Enhancing QoS in Multicore Systems with Heterogeneous Memory Configurations

Kim, Jesung; Park, Hoorin; Hong, Jeongkyu

doi:10.3390/electronics13173492

Open AccessArticle

Enhancing QoS in Multicore Systems with Heterogeneous Memory Configurations

by

Jesung Kim

¹

,

Hoorin Park

^2,*

and

Jeongkyu Hong

^3,*

¹

School of Computing, Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea

²

Department of Information Security, Seoul Women’s University, 621 Hwarang-ro, Nowon-gu, Seoul 01797, Republic of Korea

³

School of Electrical and Computer Engineering, University of Seoul, 163 Seoulsiripdae-ro, Dongdaemun-gu, Seoul 02504, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(17), 3492; https://doi.org/10.3390/electronics13173492

Submission received: 16 June 2024 / Revised: 17 July 2024 / Accepted: 26 July 2024 / Published: 3 September 2024

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Quality of service (QoS) has evolved to ensure performance across various computing environments, focusing on data bandwidth, response time, throughput, and stability. Traditional QoS schemes primarily target DRAM-based homogeneous memory systems, exposing limitations when applied to diverse memory configurations. Moreover, the emergence of nonvolatile memories (NVMs) has made achieving QoS even more challenging due to their differing characteristics. While QoS schemes have been proposed for DRAM-based memory systems or hybrid memory systems combining DRAM and a single NVM type, there is a lack of research on QoS techniques for memory systems that incorporate multiple types of NVM simultaneously. Ensuring QoS in these heterogeneous memory environments is challenging due to significant differences in memory characteristics. In this paper, we propose a novel technique, dynamic affinity-based resource pairing (DARP), designed to enhance QoS in multicore heterogeneous memory systems. The proposed approach dynamically monitors the memory access patterns of applications and leverages the specific read/write characteristics of NVM devices. Detailed information from monitoring is used to optimally allocate memory data to the most suitable memory devices, ensuring stable memory response times and mitigating bottlenecks. Extensive experiments validate the efficiency and scalability of DARP across various workloads and heterogeneous memory configurations, including memory systems with multiple types of NVM. The results show that our technique significantly outperforms state-of-the-art QoS methods in terms of memory response time consistency and overall QoS in heterogeneous memory environments. DARP achieved a memory response time variability of 74.4% in six different memory configurations compared to the baseline on average, demonstrating its high scalability and effectiveness in enhancing QoS across various heterogeneous memory systems.

Keywords:

quality of service; heterogeneous memory system; nonvolatile memory; main memory; memory management

1. Introduction

Quality of service (QoS) plays an important role in networking, cloud computing, telecommunications, real-time systems, multimedia streaming, data centers, and the internet of things (IoT) to ensure reliable performance and meet specific application requirements. As a result, many techniques have been proposed to ensure fairness, response time, throughput, and stability by considering the memory’s architecture, size, and the characteristics of the applications, under the assumption that memory read and write operations are constant and processed at high speed. However, the advent of GPU- or accelerator-based AI applications, advancements in IoT technologies, or the emergence of open hardware, such as RISC-V, have demanded memory systems with low power consumption, nonvolatility, and diverse memory operation characteristics [1]. These requirements have highlighted the limitations of conventional DRAM-based memory structures in terms of latency, bandwidth, and energy efficiency [2]. In order to address these limitations, nonvolatile memory (NVM) technologies have been proposed to replace or complement traditional DRAM-based memories by integrating them together. The emergence of NVMs complicates the guarantee of QoS provided by existing techniques, as the latency and bandwidth of read and write operations can vary significantly depending on the specific NVM type used.

While QoS techniques have been proposed considering emerging memory technologies, they are generally designed to accommodate only one or two types of memory. Traditional QoS methods for DRAM-based homogeneous memory systems face limitations such as high power consumption and scalability issues. In order to address these limitations, hybrid memory structures combining DRAM and NVMs, such as phase-change memory (PCM), have been explored. These hybrid systems aim to leverage the fast access times of DRAM and the nonvolatility of NVMs, but they introduce new challenges in managing the different characteristics of each memory type, such as varying latency and endurance. However, new technologies such as Compute Express Link (CXL) and in-memory processing suggest that multiple types of memory devices can be used within a single system [3,4,5], necessitating QoS techniques that can handle such heterogeneous memory systems. Despite this, there has been limited consideration of QoS in systems utilizing multiple NVM types simultaneously. Our analysis indicates that state-of-the-art QoS techniques, which are designed for memory systems using a single type of NVM, fail to operate correctly in environments where a different type of NVM than the one targeted is used or in heterogeneous memory systems where multiple types of NVMs are used together. Given the varying characteristics of different NVMs, the QoS of an application can differ significantly based on which memory device it is allocated to in a heterogeneous memory system.

In this paper, we propose a novel technique called dynamic affinity-based resource pairing (DARP) to enhance the QoS of applications running on multicore heterogeneous memory systems. Unlike hybrid memory systems that simply use two different types of memory, this study targets heterogeneous memory systems that can be composed of multiple types of memory devices. In these heterogeneous memory configurations, memory devices exhibit varying read and write latencies, making memory allocation and management more challenging for maintaining QoS. For instance, in a hybrid memory system with two different memory devices, the memory can be simply categorized as faster or slower. However, in a heterogeneous memory structure with multiple memory types, the memory speed spans several levels. Consequently, the QoS performance of an application can significantly vary depending on which memory device it is paired with. Furthermore, when multiple applications run on multiple memory devices, it is crucial to match applications to the optimal memory device based on their characteristics to ensure QoS.

In order to address these challenges, we introduce the concept of affinity for both programs and memory in terms of the intensity of memory operations. This affinity dynamically tracks the memory access patterns of various programs and the characteristics of NVM devices. The affinity is then used to optimally allocate program pages to the most suitable memory devices. DARP manages the pairing of programs and memory at runtime, preventing increases in memory response times due to concentrated memory requests and ensuring stable memory response times, thereby enhancing the QoS of applications. Additionally, DARP does not enforce the use of specific memory configurations and supports the flexibility to utilize various memory technologies within a single system. By defining the compatibility between programs and memory devices through the concept of affinity, DARP performs QoS-aware memory allocation and management, ensuring optimal performance in heterogeneous memory systems.

We conducted extensive experiments to evaluate the efficiency and scalability of DARP across various workloads and heterogeneous memory system configurations. The proposed scheme was evaluated not only in the most common heterogeneous memory configuration using both DRAM and NVM but also in systems where two different types of NVMs are used simultaneously and in complex heterogeneous memory environments utilizing more than three types of memory. Additionally, we compared our approach to state-of-the-art QoS techniques designed for NVM. While evaluating application QoS in terms of memory response time variability, DARP demonstrated a significantly more uniform memory response time compared to competing techniques, which exhibited large variations in memory response times between different applications. Specifically, across six different heterogeneous memory environments, DARP achieved memory response time variabilities of 90.6, 92.3, 61.3, 62.0, 67.1, and 73.3% compared to the baseline. This indicates that DARP provides substantially less variable memory response times for all running applications in various heterogeneous memory environments, leading to improved QoS.

Our contributions can be summarized as follows:

We provide a comprehensive analysis of the effect of memory configuration on QoS in heterogeneous memory systems;
We propose a novel QoS-aware memory management scheme that enhances QoS under memory resource contention;
We enable system designers to freely employ their preferred memory types while minimizing the impact of memory configuration on QoS.

The remainder of this paper is organized as follows. The background information and related studies are provided in Section 2. DARP is discussed in detail in Section 3, and the experimental results are presented in Section 4. Finally, the paper is concluded in Section 5.

2. Background and Related Work

2.1. Necessity of Emerging Memory Technology

As DRAM, the commonly used main memory, has reached its scaling limit and is difficult to develop further, meeting the demands for high performance and low power consumption has become challenging [2]. It has been reported that memory energy consumption accounts for about 30–40% of the total energy consumption of computing systems [6]. In order to address these limitations, various efforts, such as memory 3D-stacking and process-in-memory, have been made, but these are all based on DRAM and cannot fully overcome their limitations. Consequently, there has been growing interest in next-generation memory technologies (i.e., NVMs), which offer near-zero standby power consumption and/or high density [7,8,9], including PCM, spin-transfer-torque magneto-resistive RAM (STT-MRAM), resistive RAM (ReRAM), and ferroelectric RAM (FeRAM). Despite the existence of various memory types, none of them outperform DRAM in all aspects, as summarized in Table 1. For example, PCM has higher density and does not consume leakage power compared to DRAM, but it suffers from higher read and write latency and greater dynamic power consumption.

In order to leverage the strengths of both DRAM and NVMs, researchers have proposed heterogeneous memory configurations that integrate both types. These configurations can be broadly classified into two categories: horizontal and hierarchical [10,16]. The hierarchical approach uses DRAM as a cache for NVM or vice versa [17,18], while the horizontal approach manages both DRAM and NVM within a single memory address space [19,20]. In order to support a wide range of memory types, our scheme is based on a horizontal address system in which all installed memories share the same address space regardless of their types.

2.2. QoS Challenges in Heterogeneous Memory Systems

Heterogeneous memory systems address the diverse and complex requirements of modern workloads, for which traditional DRAM-based architectures cannot fully meet the performance, energy efficiency, and scalability requirements. These systems offer tailored solutions by utilizing various memory technologies. For instance, the higher density and persistence of PCM make it suitable for large-scale data storage, whereas the faster access times of DRAM are advantageous in performance-critical applications. Additionally, the integration of STT-MRAM, which is inherently resistant to radiation, makes it suitable for use in aerospace, automotive, and other harsh environments. In other words, utilizing various memory types offers numerous benefits. Ideally, there should be no constraints on employing multiple memory types within a single system. Industry initiatives such as CXL [21] and in-memory processing suggest that multiple types of memory devices can be utilized within a single system. With CXL, because multiple memory types or accelerators can be connected to the processor and accelerators can have their own connected memory, it is natural in a CXL-based architecture to connect multiple types of memory in a single or multi-tier hierarchy [22,23]. The need for such heterogeneous memory configurations and efforts to implement them are being pursued in various directions [24,25].

However, this diversity introduces several challenges. In a multicore environment where multiple applications compete for memory resources, guaranteeing QoS for applications becomes particularly difficult due to differences in memory characteristics across heterogeneous memory systems. For instance, the pages of a single program could be distributed across various types of memory, and significant differences in memory response times between these types can lead to considerable variability in the application’s overall memory response times. This variability undermines predictable performance, efficient resource utilization, and fairness among applications, thereby degrading the overall QoS of the system. Therefore, we propose a QoS management scheme that considers the heterogeneous configuration of memory systems in multicore environments to provide consistent memory response times for all concurrently running applications.

2.3. QoS Techniques Considering Shared Memory Resources

As more computing units are adopted in a system, contention in the memory subsystem increases, leading to slower and less predictable memory access times, a problem that intensifies as the number of processors increases. This issue is further exacerbated by the fact that many modern applications are memory-intensive and frequently request memory data during execution. In order to guarantee QoS for applications contending for memory resources, various memory scheduling and management schemes have been proposed. Zhou et al. proposed a scheme for PCM-based memory systems that uses request pre-emption and row buffer partitioning to finely tune QoS for high-priority applications [26]. Jeong et al. introduced a method that tracks GPU workload progress to balance the priorities of CPU and GPU memory requests, thereby improving GPU performance and addressing the limitations of existing QoS allocation techniques [27]. Another approach categorizes memory requests, distributes updates across multiple banks, and schedules read/write batches to improve fairness in systems running both persistent and nonpersistent applications [28]. A QoS-aware memory scheduler for heterogeneous systems was proposed [29], which prioritizes hardware accelerators based on deadlines and worst-case access times. Subramanian et al. developed a model to accurately estimate application slowdowns due to memory interference in multicore systems and proposed scheduling schemes to enhance QoS and minimize unfairness [30]. Tam et al. proposed a long short-term memory-based congestion-aware federated learning scheme to enhance QoS in heterogeneous memory systems by proactively detecting congestion rates and prioritizing mission-critical resources for stable model convergence [31]. A dynamic load-balancing approach is introduced for cloud computing [32], utilizing deep learning and reinforcement learning to enhance QoS by efficiently distributing workloads and optimizing task scheduling and resource utilization.

Kommareddy et al. proposed a QoS guarantee method for heterogeneous memory systems in environments where multiple computing nodes intensively use shared memory [33]. Their approach, hierarchical priority (HP), differentiates memory access priorities by using both static priorities for each workload, determined by preprofiled memory request counts, and dynamic priorities that adjust the given priority of a program according to memory requests during execution. Additionally, dedicated memory spaces are allocated for high-priority workloads to prevent their QoS from being degraded by lower-priority workloads. However, all the previously proposed schemes, including HP, fail to function properly in heterogeneous memory environments where various memory configurations can be utilized. For instance, HP operates correctly only with the specific memory type (i.e., PCM) it was designed for. Considering future systems, there is a need for a memory management scheme that is not restricted by the types of memory used and can efficiently enhance QoS. DARP proposes a QoS enhancement technique that can support such memory systems.

3. Dynamic Affinity-Based Resource Pairing

This section discusses the proposed DARP process, a novel memory management scheme designed to enhance QoS in heterogeneous memory systems, in detail. DARP dynamically pairs program pages with the most suitable memory types based on program and memory affinities, effectively managing memory response times and preventing resource contention in multicore environments.

3.1. Key Design Considerations of DARP

Before delving into the details of DARP, we discuss three key design considerations of our approach. These observations illustrate how QoS can degrade when conventional memory management techniques are applied to heterogeneous memory systems. For clarity, we consider a system with four cores, where each core can run independent applications, and four memory channels, each potentially equipped with different types of memory.

3.1.1. Variability of Program Behavior

In heterogeneous memory systems, placing program page data into the appropriate memory types is crucial for ensuring consistent memory response times and achieving QoS due to the fundamental differences in read and write operation latencies across memory types. Moreover, the variability in program behavior regarding memory access patterns can further exacerbate these differences. Figure 1 illustrates the changes in program behavior over time for cam4 and wrf, showing the number of memory read and write operations at 100 K cycle intervals. As shown in the figure, the number and ratio of read and write requests can vary significantly with changes in program phases during execution. Therefore, even memory types with small inherent differences in read and write operation latencies can experience significant cumulative response time differences during phases of concentrated memory requests, leading to increased response time variability compared to less intense phases. In a multicore environment, program variability can be exacerbated by mutual interference among multiple programs, further amplifying memory response time variability. Conventional fixed memory mapping approaches that cannot adapt flexibly to changes in memory request intensity struggle to maintain QoS as program phases change.

Consequently, effectively achieving QoS requires not only managing the latency differences between memory types but also considering the diverse memory access patterns among different programs and the inconsistent access patterns within a single program. A proper management system needs to continuously monitor application behaviors at the granularity of program phases or finer and incorporate this information into the memory mapping process.

3.1.2. Variability of Channel Access

In multicore systems, the last-level cache (LLC) is shared among programs and demonstrates relatively lower spatial locality compared to higher-level caches. This leads to irregular LLC access patterns, causing main memory access to be unevenly distributed across channels, resulting in large variations in the number of memory accesses per channel. Figure 2 compares the channel-wise access ratios for memory read and write operations across nine workload sets during the entire program execution. Each workload set is composed of four programs (see Table 2 for detailed experimental environments), and the ratio is normalized based on the access count of Channel 0. As shown in the figure, the memory access counts per channel are non-uniform, exhibiting fluctuations ranging from an average of 6% to a maximum of 22%. Note that these fluctuations become more pronounced when monitoring the program at finer intervals. This variability indicates that heterogeneous memory systems can exhibit significant performance differences depending on their configurations, even when the same memory type is used. Connecting relatively slow memory to heavily accessed channels results in significantly longer average memory access times compared to when fast memory is connected to those channels, posing challenges in ensuring QoS. Therefore, an appropriate data management scheme must consider the system hardware configuration.

3.1.3. Variability of Memory Response Time

Read and write latencies differ by memory type, and each type generally maintains consistent latency characteristics. However, from the perspective of memory response times, even the same type of memory can exhibit significant variation depending on the situation. Because all CPUs share memory and peripherals, such as buses and request queues, concentrating requests on a specific memory type can significantly increase congestion (wait) cycles, leading to longer overall response times. Figure 3 shows the average congestion cycles per channel for memory read accesses at 1 M cycle intervals in a homogeneous memory environment with DRAM installed across four channels, where four programs (nab, pop2, imagick, and exchange2) are executed simultaneously. As shown in the graph, the average congestion cycles per channel vary significantly, and memory accessed intensively by certain programs has much higher congestion cycles. For instance, in the 50th window, the average congestion cycle for Channel 0 is 16 cycles, whereas for Channel 3, it is nearly 0 cycles. This difference in response times can become even more pronounced in heterogeneous memory configurations with varying access latencies across memories. In other words, even memory types with inherently fast read and write speeds can have longer response times than those with slower speeds, and thus, simply storing data in faster memory does not guarantee quicker response times. According to our observations, these variabilities in per-channel congestion cycles are consistently found across various workload sets, indicating that a QoS-aware management system must consider not only the unique read and write latencies of each memory type but also the dynamic states of both the programs and the memories.

3.2. Affinity-Based Program and Memory Management

Based on three key design considerations, we propose a memory management technique that reflects the unique characteristics of heterogeneous memory systems, as well as the dynamic states of programs and memory, and the memory system configuration. In order to accurately reflect the dynamic states of programs and memory to the operations of DARP, we introduce the concepts of program affinity and memory affinity in terms of memory read and write operations. Program affinity is defined separately for read and write operations, and similarly, memory affinity is defined individually for read and write operations. Our scheme utilizes these affinity values to pair program pages with the most suitable memory types to ensure consistent memory response times regardless of the number of memory requests of the programs, thereby effectively enhancing their QoS.

3.2.1. Program Affinity and Memory Affinity

The program affinity quantifies the memory access intensity of a program, represented as the number of memory requests per program within a given cycle period. Memory affinity measures the relative response time of the memories used in the system, accounting for variations in response times based on the current memory state. In other words, program affinity indicates the extent to which the QoS of a program can be affected by memory read or write operations, while memory affinity indicates the suitability of a type of memory for processing read or write requests.

The program affinity values for read and write requests are computed by tallying the cumulative memory reads and writes during a 64 K cycle window. In order to normalize these cumulative values between 0 and 1, they are divided by 2048 and 512, respectively. Values exceeding 1 are set to 1. The values of 2048 for reads and 512 for writes, which represent the maximum accumulated reads and writes during the given window cycles, were chosen because they provided the best balance and accuracy in reflecting program behavior during evaluation. These values are crucial for accurately capturing program affinity and ensuring the effectiveness of the proposed scheme.

For example, if the cumulative number of reads for a program during the current window is 1536 and the read affinity value is 0.75, or if more than 2048 reads occur during the window period, the read affinity becomes the maximum value of 1. These cumulative values start from half the final value of the previous window to minimize fluctuations caused by transient program behaviors. Meanwhile, the memory affinity value for each memory type is calculated by dividing the shortest latency value among the installed memories by the respective memory latency, resulting in values ranging from 0 to 1. For instance, the write affinity values for the latencies of 15, 20, 60, and 150 ns are 1.0 (15 ns/15 ns), 0.75 (15 ns/20 ns), 0.25 (15 ns/60 ns), and 0.1 (15 ns/150 ns), respectively. During execution, the initial memory affinity values can be adjusted to reflect the runtime memory states. The adjusted affinity is calculated by multiplying the initial value by the unused queue rate. For example, a memory with an initial read affinity of 0.75 and 16 read requests queued over 64 entries has a memory affinity of 0.5625 (0.75 × (1 − 16/64)).

In summary, program affinity denotes the dependency of memory operations on the program, indicating how much the program is affected by memory response time. On the other hand, memory affinity signifies the efficiency of memory operations, reflecting how effectively memory requests can be handled. In order to ensure QoS for programs, we manage heterogeneous memory systems based on affinity values: programs with high affinity are paired with memories with high affinity, and vice versa. This approach is reasonable because it can provide fast memory response times for programs with high memory affinity while also preventing programs with low memory affinity from accessing fast memory, thereby preventing increased traffic for fast memory. In addition, because programs with low memory requests are less affected by memory response time changes, slower memory is acceptable for these programs.

3.2.2. Conceptual Behaviors of DARP

Figure 4 shows the operation of DARP, which pairs program and memory by considering changes in both program and memory affinities. In this example, we assume that the program state is initially midway along the affinity dimension (circled P) (Figure 4a). Memory affinity is initially arranged based on the relative read and write latencies of the installed memories in the system, as discussed previously. Hence, a memory with faster read and write speeds is allocated a higher affinity dimension for both read and write (e.g., type y). Once the read and write affinities between programs and memory are determined, the memory type with the shortest distance is selected (indicated by red arrows). In this example, a memory of type, w, will be paired with the program.

However, the initially paired memory type may not remain the most suitable choice over time. As its paired memory handles the memory requests of the program, the increase in memory access traffic may potentially increase the waiting time in the request queue, leading to slower response times. Figure 4b illustrates how the previously paired memory, type w, exhibits a decrease in both memory read and write affinities due to the increase in memory traffic. Consequently, DARP replaces the paired memory with memory type y, which exhibits the shortest affinity distance with the given program.

As memory affinities dynamically fluctuate, program behaviors also exhibit variability. When the phase of a program changes, leading to changes in the number of memory reads and writes, its affinity values are also adjusted accordingly, as depicted in Figure 4c. Increased memory read requests result in a rise in the read affinity of a program, which necessitates the selection of memory type x with the lowest read latency (i.e., highest memory read affinity) to handle the increased demand effectively.

In summary, DARP dynamically tracks the states of all running programs and installed memory types to pair each program with its optimal memory. The proposed scheme allows multiple programs to pair with a single memory type simultaneously, thereby simplifying memory management complexity (Figure 4d). Note, however, that this approach rarely results in QoS degradation. When memory is paired with multiple programs, the concentrated traffic naturally lowers its affinity value, prompting programs to pair with other memory types. This mechanism prevents traffic congestion on specific memory modules and provides consistent memory response times for programs.

3.3. Operation Details of DARP

Figure 5 illustrates the read and write operations of DARP. Read accesses follow the existing memory behavior and do not perform pairing-related operations, thus avoiding excessive changes in memory pairing that could lead to significant page migration overhead. The pairing process is activated only when a dirty LLC block is evicted (i.e., during a memory write). It searches for the most suitable memory type for the page associated with the corresponding block. If the current write access hits the memory row buffer or if the new pairing is not predicted to provide sufficient benefit, the pairing process is not performed, minimizing the overhead associated with pairing changes. This policy of performing pairing at the page level and only under specific conditions enhances the convenience and efficiency of data management.

The pairing process aims to identify the shortest distance between a program and memory along the affinity dimension. However, relying solely on the shortest vector distance can often lead to incorrect pairing results that fail to guarantee QoS. This is primarily because maintaining QoS is closely tied to stable memory response times, which are more influenced by memory read latency than by write latency. Memory read requests are crucial for providing the data needed to process the currently executing program instructions, making them directly related to ensuring QoS. In contrast, memory write operations involve data that are not immediately requested by the processor, and thus, delays in write operations do not significantly impact QoS.

Given this, when a program has the same distance from a memory with high read affinity and low write affinity as that from another memory with low read affinity and high write affinity, we pair the program with the former to ensure better QoS. In order to implement this, we assign distinct weights to the read (

D i s t a n c e_{r e a d}

) and write distances (

D i s t a n c e_{w r i t e}

) and compute the final affinity distance (

D i s t a n c e_{a f f}

) using the following equation:

D i s t a n c e_{a f f} = \sqrt{D i s t a n c e_{r e a d} \times W e i g h t_{r e a d} + D i s t a n c e_{w r i t e} \times W e i g h t_{w r i t e}}

where

W e i g h t_{r e a d}

and

W e i g h t_{w r i t e}

are the weight values for read and write distances, assigned as 0.80 and 0.20, respectively.

Furthermore, even if a new memory with the shortest affinity distance is identified, it may not be suitable from a QoS perspective. Changing pairings requires a memory data migration process to ensure data integrity, which can result in significant overhead. Therefore, pairing changes are evaluated based on whether the expected benefits outweigh the associated costs. Specifically, only memories with affinity distances of 0.25 or less are considered candidates for pairing changes. Moreover, if access to the existing paired memory is a row-buffer hit, the pairing is maintained regardless of the affinity distance.

Once a memory page is paired with a different memory type, it must be migrated to a frame of a new memory type. In order to maintain data integrity, tracking the change in physical frame location is essential, which can be easily handled by updating the physical address of the corresponding page in the translation lookaside buffer. This page-level migration approach simplifies page management and enables the individual pairing of each program page with its optimal memory type. Therefore, not all pages belonging to a program are stored in the same memory type.

The reference values, including the size of a window cycle, the maximum number of reads and writes within a window cycle, the weights for calculating read and write distances, and the thresholds for the validity of affinity distances, were determined through extensive and repetitive evaluations to ensure optimal results.

3.4. DARP Architecture and Overhead Analysis

3.4.1. Additional Hardware Components

In order to support DARP, the required additional hardware components are implemented in the memory controller, as shown in Figure 6. It includes two table-like structures: the Program Status Table (PST) and the Memory Status Table (MST). The PST stores the cumulative numbers of reads and writes for each program. This information is stored using 11-bit and 9-bit counters, respectively, so 20 bits are required for each core. The MST, on the other hand, stores memory-related information for each memory channel. This includes the initial reference affinity value for each memory type and the most recently used (MRU) address of each bank. The initial reference value is represented using 7 bits, allowing it to hold a value between 0 and 1 with a precision of 1/100. The MRU address is used to determine whether a row buffer hit occurs, and each address requires 16 bits in our baseline system configuration. The total number of bits required to implement the DARP architecture is 648 (80 bits for the PST and 568 bits (4 channels × 142 bits) for the MST), which constitutes a negligible area overhead and does not require modifications to the conventional memory system. In addition, a clock counter that records 64 K cycles (one window) and an ALU that performs vector operations to calculate the affinity distance is required. These components can be implemented using simple hardware.

3.4.2. Page Migration Overhead

Page migration may be required when DARP determines the most appropriate memory type for a certain page. When a new pairing is determined, the program pages originally residing in a memory frame must be migrated to a new memory frame. This migration involves the utilization of both the source and destination memory banks, as well as the memory bus. Consequently, memory operations intending to use these banks and buses must stall, impacting memory response times for other programs (i.e., increasing response time variability), thereby reducing QoS. However, due to multiple policies aimed at minimizing unnecessary page migrations, the proportion of memory congestion cycles attributable to migrations is relatively low.

In addition, we assume that our scheme is based on the CXL 3.0 protocol to address the migration overheads. This enables memory data bypass, also referred to as peer-to-peer communication, which facilitates direct page migration between memories. Even without CXL 3.0, the migration overhead in our scheme can be reduced by adopting migration optimization techniques [34,35]. These techniques manipulate the timings of migration and/or merge multiple migration requests. While these techniques are orthogonal to our approach and can be applied alongside the proposed system, adopting migration optimization techniques is beyond the scope of this study and was, therefore, not considered.

4. Experimental Results

In order to quantitatively evaluate DARP, we compared the proposed method with both a baseline system and HP, as HP shares several common features with DARP. Both methods dynamically track program behavior and incorporate this information into memory management policies to improve QoS in heterogeneous memory systems. However, unlike HP, DARP also considers the dynamic state of the memory, allowing for more effective memory data management. Additionally, DARP does not restrict the use of specific memory types, making it more flexible and scalable.

4.1. Experimental Environment

In order to implement DARP and HP, we modified Champsim [36], which is a trace-based simulator designed for out-of-order core implementations. Table 2 describes the detailed baseline system configuration, workload sets, and heterogeneous memory configurations. We used FR-FCFS [37] scheduling for both the read and write queues and RoRaBaChCo address mapping for all memory types. Because of their various advantages, row buffers are utilized in our baseline memory system regardless of the memory type [38]. Upon a row buffer hit, they all exhibit the same hit latency. Compared to the baseline system, DARP and HP differ only in their memory management schemes, while the fundamental system configuration remains identical.

We utilized a set of nine programs composed of four randomly selected benchmarks from SPEC CPU2017 [39]. Each benchmark was fast-forwarded by one billion instructions, followed by a simulation of 200 million instructions. Since memories can have different properties even if made with the same physical elements [40,41], we used their properties as listed in Table 1. We configured six different heterogeneous memory systems, which can be categorized based on whether they consist of DRAM and NVM, different types of NVMs, or more than three different memory types. Each configuration is designated using abbreviations that correspond to the types of memory used and the order in which they are installed on the channels. For instance, if STT-MRAM is used on Channels 0 and 1, and DRAM is used on Channels 2 and 3, this memory configuration is referred to as M/M/D/D.

4.2. Evaluation of Program QoS

In order to evaluate application QoS in multicore heterogeneous memory system environments, we assess the variability of memory response times. Memory response time variability is defined as the extent to which memory response times are affected when applications are executed simultaneously compared to when they are executed individually. However, evaluating the QoS of applications running together by examining them individually does not accurately reflect the overall QoS. This is because applications running concurrently compete for shared memory resources, which can stabilize the memory response time of one application at the expense of others. Therefore, to evaluate the overall QoS of concurrently running applications, we measure the standard deviation of the memory response time variabilities for each application. This standard deviation serves as an indicator of how consistently all applications can achieve consistent memory response times. A lower standard deviation indicates less disparity in memory response times among applications, suggesting improved QoS through more consistent and predictable behaviors of applications. Note that the extensive execution of a large number of instructions and the use of diverse workload sets ensure that potential outliers have minimal impact on the overall results, reducing any bias.

Figure 7 compares the application QoS of HP and DARP across six different heterogeneous memory environments in terms of the standard deviation of memory response time variations. These results are normalized against the standard deviation of the memory response time variability of a baseline system without any specific memory management scheme in a quad-core environment. HP suffered from significant fluctuations in memory response times due to its fixed memory mapping based on program priority. Its preferential policies for high-priority programs lead to a decrease in the QoS of lower-priority programs, thereby aggravating the variability of response times for all applications running concurrently. In addition, HP does not consider the memory system configuration, leading to differences in QoS, even when the same memory types but different channel configurations were used (e.g., M/M/D/D vs. D/D/M/M and P/P/M/M vs. M/P/M/P).

In contrast, through affinity-based pairing, DARP minimizes response time variations regardless of memory configuration. On average, compared to the baseline, DARP achieved memory response time variabilities of 90.6, 92.3, 61.3, 62.0, 67.1, and 73.3% across six different memory configurations. In different configurations using the same memory types (comparison between left and right graphs), the proposed scheme provides nearly identical memory response times, indicating that it is not significantly affected by the physical memory layout. The reduction in variability is more pronounced when there are significant differences in the read and write latencies between the memory types used, such as in PCM and STT-MRAM (NVM + NVM configurations) or manifold configurations with various memory types, compared to memory configurations consisting of types with similar read and write latencies, such as DRAM and STT-MRAM. Note that DARP provides significantly lower memory response time variability, even in environments that use multiple types of memory (i.e., manifold configurations). Consequently, the proposed scheme provides stable response times for all concurrently running applications, regardless of the memory types used or the heterogeneous memory configuration.

4.3. Impact on Memory Congestion Cycles

In this subsection, we analyze the impact of the proposed technique on memory traffic and congestion cycles. Since the various memory configurations show similar results with no significant differences, we focus on the representative configurations M/M/D/D and D/D/M/M, which use the same types of memory.

Figure 8 illustrates the distribution of the total accumulated memory congestion cycles per channel during the entire execution of each workload set, normalized to the baseline system results. Memory congestion cycles refer to periods when memory requests are delayed due to bus utilization or bank conflicts. The address mapping scheme adopted in the baseline system determines the channel number based on the requested memory address without considering the type or state of the memory. In contrast, DARP selects the appropriate type of memory based on program and memory affinity values and then intentionally accesses the channel where that memory is installed. Consequently, memory accesses are not evenly distributed across all channels, potentially leading to concentrated access to specific memory types. This imbalance is particularly evident in configurations such as Set 1, Set 2, Set 6, Set 8, and Set 9. For instance, Channel 0 in the M/M/D/D configuration and Channel 2 in the D/D/M/M configuration are specific memory channels that experience higher congestion cycles.

However, this phenomenon does not adversely affect the provision of consistent response times to applications. Most memory congestion cycles are caused by write accesses, particularly in channels equipped with relatively slower memory, such as STT-MRAM. Since memory write operations do not significantly impact the overall execution time of a program (i.e., critical path), the QoS degradation due to these delays is minimal. Additionally, as previously discussed, slower memory stores fewer memory-intensive pages, leading to minor QoS changes due to memory response time increases from congestion cycles. Therefore, program pages that do not perform memory-intensive operations are allocated to slower memory, such as STT-MRAM, thereby reducing memory traffic to channels with faster memory, such as DRAM, and preventing delays caused by congestion cycles in the faster memory. Consequently, channels with faster memory in the M/M/D/D configuration (Channels 2 and 3) and in the D/D/M/M configuration (Channels 0 and 1) experience fewer congestion cycles compared to the baseline, providing faster and less variable response times for memory-intensive applications and improving QoS. Moreover, the additional memory congestion cycles due to migration are relatively small compared to those caused by primary memory operations (less than 5%). This is because DARP initiates migration only when there are significant changes in program behavior and/or memory state, minimizing migration frequency.

Figure 9 shows the average congestion cycles per memory access for four channels, normalized to the baseline results. The average congestion cycle is calculated by dividing the total accumulated congestion cycles by the total number of memory accesses during the program execution. This provides a different perspective from the previously discussed total accumulated congestion cycles.

Even if the total number of congestion cycles accumulated on a specific channel increases significantly, the average congestion cycle per memory access might not increase proportionally. This indicates that the memory channel is effectively handling a high volume of requests, preventing excessive delays. In such cases, other channels handle relatively fewer memory requests, providing shorter and more stable waiting times for critical and high-priority memory accesses. For example, in Set 2 with an M/M/D/D configuration, Channel 0 shows significantly higher total congestion cycles (see Figure 8). However, the average congestion cycles per access are nearly similar to the baseline. This is because the STT-MRAM installed in Channel 0 handles many accesses, primarily processing the less memory-intensive pages of the applications. Consequently, the DRAM in Channels 2 and 3, which have lower traffic, exhibit a decrease in the average congestion cycles per access compared to the baseline.

Conversely, as shown in Set 9 of the D/D/M/M configuration, the DRAMs on Channels 0 and 1 have fewer total congestion cycles than the baseline, but the average congestion cycles per access have increased. This is due to prioritizing fast memory to handle the memory-intensive phases of applications where many memory accesses are concentrated over a short period. In other words, despite the increase in congestion cycles, if it is determined that accessing the inherently faster DRAM provides a faster and more consistent memory response time compared to the slower STT-MRAM, DARP prioritizes handling memory-intensive requests through DRAM to enhance application QoS.

5. Conclusions

In order to satisfy the QoS demands of heterogeneous memory systems, we propose DARP, a memory resource management approach that pairs program pages with the most suitable memory based on their static and dynamic characteristics. DARP uses affinity to dynamically manage the memory access patterns of programs and memory states, ensuring consistent memory response times and preventing QoS degradation due to resource contention. Evaluations comparing DARP with competing techniques across various memory configurations demonstrated that DARP significantly reduces memory response time variability and improves overall program QoS. The flexibility of DARP supports a wide range of memory technologies within a single system, making it robust for future computing environments. In conclusion, DARP dynamically adapts to diverse memory characteristics and application demands, providing consistent and predictable memory response times and improved QoS for all running programs.

Author Contributions

Conceptualization, J.H. and J.K.; methodology, J.H.; software, J.H.; validation, J.K. and H.P.; formal analysis, J.K.; investigation, J.H.; resources, J.K.; data curation, J.H.; writing—original draft preparation, J.K. and J.H.; writing—review and editing, H.P.; visualization, J.K.; supervision, H.P.; project administration, H.P.; funding acquisition, J.H. and H.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2022R1F1A1074641), and a Research Grant of Seoul Women’s University (2024-0012).

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

QoS	Quality of service
IoT	Internet of things
NVM	Nonvolatile memory
DARP	Dynamic affinity-based resource pairing
HP	Hierarchical priority
PCM	Phase-change memory
STT-MRAM	Spin-transfer-torque magneto-resistive RAM
ReRAM	Resistive RAM
FeRAM	Ferroelectric RAM
CXL	Compute Express Link
LLC	Last-level cache
PST	Program status table
MST	Memory status table
MRU	Most recently used

References

Cai, Y.; Chen, X.; Tian, L.; Wang, Y.; Yang, H. Enabling secure nvm-based in-memory neural network computing by sparse fast gradient encryption. IEEE Trans. Comput. 2020, 69, 1596–1610. [Google Scholar] [CrossRef]
Park, S. Technology scaling challenge and future prospects of DRAM and NAND flash memory. In Proceedings of the International Memory Workshop, Monterey, CA, USA, 17–20 May 2015. [Google Scholar]
Li, H.; Berger, D.S.; Hsu, L.; Ernst, D.; Zardoshti, P.; Novakovic, S.; Shah, M.; Rajadnya, S.; Lee, S.; Agarwal, I.; et al. Pond: Cxl-based memory pooling systems for cloud platforms. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vancouver, BC, Canada, 25–29 March 2023. [Google Scholar]
Lee, K.; Kim, S.; Lee, J.; Moon, D.; Kim, R.; Kim, H.; Ji, H.; Mun, Y.; Joo, Y. Improving key-value cache performance with heterogeneous memory tiering: A case study of CXL-based memory expansion. IEEE Micro 2024, 1–11. [Google Scholar] [CrossRef]
Ahn, M.; Chang, A.; Lee, D.; Gim, J.; Kim, J.; Jung, J.; Redholz, O.; Pham, V.; Malladi, K.; Ki, Y.S. Enabling CXL memory expansion for in-memory database management systems. In Proceedings of the International Workshop on Data Management on New Hardware, Philadelphia, PA, USA, 13 June 2022. [Google Scholar]
Barroso, L.A.; Hölzle, U. The case for energy-proportional computing. Computer 2007, 40, 33–37. [Google Scholar] [CrossRef]
Chen, A. A review of emerging non-volatile memory (NVM) technologies and applications. Solid-State Electron. 2016, 125, 25–38. [Google Scholar] [CrossRef]
Boukhobza, J.; Rubini, S.; Chen, R.; Shao, Z. Emerging NVM: A survey on architectural integration and research challenges. ACM Trans. Des. Autom. Electron. Syst. 2017, 23, 1–32. [Google Scholar] [CrossRef]
Hameed, F.; Menard, C.; Castrillon, J. Efficient STT-RAM last-level-cache architecture to replace DRAM cache. In Proceedings of the International Symposium on Memory Systems, Alexandria, VA, USA, 2–5 October 2017. [Google Scholar]
Liu, H.; Chen, D.; Jin, H.; Liao, X.; He, B.; Hu, K.; Zhang, Y. A survey of non-volatile main memory technologies: State-of-the-arts, practices, and future directions. J. Comput. Sci. Technol. 2021, 36, 4–32. [Google Scholar] [CrossRef]
Grossi, A.; Vianello, E.; Sabry, M.; Barlas, M.; Grenouillet, L.; Coignus, J.; Beigne, E.; Wu, T.; Le, B.Q.; Wootters, M.K.; et al. Resistive RAM endurance: Array-level characterization and correction techniques targeting deep learning applications. IEEE Trans. Electron Devices 2019, 66, 1281–1288. [Google Scholar] [CrossRef]
Kumar, D.; Aluguri, R.; Chand, U.; Tseng, T. Metal Oxide Resistive Switching Memory: Materials, Properties, and Switching Mechanisms. Ceram. Int. 2017, 43, 547–556. [Google Scholar] [CrossRef]
Seltzer, M.; Marathe, V.; Byan, S. An NVM Carol: Visions of NVM past, present, and future. In Proceedings of the International Conference on Data Engineering, Paris, France, 16–19 April 2018. [Google Scholar]
Arulraj, J.; Pavlo, A. How to build a non-volatile memory database management system. In Proceedings of the International Conference on Management of Data, Chicago, IL, USA, 14–19 May 2017. [Google Scholar]
Arulraj, J.; Pavlo, A.; Dulloor, S.R. Let’s talk about storage & recovery methods for non-volatile memory database systems. In Proceedings of the International Conference on Management of Data, Melbourne, VIC, Australia, 31 May–4 June 2015. [Google Scholar]
Suresh, A.; Cicotti, P.; Carrington, L. Evaluation of emerging memory technologies for HPC, data intensive applications. In Proceedings of the IEEE International Conference on Cluster Computing, Madrid, Spain, 22–26 September 2014. [Google Scholar]
Wu, P.; Li, D.; Chen, Z.; Vetter, J.S.; Mittal, S. Algorithm-directed data placement in explicitly managed non-volatile memory. In Proceedings of the International Symposium on High-Performance Parallel and Distributed Computing, Kyoto, Japan, 31 May–4 June 2016. [Google Scholar]
Awad, A.; Blagodurov, S.; Solihin, Y. Write-aware management of nvm-based memory extensions. In Proceedings of the International Conference on Supercomputing, Istanbul, Turkey, 1–3 June 2016. [Google Scholar]
Wei, Q.; Chen, J.; Chen, C. Accelerating file system metadata access with byte-addressable nonvolatile memory. ACM Trans. Storage 2015, 11, 1–28. [Google Scholar] [CrossRef]
Hassan, A.; Vandierendonck, H.; Nikolopoulos, D.S. Energy-efficient in-memory data stores on hybrid memory hierarchies. In Proceedings of the International Workshop on Data Management on New Hardware, Melbourne, VIC, Australia, 31 May–4 June 2015. [Google Scholar]
Compute Express Link 3.0. Available online: https://www.computeexpresslink.org/resource-library (accessed on 1 August 2022).
Fridman, Y.; Mutalik Desai, S.; Singh, N.; Willhalm, T.; Oren, G. Cxl memory as persistent memory for disaggregated hpc: A practical approach. In Proceedings of the International Conference on High Performance Computing, Network, Storage, and Analysis, Denver, CO, USA, 12–17 November 2023. [Google Scholar]
Fakhry, D.; Abdelsalam, M.; El-Kharashi, M.W.; Safar, M. A review on computational storage devices and near memory computing for high performance applications. Mem. Mater. Devices Circuits Syst. 2023, 4, 100051. [Google Scholar] [CrossRef]
Jung, M. Hello bytes, bye blocks: Pcie storage meets compute express link for memory expansion (cxl-ssd). In Proceedings of the ACM Workshop on Hot Topics in Storage and File Systems, Virtual, 27–28 June 2022. [Google Scholar]
Sharma, D.D. Compute express link (cxl): Enabling heterogeneous data-centric computing with heterogeneous memory hierarchy. IEEE Micro 2022, 43, 99–109. [Google Scholar] [CrossRef]
Zhou, P.; Du, Y.; Zhang, Y.; Yang, J. Fine-grained QoS scheduling for PCM-based main memory systems. In Proceedings of the International Symposium on Parallel & Distributed Processing, Atlanta, GA, USA, 19–23 April 2010. [Google Scholar]
Jeong, M.K.; Erez, M.; Sudanthi, C.; Paver, N. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In Proceedings of the Design Automation Conference, San Francisco, CA, USA, 3–7 June 2012. [Google Scholar]
Zhao, J.; Mutlu, O.; Xie, Y. FIRM: Fair and high-performance memory control for persistent memory systems. In Proceedings of the International Symposium on Microarchitecture, Cambridge, UK, 13–17 December 2014. [Google Scholar]
Usui, H.; Subramanian, L.; Chang, K.; Mutlu, O. Squash: Simple qos-aware high-performance memory scheduler for heterogeneous systems with hardware accelerators. arXiv 2015, arXiv:1505.07502. [Google Scholar]
Subramanian, L.; Seshadri, V.; Kim, Y.; Jaiyen, B.; Mutlu, O. Predictable performance and fairness through accurate slowdown estimation in shared main memory systems. arXiv 2018, arXiv:1805.05926. [Google Scholar]
Tam, P.; Kang, S.; Ros, S.; Kim, S. Enhancing QoS with LSTM-Based Prediction for Congestion-Aware Aggregation Scheduling in Edge Federated Learning. Electronics 2023, 12, 3615. [Google Scholar] [CrossRef]
Navaz, A.N.; Kassabi, H.T.E.; Serhani, M.A.; Barka, E.S. Resource-Aware Federated Hybrid Profiling for Edge Node Selection in Federated Patient Similarity Network. Appl. Sci. 2023, 13, 13114. [Google Scholar] [CrossRef]
Kommareddy, V.R.; Hughes, C.; Hammond, S.; Awad, A. Investigating fairness in disaggregated non-volatile memories. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI, Miami, FL, USA, 15–17 July 2019; pp. 104–110. [Google Scholar]
Yan, Z.; Lustig, D.; Nellans, D.; Bhattacharjee, A. Nimble page management for tiered memory systems. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, Providence, RI, USA, 13–17 April 2019. [Google Scholar]
Baruah, T.; Sun, Y.; Dinçer, A.T.; Mojumder, S.; Abellán, J.; Ukidave, Y.; Joshi, A.; Rubin, N.; Kim, J.; Kaeli, D. Griffin: Hardware-software support for efficient page migration in multi-gpu systems. In Proceedings of the International Symposium on High Performance Computer Architecture, San Diego, CA, USA, 22–26 February 2020. [Google Scholar]
Gober, N.; Chacon, G.; Wang, L.; Gratz, P.V.; Jimenez, D.A.; Teran, E.; Pugsley, S.; Kim, J. The championship simulator: Architectural simulation for education and competition. arXiv 2022, arXiv:2210.14324. [Google Scholar]
Rixner, S.; Dally, W.J.; Kapasi, U.J.; Mattson, P.; Owens, J.D. Memory access scheduling. ACM SIGARCH Comput. Archit. News 2000, 28, 128–138. [Google Scholar] [CrossRef]
Meza, J.; Li, J.; Mutlu, O. Evaluating row buffer locality in future non-volatile main memories. arXiv 2018, arXiv:1812.06377. [Google Scholar]
Bucek, J.; Lange, K.D.; v. Kistowski, J. SPEC CPU2017: Next-generation compute benchmark. In Proceedings of the Companion of the ACM/SPEC International Conference on Performance Engineering, Berlin, Germany, 9–13 April 2018. [Google Scholar]
Kang, W.; Ran, Y.; Zhang, Y.; Lv, W.; Zhao, W. Modeling and exploration of the voltage-controlled magnetic anisotropy effect for the next-generation low-power and high-speed MRAM applications. IEEE Trans. Nanotechnol. 2017, 16, 387–395. [Google Scholar] [CrossRef]
Chen, Y. ReRAM: History, status, and future. IEEE Trans. Electron Devices 2020, 67, 1420–1433. [Google Scholar] [CrossRef]

Figure 1. Variations in program phases observed from the number of memory reads and writes for the cam4 and wrf benchmarks, monitored at 100 K cycle intervals over time. (a) Number of memory accesses for cam4 over time, monitored at 100 K cycle intervals. (b) Number of memory accesses for wrf over time, monitored at 100 K cycle intervals.

Figure 2. Comparison of channel-wise access ratios for memory read and write operations across nine workload sets. The values are normalized based on the number of accesses to channel 0.

Figure 3. Average congestion cycles per channel for memory read accesses at 1 M cycle intervals over time, observed when four programs (nab, pop2, imagick, and exchange2) are executed simultaneously.

Figure 4. Dynamic program-memory pairing of DARP accounting for changes in both program and memory affinities. A program is mapped to the memory type with the shortest distance in the affinity dimension.

Figure 5. Operations of DARP on memory reads and memory writes. Read accesses follow the existing memory behavior, whereas write accesses validate the need for new pairing.

Figure 6. Additional hardware components implemented in the memory controller for DARP. Two table-like structures, PST and MST, are required to store information related to program and memory affinities. The system is configured with four cores (n) and four memory channels (m).

Figure 7. Memory response time variation comparison between HP and DARP. These results are normalized against the standard deviation of memory response time variability of a baseline system in a quad-core environment.

Figure 8. Breakdown of the total accumulated memory congestion cycles per channel during the entire execution of each workload set, normalized to the baseline system results.

Figure 9. Average congestion cycles per memory access for four channels, normalized to the baseline system results.

Table 1. Comparison between DRAM and various types of NVM technologies [8,10,11,12,13,14,15] ¹.

	DRAM	STT-MRAM	PCM	FeRAM	ReRAM
Read time (ns)	15	20	48	50	15
Write time (ns)	15	20	150	50	50
Read energy (pJ)	2	2	20	2	3
Write energy (pJ)	2	2	100	10	24
Density	Low	Medium	High	Medium	High
Endurance (write cycles)	> $10^{15}$	$10^{15}$	$10^{12}$	$10^{12}$	$10^{8}$

¹ Specific values may vary depending on the experimental environment and implementation technology.

Table 2. Detailed system configuration, workload sets, and heterogeneous memory configurations.

Processor	quad-core, 4 GHz, out-of-order
L1 cache	private, split I/D cache, 64 KB, 8-way, 64 B line, 4 cycle latency, write-back
L2 cache	private, unified I/D cache, 512 KB, 8-way, 64 B line, 10 cycle latency, write-back
Main memory	16 GB, 4 channels, 1 rank per channel, 8 chips per rank,
	64-bit data bus, 3200 MT/s,
	FR-FCFS scheduler, RoRaBaChCo address mapping,
	64/64-entry read/write request queues per channel
Workload	Notation	Program 1	Program 2	Program 3	Program 4
	Set 1	omnetpp	mcf	pop2	cam4
	Set 2	xz	bwaves	pop2	fotonik3d
	Set 3	x264	gcc	xalancbmk	fotonik3d
	Set 4	deepsjeng	wrf	fotonik3d	lbm
	Set 5	fotonik3d	cam4	wrf	lbm
	Set 6	x264	mcf	pop2	omnetpp
	Set 7	lbm	bwaves	fotonik3d	roms
	Set 8	perlbench	leela	xz	roms
	Set 9	xz	pop2	cam4	wrf
Heterogeneous memory configuration	Type	Notation	Channel 0	Channel 1	Channel 2	Channel 3
	NVM + DRAM	M/M/D/D	STT-MRAM	STT-MRAM	DRAM	DRAM
	NVM + DRAM	D/D/M/M	DRAM	DRAM	STT-MRAM	STT-MRAM
	NVM + NVM	P/P/M/M	PCM	PCM	STT-MRAM	STT-MRAM
	NVM + NVM	M/P/M/P	STT-MRAM	PCM	STT-MRAM	PCM
	Manifold	R/D/M/P	ReRAM	DRAM	STT-MRAM	PCM
	Manifold	P/M/D/R	PCM	STT-MRAM	DRAM	ReRAM

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, J.; Park, H.; Hong, J. Enhancing QoS in Multicore Systems with Heterogeneous Memory Configurations. Electronics 2024, 13, 3492. https://doi.org/10.3390/electronics13173492

AMA Style

Kim J, Park H, Hong J. Enhancing QoS in Multicore Systems with Heterogeneous Memory Configurations. Electronics. 2024; 13(17):3492. https://doi.org/10.3390/electronics13173492

Chicago/Turabian Style

Kim, Jesung, Hoorin Park, and Jeongkyu Hong. 2024. "Enhancing QoS in Multicore Systems with Heterogeneous Memory Configurations" Electronics 13, no. 17: 3492. https://doi.org/10.3390/electronics13173492

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing QoS in Multicore Systems with Heterogeneous Memory Configurations

Abstract

1. Introduction

2. Background and Related Work

2.1. Necessity of Emerging Memory Technology

2.2. QoS Challenges in Heterogeneous Memory Systems

2.3. QoS Techniques Considering Shared Memory Resources

3. Dynamic Affinity-Based Resource Pairing

3.1. Key Design Considerations of DARP

3.1.1. Variability of Program Behavior

3.1.2. Variability of Channel Access

3.1.3. Variability of Memory Response Time

3.2. Affinity-Based Program and Memory Management

3.2.1. Program Affinity and Memory Affinity

3.2.2. Conceptual Behaviors of DARP

3.3. Operation Details of DARP

3.4. DARP Architecture and Overhead Analysis

3.4.1. Additional Hardware Components

3.4.2. Page Migration Overhead

4. Experimental Results

4.1. Experimental Environment

4.2. Evaluation of Program QoS

4.3. Impact on Memory Congestion Cycles

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI