Next Article in Journal
Optimizing the Long-Term Efficiency of Users and Operators in Mobile Edge Computing Using Reinforcement Learning
Previous Article in Journal
Cyber–Physical Resilience: Evolution of Concept, Indicators, and Legal Frameworks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Latency-Aware and Auto-Migrating Page Tables for ARM NUMA Servers

1
Shenzhen Institute of Advanced Technology (SIAT), Chinese Academy of Sciences (CAS), Shenzhen 518055, China
2
University of Chinese Academy of Sciences, Beijing 101408, China
3
Inspur Group Co., Ltd., Jinan 250101, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(8), 1685; https://doi.org/10.3390/electronics14081685
Submission received: 23 March 2025 / Revised: 16 April 2025 / Accepted: 17 April 2025 / Published: 21 April 2025

Abstract

:
The non-uniform memory access (NUMA) architecture is the de facto norm in modern server processors. Applications running on NUMA processors may suffer significant performance degradation (NUMA effect) due to the non-uniform memory accesses, including data and page table accesses. Recent studies show that the NUMA effect of long-running memory-intensive workloads can be mitigated by replicating or migrating page tables to nodes that require accesses to remote page tables. However, this technique cannot adapt to the situation where other applications compete for the memory controller. Furthermore, it was only implemented on x86 processors and cannot be readily applied on ARM server processors, which are becoming increasingly popular. To address this issue, we designed the page table access latency aware (PTL-aware) page table auto-migration (Auto-PTM) mechanism. Then we implemented it on Linux ARM64 (the Linux kernel name for AArch64) by identifying the differences between the ARM architecture and the x86 architecture in terms of page table structure and the implementation of the Linux kernel source code. We evaluate it on real ARM NUMA servers. The experimental results demonstrate that, compared to the state-of-the-art PTM mechanism, our PTL-aware mechanism significantly enhances the performance of workloads in various scenarios (e.g., GUPS by 3.53x, XSBench by 1.77x, Hashjoin by 1.68x).

1. Introduction

A NUMA server consists of a few nodes, each with a multi-core CPU and a certain amount of memory, as shown in Figure 1. The nodes are connected through a high-speed interconnection. The memory on all nodes can be addressed uniformly, and the CPU on each node can access the memory on other nodes. This architectural design addresses bus contention issues arising from the increasing number of CPUs and has become the de facto standard in modern server processors. The CPU in a node accesses the memory of that node (local memory) through a local memory controller and accesses the memory of other nodes (remote memory) through an interconnection network and a remote memory controller. As a result, the CPU accesses the local memory faster than the remote memory. This non-uniform memory access behavior might significantly affect the performance of many workloads, which is called the NUMA effect [1,2,3,4].
The non-uniform memory accesses can be classified into data and page table accesses. Previous studies have focused on maximizing the locality of data access [5,6,7]. In the data allocation phase, operating systems (OS) such as Linux (We use Linux to represent OS throughout the paper) provide different memory allocation policies to improve the locality of data [8]. When an application runs, Linux collects remote access information for each data page and migrates the data pages to the node with the most remote access [9]. Other studies have proposed to copy data pages to remote nodes that need frequent access to them [5,6].
Recent studies have emphasized the equal importance of the placement of page tables and data pages in NUMA systems. Mitosis [10] and vMitosis [11] are the state-of-the-art studies on this topic. Mitosis can mitigate NUMA effects caused by remote page table walks by transparently replicating and migrating page tables across nodes without application changes. Page table replication (PTR) works well in workloads with a large memory footprint across NUMA nodes (wide workload). Page table migration (PTM) works well in workloads with a small memory footprint, but will be migrated frequently (thin workload). The PTM in Mitosis achieves migration by copying all page tables to every node (we name it Full-PTM). It needs to maintain the consistency of the page table copies on each node, so it has the disadvantages of memory footprint overhead and runtime overhead.
vMitosis is a system for explicit management of two-level page tables, i.e., the guest and extended page tables, on virtualized NUMA servers. The PTM in vMitosis periodically scans the page tables, determines the locality of the page tables and the next-level page tables or data, and migrates the page tables to the node with the maximum locality. This mechanism operates automatically (we name it Auto-PTM).
Both Full-PTM and Auto-PTM are based on the precondition that the latency of the local node is lower than that of the remote node, with the goal of achieving maximum locality. However, WASP [12] identifies that when other programs compete for local resources, the local latency can be higher than the remote latency, also affecting the effect of local access to the page tables. This paper also confirms this phenomenon through experiments, and we specifically explain this issue in Section 3.4. In such a situation, migrating the page tables back to the local node will reduce the program performance. Therefore, it is necessary to identify the page table memory access latency (PTL) and incorporate it into the decision-making mechanism during page table migration.
This motivates us to design a PTL-aware Auto-PTM technique. In the process of exploring a more effective page table migration mechanism, we noticed that although ARM processors are gaining increasing popularity in computing systems like supercomputers and data centers due to their appealing power efficiency, price, and density advantages over traditional Intel/AMD counterparts, existing PTR and PTM techniques, especially Auto-PTM, are only implemented and tested on x 86 servers and do not readily work on ARM servers. Therefore, our design not only focuses on integrating PTL into the decision-making mechanism for page table migration, but also aims to make the PTL-aware Auto-PTM work well on ARM bare-metal servers.
We address the issues and develop a practically useful ARM Linux kernel patch that can use PTL-aware Auto-PTM. In particular, we make three contributions in this paper.
  • We design a page table access latency-aware (PTL-aware) mechanism for automatic page table migration (Auto-PTM) to optimize the performance of thin workloads.
  • We comprehensively analyze the differences between the ARM architecture and the x86 architecture in terms of page table structure and the implementation of the Linux kernel source code. That makes our design fit to be implemented across both architectures.
  • We evaluated our design on real ARM bare-metal NUMA servers. The experimental results show that, compared to the state-of-the-art PTM mechanism, the performance of memory-intensive workloads in this paper increased in different scenarios (GUPS, 3.53x; XSBench, 1.77x; Hashjoin, 1.68x). We have also evaluated that our work has no effect on compute-intensive workloads.
The rest of this paper is organized as follows. Section 2 describes the related work, and Section 3 describes the background. Section 4 depicts our PTL-aware Auto-PTM mechanism and introduces our implementation on Linux ARM64. Section 5 describes the experimental setup. Section 6 presents the experimental results and analysis. Section 7 concludes the paper.

2. Related Work

2.1. Mitigating NUMA Effects via Data Optimization

Modern NUMA systems employ multilayered strategies to optimize the distribution of data pages. The Linux kernel provides various memory allocation strategies to accommodate different application scenarios. First-Touch (default policy) ensures that data are allocated to the NUMA node hosting the CPU that first accesses the data. This maximizes data locality. Interleave allocates physical memory in a round-robin fashion across multiple nodes on a page-by-page basis, balancing memory bandwidth pressure and being suitable for high-concurrency memory access scenarios. The binding policy enforces memory allocation on a specified node. If insufficient space is available, it immediately returns an error, making it applicable to scenarios with strict resource isolation requirements. The preferred policy elastically prioritizes memory allocation on a specified node and automatically falls back to other nodes if insufficient space is available, balancing both locality and allocation success rates [8].
Additionally, users can dynamically set policies using the numactl tool [13], and optimize performance by combining them with thread binding (CPU affinity). The NUMA balancing mechanism in modern kernels also automatically migrates hot pages to further balance cross-node access latency. AutoNUMA automatically adjusts the distribution of threads and memory pages by dynamically monitoring their access patterns, thus reducing the memory access latency across NUMA nodes. Although AutoNUMA has advantages such as automation, transparency, and dynamic adaptability, attention should also be paid to the limitations of its performance overhead and applicable scope. In practical applications, AutoNUMA can be enabled or disabled according to specific requirements to achieve optimal performance. It should be noted that AutoNUMA only migrates data pages and will not migrate page tables [9].
MemProf [14] is a profiler that allows programmers to choose and implement efficient application-level optimizations for NUMA systems. MemProf builds temporal flows of interactions between threads and objects that help programmers understand why and which memory objects are accessed remotely.
Earlier NUMA-aware memory management policies aimed to mitigate the cost of remote wire delays, which is no longer the main bottleneck in modern systems. Carrefour’s [6] design was motivated by the evolution of modern NUMA hardware, where traffic congestion plays a much larger role in performance than wire delays. Carrefour represents an automated counter-based strategy that transparently replicates and migrates application data. However, it does not adequately address the NUMA challenges associated with large pages.
In NUMA systems that employ large pages, the execution of memory placement algorithms is important. These algorithms must strike a balance between distributing the load across memory controllers and preserving data locality. Moreover, there are instances where NUMA-aware memory placement alone proves insufficient to achieve optimal performance. In such scenarios, the only feasible solution is to split large pages, despite the potential performance drawbacks this may introduce. To address these obstacles, Carrefour-LP is introduced to recover the performance lost due to the use of large pages [15].
Machine learning frameworks further optimize thread and data placement through decision tree models, reducing execution time in OLTP scenarios [16]. MONEPML [17] coordinates thread binding and hardware prefetcher parameters via NUMA topology analysis, achieving 13–28% QPS improvements through periodic PMU event sampling (e.g., MEM_LOAD_RETIRED.LOCAL_DRAM). Dynamic Optimization Methods overcome static configuration limitations through runtime adaptation.

2.2. Mitigating NUMA Effects via Page Table Optimization

To mitigate NUMA effects caused by page tables, four approaches have been proposed: Mitosis [10] pioneered page table self-replication (PTSR) on x 86 platforms. The PTSR technology on physical machines includes two optimization techniques: page table replication (PTR) and page table migration (PTM). PTR is designed for memory-intensive workloads in which threads and data are distributed across NUMA nodes. PTM, on the other hand, is intended for workloads where most of the memory resides in a single NUMA node but is frequently scheduled and migrated to other nodes by the system load-balancing mechanism. PTM is based on the replication technology to implement full page table migration (Full-PTM), which creates a copy of the page table on the destination node where the program is to be migrated. Mitosis has limitations in practical adoption because it relies on manual configuration and is exclusive to x86 systems. Moreover, Mitosis fails to take into account the situation in which, when local traffic congestion occurs, the strategy of using local page table copies may no longer be applicable.
vMitosis [11] extended PTSR to virtualized environments by explicitly managing two-level page tables. It involves not only the guest page tables (gPT) managed by the guest operating system but also the extended page tables or shadow page tables (ePT/sPT) managed by the hypervisor (KVM). vMitosis optimizes these two types of page tables. vMitosis implements automatic page table migration (Auto-PTM). A page table contains page table entries that point to either the next-level page tables or data. Auto-PTM counts the number of next-level page tables or data on each node, identifies the node with the largest quantity, and then migrates this page table to that node to achieve maximum locality. vMitosis still requires manual configuration, has only been implemented on the x86 architecture, and does not take traffic congestion into consideration.
The Hydra [18] strategy breaks the traditional mode of full page table replication and adopts an innovative approach of automatic partial page table replication. When a program’s access to a specific page table triggers a remote access, the system only replicates the affected partial page table instead of replicating the entire page table. This precise operation significantly reduces the number of page tables that need to be synchronized, effectively lowering the overhead caused by page table synchronization. However, it lacks ARM compatibility and does not take into account traffic congestion.
The WASP [12] framework represents a paradigm shift through two innovations: first, dynamic metric analysis leverages PMU counters to monitor memory access rate and TLB pressure, enabling automated PTSR activation; Secondly, WASP identity that whether PTSR improves performance depends on workload characteristics and how much interference is incurred by colocated workloads. It measures the page table access latency (PTL) between each NUMA node pair. When a process is running on the CPUs of the NUMA Node-x, WSAP supports the page table replica on the node with the shortest PTL according to Node-x. WASP was implemented in Linux and was evaluated on x86 and ARM NUMA servers. However, WASP has only been implemented in the operating system of NUMA physical machines and has not been implemented in virtualized environments. WASP uses Full-PTM and lacks an automated solution.
In summary, the existing page table migration techniques are based on the condition that the local latency is lower than the remote latency, and they migrate the page tables to local nodes to achieve maximum locality. They do not consider how to deal with the situation where the local latency is higher than the remote latency due to other programs competing for the resources of the local memory controller. This has inspired us to conduct research in this area.

3. Background

This section provides an overview of page tables, the page table-caused NUMA effect, and two key techniques, page table replication (PTR) and page table migration (PTM), designed to mitigate these issues. Finally, we present the motivations behind our research aimed at enhancing these techniques.

3.1. Page Tables

Modern processors employ virtual memory, where virtual addresses are translated into physical addresses by the memory management unit (MMU) using page tables stored in memory. The MMU automatically reads page tables when necessary, which is known as page table walk [19].
Figure 2 shows a native ARMv8 page table walk to translate a virtual address (VA) to its corresponding physical address (PA). As shown, to locate the PA of a VA, the hardware starts from the page table base address stored in the Translation Table Base Register (TTBR) and traverses up to four levels of page tables: the Page Global Directory (PGD), Page Upper Directory (PUD), Page Middle Directory (PMD) and Page Table Entry (PTE). Since the page tables are also stored in memory, an address translation needs to access memory four times in the worst case. In a virtualized environment, the number of memory accesses can be as large as 24 for an address translation. In both cases, the overhead for address translation is high.

3.2. Page Table-Caused NUMA Effect

In NUMA servers, the NUMA effect caused by page tables is mainly attributed to two factors: the uneven distribution of page tables and TLB capacity limitations, which complicate access paths [10]. This section explains the reasons why the NUMA effect caused by page tables is becoming increasingly important.

3.2.1. Page Table Distribution and Cross-Node Access

First-Touch Allocation Policy: The Linux kernel defaults to allocating page tables on the NUMA node where memory is first accessed [8]. For memory-intensive workloads (e.g., databases), page tables often scatter across multiple nodes, causing frequent remote accesses when threads execute on non-local nodes [10]. For example, a 4-node ARM NUMA system may require threads on Node0 to access page tables on Node2, incurring cross-node latency 2.3× higher than local access (measured on HiSilicon Kunpeng920) [12].
Page Table Residency During Process Migration:When a process is migrated by NUMA balancing, the OS typically has mechanisms (such as AutoNUMA) to ensure that the data pages relocate to the process to achieve the best locality. However, as part of the kernel data structure, page tables are not migrated along with the process. This results in the page tables remaining on the original node while the process runs on a new node. Consequently, every time the page table needs to be accessed, it involves remote node access, further aggravating the NUMA effect [10].

3.2.2. TLB Capacity Limitations

With modern applications using increasingly larger amounts of memory, the growth rate of the TLB capacity has not kept pace. This results in a significant increase in TLB misses. Each TLB miss triggers a page table walk, requiring memory access to query the page table and retrieve the final physical address. If the page table is located on a remote node, this significantly increases access latency, thereby intensifying the NUMA effect.
For example, if an application’s memory requirement exceeds the current TLB capacity, each TLB miss necessitates a multi-level page table walk. If the page table has four levels, the memory management unit (MMU) needs to access memory four times to traverse the page table and obtain the final physical address. With each level potentially residing on different nodes, this can result in up to four remote accesses per TLB miss. This situation worsens when the page table level extends to five levels, causing more remote accesses and higher overhead [20,21,22,23].

3.3. PTR and PTM

In recent years, considerable attention has been paid to optimizing the NUMA effect caused by page tables. Next, we will describe two key technologies, namely PTR and PTM.
PTR: Page table replication was first introduced by Mitosis [10]. It can mitigate NUMA effects on remote page table walks by transparently replicating page tables across nodes without requiring changes to the application. There are two core mechanisms of PTR technology: Firstly, employ transparent replication. Replicas of pages-table are created and placed on all NUMA nodes. The original page table and its replicas are connected via a circular linked list. When the original page table is updated, the circular linked list is used to locate and update the page table replicas, ensuring synchronization between the original and replicas. Secondly, ensure local page table replica access. After a process migrates to a new node, when process context switches it gives priority to using the local replica of the page table, thus avoiding remote node traversal of the page table. Mitosis was implemented only on the x86 architecture in the Linux kernel. It achieves the above functions through para-virtualization interfaces related to page tables.
WASP adds two features, namely workload-aware and PTL-aware, based on PTR. Firstly, the PTR technology is applicable to workloads that are memory-intensive and have a high number of TLB misses. Based on these two characteristics, through experiments, WASP selects two performance indicators, MAR (Memory Access Rate) and DTLB (Data Translation Lookaside Buffer) miss rate, as indicators to determine whether a workload is suitable for enabling PTR. Secondly, WASP has identified the impact of memory access latency on the effectiveness of PTR and proposed the PTL-aware mechanism. That is, the local latency is not necessarily the lowest. When NUMA congestion occurs locally, the memory controller is contended, and the local memory access latency may be higher than the remote memory access latency. In this case, using the local page table replica will instead reduce the program performance. Selecting the page table replica on the node with the lowest remote latency is the optimal solution. WASP has implemented workload-aware, PTL-aware, and PTR on both the ARM architecture and the x86 architecture. The implementation of PTR and PTM on the x86 architecture is the same as that of Mitosis, while it is different on the ARM architecture. We will introduce this in detail in Section 4.
PTM: Small-memory applications (thin workloads) typically frequently migrate across NUMA nodes for load balancing, but only data pages, rather than page tables, migrate along with the applications. This results in substantial remote page table accesses with long latencies, significantly degrading the overall performance of the application [10,11,12]. Page table migration was designed to mitigate this situation.
The PTM technology can be implemented through two distinct methods: full page table migration (Full-PTM) and auto page table migration (Auto-PTM), as illustrated in Figure 3. Full-PTM accomplishes migration by replicating the entire page table onto the node to which the application has been migrated. This method is employed by Mitosis and WASP. While Full-PTM ensures the locality of page table access, it is accompanied by two notable disadvantages. Firstly, Full-PTM lacks automation, necessitating users to manually decide its activation for a program, which is not user-friendly. Secondly, due to the uncertainty surrounding the node to which the process will be migrated, it is imperative to maintain a set of page table copies on each node, leading to a substantial memory overhead. To mitigate these challenges, Auto-PTM has been introduced.
Auto-PTM involves migrating the necessary part of the page table through AutoNUMA-based scanning, and this method is used by vMItosis. In the Linux kernel, each process has its own virtual address space, and this space will be divided into multiple different virtual memory areas (VMAs). Each VMA includes the starting and ending virtual addresses, and these virtual addresses have their corresponding page tables. Auto-PTM will periodically scan the page tables corresponding to the VMAs of a process. Specifically, the scanning starts from the last-level page table. Taking PTE as an example, if most of the data pages pointed to by a PTE are on Node-1, this PTE will be migrated to Node-1. Then, PMD and PUD are scanned according to this rule to achieve a bottom-up page table migration. This kind of migration is automatic and is a real migration, with only one copy of the page table maintained throughout.

3.4. Motivation

This work is driven by two goals. First, we identify the issue that NUMA congestion significantly impacts decision-making for optimal page table placement, undermining the effectiveness of existing methods. To address this problem, we enhance the Auto-PTM mechanism by incorporating a PTL-aware feature, enabling more adaptive page table management. Moreover, we analyze the differences in page table mechanisms between x86 and ARM architectures. Based on these findings, we implement Full-PTM, Auto-PTM, and PTL-aware Auto-PTM on the ARM architecture, aiming to improve performance and ensure broad architectural compatibility.

3.4.1. Enhancing Auto-PTM with PTL-Awareness

Although Auto-PTM can mitigate the NUMA effect caused by page tables for thin workloads, the core concept of Auto-PTM is to migrate the page tables to the local node to achieve maximum locality. However, we have found that simply considering locality is insufficient; it is also necessary to take into account the resource contention situations caused by NUMA congestion in the local node.
We conducted an experiment in which we ran Stream [24] as interference on the local node where the workload was running, and then observed the program’s running time when the page tables were placed on different nodes, including both local and remote nodes. The configuration of the experiment is shown in Figure 4. LPLD means that the page table and data are located on the same node as the process, which is regarded as a base case. LPLDI means running an interference program on the local node based on the scenario of LPLD. RPLDLI is an LDLI with the page table placed on a remote node. For Type-1, the page table is placed on Node 3, which is the farthest from Node 0, resulting in the highest remote latency. For Type-2, the page table is placed on Node 1, which is the closest to Node 0, leading to the lowest remote latency.
The results of the experiment are shown in Figure 5. We can draw the following three conclusions:
First, the workload has the shortest running time in the LPLD scenario, while it has the longest running time in the LPLDI scenario. This is because the tested workload is memory-intensive. The interfering program Stream competes with the local memory controller, which, in turn, affects the run time of the workloads. This is a phenomenon of NUMA congestion. As shown in Figure 6, there is memory controller and interconnection congestion in NUMA servers caused by the contention of different applications. NUMA congestion will lead to an increase in memory access latency. Therefore, memory access to NUMA-congestion nodes should be avoided as much as possible.
Second, in order to alleviate the local NUMA congestion, we place the page tables at the remote nodes. We find that the program running time under RPLDLI (type-1) is shorter than that under LPLDI. This indicates that when there is local resource contention, placing the page table on a remote node can improve the program performance.
Finally, the effect of RPLDLI (type-2) is better than that of RPLDLI (type-1). This shows that among remote nodes, a node with the lowest latency should still be selected to place the page table to achieve the best optimization.
The above three findings motivate us to study a mechanism of PTL awareness to enhance Auto-PTM. Our approach aims to minimize page table access latency by prioritizing the migration of frequently accessed page tables. This hybrid approach leverages the strengths of Auto-PTM and PTL awareness while leading to more efficient and adaptive page table management.

3.4.2. Cross-Architecture Compatibility

The second motivation stems from the challenges of applying page table migration (PTM) techniques across different architectures. Specifically, PTM techniques implemented in the Linux kernel for x86 architectures cannot be directly ported to ARM architectures due to fundamental differences in their page table mechanisms and kernel implementations.
Key differences between x86 and ARM architectures include the following:
  • Page Table Structure—The multi-level page table mechanism in ARMv8-A differs significantly from the x 86 architecture, requiring modifications to the replication and migration logic.
  • Para-Virtualization Interface—ARM lacks the para-virtualization interfaces available in x86, necessitating alternative methods for implementing page table self-replication and migration.
  • Linux Kernel Implementation—The Linux kernel’s handling of page tables on ARM architectures is fundamentally different from that on x86, presenting additional technical challenges.
To bridge this gap, we performed a detailed analysis of the architectural and kernel implementation differences between x86 and ARM. Our goal is to provide a framework that enables PTR and PTM techniques to be effectively implemented across both architectures.
In summary, this paper aims to improve the efficiency of page table management through the integration of Auto-PTM and PTL-awareness while also ensuring cross-architecture compatibility.

4. Design and Implementation

4.1. PTL-Aware Auto-PTM

We design a mechanism named PTL-aware Auto-PTM, which consists of three parts: PTL probing, page table scan, and migration decision.

4.1.1. PTL-Probing

We use lat_mem_rd to measure the Page Table Latency (PTL) information for each node. Given that the L3 cache of the ARM servers we use is 48MB and that of the x86 servers is 32 MB, to exceed the L3 cache of the server CPU and measure the true memory access latency, we set the length of memory access to 64 MB.
For each node, it is necessary to measure both the local latency and the latency of accessing other nodes. To ensure that the data are up to date, these measurements are performed periodically. We set the measurement time interval to 10 s. This periodic measurement approach allows us to capture any dynamic changes in memory access latency. We record the PTL information for each node and identify which node has the lowest latency for each individual node. For example, for Node-0, if the PTL when accessing Node-1 from Node-0 is the smallest, then the PTL information recorded for Node-0 is the node number of Node-1.

4.1.2. Page Table Scan and Migration Decision

Auto-PTM builds on AutoNUMA (an automatic NUMA balancing approach) and implements the page table migration function. AutoNUMA operates by periodically invalidating the page table entries (PTEs) within a process’s page table. This process triggers minor page faults, which the operating system uses as cues to determine whether a remote socket is the primary source of memory access for a data page. Based on this evaluation, AutoNUMA decides whether to migrate data pages to a more suitable node to improve memory access efficiency.
However, the data page migrations driven by AutoNUMA can disrupt the locality between page tables and data. To address this issue, the Auto-PTM mechanism initiates a scan of the page tables immediately after AutoNUMA completes its data-scan process. Since AutoNUMA’s data migrations may change the optimal placement of page tables, the Auto-PTM mechanism re-evaluates whether the corresponding page tables also need to be migrated, thereby ensuring that both data and page tables are optimally located for efficient memory access. Therefore, the page table scanning of Auto-PTM is triggered by AutoNUMA.
The Algorithm 1 shows the details of the pseudo-code of the migration algorithm for PTL-aware Auto-PTM. Auto-PTM scans the Virtual Memory Area (VMA) of the target process. Specifically, it scans the page tables corresponding to the virtual address range of the VMA. It traverses the Page Global Directory (PGD), Page Upper Directory (PUD), and Page Middle Directory (PMD) to find the Page Table Entry (PTE).
Then, it scans the 512 page table entries of the PTE. An array is used to record on which node the data pointed to by each page table entry is located. The subscript of the array represents the number of nodes, and the value of the array represents the number of data pages on that node.
Next, we calculate the subscript of the element in the array that stores the maximum value. This subscript is the ‘to_node’. Here, we pass the ‘from_node’ and ‘to_node’ to the PTL-aware decision-making algorithm. Using the tested Page Table Latency (PTL) information, we identify the node with the lowest PTL from the perspective of the ‘to_node’ as the ‘migration_node’.
Then, we compare the ‘from_node’ and the ‘migration_node’. If they are the same, no migration is performed. If they are different, the PTE is migrated to the ‘migration_node’.
Subsequently, the same scanning process is performed for all PTEs. After the migration of the PTEs is complete, the PMD, PUD, and PGD are scanned in the same way.
Algorithm 1 The pseudo-code of the migration algorithm for PTL-aware Auto-PTM. The configuration is a page size of 4KB, and a four-level page table is used
Require:  V M A 0 : the VMA of the target process to be scanned;
Require:  P T L i n f o : the PTL info array;
  1:
Recursively scan the PGD, PUD, PMD, and PTE corresponding to VMA-0 to find the last-level page table, PTE. Record the node where this PTE is located as from_node.
  2:
Scan the 512 page table entries of the PTE, and use an array to record which node the data pointed to by each page table entry is located on. The subscript of the array represents a node number, and the value of the array represents the number of data pages on that node.
  3:
Calculate the subscript of the element in the array that stores the maximum value. This subscript is the to_node.
  4:
Pass from_node and to_node to PTL-aware decision.
  5:
Find from the P T L i n f o the node that has the lowest PTL when viewed from the perspective of to_node, and mark it as the migration_node.
  6:
if (from_node == migration_node) then
  7:
  do not migrate
  8:
else
  9:
  migrate page-table to migration_node
10:
end if
11:
Scan PMD
12:
Scan PUD
13:
Scan PGD
14:
end

4.1.3. Workflow

Figure 7 shows an example. As can be seen, at step ①, we enable PTM for process-x. At step ②, we periodically probe the PTL of all NUMA nodes by lat_mem_rd (a tool can test the memory access latency in lmbench). Then we store the PTL information. At step ③, AutoNUMA triggers a PTM scan, and Auto-PTM scans all VMAs of process-x which corresponds to a set of page tables.
We take the scan of VMA-0 as an example. At step ④, PTL-aware Auto-PTM scans the page tables including PGD, PUD, PMD, and PTE for VMA-0. Here we take the scan of PTE-0 which is located on Node-0 as an example. In this step, Auto-PTM will calculate which node PTE-0 should be migrated to in order to achieve maximum locality. We assume that PTE-0 contains three pointers pointing to Data-0, Data-1, and Data-2. Moreover, we suppose Data-0 and Data-1 are located on Node-1, and Data-2 is located on Node-0. Here, Auto-PTM will count that Node-1 has the largest number of data pages. Therefore, migrating PTE-0 to Node-1 will achieve the maximum locality. Next, we will make a PTL-aware decision based on this result.
In step ⑤, we use the PTL information that is probed in step ② to answer a question: Is the local PTL of Node-1 the shortest? If the answer is yes, we migrate PTE-0 from Node-0 to Node-1. If the answer is no, we keep PTE-0 stay at Node-0. And if the NUMA server has more than two NUMA nodes, we migrate PTE-0 to the other node if it has the lowest PTL.

4.1.4. Analysis of Scanning Frequency

Although Auto-PTM can take advantage of AutoNUMA’s dynamic rate-limiting heuristics to adjust the scanning frequency based on the rate of data page migration for performance optimization, this mechanism cannot guarantee applicability across all architectures. In our experiments, we found that on our ARM NUMA server, this mechanism introduces non-negligible overhead. As shown in Figure 8, LPLDLI means that the page tables and the data are placed locally; there is no need to migrate the page tables. In this case, enabling Auto-PTM will incur running-time overhead for the tested workloads. For example, the running time of Redis increases by 13%, and the running time of XSBench increases by 42%.
This happens because the scanning process interferes with the normal execution of the running programs, leading to degraded overall performance. Additionally, another significant factor contributing to performance degradation is the periodic scanning of local page tables. This increases the volume of access to the local memory controller, thereby increasing local NUMA traffic and increasing local memory access latency. As a result, all memory-accessing programs running on the local node are negatively affected.
We designed a mechanism to limit the frequency of page table scanning. The scanning of page tables is triggered when AutoNUMA scans data, but whether to execute the page table scan can be controlled. If no page table migration has occurred, there is no need to scan the page tables. Based on this idea, after each page table scan, we check whether any page table migration has occurred. If migration is detected, no action is taken. If during the current scan, no page table migration is observed, the next 2 n rounds of page table scans triggered by AutoNUMA are ignored. The initial value of n is 0, and n increases by 1 after each scan where page table migration does not occur. If page table migration occurs, n is reset to 0.

4.1.5. Transparent Huge Pages in PTM

When it comes to THPs memory management, special care should be taken. When THPs are enabled, Linux supports 2KB page size. The corresponding page table conversions are carried out to manage these huge pages. When the page table of THPs is migrated, the system needs to release the huge page memory for other purposes. It is necessary to withdraw the huge page conversion. Since huge page memory may be used by different processes or kernel modules, when these huge pages are no longer needed, it is necessary to convert them back to the form managed by ordinary page tables for subsequent memory allocation and management. The pgtable_trans_huge_withdraw function will handle the modification of relevant page table entries to ensure that the memory can be reclaimed and redistributed normally.
However, Auto-PTM only scans the page middle directory entries (PMDs) within the range defined by the start and end addresses of the current VMA; if only the large page entries within this range are processed, those outside the range will be ignored, which can undermine the integrity and consistency of the page table.
Therefore, when migrating page tables for THP, one should avoid relying solely on the ‘start’ and ‘end’ range of the VMA to scan PMD entries. As demonstrated above, traversing the entire PMD table to ensure that all large page entries are processed is necessary to guarantee system stability and the effectiveness of memory management. In our approach, this problem has been solved.

4.2. Implementation

4.2.1. Page Table Differences Between ARM64 and x86-64 Processors

Table 1 shows the differences of page tables between the ARM and x86 processors. First, the page table base address registers (PTBAR) are different. When context switching, the PTBAR loads the page table base address (PTBA) of the next process to run. PTBA is the physical address of Page Global Directory (PGD) of a process. Further, MMUs translate the virtual addresses to physical addresses by using the PTBAR to get the PTBA of an application. The x86 processor uses the CR3 register as the PTBAR. For ARM64, different exception levels have different PTBARs. The concept of exception levels was first introduced in ARMv8, with each exception level representing a different privilege level. For EL0 (exception level), the execution is non-privileged. EL2 is used to support insecure virtualization operations. The EL3 provides a secure/insecure switching mechanism. Exception level 1 (EL1) has both translation-table base address register0 (TTBR0_EL1) and translation-table base address register1 (TTBR_EL1); EL2 and EL3 have a TTBR0, but no TTBR1 [19]. Translation-table is the name of page table in ARMV8 architecture.
Moreover, an ARM64 processor uses TTBR0_EL1 as the user-space process PTBAR and TTBR1_EL1 as the kernel-space PTBAR. TTBR0_EL1 provides translations for the bottom of the virtual address space, which is typically application space, and TTBR1_EL1 covers the top of the virtual address space, typically kernel space. When context switch occurs, the TTBR0_EL1 loads PTBA of the next process to run. TTBR1_EL1 loads kernel-space page table base address, which is fixed at kernel boot. When the MMU performs address translation, the page table pointed to by TTBR0 is selected when the upper bits of the virtual address (VA) are all set to 0. TTBR1 is selected when all the upper bits of the VA are set to 1.
Secondly, the application’s page table stores different ranges of virtual address spaces. For x86, the user process forks the PGD entry of the main kernel page table into TSK (TSK is an instance of the kernel data structure Task_struct representing a kernel thread or user process. In addition, it records all the context of the process.) -> mm (mm is an instance of the kernel data structure MM_struct, which abstracts and describes all information about the management process address space.) -> PGD when it is created. So, user process page table stores user virtual address space and kernel virtual address space. This is designed so that there is no need to reload the CR3 register when switching between user- and kernel-mode, since there is only one PTBAR on x86. For ARM64, since kernel-space and user-space have different PTBARs, processes are forked without kernel page table synchronization. Therefore, the page table of the user process only stores the user virtual address space.
Finally, different page table levels are supported. The x86 architecture supports level 5 page tables. The ARMV8-A architecture only supports level 4 page tables.
The above differences impact the implementation of PTR and PTM. Next section, we will detail how to overcome these challenges.

4.2.2. Solution

We implement our work in the Linux kernel 5.7.1. The non-trivial aspect of the implementation is understanding how to implement PTR and PTM in the ARM64 and what the difference is between x86 and ARM64. We describe how we overcame these challenges as follows.
Para-Virtualization: The existing page table replication technology is implemented on x86. They use PV-Ops [25] in the implementation. PV-Ops is required to support para-virtualization environments such as Xen [26]. All page table updates propagate through the PV-Ops interface. The Linux kernel shipped with major distributions like Ubuntu has para-virtualization support enabled by default, but the page table-related interface only supports x86 architecture. The ARM architecture in Linux kernel supports para-virtualization, too, but does not include the page table-related interface. To address this problem, we update the full memory subsystem to modify page table-related functions. So we reimplemented page table replication in the Linux kernel of the ARM architecture.
Overcome Page Table Differences: Arm64 has translation-table base address register (TTBRs) for different ELs. We concentrate on the TTBR0_EL1 and TTBR1_EL1. Other EL conditions are left for future work. We modify TTBR0_EL1 to load the PTBA from the local page table replica and update TTBR1_EL1 with the address space identifier (ASID) during context switches. We replicate user-space page tables, not kernel-space page tables, because page tables of the user-space process stores only user-space virtual address space on ARM64. And we modify the four-level page table-related functions, not including P4D. By addressing the differences in page table implementations between ARM64 and x86, we successfully implemented PTR and PTM on ARM64.

4.2.3. Performance Impact of Page Table Differences

Among the identified differences in x86 and ARM page table mechanisms, those relevant to Auto-PTM implementation mainly involve different page table levels, while base address register differences and support for page table-related para-virtualization interfaces mainly affect page table replication. Here is a concise performance analysis:
Base address registers: These CPU internal registers, faster than L1 cache, have an access latency of just 1 clock cycle for registers like TTBR0 and CR3.
Para-virtualization interfaces: Registered early in kernel initialization (before MMU activation), they replace native page table operation function pointers. Para-virtualization-based page table operations incur multiple user-kernel-virtualization layer context switches, consuming CPU cycles and reducing memory efficiency. Our ARM implementation, directly in the memory subsystem’s page table management unit, bypasses this issue.
Page table levels: 5-level page tables manage larger memory spaces but increase TLB miss rates and page table traversals, exacerbating the NUMA effect compared to 4-level page tables.

5. Experimental  Setup

Hardware Configuration: We utilize a two-socket Hisilicon Kunpeng 920-4862 processor with 48 cores and 32 GB of memory per node [27]. The number of NUMA nodes is configured to 4 NUMA nodes. The two-stage TLB system consists of separate L1 I TLB, L1 D TLB, and unified L2 TLB. This includes 48 fully linked L1 I TLB with local variable page length support, 32 fully linked L1 D TLB with local variable page length support, and a four-way group of 1052 table entries shared by each core that is linked to a unified L2 TLB. We also employ an x86 server in our experiments. The server has two NUMA nodes, each with 20 cores and 64 GB memory.
Software Configuration: Our work is implemented on Linux with kernel version 5.7.1. The operating system is Ubuntu 18.04.
Target Workload: Memory intensive workloads, especially when running for long periods of time, can put a strain on TLB coverage and may result in a large number of TLB misses, leading to more remote access to page tables [28,29]. This kind of workload is the target of our work [30,31,32,33,34]. We also evaluate, as a comparison, HPC benchmarks such as CG and MG from the NAS benchmark suite [35]. The workloads used in this article are listed in Table 2.

6. Results and Analysis

In this section, we evaluate our work on an ARM architecture NUMA server using memory-intensive workloads, which are shown in Table 2. The evaluation includes the following two aspects: (1) how PTL-aware Auto-PTM works on the ARM architecture machine compared to Full-PTM and Auto-PTM, and (2) for comparison, we also conducted a comparative experiment with the same workloads and configurations on x86 servers. As Auto-PTM is implemented on the basis of AutoNUMA, AutoNUMA is enabled in each configuration test. Each workload is executed three times to obtain the average value under each configuration.

6.1. Performance with PTL-Aware Auto-PTM on ARM64 NUMA Server

First, we implement PTL-aware Auto-PTM on Linux ARM64. Then we evaluate the design in different scenarios. The result is shown in Figure 9, and the configuration is shown in Table 3. LD indicates the local data configuration, which means that processes and data are on the same node. RP indicates remote page table configuration, which can simulate a situation where processes and data are migrated to other nodes and the page table is left as the original node for the remote page table. LP indicates the local page table configuration, which can simulate a situation in which the page table is placed in immediate proximity to processes and data to ensure the best locality. We run a Stream benchmark to content the memory controller as interference to simulate the real situation [24]. We use A and B to represent Node-0 and Node-3 of the ARM64 NUMA server shown in Figure 1.
Figure 9a shows the performance of workloads in two sets of configurations when the interference program and workload processes, along with the data they access, are at local Node-0 and the page table is at remote Node-3 (RPLDLI) or local Node-0 (LPLDLI). Each set of configurations contains three cases: the default configuration of Linux, Auto-PTM, and PTL-aware Auto-PTM. A couple of interesting findings can be made here. For one, in the case of LPLDLI, the performance of workloads is lower than that of RPLDLI. This indicates that when there is interference on the local node, having the page table on the local node will lead to a reduction in the performance of the workload.
Second, when the page table is at the remote node and the interference is at the local node (RPLDLI), enabling Auto-PTM will reduce the performance of the six workloads because Auto-PTM only cares about the locality of the page table and data and will migrate the remote page table to the local node. In this way, it causes the workload to suffer greater latency when accessing the page table, thus reducing the performance of the workload. In contrast, enabling PTL-aware Auto-PTM can improve the performance of all workloads. We can see that GUPS has an improvement of 3.18 times. This indicates that PTL-aware Auto-PTM has successfully identified that local PTL on Node-0 is higher than PTL on remote Node-3 and has migrated the page table to remote Node-1, which has the lowest PTL among Node-1, Node-2, and Node-3.
Third, in the case where the page table is at the local node in the presence of interference (LPLDLI), enabling Auto-PTM will lead to a reduction in the performance of all workloads. In this configuration, the page table, the data, and the processes that access them are located on the same node. Consequently, Auto-PTM will not migrate the page table. However, Auto-PTM will periodically scan the page table. This will disrupt the normal operation of the program and also increase the contention for the memory controller of the local node, as scanning the page table is also a form of memory access behavior. Therefore, the overhead resulting from the periodic scanning of the page table diminishes the performance of the workloads.
Last but not least, we find that enabling PTL-aware Auto-PTM can still improve the performance of all workloads in the case of LPLDLI. It can find the node with the lowest PTL (in this test, it is Node-1) for the current node where the page table is located through PTL-aware and migrate the page table to node 1. In this way, it reduces the PTL of the workload and improves performance.
Figure 9b shows the performance of workloads in two sets of configurations where the page table is located at Node-3, as well as when the workload processes and the data they access are at local Node-0 and the interference program is at remote Node-3 (RPLDRI), or when both are at local Node-0 and remote Node-3 (RPLDRILI). There are still four findings. First, compared to RPLDRI, when Auto-PTM is enabled and when PTL-aware Auto-PTM is enabled, workloads exhibit the same performance improvement. This is because both of them migrate the page table to the local node. Since interference is imposed on the remote node, the local node indeed has the lowest PTL.
Second, in the case of RPLDRILI, the performance of workloads is lower than that of RPLDRI. This is due to the presence of interference programs on both the local and remote nodes that contend for the memory controller, thereby increasing the PTL of their respective nodes. Third, compared to RPLDRILI, all workloads except BTree experience a performance improvement when Auto-PTM is enabled. This is because when accessing the remote page table in the case of RPLDRILI, the latency is the sum of the latency on the wires and the PTL of the remote node. In the case of RPLDRILI-m, since the page table is migrated back to the local node, the latency is only the PTL of the local node. Finally, in the case of RPLDRILI, workloads will achieve the highest performance improvement when PTL-aware Auto-PTM is enabled. This is because there is interference on both Node-0 and Node-3, and as a result, the page table will be migrated to Node-1, which has a relatively low PTL.
Figure 9c activates THP on the basis of Figure 9a and employs a page size of 2 MB. It has been found that, compared to the scenario of utilizing a 4 KB page size, the performance enhancement achieved by workloads when PTL-aware Auto-PTM is enabled becomes relatively smaller. This is anticipated, as the same data sizes of the applications are used for both 4 KB and 2 MB pages, and the TLB misses for 2 MB pages are significantly fewer than those for 4 KB pages. Regardless of whether it is 2 MB or 4 KB pages, the TL-aware Auto-PTM demonstrates its crucial advantage. It still surpasses Auto-PTM with large pages.
Figure 9d illustrates the configuration sets of RPLDRI, which represents the optimal scenario for the implementation of Full-PTM. In comparison with RPLDRI-m and RPLDRI-w, RPLDRI-f demonstrates an equivalent level of performance enhancement. However, its limitations include its lack of automation and a significant memory occupancy overhead. Additionally, it is not cognizant of PTL and exhibits reduced effectiveness in managing scenarios involving interference on the local node.

6.2. Memcached with PTL-Aware Auto-PTM on ARM64

In this section, we evaluate the performance of different migration mechanisms when the process migration actually occurs. Figure 10 shows the performance of a thin Memcached instance before, during, and after migration from Node-0 to Node-3. An interference program Stream is running on Node-3. The default configuration means that no page table migration mechanism is used after the migration. We observed that when Memcached was migrated to Node-3, the performance decreased. As AutoNUMA also migrated the data to Node-3, the performance improved. However, due to the presence of an interference program on Node-3 contending for the memory controller, the performance could not be restored to the previous level. Enabling Auto-PTM results in the migration of the page table to Node-3, which leads to lower program performance compared to the default configuration. While PTL-aware Auto-PTM senses the high latency on Node-3 and does not migrate the page table, the performance is close to that of the default configuration.
Figure 11 shows the performance of a thin Memcached instance before, during, and after migration from Node-0 to Node-3. An interference program Stream is running on both Node-0 and Node-3. We observed that after the migration occurred, enabling PTL-aware Auto-PTM resulted in higher performance than both the default configuration and Auto-PTM. This is because it sensed the high latency on node 0 and node 3, and migrated the page table to the other node with lower latency.

6.3. Evaluation of Compute-Intensive Workloads

We also evaluated, as a comparison, HPC benchmarks such as CG and MG from the NAS benchmark suite. As shown in Figure 12, when there are local interference programs, neither enabling Auto-PTM (-m) nor PTL-aware Auto-PTM (-w) has an optimization effect on CG and MG, regardless of whether the page tables are located on remote or local nodes.
The reason is that page table migration optimization is specifically targeted at memory-intensive workloads that suffer severe TLB misses. Such workloads are significantly affected by the page table NUMA effect, and applying PTM optimization can effectively mitigate this effect. However, the CG and MG programs in our experiments are compute-intensive workloads, which explains why they did not benefit from the optimization. In contrast, we observed that enabling PTL-aware Auto-PTM(-w) led to a performance loss of up to 1.4% for CG and 2.4% for MG. This is because probing PTL consumes CPU resources, which in turn impacts the execution of compute-intensive workloads.

6.4. Performance with PTL-Aware Auto-PTM on x86-64 NUMA Server

In this section, we conducted a comparative experiment with the same workloads and configurations that are used in Section 6.1 on a x86-64 NUMA server. We use A and B to represent Node-0 and Node-1 of the x86-64 NUMA server with two NUMA nodes. Figure 13 shows the results. As illustrated, PTL-aware Auto-PTM consistently demonstrates the same or better performance for the tested applications with 4 KB and 2 MB pages. This indicates that PTL-aware Auto-PTM can also work well on x86-64 NUMA servers, which is a general technique for mitigating page table-caused NUMA effect.

Why Does the Scanning Frequency Have Little Impact on x86-64 NUMA Server?

From Figure 9a we can see that, compared to LPLDLI, workloads with the LPLDLI-m configuration experienced performance degradation. With the LPLDLI-m configuration, both page table and data are located locally, and the Auto-PTM scan detects this condition without migrating any page tables. We found that the periodic scanning of the workload page tables by Auto-PTM affects normal program execution. Additionally, scanning page tables is a memory access operation, which becomes more severe when there is a Stream continuously contending for the memory controller in the local NUMA node. To address this, we improved the performance by reducing the scanning frequency.
However, on x86-64 NUMA server, our experimental results showed that periodic scanning of workload page tables by Auto-PTM has little impact on program performance with LPLDLI. To analyze this issue, we performed tests on memory bandwidth and memory latency for both ARM64 and x86-64 NUMA servers.
Then, we conducted bandwidth testing. We used the test program bw_mem in lmbench to test the bandwidth on the ARM and x86 NUMA servers that we used. Local bandwidth refers to the bandwidth to access the memory on the local node using bw_mem, while remote bandwidth refers to the bandwidth to access the memory on a remote node. Accessing a remote node requires crossing sockets. On x86-64 servers, the local node is Node-0 and the remote node is Node-1. On ARM servers, the local node is Node-0 and the remote node is Node-3, since Node-0 and Node-1 are located on the same socket on ARM servers. LI indicates that the interference program Stream is executed on the local node, whereas RI indicates that the interference program Stream is executed on the remote node. The number of threads of the Stream is set to the number of cores per node. From the experimental results in Figure 14, it can be seen that the kunpeng920 CPU has a lower bandwidth than the x86-64 server we used.
We also performed a latency test. We ran lat_Mem_RD on Node-0 and tested the latency of reading memory data from Node-0 to Node-0 (local access) and from Node-0 to Node-1 (remote access). We read a memory block size of 64 MB, exceeding the server’s LLC size, to ensure access to memory, simultaneously running Stream on Node-0 to interfere. The increasing number of threads in the Stream results in more interference pressure. Due to the fact that the x86-64 server has 20 cores per node and the arm server has 48 cores, the x86-64 server only achieves the situation of 20 threads per Stream. The experimental results show that the kunpeng920 CPU has a higher latency than x86-64.
The optimization effect of PTL-awareness is related to the intensity of interference. When interference causes lower memory bandwidth and higher latency, migrating page tables to nodes with higher bandwidth and lower latency can yield better optimization results for memory-intensive workloads with a high rate of TLB misses.
Based on the experimental results of bandwidth and latency, the ARM64 server has a relatively lower bandwidth and higher latency. Therefore, they are more significantly impacted by the memory controller contention caused by the Stream. On the other hand, x86 servers demonstrate superior performance. x86 servers are less affected by latency compared to ARM servers. As a result, the optimization effect of PTL-awareness is less significant for x86 servers. However, PTL-aware Auto-PTM can still help workloads on the x86-64 platform that are affected by local memory controller contention.

6.5. Overhead

6.5.1. Memory Overhead

Our work is an Auto-PTM, and it only maintains a single copy of the page tables. As for Full-PTM, since the migration is achieved through replication, it is necessary to maintain a set of page table replicas on each node. When our work scans the page table, it will create some temporary array variables to record the number of page tables or data on each node. However, these variables will be reclaimed once they are no longer in use and will not be occupied for a long time.
Our work involves using lat_mem_rd to measure the PTL information of the nodes. In order to exceed the size of the L3 cache, the length of each memory measurement is set to 64 MB. This length can be adjusted according to the size of the L3 cache of different servers. A round of tests is performed every 10 s. Each round requires measurements 2 n , where n is equal to the number of nodes. For example, if there are 4 nodes, each round requires 16 tests, and a total of 1 GB of memory needs to be tested. The memory being tested will not be occupied for a long time and will be released once the test is completed.

6.5.2. Runtime Overhead

The runtime overhead of PTL-aware Auto-PTM can be divided into three parts: scanning overhead, migration overhead, and PTL measurement overhead. As shown in the figure, in this scenario, there is no migration and no PTL measurement. The test results are all caused by the scanning operation. It can be seen that the scanning overhead has been reduced to 0.1–1.2% of the program’s running time overhead (s) through 2 n scans.
The migration overhead only occurs during page table migration. Therefore, for thin workloads, there is no runtime migration overhead before or after migration, or when they are never migrated across sockets. The measurement of PTL consumes CPU and memory resources. As can be seen in Figure 15, even when page table migration and PTL measurement occur, the running time of GUPS is still optimized under the LPLDLI-w configuration. This shows that, in memory-intensive workloads with a high TLB miss rate, the benefits outweigh the overhead of migration and PTL measurement.
For compute-intensive workloads, as shown in Figure 12, page table migration and PTL measurement bring a runtime overhead of up to 1.4% for CG and 2.4% for MG. In summary, PTL-aware Auto-PTM is suitable for memory-intensive workloads that suffer from a higher TLB miss rate.

6.6. Discussion

6.6.1. Scalability

Our solution can be extended to systems with more than 4 NUMA nodes. As we scale to systems with more nodes, operations like using arrays to count the distribution of next-level page tables or data across NUMA nodes, and detecting the latency between nodes, will require additional consideration.
We employ arrays as temporary variables to count the distribution of next-level page tables or data across NUMA nodes. Given that the maximum number of NUMA nodes is theoretically 16, this approach remains scalable. After counting the distribution of each page table, we discard the array. The overhead of creating and discarding these arrays is well-contained, as the operation is independent for each page table. As the number of nodes grows, the scale of array-based counting automatically increases, enabling the technology to adapt to an increasing number of nodes. PTL probing is also an essential part of our page table migration process. As the number of nodes increases, the PTL measurements will automatically adapt to the increased number of nodes and conduct the corresponding measurements.
In this paper, although our validation was restricted to systems with 2 and 4 NUMA nodes, our codebase inherently supports the expansion to systems with more than 4 NUMA nodes. We intend to conduct comprehensive validation on such multi-node systems in our future work.

6.6.2. Applicability to Virtualized Systems?

In virtualized systems, hardware-based nested paging uses a two-step page table translation process: Firstly, the translation from the guest virtual address (gVA) to the guest physical address (gPA) is achieved through a per-process guest operating system page table (gPT). Secondly, the conversion from the guest physical address (gPA) to the host physical address (hPA) is carried out via a per-virtual machine extended page table (ePT).
Although our work can apply PTL-aware migration for gPT, we have not yet implemented PTL-aware migration for ePT. Specifically, gPT migration is triggered by AutoNUMA scanning operations, which actively scan gPT to detect potential migration opportunities. In contrast, ePT migration needs to be triggered by ePT update operations, which means that changes in the ePT itself act as the trigger for the scanning and subsequent migration process. These two approaches differ significantly in implementation, involving different triggering mechanisms and operational logic, which pose not only technical challenges, but also require more comprehensive and in-depth experiments for validation. We are committed to solving this issue in our future work, including exploring more efficient trigger and migration strategies for both gPT and ePT, and conducting extensive experiments to ensure the effectiveness and reliability of our proposed methods in different scenarios.

7. Conclusions and Future Work

In this work, we introduce a PTL-aware Auto-PTM mechanism to improve the efficiency of page table migration and ensure its applicability across different architectures, specifically x86-64 and ARM64. The experimental results show that, compared to Auto-PTM, the performance of workloads in this paper is increased in different scenarios. For example, in GUPS, the performance is 3.53 times higher; in XSBench, it is 1.77 times higher; and in Hashjoin, it is 1.68 times higher. We have also verified the compute-intensive workloads cg and mg. The experimental results indicate that our work does not have an effect on these two workloads. This makes our mechanism available for PTL-sensitive applications, such as memory-intensive workloads. Memory-intensive workloads can put a strain on TLB coverage and may result in a large number of TLB misses, leading to more remote access to page tables.
It is essential to recognize several limitations of our current work. Firstly, our approach is specifically tailored for physical machines and does not directly translate to virtualized environments. In virtualized setups, the management of guest page tables within the customer operating system and extended page tables (EPT) in KVM introduces an additional level of complexity. As a result, our work is not yet applicable to multi-tenant scenarios, where multiple users or applications share virtualized resources. Moreover, the effectiveness of our approach in mixed interference patterns, which are common in complex virtualized environments, remains unexplored and requires further investigation.
Another limitation is the overhead associated with PTL detection. Detecting PTL consumes both memory and CPU resources, and this overhead grows exponentially as the number of nodes in the system expands. Addressing these limitations and exploring solutions for virtualized environments, multi-tenant scenarios, mixed interference patterns, and the scalable detection of PTL will be the focal points of our future research, with the ultimate goal of enhancing overall system performance across a broader range of applications.
In summary, this paper proposes a PTL-aware Auto-PTM mechanism to enhance page table migration efficiency across x86-64 and ARM64 architectures. Experiments demonstrate that it improves the performance of various workloads compared to Auto-PTM, though it has no effect on compute-intensive workloads, making it suitable for memory-intensive workloads. However, the mechanism is limited to physical machines, not applicable to virtualized or multi-tenant scenarios, and its performance in mixed interference patterns is untested. Moreover, PTL detection causes substantial resource overhead that increases with more nodes. Addressing these limitations will be the focus of future research for broader application.

Author Contributions

Conceptualization, H.Q.; methodology, H.Q.; Software, H.Q.; validation, H.Q.; investigation, H.Q.; writing—original draft preparation, H.Q.; writing—review and editing, H.Q. and P.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by Shenzhen Science and Technology Program (JCYJ20220818101607015).

Data Availability Statement

The data are contained within the article.

Conflicts of Interest

Author Peng Wang was employed by the company Inspur Group Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
NUMANon-Uniform Memory Access
PTRPage Table Replication
PTMPage Table Migration
PTLPage Table Access Latency
WASPWorkload-Aware Self-Replicating Page Tables
VMAVirtual Memory Areas
MMUMemory Management Unit
TLBTranslation Lookaside Buffer
TTBRTranslation Table Base Register
PGDPage Global Directory
PUDPage Upper Directory
PMDPage Middle Directory
PTEPage Table Entry

References

  1. Demir, Y.; Pan, Y.; Song, S.; Hardavellas, N.; Kim, J.; Memik, G. Galaxy: A High-performance Energy-efficient Multi-chip Architecture Using Photonic Interconnects. In Proceedings of the 28th ACM International Conference on Supercomputing (ICS ’14), Munich, Germany, 10–13 June 2014. [Google Scholar] [CrossRef]
  2. Iyer, S. Heterogeneous Integration for Performance and Scaling. IEEE Trans. Components Packag. Manuf. Technol. 2016, 6, 973–982. [Google Scholar] [CrossRef]
  3. Kannan, A.; Jerger, N.E.; Loh, G.H. Enabling interposer-based disintegration of multi-core processors. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48), Waikiki, HI, USA, 5–9 December 2015. [Google Scholar] [CrossRef]
  4. Yin, J.; Lin, Z.; Kayiran, O.; Poremba, M.; Altaf, M.S.B.; Jerger, N.E.; Loh, G.H. Modular Routing Design for Chiplet Based Systems. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA ’18), Los Angeles, CA, USA, 1–6 June 2018. [Google Scholar] [CrossRef]
  5. Calciu, I.; Sen, S.; Balakrishnan, M.; Aguilera, M.K. Black-box Concurrent Data Structures for NUMA Architectures. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’17), Xi’an, China, 8–12 April 2017. [Google Scholar] [CrossRef]
  6. Dashti, M.; Fedorova, A.; Funston, J.; Gaud, F.; Lachaize, R.; Lepers, B.; Quema, V.; Roth, M. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’13), Houston, TX, USA, 16–20 March 2013. [Google Scholar] [CrossRef]
  7. Kaestle, S.; Achermann, R.; Roscoe, T.; Harris, T. Shoal: Smart Allocation and Replication of Memory for Parallel Programs. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC ’15), Santa Clara, CA, USA, 8–10 July 2015. [Google Scholar]
  8. Kernel, T.L. NUMA Memory Policy. Available online: https://www.kernel.org/doc/html/v5.7/admin-guide/mm/numa_memory_policy.html (accessed on 22 December 2024).
  9. Corbet, J. AutoNUMA: The Other Approach to NUMA Scheduling. Available online: https://lwn.net/Articles/488709/ (accessed on 22 December 2024).
  10. Achermann, R.; Panwar, A.; Bhattacharjee, A.; Roscoe, T.; Gandhi, J. Mitosis: Transparently Self-Replicating Page-Tables for Large-Memory Machines. In Proceedings of the Twenty-fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’20), Lausanne, Switzerland, 16–20 March 2020. [Google Scholar] [CrossRef]
  11. Achermann, R.; Panwar, A.; Bhattacharjee, A.; Roscoe, T.; Gandhi, J.; Gopinath, K. Fast Local Page-Tables for Virtualized NUMA Servers with vMitosis. In Proceedings of the Twenty-Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’21), Virtual, 19–23 April 2021. [Google Scholar] [CrossRef]
  12. Qu, H.; Yu, Z. WASP: Workload-Aware Self-Replicating Page-Tables for NUMA Servers. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’24), La Jolla, CA, USA, 27 April–1 May 2024. [Google Scholar] [CrossRef]
  13. linux man page. numactl. Available online: https://linux.die.net/man/8/numactl (accessed on 1 October 2024).
  14. Lachaize, R.; Lepers, B.; Quéma, V. MemProf: A memory profiler for NUMA multicore systems. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference, Boston, MA, USA, 13–15 June 2012. [Google Scholar]
  15. Gaud, F.; Fedorova, A.; Funston, J.; Decouchant, J.; Lepers, B.; Quema, V. Large Pages May Be Harmful on NUMA Systems. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, Philadelphia, PA, USA, 19–20 June 2014; Available online: https://api.semanticscholar.org/CorpusID:17793052 (accessed on 22 December 2024).
  16. Denoyelle, N.; Goglin, B.; Jeannot, E.; Ropars, T. Data and Thread Placement in NUMA Architectures: A Statistical Learning Approach. In Proceedings of the 48th International Conference on Parallel Processing (ICPP 2019), Kyoto, Japan, 5–8 August 2019. [Google Scholar] [CrossRef]
  17. Barrera, I.S.; Black-Schaffer, D.; Casas, M.; Moretó, M.; Stupnikova, A.; Popov, M. Modeling and Optimizing NUMA Effects and Prefetching with Machine Learning. In Proceedings of the 34th ACM International Conference on Supercomputing (ICS ’20), Barcelona, Spain, 29 June–2 July 2020. [Google Scholar] [CrossRef]
  18. Gao, B.; Kang, Q.; Tee, H.W.; Chu, K.T.N.; Sanaee, A.; Jevdjic, D. Scalable and effective page-table and TLB management on NUMA systems. In Proceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference, Santa Clara, CA, USA, 10–12 July 2025. [Google Scholar]
  19. Arm Holdings. armv8-a Address Translation. Available online: https://developer.arm.com/documentation/100940/0101/?lang=en (accessed on 22 December 2024).
  20. Pham, B.; Vaidyanathan, V.; Jaleel, A.; Bhattacharjee, A. CoLT: Coalesced Large-Reach TLBs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45), Vancouver, BC, Canada, 1–5 December 2012. [Google Scholar] [CrossRef]
  21. Basu, A.; Gandhi, J.; Chang, J.; Hill, M.D.; Swift, M.M. Efficient Virtual Memory for Big Memory Servers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA ’13), Tel-Aviv, Israel, 23–27 June 2013. [Google Scholar] [CrossRef]
  22. Karakostas, V.; Gandhi, J.; Ayar, F.; Cristal, A.; Hill, M.D.; McKinley, K.S.; Nemirovsky, M.; Swift, M.M.; Ünsal, O. Redundant Memory Mappings for Fast Access to Large Memories. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA ’15), Portland, OR, USA, 13–17 June 2015. [Google Scholar] [CrossRef]
  23. Pham, B.; Bhattacharjee, A.; Eckert, Y.; Loh, G.H. Increasing TLB reach by exploiting clustering in page translations. In Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture (HPCA ’14), Orlando, FL, USA, 15–19 February 2014. [Google Scholar] [CrossRef]
  24. McCalpin, J.D. The STREAM Benchmark. Available online: https://github.com/jeffhammond/STREAM (accessed on 22 December 2024).
  25. Kernel.org. Paravirt_ops. Available online: https://www.kernel.org/doc/html/latest/virt/paravirt_ops.html (accessed on 22 December 2024).
  26. Barham, P.; Dragovic, B.; Fraser, K.; Hand, S.; Harris, T.; Ho, A.; Neugebauer, R.; Pratt, I.; Warfield, A. Xen and the Art of Virtualization. ACM SIGOPS Oper. Syst. Rev. 2003, 37, 164–177. [Google Scholar] [CrossRef]
  27. HUAWEI. The Taishan-Server. Available online: https://e.huawei.com/cn/products/computing/kunpeng/taishan (accessed on 22 December 2024).
  28. Gandhi, J.; Hill, M.D.; Swift, M.M. Agile Paging: Exceeding the Best of Nested and Shadow Paging. In Proceedings of the 43rd International Symposium on Computer Architecture, Seoul, Republic of Korea, 18–22 June 2016. [Google Scholar] [CrossRef]
  29. Kwon, Y.; Yu, H.; Peter, S.; Rossbach, C.J.; Witchel, E. Coordinated and Efficient Huge Page Management with Ingens. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, Savannah, GA, USA, 2–4 November 2016. [Google Scholar]
  30. Princeton. PARSEC Benchmark Suite. Available online: https://dl.acm.org/doi/10.1145/3053277.3053279 (accessed on 15 January 2025).
  31. HPCCALLENGE. RandomAccess: GUPS (Giga Updates Per Second). Available online: https://icl.utk.edu/projectsfiles/hpcc/RandomAccess/ (accessed on 15 January 2025).
  32. Labs, R. Redis. Available online: https://redis.io (accessed on 15 January 2025).
  33. Tramm, J.R.; Siegel, A.R.; Islam, T.; Schulz, M. XSBench-The Development and Verification of a Performance Abstraction for Monte Carlo Reactor Analysis. In Proceedings of the In PHYSOR 2014-The Role of Reactor Physics toward a Sustainable Future, Kyoto, Japan, 28 September–3 October 2014. [Google Scholar]
  34. Ang, J.; Barrett, K.W.B.; Murphy, R. Introducing the Graph500. Available online: https://graph500.org/ (accessed on 15 January 2025).
  35. Bailey, D. NAS Parallel Benchmarks. Available online: https://www.nas.nasa.gov/software/npb.html (accessed on 15 January 2025).
Figure 1. NUMA architecture on modern servers.
Figure 1. NUMA architecture on modern servers.
Electronics 14 01685 g001
Figure 2. Address translation for 4KB page on ARMv8. Page Global Directory(PGD), Page Upper Directory(PUD), Page Middle Directory(PMD), and Page Table Entry(PTE) denote one level of page tables respectively. Translation Table Base Register (TTBR) stores base address of PGD.
Figure 2. Address translation for 4KB page on ARMv8. Page Global Directory(PGD), Page Upper Directory(PUD), Page Middle Directory(PMD), and Page Table Entry(PTE) denote one level of page tables respectively. Translation Table Base Register (TTBR) stores base address of PGD.
Electronics 14 01685 g002
Figure 3. Full-PTM and Auto-PTM mechanism.
Figure 3. Full-PTM and Auto-PTM mechanism.
Electronics 14 01685 g003
Figure 4. The schematic diagram of the configurations. L denotes local. R denotes remote. P denotes page table and D denotes data. I denotes interference.
Figure 4. The schematic diagram of the configurations. L denotes local. R denotes remote. P denotes page table and D denotes data. I denotes interference.
Electronics 14 01685 g004
Figure 5. Performance comparison when Stream is running at the local node as interference in different cases for thin workloads in a bare-metal machine. The vertical axis represents the normalized runtime. Runtime (in seconds) for the base case (LPLD) is in brackets. The specific configuration is shown in Figure 4.
Figure 5. Performance comparison when Stream is running at the local node as interference in different cases for thin workloads in a bare-metal machine. The vertical axis represents the normalized runtime. Runtime (in seconds) for the base case (LPLD) is in brackets. The specific configuration is shown in Figure 4.
Electronics 14 01685 g005
Figure 6. NUMA congestion on a NUMA sever with four nodes.
Figure 6. NUMA congestion on a NUMA sever with four nodes.
Electronics 14 01685 g006
Figure 7. PTL-aware page table auto-migration mechanism. The green block represents the PTL-aware process, and the blue block represents the Auto-PTM process.
Figure 7. PTL-aware page table auto-migration mechanism. The green block represents the PTL-aware process, and the blue block represents the Auto-PTM process.
Electronics 14 01685 g007
Figure 8. The overhead brought by Auto-PTM when there are interfering programs locally. -m indicates enabling Auto-PTM. Since both the page tables and the data are placed locally, no page table migration operation occurs. The numbers in the figure represent the percentage increase in the running time relative to that of LPLDLI.
Figure 8. The overhead brought by Auto-PTM when there are interfering programs locally. -m indicates enabling Auto-PTM. Since both the page tables and the data are placed locally, no page table migration operation occurs. The numbers in the figure represent the percentage increase in the running time relative to that of LPLDLI.
Electronics 14 01685 g008
Figure 9. The performance comparison of different migration mechanisms on ARM NUMA server. (ac) show the performance comparison of workloads using the Auto-PTM (-m) and PTL-aware Auto-PTM (-w) mechanisms under different configurations. (c) represents the situation when the configuration in (a) is used with a 2MB page-size while the Transparent Huge Pages (THP) are enabled. (d) shows the performance of workloads using the full migration mechanism (-f) in the scenarios of remote page tables, local data, and remote interference. The configurations are listed in Table 3. The vertical axis represents the normalized runtime. Runtime (in seconds) for the base case are in brackets. Numbers at the top show speedup with PTL-aware Auto-PTM (-w) over the corresponding Auto-PTM (-m).
Figure 9. The performance comparison of different migration mechanisms on ARM NUMA server. (ac) show the performance comparison of workloads using the Auto-PTM (-m) and PTL-aware Auto-PTM (-w) mechanisms under different configurations. (c) represents the situation when the configuration in (a) is used with a 2MB page-size while the Transparent Huge Pages (THP) are enabled. (d) shows the performance of workloads using the full migration mechanism (-f) in the scenarios of remote page tables, local data, and remote interference. The configurations are listed in Table 3. The vertical axis represents the normalized runtime. Runtime (in seconds) for the base case are in brackets. Numbers at the top show speedup with PTL-aware Auto-PTM (-w) over the corresponding Auto-PTM (-m).
Electronics 14 01685 g009
Figure 10. Throughput of a thin Memcached instance before, during, and after migration from Node-0 to Node-3. An interference program Stream is running on Node-3.
Figure 10. Throughput of a thin Memcached instance before, during, and after migration from Node-0 to Node-3. An interference program Stream is running on Node-3.
Electronics 14 01685 g010
Figure 11. Throughput of a thin Memcached instance before, during, and after migration from Node-0 to Node-3. An interference program Stream is running on both Node-0 and Node-3.
Figure 11. Throughput of a thin Memcached instance before, during, and after migration from Node-0 to Node-3. An interference program Stream is running on both Node-0 and Node-3.
Electronics 14 01685 g011
Figure 12. The performance comparison of CG and MG using the Auto-PTM (-m) and PTL-aware Auto-PTM (-w) mechanisms on an ARM NUMA server when the interfering programs are located on the local node. The numbers in the figure represent the percentage increase in the running time relative to that of RPLDLI and LPLDLI.
Figure 12. The performance comparison of CG and MG using the Auto-PTM (-m) and PTL-aware Auto-PTM (-w) mechanisms on an ARM NUMA server when the interfering programs are located on the local node. The numbers in the figure represent the percentage increase in the running time relative to that of RPLDLI and LPLDLI.
Electronics 14 01685 g012
Figure 13. The performance comparison of different migration mechanisms on x86 NUMA server. (ac) show the performance comparison of workloads using the Auto-PTM (-m) and PTL-aware Auto-PTM (-w) mechanisms under different configurations. (c) represents the situation when the configuration in (a) is used with a 2 MB page-size while the THP are enabled. (d) shows the performance of workloads using the full migration mechanism (-f) in the scenarios of remote page tables, local data, and remote interference. The configurations are listed in Table 3. The vertical axis represents the normalized runtime. Runtime (in seconds) for the base case is in brackets. Numbers at the top show speedup with PTL-aware Auto-PTM (-w) over the corresponding Auto-PTM (-m).
Figure 13. The performance comparison of different migration mechanisms on x86 NUMA server. (ac) show the performance comparison of workloads using the Auto-PTM (-m) and PTL-aware Auto-PTM (-w) mechanisms under different configurations. (c) represents the situation when the configuration in (a) is used with a 2 MB page-size while the THP are enabled. (d) shows the performance of workloads using the full migration mechanism (-f) in the scenarios of remote page tables, local data, and remote interference. The configurations are listed in Table 3. The vertical axis represents the normalized runtime. Runtime (in seconds) for the base case is in brackets. Numbers at the top show speedup with PTL-aware Auto-PTM (-w) over the corresponding Auto-PTM (-m).
Electronics 14 01685 g013
Figure 14. Comparison of memory access bandwidth and latency on ARM and x86 NUMA servers we used. For (a), LB denotes local bandwidth, RB denotes remote bandwidth. LI denotes local interference, RI denotes remote interference. For (b), the interference program Stream is run on the local node. The x-axis represents the number of threads used by Stream, where 0 denotes no interference.
Figure 14. Comparison of memory access bandwidth and latency on ARM and x86 NUMA servers we used. For (a), LB denotes local bandwidth, RB denotes remote bandwidth. LI denotes local interference, RI denotes remote interference. For (b), the interference program Stream is run on the local node. The x-axis represents the number of threads used by Stream, where 0 denotes no interference.
Electronics 14 01685 g014
Figure 15. Analysis of performance events of GUPS when using PTL-aware Auto-PTM. We use perf to obtain hardware performance counter values.
Figure 15. Analysis of performance events of GUPS when using PTL-aware Auto-PTM. We use perf to obtain hardware performance counter values.
Electronics 14 01685 g015
Table 1. Page -table implementation differences between the Linux kernels for x86 and ARM64.
Table 1. Page -table implementation differences between the Linux kernels for x86 and ARM64.
Differencex86ARM64
Page table related para-virtualization supportYesNo
Page table base address registerCR3TTBR0, TTBR1
Page table levels support54
Table 2. Experimented Workloads in this paper.
Table 2. Experimented Workloads in this paper.
WorkloadDescriptionMemory
Cannealusing simulated annealing to minimize the routing cost [30].10 GB
GUPSmeasuring the random memory access performance [31].16 GB
Hashjoinhash-table probing in databases.10 GB
Redisan in-memory data store [32].13 GB
XSBenchMonte Carlo neutronics [33].20 GB
BTreeindex lookups used in database.22 GB
Memcachedan event-based key/value store.14 GB
cgConjugate Gradient [35].1 GB
mgMulti-Grid on meshes [35].3 GB
Table 3. Experimented configurations in this paper. R denotes remote, and L denotes local. P denotes page table, and D denotes data page. I denotes interference program Stream. -m denotes Auto-PTM mechanism, and -w denotes PTL-aware Auto-PTM mechanism. When Transparent Huge Pages (THP) in Linux is enabled, the configuration will be prefixed with (T). A and B denote two different NUMA nodes. For example, RPLDLI means page table is on remote NUMA node B, and data page is on local node A with the process which accesses them. In addition, Stream is also running on local node A as an interference to contend for the memory controller.
Table 3. Experimented configurations in this paper. R denotes remote, and L denotes local. P denotes page table, and D denotes data page. I denotes interference program Stream. -m denotes Auto-PTM mechanism, and -w denotes PTL-aware Auto-PTM mechanism. When Transparent Huge Pages (THP) in Linux is enabled, the configuration will be prefixed with (T). A and B denote two different NUMA nodes. For example, RPLDLI means page table is on remote NUMA node B, and data page is on local node A with the process which accesses them. In addition, Stream is also running on local node A as an interference to contend for the memory controller.
ConfigurationWorkloadDataPage TableImwf
RPLDLIAABA---
RPLDLI-mAABAenable--
RPLDLI-wAABAenableenable-
LPLDLIAAAA---
LPLDLI-mAAAAenable--
LPLDLI-wAAAAenableenable-
RPLDRIAABB---
RPLDRI-mAABBenable--
RPLDRI-wAABBenableenable-
RPLDRI-fAABB--enable
RPLDRILIAABA,B---
RPLDRILI-mAABA,Benable--
RPLDRILI-wAABA,Benableenable-
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qu, H.; Wang, P. Latency-Aware and Auto-Migrating Page Tables for ARM NUMA Servers. Electronics 2025, 14, 1685. https://doi.org/10.3390/electronics14081685

AMA Style

Qu H, Wang P. Latency-Aware and Auto-Migrating Page Tables for ARM NUMA Servers. Electronics. 2025; 14(8):1685. https://doi.org/10.3390/electronics14081685

Chicago/Turabian Style

Qu, Hongliang, and Peng Wang. 2025. "Latency-Aware and Auto-Migrating Page Tables for ARM NUMA Servers" Electronics 14, no. 8: 1685. https://doi.org/10.3390/electronics14081685

APA Style

Qu, H., & Wang, P. (2025). Latency-Aware and Auto-Migrating Page Tables for ARM NUMA Servers. Electronics, 14(8), 1685. https://doi.org/10.3390/electronics14081685

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop