Microarchitectural Malware Detection via Translation Lookaside Buffer (TLB) Events

Agredo, Cristian; Koranek, Daniel F.; Kabban, Christine M. Schubert; Arroyo, Jose A. Gutierrez del; Graham, Scott R.

doi:10.3390/jcp5030075

Open AccessArticle

Microarchitectural Malware Detection via Translation Lookaside Buffer (TLB) Events

by

Cristian Agredo

^*,

Daniel F. Koranek

,

Christine M. Schubert Kabban

,

Jose A. Gutierrez del Arroyo

and

Scott R. Graham

Air Force Institute of Technology, 2950 Hobson Way, Wright-Patterson AFB, OH 45433, USA

^*

Author to whom correspondence should be addressed.

J. Cybersecur. Priv. 2025, 5(3), 75; https://doi.org/10.3390/jcp5030075

Submission received: 25 June 2025 / Revised: 2 September 2025 / Accepted: 11 September 2025 / Published: 17 September 2025

(This article belongs to the Special Issue Intrusion/Malware Detection and Prevention in Networks—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Prior work has shown that Translation Lookaside Buffer (TLB) data contains valuable behavioral information. Many existing methodologies rely on timing features or focus solely on workload classification. In this study, we propose a novel approach to malware classification using only TLB-related Hardware Performance Counters (HPCs), explicitly excluding any dependence on timing features such as task execution duration or memory access timing. Our methodology evaluates whether TLB data alone, without any timing information, can effectively distinguish between malicious and benign programs. We test this across three classification scenarios: (1) A binary classification problem involving distinguishing malicious from benign tasks, (2) a 4-way classification problem designed to improve separability, and (3) a 10-way classification problem with classes of individual benign and malware tasks. Our results demonstrate that even without execution time or memory access timing, TLB events achieve up to 81% accuracy for the binary, and 72% accuracy for the 4-class grouping, and 61% accuracy for the 10-class grouping. These findings demonstrate that time-independent TLB patterns can serve as robust behavioral signatures. This work expands the understanding of microarchitectural side effects by demonstrating that TLB-only features, independent of timing-based techniques, can be effectively used for real-world malware detection.

Keywords:

TLB; Central Processing Unit (CPU); HPC; statistical learning models

1. Introduction

Electronic devices are now ubiquitous and often equipped with advanced processing units designed to enhance users’ experience [1]. In this study, we continue to focus on CPU security, specifically the TLB, an understudied microarchitectural component.

Previous research has established that cache behavior can be exploited to leak sensitive information [2,3,4,5,6,7,8,9]. Previous work showed that HPCs can be used to classify tasks, and that microarchitectural data can reveal the nature of a victim process [10]. These studies target microarchitecture components such as caches and branch predictors, which have become more robust due to extensive research into attacks and countermeasures [11,12,13,14], or leverage timing, i.e., memory access time or task execution time. However, an open question remains: can malware be accurately classified using only TLB related event counters?

To answer this question, we propose a methodology that relies exclusively on TLB event data and is independent of timing techniques [10,15,16]. Although many existing studies leverage a combination of HPCs [17,18,19,20], there is limited research focusing solely on the TLB. This lack of attention represents both a potential vulnerability and an opportunity to advance system security. We show that TLB events alone can serve as indicators of malicious activity or potential vectors for information leakage. Our methodology is demonstrated under three classification scenarios: (1) a 10-class setup involving five benign and five malware tasks; (2) a binary classification between benign and malicious tasks; and (3) a 4-class setup in which tasks are grouped to improve classification performance. Building upon the methodology described in [10,21], we expand the dataset to include both benign and malicious programs. All tasks are executed under varying CPU affinity settings, and the resulting data is analyzed using both statistical learning models and neural networks. Our results demonstrate that the TLB only methodology for our three classification scenarios achieves up to 81% in the binary classification, 72% for the 4-class setup, and 61% accuracy in the 10-class setup.

The contributions of the paper are as follows:

A methodology for integrating both benign and malicious programs into a controlled experimental environment for capturing relevant microarchitectural data.
A data collection and analysis process that excludes traditional timing based techniques, such as memory access latency or task execution duration.
A TLB-only approach that applies statistical learning models and neural networks to classify benign and malicious activity, achieving classification accuracies of up to 81% for setup (1), 72% for setup (2), and 61% for setup (3).

The paper is organized as follows. Section 1 is dedicated to reviewing background and existing literature. Section 3 describes the methodology, including research design, instrumentation used, procedures, data collection methods, and limitations. Section 4 presents the results for the system under test, including plots and their interpretation. Section 5 concludes the paper and explores future directions for TLB research.

2. Background

This work expands upon [10], which explored the use of TLB-related events for workload classification. Since a significant portion of the background established in that research is applicable to the present work, we provide a summary here. This includes foundational information on TLB operation, statistical learning models, neural networks, known TLB attacks and defenses, and the use of HPCs for classification. References not included in the prior work are discussed in greater detail. We include a dedicated subsection titled Traditional Time-Based TLB Techniques, which outlines conventional timing-based methods for context. This subsection highlights how we further develop the methodology introduced in [10] by eliminating dependencies on task execution duration and memory access latency, and by applying the approach to malware classification under multiple task grouping core configurations.

2.1. TLB Operation Overview

The TLB is a cache within the Memory Management Unit (MMU) that stores recent translations from Virtual Address (VA) to Physical Address (PA), thereby avoiding the latency of full-page table walks [1,22]. When a virtual address is requested, the TLB checks for a Virtual Page Number (VPN) match. In the event of a hit, the corresponding Physical Page Number (PPN) is returned quickly. In response to a miss, the MMU walks the page table to retrieve the mapping, potentially triggering a page fault if the page is not stored in the memory. A more detailed explanation of the TLB architecture and its interaction with the MMU can be found in our prior work [23].

2.2. Traditional Time-Based TLB Techniques

Prior studies often rely on or are augmented by the inclusion of timing-based features like task execution duration [10] or memory access latency [3,15], as these features allow us to distinguish between workloads or detect anomalous behavior. To measure memory access latency, a set of virtual addresses is used to probe the TLB; this operation is repeated, and the access time is recorded. The latency is then used to determine whether a given access results in a TLB hit or miss. Task execution duration is typically measured by inserting timestamp instructions into the collection code, first marking the end of the task and then the completion of HPCs, to ensure that counters record data only during task execution. However, these timing-based techniques are difficult to control and are often impractical in real-world deployments.

2.3. Models Approach

This study builds upon the modeling framework described in our previous work [10], where we applied statistical learning and deep learning models to classify workloads using TLB event data. A brief summary of the previously used models is provided here for completeness. In this paper, we extend that framework by incorporating additional classifiers, including eXtreme Gradient-Boosting (XGBoost), Light Gradient-Boosting Machine (LightGBM), and a voting classifier, which are described in more detail below.

2.3.1. Statistical Learning Models

We employed Logistic Regression (LR) and Random Forest (RF) as classical machine learning classifiers. LR models the relationship between input features and class probabilities using a logistic function, while RF ensembles multiple decision trees to improve classification accuracy and reduce overfitting [24,25,26].

2.3.2. Neural Networks

Neural networks, particularly Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs), were also utilized. ANNs learn from weighted combinations of inputs through nonlinear activations [27], and CNNs apply hierarchical feature extraction using convolutional layers inspired by the human visual cortex [28,29].

2.3.3. eXtreme Gradient-Boosting (XGBoost)

XGBoost is a decision-tree-based ensemble algorithm that uses gradient-boosting techniques to optimize performance. It includes regularization to reduce overfitting and is known for its efficiency and scalability in classification tasks [30].

2.3.4. Light Gradient-Boosting Machine (LightGBM)

LightGBM is a gradient-boosting framework that builds trees using a leaf-wise growth strategy, which can lead to faster training and improved accuracy compared to traditional approaches. It is particularly suited for large-scale data and high-dimensional feature spaces [31].

2.3.5. Voting Classifier

The voting classifier is an ensemble technique that combines the predictions of multiple base classifiers through either hard voting (majority class) or soft voting (average predicted probabilities), aiming to improve overall classification robustness. We used Random Forest (RF), XGBoost and LightGBM, and the soft voting option.

In this paper, we reuse the foundational modeling approach and augment it with the additional ensemble methods to evaluate whether TLB event data can effectively distinguish not only between different workloads but also between benign and malicious behaviors under multiple task grouping core configurations. While the base model structures remain consistent, our contribution shifts the focus toward practical malware detection.

2.4. Related Work

This section summarizes related work from our previous study [10], including general concepts related to data collection using hyperthreads, the use of HPCs for classification, and known TLB attack and defense strategies.

2.4.1. Hyper-Threading, HPC Based Malware Detection, and Co-Residency Classification

This work leverages concepts such as hyper-threading technology, HPCs, and co-residency-based classification. Intel’s introduction of hyper-threading [32] enables a single physical processor to appear as two logical processors, where architectural states are duplicated but physical execution resources are shared. These shared resources can introduce microarchitectural contention, which we exploit during the data collection process. HPCs, which are used to monitor and measure CPU events, have proven useful for malware detection, with several works [18,19,33,34] demonstrating high classification accuracy using HPC features. However, these studies often exclude TLB related events or limit their scope to binary classification. In contrast, our approach focuses on TLB (without other counter events) for malware classification (multi-task classification). The authors of [21] use co-residency classification strategies that we apply in our research. Langehaug et al. run sensors and target programs concurrently to reveal behavioral signatures. In this study, we use our own sensor designs, TLB events, and augment the machine learning models.

2.4.2. Prior Work on TLB Attacks

Several studies have demonstrated that TLBs can serve as a source of exploitable microarchitectural leakage. Gras et al. introduced TLBleed, the first known attack to extract cryptographic keys by exploiting TLB-based side channels, proving that caches are not the only shared resources vulnerable to adversarial use. Tatar et al. [35] proposed a desynchronization technique that manipulates page table entries to reveal precise TLB behavior, including eviction inference, replacement policy, and PCID handling. In contrast, our methodology uses HPCs. TLB attacks have been extended to Graphics Processing Units (GPUs). Dutta et al. [36] examined Nvidia DGX systems and concluded that GPU TLBs are not remotely cached and thus resistant to their attack model. Conversely, Nayak et al. [37] successfully implemented a covert channel by reverse engineering the GPU TLB hierarchy and exploiting shared virtual memory through Unified Virtual Memory (UVM) and the Multi-Process Service (MPS). In contrast to these prior works, which aim to extract fine-grained memory behavior through probing and timing memory access, our research uses HPC based data to identify macro-level behavioral signatures for malware classification.

2.4.3. Prior Work on TLB Defense

The literature on defending against TLB side-channel attacks is limited. One common approach on Linux systems involves assigning distinct VA spaces and process identifiers to different execution contexts (e.g., attacker vs. victim), reducing external hit-based leakage [38]. More aggressive strategies include flushing the TLB when transitioning between protected and unprotected regions [39,40], and deploying fully associative TLBs to eliminate miss-based vulnerabilities [38]. Deng et al. [38] proposed two defensive TLB designs: the Static-Partition (SP) TLB and the Random-Fill (RANF) TLB. The authors claimed that SP TLB mitigates a subset of known attacks, while the RANFTLB defends against all known vulnerabilities with less than 10% performance overhead. This work represents one of the first hardware defenses against TLB attacks. Stolz et al. [41] introduced TLBcoat, a secure TLB architecture designed to resist timing-based side channels related to page translation. Their study evaluated the applicability of cache defenses to TLBs and concluded that such methods are insufficient. However, TLBcoat’s applicability to attacks that leverage hardware performance counters remains unexplored.

2.5. Additional Related Work

2.5.1. TLB Coalescing with a Range-Compressed Page Table for Embedded I/O Devices

Recent performance optimizations such as TLB coalescing [42] aim to reduce page-table walk overhead by compressing contiguous page mappings into single TLB entries. While these techniques improve TLB utilization and overall system throughput, they may also expand the granularity of observable memory behavior, potentially increasing the attack surface for TLB-based side channels and covert channels. This study highlights another instance where architectural enhancements may unintentionally influence TLB behavior. However, the implications of TLB coalescing for microarchitectural leakage and secure isolation remain largely unexplored.

2.5.2. HyPer-Early Detection of a Ransomware Attack Using Hardware Performance Counters

HyPer-Early [18] proposes an early-stage ransomware detection technique using HPCs to identify malicious behavior during the setup phase of an attack. Their selected feature set includes five HPCs, one TLB counter (dTLB-loads) and four cache (branch counters branch-loads, L1-dcache-loads, L1-dcache-stores, L1-dcache-load-misses). While they achieve 98.68% accuracy, the method is not based on TLB events alone. Additionally, it remains unclear whether existing cache countermeasures could limit access to these counters and thus impair HyPer-Early’s detection capabilities. In contrast, our approach focuses exclusively on TLB events, aiming to evaluate whether TLB activity alone, without reliance on other architectural subsystems or traditional timing features, can support malware classification.

2.5.3. RanStop: A Hardware-Assisted Runtime Crypto-Ransomware Detection Technique

The authors of RanStop propose a runtime detection system specifically for crypto-ransomware, using hardware performance counters HPCs and timestamps [19]. Their approach employs an LSTM-based recurrent Recurrent Neural Network (RNN) to classify ransomware against benign programs. The authors identify the TLB_DATA counter group as the most effective feature set for this binary classification task. In contrast, our study addresses a broader classification problem involving multiple benign and multiple malware programs (multi-task classification). While RanStop relies primarily on timestamp-based features for detection, our methodology uses the raw counter values as input. Furthermore, our augmented approach incorporates a timing feature derived from the duration of counter collection, a metric that is distinct from the execution timestamps used in RanStop.

2.5.4. Intelligent Malware Detection Based on Hardware Performance Counters: A Comprehensive Survey

Hossein et al. [20] present a comprehensive survey on malware detection using HPCs and machine learning techniques. Their study reviews common malware types, summarizes machine learning algorithms frequently applied in this domain, and outlines recent research trends in hardware-assisted malware detection. While the paper provides a strong overview of the current state of the field, it lacks a methodological contribution of its own, and the conclusions are limited to general recommendations. The survey does not reference any work that relies exclusively on TLB-related events.

2.5.5. Redefining Trust: Assessing the Reliability of Machine Learning Algorithms in Intrusion Detection Systems

Hossein et al. [43] investigated the reliability of machine learning algorithms in hardware-assisted intrusion detection systems (IDS). They examined various parameters that impact the reliability of these algorithms and, consequently, the performance of IDS frameworks. The authors report that their method improves the reliability and performance of ML-based IDS by up to 6%. While their work highlighted the potential of HPCs in detecting malicious activity, its primary focus is on evaluating the robustness of machine learning models. In contrast, our study focuses specifically on TLB-related events and their utility for malware classification.

2.5.6. Cyber-Immunity at the Core: Securing Biomedical Devices Through Hardware-Level Machine Learning Defense

Hossein et al. utilize HPCs and machine learning for biomedical devices. Their methodology aligns with the approaches used in [20,43]. The authors identified LLC-load-misses, LLC-loads, and cache-misses (listed twice) as their top four hardware events, and LLC-load-misses and LLC-loads as their top two hardware events. Their findings suggest that XGBoost, when using four HPC events, is the most effective for malware identification, while ExtraTree performs best for classification. Their study, like other HPC research, prioritized cache events while overlooking TLB events.

2.5.7. Stochastic-HMD: Adversarial-Resilient Hardware Malware Detector via Undervolting

Islam et al. [44] propose Stochastic-HMDs, a hardware-based malware detection approach that introduces stochastic noise into the detection model’s computations. This is achieved through controlled undervolting, where the supply voltage is deliberately scaled below some levels to induce stochastic timing violations within the HMD’s operations. The authors claim that this technique enhances the resilience of HMDs against adversarial attacks. Their method represents an alternative to conventional data collection techniques, such as the Linux perf tool. Instead, they use Intel’s Pin dynamic instrumentation framework [45] on a system running Windows 7. This work highlights a distinct strategy for defending against malware by manipulating the microarchitectural environment rather than passively observing it, offering an alternative perspective to HPCs-based classification.

2.5.8. Obfuscation-Resistant Hardware Malware Detection: A Stacked Denoising Autoencoder Approach

He et al. [46] present a study on the impact of code obfuscation on the effectiveness of machine learning (ML)-based HMDs. They introduce ObfusGate, an obfuscation-resistant malware detection framework that leverages HPCs. The authors construct a correlation matrix using the top 16 HPC features, which include TLB events. However, although they report selecting the best four counters for their final model, these counters are not explicitly identified in the paper. Moreover, their evaluation is limited to binary classification (malware vs. benign). Nonetheless, the study provides valuable insights into the challenges of detecting obfuscated malware using HPC-based models.

3. Methodology

The Methodology Section builds upon [10], though there are several significant changes which improve the data collection, preprocessing, and adaptation; further, the methodology is augmented by the introduction of synthetic malware for classification which were not examined in [10]. In addition, this work considers model types XGBoost, LightGBM, and a voting classifier which improve classification results. Lastly, the methodology removes the use of any timestamp-based features to improve the practicality of the final models.

Replicated steps are summarized and modifications and additions are described in detail. This section contains the materials and instrumentation, experimental design, workflow, and limitations. It also includes a detailed description of the programs used. The overall sequence in the methodology is presented in Figure 1.

3.1. Materials and Instrumentation

The experiments were conducted on a computer with an Intel Xeon E3-1535M v6 processor, which supports the x86_64 ISA, featuring 4 cores and 8 threads. The system includes 4 physical counters, each with a 48-bit width. The TLB has two levels: Level-1 consists of separate instruction TLB and data TLB, while level-2 is aunified TLB. The OS used was Ubuntu 22.04.3 LTS (Jammy Jelly-fish), running the Linux 6.8.0-49-generic kernel. We utilized the Linux Performance Tool, perf, 6.8.12 [47] to monitor specific performance events. The machine learning models and neural networks were implemented using Python 3.11.4 and scikit-learn 1.5.1.

During the preparation of this manuscript, spelling and grammar were reviewed with the assistance of ChatGPT (GPT-5).

3.1.1. Experimental Design

The objective of this research is to evaluate the reliability of microarchitectural data for identifying malware behavior. We build on the framework presented in [10]. We leverage the multi-threading capabilities of modern CPUs to collect data from two processes running concurrently. The hypothesis is that the interference between these two co-resident processes can provide information that helps distinguish malware behavior. During data collection, we manipulate the affinity of the CPU to run the target task (i.e., benign or malware programs) and our sensor in various core configurations: same logical, where both processes—the task and the sensor—run on the same logical thread; Simultaneous Multi Threading (SMT), where the task and the sensor run on different threads of the same physical core; different physical, where they are assigned to different physical cores; and hybrid, where affinity of the core is linked to the system. The target tasks are CoreMark-Pro benchmarks [48]: core, linear_alg-mid-100x100-sp, loops-all-mid-10k-sp, parser-125k, and sha-test. Malware tasks are implemented as Python programs that simulate typical malware behaviors, including a cryptominer, an infector, a network scanner, a ransomware, and a rootkit. We run two sensor programs: an active TLB sensor and benign TLB sensor. We also perform experiments with both sensors concurrently, as well as with no sensors at all (referred to as the only counters program configuration). Thus, data is collected under four programs (sensors): only counters, TLB active, TLB benign, and both sensors active). This setup is based on the framework from [10], but with the primary difference being the use of a new TLB active sensor. Table 1 summarizes these programs and the core configurations. To run the experiment, we use a top-level Python script, run_all_configurations.py which selects each combination of core configuration and sensor, and calls experiment.py and counters.sh for data collection. After data collection, we use a Python environment to preprocess the data and train our models.

3.1.2. Data Description

The dataset contains counts from TLB HPCs, specifically: dTLB-loads, dTLB-store-misses, dtlb_load_misses.walk_completed, and itlb_misses.stlb_hit. Each run lasts 0.5 s and is sampled at a 5ms interval, yielding 100 samples per run. For each configuration, defined by a combination of benign/malware task, sensor, and core affinity, the experiment is repeated 500 times. The raw data includes 10 columns: task, configuration, run, time, counts, events, t0, t1, t2, and t3. task refers to the specific benign or malware program executed; configuration denotes the sensor and core affinity setting; time corresponds to 5ms intervals within each run; counts represent the actual HPC values, which reset upon each read; and events indicate the specific performance counter being recorded. The timestamps t0 through t3 denote: t0—task start, t1—counter collection start, t2—counter collection end, and t3—task end. None of these time stamps are used for model training.

3.1.3. Workflow

The pipeline begins with a Python script, run_all_arguments.py, which reads a Comma Separated Values (CSV) file where each row specifies a set of arguments for the subsequent stages of the pipeline. These arguments include the task, number of runs, core configuration, sampling rate, thread1, and thread2. Run_all_arguments.py passes these parameters to experiment.py, which is responsible for selecting the task, sensor, and CPU affinity based on the input arguments. It then calls counters.sh, which launches the TLB HPCs and the sensor programs, again according to the parameters specified in the CSV file. This completes the data collection pipeline.

3.2. Implementation Details: Python and C Programs, and Shell Scripts

From the framework presented in [10], we reused the scripts run_all_arguments.py, experiment.py, counters.sh, and benign.c, with minor modifications. The specific updates are detailed in the following paragraph. We extended the methodology by replacing active_tlb.c and adding new scripts to simulate malware behaviors: cryptocurrency.py, infector.py, network_scanner.py, ransomware.py, and rootkit.py.

The first modification involves the arguments in the CSV file passed to run_all_arguments.py.

As shown in Appendix A.1 the updated version of experiment.py was modified to support running a combination of .exe and .py files. The .exe files correspond to benign programs, while the .py files represent malware programs. Additionally, the programs were updated to make T2 and T3 independent of each other. Although T2 and T3 are not used for model training, they are retained for reference and can be used in visualizations during the analysis phase. In counters.sh, only directory paths and file names were updated. The file benign.c remains unchanged.

3.3. Augmented Framework

This section provides a detailed description of the programs added to the experiment for the purpose of evaluating the TLB’s effectiveness in identifying and classifying malware. It includes one sensor program and multiple behavioral malware programs.

3.3.1. `data_tlb.py`

This script was originally designed to allocate memory and perform read–write operations in a loop, with the goal of stressing the memory and generating a high number of TLB hits and misses. We modified it into a program that intentionally probes the data TLB in a more controlled manner. Specifically, it applies the linear mapping function described in [15,16,23] to generate a set of VAs that target a specific set within the TLB. This targeted probing enables more precise analysis of TLB replacement behavior. The implementation is provided in Appendix A.2.

3.3.2. `cryptominer.py`

This script simulates a basic cryptocurrency mining process by repeatedly generating and hashing block headers until a hash is found that meets a specified difficulty. The implementation is provided in Appendix A.3 and the corresponding workflow is illustrated in Figure A1. When executed, the script continuously hashes different values until a valid hash is discovered, mimicking proof–of–work mining in blockchain systems. This process places pressure on the TLB due to repeated function calls, memory operations, and tight looping, generating distinctive patterns of TLB hits and misses.

3.3.3. `infector.py`

This script simulates infector malware behavior by injecting a benign payload into Python files. The implementation is provided in Appendix A.4 and the corresponding workflow is illustrated in Figure A2. File injection and recovery cycles cause instructions to be fetched repeatedly and lead to new memory allocations, producing iTLB and dTLB activity.

3.3.4. `network_scanner.py`

This script simulates a basic network scanner by probing a defined IP address range for open ports. The implementation is provided in Appendix A.5 and the corresponding workflow is illustrated in Figure A3. It is configured to scan IPs within a specific subnet (e.g., 192.168.180.127 to 192.168.180.128) and checks for common ports such as SSH (22), HTTP (80), and HTTPS (443). Iterative port probing leads to predictable looping with repeated network stack calls, generating sustained TLB pressure.

3.3.5. `ransomware.py`

This script simulates ransomware behavior by encrypting and decrypting all files within a target directory using symmetric encryption. The implementation is provided in Appendix A.6 and the corresponding workflow is illustrated in Figure A4. When run, the script first encrypts the directory contents, then immediately decrypts them, simulating a full ransomware attack and recovery cycle. Recursive file encryption and decryption affect a large number of files, creating wide memory access coverage that stresses the TLB.

3.3.6. `rootkit.py`

This script simulates rootkit behavior by hooking into the Python open() function, mimicking a syscall table modification. The implementation is provided in Appendix A.7 and the corresponding workflow is illustrated in Figure A5. It models common rootkit techniques such as syscall hooking, file hiding, and activity logging. By intercepting file access, it generates irregular TLB activity from hidden resource manipulation and altered system call handling.

3.4. Preprocessing and Analysis

As described in Section 3.1.2, the raw data consists of 10 columns. This data is preprocessed before being fed into the machine learning models. In the framework proposed by [10], the Events column is expanded into four separate columns, one for each event. In this study, we modify the preprocessing strategy. Instead of creating one column per event and keeping the time sequence (resulting in 100 rows per event per run, i.e.,

100 \times 4 = 400

rows per run), we convert each run into a single row.

For each of the four event types, we compute four summary statistics: mean, standard deviation, kurtosis, and skewness. This results in

4 \times 4 = 16

features per run, per event, applied to each of the original 100 time steps. Thus, the total number of features becomes

100 \times 4 \times 4 = 1600

columns per row. The final dataset contains 80,000 rows, corresponding to 500 runs × 16 configurations × 10 tasks.

The dataset contains the following columns:

Task: the program to be classified, either one of the five benign benchmarks or one of the five malware scripts.
Configuration: the core configuration and sensor combination used during execution.
Run: identification of each of the 500 runs per configuration-task combination.
Time: time steps at 5 ms intervals over a 0.5 s duration.
Counts: The number of counts recorded per HPC event.
Events: The four TLB events measured.
T0–T3: timestamps for key events, T0 (task start), T1 (counter start), T2 (task end), T3 (counter end).

These timestamps are included only for reference and analysis and are not used for training. Notably, this study is the first to use only TLB-related HPC metrics for classification, without relying on timing features such as memory access latency or task duration.

3.5. Augmented Prior Learning Models

Building upon the framework of our previous study [10], we extend the use of statistical and neural learning models to classify tasks using only TLB event features. For clarity, the models and their setups are summarized below.

Baseline Models (from [10]):

Logistic Regression—implemented in Python, 80/20 train–validation split, evaluated on all 15 combinations of four performance counters.
Random Forest (RF)—same setup as logistic regression.
Artificial Neural Network (ANN)—three hidden layers (128, 64, 16 ReLU units), softmax output, multi-class classification.
Convolutional Neural Network (CNN)—three-dimensional reshaped inputs, kernel (2, 3, 1), global average pooling, dense layers, padding by duplicating final row.

Ensemble Models (this study):

XGBoost (Appendix A.8).
LightGBM (Appendix A.9).
Voting Classifier (Appendix A.10).

3.6. Limitations

A primary limitation of this approach is the number of performance counters that can be simultaneously utilized. Using more HPC events than the microarchitecture supports reduces precision, as the available registers must be multiplexed across the selected HPC events [49]. Additionally, as noted by Weaver et al. [50], HPCs are inherently non-deterministic. Despite this, our experiments were repeated multiple times, and the classification accuracy remained consistent.

Methodology transferability is another limitation. HPC-based approaches often require adjustments, sometimes significant, depending on the specific microarchitecture used. For example, while our methodology is transferable across Intel x64 CPUs (e.g., from Xeon to i9), extending it to other architectures such as AMD, ARM, or RISC-V is not straightforward. This is due to three primary factors:

TLB Mapping Functions: The manner in which virtual addresses are mapped to TLB sets and ways differs across Instruction Set Architectures (ISAs) and microarchitectures. This impacts both the observability and interpretability of TLB behavior, and thus affects feature extraction and side-channel signal quality.
HPC Event Semantics: The availability and definition of TLB related HPC events vary across vendors. Events that are effective for classification on Intel processors may be absent, renamed, or behave differently on AMD, ARM, or RISC-V architectures.
Access Restrictions: Some operating systems impose restrictions on user-level access to HPCs, limiting data collection. These access policies can vary across platforms and distributions, which further complicates methodology portability.

Another limitation is that the malware programs used in this study were developed by us specifically to investigate how their features impact TLB behavior. For example, we examined whether a cryptominer leaves a distinguishable signature in the TLB time-series profile. As shown in Figure 2a, our results confirm that such a signature is indeed observable. Although these are not real-world malware samples, the programs serve as a useful proof of concept for demonstrating the feasibility of TLB-based behavioral analysis.

4. Results

To meet the hardware constraints of our system, we needed to select the combination of four counters that gives the highest classification accuracy. Based on prior research [15,16,35], the following counters were identified as the most effective for leaking microarchitectural information: dtlb_load_misses.stlb.hit, itlb_misses.stlb_hit, dtlb_load_misses.walk_completed, dTLB-loads, dTLB-load-misses, dtlb-store-misses, iTLB-loads, and iTLB-load-misses. This list was narrowed to the best four counters using the methodology outlined in [10].

Our analysis starts by validating the integrity of the data through visual inspection of the plots for both benign (benchmark) and malware programs, as shown in Figure 2 and Figure 3. In both figures, the red lines indicate the hardware performance counter (HPC) collection window, confirming consistent coverage of the intended 0.5 s duration. The benchmark execution time, the blue lines, varies across tasks: some tasks, such as those in Figure 3b and Figure 2b, are very short, while others, such as in Figure 2a and Figure 2c, extend beyond the 0.5 s window. Nevertheless, our models only use the data collected during the fixed 0.5 s interval. So we confirmed the validity of the data and we proceeded with training the classification models.

4.1. Statistical Learning Model Performance

The data used to train these models is described in Section 3.1.2, and the preprocessing steps are detailed in Section 3.4. We augmented the set of models by including XGBoost, LightGBM, and a voting classifier, in addition to the previously used Logistic Regression (LR), RF, Artificial Neural Network (ANN), and Convolution Neural Network (CNN). These additions proved effective: the voting classifier achieved the highest accuracy among the statistical models, followed closely by XGBoost. Among the neural network approaches, the ANN provided the best classification performance.

We evaluated the models under three scenarios: (1) binary classification, (2) a 4-class grouping, and (3) a 10-class setup distinguishing each individual benign and malware task. The results for each scenario are presented below. Section 4.2 provides a comparison and discussion of these classification scenarios.

4.1.1. Binary Classifier

We evaluated the classification accuracy of our models in a binary classification scenario, distinguishing between malware and benign programs. The results are shown in Table 2. The trend remained consistent: the highest accuracy was achieved using the voting classifier, reaching 81%. Among the neural network models, the ANN outperformed the CNN, achieving a classification accuracy of 74%.

One of the challenges in binary classification is the limited separability between certain benign and malware programs, which reduces overall accuracy. To address this limitation and improve performance, we proposed a four-way classifier.

4.1.2. Multi-Task Four-Way Classifier

The objective in this scenario was to create clusters of programs that are more difficult to separate in a binary setup. We defined four categories: a ‘benign’ group consisting of core, linear_alg, loop, and sha; a ‘malware’ group including cryptominer, infector, ransomware, and rootkit; and two additional categories for parser and net_scan. These two were separated into individual classes because a combination of their plots and confusion matrix results showed they were more distinguishable from the other programs. However, alternative groupings are also possible. The results for this four-way classification are presented in Table 3. The trend remained consistent, the voting classifier achieved the highest classification accuracy at 72%, followed by the ANN, which reached 66%. A comparison of all three classification scenarios—the 10-way, binary, and four-way classifiers—is discussed in Section 4.2.

4.1.3. Multi-Task 10-Way Classifier

The 10-way classifier is trained to distinguish among all ten tasks, including five benign programs and five malware programs. The results for the models with the highest classification accuracy are presented in Table 4. Decision tree-based models consistently achieve accuracies in the 50% range. By combining RF, XGBoost, and LightGBM in a voting classifier, we reached an improved accuracy of 61%. We also tested two neural network models: a CNN and an ANN. The CNN achieved a weaker performance, with accuracies in the 30% range, while the ANN achieved results up to 50%. Additionally, Table 4 shows that classification accuracy tends to improve when more counters are used, and it decreases as the number of counters is reduced.

4.2. Comparing Classifiers

The scores for the three scenarios are not directly comparable, as each uses a different baseline for accuracy. For example, the baseline for binary classification is 50%, for 4-way classification it is 25%, and for 10-way classification it is 10%. By dividing the highest score achieved in each scenario (81%, 72%, and 61%) by its respective baseline, we observe that the binary classification 1.62 times better, and the 4-way classification 2.88 times better than chance, and 10-way classification performs 6.1 times better than chance. However, this type of normalization has limitations; binary classification cannot be more than two times better than chance, and 4-way classification cannot be more than four times better than chance. Thus, this technique only confirms that the trained models perform better than random guessing; it does not indicate which scenario or model is objectively superior.

For a more reliable comparison, we use Cohen’s Kappa coefficient, defined in Equation (1). This metric accounts for chance agreement and provides a more balanced evaluation across tasks with different class distributions. The results confirm that the 4-way classification scenario performs best with a Kappa score of 0.64, followed by binary classification at 0.62, and finally the 10-way classification at 0.56.

κ = \frac{P_{o} - P_{e}}{1 - P e}

(1)

$κ$ : Cohen’s Kappa.
$P_{o}$ : Observed accuracy.
$P_{e}$ : Expected (baseline).

4.3. Comparison with Related Work

A direct comparison to prior work is challenging, as existing studies differ significantly in both objectives and methodology. Related research can be categorized into two groups: studies that use similar microarchitectural components but pursue different goals, and studies that aim for similar objectives but rely on different types of HPCs.

In the first category, we include TLBleed [15] and Holmes et al. [16]. These studies use the TLB as the source of microarchitectural leakage but focus on side-channel attacks rather than classification. For example, [15] probes the TLB and measures memory access delays to distinguish hits from misses, successfully leaking a 256-bit EdDSA secret key with a 98% success rate and reconstructing 92% of RSA keys. Holmes et al. [16] extend this approach to infer Linux shell commands, achieving classification accuracies of 95% in clean settings and 62.4% under noisy conditions. These studies rely heavily on memory access timing to execute the attacks. By contrast, our study does not use any timing or timestamp features and instead focuses on malware identification, a fundamentally different goal.

In the second category are studies that share a similar objective, malware classification, but leverage different HPC features. Pundir et al. [19] performed binary classification of benign versus malicious software using a combination of HPCs, achieving 97% accuracy. Anand et al. [18] reported 98% accuracy using five HPC events, only one of which was TLB-related. In contrast, our work focuses solely on TLB events and addresses more complex classification tasks by evaluating binary, 4-way, and 10-way scenarios. Moreover, many of the HPC events used in prior work are well studied and thus difficult to exploit.

Our previous work [10] also explored task classification using microarchitectural data, but it included timestamp-based features (e.g., task duration) and was limited to benign programs. In this study, we eliminate time dependencies entirely to construct a more realistic and portable methodology applicable in real world malware detection settings.

To our knowledge, no prior study has performed malware classification using TLB data exclusively, without any form of timing information. This makes our contribution unique in its use of TLB-only features, its time independent design, and its support for multi-class malware classification.

5. Conclusions

This research integrated both benign and malicious programs into a controlled environment for collecting microarchitectural data. Our methodology introduced a unique preprocessing pipeline that flattens temporal data and extracts summary statistics from collected TLB events. To enhance its real-world applicability and generalizability, we intentionally excluded timestamp-based features, such as task duration, which may not be reliably available in practical deployment scenarios.

We evaluated classification performance across three scenarios: binary classification, 4-way, and 10-way classification. To compare these models, we used Cohen’s Kappa as a normalized metric. The four-way classifier achieved the highest performance, with a classification accuracy of 73% and a Kappa score of 0.64, using a voting classifier composed of RF, XGBoost, and LightGBM. The binary classifier followed with 81% accuracy and a Kappa of 0.62, while the 10-way classifier scored 61% for accuracy with a Kappa of 0.56.

Our findings demonstrate that microarchitectural data, specifically TLB performance counters, can be used to infer both benign and malicious program behavior. As larger and more diverse datasets are incorporated, and as hyperparameters are fine-tuned, classification accuracy is expected to improve.

In a practical defense scenario, a system operator could deploy this framework on a network to monitor incoming executable workloads. By including real-time TLB counter statistics into the preprocessing pipeline, the operator would receive near-instant anomalous behavior alerts whenever a program’s microarchitectural fingerprint deviates from known benign profiles. This would enable rapid isolation of potentially malicious code, such as ransomware or rootkits, before it can propagate through the system.

Future work will focus on enhancing these models and extending the methodology to other microarchitectures. While differences across hardware platforms and ISAs may pose challenges, identifying the adaptations required to leak microarchitectural information across systems would be a valuable direction for broader applicability. Additionally, a next step is incorporating real-world malware into future experiments to evaluate the effectiveness of the approach in more realistic scenarios. This includes deploying the system in a controlled environment and evaluating detection performance using a broader range of metrics, such as precision, recall, and F1-score, to better assess operational effectiveness.

Author Contributions

Conceptualization, C.A., D.F.K., C.M.S.K., J.A.G.d.A. and S.R.G.; methodology, C.A., D.F.K., J.A.G.d.A. and C.M.S.K.; software, C.A.; validation, C.A., D.F.K., C.M.S.K., J.A.G.d.A. and S.R.G.; formal analysis, C.A., D.F.K., C.M.S.K., J.A.G.d.A. and S.R.G.; investigation, C.A.; resources, C.A., D.F.K., C.M.S.K., J.A.G.d.A. and S.R.G.; data curation, C.A., D.F.K., C.M.S.K. and J.A.G.d.A.; writing—original draft preparation, C.A.; writing—review and editing, D.F.K., C.M.S.K., J.A.G.d.A. and S.R.G.; visualization, C.A., D.F.K., C.M.S.K., J.A.G.d.A. and S.R.G.; supervision, D.F.K., C.M.S.K., J.A.G.d.A. and S.R.G.; project administration, C.A., D.F.K. and C.M.S.K.; funding acquisition, C.A., D.F.K., C.M.S.K., J.A.G.d.A. and S.R.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

During the preparation of this manuscript, spelling and grammar were reviewed with the assistance of ChatGPT (GPT-5). The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The views expressed in this paper are those of the authors, and do not reflect the official policy or position of the United States Air Force, Department of Defense, or the U.S. Government. This document has been approved for public release; distribution unlimited, case #88ABW-2025-0571.

Appendix A. Listings

Appendix A.1. experiment.py Program

Listing A1. Changes to experiment.py.

Appendix A.2. Data TLB Linear Mapping C Program

Listing A1. Data TLB Linear Mapping C program [10].

Appendix A.3. Simulated cryptominer.py Program

Listing A3. Simulated cryptominer.py.

Figure A1. Block diagram of the simulated cryptominer. The script repeatedly generates block headers with incrementing nonces, applies SHA-256 hashing, and checks for a difficulty prefix.

Appendix A.4. Simulated infector.py Program

Listing A4. simulated infector.py.

Figure A2. Block diagram of the simulated file infector malware. The script scans directories for Python files, infects those not already marked, and later restores files from a clean backup.

Appendix A.5. Simulated network_scanner.py Program

Listing A5. Simulated network_scanner.py.

Figure A3. Block diagram of the simulated network scanner. The script iterates through a defined IP range and probes common ports (SSH, HTTP, HTTPS) using TCP connection attempts.

Appendix A.6. Simulated ransomware.py Program

Listing A6. Simulated ransomware.py.

Figure A4. Block diagram of the simulated ransomware. The script generates or loads a symmetric key, then recursively encrypts and decrypts all files in a directory.

Appendix A.7. Simulated rootkit.py Program

Listing A7. Simulated rootkit.py.

Figure A5. Block diagram of the simulated rootkit. The script hooks the open() system call, blocking access to a hidden file (secret.txt) while logging other file accesses.

Appendix A.8. XGBoost Program

Listing A8. XGBoost.

Appendix A.9. LightGBM Program

Listing A9. lightGBM.

Appendix A.10. Voting Classifier Program

Listing A10. Voting Classifier.

References

Hennessy, J.L.; Patterson, D.A. Computer Architecture: A Quantitative Approach, 6th ed.; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2017. [Google Scholar]
Disselkoen, C.; Kohlbrenner, D.; Porter, L.; Tullsen, D. Prime+Abort: A Timer-Free High-Precision L3 Cache Attack Using Intel TSX. In Proceedings of the 26th USENIX Security Symposium (USENIX Security 17), Vancouver, BC, Canada, 16–18 August 2017. [Google Scholar]
Kocher, P.; Genkin, D.; Gruss, D.; Haas, W.; Hamburg, M.; Lipp, M.; Mangard, S.; Prescher, T.; Schwarz, M.; Yarom, Y. Spectre attacks: Exploiting speculative execution. Commun. ACM 2020, 63, 93–101. [Google Scholar] [CrossRef]
Lipp, M.; Schwarz, M.; Gruss, D.; Prescher, T.; Haas, W.; Horn, J.; Mangard, S.; Kocher, P.; Genkin, D.; Yarom, Y.; et al. Meltdown: Reading kernel memory from user space. Commun. ACM 2020, 63, 46–56. [Google Scholar] [CrossRef]
Liu, F.; Yarom, Y.; Ge, Q.; Heiser, G.; Lee, R.B. Last-level cache side-channel attacks are practical. In Proceedings of the 2015 IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 17–21 May 2015; pp. 605–622. [Google Scholar]
Yarom, Y.; Falkner, K. Flush+ Reload: A High Resolution, Low Noise, L3 Cache Side-Channel Attack. In Proceedings of the USENIX Security Symposium, San Diego, CA, USA, 20–22 August 2014; pp. 719–732. [Google Scholar]
Percival, C. Cache missing for fun and profit. In Proceedings of the Free BSD Presentations and Papers (2005), Ottawa, ON, Canada, 13–14 May 2005. [Google Scholar]
Osvik, D.A.; Shamir, A.; Tromer, E. Cache Attacks and Countermeasures: The Case of AES. In Topics in Cryptology—CT-RSA 2006; Hutchison, D., Kanade, T., Kittler, J., Kleinberg, J.M., Mattern, F., Mitchell, J.C., Naor, M., Nierstrasz, O., Pandu Rangan, C., Steffen, B., et al., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; Volume 3860, pp. 1–20. [Google Scholar] [CrossRef]
Gullasch, D.; Bangerter, E.; Krenn, S. Cache Games–Bringing Access-Based Cache Attacks on AES to Practice. In Proceedings of the Security and Privacy (SP), 2011 IEEE Symposium On, Oakland, CA, USA, 22–25 May 2011; pp. 490–505. [Google Scholar]
Agredo, C.; Koranek, D.F.; Kabban, C.M.S.; Arroyo, J.A.G.D.; Langehaug, T.J.; Graham, S.R. Exploring the Translation Lookaside Buffer (TLB) for Low-Level Task Differentiation and Classification. IEEE Access 2025, 13, 111199–111216. [Google Scholar] [CrossRef]
Braun, B.A.; Jana, S.; Boneh, D. Robust and efficient elimination of cache and timing side channels. arXiv 2015, arXiv:1506.00189. [Google Scholar] [CrossRef]
Gruss, D.; Schuster, F.; Ohrimenko, O.; Haller, I.; Lettner, J.; Costa, M. Strong and efficient cache side-channel protection using hardware transactional memory. In Proceedings of the 26th USENIX Security Symposium (USENIX Security 17), Vancouver, BC, Canada, 16–18 August 2017. [Google Scholar]
Liu, F.; Ge, Q.; Yarom, Y.; Mckeen, F.; Rozas, C.; Heiser, G.; Lee, R.B. Catalyst: Defeating last-level cache side channel attacks in cloud computing. In Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), Barcelona, Spain, 12–16 March 2016; pp. 406–418. [Google Scholar]
Sprabery, R.; Evchenko, K.; Raj, A.; Bobba, R.B.; Mohan, S.; Campbell, R.H. A novel scheduling framework leveraging hardware cache partitioning for cache-side-channel elimination in clouds. arXiv 2017, arXiv:1708.09538. [Google Scholar]
Gras, B.; Razavi, K.; Bos, H.; Giuffrida, C. Translation Leak-aside Buffer: Defeating Cache Side-channel Protections with TLB Attacks. In Proceedings of the USENIX Security Symposium, USENIX, Baltimore, MD, USA, 15–17 August 2018; pp. 955–972. [Google Scholar]
Holmes, N. Not Lost in Translation: Implementing Side Channel Attacks Through the Translation Lookaside Buffer. Master’s Thesis, Department of Computer Science, University of Warwick, Coventry, UK, 2023. [Google Scholar]
Hill, J.E.; Walker, T.O., III; Blanco, J.A.; Ives, R.W.; Rakvic, R.; Jacob, B. Ransomware Classification Using Hardware Performance Counters on a Non-Virtualized System. IEEE Access 2024, 12, 63865–63878. [Google Scholar] [CrossRef]
Anand, P.M.; Charan, P.V.S.; Shukla, S.K. HiPeR—Early Detection of a Ransomware Attack using Hardware Performance Counters. Digit. Threat. Res. Pract. 2023, 4, 43. [Google Scholar] [CrossRef]
Pundir, N.; Tehranipoor, M.; Rahman, F. RanStop: A Hardware-assisted Runtime Crypto-Ransomware Detection Technique. arXiv 2020, arXiv:2011.12248. [Google Scholar]
Sayadi, H.; He, Z.; Makrani, H.M.; Homayoun, H. Intelligent Malware Detection based on Hardware Performance Counters: A Comprehensive Survey. In Proceedings of the 25th International Symposium on Quality Electronic Design (ISQED), San Francisco, CA, USA, 3–5 April 2024. [Google Scholar] [CrossRef]
Langehaug, T.; Borghetti, B.; Graham, S. Classifying Co-resident Computer Programs Using Information Revealed by Resource Contention. Digit. Threat. Res. Pract. 2023, 4, 1–29. [Google Scholar] [CrossRef]
Stallings, W. Operating Systems: Internals and Design Principles; Pearson: San Antonio, TX, USA, 2014. [Google Scholar]
Agredo, C.; Langehaug, T.J.; Graham, S.R. Inferring TLB Configuration with Performance Tools. J. Cybersecur. Priv. 2024, 4, 951–971. [Google Scholar] [CrossRef]
Chollet, F. Deep Learning with Python; Manning Publications Co.: Shelter Island, NY, USA, 2018; p. 4. [Google Scholar]
Shrivastava, A. COMP 642—Machine Learning Lecture 5: Deep Learning: Logistic Regression. Online, 2022. Scribed by Kristina Sanclemente, James Kafer, Tess Houlette, and Sarah McDonnell. Available online: https://www.cs.rice.edu/~as143/COMP642Spring22/Scribes/Lect5 (accessed on 9 January 2025).
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Representations by Back-Propagating Errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Cireșan, D.C.; Meier, U.; Masci, J.; Gambardella, L.M.; Schmidhuber, J. Flexible, High Performance Convolutional Neural Networks for Image Classification. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence (IJCAI), Barcelona, Spain, 16–22 July 2011; pp. 1237–1242. [Google Scholar]
Wiesel, T.N.; Hubel, D.H. Receptive fields of single neurones in the cat’s striate cortex. J. Physiol. 1959, 148, 574–591. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 3146–3154. [Google Scholar]
Marr, D.T.; Hinton, G.; Koufaty, D.A.; Miller, J.A. Hyper-Threading Technology Architecture and Microarchitecture. Intel Technol. J. 2002, 6, 1. [Google Scholar]
Das, S.; Werner, J.; Antonakakis, M.; Polychronakis, M.; Monrose, F. SoK: The Challenges, Pitfalls, and Perils of Using Hardware Performance Counters for Security. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2019; pp. 20–38. [Google Scholar] [CrossRef]
Zeraatkar, A.A.; Kamran, P.S.; Kaur, I.; Ramu, N.; Sheaves, T.; Al-Asaad, H. On the Performance of Malware Detection Classifiers Using Hardware Performance Counters. In Proceedings of the 2024 International Conference on Smart Applications, Communications and Networking (SmartNets), Harrisonburg, VA, USA, 28–30 May 2024; pp. 1–6. [Google Scholar]
Tatar, A.; Trujillo, D.; Giuffrida, C.; Bos, H. TLB;DR: Enhancing TLB-based Attacks with TLB Desynchronized Reverse Engineering. In Proceedings of the 2020 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 18–21 May 2020; DR, 21 May 2020; pp. 1273–1290. [Google Scholar]
Dutta, S.B.; Naghibijouybari, H.; Gupta, A.; Abu-Ghazaleh, N.; Marquez, A.; Barker, K. Spy in the GPU-box: Covert and Side Channel Attacks on Multi-GPU Systems. In Proceedings of the 50th Annual International Symposium on Computer Architecture, Orlando, FL, USA, 17–21 June 2023; pp. 1–13. [Google Scholar]
Nayak, A.; Ganapathy, V.; Basu, A. (Mis) Managed: A Novel TLB-based Covert Channel on GPUs. In Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security, Hong Kong, 7–11 June 2021; pp. 872–885. [Google Scholar]
Deng, S.; Xiong, W.; Szefer, J. Secure TLBs. In Proceedings of the 46th International Symposium on Computer Architecture, Phoenix, AZ, USA, 22–26 June 2019; pp. 346–359. [Google Scholar]
Costan, V.; Lebedev, I.A.; Devadas, S. Sanctum: Minimal Hardware Extensions for Strong Software Isolation. In Proceedings of the USENIX Security Symposium, Austin, TX, USA, 10–12 August 2016; pp. 857–874. [Google Scholar]
Intel. IA-32 Architectures Software Developer’s Manual. Syst. Program. Guide 2016, 64, 64. [Google Scholar]
Stolz, F.; Thoma, J.P.; Güneysu, T.; Sasdrich, P. Risky Translations: Securing TLBs against Timing Side Channels. In Proceedings of the Conference on Computer and Communications Security; Horst Görtz Institute for IT Security, Ruhr University Bochum: Bochum, Germany, 2024. [Google Scholar]
Duong, T.D.; Kim, Y.S.; Hur, J.Y. TLB Coalescing with Range Compressed Page Table for Embedded I/O Devices. IEEE Access 2025, 13, 12623–12633. [Google Scholar] [CrossRef]
Sayadi, H.; He, Z.; Miari, T.; Aliasgari, M. Redefining Trust: Assessing Reliability of Machine Learning Algorithms in Intrusion Detection Systems. In Proceedings of the 2024 IEEE International Symposium on Circuits and Systems (ISCAS), Singapore, 19–22 May 2024. [Google Scholar] [CrossRef]
Islam, M.S.; Alouani, I.; Khasawneh, K.N. Stochastic-HMDs: Adversarial-Resilient Hardware Malware Detectors via Undervolting. In Proceedings of the 2023 60th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 9–13 July 2023. [Google Scholar] [CrossRef]
Luk, C.K.; Cohn, R.; Muth, R.; Patil, H.; Klauser, A.; Lowney, G.; Wallace, S.; Reddi, V.J.; Hazelwood, K. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Chicago, IL, USA, 12–15 June 2005; ACM SIGPLAN Notices. Volume 40, pp. 190–200. [Google Scholar] [CrossRef]
He, Z.; Fernandes, C.W.; Sayadi, H. Obfuscation-Resistant Hardware Malware Detection: A Stacked Denoising Autoencoder Approach. In Proceedings of the 2025 Research Gate. IEEE; 2025. Available online: https://www.researchgate.net/publication/390842933 (accessed on 10 September 2025).
Linux Kernel Organization. Perf—A Performance Counting Tool. 2024. Available online: https://perf.wiki.kernel.org/index.php/Main_Page (accessed on 11 January 2024).
EEMBC. CoreMark-Pro. GitHub Repository. 2025. Available online: https://github.com/eembc/coremark-pro (accessed on 10 September 2025).
PerfWiki. Counting with Perf Stat. Available online: https://perfwiki.github.io/main/tutorial/#counting-with-perf-stat (accessed on 5 January 2025).
Weaver, V.M.; Terpstra, D.; Moore, S. Non-Determinism and Overcount on Modern Hardware Performance Counter Implementations. In Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Austin, TX, USA, 21–23 April 2013; pp. 215–224. [Google Scholar]

Figure 1. Overall workflow of the proposed malware detection methodology using TLBs. Benign benchmarks and simulated malware scripts (cryptominer, infector, scanner, ransomware, and rootkit) are executed alongside sensor programs while TLB-related performance data is collected with perf. The data is then preprocessed, features are engineered, and multiple models are trained and validated to classify task behavior. Unlike generic ML pipelines, this figure highlights our novel integration of sensor-assisted TLB event data with diverse program behaviors, which produce distinctive TLB activity patterns used for classification.

Figure 2. These figures illustrate the behavior of the malware dTLB-loads for a specific configuration, showing the characteristic patterns for each benchmark. The red lines indicate when the counters start and stop, while the blue lines mark the start and end of the benchmark execution.

Figure 3. These figures illustrate the behavior of the counter dTLB-loads for a specific configuration, showing the characteristic patterns for each benchmark. The red lines indicate when the counters start and stop, while the blue lines mark the start and end of the benchmark execution.

Table 1. Programs (sensors) and core configurations [10].

Core Configurations	Programs (Sensors)
	Only Counters	Benign	TLB Active	Both
Same Logical	A1	B1	C1	D1
SMT	A2	B2	C2	D2
Different Physical	A3	B3	C3	D3
Hybrid	A4	B4	C4	D4

Table 2. Binary classification accuracy of the voting classifier (includes Random Forest, XGBoost, and LightGBM) and ANN models with different counter combinations. Counter1:dTLB-store-misses, Counter2:dTLB-loads, Counter3:dtlb_load_misses.walk_completed, Counter4:itlb_misses.stlb_hit.

Counter1	Counter2	Counter3	Counter4	ANN	Voting Classifier
x	x	x	x	0.7468	0.8108
x	x	x		0.7475	0.7957
x	x		x	0.7636	0.7987
x		x	x	0.7477	0.7859
	x	x	x	0.7753	0.8023
x	x			0.7313	0.7893
x		x		0.7204	0.7793
x			x	0.7342	0.7735
	x	x		0.7314	0.7862
	x		x	0.7583	0.7904
		x	x	0.7281	0.7888
			x	0.6971	0.7476
	x			0.7306	0.7518
x				0.6759	0.7336
		x		0.7081	0.7642

Table 3. Four-way classification accuracy of the voting classifier (includes Random Forest, XGBoost, and LightGBM) and ANN models with different counter combinations. Counter1:dTLB-store-misses, Counter2:dTLB-loads, Counter3:dtlb_load_misses.walk_completed, Counter4:itlb_misses.stlb_hit.

Counter1	Counter2	Counter3	Counter4	ANN	Voting Classifier
x	x	x	x	0.6324	0.7236
x	x	x		0.6084	0.7138
x	x		x	0.6628	0.7061
x		x	x	0.6148	0.7003
	x	x	x	0.6509	0.721
x	x			0.5974	0.6912
x		x		0.5784	0.6723
x			x	0.6288	0.6724
	x	x		0.645	0.7106
	x		x	0.6359	0.7063
		x	x	0.6112	0.6948
			x	0.5827	0.6493
	x			0.6154	0.665
x				0.5154	0.6163
		x		0.5700	0.6559

Table 4. Ten-way classification accuracy of the voting classifier (includes Random Forest, XGBoost, and LightGBM) and ANN models with different counter combinations. Counter1:dTLB-store-misses, Counter2:dTLB-loads, Counter3:dtlb_load_misses.walk_completed, Counter4:itlb_misses.stlb_hit.

Counter1	Counter2	Counter3	Counter4	ANN	Voting Classifier
x	x	x	x	0.5079	0.6103
x	x	x		0.4934	0.59222
x	x		x	0.4714	0.5917
x		x	x	0.4693	0.5785
	x	x	x	0.5051	0.6048
x	x			0.4594	0.5666
x		x		0.3897	0.5169
x			x	0.4458	0.5429
	x	x		0.4899	0.5844
	x		x	0.4861	0.5846
		x	x	0.4467	0.5594
			x	0.4059	0.5166
	x			0.4531	0.5224
x				0.3573	0.4428
		x		0.3799	0.4836

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Agredo, C.; Koranek, D.F.; Kabban, C.M.S.; Arroyo, J.A.G.d.; Graham, S.R. Microarchitectural Malware Detection via Translation Lookaside Buffer (TLB) Events. J. Cybersecur. Priv. 2025, 5, 75. https://doi.org/10.3390/jcp5030075

AMA Style

Agredo C, Koranek DF, Kabban CMS, Arroyo JAGd, Graham SR. Microarchitectural Malware Detection via Translation Lookaside Buffer (TLB) Events. Journal of Cybersecurity and Privacy. 2025; 5(3):75. https://doi.org/10.3390/jcp5030075

Chicago/Turabian Style

Agredo, Cristian, Daniel F. Koranek, Christine M. Schubert Kabban, Jose A. Gutierrez del Arroyo, and Scott R. Graham. 2025. "Microarchitectural Malware Detection via Translation Lookaside Buffer (TLB) Events" Journal of Cybersecurity and Privacy 5, no. 3: 75. https://doi.org/10.3390/jcp5030075

APA Style

Agredo, C., Koranek, D. F., Kabban, C. M. S., Arroyo, J. A. G. d., & Graham, S. R. (2025). Microarchitectural Malware Detection via Translation Lookaside Buffer (TLB) Events. Journal of Cybersecurity and Privacy, 5(3), 75. https://doi.org/10.3390/jcp5030075

Article Menu

Microarchitectural Malware Detection via Translation Lookaside Buffer (TLB) Events

Abstract

1. Introduction

2. Background

2.1. TLB Operation Overview

2.2. Traditional Time-Based TLB Techniques

2.3. Models Approach

2.3.1. Statistical Learning Models

2.3.2. Neural Networks

2.3.3. eXtreme Gradient-Boosting (XGBoost)

2.3.4. Light Gradient-Boosting Machine (LightGBM)

2.3.5. Voting Classifier

2.4. Related Work

2.4.1. Hyper-Threading, HPC Based Malware Detection, and Co-Residency Classification

2.4.2. Prior Work on TLB Attacks

2.4.3. Prior Work on TLB Defense

2.5. Additional Related Work

2.5.1. TLB Coalescing with a Range-Compressed Page Table for Embedded I/O Devices

2.5.2. HyPer-Early Detection of a Ransomware Attack Using Hardware Performance Counters

2.5.3. RanStop: A Hardware-Assisted Runtime Crypto-Ransomware Detection Technique

2.5.4. Intelligent Malware Detection Based on Hardware Performance Counters: A Comprehensive Survey

2.5.5. Redefining Trust: Assessing the Reliability of Machine Learning Algorithms in Intrusion Detection Systems

2.5.6. Cyber-Immunity at the Core: Securing Biomedical Devices Through Hardware-Level Machine Learning Defense

2.5.7. Stochastic-HMD: Adversarial-Resilient Hardware Malware Detector via Undervolting

2.5.8. Obfuscation-Resistant Hardware Malware Detection: A Stacked Denoising Autoencoder Approach

3. Methodology

3.1. Materials and Instrumentation

3.1.1. Experimental Design

3.1.2. Data Description

3.1.3. Workflow

3.2. Implementation Details: Python and C Programs, and Shell Scripts

3.3. Augmented Framework

3.3.1. data_tlb.py

3.3.2. cryptominer.py

3.3.3. infector.py

3.3.4. network_scanner.py

3.3.5. ransomware.py

3.3.6. rootkit.py

3.4. Preprocessing and Analysis

3.5. Augmented Prior Learning Models

3.6. Limitations

4. Results

4.1. Statistical Learning Model Performance

4.1.1. Binary Classifier

4.1.2. Multi-Task Four-Way Classifier

4.1.3. Multi-Task 10-Way Classifier

4.2. Comparing Classifiers

4.3. Comparison with Related Work

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Listings

Appendix A.1. experiment.py Program

Appendix A.2. Data TLB Linear Mapping C Program

Appendix A.3. Simulated cryptominer.py Program

Appendix A.4. Simulated infector.py Program

Appendix A.5. Simulated network_scanner.py Program

Appendix A.6. Simulated ransomware.py Program

Appendix A.7. Simulated rootkit.py Program

Appendix A.8. XGBoost Program

Appendix A.9. LightGBM Program

Appendix A.10. Voting Classifier Program

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.3.1. `data_tlb.py`

3.3.2. `cryptominer.py`

3.3.3. `infector.py`

3.3.4. `network_scanner.py`

3.3.5. `ransomware.py`

3.3.6. `rootkit.py`