Advanced Side-Channel Profiling Attacks with Deep Neural Networks: A Hill Climbing Approach

Hameed, Faisal; Alkhzaimi, Hoda

doi:10.3390/electronics13173530

Open AccessArticle

Advanced Side-Channel Profiling Attacks with Deep Neural Networks: A Hill Climbing Approach

by

Faisal Hameed

^1,2

and

Hoda Alkhzaimi

^1,2,*

¹

Department of Electrical and Computer Engineering, New York University, 5 MetroTech Center, Brooklyn, NY 11201, USA

²

EMARATSEC Center, New York University Abu Dhabi, Abu Dhabi P.O. Box 129188, United Arab Emirates

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(17), 3530; https://doi.org/10.3390/electronics13173530

Submission received: 25 July 2024 / Revised: 1 September 2024 / Accepted: 2 September 2024 / Published: 5 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

Deep learning methods have significantly advanced profiling side-channel attacks. Finding the optimal set of hyperparameters for these models remains challenging. Effective hyperparameter optimization is crucial for training accurate neural networks. In this work, we introduce a novel hill climbing optimization algorithm that is specifically designed for deep learning in profiled side-channel analysis. This algorithm iteratively explores hyperparameter space using gradient-based techniques to make precise, localized adjustments. By incorporating performance feedback at each iteration, our approach efficiently converges on optimal hyperparameters, surpassing traditional Random Search methods. Extensive experiments—covering protected implementations, leakage models, and various neural network architectures—demonstrate that our hill climbing method consistently achieves superior performance in over 80% of test cases, predicting the secret key with fewer attack traces and outperforming both Random Search and state-of-the-art techniques.

Keywords:

side-channel analysis; deep learning; neural network; hyperparameter tuning

1. Introduction

The rise in the number of embedded electronic devices has propelled the prevalence of side-channel attacks. These attacks capitalize on the vulnerabilities inherent in the physical implementations of algorithms, thus posing a significant security threat. Capitalizing on the low-cost feasibility of acquiring precise measurements of physical leakages, encompassing power consumption, time, and electromagnetic emanations, further accentuates this vulnerability. This enhances the notion that both software and hardware implementations of algorithms can inadvertently divulge information [1]. In response, security experts have actively explored various side-channel analysis attacks and devised countermeasures to bolster the security of embedded systems. Established techniques like differential power analysis (DPA) and correlation power analysis (CPA) attacks are grounded in cryptographic algorithm theory and statistical analysis. However, executing these attacks demands a substantial background understanding as they do not rely on hyperparameters in the process [2].

In contemporary times, profiling attacks have emerged as a dominant variant of side-channel analysis attacks, with a notable emphasis on those that are powered by machine learning. Noteworthy reports emphasize that profiling attacks exhibit superior efficiency and accuracy in deciphering secret keys compared to conventional techniques like DPA or CPA. Importantly, profiling attacks possess the distinct advantage of effectively detecting concealed information, even when countermeasures such as masking and random delay are employed [3,4,5]. As a result, substantial research and focus have converged on refining and advancing this method. In a profiling attack scenario, the adversary leverages a profiling device to meticulously study the target device prior to mounting an attack, often involving careful hyperparameter tuning to optimize the attack’s effectiveness.

The inherent similarity between profiling and supervised machine learning renders the framing of profiling side-channel analysis (SCA) attacks as machine-learning classification problems feasible. The profiling data gathered by an adversary from studying a cloned device conveniently serves as training data for this purpose. Conversely, the data derived from the target device under attack is akin to the test data in the machine learning paradigm. Numerous studies have showcased the effectiveness of machine learning techniques in the domain of side-channel attacks [4,6,7,8]. A range of machine learning algorithms, such as support vector machines (SVM), random forest (RF), and deep learning, have been harnessed to demonstrate their prowess in these contexts, with careful hyperparameter tuning playing a crucial role in maximizing their performance [9,10].

Moreover, the adoption of deep learning techniques has garnered heightened attention due to their ability to produce successful attacks without requiring intricate feature engineering. Notably, techniques like the multi-layer perceptron (MLP) and convolutional neural network (CNN) have emerged as prominent choices. In the profiling phase of these methods, deep learning models are trained using diverse side-channel information sources such as power consumption, electromagnetic emanations, and simulation time. However, the distinctive aspect of deep learning in the context of side-channel attacks is the application of specific side-channel attack-oriented metrics [11,12]. While conventional machine learning primarily relies on accuracy as a performance metric, deep learning-based SCA often employ success rate and guessing entropy as the key evaluation criteria. Although these SCA metrics have offered better attack performance, the complexity of the hyperparameter search space limits the performance of deep learning-based side-channel attack methods.

Hyperparameter tuning is of paramount importance in enhancing the efficacy of deep learning-based side-channel attacks. Various strategies have been employed to optimize hyperparameters during the model training process. For instance, Perin et al. employed grid search for hyperparameter tuning in their study [13], while Wu et al. combined Bayesian optimization [14]. Random search has equally established its foothold as a reliable tuning technique because it is more efficient than the grid search in large search space and it has the advantage of being easily parallelized [14]. Furthermore, regularization techniques have been seamlessly integrated into numerous deep learning methods, aiming to bolster the generalization capacity of trained models and, consequently, enhance their overall performance. Gupta et al. proposed a hyperparameter tuning method based on fully black-box neural architecture search (NAS) [15]. They applied Random Search to optimize 1-D CNNs, demonstrating an effective approach to model optimization. Robissout et al., on the other hand, focused on optimizing the profile of attack traces rather than hyperparameter tuning [16,17]. They introduced a novel metric and scoring loss, which allows deep learning models to assign scores to attack predictions, thereby facilitating the effective ordering of attack traces. Li et al. presented a deep learning-based side-channel attack model tailored for different block ciphers [18]. Their approach, formulated as a regression model, specifically addresses AES and PRESENT ciphers, employing grid search combined with expert knowledge for optimization. Ni et al. investigated a side-channel attack scheme based on CNN model fusion [19]. In this approach, they randomly selected N sets of hyperparameters from the search space and trained CNN models on a side-channel dataset to create N base models. By merging high-dimensional features from the middle layers of these base models, they enhanced the model’s generalization ability. Their results demonstrated that the fusion model improved attack capability while simplifying hyperparameter tuning across various public datasets. Krček et al. explored the challenges of hyperparameter tuning in deep learning-based profiling side-channel analysis, particularly across different side-channel datasets [20]. They proposed using autoencoders for dimensionality reduction to test whether encoded datasets could enable the portability of profiling models and reduce tuning efforts. Their findings revealed that, while autoencoders did not significantly reduce tuning efforts for the original datasets, they did improve model portability and simplify hyperparameter tuning in transfer learning scenarios. Despite the advancements reported across the literature, these approaches lack the exploration of local search performance within hyperparameter tuning mechanisms. Consequently, in this research, we investigated hill climbing optimization search as a novel avenue for hyperparameter tuning, thus addressing this existing gap.

Our main contributions are as follows:

To the best of our knowledge, we are the first to propose a hill climbing algorithm for hyperparameter tuning in deep learning-based SCA. For a comprehensive analysis, we compared our proposed algorithm with a baseline method, Random Search (RS), which is widely used in the SCA community. Our proposed method in all test cases demonstrated success in predicting the secret key in contrast to the inconsistent performance exhibited by Random Search.
We conducted extensive experiments on benchmark datasets in SCA research and subsequently demonstrated our proposed method’s competitive performance over the baseline methods.
We illustrate the strength and performance of the hill climbing algorithm compared with the Random Search’s performance. With this, in all test cases using both simple and complex datasets, the hill climbing algorithm showed a more promising performance compared to Random Search.

The rest of this paper is organized as follows: Section 2 describes the background. Section 3 discusses the methodology, SCA metric, and hill climbing optimization. Section 4 details the hill climbing optimization framework. Section 5 discusses the experimental setup, while Section 6 analyzes the test cases and results. Section 7 is the discussion session and Section 8 wraps up with future considerations and a conclusion of our findings.

Related Works

Numerous studies in the literature have extensively investigated the application of deep learning techniques in machine learning-based SCA, consistently yielding superior results. For instance, Maghrebi et al. employed deep learning to construct a highly accurate model for profiling side-channel leakage, resulting in successful key recovery in unprotected side-channel implementations [3]. Their findings indicate that the proposed deep learning-based SCA techniques outperform traditional machine learning methods.

Further supporting the suitability of deep learning in SCA attacks, Masure et al. provided theoretical and experimental evidence, demonstrating that minimizing the negative log likelihood (NLL) loss function during deep learning training yields robust results in side-channel attack scenarios [21]. The authors, however, did not factor hyperparameters in their formulation, showing their reliance on expert domain knowledge. Additionally, Leo showcased the potency of convolutional neural networks in the SCA of AES, but their methodology did not account for hyperparameter tuning [22].

Ensemble techniques have also been explored in deep learning-based SCA. Perin et al. leveraged hyperparameter tuning in training an ensemble of deep learning models for prediction, resulting in improved generalization capabilities [13]. The drawback with the proposed ensemble tuning method is that it can be a lengthy process, requiring first training several models in an ensemble using a grid search. Similarly, Wang et al. demonstrated the efficacy of ensembles in deep learning side-channel attacks, exploring different attack points and showcasing superior attack performance with their tandem approach [23]. The downside with the tandem approach is that it under-utilizes the combinations in hyperparameter search space.

Additionally, other studies have applied deep learning techniques to tackle the challenge of imbalanced data. Picek et al. identified the imbalanced nature of the data in ML-based SCA arising from Hamming Weight leakage models. To address this issue, they employed the synthetic minority oversampling (SMOTE) balancing technique, which yielded encouraging performance in SCA attacks [24]. With no hyperparameter tuning, the researchers compared different sampling techniques, including random undersampling and random oversampling, and they found that SMOTE effectively mitigated the imbalanced data problem. The researchers utilized their expert domain knowledge in the hyperparameter tuning process, which will be difficult to replicate when faced with an unfamiliar dataset.

Ito et al. utilized another approach in handling data imbalance (without considering hyperparameter tuning), used the Kullback–Leibler (KL) divergence as an evaluation metric, and adopted a key value-based likelihood function instead of Hamming Weight (HW) or Hamming Distance (HD) models. Thus, they relied on their experiential knowledge for hyperparameter tuning. Their method proved effective in addressing the data imbalance in SCA datasets [25]. Furthermore, Paguada et al. introduced an optimized form of the early stopping strategy for deep learning models, introducing newly defined metrics of patience and persistence, as well as employing grid search, in hyperparameter tuning. This innovative approach not only improved the overall performance of deep learning models, but also reduced computation overhead [26]. On the other hand, the use of a grid search hyperparameter tuning technique hugely contributes to slowing down the training process.

Zhang et al. proposed the cross entropy ratio (CER) metric to bridge the gap between deep learning metrics and side-channel metrics, exhibiting significant performance with imbalanced data [27]. The added advantage of the CER metric lies in its low computational complexity, making it an efficient choice that does not require mounting an attack like the guessing entropy (GE) or success rate (SR) metrics. However, the authors downplayed the significance of hyperparameter tuning, instead relying on their domain knowledge. Furthermore, Kubota et al. investigated deep learning-based SCA against hardware implementations of AES and introduced a mixed-model dataset without incorporating a hyperparameter tuning technique. Their findings, although encouraging, demonstrated that deep learning without hyperparameter tuning can yield a limited attack against protected AES implementations in ASIC [28]. These approaches collectively showcase the effectiveness of hyperparameter tuning deep learning techniques in addressing data imbalance challenges in SCA, further enhancing the reliability and robustness of SCA attack strategies.

In order to examine the impact of each hyperparameter of a CNN on the training process, Zaid et al. conducted a comprehensive study on the intricacies of deep neural networks [29]. However, this approach necessitates prior knowledge of the datasets being targeted by adversaries. In contrast, Wu et al. proposed an automated approach for hyperparameter tuning using Bayesian optimization, which demonstrated superior performance compared to reinforcement learning [14]. However, their proposed Bayesian approach was marginally better than the Random Search and, considering that Random Search is a faster technique, the Random Search seemed more highly favored overall.

Additionally, Kim et al. demonstrated the effectiveness of adding artificial noise to a CNN, resulting in more efficient breaking of protected implementations [30]. For anyone without prior knowledge of the dataset, it would be time consuming to replicate their models’ performance. Overall, there remains ample room for further investigation into the role of hyperparameter tuning in SCA. Currently, methods like simulated annealing, iterated local search, Random Search, and grid search are well suited for problems with smaller hyperparameter spaces. However, for more complex problems with larger search spaces, Bayesian optimization, genetic algorithms, and augmented hill climbing have been successfully employed in machine learning but not SCA attacks [31]. These diverse optimization techniques offer valuable tools to tackle a wide range of problem types and complexities.

Finding the optimal hyperparameters for deep learning models can be highly computationally expensive in terms of time and resources. The limitations of available computational resources make it impossible to explore all possible architectures of the neural network model. Therefore, the multitude of feasible configurations often requires specialized knowledge to narrow down the search region and achieve a desirable architecture. As a solution to this challenge, various neural architecture search (NAS) algorithms currently do not completely offer the combined advantages of speed, exploitation, and exploration as we have examined in the survey of current hyperparameter tuning techniques. As a result, we are proposing the hill climbing algorithm (HCA) optimization technique as an efficient and robust hyperparameter tuning method to tackle the complex hyperparameter search space associated with deep learning-based SCA [32].

In this project, we will formulate and explore the hill climbing algorithm for its robustness, flexibility, and scalability [33]. This algorithm possesses a unique ability to navigate search spaces, making it robust in handling local optima scenarios. It is worthwhile to mention that we are the first to propose the hill climbing algorithm for hyperparameter tuning in SCA attacks. As a result, the improvement in the algorithm has the capability for either facilitating local exploitation or global exploration. The convergence of the hill climbing algorithm is a result of striking a balance between exploration and exploitation [34]. This well-balanced approach ensures that the algorithm effectively explores the search space while efficiently exploiting promising solutions, leading to the discovery of optimal or near-optimal solutions.

2. Background

In this section, we will examine the optimization techniques currently employed in the SCA hyperparameter tuning problem. We will follow this up with preliminaries on profiling side-channel attacks, supervised machine learning, and experimental frameworks.

2.1. Current Hyperparameter Tuning Techniques

Grid search is a widely used hyperparameter tuning technique for conducting side-channel analysis attacks. However, it is computationally expensive, especially when applied to deep learning architectures with numerous tunable hyperparameters. This challenge becomes even more pronounced when working with high-dimensional datasets, making grid search impractical for most side-channel analysis scenarios. Consequently, Random Search is more commonly employed in this domain due to its efficiency and simplicity. Given these factors, Random Search has been chosen as the baseline technique for comparison in this research.

Ref. [1] highlighted that hyperparameter tuning is a critical factor in training an effective profiling model for side-channel attacks. However, the authors did not propose a specific methodology for hyperparameter tuning. Instead, they offer recommended ranges of hyperparameters to explore during model training, leaving the selection process largely dependent on expert guidance. As a result, this approach may be challenging for new researchers in the field of deep learning-based SCA to adopt effectively as it assumes a level of expertise that might not be immediately accessible to beginners.

Ref. [35] proposed a methodology for optimizing hyperparameters to build efficient CNN architectures that balance attack efficiency and network complexity. Their approach emphasizes a deep understanding of the model’s inner workings, ensuring that the role of each hyperparameter is clearly understood in terms of explainability and interpretability. This methodology offers valuable insights for tailoring models to specific attack scenarios, contributing to more precise and targeted optimizations. However, the approach relies heavily on expert knowledge of the specific dataset being attacked, which limits its generalizability to other datasets. Additionally, it requires significant time and effort to fine tune hyperparameters, making it less practical in scenarios with limited resources.

Ref. [36] applied reinforcement learning to tune CNN hyperparameters. The authors achieved this by exploring the Q-learning algorithm and subsequently developing two reward functions based on side-channel metrics. The strength of this approach lies in its ability to automate the hyperparameter tuning process and effectively generalize across different datasets. As a result, the methodology identifies several optimal hyperparameters that lead to successful attacks. However, the search process is extremely time consuming, which can be a significant drawback in practical applications where computational efficiency is critical.

Several studies have demonstrated that CNNs are highly effective at extracting relevant features from noisy side-channel data, which significantly enhances the success rate of attacks on AES implementations [35,37]. In the context of other cryptographic implementations, such as ASCON, researchers have utilized ensemble learning to address the challenges of optimizing CNN architecture and hyperparameter tuning [38]. This shows that hyperparameter tuning in deep learning for side-channel analysis is a widespread issue across various cryptographic algorithms. Additionally, Ref. [39] explored the use of a sampling algorithm combined with an early-stopping mechanism for hyperparameter optimization.

Ref. [14] proposed a hyperparameter search technique based on Bayesian optimization. Their experimental analysis demonstrates that the framework performs robustly across different datasets and leakage models, highlighting its versatility in various side-channel attack scenarios. The key advantage of this approach lies in its ability to effectively balance exploration and exploitation during the search process, leading to more precise hyperparameter configurations. However, the significant drawback of this method is its high computational complexity, which can be resource intensive and time consuming. This complexity may limit the method’s scalability, particularly in scenarios where rapid model deployment is essential.

HCA addresses these shortcomings by offering several key improvements: iterative hyperparameter refinement with reduced computational cost, adaptability across varying datasets, and effectiveness in overcoming local optima through adaptive search steps. Positioning HCA within the broader landscape of existing techniques highlights both the novelty of this approach and its practical advantages in SCA applications. These features make HCA a valuable alternative for scenarios where traditional methods struggle with efficiency and flexibility.

2.2. Notation

In this project,

X

is used to denote sets. In situations where the calligraphic letter is finite,

| X |

will represent the cardinality of

X

. X will denote random variables and random vectors X over

X

. Consequently, the corresponding letters x and x represent the realizations of X and X, respectively. Thus, the i-th entry of the vector x is

x [i]

. For each entry,

x [i]

, there is an associated key candidate and plaintext bytes. Thus, let k be a key candidate, with possible values in key space

K

. From this key space, let

k^{*}

be the correct key. Therefore, the key guess associated with

x [i]

is

k_{i}

and, likewise, the associated plaintext is

P T_{i}

.

2.3. Profiled Side-Channel Attack

In this work, the profiling side-channel attack will be considered. It is the worst-case security analysis since it assumes a powerful attacker has access to a clone device that possesses knowledge about the secret key. Thus, the profiling attack consists of the following two phases.

2.3.1. Profiling Phase

In this step, using the clone device, the attacker obtains a dataset of

N_{p}

profiling traces. To obtain these traces, the adversary uses the known input of plaintexts and secret keys to achieve this. It is worthy to mention that the profiling traces are independent and identically distributed (i.i.d.). The completion of the data input and capture results to a profiling dataset, which is used to characterize the leakage X. In order to characterize the leakage, a model is built to map the secret key-dependent intermediate variable to the physical leakages contained in the measured traces.

2.3.2. Attack Phase

This stage begins with the attacker obtaining a set of

N_{q}

traces from the target device under attack, which are independent from the profiling traces. To do this, the adversary using the target device encrypts

N_{q}

plaintexts and, consequently, records the resulting traces. Using the model trained in the profiling phase, the attacker computes a prediction vector of the probabilities for each attack trace. Based on this vector, the attacker chooses the best key candidate.

2.4. Supervised Machine Learning

The profiled attacks we are investigating is a classification problem as it involves evaluating a set of possible key values and selecting one of the options, hence necessitating the training of a classifier [40,41]. Therefore, in this work, we examined deep learning-based SCAs against AES. In these attacks, the deep learning is trained by extracting the output of the S-box in the first round of the AES. With the profiling data obtained using a clone device, training is carried out. The goal of the training or profiling process is to learn how the device leaks information. While training, the adversary encrypts several plaintexts with known secret keys and, consequently, collects the resulting traces. In order to ensure a sufficiently generalized training model, the side-channel traces generated should result from random inputs such as random plaintext,

p t

, and random key, k. This cycle is repeated several times to obtain

N_{p}

profiling traces. Thus, we can define the training data

S_{p}

to comprise the

N_{p}

traces as follows:

S_{p} = {(k_{i}, P T_{i}, x_{i}) | 1 \leq i \leq N_{p}},

(1)

where

k_{i}

denotes the partial key of the full key, one-byte length;

P T_{i}

is the plaintext corresponding to the partial key;

k_{i}

,

x_{i}

is the profiling trace collected from the clone or profiling device; and

N_{p}

is the total number of traces in the profiling stage. By having the training data in place, the deep learning model is subsequently trained. The leakage being characterized is at the output of the S-box in the first round; therefore, the label is set to

l_{i} = H W (S b o x (k_{i} \oplus P T_{i}))

. As a result, the output of the deep learning model is a probability ranging from zero to eight for nine classes. These classes are the HW values and, in order to obtain the class probabilities, the final layer of the deep learning model is a softmax activation layer. Thus, the trained model is based on side-channel traces as the feature data and the HW of the intermediate values as the labels; thus, accurately achieving a generalized DL model is one step from predicting the secret key. Upon training, a set V of validation traces was used to validate the performance of the trained model to ensure that the model was generalized.

The completion of the profiling or training phase was an indication that the characterization of the clone device was completed and the attack phase can be initiated. In this phase, the adversary collects the attack traces and S-box output. Thus, we can define the attack data

S_{q}

to comprise

N_{q}

traces as follows:

S_{q} = {(P T_{i}, x_{i}) | 1 \leq i \leq N_{q}},

(2)

where

P T_{i}

is the plaintext corresponding to the unknown key;

k^{*}

,

x_{i}

is the attack trace collected from the device under attack; and

N_{q}

is the total number of traces in the attack stage. After this, using the trained model from the profiling phase, a prediction vector

P_{r}

was computed for each attack trace. That is, a score was assigned to each class, and the highest score shows which class from the nine the attack trace belongs. Thus, for any given trace, the class with the highest probability shows the trace leaks that intermediate value the most when compared to all other intermediate values.

Furthermore, upon obtaining the prediction vector for all the attack traces, the secret key prediction follows. It is worthwhile to state that, for the key prediction task in this work, unlike typical machine learning problems, all the attack traces together contributed in the key prediction task. That is, a key ranking method was employed in this process. Therefore, if the trained model has a high generalization and if the number of test traces is sufficiently large, the likelihood of the secret key is higher than the likelihood from the incorrect key candidates.

Consequently, key ranking rather than accuracy has proven to be the best and one of the most used metric in SCA attacks [30]. Hence, the output class probabilities of each trace were considered in the key ranking calculation as they possess crucial information. Furthermore, the classes represent intermediate values while the secret key is our objective, and the key ranking addresses this. Hence, for every trace i in Q, given our trained model with j possible classes and k possible key guesses, we evaluated for every k. Given that each trace i has an accompanying plaintext provided in its dataset, with this, we can find the S-Box output and apply the Hamming weight, and this value is j. Based on this trace i and the estimate j,

P_{i, j}

is obtained, and the log of the result is estimated and accumulated for all Q traces of the attack set. All the other k values are also evaluated. Thus, the trained classifier now predicts the secret key k for the attack set Q by selecting the key

k^{Q}

with the maximum log-likelihood, as given in Equation (4). If the key

k^{*}

is the key with the maximum log-likelihood, then the model has predicted the secret key correctly.

S (k) = log P_{k}^{Q} = \sum_{i = 1}^{Q} log P_{i, j},

(3)

k^{Q} = \underset{k}{arg} max log P_{k}^{Q} .

(4)

3. Dataset Frameworks and Structures

In this work, we will investigate the ASCAD fixed and variable key datasets, which are based on the AES-128 block cipher. These two datasets are publicly available and they have been studied in previous studies [13,21].

The ASCAD dataset was introduced by Prouff et al. [42]. It targets the electromagnetic emanations (EM) resulting from the software protected implementation of AES-128 on ATMEGA8515, which is an 8-bit AVR architecture. The boolean masking countermeasure was adopted. In our experiments, for each dataset, we utilized 50,000 profiling traces and 10,000 attack traces, which, in total, was 60,000 traces. There were 700 time samples for each trace, and the time samples corresponded to the operations of the third masked S-box in the first round.

4. Methodology

4.1. Leakage Model

The data leakage model describes the relationship between the number of bits switching from one state to the other and the side-channels [43]. Therefore, in this project, we used the following two leakage models.

4.1.1. Identity Model

In the identity model, every possible value of the eight-bit output of the S-box represents its own class, which results in 256 possible classes of the output of the ML task.

4.1.2. Hamming Weight (HW) Model

In this leakage model, in the given eight-bit number, the number of ones in it represents the output of the class; therefore, we obtained nine possible classes. Thus, this more or less transformed the given eight-bit number to a representation using four bits alone, as can be witnessed in the AES cipher before and after the S-box.

4.2. SCA Metric

Two SCA metrics utilized in this work were as follows:

4.2.1. Guessing Entropy (GE)

The GE provides the average rank of the correct key

k^{*}

in a key guessing vector G from processing Q attack traces. From these Q traces, a number of traces less than Q is randomly selected and this will constitute the experimental set that would be used in estimating the GE. From carrying out this experiment a number of times, usually 100, taking the average key rank of the correct key k provides a more accurate estimate of the GE. The goal of any SCA method is to obtain the correct key from a set of all possible keys within a minimal number of traces. Thus, the GE performance metric was adopted in this project. Furthermore, given Q traces in the attack phase, an attack outputs a key guessing vector in decreasing order of probability

g = [g_{1}, g_{2}, . . ., g_{κ}]

, such that

| κ |

is the size of the key space. Based on this, the most likely key candidate is

g_{1}

and the least likely is

g_{κ}

.

4.2.2. Leakage Distribution Difference (LDD) and Correlation with Key Guessing Vector

In LDD, the relationship between the correct key and the guessed keys contributes to estimating the SCA metric [44], that is, for every key candidate k, a leakage distribution is estimated using all the plaintexts. Thus, using this distribution, the difference between the leakage distribution and the different key candidates give rise to to the LDD. Therefore, the LDD of a correct key candidate is typically zero, whereas, for incorrect key candidates, the LDD is higher. Thus, the Euclidean distance is used in estimating the LDD, as shown below [14]:

L D D (k^{*}, k) = \sum_{i = 0}^{Q} ∥ f (d_{i}, k^{*}) - f (d_{i}, k) ∥, k \in K,

(5)

such that

f (d_{i}, k^{*})

and

f (d_{i}, k)

are the leakage functions that provide the leakage value for the data value

d_{i}

with the correct key

k^{*}

and with the candidate key k, respectively. Therefore, the identity or the HW leakage model can be used in the LDD calculation.

In the next step, using the LDD and the key guessing vector, a correlation was estimated between the two. Thus, if the profiling model was accurate for the device under attack, then the correlation between these two variables will be strong; otherwise, it will be weak. This variation in the correlation serves as the SCA metric and it can be calculated as follows [14]:

L_{m} (L D D, g) = corr (argsort (L D D), g) .

(6)

4.3. Hill Climbing Optimization

HCA optimizes a given objective function

f (x)

, where x is either a vector of discrete or continuous values. Given an objective function

f (x)

, the optimal hyperparameters

x^{*}

, which maximize it, can be evaluated as follows:

x^{(t + 1)} = x^{(t)} + Δ x,

(7)

where

Δ x > 0

and

f {(x)}^{(t + 1)} \geq f {(x)}^{t}

. Thus, for a current solution

x^{(t)}

at iteration t,

f {(x)}^{(t)}

is the objective function value;

x^{(t + 1)}

is the solution at the next iteration,

t + 1

; and its objective function value is

f {(x)}^{(t + 1)}

. The algorithm works iteratively considering neighboring solutions and then moving toward the direction of improvement until it reaches a local minimum. Thus, under the constraint of a number of iterations, the optimal,

x^{*}

, is determined.

5. Hill Climbing Search Framework

In this section, we propose the HCA to address the hyperparameter search problem. The algorithm is illustrated in Figure 1, with the pseudocode detailed in Algorithm 1. Our framework consists of two primary phases: the profiling phase and the attack phase. The hill climbing optimization occurs during the profiling phase. We focused on two neural networks—MLP and CNN—due to their prominence in the deep learning SCA community [1]. Thus, the hyperparameters we reference pertain specifically to MLP and CNN models. The efficacy of our proposed method will be demonstrated through comparisons with the widely used Random Search (RS) technique. RS is commonly employed in hyperparameter tuning for deep learning models in SCA due to its simplicity and its broad exploration of the search space.

As depicted in Figure 1, the algorithm begins by initializing the iteration count a to 0. An arbitrary set S of hyperparameter combinations was selected to serve as the architecture for a deep learning model. The model was then trained over ten epochs, resulting in a profiling model. The HCA optimization determines the next set of hyperparameters to explore in subsequent iterations based on the performance of the current model, provided

a < 200

. Drawing from the literature and an extensive calibration phase, and considering the limited training resources and time constraints in profiling attacks, we capped the number of iterations at 200. This optimization search was completed in five hours using a CPU, a Tesla V100 GPU with 16 GB memory, and 5120 GPU cores.

After completing 200 iterations, the best model was selected from the set T of trained models based on the GE SCA metric. The GE was estimated by performing an attack on the traces from the validation set V. The GE serves as the objective function estimate, O, for the set of hyperparameters. Specifically, 5000 traces from the attack dataset were used as the validation data to estimate the GE. The best model was then identified, trained, and subsequently utilized again to attack the training dataset, as depicted in the flowchart in Figure 1. Similarly, the trained model was employed to attack the 5000 traces from the attack dataset Q to estimate the GE, which reflects the attack performance of the trained model.

For this work, the hyperparameters of the MLP considered included the dense or fully connected layers, the number of neurons in those layers, the learning rate and the activation function across all layers. It is summarized in Table 1. The hyperparameter search space for the CNN is also shown in Table 2.

Algorithm 1 Hill climbing algorithm pseudocode

Input: Search Space

H

consisting of MLP or CNN hyperparameters, objective function l, and neighborhood function

N

Pick a hyperparameter sample $h_{1} \in H$ at random
Evaluate $l (h_{1})$ ; initialize a dummy variable $l (h_{0}) = \infty$ to start the algorithm set $i = 1$
While $l (h_{i}) < l (h_{i - 1})$
- Evaluate $h (u)$ for all $u \in N (h_{i})$
- Set $l (h_{i + 1}) = \underset{u \in N (h_{i})}{argmax} l (u)$ ; Set $i = i + 1$

Output: Hyperparameters

h_{i}

6. Understanding the Test Cases and Results

We detailed the hyperparameter values considered in training the ML models in Table 1 and Table 2. Table 1 lists the hyperparameters for MLP, while Table 2 presents those for CNN. The search space for MLP consisted of 23,040 combinations, whereas the CNN search space was significantly larger, with 637,009,920 combinations. Consequently, the CNN search space was vast compared to that of MLP. Ideally, identifying the single best model would involve exhaustively exploring all possible combinations within the search space during hyperparameter tuning. However, due to constraints of time and memory, we limited our exploration to 200 combinations from the search space [14]. Therefore, for both the Random Search (RS) and the proposed Hill Climbing Algorithm (HCA), we fixed the number of iterations at 200, which was deemed sufficient for this study. The general architecture of our algorithm was divided into two phases. As illustrated in the flowchart of Figure 1, the first phase involved randomly selecting a hyperparameter combination from either the MLP or CNN set. This selection was followed by ML training to produce a profiling model. Based on the performance of the initial profiling model, either RS or HCA determines the next hyperparameter combination. This process was repeated 200 times, as specified by the number of iterations. From the 200 profiling models generated, the best model was selected, retrained, and then used to perform the attack on the attack dataset. For this study, we utilized the ASCAD and CHES CTF datasets, both the fixed and random versions. The results were consistent across both datasets; however, due to page limitations, we presented the results for ASCAD only. Table 3 includes other important design parameters employed in our experiment.

We first analyzed the results from the ASCAD fixed dataset where the secret key was fixed and then the ASCAD variable dataset where the secret key was random.

6.1. ASCAD Fixed Key

For this experiment and all others, we considered three objective functions, such as key rank,

L_{m}

, and validation accuracy, training them in 10 and 50 epochs, thereby resulting to 6 plots. We started by examining the performance of the MLP under the ID leakage model, as shown in Figure 2a,b. In the HCA, all the methods achieved successful secret key prediction. Particularly, the key rank with fifty epochs was the most efficient, requiring 90 attack traces, demonstrating the efficacy of the HCA search optimization method. Likewise, in RS, as shown in Figure 2b,

L_{m}

achieved the best performance, and it also required about 90 traces. In terms of performance, the HCA and the RS achieved similar performances. In general, for both search methods, all of the objective functions led to correctly predicting the secret key showing us that the ASCAD fixed key was easy to break.

In the next phase, we studied the MLP performance under the HW leakage model, as presented in Figure 3a,b. Here, the HCA achieved a clearly much better performance in all three objectives. For the HCA, the

L_{m}

and the key rank objectives yielded the best attack performance of about 600 traces while the validation accuracy required about 1000 traces, as shown in Figure 3a. For the RS, the key rank and the

L_{m}

also resulted in the best performance, requiring about 1000 traces while the accuracy objective function needed about 1500 traces, as can be seen in Figure 3b. In all test cases, from both HCA and RS, all the profiling models successfully predicted the secret key, equally showing how easy it is to break the ASCAD fixed key.

Following that was the examination of the CNN performance. Similar to the previous stage, we first looked into the ID leakage model. In the HCA of Figure 4a, the key rank objective resulted in the best attack performance of about 300 attack traces. The

L_{m}

and validation accuracy objectives followed with about 800 traces. In general, all objective functions resulted a successful attack in the HCA. On the other hand, as show in Figure 4b, for the RS, the performance was clearly incomparable with the HCA as its best performance was the profiling model that resulted from both the key rank and accuracy objectives, which required about 2500 traces. The

L_{m}

objective did not yield any successes. This indicates the disadvantage of the RS as randomly selecting parameter combinations does not guarantee a successful attack. The CNN had a much wider search space, and the HCA, being a method that thrives by finding peaks, was more efficient than RS as it converged to a solution.

Next was the CNN performance using the HW leakage model. In Figure 5a, the parameter combination that resulted from the HCA hyperparameter tuning can be seen. The

L_{m}

objective function yielded the profiling model with the best performance at about 700 traces, followed by the key rank and accuracy objective functions, with about 800 and 1000 traces, respectively. For the RS, as shown in Figure 5b, the key rank achieved the best attack performance of about 1200 traces. Following in performance was both the

L_{m}

and the accuracy objective, with about 1500 traces. To summarize, all objective functions succeeded in predicting the secret key for both the HCA and RS methods, but the HCA performed better than the RS. Furthermore, the key rank objective specifically yielded the best performance in both methods. When comparing the general performance of the CNN under an ID leakage model, it can be seen that a CNN trained under the HW leakage model resulted in a better performance, and the reason for this is that fewer models were trained when using HW leakage models, i.e., 9 as opposed to 256, given the same amount of training data. Therefore, the HW leakage models were more generalized after training. In addition, the MLP models’ attack performances were better, in general, compared to the CNN models as the MLP has a smaller hyperparameter search space of about 23,000 compared to the CNN with over 637 million, where the number of traces in our training set still remained 50,000 across all optimization search methods, which is why the MLP converged much faster.

We summarize the results obtained from this experiment in Table 4. The table shows the method that achieved the best performance in each category listed as a row across all objective functions. In three out of the four cases, HCA demonstrated superior performance. In the remaining test case, both methods achieved comparable results. Overall, these findings highlight the effectiveness of HCA in improving performance over RS in most scenarios.

6.2. ASCAD Variable Key

First, we analyzed the GE performance of the MLP on the ASCAD variable key data based on the ID leakage model. In the HCA of Figure 6a, the performances of all three models yielded successful attacks. They all required around 3000 traces. For the RS, on the other hand, only the key rank objective was successful in predicting the secret key, as can be observed in Figure 6b, requiring about 3500 traces. We can observe that the models not breaking the secret key, as a result of the dataset being variable, were more difficult to train compared with the fixed dataset. Therefore, HCA has more capability to deal with hyperparameter search under this scenario, further reinforcing how robust the algorithm is.

Furthermore, we studied the GE performance of the MLP on the ASCAD variable key data based on the HW leakage model. In the HCA from Figure 7a, the profiling model from the accuracy objective predicted the secret key in about 600 traces, making it the best among the three objectives. Following it was the key rank and the

L_{m}

with 1000 and 2000 traces, respectively. For the RS, as shown in Figure 7b, the key rank achieved the best performance, requiring about 700 traces. The accuracy and the

L_{m}

followed next with 800 and 1200 traces, respectively. Therefore, for the MLP training based on the HW leakage model, using all objectives in both search methods resulted in profiling models that correctly predicted the secret key, and this was not the case with the ID leakage model. The reason for the HW better performance was due to the HW leakage having a lesser number of classes, i.e., 9, compared to the ID’s 256, making training a generalized model faster. It is worthwhile to mention that HCA’s superior performance for both ID and HW leakage models was due to its flexibility and robustness in dealing with various scenarios in hyperparameter search optimization.

Next was the analysis of the performance of the CNN. For the ID leakage model, all of the objective functions did not predict the secret key in 5000 traces in the HCA, as summarized in Table 5 and seen in Figure 8a. However, we observed that the accuracy and the

L_{m}

objectives were within one rank of cracking the secret key. Thus, we estimated that, in another 20 traces, these two objectives would predict the secret key. For RS, on the other hand, as shown in Table 5, and Figure 8b, only

L_{m}

predicted the secret key successfully, only needing 5000 traces. This observation corroborates the claim that CNN is harder to train because of its much larger search space, making it difficult for the hyperparameter tuning algorithm to find a trainable hyperparameter combination. In addition, from our analysis, the HCA yielded better training outcomes in this department as the two objectives gave rise to accurate profiling models versus the one in RS.

The last of the test cases that we looked at was the GE performance of a CNN under the HW leakage model. From Table 5, and Figure 9a,b, we can see the performance plots for both the HCA and RS. In HCA from Figure 9a, For the

L_{m}

and key rank objectives, HCA successfully predicted s the secret keys in 1400 and 4100 traces, respectively. The RS on the other hand, in Figure 9b had all the objective functions producing training models of GE that were greater than 50. We concluded that, based on this, when training a CNN utilizing the HW leakage model, the HCA profiles ML models better when compared to the RS. Also, in this test case and all others, the hyperparameter tuning using both HCA and RS always performed better than when using the HW leakage model. Lastly, our proposed HCA was very competitive, as represented in Table 6, compared against [14,36] in complexity, that is, the number of trainable parameters and time to reach GE of 1, which resulted in besting the existing methods in two of four categories. We calculated the complexity of our models using the following formula in (8) and (9):

{Complexity}_{MLP} = \sum_{i = 1}^{L - 1} ({Neurons}_{i} \times {Neurons}_{i + 1}) + \sum_{i = 1}^{L} {Neurons}_{i},

(8)

where L is the number of fully connected layers, and

{Neurons}_{i}

is the number of neurons in layer i.

\begin{matrix} {Complexity}_{CNN} = \sum_{j = 1}^{C} ({Filters}_{j} \times (Kernel {Size}_{j} \times Input {Channels}_{j} + 1)) + \\ \sum_{k = 1}^{F - 1} ({Neurons}_{k} \times {Neurons}_{k + 1}) + \sum_{k = 1}^{F} {Neurons}_{k}, \end{matrix}

(9)

where C is the number of convolutional layers,

{Filters}_{j}

is the number of filters in the

j^{t h}

convolutional layer,

Kernel {Size}_{j}

is the size of the convolution kernel,

Input {Channels}_{j}

is the number of input channels for the

j^{t h}

convolutional layer, F is the number of fully connected layers, and

{Neurons}_{k}

is the number of neurons in the

k^{t h}

fully connected layer.

We summarized the results obtained from this experiment in Table 5. The table shows that, in three out of the four cases, HCA demonstrated superior performance, while RS achieved the best performance in only one test case. These results reinforce the robustness and effectiveness of HCA in comparison to RS, further highlighting its advantages in optimizing hyperparameters for deep learning models in side-channel analysis.

6.3. Statistical Significance Test of the Results

In this section, we calculate the p-values to assess the statistical significance of our experimental results. To ensure a rigorous analysis, we adopted a systematic approach that involves hypothesis testing, a computation of test statistics, a calculation of p-values, and an interpretation of the results. The process is detailed in Algorithm 2.

Algorithm 2 The pseudocode for the statistical significance testing

Pseudocode for Statistical Significance Testing

Input: Data from experiments: HCA results and RS results.

Output: p-value to determine if HCA outperforms RS.

Define Hypotheses:
–
Null Hypothesis ( $H_{0}$ ): No significant difference between the performances of HCA and RS.
–
Alternative Hypothesis ( $H_{1}$ ): Significant difference between the performances of HCA and RS.
Select a Statistical Test: Use an independent t-test to compare the means of the two groups (HCA vs. RS).
Calculate Test Statistic: Time Complexity: $O (n)$

$t = \frac{{\bar{X}}_{1} - {\bar{X}}_{2}}{\sqrt{\frac{s_{1}^{2}}{n_{1}} + \frac{s_{2}^{2}}{n_{2}}}},$

where:
–
${\bar{X}}_{1}$ and ${\bar{X}}_{2}$ are the sample means for HCA and RS, respectively.
–
$s_{1}^{2}$ and $s_{2}^{2}$ are the sample variances for HCA and RS, respectively.
–
$n_{1}$ and $n_{2}$ are the sample sizes for HCA and RS, respectively.
Determine p-value: Time Complexity: $O (1)$ Calculate the p-value using the t-distribution:

$p = 2 \cdot P (T \geq | t |),$

where $P (T \geq | t |)$ is the probability that the t-distribution with $d f = n_{1} + n_{2} - 2$ degrees of freedom is greater than or equal to the absolute value of t.
Interpret the p-value: Time Complexity: $O (1)$
–
If $p < α$ (typically $α = 0.05$ ), reject the null hypothesis $H_{0}$ .
–
Conclude that the difference in performance between HCA and RS is statistically significant.
Analyze and Report Results:
–
Use the number of traces required to achieve a guessing entropy (GE) of 1 as the primary metric.
–
Aggregate the results across test cases and compute the sample size (e.g., 100 samples).
–
If p is approximately 0, then conclude that HCA significantly outperforms RS in hyperparameter tuning.

Therefore, in aiming to provide robust statistical evidence that our HCA search outperforms Random Search, we calculated the p-values for these tests. We used the number of traces required to reach a guessing entropy (GE) of 1 as our sampled data. The results were aggregated across all test cases, and multiple experiments were conducted to ensure a total sample size of 100. For both experiments, the calculated p-values were approximately 0, indicating that the observed differences were statistically significant and unlikely to be due to chance. Based on the results, our proposed hill climbing search results demonstrated superior performance compared to the baseline Random Search.

We then examined the limitations of proposed HCA. Hill climbing is a local search technique that iteratively adjusts hyperparameters to enhance performance. However, its primary drawback is its susceptibility to becoming trapped in local optima. Unlike more sophisticated methods such as Bayesian optimization or genetic algorithms, which explore the search space globally, hill climbing (HC) often becomes stuck in sub-optimal regions, especially in high-dimensional or complex search spaces. This limitation is particularly evident in performance landscapes with numerous local peaks, where HC may converge prematurely without reaching the global optimum. Moreover, HC does not incorporate uncertainty estimates—a crucial feature of Bayesian optimization that allows for the effective balancing of exploration and exploitation during the search process. To overcome these challenges, several researchers have proposed adaptive hill climbing algorithms that dynamically adjust the search strategy to better navigate such complex landscapes [33,45]. Investigating these adaptive variants within the context of deep learning-based side-channel analysis for hyperparameter optimization could offer promising avenues for future research.

7. Discussion

Our experiments validated that the HCA is a viable method for hyperparameter tuning in deep learning-based SCA. Across multiple test cases, HCA consistently demonstrated superior performance compared to Random Search, particularly in terms of achieving lower guessing entropy. This performance advantage was further confirmed through statistical significance tests. Notably, HCA outperformed Random Search across all test cases, regardless of dataset complexity, proving its effectiveness in hyperparameter optimization.

In our approach, hill climbing efficiently navigated the hyperparameter search space by making iterative, small-scale adjustments that improved performance. Unlike grid or Random Search methods—which can be computationally expensive or directionless—HCA offers a systematic, performance-driven exploration. This strategy effectively avoids local optima by allowing adaptive changes based on feedback, minimizing the risk of suboptimal configurations. Furthermore, the simplicity and lower computational overhead of HCA make it particularly well suited for complex models and larger datasets, where traditional methods may struggle with overfitting. In addition, the data disproportionation problem, where certain sentiment classes are underrepresented, can hinder the performance of deep neural networks (DNNs) [46]. The HCA mitigates this by optimizing hyperparameters like learning rates, batch sizes, and class weights, which better accommodate imbalanced datasets. HCA’s tailored search reduces the risk of overfitting to the majority class, enhancing overall model performance in scenarios with significant class imbalance.

In terms of computational complexity, our HCA is generally more resource efficient than Random Search. HCA’s targeted approach converges more quickly to optimal or near-optimal solutions, reducing computational costs. In contrast, Random Search randomly samples the hyperparameter space, often requiring more iterations to achieve similar performance, resulting in higher resource demands. While Random Search can occasionally yield faster results due to its randomness, HCA consistently provides better resource efficiency in both average and worst-case scenarios.

Furthermore, HCA, while simpler, demonstrates lower computational complexity due to its iterative local adjustments, allowing faster convergence to optimal or near-optimal solutions. In contrast, techniques like Bayesian optimization, genetic algorithms, and simulated annealing incorporate global search mechanisms that can better escape local optima, though often at the cost of increased computational demands and complexity [14]. Bayesian optimization, for instance, excels at balancing exploration and exploitation by leveraging probabilistic models, but it can be resource intensive. Genetic algorithms offer strong global search capabilities but may require extensive population management and iterations [47]. Simulated annealing, with its probabilistic acceptance of worse solutions, helps avoid premature convergence but introduces additional computational overhead. On the other hand, the multi-dimensional fusion convolutional residual dendrite (MD_CResDD) network excels in profiling speed and feature extraction for side-channel analysis through multi-scale feature fusion. In contrast, our HCA systematically optimizes hyperparameters, potentially achieving superior accuracy across diverse datasets when tested under identical conditions [48].

Our hill climbing side-channel analysis attack is a powerful technique used to exploit secret information from cryptographic devices. These techniques have improved attack performances over the years, thus providing valuable insights in industry, as well as academia. This application, being in the field of security positions, raises important ethical concerns.

The primary ethical concern with SCA attacks lies in their potential misuse for malicious purposes. If the proposed HCA for SCA falls into the wrong hands, there is a significant risk of it being exploited to compromise the security of systems by extracting confidential data. The accessibility and proliferation of various SCA methods, coupled with the growing availability of cloud computing resources, make such attacks increasingly feasible, even for adversaries with limited resources. This risk is further amplified by the low-cost nature of SCA attacks, allowing attackers to breach secure systems without substantial investment. Therefore, it is crucial to address these ethical implications and emphasize responsible usage, focusing on defensive applications and contributing to the development of stronger countermeasures.

On the other hand, advancements in side-channel techniques have been instrumental in fostering the development of more secure hardware designs. A successful side-channel attack reveals specific vulnerabilities in the design that adversaries can exploit. Identifying these flaws allows security experts to develop and implement more robust countermeasures, thereby fortifying the defense mechanisms of the hardware. The continuous refinement of side-channel techniques ultimately drives improvements in hardware security, ensuring that future devices are better protected against such attacks. Consequently, many hardware manufacturing companies now incorporate side-channel analysis testing as a standard stage in their design process to proactively address potential weaknesses.

8. Conclusions

This paper introduces a novel hill climbing algorithm (HCA) for hyperparameter tuning in deep learning-based side-channel attacks (SCA). By formulating SCA metrics as the objective function, our approach optimizes hyperparameters more effectively than traditional methods. Comprehensive experiments and evaluations demonstrate that our HCA consistently outperforms Random Search and other established techniques across all test cases. The HCA method not only yields optimal profiling models, but also proves exceptionally effective and reliable, particularly when dealing with large datasets and stringent memory and computing constraints. Thus, the HCA emerges as the optimal choice for enhancing the performance of deep learning-based side-channel attacks.

Author Contributions

H.A. conceived the research idea, designed the experiments, and provided guidance throughout the research process. F.H. conceived the idea, designed the experiments, conducted the experiments, provided data, analyzed the data, created figures and tables and wrote manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is available from the corresponding authors upon request.

Acknowledgments

This work was supported by the NYUAD Global Ph.D Fellowship and EMARATSEC Lab.

Conflicts of Interest

The authors declare no competing interests.

References

Picek, S.; Heuser, A.; Perin, G.; Guilley, S. Profiling Side-Channel Analysis in the Efficient Attacker Framework. In Proceedings of the Smart Card Research and Advanced Applications, Lübeck, Germany, 11–12 November 2021; Grosso, V., Pöppelmann, T., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 44–63. [Google Scholar]
Lerman, L.; Bontempi, G.; Markowitch, O. Power analysis attack: An approach based on machine learning. Int. J. Appl. Cryptogr. 2014, 3, 97–115. [Google Scholar] [CrossRef]
Maghrebi, H.; Portigliatti, T.; Prouff, E. Breaking cryptographic implementations using deep learning techniques. In Proceedings of the International Conference on Security, Privacy, and Applied Cryptography Engineering, Hyderabad, India, 14–18 December 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 3–26. [Google Scholar]
Lerman, L.; Bontempi, G.; Markowitch, O. A machine learning approach against a masked AES. J. Cryptogr. Eng. 2015, 5, 123–139. [Google Scholar] [CrossRef]
Zeng, Z.; Gu, D.; Liu, J.; Guo, Z. An improved side-channel attack based on support vector machine. In Proceedings of the 2014 Tenth International Conference on Computational Intelligence and Security, Kunming, China, 15–16 November 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 676–680. [Google Scholar]
Jin, S.; Kim, S.; Kim, H.; Hong, S. Recent advances in deep learning-based side-channel analysis. ETRI J. 2020, 42, 292–304. [Google Scholar] [CrossRef]
Chang, L.; Wei, Y.; He, S.; Pan, X. Research on side-channel analysis based on deep learning with different sample data. Appl. Sci. 2022, 12, 8246. [Google Scholar] [CrossRef]
Rivest, R.L. Cryptography and machine learning. In Proceedings of the International Conference on the Theory and Application of Cryptology, Brighton, UK, 8–11 April 1991; Springer: Berlin/Heidelberg, Germany, 1991; pp. 427–439. [Google Scholar]
Picek, S.; Heuser, A.; Guilley, S. Template attack versus Bayes classifier. J. Cryptogr. Eng. 2017, 7, 343–351. [Google Scholar] [CrossRef]
Ou, Y.; Li, L. Side-channel analysis attacks based on deep learning network. Front. Comput. Sci. 2022, 16, 162303. [Google Scholar] [CrossRef]
Perin, G.; Wu, L.; Picek, S. The need for speed: A fast guessing entropy calculation for deep learning-based SCA. Algorithms 2023, 16, 127. [Google Scholar] [CrossRef]
Zhang, Z.; Ding, A.A.; Fei, Y. A guessing entropy-based framework for deep learning-assisted side-channel analysis. IEEE Trans. Inf. Forensics Secur. 2023, 18, 3018–3030. [Google Scholar] [CrossRef]
Perin, G.; Chmielewski, Ł.; Picek, S. Strength in numbers: Improving generalization with ensembles in machine learning-based profiled side-channel analysis. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020, 2020, 337–364. [Google Scholar] [CrossRef]
Wu, L.; Perin, G.; Picek, S. I choose you: Automated hyperparameter tuning for deep learning-based side-channel analysis. IEEE Trans. Emerg. Top. Comput. 2022, 12, 546–557. [Google Scholar] [CrossRef]
Gupta, P.; Drees, J.P.; Hüllermeier, E. Automated side-channel attacks using black-box neural architecture search. In Proceedings of the 18th International Conference on Availability, Reliability and Security, Benevento, Italy, 29 August–1 September 2023; pp. 1–11. [Google Scholar]
Robissout, D.; Bossuet, L.; Habrard, A. Scoring the predictions: A way to improve profiling side-channel attacks. J. Cryptogr. Eng. 2024, 1–23. [Google Scholar] [CrossRef]
AlSobeh, A. OSM: Leveraging Model Checking for Observing Dynamic Behaviors in Aspect-Oriented Applications. arXiv 2024, arXiv:2403.01349. [Google Scholar] [CrossRef]
Li, L.; Ou, Y. A deep learning-based side-channel attack model for different block ciphers. J. Comput. Sci. 2023, 72, 102078. [Google Scholar] [CrossRef]
Ni, L.; Wang, P.; Zhang, Y.; Zhang, H.; Li, X.; Ni, L.; Lv, J.; Zheng, W. Profiling side-channel attacks based on CNN model fusion. Microelectron. J. 2023, 139, 105901. [Google Scholar] [CrossRef]
Krček, M.; Perin, G. Autoencoder-enabled model portability for reducing hyperparameter tuning efforts in side-channel analysis. J. Cryptogr. Eng. 2023, 1–23. [Google Scholar] [CrossRef]
Masure, L.; Strullu, R. Side Channel Analysis against the Anssi’s Protected AES Implementation on ARM. Cryptology ePrint Archive, Paper 2021/592. 2021. Available online: https://eprint.iacr.org/2021/592 (accessed on 5 May 2023).
Weissbart, L.; Chmielewski, Ł.; Picek, S.; Batina, L. Systematic side-channel analysis of curve25519 with machine learning. J. Hardw. Syst. Secur. 2020, 4, 314–328. [Google Scholar] [CrossRef]
Wang, H.; Dubrova, E. Tandem deep learning side-channel attack on FPGA implementation of AES. SN Comput. Sci. 2021, 2, 373. [Google Scholar] [CrossRef]
Picek, S.; Heuser, A.; Jovic, A.; Bhasin, S.; Regazzoni, F. The curse of class imbalance and conflicting metrics with machine learning for side-channel evaluations. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2019, 2019, 209–237. [Google Scholar] [CrossRef]
Ito, A.; Saito, K.; Ueno, R.; Homma, N. Imbalanced data problems in deep learning-based side-channel attacks: Analysis and solution. IEEE Trans. Inf. Forensics Secur. 2021, 16, 3790–3802. [Google Scholar] [CrossRef]
Paguada, S.; Batina, L.; Buhan, I.; Armendariz, I. Being Patient and Persistent: Optimizing an Early Stopping Strategy for Deep Learning in Profiled Attacks. IEEE Trans. Comput. 2023, 1–12. [Google Scholar] [CrossRef]
Zhang, J.; Zheng, M.; Nan, J.; Hu, H.; Yu, N. A novel evaluation metric for deep learning-based side channel analysis and its extended application to imbalanced data. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020, 2020, 73–96. [Google Scholar] [CrossRef]
Kubota, T.; Yoshida, K.; Shiozaki, M.; Fujino, T. Deep learning side-channel attack against hardware implementations of AES. Microprocess. Microsyst. 2021, 87, 103383. [Google Scholar] [CrossRef]
Zaid, G.; Bossuet, L.; Dassance, F.; Habrard, A.; Venelli, A. Ranking loss: Maximizing the success rate in deep learning side-channel analysis. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2021, 2021, 25–55. [Google Scholar] [CrossRef]
Kim, J.; Picek, S.; Heuser, A.; Bhasin, S.; Hanjalic, A. Make Some Noise. Unleashing the Power of Convolutional Neural Networks for Profiled Side-channel Analysis. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2019, 2019, 148–179. [Google Scholar] [CrossRef]
Pradhan, A.; Mishra, D.; Das, K.; Obaidat, M.S.; Kumar, M. A COVID-19 X-ray image classification model based on an enhanced convolutional neural network and hill climbing algorithms. Multimed. Tools Appl. 2023, 82, 14219–14237. [Google Scholar] [CrossRef]
Alweshah, M.; Al-Daradkeh, A.; Al-Betar, M.A.; Almomani, A.; Oqeili, S. β-Hill climbing algorithm with probabilistic neural network for classification problems. J. Ambient Intell. Humaniz. Comput. 2020, 11, 3405–3416. [Google Scholar] [CrossRef]
Al-Betar, M.A.; Aljarah, I.; Awadallah, M.A.; Faris, H.; Mirjalili, S. Adaptive β-hill climbing for optimization. Soft Comput. 2019, 23, 13489–13512. [Google Scholar] [CrossRef]
Al-Betar, M.A. β-Hill climbing: An exploratory local search. Neural Comput. Appl. 2017, 28, 153–168. [Google Scholar] [CrossRef]
Zaid, G.; Bossuet, L.; Habrard, A.; Venelli, A. Methodology for efficient CNN architectures in profiling attacks. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020, 2020, 1–36. [Google Scholar] [CrossRef]
Rijsdijk, J.; Wu, L.; Perin, G.; Picek, S. Reinforcement learning for hyperparameter tuning in deep learning-based side-channel analysis. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2021, 2021, 677–707. [Google Scholar] [CrossRef]
Cagli, E.; Dumas, C.; Prouff, E. Convolutional Neural Networks with Data Augmentation Against Jitter-Based Countermeasures. In Proceedings of the Cryptographic Hardware and Embedded Systems—CHES 2017, Taipei, Taiwan, 25–28 September 2017; Fischer, W., Homma, N., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 45–68. [Google Scholar]
Rezaeezade, A.; Basurto-Becerra, A.; Weissbart, L.; Perin, G. One for All, All for Ascon: Ensemble-Based Deep Learning Side-Channel Analysis. In Proceedings of the International Conference on Applied Cryptography and Network Security, Abu Dhabi, United Arab Emirates, 5–8 March 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 139–157. [Google Scholar]
Serafini, G.; Weissbart, L.; Batina, L. Everything All at Once: Deep Learning Side-Channel Analysis Optimization Framework. In Proceedings of the International Conference on Applied Cryptography and Network Security, Abu Dhabi, United Arab Emirates, 5–8 March 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 195–212. [Google Scholar]
Kotsiantis, S.B.; Zaharakis, I.; Pintelas, P. Supervised machine learning: A review of classification techniques. Emerg. Artif. Intell. Appl. Comput. Eng. 2007, 160, 3–24. [Google Scholar]
Tubbing, R. An Analysis of Deep Learning Based Profiled Side-Channel Attacks: Custom Deep Learning Layer, CNN Hyperparameters for Countermeasures, and Portability Settings. Master’s Thesis, Delft University of Technology (TU Delft), Delft, The Netherlands, 2019. [Google Scholar]
Benadjila, R.; Prouff, E.; Strullu, R.; Cagli, E.; Dumas, C. Deep learning for side-channel analysis and introduction to ASCAD database. J. Cryptogr. Eng. 2020, 10, 163–188. [Google Scholar] [CrossRef]
Brier, E.; Clavier, C.; Olivier, F. Correlation Power Analysis with a Leakage Model. In Proceedings of the Cryptographic Hardware and Embedded Systems—CHES 2004, Boston/Cambridge, MA, USA, 11–13 August 2004; Joye, M., Quisquater, J.J., Eds.; Springer: Berlin/Heidelberg, Germany, 2004; pp. 16–29. [Google Scholar]
Wu, L.; Weissbart, L.; Krček, M.; Li, H.; Perin, G.; Batina, L.; Picek, S. On the Attack Evaluation and the Generalization Ability in Profiling Side-Channel Analysis. Cryptology ePrint Archive, Paper 2020/899. 2020. Available online: https://eprint.iacr.org/2020/899 (accessed on 5 May 2023).
Sun, K.; Jia, H.; Li, Y.; Jiang, Z. Hybrid improved slime mould algorithm with adaptive β hill climbing for numerical optimization. J. Intell. Fuzzy Syst. 2021, 40, 1667–1679. [Google Scholar] [CrossRef]
Alfreihat, M.; Almousa, O.; Tashtoush, Y.; AlSobeh, A.; Mansour, K.; Migdady, H. Emo-SL Framework: Emoji Sentiment Lexicon Using Text-Based Features and Machine Learning for Sentiment Analysis. IEEE Access 2024, 12, 81793–81812. [Google Scholar] [CrossRef]
Ali, Y.A.; Awwad, E.M.; Al-Razgan, M.; Maarouf, A. Hyperparameter search for machine learning algorithms for optimizing the computational complexity. Processes 2023, 11, 349. [Google Scholar] [CrossRef]
Deng, T.; Wang, H.; He, D.; Xiong, N.; Liang, W.; Wang, J. Multi-Dimensional Fusion Deep Learning for Side Channel Analysis. Electronics 2023, 12, 4728. [Google Scholar] [CrossRef]

Figure 1. HCA flowchart.

Figure 2. The GE performance of the two methods utilizing MLP on an ASCAD fixed key based on an ID leakage model.

Figure 3. The GE performance of the two methods utilizing MLP on an ASCAD fixed key based on the HW leakage model.

Figure 4. The GE performance of the two methods utilizing CNN on an ASCAD fixed key based on the ID leakage model.

Figure 5. The GE performance of the two methods utilizing CNN on an ASCAD fixed key based on the HW leakage model.

Figure 6. The GE performance of the two methods utilizing MLP on an ASCAD variable key based on the ID leakage model.

Figure 7. The GE performance of the two methods utilizing MLP on an ASCAD variable key based on the HW leakage model.

Figure 8. GE performance of the two methods utilizing CNN on ASCAD variable key based on ID leakage model.

Figure 9. GE performance of the two methods utilizing CNN on ASCAD variable key based on HW leakage model.

Table 1. MLP hyperparameters.

Hyperparameter	min	max	Step
Fully connected layers	2	10	1
Neurons per fully connected layer	8	1024	8
	Values
Learning rate	$1 \times 10^{- 3}$ , $5 \times 10^{- 4}$ , $1 \times 10^{- 4}$ , $5 \times 10^{- 5}$ , $1 \times 10^{- 5}$
Activation function per layer	ReLU, Tanh, ELU, SELU

Table 2. CNN hyperparameters.

Hyperparameter	min	max	Step
Convolution layers	1	4	1
Convolution filters	8	256	8
Convolution kernel size	2	10	1
Pooling size	2	5	1
Pooling stride	2	10	1
Fully connected layers	1	3	1
Neurons per fully connected layer	8	1024	8
	Values
Pooling Type	Max pooling, avg pooling
Learning Rate	$1 \times 10^{- 3}$ , $5 \times 10^{- 4}$ , $1 \times 10^{- 4}$ , $5 \times 10^{- 5}$ , $1 \times 10^{- 5}$
Activation function per layer	ReLU, Tanh, ELU, SELU

Table 3. Experimental parameters.

Variable	Value
Epochs	10 and 50
Training Dataset	50,000
Validation Dataset	5000
Attack Dataset	5000
Performance Metrics	Accuracy, Lm, Key rank

Table 4. Comparison of the performance of HCA and RS on an ASCAD fixed key across leakage models.

	Better Method	HCA Traces			RS Traces
	Better Method	Key Rank	$L_{m}$	Accuracy	Key Rank	$L_{m}$	Accuracy
$M L P_{H W}$	HCA	600	600	1000	1000	1000	1500
$M L P_{I D}$	-	90	100	300	90	90	100
$C N N_{H W}$	HCA	800	700	1000	1200	1500	1500
$C N N_{I D}$	HCA	300	800	800	2500	-	-

Table 5. Comparison of the performances of HCA and RS on an ASCAD variable key across leakage models.

		HCA Traces			RS Traces
	Better Method	Key Rank	$L_{m}$	Accuracy	Key Rank	$L_{m}$	Accuracy
$M L P_{H W}$	HCA	1000	2000	600	700	1200	800
$M L P_{I D}$	HCA	3000	3000	3000	3500	-	-
$C N N_{H W}$	HCA	4100	1400	-	-	-	-
$C N N_{I D}$	RS	-	-	-	-	5000	-

Table 6. Proposed HCA performance versus the state-of-the-art methods. HCA provided the best attack performance in two of the four comparisons.

	[36]	[14]	HCA MLP	HCA CNN
ASCAD Fixed + ID Leakage
Complexity	79,439	1,544,776	1,360,567	2,154,420
Time to reach GE of 1	202	120	90	300
ASCAD Fixed + HW Leakage
Complexity	5566	1,388,457	1,155,816	3,240,134
Time to reach GE of 1	906	447	600	700
ASCAD Variable + ID Leakage
Complexity	70,492	1,539,320	1,002,342	1,578,995
Time to reach GE of 1	490	2945	300	500
ASCAD Variable + HW Leakage
Complexity	15,241	4,128,753	1,672,980	3,450,872
Time to reach GE of 1	911	496	600	1400

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hameed, F.; Alkhzaimi, H. Advanced Side-Channel Profiling Attacks with Deep Neural Networks: A Hill Climbing Approach. Electronics 2024, 13, 3530. https://doi.org/10.3390/electronics13173530

AMA Style

Hameed F, Alkhzaimi H. Advanced Side-Channel Profiling Attacks with Deep Neural Networks: A Hill Climbing Approach. Electronics. 2024; 13(17):3530. https://doi.org/10.3390/electronics13173530

Chicago/Turabian Style

Hameed, Faisal, and Hoda Alkhzaimi. 2024. "Advanced Side-Channel Profiling Attacks with Deep Neural Networks: A Hill Climbing Approach" Electronics 13, no. 17: 3530. https://doi.org/10.3390/electronics13173530

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advanced Side-Channel Profiling Attacks with Deep Neural Networks: A Hill Climbing Approach

Abstract

1. Introduction

Related Works

2. Background

2.1. Current Hyperparameter Tuning Techniques

2.2. Notation

2.3. Profiled Side-Channel Attack

2.3.1. Profiling Phase

2.3.2. Attack Phase

2.4. Supervised Machine Learning

3. Dataset Frameworks and Structures

4. Methodology

4.1. Leakage Model

4.1.1. Identity Model

4.1.2. Hamming Weight (HW) Model

4.2. SCA Metric

4.2.1. Guessing Entropy (GE)

4.2.2. Leakage Distribution Difference (LDD) and Correlation with Key Guessing Vector

4.3. Hill Climbing Optimization

5. Hill Climbing Search Framework

6. Understanding the Test Cases and Results

6.1. ASCAD Fixed Key

6.2. ASCAD Variable Key

6.3. Statistical Significance Test of the Results

7. Discussion

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI