1. Introduction
The rise in the number of embedded electronic devices has propelled the prevalence of side-channel attacks. These attacks capitalize on the vulnerabilities inherent in the physical implementations of algorithms, thus posing a significant security threat. Capitalizing on the low-cost feasibility of acquiring precise measurements of physical leakages, encompassing power consumption, time, and electromagnetic emanations, further accentuates this vulnerability. This enhances the notion that both software and hardware implementations of algorithms can inadvertently divulge information [
1]. In response, security experts have actively explored various side-channel analysis attacks and devised countermeasures to bolster the security of embedded systems. Established techniques like differential power analysis (DPA) and correlation power analysis (CPA) attacks are grounded in cryptographic algorithm theory and statistical analysis. However, executing these attacks demands a substantial background understanding as they do not rely on hyperparameters in the process [
2].
In contemporary times, profiling attacks have emerged as a dominant variant of side-channel analysis attacks, with a notable emphasis on those that are powered by machine learning. Noteworthy reports emphasize that profiling attacks exhibit superior efficiency and accuracy in deciphering secret keys compared to conventional techniques like DPA or CPA. Importantly, profiling attacks possess the distinct advantage of effectively detecting concealed information, even when countermeasures such as masking and random delay are employed [
3,
4,
5]. As a result, substantial research and focus have converged on refining and advancing this method. In a profiling attack scenario, the adversary leverages a profiling device to meticulously study the target device prior to mounting an attack, often involving careful hyperparameter tuning to optimize the attack’s effectiveness.
The inherent similarity between profiling and supervised machine learning renders the framing of profiling side-channel analysis (SCA) attacks as machine-learning classification problems feasible. The profiling data gathered by an adversary from studying a cloned device conveniently serves as training data for this purpose. Conversely, the data derived from the target device under attack is akin to the test data in the machine learning paradigm. Numerous studies have showcased the effectiveness of machine learning techniques in the domain of side-channel attacks [
4,
6,
7,
8]. A range of machine learning algorithms, such as support vector machines (SVM), random forest (RF), and deep learning, have been harnessed to demonstrate their prowess in these contexts, with careful hyperparameter tuning playing a crucial role in maximizing their performance [
9,
10].
Moreover, the adoption of deep learning techniques has garnered heightened attention due to their ability to produce successful attacks without requiring intricate feature engineering. Notably, techniques like the multi-layer perceptron (MLP) and convolutional neural network (CNN) have emerged as prominent choices. In the profiling phase of these methods, deep learning models are trained using diverse side-channel information sources such as power consumption, electromagnetic emanations, and simulation time. However, the distinctive aspect of deep learning in the context of side-channel attacks is the application of specific side-channel attack-oriented metrics [
11,
12]. While conventional machine learning primarily relies on accuracy as a performance metric, deep learning-based SCA often employ success rate and guessing entropy as the key evaluation criteria. Although these SCA metrics have offered better attack performance, the complexity of the hyperparameter search space limits the performance of deep learning-based side-channel attack methods.
Hyperparameter tuning is of paramount importance in enhancing the efficacy of deep learning-based side-channel attacks. Various strategies have been employed to optimize hyperparameters during the model training process. For instance, Perin et al. employed grid search for hyperparameter tuning in their study [
13], while Wu et al. combined Bayesian optimization [
14]. Random search has equally established its foothold as a reliable tuning technique because it is more efficient than the grid search in large search space and it has the advantage of being easily parallelized [
14]. Furthermore, regularization techniques have been seamlessly integrated into numerous deep learning methods, aiming to bolster the generalization capacity of trained models and, consequently, enhance their overall performance. Gupta et al. proposed a hyperparameter tuning method based on fully black-box neural architecture search (NAS) [
15]. They applied Random Search to optimize 1-D CNNs, demonstrating an effective approach to model optimization. Robissout et al., on the other hand, focused on optimizing the profile of attack traces rather than hyperparameter tuning [
16,
17]. They introduced a novel metric and scoring loss, which allows deep learning models to assign scores to attack predictions, thereby facilitating the effective ordering of attack traces. Li et al. presented a deep learning-based side-channel attack model tailored for different block ciphers [
18]. Their approach, formulated as a regression model, specifically addresses AES and PRESENT ciphers, employing grid search combined with expert knowledge for optimization. Ni et al. investigated a side-channel attack scheme based on CNN model fusion [
19]. In this approach, they randomly selected N sets of hyperparameters from the search space and trained CNN models on a side-channel dataset to create N base models. By merging high-dimensional features from the middle layers of these base models, they enhanced the model’s generalization ability. Their results demonstrated that the fusion model improved attack capability while simplifying hyperparameter tuning across various public datasets. Krček et al. explored the challenges of hyperparameter tuning in deep learning-based profiling side-channel analysis, particularly across different side-channel datasets [
20]. They proposed using autoencoders for dimensionality reduction to test whether encoded datasets could enable the portability of profiling models and reduce tuning efforts. Their findings revealed that, while autoencoders did not significantly reduce tuning efforts for the original datasets, they did improve model portability and simplify hyperparameter tuning in transfer learning scenarios. Despite the advancements reported across the literature, these approaches lack the exploration of local search performance within hyperparameter tuning mechanisms. Consequently, in this research, we investigated hill climbing optimization search as a novel avenue for hyperparameter tuning, thus addressing this existing gap.
Our main contributions are as follows:
To the best of our knowledge, we are the first to propose a hill climbing algorithm for hyperparameter tuning in deep learning-based SCA. For a comprehensive analysis, we compared our proposed algorithm with a baseline method, Random Search (RS), which is widely used in the SCA community. Our proposed method in all test cases demonstrated success in predicting the secret key in contrast to the inconsistent performance exhibited by Random Search.
We conducted extensive experiments on benchmark datasets in SCA research and subsequently demonstrated our proposed method’s competitive performance over the baseline methods.
We illustrate the strength and performance of the hill climbing algorithm compared with the Random Search’s performance. With this, in all test cases using both simple and complex datasets, the hill climbing algorithm showed a more promising performance compared to Random Search.
The rest of this paper is organized as follows:
Section 2 describes the background.
Section 3 discusses the methodology, SCA metric, and hill climbing optimization.
Section 4 details the hill climbing optimization framework.
Section 5 discusses the experimental setup, while
Section 6 analyzes the test cases and results.
Section 7 is the discussion session and
Section 8 wraps up with future considerations and a conclusion of our findings.
Related Works
Numerous studies in the literature have extensively investigated the application of deep learning techniques in machine learning-based SCA, consistently yielding superior results. For instance, Maghrebi et al. employed deep learning to construct a highly accurate model for profiling side-channel leakage, resulting in successful key recovery in unprotected side-channel implementations [
3]. Their findings indicate that the proposed deep learning-based SCA techniques outperform traditional machine learning methods.
Further supporting the suitability of deep learning in SCA attacks, Masure et al. provided theoretical and experimental evidence, demonstrating that minimizing the negative log likelihood (NLL) loss function during deep learning training yields robust results in side-channel attack scenarios [
21]. The authors, however, did not factor hyperparameters in their formulation, showing their reliance on expert domain knowledge. Additionally, Leo showcased the potency of convolutional neural networks in the SCA of AES, but their methodology did not account for hyperparameter tuning [
22].
Ensemble techniques have also been explored in deep learning-based SCA. Perin et al. leveraged hyperparameter tuning in training an ensemble of deep learning models for prediction, resulting in improved generalization capabilities [
13]. The drawback with the proposed ensemble tuning method is that it can be a lengthy process, requiring first training several models in an ensemble using a grid search. Similarly, Wang et al. demonstrated the efficacy of ensembles in deep learning side-channel attacks, exploring different attack points and showcasing superior attack performance with their tandem approach [
23]. The downside with the tandem approach is that it under-utilizes the combinations in hyperparameter search space.
Additionally, other studies have applied deep learning techniques to tackle the challenge of imbalanced data. Picek et al. identified the imbalanced nature of the data in ML-based SCA arising from Hamming Weight leakage models. To address this issue, they employed the synthetic minority oversampling (SMOTE) balancing technique, which yielded encouraging performance in SCA attacks [
24]. With no hyperparameter tuning, the researchers compared different sampling techniques, including random undersampling and random oversampling, and they found that SMOTE effectively mitigated the imbalanced data problem. The researchers utilized their expert domain knowledge in the hyperparameter tuning process, which will be difficult to replicate when faced with an unfamiliar dataset.
Ito et al. utilized another approach in handling data imbalance (without considering hyperparameter tuning), used the Kullback–Leibler (KL) divergence as an evaluation metric, and adopted a key value-based likelihood function instead of Hamming Weight (HW) or Hamming Distance (HD) models. Thus, they relied on their experiential knowledge for hyperparameter tuning. Their method proved effective in addressing the data imbalance in SCA datasets [
25]. Furthermore, Paguada et al. introduced an optimized form of the early stopping strategy for deep learning models, introducing newly defined metrics of patience and persistence, as well as employing grid search, in hyperparameter tuning. This innovative approach not only improved the overall performance of deep learning models, but also reduced computation overhead [
26]. On the other hand, the use of a grid search hyperparameter tuning technique hugely contributes to slowing down the training process.
Zhang et al. proposed the cross entropy ratio (CER) metric to bridge the gap between deep learning metrics and side-channel metrics, exhibiting significant performance with imbalanced data [
27]. The added advantage of the CER metric lies in its low computational complexity, making it an efficient choice that does not require mounting an attack like the guessing entropy (GE) or success rate (SR) metrics. However, the authors downplayed the significance of hyperparameter tuning, instead relying on their domain knowledge. Furthermore, Kubota et al. investigated deep learning-based SCA against hardware implementations of AES and introduced a mixed-model dataset without incorporating a hyperparameter tuning technique. Their findings, although encouraging, demonstrated that deep learning without hyperparameter tuning can yield a limited attack against protected AES implementations in ASIC [
28]. These approaches collectively showcase the effectiveness of hyperparameter tuning deep learning techniques in addressing data imbalance challenges in SCA, further enhancing the reliability and robustness of SCA attack strategies.
In order to examine the impact of each hyperparameter of a CNN on the training process, Zaid et al. conducted a comprehensive study on the intricacies of deep neural networks [
29]. However, this approach necessitates prior knowledge of the datasets being targeted by adversaries. In contrast, Wu et al. proposed an automated approach for hyperparameter tuning using Bayesian optimization, which demonstrated superior performance compared to reinforcement learning [
14]. However, their proposed Bayesian approach was marginally better than the Random Search and, considering that Random Search is a faster technique, the Random Search seemed more highly favored overall.
Additionally, Kim et al. demonstrated the effectiveness of adding artificial noise to a CNN, resulting in more efficient breaking of protected implementations [
30]. For anyone without prior knowledge of the dataset, it would be time consuming to replicate their models’ performance. Overall, there remains ample room for further investigation into the role of hyperparameter tuning in SCA. Currently, methods like simulated annealing, iterated local search, Random Search, and grid search are well suited for problems with smaller hyperparameter spaces. However, for more complex problems with larger search spaces, Bayesian optimization, genetic algorithms, and augmented hill climbing have been successfully employed in machine learning but not SCA attacks [
31]. These diverse optimization techniques offer valuable tools to tackle a wide range of problem types and complexities.
Finding the optimal hyperparameters for deep learning models can be highly computationally expensive in terms of time and resources. The limitations of available computational resources make it impossible to explore all possible architectures of the neural network model. Therefore, the multitude of feasible configurations often requires specialized knowledge to narrow down the search region and achieve a desirable architecture. As a solution to this challenge, various neural architecture search (NAS) algorithms currently do not completely offer the combined advantages of speed, exploitation, and exploration as we have examined in the survey of current hyperparameter tuning techniques. As a result, we are proposing the hill climbing algorithm (HCA) optimization technique as an efficient and robust hyperparameter tuning method to tackle the complex hyperparameter search space associated with deep learning-based SCA [
32].
In this project, we will formulate and explore the hill climbing algorithm for its robustness, flexibility, and scalability [
33]. This algorithm possesses a unique ability to navigate search spaces, making it robust in handling local optima scenarios. It is worthwhile to mention that we are the first to propose the hill climbing algorithm for hyperparameter tuning in SCA attacks. As a result, the improvement in the algorithm has the capability for either facilitating local exploitation or global exploration. The convergence of the hill climbing algorithm is a result of striking a balance between exploration and exploitation [
34]. This well-balanced approach ensures that the algorithm effectively explores the search space while efficiently exploiting promising solutions, leading to the discovery of optimal or near-optimal solutions.
2. Background
In this section, we will examine the optimization techniques currently employed in the SCA hyperparameter tuning problem. We will follow this up with preliminaries on profiling side-channel attacks, supervised machine learning, and experimental frameworks.
2.1. Current Hyperparameter Tuning Techniques
Grid search is a widely used hyperparameter tuning technique for conducting side-channel analysis attacks. However, it is computationally expensive, especially when applied to deep learning architectures with numerous tunable hyperparameters. This challenge becomes even more pronounced when working with high-dimensional datasets, making grid search impractical for most side-channel analysis scenarios. Consequently, Random Search is more commonly employed in this domain due to its efficiency and simplicity. Given these factors, Random Search has been chosen as the baseline technique for comparison in this research.
Ref. [
1] highlighted that hyperparameter tuning is a critical factor in training an effective profiling model for side-channel attacks. However, the authors did not propose a specific methodology for hyperparameter tuning. Instead, they offer recommended ranges of hyperparameters to explore during model training, leaving the selection process largely dependent on expert guidance. As a result, this approach may be challenging for new researchers in the field of deep learning-based SCA to adopt effectively as it assumes a level of expertise that might not be immediately accessible to beginners.
Ref. [
35] proposed a methodology for optimizing hyperparameters to build efficient CNN architectures that balance attack efficiency and network complexity. Their approach emphasizes a deep understanding of the model’s inner workings, ensuring that the role of each hyperparameter is clearly understood in terms of explainability and interpretability. This methodology offers valuable insights for tailoring models to specific attack scenarios, contributing to more precise and targeted optimizations. However, the approach relies heavily on expert knowledge of the specific dataset being attacked, which limits its generalizability to other datasets. Additionally, it requires significant time and effort to fine tune hyperparameters, making it less practical in scenarios with limited resources.
Ref. [
36] applied reinforcement learning to tune CNN hyperparameters. The authors achieved this by exploring the Q-learning algorithm and subsequently developing two reward functions based on side-channel metrics. The strength of this approach lies in its ability to automate the hyperparameter tuning process and effectively generalize across different datasets. As a result, the methodology identifies several optimal hyperparameters that lead to successful attacks. However, the search process is extremely time consuming, which can be a significant drawback in practical applications where computational efficiency is critical.
Several studies have demonstrated that CNNs are highly effective at extracting relevant features from noisy side-channel data, which significantly enhances the success rate of attacks on AES implementations [
35,
37]. In the context of other cryptographic implementations, such as ASCON, researchers have utilized ensemble learning to address the challenges of optimizing CNN architecture and hyperparameter tuning [
38]. This shows that hyperparameter tuning in deep learning for side-channel analysis is a widespread issue across various cryptographic algorithms. Additionally, Ref. [
39] explored the use of a sampling algorithm combined with an early-stopping mechanism for hyperparameter optimization.
Ref. [
14] proposed a hyperparameter search technique based on Bayesian optimization. Their experimental analysis demonstrates that the framework performs robustly across different datasets and leakage models, highlighting its versatility in various side-channel attack scenarios. The key advantage of this approach lies in its ability to effectively balance exploration and exploitation during the search process, leading to more precise hyperparameter configurations. However, the significant drawback of this method is its high computational complexity, which can be resource intensive and time consuming. This complexity may limit the method’s scalability, particularly in scenarios where rapid model deployment is essential.
HCA addresses these shortcomings by offering several key improvements: iterative hyperparameter refinement with reduced computational cost, adaptability across varying datasets, and effectiveness in overcoming local optima through adaptive search steps. Positioning HCA within the broader landscape of existing techniques highlights both the novelty of this approach and its practical advantages in SCA applications. These features make HCA a valuable alternative for scenarios where traditional methods struggle with efficiency and flexibility.
2.2. Notation
In this project, is used to denote sets. In situations where the calligraphic letter is finite, will represent the cardinality of . X will denote random variables and random vectors X over . Consequently, the corresponding letters x and x represent the realizations of X and X, respectively. Thus, the i-th entry of the vector x is . For each entry, , there is an associated key candidate and plaintext bytes. Thus, let k be a key candidate, with possible values in key space . From this key space, let be the correct key. Therefore, the key guess associated with is and, likewise, the associated plaintext is .
2.3. Profiled Side-Channel Attack
In this work, the profiling side-channel attack will be considered. It is the worst-case security analysis since it assumes a powerful attacker has access to a clone device that possesses knowledge about the secret key. Thus, the profiling attack consists of the following two phases.
2.3.1. Profiling Phase
In this step, using the clone device, the attacker obtains a dataset of profiling traces. To obtain these traces, the adversary uses the known input of plaintexts and secret keys to achieve this. It is worthy to mention that the profiling traces are independent and identically distributed (i.i.d.). The completion of the data input and capture results to a profiling dataset, which is used to characterize the leakage X. In order to characterize the leakage, a model is built to map the secret key-dependent intermediate variable to the physical leakages contained in the measured traces.
2.3.2. Attack Phase
This stage begins with the attacker obtaining a set of traces from the target device under attack, which are independent from the profiling traces. To do this, the adversary using the target device encrypts plaintexts and, consequently, records the resulting traces. Using the model trained in the profiling phase, the attacker computes a prediction vector of the probabilities for each attack trace. Based on this vector, the attacker chooses the best key candidate.
2.4. Supervised Machine Learning
The profiled attacks we are investigating is a classification problem as it involves evaluating a set of possible key values and selecting one of the options, hence necessitating the training of a classifier [
40,
41]. Therefore, in this work, we examined deep learning-based SCAs against AES. In these attacks, the deep learning is trained by extracting the output of the S-box in the first round of the AES. With the profiling data obtained using a clone device, training is carried out. The goal of the training or profiling process is to learn how the device leaks information. While training, the adversary encrypts several plaintexts with known secret keys and, consequently, collects the resulting traces. In order to ensure a sufficiently generalized training model, the side-channel traces generated should result from random inputs such as random plaintext,
, and random key,
k. This cycle is repeated several times to obtain
profiling traces. Thus, we can define the training data
to comprise the
traces as follows:
where
denotes the partial key of the full key, one-byte length;
is the plaintext corresponding to the partial key;
,
is the profiling trace collected from the clone or profiling device; and
is the total number of traces in the profiling stage. By having the training data in place, the deep learning model is subsequently trained. The leakage being characterized is at the output of the S-box in the first round; therefore, the label is set to
. As a result, the output of the deep learning model is a probability ranging from zero to eight for nine classes. These classes are the HW values and, in order to obtain the class probabilities, the final layer of the deep learning model is a softmax activation layer. Thus, the trained model is based on side-channel traces as the feature data and the HW of the intermediate values as the labels; thus, accurately achieving a generalized DL model is one step from predicting the secret key. Upon training, a set
V of validation traces was used to validate the performance of the trained model to ensure that the model was generalized.
The completion of the profiling or training phase was an indication that the characterization of the clone device was completed and the attack phase can be initiated. In this phase, the adversary collects the attack traces and S-box output. Thus, we can define the attack data
to comprise
traces as follows:
where
is the plaintext corresponding to the unknown key;
,
is the attack trace collected from the device under attack; and
is the total number of traces in the attack stage. After this, using the trained model from the profiling phase, a prediction vector
was computed for each attack trace. That is, a score was assigned to each class, and the highest score shows which class from the nine the attack trace belongs. Thus, for any given trace, the class with the highest probability shows the trace leaks that intermediate value the most when compared to all other intermediate values.
Furthermore, upon obtaining the prediction vector for all the attack traces, the secret key prediction follows. It is worthwhile to state that, for the key prediction task in this work, unlike typical machine learning problems, all the attack traces together contributed in the key prediction task. That is, a key ranking method was employed in this process. Therefore, if the trained model has a high generalization and if the number of test traces is sufficiently large, the likelihood of the secret key is higher than the likelihood from the incorrect key candidates.
Consequently, key ranking rather than accuracy has proven to be the best and one of the most used metric in SCA attacks [
30]. Hence, the output class probabilities of each trace were considered in the key ranking calculation as they possess crucial information. Furthermore, the classes represent intermediate values while the secret key is our objective, and the key ranking addresses this. Hence, for every trace
i in
Q, given our trained model with
j possible classes and
k possible key guesses, we evaluated for every
k. Given that each trace
i has an accompanying plaintext provided in its dataset, with this, we can find the S-Box output and apply the Hamming weight, and this value is
j. Based on this trace
i and the estimate
j,
is obtained, and the log of the result is estimated and accumulated for all
Q traces of the attack set. All the other
k values are also evaluated. Thus, the trained classifier now predicts the secret key
k for the attack set
Q by selecting the key
with the maximum log-likelihood, as given in Equation (
4). If the key
is the key with the maximum log-likelihood, then the model has predicted the secret key correctly.
5. Hill Climbing Search Framework
In this section, we propose the HCA to address the hyperparameter search problem. The algorithm is illustrated in
Figure 1, with the pseudocode detailed in Algorithm 1. Our framework consists of two primary phases: the profiling phase and the attack phase. The hill climbing optimization occurs during the profiling phase. We focused on two neural networks—MLP and CNN—due to their prominence in the deep learning SCA community [
1]. Thus, the hyperparameters we reference pertain specifically to MLP and CNN models. The efficacy of our proposed method will be demonstrated through comparisons with the widely used Random Search (RS) technique. RS is commonly employed in hyperparameter tuning for deep learning models in SCA due to its simplicity and its broad exploration of the search space.
As depicted in
Figure 1, the algorithm begins by initializing the iteration count
a to 0. An arbitrary set
S of hyperparameter combinations was selected to serve as the architecture for a deep learning model. The model was then trained over ten epochs, resulting in a profiling model. The HCA optimization determines the next set of hyperparameters to explore in subsequent iterations based on the performance of the current model, provided
. Drawing from the literature and an extensive calibration phase, and considering the limited training resources and time constraints in profiling attacks, we capped the number of iterations at 200. This optimization search was completed in five hours using a CPU, a Tesla V100 GPU with 16 GB memory, and 5120 GPU cores.
After completing 200 iterations, the best model was selected from the set
T of trained models based on the GE SCA metric. The GE was estimated by performing an attack on the traces from the validation set
V. The GE serves as the objective function estimate,
O, for the set of hyperparameters. Specifically, 5000 traces from the attack dataset were used as the validation data to estimate the GE. The best model was then identified, trained, and subsequently utilized again to attack the training dataset, as depicted in the flowchart in
Figure 1. Similarly, the trained model was employed to attack the 5000 traces from the attack dataset
Q to estimate the GE, which reflects the attack performance of the trained model.
For this work, the hyperparameters of the MLP considered included the dense or fully connected layers, the number of neurons in those layers, the learning rate and the activation function across all layers. It is summarized in
Table 1. The hyperparameter search space for the CNN is also shown in
Table 2.
Algorithm 1 Hill climbing algorithm pseudocode |
Input: Search Space consisting of MLP or CNN hyperparameters, objective function l, and neighborhood function Pick a hyperparameter sample at random Evaluate ; initialize a dummy variable to start the algorithm set While
Output: Hyperparameters |
6. Understanding the Test Cases and Results
We detailed the hyperparameter values considered in training the ML models in
Table 1 and
Table 2.
Table 1 lists the hyperparameters for MLP, while
Table 2 presents those for CNN. The search space for MLP consisted of 23,040 combinations, whereas the CNN search space was significantly larger, with 637,009,920 combinations. Consequently, the CNN search space was vast compared to that of MLP. Ideally, identifying the single best model would involve exhaustively exploring all possible combinations within the search space during hyperparameter tuning. However, due to constraints of time and memory, we limited our exploration to 200 combinations from the search space [
14]. Therefore, for both the Random Search (RS) and the proposed Hill Climbing Algorithm (HCA), we fixed the number of iterations at 200, which was deemed sufficient for this study. The general architecture of our algorithm was divided into two phases. As illustrated in the flowchart of
Figure 1, the first phase involved randomly selecting a hyperparameter combination from either the MLP or CNN set. This selection was followed by ML training to produce a profiling model. Based on the performance of the initial profiling model, either RS or HCA determines the next hyperparameter combination. This process was repeated 200 times, as specified by the number of iterations. From the 200 profiling models generated, the best model was selected, retrained, and then used to perform the attack on the attack dataset. For this study, we utilized the ASCAD and CHES CTF datasets, both the fixed and random versions. The results were consistent across both datasets; however, due to page limitations, we presented the results for ASCAD only.
Table 3 includes other important design parameters employed in our experiment.
We first analyzed the results from the ASCAD fixed dataset where the secret key was fixed and then the ASCAD variable dataset where the secret key was random.
6.1. ASCAD Fixed Key
For this experiment and all others, we considered three objective functions, such as key rank,
, and validation accuracy, training them in 10 and 50 epochs, thereby resulting to 6 plots. We started by examining the performance of the MLP under the ID leakage model, as shown in
Figure 2a,b. In the HCA, all the methods achieved successful secret key prediction. Particularly, the key rank with fifty epochs was the most efficient, requiring 90 attack traces, demonstrating the efficacy of the HCA search optimization method. Likewise, in RS, as shown in
Figure 2b,
achieved the best performance, and it also required about 90 traces. In terms of performance, the HCA and the RS achieved similar performances. In general, for both search methods, all of the objective functions led to correctly predicting the secret key showing us that the ASCAD fixed key was easy to break.
In the next phase, we studied the MLP performance under the HW leakage model, as presented in
Figure 3a,b. Here, the HCA achieved a clearly much better performance in all three objectives. For the HCA, the
and the key rank objectives yielded the best attack performance of about 600 traces while the validation accuracy required about 1000 traces, as shown in
Figure 3a. For the RS, the key rank and the
also resulted in the best performance, requiring about 1000 traces while the accuracy objective function needed about 1500 traces, as can be seen in
Figure 3b. In all test cases, from both HCA and RS, all the profiling models successfully predicted the secret key, equally showing how easy it is to break the ASCAD fixed key.
Following that was the examination of the CNN performance. Similar to the previous stage, we first looked into the ID leakage model. In the HCA of
Figure 4a, the key rank objective resulted in the best attack performance of about 300 attack traces. The
and validation accuracy objectives followed with about 800 traces. In general, all objective functions resulted a successful attack in the HCA. On the other hand, as show in
Figure 4b, for the RS, the performance was clearly incomparable with the HCA as its best performance was the profiling model that resulted from both the key rank and accuracy objectives, which required about 2500 traces. The
objective did not yield any successes. This indicates the disadvantage of the RS as randomly selecting parameter combinations does not guarantee a successful attack. The CNN had a much wider search space, and the HCA, being a method that thrives by finding peaks, was more efficient than RS as it converged to a solution.
Next was the CNN performance using the HW leakage model. In
Figure 5a, the parameter combination that resulted from the HCA hyperparameter tuning can be seen. The
objective function yielded the profiling model with the best performance at about 700 traces, followed by the key rank and accuracy objective functions, with about 800 and 1000 traces, respectively. For the RS, as shown in
Figure 5b, the key rank achieved the best attack performance of about 1200 traces. Following in performance was both the
and the accuracy objective, with about 1500 traces. To summarize, all objective functions succeeded in predicting the secret key for both the HCA and RS methods, but the HCA performed better than the RS. Furthermore, the key rank objective specifically yielded the best performance in both methods. When comparing the general performance of the CNN under an ID leakage model, it can be seen that a CNN trained under the HW leakage model resulted in a better performance, and the reason for this is that fewer models were trained when using HW leakage models, i.e., 9 as opposed to 256, given the same amount of training data. Therefore, the HW leakage models were more generalized after training. In addition, the MLP models’ attack performances were better, in general, compared to the CNN models as the MLP has a smaller hyperparameter search space of about 23,000 compared to the CNN with over 637 million, where the number of traces in our training set still remained 50,000 across all optimization search methods, which is why the MLP converged much faster.
We summarize the results obtained from this experiment in
Table 4. The table shows the method that achieved the best performance in each category listed as a row across all objective functions. In three out of the four cases, HCA demonstrated superior performance. In the remaining test case, both methods achieved comparable results. Overall, these findings highlight the effectiveness of HCA in improving performance over RS in most scenarios.
6.2. ASCAD Variable Key
First, we analyzed the GE performance of the MLP on the ASCAD variable key data based on the ID leakage model. In the HCA of
Figure 6a, the performances of all three models yielded successful attacks. They all required around 3000 traces. For the RS, on the other hand, only the key rank objective was successful in predicting the secret key, as can be observed in
Figure 6b, requiring about 3500 traces. We can observe that the models not breaking the secret key, as a result of the dataset being variable, were more difficult to train compared with the fixed dataset. Therefore, HCA has more capability to deal with hyperparameter search under this scenario, further reinforcing how robust the algorithm is.
Furthermore, we studied the GE performance of the MLP on the ASCAD variable key data based on the HW leakage model. In the HCA from
Figure 7a, the profiling model from the accuracy objective predicted the secret key in about 600 traces, making it the best among the three objectives. Following it was the key rank and the
with 1000 and 2000 traces, respectively. For the RS, as shown in
Figure 7b, the key rank achieved the best performance, requiring about 700 traces. The accuracy and the
followed next with 800 and 1200 traces, respectively. Therefore, for the MLP training based on the HW leakage model, using all objectives in both search methods resulted in profiling models that correctly predicted the secret key, and this was not the case with the ID leakage model. The reason for the HW better performance was due to the HW leakage having a lesser number of classes, i.e., 9, compared to the ID’s 256, making training a generalized model faster. It is worthwhile to mention that HCA’s superior performance for both ID and HW leakage models was due to its flexibility and robustness in dealing with various scenarios in hyperparameter search optimization.
Next was the analysis of the performance of the CNN. For the ID leakage model, all of the objective functions did not predict the secret key in 5000 traces in the HCA, as summarized in
Table 5 and seen in
Figure 8a. However, we observed that the accuracy and the
objectives were within one rank of cracking the secret key. Thus, we estimated that, in another 20 traces, these two objectives would predict the secret key. For RS, on the other hand, as shown in
Table 5, and
Figure 8b, only
predicted the secret key successfully, only needing 5000 traces. This observation corroborates the claim that CNN is harder to train because of its much larger search space, making it difficult for the hyperparameter tuning algorithm to find a trainable hyperparameter combination. In addition, from our analysis, the HCA yielded better training outcomes in this department as the two objectives gave rise to accurate profiling models versus the one in RS.
The last of the test cases that we looked at was the GE performance of a CNN under the HW leakage model. From
Table 5, and
Figure 9a,b, we can see the performance plots for both the HCA and RS. In HCA from
Figure 9a, For the
and key rank objectives, HCA successfully predicted s the secret keys in 1400 and 4100 traces, respectively. The RS on the other hand, in
Figure 9b had all the objective functions producing training models of GE that were greater than 50. We concluded that, based on this, when training a CNN utilizing the HW leakage model, the HCA profiles ML models better when compared to the RS. Also, in this test case and all others, the hyperparameter tuning using both HCA and RS always performed better than when using the HW leakage model. Lastly, our proposed HCA was very competitive, as represented in
Table 6, compared against [
14,
36] in complexity, that is, the number of trainable parameters and time to reach GE of 1, which resulted in besting the existing methods in two of four categories. We calculated the complexity of our models using the following formula in (
8) and (
9):
where
L is the number of fully connected layers, and
is the number of neurons in layer
i.
where
C is the number of convolutional layers,
is the number of filters in the
convolutional layer,
is the size of the convolution kernel,
is the number of input channels for the
convolutional layer,
F is the number of fully connected layers, and
is the number of neurons in the
fully connected layer.
We summarized the results obtained from this experiment in
Table 5. The table shows that, in three out of the four cases, HCA demonstrated superior performance, while RS achieved the best performance in only one test case. These results reinforce the robustness and effectiveness of HCA in comparison to RS, further highlighting its advantages in optimizing hyperparameters for deep learning models in side-channel analysis.
6.3. Statistical Significance Test of the Results
In this section, we calculate the p-values to assess the statistical significance of our experimental results. To ensure a rigorous analysis, we adopted a systematic approach that involves hypothesis testing, a computation of test statistics, a calculation of p-values, and an interpretation of the results. The process is detailed in Algorithm 2.
Algorithm 2 The pseudocode for the statistical significance testing |
Pseudocode for Statistical Significance Testing |
Input: Data from experiments: HCA results and RS results. |
Output: p-value to determine if HCA outperforms RS. |
Define Hypotheses: - –
Null Hypothesis (): No significant difference between the performances of HCA and RS. - –
Alternative Hypothesis (): Significant difference between the performances of HCA and RS.
Select a Statistical Test: Use an independent t-test to compare the means of the two groups (HCA vs. RS). Calculate Test Statistic: Time Complexity:
where:
- –
and are the sample means for HCA and RS, respectively. - –
and are the sample variances for HCA and RS, respectively. - –
and are the sample sizes for HCA and RS, respectively.
Determine p-value: Time Complexity: Calculate the p-value using the t-distribution:
where is the probability that the t-distribution with degrees of freedom is greater than or equal to the absolute value of t. Interpret the p-value: Time Complexity: - –
If (typically ), reject the null hypothesis . - –
Conclude that the difference in performance between HCA and RS is statistically significant.
Analyze and Report Results: - –
Use the number of traces required to achieve a guessing entropy (GE) of 1 as the primary metric. - –
Aggregate the results across test cases and compute the sample size (e.g., 100 samples). - –
If p is approximately 0, then conclude that HCA significantly outperforms RS in hyperparameter tuning.
|
Therefore, in aiming to provide robust statistical evidence that our HCA search outperforms Random Search, we calculated the p-values for these tests. We used the number of traces required to reach a guessing entropy (GE) of 1 as our sampled data. The results were aggregated across all test cases, and multiple experiments were conducted to ensure a total sample size of 100. For both experiments, the calculated p-values were approximately 0, indicating that the observed differences were statistically significant and unlikely to be due to chance. Based on the results, our proposed hill climbing search results demonstrated superior performance compared to the baseline Random Search.
We then examined the limitations of proposed HCA. Hill climbing is a local search technique that iteratively adjusts hyperparameters to enhance performance. However, its primary drawback is its susceptibility to becoming trapped in local optima. Unlike more sophisticated methods such as Bayesian optimization or genetic algorithms, which explore the search space globally, hill climbing (HC) often becomes stuck in sub-optimal regions, especially in high-dimensional or complex search spaces. This limitation is particularly evident in performance landscapes with numerous local peaks, where HC may converge prematurely without reaching the global optimum. Moreover, HC does not incorporate uncertainty estimates—a crucial feature of Bayesian optimization that allows for the effective balancing of exploration and exploitation during the search process. To overcome these challenges, several researchers have proposed adaptive hill climbing algorithms that dynamically adjust the search strategy to better navigate such complex landscapes [
33,
45]. Investigating these adaptive variants within the context of deep learning-based side-channel analysis for hyperparameter optimization could offer promising avenues for future research.
7. Discussion
Our experiments validated that the HCA is a viable method for hyperparameter tuning in deep learning-based SCA. Across multiple test cases, HCA consistently demonstrated superior performance compared to Random Search, particularly in terms of achieving lower guessing entropy. This performance advantage was further confirmed through statistical significance tests. Notably, HCA outperformed Random Search across all test cases, regardless of dataset complexity, proving its effectiveness in hyperparameter optimization.
In our approach, hill climbing efficiently navigated the hyperparameter search space by making iterative, small-scale adjustments that improved performance. Unlike grid or Random Search methods—which can be computationally expensive or directionless—HCA offers a systematic, performance-driven exploration. This strategy effectively avoids local optima by allowing adaptive changes based on feedback, minimizing the risk of suboptimal configurations. Furthermore, the simplicity and lower computational overhead of HCA make it particularly well suited for complex models and larger datasets, where traditional methods may struggle with overfitting. In addition, the data disproportionation problem, where certain sentiment classes are underrepresented, can hinder the performance of deep neural networks (DNNs) [
46]. The HCA mitigates this by optimizing hyperparameters like learning rates, batch sizes, and class weights, which better accommodate imbalanced datasets. HCA’s tailored search reduces the risk of overfitting to the majority class, enhancing overall model performance in scenarios with significant class imbalance.
In terms of computational complexity, our HCA is generally more resource efficient than Random Search. HCA’s targeted approach converges more quickly to optimal or near-optimal solutions, reducing computational costs. In contrast, Random Search randomly samples the hyperparameter space, often requiring more iterations to achieve similar performance, resulting in higher resource demands. While Random Search can occasionally yield faster results due to its randomness, HCA consistently provides better resource efficiency in both average and worst-case scenarios.
Furthermore, HCA, while simpler, demonstrates lower computational complexity due to its iterative local adjustments, allowing faster convergence to optimal or near-optimal solutions. In contrast, techniques like Bayesian optimization, genetic algorithms, and simulated annealing incorporate global search mechanisms that can better escape local optima, though often at the cost of increased computational demands and complexity [
14]. Bayesian optimization, for instance, excels at balancing exploration and exploitation by leveraging probabilistic models, but it can be resource intensive. Genetic algorithms offer strong global search capabilities but may require extensive population management and iterations [
47]. Simulated annealing, with its probabilistic acceptance of worse solutions, helps avoid premature convergence but introduces additional computational overhead. On the other hand, the multi-dimensional fusion convolutional residual dendrite (MD_CResDD) network excels in profiling speed and feature extraction for side-channel analysis through multi-scale feature fusion. In contrast, our HCA systematically optimizes hyperparameters, potentially achieving superior accuracy across diverse datasets when tested under identical conditions [
48].
Our hill climbing side-channel analysis attack is a powerful technique used to exploit secret information from cryptographic devices. These techniques have improved attack performances over the years, thus providing valuable insights in industry, as well as academia. This application, being in the field of security positions, raises important ethical concerns.
The primary ethical concern with SCA attacks lies in their potential misuse for malicious purposes. If the proposed HCA for SCA falls into the wrong hands, there is a significant risk of it being exploited to compromise the security of systems by extracting confidential data. The accessibility and proliferation of various SCA methods, coupled with the growing availability of cloud computing resources, make such attacks increasingly feasible, even for adversaries with limited resources. This risk is further amplified by the low-cost nature of SCA attacks, allowing attackers to breach secure systems without substantial investment. Therefore, it is crucial to address these ethical implications and emphasize responsible usage, focusing on defensive applications and contributing to the development of stronger countermeasures.
On the other hand, advancements in side-channel techniques have been instrumental in fostering the development of more secure hardware designs. A successful side-channel attack reveals specific vulnerabilities in the design that adversaries can exploit. Identifying these flaws allows security experts to develop and implement more robust countermeasures, thereby fortifying the defense mechanisms of the hardware. The continuous refinement of side-channel techniques ultimately drives improvements in hardware security, ensuring that future devices are better protected against such attacks. Consequently, many hardware manufacturing companies now incorporate side-channel analysis testing as a standard stage in their design process to proactively address potential weaknesses.