Next Article in Journal
BPAP: FPGA Design of a RISC-like Processor for Elliptic Curve Cryptography Using Task-Level Parallel Programming in High-Level Synthesis
Previous Article in Journal
Encryption Algorithm MLOL: Security and Efficiency Enhancement Based on the LOL Framework
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Improved Correlation Power Analysis Attack on the Latest Cortex M4 Kyber Implementation

by
Costin Ghiban
*,† and
Marios Omar Choudary
Computer Science Department, Faculty of Automatic Control and Computer Science, National University of Science and Technology POLITEHNICA Bucharest, 060042 Bucharest, Romania
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Cryptography 2025, 9(1), 19; https://doi.org/10.3390/cryptography9010019
Submission received: 13 February 2025 / Revised: 10 March 2025 / Accepted: 11 March 2025 / Published: 16 March 2025

Abstract

:
CRYSTALS-Kyber has been standardized as a general public-key post-quantum algorithm under the name of ML-KEM after NIST released its first three final post-quantum standards in August 2024. The resilience of post-quantum cryptography to side-channel attacks has been an important research endeavor, and there have been many attacks designed, including basic Correlation Power Analysis. This paper adapts existing Correlation Power Analysis attacks to the most recent ARM Cortex M4 optimized implementation that uses Plantard arithmetic. It also demonstrates an improved version of a CPA that results in a 50% speedup compared to the original attack. Data are gathered and the mathematical model is tested using a ChipWhisperer-Lite board.

1. Introduction

The risk posed by quantum computers and Shor’s algorithm [1] to public-key cryptographic systems prompted the National Institute of Standards and Technology (NIST) to launch in 2016 a standardization process for one or several quantum-resistant algorithms (also denoted post-quantum) for public-key encryption and key establishment, and digital signatures [2].
One important class of vulnerabilities in which NIST encouraged the cryptographic community to invest time is that of Side-Channel Attacks (SCAs) [3]. Although it is essential to possess a quantum-resistant design, testing the resilience of the new cryptosystems to leakage attacks is equally necessary while the process of adoption is still ongoing.
In 2022 CRYSTALS-Kyber was chosen as the main candidate for standardization for public key encryption. Later, in 2024, it became the base for the FIPS 203 standard [4], changing its name to ML-KEM (Module Lattice-Based Key Encapsulation Mechanism). Kyber is a lattice-based Key Encapsulation Mechanism (KEM) that relies on the hardness of the Module-Learning-With-Errors (Module-LWE) problem, where the keys and ciphertexts are vectors of polynomials with coefficients in the residue class Z Q , or integers modulo Q, where Q is one of Kyber’s parameters.
There are a number of primitives that can potentially leak information about secret key coefficients. In this paper, we demonstrate an improved Correlation Power Analysis (CPA) attack on the pairwise coefficient multiplication between the secret key and ciphertext polynomials in the Number Theoretic Transform (NTT) domain. The Kyber implementation that we target is the latest version of the open-source pqm4 library [5], designed for the ARM Cortex M4 microcontroller, using the ChipWhisperer SCA analysis board [6].
While CPA attacks on Kyber’s pqm4 implementation have been made in the past [7,8], our main contributions are updating the attack to the newest implementation and leveraging Plantard arithmetic as an optimization in the latest versions of the library to produce a 50% speedup compared to previous works. The observations presented in this paper could also lead to further improvements and attack vectors.

2. Background

Public-key cryptographic schemes generally rely on hard mathematical problems in their security assumptions. This is currently the case with RSA’s large integer factorization into large prime factors [9] and Diffie–Hellman’s discrete logarithms over multiplicative cyclic groups or elliptic curves [10].
Shor’s algorithm is known to solve these problems in polynomial time using quantum computing [1], deeming them seemingly trivial to break. However, it is a quantum computing algorithm, which makes it not yet practical to implement as hardware. Despite seeing important advances in recent years, it is still not developed enough. Even so, the prospects of a future quantum-capable attacker defeating the most common communication security primitives could not have passed unaccounted for. Hence, new cryptosystems that are founded on different hard, quantum-resistant, mathematical problems are required to prevent this scenario in time.

2.1. Lattice-Based Cryptography

Lattices are mathematical objects formally defined as (1) a partially ordered set (poset), ( L ; ) , in which exists a supremum, sup { a , b } , and an infimum, inf { a , b } , for any a, bL, (2) an algebra, ( L ; , ) , where L is a nonempty set, ∨ and ∧ are binary operations that satisfy the properties of idempotency, commutativity, and associativity, along with the absorption identities:
a ( a b ) = a a ( a b ) = a
The two definitions are equivalent, as shown in [11]. In practice, lattices are presented as linear combinations with integer coefficients of the vectors on a vector space basis. Their use in cryptography stems from certain NP-hard optimization problems such as the shortest vector problem [12].
Miklós Ajtai is the one to propose a cryptographic system based on the theory of lattices [13]. More modern approaches, however, use the lattices as a framework for other computationally hard problems, such as Learning With Errors (LWEs), which is said to be computationally as hard as worst-case lattice problems [14]. LWE prompts us to retrieve a secret vector s Z q n from a series of random linear equations, each exhibiting a small error, ± ϵ [15]. In mathematical notation, that is A · s + e = t , where A is a matrix, and s, e, and t are vectors. The tuple ( A , t ) forms the public key, while s is the secret key. Without the error vector e, the linear system would be trivially solved by Gaussian elimination.

2.2. Kyber

CRYSTALS-KYBER and CRYSTALS-DILITHIUM are two selected solutions that have CRYSTALS (Cryptographic Suite for Algebraic Lattices) as a starting point. Kyber is a key encapsulation mechanism constructed from module lattices and the Module-LWE problem [16]. Kyber is said to be IND-CCA2 secure, meaning it can withstand and remain indistinguishable in the face of adaptive chosen ciphertext attacks. Its KEM is obtained from a Chosen Plaintext Attack secure Public-Key Encryption (PKE) algorithm by applying what is known as the Fujisaki–Okamoto (FO) transform [17].
A key encapsulation mechanism is a protocol for establishing a key for symmetric encryption between two parties using a public-key encryption scheme. This two-layer method was necessary to confer message integrity. A KEM exposes three methods: a key generation procedure that corresponds to the PKE key pair generation; an encapsulation method that generates the shared secret, encrypts it with one party’s public key and sends it to that party; and a decapsulation procedure, which decrypts the shared secret from the received ciphertext using the secret key [18].
The FO transform defines a ciphertext validity check inside the decapsulation procedure: the ciphertext (encapsulating the shared secret) is first decrypted using the secret key, then re-encrypted with the public key. If the newly encrypted ciphertext is equal to the input ciphertext, then the shared secret is returned. Otherwise, the request is dropped as invalid.
The Module-Learning With Errors (M-LWE) problem that sits at the basis of Kyber’s design is a generalization of the Ring-LWE, in which the underlying structure is the ring of polynomials with coefficients reduced modulo Q. M-LWE, among other particularities, deals with a module of rank k over the ring R Q = Z Q [ X ] / f ( X ) , that is R Q k = ( Z Q [ X ] / f ( X ) ) k . We could see this as s, e, t from LWE being vectors of polynomials and, similarly, A a matrix, satisfying Equation (1). The module algebraic structure offers more flexibility than the ring in terms of security and efficiency through the k parameter, being stronger than the standard LWE, while having larger keys and ciphertexts than the Ring-LWE.
A 00 [ X ] A 0 k [ X ] A k 0 [ X ] A k k [ X ] · s 0 [ X ] s 1 [ X ] s k [ X ] + e 0 [ X ] e 1 [ X ] e k [ X ] = t 0 [ X ] t 1 [ X ] t k [ X ]
Kyber PKE defines methods for key generation, encryption, and decryption, as presented in Algorithms 1–3. Here, CBD denotes a Centered Binomial Distribution function from which both noise and secret key coefficients are sampled. The η 1 parameter for Kyber512 is 3 as shown in Table 1, which means that the original secret key has coefficients in the set 2 , 1 , 0 , 1 , 2 . For errors, the η 2 parameter is used. The secret key is then immediately translated into the NTT domain and stored internally as a byte string. In fact, both the public-secret key pair and the ciphertext are stored in this format, operation denoted as E n c o d e n u m _ o f _ b i t s in the algorithms. The NTT transform is similar to a Discrete Fourier Transform (DFT), adapted for finite fields, and is used to optimize polynomial multiplications.
The other notation present is as follows: PRF is a Pseudo-Random Function, in Kyber’s standard is SHAKE-256, the same being true for the Key Derivation Function (KDF), XOF represents an Extendable-Output Function, particularly SHAKE-128, and H and G are two hash functions, namely SHA3-256 and SHA3-512 [19].
Algorithm 1: KYBER.CPAPKE.KeyGen(): key generation
Cryptography 09 00019 i001
Algorithm 2: KYBER.CPAPKE.Enc(pk, m, r): encryption
Cryptography 09 00019 i002
Algorithm 3: KYBER.CPAPKE.Dec( s k , c ): decryption
Input: Secret key s k B 12 · k · n / 8
 Ciphertext c B d u · k · n / 8 + d v · n / 8
Output: Message m B 32
1
u Decompress q ( Decode d u ( c ) , d u ) ;
2
v Decompress q ( Decode d v ( c + d u · k · n / 8 ) , d v ) ;
3
s ^ Decode ( s k ) ;
4
m Encode 1 ( Compress q ( v NTT 1 ( s ^ NTT ( u ) ) , 1 ) ) ;
5
return m;
Kyber KEM’s exposed methods for key generation, encapsulation, and decapsulation are shown in Algorithms 4–6. All PKE and KEM algorithms are taken from the official documentation and presented here for reference [16].
Algorithm 4: KYBER.CCAKEM.KeyGen(): key generation
Output: Public key p k B 12 · k · n / 8 + 32
 Secret key s k B 12 · k · n / 8 + 96
1
z B 32 ;
2
( p k , s k ) KYBER . CPAPKE . KeyGen ( ) ;
3
s k ( s k p k H ( p k ) z ) ;
4
return ( p k , s k ) ;
Algorithm 5: KYBER.CCAKEM.Enc( p k ): encapsulation
Input: Public key p k B 12 · k · n / 8 + 32
Output: Ciphertext c B d u · k · n / 8 + d v · n / 8
 Shared key K B *
1
m B 32 ;
2
m H ( m ) ;
3
( K ¯ , r ) G ( m H ( p k ) ) ;
4
c KYBER . CPAPKE . Enc ( p k , m , r ) ;
5
K KDF ( K ¯ H ( c ) ) ;
6
return ( c , K ) ;
Algorithm 6: KYBER.CCAKEM.Dec(): decapsulation
Cryptography 09 00019 i003

2.3. Side-Channel Attacks

Side-Channel Attacks are a special category of attacks that exploit both design and implementation details of algorithms that enable uncontrollable data leakage. The data are not leaked directly, but through related information such as time, memory access patterns, or power usage.
One particularly interesting side-channel, closer to the physical world, is the power consumption of a target function. The power consumption leakage lets the adversary distinguish both between instructions (load and store operations might consume more power than arithmetic operations) and between different data fed to those instructions.
The attack that relies on this side-channel is generally known as Power Analysis (PA) and comes in different forms. Simple Power Analysis (SPA), ideally, deduces information, usually instruction-related, by inspecting a single power trace, while Differential Power Analysis (DPA) covers multiple executions with different inputs and, thus, is rather data-driven [20].

Correlation Power Analysis

Another type of PA is Correlation Power Analysis (CPA). CPA correlates the power consumption with the Hamming weights (or other theoretical models) of the data processed by the target function, which generate Points of Interest (PoIs). The metric used in this case is the Pearson Correlation Coefficient (PCC) between the distribution of power amplitudes across inputs and the distribution of the Hamming weights of the same inputs, where the inputs fall into a chosen ciphertext scenario. The highest value of the PCC, given a sufficient number of traces, corresponds to the correct value of the secret [21].
The Pearson Correlation Coefficient of two populations is defined as the covariance of two random variables describing the populations scaled by the product of their standard deviations. The covariance measures the similarity in behavior (variability) of two random variables. The scaled value, denoted as ρ in Equation (2), is therefore a metric of the linear dependency between two distributions, represented by X and Y random variables, and it lies in the interval [ 1 , 1 ] .
ρ X , Y = c o v ( X , Y ) σ X σ Y
The power of CPA is that it is applied on all the points in every power trace and on all possible key values, and it is able to determine both the PoI and the correct secret key simultaneously.

2.4. Plantard Arithmetic

In his paper from 2021, Thomas Plantard introduced a new modular reduction method adapted to word-sized moduli cryptographic applications [22], differentiating itself from other solutions such as Montgomery and Barret reductions, which operate best on huge moduli (e.g., RSA with over 1024-bit keys). Its advantage, that multiplying two numbers out of which one is a constant (i.e., a value used multiple times as in modular exponentiation) can save a multiply operation, was later leveraged in [23], adapting the general Plantard arithmetic to the specific use case of lattice-based cryptography.
In particular, Kyber uses a modulus Q equal to 3329, representable on 12 bits. The improvement brought about by [23] is the use of signed integers which removes a constraint of the original algorithm and allows for a wider input range and smaller output range ( [ Q + 1 2 , Q / 2 ) ), the latter being important to this paper, as will be discussed in Section 5.4.
This optimization using Plantard’s arithmetic is applied only to computations performed in the NTT domain. The same is true for older implementations using Montgomery reductions, which have an output range of [ Q , Q ] . The computations with signed integers were found to run faster, at least on the Cortex M4 hardware. Outside of NTT, the values should lie in the theoretical interval, [ 0 , Q ) . Therefore, after performing the optimized computations in NTT, the resulting polynomial coefficients are reduced to [ 0 , Q ) using a Barret reduction as explained in [23]. In this way, the implementation achieves both efficiency and consistency.

3. Related Works

As presented in Section “Correlation Power Analysis”, the Correlation Power Analysis is a strong tool for statistically inferring secret information from power leakage and facilitating a feasible brute-force attack.
CPAs on Kyber for full-key recovery have been devised before, most notably in [7,8], which targeted the NTT polynomial multiplication of the pqm4 implementation from Round 2, using a setup similar to that presented in this paper in Section 4. They form the basis for the attack methodology described in Section 5.
CPA was also used by Kuo and Takayasu in a two-step full-key recovery attack on Kyber and other algorithms relying on NTT [24]. In the first step, the CPA recovers a part of the secret key coefficients. These are then used to form an LWE problem with a smaller lattice rank, which helps to recover the rest of the coefficients in a shorter running time than for the CPA-only attack.
Kuo and Takayasu later improved their attack and introduced the method of negative correlation to distinguish false positives and reduce the number of required traces in the CPA step [25]. The main observations that relate to our work are that the correct pair of secret coefficients has approximately the same correlation, but of opposite sign, as its additive inverse with respect to Q. Around the same time, Wang et al. noticed the same thing and used these observations to reduce the number of traces and time required for the two-step CPA-lattice attack [26]. Concurrently, we arrived at similar results during our experiments, but we use the observation that the correct key leads to a negative correlation to directly reduce the time and the search space for the secret coefficients in a CPA-only attack, as presented in Section 5.4.
In addition to CPA attacks, other SCAs relied on more complex techniques, such as Machine Learning [27,28,29] and template attacks [30,31], as well as lattice attacks [32].

4. Experimental Setup

The power consumption traces employed in the attacks and the intermediate results used for verification were collected using a CW1173 ChipWhisperer-Lite board with an ARM Cortex M4 target chip and a STM32F3 platform [6]. It offers a 10-bit ADC capable of capturing power traces at a sample rate of 105 MS/s, and a clock frequency of up to 200 MHz. For our experiments, we used the default setting of 4 × 7.37 MHz ADC clock. The sample buffer size of 24,573 samples has proven to be enough to capture the first polynomial multiplication, but the ADC also offers an offset mechanism for capturing operations outside the initial buffer window.
ChipWhisperer also comes with an open-source Python library and a SimpleSerial communication protocol for interacting with the target board. Most of the commonly used cryptographic algorithms are included in a C crypto library, ready to be compiled and programmed onto the target. Since Kyber and all other post-quantum algorithms are not widely adopted yet, they are not included in the library provided by the developers and had to be built manually.
The Kyber512 implementation we programmed onto the board is the round 4 iteration of the pqm4 open-source library [5]. The pqm4 library builds on the PQClean project [33] to create optimized implementations for the ARM Cortex M4 environment, a common medium for side-channel analysis. Among these optimizations is the introduction of signed modular reductions and, subsequently, the Plantard arithmetic [22,23].

5. Attack Methodology

The baseline for this attack lies in the previous work on CPA of [7,8], but with the added observation that we obtain peak correlation values for the correct key values, as well as for its additive inverse.
The attack steps are as follows: first we determine the sampling window in the trace where the target operation is performed. Then the first secret key coefficients are guessed by picking the pair with the highest correlation coefficient, and the same is performed to the next pair until full-key recovery. Lastly, we repeat the attack with different dataset sizes and conclude that it works with a minimum of 80 traces, although for some keys this number could be even lower as it is the case with our experimental data.
Our target operation is the basemul function, which performs pair-pointwise multiplication of polynomial coefficients in the NTT domain. More specifically, in the decryption procedure that is used during decapsulation (Algorithm 6), the ciphertext and secret key polynomials are multiplied in NTT in pairs of coefficients (see the innermost computation on Line 4 of Algorithm 3). Note that compared to Karlov [7], we worked directly with the NTT coefficients, thus eliminating an operation that is redundant in the actual attack.
The multiplication in NTT is performed, as said, in pairs. The basemul function takes two ciphertext coefficients, u 0 and u 1 , two secret key coefficients, s 0 , s 1 , and a constant, z, and outputs two result coefficients, r 0 , r 1 . The parameter z refers to one of the roots of unity of the X N + 1 polynomial. For Kyber512, the number of coefficients in a polynomial is N = 256 , which means that we have 128 calls to basemul. In practice, the calls are grouped 2 by 2 in what is known as the doublebasemul procedure. Both basemul iterations in a doublebasemul use the same z, but with opposite signs.
We could formally describe the result coefficients obtained from the first basemul function with the following system of equations:
( u 0 , u 1 ) · ( s 0 , s 1 ) = ( r 0 , r 1 )
r 0 = ( u 1 · s 1 · z + u 0 · s 0 ) m o d Q
r 1 = ( u 1 · s 0 + u 0 · s 1 ) m o d Q
In practice, in pqm4’s Round 2 implementation, the doublebasemul uses Montgomery arithmetic to perform the modular reduction, as presented in Listing 1, and Plantard arithmetic for Round 4, as presented in Listing 2. The assembly implementation also optimizes the multiplication by storing two 16-bit coefficients in a single 32-bit register and performing operations with halfwords. We can see from these listings that the main difference is the call to the Montgomery reduction in Listing 1 on Lines 3, 6, and 10, and the call to the Plantard reduction on Lines 9 and 13 of Listing 2, while both end similarly by packaging the two result coefficients, r 0 and r 1 , on Lines 12 and 16, respectively.
The attack relies on random ciphertexts, in comparison with [8] which describes a method of constructing specially crafted ciphertexts such that the result of a point-wise multiplication depends only on a single secret key coefficient, being able to target them individually.
Listing 1. Round 2 doublebasemul (ASM).
  1   // basemul(r−>coeffs + 4 ∗ i, a−>coeffs + 4 ∗ i, b−>coeffs + 4 ∗ i, zetas[64 + i]);
  2   smultt tmp, poly0, poly1
  3   montgomery q, qinv, tmp, tmp2
  4   smultb tmp2, tmp2, zeta
  5   smlabb tmp2, poly0, poly1, tmp2
  6   montgomery q, qinv, tmp2, tmp
  7   // r[0] in upper half of~tmp
  8
  9   smuadx tmp2, poly0, poly1
10   montgomery q, qinv, tmp2, tmp3
11   // r[1] in upper half of tmp3
12   pkhtb tmp, tmp3, tmp, asr#16
13   str tmp, [rptr], #4
14   ...
Listing 2. Round 4 doublebasemul (ASM).
  1  .macro doublebasemul_frombytes_asm rptr, bptr, zeta, poly0, poly1, poly3, tmp,
     tmp2, q, qa, qinv
  2     ldr.w \poly0, [\bptr], #4
  3
  4     smulwt \tmp, \zeta, \poly1
  5     smlabt \tmp, \tmp, \q, \qa
  6     smultt \tmp, \poly0, \tmp
  7     smlabb \tmp, \poly0, \poly1, \tmp
  8     // a1∗b1∗zeta+a0∗b0
  9     plant_red \q, \qa, \qinv, \tmp
10     // r[0] in upper half of~tmp
11
12     smuadx \tmp2, \poly0, \poly1
13     plant_red \q, \qa, \qinv, \tmp2
14
15     // r[1] in upper half of tmp2
16     pkhtb \tmp, \tmp2, \tmp, asr#16
17     str \tmp, [rptr], #4
18     ...

5.1. Data Acquisition

The first step towards deploying the attack is data collection. This is performed using the ChipWhisperer board presented in Section 4. ChipWhisperer’s oscilloscope resides on the main board and is connected to the target through two types of subsystems: Measure and Glitch. The Measure subsystem captures the power consumed by the target chip as long as the trigger pin is set to high.
In this case, the triggers are set around the first NTT polynomial multiplication between the ciphertext and the secret key from the decryption function. We captured power consumption traces for 2000 randomly generated ciphertexts.

5.2. Attack Phases

5.2.1. Determining the Target Operation Sample Point

We first need to determine the point in the trace where the first iteration of basemul is performed involving the coefficients under attack, s 0 and s 1 .
This can be performed through an exhaustive search over all points in the trace and across all possible secret coefficient pairs, but in a testing environment it is much more efficient to predetermine this sample point using a known pair of secret coefficients. This is consistent with a scenario in which the adversary employs a copy of the Device under Attack (DuT) that runs the same algorithm. The key on the testing device doesn’t even have to be the one used by the target since the operation will be executed around the same moment in time regardless of the data involved if the sequence of instructions is the same.
For this procedure, we first set the key coefficients to the correct pair and then compute the results of the target operation with this pair for every ciphertext in the dataset. After applying the Hamming weight function on each result r 0 , we obtain a 1-D vector that forms the first distribution sampling.
A trace can be viewed as a 1-D row vector of amplitudes corresponding to the execution with a given ciphertext. An example of a trace is shown in Figure 1. Stacking the traces corresponding to every ciphertext gives a matrix where each row is a trace, and each column is a 1-D vector of power amplitudes at the same point in time.
H = [ H W ( r 0 ( 0 ) ) H W ( r 0 ( 1 ) ) H W ( r 0 ( C ) ) ] t i = [ P i 0 P i 1 P i S ] T = t 0 t 1 t C = [ t ( 0 ) t ( 1 ) t ( S ) ] = P 00 P 0 S P C 0 P C S C = number of ciphertexts ; S = maximum number of samples ; P = power amplitude ; H W ( n ) = Hamming weight of n ; t i = i th row of T ; t ( j ) = j th column of T
Since we want the point in time at which the target gets executed, we compute the PCC between the Hamming weight vector H from Equation (6) and every column vector in the trace matrix, T . The column with the highest absolute correlation to the Hamming weights indicates the most likely point.
sample point = index ( max t ( j ) T ( PCC ( H , t ( j ) ) ) )
This phase is exemplified in the Results Section (Section 5.3). It should be mentioned that another viable method for obtaining the Points of Interest is the Test-Vector Leakage Assessment (TVLA) [34].

5.2.2. Determining the Secret Key Coefficients

Once the sample point for the target has been determined, the next phase is to demonstrate that an attacker can distinguish the correct key coefficients between all possible pairs. Now, the fixed vector from Equation (6) is represented by the column of power measurements at the given sample point, t ( j ) , and the matrix is built from the vectors of Hamming weights for each pair guess as in Equation (8). The correlation is computed between t ( j ) and each row in the matrix H .
t ( j ) = P 0 j P 1 j P S j H = H W ( r 0 ( 00 ) ) H W ( r 0 ( 0 C ) ) H W ( r 0 ( G 0 ) ) H W ( r 0 ( G C ) ) G = total number of guesses for a pair
The range of values for a number in Z Q is, in theory, the set Z [ 0 , Q ) . In practice, the pqm4 implementation uses signed integers in the NTT domain for efficiency. The output range of a computation in Montgomery arithmetic is [ Q , Q ] , while for Plantard it is [ Q + 1 2 , Q / 2 ) [23]. Note that negative values are allowed in internal computations, but outside the NTT domain, the values should respect the theoretical interval of [ 0 , Q ) . The values that are not the result of a modular arithmetic operation in NTT, such as the secret key coefficients, lie in this interval, while the rest, such as ciphertext coefficients, follow the output range of the respective arithmetic. This leads to the necessity of an additional modular reduction (as an implementation detail, a Barret reduction is used) when leaving the NTT domain, before storing data for outside use (e.g., sending it over the network between parties). Refer again to [23] for more detailed explanations.
Computing the PCC using each pair essentially means looping through all possible values for s 0 and s 1 and running the calculation with every ciphertext. This creates a complexity of O ( Q 2 M ) , where Q = 3329 is the number of coefficient values and M is the number of ciphertexts (and corresponding traces) used.

5.2.3. Determining the Minimum Number of Traces

In the end, after running the attack with a large enough dataset to obtain the correct pair, it is useful to test the procedure with an incremental number of traces and follow the evolution of the correct key correlation against the other key candidates. The expected result is that the PCC of the right key should slightly increase with the size of the dataset, while the others should drop to negligible values.

5.3. Results

Figure 2 represents the first phase of the attack, determining the target sample point. The highest correlation in absolute value is 0.686 for the index (sample point) 200 when testing against the first pair of correct key coefficients.
Doing the same thing for all consecutive pairs of the correct key gives Figure 3. The average distance between two peaks is 55.367 samples and the maximum deviation from the mean is 1.633. This periodicity further confirms that we found the right place to look for our target function because it is consistent with the underlying execution loop that computes the product of the secret key and ciphertext in iterations of two coefficients. An additional observation is that the actual value of the PCC is negative, as seen in Figure 4, and we could conjecture that the Hamming weight for the right key is inversely correlated with the power usage. The smaller peaks appearing next to the pairs’ minima signal another operation involving our PoI result. Indeed, corroborating with the assembler code from Listing 2, we can deduce that the smaller local minima match the instruction from line 16, which packs together r 0 and r 1 into a single 32-bit register. Since the value we compute in the first basemul, r 0 , is also part of the pack operation, it is only natural that it appears in the correlation graph with a fairly significant value.
Figure 5 depicts the main attack phase on the interval Z [ 0 , Q ) for the first pair of secret coefficients using 400 traces. As can be seen, the correct key is the negative peak on the right side of the graph, keeping in mind that the correlation was previously observed to be negative at the previous step of the attack. However, there is another, positive, peak in the first half of the graph with roughly the same absolute value (the difference being of the order of 10 3 ).
This second peak seems to contradict our expectations and the conditions for a successful CPA. Ideally, the attack would prompt a single candidate for which the correlation is significantly different from the rest. There are cases where there are more possible candidates extracted from a single attack step, just as in [7] where the adversary first guesses half of the required bytes, yielding a number of candidate key parts. However, they are filtered down when guessing the rest of the bytes in that step. Here, without a filtering criterion, the number of candidates would only increase as we try to guess more pairs.
In this case, for the attack to succeed, the adversary would have to be able to distinguish between the two peaks. We provide a more formal explanation and a discussion for choosing the negative peak in Section 5.4.
The last phase of the attack determines the minimum number of traces needed for an accurate result. The attack running with an increasing number of traces is depicted in Figure 6. We can infer that the required minimum is somewhere around 80 power traces, which is in line with the state-of-the-art conclusions [8,35]. The black line shadowing the pair of correct coefficients in red corresponds to the positive peak from Figure 5.

5.4. Improved CPA

5.4.1. The Impact of the Additive Inverse

Returning to the matter of the two candidate solutions observed previously in Figure 5 and Figure 6, the first remark we have to make is that the false positive, the incorrect pair of secret coefficients, is found at ( s 0 , s 1 ) = ( Q s 0 , Q s 1 ) where ( s 0 , s 1 ) is the correct pair. In other words, ( s 0 , s 1 ) and ( s 0 , s 1 ) are additive inverses of one another modulo Q: ( s 0 , s 1 ) + ( s 0 , s 1 ) = ( Q , Q ) = ( 0 , 0 ) .
This alone does not explain why we obtain an almost symmetrical graph. To understand this behavior, recall that the target computation is the first half of the multiplication ( u 0 , u 1 ) · ( s 0 , s 1 ) = ( r 0 , r 1 ) mathematically described in Equation (4) and the associated assembly instructions from Listing 2.
Substituting ( s 0 , s 1 ) with ( s 0 , s 1 ) = ( Q s 0 , Q s 1 ) we obtain the following important result:
r 0 = ( u 1 s 1 z + u 0 s 0 ) m o d Q ( u 1 ( Q s 1 ) z + u 0 ( Q s 0 ) ) m o d Q = ( u 1 Q z u 1 s 1 z + u 0 Q u 0 s 0 ) m o d Q = ( Q ( u 1 z + u 0 ) ( u 1 s 1 z + u 0 s 0 ) ) m o d Q = ( ( Q ( u 1 z + u 0 ) ) m o d Q ( u 1 s 1 z + u 0 s 0 ) m o d Q ) m o d Q = ( 0 ( u 1 s 1 z + u 0 s 0 ) m o d Q ) m o d Q = ( ( u 1 s 1 z + u 0 s 0 ) m o d Q ) m o d Q = ( r 0 ) m o d Q
Now consider that while s 0 , s 1 [ 0 , Q ) , the Plantard reduced result is r 0 [ Q + 1 2 , Q / 2 ) . Normally, in [ 0 , Q ) , since r 0 is already reduced mod Q, ( r 0 ) m o d Q would simply equate Q r 0 , but because the interval is shifted to [ Q + 1 2 , Q / 2 ) to accommodate signed integers, the modular reduction is actually equal to r 0 .
The Hamming weight function operates on the binary representation of numbers; therefore, we have to look at r 0 and r 0 two’s complement representations. In two’s complement, r 0 = n o t ( r 0 ) + 1 where n o t is the logical operator that flips all the bits in the argument. Since r 0 and n o t ( r 0 ) have all their bits flipped, H W ( r 0 ) + H W ( n o t ( r 0 ) ) = 16 , with 16 being the maximum possible Hamming weight on 16-bit numbers. Notice that for odd numbers, the sum n o t ( r 0 ) + 1 does not require any carry, therefore H W ( n o t ( r 0 ) + 1 ) = H W ( n o t ( r 0 ) ) + 1 = 16 H W ( r 0 ) + 1 = 17 H W ( r 0 ) . From this we obtain H W ( r 0 ) = H W ( r 0 ) = 17 H W ( r 0 ) . For correctness, we should wrap this result in a modulo 17 to account for when r 0 = 0 H W ( r 0 ) = H W ( r 0 ) = 0 .
r 0 = 75 = 0 b 0000000001001011 H W ( r 0 ) = 4 r 0 = 75 = 0 b 1111111110110101 H W ( r 0 ) = 13 H W ( r 0 ) = 17 H W ( r 0 )
This result shows that for odd values of r 0 the Hamming weights of the two additive inverse coefficients, r 0 and r 0 , are linearly dependent and hence very strongly inversely correlated. However, the same rule does not apply to even values, which imply a cascade of one or more carry additions when performing n o t ( r 0 ) + 1 .
r 0 = 74 = 0 b 0000000001001010 H W ( r 0 ) = 3 r 0 = 74 = 0 b 1111111110110110 H W ( r 0 ) = 13 H W ( r 0 ) 17 H W ( r 0 )
If we compute the correlation between the sets of all positive and all negative numbers, we obtain a PCC of ≈−0.59. However, if we sample a set of positive and negative values and correlate that with the set of their opposites, which is the case with the sets of r 0 and r 0 values, the resulting PCC is ≈−0.98, which can be visualized in Figure 7 for the first 50 entries. This explains the formation of the second correlation peak of opposite orientation seen in Figure 5.

5.4.2. Halving the Attack Range

Taking advantage of the remarks above, an improvement can be brought to the basic CPA from Section 5.
The idea is to reduce the search space from [ 0 , Q ) to [ 0 , Q / 2 ) for one coefficient of the pair and define an approximation for the correlation value of the inverse with respect to that coefficient, let us call it the “half-additive inverse”, of every searched element:
c o r r ( ( Q s 0 , s 1 ) ) = c o r r ( ( s 0 , s 1 ) ) for s 0 [ 0 , Q / 2 ) , s 1 [ 0 , Q )
The maximum absolute correlation for the halved interval corresponds to one of two possibilities: the correct pair or its half-additive inverse. This provides us with a way to choose the correct pair from only the reduced range.
correct pair = ( s 0 , s 1 ) if c o r r ( ( s 0 , s 1 ) ) < 0 ( Q s 0 , s 1 ) if c o r r ( ( s 0 , s 1 ) ) > 0
It might appear to be space for further optimization by slicing down the interval for the second coefficient as well, thus reducing the number of iterations by a factor of 4 instead of 2, but if we look closer, the set of associated pairs ( s 0 , s 1 ) , ( Q s 0 , Q s 1 ) for s 0 , s 1 [ 0 , Q / 2 ) does not cover the pairs with one coefficient in the first half, and the other in the second half of the interval, so we would lose potential candidates.
To test this improvement, we ran the attack for the first pair of coefficients with 80 power traces and compared the running time of the basic CPA and the improved CPA. In the first case, the attack took ≈1002.1 s, while in the second case only ≈499.43 s. This confirms the hypothesis of a ≈50% speedup. The tests were performed on an Intel Core i5-6400 CPU with four cores at 2.7 GHz.
We further make the observation that even though the output range of the Montgomery reduction is not normalized to an interval of length Q, it is still centered around 0, and we can operate in the same way we did when we analyzed the additive inverse in the Plantard case and deduce that r 0 = r 0 . The optimization method is consequently the same in this case as well.

6. Conclusions

The main contributions we presented in this paper are the adaptation of previous Correlation Power Analysis side-channel attacks to the most recent version of Crystals-Kyber from the pqm4 library, and improving the methodology to reduce the attack time by half. Our work focused on how an optimization using signed modular arithmetic can have important mathematical implications that can be leveraged so that the search range and the time required to search for secret key coefficients are reduced by 50%.
The factors that led to this finding are not, however, intrinsic to the algorithms but are rather practical design decisions: the use of signed integers, the half-word values packing, and the output range of the modular reduction are all aspects, which can be addressed in order to retain the advantages of Plantard’s method and avoid making it easier for an adversary to mount an attack.
While we applied our mathematical observations to the specific case of a CPA, this could also be applied to profiled attacks using Template Attacks or Machine Learning approaches.

Author Contributions

Conceptualization, C.G. and M.O.C.; Methodology, M.O.C.; Software, C.G.; Validation, M.O.C.; Data curation, C.G.; Writing—original draft, C.G.; Writing—review & editing, M.O.C.; Visualization, C.G.; Supervision, M.O.C.; Project administration, M.O.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been partially supported by RoNaQCI, part of EuroQCI, DIGITAL-2021-QCI-01-DEPLOY-NATIONAL, 101091562.

Data Availability Statement

The code and data are in the process of being prepared for open-source publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Shor, P. Algorithms for quantum computation: Discrete logarithms and factoring. In Proceedings of the Proceedings 35th Annual Symposium on Foundations of Computer Science, Santa Fe, NM, USA, 20–22 November 1994; pp. 124–134. [Google Scholar] [CrossRef]
  2. NIST. Post-Quantum Cryptography. Available online: https://csrc.nist.gov/projects/post-quantum-cryptography (accessed on 10 March 2025).
  3. Neve, M.; Tiri, K. On the Complexity of Side-Channel Attacks on AES-256—Methodology and Quantitative Results on Cache Attacks. Cryptology ePrint Archive. 2007. Available online: https://eprint.iacr.org/2007/318.pdf (accessed on 10 March 2025).
  4. FIPS 203; Module-Lattice-Based Key-Encapsulation Mechanism Standard. National Institute of Standards and Technology: Gaithersburg, MD, USA, 2024. [CrossRef]
  5. Kannwischer, M.J.; Petri, R.; Rijneveld, J.; Schwabe, P.; Stoffelen, K. PQM4: Post-Quantum Crypto Library for the ARM Cortex-M4. Available online: https://github.com/mupq/pqm4 (accessed on 10 March 2025).
  6. NewAE Technology Inc. ChipWhisperer. Available online: https://github.com/newaetech/chipwhisperer-jupyter/tree/master (accessed on 10 March 2025).
  7. Karlov, A.; de Guertechin, N.L. Power Analysis Attack on Kyber. Cryptology ePrint Archive. 2021. Available online: https://eprint.iacr.org/2021/1311 (accessed on 10 March 2025).
  8. Yang, Y.; Wang, Z.; Ye, J.; Fan, J.; Chen, S.; Li, H.; Li, X.; Cao, Y. Chosen ciphertext correlation power analysis on Kyber. Integration 2023, 91, 10–22. [Google Scholar] [CrossRef]
  9. Rivest, R.L.; Shamir, A.; Adleman, L. A method for obtaining digital signatures and public-key cryptosystems. Commun. ACM 1978, 21, 120–126. [Google Scholar] [CrossRef]
  10. Diffie, W.; Hellman, M. New directions in cryptography. IEEE Trans. Inf. Theory 1976, 22, 644–654. [Google Scholar] [CrossRef]
  11. Gratzer, G.A. Lattice Theory: Foundation; Springer: Basel, Switzerland, 2011; Volume 2. [Google Scholar]
  12. Ajtai, M. The shortest vector problem in L2 is NP-hard for randomized reductions. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, Dallas, TX, USA, 24–26 May 1998; pp. 10–19. [Google Scholar]
  13. Ajtai, M. Generating hard instances of lattice problems. In Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing, Philadelphia, PA, USA, 22–24 May 1996; pp. 99–108. [Google Scholar]
  14. Regev, O. On lattices, learning with errors, random linear codes, and cryptography. J. ACM (JACM) 2009, 56, 1–40. [Google Scholar] [CrossRef]
  15. Regev, O. The learning with errors problem. Invit. Surv. CCC 2010, 7, 11. [Google Scholar]
  16. Bos, J.; Ducas, L.; Kiltz, E.; Lepoint, T.; Lyubashevsky, V.; Schanck, J.M.; Schwabe, P.; Seiler, G.; Stehle, D. CRYSTALS—Kyber: A CCA-Secure Module-Lattice-Based KEM. In Proceedings of the 2018 IEEE European Symposium on Security and Privacy (EuroS&P), London, UK, 24–26 April 2018; pp. 353–367. [Google Scholar] [CrossRef]
  17. Fujisaki, E.; Okamoto, T. Secure integration of asymmetric and symmetric encryption schemes. In Advances in Cryptology—CRYPTO’99, Proceedings of the 19th Annual International Cryptology Conference, Santa Barbara, CA, USA, 15–19 August 1999; Springer: Berlin/Heidelberg, Germany, 1999; pp. 537–554. [Google Scholar]
  18. Shoup, V. A Proposal for an ISO Standard for Public Key Encryption. Cryptology ePrint Archive, Paper 2001/112. 2001. Available online: https://eprint.iacr.org/2001/112 (accessed on 10 March 2025).
  19. FIPS PUB 202; SHA-3 Standard: Permutation-Based Hash and Extendable-Output Functions. National Institute of Standards and Technology: Gaithersburg, MD, USA, 2015. [CrossRef]
  20. Kocher, P.; Jaffe, J.; Jun, B. Differential power analysis. In Advances in Cryptology—CRYPTO’99, Proceedings of the 19th Annual International Cryptology Conference, Santa Barbara, CA, USA, 15–19 August 1999; Proceedings 19; Springer: Berlin/Heidelberg, Germany, 1999; pp. 388–397. [Google Scholar]
  21. Brier, E.; Clavier, C.; Olivier, F. Correlation power analysis with a leakage model. In Cryptographic Hardware and Embedded Systems—CHES 2004, Proceedings of the 6th International Workshop, Cambridge, MA, USA, 11–13 August 2004; Proceedings 6; Springer: Berlin/Heidelberg, Germany, 2004; pp. 16–29. [Google Scholar]
  22. Plantard, T. Efficient word size modular arithmetic. IEEE Trans. Emerg. Top. Comput. 2021, 9, 1506–1518. [Google Scholar] [CrossRef]
  23. Huang, J.; Zhang, J.; Zhao, H.; Liu, Z.; Cheung, R.C.; Koç, Ç.K.; Chen, D. Improved Plantard arithmetic for lattice-based cryptography. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2022, 2022, 614–636. [Google Scholar] [CrossRef]
  24. Kuo, Y.T.; Takayasu, A. A lattice attack on crystals-Kyber with correlation power analysis. In Proceedings of the 26th International Conference on Information Security and Cryptology, Seoul, Republic of Korea, 29 November–1 December 2023; pp. 202–220. [Google Scholar]
  25. Kuo, Y.T.; Takayasu, A. Improved Power Analysis on CRYSTALS-Kyber. Power 2024, 23, 200. [Google Scholar]
  26. Wang, K.; Xu, D.; Tian, J. An Improved Two-Step Attack on Kyber. arXiv 2024, arXiv:2407.06942. [Google Scholar]
  27. Ueno, R.; Xagawa, K.; Tanaka, Y.; Ito, A.; Takahashi, J.; Homma, N. Curse of re-encryption: A generic power/em analysis on post-quantum kems. Iacr Trans. Cryptogr. Hardw. Embed. Syst. 2022, 2022, 296–322. [Google Scholar] [CrossRef]
  28. Hoang, A.T.; Kennaway, M.; Pham, D.T.; Mai, T.S.; Khalid, A.; Rafferty, C.; O’Neill, M. Deep Learning Enhanced Side Channel Analysis on CRYSTALS-Kyber. In Proceedings of the 2024 25th International Symposium on Quality Electronic Design (ISQED), San Francisco, CA, USA, 3–5 April 2024; pp. 1–8. [Google Scholar]
  29. Ravi, P.; Jap, D.; Bhasin, S.; Chattopadhyay, A. Machine Learning Based Blind Side-Channel Attacks on PQC-Based KEMs—A Case Study of Kyber KEM. In Proceedings of the 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), San Francisco, CA, USA, 28 October–2 November 2023; pp. 01–07. [Google Scholar]
  30. Mu, J.; Zhao, Y.; Wang, Z.; Ye, J.; Fan, J.; Chen, S.; Li, H.; Li, X.; Cao, Y. A Voltage Template Attack on the Modular Polynomial Subtraction in Kyber. In Proceedings of the 2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC), Taipei, Taiwan, 17–20 January 2022; pp. 672–677. [Google Scholar] [CrossRef]
  31. Yang, Y.; Huang, J.; Wang, Z.; Ye, J.; Sun, Z.; Fan, J.; Chen, S.; Li, H.; Li, X.; Cao, Y. A Template Attack on Reduction Without Reference Device on Kyber. In Proceedings of the 2023 IEEE 32nd Asian Test Symposium (ATS), Beijing, China, 14–17 October 2023; pp. 1–6. [Google Scholar] [CrossRef]
  32. Guo, Q.; Johansson, T. Faster dual lattice attacks for solving LWE with applications to CRYSTALS. In Advances in Cryptology—ASIACRYPT 2021, Proceedings of the 27th International Conference on the Theory and Application of Cryptology and Information Security, Singapore, 6–10 December 2021; Proceedings, Part IV 27; Springer: Cham, Switzerland, 2021; pp. 33–62. [Google Scholar]
  33. Kannwischer, M.J.; Schwabe, P.; Stebila, D.; Wiggers, T. Improving Software Quality in Cryptography Standardization Projects. In Proceedings of the IEEE European Symposium on Security and Privacy (EuroS&PW), Genoa, Italy, 6–10 June 2022; IEEE: Los Alamitos, CA, USA, 2022; pp. 19–30. [Google Scholar] [CrossRef]
  34. Goodwill, G.; Jun, B.; Jaffe, J.; Rohatgi, P. A testing methodology for side-channel resistance validation. In Proceedings of the NIST Non-Invasive Attack Testing Workshop, Nara, Japan, 25–27 September 2011; Volume 7, pp. 115–136. [Google Scholar]
  35. Mujdei, C.; Wouters, L.; Karmakar, A.; Beckers, A.; Bermudo Mera, J.M.; Verbauwhede, I. Side-channel analysis of lattice-based post-quantum cryptography: Exploiting polynomial multiplication. ACM Trans. Embed. Comput. Syst. 2024, 23, 27. [Google Scholar] [CrossRef]
Figure 1. Example of a power trace captured for a random ciphertext.
Figure 1. Example of a power trace captured for a random ciphertext.
Cryptography 09 00019 g001
Figure 2. Correlation of the first coefficient pair across all samples.
Figure 2. Correlation of the first coefficient pair across all samples.
Cryptography 09 00019 g002
Figure 3. Correlation of all coefficient pairs across all samples in absolute value.
Figure 3. Correlation of all coefficient pairs across all samples in absolute value.
Cryptography 09 00019 g003
Figure 4. Correlation of the first 5 correct coefficient pairs across samples.
Figure 4. Correlation of the first 5 correct coefficient pairs across samples.
Cryptography 09 00019 g004
Figure 5. Correlation of secret key pair candidates for 400 traces.
Figure 5. Correlation of secret key pair candidates for 400 traces.
Cryptography 09 00019 g005
Figure 6. Evolution of correlation for the main pair candidates up to 400 traces.
Figure 6. Evolution of correlation for the main pair candidates up to 400 traces.
Cryptography 09 00019 g006
Figure 7. Correlation between H W ( r 0 ) and H W ( r 0 ) for the first 50 results.
Figure 7. Correlation between H W ( r 0 ) and H W ( r 0 ) for the first 50 results.
Cryptography 09 00019 g007
Table 1. Parameter sets for Kyber.
Table 1. Parameter sets for Kyber.
nkQ η 1 η 2 ( d u , d v ) δ
Kyber5122562332932(10, 4) 2 139
Kyber7682563332922(10, 4) 2 164
Kyber10242564332922(11, 5) 2 174
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ghiban, C.; Choudary, M.O. Improved Correlation Power Analysis Attack on the Latest Cortex M4 Kyber Implementation. Cryptography 2025, 9, 19. https://doi.org/10.3390/cryptography9010019

AMA Style

Ghiban C, Choudary MO. Improved Correlation Power Analysis Attack on the Latest Cortex M4 Kyber Implementation. Cryptography. 2025; 9(1):19. https://doi.org/10.3390/cryptography9010019

Chicago/Turabian Style

Ghiban, Costin, and Marios Omar Choudary. 2025. "Improved Correlation Power Analysis Attack on the Latest Cortex M4 Kyber Implementation" Cryptography 9, no. 1: 19. https://doi.org/10.3390/cryptography9010019

APA Style

Ghiban, C., & Choudary, M. O. (2025). Improved Correlation Power Analysis Attack on the Latest Cortex M4 Kyber Implementation. Cryptography, 9(1), 19. https://doi.org/10.3390/cryptography9010019

Article Metrics

Back to TopTop