An Analysis of Hardware Design of MLWE-Based Public-Key Encryption and Key-Establishment Algorithms

Nguyen, Tuy Tan; Nguyen, Tram Thi Bao; Lee, Hanho

doi:10.3390/electronics11060891

Open AccessFeature PaperReview

An Analysis of Hardware Design of MLWE-Based Public-Key Encryption and Key-Establishment Algorithms

by

Tuy Tan Nguyen

^*

,

Tram Thi Bao Nguyen

and

Hanho Lee

^*

Department of Information and Communication Engineering, Inha University, Incheon 22212, Korea

^*

Authors to whom correspondence should be addressed.

Electronics 2022, 11(6), 891; https://doi.org/10.3390/electronics11060891

Submission received: 11 February 2022 / Revised: 7 March 2022 / Accepted: 9 March 2022 / Published: 12 March 2022

(This article belongs to the Special Issue 10th Anniversary of Electronics: Recent Advances in Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

This paper presents a review of module ring learning with errors-based (MLWE-based) public-key encryption and key-establishment algorithms. In particular, we introduce the preliminaries of public key cryptography, MLWE-based algorithms, and arithmetic operations in post-quantum cryptography. We then focus on analyzing the state-of-the-art hardware architecture designs of CRYSTALS-Kyber at different security levels, including hardware architectures for Kyber-512, Kyber-768, and Kyber-1024. This analysis is dedicated to providing complete guidelines for selecting the most suitable CRYSTALS-Kyber hardware architecture to apply in post-quantum cryptography-based security systems in reality, with different requirements of security levels and hardware efficiency.

Keywords:

CRYSTALS-Kyber; number theoretic transform (NTT); module-LWE; public-key encryption; key establishment

1. Introduction

Cryptography is broadly categorized into symmetric key cryptography and asymmetric key cryptography (or public key cryptography). In symmetric key cryptography, a single key is used between a sender and a receiver to enable secure communication, and the key is kept confidential from anyone else. Public key cryptography utilizes a public key in encryption and a private key in decryption, as demonstrated in Figure 1. While the private key is kept secret to use in the decryption operation, the public key is published to everyone and used in the encryption operation. The encryption process generates the encrypted message (ciphertext) from the input message (plaintext) and the public key. The ciphertext can be decrypted using the private key. One of the most well-known asymmetric cryptographic algorithms was developed by Rivest–Shamir–Adleman (RSA) [1,2,3,4]. The RSA algorithm is the earliest public key cryptographic algorithm developed and published for commercial use. It is widespread and has been integrated in both Netscape Web and Microsoft browsers to provide security solutions for e-commerce applications. The patented RSA encryption algorithm has, in fact, become a standard for public-use encryption applications. Elliptic curve cryptography (ECC) [5,6,7,8,9] is another widely used public key cryptography algorithm. The ECC encryption and decryption operations rely on an elliptic curve and arithmetic operations over a Galois field, either GF(p) or GF(

2^{m}

). In the key-generation operation, the recipient selects a base point

P_{S}

on the elliptic curve and a random number as its private key

k_{S}

to calculate ECC point multiplication

Q_{S} = k_{S} \times P_{S}

. The public key is stored publicly so that the sender can use the receiver’s public key to encrypt the input data before sending the data over the network. At the receiver side, the data sent by the transmitter can be retrieved using ECC point multiplication operations and the private key of the receiver.

With the rapid growth of quantum computers, the existing public key schemes can be basically broken in the next few years. In 2016, National Institute of Standard and Technology (NIST) launched the post-quantum cryptography (PQC) project to discover PQC candidates for future standardization [10]. The goal of the PQC project is to develop cryptographic systems that are secure against both classical and quantum computers, and that can be compatible with existing communications protocols and networks. Round 3 finalists include (1) four candidates for public-key encryption (PKE) and key-establishment (KEM) algorithms (Classic McEliece, CRYSTALS-Kyber, NTRU, Saber), (2) three candidates for digital signature algorithms (CRYSTALS-Dilithium, Falcon, Rainbow), and (3) alternate candidates. Among the PKE and KEM algorithms in the round 3 finalists, the CRYSTALS-Kyber algorithm is a promising candidate.

Learning with errors (LWE) problems are lattice-based problems attracting a lot of attention from research communities in recent years. In LWE problems, given a finite field

Z_{q}

with modulus q, two matrices A and p, and a small noise e, it is a challenge to find the secret key s from the equation

A \cdot s + e = p

. Many proposals using standard LWE [11] and the structured ring-LWE [12,13,14,15,16,17] have been conducted. The typical advantage of standard LWE is easy scalability. However, it introduces a significant decrease in efficiency. The structured LWE offers better efficiency in terms of speed and key and ciphertext sizes. Nevertheless, there is a tradeoff between efficiency and security because of the additional structure [18]. Module-LWE can balance these two extremes. In [18], the authors of the CRYSTALS-Kyber algorithm mentioned that Kyber helps to reduce structure and offers much better scalability compared to ring-LWE. The performance of Kyber is very similar to the ring-LWE-based schemes, using 256 bits to encrypt messages [19].

The implementations of CRYSTALS-Kyber include software design [18,20], software and hardware codesign [20], and pure hardware design [20,21,22,23,24]. In [18], the authors showed the implementation results on Intel Haswell CPUs and ARM Cortex-M4 CPUs. The authors in [20] described the software implementation of post-quantum cryptography schemes including CRYSTALS-Kyber using C language on ARM Cotex-A53. In order to accelerate the operations of Kyber, the authors in [25] introduced massively parallel algorithms implemented on a GPU. The authors in [20] also presented a software and hardware codesign of Kyber. The software and hardware codesign offers a remarkable improvement in encapsulation time and decapsulation time compared with pure software implementation results. The pure hardware implementations in [20,21,22,23,24] introduce different approaches, with a focus on reducing hardware complexity and speeding up the processing time.

In this paper, we summarize the CRYSTALS-Kyber public-key encryption and key-establishment algorithms. We then present an analysis of the state-of-the-art implementations of CRYSTALS-Kyber in pure software, software and hardware codesign, and pure hardware. From the existing implementation results, we recommend the most suitable Kyber hardware architecture for various systems which have different requirements of hardware resources and latency.

The rest of this paper is structured as follows. Section 2 gives background information about CRYSTALS-Kyber. Section 3 presents an analysis of the implementation of the state-of-the-art designs. We discuss some potential solutions to improve the existing works in Section 4. Finally, conclusions are drawn in Section 5.

2. Background

In this section, we introduce CRYSTALS-Kyber, an MLWE-based public-key encryption and key-establishment algorithms that entered round 3 of the PQC standardization. Kyber has first been introduced in [26], which includes three parameter sets, as reported in Table 1, corresponding to three security levels of NIST. Polynomials are of the same degree

n = 256

, and the polynomial coefficients are members of the prime field

Z_{q}

, where q = 3329 for all security levels. However, for each security level, different numbers of polynomials are required. These polynomials are considered as a vector whose size is specified by a parameter k. The values of k are 2, 3, and 4, corresponding to the security levels 1, 3, and 5, respectively. The remaining parameters

η_{1}

,

η_{1}

,

d_{u}

, and

d_{v}

are chosen to balance between security, ciphertext size, and failure probability. A detailed explanation of parameter selection can be found in [18]. Centered binomial distribution (CBD) is used to generate secret noise polynomials. In this paper, the same notation presented in [18] is used. For example, regular font letters represent elements in R or

R_{q}

, bold lower-case letters denote vectors with coefficients in R or

R_{q}

, bold upper-case letters are matrices. For bytes and byte arrays,

B^{k}

is the set of byte arrays of length k, and

a | | b

is the concatenation of two byte arrays a and b. More details about notation can be found in [18].

2.1. Public-Key Encryption Algorithm

Operations in the PKE algorithm includes key generation, encryption, and decryption. The key-generation function generates a public key

p k

and a private key

s k

, which are then used in encryption and decryption operations, respectively. Particularly, the public key participates in the encryption process to generate a ciphertext c from the input message m and the public key

p k

. The public-key encryption process is illustrated in Figure 2.

The decryption function restores the original message m from the ciphertext c and the private key

s k

, as described in Figure 3.

The functions of the Kyber chosen-plaintext attack public-key encryption (Kyber.CPAPKE) are defined as follows:

Kyber.CPAPKE key-generation function

The Kyber.CPAPKE key-generation function generates a public key

p k

used for the encryption process and a secret key

s k

used for the decryption process. Noise vectors s and e are sampled from a CBD. The public matrix

\hat{A}

is generated from a rejection sampler. The public key

p k

and private key

s k

are computed as

p k = (ρ, \hat{t})

, and

s k = \hat{s}

, in which

\hat{t} = \hat{A} \circ \hat{s} + \hat{e}

. The details of the Kyber.CPAPKE key-generation function are described in Algorithm 1. In this algorithm, XOF is an extendable output function instantiated with SHAKE-128. A parse function returns the NTT-representation of the input byte stream. G is a hash function G:

B^{*} \to B^{32} \times B^{32}

. PRF and NTT represent a pseudo-random function and the number theoretic transform, respectively.

Algorithm 1: Kyber PKE key-generation algorithm (Kyber.CPAPKE.KeyGen) [18].

Output: Public key

p k \in B^{12 \cdot k \cdot n / 8 + 32}

,

Secret key

s k \in B^{12 \cdot k \cdot n / 8}

1

d \leftarrow \in B^{32}

2

(ρ, σ) : = G (d)

3

N : = 0

4 for i from 0 to

k - 1

do

5 for j from 0 to

k - 1

do

6

\hat{A} [i] [j] : = Parse (XOF (ρ, j, i))

7 end for

8 end for

9 for i from 0 to

k - 1

do

10

s [i] : = {CBD}_{η_{1}} (PRF (σ, N))

11

N : = N + 1

12 end for

13 for i from 0 to

k - 1

do

14

e [i] : = {CBD}_{η_{1}} (PRF (σ, N))

15

N : = N + 1

16 end for

17

\hat{s} : = NTT (s)

18

\hat{e} : = NTT (s)

19

\hat{t} : = \hat{A} \circ \hat{s} + \hat{e}

20

p k : = {Encode}_{12} (\hat{t} \mod^{+} q) | | ρ

21

s k : = {Encode}_{12} (\hat{s} \mod^{+} q)

22 return $(p k, s k)$

Kyber.CPAPKE encryption function

The Kyber.CPAPKE encryption function constructs ciphertext

c = (c_{1}, c_{2})

from input message m, random coins

r \in B^{32}

, and the public key

p k

, as presented in Algorithm 2. As described in Algorithm 2,

{\hat{A}}^{T}

is generated from a uniform distribution, and r, e

_{1}

, and

e_{2}

are sampled from a binomial sampler. The ciphertext c is constructed as

c = ({Compress}_{q} (u, d_{u}),

{Compress}_{v} (v, d_{v}))

, where

u = {NTT}^{- 1} ({\hat{A}}^{T} \circ \hat{r}) + e_{1}

and

v = {NTT}^{- 1} ({\hat{t}}^{T} \circ \hat{r}) + e_{2} + m

. NTT

^{- 1}

is the inverse number theoretic transform.

Algorithm 2: Kyber PKE encryption algorithm (Kyber.CPAPKE.Enc(

p k, m, r

)) [18].

Input: Message

m \in B^{32}

,

Public key

p k \in B^{12 \cdot k \cdot n / 8 + 32}

,

Random coins

r \in B^{32}

Output: Ciphertext

c \in B^{d_{u} \cdot k \cdot n / 8 + d_{v} \cdot n / 8}

1

N : = 0

2

\hat{t} : = {Decode}_{12} (p k)

3

ρ : = p k + 12 \cdot k \cdot n / 8

4 for i from 0 to

k - 1

do

5 for j from 0 to

k - 1

do

6

{\hat{A}}^{T} [i] [j] : = Parse (XOF (ρ, i, j))

7 end for

8 end for

9 for i from 0 to

k - 1

do

10

r [i] : = {CBD}_{η_{1}} (PRF (r, N))

11

N : = N + 1

12 end for

13 for i from 0 to

k - 1

do

14

e_{1} [i] : = {CBD}_{η_{2}} (PRF (r, N))

15

N : = N + 1

16 end for

17

e_{2} : = {CBD}_{η_{2}} (PRF (r, N))

18

\hat{r} : = NTT (r)

19

u : = {NTT}^{- 1} ({\hat{A}}^{T} \circ \hat{r}) + e_{1}

20

v : = {NTT}^{- 1} ({\hat{t}}^{T} \circ \hat{r}) + e_{2} + {Decompress}_{q} ({Decode}_{1} (m), 1)

21

c_{1} : = {Encode}_{d_{u}} ({Compress}_{q} (u, d_{u}))

22

c_{2} : = {Encode}_{d_{v}} ({Compress}_{q} (v, d_{v}))

23 return $c = (c_{1}, c_{2})$

Kyber.CPAPKE decryption function

The Kyber.CPAPKE decryption function shown in Algorithm 3 recovers the original message m from the ciphertext c using a secret key

s k

. The value of m is calculated as

m = Compress (v - {NTT}^{- 1} ({\hat{s}}^{T} \circ \hat{u})

, where u and v are extracted from c.

Algorithm 3: Kyber PKE decryption algorithm (Kyber.CPAPKE.Dec) [18].

Input: Ciphertext

c \in B^{d_{u} \cdot k \cdot n / 8 + d_{v} \cdot n / 8}

,

Secret key

s k \in B^{12 \cdot k \cdot n / 8}

Output: Message

m \in B^{32}

1

u : = {Decompress}_{q} ({Decode}_{d_{u}} (c), d_{u})

2

v : = {Decompress}_{q} ({Decode}_{d_{v}} (c + d_{u} \cdot k \cdot n / 8), d_{v})

3

c_{1} : = {Encode}_{d_{u}} ({Compress}_{q} (u, d_{u}))

4

c_{2} : = {Encode}_{d_{v}} ({Compress}_{q} (v, d_{v}))

5

\hat{s} : = {Decode}_{12} (s k)

6

m : = {Encode}_{1} ({Compress}_{q} (v - {NTT}^{- 1} ({\hat{s}}^{T} \circ NTT (u)), 1))

7 return m

2.2. Key-Establishment Algorithm

The Kyber key-establishment algorithm (Kyber.CCAKEM) includes key generation, encapsulation, and decapsulation. Kyber.CCAKEM is constructed from the CPA-secure public-key encryption described in Algorithms 1–3. The encapsulation operation constructs ciphertext c and a shared key K from the input message m, public key

p k

, and random vectors, as illustrated in Figure 4. Figure 5 describes the decapsulation operation, which recovers the shared key K from ciphertext c and private key

s k

.

The following algorithms describe the key generation, encapsulation, and decapsulation of Kyber.CCAKEM in detail.

Kyber.CCAKEM key generation algorithm

The Kyber.CCAKEM key-generation algorithm generating public key

p k \in B^{12 \cdot k \cdot n / 8 + 32}

and secret key

s k \in B^{12 \cdot k \cdot n / 8 + 96}

is presented in Algorithm 4. Initially,

p k

and

s k^{'}

are constructed by the Kyber.CPAPKE.KeyGen() algorithm presented in Algorithm 1. The value of the secret key

s k

is then calculated using the formula

s k : = (s k^{'} | | p k | | H (p k) | | z)

, where z is a value in

B^{32}

. H is a hash function, H:

B^{*} \to B^{32}

.

Algorithm 4: Kyber KEM key-generation algorithm (Kyber.CCAKEM.KeyGen) [18].

Output: Public key

p k \in B^{12 \cdot k \cdot n / 8 + 32}

,

Secret key

s k \in B^{12 \cdot k \cdot n / 8 + 96}

1

z \leftarrow B^{32}

2

(p k, s k^{'}) : = Kyber . CPAPKE . KeyGen ()

3

s k : = (s k^{'} | | p k | | H (p k) | | z)

4 return $(p k, s k)$

Kyber.CCAKEM encapsulation algorithm

The Kyber.CCAKEM encapsulation algorithm returns the ciphertext

c \in B^{d_{u} \cdot k \cdot n / 8 + d_{v} \cdot n / 8}

and shared key

K \in B^{*}

from input public key

p k

that was generated from Algorithm 5. Specifically, the ciphertext c is constructed by the Kyber.CPAPKE encryption function in Algorithm 2, and the shared key K is generated using the SHA3-256, SHA3-512, and SHAKE-256 algorithms from the input message m and ciphertext c. More details about the functions used to generate K can be found in [18].

Algorithm 5: Kyber.CCAKEM.Enc(

p k

) [18].

Input: Public key

p k \in B^{12 \cdot k \cdot n / 8 + 32}

Output: Ciphertext

c \in B^{d_{u} \cdot k \cdot n / 8 + d_{v} \cdot n / 8}

,

Shared key

K \in B^{*}

1

m \leftarrow B^{32}

2

m \leftarrow H (m)

3

(\bar{K}, r) \leftarrow G (m | | H (p k))

4

c : = Kyber . CPAPKE . Enc (p k, m, r)

5

K : = KDF (\bar{K} | | H (c))

6 return $(c, K)$

Kyber.CCAKEM decapsulation algorithm

The Kyber.CCAKEM decapsulation algorithm restores the shared key K from the input ciphertext

c \in B^{d_{u} \cdot k \cdot n / 8 + d_{v} \cdot n / 8}

and secret key

s k \in B^{12 \cdot k \cdot n / 8 + 96}

, as described in Algorithm 6.

Algorithm 6: Kyber.CCAKEM.Dec(

c, s k

) [18].

Input: Ciphertext

c \in B^{d_{u} \cdot k \cdot n / 8 + d_{v} \cdot n / 8}

,

Secret key

s k \in B^{12 \cdot k \cdot n / 8 + 96}

Output: Shared key

K \in B^{*}

1

p k : = s k + 12 \cdot k \cdot n / 8

2

h : = s k + 24 \cdot k \cdot n / 8 + 32 \in B^{32}

3

z : = s k + 24 \cdot k \cdot n / 8 + 64

4

m^{'} : = Kyber . CPAPKE . Dec (s, (u, v))

5

(\bar{K}, r^{'}) : = G (m^{'} | | h)

6

c^{'} : = Kyber . CPAPKE . Enc (p k, m^{'}, r^{'})

7

if c = c^{'} then

8

return K = KDF (\bar{K^{'}} | | H (c))

9 else

10

return K = KDF (z | | H (c))

11 endif

12 return K

The Kyber PKE enrcryption/decryption and Kyber KEM encapsulation/decapsulation algorithms are summarized in Table 2.

2.3. Arithmetic Operations in CRYSTALS-Kyber

In this section, we introduce the typical arithmetic operations in CRYSTALS-Kyber, which include number theoretic transform (NTT), modular reduction, and sampling.

Number Theoretic Transform

Polynomial multiplication in

R_{q}

using NTT offers multiple advantages [18]: simple code space, high speed, and no additional memory is required. NTT is a form of the well-known fast Fourier transform (FFT) [27,28], with all arithmetic operations performed in a finite field. NTT uses the n-th primitive root of unity

ω_{n}

in the ring

Z_{q}

. Given a polynomial a in

R_{q}

:

a (x) = a_{0} + a_{1} x + a_{2} x^{2} + \dots + a_{255} x^{255} .

(1)

NTT(a) is expressed as:

\hat{a} (x) = {\hat{a}}_{0} + {\hat{a}}_{1} X + {\hat{a}}_{2} X^{2} + \dots + + {\hat{a}}_{255} X^{255} .

(2)

Since NTT plays a crucial role in the hardware architecture design of post-quantum cryptography, improving NTT performance has received great attention. In [29], the authors proposed the algorithmic and hardware optimizations to design the NTT-based polynomial multiplication architecture. In [30], the authors introduced a fast modular multiplication method and a memory access scheme using doubled bandwidth ping-pong; the polynomial multiplication can be accelerated using the significantly fewer hardware resources.

Modular reduction algorithm

Modular reduction for Kyber is specified in [18] as follows: define

r^{'} = r \mod^{\pm} α

as the unique element

r^{'}

in

- \frac{α}{2} < r^{'} \leq \frac{α}{2}

such that

r^{'} = r \mod α

for an even positive integer

α

. For an odd positive integer

α

, the rank becomes

- \frac{α - 1}{2} \leq r^{'} \leq \frac{α - 1}{2}

. Define

r^{'} = r \mod^{+} α

as the unique element

r^{'}

in

0 \leq r^{'} < α

such that

r^{'} = r \mod α

. Generally, modular reduction can be simplified as

r \mod α

.

In hardware design, the modular reduction operation in multiplication can be executed using Barrett’s reduction [31] or Montgomery’s reduction [32]. Montgomery’s reduction algorithm requires converting numbers into and out of Montgomery form [33]. For Barrett’s modular reduction, the SAM2 technique is utilized to accelerate its operation in hardware. Barrett’s modular reduction requires a large number of shift blocks, adders, subtractors, and may be subject to timing attacks [33].

Sampling

The Kyber design is based on the module ring learning with errors encryption scheme, which typically considers LWE with either a rounded Gaussian [34] or a discrete Gaussian [35]. The authors of Kyber use centered binomial noise, which relies on LWE, in the design of Kyber.

In uniform sampling, Kyber uses a deterministic approach to sample elements in

R_{q}

, where

q = 3329

. These elements are statistically close to a uniform distribution [18]. Specifically, a byte stream

B = b_{0}, b_{1}, b_{2}, \dots

is the input of a

parse function : B \to R_{q}

, to compute the number theoretic transform of

a \in R_{q}

. The details of the parse function can be found in [18]. The output of the parse function is as follows:

\hat{a} (x) = {\hat{a}}_{0} + {\hat{a}}_{1} X + \dots + + {\hat{a}}_{n - 1} X^{n - 1} .

(3)

Centered binomial distribution

B_{η}

is used to sample noise in Kyber, where

η = 2

or

η = 3

. The centered binomial distribution is described in Algorithm 7.

Algorithm 7:

{CBD}_{η} : B^{64 η} \to R_{q}

[18].

Input: Byte array

B = (b_{0}, b_{1}, b_{2}, \dots, b_{64 η - 1}) \in B^{64 η}

Output: Polynomial

f \in R_{q}

1

(β_{0}, \dots, β_{512 η - 1}) : = BytesToBits (B)

2

for i from 0 to 255 do

3

a : = \sum_{j = 0}^{η - 1} β_{2 i η + j}

4

b : = \sum_{j = 0}^{η - 1} β_{2 i η + η + j}

5

f_{i} : = a - b

6 end for

7 return $f_{0} + f_{1} X + f_{2} X^{2} + \dots + f_{255} X^{255}$

3. Implementation of CRYSTALS-Kyber

In this section, we introduce existing implementation results of CRYSTALS-Kyber in pure software, software/hardware codesign, and pure hardware. We then analyze these results with a focus on hardware architecture design, and recommend suitable Kyber hardware architectures among the state-of-the-art designs for specific design goals.

The pure software implementation results of Kyber on Intel Haswell CPUs and ARM Cortex-M4 CPUs are reported in Table 3. Cycle counts are obtained on one core of a CPU and the results on an Intel Haswell CPU are the C-reference implementation results. As can be seen from Table 3, the software implementation on Intel Haswell CPUs offers a better performance, in terms of number of clock cyles, than the implementation on ARM Cortex-M4 CPUs, for both encapsulation and decapsulation processes at all security levels.

Table 4 shows the processing time in microseconds of the Kyber encapsulation and decapsulation operations in both software design and software/hardware codesign. As can be seen, the Kyber encapsulation and decapsulation operations implemented in the software/hardware codesign are much faster than those on pure software design. Specifically, with Kyber-1024, the encapsulation time and decapsulation time in software/hardware codesign are accelerated by up to 35.8 times and 38.6 times, compared with those in pure software design, respectively.

The pure hardware implementation results of Kyber-512, Kyber-768, and Kyber-1024 are presented in Table 5, Table 6 and Table 7, respectively. As can be seen in Table 5, among the Kyber-512 architectures, the design in [37] requires the highest hardware resources. In addition, the architecture in [37] introduces the longest latency compared with other works. The architecture in [23] helps reduce the hardware complexity, in terms of LUTs, and the processing time, compared with [37]. However, the area-to-time ratio of the architecture in [23] is still large. The architectures in [21,24,29,36] offer better values of hardware complexity and latency. Therefore, these architectures are suitable for the systems which require high speed and low complexity. Among the Kyber-512 architectures in Table 5, the design in [21] offers the best value of area-to-time ratio, and the design in [24] requires the lowest hardware resources to execute the Kyber-512 functions.

Among the Kyber-768 architectures in Table 6, the proposal in [29] obtains the best value of area-to-time ratio compared with other designs, followed by the architecture in [36]. Therefore, the design in [29] is the best candidate for systems which strictly consider the balance between hardware resource and latency. In addition, the design in [24] offers a reasonable area–time efficiency on the lowest hardware resources. Therefore, the work in [24] is an ideal candidate to deploy Kyber-768 on low hardware resource systems.

At NIST security level 5, the Kyber-1024 architecture in [29] is still the best work in terms of area–time ratio. For the resource-constraint systems, the architecture in [24] is the best choice. The design in [36] can be implemented on low hardware resources and offers an acceptable value of area–time ratio.

4. Discussion

From the analysis results, the architecture in [29] requires the lowest latency to perform Kyber encapsulation and decapsulation at different security levels. In addition, this architecture can offer the best value of area–time ratio at three security levels of CRYSTALS-Kyber. The architecture in [24] requires the lowest hardware resources. The performance of the existing CRYSTALS-Kyber architectures can be improved by deploying novel solutions to accelerating polynomial multiplication, such as integrating a novel parallel polynomial multiplication into the architecture in [24]. Discovering an efficient method for scheduling arithmetic operations is another technique to reduce hardware consumption and improve the area–time ratio.

5. Conclusions

In this paper, we introduced CRYSTALS-Kyber, a public key cryptography algorithm that entered round 3 of the NIST PQC standardization. We briefly described the background information of Kyber and introduced the key-encryption and key-establishment algorithms in detail. Furthermore, we analyzed the existing implementations of the pure software design, software and hardware codesign, and pure hardware design of Kyber, with a focus on pure hardware design. From the analysis results, we recommended the most suitable candidates at three security levels for different systems which have specific requirements of hardware resources, latency, and area-to-time balancing.

Author Contributions

T.T.B.N. and T.T.N. formulated the idea for this research, analyzed data, and prepared the initial version. H.L. reviewed, supervised, validated, and supported the research with funding. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the MSIT (Ministry of Science and ICT) under the ITRC support program (IITP-2021-0-02052) supervised by the IITP, and in part by the Inha University Research Grant.

Conflicts of Interest

The authors declare no conflict of interest.

References

Huang, X.; Wang, W. A Novel and Efficient Design for an RSA Cryptosystem with a Very Large Key Size. IEEE Trans. Circuits Syst. II Express Briefs 2015, 62, 972–976. [Google Scholar] [CrossRef]
Sun, H.; Wu, M.; Ting, W.; Hinek, M.J. Dual RSA and Its Security Analysis. IEEE Trans. Inf. Theory 2007, 53, 2922–2933. [Google Scholar] [CrossRef]
Ma, K.; Liang, H.; Wu, K. Homomorphic Property-Based Concurrent Error Detection of RSA: A Countermeasure to Fault Attack. IEEE Trans. Comput. 2012, 61, 1040–1049. [Google Scholar] [CrossRef]
Yang, C.; Chang, T.; Jen, C. A New RSA Cryptosystem Hardware Design Based on Montgomery’s Algorithm. IEEE Trans. Circuits Syst. II Analog. Digit. Signal Process. 1998, 45, 908–913. [Google Scholar] [CrossRef]
Sutter, G.D.; Deschamps, J.P.; Imaña, J.L. Efficient Elliptic Curve Point Multiplication using Digit-Serial Binary Field Operations. IEEE Trans. Ind. Electron. 2013, 60, 217–225. [Google Scholar] [CrossRef]
Koblitz, N.; Menezes, A.; Vanstone, S. The State of Elliptic Curve Cryptography. Des. Codes Cryptogr. 2000, 19, 173–193. [Google Scholar] [CrossRef]
Chelton, W.N.; Benaissa, M. Fast Elliptic Curve Cryptography on FPGA. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2008, 16, 198–205. [Google Scholar] [CrossRef]
Mahdizadeh, H.; Masoumi, M. Novel Architecture for Efficient FPGA Implementation of Elliptic Curve Cryptographic Processor over GF(2¹⁶³). IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2013, 21, 2330–2333. [Google Scholar] [CrossRef]
Lee, Y.K.; Sakiyama, K.; Batina, L.; Verbauwhede, I. Elliptic Curve-Based Security Processor for RFID. IEEE Trans. Comput. 2008, 57, 1514–1527. [Google Scholar] [CrossRef] [Green Version]
NIST Post-Quantum Cryptography Standardization. Available online: https://csrc.nist.gov/projects/post-quantum-cryptography (accessed on 8 February 2022).
Bos, J.; Costello, C.; Ducas, L.; Mironov, I.; Naehrig, M.; Nikolaenko, V.; Raghunathan, A.; Stebila, D. Frodo: Take off the Ring! Practical, Quantum-Secure Key Exchange from LWE. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 1006–1018. [Google Scholar] [CrossRef] [Green Version]
Nguyen Tan, T.; Lee, H. High-Performance Ring-LWE Cryptography Scheme for Biometric Data Security. IEIE Trans. Smart Process. Comput. 2018, 7, 97–106. [Google Scholar] [CrossRef]
Rentería-Mejía, C.P.; Velasco-Medina, J. High-Throughput Ring-LWE Cryptoprocessors. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2017, 25, 2332–2345. [Google Scholar] [CrossRef]
Chen, D.D.; Mentens, N.; Vercauteren, F.; Roy, S.S.; Cheung, R.C.C.; Pao, D.; Verbauwhede, I. High-Speed Polynomial Multiplication Architecture for Ring-LWE and SHE Cryptosystems. IEEE Trans. Circuits Syst. I Regul. Pap. 2015, 62, 157–166. [Google Scholar] [CrossRef]
Nguyen Tan, T.; Lee, H. Efficient-Scheduling Parallel Multiplier-Based Ring-LWE Cryptoprocessors. Electronics 2019, 8, 413. [Google Scholar] [CrossRef] [Green Version]
Nguyen Tan, T.; Lee, H. High-Secure Low-Latency Ring-LWE Cryptography Scheme for Biomedical Images Storing and Transmitting. In Proceedings of the 2018 IEEE International Symposium on Circuits and Systems (ISCAS), Florence, Italy, 27–30 May 2018; pp. 1–4. [Google Scholar] [CrossRef]
Nguyen Tan, T.; Nguyen, T.T.B.; Lee, H. High-Efficiency Low-Latency NTT Polynomial Multiplier for Ring-LWE Cryptography. J. Semicond. Technol. Sci. (JSTS) 2020, 20, 220–223. [Google Scholar] [CrossRef]
Avanzi, R.; Bos, J.; Ducas, L.; Kiltz, E.; Lepoint, T.; Lyubashevsky, V.; Schanck, J.M.; Schwabe, P.; Seiler, G.; Stehle, D. CRYSTALS-Kyber Algorithm Specifications and Supporting Documentation. Available online: https://pq-crystals.org/kyber/data/kyber-specification-round3-20210131.pdf (accessed on 8 February 2022).
Ma, L.; Wu, X.; Bai, G. Parallel polynomial multiplication optimized scheme for CRYSTALS-KYBER Post-Quantum Cryptosystem based on FPGA. In Proceedings of the 2021 International Conference on Communications, Information System and Computer Engineering (CISCE), Beijing, China, 14–16 May 2021; pp. 361–365. [Google Scholar] [CrossRef]
Nguyen, D.T.; Dang, V.B.; Gaj, K. A High-Level Synthesis Approach to the Software/Hardware Codesign of NTT-Based Post-Quantum Cryptography Algorithms. In Proceedings of the 2019 International Conference on Field-Programmable Technology (ICFPT), Tianjin, China, 9–13 December 2019; pp. 371–374. [Google Scholar] [CrossRef]
Bisheh-Niasar, M.; Azarderakhsh, R.; Mozaffari-Kermani, M. Instruction-Set Accelerated Implementation of CRYSTALS-Kyber. IEEE Trans. Circuits Syst. I Regul. Pap. 2021, 68, 4648–4659. [Google Scholar] [CrossRef]
Roy, S.S.; Basso, A. High-Speed Instruction-Set Coprocessor for Lattice-Based Key Encapsulation Mechanism: Saber in Hardware. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020, 4, 443–466. [Google Scholar] [CrossRef]
Huang, Y.; Huang, M.; Lei, Z.; Wu, J. A Pure Hardware Implementation of CRYSTALS-KYBER PQC Algorithm through Resource Reuse. IEICE Electron. Express 2020, 17, 20200234. [Google Scholar] [CrossRef]
Xing, Y.; Li, S. A Compact Hardware Implementation of CCA-Secure Key Exchange Mechanism Crystals-Kyber on FPGA. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2021, 2, 328–356. [Google Scholar] [CrossRef]
Gupta, N.; Jati, A.; Chauhan, A.K.; Chattopadhyay, A. PQC Acceleration Using GPUs: FrodoKEM, NewHope, and Kyber. IEEE Trans. Parallel Distrib. Syst. 2021, 32, 575–586. [Google Scholar] [CrossRef]
Bos, J.; Ducas, L.; Kiltz, E.; Lepoint, T.; Lyubashevsky, V.; Schanck, J.M.; Schwabe, P.; Seiler, G.; Stehle, D. CRYSTALS-Kyber: A CCA-Secure Module-Lattice-Based KEM. In Proceedings of the 2018 IEEE European Symposium on Security and Privacy (EuroS&P), London, UK, 24–26 April 2018; pp. 353–367. [Google Scholar] [CrossRef] [Green Version]
He, S.; Torkelson, M. Designing Pipeline FFT Processor for OFDM (De)modulation. In Proceedings of the 1998 URSI International Symposium on Signals, Systems, and Electronics, Pisa, Italy, 2 October 1998; pp. 257–262. [Google Scholar] [CrossRef]
Nguyen, T.T.B.; Lee, H. High-Throughput Low-Complexity Mixed-Radix FFT Processor using a Dual-Path Shared Complex Constant Multiplier. J. Semicond. Technol. Sci. (JSTS) 2017, 17, 101–109. [Google Scholar] [CrossRef] [Green Version]
Bisheh-Niasar, M.; Azarderakhsh, R.; Mozaffari-Kermani, M. High-Speed NTT-Based Polynomial Multiplication Accelerator for Post-Quantum Cryptography. In IACR Cryptology ePrint Archive: Report 2021/563. 2021. Available online: https://eprint.iacr.org/2021/563 (accessed on 8 February 2022).
Zhang, C.; Liu, D.; Liu, X.; Zou, X.; Niu, G.; Liu, B.; Jiang, Q. Towards Efficient Hardware Implementation of NTT for Kyber on FPGAs. In Proceedings of the 2021 IEEE International Symposium on Circuits and Systems (ISCAS), Daegu, Korea, 22–28 May 2021; pp. 1–5. [Google Scholar] [CrossRef]
Barrett, P. Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor. In Conference on the Theory and Application of Cryptographic Techniques; Springer: Berlin/Heidelberg, Germany, 1986; pp. 311–323. [Google Scholar]
Montgomery, P.L. Modular Multiplication without Trial Division. Math. Comput. 1985, 44, 519–521. [Google Scholar] [CrossRef]
Kundi, D.-S.; Zhang, Y.; Wang, C.; Khalid, A.; O’Neill, M.; Liu, W. Ultra High-Speed Polynomial Multiplications for Lattice-based Cryptography on FPGAs. IEEE Trans. Emerg. Top. Comput. 2022. [Google Scholar] [CrossRef]
Regev, O. On Lattices, Learning with Errors, Random Linear Codes, and Cryptography. J. ACM 2009, 56, 34. [Google Scholar] [CrossRef]
Brakerski, Z.; Langlois, A.; Peikert, C.; Regev, O.; Stehle, D. Classical Hardness of Learning with Errors. In Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, Palo Alto, CA, USA, 2–4 June 2013; pp. 575–584. [Google Scholar] [CrossRef] [Green Version]
Dang, V.B.; Farahmand, F.; Andrzejczak, M.; Mohajerani, K.; Nguyen, D.T.; Gaj, K. Implementation and Benchmarking of Round 2 Candidates in the NIST Post-Quantum Cryptography Standardization Process Using Hardware and Software/Hardware Co-Design Approaches. In IACR Cryptology ePrint Archive: Report 2020/795. 2020. Available online: https://eprint.iacr.org/2020/795 (accessed on 8 February 2022).
Basu, K.; Soni, D.; Nabeel, M.; Karri, R. NIST Post-Quantum Cryptography—A Hardware Evaluation Study. IACR Cryptology ePrint Archive: Report 2019/047. 2019. Available online: https://eprint.iacr.org/2019/047 (accessed on 8 February 2022).

Figure 1. Public-key encryption.

Figure 2. Encryption process of a public-key algorithm.

Figure 3. Decryption process of a public-key algorithm.

Figure 4. Encapsulation process of a key-establishment algorithm.

Figure 5. Decapsulation process of a key-establishment algorithm.

Table 1. Kyber parameter sets [18].

Algorithm	NIST Security Level	Parameters
Algorithm	NIST Security Level	$n$	$k$	$q$	( $η_{1}$ , $η_{2}$ )	( $d_{u}$ , $d_{v}$ )
Kyber-512	1	256	2	3329	(3, 2)	(10, 4)
Kyber-768	3	256	3	3329	(2, 2)	(10, 4)
Kyber-1024	5	256	4	3329	(2, 2)	(11, 5)

Table 2. Comparison of Kyber PKE encryption/decryption and Kyber KEM encapsulation/decapsulation algorithms.

Algorithm	Input	Output
PKE encryption	Input message $m \in B^{32}$ Public key $p k \in B^{12 \cdot k \cdot n / 8 + 32}$ Random coins $r \in B^{32}$	Ciphertext $c \in B^{d_{u} \cdot k \cdot n / 8 + d_{v} \cdot n / 8}$
KEM encapsulation	Public key $p k \in B^{12 \cdot k \cdot n / 8 + 32}$	Ciphertext $c \in B^{d_{u} \cdot k \cdot n / 8 + d_{v} \cdot n / 8}$ Shared key $K \in B^{*}$
PKE decryption	Ciphertext $c \in B^{d_{u} \cdot k \cdot n / 8 + d_{v} \cdot n / 8}$ Secret key $s k \in B^{12 \cdot k \cdot n / 8}$	Original message $m \in B^{32}$
KEM decapsulation	Ciphertext $c \in B^{d_{u} \cdot k \cdot n / 8 + d_{v} \cdot n / 8}$ Secret key $s k \in B^{12 \cdot k \cdot n / 8 + 96}$	Shared key $K \in B^{*}$

Table 3. Results of Kyber pure software design on Intel Haswell CPUs and ARM Cortex-M4 CPUs [18].

Algorithm	Function	Time (Clock Cycles)
Algorithm	Function	Intel Haswell CPUs	ARM Cortex-M4 CPUs
Kyber-512	Encapsulation	154,524	561,518
Kyber-512	Decapsulation	187,960	519,237
Kyber-768	Encapsulation	235,260	915,676
Kyber-768	Decapsulation	274,900	853,001
Kyber-1024	Encapsulation	346,648	1,407,769
Kyber-1024	Decapsulation	396,584	1,326,409

Table 4. Existing pure software design and software/hardware codesign results of Kyber [36].

Algorithm	Function	Time ( $μ s$ )
Algorithm	Function	Pure Software Design	Software/Hardware Codesign
Kyber-512	Encapsulation	332.0	15.2
Kyber-512	Decapsulation	433.0	17.1
Kyber-768	Encapsulation	536.7	17.8
Kyber-768	Decapsulation	670.1	20.1
Kyber-1024	Encapsulation	787.5	22.0
Kyber-1024	Decapsulation	953.7	24.7

Table 5. Comparison of the existing Kyber-512 hardware architectures.

Parameter	[37]	[23]	[24]	[36]	[29]	[21]
Device	Virtex-7	Artix-7	Artix-7	Artix-7	Artix-7	Artix-7
LUTs	1978 K	89 K	7 K	12 K	11 K	18 K
FFs	194 K	NA	5 K	10 K	10 K	5 K
Slices	NA	NA	2 K	4 K	4 K	5 K
DSPs	0	354	2	8	8	6
BRAMs	0	202	3	15	13	15
Frequency (MHz)	67	155	161	210	200	115
Total time ( $μ$ s)	1169	761	72	35	31	148
Area × Time (LUTs × s)	2312.28	67.73	0.50	0.42	0.34	2.66

Table 6. Comparison of the state-of-the-art Kyber-768 hardware architectures.

Parameter	[23]	[24]	[36]	[29]	[21]
Device	Artix-7	Artix-7	Artix-7	Artix-7	Artix-7
LUTs	167 K	7 K	12 K	12 K	16 K
FFs	NA	5 K	10 K	10 K	6 K
Slices	NA	2 K	4 K	4 K	4 K
DSPs	292	2	8	12	9
BRAMs	202	3	15	14	16
Frequency (MHz)	155	161	210	200	115
Total time ( $μ$ s)	1155	111	46	40	209
Area × Time (LUTs × s)	192.89	0.78	0.55	0.48	3.34

Table 7. Comparison of the state-of-the-art Kyber-1024 hardware architectures.

Parameter	[23]	[24]	[36]	[29]	[21]
Device	Artix-7	Artix-7	Artix-7	Artix-7	Virtex-7
LUTs	133 K	7 K	12 K	13 K	16 K
FFs	NA	5 K	12 K	12 K	6 K
Slices	NA	2 K	5 K	54 K	5 K
DSPs	548	2	8	16	12
BRAMs	202	3	15	16	17
Frequency (MHz)	192	161	210	185	156
Total time ( $μ$ s)	1260	154	63	56	205
Area × Time (LUTs × s)	167.58	1.08	0.76	0.72	3.28

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nguyen, T.T.; Nguyen, T.T.B.; Lee, H. An Analysis of Hardware Design of MLWE-Based Public-Key Encryption and Key-Establishment Algorithms. Electronics 2022, 11, 891. https://doi.org/10.3390/electronics11060891

AMA Style

Nguyen TT, Nguyen TTB, Lee H. An Analysis of Hardware Design of MLWE-Based Public-Key Encryption and Key-Establishment Algorithms. Electronics. 2022; 11(6):891. https://doi.org/10.3390/electronics11060891

Chicago/Turabian Style

Nguyen, Tuy Tan, Tram Thi Bao Nguyen, and Hanho Lee. 2022. "An Analysis of Hardware Design of MLWE-Based Public-Key Encryption and Key-Establishment Algorithms" Electronics 11, no. 6: 891. https://doi.org/10.3390/electronics11060891

APA Style

Nguyen, T. T., Nguyen, T. T. B., & Lee, H. (2022). An Analysis of Hardware Design of MLWE-Based Public-Key Encryption and Key-Establishment Algorithms. Electronics, 11(6), 891. https://doi.org/10.3390/electronics11060891

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Analysis of Hardware Design of MLWE-Based Public-Key Encryption and Key-Establishment Algorithms

Abstract

1. Introduction

2. Background

2.1. Public-Key Encryption Algorithm

2.2. Key-Establishment Algorithm

2.3. Arithmetic Operations in CRYSTALS-Kyber

3. Implementation of CRYSTALS-Kyber

4. Discussion

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI