Designing a Scalable and Area-Efficient Hardware Accelerator Supporting Multiple PQC Schemes

Jung, Heonhui; Oh, Hyunyoung

doi:10.3390/electronics13173360

Open AccessArticle

Designing a Scalable and Area-Efficient Hardware Accelerator Supporting Multiple PQC Schemes

by

Heonhui Jung

¹ and

Hyunyoung Oh

^2,*

¹

Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, Republic of Korea

²

Department of AI·Software, Gachon University, Seongnam-si 13120, Gyeonggi-do, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(17), 3360; https://doi.org/10.3390/electronics13173360

Submission received: 29 June 2024 / Revised: 8 August 2024 / Accepted: 14 August 2024 / Published: 23 August 2024

(This article belongs to the Special Issue Recent Advances in Information Security and Data Privacy)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This study introduces a hardware accelerator to support various Post-Quantum Cryptosystem (PQC) schemes, addressing the quantum computing threat to cryptographic security. PQCs, while more secure, also bring significant computational demands, which are especially problematic for lightweight devices. Previous hardware accelerators are typically scheme-specific, which is inefficient given the National Institute of Standards and Technology (NIST)’s multiple finalists. Our approach focuses on the shared operations among these schemes, allowing a single design to accelerate multiple candidate PQCs at the same time. This is further enhanced by allocating resources according to performance profiling results. Our compact, scalable hardware accelerator supports four of NIST PQC finalists, achieving an area efficiency of up to 81.85% compared to the current state-of-the-art multi-scheme accelerator while supporting twice as many schemes. The design demonstrates average throughput improvements ranging from 0.97× to 35.97× across the four schemes and their main operations, offering an efficient solution for implementing multiple PQC schemes within constrained hardware environments.

Keywords:

post-quantum security; kyber–dilithium; falcon; SPHINCS+; hardware accelerator

1. Introduction

The key exchange algorithm (KEA) and digital signature algorithm (DSA) are indispensable cryptographic algorithms used for identification and authentication between systems. Such classical cryptosystems have been used as strong security measurements in various fields such as the Internet of Things (IoT) [1] or autonomous industrial systems [2], implemented in various HW platforms, from low-end embedded/mobile devices [3] to high-end platforms [4]. However, the advent of quantum computing, a prominent emerging field that has been driving significant development, has put classical cryptosystems at great risk. They were proven to be vulnerable against attacks from quantum computing systems [5], which necessitated the need for novel cryptosystems designed to be quantum-resistant.

To address this need, NIST initiated a standardization process for post-quantum cryptography (PQC), where they selected four final cryptographic schemes (CRYSTALS-Kyber, CRYSTALS-Dilithium, FALCON, and SPHINCS+) (for simplicity, we will refer to CRYSTALS-Kyber as Kyber and CRYSTALS-Dilithium as Dilithium throughout this paper) among 82 initial submissions, as a target of standardization. These schemes have undergone extensive scrutiny and research, ensuring their reliability. While NIST may consider additional schemes in the future, these four were chosen first due to their promising potential and comprehensive evaluation results. Several researchers put effort in implementing them into the HW platforms. However, achieving post-quantum resistance requires significantly more complex algorithms compared to classical cryptography, leading to a substantially increased amount of computations. Under such conditions, prior works mostly focused on enhancing the performance of the PQC schemes, exploiting their inherent parallelism inside the algorithm. Additionally, several of the existing accelerators only targeted one of the four schemes. To the best of our knowledge, only Aikata et al. [6] introduced HW designs that supported two of the four schemes. It is noteworthy that all four schemes have their own pros and cons (e.g., better fit for longer/shorter message length), which makes it important for an implementation to efficiently support for all four schemes.

A naive approach to supporting all four PQC finalist schemes would be to integrate four independent designs, each dedicated to one scheme. However, this would require excessive hardware area, limiting applicability across various platforms needed for wide-ranging fields. Our work proposes a design methodology enabling efficient implementation of all four schemes within hardware area constraints. This methodology is built on a comprehensive analysis of the four PQC finalist schemes, aiming to create a flexible and efficient hardware design that adapts to various area constraints. We begin with performance profiling to identify computational hotspots—parts of each scheme where the most computational resources are used—and common operations across the schemes. This analysis reveals three key challenges: the diverse nature of polynomial operations, varying proportions of Keccak usage, and distinct high-level operation sequences among the schemes.

To address these challenges, our hardware design incorporates three main components: a scalable Keccak Acceleration Module (KAM), a versatile Joint Polynomial Arithmetic Unit (JPAU), and an efficient control unit. The KAM offers three variants to balance area and performance requirements, while the JPAU serves as a generic arithmetic unit capable of handling various polynomial operations common to all schemes. To manage the complexity of control flow, we implement a Unified Polynomial Control Unit (UPCU) separate from the main control unit, efficiently handling polynomial operations for all schemes. This modular and scalable approach allows for efficient resource utilization and performance optimization, achieving an area efficiency of up to 81.85% compared to the current state-of-the-art multi-scheme accelerator in [6], while supporting all four schemes instead of just two. Our evaluation shows an average throughput improvement ranging from 0.97× to 35.97× across the four schemes and three main operations, demonstrating the robustness and efficiency of our comprehensive design.

The remainder of this paper is organized as follows: Section 2 provides background information on post-quantum cryptography and detailed explanations of the four finalist schemes: Dilithium, Kyber, Falcon, and SPHINCS+. Section 3 discusses related works and outlines our motivation. In Section 4, we present our design methodology, including performance profiling and the proposed design architecture. Section 5 details the implementation of our design, while Section 6 provides a comprehensive evaluation of its performance. Finally, we conclude our work in Section 8, summarizing our contributions and discussing potential future directions in the field of hardware acceleration for post-quantum cryptography.

2. Background

2.1. Post-Quantum Cryptography

Post-quantum cryptography (PQC) refers to cryptosystems that are considered secure against cryptanalytic attacks by quantum computers. Since 2016, NIST has been pursuing a PQC standardization program to select suitable schemes for key establishment and digital signature algorithms (KEAs and DSAs). Figure 1 depicts the general process of KEAs and DSAs. The KEA consists of three principal stages: key generation, encapsulation, and decapsulation. During the key generation stage, the receiver generates a pair of keys (public and secret) using Keygen() and broadcasts the public key. The sender, who wishes to send a message to the receiver, uses the public key to encapsulate the message using Encaps(), which the receiver decapsulates with the secret key using Decaps(). The DSA is composed of three stages: key generation, signature generation, and signature verification. The sender generates a pair of public and secret keys using Keygen(). With their private key, he generates a signature using Sign(), which the receiver can verify with the sender’s public key using Verify(). The signature generation continues until a valid signature is produced. For a signature to be valid, it should satisfy a set of constraints to ensure that it does not convey similarity with the message.

In 2022, NIST selected four final schemes among 82 initial submissions. They chose one KEA, Kyber [7], and three DSA algorithms, Dilithium [8], FALCON [9], and SPHINCS+ [10]. Table 1 lists the algorithms selected in 2022. Below are brief descriptions of Kyber, Dilithium, FALCON, and SPHINCS+. For readers who seeking more details, we recommend referring to the submission references of the schemes [11].

2.2. Dilithium

Dilithium [8] is a digital signature algorithm (DSA) that is lattice-based, relying on the hardness of finding short vectors in lattices for security. In Algorithm 1, the key generation algorithm first generates a

k \times l

matrix A, where each element is a 256-dimensional polynomial in the Ring

R_{q} = Z [x] / (X^{n} + 1)

. This is achieved by sampling uniform random values from the SHAKE128 hash value. The algorithm then samples random secret vectors

s_{1}

and

s_{2}

. Using these values, the public key is computed as

t = A s_{1} + s_{2}

.

Algorithm 1 Main algorithm of Dilithium

1 Keygen()

2

A \leftarrow SHAKE 128 Sampling (size = k \times l)

3

S_{1} \leftarrow SHAKE 256 Sampling (size = l)

4

S_{2} \leftarrow SHAKE 256 Sampling (size = k)

5 // Use NTT for faster multiplication

6

t = A S_{1} + S_{2}

7 return

(p k = (A, t), s k = (A, t, S_{1}, S_{2}))

8 SignSignature(sk, M)

9 // Generate masking vector

10

z = SHAKE 256 SamplingLessThan γ_{1} (size = l)

11 while

z = ⊥

do

12

y \leftarrow SHAKE 256 SamplingLessThan γ_{1} (size = l)

13

w_{1} = HighBits (A y, 2 γ_{2})

14

c \in B_{τ} = H (M ‖ w_{1})

15

z = y + c s_{1}

16 if

∥ z ∥ \geq γ_{1} - β

or

∥ LowBits (A y - c s_{2}, 2 γ_{2}) ∥ \geq γ_{2} - β

then

17

z = ⊥

18 return

σ = (z, c)

19 Verify

(p k, M, σ = (z, c))

20

w_{1}^{'} = HighBits (A z - c t, 2 γ_{2})

21 return

(∥ z ∥ \leq γ_{1} - β) and (c = H (M ‖ w_{1}^{'}))

The signing algorithm first generates a masking vector of polynomials y with coefficients less than

γ_{1}

. After generating y,

A y

is computed and

w_{1}

is set to high-order bits of the coefficients in this vector. The challenge c is computed as the hash of the message and

w_{1}

using SHAKE256. The signature is then computed as

z = y + c s_{1}

. To avoid the leakage of the secret key, we ensure that z does not reveal any dependency on the secret key.

The verification algorithm first computes

w_{1}^{'}

as the high-order bits of

A z - c t

. We then check the conditions

(∥ z ∥ \leq γ_{1} - β) and (c = H (M ∥ w_{1}^{'}))

to validate the signature. Polynomial multiplication is efficiently performed by applying Number Theory Transform (NTT) to each element. The Dilithium scheme uses a 23-bit q value of

q = 2^{23} - 2^{13} + 1

.

2.3. Kyber

Kyber [7] is an IND-CCA2 secure key exchange algorithm (KEA) that is also lattice-based. Kyber is constructed in two stages, with the IND-CCA2-secure Kyber.CPAPKE. The CPAPKE algorithm is as follows.

Kyber.CPAPKE.KeyGen. Key generation first generates a $k \times k$ matrix A by sampling from the SHAKE128 hash function. It then samples secret s and e using SHAKE128. The public key $p k$ is computed as $A s + e$ , and the secret key $s k$ is s.
Kyber.CPAPKE.Encryption. Encryption first generates the $k \times k$ matrix $A, r, e_{1}, e_{2}$ by sampling from the SHAKE128 hash function. Then, it computes $u = A^{T} r + e_{1}$ and $v = t^{T} r + e_{2} + D e c o m p r e s s_{q} (m, 1)$ . The ciphertext c is computed as $c = (C o m p r e s s_{q} (u, d_{u}),$ $C o m p r e s s_{q} (v, d_{v}))$ .
Kyber.CPAPKE.Decryption. Decryption uses the secret key $s k$ and ciphertext c to restore u and v by decompressing the ciphertext. The original message is computed as $m = C o m p r e s s_{q} (v - s^{T} u, 1)$ .

Taking all the above-mentioned steps into account, the Kyber.CPAPKE algorithm is detailed in Algorithm 2.

Algorithm 2 Main algorithm of Kyber

1 Keygen()

2

z \leftarrow B^{32}

3

(p k, s k^{'}) = K y b e r . C P A P K E . K e y g e n ()

4

s k = (s k^{'} ‖ p k ‖ H (p k) ‖)

5 return

(p k, s k)

6 Encrypt(Secretkey sk,Message M)

7

m \leftarrow B^{32}

8

m \leftarrow H (m)

9

(\bar{K}, r) = G (m ‖ H (p k))

10

c = K y b e r . C P A P K E . E n c r y p t (p k, m, r)

11

K = K D F (\bar{K} ‖ H (c))

12 return

(c, K)

13 Decrypt $(p k, M, σ = (z, c))$

14

p k = s k + 12 \times k \times n / 8

15

h = s k + 24 \times k \times n / 8 + 32 \in B^{32}

16

z = s k + 24 \times k \times n / 8 + 64

17

m^{'} = K y b e r . C P A P K E . D e c r y p t (s, (u, v))

18

({\bar{K}}^{'}, r^{'}) = G (m^{'} ‖ h)

19

c^{'} = K y b e r . C P A K E . E n c r y p t (p k, m^{'}, r^{'})

20 if c = c′ then

21 return

K = K D F ({\bar{K}}^{'} ‖ H (c))

22 else

23 return

K = K D F (z ‖ H (c))

24 return K

2.4. Falcon

Falcon [9] is a digital signature algorithm (DSA) that utilizes the Gentry–Peikert–Vaikuntanathan (GPV) framework to construct a hash-and-sign lattice-based scheme. Thanks to the use of NTRU lattices, Falcon signatures are substantially shorter than those in any other lattice-based signature scheme with the same security level, while maintaining the same public key size. Originally, Falcon employs Fast Fourier Sampling with double-precision floating-point operations for fast implementation. However, the modification by [12] allows the use of NTT and modular arithmetic instead of costly double-precision floating-point operations. Considering all the aforementioned points, the modified Falcon algorithm by [12] is detailed in Algorithm 3.

Algorithm 3 Main algorithm of Falcon

1 Keygen( $ϕ \in Z [x], q$ )

2

f, g, F, G \leftarrow P o l y G e n (ϕ, q)

3

B \leftarrow (\begin{matrix} g & - f \\ G & - F \end{matrix})

4

s k \leftarrow B

5

h \leftarrow g \times f^{- 1} mod (q, ϕ)

6

p k \leftarrow h

7 return

s k, p k

8 SignSignature(M, sk, $⌊ β^{2} ⌋$ )

9

r \leftarrow {0, 1}^{320} uniformly

10

c \leftarrow H (M ‖ r)

11

μ \leftarrow 26

12

I_{1} \leftarrow ModDown (- c \times F, Q, q)

13

I_{2} \leftarrow ModDown (c \times f, Q, q)

14 while

∥ s ∥ > ⌊ β^{2} ⌋

do

15 for

i = 0, i \leq n - 1, i \leftarrow i + 1

do

16

J_{1 i} \leftarrow β_{μ}

17

J_{2 i} \leftarrow β_{μ}

18

J_{1} \leftarrow \sum_{i = 0}^{n - 1} J_{1 i} \times x^{i}

19

J_{2} \leftarrow \sum_{i = 0}^{n - 1} J_{2 i} \times x^{i}

20

s_{1} \leftarrow c - (I_{1} + J_{1}) \times g - (I_{2} + J_{2}) \times G mod (ϕ, q)

21

s_{2} \leftarrow (I_{1} + J_{1}) \times f + (I_{2} + J_{2}) \times F mod (ϕ, q)

22

s \leftarrow (s_{1}, s_{2})

23 return

s i g = (r, s_{2})

24 Verify( $M, s i g, p k, ⌊ β^{2} ⌋$ )

25

c \leftarrow H (M ‖ r)

26

s_{1} \leftarrow c - s_{2} \times h mod (ϕ, q)

27 if

∥ (s_{1}, s_{2}) ∥ \leq ⌊ β^{2} ⌋

then

28 return accept

29 else

30 return reject

2.5. SPHINCS+

SPHINCS+ [10] is a stateless hash-based scheme, an advancement of the SPHINCS [13] signature scheme with several improvements, including reduced signature size. The signing information is essentially a hypertree signature, a hierarchical structure that combines multiple layers of hash-based XMSS signatures. XMSS (eXtended Merkle Signature Scheme) is a hash-based digital signature scheme that uses a Merkle-tree to generate and verify signatures. In SPHINCS+, an XMSS signature is a Merkle-tree signature consisting of WOTS+ (Winternitz One-Time Signature) used as one-time signatures on the given message and the authentication path in the binary hash-tree. By combining several layers of XMSS trees, a hypertree is formed, which is a variant of XMSS^MT. The overall algorithm is detailed in Algorithm 4.

Algorithm 4 Main algorithm of SPHINCS+

1 Keygen()

2

s k . s e e d \leftarrow r a n d o m (n)

3

s k . p r f \leftarrow r a n d o m (n)

4

p k . s e e d \leftarrow r a n d o m (n)

5

p k . r o o t \leftarrow H y p e r t r e e K e y g e n (s k . s e e d, p k . s e e d)

6 return ((SK.seed, SK.prf, PK.seed, PK.root), (PK.seed, PK.root))

7 SignSignature( $M, s k$ )

8

/ / I n i t

9

A D R S = t o B y t e (0, 32)

10

o p t = P K . s e e d

11

R = P R F_m s g (s k . p r f, o p t, M)

12

S I G = S I G ‖ R

13 //Compute message digest and index

14

d i g e s t = H (R, P k . s e e d, P K . r o o t, M)

15

t m p_m d =

first floor ((ka+7)/8 bytes of digest

16

t m p_i d x_t r e e =

next floor ((h-h/d+7)/8) bytes of digest

17

t m p_i d x_l e a f =

next floor ((h/d+7)/8) bytes of digest

18 md = first ka bits of tmp_md

19

i d x_t r e e = f i r s t h - h / d b i t s o f t m p_i d x_t r e e

20

i d x_l e a f = f i r s t h / d b i t s o f t m p_i d x_l e a f

21 //FORS sign

22

s i g_f o r s = F O R S_s i g n (m d, s k . s e e d, p k . s e e d, A D R S)

23

S I G = S I G ‖ S I G_F O R S

24 //get FORS public key

25

P K_F O R S = f o r s_p k F r o m S i g (s i g_f o r s, m d, P K . s e e d, A D R S)

26 //sign FORS public key with Hyper Tree

27

A D R S . s e t T y p e (T R E E)

28

S I G_H T = h t_s i g n (P K_F O R S, S K . s e e d, P K . s e e d, i d x_t r e e, i d x_l e a f)

29

S I G = S I G | | S I G_H T

30 return SIG

31 verify( $M, S I G, p k$ )

32 //Init

33

A D R S = t o B y t e (0, 32)

34

R = S I G . g e t R ()

35

S I G_F O R S = S I G . g e t S I G_F O R S ()

36

S I G_H T = S I G . g e t S I G_H T ()

37 //compute message digest and index

38

d i g e s t = H_m s g (R, P K . s e e d, P K . r o o t, M)

39

t m p_m d

= first floor((ka +7)/ 8) bytes of digest

40

t m p_i d x_t r e e

= next floor((h - h/d +7)/ 8) bytes of digest

41

t m p_i d x_l e a f

= next floor((h/d +7)/ 8) bytes of digest

42

m d =

first ka bits of tmp_md

43

i d x_t r e e =

first h - h/d bits of tmp_idx_tree

44

i d x_l e a f =

first h/d bits of tmp_idx_leaf

45 //compute FORS public key

46

A D R S . s e t L a y e r A d d r e s s (0)

47

A D R S . s e t T r e e A d d r e s s (i d x_t r e e)

48

A D R S . s e t T y p e (F O R S_T R E E)

49

A D R S . s e t K e y P a i r A d d r e s s (i d x_l e a f)

50

P K_F O R S = f o r s_p k F r o m S i g (S I G_F O R S, m d, P K . s e e d, A D R S)

51 //verify Hyper Tree signature

52

A D R S . s e t T y p e (T R E E)

53 return

h t_v e r i f y (P K_F O R S, S I G_H T, P K . s e e d, i d x_t r e e, i d x_l e a f, P K . r o o t)

3. Related Works and Motivation

Several works, such as [6,14], have presented accelerators for two schemes: Dilithium–Kyber and Dilithium–Saber. The authors in [6] used Kyber’s small Q value size to compute two coefficients, simultaneously using Dilithium’s 24-bit datapath, scheduling operations to further utilize each component. As a result, they achieved a performance per area comparable to, or, in some cases, better than, single-scheme accelerators. The authors in [14] optimized Saber [15] to use NTT-based polynomial multiplication and designed a unified NTT multiplier. Both work exploited the fact that theses schemes are lattice-based and share many common operations. While they support both KEA and DSA schemes, they do not support all four schemes in the selected set. It is worth noting that Saber was not selected as a finalist in the PQC standardization process and is not covered in our work.

The authors in [16] introduced a coprocessor and instruction set for the RISC-V architecture to accelerate various schemes. The key achievement reported is that this design reduces the number of clock cycles required for NTT operations by about 50% compared to the same operations on a standard RISC-V 64-bit integer multiplication (RV64IM) architecture. The authors in [17] presented an NTT accelerator specifically targeted for Dilithum and Kyber schemes to accelerate butterfly operations and polynomial multiplication, which improved NTT execution time by 3.4×–9.6× for Dilithium and 1.36×–34.16× for Kyber. Both works accelerated a part of each scheme in the form of a coprocessor, which does not implement the full scheme. There are also various single-scheme accelerators designed for FPGA or ASIC, such as those in [18,19,20,21,22,23].

Despite these advancements, none of the existing works support all four PQC finalist schemes: Kyber, Dilithium, FALCON, and SPHINCS+. Supporting all four schemes is crucial for the following reasons:

1.: Versatility across applications and environments. Allows for a single solution adaptable to different security and performance requirements.
2.: Reduced need for multiple specialized accelerators. Unifies HW resources, reducing overall cost and complexity.
3.: Simplified maintenance and updates. Changes can be uniformly applied across all supported schemes and easier to adapt to evolving cryptographic standards.
4.: Enhanced flexibility and longevity of the HW design. Ensures compatibility with future PQC standards and extends the useful life of the HW investment.

According to the National Institute of Standards and Technology (NIST), having a versatile and comprehensive approach to PQC is essential for addressing the wide-ranging threats posed by quantum computing. By standardizing a set of diverse and robust schemes, NIST aims to provide a secure foundation that can protect sensitive information well into the future. Therefore, our motivation lies in developing an HW design that efficiently supports all four PQC finalist schemes, enduring robust security and broad applicability.

4. Design Methodology

Our paper aims to provide a methodology to implement designs supporting all four PQC finalist schemes—Kyber, Dilithium, FALCON, and SPHINCS+—while accommodating various HW area constraints. Supporting diverse HW area constrains is significant because it allows for the deployment of PQC solutions across a wide range of devices and applications, from resource-constrained IoT devices to high-performance computing systems. This flexibility ensures that robust post-quantum security can be integrated into existing and future technology infrastructures without being limited by HW capabilities.

To achieve this, our approach avoids getting into intricate, detailed optimizations. Instead, we introduce a design space exploration (DSE) method that can be scaled according to any given area constraints. We begin by profiling the performance of each scheme to identify hotspots. Hotspots refer to parts of the scheme where the most computational resources are used. Next, we identify functions that are commonly shared across different schemes, with the goal of implementing them as single, unified HW modules. Following this, we analyze the trade-offs of specific design aspects, such as SRAM size and the number of parallel butterfly units, to optimize overall HW efficiency.

4.1. Performance Profiling

Figure 2 shows the breakdown of executed operations for each scheme. The profiling was performed using the NIST reference implementations on Intel Xeon W5-2465X CPU with 512GB of memory. Each reference implementation was compiled with gcc-11.4.0. using the -O3 optimization parameter. We used Intel VTune Profiler [24] to analyze the profiling result. The breakdown of operations highlighted several hotspots:

Keccak. Predominantly in SPHINCS+ (99%) and significantly in Dilithium (43%).
NTT. Major component in Falcon as FFT operations (60.4%).
Polynomial operations. Present in Falcon as floating-point operations, indicating heavy computational load (30.8%).
Reduction. Notable in Kyber (25%) and Dilithium (21%) due to the use of Montgomery reduction.

From the results, we identified three major points that are highly relevant to the crux of our paper.

P1—Polynomial operations are commonly used, but data types are different. Three schemes—Dilithium, Kyber, and Falcon—commonly perform operations over polynomial data. Dilithium and Kyber operate over polynomials with coefficients in integer rings, requiring the variants of Montgomery reduction [25] and the Number Theoretic Transform (NTT). Both functions are frequently used and represent significant hotspots in their execution profiles. Falcon also operates over polynomial data but uses polynomials with floating-point coefficients and performs Fast Fourier Transforms (FFTs) instead of NTTs, eliminating the need for modular reductions.

Whereas the three schemes share a common trait in operating over polynomials, the difference in data types leads to distinct designs for HW modules, reducing efficiency. To address this and minimize the need for multiple types of modules, we used a modified variant of Falcon called Peregrine [12], which offers the same functionality but operates over integer rings. Although the modulo bases differ, this approach allows us to reuse modules required for Montgomery reductions and NTT.

P2—Dissimilar proportion of Keccak. Keccak is another hotspot function present in all four schemes, but accounts for a varying proportions of execution time. It constitutes 43% of Dilithium’s operations, 19% for Kyber, 3.7% for Falcon, and 99% for SPHINCS+. Since there is no clear preference for any specific scheme (there are no statistical numbers available for which scheme is more frequently used, as PQC schemes are merely developed, not deployed in industries), we assumed that all four schemes will be used equally.

Generally, functions with a high average proportion of usage across schemes should receive more HW resources. However, in practice, users are unlikely to run all schemes concurrently. For example, if someone requires running a DSA, they would use one of the three options, not all three. The average proportion of Keccak is less than 50%. For a user who runs SPHINCS+ only, more than a half of the HW would be idle during computation, as Keccak is the primary function used. Conversely, for users running Falcon, dedicating about half of the HW area to Keccak is inefficient since it is rarely used. Based on these circumstances, we settled on an approach that was not solely optimized for Keccak’s performance but was simplifiable and scalable to arbitrary area constraints Additionally, we incorporated specific optimizations targeting Keccak to balance the overall efficiency.

P3—Distinct high-level operation sequence. Although the schemes share common functions, their high-level operation sequences differ significantly. This is due to not only the inherent differences in algorithms but also the varying polynomial length used in different parameters. For instance, using parallel butterfly modules to compute NTT [26] requires different numbers of stages, and, for each stage, we need different numbers of cycles depending on the number of butterfly modules we instantiate.

As a result, we need to integrate multiple control flows for each of the possibilities in order to provide full functionality of the schemes, which will not only lead to a large Finite-State Machine (FSM) as a control unit, but also result in different sizes depending on the number of each type of modules we instantiate. Based on these observations, we acknowledge that designing efficient FSM is also a key to enhance the efficiency of our targeted HW.

4.2. Proposed Design

Figure 3 shows our overall design. To support all four schemes with maximum area efficiency, we utilized the results from Section 4.1 to build each component of our accelerator.

Our design integrates several key components: a Keccak Acceleration Module (KAM) (Section 4.2.1), Joint Polynomial Arithmetic Unit (JPAU) (Section 4.2.2), and two sophisticated control units named main control unit (MCU) and Unified Polynomial Control Unit (UPCU) (Section 4.2.3).

Our accelerator functions as a coprocessor for the host processor. To flexibly support four PQC schemes, our MCU receives input data (data_in) along with the control signals of integer type: scheme, parameter, and op from the host. The scheme signal specifies the active PQC scheme, with 0 for Kyber, 1 for Dilithium, 2 for FALCON, and 3 for SPHINCS+. The parameter signal indicates the parameter sets currently in use, using integer values set according to the parameter-set order for each scheme as described in Table 2, Table 3, and Table 4. The op signal specifies the operation to be executed, with 0 for Keygen, 1 for Sign/Encapsulate, and 2 for Verify/Decapsulate. Once the requested operation is completed, our accelerator generates an interrupt done to the host and outputs the result (data_out).

Scalability was achieved by designing components that can adjust their resource usage based on available hardware area, while flexibility was ensured through a modular design that allows easy reconfiguration for different schemes. Each of these components is detailed in the following subsections.

4.2.1. Keccak Accleration Moudle (KAM)

The first component is the Keccak Acceleration Module (KAM). All four schemes use Keccak operations to compute SHAKE hash values. For scalability, we present three KAM design choices:

1.

KAM-Small

Optimized for minimal area consumption, it uses 5 KALUs (Keccak ALUs) to compute each step of the Keccak permutation, taking 5 cycles per step.

2.

KAM-Large

A mid-range solution balancing area and performance, it has 25 KALUs, allowing each Keccak permutation step to be computed in a single cycle.

3.

KAM-FP

For maximum performance, this variant has a fully-pipelined datapath that computes each round of permutation in a single cycle.

Figure 4 represents the structure of the KAM-Small and KAM-Large variants of KAM. Since Keccak permutation needs XOR and bitwise operations over 1600 bit, implementing the entire datapath on KAM could consume significant area and resources. To reduce area, both KAM-Small and KAM-Large use Keccak ALU (KALU) clusters. Each KALU consists of a barrel shifter, XOR, and AND gates to perform bitwise operations. The KAM-Small variant, optimized for minimal area consumption, has 5 KALUs, consuming 5 cycles for each step of Keccak permutation by computing 5 entries per cycle. The KAM-Large variant, which offers a balance between area and performance, has 25 KALUs, consuming 1 cycle per step of Keccak permutation. The KAM-FP variant, instead of having a KALU for sequential computation, has a fully pipelined datapath for maximum performance. computing each round of permutation in each cycle. A fully pipelined design means that different stages of the permutation process are handled simultaneously by different pipeline stages, allowing for continuous data processing and maximizing throughput. Figure 5 shows the structure of KAM-FP variant, which can be used when any surplus area is identified.

Each KAM can perform three operations: absorb, squeeze, and permutation. The absorb operation handles Keccak’s absorb stage, splitting input and adding SHAKE padding to each input block. After adding padding to each input block, we perform XOR to the Keccak state stored on Keccak State Register File, and then the permutation operation is performed. This is repeated on each input block. The squeeze operation outputs results after permutation. The output buffer is placed at the output, which allowed us to minimize latency during the execution of sampling operations.

The permutation operation performs the Keccak-f1600 permutation, which is a critical part of the SHA-3 and SHAKE algorithms standardized by NIST. The Keccak-f1600 permutation involves 24 rounds of complex transformations, each requiring XOR computations over 1600 bits. Since the Keccak permutation operation is complex, requiring XOR computations over 1600 bits at each stage, the KAM-Small and KAM-Large variants compute the permutation sequentially, with latencies of 30 and 6 cycles per round, respectively. The fully pipelined variant, which consumes more area, can compute a round in each cycle, providing maximum performance.

4.2.2. Joint Polynomial Arithmetic Unit (JPAU)

The second component is the Joint Polynomial Arithmetic Unit (JPAU). The JPAU serves as a generic ALU to support operations on multiple schemes by breaking down each computation into basic operations such as addition, multiplication, and logical operations as shown in Figure 6. Table 5 shows the list of JPAU’s opcodes used for each operation. Since reduction and NTT/INTT operations account for significant portions of execution time (up to 18% and 30.8%, respectively), the butterfly unit and Montgomery reduction unit are also attached to JPAU. These design choices were made to handle the most computationally intensive tasks more efficiently. The included adder can also be used to compute CADDQ function, which is used on Dilithium and Kyber. The CADDQ function stands for conditional addition with quotient, which helps manage polynomial coefficients during computations, ensuring that they stay within a specific range to maintain algorithm correctness and efficiency.

When performing modular multiplication, reduction should follow the multiplicaton. In this case, the upper 48 bits of the output port are used, with these values stored in a temporary register outside and fed back to the JPAU for reduction. This design ensures accurate and efficient reduction operations, preventing overflow and maintaining consistency. Comparison operations can be performed by subtracting two data values, useful for condition checks such as rejection sampling or signature validation. The comparison result is outputted through a separated port.

Because each scheme uses different q values and coefficients for NTT, a Twiddle factor ROM is also attached to JPAU. This allows for flexible and accurate handling of various polynomial transformations needed for different algorithms. Since Kyber uses 12-bit q value and Dilithium uses the largest q value of 23 bits, we followed the approach of [6], which extends the ALU’s datapath to 24 bits and computes four coefficients instead of two when using the Kyber scheme. This significantly increases throughput and utilization for Kyber, optimizing the hardware for its specific requirements.

Each JPAU can perform coefficient-wise operations on two coefficients simultaneously, with each port receiving two coefficients from two different polynomials. Adding more JPAUs can further accelerate polynomial operations, enhancing overall computational efficiency. The JPAU is fully pipelined, maximizing throughput and minimizing latency by ensuring that multiple stages of computation can be processed concurrently without waiting for previous stages to complete. This pipelining is crucial for maintaining high performance across the supported cryptographic schemes.

4.2.3. Control Unit

The control unit is responsible for sending commands to JPAU and Keccak modules, as well as managing memory and MUX addresses. It is implemented as a large FSM with states for each scheme. Building a separate FSM for each scheme can result in a significant area overhead, due to the need to construct separate states for each of the four schemes. This can lead to a large state register and also delays in control signal paths.

To overcome this problem, we designed a Unified Polynomial Control Unit (UPCU) separate from main control unit. Figure 7 shows the diagram for main control unit and UPCU. The main control unit handles the high-level control flow for each scheme, including initializing operations and managing the overall sequence of tasks. For instance, in the Dilithium_sign operation, the control unit starts by initializing and performing the SHAKE256 operation, then moves to Keccak operations and matrix expansion. Similarly, for Falcon_sign, it handles random sampling and then proceeds to polynomial multiplication.

When a JPAU operation is needed, instead of main control unit sending all JPAU opcodes and MUX control signals, it sends a predefined polynomial function code to the UPCU. The UPCU then takes the function code along with information such as the scheme and security level and starts sending the appropriate JPAU opcodes and SRAM memory addresses. The UPCU adjusts parameters such as N for each scheme and security level, eliminating the need to create separate control sequence for each scheme. This segregation of detailed polynomial control to the UPC minimizes the FSM complexity in the main control unit. This design ensures that the control logic is streamlined and efficient, capable of handling various polynomial operations without excessive state overhead. The detailed operation of the UPCU can be summarized as follows:

Sample_polynomial. The UPCU initiates and manages the polynomial sampling process. This includes setting up necessary registers and handling data flow for efficient sampling.
Polynomial_multiplication. The UPCU controls the sequence of multiplication and accumulation operations, coordinating data flow and setting up operands for the computation.
NTT_INTT. The UPCU manages the NTT and INTT operations, controlling the butterfly units and Montgomery reduction units. It ensures efficient operations by adjusting control signals and managing data flow through various stages, utilizing the Twiddle factor ROM for different schemes.

By implementing these processes within the UPCU separately, the complexity of the overall FSM is significantly reduced, leading to higher area efficiency. This approach allows the control unit to handle the operations of all four PQC schemes without incurring a large area overhead, thus enhancing the overall performance and efficiency of the hardware design.

5. Implementation

We synthesized our design using Design Compiler N-2017.09-SP2 [27] with 15 nm Opencell library [28]. We used kGE as a metric to ensure a fair comparison across different silicon processes, as it normalizes the differences in technology nodes. This standardization allowed us to compare designs more effectively, regardless of the specific fabrication technology used.

Our target kGE (kilo Gate Equivalent) was set to 600 kGE, which represents the sum of kGE values from various reported works, as there is no single reference implementation supporting all four PQC finalist schemes. This target was determined by combining the most efficient individual implementations for each scheme to achieve the minimum possible kGE sum. Specifically, we considered Gupta et al. [19]’s 157 kGE for Dilithium, Bisheh-Nisar et al. [23]’s 93 kGE for Kyber, Lee et al. [18]’s 98.729 kGE for Falcon verification and Soni et al. [29]’s 181.120 kGE for Falcon-1024 signing, and Wagner et al. [20]’s 84 kGE for SPHINCS+. Adding these values results in a total of 613.849 kGE. However, we conservatively set our target to 600 kGE to provide a more challenging and aggressive goal, ensuring a more efficient and streamlined design. It is worth noting that no implementation for Falcon’s key generation was available, so this function’s kGE was not included in our target kGE. Additionally, while Aikata et al. [6] implemented both Dilithium and Kyber, their reported 747 kGE was deemed too high. Thus, we opted for the combination of separate implementations by Gupta et al. [19] for Dilithium and Bisheh-Nisar et al. [23] for Kyber, as this resulted in a lower total kGE target. This approach allowed us to set a realistic and competitive target for our unified implementation of all four PQC finalist schemes.

We first built a small baseline design, Ours_Baseline, with 4 × JPAU and KAM-Small variant, and extended our design by changing KAM-Small to KAM-Large to build Ours_S variant. Our first priority in scaling up was the JPAU cluster, which generally affects each scheme’s performance. After scaling up the JPAU by 4× to 8×, we built the Ours_M variant; we then checked whether we had margin left. If we had spare resources, we could allocate more to use faster KAM modules. By changing to the KAM-FP, we built the Ours_L variant, which resulted in 611.389 kGE, which satisfied our target kGE. Table 6 shows synthesis results of our proposed design compared with other designs.

6. Evaluation

We compared the performance of our design variants presented in Section 5, namely Ours_S, Ours_M, and Ours_L with prior works.

To compare with other works, we compared our design with FoM (figure-of-merit) defined in [30] as shown below:

F o M = T h r o u g h p u t / A r e a = T h r o u g h p u t / k G E

This factor can show the area efficiency of the design. We also used kGE count instead of area for a fair comparison of accelerators with other technologies.

Parameters used for evaluation of each PQC scheme are listed in Table 2 and Table 4. In cases where no prior HW implementation exists for certain operations (e.g., key generation for Falcon and SPHINCS+), we used the performance of CPU implementations with AVX extensions as a baseline for comparison. These CPU cycle counts, reported in the NIST reference submissions, are converted to throughput numbers to provide a point of reference. For ease of comparison and representations, we present the performance results for Dilithium and Kyber in a single subsection, as they were both implemented in the work by Aikata et al. [6]. The results for Falcon and SPHINCS+ are discussed in separate subsections due to their unique implementation characteristics and the lack of comprehensive HW implementations for all operations. This approach allows us to provide a comprehensive comparison across all schemes and operations, even in cases where direct hardware implementation comparisons are not available. It also enables us to highlight the advantages of our unified design across different PQC algorithms.

6.1. Dilithium and Kyber

In Dilithium and Kyber, we compared our design with current state-of-the-art ASIC accelerators that support more than two different parameters, namely Aikata et al. [14], Aikata et al. [6], and the state-of-art Dilithium ASIC accelerator, Gupta et al. [19]. Table 7 shows the normalized throughput of our variants on Dilithium and Kyber compared to other accelerators. The throughput is calculated in the same manner as benchmarks in the NIST submission package, which performs Keygen, Sign/Encapsulate, and Verify/Decapsulate on each security parameter.

For Dilithium, due to significant loads on matrix generation, changing the Keccak module to the KAM-FP variant can improve performance by up to 8%. Compared with Aikata et al. [14], who accelerated both Dilithium and Saber schemes, our Ours_M and Ours_L variants achieved a 3.11× and 3.69× speedup on Keygen, a 1.73× and 1.81× speedup on Sign, and a 2.18× and 2.38× speedup on Verify, on average, with 2.65× and 2.77× larger kGE counts, and 1.44× and

1.07 \times

larger FoM on average, respectively. Ours_S variant also achieved a speedup of 1.76× and 1.17× average on Keygen and Verify, while having 0.87× lower throughput on Sign, 1.43× larger kGE counts, and 0.76× lower FoM on average.

Comparing to Aikata et al. [6], the current state-of-the-art implementation for accelerating both Dilithium and Kyber schemes, our Ours_M and Ours_L variants achieved an average throughput increase of 1.51× and 1.27× in Keygen, respectively. In Sign, our variants had a slightly lower throughput of 0.75× and 0.71×, and, in Verify, the throughput was 0.89× and 0.97× lower, on average, compared to Aikata et al. [6]. However, Ours_M and Ours_L variants had significantly lower kGE counts, which were 0.27× and 22× lower than those of Aikata et al. [6], which resulted in 2.21× and 1.65× higher FoM averaged on Dilithium. Ours_S variant had a lower throughput of 0.72×, 0.37×, and 0.48× average on Keygen, Sign, and Verify, respectively, while having 2.63× smaller kGE counts with 1.17× larger FoM.

For Kyber, each JPAU in our design can perform operations on four coefficients simultaneously, significantly reducing the time spent using the JPAU and increasing the proportion of time dedicated to the KAM. Replacing the KAM from KAM-Large to KAM-FP when transitioning from Ours_M to Ours_L variant significantly increased the throughput in Kyber. Ours_S and {Ours_M, Ours_L} variants showed an average of 0.72×, 0.29×, and 0.62× lower throughput in Keygen. In Encapsulate, the Ours_S and {Ours_M, Ours_L} variants showed differences of 0.38×,0.42×, and 1.05×, and, on Decapsulate, 0.47×,0.53×, and 1.24×, respectively, compared to Aikata et al. [6]. This is because our design focused on breaking down each function to maximize shared functions, whereas Aikata et al. [6] focused on accelerating Kyber by consuming as many hardware resources as possible.

6.2. Falcon

Table 8 shows our speedup factor on Falcon [12] compared with other Falcon accelerators implemented on ASIC. Since it is difficult to find works that have implemented the full scheme in ASIC, we compared each functionality (namely Keygen, Sign, and Verify) with existing accelerators including SW implementation on CPU with AVX extensions reported in [9]. Our accelerator achieved a speedup of 8.06×, 14.76×, and 14.76× on Keygen, 4.03×, 6.49×, and 6.49× on Sign, and 7.87×, 15.66×, and 15.76× on Verification compared to CPU with AVX extensions Ours_S and {Ours_M, Ours_L} variants, respectively. Since the Keccak operation accounts for about 0.1% of total Falcon operations, using KAM-FP almost does not affect the overall performance. Our accelerator outperformed prior works implemented in ASIC in all three variants with more than 100× speedups and more than 50× FoM.

6.3. SPHINCS+

In SPHINCS+, since the current state-of-art design is on FPGA, we compared our designs with both FPGA-based work and existing ASIC implementations in Table 9. When attempting to obtain a FoM comparable to other works, direct comparisons between existing SPHINCS+ implementations become complicated due to the significant differences between FPGA and ASIC technologies. The metrics used to evaluate the FPGA area (such as the number of LUTs or FFs) do not directly translate to ASIC metrics like kGE, making it challenging to establish a consistent basis for comparison. We also included a comparison of CPU performance with AVX extensions as reported in [10] as the baseline for Keygen, since no prior work has targeted the Keygen of SPHINCS+. SPHINCS+’s performance is highly dependent on the throughput of KAM. Due to heavy use of SHAKE256 hash functions for tree hashing, utilizing KAM-Large and KAM-FP dramatically increases overall throughput.

Since there are many parameters in SPHINCS+, we selected the most time-consuming parameters, 256s-simple and 256s-robust, which are also implemented in [20]. The parameters are represented in Table 4. It should be noted that the robust parameter adds an extra layer of SHAKE to generate bitmasks and XOR uses the bitmask to the input when hashing each input. Ours_S and Ours_M variant outperform CPUs with a speedup of 33.3× in signing signatures. Ours_L variant, which has KAM-FP, can further accelerate up to 99×. Compared with the current state-of-art FPGA implementation in [31], our design has 1.2× and 3.6× higher throughput on {Ours_S, Ours_M} and Ours_L variants, respectively.

6.4. Power Consumption

Table 10 presents the power consumption of our accelerator and the energy used for each scheme compared with other accelerators. For a fair comparison, we used the Cadence Genus tool for power analysis, following the method used in Aikata et al. [6]. Power consumption represents the rate of energy use, typically measured in watts, and is obtained from the Genus tool. Energy consumption, on the other hand, is calculated by multiplying the power consumption by the execution time, providing a measure of the total energy used for a specific operation or set of operations. Our design achieved significantly lower power consumption compared to Aikata et al. [6], which used a larger design resulting in higher performance. Specifically, our design consumed 32.73×, 20.93×, and 20.34× less power for Ours_S, Ours_M, and Ours_L variants, respectively. For Dilithium, in terms of energy used for Sign/Verify combined, our variants Ours_S, Ours_M, and Ours_L were 13.45×, 16.30×, and 16.81× more efficient than Aikata et al. [6]. Furthermore, For Kyber, with Encapsulate/Decapsulate operations combined, our variants Ours_S, Ours_M, and Ours_L were 15.11×, 10.58×, and 24.67× more efficient than Aikata et al. [6].

For Falcon, we compared our design with Lee et al. [18], focusing on the energy used during Verify operations. Due to our larger design with more gate counts, our variants consumed more power: 1.89×, 2.97×, and 3.05× for Ours_S, Ours_M, and Ours_L, respectively, but consumed 276× less energy when performing Verify. However, by utilizing modular arithmetic instead of complex double-precision floating-point operations [12], we achieved significantly higher energy efficiency: 206.51×, 263.17×, and 256.82× for our three variants.

Regarding SPHINCS+, we compared our design with Amiet et al. [31], which reported power consumption for an FPGA accelerator platform. We chose this FPGA implementation for comparison due to the absence of other ASIC implementations in the literature, despite the platform difference. This state-of-the-art FPGA design served as our benchmark. It is important to note that, while this comparison provides valuable insights, the fundamental differences between ASIC and FPGA platforms should be considered when interpreting the following results. Our design variants Ours_S, Ours_M, and Ours_L consumed 1,087,931×, 695,955×, and 1,992,631× less energy, respectively, on the SPHINCS + 256s-simple parameter. This substantial difference in power consumption can be largely attributed to the inherent differences between our PQC-specific ASIC implementation and the more general-purpose nature of FPGA platforms. Additionally, our design achieved up to 3.6× shorter processing time on SPHINCS+, further contributing to this significant energy efficiency gap.

7. Discussions

7.1. Architectural Differences against Others

Our design is unique in that it supports all four PQC finalist schemes by breaking down each scheme’s operations into fundamental components and building a generalized ALU capable of handling these operations across different schemes. As discussed in Section 3, unlike prior works that implemented accelerators for only one or two schemes, we focused on creating a flexible and efficient hardware design from scratch, capable of performing fast computations for all four schemes. To support four different schemes, we analyzed the detailed operations of each algorithm and constructed a generalized ALU that can be utilized across multiple schemes. A similar approach can be seen in the work by [16], which implemented a coprocessor based on RISC-V architecture for widely used tasks such as NTT. However, our design takes a processor-like approach that is more customized for PQC needs, as opposed to a general CPU pipeline structure. Our design operates under a specialized control unit tailored for PQC, allowing it to efficiently support all four schemes with a structure specifically optimized for these cryptographic tasks, rather than a general-purpose architecture. This approach ensures that our hardware is not just a general solution, but a highly specialized one for the unique requirements of PQC.

7.2. Security and Reliability

Recent research has demonstrated side-channel and fault-injection vulnerabilities in existing cryptosystems [32,33], with similar vulnerabilities reported for NIST-selected PQC schemes in hardware accelerator implementations [31,34]. While our research primarily focused on developing an efficient hardware implementation without specific protection methods currently applied, we considered various security solutions as orthogonal work that could be integrated into our design. For instance, to counter fault injection attacks that create glitches by adjusting power supply voltage [35], our design architecture allows for the potential duplication of parts like JPAU and KAM, enabling result comparison for enhanced defense [31]. Masking techniques to randomize intermediate values and prevent secret leakage [36] could also be incorporated, although this presents challenges for lattice-based schemes due to their complex operations and rejection sampling. Existing works proposed efficient masking techniques for these schemes [37,38] that could be also applied to our design. Our flexible architecture also allows for the potential implementation of additional masking operations during function computations, albeit at the cost of extra clock cycles. Alternatively, trusted execution like TrustZone may be utilized to provide a more secure environment, but it suffers from some vulnerability issues, as witnessed in [39,40]. Our design was implemented as an isolated IP providing only fixed interfaces specific to PQC operations. While TrustZone aims to separate general execution environments, our approach focused exclusively on cryptographic operations with a limited set of dedicated interfaces. This specialization potentially results in a smaller attack surface compared to more general-purpose secure execution environments, which may offer additional security benefits in the context of post-quantum cryptographic operations. However, a comprehensive analysis of this security aspect is beyond the scope of this paper and remains an important area for future research.

7.3. Limitations and Future Works

Our design, while supporting multiple PQC schemes, has several limitations. Firstly, the absence of optimal performance for a single scheme is a limitation, as our accelerator primarily focuses on optimizations for multiple algorithms, making it challenging to achieve the best performance for any one scheme. Secondly, the performance ceiling of the Keccak Acceleration Module (KAM) is a concern. Due to the limited scalability of the Keccak algorithm, while the number of Joint Polynomial Arithmetic Units (JPAUs) can be scaled extensively, the KAM cannot be similarly scaled, leading to a performance bottleneck. Additionally, our design relies heavily on the Peregrine algorithm modifications to Falcon to avoid using double-precision floating-point operations. However, these modifications have not been extensively tested over a long period. Despite these limitations, experiments show that our design achieves throughput comparable to or sometimes better than other single-scheme accelerators, making it a viable option. In future works, we plan to support the original Falcon algorithm by integrating double-precision floating-point operations into our JPAUs and improve scalability by utilizing multiple KAM modules. We also aim to explore advanced optimization techniques to further enhance the performance for individual schemes and to develop more robust testing methodologies to ensure the reliability of our design over time. Furthermore, integrating adaptive algorithms to dynamically allocate resources based on workload could provide better performance balance across different PQC schemes.

8. Conclusions

The advent of quantum computing poses a significant threat to classical cryptosystems, creating a need for post-quantum cryptography (PQC). In response, NIST announced four schemes for standardization, each with its own advantages and disadvantages, such as Falcon’s short signature length. Prior works have primarily focused on accelerating individual schemes, making integration with other schemes challenging. We presented a scalable accelerator designed to support all four NIST-selected algorithms, incorporating a modified version of Falcon. Through function-level profiling, we identified common operations shared between schemes and designed each component of our accelerator accordingly. Our design achieved nearly the same throughput compared with state-of-the-art multi-scheme accelerators with small area overhead on Dilithium and Kyber and also achieved significant speedup compared other single-scheme accelerators on Falcon and SPHINCS+. Overall, our design provides a general speedup across all four NIST-selected schemes, demonstrating its effectiveness and versatility in addressing the challenges posed by quantum computing to cryptographic systems.

Author Contributions

Conceptualization and methodology H.O.; software, investigation, data curation, and writing—original draft, H.J.; All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2022-00166529), Korea Planning & Evaluation Institute of Industrial Technology (KEIT) grant funded by the Korea Government (MOTIE) (No. RS-2024-00406121, Development of an Automotive Security Vulnerability-based Threat Analysis System (R&D), National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2023-00277326), Institute of Information & communications Technology Planning & Evaluation (IITP) under the artificial intelligence semiconductor support program to nurture the best talents (IITP-2023-RS-2023-00256081) grant funded by the Korea government (MSIT), Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.RS-2023-00277060, Development of open edge AI SoC hardware and software platform, 0.1) and Korea Evaluation Institute of Industrial Technology (KEIT) grant funded by the Korea government (MOTIE) (No.RS-2023-00277060, Development of open edge AI SoC hardware and software platform, 0.1), MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2023-2020-0-01602) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation), Inter-University Semiconductor Research Center (ISRC), BK21 FOUR program of the Education and Research Program for Future ICT Pioneers, Seoul National University in 2024. The EDA tool was supported by the IC Design Education Center (IDEC), Korea.

Data Availability Statement

The data used for experimental comparisons in this study, referring to the comparison figures, can be found in related research papers. Our hardware implementation code is protected under the proprietary rights of the funding project’s institution and therefore cannot be made publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this paper.

AVX	Advanced Vector eXtension
ASIC	Application Specific Integrated Circuit
ALU	Arithmetic and Logical Unit
DSE	Design Space Exploration
DSA	Digital signature algorithm
XMSS	eXtended Merkle Signature Scheme
FFT	Fast Fourier Transform
FPGA	Field Programmable Gate Array
FSM	Finite-State Machine
GE	Gate Equivalent
GPV	Gentry–Peikert–Vaikuntanathan
HW	Hardware
IoT	Internet of Things
INTT	Inverse Number Theoretic Transform
JPAU	Joint Polynomial Arithmetic Unit
KALU	Kecakk ALU
KAM	Keccak Acceleration Module
KEA	Key exchange algorithm
NIST	National Insitute of Standards and Technology
NTRU	Number Theory Research Unit
NTT	Number Theroretic Transform
PQC	Post-Quantum Cryptosystem
$p k$	Public Key
$s k$	Secret Key
SW	Software
UPCU	Unified Polynomial Control Unit
WOTS	Winternitz One-Time Signature

References

Carracedo, J.M.; Milliken, M.; Chouhan, P.K.; Scotney, B.; Lin, Z.; Sajjad, A.; Shackleton, M. Cryptography for Security in IoT. In Proceedings of the 2018 Fifth International Conference on Internet of Things: Systems, Management and Security, Valencia, Spain, 15–18 October 2018; pp. 23–30. [Google Scholar] [CrossRef]
Katzenbeisser, S.; Polian, I.; Regazzoni, F.; Stöttinger, M. Security in Autonomous Systems. In Proceedings of the 2019 IEEE European Test Symposium (ETS), Baden-Baden, Germany, 27–31 May 2019; pp. 1–8. [Google Scholar] [CrossRef]
Muzikant, P.; Willemson, J. Deploying Post-quantum Algorithms in Existing Applications and Embedded Devices. In Proceedings of the Ubiquitous Security; Wang, G., Wang, H., Min, G., Georgalas, N., Meng, W., Eds.; Springer: Singapore, 2024; pp. 147–162. [Google Scholar]
Kim, D.; Choi, H.; Seo, S.C. Parallel Implementation of SPHINCS+ With GPUs. IEEE Trans. Circuits Syst. I Regul. Pap. 2024, 71, 2810–2823. [Google Scholar] [CrossRef]
Shor, P.W. Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM Rev. 1999, 41, 303–332. [Google Scholar] [CrossRef]
Aikata, A.; Mert, A.C.; Imran, M.; Pagliarini, S.; Roy, S.S. KaLi: A Crystal for Post-Quantum Security Using Kyber and Dilithium. IEEE Trans. Circuits Syst. I Regul. Pap. 2023, 70, 747–758. [Google Scholar] [CrossRef]
Avanzi, R.; Bos, J.; Ducas, L.; Kiltz, E.; Lepoint, T.; Lyubashevsky, V.; Schanck, J.M.; Schwabe, P.; Seiler, G.; Stehlé, D. CRYSTALS-Kyber: Algorithm Specifications and Supporting Documentation, Submission to the NIST Post-Quantum Project. 2021. Available online: https://pq-crystals.org/kyber/data/kyber-specification-round3-20210131.pdf (accessed on 7 August 2024).
Ducas, L.; Kiltz, E.; Lepoint, T.; Lyubashevsky, V.; Schwabe, P.; Seiler, G.; Stehlé, D. CRYSTALS-Dilithium: Algorithm Specifications and Supporting Documentation, Submission to the NIST Post-Quantum Project. 2021. Available online: https://pq-crystals.org/dilithium/data/dilithium-specification-round3-20210208.pdf (accessed on 7 August 2024).
Fouque, P.A.; Hoffstein, J.; Kirchner, P.; Lyubashevsky, V.; Pornin, T.; Prest, T.; Ricosset, T.; Seiler, G.; Whyte, W.; Zhang, Z. Falcon: Fast-Fourier Lattice-Based Compact Signatures over NTRU, Specification v1.2. 2020. Available online: https://falcon-sign.info/falcon.pdf (accessed on 7 August 2024).
Aumasson, J.P.; Bernstein, D.J.; Beullens, W.; Dobraunig, C.; Eichlseder, M.; Fluhrer, S.; Gazdag, S.L.; Hülsing, A.; Kampanakis, P.; Kölbl, S.; et al. SPHINCS+ Specification. Submission to the NIST Post-Quantum Project. 2020. Available online: https://sphincs.org/data/sphincs+-r3.1-specification.pdf (accessed on 7 August 2024).
NIST. Selected Algorithms 2022, July 2022. Available online: https://csrc.nist.gov/projects/post-quantum-cryptography/selected-algorithms-2022 (accessed on 7 August 2024).
Seo, E.Y.; Kim, Y.S.; Lee, J.W.; No, J.S. Peregrine: Toward Fastest FALCON Based on GPV Framework. Cryptology ePrint Archive. 2022. Available online: https://eprint.iacr.org/2022/1495 (accessed on 7 August 2024).
Bernstein, D.J.; Hopwood, D.; Hülsing, A.; Lange, T.; Niederhagen, R.; Papachristodoulou, L.; Schneider, M.; Schwabe, P.; Wilcox-O’Hearn, Z. SPHINCS: Practical Stateless Hash-Based Signatures. In Proceedings of the Advances in Cryptology–EUROCRYPT 2015; Oswald, E., Fischlin, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2015; pp. 368–397. [Google Scholar]
Aikata, A.; Mert, A.C.; Jacquemin, D.; Das, A.; Matthews, D.; Ghosh, S.; Roy, S.S. A Unified Cryptoprocessor for Lattice-Based Signature and Key-Exchange. IEEE Trans. Comput. 2023, 72, 1568–1580. [Google Scholar] [CrossRef]
Basso, A.; Bermudo Mera, J.M.; D’Anvers, J.P.; Karmakar, A.; Sinha Roy, S.; Van Beirendonck, M.; Vercauteren, F. SABER: Mod-LWR Based KEM (Round 3 Submission) SABER Submission Package for Round 3. 2017. Available online: https://www.esat.kuleuven.be/cosic/pqcrypto/saber/files/saberspecround3.pdf (accessed on 7 August 2024).
Lee, J.; Kim, W.; Kim, J.H. A Programmable Crypto-Processor for National Institute of Standards and Technology Post-Quantum Cryptography Standardization Based on the RISC-V Architecture. Sensors 2023, 23, 9408. [Google Scholar] [CrossRef]
Nguyen, T.H.; Kieu-Do-Nguyen, B.; Pham, C.K.; Hoang, T.T. High-Speed NTT Accelerator for CRYSTAL-Kyber and CRYSTAL-Dilithium. IEEE Access 2024, 12, 34918–34930. [Google Scholar] [CrossRef]
Lee, Y.; Youn, J.; Nam, K.; Jung, H.H.; Cho, M.; Na, J.; Park, J.Y.; Jeon, S.; Kang, B.G.; Oh, H.; et al. An Efficient Hardware/Software Co-Design for FALCON on Low-End Embedded Systems. IEEE Access 2024, 12, 57947–57958. [Google Scholar] [CrossRef]
Gupta, N.; Jati, A.; Chattopadhyay, A.; Jha, G. Lightweight Hardware Accelerator for Post-Quantum Digital Signature CRYSTALS-Dilithium. IEEE Trans. Circuits Syst. I Regul. Pap. 2023, 70, 3234–3243. [Google Scholar] [CrossRef]
Wagner, A.; Oberhansl, F.; Schink, M. To Be, or Not to Be Stateful: Post-Quantum Secure Boot using Hash-Based Signatures. In Proceedings of the 2022 Workshop on Attacks and Solutions in Hardware Security, Los Angeles, CA, USA, 11 November 2022; ASHES’22. pp. 85–94. [Google Scholar] [CrossRef]
Mandal, S.; Roy, D.B. KiD: A Hardware Design Framework Targeting Unified NTT Multiplication for CRYSTALS-Kyber and CRYSTALS-Dilithium on FPGA. In Proceedings of the 2024 37th International Conference on VLSI Design and 2024 23rd International Conference on Embedded Systems (VLSID), Kolkata, India, 6–10 January 2024; pp. 455–460. [Google Scholar] [CrossRef]
Beckwith, L.; Nguyen, D.T.; Gaj, K. Hardware Accelerators for Digital Signature Algorithms Dilithium and FALCON. IEEE Des. Test 2023, 1. [Google Scholar] [CrossRef]
Bisheh-Niasar, M.; Azarderakhsh, R.; Mozaffari-Kermani, M. Azarderakhsh, R.; Mozaffari-Kermani, M. A Monolithic Hardware Implementation of Kyber: Comparing Apples to Apples in PQC Candidates. In Progress in Cryptology–LATINCRYPT 2021, Proceedings of the 7th International Conference on Cryptology and Information Security in Latin America, Bogotá, Colombia, 6–8 October 2021; Longa, P., Ràfols, C., Eds.; Springer: Cham, Switzerland, 2021; pp. 108–126. [Google Scholar]
Intel Inc. Intel Vtune Profiler. 2023. Available online: https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html (accessed on 7 August 2024).
Montgomery, P.L. Modular multiplication without trial division. Math. Comput. 1985, 44, 519–521. [Google Scholar] [CrossRef]
Richard, T.; Chao, L.; Myoung, A. Algorithms for Discrete Fourier Transform and Convolution; Springer: Cham, Switzerland, 2021. [Google Scholar] [CrossRef]
SYNOPSYS Inc. Synopsys Design Cimpiler. Available online: https://www.synopsys.com/implementation-and-signoff/rtl-synthesis-test/dc-ultra.html (accessed on 7 August 2024).
Martins, M.; Matos, J.M.; Ribas, R.P.; Reis, A.; Schlinker, G.; Rech, L.; Michelsen, J. Open Cell Library in 15nm FreePDK Technology. In Proceedings of the 2015 Symposium on International Symposium on Physical Design, Monterey, CA, USA, 29 March–1 April 2015; ISPD ’15. pp. 171–178. [Google Scholar] [CrossRef]
Soni, D.; Basu, K.; Nabeel, M.; Aaraj, N.; Manzano, M.; Karri, R. Hardware Architectures for Post-Quantum Digital Signature Schemes; Springer: Cham, Switzerland, 2021. [Google Scholar] [CrossRef]
Alharbi, A.R.; Hazzazi, M.M.; Jamal, S.S.; Aljaedi, A.; Aljuhni, A.; Alanazi, D.J. DCryp-Unit: Crypto Hardware Accelerator Unit Design for Elliptic Curve Point Multiplication. IEEE Access 2024, 12, 17823–17835. [Google Scholar] [CrossRef]
Amiet, D.; Leuenberger, L.; Curiger, A.; Zbinden, P. FPGA-based SPHINCS+ Implementations: Mind the Glitch. In Proceedings of the 2020 23rd Euromicro Conference on Digital System Design (DSD), Kranj, Slovenia, 26–28 August 2020; pp. 229–237. [Google Scholar] [CrossRef]
Kocher, P.C. Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems. In Proceedings of the Advances in Cryptology—CRYPTO ’96; Koblitz, N., Ed.; Springer: Berlin/Heidelberg, Germany, 1996; pp. 104–113. [Google Scholar]
Bogdanov, A. Improved Side-Channel Collision Attacks on AES. In Proceedings of the Selected Areas in Cryptography; Adams, C., Miri, A., Wiener, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; pp. 84–95. [Google Scholar]
Ji, Y.; Wang, R.; Ngo, K.; Dubrova, E.; Backlund, L. A Side-Channel Attack on a Hardware Implementation of CRYSTALS-Kyber. In Proceedings of the 2023 IEEE European Test Symposium (ETS), Venezia, Italy, 22–26 May 2023; pp. 1–5. [Google Scholar] [CrossRef]
Xagawa, K.; Ito, A.; Ueno, R.; Takahashi, J.; Homma, N. Fault-Injection Attacks Against NIST’s Post-Quantum Cryptography Round 3 KEM Candidates. In Proceedings of the Advances in Cryptology–ASIACRYPT 2021; Tibouchi, M., Wang, H., Eds.; Springer: Cham, Switzerland, 2021; pp. 33–61. [Google Scholar]
Zhao, Y.; Pan, S.; Ma, H.; Gao, Y.; Song, X.; He, J.; Jin, Y. Side Channel Security Oriented Evaluation and Protection on Hardware Implementations of Kyber. IEEE Trans. Circuits Syst. I Regul. Pap. 2023, 70, 5025–5035. [Google Scholar] [CrossRef]
Bos, J.W.; Gourjon, M.O.; Renes, J.; Schneider, T.; Vredendaal, C.V. Masking Kyber: First- and higher-order implementations. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2021, 173–214. [Google Scholar] [CrossRef]
Migliore, V.; Gérard, B.; Tibouchi, M.; Fouque, P.A. Masking Dilithium. In Proceedings of the Applied Cryptography and Network Security; Deng, R.H., Gauthier-Umaña, V., Ochoa, M., Yung, M., Eds.; Springer: Cham, Switzerland, 2019; pp. 344–362. [Google Scholar]
Cerdeira, D.; Martins, J.; Santos, N.; Pinto, S. ReZone: Disarming TrustZone with TEE Privilege Reduction. In Proceedings of the 31st USENIX Security Symposium (USENIX Security 22), Boston, MA, USA, 10–12 August 2022; pp. 2261–2279. [Google Scholar]
Ryan, K. Hardware-Backed Heist: Extracting ECDSA Keys from Qualcomm’s TrustZone. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, London, UK, 11–15 November 2019; CCS ’19. pp. 181–194. [Google Scholar] [CrossRef]

Figure 1. General process of KEA and DSA.

Figure 2. Executed number of operations breakdown on each schemes; note that NTT part of Falcon is FFT, since the measurement is done on NIST submission package for fair comparison.

Figure 3. Overall design of our proposed accelerator with scalable parts in red and data lines in bold arrows.

Figure 4. Diagram of KAM-Small and KAM-Large variants. Note that the number of KALUs in Keccak ALU cluster can vary depending on each variant.

Figure 5. Diagram of KAM-FP variant performing single stage of Keccak Permutation.

Figure 6. Diagram of JPAU unit.

Figure 7. Diagram of control unit with UPCU.

Table 1. Selected finalist PQC schemes.

	Kyber	Dilithium	FALCON	SPHINCS+
Algorithm Type	Key Exchange (KEA)	Digital Signature (DSA)
Based Approach	Lattice-Based			Hash-Based

Table 2. Parameter used on Dilithium and Kyber.

	Dilithium2	Dilithium3	Dilithium5
q	8,380,417	8,380,417	8,380,417
N	256	256	256
(k, l)	(4, 4)	(6, 5)	(8, 7)
$γ_{1}$	$2^{17}$	$2^{19}$	$2^{19}$
$γ_{2}$	(q− 1)/88	(q − 1)/32	(q − 1)/32
	Kyber512	Kyber768	Kyber1024
q	3329	3329	3329
N	256	256	256
k	2	3	4
$η_{1}$	3	2	2
$η_{2}$	2	2	2
$(d_{u}, d_{v})$	(10, 4)	(10, 4)	(10, 5)
$δ$	$2^{- 139}$	$2^{- 164}$	$2^{- 174}$

Table 3. Parameter used on Falcon.

	Falcon512	Falcon1024	Peregrine * 512	Peregrine * 1024
q	12,289	12,289	12,289	12,289
N	512	1024	512	1024
b	34,034,726	70,265,242	34,034,726	150,700,176

* Modified Falcon scheme presented in [12].

Table 4. Parameter used on SPHINCS+.

Parameter	n	h	d	log(t)	k	w	NIST Security Level
SPHINCS+-256s	32	64	8	14	22	16	5
SPHINCS+-256s robust	32	64	8	14	22	16	5

Table 5. List of JPAU’s opcodes.

Mnemonic	Opcode	Description
NOP	0000	No operation, Do nothing
ADD	0001	Result[i]←vec_a[i]+vec_b[i]
SUB	0010	Result[i]←vec_a[i]-vec_b[i]
CADDQ	0011	Result[i]←(vec_a[i] <0) ? vec_a[i] +Q : vec_a[i]
MULT	0100	Result[i]←vec_a[i] × vec_b[i]
SHIFT	0101	Result[i]←vec_a[i] <<SHIFT_AMOUNT
REDUCE	0110	Result[i]←MontgomeryReduction(vec_a[i])
AND	0111	Result[i]←vec_a[i] AND vec_b[i]
OR	1000	Result[i]←vec_a[i] OR vec_b[i]
XOR	1001	Result[i]←vec_a[i] XOR vec_b[i]
NTT_BUTTERFLY	1010	Result←Butterfly(vec_a)
INTT_BUTTERFLY	1011	Result←InvButterfly(vec_a)
COMP	1100	Comp_result[i] ← COMPARE(vec_a[i], vec_b[i])
RESERVED	1101–1111	-

Table 6. Synthesis results compared with other works.

	Technology	Clock Frequency	Area (mm²)	kGE	Target kGE	Target Scheme
Ours_Baseline	15 nm	1000 MHz	0.056	284.939	-	Dilithium, Kyber, SPHINCS+, Falcon(Peregrine)
Ours_S	15 nm	1000 MHz	0.062	315.743	300	Dilithium, Kyber, SPHINCS+, Falcon(Peregrine)
Ours_M	15 nm	1000 MHz	0.115	584.624	613.849	Dilithium, Kyber, SPHINCS+, Falcon(Peregrine)
Ours_L	15 nm	1000 MHz	0.120	611.389	613.849	Dilithium, Kyber, SPHINCS+, Falcon(Peregrine)
Gupta et al. [19]	65 nm	1176 MHz	0.227	157.000	-	Dilithium
Aikata et al. [14]	65 nm	400 MHz	0.317	220.000	-	Dilithium, Saber
Aikata et al. [6]	28 nm	1000 MHz	0.263	747.000	-	Dilithium, Kyber
Wagner et al. [20]	120 nm	250 & 500 MHz	0.560	84.000	-	SPHINCS+
Wagner et al. [20] extended	120 nm	250 & 500MHz	0.476	98.800	-	SPHINCS+
Lee et al. [18]	28 nm	300 MHz	0.038	98.729	-	Falcon(Verification) ⁽¹⁾
Soni et al. [29] 512	65 nm	122 MHz	0.387	184.300	-	Falcon(Signing) ⁽²⁾
Soni et al. [29] 1024	65 nm	173 MHz	0.380	181.120	-	Falcon(Signing) ⁽²⁾
Bisheh-Nisar et al. [23]	65 nm	200 MHz	N/A	93	-	Kyber

⁽¹⁾ Implemented verification algorithm in Falcon. ⁽²⁾ Implemented signing algorithm in Falcon.

Table 7. Relative throughput and FoM on Dilithium and Kyber.

	Parameter	Gupta et al. [19]		Aikata et al. [14]		Akata et al. [6]		Ours_S		Ours_M		Ours_L
	Parameter	Thrpt.	FoM	Thrpt.	FoM	Thrpt.	FoM	Thrpt.	FoM	Thrpt.	FoM	Thrpt.	FoM
Keygen	Dilithium2	-	-	0.52	0.75	1.27	0.54	1.00	1.00	1.74	0.94	2.09	1.08
	Dilithium3	-	-	0.57	0.82	1.39	0.59	1.00	1.00	1.76	0.95	2.08	1.07
	Dilithium5	1.11	2.23	0.61	0.88	1.50	0.63	1.00	1.00	1.77	0.96	2.08	1.07
	Kyber512	-	-	-	-	4.66	1.97	1.00	1.00	1.04	0.56	2.18	1.12
	Kyber768	-	-	-	-	3.47	1.47	1.00	1.00	1.04	0.56	2.26	1.17
	Kyber1024	-	-	-	-	3.08	1.30	1.00	1.00	1.04	0.56	2.31	1.19
Sign	Dilithium2	-	-	0.96	1.38	2.31	0.98	1.00	1.00	1.90	1.03	2.01	1.04
	Dilithium3	-	-	1.08	1.55	2.63	1.11	1.00	1.00	1.92	1.04	2.01	1.04
	Dilithium5	2.39	4.81	1.35	1.94	3.30	1.40	1.00	1.00	1.93	1.04	2.01	1.04
	Kyber512	-	-	-	-	2.85	1.20	1.00	1.00	1.07	0.58	2.74	1.42
	Kyber768	-	-	-	-	2.57	1.09	1.00	1.00	1.07	0.58	2.71	1.40
	Kyber1024	-	-	-	-	2.34	0.99	1.00	1.00	1.07	0.58	2.68	1.39
Verify	Dilithium2	-	-	0.83	1.19	2.02	0.85	1.00	1.00	1.83	0.99	2.00	1.03
	Dilithium3	-	-	0.85	1.22	2.08	0.88	1.00	1.00	1.85	1.00	2.01	1.04
	Dilithium5	1.71	3.44	0.86	1.24	2.11	0.89	1.00	1.00	1.86	1.01	2.03	1.05
	Kyber512	-	-	-	-	2.26	0.95	1.00	1.00	1.11	0.60	2.65	1.37
	Kyber768	-	-	-	-	1.96	0.83	1.00	1.00	1.11	0.60	2.61	1.35
	Kyber1024	-	-	-	-	2.10	0.89	1.00	1.00	1.11	0.60	2.57	1.33

Table 8. Relative throughput and FoM on Falcon.

	Parameter	CPU(AVX)	Lee et al. [18]		Soni et al. [29]		Ours_S		Ours_M		Ours_L
	Parameter	Thrpt.	Thrpt.	FoM	Thrpt.	FoM	Thrpt.	FoM	Thrpt.	FoM	Thrpt.	FoM
Keygen	Falcon512	0.15	-	-	-	-	1.00	1.000	1.82	0.983	1.82	0.940
Keygen	Falcon1024	0.11	-	-	-	-	1.00	1.000	1.84	0.992	1.84	0.948
Sign	Falcon512	0.24	0.002	0.006	-	-	1.00	1.000	1.60	0.863	1.60	0.826
Sign	Falcon1024	0.25	0.002	0.006	-	-	1.00	1.000	1.62	0.874	1.62	0.836
Verify	Falcon512	0.12	-	-	0.01	0.015	1.00	1.000	1.98	1.072	2.00	1.034
Verify	Falcon1024	0.13	-	-	0.01	0.015	1.00	1.000	1.99	1.076	2.00	1.033

Table 9. Relative throughput and FoM on SPHINCS+.

	Parameter	CPU(AVX)	Wagner et al. [20]		Wagner et al. [20] Extended		Amiet et al. [31]		Ours_S		Ours_M		Ours_L
	Parameter	Thrpt.	Thrpt.	FoM	Thrpt.	FoM	Thrpt.	FoM	Thrpt.	FoM	Thrpt.	FoM	Thrpt.	FoM
Keygen	256s-simple	0.05	-	-	-	-	-	-	1.00	1.000	1.00	0.540	2.95	1.522
Keygen	256s-robust	0.04	-	-	-	-	-	-	1.00	1.000	1.00	0.540	2.97	1.535
Sign	256s-simple	0.03	-	-	-	-	0.82	-	1.00	1.000	1.00	0.540	2.95	1.522
Sign	256s-robust	0.03	-	-	-	-	0.83	-	1.00	1.000	1.00	0.540	2.97	1.535
Verify	256s-simple	0.01	0.03	0.104	0.04	0.135	0.06	-	1.00	1.000	1.00	0.540	2.58	1.330
Verify	256s-robust	0.01	0.02	0.077	0.04	0.131	0.08	-	1.00	1.000	1.00	0.540	2.59	1.336

Table 10. Power consumption compared with other accelerators.

	Power (mW)	Dilithium3 (μJ)	Kyber1024 (μJ)	SPHINCS + 256s (μJ)	FALCON1024 (μJ)
Ours_S	11	2.01	0.61	0.174	0.10
Ours_M	17.2	1.66	0.88	0.272	0.08
Ours_L	17.7	1.61	0.38	0.095	0.08
Aikata et al. [6]	360	27.00	9.27	-	-
Lee et al. [18]	5.79	-	-	-	27.60
Amiet et al. [31]	9750	-	-	189,300	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jung, H.; Oh, H. Designing a Scalable and Area-Efficient Hardware Accelerator Supporting Multiple PQC Schemes. Electronics 2024, 13, 3360. https://doi.org/10.3390/electronics13173360

AMA Style

Jung H, Oh H. Designing a Scalable and Area-Efficient Hardware Accelerator Supporting Multiple PQC Schemes. Electronics. 2024; 13(17):3360. https://doi.org/10.3390/electronics13173360

Chicago/Turabian Style

Jung, Heonhui, and Hyunyoung Oh. 2024. "Designing a Scalable and Area-Efficient Hardware Accelerator Supporting Multiple PQC Schemes" Electronics 13, no. 17: 3360. https://doi.org/10.3390/electronics13173360

APA Style

Jung, H., & Oh, H. (2024). Designing a Scalable and Area-Efficient Hardware Accelerator Supporting Multiple PQC Schemes. Electronics, 13(17), 3360. https://doi.org/10.3390/electronics13173360

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Designing a Scalable and Area-Efficient Hardware Accelerator Supporting Multiple PQC Schemes

Abstract

1. Introduction

2. Background

2.1. Post-Quantum Cryptography

2.2. Dilithium

2.3. Kyber

2.4. Falcon

2.5. SPHINCS+

3. Related Works and Motivation

4. Design Methodology

4.1. Performance Profiling

4.2. Proposed Design

4.2.1. Keccak Accleration Moudle (KAM)

4.2.2. Joint Polynomial Arithmetic Unit (JPAU)

4.2.3. Control Unit

5. Implementation

6. Evaluation

6.1. Dilithium and Kyber

6.2. Falcon

6.3. SPHINCS+

6.4. Power Consumption

7. Discussions

7.1. Architectural Differences against Others

7.2. Security and Reliability

7.3. Limitations and Future Works

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI