On the Parallelization of Square-Root Vélu’s Formulas

Chávez-Saab, Jorge; Ortega, Odalis; Pizarro-Madariaga, Amalia

doi:10.3390/mca29010014

Open AccessArticle

On the Parallelization of Square-Root Vélu’s Formulas

by

Jorge Chávez-Saab

^1,*,†,

Odalis Ortega

^2,*,† and

Amalia Pizarro-Madariaga

^2,*,†

¹

Cryptography Research Center, Technology Innovation Institute, Abu Dhabi P.O. Box 1639, United Arab Emirates

²

Instituto de Matemáticas, Universidad de Valparaíso, General Cruz 222, 5º piso, Valparaíso 2360102, Chile

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work and order is alphabetical.

Math. Comput. Appl. 2024, 29(1), 14; https://doi.org/10.3390/mca29010014

Submission received: 11 January 2024 / Revised: 6 February 2024 / Accepted: 7 February 2024 / Published: 16 February 2024

(This article belongs to the Topic Mathematical Modeling)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

A primary challenge in isogeny-based cryptography lies in the substantial computational cost associated to computing and evaluating prime-degree isogenies. This computation traditionally relied on Vélu’s formulas, an approach with time complexity linear in the degree but which was further enhanced by Bernstein, De Feo, Leroux, and Smith to a square-root complexity. The improved square-root Vélu’s formulas exhibit a degree of parallelizability that has not been exploited in major implementations. In this study, we introduce a theoretical framework for parallelizing isogeny computations and provide a proof-of-concept implementation in C with OpenMP. While the parallelization effectiveness exhibits diminishing returns with the number of cores, we still obtain strong results when using a small number of cores. Concretely, our implementation shows that for large degrees it is easy to achieve speedup factors of up to

1.74

,

2.54

, and

3.44

for two, four, and eight cores, respectively.

Keywords:

isogenies; elliptic curves; parallelism; postquantum cryptography; efficient implementation

1. Introduction

Since the foundational work of De Feo, Jao, and Plût [1,2], which resulted in the SIKE protocol [3], isogeny-based cryptography has received considerable interest in the development of post-quantum cryptography. Today, the security of SIKE has been broken by a wave of attacks which started with the work of Castryck and Decru [4] and was followed up by the works of Maino, Martindale, Panny, Pope, and Wesolowski [5] and of Robert [6], ultimately demonstrating the existence of a polynomial-time attack on SIKE with any starting curve. Nevertheless, there are still many isogeny-based protocols that remain unbroken, including the CSIDH [7] key exchange protocol and signature schemes like SeaSign [8], CSI-Fish [9], and SQISign [10], the last of which is currently under consideration in the NIST (National Institute of Standards and Technology) process for Standardization of Additional Digital Signature Schemes [11].

While SIKE only required the computation of isogenies of degrees 2 and 3, there has been a tendency in some of the newer isogeny-based protocols to move towards higher and higher prime degrees. This has brought increasing attention to the task of optimizing the evaluation of large-degree isogenies. For instance, Boris and Basso [12] proposed countermeasures against the attack at the cost of utilizing isogenies of considerably larger degrees. On the other hand, the original CSIDH proposal used isogenies of degrees up to 587, and Chávez, Chi, Jacques, and Rodríguez have suggested that a level 5 dummy-free implementation would need degrees as high as 2239 [13] after revising the security. The CSI-Fish [9] signature scheme also worked with the same degrees as the original CSIDH scheme, but efforts to generalize it to higher security levels like [14] have led to the use of higher degrees as well. More recently, the level 5 version of SQISign that was submitted to the NIST standarization process [15] brings the ceiling up with a colossal 318,233-degree isogeny computation. Even the aforementioned attacks on SIKE require the evaluation of large prime-degree isogenies, with the techniques from [5] requiring the evaluation of an isogeny of degree 321,193 over a large extension field for a direct key recovery attack. Therefore, optimizing the computation of large prime-degree isogenies is of great interest for constructive purposes as well as for cryptanalysis. Interestingly, a new generation has also risen of isogeny-based protocols which utilize the techniques of the attack constructively, such as FESTA [16], QFESTA [17], IS-CUBE [18] and constructions for VDFs [19] and VRFs [20], of which the last two also involve large-degree isogenies.

Traditionally, the main method for computing isogenies of prime degree ℓ when there exists a point of order ℓ in the curve has been Vélu’s formulas, initially introduced in [21] and expanded upon in (Section 2.4, [22]) and (Theorem 12.16, [23]). In view of the ubiquitousness of large-degree isogenies, the more recent improvement from Bernstein, De Feo, Leroux, and Smith [24], which reduced the time complexity from

O (ℓ)

to

\tilde{O} (\sqrt{ℓ})

, has been a keystone for the viability of many protocols. In addition to the straightforward savings that it provides, this new algorithm also exhibits a substantial degree of parallelizability that has remained largely unexplored.

Our Contributions: In this work we detail and showcase the parallelizability of the square-root Vélu formulas, focusing on the problem of computing a single large-degree isogeny from a fixed kernel generator and pushing a number of points through said isogeny. We analyze and provide a parallelism-friendly reformulation of the square-root Vélu formulas, and provide a proof-of-concept (PoC) implementation to illustrate the implications of our work. More precisely,

We propose a new indexing system for the points in the kernel of the isogeny that is similar to the one presented in [24], but which can be naturally split into subsets that are assigned to each core in a multi-core implementation.
We provide a parallelized expected cost function based on our square-root Vélu variant, as well as a revised expected cost of the sequential Vélu formulas that were presented in [25]. Assuming that the assigned task is to perform an isogeny construction and to push two points through it, the idealized speedup factors for large degrees is up to 1.79, 2.96, and 4.58 faster than the expected sequential cost function using two, four and eight cores, respectively.
We give a PoC implementation utilising C and OpenMP to compute our isogeny formulas with large odd prime degrees. Our implementation assumes for concreteness that isogenies are evaluated over a prime field such that there exist rational points of the desired degree in the curve, and achieves speedup factors for large degrees of up to $1.74$ , $2.54$ , and $3.44$ for two, four, and eight cores, respectively, even for medium-sized isogenies.

Assumptions and generalizations: For the purposes of our implementation, we have searched for a 1792-bit prime such that

p + 1

contains evenly-spaced prime factors ranging from 19 to 321,193 (the degree of the isogeny in the attack of [5]). Isogenies are performed over the base prime field, and two points are pushed through each isogeny. These choices were made for concreteness and closely mirror the scenario in CSIDH. On the other hand, they do not adhere exactly to the case of SQISign (which works in a quadratic field extension) nor that of the attack in [5] (which performs isogenies in a much larger field extension). Nevertheless, the techniques that we describe save on a fixed number of field multiplications, which are the dominating part of the total cost. Therefore we expect that similar speedups in percentage would be obtained in those other cases, with the caveat that the savings could even be slightly larger as the parallelization overhead becomes more negligible with respect to the arithmetic cost. As for the theoretical savings, our costs are expressed in terms of the total number of field multiplications and remain valid for any choice of base field.

Related Work: Different forms of parallelism have been explored at various layers of isogeny-based protocols. For instance, the use of vectorization through Intel’s Advanced Vector Extensions (AVX-512) has been exploited for the finite field arithmetic layer, both in the context of CSIDH [26] and of the now obsolete SIKE protocol [27,28], obtaining speedup factors in the order of

1.5

. Additionally, it has also been proposed to use AVX-512 to batch multiple evaluations of the protocols, leading to increases in throughput of up to

3.6

for CSIDH [26] and

4.6

in SIKE [28]. When latency is the focus, vectorization can also be used to push multiple points through an isogeny, leading to the concept of parallel-friendly evaluation strategies which favour pushing points over point multiplications [28,29,30]. These parallel strategies can accelerate the evaluation of large composite-degree isogenies, but do not apply for isogenies of a large prime degree.

This work focuses exclusively on the multi-core optimization for the isolated evaluation of a large prime-degree isogeny, which has received much less attention. To the best of our knowledge, there are only simple parallelization strategies that have been proposed [31] for the linear-complexity Vélu formulas, not exploiting the improved complexity methods from [24]. We also do not consider vectorization: if it was to be used, we expect that the best way to exploit it would be at the field arithmetic layer. This would lead to improvements similar to those from [26,27,28], which are well-documented and are largely parallel to our multi-core improvements.

Outline: The remainder of the manuscript is organized as follows. In Section 2 we give the basic background and description of the original square-root Vélu formulas with their three main building blocks, KPS, xISOG, and xEVAL, and derive their expected cost function. In Section 3 we introduce our new framework, explaining the need for a new indexing system in Section 3.1 and then detailing the new algorithms in Section 3.2 and deriving their expected cost function in Section 3.3. Finally, in Section 4, we present our experimental results, comparing the performance to the theoretical savings.

2. Background

We denote by

F_{q}

the finite field with

q = p^{n}

elements, and p a prime number. We focus on supersingular Montgomery curves, E, given by

E : y^{2} = x^{3} + A x^{2} + x, A \in F_{q} \ {\pm 2} .

Remark 1.

In general, a Montgomery curve is defined by the equation

E_{A, B} : B y^{2} = x^{3} + A x^{2} + x

such that

B \neq 0

and

A \neq \pm 2

, but all choices of B are equivalent up to isomorphism. For the sake of simplicity, we write

E_{A}

instead of

E_{A, 1}

and omit the constant B.

Viewing the curve as a projective surface, the set

E (F_{q})

of

F_{q}

-rational points in E forms a group where the point at infinity ∞ acts as the group identity. An order-d point P on E is a point on the curve such that d is the smallest positive integer satisfying

[d] P ≔ \sum_{i = 1}^{d} P = \infty .

We write

E [d]

to refer to the d-torsion subgroup

{P \in E (\bar{F_{q}}) ∣ [d] P = \infty}

of E.

Isogenies: An isogeny

ϕ : E \to E^{'}

is a surjective morphism between curves with a finite kernel such that

ϕ (\infty_{E}) = \infty_{E^{'}}

. Such a map is always also a group homomorphism. Two curves E and

E^{'}

are isogenous over

F_{q}

if there exists such

ϕ

connecting them, or equivalently if

# E (F_{q}) = # E^{'} (F_{q})

. The kernel of

ϕ

is

{P \in E (F_{q}) ∣ ϕ (P) = \infty}

, denoted by

ker ϕ

. When restricting only to separable maps, the degree of an isogeny is equal to the size of its kernel, and an isogeny is uniquely determined by its kernel up to an isomorphism. An isogeny

ϕ : E \to E^{'}

of degree ℓ is referred to as a ℓ-isogeny, and it has a unique dual (up to isomorphism)

\hat{ϕ} : E^{'} \to E

of the same degree such that

ϕ \circ \hat{ϕ}

and

\hat{ϕ} \circ ϕ

are equivalent to the multiplication-by-ℓ map on E and

E^{'}

, respectively.

2.1. Computation of ℓ-Isogenies

Let

E_{A}

be a Montgomery curve defined over

F_{q}

and

P \in E_{A} (F_{q})

a point of order ℓ. If

ϕ : E_{A} \to E_{A^{'}}

is an isogeny with ker

ϕ = 〈 P 〉

, then it is possible to find polynomials

g (x), h (x) \in F_{q} [x]

such that

ϕ (x, y) = (\frac{g (x)}{h (x)}, y {(\frac{g (x)}{h (x)})}^{'}) .

The polynomials

g (x)

and

h (x)

are related by a formula stated by Elkies [32], so the main task for computing an ℓ-isogeny is that of obtaining

h (x)

from P. In the particular case of Montgomery curves, from the formulas of (Theorem 1, [33]) the coefficient of the Montgomery curve

E_{A^{'}}

can be obtained by

A^{'} = \frac{2 (1 + d)}{1 - d}, d = {(\frac{A - 2}{A + 2})}^{ℓ} {(\frac{h_{S} (1)}{h_{S} (- 1)})}^{8},

(1)

and the x-coordinate

ϕ_{x} (x (Q))

of the point

ϕ (Q)

, by

ϕ_{x} (x (Q)) = {(\frac{x {(Q)}^{ℓ} h_{S} (1 / x (Q))}{h_{S} (x (Q))})}^{2},

(2)

where

h_{S} (X) = \prod_{s \in S} (X - x ([s] P))

and

S = {1, 3, \dots, ℓ - 2}

.

In 2020, Bernstein, De Feo, Leroux, and Smith [24] proposed an important improvement in the computation of isogenies. Their key idea exploits the fact that for isogeny evaluations and constructions, it is not required to compute the polynomial h itself, but only its evaluation at a given set of points. We briefly describe their method for evaluating (1) and (2), starting by pointing out that, although the map

Q \mapsto x (Q)

is not holomorphic, an algebraic relation exists between the values. More precisely,

Lemma 1

(Lemma 4.3, [24]). Let

E / F_{q}

be an elliptic curve, where q is a prime power. There exist biquadratic polynomials

F_{0}

,

F_{1}

, and

F_{2}

in

F_{q} [X_{1}, X_{2}]

such that

(X - x (P + Q)) (X - x (P - Q)) = X^{2} + \frac{F_{1} (x (P), x (Q))}{F_{0} (x (P), x (Q))} X + \frac{F_{2} (x (P), x (Q))}{F_{0} (x (P), x (Q))},

for all P and Q in E such that

0 \notin {P, Q, P + Q, P - Q}

.

Remark 2.

We focus on isogeny computations between Montgomery curves using x-only coordinates, mainly because it is the case most used in practical applications. However, the results obtained can be extended to other models. Formulas that are analogous to (1) and (2) are provided in [34] for Edwards and Huff models. These formulas require the evaluation of polynomials analogous to

h_{S}

, which can be evaluated in the same ways that we will describe after adjusting the polynomials

F_{0}, F_{1}, F_{2}

. More generally, the same procedure could be applied to the generalized Montgomery coordinate from [35], of which all models mentioned are special cases.

Using Lemma 1, we can rearrange

h_{S} (X)

in a baby-step giant-step style:

h (X) = (\prod_{i \in I} \prod_{j \in J} (X - x ([i + j] P)) (X - x ([i - j] P))) (\prod_{k \in K} (X - x ([k] P))),

(3)

where the index system

{I, J}

is chosen such that

S = (I + J) \cup (I - J) \cup K

, with

K = S \ (I + J) \cup (I - J)

and satisfying the following definition:

Definition 1.

Let S, I and J be finite sets of integers. The

(I, J)

is an index system of S if:

the maps $I \times J \to Z$ defined by $(i, j) ⟼ i + j$ and $(i, j) ⟼ i - j$ are both injective and have disjoint images.
$I + J$ and $I - J$ are both contained in S.

From here, they show that the factor of

h (X)

that is related to I and J can be computed efficiently as a resultant. More precisely, letting

I \pm J

be the union of the sets

I + J

and

I - J

, the following result is proven.

Lemma 2

(Lemma 4.9, [24]). Let

E / F_{q}

be an elliptic curve, where q is a prime power. Let P be an order-n element of

E (F_{q})

. Let

(I, J)

be an index system such that

I, J, I + J,

and

I - J

do not contain any elements of

n Z

. Then

h_{I \pm J} (X) = \frac{1}{Δ_{I, J}} R e s_{Z} (h_{I} (Z), E_{J} (X, Z))

where

E_{J} (X, Z) ≔ \prod_{j \in J} (F_{0} (Z, x ([j] P)) X^{2} + F_{1} (Z, x ([j] P)) X + F_{2} (Z, x ([j] P)))

and

Δ_{I, J} ≔ R e s_{Z} (h_{I} (Z); D_{J} (Z))

where

D_{J} (Z) ≔ \prod_{j \in J} F_{0} (Z, x ([j] P))

.

Remark 3.

Note that

Δ_{I, J}

in Lemma 2 does not depend on the variable X. The Formulas (2) and (1) are for computing the codomain curve and point evaluation, respectively, include quotients of the polynomial h, so in practice,

Δ_{I, J}

in Lemma 2 does not need to be computed.

Notice that

h_{I \pm J}

is precisely the first parenthesis that appears in (3), which can be computed quadratically faster as a resultant, while the second parenthesis (corresponding to

h_{K} (X)

) is a leftover product that is still computed in linear time. Based on this, Adj, Chi-Domínguez, and Rodríguez-Henríquez [25] introduced three explicit algorithms for evaluating ℓ-isogenies: KPS (which computes the necessary multiples of the kernel generator), xISOG (which obtains the coefficient of the image curve) and xEVAL (which obtains the image of a point). These algorithms are reproduced here as Algorithms 1–3.

Algorithm 1 KPS.

Inputs:: An elliptic curve $E_{A} / F_{q}$ ; $P \in E_{A} (F_{q})$ of order an odd prime ℓ
Output:: $I$ , $J$ , $K$ such that $(I, J)$ is an index system for S

1:: $b \leftarrow ⌊ \sqrt{(ℓ - 1)} / 2 ⌋$
2:: $b^{'} \leftarrow ⌊ (ℓ - 1) / 4 b ⌋$
3:: $J \leftarrow {2 j + 1 : 0 \leq j < b}$
4:: $I \leftarrow {2 b (2 i + 1) : 0 \leq i < b^{'}}$
5:: $K \leftarrow S \ (I \pm J)$
6:: $J \leftarrow {x ([j] P) : j \in J}$
7:: $I \leftarrow {x ([i] P) : i \in I}$
8:: $K \leftarrow {x ([k] P) : k \in K}$
9:: return $I$ , $J$ and $K$

Algorithm 2 xISOG.

Inputs:: An elliptic curve $E_{A} / F_{q} : y^{2} = x^{3} + A x^{2} + x$ ; $P \in E_{A} (F_{q})$ of order an odd prime ℓ; and $I$ , $J$ , $K$ the output from KPS(P)
Output:: $A^{'} \in F_{q}$ such that $E_{A} / F_{q} : y^{2} = x^{3} + A^{'} x^{2} + x$ is the image curve of a separable isogeny with kernel $〈 P 〉$

1:: $h_{I} \leftarrow \prod_{x_{i} \in I} (Z - x_{i}) \in F_{q} [Z]$
2:: $E_{0, J} \leftarrow \prod_{x_{j} \in J} (F_{0} (Z, x_{j}) + F_{1} (Z, x_{j}) + F_{2} (Z, x_{j})) \in F_{q} [Z]$
3:: $E_{1, J} \leftarrow \prod_{x_{j} \in J} (F_{0} (Z, x_{j}) - F_{1} (Z, x_{j}) + F_{2} (Z, x_{j})) \in F_{q} [Z]$
4:: $R_{0} \leftarrow R e s_{Z} (h_{I}, E_{0, J}) \in F_{q}$
5:: $R_{1} \leftarrow R e s_{Z} (h_{I}, E_{1, J}) \in F_{q}$
6:: $M_{0} \leftarrow \prod_{x_{k} \in K} (1 - x_{k}) \in F_{q}$
7:: $M_{1} \leftarrow \prod_{x_{k} \in K} (- 1 - x_{k}) \in F_{q}$
8:: $d \leftarrow {(\frac{A - 2}{A + 2})}^{ℓ} {(\frac{M_{0} R_{0}}{M_{1} R_{1}})}^{8}$
9:: return $\frac{2 (1 + d)}{1 - d}$

Algorithm 3 xEVAL.

Inputs:: An elliptic curve $E_{A} / F_{q} : y^{2} = x^{3} + A x^{2} + x$ ; $P \in E_{A} (F_{q})$ of order an odd prime ℓ; $α = x (Q) = [Q_{x} : Q_{z}] \neq 0$ of a point $Q \in E_{A} (F_{q}) \ 〈 P 〉$ and $I$ , $J$ , $K$ the output from KPS(P)
Output:: The image of $x (ϕ (Q))$ by w, where $ϕ$ is a separable isogeny of kernel $〈 P 〉$ .

1:: $h_{I} \leftarrow \prod_{x_{i} \in I} (Z - x_{i}) \in F_{q} [Z]$
2:: $E_{0, J} \leftarrow \prod_{x_{j} \in J} (F_{0} (Z, x_{j}) / α^{2} + F_{1} (Z, x_{j}) / α + F_{2} (Z, x_{j})) \in F_{q} [Z]$
3:: $E_{1, J} \leftarrow \prod_{x_{j} \in J} (F_{0} (Z, x_{j}) α^{2} + F_{1} (Z, x_{j}) α + F_{2} (Z, x_{j})) \in F_{q} [Z]$
4:: $R_{0} \leftarrow R e s_{Z} (h_{I}, E_{0, J}) \in F_{q}$
5:: $R_{1} \leftarrow R e s_{Z} (h_{I}, E_{1, J}) \in F_{q}$
6:: $M_{0} \leftarrow \prod_{x_{k} \in K} (1 / α - x_{k}) \in F_{q}$
7:: $M_{1} \leftarrow \prod_{x_{k} \in K} (α - x_{k}) \in F_{q}$
8:: return ${(M_{0} R_{0} Q_{x})}^{2} / {(M_{1} R_{1} Q_{z})}^{2}$

2.2. Computing the Resultants

The main computational task that is required for the square-root Vélu algorithm is the evaluation of the resultants

R e s_{Z} (h_{I}, E_{i, J})

, which can be achieved in constant-time via a residue tree approach as described in [25]. This task accounts for approximately 85% of the total computation time, and depends on the size of the sets I, J, and K. For the following discussion, we set

# J = b

,

# I = b^{'}

and

# K = \frac{ℓ - 1}{2} - 2 b b^{'}

. As a reminder, if

f (Z) \in F_{q} [Z]

splits into linear factors as

f (Z) = a \prod_{0 \leq i < b} (Z - x_{i})

and

g (Z) \in F_{q} [Z]

, then the resultant is given by

R e s_{Z} (f (Z), g (Z)) = a^{b} \prod_{0 \leq i < b} g (x_{i}) .

(4)

For our purposes, the polynomials f and g are provided as a product of

b^{'}

linear polynomials and b quadratic polynomials, respectively. Since the

x_{i}

values are readily available, one could compute (4) directly by evaluating g multiple times, but this would lead to a complexity of

O (b b^{'}) = O (ℓ)

.

Instead, we begin by obtaining the polynomials f and g as a list of coefficients by multiplying together all the terms following a product tree approach, as shown in Figure 1.

After the product tree for f has been built, we compute a corresponding reciprocal tree. Unlike the product tree, the reciprocal tree is computed from the root down. If a node in the product tree contains a polynomial F of degree m, then the corresponding node in the reciprocal tree contains

(F, c)

, where

F

is a polynomial of degree m and c is a constant such that

r e v_{m} (F) \cdot F = c mod x^{m}

. Here,

r e v_{m} (\cdot)

denotes the polynomial with the list of coefficients in reverse order: if

F = \sum_{i = 0}^{m} F_{i} x^{i}

, then

r e v_{m} (F) = \sum_{i = 0}^{m} F_{n - i} x^{i}

. The reciprocal tree is depicted in Figure 2, and is used as an auxiliary tool to construct one last tree called the residue tree.

The residue tree is also built from the root down, and as depicted in Figure 3, each of its nodes contains

c g mod F

, where F is the polynomial in the corresponding node of the product tree of f and c the constant from the reciprocal tree. The leaves of the residue tree contain a multiple of

g (Z) mod (Z - x_{i}) = g (x_{i})

, so a multiple of the resultant is readily obtained by multiplying all the leaves together. The multiple cancels out when taking the ratios at the last step of Algorithms 2 and 3. Remarkably, this method allows us to compute (a multiple of) all of the

g (x_{i})

with a complexity of just

\tilde{O} (b + b^{'}) = \tilde{O} (\sqrt{ℓ})

. A detailed description for the cost of the product tree, reciprocal tree, and residue tree is presented in Appendix A.

2.3. Cost Model for the Sequential Square-Root Vélu

An expected cost function for KPS, xISOG, and xEVAL has been presented in [25], but it takes into account some redundant computations and also fails to consider the cost of the residue tree. We now present a more accurate and optimal cost function.

As in [25], we begin by noting that a better performance is achieved when

# K

is as small as possible and

# I, # J

are similar in size. Since we need to have

2 (# I) (# J) + # K = (ℓ - 1) / 2

, we set the size of

# J

to be

b : = ⌈ \sqrt{(ℓ - 1)} / 2 ⌉

and then perform division with remainder to obtain

# I = ⌊ (ℓ - 1) / (4 b) ⌋

and

# K \in [0, 2 b]

. For simplicity, from now on we ignore rounding errors and assume the average value for

# K

, so that

# I \approx # J \approx # K \approx b \approx \sqrt{(ℓ - 1)} / 2

.

Next, we provide explicit formulas for the number of field multiplications in each of the three procedures as a function of b. This cost model neglects the cost of field additions and subtractions, as well as the savings that a field squaring may have over a regular field multiplication. We consider this a fair compromise, since the total number of squarings is small and the benefit of having a single metric far outweighs the importance of the small error. The cost functions are expressed in terms of the cost of the various tree procedures, which are detailed in Appendix A.

Cost function for: KPS Notice that the computation of

h_{I}

is performed in both Algorithms 2 and 3, so instead we delegate this product tree and the corresponding reciprocal tree to KPS and perform it only once. The total cost of KPS is then reflected by obtaining b multiples of an elliptic curve point (at a cost of

6 b

field multiplications each) for each of

I

,

J

, and

K

, and the cost of a product tree and reciprocal tree for b linear polynomials. We obtain

\begin{matrix} C o s t_{KPS} (b) & = 18 b + P r o d u c t T r e e_{1} (b) + R e c i p r o c a l T r e e (b) = 4 b^{{log}_{2} (3)} + 2 {log}_{2} (b) + 16 b - 2 . \end{matrix}

Cost function for: xISOG The degree-2 factors in

E_{0, J}

and

E_{1, J}

can be obtained at the cost of

5 b

field multiplications, and the complete polynomials are then obtained from two product trees of quadratic leaves. We then perform 2 residue trees to compute the two resultants of Algorithm 2, and another

2 * (b - 1)

multiplications to compute

M_{0}

and

M_{1}

. Algorithm 2 also requires an ℓ-th power exponentiation at an average cost of

1.5 log ℓ \approx 3 ({log}_{2} (b) + 1)

multiplications, which must be performed separately on the numerator and denominator when working in projective coordinates. Finally, there are another 10 multiplications for the final two steps. We obtain

\begin{matrix} C o s t_{xISOG} (b) = & 5 b + 2 \times P r o d u c t T r e e_{2} (b) + 2 \times R e s i d u e T r e e (b) \\ + 2 (b - 1) + 2 \times 3 ({log}_{2} (b) + 1) + 10, \end{matrix}

\begin{matrix} C o s t_{xISOG} (b) = & 18 b^{{log}_{2} (3)} + 6 {log}_{2} (b) - 5 b + 12 . \end{matrix}

Cost function for: xEVAL In this case, computing the factors for

E_{0, J}

requires

10 b

field multiplications. However, only one product tree is required since

E_{1, J}

can be obtained by inverting the coefficient list of

E_{0, J}

, as noted in [25]. We then need two residue trees to compute the resultants in Algorithm 3,

2 (2 b - 1)

multiplications to compute

M_{0}

and

M_{1}

, and six more multiplications for the final step. The total is

\begin{matrix} C o s t_{xEVAL} (b) = & 10 b + P r o d u c t T r e e_{2} (b) + 2 \times R e s i d u e T r e e (b) + 2 (2 b - 1) + 6 \end{matrix}

\begin{matrix} C o s t_{xEVAL} (b) = & 15 b^{{log}_{2} (3)} + 5 b + 2 . \end{matrix}

Total cost function: The total cost function, assuming that two points need to be pushed through the isogeny as is the case in CSIDH and SQISign, is

\begin{matrix} \begin{matrix} C o s t_{V é lu} (b) = & C o s t_{KPS} (b) + C o s t_{xISOG} (b) + 2 \times C o s t_{xEVAL} (b) \\ = & 52 b^{{log}_{2} (3)} + 8 {log}_{2} (b) + 21 b + 14 . \end{matrix} \end{matrix}

(5)

3. Parallelizing Square-Root Vélu Formulas

In this section, we present our main results, which focus on the problem of parallelizing Vélu’s formulas for isogeny computations.

An immediate observation is that the two resultants that appear in each of Algorithms 2 and 3, which represent the heaviest part of the computation, are completely independent of each other. Therefore, there is a simple way to parallelize the two algorithms in a 2-core implementation by assigning one resultant to each core. We show that these algorithms exhibit a good degree of parallelizability even beyond two cores, by assigning more than one core per resultant. However, doing so requires a new indexing system that allows for a clear separation of the computation between cores, which we introduce in the first subsection.

Additionally, other parts of the algorithms are also apt for parallelization, such as the computation of

M_{0}

and

M_{1}

where a partial product can be assigned to each core, or the computation of product trees where a subtree can be assigned to each core after a small bit of sequential work. We show also that the computation of point multiples in KPS can be parallelized with our new index system.

After detailing the new index system, we provide explicit new algorithms for the square-root Vélu formulas, which can be computed by assuming the availability of n-cores, where

n > 1

is a power of two.

3.1. Construction of a New Index System

The main observation that allows us to parallelize the computation of resultants beyond two cores is that they exhibit a multiplicative property. More precisely, if I can be written as the disjoint union

I = I_{0} \cup I_{1} \cup \dots \cup I_{n / 2 - 1},

then

R e s_{Z} (h_{I}, E_{i, J}) = R e s_{Z} (h_{I_{0}}, E_{i, J}) \times R e s_{Z} (h_{I_{1}}, E_{i, J}) \times \dots \times R e s_{Z} (h_{I_{n / 2 - 1}}, E_{i, J}),

so we can assign one subset

I_{t}

to each of the cores, have them compute one subresultant each, and then multiply them all together. Doing so will require us to modify the sizes of I and J: since each resultant computation should have balanced sizes, we now require

# J \approx # I / (n / 2)

. Accordingly, we now need

# J \approx \sqrt{(ℓ - 1) / 2 n}

and

# I \approx \sqrt{(ℓ - 1) n / 4}

. We will design such an indexing system according to the following lemma.

Lemma 3.

Let

n > 1

be a power of two, ℓ be an odd positive integer and consider the set

S = {1, 3, \dots, ℓ}

. If

b = ⌊ \sqrt{(ℓ - 1) / 2 n} ⌋ \geq 1

and

b^{'} = ⌊ (ℓ - 1) / 2 n b ⌋

, then

(I, J)

is an index system for S, where

\begin{matrix} J ≔ & {2 j + 1 : 0 \leq j < b}, \end{matrix}

\begin{matrix} I = & ⋃_{t = 0}^{n / 2 - 1} I_{t}, and \end{matrix}

\begin{matrix} I_{t} ≔ & {2 b (2 t + 1) + 2 b n i : 0 \leq i < b^{'}} for 0 \leq t < n / 2 . \end{matrix}

Moreover,

I_{t} \cap I_{\tilde{t}} = \emptyset

if and only if

t \neq \tilde{t} .

Proof of Lemma 3.

Suppose that

(2 b (2 t + 1) + 2 b n i_{1})) - (2 b (2 \tilde{t} + 1) + 2 b n i_{2})) = (2 j_{1} + 1) - (2 j_{2} + 1),

where

0 \leq j_{i}, j_{2} < b

,

0 \leq i_{1}, i_{2} < b^{'}

and

j_{1} \neq j_{2}

and

i_{1} \neq i_{2}

(t and

\tilde{t}

can be equal). Then,

4 b (t - \tilde{t}) + 2 b n (i_{1} - i_{2}) = 2 (j_{1} - j_{2}),

so

2 b (t - \tilde{t} + n (i_{1} - i_{2}) / 2) = j_{1} - j_{2}

but this is impossible since

0 < | t - \tilde{t} | < n

and thus

0 < | j_{1} - j_{2} | < b < 2 b | (t - \tilde{t} + n (i_{1} - i_{2}) / 2) |

. Therefore, the map

I \times J \to Z

defined by

(i, j) ⟼ i + j

is injective. Similarly, the map

(i, j) ⟼ i - j

is injective and therefore, both have disjoint images.

On the other hand, the sets

I + J

and

I - J

are both contained in S. It is clear that the elements in these sets are odd integers. Since for all

0 \leq t < n / 2

,

0 \leq j < b

and

0 \leq i < b^{'}

we have the following:

(2 b (2 t + 1) + 2 b n i) + (2 j + 1) \leq 2 b (2 (n / 2 - 1) + 1 + n i) + (2 b + 1) = 2 n b i + 2 n b + 1

(2 b (2 t + 1) + 2 b n i) + (2 j + 1) \leq 2 n b (\frac{(ℓ - 1)}{2 n b} - 1) + 2 n b + 1 = ℓ .

Similarly,

1 \leq (2 b (2 t + 1) + 2 b n i) - (2 j + 1)

for all

0 \leq t < n / 2

,

0 \leq j < b

and

0 \leq i < b^{'}

since the minimum of I is

(2 b (2 t + 1) + 2 b n i) = 2 b

when

t = i = 0

and the maximum of J is

2 j + 1 = 2 b - 1

when

j = b - 1

. □

Note that we have chosen to partition I into only

n / 2

subsets because, when n cores are available, we assign

n / 2

cores to each of the two resultants. However, when computing point multiples during KPS we do have all n cores available for working on the

n / 2

subsets. Therefore, it will be convenient to further divide each

I_{t}

into two subsets to obtain a total of n half-sized subsets. Note that some subsets will have fewer elements than others because

# I

need not be a multiple of n. In practice, this is problematic because it is crucial for each core to perform exactly analogous tasks to guarantee performance. In order to achieve a balanced computation, we will add additional elements to the subsets that are lacking, ensuring that each subset contains the same number of elements, while ignoring redundant multiples in future steps. This idea is inspired by (Example 3.7, [24]) and the parallelization technique proposed by Costello in [36] to compute the set

{[i] P}_{1 \leq i \leq d}

. We provide an example to illustrate this approach.

Example 1.

Let

ℓ = 89

and

n = 4

. From the Lemma 3 we obtain

\begin{matrix} J ≔ {2 j + 1 : 0 \leq j < 3}, & I = ⋃_{t = 0}^{1} I_{t}, \end{matrix}

\begin{matrix} I_{t} ≔ {2 b ((2 t + 1) + n i) : 0 \leq i < 3}, & f o r a l l 0 \leq t < 2 . \end{matrix}

So

I_{0} ≔ {2 b, 10 b, 18 b}

and

I_{1} ≔ {6 b, 14 b, 22 b}

. Instead of this, we can consider the following sets:

\begin{matrix} I I_{t} ≔ {2 b ((2 t + 1) + 2 n i) : 0 \leq i < 2} & for all 0 \leq t < 4 . \end{matrix}

So

I I_{0} ≔ {2 b, 18 b}

,

I I_{1} ≔ {6 b, 22 b}

,

I I_{2} ≔ {10 b, 26 b}

and

I I_{3} ≔ {14 b, 30 b}

.

We will now demonstrate that the partitioning of the set I, without taking into account the additional elements added to ensure equal-sized subsets, still forms an index system.

Proposition 1.

Let us assume the same conditions of the Lemma 3. Let

c = ⌈ b^{'} / 2 ⌉ > 2

. If

\begin{matrix} I^{'} ≔ ⋃_{t = 0}^{n - 1} I I_{t} \ ⋃_{t = n / 2}^{n - 1} {w_{t}} & with w_{t} ≔ 2 b (2 t + 1 + 2 n (c - 1)) (b^{'} mod 2), \\ and I I_{t} ≔ {2 b (2 t + 1) + b n (2 i) : 0 \leq i < c} \\ for all 0 \leq t < n, \end{matrix}

\begin{matrix} I I_{t} ≔ {2 b (2 t + 1) + b n (2 i) : 0 \leq i < c}, & for all 0 \leq t < n, \end{matrix}

\begin{matrix} I^{'} ≔ ⋃_{t = 0}^{n - 1} I I_{t} \ ⋃_{t = n / 2}^{n - 1} {w_{t}}, & with w_{t} ≔ 2 b (2 t + 1 + 2 n (c - 1)) (b^{'} mod 2), \end{matrix}

then

I^{'} = I .

In particular, (

I^{'}, J

) is an index system for S.

Proof of Proposition 1.

If

b^{'}

is an even integer, it is clear that

I I_{t} \subset I

, for all

0 \leq t < n

and thus

⋃_{t = n / 2}^{n - 1} {w_{t}} = {0}

. Note that

0 \leq 2 i \leq 2 c = b^{'}

and considering that

⋃_{t = 0}^{n - 1} I I_{t}

and I have the same cardinality, we conclude that they are equal. If

b^{'}

is an odd integer, all

0 \leq t < n / 2

we have that

I I_{t} \subset I

. When

n / 2 \leq t < n

, dropping the last element

w_{t} ≔ 2 b (2 t + 1 + 2 n (c - 1))

in each

I I_{n / 2}, \dots, I I_{n - 1}

, we have

I I_{t} \subset I

. Note that the cardinality of

⋃_{t = n / 2}^{n - 1} {w_{t}}

is

n / 2

and the cardinality of

⋃_{t = 0}^{n - 1} I I_{t}

is

n c = n ((b^{'} + 1) / 2) = n b^{'} + n / 2 = # I + n / 2

. □

Similarly, we consider a partition of

K = S \ (I \pm J) = {2 b b^{'} n + 1, \dots, ℓ - 2, ℓ}

into n sets

K_{0}, \dots, K_{n - 1}

with

d = ⌈ # K / n ⌉

elements,

K_{t} = {2 b b^{'} n + 2 (t + 1) + 2 n k : 0 \leq k < d}, 0 \leq t < n

However, when

# K

is not divisible by n, we will add one additional element to each set

K_{d mod (n)}, \dots, K_{n - 1}

, ensuring that each subset contains the same number of elements.

The additional elements that were added to balance the size of the sets are not taken into account when calculating the corresponding products for each set.

3.2. Parallelized Algorithms

We now describe the construction of parallelized versions of algorithms KPS, xISOG, and xEVAL assuming that n-cores are available, algorithms which we refer to as KPS-Parallel, xISOG-Parallel, and xEVAL-Parallel.

The KPS-Parallel algorithm, presented as Algorithm 4, is based on the new index system from Proposition 1: instead of computing the sets

I

and

K

as in the original algorithm, KPS-Parallel computes the sets

I I_{t}

and

K_{t}

, as well as the polynomial product tree for

h_{I I_{t}}

, by assigning one core to each value of

0 \leq t < n

. Note that the computation of the resultants requires a reciprocal tree for each of the

h_{I_{t}}

, so the polynomials

{h_{I I_{t}}, 0 \leq t < n}

are multiplied pair-wise to obtain the polynomials

{h_{I_{t}}, 0 \leq t < n / 2}

. We then compute the residue trees, which are built from the root down: the root is computed for each tree using

n / 2

cores, and then the two subtrees of each tree are computed in parallel, using two cores per tree, for a total of n cores. Note that although the computations for the root of the product and reciprocal trees, corresponding to lines 11 and 12 of Algorithm 4, should only be ran with

n / 2

cores, we instead have all cores running with half of them repeating the work over dummy arrays

I_{t}

for

t > n / 2

. These computations are of course redundant, but ensure that the workload is balanced.

For the algorithms xISOG-Parallel and xEVAL-Parallel, the set J is partitioned into n subsets

J_{t}

, and each core computes the polynomial product tree corresponding to one subset. The roots of the n trees are then multiplied together sequentially to obtain the full polynomial

E_{i, J}

. Next, each core computes one of the subresultants

R e s_{Z} (h_{I_{t}}, E_{i, J})

using the reciprocal trees from KPS-Parallel, and the subresultants are multiplied sequentially at the end to obtain the two principal resultants. As for the computations related to

K

, it is also split into subsets

K_{t}

so that each core can compute a subproduct of

M_{i}

, and the subproducts are multiplied together at the end.

In the next subsection, we present a detailed analysis that estimates the cost of computing the image curve of an isogeny and the evaluation of a point using the procedures KPS-Parallel+ xISOG-Parallel+ xEVAL-Parallel.

Algorithm 4 KPS-Parallel.

Inputs:: An elliptic curve $E_{A} / F_{q}$ ; $P \in E_{A} (F_{q})$ of order ℓ an odd prime and a number of cores n which is a power of 2
Output:: b, d integers, $J, K$ $arrays of field elements, h_{I_{0}}$ ,…, $h_{I_{n / 2 - 1}} polynomials$ and ${rtree}_{h_{I_{0}}}, \dots, {rtree}_{h_{I_{n / 2 - 1}}}$ reciprocal trees

1:: $b \leftarrow ⌊ \sqrt{(ℓ - 1) / 2 n} ⌋$ ; $b^{'} \leftarrow ⌊ (ℓ - 1) / 2 n b ⌋$ ; $d \leftarrow (ℓ - 1) / 2 - n b b^{'}$ ;
2:: $b^{'} \leftarrow ⌈ b^{'} / 2 ⌉$ ; $d \leftarrow ⌈ d / n ⌉$ // $⌈ d / n ⌉ > 2$ and $⌈ b^{'} / 2 ⌉$ >1
3:: $J \leftarrow \{x ([m] P) : m \in {2 j + 1 : 0 \leq j < b}\}$
4:: $B_{I} \leftarrow {x ([m] P) : m \in {2 b (2 t + 1) : 0 \leq t < n}}$
5:: $B_{K} \leftarrow {x ([m] P) : m \in {2 (t + 1) : 0 \leq t < n}}$
6:: for $t \in {0, 1, \dots, n - 1}$ in parallel do
7:: $I I_{t} \leftarrow {B_{I} [t] + x ([m] P) : m \in {2 b (2 n i) : 0 \leq i < b^{'}}}$
8:: $K_{t} \leftarrow {B_{K} [t] + x ([m] P) : m \in {2 n k : 0 \leq k < d}}$
9:: $h_{I I_{t}} \leftarrow \prod_{x_{i} \in I I_{t}} (Z - x_{i})$ .
10:: barrier // Synchronize cores
11:: $h_{I_{t}} \leftarrow h_{I I_{t}} \cdot h_{I I_{(t + n / 2)}}$
12:: $r t r e e_{h_{I_{t}}} \leftarrow R e c i p r o c a l T r e e R o o t (h_{I_{t}})$ // Fills in only the root node
13:: if $t < n / 2$ then
14:: $r t r e e_{h_{I_{t}}} \leftarrow R e c i p r o c a l L e f t S u b t r e e (h_{I_{t}})$ // Fills in only the left subtree
15:: else
16:: $r t r e e_{h_{I_{t - n / 2}}} \leftarrow R e c i p r o c a l R i g h t S u b t r e e (h_{I_{t - n / 2}})$ // Fills in only the right subtree
17:: end if
18:: end for
19:: $K \leftarrow K_{0} \cup \dots \cup K_{n - 1}$
20:: return b, d, $J, K, h_{I_{0}}, \dots, h_{I_{n / 2 - 1}} and {rtree}_{h_{I_{0}}}, \dots, {rtree}_{h_{I_{n / 2 - 1}}}$

3.3. Cost Analysis of the Parallel Square-Root Vélu

In this section, we will provide an estimation of the cost that is expected when performing the KPS-Parallel, xISOG-Parallel, and xEVAL-Parallel procedures, which is analogous to the one presented in Section 2.3 for the sequential algorithms. Recall that for our new index system from Section 3.1 we have

# J = ⌊ \sqrt{(ℓ - 1) / (2 n)} ⌋

, where ℓ is the degree of the isogeny and n represents the number of cores available, while each of the

I_{t}

subsets has a size of

b^{'} : = ⌊ (ℓ - 1) / (2 n (# J)) ⌋

for a total size of

# I = (n / 2) ⌊ (ℓ - 1) / (2 n (# J)) ⌋

. It follows that

\begin{matrix} # K : = & \frac{ℓ - 1}{2} - 2 (# I) (# J), \end{matrix}

\begin{matrix} # K \leq & \frac{ℓ - 1}{2} - 2 (\sqrt{\frac{ℓ - 1}{2 n}} - 1) (\frac{n}{2} (\sqrt{\frac{ℓ - 1}{2 n}} - 1)), \end{matrix}

\begin{matrix} # K = & \sqrt{2 n (ℓ - 1)} - n . \end{matrix}

As before, we will ignore rounding errors and assume that

# K

takes the middle value of this range (neglecting the second term), so that

# I \approx n b / 2

,

# J \approx b

, and

# K \approx n b

for

b = \sqrt{(ℓ - 1) / (2 n)}

.

Proposition 2.

The cost of computing the parallel Algorithm KPS-Parallel with n cores is

\begin{matrix} C o s t_{KPS - Parallel} (b, n) = \frac{8}{3} b^{{log}_{2} (3)} + 4 {log}_{2} (b) + 15 b + 12 n - 18, \end{matrix}

field multiplications.

Proof of Proposition 2.

In order to find all the point multiples in K, we first compute the first n multiples sequentially and then core t starting from the t-th multiple and taking steps of size n. This leads to wall-clock time of

n + # K / n - 1

point operations, which involves six field multiplications each. A similar approach is taken for computing all the multiples in the

I I_{t}

sets, while the ones in

J

are computed sequentially as this is the smallest of the sets. A product tree is constructed for each of the

h_{{I I}_{t}}

in parallel, and line 11 involves multiplying two polynomials of degree b/2. As for the reciprocal trees, we incur the cost of the root node and then just half of the remainder cost, since there is one core working on each subtree, so its cost is

(Reciprocal (b) + ReciprocalTree (b)) / 2,

where

Reciprocal (b) = b^{{log}_{2} (3)} + b + 2 \log_{2} (b) - 2

is the cost of the root node alone. The total cost is then

\begin{matrix} C o s t_{KPS - Parallel} (b, n) = & 6 (n + b - 1) + 6 (n + b / 2 - 1) + 6 b \\ + P r o d u c t T r e e_{1} (b / 2) + P o l y M u l (b / 2) \\ + (R e c i p r o c a l (b) + R e c i p r o c a l T r e e (b)) / 2, \end{matrix}

and the proposition follows after considering the cost functions in the Appendix A. □

We now focus on Algorithm 5: xISOG-Parallel which computes the coefficient of the image curve, and on Algorithm 6: xEVAL-Parallel which computes the isogeny evaluation at a specific point.

Algorithm 5 xISOG-Parallel.

Inputs:: An elliptic curve $E_{A} / F_{q} : y^{2} = x^{3} + A x^{2} + x$ ; $P \in E_{A} (F_{q})$ of order an odd prime ℓ; b, d, $J, K, h_{I_{0}}, \dots, h_{I_{n / 2 - 1}}$ and ${rtree}_{h_{I_{0}}}, \dots, {rtree}_{h_{I_{n / 2 - 1}}}$ from KPS-Parallel; and a number of cores n which is a power of 2.
Output:: $A^{'} \in F_{q}$ such that $E_{A} / F_{q} : y^{2} = x^{3} + A^{'} x^{2} + x$ is the image curve of a separable isogeny with kernel $〈 P 〉$

1:: for $t \in {0, 1, \dots, n - 1}$ in parallel do
2:: $J_{t} \leftarrow {x ([j] P) \in J : t ⌈ b / n ⌉ \leq j < ⌈ b / n ⌉ + t ⌈ b / n ⌉}$
3:: $E_{0, J_{t}} \leftarrow \prod_{x_{j} \in J_{t}} (F_{0} (Z, x_{j}) + F_{1} (Z, x_{j}) + F_{2} (Z, x_{j}))$
4:: $E_{1, J_{t}} \leftarrow \prod_{x_{j} \in J_{t}} (F_{0} (Z, x_{j}) - F_{1} (Z, x_{j}) + F_{2} (Z, x_{j}))$
5:: $K_{t} \leftarrow {x ([k] P) \in K : t d \leq k < d + t d}$
6:: $M_{0, t} \leftarrow \prod_{x_{k} \in K_{t}} (1 - x_{k})$
7:: $M_{1, t} \leftarrow \prod_{x_{k} \in K_{t}} (- 1 - x_{k})$
8:: end for
9:: $E_{0, J} \leftarrow E_{0, J_{0}} \times \dots \times E_{0, J_{n - 1}}$ $\in F_{q} [Z]$
10:: $E_{1, J} \leftarrow E_{1, J_{0}} \times \dots \times E_{1, J_{n - 1}}$ $\in F_{q} [Z]$
11:: $M_{0} \leftarrow M_{0, 0} \times \dots \times M_{0, n - 1} \in F_{q}$
12:: $M_{1} \leftarrow M_{1, 0} \times \dots \times M_{1, n - 1} \in F_{q}$
13:: for $t \in {0, 1, \dots, n - 1}$ in parallel do
14:: if $t < n / 2$ then
15:: $R_{0, t} \leftarrow R e s_{Z} (h_{I_{t}}, E_{0, J}, r t r e e_{I_{t}})$
16:: else
17:: $R_{1, (t - n / 2)} \leftarrow R e s_{Z} (h_{I_{t - n / 2}}, E_{1, J}, r t r e e_{I_{t - n / 2}})$
18:: end if
19:: end for
20:: $R_{0} \leftarrow R_{0, 0} \times \dots \times R_{0, n / 2 - 1} \in F_{q}$
21:: $R_{1} \leftarrow R_{1, 0} \times \dots \times R_{1, n / 2 - 1} \in F_{q}$
22:: $d \leftarrow {(\frac{A - 2}{A + 2})}^{ℓ} {(\frac{M_{0} R_{0}}{M_{1} R_{1}})}^{8}$
23:: return $2 (1 + d) / (1 - d)$

Algorithm 6 xEVAL-Parallel.

Inputs:: An elliptic curve $E_{A} / F_{q} : y^{2} = x^{3} + A x^{2} + x$ ; $P \in E_{A} (F_{q})$ of order an odd prime ℓ; $α = x (Q) \neq 0$ for a point $Q \in E_{A} (F_{q}) \ 〈 P 〉$ ; b, d, $J, K, h_{I_{0}}, \dots, h_{I_{n / 2 - 1}}$ and ${rtree}_{h_{I_{0}}}, \dots, {rtree}_{h_{I_{n / 2 - 1}}}$ from KPS-Parallel; and a number of cores n which is a power of 2.
Output:: $x (ϕ (Q))$ , where $ϕ$ is a separable isogeny of kernel $〈 P 〉$

1:: for $t \in {0, 1, \dots, n - 1}$ in parallel do
2:: $J_{t} \leftarrow {x ([j] P) \in J : t ⌈ b / n ⌉ \leq j < ⌈ b / n ⌉ + t ⌈ b / n ⌉}$
3:: $E_{0, J_{t}} \leftarrow \prod_{x_{j} \in J_{t}} (F_{0} (Z, x_{j}) α^{2} + F_{1} (Z, x_{j}) α + F_{2} (Z, x_{j}))$
4:: $E_{1, J_{t}} \leftarrow \prod_{x_{j} \in J_{t}} (F_{0} (Z, x_{j}) α^{- 2} + F_{1} (Z, x_{j}) α^{- 1} + F_{2} (Z, x_{j}))$
5:: $K_{t} \leftarrow {x ([k] P) \in K : t d \leq k < d + t d}$
6:: $M_{0, t} \leftarrow \prod_{x_{k} \in K_{t}} (α - x_{k})$
7:: $M_{1, t} \leftarrow \prod_{x_{k} \in K_{t}} (α^{- 1} - x_{k})$
8:: end for
9:: $E_{0, J} \leftarrow E_{0, J_{0}} \times \dots \times E_{0, J_{n - 1}}$ $\in F_{q} [Z]$
10:: $E_{1, J} \leftarrow E_{1, J_{0}} \times \dots \times E_{1, J_{n - 1}}$ $\in F_{q} [Z]$
11:: $M_{0} \leftarrow M_{0, 0} \times \dots \times M_{0, n - 1} \in F_{q}$
12:: $M_{1} \leftarrow M_{1, 0} \times \dots \times M_{1, n - 1} \in F_{q}$
13:: for $t \in {0, 1, \dots, n - 1}$ in parallel do
14:: if $0 \leq t < n / 2$ then
15:: $R_{0, t} \leftarrow R e s_{Z} (h_{I_{t}}, E_{0, J}, r t r e e_{I_{t}})$
16:: else
17:: $R_{1, (t - r)} \leftarrow R e s_{Z} (h_{I_{t - n / 2}}, E_{1, J}, r t r e e_{I_{t - n / 2}})$
18:: end if
19:: end for
20:: $R_{0} \leftarrow R_{0, 0} \times \dots \times R_{0, n / 2 - 1} \in F_{q}$
21:: $R_{1} \leftarrow R_{1, 0} \times \dots \times R_{1, n / 2 - 1} \in F_{q}$
22:: return ${(M_{0} R_{0} Q_{x})}^{2} / {(M_{1} R_{1} Q_{z})}^{2}$

Proposition 3.

The cost of computing the parallel Algorithm xISOG-Parallel with n cores is

\begin{matrix} C o s t_{xISOG - Parallel} (b, n) = & 6 b^{{log}_{2} (3)} + 3 (n^{2} - n + 2) {(\frac{b}{n})}^{{log}_{2} (3)} \\ + 3 {log}_{2} (b \sqrt{n}) - b - \frac{b}{n} + \frac{5}{2} n + \frac{11}{2}, \end{matrix}

field multiplications.

Proof of Proposition 3.

We begin by considering the cost of computing the polynomials

E_{0, J}

and

E_{1, J}

in Algorithm 5, where

J

comes from KPS-Parallel. We partition

J

into the subsets

J_{0}, \dots, J_{n - 1}

of size

(b / n)

each, and compute one sub-product tree

E_{0, J_{t}}

per subset

J_{t}

concurrently. The cost of computing the factors of each subset is given by

5 (b / n)

field multiplications, then a product tree is computed for

b / n

quadratic polynomials on each core, and the tree roots are multiplied together sequentially to obtain the complete polynomial

E_{0, J}

. This last step involves multiplying together n polynomials of degree

2 b / n

each, which we denote as

L i n e a r P r o d u c t_{2 b / n} (n)

. An identical procedure is used to compute

E_{1, J}

as well.

Using the residue trees

r t r e e_{h_{I_{0}}}, \dots, r t r e e_{h_{I_{n / 2 - 1}}}

from KPS-Parallel, each core then compute a subresultant using a residue tree of size

b / n

, and the subresultants are multiplied sequentially at a cost of

n / 2 - 1

multiplications per resultant.

For each of the

M_{i}

, each core computes a subproduct of size b at a cost of

b - 1

multiplications, and then combining the subproducts takes another

n - 1

multiplications.

Finally, we use two cores to compute the ℓ-th power exponentiation in Algorithm 5 for the numerator and denominator concurrently at a cost of about

1.5 log ℓ \approx 3 / 2 + 3 {log}_{2} (b) + 3 log (n) / 2

, and the last few products take 10 more multiplications.

The total cost is then

\begin{matrix} C o s t_{xISOG - Parallel} (b, n) = & 5 b / n + 2 \times P r o d u c T r e e_{2} (b / n) + 2 \times L i n e a r P r o d u c t_{n} (2 b / n) \\ + R e s i d u e T r e e (b) + n / 2 - 1 \\ + 2 \times (b + n - 2) \\ + 3 / 2 + 3 {log}_{2} (b) + 3 log (n) / 2 + 10, \end{matrix}

and the proposition follows after considering the cost functions in the Appendix A. □

Proposition 4.

The cost of computing the parallel Algorithm xEVAL-Parallel with n cores is

C o s t_{xEVAL - Parallel} (b, n) = 6 b^{{log}_{2} (3)} + \frac{3}{2} (n^{2} - n + 2) {(\frac{b}{n})}^{{log}_{2} (3)} + b + \frac{7 b}{n} + \frac{5}{2} n .

field multiplications.

Proof of Proposition 4.

The proof of the current proposition follows a similar structure to the previous proposition, with the main distinction lying in the polynomials

E_{0, J}

E_{1, J}

M_{0, t}

and

M_{1, t}

. As before, the factors of

E_{0, J}

can be computed with a cost of

10 (b / n)

, and only one product tree is needed since

E_{1, J}

is obtained at no additional cost. For each of

M_{0}

and

M_{1}

, each core is still working with a subset of size b, but now there is a cost of b scalar multiplications for computing the factors in a subset,

b - 1

for the subproduct of each subset, and

n - 1

for combining the subproducts. There is no exponentiation in this case, and the final steps require only six additional multiplications. The total cost is then

\begin{matrix} C o s t_{xEVAL - Parallel} (b, n) = & 10 b / n + P r o d u c T r e e_{2} (b / n) + L i n e a r P r o d u c t_{n} (2 b / n) \\ + R e s i d u e T r e e (b) + n / 2 - 1 \\ + 2 (2 b + n - 2) \\ + 6, \end{matrix}

and the proposition follows after considering the cost functions in the Appendix A. □

The next theorem summarizes our cost analysis, and follows immediately from the previous propositions.

Theorem 1.

The expected total cost of the algorithms KPS-Parallel, xISOG-Parallel and twice xEVAL-Parallel expressed in field multiplications, assuming that two points need to be pushed through the isogeny, is given by

\begin{matrix} C o s t_{V é lu - Parallel} (b, n) = & C o s t_{KPS - Parallel} (b, n) + C o s t_{xISOG - Parallel} (b, n) \\ + 2 \times C o s t_{xEVAL - Parallel} (b, n) \\ = & \frac{62}{3} b^{{log}_{2} (3)} + 6 (n^{2} - n + 2) {(\frac{b}{n})}^{{log}_{2} (3)} + 7 {log}_{2} (b) \\ + \frac{13 b}{n} + 16 b + \frac{3}{2} {log}_{2} (n) + \frac{39}{2} n - \frac{25}{2} . \end{matrix}

(6)

Remark 4.

While it may be tempting to compare (6) directly to (5), a direct comparison would be incorrect since the definitions of b (the size of the set J) are different in each case. The fair comparison would be in terms of ℓ, where (5) has

b = \sqrt{(ℓ - 1)} / 2

, whereas (6) has

b = \sqrt{(ℓ - 1) / (2 n)}

.

3.4. Asymptotic Speedup

Although working with large degrees should help to amortize the cost of the parallelization overhead, we stress that combining the work from each core (for example, joining together the n product subtrees and working up to the root) incurs a sequential cost that is not subdominant, and so perfect parallelization is not possible even in a theoretical setting and in the

ℓ \to \infty

limit. Concretely, we provide the following result:

Proposition 5.

Using n cores, the speedup of KPS + xISOG + 2xEVAL for large degrees is

\frac{26 \sqrt{3} n^{\frac{3}{2} {log}_{2} 3}}{9 n^{2} + 31 n^{{log}_{2} 3} - 9 n + 18} .

Proof.

The result follows by taking the ratio of (5) and (6) and taking the limit

ℓ \to \infty

with Remark 4 into account. □

The above result demonstrates the decreasing returns to scale from each additional core, as plotted in Figure 4, which suggests our algorithms would work best with a modest number of cores (2–8) and are not apt for large-scale parallelization architectures like those of GPUs.

4. Experimental Results

We implemented our parallel version of the square-root Vélu algorithm in the C programming language, leveraging OpenMP for parallelization. This implementation has been integrated into a fork of the SQALE’d CSIDH library [13], where we can measure the savings due to parallelism for isogenies of each prime degree

ℓ | p + 1

. In order to visualize the progression of the performance with the isogeny degree, we have incorporated a new 1792-bit prime

\begin{matrix} p_{t e s t} : = & 0 x 4 FA 4 E 8 C 57 C 4 EFF 02 D 0650 FE 3 AFC 0413536 E 72253101645 B 1387 DA 2 DD 519 C 17 FBEC \dots \\ 69843 C 04 DEAEA 2 DB 59 CDDDA 7876 B 514 C 101 A 1 DF 0 D 96778 BD 72 A 3 C 51844 BB 0196 \dots \\ F 73 F 1 DDBFEC 980 A 4 BB 3 B 200 A 4 E 618 C 54621 ADB 35 B 5 E 4 B 0545 F 5 BE 025 D 63 \dots \\ BC 914 AB 11 A 882 AD 78 B 6203 C 57 A 31031 B 98 B 6 C 104 DC 99 AC 9 A 4532 DEC 0 C 293 \dots \\ 0 F 8 AE 51 B 008 E 4 BA 6 D 26 E 56 C 736 D 3 C 067 C 8 F 2 DFDF 7 F 8206 B 444 A 42 D 39 E 0 F 4 D 82 FF \dots \\ 3 FD 0 EB 1 DF 44 B 31 DDCDE 876 E 658489 E 1 CA 359 DAF 5868 A 6 C 22 E 8455 B 4 A 4 F 7679 F 6 \dots \\ 2 B 0 C 30 D 8883 D 2 B 79931 C 19 E 4737 C 3 CC 33056461 E 9 C 96 A 175 D 94 B 594 B 2 A 9 EAB 4 B 6 B 6303, \end{matrix}

where

p_{t e s t} + 1 = 4 \prod_{i = 0}^{107} ℓ_{i}

for odd primes

ℓ_{i}

that are roughly evenly spaced between

ℓ_{0} = 19

and

ℓ_{107} =

321,193 (the degree of the isogeny in the attack of [5]). This prime was chosen so that isogenies of each degree

ℓ_{i}

can be evaluated using only

F_{p_{t e s t}}

arithmetic (for which we also provide an assembly code implementation). As previously mentioned, this model most closely resembles the scenario in CSIDH as opposed to applications where isogenies are performed over quadratic fields or larger extensions. However, we make this choice for concreteness while still expecting our results to extrapolate well to other scenarios, since the main savings come from reducing the number of field multiplications by a factor that is agnostic to the field itself. Our library can be found at https://github.com/TheParallelSquarerootVeluTeam/Parallel-Squareroot-Velu (accessed on 9 January 2024). The benchmark results presented in this section are from experiments conducted using an Intel(R) Xeon(R) Gold 6230R processor equipped with 26 physical cores, operating at

2.10 GHz

. Turbo boost and Hyper-threading technologies were disabled during these experiments.

Our experimental results are illustrated in Figure 5. We measured the total computational time for executing KPS-Parallel, xISOG-Parallel, and two iterations of xEVAL-Parallel for each of the prime degrees

ℓ_{i}

using two, four, and eight cores, and compared to the computational time of the original sequential implementation of [13]. In Figure 5, we compare the observed speedup factors against the theoretical speedup using (5) and (6). The experimental speedup can be seen to achieve levels close to its asymptotic value starting from

ℓ \approx

10,000, although we do observe a slight deviation from the theoretical expectation which becomes important for the eight-core implementation. Since a large part of the computation is spent computing different kinds of trees, with all cores writing simultaneously to some array, a plausible explanation for the degrading performance is the competition for memory accesses. While our implementation is only a proof of principle, it is possible that a more sophisticated implementation could achieve significantly better results for eight cores (or more) by carefully managing the distribution and alignment of the memory that each core accesses.

To showcase our results in a more practical setting, we also implemented and benchmarked our parallelized square-root Vélu algorithms in the context of SQALE’d CSIDH-9216. In its dummy-free variant, SQALE’d CSIDH-9216 performs the operations corresponding to KPS + xISOG + 2xEVAL exactly once for each of the 333 smallest odd primes (excluding 263), over a 9216-bit prime field. The experimental timings for each of the three routines, summed over all the degrees, are shown in Table 1. Our two-core implementation achieves speedup factors of 1.23, 1.70, and 1.92 for KPS-Parallel, xISOG-Parallel, and xEVAL-Parallel, respectively, while for the four-core implementation we obtained acceleration factors of 1.42, 2.27, and 3.02. While these accelerations span the entirety of the Vélu-related computations in the protocol, it is important to stress that they cannot be translated easily into an acceleration for the protocol as a whole, since the protocol includes other non-negligible computations (most notably elliptic curve point scalar multiplications).

5. Discussion

Since KPS-Parallel performs some of its computational tasks sequentially, while others only benefit from half of the total number of cores, it is not surprising that its speedup factor is the lowest out of the three square-root Vélu routines. Nonetheless, as illustrated by Table 1, this is not too important as its total cost is comparatively small. Fortunately, the largest accelerations come from the xEVAL-Parallel, which apart from being the most expensive of the three is often required to be evaluated multiple times for different points.

Both our experimental and theoretical results show that parallelizing with at least two cores yields strong improvements, which are important considering that they can be combined with other improvements such as vectorized field arithmetic. On the downside, it is also noticeable that the speedup factors in our implementations do not scale well when transitioning higher numbers of cores. From eight cores onward, even the maximum speedup derived from the theoretic estimates suggests that the utilization that can be achieved may not be attractive for practical applications. The intuition behind these diminishing returns on the parallelization effectiveness is easy to explain, as the size of the resultants in Algorithms 5 and 6 only decreases as

1 / \sqrt{n}

.

Author Contributions

Conceptualization, J.C.-S., O.O. and A.P.-M.; methodology, J.C.-S. and O.O.; software, J.C.-S. and O.O.; validation, J.C.-S., O.O. and A.P.-M.; formal analysis, J.C.-S., O.O. and A.P.-M.; investigation, O.O.; resources, A.P.-M.; data curation, O.O.; writing—original draft preparation, O.O. and A.P.-M.; writing—review and editing, J.C.-S., O.O. and A.P.-M.; visualization, O.O.; supervision, J.C.-S. and A.P.-M.; project administration, J.C.-S., O.O. and A.P.-M.; funding acquisition, J.C.-S. and A.P.-M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/TheParallelSquarerootVeluTeam/Parallel-Squareroot-Velu (accessed on 10 January 2024).

Acknowledgments

This work started when J.C. and O.O. were doing an internship at the Technology Innovation Institute (TII) under the guidance of Rodríguez-Henríquez F. We thank TII for sponsoring this internship. We thank ANID for the study scholarship to O.O., grant number 21190301. We also thank Chi-Domínguez J. and Zamarripa-Rivera L. for valuable discussion on an early version of this manuscript. Additionally, this work has received partial funding to facilitate the use of a server in CINVESTAV-IPN in Mexico which was used for our tests.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Cost Functions for Polynomial Operations

In this section, we detail the costs in field multiplications associated with various operations in both the sequential and parallel versions of square-root Vélu.

Polynomial Multiplication: The basic step is the multiplication of two polynomials of equal degree d, which we denote as

P o l y M u l (d)

. We use a Karatsuba strategy for polynomial multiplication, so the cost in field multiplications can be expressed as in [37]

P o l y M u l (d) = d^{{log}_{2} (3)} .

Multiplication of Multiple Polynomials: Suppose we want to multiply together m polynomials of degree d each. If m is small, we opt for multiplying the polynomials one by one: in step 1 we are multiplying together two polynomials of degree d, and in step t for

1 < t < m

we are multiplying a polynomial of degree

t d

by another of degree d. By partitioning the larger polynomial, we can perform this multiplication via t polynomial multiplications of degree d by d, so the total cost is

\begin{matrix} L i n e a r P r o d u c t_{m} (d) = & \sum_{t = 1}^{m - 1} t \cdot P o l y M u l (d) = \frac{m (m - 1)}{2} P o l y M u l (d) = \frac{m (m - 1) d^{l o g_{2} (3)}}{2} . \end{matrix}

On the other hand, if m is large then we opt for a product tree strategy, where the input polynomials are placed at the leaves of the tree and each node contains the product of its two child nodes. We assume for simplicity that m is a power of 2. At level i (where i = 0 corresponds to the root), we need to compute

2^{i}

nodes, each of which is a product of two polynomials of degree

m d / 2^{i + 1}

. The total cost is therefore

\begin{matrix} P r o d u c t T r e e_{d} (m) & = \sum_{i = 0}^{{log}_{2} (m) - 1} 2^{i} \times P o l y M u l (m d / 2^{i + 1}) \\ = \sum_{i = 0}^{{log}_{2} (m) - 1} 2^{i} {(\frac{m d}{2^{i + 1}})}^{{log}_{2} (3)} \\ = {(m d)}^{{log}_{2} (3)} - m d^{{log}_{2} (3)} \end{matrix}

We only use product trees for computing

h_{I}

and

E_{i, J}

which have leaves of degree

d = 1

and

d = 2

, respectively, so we are only interested in the special cases

P r o d u c t T r e e_{1} (m) = m^{{log}_{2} (3)} - m, P r o d u c t T r e e_{2} (m) = 3 m^{{log}_{2} (3)} - 3 m .

Reciprocal tree: Computing the resultants requires the aid of a reciprocal tree, which is built from the root down using the product tree. If a node in the product tree contains a polynomial F of degree m, then the corresponding node in the reciprocal tree contains

(F, c)

, where

F

is a polynomial of degree m and c is a constant such that

r e v_{m} (F) \cdot F = c mod x^{m}

. Here,

r e v_{m} (\cdot)

denotes the polynomial with the list of coefficients in reverse order: if

F = \sum_{i = 0}^{m} F_{i} x^{i}

, then

r e v_{m} (F) = \sum_{i = 0}^{m} F_{m - i} x^{i}

.

We only use the case for the product tree of a polynomial f that has b leaves of degree 1. For the root of the tree, we need to know the cost of obtaining from scratch

r e v_{b} {(f)}^{- 1} mod x^{b}

, where f has degree b. As pointed out in (Section 6.4, [25]), if we have already computed a reciprocal

(f_{0}, c_{0})

of half the degree (that is,

r e v_{b} (f) \cdot f_{0} = c_{0} mod x^{b / 2}

), then the inverse modulus

x^{b}

can be obtained as

(f, c)

with

\begin{matrix} f = & c_{0} f_{0} - (r e v_{b} (f) \cdot f_{0} - c_{0}) f_{0} mod x^{b}, \\ c = & c_{0}^{2} . \end{matrix}

These equations can be evaluated at the cost of

b / 2 + 1

field multiplications (for multiplying

f_{0}

by the scalar

c_{0}

), plus the cost of two polynomial multiplications modulus

x^{b / 2}

(because the term in parenthesis is known to have null coefficients for all powers less than

b / 2

), plus one squaring. This leads to the recursive formula for the cost of the reciprocal

\begin{matrix} R e c i p r o c a l (b) = & R e c i p r o c a l (b / 2) + b / 2 + 1 + 2 \times P o l y M u l (b / 2) + 1 \\ = & R e c i p r o c a l (b / 2) + b / 2 + 2 {(b / 2)}^{l o g_{2} (3)} + 2, \end{matrix}

The base case (finding a multiple of the reciprocal of a constant) is trivial, so we obtain

\begin{matrix} R e c i p r o c a l (b) = & \sum_{i = 1}^{{log}_{2} (b)} (\frac{b}{2^{i}} + 2 {(\frac{b}{2^{i}})}^{{log}_{2} (3)} + 2) \\ = & b^{{log}_{2} (3)} + b + 2 {log}_{2} (b) - 2 . \end{matrix}

The child nodes in the reciprocal tree can then be computed from the root. Let

f_{1}, f_{2}

be polynomials of degree d corresponding to sibling nodes of the product tree, and assume that we have already computed a reciprocal for their parent node

F = f_{1} \cdot f_{2}

. That is, we already have

(F, C)

such that

r e v_{2 d} (F) \cdot F = C mod x^{2 d}

. It follows that

r e v_{d} (f_{1}) \cdot (r e v_{d} (f_{2}) \cdot F) = C mod x^{d}

, so we can compute a reciprocal for

f_{1}

via the modular multiplication

r e v_{d} (f_{2}) \cdot F mod x^{d}

. The total cost of the reciprocal tree is then

\begin{matrix} R e c i p r o c a l T r e e (b) = & R e c i p r o c a l (b) + \sum_{i = 1}^{{log}_{2} (b)} 2^{i} \times P o l y M u l (b / 2^{i}) \\ = & b^{l o g_{2} (3)} + b + 2 {log}_{2} (b) - 2 + \sum_{i = 1}^{{log}_{2} (b)} 2^{i} {(\frac{b}{2^{i}})}^{{log}_{2} (3)} \\ = & 3 b^{{log}_{2} (3)} - b + 2 {log}_{2} (b) - 2 . \end{matrix}

Residue tree: Recall that we can compute the resultant

R e s_{Z} (f, g)

for monic f as

R e s_{Z} (f, g) = \prod_{i} g (x_{i}) = \prod_{i} (g mod f_{i}),

where

x_{i}

are the roots of f and

f_{i} = Z - x_{i}

are its linear factors. In the context of Algorithms 2 and 3, (resp. the parallel versions Algorithms 5 and 6), f is a polynomial of degree b corresponding to

h_{I}

(resp.

h_{I_{t}}

) for which we have already computed the product tree with leaves

f_{i}

as well as the reciprocal tree, and g is a polynomial of degree

2 b

corresponding to

E_{i, J}

for which we have computed the product tree with leaves of degree 2. We obtain the values

g mod f_{i}

by building a residue tree from the root down. Each node of the residue tree contains

R : = c F mod G

, where F and G are the polynomials in the corresponding nodes of the product trees of f and g, respectively, and c is the constant of the reciprocal tree.

As pointed out in (Section 17.4, [38]), the computation of

F mod G

for

d e g (G) = 2 d e g (F) = 2 d

can be achieved with the aid of

(F, c)

from the reciprocal tree via

F mod G = G - r e v_{d} (F \cdot r e v_{2 d} (G) mod x^{d}) \cdot F mod x^{d},

at the cost of an additional two polynomial multiplications modulus

x^{d}

, so the total cost of the tree (including the cost to multiply together all of the leaves) is

\begin{matrix} R e s i d u e T r e e (b) = & b - 1 + 2 \sum_{i = 0}^{{log}_{2} (b)} 2^{i} \times P o l y M u l (b / 2^{i}) \\ = & b - 1 + 2 \sum_{i = 0}^{{log}_{2} (b)} 2^{i} {(\frac{b}{2^{i}})}^{{log}_{2} (3)} \\ = & 6 b^{{log}_{2} (3)} - 3 b - 1 \end{matrix}

Note that the product of all the leaves gives us a multiple of the resultant (the multiple being the product of all the c constants at the leaves of the reciprocal tree), but these cancel out when taking ratios in the actual algorithms.

References

De Feo, L.; Jao, D. Towards quantum-resistant cryptosystems from supersingular elliptic curve isogenies. In Proceedings of the Post-Quantum Cryptography, Taipei, Taiwan, 29 November–2 December 2011; Yang, B.Y., Ed.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 19–34. [Google Scholar]
De Feo, L.; Jao, D.; Plût, J. Towards quantum-resistant cryptosystems from supersingular elliptic curve isogenies. J. Math. Cryptol. 2014, 8, 209–247. [Google Scholar] [CrossRef]
SIKE—Supersingular Isogeny Key Encapsulation. 2023. Available online: https://sike.org/ (accessed on 6 December 2023).
Castryck, W.; Decru, T. An Efficient key recovery attack on SIDH. In Proceedings of the Advances in Cryptology—EUROCRYPT 2023, Lyon, France, 23–27 April 2023; Hazay, C., Stam, M., Eds.; Springer: Cham, Switzerland, 2023; pp. 423–447. [Google Scholar]
Maino, L.; Martindale, C.; Panny, L.; Pope, G.; Wesolowski, B. A direct key recovery attack on SIDH. In Proceedings of the Advances in Cryptology—EUROCRYPT 2023, Lyon, France, 23–27 April 2023; Hazay, C., Stam, M., Eds.; Springer: Cham, Switzerland, 2023; pp. 448–471. [Google Scholar]
Robert, D. Breaking SIDH in polynomial time. In Proceedings of the Advances in Cryptology—EUROCRYPT 2023, Lyon, France, 23–27 April 2023; Hazay, C., Stam, M., Eds.; Springer: Cham, Switzerland, 2023; pp. 472–503. [Google Scholar]
Castryck, W.; Lange, T.; Martindale, C.; Panny, L.; Renes, J. CSIDH: An efficient post-quantum commutative group action. In Proceedings of the Advances in Cryptology—ASIACRYPT 2018: 24th International Conference on the Theory and Application of Cryptology and Information Security, Brisbane, QLD, Australia, 2–6 December 2018; Proceedings, Part III. Springer: Berlin/Heidelberg, Germany, 2018; pp. 395–427. [Google Scholar]
De Feo, L.; Galbraith, S.D. SeaSign: Compact isogeny signatures from class group actions. In Proceedings of the Advances in Cryptology—EUROCRYPT 2019, Darmstadt, Germany, 19–23 May 2019; Ishai, Y., Rijmen, V., Eds.; Springer: Cham, Switzerland, 2019; pp. 759–789. [Google Scholar]
Beullens, W.; Kleinjung, T.; Vercauteren, F. CSI-FiSh: Efficient isogeny based signatures through class group computations. In Proceedings of the Advances in Cryptology—ASIACRYPT 2019, Kobe, Japan, 8–12 December 2019; Galbraith, S.D., Moriai, S., Eds.; Springer: Cham, Switzerland, 2019; pp. 227–247. [Google Scholar]
De Feo, L.; Kohel, D.; Leroux, A.; Petit, C.; Wesolowski, B. SQISign: Compact post-quantum signatures from quaternions and isogenies. In Proceedings of the Advances in Cryptology—ASIACRYPT 2020, Daejeon, Republic of Korea, 7–11 December 2020; Moriai, S., Wang, H., Eds.; Springer: Cham, Switzerland, 2020; pp. 64–93. [Google Scholar]
National Institute of Standards and Technology NIST. 2023. Available online: https://csrc.nist.gov/news/2023/additional-pqc-digital-signature-candidates (accessed on 6 December 2023).
Basso, A.; Fouotsa, T.B. New SIDH Countermeasures for a More Efficient Key Exchange. In Proceedings of the Advances in Cryptology—ASIACRYPT 2023, Guangzhou, China, 4–8 December 2023; Guo, J., Steinfeld, R., Eds.; Springer: Singapore, 2023; pp. 208–233. [Google Scholar]
Chávez-Saab, J.; Chi-Domínguez, J.; Jaques, S.; Rodríguez-Henríquez, F. The SQALE of CSIDH: Sublinear Vélu quantum-resistant isogeny action with low exponents. J. Cryptogr. Eng. 2022, 12, 349–368. [Google Scholar] [CrossRef]
Feo, L.D.; Fouotsa, T.B.; Kutas, P.; Leroux, A.; Merz, S.P.; Panny, L.; Wesolowski, B. SCALLOP: Scaling the CSI-FiSh. In Proceedings of the Public-Key Cryptography—PKC 2023, Atlanta, GA, USA, 7–10 May 2023; Boldyreva, A., Kolesnikov, V., Eds.; Springer: Cham, Switzerland, 2023; pp. 345–375. [Google Scholar]
SQIsign: Algorithm Specifications and Supporting Documentation. 2023. Available online: https://csrc.nist.gov/csrc/media/Projects/pqc-dig-sig/documents/round-1/spec-files/sqisign-spec-web.pdf (accessed on 6 December 2023).
Basso, A.; Maino, L.; Pope, G. FESTA: Fast Encryption from Supersingular Torsion Attacks. In Proceedings of the Advances in Cryptology—ASIACRYPT 2023, Guangzhou, China, 4–8 December 2023; Guo, J., Steinfeld, R., Eds.; Springer: Singapore, 2023; pp. 98–126. [Google Scholar]
Nakagawa, K.; Onuki, H. QFESTA: Efficient Algorithms and Parameters for FESTA Using Quaternion Algebras. Cryptol. Eprint Arch. 2023, preprint. [Google Scholar]
Moriya, T. IS-CUBE: An Isogeny-Based Compact KEM Using a Boxed SIDH Diagram. Cryptol. Eprint Arch. 2023, preprint. [Google Scholar]
Decru, T.; Maino, L.; Sanso, A. Towards a Quantum-resistant Weak Verifiable Delay Function. In Proceedings of the Progress in Cryptology—LATINCRYPT 2023—8th International Conference on Cryptology and Information Security in Latin America, Quito, Ecuador, 3–6 October 2023; Aly, A., Tibouchi, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2023; Volume 14168, pp. 149–168. [Google Scholar]
Leroux, A. Verifiable random function from the Deuring correspondence and higher dimensional isogenies. Cryptol. Eprint Arch. 2023.
Vélu, J. Isogénies entre courbes elliptiques. Comptes-Rendus l’Académie Sci. Série I 1971, 273, 238–241. [Google Scholar]
Kohel, D.R. Endomorphism Rings of Elliptic Curves over Finite Fields. Ph.D. Thesis, University of California at Berkeley, Berkeley, CA, USA, 1996. Available online: http://iml.univ-mrs.fr/~kohel/pub/thesis.pdf (accessed on 9 January 2023).
Washington, L. Elliptic Curves: Number Theory and Cryptography, 2nd ed.; Chapman & Hall/CRC: Boca Raton, FL, USA, 2008. [Google Scholar]
Bernstein, D.J.; Feo, L.D.; Leroux, A.; Smith, B. Faster computation of isogenies of large prime degree. In ANTS XIV; The Open Book Series; University of Auckland: Auckland, New Zealand, 2020; Volume 4, pp. 39–55. [Google Scholar]
Adj, G.; Chi-Domínguez, J.; Rodríguez-Henríquez, F. Karatsuba-based square-root Vélu’s formulas applied to two isogeny-based protocols. J. Cryptogr. Eng. 2022, 13, 89–106. [Google Scholar] [CrossRef]
Cheng, H.; Fotiadis, G.; Großschädl, J.; Ryan, P.Y.A.; Rønne, P.B. Batching CSIDH group actions using AVX-512. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2021, 2021, 618–649. [Google Scholar] [CrossRef]
Orisaka, G.; López-Hernández, J.C.; Aranha, D.F. Finite field arithmetic using AVX-512 for isogeny-based cryptography. In Proceedings of the 18th Brazilian Symposium on Information and Computer Systems Security (SBSeg), Natal, Brazil, 22–25 October 2018; pp. 49–56. [Google Scholar]
Cheng, H.; Fotiadis, G.; Großschädl, J.; Ryan, P.Y.A. Highly Vectorized SIKE for AVX-512. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2022, 2022, 41–68. [Google Scholar] [CrossRef]
Phalakarn, K.; Suppakitpaisarn, V.; Rodríguez-Henríquez, F.; Hasan, M.A. Vectorized and parallel computation of large smooth-Degree isogenies using precedence-constrained scheduling. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2023, 3, 246–269. [Google Scholar] [CrossRef]
Phalakarn, K.; Suppakitpaisarn, V.; Hasan, M.A. Speeding-up parallel computation of large smooth-degree isogeny using precedence-constrained scheduling. In Proceedings of the Information Security and Privacy, Wollongong, NSW, Australia, 28–30 November 2022; Nguyen, K., Yang, G., Guo, F., Susilo, W., Eds.; Springer: Cham, Switzerland, 2022; pp. 309–331. [Google Scholar]
Kato, G.; Suzuki, K. Speeding up CSIDH using parallel computation of isogeny. In Proceedings of the 2020 7th International Conference on Advance Informatics: Concepts, Theory and Applications (ICAICTA), Tokoname, Japan, 8–9 September 2020; pp. 1–6. [Google Scholar]
Elkies, N.D. Elliptic and modular curves over finite fields and related computational issues. AMS IP Stud. Adv. Math. 1998, 7, 21–76. [Google Scholar]
Costello, C.; Hisil, H. A Simple and Compact Algorithm for SIDH with Arbitrary Degree Isogenies. In Proceedings of the Advances in Cryptology—ASIACRYPT 2017—23rd International Conference on the Theory and Applications of Cryptology and Information Security, Hong Kong, China, 3–7 December 2017; Proceedings, Part II, Lecture Notes in Computer Science. Takagi, T., Peyrin, T., Eds.; Springer: Berlin/Heidelberg, Germany, 2017; Volume 10625, pp. 303–329. [Google Scholar]
Moody, D.; Shumow, D. Analogues of Velu’s formulas for Isogenies on Alternate Models of Elliptic Curves. IACR Cryptol. Eprint Arch. 2011, 2011, 430. [Google Scholar] [CrossRef]
Moriya, T.; Onuki, H.; Aikawa, Y.; Takagi, T. The Generalized Montgomery Coordinate: A New Computational Tool for Isogeny-based Cryptography. Math. Cryptol. 2022, 2, 36–59. [Google Scholar]
Costello, C. B-SIDH: Supersingular Isogeny Diffie-Hellman Using Twisted Torsion. In Proceedings of the Advances in Cryptology—ASIACRYPT 2020—26th International Conference on the Theory and Application of Cryptology and Information Security, Daejeon, Republic of Korea, 7–11 December 2020; Proceedings, Part II, Lecture Notes in Computer Science. Moriai, S., Wang, H., Eds.; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12492, pp. 440–463. [Google Scholar]
Karatsuba, A.; Ofman, Y. Multiplication of Multidigit Numbers on Automata. Sov. Phys. Dokl. 1962, 7, 595. [Google Scholar]
Bernstein, D.J. Fast multiplication and its applications. In Algorithmic Number Theory; Buhler, J., Stevenhagen, P., Eds.; Cambridge University Press: Cambridge, UK, 2008; pp. 325–384. [Google Scholar]

Figure 1. Diagram of the product tree for computing

f_{i} = (Z - x_{i})

.

Figure 1. Diagram of the product tree for computing

f_{i} = (Z - x_{i})

.

Figure 2. Diagram of the reciprocal tree, where

f_{i} = (Z - x_{i})

,

r e v (.)

and

c_{i}

are as mentioned above.

Figure 2. Diagram of the reciprocal tree, where

f_{i} = (Z - x_{i})

,

r e v (.)

and

c_{i}

are as mentioned above.

Figure 3. Diagram of the residue tree for computing

R e s_{Z} (f (Z), g (Z))

.

Figure 3. Diagram of the residue tree for computing

R e s_{Z} (f (Z), g (Z))

.

Figure 4. (top): The asymptotic speedup factor from Proposition 5 as a function of the number of cores. (bottom): Average core utilization (speedup factor divided by the number of cores).

Figure 5. Speedup factors of timings of the joint cost KPS-Parallel + xISOG-Parallel + 2xEVAL-Parallel over

F_{ptest}

for each odd prime degree

ℓ ∣ (p_{t e s t} + 1)

. Experimental timings correspond to an average of 100 runs, whereas the expected savings are obtained from the estimated number of field multiplications from (5) and (6).

Figure 5. Speedup factors of timings of the joint cost KPS-Parallel + xISOG-Parallel + 2xEVAL-Parallel over

F_{ptest}

for each odd prime degree

ℓ ∣ (p_{t e s t} + 1)

. Experimental timings correspond to an average of 100 runs, whereas the expected savings are obtained from the estimated number of field multiplications from (5) and (6).

Table 1. Aggregate computational time (in gigacycles

^{1}

) measured for the single-core square-root Vélu procedures KPS, xISOG, and xEVAL described in Section 2.1, and of the two-core and four-core implementations of the parallel analogues described in Section 3.2. The costs are summed over each odd prime degree

ℓ ∣ (p_{9216} + 1)

, and the Total column corresponds to KPS + xISOG + 2xEVAL. Timings correspond with the average of 100 runs. (Gcc) is the number of gigacycles and (SF) is the experimental speedup factor considering the single core implementation as a baseline.

Table 1. Aggregate computational time (in gigacycles

^{1}

) measured for the single-core square-root Vélu procedures KPS, xISOG, and xEVAL described in Section 2.1, and of the two-core and four-core implementations of the parallel analogues described in Section 3.2. The costs are summed over each odd prime degree

ℓ ∣ (p_{9216} + 1)

, and the Total column corresponds to KPS + xISOG + 2xEVAL. Timings correspond with the average of 100 runs. (Gcc) is the number of gigacycles and (SF) is the experimental speedup factor considering the single core implementation as a baseline.

Algorithm	Single Core	Algorithm	Two Cores		Four Cores
	Gcc		Gcc	SF	Gcc	SF
`KPS`	23.6	`KPS-Parallel`	19.2	1.23	16.6	1.42
`xISOG`	58.5	`xISOG-Parallel`	34.5	1.70	25.8	2.27
`xEVAL`	59.5	`xEVAL-Parallel`	31.0	1.92	19.7	3.02
`Total`	201.1		115.7	1.74	81.8	2.46

¹ One gigacycle is one billion clock cycles.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chávez-Saab, J.; Ortega, O.; Pizarro-Madariaga, A. On the Parallelization of Square-Root Vélu’s Formulas. Math. Comput. Appl. 2024, 29, 14. https://doi.org/10.3390/mca29010014

AMA Style

Chávez-Saab J, Ortega O, Pizarro-Madariaga A. On the Parallelization of Square-Root Vélu’s Formulas. Mathematical and Computational Applications. 2024; 29(1):14. https://doi.org/10.3390/mca29010014

Chicago/Turabian Style

Chávez-Saab, Jorge, Odalis Ortega, and Amalia Pizarro-Madariaga. 2024. "On the Parallelization of Square-Root Vélu’s Formulas" Mathematical and Computational Applications 29, no. 1: 14. https://doi.org/10.3390/mca29010014

Article Menu

On the Parallelization of Square-Root Vélu’s Formulas

Abstract

1. Introduction

2. Background

2.1. Computation of ℓ-Isogenies

2.2. Computing the Resultants

2.3. Cost Model for the Sequential Square-Root Vélu

3. Parallelizing Square-Root Vélu Formulas

3.1. Construction of a New Index System

3.2. Parallelized Algorithms

3.3. Cost Analysis of the Parallel Square-Root Vélu

3.4. Asymptotic Speedup

4. Experimental Results

5. Discussion

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Cost Functions for Polynomial Operations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI