Extended supersingular isogeny Diffie–Hellman key exchange protocol: Revenge of the SIDH

The supersingular isogeny Diffie–Hellman key exchange protocol (SIDH) was intro-duced by Jao and De Feo in 2011. SIDH operates on supersingular elliptic curves defined over F p 2 , where p is a large prime number of the form p ¼ 4 e A 3 e B − 1 and e A and e B are positive integers such that 4 e A ≈ 3 e B . A variant of the SIDH protocol, dubbed extended SIDH (eSIDH), is presented. The eSIDH makes use of primes of the form p ¼ 4 e A ℓ e B B ℓ e C C f − 1. Here ℓ B and ℓ C are two small prime numbers; f is a cofactor; and e A , e B , and e C are positive integers such that 4 e A ≈ ℓ e B B ℓ e C C . It is shown that for many relevant instantiations of the SIDH protocol, this new family of primes enjoys faster field arithmetic than the one associated with traditional SIDH primes. Furthermore, its richer opportunities for parallelism yield a noticeable speed ‐ up factor when implemented on multicore platforms. A supersingular isogeny key encapsulation (SIKE) instantiation using the prime eSIDH ‐ p 765 yields an acceleration factor of 1.06, 1.15 and 1.14 over a SIKE instantiation with the prime SIKE ‐ p 757 when implemented on k = {1, 2, 3} ‐ core processors. To the authors’ knowledge, this work reports the first multicore implementation of SIDH and SIKE.


| INTRODUCTION
In 2011, Jao and De Feo proposed the supersingular isogeny Diffie-Hellman key exchange protocol (SIDH) [1] (see also [2]). Thanks to the high complexity of its underlying problem, SIDH provides key sizes comparable to classical public-key cryptosystems currently in use. Consequently, SIDH has been studied and implemented in an impressive number of recent publications [3][4][5][6][7][8]. Moreover, the supersingular isogeny key encapsulation (SIKE) protocol [9], which can be seen as a descendent of SIDH, is one of the candidate schemes under consideration within the second round of the National Institute of Standards and Technology (NIST) Post-Quantum Cryptography Standardization project [10].
The key exchange SIKE protocol operates on supersingular elliptic curves defined over F p 2 , where p is a large prime number of the form p ¼ 4 e A 3 e B − 1. 1 During the SIDH key generation and key agreement phases, Alice and Bob must compute degree-4 e A and degree-3 e B isogenies, respectively. Hence, by choosing the exponents e A and e B such that 4 e A ≈ 3 e B , one can assure that Alice and Bob will invest about the same computational expenses when executing SIDH. Moreover, this design choice guarantees a healthy security balance because the security guarantees of SIDH lie in the intractability of the computational supersingular isogeny (CSSI) problem. Solving CSSI implies computing F p 2 -rational isogenies of degrees 4 e A and 3 e B between pairs of supersingular elliptic curves defined over a quadratic extension field F p 2 . Following recent analyses of the classical and quantum security of SIDH and SIKE [11][12][13], the authors of [9] endorsed the primes SIKEp434, SIKEp503, SIKEp610, and SIKEp751 (so named to indicate the bit-size of the underlying prime field characteristic) to meet the security requirements of NIST categories 1, 2, 3, and 5, respectively.
A variant of the SIDH protocol is presented that allows us to accelerate Bob's computations on single and multicore platforms without modifying the formats and lengths of its private/public keys. The SIDH variant proposed is dubbed extended SIDH (eSIDH), 2 because of the pair of primes that are assigned to Bob for performing his isogeny computations. The eSIDH domain parameters are a supersingular elliptic curve E=F p 2 , where p is a prime of the form B B ℓ e C C f − 1: ð1Þ where ℓ B , ℓ C are two small prime numbers; f is a cofactor that for efficiency reasons is usually selected as a power of 2; and e A , e B and e C are positive integers such that 4 e A ≈ ℓ e B B ℓ e C C . Just as it would happen in SIKE, in eSIDH, Alice limits herself to compute degree-4 e A isogenies. This naturally implies that Alice can still take advantage of the low cost associated with the fast degree-4 isogeny arithmetic. On the other hand, Bob is now responsible of computing degree-ℓ e B B ℓ e C C isogenies. At first glance, it would appear that Bob's task in eSIDH has just become more expensive than what used to be his computational role on a traditional SIDH scheme. Nonetheless, we will show that Bob's eSIDH tasks offer several advantages such as a faster underlying field arithmetic and novel opportunities for exploiting the parallelism associated with his new computational responsibilities.
Indeed, because of the existence of friendlier Montgomeryfriendly primes [4,16], the rich abundance of the family of primes given in Equation 1 often yields a faster field arithmetic. Our experimental results show that the computational advantages of eSIDH compensates quite well for the extra calculations demanded by this variant. For example, using a single-core SIKE prime p 751 implementation as a baseline, a comparable eSIDH prime p 765 instantiation yields an acceleration factor of 1.05, 1.30, and 1.41 when implemented on k = {1, 2, 3}-core processors.
Presently, relatively few works have attempted to exploit the rich opportunities that SIDH main computations can offer for parallel computations. In this research direction, we are only aware of the works reported in [5,6], where explicit efforts for parallelizing the computations of the SIDH protocol were attempted and/or exploited. Using a similar approach as the one followed in [5,6], we report that a two-core and threecore parallel implementation of the SIDH p 751 instantiation yields a speed-up factor of 1.118 and 1.216 against a sequential implementation, respectively. To our knowledge this work reports the first multicore implementation of SIDH. In addition, when both protocols are implemented on k = {1, 2, 3}-core processors, eSIDH p 765 yields an acceleration factor of 1.050, 1.160 and 1.162 over SIDH.
The remainder of this paper is organized as follows. In §2 a summary of the SIDH protocol and associated implementations aspects is presented. In §3 three different approaches for implementing the eSIDH protocol are presented. In §4 several relevant eSIDH implementations aspects on single-core and multicore processors are discussed. We draw our concluding remarks in §5.

| PRELIMINARIES
In this section, a brief summary of the SIDH protocol and its optimal strategies is given. For more in-deep details see [2,9].

| SIDH protocol
The most popular key exchange SIDH protocol instantiation operates on supersingular elliptic curves defined over F p 2 , where p is a large prime number of the form p ¼ 4 e A 3 e B − 1.
The exponents e A and e B are typically chosen such that 4 e A ≈ 3 e B . Let us define the constants r A ¼ 4 e A and r B ¼ 3 e B . The public parameters of SIDH are given by a supersingular base curve E 0 , and the basis points During the initial key generation phase, Alice chooses a random integer m A ∈ [1, r A − 1], which acts as her secret key. Thereafter, Alice computes a secret key  3 Alice uses Bob's information to recover the image of her secret key under Bob's curve E B , as . Bob then computes the isogenous curve E AB such that there is a degree-3 e B isogeny ϕ AB : E A → E AB with Ker(ϕ AB ) = 〈ϕ A (R B )〉. This ends the SIDH protocol. Alice and Bob can now create a shared secret by computing the j-invariant of their respective curves, using the fact that E BA ≅ E AB implies j(E BA ) = j(E AB ).
Remark 1 The most prominent SIDH computational tasks include the computation of large degree isogenies and the evaluation of elliptic curve points in those isogenies. Another large operation of this scheme is the computation of four three-point scalar multiplications. For a typical software or hardware implementation of SIDH, the isogeny computations and associated point evaluations on one hand, along with the three-point scalar multiplications on the other hand, may take 81%-83% and 17%-19% of the overall protocol's computational cost, respectively. 2 Pronounced by spelling out all five letters individually. An early version of this paper was presented in [14] and [15,Chapter 11]. 3 State-of-the-art SIDH implementations use differential point arithmetic on Montgomery curves. Consequently, Alice and Bob evaluate and transmit three points each, namely, x (P A ), x(Q A ), x(P A − Q A ); and x(P B ), x(Q B ), and x(P B − Q B ), respectively [3].

CERVANTES-VÁZQUEZ ET AL.
Remark 2 In order to compute the points R A , ϕ B (R A ) (resp. R B , ϕ A (R B )), Alice (resp. Bob) must perform two three-point scalar multiplication procedures using a right-to-left Montgomery ladder algorithm [1,4]. This kind of Montgomery ladder has a per-step cost of one point addition (xADD) and one point doubling (xDBL), which are usually performed in the projective space P 1 . Noticing that for current state-of-the-art SIDH implementations the costs of xDBL and xADD are about the same, one can assume that the per-step computational cost of the three-point Montgomery ladder is essentially that of two xDBL operations. It follows that the cost of computing R A or ϕ B (R A ) (resp. R B or ϕ A (R B )) is of 4e A (resp. 2 log 2 (3)e B ) xDBL operations.

| Optimal strategies for supersingular isogeny Diffie-Hellman key exchange protocol
Let E be a supersingular elliptic curve defined over the quadratic extension field F p 2 . Given a point R 0 ∈ E, let S = 〈R 0 〉 be an order-ℓ e subgroup of E[ℓ e ]. Then there exists an isogeny ϕ: E → E 0 (with both ϕ and E 0 defined over F p 2 ) having kernel S. The isogeny ϕ is unique up to isomorphism. Given E and S, an isogeny ϕ with kernel S and the corresponding equation for E 0 , can be computed as a sequence of degree-ℓ isogenies using Vélu-like formulas and scalar multiplications by ℓ such as the ones discussed in [17,18]. The optimal computation of large smooth-degree isogenies for the special case of a sequential SIDH implementation, was presented and solved in [2].
In order to efficiently compute a degree-ℓ e isogeny, it was shown in [2] that one can apply balanced or optimal strategies for traversing a weighted directed graph, which is represented as a right triangular lattice Δ e having eðeþ1Þ 2 points distributed in e columns and rows (See Figure 1a). 4 A leaf is defined as the most bottom point in a given column of the lattice. The vertices of the graph represent elliptic curve points and its vertical and horizontal edges have as associated weight p ℓ and q ℓ , defined as the cost of performing one scalar multiplication by ℓ and one degree-ℓ isogeny, respectively. At the beginning of the isogeny computation, only the point R 0 of order ℓ e is known. The goal of the isogeny construction/evaluation computation is to obtain all the leaves in Δ e one by one until the farthest right one, R e−1 , has been calculated. Then, ϕ: E → E 0 can be obtained by simply computing a degree-ℓ isogeny with kernel R e−1 .
Optimal strategies as defined in [2] exploit the fact that a triangle Δ e can be optimally and recursively decomposed into two sub-triangles Δ h and Δ e−h as shown in Figure 1b. Let us denote as Δ h e the design decision of splitting a triangle Δ e at row h. Then, the sequential cost of walking through the triangle Δ e using the cut Δ h e is given as, We say that Δ ĥ e is optimal if CðΔ ĥ e Þ is minimal among all Δ h e for h ∈ [1, e − 1]. Applying this strategy recursively leads to a procedure that computes a degree-ℓ e isogeny at a cost of approximately e 2 log 2 e scalar multiplications by ℓ, e 2 log 2 e degree-ℓ isogeny evaluations, and e constructions of degree-ℓ isogenous curves. F I G U R E 1 Subfigure 1a shows a triangular lattice used to compute a degree-ℓ e isogeny ϕ: E → E 0 . The kernel of ϕ is the subgroup 〈R 0 〉, where R 0 ∈ E is an order-ℓ e elliptic curve point. Using an optimal SIDH strategy as in [2], a triangular lattice Δ e is processed by splitting it into two subtriangles as shown in Subfigure 1b. After applying this splitting strategy recursively, the cost of computing ϕ drops to approximately e 2 log 2 e scalar multiplications by ℓ, e 2 log 2 e degree-ℓ isogeny evaluations, and e constructions of degree-ℓ isogenous curves 4 Note that we depart from the tradition that would represent the weighted directed graph Δ e as a triangular equilateral lattice between the x-axis and the lines y ¼ ffi ffi ffi 3 p x and CERVANTES-VÁZQUEZ ET AL.
Remark 3 Let us assume that a degree-ℓ e isogeny ϕ: E → E 0 has been constructed using the procedure just described. Then given a point P ∈ E, its image ϕ (P) ∈ E 0 can be found by performing the composition of e degree-ℓ isogeny evaluations. As a way of illustration, the computation of the image of the point R BC under Bob's isogeny ϕ B as the top horizontal segment of the triangular lattice going from the vertex R BC to the vertex ϕ B (R BC ). The cost of this operation is of e B degree-ℓ B isogeny evaluations.

| EXTENDED SUPERSINGULAR ISOGENY DIFFIE-HELLMAN KEY EXCHANGE PROTOCOL
The extended SIDH (eSIDH) Protocol operates on supersingular elliptic curves defined over F p 2 , where p is a large prime number of the form p ¼ 4 e A ℓ e B B ℓ e C C − 1. The exponents e A , e B and e C are chosen so that 4 e A ≈ ℓ e B B ℓ e C B . The eSIDH protocol flow is quite similar to the one of a traditional SIDH as described in §2.1. Alice must still compute degree-4 e A isogenies, but now Bob is responsible for computing degree-ℓ e B B ℓ e C C isogenies. In this section, three different approaches for computing the eSIDH protocol are presented. We start in §3.1 with the description of a simple naive eSIDH approach that is relatively expensive and offers little opportunities for exploiting parallelism. In §3.2, an eSIDH approach especially designed for exploiting parallelism opportunities is presented. Table 1 shows the estimated scalar multiplication expenses incurred by SIDH and the two eSIDH instantiations discussed in this section. All the costs are given in number of xDBL operations. 5 For two-core implementations, the parallel eSIDH described in §3.2, is significantly faster than the SIDH implementation of [9] and any other eSIDH instantiation discussed here.
We start noting that in SIDH, Alice must perform two λ 2 -bit scalar multiplications for computing her secret points R A and R A 0 . Each one of these two three-point scalar multiplication involves the execution of about 4 4 λ xDBL operations (cf. Remark 2). Here we also take into account the cost incurred by Alice for computing her ½2 e A −1 �R 0 kernel point shown in Subfigure 1a. This computation requires performing e A = λ/2 xDBL operations and must be calculated in both SIDH phases. Hence, for the two SIDH phases, Alice ends up computing a total of 12 4 λ xDBL operations. On the other hand, Bob must compute his secret points R B and R B 0 at a cost of 8λ 4 xDBL operations. Bob also must calculate two ½3 e B −1 �R 0 kernel points whose cost can be estimated as follows. SIDH uses the prime Moreover, we have experimentally found that a point multiplication by 3 (xTPL) costs approximately the same as two xDBLs (cf. Table 2). Thus, Bob's computational cost for calculating e B triplings, is equivalent to some 16 25 λ xDBL operations. Because Bob must perform this task for both SIDH phases, the combined cost becomes 32 25 λ xDBL operations. We conclude that Bob ends up performing ð 8 4 þ 32 25 Þλ xDBL operations during the execution of the two phases of SIDH.
The above discussion implies that the combined scalar multiplication computational effort performed by Bob and Alice in SIDH can be estimated as some ð 20 4 þ 32 25 Þλ xDBL operations. This estimate is reported in the second-row entries of Table 1. Note that we are considering here that this computational effort remains the same for single-and twocore implementations of SIDH. 6

| Naive approach for computing eSIDH
Mimicking his role in SIDH, for a naive eSIDH instantiation Bob can first choose a basis for 〈P BC ; Q BC 〉 ¼ E½ℓ e B B ⋅ ℓ e C C �. Thereafter, Bob computes his secret point as B ℓ e C C isogeny using an optimal strategy à la SIDH as shown in Figure 2a.
T A B L E 1 Let λ = ⌈ log 2 (p)⌉ be the bitlength of the extended supersingular isogeny Diffie-Hellman key exchange protocol (eSIDH) prime p. This table reports the approximate number of doubling (xDBL) operations processed by the SIDH protocol of [2] compared against the three eSIDH variants presented in this section (for the experimental clock cycle cost of xDBL, see Abbreviations: CRT, Chinese remainder theorem; SIDH, supersingular isogeny Diffie--Hellman; xDBL, doubling. 5 We do not account for isogeny computations, because the computational cost associated with this task is about the same for both variants of eSIDH and the SIDH protocol. 6 Once again we stress that for this analysis many of the xDBL operations included in the computation of the isogeny triangles shown in Subfigure 1b were not taken into account (see the previous footnote).
Alice's eSIDH computational expenses are exactly the same as in SIDH. In the case of Bob, we stress that the computational cost of calculating his eSIDH secret point R BC as defined above, is about the same as computing Bob's SIDH secret point R B as given in §2. 1. Figure 2a depicts an optimal strategy procedure for computing Bob's degree-ℓ e B B ℓ e C C . The computational cost of this isogeny is of about e B 2 log 2 e B , e C 2 log 2 e C scalar multiplications by ℓ B and ℓ C , e B 2 log 2 e B degree-ℓ B and e C 2 log 2 e C degree-ℓ C isogeny evaluations, and e B and e C constructions of degree-ℓ B and degree-ℓ C isogenous curves, respectively. This computational expense is nearly the same as the one required by Alice for computing a degree-4 e A isogeny, using the optimal strategies described in §2.2 and Figure 1.
There seems to be no obvious way of parallelizing the main computation of this naive eSIDH instantiation, except by using explicit parallel optimal strategies as the ones proposed in [6]. In the following two subsections, two eSIDH instantiations more amenable for parallelization are described.

| Parallel approach for computing eSIDH key exchange protocol
As mentioned before, eSIDH offers rich opportunities for exploiting its inherent parallelism. In this subsection, an eSIDH instantiation specifically designed for the parallel computation of some of its scalar multiplication operations, will be presented.
As before, let λ = ⌈ log 2 (p)⌉ be the bit-length of the eSIDH prime p ¼ 4 e A ℓ e B B ℓ e C C − 1. For the sake of compactness let us define r B ¼ ℓ e B B and r C ¼ ℓ e C C . Rather than defining Bob's secret point R BC as in the previous subsection, Bob now has two secret points that he can calculate by choosing two pairs of bases such that 〈P B , Now, by picking ℓ B , ℓ C , e B and e C such that log 2 (ℓ B )e B ≈ log 2 (ℓ C )e C , it follows that the cost of computing R B is about 2λ 4 xDBL operations (cf. Remark 2), which is nearly the same cost of computing R C and about half of the cost of computing Alice's secret point R A . Furthermore, the calculations of Bob's secret points R B and R C are fully independent. Therefore, one can compute them in parallel on multicore platforms. Moreover, the isogeny ϕ BC = ϕ C • ϕ B can now be determined T A B L E 2 Timing performance of selected quadratic-field arithmetic operations and isogeny evaluations and constructions. Timings are reported in clock cycles measured on a Skylake processor at 4.0 GHz. (*) In this cell the left number indicates the cost of quadratic-field inversion for SIKE-p 434 whereas the right number indicates the cost of quadratic-field inversion for P434eSIDH-p 434 . The rightmost column shows the Acceleration factor when comparing p 751 versus p 765 isogeny. Each isogeny ϕ B and ϕ C can be computed using a traditional supersingular isogeny Diffie-Hellman key exchange protocol strategy as in [2]. The kernel of ϕ B is the subgroup 〈½ℓ eC C �R BC 〉, and the kernel of ϕ C is the subgroup 〈ϕ B (R BC )〉. Figure 2a shows a naive way for computing the ℓ eB B ℓ eC Cisogeny ϕ BC = ϕ C •ϕ B . Figure 2b shows a parallel-oriented approach for computing such strategy CERVANTES-VÁZQUEZ ET AL.
-5 without performing the multiplication by r C shown in Figure 2a. This computational saving comes from the facts that gcd(r B , r C ) = 1 and that R B , R C are points of order r B and r C , respectively. Hence as shown in Figure 2b, R B and ϕ B (R C ) can serve to generate the kernels of the isogenies ϕ B and ϕ C , respectively. This simple observation yields a significant saving of roughly λ 4 xDBL operations in each one of the two phases of the eSIDH protocol.
3.2.1 | Reducing the public-key size of eSIDH key exchange protocol parallel instantiation Seemingly, an important drawback of using two secret points for Bob is that for the key agreement phase this design decision forces Bob to know the images of his public points P B , Q B , P C and Q C , all of them evaluated under Alice's degree-4 e A isogeny ϕ A . Sending these four points implies an increment on the data to be transferred from Alice to Bob. This in turn implies an increment on Alice's computational load because now, she would need to find the isogeny images of four points (instead of two as in the original SIDH). 7 Alternatively, one can reduce the eSIDH public-key size at the same time that Alice's extra work is prevented. This can be done by defining two auxiliary public points that while codifying Bob's public points P B , Q B , P C and Q C , provide an efficient way to recover them. Let us re-define Bob's public points as S = P B + P C and T = Q B + Q C . This implies that Hence, given the points S, T, one can recover multiples of Bob's original four public points by performing four scalar multiplications. Notice that all four of these scalar multiplications are fully independent. Nonetheless, we can do better as discussed below.
, is sufficient to generate the degree-r B isogeny with kernel 〈R C 〉.
The observation stated in Remark 4 along with the relations given in Equation (3) suggest an approach where Bob can efficiently recover the points R B 0 , R C 0 , by the direct computation of Notice that the cost of computing either R B 0 or R C 0 in (4), is of one λ 4 three-point ladder plus a scalar multiplication by r C (resp. r B ). Invoking once again Remark 2, the cost of a λ 4 three-point ladder is of about 2λ 4 xDBL operations. An estimation of the computational cost associated with a point multiplication by the scalars [r B ] or [r C ] is explained next.
For the sake of concreteness, let us assume that ℓ B = 3 and ℓ C = 5. Let λ = ⌈ log 2 (p)⌉ be the bit-length of the eSIDH prime p ¼ 4 e A 3 e B 5 e C − 1. By construction, we choose 3 e B ≈ 5 e C ≈ 4 e A 2 . It follows that the exponents e A , e B and e C are related by the following equations: Moreover, from our experimental data (cf. Table 2), the costs associated with the scalar multiplications by 3 (xTPL) and by 5 (xQPL) in terms of multiplications by 2 (xDBL) can be approximated as, • Cost of point multiplication [3] ≈ 2 xDBL operations.
• Cost of point multiplication [5] ≈ 3 xDBL operations. Hence, the cost of computing a point multiplication by the scalars r B ¼ 3 e B and r C ¼ 5 e C , can be estimated as ½r B � ≈ 8 25 λ and ½r C � ≈ 21 64 λ xDBL operations, respectively. We conclude that the total cost of computing the points R B 0 or R C 0 of Equation (4) is of about 53 64 λ and 41 50 λ xDBL operations, respectively. Figure 3 shows a general overview of the extended supersingular isogeny Diffie-Hellman key exchange protocol (eSIDH) parallel instantiation described in this subsection. Assuming that a multicore platform is available for the execution of this eSIDH instantiation, R B 0 and R C 0 in (4) (shown in dashed boxes in Figure 3) can be computed in parallel (cf. §3.2.2). (4) is useful for the extended supersingular isogeny Diffie-Hellman key exchange protocol (eSIDH) key agreement phase. For the eSIDH key generation phase, its results more efficient for computing the points R B and R C as discussed at the beginning of Subsection 3.2.

Remark 5 Equation
Remark 6 eSIDH security: Recall that gcd(r B , r C ) = 1 and 2 ⋅ e A ≈ log 2 (ℓ B )e B + log 2 (ℓ C )e C . Given the points S and T, computing a degree-r B r C isogeny between E 0 and E BC should have the same computational complexity as the problem of, given the points P A and 7 In practice one uses differential point arithmetic on Montgomery curves. Hence, Alice would need to evaluate and transmit six points, namely, x(P B ), x(Q B ), x(P B − Q B ), x(P C ), x (Q C ), and x(P B − Q B ).

CERVANTES-VÁZQUEZ ET AL.
Q A , finding a degree-r A isogeny between E 0 and E A .
Furthermore, provided that 4 e A ≈ ℓ e B B ⋅ ℓ e C C , the heuristic polynomial time key recovery attacks presented in [19] do not appear to apply against eSIDH.

| Computational cost of eSIDH key exchange protocol parallel instantiation
Let us consider the eSIDH instantiation shown in Figure 3. Note that the eSIDH private/public-key sizes are the same as the traditional SIDH protocol of [2]. Moreover, Alice's isogeny computations are exactly the same for both protocols.
The scalar multiplications computational expenses of the parallel eSIDH variant are discussed in the following.
As in traditional SIDH, Alice must perform two λ 2 -bit scalar multiplications that involve the computation of about 8λ 4 xDBL operations (cf. Remark 2). Taking into account for the two SIDH phases, the computation of the ½2 e A −1 �R 0 kernel point as shown in Subfigure 1a, Alice must compute λ xDBL extra operations. We conclude that Alice ends up computing a total of 12 4 λ xDBL operations. Moreover, during the key generation phase, Bob computes the points R B , R C , by performing 4λ 4 and 2λ 4 xDBLs for a singlecore and two-core implementation, respectively. During the key agreement phase, Bob computes the points R B 0 , R C 0 , by performing 82 50 λ and 41 50 λ xDBL operations for a single-core and two-core implementation, respectively (cf. discussion after Equation (4).
As shown in Figure 2b, Bob also requires to calculate two ½3 e B −1 �R 0 kernel points at a combined cost of 2e B triplings. As previously discussed, in eSIDH the exponent e b is chosen to be approximately equal to 4 25 λ. Moreover, our experimental data shows that one tripling costs approximately the same as two xDBLs. This implies that Bob can compute 2e B triplings at an equivalent cost of some 16 25 λ xDBLs. Hence, the eSIDH combined scalar multiplication effort of Alice and Bob can be estimated as about ð 16 4 þ 114 50 Þλ and ð 14 4 þ 73 50 Þλ xDBL operations, respectively (cf. third row of Table 1).

| A CRT-based approach for computing eSIDH key exchange protocol
Another instantiation of eSIDH can be constructed by taking advantage of the Chinese remainder theorem (CRT). As in §3.2, let λ = ⌈ log 2 (p)⌉ be the bit-length of the eSIDH prime p ¼ 4 e A ℓ e B B ℓ e C C − 1. And once again, let us define r B ¼ ℓ e B B and r C ¼ ℓ e C C . A CRT-based approach for eSIDH can be computed as explained in the remainder of this subsection.
First, let us choose a pair of random integers under the following restrictions. Pick randomly m B ∈ [1, r B ] and m C ∈ [1, r C ] such that, gcd(m B , r C ) = gcd(m C , r B ) = 1. Then compute the following integers: From Equation (5), it follows that m BC ≡ m B modr B and m BC ≡ m C modr C .
For the execution of the eSIDH key generation phase the following two points are computed, R B ¼ P B þ ½m B �Q B and R C ¼ P C þ ½m C �Q C . Thereafter, one can compute ϕ BC as shown in Figure 2b, such that the kernel of ϕ B is generated by R B and the kernel of ϕ C is generated by 8 the combined cost of computing R B and R C is about the same as the cost of computing R A . As it was the case for the eSIDH parallel instantiation, note that the CRT-based approach also implies a saving of r C ≈ λ 4 xDBL operations corresponding to the left most vertical edge between the points R C and R B shown in Figure 2b.
For the computation of the eSIDH Key Agreement phase as in §3.2.1, let us define the auxiliary public points S = P B + P C and T = Q B + Q C . It turns out that the generators of the subgroups 〈R B 〉 and 〈R C 〉 can be recovered by F I G U R E 3 Overview of an extended supersingular isogeny Diffie-Hellman key exchange protocol parallel instantiation with Bob's secret points computed in parallel in the Key generation phase Ker(ϕ B ) = 〈R B 〉 and Ker(ϕ C ) = 〈ϕ B (R C )〉 in the Key agreement phase Ker(ϕ B 0 ) = 〈R B 0 〉 and Ker The operator |⋅| evaluates the bit-length of its operand.

CERVANTES-VÁZQUEZ ET AL.
invoking the CRT and Remark 4 applied on the integers given in Equation (5).

Proposition 1 Let P B , Q B , P C , Q C , m B , m C , m BC , R B as R C be defined as before, and fix S = P B + P C and T = Q B + Q C . Then [r C ]R B = [r C ](S + [m BC ]T) and
Proof: By straightforward substitution, we obtain Using an analogous procedure, one can show that [r B ] □ Using Proposition 1, one can recover the generator R 0 B of the subgroup Ker(ϕ B 0 ) and ϕ B 0 (R 0 C ), the generator of the subgroup Ker(ϕ C 0 ). To this end, one can compute Nevertheless, computing the points R 0 B and R 0 C , have a steep combined cost of roughly 10λ 4 xDBL operations. Fortunately, there is an efficient way to reduce this expense.
Proof: By virtue of Proposition 1, the order-r B point R B generates the kernel of the degree-r B isogeny ϕ 0 B , that is, Ker (ϕ 0 B ) = 〈R 0 B 〉. By straightforward substitution we obtain It follows that which yields an order-r C point. □ Note that the points R B 0 and ϕ B 0 (R 0 ) can serve as the kernel generators of Bob's key agreement phase isogenies ϕ B 0 and ϕ C 0 , respectively. The cost of computing R 0 and R B 0 (see Figure 4), is of about 4 4 λ and 21 64 λ xDBL operations, respectively (see discussion below Equation (4)). There seems to be no obvious way to parallelize these two calculations.

| Computational cost of CRT-based extended supersingular isogeny Diffie-Hellman key exchange protocol instantiation
The scalar multiplications computational expenses of the CRTbased eSIDH variant are dispensed as discussed next. Let us consider the eSIDH instantiation depicted in Figure 4. Then, as in the traditional SIDH, Alice must perform two 2 4 λ-bit scalar multiplications that involve the computation of about 8 4 λ xDBL operations (cf. Remark 2). Taking into account that for the two SIDH phases Alice has to compute the ½2 e A −1 �R 0 kernel point as shown in Subfigure 1a, Alice must compute λ xDBL extra operations. This gives a total of 12 4 λ xDBL operations for Alice. During the key generation phase Bob computes the points R B , R C , by performing 4 4 λ and 2 4 λ xDBL operations for a single-core and two-core implementation, respectively. During the key agreement phase Bob computes the points R 0 , R B 0 , by performing ð 4 4 þ 21 64 Þλ xDBL operations for either a single-core or a two-core implementation.
As shown in Figure 2b, Bob also must calculate two ½3 e B −1 �R 0 kernel points at a combined cost of 2e B triplings. As discussed in §3.2.2, Bob can compute 2e B triplings at an equivalent cost of about 16 25 λ xDBLs. We conclude that the eSIDH CRT-based variant combined scalar multiplication effort for Alice and Bob, can be estimated as about ð 20 4 þ 21 64 þ 16 25 Þλ and ð 18 4 þ 21 64 þ 16 25 Þλ xDBL operations, for a single-and two-core implementation, respectively (cf. Table 1).

| Hunting for efficient eSIDH primes
Let N = ⌈⌈ log 2 (p)⌉/w⌉ be the minimum number of 64-bit words needed to represent an eSIDH prime p. It is assumed that w = 64. We say that a modulus p is γ-Montgomeryfriendly if p ≡ ±1 mod 2 γ⋅w , for a positive integer γ [20,21]. This property implies that −p −1 ≡ ∓1 mod 2 γ⋅w , which is conveniently exploited to produce savings in the Montgomery's reduction algorithm (also commonly known as REDC) [16]. SIKE uses primes of the form p ¼ 4 e A 3 e B − 1. There are at least two computer arithmetic reasons for this choice. One of them, is that this family of primes are Montgomeryfriendly, which implies that they admit fast Montgomery reduction [4,16]. The second advantage is that there exist highly efficient formulas for computing degree-3 and degree-4 isogenies [17,18]. The eSIDH primes proposed are of the form p ¼ 4 e A ℓ e B B ℓ e C C f − 1, which are much more flexible and abundant than the SIKE primes. Given some fixed values for N and the small primes ℓ B and ℓ C , one searches for N 2 -Montgomery-friendly primes (if they exist) by varying e B , e C and f. These friendlier Montgomery-friendly primes achieve a faster Montgomery reduction (see [4,Algorithm 6]) than the ones that could possibly be obtained from comparable SIKE primes.
Another important design aspect to be considered is that on Bob's side, there exists a trade-off between the size of the base-primes ℓ B and ℓ C and their corresponding exponents e B and e C , respectively. The base-primes define the size of the step, whereas their exponents determine how many steps one must perform for isogeny evaluations and constructions. Depending on the exact choice of these parameters, one can make a few big steps or many small steps. Furthermore, as discussed in §3.2, to take full advantage of balanced parallel computations and because of security reasons (cf. Remark 6), it is important to choose log 2 (ℓ B )e B ≈ log 2 (ℓ C )e C as well as log 2 (ℓ B )e B + log 2 (ℓ C )e C ≈ 2 ⋅ e A .
Hence, we used primes of the form B ℓ e C C Þ, and where e A is chosen so that the security level offered by the SIKE primes as specified in [9] is matched (see also [11]). The cofactor f = 2 k c is carefully selected so that p qualifies as an N 2 -Montgomery-friendly prime (if at all possible). Table 3 shows our selection of four eSIDH primes matching the four security levels specified in [9]. When searching for eSIDH primes with comparable security as the one offered by the p 434 SIKE prime, the best choice that we were able to find is eSIDH-p 434 as specified in Table 3. Both of them, SIKE-p 434 and eSIDH-p 434 , fit in seven 64-bit words and they are 3-Montgomery-friendly primes. This implies that the field arithmetic costs associated with SIKE-p 434 and eSIDH-p 434 are fairly similar (cf. Table 2). Luckily, for the other three security levels we managed to find eSIDH N 2 -Montgomery-friendly primes sharing the same security level as their SIKE prime counterparts.

| Results and discussion
In this subsection, a full implementation of the eSIDH protocol proposed in this work is presented. We mainly focus ourselves on the eSIDH parallel instantiation discussed in §3.2, and we used the SIDH implementation of [9] as a baseline to compare the acceleration factor achieved by the eSIDH scheme. Building on the techniques proposed in [6], we also report a multicore implementation of the SIDH protocol. To the best of our knowledge this is the first reported software implementation of SIDH. 9 Our two case studies targeted p 434 and p 751 , the smallest and largest SIKE primes that are included in the SIKE specification [9].
Our software library is freely available from, [link eliminated for anonymity]. Table 2 presents a comparison of the field arithmetic costs associated with the SIKE primes p 434 and p 751 against the ones exhibit by the eSIDH primes p 434 and p 765 , respectively. Note that our eSIDH prime p 765 field arithmetic gets noticeable timing speedups compared against the SIKE p 751 field arithmetic. This acceleration is justified from the fact that because p 765 is a friendlier Montgomery-friendly prime, it has a faster modular reduction than p 751 .

| Parallelizing SIDH protocol
Using a similar approach as the one followed in [5,6], in this work we parallelize the SIDH implementation of [9] as follows. Alice and Bob isogeny evaluations and constructions were computed using the optimal strategy of [2]. Optimal strategies typically T A B L E 3 Our selection of extended supersingular isogeny Diffie-Hellman key exchange protocol primes matching the four security levels offered by the supersingular isogeny key encapsulation primes included in [9], where N = ⌈⌈ log 2 (p)⌉/64⌉, and γ is the largest integer for that N such that p ≡−1 mod 2 γ⋅64 holds eSIDH Primes Proposed Here N γ SIKE Primes as in [9]  produce an average of four points per curve whose isogeny images can be processed concurrently [5]. Hence, our two-and three-core implementations actively strove for concurrently performing as many isogeny evaluations as possible. 10 Table 4 shows that with respect to a sequential implementation, a two-core and a three-core parallel implementation of the SIDH p 434 instantiation yields a speed-up factor of 1.062 and 1.123, respectively. Likewise, Table 5 reports that with respect to a sequential implementation, a two-core and a threecore parallel implementation of the SIDH p 751 instantiation yields speed-up factors of 1.118 and 1.216, respectively. Table 4 reports the performance timing achieved by the eSIDH-p 434 parallel instantiation. An implementation of eSIDH using the eSIDH-p 434 prime yields an acceleration factor of 0.971, 1.073 and 1.086, compared against SIDH using the SIKE-p 434 prime when both instantiations are implemented on k = {1, 2, 3}-core processors. We note that for a single-core implementation, the eSIDH-p 434 prime is slower than the SIDH-p 434 prime. Table 5 reports the performance timing achieved by the eSIDH-p 765 parallel instantiation. An implementation of eSIDH using the eSIDH-p 765 prime yields an acceleration factor of 1.050, 1.160 and 1.161, compared against SIDH using the SIKE-p 751 prime when both instantiations are implemented on k = {1, 2, 3}-core processors. Table 6 reports the performance timing achieved by the eSIDH-p 434 parallel instantiation. An implementation of SIKE using the eSIDH-p 434 prime yields an acceleration factor of 0.97, 1.06 and 1.07, compared against SIKE using the SIKEp 434 prime when both instantiations are implemented on k = {1, 2, 3}-core processors. Table 7 reports the performance timing achieved by the eSIDH-p 765 parallel instantiation. An implementation of SIKE using the eSIDH-p 765 prime yields an acceleration factor of 1.06, 1.15 and 1.14. compared against SIKE using the SIKE-p 751 prime when both instantiations are implemented on k = {1, 2, 3}-core processors. We stress that even for a single- core implementation of this case study, our eSIDH variant produces a modest but noticeable speedup of about 5% and % 6 in SIKE and SIDH, respectively. Moreover, using a singlecore SIDH p 751 implementation as a baseline, It can be seen that a parallel eSIDH p 765 implementation yields an acceleration factor of 1.05, 1.29 and 1.41, when executed on k = {1, 2, 3}-core processors.

| CONCLUSIONS
The eSIDH scheme, a variant of the SIDH protocol in [2], has been presented. Our experimental results show that an eSIDH parallel implementation is faster than a corresponding parallel version of SIDH. Our future work includes expanding the search for more efficient eSIDH primes for all four security levels considered in [9]. Building on the work presented in [6], we would also like to explore more aggressive approaches for parallelizing the SIDH isogeny computations and evaluations. The algorithmic ideas discussed here might be useful for B-SIDH construction [8], where given the generous size of the prime factors involved in the factorization of p ± 1, parallel implementations appear to be mandatory. We would also like to explore applications of eSIDH to the client-server scenarios discussed in [8].