Next Article in Journal
Household Power Demand Prediction Using Evolutionary Ensemble Neural Network Pool with Multiple Network Structures
Next Article in Special Issue
FPGA Modeling and Optimization of a SIMON Lightweight Block Cipher
Previous Article in Journal
Improved Convolutional Pose Machines for Human Pose Estimation Using Image Sensor Data
Previous Article in Special Issue
PPSDT: A Novel Privacy-Preserving Single Decision Tree Algorithm for Clinical Decision-Support Systems Using IoT Devices
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Energy/Area-Efficient Scalar Multiplication with Binary Edwards Curves for the IoT

by
Carlos Andres Lara-Nino
1,*,†,
Arturo Diaz-Perez
2,† and
Miguel Morales-Sandoval
1,†
1
CINVESTAV Tamaulipas, Victoria 87130, Mexico
2
CINVESTAV Guadalajara, Zapopan 45019, Mexico
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Sensors 2019, 19(3), 720; https://doi.org/10.3390/s19030720
Submission received: 30 November 2018 / Revised: 29 January 2019 / Accepted: 30 January 2019 / Published: 10 February 2019

Abstract

:
Making Elliptic Curve Cryptography (ECC) available for the Internet of Things (IoT) and related technologies is a recent topic of interest. Modern IoT applications transfer sensitive information which needs to be protected. This is a difficult task due to the processing power and memory availability constraints of the physical devices. ECC mainly relies on scalar multiplication (kP)—which is an operation-intensive procedure. The broad majority of kP proposals in the literature focus on performance improvements and often overlook the energy footprint of the solution. Some IoT technologies—Wireless Sensor Networks (WSN) in particular—are critically sensitive in that regard. In this paper we explore energy-oriented improvements applied to a low-area scalar multiplication architecture for Binary Edwards Curves (BEC)—selected given their efficiency. The design and implementation costs for each of these energy-oriented techniques—in hardware—are reported. We propose an evaluation method for measuring the effectiveness of these optimizations. Under this novel approach, the energy-reducing techniques explored in this work contribute to achieving the scalar multiplication architecture with the most efficient area/energy trade-offs in the literature, to the best of our knowledge.

1. Introduction

The deployment of Internet of Things (IoT) applications is pushing society to interact with smart environments on a regular basis. Smartphones, buildings, vehicles, roads, home appliances; most new instances of these technologies are being equipped with capabilities for data sensing and internet connectivity [1]. The data retrieved by these systems might be sensitive, since it can be inherently confidential [2] or can be used to infer a user’s behavior [3]. Providing security for the IoT is said to be the equivalent of providing security for a conventional network, with the added complexity that the network can be physically reached by attackers [4].
A common characteristic in many IoT nodes is that they suffer from physical constraints, most notably on size and energy [4,5]. For reducing manufacturing costs, devices’ physical size needs to be decreased.
For some, one of the most precious resources of a constrained device is energy [6,7]. The reasoning is that after deployment, some nodes rely on battery systems which cannot be replaced and ought to last for several months or years. That is why “to minimize energy consumption, lightweight Public-key Cryptography (PKC) implementations are a fundamental requirement” [8]. For both cases, lightweight cryptography can provide an effective solution that (a) is physically small and (b) has low energy consumption.
Elliptic Curve Cryptography (ECC) has proven to be one of the best PKC alternatives for constrained applications [8,9,10,11,12,13] where multiple restrictions are also observed. Compared to other PKC systems, ECC features reduced key sizes for equivalent security levels [14]. ECC can be used for achieving key establishment, encryption, authentication, and signatures, among other security functions. The fundamental operation required in ECC is kP—which relies on a huge amount of field operations. Although improving the performance and area of this algorithm has been widely addressed in the literature, the energy profile of these systems has been seldom studied [15].
The most popular strategy for reducing the energy consumption of an implementation is to reduce its runtime [16,17,18,19,20]. However, some performance-enhancing techniques might prove to be too costly for constrained devices in terms of hardware utilization. Area minimization can also lead to energy savings by reducing the power dissipation—to a lesser extent this approach has also been studied in the literature [21,22]. Experimenting with the tradeoffs of both methodologies can lead to novel design insights which can be used to reduce the energy footprint of the system.
In this paper, we explore the area/energy tradeoffs on a FPGA-based realization of the multiplication scalar for Binary Edwards Curves (BECs) [23]. We use a lightweight area-oriented kP architecture as the starting point for applying a sequence of energy-related improvements. The efficiency of these modifications is assessed at each step. Our approach emulates an incremental development, where in the final step a solution which is efficient in both hardware usage and energy consumption is obtained. We have chosen BECs as case study, but the energy-reduction strategies applied can be translated to any other elliptic curves. Our optimizations focus on hardware since in that way the module can be used as a security accelerator which is enabled upon request to further save energy. The proposed improvements are evaluated with performance and area metrics, as well as with a novel method for estimating the energy savings in relation to area costs. Under this novel approach, we show that one of the kP architectures proposed—to the best of our knowledge—is the most efficient design reported. As evaluation platform we use the xc6slx16 low-cost FPGA operating at commonly used frequencies.
Our contributions can be summarized as follows:
  • The detailed implementation and assessment of energy-reducing techniques are presented. The techniques employed are often used in the literature, but the actual effectiveness of each one is seldom explored. In this paper we aim at filling this gap by providing detailed implementation results. With this study, researchers aiming at producing new low-energy designs can have a precedent for choosing the strategies best suited to their projects.
  • Our architectures improve the state of the art in regards to area/energy efficiency. This is in part thanks to the carefully designed cryptosystem, and to the followed design methodology.
  • We have created and described a novel evaluation metric for assessing the efficiency of the proposed architectures in terms of energy reduction and area increments. This metric can account for variations in the measurement units, the operational frequency, and the underlying finite field. Thanks to these points we were able to employ the novel metric for benchmarking our architectures and the entirety of the state of the art for low-power/low-energy scalar multiplication realizations.
The rest of the paper is structured as follows. Section 2 briefly enumerates some preliminary notions regarding the topics in this paper. Section 3 describes the energy-oriented improvements applied to ECC architectures in selected works from the literature. The description and implementation for our energy improvements can be found in Section 4. A novel evaluation method for energy improvements is detailed in Section 5. Lastly, our concluding remarks are available in Section 6.

2. Preliminaries

2.1. Elliptic Curve Cryptography

An elliptic curve can be described as the set of points that satisfy the Weierstrass model in (1) over the finite field F q .
E : y 2 + a 1 x y + a 3 y = x 3 + a 2 x 2 + a 4 x + a 6 with a i F q
Simplifications of (1) and equivalences are used as the basis for different elliptic curve families: random prime, random binary, Koblitz, Montgomery, Edwards, twisted Edwards, binary Edwards, among others.
The elliptic curve points E, a group operation + and the point at infinity O form an elliptic curve group E( F q ), which can be used in cryptographic applications. The operation + is the addition of points, it varies for each elliptic curve family. Thus kP represents the consecutive application of the group operation k times over the base point or generator P:
Q = P + P + + P = k P .
In practice kP relies on point addition ( P + P ) and doubling ( 2 P ), where each is composed of multiple field operations. The complexity of kP depends on the group and field arithmetic definitions. The kP calculation is used in any ECC-based algorithm, hence improving its efficiency is critical.
The BECs family is defined by the model
E B : d 1 ( x + y ) + d 2 ( x 2 + y 2 ) = x y + x y ( x + y ) + x 2 y 2
where d 1 , d 2 F 2 m with d 1 0 and d 2 d 1 2 + d 1 . These curves are birationally equivalent to binary generic curves [23]. Their principal advantages of BECs are that (a) their group operation is complete, so no extra checks are required and (b) their group operation requires less field operations.
In [23] the authors introduced the concept of w coordinates for BEC. By using this point representation it is possible to reduce the amount of field operations required in performing kP. Furthermore, the use of projective-w coordinates enables reducing the number of inversions required—inversions are some of the most expensive field operations. Differential addition and doubling formulae can be combined with projective-w coordinates to achieve the smallest requirements in terms of field operations for kP in BECs [24].

2.2. Power and Energy

Let the energy (ENE) consumed by a circuit to perform a task as the product between the dissipated power (POW) and the runtime (t):
E N E = P O W × t
This approach is employed in multiple works from the literature [17,18,25,26,27].
We consider the runtime to be directly linked with the performance of the system: lower runtime equals higher performance and vice versa. The runtime is the product of the latency clock cycles (LAT) and the inverse of the operational frequency (f):
t = L A T × 1 f
POW is obtained as the sum of the dynamic (DP) and static or quiescent (SP) powers:
P O W = D P + S P
DP is the sum of powers associated with clocking, signals, logic, IOs, and dedicated blocks; this includes the data-dependent power. SP is dissipated by the whole FPGA fabric and remains somewhat constant regardless of the implemented circuit. For FPGAs the static power tends to be higher than the dynamic part. Each component is usually modeled as
D P = e × f × A and S P = I s × V c c
where e is the average energy spent during one clock cycle per area unit, A represents the area of the circuit, I s is the static current consumed from the power supply, and V c c is the supply voltage. The designer has control over the operational frequency, the latency, and the area to influence the energy consumption of the system.
The effects of f over ENE are not straightforward. If f is reduced, then t grows and ENE rises—as shown in (4)—from the SP component in POW; if f is increased, then POW may grow due its DP element—see (7)—and the increment of ENE follows from (4). Finding the optimal operational frequency for the proposed kP architecture is outside of the scope of this work, we do however use two operational frequencies (low vs. high) to study this variation.
So, if we seek to reduce ENE we need to find a minimum in the balance between the area and the latency. The former has been the main optimization goal for lightweight cryptography, whereas the latter has generated interest in recent years [28].
Other popular optimizations such as clock gating and datapath insulation aim at mitigating the switching activity of parts of the circuit which are not actively used—these aim at reducing the dynamic power consumption of the circuit.

2.3. Percentile Differences

The percentile increment ( Δ %) is provided whenever new implementation results are presented. These increments are calculated as the difference between the new ( O C i ) and the previous observation ( O C i 1 ), with reference to O C i 1 :
Δ % = O C i O C i 1 O C i 1 × 100
In this work we only use percentile differences to assess the area increments and the energy decrements.

2.4. Evaluation Environment

We used the Xilinx ISE Design Suite 14.3 for synthesis and configuration of all the architectures described. The designs were described in VHDL and synthesized with Area Reduction as design goal and strategy2 as strategy. All the results provided in this document were obtained after Place and Route (PAR) unless explicitly stated otherwise.
The power estimations reported in this paper correspond to the sum of dynamic and static power. Since for FPGAs the static part tends to outweigh the dynamic power, in some cases the total power might appear somewhat constant.
These estimations were obtained using the Xilinx XPower Analyzer software. In order to obtain a high overall confidence level we employed the post-PAR design file (ncd), the physical constraints file (pcf) for the specified FPGA, and a simulation activity file (saif). The latter was obtained using the Xilinx Isim software from a post-PAR simulation; each one of the architectures was simulated using actual data for over 10,000 cycles.

3. Energy Reduction in the Literature

Improving the performance of the system is one of the most common approaches for reducing the energy consumption [16,17,18,19,20,29,30,31,32,33,34]. In [35] it is shown that techniques like pipelining and parallelism can be used to reduce the power consumption. If the computations are completed quickly, a moderate rise in the power required (due to increments in the area and switching activity) can be mitigated by the time reduction. In this regard multiple alternatives have been proposed: using low-latency algorithms, proposing low-latency implementations, exploiting algorithm parallelism, and using dedicated processing units. Nonetheless, just as it is inadequate to say that low-area equals lightweight, it is also flawed to assume that high-performance equals low-energy. As reviewed in the previous section, the relations between energy and performance are not clear-cut. The other strategies for achieving power reduction consider area minimization [21,22] and exploring area/performance tradeoffs [32,36].
From the perspective of security protocols, it can be concluded that low overheads in the number of packets [22,37,38,39,40] and the number of cryptographic operations [32,38,41,42,43] are key for low-energy PKC. These nodes are characterized by wireless transmissions, which require considerable amounts of energy to be performed, thus it is opportune to use protocols with low packet count requirements. As mentioned, ECC offers the smallest key sizes for comparable security levels. That property holds for all the group elements, thus contributing to reducing the transmissions overhead.
The implementation platform plays a significant role in the design of an ECC system. Using a generic processor would imply selecting prime curves, since commercial ALUs seldom include binary multipliers. On the other hand, a hardware solution would benefit from using binary curves [17].
Selecting the adequate coordinate representation, the group operations, and the field operations used in the ECC system is of paramount importance. For a software-implementation these choices translate into different routines that are executed by the processor, whilst for a hardware-realization these translate into different hardware modules. Processors benefit from shorter routines, from quick calculations, but also from reduced memory accesses [44]. On the other hand, hardware architectures can exploit the arithmetic of binary fields for performing calculations swiftly.
For hardware systems, low latency designs are generally preferred, although implementing faster application-specific modules can result in expensive area investments. How much hardware can be used for improving performance? In the literature, significant interest has been put into concrete points: (a) selecting the optimal field inverter [16,17,18,19,27,30,41,45,46]; (b) determining the adequate digit size for the digit multiplier [17,22,27,38]; (c) designing dedicated squaring modules [16,17,18,25,27,41,45,47].
At circuit level, some works have explored reducing the switching activity of the design by applying clock gating [18,25,26,29,30,38,47,48,49,50], reducing memory accesses [26,30,41,44,46,47], and implementing datapath insulation [38,47].

4. Methods

In this section we outline the application and evaluation of different energy-reducing techniques over a low-area kP architecture. Throughout the document we use the Binary Edwards Curve BE251 [23] as case study.
We study three architecture-level transformations—field inverter, field multiplier, field squarer—as well as a circuit-level modification—datapath insulation. We study these strategies in the aforementioned order so that the contribution of each technique can be studied in a way in which it benefits the most from previous techniques.

4.1. Starting Point: Low-Area kP Architecture

In Figure 1 we illustrate the base area-optimized architecture used. This module follows the Montgomery Ladder algorithm with differential addition and doubling for binary Edwards curves in mixed-w coordinates as proposed in [24]. One of the main characteristics of this design is that it offers flexibility of the field, curve, base point, and scalar; all the proposed optimizations ought to preserve this property.
The field operations supported by this design are multiplication, addition, and inversion. A bit-serial like multiplier is used to reduce implementation size. Addition is performed by a layer of XOR gates. Field inversion is required to convert the input and output of the system from w to projective-w coordinates and vice versa. This operation is performed with only multiplications thanks to Fermat’s Little Theorem. The particular inversion algorithm used is Wang’s [51].
In regards to latency, each inversion requires 2 m 3 m-bit multiplications, which amounts to 125,249 cycles when m = 251 . A step in the Montgomery ladder requires 9 × m full multiplications with a latency of 567,009 cycles, m short multiplications with a latency of 14,558 cycles, and 3 × m additions which take 753 cycles. The architecture requires two inversions and an m-bit Montgomery ladder per kP, hence the total latency of the design is 832,818 cycles.
While this design performs well in regards to hardware resources, it requires many latency cycles. This has a negative effect on the performance and energy consumption of the system. In the following we review the application of different optimization strategies devised to reduce the energy footprint.

4.2. Modification 1: Inversion Algorithm

Field inversion provides a convenient way to perform divisions in finite fields. Such operations are required in point conversion. The scalar multiplication algorithm selected requires two inversions. The Wang inversion algorithm is used in the C0 architecture. Although this method is simple and flexible, more efficient solutions exist.

4.2.1. Fermat’s Little Theorem

Let q be a prime number and let a be an integer satisfying gcd ( a , q ) = 1 then
a q 1 1 mod q
This conjecture is known as Fermat’s Little Theorem [52]. A simple proof for the theorem is provided in [53]. Consider the product ( a ) ( 2 a ) ( 3 a ) ( ( q 1 ) a ) , which can be written as ( q 1 ) ! a q 1 . The list of terms in the product modulo q is a complete list of variables from 1 to q 1 , since no two terms in the list are equivalent modulo q. From this, the product can also be written as ( q 1 ) ! mod q . Thus
( q 1 ) ! a q 1 ( q 1 ) ! mod q ,
and (9) is demonstrated.

4.2.2. Divisions on Finite Fields

In 1979, MacWilliams and Sloane demonstrated that every element a F p m , where p = 2 n , satisfies the identity a p m = a . This, together with the demonstration from Wang in 1985 that a non-zero element a F p m has a unique multiplicative inverse a 1 , shows that a 1 = a 2 m 2
Then, for all a F 2 m , a 0 , a 1 can be computed as
a 1 = a 2 m 2 = a 2 × a 2 2 × × a 2 m 1
according to a generalization of (9). This requires n 2 multiplications and n 1 squarings [54].
Inverses are important in calculating divisions since
a b = c a b 1 = c
Therefore, it is possible to perform divisions through a series of repeated multiplications and squarings.

4.2.3. Wang Inversion

The naïve approach for computing inversions through Fermat’s Little Theorem is denominated Wang Inversion [51]. As presented in Algorithm 1, this operation requires m 2 multiplications and m 1  squarings.
Algorithm 1 Wang Inversion Method.
Input: A ( x ) F 2 m , f ( x ) the irreducible polynomial of F 2 m
Output: C ( x ) = A 1 ( x ) mod f ( x )
C ( x ) A ( x )
for i = 1 to m 2 do
   B ( x ) C 2 ( x ) mod f ( x )
   C ( x ) A ( x ) B ( x ) mod f ( x )
end for
C ( x ) C 2 ( x ) mod f ( x )

return C ( x )
Albeit slow, the Wang method of inversion is capable of solving for any A ( x ) which has an inverse over F 2 m with m of any length. It is also important to note that only two registers are required in this procedure.
A 2 m 1 1 2 = A 2 2 k t 1 A 2 2 k t 1 1 A 2 2 k 2 1 A 2 2 k 1 1 2 2 k 2 2 2 k 3 2 2 k t 2

4.2.4. Itoh-Tsujii Inversion Algorithms

In their work [51], Itoh and Tsujii proposed three field inversion algorithms. The first two of them for inverses over binary fields and the third for inverses over generic fields. The third case, however, relies on subfield inversion.
The first algorithm is applicable in F 2 m such that m = 2 r + 1 . It is based on the observation that the exponent 2 m 2 in (11) can be rewritten as ( 2 m 1 1 ) × 2 . Thus if m = 2 r + 1 , it follows that
A 1 = A 2 2 r 1 2
From this, Algorithm 2 is obtained. This procedure requires log 2 ( m 1 ) multiplications and m 1 squarings.
Algorithm 2 Itoh-Tsujii Inversion for F 2 m Where m = 2 r + 1 .
Input: A ( x ) F 2 m with m = 2 r + 1 , f ( x ) the irreducible polynomial of F 2 m
Output: C ( x ) = A 1 ( x ) mod f ( x )
C ( x ) A ( x )
for i = 0 to r 1 do
   B ( x ) C ( x )
  for j = 0 to 2 i do
    B ( x ) B 2 ( x ) mod f ( x )
   end for
   C ( x ) C ( x ) B ( x ) mod f ( x )
end for
C ( x ) C 2 ( x ) mod f ( x )

return C ( x )
Algorithm 2 can be generalized to any value of m as proposed in [51]. For this, write m 1 as
m 1 = i = 1 t 2 k i
where k 1 > k 2 > > k t is an addition chain. Then, knowing that
A 1 = A 2 m 1 1 2
and (15), it can be shown that the inverse of A can be solved as in (13).
The Itoh-Tsujii inversion for fields of generic length can be computed following two approaches. Note that by calculating A 2 2 k 1 1 , all the previous partial products are also obtained. For posterior use these must be either stored by using additional registers (Algorithm 3) or re-calculated by taking additional operations (Algorithm 4).
Algorithm 3 Itoh-Tsujii Inversion for Generic Binary Fields Where Extra Storage is Used.
Input: A ( x ) F 2 m , U = u 0 , u 1 , u r 1 the binary representation of m, f ( x ) the irreducible polynomial of F 2 m
Output: C ( x ) = A 1 ( x ) mod f ( x )
C ( x ) A ( x )
for i = 0 to r 1 do
   B ( x ) C ( x )
  for j = 0 to 2 i do
    B ( x ) B 2 ( x ) mod f ( x )
  end for
   C ( x ) C ( x ) B ( x ) mod f ( x )
   D i ( x ) C ( x )
end for
for i = r 2 to 1 do
  if u i = 1 then
    B ( x ) C ( x )
   for j = 0 to 2 i do
     B ( x ) B 2 ( x ) mod f ( x )
   end for
    C ( x ) C ( x ) D i ( x ) mod f ( x )
  end if
end for
C ( x ) C 2 ( x ) mod f ( x )

return C ( x )
Algorithm 4 Itoh-Tsujii Inversion for Generic Binary Fields Where Additional Cycles are Required.
Input: A ( x ) F 2 m , U = u 0 , u 1 , u r 1 the binary representation of m, f ( x ) the irreducible polynomial of F 2 m
Output: C ( x ) = A 1 ( x ) mod f ( x )
C ( x ) A ( x )
for i = 0 to r 1 do
   B ( x ) C ( x )
  for j = 0 to 2 i do
    B ( x ) B 2 ( x ) mod f ( x )
  end for
   C ( x ) C ( x ) B ( x ) mod f ( x )
end for
for i = r 2 to 1 do
  if u i = 1 then
    B ( x ) C ( x )
   for j = 0 to 2 i do
     B ( x ) B 2 ( x ) mod f ( x )
   end for
    C ( x ) A ( x )
   for j = 0 to i do
     D ( x ) C ( x )
    for k = 0 to 2 j do
       D ( x ) D 2 ( x ) mod f ( x )
    end for
     C ( x ) C ( x ) D ( x ) mod f ( x )
   end for
    C ( x ) C ( x ) B ( x ) mod f ( x )
  end if
end for
C ( x ) C 2 ( x ) mod f ( x )

return C ( x )
These algorithms perform inverses over fields of generic length. The addition chains used are based on the binary representation of the field length. It is possible to compute optimal addition chains, however, this task is difficult to perform on constrained devices given that the field length is variable.

4.2.5. Comparison of the Inversion Methods Reviewed

A summary of the computational and storage costs for the different inversion algorithms reviewed is provided in Table 1. Whereas Table 2 reports the latency and storage estimation of the inversion algorithms for security levels close to 128-bits.
As it can be noted from Table 2, there is an improvement in the number of underlying operations when the Itoh-Tsujii algorithm is implemented over the Wang inversion method. Recall that kP requires two field inversions, therefore, reducing the latency of this operation by x reduces the latency of the scalar multiplication by 2 x .
The alternative in Algorithm 2 only works for fields that satisfy the condition m = 2 r + 1 and thus would limit the elliptic curves that can be used if selected. The alternative in Algorithm 3 works for any m but its implementation requires increased storage space which would be translated into higher hardware usage—four additional m-bit registers if m = 251 ; this increment can be calculated as described in Table 1. Whereas the inversion method in Algorithm 4 does not offer the same latency advantages as the alternatives, it preserves generality without requiring additional hardware resources. Moreover, when the overall kP latency is considered and a dedicated squaring module can be added, the performance cost is not as significant (3% difference).
The Itoh-Tsujii inversions exploit the fact that squarings over finite fields are faster than multiplications. To achieve further improvement in the energy consumption of the system, it is necessary to improve the multiplication and squaring modules.

4.2.6. Implementation of the Itoh-Tsujii Inversion

Figure 2 reflects the changes in the architectural design compared to the base architecture in Figure 1. For this second architecture it was necessary to include the field length as an additional input. This value is used to control the iterations in the Itoh-Tsujii inversion algorithm. One of the MUX that feeds the field multiplier input was required to be re-wired as well.
The implementation results for the kP architectures (comparing C0 and C1) can be found in Table 3.
From these results it can be noted how the modification of the inversion algorithm offers an average reduction of 7% in the energy consumption for different versions of the kP architecture. On the other hand, the hardware usage shows an average increment of 7%. This is consistent with the data in Table 2.

4.3. Modification 2: Field Multiplier

In the outlined second strategy, replacing the bit-serial multiplier with a digit multiplier is suggested. The new multiplier should be created with the same ports as the previous one to ease the interconnection; it ought to provide support for fast × 1 operations (which can be used to store data in the registers); and constant multiplications (with reduced length) should also be preserved. The new multiplier also needs to be parameterized in order to function for any digit size and any field length, preserving the generality of the design.

4.3.1. Digit-Based Multiplier

A digit-based multiplier, as presented in Algorithm 5, allows to explore area/latency tradeoffs for different applications. Implementing a digit-based multiplier makes it possible to explore how much hardware can be compromised in order to reduce the cycle count of the architecture. If the design is parameterized then a single architecture can be used for a wide range of applications.
Algorithm 5 Digit Multiplication in F 2 m Where d is the Digit Size [55].
Input: A ( x ) , B ( x ) F 2 m , f ( x ) = x m + x l + + 1 the irreducible polynomial of F 2 m
Output: C ( x ) = A ( x ) B ( x ) mod f ( x )
C ( x ) B d 1 ( x ) A ( x ) mod f ( x )
for i = d 2 to 0 do
   C ( x ) x l C ( x )
   C ( x ) B i ( x ) A ( x ) + C ( x ) mod f ( x )
end for

return C ( x )
The digit multiplier from Algorithm 5 uses an underlying combinatorial multiplier:
U ( x ) A ( x ) mod f ( x ) = u d 1 x d 1 A ( x ) mod f ( x ) + + u 1 x A ( x ) mod f ( x ) + u 0 A ( x ) mod f ( x )
The size of this combinatorial multiplier is what determines the hardware cost of the digit multiplier. A combinatorial multiplier can be seen as a matrix of hardware cells where its width is the digit size and its depth is the operand size.

4.3.2. Implementation of the Digit Multiplier

We designed a digit multiplier based on two combinatorial multipliers. The design was synthesized for the xc6slx16 FPGA. At this point, the number of IO ports in the digit multiplier makes the place-and-route process infeasible; some post-synthesis results are provided in Table 4.
The digit multiplier was integrated in the architecture C0 which uses the Wang inversion algorithm to generate a new architecture denominated C2. This aims to determine if the use of Itoh-Tsujii inversion is cost-effective when a dedicated squaring module is not implemented. In this case, as can be seen in Figure 3, the bit-serial multiplier is replaced with the digit multiplier. Only small changes in the input MUX are required.
The multiplier was also merged into architecture C1 to generate the design shown in Figure 4. This architecture now has been modified with the first two proposed optimizations.
The implementation results for C2 and C3, which now use a digit multiplier, can be found in Table 5. Both designs were synthesized for the xc6slx16 FPGA using operational frequencies of 100 KHz and 13.56 MHz. These results are compared against the implementation results for C0 and C1 from Table 3. In this instance we are evaluating the efficiency of the digit multiplier (used in C2 and C3) compared with the bit-serial multiplier (used in C0 and C1).
The use of a digit multiplier enables achievement of a reduction in the energy ranging from 51% to 92% with hardware increments ranging from 6% to 54%. This trend seems to be consistent for both C2 and C3. The main difference between these architectures is that C3 has greater energy reduction for small digit sizes, which implies smaller hardware increments. In the long run ( d > 16 ), however, both architectures tend to reach similar energy consumption levels.

4.4. Modification 3: Squaring Module

In the base architecture from Figure 1 the squaring operations are realized as multiplications. The selected kP algorithm performs four squarings per ladder step ( 4 m ). By including a dedicated squaring module the latency can be reduced since squarings are more efficient than multiplications in hardware. The C1 and C3 architectures (Figure 2 and Figure 4, respectively) can also benefit from this modification since the inversion method used (Algorithm 4) relies heavily on squarings. Although the advantages of a dedicated squaring component are evident, the hardware costs must be evaluated in order to assess its efficiency.
A combinatorial design for squarings was selected in order to maximize the latency reduction. Note that using a squaring module can reduce the latency independently of the field multiplier used. For this reason, we study both the alternative where the field multiplication uses a bit-serial approach but that also features dedicated squarings, and the option where the system uses a digit multiplier together with a squaring module.

4.4.1. Field Squarings

A squaring module is a special kind of field multiplier which exploits the fact that both input operands are the same word. The squaring procedure is presented in the following. Following this method, the squaring operation is reduced to a field multiplication of an m 2 bits word by a d bits word.
Let the input element A be represented as a polynomial A ( x ) :
A ( x ) = a m 1 x m 1 + + a 2 x 2 + a 1 x + a 0
Thus, the squaring of A ( x ) in polynomial form is obtained by shifting the polynomial’s coefficients to the left, generating a 2 m 1 terms polynomial A 2 ( x ) :
A 2 ( x ) = a m 1 x 2 m 2 + + a 2 x 4 + a 1 x 2 + a 0
The coefficients in A 2 can be divided in two polynomials A h ( x ) and A l ( x ) , considering that the elements in A h ( x ) are shifted m + 1 positions to the left:
A 2 ( x ) = A h ( x ) x m + 1 + A l ( x )
The coefficients in each of the new polynomials are:
A h ( x ) = a m 1 x m 3 + + a m + 3 2 x 2 + a m + 1 2 A l ( x ) = a m 1 2 x m 1 + + a 1 x 2 + a 0
Shifting the elements in A h ( x ) can be solved as a multiplication by the element x m + 1 , which can be obtained from the finite field’s irreducible polynomial f ( x ) :
f ( x ) = x m + x d + + 1 x m = x d + + 1 mod f ( x ) x m + 1 = x d + 1 + + x
The final multiplication can be performed either using a bit-serial multiplier or a combinatorial multiplier. The latter was used for this work.
A h ( x ) x m + 1 = A h ( x ) × ( x d + 1 + + x )

4.4.2. Implementation of the Squaring Module

The squaring module was included to the architectures that use the Itoh-Tsujii inversion algorithm (C1 and C3) to generate the versions C4 and C5 of the architecture, respectively. The architectural designs of C4 and C5 are illustrated in Figure 5.
In both modules the main difference is the addition of the squaring module, which has an effect on the MUXs at the input of the field multiplier. The MUXs at the input of the data registers were also updated to store the results of the squaring module.
The designs were implemented for the xc6slx16 FPGA using operational frequencies of 100 KHz and 13.56 MHz. Table 6 provides implementation results for the kP architectures which include a dedicated squaring module.
In the case where a bit-serial multiplier is used, the energy consumption is halved—compare C4 in Table 6 to C1 in Table 3. The addition of the squaring module enables achieving energy reductions ranging from 77% to 96% if a digit multiplier is used (C5). Comparing C5 to C3—where the reduction ranges from 51% to 92%, see Table 5—the improvement is noticeable. The hardware increment of implementing the squaring module is 20%.

4.5. Other Strategies

In a design which contains combinatorial logic and data registers, if these registers are not disconnected from the combinatorial logic spurious calculations will be performed. The switching activity on the combinatorial modules translates into power dissipation, and if the data being processed is not useful then it represents energy being wasted. In order to mitigate the spurious calculations, it is a good design practice to insulate the data registers. This applies both for inputs and outputs. If storing the data is not required, then the register writing must be disabled; if the data in the register is not needed, then it should be masked with zeros.
Register insulation is built-in in the proposed designs. As can be noted from Figure 1, Figure 2, Figure 3, Figure 4 and Figure 5 the data registers outputs are always connected to a MUX element. These modules, under all cases default to GND when the output is not required, effectively insulating the data in the registers from reaching any combinatorial module.
We evaluated the impact of removing the register insulation from our designs. Although this strategy has a hardware cost and contributes to reducing the power dissipation, the variation is not significant—13% less energy in the best case and 10% more hardware in the worst case.

4.6. Summary

Table 7 provides a summary of all the designs studied in this section.

5. Energy Savings in Relation to Area Costs

In this section we describe the design and application of a method to evaluate the efficiency of the optimization techniques that were used to create the kP architectures C1–C5. This section is conformed of two parts: first we describe a novel method for quantifying the efficiency of energy optimizations in regards to area cost, then we use this method for comparing our work with other state of the art solutions.
For the analysis provided in this section we consider that the configurations C0, C1, and C4 are equivalent to the configurations C2, C3, and C5 when d = 1 , respectively.

5.1. Novel Metric for Efficiency of Energy Oriented Optimizations in Regards to Area Costs

Since it is complicated to characterize the efficiency of an optimization technique in terms of area or energy, we have developed an evaluation metric which can account for both magnitudes.
We start from the energy evaluation and area cost of all the hardware implementations. Figure 6 shows the area (FF, LUT, SLC), power (POW), and energy (ENE) results for the different kP architectures under study.
In this work we describe four challenges which should be surpassed for an evaluation metric aimed at comparing hardware realizations; these challenges are described in the following.

5.1.1. Selecting the Data

In FPGA implementations it is customary to use SLCs as area unit. However, as it can be seen from Figure 6, the SLC measurements are prone to outliers. This occurs due to the nature of the PAR process which follows heuristic approaches. In the ideal case the number of SLC should be correlated with the number of FFs and LUTs placed in the design. For example, for the FPGA used in our experiments, each SLC contains four FFs, four LUTS, and some connection logic.
The number of FFs required by the kP architectures is given by the number of registers allocated. The modifications applied do not modify the number of registers substantially, which is why this value remains almost constant. In contrast, since most of the changes made require combinatorial logic, we can observe that the amount of LUTs varies steadily. With this reasoning, we propose to use LUTs as area indicator for the configurations evaluated. The first challenge for the proposed metric is to define whether the LUT results can be used to represent the hardware increment in the design accurately.
In regards to power dissipation and energy consumption, the quiescent component in FPGAs is almost constant and more significant than its dynamic counterpart; this makes it easier to study the energy profile of an architecture.
To measure the area and energy increments we use percentile differences ( Δ %). The variation, difference, or increment in the measurement of a particular metric ( O C i ) for the architecture C i , with regards to a previous observation ( O C i 1 ) for the architecture C i 1 can be computed as in (8).
Figure 7 shows the area and energy Δ s for the different architectures created. It is important to recall that in the proposed scheme a positive difference implies an increment, like in the case of area, and a negative difference implies a decrement, like in the case of energy consumption.
From the results in Figure 7 we can note that in fact, the LUT usage is a close match to the SLC in regards to perceived hardware cost, with less impact from outliers. The R-square yields a closeness of 74.54%, 98.21% and 16.45% for C2, C3, and C5, respectively. Even though the R-square in the case of C5 is not great, the goodness-of-fit achieved for C2 and C3 hints that the LUT measurements can substitute the SLC as area units when the number of FFs remains constant. This solves the first challenge proposed.

5.1.2. Efficiency Metric

Using the LUT and ENE results we propose the efficiency (EFF) metric in (24). What this value conveys is the energy decrement weighted by the area increments associated with the improvement. If the energy savings are high (negative percentages), and the area costs for said improvements are low (in relation to a reference model) then the efficiency metric will yield a high negative result. Results that are less negative imply that the area cost outweighs the energy savings achieved.
E F F = Δ E N E Δ L U T
Figure 8 presents the evaluation of the efficiency metric for the different architectures created.

5.1.3. Sensitivity to Frequency Variations

For the results in Figure 8 we used an operational frequency of 100 KHz. However, how does the operational frequency affect the proposed evaluation metric? This is the second challenge for our metric. This question is important for comparing our results with proposals from the literature. It is evident that not all the works would use the same operational frequency. Furthermore, given the relevance of this magnitude in the energy consumption of a design, it is clear that any evaluation metric should account for frequency variations.
Figure 9 illustrates the differences in the energy consumption for the kP architectures under evaluation as a result of changing the operational frequency.
As it can be noted, even though the measured values vary by two orders of magnitude, the consumption models are similar. In fact, both energy measurements can be used for computing the energy increment of each architecture, and later the efficiency evaluation using both operational frequencies.
The evaluation of the efficiency metric for two operational frequencies is provided in Figure 10. The results demonstrate that the proposed metric can account for variations in the operational frequency of the implementation. In these results, the R-square for the configurations C2, C3, and C5 indicates goodness values of 99.87%, 99.91%, and 99.62%, respectively. This answers the second challenge proposed, since the metric proposed appears to be able to isolate the strong influence that the frequency has on the static power consumption, while highlighting the improvements achieved by the architecture modifications on the dynamic power consumption.

5.1.4. Sensitivity to Different Curve Sizes

How does the curve length influence the results? This is considered the third challenge proposed. In this work we use the elliptic curve BE251 as case study. However, when comparing our work with the literature, it is noteworthy that most existing lightweight proposals of elliptic curve systems target security levels of at most 80 bits. Since our work targets security levels close to 128 bits, the necessity of accounting for the difference in field length is clear.
We have used the results provided in [22] to evaluate the sensitivity of the proposed metric to differences in the curve length. The relevance of that work is that the authors present implementation results for a scalar multiplication architecture using generic elliptic curves of varying length. We took their area and energy results and utilized them to evaluate our metric. Figure 11 illustrates the area and energy results from [22] for various curve lengths. Figure 12 presents the evaluation of the proposed metric using the results from [22]. In this case the area increments are measured in GEs and the energy increments in μ J.
As it can be noted from Figure 12, our metric yields similar calculations for the different experiments which use varying curve lengths. These figures generate R-square evaluations of 99.99%, 99.99%, and 99.99%, which implies a close match in the values. Based on this experiment, we can conclude that the proposed metric is not sensitive to variations in the curve length, which solves the third challenge presented. This is a significant result as it implies that in comparing our results with the state of the art it is not necessary to account for variations in the curve length.

5.1.5. Sensitivity to the Implementation Technology

As evidenced in the previous point, not all works in the literature target FPGA technology. Some of them, as in the case of [22], have been developed for ASIC. Even though our metric can be applied to both scenarios without problems, it is necessary to determine if changing the implementation technology can impact the results of the proposed metric for the same architecture. However, our work focuses on FPGA technology and in the literature we have not identified any work which allows carrying out this experiment. As of now, we consider this fourth question as an open challenge.

5.2. Applying the Proposed Metric for Comparing Our Work with the State of the Art

All of the reviewed works which propose low-power or low-energy kP architectures use digit multipliers. This is understandable, given how a digit multiplier allows for significant improvements in the reduction of the energy consumption with relative low hardware costs.
The metric proposed is particularly useful for comparing such works. For starters, the ability of synthesizing a design for varying digit sizes allows flexibility of the application. Some scopes might be able to accommodate greater hardware strains in order to achieve improved performance, whereas others can have stricter area bounds. Therefore, an architecture of this type cannot be evaluated solely on the efficiency for a particular digit size. The curves derived from the evaluation of the efficiency metric proposed, as a function of the digit size in architectures with digit multipliers, make it possible to use the area under the curve as an objective quantifier of efficiency. To this end different problems need to be addressed.
First, since in this case of comparison we refer to the efficiency of a particular architecture which uses a digit multiplier, each series shall use as reference the instance of the implementation where d = 1 . In this scenario we aim at quantifying the efficiency of an individual architecture. The relative percentile increments can be computed as in (25).
Δ r % = O C i , d O C i , 1 O C i , 1 × 100
Second, to use the area under the curve as quantifier it is necessary that the evaluation bounds are coincident for each configuration. That is, that all the designs evaluated provide implementation results for the same digit interval. The case where d = 1 is mandatory since it is used as reference, but as upper bound we can define any d = n .
Once the evaluation boundaries have been defined, we note that the majority of works in the literature do not provide results for continuous intervals of the digit space. For instance, the works in [17] and [25] only provide implementation results for the cases where d { 1 , 15 } and d { 1 , 16 } , respectively. A solution for this problem is to use interpolation models in order to obtain the missing data.

5.2.1. Modeling the Data

The area and energy increments are the source for computing the efficiency of a design. These increments are calculated from the raw data of hardware resources and energy consumption. The former can be modeled using a polynomial fit of first degree of the form y = α 1 d + α 2 while the latter can be adjusted to an exponential model of the form y = α 3 d α 4 where α i R are constants for the model of each configuration and d Z is the digit size. The model proposed for the efficiency metric is presented in (26).
E F F m = Δ r % ( α 3 d α 4 ) Δ r % ( α 1 d + α 2 )
The use of a mathematical model over the raw data has the additional advantage that the effects of outliers are mitigated. This is practical since some works from the literature that target FPGAs do not provide LUT results [25] or do provide them but the variance in the flip flop count is significant [17].
In Figure 13 we show the models obtained for the area and energy results from our C2, C3, and C5 architectures. Consequently, Figure 14 presents the evaluation of the efficiency metric applied over these data. It is possible to observe the precision obtained in the final model, which produces R-square evaluations of 93.12%, 93.66% for C2 and C3, respectively.
As can be observed in Figure 13, the area in LUTs recorded for the configuration C5 where d = 1 is an outlier. When this anomalous reference point is used for evaluating (24), the results are skewed. Modeling the data prevents obtaining erroneous results by removing the outliers. This is the reason for the significant variation exhibited between C5 and C 5 m .
With the updated analysis we can note that the most cost effective solution provided in this work, in regards to preserving the implementation area while reducing the energy profile, is the architecture C5. This design consistently outperforms the other configurations for any digit size.

5.2.2. Quantifying the Efficiency

The data in Figure 14 can be used to obtain the area under the curve for each configuration using a trapezoidal rule as shown in Equation (27).
E F F A = 1 2 d = 2 n Δ r % ( α 3 d α 4 ) Δ r % ( α 1 d + α 2 ) + Δ r % ( α 3 ( d + 1 ) α 4 ) Δ r % ( α 1 ( d + 1 ) + α 2 ) Δ d where α i R and d Z
For this evaluation we shall define n = 15 and Δ d = 1 since d Z . From this, the configurations C2, C3, and C5 obtain efficiency scores of −77.59, −80.16, and −97.5, respectively. In this evaluation, the configuration C5 is the one that achieves the greater energy reduction per area cost overall.

5.2.3. Comparison with the Literature

Table 8 provides implementation results from works in the literature that are defined as “low power” or “low energy” by their authors. Using these data we have adjusted coefficients for the model of the area and energy measurements from each work. These models are used for evaluating the efficiency metric and to obtain the respective efficiency score for each design. In 77% of the non-trivial models we achieved R-square evaluations above 99%, which implies that the provided results are accurate. Table 9 provides the coefficients obtained for the model of each configuration, according to the formula in Equation (26).
Figure 15 illustrates the evaluation of the efficiency metric for the different works in the state of the art.
Finally, the efficiency scores for each configuration evaluated are reported in Figure 16.

5.3. Limitations of the Proposed Method

The proposed method is sensitive to data outliers. Since the results are provided as percentages, when the measurements are small, area or energy variations can skew the results. This is solved by using models to adjust the data.
Conditions that do not adjust to the models proposed also lead to unexpected results. If the energy consumption is not reduced or the hardware requirements are not increased, the sign of the results will flop and produce spurious evaluations of (27). While that might have some use, for the purposes intended in this article such results are undesired.

6. Conclusions

In this paper we have studied the reduction of energy consumption in six different scalar multiplication architectures. Starting from a base low-area design, we have improved it following energy and power reducing strategies. The result of this process is a comprehensive set of designs that have gradual optimization levels, and thus exhibit from moderate to increased area/energy tradeoffs. These scalar multiplication modules can be used in key establishment systems with low-area requirements and low-energy consumption.
The novel metric proposed can be applied in studying the impact of any modification to a reference architecture, implemented in hardware. In a sense, it represents the energy costs, weighted by the associated hardware costs. The main goal for using this indicator is to demonstrate the effectiveness of any energy-related improvement in a platform with hardware constraints. We have shown that this metric is capable of accounting for differences in the area units, the operational frequency, and the field size; we also provided a way to reduce its sensitivity to unavailable data and outliers. For these reasons we believe that it is adequate for comparing works implemented under heterogeneous conditions.
From the proposed architectures, the configuration C5 exhibits the greatest efficiency. This design employs a digit-multiplier, the Itoh-Tsujii inversion algorithm, and a dedicated squaring module; and also implements datapath insulation. Compared against the state of the art, this configuration turned out to be 13.33% more efficient than the closest work. In terms of efficiency, our proposal represents a good candidate for implementation in environments with area and energy constraints such as IoT devices.

Author Contributions

Conceptualization, C.A.L.-N., A.D.-P. and M.M.-S.; Data curation, C.A.L.-N.; Formal analysis, C.A.L.-N.; Funding acquisition, A.D.-P. and M.M.-S.; Investigation, C.A.L.-N.; Methodology, C.A.L.-N., A.D.-P. and M.M.-S.; Project administration, A.D.-P. and M.M.-S.; Resources, A.D.-P. and M.M.-S.; Software, C.A.L.-N.; Supervision, A.D.-P. and M.M.-S.; Validation, C.A.L.-N.; Visualization, C.A.L.-N.; Writing—original draft, C.A.L.-N.; Writing—review & editing, C.A.L.-N., A.D.-P. and M.M.-S.

Funding

This research was funded by CONACyT Mexico, grant number 336750. The research was also funded by “Fondo Sectorial de Investigación para la Educación”, CONACyT Mexico, through the project number 281565.

Acknowledgments

The authors would like to thank CINVESTAV for the support received. The authors would like to express their gratitude with the reviewers, in particular reviewer 2, for their remarkable contributions which helped to improve the quality of this manuscript.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

  1. Mangia, M.; Pareschi, F.; Rovatti, R.; Setti, G. Low-Cost Security of IoT Sensor Nodes With Rakeness-Based Compressed Sensing: Statistical and Known-Plaintext Attacks. IEEE Trans. Inf. Forensics Secur. 2018, 13, 327–340. [Google Scholar] [CrossRef]
  2. Kumar, P.; Lee, H.J. Security Issues in Healthcare Applications Using Wireless Medical Sensor Networks: A Survey. Sensors 2012, 12, 55–91. [Google Scholar] [CrossRef] [PubMed]
  3. Perez-Torres, R.; Torres-Huitzil, C.; Galeana-Zapien, H. An On-Device Cognitive Dynamic Systems Inspired Sensing Framework for the IoT. IEEE Commun. Mag. 2018, 56, 154–161. [Google Scholar] [CrossRef]
  4. Khan, M.K.; Alghathbar, K. Cryptanalysis and Security Improvements of ‘Two-Factor User Authentication in Wireless Sensor Networks’. Sensors 2010, 10, 2450–2459. [Google Scholar] [CrossRef] [PubMed]
  5. Buratti, C.; Conti, A.; Dardari, D.; Verdone, R. An Overview on Wireless Sensor Networks Technology and Evolution. Sensors 2009, 9, 6869–6896. [Google Scholar] [CrossRef] [PubMed]
  6. Ghiasi, S.; Srivastava, A.; Yang, X.; Sarrafzadeh, M. Optimal Energy Aware Clustering in Sensor Networks. Sensors 2002, 2, 258–269. [Google Scholar] [CrossRef]
  7. Liu, M.; Cao, J.; Chen, G.; Wang, X. An Energy-Aware Routing Protocol in Wireless Sensor Networks. Sensors 2009, 9, 445–462. [Google Scholar] [CrossRef]
  8. Liu, Z.; Seo, H. IoT-NUMS: Evaluating NUMS Elliptic Curve Cryptography for IoT Platforms. IEEE Trans. Inf. Forensics Secur. 2019, 14, 720–729. [Google Scholar] [CrossRef]
  9. Yeh, H.L.; Chen, T.H.; Liu, P.C.; Kim, T.H.; Wei, H.W. A Secured Authentication Protocol for Wireless Sensor Networks Using Elliptic Curves Cryptography. Sensors 2011, 11, 4767–4779. [Google Scholar] [CrossRef]
  10. Piedra, A.d.l.; Braeken, A.; Touhafi, A. Extending the IEEE 802.15.4 Security Suite with a Compact Implementation of the NIST P-192/B-163 Elliptic Curves. Sensors 2013, 13, 9704–9728. [Google Scholar] [CrossRef]
  11. Liu, Z.; Liu, D.; Zou, X.; Lin, H.; Cheng, J. Design of an Elliptic Curve Cryptography Processor for RFID Tag Chips. Sensors 2014, 14, 17883–17904. [Google Scholar] [CrossRef] [PubMed]
  12. Choi, Y.; Lee, D.; Kim, J.; Jung, J.; Nam, J.; Won, D. Security Enhanced User Authentication Protocol for Wireless Sensor Networks Using Elliptic Curves Cryptography. Sensors 2014, 14, 10081–10106. [Google Scholar] [CrossRef] [PubMed]
  13. Liu, Z.; Seo, H.; Großschädl, J.; Kim, H. Efficient Implementation of NIST-Compliant Elliptic Curve Cryptography for 8-bit AVR-Based Sensor Nodes. IEEE Trans. Inf. Forensics Secur. 2016, 11, 1385–1397. [Google Scholar] [CrossRef]
  14. Barker, E. NIST Special Publication 800-57. Part 1. Revision 4. Recommendation for Key Management. 2016. Available online: http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-57pt1r4.pdf (accessed on 15 October 2018).
  15. Lara-Nino, C.A.; Diaz-Perez, A.; Morales-Sandoval, M. Elliptic Curve Lightweight Cryptography: A Survey. IEEE Access 2018, 6, 72514–72550. [Google Scholar] [CrossRef]
  16. Schroeppel, R.; Beaver, C.; Gonzales, R.; Miller, R.; Draelos, T. A Low-Power Design for an Elliptic Curve Digital Signature Chip. In Cryptographic Hardware and Embedded Systems, Proceedings of the CHES 2002, 4th International Workshop, Redwood Shores, CA, USA, 13–15 August 2002; Revised Papers; Kaliski, B.S., Koç, ç.K., Paar, C., Eds.; Springer: Berlin/Heidelberg, Germany, 2003; pp. 366–380. [Google Scholar]
  17. Keller, M.; Byrne, A.; Marnane, W.P. Elliptic Curve Cryptography on FPGA for Low-Power Applications. ACM Trans. Reconfigurable Technol. Syst. 2009, 2, 2. [Google Scholar] [CrossRef]
  18. Keller, M.; Marnane, W. Energy Efficient Elliptic Curve Processor. In Integrated Circuit and System Design. Power and Timing Modeling, Optimization and Simulation, Proceedings of the 18th International Workshop, PATMOS 2008, Lisbon, Portugal, 10–12 September; Revised Selected Papers; Svensson, L., Monteiro, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 287–296. [Google Scholar]
  19. Tamura, M.; Ikeda, M. 1.68uJ/signature-generation 256-bit ECDSA over GF(p) signature generator for IoT devices. In Proceedings of the 2016 IEEE Asian Solid-State Circuits Conference (A-SSCC), Toyama, Japan, 7–9 November 2016; pp. 341–344. [Google Scholar]
  20. Asif, S.; Andersson, O.; Rodrigues, J.; Kong, Y. 65-nm CMOS low-energy RNS modular multiplier for elliptic-curve cryptography. IET Comput. Digit. Tech. 2018, 12, 62–67. [Google Scholar] [CrossRef]
  21. Öztürk, E.; Sunar, B.; Savaş, E. Low-Power Elliptic Curve Cryptography Using Scaled Modular Arithmetic. In Cryptographic Hardware and Embedded Systems, Proceedings of the CHES 2004, 6th International Workshop, Cambridge, MA, USA, 11–13 August 2004; Joye, M., Quisquater, J.J., Eds.; Springer: Berlin/Heidelberg, Germany, 2004; pp. 92–106. [Google Scholar]
  22. Batina, L.; Mentens, N.; Sakiyama, K.; Preneel, B.; Verbauwhede, I. Low-Cost Elliptic Curve Cryptography for Wireless Sensor Networks. In Security and Privacy in Ad-Hoc and Sensor Networks, Proceedings of the Third European Workshop, ESAS 2006, Hamburg, Germany, 20–21 September 2006; Revised Selected Papers; Buttyán, L., Gligor, V.D., Westhoff, D., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; pp. 6–17. [Google Scholar]
  23. Bernstein, D.J.; Lange, T.; Rezaeian Farashahi, R. Binary Edwards Curves. In Cryptographic Hardware and Embedded Systems, Proceedings of the CHES 2008, 10th International Workshop, Washington, DC, USA, 10–13 August 2008; Springer: Berlin/Heidelberg, Germany, 2008; pp. 244–265. [Google Scholar]
  24. Koziel, B.; Azarderakhsh, R.; Mozaffari-Kermani, M. Low-Resource and Fast Binary Edwards Curves Cryptography. In Progress in Cryptology, Proceedings of the INDOCRYPT 2015: 16th International Conference on Cryptology in India, Bangalore, India, 6–9 December 2015; Springer International Publishing: Cham, Switzerland, 2015; pp. 347–369. [Google Scholar]
  25. Keller, M.; Marnane, W. Low Power Elliptic Curve Cryptography. In Integrated Circuit and System Design. Power and Timing Modeling, Optimization and Simulation, Proceedings of the 17th International Workshop, PATMOS 2007, Gothenburg, Sweden, 3–5 September 2007; Azémard, N., Svensson, L., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; pp. 310–319. [Google Scholar]
  26. Kocabas, U.; Fan, J.; Verbauwhede, I. Implementation of Binary Edwards Curves for very-constrained devices. In Proceedings of the ASAP 2010, Rennes, France, 7–9 July 2010; pp. 185–191. [Google Scholar]
  27. Dan, Y.P.; He, H.L. Tradeoff Design of Low-Cost and Low-Energy Elliptic Curve Crypto-Processor for Wireless Sensor Networks. In Proceedings of the 2012 8th International Conference on Wireless Communications, Networking and Mobile Computing, Shanghai, China, 21–23 September 2012; pp. 1–5. [Google Scholar]
  28. Knežević, M.; Nikov, V.; Rombouts, P. Low-Latency Encryption—Is “Lightweight = Light + Wait”? In Proceedings of the Cryptographic Hardware and Embedded Systems, CHES 2012, Leuven, Belgium, 9–12 September 2012; Prouff, E., Schaumont, P., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 426–446. [Google Scholar]
  29. Purnaprajna, M.; Puttmann, C.; Porrmann, M. Power Aware Reconfigurable Multiprocessor for Elliptic Curve Cryptography. In Proceedings of the 2008 Design, Automation and Test in Europe, Munich, Germany, 10–14 March 2008; pp. 1462–1467. [Google Scholar]
  30. Ahmadi, H.R.; Afzali-Kusha, A. A low-power and low-energy flexible GF(p) elliptic-curve cryptography processor. J. Zhejiang Univ. Sci. C 2010, 11, 724–736. [Google Scholar] [CrossRef]
  31. Iwasaki, A.; Shibata, Y.; Oguri, K.; Harasawa, R. An energy-efficient FPGA-based soft-core processor with a configurable word size ECC arithmetic accelerator. In Proceedings of the 2015 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XVIII), Yokohama, Japan, 13–15 April 2015; pp. 1–3. [Google Scholar]
  32. Rožić, V.; Reparaz, O.; Verbauwhede, I. A 5.1 uJ per point-multiplication elliptic curve cryptographic processor. Int. J. Circuit Theory Appl. 2017, 45, 170–187. [Google Scholar] [CrossRef]
  33. Liu, Z.; Weng, J.; Hu, Z.; Seo, H. Efficient Elliptic Curve Cryptography for Embedded Devices. ACM Trans. Embed. Comput. Syst. 2016, 16, 53. [Google Scholar] [CrossRef]
  34. Liu, Z.; Huang, X.; Hu, Z.; Khan, M.K.; Seo, H.; Zhou, L. On Emerging Family of Elliptic Curves to Secure Internet of Things: ECC Comes of Age. IEEE Trans. Dependable Secur. Comput. 2016, 14, 237–248. [Google Scholar]
  35. Chandrakasan, A.P.; Potkonjak, M.; Mehra, R.; Rabaey, J.; Brodersen, R.W. Optimizing Power Using Transformations. Trans. Comp.-Aided Des. Integr. Cir. Syst. 2006, 14, 12–31. [Google Scholar] [CrossRef]
  36. Kim, H.; Kim, Y.; Yoo, H.J. A 6.3nJ/op low energy 160-bit modulo-multiplier for elliptic curve cryptography processor. In Proceedings of the 2008 IEEE International Symposium on Circuits and Systems, Seattle, WA, USA, 18–21 May 2008; pp. 3310–3313. [Google Scholar]
  37. Gaubatz, G.; Kaps, J.P.; Ozturk, E.; Sunar, B. State of the art in ultra-low power public key cryptography for wireless sensor networks. In Proceedings of the Third IEEE International Conference on Pervasive Computing and Communications Workshops, Kauai Island, HI, USA, 8–12 March 2005; pp. 146–150. [Google Scholar]
  38. Fan, J.; Reparaz, O.; Rožić, V.; Verbauwhede, I. Low-energy encryption for medical devices: Security adds an extra design dimension. In Proceedings of the 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC), Austin, TX, USA, 29 May–7 June 2013; pp. 1–6. [Google Scholar]
  39. Maidhili, R.; Karthik, G. Energy Efficient and Secure Multi-User Broadcast Authentication Scheme in Wireless Sensor Networks. In Proceedings of the 2018 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, India, 4–6 January 2018; pp. 1–6. [Google Scholar]
  40. Kreiser, D.; Dyka, Z.; Kabin, I.; Langendoerfer, P. Low-energy key exchange for automation systems. In Proceedings of the 2018 13th International Conference on Design Technology of Integrated Systems In Nanoscale Era (DTIS), Taormina, Italy, 9–12 April 2018; pp. 1–5. [Google Scholar]
  41. Hein, D.; Wolkerstorfer, J.; Felber, N. ECC Is Ready for RFID—A Proof in Silicon. In Selected Areas in Cryptography, Proceedings of the 15th International Workshop, SAC 2008, Sackville, NB, Canada, 14–15 August 2009; Revised Selected Papers; Avanzi, R.M., Keliher, L., Sica, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 401–413. [Google Scholar]
  42. Kodali, R.K.; Patel, K.H.; Sarma, N. Energy efficient elliptic curve point multiplication for WSN applications. In Proceedings of the 2013 National Conference on Communications (NCC), New Delhi, India, 15–17 February 2013; pp. 1–5. [Google Scholar]
  43. Ting, P.; Tsai, J.; Wu, T. Signcryption Method Suitable for Low-Power IoT Devices in a Wireless Sensor Network. IEEE Syst. J. 2018, 12, 2385–2394. [Google Scholar] [CrossRef]
  44. De Clercq, R.; Uhsadel, L.; Van Herrewege, A.; Verbauwhede, I. Ultra Low-Power Implementation of ECC on the ARM Cortex-M0+. In Proceedings of the 51st Annual Design Automation Conference, DAC ’14, San Francisco, CA, USA, 1–5 June 2014; ACM: New York, NY, USA, 2014; pp. 112:1–112:6. [Google Scholar]
  45. Zeidler, S.; Goderbauer, M.; Krstić, M. Design of a low-power asynchronous elliptic curve cryptography coprocessor. In Proceedings of the 2013 IEEE 20th International Conference on Electronics, Circuits, and Systems (ICECS), Abu Dhabi, UAE, 8–11 December 2013; pp. 569–572. [Google Scholar]
  46. Targhetta, A.D.; Owen, D.E.; Israel, F.L.; Gratz, P.V. Energy-efficient Implementations of GF (P) and GF(2M) Elliptic Curve Cryptography. In Proceedings of the 2015 33rd IEEE International Conference on Computer Design (ICCD), ICCD ’15, New York, NY, USA, 18–21 October 2015; IEEE Computer Society: Washington, DC, USA, 2015; pp. 704–711. [Google Scholar]
  47. Tan, X.; Dong, M.; Wu, C.; Ota, K.; Wang, J.; Engels, D.W. An Energy-Efficient ECC Processor of UHF RFID Tag for Banknote Anti-Counterfeiting. IEEE Access 2017, 5, 3044–3054. [Google Scholar] [CrossRef]
  48. Dao, V.L.; Nguyen, V.T.; Hoang, V.P. Low Power ECC Implementation on ASIC. In Advances in Information and Communication Technology, Proceedings of the International Conference, ICTA 2016, Thai Nguyen, Vietnam, 12–12 December 2016; Akagi, M., Nguyen, T.T., Vu, D.T., Phung, T.N., Huynh, V.N., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 332–339. [Google Scholar]
  49. Rabaey, J. Low Power Design Essentials, 1st ed.; Springer Publishing Company: Berlin/Heidelberg, Germany, 2009; Incorporated. [Google Scholar]
  50. Liu, D.; Liu, Z.; Yong, Z.; Zou, X.; Cheng, J. Design and Implementation of an ECC-Based Digital Baseband Controller for RFID Tag Chip. IEEE Trans. Ind. Electron. 2015, 62, 4365–4373. [Google Scholar] [CrossRef]
  51. Itoh, T.; Tsujii, S. A Fast Algorithm for Computing Multiplicative Inverses in GF(2M) Using Normal Bases. Inf. Comput. 1988, 78, 171–177. [Google Scholar] [CrossRef]
  52. Daepp, U.; Gorkin, P. Fermat’s Little Theorem. In Reading, Writing, and Proving: A Closer Look at Mathematics; Springer: New York, NY, USA, 2011; pp. 315–323. [Google Scholar]
  53. Liskov, M. Fermat’s Little Theorem. In Encyclopedia of Cryptography and Security; van Tilborg, H.C.A., Jajodia, S., Eds.; Springer: Boston, MA, USA, 2011; p. 456. [Google Scholar]
  54. Guajardo, J. Itoh–Tsujii Inversion Algorithm. In Encyclopedia of Cryptography and Security; van Tilborg, H.C.A., Ed.; Springer: Boston, MA, USA, 2005; p. 313. [Google Scholar]
  55. Rodríguez-Henríquez, F.; Saqib, N.A.; Díaz-Pérez, A.; Koc, C.K. Cryptographic Algorithms on Reconfigurable Hardware (Signals and Communication Technology); Springer-Verlag New York, Inc.: Secaucus, NJ, USA, 2006. [Google Scholar]
Figure 1. Low-area kP architecture, in the following referred to as C0.
Figure 1. Low-area kP architecture, in the following referred to as C0.
Sensors 19 00720 g001
Figure 2. Architecture for kP featuring Itoh-Tsujii inversion (C1).
Figure 2. Architecture for kP featuring Itoh-Tsujii inversion (C1).
Sensors 19 00720 g002
Figure 3. Architecture C2 for kP, Wang inversion and a digit-multiplier are used.
Figure 3. Architecture C2 for kP, Wang inversion and a digit-multiplier are used.
Sensors 19 00720 g003
Figure 4. The Itoh-Tsujii inversion is paired with a digit-multiplier on this kP architecture (C3).
Figure 4. The Itoh-Tsujii inversion is paired with a digit-multiplier on this kP architecture (C3).
Sensors 19 00720 g004
Figure 5. Architectures for kP featuring dedicated squaring modules.
Figure 5. Architectures for kP featuring dedicated squaring modules.
Sensors 19 00720 g005
Figure 6. FPGA area, power dissipation, and energy consumption for the different kP architectures.
Figure 6. FPGA area, power dissipation, and energy consumption for the different kP architectures.
Sensors 19 00720 g006
Figure 7. Percentile area and energy increments for architectures C2, C3, and C5 with reference to C0.
Figure 7. Percentile area and energy increments for architectures C2, C3, and C5 with reference to C0.
Sensors 19 00720 g007
Figure 8. Evaluation of the efficiency metric for the different kP configurations.
Figure 8. Evaluation of the efficiency metric for the different kP configurations.
Sensors 19 00720 g008
Figure 9. Energy consumption of the kP architectures at different operational frequencies.
Figure 9. Energy consumption of the kP architectures at different operational frequencies.
Sensors 19 00720 g009
Figure 10. Evaluation of the efficiency metric for the different kP configurations using two operational frequencies.
Figure 10. Evaluation of the efficiency metric for the different kP configurations using two operational frequencies.
Sensors 19 00720 g010
Figure 11. Implementation results for the architectures in [22].
Figure 11. Implementation results for the architectures in [22].
Sensors 19 00720 g011
Figure 12. Evaluation of the efficiency metric for the results from [22].
Figure 12. Evaluation of the efficiency metric for the results from [22].
Sensors 19 00720 g012
Figure 13. Curve fitting for the hardware usage and energy consumption of architectures C2, C3, and C5.
Figure 13. Curve fitting for the hardware usage and energy consumption of architectures C2, C3, and C5.
Sensors 19 00720 g013
Figure 14. Evaluation of the efficiency metric for the C2, C3, and C5 configurations based on model data ( E F F m ), compared to the evaluation based on real data (EFF).
Figure 14. Evaluation of the efficiency metric for the C2, C3, and C5 configurations based on model data ( E F F m ), compared to the evaluation based on real data (EFF).
Sensors 19 00720 g014
Figure 15. Evaluation of the efficiency metric for the different works in the literature, ours included.
Figure 15. Evaluation of the efficiency metric for the different works in the literature, ours included.
Sensors 19 00720 g015
Figure 16. Efficiency scores for the different architectures in the literature. Values that are more negative represent greater energy savings overall.
Figure 16. Efficiency scores for the different architectures in the literature. Values that are more negative represent greater energy savings overall.
Sensors 19 00720 g016
Table 1. Inversion algorithms cost over binary fields of variable length. Let v = H W ( m 1 ) and u 1 u i the binary representation of m, where H W ( w ) represents the Hamming weight of w.
Table 1. Inversion algorithms cost over binary fields of variable length. Let v = H W ( m 1 ) and u 1 u i the binary representation of m, where H W ( w ) represents the Hamming weight of w.
Inv.FieldMultiplicationsSquaringsStorage Bits
Algorithm 1 F 2 m , m Z m 2 m 1 2 × m
Algorithm 2 F 2 m , m = 2 r + 1 Z log 2 ( m 1 ) m 1 2 × m
Algorithm 3 F 2 m , m Z log 2 ( m 1 ) + v 1 m 1 ( 2 + v 1 ) × m
Algorithm 4 F 2 m , m Z log 2 ( m 1 ) + v 1 + i = 1 r 2 ( u i × i ) m 1 + i = 1 r 2 u i × j = 1 i 2 j 1 3 × m
Table 2. Latency costs for inversion algorithms and kP over binary fields of approximately 128-bit security. We evaluate m = 251 which corresponds with the curve used (BE251) and m = 257 which has the form m = 2 8 + 1 to showcase the best and the average complexities for Itoh-Tsujii inversions.
Table 2. Latency costs for inversion algorithms and kP over binary fields of approximately 128-bit security. We evaluate m = 251 which corresponds with the curve used (BE251) and m = 257 which has the form m = 2 8 + 1 to showcase the best and the average complexities for Itoh-Tsujii inversions.
Inv.mMSMEMLAT (Cycles) aImprovement aLAT (Cycles) bImprovement b
(bits)Inv.kPInv. Δ LAT Inv. Δ %kP Δ LAT kP Δ %Inv.kPInv. Δ LAT Inv. Δ %kP Δ LAT kP Δ %
Algorithm 1251249250502125,249832,818----62,749456,818----
257255256514131,327872,772----65,791478,532----
Algorithm 2257825651467,848745,814−63,479−48−126,958−152312351,574−60,437−92−126,958−27
Algorithm 325112250175765,762713,844−59,487−47−118,974−143262337,844−59,487−95−118,974−26
2578256179967,848745,814−63,479−48−126,958−152312351,574−63,479−96−126,958−27
Algorithm 42513136775399,898782,116−25,351−20−50,702−68148347,616−54,601−87−109,202−24
257825677167,848745,814−63,479−48−126,958−152312351,574−63,479−96−126,958−27
a Field multiplications (M) and squarings (S) are performed using a bit-serial multiplier. b Multiplications are performed using a bit-serial multiplier and squarings are considered to take 1 cycle.
Table 3. Implementation results for C0 and C1 at frequencies of f 1 = 100 KHz and f 2 = 13.56 MHz in the xc6slx16 FPGA.
Table 3. Implementation results for C0 and C1 at frequencies of f 1 = 100 KHz and f 2 = 13.56 MHz in the xc6slx16 FPGA.
Arch.mFFLUTSLCFmax
(MHz)
LATt (ms)POW (mW)ENE (mJ)
# Δ %(Cycles) Δ % f 1 f 2 f 1 f 2 f 1 Δ % f 2 Δ %
C012711402220633-122223,024-2230.2416.4523.5926.8952.61-0.44-
16314322755868-119362,530-3625.3026.7424.0827.4387.30-0.73-
233199438771224-102730,248-7302.4853.8525.3829.87185.34-1.61-
251213841221357-109845,395-8453.9562.3425.2829.83213.72-1.86-
C1127116823707161383212,096−52120.9615.6423.6627.0450.18−50.42−4
16314622981945999324,596−103245.9623.9424.1327.5078.33−100.66−10
233202441731311797680,096−76800.9650.1525.1529.65171.04−81.49−8
251216844351352099793,978−67939.7858.5525.2429.61200.40−61.73−7
Table 4. Preliminary results for the digit-based multiplier on F 2 251 .
Table 4. Preliminary results for the digit-based multiplier on F 2 251 .
DigitFFLUTLAT (Cycles)
2510786129
4507105266
8506164335
16497266219
Table 5. Implementation results for C2 and C3 at frequencies of f 1 = 100 KHz and f 2 = 13.56 MHz, and variable multiplier digit size in the xc6slx16 FPGA. The Δ % for C2 and C3 were computed in relation to C0 and C1, respectively.
Table 5. Implementation results for C2 and C3 at frequencies of f 1 = 100 KHz and f 2 = 13.56 MHz, and variable multiplier digit size in the xc6slx16 FPGA. The Δ % for C2 and C3 were computed in relation to C0 and C1, respectively.
Arch.DigitFFLUTSLCFmax
(MHz)
LATt (ms)POW (mW)ENE (mJ)
# Δ %(Cycles) Δ % f 1 f 2 f 1 f 2 f 1 Δ % f 2 Δ %
C222138425115271388426,980−494269.8031.4925.5429.30109.05−490.92−51
3213843871445698287,843−662878.4321.2325.8729.5774.46−650.63−66
4213745181468893218,149−742181.4916.0925.8029.3756.28−740.47−75
52140465015091188178,288−791782.8813.1525.9829.5946.32−780.39−79
62137478016752363148,455−821484.5510.9525.7629.3238.24−820.32−83
72137479316882491128,650−851286.509.4926.1830.0633.68−840.29−84
82140493217182788115,363−861153.638.5126.4430.2830.5−860.26−86
92136505717222785102,076−881020.767.5326.3530.2526.9−870.23−88
10214451981836358695,307−89953.077.0326.7030.5025.45−880.21−89
11213752101634209085,530−90855.306.3126.6330.4522.78−890.19−90
12213653341855378178,761−91787.615.8126.5030.3320.87−900.18−90
13214454691914418075,502−91755.025.5726.4930.3620−910.17−91
14213655981983467868,984−92689.845.0926.8530.7518.52−910.16−91
15213957311896407965,474−92654.744.8326.7230.4017.49−920.15−92
16213958561972457662,215−93622.154.5927.1130.9316.87−920.14−92
C322168457015641584401,075−534010.7529.5825.8729.64103.76−510.88−53
32168469715831777270,464−682704.6419.9525.9729.7570.24−670.59−68
42167483115411482205,033−762050.3315.1225.9829.8653.27−750.45−76
52170481915611588167,608−801676.0812.3626.1629.9843.85−790.37−80
62167509817172781139,602−831396.0210.3026.2430.1136.63−830.31−83
72167511016442179121,015−861210.158.9226.2330.0431.74−850.27−85
82170523717562978108,540−871085.408.0026.2929.9928.54−870.24−87
9216653621746297496,065−89960.657.0826.4430.1725.4−880.21−89
10217455031878387289,702−89897.026.6226.5130.2323.78−890.2−89
11216755261895407380,534−90805.345.9426.5729.1921.4−900.17−91
12216656431820346974,171−91741.715.4726.8230.7819.89−910.17−91
13217457781953447271,115−92711.155.2426.9930.7519.19−910.16−91
14216659151995477165,003−92650.034.7927.0330.7517.57−920.15−92
15216960421965457161,696−93616.964.5526.9230.7216.61−920.14−92
16216962602093546858,640−93586.404.3227.5631.3616.16−920.14−92
Table 6. Implementation results for C4 and C5 at frequencies of f 1 = 100 KHz and f 2 = 13.56 MHz in the xc6slx16 FPGA. The Δ % for C4 and C5 were computed in relation to C1 and C3, respectively.
Table 6. Implementation results for C4 and C5 at frequencies of f 1 = 100 KHz and f 2 = 13.56 MHz in the xc6slx16 FPGA. The Δ % for C4 and C5 were computed in relation to C1 and C3, respectively.
Arch.DigitFFLUTSLCFmax
(MHz)
LATt (ms)POW (mW)ENE (mJ)
# Δ %(Cycles) Δ % f 1 f 2 f 1 f 2 f 1 Δ % f 2 Δ %
C412176529016512288354,264−553542.6426.1326.9232.1195.37−520.84−52
C522176566817583085180,349−791803.4913.3027.0130.8848.71−770.41−78
32176579717903284122,734−851227.349.0527.1431.2233.31−840.28−85
4217559341948448493,801−89938.016.9227.8431.9926.11−880.22−88
5217860681912418277,232−91772.325.7027.6831.8321.38−900.18−90
6217561991966458564,868−92648.684.7827.7431.8017.99−920.15−92
7217562101992478856,709−93567.094.1827.9232.0215.83−930.13−93
8217863531995478951,186−94511.863.7728.3132.2914.49−930.12−94
9217464741920418245,663−95456.633.3728.1032.1212.83−940.11−94
10218266182122567742,776−95427.763.1528.2032.3312.06−940.10−95
11217567162153597538,822−95388.222.8628.5032.9611.06−950.09−95
12217467022101557535,935−96359.352.6529.0233.3710.43−950.09−95
13218268472175607234,617−96346.172.5528.4632.619.85−950.08−96
14217470622182617331,981−96319.812.3629.2033.879.34−960.08−96
15217771172099557030,412−96304.122.2429.2733.458.90−960.08−96
16217772512205626929,094−97290.942.1529.6733.968.63−960.07−96
Table 7. Details of the six different kP architectures created, highlighting the approach used for performing field operations.
Table 7. Details of the six different kP architectures created, highlighting the approach used for performing field operations.
Conf.MultiplicationInversionAdditionSquaring
C0Bit-serialWangCombinatorialNot supported
C1Bit-serialItoh-TsujiiCombinatorialNot supported
C2Digit-serialWangCombinatorialNot supported
C3Digit-serialItoh-TsujiiCombinatorialNot supported
C4Bit-serialItoh-TsujiiCombinatorialCombinatorial
C5Digit-serialItoh-TsujiiCombinatorialCombinatorial
Table 8. Implementation results for different low-power or low-area kP architectures from the literature.
Table 8. Implementation results for different low-power or low-area kP architectures from the literature.
YearRef.mCurvePlatformLabelDigitFFLUTSLCGEStorageLAT (Cycles)Freq. (MHz)t (ms)POW ( μ W)ENE ( μ J)
2006[22]131B1310.13  μ mw1a011xxx4446 5 · m bits226,3300.50452.6621.009.51
2xxx4917 5 · m bits116,4800.50232.9621.505.01
3xxx5376 5 · m bits79,3000.50158.6022.003.49
4xxx5837 5 · m bits60,7100.50121.4222.502.73
139B1391xxx4716 5 · m bits254,6100.50509.2222.0011.20
2xxx5214 5 · m bits130,8240.50261.6522.505.89
3xxx5712 5 · m bits89,5620.50179.1223.004.12
4xxx6189 5 · m bits68,0340.50136.0723.503.20
151B1511xxx5117 5 · m bits300,1500.50600.3023.0013.81
2xxx5652 5 · m bits153,9000.50307.8023.507.23
3xxx6187 5 · m bits105,1500.50210.3024.005.05
4xxx6700 5 · m bits79,8000.50159.6025.003.99
163B1631xxx5525 5 · m bits349,4340.50698.8724.0016.77
2xxx6105 5 · m bits178,8480.50357.7024.508.76
3xxx6685 5 · m bits121,9860.50243.9725.006.10
4xxx7243 5 · m bits92,5020.50185.0026.004.81
2007[25]163B163xc3s1000lw2a011--2541xRAM/ROM/Pro130,14180.001.63207,328.39339.62
16--3721xRAM/ROM/Pro92,95880.001.16236,085.34274.87
w2a021--2692xRAM/ROM/Pro287,32480.003.59171,614.10610.82
16--3728xRAM/ROM/Pro40,56480.000.51252,319.11129.49
w2a031--1551xRAM/ROM/Pro287,32480.003.59155,380.33549.74
16--2556xRAM/ROM/Pro40,56480.000.51173,933.2187.96
w2a041--2541xRAM/ROM/Pro112,67780.001.41208,719.85287.09
16--3728xRAM/ROM/Pro112,67780.001.41205,009.28284.64
w2a051--2541xRAM/ROM/Pro174,64880.002.18217,996.29472.77
16--3728xRAM/ROM/Pro25,35380.000.32224,953.6269.63
w2a061--1543xRAM/ROM/Pro17,464880.002.18153,525.05333.51
16--2707xRAM/ROM/Pro25,35380.000.32179,035.2554.97
w2a071--3033xRAM/ROM/Pro116,05780.001.45222,634.51322.51
16--4061xRAM/ROM/Pro82,81780.001.04233,766.23244.33
w2a081--2624xRAM/ROM/Pro238,87480.002.99212,430.43631.59
16--3751xRAM/ROM/Pro33,80380.000.42226,345.0897.73
w2a091--1641xRAM/ROM/Pro238,87480.002.99157,235.62471.55
16--2821xRAM/ROM/Pro33,80380.000.42175,324.6876.96
2009[17]163B163xc3s500ew3a011332332492873x7 BRAM126,83610.0012.6876,730.00973.18
15333752383738x7 BRAM89,97610.009.0078,500.00706.26
w3a021200517681551x8 BRAM281,02410.0028.1073,680.002070.63
15201937482575x8 BRAM33,72010.003.3784,650.00285.45
w3a031200517681551x8 BRAM226,11010.0022.6173,710.001666.58
15201937482575x8 BRAM28,05410.002.8184,920.00238.23
w3a041332332492873x7 BRAM111,18810.0011.1277,230.00858.70
15333752383783x7 BRAM110,88410.0011.0978,080.00865.82
w3a051200517681551x8 BRAM171,79610.0017.1873,810.001267.93
15201937482575x8 BRAM21,16410.002.1283,940.00177.65
w3a061332332492873x7 BRAM170,21410.0017.0275,700.001288.45
15333752383738x7 BRAM21,18110.002.1285,890.00181.93
w3a071200517681551x8 BRAM172,12410.0017.2173,640.001267.59
15201937482575x8 BRAM21,49210.002.1582,850.00178.05
w3a081283426122384x8 BRAM88,99110.008.9077,290.00687.84
15286465734447x8 BRAM12,99110.001.3095,010.00123.43
w3a091365831222888x8 BRAM61,76910.006.1880,210.00495.47
15368872004654x8 BRAM10,54510.001.0598,620.00104.00
w3a101332332492873x7 BRAM113,09810.0011.3177,480.00876.23
15333752383738x7 BRAM80,21610.008.0278,980.00633.56
w3a111200517681551x8 BRAM235,00110.0023.5073,900.001736.63
15201937482575x8 BRAM28,23010.002.8284,820.00239.45
w3a121200517681551x8 BRAM189,37210.0018.9473,860.001398.72
15201937482575x8 BRAM23,74210.002.3786,260.00204.79
2009[18]163B1630.13  μ mw4a011xxx16,8370169,7690.50339.5416.015.44
2xxx17,444089,4170.50178.8317.333.10
3xxx17,957062,6330.50125.2719.982.50
4xxx18,567048,7450.5097.4922.052.15
8xxx20,678028,9050.5057.8128.031.62
15xxx24,561018,9850.5037.9734.631.32
19xxx26,777017,0010.5034.0041.511.41
55xxx47,247011,0490.5022.1068.231.51
2010[26]163BE1630.13  μ mw5a011xxx1172084 bytes219,1480.40547.877.27*3.98
2xxx12,34884 bytes113,4280.40283.579.10*2.58
3xxx12,86284 bytes78,1120.40195.2810.19*1.99
4xxx13,42784 bytes59,8000.40149.5012.00*1.79
5xxx13,97084 bytes49,3360.40123.3412.69*1.57
6xxx14,53084 bytes42,7960.40106.9913.80*1.48
2012[27]163B1630.25  μ mw6a011xxx24140016500010.0016.505940.0098.01
2xxx24,742084,90010.008.497180.0060.96
4xxx26,156044,20010.004.428640.0038.19
8xxx31,333023,50010.002.3513,200.0031.02
16xxx34,956013,50010.001.3517,400.0023.49
2016[32]163K1630.13  μ mw7a011xxx10,106RAM/ROM-1.13-36.639.16
2xxx11,383RAM/ROM-0.59-21.555.39
3xxx12,236RAM/ROM-0.41-15.753.94
4xxx12,863RAM/ROM-0.32-12.083.02
5xxx13,497RAM/ROM-0.27-11.412.85
Markers: (*) dynamic power; (-) data not available; (x) does not apply. Some results were retrieved from graph representations.
Table 9. Adjusted coefficients for the efficiency model of each configuration. The R-square result is provided for the adjustment of the hardware and the energy consumption curves.
Table 9. Adjusted coefficients for the efficiency model of each configuration. The R-square result is provided for the adjustment of the hardware and the energy consumption curves.
YearRef.mCurvePlatformConf. α 1 α 2 R-square α 3 α 4 R-square
2006[22]131B1310.13 μ mw1a01463.20003986.000099.99%9.4985−0.910799.99%
139B139491.70004228.500099.98%11.1861−0.911899.99%
151B151528.40004593.000099.98%13.7842−0.912299.97%
163B163573.40004956.000099.99%16.7408−0.916999.97%
2007[25]163B163xc3s1000lw2a0178.66672462.3333100%339.6200−0.0763100%
w2a0269.06672622.9333100%610.8200−0.5595100%
w2a0367.00001484.0000100%549.7400−0.6610100%
w2a0479.13332461.8666100%287.0900−0.0031100%
w2a0579.13332461.8666100%472.7700−0.6908100%
w2a0677.60001465.4000100%333.5100−0.6503100%
w2a0768.53332964.4666100%322.5100−0.1001100%
w2a0875.13332548.8666100%631.5900−0.6730100%
w2a0978.66671562.3333100%471.5500−0.6538100%
2009[17]163B163xc3s500ew3a0161.78572811.2142100%973.1800−0.1184100%
w3a0273.14291477.8571100%2070.6300−0.7317100%
w3a0373.14291477.8571100%1666.5800−0.7183100%
w3a0465.00002808.0000100%858.70000.0030100%
w3a0573.14291477.8571100%1267.9300−0.7257100%
w3a0661.78572811.2142100%1288.4500−0.7229100%
w3a0773.14291477.8571100%1267.5900−0.7248100%
w3a08147.35712236.6428100%687.8400−0.6344100%
w3a09126.14292761.8571100%495.4700−0.5765100%
w3a1061.78572811.2141100%876.2300−0.4678100%
w3a1173.14291477.8571100%1736.6300−0.7317100%
w3a1273.14291477.8571100%1398.7200−0.7095100%
2009[18]163B1630.13 μ mw4a01562.428216,236.022399.99%4.9753−0.509489.25%
2010[26]163BE1630.13 μ mw5a01556.600011,194.000099.95%3.9404−0.579499.50%
2012[27]163B1630.25 μ mw6a01752.557823,599.541695.70%95.7553−0.584098.33%
2016[32]163K1630.13 μ mw7a01826.20009538.399997.40%9.1506−0.763699.80%
C2110.89414050.275099.54%1846.3496−0.963499.92%
2019This work.251BE251xc6slx16C3114.18244331.075099.17%1725.1535−0.954299.95%
C5115.81185409.725097.91%827.4241−0.937199.76%

Share and Cite

MDPI and ACS Style

Lara-Nino, C.A.; Diaz-Perez, A.; Morales-Sandoval, M. Energy/Area-Efficient Scalar Multiplication with Binary Edwards Curves for the IoT. Sensors 2019, 19, 720. https://doi.org/10.3390/s19030720

AMA Style

Lara-Nino CA, Diaz-Perez A, Morales-Sandoval M. Energy/Area-Efficient Scalar Multiplication with Binary Edwards Curves for the IoT. Sensors. 2019; 19(3):720. https://doi.org/10.3390/s19030720

Chicago/Turabian Style

Lara-Nino, Carlos Andres, Arturo Diaz-Perez, and Miguel Morales-Sandoval. 2019. "Energy/Area-Efficient Scalar Multiplication with Binary Edwards Curves for the IoT" Sensors 19, no. 3: 720. https://doi.org/10.3390/s19030720

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop