They are composed of one or more Central Processing Units (CPUs), an interconnection mechanism, some peripherals, and the memories where the applications, executed during the simulation, reside. From this general scenario, we want to focus on the power consumption evaluation of some modules in the design, using different parameters to customize the analysis procedure. To further illustrate the framework features and prove the design productivity gain achieved with the aided workflow, we present a use-case example where the framework helps to set up the environment needed to obtain the desired result. The execution of the workflows described in next section has been performed on a computer equipped with two Intel(R) Xeon(R) E5-2650 v3 CPUs and of RAM running CentOS Linux 7.
4.1. Use-Case System-on-Chip
This section presents how the various parts of the framework have been used to perform the post-synthesis evaluation of a hardware accelerator for ECC. As reported in
Section 1, such hardware module will be integrated into the Hardware Secure Module of the European Processor Initiative (EPI) chip and implemented in a 7 nm ARM
® Artisan technology. The use of the framework has reduced the time spent on building up the design and verification environment to set up all the synthesis, simulations, and power analysis steps of the workflow. It has also permitted the evaluation of many simulations with different input data.
As shown in
Figure 12, the simulations instantiate a SoC composed of a RISC-V CPU (64bit CVA6 by OpenHw Group [
29]), an Advanced eXtensible Interconnect (AXI) communication network, a simulation-only RAM initialised with the binary of the application, and the hardware accelerator for ECC (henceforth
ECC Core). The
ECC Core under evaluation can be configured to support both the NIST P-256 and NIST P-521 elliptic curves [
30], which are used to accelerate different cryptographic schemes such as Elliptic Curve Digital Signature Algorithm (ECDSA) and Elliptic Curve Diffie-Hellman (ECDH). In this work, we are focusing on the evaluation of the performance of the ECC Point Multiplication (PM) operation, which in ECC represents the most important primitive. The architectural details of the
ECC Core that we want to evaluate in this work are presented in [
14]. In the cited work, three different algorithms are implemented to perform the PM and an evaluation of performance in terms of latency, power and area consumption has been made. In addition, [
14] provides a preliminary evaluation on the resistance against Simple Power Analysis (SPA) attacks of the accelerator for the three different architectures implemented on the ARM
® Artisan
® (typical corner case: 0.75 V, 85 °C) technology at 100 MHz. The architecture of the
ECC Core is reported in
Figure 13: it features two computational units (i.e., Point addition module and Point doubling module in
Figure 13), and a state machine. The latter manages the computational modules and the data flow according to the PM algorithm; at synthesis level the state machine can be configured to execute three different PM algorithms, and a brief description of them is reported as follows:
DA: This configuration (already presented in [
14]) performs PM using the standard Double-and-Add (DA) algorithm that is not resistant to SPA. This algorithm has no fixed latency, which depends on the value of the key
k.
DAA: This configuration (already presented in [
14]) performs PM using the Double-and-Add-Always (DAA) algorithm, which is retained secure against SPA.
MDAA: This configuration (already presented in [
14]) performs PM using a Modified Double-and-Add-Always (MDAA) algorithm.
In this work, we introduced an additional randomized MDAA architecture to improve SCA resistance of the
ECC Core, named Randomized Modified Double-and-Add-Always (RMDAA). All the architectures in [
14] employ a redundant projective representation of the elliptic curve points (named Standard Projective representation), which allows reducing the computation time of PM at the cost of higher resources consumption. This approach requires to map every generic point
of the elliptic curve with its projective representation
where
Z can be arbitrarily chosen. In the presented solution, we randomized the
Z-coordinate to find whether this countermeasure provides benefit against SPA attacks. Furthermore, different works as [
31,
32,
33] showed that randomization of the
Z-coordinate can be used as countermeasure also against Differential Power Analysis (DPA) SCAs.
Thanks to the functionalities offered by the framework presented in this paper, we were able to obtain a more accurate characterisation of the ECC Core in the four different synthesis scenarios (i.e., DA, DAA, MDAA, RMDAA). In particular, we used the SDF generated by Design Compiler for the gate-level simulations, synthesised the hardware designs at 1 GHz of frequency, and evaluated the power consumption profile of the four architectures. We used the proposed framework to synthesise the four ECC Core configurations (for simplicity only the configuration for NIST P-256 elliptic curve is synthesized, but the workflow allows to synthesize automatically all the configurations), evaluating area utilization, latency, and power consumption. Additionally, we performed an assessment on its resistance to SPA. We needed to extract the power trace of the ECC Core while performing the PM with different inputs, provided from the software side. For reason of readability and conciseness in this work, we provided only six different keys for each architectures. Therefore, the workflow must execute: six software compilations, one per each key (k1, k2, k3, k4, k5, k6); four syntheses, one per each PM implementation (da, daa, mdaa, rmdaa); twenty-four simulations and power analyses, one for each combination of compiled software and synthesised netlist (targets are named as <sw>-<syn>). It should be noted that for a complete characterization of an ECC architecture against SPA or DPA SCAs thousands of simulations shall be required. The flow recipe can easily scale with the complexity of the desired simulations reducing the time spent for the setup of the workflow and the data collection.
4.2. Recipe Configuration
We wrote the recipe exploiting GNU Make functions to define the various properties dynamically. Firstly, we select the RTL modules to include into the workflow by using the
RTL_MODULES property. The
RTL_DEFINES is used to set some SystemVerilog defines common to all targets. The
ecc_soc module includes as dependencies the RTL modules of the CPU, the AXI interconnect, and the
ECC core.
The software handler uses the GCC compiler and the Baremetal SDK included in the framework to build the six different applications.
SW = cxx. # CXX tool SW_TARGETS = k1 k2 k3 k4 k5 k6 define gen-sw-props SW_$1_SDK = baremetal SW_$1_TYPE = vmem SW_$1_MEMNAME = initram SW_$1_BAREMETAL_APP = ecc/spa_test SW_$1_BAREMETAL_ARCH = soft64 SW_$1_BAREMETAL_PLATFORM = ecc_soc enddef $(foreach t,$(SW_TARGETS),$(eval $(call gen-sw-props,$t))) SW_k1_BAREMETAL_DEFINES = ECC_K=101010 SW_k2_BAREMETAL_DEFINES = ECC_K=010101 SW_k3_BAREMETAL_DEFINES = ECC_K=001100 SW_k4_BAREMETAL_DEFINES = ECC_K=110011 SW_k5_BAREMETAL_DEFINES = ECC_K=110000 SW_k6_BAREMETAL_DEFINES = ECC_K=001111
|
The synthesis handler uses Design Compiler to synthesize the four configurations of the
ECC Core for the provided Process Development Kit (PDK).
SYN = dc # DesignCompiler tool SYN_PARALLEL = 5 # Limit for license availability SYN_TARGETS = da daa mdaa rmdaa SYN_NETLIST_ANNOTATED = yes SYN_DC_LIB = libs/epi7nm SYN_DC_SETUP_FILE = scripts/dc_setup.tcl SYN_DC_SDC_FILES = constr/ecc_core.sdc define gen-syn-props SYN_$1_TOP = ecc_core_wrapper SYN_$1_TOP_LIB = ecc_core SYN_$1_REQUIRE_SW = # Force no dependencies with SW enddef $(foreach t,$(SYN_TARGETS),$(eval $(call gen-syn-props,$t))) SYN_daa_RTL_DEFINES = ECC_PROT_DAA SYN_mdaa_RTL_DEFINES = ECC_PROT_MDAA SYN_rmdaa_RTL_DEFINES = ECC_PROT_MDAA ECC_RAND
|
The simulation handler uses QuestaSim to perform twenty-four simulations. The software and synthesis dependencies of each target have been correctly limited using
SIM_<target>_ REQUIRE_SYN/SW properties. The SDF annotation is performed adding QuestaSim-specific command-line arguments to each simulation target.
SIM = questa # QuestaSIM tool SIM_TB = tb_ecc_core SIM_TB_LIB = ecc_core SIM_TIMESCALE = 100ps/1ps SIM_RUN_TIME = all SIM_OPT = yes # 1 GHz clock from testbench SIM_RTL_DEFINES = CLK_PERIOD=10 SIM_VCD_LOG_MODULES = tb_ecc_soc/soc/ecc_core SIM_QUESTA_VOPT_ARGS = +sdf_verbose +sdf_iopath_to_prim_ok # Generation of simulation targets $(foreach t,$(SYN_TARGETS), $(foreach k,$(SW_TARGETS),\ $(eval SIM_TARGETS += $t-$k)\ $(eval SIM_$t-$k_REQUIRE_SYN = $t)\ $(eval SIM_$t-$k_REQUIRE_SW = $k)\ $(eval SIM_$t-$k_QUESTA_VOPT_ARGS = -sdftyp /tb_ecc_soc/soc/ecc_core=$t-ecc_core_wrapper.sdf)))
|
The power analysis handler generates the targets automatically, only the netlist path into the VCD file must be specified.
PWR = pt # PrimeTime tool PWR_PARALLEL = 5 # Limit for license availability $(foreach t,$(SYN_TARGETS),\ $(eval PWR_$t_NETLIST_PATH = tb_ecc_soc/soc/ecc_core))
|
To prevent huge space utilization on the host disk, the
Limited Power-Simulation workflow has been used, limiting the number of VCD files on disk to five.
WORKFLOW = limit-pwr-sim LIMIT_PWR_SIM_FILES = 5
|
4.3. Performance Evaluation Results
A designer with good knowledge of the GNU Make syntax takes just 30 min to set up the entire workflow, including the organisation of the RTL, the software sources, the technological synthesis library, and all the scripts and constraint files. After that, the framework can be invoked with
make -j to parallelise the workload. At the end of the workflow, which took ≃ 12 h, the designer finds all the files required to evaluate the performance of the architecture in the output directory of the
flow recipe. In particular, in the output folders of the synthesis targets the designer can find various synthesis reports, in the simulation output folder he can find reports on the latency of the operations, and in the power analysis output folder he can find the power report and the power trace of the simulation. In
Table 1 are reported the results for the different PM architectures.
We used the simulated approach to assess the resistance to SPA attacks.
Figure 14 shows some power traces for each accelerator configuration. In particular, the four pictures on the left side of
Figure 14 are the ones acquired for the key
k1, where the first one corresponds to the DA architecture, the second one for the DAA, the third one for the MDAA and the last one for the RMDAA. Instead, the four pictures on the right side of
Figure 14 are the ones acquired for the key
k3, where the first one corresponds to the DA architecture, the second one for the DAA, the third one for the MDAA and the last one for the RMDAA. As already stated in [
14], the information leakage of DA architecture allows to easily retrieve the whole private key, instead of the DAA architecture where some part of the key can still be guessed. The power traces of the MDAA and RMDAA architectures are extremely similar and there are no substantial differences among them.
To further investigate the results of the architectures MDAA and RMDAA, we measured and plotted in
Figure 15 the average power consumption during the computation of the PM operation. The power is averaged on the intervals corresponding to the use of an ECC key bit. While the MDAA architecture presents the same power consumption profile each time it is used with the same key, the RMDAA hides the information of the key, thanks to the randomness added by the random
Z coordinate of the projective representation.
The result presented in this work related to the characterisation of the
ECC Core must not be intended as complete and exhaustive, but the use-case aims to show how the proposed framework accelerates verification, validation and characterisation of SoC design workflows. In particular, the results obtained in previous work [
14] required a few days of development and verification of the simulation and analysis environment itself. Here, the correct use of the various third-party tools, needed to complete the workflow, is ensured and automated by the proposed framework, which permitted the set up of the environment in around 30 min, plus
h of actual flow execution. Using the proposed framework, we can continue the SCA assessment of the
ECC Core, in particular against the DPA attacks, which could require around thousands simulations and power analyses that will be completely automated. In an ordinary design flow, all the scripts for simulations and analyses must be carefully modified in order to adapt their use to the new necessities, with our framework instead, changing only few properties in the
flow recipe permits to obtain very different design and verification environments.