Exploration of the High-Efficiency Hardware Architecture of SM4-CCM for IoT Applications

Chen, Rui; Li, Bing

doi:10.3390/electronics11060935

Open AccessArticle

Exploration of the High-Efficiency Hardware Architecture of SM4-CCM for IoT Applications

by

Rui Chen

¹

and

Bing Li

^2,*

¹

School of Computer and Software Engineering, Nanjing Vocational University of Industry Technology, Nanjing 210023, China

²

School of Cyber Science and Engineering, Southeast University, Nanjing 210096, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(6), 935; https://doi.org/10.3390/electronics11060935

Submission received: 15 February 2022 / Revised: 13 March 2022 / Accepted: 15 March 2022 / Published: 17 March 2022

(This article belongs to the Special Issue Digital Hardware Architectures: Systems and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

The widespread use of the internet of things (IoT) is due to the value of the data collected by IoT devices. These IoT devices generate, process, and exchange large amounts of safety-critical or privacy-sensitive data. Before transmission, the data should be protected against information leakage and data stealing. Deploying authenticated encryption with additional data (AEAD) algorithms on IoT devices ensures data confidentiality and integrity. However, AEAD algorithms are computationally intensive, while IoT devices are resource constrained or even battery powered. Therefore, a low-cost, low-power, and high-efficiency method of implementing an AEAD algorithm into resource-constrained IoT devices is required. The SM4-CCM algorithm, introduced in RFC 8998, is selected as the AEAD algorithm to address this problem. Algorithms similar to SM4-CCM (e.g., SM4 and AES-CCM) provide many architectural design references, but it is challenging to decide which architecture is the most suitable for SM4-CCM. In order to find the most efficient SM4-CCM hardware architecture, a design space exploration method is proposed. Firstly, the SM4-CCM algorithm is divided into five layers, and three candidate architectures are provided for each layer. Secondly, 63 design schemes for SM4-CCM are constructed by combining candidate architectures from each layer. Finally, a batch number of comparisons and analyses of experimental results are used to identify the most efficient one. Under TSMC 90 nm technology, the experimental results of the identified scheme show that the throughput, power consumption, and area achieve 199.99 Mbps, 1.625 mW, and 14.6 K gates, respectively. As a proof of concept, implementing this scheme on an FPGA board is also presented.

Keywords:

internet of things (IoT); authenticated encryption; AEAD; SM4; CCM; design space exploration; hardware architecture

1. Introduction

The internet of things, or IoT, has brought great convenience to daily life. Its applications have spread across many fields, such as healthcare, the industry, and smart cities [1]. IoT is widely used mainly because of the value of the data collected and generated. In healthcare applications, IoT devices collect patient health statuses and send this information to a cloud server for health monitoring. In industrial applications, the data collected by the internet of things can be used to monitor equipment status remotely and to predict maintenance with the help of artificial intelligence. IoT devices may generate, process, and exchange large amounts of safety-critical data or privacy-sensitive information and, thus, are appealing targets of various attacks [2]. The collected data or information should be protected against attacks on privacy and confidentiality [3].

Authenticated encryption with associated data (AEAD) algorithms provide confidentiality and integrity protection in a single operation [4]. Deploying an AEAD algorithm on an IoT device and authenticating encryption before sending any data can protect the confidentiality and integrity of the data from the data-producer side and can ensure no information leakage or data-stealing during data transmission. SM4-CCM, which stands for a SM4 block cipher in counter mode and CBC-MAC mode, is one of the new AEAD algorithms introduced in RFC 8998 [5]. The SM4 used in SM4-CCM is the preferred block cipher for the Chinese market. It became an ISO (International Organization for Standardization) standard in June 2021 [6].

The software implementation of a computing-intensive cryptography algorithm, such as SM4-CCM, generally limits performance and energy efficiency [7]. In comparison, the dedicated cryptography hardware has high performance, low power consumption, high efficiency, and fewer ROM/RAM resources. Moreover, the dedicated cryptography hardware extends the battery life of battery-powered IoT devices [4] and provides a higher level of security [8]. These advantages of dedicated hardware motivate us to develop dedicated hardware architectures for SM4-CCM that are suitable for resource-constrained IoT devices.

This work aims to develop an efficient, dedicated hardware architecture of SM4-CCM for resource-constrained applications. The main contributions of this paper are listed as follows.

Algorithms similar to SM4-CCM (e.g., SM4 and AES-CCM) provide many architectural design references, but it is challenging to decide which architecture is the most suitable for SM4-CCM. Therefore, for IoT applications, this paper proposes a design-space-exploration-based approach by which the hardware architecture with the best efficiency of the SM4-CCM algorithm is identified.
In order to realize the design space exploration, we divide the SM4-CCM algorithm into five layers and provide three candidate architectures for each layer. The candidate architectures of different layers are combined to construct 63 kinds of design schemes for SM4-CCM. The optimal design scheme with the best efficiency is found through batch experimental comparison and analysis.

2. Related Work

Relevant prior work includes studies dedicated to the hardware of SM4, the hardware of a block cipher in CCM mode, and design space exploration of SM4 or CCM.

2.1. Dedicated Hardware of SM4

Ref. [9] proposed a dual-cascade architecture to improve the throughput, which achieved up to 1.9 Gbps at 286 MHz on FPGA. Refs. [10,11,12,13] proposed compact hardware implementations of SM4 by reducing the area cost of the Sbox. Refs. [14,15] merged multiple block ciphers into one hardware implementation, such as AES/SM4 [14] and Camellia/SM4/AES [15]. Refs. [16,17,18,19] presented the designs of special operation modes of SM4, including XTS [19], GCM [17,18], and CBC [16]. The above-mentioned papers provide us with a reference for the hardware design of the SM4 algorithm. Similar to prior work, this paper also presents an implementation of a special operation mode, SM4-CCM. However, unlike prior works, we use the design-space-exploration method to find the optimal hardware architecture for resource-constrained IoT devices.

2.2. Dedicated Hardware of CCM

Ref. [20] proposed a sequential AES-CCM architecture for IEEE 802.16e that divides the composite field logic of SubBytes into two stages to process the CBC-MAC and CTR modes concurrently with one round of operation. Ref. [21] proposed a parallel AES-CCM architecture using two AES cores: one in CBC-MAC mode and another in CTR mode. Refs. [22,23] proposed a hardware architecture for AES-CCM that uses only one AES core and operates in the CBC-MAC mode and CTR mode in turn. Refs. [24,25] proposed network-on-chip (NoC)-based hardware architectures for AES-CCM to achieve high throughput with a large area cost. The above-mentioned papers provide us with a reference for the hardware design of the CCM algorithm. Different from prior works, in this paper, the hardware architecture for SM4-CCM is studied. As far as we know, this study is the first to study the hardware architecture for SM4-CCM. Moreover, in this paper, various hardware architectures for SM4-CCM are explored and studied to find the most efficient one.

2.3. Design Space Exploration of SM4 or CCM

Ref. [26] studied and evaluated the impact of the number of round functions (RFs) of SM4 on power, energy, area, and speed to find the optimal design. Ref. [27] studied the impact of fully pipeline, 1/4/8/16-RFs of SM4 on area and speed. Refs. [22,28] studied the impact of the number of AES cores on the area and speed. Ref. [22] presented three architectures for AES-CCM, using one or two AES cores. Ref. [28] presented and evaluated two architectures for AES-CCM: the CBC-MAC and CTR mode working in parallel with two separate AES modules and sequentially with one AES module. The above-mentioned papers provide us with a reference for design space exploration. Unlike prior works, not only the impact of the number of SM4 cores of SM4-CCM on area and speed is studied, but also the insight into each layer of the SM4-CCM algorithm is studied, and 63 design schemes are explored to find the most efficient hardware architecture.

3. Background

3.1. Notations

Here, we describe the overview of the SM4 algorithm and CCM algorithm. Table 1 shows the notations of the SM4 and CCM.

3.2. Brief Introduction of SM4

SM4 is an unbalanced Feistel network with an 8-bit Sbox, 128-bit key, and 128-bit block size. Compared with the AES block cipher proposed by NIST in 1997, the SM4 block cipher has the following characteristics, making it more suitable for resource-constrained scenarios. First, the security features of SM4 are equivalent to AES-128. Second, the structures of encryption and decryption algorithms of SM4 are the same. Third, the Sbox used for encryption and decryption of SM4 are also the same. Fourth, SM4 requires only 4 Sbox (each with 256 × 8-bits) in one round, while AES requires 16. In the following sections, we introduce SM4 briefly. A detailed description of this algorithm can be found in Ref. [29].

The SM4 consists of encryption, decryption, and key expansion algorithms. The encryption and decryption algorithms perform 32 iterations with different round keys to process one 128-bit data block. Their structures are identical, except that the schedule of round keys is in reversed order during decryption. As shown in the left part of Figure 1, the encryption algorithm consists of 32 rounds and 1 reverse transform. As shown in the right part of Figure 1, the round function processes a 128-bit data block with a 32-bit round key (rki) in seven steps. It crosses three layers: Sbox, non-linear transform, and round function.

The key expansion algorithm is similar to the encryption, as shown in Figure 2. Before 32 iterations, the 128-bit master key (MK) is pre-processed (xor) with the family key (FK, a 128-bit constant). As shown in the right part of Figure 2, the round function processes a 128-bit data block with a 32-bit constant key (CKi) in seven steps. It also crosses three layers: Sbox, non-linear transform, and round function.

Encryption of a 128-bit data block needs both encryption and key expansion algorithms. Thus, from the perspective of the encryption procedure of the 128-bit data block, four layers are crossed, which are (1) Sbox, (2) non-linear transform, (3) round function and (4) encryption and key expansion.

3.3. Brief Introduction of CCM

The counter with CBC-MAC (CCM) is one of the operation modes of a block cipher and provides both data confidentiality and authentication through a combination of counter (CTR) modes and cipher block chaining message authentication code (CBC-MAC) modes. The CBC-MAC mode is used to calculate a MAC for all inputs, including a nonce, additional authenticated data (AAD), and plaintext. Additionally, the CTR mode is used to encrypt the plaintext. Figure 3 gives an example operation flow of CCM. Detailed descriptions of CCM can be found in Ref. [30].

Based on the above analysis, we find that, to design and implement an efficient hardware architecture for SM4-CCM, five layers should be designed and explored, which are listed as follows: (1) architecture of Sbox, (2) architecture of non-linear transform, (3) architecture of round function, (4) architecture of SM4 encryption and key expansion, (5) architecture of CBC-MAC mode and CTR mode.

4. Design Space Exploration

This section uses a design-space-exploration approach to identify the most efficient hardware architecture for SM4-CCM.

4.1. Our Method

As mentioned earlier, the SM4-CCM algorithm can be divided into five layers, each of which can be implemented using several different hardware architectures. We cannot simply decide which combination of architectures is the most efficient. Therefore, in this paper, we propose a design space exploration approach to find such an optimal design scheme, as shown in Figure 4. First, three candidate architectures are provided for each layer, as shown in Table 2 (detailed descriptions of these candidate architectures are given in the following sections). Then, candidate architectures selected from each layer are combined to construct multiple SM4-CCM design schemes. Finally, the performance of each design scheme is evaluated and compared to identify the most efficient one.

The next few subsections give a detailed description of the candidate architectures for each layer.

4.2. Sbox

The efficiency of the SM4 hardware implementation in terms of power consumption, area, and throughput mainly depends on the implementation of Sbox [11]. Three candidate architectures are provided for the Sbox layer, as shown in the second row of Table 2. The LUT (look-up table)-based Sbox uses a 256 × 8-bit ROM to achieve byte substitution. The representation of a composite field arithmetic (CFA)-based Sbox uses composite fields. The CFA-based Sbox proposed in Ref. [11] is adopted in this paper. The decoder–switch–encoder (DSE)-based Sbox is reported to be the most power-efficient one [31]. DSE Sbox consists of an 8-to-256 decoder used to convert the input byte to one-hot representation, a 256-to-256 permutation block to connect the decoder and encoder one-to-one and a 256-to-8 encoder to convert one hot 256-bit representation back to a byte. We refer to Ref. [32] for the design and implementation of the DSE-based Sbox.

4.3. Non-Linear Transform (NLT)

The input of this layer is a 32-bit word. After byte-by-byte substitution with bytes in the Sbox, a new 32-bit word is generated. The amount of delay of this layer is decided by the number of Sbox, as shown in the third row of Table 2. The candidate architecture NLT1 needs four clock cycles to complete the substitution of a 32-bit word. Additionally, it needs additional 3-byte registers to store temporary bytes, as shown in Figure 5a. Half-word substitution can be achieved in each clock cycle in the candidate architecture NLT2. Thus, the delay of NLT is reduced to two clock cycles, as shown in Figure 5b. Word-level substitution can be achieved in NLT4 (Figure 5c), and only one clock cycle is needed. However, with more Sboxes, the higher the area requirements of this layer.

4.4. Round Function (RF)

The RFs of the key expansion and encryption/decryption algorithm are almost the same. Thus, resource sharing can be used to reduce the area, as shown in Figure 6a. However, we can use two RFs for the encryption and key expansion algorithms separately if performance prefers area, as shown in Figure 6b,c. The names and descriptions of each candidate architecture of this layer are given in the fourth row of Table 2.

4.5. SM4

Three candidate architectures for this layer were provided according to their online or offline key expansions and the number of RFs. The names and descriptions of each candidate architecture are listed in the fifth row of Table 2. The first candidate (OffKS) needs only one shared RF with one 128-bit state register. The shared RF works for both key expansion and encryption. In this candidate architecture, the key expansion algorithm should perform prior encryption using the offline key expansion method. The generated round keys need to be stored in 32 32-bit registers, as shown in Figure 7a. The advantages of this architecture are that a user does not need to wait for round keys during encryption, and only one shared RF is needed. The disadvantage is that 32 additional 32-bit registers are required. The second candidate (OnKP) needs two RFs with two 128-bit state registers. One of the round keys works for key expansion, and the other works for encryption. The RF of key expansion generates a 32-bit key in each round and sends it immediately to the RF of encryption. Since this round key is consumed immediately, it is unnecessary to save it. Thus, different from OffKS, 32 32-bit registers are not needed in this option. The third candidate (OnKS) needs only one shared RF with two 128-bit state registers. The shared RF works for both key expansion and encryption. However, different from the first candidate, the key expansion is online. The RF works for key expansion and encryption in turn. When the RF works for key expansion, a 32-bit round key is generated and stored in the state register. Additionally, this round key is read out from the state register directly when the RF works for encryption.

4.6. SM4-CCM

The SM4-CCM algorithm needs SM4 modules to work in the CBC-MAC and CTR modes. Figure 8 and the sixth row of Table 2 present the candidate hardware architectures for SM4-CCM. The first candidate uses only one SM4 module to work in the CBC-MAC and CTR modes in turns. Two SM4 modules are used in the second candidate architecture to parallel the CBC-MAC and CTR modes. The third candidate architecture uses one SM4 module and one RF. The SM4 module works in the CBC-MAC mode with offline KEYEXP and encryption, and the RF with state register works in the CTR mode. In the third candidate, the SM4 module has two tasks: one is to generate 32 32-bit RKs preceding the start of the CBC-MAC mode and CTR mode, and the other is to work in the CBC-MAC mode. The SM4 module and the RF share generated RKs.

4.7. Combination of Candidate Architectures

We chose one candidate architecture from each layer and combined them to construct SM4-CCM:

3^{5} = 243

design schemes were constructed. To reduce the workload, we eliminated some schemes. First, the candidate architectures for the round function were merged into SM4 because the round function used in the three candidate architectures of SM4 is fixed. Thus, the number of schemes was reduced to

3^{4} = 81

. Second, the candidate architecture GB2SM4 of SM4-CCM is special. In GB2SM4, the candidate architectures of the two SM4 modules were also fixed: one was OffKS and the other was ERF. Therefore, the number of schemes was reduced to 63,

2 \times 3 \times 3 \times 3

(S1SM4 and P2SM4) +

1 \times 3 \times 3

(GB2SM4) = 63. Figure 9 presents the naming of these design schemes.

5. Experimental Results and Analysis

The experimental flow consists of the following four steps, as shown in Figure 10.

First, we describe each layer of SM4-CCM with Verilog HDL, using the parameterized method, verify the functionality of each scheme with the Synopsys VCS simulator and estimate the throughput.
Second, we synthesize each scheme to gate-level netlist under TSMC 90 nm technology with the Synopsys design compiler, estimate the area cost, and calculate the area efficiency.
Third, the gate-level simulation of each scheme is adopted to generate a backward SAIF (switching activity interchange format (SAIF)) file containing the switching activity of individual nets and signals.
Finally, the Synopsys PrimeTime-PX is used to estimate the gate-level average power consumption and power efficiency.

After comparison and analysis of experimental results, the optimal design scheme will be found.

5.1. Throughput

The throughput is calculated according to Equation (1). The number of bits includes a nonce, AAD, and plaintext. Additionally, the number of clock cycles is counted from loading MK to reading out the CCM tag.

throughput = \frac{length of bits}{number of clock cycles} \times frequency

(1)

Figure 11 demonstrates the throughput comparison results of 63 design schemes from four perspectives—which are the throughput comparison of design schemes with (1) the same CCM candidate architecture, (2) the same SM4 candidate architecture, (3) the same NLT candidate architecture, and (4) the same Sbox candidate architecture—to find the impact of each layer candidate architecture on throughput. As shown in Figure 11a, the schemes with c1 (P2SM4, CCM with two separate SM4 modules) achieve higher throughput than others. As shown in Figure 11b, the schemes with k1 (OnKP) obtain higher throughput than k0 (OffKS) and k2 (OnKS). The schemes with k2 (OnKS) obtain lower throughput than k0 (OffKS) and k1 (OnKP). Therefore, we can conclude that SM4, using online key expansion with two separate RFs, obtains higher throughput than other candidates. As shown in Figure 11c, the schemes with n2 (NLT4) obtain higher throughput than the schemes with n0 (NLT1) and n1 (NLT2). In the NLT4, NLT uses four Sboxes, while in NLT1 and NLT2, NLT uses only one and two Sboxes, respectively. Thus, we conclude that the throughput is proportional to the number of Sboxes. The schemes with the same candidate architecture of Sbox (s{0/1/2}) have the same throughput, as shown in Figure 11d, because the architecture of the Sbox has no impact on throughput.

5.2. Area

We estimate the area cost in terms of gate count, which is the result of dividing the area reported with the Synopsys Design Compiler by the minimum area of a two-input NAND gate, as shown in Equation (2).

gate count = \frac{area reported by DC}{area of NAND 2}

(2)

Figure 12 demonstrates the area comparison results of 63 design schemes from four perspectives—which are the area comparison of design schemes with (1) the same CCM candidate architecture, (2) the same SM4 candidate architecture, (3) the same NLT candidate architecture, and (4) the same Sbox candidate architecture—to find the impact of each layer candidate architecture on area. Figure 12a shows that the schemes with c1 (P2SM4, CCM with two separate SM4 modules) obtain higher gate counts than other schemes. As shown in Figure 12b, the schemes with k0 (OffKS) obtain higher gate counts than k1 (OnKP) and k2 (OnKS). The schemes with n2 (NLT4) obtain higher gate counts than the schemes with n0 (NLT1) and n1 (NLT2). However, the difference between the schemes with n0 and the schemes with n1 is tiny. Thus, we conclude that offline key expansion significantly impacts the area. As shown in Figure 12d, the schemes with s1 (CFA) obtain the lowest gate counts, while the schemes with s2 (DSE) obtain the highest.

5.3. Power

The gate-level power analysis reports a power consumption under the typical technology corner of the TSMC 90 nm library. The temperature is 25 °C, the voltage is 0.9 V, and the frequency of the clock is 100 MHz. Figure 13 demonstrates the power comparison results of 63 design schemes from four perspectives—which are the power comparison of design schemes with (1) the same CCM candidate architecture, (2) the same SM4 candidate architecture, (3) the same NLT candidate architecture, and (4) the same Sbox candidate architecture—to find the impact of each layer candidate architecture on power.

Figure 13a shows that the scheme c0k2n0s2 obtains the lowest power, about 80% lower than the highest one (c1k0n2s1). The schemes with c0 (S1SM4 and CCM with only one SM4 module) obtain lower power consumption than other schemes. The schemes with k0 (OffKS) obtain higher power than the schemes with k1 (OnKP) and k2 (OnKS). As shown in Figure 13c, the schemes with n0 (NLT1) evidently obtain lower power than n1 (NLT2) and n2 (NLT4). Figure 13d shows that the schemes with s2 (DSE) obtain much lower power than the schemes with s0 (LUT) and s1 (CFA).

5.4. Area Efficiency

Area efficiency is calculated by dividing the throughput by gate count, as shown in Equation (3).

area efficiency = \frac{throughput}{gate count}

(3)

The area efficiency of various schemes is compared in Figure 14. The scheme c0k1n2s1 obtains the highest area efficiency, about six times higher than the lowest (c1k0n0s2).

5.5. Power Efficiency

Power efficiency is calculated by dividing the throughput by power consumption, as shown in Equation (4).

power efficiency = \frac{throughput}{power}

(4)

The power efficiency of various schemes is compared in Figure 15. The scheme c0k1n2s2 obtains the highest power efficiency, about 6.8 times higher than the lowest (c1k0n0s0).

5.6. Area–Power–Delay Product

The area–power–delay product (APDP) is calculated to evaluate the overall performance, as shown in Equation (5). We used the number of clock cycles as the delay, counting from the start of the loading key to read out the CCM tag.

APDP = gate count \times power \times clock cycles

(5)

The APDP comparison of various schemes is demonstrated in Figure 16. The scheme c0k1n2s2 obtains the lowest APDP, about 92% lower than the highest one (c1k0n0s1). Thus, we conclude that the optimal scheme is c0k1n2s2.

5.7. Our Findings

We obtain the following findings with the above experimental results, as shown in Table 3. The scheme c0k1n2s1 is the most area-efficient one, and the scheme c0k1n2s2 is the most power-efficient one. With the metric of APDP, the best scheme is also c0k1n2s2. Thus, we conclude that the scheme with S1SM4 + OnKP + NLT4 + DSE is the optimal scheme. That is what we wanted. We present the simplified hardware architecture of the optimal design scheme in Figure 17 to demonstrate it more clearly. We also provide a breakdown analysis of the power consumption of the optimal scheme c0k1n2s2, as shown in Table 4. The percentage points of the dynamic, static and leakage power to the total power are 12.55%, 84.26% and 3.19%, respectively.

5.8. Comparison

Our work is the first SM4-CCM-specific hardware implementation, and no directly relevant references are available for comparison yet. Therefore, only the references most similar to our work are compared, as shown in Table 5.

Ref. [13] implemented SM4 using 14 nm technology and two round functions to implement the encryption algorithm and the key expansion algorithm of SM4. Due to the advanced technology used, the throughput was much higher than ours. More advanced technology means lower supply voltage, and the power consumption is proportional to the voltage. However, compared to Ref. [13], our work has a much lower power consumption. That is because the DSE Sbox is used as our Sbox, as opposed to the CFA Sbox in Ref. [13]. The internal signals in DSE Sbox have low switching activities, and therefore, this type of Sbox consumes less power. Ref. [33] implemented SM4 using 180 nm technology with offline key expansion, and thus the expanded round keys were needed to store in RAM. However, the gate count reported in Ref. [33] did not include the area cost of RAM. Since no area cost of RAM was provided, the gate count appears to be lower than ours. Our work uses online key expansion. Therefore, no need for RAM or additional registers to store these expanded rounds keys. Moreover, our work has a higher throughput than this reference, even though we have a lower clock frequency. The algorithm complexity of SM4-CCM is much higher than SM4, making a higher gate count than Refs. [13,33].

Ref. [25] implemented the AES-CCM algorithm based on an asynchronous network on chip (ANoC) using 65 nm technology. The throughput was much higher than ours because of the parallel architecture. Nine AES cores were used in the parallel architecture, making the gate count 3.5 times higher than ours. Ref. [25] was oriented to the field of high-performance computing. In contrast, our work is oriented to the internet of things. Both a small area and low-power consumption are necessary for IoT devices. Ref. [34] implemented the AES-CCM algorithm under 65 nm technology with an 8-bit datapath to implement the AES core and multiplexed this core to achieve the CBC-MAC and CBC modes. The area cost was small because of the 8-bit datapath, but their throughput was lower than ours. Ref. [35] implemented the AES-CCM algorithm using two AES cores and 90 nm technology and achieved a higher throughput but a higher area cost than ours. Although the throughput of our work is low compared to Refs. [25,35], it is suitable for most IoT applications since most IoT protocols require a low data rate. Taking 6LoWPANs as an example, the data rate of this protocol is at most 250 Kbps [36]. Our work has a small area, low power consumption, and moderate throughput, thus it is more suitable for IoT applications.

Table 5. Experimental results of scheme c0k1n2s2 and comparison with related work.

Metrics	c0k1n2s2	Ref. [13]	Ref. [33]	Ref. [25]	Ref. [34]	Ref. [35]
Algorithm	SM4-CCM	SM4	SM4	AES-CCM	AES-CCM	AES-CCM
Technology (nm)	90	14	180	65	65	90
Frequency (MHz)	100	1000	185	100	149	264
Throughput (Mbps)	199.99	4040	185	8320	119.2	2690
Gate Count	14.6K	10.2K	3.06K	51.31K	8.1K	44.5K
Power (mW)	1.625	12.1	-	0.311	0.593	-
Norm. Power $^{a}$	1	307.72	-	0.37	0.7	-
Area Eff. (Gbps/KGate)	0.013	0.396	0.06	0.16	0.014	0.06
Power Eff. (Gbps/W)	0.12	0.33	-	14.3	-	-

^a scaling factor

S = \frac{Technology of A}{Technology of B}

, normalized power of

B = \frac{B \times S^{2}}{A}

, the ideal full-scaling model [37] is used.

6. Proof of Concept

We implemented this optimal design scheme on a DE10-standard FPGA board. The experimental environment is shown in Figure 18a. The test vectors were taken from RFC 8998 and stored in a memory initialization file (MIF), which was then used as the initial value in the ROM. The data in the ROM were obtained using a simple signal generator module, which also provides input data to the SM4-CCM. The SM4-CCM generates an interrupt signal after each 128-bit block of data is processed, which tells the generator to provide new data to the SM4-CCM. After processing the data, the CCM tag is output via the GPIO pins. Then, the signal is captured by the Intel Quartus Signal Tap Logic Analyzer (LA). The captured waveform is shown in Figure 18d. The captured waveform proves the optimality of the design scheme.

7. Conclusions

This paper uses a design-space-exploration approach to find the most efficient hardware architecture of the SM4-CCM algorithm for resource-constrained IoT applications. We first divide the SM4-CCM algorithm into five layers and provide three candidate architectures for each layer. Then, 63 design schemes are constructed by combining different candidate architectures from each layer. To find the most efficient one from 63 design schemes, multiple performance metrics of these schemes are evaluated. After comparison and analysis, the most efficient design scheme is identified. The experimental results of this scheme under TSMC 90 nm technology showed that the throughput rate achieves up to 199.99 Mbps at 100 MHz, and the area is about 14.6 K gates. Additionally, the power consumption is only 1.625 mW. These experimental results confirm that this scheme we have found is well suited for IoT applications. Future work will study more dedicated hardware architectures of cryptographic algorithms for IoT applications.

Author Contributions

Conceptualization, R.C. and B.L.; methodology, R.C. and B.L.; software, R.C.; validation, R.C.; formal analysis, R.C.; investigation, R.C.; resources, R.C.; data curation, R.C.; writing—original draft preparation, R.C.; writing—review and editing, R.C.; visualization, R.C.; supervision, B.L.; project administration, R.C.; funding acquisition, R.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the innovation Fund of Nanjing Vocational University of Industry Technology(YK18-05-04, ZK19-04-03, YK20-05-07), the basic research (exploration) of science and technology in Shenzhen (JCYJ20170817115538543), and the basic research (layout) of science and technology in Shenzhen (JCYJ20170817115500476).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CCM	Counter with CBC-MAC mode
RF	Round Function
RK	Round Key
NLT	Non-Linear Transform
DSE	Decoder–Switch–Encoder
LUT	Look-Up Table
FPGA	Field-Programmable Gate Array

References

Lombardi, M.; Pascale, F.; Santaniello, D. Internet of Things: A General Overview between Architectures, Protocols and Applications. Information 2021, 12, 87. [Google Scholar] [CrossRef]
Sadeghi, A.R.; Wachsmann, C.; Waidner, M. Security and Privacy Challenges in Industrial Internet of Things. In Proceedings of the 52nd Annual Design Automation Conference, San Francisco, CA, USA, 7–11 June 2015; ACM: New York, NY, USA, 2015; pp. 1–6. [Google Scholar] [CrossRef]
He, D.; Ma, M.; Zeadally, S.; Kumar, N.; Liang, K. Certificateless Public Key Authenticated Encryption With Keyword Search for Industrial Internet of Things. IEEE Trans. Ind. Inform. 2018, 14, 3618–3627. [Google Scholar] [CrossRef]
Saarinen, M.J.O. Simple AEAD Hardware Interface (SÆHI) in a SoC: Implementing an On-Chip Keyak/WhirlBob Coprocessor. In Proceedings of the 4th International Workshop on Trustworthy Embedded Devices, Scottsdale, AZ, USA, 3 November 2014; ACM: New York, NY, USA, 2014; pp. 51–56. [Google Scholar] [CrossRef]
Yang, P. RFC 8998 ShangMi (SM) Cipher Suites for TLS 1.3. Available online: https://www.rfc-editor.org/rfc/rfc8998 (accessed on 9 December 2021).
ISO/IEC 18033-3:2010/AMD 1:2021 Information Technology—Security techniques—Encryption Algorithms—Part 3: Block Ciphers—Amendment 1: SM4. Available online: https://www.iso.org/standard/81564.html (accessed on 27 November 2021).
Kietzmann, P.; Boeckmann, L.; Lanzieri, L.; Schmidt, T.C.; Wählisch, M. A Performance Study of Crypto-Hardware in the Low-End IoT. In Proceedings of the 2021 International Conference on Embedded Wireless Systems and Networks, Delft, The Netherlands, 17–19 February 2021; Junction Publishing: Junction, TX, USA, 2021; pp. 79–90. [Google Scholar] [CrossRef]
Sadeghi, A.R. Hardware-Assisted Security: Promises, Pitfalls and Opportunities. In Proceedings of the 2017 Workshop on Attacks and Solutions in Hardware Security, Dallas, TX, USA, 3 November 2017; Association for Computing Machinery: New York, NY, USA, 2017; p. 5. [Google Scholar] [CrossRef]
Zhao, J.; Guo, Z.; Zeng, X. High Throughput Implementation of SMS4 on FPGA. IEEE Access 2019, 7, 88836–88844. [Google Scholar] [CrossRef]
Liu, Y.; Wu, N.; Zhang, X.; Zhou, F. A new compact hardware architecture of S-Box for block ciphers AES and SM4. IEICE Electron. Express 2017, 14, 20170358. [Google Scholar] [CrossRef] [Green Version]
Abbasi, I.; Afzal, M. A Compact S-Box Design for SMS4 Block Cipher. In IT Convergence and Services; Park, J.J., Arabnia, H., Chang, H.B., Shon, T., Eds.; Springer: Dordrecht, The Netherlands, 2011; pp. 641–658. [Google Scholar] [CrossRef] [Green Version]
Fu, H.; Bai, G.; Wu, X. Low-cost hardware implementation of SM4 based on composite field. In Proceedings of the 2016 IEEE Information Technology, Networking, Electronic and Automation Control Conference, Chongqing, China, 20–22 May 2016; pp. 260–264. [Google Scholar] [CrossRef]
Satpathy, S.; Mathew, S.; Suresh, V.; Anders, M.; Kaul, H.; Agarwal, A.; Hsu, S.; Chen, G.; Krishnamurthy, R. 250 mV–950 mV 1.1 Tbps/W double-affine mapped Sbox based composite-field SMS4 encrypt/decrypt accelerator in 14 nm tri-gate CMOS. In Proceedings of the 2016 IEEE Symposium on VLSI Circuits, Honolulu, HI, USA, 15–17 June 2016; pp. 1–2. [Google Scholar] [CrossRef]
Satpathy, S.; Suresh, V.; Mathew, S.; Anders, M.; Kaul, H.; Agarwal, A.; Hsu, S.; Krishnamurthy, R. 220 MV–900 MV 794/584/754 GBPS/W Reconfigurable GF(2⁴)2 AES/SMS4/Camellia Symmetric-Key Cipher Accelerator in 14 NM Tri-Gate CMOS. In Proceedings of the 2018 IEEE Symposium on VLSI Circuits, Honolulu, HI, USA, 18–22 June 2018; pp. 175–176. [Google Scholar] [CrossRef]
Martínez-Herrera, A.F.; Mex-Perera, C.; Nolazco-Flores, J. Merging the Camellia, SMS4 and AES S-Boxes in a Single S-Box with Composite Bases. In Information Security; Desmedt, Y., Ed.; Springer International Publishing: Cham, Switzerland, 2015; pp. 209–217. [Google Scholar] [CrossRef]
Fan, L.; Zhou, M.; Luo, J.; Liu, H. IC Design with Multiple Engines Running CBC Mode SM4 Algorithm. Jisuanji Yanjiu Yu Fazhan/Comput. Res. Dev. 2018, 55, 1247–1253. [Google Scholar] [CrossRef]
Martínez-Herrera, A.F.; Mancillas-López, C.; Mex-Perera, C. GCM implementations of Camellia-128 and SMS4 by optimizing the polynomial multiplier. Microprocess. Microsyst. 2016, 45, 129–140. [Google Scholar] [CrossRef]
Li, L.; Yang, F.; Pan, Y.; Mao, W.; Liu, C. An implementation method for SM4-GCM on FPGA. In Proceedings of the 2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 25–26 March 2017; pp. 1921–1925. [Google Scholar] [CrossRef]
Zheng, L.; Li, C.; Liu, Z.; Zhang, L.; Ma, C. Implementation of High Throughput XTS-SM4 Module for Data Storage Devices. In Security and Privacy in Communication Networks; Beyah, R., Chang, B., Li, Y., Zhu, S., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 271–290. [Google Scholar] [CrossRef]
JI, J.D.; JUNG, S.W.; LIM, J. Efficient Sequential Architecture of AES CCM for the IEEE 802.16e. IEICE Trans. Inf. Syst. 2012, 95, 185–187. [Google Scholar] [CrossRef] [Green Version]
Algredo-Badillo, I.; Feregrino-Uribe, C.; Cumplido, R.; Morales-Sandoval, M. Efficient hardware architecture for the AES-CCM protocol of the IEEE 802.11i standard. Comput. Electr. Eng. 2010, 36, 565–577. [Google Scholar] [CrossRef]
Choi, I.; Kim, J.H. Area-Optimized Multi-Standard AES-CCM Security Engine for IEEE 802.15.4 / 802.15.6. JSTS J. Semicond. Technol. Sci. 2016, 16, 293–299. [Google Scholar] [CrossRef] [Green Version]
Bae, D.; Kim, G.; Kim, J.; Park, S.; Song, O. An Efficient Design of CCMP for Robust Security Network. In Information Security and Cryptology - ICISC 2005; Won, D.H., Kim, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; pp. 352–361. [Google Scholar] [CrossRef]
Li, Y.; Han, J.; Wang, S.; Liu, J.; Zeng, X. A NoC-based multi-core architecture for IEEE 802.11i CCMP. In Proceedings of the 2011 9th IEEE International Conference on ASIC, Xiamen, China, 25–28 October 2011; pp. 196–199. [Google Scholar] [CrossRef]
Pammu, A.A.; Ho, W.G.; Lwin, N.K.Z.; Chong, K.S.; Gwee, B.H. A High Throughput and Secure Authentication-Encryption AES-CCM Algorithm on Asynchronous Multicore Processor. IEEE Trans. Inf. Forensics Secur. 2019, 14, 1023–1036. [Google Scholar] [CrossRef]
Abed, S.; Jaffal, R.; Mohd, B.J.; Alshayeji, M. Performance evaluation of the SM4 cipher based on field-programmable gate array implementation. IET Circuits Devices Syst. 2021, 15, 121–135. [Google Scholar] [CrossRef]
Guan, Z.; Li, Y.; Shang, T.; Liu, J.; Sun, M.; Li, Y. Implementation of SM4 on FPGA: Trade-Off Analysis between Area and Speed. In Proceedings of the 2018 IEEE International Conference on Intelligence and Safety for Robotics (ISR), Shenyang, China, 24–27 August 2018; pp. 192–197. [Google Scholar] [CrossRef]
Feng, B.; Qi, D.y.; Han, H. Parallel and Multiplex Architecture of AES-CCM Coprocessor Implementation for IEEE 802.15.4. In Proceedings of the 2013 Fourth International Conference on Emerging Intelligent Data and Web Technologies, Xi’an, China, 9–11 September 2013; pp. 149–153. [Google Scholar] [CrossRef]
Tse, R.; Wong, S.K.; Saarinen, M.J. The SM4 Blockcipher Algorithm and Its Modes of Operations. Available online: https://tools.ietf.org/id/draft-ribose-cfrg-sm4-09.html (accessed on 20 February 2022).
Dworkin, M.J. Recommendation for Block Cipher Modes of Operation: The CCM Mode for Authentication and Confidentiality. Available online: https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38c.pdf (accessed on 20 February 2022).
Banik, S.; Bogdanov, A.; Regazzoni, F. Efficient configurations for block ciphers with unified ENC/DEC paths. In Proceedings of the 2017 IEEE International Symposium on Hardware Oriented Security and Trust (HOST), Mclean, VA, USA, 1–5 May 2017; pp. 41–46. [Google Scholar] [CrossRef]
Xing, J.P.; Zou, X.C.; Guo, X. Ultra-low power S-Boxes architecture for AES. J. China Univ. Posts Telecommun. 2008, 15, 112–117. [Google Scholar] [CrossRef]
Shang, M.; Zhang, Q.; Liu, Z.; Xiang, J.; Jing, J. An Ultra-Compact Hardware Implementation of SMS4. In Proceedings of the 2014 IIAI 3rd International Conference on Advanced Applied Informatics, Kitakyushu, Japan, 31 August–4 September 2014; pp. 86–90. [Google Scholar] [CrossRef]
Hoang, V.P.; Phan, T.T.D.; Dao, V.L.; Pham, C.K. A compact, ultra-low power AES-CCM IP core for wireless body area networks. In Proceedings of the 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC), Tallinn, Estonia, 26–28 September 2016; pp. 1–4. [Google Scholar] [CrossRef]
Nguyen, K.; Lanante, L.; Nagao, Y.; Kurosaki, M.; Ochi, H. Implementation of 2.6 Gbps super-high speed AES-CCM security protocol for IEEE 802.11i. In Proceedings of the 2013 13th International Symposium on Communications and Information Technologies (ISCIT), Surat Thani, Thailand, 4–6 September 2013; pp. 669–673. [Google Scholar] [CrossRef]
Montenegro, G.; Schumacher, C.; Kushalnagar, N. RFC 4919 IPv6 over Low-Power Wireless Personal Area Networks (6LoWPANs): Overview, Assumptions, Problem Statement, and Goals. Available online: https://www.rfc-editor.org/info/rfc4919 (accessed on 20 February 2022).
Rabaey, J.; Chandrakasan, A.; Nikolić, B. Digital Integrated Circuits: A Design Perspective, 2rd ed.; Prentice Hall Electronics and VLSI Series; Pearson Education: Upper Saddle River, NJ, USA, 2003. [Google Scholar]

Figure 1. The operation flow of SM4 encryption algorithm.

Figure 2. The operation flow of SM4 key expansion algorithm.

Figure 3. The operation flow of CCM algorithm.

Figure 4. The proposed design-space-exploration method.

Figure 5. Candidate hardware architectures for NLT (NLT1 (a); NLT2 (b); NLT4 (c)).

Figure 6. Candidate hardware architectures for the RF (shared RF for encryption and key expansion (a); RF for encryption only (b); RF for key expansion only (c)).

Figure 7. Candidate architectures for the SM4 layer (OffKS (a); OnKP (b); OnKS (c)).

Figure 8. Candidate architectures for SM4-CCM (S1SM4 (a); P2SM4 (b); GB2SM4 (c)).

Figure 9. The naming of design schemes (e.g., c0k0n0s0).

Figure 10. Demonstration of our experimental flow.

Figure 11. Throughput comparison of 63 schemes from four perspectives, (a) comparison of design schemes with the same CCM candidate architecture, (b) comparison of design schemes with the same SM4 candidate architecture, (c) comparison of design schemes with the same NLT candidate architecture, and (d) comparison of design schemes with the same SBox candidate architecture (the higher the value, the better).

Figure 12. Area (in terms of gate count) comparison of 63 schemes from four perspectives: (a) comparison of design schemes with the same CCM candidate architecture, (b) comparison of design schemes with the same SM4 candidate architecture, (c) comparison of design schemes with the same NLT candidate architecture, and (d) comparison of design schemes with the same SBox candidate architecture (the lower the value, the better).

Figure 13. Power comparison of 63 schemes from four perspectives: (a) comparison of design schemes with the same CCM candidate architecture, (b) comparison of design schemes with the same SM4 candidate architecture, (c) comparison of design schemes with the same NLT candidate architecture, and (d) comparison of design schemes with the same SBox candidate architecture (the lower the value, the better).

Figure 14. Area efficiency of 63 schemes (the higher the value, the better).

Figure 15. Power efficiency of 63 schemes (the higher the value, the better).

Figure 16. APDP of 63 schemes (the lower the value, the better).

Figure 17. Simplified hardware architecture of the optimal design scheme: (a) SM4-CCM with only one SM4 core (S1SM4); (b) SM4 with online key expansion (OnKP); (c) Round function for encryption only (ERF); (d) Round function for key expansion only (KRF); (e) Non-linear transform with 4 DSE Sbox (NLT4)).

Figure 18. Proof of concept of the SM4-CCM on a DE10-Standard FPGA Board: (a) Our physical environment: PC+DE10-Standard FPGA Board; (b) Test vectors from RFC 8998; (c) Block diagram of FPGA proof-of-concept; (d) Waveform captured by Logic Analyzer.

Table 1. Notations of the SM4 algorithm and CCM algorithm.

Notation	Meaning
Sbox	It is a substitution table with 256 bytes, the input byte is substituted with the byte from the table.
Non-Linear Transform (NLT)	The input word (4-byte) is substituted with the bytes from the Sbox.
Round Function (RF)	The SM4 encryption and key expansion algorithm consist of 32 rounds of iteration, the body of iteration is called round function.
Key Expansion	It is a routine generating 32 32-bit round keys from a 128-bit master key.
Inverse Transform	Change the order of input 4 words, $A, B, C, D \Rightarrow D, C, B, A$ .
Additional Authenticated Data (AAD)	It is non-secret data for encryption/decryption to add an additional integrity and authenticity check on the encrypted data
CBC-MAC Mode	It is an operation mode of block cipher, used to generate a message authentication code
Counter Mode (CTR)	It is a counter-based operation mode of block cipher, used to encrypt data

Table 2. Names and descriptions of the candidate hardware architectures for each layer.

Layer	NCA $^{a}$	Description of Candidate Architectures
	LUT	Sbox based on look-up table (LUT)
Sbox	CFA	Sbox based on composite field arithmetic (CFA)
	DSE	Sbox based on the decoder–switch–encoder (DSE) architecture
	NLT1	Using 1 Sbox to implement NLT
NLT $^{b}$	NLT2	Using 2 Sbox to implement NLT
	NLT4	Using 4 Sbox to implement NLT
	KRF	RF for key expansion only
RF $^{c}$	ERF	RF for encryption only
	KERF	RF for both key expansion and encryption
	OffKS	Offline key expansion with only one shared RF (KERF)
SM4	OnKP	Online key expansion and using two separate RFs (KRF and ERF)
	OnKS	Online key expansion, and using only one shared RF (KERF)
	S1SM4	Only one SM4, working in CBC-MAC mode and CTR mode by turns
SM4-CCM	P2SM4	Two SM4, one is working in CBC-MAC mode, and another is in CTR mode
	GB2SM4	One SM4 and one RF, SM4 works in CBC-MAC mode with offline KEYEXP and encryption, and the RF pretends to be an SM4 module working in CTR mode

^a name of candidate architectures, ^b non-linear transform, ^c round function.

Table 3. The best design schemes under various metrics and the corresponding candidate architectures of each layer.

Metrics	The Best Scheme	Metric Value	CCM	SM4	NLT	SBOX
Throughput (Mbps)	c1k1n2s{0,1,2}	281.31	P2SM4	OnKP	NLT4	-
Area (Gate Count)	c0k2n0s1	10754	S1SM4	OnKS	NLT1	CFA
Power (mW)	c0k2n0s2	1.344	S1SM4	OnKS	NLT1	DSE
Area Efficiency	c0k1n2s1	16447	S1SM4	OnKP	NLT4	CFA
Power Efficiency	c0k1n2s2	123.07	S1SM4	OnKP	NLT4	DSE
APDP	c0k1n2s2	$1.21 \times 10^{9}$	S1SM4	OnKP	NLT4	DSE

Table 4. Breakdown analysis of the optimal scheme c0k1n2s2.

Power Group	Internal Power	Switching Power	Leakage Power	Total Power
Clock Network	1.067 mW	0	0	1.067 mW
Register	102.7 $μ$ W	20.89 $μ$ W	13.84 $μ$ W	137.4 $μ$ W
Combinational	199.5 $μ$ W	183.0 $μ$ W	37.98 $μ$ W	420.5 $μ$ W
In total	1.369 mW	203.9 $μ$ W	51.82 $μ$ W	1.625 mW

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, R.; Li, B. Exploration of the High-Efficiency Hardware Architecture of SM4-CCM for IoT Applications. Electronics 2022, 11, 935. https://doi.org/10.3390/electronics11060935

AMA Style

Chen R, Li B. Exploration of the High-Efficiency Hardware Architecture of SM4-CCM for IoT Applications. Electronics. 2022; 11(6):935. https://doi.org/10.3390/electronics11060935

Chicago/Turabian Style

Chen, Rui, and Bing Li. 2022. "Exploration of the High-Efficiency Hardware Architecture of SM4-CCM for IoT Applications" Electronics 11, no. 6: 935. https://doi.org/10.3390/electronics11060935

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploration of the High-Efficiency Hardware Architecture of SM4-CCM for IoT Applications

Abstract

1. Introduction

2. Related Work

2.1. Dedicated Hardware of SM4

2.2. Dedicated Hardware of CCM

2.3. Design Space Exploration of SM4 or CCM

3. Background

3.1. Notations

3.2. Brief Introduction of SM4

3.3. Brief Introduction of CCM

4. Design Space Exploration

4.1. Our Method

4.2. Sbox

4.3. Non-Linear Transform (NLT)

4.4. Round Function (RF)

4.5. SM4

4.6. SM4-CCM

4.7. Combination of Candidate Architectures

5. Experimental Results and Analysis

5.1. Throughput

5.2. Area

5.3. Power

5.4. Area Efficiency

5.5. Power Efficiency

5.6. Area–Power–Delay Product

5.7. Our Findings

5.8. Comparison

6. Proof of Concept

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI