Efficient Parallel Implementation of CTR Mode of ARX-Based Block Ciphers on ARMv8 Microcontrollers

Song, JinGyo; Seo, Seog Chung

doi:10.3390/app11062548

Open AccessArticle

Efficient Parallel Implementation of CTR Mode of ARX-Based Block Ciphers on ARMv8 Microcontrollers

by

JinGyo Song

and

Seog Chung Seo

^*

Department of Financial Information Security, Kookmin University, Seoul 02707, Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2021, 11(6), 2548; https://doi.org/10.3390/app11062548

Submission received: 2 February 2021 / Revised: 1 March 2021 / Accepted: 9 March 2021 / Published: 12 March 2021

(This article belongs to the Special Issue Design and Security Analysis of Cryptosystems)

Download

Browse Figures

Versions Notes

Abstract

:

With the advancement of 5G mobile telecommunication, various IoT (Internet of Things) devices communicate massive amounts of data by being connected to wireless networks. Since this wireless communication is vulnerable to hackers via data leakage during communication, the transmitted data should be encrypted through block ciphers to protect the data during communication. In addition, in order to encrypt the massive amounts of data securely, it is essential to apply one of secure mode of operation. Among them, CTR (CounTeR) mode is the most widely used in industrial applications. However, these IoT devices have limited resources of computing and memory compared to typical computers, so that it is challenging to process cryptographic algorithms that have computation-intensive tasks in IoT devices at high speed. Thus, it is required that cryptographic algorithms are optimized in IoT devices. In other words, optimizing cryptographic operations on these IoT devices is not only basic but also an essential effort in order to build secure IoT-based service systems. For efficient encryption on IoT devices, even though several ARX (Add-Rotate-XOR)-based ciphers have been proposed, it still necessary to improve the performance of encryption for smooth and secure IoT services. In this article, we propose the first parallel implementations of CTR mode of ARX-based ciphers: LEA (Lightweight Encryption Algorithm), HIGHT (high security and light weight), and revised CHAM on the ARMv8 platform, a popular microcontroller in various IoT applications. For the parallel implementation, we propose an efficient data parallelism technique and register scheduling, which maximizes the usage of vector registers. Through proposed techniques, we process the maximum amount of encryption simultaneously by utilizing all vector registers. Namely, in the case of HIGHT and revised CHAM-64/128 (resp. LEA, revised CHAM-128/128, and CHAM-128/256), we can execute 48 (resp. 24) encryptions simultaneously. In addition, we optimize the process of CTR mode by pre-computing and using the intermediate value of some initial rounds by utilizing the property that the nonce part of CTR mode input is fixed during encryptions. Through the pre-computation table, CTR mode is optimized up until round 4 in LEA, round 5 in HIGHT, and round 7 in revised CHAM. With the proposed parallel processing technique, our software provides

3.09

%,

5.26

%, and

9.52

% of improved performance in LEA, HIGHT, and revised CHAM-64/128, respectively, compared to the existing parallel works in ARM-based MCU. Furthermore, with the proposed CTR mode optimization technique, our software provides the most improved performance with

8.76

%,

8.62

%, and

15.87

% in LEA-CTR, HIGHT-CTR, and revised CHAM-CTR, respectively. This work is the fastest implementation of CTR mode on ARMv8 architecture to the best of our knowledge.

Keywords:

embedded security; LEA block cipher; HIGHT block cipher; revised CHAM block cipher; counter mode of operation; ARMv8; parallel implementation; internet of things

1. Introduction

Since the fourth industrial revolution, IoT (Internet of Things) devices are developing in the form of ’Intelligent IoT’ in various applications by the convergence of technologies such as big data, artificial intelligence, and cloud services. IoT devices expand from the previously limited functions and communicate massive amounts of data with various IoT devices and servers. However, since a large portion of data communication is performed by wireless communication on these IoT services, it is vulnerable to hacking such as eavesdropping and data modification, etc. Thus, cryptographic algorithms such as block ciphers, hash algorithms, and so on need to be applied for security during wireless communication. In other words, IoT devices need to perform cryptographic algorithms requiring computation-intensive operations inside the devices. However, IoT devices have limited resources in terms of computing and memory capabilities compared with typical computers. Thus, it is a challenging task to apply existing cryptographic algorithms to IoT devices because they are computationally intensive operations. For this reason, several ARX (Add-Rotate-XOR)-based lightweight block ciphers [1,2,3] have been recently proposed so that they can operate smoothly in IoT devices having limited resources. However, they still incur computation overheads compared to other operations in IoT devices. Thus, it needs to optimize ARX-based block ciphers on IoT devices to make various IoT services operate smoothly.

As massive amounts of data are communicated in wireless communication, the mode of operation such as Cipher Block Chaining (CBC) mode, Output FeedBack (OFB) mode, Cipher FeedBack (CFB) mode, and CounTeR (CTR) mode needs to be applied. Among them, CTR mode is the most widely used in industrial applications because it has the following advantages. Each block in CTR mode can be processed independently, which makes the encryption process in parallel. In addition, the CTR mode requires only the implementation of encryption process because the decryption process is identical to the encryption one.

Until now, various studies have been conducted for CTR mode optimization. In CHES’18 [4], Park et al. proposed the first optimized method, called FACE, of AES-CTR mode in Intel Core i2 and i7 processors. The proposed method not only reduced unnecessary operations by looking up the five caches but can be applied to various platforms. In ICISC’19 [5], Kim et al. efficiently optimized the AES-CTR mode in 8-bit AVR widely used as a low-end processor by extending the concept of FACE from [4]. In MDPI electronics’20 [6], Kwon et al. proposed optimized implementation of CHAM-CTR mode in 8-bit AVR MCU. They effectively reduced some operations in the initial some round functions of the revised CHAM by pre-computing round operations related to the fixed nonce value in the input block of CTR mode. In MDPI electronics’20 [7], Kim et al. optimized HIGHT-CTR, LEA-CTR mode, and CTR-DRBG in 8-bit AVR MCU. HIGHT-CTR and LEA-CTR mode were optimized in the same way as [6], and in CTR-DRBG, in addition to applying the CTR mode optimization, computational overheads were efficiently reduced by looking up the pre-computation table without computing some operations within CTR-DRBG.

Until now, most works on CTR mode optimization have considered only optimization on 8-bit-AVR MCU in embedded IoT devices. However, recently, ARMv8 has been widely used in various IoT devices such as smart phones and tablets. ARMv8 equips the NEON engine that can process tasks in parallel, which can be used for accelerating cryptographic operations. In this article, we propose the first parallel implementations of CTR mode optimization of ARX-based block ciphers (LEA, HIGHT, and revised CHAM) in ARMv8 architecture. For parallel implementation, we propose data parallelism and vector register scheduling to process multiple encryption simultaneously. Moreover, we apply the proposed parallel implementation to the CTR mode optimization. Our implementations pre-compute the initial few rounds by using the property that the nonce part of input block is fixed during encryption, and utilized the precomputed values for efficiency. Through the proposed methods, the maximum encryption was processed simultaneously by utilizing NEON engine. Furthermore, the actual operation can be skipped by utilizing the precomputed values for the initial few rounds in CTR mode. Thus, we achieved enhanced performance of

8.52

%,

8.62

%, and

15.87

%, respectively, than related works in LEA-CTR mode, HIGHT-CTR mode, and revised CHAM-64/128-CTR mode.

Contributions

The contributions of this work are as follows.

First parallel implementation of CTR mode on embedded devices using ARMv8 architecture
Until now, CTR mode optimization has been conducted only on 8-bit AVR, Intel Core i2, and Intel Core i7 [4,5,6,7,8]. However, there are many types of IoT platforms. Among them, ARMv8 is widely used in various IoT devices, and it supports a NEON engine that is capable of efficient parallel processing. Therefore, we present a first parallel implementation of CTR mode with ARX-based block ciphers on ARMv8 architecture. For parallel implementation, we not only present an efficient data parallelism, but also present register scheduling that maximizes the use of vector registers to simultaneously process multiple encryptions. Finally, we propose the parallel implementation of CTR mode by applying the proposed parallelism technique to CTR mode optimization, which pre-computes the initial few rounds using fixed nonce of input block. The proposed parallel implementation of CTR mode can be easily applied to various parallel environments such as Single Instruction Multiple Data (SIMD) and Advanced Vector Extensions (AVX2).
Proposing the efficient data parallelism of ARX-Based Block Ciphers
We propose parallel implementation of ARX-based block ciphers (LEA, HIGHT, and revised CHAM) by utilizing the NEON engine in ARMv8 architecture. The proposed parallel technique is more efficient than the existing data parallel processing techniques in [9,10]. In LEA and revised CHAM, we eliminate the transpose operations required to apply data parallelism through LD4 and ST4 instructions when loading data from memory to four vector registers and storing data from four vector registers into memory. In HIGHT, since it processes round operations in 8-bit units, it is difficult to process data parallelism without additional costs. Thus, we present an optimized transpose operation for data parallelism in HIGHT. Furthermore, to perform as much encryption as possible simultaneously, we present an efficient vector register scheduling. Through the proposed data parallelism techniques, 24 encryptions are performed simultaneously in LEA, revised CHAM-128/128, and revised CHAM-128/256, and 48 encryptions are performed simultaneously in HIGHT and revised CHAM-64/128. In case of HIGHT and revised CHAM, more encryptions are simultaneously performed than in the previous works. As a result, we outperformed the previous work by $3.09$ %, $5.26$ %, and $9.52$ %, respectively in proposed data parallelism. The proposed data parallelism techniques can be applied to various lightweight cryptography such as SIMECK [11] and SKINNY [12].
Presenting the first parallel implementation of CTR mode
We apply the proposed parallel techniques to the CTR mode implementation. The existing works of CTR optimization have been conducted on 8-bit AVR MCU [6,7]. They utilized the property of CTR mode that the input block consists of nonce and counter parts and the nonce part is always the same during encryption. Thus, we can precompute the initial few founds which are related to the fixed nonce part. In other words, by precomputing the operations related to the nonce part, it is possible to efficiently improve performance through utilizing precomputed values rather than computing round operations until 4 rounds in LEA, 5 rounds in HIGHT, and 7 rounds in revised CHAM. We extend the optimization concept of the existing works for the proposed parallel implementation of CTR mode. However, in the case of LEA, the input position of the nonce in CTR mode is not fixed. Thus, by changing the position of the nonce in the existing CTR mode optimization, we can precompute one more round operation than the previous work [7]. In addition, with the proposed data parallelism, the maximum encryption is performed simultaneously in CTR mode, and the number of encryption even in CTR mode is the same as the number in data parallelism. Through the parallel implementation of CTR mode optimization, we could achieve enhanced performance of $8.76$ %, $8.62$ %, and $15.87$ % in LEA-CTR, HIGHT-CTR, and revised CHAM-CTR by comparison with the previous works.

The rest of this paper is structured as follows. Section 2 explains NEON engine and target block ciphers. Section 3 provides existing studies using NEON engine and CTR mode optimization. Section 4 presents the efficient data parallelism and parallel implementation of CTR mode optimization. Section 5 provides performance comparison results from the proposed implementation and previous works. Finally, Section 6 concludes this paper.

2. Background

2.1. NEON Engine and ASIMD Instructions

From the ARMv7 architecture, NEON engine and parallel processing unit have been supported to maximize performance. NEON engine in ARMv7 supports 16 vector registers (

q 0

–

q 15

) with a 128-bit width for parallel processing, but that in ARMv8, the target device of this paper, supports 32 128-bit vector registers (

v 0

–

v 31

). In addition, in the NEON engine, parallel processing is possible for 64, 32, 16 and 8-bit wise within a 128-bit vector register. Table 1 shows Advance Single Instruction Multiple Data (ASIMD) instructions and clock cycles of each instructions. Since the target block ciphers are based on ARX structure, ASIMD instructions for processing addition, rotate, and

X O R

are required. ADD instruction processes the addition in parallel by lane. SRI and SHL instructions process rotate shift operations in parallel, and it is commonly used to implement rotate shift operations in the NEON engine. REV16 is an instruction that can process

R O L_{8}

operation more efficiently than SRI and SHL when parallel processing

R O L_{8}

operation in 16-bit units. EOR instruction processes

X O R

operations in parallel depending on the lane. TRN1 and TRN2 instructions efficiently transpose the vector registers. The transpose operations of TRN1 and TRN2 instructions are as shown in Figure 1. ST4 and LD4 are instructions to store data from 4 vector registers to memory and load data from memory to 4 vector registers, and, at this time, transpose operation is automatically applied. Using these properties, we present a method to apply data parallelism without additional costs by using ST4 and LD4 instructions in ARX-based ciphers (LEA and revised CHAM) except HIGHT. From Table 1, it can be seen that more cycles are required for memory access operations than simple register operations.

2.2. Target Block Ciphers

2.2.1. LEA Block Cipher

In WISA’13 [1], Hong et al. proposed the Lightweight Encryption Algorithm (LEA) block cipher that provides high-speed encryption in common software platforms. The proposed LEA structure not only provides high-speed encryption and small code size, but also effectively resists widely attacks that are vulnerable to the block ciphers, such as differential cryptanalysis and linear cryptanalysis. Furthermore, LEA has been selected as ISO/IEC (29192-2) [14] in 2019, which proves its security and efficiency. It is classified into three types according to the parameters, and is shown in Table 2. LEA supports key lengths of 128, 192, 256-bit, and consists of 24, 28, and 32 rounds, respectively, depending on the key length. In the key schedule of LEA, a delta, which is a constant value, is used, and it is the root value of 766,995, which concatenates ASCII codes of ‘L’, ‘E’, and ‘A’. The delta efficiently generates round keys through

R O L

operation and modular addition by

2^{32}

. Each round function of LEA requires a total of six 32-bit round keys, but only LEA-128 requires four 32-bit round keys. In the round function of LEA, round operations are computed in units of 32-bit suitable for 32-bit and 64-bit processors, which are widely used in recent years, and round operations consist of ARX operations (Addition, Rotation, and XOR). Figure 2 gives the the round function process of LEA. It is composed of 6

X O R

, three modular additions by

2^{32}

, and three rotate operations (

R O R_{3}, R O R_{5}

, and

R O L_{9}

). After repeating the above round function for the required number of rounds, encryption of LEA is completed.

2.2.2. HIGHT Block Cipher

In CHES’06 [2], Hong et al. proposed the HIGHT block cipher to provide confidentiality in environments where only limited computation and power are allowed such as RFID and sensors. HIGHT was designed in a feistel structure and is a lightweight block cipher that encrypts 64-bit block through a 128-bit secret key. The HIGHT’ round function is composed of lightweight arithmetic such as modular addition by

2^{8}

, rotate (

F_{0}

and

F_{1}

functions), and

X O R

operations in 8-bit units, so it is very suitable for low-end processors with limited resources. Furthermore, HIGHT was designated as an ISO/IEC 18033-3 [15] in 2010. In the key schedule process of HIGHT, the delta value is updated through Linear-Feedback Shift Register (LFSR), and subkeys are generated by computing the modular addition by

2^{8}

on one byte of the master key and the delta value. In LFSR, the connected polynomial uses

x^{7} + x^{3} + 1

, so the delta has a period of 127. By repeating the above process, a total of 128-byte subkeys are generated by the delta value. In addition, 128-byte subkeys and an 8-byte part of the master key are required for the encryption process. In the encryption process of HIGHT, it is composed of an initial and final conversion performed at the beginning and end of encryption, and round functions. The initial and final transformations operate byte-wise

X O R

operation and modular addition by

2^{8}

. The round function performs

F_{0}

,

F_{1}

function,

X O R

, and modular addition by

2^{8}

operations. The

F_{0}

and

F_{1}

functions, which have the largest computational overheads in the round function, perform rotate and

X O R

operations. After each round operation is completed, the result values are rotated one byte to the left. Figure 3 shows the round function of HIGHT. A total of 32 round functions are performed to encrypt 64-bit plaintext. The rotate operation is not performed to the result values in the last round. Finally, the final conversion is performed on the result value. The encryption of HIGHT is completed.

2.2.3. Revised CHAM Block Cipher

In ICISC’17 [16], CHAM family was proposed for efficient encryption in environments with limited resources such as 8-bit and 16-bit. CHAM family is an ARX-based ultra-lightweight block ciphers that does not use Sbox operation. The key schedule of CHAM family does not renew the round keys, effectively reducing the memory space for storing round keys. Furthermore, the CHAM family has shown that H/W and S/W implementation are also more efficient than other lightweight block ciphers. In H/W implementation, it can be implemented with a smaller area than SIMON [17] block cipher that shows the most optimal performance. In S/W implementation, it provides more efficient performance than the SPECK [17] block cipher, which shows the most optimal performance in S/W. The CHAM family can be classified into three block ciphers according to parameters, and the composition of the CHAM family is shown in Table 3. In the key schedule of CHAM, it is composed of

R O L_{1}

,

R O L_{11}

,

R O L_{8}

and

X O R

operations. Through these operations, round keys (

2 * k / w

) that are much less than the round required for each CHAM block ciphers are generated. In the round function of CHAM, it is made up of modular addition, rotate operations (

R O L_{1}

and

R O L_{8}

), and

X O R

operations. These operations are performed on only one word in each round of CHAM. Thus, it performs the round function with only very light operations compared to other lightweight block ciphers. Figure 4 shows the round function of CHAM. Recently, in ICISC’19 [3], the inventor of original CHAM revealed a new differential characteristic using the SAT solver in the existing CHAM family, and proposed a revised CHAM family to provide sufficient security. Using the SAT solver, the differential characteristics of the existing CHAM-64/128 and CHAM-128/k were found in rounds 39 and 62, and any related-key differential characteristics to round 47 were found in the CHAM family. The revised CHAM family is the same as the CHAM family, but, in order to provide sufficient security for these differential characteristics, the revised CHAM family is a version that increases the number of rounds from 80 to 88 in CHAM-64/128, from 80 to 112 in CHAM-128/128, and from 96 to 120 in CHAM-128/256, respectively. Despite the increase in the number of rounds in the revised CHAM family, in case of H/W implementation, the revised CHAM family remains the same as that of the existing CHAM family. In case of S/W implementation, the revised CHAM family is comparable to SPECK, which is known to have the best performance in S/W.

3. Related Works

Parallel Implementation of Block Ciphers on NEON Engine and CTR Mode Optimization

In this section, we give the description related to parallel implementation of the NEON engine and CTR mode optimization. ARMv7 architecture, in addition to the ARM core, supports the NEON engine that can effectively process data in parallel. In the NEON engine, parallel processing is possible to 64, 32, 16, and 8-bit wise depending on the lane within the 128-bit vector register. Until now, works on parallel optimization of various block ciphers utilizing the NEON engine in ARMv7 and ARMv8 architecture have been conducted [9,10,18,19,20]. Table 4 shows the number of simultaneously performed encryptions utilizing the NEON engine in related works. In WISA’16 [18], taking advantage of the independent cores of ARM and NEON, the authors proposed an optimized technique to efficiently hide cycles of ARM instructions into cycles of NEON instructions by interleaving NEON and ARM instructions. In the NEON engine, 12 encryptions were simultaneously performed using 16 vector registers, and one encryption was efficiently performed utilizing a barrel shifter in ARM core. For parallel implementation on the NEON engine, the authors presented the overall process for parallel processing in NEON engine as follows:

\begin{matrix} Load ⟶ Transpose ⟶ Encryption ⟶ Transpose ⟶ Store \end{matrix}

It could be seen that the transpose operation was required for parallel processing in the NEON engine. Furthermore, utilizing OpenMP, the authors expanded the proposed one core optimization into four cores so that a total of 52 encryptions were performed, simultaneously. In WISA’18 [19], the authors presented the results of applying the above optimization method to CHAM-64/128. With taking advantage of the property that CHAM-64/128 is a 16-bit word-wise, in the proposed method, 24 encryptions were processed simultaneously in the NEON engine and four encryptions were processed simultaneously in the ARM core. In the NEON parallel process, the transpose operation was required as in the above method. Furthermore, the authors interleaved ARM instructions and NEON instructions in CHAM-64/128 as in the above optimization method, and optimized CHAM-64/128 by using multiple cores through OpenMP.

Recently, various research results are introduced for ARMv8 architecture, the latest version of ARM core. Unlike ARMv7, ARMv8 architecture supports 32 128-bit vector registers instead of 16 128-bit vector registers. In journal of the Korea Institute of Information and Communication Engineering’17 [9], the author presented an efficient parallel implementation of LEA utilizing NEON engine in ARMv8 architecture for the first time. In the proposed data parallelism, 24 encryptions were simultaneously processed, but the transpose operation was required for data parallelism in the same way as the above optimization method. In ICISC’19 [20], the authors optimized AES using the ASIMD instruction set on ARMv8 architecture. the 4-way transpose MixColumns was introduced to efficiently process MixColumns on four encryptions. Through the 4-way MixColumns, the authors achieved improved performance in MixColumns. In IEEE ACCESS’20 [10], the authors proposed secure and fast implementation of HIGHT and CHAM by utilizing the NEON engine in ARMv8 platforms. In the HIGHT and revised CHAM’ fast implementation, data and task parallelism was effectively applied to process multiple operations and data simultaneously. Through the proposed data parallelism of HIGHT and revised CHAM, 24 encryptions were performed simultaneously in HIGHT, and 16, 10, and 8 encryptions were performed simultaneously in revised CHAM-64/128, CHAM-128/128, and CHAM-128/256, respectively. However, the authors did not maximize the scheduling of vector registers, so the maximum number of encryptions was not processed simultaneously.

In MDPI eletronic’20 [6], Kwon et al. proposed the first CHAM-CTR mode optimization in an 8-bit AVR processor. The authors utilized the property of CTR mode that nonce part is always the same during encryption. Through this CTR mode property, the authors optimized the some round operations up until round 7 in CHAM-CTR. The overall process of the proposed CHAM-CTR mode [6] is shown in Figure 5. In MDPI eletronic’20 [7], Kim et al. proposed the first LEA-CTR and HIGHT-CTR mode optimization in an 8-bit AVR processor. Utilizing the same CTR mode property as above, the authors optimized the same round operations up until round 3 in LEA-CTR and round 5 in HIGHT-CTR. The overall process of HIGHT-CTR optimization is shown in Figure 6.

When encryption in an actual application, massive amounts of data are encrypted, so the application of the mode of operation is essential. Currently, the modes of operation widely used industrially are CBC and CTR modes. However, the studies until now have tended to focus on optimization of the block cipher rather than optimization of mode of operation. Thus, in this paper, the objective of our work is to present the first parallel implementation of CTR mode optimization in ARMv8 architecture. For parallel implementation, we introduce efficient register scheduling and data parallelism techniques. For CTR mode optimization, LEA-CTR proposed the method to optimize one round more than the previous study [7]. HIGHT-CTR and revised CHAM-CTR utilized the same method as the previous CTR mode optimization [6,7].

4. Proposed Parallel Implementation of LEA-CTR, HIGHT-CTR, and Revised CHAM-CTR on ARMv8 Microcontrollers

4.1. Proposed Data Parallelism Technique

In this section, we present an efficient data parallelism of LEA, HIGHT, and revised CHAM. In our data parallelism, the following techniques are used. First, the transpose operation is efficiently eliminated from the LEA and revised CHAM through the ST4 and LD4 instructions. However, since encryption is performed in units of 8-bit words in HIGHT, it is difficult to eliminate the transpose operation. Thus, we present the optimized transpose operation in HIGHT so that data parallelism can be efficiently applied. Second, we propose efficient register scheduling technique that can maximize the number of encryptions in parallel. Through proposed register scheduling, 24 encryptions are performed simultaneously for LEA, revised CHAM-128/128, and revised CHAM-128/256, and 48 encryptions are performed simultaneously for HIGHT and revised CHAM-64/128 on the NEON engine. Furthermore, to efficiently process multiple encryption simultaneously, we interleave the NEON instructions to minimize pipeline stalls. Interleaving NEON instructions is shown in Figure 7. In other words, when processing NEON instructions 1, 2, and 3, the pipeline stall is minimized by inserting other NEON instructions 1, 2, and 3 utilizing TEMP registers. Finally, we generally access memory with the maximum number of vector registers, effectively reducing the number of memory accesses.

4.1.1. LEA Optimization

We introduce an efficient data parallelism technique for LEA. Since LEA performs round operation in units of 32-bit words, the lanes of 128-bit vector registers are set to be processed in parallel in units of 32-bits. Previously [9], the transpose operation was performed to apply data parallelism, but we efficiently eliminate computational overheads by automatically applying the transpose operation when loading data from memory to four vector registers and storing data from four vector registers to memory through LD4 and ST4 instructions. In addition, all vector registers are efficiently utilized to perform 24 encryptions simultaneously. Figure 8 shows register scheduling in the proposed parallel implementation of LEA. The vector registers

v 0

–

v 23

are registers that store 24 plaintexts, which are stored with the transpose operation applied. The vector registers

v 24

–

v 27

are used as temporary variables to process multiple plaintexts simultaneously. Finally, the vector registers

v 28

–

v 31

store the round keys.

Algorithm 1 represents one round of parallel implementation in LEA-128. The above algorithm shows data parallel implementation for four plaintexts. Other plaintexts are processed using the remaining Temp registers through interleaving NEON instructions. Our contribution in Algorithm 1 is to process the transpose operation without computational overhead, and the other data parallelism techniques are the same as in [9]. Step 1 loads the plaintexts into four vector registers via LD4 instruction. By using LD4 instruction, the transpose operation is automatically applied when loading PT from memory to four vector registers. Step 2 loads the round keys required to process a round function into four vector registers. In Steps 3–5, Steps 8–10, and Steps 13–15,

X O R

operation and modular addition by

2^{32}

are performed. Steps 6–7 perform

R O R_{3}

operation on one word to which modular addition is computed. As above, Steps 11–12 perform

R O R_{5}

operation on one word, and Steps 16–17 perform

R O L_{9}

operation on one word. By repeating the above process 24 times, the encryption of LEA-128 is completed. After that, the transpose operation is automatically applied through ST4 instruction and stored in the memory, effectively reducing the computational overheads on the transpose operation. Similar to LEA-128, the proposed parallel implementation of LEA-192 and LEA-256 also perform 24 encryptions simultaneously. However, unlike LEA-128, both LEA-192 and LEA-256 take more vector registers to store round keys. Therefore, in LEA-192 and LEA-256, the

v 26

–

v 31

vector registers store the round keys, and the remaining

v 24

–

v 25

vector registers are used as the Temp registers.

Algorithm 1 Data parallel implementation of LEA-128 Round Function.

Require: v0–v3 (Plaintexts), v28–v31 (Roundkeys)

Ensure: v0–v3 (Ciphertexts)

Loading the PT, (x0: PT Address)

1: LD4 {v0.4s-v3.4s}, [x0], #64

Loading the RK, (x1: RK Address)

2: LD1 {v28.4s-v31.4s}, [x1], #64

Round Function

3: EOR v3.16b, v3.16b, v31.16b

4: EOR v24.16b, v2.16b, v30.16b

5: ADD v24.4s, v3.4s, v24.4s

$R O R_{3}$ Operation

6: SHL v3.4s, v24.4s, #29

7: SRI v3.4s, v24.4s, #3

Round Function

8: EOR v2.16b, v2.16b, v31.16b

9: EOR v24.16b, v1.16b, v29.16b

10: ADD v24.4s, v2.4s, v24.4s

$R O R_{5}$ Operation

11: SHL v2.4s, v24.4s, #27

12: SRI v2.4s, v24.4s, #5

Round Function

13: EOR v1.16b, v1.16b, v31.16b

14: EOR v24.16b, v0.16b, v28.16b

15: ADD v24.4s, v1.4s, v24.4s

$R O L_{9}$ Operation

16: SHL v1.4s, v24.4s, #9

17: SRI v1.4s, v24.4s, #23

4.1.2. HIGHT Optimization

In HIGHT, since the round function operates in units of 8-bit, the lane of the vector register is set to 16-bit to efficiently parallelize in units of 1-byte within 128-bit. Through the above process, 16 encryptions can be performed simultaneously in a total of eight vector registers. In the existing work [10], only 24 encryptions were processed simultaneously on the NEON engine, but we process 48 encryptions simultaneously by efficiently utilizing all vector registers. Figure 9 shows register scheduling for the proposed parallel implementation of HIGHT. The

v 0

–

v 23

vector registers are vector registers that store 48 plaintexts, and 16 plaintexts are stored per eight vector registers. The

v 24

–

v 27

vector register are Temp registers that store the intermediate values. Finally, the

v 28

–

v 31

vector registers maintain round keys.

Unlike LEA and revised CHAM, HIGHT performs operations in byte-wise, so it is difficult to remove transpose operations in the overall process. We present an optimized transpose operation. Algorithm 2 is the optimized transpose operations used in the proposed parallel implementation for HIGHT. For the rest of the encryptions, we efficiently interleaved the instructions by minimizing the pipeline stall. Steps 1–14 are the transpose operation performed before encryption, and Steps 15–28 are the transpose operation performed after encryption. In Steps 1–2, the transposed plaintexts are loaded into four vector registers through LD4 instruction. For example, in the

v 0

vector register, each 0-th and 4-th index word of 8 plaintexts among 16 plaintexts are stored, and in the

v 4

vector register, each 0th and 4th index word from the remaining eight plaintexts are stored. Step 3 is the process of storing the intermediate value, and, in Step 4, each 0th word of 16 plaintexts is efficiently stored in the

v 0

vector register through TRN1 instruction. In Step 5, each 4th word from 16 plaintexts are efficiently stored in the

v 4

vector register through TRN2 instruction. A total of three more iterations of the above process complete the transpose operation for 16 plaintexts. From Step 15, it is the transpose operation that is performed after encryption is completed and before storing the ciphertexts in memory, and proceeds in the reverse order of Steps 1–14. After the transpose operation is completed, the ciphertexts are stored in memory through ST4 instruction.

Algorithm 2 Efficient Transpose Operation for HIGHT.

Require: v0–v7(Plaintexts)

Ensure: v0–v7(Ciphertexts)

Transpose operation to be performed before encryption

1: LD4 {v0.16b-v3.16b}, [x0], #64

2: LD4 {v4.16b-v7.16b}, [x0], #64

3: MOV v24.16b, v0.16b

4: TRN1 v0.16b, v0.16b, v4.16b

5: TRN2 v4.16b, v4.16b, v24.16b

6: MOV v24.16b, v1.16b

7 TRN1 v1.16b, v1.16b, v5.16b

8 TRN2 v5.16b, v5.16b, v24.16b

9 MOV v24.16b, v2.16b

10: TRN1 v2.16b, v2.16b, v6.16b

11: TRN2 v6.16b, v6.16b, v24.16b

12: MOV v24.16b, v0.16b

13: TRN1 v0.16b, v0.16b, v4.16b

14: TRN2 v4.16b, v4.16b, v24.16b

Transpose operation to be performed after encryption

15: MOV v24.16b, v4.16b

16: TRN1 v4.16b, v0.16b, v4.16b

17: TRN2 v0.16b, v0.16b, v24.16b

18: MOV v24.16b, v5.16b

19: TRN1 v5.16b, v1.16b, v5.16b

20: TRN2 v1.16b, v1.16b, v24.16b

21: MOV v24.16b, v6.16b

22: TRN1 v6.16b, v2.16b, v6.16b

23: TRN2 v2.16b, v2.16b, v24.16b

24: MOV v24.16b, v7.16b

25: TRN1 v7.16b, v3.16b, v7.16b

26: TRN2 v3.16b, v3.16b, v24.16b

27: ST4 {v0.16b-v3.16b}, [x0], #64

28: ST4 {v4.16b-v7.16b}, [x0], #64

Algorithm 3 is a round function to which a data parallelism technique is applied in HIGHT. In the HIGHT optimization of [10], both task and data parallelism technique are applied, but Algorithm 3 applies only a data parallelism technique. Our contribution of Algorithm 3 is a data parallelism technique for HIGHT. As data parallelism technique is applied, Algorithm 3 shows a one round function of processing 16 encryptions. The rest of the encryptions were processed using the Temp register through interleaving the NEON instructions. Step 1 loads the round keys required for one round function into vector registers. Steps 3–11 and Steps 27–35 perform

F_{1}

Function. Steps 12–14 and Steps 37–38 perform an

X O R

operation and then a modular addition by

2^{8}

, and Steps 25–26 and Steps 49–50 perform a modular addition by

2^{8}

and then an XOR operation. Finally, Steps 15–23 and Steps 39–47 perform

F_{0}

Function.

Algorithm 3 Data parallel implementation of HIGHT Round Function.

Require: v0–v7(Plaintexts), v28–v31(Roundkeys)

Ensure: v0–v7(Ciphertexts)

Loading the RK, (x1: RK Address)

1: LD1 {v28.16b-v31.16b}, [x1], #64

$F_{1}$ Function

2: SHL v24.16b, v7.16b, #3

3: SRI v24.16b, v7.16b, #5

4: SHL v25.16b, v7.16b, #4

5: SRI v25.16b, v7.16b, #4

6: EOR v24.16b, v24.16b, v25.16b

7: SHL v25.16b, v7.16b, #6

8: SRI v25.16b, v7.16b, #2

9: EOR v24.16b, v24.16b, v25.16b

Round Function

10: EOR v24.16b, v24.16b, v28.16b

11: ADD v6.16b, v6.16b, v24.16b

$F_{0}$ Function

12: SHL v24.16b, v5.16b, #1

13: SRI v24.16b, v5.16b, #7

14: SHL v25.16b, v5.16b, #2

15: SRI v25.16b, v5.16b, #6

16: EOR v24.16b, v24.16b, v25.16b

17: SHL v25.16b, v5.16b, #7

18: SRI v25.16b, v5.16b, #1

19: EOR v24.16b, v24.16b, v25.16b

Round Function

20: ADD v24.16b, v24.16b, v29.16b

21: EOR v4.16b, v4.16b, v24.16b

$F_{1}$ Function

22: SHL v24.16b, v3.16b, #3

23: SRI v24.16b, v3.16b, #5

24: SHL v25.16b, v3.16b, #4

25: SRI v25.16b, v3.16b, #4

26: EOR v24.16b, v24.16b, v25.16b

27: SHL v25.16b, v3.16b, #6

28: SRI v25.16b, v3.16b, #2

29: EOR v24.16b, v24.16b, v25.16b

Round Function

30: EOR v24.16b, v24.16b, v30.16b

31: ADD v2.16b, v2.16b, v24.16b

$F_{0}$ Function

32: SHL v24.16b, v1.16b, #1

33: SRI v24.16b, v1.16b, #7

34: SHL v25.16b, v1.16b, #2

35: SRI v25.16b, v1.16b, #6

36: EOR v24.16b, v24.16b, v25.16b

37: SHL v25.16b, v1.16b, #7

38: SRI v25.16b, v1.16b, #1

39: EOR v24.16b, v24.16b, v25.16b v25.16b

Round Function

40: ADD v24.16b, v24.16b, v31.16b

41: EOR v0.16b, v0.16b, v24.16b

4.1.3. Revised CHAM Optimization

We introduce an efficient data parallel implementation of the revised CHAM family. We describe our approach on the basis of the revised CHAM-64/128 because revised CHAM-128/128 and revised CHAM-128/256 are almost the same except for lanes because word units are different for revised CHAM-64/128. In the existing parallel implementation of revised CHAM [19], the transpose process was required to apply data parallelism, but we make data parallelism to be applied automatically when loading data from four vector registers to memory and storing data from four vector registers to memory through LD4 and ST4 instructions in the revised CHAM family. In addition, the maximum number of encryptions are performed simultaneously by utilizing all vector registers. Forty-eight encryptions are performed simultaneously, in revised CHAM-64/128, and 24 encryptions are performed simultaneously, the remaining revised CHAM. The vector register scheduling for processing maximum encryptions in revised CHAM-64/128 is shown in Figure 10. The register scheduling of the remaining revised CHAM is similar to that of LEA, and only the

v 27

vector register needs to be set as a counter.

Algorithm 4 shows a parallel implementation of odd and even rounds with a data parallelism technique of revised CHAM-64/128. Algorithm 4 may also look like the HIGHT optimized implementation of [10], but the HIGHT optimized implementation of [10] applied the task and data parallelism techniques, and, in Algorithm 4, only the data parallelism technique is applied. Our contribution in Algorithm 4 is that the data parallelism technique for HIGHT and the transpose operation required to apply data parallelism technique is processed with LDN and STN instructions without computational overhead. In addition, we apply only the data parallelism technique for revised CHAM-64/128, so there is no necessity for a transpose operation within the round function, which is required in revised CHAM-64/128 optimized implementation of [10]. The data parallelism technique is applied to the Algorithm 4, so it processes eight encryptions for the round function, and the remaining plaintexts are encrypted by interleaving instructions efficiently through the Temp registers. Since the revised CHAM-64/128 performs round operations in 16-bit-wise, the lane, which is the unit of parallel processing of vector registers, is set in units of 8 h. In Step 1, plaintexts are automatically transposed through the LD4 instruction and loaded from memory into four vector registers. Step 2 loads the round keys for processing four rounds into the four vector registers. In Step 3, one word of the plaintexts and the counter value are

X O R

ed. In Steps 4–5, the

R O L_{1}

operation is performed, and, in Step 6,

X O R

operations are performed on the above result and round key. After that, modular addition is performed between the results of Step 3 and Step 6. In Step 8,

R O L_{8}

operation, the last operation of odd round, is performed, and, in 16-bit parallel processing, rotate shift operation can be efficiently processed with only

R E V 16

instruction. Steps 9–10 increase counter value because one round is over. From Step 11, it means even round, and the counter value and one word of the plaintext perform

X O R

operation again. Step 12 performs

R O L_{8}

operation, and Step 13 performs

X O R

operation on the round key and result of Step 12. After that, step 14 performs a modular addition by

2^{16}

between the resulting values from steps 11 and 13. In Steps 15–16, an even round is completed through

R O L_{1}

operation. Finally, Steps 17–18 add 1 to the counter value. The encryption process of revised CHAM-64/128 is completed by repeating the above process 44 times, when the ciphertexts are stored in the memory, by automatically applying the transpose operation through the ST4 instruction. We achieve the processing of more encryptions simultaneously than previous work [10] through efficient vector register scheduling.

Algorithm 4 Data parallel implementation of revised CHAM-64/128 Round Function.

Require: v0–v3(Plaintexts), v27(Counter), v28–v31(Roundkeys)

Ensure: v0–v3(Ciphertexts)

Loading the PT and RK, (x0: PT Address, x1: RK Address)

1: LD4 {v0.8h-v3.8h}, [x0], #64

2: LD1 {v28.8h-v31.8h}, [x1], #64

Counter ⊕ PT

3: EOR v0.16b, v0.16b, v27.16b

$R O L_{1}$ Operation

4: SHL v24.8h, v1.8h, #1

5: SRI v24.8h, v1.8h, #15

Roundkey ⊕ $R O L_{1}$

6: EOR v24.16b, v24.16b, v28.16b

Step 3 ⊞ Step 6

7: ADD v0.8h, v24.8h, v0.8h

$R O L_{8}$ Operation

8: REV16 v0.16b, v0.16b

Adding 1 to the counter value

9: MOVi v24.8h, #1

10: ADD v27.8h, v27.8h, v24.8h

Counter ⊕ PT

11: EOR v1.16b, v0.16b, v27.16b

$R O L_{8}$ Operation

12: REV16 v24.16b, v2.16b

Roundkey ⊕ $R O L_{8}$

13: EOR v24.16b, v24.16b, v28.16b

Step 11 ⊞ Step 13

14: ADD v0.8h, v24.8h, v0.8h

$R O L_{1}$ Operation

15: SHL v1.8h, v24.8h, #1

16: SRI v1.8h, v24.8h, #15

Adding 1 to the counter value

17: MOVi v24.8h, #1

18: ADD v27.8h, v27.8h, v24.8h

4.2. Parallel Implementation of CTR Mode of Operation on the NEON Engine

In this section, we propose the parallel implementation of CTR mode optimization. The CTR mode optimization is as follows. The input block of CTR mode is composed of a nonce and a counter, and the nonce part is a fixed value and the counter part is a variable value. When there is encryption in the CTR mode, the nonce part always comes out with a fixed value, so some round operations of the initial few rounds can be pre-computed. Therefore, when performing CTR mode encryption, it is possible to effectively reduce the computational overheads by looking up the pre-computation table rather than computing the round operation for the initial few rounds. We use the existing optimization of CTR mode for HIGHT and revised CHAM [6,7], and, in the case of LEA, we optimize one more round than the existing work [7]. Through the CTR mode optimization method, the operations of the round function are effectively reduced by looking up the pre-computation table up until four rounds for LEA, five rounds for HIGHT, and seven rounds for revised CHAM. Furthermore, we apply the proposed parallel implementation to CTR mode optimization to encrypt multiple plaintexts simultaneously. We not only apply the CTR mode optimization, but also process massive amounts of encryptions in parallel, effectively improving the performance of the CTR mode in ARMv8 architecture.

4.2.1. LEA-CTR Optimization

We describe how to efficiently process multiple encryption simultaneously by applying the proposed LEA-CTR mode optimization. In CTR mode optimization, the positions of the counter and nonce are not decided within the block. In addition, the size of the counter is not determined, and, for consideration of security, counters are used typically to have a period of

2^{32}

on 32-bit processors. Unlike previous work [7], we change the counter position in order to apply the proposed technique to one more round compared to the previous work [7]. Figure 11 shows the optimization of the LEA-CTR mode that can bypass some round operations of the initial four rounds by using a fixed nonce value. To give detailed description of round function up until the optimized rounds, we express

X_{i, j}

word (

i \in 0, 4, j \in 0, 3

) as the word at which each round function is completed.

X_{0, 0}

word, which is a part of the input block, is a variable counter, and the remaining

X_{0, 1}

,

X_{0, 2}

, and

X_{0, 3}

words are fixed nonce. In the Round 1, the words except for

X_{0, 0}

word are nonce, so the round operations can be efficiently skipped through pre-computation. In the Round 2,

X_{1, 2}

and

X_{1, 3}

words are still the results of fixed nonces, so round operations are efficiently bypassed by looking up the pre-computation table. The remaining

X_{1, 0}

and

X_{1, 1}

words are affected by the variable counter, so round operations are required, and round operations are performed through pre-computation with fixed nonce. In the Round 3, round function operations of

X_{2, 1}

and

X_{2, 0}

words affected by the variable counter in the Round 2 and Round 3 and

X_{2, 2}

word affected by the counter from the previous rounds are required. Some round operations are efficiently performed by looking up the pre-computation table rather than computing the round operations. In the Round 4 rounds, all words except

X_{3, 0}

word are affected by the variable counter, so the round operation of the Round 4 is partially reduced through the pre-computation on the

X_{3, 0}

word.

The parallel implementation of LEA-CTR mode simultaneously processes the same number of encryptions as the proposed parallel implementation of LEA. To process the same number of encryptions as the proposed parallel implementation of LEA, the register scheduling for parallel implementation of LEA-CTR mode is similar to the proposed parallel implementation of LEA, and the following differences exist. The

v 26

–

v 28

vector registers are used to store the pre-computation table when performing 1–4 rounds, and the remaining

v 29

–

v 31

vector registers store the round keys. After round 4, the same register scheduling as the parallel implementation of LEA is applied. Algorithm 5 shows the optimization of the initial four rounds in LEA-CTR mode for four plaintexts. The rest of the plaintexts are processed by efficiently interleaving NEON instructions using the Temp registers. In Steps 1–2, the round keys and pre-computation table for rounds 1–2 are loaded into vector registers, respectively. In Steps 3–6, which are the steps of performing Round 1, only the

X_{0, 0}

word, the variable counter, requires the round operations. By bypassing the round operations for the remaining words, the computational overheads required for Round 1 are efficiently reduced.

Steps 7–14 show the round operations of Round 2. Steps 7–10 are the round operation process for

X_{1, 0}

word that is affected by the variable counter in Round 2. Steps 11–14 are the process of performing the round operations of the word containing the variable counter, and the remaining

X_{1, 2}

and

X_{1, 3}

words are for nonce, so round operations are not required. In Steps 15–16, the round keys and pre-computation table to be used in Round 3 and Round 4 are loaded into vector registers. Steps 17–29 represent the round operations process in Round 3. Steps 17–20 are the round operations for

X_{2, 0}

word affected by the variable counter in Round 3, and Steps 21–25 are the round operation process for the word affected by the variable counter in Round 2. In Steps 26–29, the word including the counter performs round operations. In Step 30, the round keys for one round are loaded into the vector registers. In Steps 31–32, one round operation is reduced by looking up the last pre-computation table. After Round 4, it is the same as the parallel implementation of LEA. In LEA-192 and LEA-256, only the number of rounds is different from that of LEA-128, so it is possible to optimize CTR mode in the same way as the above procedure.

Algorithm 5 Parallel implementation of LEA-128 CTR mode optimization.

Require: v0–v3(Plaintexts), v26–v28(Table), v29–v31(Roundkeys)

Ensure: v0–v3(Ciphertexts)

Loading the RK and the table, (x1: RK Address, x2: Pre-computation table Address)

1: LD1 {v29.4s-v31.4s}, [x1], #48

2: LD1 {v26.4s-v28.4s}, [x2], #48

1 Round Function

3: EOR v3.16b, v3.16b, v29.16b

4: ADD v24.16b, v3.16b, v26.16b

5: SHL v3.4s, v24.4s, #29

6: SRI v3.4s, v24.4s, #3

2 Round Function

7: EOR v24.16b, v3.16b, v30.16b

8: ADD v24.4s, v24.4s, v27.4s

9: SHL v0.4s, v24.4s, #29

10: SRI v0.4s, v24.4s, #3

11: EOR v24.16b, v3.16b, v31.16b

12: ADD v24.4s, v24.4s, v28.4s

13: SHL v3.4s, v24.4s, #27

14: SRI v3.4s, v24.4s, #5

Loading the RK and table

15: LD1 {v29.4s-v31.4s}, [x1], #48

16: LD1 {v26.4s-v28.4s}, [x2], #48

3 Round Function

17: EOR v24.16b, v0.16b, v30.16b

18: ADD v24.4s, v24.4s, v26.4s

19: SHL v1.4s, v24.4s, #29

20: SRI v1.4s, v24.4s, #3

21: EOR v0.16b, v0.16b, v31.16b

22: EOR v24.4s, v3.4s, v29.4s

23: ADD v24.4s, v24.4s, v0.4s

24: SHL v0.4s, v24.4s, #27

25: SRI v0.4s, v24.4s, #5

26 EOR v3.16b, v3.16b, v31.16b

27: ADD v24.4s, v27.4s, v3.4s

28: SHL v3.4s, v24.4s, #9

29: SRI v3.4s, v24.4s, #23

Loading the RK

30: LD1 {v28.4s-v31.4s}, [x1], #64

4 Round Function

31: EOR v24.16b, v1.16b, v30.16b

32: ADD v24.4s, v24.4s, v28.4s

4.2.2. HIGHT-CTR Optimization

In the HIGHT-CTR mode, we propose a parallel implementation that simultaneously processes 48 encryptions by applying the proposed data parallelism technique to the HIGHT-CTR mode optimization. As for the optimization method of HIGHT-CTR mode, the proposed optimization of CTR mode in the existing work [7] is used. The HIGHT-CTR mode optimization proposed in the previous work is shown in Figure 6. The 64-bit block of HIGHT consists of a 32-bit variable counter and fixed nonce. In the same way as the LEA-CTR optimization, some round operations for up to five rounds have been efficiently bypassed through pre-computation of the round operations for the fixed nonce part. To implement optimization of HIGHT-CTR mode in parallel, we utilize register scheduling similar to the proposed parallel implementation of HIGHT. The round function of HIGHT requires more TEMP registers than other block ciphers (LEA and CHAM) to store intermediate values. Thus, when loading RK and Table values from memory into registers, three vector registers are used respectively, and the remaining vector registers are utilized as TEMP registers.

Algorithm 6 shows the parallel implementation of the initial 1–5 rounds of the optimized HIGHT-CTR mode by looking up the pre-computation value for a fixed nonce part. After round 5, it is the same as the proposed parallel implementation of HIGHT, and the encryptions of the remaining plaintexts are performed by efficiently interleaving NEON instructions using the Temp register. In Round 1, only four words (

X_{0, 0} - X_{0, 3}

), the counter value, need round operations, and the remaining words (

X_{0, 4} - X_{0, 7}

) are already pre-computed with a fixed value part, so round operations are not required. In Round 2, since

X_{1, 1}

and

X_{1, 5}

words are part of the variable counter, round operations for

X_{1, 1}

and

X_{1, 5}

words are performed by looking up the pre-computation table. Steps 3–13 show the round function for

X_{1, 1}

and

X_{1, 5}

words. In Step 3, round operation for

X_{1, 1}

word is efficiently performed using the pre-computation with only ADD instruction. Steps 4–13 are a round function for

X_{1, 5}

word and the cache is used in the round function.

In Round 3, the

X_{2, 7}

word is affected by the counter, so it is no longer possible to pre-compute intermediate value, and round operation is required. In Steps 15–25, to perform the round function for

X_{2, 7}

word, a modular addition is performed to the pre-computed

X_{2, 7}

word and

X_{2, 6}

word applied to the

F_{0}

function and

X O R

operation. In Round 4, round operations of

X_{3, 1}

and

X_{3, 3}

words affected by the counter are required. Steps 27–36 are the process of performing the round function for

X_{3, 1}

word, and Steps 37–38 process round function for

X_{3, 3}

word with only XOR instruction by loading the pre-computation table into vector registers. In Round 5, all words are affected by the counter, so pre-computation is no longer possible, and all words require round operations. In Steps 40–48, the last pre-computation value is used to perform a round function of

X_{4, 3}

word. Through the HIGHT-CTR mode optimization, some round operations up to the initial five rounds were processed as a simple table lookup. After round 5, encryption is performed in the same way as the proposed parallel implementation of HIGHT.

Algorithm 6 Parallel implementation of HIGHT-CTR mode optimization.

Require: v0–v7(Plaintexts), v24–v31(Roundkeys and Table)

Ensure: v0–v7(Ciphertexts)

Loading RK and the table, (x1: RK Address, x2: pre-computation table Address)

1: LD1 {v30.16b-v31.16b}, [x1], #32

2: LD1 {v28.16b-v29.16b}, [x2], #32

2 Round Function

3: ADD v7.16b, v7.16b, v28.16b

4: SHL v24.16b, v4.16b, #3

5: SRI v24.16b, v4.16b, #5

6: SHL v25.16b, v4.16b, #4

7: SRI v25.16b, v4.16b, #4

8: EOR v24.16b, v24.16b, v25.16b

9: SHL v25.16b, v4.16b, #6

10: SRI v25.16b, v4.16b, #2

11: EOR v24.16b, v24.16b, v25.16b

12: EOR v24.16b, v31.16b, v24.16b

13: ADD v3.16b, v29.16b, v24.16b

Loading RK

14: LD1 {v29.16b-v31.16b}, [x1], #48

3 Round Function

15: SHL v24.16b, v3.16b, #1

16: SRI v24.16b, v3.16b, #7

17: SHL v25.16b, v3.16b, #2

18: SRI v25.16b, v3.16b, #6

19: EOR v24.16b, v24.16b, v25.16b

20: SHL v25.16b, v3.16b, #7

21: SRI v25.16b, v3.16b, #1

22: EOR v24.16b, v24.16b, v25.16b

23: ADD v24.16b, v31.16b, v24.16b

Loading table

24: LD1 {v30.16b-v31.16b}, [x2], #32

25: EOR v2.16b, v24.16b, v30.16b

Loading RK

26: LD1 {v28.16b-v30.16b}, [x1], #48

4 Round Function

27: SHL v24.16b, v2.16b, #3

28: SRI v24.16b, v2.16b, #5

29: SHL v25.16b, v2.16b, #4

30: SRI v25.16b, v2.16b, #4

31: EOR v24.16b, v24.16b, v25.16b

32: SHL v25.16b, v2.16b, #6

33: SRI v25.16b, v2.16b, #2

34: EOR v24.16b, v24.16b, v25.16b

35: EOR v24.16b, v30.16b, v24.16b

36: ADD v1.16b, v31.16b, v24.16b

Loading table

37: LD1 {v30.16b-v31.16b}, [x2]

38: EOR v7.16b, v7.16b, v30.16b

Loading RK

39: LD1 {v28.16b-v31.16b}, [x1], #64

5 Round Function

40: SHL v24.16b, v1.16b, #1

41: SRI v24.16b, v1.16b, #7

42: SHL v25.16b, v1.16b, #2

43: SRI v25.16b, v1.16b, #6

44: EOR v24.16b, v24.16b, v25.16b

45: SHL v25.16b, v1.16b, #7

46: SRI v25.16b, v1.16b, #1

47: EOR v24.16b, v24.16b, v25.16b

48: ADD v24.16b, v31.16b, v24.16b

4.2.3. Revised CHAM-CTR Optimization

In this section, we present a parallel implementation of the revised CHAM-CTR mode optimization. Since revised CHAM-128/128 and CHAM-128/256 are almost the same as the revised CHAM-64/128 except for word units, the proposed parallel implementation of CTR mode in revised CHAM-128/128, and CHAM-128/256 can be easily applied by changing only the lane of the vector register. Thus, we explain on a basis from revised CHAM-64/128. In common with HIGHT, revised CHAM-64/128 encrypts 64-bit blocks, and our target platform is a 64-bit processor, so the block of CTR mode consists typically of a 32-bit variable counter and a 32-bit fixed nonce. Revised CHAM-CTR mode optimization is the same as in the existing work [6], and is shown in Figure 5. Optimization of CHAM-CTR mode proposed in [6] reduces the computational overheads of round functions during encryption by pre-computing some operations in the initial few rounds using a fixed nonce part in the same way as the HIGHT-CTR and LEA-CTR optimization. To simultaneously process as many encryptions as possible, the proposed parallel implementation of the revised CHAM-CTR mode uses the same register scheduling as the proposed parallel implementation of revised CHAM, except that the

v 28

–

v 31

vector registers are used to store the RK and pre-computation table.

Algorithm 7 shows the parallel implementation of the initial seven rounds in the CTR mode optimization of revised CHAM-64/128. After the seven rounds, the round functions are performed in the same way as the proposed parallel implementation of revised CHAM-64/128. In Algorithm 7, round functions until seven rounds are shown for eight plaintexts, and the remaining plaintexts are encrypted with interleaving NEON instructions using the Temp registers. As shown in Figure 5, a block of the revised CHAM-CTR mode consists of

X_{0, 2}

and

X_{0, 3}

words, which are a fixed nonce part, and

X_{0, 0}

and

X_{0, 1}

words, which are a variable counter part. In Step 1, the round keys used for Round 1 are loaded into vector registers. In Round 1, round operations of

X_{0, 0}

and

X_{0, 1}

words are required. Since

X_{0, 0}

and

X_{0, 1}

words are all variable counter parts, pre-computation is impossible, so the round function is normally performed for

X_{0, 0}

and

X_{0, 1}

words. In Step 2, the pre-computed values to be used in Round 2 and Round 4 are loaded into vector registers. In Round 2, the round function is performed for

X_{1, 0}

word with

X_{1, 1}

word. At this time, the pre-computed value is used, so that the computational overheads for the

R O L_{8}

and

X O R

operations to be originally performed are reduced. In Steps 3–8, the round operations for

X_{1, 0}

word are processed by using the pre-computed value.

Algorithm 7 Parallel implementation of the revised CHAM-64/128-CTR mode optimization.

Require: v0–v3(Plaintexts), v28–v31(Roundkeys and Table)

Ensure: v0–v3(Ciphertexts)

Loading RK (x1: RK Address)

1: LD1 {v29.8h}, [x1], #16

Loading the table (x2: pre-computation table Address)

2: LD1 {v28.8h-v29.8h}, [x2], #32

2 Round Function

3: EOR v1.16b, v27.16b, v1.16b

4: ADD v24.8h, v1.8h, v28.8h

5: SHL v1.8h, v24.8h, #1

6: SRI v1.8h, v24.8h, #15

7: MOVi v29.8h, #1

8: ADD v27.8h, v27.8h, v29.8h

3 Round Function

9: MOVi v29.8h, #1

10: ADD v27.8h, v27.8h, v29.8h

Loading RK

11: LD1 {v30.8h-v31.8h}, [x1], #32

4 Round Function

12: REV16 v24.16b, v0.16b

13: EOR v24.16b, v30.16b, v24.16b

14: ADD v24.8h, v29.8h, v24.8h

15: SHL v3.8h, v24.8h, #1

16: SRI v3.8h, v24.8h, #15

17: MOVi v29.8h, #1

18: ADD v27.8h, v27.8h, v29.8h

Loading the table

19: LD1 {v30.8h-v31.8h}, [x2]

6 Round Function

20: EOR v1.16b, v27.16b, v1.16b

21: ADD v24.8h, v1.8h, v30.8h

22: SHL v1.8h, v24.8h, #1

23: SRI v1.8h, v24.8h, #15

24: MOVi v29.8h, #1

25: ADD v27.8h, v27.8h, v29.8h

Loading RK

26: LD1 {v28.8h-v29.8h}, [x1], #32

7 Round Function

27: SHL v24.8h, v3.8h, #1

28: SRI v24.8h, v3.8h, #15

29: EOR v24.16b, v24.16b, v28.16b

30: ADD v24.8h, v24.8h, v31.8h

31: REV16 v2.16b, v24.16b

32: MOVi v29.8h, #1

33: ADD v27.8h, v27.8h, v29.8h

In Round 3, the round function for

X_{2, 0}

and

X_{2, 1}

words is performed, but both can be pre-computed due to a fixed nonce part, so round operations for Round 3 are not required. Steps 9–10 is the process of simply increasing the counter value. In Step 11, the round keys to be used in Round 4 and Round 5 are loaded into vector registers. In Round 4, the pre-computed

X_{3, 0}

words are affected by the variable counter, so the round function is required. In Steps 12–18, round operations are performed for

X_{3, 0}

word. In Step 19, finally, the pre-computed value to be used in Round 6 and Round 7 are loaded into the vector registers. In Round 5, all of the words

X_{4, 0}

and

X_{4, 1}

words are affected by the counter, so round operations are performed normally. In Round 6, the round function for

X_{5, 0}

word is performed with the pre-computed

X_{5, 1}

word. In Steps 20–25, the round function is completed only by

R O L_{1}

, modular addition, and

X O R

operations through pre-computed

X_{5, 1}

words. In Round 7, the round function is performed using the last pre-computed value. In the later implementation, it is the same as the proposed parallel implementation of revised CHAM-64/128. Through the proposed parallel implementation of revised CHAM-CTR mode optimization, some round operations up until seven rounds were bypassed, and maximum encryptions were processed simultaneously.

5. Evaluation

We evaluate the proposed data parallelism technique and CTR mode optimization technique on Raspberry Pi 4B [21], which is the most popular embedded device. Raspberry Pi 4B supports up to 4 GB with an upgraded internal memory compared to previous models, and supports ARM Cortex-A72, a 64-bit processor as a microcontroller unit. We compiled using aarch64-linux-gnu-gcc in the Ubuntu 19.10 environment, and VS Codium was used as the development environment. To our knowledge, there are no studies on the optimization of CTR mode using LEA, HIGHT, and revised CHAM in ARMv8 architecture. Thus, we compare the performance of our work and the existing optimization works [9,10] of LEA, HIGHT, and revised CHAM, respectively, in ARMv8 architecture.

5.1. Parallel Implementation of LEA-CTR Mode on the NEON Engine

Table 5 shows the performance comparison of LEA parallel implementation. Seo et al. [9] presented parallel implementation of LEA utilizing NEON engine in Apple A7 and Apple A9. A total of 24 encryptions were performed simultaneously using all vector registers, but a transpose process was required to apply data parallelism. We ported the Seo et al.’s implementation to our target device, ARM Cortex-A72, for fair situations. Our Work 1 is the proposed parallel implementation of LEA, and processes 24 encryptions simultaneously as shown in [9]. We reduce the computational overheads by automatically applying the transpose operation when loading data from memory to vector registers and storing data from vector register to memory. By removing the transpose operation process, our Work 1 shows 3.09%, 2.77%, and 2.63% performance improvement in LEA-128, LEA-192, and LEA-256, respectively, than the results ported to our target platform [9]. Until now, this is an optimization work of block ciphers. IoT devices send massive amounts of data in the real world, so the mode of operation should be applied when encrypted. Among them, the CTR mode is widely used in the current industry. Our Work 2 is the proposed CTR mode optimization technique. Our Work 2 performs 24 encryptions simultaneously. In particular, CTR mode optimization was applied, so that the round operations of the initial few rounds were processed only by simply looking up the pre-computation table, thereby reducing the computational overheads on the round operations efficiently. Moreover, by pre-computing one round more than the previous work [7], Our Work 2 in LEA-128, LEA-192, and LEA-256 achieved 8.76%, 6.39%, and 5.82% performance improvement respectively over [9]. The result of Our Work 2 shows faster performance than the previous study [9] and Our Work 1.

5.2. Parallel Implementation of HIGHT-CTR and Revised CHAM-CTR Mode on the NEON Engine

Table 6 shows the performance comparison of HIGHT and revised CHAM parallel implementations. Song et al. [10] presented the secure and fast implementation for HIGHT and revised CHAM using the NEON engine in the ARMv8 platform. In fast implementation, task and data parallelism was applied to compute multiple operations and data simultaneously. In addition, utilizing all the vector registers, 24 encryptions in HIGHT, 16 encryptions in revised CHAM-64/128, 10 encryptions in revised CHAM-128/128, and 8 encryptions in revised CHAM-128/256 were processed simultaneously. In the secure implementation, Song et al. [10] optimized random-shuffling, the core operation of fault attack countermeasures, using the NEON engine. Our Work 1 in HIGHT is a parallel implementation with efficient transpose operation and proposed parallel implementation. In addition, Our Work 1 in HIGHT processed 48 encryptions, with more encryptions than before, simultaneously by scheduling all vector registers more efficiently than previous work [10]. Through the above advantages, Our Work 1 in HIGHT achieved about 5.26% performance improvement over the existing fast implementation for HIGHT [10]. Our Work 2 in HIGHT is a parallel implementation of the HIGHT-CTR mode, and, like Our Work 1, 48 encryptions are simultaneously processed. Furthermore, by applying the CTR mode optimization, the computation overheads are reduced by efficiently pre-computing some round operations up to five rounds. Through proposed parallel techniques and optimization of CTR mode, it shows 8.62% performance improvement over previous work [10]. Our Work 1 of revised CHAM is our proposed parallel implementation. Through efficient vector register scheduling, we processed more 48, 24, and 24 encryptions simultaneously, which is more than the number of previous encryptions, in revised CHAM-64/128, CHAM-128/128, and CHAM-128/256, respectively, and likewise applied the transpose operation without additional costs when loading data from memory to four vector registers and storing four vector registers into memory. Through the proposed optimization techniques, Our Work 1 of a revised CHAM family shows an improvement of 9.52%, 1.52%, and 4.02% compared to the previous work [10]. Finally, Our Work 2 of revised CHAM is a parallel implementation of the CTR mode optimization. Our Work 2 of the revised CHAM processes the same number of encryptions simultaneously as Our Work 1 of revised CHAM. In addition, through CTR mode optimization, some round operations up to the initial eight rounds are pre-computed to reduce the computational overheads of the round functions. As a result, performance improvements of 15.87%, 2.94%, and 5.36% in revised CHAM-64/128, CHAM-128/128, and CHAM-128/256, respectively, were achieved compared to previous work [10]. Our Work 2 shows the results of the fastest CTR mode in ARMv8 architecture.

6. Conclusions

In this article, we have presented parallel implementations of CTR mode using ARX-based block ciphers (LEA, HIGHT, and revised CHAM) for data security on embedded devices using the ARMv8 architecture. For parallel implementation, we have presented data parallelism techniques. In LEA and revised CHAM, we have eliminated the transpose operation required to apply data parallelism techniques, and, in HIGHT, we have proposed an optimized transpose operation. In addition, to process the maximum number of encryption simultaneously, we have presented register scheduling that efficiently uses all vector registers. As a result, 24 encryptions were performed in LEA, revised CHAM-128/128, and revised CHAM-128/256, and 48 encryptions were performed in HIGHT and revised CHAM-64/128. HIGHT and revised CHAM performed more encryption simultaneously than previous work. Through the data parallelism technique and register scheduling, the performance was improved by 3.09%, 5.26%, and 9.52% compared to the previous study in LEA, HIGHT, and revised CHAM, respectively. Most of the studies up to now were optimization of block ciphers. Since massive amounts of data are encrypted, it needs to apply a mode of operation. Among them, the CTR mode is widely used in industrial applications. We applied the proposed parallel technique to CTR mode optimization to efficiently process multiple encryptions. In addition, utilizing the property of the CTR mode, the initial few round operations related to the nonce part were pre-computed. In particular, we optimized one round more than the previous work by changing the position of the nonce part in LEA-CTR. Through proposed optimization techniques, the LEA, HIGHT, and revised CHAM-64/128-CTR modes showed 8.76%, 8.62%, and 15.87% performance improvements, respectively, compared to the previous best results. The proposed CTR mode implementation achieved the faster performance than the previous work and the proposed data parallelism technique. We will study various lightweight cryptography such as SIMECK, SPECK, and SKINNY to apply the proposed parallel implementation of CTR mode in the future. Our work contributes to fast encryption in ARMv8-based IoT devices with the proposed parallel implementations of the CTR mode of ARX-based ciphers. We believe that our software can be a cornerstone for building secure IoT services and applications such as smart city, smart factory, autonomous driving, and so on.

Author Contributions

Writing—original draft, J.S.; Writing—review and editing, S.C.S. Both authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2019R1F1A1058494).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hong, D.; Lee, J.K.; Kim, D.C.; Kwon, D.; Ryu, K.H.; Lee, D.G. LEA: A 128-Bit Block Cipher for Fast Encryption on Common Processors. In Proceedings of the 14th International Workshop, WISA 2013, Jeju Island, Korea, 19–21 August 2013; Revised Selected Papers. Springer: Cham, Switzerland, 2013; pp. 3–27. [Google Scholar]
Hong, D.; Sung, J.; Hong, S.; Lim, J.; Lee, S.; Koo, B.S.; Lee, C.; Chang, D.; Lee, J.; Jeong, K.; et al. HIGHT: A new block cipher suitable for low-resource device. In Proceedings of the International Workshop on Cryptographic Hardware and Embedded Systems, Yokohama, Japan, 10–13 October 2006; pp. 46–59. [Google Scholar]
Roh, D.; Koo, B.; Jung, Y.; Jeong, I.W.; Lee, D.G.; Kwon, D. Revised Version of Block Cipher CHAM. In Information Security and Cryptology—ICISC 2019, Proceedings of the 22nd International Conference, Seoul, Korea, 4–6 December 2019; Revised Selected Papers; Springer: Cham, Switzerland, 2019; pp. 1–19. [Google Scholar]
Park, J.H.; Lee, D.H. FACE: Fast AES CTR mode Encryption Techniques based on the Reuse of Repetitive Data. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2018, 469–499. [Google Scholar] [CrossRef]
Kim, K.; Choi, S.; Kwon, H.; Liu, Z.; Seo, H. FACE–LIGHT: Fast AES–CTR Mode Encryption for Low-End Microcontrollers. In Information Security and Cryptology—ICISC 2019, Proceedings of the 22nd International Conference, Seoul, Korea, 4–6 December 2019; Revised Selected Papers; Springer: Cham, Switzerland, 2019; pp. 102–114. [Google Scholar]
Kwon, H.; An, S.; Kim, Y.; Kim, H.; Choi, S.J.; Jang, K.; Park, J.; Kim, H.; Seo, S.C.; Seo, H. Designing a CHAM Block Cipher on Low-End Microcontrollers for Internet of Things. Electronics 2020, 9, 1548. [Google Scholar] [CrossRef]
Kim, Y.; Kwon, H.; An, S.; Seo, H.; Seo, S.C. Efficient Implementation of ARX-Based Block Ciphers on 8-Bit AVR Microcontrollers. Electronics 2020, 8, 1837. [Google Scholar] [CrossRef]
Kim, Y.; Seo, S.C. An Efficient Implementation of AES on 8-Bit AVR-Based Sensor Nodes. In Information Security Applications; Springer International Publishing: Cham, Switzerland, 2020; pp. 276–290. [Google Scholar]
Seo, H. High Speed Implementation of LEA on ARMv8. J. Korea Inst. Inf. Commun. Eng. 2017, 21, 1929–1934. [Google Scholar]
Song, J.; Seo, S.C. Secure and Fast Implementation of ARX-Based Block Ciphers Using ASIMD Instructions in ARMv8 Platforms. IEEE Acess 2020, 8, 193138–193153. [Google Scholar] [CrossRef]
Gangqiang, Y.; Bo, Z.; Valentin, S.; Mark D., A.; Guang, G. The Simeck Family of Lightweight Block Ciphers. In Cryptographic Hardware and Embedded Systems—CHES 2015, Proceedings of the 17th International Workshop, Saint-Malo, France, 13–16 September 2015; Springer: Berlin/Heidelberg, Germany, 2015; Volume 9293, pp. 307–329. [Google Scholar]
Beierle, C.; Jean, J.; Kölbl, S.; Leander, G.; Moradi, A.; Peyrin, T.; Sasaki, Y.; Sasdrich, P.; Sim, S.M. The SKINNY Family of Block Ciphers and Its Low-Latency Variant MANTIS. In Advances in Cryptology—CRYPTO 2016, Proceedings of the 36th Annual International Cryptology Conference, Santa Barbara, CA, USA, 14–18 August 2016; Lecture Notes in Computer Science; Proceedings, Part II; Robshaw, M., Katz, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2016; Volume 9815, pp. 123–153. [Google Scholar]
Arm® A64 Instruction Set Architecture: Armv8, for Armv8-A Architecture Profile. Available online: https://developer.arm.com/docs/ddi0596/c/simd-and-floating-point-instructions-alphabetic-order (accessed on 2 February 2021).
ISO. ISO/IEC 29192-2: 2019: Information Security—Lightweight Cryptography—Part 2: Block Ciphers; International Organization for Standardization: Geneva, Switzerland, 2019. [Google Scholar]
ISO. ISO/IEC 18033-3: 2010: Information Technology—Security Techniques—Encryption Algorithms—Part 3: Block Ciphers; International Organization for Standardization: Geneva, Switzerland, 2010. [Google Scholar]
Koo, B.; Roh, D.; Kim, H.; Jung, Y.; Lee, D.G.; Kwon, D. CHAM: A Family of Lightweight Block Ciphers for Resource-Constrained Devices. In Proceedings of the International Conference on Information Security and Cryptology (ICISC’17), Seoul, Korea, 29 November–1 December 2017. [Google Scholar]
Beaulieu, R.; Shors, D.; Smith, J.; Treatman-Clark, S.; Weeks, B.; Wingers, L. The SIMON and SPECK lightweight block ciphers. In Proceedings of the 52nd Annual Design Automation Conference, San Francisco, CA, USA, 7–11 June 2015; pp. 175:1–175:6. [Google Scholar]
Seo, H.; Park, T.; Heo, S.; Seo, G.; Bae, B.; Hu, Z.; Zhou, L.; Nogami, Y.; Zhu, Y.; Kim, H. Parallel Implementations of LEA, Revisited. In Information Security Applications—WISA 2016, Proceedings of the 17th International Workshop, Jeju Island, Korea, 25–27 August 2016; Revised Selected Papers; Springer: Cham, Switzerland, 2016; pp. 318–330. [Google Scholar]
Seo, H.; An, K.; Kwon, H.; Park, T.; Hu, Z.; Kim, H. Parallel Implementations of CHAM. In Information Security Applications—WISA 2018, Proceedings of the 19th International Conference, Jeju Island, Korea, 23–25 August 2018; Revised Selected Papers; Springer: Cham, Switzerland, 2018; pp. 93–104. [Google Scholar]
Fujii, H.; Carvalho Rodrigues, F.; López, J. Fast AES Implementation Using ARMv8 ASIMD Without Cryptography Extension. In Information Security and Cryptology—ICISC 2019, Proceedings of the 22nd International Conference, Seoul, Korea, 4–6 December 2019; Revised Selected Papers; Springer: Cham, Switzerland, 2019; pp. 84–101. [Google Scholar]
Raspberry Pi 4B Specification. Available online: https://www.raspberrypi.org/products/raspberry-pi-4-model-b/specifications/ (accessed on 2 February 2021).

Figure 1. Transpose process of TRN1

. 16 b

and TRN2

. 16 b

instructions [13].

Figure 1. Transpose process of TRN1

. 16 b

and TRN2

. 16 b

instructions [13].

Figure 2. LEA round function [1] (RK: round key,

R O R_{3}, R O R_{5}

, and

R O L_{9}

: rotate operations).

Figure 2. LEA round function [1] (RK: round key,

R O R_{3}, R O R_{5}

, and

R O L_{9}

: rotate operations).

Figure 3. HIGHT round function [2] (SK: subkey,

F_{0}

,

F_{1}

: functions, and ⋘: rotate operation).

Figure 3. HIGHT round function [2] (SK: subkey,

F_{0}

,

F_{1}

: functions, and ⋘: rotate operation).

Figure 4. CHAM round functions [16] (

R K_{o d d}

:

R K_{i m o d 2 k / w}

,

R K_{e v e n}

:

R K_{i + 1 m o d 2 k / w}

,

R O L_{8}

and

R O L_{1}

: rotate operations).

Figure 4. CHAM round functions [16] (

R K_{o d d}

:

R K_{i m o d 2 k / w}

,

R K_{e v e n}

:

R K_{i + 1 m o d 2 k / w}

,

R O L_{8}

and

R O L_{1}

: rotate operations).

Figure 5. Optimization of revised CHAM-64/128-CTR mode through partial pre-computation for the initial seven rounds in previous work [6].

Figure 6. Optimization of HIGHT-CTR mode through partial pre-computation for the initial five rounds in previous work [7].

Figure 7. Interleaving the NEON instructions.

Figure 8. Proposed vector register scheduling for LEA.

Figure 9. Proposed vector register scheduling for HIGHT.

Figure 10. Proposed vector register scheduling for revised CHAM-64/128.

Figure 11. Proposed LEA-CTR mode optimization through pre-computation for the initial four rounds.

Table 1. Summary of ASIMD instructions used in this paper [13].

Instructions	Operands	Description	Cycles
`ADD`	$V_{d}, V_{m}, V_{n}$	Vector addition $V_{d}$ = $V_{m}$ + $V_{n}$	1
`SHL`	$V_{d}, V_{m}, V_{n}$	Vector left shift $V_{d} \leftarrow V_{m} (# n)$	1
`SRI`	$V_{d}, V_{m}, # i m m e d i a t e$	Vector right shift and insert $V_{d} \leftarrow V_{m} (# n)$ ⊕ $V_{m} (i n s e r t)$	2
`TRN1`	$V_{d}, V_{m}, V_{n}$	Vector transpose (primary)	1
`TRN2`	$V_{d}, V_{m}, V_{n}$	Vector transpose (secondary)	1
`REV16`	$V_{d}, V_{m}$	Vector reverse elements in 16-bit halfwords $V_{d} \leftarrow V_{m}$ (reverse)	1
`LD4`	${V_{n} - V_{m}}, []$	Loading data from memory to 4 vector registers by applying the transpose operation	4
`ST4`	${V_{n} - V_{m}}, []$	Storing data from 4 vector registers to memory by applying the transpose operation	4

Table 2. Parameters of LEA [1].

Cipher	$Block Size$	$Key Size$	$Round Number$	$Word Size$
LEA-128	128	128	24	32
LEA-192	128	192	28	32
LEA-256	128	256	32	32

Table 3. Parameters of CHAM [16].

Cipher	$Block Size$	$Key Size$	$Round Number$	$Word Size$
CHAM-64/128	64	128	80	16
CHAM-128/128	128	128	80	32
CHAM-128/256	128	256	96	32

Table 4. Previous block cipher implementations using NEON engine (#Data parallelism: the number of encryptions performed simultaneously on NEON engine).

Methods	Target Device	Target Block Cipher	#Data Parallelism
Seo et al. [18]	Cortex-A9 (ARMv7)	LEA	12
Seo et al. [19]	Cortex-A53 (ARMv8)	CHAM-64/128	24
Seo Hwajeong [9]	Apple A7 and Apple A9 (ARMv8)	LEA	24
H. Fujii et al. [20]	Cortex-A53 (ARMv8)	AES	4
Song et al. [10]	Cortex-A72 (ARMv8)	HIGHT	24
		Revised CHAM-64/128	16
		Revised CHAM-128/128	10
		Revised CHAM-128/256	8

Table 5. Running time comparison of LEA implementation in ARMv8 architecture. (Our Work 1: Proposed parallel implementation, Our Work 2: Proposed CTR mode optimization, #Data parallelism: the number of encryptions performed simultaneously on NEON engine, Cpb: Clock per byte, and Improvement: Improved performance compared to previous work).

Work	Suggested Structure	#Data Parallelism	Cpb	Improvement
LEA-128 [9]	ECB mode	24	3.88	-
LEA-192 [9]	ECB mode	24	4.69	-
LEA-256 [9]	ECB mode	24	5.32	-
LEA-128 (Our Work 1)	ECB mode	24	3.76	3.09%
LEA-192 (Our Work 1)	ECB mode	24	4.56	2.77%
LEA-256 (Our Work 1)	ECB mode	24	5.18	2.63%
LEA-128 (Our Work 2)	CTR mode (Our Work)	24	3.54	8.76%
LEA-192 (Our Work 2)	CTR mode (Our Work)	24	4.39	6.39%
LEA-256 (Our Work 2)	CTR mode (Our Work)	24	5.01	5.82%

Table 6. Running time comparison of HIGHT and revised CHAM implementation in the ARMv8 architecture. (Our Work 1: Proposed parallel implementation, Our Work 2: Proposed CTR mode optimization, and #Data parallelism: the number of encryptions performed simultaneously on NEON engine, Cpb: Clock per byte, and Improvement: Improved performance compared to previous work).

Work	Suggested Structure	#Data Parallelism	Cpb	Improvement
HIGHT-64/128 [10]	ECB mode	24	8.35	-
Revised CHAM-64/128 [10]	ECB mode	16	6.3	-
Revised CHAM-128/128 [10]	ECB mode	10	9.85	-
Revised CHAM-128/256 [10]	ECB mode	8	10.81	-
HIGHT-64/128 (Our Work 1)	ECB mode	48	7.91	5.26%
HIGHT-64/128 (Our Work 2)	CTR mode [7]	48	7.63	8.62%
Revised CHAM-64/128 (Our Work 1)	ECB mode	48	5.7	9.52%
Revised CHAM-128/128 (Our Work 1)	ECB mode	24	9.71	1.52%
Revised CHAM-128/256 (Our Work 1)	ECB mode	24	10.38	4.02%
Revised CHAM-64/128 (Our Work 2)	CTR mode [6]	48	5.3	15.87%
Revised CHAM-128/128 (Our Work 2)	CTR mode [6]	24	9.56	2.94%
Revised CHAM-128/256 (Our Work 2)	CTR mode [6]	24	10.23	5.36%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Song, J.; Seo, S.C. Efficient Parallel Implementation of CTR Mode of ARX-Based Block Ciphers on ARMv8 Microcontrollers. Appl. Sci. 2021, 11, 2548. https://doi.org/10.3390/app11062548

AMA Style

Song J, Seo SC. Efficient Parallel Implementation of CTR Mode of ARX-Based Block Ciphers on ARMv8 Microcontrollers. Applied Sciences. 2021; 11(6):2548. https://doi.org/10.3390/app11062548

Chicago/Turabian Style

Song, JinGyo, and Seog Chung Seo. 2021. "Efficient Parallel Implementation of CTR Mode of ARX-Based Block Ciphers on ARMv8 Microcontrollers" Applied Sciences 11, no. 6: 2548. https://doi.org/10.3390/app11062548

APA Style

Song, J., & Seo, S. C. (2021). Efficient Parallel Implementation of CTR Mode of ARX-Based Block Ciphers on ARMv8 Microcontrollers. Applied Sciences, 11(6), 2548. https://doi.org/10.3390/app11062548

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Parallel Implementation of CTR Mode of ARX-Based Block Ciphers on ARMv8 Microcontrollers

Abstract

1. Introduction

Contributions

2. Background

2.1. NEON Engine and ASIMD Instructions

2.2. Target Block Ciphers

2.2.1. LEA Block Cipher

2.2.2. HIGHT Block Cipher

2.2.3. Revised CHAM Block Cipher

3. Related Works

Parallel Implementation of Block Ciphers on NEON Engine and CTR Mode Optimization

4. Proposed Parallel Implementation of LEA-CTR, HIGHT-CTR, and Revised CHAM-CTR on ARMv8 Microcontrollers

4.1. Proposed Data Parallelism Technique

4.1.1. LEA Optimization

4.1.2. HIGHT Optimization

4.1.3. Revised CHAM Optimization

4.2. Parallel Implementation of CTR Mode of Operation on the NEON Engine

4.2.1. LEA-CTR Optimization

4.2.2. HIGHT-CTR Optimization

4.2.3. Revised CHAM-CTR Optimization

5. Evaluation

5.1. Parallel Implementation of LEA-CTR Mode on the NEON Engine

5.2. Parallel Implementation of HIGHT-CTR and Revised CHAM-CTR Mode on the NEON Engine

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI