Cuckoo-Store Engine: A Reed–Solomon Code-Based Ledger Storage Optimization Scheme for Blockchain-Enabled IoT

Yang, Jinsheng; Jia, Wencong; Gao, Zhen; Guo, Zhaohui; Zhou, Ying; Pan, Zhou

doi:10.3390/electronics12153328

Open AccessArticle

Cuckoo-Store Engine: A Reed–Solomon Code-Based Ledger Storage Optimization Scheme for Blockchain-Enabled IoT

by

Jinsheng Yang

¹,

Wencong Jia

¹,

Zhen Gao

²

,

Zhaohui Guo

¹,

Ying Zhou

³ and

Zhou Pan

^3,*

¹

School of Microelectronics, Tianjin University, Tianjin 300072, China

²

School of Electrical Automation and Information Engineering, Tianjin University, Tianjin 300072, China

³

Tianjin Navigation Instrument Research Institute, Tianjin 300131, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(15), 3328; https://doi.org/10.3390/electronics12153328

Submission received: 9 June 2023 / Revised: 26 July 2023 / Accepted: 26 July 2023 / Published: 3 August 2023

Download

Browse Figures

Versions Notes

Abstract

:

As the distributed ledger technology underlying cryptocurrencies such as Bitcoin and Ethereum, blockchain has empowered various industries, such as supply chain management, healthcare, government services, e-voting, etc. However, the ever-growing ledger on each node has been the main bottleneck for blockchain scalability as the network scale expands, which worsens in blockchain-enabled IoT scenarios with resource-limited devices. With the support of the Reed–Solomon (RS) code, the Cuckoo-Store (CS), a ledger storage optimization engine, is proposed in this paper to dramatically decrease the storage burden on each node by encoding the ledger as data segments with redundancy and distributing them to multiple nodes. These distributed data segments can be collected and decoded using RS code to recover the original ledger. Furthermore, the Cuckoo filter (CF) is used to guarantee the integrity of the encoded segments, which helps detect the forged segments and facilitates the process of ledger recovery. Theoretical analysis and simulation results show that the CS engine can decrease the storage in each node by more than 94%, and the original ledger can be recovered efficiently with acceptable communication overheads.

Keywords:

blockchain; distributed storage; Cuckoo filter; IoT; RS code

1. Introduction

As the distributed ledger technology (DLT) underlying cryptocurrencies such as Bitcoin and Ethereum, blockchain implements decentralization, immutability, transparency, trustlessness, and traceability, with the support of P2P networking, cryptography, chained structure, and the consensus algorithm [1,2]. With the development of industrial digitization, the blockchain has played an essential role as the trust anchor for multi-party cooperation and resource sharing in various fields, such as digital twins (DT), healthcare, smart grids, and supply chain management [3,4,5]. A typical application is the blockchain-enabled Internet of Things (IoT), where all sensors act as the blockchain nodes to maintain the entire ledger, and collected data are exchanged in the form of transactions [6]. For example, blockchain-enabled IoT has been applied in various fields, such as Industry 4.0, smart cities, and identification [7,8,9].

However, the ever-growing ledger on each node has been the main bottleneck for blockchain scalability as the network scale expands, which worsens in blockchain-enabled IoT scenarios with resource-limited devices. The limited storage scalability is caused by the blockchain storage model, where all nodes have to maintain the replicas of the entire ledger to guarantee the core characteristics of the blockchain, such as decentralization, transparency, and traceability [10]. By the end of March 2023, the storage for the ledger had exceeded 470 GB and 900 GB in Bitcoin and Ethereum, and the growth rates were predicted to reach 50 GB and 400 GB per year, respectively [11]. Considering the expansion rate of the IoT scale (projected to rise from 10 billion devices in 2018 to 64 billion in 2025 [12]), the storage requirements for IoT devices to participate in blockchain activities independently will rise rapidly (10× or even 100×), which damages the decentralization and further limits the scalability of the transaction throughout, in addition to the consensus efficiency.

Various solutions have been proposed to address storage issues, including the schemes of role division, channel, sharding, and storage clusters. In [1,13], the nodes are divided into light and full nodes according to their resources. The former only maintains the lightweight chain formed by the block headers, while the latter maintains the entire ledger, including all headers and bodies. Correspondingly, the full nodes can participate in all the blockchain activities independently, while the light nodes have to rely on the full nodes with simplified payment verification (SPV). When using role division, the storage burden on the light node is decreased, but with the cost of damaged decentralization and robustness. As indicated in [14], a blockchain with more nodes performing independently has better decentralization and robustness. Poon et al. constructed an off-chain transaction channel called the Lightning Network [15]. Once the channel is constructed, all transaction executions are transferred into the off-chain. Only the transactions for channel opening and closing are recorded on-chain, which decreases the ledger growth rate. However, the main purpose of the Lightning Network is to improve the throughput, which makes nodes more vulnerable to attacks because both parties involved in the transaction must be online and must log in with their private keys. A sharding scheme named Elastico was proposed in [16] to improve blockchain scalability, where all nodes are divided into multiple consensus units called shards, and each node only maintains the shard-related ledger to dramatically decrease the local storage burden. However, the complicated architecture is necessary for cross-shard communication and committee reshuffling, which improves the deployment difficulty. Perard proposed the Erasure Code-based low storage (ECLS) blockchain in [17], where the block is encoded by RS code, and each node only maintains a subset of the encoded data instead of the entire ledger. The original block could be recovered based on the RS code’s maximum distance separable (MDS) property after distributed subsets are collected correctly. However, there is no efficient mechanism to inspect the integrity of the collected subsets, which means the ledger recovery process is easily interrupted due to low fault tolerance. Similarly, Qi et al. proposed a reliable storage partition scheme, RS-Store, for the blockchain ledger in [18] by combining the RS code and the Byzantine Fault Tolerance (BFT) consensus protocol, where the storage burden on each node is decreased by distributing the encoded transactions into multiple nodes. However, the RS-Store is designed for the permissioned blockchain and makes a huge modification to the original workflow, resulting in compatibility issues when applying the scheme to permissionless blockchains such as Bitcoin and Ethereum. Furthermore, there are cluster storage schemes based on other techniques, such as the schemes proposed in [19,20], with the support of the residue number system (RNS) and distributed hash table (DHT), respectively. These schemes can decrease the storage burden on each node but have individual limitations in terms of communication overheads and integrity.

In this paper, a storage optimization engine, Cuckoo-Store (CS), is proposed to dramatically decrease each node’s storage burden with enough fault tolerance, where the RS encoding is used to translate the block bodies as encoded segments with redundancy, and Cuckoo filter (CF) is used to guarantee the integrity of the encoded results and facilitate the ledger recovery process. With the support of the CS, each node selects subsets of the encoded results when synchronizing the new blocks. They can recover the original ledger by requesting the neighbor nodes for the other subsets and perform the decoding algorithm provided by the RS code, where the CF helps to detect the forged responses via membership checks and checks whether the decoded segments are correct. The major contributions of this paper are summarized as follows:

1.: We first propose the CS engine to dramatically reduce each node’s storage burden without damaging the core functions of the blockchain. The CS engine is consensus-algorithm-independent and slightly modifies the original block structure and workflow, lowering the deployment difficulty;
2.: The RS encoding is used to translate the block body as encoded segments with redundancy. The multiple packed transactions are grouped as the basic unit to be processed and are represented by a matrix containing the raw and redundant data. The encoded segments are distributed into multiple nodes to decrease the storage burden on each node. The original block can be recovered with the support of the decoding algorithm of the RS code, where the grouped data segments are still the basic units to be recovered, and the redundant data provide the extra capability for error detection and correction to avoid interruptions caused by a transmission error or node crash. The recoverability of the RS code guarantees the traceability of the blockchain after the CS engine is applied;
3.: We use the CF to guarantee the integrity of the encoded results during the block proposal. During the ledger recovery process, the CF is used to detect the forged segments from the malicious nodes before decoding, which facilitates the recovery process by preventing a number of forged segments from participating in the decoding process and removing the detected malicious nodes from the neighbor list. After decoding, nodes can also verify the correctness of decoded segments by CF. With the support of the CF, the CS engine can provide better fault tolerance and more efficient error detection and correction than the existing RS-based schemes;
4.: We implement the PoC (Proof of Concept) deployment of the CS engine. The simulation results show that with the CS engine’s support, the storage for the ledger on each node is decreased dramatically (more than 94%), which is comparable with the existing schemes. Furthermore, the original ledger can be recovered more efficiently in the CS engine, which means the system remains stable even under malicious attacks.

The rest of this paper is organized as follows. The basic techniques underlying the CS engine are introduced in Section 2. Section 3 describes the construction of the CS engine, including the processes for ledger distribution and recovery. The theoretical analysis and simulation are performed in Section 4 and Section 5, respectively. Finally, the paper is concluded in Section 6.

2. Related Techniques

This section introduces the techniques related to the construction of the Cuckoo-Store engine. The blockchain’s chain structure and transaction lifecycle are first described in Section 2.1. Then, the encoding and decoding processes of the Reed–Solomon code are introduced in Section 2.2. Finally, the Cuckoo filter is introduced in Section 2.3, including its construction and membership check process.

2.1. Blockchain Ledger Structure and Transaction Lifecycle

Blockchain is a back-linked list of blocks in chronological order, where each block is composed of a body and a header. The body records all confirmed transactions, which guarantees the blockchain’s decentralization, immutability, and traceability [21,22]. The header maintains the metadata related to the body. As shown in Figure 1, all transactions packed in the body are hashed together in pairs to obtain the Merkle root, which is stored in the header to guarantee the body’s integrity. There are other fields in the header, such as the timestamp, the difficulty target, the nonce, and the parent hash. The timestamp is a unique serial number to determine the order of the blocks. The difficulty target is a numeric value used to adjust the mining difficulty according to the estimated computing power, ensuring that the blocks are constructed efficiently. The nonce is also a numeric value repeatedly altered as the input of the hash process, referred to as Proof-of-Work (PoW), to meet the difficulty level restrictions. The parent hash is the previous block’s hash value, and the hash of this block will be referenced in the next header, which links the adjacent blocks tightly. Any modification to a block causes an enormous recalculation of all subsequent blocks, making the blockchain’s deep history immutable unless under the 51% attack [23].

Taking Ethereum as an example, for a transaction described formally by tx = {txid, (sender, recipient, value), nonce, sig}, the tuple of (sender, recipient, value) indicates the amount of the transferred tokens, and accounts on both sides. The nonce is the number of transactions issued by the sender, and sig is the sender’s signature. Finally, tx can be indexed by txid, which is the unique identifier, via the hash function [24,25]. The transaction is the smallest unit to drive the blockchain system. Transactions are created, relayed, and validated by the nodes and packed into the blocks. The state transitions occur when the transaction is executed and confirmed. The workflow of the blockchain is described in detail as follows:

1.: A new transaction is generated with a complete structure and propagated to the network;
2.: The nodes collect the transactions into their waiting area, called the mempool. Then, these transactions are checked locally to ensure the transferring funds are available and the signatures are valid. The valid transactions remain in the mempool until the packing rules are met, whereas the invalid ones are excluded;
3.: The nodes pack the transactions in the mempool into new blocks, execute them, and update local state sets. Then, PoW is processed independently. The node satisfying the nonce with the difficulty target propagates its block to the network;
4.: The nodes receiving the new block check its validity and append it to the tip of the local chain.

In particular, Bitcoin and Ethereum implement simple payment verification (SPV) for light nodes to confirm transactions with the help of the full nodes. For a specific transaction tx, the light node sends txid to the full node, which returns the Merkle branches corresponding to txid. Then, the light node can confirm the transaction by recalculating the Merkle root with the received branches and compare the result with the Merkle root maintained in the local header.

2.2. Reed–Solomon Code

Error-correcting code is an encoding technique capable of error detection and correction by adding redundancy during the encoding process [26]. Benefiting from the redundancy, the encoded results can always be decoded correctly as long as the errors (such as data loss or forgery) do not exceed the upper limits of the capabilities for error detection and correction, which improve with more redundancy [27]. Thus, error-correcting coding is widely used in data backup and distributed storage [28,29].

The Reed–Solomon (RS) code is a typical implementation of error-correcting code based on the arithmetic in finite fields, where the data are encoded and decoded in bytes by Encoder and Decoder, respectively [30]. In particular, a specific encoding pattern of the RS can be represented formally as RS(n, p), where p is the number of bytes for the original data and n is the number of bytes for the encoded result with redundant data of q bytes added during the encoding. Thus, there is n = p + q.

In the RS(n, p), the original data, D, of b bits can be encoded as follows. First, D is divided in bytes into p parts (

d = [d_{1}, d_{2}, \dots, d_{p}]

) after padding

(8 p - b)

zero bits, where each part

d_{i} (1 \leq i \leq p)

takes one byte

(0 \leq d_{i} \leq 2^{8})

. Then, the Encoder can calculate the redundant vector

c = [c_{1}, c_{2}, \dots, c_{q}]

based on d as follows:

c^{T} = G \cdot d^{T} = [\begin{matrix} g_{11} & g_{12} & \dots & g_{1 p} \\ g_{21} & g_{22} & \dots & g_{2 p} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ g_{q 1} & g_{q 2} & \dots & g_{q p} \end{matrix}] \cdot [\begin{matrix} d_{1} \\ d_{2} \\ ⋮ \\ d_{p} \end{matrix}] = [\begin{matrix} c_{1} \\ c_{2} \\ ⋮ \\ c_{q} \end{matrix}],

(1)

where

G = [g_{i j}] (1 \leq i \leq q, 1 \leq j \leq p)

is the generator matrix with q rows and p columns. Finally, the encoding result

D_{e n c}

can be obtained by concatenating d and c as

D_{enc} \leftarrow Encoder (D) = [d, c],

(2)

where

D_{e n c}

is a set with

n = p + q

elements.

Based on the properties of the RS code, the Decoder can recover the original data, D, based on the encoding result

D_{e n c}

or its subset

D_{s u b}

with no less than

⌈ p + q / 2 ⌉

elements (

D_{s u b} \subset D_{e n c}

, and

| D_{s u b} | \geq ⌈ p + q / 2 ⌉

) as follows:

D_{d e c} \leftarrow V^{- 1} \cdot V e c t o r (D_{s u b}),

(3)

where V is the decoder matrix with n rows and p columns and

V^{- 1}

is the inverse matrix of V. V can be obtained by concatenating a p-order identify matrix

I_{p \times p}

and the generator matrix

G_{q \times p}

as follows:

V = [\begin{matrix} I_{p \times p} \\ G_{q \times p} \end{matrix}] = [\begin{matrix} 1 & 0 & \dots & 0 \\ 0 & 1 & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & 0 \\ g_{11} & g_{12} & \dots & g_{1 p} \\ g_{21} & g_{22} & \dots & g_{2 p} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ g_{q 1} & g_{q 2} & \dots & g_{q p} \end{matrix}] .

(4)

In the process of RS-encoded data transmission, some data bits may be corrupted, resulting in inconsistencies between the received data and the encoded data. Through the decoding process of the RS code, these errors can be corrected to make the received data consistent with the original data. For RS(n, p) code, it can correct up to no more than

t = ⌊ (n - p) / 2 ⌋

errors at locations that are known and provided to the algorithm [31].

In practice, the fast Fourier transform (FFT) and its inverse transform (IFFT) could be used to accelerate matrix multiplication and inversion in a finite field with a time complexity of

O (n l o g n)

for encoding and decoding [32,33]. Furthermore, the ISA-L library provides an efficient implementation for RS encoding and decoding with high performance and low latency with the support of single instruction multiple data (SIMD) instructions [34].

2.3. Cuckoo Filter

The Bloom filter (BF) is a probabilistic, space-efficient data structure that checks set membership quickly [35,36]. The Bloom filter is a bit array with

S_{B F}

bits and K hash functions (

h_{k}, k = 1, 2, \dots, K

). Initially, all bits in the bit array are set to 0 s. To insert a new item, x, to a set, S, all locations indexed by

h_{k} (x)

should be set to 1 s. Correspondingly, the membership of x in S can be determined if all locations indexed by

h_{k} (x)

in the bit array are 1 s. There will be a false positive for a BF with

C_{B F}

items inserted, which means BF reports yes by mistake for an item outside of S [37].

The Cuckoo filter (CF) is an improvement of BF with a more compact size and higher query efficiency by introducing the new components of buckets and fingerprints [38,39,40]. Unlike the BF, the CF divides the underlying bit array into multiple buckets as the units to store elements, where the maximum capacity of a bucket is preset as

C_{b u k}

. Typically, the number of buckets,

N_{b u k}

, in a CF is determined as [39]

N_{b u k} = ⌈ \frac{C_{C F}}{C_{b u k}} ⌉,

(5)

where

C_{C F}

is the maximum capacity of elements that can be inserted.

The fingerprint is the unique identifier for each element in a bucket, which is obtained with a specific hash function,

h_{S}

, but has length, fl, as follows:

f i n g e r (x) = h_{S} (x) [: f l], (S = 1 o r 2) .

(6)

Typically,

f l

can be calculated as [38]

f l = ⌈log (\frac{2 \cdot C_{b u k}}{F P P})⌉,

(7)

where FPP is the false positive rate. The reason for the misjudgment is that two different elements may have the same fingerprint and the same storage location.

Based on

C_{C F}

and

f l

, the size of the CF (

S_{C F}

) can be determined as follows:

S_{C F} = f l \cdot C_{C F} = \frac{f l \cdot N_{x}}{γ},

(8)

where

N_{x}

is the number of stored elements and

γ

represents the already stored elements divided by

C_{C F}

, also called the load factor [38]. The performance of the CF is affected by

γ

, which should be determined carefully to achieve a good trade-off between the false positive rate and the CF’s size [39]. Typically,

γ

is approximately 95% [40]. Initially,

S_{C F}

bits in the buckets are set to 0 s.

An example of the CF with two hash functions (

h_{1}

and

h_{2}

) and eight buckets (

N_{b u k}

= 8) is shown in Figure 2 [36]. The process to insert a new element, x, with the fingerprint,

f i n g e r (x)

, in Set S is as follows:

1.: Calculate $h_{1} (x)$ using the SHA-256 function and insert x into bucket $b_{1}$ if it is not full, where $b_{1} = h_{1} (x)$ mod $N_{b u k}$ , and $h_{1}$ is calculated as follows:

$h_{1} (x) = S H A_{256} (x);$

(9)
2.: If bucket $b_{1}$ is full, pick up an old element, y, kick it out of bucket $b_{1}$ and reinsert it into one of the other buckets. Typically, y is the element first inserted into bucket $b_{1}$ . Then, insert x into bucket $b_{1}$ ;
3.: For the old element y kicked out of bucket $b_{1}$ , its new bucket, $b_{2}$ , can be determined based on $h_{2} (y)$ mod $N_{b u k}$ , where $h_{2} (y)$ is calculated as follows:

$h_{2} (y) = h_{1} (y) \oplus S H A_{256} (h_{1} (y) [: f l]) .$

(10)

Correspondingly, the membership of x in S can be determined using two-step checks as follows [37]:

1.: Calculate finger(x) and locate the corresponding two candidate buckets indexed by $h_{1} (x)$ mod $N_{b u k}$ and $h_{2} (x)$ mode $N_{b u k}$ ;
2.: If any one of the inspected buckets reports yes for the existence of x, the membership of x in S is confirmed with $F P P$ .

Figure 2. Insert a new element x in the Cuckoo filter.

Thus, the CF could be used in various scenarios, such as network routing and data storage, by providing membership checks [41,42]. The CF could also be applied to Bitcoin and Ethereum to replace the BF used [43,44].

For a set or vector, S, the polynomial commitment (PC) schemes could also provide the membership proof for an element x to indicate x is in S. Typical PCs (including FRI [45], IPA [46], and KZG [47]) are compared in Table 1 for their cryptographic assumption, proof size for a commitment, and verification delay.

Among the above PCs, the KZG scheme has the most compact proof size and the shortest time for verification. However, KZG is based on the operations on the elliptic curve cyclic groups and the proofs are checked based on the pairing. Based on the widely used paring-friendly curve BLS12_381, compared with the insertion and check of the Cuckoo filter (hash operations), commitment/proof generation and checks in KZG are extremely expensive. Furthermore, the construction of KZG relies on the trusted setup, which increases the security risk. As a probabilistic structure, the Cuckoo filter has a false positive, which is inferior to the PCs. Furthermore, as indicated in [39], the Cuckoo filter can be well designed to reduce the false positive rate.

The Merkle tree can also be used for membership checks, which is called a vector commitment scheme and is the underlying technique of SPV, as described in Section 2.1. The SPV scheme provides the membership proof for a leaf node, which is called the Merkle branch. If a node can provide the Merkle branch related to a leaf, it must store the hashes of all leaves, which means all hashes of leaves should be maintained locally, conflicting with our purpose of storage optimization. Compared with the PCs and the Merkle tree, the Cuckoo filter has a more compact commitment and shorter check time, which is more suitable for blockchain applications.

3. Design of the Cuckoo-Store Engine

In this section, the system model of the Cuckoo-Store engine is first introduced in Section 3.1. The processes of ledger encoding and decoding are introduced in Section 3.2 and Section 3.3, respectively. Additionally, the Cuckoo filter is introduced into the encoding and decoding process to address the practical issues in Section 3.4. Finally, a demonstration of the CS engine is given in Section 3.5.

3.1. System Model

As shown in Figure 3, the Cuckoo-Store engine is formed from four modules, and all modules cooperate to interact with the blockchain network. The RS module is the core of the CS engine, which implements the functions of RS encoding and decoding by Encoder and Decoder, respectively. The data verification module can be used to validate the received data during the decoding process by the Block and Filter verifiers. The data storage module is divided into three parts: temp data storage, segments storage, and cuckoo filter storage. These parts are responsible for maintaining the temporary data, encoded segments, and the CF, respectively. The data request module is designed to interact with the other nodes, where the local reader is used to respond to the data request. The decode and remote reader are used to obtain segments and blocks from the neighbor node, respectively.

In practice, the CS engine could be implemented as the middleware, and any nodes with the CS engine could adopt a new storage model for the blockchain ledger. The basic idea is to encode the new block to multiple encoded segments with smaller sizes and distribute them among multiple nodes, which will decrease the storage burden for the ledger on each node and guarantee the recoverability of the ledger with the support of the RS Decoder. For clarity, the nodes can be divided into senders to issue the new transactions, validators to verify the transactions, and miners to propose new blocks. Each node can have multiple roles simultaneously. In the blockchain with the CS engine, the senders and validators can issue and validate the transactions as usual, and the process of block proposal and synchronization can be modified as follows:

1.: Block proposal. The miner packs the transactions into a new block. Based on the packed transactions, with RS Encoder’s support, the miner encodes the block bodies into multiple encoded segments and inserts these segments into a Cuckoo filter as step 3. Then, the hash of the CF is inserted into the new header to guarantee the integrity and consistency of the encoding results with other fields (parent hash, Merkle root, etc.). Then, the miner participates in the consensus process (such as PoW), and the recognized new block will be broadcast to the network;
2.: Block validation and synchronization. The validators who receive the new block stop the local PoW and validate the new block. Different from the original blockchain, in addition to normal checks (such as nonce and Merkle root), the nodes should repeat the processes of encoding and insertion to the CF to validate the integrity of the encoding results, with the support of the RS Encoder and block verifier as step 1 and step 2. The valid block will be appended to the tip of the local blockchain. During the process, each node can adaptively select a subset from the encoded segments based on the storage resources as step 3. It should be noted that the CF will also be maintained locally.

Furthermore, the distributed ledger can be recovered with the support of the RS Decoder. The process can be described as follows:

1.: Segment collection. With the support of the data request module, the node can collect enough encoded segments from its neighbor nodes. During the process, the integrity of the received segments is checked against the CF, and the forged segments are abandoned, with the nodes returning them removed from the neighbor list as step 4. The process is repeated until the number of received valid segments reaches the upper limit for decoding;
2.: Decoding and validation. With the support of the RS Decoder, the entire block body can be recovered as step 5. The correctness of the recovered block can be checked against the Merkle root in the header and the CF as step 6.

3.2. Data Encoding

For a specific encoding pattern, RS(n, p), a group containing p transactions is the basic unit to be encoded. The encoding result for the block can be obtained after processing all groups. For simplicity, for the block with T transactions (

T X = t x_{1}, t x_{2}, t x_{3}, \dots, t x_{T}

), assuming all transactions are L bytes, all T transactions can be divided into g groups exactly (

T = p \cdot g

).

Based on the description in Section 2.2, each transaction, tx, can be represented as a byte array to be processed by Encoder as

t x = d_{1}, d_{2}, \dots, d_{L}

. Then, a group of p transactions can be represented as matrix A with p rows and L columns as follows:

A = [\begin{matrix} d_{1}^{1} & d_{2}^{1} & \dots & d_{L}^{1} \\ d_{1}^{2} & d_{2}^{2} & \dots & d_{L}^{2} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ d_{1}^{p} & d_{2}^{p} & \dots & d_{L}^{p} \end{matrix}],

(11)

where

d_{j}^{i}

indicates the j-th byte for the i-th transaction in the block. Then, based on Equation (2), the

A_{e n c}^{1}

can be obtained by decoding A as follows:

A_{e n c}^{1} = [\begin{matrix} A \\ Q \end{matrix}],

(12)

where

A_{e n c}^{1}

is called an encoded result and Q is the redundant matrix with q rows and L columns as follows:

Q = [\begin{matrix} c_{1}^{1} & c_{2}^{1} & \dots & c_{L}^{1} \\ c_{1}^{2} & c_{2}^{2} & \dots & c_{L}^{2} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ c_{1}^{q} & c_{2}^{q} & \dots & c_{L}^{q} \end{matrix}],

(13)

where

c_{j}^{i}

indicates the j-th byte for the i-th redundant segment and

q = n - p

. After performing the same operations above for all g groups, the encoding results,

B_{e n c}

, for the block can be obtained as follows:

B_{e n c} = {[A_{e n c}^{1}, A_{e n c}^{2}, \dots, A_{e n c}^{g}]}^{T} .

(14)

Each row element of

B_{e n c}

represents a piece of encoded segment

S_{j}, (j = 1, 2, \dots, n \cdot g)

. As described in Section 3.1, during the block synchronization, each node can select l segments in each

A_{e n c}^{i}, (i = 1, 2, \dots, g)

, a total of

l \cdot g

segments are stored locally, and the j values of stored

S_{j}

as metadata are added in the block body. The compression ratio can be obtained through the stored

l \cdot g

segments of a subset of

B_{e n c}

and metadata, which will be analyzed in detail in Section 4.1.

S_{j} = B_{e n c} [j :], (j = 1, 2, \dots, n \cdot g) .

(15)

It should be noted that, in practice, the transactions packed in a block have different sizes, and the number of transactions (T) cannot always be divided exactly by the selected p. To address these issues, the transactions should be regularized by padding zeros before being encoded by RS Encoder. Assuming the longest transaction has L bytes, the transaction regularization process is as follows:

1.: Padding zeros to all transactions to make their length reach L bytes;
2.: Constructing $⌈ T / p ⌉ \cdot p - T$ transactions with L zero bytes and adding them to the transaction list.

Then, the ledger can be processed in the same way as the above descriptions, and

⌈ T / p ⌉ \cdot p

transaction segments and

⌈ T / p ⌉ \cdot q

redundant segments are obtained.

3.3. Block Recovery by Decoding

In the blockchain with the CS engine, the distributed ledger can be recovered with the support of the RS Decoder, which guarantees the traceability of the blockchain. As described in Section 2.3, due to the redundant matrix, the original message can be decoded without reliance on all data segments, and RS code can detect and correct errors. In this part, we assume each encoded segment is maintained on multiple nodes, and the received segments are all correct, where the decoding process can be performed successfully as long as the number of the received segments (

S_{r e c}

) exceeds the lower limit (denoted by

m \geq ⌈ n - t ⌉

). When n different j values of

S_{j}

of a group are collected, the collection ends. After all T transactions are recovered, the original block is decoded successfully. The processes of data collection and decoding are described as follows:

1.: Collecting segments. The node collects data segments $S_{j}$ from its neighbor nodes with the support of the data request module. Specifically, the node initiates a request to collect data segment $S_{j}$ with the support of the Decode Reader submodule, and after receiving it, the neighbor nodes send the stored segments and metadata through the Local Reader submodule. Typically, Ethereum nodes discover their neighbors and maintain connections with the K-bucket and Kademlia algorithms. In particular, the K-bucket can be modified to maintain the information about the data segments to improve response efficiency. Therefore, like the encoding process, a group of $n - t$ segments is the minimum unit for data collection. The data request finishes when the node has received n different segments;
2.: Decoding segments. Based on the received m segments $S_{r e c}$ and constructed $n - m$ segments with L zero bytes, matrix $A_{r e c}^{1}$ is constructed and the first group of p segments can be decoded according to Equation (3). Decoding matrix $A_{d e c}^{1}$ can be obtained by decoding $A_{r e c}^{1}$ , which is called a decoded result, and each row of $A_{d e c}^{1}$ can be considered as a decoded segment.

Similarly, the other groups could be decoded as

A_{d e c}^{2} \sim A_{d e c}^{g}

, and the decoded results can be represented as a matrix with

p \times g

rows and L columns as follows:

B_{d e c} = {[A_{d e c}^{1}, A_{d e c}^{2}, \dots, A_{d e c}^{g}]}^{T} .

(16)

Then, the original block body can be obtained by removing all redundant rows and reconstructing all T transactions by byte. The correctness of the recovered block can be checked by computing the Merkle root based on the transactions by decoding and comparing the result with that maintained in the header.

3.4. Commitment Generation for Encoding Results

As a permissionless blockchain network, some malicious nodes exist, which could return the forged segments to interrupt the ledger recovery process and even corrupt the data integrity of the blockchain. Even if a block is recovered by multiple RS decodings under the untrusted network, this is based on repeated data requests and results in huge communication and computational overheads when decoding fails. According to the principles of the blockchain, all data should be committed to the header. However, in the encoding process described above, there is no commitment to the encoding results to guarantee data integrity. Thus, in this part, the Cuckoo filter is constructed to commit the encoding segments before broadcasting the new block. Then, during the ledger recovery process, the CF can detect the forged segments quickly based on the membership checks, which improves the recovery efficiency.

During the block synchronization process, the CF can be constructed as follows:

1.: Setting parameters. We calculate the number of elements to be stored as $N_{x} = ⌈ T / p ⌉ \cdot n$ , the false positive rate as FPP, which is usually less than 1%, the number of hash functions as k = 2, and the load factor ( $γ$ ) as 95%. According to the parameters, we calculate the length of the fingerprint (fl) and the number of buckets ( $N_{b u k}$ ), and then calculate the size of the filter ( $S_{C F}$ ) in accordance with Equation (8). We initialize the CF with all bits set to 0;
2.: Inserting the encoded segments. The hash value $h_{1} (S_{j}) = h a s h (S_{j} + r + i)$ is based on Equation (9). Then, $h_{1} (S_{j})$ is taken as modulo $N_{b}$ to obtain $b_{1}$ , which determines the bucket where $S_{j}$ will be inserted. The low fl bits of $h_{1} (S_{j})$ are stored as the fingerprint in bucket $b_{1}$ . When bucket $b_{1}$ is full, the fingerprint that was the first element $S_{j}^{'}$ inserted into the bucket is kicked out and the new fingerprint is stored in bucket $b_{1}$ , while the kicked fingerprint of element $S_{j}^{'}$ is stored in another bucket, $b_{2}$ , where $b_{2}$ is obtained by taking the hash value $h_{2} (S_{j}^{'}) = h_{1} (S_{j}^{'}) \oplus h a s h (f i n g e r p r i n t (h_{1} (S_{j}^{'})))$ based on Equation (10) modulo $N_{b}$ , and the low fl bits of $h_{2} (S_{j}^{'})$ are stored as fingerprints in bucket $b_{2}$ ;
3.: Writing hash in the header. The hash fingerprints from each bucket in the CF are retrieved. When the entries in buckets are initial values, the initial values are treated as fingerprints. All fingerprints are then concatenated together, hashed, and the resulting hash value represents the CF hash, which is written into the block header.

During the block recovery process, the CF can be used in two cases:

1.: First Check. Before decoding, the CF can detect the received segments based on the membership checks by generated commitment; therefore, a reliable matrix for decoding is constructed quickly;
2.: Second Check. After $⌈ T / p ⌉$ groups of receiving segments for a block are decoded, transactions based on the decoded segments will be organized as a Merkle tree to obtain the root, which will be compared with the Merkle root maintained in the header to determine the correctness of the decoded result. If there are discrepancies between two roots, the CF, which is stored locally, will carry out the second check based on the membership checks to identify the groups where the error segments are located. The segments of these groups will be re-collected, decoded, and then verified repeatedly until the computed Merkle root matches the one maintained in the header.

To further enhance the block recovery efficiency under the untrusted network, it is necessary to modify the requirements for ending segment collection given in Section 3.3. Specifically, we are supposed to ensure the reception of n different j values of

S_{j}

of a group, which are able to pass the First Check mechanism.

3.5. A Demonstration of the CS Engine

In this subsection, a demonstration of the CS engine will be shown, which includes regularization, the encoding of a group of transactions, verification, and the decoding of a group of received segments.

As an example shown in Figure 4, in the RS(6,4) code method, a group of four transactions (

t x_{1} \sim t x_{4}

) is transformed into a group of four transaction segments (

S_{1} \sim S_{4}

) through regularization. Considering the first set of column data (

{[d_{1}^{1}, d_{1}^{2}, d_{1}^{3}, d_{1}^{4}]}^{T}

) from four transaction segments, RS encoding is performed, resulting in encoded column data (

{[d_{1}^{1}, d_{1}^{2}, d_{1}^{3}, d_{1}^{4}, c_{1}^{1}, c_{1}^{2}]}^{T}

), which includes redundant data (

{[c_{1}^{1}, c_{1}^{2}]}^{T}

). After this group of transactions is complete for RS encoding, we obtain six encoded segments, including four transaction segments (

S_{1} \sim S_{4}

) and two redundant segments (

S_{5} \sim S_{6}

).

In Figure 5, six received data segments of a group contain a forged segment at location

j = 2

, and membership checks are performed according to the generated commitment by CF. The segment at location

j = 2

is detected as a forged segment by checking commitment and constructing a segment with L zero bytes to replace the forged segment. After that, the group of segments is decoded.

3.6. The Features of CS Engine

Based on the design scheme, the features of the blockchain applying the CS engine are drawn as follows:

In the proposed CS engine, all nodes perform independently, which ensures the decentralization and robustness of the system, which means there is no specific difference between the full and light nodes. The storage volume for the ledger on each node is reduced by encoding all transactions into multiple segments and distributing them to multiple nodes. When synchronizing the new block into the local blockchain, each node adaptively selects a subset of segments to be stored locally according to its storage resources. A more precise algorithm (such as the incentive mechanism introduced in FileCoin [1,48]) will help to obtain better storage efficiency, which is outside of the scope of this paper.

The CS engine is consensus-algorithm-independent and slightly modifies the original block structure and workflow, lowering the deployment difficulty. Specifically, (1) an additional 256-bit memory is added to the head of the blockchain to store the hash value of CF, and (2) during the process of block synchronization, the encoding of transactions and the construction of CF are added.

The CS engine can reduce the storage burden on each node without damage to blockchain functions such as traceability. After collecting all necessary segments from neighbor nodes, the original transactions can be recovered by RS decoding, which ensures the traceability of the blockchain. With the support of the Cuckoo filters, the forged responses can be detected efficiently by the membership check in O(1) complexity.

4. Theoretical Analysis

The compression ratio and the recovery availability are analyzed in Section 4.1 and Section 4.2, respectively. Then, the communication and computational overhead is analyzed in Section 4.3. For clarity, the symbols used in this paper and their meanings are listed in Table 2.

4.1. Compression Ratio

The benefits introduced by the CS engine to the blockchain storage can be measured in two aspects: local compression ratio,

β_{L}

, and system compression ratio,

β_{S}

. The local compression ratio indicates the space saved by the CS engine on each node, which is defined as

β_{L} = 1 - \frac{S_{CB}}{S_{OB}} = 1 - \frac{(\bar{L} + L_{B}) \cdot l \cdot ⌈\frac{T}{p}⌉ + S_{CF} + \bar{S_{H}}}{\bar{L} \cdot T + \bar{S_{H}}},

(17)

where

S_{C B}

and

S_{O B}

are the average sizes of the compressed and original blocks for a node to store, respectively, and

l (1 \leq l \leq n)

is the number of selected segments to be maintained locally. T is the average number of transactions packed into the block, and

\bar{S_{H}}

is the average size of the block headers.

\bar{L}

is the average size of the regularized transactions (also the average size of the data segments obtained by RS encoding as described in Section 3.2), and

L_{B}

is the size of the metadata required to describe the index of each encoding segment. Considering that there are thousands of transactions (T is very large) and

\bar{S_{H}}

, and

L_{B}

is much smaller than

\bar{L}

, the local compression ratio can be approximated as

β_{L} \approx 1 - \frac{l}{p} - \frac{f l}{\bar{L} \cdot γ} \cdot \frac{p + q}{p},

(18)

where fl and

γ

are the length of the fingerprint and load factor of the Cuckoo filter. Furthermore, since

\bar{L}

is far larger than fl, Equation (17) can be simplified as

β_{L} \approx 1 - \frac{l}{p} .

(19)

From the above formula, the local compression ratio is determined by l and the RS pattern, specifically p.

For a group of encoded segments containing n encoded segments, it is assumed that nodes with a proportion of

p_{l}

(l = 1, 2, …, n) in the network store l encoded segments, where

\sum_{l = 1}^{n} p_{l} = 1

. Therefore, the storage overhead of the system can be expressed as

S_{s y} = N \cdot ⌈\frac{T}{p}⌉ \cdot (f l \cdot n + \sum_{l = 1}^{n} (l \cdot p_{l})),

(20)

where the total number of nodes is N. The system compression ratio can be expressed as

β_{S} = 1 - \frac{S_{S y}}{N \cdot S_{O B}} = 1 - \frac{f l \cdot n + \sum_{l = 1}^{n} (l \cdot p_{l})}{p} .

(21)

Assuming that

p_{l} = 2^{n - l} \cdot p_{n}

, which means the number of nodes storing

l + 1

encoded segments is reduced by half compared to the number of nodes storing l encoded segments, the system compression ratio can be expressed as

β_{S} = 1 - \frac{f l \cdot n + \sum_{l = 1}^{n} (l \cdot 2^{n - l} / (2^{n} - 1))}{p} .

(22)

Specifically, when p = 20,

β_{s}

of the CS engine-based scheme is approximately 90%. For the plain blockchain network, when the proportion of full nodes to total nodes is greater than 10%, the scheme proposed in this paper has more advantages in storage.

From the above formula, the system compression ratio is mainly determined by the number of segments maintained locally on each node, which is affected by the proportions of the nodes with limited and ample resources in the network. Considering that most nodes act as light nodes with limited resources in blockchain applications, especially for the blockchain-based IoT scenario, the system storage efficiency could be improved greatly by applying the CS engine.

4.2. Availability of Block Recovery

The Cuckoo filter is constructed to guarantee the data integrity of the encoding segments during the RS encoding process, which can provide error detection via membership checks during the ledger recovery process. The received segments are first checked by the CF, and only valid ones can be used in the RS decoding process. Correspondingly, the nodes providing the forged segments will be removed from the peer list maintained locally. The availability of the ledger recovery for the Byzantine network environment is analyzed in this section.

For simplicity and clarity, we make some assumptions of the Byzantine network: (1) all encoding segments are maintained by multiple nodes; (2) the proportion of the malicious nodes in the network is denoted as

α

, and

α

does not exceed 0.5; (3) all malicious nodes perform fraudulent behaviors independently; and (4) the mistakes in the communication channel are out of the scope of the analysis. Furthermore, we assume that the malicious nodes can be divided into two types: the ones only modifying the segments at random (

N_{S}

) and the ones modifying the segments with the support of the ample computing resources (

N_{C}

). Therefore, the latter could cheat the CF by finding the hash conflicts. The proportions of the two types of malicious nodes within all nodes are denoted by

μ

and (

α - μ

), respectively.

During the ledger recovery process, all receiving segments are first checked by the CF. For the forged segments provided by

N_{S}

, there is a probability of cheating the CF, as follows:

P_{F} = \frac{1}{2^{f l} \cdot N_{b u k}} = \frac{1}{2^{⌈log (2 \cdot C_{b u k} / F P P)⌉} \cdot N_{b u k}} .

(23)

For the forged segments provided by

N_{C}

, all of them can cheat the CF because they are the colliding pre-images. Furthermore, the proportion of the colliding pre-images provided by

N_{S}

in all n segments passing the CF checks can be estimated as

α^{'} = \frac{μ \cdot P_{F} + (α - μ)}{1 - α + (μ \cdot P_{F} + (α - μ))} .

(24)

After the first check, for receiving n (

n = p + q

) segments of a group, the probability that c correct segments exist can be estimated as

P (X = c) = C_{n}^{i} \cdot {(1 - α^{'})}^{c} \cdot {(α^{'})}^{n - c},

(25)

Suppose

p : q = ξ

; then,

n : p = (1 + ξ) / ξ

, and when n increases proportionally with p,

n : p

is a fixed value.

P (X = c)

can be considered as

P (X = c) = C_{⌈\frac{1 + \tilde{ξ}}{ξ} \cdot p⌉}^{c} \cdot {(1 - α^{'})}^{c} \cdot {(α^{'})}^{⌈\frac{1 + \tilde{ξ}}{ξ} \cdot p⌉ - c} .

(26)

Therefore, the probability of successfully recovering a group (

A_{g}

) is as follows:

P_{R e c G} = P (X \geq n - t) = \sum_{i = n - t}^{n} P (X = i),

(27)

where

n - t

is the minimum number of segments required to successfully decode a group. When

α^{'} \leq 1 / 3

, on the one hand,

ξ \leq 1

,

P_{R e c G}

increases towards 1 as p increases; on the other hand, when

ξ

takes a larger value,

P_{R e c G}

decreases in steps as p increases.

The probability of decoding a group correctly complies with the geometric distribution. Thus, the expected value when successfully recovering a group of encoded segments can be calculated as

E (R e c G) = \frac{1}{P_{R e c G}} .

(28)

Considering that the transactions in the block are divided into

⌈ T / p ⌉

groups,

⌈ T / p ⌉ \cdot E (R e c G)

attempts for all groups are required. The average number of decoding attempts required to recover the block is as follows:

N_{R e c B} = \frac{⌈ T / p ⌉ \cdot E (R e c G)}{⌈ T / p ⌉} = E (R e c G) .

(29)

Thus, we can draw the following conclusion: when

α^{'} \leq 1 / 3

, on the one hand,

ξ \leq 1

,

N_{R e c B}

decreases towards 1 as p increases; on the other hand,

ξ

takes a larger value,

N_{R e c B}

increases in steps as p increases.

4.3. Implementation Overheads

The CS engine decreases the storage burden on each node with extra communication and computation overheads. The communication overhead is introduced mainly by the data segment collection during the ledger recovery process. The computation overhead is introduced mainly by the RS encoding and decoding processes.

Based on Section 3.4, the communication overhead can be estimated as

O_{Comm} = ⌈\frac{T}{p}⌉ \cdot n \cdot \bar{L} .

(30)

As p increases and n increases proportionally with p, the communication overhead for block recovery barely changes. When a node joins the blockchain system or when a historical transaction needs to be viewed, only a small fraction of nodes may be involved in ledger recovery. The communication overhead for the requesting node may be significant, but for the entire network it does not significantly increase the overall system communication overhead.

The computation overhead can be divided into two parts: (1) encoding and calculating the CF for storage and (2) decoding to recover the block. According to the RS(n,p) model, the average computation overhead of encoding a group of p transactions is denoted as EnCom. Therefore, the computation overhead of encoding a block is

[T / p] \cdot E n C o m

. In addition,

[T / p] \cdot 2 n h a s h

operations are required to insert

[T / p] \cdot n

encoded segments into the CF. Thus, the computational overhead for encoding is as follows:

O_{C o m p}^{E n} = ⌈\frac{T}{p}⌉ \cdot (E n C o m + 2 n h a s h) .

(31)

The computational overhead of decoding the received segments of a group based on the RS(n,p) model is denoted by DeCom. The average overhead of decoding computation required to successfully recover a block is denoted by

N_{R e c B} \cdot ⌈T / p⌉ \cdot E n C o m

. In addition, each decoding operation requires the computation of 2n hash operations to verify whether the decoded segments exist in the CF.

O_{C o m p}^{D e} = N_{R e c B} \cdot ⌈\frac{T}{p}⌉ \cdot (E n C o m + 2 n h a s h) .

(32)

As p increases and n increases proportionally with p, the EnCom and DeCom are proportional to plogp; thus,

O_{C o m p}^{E n}

increases continuously. Furthermore,

N_{R e c B}

will increase or approach 1, leading to an increase in

O_{C o m p}^{D e}

. Therefore, as p increases, the computational overhead will continue to increase.

5. Experimental Evaluation

The simulation platform is established and the compression ratio is evaluated in Section 5.1. Then, the availability of the ledger recovery is evaluated in Section 5.2. Finally, the proposed scheme is compared with the existing scheme in terms of the compression ratio and block recovery delay for ledger recovery in Section 5.3.

5.1. Evaluation of Compression Ratio

An Ethereum-like private blockchain system is formed using a PC with Windows 10 OS, two Ubuntu virtual machines, and two CentOS virtual machines. The PC has an Intel (R) Core(TM) i7-7700HQ CPU @2.8 GHz and 8 GB memory. Each virtual machine is assigned 2 GB memory and 20 GB disk space. The blockchain protocols are implemented with Python 3.7, where only the core functions (such as consensus process, transaction processing, etc.) are realized. Furthermore, all parameters in the private blockchain are referenced from the typical values in Ethereum, and all blocks to be processed are the blocks (from 16,530,547 to 16,537,406) fetched from the real-world Ethereum. These blocks have an average size of 83,126 bytes, with an average of 155.15 transactions per block.

We consider a fixed value of p:q (typically, p:q = 1:2, 2:1, or 1:1 [18]), when we study the relationship between properties and p. The core function of the RS modular is implemented with the support of the ISA-L library.

Furthermore, the CF size can be adjusted dynamically according to p:q and FPP. The error correction ability of RS decoding is related to the number of redundant segments and the q value. The larger the q value, the stronger the error correction ability of RS decoding. When

p : q = ξ

, the larger q is, the smaller the required CF false positive probability (FPP), so we assume

F P P^{q} = 10^{- 20}

, which reduces CF storage overhead and ensures block recovery efficiency with an increase in p. Otherwise, we set the load factor,

γ

, to 95% and the number of hash functions, k, to 2 for the CF according to [40]. In addition, the proportions of malicious nodes that modify the segments at random in all malicious nodes is set to 0.5, i.e.,

μ = 0.5 α

.

It should be noted that the conclusion of the simulation is almost the same as that on the PC. According to Equation (17), five nodes in the private blockchain system have the same performance in the compression ratio, which means the system compression ratio can be estimated on a single node. Furthermore, under different parameters, the five nodes in the private chain system have the same trend in encoding and decoding speed, which means that the trend of block recovery performance can be tested on a single node. Therefore, for simplicity, the subsequent analysis is based on the simulation results from the PC with Windows 10 OS.

The CF sizes for different p:q are shown in Figure 6, where the sizes for p:q = 2:1, 1:1, and 1:2 are represented as black, red, and blue lines. As we can see from Figure 6, two conclusions can be obtained. First, the CF size decreases for a larger p. This is because a larger p results in a smaller fingerprint length based on the assumption of

F P P^{q} = 10^{- 20}

. Second, for a specific p, the CF size increases for a larger p:q because more redundant segments need to be inserted.

Different from the CF size, the size of the segments maintained locally on each node is determined by p and the segment number (l). Based on the selected blocks, when p:q = 1:1, the local compression ratio for l = 1, 2 is represented by black and blue lines, as shown in Figure 7. It can be concluded that the local compression ratio decreases for a larger l. When l = 1, the local compression ratio for p:q = 2:1, 1:1, and 1:2 is represented by red, blue, and green lines, respectively. It can be concluded that the larger the p:q, the smaller the

β_{L}

, and the larger the l, the smaller the

β_{L}

, which coincides with Equation (19).

As indicated in Figure 7, the simulation results match the theoretical results very well. Generally speaking, as p increases, the compression rate gradually tends toward 1. When p = 20, the compression ratio for l = 1 can exceed 94%. Considering that most nodes in the blockchain network are light nodes with limited resources, they will select few encoded segments to be stored locally, as simulated. Therefore, the system storage efficiency can be improved greatly.

5.2. Efficiency of Block Recovery

In this section, we introduce the average number of decoding attempts for block recovery and the average decoding time of block recovery.

When

α

= 1/3, the average number of decoding attempts (

N_{R e c B}

) for recovering blocks in our scheme is shown in the following Figure 8, where the

N_{R e c B}

for p:q = 2:1, 1:1, and 1:2 is represented as black, red, and blue lines, respectively. As p increases,

N_{R e c B}

becomes smaller, gradually tending to 1 when p:q = 1:2, and

N_{R e c B}

becomes larger with the number of steps when p:q = 2:1, which is consistent with our analysis in Section 4.2. Since n-t may take the same integers in different p of Equation (25), when p:q = 2:1,

P_{R e c G}

decreases with the number of steps; thus,

N_{R e c B}

increases with the number of steps.

In this paragraph, we discuss the average decoding time delay of the block recovery process,

t_{B}

, under different p:q, where the

t_{B}

for p:q = 2:1, 1:1, and 1:2 is represented as black, red, and blue lines, respectively, as shown in Figure 9. When the proportion of malicious nodes

α

= 1/3, on the one hand, the curve corresponding to p:q = 2:1 shows that

t_{B}

increases significantly with the increase in p, because more attempts are required overall and the decoding becomes more complex. On the other hand, when p:q = 1:1 and p:q= 1:2, as p increases, the

N_{R e c B}

tends to 1, and the increase in

t_{B}

is mainly due to the increasing complexity of the decoding process, causing

t_{B}

to increase slowly, as shown in Figure 9.

Figure 10 illustrates that when p:q = 1:1, the

t_{B}

increases as p increases with different proportions of malicious nodes,

α

, where the

t_{B}

for

α

= 1/2, 2/5, and 1/3 is represented as black, red, and blue lines, respectively. It can be concluded that the larger the

α

, the greater the value of

N_{R e c B}

required to recover the block, and

t_{B}

increases significantly with an increase in p. When p = 20 and

α

= 1/2, the maximum

t_{B}

is approximately 1.74 ms, which is still a small value compared to the block time. Therefore, we can claim that the blockchain system based on the CS engine can achieve excellent block recovery efficiency under attacks with different proportions of malicious nodes.

Through the above analysis, due to a block recovery in the order of milliseconds or even less, the blockchain system using the CS engine is friendly to the synchronization of ledgers after new nodes join the blockchain or the recovery of a historical block by nodes.

5.3. Comparison between the Proposed Scheme and the Existing Scheme

The performance in terms of the compression ratio between the proposed and ECLS scheme is compared in Figure 11, where the two schemes under the condition of p:q = 1:1 are represented by black and red lines, respectively. As indicated in Figure 11, the compression ratios in the proposed and the ECLS scheme increase for a larger p, as they have the same RS encoding pattern. Furthermore, for a specific p, the ECLS scheme has a higher compression ratio because there is no storage overhead for the Cuckoo filter and meta data.

The performance in terms of the average decoding time delay of the block recovery process between the proposed and ECLS scheme is compared in Figure 12, where the two schemes under the condition of

α

= 1/3 and p:q = 1:1 are represented by black and red lines, respectively. On the one hand, since the ECLS scheme makes no guarantee regarding the data integrity of the encoded segments, more forged segments can participate in the decoding process. On the other hand, the ECLS scheme has to verify the correctness of the Merkle root if there is a decoded result of a group that does not match the encoded result of the corresponding group, causing

⌈t / p⌉

groups of segments to require re-collecting and decoding. The smaller the p, the more the groups, and the greater the probability of error, which results in a larger time delay, as shown by the red line in Figure 12.

However, based on commitment verification, the proposed scheme can effectively reduce the participation of forged segments in the process of decoding and quickly identify the erroneous groups after the verification of the Merkle root fails. Therefore, compared with the ECLS scheme, our scheme requires fewer decoding attempts. Thus, regardless of the changes in p, there is an obvious advantage of our scheme in terms of

t_{B}

when compared against the ECLS scheme, as shown in Figure 12. Even if the gap in

t_{B}

between the ECLS scheme and our scheme gradually narrows, there is still an improvement of more than 200% in block recovery efficiency when p = 20.

6. Conclusions

This paper proposes a distributed ledger storage solution, the CS engine, based on CF and RS code techniques. The CS engine stores transaction segments and redundant segments corresponding to transactions through data regularization and RS encoding technology. By using the verification ability of CF and the error-correcting ability of RS decoding, the CS engine greatly improves the efficiency of block data recovery under malicious node attacks. This approach reduces the data storage pressure while facilitating the relatively fast query of a specific transaction by recovering the whole block, ensuring the core characteristics of blockchain, including traceability and decentralization. Additionally, in this paper, the feasibility of fast block data recovery and a strong ability to resist malicious node attacks of CS are proven through theoretical analysis and experimental verification. It should be noted that the security of the CS engine is guaranteed under the assumption of permissionless blockchains. There are stricter assumptions, such as those proposed in [49], where the security of the CF should be studied in future work.

Author Contributions

Conceptualization, J.Y. and W.J.; methodology, W.J. and Z.G. (Zhen Gao); software, J.Y. and W.J.; validation, J.Y., W.J. and Z.P.; formal analysis, J.Y., Z.G. (Zhaohui Guo) and Y.Z.; investigation, W.J. and Y.Z.; writing—original draft preparation, W.J. and J.Y.; writing—review and editing, Z.G. (Zhaohui Guo), Y.Z. and Z.P.; project administration, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the College Government Procurement Branch of Education Accounting Society of China, grant number EASCCGPB2022MS24.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Nakamoto, S. Bitcoin: A Peer-to-Peer Electronic Cash System. Available online: https://bitcoin.org/bitcoin.pdf (accessed on 16 November 2008).
Ethereum: A Secure Decentralised Generalised Transaction Ledger. Available online: https://ethereum.org/en/whitepaper/ (accessed on 30 October 2017).
Imran, M.; Zaman, U.; Imran; Imtiaz, J.; Fayaz, M.; Gwak, J. Comprehensive survey of iot, machine learning, and blockchain for health care applications: A topical assessment for pandemic preparedness, challenges, and solutions. Electronics 2021, 10, 2501. [Google Scholar] [CrossRef]
Hasselgren, A.; Kralevska, K.; Gligoroski, D.; Pedersen, S.A.; Faxvaag, A. Blockchain in healthcare and health sciences—A scoping review. Int. J. Med. Inform. 2020, 134, 104040. [Google Scholar] [CrossRef]
Moosavi, J.; Naeni, L.M.; Fathollahi-Fard, A.M.; Fiore, U. Blockchain in supply chain management: A review, bibliometric, and network analysis. Environ. Sci. Pollut. Res. 2021, 1–15. [Google Scholar] [CrossRef]
Haro-Olmo, F.; Alvarez-Bermejo, J.A.; Varela-Vaca, A.J.; López-Ramos, J.A. Blockchain-based federation of wireless sensor nodes. J. Supercomput. 2021, 77, 7879–7891. [Google Scholar] [CrossRef]
Zhang, K.; Zhu, Y.; Maharjan, S.; Zhang, Y. Edge intelligence and blockchain empowered 5G beyond for the industrial Internet of Things. IEEE Network 2019, 33, 12–19. [Google Scholar] [CrossRef]
Alam, T. Cloud-based IoT applications and their roles in smart cities. Smart Cities 2021, 4, 1196–1219. [Google Scholar] [CrossRef]
Hosseini, S.M.; Ferreira, J.; Bartolomeu, P.C. Blockchain-Based Decentralized Identification in IoT: An Overview of Existing Frameworks and Their Limitations. Electronics 2023, 12, 1283. [Google Scholar]
Xu, M.; Chen, X.; Kou, G. A systematic review of blockchain. Financ. Innov. 2019, 5, 1–14. [Google Scholar] [CrossRef] [Green Version]
Etherscan. Available online: https://etherscan.io/ (accessed on 30 June 2023).
Mian, A.N.; Shah, S.W.H.; Manzoor, S.; Said, A.; Heimerl, K.; Crowcroft, J. A value-added IoT service for cellular networks using federated learning. Comput. Netw. 2022, 213, 109094. [Google Scholar] [CrossRef]
Zohar, A. Bitcoin: Under the hood. Commun. ACM 2015, 58, 104–113. [Google Scholar] [CrossRef]
Taylor, P.J.; Dargahi, T.; Dehghantanha, A.; Parizi, R.M.; Choo, K.K.R. A systematic literature review of blockchain cyber security. Digit. Commun. Networks 2020, 6, 147–156. [Google Scholar] [CrossRef]
Poon, J.; Dryja, T. The Bitcoin Lightning Network: Scalable Off-Chain Instant Payments. Available online: https://lightning.network/lightning-network-paper.pdf (accessed on 14 January 2016).
Luu, L.; Narayanan, V.; Zheng, C.; Baweja, K.; Gilbert, S.; Saxena, P. A secure sharding protocol for open blockchains. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 17–30. [Google Scholar]
Perard, D.; Lacan, J.; Bachy, Y.; Detchart, J. Erasure code-based low storage blockchain node. In Proceedings of the 2018 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), Halifax, NS, Canada, 30 July–3 August 2018; pp. 1622–1627. [Google Scholar]
Qi, X.; Zhang, Z.; Jin, C.; Zhou, A. A reliable storage partition for permissioned blockchain. IEEE Trans. Knowl. Data Eng. 2020, 33, 14–27. [Google Scholar]
Guo, Z.; Gao, Z.; Liu, Q.; Chakraborty, C.; Hua, Q.; Yu, K.; Wan, S. RNS-based adaptive compression scheme for the block data in the blockchain for IIoT. IEEE Trans. Ind. Informatics 2022, 18, 9239–9249. [Google Scholar]
Hassanzadeh-Nazarabadi, Y.; Küpçü, A.; Özkasap, Ö. Lightchain: Scalable dht-based blockchain. IEEE Trans. Parallel Distrib. Systems 2021, 32, 2582–2593. [Google Scholar] [CrossRef]
Benisi, N.Z.; Aminian, M.; Javadi, B. Blockchain-based decentralized storage networks: A survey. J. Netw. Comput. Appl. 2020, 162, 102656. [Google Scholar] [CrossRef]
Rosa, R.V.; Rothenberg, C.E. Blockchain-based decentralized applications for multiple administrative domain networking. IEEE Commun. Stand. Mag. 2018, 2, 29–37. [Google Scholar] [CrossRef]
Sayeed, S.; Marco-Gisbert, H. Assessing blockchain consensus and security mechanisms against the 51% attack. Appl. Sci. 2019, 9, 1788. [Google Scholar]
Gai, K.; Wang, S.; Zhao, H.; She, Y.; Zhang, Z.; Zhu, L. Blockchain-Based Multisignature Lock for UAC in Metaverse. IEEE Trans. Comput. Soc. Syst. 2022. [Google Scholar] [CrossRef]
Yang, J.s.; Wang, H.; Gao, Z.; Guo, Z.h. Double RSA accumulator based stateless transaction verification scheme. J. Zhejiang Univ. (Eng. Sci.) 2023, 57, 178–189. [Google Scholar]
Hamming, R.W. Error detecting and error correcting codes. Bell Syst. Tech. J. 1950, 29, 147–160. [Google Scholar] [CrossRef]
Etzion, T.; Vardy, A. Error-correcting codes in projective space. IEEE Trans. Inf. Theory 2011, 57, 1165–1173. [Google Scholar] [CrossRef]
Wang, G.; Peng, H.; Tang, Y. Repair and restoration of corrupted LZSS files. IEEE Access 2019, 7, 9558–9565. [Google Scholar] [CrossRef]
Han, Y.S.; Pai, H.T.; Zheng, R.; Varshney, P.K. Update-efficient error-correcting product-matrix codes. IEEE Trans. Commun. 2015, 63, 1925–1938. [Google Scholar] [CrossRef] [Green Version]
Reed, I.S.; Solomon, G. Polynomial codes over certain finite fields. J. Soc. Ind. Appl. Math. 1960, 8, 300–304. [Google Scholar] [CrossRef]
Sudan, M. Decoding of Reed Solomon codes beyond the error-correction bound. J. Complex. 1997, 13, 180–193. [Google Scholar] [CrossRef] [Green Version]
Lin, S.J.; Al-Naffouri, T.Y.; Han, Y.S. FFT algorithm for binary extension finite fields and its application to Reed–Solomon codes. IEEE Trans. Inf. Theory 2016, 62, 5343–5358. [Google Scholar] [CrossRef] [Green Version]
Gao, S. A new algorithm for decoding Reed-Solomon codes. In Communications, Information and Network Security; Springer: Boston, MA, USA, 2003; pp. 55–68. [Google Scholar]
Plank, J.S.; Ding, Y. Intel Intelligent Storage Acceleration Library (Intel ISA-L). Available online: https://hfbixcd7869f0864a45b9s5p0ob0pqv00k66pufiiz.eds.tju.edu.cn/en-us/storage/ISA-L (accessed on 16 November 2008).
Almeida, P.S.; Baquero, C.; Preguiça, N.; Hutchison, D. Scalable bloom filters. Inf. Process. Lett. 2007, 101, 255–261. [Google Scholar] [CrossRef] [Green Version]
Bloom, B.H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 1970, 13, 422–426. [Google Scholar] [CrossRef]
Kiss, S.Z.; Hosszu, É.; Tapolcai, J.; Rónyai, L.; Rottenstreich, O. Bloom filter with a false positive free zone. IEEE Trans. Netw. Serv. Manag. 2021, 18, 2334–2349. [Google Scholar] [CrossRef]
Fan, B.; Andersen, D.G.; Kaminsky, M.D. Cuckoo filter: Practically better than bloom. In Proceedings of the 10th ACM International on Conference on Emerging Networking Experiments and Technologies, New York, NY, USA, 2–5 December 2014; pp. 75–88. [Google Scholar]
Reviriego, P.; Martínez, J.; Larrabeiti, D.; Pontarelli, S. Cuckoo filters and bloom filters: Comparison and application to packet classification. IEEE Trans. Netw. Serv. Manag. 2020, 17, 2690–2701. [Google Scholar] [CrossRef]
Ting, D.; Cole, R. Conditional cuckoo filters. In Proceedings of the 2021 International Conference on Management of Data, New York, NY, USA, 20–25 June 2021; pp. 1838–1850. [Google Scholar]
Lian, W.; Li, Y.; Wang, J.; You, J. A Cuckoo Filter-Based Name Resolution and Routing Method in Information-Centric Networking. Electronics 2022, 11, 3243. [Google Scholar] [CrossRef]
Mosharraf, S.I.M.; Adnan, M.A. Improving lookup and query execution performance in distributed Big Data systems using Cuckoo Filter. J. Big Data 2022, 9, 1–30. [Google Scholar] [CrossRef]
Shafeeq, S.; Zeadally, S.; Alam, M.; Khan, A. Curbing address reuse in the iota distributed ledger: A cuckoo-filter-based approach. IEEE Trans. Eng. Manag. 2019, 67, 1244–1255. [Google Scholar] [CrossRef]
Zhang, H.; Zhao, F. Cross-domain identity authentication scheme based on blockchain and PKI system. High-Confid. Comput. 2023, 3, 100096. [Google Scholar] [CrossRef]
Ben-Sasson, E.; Goldberg, L.; Kopparty, S.; Saraf, S. DEEP-FRI: Sampling Outside the Box Improves Soundness. In Proceedings of the 11th Innovations in Theoretical Computer Science Conference (ITCS), Washington, DC, USA, 12–14 January 2020; pp. 5:1–5:32. [Google Scholar]
Zhou, Z.; Zhang, Z.; Tao, H.; Li, T.; Zhao, B. Efficient inner product arguments and their applications in range proofs. IET Inf. Secur. 2023, 17, 485–504. [Google Scholar] [CrossRef]
Kate, A.; Zaverucha, G.M.; Goldberg, I. Constant-Size Commitments to Polynomials and Their Applications. In Proceedings of the 16th International Conference on the Theory and Application of Cryptology and Information Security, Singapore, 5–9 December 2010; pp. 177–194. [Google Scholar]
Benet, J.; Greco, N. Filecoin: A Decentralized Storage Network. Available online: https://filecoin.io/filecoin.pdf (accessed on 6 May 2020).
Al-Bassam, M.; Sonnino, A.; Buterin, V.; Khoffi, I. Fraud and data availability proofs: Detecting invalid blocks in light clients. In Proceedings of the 25th International Conference, FC 2021 on Financial Cryptography and Data Security for Virtual Event, Virtual, 1–5 March 2021. Revised Selected Papers, Part II 25. [Google Scholar]

Figure 1. The chain structure of the blockchain.

Figure 3. Storage system model.

Figure 4. Example of the regularization and encoding of a group of transactions.

Figure 5. Example of the verification and decoding of a group of received segments.

Figure 6. The storage overhead of the CF under different p:q.

Figure 7. The compression ratio of a block.

Figure 8. The average number of decoding attempts required to recover a block.

Figure 9. The average time for block recovery under different p:q, when the proportion of malicious nodes

α

= 1/3.

Figure 9. The average time for block recovery under different p:q, when the proportion of malicious nodes

α

= 1/3.

Figure 10. The average time for block recovery under different proportions of malicious nodes, when p:q = 1:1.

Figure 11. The performance in terms of compression ratio between the proposed and ECLS scheme, when

α

= 1/3 and p:q = 1:1.

Figure 11. The performance in terms of compression ratio between the proposed and ECLS scheme, when

α

= 1/3 and p:q = 1:1.

Figure 12. The performance in terms of the time delay for the block recovery process between the proposed and ECLS scheme, when

α

= 1/3 and p:q = 1:1.

Figure 12. The performance in terms of the time delay for the block recovery process between the proposed and ECLS scheme, when

α

= 1/3 and p:q = 1:1.

Table 1. The typical polynomial commitment (PC) schemes.

Technology	Cryptographic Assumption	Proof Size	Verification Delay
FRI	Hashes only	Large (10–200 KB)	Medium
IPA	Elliptic Curves	Medium (1–3 KB)	High
KZG	Elliptic Curves, Pairings, Trust Setup	Small (≈500 B)	Low

Table 2. Main parameters and meaning of CS.

Symbol	Meaning	Symbol	Meaning
$f l$	length of fingerprint	T	average number of transactions
$γ$	the load factor of CF	$\bar{S_{H}}$	average size of the block headers
$P_{F}$	the probability of cheating CF	$S_{C F}$	average size of CF
$β_{S}$	system compression ratio	$S_{C B}$	average compressed blocks size
$β_{L}$	local compression ratio	$S_{O B}$	average original blocks size
$β_{l}$	$β_{L}$ under different l	$S_{s y}$	storage overhead of the system
N	total number of nodes	$L_{B}$	metadata size for segment
n	number of segments in a group	l	number of selected segments
p	number of txs in a group	$α$	proportion of malicious nodes
q	number of redundant segments in a group	$p_{l}$	proportion of nodes storing l segments
$\bar{L}$	average size of the regularized txs	$α^{'}$	proportion of forged segments for decoding

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, J.; Jia, W.; Gao, Z.; Guo, Z.; Zhou, Y.; Pan, Z. Cuckoo-Store Engine: A Reed–Solomon Code-Based Ledger Storage Optimization Scheme for Blockchain-Enabled IoT. Electronics 2023, 12, 3328. https://doi.org/10.3390/electronics12153328

AMA Style

Yang J, Jia W, Gao Z, Guo Z, Zhou Y, Pan Z. Cuckoo-Store Engine: A Reed–Solomon Code-Based Ledger Storage Optimization Scheme for Blockchain-Enabled IoT. Electronics. 2023; 12(15):3328. https://doi.org/10.3390/electronics12153328

Chicago/Turabian Style

Yang, Jinsheng, Wencong Jia, Zhen Gao, Zhaohui Guo, Ying Zhou, and Zhou Pan. 2023. "Cuckoo-Store Engine: A Reed–Solomon Code-Based Ledger Storage Optimization Scheme for Blockchain-Enabled IoT" Electronics 12, no. 15: 3328. https://doi.org/10.3390/electronics12153328

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cuckoo-Store Engine: A Reed–Solomon Code-Based Ledger Storage Optimization Scheme for Blockchain-Enabled IoT

Abstract

1. Introduction

2. Related Techniques

2.1. Blockchain Ledger Structure and Transaction Lifecycle

2.2. Reed–Solomon Code

2.3. Cuckoo Filter

3. Design of the Cuckoo-Store Engine

3.1. System Model

3.2. Data Encoding

3.3. Block Recovery by Decoding

3.4. Commitment Generation for Encoding Results

3.5. A Demonstration of the CS Engine

3.6. The Features of CS Engine

4. Theoretical Analysis

4.1. Compression Ratio

4.2. Availability of Block Recovery

4.3. Implementation Overheads

5. Experimental Evaluation

5.1. Evaluation of Compression Ratio

5.2. Efficiency of Block Recovery

5.3. Comparison between the Proposed Scheme and the Existing Scheme

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI