Optimizing Merkle Proof Size Through Path Length Analysis: A Probabilistic Framework for Efficient Blockchain State Verification
Abstract
:1. Introduction
- A strict mathematical model describing the distribution of path lengths in tries that contain random blockchain addresses;
- Empirical validation by extensive computational experiments ranging from tries over 100 to 300 million addresses;
- Accurate Prediction of Structural Properties: The experimental validation shows discrepancies that are not greater than 0.01 at all the tested scales;
- Guidelines on practical implementation, showing potential proof size reductions of up to 70% using optimized path structuring.
2. Related Work and Theoretical Foundations
2.1. Theoretical Foundations of Patricia Tries
2.2. Blockchain Data Structure Optimizations
- Tabatabaei et al. (2023) [13] analyzed Ethereum’s Modified Merkle Patricia Trie, which was beneficial for state storage but generated scalability challenges in the process of proof generation. Mardiansyah et al. (2023) [14] proposed the Multi-State Merkle Patricia Trie for query processing optimization and achieved huge performance improvements by modifying the node structure and traversal algorithms;
- Yang et al. (2024) [15] introduced SolsDB, which tackled performance bottlenecks and showed that an optimized storage engine can significantly reduce state access latency. Their results demonstrated a 30% improvement in read performance compared to traditional MPT implementations;
- Mizrahi et al. (2024) [16] introduced an innovative approach to optimizing Merkle tries based on transaction patterns, achieving significant reductions in proof sizes through traffic-aware structuring. Their algorithms, inspired by coding theory methods, demonstrated substantial improvements in communication costs for both payment and smart contract transactions on the Ethereum network;
- Kuznetsov et al. (2024) [17] proposed a novel OR-based proof aggregation technique that enables compact and universally verifiable proofs for Merkle tree inclusion. Their approach achieves proof sizes independent of tree leaf count while maintaining universal verifiability, representing a significant advancement in proof system scalability.
2.3. Ethereum-Specific Challenges
- State Bloat: The rapid growth in the size of Ethereum’s state has led to increasing storage and verification overhead;
- Proof size: The current MPT implementation generates prohibitively large stateless proofs, up to 300 MB in the worst cases (vitalik.eth, 2024) [4];
- Client Efficiency: The width-16 structure of Ethereum’s MPT impacts client performance, particularly for stateless implementations.
2.4. Research Gap
- No comprehensive analysis of path length distributions in MPTs at the Ethereum scale;
- Lack of validated probabilistic models for predicting trie structure properties;
- Limited understanding of the relationship between address distribution and trie efficiency;
- Absence of empirically verified optimization strategies for large-scale implementations;
3. Background: Merkle Patricia Tries in Ethereum State Management
- Merkle trees provide cryptographic verification capabilities through hierarchical hashing, enabling efficient proof generation and verification [21]. The hash-based structure ensures data integrity while allowing selective disclosure of state information;
- Patricia tries (Practical Algorithm to Retrieve Information Coded in Alphanumeric [22]) enhance the radix trie concept by eliminating single-child nodes through path compression. This optimization significantly reduces the trie’s height and memory footprint while maintaining lookup efficiency.
- Deterministic root hash computation independent of insertion order;
- Efficient proof generation for state verification;
- Optimal storage utilization through prefix sharing and path compression;
- O(log n) complexity for key operations.
3.1. Width-16 Structure Design Rationale
- Optimal Bit Manipulation: Processing 4 bits at a time provides efficient CPU operation alignment;
- Memory Access Patterns: 16-way branching creates node structures that align well with common memory page sizes;
- Storage Density: Reduced tree height compared to binary tries while maintaining manageable node sizes;
- Implementation Efficiency: Hexadecimal representation simplifies debugging and development.
3.2. State Management in Ethereum
- Branch nodes: where are child references and is an optional value;
- Extension nodes: encoding shared path segments;
- Leaf nodes: containing actual state data.
4. Probabilistic Model for Path Length Distribution
- Event A: The first symbols of the key match with at least one other key in the trie;
- Event B: The -th symbol of the key does not match with any other key that shared the first symbols.
- Let be the event that the first symbols match with at least one other key;
- Let be the event that the first symbols match with at least one other key.
- Minimum path length: (root plus single branch);
- Maximum path length: (full address path);
- Variance bound: .
5. Experimental Methodology
5.1. Implementation Environment
- CPU: AMD Ryzen 7 7840 HS (3.80 GHz, 8 cores);
- RAM: 64 GB DDR5;
- OS: Windows 11.
5.2. Trie Implementation and Address Generation
- Generate a 256-bit private key: ;
- Compute the public key using secp256k1 elliptic curve: ;
- Take the Keccak-256 hash and extract the last 20 bytes: .
python Copy class Node: def __init__(self): self.children = {} # Hexadecimal character mappings self.is_end = False # Terminal node indicator
- Each node represents a nibble (4 bits) of the address;
- The trie maintains the same 16-ary branching factor as Ethereum;
- Path compression is implemented through single-child node merging.
5.3. Experimental Design
- Small-scale: addresses;
- Medium-scale: addresses;
- Large-scale: addresses;
- Network-scale: addresses;
- Ethereum-scale: addresses.
- Random address generation;
- Trie construction;
- Path length measurement;
- Statistical analysis.
5.4. Measurement Protocol
- Path lengths for all addresses : ;
- Distribution of path lengths using Counter class;
- Average path length:
- Memory utilization and construction time.
5.5. Statistical Analysis
- Chi-square goodness-of-fit test:
- 2.
- Mean absolute percentage error (MAPE):
- 3.
- Kolmogorov–Smirnov test for distribution comparison:
5.6. Performance Metrics
- Time complexity:
- Trie construction: where is address length;
- Path length calculation: ;
- Statistical analysis: where is unique path lengths.
- Space complexity:
- Peak memory usage;
- Node count;
- Storage overhead per address.
6. Results and Analysis
6.1. Path Length Distribution Analysis
6.2. Statistical Validation
6.3. Average Path Length Scaling
- Scalability: The logarithmic relationship ensures that path lengths remain manageable even as Ethereum’s state grows exponentially. At the current scale ( addresses), the average path length is only 7.72 nodes;
- Proof Size: Since Merkle proofs must include all nodes along a path, the logarithmic scaling directly translates to proof size efficiency. This validates Ethereum’s design choice of using MPTs for state management;
- Performance Bounds: The tight correlation between theoretical and experimental results () allows precise performance predictions for future network growth. For example, even a 1000-fold increase in network size would only increase the average path length by approximately 2.48 nodes;
- Optimization Potential: The consistent behavior across scales suggests that optimization strategies targeting average-case performance will remain effective as the network grows.
6.4. Optimization Potential
- -
- The most frequently accessed 20% of addresses can achieve path length reductions of 65–70%;
- -
- The overall average path length reduction across all addresses reaches 30–35%;
- -
- These results were validated using historical Ethereum transaction data over multiple timeframes.
7. Discussion
7.1. Comparison with the Existing Studies
- Foundational theoretical work by Kirschenhofer et al. [5] established crucial properties of Patricia tries, proving that external path length asymptotically equals n·log2n with probability one. This fundamental result suggested that Patricia tries maintain a natural balance without explicit restructuring;
- Tong et al. [9] extended this understanding through smoothed analysis, demonstrating that both tries and Patricia tries achieve logarithmic height bounds under perturbation, providing theoretical justification for their practical efficiency;
- Building on these theoretical foundations, our previous work on adaptive Merkle trees [8] introduced dynamic restructuring based on access patterns, achieving significant path length reductions for frequently accessed data;
- The current work synthesizes these approaches, providing a comprehensive probabilistic framework that:
- ○
- Extends the theoretical analysis to width-16 trees used in modern blockchain systems;
- ○
- Quantifies the potential optimization gains through rigorous mathematical modeling;
- ○
- Validates the theoretical predictions with empirical blockchain data.
7.2. Limitations and Future Directions
- Distribution Analysis. Modeling of Non-uniform Address Distribution:
- -
- Characterization of real-world address generation patterns;
- -
- Path length analysis for clustered and correlated addresses;
- -
- Optimization strategies for known distribution types.
- Dynamic Behavior Modeling. The temporal aspects to be investigated are as follows:
- -
- Time-series analysis of path length distributions;
- -
- Impact of state changes on proof size;
- -
- Adaptive optimization strategies based on workload variation.
- Optimization of Implementation. Hardware-specific considerations, including Cache-aware trie organization strategies.
7.3. Impact on Blockchain Technology
7.4. Synergy Between Adaptive Restructuring and Verkle Tree Implementation
8. Conclusions
- Extending the Model to Take Up Non-uniform Address Distribution;
- Analyzing dynamic trie behavior when there are frequent state changes;
- Developing adaptive optimization strategies based on observed path length distributions.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Lv, Z.; Wu, J.; Chen, D.; Gander, A.J. Chapter 3—Distributed Computing to Blockchain: Architecture, Technology, and Applications. In Distributed Computing to Blockchain; Pandey, R., Goundar, S., Fatima, S., Eds.; Academic Press: Cambridge, MA, USA, 2023; pp. 39–54. ISBN 978-0-323-96146-2. [Google Scholar]
- Rosa-Bilbao, J.; Boubeta-Puig, J. Chapter 15—Ethereum Blockchain Platform. In Distributed Computing to Blockchain; Pandey, R., Goundar, S., Fatima, S., Eds.; Academic Press: Cambridge, MA, USA, 2023; pp. 267–282. ISBN 978-0-323-96146-2. [Google Scholar]
- Ethereum. Ethereum Yellow Paper 2023; Ethereum: Zug, Switzerland, 2023. [Google Scholar]
- vitalik.eth [@VitalikButerin] @danfinlay The Goal Is to Replace the Current Merkle Patricia State Tree (MPT), Because the Current MPT Is *very* Unfriendly to Stateless Clients: A Worst-Case Stateless Proof for an Ethereum Block Is ~300 MB (Think: Spam Reads on 24kB Contracts), and Even the Average Case Sucks Because The. Twitter. 2024. Available online: https://x.com/VitalikButerin/status/1817408883897593911 (accessed on 28 November 2024).
- Kirschenhofer, P.; Prodinger, H.; Szpankowski, W. On the Balance Property of Patricia Tries: External Path Length Viewpoint. Theor. Comput. Sci. 1989, 68, 101193. [Google Scholar] [CrossRef]
- Andersson, A. Comments of “on the Balance Property of Patricia Tries: External Path Length Viewpoint”. Theor. Comput. Sci. 1992, 106, 391–393. [Google Scholar] [CrossRef]
- Kuznetsov, O.; Rusnak, A.; Yezhov, A.; Kuznetsova, K.; Kanonik, D.; Domin, O. Merkle Trees in Blockchain: A Study of Collision Probability and Security Implications. Internet Things 2024, 26, 101193. [Google Scholar] [CrossRef]
- Kuznetsov, O.; Kanonik, D.; Rusnak, A.; Yezhov, A.; Domin, O.; Kuznetsova, K. Adaptive Merkle Trees for Enhanced Blockchain Scalability. Internet Things 2024, 27, 101315. [Google Scholar] [CrossRef]
- Tong, W.; Goebel, R.; Lin, G. Smoothed Heights of Tries and Patricia Tries. Theor. Comput. Sci. 2016, 609, 620–626. [Google Scholar] [CrossRef]
- Devroye, L. Laws of Large Numbers and Tail Inequalities for Random Tries and PATRICIA Trees. J. Comput. Appl. Math. 2002, 142, 27–37. [Google Scholar] [CrossRef]
- Jung, M.; Shishibori, M.; Tanaka, Y.; Aoe, J. A Dynamic Construction Algorithm for the Compact Patricia Trie Using the Hierarchical Structure. Inf. Process. Manag. 2002, 38, 221–236. [Google Scholar] [CrossRef]
- Knollmann, T.; Scheideler, C. A Self-Stabilizing Hashed Patricia Trie. Inf. Comput. 2022, 285, 104697. [Google Scholar] [CrossRef]
- Tabatabaei, M.H.; Vitenberg, R.; Veeraragavan, N.R. Understanding Blockchain: Definitions, Architecture, Design, and System Comparison. Comput. Sci. Rev. 2023, 50, 100575. [Google Scholar] [CrossRef]
- Mardiansyah, V.; Muis, A.; Sari, R.F. Multi-State Merkle Patricia Trie (MSMPT): High-Performance Data Structures for Multi-Query Processing Based on Lightweight Blockchain. IEEE Access 2023, 11, 117282–117296. [Google Scholar] [CrossRef]
- Yang, C.; Yang, F.; Xu, Q.; Zhang, Y.; Liang, J. SolsDB: Solve the Ethereum’s Bottleneck Caused by Storage Engine. Future Gener. Comput. Syst. 2024, 160, 295–304. [Google Scholar] [CrossRef]
- Mizrahi, A.; Koren, N.; Rottenstreich, O.; Cassuto, Y. Traffic-Aware Merkle Trees for Shortening Blockchain Transaction Proofs. IEEE/ACM Trans. Netw. 2024, 32, 5326–5340. [Google Scholar] [CrossRef]
- Kuznetsov, O.; Rusnak, A.; Yezhov, A.; Kanonik, D.; Kuznetsova, K.; Domin, O. Efficient and Universal Merkle Tree Inclusion Proofs via OR Aggregation. Cryptography 2024, 8, 28. [Google Scholar] [CrossRef]
- AlSobeh, A.M.R.; Gaber, K.; Hammad, M.M.; Nuser, M.; Shatnawi, A. Android Malware Detection Using Time-Aware Machine Learning Approach. Clust. Comput. 2024, 27, 12627–12648. [Google Scholar] [CrossRef]
- Liang, C.; Zhang, J.; Ma, S.; Zhou, Y.; Hong, Z.; Fang, J.; Zhou, Y.; Tang, H. Study on Data Storage and Verification Methods Based on Improved Merkle Mountain Range in IoT Scenarios. J. King Saud Univ. Comput. Inf. Sci. 2024, 36, 102117. [Google Scholar] [CrossRef]
- Alshattnawi, S.; Shatnawi, A.; AlSobeh, A.M.R.; Magableh, A.A. Beyond Word-Based Model Embeddings: Contextualized Representations for Enhanced Social Media Spam Detection. Appl. Sci. 2024, 14, 2254. [Google Scholar] [CrossRef]
- Merkle, R.C. Method of Providing Digital Signatures. U.S. Patent 4309569, 5 January 1982. [Google Scholar]
- Morrison, D.R. PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric. J. ACM 1968, 15, 514–534. [Google Scholar] [CrossRef]
- Knuth, D. Art of Computer Programming, The: Volume 3: Sorting and Searching, 2nd ed.; Addison-Wesley Professional: Boston, MA, USA, 1998; ISBN 978-0-201-89685-5. [Google Scholar]
- Kostamis, P.; Sendros, A.; Efraimidis, P.S. Data Management in Ethereum DApps: A Cost and Performance Analysis. Future Gener. Comput. Syst. 2024, 153, 193–205. [Google Scholar] [CrossRef]
- Arslanian, H. Ethereum. In The Book of Crypto: The Complete Guide to Understanding Bitcoin, Cryptocurrencies and Digital Assets; Arslanian, H., Ed.; Springer International Publishing: Cham, Switzerland, 2022; pp. 91–98. [Google Scholar] [CrossRef]
- George, J.T. Ethereum. In Introducing Blockchain Applications: Understand and Develop Blockchain Applications Through Distributed Systems; George, J.T., Ed.; Apress: Berkeley, CA, USA, 2022; pp. 55–106. ISBN 978-1-4842-7480-4. [Google Scholar]
- Samreen, N.F.; Alalfi, M.H. An Empirical Study on the Complexity, Security and Maintainability of Ethereum-Based Decentralized Applications (DApps). Blockchain Res. Appl. 2023, 4, 100120. [Google Scholar] [CrossRef]
- Kuznetsov, O. Patricia Trie Simulation for Ethereum Addresses. Available online: https://colab.research.google.com/drive/1rxevEArsxD5CN_aHYB_gDDJs_sra4ZbV?usp=sharing (accessed on 26 August 2024).
- Merkle Patricia Trie. Available online: https://ethereum.org/en/developers/docs/data-structures-and-encoding/patricia-merkle-trie/ (accessed on 5 December 2024).
Path Length | Theoretical Prob. | Experimental Prob. | Difference |
---|---|---|---|
For 100 addresses | |||
1 | 0.002386 | 0.000000 | 0.002386 |
2 | 0.690504 | 0.684000 | 0.006504 |
3 | 0.284479 | 0.306000 | 0.021521 |
4 | 0.021201 | 0.010000 | 0.011201 |
5 | 0.001340 | 0.000000 | 0.001340 |
6 | 0.000084 | 0.000000 | 0.000084 |
For 1000 addresses | |||
2 | 0.025506 | 0.020200 | 0.005306 |
3 | 0.769895 | 0.763500 | 0.006395 |
4 | 0.190395 | 0.214300 | 0.023905 |
5 | 0.013310 | 0.002000 | 0.011310 |
6 | 0.000838 | 0.000000 | 0.000838 |
7 | 0.000052 | 0.000000 | 0.000052 |
For 10,000 addresses | |||
3 | 0.101360 | 0.085860 | 0.015500 |
4 | 0.765349 | 0.786600 | 0.021251 |
5 | 0.124390 | 0.126340 | 0.001950 |
6 | 0.008342 | 0.001200 | 0.007142 |
7 | 0.000524 | 0.000000 | 0.000524 |
8 | 0.000033 | 0.000000 | 0.000033 |
For 100,000 addresses | |||
4 | 0.239184 | 0.217824 | 0.021360 |
5 | 0.675289 | 0.712484 | 0.037195 |
6 | 0.079954 | 0.069361 | 0.010593 |
7 | 0.005223 | 0.000331 | 0.004892 |
8 | 0.000327 | 0.000000 | 0.000327 |
9 | 0.000020 | 0.000000 | 0.000020 |
For 1000,000 addresses | |||
5 | 0.408987 | 0.408991 | 0.000004 |
6 | 0.536665 | 0.536662 | 0.000003 |
7 | 0.050860 | 0.050859 | 0.000001 |
8 | 0.003268 | 0.003268 | 0.000000 |
Trie Size | χ2 Statistic | p-Value | Result |
---|---|---|---|
0.011423 | 1.000 | Accept H0 | |
0.014662 | 1.000 | Accept H0 | |
0.009664 | 1.000 | Accept H0 |
Addresses (N) | Theoretical Avg. | Experimental Avg. | Difference |
---|---|---|---|
100 | 2.33 | 2.33 | 0.00 |
1000 | 3.19 | 3.20 | 0.01 |
10,000 | 4.04 | 4.04 | 0.00 |
100,000 | 4.85 | 4.85 | 0.00 |
1,000,000 | 5.65 | 5.65 | 0.00 |
10,000,000 | 6.46 | 6.47 | 0.01 |
100,000,000 | 7.31 | 7.32 | 0.01 |
300,000,000 | 7.72 | 7.72 | 0.00 |
Study | Main Focus | Key Contribution | Optimization Approach | Performance Results | Limitations |
---|---|---|---|---|---|
Kirschenhofer et al. [5] | Binary Patricia tries | Variance analysis of external path length | Mathematical analysis of path distribution | Variance ≈ 0.37n + nP(log2n) | Limited to binary tries |
Tong et al. [9] | Height analysis | Smoothed analysis of tries | Perturbation model | Height = Θ(log n) under perturbation | Theoretical bounds only |
Adaptive Merkle Trees [8] | Path length optimization | Dynamic tree restructuring | Statistical encoding principles | Up to 70% path length reduction for hot addresses | Initial restructuring overhead |
Current work | Probabilistic framework | Path length distribution model | Entropy-based optimization | 30–35% average improvement | Requires periodic updates |
Yang et al. [15] | Storage efficiency | SolsDB optimization | Storage engine modifications | 30% read performance improvement | System complexity increase |
Mardiansyah et al. [14] | Query processing | Multi-State MPT | Node structure modifications | Query processing speedup | Additional storage overhead |
Tabatabaei et al. [13] | State management | Modified MPT structure | Protocol-level changes | Improved state handling | Scalability trade-offs |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kuznetsov, O.; Frontoni, E.; Kuznetsova, K.; Arnesano, M. Optimizing Merkle Proof Size Through Path Length Analysis: A Probabilistic Framework for Efficient Blockchain State Verification. Future Internet 2025, 17, 72. https://doi.org/10.3390/fi17020072
Kuznetsov O, Frontoni E, Kuznetsova K, Arnesano M. Optimizing Merkle Proof Size Through Path Length Analysis: A Probabilistic Framework for Efficient Blockchain State Verification. Future Internet. 2025; 17(2):72. https://doi.org/10.3390/fi17020072
Chicago/Turabian StyleKuznetsov, Oleksandr, Emanuele Frontoni, Kateryna Kuznetsova, and Marco Arnesano. 2025. "Optimizing Merkle Proof Size Through Path Length Analysis: A Probabilistic Framework for Efficient Blockchain State Verification" Future Internet 17, no. 2: 72. https://doi.org/10.3390/fi17020072
APA StyleKuznetsov, O., Frontoni, E., Kuznetsova, K., & Arnesano, M. (2025). Optimizing Merkle Proof Size Through Path Length Analysis: A Probabilistic Framework for Efficient Blockchain State Verification. Future Internet, 17(2), 72. https://doi.org/10.3390/fi17020072