Image Storage in DNA by an Extensible Quaternary Codec System

Pang, Ruoying; Dong, Yiming; Zhao, Xin

doi:10.3390/app15094760

Open AccessArticle

Image Storage in DNA by an Extensible Quaternary Codec System

by

Ruoying Pang

¹,

Yiming Dong

² and

Xin Zhao

^1,*

¹

State Key Laboratory of Radio Frequency Heterogeneous Integration, Shanghai Jiao Tong University, Shanghai 200240, China

²

Guiji Life Sciences Co., Ltd., Suzhou 215000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 4760; https://doi.org/10.3390/app15094760

Submission received: 13 March 2025 / Revised: 16 April 2025 / Accepted: 24 April 2025 / Published: 25 April 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Silicon-based storage technologies are increasingly failing to meet the explosively growing data storage demands of the information age. DNA-based data storage offers a promising solution due to its unparalleled storage density, long lifespan, low energy consumption, and high parallel accessibility. In this study, we propose a novel True Quadratic Codec System (ETQ) that directly encodes data into nucleotide sequences using a quaternary encoding approach. By treating A, T, C, and G as direct encoding symbols (0, 1, 2, 3), an ETQ eliminates the intermediate binary-to-ATCG conversion step, thus surpassing the theoretical storage density limit of 2 bits/nt. An ETQ is built on these three key components: (1) dividing image data into B, G, and R color channels for separate encoding and storage, (2) employing quaternary Huffman encoding to map image information directly into nucleotide sequences, and (3) integrating Reed–Solomon error correction codes to enhance data reliability and system extensibility. The ETQ framework demonstrates significant improvements in storage density and efficiency compared to conventional methods. By leveraging the inherent properties of DNA, this system offers a scalable and cost-effective solution that addresses the growing global data storage crisis.

Keywords:

DNA storage; Huffman coding; RS error correction; quaternary codec system

1. Introduction

With the rapid development of the information age, the total amount of data used globally is projected to grow from 30 ZB in 2018 to 163 ZB by 2025. Based on this trend, researchers predict that by 2040, more than 1000 kg of wafer-scale monocrystalline silicon will be needed to store global data, but only 108 kg of monocrystalline wafers will be available [1]. This impending silicon deficit poses significant challenges to the $574 billion global semiconductor manufacturing sector, where monocrystalline silicon serves as the foundational material for 92% of computing components including CPUs, memory chips, and storage devices [2]. Current projections indicate that maintaining silicon-based storage infrastructure at 2025 data volumes would consume over 65% of annual monocrystalline silicon production, compared to just 22% in 2018 [3]. This resource competition may lead to substantial increases in semiconductor prices, potentially driving significant cost escalation in global computing device production. According to Gartner, Inc.’s latest forecast, worldwide IT spending is projected to reach $5.74 trillion by 2025, with server sales expected to demonstrate exponential growth [4]. DNA-based data storage technology offers a promising solution to this dual crisis of material scarcity and economic pressure. DNA (deoxyribonucleic acid) is a long-strand molecule formed by the connection of four deoxyribonucleotide bases (A, T, C, and G). It ensures the secure storage and stable intergenerational replication of massive quantities of genetic information in living organisms, making it one of the most dense and stable data storage media known. Theoretically, 1 g of DNA can store 455 EB of data, 4 g of DNA can store the total amount of information generated globally in one year, and 1 kg of DNA can store all human information [5]. While DNA exhibits remarkable longevity under controlled conditions (e.g., millennia-old fossils preserve intact sequences [6]), its stability in storage systems requires careful mitigation of enzymatic degradation risks. Environmental DNAses and endonucleases pose significant challenges to unprotected DNA, potentially cleaving phosphodiester bonds and compromising data integrity. To address this, current preservation strategies employ multi-layered protection, such as silica encapsulation and cryogenic storage [7].

Since the dawn of the 21st century, the field of DNA-based data storage research has witnessed remarkable advancements, fueled by breakthroughs in synthetic biology and collaborative efforts across academia and industry involving researchers, institutions, and enterprises. Church et al. [8] designed a binary encoding and decoding scheme where one nucleotide corresponded to one binary digit, successfully storing 650 KB of data in vitro. Goldman et al. [9] developed a ternary encoding scheme with rotational coding and implemented error correction through quadruple overlapping steps. Erlich et al. [10] introduced fountain codes into DNA information storage, using an encoding method where one nucleotide corresponded to two binary digits, marking the advent of the so-called “quaternary” codec system. Subsequent studies, including those by Organick et al. [11], Pan et al. [12], Ping et al. [13], Ceze et al. [14], Chen et al. [15], Rasool et al. [16], and Ding et al. [17], have adopted the innovative concept of fountain codes (as shown in Figure 1). Critically, these encoding paradigms synergize with emerging DNA nanotechnology. By programmatically controlling the spatial arrangement of deoxyribonucleotide bases (A, T, C, and G), DNA strands can serve as programmable building blocks to construct three-dimensional nanostructures through scaffolded origami techniques [18]. This molecular self-assembly process enables the creation of addressable 3D architectures with functionalized domains, effectively integrating data storage with nanoscale structural engineering [19]. Such DNA origami frameworks not only enhance error resilience through spatially constrained redundancy but also expand application horizons by enabling in situ computational operations within biomolecular matrices.

These above studies leverage various source coding methods from the fields of informatics and communication theory to encode the information to be stored, resulting in binary data streams. A series of intricate mapping rules are then designed to map the binary data streams, composed of zeros and ones, into DNA sequences consisting of the four bases A, T, C, and G. The most basic mapping rule is a quaternary encoding scheme where one base corresponds to two binary digits, for example 00 → A, 01 → T, 10 → C, 11 → G. This completes the so-called quaternary encoding. However, the encoding process of “data → binary data stream → ATCG base sequence” increases the steps in the DNA storage pipeline and fails to fully utilize the inherent advantages of DNA sequences formed by the arrangement of the A, T, C, and G bases. Under this framework, the theoretical maximum information storage density is 2 bits/nt (bits per nucleotide). Otherwise, after accounting for sequence indexing, error correction redundancy, characteristic primers, and other overheads, the actual storage density is typically about half of the theoretical value.

In this study, we developed a True Quadratic Codec System, termed ETQ (ETQ consists of the initials of extensible, true, and quadratic), which uses A, T, C, and G (i.e., 0, 1, 2, 3) as symbols for direct information encoding. This system bypasses the conversion process from binary data streams to DNA sequences, directly encoding data into ATCG nucleotide sequences. Our proposed approach consists of three main components:

Dividing an image into B, G, and R color channels for separate encoding and storage.
Applying quaternary Huffman encoding to each color channel, directly encoding image information into nucleotide sequences composed of ATCG, eliminating the traditional “data → binary data stream → ATCG sequence” process, and overcoming the storage density limit of 2 bits/nt.
Incorporating Reed–Solomon error correction codes into each DNA sequence to enhance the extensibility of the proposed method.

2. Materials and Research Methods

2.1. Encoding

The encoding framework of the ETQ is illustrated in Figure 2. It is well known that compressing image data is necessary for storage; however, compression can result in a significant loss of image quality when errors occur. To mitigate this, conventional methods typically add up to 30% error redundancy [1,20], which significantly increases the cost of DNA storage. To reduce redundancy and save costs, we divide the image into RGB color channels, encode and store them separately, and use a 3 nt block to represent the RGB color information of each image. The probability of simultaneous errors occurring in multiple channels for the same pixel is extremely low, and decoding errors can be detected using majority rule color correction [12]. Thus, this design treats storing different color channels as a proxy for redundant encoding. By dividing the image into three sub-images corresponding to RGB channels, redundancy is reduced.

The subsequent step employs a Hilbert space-filling curve, which transforms the 2D pixel matrix into a 1D pixel vector while preserving the local similarity and smoothness of the 2D image. This results in linear strings with minimal differences between adjacent entries [21], making it suitable for images of various scales.

Taking the R channel as an example (the same process applies to the other two channels), the encoding begins by using the Hilbert space-filling curve to convert the 2D image data into a 1D sequence. This approach is suitable for images of various scales. Next, a quaternary Huffman tree is constructed to encode the data (as shown in Figure 3).

Probability Calculation: The probability of the occurrence of each pixel value in the image is calculated and used as the weight of the leaf nodes of the Huffman tree. The nodes are sorted in ascending order of their weights (from bottom to top).
Tree Construction: The four nodes with the smallest weights are sequentially encoded as A/T/C/G. The weights of these four nodes are summed, and the result is used as the weight of a new node to construct the next level of the Huffman tree. This process is repeated until all nodes are encoded. The resulting sequence is denoted as seq_init.
Metadata Generation: The initial encoded sequence is analyzed to generate an information table and a tree table, which are stored in the computer to enhance decoding accuracy.
Biochemical Constraints: Due to limitations in DNA synthesis and sequencing technologies [22,23,24], biochemical constraints must be applied to the initial sequence. The ETQ system adopts a two-step extended mapping strategy:
- CG Content Control: A 5-to-6 mapping is used, grouping every five bases and mapping them to six bases with a GC ratio close to 50%.
- Homopolymer Length Control: Sliding windows are applied to replace homopolymers of “CCCC”, “GGGG”, “TTTT”, and “AAAA” with “ATGCC”, “ATGCG”, “ATGCT”, and “ATGCA”, respectively, ensuring a controlled homopolymer length.

The final encoded sequence is thus obtained. However, a DNA storage system requires the synthesis of specific DNA sequences, followed by the preservation, amplification, and sequencing of DNA molecules. These processes may introduce random and systematic errors, posing challenges to the reliability of the DNA data. To address these issues, appropriate error correction schemes are required to add error correction sequences to the encoded DNA fragments. At present, almost all error correction methods rely on redundancy—both physical and logical—to ensure the accuracy of information recovery [11,14,25].

2.2. Error Correction

The ETQ employs a Reed–Solomon error correction code (RS code) under the BCH view as a logical redundancy. RS codes, as multilevel error correction codes, are particularly suitable for multilevel modulation scenarios and possess strong error correction capabilities, capable of correcting both random and burst errors [26]. During DNA storage and retrieval, base sequences may encounter insertion, deletion, and substitution errors, which are unevenly distributed [8]. RS codes are highly effective at correcting these types of errors and have been adopted in multiple DNA storage systems as the error correction strategy [11,20,25,27,28,29].

An RS code with parameters (n, k) can correct up to (n − k)/2 errors, where k represents the original data length and n is the length of the encoded data. Within the BCH coding framework, RS error correction codes encode messages by treating their symbols as coefficients of a polynomial defined over the finite field (Galois field, GF). Through polynomial multiplication and modulo operations in GF, the original message polynomial is systematically transformed into an extended codeword polynomial, where the resultant coefficients constitute the error correcting encoded sequence. This algebraic approach ensures the codeword’s redundancy adheres to the designed minimum distance properties critical for bounded-distance decoding. In the presence of storage-induced errors, the received polynomial during decoding is expressed as

r (x) = s (x) + e (x)

where

s (x)

denotes the original message polynomial and

e (x)

represents the error polynomial. The fundamental objective of Reed–Solomon (RS) error correction algorithms resides in computationally deriving

e (x)

from

r (x)

, followed by algebraic subtraction

s (x) = r (x) - e (x)

to achieve error-free message recovery. All mathematical operations are rigorously executed within the finite field (Galois field, GF) framework. Figure 4 schematically illustrates the workflow of the RS error correction decoding algorithm. Let

μ

denote the number of errors introduced during transmission. The error polynomial is defined as

e (x) = \sum_{i = 1}^{μ} γ_{i} x^{l_{i}}

where

γ_{i} \in G F (2^{m})

represents the error magnitude at position

l_{i}

. The decoding procedure unfolds as follows:

Syndrome Computation: The syndrome values $S_{i}$ are computed by evaluating the received polynomial $r (x)$ at consecutive powers of the primitive element $α$ , specifically at $α^{j}$ , for j = 1, 2, …, n − k, within the $G F (2^{m})$ . This operation mathematically corresponds to

$S_{j} = r (α^{j}) = 0 + e (α^{j}) = \sum_{i = 1}^{μ} γ_{i} X_{i}^{j}$

where $X_{i} = α^{l_{i}}$ , $S_{j}$ has n − k terms.
Error Locator Polynomial: Employ the Berlekamp–Massey algorithm to derive coefficients λ $λ_{i}$ of the error locator polynomial

$σ (x) = \prod_{i = 1}^{μ} (1 - x X_{i}) = 1 + \sum_{i = 1}^{μ} λ_{i} x^{i}$
Root Identification: Execute a Chien search [30] to determine roots $X_{i}^{- 1}$ of $σ (x)$ , yielding error positions $l_{i} = \log X_{i}$ .
Error Magnitude Calculation: Apply Forney’s algorithm [31] to compute error magnitudes

$γ_{i} = - \frac{X_{i}^{1 - j_{0}} ω (X_{i}^{- 1})}{σ^{’} (X_{i}^{- 1})}$

where $ω (x)$ is the error evaluator polynomial and the first syndrome index.
Error Correction: reconstruct $e (x)$ using $\{l_{i}, γ_{i}\}$ and recover the codeword

$c (x) = r (x) - e (x)$

In this study, RS codes are constructed within GF (256) because the decoding time of RS codes increases nonlinearly with code length [17], and the RS error correction framework under GF (256) is well established [32]. Building upon this foundation, each group of four bases is treated as an error correction block, The final encoded sequence, obtained after the “two mapping” process, is divided into 80 nt base fragments, which are sequentially indexed to generate the data segment indices. Redundant error correction sequences are then added to each data fragment using RS (100,80), resulting in 100 nt fragments. Subsequently, file ID, color channel ID, and index are prepended to each fragment. Specific primers are added to both ends of each sequence to enable random access to individual files during large-scale storage [11].

2.3. Synthesis and Sequencing

The DNA strands were synthesized by Twist Bioscience. The synthesized oligos were subsequently sequenced and analyzed using next-generation sequencing (NGS) technology, performed by Guiji Life Sciences Co., Suzhou, China.

3. Results

The performance of the ETQ was validated through in vitro experiments and in silicon experiments, and all images were successfully restored (the results are shown in Figure 5). In the in vitro experiments, we synthesized 1,037,400 bits of image data (image ID: 0). The synthetic DNA sequences were stored at room temperature for three days and subsequently frozen for another three days. After sequencing, the image was successfully recovered. Due to the limitations of DNA synthesis technology (short DNA strands have the highest synthesis efficiency [22,23,24,33]), the synthesized sequences were all 150 nt in length, resulting in a total of 4891 DNA sequences (733,650 nt). These DNA sequences exhibited a GC content of approximately 50%, homopolymer lengths shorter than 4 nt, and a net information density of about 1.41 bits/nt (as shown in Figure 6). In the computer simulation experiments, in addition to image ID 0, we tested three other images. For each image, the DNA strand length used for storage was varied among 150 nt, 200 nt, 250 nt, 300 nt, 350 nt, and 400 nt. Figure 6a shows the relationship between the net information density and DNA strand length for different images. Taking image ID 0 as an example, the GC content and homopolymer length of sequences with different lengths were controlled to match the in vitro experiments. However, due to the increased length of the data segments, the net information density for sequences of 150 nt, 200 nt, 250 nt, 300 nt, 350 nt, and 400 nt was 1.71 bits/nt, 1.94 bits/nt, 2.07 bits/nt, 2.14 bits/nt, and 2.22 bits/nt, respectively. This broke the storage density limit of 2 bits/nt previously achieved by other encoding systems (Figure 1).

Additionally, all the curves in Figure 6a show a trend of increasing net information density with increasing DNA strand length, but the rate of increase gradually diminishes, indicating a trend toward saturation. The curve for image ID “0.jpg” is consistently higher than the others, suggesting that its net information density is the highest under given conditions. Overall, there is a positive correlation between DNA strand length and net information density, although the specific form of this relationship varies significantly depending on the image content. Short DNA strands may be limited by boundary effects and a constrained information space, resulting in lower information density. In contrast, longer DNA strands exhibit greater redundancy tolerance, allowing for more efficient information storage. This finding provides important insights into the optimization of DNA storage and encoding strategies, highlighting the critical role of DNA strand length and encoding methods in the design of high-density DNA molecules. At this stage, there are various encoding systems in the field, making it difficult to test different DNA data storage systems on the same dataset. Therefore, we followed the conventions in the DNA digital storage field [10,14] by testing system performance on their respective datasets. Through comparisons and predictions of net information density, despite the ETQ’s net information density in in vitro experiments not exceeding 2 bits/nt, it remains the most feasible and extensible DNA storage system published in recent years.

4. Discussion

With the rapid development of the internet and digital technologies, images have become one of the most critical data types on the internet, playing a pivotal role in fields such as medical diagnosis, visual positioning, facial recognition, and autonomous driving. The demand for storing high-resolution and high-color-depth images has surged, and images also serve as the foundation for video storage. DNA storage, with its ultra-high density and scalability, has emerged as one of the most promising media capable of meeting these demanding storage requirements.

We have designed a true quaternary DNA storage encoding system that divides a color image into three RGB color channels to reduce redundant data and enhance storage density. Our system employs a lossless compression method, ensuring that the decompressed image is completely identical to the original with no information loss. To guarantee data accuracy and integrity, error correction codes are indispensable in DNA storage systems [34]. Therefore, we incorporated RS error correction codes into a quaternary Huffman tree [35], transcending the “so-called” quaternary encoding framework established by DNA fountain codes. Both in vitro and simulation experiments were successfully conducted. In the in vitro experiment, constrained by the current precision of DNA synthesis, the synthesized sequences were limited to a length of 150 nt, resulting in a net storage density of 1.41 bits/nt. In the in silico experiment, the net storage density exceeded the theoretical maximum of 2 bits/nt established by Erlich et al. [10]. While our method was specifically applied to color image files, for other file types such as text or audio, the ETQ encoding steps cannot be directly applied due to the absence of color channels. However, the combination of quaternary Huffman compression and RS decoding remains applicable. By omitting the steps of color channel division and flattening, the ETQ design is still effective for these data types. Hence, our research provides significant insights into the development and optimization of DNA digital storage systems and contributes to improving their performance.

Unlike DNA Fountain’s probabilistic encoding [10], which achieves theoretical density at the cost of complex reassembly, the ETQ guarantees deterministic reconstruction through fixed-length quaternary mapping. The ETQ’s net density after adding error correction redundancy is 1.41 bits/nt in biochemical implementation—surpassing Organick’s random access codec (1.16 bits/nt) [11] and approaching Erlich’s code benchmarks (1.55 bits/nt) [10]—validates quaternary encoding’s potential despite current synthesis constraints. Notably, our in silico achievement of 2.22 bits/nt exceeds Erlich’s theoretical 2 bits/nt limit. Additionally, the two mapping in the coding scheme ensures smooth decoding and the addition of RS error correction redundancy is more robust than the quadratic system designed by Lu et al. [35].

A limitation of this study lies in its reliance on the RS error correction algorithm, which has been widely adopted in wireless communication. Unlike communication systems, DNA storage systems are composed of four characters (A, T, C, G) and are subject to substitution, insertion, and deletion errors [14]. These unique characteristics pose significant challenges to error correction, particularly because insertions and deletions (indels) can cause frame shifts in sequence reading. Indels not only introduce localized errors but also produce downstream effects that compromise the integrity of the entire sequence. At present, research on quaternary error correction codes specifically tailored for DNA storage remains scarce. The few existing studies, such as Yan et al. [36], involve complex and highly redundant encoding steps. Given the high cost of DNA synthesis, the feasibility of such approaches is limited. The development of efficient, cost-effective quaternary error correction codes that address the unique challenges of DNA storage systems remains an important area for future research [22].

Future research can explore more quaternary-based encoding methods, building upon the ETQ framework. By synthesizing and sequencing the same dataset multiple times, researchers can analyze sequencing reads to identify patterns and correlations between DNA sequences and errors. This could inform updates to the 5-to-6 mapping scheme, enabling the selection of more stable and efficient base combinations. The ETQ system adopts a single-function encoding approach with clearly encapsulated modules, offering excellent extensibility and flexibility. Its modular design allows for the easy modification and replacement of components, as well as integration into larger systems. For instance, future studies could implement multi-layered encoding schemes on top of the quaternary Huffman coding framework, while simultaneously improving error correction strategies to further enhance storage density and error resilience. In terms of biotechnology, the development of DNA synthesis and sequencing techniques specifically tailored for DNA storage is essential. These advancements could improve synthesis and sequencing efficiency, significantly reducing the frequency of errors other than insertions. Additionally, designing reliable DNA storage systems requires careful consideration of data volume and sequencing technology to determine optimal encoding schemes and sequencing depths, ensuring accurate and robust data storage and retrieval.

5. Conclusions and Future Work

Our research presents a true quaternary DNA storage encoding system, specifically designed for image files, which achieves high storage density and data integrity. By dividing color images into RGB channels and employing a lossless compression method combined with Reed–Solomon (RS) error correction codes, we ensured that decompressed images were identical to the originals. The ETQ demonstrated excellent performance in both in vitro and computational simulation experiments. In vitro experiments achieved a net storage density of 1.41 bits/nt, while in silico simulations surpassed the theoretical 2 bits/nt limit established by Erlich [10]. The system’s modular design, centered on quaternary Huffman encoding and RS error correction, offers flexibility and extensibility, making it adaptable to other data types such as text or audio by omitting color channel processing steps. This work provides valuable insights into optimizing DNA digital storage systems and highlights the potential of quaternary encoding to overcome current storage density limitations.

Future research on DNA storage should focus on developing efficient, quaternary-specific error correction codes to address unique challenges like insertions, deletions, and substitutions, while minimizing redundancy and cost. Advanced quaternary encoding methods, building on the ETQ framework, can refine mapping schemes for stable and efficient base combinations. Multi-layered encoding atop the quaternary Huffman framework could enhance storage density and error resilience. Biotechnological advancements in DNA synthesis and sequencing tailored for storage applications are crucial to reduce errors and improve performance. Additionally, the modular design of the ETQ enables seamless integration with other encoding or error correction strategies, paving the way for robust and versatile DNA storage solutions. We believe that the introduction of quaternary encoding has the potential to significantly enhance DNA storage systems, surpassing the 2 bits/nt storage density limit and enabling more efficient data storage. In the future, encoding systems that directly map data to nucleotide sequences are likely to replace the current paradigm of “data → binary data stream → nucleotide sequence”, further advancing the field of DNA storage technology. And we will focus on advancing quaternary error correction algorithms, particularly those addressing the insertion–deletion–substitution (indel-sub) error correlation inherent to DNA storage systems. This includes developing machine learning-driven encoding optimization to counteract sequence synthesis biases and engineering enzymatic synthesis platforms for error-suppressed long-strand (>500 nt) production. Concurrent standardization efforts will prioritize establishing ethical guidelines and interoperability protocols for biologically embedded data within hybrid silicon–DNA storage hierarchies, ensuring secure transitions between molecular and electronic domains. Through algorithmic development and biotechnology co-advancement, this integrated strategy aims to resolve error propagation challenges while enabling scalable architectures for ultra-dense data preservation, ultimately contributing actionable solutions to the impending zettabyte-scale data storage crisis.

Author Contributions

Conceptualization, R.P., Y.D. and X.Z.; methodology, R.P., Y.D. and X.Z.; software, R.P.; validation, R.P., Y.D. and X.Z.; formal analysis, R.P.; investigation, R.P., and Y.D.; resources, X.Z.; data curation, R.P. and Y.D.; writing—original draft preparation, R.P.; writing—review and editing, X.Z.; visualization, R.P.; supervision, X.Z.; project administration, X.Z.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Natural Science Foundation of China under grant No. 62174107, the National Key R&D Program of China No. 2022YFF1202000, and the National Natural Science Foundation of China under grant No. 62188102.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study.

Conflicts of Interest

Author Yiming Dong was employed by the company Guiji Life Sciences. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zhirnov, V.; Zadegan, R.M.; Sandhu, G.S.; Church, G.M.; Hughes, W.L. Nucleic Acid Memory. Nat. Mater. 2016, 15, 366–370. [Google Scholar] [CrossRef] [PubMed]
Recent News Release. Available online: https://www.wsts.org/76/Recent-News-Release (accessed on 16 April 2025).
Market Analysis Perspective: Worldwide Enterprise Storage Systems. 2024. Available online: https://my.idc.com/getdoc.jsp?containerId=US47269521 (accessed on 16 April 2025).
Gartner Forecasts Worldwide IT Spending to Grow 9.3% in 2025. Available online: https://www.gartner.com/en/newsroom/press-releases/2024-10-23-gartner-forecasts-worldwide-it-spending-to-grow-nine-point-three-percent-in-2025 (accessed on 16 April 2025).
Bonnet, J.; Colotte, M.; Coudy, D.; Couallier, V.; Portier, J.; Morin, B.; Tuffet, S. Chain and Conformation Stability of Solid-State DNA: Implications for Room Temperature Storage. Nucleic Acids Res. 2009, 38, 1531–1546. [Google Scholar] [CrossRef]
Zhou, Y.; Bi, K.; Ge, Q.; Lu, Z. Advances and Challenges in Random Access Techniques for In Vitro DNA Data Storage. ACS Appl. Mater. Interfaces 2024, 16, 43102–43113. [Google Scholar] [CrossRef]
Chen, W.D.; Kohll, A.X.; Nguyen, B.H.; Koch, J.; Heckel, R.; Stark, W.J.; Ceze, L.; Strauss, K.; Grass, R.N. Combining Data Longevity with High Storage Capacity—Layer-by-Layer DNA Encapsulated in Magnetic Nanoparticles. Adv. Funct. Mater. 2019, 29, 1901672. [Google Scholar] [CrossRef]
Church, G.M.; Gao, Y.; Kosuri, S. Next-Generation Digital Information Storage in DNA. Science 2012, 337, 1628. [Google Scholar] [CrossRef]
Goldman, N.; Bertone, P.; Chen, S.; Dessimoz, C.; LeProust, E.M.; Sipos, B.; Birney, E. Towards Practical, High-Capacity, Low-Maintenance Information Storage in Synthesized DNA. Nature 2013, 494, 77–80. [Google Scholar] [CrossRef]
Erlich, Y.; Zielinski, D. DNA Fountain Enables a Robust and Efficient Storage Architecture. Science 2017, 355, 950–954. [Google Scholar] [CrossRef]
Organick, L.; Ang, S.D.; Chen, Y.-J.; Lopez, R.; Yekhanin, S.; Makarychev, K.; Racz, M.Z.; Kamath, G.; Gopalan, P.; Nguyen, B.; et al. Random Access in Large-Scale DNA Data Storage. Nat. Biotechnol. 2018, 36, 242–248. [Google Scholar] [CrossRef]
Pan, C.; Tabatabaei, S.K.; Tabatabaei Yazdi, S.M.H.; Hernandez, A.G.; Schroeder, C.M.; Milenkovic, O. Rewritable Two-Dimensional DNA-Based Data Storage with Machine Learning Reconstruction. Nat. Commun. 2022, 13, 2984. [Google Scholar] [CrossRef]
Ping, Z.; Chen, S.; Zhou, G.; Huang, X.; Zhu, S.; Zhang, H.; Lee, H.H.; Lan, Z.; Cui, J.; Chen, T.; et al. Towards Practical and Robust DNA-Based Data Archiving Using the Yin–Yang Codec System. Nat. Comput. Sci. 2022, 2, 234–242. [Google Scholar] [CrossRef]
Ceze, L.; Nivala, J.; Strauss, K. Molecular Digital Data Storage Using DNA. Nat. Rev. Genet. 2019, 20, 456–466. [Google Scholar] [CrossRef] [PubMed]
Chen, W.; Han, M.; Zhou, J.; Ge, Q.; Wang, P.; Zhang, X.; Zhu, S.; Song, L.; Yuan, Y. An Artificial Chromosome for Data Storage. Natl. Sci. Rev. 2021, 8, nwab028. [Google Scholar] [CrossRef] [PubMed]
Rasool, A.; Hong, J.; Jiang, Q.; Chen, H.; Qu, Q. BO-DNA: Biologically Optimized Encoding Model for a Highly-Reliable DNA Data Storage. Comput. Biol. Med. 2023, 165, 107404. [Google Scholar] [CrossRef]
Ding, L.; Wu, S.; Hou, Z.; Li, A.; Xu, Y.; Feng, H.; Pan, W.; Ruan, J. Improving Error-Correcting Capability in DNA Digital Storage via Soft-Decision Decoding. Natl. Sci. Rev. 2024, 11, nwad229. [Google Scholar] [CrossRef]
Dey, S.; Fan, C.; Gothelf, K.V.; Li, J.; Lin, C.; Liu, L.; Liu, N.; Nijenhuis, M.A.D.; Saccà, B.; Simmel, F.C.; et al. DNA Origami. Nat. Rev. Methods Primers 2021, 1, 13. [Google Scholar] [CrossRef]
Postigo, A.; Marcuello, C.; Verstraeten, W.; Sarasa, S.; Walther, T.; Lostao, A.; Göpfrich, K.; del Barrio, J.; Hernández-Ainsa, S. Folding and Functionalizing DNA Origami: A Versatile Approach Using a Reactive Polyamine. J. Am. Chem. Soc. 2025, 147, 3919–3924. [Google Scholar] [CrossRef]
Grass, R.; Heckel, R.; Puddu, M.; Paunescu, D.; Stark, W. Robust Chemical Preservation of Digital Information on DNA in Silica with Error-Correcting Codes. Angew. Chem. 2015, 54, 2552–2555. [Google Scholar] [CrossRef]
Moon, B.; Jagadish, H.V.; Faloutsos, C.; Saltz, J.H. Analysis of the Clustering Properties of the Hilbert Space-Filling Curve. IEEE Trans. Knowl. Data Eng. 2001, 13, 124–141. [Google Scholar] [CrossRef]
Baek, D.; Joe, S.-Y.; Shin, H.; Park, C.; Jo, S.; Chun, H. Recent Progress in High-Throughput Enzymatic DNA Synthesis for Data Storage. Biochip J. 2024, 18, 357–372. [Google Scholar] [CrossRef]
Masaki, Y.; Onishi, Y.; Seio, K. Quantification of Synthetic Errors during Chemical Synthesis of DNA and Its Suppression by Non-Canonical Nucleosides. Sci. Rep. 2022, 12, 12095. [Google Scholar] [CrossRef]
Hoose, A.; Vellacott, R.; Storch, M.; Freemont, P.S.; Ryadnov, M.G. DNA Synthesis Technologies to Close the Gene Writing Gap. Nat. Rev. Chem. 2023, 7, 144–161. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.-J.; Takahashi, C.N.; Organick, L.; Bee, C.; Ang, S.D.; Weiss, P.; Peck, B.; Seelig, G.; Ceze, L.; Strauss, K. Quantifying Molecular Bias in DNA Data Storage. Nat. Commun. 2020, 11, 3264. [Google Scholar] [CrossRef] [PubMed]
Proakis, J.; Salehi, M. Communication Systems Engineering; Prentice Hall: Upper Saddle River, NJ, USA, 1994. [Google Scholar]
Blawat, M.; Gaedke, K.; Hütter, I.; Chen, X.-M.; Turczyk, B.; Inverso, S.; Pruitt, B.W.; Church, G.M. Forward Error Correction for DNA Data Storage. Procedia Comput. Sci. 2016, 80, 1011–1022. [Google Scholar] [CrossRef]
Meiser, L.C.; Antkowiak, P.L.; Koch, J.; Chen, W.D.; Kohll, A.X.; Stark, W.; Heckel, R.; Grass, R. Reading and Writing Digital Data in DNA. Nat. Protoc. 2019, 15, 86–101. [Google Scholar] [CrossRef]
Press, W.; Hawkins, J.; Jones, S.K.; Schaub, J.M.; Finkelstein, I.J. HEDGES Error-Correcting Code for DNA Storage Corrects Indels and Allows Sequence Constraints. Proc. Natl. Acad. Sci. USA 2020, 117, 18489–18496. [Google Scholar] [CrossRef]
Chien, R. Cyclic Decoding Procedures for Bose- Chaudhuri-Hocquenghem Codes. IEEE Trans. Inf. Theory 1964, 10, 357–363. [Google Scholar] [CrossRef]
Forney, G. On Decoding BCH Codes. IEEE Trans. Inf. Theory 1965, 11, 549–557. [Google Scholar] [CrossRef]
Chen, S. Notes/Notebooks/ReedSolomonErasureCodes.Ipynb at Master·Chenshuo/Notes. Available online: https://github.com/chenshuo/notes/blob/master/notebooks/ReedSolomonErasureCodes.ipynb (accessed on 8 January 2025).
Caruthers, M.H. The Chemical Synthesis of DNA/RNA: Our Gift to Science. J. Biol. Chem. 2013, 288, 1420–1427. [Google Scholar] [CrossRef]
Xu, C.; Zhao, C.; Ma, B.; Liu, H. Uncertainties in Synthetic DNA-Based Data Storage. Nucleic Acids Res. 2021, 49, 5451–5469. [Google Scholar] [CrossRef]
Lu, M.; Wang, Y.; Qiang, W.; Cui, J.; Wang, Y.; Huang, X.; Dai, J. Towards High-Density Storage of Text and Images into DNA by the “Xiao-Pang” Codec System. Sci. China Life Sci. 2023, 66, 1447–1450. [Google Scholar] [CrossRef]
Yan, Z.; Liang, C.; Wu, H. A Segmented-Edit Error-Correcting Code with Re-Synchronization Function for DNA-Based Storage Systems. IEEE Trans. Emerg. Top. Comput. 2023, 11, 605–618. [Google Scholar] [CrossRef]

Figure 1. The process of DNA-based information storage, which involves two key stages—information writing and information reading. In the writing process, information is encoded into DNA via DNA synthesis, conducted both in vivo and in vitro. In the reading process, the DNA is sequenced, allowing for the retrieval of the stored information.

Figure 2. The process of encoding an image into DNA via the ETQ codec.

Figure 3. The illustration of quaternary Huffman encoding for image data storage, focusing on the conversion of pixel values into nucleotide sequences.

Figure 4. Flow diagram of RS error correction and decoding algorithm.

Figure 5. A successful reconstruction, which demonstrates the successful decoding and image reconstruction process from DNA-based storage and validates the feasibility of DNA-based image storage and retrieval via the ETQ.

Figure 6. (a) The relationship between the length of the DNA strands (in nucleotides) and their net information density (in bits per nucleotide); (b) the composition of CG distribution in the encoded DNA sequence.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pang, R.; Dong, Y.; Zhao, X. Image Storage in DNA by an Extensible Quaternary Codec System. Appl. Sci. 2025, 15, 4760. https://doi.org/10.3390/app15094760

AMA Style

Pang R, Dong Y, Zhao X. Image Storage in DNA by an Extensible Quaternary Codec System. Applied Sciences. 2025; 15(9):4760. https://doi.org/10.3390/app15094760

Chicago/Turabian Style

Pang, Ruoying, Yiming Dong, and Xin Zhao. 2025. "Image Storage in DNA by an Extensible Quaternary Codec System" Applied Sciences 15, no. 9: 4760. https://doi.org/10.3390/app15094760

APA Style

Pang, R., Dong, Y., & Zhao, X. (2025). Image Storage in DNA by an Extensible Quaternary Codec System. Applied Sciences, 15(9), 4760. https://doi.org/10.3390/app15094760

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Image Storage in DNA by an Extensible Quaternary Codec System

Abstract

1. Introduction

2. Materials and Research Methods

2.1. Encoding

2.2. Error Correction

2.3. Synthesis and Sequencing

3. Results

4. Discussion

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI