1. Introduction
Information technology has undergone extreme development in recent years, yielding new ways of processing data with secure and rapid characteristics. Inspired by the DNA (Deoxyribonucleic Acid) structure, in the near future, it will be possible to replace traditional silicon-based facilities with biological tools where computing with more than 0 and 1 is possible [
1,
2]. Data storage in DNA has rapidly gained interest because of its high density, long-term durability, and low maintenance costs [
3]. Over time, a large amount of information that requires long-term storage has been produced by humanity. Stone, clay, papyrus, wood, paper, and other materials are reliable media in which information was stored in the late 1920s. Since then, other carriers such as silicon have been invented, and accordingly, the current digital revolution has been raised [
4]. However, advances in DNA sequencing technology have reduced the cost of genome sequencing. This has led to several revolutionary advances in the genetic industry. NGS (next-generation sequencing) allows for the parallel sequencing of significant amounts of DNA and minimizes the need for comparatively inefficient fragment-cloning methods that are usually used in Sanger sequencing technologies. However, the management of such large volumes of data is an important issue, and this issue is still of interest to researchers [
5].
Currently, in the Big Data era, a huge amount of data is produced every day. The global data volume is predicted to reach 175 ZB by 2025. Researchers are concerned that this increase in data volume will exceed the existing storage capacity [
6]. This is because the amount of data increases by nearly 50% annually. It is predicted that this value will exceed three yottabytes by 2040. For this reason, more than 109 kg of highly pure silicon is required for their storage [
7]. However, only 107–108 kg of silicon can be produced by 2040 [
4]. DNA is a promising storage medium owing to its durability and ultrahigh density. It is known that one gram of DNA can store 215 GB of data, and the data stored in the DNA can be preserved for thousands of years [
6]. To provide a concrete picture of the aforementioned information, the main digital information carriers are listed in
Table 1. Here, it is clear that using DNA as a carrier has the main advantages in terms of capacity and lifetime, where access time is another issue to be tackled. Therefore, it is advisable to store only archival information in this biopolymer.
In the literature, there are two main approaches to DNA data storage: in vivo and in vitro. In vivo approaches are based on recombinant microorganisms that carry artificially embedded non-biological information within their genomes. Here, the information is stored via transmission from generation to generation. The main disadvantage of this approach is that only a small amount of information can be saved for living organisms [
4].
Long-term in vitro storage is a widely studied area because it is suitable for computer-aided systems. This process consists of the following stages of DNA data storage: the conversion of information into nucleotide sequences (binarization), translation of binary code into DNA code (encoding), synthesis of oligos (DNA synthesis), storage of oligos under stable conditions (DNA storage after synthesis before sequencing), obtainment of a sufficient amount of informative DNA by amplification to retrieve the nucleobase code (DNA sequencing), and recovery of digital information from nucleotide sequences (decoding) [
8]. Long-term in vitro storage constitutes a suitable background for researchers to reproduce novel methods by enhancing the given substages. Accordingly, this branch of research was addressed in this study.
In 2014, Hafeez et al. proposed a robust data-hiding model for watermarking DNA sequences called “DNA-LCEB”. In this model, the phenomenon of silent mutations (i.e., synonymous substitutions) for storing data in the degenerative codons of the DNA coding region has been exploited. The authors claimed that DNA-LCEB exploited the entire coding region for watermark storage, thus leading to a better storage capacity [
9]. In 2018, Lee et al. proposed a method in which they addressed two approaches to reversible DNA data hiding using multiple difference expansions. In this method, the string sequences of four characters (A, T, C, G) of noncoding DNA sequences were converted into decimal-coded values, and the watermark was embedded into a coded value sequence using two approaches: DE-based multiple-bit embedding (DE-MBE) using pairs of neighboring values and consecutive DE-MBE (CDE-MBE) using previously embedded coded values [
10]. In 2019, Rahman et al. presented a lossless DNA sequence hiding method to authenticate DNA sequences in the context of mobile cloud-based healthcare systems. In the proposed method, authentication data are hidden and extracted, and the DNA sequence is reconstructed without any loss of information. Security analysis and experiments regarding performance were performed in the scope of their study [
11]. In 2022, Song et al. proposed a de novo strand assembly algorithm (DBGPS) using a de Bruijn graph and a greedy path search. They claimed that DBGPS had substantial advantages in handling DNA breaks, rearrangements, and indels. In the proposed study, the robustness of DBGPS was demonstrated through accelerated aging, multiple independent data retrievals, deep error-prone PCR, and large-scale simulations [
3]. In 2023, Lenz and Zeh proposed a communication system in which information is transferred over many sequences in parallel. In this system, the receiver cannot control access to these sequences and can only draw from these sequences, unaware of which sequence has been drawn. Furthermore, the drawn sequences were susceptible to errors. Moreover, the information capacity was computed for a wide range of parameters and a general class of drawing distributions [
12]. In 2024, Hong et al. proposed a novel Video Segmentation and Storage in DNA (VSD) method that relies on an innovative video segmentation strategy and a quadratic coding model and uses efficient indexing. This proposed encoding model based on the RS error-correcting code efficiently balances the storage density, combinatorial bio-constraints, and time efficiency, thereby reducing overhead costs [
8]. Preuss et al. presented a novel approach for encoding information in DNA, using combinatorial encoding and shortmer DNA synthesis, leading to an efficient sequence design and improved DNA synthesis and readout interpretation. This method leverages the advantages of combinatorial encoding schemes while relying on existing DNA chemical synthesis methods with some modifications. They mentioned that the use of short DNA synthesis also minimized the effects of synthesis and sequencing errors [
13]. Cao et al. proposed parity encoding and a local mean iteration (PELMI) scheme to achieve the robust DNA storage of images. They mentioned that the proposed parity-encoding scheme satisfied the common biochemical constraints of DNA sequences and undesired motif content. It addresses varying pixel weights at different positions for binary data, thus optimizing the utilization of the Reed–Solomon error correction. They claimed that PELMI achieved image reconstruction with general errors (insertion, deletion, and substitution) and enhanced DNA sequence quality [
14].
In this study, A DNA data storage method using spatial encoding-based lossless compression was proposed. The proposed approach employs a vector representation of each DNA base in a two-dimensional (2D) spatial domain for both the encoding and decoding phases. This representation of input data renders the proposed method suitable for efficient compression. The structure of the proposed method is reversible, which renders decoding possible without any information loss. Experimental studies were performed by conducting capacity, compression ratio, stability, and reliability analyses. The obtained results show that the proposed method is much more efficient in terms of capacity than other known algorithms in the literature. Moreover, significant results were achieved in terms of the compression ratio, stability, and reliability. This paper is organized into four sections. A brief description of the DNA structure and the proposed DNA storage approach has been explained in
Section 2, titled “Materials and Methods”. The performed experiments and their results are presented in
Section 3, titled “Experimental Results”. Finally, a general outcome has been reached in
Section 4, titled “Discussion”.
4. Discussion
In recent years, significant efforts have been made to adapt digital data storage processes to computational biology tasks. With the development of Sanger sequencing technology, DNA sequences have been collected and manipulated using computers. This has yielded a new and attractive field of bioinformatics and DNA data storage.
In this study, a DNA data storage method using spatial encoding-based lossless compression, which is based on the vector representation of DNA bases and processing them on a 2D spatial domain, was proposed. It was observed that the experimental results of the proposed method had better accuracy rates than those of previously known algorithms in the literature when the storage capacity is considered in terms of bpn. Moreover, significant compression ratios for storage and GC content values for the stability of the sequence were achieved. The reliability of the proposed method was evaluated by performing multiple DNA sequencing analyses. The proposed method has a storage capacity of 1.99 in terms of bpn. This was mainly due to the use of the 2D spatial domain and vector representation of DNA bases in the proposed study. Another advantage of the proposed approach is that it enables the storage of 3 bits in one DNA base. This exceeds traditional encoding schemes, where 2 bits are stored in one DNA base. In addition, this method occupies a small area of memory during the simulation.
However, increasing the storage capacity via the compression ratio is a valuable problem that needs to be addressed. In the proposed method, 3 bits correspond to one base. First, we aim to increase this rate in future studies. We aim to address this issue by adapting the proposed method to the 3D spatial domain. In this way, we aim to benefit from the variations in the directions of vector representations to increase the amount of stored bits in one DNA base. In addition, this will enable us to use error-correcting codes (if needed) without affecting the storage capacity. Another important issue that needs to be addressed is the practical running time. DNA will not be fast enough to compete with optical, magnetic, or quantum formats in the foreseeable future. The use of primers renders the DNA strand accessible during sequencing. In addition, to provide a more developed random-access mechanism, hash codes can be used as fingerprints. Finally, the implementation of the proposed method using living organisms and mutation resistance must be investigated. This presents a new challenge by enabling the proposed method to carry information from one generation to another. However, this process requires a long, detailed, and expensive in vivo experiment because the preservation of the synthesized DNA is difficult in living organisms owing to mutations and amplifications.