1. Introduction
DNA molecules are a natural storage medium, having been carriers of life for thousands of years. DNA molecules can also preserve non-biological information [
1,
2,
3,
4]. Since the validation of scalable DNA storage by the George Church team [
5] at Harvard University in 2012, research related to DNA storage has increased annually. The first step in DNA storage is encoding information into DNA sequences [
3,
6]; encoding algorithms based on basic mapping relationships is relatively simple but sacrifices a certain degree of coding information density [
7]. In 2012, Church et al. [
5] mapped a draft of an HTML-encoded book into DNA as a 5.27 MB bit stream using a binary-to-base mapping method. The DNA-coding method developed by the Church team propelled research and applications in this field. In 2013, Nick Goldman proposed [
8] a ternary information transformation model. Firstly, the Huffman ternary tree is used to analyze the appearance frequency of single bytes in binary files to be transcoded, converting binary sequences (0/1) into corresponding ternary sequences (0/1/2). Subsequently, based on the ternary mapping pattern, the current base of the DNA sequence is determined, as well as the previously selected base. This method completely avoids the repetition of consecutive bases but cannot regulate the GC content under fixed rules and may result in fragment repetition. The Goldman code was the first to introduce Huffman coding into DNA storage and was the first coding method to consider base information density. Due to the complexity of DNA synthesis and sequencing processes involving many intricate experimental operations [
9,
10,
11], chemical reactions [
12], and unavoidable noise pollution [
13], unpredictable DNA-specific errors can occur at any time, leading to base loss, incorrect connections, or other unexpected changes, resulting in erroneous DNA sequences.
To reduce the error rate, Grass et al. [
14] introduced error correction codes from information and communication technology into DNA storage, utilizing Reed–Solomon (RS) codes to rectify base errors or sequence losses occurring during DNA storage. The theoretical information density of the Grass et al. coding method can reach 1.78 bits per nucleotide (nt), marking the first integration of error correction algorithms into DNA coding processes and expanding the encoding module of DNA storage. In 2016, the Meinolf Blawat team [
15] developed an efficient and robust forward error correction scheme for DNA channels by using bytes as the basic unit of base conversion and mapping eight bits of information into five nucleotides. The remaining two bits represented optional conversion parts. This design restricted the maximum length of homopolymers to three, reducing the likelihood of DNA sequence self-complementarity. Phylogenetic analysis methods [
16] that use unique natural vectors to represent DNA sequences also help to ensure one-to-one correspondence in DNA storage, which ensures that each DNA sequence is represented clearly and unambiguously in genome space. In 2017, Yaniv Erlich and Dina Zielinski [
17] proposed a coding algorithm based on the Luby Transform, unprecedentedly boosting the encoding rate to 1.98 bits/nt. In 2020, Press et al. [
18] developed the Hash Encoded, Decoded by Greedy Exhaustive Search (HEDGES) DNA encoding algorithm capable of handling insertion and deletion errors in DNA synthesis and sequencing errors. This algorithm employs RS codes and convolution codes for encoding. The results show that, while sacrificing a certain encoding density, HEDGES can handle approximately 1.2% of insertion and deletion errors. Cai et al. [
19] emphasize the importance of redundancy and error correction to maintain the uniqueness of the encoded DNA sequences and to ensure that the original data can be retrieved accurately even if errors occur.
The random access process of DNA storage based on Polymerase Chain Reaction (PCR) first requires the design of a specific primer library to ensure the uniqueness of the random access target. The number of fixed-length primers is limited, and increasing the primer length to increase the number of primers will result in a decrease in the bases available for data in the DNA sequence [
1,
20]. In 2018, Organick et al. [
21] proposed an encoding algorithm that significantly reduces sequencing redundancy by random access, thereby requiring fewer physical copies of given molecules to fully recover stored data. Moreover, the random-access DNA storage system can also represent file metadata by impervious silica capsules selecting barcodes [
22], enabling the completion of Boolean logical searches without the use of methods. Anavy et al. [
23] encoded binary using a six-letter composite DNA alphabet and combined RS and fountain codes for error correction. The information stored in DNA is converted from standard American Standard Code for Information Interchange (ASCII) [
23] encoding to binary sequences, and Huffman coding is used to generate DNA sequences. In 2023, Yu et al. [
24] overcame the problem of the passive processing of DNA storage data in DNA pools by realizing an active DNA data-editing process in a droplet-controlled jet (DCF) system using splint connections.
To reduce inherent errors in the random-access DNA storage process, Cao et al. [
25] reported new combination constraints and proposed a Damping Multi-Verse Optimizer (DMVO) algorithm to design DNA storage encoding sets that meet combination constraints, using these encodings as address bits. Based on this, they proposed a thermodynamic Minimum Free Energy (MFE) constraint [
26] for the construction of DNA storage coding sets. The MFE constraint is used to avoid nonspecific hybridization and reduce synthesis sequencing error rates, and a new BMVO algorithm is used in this work. Yin et al. [
27] proposed the NOL-HHO algorithm by improving the Harris Hawks optimization algorithm, which achieves a better lower bound for DNA storage coding. Although the results of the NOL-HHO algorithm are improved in the work of Limbachiya et al., there is still much room for improvement in its lower bound. In 2022, Rasool et al. [
28] proposed a new DNA data storage (BO-DNA) biological optimization coding model to overcome reliability issues.
Although the construction of DNA storage coding sets can be equivalent to the optimization problem of satisfying combination constraints, existing encoding algorithms still have deficiencies in the quantity and quality of encoding. Therefore, this paper introduces Levy flight operations to improve the STOA and proposes the LSTOA, reducing the likelihood of the original algorithm falling into local optima and accelerating convergence speed. The encoding results show that under the same combination of constraint conditions, the LSTOA can construct larger DNA storage coding sets, providing more address bits for random access to reduce DNA storage read–write latency.
5. Conclusions
In this paper, the LSTOA is proposed based on the Levy flight strategy to address the issue of local optima frequently encountered by traditional heuristic algorithms in optimization problems. To further illustrate the LSTOA, 13 benchmark test functions are introduced, some of which are high-dimensional unimodal functions for general testing purposes, while others are high-dimensional multimodal functions for extreme function performance testing. In these functions, the LSTOA achieved satisfactory results. In the practical problem of DNA storage encoding, the LSTOA addresses the issue of low encoding efficiency during DNA encoding. To enhance encoding quality, an editor distance constraint, GC content constraint, No-runlength constraint, and uncorrelated address constraint are introduced. The combined constraints can reduce errors in DNA storage and improve DNA storage efficiency, but also pose challenges to encoding and may reduce storage density. Therefore, by transforming the DNA storage encoding problem into a heuristic algorithm for solving multi-objective optimization problems, I iteratively generate DNA storage encoding sets that meet the constraints.
Table 5 shows that the LSTOA expands the DNA storage encoding sets in four scenarios, and achieves similar results to previous studies in other scenarios. Cases such as
n = 9,
d = 3 indicate the potential of the LSTOA to surpass the existing DNA storage encoding set size, enabling better random access processes to access more data with the same codeword length. In other cases, although the DNA storage encoding sets constructed by the LSTOA do not expand, they remain consistent with the optimal results of previous studies, demonstrating the stability of the LSTOA. In
Figure 1, I also compare the code rate, where a higher code rate indicates that more information can be stored using the same DNA sequence. Even a 1% improvement is significant for expensive DNA storage systems.
In future work, I will continue to focus on DNA storage encoding because encoding is crucial not only for the DNA storage data writing process but also for data reading. Clustering [
35], assembly, and other processes require encoding, so considering setting clustering and assembly or other preset information in the encoding process may be a direction for our continued efforts.