1. Introduction
With the rapid advancement of computer technology, efficiently storing vast amounts of information has become a critical challenge in data storage [
1,
2,
3,
4]. Compared to traditional methods, DNA storage technology offers superior capacity, enhanced stability, and reduced maintenance costs, positioning it as a prominent focus of contemporary information storage research [
5,
6,
7,
8].
In DNA storage systems, all files are encoded into DNA sequences composed of the four bases A, T, C, and G. These sequences are then synthesized and stored in a DNA file pool. To retrieve the desired files from the DNA file pool, PCR amplification technology is typically employed for information retrieval [
9,
10]. Each DNA data file contains a sequence that binds to a specific PCR primer. To extract a particular file, the corresponding primer is added to the sample to locate and amplify the desired sequence [
11,
12,
13,
14,
15,
16,
17]. However, the binding of primers to DNA sequences is governed by the principle of base complementarity. If other base segments within the DNA sequences can form complementary pairs with the primer, it will lead to incomplete amplification of the target sequence. More critically, if other DNA sequences in the file pool form complementary pairs with the primer, cross-talk between the primer and DNA sequences will occur, resulting in the retrieval of unintended files [
18,
19]. The phenomenon where DNA sequences form complementary base pairs with the primer at incorrect positions is known as “nonspecific pairing”.
Due to the significant length of DNA sequences compared to primer sequences, it is nearly impossible to entirely avoid complementary pairing between base segments and the primer. Kayama et al. utilized recurrent neural networks to predict the success rate of PCR amplification for specific primer sets and DNA sequences [
20]. Their experimental results indicated that when a continuous base complementarity exists at the DNA sequence and 3’ end of the primer, PCR amplification may initiate at that position, leading to nonspecific pairing errors. We conducted further experimental analysis to investigate the number of continuous complementary bases required for an increase in PCR amplification error rates. The results demonstrated that when the DNA sequence and the primer’s 3’ end exhibit eight or more consecutive complementary bases, nonspecific pairing is more likely to occur (detailed experiments and results are provided in the
Supplementary Materials: S1.2. Nonspecific Pairing Constraint). Since PCR amplification involves a double-stranded reaction, if one strand of the DNA sequence contains base segments that repeat the primer, it implies that the other strand will form complementary base pairs with the primer. Therefore, to avoid nonspecific pairing between the DNA sequence and the primer, it is sufficient to ensure that the DNA sequence does not exhibit eight consecutive complementary bases with the primer’s 3’ end. We summarize this as the “nonspecific pairing constraint”.
Additionally, due to the biochemical properties of DNA sequences, they must also satisfy traditional biological constraints (detailed information is provided in the
Supplementary Materials: S1.1. Conventional Biological Constraints) [
21,
22,
23,
24,
25,
26]. To ensure that the DNA sequences meet both constraints and to enhance the accuracy of PCR amplification, we propose an efficient DNA coding algorithm for PCR amplification information retrieval. This algorithm employs pruning optimization and variable-length scanning to adaptively establish a file coding codebook with the maximum theoretical storage density while adhering to traditional biological constraints. A codeword search tree is constructed based on the statistical characteristics of primers to optimize the codebook and minimize nonspecific pairing. Additionally, the variable-length interleaver, based on primers, interleaves the DNA sequence to further ensure compliance with biological constraints and reduce nonspecific pairing. This algorithm not only effectively addresses the problem of nonspecific pairing but also enhances storage density and reduces the cost of DNA synthesis. Furthermore, the algorithm is universal and suitable for efficient coding of files in arbitrary formats.
3. Discussion
The experimental results demonstrate that the probability of nonspecific pairing between the DNA sequence and the 3’ end of the primer in the traditional QC algorithm ranges from 10% to 76%, while in the TC algorithm, it is between 71% and 99%. For the YYC, Modulation, and CAC algorithms, the probabilities of nonspecific pairing are 64% to 94%, 10% to 98%, and 7% to 72%, respectively. The ECA-PCRAIR algorithm exhibits a significantly lower nonspecific pairing probability, ranging from 2% to 25%. This substantial reduction highlights the superior ability of the ECA-PCRAIR algorithm in mitigating nonspecific pairing compared to the other five algorithms, thereby indicating its exceptional effectiveness in addressing the nonspecific pairing constraint.
Due to the inevitable nonspecific pairing under the YYC, Modulation, and CAC algorithms, the
and
values of the DNA sequences are very close to those of the given primer. Consequently, when the target primer is used to retrieve the target file from the DNA file pool, non-target files are also erroneously retrieved, complicating subsequent decoding and causing PCR amplification bias. In contrast, the DNA sequences coded by the ECA-PCRAIR algorithm exhibit almost no nonspecific pairing. As shown in
Figure 2d, the thermodynamic indices of other unrelated file sequences, whose
and
values are close to the target file, diverge significantly after being processed by the ECA-PCRAIR algorithm. This makes it difficult to amplify files other than the specific target in PCR amplification experiments, resulting in more accurate target amplification and ideal experimental outcomes. Thus, the ECA-PCRAIR algorithm effectively avoids the nonspecific pairing problem between DNA sequences and the primer’s 3’ end, enhancing the accuracy of PCR amplification and the proportion of effective information.
The demonstrated effectiveness of the ECA-PCRAIR algorithm can be attributed to its innovative coding process, which leverages a codeword search tree derived from the primer library. The algorithm constructs the codebook based on the ascending order of codeword weights, preferentially selecting base combinations that either do not appear in the primer or have a low probability of occurrence. This strategy effectively minimizes nonspecific pairing between the DNA sequence and the primer. In contrast, other algorithms primarily focus on traditional biological constraints and storage density without systematically addressing the nonspecific pairing between DNA sequences and primers. Given the vast number of potential DNA sequences and the limited number of primers in a DNA file pool, nonspecific pairing with primers becomes inevitable. Consequently, existing coding algorithms fail to fully satisfy the nonspecific pairing constraints. The CAC algorithm partially addresses nonspecific pairing. However, it is fundamentally a coding scheme based on primer control construction. It only ensures that the DNA sequence does not exhibit nonspecific pairing with a specific primer. In practical applications, where hundreds or thousands of primers may be present, DNA sequences are prone to nonspecific pairing with other primers. Therefore, the
and
values of CAC are higher than those of the proposed ECA-PCRAIR algorithm, and its binding force is inferior to our algorithm, as shown in
Figure 2c. Furthermore, the theoretical storage density of CAC is limited to 1 bit/nt, significantly lower than that of the ECA-PCRAIR algorithm. In conclusion, the overall performance of the CAC algorithm is inferior to that of the ECA-PCRAIR algorithm.
Based on empirical data from file and simulation experiments, the ECA-PCRAIR algorithm effectively manages GC content and homopolymer situations. The experiments demonstrated that GC content was consistently maintained between 40% and 60%, with a concentration range of 45% to 55%. Long homopolymers were kept below 4 nt, and homopolymers of 4 nt or longer appeared in only 0.06% of cases. This minimal occurrence is unlikely to substantially impact the overall PCR reaction, underscoring the ECA-PCRAIR algorithm’s superior performance in controlling GC content and homopolymer formation.
Compared with the TC, YYC, and CAC algorithms, the ECA-PCRAIR algorithm exhibits a faster coding speed and reduced coding time for various file formats. The coding time is directly proportional to the file size; larger files necessitate longer coding time. For text files, the ECA-PCRAIR algorithm reduces time consumption by an average of 29% compared to the TC, YYC, and CAC algorithms. For image files, the reduction averages 44%, highlighting the most significant performance improvement. For other file formats, time consumption decreases by approximately 26% on average. This improvement is attributable to the low time complexity of the ECA-PCRAIR algorithm, with the primary computational efforts concentrated in variable-length scanning and search tree weighting, making the computational complexity linear with the input data size. Once these components are completed, the subsequent coding involves straightforward codebook mapping. In contrast, the TC, YYC, and CAC algorithms require iterative comparisons with previous bases during coding, leading to decreased efficiency as the DNA sequence lengthens and the number of iterations increases. Consequently, these algorithms demand more computational resources. Among these, the ECA-PCRAIR algorithm shows the highest reduction rate in time consumption for image files due to the regularity of the binary code streams in such files. The text and other file formats, being more complex, exhibit less regular binary data patterns. The core principle of the ECA-PCRAIR algorithm is to identify combinatorial rules within binary code streams to derive the most efficient codebook. Therefore, more regular binary data results in reduced time consumption for the ECA-PCRAIR algorithm.
The ECA-PCRAIR algorithm adaptively identifies the coding method with the highest storage density for different files. Its theoretical storage density ranges from 2.14 to 3.67 bits/nt, as depicted in
Figure 6. In comparison, current coding algorithms achieve a theoretical density of 1 to 1.98 bits/nt [
19,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39]. This demonstrates that the proposed algorithm significantly enhances storage capacity, thereby reducing the cost of DNA sequence synthesis and accelerating the adoption and application of DNA storage technology.
4. Methods
The proposed ECA-PCRAIR algorithm comprises three main steps: codebook generation, codebook optimization, and interleaved correction. By analyzing the statistical characteristics of the binary bit stream of the file to be coded, the initial codebook is generated. This initial codebook is then optimized by incorporating the nonspecific pairing constraint to produce the final codebook for the file. Finally, through the interleaving algorithm, the constraint conditions of the coded DNA file sequence are detected and corrected, resulting in a set of file sequences that can be directly synthesized by DNA molecules. A schematic of the coding algorithm is shown in
Figure 7.
4.1. Codebook Generation Algorithm Based on Pruning Optimization and Variable-Length Scanning
4.1.1. Codeword Generation Algorithm Based on Pruning Optimization
For a DNA sequence of length , the space of alternative bases at each position is and its size is . Therefore, there are at most possibilities for a sequence of length . However, a significant portion of these possible sequences cannot be used in practice. Sequences with a high number of consecutive repeated base subsequences, GC content lower than 40% or higher than 60%, or complementary regions within different parts of the sequences that lead to hairpin structures can significantly affect the accuracy of DNA molecule synthesis, selective PCR amplification reactions, and DNA molecule sequencing. Thus, it is necessary to filter the sequences according to these constraints to obtain viable codewords for each length.
Firstly, the length of the base sequence is determined, and a multi-branch search tree with depth is established by taking the optional base space at each position as the root node. Each search tree has a total of different combination patterns, and there are a total of species.
After constructing the search tree, a depth-first traversal of the tree nodes is performed. Starting from the root node, child nodes are visited depth-first to determine whether the biological constraints on homopolymer length and GC content are met. If the visited node does not meet the constraint conditions, the current search path is marked, and the search tree is pruned to optimize the time complexity and space complexity of the search algorithm.
These steps of child node visitation, biological constraints judgment of node path base sequences, and pruning optimization are repeated until all feasible paths have been visited. The remaining node paths of the search tree, after pruning optimization, represent all the alternative codewords that meet the basic biological constraint conditions under fixed length, as depicted in
Figure 8.
After the aforementioned search process, a feasible codebook consisting of codewords of length is obtained. By altering the codeword length and repeating the above steps, a set of feasible codeword sets and the relationship between the length of the codeword and the number of alternative codewords are established for the given orthogonal primer library. Typically, the constraints for the codeword search and pruning process are homopolymer length <4 nt and GC content between 40% and 60%.
4.1.2. Codebook Generation Algorithm Based on Variable-Length Scanning
Considering that different input files have varying statistical characteristics in their code streams, we scan the input files with variable lengths. By comparing the compression efficiency of the scan results for different lengths, we select the optimal scan length parameter and then use the pruning optimization algorithm to match the best codeword length, thereby obtaining the adaptive codebook. The adaptive codebook is generated as follows.
For a binary sequence of length
, given the string length
. The binary sequence is complemented with “0” according to
to obtain the new sequence and its length
.
and
should satisfy the following relationship:
String sets
can be obtained by scanning new binary sequences with
bit intervals. In different sets of strings, each string has a different probability of appearing. Let the probability of each character be
, then:
among them,
For the string set
obtained from the binary sequence after the interval scanning, the codebook is
and the code length is
. The average code length is:
Using different length interval for scanning, the corresponding average code length can be obtained. From the perspective of coding efficiency and DNA storage cost, the smaller the average code length, the better. Considering that different texts have different statistical characteristics, the optimal scan length should be selected according to the characteristics of the text.
The steps for the best code length matching are as follows:
(1) An initialized numeric value (usually ) is selected, and a new binary stream is obtained by adding “0” to the original binary sequence. The probability distribution of the string with length in the binary stream is counted, and the required codeword length is matched according to the pruning optimization algorithm.
(2) Calculate the theoretical storage density
of a single codeword according to the string bits
and the code length
obtained by matching. Record the string bits
, the codeword length
and the theoretical storage density
. However, since the code length allocation algorithm will produce a case where one file corresponds to one codebook, the length of the codebook should be added when calculating the actual theoretical storage density
. The formula is as follows:
which
denotes the length of the coded base sequence and
denotes the codebook length.
(3) Let , repeat the above steps, and finally obtain the one-to-one dictionary of , and . It should be noted that the number of scanning bits cannot be increased infinitely. Based on experimental experience, the upper limit of scanning bits is set.
(4) Sort the dictionary in descending order according to the theoretical storage density , and select the string length and the codeword length with the highest theoretical storage density.
After the optimal code length matching, the optimal string length and codeword length have been obtained, based on which the initial adaptive coding codebook can be generated.
4.2. Codebook Optimization Algorithm Based on Nonspecific Pairing Constraint
According to the codebook generation algorithm described in the previous section, the parameters for the codebook that achieves optimal theoretical storage density have been determined. These parameters include the number of string bits, the codeword length, the search tree at this codeword length, and the initial codebook. Although the initial codebook generated by the algorithm provides the best information storage density, it has two main limitations: (1) the mapping relationship between strings and codewords in the codebook has not been constrained, and (2) the nonspecific pairing constraints of DNA storage systems have not been considered during the codebook generation process. Consequently, the initial codebook requires optimization.
Upon further investigation, it was observed that the base distribution of primer sequences in the primer library exhibits specific statistical characteristics and is not uniformly distributed. Detailed experimental procedures are provided in the
Supplementary Materials “4. Analysis of Primer Statistical Properties”. Considering this feature, we propose a codebook optimization algorithm based on nonspecific pairing constraints to enhance the initial codebook. By leveraging the statistical characteristics of the base substrings in the primer library, we assign a weight value to each node of the codeword search tree obtained previously. Based on these weight values and the frequency of the strings to be coded, we map the codewords to the strings, thereby optimizing the codebook and completing the preliminary coding. The main steps are as follows:
(1) Sliding Scan: Perform an -bit sliding scan on the last 8 bases of all primers in the primer library. After each scan, the path weight of the corresponding codeword is updated according to the scanning results on the bit search tree obtained by the pruning optimization algorithm. The result is an -bit search tree with path weights.
(2) Mapping: Starting from the path with the least sum of path weights, perform a one-to-one mapping using the -bit string determined in the variable length scan, according to the string occurrence probability from high to low. This mapping method ensures that the continuous 8-base fragment at the 3’ end of the primer appears as infrequently as possible in the coding sequence, thereby reducing the likelihood of nonspecific pairing between the DNA sequence and the 3’ end of the primer library.
4.3. Variable-Length Interleaving Models with Nonspecific Pairing Constraint
After the codebook optimization process, the probability of nonspecific pairing between the DNA sequence and the 3’ end of primers in the orthogonal primer library is greatly reduced. However, due to the large number of DNA molecules storing information in each DNA file pool, the number of different file DNA sequences usually exceeds 104. During the conversion of binary information sequences into DNA sequences, according to the codebook, the connection between different codewords can still cause some nonspecific pairing issues between partial string positions of the DNA sequence and the 3’ end of the primer sequence.
To address this problem, we design a variable-length interleaver. By establishing a criterion to determine whether the load part of the DNA sequence satisfies the nonspecific pairing constraint and other biological constraints, DNA sequences that do not meet these criteria are input into the variable-length interleaver for cyclic interleaving calibration, transforming them into risk-free sequences.
Figure 9 shows an example diagram of the variable-length adaptive packet interleaver, where the existing interleaver is divided into four blocks with different sizes, and each sub-block has different interleaving criteria.
The overall interleaver size
will adapt to the length of the input sequence
by adjusting to square interleavers, calculated as follows:
Simultaneously, the interleaver is adaptively divided into four groups of smaller interleavers of different sizes, tending towards square shapes. The size of the 1 interleaver is specified as
, the 2 interleaver is specified as
, the 3 interleaver is specified as
, and the 4 interleaver is specified as
. The dimensions of the four interleaver groups are calculated as follows.
After each interleaving operation, the DNA sequence is evaluated to determine if it still poses a risk. If nonspecific pairing remains a concern, the interleaving process is repeated. Otherwise, the risk-free sequence is output. By using the variable-length adaptive group interleaver, risk-free sequences that satisfy all biological constraints can be generated through base rearrangement without altering the types, numbers, and GC content of the original DNA base sequences.
5. Conclusions
In PCR amplification reactions, nonspecific pairing between base segments of the DNA sequence and the primer, particularly when it involves eight or more consecutive bases, can easily cause cross-talk in the amplification process. This leads to the generation of redundant sequences and a reduction in amplification accuracy. To address this issue, we propose a novel DNA coding algorithm tailored for PCR amplification information retrieval. This algorithm constructs a weighted codeword search tree based on a primer library and codes with base combinations that are either absent or have a low occurrence probability in the primer library. Additionally, it employs a variable-length interleaver for constraint detection and correction, significantly reducing nonspecific pairing between DNA sequences and primers. The preliminary codebook generation process incorporates pruning optimization and variable-length scanning, which not only ensures the satisfaction of traditional biological constraints but also adaptively searches for the optimal coding scheme in terms of storage density, thereby enhancing storage capacity.
Experimental results demonstrate that the DNA sequences coded by our proposed algorithm exhibit a nonspecific pairing probability with primers of only 2% to 25%, which is significantly lower than that of existing algorithms. Furthermore, the theoretical storage density of our algorithm reaches 2.14 to 3.67 bits/nt, more than twice that of current algorithms. This algorithm holds significant potential for advancing the practical application of DNA storage.