Our strategy relies on the comparison between the Boltzmann statistical weights for on-target and off-target pairings in order to evaluate the success probability of the process. In the miRNA case, we study the pairing of the miRNA–RISC complex to the mRNA, where mainly the nucleotides within the “seed” region are available for Watson–Crick interactions; on the other hand, we consider the first annealing cycle of the PCR, being the most significant for the success of the technique.
In the following subsections, we present the main ingredients for our physical statistical description of the PCR technique and miRNA gene expression regulation: firstly, we are able to obtain the average free energy of a certain quality of duplex, thanks to a parametrization of the pairing depending on the kinds of mismatches involved. Secondly, the same parametrization allows us to obtain the degeneracy of each kind of duplex, i.e., the total number of sequences with which the primer/miRNA can realize a duplex with the same combination of mismatched bases. For our purposes of general validity of the results, we neglect the sequence specificity of the genomic ssDNA or of the mRNA and we consider them as random sequences, where the 4 nitrogenous bases are equiprobable in each nucleotide of the off-target sites. Finally, we have combined the binding free energy and the degeneracy to compute the Boltzmann weight of the on-target and off-target pairings. Comparing these two terms, we obtain the on-target pairing probability.
2.1. Free Energy for Duplex Formation
Differently from other works on DNA hybridization focusing on the prediction of stable pairings as a function of the temperature, i.e., the study of “melting curves” [
9,
10,
11], we would like to characterize here the probability of on-target binding of oligonucleotides in the presence of huge numbers of random competitive sites. To do so, we have to describe the binding free energy between any given pair of interacting oligomers, as well as the degeneracy of their potential pairing.
The free energy difference
between a nucleic acid duplex and its free constituent sequences can be split into an enthalpic and an entropic part,
Nevertheless, providing an accurate description of such thermodynamic parameters characterizing the interaction between nucleic acids is not an easy task. In the highly cited review by SantaLucia and Hicks [
12], detailed energetic data for several DNA motifs can be found, comprising canonical Watson–Crick pairing and a long catalog of errors, including internal mismatches, terminal mismatches, terminal dangling ends, hairpins, bulges, internal loops, and multibranched loops. The extraction of such thermodynamic parameters, however, necessarily requires the knowledge of the specific bases composing the two strings, and this is information that is not possible to access typically, or it is simply unfeasible to compute when dealing with a multitude of random possible competing pairs. Moreover, since our aim is to unveil some fundamental properties based on thermodynamic arguments with a coarse-grained modeling to explain the effectiveness of selective bindings in nucleic acids, we consider that such properties do not depend on fine details such as the specific bases composing the interacting oligomers. This hypothesis is checked for specific cases (see
Appendix A, and
Appendix A.5 in particular), proving the robustness and range of applicability of our description. Remarkably, our assumption of two states, i.e., on–off hybridization with no intermediate state between unbound and paired, is justified for short oligomers [
13], as it is in the cases considered in this study.
For these reasons, we develop here an effective energetic model that, by considering only three classes of base pairing errors and through a “mean field” approach where all possible combinations of interacting pairs are averaged, yields a simplified but yet quantitatively fair description of DNA (or RNA) hybridization.
For the sake of concreteness, let us focus on the pairing between a generic primer (an oligomer with length
L) and a long polymer with length
. Specifically,
is measuring the number of ways in which the first oligomer can couple to the latter (number of sites wherein it can attach). Once
L is defined, in our description, the duplex is fully characterized through a three-component parameter vector
. This vector carries the information of the number of external mismatches,
and
, and internal mismatches,
. The definition of
thus consists of re-parametrizing the hybridization thermodynamics on three classes of base pairing errors. To this aim, we split the total enthalpy and entropy into different contributions stemming from the different interactions involved in the duplex,
Above, we have separated the contributions from the perfect match, the dangling ends, and internal mismatches. Note that entropy is additionally corrected due to salt concentration
[
14]. In the following subsections, we account for each contribution in detail.
2.1.1. Perfect Match: Initiation and Nearest-Neighbor Canonical Base Pairs
Our starting point is the contribution of an ideal matched duplex. The nearest-neighbor model has been proven to provide a very good description for the enthalpy and entropy of duplexes [
12]. This model starts from initiation values
and
, which are complemented by additive contributions coming from each couple of neighboring base pairs. Such contributions depend on the specific bases considered. Nevertheless, in our coarse-grained description, we associate a single averaged contribution
and
to any couple of neighboring matched base pairs (see
Appendix A along with
Table A1 therein for further details on the averaging). Therefore, in our framework, the enthalpy and entropy of perfectly matched duplexes depend solely on the length
L and simply read
where we have taken into account that the number of couples of neighboring base pairs is
.
2.1.2. Dangling Ends: External Mismatches
This contribution takes into account that the duplex may happen with a certain external mismatched base pair. Moreover, if the external base is well paired, there is a stacking contribution to the free energy, due to the base of the long polymer that is next to the pair.
The number of external mismatches in each end is given by
and
, respectively. Note that, in order to obtain at least one matched base pair, we need to enforce
(see
Figure 1). Our work hypothesis, motivated by the values typically found [
12], is that external mismatches can be thought of as two dangling bases at the same end. Therefore, due to external mismatches, (
i)
neighboring base pairs are canceled out with respect to the perfect match and (
) there is an extra contribution stemming from the first bases within a dangling end. Although, in reality, this contribution would depend on the identity of the bases, we consider an averaged contribution
and
to any dangling end (see
Appendix A along with
Table A2 therein for further insight on these values).
Therefore, by summing up the previous discussion, the contribution of external mismatches can be parametrized as follows:
where
takes the possible values
depending on the external mismatches
corresponding, respectively, to dangling ends without external mismatches, dangling ends plus external mismatches in one end, and dangling and external mismatches in both ends. When writing the cases above, we have kept in mind the binding of a primer inside a specific region of a longer DNA as in
Figure 1. Nevertheless, this has to be modified if one is interested in studying selection by miRNA. As described in the Introduction, the active region of miRNA is finite, as represented in
Figure 2. Therefore, when considering miRNA, we always assume
, regardless of the number of external mismatches.
2.1.3. Internal Mismatches
Now, we consider the effect of internal mismatches in the duplex. The integer parameter
gives the number of internal mismatches within the duplex. When
, the set of possible
defining a possible duplex, with one matched base pair at least, fulfills the condition
(see
Figure 1). Besides the corresponding couples of neighboring base pairs that are canceled out, the contribution penalty that stems from single internal mismatches has been thoroughly studied [
12]. This depends on the particular bases. Again, following the philosophy of our coarse-grained approach, we give an averaged contribution
and
to those eventualities (see
Appendix A along with
Table A3 therein for further details on the averaging). We assume that such a contribution does not vary when more than one internal mismatch is considered. Moreover, in order to prevent further complexity, we completely neglect the internal structure of the internal mismatches (number, sizes, and separation of adjacent internal mismatches). Specifically, we consider that the effects of additional internal mismatches are equivalent to considering those mismatches to be non-consecutive. Therefore, the thermodynamic parameters associated with internal mismatches are
According to our modeling, each internal mismatch replaces two couples of next-neighbor canonical base pairs with two next-neighbor couples of mismatched base pairs. Note that we have carried out a strong approximation in the study of internal mismatches. Nevertheless, since the states with a significant statistical weight are those with low numbers of errors, we can argue that this approximation will not lead to significant errors. The results we present in this work are reasonable and physically sound, corroborating that our assumptions do not seem to misguide our analysis.
2.1.4. Salt Correction
Thermodynamic parameters are computed for a given referential salt concentration, usually 1 M of NaCl. Either excess or a defect of salt, or the presence of other ions, will imply a change in those parameters, affecting mainly the entropic contribution. Salt correction has been studied in detail in the literature [
14]. In a nutshell, the most accepted proposals for this contribution assume that
is a function of the salt concentration
, usually through its logarithm. Again, this contribution has a dependence on the specific sequence that we neglect through averaging (see
Appendix A for details).
2.1.5. CG Contribution
In order to complement our averaged description, we develop also a more detailed, yet simple, approach that takes into account also the effect of different sequences. Specifically, we assume that the energetic parameters will be a function on the fraction of bases C or G in the DNA sequence, which may change in a significant way the thermal stability of the duplex. This description will be primarily of interest for the PCR pairing statistics, since it will highlight how the choice of specific sequences can influence the success of the PCR.
Herein, we follow the IUPAC-IUB notation, where the bases are classified as either strong bases S = {C, G} or weak bases W = {A, T}. Then, we define
as the fraction of S bases in the DNA sequence of interest. Our hypothesis is that the contribution coming from the next-neighbor canonical couples of base pairs is a function of this fraction. Specifically, we consider a linear interpolation (see
Appendix A for further details), i.e.,
When illustrating the effect of differences in the richness of strong bases, we will present the results in terms of the number of S bases .
2.2. Degeneracy of Equivalent Duplexes
Given a specific sequence, there is only one well-defined complementary sequence, which corresponds with . On the contrary, with the same specific referential sequence, we can find many duplexes with the same , i.e., duplexes with errors are degenerate.
Since there are 4 different possible bases, if we focus on one of them, there is only 1 exact complementary and 3 possible mismatches. Therefore, the degeneracy of a duplex with errors made by a selective molecule (primer or miRNA) of length
L within a specific site of a much longer nucleic acid characterized by
is
where the binomial coefficient takes into account all possible combinations of the
mismatches in the internal region of the duplex. Note that the simplicity of this degeneracy is partially due to our disregarding of the internal structure of internal mismatches.
2.3. Quantifying Selectivity
In the annealing phase of PCR, a short primer of length L can pair to its complementary target or to an off-target site in the two genomic ssDNA. Similarly, this also occurs in the pairing of miRNA, which can pair to its specific target or to other available sites within mRNA different molecules. The specificity of this binding is key to guarantee the success of the selective process. Herein, we compute the probability of having such successful binding using our model.
Let us consider a duplex comprising one selective molecule (primer/miRNA) and a longer nucleic acid. This duplex has, in principle, many ways to be formed. Obviously, we expect that there is a preferred binding, which corresponds with the selective molecule binding to the target region of the longer nucleic acid. For generalization purposes, let us assume that this target region appears times in the longer nucleic acids.
The statistical weight of occurrence for a specific binding
j is given by the Boltzmann factor
where
is the free energy difference corresponding to such binding,
R is the gas constant, and
T the temperature used in the experiment. Therefore, if we label
as the desirable hybridization of the primer/miRNA with a specific target region, the probability of having a successful selection is
where the sum is carried out over all possible pairings in the system. Note that
is the conditional probability of having a successful binding, given that a binding occurs. In other words, we implicitly assume that in typical conditions, concentration and temperature grant a good degree of PCR primer (or miRNA) binding to the longer nucleic acids.
should not be confused with a melting curve, e.g.,
means that, out of a total of
bound primers/miRNA per long polymer,
is on-target and
are off-target.
When computing
, we have conjectured that the oligomers (primer/miRNA) in the system are mutually independent, i.e., they do not compete for the binding on each specific site. Therefore, we are requiring implicitly that the total number of actual bindings
per long polymer measured by the melting curve should not be much larger than
. This constraint means that, on average, the number of primers/miRNA on target computed from
, i.e.,
, does not exceed the number of target spots on the genome. In
Appendix B, we provide a numerical check of
in typical genomic PCR conditions, based on the assumption of independence of primers and the computation of the melting curve, validating our hypothesis. Thus, we interpret
as a good estimator of pairing selectivity, expressing the ratio between on-target and off-target bindings.
In order to compute
, we need to quantify the different
. On the one hand, we can compute
through
using the formalism introduced in the previous section considering
. On the other hand, using a mean field approximation, we assign the averaged Boltzmann factor
to the rest of the possible bindings, where the sum over
runs for all possible external and internal mismatches. The denominator in (
16) comes from
, where the sum includes duplexes without a single complementary base pair. These duplexes can be considered within our energetic framework as impossible bindings to which we associate
. These off-target pairs have a weight proportional to the total sites of pairing
available in the system, i.e., the number of bases of the long polymer. Finally, we can rewrite the probability in (
15) for pairing to the targets that are found in number
in the system as
Note that we have used that , which is true for both PCR and miRNA.
This statistical approach allows us to provide a simple theoretical result with no knowledge of the specific sequences involved, which is computationally cheap. Although we are aware of the quantitative limitations of such an approach, we show here that our framework leads to a better understanding of the physics involved in selective processes such as the PCR technique or miRNA.