Next Article in Journal
Analysis and Prediction of Grouting Reinforcement Performance of Broken Rock Considering Joint Morphology Characteristics
Previous Article in Journal
Investigation of a Biochemical Model with Recycling in Case of Negative Cooperativity
Previous Article in Special Issue
Prey-Taxis vs. An External Signal: Short-Wave Asymptotic and Stability Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Inconsistency of the Algorithms of Jaro–Winkler and Needleman–Wunsch Applied to DNA Chain Similarity Results

Department of Computational Mathematics and Cybernetics, Shenzhen MSU–BIT University, 1 International University Park Road, Dayun New Town, Longgang District, Shenzhen 518172, China
Mathematics 2025, 13(2), 263; https://doi.org/10.3390/math13020263
Submission received: 3 December 2024 / Revised: 6 January 2025 / Accepted: 6 January 2025 / Published: 15 January 2025
(This article belongs to the Collection Theoretical and Mathematical Ecology)

Abstract

:
There are many different algorithms for calculating the distances between DNA chains. Different algorithms for determining such distances give different results. This paper does not consider issues related to which of the classical algorithms is better, but shows the inconsistency of two classical algorithms, specifically the algorithms of Jaro–Winkler and Needleman–Wunsch. To do this, we consider distance matrices based on both of these algorithms. We explain that, ideally, the triangles formed by the distance matrix corresponding to each triple of distances should be acute-angled isosceles. Of course, in reality, this fact is violated, and we can determine the badness for each such triangle. In this case, the two algorithms for determining distances will be consistent. In the case where such sequences of badness are located in the same order for them, and the greater the difference from this order, the less they are consistent. In this paper, we consider the distance matrices for the two mentioned algorithms, calculated for the mitochondrial DNA of 32 species of monkeys belonging to different genera. For them, 4960 triangles are formed in both matrices, and we calculate the values of the rank correlation between these sequences. We obtain very small results for these values (with different methods of calculating the rank correlation, it does not exceed the value 0.14), which indicates the inconsistency of the two algorithms under consideration.

1. Introduction and Motivation

There are many different algorithms for calculating the distances between sequences of symbols of different natures, and, in particular, between DNA chains, which are the main subject of research in this article. At the same time, both biologists and specialists in applied mathematics consider some of the provisions used in these algorithms to be unshakable. (However, “Nothing in Biology Makes Sense Except in the Light of Evolution” (1973, Theodosius Dobzhansky). We believe that both the title of this essay and its content are directly related to the study of DNA sequences in general and to algorithms for calculating distances between them in particular.) Certainly, this applies to the once-calculated distance between the genomes of different species.
However, there are some important points to make about this case:
  • Firstly, if we talk specifically about mammals, whose genomes are the object of the research of this paper, then one of three options is usually used as the object for analysis:
    Mitochondrial DNA (mt DNA, which will be the main object of this research);
    The “tails” of Y chromosomes;
    The main histocompatibility complex.
  • Secondly, different algorithms are used to determine the distances between genomes, and, according to the author’s opinion, they are all modifications of the Levenshtein metric (sometimes significant modifications, as a result of which their relation to the Levenshtein metric is not always obvious). At the same time, in this paper, we do not engage in comparing these different algorithms.
  • Thirdly, the main difficulty encountered in calculating the distance between such sequences is their very long length. For example, the length of the human mt DNA sequence, i.e., very short DNA, exceeds 16,000 characters, while the total length of human DNA exceeds 3,000,000,000 characters. Therefore, it is impossible to solve real problems with the exact calculation of the Levenshtein distance, and all the algorithms used in them can be called heuristic.
  • Fourthly, it is possible to conduct distance studies either before “combining triples of letters into one” or after such combining. However, for the task described in this paper, this is not fundamental.
  • Fifthly, for 4 variants of nucleotides in the genome, natural selection results not in 64 variants of the triples but in 21 variants only. Each of these options can be considered an encoded letter. Moreover, as is written in the popular scientific literature (we shall not give specific references), at least four artificial amino acids have been designed, which can “on full grounds” enter artificial DNA chains, and the triples of amino acids in such artificial DNA chains can be replaced by fours or even fives.
Thus, different algorithms for determining the distances between DNA chains give different results. At the same time, the general trend is, of course, correct; for any adequate algorithm, the distance between the genomes of humans and chimpanzees is, of course, less than that between the genomes of humans and, e.g., an elephant. However, of course, we would like to obtain a more detailed answer to the question of quantifying this distance. The paper is devoted to one of the issues of this big topic.
However, when considering such different algorithms for determining the distances between genomes, the author has long had an assumption about the great inconsistency of the Jaro–Winkler and Needleman–Wunsch algorithms. The main subject of the paper is the quantitative verification of this hypothesis. We shall show in the paper that it is fulfilled, i.e., these algorithms are not consistent.
We calculate such a quantitative characteristic as follows: First, for the algorithm in question, we calculate the distance matrix between pairs of genomes. Note that, for example, for a matrix of dimension 32, we have
32 · 31 2 = 496
cells required to fill the matrix, but the algorithms mentioned above have been working for quite a long time. Using the Needleman–Wunsch algorithm for mt DNA, we fill 496 cells of the 32-dimensional matrix in about a day of operation of an average modern computer.
Next, in this matrix, we consider all the triangles. It is important to note that there are quite a lot of them. For example, for a matrix of dimension 32, there are
32 · 31 · 30 2 · 3 = 4960 .
Each of the triangles has a special characteristic, the so-called badness, which will be discussed in detail in Section 2.
For two different algorithms, as a result of such constructions, we can calculate sufficiently long sequences of badness values for all triangles. Next, we calculate the values of the pair correlation for these sequences, and we believe that acceptably large values of the latter value (that is, 0.7 or more) indicate that the algorithms are consistent.
However, in our situation, real calculations give results that are very far from such values. Specifically, with different methods of calculating the correlation coefficient, values for the examples we are considering are obtained in the range from 0.075 to 0.14 (see details below). Furthermore, it would not be an exaggeration to say that it is good that the correlation coefficients are positive in general. Thus, as we have already said, the Jaro–Winkler and Needleman–Wunsch algorithms are inconsistent.
At the end of this section, we note a few small general remarks:
  • Firstly, everything said here is described in more detail below, but we do not provide detailed content by section in the introduction.
  • Secondly, the technique considered in the paper, which can be called an algorithm for determining the consistency of algorithms for calculating distances between lines, is applicable to any pair of such distance calculation algorithms and to any set of types.
  • Thirdly,
    Algorithms for determining the distances between two specific lines (in particular, DNA sequences) are heuristic due to the total size of the data under consideration;
    Algorithms for calculating the badness for triples of DNA sequences are therefore heuristic algorithms for analyzing heuristic algorithms;
    Algorithms for determining consistency between two distance calculation algorithms can therefore be called heuristic algorithms for analyzing heuristic algorithms designed to analyze heuristic algorithms.
    In other words, a “triple embedding” appears.

2. Preliminaries: DNA Chains, Their Distance and Statistical Characteristics

The theory presented in the paper related to the analysis of DNA sequences is based on the author’s previous works, among which it is primarily worth noting [1,2,3,4]. The standard concepts and formulas of mathematical statistics are consistent with the monographs [5,6].
Above, we talked about the triangles formed by the distances between genomes, that is, where they come from. We continue the example of chimpanzees and humans but add a third very close species.
For this interesting example, let us consider the three following species: human (H), chimpanzee (C) and bonobo (B). According to biologists,
  • The ancestors of both of apes and humans diverged about 7,000,000 years ago;
  • The ancestors of chimpanzees and bonobos diverged about 2,500,000 years ago.
At the same time, the exact values are not particularly important. The only important thing is that the triangles formed by the corresponding three distances should ideally be acute-angled isosceles. Moreover, all of the above must be fulfilled for any three species.
Table 1. Some triangles and options for their badness.
Table 1. Some triangles and options for their badness.
Sides 1,2Angles 2Bad. (0)Bad. (1)Bad. (2)Bad. (3)Bad. (5)
a , b , c α , β , γ ( α β ) / γ ( α β ) / π ( α β ) / α ( a b ) / a ( a b ) / c
1 1 1 60 60 60 00000
5 5 4 66 66 47 00000
42 41 28 72 68 39 0.10 0.04 0.05 0.02 0.04
19 18 17 66 60 55 0.11 0.07 0.09 0.05 0.06
10 9 8 72 59 50 0.26 0.14 0.18 0.10 0.13
6 5 5 74 53 53 0.39 0.23 0.28 0.17 0.20
13 12 5 90 67 23 1.00 0.25 0.25 0.08 0.20
5 4 3 90 53 37 1.00 0.41 0.41 0.20 0.33
12 6 5 1.09
20 6 5 1.81
1 We round it up to an integer of degrees; therefore, the sum may not be the same as 180. 2  a b c , α β γ .
Thus, we consider distance matrices based on both of these algorithms. Certainly, in reality, the fact that the triangles formed in these matrices are acute-angled isosceles is violated. Then, for each such a triangle, we determine the numerical value of the density. In the process of the calculations, several variants of such badness are considered. Examples for some specified triangles are shown in Table 1. Below, we will use the value of badness, indicated by “Bad. (0)”, which we consider to be the most adequate.
It should be noted that such matrices are used primarily by biologists, in particular in the popular science literature, but are little used in related mathematical modeling tasks. Of course, there are successful exceptions. As one of them, we note the most recent study [7]. Among earlier works, we note [8,9].
The author hopes that the presented work will be considered not only for the use of such distance matrices by biologists, but also, above all, as one of the applications for creating mathematical models and algorithms for working with such matrices.
Based on such matrices, the total badness of all triangles can be considered and it can be argued that algorithms with lower badness values are better than algorithms with higher values. However, such consideration is not the subject of this paper. Here, we consider another natural assumption. It can be assumed that two algorithms for determining distances are consistent in the case when the sequences of badness of their corresponding matrices are ordered in the same order for them, and the greater the difference from this order, the less they are consistent.
Two such sequences of badness values can be compared by applying rank correlation algorithms. At the same time, as in many statistical experiments, if we obtain a value exceeding, e.g., 0.7 , then it could be argued that the considered sequences of badness are consistent, and, therefore, the two algorithms under consideration are also consistent. However, we do not obtain such values (see details below). We also note in advance that different classical variants for calculating rank correlation give approximately the same results; therefore, the specific algorithm of rank correlation is unimportant.
Now, let us move on to the description of the standard statistical characteristics we use, as well as their small variations. Sometimes, we use “more customary” notation. For example, we do not use “standard statistical” notation M X Y (we write M X · Y instead), etc.
The two random variables under consideration are denoted by X and Y; their observed implementations are denoted in the same way with the corresponding subscripts, i.e.,
X i a n d Y i f o r i = 1 , 2 , , N .
Firstly, let us formulate the usual definition of correlation. Recall that the pair correlation coefficient can be calculated using the usual formula:
R ( X , Y ) = c o v ( X , Y ) σ X · σ Y ,
where
c o v ( X , Y ) = M X · Y M X · M Y .
In our further tables and program fragments, this variant of the coefficient has the number 0.
Secondly, let us formulate a modified Kendall correlation coefficient. For it, we define the number of discrepancies (“entropy coefficient”) as follows: a discrepancy holds if, for a pair ( i , j ) where i j , we have
X i > X j b u t Y i < Y j .
Let us denote the number of such discrepancies by e n t r ( X , Y ) , or simply E in the next formula. We should also note that the correlation calculated in any way between the usual Kendall correlation coefficient and our variant is always equal to 1 (“correlation between correlations”). This is easily obtained by simply considering the formulas.
Since the maximum possible number of such discrepancies is N · ( N 1 ) 2 , we will consider the modified Kendall correlation coefficient by
1 4 · E N · ( N 1 ) ;
This value is equal to 1 in the case of 0 discrepancies, and is equal to 1 in the case of the maximum possible number of discrepancies. In our further tables and program fragments, this variant of the coefficient has the number 2.
Note that we could calculate this coefficient as follows: We define the “entropy coefficient” considered before for each pair of pairs by (1). Then, we calculate   the sum of these coefficients and divide the result by the value N · ( N 1 ) 2 already   used earlier.
Different publications provide different versions of criticism of the Kendall criterion, but the authors of the current paper consider the following flaw to be the most important: it does not give very adequate results with a large number of coincidences in the values of the considered random variables. Therefore, we also consider the following “very modificated” Kendall correlation coefficient.
It is most convenient to consider it as a search for pairs of pairs, like in the last remark. However, unlike in (1), we also use the value 0 (not only 1 and 1 ). Specifically, the value 0 is selected if and only if the values of at least one of the random variables in the considered pairs match. In our further tables and program fragments, this variant of the coefficient has the number 3.
Thirdly, the Spearman correlation coefficient is calculated in the usual way, i.e.,
i = 1 n ( x i M X ) · ( y i M Y ) n · σ X · σ Y
In our further tables and program fragments, this variant of the coefficient has the number 1.
Note in advance that, in Section 5, we will briefly describe and apply another way to calculate the rank correlation where it will have the number 4.

3. Problem Statement

Continuing what was said in the introduction, we formulate the problem statement. Based on general considerations, we can say in our expert assessment that we believe that the Needleman–Wunsch algorithm [10] is much more adequate than the Jaro–Winkler algorithm [11]. However, we do not show this in this paper (many indirect arguments were given in our publications cited above), but rather show a less significant fact, namely, the inconsistency of these two algorithms.
By using correlation analysis, such inconsistency can be shown by simply calculating the rank correlation for the sequences of the corresponding elements of the two distance matrices, i.e., matrices for the Jaro–Winkler and Needleman–Wunsch algorithms. However, we do not consider such work to be important, for the following three reasons:
  • Firstly, there are not very many such elements. In square matrices of size 32, we have 496 elements, which are located from the top of the main diagonal.
  • Secondly, based on the calculations performed, we came to the conclusion that the results of such a correlation comparison are not very informative. The values of the correlation coefficient (with different methods of calculating it) do not differ much from 0.5 (specifically, from  0.38 to 0.61 ), and this fact, apparently, does not allow us to draw unambiguous conclusions.
  • Thirdly, we are considering a specific task (and not just comparing any two abstract matrices), and, as we noted in Section 2, our matrices must have an important special property, i.e., a small value of badness (we use the Bad. (0) value), and, moreover, they also have in our case the consistency of these values for both matrices.
Thus, we use some more complex comparisons. We will talk about them in the next section.

4. Algorithms, Methods and Some More About the Motivation

In this section, we discuss different opinions about whether the conclusions that can be drawn based on the simplest study of the distances between DNA sequences are correct or not. In particular (and this is the simplest question, to which there is no definitive answer), whether the initial results themselves expressing the distances between genomes, i.e., the algorithms of Jaro–Winkler and Needleman–Wunsch (as well as several other algorithms not considered in this paper), are correct and adequate. In any case, the most basic conclusion is as follows: the research related to the consideration of algorithms for determining the distances between DNA strands should be continued.
On the one hand, it is worth adding that there are scientific publications where this difference is significantly greater (see [12] and many others); estimates of the difference between the genomes of humans and chimpanzees reach up to 19% (and this is only in works known to the author; see also some links in [12]). This is explained by the authors as follows: Geneticists allegedly sequenced “small pieces of chimpanzee DNA”, i.e., using conventional chemical laboratory procedures, they determined the sequence of the chemical symbols. Then, these small chains of “symbols” were connected to the human genome in those places where, in their opinion, they should match. After that, the human genome was removed and a chimpanzee pseudo-genome was obtained, which allegedly indicated a common kinship with humans. Thus, a mixed sequence was obtained, which, apparently, is not real. Hence, it is concluded that the real differences are significantly more than 1.
However, on the other hand, there are arguments for the fact that the resulting distances between genomes are more or less correct, and the grounds for the noticeable difference between humans and their closest relatives lie elsewhere.
According to the general DNA sequence, humans actually stand apart from other hominids. Moreover, this is not according to the “formal set of genes”, but according to their distribution on chromosomes. It is precisely the following factors that apply:
  • Multiple chromosomal aberrations;
  • The deletion of a huge section;
  • The transition of another section to the other chromosome, due to which humans have one pair of autosomes fewer;
  • The reversal of another section.
Most likely, it leads to a radical change in the phenotype. We consider the phenotype to be a set of internal and external features, properties and traits of a specific organism. There are some other definitions in the scientific literature.
This last, i.e., the change in the phenotype, can be described primarily by the following signs (also based on material taken from numerous scientific and popular science publications):
  • The absence of a massive, protruding jaw in humans, and, consequently, a significantly different structure of the oral cavity, which is the most important resonator in speech formation;
  • The structure of the nose (as well as the larynx) is significantly different;
  • Lack of wool cover;
  • Walking upright;
  • Rebuilt work of sebaceous and sweat glands;
  • Reconstruction of the upper part of the skull;
  • Many other things that distinguish humans from anthropoids in general.
However, as can be understood, all of the above are general reflections at the level of “apparently”, not supported by specific genomic studies. At the same time, it is precisely this “lack of strength”, i.e., the inability to strictly prove the above dependencies, at least in the near future, that is exactly what
suggests the need to continue detailed studies of DNA strands, in particular, to analyze their similarity.
Such tasks remain and they will remain very relevant for a long time. As we said before, the research related to the consideration of algorithms for determining the distances between DNA strands should be continued. In particular, it is possible in future studies to try to algorithmically formulate the grounds for the strong difference between humans and great apes.
Thus, for the considered matrices corresponding to 32 species of monkeys of different genera, we consider two sequences of badness values for all the corresponding triangles (as we said above, there are 4960 of them) and calculate the badness values for them.
We also repeat once again that we do not consider in this paper the issues related to determining which of the algorithms is better; we are considering only a method that can show the inconsistency of these algorithms.

5. Description of Computational Experiments and the Results

Firstly, let us list the species of monkeys we are considering (see Table 2). It is important to remark that all the species belong to different genera. Apparently, this fact leads to a more or less successful distribution of the elements of the distance matrix.
After that, we present the distances calculated for the mt DNA of these species in the form of two tables. Everything is considered for two different distance calculation algorithms. Namely, for our article, we have reviewed the algorithms of Jaro–Winkler and Needleman–Wunsch.
Table 3 is the calculated distance matrix for the Jaro–Winkler algorithm. The species numbers correspond to those shown in Table 2. The peculiarity of this algorithm is that it gives very close answers for these types; therefore, the three-digit numbers shown in the table correspond to three decimal places after 0.0 . For instance, 541 means 0.0541 .
If it is necessary to verify the algorithms described by us and the calculation results, the values of the table elements can simply be copied from a PDF file and pasted to the computer program. The author can also submit them by e-mail.
Table 3. The matrix obtained by applying the Jaro–Winkler algorithm.
Table 3. The matrix obtained by applying the Jaro–Winkler algorithm.
1234567891011121314151617181920212223242526272829303132
1000541677583592541589536562633465610530370512565545800624640520556548562515570726524511589589540
2541000635387342369396381386733600686463542409549349722698708515440401543462455681388452464383532
3677635000665676627668626670714728739666678655777617731744760737661663767692680690646648710661753
4583387665000334396385384396767630727457579422577383677723733546447411571442434637403474447378568
5592342676334000384395321397777644736481584433570375672742751554451421579429444650418498453393562
6541369627396384000401319406706581665455528387526383753676675510458381499481457693320436475400527
7589396668385395401000397389763630727471580425584392695738741556429346573458451657400488463382573
8536381626384321319397000400723595691453537396527345724687696518457392534474457685312448470392526
9562386670396397406389400000747585700462561415565390725703722532448403571467469681409477482327546
10633733714767777706763723747000628635706661676699723674653678634720693677767758538712697793775656
11465600728630644581630595585628000560584462549535594859568579494596589526636608790582560631639464
12610686739727736665727691700635560000673610631601687871379381556688669589724706795667646729731571
13530463666457481455471453462706584673000535446467449741665678434391454454414402678463454413461448
14370542678579584528580537561661462610535000502566545790614627526545539558578549723511492545582539
15512409655422433387425396415676549631446502000515400772630642477478395506510483705390437509426493
16565549777577570526584527565699535601467566515000529913580571401484548350483461836528513481589379
17545349617383375383392345390723594687449545400529000719684701514442391543462461673376443468387503
18800722731677672753695724725674859871741790772913719000871884851708759897664690538759763694709874
19624698744723742676738687703653568379665614630580684871000366579701682565734711799668647721729547
20640708760733751675741696722678579381678627642571701884366000585717688567752718806679656729739551
21520515737546554510556518532634494556434526477401514851579585000446515386469462787508498485549344
22556440661447451458429457448720596688391545478484442708701717446000438473377369644465471379451469
23548401663411421381346392403693589669454539395548391759682688515438000539492478705380451490416528
24562543767571579499573534571677526589454558506350543897565567386473539000503479822522509465569372
25515462692442429481458474467767636724414578510483462664734752469377492503000346627484486344467486
26570455680434444457451457469758608706402549483461461690711718462369478479346000621460453366451471
27726681690637650693657685681538790795678723705836673538799806787644705822627621000694699634663805
28524388646403418320400312409712582667463511390528376759668679508465380522484460694000389478409525
29511452648474498436488448477697560646454492437513443763647656498471451509486453699389000476488500
30589464710447453475463470482793631729413545509481468694721729485379490465344366634478476000466479
31589383661378393400382392327775639731461582426589387709729739549451416569467451663409488466000541
32540532753568562527573526546656464571448539493379503874547551344469528372486471805525500479541000
The following Table 4 is the calculated distance matrix for the Needleman–Wunsch algorithm. The species numbers also correspond to those shown in Table 2.
This algorithm gives not very close answers for these types; therefore, the three-digit numbers shown in the table correspond to three decimal places after 0 . (not 0.0 ). For instance, 375 means 0.375 . It is important to note that such a 10 times increase in values does not change any of the values of the badness of the triangles we are considering. Indeed, considering the first triangle of Table 3, with the sides 0.0541 , 0.0677 and  0.0635 , we can say that its badness is exactly equal to the badness of the triangle with the sides 0.541 , 0.677 and  0.635 .
As follows from the previous material, we can work with Table 3 and Table 4 (they are given after the text of the paper), as well as with any other tables built on the same principle. Simply, as with tables of integers, the values of badness that we are interested in will be the same.
In general, all the calculation results are shown in Table 5. The column designations are clear; they are related to the options described above for calculating the rank correlation. The string designations have the following meaning:
  • “Simple” means counting sequences of matrix elements above the main diagonal, while “main” means counting sequences of badness (Bad. 0) of triangles;
  • “With” (unlike “without”) means that we used normalization before calculations. As usual, normalization is what we call the linear mapping of all the received data into the segment [ 0 , 1 ] .
In the next section, we will discuss the numerical results obtained and some conclusions.
Table 5. The calculation results.
Table 5. The calculation results.
OptionCorr-0, UsualCorr-1, SpearmanCorr-2, Kendall+Corr-3, Kendall++
simple, with 0.610 0.533 0.386 0.376
main, without 0.0817 0.136 0.0742 0.0909
main, with 0.0817 0.136 0.139 0.0909
At the end of this section, we note that, in a recently published work [13], our method of calculating rank correlation was given. See its detailed description in that paper. This method differs from all “classical” methods and at the same time gives adequate results in all fields of application known to us. The first computational experiments show its good applicability in the subject area considered in this paper. For example, the values of the rank correlation in all three calculation variants (i.e., in three lines) for the same set of genomes and a pair of Needleman–Wunsch and Jaro–Winkler algorithms produce results not exceeding 0.14 , while the pair of the values obtained by the Needleman–Wunsch and Damerau–Levenshtein [14] algorithms are greater than 0.6 . However, of course, computational experiments related to this method of calculating the rank correlation should be continued.

6. Discussion and Conclusions

Summarizing some of the thoughts of this paper, we can formulate the following: The difference between genomes is very different in different studies, although the vast majority of both scientific and popular scientific papers give a distance between the genomes of humans and chimpanzees in the range from 0.5 % to 2% (i.e., the similarity is from 98% to 99.5 %). For example, according to [15], the genomes of humans and chimpanzees are “identical by more than 98.5 %”, and this statement is very often quoted “as the ultimate truth”. However, for the sake of completeness, in the next section, we give even more detailed reflections on the topic of specific values of genome proximity.
Regarding the various rank correlation algorithms used in this article, we note that very interesting results related to their different ways of calculation (including the method described in [13]) are provided by the following interesting example, specially selected by the author:
1001  1002  1003  1004  1005  1006  2001  2002  2003  2004  2005  2006
1006  1005  1004  1003  1002  1001  2006  2005  2004  2003  2002  2001
(The corresponding elements of the two rows form the pairs of elements.)
From our point of view, this example, as well as some other specially selected ones, shows the need to use special algorithms for calculating rank correlation and improving existing ones. Therefore, in the following publications, we propose to return to the consideration of this example and its connection with the methods of calculating rank correlation that we have considered.
Here are some references to biologists’ works that use distance matrices between genomes. The following are recent works that are not related to the study of mammalian mitochondrial DNA: [16,17,18]. However, we have already noted that the application of mathematical methods and the creation of algorithms for analyzing such matrices, including heuristic algorithms, are reflected in a very small number of publications. We have already cited the following: [7,8,9].
Thus, the main work performed is as follows: Based on the above tables of calculation results, we can say that the method we described as “simple” can hardly answer the question we posed about the consistency of the two algorithms. Correlation values of approximately 0.5, as a rule, do not say much. However, everything is clarified by a more complex method that examines the rank correlation of the badness of all the triangles under consideration. For large values of pairs obtained using the algorithms of Needleman–Wunsch and Damerau–Levenshtein, we obtain very small values of pairs of algorithms using the Needleman–Wunsch and Jaro–Winkler algorithms (not exceeding 0.14) on the same input data.
Therefore, we think that we have shown the inconsistency between two well-known algorithms for determining the distances between genomes, namely, the algorithms of Jaro–Winkler and Needleman–Wunsch. Specifically, there is an assumption (not yet fully confirmed) that the Needleman–Wunsch algorithm is significantly more adequate than the Jaro–Winkler one.
Here are possible directions for continuing the work described in the paper, including outcomes that can be drawn based on its material:
  • We hope to obtain a matrix for all types of monkeys (500 to 850 types, according to various sources), and at first these will be algorithms for restoring a partially filled matrix.
  • This problem is best used for the Needleman–Wunsch algorithm, ignoring the rest of the described algorithms.
  • The author believes that the following task is very important. This problem consists of viewing, based on the given distance matrix, all five variants of badness, and choosing “the best” of them. In previous papers and in Section 2, it was said that, ideally, this value should be equal to 0. Then, “the best” badness can be obtained by minimizing the linear combination of the considered options. At the same time, of course, functions like the identity zero are pointless to consider. Therefore, in our model, we consider a linear combination of several of the above functions for variants of badness.
  • We hope to continue the consideration of the tasks described in the paper, our algorithm for calculating rank correlation [13], which can be called corr-4.

Funding

This work was partially supported by a grant from the scientific program of Chinese universities, the “Higher Education Stability Support Program” (section “Shenzhen 2022—Science, Technology and Innovation Commission of Shenzhen Municipality”).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

This work was partially supported by a grant from the scientific program of Chinese universities, the “Higher Education Stability Support Program” (section “Shenzhen 2022—Science, Technology and Innovation Commission of Shenzhen Municipality”). The author also expresses gratitude to post-graduate students Li Jiamian and Mu Jingyuan (Shenzhen MSU–BIT University), who received the values of Table 3 and Table 4, for which they independently implemented these algorithms.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Melnikov, B.; Pivneva, S.; Trifonov, M. Various algorithms, calculating distances of DNA sequences, and some computational recommendations for use such algorithms. CEUR Workshop Proc. 2017, 1902, 43–50. [Google Scholar]
  2. Melnikov, B.; Melnikova, E.; Pivneva, S.; Trenina, M. An approach to analysis of the similarity of DNA-sequences. CEUR Workshop Proc. 2018, 2212, 67–72. [Google Scholar]
  3. Melnikov, B.; Zhang, Y.; Chaikovskii, D. An inverse problem for matrix processing: An improved algorithm for restoring the distance matrix for DNA chains. Cybern. Phys. 2022, 11, 217–226. [Google Scholar] [CrossRef]
  4. Melnikov, B.; Chaikovskii, D. On the application of heuristics of the TSP for the task of restoring the DNA matrix. Front. Artif. Intell. Appl. 2024, 385, 36–44. [Google Scholar]
  5. Lagutin, M. Visual Mathematical Statistics; BINOM. Laboratoriya Znaniy: Moscow, Russia, 2012; 472p. (In Russian) [Google Scholar]
  6. Wasserman, L. All of Statistics: A Concise Course in Statistical Inference; Springer Science & Business Media: Berlin, Germany, 2013; 442p. [Google Scholar]
  7. Young, S.; Gilles, J. Use of 3D chaos game representation to quantify DNA sequence similarity with applications for hierarchical clustering. J. Theor. Biol. 2024, 5967, 111972. [Google Scholar] [CrossRef] [PubMed]
  8. Ballester, P.J.; Richards, W.G. Ultrafast shape recognition to search compound databases for similar molecular shapes. J. Comput. Chem. 2007, 28, 1711–1723. [Google Scholar] [CrossRef] [PubMed]
  9. Bodenhofer, U.; Bonatesta, E.; Horejs-Kainrath, C.; Hochreiter, S. Msa: An R package for multiple sequence alignment. Bioinformatics 2015, 31, 3997–3999. [Google Scholar] [CrossRef] [PubMed]
  10. Needleman, S.; Wunsch, C. A general method is applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970, 48, 443–453. [Google Scholar] [CrossRef] [PubMed]
  11. Winkler, W. String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In Proceedings of the Survey Research Methods Sections, Anaheim, CA, USA, 6–9 August 1990; American Statistical Association: Alexandria, VA, USA, 1990; pp. 354–359. [Google Scholar]
  12. Cohen, J. Relative differences: The myth of 1%. Science 2007, 316, 1836. [Google Scholar] [CrossRef] [PubMed]
  13. Melnikov, B.; Lysak, T. On some algorithms for comparing models of femtosecond laser radiation propagation in a medium with gold nanorods. Cybern. Phys. 2024, 13, 261–267. [Google Scholar] [CrossRef]
  14. Levenshtein, V. Binary codes capable of correcting. Deletions, insertions, and reversals. Sov. Phys. Dokl. 1966, 10, 707–710. [Google Scholar]
  15. Polavarapu, N.; Arora, G.G.; Mittal, V.; McDonald, J. Characterization and potential functional significance of human-chimpanzee large INDEL variation. Mob. DNA 2011, 2, 13. [Google Scholar] [CrossRef] [PubMed]
  16. Sampaio, J.R.; Oliveira, W.D.d.S.; Junior, L.C.d.S.; Nascimento, F.d.S.; Moreira, R.F.C.; Ramos, A.P.d.S.; Santos-Serejo, J.A.d.; Amorim, E.P.; Jesus, R.D.M.d.; Ferreira, C.F. Diversity of Improved Diploids and Commercial Triploids from Musa spp. via Molecular Markers. Curr. Issues Mol. Biol. 2024, 46, 11783–11796. [Google Scholar] [CrossRef] [PubMed]
  17. Memon, J.; Patel, R.; Patel, B.N.; Patel, M.P.; Madariya, R.B.; Patel, J.K.; Kumar, S. Genetic diversity, population structure and association mapping of morphobiochemical traits in castor (Ricinus communis L.) through simple sequence repeat markers. Ind. Crops Prod. 2024, 221, 119348. [Google Scholar] [CrossRef]
  18. Mansueto, L.; Tandayu, E.; Mieog, J.; Garcia-de Heer, L.; Das, R.; Burn, A.; Mauleon, R.; Kretzschmar, T. HASCH—A high-throughput amplicon-based SNP-platform for medicinal cannabis and industrial hemp genotyping applications. BMC Genom. 2024, 25, 818. [Google Scholar] [CrossRef]
Table 2. The considered monkey species in alphabetical order.
Table 2. The considered monkey species in alphabetical order.
No.Species of MonkeysNo.Species of Monkeys
1Allenopithecus nigroviridis17Lagothrix lagotricha
2Ateles belzebuth18Leontopithecus rosalia
3Brachyteles arachnoides19Macaca fascicularis
4Cacajao calvus20Macaca fuscata
5Callimico goeldii21Mandrillus leucophaeus
6Callithrix jacchus22Nasalis larvatus
7Carlito syrichta23Nycticebus coucang
8Cebuella pygmaea24Papio anubis
9Cephalopachus bancanus25Presbytis melalophos
10Cercocebus atys26Pygathrix nemaeus
11Cercopithecus albogularis27Rhinopithecus roxellana
12Chlorocebus sabaeus28Saguinus oedipus
13Colobus angolensis29Saimiri boliviensis
14Erythrocebus patas30Semnopithecus entellus
15Galago moholi31Tarsius dentatus
16Gorilla gorilla32Theropithecus gelada
Table 4. The matrix obtained by applying the Needleman–Wunsch algorithm.
Table 4. The matrix obtained by applying the Needleman–Wunsch algorithm.
1234567891011121314151617181920212223242526272829303132
1000250375260253256283253277156143192197157274216253477206204154187284161188192381263256192281153
2250000293184168168267167265250253287258256264240123473289289246247275254245249473180179251267249
3375293000322323320371320368373377476380375384375286476474476374372383378377376296327329381372374
4260184322000191191271189270258263295264264271258182476297298257258278265258259405196199259270261
5253168323191000146268145265255258289259259272251169474292293253253276260250250472169184251269256
6256168320191146000276091271254254286261259271249165477286287252253274255255255474163180256273253
7283267371272268276000272152283287319286285255279266477319320281277253289276275406273278282177281
8253167320189145091272000266251253286257257269247165474289288251254275256253253471163181255269254
9277265368270266271152266000275277312279276250272263477311313275271250282272271402270273275172275
10156250373258255254283251275000159201202173275212249477191191084190279153191196377260253192279148
11143253377263258254287253277159000174202140273215251480205202153191280162191193384264258194283156
12192287476295289286319286312201174000244193301246285479160157201236312203235234478291287237316200
13197258380264259261286257279202202244000209281227256478246245197167284200176174363266267174283197
14157256375264259259285257276173140193209000278225258478221219169200287179200206378270265207282173
15274264384271272271255269250275273301281278000267266476301301274273202277273273476271275278249272
16216240375258251249279247272212215246227225267000244481245245208217275216219222399254250222275210
17253123286182169165266165263249251285256259266245000473288289247248272252252250407179179251267247
18477472476476474477476474477477480479478478476482473000480481478479477478475476476477477479474479
19206289474297292286319289311191205160246221301245288480000077189234311200237237477296290239317199
20204289475298293287320288313191202157245219301245289481077000189236312196236236478293288241316195
21154246374257253252281251275084153201197169274208247477189189000185281146187190379256253190276141
22187247372258254253277254271190191236167200273217248479234236185000279193142129336264257145271187
23284275383278276274253275250279280312284287202275272477311312281279000287282281476276279284253282
24161254378265260255289256282153162203200179277216252479200196146193286000199197382264260196286095
25188245377258250255276253272191192235176200273219252474237236187142282199000148348267256148272192
26192249376259250255275253271196193234174206273222250477237236190129281197148000339264256153276192
27381473296405472474406471402377384478363378475399407476477478379336476382348339000477471352403380
28263180327196169163273163270260264291266270270254179477296293256264276264267264477000190265273259
29256179329199184180278181273253258286267265275250179477290288253257279260256256472190000261275255
30192251380259251256282254275192194237174207278222251480239241190145284196148153352265261000279195
31281267372270269273177269172280283316283282249275267475317316276272253286272276403273275279000279
32153249374261256253281254275148156200197173272210247479199195141187282095192192380259255195279000
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Melnikov, B. The Inconsistency of the Algorithms of Jaro–Winkler and Needleman–Wunsch Applied to DNA Chain Similarity Results. Mathematics 2025, 13, 263. https://doi.org/10.3390/math13020263

AMA Style

Melnikov B. The Inconsistency of the Algorithms of Jaro–Winkler and Needleman–Wunsch Applied to DNA Chain Similarity Results. Mathematics. 2025; 13(2):263. https://doi.org/10.3390/math13020263

Chicago/Turabian Style

Melnikov, Boris. 2025. "The Inconsistency of the Algorithms of Jaro–Winkler and Needleman–Wunsch Applied to DNA Chain Similarity Results" Mathematics 13, no. 2: 263. https://doi.org/10.3390/math13020263

APA Style

Melnikov, B. (2025). The Inconsistency of the Algorithms of Jaro–Winkler and Needleman–Wunsch Applied to DNA Chain Similarity Results. Mathematics, 13(2), 263. https://doi.org/10.3390/math13020263

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop