1. Introduction
Phylogenetic trees serve as graphical representations aimed at depicting potential relationships among various species or groups of organisms over time. Despite their limitations, such as simplified interpretations, phylogenetic trees remain valuable in efforts to understand the diversity of life and find applications across disciplines such as taxonomy, comparative genomics, and evolutionary biology. Many practical tasks, such as tree clustering and classification, involve comparing phylogenetic trees by computing distances between them, which is a crucial part of the process.
Distance measures for phylogenetic trees have been widely developed for trees defined on the same sets of taxa. These include well-known distance measures such as the Robinson–Foulds (RF) distance [
1] and its generalizations [
2,
3,
4], triplet and quartet distances [
5,
6,
7], nodal distance [
8], geodesic distance [
9,
10], maximum agreement subtrees [
11,
12], and path distance [
13,
14].
Concurrently, there are applied problems that require trees defined on different but overlapping sets of taxa. For instance, phylogenetic tree clustering [
15,
16,
17], supertree construction [
16,
18,
19], the Tree of Life construction [
20,
21], and phylogenetic database searching [
22,
23] demand computing distances between such trees. There are also related works on distance measures for phylogenetic trees with different numbers of overlapping leaves (see
Table 1).
One approach to determine the distance between two phylogenetic trees defined on different but partially overlapping taxa sets is to prune the trees to their common taxa set. Specifically, in the RF(−) approach, the two trees are first defined by their different but overlapping leaves. Subsequently, the unique leaves are removed to make both trees defined on a common set of taxa, and then the classical RF distance between these trees is computed [
24]. This approach is known for its simplicity and relatively fast computation time. However, it should be noted that this pruning step may result in the loss of valuable topological information from both trees.
A more advanced RF-based approach involves adding the non-common leaves of one tree to the other tree being compared. This results in two trees defined on the same set of taxa, i.e., on the union of the original leaf sets of both trees. The methods of the RF(+) approach are described in [
24] and developed in [
25,
26,
27,
28,
29]. The completion-based RF(+) method as described in [
29] is based on adding leaves to trees under the condition of the minimization of the RF distance criterion. Compared to RF(−), the RF(+) method processes more information about both trees after their completion and includes a wider set of possible distance values.
The Generalized Robinson–Foulds (GRF) distance [
4] is another approach that can be applied to two trees not necessarily defined on the same set of taxa. This method utilizes symmetric differences for sets of clades (clusters) from both phylogenetic trees, accounting for both shared and non-shared clades. As a result, the GRF distance incorporates more tree information compared to the classical RF distance and can be computed in linear time. However, it primarily uses topological information and does not consider the length of tree branches.
The distance between trees defined on different but partially overlapping sets of taxa can be calculated using the Vectorial Tree Distance (VTD) [
30]. VTD is a vector consisting of elements representing the difference in the number of branches at each tree level, starting from the root of the tree. It is based on a tree-alignment technique that maps the branches of one tree to the branches of another tree at the same level. However, VTD does not consider leaf names, is not a metric, and the final result is a vector rather than the scalar commonly used as a distance value (although the authors suggest a method for how to convert a vector to a single value).
The geodesic distance in the Billera–Holmes–Vogtman (BHV) tree space [
31] is a distance metric that considers both tree topology and branch lengths. This distance was originally developed for phylogenetic trees with the same set of taxa, but there are extensions of this approach to trees defined on different but overlapping sets of taxa [
32,
33]. In particular, in [
32], the authors introduce the BHV connection cluster, the BHV connection space, and the BHV connection graph. They describe the process of constructing the BHV connection graph to move from lower dimensions to higher dimensions of the BHV tree space, which provides a way to compute distances for trees with different numbers of overlapping leaves. However, this process introduces additional computational complexity into the resulting geodesic distance calculation and makes this method slow for large trees with many non-common leaves.
These distance measures are compared based on their properties, such as the tree information used (e.g., binary nature, branch lengths, topology, and leaf names) and their computational complexity. The results are shown in
Table 1.
Table 1.
Comparison of distance measures for trees with non-identical taxa. A distance measure is recognized as a metric if it satisfies the 4 properties of a metric (non-negativity, identity of indiscernibles, symmetry, and triangle inequality), where n is the total number of unique leaves in both trees, is the total number of nodes in the tree, k is the number of maximal subtrees unique to an input tree, is the number of minimal matching of two m-dimensional vectors, is the maximum number of nodes, l is the number of leaves to be added to the tree.
Table 1.
Comparison of distance measures for trees with non-identical taxa. A distance measure is recognized as a metric if it satisfies the 4 properties of a metric (non-negativity, identity of indiscernibles, symmetry, and triangle inequality), where n is the total number of unique leaves in both trees, is the total number of nodes in the tree, k is the number of maximal subtrees unique to an input tree, is the number of minimal matching of two m-dimensional vectors, is the maximum number of nodes, l is the number of leaves to be added to the tree.
Measure | Metric | Complexity | Non-Binary Trees | Branch Length | Topology | Leaf Names | References |
---|
RF(−) | Yes | | No | No | Yes | Yes | [24,28] |
RF(+) | Yes | | No | No | Yes | Yes | [28,29] |
GRF | Yes | | No | No | Yes | Yes | [4] |
VTD | No | | Yes | No | Yes | No | [30] |
BHV | Yes | | No | Yes | Yes | Yes | [32,33] |
A number of research papers have addressed the issue of imputing missing taxa in phylogenetic trees, presenting a variety of techniques for this purpose. Yasui et al. [
34] presented an optimization-based method using a mixed-integer non-linear programming model to handle missing pairwise distances in gene trees. This method involves a two-stage optimization process to assign individuals to hypothetical groups and estimate the missing distances. Yoshida [
35] introduced a technique utilizing tropical geometry, leveraging tropical polytopes and max-plus algebra. This method projects the incomplete tree onto a constructed tropical polytope to estimate missing data, focusing on equidistant trees. Rabiee and Mirarab [
36] developed INSTRAL, which integrates new species into existing trees by minimizing quartet discordance. Mai and Mirarab [
37] proposed a method to complete gene trees independently by optimizing the quartet score and introduced quartet subsampling for better accuracy. Mahbub et al. [
38] addressed missing data using QT-GILD, a deep learning approach that employs an autoencoder to generate and correct quartets in incomplete gene trees.
Problem statement. The primary problem is to complete phylogenetic trees defined on different but overlapping sets of taxa while preserving the evolutionary relationships and structural integrity of the original trees. The challenge is to maintain the accuracy of evolutionary distances and the topological information of phylogenetic trees despite differences in their original taxon composition.
Our contributions. In this work, we present a new algorithm for completing phylogenetic trees defined on different but overlapping sets of taxa. This algorithm utilizes branch lengths and pairwise distances between leaves of the considered trees. It applies branch adjustment rates, common leaf distances, temporary nodes, and a midnode approach to insert distinctive leaves from one tree into the other, making them defined on the same set of taxa. Based on the common part of both trees (i.e., the set of common leaves), new leaves are inserted into the other tree by exploiting this common information using the leaf distances associated with the common leaves. Specifically, using the branch adjustment rates to scale the branch lengths, the algorithm uses the adjusted distances to find planting points for the distinctive leaves in the other tree associated with the same common leaves. The planting point is chosen as the position for inserting a new leaf (or leaves) with its adjusted terminal branch. Several important properties are formulated for the proposed approach.
The rest of this paper is organized as follows.
Section 2.1 provides the necessary notation and preliminary information.
Section 2.2 and
Section 2.3 outline the new phylogenetic tree completion algorithm, and
Section 2.4 provides a practical example.
Section 3 presents several properties of the proposed algorithm and discusses their importance.
2. Materials and Methods
2.1. Notation and Preliminaries
In the phylogenetic tree T, nodes (or vertices, v) represent taxonomic units, such as species. The set of nodes is denoted as . The root node is a special node that represents the most recent common ancestor of all taxa included in the tree. Leaves (or terminal nodes, l) of the tree T are nodes that do not have any children (i.e., nodes that are not connected further downstream). These nodes represent individual species or taxa under study and the set of leaves for the tree T is denoted as . Internal nodes are nodes that have at least one child and represent ancestral taxa that have given rise to descendant taxa. Edges (or branches) in the phylogenetic tree T represent evolutionary relationships between taxa, and each branch connects two nodes and indicates a common ancestor, showing how species have evolved from their ancestors over time. The set of edges (branches) is denoted as . Branches can have a length associated with them. A terminal branch (or a pendant branch) is a branch that ends in a leaf (terminal node) and does not give rise to any further branches or internal nodes. The length of the terminal branch of the leaf l in the tree T is denoted as . In this work, rooted phylogenetic trees with labeled branch lengths are considered.
Definition 1 (Distance between leaves). The distance between any two leaves and of the phylogenetic tree T (denoted as ) is the sum of branch lengths along the unique path from to .
The distance
can be expressed as follows:
where
e represents each branch (edge) along the path
and
is the length of branch
e.
Similarly, the distance between any two nodes in the tree can be calculated.
Definition 2 (Common and distinct leaves). For two phylogenetic trees and , the leaves they share are called common leaves. The set of common leaves for trees and is denoted as . If one tree contains leaves that are not included in the other tree, these leaves are called distinct leaves. The set of distinct leaves for tree T is denoted as .
Definition 3 (Maximal distinct-leaf subtree). A subtree S of a tree T is called a maximal distinct-leaf subtree if and only if all leaves in the subtree S belong to the set and there is no other subtree of T that includes all the leaves from and has S as its proper subtree.
The branch connecting the root of a maximal distinct-leaf subtree S to its lowest ancestor node in the tree T is called the root branch of that subtree. The length of the root branch of the maximal distinct-leaf subtree S is denoted as , similar to how the length of the terminal branch for a leaf is denoted. To calculate the cutback distance between a leaf l and the maximal distinct-leaf subtree S, it is necessary to compute the distance between a leaf l in the entire tree T and the lowest ancestor node of the subtree S within the larger tree T. For a tree T, the set containing all its maximal distinct-leaf subtrees and remaining distinct leaves is denoted as and can be found using Algorithm 1.
Definition 4 (Branch adjustment rate)
. Given two phylogenetic trees and defined on different but overlapping taxa and their set of common leaves , the branch adjustment rate is the ratio of the sums of pairwise (without repetitions) distances between common leaves in one tree to the other.where is the branch adjustment rate for tree related to tree and is the number of common leaves . The branch adjustment rate is used to adjust the terminal branch lengths for distinct leaves (and subtree branches, if applicable) in the tree completion process.
Definition 5 (Leaf-based adjustment rate)
. Given two phylogenetic trees and defined on different but overlapping taxa, their set of common leaves , and a common leaf , the leaf-based adjustment rate is defined by the following equation:where is the -based adjustment rate for tree related to tree and . Each leaf-based adjustment rate is calculated based on one common leaf relative to the other common leaves in the considered trees. Definition 6 (Midnode). In the context of a phylogenetic tree T, the midnode between any two connected nodes and refers to a specific point along the path that connects these nodes, such that this point divides the total branch length of the path into two equal halves.
Formally, if
represents the unique path between
and
, then the midnode
M is defined as the point on
such that
It is essential to note that the midnode represents a calculated position that may not coincide with the pre-existing nodes in the tree. This definition presupposes the existence of a unique path between any two nodes in T, a fundamental characteristic of phylogenetic trees, ensuring that each pair of nodes is connected by exactly one path. The midnode approach is employed in the tree completion process and can be found using Algorithm 2.
2.2. Distance Measure
The task of comparing phylogenetic trees with different but overlapping sets of taxa can be formulated by calculating the distance between them after completing the trees on the union of their taxa sets. Given two trees,
defined on
and
defined on
, and the set of their common leaves
containing at least two elements, the tree completion process involves making the trees
and
defined on
. The distance between the completed trees is then calculated using the Branch Score Distance, denoted as
, which utilizes the difference in distances between the corresponding tree leaves [
13]. The formula for the
distance is as follows:
where
and
N is the size of the set
.
2.3. Tree Completion Algorithm
A novel tree completion algorithm based on the concepts of common leaf distances, adjustment rates, and midnodes is described in this subsection. The proposed algorithm adopts a procedural approach, leveraging domain knowledge and procedural logic to achieve its objectives outlined in the problem statement. Specifically, the algorithm seeks to preserve evolutionary information by adjusting branch lengths with calculated rates, aiming to maintain the proportional evolutionary distances present in the original trees. Additionally, the use of common leaves and the distances between them allows leaf insertion to be guided by the shared information of the trees being compared, ensuring that insertions reflect established phylogenetic relationships.
Let and be phylogenetic trees defined on different but overlapping sets of taxa. The phylogenetic tree completion algorithm includes the following main steps.
The first step consists of finding common leaves , distinct leaves, and maximal distinct-leaf subtrees (sets and ). Finding the maximal distinct-leaf subtrees and the remaining single distinct leaves in a phylogenetic tree T can be accomplished using Algorithm 1.
The process continues for each tree
in
. The second step calculates the branch adjustment rates
(see Equation (
2)) and the leaf-based adjustment rates
for each leaf
(see Equation (
3)).
Algorithm 1: Finding distinct elements |
![Symmetry 16 00790 i001]() |
The third step consists of processing each distinct element , performing the following substeps.
The first substep calculates the new branch length for element
a as the current branch length multiplied by the corresponding branch adjustment rate using Equation (
6).
All branch lengths within maximal distinct-leaf subtrees should be adjusted using the appropriate adjustment rate when inserted into the corresponding tree. It is important to note that the initial branch lengths in the trees should be kept unchanged.
The second substep calculates the cutback distances between each common leaf and that element, denoted as
, using Equation (
7).
The third substep involves multiplying these cutback distances by the corresponding leaf-based adjustment rates to obtain distances
(see Equation (
8)), which are used to find possible positions for adding temporary nodes in the next substep.
A possible position for inserting temporary nodes is determined by traversing the branches of tree and identifying points where the distance from matches . This involves checking each branch and calculating the cumulative distance from common leaf . If the calculated cumulative distance matches , that point is considered a possible position for temporary nodes. Temporary nodes are auxiliary nodes that are introduced to facilitate finding insertion points for new leaves in the tree completion process. These temporary nodes do not represent actual biological entities and serve as placeholders.
The fourth substep consists of adding temporary nodes in in all possible positions (only among the branches that were in the tree initially, not including newly added branches and nodes) at appropriate calculated distances () from each common leaf .
The fifth substep involves finding the planting point among temporary nodes (see Algorithms 2 and 3) and inserting the considered distinct element
a (leaf or maximal distinct-leaf subtree) with its adjusted branch length
at the planting point position.
Algorithm 2: Finding the farthest nodes |
![Symmetry 16 00790 i002]() |
Let
be the set of temporary nodes in the phylogenetic tree,
M be the midnode (see Definition 6), and
be the pair of farthest nodes (see Algorithm 2). The iterative process of identifying the planting point can be described as follows (Equation (
9)).
Algorithm 3: Finding the planting point |
![Symmetry 16 00790 i003]() |
In Algorithm 3, the function TraverseTree traverses the tree from one node towards another to find the exact position of the midnode based on the calculated half distance. Starting at node with an initial distance of zero, the function iteratively moves to the next node towards while accumulating the distance. If the next step exceeds the half distance, the function interpolates the position between the current node and the next node to pinpoint the precise midnode position. This traversal continues until the cumulative distance equals the half distance, at which point the midnode position is returned.
As a result, two completed phylogenetic trees,
and
, are obtained, both defined on the same set of taxa
. The distance between completed trees is calculated using Equation (
5).
2.4. Example
The following example illustrates the tree completion procedure within the proposed algorithm. Consider the trees
and
in
Figure 1. The tree completion process can start with either
or
due to the symmetry property (see Proposition 1). In this example, the tree
is completed first. All subsequent results are rounded to three decimal places.
The process begins with identifying common leaves, distinct leaves, and maximal distinct-leaf subtrees in both trees. The common leaves in both trees are A, B, and D. Tree possesses one distinct leaf C, whereas tree includes distinct leaves G, F, and E, which together form a maximal distinct-leaf subtree, denoted as S.
In order to compute the branch adjustment rate
(see Equation (
2)), the pairwise distances among common leaves, specifically paths (
A–
B), (
A–
D), and (
B–
D), are considered. The result is
.
For common leaves
A,
B, and
D, the leaf-based adjustment rates are calculated as follows (see Equation (
3)):
,
, and
.
Next, the new (adjusted) branch length for the distinct leaf
C is
(see Equation (
6)).
The process continues by inserting the distinct leaf
C with its adjusted terminal branch length (0.725) into the tree
. This is achieved by calculating the cutback distances between each common leaf in tree
and leaf
C as follows (see Equation (
7)):
,
, and
.
Subsequently, the distances for identifying possible locations for temporary nodes in tree
are computed (see Equation (
8)):
,
, and
.
These distances are employed to integrate temporary nodes into tree
at all possible positions (among the original branches) from the same common leaves
A,
B, and
D. A traversal of tree
at a distance of 1.661 from leaf
A identifies temporary nodes
and
. Upon traversing tree
at a distance of 2.150 from leaf
B, the temporary node
was identified. Finally, the temporary node corresponding to common leaf
D at a distance of 0.705 is
. The results are presented in
Figure 2.
The planting point for leaf
C into tree
is determined using Algorithm 3 (this point is highlighted in blue in
Figure 2b). After this insertion, the completion of tree
is finished, as there are no further distinct leaves to incorporate. The completed tree
is shown in
Figure 3b.
The branch adjustment rate for the next tree, , is calculated to be . The leaf-based adjustment rates are , , and .
Completion of tree proceeds with the addition of subtree S with its adjusted root branch length b. For subtree S, upon insertion into tree , all branch lengths are adjusted using the rate .
The cutback distances between each common leaf in tree and subtree S are , , and .
The distances for identifying potential locations of temporary nodes in tree are , , and .
Utilizing these distances, temporary nodes in tree
are identified (nodes
,
,
, and
in
Figure 2a). The planting point (indicated in blue in
Figure 2a) is then found using these nodes.
Completed trees
and
defined on the same taxa are shown in
Figure 3. Newly added internal nodes and leaves with their adjusted branch lengths are highlighted in blue.
Finally, the distance
between the completed trees
and
is calculated as follows:
3. Results and Discussion
The properties of the described approach are formulated in the form of theorems and propositions.
Theorem 1 (Metric properties). Let and be phylogenetic trees, then the distance is a metric.
Proof. To be a metric, a distance measure has to satisfy the following properties for any three phylogenetic trees , , and : non-negativity, identity of indiscernibles, symmetry, and triangle inequality.
These properties are first demonstrated for the general case where phylogenetic trees are defined on the same set of taxa, denoted as . It is then discussed how the proposed tree completion process affects these properties in order to preserve their metric characteristics. Let N denote the size of the set .
Non-negativity. It is evident that squaring each term within the square root guarantees a non-negative sum of squared differences. Consequently, given that the square root of a non-negative value is non-negative, it follows that .
Identity of indiscernibles. To prove this property, it is necessary to consider two cases. If and are identical, then for any pair of leaves in , the distances and are equal. Formally, . Consequently, each squared difference term in the formula becomes zero. Therefore, the entire summation evaluates to zero, leading to .
If , then it implies that the sum of squared differences is zero. Given that squared differences are always non-negative, this can only occur if each squared difference term is individually zero. Formally, . This implies for every pair of leaves , establishing that and are identical in terms of their leaf distances, and hence are identical trees.
Symmetry. The formula for is symmetric, as it considers pairwise differences between and , and it is known that .
Therefore, .
Triangle inequality. In order to establish the triangle inequality, it is necessary to demonstrate the following inequality, for any triplet of phylogenetic trees denoted as
,
, and
:
can be written as
Let and .
The triangle inequality can be used for the square root of a sum of squares:
Applying this inequality to each term in the double sum above, we have
The first square root on the right side is
, and the second square root is
:
Therefore, the distance function satisfies the triangle inequality.
Since the distance function satisfies all four properties, it is a metric.
Next, we discuss that the distance for the completed trees and is still a metric.
Non-negativity. The distance inherently ensures non-negativity through the square of differences in distances between pairs of leaves, followed by the square root of their sum. The introduction of additional leaves in the tree completion process does not result in negative values, as distances calculated are inherently non-negative. Thus, is maintained. Identity of indiscernibles. The tree completion process preserves the original distances between common leaves while adding new distances in a consistent manner across both trees. As a result, if the trees are identical after completion, implying that all pairwise distances among leaves are equal, then . Conversely, if , it implies all corresponding distances are equal, including those involving newly added leaves, confirming the trees are identical. Symmetry. Symmetry is inherent to the formulation, as it calculates the difference in distances between corresponding leaf pairs in both trees. This symmetry is not affected by the tree completion process, as the method of calculating distances between leaf pairs remains consistent, ensuring . Triangle inequality. The addition of leaves maintains the integrity of original distances and adds new distances in a manner that respects the structure of the metric space. Thus, the aggregation of differences in distances, including those from newly added leaves, continues to satisfy the triangle inequality in the completed tree context. □
The fact that the distance is a metric is crucial because it is essential for maintaining mathematical consistency and validity in phylogenetic comparisons, ensuring reliable and interpretable results.
Theorem 2 (Computational complexity). Let and be phylogenetic trees defined on different but overlapping sets of taxa, and be their corresponding completed versions, , and . The computational complexity of the proposed algorithm for completing the phylogenetic trees and and computing the distance can be estimated as .
Proof. To establish the computational complexity of the proposed approach, it is necessary to analyze the complexity associated with each significant step in the completion and distance computation process.
The computational complexity for determining common and distinct leaves is , utilizing efficient data structures for leaf comparison, such as hash sets.
Assume that the first tree has nodes and edges, and the second tree has nodes and edges. Identifying distinct-leaf subtrees via breadth-first traversal in both and has a complexity of . It is to be noted that this complexity surpasses but remains inferior to .
The calculations of branch adjustment rates and , along with leaf-based adjustment rates and , involve nested loops over the set of common leaves . The number of common leaves is at most n, and for each pair of common leaves, the distances between them are calculated in each initial tree. The time complexity for this step is .
The completion process involves adding temporary nodes and determining the planting point for each element in and . Each insertion involves traversing the trees, which has a complexity of . Since we perform this operation for k elements, the overall complexity for this step is .
The distance calculation between the completed trees and involves nested loops over all pairwise combinations of leaves in the union of the sets of leaves (). For each pair of leaves, distances are calculated in both completed trees. The time complexity for this step can be estimated as .
Combining the complexities of all major steps, the overall computational complexity of the approach is the maximum of these complexities, which is . □
Understanding the computational complexity of the algorithm is important for assessing its feasibility and efficiency, especially in the case of dealing with large datasets. An estimated complexity of indicates that the algorithm is scalable and can handle large phylogenetic trees within a reasonable timeframe, which is important for practical applications in evolutionary biology and comparative genomics.
Proposition 1 (Symmetry in tree completion). Let and be phylogenetic trees defined on different but overlapping sets of taxa. The proposed tree completion algorithm is symmetric with regard to the input trees and .
Proof. The symmetry property ensures that interchanging and in the tree completion process does not alter the resulting completed trees and .
To prove symmetry, it is neccessary to demonstrate that the operations performed by the algorithm do not depend on the order of and .
The first step involves identifying the common leaves, given by . This operation is inherently symmetric because set intersection is commutative, ensuring . The operations of identifying distinct leaves are symmetric because they are based on the set difference, which is inherently order-independent.
Consequently, for each pair of common leaves , the distances are computed within each tree. These distances are used for adjusting branch lengths and determining insertion points. Since distances are symmetric, , this step does not introduce any asymmetry.
The selection of insertion points for distinct elements from into , and from into , employs temporary nodes and midnodes. This method applies the same criteria regardless of whether the tree is designated as or .
Let
denote the function that produces the completed trees
and
through a series of operations
O involving the identification of common and distinct leaves, pairwise distance calculations, and distinct element integration. The symmetry of the algorithm implies that
This indicates that the operations O are commutative and order-independent.
In addition, the distance between the completed trees (see Equation (
5)) is symmetric with respect to
and
because the squared differences
are inherently symmetric (see Theorem 1). This ensures that
.
Therefore, the symmetry of the proposed tree completion algorithm guarantees that the completed trees and are structurally consistent and represent the evolutionary relationships accurately, regardless of the order in which and are processed. □
Symmetry ensures that the algorithm treats the input trees and equally, without bias towards either tree. This property is important for the consistency of the algorithm, as it guarantees that the outcome does not depend on the order of the input trees, making the method robust and reliable.
Proposition 2 (Branch adjustment rates). Let and be phylogenetic trees defined on different but overlapping sets of taxa. The branch adjustment rates and are positive and non-zero. Furthermore, if the common leaves in both trees have identical pairwise distances, then .
Proof. Branch adjustment rates are defined by Equation (
2). Since all terms in the numerator and denominator of the equation are non-negative distances, and the denominator is non-zero (as
has at least one edge between any two common leaves), it follows that
is positive and non-zero, i.e.,
.
Consider the scenario where for some . In this case, the adjustment rate will be greater than 1, indicating that distances in are generally longer than in , and vice versa. □
The positivity and non-zero nature of branch adjustment rates, along with their equality when common leaves have identical pairwise distances, ensure that the adjustments made to branch lengths are meaningful and preserve the evolutionary distances. This property is critical for maintaining the biological relevance and accuracy of the completed trees, ensuring that the algorithm reflects true evolutionary relationships.
Proposition 3 (Preservation of leaf–leaf distances). Let T be a phylogenetic tree, and be its corresponding completed version. For any two leaves , the distance between them is preserved in the completed tree. That is, .
Proof. By definition, the tree completion algorithm adds distinct elements to the tree T to form without modifying the existing structure and distances among the initial leaves in . This ensures that the paths and distances between original leaves and are unchanged.
Specifically, when a new leaf or subtree is added, it is appended in such a manner that it does not disrupt the pre-existing paths between any pair of leaves and in .
The planting points for new elements are determined based on the common leaves and the calculated midnodes among temporary nodes, ensuring these new insertions do not shorten or lengthen the original distances between any two common leaves (, ).
The distance in the original tree T is defined as the sum of branch lengths along the unique path from to . Since this path and its constituent branch lengths remain unaltered in , the sum of branch lengths, and thus the distance , remains the same as .
Given that the tree completion algorithm preserves the structural integrity of T regarding the distances between its original leaves during the completion process, and since the algorithm ensures that new elements are added in a way that does not affect these distances, it follows that for any two leaves , the distance between them is preserved in the completed tree . Thus, . □
This proposition underscores the integrity of the original phylogenetic tree during the completion process. This aspect is important for preserving the biological significance of the tree. The preservation of the original leaf-to-leaf distances, both among the common leaves and between distinct leaves, ensures that the completed tree continues to reflect the evolutionary relationships and distances initially depicted in the original tree T.
Proposition 4 (Multifurcation). The distance can be applied to non-binary trees with multifurcations.
Proof. Consider
and
as two completed phylogenetic trees, potentially including multifurcations, and defined on the same set of taxa
. Internal nodes may bifurcate (in binary trees) or multifurcate (in non-binary trees). The
distance between these trees is defined by Equation (
5).
For any two leaves , the path connecting them is unique, due to the acyclic nature of phylogenetic trees. This holds true for both binary and non-binary (multifurcated) structures.
The distance
is the sum of the lengths of the edges along the path
(see Equation (
1)). This calculation depends solely on the path, not on the branching structure at each node, indicating that the presence of multifurcations does not alter the fundamental distance calculation between leaf pairs.
Therefore, the computation of , which aggregates the squared differences of these pairwise distances across and , remains valid and meaningful regardless of the tree binary or multifurcated nature. □
The ability to apply the distance to non-binary trees with multifurcations increases the versatility and applicability of the algorithm. Many real-world phylogenetic trees are not strictly binary, thus accommodating multifurcations allows the algorithm to be used in a broader range of scenarios, enhancing its utility and relevance in evolutionary studies. However, application requires the careful consideration of how multifurcations are represented and interpreted within this framework. The utility and accuracy of using in this context will depend on the specific characteristics of the trees being compared and the biological implications of their multifurcating structures.
Proposition 5 (Consideration of topology and branch lengths). The proposed tree completion algorithm integrates considerations of both the topology (the arrangement and relationships between nodes) and the branch lengths (quantitative measures of evolutionary distance) of the original trees and to produce completed trees and .
Proof. The proposed tree completion algorithm inserts new leaves and subtrees at the planting points determined by midnodes and temporary nodes, which are positioned based on the distances to common leaves in the original trees and .
Let . For any new leaf added to the tree, its position is determined by calculating the planting point among the selected temporary nodes, which are based on distances related to common leaves. This process ensures that the topological arrangement of leaves in and respects the evolutionary relationships inferred from and .
In addition, the tree completion algorithm incorporates maximal distinct-leaf subtrees (if any) from one tree into the other. This process is critical for topology consideration, because these subtrees represent significant evolutionary branches that must be integrated while preserving the phylogenetic relationships.
Branch lengths in
and
are adjusted using rates
and
, calculated as the ratio of sums of pairwise distances between common leaves in
and
. These rates reflect how branch lengths should be scaled to align the tree branch lengths. For any branch
b in
or
corresponding to a new insertion, the length is adjusted as follows (see Equation (
6)):
This ensures that the branch lengths in the completed trees reflect the evolutionary distances measured in the initial trees, adjusted for the context of the completion.
The distance between the completed trees and is calculated by considering the pairwise distances between leaves, which are the sum of branch lengths along the paths connecting the leaves. Although the distance does not directly quantify dissimilarities in tree topology like purely topological measures, it indirectly involves topology through the paths chosen and directly involves branch lengths through the sum of lengths along these paths. The topology involvement is indirect because the calculates the sum of squared differences in pairwise distances between leaves across the two trees. These pairwise distances are inherently determined by the paths through the tree topology from one leaf to another and the lengths of the branches that compose those paths. Thus, the formula encapsulates both topological and branch length considerations by evaluating the aggregate difference in these pairwise distances between and . □
Remark 1. The use of branch lengths in the phylogenetic tree distance metric, instead of solely relying on tree topology, provides numerous advantages, especially in terms of biological insights.
Branch lengths in a phylogenetic tree often represent evolutionary time or genetic change. Incorporating branch lengths can provide a more nuanced and accurate picture of the evolutionary relationships between species, reflecting not just how they are related but also how much they have diverged from each other. In addition to this, in cases where branch lengths are considered, the distance metric can better differentiate between trees that are topologically similar but differ significantly in how branches are distributed. This can be crucial in scenarios where slight changes in branch length represent important evolutionary events. Two trees might have a similar structure (topology), but if the lengths of the corresponding branches are significantly different, it indicates a greater evolutionary divergence. Furthermore, trees with branch lengths can be sensitive to more subtle evolutionary changes that are not apparent from topology alone. This sensitivity is critical in studies where small genetic differences are significant, such as in closely related species or in populations of the same species. Finally, using branch lengths can make comparisons between different phylogenetic trees more robust, especially when those trees are derived from different datasets or methods.
Considering the importance of using branch lengths in the distance metric between phylogenetic trees for gaining biological insights, the introduction of branch adjustment rates takes this a step further by ensuring that the evolutionary distances expressed in branch lengths are comparable between different trees. These rates standardize evolutionary rates across disparate datasets and methodologies, making it possible to compare trees more effectively. They adjust branch lengths to a common scale, allowing for meaningful evaluations of evolutionary time and genetic change, which is crucial when different trees may scale these distances differently. This is particularly significant in cases where trees might share similar topologies but differ markedly in their branch distributions, as it allows the distance metric to differentiate between trees that, while structurally similar, differ substantially in their evolutionary paths. Such differentiation is essential in scenarios where even slight variations in branch length can indicate important evolutionary developments. Practical examples of these adjustments include their use in supertree construction, where they help integrate multiple partial trees into a single coherent whole, and in comparative phylogenetics, where they enable more accurate analyses of evolutionary relationships and rates across different organisms or genes, facilitating deeper insights into biodiversity and evolution.