Novel Algorithm for Comparing Phylogenetic Trees with Different but Overlapping Taxa

Koshkarov, Aleksandr; Tahiri, Nadia

doi:10.3390/sym16070790

Open AccessArticle

Novel Algorithm for Comparing Phylogenetic Trees with Different but Overlapping Taxa

by

Aleksandr Koshkarov

and

Nadia Tahiri

^*

Department of Computer Science, University of Sherbrooke, Sherbrooke, QC J1K 2R1, Canada

^*

Author to whom correspondence should be addressed.

Symmetry 2024, 16(7), 790; https://doi.org/10.3390/sym16070790

Submission received: 27 May 2024 / Revised: 18 June 2024 / Accepted: 20 June 2024 / Published: 24 June 2024

(This article belongs to the Special Issue Applications of Symmetry in Computational Biology)

Download

Browse Figures

Versions Notes

Abstract

Comparing phylogenetic trees is a prominent problem widely used in applications such as clustering and building the Tree of Life. While there are many well-developed distance measures for phylogenetic trees defined on the same set of taxa, the situation is contrasting for trees defined on different but mutually overlapping sets of taxa. This paper presents a new polynomial-time algorithm for completing phylogenetic trees and computing the distance between trees defined on different but overlapping sets of taxa. This novel approach considers both the branch lengths and the topology of the phylogenetic trees being compared. We demonstrate that the distance measure applied to completed trees is a metric and provide several properties of the new method, including its symmetrical nature in tree completion.

Keywords:

phylogenetic tree; evolution; tree distance; tree completion

1. Introduction

Phylogenetic trees serve as graphical representations aimed at depicting potential relationships among various species or groups of organisms over time. Despite their limitations, such as simplified interpretations, phylogenetic trees remain valuable in efforts to understand the diversity of life and find applications across disciplines such as taxonomy, comparative genomics, and evolutionary biology. Many practical tasks, such as tree clustering and classification, involve comparing phylogenetic trees by computing distances between them, which is a crucial part of the process.

Distance measures for phylogenetic trees have been widely developed for trees defined on the same sets of taxa. These include well-known distance measures such as the Robinson–Foulds (RF) distance [1] and its generalizations [2,3,4], triplet and quartet distances [5,6,7], nodal distance [8], geodesic distance [9,10], maximum agreement subtrees [11,12], and path distance [13,14].

Concurrently, there are applied problems that require trees defined on different but overlapping sets of taxa. For instance, phylogenetic tree clustering [15,16,17], supertree construction [16,18,19], the Tree of Life construction [20,21], and phylogenetic database searching [22,23] demand computing distances between such trees. There are also related works on distance measures for phylogenetic trees with different numbers of overlapping leaves (see Table 1).

One approach to determine the distance between two phylogenetic trees defined on different but partially overlapping taxa sets is to prune the trees to their common taxa set. Specifically, in the RF(−) approach, the two trees are first defined by their different but overlapping leaves. Subsequently, the unique leaves are removed to make both trees defined on a common set of taxa, and then the classical RF distance between these trees is computed [24]. This approach is known for its simplicity and relatively fast computation time. However, it should be noted that this pruning step may result in the loss of valuable topological information from both trees.

A more advanced RF-based approach involves adding the non-common leaves of one tree to the other tree being compared. This results in two trees defined on the same set of taxa, i.e., on the union of the original leaf sets of both trees. The methods of the RF(+) approach are described in [24] and developed in [25,26,27,28,29]. The completion-based RF(+) method as described in [29] is based on adding leaves to trees under the condition of the minimization of the RF distance criterion. Compared to RF(−), the RF(+) method processes more information about both trees after their completion and includes a wider set of possible distance values.

The Generalized Robinson–Foulds (GRF) distance [4] is another approach that can be applied to two trees not necessarily defined on the same set of taxa. This method utilizes symmetric differences for sets of clades (clusters) from both phylogenetic trees, accounting for both shared and non-shared clades. As a result, the GRF distance incorporates more tree information compared to the classical RF distance and can be computed in linear time. However, it primarily uses topological information and does not consider the length of tree branches.

The distance between trees defined on different but partially overlapping sets of taxa can be calculated using the Vectorial Tree Distance (VTD) [30]. VTD is a vector consisting of elements representing the difference in the number of branches at each tree level, starting from the root of the tree. It is based on a tree-alignment technique that maps the branches of one tree to the branches of another tree at the same level. However, VTD does not consider leaf names, is not a metric, and the final result is a vector rather than the scalar commonly used as a distance value (although the authors suggest a method for how to convert a vector to a single value).

The geodesic distance in the Billera–Holmes–Vogtman (BHV) tree space [31] is a distance metric that considers both tree topology and branch lengths. This distance was originally developed for phylogenetic trees with the same set of taxa, but there are extensions of this approach to trees defined on different but overlapping sets of taxa [32,33]. In particular, in [32], the authors introduce the BHV connection cluster, the BHV connection space, and the BHV connection graph. They describe the process of constructing the BHV connection graph to move from lower dimensions to higher dimensions of the BHV tree space, which provides a way to compute distances for trees with different numbers of overlapping leaves. However, this process introduces additional computational complexity into the resulting geodesic distance calculation and makes this method slow for large trees with many non-common leaves.

These distance measures are compared based on their properties, such as the tree information used (e.g., binary nature, branch lengths, topology, and leaf names) and their computational complexity. The results are shown in Table 1.

Table 1. Comparison of distance measures for trees with non-identical taxa. A distance measure is recognized as a metric if it satisfies the 4 properties of a metric (non-negativity, identity of indiscernibles, symmetry, and triangle inequality), where n is the total number of unique leaves in both trees,

n_{1}

is the total number of nodes in the tree, k is the number of maximal subtrees unique to an input tree,

| M (m) |

is the number of minimal matching of two m-dimensional vectors,

n_{2}

is the maximum number of nodes, l is the number of leaves to be added to the tree.

Table 1. Comparison of distance measures for trees with non-identical taxa. A distance measure is recognized as a metric if it satisfies the 4 properties of a metric (non-negativity, identity of indiscernibles, symmetry, and triangle inequality), where n is the total number of unique leaves in both trees,

n_{1}

is the total number of nodes in the tree, k is the number of maximal subtrees unique to an input tree,

| M (m) |

is the number of minimal matching of two m-dimensional vectors,

n_{2}

is the maximum number of nodes, l is the number of leaves to be added to the tree.

Measure	Metric	Complexity	Non-Binary Trees	Branch Length	Topology	Leaf Names	References
RF(−)	Yes	$O (n)$	No	No	Yes	Yes	[24,28]
RF(+)	Yes	$O (n_{1} k^{2})$	No	No	Yes	Yes	[28,29]
GRF	Yes	$O (n)$	No	No	Yes	Yes	[4]
VTD	No	$O (\| M (m) \| \cdot n_{2})$	Yes	No	Yes	No	[30]
BHV	Yes	$O (n^{l + 2})$	No	Yes	Yes	Yes	[32,33]

A number of research papers have addressed the issue of imputing missing taxa in phylogenetic trees, presenting a variety of techniques for this purpose. Yasui et al. [34] presented an optimization-based method using a mixed-integer non-linear programming model to handle missing pairwise distances in gene trees. This method involves a two-stage optimization process to assign individuals to hypothetical groups and estimate the missing distances. Yoshida [35] introduced a technique utilizing tropical geometry, leveraging tropical polytopes and max-plus algebra. This method projects the incomplete tree onto a constructed tropical polytope to estimate missing data, focusing on equidistant trees. Rabiee and Mirarab [36] developed INSTRAL, which integrates new species into existing trees by minimizing quartet discordance. Mai and Mirarab [37] proposed a method to complete gene trees independently by optimizing the quartet score and introduced quartet subsampling for better accuracy. Mahbub et al. [38] addressed missing data using QT-GILD, a deep learning approach that employs an autoencoder to generate and correct quartets in incomplete gene trees.

Problem statement. The primary problem is to complete phylogenetic trees defined on different but overlapping sets of taxa while preserving the evolutionary relationships and structural integrity of the original trees. The challenge is to maintain the accuracy of evolutionary distances and the topological information of phylogenetic trees despite differences in their original taxon composition.

Our contributions. In this work, we present a new algorithm for completing phylogenetic trees defined on different but overlapping sets of taxa. This algorithm utilizes branch lengths and pairwise distances between leaves of the considered trees. It applies branch adjustment rates, common leaf distances, temporary nodes, and a midnode approach to insert distinctive leaves from one tree into the other, making them defined on the same set of taxa. Based on the common part of both trees (i.e., the set of common leaves), new leaves are inserted into the other tree by exploiting this common information using the leaf distances associated with the common leaves. Specifically, using the branch adjustment rates to scale the branch lengths, the algorithm uses the adjusted distances to find planting points for the distinctive leaves in the other tree associated with the same common leaves. The planting point is chosen as the position for inserting a new leaf (or leaves) with its adjusted terminal branch. Several important properties are formulated for the proposed approach.

The rest of this paper is organized as follows. Section 2.1 provides the necessary notation and preliminary information. Section 2.2 and Section 2.3 outline the new phylogenetic tree completion algorithm, and Section 2.4 provides a practical example. Section 3 presents several properties of the proposed algorithm and discusses their importance.

2. Materials and Methods

2.1. Notation and Preliminaries

In the phylogenetic tree T, nodes (or vertices, v) represent taxonomic units, such as species. The set of nodes is denoted as

V (T)

. The root node is a special node that represents the most recent common ancestor of all taxa included in the tree. Leaves (or terminal nodes, l) of the tree T are nodes that do not have any children (i.e., nodes that are not connected further downstream). These nodes represent individual species or taxa under study and the set of leaves for the tree T is denoted as

L (T)

. Internal nodes are nodes that have at least one child and represent ancestral taxa that have given rise to descendant taxa. Edges (or branches) in the phylogenetic tree T represent evolutionary relationships between taxa, and each branch connects two nodes and indicates a common ancestor, showing how species have evolved from their ancestors over time. The set of edges (branches) is denoted as

E (T)

. Branches can have a length associated with them. A terminal branch (or a pendant branch) is a branch that ends in a leaf (terminal node) and does not give rise to any further branches or internal nodes. The length of the terminal branch of the leaf l in the tree T is denoted as

b r^{(T)} (l)

. In this work, rooted phylogenetic trees with labeled branch lengths are considered.

Definition 1

(Distance between leaves). The distance between any two leaves

l_{1}

and

l_{2}

of the phylogenetic tree T (denoted as

d^{(T)} (l_{1}, l_{2})

) is the sum of branch lengths along the unique path from

l_{1}

to

l_{2}

.

The distance

d^{(T)} (l_{1}, l_{2})

can be expressed as follows:

d^{(T)} (l_{1}, l_{2}) = \sum_{e \in P^{(T)} (l_{1}, l_{2})} length (e),

(1)

where e represents each branch (edge) along the path

P^{(T)} (l_{1}, l_{2})

and

length (e)

is the length of branch e.

Similarly, the distance between any two nodes in the tree can be calculated.

Definition 2

(Common and distinct leaves). For two phylogenetic trees

T_{1}

and

T_{2}

, the leaves they share are called common leaves. The set of common leaves for trees

T_{1}

and

T_{2}

is denoted as

C L (T_{1}, T_{2})

. If one tree contains leaves that are not included in the other tree, these leaves are called distinct leaves. The set of distinct leaves for tree T is denoted as

D L (T)

.

Definition 3

(Maximal distinct-leaf subtree). A subtree S of a tree T is called a maximal distinct-leaf subtree if and only if all leaves in the subtree S belong to the set

D L (T)

and there is no other subtree of T that includes all the leaves from

D L (T)

and has S as its proper subtree.

The branch connecting the root of a maximal distinct-leaf subtree S to its lowest ancestor node in the tree T is called the root branch of that subtree. The length of the root branch of the maximal distinct-leaf subtree S is denoted as

b r^{(T)} (S)

, similar to how the length of the terminal branch for a leaf is denoted. To calculate the cutback distance between a leaf l and the maximal distinct-leaf subtree S, it is necessary to compute the distance between a leaf l in the entire tree T and the lowest ancestor node of the subtree S within the larger tree T. For a tree T, the set containing all its maximal distinct-leaf subtrees and remaining distinct leaves is denoted as

S D (T)

and can be found using Algorithm 1.

Definition 4

(Branch adjustment rate). Given two phylogenetic trees

T_{1}

and

T_{2}

defined on different but overlapping taxa and their set of common leaves

C L (T_{1}, T_{2})

, the branch adjustment rate is the ratio of the sums of pairwise (without repetitions) distances between common leaves in one tree to the other.

r (T_{1}, T_{2}) = \frac{\sum_{i = 1}^{N_{C L} - 1} \sum_{j = i + 1}^{N_{C L}} d^{(T_{1})} (l_{i}, l_{j})}{\sum_{i = 1}^{N_{C L} - 1} \sum_{j = i + 1}^{N_{C L}} d^{(T_{2})} (l_{i}, l_{j})},

(2)

where

r (T_{1}, T_{2})

is the branch adjustment rate for tree

T_{1}

related to tree

T_{2}

and

N_{C L}

is the number of common leaves

C L (T_{1}, T_{2})

.

The branch adjustment rate is used to adjust the terminal branch lengths for distinct leaves (and subtree branches, if applicable) in the tree completion process.

Definition 5

(Leaf-based adjustment rate). Given two phylogenetic trees

T_{1}

and

T_{2}

defined on different but overlapping taxa, their set of common leaves

C L (T_{1}, T_{2})

, and a common leaf

l_{c} \in C L (T_{1}, T_{2})

, the leaf-based adjustment rate is defined by the following equation:

r^{(l_{c})} (T_{1}, T_{2}) = \frac{\sum_{i = 1}^{N_{C L}} d^{(T_{1})} (l_{c}, l_{i})}{\sum_{i = 1}^{N_{C L}} d^{(T_{2})} (l_{c}, l_{i})},

(3)

where

r^{(l_{c})} (T_{1}, T_{2})

is the

l_{c}

-based adjustment rate for tree

T_{1}

related to tree

T_{2}

and

l_{c}, l_{i} \in C L (T_{1}, T_{2})

. Each leaf-based adjustment rate is calculated based on one common leaf relative to the other common leaves in the considered trees.

Definition 6

(Midnode). In the context of a phylogenetic tree T, the midnode between any two connected nodes

v_{1}

and

v_{2}

refers to a specific point along the path that connects these nodes, such that this point divides the total branch length of the path into two equal halves.

Formally, if

P_{T} (v_{1}, v_{2})

represents the unique path between

v_{1}

and

v_{2}

, then the midnode M is defined as the point on

P_{T} (v_{1}, v_{2})

such that

d^{(T)} (v_{1}, M) = d^{(T)} (M, v_{2}) = \frac{d^{(T)} (v_{1}, v_{2})}{2} .

(4)

It is essential to note that the midnode represents a calculated position that may not coincide with the pre-existing nodes in the tree. This definition presupposes the existence of a unique path between any two nodes in T, a fundamental characteristic of phylogenetic trees, ensuring that each pair of nodes is connected by exactly one path. The midnode approach is employed in the tree completion process and can be found using Algorithm 2.

2.2. Distance Measure

The task of comparing phylogenetic trees with different but overlapping sets of taxa can be formulated by calculating the distance between them after completing the trees on the union of their taxa sets. Given two trees,

T_{1}

defined on

L (T_{1})

and

T_{2}

defined on

L (T_{2})

, and the set of their common leaves

C L (T_{1}, T_{2})

containing at least two elements, the tree completion process involves making the trees

T_{1}^{⊎}

and

T_{2}^{⊎}

defined on

L (T^{⊎}) = L (T_{1}) \cup L (T_{2})

. The distance between the completed trees is then calculated using the Branch Score Distance, denoted as

B S D (+)

, which utilizes the difference in distances between the corresponding tree leaves [13]. The formula for the

B S D (+)

distance is as follows:

B S D (+) (T_{1}^{⊎}, T_{2}^{⊎}) = \sqrt{\sum_{i = 1}^{N - 1} \sum_{j = i + 1}^{N} {(d^{(T_{1}^{⊎})} (l_{i}, l_{j}) - d^{(T_{2}^{⊎})} (l_{i}, l_{j}))}^{2}},

(5)

where

l_{i}, l_{j} \in L (T^{⊎})

and N is the size of the set

L (T^{⊎})

.

2.3. Tree Completion Algorithm

A novel tree completion algorithm based on the concepts of common leaf distances, adjustment rates, and midnodes is described in this subsection. The proposed algorithm adopts a procedural approach, leveraging domain knowledge and procedural logic to achieve its objectives outlined in the problem statement. Specifically, the algorithm seeks to preserve evolutionary information by adjusting branch lengths with calculated rates, aiming to maintain the proportional evolutionary distances present in the original trees. Additionally, the use of common leaves and the distances between them allows leaf insertion to be guided by the shared information of the trees being compared, ensuring that insertions reflect established phylogenetic relationships.

Let

T_{1}

and

T_{2}

be phylogenetic trees defined on different but overlapping sets of taxa. The phylogenetic tree completion algorithm includes the following main steps.

The first step consists of finding common leaves

C L (T_{1}, T_{2})

, distinct leaves, and maximal distinct-leaf subtrees (sets

S D (T_{1})

and

S D (T_{2})

). Finding the maximal distinct-leaf subtrees and the remaining single distinct leaves in a phylogenetic tree T can be accomplished using Algorithm 1.

The process continues for each tree

T_{i}

in

{T_{1}, T_{2}}

. The second step calculates the branch adjustment rates

r (T_{i}, T_{3 - i})

(see Equation (2)) and the leaf-based adjustment rates

r^{(l_{c})} (T_{i}, T_{3 - i})

for each leaf

l_{c} \in C L (T_{1}, T_{2})

(see Equation (3)).

Algorithm 1: Finding distinct elements

The third step consists of processing each distinct element

a \in S D (T_{3 - i})

, performing the following substeps.

The first substep calculates the new branch length for element a as the current branch length multiplied by the corresponding branch adjustment rate using Equation (6).

b r^{(T_{i}^{⊎})} (a) = b r^{(T_{3 - i})} (a) \cdot r (T_{i}, T_{3 - i}) .

(6)

All branch lengths within maximal distinct-leaf subtrees should be adjusted using the appropriate adjustment rate when inserted into the corresponding tree. It is important to note that the initial branch lengths in the trees should be kept unchanged.

The second substep calculates the cutback distances between each common leaf and that element, denoted as

d c^{(T_{3 - i})} (l_{c}, a)

, using Equation (7).

d c^{(T_{3 - i})} (l_{c}, a) = d^{(T_{3 - i})} (l_{c}, a) - b r^{(T_{3 - i})} (a) .

(7)

The third substep involves multiplying these cutback distances by the corresponding leaf-based adjustment rates to obtain distances

d_{p}^{(T_{i})} (l_{c}, a)

(see Equation (8)), which are used to find possible positions for adding temporary nodes in the next substep.

d_{p}^{(T_{i})} (l_{c}, a) = d c^{(T_{3 - i})} (l_{c}, a) \cdot r^{(l_{c})} (T_{i}, T_{3 - i}) .

(8)

A possible position for inserting temporary nodes is determined by traversing the branches of tree

T_{i}

and identifying points where the distance from

l_{c}

matches

d_{p}^{(T_{i})} (l_{c}, a)

. This involves checking each branch and calculating the cumulative distance from common leaf

l_{c}

. If the calculated cumulative distance matches

d_{p}^{(T_{i})} (l_{c}, a)

, that point is considered a possible position for temporary nodes. Temporary nodes are auxiliary nodes that are introduced to facilitate finding insertion points for new leaves in the tree completion process. These temporary nodes do not represent actual biological entities and serve as placeholders.

The fourth substep consists of adding temporary nodes in

T_{i}

in all possible positions (only among the branches that were in the tree initially, not including newly added branches and nodes) at appropriate calculated distances (

d_{p}^{(T_{3 - i})} (l_{c}, a)

) from each common leaf

l_{c}

.

The fifth substep involves finding the planting point among temporary nodes (see Algorithms 2 and 3) and inserting the considered distinct element a (leaf or maximal distinct-leaf subtree) with its adjusted branch length

b r^{(T_{i}^{⊎})} (a)

at the planting point position.

Algorithm 2: Finding the farthest nodes

Let

V

be the set of temporary nodes in the phylogenetic tree, M be the midnode (see Definition 6), and

{v_{1}, v_{2}}

be the pair of farthest nodes (see Algorithm 2). The iterative process of identifying the planting point can be described as follows (Equation (9)).

P l a n t i n g P o i n t (V) = \{\begin{matrix} V [0], & if | V | = 1 \\ P l a n t i n g P o i n t ((V \cup {M}) ∖ {v_{1}, v_{2}}), & otherwise \end{matrix}

(9)

Algorithm 3: Finding the planting point

In Algorithm 3, the function TraverseTree traverses the tree from one node towards another to find the exact position of the midnode based on the calculated half distance. Starting at node

v_{1}

with an initial distance of zero, the function iteratively moves to the next node towards

v_{2}

while accumulating the distance. If the next step exceeds the half distance, the function interpolates the position between the current node and the next node to pinpoint the precise midnode position. This traversal continues until the cumulative distance equals the half distance, at which point the midnode position is returned.

As a result, two completed phylogenetic trees,

T_{1}^{⊎}

and

T_{2}^{⊎}

, are obtained, both defined on the same set of taxa

L (T^{⊎}) = L (T_{1}) \cup L (T_{2})

. The distance between completed trees is calculated using Equation (5).

2.4. Example

The following example illustrates the tree completion procedure within the proposed algorithm. Consider the trees

T_{1}

and

T_{2}

in Figure 1. The tree completion process can start with either

T_{1}

or

T_{2}

due to the symmetry property (see Proposition 1). In this example, the tree

T_{2}

is completed first. All subsequent results are rounded to three decimal places.

The process begins with identifying common leaves, distinct leaves, and maximal distinct-leaf subtrees in both trees. The common leaves in both trees are A, B, and D. Tree

T_{1}

possesses one distinct leaf C, whereas tree

T_{2}

includes distinct leaves G, F, and E, which together form a maximal distinct-leaf subtree, denoted as S.

In order to compute the branch adjustment rate

r (T_{2}, T_{1})

(see Equation (2)), the pairwise distances among common leaves, specifically paths (A–B), (A–D), and (B–D), are considered. The result is

r (T_{2}, T_{1}) = \frac{5.8}{2.4} = 2.417

.

For common leaves A, B, and D, the leaf-based adjustment rates are calculated as follows (see Equation (3)):

r^{(A)} (T_{2}, T_{1}) = \frac{3.6}{1.3} = 2.769

,

r^{(B)} (T_{2}, T_{1}) = \frac{4.3}{1.4} = 3.071

, and

r^{(D)} (T_{2}, T_{1}) = \frac{3.7}{2.1} = 1.762

.

Next, the new (adjusted) branch length for the distinct leaf C is

b r^{(T_{2}^{⊎})} (C) = 0.725

(see Equation (6)).

The process continues by inserting the distinct leaf C with its adjusted terminal branch length (0.725) into the tree

T_{2}

. This is achieved by calculating the cutback distances between each common leaf in tree

T_{1}

and leaf C as follows (see Equation (7)):

d c^{(T_{1})} (A, C) = 0.6

,

d c^{(T_{1})} (B, C) = 0.7

, and

d c^{(T_{1})} (D, C) = 0.4

.

Subsequently, the distances for identifying possible locations for temporary nodes in tree

T_{2}

are computed (see Equation (8)):

d_{p}^{(T_{2})} (A, C) = 0.6 \times r^{(A)} (T_{2}, T_{1}) = 1.661

,

d_{p}^{(T_{2})} (B, C) = 0.7 \times r^{(B)} (T_{2}, T_{1}) = 2.150

, and

d_{p}^{(T_{2})} (D, C) = 0.4 \times r^{(D)} (T_{2}, T_{1}) = 0.705

.

These distances are employed to integrate temporary nodes into tree

T_{2}

at all possible positions (among the original branches) from the same common leaves A, B, and D. A traversal of tree

T_{2}

at a distance of 1.661 from leaf A identifies temporary nodes

c_{1}

and

c_{2}

. Upon traversing tree

T_{2}

at a distance of 2.150 from leaf B, the temporary node

c_{3}

was identified. Finally, the temporary node corresponding to common leaf D at a distance of 0.705 is

c_{4}

. The results are presented in Figure 2.

The planting point for leaf C into tree

T_{2}

is determined using Algorithm 3 (this point is highlighted in blue in Figure 2b). After this insertion, the completion of tree

T_{2}

is finished, as there are no further distinct leaves to incorporate. The completed tree

T_{2}^{⊎}

is shown in Figure 3b.

The branch adjustment rate for the next tree,

T_{1}

, is calculated to be

r (T_{1}, T_{2}) = \frac{2.4}{5.8} = 0.414

. The leaf-based adjustment rates are

r^{(A)} (T_{1}, T_{2}) = \frac{1.3}{3.6} = 0.361

,

r^{(B)} (T_{1}, T_{2}) = \frac{1.4}{4.3} = 0.326

, and

r^{(D)} (T_{1}, T_{2}) = \frac{2.1}{3.7} = 0.568

.

Completion of tree

T_{1}

proceeds with the addition of subtree S with its adjusted root branch length b

r^{(T_{1}^{⊎})} (S) = 0.248

. For subtree S, upon insertion into tree

T_{1}

, all branch lengths are adjusted using the rate

r (T_{1}, T_{2}) = 0.414

.

The cutback distances between each common leaf in tree

T_{2}

and subtree S are

d c^{(T_{2})} (A, S) = 1.6

,

d c^{(T_{2})} (B, S) = 0.5

, and

d c^{(T_{2})} (D, S) = 1.7

.

The distances for identifying potential locations of temporary nodes in tree

T_{1}

are

d_{p}^{(T_{1})} (A, S) = 1.6 \times r^{(A)} (T_{1}, T_{2}) = 0.578

,

d_{p}^{(T_{1})} (B, S) = 0.5 \times r^{(B)} (T_{1}, T_{2}) = 0.163

, and

d_{p}^{(T_{1})} (D, S) = 1.7 \times r^{(D)} (T_{1}, T_{2}) = 0.966

.

Utilizing these distances, temporary nodes in tree

T_{1}

are identified (nodes

s_{1}

,

s_{2}

,

s_{3}

, and

s_{4}

in Figure 2a). The planting point (indicated in blue in Figure 2a) is then found using these nodes.

Completed trees

T_{1}^{⊎}

and

T_{2}^{⊎}

defined on the same taxa are shown in Figure 3. Newly added internal nodes and leaves with their adjusted branch lengths are highlighted in blue.

Finally, the distance

B S D (+)

between the completed trees

T_{1}^{⊎}

and

T_{2}^{⊎}

is calculated as follows:

\begin{matrix} B S D (+) (T_{1}^{⊎}, T_{2}^{⊎}) = \sqrt{35.96} = 5.997 . \end{matrix}

3. Results and Discussion

The properties of the described approach are formulated in the form of theorems and propositions.

Theorem 1

(Metric properties). Let

T_{1}

and

T_{2}

be phylogenetic trees, then the

B S D (+) (T_{1}, T_{2})

distance is a metric.

Proof.

To be a metric, a distance measure has to satisfy the following properties for any three phylogenetic trees

T_{1}

,

T_{2}

, and

T_{3}

: non-negativity, identity of indiscernibles, symmetry, and triangle inequality.

These properties are first demonstrated for the general case where phylogenetic trees are defined on the same set of taxa, denoted as

L (T)

. It is then discussed how the proposed tree completion process affects these properties in order to preserve their metric characteristics. Let N denote the size of the set

L (T)

.

Non-negativity. It is evident that squaring each term within the square root guarantees a non-negative sum of squared differences. Consequently, given that the square root of a non-negative value is non-negative, it follows that

B S D (+) (T_{1}, T_{2}) \geq 0

.

Identity of indiscernibles. To prove this property, it is necessary to consider two cases. If

T_{1}

and

T_{2}

are identical, then for any pair of leaves

(l_{i}, l_{j})

in

L (T)

, the distances

d^{(T_{1})} (l_{i}, l_{j})

and

d^{(T_{2})} (l_{i}, l_{j})

are equal. Formally,

\forall (l_{i}, l_{j}) \in L (T), d^{(T_{1})} (l_{i}, l_{j}) = d^{(T_{2})} (l_{i}, l_{j})

. Consequently, each squared difference term in the

B S D

formula becomes zero. Therefore, the entire summation evaluates to zero, leading to

B S D (+) (T_{1}, T_{2}) = 0

.

If

B S D (+) (T_{1}, T_{2}) = 0

, then it implies that the sum of squared differences is zero. Given that squared differences are always non-negative, this can only occur if each squared difference term is individually zero. Formally,

\forall (l_{i}, l_{j}) \in L (T), {(d^{(T_{1})} (l_{i}, l_{j}) - d^{(T_{2})} (l_{i}, l_{j}))}^{2} = 0

. This implies

d^{(T_{1})} (l_{i}, l_{j}) = d^{(T_{2})} (l_{i}, l_{j})

for every pair of leaves

(l_{i}, l_{j})

, establishing that

T_{1}

and

T_{2}

are identical in terms of their leaf distances, and hence are identical trees.

Symmetry. The formula for

B S D (+)

is symmetric, as it considers pairwise differences between

d^{(T_{1})} (l_{i}, l_{j})

and

d^{(T_{2})} (l_{i}, l_{j})

, and it is known that

\forall (l_{i}, l_{j}) \in L (T), d^{(T)} (l_{i}, l_{j}) = d^{(T)} (l_{j}, l_{i})

.

Therefore,

B S D (+) (T_{1}, T_{2}) = B S D (+) (T_{2}, T_{1})

.

Triangle inequality. In order to establish the triangle inequality, it is necessary to demonstrate the following inequality, for any triplet of phylogenetic trees denoted as

T_{1}

,

T_{2}

, and

T_{3}

:

B S D (+) (T_{1}, T_{3}) \leq B S D (+) (T_{1}, T_{2}) + B S D (+) (T_{2}, T_{3}) .

B S D (+) (T_{1}, T_{3})

can be written as

\begin{matrix} B S D (+) (T_{1}, T_{3}) = \sqrt{\sum_{i = 1}^{N - 1} \sum_{j = i + 1}^{N} {(d^{(T_{1})} (l_{i}, l_{j}) - d^{(T_{3})} (l_{i}, l_{j}))}^{2}} \\ = \sqrt{\sum_{i = 1}^{N - 1} \sum_{j = i + 1}^{N} {(d^{(T_{1})} (l_{i}, l_{j}) - d^{(T_{2})} (l_{i}, l_{j}) + d^{(T_{2})} (l_{i}, l_{j}) - d^{(T_{3})} (l_{i}, l_{j}))}^{2}} . \end{matrix}

Let

X_{i j} = d^{(T_{1})} (l_{i}, l_{j}) - d^{(T_{2})} (l_{i}, l_{j})

and

Y_{i j} = d^{(T_{2})} (l_{i}, l_{j}) - d^{(T_{3})} (l_{i}, l_{j})

.

The triangle inequality can be used for the square root of a sum of squares:

\sqrt{\sum {(a + b)}^{2}} \leq \sqrt{\sum a^{2}} + \sqrt{\sum b^{2}} .

Applying this inequality to each term in the double sum above, we have

\begin{matrix} B S D (+) (T_{1}, T_{3}) = \sqrt{\sum_{i = 1}^{N - 1} \sum_{j = i + 1}^{N} {(X_{i j} + Y_{i j})}^{2}} \leq \sqrt{\sum_{i = 1}^{N - 1} \sum_{j = i + 1}^{N} X_{i j}^{2}} + \sqrt{\sum_{i = 1}^{N - 1} \sum_{j = i + 1}^{N} Y_{i j}^{2}} . \end{matrix}

The first square root on the right side is

B S D (+) (T_{1}, T_{2})

, and the second square root is

B S D (+) (T_{2}, T_{3})

:

B S D (+) (T_{1}, T_{3}) \leq B S D (+) (T_{1}, T_{2}) + B S D (+) (T_{2}, T_{3}) .

Therefore, the distance function

B S D (+)

satisfies the triangle inequality.

Since the distance function

B S D (+)

satisfies all four properties, it is a metric.

Next, we discuss that the

B S D (+)

distance for the completed trees

T_{1}^{⊎}

and

T_{2}^{⊎}

is still a metric.

Non-negativity. The

B S D (+)

distance inherently ensures non-negativity through the square of differences in distances between pairs of leaves, followed by the square root of their sum. The introduction of additional leaves in the tree completion process does not result in negative values, as distances calculated are inherently non-negative. Thus,

B S D (+) (T_{1}^{⊎}, T_{2}^{⊎}) \geq 0

is maintained. Identity of indiscernibles. The tree completion process preserves the original distances between common leaves while adding new distances in a consistent manner across both trees. As a result, if the trees are identical after completion, implying that all pairwise distances among leaves are equal, then

B S D (+) (T_{1}^{⊎}, T_{2}^{⊎}) = 0

. Conversely, if

B S D (+) (T_{1}^{⊎}, T_{2}^{⊎}) = 0

, it implies all corresponding distances are equal, including those involving newly added leaves, confirming the trees are identical. Symmetry. Symmetry is inherent to the

B S D (+)

formulation, as it calculates the difference in distances between corresponding leaf pairs in both trees. This symmetry is not affected by the tree completion process, as the method of calculating distances between leaf pairs remains consistent, ensuring

B S D (+) (T_{1}^{⊎}, T_{2}^{⊎}) = B S D (+) (T_{2}^{⊎}, T_{1}^{⊎})

. Triangle inequality. The addition of leaves maintains the integrity of original distances and adds new distances in a manner that respects the structure of the metric space. Thus, the aggregation of differences in distances, including those from newly added leaves, continues to satisfy the triangle inequality in the completed tree context. □

The fact that the

B S D (+) (T_{1}, T_{2})

distance is a metric is crucial because it is essential for maintaining mathematical consistency and validity in phylogenetic comparisons, ensuring reliable and interpretable results.

Theorem 2

(Computational complexity). Let

T_{1}

and

T_{2}

be phylogenetic trees defined on different but overlapping sets of taxa,

T_{1}^{⊎}

and

T_{2}^{⊎}

be their corresponding completed versions,

n = | L (T_{1}) | + | L (T_{2}) |

, and

k = | S D (T_{1}) | + | S D (T_{2}) |

. The computational complexity of the proposed algorithm for completing the phylogenetic trees

T_{1}

and

T_{2}

and computing the distance

B S D (+) (T_{1}^{⊎}, T_{2}^{⊎})

can be estimated as

O (k \cdot n^{2})

.

Proof.

To establish the computational complexity of the proposed approach, it is necessary to analyze the complexity associated with each significant step in the completion and distance computation process.

The computational complexity for determining common and distinct leaves is

O (n)

, utilizing efficient data structures for leaf comparison, such as hash sets.

Assume that the first tree has

V_{1}

nodes and

E_{1}

edges, and the second tree has

V_{2}

nodes and

E_{2}

edges. Identifying distinct-leaf subtrees via breadth-first traversal in both

T_{1}

and

T_{2}

has a complexity of

O (V_{1} + E_{1}) + O (V_{2} + E_{2})

. It is to be noted that this complexity surpasses

O (n)

but remains inferior to

O (n^{2})

.

The calculations of branch adjustment rates

r (T_{1}, T_{2})

and

r (T_{2}, T_{1})

, along with leaf-based adjustment rates

r^{(l_{c})} (T_{1}, T_{2})

and

r^{(l_{c})} (T_{2}, T_{1})

, involve nested loops over the set of common leaves

C L (T_{1}, T_{2})

. The number of common leaves is at most n, and for each pair of common leaves, the distances between them are calculated in each initial tree. The time complexity for this step is

O (n^{2})

.

The completion process involves adding temporary nodes and determining the planting point for each element in

S D (T_{1})

and

S D (T_{2})

. Each insertion involves traversing the trees, which has a complexity of

O (n^{2})

. Since we perform this operation for k elements, the overall complexity for this step is

O (k \cdot n^{2})

.

The distance calculation between the completed trees

T_{1}^{⊎}

and

T_{2}^{⊎}

involves nested loops over all pairwise combinations of leaves in the union of the sets of leaves (

L (T_{1}) \cup L (T_{2})

). For each pair of leaves, distances are calculated in both completed trees. The time complexity for this step can be estimated as

O (n^{2})

.

Combining the complexities of all major steps, the overall computational complexity of the approach is the maximum of these complexities, which is

O (k \cdot n^{2})

. □

Understanding the computational complexity of the algorithm is important for assessing its feasibility and efficiency, especially in the case of dealing with large datasets. An estimated complexity of

O (k \cdot n^{2})

indicates that the algorithm is scalable and can handle large phylogenetic trees within a reasonable timeframe, which is important for practical applications in evolutionary biology and comparative genomics.

Proposition 1

(Symmetry in tree completion). Let

T_{1}

and

T_{2}

be phylogenetic trees defined on different but overlapping sets of taxa. The proposed tree completion algorithm is symmetric with regard to the input trees

T_{1}

and

T_{2}

.

Proof.

The symmetry property ensures that interchanging

T_{1}

and

T_{2}

in the tree completion process does not alter the resulting completed trees

T_{1}^{⊎}

and

T_{2}^{⊎}

.

To prove symmetry, it is neccessary to demonstrate that the operations performed by the algorithm do not depend on the order of

T_{1}

and

T_{2}

.

The first step involves identifying the common leaves, given by

C L (T_{1}, T_{2}) = L (T_{1}) \cap L (T_{2})

. This operation is inherently symmetric because set intersection is commutative, ensuring

C L (T_{1}, T_{2}) = C L (T_{2}, T_{1})

. The operations of identifying distinct leaves are symmetric because they are based on the set difference, which is inherently order-independent.

Consequently, for each pair of common leaves

(l_{i}, l_{j})

, the distances

d^{(T)} (l_{i}, l_{j})

are computed within each tree. These distances are used for adjusting branch lengths and determining insertion points. Since distances are symmetric,

d^{(T)} (l_{i}, l_{j}) = d^{(T)} (l_{j}, l_{i})

, this step does not introduce any asymmetry.

The selection of insertion points for distinct elements from

S D (T_{1})

into

T_{2}

, and from

S D (T_{2})

into

T_{1}

, employs temporary nodes and midnodes. This method applies the same criteria regardless of whether the tree is designated as

T_{1}

or

T_{2}

.

Let

f (T_{1}, T_{2})

denote the function that produces the completed trees

T_{1}^{⊎}

and

T_{2}^{⊎}

through a series of operations O involving the identification of common and distinct leaves, pairwise distance calculations, and distinct element integration. The symmetry of the algorithm implies that

f (T_{1}, T_{2}) = (T_{1}^{⊎}, T_{2}^{⊎}) \equiv f (T_{2}, T_{1}) = (T_{2}^{⊎}, T_{1}^{⊎}) .

(10)

This indicates that the operations O are commutative and order-independent.

In addition, the distance between the completed trees (see Equation (5)) is symmetric with respect to

T_{1}^{⊎}

and

T_{2}^{⊎}

because the squared differences

{(d^{(T_{1}^{⊎})} (l_{i}, l_{j}) - d^{(T_{2}^{⊎})} (l_{i}, l_{j}))}^{2}

are inherently symmetric (see Theorem 1). This ensures that

B S D (+) (T_{1}^{⊎}, T_{2}^{⊎}) = B S D (+) (T_{2}^{⊎}, T_{1}^{⊎})

.

Therefore, the symmetry of the proposed tree completion algorithm guarantees that the completed trees

T_{1}^{⊎}

and

T_{2}^{⊎}

are structurally consistent and represent the evolutionary relationships accurately, regardless of the order in which

T_{1}

and

T_{2}

are processed. □

Symmetry ensures that the algorithm treats the input trees

T_{1}

and

T_{2}

equally, without bias towards either tree. This property is important for the consistency of the algorithm, as it guarantees that the outcome does not depend on the order of the input trees, making the method robust and reliable.

Proposition 2

(Branch adjustment rates). Let

T_{1}

and

T_{2}

be phylogenetic trees defined on different but overlapping sets of taxa. The branch adjustment rates

r (T_{1}, T_{2})

and

r (T_{2}, T_{1})

are positive and non-zero. Furthermore, if the common leaves in both trees have identical pairwise distances, then

r (T_{1}, T_{2}) = r (T_{2}, T_{1}) = 1

.

Proof.

Branch adjustment rates are defined by Equation (2). Since all terms in the numerator and denominator of the equation are non-negative distances, and the denominator is non-zero (as

T_{2}

has at least one edge between any two common leaves), it follows that

r (T_{1}, T_{2})

is positive and non-zero, i.e.,

0 < r (T_{1}, T_{2}), r (T_{2}, T_{1})

.

Consider the scenario where

d^{(T_{1})} (l_{i}, l_{j}) > d^{(T_{2})} (l_{i}, l_{j})

for some

l_{i}, l_{j} \in C L (T_{1}, T_{2})

. In this case, the adjustment rate

r (T_{1}, T_{2})

will be greater than 1, indicating that distances in

T_{1}

are generally longer than in

T_{2}

, and vice versa. □

The positivity and non-zero nature of branch adjustment rates, along with their equality when common leaves have identical pairwise distances, ensure that the adjustments made to branch lengths are meaningful and preserve the evolutionary distances. This property is critical for maintaining the biological relevance and accuracy of the completed trees, ensuring that the algorithm reflects true evolutionary relationships.

Proposition 3

(Preservation of leaf–leaf distances). Let T be a phylogenetic tree, and

T^{⊎}

be its corresponding completed version. For any two leaves

l_{i}, l_{j} \in L (T)

, the distance between them is preserved in the completed tree. That is,

d^{(T^{⊎})} (l_{i}, l_{j}) = d^{(T)} (l_{i}, l_{j})

.

Proof.

By definition, the tree completion algorithm adds distinct elements to the tree T to form

T^{⊎}

without modifying the existing structure and distances among the initial leaves in

L (T)

. This ensures that the paths and distances between original leaves

l_{i}

and

l_{j}

are unchanged.

Specifically, when a new leaf or subtree is added, it is appended in such a manner that it does not disrupt the pre-existing paths between any pair of leaves

l_{i}

and

l_{j}

in

L (T)

.

The planting points for new elements are determined based on the common leaves and the calculated midnodes among temporary nodes, ensuring these new insertions do not shorten or lengthen the original distances between any two common leaves (

l_{i}

,

l_{j}

).

The distance

d^{(T)} (l_{i}, l_{j})

in the original tree T is defined as the sum of branch lengths along the unique path from

l_{i}

to

l_{j}

. Since this path and its constituent branch lengths remain unaltered in

T^{⊎}

, the sum of branch lengths, and thus the distance

d^{(T^{⊎})} (l_{i}, l_{j})

, remains the same as

d^{(T)} (l_{i}, l_{j})

.

Given that the tree completion algorithm preserves the structural integrity of T regarding the distances between its original leaves during the completion process, and since the algorithm ensures that new elements are added in a way that does not affect these distances, it follows that for any two leaves

l_{i}, l_{j} \in L (T)

, the distance between them is preserved in the completed tree

T^{⊎}

. Thus,

d^{(T^{⊎})} (l_{i}, l_{j}) = d^{(T)} (l_{i}, l_{j})

. □

This proposition underscores the integrity of the original phylogenetic tree during the completion process. This aspect is important for preserving the biological significance of the tree. The preservation of the original leaf-to-leaf distances, both among the common leaves and between distinct leaves, ensures that the completed tree continues to reflect the evolutionary relationships and distances initially depicted in the original tree T.

Proposition 4

(Multifurcation). The

B S D (+)

distance can be applied to non-binary trees with multifurcations.

Proof.

Consider

T_{1}^{⊎}

and

T_{2}^{⊎}

as two completed phylogenetic trees, potentially including multifurcations, and defined on the same set of taxa

L (T^{⊎})

. Internal nodes may bifurcate (in binary trees) or multifurcate (in non-binary trees). The

B S D (+)

distance between these trees is defined by Equation (5).

For any two leaves

l_{i}, l_{j} \in L (T^{⊎})

, the path

P^{(T_{k}^{⊎})} (l_{i}, l_{j})

connecting them is unique, due to the acyclic nature of phylogenetic trees. This holds true for both binary and non-binary (multifurcated) structures.

The distance

d^{(T_{k}^{⊎})} (l_{i}, l_{j})

is the sum of the lengths of the edges along the path

P^{(T_{k}^{⊎})} (l_{i}, l_{j})

(see Equation (1)). This calculation depends solely on the path, not on the branching structure at each node, indicating that the presence of multifurcations does not alter the fundamental distance calculation between leaf pairs.

Therefore, the computation of

B S D (+) (T_{1}^{⊎}, T_{2}^{⊎})

, which aggregates the squared differences of these pairwise distances across

T_{1}^{⊎}

and

T_{2}^{⊎}

, remains valid and meaningful regardless of the tree binary or multifurcated nature. □

The ability to apply the

B S D (+)

distance to non-binary trees with multifurcations increases the versatility and applicability of the algorithm. Many real-world phylogenetic trees are not strictly binary, thus accommodating multifurcations allows the algorithm to be used in a broader range of scenarios, enhancing its utility and relevance in evolutionary studies. However,

B S D (+)

application requires the careful consideration of how multifurcations are represented and interpreted within this framework. The utility and accuracy of using

B S D (+)

in this context will depend on the specific characteristics of the trees being compared and the biological implications of their multifurcating structures.

Proposition 5

(Consideration of topology and branch lengths). The proposed tree completion algorithm integrates considerations of both the topology (the arrangement and relationships between nodes) and the branch lengths (quantitative measures of evolutionary distance) of the original trees

T_{1}

and

T_{2}

to produce completed trees

T_{1}^{⊎}

and

T_{2}^{⊎}

.

Proof.

The proposed tree completion algorithm inserts new leaves and subtrees at the planting points determined by midnodes and temporary nodes, which are positioned based on the distances to common leaves in the original trees

T_{1}

and

T_{2}

.

Let

L (T^{⊎}) = L (T_{1}) \cup L (T_{2})

. For any new leaf

l_{n e w}

added to the tree, its position is determined by calculating the planting point among the selected temporary nodes, which are based on distances related to common leaves. This process ensures that the topological arrangement of leaves in

T_{1}^{⊎}

and

T_{2}^{⊎}

respects the evolutionary relationships inferred from

T_{1}

and

T_{2}

.

In addition, the tree completion algorithm incorporates maximal distinct-leaf subtrees (if any) from one tree into the other. This process is critical for topology consideration, because these subtrees represent significant evolutionary branches that must be integrated while preserving the phylogenetic relationships.

Branch lengths in

T_{1}^{⊎}

and

T_{2}^{⊎}

are adjusted using rates

r (T_{1}, T_{2})

and

r (T_{2}, T_{1})

, calculated as the ratio of sums of pairwise distances between common leaves in

T_{1}

and

T_{2}

. These rates reflect how branch lengths should be scaled to align the tree branch lengths. For any branch b in

T_{1}^{⊎}

or

T_{2}^{⊎}

corresponding to a new insertion, the length is adjusted as follows (see Equation (6)):

\{\begin{matrix} b r^{(T_{1}^{⊎})} (b) = b r^{(T_{2})} (b) \cdot r (T_{1}, T_{2}), \\ b r^{(T_{2}^{⊎})} (b) = b r^{(T_{1})} (b) \cdot r (T_{2}, T_{1}) . \end{matrix}

(11)

This ensures that the branch lengths in the completed trees reflect the evolutionary distances measured in the initial trees, adjusted for the context of the completion.

The

B S D (+)

distance between the completed trees

T_{1}^{⊎}

and

T_{2}^{⊎}

is calculated by considering the pairwise distances between leaves, which are the sum of branch lengths along the paths connecting the leaves. Although the

B S D (+)

distance does not directly quantify dissimilarities in tree topology like purely topological measures, it indirectly involves topology through the paths chosen and directly involves branch lengths through the sum of lengths along these paths. The topology involvement is indirect because the

B S D (+)

calculates the sum of squared differences in pairwise distances between leaves across the two trees. These pairwise distances are inherently determined by the paths through the tree topology from one leaf to another and the lengths of the branches that compose those paths. Thus, the

B S D (+)

formula encapsulates both topological and branch length considerations by evaluating the aggregate difference in these pairwise distances between

T_{1}^{⊎}

and

T_{2}^{⊎}

. □

Remark 1.

The use of branch lengths in the phylogenetic tree distance metric, instead of solely relying on tree topology, provides numerous advantages, especially in terms of biological insights.

Branch lengths in a phylogenetic tree often represent evolutionary time or genetic change. Incorporating branch lengths can provide a more nuanced and accurate picture of the evolutionary relationships between species, reflecting not just how they are related but also how much they have diverged from each other. In addition to this, in cases where branch lengths are considered, the distance metric can better differentiate between trees that are topologically similar but differ significantly in how branches are distributed. This can be crucial in scenarios where slight changes in branch length represent important evolutionary events. Two trees might have a similar structure (topology), but if the lengths of the corresponding branches are significantly different, it indicates a greater evolutionary divergence. Furthermore, trees with branch lengths can be sensitive to more subtle evolutionary changes that are not apparent from topology alone. This sensitivity is critical in studies where small genetic differences are significant, such as in closely related species or in populations of the same species. Finally, using branch lengths can make comparisons between different phylogenetic trees more robust, especially when those trees are derived from different datasets or methods.

Considering the importance of using branch lengths in the distance metric between phylogenetic trees for gaining biological insights, the introduction of branch adjustment rates takes this a step further by ensuring that the evolutionary distances expressed in branch lengths are comparable between different trees. These rates standardize evolutionary rates across disparate datasets and methodologies, making it possible to compare trees more effectively. They adjust branch lengths to a common scale, allowing for meaningful evaluations of evolutionary time and genetic change, which is crucial when different trees may scale these distances differently. This is particularly significant in cases where trees might share similar topologies but differ markedly in their branch distributions, as it allows the distance metric to differentiate between trees that, while structurally similar, differ substantially in their evolutionary paths. Such differentiation is essential in scenarios where even slight variations in branch length can indicate important evolutionary developments. Practical examples of these adjustments include their use in supertree construction, where they help integrate multiple partial trees into a single coherent whole, and in comparative phylogenetics, where they enable more accurate analyses of evolutionary relationships and rates across different organisms or genes, facilitating deeper insights into biodiversity and evolution.

4. Conclusions

In conclusion, the proposed method for tree completion and distance calculation between phylogenetic trees defined on different but overlapping sets of taxa is designed to address the existing limitations of distance-related measures. Our approach exploits important properties of phylogenetic trees, such as branch length, topology information through the arrangement and relationships between nodes, and leaf names. Additionally, the proposed distance measure is a metric that can be applied to both binary and non-binary trees and is computed in polynomial time. Incorporating branch lengths into phylogenetic tree distance metrics enhances the biological relevance and interpretative power of phylogenetic analyses, providing deeper insights into evolutionary processes and relationships. The proposed tree completion algorithm is designed to respect evolutionary relationships and preserve the structural integrity of the compared phylogenetic trees.

Building on this foundation, the ability of our algorithm to integrate leaves from different trees into a unified taxonomic framework supports significant advancements in comparative genomics and phylogenetics. By enabling the insertion of leaves from one phylogenetic tree into another while maintaining a consistent taxonomic framework, our algorithm significantly aids in the synthesis of comprehensive phylogenetic trees, often referred to as ‘supertrees’. These supertrees are particularly useful in scenarios where partial data from various studies need to be integrated to form a more complete evolutionary picture. Moreover, our approach is crucial in studying evolutionary alternatives among sets of genes, where clusters of genes may exhibit similar evolutionary trajectories but are often studied in isolation. Our algorithm allows for the integration of these clusters into alternative phylogenetic trees, providing a holistic view of potential evolutionary paths and helping researchers to understand complex genetic relationships.

Future work in a more advanced comparison framework includes comparing the proposed tree completion approach and the

B S D (+)

distance with the relevant phylogenetic tree distance measures, as well as the

B S D (-)

approach, which involves pruning non-common leaves from both trees before calculating the distance. Experiments with different scenarios will be conducted using biological and simulated data [39,40]. Furthermore, various modifications of the proposed tree completion algorithm can be implemented and evaluated. In particular, the number of common leaves involved in finding planting points for new leaves can be limited to a few common leaves closest to the distinct leaf or maximum distinct-leaf subtree. Additional strategies for determining a representative point among temporary nodes can also be investigated and tested.

Author Contributions

Conceptualization, A.K. and N.T.; methodology, A.K. and N.T.; validation, A.K. and N.T.; resources, N.T.; data curation, A.K.; writing—original draft preparation, A.K. and N.T.; writing—review and editing, A.K. and N.T.; visualization, A.K.; supervision, N.T.; project administration, N.T.; funding acquisition, N.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Sciences and Engineering Research Council of Canada—Discovery Grants (grant number RGPIN-2022-04322) and Canada Graduate Scholarship-Doctoral (grant number CGS D-589644-2024), Fonds de recherche du Québec-Nature and technologies (grant number 326911), and the University of Sherbrooke grant.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

The authors would like to thank the Department of Computer Science, University of Sherbrooke, Quebec, Canada for providing the necessary resources to conduct this research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Robinson, D.F.; Foulds, L.R. Comparison of phylogenetic trees. Math. Biosci. 1981, 53, 131–147. [Google Scholar] [CrossRef]
Briand, S.; Dessimoz, C.; El-Mabrouk, N.; Lafond, M.; Lobinska, G. A generalized Robinson-Foulds distance for labeled trees. BMC Genom. 2020, 21, 779. [Google Scholar] [CrossRef] [PubMed]
Smith, M.R. Information theoretic generalized Robinson–Foulds metrics for comparing phylogenetic trees. Bioinformatics 2020, 36, 5007–5013. [Google Scholar] [CrossRef] [PubMed]
Llabrés, M.; Rosselló, F.; Valiente, G. The Generalized Robinson-Foulds Distance for Phylogenetic Trees. J. Comput. Biol. 2021, 28, 1181–1195. [Google Scholar] [CrossRef] [PubMed]
Critchlow, D.E.; Pearl, D.K.; Qian, C. The triples distance for rooted bifurcating phylogenetic trees. Syst. Biol. 1996, 45, 323–334. [Google Scholar] [CrossRef]
Estabrook, G.F.; McMorris, F.; Meacham, C.A. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Syst. Zool. 1985, 34, 193–200. [Google Scholar] [CrossRef]
Snir, S.; Weissberg, O.; Yuster, R. On the quartet distance given partial information. J. Graph Theory 2022, 100, 252–269. [Google Scholar] [CrossRef]
Cardona, G.; Llabrés, M.; Rosselló, F.; Valiente, G. Nodal distances for rooted phylogenetic trees. J. Math. Biol. 2010, 61, 253–276. [Google Scholar] [CrossRef] [PubMed]
Kupczok, A.; Haeseler, A.V.; Klaere, S. An exact algorithm for the geodesic distance between phylogenetic trees. J. Comput. Biol. 2008, 15, 577–591. [Google Scholar] [CrossRef]
Khodaei, M.; Owen, M.; Beerli, P. Geodesics to characterize the phylogenetic landscape. PLoS ONE 2023, 18, e0287350. [Google Scholar] [CrossRef]
Amir, A.; Keselman, D. Maximum agreement subtree in a set of evolutionary trees: Metrics and efficient algorithms. SIAM J. Comput. 1997, 26, 1656–1669. [Google Scholar] [CrossRef]
Markin, A. On the extremal maximum agreement subtree problem. Discret. Appl. Math. 2020, 285, 612–620. [Google Scholar] [CrossRef]
Steel, M.A.; Penny, D. Distributions of tree comparison metrics—Some new results. Syst. Biol. 1993, 42, 126–141. [Google Scholar]
Smith, M.R. Robust analysis of phylogenetic tree space. Syst. Biol. 2022, 71, 1255–1270. [Google Scholar] [CrossRef] [PubMed]
Tahiri, N.; Willems, M.; Makarenkov, V. A new fast method for inferring multiple consensus trees using k-medoids. BMC Evol. Biol. 2018, 18, 48. [Google Scholar] [CrossRef] [PubMed]
Tahiri, N.; Fichet, B.; Makarenkov, V. Building alternative consensus trees and supertrees using k-means and Robinson and Foulds distance. Bioinformatics 2022, 38, 3367–3376. [Google Scholar] [CrossRef] [PubMed]
Silva, A.S.; Wilkinson, M. On defining and finding islands of trees and mitigating large island bias. Syst. Biol. 2021, 70, 1282–1294. [Google Scholar] [CrossRef] [PubMed]
Whidden, C.; Zeh, N.; Beiko, R.G. Supertrees based on the subtree prune-and-regraft distance. Syst. Biol. 2014, 63, 566–581. [Google Scholar] [CrossRef] [PubMed]
Makarenkov, V.; Barseghyan, G.S.; Tahiri, N. Inferring multiple consensus trees and supertrees using clustering: A review. In Data Analysis and Optimization: In Honor of Boris Mirkin’s 80th Birthday; Springer: Cham, Switzerland, 2023; pp. 191–213. [Google Scholar]
Hinchliff, C.E.; Smith, S.A.; Allman, J.F.; Burleigh, J.G.; Chaudhary, R.; Coghill, L.M.; Crandall, K.A.; Deng, J.; Drew, B.T.; Gazis, R.; et al. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proc. Natl. Acad. Sci. USA 2015, 112, 12764–12769. [Google Scholar] [CrossRef]
Letunic, I.; Bork, P. Interactive Tree of Life (iTOL) v6: Recent updates to the phylogenetic tree display and annotation tool. Nucleic Acids Res. 2024, gkae268. [Google Scholar] [CrossRef]
Wang, J.T.; Shan, H.; Shasha, D.; Piel, W.H. Fast structural search in phylogenetic databases. Evol. Bioinform. 2005, 1, 117693430500100009. [Google Scholar] [CrossRef]
Chen, D.; Burleigh, J.G.; Bansal, M.S.; Fernández-Baca, D. PhyloFinder: An intelligent search engine for phylogenetic tree databases. BMC Evol. Biol. 2008, 8, 90. [Google Scholar] [CrossRef] [PubMed]
Cotton, J.A.; Wilkinson, M. Majority-rule supertrees. Syst. Biol. 2007, 56, 445–452. [Google Scholar] [CrossRef] [PubMed]
Christensen, S.; Molloy, E.K.; Vachaspati, P.; Warnow, T. OCTAL: Optimal Completion of gene trees in polynomial time. Algorithms Mol. Biol. 2018, 13, 6. [Google Scholar] [CrossRef] [PubMed]
Kupczok, A. Split-based computation of majority-rule supertrees. BMC Evol. Biol. 2011, 11, 205. [Google Scholar] [CrossRef] [PubMed]
Dong, J.; Fernández-Baca, D. Properties of majority-rule supertrees. Syst. Biol. 2009, 58, 360–367. [Google Scholar] [CrossRef] [PubMed]
Bansal, M.S. Linear-time algorithms for phylogenetic tree completion under Robinson–Foulds distance. Algorithms Mol. Biol. 2020, 15, 6. [Google Scholar] [CrossRef] [PubMed]
Yao, K.; Bansal, M.S. Optimal completion and comparison of incomplete phylogenetic trees under robinson-foulds distance. In Proceedings of the 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021), Wrocław, Poland, 5–7 July 2021; Schloss Dagstuhl-Leibniz-Zentrum für Informatik: Wadern, Germany, 2021. [Google Scholar]
Priel, A.; Tamir, B. A vectorial tree distance measure. Sci. Rep. 2022, 12, 5256. [Google Scholar] [CrossRef]
Billera, L.J.; Holmes, S.P.; Vogtmann, K. Geometry of the space of phylogenetic trees. Adv. Appl. Math. 2001, 27, 733–767. [Google Scholar] [CrossRef]
Ren, Y.; Zha, S.; Bi, J.; Sanchez, J.A.; Monical, C.; Delcourt, M.; Guzman, R.K.; Davidson, R. A combinatorial method for connecting BHV spaces representing different numbers of taxa. arXiv 2017, arXiv:1708.02626. [Google Scholar]
Grindstaff, G.; Owen, M. Geometric comparison of phylogenetic trees with different leaf sets. arXiv 2018, arXiv:1807.04235. [Google Scholar]
Yasui, N.; Vogiatzis, C.; Yoshida, R.; Fukumizu, K. imPhy: Imputing phylogenetic trees with missing information using mathematical programming. IEEE/ACM Trans. Comput. Biol. Bioinform. 2018, 17, 1222–1230. [Google Scholar] [CrossRef] [PubMed]
Yoshida, R. Imputing phylogenetic trees using tropical polytopes over the space of phylogenetic trees. Mathematics 2023, 11, 3419. [Google Scholar] [CrossRef]
Rabiee, M.; Mirarab, S. INSTRAL: Discordance-aware phylogenetic placement using quartet scores. Syst. Biol. 2020, 69, 384–391. [Google Scholar] [CrossRef] [PubMed]
Mai, U.; Mirarab, S. Completing gene trees without species trees in sub-quadratic time. Bioinformatics 2022, 38, 1532–1541. [Google Scholar] [CrossRef] [PubMed]
Mahbub, S.; Sawmya, S.; Saha, A.; Reaz, R.; Rahman, M.S.; Bayzid, M.S. Quartet Based Gene Tree Imputation Using Deep Learning Improves Phylogenomic Analyses Despite Missing Data. J. Comput. Biol. 2022, 29, 1156–1172. [Google Scholar] [CrossRef]
Koshkarov, A.; Tahiri, N. GPTree: Generator of Phylogenetic Trees with Overlapping and Biological Events for Supertree Inference. In Proceedings of the 16th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2023), Lisbon, Portugal, 16–18 February 2023; SCITEPRESS: Setúbal, Portugal, 2023; Volume 3: BIOINFORMATICS, pp. 212–219. [Google Scholar]
Koshkarov, A.; Tahiri, N. GPTree Cluster: Phylogenetic tree cluster generator in the context of supertree inference. Bioinform. Adv. 2023, 3, vbad023. [Google Scholar] [CrossRef]

Figure 1. Phylogenetic trees defined on different but mutually overlapping sets of taxa. Common taxa are colored red. Tree

T_{1}

(a) has one distinct leaf (C). Tree

T_{2}

(b) includes three distinct leaves (G, F, and E) that form a distinct-leaf subtree.

Figure 1. Phylogenetic trees defined on different but mutually overlapping sets of taxa. Common taxa are colored red. Tree

T_{1}

(a) has one distinct leaf (C). Tree

T_{2}

(b) includes three distinct leaves (G, F, and E) that form a distinct-leaf subtree.

Figure 2. Temporary nodes and planting points. Temporary nodes are marked in black for both trees. Tree

T_{1}

(a) has 4 temporary nodes (denoted as

s_{1}

,

s_{2}

,

s_{3}

, and

s_{4}

in (a)). Tree

T_{2}

(b) also contains 4 temporary nodes (labeled as

c_{1}

,

c_{2}

,

c_{3}

, and

c_{4}

in (b)). The planting points found using Algorithm 2 are marked in blue.

Figure 2. Temporary nodes and planting points. Temporary nodes are marked in black for both trees. Tree

T_{1}

(a) has 4 temporary nodes (denoted as

s_{1}

,

s_{2}

,

s_{3}

, and

s_{4}

in (a)). Tree

T_{2}

(b) also contains 4 temporary nodes (labeled as

c_{1}

,

c_{2}

,

c_{3}

, and

c_{4}

in (b)). The planting points found using Algorithm 2 are marked in blue.

Figure 3. Completed trees

T_{1}^{⊎}

and

T_{2}^{⊎}

. Newly added internal nodes and leaves with their adjusted branch lengths are colored in blue.

Figure 3. Completed trees

T_{1}^{⊎}

and

T_{2}^{⊎}

. Newly added internal nodes and leaves with their adjusted branch lengths are colored in blue.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Koshkarov, A.; Tahiri, N. Novel Algorithm for Comparing Phylogenetic Trees with Different but Overlapping Taxa. Symmetry 2024, 16, 790. https://doi.org/10.3390/sym16070790

AMA Style

Koshkarov A, Tahiri N. Novel Algorithm for Comparing Phylogenetic Trees with Different but Overlapping Taxa. Symmetry. 2024; 16(7):790. https://doi.org/10.3390/sym16070790

Chicago/Turabian Style

Koshkarov, Aleksandr, and Nadia Tahiri. 2024. "Novel Algorithm for Comparing Phylogenetic Trees with Different but Overlapping Taxa" Symmetry 16, no. 7: 790. https://doi.org/10.3390/sym16070790

APA Style

Koshkarov, A., & Tahiri, N. (2024). Novel Algorithm for Comparing Phylogenetic Trees with Different but Overlapping Taxa. Symmetry, 16(7), 790. https://doi.org/10.3390/sym16070790

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Novel Algorithm for Comparing Phylogenetic Trees with Different but Overlapping Taxa

Abstract

1. Introduction

2. Materials and Methods

2.1. Notation and Preliminaries

2.2. Distance Measure

2.3. Tree Completion Algorithm

2.4. Example

3. Results and Discussion

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI