1. Introduction
A
gene/species tree reconciliation is an embedding of a gene tree explicating the topological difference between the two trees through a sequence of events into its associated species tree by shaping the gene family inside its host species. Gene/species tree reconciliation has been widely studied since the 1980s [
1]; first, by focusing on duplication and loss events and then extending to horizontal gene transfers [
2,
3,
4] and other events such as hybridization or incomplete lineage sorting (see review in [
5]). One of the major drawbacks of classical reconciliation is that gene families are considered separately from one another, which is not appropriate for genes organized in
syntenies, i.e., colocalized genes likely to have evolved together through segmental events.
Although some work has been achieved to infer the evolution of adjacencies [
6], to group individual events into segmental ones [
7] or to minimize “duplication episodes” [
8,
9], none of these methods are intended to explicitly look for evolutionary scenarios that minimize segmental events. We presented the first algorithm that generalizes reconciliation to synteny trees (i.e., with leaves representing syntenies rather than single genes) and segmental events for the duplication–loss distance in [
10], and we extended it to horizontal transfers with
SuperDTL in [
11].
Here, we present Synesth (for
SYNteny Evolution in SegmenTal Histories), an extended syntenic reconciliation model accounting for fissions, whereby part of a synteny is detached to another locus or species, in addition to losses, gains, duplications, and transfers. Moreover, as the choice of the species included in a study—or, equivalently, the absence of species that are not chosen, unsampled, or extinct—has a serious impact on the output of a reconciliation algorithm [
12,
13,
14], we account for the transfers going in species.
Among the infinite space of possible histories explicating the evolution of a synteny tree inside a species tree when both are given as an input, we are interested in selecting the most likely ones. A possible way to achieve this is by assigning costs (or probabilities) to event types and selecting histories minimizing (resp. maximizing) the overall cost (resp. probability). The resulting histories can vary significantly depending on the chosen costs, which are usually challenging to determine and strongly depend on the taxa under study [
15,
16]. Instead, our goal here is to provide an overview of the optimal histories for all possible cost choices.
Our approach is to progressively subdivide the history space. First, the space is reduced to a finite but exponential size set of Pareto-optimal histories (in terms of counts of each event type), then to a polynomial size set of Pareto-optimal event count vectors, and eventually to the single cost of an optimal history. We develop efficient and exact algorithms to compute the optimal histories for each subdivision level using algebraic dynamic programming. The inductive characterization of Pareto-optimal histories, which is given in
Section 5, leads to a polynomial–time dynamic programming algorithm to output all Pareto-optimal vectors (
Section 6), and then ultimately to an
time complexity algorithm (where
m and
n are number of nodes in the input synteny and the species trees, respectively) to output the cost of a minimum cost history (
Section 7), thereby matching the time complexity of the fastest algorithms for the classical duplication–transfer–loss reconciliation [
14], an implementation of the algorithm is available at:
https://github.com/UdeM-LBIT/superrec2/tree/algo2024 (accessed on 20 April 2024).
In
Section 8, we apply Synesth to study the evolution of CRISPR-associated (Cas) gene syntenies. In taking advantage of a previous work [
15,
17], we present a visualization of the solution landscape as a partition of the space of cost choices into regions of equivalent costs leading to the same set of optimal histories.
We first introduce the required notations in
Section 2, the evolutionary model in
Section 3, and then we consider the solution space subdivision and optimization problems in
Section 4.
2. Preliminary Notation
All trees are considered rooted in this method. Given a tree T, we denote by its root, by its node set, by its leaf set, and we let be the set of its internal nodes. A node is an ancestor of v if is on the path from to v, and the parent of v, of which v is a child, directly precedes v on this path. Conversely, v is a descendant of . This ancestor–descendant relation is denoted as ≤ and forms a partial order on the nodes, in which the root is minimal and the leaves are maximal. Any pair of nodes v and not ordered by this relation are said to be separated, which we denote as . Notice that a node v is both an ancestor and a descendant of itself; whenever this case needs to be excluded, we will talk about strict ancestors and strict descendants.
We denote by the set of edges of T, each of which are represented by a pair of nodes . For any two nodes v and of T, there exists a unique path from v to that we denote . The distance between v and is defined as . Given a node v of T, is the subtree of T rooted at v (i.e., containing only the descendants of v). The lowest common ancestor (LCA) of a subset V of nodes, denoted as , is the ancestor of all nodes in V, which is the most distant from the root.
A tree is said to be an extension of a tree T if can be obtained from T by a sequence of operations among the following: (1) Subdividing an edge by adding a new node v and replacing by two edges and ; (2) Grafting a new node v below an existing node u by adding the edge ; (3) Rerooting the tree to a new node u by adding the edge .
The set of children of any node v is denoted by . If , then v is said to be unary and we denote its only child by . If , then it is said to be binary and, unless specified otherwise, we denote its children by and in no particular order. A binary tree is a tree where all internal (non-leaf) nodes are binary. If all internal nodes are unary or binary, then the tree is partially binary.
If is a set of gene families, then a synteny on is a subset , which represents a group of genes assumed to have jointly evolved. Notice that we ignore the relative order of genes in the genomic region, as well as the physical distance between genes and regions. The genes of a synteny are considered to all belong to different gene families (i.e., repeated gene families inside a synteny are forbidden); therefore, a gene is simply identified by the family it belongs to.
A species tree S on a set of species is a binary tree with a bijection between and . For a set of syntenies , a synteny tree is a tuple where T is a binary tree, and are two functions, the second indicating the species to which each synteny belongs.
Finally, the restriction of a function f to a subset A of its domain is denoted as .
3. Evolutionary Histories for Syntenies
We model the evolution of syntenies through the following syntenic events: codivergence with the host species (“
”), duplication of a synteny subset (“
”), fission of a synteny (“
”), transfer of a duplicated or cut subset (resp. “
” or “
”), and the gain or loss of a subset (“
”, “
”). Losses can be partial in the sense that only a subset of genes in a synteny are lost. Evolutionary histories are the sequences of such events, as formally defined below (see example in
Figure 1).
Definition 1 (Evolutionary history for syntenies). A history on a species tree S is a tuple , where H is a partially binary tree. Each node is labeled with a species and a synteny (i.e., a subset of gene families) . Each internal node is additionally labeled with an event acting on and . These labels satisfy the following conditions:
- 1.
If , with and , then , , and .
- 2.
If , with :
- 1.
If , then ;
- 2.
If , then , ,
and (but is allowed);
- 3.
If , then ;
- 4.
If , then , and .
- 3.
If with , then and the following:
- 1.
If , then .
- 2.
If , then (a loss is full if , and partial otherwise).
- 4.
For each gene family Γ, exactly one event in H involves Γ.
- 5.
The only nodes v of the history such that are its root and its leaves.
Finally, we denote by any event in .
Notice that, when the sets of
are restricted to at most one gene family each, this model reduces to the classical reconciliation model of [
14], with
corresponding to
(transfer) and
to
(transfer–loss). Additionally, from any syntenic history on a set of gene families
, one can extract a reconciled gene tree for each gene family
whose root is the gain event for
and leaves are the loss events where
is lost. This root is unique because of Condition (4), which excludes
convergent gains.
Moreover, as in [
14], we will allow for transfers to and from unsampled or extinct species by augmenting the species tree. For example, in
Figure 1, Synteny
in Species
A is the result of a transfer from an unsampled species. In the following, unless specified otherwise, all histories are on augmented species trees, as formally defined below.
Definition 2 (Augmented species tree). A tree S can be augmented into by adding unsampled leaves as follows: (1) Subdivide each edge of into two edges and linked to a new node z; (2) Connect each z to a new unsampled leaf; (3) Create a new root whose two children are and a new unsampled leaf. Edges leading to unsampled leaves are called unsampled edges.
Finally, notice that Definition 1 does not require the species involved in a transfer to be contemporary, nor does it forbid biologically infeasible cyclic histories, such as one resulting from a transfer from a Species
A to a Species
B, and then back from Species
B to an ancestor of Species
A (for a more precise definition of acyclicity, see [
4]). This limitation is necessary to make the computational problems of the next section tractable.
4. Explicatory Histories and Optimization Problems
The goal of reconciliation is to infer histories that explicate the topology of a synteny tree given a tree of the corresponding species. Such histories are extensions of the synteny tree that map all leaves to the appropriate species and syntenies without introducing new visible leaves.
Definition 3 (Visible leaves). A leaf l of a history is said to be visible if and , and invisible otherwise. The set of visible leaves of is denoted as .
For example, in
Figure 1, the history in (2) explicates the trees in (1). In that history, the only invisible leaf is the unnamed leaf in the hatched region (representing an unsampled species). The leaves below full losses would also be invisible, but that example history does not contain any full loss. The following is a formal definition of an explicatory history:
Definition 4 (Explicatory histories). For a species tree S, a history on is said to explicate a synteny tree and S if the following holds true:
- 1.
H is an extension of T.
- 2.
and .
- 3.
.
- 4.
No is such that .
- 5.
No gain event is the parent of a node .
- 6.
No partial loss event is a child of a node .
Condition (3) disallows introducing new visible leaves. Condition (4) excludes assignments of species that create cycles between adjacent nodes of the synteny tree. This condition is a necessary, but not sufficient, condition for acyclicity. Imposing a full acyclicity condition would lead to computationally intractable problems [
4].
As for Conditions (5) and (6), they are introduced to avoid having multiple histories with the same events, but with gains and losses distributed differently between the adjacent nodes of the synteny tree. More precisely, Condition (5) requires sifting gains down and merging them until they are the parent of a synteny tree node; conversely, Condition (6) requires sifting losses up and merging them until they are the child of a synteny tree node.
Note that is infinite: given any explicatory history, it is always possible to extend it into a larger one by introducing superfluous duplications or transfers. We next define a way to reduce this space to a finite one.
Definition 5 (Event vector). Let be a history. We define as the vector such that is the number of events of type in .
As usual for reconciliation, our definition of an event vector excludes the number of speciations since they do not allow one to meaningfully distinguish between histories. Notice that the number of gains is also excluded, as needed to make the problem of
Section 6 tractable. Consequently, we disregard taking advantage of the simultaneous gains of multiple genes to reduce the overall number of individual gain events.
Definition 6 (Order on vectors and histories). For two event vectors and , if for all , . This partial order induces another one on the histories, namely if .
Definition 7 (Pareto optimality). For any set E with a partial order ⪯, we define its Pareto subset as .
As an example, in the set of vectors , we have while and are not comparable; hence, .
Let us now consider the set of histories whose event vectors are Pareto-optimal. As opposed to , is a finite set, as we will show in Theorem 3. We can thus meaningfully define the problem of computing this set.
Problem 1 (All Pareto-optimal histories).
Input: A synteny tree and a species tree S.
Output: The set of Pareto-optimal histories explicating and S.
Even though is finite, it may contain a number of optimal histories that are exponential in . Rather, consider the set of Pareto-optimal event vectors. As we will show later (Theorem 4), the number of optimal event vectors in is polynomial. We now consider the following problem of reduced complexity:
Problem 2 (All Pareto-optimal vectors).
Input: A synteny tree and a species tree S.
Output: The set .
Now, given a vector of costs for each event type , we can associate an overall scalar cost to each history.
Problem 3 (Minimum cost).
Input: A synteny tree , a species tree S, and a vector .
Output: The minimum cost of any history explicating and S.
Finally, we will call Problems 2’ and 3’ the versions of Problems 2 and 3 where we additionally ask for one history corresponding to each returned optimal event vector (for Problem 2’) or one history for the returned minimum cost (for Problem 3’). For example,
Figure 1 shows in (2) a possible solution for Problem 2’ for the synteny and species trees in (1) for vector
, as well as cost
for Problem 3’.
5. Generating All Pareto-Optimal Histories
We start by addressing Problem 1, which asks to enumerate the set. This can be conducted inductively by building histories from the leaves of the synteny tree up to its root. This result is the basis for the dynamic programming formulations introduced for the other problems in the upcoming sections.
The next two definitions are used to build histories by composing partial histories (see
Figure 2). In the following, a node
v of a history such that
and
is denoted as
, or
if the event
associated to
v is known. The node name may be omitted where it is not relevant by simply writing
or
. If
A,
B, and
C are three nodes, then
refers to the triplet tree with Root
C and Leaves
A and
B.
Definition 8 (Partial histories)
. Let be a synteny tree on a species tree S. Let , , and . We define as the set of the Pareto-optimal histories explicating the subtree of rooted at . We also define as the set of Pareto-optimal acyclic histories whose root is and whose only visible leaf is . Formally, Definition 9 (History composition). If and are two histories such that there exists a leaf with and , then we define their composition as the history obtained by replacing l with and merging e with , x with , and s with . This operation is only defined if the resulting history is valid (particularly if no gene family would be gained in two separate events).
When applied to two sets of histories A and B, then is defined by taking the Cartesian product of the two sets and composing each resulting history pair whilst excluding invalid compositions.
For any node v of a synteny tree and any event , syntenies , and species , we define the set of histories starting with as followed by two paths from to and from to , thereby leading to two sub-histories as follows: For example, in
Figure 2, we have
.
This representation of histories as the compositions of sub-histories and paths will now be used to formulate inductive definitions for and , as well as ultimately . In the following, we use to mean .
Theorem 1 (Inductive form of Pareto-optimal histories)
. Let be a synteny tree on S, X be any synteny on , and σ be a node of . If v is a leaf of T, then if and , and otherwise. If v is an internal node of T, then with the following:
Proof. If v is a leaf of T, then the proposition follows directly from Definitions 1 and 4. Assume now that v is an internal node. First, notice that all histories of the sets—for —explicate and S, have their root assigned to X and and are Pareto-optimal.
Let . Since v is an internal node of T, v must be binary in H; hence . Denote the children of v in T as and , and the children of v in H as and . If , then and cannot be strict ancestors of , as otherwise would not be acyclic. Let , , , .
If , then by Definition 1 (Item 1).
Otherwise, let us first show that the synteny contents of both children are chosen appropriately. We have that (resp. ); since, if any (resp. ) was not in (resp. ), we would have multiple gains of g in H, i.e., one above v (since ) and one between v and (since , resp. since ).
If , by Definition 1, at least one of or is equal to X. Without loss of generality (w.l.o.g.), assume that . We should have , as otherwise we would have at least one additional event in the path between v and for losing , thus contradicting the Pareto-optimality of . Hence, .
If , it cannot be that and are both true as such a scenario would lead to a loss on each path between v and and between v and , whereas by choosing or , we save at least one loss. Since and are a partition of X, then if , or otherwise.
Finally, we show that the species of both children are also chosen appropriately.
If , then by Definition 1, .
If , then by Definition 1, either and , or and .
If , if (resp. ), then (resp. ). □
We next consider how to compute the set of histories given any two syntenies X and Y and species and .
Theorem 2 (Pareto-optimal paths). Let X and Y be two syntenies and σ and γ be two species. If γ is a strict ancestor of σ, then . Otherwise, , where the following holds:
if . Otherwise, contains the history made up of a chain of speciations and full losses; preceded by a partial loss event if and followed by a gain event if . If , then also includes a history where the initial partial loss event is replaced by a terminal or event and an unsampled leaf.
. If , all three sets contain histories starting with a event v from to γ. Otherwise, all three sets contain histories starting with a speciation at σ. Letting such that , the speciation is followed by a event v from to γ and a full loss on the side of if . In both cases, the histories end with a gain event if . For , and with a full loss at if . For or , . For , and is a full loss if and . For , and there is a partial loss at if .
if and , or , or . Otherwise, . Each of the three sets contain histories made up of two initial consecutive transfers v and such that and . The histories of and are such that . In , v is followed by a full loss if . In and , the first cut v is complete (i.e., with an empty leaf). In if , then , otherwise the second cut is complete if . In all cases, the histories end with a gain event if .
Proof. If is a strict ancestor of , there can be no acyclic history leading from to ; therefore, .
Otherwise, let , , and let w be the only visible leaf of . We call the main path of . We say that a subtree of H is invisible if it contains only invisible leaves.
First notice that all binary nodes of must be on the main path. Assume that v is a binary node outside of the main path. Both subtrees of v must be invisible. If , then at least one subtree contains a full loss; hence, we can completely replace v and its subtrees with a full loss. Otherwise, we can completely replace v and its subtrees with an unsampled leaf. In both cases, we strictly reduce the number of events in , which contradicts its Pareto-optimality. Also, notice that all binary nodes of the main path have exactly one child whose subtree is invisible.
We show now that may only contain at most two nodes. Assume that contains at least three consecutive events that we denote , , . Let be the child of resulting from the transfer. Consider the history obtained from via the following: (1) Removing all the nodes on the path from to and their invisible subtrees, excluding and themselves but including, in particular, ; (2) Connecting directly to ; (3) Remapping to any unsampled species separated both from and ; (4) Replacing the subtree below by a single unsampled leaf l with and if , or otherwise. Clearly, , and , which contradicts the Pareto-optimality of .
As per Definition 4 (Item 5 and Item 6), any partial loss must be placed at the end of the history, and any gain must be placed at the start. Note that contains either zero (), one (), or two () transfers.
[ contains no transfers.] Note that cannot be separated from since only transfers can reach separate species; thus, is an ancestor of . We start by showing that can only possibly contain a duplication or a cut if that event is on an unsampled leaf species. If v is a node such that and , then the invisible subtree of v contains at least one full loss. In this case, we can remove the invisible subtree and turn v into a partial loss, if needed, otherwise we would replace v with its remaining child. Hence, if contains a duplication or a cut, it must be the last binary event on the main path. Thus, all other binary events must be speciations, and we need exactly of them to reach from . Those speciations must lead to one full loss for each species . Hence, .
[ contains exactly one transfer.] Let v be the only transfer node. If , then there must be a speciation above v so that can be separated from since cannot contain other transfers. Notice that contains no partial losses, duplications, or cuts. In fact, any partial loss can be merged into the transfer. As for duplications and cuts, they can be removed if they do not save any loss, or merged into the transfer otherwise. Apart from the initial speciation if , contains no other speciations as those before (resp. after) the transfer can be removed by redirecting the transfer to start from a higher species (resp. to end at a lower species) that is still separated from (resp. ). If or , then the invisible subtree of v must contain at least one full loss, unless . If and , then either subtree of v must contain at least one full or partial loss if . Hence, .
[ contains exactly two transfers.] Let v and be the only two transfer nodes. If , then must contain at least one full loss; hence, is not Pareto-optimal. If , then, if , we can remove both transfers, and, if , then we can remove , in both cases without introducing additional events. If , then we have the following: If , we can remove without adding new losses; If , then . Hence, we can also remove . If either of v or is such that or , then , as otherwise it is , and there is a full loss below v. In addition, we can exchange the transfer types of v and so that , and , so as to save a full loss. If or , then , as otherwise we can reroute the first transfer toward an unsampled leaf so as to save a full loss. Hence, . □
Finally, the set of all possible histories
can be computed as the union of assignments of the root node of the synteny tree to all possible syntenies and species. This starts with a path from an empty synteny (i.e., an initial gain). In other words,
Hence, using Theorems 1 and 2, one can derive a dynamic programming algorithm to solve Problem 1. Due to the exponential size of the set of solutions, the time complexity of that algorithm is also exponential.
Theorem 3 (Number of minimal histories)
. Let be a synteny tree on a species tree S with gene family set . Then, Proof. Using Theorem 1, we obtain that
with
where
p is an upper bound on the number of possible paths. This is because, in each case of Theorem 1, up to all subsets
are tried along with up to all possible species pairs
, which has a number of nodes directly proportional to
. Using Theorem 2, we obtain
. We obtain the desired result by solving the recurrence. □
6. Polynomial-Time Computation of Pareto-Optimal Event Vectors
We now address Problem 2. Given a synteny tree on a species tree S, similar to the way was previously used to recursively compute , we define for computing as .
To compute , we replace the algebra of Theorem 1 with an algebra where:
The base cases are and ∅;
is the union of vectors from A and B, retaining only the Pareto-optimal ones;
sums the pairs of vectors from , retaining only the Pareto-optimal ones.
Additional simplifications can reduce the complexity of computing . We start by showing that, when adapted for Problem 2, it is sufficient to try a constant number of syntenies at each step of the recurrence of Theorem 1. First, let us show that it is sufficient to place the gain event for each gene family at the lowest common ancestor of the leaves they appear in.
Definition 10 (Gain positions)
. Let be a synteny tree on , , , and . We thus define Lemma 1 (Gains at the LCA). If is a synteny tree on , , and , then .
Proof. Let . Let , if there is any. By definition, . Assume w.l.o.g. that . Consider the history in which is removed from and all for , thereby removing any invalid loss event created in the process, and in which the gain event for is moved to be the parent of (potentially merging it with other gains). Then, , since we only potentially removed losses from and the number of gains is not part of the event vector. Let be the history obtained after repeating this process for each . Clearly, ; hence, and . □
In a similar way to Lemma 5 in [
10], we now show that only two synteny contents have to be tried at each step of the recurrence as any synteny larger than the minimal required gene families (as formally defined below) leads to the same set of optimal event vectors.
Definition 11 (Minimal synteny contents)
. For any , we define Lemma 2 (Two choices of synteny contents). Let be a synteny tree on a species tree S and v be a node of T. For any such that and , .
Proof. We proceed by induction on the depth of v in T. If v is a leaf, then no valid history exists; hence, . Otherwise, let , , , and denote (resp. for ) to be the subhistory of below (resp. ).
Let , and . If , then , thus implying that and since and because . This implies that . Using the same argument for , we deduce that .
Suppose that . Let . Noting that , and using the induction hypothesis, we see that . There exists a history such that . The same argument applies to , thereby yielding a history such that .
Suppose that . Then, there must be an event to lose on the path from v to . That event can also be used to lose . In this case, we let . The same argument applies to .
Finally, consider the history that is obtained by replacing with , by , and setting . Clearly, and ; hence, . □
Using the recurrence from Theorem 1 adapted to use the algebra over Pareto-sets of vectors and simplified, as shown in Lemmas 1 and 2, we can obtain a dynamic programming algorithm for solving Problems 2 and 2’.
Theorem 4. Problem 2 can be solved in time and space , and Problem 2’ in time and space , where and are, respectively, the numbers of the nodes in the synteny and species trees.
Proof. Let us first show that
From Theorem 2, the number of events in a history
is in
(attained for histories in
, while those in
and
have a constant number of events). From Theorem 1, the histories
are extensions of the synteny tree
T, which are obtained by inserting such paths; hence, the number of events in such histories is in
. As the vectors have five components, we obtain the desired result by adapting the argument from Lemma 3.1 in [
15].
We solve Problem 2 using a dynamic programming table for . The number of entries in the table is m for v, two for X (as per Lemma 2), and n for . Hence, the space complexity result follows from the bound on the size of shown above.
As for the time complexity, following the recurrence adapted from Theorem 1, to compute each entry we need to consider four options for
Y and
Z, as well as the up to
options for
and
(or
and
) and
. However, it is possible to reduce these
species options to a constant number of operations at each step by simultaneously computing three separate tables
,
, and
, as defined by Weiner and Bansal [
14] and explained in their Algorithm 1 and the proof of their Theorem 1. For each of these options, the ⊕ and ⊗ operators need to be used a constant number of times.
The ⊕ and ⊗ operations on two sets containing
k Pareto-vectors can be implemented in time
[
18] and
, respectively [
15]. In our case, it follows from the bound on the size of
that both operators can be implemented in time
and
, respectively.
To solve Problem 2’, we need to be able to reconstruct one of the histories leading to each Pareto-optimal event vector. To that end, we associate additional pieces of information to each vector: the root node of the history, two pointers to two sub-histories, and two paths that lead to those histories (see
Figure 2). Since a path can contain, at most,
events, the time and space complexities of this method to solve Problem 2’ are obtained by adding a factor of
n to those of Problem 2. □
7. Efficient Computation of Minimum-Cost Histories
We finally address Problem 3. Let be a synteny tree on a species tree S, and let be an event cost vector. Similar to the way through which and were previously used to compute and , we define to compute .
To compute , we replace the algebra of Theorem 1 with an algebra where:
The base cases are 0 and ∞;
is the minimum of A and B;
is the sum of A and B.
This is the so-called min-plus or “tropical” semiring. Notice that Lemmas 1 and 2 still apply since if = , then .
Theorem 5. Problem Section 4 can be solved in time and space , and Problem 3’ in time and space , where and . Proof. Both results follow from the proof of Theorem 4 due to the fact that the minimum and sum operators can be computed in constant time, thereby removing the factor from the time complexity, and also due to the fact that only a single value needs to be stored for each entry of the dynamic programming table, thus removing the factor from the space complexity. □
8. Results
The CRISPR–Cas module is an adaptive system that allows prokaryotes to defend against invading viruses and plasmids. Its fame is due to the development of the CRISPR-Cas9 genome editing technology, which is one of the most reliable and accurate “molecular scissors” to date. An important part of any CRISPR-Cas system is the operon of the associated Cas genes playing various roles in the defense machinery. As the microbial function of CRISPR-Cas systems highly depends on the syntenic organization of Cas genes, elucidating the evolution of these syntenies is crucial.
In [
11], taking the Makarova et al. [
19] CRISPR-Cas classification of Class 1 as the synteny tree—and considering a dataset of 15 bacterial species, as well as the species tree topology inferred in [
20]—we recovered an evolutionary history that is broadly consistent with that proposed in [
21], with the CRISPR-Cas emergence inferred at the root of Terrabacteria. However, some inconsistencies were observed. For example, in the Proteobacteria subtree, the
SuperDTL algorithm inferred an unlikely scenario with an ancestral synteny duplication before the LCA of
Shewanella putrefaciens,
Vibrio crassostreae,
Yersinia pseudotuberculosis, and
Escherichia coli, thereby resulting in a succession of three consecutive full synteny losses along the branch to
E. coli.
Here, we used Synesth on the same dataset with the same event costs for
,
, and
events and by choosing intermediate costs for the new
and
events. The used event costs were
. Almost the same history was obtained but with the above unlikely scenario replaced by a speciation (on the branch separating the ancestor of
Geobacter sulfurreducens to the LCA of
Thioalkalivibrio and
Shewanella putrefaciens) copying the ancestral synteny to an unsampled species, which was later transferred back to
E. coli (see
Figure 3).
Choosing the appropriate event costs constitutes one of the main challenges of tree reconciliation. Slight changes to the event costs may lead to significantly different history outputs. We use a similar approach to that which was developed by Libeskind-Hadas et al. [
15] to display a summary of the solution space over all possible event cost choices. In order to represent the solutions in a 2D plot, we normalized the cost of
events to 1 and set an equal cost for
and
and for
and
. The Pareto-optimal vectors were condensed to three dimensions as follows:
. The resulting plot is given in
Figure 4, in which each color-coded region corresponds to a set of event costs that give rise to exactly the same set of Pareto-optimal histories.
9. Conclusions
Synesth is a flexible tool for tree reconciliation, which allows for a wide range of segmental events that addresses the inevitable incompleteness of the input dataset in terms of unsampled species, as well as offers various optimization and output criteria to the user. Moreover, its time complexity brings it up to the level of the most time-efficient reconciliation algorithms, such as
ecceTERA [
12] and
RANGER-DTLx [
14], but for an evolutionary model with events involving sets of genes rather than single genes.
The inductive characterization of Pareto-optimal histories allows for an exhaustive exploration of the solution space. Taking advantage of this flexibility, future extensions of the computational aspects of this work may address the problem of formally characterizing this space in terms of constructing equivalence classes or normalized histories, thus uniformly sampling the space of histories and assigning confidence values to predicted histories.
Further extensions of the model would also be worth investigating. For example, representing syntenies as sets does not capture the information of gene orders and multiplicities. Accounting for gene orders would require allowing rearrangement events, and this may significantly increase the computational complexity of the problem. Allowing gene repetitions inside syntenies would require representing them as multisets, which would break some of the assumptions required for the algorithmic approach presented in this work. However, probably, the most questionable limitation of the model is the absence of synteny fusions while synteny fissions are allowed; thus, it favors large syntenies up to the root of the tree. Note, however, that including fusions will require adding reticulated nodes. It will be interesting to see whether our dynamic programming scheme can be generalized to such phylogenetic networks or if it makes the problem NP-hard, in which case tree decomposition methods may be explored [
22].
Another important challenge not addressed in this paper is how to obtain an input synteny tree. In fact, phylogenetic methods instead output sets of gene trees—one for each gene family. If the individual gene trees are “consistent”, i.e., with no contradictory phylogenetic information, then a tree displaying them all can be obtained. However, even in this case, there may be an exponential number of such supertrees. In [
10], the suggested solution was to test each possible supertree and retain the one leading to the most parsimonious reconciliation. An alternative would be to simultaneously construct and reconcile a supertree with a given species tree. This opens the door to interesting future investigations.