Synesth: Comprehensive Syntenic Reconciliation with Unsampled Lineages

Delabre, Mattéo; El-Mabrouk, Nadia

doi:10.3390/a17050186

Open AccessArticle

Synesth: Comprehensive Syntenic Reconciliation with Unsampled Lineages

by

Mattéo Delabre

^*

and

Nadia El-Mabrouk

^*

Département d’Informatique et de Recherche Opérationnelle, Université de Montréal, 2920 Chemin de la Tour, Montréal, QC H3T 1J4, Canada

^*

Authors to whom correspondence should be addressed.

Algorithms 2024, 17(5), 186; https://doi.org/10.3390/a17050186

Submission received: 14 March 2024 / Revised: 20 April 2024 / Accepted: 25 April 2024 / Published: 29 April 2024

(This article belongs to the Section Combinatorial Optimization, Graph, and Network Algorithms)

Download

Browse Figures

Versions Notes

Abstract

:

We present Synesth, the most comprehensive and flexible tool for tree reconciliation that allows for events on syntenies (i.e., on sets of multiple genes), including duplications, transfers, fissions, and transient events going through unsampled species. This model allows for building histories that explicate the inconsistencies between a synteny tree and its associated species tree. We examine the combinatorial properties of this extended reconciliation model and study various associated parsimony problems. First, the infinite set of explicatory histories is reduced to a finite but exponential set of Pareto-optimal histories (in terms of counts of each event type), then to a polynomial set of Pareto-optimal event count vectors, and this eventually ends with minimum event cost histories given an event cost function. An inductive characterization of the solution space using different algebras for each granularity leads to efficient dynamic programming algorithms, ultimately ending with an

O (m n)

time complexity algorithm for computing the cost of a minimum-cost history (m and n: number of nodes in the input synteny and species trees). This time complexity matches that of the fastest known algorithms for classical gene reconciliation with transfers. We show how Synesth can be applied to infer Pareto-optimal evolutionary scenarios for CRISPR-Cas systems in a set of bacterial genomes.

Keywords:

tree reconciliation; algebraic dynamic programming; multi-objective optimization; horizontal gene transfer; synteny; fission; unsampled lineages

1. Introduction

A gene/species tree reconciliation is an embedding of a gene tree explicating the topological difference between the two trees through a sequence of events into its associated species tree by shaping the gene family inside its host species. Gene/species tree reconciliation has been widely studied since the 1980s [1]; first, by focusing on duplication and loss events and then extending to horizontal gene transfers [2,3,4] and other events such as hybridization or incomplete lineage sorting (see review in [5]). One of the major drawbacks of classical reconciliation is that gene families are considered separately from one another, which is not appropriate for genes organized in syntenies, i.e., colocalized genes likely to have evolved together through segmental events.

Although some work has been achieved to infer the evolution of adjacencies [6], to group individual events into segmental ones [7] or to minimize “duplication episodes” [8,9], none of these methods are intended to explicitly look for evolutionary scenarios that minimize segmental events. We presented the first algorithm that generalizes reconciliation to synteny trees (i.e., with leaves representing syntenies rather than single genes) and segmental events for the duplication–loss distance in [10], and we extended it to horizontal transfers with SuperDTL in [11].

Here, we present Synesth (for SYNteny Evolution in SegmenTal Histories), an extended syntenic reconciliation model accounting for fissions, whereby part of a synteny is detached to another locus or species, in addition to losses, gains, duplications, and transfers. Moreover, as the choice of the species included in a study—or, equivalently, the absence of species that are not chosen, unsampled, or extinct—has a serious impact on the output of a reconciliation algorithm [12,13,14], we account for the transfers going in species.

Among the infinite space of possible histories explicating the evolution of a synteny tree inside a species tree when both are given as an input, we are interested in selecting the most likely ones. A possible way to achieve this is by assigning costs (or probabilities) to event types and selecting histories minimizing (resp. maximizing) the overall cost (resp. probability). The resulting histories can vary significantly depending on the chosen costs, which are usually challenging to determine and strongly depend on the taxa under study [15,16]. Instead, our goal here is to provide an overview of the optimal histories for all possible cost choices.

Our approach is to progressively subdivide the history space. First, the space is reduced to a finite but exponential size set of Pareto-optimal histories (in terms of counts of each event type), then to a polynomial size set of Pareto-optimal event count vectors, and eventually to the single cost of an optimal history. We develop efficient and exact algorithms to compute the optimal histories for each subdivision level using algebraic dynamic programming. The inductive characterization of Pareto-optimal histories, which is given in Section 5, leads to a polynomial–time dynamic programming algorithm to output all Pareto-optimal vectors (Section 6), and then ultimately to an

O (m n)

time complexity algorithm (where m and n are number of nodes in the input synteny and the species trees, respectively) to output the cost of a minimum cost history (Section 7), thereby matching the time complexity of the fastest algorithms for the classical duplication–transfer–loss reconciliation [14], an implementation of the algorithm is available at: https://github.com/UdeM-LBIT/superrec2/tree/algo2024 (accessed on 20 April 2024).

In Section 8, we apply Synesth to study the evolution of CRISPR-associated (Cas) gene syntenies. In taking advantage of a previous work [15,17], we present a visualization of the solution landscape as a partition of the space of cost choices into regions of equivalent costs leading to the same set of optimal histories.

We first introduce the required notations in Section 2, the evolutionary model in Section 3, and then we consider the solution space subdivision and optimization problems in Section 4.

2. Preliminary Notation

All trees are considered rooted in this method. Given a tree T, we denote by

r (T)

its root, by

V (T)

its node set, by

L (T) \subseteq V (T)

its leaf set, and we let

I (T) = V (T) - L (T)

be the set of its internal nodes. A node

v^{'}

is an ancestor of v if

v^{'}

is on the path from

r (T)

to v, and the parent

p (v)

of v, of which v is a child, directly precedes v on this path. Conversely, v is a descendant of

v^{'}

. This ancestor–descendant relation is denoted as ≤ and forms a partial order on the nodes, in which the root is minimal and the leaves are maximal. Any pair of nodes v and

v^{'}

not ordered by this relation are said to be separated, which we denote as

v ∥ v^{'}

. Notice that a node v is both an ancestor and a descendant of itself; whenever this case needs to be excluded, we will talk about strict ancestors and strict descendants.

We denote by

E (T)

the set of edges of T, each of which are represented by a pair of nodes

(p (v), v)

. For any two nodes v and

v^{'}

of T, there exists a unique path from v to

v^{'}

that we denote

P_{T} (v, v^{'}) \subseteq E (T)

. The distance between v and

v^{'}

is defined as

D_{T} (v, v^{'}) = | P_{T} (v, v^{'}) |

. Given a node v of T,

T [v]

is the subtree of T rooted at v (i.e., containing only the descendants of v). The lowest common ancestor (LCA) of a subset V of nodes, denoted as

{lca}_{T} (V)

, is the ancestor of all nodes in V, which is the most distant from the root.

A tree

T^{'}

is said to be an extension of a tree T if

T^{'}

can be obtained from T by a sequence of operations among the following: (1) Subdividing an edge

(u, w)

by adding a new node v and replacing

(u, w)

by two edges

(u, v)

and

(v, w)

; (2) Grafting a new node v below an existing node u by adding the edge

(u, v)

; (3) Rerooting the tree to a new node u by adding the edge

(u, r (T))

.

The set of children of any node v is denoted by

ch (v)

. If

| ch (v) | = 1

, then v is said to be unary and we denote its only child by

v_{c}

. If

| ch (v) | = 2

, then it is said to be binary and, unless specified otherwise, we denote its children by

v_{ℓ}

and

v_{r}

in no particular order. A binary tree is a tree where all internal (non-leaf) nodes are binary. If all internal nodes are unary or binary, then the tree is partially binary.

If

F

is a set of gene families, then a synteny on

F

is a subset

X \subseteq F

, which represents a group of genes assumed to have jointly evolved. Notice that we ignore the relative order of genes in the genomic region, as well as the physical distance between genes and regions. The genes of a synteny are considered to all belong to different gene families (i.e., repeated gene families inside a synteny are forbidden); therefore, a gene is simply identified by the family

Γ \in F

it belongs to.

A species tree S on a set

Σ

of species is a binary tree with a bijection between

L (S)

and

Σ

. For a set of syntenies

X

, a synteny tree

〈 T, x, s 〉

is a tuple where T is a binary tree,

x : L (T) \to X

and

s : L (T) \to Σ

are two functions, the second indicating the species to which each synteny belongs.

Finally, the restriction of a function f to a subset A of its domain is denoted as

{f |}_{A}

.

3. Evolutionary Histories for Syntenies

We model the evolution of syntenies through the following syntenic events: codivergence with the host species (“

Spe

”), duplication of a synteny subset (“

Dup

”), fission of a synteny (“

Cut

”), transfer of a duplicated or cut subset (resp. “

TrDup

” or “

TrCut

”), and the gain or loss of a subset (“

Gain

”, “

Loss

”). Losses can be partial in the sense that only a subset of genes in a synteny are lost. Evolutionary histories are the sequences of such events, as formally defined below (see example in Figure 1).

Definition 1

(Evolutionary history for syntenies). A history

H

on a species tree S is a tuple

〈 H, e, x, s 〉

, where H is a partially binary tree. Each node

v \in V (H)

is labeled with a species

s (v) \in V (S)

and a synteny (i.e., a subset of gene families)

x (v)

. Each internal node is additionally labeled with an event

e (v) \in E = {Spe, Dup, Cut, TrDup, TrCut, Gain, Loss}

acting on

x (v)

and

s (v)

. These labels satisfy the following conditions:

1.

If

e (v) = Spe

, with

ch (v) = {v^{'}, v^{″}}

and

σ = s (v)

, then

x (v) = x (v^{'}) = x (v^{″})

,

s (v^{'}) = σ_{ℓ}

, and

s (v^{″}) = σ_{r}

.

2.

If

e (v) \in {Dup, Cut, TrDup, TrCut}

, with

ch (v) = {v_{t}, v_{k}}

:

1.: If $e (v) \in {Dup, TrDup}$ , then $x (v_{t}) \subseteq x (v) = x (v_{k})$ ;
2.: If $e (v) \in {Cut, TrCut}$ , then $x (v_{t}) \cup x (v_{k}) = x (v)$ , $x (v_{t}) \cap x (v_{k}) = \emptyset$ ,
and $x (v_{t}) \neq \emptyset$ (but $x (v_{k}) = \emptyset$ is allowed);
3.: If $e (v) \in {Dup, Cut}$ , then $s (v) = s (v_{t}) = s (v_{k})$ ;
4.: If $e (v) \in {TrDup, TrCut}$ , then $s (v) ∥ s (v_{t})$ , and $s (v) = s (v_{k})$ .

3.

If

e (v) \in {Gain, Loss}

with

ch (v) = {v_{c}}

, then

s (v_{c}) = s (v)

and the following:

1.: If $e (v) = Gain$ , then $x (v_{c}) ⊋ x (v)$ .
2.: If $e (v) = Loss$ , then $x (v_{c}) ⊊ x (v)$ (a loss is full if $x (v_{c}) = \emptyset$ , and partial otherwise).

4.

For each gene family Γ, exactly one

Gain

event in H involves Γ.

5.

The only nodes v of the history such that

x (v) = \emptyset

are its root and its leaves.

Finally, we denote by

Tr

any event in

{TrDup, TrCut}

.

Notice that, when the sets of

x (v)

are restricted to at most one gene family each, this model reduces to the classical reconciliation model of [14], with

TrDup

corresponding to

T

(transfer) and

TrCut

to

TL

(transfer–loss). Additionally, from any syntenic history on a set of gene families

F

, one can extract a reconciled gene tree for each gene family

Γ \in F

whose root is the gain event for

Γ

and leaves are the loss events where

Γ

is lost. This root is unique because of Condition (4), which excludes convergent gains.

Moreover, as in [14], we will allow for transfers to and from unsampled or extinct species by augmenting the species tree. For example, in Figure 1, Synteny

{1, 2}

in Species A is the result of a transfer from an unsampled species. In the following, unless specified otherwise, all histories are on augmented species trees, as formally defined below.

Definition 2

(Augmented species tree). A tree S can be augmented into

S^{*}

by adding unsampled leaves as follows: (1) Subdivide each edge

(v, v^{'})

of

E (S)

into two edges

(v, z)

and

(z, v^{'})

linked to a new node z; (2) Connect each z to a new unsampled leaf; (3) Create a new root

r (S^{*})

whose two children are

r (S)

and a new unsampled leaf. Edges leading to unsampled leaves are called unsampled edges.

Finally, notice that Definition 1 does not require the species involved in a transfer to be contemporary, nor does it forbid biologically infeasible cyclic histories, such as one resulting from a transfer from a Species A to a Species B, and then back from Species B to an ancestor of Species A (for a more precise definition of acyclicity, see [4]). This limitation is necessary to make the computational problems of the next section tractable.

4. Explicatory Histories and Optimization Problems

The goal of reconciliation is to infer histories that explicate the topology of a synteny tree given a tree of the corresponding species. Such histories are extensions of the synteny tree that map all leaves to the appropriate species and syntenies without introducing new visible leaves.

Definition 3

(Visible leaves). A leaf l of a history

H = 〈 H, e, x, s 〉

is said to be visible if

x (l) \neq \emptyset

and

s (l) \in V (S)

, and invisible otherwise. The set of visible leaves of

H

is denoted as

L_{V} (H)

.

For example, in Figure 1, the history in (2) explicates the trees in (1). In that history, the only invisible leaf is the unnamed leaf in the hatched region (representing an unsampled species). The leaves below full losses would also be invisible, but that example history does not contain any full loss. The following is a formal definition of an explicatory history:

Definition 4

(Explicatory histories). For a species tree S, a history

H = 〈 H, e, x, s 〉

on

S^{*}

is said to explicate a synteny tree

T = 〈 T, x^{'}, s^{'} 〉

and S if the following holds true:

1.: H is an extension of T.
2.: $x^{'} {= x |}_{L (T)}$ and $s^{'} {= s |}_{L (T)}$ .
3.: $L_{V} (H) = L (T)$ .
4.: No $(u, v) \in E (T)$ is such that $s (v) < s (u)$ .
5.: No gain event is the parent of a node $v \in V (H) - V (T)$ .
6.: No partial loss event is a child of a node $v \in V (H) - V (T)$ .

The set of all such histories is denoted by $H (T, S)$ .

Condition (3) disallows introducing new visible leaves. Condition (4) excludes assignments of species that create cycles between adjacent nodes of the synteny tree. This condition is a necessary, but not sufficient, condition for acyclicity. Imposing a full acyclicity condition would lead to computationally intractable problems [4].

As for Conditions (5) and (6), they are introduced to avoid having multiple histories with the same events, but with gains and losses distributed differently between the adjacent nodes of the synteny tree. More precisely, Condition (5) requires sifting gains down and merging them until they are the parent of a synteny tree node; conversely, Condition (6) requires sifting losses up and merging them until they are the child of a synteny tree node.

Note that

H (T, S)

is infinite: given any explicatory history, it is always possible to extend it into a larger one by introducing superfluous duplications or transfers. We next define a way to reduce this space to a finite one.

Definition 5

(Event vector). Let

H = 〈 H, e, x, s 〉

be a history. We define

ev (H) = (c_{Dup}, c_{Cut}, c_{TrDup}, c_{TrCut}, c_{Loss}) \in N^{5}

as the vector such that

c_{e}

is the number of events of type

e \in E

in

H

.

As usual for reconciliation, our definition of an event vector excludes the number of speciations since they do not allow one to meaningfully distinguish between histories. Notice that the number of gains is also excluded, as needed to make the problem of Section 6 tractable. Consequently, we disregard taking advantage of the simultaneous gains of multiple genes to reduce the overall number of individual gain events.

Definition 6

(Order on vectors and histories). For two event vectors

{(a_{i})}_{1 \leq i \leq n}

and

{(b_{i})}_{1 \leq i \leq n}

,

(a_{i}) ⪯ (b_{i})

if for all

i \in {1, . . ., n}

,

a_{i} \leq b_{i}

. This partial order induces another one on the histories, namely

H ⪯ H^{'}

if

ev (H) ⪯ ev (H^{'})

.

Definition 7

(Pareto optimality). For any set E with a partial order ⪯, we define its Pareto subset as

{min}_{⪯} E = {x \in E ∣ \forall y \in E, (x \neq y) \to \neg (y ⪯ x)}

.

As an example, in the set of vectors

A = {(1, 2), (2, 2), (3, 1)} \subseteq N^{2}

, we have

(1, 2) ⪯ (2, 2)

while

(1, 2)

and

(3, 1)

are not comparable; hence,

{min}_{⪯} A = {(1, 2), (3, 1)}

.

Let us now consider the set

H^{\min} (T, S) = {min}_{⪯} H (T, S)

of histories whose event vectors are Pareto-optimal. As opposed to

H (T, S)

,

H^{\min}

is a finite set, as we will show in Theorem 3. We can thus meaningfully define the problem of computing this set.

Problem 1 (All Pareto-optimal histories).
Input: A synteny tree $T$ and a species tree S.
Output: The set $H^{\min} (T, S)$ of Pareto-optimal histories explicating $T$ and S.

Even though

H^{\min}

is finite, it may contain a number of optimal histories that are exponential in

| V (T) |

. Rather, consider the set

{ev}^{\min} (T, S) = {min}_{⪯} {ev (H) ∣ H \in H (T, S)}

of Pareto-optimal event vectors. As we will show later (Theorem 4), the number of optimal event vectors in

{ev}^{\min}

is polynomial. We now consider the following problem of reduced complexity:

Problem 2 (All Pareto-optimal vectors).
Input: A synteny tree $T$ and a species tree S.
Output: The set ${ev}^{\min} (T, S)$ .

Now, given a vector of costs for each event type

δ = (δ_{Dup}, δ_{Cut}, δ_{TrDup}, δ_{TrCut}, δ_{Loss}) \in {(R^{+} \cup {\infty})}^{5}

, we can associate an overall scalar cost

c (H) = δ \cdot ev (H)

to each history.

Problem 3 (Minimum cost).
Input: A synteny tree $T$ , a species tree S, and a vector $δ \in {(R^{+} \cup {\infty})}^{5}$ .
Output: The minimum cost $c^{\min} (T, S) = min {c (H) ∣ H \in H (T, S)}$ of any history explicating $T$ and S.

Finally, we will call Problems 2’ and 3’ the versions of Problems 2 and 3 where we additionally ask for one history corresponding to each returned optimal event vector (for Problem 2’) or one history for the returned minimum cost (for Problem 3’). For example, Figure 1 shows in (2) a possible solution for Problem 2’ for the synteny and species trees in (1) for vector

(c_{Dup} = 1, c_{Cut} = 0, c_{TrDup} = 1, c_{TrCut} = 1, c_{Loss} = 1)

, as well as cost

(δ_{Dup} = 2, δ_{Cut} = 2.5, δ_{TrDup} = 3, δ_{TrCut} = 3.5, δ_{Loss} = 1)

for Problem 3’.

5. Generating All Pareto-Optimal Histories

We start by addressing Problem 1, which asks to enumerate the

H^{\min}

set. This can be conducted inductively by building histories from the leaves of the synteny tree up to its root. This result is the basis for the dynamic programming formulations introduced for the other problems in the upcoming sections.

The next two definitions are used to build histories by composing partial histories (see Figure 2). In the following, a node v of a history such that

x (v) = X

and

s (v) = σ

is denoted as

v [X, σ]

, or

v [e, X, σ]

if the event

e \in E

associated to v is known. The node name may be omitted where it is not relevant by simply writing

[X, σ]

or

[e, X, σ]

. If A, B, and C are three nodes, then

(A, B) C

refers to the triplet tree with Root C and Leaves A and B.

Definition 8

(Partial histories). Let

T = 〈 T, x, s 〉

be a synteny tree on a species tree S. Let

v \in V (T)

,

X \subseteq F

, and

σ \in V (S^{*})

. We define

h (v, X, σ)

as the set of the Pareto-optimal histories explicating the subtree of

T

rooted at

v [X, σ]

. We also define

path ([X, σ], [Y, γ])

as the set of Pareto-optimal acyclic histories whose root is

[X, σ]

and whose only visible leaf is

[Y, γ]

. Formally,

\begin{matrix} h (v, X, σ) & = {〈 H, e, x, s 〉 \in H^{\min} (T [v], S) ∣ (r (H) = v) \land (x (v) = X) \land (s (v) = σ)} \\ path ([X, σ], [Y, γ]) & = {min}_{⪯} {H = 〈 H, e, x, s 〉 ∣ (x (r (H)) = X) \land (s (r (H)) = σ) \\ \land (L_{V} (H) = {l}) \land (x (l) = Y) \land (s (l) = γ) \\ \land H is acyclic} . \end{matrix}

Definition 9

(History composition). If

H = 〈 H, e, x, s 〉

and

H^{'} = 〈 H^{'}, e^{'}, x^{'}, s^{'} 〉

are two histories such that there exists a leaf

l \in L (H)

with

x (l) = x (r (H^{'}))

and

s (l) = s (r (H^{'}))

, then we define their composition

H \otimes H^{'}

as the history obtained by replacing l with

r (H^{'})

and merging e with

e^{'}

, x with

x^{'}

, and s with

s^{'}

. This operation is only defined if the resulting history is valid (particularly if no gene family would be gained in two separate

Gain

events).

When applied to two sets of histories A and B, then

A \otimes B

is defined by taking the Cartesian product of the two sets and composing each resulting history pair whilst excluding invalid compositions.

For any node v of a synteny tree

T

and any event

e \in E

, syntenies

X, Y, Y^{'}, Z, Z^{'} \subseteq F

, and species

σ, γ_{ℓ}, γ_{ℓ}^{'}, γ_{r}, γ_{r}^{'} \in V (S^{*})

, we define the set of histories starting with

v [e, X, σ]

as followed by two paths from

[Y^{'}, γ_{ℓ}^{'}]

to

v_{ℓ} [Y, γ_{ℓ}]

and from

[Z^{'}, γ_{r}^{'}]

to

v_{r} [Z, γ_{r}]

, thereby leading to two sub-histories as follows:

\begin{matrix} M (v [e, X, σ], [Y^{'}, γ_{ℓ}^{'}], v_{ℓ} [Y, γ_{ℓ}], [Z^{'}, γ_{r}^{'}], v_{r} [Z, γ_{r}]) \\ = ([Y^{'}, γ_{ℓ}^{'}], [Z^{'}, γ_{r}^{'}]) v [e, X, σ] \otimes (path ([Y^{'}, γ_{ℓ}^{'}], v_{ℓ} [Y, γ_{ℓ}]) \otimes h (v_{ℓ}, Y, γ_{ℓ})) \\ \otimes (path ([Z^{'}, γ_{r}^{'}], v_{r} [Z, γ_{r}]) \otimes h (v_{r}, Z, γ_{r})) . \end{matrix}

For example, in Figure 2, we have

H = M (v [Spe, X, σ], [X, σ_{ℓ}], v_{ℓ} [Y, γ_{ℓ}], [X, σ_{r}], v_{r} [Z, γ_{r}])

.

This representation of histories as the compositions of sub-histories and paths will now be used to formulate inductive definitions for

h (v, X, σ)

and

path (\cdot, \cdot)

, as well as ultimately

H^{\min} (T, S)

. In the following, we use

A \oplus B

to mean

{min}_{⪯} (A \cup B)

.

Theorem 1

(Inductive form of Pareto-optimal histories). Let

T = 〈 T, x, s 〉

be a synteny tree on S, X be any synteny on

F

, and σ be a node of

S^{*}

. If v is a leaf of T, then

h (v, X, σ) = {v [X, σ]}

if

x (v) = X

and

s (v) = σ

, and

h (v, X, σ) = \emptyset

otherwise. If v is an internal node of T, then

h (v, X, σ) = P_{Spe} \oplus P_{Dup} \oplus P_{Cut} \oplus P_{TrDup} \oplus P_{TrCut}

with the following:

\begin{matrix} P_{Spe} = ⨁_{\begin{matrix} γ_{ℓ}, γ_{r} ≮ σ \\ Y, Z \subseteq F \end{matrix}} & \begin{matrix} (M (v [Spe, X, σ], [X, σ_{ℓ}], v_{ℓ} [Y, γ_{ℓ}], [X, σ_{r}], v_{r} [Z, γ_{r}]) \\ \oplus M (v [Spe, X, σ], [X, σ_{r}], v_{ℓ} [Y, γ_{ℓ}], [X, σ_{ℓ}], v_{r} [Z, γ_{r}])) \end{matrix} \end{matrix}

\begin{matrix} P_{Dup} = ⨁_{\begin{matrix} γ_{ℓ}, γ_{r} ≮ σ \\ Y, Z \subseteq F \end{matrix}} & \begin{matrix} (M (v [Dup, X, σ], [X \cap Y, σ], v_{ℓ} [Y, γ_{ℓ}], [X, σ], v_{r} [Z, γ_{r}]) \\ \oplus M (v [Dup, X, σ], [X, σ], v_{ℓ} [Y, γ_{ℓ}], [X \cap Z, σ], v_{r} [Z, γ_{r}])) \end{matrix} \end{matrix}

\begin{matrix} P_{Cut} = ⨁_{\begin{matrix} γ_{ℓ}, γ_{r} ≮ σ \\ Y, Z \subseteq F \end{matrix}} & \begin{matrix} (M (v [Cut, X, σ], [X \cap Y, σ], v_{ℓ} [Y, γ_{ℓ}], [X - Y, σ], v_{r} [Z, γ_{r}]) \\ \oplus M (v [Cut, X, σ], [X - Z, σ], v_{ℓ} [Y, γ_{ℓ}], [X \cap Z, σ], v_{r} [Z, γ_{r}])) \end{matrix} \end{matrix}

\begin{matrix} P_{TrDup} = ⨁_{\begin{matrix} γ_{i} ∥ σ; γ_{t} ≮ γ_{i}, σ \\ γ_{k} ≮ σ; Y, Z \subseteq F \end{matrix}} & \begin{matrix} (M (v [TrDup, X, σ], [X \cap Y, γ_{i}], v_{ℓ} [Y, γ_{t}], [X, σ], v_{r} [Z, γ_{k}]) \\ \oplus M (v [TrDup, X, σ], [X, σ], v_{ℓ} [Y, γ_{k}], [X \cap Z, γ_{i}], v_{r} [Z, γ_{t}])) \end{matrix} \end{matrix}

\begin{matrix} P_{TrCut} = ⨁_{\begin{matrix} γ_{i} ∥ σ; γ_{t} ≮ γ_{i}, σ \\ γ_{k} ≮ σ; Y, Z \subseteq F \end{matrix}} & \begin{matrix} (M (v [TrCut, X, σ], [X \cap Y, γ_{i}], v_{ℓ} [Y, γ_{t}], [X - Y, σ], v_{r} [Z, γ_{k}]) \\ \oplus M (v [TrCut, X, σ], [X - Z, γ_{i}], v_{ℓ} [Y, γ_{t}], [X \cap Z, σ], v_{r} [Z, γ_{k}]) \\ \oplus M (v [TrCut, X, σ], [X \cap Y, σ], v_{ℓ} [Y, γ_{k}], [X - Y, γ_{i}], v_{r} [Z, γ_{t}]) \\ \oplus M (v [TrCut, X, σ], [X - Z, σ], v_{ℓ} [Y, γ_{k}], [X \cap Z, γ_{i}], v_{r} [Z, γ_{t}])) \end{matrix} \end{matrix}

Proof.

If v is a leaf of T, then the proposition follows directly from Definitions 1 and 4. Assume now that v is an internal node. First, notice that all histories of the

P_{e}

sets—for

e \in {Spe, Dup, Cut, TrDup, TrCut}

—explicate

T [v]

and S, have their root assigned to X and

σ

and are Pareto-optimal.

Let

H = 〈 H, e, x, s 〉 \in h (v, X, σ)

. Since v is an internal node of T, v must be binary in H; hence

e (v) \in {Spe, Dup, Cut, TrDup, TrCut}

. Denote the children of v in T as

v_{ℓ}

and

v_{r}

, and the children of v in H as

w_{ℓ}

and

w_{r}

. If

e (v) \in {Spe, Dup, Cut}

, then

γ_{ℓ}

and

γ_{r}

cannot be strict ancestors of

σ

, as otherwise

H

would not be acyclic. Let

Y = x (v_{ℓ})

,

Y^{'} = x (w_{ℓ})

,

Z = x (v_{r})

,

Z^{'} = x (w_{r})

.

If

e (v) = Spe

, then

H \in P_{Spe}

by Definition 1 (Item 1).

Otherwise, let us first show that the synteny contents of both children are chosen appropriately. We have that

X \cap Y \subseteq Y^{'}

(resp.

X \cap Z \subseteq Z^{'}

); since, if any

g \in X \cap Y

(resp.

g \in X \cap Z

) was not in

Y^{'}

(resp.

Z^{'}

), we would have multiple gains of g in H, i.e., one above v (since

g \in X

) and one between v and

v_{l}

(since

g \in Y

, resp.

v_{r}

since

g \in Z

).

If $e (v) \in {Dup, TrDup}$ , by Definition 1, at least one of $Y^{'}$ or $Z^{'}$ is equal to X. Without loss of generality (w.l.o.g.), assume that $Z^{'} = X$ . We should have $Y^{'} \subseteq X \cap Y$ , as otherwise we would have at least one additional event in the path between v and $v_{l}$ for losing $Y^{'} - (X \cap Y)$ , thus contradicting the Pareto-optimality of $H$ . Hence, $Y^{'} = X \cap Y$ .
If $e (v) \in {Cut, TrCut}$ , it cannot be that $Y^{'} \subseteq X \cap Y$ and $Z^{'} ⊈ X \cap Z$ are both true as such a scenario would lead to a loss on each path between v and $v_{ℓ}$ and between v and $v_{r}$ , whereas by choosing $Y^{'} \subseteq X \cap Y$ or $Z^{'} \subseteq X \cap Z$ , we save at least one loss. Since $Y^{'}$ and $Z^{'}$ are a partition of X, then $Z^{'} = X - Y$ if $Y^{'} = X \cap Y$ , or $Y^{'} = X - Z$ otherwise.

Finally, we show that the species of both children are also chosen appropriately.

If $e (v) \in {Dup, Cut}$ , then by Definition 1, $s (w_{ℓ}) = s (w_{r}) = σ$ .
If $e (v) \in {TrDup, TrCut}$ , then by Definition 1, either $s (w_{ℓ}) = σ$ and $s (w_{r}) ∥ σ$ , or $s (w_{r}) = σ$ and $s (w_{ℓ}) ∥ σ$ .
If $e (v) = TrDup$ , if $s (w_{ℓ}) = σ$ (resp. $s (w_{r}) = σ$ ), then $x (w_{ℓ}) = X$ (resp. $x (w_{r}) = Y$ ). □

We next consider how to compute the set of histories

path ([X, σ], [Y, γ])

given any two syntenies X and Y and species

σ

and

γ

.

Theorem 2

(Pareto-optimal paths). Let X and Y be two syntenies and σ and γ be two species. If γ is a strict ancestor of σ, then

path ([X, σ], [Y, γ]) = \emptyset

. Otherwise,

path ([X, σ], [Y, γ]) = P_{0} \oplus P_{1} \oplus P_{2}

, where the following holds:

$P_{0} = \emptyset$ if $σ ∥ γ$ . Otherwise, $P_{0}$ contains the history made up of a chain of $D_{S^{*}} (σ, γ)$ speciations and $| {(u, v) \in P_{S^{*}} (σ, γ) ∣ v \notin V (S)} |$ full losses; preceded by a partial loss event if $X ⊈ Y$ and followed by a gain event if $Y ⊈ X$ . If $γ \in L (S^{*}) - L (S)$ , then $P_{0}$ also includes a history where the initial partial loss event is replaced by a terminal $Dup$ or $Cut$ event and an unsampled leaf.
$P_{1} = P_{TrDup} \oplus P_{TrCut}^{Left} \oplus P_{TrCut}^{Right}$ . If $σ ∥ γ$ , all three sets contain histories starting with a $Tr$ event v from $σ^{'} = σ$ to γ. Otherwise, all three sets contain histories starting with a speciation at σ. Letting $ch (σ) = {σ^{'}, σ^{″}}$ such that $σ^{'} ∥ γ$ , the speciation is followed by a $Tr$ event v from $σ^{'}$ to γ and a full loss on the side of $σ^{″}$ if $σ^{″} \notin L (S^{*}) - L (S)$ . In both cases, the histories end with a gain event if $Y ⊈ X$ . For $H \in P_{TrDup}$ , $e (v) = TrDup$ and $x (v_{t}) = X \cap Y$ with a full loss at $v_{k}$ if $σ^{'} \notin L (S^{*}) - L (S)$ . For $H \in P_{TrCut}^{Left}$ or $H \in P_{TrCut}^{Right}$ , $e (v) = TrCut$ . For $H \in P_{TrCut}^{Left}$ , $x (v_{t}) = X \cap Y$ and $v_{k}$ is a full loss if $X ⊈ Y$ and $σ^{'} \notin L (S^{*}) - L (S)$ . For $H \in P_{TrCut}^{Right}$ , $x (v_{k}) = \emptyset$ and there is a partial loss at $v_{t}$ if $X ⊈ Y$ .
$P_{2} = \emptyset$ if $σ ∥ γ$ and $X \subseteq Y$ , or $σ \in L (S^{*}) - L (S)$ , or $σ = r (S^{*})$ . Otherwise, $P_{2} = P_{TrDup}^{TrDup} \oplus P_{TrDup}^{TrCut} \oplus P_{TrCut}^{TrCut}$ . Each of the three sets $P_{e^{'}}^{e}$ contain histories made up of two initial consecutive transfers v and $v^{'}$ such that $e (v) = e$ and $e (v^{'}) = e^{'}$ . The histories of $P_{TrDup}^{TrDup}$ and $P_{TrDup}^{TrCut}$ are such that $s (v^{'}) = σ^{'} \in L (S^{*}) - L (S)$ . In $P_{TrDup}^{TrDup}$ , v is followed by a full loss if $s (v) \notin L (S^{*}) - L (S)$ . In $P_{TrDup}^{TrCut}$ and $P_{TrCut}^{TrCut}$ , the first cut v is complete (i.e., with an empty leaf). In $P_{TrCut}^{TrCut}$ if $X ⊈ Y$ , then $s (v) \in L (S^{*}) - L (S)$ , otherwise the second cut $v^{'}$ is complete if $s (v) \notin L (S^{*}) - L (S)$ . In all cases, the histories end with a gain event if $Y ⊈ X$ .

Proof.

If

γ

is a strict ancestor of

σ

, there can be no acyclic history leading from

[X, σ]

to

[Y, γ]

; therefore,

path ([X, σ], [Y, γ]) = \emptyset

.

Otherwise, let

H = 〈 H, e, x, s 〉 \in path ([X, σ], [Y, γ])

,

u = r (H)

, and let w be the only visible leaf of

H

. We call

P (u, w)

the main path of

H

. We say that a subtree

H [v]

of H is invisible if it contains only invisible leaves.

First notice that all binary nodes of

H

must be on the main path. Assume that v is a binary node outside of the main path. Both subtrees of v must be invisible. If

s (v) \notin L (S^{*}) - L (S)

, then at least one subtree contains a full loss; hence, we can completely replace v and its subtrees with a full loss. Otherwise, we can completely replace v and its subtrees with an unsampled leaf. In both cases, we strictly reduce the number of events in

H

, which contradicts its Pareto-optimality. Also, notice that all binary nodes of the main path have exactly one child whose subtree is invisible.

We show now that

H

may only contain at most two

Tr

nodes. Assume that

H

contains at least three consecutive

Tr

events that we denote

v_{1}

,

v_{2}

,

v_{3}

. Let

v_{4}

be the child of

v_{3}

resulting from the transfer. Consider the history

H^{'}

obtained from

H

via the following: (1) Removing all the nodes on the path from

v_{2}

to

v_{4}

and their invisible subtrees, excluding

v_{2}

and

v_{4}

themselves but including, in particular,

v_{3}

; (2) Connecting

v_{2}

directly to

v_{4}

; (3) Remapping

s (v_{2})

to any unsampled species separated both from

s (v_{1})

and

s (v_{4})

; (4) Replacing the subtree below

v_{2}

by a single unsampled leaf l with

s (l) = s (v_{2})

and

x (l) = x (v_{2})

if

e (v_{2}) = TrDup

, or

x (l) = x (v_{2}) - x (v_{4})

otherwise. Clearly,

H^{'} ⪯ H

,

H^{'} \neq H

and

H^{'} \in path ([X, σ], [Y, γ])

, which contradicts the Pareto-optimality of

H

.

As per Definition 4 (Item 5 and Item 6), any partial loss must be placed at the end of the history, and any gain must be placed at the start. Note that

H

contains either zero (

P_{0}

), one (

P_{1}

), or two (

P_{2}

) transfers.

[ $H$ contains no transfers.] Note that $σ$ cannot be separated from $γ$ since only transfers can reach separate species; thus, $σ$ is an ancestor of $γ$ . We start by showing that $H$ can only possibly contain a duplication or a cut if that event is on an unsampled leaf species. If v is a node such that $e (v) \in {Dup, Cut}$ and $s (v) \notin L (S^{*}) - L (S)$ , then the invisible subtree of v contains at least one full loss. In this case, we can remove the invisible subtree and turn v into a partial loss, if needed, otherwise we would replace v with its remaining child. Hence, if $H$ contains a duplication or a cut, it must be the last binary event on the main path. Thus, all other binary events must be speciations, and we need exactly $D_{S^{*}} (σ, γ)$ of them to reach $σ$ from $γ$ . Those speciations must lead to one full loss for each species $s \in V (S)$ . Hence, $H \in P_{0}$ .
[ $H$ contains exactly one transfer.] Let v be the only transfer node. If $σ \leq γ$ , then there must be a speciation above v so that $s (v)$ can be separated from $γ$ since $H$ cannot contain other transfers. Notice that $H$ contains no partial losses, duplications, or cuts. In fact, any partial loss can be merged into the transfer. As for duplications and cuts, they can be removed if they do not save any loss, or merged into the transfer otherwise. Apart from the initial speciation if $σ \leq γ$ , $H$ contains no other speciations as those before (resp. after) the transfer can be removed by redirecting the transfer to start from a higher species (resp. to end at a lower species) that is still separated from $γ$ (resp. $σ$ ). If $e (v) = TrDup$ or $X ⊈ Y$ , then the invisible subtree of v must contain at least one full loss, unless $s (v) \in L (S^{*}) - L (S)$ . If $s (v) \notin L (S^{*}) - L (S)$ and $e (v) = TrCut$ , then either subtree of v must contain at least one full or partial loss if $x (v) ⊈ Y$ . Hence, $H \in P_{1}$ .
[ $H$ contains exactly two transfers.] Let v and $v^{'}$ be the only two transfer nodes. If $σ = r (S^{*})$ , then $H$ must contain at least one full loss; hence, $H$ is not Pareto-optimal. If $σ \in L (S^{*}) - L (S)$ , then, if $σ \leq γ$ , we can remove both transfers, and, if $σ ∥ γ$ , then we can remove $v^{'}$ , in both cases without introducing additional events. If $σ ∥ γ$ , then we have the following: If $e (v) = TrDup$ , we can remove $v^{'}$ without adding new losses; If $X \subseteq Y$ , then $x (v) \subseteq x (v^{'}) \subseteq x (w)$ . Hence, we can also remove $v^{'}$ . If either of v or $v^{'}$ is such that $e (v) = TrCut$ or $e (v^{'}) = TrCut$ , then $e (v) = TrCut$ , as otherwise it is $e (v) = TrDup$ , and there is a full loss below v. In addition, we can exchange the transfer types of v and $v^{'}$ so that $e (v) = TrCut$ , $e (v^{'}) = TrDup$ and $s (v^{'}) \in L (S^{*})$ , so as to save a full loss. If $x (v^{'}) ⊉ Y$ or $e (v^{'}) = TrDup$ , then $s (v^{'}) \in L (S^{*}) - L (S)$ , as otherwise we can reroute the first transfer toward an unsampled leaf so as to save a full loss. Hence, $H \in P_{2}$ . □

Finally, the set of all possible histories

H^{\min} (T, S)

can be computed as the union of assignments of the root node of the synteny tree to all possible syntenies and species. This starts with a path from an empty synteny (i.e., an initial gain). In other words,

H^{\min} (T, S) = ⨁_{\begin{matrix} X \subseteq F \\ σ \in V (S^{*}) \end{matrix}} path ([\emptyset, σ], r (T) [X, σ]) \otimes h (r (T), X, σ) .

Hence, using Theorems 1 and 2, one can derive a dynamic programming algorithm to solve Problem 1. Due to the exponential size of the set of solutions, the time complexity of that algorithm is also exponential.

Theorem 3

(Number of minimal histories). Let

T = 〈 T, x, s 〉

be a synteny tree on a species tree S with gene family set

F

. Then,

| H^{\min} (T, S) | \in O ((2^{| F |} {| V (S) |}^{3})^{| V (T) |}) .

Proof.

Using Theorem 1, we obtain that

| h (v, X, σ) | \in O (f (v))

with

f (v) = \{\begin{matrix} 1 & if v is a leaf, \\ (2^{| F |} {| V (S) |}^{2} p) \times f (v_{ℓ}) \times f (v_{r}) & otherwise, \end{matrix}

where p is an upper bound on the number of possible paths. This is because, in each case of Theorem 1, up to all subsets

Y, Z \subseteq F

are tried along with up to all possible species pairs

γ_{ℓ}, γ_{r} \in V (S^{*})

, which has a number of nodes directly proportional to

| V (S) |

. Using Theorem 2, we obtain

p \in O (| V (S) |)

. We obtain the desired result by solving the recurrence. □

6. Polynomial-Time Computation of Pareto-Optimal Event Vectors

We now address Problem 2. Given a synteny tree

T = 〈 T, x, s 〉

on a species tree S, similar to the way

h (v, X, σ)

was previously used to recursively compute

H^{\min} (T, S)

, we define

Λ (v, X, σ)

for computing

{ev}^{\min} (T, S)

as

Λ (v, X, σ) = {ev (H) ∣ H \in h (v, X, σ)}

.

To compute

Λ (v, X, σ)

, we replace the algebra of Theorem 1 with an algebra where:

The base cases are ${(0, 0, 0, 0, 0, 0)}$ and ∅;
$A \oplus B$ is the union of vectors from A and B, retaining only the Pareto-optimal ones;
$A \otimes B$ sums the pairs of vectors from $A \times B$ , retaining only the Pareto-optimal ones.

Additional simplifications can reduce the complexity of computing

Λ (v, X, σ)

. We start by showing that, when adapted for Problem 2, it is sufficient to try a constant number of syntenies

X, Y, Z \subseteq F

at each step of the recurrence of Theorem 1. First, let us show that it is sufficient to place the gain event for each gene family at the lowest common ancestor of the leaves they appear in.

Definition 10

(Gain positions). Let

T = 〈 T, x, s 〉

be a synteny tree on

F

,

Γ \in F

,

v \in V (T)

, and

X \subseteq F

. We thus define

\begin{matrix} lca (Γ) & = {lca}_{T} {l \in L (T) ∣ Γ \in x (l)}, \\ f (v, X) & = {Γ \in X ∣ lca (Γ) ≯ v} . \end{matrix}

Lemma 1

(Gains at the LCA). If

T = 〈 T, x, s 〉

is a synteny tree on

F

,

v \in V (T)

, and

X \subseteq F

, then

{min}_{⪯} (Λ (v, X, σ) \cup Λ (v, f (v, X), σ)) = Λ (v, f (v, X), σ)

.

Proof.

Let

H \in h (v, X, σ)

. Let

Γ \in X - f (v, X)

, if there is any. By definition,

lca (Γ) > v

. Assume w.l.o.g. that

lca (Γ) \geq v_{ℓ}

. Consider the history

H^{'}

in which

Γ

is removed from

x (v)

and all

x (w)

for

w \in T [v_{r}]

, thereby removing any invalid loss event created in the process, and in which the gain event for

Γ

is moved to be the parent of

v_{r}

(potentially merging it with other gains). Then,

ev (H^{'}) ⪯ ev (H)

, since we only potentially removed losses from

H

and the number of gains is not part of the event vector. Let

H^{*}

be the history obtained after repeating this process for each

Γ \in X - f (v, X)

. Clearly,

H^{*} \in h (v, f (v, X), σ)

; hence,

e (H^{*}) \in Λ (v, f (v, X), σ)

and

ev (H^{*}) ⪯ ev (H)

. □

In a similar way to Lemma 5 in [10], we now show that only two synteny contents have to be tried at each step of the recurrence as any synteny larger than the minimal required gene families (as formally defined below) leads to the same set of optimal event vectors.

Definition 11

(Minimal synteny contents). For any

v \in V (T)

, we define

\begin{matrix} gain (v) & = {Γ \in F ∣ v = lca (Γ)}, \\ x^{\min} (v) & = \{\begin{matrix} x (v) & if v is a leaf, \\ x^{\min} (v_{ℓ}) \cup x^{\min} (v_{r}) - (gain (v_{ℓ}) \cup gain (v_{r})) & otherwise . \end{matrix} \end{matrix}

Lemma 2

(Two choices of synteny contents). Let

T = 〈 T, x, s 〉

be a synteny tree on a species tree S and v be a node of T. For any

X, X^{'} ⊋ x^{\min} (v)

such that

X = f (v, X)

and

X^{'} = f (v, X^{'})

,

Λ (v, X, σ) = Λ (v, X^{'}, σ)

.

Proof.

We proceed by induction on the depth of v in T. If v is a leaf, then no valid history exists; hence,

Λ (v, X, σ) = Λ (v, X^{'}, σ) = \emptyset

. Otherwise, let

H = 〈 H, e, x, s 〉 \in h (v, X, σ)

,

Y = x (v_{ℓ})

,

Z = x (v_{r})

, and denote

H_{ℓ} \in h (v_{ℓ}, Y, s (v_{ℓ}))

(resp.

H_{r}

for

v_{r}

) to be the subhistory of

H

below

v_{ℓ}

(resp.

v_{r}

).

Let

G = X - x^{\min} (v)

, and

G^{'} = X^{'} - x^{\min} (v)

. If

Γ \in G

, then

Γ \notin x^{\min} (v)

, thus implying that

Γ \notin x^{\min} (v_{ℓ})

and

Γ \notin x^{\min} (v_{r})

since

Γ \notin gain (v_{ℓ})

and

Γ \notin gain (v_{r})

because

lca (Γ) ≯ v

. This implies that

X ⊈ Y, Z

. Using the same argument for

Γ^{'} \in G^{'}

, we deduce that

X^{'} ⊈ Y, Z

.

Suppose that

Y ⊋ x^{\min} (v_{ℓ})

. Let

Y^{'} = (Y - G) \cup G^{'}

. Noting that

Y^{'} ⊋ x^{\min} (v_{ℓ})

, and using the induction hypothesis, we see that

Λ (v_{ℓ}, Y, s (v_{ℓ})) = Λ (v_{ℓ}, Y^{'}, s (v_{ℓ}))

. There exists a history

H_{ℓ}^{'} \in h (v_{ℓ}, Y^{'}, s (v_{ℓ}))

such that

ev (H_{ℓ}) = ev (H_{ℓ}^{'})

. The same argument applies to

v_{r}

, thereby yielding a history

H_{r}^{'}

such that

ev (H_{r}) = ev (H_{r}^{'})

.

Suppose that

Y = x^{\min} (v_{ℓ})

. Then, there must be an event to lose

X - Y

on the path from v to

v_{ℓ}

. That event can also be used to lose

X^{'} - Y

. In this case, we let

H_{ℓ}^{'} = H_{ℓ}

. The same argument applies to

v_{r}

.

Finally, consider the history

H^{'}

that is obtained by replacing

H_{ℓ}

with

H_{ℓ}^{'}

,

H_{r}

by

H_{r}^{'}

, and setting

x (v) = X^{'}

. Clearly,

ev (H) = ev (H^{'})

and

H^{'} \in h (v, X^{'}, σ)

; hence,

ev (H) \in Λ (v, X^{'}, σ)

. □

Using the recurrence from Theorem 1 adapted to use the algebra over Pareto-sets of vectors and simplified, as shown in Lemmas 1 and 2, we can obtain a dynamic programming algorithm for solving Problems 2 and 2’.

Theorem 4.

Problem 2 can be solved in time

O ({(m n)}^{9} log (m n))

and space

O ({(m n)}^{5})

, and Problem 2’ in time

O ({(m n)}^{9} n log (m n))

and space

O ({(m n)}^{5} n)

, where

m = | V (T) |

and

n = | V (S) |

are, respectively, the numbers of the nodes in the synteny and species trees.

Proof.

Let us first show that

| {ev}^{\min} (T, S) | \in O ({(m n)}^{4}) .

From Theorem 2, the number of events in a history

H \in path ([X, σ], [Y, γ])

is in

O (n)

(attained for histories in

P_{0}

, while those in

P_{1}

and

P_{2}

have a constant number of events). From Theorem 1, the histories

H \in H^{\min} (T, S)

are extensions of the synteny tree T, which are obtained by inserting such paths; hence, the number of events in such histories is in

O (m n)

. As the vectors have five components, we obtain the desired result by adapting the argument from Lemma 3.1 in [15].

We solve Problem 2 using a dynamic programming table for

Λ (v, X, σ)

. The number of entries in the table is m for v, two for X (as per Lemma 2), and n for

σ

. Hence, the space complexity result follows from the bound on the size of

{ev}^{\min} (T, S)

shown above.

As for the time complexity, following the recurrence adapted from Theorem 1, to compute each entry we need to consider four options for Y and Z, as well as the up to

n^{3}

options for

γ_{ℓ}

and

γ_{r}

(or

γ_{k}

and

γ_{t}

) and

γ_{i}

. However, it is possible to reduce these

n^{3}

species options to a constant number of operations at each step by simultaneously computing three separate tables

in

,

inAlt

, and

out

, as defined by Weiner and Bansal [14] and explained in their Algorithm 1 and the proof of their Theorem 1. For each of these options, the ⊕ and ⊗ operators need to be used a constant number of times.

The ⊕ and ⊗ operations on two sets containing k Pareto-vectors can be implemented in time

O (k log k)

[18] and

O (k^{2} log k)

, respectively [15]. In our case, it follows from the bound on the size of

{ev}^{\min} (T, S)

that both operators can be implemented in time

O ({(m n)}^{4} log (m n))

and

O ({(m n)}^{8} log (m n))

, respectively.

To solve Problem 2’, we need to be able to reconstruct one of the histories leading to each Pareto-optimal event vector. To that end, we associate additional pieces of information to each vector: the root node of the history, two pointers to two sub-histories, and two paths that lead to those histories (see Figure 2). Since a path can contain, at most,

O (n)

events, the time and space complexities of this method to solve Problem 2’ are obtained by adding a factor of n to those of Problem 2. □

7. Efficient Computation of Minimum-Cost Histories

We finally address Problem 3. Let

T = 〈 T, x, s 〉

be a synteny tree on a species tree S, and let

δ \in {(R^{+} \cup {\infty})}^{5}

be an event cost vector. Similar to the way through which

h (v, X, σ)

and

Λ (v, X, σ)

were previously used to compute

H^{\min} (T, S)

and

{ev}^{\min} (T, S)

, we define

c (v, X, σ) = {c (H) ∣ H \in h (v, X, σ)}

to compute

c^{\min} (T, S)

.

To compute

c (v, X, σ)

, we replace the algebra of Theorem 1 with an algebra where:

The base cases are 0 and ∞;
$A \oplus B$ is the minimum of A and B;
$A \otimes B$ is the sum of A and B.

This is the so-called min-plus or “tropical” semiring. Notice that Lemmas 1 and 2 still apply since if $Λ (v, X, σ)$ = $Λ (v, X^{'}, σ)$ , then $c (v, X, σ) = c (v, X^{'}, σ)$ .

Theorem 5.

Problem Section 4 can be solved in time and space

O (m n)

, and Problem 3’ in time and space

O (m n^{2})

, where

m = | V (T) |

and

n = | V (S) |

.

Proof.

Both results follow from the proof of Theorem 4 due to the fact that the minimum and sum operators can be computed in constant time, thereby removing the

{(m n)}^{8} log (m n)

factor from the time complexity, and also due to the fact that only a single value needs to be stored for each entry of the dynamic programming table, thus removing the

{(m n)}^{4}

factor from the space complexity. □

8. Results

The CRISPR–Cas module is an adaptive system that allows prokaryotes to defend against invading viruses and plasmids. Its fame is due to the development of the CRISPR-Cas9 genome editing technology, which is one of the most reliable and accurate “molecular scissors” to date. An important part of any CRISPR-Cas system is the operon of the associated Cas genes playing various roles in the defense machinery. As the microbial function of CRISPR-Cas systems highly depends on the syntenic organization of Cas genes, elucidating the evolution of these syntenies is crucial.

In [11], taking the Makarova et al. [19] CRISPR-Cas classification of Class 1 as the synteny tree—and considering a dataset of 15 bacterial species, as well as the species tree topology inferred in [20]—we recovered an evolutionary history that is broadly consistent with that proposed in [21], with the CRISPR-Cas emergence inferred at the root of Terrabacteria. However, some inconsistencies were observed. For example, in the Proteobacteria subtree, the SuperDTL algorithm inferred an unlikely scenario with an ancestral synteny duplication before the LCA of Shewanella putrefaciens, Vibrio crassostreae, Yersinia pseudotuberculosis, and Escherichia coli, thereby resulting in a succession of three consecutive full synteny losses along the branch to E. coli.

Here, we used Synesth on the same dataset with the same event costs for

Loss

,

Dup

, and

TrDup

events and by choosing intermediate costs for the new

Cut

and

TrCut

events. The used event costs were

(δ_{Dup} = 2, δ_{Cut} = 2.5, δ_{TrDup} = 4, δ_{TrCut} = 4.5, δ_{Loss} = 1)

. Almost the same history was obtained but with the above unlikely scenario replaced by a speciation (on the branch separating the ancestor of Geobacter sulfurreducens to the LCA of Thioalkalivibrio and Shewanella putrefaciens) copying the ancestral synteny to an unsampled species, which was later transferred back to E. coli (see Figure 3).

Choosing the appropriate event costs constitutes one of the main challenges of tree reconciliation. Slight changes to the event costs may lead to significantly different history outputs. We use a similar approach to that which was developed by Libeskind-Hadas et al. [15] to display a summary of the solution space over all possible event cost choices. In order to represent the solutions in a 2D plot, we normalized the cost of

Loss

events to 1 and set an equal cost for

Dup

and

Cut

and for

TrDup

and

TrCut

. The Pareto-optimal vectors were condensed to three dimensions as follows:

(c_{Dup} + c_{Cut}, c_{TrDup} + c_{TrCut}, c_{Loss})

. The resulting plot is given in Figure 4, in which each color-coded region corresponds to a set of event costs that give rise to exactly the same set of Pareto-optimal histories.

9. Conclusions

Synesth is a flexible tool for tree reconciliation, which allows for a wide range of segmental events that addresses the inevitable incompleteness of the input dataset in terms of unsampled species, as well as offers various optimization and output criteria to the user. Moreover, its time complexity brings it up to the level of the most time-efficient reconciliation algorithms, such as ecceTERA [12] and RANGER-DTLx [14], but for an evolutionary model with events involving sets of genes rather than single genes.

The inductive characterization of Pareto-optimal histories allows for an exhaustive exploration of the solution space. Taking advantage of this flexibility, future extensions of the computational aspects of this work may address the problem of formally characterizing this space in terms of constructing equivalence classes or normalized histories, thus uniformly sampling the space of histories and assigning confidence values to predicted histories.

Further extensions of the model would also be worth investigating. For example, representing syntenies as sets does not capture the information of gene orders and multiplicities. Accounting for gene orders would require allowing rearrangement events, and this may significantly increase the computational complexity of the problem. Allowing gene repetitions inside syntenies would require representing them as multisets, which would break some of the assumptions required for the algorithmic approach presented in this work. However, probably, the most questionable limitation of the model is the absence of synteny fusions while synteny fissions are allowed; thus, it favors large syntenies up to the root of the tree. Note, however, that including fusions will require adding reticulated nodes. It will be interesting to see whether our dynamic programming scheme can be generalized to such phylogenetic networks or if it makes the problem NP-hard, in which case tree decomposition methods may be explored [22].

Another important challenge not addressed in this paper is how to obtain an input synteny tree. In fact, phylogenetic methods instead output sets of gene trees—one for each gene family. If the individual gene trees are “consistent”, i.e., with no contradictory phylogenetic information, then a tree displaying them all can be obtained. However, even in this case, there may be an exponential number of such supertrees. In [10], the suggested solution was to test each possible supertree and retain the one leading to the most parsimonious reconciliation. An alternative would be to simultaneously construct and reconcile a supertree with a given species tree. This opens the door to interesting future investigations.

Author Contributions

Conceptualization, N.E.-M.; methodology, M.D. and N.E.-M.; software, M.D.; writing—original draft, M.D. and N.E.-M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Sciences and Engineering Research Council of Canada (grant number RN000743) and Fonds de recherche du Québec—Nature et technologies (grant number 335893).

Data Availability Statement

Data sharing is not applicable to this article.

Acknowledgments

We would like to thank Arnaud Grandisson for his work on the graphical output and Mathieu Gascon for his help and invaluable comments on the model and algorithm.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Goodman, M.; Czelusniak, J.; Moore, G.W.; Romero-Herrera, A.E.; Matsuda, G. Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst. Biol. 1979, 28, 132–163. [Google Scholar] [CrossRef]
Bansal, M.S.; Alm, E.J.; Kellis, M. Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinformatics 2012, 28, i283–i291. [Google Scholar] [CrossRef]
Donati, B.; Baudet, C.; Sinaimeri, B.; Crescenzi, P.; Sagot, M.F. EUCALYPT: Efficient tree reconciliation enumerator. Algorithms Mol. Biol. 2015, 10, 3. [Google Scholar] [CrossRef]
Tofigh, A.; Hallett, M.; Lagergren, J. Simultaneous identification of duplications and lateral gene transfers. IEEE/ACM Trans. Comput. Biol. Bioinform. 2011, 8, 517–535. [Google Scholar] [CrossRef] [PubMed]
El-Mabrouk, N.; Noutahi, E. Gene family evolution: An algorithmic framework. In Bioinformatics and Phylogenetics; Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp. 87–119. [Google Scholar] [CrossRef]
Duchemin, W.; Anselmetti, Y.; Patterson, M.; Ponty, Y.; Bérard, S.; Chauve, C.; Scornavacca, C.; Daubin, V.; Tannier, E. DeCoSTAR: Reconstructing the ancestral organization of genes or genomes using reconciled phylogenies. Genome Biol. Evol. 2017, 9, 1312–1319. [Google Scholar] [CrossRef] [PubMed]
Duchemin, W. Phylogeny of Dependencies and Dependencies of Phylogenies in Genes and Genomes. Ph.D. Thesis, Université de Lyon, Lyon, France, 2017. [Google Scholar]
Dondi, R.; Lafond, M.; Scornavacca, C. Reconciling multiple genes trees via segmental duplications and losses. Algorithms Mol. Biol. 2019, 14, 7. [Google Scholar] [CrossRef]
Paszek, J.; Gorecki, P. Efficient algorithms for genomic duplication models. IEEE/ACM Trans. Comput. Biol. Bioinform. 2018, 15, 1515–1524. [Google Scholar] [CrossRef] [PubMed]
Delabre, M.; El-Mabrouk, N.; Huber, K.T.; Lafond, M.; Moulton, V.; Noutahi, E.; Castellanos, M.S. Evolution through segmental duplications and losses: A super-reconciliation approach. Algorithms Mol. Biol. 2020, 15, 12. [Google Scholar] [CrossRef]
Anselmetti, Y.; Delabre, M.; El-Mabrouk, N. Reconciliation with segmental duplication, transfer, loss and gain. In Comparative Genomics; Springer International Publishing: Berlin/Heidelberg, Germany, 2022; pp. 124–145. [Google Scholar] [CrossRef]
Jacox, E.; Chauve, C.; Szöllősi, G.J.; Ponty, Y.; Scornavacca, C. ecceTERA: Comprehensive gene tree-species tree reconciliation using parsimony. Bioinformatics 2016, 32, 2056–2058. [Google Scholar] [CrossRef] [PubMed]
Szöllősi, G.J.; Tannier, E.; Lartillot, N.; Daubin, V. Lateral gene transfer from the dead. Syst. Biol. 2013, 62, 386–397. [Google Scholar] [CrossRef] [PubMed]
Weiner, S.; Bansal, M.S. Improved duplication-transfer-loss reconciliation with extinct and unsampled lineages. Algorithms 2021, 14, 231. [Google Scholar] [CrossRef]
Libeskind-Hadas, R.; Wu, Y.C.; Bansal, M.S.; Kellis, M. Pareto-optimal phylogenetic tree reconciliation. Bioinformatics 2014, 30, i87–i95. [Google Scholar] [CrossRef] [PubMed]
David, L.A.; Alm, E.J. Rapid evolutionary innovation during an Archaean genetic expansion. Nature 2011, 469, 93–96. [Google Scholar] [CrossRef] [PubMed]
Libeskind-Hadas, R. Tree reconciliation methods for host-symbiont cophylogenetic analyses. Life 2022, 12, 443. [Google Scholar] [CrossRef] [PubMed]
Saule, C.; Giegerich, R. Pareto optimization in algebraic dynamic programming. Algorithms Mol. Biol. 2015, 10, 22. [Google Scholar] [CrossRef]
Makarova, K.S.; Wolf, Y.I.; Iranzo, J.; Shmakov, S.A.; Alkhnbashi, O.S.; Brouns, S.J.J.; Charpentier, E.; Cheng, D.; Haft, D.H.; Horvath, P.; et al. Evolutionary classification of CRISPR–Cas systems: A burst of class 2 and derived variants. Nat. Rev. Microbiol. 2020, 18, 67–83. [Google Scholar] [CrossRef]
Coleman, G.A.; Davín, A.A.; Mahendrarajah, T.A.; Szánthó, L.L.; Spang, A.; Hugenholtz, P.; Szöllősi, G.J.; Williams, T.A. A rooted phylogeny resolves early bacterial evolution. Science 2021, 372, 588. [Google Scholar] [CrossRef]
Koonin, E.V.; Makarova, K.S. Evolutionary plasticity and functional versatility of CRISPR systems. PLoS Biol. 2022, 20, e3001481. [Google Scholar] [CrossRef]
Scornavacca, C.; Weller, M. Treewidth-based algorithms for the small parsimony problem on networks. Algorithms Mol. Biol. 2022, 17, 15. [Google Scholar] [CrossRef]

Figure 1. (1) A synteny tree

T

(on the left) with an augmented species tree

S^{*}

(on the right). The numbers represent single gene families and the letters represent species. As indicated by the dashed gray lines connecting the two trees, the Syntenies

{1, 2}

,

{2, 3}

, and

{1, 3}

belong to Species A,

{2, 3}

, and

{1, 2}

to B

{3}

and

{1, 2}

to C. The dotted lines in the augmented species tree represent unsampled edges. (2) An output history

H

of Synesth when given

T

and

S^{*}

as the input with costs

(δ_{Dup} = 2, δ_{Cut} = 2.5, δ_{TrDup} = 3, δ_{TrCut} = 3.5, δ_{Loss} = 1)

. The history tree is represented with black lines on top of the species tree filled in gray. The hatched background represents a part of the history taking place in an unsampled lineage. The events are represented as follows: “

Spe

” by ovals, “

Dup

” and “

Cut

” by rectangles, “

TrDup

” and “

TrCut

” by diamonds, “

Loss

” by right half-circles, and “

Gain

” by left half-circles. The synteny contents are written inside of each event, while the associated species is represented implicitly by the position of each event on top of the species tree. Duplicated or transferred genes are underlined, while fissions are represented by a separation in the synteny.

Figure 1. (1) A synteny tree

T

(on the left) with an augmented species tree

S^{*}

(on the right). The numbers represent single gene families and the letters represent species. As indicated by the dashed gray lines connecting the two trees, the Syntenies

{1, 2}

,

{2, 3}

, and

{1, 3}

belong to Species A,

{2, 3}

, and

{1, 2}

to B

{3}

and

{1, 2}

to C. The dotted lines in the augmented species tree represent unsampled edges. (2) An output history

H

of Synesth when given

T

and

S^{*}

as the input with costs

(δ_{Dup} = 2, δ_{Cut} = 2.5, δ_{TrDup} = 3, δ_{TrCut} = 3.5, δ_{Loss} = 1)

. The history tree is represented with black lines on top of the species tree filled in gray. The hatched background represents a part of the history taking place in an unsampled lineage. The events are represented as follows: “

Spe

” by ovals, “

Dup

” and “

Cut

” by rectangles, “

TrDup

” and “

TrCut

” by diamonds, “

Loss

” by right half-circles, and “

Gain

” by left half-circles. The synteny contents are written inside of each event, while the associated species is represented implicitly by the position of each event on top of the species tree. Duplicated or transferred genes are underlined, while fissions are represented by a separation in the synteny.

Figure 2. General shape of a history

H

for a synteny tree

T

starting with

v [Spe, X, σ]

, which is composed of two sub-histories taken from

h (v_{ℓ}, Y, γ_{ℓ})

and

h (v_{r}, Z, γ_{r})

. These are linked together by two paths (represented by wavy lines) taken from

path ([X, σ_{ℓ}], v_{ℓ} [Y, γ_{ℓ}])

and

path ([X, σ_{r}], v_{r} [Z, γ_{r}])

.

Figure 2. General shape of a history

H

for a synteny tree

T

starting with

v [Spe, X, σ]

, which is composed of two sub-histories taken from

h (v_{ℓ}, Y, γ_{ℓ})

and

h (v_{r}, Z, γ_{r})

. These are linked together by two paths (represented by wavy lines) taken from

path ([X, σ_{ℓ}], v_{ℓ} [Y, γ_{ℓ}])

and

path ([X, σ_{r}], v_{r} [Z, γ_{r}])

.

Figure 3. Output of Synesth for the CRISPR-Cas Class 1 dataset when asked for one minimum-cost history with event costs

(δ_{Dup} = 2, δ_{Cut} = 2.5, δ_{TrDup} = 4, δ_{TrCut} = 4.5, δ_{Loss} = 1)

.

Figure 3. Output of Synesth for the CRISPR-Cas Class 1 dataset when asked for one minimum-cost history with event costs

(δ_{Dup} = 2, δ_{Cut} = 2.5, δ_{TrDup} = 4, δ_{TrCut} = 4.5, δ_{Loss} = 1)

.

Figure 4. The event cost landscape for the solutions returned by Synesth for the Class 1 Cas gene synteny dataset (see text). The cost of a loss event was fixed to 1. For each color region, the legend shows a condensed event count vector of the form

(c_{Dup} + c_{Cut}, c_{TrDup} + c_{TrCut}, c_{Loss})

, where “n” indicates the number of distinct Pareto-optimal histories for any set of event costs in that region.

Figure 4. The event cost landscape for the solutions returned by Synesth for the Class 1 Cas gene synteny dataset (see text). The cost of a loss event was fixed to 1. For each color region, the legend shows a condensed event count vector of the form

(c_{Dup} + c_{Cut}, c_{TrDup} + c_{TrCut}, c_{Loss})

, where “n” indicates the number of distinct Pareto-optimal histories for any set of event costs in that region.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Delabre, M.; El-Mabrouk, N. Synesth: Comprehensive Syntenic Reconciliation with Unsampled Lineages. Algorithms 2024, 17, 186. https://doi.org/10.3390/a17050186

AMA Style

Delabre M, El-Mabrouk N. Synesth: Comprehensive Syntenic Reconciliation with Unsampled Lineages. Algorithms. 2024; 17(5):186. https://doi.org/10.3390/a17050186

Chicago/Turabian Style

Delabre, Mattéo, and Nadia El-Mabrouk. 2024. "Synesth: Comprehensive Syntenic Reconciliation with Unsampled Lineages" Algorithms 17, no. 5: 186. https://doi.org/10.3390/a17050186

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Synesth: Comprehensive Syntenic Reconciliation with Unsampled Lineages

Abstract

1. Introduction

2. Preliminary Notation

3. Evolutionary Histories for Syntenies

4. Explicatory Histories and Optimization Problems

5. Generating All Pareto-Optimal Histories

6. Polynomial-Time Computation of Pareto-Optimal Event Vectors

7. Efficient Computation of Minimum-Cost Histories

8. Results

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI