*4.2. Proof of the Algorithm Exactness for the First Subcase*

**End of the algorithm description for the first subcase.**

Define the *autonomous a-cost Aa*(*G*) of a graph *G* to be the total cost of the sequence of operations given in the autonomous reduction for *G* after deduction of costs of all its *b*-special operations, i.e., those reducing the number of *b*-singular vertices. In other words, *Aa*(*G*) accounts only for *a*-special (i.e., those reducing the number of *a*-singular vertices) and nonspecial (i.e., not reducing the number of singular vertices) operations. Together, they are called *a*-*operations*.

For a graph *G*, we use the following *notation*: *Ba* and *Bb* are the numbers of, respectively, *a*-singular and *b*-singular vertices in *G*; *S* is the sum of integral parts of halved lengths of maximal connected fragments consisting of conventional edges (referred to as *segments*) plus the number of extremal (on a chain) odd segments (i.e., those consisting of an odd number of edges) minus the number of cyclic segments; *Sa* is the number of segments enclosed between two singular *a*-edges; *Ca* is the number of *a*-cycles; *D* is the number of chains of types 1*a*, 1*b*, 3*a*, 3*b*, and 3; *N*(*t*) is the number of chains of a certain type *t*. Denote by *A* the number of *a*-special DCJ operations in the autonomous reduction.

**Lemma 5.** *Let wa and wb be the removal costs for singular a- and b-vertices, and let all other operations have cost 1. Then,*

$$\mathbf{A\_{4}(G)} = (\mathbf{1} - \mathbf{w\_{4}}) \cdot (\mathbf{S\_{4} - C\_{\mathbf{a}} + N(2b^{\*}) + N(3a^{\*}) + N(1\_{\mathbf{a}}")) + \mathbf{w\_{4}} \cdot \mathbf{B\_{4} + S + D}.$$

**Proof.** For every component of *G*, let us check the equality *A* = *Sa* − *Ca* + *N*(2*b\**) + *N*(3*a\**) + *N*(1*a\**), and then sum up the obtained equalities. Each of them is implied by the following facts: when cutting out conventional edges from any component other than an *a*-cycle, the number of executed *a*-special operations is *Sa*, and, for an *a*-cycle, it is less by 1; the operation of circularization into a cycle is *a*-special if and only if this chain is of type 2*b\*,* 3*a\*,* or 1*a\**; breaking a cycle does not contain *a*-special DCJ operations. Next, the number of removals of *a*-singular vertices in the autonomous reduction is *Ba* − *A*, and, similarly to the proof of Lemma 1, we obtain that the number of nonspecial operations is *S* + *D*. -

Define the *a***-***quality Pa*(*s*) of an interaction *s* to be

$$P\_a(\mathbf{s}) = A\_a(G) - A\_a(\mathbf{s}(G)) - c\_a(\mathbf{s})\_{,a}$$

where *s*(*G*) is the graph obtained by applying *s* to *G*, and *ca*(*s*) is the total cost of *a*-operations in *s*. The number *Pa*(*S*) shows what *a*-cost is saved when applying the interaction *s* as against the autonomous reduction of *G*, i.e., without using *s*. Take note that, in Case II, we used the *quality* of interaction, which is related to all operations instead of *a*-operations.

Using Lemma 4, we compute *a*-qualities of 2-interactions. For example, let us find the *a*-quality of the interaction 1*a* + 1*b* = 1*<sup>b</sup>* \* (Figure 4), where *ca*(*s*) = 1. The interaction reduces *Ba* by 1, reduces *D* by 2, and does not change all other quantities in the expression for *Aa*(*G*). Therefore, *Pa*(*s*) = *wa* + 1.

Let *E* be any reducing sequence for a graph *G*, which is arbitrarily divided into connected fragments of operations; let us for a while call the latter interactions and denote them by *s*. Define *Ta*(*G,E*), the *total cost of a*-*operations* in *E*, and *Pa*(*G,E*), the total *a*-quality of all interactions in *E*, where *Pa*(*s*) is formally defined above. The following lemma is proven similarly to Lemma 2 in Section 3:

**Lemma 6.** *For any graph G, we have Ta*(*G*,E) = *Aa*(*G*) − *Pa*(*G*,*E*).

**Lemma 7.** *After Stage 3, the algorithm outputs a graph G"' with no polytype corresponding to any 2-interaction of two chains.*

**Proof.** It is important that the algorithms for Cases II (when *wb* = 1) and III at Stages 0, 1, and 2 completely coincide. Maximal domains that are constructed in both cases coincide, since, through an exhaustive search over all 2-interactions, we check the following: for *wb* = 1, the *a*-quality of any 2-interaction is equal to the quality if the same 2-interaction (computed in Section 2). Furthermore, through an exhaustive search, we check that all 3-interactions have zero quality. If, for *G"',* there are two different chains corresponding to some 2-interaction (its *a*-quality is automatically strictly greater than zero for 0 < *wa* < 1), then, after Stage 2 of the *algorithm for Case II* applied to *G* with *wb* = 1, we use 3-interactions, followed by this 2-interaction and, then, autonomous reduction. The aggregate quality of all interactions in such a reducing sequence is strictly greater than the aggregate quality in the sequence output by the algorithm for Case II. By Lemma 2, this means that we have obtained a sequence with total cost strictly less than the absolute minimum *C*(*G*), which is impossible. -

**Lemma 8.** *After Stage 3, the algorithm outputs a graph G"' satisfying the following condition: it contains no more than two chains having b-singular vertices. These pairs of di*ff*erent chains are contained among the following pairs of types:* {2*a*,3*a*\*}, {1*a*\*,1*b*}, {3b,2*b*\*}, {2*a*,1*a*\*}, {3*a*\*,1*b*}, {1*a*\*,3*b*}, and {1*b*,2*b*\*}. *If the initial G has no (*a,b*)-chains, then G"' contains at most one chain with a b-singular vertex*.

**Proof.** First claim. Through an exhaustive search, we check that, for any three types of chains with *b*-singular vertices, there are two of them corresponding to a 2- or 3-interaction. The former is impossible by Lemma 7, and the latter is possible by the definition of Stage 3. -

Second claim. Exhaustively searching over all types of chains with *b*-singular vertices, form all pairs of them that do not correspond to any 2- or 3-interaction. We obtained precisely the pairs listed in the lemma. Any two chains in this list do not correspond to these interactions.

Third claim. If an (*a,b*)-chain does not occur in the initial *G*, it does not appear while running the algorithm, i.e., in *G"'*, there are only *a-* and *b*-chains. If, in *G"'*, there are two different chains with *b*-singular vertices, then, by the second claim, their types form one of the listed pairs. However, every chain whose type contains an asterisk is an (*a,b*)-chain.

Denote by *T*(*G*) the *total cost of operations* executed by the described algorithm on a graph *G*. Let *Cb* and *Cab* be the numbers of *b*- and (*a,b*)-cycles in *G*, respectively; let *Ipb* be the indicator function of the property «*G* contains a *b*-loop but has no components with a singular *b*-vertex other than loops»; similarly, let *Icb* be the indicator function of having a chain with a singular *b*-vertex; let ε*<sup>b</sup>* = *wb* − 1.

Denote by *Pa*(*G*) the *total a-quality* of 2-interactions when running the algorithm on a graph *G.*

**Lemma 9.** *Let wa and wb be removal costs for singular a- and b-vertices, and let all other operations have cost 1. Then,*

*T(G)* = *Aa(G)* − *Pa(G)* + *Bb* + ε*b*·*(Cb* + *Cab* + *Icb* + *Ipb* + *E), where E* = *0 or E* = *1.*

**Proof.** We declare each operation outside Stages 2 and 3 as an interaction. It is easily checked that the *a*-quality of any such formal interaction, as well as that of any 3-interaction, is zero. By Lemma 6, the total cost of *a*-operations is *Aa*(*G*) − *Pa*(*G*). In a reducing sequence, the number of *b*-special operations is *Bb*. Let us show that the number of *b*-removals among them is *Cb* + *Cab* + *Icb* + *Ipb* + *E*. If, in *G*, there are no components with a *b*-vertex other than *b*-loops, the claim is trivial. Otherwise, the number of *b*-removals is equal to the number of components with a *b*-vertex that occur in *G"'*, i.e., before Stage 4, since a removal operation was never applied in the algorithm before; at Stage 4, only autonomous reduction is used. This amounts to *Cb* + *Cab* original cycles, *Icb* chains with a *b*-vertex, and *E* more chains from Lemma 8. -

We prove the additive exactness of the algorithm by induction on the cost *C*(*G*) of the shortest reducing sequence for a graph *G*. Denote by *T'*(*G*) the value of *T*(*G*) for *E* = 0. Assume that, for any operation *o* applied to an arbitrary breakpoint graph *G*, the «triangle inequality»,

$$c(o) \ge T'(G) - T'(o(G)),$$

is valid, where *c*(*o*) is the cost of *o* and *o*(*G*) is the result of applying *o* to *G*. If we take for *o* the first operation in the shortest sequence, by the induction hypothesis, we obtain *C*(*o*(*G*)) ≥ *T'*(*o*(*G*)) ≥ *T'*(*G*) − *c*(*o*) and *C*(*G*) = *C*(*o*(*G*)) + *c*(*o*) ≥ *T'*(*G*) − *c*(*o*) + *c*(*o*) = *T'*(*G*). Therefore, the algorithm error is at most ε*bE* ≤ *wb* − 1.

The proof of the triangle inequality follows the same lines as in Section 3 (Case II) by examining all operations *o*. In the proof, one should use *Pa*, the *a*-quality of the maximal domain on chains in *G*.

#### **5. Algorithm and Proof of Theorem 1 for Case I**

It might be interesting for the reader to compare the algorithms and proofs presented above with those related to Case I, where both costs *wd* and *wi* are not less than *w*, although this case was analyzed in a preparatory work [2]. As above, it suffices to consider the case of *wa* ≤ *wb*.

Recall that, in Case I, the Algorithm consists of 8 stages.

**Stage 0**: The same as above: Transform initial graphs *a* and *b* into *a* + *b*.

**Stage 1**: The same as above: Cut out conventional edges.

**Stage 2**: Perform the same 2-interactions between chains as in Section 2 (except for the two additional ones) but in the same order as they are listed, independently of their quality (i.e., we need not choose a maximal domain *M*). The meaning of these interactions, as always, is that they save maximally many operations as against the number of operations in the autonomous reduction of a graph.

**Stage 3**: Perform *3'-interactions* between chains, which are listed in [2]. The meaning of *3'-*interactions is that they reduce the number of *short* chains in *a* + *b* (to be precise, chains of types 2*a'*, 2*b'*, 3*a'*, 3*b'*, 1*'a*, 1*'b*, and 2*'*), thereby reducing the additive error of the algorithm, which occurs precisely because of them.

**Stage 4**: Circularize chains (except for short ones) into cycles using OM or SM operations.

**Stage 5**: Join all cycles into one cycle and then detach final cycles from it.

**Stage 6**: Perform *6-interactions* between a cycle and short chains, which reduce the number of the latter. These are interactions defined by the following term equalities: (*a,b*)-cycle + 2*a'* + 2*b'* = 2*'* (two DM operations) and (*a,b*)-cycle + 2*'* = 2*'* (DM and cutting out a conventional edge).

**Stage 7** is applied if *wb* >2. Perform *7-interactions* between two cycles or between a cycle and a short chain. Their meaning is to replace the removal operation for a *b*-singular vertex with two intermergings, which is advantageous if *wb* >2. These are the interactions defined by the following term equalities: (*a,b*)-cycle + *b*-cycle = (*a,b*)-cycle + cycle of length 2 (two DM operations), *b*-cycle + *b*-cycle = *b*-cycle + cycle of length 2 (two DM operations), *b*-cycle + 2*'*-chain = 2*'*-chain + cycle of length 2 (SM and cutting out an extremal conventional edge), (*a,b*)-cycle + 2*a'*-chain = (*a,b*)-cycle (SM and OM), (*a,b*)-cycle + 1*b'*-chain = (*a,b*)-cycle (two SM operations), *b*-cycle + 2*a'*-chain = 2*a'*-chain + cycle of length 2 (SM and cutting out an extremal conventional edge), *b*-cycle + 1*b'*-chain = 1*b'*-chain + cycle of length 2 (SM and cutting out an extremal conventional edge). If *wa* > 2, then symmetric interactions with *a* replaced by *b* are also performed.

**Stage 8:** Remove singular vertices.

Now, we recall the proof of exactness of this algorithm, which has an additive error of at most 2*w*. As all such proofs, it is based on the triangle inequality. However, for Case I, it has a distinction: *T*(*G*) takes the form *T'*(*G*) + *E*(*G*), where *T'*(*G*) involves only easily computable characteristics of a graph *G*, and *E*(*G*) ≤ 2*w*. The triangle inequality for *T'*(*G*) implies that *C*(*G*) ≥ *T'*(*G*), where, as always, *C*(*G*) is the total cost of the shortest reducing sequence for *G*. However, for *T'*(*G*), the triangle inequality is not always valid. To overcome this obstacle, all graphs *G are divided into ranks* 1, 2, and 3. Graphs of ranks 1 and 2 have a simpler structure, although graphs of a final form fall into rank 3 (from the point of view of the current proof, they are more complicated). For graphs *G* of ranks 1 and 2, we prove a lower bound on *C*(*G*), which is stronger than *C*(*G*) ≥ *T'*(*G*); for graphs of rank 1, it is even stronger than for those of rank 2. When passing, by operation *o,* from a graph of a larger rank to that of a smaller rank, the triangle inequality is weakened by a strictly positive number Δ (depending only on these ranks), i.e., *c*(*o*) ≥ *T'*(*G*) − *T'*(*o*(*G*)) − Δ. Under the inverse transformation, it is strengthened by Δ, i.e., *c*(*o*) ≥ *T'*(*G*) − *T'*(*o*(*G*)) + Δ. Then, the proof is made by induction on the shortest reducing sequence for *G*; it also requires an exhaustive search over all pairs of ranks of *G* and *o*(*G*). For example, if *G* is of rank 3 and *o*(*G*) of rank 2, then, by the induction hypothesis, we have *C*(*o*(*G*)) ≥ *T'*(*o*(*G*)) + Δ ≥ *T'*(*G*) − *c*(*o*) − Δ + Δ and *C*(*G*) = *C*(*o*(*G*)) + *c*(*o*) ≥ *T'*(*G*) − *c*(*o*) + *c*(*o*) = *T'*(*G*). Here, we used the fact that the Δ by which the triangle inequality is weakened when passing from a graph of rank 3 to that of rank 2 is equal to the number by which the lower bound on *C*(*G*) for graphs of rank 2 is strengthened. The analogous consistency of such inequalities holds for other pairs of ranks as well. All cases of the ranks are examined in [2].

#### **6. Discussion**

Usually, a small (slightly different from the original operation costs) explicitly specified additive error is not considered as a violation of the algorithm exactness. In this sense, we described an exact algorithm of linear complexity which constructs a shortest transformation of one weighted (in other words, labeled) directed graph into another under equal costs of DCJ operations and arbitrary costs of deletion and insertion operations. These graphs must be of degree 2 in the sense that each vertex is of degree 1 or 2. The labeling cannot contain repetitions of names. Note that the above can be carried over to infinite (countable) recursively enumerated sequences. The idea of the current algorithm was based on the abstract theory developed in [11,12].

The theorem proven in the paper is rather unexpected from the mathematical point of view; a problem that seems to require exhaustive search or at least to be computationally very difficult is solved by an exact algorithm whose runtime is always linear in the size of input data. In this sense, the problem is not related to any application. However, it has arisen in the context of quite various applications. Of these, the most popular concerns the description of a biological genome by a set of chains and cycles that correspond to linear and circular chromosomes. In this case, the considered operations correspond to real genome transformations in the biological evolution of the genome. It is important that different operations can be assigned with different costs, which corresponds to different frequency of their occurrence in the evolution.

For instance, the possibility to take different costs of deletion *wd* and insertion *wi* is essential when mitochondria of deuterostomes are considered. Their mitochondria mainly have the same set of genes, and each gene has a unique function. However, the order of these genes can vary; it is substantially different in the purple sea urchin *Strongylocentrotus purpuratus* than in vertebrates (Jacobs et al. [13]), as well as among echinoderms themselves. Thus, gene losses and acquisitions are much rarer here than DCJ operations, and rare events should be assigned higher costs. A similar pattern is observed in plastids of plants, e.g., the order of plastid genes in the red alga *Porphyridium purpureum* notably differs from that in other red algae. In the next case where one genome is smaller than the other, the loss cost should exceed that of acquisition, which surely depends on the gene and the number of its copies. For example, the nuclear genome amplifies after hybridization of two species as in the case of polyploid strawberry or wheat (Bors and Sullivan [14]). Polyploidy is much more common in plants than in animals. Among animals, it was described in nematodes including ascarids and certain amphibians. However, other full-genome duplications have been reported in chordates (Putnam et al. [15]). The genome reduction can also occur, e.g., many genes (paralogs) are lost after a full-genome duplication. In this regard, the presented algorithm has been used, e.g., in [16].

Let us give an example from another area. In some robotics and image processing problems, it is assumed that the "terrain" is bound by barriers that can be described as chains and cycles. Chains and cycles are also used to represent parts that have long or compact forms and to describe the frame of an area or human pose, among other examples. Among many such applications, which are not related to genomics, let us mention, as an example, [17].

Further research will be aimed at reducing the additive error, up to zeroing it, constructing fast algorithms in the case of unequal costs of DCJ operations, allowing name repetitions in the graph labeling, and passing to graphs of degree 3. Moreover, it would be important to extend the list of operations by adding, for example, the operation of edge duplication in the initial graph, which will allow strictly considering the duplication of biological genes [18] from the mathematical point of view.

**Author Contributions:** Conceptualization, V.L. and K.G.; proof, K.G. and V.L.; writing—original draft preparation, K.G.; writing—review and editing, V.L.; supervision, V.L.; project administration, V.L.; funding acquisition, V.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Russian Foundation for Basic Research under research project No. 18-29-13037.

**Acknowledgments:** We are grateful to the anonymous reviewers for their thorough review, and we highly appreciate the comments and suggestions, which significantly contributed to improving the quality of the publication.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
