**1. General Aspects of** *T. cruzi* **Biology**

Trypanosomatidae family includes parasites of vertebrates, invertebrates, and plants. Due to their adaptation to different environmental conditions and high biological diversity, these protists produce a major impact on all biotic communities [1,2]. *Trypanosoma cruzi* (*T. cruzi*) is the parasite that causes the Chagas disease or American Trypanosomiasis, a chronic endemic illness of Central and South America, and a neglected tropical disease. Chagas disease is characterized by an acute phase with low mortality and symptomatology. Then, the patients can remain in an asymptomatic phase for life or, after many years without any sign of disease, develop a symptomatic chronic phase with cardiomyopathy, megavisceras, or both [3]. Moreover, these variations in the disease outcomes are related to the high genetic variability of the parasite [4–7].

*T. cruzi* presents a very complex life cycle that includes an invertebrate hematophagous triatomine vector and a broad range of mammalian hosts [8]. In both insect and mammalian hosts, four different major developmental stages were identified [9,10]. The non-infective epimastigotes are present in the midgut of triatomines where they differentiate into infective metacyclic trypomastigotes that after the infection of host cells are differentiated into the replicative amastigotes [11]. Finally, these amastigotes replicate by binary fission and lyse the cell differentiating to bloodstream trypomastigotes that can infect other cells of the host.

The mitochondrial DNA of *T. cruzi* is formed by a network of concatenated circular molecules of maxicircles and minicircles that is called the kinetoplast. This structure contains dozens of maxicircles (20–40 kb) and thousands of minicircles (0.5–10 kb) with varying sizes depending on species [12,13]. Maxicircles contain the characteristic mitochondrial genes of other eukaryotes and consist of two regions: the coding region and the divergent/variable region, very difficult to sequence due to its repetitive sequences [14]. Minicircles are exclusive to trypanosomatids and they are directly involved in U-insertion/deletion editing system as they encode guide RNAs (gRNAs) [15]. Moreover, it is suggested that both molecule populations are heterogeneous showing strain-specific variations [16,17].

*T. cruzi* reproduction is usually asexual by binary division, but there are evidences of natural hybridization, genetic exchange between strains and sexual reproduction [18–21]. Also, the population genetics of *T. cruzi* generated a significant interest, producing two opposing views. A clonal theory was proposed considering *T. cruzi* as the paradigm of the predominant clonal evolution (PCE) model of pathogens, displaying that this parasite shares many features with other parasitic protozoa, fungi and bacteria [22,23]. However, other researchers have demonstrated that *T. cruzi* could reproduce sexually by a mechanism consistent with classic meiosis, and have suggested that the PCE model in this parasite does not reflect the biological reality [21,24].

In mitosis the genome of *T. cruzi* does not condense to form chromosomes, preventing its visualization by conventional techniques [25,26]. Instead, parasite karyotype was determined by molecular biology techniques, such as pulsed-field gel electrophoresis (PFGE) in combination with Southern blot. These studies revealed a large molecular variability in size and number of chromosomes between strains and even among clones of the same strain [27,28]. The parasite is usually described as diploid, and the size of chromosomes varies from 0.45 Mb to 4 Mb and the number from 19 to 40. Experiments by flow cytometry have estimated the genome size between 80 and 150 Mb [29].

#### **2. Classification of** *T. cruzi* **Strains**

There are many genetically different strains of *T. cruzi* [30,31]. Therefore, field investigators have looked for methods to classify these strains mostly according to their biological and genomic differences. The first classification was established in 1999 in a Satellite Meeting held at Fiocruz [32]. An expert committee reviewed the available data establishing two principal subgroups named *T. cruzi* I and *T. cruzi* II (Figure 1A). This classification was proposed considering biological and biochemical characteristics and molecular approaches such as the mini-exon studies and the 24Sα ribosomal DNA sequence.

Ten years later knowledge of the molecular diversity of the parasite increased and multilocus genotyping analyses revealed six distinct Discrete Typing Units (DTUs) [30], which in turn classified in two major subdivisions called DTU I and DTU II. DTUs are defined as "sets of stocks that are genetically more related to each other than to any other stock and that are identifiable by common genetic, molecular or immunological markers" [33]. Furthermore, based on phylogenetic information from multilocus enzyme electrophoresis (MLEE) and random amplified polymorphic DNA (RAPD) markers the DTU II was split into five DTUs (IIa-e) [34,35], and DTUs I and IIb correspond, respectively, to the *T. cruzi* I and *T. cruzi* II groups recommended by the original committee in 1999 (Figure 1B). This new classification considered that DTUs I and IIb were the ancestral strains, DTUs IId and IIe were the products of a minimum of two hybridization events [36–38], and DTUs IIa and IIc as ancestral hybrids.

**Figure 1.** Different classifications of *Trypanosoma cruzi* since 1999. (**A**) Classification of the meeting of 1999. (**B**) First consensus classification of 2009. (**C**) Second consensus classification of 2009. (**D**) Alternative classification proposed in 2016.

However, a second revision that same year (2009) proposed a final classification in 6 DTUs [30]. DTUs I and II were the ancestral strains, DTUs III-IV those with at least one recombination event between DTUs I and II (homozygote hybrids), and DTUs V-VI were heterozygote hybrids of the DTUs II and III (Figure 1C). A new strain detected in bats was also included in the classification as TcBat [39] and with subsequent studies based on diverse molecular markers, it is considered to be the seventh DTU [40].

– Finally, in 2016, Barnabé et al. [41] questioned the statistical validity of this classification. They performed a phylogenetic reconstruction by maximum likelihood trees based on the most common mitochondrial genes in databases. They proposed a new aggrupation considering the expression of three genes, two mitochondrial (*CytB* and *COII*) and one nuclear (*Gpi*). This new classification established three groups, the ancestral mtTcI and mtTcII, and the mtTcIII that grouped all the hybrid strains. They included the TcBat as an independent strain, although it was phylogenetically related to the mtTcI (Figure 1D).

#### **3. The Genomes of** *T. cruzi***: A New Update**

The first version of a *T. cruzi* genome was published in 2005 [42] from the CL Brener strain. Interestingly, genomes for *Leishmania major* [43] and *Trypanosoma brucei* (*T. brucei*) [44] were simultaneously published in the same year.

The CL Brener strain was the most analyzed until then, with reproducible models in vitro, capable of producing an acute phase and being susceptible to Benznidazole [45]. In contrast to *Leishmania major* or *T. brucei* that had around 20–25% of repetitions in the genome, *T. cruzi* presented around 50%, making genome analysis and assembly more difficult [46]. Therefore, this first genome did not achieve the expected quality and remains incomplete, although it has been the principal reference for many researchers until today, despite the increasing availability of new and better genome sequences.

To date, there are several genomes of *T. cruzi* in the databases of the National Center for Biotechnology Information (NCBI) and TriTrypDB. This contributed to the study and understanding of the phenotypic, pathogenic, or complex variations among strains. Table 1 displays a summary of the recently available genomes in databases for the most studied strains. Some of these genomes were constructed from short-read sequencing methods (i.e., Illumina/Roche 454), such as Y [47], 231 [48], Sylvio X10/1 [49], G [50], or B7 strain of *T. cruzi marinkellei* [51]. Although these methods produce a

high number of reads and have a low error rate, a relevant problem is the incapability to generate a complete chromosome reconstruction from short reads, causing very fragmented genomes in the case of complex genomes as trypanosomatids. This could lead to over-, under- or miss-representation of genes or complete chromosomic regions. In this regard, long-read sequencing methods (i.e., PacBio, Nanopore) could be a better choice for the trypanosomatids genomes [52], as the case of Bug2148 strain [53]. This technology allows the sequencing of long genetic fragments avoiding the complex and repetitive nature of the parasite. It could contribute to obtaining genomes with less redundant sequences and more completed, although the assembly size is still below the estimations made by DNA measurements (80–150 Mb) [29]. However, the error rate is bigger using long-read methods (and needs to be minimized by increasing the sequencing coverage) than in short-read methods. Therefore, in recent years, some laboratories chose the combination of both techniques to improve the assembly process, as the Berenice [54] or TCC and Dm28c [55] strains. In fact, the use of long-read sequencing methods generates contigs of more than 1 Mb, probably covering whole chromosomes. This allows the assembly of a genome in the smaller number of contigs, as happens with Berenice, Dm28c, TCC and Bug2148 strains (Table 1), obtaining the largest contig N50. Other researchers suggested that the copy number of conserved genes of *T. cruzi*, such as the monoglyceride lipase gene could be used as misassemble control [56].

**Table 1.** Data of the most recent genomes of the best-studied strains of *T. cruzi* and the B7 strain of *T. cruzi marinkellei*. BNEL: CL Brener Non-Esmeraldo-like; BEL: CL Brener Esmeraldo-like; PacBio: Pacific Biosciences. Contig N50: is a statistic median such that the 50% of the whole assembly is contained in contigs equal to or larger than this value.


Moreover, it was demonstrated that transcriptomic data may be useful to correct and re-annotate previous assembled genomes. Besides, in the case of Sylvio X10/1, RNAseq data was used to improve the previous genome annotation showing that 79.95% of the genome corresponds to the coding sequence, while the previous genomic analysis established only a 37.73% [57]. These results also suggested that the haploid genome for Sylvio X10/1 may be higher than previously reported (at least 51 Mb).

In the NCBI the reference genome is the hybrid CL Brener genome of 2005 [42,58] and presently many researchers rely on this information. CL Brener is a hybrid strain, where their homologous chromosomes presented different length and genetic content. Furthermore, this strain was separated in two haplotypes, named as Brener Esmeraldo-like and Brener Non-Esmeraldo-like, which genomes are also deposited in databases. Full length chromosome sequencing was performed with this hybrid strain, using a combined strategy based on bacterial artificial chromosome (BAC) ends sequencing and synteny maps with *T. brucei* [58], obtaining 41 virtual chromosomes (Table 1). Despite the continuous re-annotations of these genomes, they are far from being the best reference considering all the new and more completed genomes obtained with current techniques of long and short-read sequencing as Y [47], Bug2148 [53], Berenice [54] or Dm28c [55] strains. Therefore, we need to pose again which genome is appropriate as a reference for *T. cruzi* research and if the existence of just one genome reference is useful due to the high heterogeneity of the parasite. Moreover, and more importantly, some of the

different DTUs of *T. cruzi* showed relevant differences in pathogenicity in mice [6]. This forces us to understand the differences at a genomic level and each strain would need a specific genomic analysis. Also, this high pathogenic, biological, and genetic diversity of the *T. cruzi* strains, even within DTUs, suggests that DTUs might not be a definitive form of classification, and it was hypothesized if *T. cruzi* could be a complex of species rather than a unique specie [59].

#### **4. Genetic Diversity and Genome Structure of** *T. cruzi*

#### *4.1. Ploidy*

Different studies confirmed the complexity of the *T. cruzi* genome, with different chromosome lengths between clones of the same strain, strains of distinct DTUs, or strains of the same DTU [26,28]. However, ploidy or chromosomal copy number variation (CCNV) analysis in this parasite could not be studied until the arrival of the Next Generation Sequencing (NGS) approaches.

Aneuploidy was studied in detail in *Leishmania*, whose "mosaic aneuploidies" are ploidy variations between isolates from the same strain and even between individual cells from the same population. These aneuploidies are related to drug resistance, gene expression regulation, or host adaptation [60–62]. Otherwise, in *T. brucei* a ploidy stability exists, including the subspecies *T. b. gambiense* and *T. b. rhodesiense* [63].

Regarding *T. cruzi*, the CCNV analysis depends on the quality of the assembled reference genome. Studies including strains of different DTUs revealed that as in *Leishmania*, the aneuploidy pattern varies among and within strains and DTUs [26]. However, the used reference genome was from CL Brener, which is not the most completed genome that we have in databases. Despite this limitation, it was concluded that the strains from DTU I seem to be more stable, while the strains from DTUs II and III present a high degree of aneuploidies as monosomies, trisomies, or tetrasomies [64].

These results suggest that the aneuploidies events could be used by *T. cruzi* to expand their genes and promote alterations in gene expression, something that may be critical for parasites that depend on post-transcriptional mechanisms to control gene expression. Although aneuploidies are mainly associated with debilitating phenotypes in many eukaryotes, they may be involved in species-specific adaptations during trypanosomatid evolution, affecting, for example, multi-gene families that are critical for the establishment of a productive infection in the mammalian hosts [65].

## *4.2. Genome Composition*

Besides the different mechanisms to control gene expression such as polycistronic transcription, RNA editing, nuclear compartmentalization, or trans-splicing [66,67], *T. cruzi* presents genomic plasticity and an unusual gene organization among strains. Tandemly repeated sequences take up more than 50% of the *T. cruzi* genome and, although the parasite is considered a diploid organism, it presents variations in chromosome number and aneuploidy arrangements between strains and clones of the same strain [26,56,68].

The genome plasticity of *T. cruzi* is related to the genetic composition and a compartmentalization in two principal large regions of protein-coding genes was established. The first one is the core compartment, where we can find highly conserved genes with known function and genes without an assigned function typically annotated as hypothetical conserved genes that present synteny conservation with other species such as *Leishmania major* and *T. brucei*. The second one is the non-syntenic disruptive compartment, which is mainly composed by genes that evolve constantly, such as those that belong to surface multi-gene families (trans-sialidases, MASPs, or mucins). Both core and disruptive compartments show opposite G + C content and gene organization, with high differences in their regulatory sites [26,55].

*T. cruzi* genome is formed by three types of DNA. (1) Coding sequence of single-copy genes that are conserved between strains and species. (2) Coding sequence of multi-copy gene families, such as surface proteins or virulent factors. (3) Non-coding sequences and repetitive sequences, such as

tandem repeats, retrotransposable elements and short repeat elements, which represent more than half of the genome affecting the methods of short-read sequencing above all as we explained before. Interestingly, around 50% of the genetic content of *T. cruzi* has unknown functions [47], which correlates with proteome studies of CL Brener, Dm28c, Y, and VFRA strains [69–72] in which around 40–50% of total proteins were of unknown function. This indicates how much we do not know yet about *T. cruzi* biology.

Regarding the single-copy genes, it was estimated that *T. cruzi* has more than 215 of these genes [54]. Although in the hybrid strains these genes might be underestimated according to previous results [47], due to the conservation of these genes and the apparition of new variations. Recent results in Y and Bug2148 strains confirmed this theory, with 183 and 400 detected single-copy genes, respectively [47]. The identification of these genes may help to understand the differential behaviors among strains as different pathogenicity, immune evasion, or life cycle.
