*4.5. Replication Origin*

Chromosomes of eukaryotic organisms are replicated from hundreds to thousands of DNA replication origins (ORIs), which are specified by the binding of the origin recognition complex (ORC). ORIs were mapped in *T. brucei* by marker frequency analysis sequencing (MFA-seq) coupled to ChIP analysis of the ORC [82]. These studies displayed that all mapped *T. brucei* ORIs are located at the boundaries of the transcription units. This was also detected in another specie as *Leishmania major*, where replication initiation sites are close to the genomic locations where the RNA pol II finishes, suggesting a strong correlation between the transcription kinetics and the replication initiation [83]. These studies also revealed more than 5000 potential sites of ORIs by SNS-seq techniques (Small nascent strand purification coupled with deep sequencing). However, another study detected by MFA-seq just one origin per chromosome in *Leishmania major* [84]. This happens because MFA-seq might detect mainly constitutive origins, while SNS-seq techniques may not reflect the frequency of origin activation, since these techniques might also identify flexible and/or dormant origins. Considering all those results, the complete replication of the genome in *T. brucei* and *Leishmania major* may require not merely constitutive ORIs that are fired in every cell cycle, but also further flexible and/or dormant ORIs, which cannot coincide with ORC binding and are fired stochastically [85].

Regarding *T. cruzi*, the ORIs of CL Brener strain were recently analyzed by MFA-seq [86], mapping 103 and 110 putative consensus ORIs in each haplotype of this hybrid strain. Moreover, the analysis displayed that some replication initiation sites map to the borders of the transcription units, as in *Leishmania major* and *T. brucei*. Interestingly, the majority of the putative predicted ORIs presented a great abundance within coding DNA sequences and showed a great G + C content enrichment (65% of average), while the genomic regions had a maximum of 54%. Also, another analysis with the same strain of *T. cruzi* by DNA combing, which can detect any replication initiation event (including constitutive, flexible and dormant origins, but without reference to genome location), displayed a median inter-origin distance of 1711 kb [87].

Considering the chromosomal location, while some ORIs of *T. cruzi* are located in non-transcribed regions as those seen in *T. brucei* and *Leishmania major*, many others are strategically localized at sub-telomeric regions (with a strong focus on DGF-1 genes), where they can produce genetic variability of multi-gene families [86]. The transcription orientation toward telomeres suggests that the abundance of putative ORIs in sub-telomeric regions produces head-on transcription-replication collisions since the replisomes go toward the centers of the chromosomes. These results suggest that collisions between DNA replication and transcription are recurrent in the *T. cruzi* genome and produce genetic variability, as suggested by the increase in SNP levels in the sub-telomeric regions and the DGF-1 genes containing putative ORIs [86].

#### **5. Transcription of** *T. cruzi*

Transcription in *T. cruzi* is polycistronic. Protein-coding genes are organized into non-overlapping clusters on the same DNA strand sometimes with unrelated predicted functions and separated by relatively short intergenic regions. Polycistronic transcripts are processed to produce mature mRNAs [88]. *T. cruzi* gene clusters can range from 30 to 500 kb separated by divergent or convergent strand-switch regions, or in a head-to-tail orientation whereby transcription terminates and then restarts from the same strand [57,89].

These strand-switch regions present a different nucleotide composition compared to the rest of the genome and a higher intrinsic curvature associated with transcriptional regulation [90]. In both *T. cruzi* and *T. brucei* canonical signals for RNA polymerase II promoters have not already been identified, except for the genes encoding the spliced leader (SL) [91]. In trypanosomatids, the transcription start sites and histone variants implicated in the transcription initiation process were described mainly at the divergent strand-switch regions [92,93]. Otherwise, the convergent strand-switch regions contain preferentially sites of transcription termination as well as RNA polymerase III transcribed tRNA genes [94].

Up to hundreds of genes are transcribed at the same time by the RNA pol II in large Polycistronic Transcription Units (PTUs). The final mRNA maturation occurs by trans-splicing and polyadenylation processes (Figure 2). The trans-splicing is a special form of RNA processing by which two mRNAs encoded in different genome locations react to constitute a unique transcript [95]. In *T. cruzi* it consists of the insertion of a sequence of 39 nucleotides in the 5′ of each transcript, known as mini-exon or SL. This SL is transcribed from a tandem array as a precursor of around 140 nucleotides and is the target

for the capping modification. The insertion of this Cap-SL gives stability to the mRNA and causes the excision of each mRNA of the PTU allowing the final polyadenylation [88,96].

'

–

**Figure 2.** Transcription process of *T. cruzi*. RNA polymerase II produces polycistronic RNAs that are modified by trans-splicing and polyadenylation. The final mature mRNAs contain the Cap with the SL and the poly A tail. SL: spliced leader.

′ The AG dinucleotide was described as the consensus sequence for the SL trans-splicing in *T. cruzi* [57], Leishmania major [97] and *T. brucei* [98]. However, small differences were detected between all of them in the nucleotide composition surrounding the AG dinucleotide, suggesting that different specific mechanisms are involved in the mRNA maturation among these species. For example, considering the first residue before the AG dinucleotide, the most probable in *T. cruzi* is an adenine, as in *T. brucei*, while in Leishmania major is a cytosine. Also, at position -4 a guanine is the most probable nucleotide in *T. cruzi* and Leishmania major, in contrast to *T. brucei* where a poly T tract starts and continues up to 50 nucleotides upstream. Interestingly, this pyrimidine enrichment is one of the principal differences between these trypanosomatids. In *T. cruzi* and *T. brucei* this C-T pattern is conserved just in the upstream 5 ′ region, while in Leishmania major represent about the 70% of the nucleotides upstream and downstream the AG dinucleotide. Besides, whereas the downstream region in *T. cruzi* is composed of purine nucleotides (A–G) up to 60%, in *T. brucei* A-T dinucleotides are the most frequent bases, indicating that *T. cruzi* and Leishmania major transcripts present a more proportional nucleotide composition than *T. brucei*.

The AAUAAA polyadenylation signal of eukaryotes is not present in trypanosomatids. Recent studies published by our group demonstrated that *T. cruzi* shows a single nucleotide that seems to be the most probable signal of polyadenylation start, being cytosine the most frequent nucleotide (45.3%) and thymine the less frequent (6.79%) [57]. This differs from other trypanosomatids species, as *Leishmania major* and *T. brucei* that presents a AA dinucleotide [97,98] as the most probable signal for polyadenylation. Furthermore, the surrounding genomic regions are also different. Whereas *T. cruzi* displays an abundant thymine composition in the upstream region, and a higher T-A composition in the downstream, *Leishmania major* shows a more variable sequence composition in both upstream and downstream regions, and *T. brucei* a uniform pattern in both extremes composed by T-A nucleotides. These results suggest that the mRNA maturation processes in *T. cruzi* may differ notably from *Leishmania major* and *T. brucei*.

Genes in trypanosomatids do not have promoter regions to regulate gene expression and their regulation is mainly at the post-transcriptional level, with a key role of the 3′ UTR regions. The principal mechanisms of regulation are the stability or instability of the transcripts, gene duplication, histone regulation, and translation efficiency [96,99]. Therefore, despite the genes of the same polycistron are transcribed in an equal proportion, differences in their expression were detected in distinct life cycle stages or growth conditions [97,98,100]. This could explain the selection of highly repetitive sequences in the parasite through evolution [46,101,102] by the aggregation of tandem repeats, retrotransposons, and repetitive short sequences in chromatin remodeling [103].

However, the concrete mechanisms involved in the regulation of the gene expression in *T. cruzi* are still unknown and were not further studied as in other species, such as *Leishmania* or *T. brucei* [104,105]. In this last specie, for example, it was demonstrated that in heat-shock conditions the genes close to the transcription initiation sites are down-regulated, while genes in a distal position increase their expression [106].

## **6. Principal Multi-Gene Families of** *T. cruzi*

*T. cruzi* possesses several multi-gene families, some with hundreds of members, which contribute to the repetitive nature of the parasite's genome, such as the retrotransposons or the tandem repeats. Most of these multi-gene families code for surface proteins, which play different key roles in the *T. cruzi* life cycle, from the establishment of an effective host-cell interaction and invasion until the protection against the host immune system. Furthermore, these multi-gene families present a huge expansion and constant evolution that produces a great diversity among strains [107].

Therefore, many efforts to unravel the structure, distribution, and functions of these multi-gene families were made. Several groups identified in the disruptive compartment of the *T. cruzi* genome multi-gene families as trans-sialidases (TSs), mucins and MASPs, whereas RHS, GP63 and DGF-1 families were located in both disruptive and core compartments [55]. Copy numbers of these multi-gene families in the genomes of strains of *T. cruzi* and B7 strain of *T. cruzi marinkellei* are displayed in Figure 3. According to data, and considering all strains as whole, the most expanded multi-gene family is the TS family, following by MASPs, RHS, mucins and DGF-1, although this is not so for all strains with available genomes. There is a high variability among strains that may be related to a strain-specific genetic profile, the accuracy of the assembled genomes, and the genomic plasticity. This produces a great diversity that could explain the different infection kinetics, virulence and/or immune responses that were detected between *T. cruzi* strains [6,7,108].

Here, we focus on the principal multi-gene families in terms of diversity, abundance and function that belong to the disruptive compartment of the *T. cruzi* genome: TSs, mucins and MASPs.

e's

**Figure 3.** Genome copy number of the most abundant multi-gene families of *T. cruzi* and the B7 strain of *T. cruzi marinkellei*. BNEL: CL Brener Non-Esmeraldo-like; BEL: CL Brener Esmeraldo-like; DGF-1: Dispersed Gene Family 1; GP63: Glycoprotein 63; MASP: Mucin-Associated Surface Proteins; RHS: Retrotransposon Hot Spot genes.
