*3.3. Regions Upstream and Downstream of the 241 nt Repeats Are Enriched in Surface Protein Genes*

The interspersed distribution pattern of the 241 nt repeat and its intergenic location led us to investigate possible signs of correlation of this DNA repetitive element to nearby genes. First, these genes were classified according to their transcription orientation, where genes whose transcription orientation moved in the direction of the repeat were termed "upstream genes" and those whose transcription direction moved away from the repeat were termed "downstream genes" (Figure 2B), regardless of the strain being considered. From this analysis, different patterns could be observed, as shown in Figure 2B. Most 241 nt repeats were located between genes on the same polycistronic transcription unit (PTU) on the sense strand (indicated by ++) and anti-sense strand (indicated by −−), as shown on Table 3. In both cases (++ and −−), there were two genes, one gene upstream and the other downstream, flanking the 241 nt sequence. Fewer 241 nt repeats were located between convergent PTUs (indicated by +− in Table 3), and in this case, both adjacent genes were considered upstream. The 241 nt repeats were also located between divergent PTUs (indicated by −+ in Table 3), when both adjacent genes were denominated downstream. Additionally, some 241 nt copies had only one gene adjacent to it (indicated by +\* and \*− in Table 3), and these genes were always upstream genes. No 241 nt repeat was found with a single downstream gene close to it in the CL Brener strain (indicated by −\* and \*+ in Table 3); however, a few copies of this pattern were found on Dm28c, Y and TCC strains (Table 3).

**Figure 2.** Localization of the 241 nt repetitive element in the *T. cruzi* CL Brener genome. (**A**) Consensus sequence of the 241 nt repeat that was submitted to Blast-n against the CL Brener genome, and the retrieved sequences with at least 95% identity and sizes from 140–244 nucleotides were selected. The graphic shows the frequency distribution of the retrieved sequences by their size in the genome sequence of *T. cruzi* strains Dm28c (TcI), Y (TcII), TCC (TcVI), and CL Brener S and P (TcVI). (**B**) Schematic representation of genes surrounding the 241 nt found inside the intergenic region. Genes were classified as upstream (UP) or downstream (DOWN) of the 241 nt repeat according to their transcription orientation. Distinct patterns of upstream and downstream genes in relation to the repeat are observed and indicated by the symbols ++, −−, +−, −+. In some cases, only one gene is associated with the repeat and is indicated by \*+, +\*, \*−, and −\*. Letter "d" indicates the distance between the repeat and upstream (dup)/downstream (ddown) genes. (**C**) Percentage of genes upstream and (**D**) downstream of the 241 nt repeat on the *T. cruzi* genome sequences of Dm28c, Y, TCC, and CL Brener strains. (**E**) A total of 334 sequences of 241 nt were randomly distributed in the CL Brener\_S genome sequence. The graph shows the number of repeats found near a trans-sialidase gene. The dashed line represents the number of repeats identified close to a trans-sialidase gene in the CL Brener genome (S haplotype). (**F**) A total of 1117 sequences of 241 nt were randomly distributed in the Dm28c genome sequence. The graph shows the number of repeats found near a trans-sialidase gene. The dashed line represents the number of repeats identified close to a trans-sialidase gene in the Dm28c genome Abbreviations: TS-trans-sialidase, MASP-mucin-associated surface protein, GP-glycoprotein, DGF-1-dispersed gene family-1, RHS-retrotransposon hot spot and EF-1 γ-elongation factor-1 γ.


**Table 2.** Number of 241 nt repeats found on each chromosome of *T. cruzi* CL Brener.


**Table 3.** Number of repeats found in the intergenic region of each *T. cruzi* strain. The + symbol indicates gene transcription from the sense strand, and the − symbol indicates gene transcription from the anti-sense strand. The left symbol represents the gene at the left side of the repeat, and the right symbol represents the gene at the right side of the repeat.

The next step was to identify which genes surround the 241 nt repeat. Among upstream genes, the large majority were from multigenic families (trans-sialidase, MASP, mucin and GP63), that is, representing 96.6% of the genes in CL Brener\_S, 91.2% in CL Brener\_P, 95.6% in the Y strain, and 68.8% and 66.4% in the Dm28c and TCC strains, respectively. The remaining genes were mostly hypothetical proteins (representing 1.1% in Dm28c, 3.3% in Y, 0.8% in TCC, 2.5% and 6.9% in CL Brener S and P haplotypes, respectively) and some other genes (listed in Supplementary File S3) that collectively represent 1% in Dm28c, 0.9% in Y, 1.9% in TCC, and 0.8% and 1.7% in CL Brener S and P haplotypes, respectively (Figure 2C). When analyzing the percentage of these genes on the genome, these genes of multigenic families collectively represent 18.07% of the genes in Dm28c, 32.19% in Y, 16.15% in TCC, and 15.19% and 14.52% in CL Brener S and P, respectively. Additionally, the hypothetical protein genes from the genome correspond to 37.33% of the genes in Dm28c, 58.24% in Y, 37.14% in TCC, and 38.44% and 38.43% in CL Brener S and P, respectively, but only a small percentage was found upstream of the 241 nt repeat. Taken together, these data suggest that the 241 nt repeat is preferentially located near multigenic families along the genome and is not randomly distributed.

Once it was determined that the upstream genes are mostly composed of multigenic family genes and trans-sialidase (TS) genes are the most abundant among them, we tested whether the association between the 241 nt repeat and TS had biological relevance or was just a consequence of the random distribution of the 241 nt repeat in the genome. To address this question, we conducted a Monte Carlo test on genome sequences of CL Brener S (TcVI) and DM28c (TcI) strains that are highly divergent [30]. This test consists of random re-insertions of 241 nt repeats in the genome sequence according to the number of repeats originally identified in genome sequences (334 for CL Brener S and 1117 for Dm28c). In each replicate, random re-insertion was performed, and the number of trans-sialidase genes found flanking this repeat was counted. Monte Carlos analysis of CL Brener S showed that up to four TS genes were located close to the repeat after its random reinsertion into the CL Brener S genome (Figure 2E). As indicated by the dashed line in Figure 2E, the total number of TS genes found close to the 241 nt repeat in the *T. cruzi* CL Brener\_S was 244, which is significantly higher than expected for the random distribution of the repeat (*p* < 0.01). For the Sylvio X10/1 strain, the Monte Carlo analysis showed that up to 89 repeats were found close to TS genes after the random re-insertion of the repeat, as shown in Figure 2F. Again, the number of TS genes flanking the 241 nt repeat (dashed line in Figure 2F) in the genome sequence of Sylvio X10/1 was significantly higher (*p* < 0.01) than that expected by chance distribution of the repeat in the genome. Therefore, these findings indicate that the proximity of repeats and the TS genes was not randomly distributed in the genomes analyzed and may have biological function.

We then analyzed the pattern of gene distribution in downstream genes, which was found to differ from that in upstream genes. The same multigenic family enriched upstream of the repeat

(trans-sialidase, MASP, mucin and GP63) represented 35.5% of the downstream genes in the CL Brener\_S, 39.8% in the CL Brener\_P, 34.9% in the Y strain and 34.7% and 31.3% in the Dm28c and TCC strains, respectively. Additionally, two other multigenic family genes were among the downstream genes: DGF-1 (2.7% in Dm28c, 8.2% in Y, 5.4% in TCC, 8% and 7% in CL Brener S and P, respectively) and RHS (13.5% in Dm28c, 14.7% in Y, 15.6% in TCC, 9% and 11.7% in CL Brener S and P, respectively). The remaining genes were mostly genes for hypothetical proteins (7.6% in DM28c, 38.8% in Y, 9.5% in TCC, 40.5% in CL Brener\_S and 33.1% on CL Brener\_P). The Dm28c and TCC strains also presented "unspecific product" genes that represented 38.8% of the downstream genes in the first strain and 31.5% in the latter (Figure 2D). The higher amount of hypothetical protein genes and the lower amount of multigenic family genes among the downstream genes corroborate the closer relation of the repeat to upstream genes than to downstream genes.

In addition to the strains analyzed above, we also verified the repertoire of genes located upstream and downstream to the 241 nt repeats in *T. cruzi* Brazil A4 and Sylvio X10/1 strains (PacBio sequenced) as well as in the ancestral *T. cruzi marinkellei* strain. The Brazil A4 strain and *T. cruzi marinkellei* presented similar repertoires in their upstream genes (Supplementary Files S8 and S9), where trans-sialidase genes represented the great majority (73.7% in Brazil A4 and 75% in *T. cruzi marinkellei*), followed by MASP genes (13% in Brazil A4 and 6.3% in *T. cruzi marinkellei*). Other multigenic family genes were also observed among the upstream genes (Supplementary Files S8 and S9) and, collectively, the multigenic family genes (TS, MASP, mucin and GP63) represented 93.7% and 85% of the upstream genes from the Brazil A4 strain and *T. cruzi marinkellei*, respectively. When analyzing genes found downstream to the repeat, the repertoires found in Brazil A4 and *T. cruzi marinkellei* were similar to those found in previously analyzed strains but differed in the amount of multigenic family genes (Supplementary Files S8 and S9). In the ancestral *T. cruzi marinkellei*, MASP genes comprised 36.8%, followed by trans-sialidase and hypothetical protein genes (both representing 13.2% of the downstream genes). In the Brazil A4 strain, the most abundant genes among downstream genes were hypothetical protein genes (35.8%) and trans-sialidase (24.1%) (Supplementary Files S8 and S9). Surprisingly, *T. cruzi* Sylvio X10/1 strain analysis revealed different genes flanking the 241 nt repeat (Supplementary Files S8 and S9). Bacterial neuraminidase repeat (BNR)-like domain genes were the most abundant genes among the upstream genes (49.82%) and the second most abundant among the downstream genes (27.86%). The concanavalin A-like lectin/glucanases superfamily represented 20% of the upstream genes and 14.5% of the downstream genes, while leishmanolysin represented 7.64% of the upstream genes and 8.4% of the downstream genes. RHS (5.09% and 3.44%, of the upstream and downstream genes respectively), EF1-γ (4.73% and 2.29% of the upstream and downstream genes respectively), trans-sialidase (1.09% and 0.38% of the upstream and downstream genes respectively) and DGF-1 (4.73% of the upstream genes) also flanked the repeat. Genes identified as "unspecific products" comprised 35% of the downstream genes (Supplementary Files S8 and S9). The fact that the 241 nt repeat is exclusive to *T. cruzi* and that the bat subspecies *T. cruzi marinkellei* presented similar composition illustrates how ancient this repeat found among *T. cruzi* is and reinforces its potential biological role.
