2.3. *Reconstruction of the Ancestral Monomer 2.3. Reconstruction of the Ancestral Monomer*

TEs.

BLAST-detected relatedness between satellite monomers of the CficCl-61-40 satDNA family allowed determination of the major part of the ancestral monomer. For this reconstruction, satellites of *C. bryoniifolium* (consensus monomer from one RE cluster) and *C. vulvaria* (consensus monomers from seven RE clusters) that show relatedness to both *C. quinoa* and *B. corolliflora* satellites were aligned. DNA fragments with 100% BLAST matches in combination formed the most conservative fragment of the basic monomer (supplementary data 2). This approach is quite similar to the method of ancient paralogs (LUCA) [31,32]. The sequence of 37 bp was as follows: TCAAACAAAGCTAATTGAATCAAATGAAAGTCAAATG. This sequence was used as a basis for the subsequent comparison of the monomer divergence in *Chenopodium* lineages. Analysis of basic satellite alterations revealed point mutations, indels, and shifts that were present with different frequencies in the genomes of the studied diploid *Chenopodium* species (supplementary data 1). Kmer based distance estimation revealed a phylogenetically reliable tree with the ancestral monomer as a base, *B. corolliflora* is located separately and rather close to the root, the analyzed diploids that form fairly natural groups with species of clades B and E located nearby, *C. bryoniifolium*, *C.*  BLAST-detected relatedness between satellite monomers of the CficCl-61-40 satDNA family allowed determination of the major part of the ancestral monomer. For this reconstruction, satellites of *C. bryoniifolium* (consensus monomer from one RE cluster) and *C. vulvaria* (consensus monomers from seven RE clusters) that show relatedness to both *C. quinoa* and *B. corolliflora* satellites were aligned. DNA fragments with 100% BLAST matches in combination formed the most conservative fragment of the basic monomer (supplementary data 2). This approach is quite similar to the method of ancient paralogs (LUCA) [31,32]. The sequence of 37 bp was as follows: TCAAACAAAGCTAATTGAATCAAATGAAAGTCAAATG. This sequence was used as a basis for the subsequent comparison of the monomer divergence in *Chenopodium* lineages. Analysis of basic satellite alterations revealed point mutations, indels, and shifts that were present with different frequencies in the genomes of the studied diploid *Chenopodium* species (supplementary data 1). K-mer based distance estimation revealed a phylogenetically reliable tree with the ancestral monomer as a base, *B. corolliflora* is located separately and rather close to the root, the analyzed diploids that form fairly natural groups with species of clades B and E located nearby, *C. bryoniifolium*, *C. acuminatum* aside, and polyploid *C. quinoa* at the maximum distance from the ancestral monomer (Figure 3).

*acuminatum* aside, and polyploid *C. quinoa* at the maximum distance from the ancestral monomer (Figure 3). Clade H (*C. vulvaria*) deserves separate attention. The RE pipeline divided the variety of CficCl-61-40 satDNA family sequences in the genome of *C. vulvaria* into seven clusters (supplementary data 1), indicating valuable heterogeneity. On one hand, all the basic monomers of the clusters contain BLAST-recognizable fragments of the ancestral monomer. On the other hand, the observed variability exceeds that for all clades taken together (Figure 3). An important question is whether all these clusters from the *C. vulvaria* genome belong to the same CficCl-61-40 satDNA family. RE output includes not only the row of clusters but also detailed cluster characteristics, including the cluster neighborhoods of connected components. The analysis showed that all clusters that we classified as belonging to the CficCl-61-40 satDNA family are related to each other and to the repetitive sequences with the GenBank IDs HM641822.1 and AJ288880.1. Additionally, these satDNA clusters possess a limited number of similarity hits with TEs clusters (mainly with Clade H (*C. vulvaria*) deserves separate attention. The RE pipeline divided the variety of CficCl-61-40 satDNA family sequences in the genome of *C. vulvaria* into seven clusters (supplementary data 1), indicating valuable heterogeneity. On one hand, all the basic monomers of the clusters contain BLAST-recognizable fragments of the ancestral monomer. On the other hand, the observed variability exceeds that for all clades taken together (Figure 3). An important question is whether all these clusters from the *C. vulvaria* genome belong to the same CficCl-61-40 satDNA family. RE output includes not only the row of clusters but also detailed cluster characteristics, including the cluster neighborhoods of connected components. The analysis showed that all clusters that we classified as belonging to the CficCl-61-40 satDNA family are related to each other and to the repetitive sequences with the GenBank IDs HM641822.1 and AJ288880.1. Additionally, these satDNA clusters possess a limited number of similarity hits with TEs clusters (mainly with Ty3-*gypsy* retrotransposons) which may indicate for splitting of satDNA arrays by the insertion of TEs.

Ty3-*gypsy* retrotransposons) which may indicate for splitting of satDNA arrays by the insertion of

*Int. J. Mol. Sci.* **2019**, *20*, x 7 of 18

**Figure 3.** Phylogenetic relationships of the CficCl-61-40 satDNA family sequences. Phylogenetic tree based on the k-mer analysis. **Figure 3.** Phylogenetic relationships of the CficCl-61-40 satDNA family sequences. Phylogenetic tree based on the k-mer analysis.

#### *2.4. High Order Repeat (HOR) Detection in the CficCl-61-40 satDNA Family and Determination of Its 2.4. High Order Repeat (HOR) Detection in the CficCl-61-40 satDNA Family and Determination of Its Physical Counterpart*

*Physical Counterpart*  TRF analysis of the CficCl-61-40 satDNA family in seven diploid species of *Chenopodium* revealed different structures of the arrays. In *C. ficifolium, C. pamiricum,* and *C. suecicum*, uniform tandem arrays with basic satellite motifs of ~40 bp (87–96% matches between monomers and copy numbers of 79.2–153.4) were identified by TRF. In *C. acuminatum, C. bryoniifolium, C. iljinii* and *C. vulvaria*, derivatives from CficCl-61-40 satDNA family repeats ranging up to 332 bp and of different repeatability were found (supplementary data 1). It was proposed that in the latter species, HORs could be formed by concurrent amplification and homogenization of modified monomers. TRF analysis of the CficCl-61-40 satDNA family in seven diploid species of *Chenopodium* revealed different structures of the arrays. In *C. ficifolium*, *C. pamiricum*, and *C. suecicum*, uniform tandem arrays with basic satellite motifs of ~40 bp (87–96% matches between monomers and copy numbers of 79.2–153.4) were identified by TRF. In *C. acuminatum*, *C. bryoniifolium*, *C. iljinii* and *C. vulvaria*, derivatives from CficCl-61-40 satDNA family repeats ranging up to 332 bp and of different repeatability were found (supplementary data 1). It was proposed that in the latter species, HORs could be formed by concurrent amplification and homogenization of modified monomers.

Here, it is necessary to elucidate the TRF algorithm using an example of the detection of a 117 bp monomer in the genome of *C. acuminatum* (later used as a probe in fluorescent *in situ* hybridization (FISH) experiments). Analysis of the RE Cluster-1 sequence by TRF produced a table of monomers with the most frequent of 117 bp (consensus size) (supplementary data 1). However, when the consensus sequence was manually analyzed, it decomposed into three 39 bp long subrepeats. Nevertheless, it can be argued that the 117 bp fragment is the basic monomer and that the formation of a HOR unit is based on an ~40 bp monomer. The program finds likely patterns (monomers) and then refines them into a consensus sequence. Patterns are detected by a high percentage of matches at the candidate pattern length. For 39 bp not enough matches were found, but a very high number for 117 bp. This indicates that the unit of duplication was 117 bp and not 39 bp. Furthermore, the mismatches and indels are more consistent with a 117bp monomer than with a 39 bp monomer (Gary Benson, personal communication). Following sequencing of physical counterparts of CacuCl-1-117 consensus sequence (see below) revealed that the physical components of the CacuCl-1-117 HOR unit did not coincide completely (as in consensus) but varied within the interval of 82% to 86% similarity, which confirmed the accuracy of the TRF algorithm. Additionally, it can be considered that the TRF analysis of all RE clusters belonging to the CficCl-61-40 satDNA family was performed with the same parameters, and in genomes of tree species, Here, it is necessary to elucidate the TRF algorithm using an example of the detection of a 117 bp monomer in the genome of *C. acuminatum* (later used as a probe in fluorescent in situ hybridization (FISH) experiments). Analysis of the RE Cluster-1 sequence by TRF produced a table of monomers with the most frequent of 117 bp (consensus size) (supplementary data 1). However, when the consensus sequence was manually analyzed, it decomposed into three 39 bp long subrepeats. Nevertheless, it can be argued that the 117 bp fragment is the basic monomer and that the formation of a HOR unit is based on an ~40 bp monomer. The program finds likely patterns (monomers) and then refines them into a consensus sequence. Patterns are detected by a high percentage of matches at the candidate pattern length. For 39 bp not enough matches were found, but a very high number for 117 bp. This indicates that the unit of duplication was 117 bp and not 39 bp. Furthermore, the mismatches and indels are more consistent with a 117 bp monomer than with a 39 bp monomer (Gary Benson, personal communication). Following sequencing of physical counterparts of CacuCl-1-117 consensus sequence (see below) revealed that the physical components of the CacuCl-1-117 HOR unit did not coincide completely (as in consensus) but varied within the interval of 82% to 86% similarity, which confirmed the accuracy of the TRF algorithm. Additionally, it can be considered that the TRF analysis of all RE clusters belonging to the CficCl-61-40 satDNA family was performed with the same parameters, and in genomes of tree species, only homogeneous arrays were identified while the four other arrays were heterogeneous, which reflects the real structure of satDNA.

A total of three to four different proposed HOR units were detected in the genomes of *C. acuminatum*, *C. bryoniifolium* and *C. iljinii*. However, approximately 23 such units were found in genome of *C. vulvaria* (supplementary data 1). The genome of *C. vulvaria* is thus again the most variable according to this parameter.

While there are multiple studies that demonstrate that RE is efficient in repeat identification using NGS, there are some limitations regarding sequence analysis of satellite repeats. The most important one was that generation of consensus sequences by assembling reads to contigs. While this works well for most dispersed repeats like TEs, this is problematic for satellites due to their tandem structure. Consequently, contigs vary in their coverage by reads and their sequences could be partially chimeric (producing sequence variant combinations that in fact do not exist in the genome). To confirm the existence in genomes the physical counterparts of computer-generated consensus monomers we analyzed the sequence variation of CficCl-61-40 and proposed HOR units CacuCl-1-117, CvulCl-28-118, CvulCl-28-397, CvulCl-112-117, CvulCl-134-117 and Cvul-145-129 by cloning. We then compared the obtained sequences with the consensus sequence from the TRF output (supplementary data 3, Figure 4). For all monomers, we obtained several clones that differed from each other as well as from the consensus sequence (supplementary data 4). The CficCl-61-40 monomer is rather uniform with a few point mutations and sequence similarity between clones. The consensus sequence ranged from 90.2% to 95%. For the four obtained clones of the CacuCl-1-117 monomer, two sequence types were found with generally higher similarity to the consensus sequences as well as to each other (similarity value ranges 89.8–91.5 and 90.7–99.2, respectively). This once again confirmed the correctness of the TRF algorithm.

More variability was detected for the proposed HOR units in the *C. vulvaria* genome, which once again highlights the complexity of the satDNA fraction in this species. Thus, among tree clones obtained for the CvulCl-28-118-proposed HOR unit, two sequence types were found with generally high similarity to each other than to the consensus monomer (similarity value ranges 88.2–98.3 and 76.4–79.1, respectively). For CvulCl-28-397-proposed HOR unit sequences amplified by primers (supplementary data 3) also shows more relatedness to each other than to consensus sequence (supplementary data 4). For the CvulCl-112-117- and CvulCl-134-117-proposed HOR units, two types were found among cloned sequences. One showed high relatedness to the consensus monomer (83.3%–90.7%) and the other clones were 100% related to each other and less to the consensus monomer and to the first variant (supplementary data 4). This most likely suggests that several related HOR units could be formed simultaneously. For Cvul-145-129-proposed HOR unit clones possess high similarity to the consensus sequences as well as to each other (82.9%–100.0%). Part of the cloned sequences was submitted to GenBank (accession numbers MH257681–MH257687). However, it should also be noted that we were not able to amplify part of the proposed HOR units generated by TRF analysis (for example CvulCl-28-355 and Cvul-134-148) (supplementary data 1, far right line on Figure 4). These sequences could be attributed most likely to computer-generated chimeric sequences (i.e., method error). However, for the majority of the proposed HOR units its physical counterparts were discovered in the genomes.
