*Article* **Genome-Wide Comparative Analysis of Transposable Elements by Matrix-TE Method Revealed Indica and Japonica Rice Evolution**

**Zhiguo Wu 1,\*, Wei Xi <sup>1</sup> , Zixuan Han <sup>1</sup> , Yanhua Wu <sup>1</sup> , Yongzhuo Guan <sup>1</sup> and Yuxian Zhu 1,2**


**Abstract:** Transposons (TEs) are known to change the gene expression and function, and subsequently cause plant speciation and evolution. Nevertheless, efficient and new approaches are required to investigate the role of TEs in the plant genome structural variations. Here, we reported the method named matrix-TE to investigate the differentiation of intact and truncated LTR/TEs comprehensively in *Indica* and *Japonica* rice throughout whole genomes with a special eye on centromeric regions. Six LTR/TE super-families were identified in both *Indica* and *Japonica* rice genomes, and the TE ORF references were extracted by phylogenetic analysis. *Indica* rice specific TE peak P-*Gypsy* and *Japonica* rice specific TE peak P-*Copia* were observed, and were further analyzed by Gaussian probability density function (GPDF) fit. The individual TE peak P-*Gypsy* was observed in centromeric regions of the *Indica* genome. By the matrix-TE method, the divergence of *Indica* and *Japonica* genomes, especially their centromeric regions, mainly resulted from the *Ty3*/*Gypsy* insertion events at 0.77 Mya. Our data indicate that the optimized matrix-TE approach may be used to specifically analyze the TE content, family evolution, and time of the TE insertions.

**Keywords:** transposon; rice genome; centromere; divergence; Gaussian distribution

### **1. Introduction**

TEs show an importance in both monocotyledonous and dicotyledonous plants, since TEs probably lead to the alterations of plant gene expression and function by introducing mutations both in the coding regions and the regulatory regions [1]. Recent great advances in genomics has gradually made it a reality to systematically investigate the roles of TEs during the evolution of plant genomic structures [2–5]. Long terminal repeat transposable elements (LTR/TEs) are retrotransposons and usually constitute a major part of most plant genomes [6].

*Indica* and *Japonica* rice are the most widely cultivated subspecies and the genome sequences were successfully assembled and polished in the past two decades, with their genome structure variations studied to interpret the origin and evolution for the *Oryza* genus [7–12]. The rice genomes are known to contain high levels of *Ty1*/*Copia* and *Ty3*/*Gypsy* super-family LTR/TEs [4,5]. To date, the *Oryza* species have been evolved for ~15 million years. However, the divergence time between *Indica* and *Japonica* rice was estimated at 0.55 million years ago (Mya) [11,12]. Molecular phylogenetic studies suggested that *Indica* and *Japonica* rice originated independently [11]. Recently, gap-free rice genomes have been assembled and provide a new view of the structure and function of full-length centromeres with LTR/TEs being a major component [4,5]. These advances provide a firm basis on which to study the essential roles that LTR/TEs play during the origin and evolution of rice genomes.

**Citation:** Wu, Z.; Xi, W.; Han, Z.; Wu, Y.; Guan, Y.; Zhu, Y. Genome-Wide Comparative Analysis of Transposable Elements by Matrix-TE Method Revealed Indica and Japonica Rice Evolution. *Agronomy* **2022**, *12*, 1490. https://doi.org/ 10.3390/agronomy12071490

Academic Editor: Katherine Steele

Received: 22 May 2022 Accepted: 18 June 2022 Published: 22 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

LTR/TEs are found everywhere in many plant genomes, especially in wheat and maize, as more than half of their entire genome sequences are constituted by *Ty1*/*Copia* and *Ty3*/*Gypsy* super-families [13–17]. A large number of intact and truncated LTR/TEs were identified in both the wheat D and A subgenomes. Sequential insertion followed by silencing events of TEs over the past three million years were thought to play a significant role during evolution of the hexoploid wheat genome [13,14]. Furthermore, comparative studies of TEs in three maize genomes interpreted the existence of high level variations, even among very related subspecies [15–17], and TE-mediated gene silencing was proposed to be involved in epigenetic regulations during different developmental stages [18].

Most plant TEs cumulate a large amount of nucleotide substitution and truncation events over their evolution time courses [19–25]. The high repetitive nature with a high mutation rate makes it a challenge to accurately analyze TEs during genome assembly and related studies. Previously, a software entitled TRACK-POSON was published to detect TE insertion polymorphisms in rice genomes [26]. Here, we reported a matrix-TE approach to quantitatively and systematically study the intact and truncated LTR/TEs in the genomes of *Indica* and *Japonica* rice. All of the LTR/TE super-families were restricted in one super matrix by their ORFs and sequence identities, then classified and phylogenetically clustered. Individual TE insertion peaks of P-*Gypsy* and P-*Copia* were detected in *Indica* and *Japonica* rice, respectively. The TE peaks were further resolved by Gaussian probability density function (GPDF), as the GPDF model was matched with stochastic nucleotide substitution events, and the time point of the individual TE peak was calculated by the nucleotide substitution ratio *Ks* [3]. We suggest that this is the first characterization of the insertion events for all rice LTR/TE super-families in the genomes of two subspecies, especially in centromeric regions.

#### **2. Materials and Methods**

#### *2.1. Indica and Japonica Genomes Used for TE Matrix Generation and GPDF Analysis*

Two cultivated rice *Indica* (MH63) and *Japonica* (Nip) genome sequence data were applied to the analysis process. The *Indica* (MH63) genome assembly ID was GWH-BCKY00000000 and the *Japonica* (Nip) genome assembly ID was GCF\_001433935.1\_IRGSP-1.0 [4]. Intact LTR/TEs of the MH63 and Nip genomes were annotated by LTR\_finder software with default parameters.

#### *2.2. LTR/TE ORF Matrix Generation*

The ORFs were extracted from the intact LTR/TEs by the Getorf Script in the EMBOSS package with default parameters. Then, the ORFs were sorted by length from long to short, and we discarded ORFs shorter than 1500 bp. The identity matrix for the selected longer ORF sequences was calculated by BioEdit software with default parameters [3].

#### *2.3. Phylogenetic Analysis of TE ORFs*

The LTR/TE ORFs with sequence identities greater than 95% in the TE matrix were extracted according to their sequence homology. Then, the MEGA software with the construct neighbor-joining tree method was used to generate the molecular phylogenetic trees of the ORF groups. The LTR/TE ORF sequences on each top branch of the phylogenetic trees were determined as the TE ORF reference sequences of the super-families in the following steps.

#### *2.4. Whole Genome and Centromere Scanning with TE ORF Reference Sequences and GPDF Analysis of Individual TE Peaks*

The *Indica* (MH63) and *Japonica* (Nip) whole genomes and centromeric regions were searched by the TE ORF reference sequences with BLASTN software using the parameters: \$blastn -db Ricegenomedb -query ref-ORFseq.fa -out ref-ORFseq-blast-Ricegenomedb evalue 0.00001 -word\_size 11 -gapopen 5 -gapextend 2 -penalty -2 -reward 1 -culling\_limit 0 -outfmt 7.

Sequences obtained from the above blast search contained intact and truncated TE sequences, and identity distribution curves were generated using individual TE superfamily sequences. The individual peaks of the curves were fitted by GPDF and the average nucleotide substitution ratio *Ks* of each TE peak were defined as 2.58σ [3]. The TE insertion time was calculated by the formula: T = *Ks*/2*r*, where *r* is the average nucleotide substitution rate (1.3 <sup>×</sup> <sup>10</sup>−<sup>8</sup> here) [27].

#### *2.5. TE Insertion Events, Whole Genome, and Centromere Evolution Analysis*

The distribution of the single nucleotide polymorphism across *Ty1*/*Copia* and *Ty3*/*Gypsy* ORFs were scanned in Nip and MH63 genomes to calculate the SNP densities. The individual TE peaks were compared with *Indica* and *Japonica* rice genomes as well as the centromeric regions. Different TE insertion events among the whole genomes and centromeric regions were applied to correlate with the differentiation of rice subspecies [28,29].

#### **3. Results**

#### *3.1. Development of Matrix-TE Approach Pipeline*

Considering the large amount of TEs and the stochastic nucleotide substitutions, we developed the approach matrix-TE to successfully evaluate the TE ORFs, and calculated the *Ks* of the TE insertion events in rice genomes (Figure S1). Whole rice genome sequences were sequentially applied to the LTR\_finder, Getorf, and BioEdit scripts to obtain various related data. ORF clusters with an identity over 95% were observed at the diagonal of the TE matrix and were extracted based on sequence identities. The whole genome or the centromeric regions were scanned by the reference sequence with the identity distribution curve fitted by the GPDF model. The individual TE insertion event was analyzed by calculating the *Ks* derived from the GPDF. The matrix-TE approach was used to analyze the most abundant TE super-families, and the TE content at both the whole genome level and in the centromeric regions were subsequently quantified.

#### *3.2. TE Matrix and Cluster Generation for Indica and Japonica Rice Whole Genomes*

Compared with that of Nip, MH63 had a slightly bigger genome and higher TE content with both having more annotated intact LTR/TEs and ORFs (Table 1). A total of 1520 Nip ORFs and 2010 MH63 ORFs were discovered from the super matrices, and clusters with identities over 95% observed at the matrix diagonals (Figure 1A,B) were extracted and annotated so that both the Nip and MH63 genomes contained six clusters of LTR/TE super families, named *type1*, *type2*, *typeRT*, *typePHA*, *Ty1*/*Copia*, and *Ty3*/*Gypsy*. Both the Nip and MH63 genomes contained similar numbers of *type1*, *type2*, *typeRT*, and *typePHA* TEs (from 51 to 64), whereas significantly more of the *Ty1*/*Copia* TEs were observed in Nip than in MH63 (comparing 48 in Nip to 18 in the later). In contrast, MH63 contained significantly more *Ty3*/*Gypsy* type TEs than that of Nip (71 in MH63 while there were only 14 in the latter) (Figure 1C,D).

**Table 1.** Genome size, TE content, and intact TE ORF statistics of two rice genomes.


#### *3.3. Phylogenetic Trees and ORF Reference Sequences of TE Clusters*

The phylogenetic trees of TE ORFs were constructed with the six TE clusters of the Nip and MH63 genomes separately (Figure 2). With a non-homologous sequence as the root (e.g., using *Ty1*/*Copia* as the root of *Ty3*/*Gypsy* tree and vice versa), the *ref-N-type1* and *ref-M-type1* (Figure 2A), the *ref-N-type2* and *ref-M-type2* (Figure 2B), the *ref-N-typeRT*

and *ref-M-typeRT* (Figure 2C), the *ref-N-typePHA* and *ref-M-typePHA* (Figure 2D), the *ref-N-Ty1*/*Copia* and *ref-M-Ty1*/*Copia* (Figure 2E), and the *ref-N-Ty3*/*Gypsy* and *ref-M-Ty3*/*Gypsy* (Figure 2F) on the top branch of each tree were selected as the reference sequences for the TE ORFs of each Nip or MH63 TE cluster. These reference sequences were potentially active and were probably inserted into rice genomes in recent ages [2,3]. These TE ORF reference sequences were applied to whole genome and centromere scanning in the following steps.

**Figure 1.** The Nip and MH63 TE ORF matrix and clusters. (**A**,**B**) Six TE ORF clusters *type1*, *type2*, *typeRT*, *typePHA*, *Ty1*/*Copia*, and *Ty3*/*Gypsy* with identities of over 95% observed at the TE matrix diagonals in the Nip and MH63 genomes, respectively, where *N* indicates Nip and *M* indicates MH63. (**C**) TE numbers with identities of over 95% for each cluster in Nip, *N-type1* (64 TEs), *N-type2* (51 TEs), *N-typeRT* (54 TEs), *N-typePHA* (51 TEs), *N-Ty1*/*Copia* (48 TEs), *N-Ty3*/*Gypsy* (14 TEs). (**D**) TE numbers with an identity over 95% for each cluster in MH63, *M-type1* (63 TEs), *M-type2* (55 TEs), *M-typeRT* (59 TEs), *M-typePHA* (56 TEs), *M-Ty1*/*Copia* (18 TEs), and *M-Ty3*/*Gypsy* (71 TEs).

#### *3.4. Whole Genome and Centromere Scanning by TE ORF Reference Sequences*

Both whole genome and the centromeric regions of Nip or MH63 were scanned with the TE ORF reference sequences. Data analysis indicated that the TE ORF contents for the *type1*, *type2*, *typeRT*, and *typePHA* clusters were similar in the Nip or MH63 genomes (Table 2), with similar identity distribution curves and in general quite weak TE hit signals for these four super-families (Figure 3E–L). There were 1479 *Ty1*/*Copia* ORFs in Nip vs. 316 in MH63, and 6277 *Ty3*/*Gypsy* ORFs in MH63 compared with 3328 in the Nip genomes. In the Nip centromeres, a total of 57 *Ty1*/*Copia* and 345 *Ty3*/*Gypsy* ORFs were observed, while 23 *Ty1*/*Copia* and 758 *Ty3*/*Gypsy* ORFs were found in that of MH63 (Table 2), which may indicate the importance of the activities of *Ty1*/*Copia* and *Ty3*/*Gypsy* in the evolutions of the Nip and MH63 genomes, respectively. Indeed, sharp peaks of P-*Copia* and P-*Gypsy* were detected in the Nip and MH63 genomes (Figure 3A,B), whereas significantly lower levels of *Ty3*/*Gypsy* were found in Nip and only some scattered *Ty1*/*Copia* was revealed in MH63 (Figure 3C,D).

**Figure 2.** The phylogenetic analysis of the Nip and MH63 TE super-family clusters and ORF reference sequence identifications for each cluster. (**A**) Phylogenetic trees of *type1* TE ORFs for Nip and MH63, the *ref-N-type1* and *ref-M-type1* sequence on the top branch was selected as the reference sequences for the *N-type1* and *M-type1* clusters, respectively. (**B**–**F**) In the same format as (**A**), for *type2*, *typeRT*, *typePHA, Ty1*/*Copia* and *Ty3*/*Gypsy*, respectively. The TE ORF reference sequences on the top of the branch for each cluster are shown in red.

**Table 2.** The whole genome and centromere scanning with six TE ORF reference sequences in MH63 and Nip.


1

2 **Figure 3.** Scanning of the Nip and MH63 whole genomes with the TE ORF reference sequences. (**A**) Nip genome scanning after normalization with *ref-N-Ty1*/*Copia*. (**B**) MH63 genome scanning after normalization with *ref-M-Ty3*/*Gypsy*. (**C**–**L**) in the same format as (**A**), for *Ty1*/*Copia*, *Ty3*/*Gypsy*, *type1*, *type2*, *typeRT* and *typePHA,* respectively. Individual TE peaks P-*Copia* (*p* value = 0.0001) and P-*Gypsy* (*p* value = 0.008) were observed in the Nip and MH63 genomes separately.

#### *3.5. Stochastic SNP Distribution in TE ORFs, and GPDF Analysis of P-Copia and P-Gypsy*

Across the *Ty1*/*Copia* and *Ty3*/*Gypsy* ORF sequences in both MH63 and Nip, the SNP distribution was observed. A significantly much higher rate of the SNP distributions was found for *Ty3*/*Gypsy* ORFs than that of *Ty1*/*Copia* ORFs in both rice genomes (Figure 4A,B). We further produced identity distribution curves for P-*Copia* and P-*Gypsy*, and they fitted well to the mathematical GPDF model (Figure 4C,D). We reported the R square values in the inset of Figure 4C,D, respectively. The nucleotide substitution ratio (*Ks*) of the two peaks were calculated as 2.58σ [3]. The *Ks* value of P-*Copia* was calculated to be 0.0049, while that of P-*Gypsy* was 0.020. The individual P-*Copia* peak representing the *Ty1*/*Copia* insertion events in the Nip genome were estimated to be 0.19 Mya by GPDF fitting (Figure 4C), and the individual P-*Gypsy* peak representing the *Ty3*/*Gypsy* insertion events in the MH63 genome were calculated to be at around 0.77 Mya (Figure 4D). The *Ty3*/*Gypsy* peaks with identity at *Ks* values of 0.7–0.8 were estimated to be inserted in the MH63 genome at 6.3~9.7 Mya (Figure 4D).

**Figure 4.** The SNP distributions across *Ty1*/*Copia* and *Ty3*/*Gypsy* ORFs, and the GPDF fit of the P-*Copia* and P-*Gypsy* peaks. (**A**) The SNP level distributions of *Ty1*/*Copia* in the MH63 (upper panel) and Nip (lower panel) genomes. (**B**) The SNP level distributions of *Ty3*/*Gypsy* in the MH63 (upper panel) and Nip genomes (lower panel). gag, the GAG protein; pr, the protease; int, the integrase; RT, the reverse transcriptase; RH, RNaseH. (**C**) The GPDF fit of the P-*Copia* peak in the Nip genome with an R square value 0.99, and the age of the peak was calculated at around 0.19 Mya. (**D**) The GPDF fit of the P-*Gypsy* peak in MH63 with an R square value of 0.96, and the age of the peak was 0.77 Mya. The peaks with identities of 0.7–0.8 were estimated as *Ty3*/*Gypsy* insertion events at 9.7~6.3 Mya.

#### *3.6. LTR/TE Analysis in Nip and MH63 Centromeric Regions*

Centromere sequences of the MH63 and Nip genomes were extracted from the whole genome data (Tables S1 and S2) [4], and were scanned with the *Ty1*/*Copia* and *Ty3*/*Gypsy* ORF reference sequences (Figure 5). No significant *Ty1*/*Copia* ORF distribution signals were observed in centromeres of MH63 (Figure 5A) or in that of Nip (Figure 5C). However, strong distribution signals for T*y3*/*Gypsy* ORFs in the centromeric regions of MH63 (Figure 5B) and Nip (Figure 5D), although the 0.77 Mya *Ty3*/*Gypsy* peak was only observed in the former. In centromeres of the Nip genome, this most recent *Ty3*/*Gypsy* peak was not observed, which may indicate that these two types of rice have been diversified for at least 0.77 million years (Figure 5D).

3

**Figure 5.** The centromeres of MH63 and Nip were scanned with *Ty1*/*Copia* and *Ty3*/*Gypsy* ORF reference sequences and normalized with sequence identities. (**A**) MH63 centromere scanning and normalizing with *ref-M-Ty1*/*Copia*. (**B**–**D**) In the same way. Individual peak P-*Gypsy* (*p* value = 0.03) was observed in the MH63 centromeres, and the age of the peaks was 0.77 Mya.

#### **4. Discussion**

3 LTR/TEs have been shown to constitute the major of many monocotyledonous plant genome components, usually, LTR/TEs are randomly inserted across the whole genomes [30]. Earlier inserted TEs may be truncated or fragmented at stochastic sites by various insertional events or sequence mutations that happened later on [31]. Thus, comparisons of the rate in the nucleotide changes between two intact LTR sequences were often used to estimate their insertion time, although it is a challenge to define the exact insertion events for truncated TEs [32]. In our analysis, both rice genomes cumulated plenty of SNPs in the TE ORFs as they were supposed to experience low selection forces during the evolution [33,34]. The huge number of LTR/TE copies, together with their non-conserved nucleotide substitution sites, make them good candidates for evolutionary analysis in several plant systems [2,3,35]. Currently, we have established a matrix-TE approach to comprehensively evaluate the TE insertion events in plant genomes and centromeric regions. This approach overcomes problems related to fragmented TE pieces as well as to stochastic nucleotide substitution sites in the ORFs, as was previously elucidated [3].

Six super-families of LTR/TEs were identified for *Indica* and *Japonica* rice genomes through our super matrix (Figures 1 and 2). The ORFs of intact LTR/TEs were used to generate the identity matrix, then the extracted clusters with high identities were analyzed to figure out the ORF reference sequences by constructing phylogenetic trees (Figure 2). This pipeline has been proven to be an efficient strategy for this type of analysis, because both the intact and truncated TE ORF sequences with diverse SNPs were identified and collected for subsequent evolutionary studies. *Ty1*/*Copia* and *Ty3*/*Gypsy* were the two families that showed significant different distribution patterns between the two rice genomes, with that of the other four types showing similar distribution patterns (Figure 3). Detailed analysis showed that significantly more *Ty3*/*Gypsy* ORFs were inserted in the *Indica* genome, while more *Ty1*/*Copia* ORFs were inserted in the *Japonica* genome (Table 2). The un-balanced *Ty1*/*Copia* and *Ty3*/*Gypsy* insertions increased the diversification potential for the *Oryza* genus [5]. These data strongly indicate that *Ty1*/*Copia* and *Ty3*/*Gypsy* ORF insertions were probably an important driving force for the differentiation of *Indica* and *Japonica* rice [3,36,37].

LTR/TEs are also major components of the centromeres in both the MH63 and Nip genomes. A significant *Ty3*/*Gypsy* peak, calculated to appear at around 0.77 Mya, was observed only in MH63 centromeres, suggesting that *Indica* and *Japonica* genomes must be diverged at or before this time point (Figure 5). Previously, by calculating *Ks* values with single-copy ortholog genes, the mean divergence time of *Indica* and *Japonica* rice was at around 0.55 Mya [11]. Since LTR/TE ORFs are generally not under strict evolutionary selection forces as the ortholog genes are, they may accumulate more nucleotide substitutions or other sequence mutations [27,33,34]. We thus conclude that *Indica* and *Japonica* rice may have diverged between 0.55 to 0.77 Mya. For a more accurate estimation of key time points during genome evolution, both ortholog genes as well as LTR/TE ORFs have to be considered simultaneously in a fully balanced way. A recent publication by Zhang et al. successfully applied the GPDF model to interpret the evolution of autopolyploids in the *Saccharum* species [38]. The optimized matrix-TE method may probably be used in other plant species with large genomes.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/agronomy12071490/s1, Figure S1: The matrix-TE approach pipeline. A rice genome with 12 chromosome sequences was used as the input data and the TE ORFs were used to generate the super matrix. The TE ORF clusters with an identity of over 95% were extracted and annotated. The TE ORF reference sequence was identified by phylogenetic analysis with the whole genomes and the centromeres were scanned and analyzed by the GPDF fit; Table S1: The centromere locations and length statistics for different chromosomes in the MH63 genome; Table S2: The centromere locations and length statistics for different chromosomes in the Nip genome.

**Author Contributions:** Z.W., W.X., Z.H., Y.W., Y.G. and Y.Z. conceived the bioinformatics experiments and carried out the data analysis. Y.Z. and Z.W. conceived the project and wrote the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by grants from the Natural Science Foundation of China (No. 21602162, No. 31690090, No. 31690091) and the National Science and Technology Major Project (No. 2016ZX08005003-001).

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**

