**2. Characterization of Organellar DNA-Derived Sequences in Plants**

The nuclear integrants of organellar DNA were first discovered in a study in which a mitochondrial ATPase subunit gene was found in the nuclear genome, as well as in the mitochondrial genome of *Neurospora crassa* [25]. Since then, organelle-derived sequences were examined in the nuclear genome of a number of animals [26–32] and plants [15–21]. The availability of a large amount of plant organelle and nuclear genome data has made it possible to investigate the prevalence and characteristics of NUMTs and NUPTs in plants. In addition to some plant genomes analyzed previously, we estimated the whole genome landscape of NUMTs and NUPTs in the majority of currently sequenced plant species. A dataset of NUPTs in 199 plant genomes and NUMTs in 91 plant genomes was obtained (Tables S1 and S2). The analysis methods were described in Supplemental File 1. It should be noted that the NUPT and NUMT insertions analyzed here are all contiguous fragments reflected from the BLAST results, and the rearrangements of NUPTs/NUMTs from different regions of chloroplast/mitochondrial genomes are not analyzed.

#### *2.1. Number and Size Distribution*

NUPTs and NUMTs were observed in all examined plant genomes. Their number and size distributions vary markedly among different species. The average size of NUPT ranges from 57 bp (*Chlorella variabilis*) to 3382 bp (*Porphyra umbilicalis*), and the total NUPT length ranges from 1038 bp (*Phaeodactylum tricomutum*) to 9.83 Mb (*Triticum urartu*). The NUPT proportion of these genomes varies from 0.004% to more than 1% in three genomes, including those of *Cucurbita maxima*, *Porphyridium purpureum*, and *Ziziphus jujuba*. The number of NUPTs is few in the majority of algae, and the genomes with NUPT number less than 200 are all from algae. By contrast, NUPTs are abundant in most of the flowering plants, with the highest number found in *Triticum urartu* (Table S1). Similarly, NUMTs are also highly varied among different plant species. The cumulative length within the examined plant genomes varies from 327 bp in *Cyanidioschyzon merolae* to 11.42 Mb in *Capsicum annuum*, and the NUMT proportion accounting for the nuclear genome is from 0.0002% to 2.08% (Table S2). Previous studies showed that the longest NUPT is a 131-kb integrant detected in rice [33,34]; a 620-kb NUMT insertion derived from partially duplicated mitochondrial DNA investigated in *Arabidopsis* is the largest NUMT examined to date [35]. Our analysis detected a 135-kb NUPT in *Gossypium hirsutum*; thus, it is the longest NUPT insertion known so far (Table S1). The NUPT/NUMT fraction usually accounts for less than 0.1%, which is a small fraction of the nuclear genome. However, it should be noted that "old" NUPT and NUMT sequences are usually difficult to detect because of constant mutation and rearrangement during the evolutionary process [29,36]. In addition, NUMTs and NUPTs with high sequence similarity to mitochondrial/chloroplast DNA sequences may be removed as organelle contamination when the nuclear genomes are assembled. Thus, even in "thoroughly sequenced" nuclear genomes, NUMTs/NUPTs may not be completely investigated. Therefore, the content of these organelle-derived sequences is usually underestimated using the standard BLAST method.

Different studies on the correlation between NUPT/NUMT abundance and nuclear genome size show conflicting findings [29,37]. Thus, correlation analysis between genome size and cumulative length or total number of NUPTs and NUMTs in more than 200 plant species was conducted to investigate whether nuclear genome size affect NUPT/NUMT content. A positive correlation existed between nuclear genome size and cumulative lengths of NUPTs/NUMTs, as well as the total number of NUPTs/NUMTs (Figure 1). Previous searches detected no such correlations, probably because of the smaller number of plant nuclear genomes analyzed. No correlations were detected between NUPT/NUMT content and chloroplast/mitochondrial genome size (Tables S1 and S2).

**Figure 1.** Correlation analysis between nuclear genome size and nuclear integrants of plastid DNA (NUPT)/nuclear integrants of mitochondrial DNA (NUMT) content or number in plants: (**A**) genome size versus cumulative length of NUPTs; (**B**) genome size versus total number of NUPTs; (**C**) genome size versus cumulative length of NUMTs; (**D**) genome size versus total number of NUMTs. This dot-plot was generated based on the data presented in Tables S1 and S2. The red dots indicate the results of algal plant analysis, whereas the black dots represent those of analysis involving other plants, most of which are flowering plants.

#### *2.2. Organization and Distribution Patterns*

NUPTs and NUMTs are frequently organized as clusters [21,38,39]. For example, we observed that approximately 45% of the 3155 NUPT insertions were organized in clusters in the genome of *Asparagus o*ffi*cinalis* [21]. In the model plant species *Arabidopsis* and rice, NUPTs and NUMTs are frequently nonrandomly arranged as loose clusters or tight clusters based on the physically linked degrees [39]. NUPTs/NUMTs are organized into three major patterns in plants: (a) continuous fragments of nuclear DNA collinear with mitochondrial or chloroplast DNA, (b) rearranged NUPTs/NUMTs originating from different regions of one organelle genome with non-uniform orientation, and (c) mosaics containing both NUPTs and NUMTs [40]. The presence of mosaic clusters containing NUPTs and NUMTs indicates that DNA fragments from different organelles might have concatemerized before insertion, or these nuclear regions are hotspots for integration [39,41]. Other organization patterns such as NUPTs/NUMTs with tandem duplications originating from one organelle fragment with the same orientation were occasionally observed [21]. These various organization patterns of NUPTs/NUMTs existing in plant

genomes suggested that the origins and the evolutionary paths of the integrated regions may be different, and NUPTs/NUMTs are involved in shaping the plant genome via complicated mechanisms.

NUPTs/NUMTs are usually distributed unevenly in the analyzed plant genomes [20,21,39]. NUPTs and NUMTs are more preferred to distribute in centromeric and pericentromeric regions [36,39], which have few genes and a high level of heterochromatin content [42–44]. Such regions may offer a stable genomic environment for the maintenance of the alien organelle-originated DNA [16,36]. The integrations in these regions should be less harmful than those in other chromosomal regions [36]. For example, in rice and *A. o*ffi*cinalis*, large NUPTs are predominately distributed in the pericentromeric regions of the chromosomes [21,36]. In some species, such as *Arabidopsis* and sorghum, a considerable fraction of NUPTs and NUMTs is co-localized with transposable elements (TEs) [16]. These findings imply that recombination based on repetitive sequence can lead to the rearrangement of chromosome structure and contribute to the various organization patterns of organelle-derived sequences.

The chromatin state seems to be an essential factor that affects the successful insertion of organellar DNA into nucleus. The pre-insertion status of *Oryza sativa* subsp. *indica*-specific NUPTs suggests that the newly transferred organellar sequences are predominantly inserted into open chromatin. This phenomenon has also been observed in humans [45]. However, current existing NUPTs/NUMTs are often detected in heterochromatin regions. Such paradox can be explained by two reasons. One is that the accessibility of chromatin can be modified by external environment, such as stress [46], and/or by genetic crash, such as hybridization [47]. Alternatively, many new insertions in the open chromatin may not be retained because of selective pressure; for that, the insertion into exons of genes can damage gene function. Indeed, most NUPTs/NUMTs are located in introns or untranslated regions [23]. By contrast, the heterochromatin regions are more facilitated for the maintenance of organelle-derived sequences.
