**De Novo Assembly Discovered Novel Structures in Genome of Plastids and Revealed Divergent Inverted Repeats in** *Mammillaria* **(Cactaceae, Caryophyllales)**

**Sofía Solórzano 1,\*, Delil A. Chincoya 1, Alejandro Sanchez-Flores 2,\*, Karel Estrada 2, Clara E. Díaz-Velásquez 3, Antonio González-Rodríguez 4, Felipe Vaca-Paniagua 3,5, Patricia Dávila <sup>6</sup> and Salvador Arias <sup>7</sup>**


Received: 21 August 2019; Accepted: 22 September 2019; Published: 1 October 2019

**Abstract:** The complete sequence of chloroplast genome (cpDNA) has been documented for single large columnar species of Cactaceae, lacking inverted repeats (IRs). We sequenced cpDNA for seven species of the short-globose cacti of *Mammillaria* and de novo assembly revealed three novel structures in land plants. These structures have a large single copy (LSC) that is 2.5 to 10 times larger than the small single copy (SSC), and two IRs that contain strong differences in length and gene composition. Structure 1 is distinguished by short IRs of <1 kb composed by *rpl23-trnI-CAU-ycf2*; with a total length of 110,189 bp and 113 genes. In structure 2, each IR is approximately 7.2 kb and is composed of 11 genes and one Intergenic Spacer-(*psbK-trnQ*)-*trnQ-UUG-rps16-trnK-UUU-matK-trnK-UUU-psbA-trnH-GU G-rpl2-rpl23-trnI-CAU-ycf2*; with a total size of 116,175 bp and 120 genes. Structure 3 has divergent IRs of approximately 14.1 kb, where IRA is composed of 20 genes: *psbA-trnH-GUG-rpl23-trnI-CAUycf2-ndhB-rps7-rps12-trnV-GAC-rrn16-ycf68-trnI-GAU-trnA-AGC-rrn23 -rrn4.5-rrn5-trnR-ACG-trnN-GUU-ndhF-rpl32*; and IRB is identical to the IRA, but lacks *rpl23*. This structure has 131 genes and, by pseudogenization, it is shown to have the shortest cpDNA, of just 107,343 bp. Our findings show that *Mammillaria* bears an unusual structural diversity of cpDNA, which supports the elucidation of the evolutionary processes involved in cacti lineages.

**Keywords:** divergent inverted repeats; short-globose cacti; novel gene rearrangements; pseudogenization

#### **1. Introduction**

A new era in the study of evolutionary processes of chloroplasts and their genomes has arisen with the advent of massive sequencing [1]. Huge advances have been documented since 1883, when Schimper postulated an endosymbiotic cyanobacterial origin of these organelles [2]. More recently, many studies have focused on determining the cyanobacterial origin of the DNA molecule contained in chloroplasts [1,3,4]. Using comparative genomics, DNA sequences of complete genomes of contemporary cyanobacteria, algae, and plants have been analyzed, leading to the discovery that the chloroplast genome encompasses structural changes with significant evolutionary information. Thus, in comparison to cyanobacteria and algae, a significant reduction in the total length and in the number of genes has been documented in land plants [5]. However, many genes lacking in the chloroplast genome have migrated to nuclear or mitochondrial genomes [6], which indicates a complex functional relationship among the three genomes contained in plants. In addition, in plants, the chloroplast genome has a hybrid transcriptional process, which denotes the evolutionary transition from a prokaryotic form to a eukaryotic form. Accordingly, most encoding regions are regulated in operons which are transcribed into polycistronic units, as occurs in contemporary cyanobacteria [7]. Additionally, typical eukaryotic transcriptional regulation was documented in nearly 60 promoters for encoding regions and their transfer RNAs [8].

Comparisons within land plants have concluded that the differences in the total length of the complete chloroplast genome (cpDNA) are caused by lengthening or shortening of genes and not by a significant gain/loss of them [5,9]. Flowering land plants tend to have a total of 120 genes; of these, nearly 80 are encoding genes, 30 are tRNAs, and four are rRNAs [9]. In angiosperms, these genes are not randomly distributed along the entire molecule of DNA of the chloroplasts, but instead determine a recognizable structure in the cpDNA. Moreover, in most angiosperms, this cpDNA is sectioned into four regions, which are distinguished by their length and trend in gene composition. The largest single copy (LSC) contains most of the encoding genes directly related to photosystems I and II, ATP synthases, proteins of cytochrome b/f complex, DNA depending on RNA polymerases, and proteins which tend to have a single encoding gene: ribulose bisphosphate carboxylase large chain, maturase K, envelope membrane protein, acetyl coenzyme carboxylase, transcriptional initiation factor, as well as most of proteins of small and large subunits of the ribosome. The small single copy (SSC) often contains dehydrogenase subunits and open reading frames. This copy typically shows the highest mutation rates. The SSC is often flanked by two inverted repeats (IRA and IRB), which vary in gene composition and length [1]. In these plants, the IRs typically contains four ribosomal RNA subunits (*4.5S*, *5S*, *16S*, and *23S*) and five transfer RNA subunits (*trnA-UGC, trnI-GAU, trnN-GUU, trnR-ACG*, and *trnV-GAC*). In addition, the IRs exhibit lower mutation rates than the SSC [10]. Currently, nearly 500 cpDNA have been sequenced for the land plant group, showing that IRs are the main source of structural variation by relative expansion, contraction, and gene rearrangement. However, between IRA and IRB within the same genome, there are no differences in gene arrangement and composition, and in only a few cases do they have low divergence in the DNA sequence [9,10].

In angiosperms, although IRs are commonly present, they are absent in some taxa. Around 95% of legume species of the subfamily Papilionoideae (order Fabales) lack IRs, which has been interpreted as a novel evolutionary change that appeared in a common ancestor and, eventually, was inherited by its descendants, whereas other legume species of this order have IRs [11]. Recently, the lack of IRs was documented in two species of large columnar cacti of Cactoideae (Cactaceae, Caryophyllales) of the tribe Echinocereeae. The loss of IRs in the saguaro (*Carnegiea gigantea*) was interpreted as a novel structural change [12]. In addition, we have verified that the cpDNA of *Pachycereus schottii*, which has recently been directly submitted to the GenBank database, also lacks IRs (uploaded with its synonym *Lophocereus schottii*, NCBI, NC\_041727.1). In contrast, in all other species currently sequenced in Caryophyllales have been shown to have IRs [13]. At the infrageneric level, contrasting results have been documented and it is not a rule that all members of a certain genus show identical cpDNA structure. For example, in 13 species of *Camellia* (Theaceae, Ericales), identical structure of cpDNA and

low divergence of DNA sequences were documented [14]. A similar result was obtained for seven species of *Silene* (Caryophyllaceae, Caryophyllales), with identical cpDNA structure and only a small gain/loss of genes among them being documented [15,16]. In contrast, unusual results have been obtained for 17 species of *Erodium* (Geraniaceae, Geraniales), which showed deep and strong structural changes, such as expansion and contraction of IRs or even the absence of IRs, and substantial gene rearrangements in the LSC [17]. Thus, data based on characterizations of the structures of complete chloroplast genomes are necessary, as they might reveal novel unexpected results that may help to clarify evolutionary processes in plants.

In this study, we focused on cacti species of the short-globose genus *Mammillaria* (Cactoideae, tribe Cacteae). *Mammillaria* is relevant, in terms of biodiversity, due to its high species richness (163–232) [18] in the Cactaceae. A total of 192 species and subspecies of *Mammillaria* are listed in the Red List of Threatened Species of International Union for Conservation of Nature [19]. For the species of this genus, non-fully resolved phylogenies were obtained from DNA sequences of the r*pl16* intron and psbA-trnH intergenic spacer regions of the chloroplast [20]. The increment of plastid molecular markers (*rpl16, trnK,* and *rpoC1* introns, and *trnK-psbA, rpl20-rps12, trnL-trnF,* and *trnT-trnL* intergenic spacers) did not resolve the relationships among species of *Mammillaria*, nor of species from closer genera (i.e., *Coryphantha, Escobaria, Neolloydia, Ortegocactus,* and *Pelecyphora*). These unresolved evolutionary relationships have been attributed to the recent origin of Cactaceae (e.g., [21,22]), estimated at 35 million years ago [23]. Currently, morphological characteristics have been used to postulate the taxonomic limits among species of *Mammillaria* and of those in close cacti genera [18]. However, these characters are ambiguous and often do not accomplish a robust taxonomically resolved separation [20,21].

In this study, we de novo assembled the complete chloroplast genome of seven species in this genus, in order to utilize these genomes as reference for *Mammillaria*. A second objective was to identify putative structural characteristics of the cpDNA of *Mammillaria*, by comparing with the complete chloroplast genomes, which have been documented for other cacti species and other Caryophyllales. In addition, we discuss whether the structural differences of the cpDNA discovered in *Mammillaria* may serve to resolve the evolutionary and taxonomic pendants of this genus. As structural differences have been documented at the subfamily level, we expect the cpDNA of *Mammillaria* (Cacteae) to differ from those of the large columnar cacti (Echinocereeae); however, among species of *Mammillaria*, structural differences in cpDNA are not expected by its recent divergence.

#### **2. Results**

#### *2.1. Gene Composition and Length Variation in Three Novel cpDNA Structures Identified in Mammillaria*

De novo assembly revealed three structures of cpDNA in *Mammillaria* (Figure 1). Structure 1 was present in *M. albiflora* and *M. pectinifera* (Figure 1a), structure 2 in *M. crucigera, M. huitzilopochtli, M. solisioides,* and *M. supertexta* (Figure 1b), and structure 3 in *M. zephyranthoides*(Figure 1c). These structures had a quadripartite partition, into LSC, SSC, and two IRs (Figure 1). We identified unexpected and strong structural differences in gene composition and length among the three structures (Figures 2 and 3a). In addition, structure 3 (*M. zephyranthoides*) had divergent IRs; meanwhile, the rest of the species had identical gene composition in their IRs (Figure 1).

Variation between species in the total length and number of genes of cpDNA were documented (Tables 1 and 2). The cpDNA of *Mammillaria* ranged from 107,343 bp (*M. zephyranthoides*) to 116,175 bp (*M. supertexta*) (Table 1). The relative length of LSC represented approximately 62% of the genome in the six species (*M. albiflora, M. crucigera, M. huitzilopochtli, M. pectinifera, M. solisioides,* and *M. supertexta*) (Table 1). Moreover, the LSC was nearly 2.5 times larger than SSC, which represented 25.35–28% of the total genome length (Table 1). In contrast, in *M. zephyranthoides*, the LSC was longer (68% of the cpDNA length), being 10 times larger than its respective SSC, whose length only reached 7%. The shortening of SSC in *M. zephyranthoides* was due to the lengthening of the IRs, which had nearly 14 kb, and to the strong reduction of the genes *ycf1* and *ycf2* to <1 kb; meanwhile, in the other species of *Mammillaria*, these genes were >6 kb (Figure 1, Table 1).

We identified additional different types of gene rearrangements at the LSC (Figure 3b). The first was a type of rearrangement involving blocks of genes, which were inverted but maintained identical order (Figure 3b); there was also a second type involving a single gene with two variants: (a) the single gene did not change its relative location, but its orientation was inverted (*trnF-GAA*); (b) the single gene changed in location, but it maintained its orientation (*rpl2*). The single gene *trnF-GAA* had identical orientation in the species of structure 1 (*M. albiflora* and *M. pectinifera*), structure 3 (*M. zephyranthoides*), and *M. solisioides* of structure 2, but was inverted in the other three species of structure 2 (*M. crucigera, M. huitzilopochtli,* and *M. supertexta*). In the structure 1 species, *rpl2* flanked the IRA in *M. pectinifera*, whereas, in *M. albiflora*, it flanked the IRB. In addition, in *M. supertexta* (structure 2) and *M. zephyranthoides* (structure 3), *rpl33* was lost, but it was present in the other five species of *Mammillaria* as pseudogene except in *M. albiflora* (Figure 1, Table 2).

**Figure 1.** *Cont.*

**Figure 1.** Three different structures found in the complete chloroplast genome of *Mammillaria*: (**a**) structure 1, (**b**) structure 2, and (**c**) structure 3. In structure 1, the *rpl2* gene is flanking IRB in *M. albiflora* and IRA in *M. pectinifera*. Gene *rpl33* was lost in *M. supertexta* of structure 2 and in *M. zephyranthoides* of structure 3. The genomes are displayed circularly, and IRA and IRB correspond to duplicated blocks of regions; starting from the top of the circle, the IRA is the one that appears first in clockwise.

On the other hand, comparison of the complete genomes of the seven species showed similar percentages of types of genes. In the three structures, the highest percentage of genes corresponded to tRNAs (26%), where each of the sets of *rps* and *psb* represented 13% of the genes. Another similarity found among species was that, in the LSC, large blocks of concatenated genes maintained identical gene compositions and arrangements (Figures 1 and 2). Most of the concatenated genes correspond to the encoding genes of photosystems I and II. In addition, the *rbcL* gene, units of the cytochrome b/f complex, and genes of the DNA-dependent RNA polymerase (Table 2) were identical in number, location, and arrangement (Figure 1). Comparisons of LSC and SSC showed that structures 1 and 2 had more mutual similarities than either had with structure 3 (Table 2). In the SSC, the seven species maintained identical order and orientation in *ycf2-trnL-CAA-ycf1*; although, in structure 3, they were shorter by pseudogenization (Figure 1).

In addition, in the seven species pseudogenization was identified in the NADH dehydrogenase-like (NDH) complex of plastid genes (*ndh* genes, hereafter). Of this family of genes, only four subunits of the suite of dehydrogenase genes (B, D, F, and G) were recorded in the seven species. The *ndh* subunits B, D, and F were pseudogenes in the seven species, and subunit G was pseudogene only in *M. pectinifera, M. solisioides* and *M. zephyranthoides*. In addition, *ycf68* was a pseudogene in all seven species, and *ycf4* was pseudogene in three species except in *M. albiflora*, *M. huitzilopochtli*, *M. supertexta* and *M. zephyranthoides* (Table 2).

**Figure 2.** MAUVE graphic of five structural alignments of complete chloroplast genomes. The upper graph corresponds to caryophyllid *P. oleracea* (Portulacaceae); below that, the large giant columnar cactus, *C. gigantea*; and the last three graphs are the three structures documented in *Mammillaria*. Relative inverted DNA sequences are drawn above/below of the horizontal line; identical genes are in the same color. *P. oleracea* has a larger genome than any species of Cactaceae. Discarding the IRs that are recorded in *Mammillaria* and *P. oleracea,* but not in *C. gigantea*, the cpDNA structure of *P. oleracea* is more similar in structure to *C. gigantea* than to *Mammillaria*. Between *C. gigantea* and *P. oleracea*, a single large block of inverted genes (encircled) corresponding to *atpB* and *atpE* is shown. This block of genes in *Mammillaria* has identical orientation to *P. oleracea*. In *Mammillaria*, many other novel gene rearrangements, which are absent in the other two-caryophyllid taxa, were documented. Additionally, structure 3 has two blocks of inverted genes (described in detail in Figure 3b), with respect to structures 1 and 2. These two blocks of genes are indicated with arrows and have identical orientation in *C. gigantea, P. oleracea*, and structures 1 and 2 of *Mammillaria*.

**Figure 3.** (**a**) Comparison of length and gene composition of IRs in the three structures documented for the complete chloroplast genomes of *Mammillaria*. The two IRs of structure 3 diverge in *rpl23*; its location in IRA is denoted with an asterisk. (**b**) Blocks of genes rearranged at the LSC. These genes are inverted and reoriented in structures 1 and 2, with respect to structure 3. The direction of the row indicates the orientation of transcription, to the left in sense of clockwise and to the right, counter-clockwise. The large squares indicate the genes of LSC that flank these two rearrangements. The asterisk in *rpl33* (bottom figure) indicates that, in *M. supertexta* of structure 2 and in species of structure 3, this gene was lost.


**Table 1.** Species of *Mammillaria* grouped by the type of the structure identified in the complete chloroplast genome (cpDNA). Within and among structure variation in total length size, the two inverted repeats (IRs), large single copy (LSC), and small single copy (SSC) were detected.

<sup>1</sup> GeneBank access number of the DNA sequences deposited.

**Table 2.** Variation in structural and functional gene composition in the three structures of cpDNA found in *Mammillaria*. A total of 18 different types of genes were documented, and these are organized alphabetically according to their location in IRs, LSC, and SSC. All the genes located at IRs are duplicated (2X), except the *rpl23*Ψ in structure 3 that lacks in IRB.



**Table 2.** *Cont.*

Ψ indicates a pseudogene. The note "-p" indicates that a partial DNA sequence of a gene is inserted in the two IRs. \* indicates that *rpl33* lacks in *M. supertexta* but it is present in the other three species of this structure. In addition, this gen is pseudogene in all species except in *M. albiflora*. \*\* indicates that is a pseudogene in *M. crucigera* and *M. solisioides* of structure 2. \*\*\* indicates that it is pseudogene only in *M. solisioides* of structure 2 and *M. pectinifera* of structure 1. 89

#### *2.2. Structure 1: Shortest IRs, Composed of Three Genes, rpl23-trnI-CAU-ycf2*

This structure was distinguished by two short IRs (of <1 kb) composed by *rpl23-trnI-CAU-ycf2.* Of this *ycf2*, a total of 265 bases of its 5' extreme are inserted in the IRs, which were identical in DNA sequence and gene composition. This structure 1 was found only in two species of *Krainzia* subgenus, *M. albiflora*, and *M. pectinifera* (Figure 1, Table 1). The genome of *M. albiflora* was larger (110,789 bp, Table 1) than the one of *M. pectinifera*. In this structure, both IRs had length 1348 bp (1.22% of the total genome). The three genes represented 5.3% of the total genes (113). These two species had an identical number of total genes; most of which (26.6%) were represented by tRNAs (Figure 1a, Table 1). The LSC covered the largest proportion of the DNA sequence (72.6%, Table 1) and the largest number of genes (82) (Table 2). However, *rpl2* gene is flaking IRA in *M. pectinifera* but in *M. albiflora* is flanking IRB that is the identical location of the species of structures 2 and 3.

#### *2.3. Structure 2: IRs Composed by an Unusual Complete Battery of 11 Concatenated Genes and One Identical Intergenic Spacer.*

Structure 2 was found in three species of the subgenus *Mammillaria* (*M. crucigera*, *M. Huitzilopochtli*, and *M. supertexta*) and one of the subgenus *Krainzia* (*M. solisioides*). In all of these, the two IRs were formed by genes usually located at the LSC, and were flanked by the DNA sequence of an identical intergenic spacer sequence (IGS) that is from psbk to trnQ; however, IRB was flanked by *rps19*, whereas IRA was flanked by *psbK* (Figure 1). The complete composition of this structure consisted of this IGS and 11 genes: IGS (*psbK-trnQ*)*-trnQ-UUG-rps16-trnK-UUU-matK-trnK -UUU-psbA-trnH-GUG-rpl2-rpl23-trnI-CAU-ycf2*. The four species with structure 2 had identical IRs, with respect to gene composition and DNA sequence. However, there were differences between species, in terms of the total genome length, number of genes, and length of each of the four quadripartite regions (Table 1). The *rpl33* gene was found in three species, but was absent in *M. supertexta*. Consequently, the total number of genes differed among the structure 2 species (Tables 1 and 2). Although they showed differences, the four species had similar percentages in the relative proportions of genes represented in IRs, LSC, and SSC, as well as in the percentages of gene types in the overall genome (Tables 1 and 2). Using *M. supertexta*, as a reference is important, as it had the largest genome in structure 2, showing 63% (75) of genes located at the LSC and 20.2% (24) at the SSC. In addition, each IR comprised 6.24% of the genes (11 genes and one IGS). In structure 2, the complete gene of maturase K gen (*matK*) is nested with *trnK-UUU* introns and inserted into the IRs. In this structure, 28.20% were tRNAs, followed by 13.67% in each of the suites of *rps* and *psb* subunits, *rpl* subunits represent 8.40%. Finally, the DNA regions that were represented by only one gene (0.85%) were *accD*, *ccsA*, *cemA*, *infA*, and *rbcL* (Table 2).

#### *2.4. Structure 3: Largest and Divergent IRs in Which Four Ribosomal Units are Included.*

Structure 3 was only recorded in *M. zephyranthoides* (Figure 1c). This genome had a length of 107,343 bp and 131 genes (Table 1). The LSC covered 62% of the total number of genes (81), the SSC had only 7 genes (5.7%), and both IRs had 43 genes (32.8%). In addition, each IR was approximately 14.1 kb in length and was comprised of genes typically located at the SSC. The IRA was composed of *psbA-trnH-GUG-rpl23\*-trnI-CAU-ycf2partial-ndhB-rps7-rps12-trnV-GAC-rrn16-ycf68-trnI-GAU-trnA-AGC -rrn23-rrn4.5-rrn5-trnR-ACG-trnN-GUU-ndhF-rpl32*. The *rpl23* is marked with an asterisk because it is absent in the IRB, thus the IRB was identical in gene composition and arrangements but was divergent. Structure 3 is distinguished by the ancestral presence of four rRNA subunits (4.5, 5, 16, and 23) in the IRs.

Pseudogenization in structure 3 was clearly identified, in that *ycf2* and *ycf1* showed incomplete DNA sequences. The former was truncated into three segments. Two of these segments were inserted into the IRS and the third one was at the SSC. These three segments added only to a total of 960 bp. The gene *ycf1* had <1000 bp. Consequently, the shortening of *ycf1* and *ycf2* caused a diminished cpDNA total length (Table 1, Figure 1c). Additionally, this structure added, as pseudogenes, *accd, rps18, rpl23,* and one copy of *rps12*.

#### **3. Discussion**

The three structures of cpDNA discovered in *Mammillaria* are novel and these have not been recorded in other eukaryote organisms. In addition, the divergent IRs (identified in structure 3) are a novel result for land plants (Embryophyta), only having otherwise been recently discovered in green algae of the order Ignatiales (*Pseudoneochloris*, and *Chamaetrichon*) [24]. These strong arrangements in the cpDNA of *Mammillaria* are notable with respect to the rest of caryophyllids as *P. oleracea* (Portulacaceae) and *C. gigantea* (Cactaceae), and even within *Mammillaria* (Figures 2 and 3). Based on the DNA sequences of chloroplasts and current biogeographic distribution, it was estimated that the suborder Portulacinae, which includes the families Cactaceae and Portulacaceae, diverged in the early Miocene (18.8–33.7 Mya). In particular, for Cactaceae, an origin of 10–19 Mya has been estimated [25]. This last estimation differs from the age of 35 Mya estimated for Cactaceae based on molecular phylogeny [23]. Accordingly, the members of Cactaceae are relatively young in the evolutionary history of Caryophyllales and, thus, the structures of cpDNA found in *Mammillaria* have evolved in a recent diverging process, as none of the three structures have been reported for other, older members of Caryophyllales.

Our results suggest that the structural reconfigurations of cpDNA within the family Cactaceae have occurred frequently. Particularly, such structural changes have mainly involved genes located at the flanking extremes of the LSC and, secondly, genes located at the SSC. Although deep and strong changes occurred in reconfigurations of the IRs, only few gene rearrangements and loss of genes occurred in the LSC, protecting the large blocks of genes involved in photosynthesis (Figure 2). We conclude that no single type of cpDNA structure characterizes all members of *Mammillaria*; however, the presence of IRs is a marked difference, with respect to the large columnar cacti species that have lost them.

In addition, our results showed that, among the tribes of the subfamily Cactoideae, there are notable structural differences in the cpDNA. The most evident is that *Mammillaria* (Cacteae) has IRs, as occurs in most of Caryophyllales, which were lost in Echinocereeae. We propose that the presence of IRs is the basal state in the Cactaceae family, as the presence of IRs is common in all members of Caryophyllales currently sequenced. However, we cannot ignore the possibility that IRs may be have been lost and recovered (with a new configuration) in multiple evolutionary events occurring during Cactaceae radiation. The phylogenetic relationships of the seven species based on DNA sequences (Figure 4) showed that *M. albiflora* (structure 1) is the closest species to columnar saguaro, however, *M. pectinifera* (structure 1) is the closest one to *M. solisioides* (structure 2). Unexpectedly, *M. zephyranthoides* (structure 3) occupied a branch that is closer to the three species of series *Supertextae*. These phylogenetic results should be taken with caution, since the number of species is too poor; however, it seems that the evolutionary underlying processes that have operated on the structural changes differ of those operating at the level of DNA sequences.

**Figure 4.** Phylogenetic ML tree obtained for the seven species of *Mammillaria*. The analysis is based on 42 coding regions shared to the two species used as outgroups (*C. gigantea* and *P. oleraceae*).

Our findings showed that, in *Mammillaria*, the concatenated battery *rpl23-trnI-CAU-ycf2* might have a relevant role in the reconfiguration of IRs. Particularly in these genes, it is worthy to highlight the gene *trnI-CAU*, which (along *with trnI-GAU, trnfM-CAU*, and *trnM-CAU*) has been identified, in experimental essays, as one of four essential tRNAs in plastids [26]. In addition, the IRs composed of a single gene have been shown to correspond to *trnI-CAU* (e.g. *Pinus massoniana*) [27]. In this context, we propose that *trnI-CAU* in *Mammillaria* plays a key role in the reconfiguration of IRs, but future studies are needed to verify this.

We found that the IGS *psbK-trnQ* was inserted in both IRs of structure 2, although this IGS at IRB was flanked by *rps19* at the LSC; meanwhile, in the rest of the species, this pair of genes (*psbk* and *trnQ*) was located at the LSC. In addition, the DNA sequence of this IGS was highly conserved, showing 91–97% of identity to the IGS of other Caryophyllales species (e.g., *C. gigantea, Cistanthe longiscaspa, Tallinum paniculatum,* and *P. oleracea*). In addition, we found other highly conserved IGS, which showed 90–98% of identity to the IGS of *trnG-UCC—trnS-GGU* in other members of Caryophyllales. However, in *Mammillaria*, this pair of transfer genes was at the contrary extreme of the LSC, although the DNA sequence of this IGS was found between *rpl20-rps12*. Thus, we support the idea that IGS may have played an important role in the functional processes, in agreement with former studies [28]. Pseudogenization was identified in the seven species of *Mammillaria*, but in structure 3 it was more evident (Figures 1 and 2); particularly by the strong shortening of the genes of the two open reading frames, *ycf2* and *ycf1*, as they had a length of <1 kb whilst, in the other six species of *Mammillaria*, each of these genes had a length of nearly 6–7 kb. The pseudogenization of *ycf1* and *ycf2* indicates a loss of functional activity, which disagrees with the conclusion that these genes are essential for plant survival [29]. In addition, pseudogenization was also identified for all species of *Mammillaria*, with incomplete copies of subunits of *rps* and *rpl* suites, *accD*, as well as, *rps12* and *clpP* duplicated and, evidently, in three dehydrogenase subunits (B, D, and F). These subunits translated to an interrupted sequence of amino acids, which indicates that the functionalities of these genes may have been lost. In addition, seven other subunits of *ndh* genes were completely lost (A, C, E, H, I, J, and K). These subunits in *C. gigantea*, both pseudogenization and the complete loss of *ndh* subunits in cpDNA have also been documented [12]. However, we could not show, for *Mammillaria*, that all of those genes documented as pseudogenes, or entirely lacking in the cpDNA, were not present in nuclear or mitochondrial genomes; in other plants, many genes of the chloroplast have been found in nuclear or mitochondrial genomes [6].

An interesting result, obtained in the SSC of the seven species of *Mammillaria*, *C. gigantea*, and *P. schottii*, is that they all have identical order and orientation of *ycf2-trnL-CAA-ycf1*. This result suggests that the arrangement of these three genes may be a synapomorphy in the subfamily Cactoideae, as it is not present in *Opuntia microdasys*, subfamily Opuntioideae (sequence consulted GenBank: HQ664651.1), nor in the rest of the species of Caryophyllales [13].

Structure 2 was distinguished by the insertion of the *matK* gene into the IRs, which was nested inside of two complete *trnK* introns. The gene *matK* was also documented in the IRs of some species of *Erodium* (Geraniaceae) but was truncated and separated to the two *trnK* introns [17]. In addition, in IRs of *Lamprocapnos spectabilis* (Papaveraceae), complete *matK* nested between the two introns has been documented [30]. It is relevant to point out that *matK* is duplicated in these genomes due to its insertion in IRs. This encoding gene plays a fundamental role in photosynthesis by editing the RNAs of nearly 15 proteins, even though the *trnK* introns were lost [31]. Thus, in species having structure 2, by the formation of IRs, there are two copies of *matK* in the haploid genome of the chloroplast; however, we do not possess sufficient information to discuss the functional consequences of this.

Finally, it is important to point out the following: 1) the three structures of cpDNA of *Mammillaria* are not concordant to the taxonomic subgenera and series levels. Consequently, our results do not support any of the infrageneric classification currently proposed. 2) *Sensu* [18], *M. solisioides* is considered to be a subspecies of *M. pectinifera* (subgenus *Krainzia*, series *Herrerae*+*Pectiniferae*); however, *M. solisioides* exhibits the cpDNA structure found in the *Supertextae*, subgenus *Mammillaria*. 3) Accordingly, we consider that *M. solisioides*is an independent species, in agreement with Arias et al., [32]. We conclude that the structural analysis of cpDNA can also contribute towards clarifying the taxonomic relationships of *Mammillaria* with other plant species.

#### **4. Materials and Methods**

#### *4.1. Plant Sampling and DNA Extraction*

We included seven species of *Mammillaria*, which represent three of the eight subgenera proposed by Hunt [18]. These seven species are listed in the IUCN red list [19]. The three species sampled (*M. crucigera*, *M. huitzilopochtli*, and *M. supertexta*) are classified in the subgenus *Mammillaria*, series *Supertextae.* According to Crozier [21], only the species included in this subgenus represent a natural and monophyletic clade in phylogenies obtained with chloroplast DNA sequences. The other three taxa sampled (*M. albiflora*, *M. pectinifera*, and *M. solisioides)* are of the subgenus *Krainzia*, series *Herrerae*+*Pectiniferae*. It is important to mention that *M. solisioides* was considered [18] to be a subspecies of *M. pectinifera*. The last sampled species was *M. zephyranthoides*, classified in the subgenus *Dolichothele*. This last species has faced a complex and controversial taxonomic identification, as it has been included in both *Dolichothele* and *Mammillaria*.

Living complete plants for tissue samples were obtained for *M. albiflora*, *M. crucigera*, *M. huitzilopochtli*, *M. pectinifera*, *M. solisioides*, and *M. supertexta*. Small pieces of the surface of green stems were cleaned of spines and areoles to obtain 300 mg samples of tissue, which were treated using a MinuteTM Chloroplast Isolation Kit (Invent Technologies Inc., Eden Prairie, MN, USA), according to the manufacturer's instructions. The chloroplast extracts were processed with a DNeasy plant minikit (Qiagen, Valencia, CA, USA), in order to obtain enriched chloroplast genomes.

#### *4.2. High-Throughput Sequencing and Sanger Verification*

Massive sequencing was done with the Illumina platform (Ilumina, San Diego, CA, USA). For each species, genomic libraries were prepared with the Nextera XT kit, according to the manufacturer's instructions, and sequenced in a MiSeq 2 × 300 cycles. The flanks of IRs and gene rearrangements were PCR verified (Table S1), using the recently assembled cpDNA in our study for the design of specific primers with Primer3 v.4 [33]. These PCR products were sequenced in a 3730×l capillary sequencer (Applied Biosystems, Pleasanton, CA, USA).

#### *4.3. Genome Assembly, Annotation, and Structural Alignment*

The available assembly genome for the giant columnar cactus species of *Carnegiea gigantea* [12] lacks IRs and, thus, de novo assembly was carried out with NovoPlasty v.2.6.5. [34] and DISCOVAR de novo v.52488 [35]. The scaffolding was carried out with Ragout v.2.0 [36]. The gaps were filled with GARM v.0.7.5 [37] and the circularization of each of the assembly genomes was obtained with Circlator v.1.5.5 [38]. The assemblage of each genome was tested with REAPR v.1.0.18 [39]. The annotation for the seven species was done with GeSeq [40] and the genomes were drawn with OGD [41]. For gene annotation, we used the cpDNA of *C. gigantea* [12] and *Portulaca oleracea* [42]. Annotation of the largest genome found for each of the three different structures (Table 1) was manually curated. The structural alignment of the complete assembly genomes was performed with MAUVE [43]. This program was also used to compare these structures to other species of Caryophyllales (*P. oleracea*, NC\_036236.1) and Cactaceae (*C. gigantea*, GCA\_002740515.1), in order to emphasize the relevance of our structural findings in the whole chloroplast genome of *Mammillaria*. In order to reconstruct the phylogenetic relationships of the seven species of *Mammillaria*, two species (*C. gigantea* and *P. oleracea*) were used as outgroups. A total of 42 orthologous protein-coding genes (Table S2) shared in these nine species were identified with Prottest [44]. The DNA sequences of these 42 loci were aligned using MAFFT v7.310 [45]. The Akaike Information Criterion (AIC) in JMODELTEST v2.1.10 was used to determine

the best-fitting model of nucleotide substitutions [46]. The GTR + G model was used to obtain the phylogenetic tree based on ML in RAXML-HPC v8.2.10 [47] with 1000 replicates.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2223-7747/8/10/392/s1, Table S1: Primer sequences designed for PCR verifications for flanking adjacent sequences of IRs; and gene rearrangements located at LSC in three genomes structures of *Mammillaria.* In the last column the temperature used to amplify each locus for all species. Primer design was based on the DNA sequences of *M. albiflora* (structure 1), *M. supertexta* (structure 2) and *M. zephyranthoides* (structure 3); Table S2. List of the total of 42 coding loci were used to obtain phylogenetic relationships of the seven species of *Mammillaria* studied and the two species used as outgroups *Carnegiea gigantea and Portulaca oleraceae*.

**Author Contributions:** S.S. is the researcher leading of this study, she designed research and obtain the financial support. S.S., S.A., and P.D. carried out fieldwork and taxonomic identification. S.S. and D.A.C. carried out chloroplasts and DNA isolation. C.E.D.-V. and F.V.-P. implemented experimental protocols of massive sequencing. A.S.-F. supervised bioinformatics analyses, which were processed by D.A.C. and K.E. A.G.-R. supervised the experiments for PCR verifications. S.S. and D.A.C. prepared the complete manuscript; and A.G.-R. and P.D. reviewed all draft versions. All the authors approved the submitted version of this this article.

**Funding:** This study is supported by grants from the Dirección General de Personal Académico of UNAM for research projects (DGAPA-PAPIIT IN222216).

**Acknowledgments:** SEMARNAT (SGPA/DGVS/06880/16) supplied sampling permission. G. Mendoza-Juárez (IIES UNAM) carried out PCR verification. L.M Marquez-Valdelamar and N.M López-Ortiz (Laboratorio de Secuenciación de LaNaBio, IB UNAM) provided sequencing service for Sanger sequencing. P. Gaytán (Unidad de Síntesis y Secuenciación de ADN, IBT UNAM) synthetized primers for PCR verification. We are grateful to Unidad de Secuenciación Masiva y Bioinformática of the Laboratorio Nacional de Apoyo Tecnológico a las Ciencias Genómicas, CONACyT #260481, at the Instituto de Biotecnología/UNAM for advice and training in bioinformatics to Sofía Solórzano and Delil A. Chincoya. The comments of two anonymous reviewers improved the quality of this article.

**Conflicts of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **New Insights on** *Lilium* **Phylogeny Based on a Comparative Phylogenomic Study Using Complete Plastome Sequences**

#### **Hyoung Tae Kim 1, Ki-Byung Lim <sup>2</sup> and Jung Sung Kim 3,\***


Received: 6 November 2019; Accepted: 22 November 2019; Published: 27 November 2019

**Abstract:** The genus *Lilium* L. is widely distributed in the cold and temperate regions of the Northern Hemisphere and is one of the most valuable plant groups in the world. Regarding the classification of the genus *Lilium*, Comber's sectional classification, based on the natural characteristics, has been primarily used to recognize species and circumscribe the sections within the genus. Although molecular phylogenetic approaches have been attempted using different markers to elucidate their phylogenetic relationships, there still are unresolved clades within the genus. In this study, we constructed the species tree for the genus using 28 *Lilium* species plastomes, including three currently determined species (*L. candidum*, *L. formosanum*, and *L. leichtlinii* var. *maximowiczii*). We also sought to verify Comber's classification and to evaluate all loci for phylogenetic molecular markers. Based on the results, the genus was divided into two major lineages, group A and B, consisting of eastern Asia + Europe species and Hengduan Mountains + North America species, respectively. Sectional relationships revealed that the ancestor *Martagon* diverged from *Sinomartagon* species and that *Pseudolirium* and *Leucolirion* are polyphyletic. Out of all loci in that *Lilium* plastome, *ycf1*, *trnF-ndhJ*, and *trnT-psbD* regions are suggested as evaluated markers with high coincidence with the species tree. We also discussed the biogeographical diversification and long-distance dispersal event of the genus.

**Keywords:** *Lilium*; phylogenomics; plastome; molecular markers; gene tree; species tree

#### **1. Introduction**

The genus *Lilium* L. is the type genus of Liliaceae that consists of approximately 100 species spread throughout the cold and temperate regions of the Northern Hemisphere [1,2]. *Lilium* species are economically important because of their ornamental features in horticulture as cut flowers and as potted and garden plants [3]. In addition, the flowers and bulbs of cultivated *Lilium* species are used for food and medicine [4].

The classification of the genus *Lilium* had been built based on flower shape before Comber's classification [5]. As a result, the sectional (or subgeneral) boundaries in the genus frequently changed, and many species were referred to different sections according to different classification systems [6–8]. In contrast to the previous sectional concept based on flower shape, Comber [5] suggested the use of new natural characteristics, including characteristics of leaves, bulbs, and stems to classify the genus *Lilium* and divided it into seven sections. Although Comber's classification was revised by De Jong [9] based on previous papers published up to that time, Comber's classification has been primarily used to date to recognize *Lilium* species and circumscribe the sections within the phylogeny of *Lilium*.

Nishikawa et al. [10] constructed the phylogeny of 55 *Lilium* species using the internal transcribed spacer (ITS) region to clarify their phylogenetic relationships and found that most species formed a clade according to the classification based on the morphological features. Nonetheless, it is difficult to follow their results directly because most of the branch lengths were very short and not significantly supported. In addition, the phylogenies using the same marker with more samples indicated that most of the sections, except for *Martagon*, were polyphyletic [11–13]. Such discordance between classifications based on morphological characteristics and molecular phylogeny has also been found in other genera. For example, the subgenera of the genus *Cymbidium* based on morphological characteristics [14] was not found to be monophyletic using ITS [15] and ITS + *matK* [16,17], and a number of branches in the phylogenetic trees were collapsed in strict consensus trees. It is likely that gene trees based on molecular markers were often insufficient for investigating species trees because of certain evolutionary events, such as incomplete lineage sorting and horizontal gene transfer, or because convergent morphological evolution has occurred in certain lineages. To ensure the phylogenetic accuracy, increased taxa sampling has been preferred for a long time [18–20]. However, it has been suggested that the number of genes may be a more important determinant than the taxon number [21–23], and the rapid development of next-generation sequencing techniques has led us to the phylogenomic era, which provides numerous data to resolve ambiguous relationships.

Plants contain three genomes: a nuclear genome, a mitochondrial genome, and a plastid genome (plastome). The nuclear genomes [24] and mitochondrial genomes [23] in flowering plants substantially vary in length, whereas the plastomes maintain consistent lengths and typical structures for a long time [25], except for those of some specific lineages [26–28]. In addition, the typical plastome structure allows for next-generation sequencing (NGS) data to be assembled easily. As a result, the number of deposited plastome sequences is 10 times that of mitochondrial genome sequences in the NCBI genome database. The highly conservative nature of the plastome structure makes it possible for intergenic spaces to be used as molecular markers, as well as genes [29,30]. Using this feature, many researchers have attempted to increase phylogenetic accuracy for unresolved taxa using whole plastome sequences [31–33] and have also suggested new combinations of molecular markers for specific taxa [34–38]. Two studies on the phylogenomics of *Lilium* using plastome sequences have been published [4,39] to date. The phylogenetic relationships among the branches were well resolved with high support. However, taxon samplings were restricted because of the inability to use the data from both papers, which were published in the same year. Therefore, it was difficult to verify Comber's classification based on well-resolved phylogeny. Du et al. [4] suggested molecular markers for the phylogeny of *Lilium* based on nucleotide diversity; however, the issues of polytomies and low supporting values still remained. Apparently, there was no comparison between species trees and gene trees to suggest molecular markers for the phylogenetic analysis. A gene tree can be easily acquired from each gene; however, it can conflict with the species tree [40] because not all genes have evolved in the same manner in the same lineage. Therefore, it is necessary to evaluate the reliability of any constructed species trees and gene trees and suggest molecular markers for the phylogeny of certain lineages.

In this study, three plastomes in *Lilium* were newly sequenced (1) to verify Comber's classification and (2) to evaluate all loci of the *Lilium* plastome for phylogenetic molecular markers. Two taxa, *Lilium formosanum* Wallace and *Lilium candidum* L., were sampled to investigate the monophyly of *Leucolirion* and to determine the phylogenetic position of *Liriotypus*. *Lilium leichtlinii* var. *maximowiczii* (Regel) Baker was added because it was found to be closely or distantly related to *Lilium lancifolium* Thunb. in the previous studies [10–13]. The phylogenies of the genus *Lilium* that were built based on different methodologies using plastome sequences were constructed to determine a more accurate species tree. Based on this result, gene trees were compared to the species tree of the genus *Lilium* to evaluate which gene trees were similar to the species tree in a topology, while reflecting the generic relationships in Liliaceae. Additionally, based on the species tree generated in this study, the evolution of genomic characteristics in the genus *Lilium* was also discussed.

#### **2. Results**

#### *2.1. Newly Sequenced Plastomes of Three Lilium Species*

Numbers of reads ranging from 35,028,408 (*L. leichtlinii* var. *maximowiczii*) to 89,224,524 (*L. candidum*) were used for raw data after removing less than 50 bp reads from each dataset for the three species (Table 1). After constructing a complete plastome sequence, the number of mapped reads to complete the plastome sequences ranged from 568,878 (*L. leichtlinii* var. *maximowiczii*) to 1,945,534 (*L. formosanum*). Consequently, the average coverage of each plastome sequence ranged from 520.1 (*L. leichtlinii* var. *maximowiczii*) to 1840.6 (*L. formosanum*). The plastomes of the three *Lilium* species (*L. candidum*, *L. formosanum*, and *L. leichtlinii* var. *maximowiczii*) were 152,101–152,653 bp in length with a large single copy (81,481–82,101 bp), a small single copy (17,524–17,644 bp), and two inverted repeats (26,488–26,514 bp) (Table S1).


**Table 1.** Summary of genome assembly.

<sup>a</sup> Sequence Read Archive.

The overall GC content was 37.0%, which is similar to those of other *Lilium* species. In total, 133 genes were annotated from each plastome with 85 protein-coding genes, 8 rRNA genes, 38 tRNA genes, and 2 partial genes (*rps19* and *ycf1*). *InfA* was a pseudogene in all three plastomes, as well as other *Lilium* species, and *cemA* was pseudogenized in the plastomes of *L. candidum* and *L. leichtlinii* var. *maximowiczii* because of the copy number variation of the poly A-tract.

#### *2.2. Nucleotide Diversity within Genera Lilium and Fritillaria*

The total length of the aligned sequences of *Lilium* and *Fritillaria* plastomes was 159,458 bp. The maximum nucleotide diversities of *Lilium* and *Fritillaria* were 0.030 and 0.041, respectively (Figure 1). In total, the nucleotide diversity was lower in the inverted repeat (IR) region than in the large single copy (LSC) and small single copy (SSC) regions in both genera, as well as being inversely proportional to the GC content. The delta nucleotide diversity between the two genera fluctuated from 0.014 to −0.017. Four loci higher than 0.02 in *Lilium* plastomes and two loci higher than 0.025 in *Fritillaria* plastomes were caused by large deletions in certain species because we deleted the sites with missing data for at least one species. In addition, there were the loci with higher nucleotide diversity, particularly in *Lilium* plastome sequences, which were caused by small palindromic repeats and the IR expansion from IRa to SSC.

**Figure 1.** Nucleotide diversity (π) and GC content throughout the plastome sequence according to sliding window analysis (window size = 600 bp, step size = 200 bp). The red line and blue dashed line refer to π and GC content, respectively. The vertical dashed grey lines refer to the approximate boundaries of the plastome structure.

#### *2.3. Phylogeny of Lilium in Liliaceae*

The 28 phylogenetic trees constructed from the four datasets, four tools, and two models showed that all of the genera in Liliaceae were monophyletic (Figure 2). *Amana* and *Erythronium* in tribe Tulipeae were distinguished from *Cardiocrinum*, *Fritillaria*, and *Lilium* in tribe Lilieae. In tribe Lilieae, *Cardiocrinum* diverged first, and *Lilium* and *Fritillaria* were separated later. In contrast to the generic relationships within the tribe, the infrageneric relationships varied slightly among phylogenetic trees.

In the clades of *Lilium* in 28 phylogenies, different datasets affected the change in topology more than different tools and models; however, the major clades were highly conserved in all phylogenies (Figure S1). The genus was divided into two major lineages, group A and group B, consisting of eastern Asia + Europe species and Hengduan Mountains + North America species, respectively (Figure 3). Group A consisted of three clades and five independent lineages. Clade I comprised three *Martagon* species and *L.* sp. KHK\_2014, except for the phylogeny created by ASTRAL with coding genes. Clade II consisted of *L. amabile*, *L. lancifolium*, and *L. callosum*, which belong to *Sinomartagon*. *L. longiflorum,* and *L. formosanum* of *Leucolirion* and *L. brownii* of *Archelirion* formed clade III. *L. cernuum* of *Sinomartagon* and *L. candidum* of *Liriotypus* were sister groups to *Martagon* and the rest of group A, respectively. The positions of *L. leichtlinii* var. *maximowiczii* were distantly related to *L. lancifolium* and *L. amabile*, which were morphologically close to *L. leichtlinii* var. *maximowiczii*.

Group B consisted of four clades and two independent lineages. Clade IV comprised species in *Sinomartagon* 5c. Clade V consisted of *L. leucanthum* of *Leucolirion* and *L. henryi* of *Sinomartagon* 5a. Clade VI comprised *L. duchartrei* and *L. fargesii* of *Sinomartagon*, and clade VII included *L. washingtonianum*, *L. pardalinum*, and *L. superbum* of *Pseudolirium*. *L. distichum* of *Martagon* and *L. philadelphicum* of *Pseudolirium* were sisters to clade IV and clade IV + V, respectively. Most of the branches, including seven clades in groups A and B, were strongly supported by bootstrap values of the maximum

likelihood analysis, branch support values of ASTRAL, and posterior probabilities of Bayesian inference, but certain branches within clades had moderate support.

A consensus tree of 28 phylogenies was constructed using the 50% majority rule to infer a robust species tree of *Lilium* (Figure 2). Except for two polytomies, one in *Lilium* and another in *Amana*, the seven clades in *Lilium,* as described above, were maintained.

**Figure 2.** Phylogenetic trees based on four tools (ASTRAL, IQ-TREE, RAxML, and MrBayes) using four datasets (genes, introns, intergenic spacers, and all regions) and two different models (partition model and non-partition model of the dataset). The colored line refers to each genus.

**Figure 3.** Consensus tree of 28 phylogenies based on four tools (ASTRAL, IQ-TREE, RAxML, and MrBayes) using four datasets (genes, introns, intergenic spacers, and all regions) and two different models (partition model and non-partition model of the dataset). The colored line refers to each genus. The distribution areas of clades are based on Gao et al. [41] and Xinqi et al. [42].

#### *2.4. Comparing Gene Trees to Species Trees*

Among the 189 total gene trees, only 22 gene trees showed that each genus was monophyletic, even though two trees built using *trnS-trnG* IGS and the *trnV* intron had different generic relationships compared with the other 20 gene trees (Figures S2–S5). Thirty-six clusters of similar trees from the 190 trees (189 gene trees + the consensus tree of 28 species trees) were generated using treespace [43], with five principal components and a cut-off height of 100 (Figure 4). Among them, 21 clusters consisted of more than three trees (Table 2), and the 10th cluster included a consensus tree of the 28 species trees and 21 gene trees. These 21 gene trees were identical to the gene trees wherein each genus was monophyletic, except for the *ycf2* gene tree.


**Table 2.** Information on 21 clustered trees using treespace, with five principal components and 100 cut-off distance.

#### *2.5. IR Expansion and Contraction in Lilium*

Based on the topology of *Lilium* species using Bayesian inference with all loci and partition models, the movements of IR boundaries (LSC-IRb, IRb-SSC, and SSC-IRa) were investigated (Figure 5). The LSC-IRb boundaries of group A were identical to each other. In contrast to group A, IR expansions and contractions were found in group B, excluding *L. distichum*, *L. washingtonianum*, *L. pardalinum*, and *L. superbum*. Two IR-SSC boundaries were more diverse than LSC-IRb, although most IR expansions and contractions occurred in the SSC-IRa boundary. Overall, IR boundaries were highly conserved within clades, except clades IV and VI.

**Figure 4.** Hierarchical cluster analysis for 190 trees, including 189 gene trees and a consensus tree. The horizontal dashed line refers to the cut-off value for cluster trees.

#### *2.6. Insertions*/*Deletions in Lilium*

In total, eight insertions/deletions (indels) longer than 50 bp occurred in over five species that were found throughout the whole plastome sequences. Five of them had distinguishable features between groups A and B in the *Lilium* phylogeny (Figure S5A), but others did not correspond to the phylogenetic relationships (Figure S5B).

**Figure 5.** Inverted repeat (IR) expansion and contraction in *Lilium*. Pale grey refers to bases and colors on the bases represent disagreements with the consensus sequence. The red, blue, brown, and green blocks below bases stand for the large single copy (LSC) region, IRb region, small single copy (SSC) region, and IRa region, respectively. The tree on the left is constructed using Bayesian inference with all loci and partition models.

#### **3. Discussion**

#### *3.1. Evolution of Lilium Plastomes in Liliaceae*

The plastome sequences of *Lilium*, including the three newly determined species in this study, are highly conserved in terms of gene content and order and genomic structure. Although the nucleotide diversities of the LSC and SSC regions were higher than that of the IR region, this is a common phenomenon across the angiosperms [25]. However, the nucleotide diversities of *Lilium* and *Fritillaria* were higher in the LSC and SSC regions but lower in the IR region as compared to that of *Paris* belonging to Melanthiaceae of Liliales [44]. Consequently, the IR regions in Liliaceae appear to be more stable by purifying selection, and the LSC and SSC regions were assumed to have been under relaxed purifying selective pressure compared to those of Melanthiaceae during their evolutionary process. In Liliaceae, the fluctuating delta nucleotide diversity between *Lilium* and *Fritillaria* also supported the idea that different loci of the plastome have undergone different selective regions in this lineage, except the overestimated nucleotide diversities owing to deletions and small inversions. These findings imply that the mutational dynamics with respect to plastome loci between *Lilium* and *Fritillaria* have been processed differently even though they are closely related taxa.

#### *3.2. Verification of Comber's Sectional Classification*

In this study, we reconstructed the phylogeny of 28 *Lilium* species using complete plastomes with four different datasets, four tools, and two models and compared these trees to determine the accurate species tree. There were slightly different topologies, but the major clades were coincident among the trees and were strongly supported by various branch support values (Figure S1). Before the discussion of the phylogeny of *Lilium*, we must evaluate the identification of plastome sequences of *L. distichum* (NC\_029937) and *L.* sp. KHK\_2014 (NC\_027679), which were published by the same research group, because of their unreliable phylogenetic position in the genus. Du et al. [4] first raised a question regarding the phylogenetic position of *L. distichum* (NC\_029937). The *rbcL* and *matK* sequences of *L. distichum* (NC\_029937) were identical to those of *L. speciosum* (*rbcL*: AB034922.1; *matK*: AB030853, AB049526). On the contrary, *atpB*, *rbcL*, and *ndhF* of *L.* sp. KHK\_2014 (NC\_027679) were identical to those of *L. distichum* (*atpB*: KC796843, JX903928, KM085888; *rbcL*: JX903238, JN786059, JN417422, KP711933; *ndhF*: JX903509, KM085762). Consequently, it may be necessary to exclude these two plastomes or to consider *L.* sp. KHK\_2014 as *L. distichum* to avoid improper conclusions for the phylogenomics of *Lilium*.

All phylogenetic trees using different datasets indicated that *Lilium* species could be divided into two major groups. These two groups were also distinguished by the mutational dynamics of the IR expansion/contraction and large indels (Figure 5 and Figure S5A).

#### 3.2.1. The Phylogenetic Position of Martagon

*Martagon* consists of five species that are primarily distributed in northeastern Asia and Russia, except *L. martagon*, which ranges widely from central Europe to eastern Siberia. This section had been considered an early-diverging lineage in *Lilium* based on the morphological characteristics: hypogeal and delayed germination, whorled leaves, jointed scales, and heavy seeds [5,45]. In contrast to the morphological analyses, molecular phylogenetic analyses using ITS sequences showed that this section is a more recently derived lineage and is sister to some of the *Sinomartagon* + *Leucolirion* [10,11,13]. Based on these subgeneric relationships, Gao et al. [41] suggested that the ancestor of *Martagon*, *Sinomartagon*, and *Leucolirion* 6b had a distribution within the Hengduan Mountains before *Martagon* separated from *Sinomartagon* + *Leucolirion* 6b approximately 8.8 million years ago. However, based on the plastome sequences, *Martagon* is not a sister to *Sinomartagon* + *Leucolirion* 6b but forms a clade within *Sinomartagon* I (Figure 5). Additionally, the plastome structures of *Lilium* provide further support. The junctions between SCs and IRs of *Martagon* are identical to those of *Sinomartagon* I, whereas they differ from those of *Sinomartagon* 5c and *Leucolirion* 6b (Figure 5). These results imply that the ancestor of *Martagon* diverged from *Sinomartagon* species, i.e., the ancestor of *L. cernuum* in this study, and the divergence time was more recent than the expectation of Gao et al. [41]. In addition, four species within *Sinomartagon* I and three species within *Martagon* are commonly distributed in eastern Asia. As a result, the hypothesis that the origin of *Martagon* is the Hengduan Mountains [41] is controvertible.

#### 3.2.2. The Polyphyly of Pseudolirium

*Pseudolirium* consists of all American lilies, including *L. philadelphicum*, which is a lectotype of the section [5]. Interestingly, when the data matrix involved *L. philadelphicum*, the phylogeny using ITS [10,11,13] showed the section was monophyletic, but using *matK* [46] revealed polyphyly for the section. Unfortunately, the phylogenetic position of the section using both markers was not resolved or was supported weekly. On the other hand, Kim et al. [47] suggested that *L. philadelphicum* seems to be distinguishable from other species in the section based on phylogenomics using plastome sequences. This relationship is well supported by the 28 species trees constructed in this study. In terms of DNA sequence mutations, a two base deletion in ITS was found specifically in *Pseudolirium* species, except for *L. philadelphicum* [10], and there was an obvious distinction between the IR-SC junctions of *L. philadelphicum* and other *Pseudolirium* species (Figure 5). Morphologically, subsection 2d, consisting of *L. philadelphicum* and *L. catesbaei* in *Pseudolirium*, is distinguished from other species by erect flowers and highly clawed perianth parts [3]. Consequently, a new circumscription of *Pseudolirium* should be considered to reflect the recent phylogenetic results.

#### 3.2.3. The Polyphyly of Leucolirion

*Leucolirion* consists of eight species with scattered and sessile leaves and trumpet flowers [1]. The section is subdivided into two subsections based on bulb color: dark purple or brown for 6a and white for 6b [1]. In this study, the two subsections were distantly separated with strong support and this result was congruent with the previous phylogenetic studies [13,46]. In addition, *L. henryi* of *Sinomartagon* 5a and *L. brownii* of *Archelirion* formed a robust clade with *Leucolirion* 6a and 6b, respectively (Figure S1, Figure 5). Based on the phylogenetic and cytological studies, Du et al. [13] suggested that *L. henryi* and *L. brownii* should be classified into *Leucolirion* 6a and 6b, respectively. Consequently, our results provide further support for the modification of *Leucolirion* according to Du et al. [13].

#### 3.2.4. The Position of Liriotypus in the Genus *Lilium*

*Liriotypus* comprises 20 species, including all European, Turkish, and Caucasian species, with the exception of *Lilium martagon* [5,48]. Among them, *L. bulbiferum*, having upright flowers, has been distinguished from the rest of the species within the section and forms a clade with the *Sinomartagon* species, including *L. dauricum* of *Daurolirion* based on the molecular phylogenetic analyses [13,48]. In this study, *L. bulbiferum* was placed far away from *L. candidum*, which is a lectotype of the section [5], agreeing with the results of previous studies. However, it forms a clade with *Martagon* + *Sinomartagon* I with strong support, although there are sampling gaps that prevent a concrete conclusion (Figure S1). Therefore, increased taxa sampling, particularly the members of *Daurolirion*, will help to resolve the discordance of the phylogenetic position of *L. bulbiferum* between this study and previous studies and offer the correct phylogenetic position for this species.

On the other hand, the phylogenetic position of *Liriotypus,* except *L. bulbiferum,* was irregular within the genus, although they formed a clade with the *Sinomartagon* 5c species-*Nomocharis* clade in previous studies [11,13]. On the contrary, *L. candidum* was an early-diverging taxon in group A, without alternative relationships with the rest of group A species in this study (Figure 3, Figure S1). Consequently, the newly suggested phylogenetic position of *Liriotypus* in this study is incongruent with that of previous phylogenetic studies using a few molecular markers. One possible explanation for this discordance is that the position of *L. candidum* does not represent *Liriotypus* in spite of its taxonomic importance by lectotype in the subsection. This species differs from other *Liriotypus* species based on rosette basal leaves and widely trumpet-shaped flowers [48,49], and the geographic circumscription of the sections in *Lilium* by Comber [5] collided with the molecular phylogeny of this study, i.e., *Pseudolirium* and *Liriotypus*. Another scenario is that the origin of *Liriotypus* came from early-diverging *Sinomartagon* (see additional details in the next section) and directly moved to Europe via the Caucasus during evolution.

In spite of low taxon sampling in *Liriotypus*, our result strongly supports the previous conclusions in which *L. bulbiferum* is separated from the rest of the species [48], and it suggests a new phylogenetic position of the section within the robust phylogenetic trees. However, because there still remains uncertainty as to whether the position of *L. candidum* belongs to the *Liriotypus* based on its unique morphological characteristics in the section, more sampling in the section will be needed to solve its position.

#### 3.2.5. Biogeographic History of the Genus *Lilium*

The geographical origin of the tribe Lilieae has been speculated to be the Himalayas + Hengduan Mountains and multiple intercontinental dispersal events for Lilieae were suggested, as well as *Lilium* [41,50]. In this study, we also found long distant dispersals and simpler than the previous speculation. Gao et al. [41] suggested a long-distance dispersal model of *Lilium* based on ITS and *matK* phylogenies, in which there were movements among four regions: Hengduan Mountains, eastern Asia, Europe, including Caucasus, and Northern America. However, from the results in this study, it was suggested that there were two main long dispersals in eastern Asia—Europe in group A and Hengduan Mountains—North America in group B (Figure 3).

To explain these distributions, it was supposed that *Lilium* comprised "*Sinomartagon* and its derivatives." *Sinomartagon* comprises approximately 30 species [11] with epigeal and immediate germination (except *L. henryi*), scattered leaves, an entire bulb, and Turk's cap flowers [5], and it has a distribution from the Hengduan Mountains to eastern Asia. In molecular phylogenetic trees, including those in this study, this section has polyphyly regardless of the types of marker or taxon samplings [12,13,41,46]. Therefore, if the manner for distant dispersals was paved during glacial periods or pollinators traveled a great distance at the beginning of the *Lilium* diversification, certain populations of different species in the *Sinomartagon* could have moved together. Therefore, the adaptations to new circumstances may have led to new populations that morphologically converged. Consequently, populations within new circumstances had similar morphological characteristics and belonged to the same section by morphological classification. This may cause the discordance between classifications based on morphological characteristics and molecular phylogeny. Regarding the basal lineages of group A and B, the "*Sinomartagon* and its derivatives" hypothesis is unclear because of our low taxon sampling or extinction of ancient *Sinomartagon* species. This hypothesis may further confuse the evolutionary history of the genus because (1) nobody has placed *Sinomartagon* as a basal lineage in the *Lilium* based on the morphological characteristics, and (2) most phylogenies of *Lilium* constructed using ITS are not consistent with the present results. In addition, inheritance from single parents, such as a plastome, makes it difficult to detect hybridization events. However, the phylogenetic tree generated in this study was the first robust phylogenetic tree with more than 20 samples. There is no doubt that the most important action for the discussion of phylogenetic relationships or divergence times of certain lineage is the construction of accurate species trees without polytomy and poor support. Therefore, we cannot rule out this "*Sinomartagon* and its derivatives" hypothesis based on the well-resolved phylogenetic tree.

#### *3.3. Molecular Markers for the Phylogeny of Lilium and its Relatives*

ITS has been used as a valuable molecular marker for the phylogeny of *Lilium* [10,11,13,41] but the phylogenetic relationships among sections or species were incongruent or were weakly supported. To overcome this problem, we constructed the phylogeny of *Lilium* using whole plastome sequences in this study. However, the production and manipulation of NGS data also require significant time and cost, as well as a higher-level technique than the Sanger sequencing method.

To evaluate which gene trees were similar to the species tree in terms of topology, while reflecting the generic relationship in Liliaceae, 189 gene trees were constructed, and only 20 were found to be consistent with the results presented in previous phylogenetic studies [51,52]. Among them, the *ycf1*, *trnF-ndhJ*, and *trnT-psbD* regions were coincident with highly variable regions of the *Lilium* plastome, as suggested by Du et al. [4].

This could serve as an alternative choice of NGS-based phylogenomics in *Lilium* when we use insufficient conditioned samples, such as very low concentration, fragmentation, or extraction from very old specimens that are difficult for use in the preparation of an NGS library.

#### **4. Materials and Methods**

#### *4.1. Plant Materials and DNA Extraction*

The bulbs of *L. leichtlinii* var. *maximowiczii* (Wooriseed, Korea) and the seeds of *L. formosanum* (Wageningen University, Netherlands) and *L. candidum* (Royal horticultural society Lily group, UK) were germinated on media at Kyungpook National University of Korea. Genomic DNA was extracted from young fresh leaves using a DNeasy plant mini kit (Qiagen, Valencia, CA, USA), following the manufacturer's protocol.

#### *4.2. Sequencing, Assembly, and Annotation*

Genomic DNA was sequenced using the HiSeq 2500 instrument (Illumina, San Diego, CA, USA). The following assembling procedures were implemented using Geneious 10.2.5 [53]. Both ends of raw reads were trimmed with more than a 1% chance of an error per base. Reads exceeding 50 bp in length were extracted and used as raw reads after this step. Raw reads were mapped to the plastome sequence of *Lilium pardalinum* [47] with medium-low sensitivity. Reads were aligned to the reference then *de novo* assembled with zero mismatches and gaps among the reads to generate contigs. Raw reads were realigned to the contigs with zero mismatches and gaps among the reads for up to 100 iterations. The generated contigs were concatenated into a circular form using *de novo* assembled circularizing contigs with matching ends. Finally, the raw reads were mapped to the complete plastome sequence with zero mismatches and gaps among the reads to verify the coverage depths through the genome, because of the fact that many plastome-like sequences distributed in the mitochondrial genome and nuclear genome have relatively low coverage depths compared to that of the plastome.

All of the genes in the three plastome sequences were annotated and compared with those of *L. pardalinum* using Geneious annotation with 90% similarity, then re-checked separately using BLASTP [54] and tRNAscan-SE [55].

#### *4.3. Sequence Diversity and GC Content Analyses*

Twenty *Fritillaria* and 28 *Lilium* plastome sequences were aligned by MAFFT [56], then the AT-rich regions were realigned by MUSCLE [57] to increase the alignment accuracy at these regions. The alignment sequences were loaded in R ver. 3.5.1 [58], and the nucleotide diversities of the two genera were analyzed using sliding window analysis (window size = 600 bp, step size = 200 bp) by deleting the sites including at least one missing data point for all sequences. In addition, the GC content was also calculated to compare the relationship between nucleotide diversity and GC content.

#### *4.4. Phylogenetic Analysis*

Fifty-five plastome sequences from the genera *Lilium*, *Fritillaria*, *Cardiocrinum*, *Amana*, and *Erythronium* were downloaded from GenBank to construct a phylogeny for *Lilium*. We extracted 78 coding sequences (CDSs), 90 intergenic spaces (IGSs) longer than 100 bp, and 21 intron regions (two regions from *clpP*, *trnK*, and *ycf3*) from each plastome (Table S2).

All of the extracted sequences were aligned using MAFFT [55] according to the region (Table S3) and merged into four datasets as follows: All\_loci (including CDSs, IGSs, and introns), CDS\_loci, IGS\_loci, and Intron\_loci. The models for each partition for the four datasets were estimated using PartitionFinder 2 [59], ModelFinder [60], or Jmodeltest 2 [61] for different phylogenetic analyses. The phylogenetic trees were constructed using the RAxML Black Box with 1000 bootstraps [62] in the CIPRES gateway [63], MrBayes with ngen = 10,000,000, samplefreq = 1000, and burninfrac = 0.25 [64], or IQ-TREE with 1000 bootstraps [65]. In total, 189 gene trees were constructed using IQ-TREE with ModelFinder, and these were used to estimate the species trees by ASTRAL [66–68]. All of the details for phylogenies are summarized in Table S2. A consensus tree of 28 phylogenies was constructed by majority rule.

#### *4.5. Comparison of Gene Trees*

The generic relationships of each gene tree were compared to those presented in previous phylogenetic studies of Liliaceae [51,52]. Clusters of similar gene trees were identified using the Kendall Colijn metric [69] in treespace [43] with five principal components and a cut-off distance of 100. Phylogeny using selected markers was constructed by IQ-TREE, with 1000 bootstraps for comparison to the consensus tree constructed by 28 phylogenies.

#### *4.6. R Packages for Manipulation of Phylogenies*

APE [70], Biostrings [71], dplyr [72], ggplot2 [73], ggtree [74,75], gridExtra [76], pegas [77], phytools [78], tidytree [79], and treespace [43] were used in this study.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2223-7747/8/12/547/s1, Figure S1: A clade of *Lilium* species extracted from 28 species trees, Figure S2: Gene trees constructed using 78 coding regions. Coloured line refers to each genus, Figure S3: Gene trees constructed using 90 intergenic spacers. Coloured line refers to each genus, Figure S4: Gene trees were constructed using 21 introns. Coloured line refers to each genus, Figure S5: Large indertions/deletions occurring at least five species with longer than 50 bp. Red line distinguishes between two groups, Table S1: Summary of three plastome sequences, Table S2: List of estimated species trees according to different options, Table S3: 189 loci used for phylogenetic analyses in this paper.

**Author Contributions:** Performed research design, writing of the manuscript, lab work, and data analysis, H.T.K.; collected samples and edited the manuscript, K.-B.L.; contributed to the writing and editing of the manuscript, J.S.K.

**Funding:** This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (Project No. NRF-2017R1D1A1A02018573/ 2016R1D1A1B04932913).

**Acknowledgments:** We thank Ji-Hyang Park for growing the plants from seeds and bulbs.

**Conflicts of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
