*2.2. Repeat Sequence Analysis*

We identified a total of 1798 simple sequence repeat (SSR) loci across the 24 Salicaceae *s.l.* plastids (Figure 2a, Table S1). Of these, 1455 were mononucleotide repeats accounting for about 80.92% of the total SSRs, while 180 (10.01%), 98 (5.45%), 41 (2.28%), 18 (1.00%), and six (0.33%) were tetra-, di-, tri-, penta-, and hex-nucleotides repeats, respectively. The penta- and hex- nucleotides repeats were very rare across the plastid genomes. Most (74.36%) SSR loci were located in the intergenic regions, whereas, 9.07% were in intron and 16.57% were in the protein-coding regions.



*Int. J. Mol. Sci.* **2019**, *20*, 3788

**Figure 2.** Comparison of repeat sequences among 24 plastomes. The *X*-axis represents the species and the *Y*-axis represents the number of repeats. (**a**) Frequency of selected motifs of simple sequence repeats (SSRs) and (**b**) number of each repeat type: tandem repeats, forward repeats, palindromic repeats, reverse repeats, and complement repeats.

We also identified 5 other types of repeats, including 1750 tandems, 434 forward, 407 palindromic, 86 reverse, and 30 complementary repeats (Figure 2b, Table S2, Table S3). Among these, tandem repeats are the most frequent type of repeats (64.65%), while complementary repeats are the least. We found that the tandem repeat sequences were mainly located in non-coding regions, whereas, only a few were located in the coding regions (*ycf2*, *rpl22*, *rpl14*, *rpoC2*, *ycf1*, *ndhD*, *ndhG*, *psbL*, *ndhF*, *rpoA*, *petD*, and *ccsA*). Most of the five types of repeats were concentrated in the intergenic regions (Table S3).

## *2.3. Inverted Repeats and Genome Comparison*

Next, we conducted whole genome alignment using the program mVISTA (*Populus euphratica* as reference), and the results showed that both the content and order of the genes were highly conserved in the Salicaceae *s.l.* plastids (Figure S1). The IRa/SSC boundary positions for all species were located in the *ycf1* gene, while the border genes of IRa and LSC were *Rps19* and *trnH-GUG*, respectively. Only slight variations of the border structure were identified across these plastid genomes. For example, the IRb/LSC junction was located within the *rpl22* gene in all but six species (*Scolopia chinensis*, *Homalium racemosum*, *Homalium cochinchinense*, *Prockia crucis*, *Casearia velutina*, and *Casearia decandra Jacq*). The *ndhF* gene was located entirely in the SSC region of 16 species, while in the other eight species it extended into the IRb region (Figure 3).


**Figure 3.** Comparison of the border positions of large single copy (LSC), small single copy (SSC), and inverted repeat (IR) region borders among plastid genomes of 24 Salicaceae *s.l.* species. Complete genes and portions of genes adjacent to the junctions are depicted by differently colored blocks.

#### *2.4. Divergence Hotspots of Plastid Genomes Int. J. Mol. Sci.* **2019**, *20*, x 9 of 18

To evaluate the level of sequence divergence, we calculated the percentages of variation using a sliding window approach (Figure 4). Across the 24 species, the values of nucleotide variability ranged from 0 to 0.139, with an average of 0.023, suggesting a high conservation of plastid genomes within Salicaceae *s.l.* Six relatively high variable regions (divergence hotspots) were identified, which comprised one gene region (*ycf1* + *ndhF*) and the following five intergenic regions: *matK-trnQ-UUG*, *trnS-GCU-trnG-UCC*, *psbZ-trnG-GCC*, and *ndhF-trnL-UAG, psbE-petL*. *2.4. Divergence Hotspots of Plastid Genomes*  To evaluate the level of sequence divergence, we calculated the percentages of variation using a sliding window approach (Figure 4). Across the 24 species, the values of nucleotide variability ranged from 0 to 0.139, with an average of 0.023, suggesting a high conservation of plastid genomes within Salicaceae *s.l.* Six relatively high variable regions (divergence hotspots) were identified, which

**Figure 4***.* Comparison of the nucleotide variability (Pi) values of the plastomes. *X*-axis: position of the midpoint of a window, *Y*-axis: nucleotide diversity of each window (window length: 600 bp and step size: 200 bp). **Figure 4.** Comparison of the nucleotide variability (Pi) values of the plastomes. *X*-axis: position of the midpoint of a window, *Y*-axis: nucleotide diversity of each window (window length: 600 bp and step size: 200 bp).

#### *2.5. Phylogenetic Analysis 2.5. Phylogenetic Analysis*

To obtain an accurate phylogenetic relationship of Salicaceae *s.l.* species, we performed multiple sequence alignments of these 24 complete plastid genomes, including an additional two species from *Passiflora* as the outgroup. The final concatenated dataset, which included 63 protein-coding genes (Table S4) and 51,780 nucleotides, after trimming poorly aligned regions, produced a highly supported topology based on the maximum likelihood (ML) strategy (Figure 5). Two subfamilies in the phylogenetic tree, Samydoideae and Salicoideae, formed a highly supported monophyletic. *Casearia* (*Casearia decandra Jacq* and *Casearia velutina*) was the only genus from Samydoideae in this study and was identified as the basal clade. The remaining species belonged to Salicoideae and were divided into two main clades. Furthermore, we estimated divergence times from the plastome dataset using an approximate likelihood method. The Salicaceae *s.l.* was estimated to diverge from the outgroup around 93 Mya. To obtain an accurate phylogenetic relationship of Salicaceae *s.l.* species, we performed multiple sequence alignments of these 24 complete plastid genomes, including an additional two species from *Passiflora* as the outgroup. The final concatenated dataset, which included 63 protein-coding genes (Table S4) and 51,780 nucleotides, after trimming poorly aligned regions, produced a highly supported topology based on the maximum likelihood (ML) strategy (Figure 5). Two subfamilies in the phylogenetic tree, Samydoideae and Salicoideae, formed a highly supported monophyletic. *Casearia* (*Casearia decandra Jacq* and *Casearia velutina*) was the only genus from Samydoideae in this study and was identified as the basal clade. The remaining species belonged to Salicoideae and were divided into two main clades. Furthermore, we estimated divergence times from the plastome dataset using an approximate likelihood method. The Salicaceae *s.l.* was estimated to diverge from the outgroup around 93 Mya.

**Figure 5.** Phylogenetic tree of the 26 species based on 63 protein-coding genes using maximum likelihood (ML). Bootstrap values (1000 replications) between different genera are shown at the nodes. The branches are drawn in proportion to the posterior means of divergence times estimated under the GTR + I + G model with an independent relaxed clock and birth–death sampling, and the outgroup divergence constraint of 87 to 97 Mya.

#### **3. Discussion**

## *3.1. Features of Plastid Genomes*

We used the information from complete plastid genome sequencing to research and analyze the plastomes of Salicaceae *s.l*. species. Generally, the plastids of angiosperms are considered to be highly conserved, have a typical tetragonal structure, usually contain about 110–130 unique genes, and the genome size, GC composition, as well as gene and intron content are identical in most angiosperms [32–35]. The shift of IR-SC boundaries was a common evolutionary event and played an important role in plastome size variation [36]. The IR regions of the Salicaceae *s.l.* species varied in size from 27,168 bp to 27,926 bp and the total length ranged from 155,144 bp in *Flacourtia ramontchii* to 158,605 bp in *Bennettiodendron brevipes*, as a result of the expansion and contraction of the IR borders.

Previous studies have revealed that many chloroplast genes, such as *infA*, *rpl22*, *rps19*, *rpl2* intron and *rpl23*, are transferred to the nucleus or lost [37–39]. In our study, only *infA* was not found in the plastid genome of Salicaceae *s.l. cemA* and contained premature termination codons in *Casearia decandra Jacq* and *Casearia velutina*, whereas, *atpF* was a pseudogene in *Olmediella betschleriana*. Gene duplication caused by IR is common in plastomes and is believed to be an important driving force in the evolution of genomes, leading to the creation of new genes and new gene functions [40]. Gene duplication has been previously reported in multiple angiosperm lineages and most of them are tRNAs [41–46]. Duplication of protein-coding genes outside of the IR is rare. In our results, seven tRNA genes and eight protein-coding genes were duplicated in the inverted-repeat region.

## *3.2. Repeat Sequence Variations*

SSRs are a type one to seven nucleotide unit tandem repeat sequence frequently observed in plastid genome and the changes in copy number are usually polymorphic [47–50]. The SSRs are considered effective molecular markers for plant species identification and population genetics because they exhibit codominant inheritance, high repeatability, and high variation within the same species [47–50]. In this study, the mononucleotide repeats were the most common, of which 822 were T type accounting for about 57%, while 606 (41%), 16 (1%), 11 (0.7%), were A, C, and G type, respectively. Most of the SSR loci were located in the non-coding region, and only 16% of the SSR loci were found in the gene-coding region. These SSRs were located in 39 coding genes, and the genes with the highest SSR frequency are *ycf1*, *rpoc2*, and *ndhF*, which are 95, 48, and 26 times, respectively (Table S1). The plastome size variation was previously reported to contribute to tandem [36,51] and dispersed repeats [41,52–54]. We analyzed the direct repeats, inverted repeats, reverse repeats, complementary repeats, and tandem repeats in 24 Salicaceae *s.l.* species. The analysis showed that the number of tandem repeats is more than the other repeats, while complementary repeats are the least common in these species. Most of the repeats are distributed in the intergenic and intron regions, similar to those reported in other angiosperm lineages [55]. These repeat sequences and plastid SSRs provided molecular markers for studying the genetic diversity, population structure, and phylogenetics of Salicaceae *s.l.*

#### *3.3. Comparative Analyses*

Comparing chloroplast genomics not only provides insight into chloroplast evolution patterns [56], but also contributes to phylogenetic studies to understand the evolutionary relationships among taxa and apply them to species breeding and conservation [57]. We compared the 24 complete plastid genomes of 18 genera in Salicaceae *s.l.* The species in this family have a conserved plastid genome that lack major structural variations. The main cause of genomic length variation in higher plant plastids is the change in the position of the boundary between IR and SSC/LSC [58]. In this study, the location of the boundary and length of the IR regions showed that variation among the 24 species and the plastid genomes differed slightly in size, with *Bennettiodendron leprosipes*, *Bennettiodendron brevipes*, *Olmediella betschleriana*, *Carrierea calycina*, and *Ito orientalis* larger than other species, which may be related to their IR expansion. IR expansion/contraction also represents a highly variable region, which can be used to study the phylogenetic classification of plants and achieve molecular improvement of plant phenotypes. In addition, the position of the *trnH* gene, which was found in the LSC region of all 24 plastid genomes, was conserved. This result was consistent with the observation that the *trnH* gene is usually located in the IR region in monocots but is found within the LSC region in dicots.

## *3.4. Divergence Hotspot Analysis*

The whole aligned sequences revealed relatively low divergence, however, six (*matK-trnQ-UUG*, *trnS-GCU-trnG-UCC*, *psbZ-trnG-GCC*, *ndhF-trnL-UAG*, *psbE-petL*, and *ycf1*+*ndhF*) displayed high variation. Further work is still necessary to determine whether these six variable loci could be used in phylogenetic analyses of related Salicaceae *s.l.* species or as potential molecular markers for population genetics and phylogenetics.

## *3.5. Phylogenetic Relationships*

The deep relationships of mimosoids were poorly resolved by phylogenetic studies applying a few plastid markers [59,60]. Plastid phylogenomics has been proven to be efficient to resolve difficult relationships at the family level such as Orchidaceae [61] and the lower taxonomic level such as subfamilies Bambusoideae [62] and Chloridoideae of Poaceae. For the phylogenetic relationship between the species of Salicaceae, earlier studies focused on Salicaceae *s.s.* Preliminary phylogenetic

frameworks for *Salix* and *Populus* have previously been provided [23–26]. In addition, some studies of angiosperm and Malpighiales also involve the phylogenetic relationships of several genera of Salicaceae *s.l.* [63–66]. However, phylogenetic analyses focusing on Salicaceae *s.l.* have been limited to the use of one or a few genes obtained from plastid or nuclear genomes [2,67]. In this study, we obtained phylogenetic relationships of 18 genera in Salicaceae *s.l.* using 63 plastid genes from 26 species. According to the Angiosperm Phylogeny Group's most recent phylogeny, APG IV [2,68], the phylogenetic tree is divided into Salicoideae and Samydoideae. *Casearia* from Samydoideae was identified as the basal clade. The genera in one of the two main clades of Salicoideae included *Populus*, *Salix*, *Bennettiodendron*, *Idesia*, *Olmediella*, *Carrierea*, *Itoa*, and *Poliothyrsis*, all belonging to Saliceae. The other clade was divided into two subclades, one subclade included five genera, *Flacourtia*, *Xylosma*, and *Dovyalis* belonging to the Saliceae, *Scolopia* belonging to Scolopieae, and *Homolium* belonging to Homalieae. The other subclade consisted of four species of three tribes, of which *Banara* and *Prockia* belong to Prockieae, *Abatia* and *Azara* belong to Abatieae and Saliceae, respectively. The Saliceae with the most species is not monophyletic.

We further estimated divergence timescales of the major clades within the Salicaceae *s.l.* according to the calibrations of the species tree constructed on the basis of 63 protein-coding genes. The split between Salicaceae *s.l.* and the outgroup was estimated as ~93 Mya, and the basal *Casearia* was estimated to diverge from other clades around 87 Mya. The clade of Salicoideae is estimated to have originated around 61 Mya. *Salix* and *Populus* diverged around 34 Mya, consistent with the previously reported [26].

#### **4. Materials and Methods**

#### *4.1. Sampling and Genome Sequencing*

Fresh leaves and silica-gel dried materials were sampled from 20 species representing 14 genera of the family Salicaceae *s.l*. The voucher specimens for the ten fresh sampled plants collected from XiShuangBanNa Tropical Botanical Garden (Mengla, China) were deposited at the Key Laboratory of Bio-Resource and Eco-Environment of the Ministry of Education (Chengdu, Sichuan, China). The ten silica-gel-dried materials were obtained from Harvard University Herbaria and the Arnold Arboretum of Harvard University. For each species, we used the modified CTAB method [69] to extract the total genomic DNA from dry leaves. About 5 ug purified DNA was used to construct Illumina paired-end libraries with an insert size of 500 bp and sequenced using the HiSeq X Ten System by Novogene (Beijing, China). At least two gigabases (Gb) of 2 × 150 bp short reads data were generated for each sample. Reads with a Phred quality score <7 and more than 10% ambiguous nucleotides were filtered.

#### *4.2. Plastid Genome Assembly and Annotation*

The filtered reads were assembled by NOVOPlasty v2.7.1 [70], and we used the complete plastid genome *Itoa orientalis* and *Flacourtia indica* (NC\_037411.1 and NC\_037410.1) as the reference, and then used Geneious v11.1.5 [71] to correct the assemble. Then, we annotated the assembled plastid genome in Plann v1.1 [72]. The positions of starts, stops, introns, and exons were manually adjusted using Sequin v15.50. In addition, the physical map of the plastid genome was generated using OGDRAW v1.2 [73]. The complete plastid genome together with gene annotations was submitted to the GenBank. The accession numbers are shown in Table 1.

#### *4.3. Repeat Sequence Analysis*

The SSRs were evaluated using the online software MISA (https://webblast.ipk-gatersleben. de/misa/) [74] with thresholds: 10, 5, 4, 3, 3, 3 ten repeat units for mononucleotide SSRs, five repeat units for dinucleotide SSRs, four repeat units for trinucleotide SSRs and three repeat units for tetra-, penta-, and hexanucleotide SSRs. We used the web-based analysis tool REPuter (https: //bibiserv.cebitec.uni-bielefeld.de/reputer) [75] to detect the repeat sequences, including reverse, forward

(direct), complement, and palindromic (inverted), with a minimal repeat size of 30 bp and Hamming distance less than or equal to 3 (90% or greater sequence identity). Tandem Repeats Finder v4.09 (http://tandem.bu.edu/trf/trf.submit.options.html) [76] was used to determine the trandem repeats, alignment parameters match, mismatch. Indel were set as 2, 7, and 7.

## *4.4. Genome Comparative Analysis*

Furter, mVISTA (http://genome.lbl.gov/vista/mvista/submit.shtml) [77], a web-based program, was used to compare similarities and detect any rearrangements or inversions among different plastid genomes. We used this software (*Populus euphratica* as reference) to discover any significant interspecific and intergeneric variations among plastid genome sequences of Salicaceae *s.l.* Additionally, the IR expansion/contraction regions were compared among the 24 Salicaceae *s.l.* species (20 newly generated plastid genomes in this study and 4 previous published plastid genomes: *Populus euphratica*, *Salix interior*, *Idesia polycarpa Maxim*, and *Poliothyrsis sinensis*).
