*Article* **Analysis of** *Centranthera grandiflora* **Benth Transcriptome Explores Genes of Catalpol, Acteoside and Azafrin Biosynthesis**

**Xiaodong Zhang 1,2,**† **, Caixia Li 1,2,**† **, Lianchun Wang <sup>1</sup> , Yahong Fei <sup>3</sup> and Wensheng Qin 4,\***


Received: 14 September 2019; Accepted: 27 November 2019; Published: 29 November 2019

**Abstract:** Cardiovascular diseases (CVDs) are a major cause of health loss in the world. Prevention and treatment of this disease by traditional Chinese medicine is a promising method. *Centranthera grandiflora* Benth is a high-value medicinal herb in the prevention and treatment of CVDs; its main medicinal components include iridoid glycosides, phenylethanoid glycosides, and azafrin in roots. However, biosynthetic pathways of these components and their regulatory mechanisms are unknown. Furthermore, there are no genomic resources of this herb. In this article, we provide sequence and transcript abundance data for the root, stem, and leaf transcriptome of *C. grandiflora* Benth obtained by the Illumina Hiseq2000. More than 438 million clean reads were obtained from root, stem, and leaf libraries, which produced 153,198 unigenes. Based on databases annotation, a total of 557, 213, and 161 unigenes were annotated to catalpol, acteoside, and azafrin biosynthetic pathways, respectively. Differentially expressed gene analysis identified 14,875 unigenes differentially enriched between leaf and root with 8,054 upregulated genes and 6,821 downregulated genes. Candidate MYB transcription factors involved in catalpol, acteoside, and azafrin biosynthesis were also predicated. This work is the first transcriptome analysis in *C. grandiflora* Benth which will aid the deciphering of biosynthesis pathways and regulatory mechanisms of active components.

**Keywords:** *Centranthera grandiflora* Benth; transcriptome; catalpol biosynthesis; acteoside biosynthesis; azafrin biosynthesis

#### **1. Introduction**

Cardiovascular diseases (CVDs) are an important reason for death in the world which hinder sustainable development of human beings [1]. In China, CVDs were also the leading cause of death due to lifestyle changes, urbanization, and the accelerated process of aging, and the figures have exceeded 42% of all deaths in both rural and urban regions, which was much higher than deaths caused by cancer or any other diseases in 2014 [2]. Traditional Chinese medicine has been used for more than 2000 years and has displayed the explicit role in preventing and treating CVDs, although the detailed pharmacological mechanisms have seldomly been clarified [3]. *Centranthera grandiflora* Benth, also known as broad bean *Ganoderma lucidum*, wild broad bean root, Huaxuedan, Golden Cat's Head, and Xiaohongyao, is a medicinal plant widely used for preventing and treating CVDs among Miao Nationality of Yunnan in China. In taxonomy, it belongs to the Centranthera, Scrophulariaceae family. Distinguished as a rare and endangered medicinal plant, *C. grandiflora* Benth usually grows well with *Cyperus rotundus* and is mainly distributed in Yunnan, Guizhou, and Guangxi in China as well as parts of India, Myanmar, and Vietnam [4–7]. Its roots possess many functions, such as to promote blood circulation, to regulate menstruation, to dispel blood stasis, and to relieve pain, and has known coagulation, antibacterial, and anticancer properties [6,8–10]. Therefore, it is mainly used to treat amenorrhea, dysmenorrhea, metrorrhagia, fall-related injuries, rheumatic bone pain, traumatic hemorrhage, and cardiovascular and cerebrovascular diseases [6,8–10].

So far, studies on this herb have mainly focused on the isolation and identification of its chemical constituents and pharmacological effects, while the discovery of genes related to biosynthesis of active secondary metabolites has not been reported. Azafrin and D-mannitol were first isolated and identified from the roots of *C. grandiflora* Benth in 1984 [8]. Then, aeginetin and azalea were isolated from the roots of *C. grandiflora* Benth, and their coagulation, antimicrobial, and anticancer functions were verified in 2012 [10]. In the same year, nine iridoid glycosides including aucubin, mussaenoside, 8-epiloganin, 8-epiloganic acid, mussaenosidic acid, catalpol, gardoside methyl ester, geniposidic acid, and 6-O-methylaucubin were isolated from roots of *C. grandiflora* Benth [6]. In 2014, another 17 compounds, including six new ones: centrantheroside A to E and neomelasmoside; phenylethanoid glycosides: plantainoside A, calceolarioside A, acteoside, and isoacteoside; monoterpenoid glycosides: melasmoside and rehmaionoside C; Di-O-methylcrenatin; azafrin; β-sitosterol; mannitol; and β-daucosterol were isolated from *C. grandiflora* Benth roots [5,7]. Studies have shown that iridoid glycosides, phenylethanoid glycosides, and azafrin are the main substance bases for their pharmacodynamics [5,7]. In 2017, tissue culture of *C. grandiflora* Benth was also successfully developed [11].

At present, *C. grandiflora* Benth roots sold in public markets are mainly collected from wild resources, while its artificial cultivation has just started [5]. So far, the cost of annual *C. grandiflora* Benth planting is about \$0.13 million per hectare and the worth of annual yield is about \$0.64 million per hectare [5]. Therefore, to explore the biosynthetic pathways and regulatory mechanisms of the main active ingredients of *C. grandiflora* Benth will lay a scientific foundation for breeding new varieties of this herb and for producing its medicinal chemical constituents by synthetic biology.

Iridoid glycosides belong to monoterpenoids, and their biosynthesis in plants can be divided into three stages. The first stage is precursor formation, which includes the plastidial 2-C-methyl-D-erythritol-4-phosphate (MEP) pathway and the cytoplasmic mevalonate (MVA) pathway to produce isopentenyl diphosphate (IPP) and dimethylallyl diphosphate (DMAPP) [12,13]. The second stage is the formation of a carbon skeleton structure [13–16]. The third stage is the post-modification of terpenoids: hydroxylation, methylation, isomerization, demethylation, glycosylation, etc. [16]. So far, most of the biosynthesis pathways of iridoid glycosides remain unclear. However, the complete catalpol biosynthetic pathway was first elicited in *Picrorhiza kurroa* [17], and it was partially decoded in *Rehmannia glutinosa* [18]. In *P. kurroa*, the catalpol biosynthetic pathway contains 29 steps including 14 steps for the MEP and MVA pathways and 15 steps for the iridoid pathway [17]. As the MEP and MVA pathways has been widely and intensively studied and they are conserved in plants [19], here, we mainly focused on the iridoid pathway. So far, two iridoid pathways including secoiridoid pathway (Route I) and decarboxylated iridoid pathway (Route II) have been reported, and the early enzymatic steps containing geranyl diphosphate synthase (GPPS), geraniol synthase (GES), geraniol 10-hydroxylase (G10H), 8-hydroxygeraniol oxidoreductase (8HGO), iridoid synthase (IS), iridoid oxidase (IO), and UDP-glucosyltransferase (UGT) are common to both pathways [20], has been verified in *Catharanthus roseus* and *P. kurroa* [14,21,22], and proposed in *Gardenia jasminoides* [23,24]. The remaining steps were first deduced by chemical intermediates [20], and then the corresponding enzymes were predicted and discovered by transcriptome analysis [17]. In *P. kurroa*, another seven enzymes containing aldehyde dehydrogenase (ALD), flavanone 3-dioxygenase/hydoxylase (F3D), 2-hydroxyisoflavanone dehydratase (2FHD), deacetoxycephalosporin-C hydroxylase (DCH), uroporphyrinogen decarboxylase/UDP-glucuronic acid decarboxylase (UPD/UGD), and squalene

monooxygenase (SQM) have been proposed to catalyze the remaining seven steps in catalpol biosynthesis [17].

Acteoside, belonging to phenylethanoid glycosides, is composed of two parts: caffeoyl CoA and hydroxytyrosol glucoside [25]. Feeding and inhibition experiments showed that hydroxytyrosol glucoside moiety is derived from tyrosine while caffeoyl CoA moiety is derived from phenylalanine via the cinnamate pathway and that both tyrosine and phenylalanine come from the shikimate pathway [26,27]. In *Ole europae* and *R*. *glutinosa*, phenylalanine is converted into caffeoyl CoA via four enzymes including phenylalanine ammonia-lyase (PAL), cinnamate-4-hydroxylase (C4H), coumarate-3-hydroxylase (C3H), and 4-coumarate-CoA ligase (4CL) [18,25,28]. Simultaneously, tyrosine is transformed into hydroxytyrosol glucoside through two alternative pathways: one is via *L*-dopa, dopamine, and hydroxytyrosol with the enzymes polyphenol oxidase (PPO), tyrosine decarboxylase (TDC), copper-containing amine oxidase (CuAO), alcohol dehydrogenase (ADH), and UGT; the other is via tyramine, tyrosol, and salidroside with the enzymes TDC, CuAO, ADH, UGT, and PPO [18,25,28]. Finally, caffeoyl CoA and hydroxytyrosol glucoside can be converted into acteoside by Shikimate O-hydroxycinnamoyltransferase (HCT) and UGT [18,25]. Recent studies have verified that acteoside possesses pharmacological properties: antioxidant, anti-inflammatory, antidepressant, antitumor, antidiabetes, and hepatoprotection [29–32].

Azafrin, belonging to carotenoid derivative, is one of the most abundant active ingredients in *C. grandiflora* Benth roots and plays an important role in myocardial protection [33]. Carotenoids are ubiquitous pigments in plants, and they confer plants with bright yellows, oranges, and reds [34]. In higher plants, carotenoids are synthesized through isoprene-like pathways in plastids, including condensation, dehydrogenation, cyclization, hydroxylation, and epoxidation reactions, while lycopene acts as an important branch point of both synthesis of α-carotene and β-carotene [35]. In the α-carotene pathway, α-carotene is synthesized by lycopene ε-cyclase (LCY-ε) and lycopene β-cyclase (LCY-β) and is then converted to lutein by ε-hydroxylase (LUT1) and β-cyclohexylase (LUT5) [17,18]. In the β-carotene pathway, LCY-β catalyzes the synthesis of β-carotene, which can be converted into strigolactone, astaxanthin, capsanthin, capsorubin, and violaxanthin under the catalysis of different enzymes, while violaxanthin can be further converted into abscisic acid [36,37]. However, studies have shown that azafrin is an apocarotenoid which is generated by cleavage of carotenoids at the C90–C100 [38,39]. In the strigolactone pathway, β-carotene is converted into carlactone through 9-cis-carotene, 100 -apo-β-carotenal by enzymes DWARF27, carotenoid cleavage dioxygenase 7 (CCD7), and CCD8 [39]. The intermediate product 100 -apo-β-carotenal is very similar to azafrin in structure except one terminal carboxyl group and two hydroxyl groups. Therefore, the hypothesis that azafrin is synthesized via 100 -apo-β-carotenal is proposed in this article.

Thus, the aim of this research is to characterize globally for the first time the transcriptomes of the root, stem, and leaf of *C. grandiflora* Benth using the Illumina Hiseq2000. To explore the genes involved in the catalpol, acteoside, and azafrin biosynthesis pathways and regulatory mechanisms, transcripts from leaves, stems, and roots of *C. grandiflora* Benth were screened out, quantified, and annotated. The results obtained here will facilitate further molecular studies in *C. grandiflora* Benth.

#### **2. Results**

#### *2.1. Sequencing and Assembly*

To figure out which genes are involved in the biosynthesis of active components in *C. grandiflora* Benth, nine sequencing libraries including roots (C\_R1, C\_R2, and C\_R3), stems (C\_S1, C\_S2, and C\_S3), and leaves (C\_L1, C\_L2, and C\_L3) were prepared and sequenced with the Illumina Hiseq2000 platform. As a result, more than 45 million clean reads per library were obtained after cleaning and quality examination. Quality assessments of the sequencing data are shown in Table 1. The error rate of all libraries was 0.03%, while Q20 and Q30 were over 94.99% and 88.32%, respectively, indicating that these data are suitable for further analysis. The raw data from the nine libraries have been deposited into the Short Reads Archive (SRA) database under the accession numbers: SRX6654843–SRX6654851.


**Table 1.** Quality assessment of the sequencing data.

Note: C\_R1, C\_R2, and C\_R3: three root samples; C\_S1, C\_S2, and C\_S3: three stem samples; and C\_L1, C\_L2, and C\_L3: three leaf samples.

The clean reads were combined and assembled by Trinity with min\_kmer\_cov set to 2 and all other default parameters [40]. Assembled sequences were subjected to cluster using the Trinity algorithm. As a result, 153,300 contigs were clustered into 173,851 trinity components. Each Trinity component contained a set of transcripts derived from the same gene, and a unigene was designated as the longest transcript in each trinity component. A total of 173,851 transcripts and 153,198 genes were assembled, with 69,421 (39.93%) transcripts and 69,419 (45.31%) genes being over 2 Kb in length (Figure 1). The average length of transcripts and genes were 1895 bp and 2115 bp, respectively (Table 2), and the N50 for transcripts and genes were 2902 and 2936 bp, respectively (Table 2).

**Figure 1.** Length distribution frequency of spliced transcripts and deduced genes.


#### *2.2. Gene Function Annotation and Classification*

All the 153,198 assembled putative unigenes were aligned using the BLAST (Basic Local Alignment Search Tool) program against the seven classic databases including NR (nonredundant protein sequences), NT (Nucleotide collection), PFAM (Protein family), SwissProt, KOG (euKaryotic Orthologous Groups), KEGG (Kyoto Encyclopedia of Genes and Genomes), and GO (Gene Ontology) databases with *e*-value cutoffs of 10−<sup>5</sup> , 10−<sup>5</sup> , 10−<sup>2</sup> , 10−<sup>5</sup> , 10−<sup>3</sup> , 10−10, and 10−<sup>6</sup> , respectively. A total of 26,652 unigenes (17.39%) were annotated to the above seven databases in common, while 132,896 unigenes (86.74%) were annotated in at least one database (Table 3). Among them, 127,767 unigenes (83.39%) showed high similarity with sequences in the NR database (*e*-value = 10−<sup>5</sup> ), 96,216 unigenes (62.80%) matched to protein sequences in NT, and 103,257 unigenes (67.40%) showed homology with known genes in SwissProt. The detailed results are shown in Table 3 and Tables S1–S3. Based on the top-hit species distribution of the homology results against NR database, the highest matches were genes from *Sesamum indicum* (43.77%), followed by *Handroanthus impetiginosus* (22.58%) and *Erythranthe guttata* (13.07%) (Figure 2).


**Table 3.** Statistical results of gene annotation.

Note: NR (nonredundant protein sequences), NT (Nucleotide collection), PFAM (Protein family), GO (Gene Ontology), KOG (euKaryotic Orthologous Groups), KEGG (Kyoto Encyclopedia of Genes and Genomes).

**Figure 2.** Species distribution of top 10 BLASTx hits against the NR database.

In coding sequence prediction analysis, unigenes were aligned first to the NR and then Swissprot database. If aligned, ORF (open reading frame) information of transcripts was extracted from the alignment results and the sequence of the coding region was translated into amino acid sequences according to the standard codon table. If failed, ESTSCAN (Expression Sequence Tag Scan) software was adopted to predict the ORF of the unigenes. As a result, a total of 157,392 peptides were predicted by BLASTx and the peptide length mainly ranged from 38 to 1059 (Figure 3a) while 35,339 peptides were predicted by ESTSCAN and the peptide length was from 15 to 1092 (Figure 3b).

**Figure 3.** Length range distributions of unigene encoded peptides: (**a**) Peptides predicted by BLASTx searches against NR and Swissprot databases; (**b**) peptides predicted by software ESTSCAN 3.0.3.

To figure out the biological processes that our unigenes are involved in as well as their molecular functions and the cellular environments they reside in, all unigenes were searched against the GO database with software BLAST2GO. Out of 153,198 unigenes, 98,364 (64.20%) were successfully annotated and classified into three GO categories—biological process (BP), cellular component (CC), and molecular function (MF)—and then assigned to 55 functional groups (Figure 4). As shown in Figure 4, assignments which fell under BP (273,598, 47.25%) ranked the highest, followed by CC (175,650, 30.34%) and MF (129,742, 22.41%). Similar to *R. glutinosa* [41] and adventitious roots in *Panax ginseng* [42], "cellular process" (60,529, 61.54%) and "metabolic process" (56,468, 57.41%) were the two most representative subcategories in the BP category, which suggested that lots of important cellular processes and metabolic activities took place in *C. grandiflora* benth. Unlike adventitious roots in *P*. *ginseng* [42], although unigenes related to "cell" (33,946, 34.51%) and "cell part" (33,946, 34.51%) were dominant in the CC category, the percentages in *C. grandiflora* Benth were far less than in *P*. *ginseng*, which implied that many tissues and organs in *C. grandiflora* Benth were in construction at a slow speed. In the MF category, the majority of unigenes were involved in "binding" (59,157, 60.14%) and "catalytic activity" (49,089, 49.91%) in *C. grandiflora* Benth, and this was somewhat similar to *R. glutinosa* [41], in which unigenes annotated into "binding" were about 20% more than "catalytic activity", indicating that more catalytic reactions may occur in the form of protein complexes.

**Figure 4.** GO classification map: The ordinate represents the next-level GO term of the three GO categories, while the abscissa represents the number of genes annotated into the corresponding term and its proportion of the total number of annotated genes. Three basic categories of GO term, from top to bottom, are the molecular function, cell components, and biological processes.

KOG refers to clusters of orthologous groups from different eukaryotic species, and genes from the same ortholog are assumed to have the same function. To further classify our unigenes, KOG annotation was performed with software diamond. A total of 44,170 unigenes were classified into 26 KOG groups (Figure 5), where the "posttranslational modification, protein turnover, and chaperon" (5516, 12.49%) category accounted for the most frequent group, "general function prediction only" (5445, 12.33%) was the second largest group, and "translation, ribosomal structure, and biogenesis" (4404, 9.97%) and "intracellular trafficking, secretion, and vesicular transport" (3504, 7.93%) were tied for the third largest. In addition, 773 unigenes were assigned to "secondary metabolites biosynthesis, transport, and catabolism", implying that catalpol, acteoside, and carotenoid biosynthesis may take place in *C. grandiflora* Benth.

**Figure 5.** KOG classification map: The abscissa represents KOG groups, while the vertical axis represents the percentage of annotated genes.

KEGG is a database in which gene products and compounds of the cellular metabolic pathways and the functions of these gene products were systematically analyzed. To figure out the active pathways in growing *C. grandiflora* Benth, KEGG annotation of our unigenes were performed with KAAS (KEGG Automatic Annotation Server). A total of 57,190 (37.33%) unigenes were annotated into five categories (level 1; cellular processes, environmental information processing, genetic information processing, metabolism, and organismal systems), 19 subcategories (level 2, Figure 6), and 130 pathways (level 3). Similar to *R. glutinosa* [41], the five pathways with the largest number of genes were "carbohydrate metabolism" (4996, 8.74%), "overview" (3455, 6.04%), "amino acid metabolism", "lipid metabolism", and "energy metabolism" in the metabolism category, indicating that primary metabolism was very important to the growth of *C. grandiflora* Benth. In the category of genetic information processing, the two pathways with the largest number of genes were "translation" (5427, 9.49%) and "folding, sorting, and degradation" (4178, 7.31%), indicating that protein biosynthesis and processing were more active in *C. grandiflora* Benth (Figure 6). The numbers of unigenes for "amino acid metabolism", "metabolism of terpenoids and polyketides", and "biosynthesis of other secondary metabolites" were 2927, 1164, and 882, respectively. These results indicate that the amino acid pathway and terpenoid pathways were active in growing *C. grandiflora* Benth and that the corresponding genes would be good candidate genes for catalpol, acteoside, and carotenoid biosynthesis.

**Figure 6.** KEGG classification map: The ordinate is the pathway, and the abscissa is the proportion of genes belonging to the corresponding pathway. These genes were divided into five categories: **A**. Cellular Processes; **B**. Environmental Information Processing; **C**. Genetic Information Processing; **D**. Metabolism; and **E**. Organismal Systems.

#### *2.3. Identification of Di*ff*erentially Expressed Genes (DEGs), GO, and KEGG Enrichment Analysis*

Gene expression level which is transformed from read counts using RSEM (RNA-Seq by Expectation-Maximization) software was analyzed with the FPKM (expected number of Fragments Per Kilobase of transcript sequence per Millions base pairs sequenced) method [43]. According to the criteria *p* < 0.05 and log2(FoldChange) > 1, 14,875 genes (9.71% of all genes) were identified as significant DEGs between leaves and roots, which comprised 8054 upregulated genes (54.14%) and 6821 downregulated genes (45.86%) in leaves (Figure 7a, Table S4). There were 4126 genes (2.69% of all genes) identified as significant DEGs between leaves and stems, which comprised 2251 upregulated genes (54.56%) and 1875 downregulated genes (45.44%) in leaves (Figure 7b, Table S5). A total of 9115 genes (5.95% of all genes) were identified as significant DEGs between stems and roots, which comprised 5290 upregulated genes (58.04%) and 3825 downregulated genes (41.96%) in stems (Figure 7c, Table S6). Using a Venn diagram, we compared these three data sets from different comparison groups (C\_L vs. C\_R, C\_S vs. C\_R, and C\_L vs. C\_S). In all three comparison groups, 829 DEGs were identified as being in common (Figure 7d). Specifically, 4839 DEGs were identified in both "C\_L vs. C\_S" and "C\_S vs. C\_R" comparisons; 1918 DEGs were identified in both "C\_L vs. C\_R" and "C\_L vs. C\_S" comparisons; and 551 DEGs were identified in both "C\_L vs. C\_S" and "C\_S vs. C\_R" comparisons (Figure 7d).

GO and KEGG enrichment analysis on all DEGs were performed to find the enriched pathways. The GO enrichment is shown in Tables S7–S9. In the KEGG enrichment analysis, the top two enriched pathways were flavonoid biosynthesis with 37 DEGs and 68 background unigenes, and flavone and flavonol biosynthesis with 10 DEGs and 14 background unigenes in the C\_L vs. C\_R comparison (Figure 8a, Table S10). In the C\_L vs. C\_S comparison, the top two pathways were flavone and flavonol biosynthesis with 5 DEGs and 14 background unigenes, and the stilbenoid, diarylheptanoid, and gingerol biosynthesis with 11 DEGs and 48 background unigenes (Figure 8b, Table S11). Finally, in the L\_S vs. L\_R comparison, the top two pathways were stilbenoid, diarylheptanoid, and gingerol biosynthesis with 15 DEGs and 48 background unigenes, and flavonoid biosynthesis with 3 DEGs and 11 background unigenes (Figure 8c, Table S12). Notably, pathways for both phenylpropanoid biosynthesis and carotenoid biosynthesis were enriched in all three comparisons. The top 20 KEGG enrichment pathways are shown in Figure 8.

**Figure 7.** Differentially expressed genes (DEGs) in different comparisons. Volcano plots of the DEGs in different comparisons: The red dots mean significantly upregulated genes, and the green dots represent significantly downregulated genes. The black dots represent non-DEGs. (**a**) C\_L vs. C\_R volcano; (**b**) C\_L vs. C\_S volcano; and (**c**) C\_S vs. C\_R volcano. (**d**) Venn diagram of DEGs in different comparisons: All DEGs are clustered into three comparison groups represented by three circles. Overlapping parts of the different circles represent the number of DEGs in common among those comparison groups.

#### *2.4. Biosynthetic Genes of the Terpenoid Backbone and Catalpol in C. grandiflora Benth*

Terpenoids are produced from the universal precursor IPP (a five-carbon building material) and its isomer DMAPP [44]. In plants, IPP is synthesized via the cytoplasmic MVA pathway from acetyl-CoA and through the plastidial MEP pathway from glyceraldehyde 3-phosphate and pyruvate; IPP isomerase (IDI) catalyzes the interconversion between IPP and DMAPP [44] (Figure 9a). When examining the annotation of unigenes against the KEGG database, 239 unigenes were assigned to the terpenoid backbone biosynthesis pathway, including 74 unigenes encoding 6 enzymes in the MVA pathway and 165 unigenes encoding 8 enzymes in the MEP pathway (Table 4). Among these genes, the largest number is *DXS* (67), followed by *IDI* (28), *AACT* (24), and *CMK* (24), and the lowest number is *MCS* (1). Transcriptome profiling data showed that the MEP pathway is more active in leaves, while the MVA pathway is more active in stems due to the high expression levels of corresponding pathway genes (Figure 9b).

**Figure 8.** Top 20 of KEGG pathway enrichment of DEGs: The y-axis indicates the pathway name, and the x-axis indicates the enrichment factor corresponding to the pathway. The q-value is represented by the color of the dot. The number of DEGs is represented by the size of the dots. (**a**) C\_L vs. C\_R; (**b**) C\_L vs. C\_S; and (**c**) C\_S vs. C\_R.

Catalpol, belonging to iridoid glucoside, is usually found in Scrophulariaceae plants [6,45,46], and iridoid glucosides are derived from MEP and MVA pathways [41]. Based on feeding experiments with isotope labeling and transcriptome analysis, the draft biosynthesis pathway of catalpol was first proposed in *R. glutinosa* in 2012 [47,48]. Then, the complete pathway of catalpol was clarified for the first time in *P. kurroa* in 2015 [17].

According to the KEGG and Swissprot annotation, a total of 368 unigenes were assigned to the catalpol biosynthetic pathway, with 60 unigenes upregulated and 39 unigenes downregulated in leaf vs. root (Figure 9a and Table 4). The unigenes encoding 13 enzymes involved in catalpol biosynthesis are listed in Table 4. Among these genes, the largest number was *ALDH* (76), followed by *8HGO* (53), *IO* (44), *GPPS* (32), and UGT (30), and the lowest number was *F3D* (2) (Table 4). Like the terpenoid backbone pathway, most genes in the catalpol biosynthesis pathway possess at least two unigenes, displaying the redundancy of the plant genes and adding the difficulty of deciphering the pathway

(Table 4). In our transcriptome, unigenes of catalpol pathways are more abundant in leaves, as revealed by much higher expression level of *GES*, *G10H*, *IS*, *IO*, and *F3D* in leaves than in roots (Figure 9b, Table 4). It is worth noting that there were only two *F3D* genes in our transcriptome and that it was only expressed in leaves and stems, but not roots, which indicated that the catalpol biosynthesis was active in aboveground growth at this developmental stage (Figure 9a, Table 4). Therefore, while it is the roots of *C. grandiflora* Benth that are used as medicinal materials, our results imply that the catalpol is first synthesized in the leaves and then transported and stored in the roots. Furthermore, *DCH* gene functioning in the conversion of deoxygeniposidic acid to geniposidic acid was not found in our transcriptome, which may be a result of low expression or low homology with the known *DCH* genes.

**Figure 9.** Expression of unigenes in the putative pathway of terpenoid backbone and catalpol biosynthesis in *C. grandiflora* Benth: (**a**) Proposed biosynthetic pathway of terpenoid backbone and catalpol. (**b**) Heatmap based on the expression level of unigenes involved in terpenoid backbone and catalpol biosynthesis across three tissues in *C. grandiflora* Benth. The expression level is the sum of all the unigenes for each gene, and log10(sum(FPKM)+1) was used to plot the heatmap. Candidate unigenes were selected according to the annotation. Abbreviations: G3P, Glyceraldehyde 3-phosphate; DXP, 1-deoxy-D-xylulose-5-phosphate; MEP, 2-C-methyl-Derythritol 4-phosphate; CDP-ME, 4-(Cytidine 50 -diphospho)-2-C-methyl-D-erythritol; MECPP, 2-C-methyl-D-erythritol-2,4-cyclodiphosphate; HMBPP, 1-hydroxy-2-methyl-2-butenyl 4-diphosphate; IPP, isopentenyl pyrophosphate; DMAPP, dimethylallyl pyrophosphate.


**Table 4.** Putative genes of the mevalonate (MVA), 2-C-methyl-D-erythritol-4-phosphate (MEP), and catalpol biosynthesis pathways.

Note: FC represents fold change.

#### *2.5. Biosynthetic Genes of Acteoside in C. grandiflora Benth*

Studies have shown that acteoside is widely distributed in more than 150 plant species and has medicinal properties including antioxidant, anti-inflammation, anti-nephritis, cell regulation, hepatoprotection, immunoregulation, and neuroprotection [49]. Upstream regions of the acteoside biosynthetic pathway, including the phenylalanine-derived pathway and tyrosine-derived pathway, was first clarified in *Olea europaea* using feeding experiments, while the downstream is largely unknown [26]. The downstream region was partially deciphered with elicitor inducing and transcriptome sequencing in *R. glutinosa* [18,28].

Based on the KEGG annotation in this study, a total of 213 unigenes were assigned to the acteoside biosynthetic pathway, with 40 unigenes significantly upregulated and 16 unigenes significantly downregulated in leaves vs. roots (Figure 10a). The unigenes encoding key enzymes involved in acteoside biosynthesis are listed in Table 5. Among these genes, the largest number was *CuAO* (53), followed by *4CL* (35), *ADH* (26), and *UGT* (22), and the lowest number was *HCT* (6) (Table 5). In the DEGs analysis, four genes including *PAL*, *C4H*, *C3H*, and *4CL* were upregulated in leaves and stems compared with roots (Figure 10b, Table 5), which implies that the phenylalanine-derived pathway is active in aerial parts.

**Figure 10.** Expression of unigenes in putative pathways of acteoside biosynthesis in *C. grandiflora* Benth: (**a**) Proposed biosynthetic pathway of acteoside. The solid arrow represents the known steps, and the dashed arrows denote the putative steps. (**b**) Heatmap based on the expression level of unigenes involved in acteoside biosynthesis across three tissues: Candidate unigenes were selected according to the annotation.


**Table 5.** Putative genes of acteoside biosynthesis pathways.

#### *2.6. Biosynthetic Genes of Azafrin in C. grandiflora Benth*

Recent studies have shown that azafrin can significantly improve myocardial contractile function during myocardial ischemia via activation of the Nrf2-ARE (Nuclear factor-erythroid 2-related factor-Antioxidant Response Element) pathway in rats [7,33]. So far, biosynthesis and chemical synthesis pathways of azafrin are still unknown. Azafrin is a derivative of carotenoid, a tetraterpenoid compound [38]. In higher plants, carotenoids are manufactured in plastid with IPP generated by the MEP pathway [50]. The putative carotenoid biosynthesis pathway, including the MEP part, lutein branch, strigolactone branch, capsanthin/capsorubin branch, abscisic acid branch, and without the azafrin pathway, has been established in plants (Figure 11a) [35,51]. However, studies have shown that the substrate β-carotene can be directly converted into 10'-apo-β-carotenal and ionone by β-carotene-90 ,100 -oxygenase (BCO2) in non-plants or can be indirectly converted into 100 -apo-β-carotenal and ionone by DWARF27 and carotenoid cleavage dioxygenases 7 (CCD7) in plants [39,52]. The differences between azafrin and 100 -apo-β-carotenal are one terminal carboxyl group and two hydroxyl groups in the cyclohexane skeleton. From the aspect of biochemistry, acetaldehyde dehydrogenase (ALDH) can transform aldehyde into carboxylic acid and the cytochrome P450 monooxygenases (CYP450s) are capable of inserting oxygen atoms into inert hydrophobic molecules to make them more hydrophilic [53]. There is also a report that post-modification of terpenoid derivatives is mostly initiated by oxidation and that most of them are catalyzed by CYP450s

and then other post-modification reactions are carried out on the basis of oxidation products [54]. Then, we hypothesize that azafrin is produced in two continuous steps: 100 -apo-β-carotenal is first oxidized by ALDH and then is hydrolyzed by CYP450. The detailed biosynthetic pathway of azafrin is shown in Figure 11a.

**Figure 11.** The putative carotenoid biosynthesis pathway and the heatmap of corresponding genes in *C. grandiflora* Benth: (**a**) Proposed biosynthetic pathway of carotenoid. Here, the hypothesis that 100 -apo-β-carotenal can be converted into azafrin by ALDH and CYP450 is proposed. (**b**) Heatmap based on the expression level of unigenes involved in carotenoid biosynthesis across three tissues. For CYP450, only genes of Log<sup>2</sup> (FC) > 10 (leaf vs. root) were selected for map. Candidate unigenes were selected according to the annotation.

Based on the KEGG annotation and NR annotation in this study, a total of 356 unigenes were correlated with the carotenoid biosynthesis, of which 161 unigenes were assigned to the azafrin biosynthetic pathway with 20 unigenes upregulated and 33 unigenes downregulated in leaves vs. roots (Table 6). For the MEP portion, it was active in leaves, stems, and roots in general as it is known to provide the universal precursor for the terpenoids (Figure 11b). For the lutein pathway, it was more active in leaves and stems while somewhat inactive in roots due to the low expressions of *LUT5* and *LUT1* genes (Figure 11b). For the azafrin and strigolactone branch, it was slightly active in stem, as neither *DWARF27* and *CCD7* were expressed in leaves nor *CCD7* were expressed in roots (Figure 11b). For the capsanthin/capsorubin and abscisic acid branch, it is more active in leaves and stems and somewhat blocked by the low expression of the *LUT5* gene (Figure 11b). What should be noted here is that there is only one gene encoding the CCD7 enzyme, which may be a rate-limiting enzyme (Table 6).


**Table 6.** Putative genes of carotenoid biosynthesis pathways.

#### *2.7. Identification of Transcription Factors (TFs)*

Transcription factors can activate or inhibit the expression of functional genes in the biosynthetic pathway of plant metabolites, thereby effectively regulating the synthesis and accumulation of secondary metabolites. According to gene sequence alignment to the PFAM database, referring to the Hidden Markov Model files of various TFs, the HMMER3.0 software was used to search the transcriptome database of *C. grandiflora* Benth. The results showed that, in our transcriptome, 4888 unigenes were annotated as TFs belonging to 78 categories. The top three TFs with the largest numbers were MYB (avian myeloblastosis viral oncogene homolog, 356, accounting for 7.28%), WRKY (WRKY domain-containing protein, 301, accounting for 6.16%), and orphans (234, accounting for 4.79%), followed by HB (homeobox, 223, accounting for 4.56%), C3H (Cys3His zinc finger domain-containing protein, 209, accounting for 4.28%), and bHLH (basic Helix-Loop-Helix, 201, accounting for 4.11%) (Figure 12, Table 7, and Table S13). There were also TFs ERF and bZIP (basic region-leucine zipper, Table 7). Among these TFs, most were expressed in both root and leaf tissues, with 121 and 132 showing significantly upregulation and downregulation in leaves, respectively (Table 7).

**Figure 12.** Top 20 transcription factors in *C. grandiflora* Benth.


**Table 7.** Summary of transcription factor unigenes of *C. grandiflora* Benth.

Note: TF (Transcription Factor), MYB (avian myeloblastosis viral oncogene homolog), WRKY (WRKY domain-containing protein), HB(homeobox), C3H(Cys3His zinc finger domain-containg protein), bHLH (basic Helix-Loop-Helix), ERF(Ethylene Responsive Factor), bZIP (basic region-leucine zipper).

Studies have also shown that the active components of medicinal plants are regulated by many TFs and that the number of genes regulated by a specific TF varies widely. There may be even crosstalk between regulations. In *Artemisia annua,* only AaHD1 (Homeodomain-leucine zipper) and AaGSW1 (Glandular trichome-Specific WRKY 1) can activate transcription of the *CYP71AV1* gene [55,56]; AaWRKY1, AabHLH1, and ERF (Ethylene Response Factor) TFs including AaTAR1 (Trichome and Artemisinin Regulator 1), AaERF1, AaERF2, and AaORA (Octadecanoid-derivative Responsive *Apetala*2 domain) can activate the transcription of both the amorpha-4,11-diene synthase (*ADS*) gene and the *CYP71AV1* gene and then facilitates artemisinin biosynthesis [57–61]; AabZIP1 is responsible for the activation of the *ADS*, *CYP71AV1*, and *AaGSW1* genes [62], while AaMYC2 is responsible for the *CYP71AV1*, *DBR2* (*Double-Bond Reductase 2*), and *AaGSW1* genes [63,64]; and AaMYB2 may regulate the *ADS*, *CYP71AV1*, *DBR2*, and *ALDH1* (*Aldehyde Dehydrogenase 1*) genes [65]. All the abovementioned TFs were found in our transcriptome (Table 7).

In order to figure out which TFs are involved in catalpol, acteoside, and carotenoid biosynthesis in *C. grandiflora* Benth, MYB TFs of which the log2(FC) > 4 were selected for performing phylogenetic analysis with 168 MYBs from *Arabidopsis thaliana*. As a result, a total of 28 MYBs, including 16 upregulated and 12 downregulated, were screened out in leaf vs. root. In kiwifruit, *AdMYB7*, *AdMYB8*, *AdMYBR2*, and *AdMYBR3* play important roles in regulating carotenoid accumulation in tissues through transcriptional activation of metabolic pathway genes [35,66]. In our analysis, CgMYB18, CgMYB26, CgMYB19, and AdMYB7 were all clustered into the S20 subgroup, while AdMYB8 was near the S20; CgMYB15, CgMYB4, CgMYB8, CgMYB13, AdMYBR2, and AdMYBR3 were in the same clade; and *CgMYB18*, *CgMYB26*, *CgMYB19*, *CgMYB15*, *CgMYB4*, *CgMYB8*, and *CgMYB13* were all upregulated in the root, which indicates that they may regulate carotenoid biosynthesis in the roots of *C. grandiflora* Benth. (Figure 13a). In *A. annua*, overexpression of *AaMYB1* exclusively in trichomes or in whole plants both increased the expression of the *FDS* (*farnesyl diphosphate synthase*), *ADS*, *CYP71AV1*, *DBR2,* and *ALDH1* genes and increased the accumulation of artemisinin [67]. CgMYB9 and AaMYB1 were clustered into S13 subgroup and *CgMYB9* was upregulated in leaves, which showed that it may be a candidate regulatory gene in catalpol and carotenoid biosynthesis [66]. Overexpression of *AtPAP1* (*Production of Anthocyanin Pigment1*) in rose plants enhanced production of phenylpropanoid and terpenoid scent compounds by transcriptional activation of their respective pathway genes [68], and AtPAP1, AtMYB90 (AtPAP2), AtMYB113, and AtMYB114 all regulated anthocyanin biosynthesis [69]. In our data, CgMYB1, CgMYB2, CgMYB6, AtPAP1, and AtMYB90 were all in the S6 subgroup and *CgMYB1*, *CgMYB2*, and *CgMYB6* were all significantly upregulated in the leaves (Figure 13b), suggesting that they are candidate regulatory genes in catalpol, acteoside, and carotenoid biosynthesis.

**Figure 13.** Phylogenetic analysis and expression level of MYBs from *C. grandiflora* Benth: (**a**) Phylogenetic analysis of CgMYBs. Amino acid sequences were aligned using the ClustalX2 program, and evolutionary distances were calculated using phyML software with the maximum likelihood statistical method. The sequences of *C. grandiflora* Benth are listed in Table S14. The sequences of *Arabidopsis thaliana* come from PLANTTFDB (https://planttfdb.cbi.pku.edu.cn), while that of *Artemisia annua* and *Actinidia deliciosa* come from NCBI (https://www.ncbi.nlm.nih.gov). (**b**) Expression level of CgMYBs: Expression level for each gene is represented by the average RPKM in roots, stems, and leaves.

#### *2.8. Expression Correlation Analysis of Selected Genes*

To verify our transcriptome results, five terpenoid-related genes including three upregulated genes (*MCS*, *GES*, and *IS*) and two downregulated genes (*8HGO* and *HMGR2*) in leaf vs. root were selected for correlation analysis. All the selected genes possessed the same expression trend although the expression levels varied between RNA-Seq and qRT-PCR, especially for the *8HGO* and *HMGR2* genes (Figure 14a). The overall correlation coefficient was about 0.84, which indicates that our transcriptome is valid (Figure 14b).

**Figure 14.** The correlation analysis of gene expression pattern between RNA-Seq and qPCR in roots, stems, and leaves in *C. grandiflora* Benth. (**a**) Expression patterns of *MCS*, *GES*, *IS*, *8HGO* and *HMGR2* gene by RNA-Seq and qPCR; (**b**) Correlation analysis of gene expression between RNA-Seq and qPCR. Each qPCR was biologically repeated three times.

#### **3. Discussion**

Cardiovascular diseases (CVDs) remain a major cause of health loss for all regions of the world in the past 25 years [70]. In China, the incidence of CVDs is continuously rising and will keep an upward trend in the next decade [2]. Therefore, to find the herbs with effective treatment of CVDs is imminent. *C. grandiflora* Benth is one of the most precious herbs in the area of Miao Nationality in Yunnan, China. It is widely used in folk medicine because of its multifunctional medicinal values, especially in the aspect of the prevention and treatment of CVDs [7]. Although it has been collected in the Chinese Materia Medica, up to now, it has not been included in the Chinese Pharmacopoeia because of limited researches [9]. The current situation is high market prices and overexploitation of wild resources, which has not only prevented the herbal medicine from being widely used but has destroyed species diversity. Synthetic biology will provide solutions for the abovementioned problems through the biosynthetic pathway elucidations of the main pharmacodynamic components.

So far, de novo transcriptome analysis is an important method in gene discovery of biosynthesis pathways, especially for species without reference genomes [13]. In this research, the transcriptomes of three tissues with three biological repeats were sequenced by illumine Hiseq2000, and 438,112,930 clean reads were assembled into 173,851 transcripts and 153,198 unigenes. This suggests that one gene may have different transcripts which may come from variable splicing, alleles, different copies of the same gene, homologs, orthologs, etc. The mean length of transcripts and genes were 1895 bp and 2115 bp, respectively, and the N50 of the transcripts and genes were 2902 and 2936 bp, respectively (Table 2), which were higher than that in *Dendrobium huoshanense*, *Persea Americana*, and *R. glutinosa* [18,71,72]. These results implied that our assembly quality was suitable for subsequent analyses. In the species distribution analysis of unigenes, more than 43.77% of unigenes were matched to *Sesamum indicum* (Figure 2), which is similar to *R. glutinosa*, a plant of Scrophulariaceae; these results implied that they shared the closer genetic relationship, similar chemical substances, and similar biosynthetic pathways. Catalpol, acteoside, and azafrin are three medicinal ingredients in *C*. *grandiflora* Benth; however, their biosynthetic pathway is unexplored.

So far, catalpol biosynthesis containing terpenoid backbone pathway and iridoid pathway has not been fully deciphered due to the deficiency of detailed information on genetic and molecular levels [20]. In 1993, Damtoft found that 8-epi-deoxyloganic acid, bartsioside, and aucubin are intermediates of catalpol biosynthesis by feeding experiments [48]. Then, Jensen et al. confirmed that catalpol is synthesized via decarboxylated iridoids pathway (Route II), which involved 8-epi-iridodial, 8-epi-iridotrial, and 8-epi-deoxyloganic acid [73]. In 2013, the more detailed route II was proposed in *R. glutinosa* and *P. kurrooa* [21,41]. In 2015, the complete catalpol biosynthesis pathway was hypothesized in *P. kurrooa* according to data of the transcriptome mining, gene expression, and picroside content [17]. In our transcriptomes, 368 unigenes were annotated to the catalpol biosynthetic pathway with

60 unigenes upregulated in leaves and 39 unigenes in roots; simultaneously combined with the fact that *F3D* gene was not expressed in roots, we deduced that catalpol biosynthesis was mainly active in leaves. A recent article showed that, in wild *C. grandiflora* Benth, the content of catalpol is far higher in leaves than in stems and roots [74], which also implied that catalpol is mainly synthesized in leaves other than roots. The discovery of rate-limiting enzymes is essential for synthetic biology; therefore, some genes are discussed here. Catalpol biosynthesis begins with the terpenoid backbone pathway, which contains the MEP and MVA pathways. In the MEP pathway, the DXS enzyme is the first and rate-limiting enzyme, and in *A. annua*, among the three AaDXSs, only AaDXS2 might participate in artemisinin biosynthesis [75]. Contrary to *A. annua*, the *DXS*s were more abundant in our transcriptome, which seems that DXS was not a limiting enzyme in *C. grandiflora* Benth. Further studies are needed to clarify which DXS functions in MEP pathway. A recent report showed that plastidial IDI plays an important role in optimizing the ratio between IPP and DMADP as precursors for different downstream isoprenoid pathways while mutation of *IDI1* reduced the content of carotenoids in fruits, flowers, and cotyledons (except mature leaves) [44]. In our transcriptome, there were 28 *IDI* genes with two upregulated in leaves compared with roots, which highlights their importance in terpenoid backbone biosynthesis (Table 4). However, there were no significant differences for the overall expression of *IDI* genes in roots, stems, and leaves in our transcriptome (Figure 9b). What is interesting is that there was only one *MCS* gene in our transcriptome; however, its expression levels in roots, stems, and leaves were all relatively high, which directly denied that *MCS* was a rate-limiting enzyme gene. According to the expression profile, *MCT* may be a rate-limiting enzyme for roots (Figure 9b). In addition, the relative contribution of the MEP and MVA pathways for a specific pathway is a focus scientist paying attention to. In *P. kurroa*, the biosynthesis of picroside-I is contributed solely by the MEP pathway [17]. In *Taxus baccata*, the MEP pathway provides the main source of universal terpenoid precursor IPP [76]. However, in *C. grandiflora* Benth, the contribution of the MEP and MVA pathways for catalpol biosynthesis remains to be clarified and it will be resolved by the inhibition experiments in the future.

Acteoside biosynthesis was first studies in an *O. europaea* cell with feeding experiments, which outline the basic pathway profile: caffeoyl moiety was synthesized through the phenylalanine-derived pathway including intermediates cinnamic acid, p-coumaric acid, and caffeic acid, while hydroxytyrosol moiety was formed via the tyrosine-derived pathway including two alternative routes [26]. Then, HCT enzyme which connects the caffeoyl moiety and the hydroxytyrosol moiety, UGT enzymes, and the corresponding enzymes of the phenylalanine-derived pathway and tyrosine-derived pathway were hypothesized in *R. glutinosa* [28]. All of the acteoside pathway genes were found in our transcriptome of *C. grandiflora* Benth. Expression profiles showed that genes involved in both the phenylalanine-derived pathway and the tyrosine-derived pathway were more abundant in leaves and stems compared to roots, especially for the *PAL* and *PPO* genes (Figure 10b). This is consistent with the reports that, in *Harpagophytum procumbens*, the content of acteoside was higher in leaves and stems than in roots and that, in *Sesamum indicum*, the content of acteoside in leaves is far higher than in stems and roots [25,77].

Studies have shown that PAL is an entry-point enzyme which can convert *L*-Phe into trans-cinnamic acid and that it plays a vital role in channeling carbon flux from primary metabolism into the phenylpropanoid pathway [78]. So far, *PAL* gene has been cloned from many medicinal plants, such as *Ocimum basilicum* [79], *Ginkgo biloba* [80], *Salvia miltiorrhiza* [81], and *A. annua* [82]. In *G. biloba*, the highest expression of *GbPAL* gene was found in leaves, followed by stems, and the lowest expression was in roots; transcription levels of *GbPAL* were closely related to flavonoid accumulation [80]. In *R. glutinosa*, the *RgPAL* gene (CL1389.Contig1) shared the same expression pattern as in *G. biloba* [28]. In *A. annua*, the highest expression of the *AaPAL* gene was found in young leaves and the lowest expression of that was in roots [82]. In plants, *PAL* gene is a multi-gene family and the gene number ranges from 4 in *A. thaliana* to more than 12 in tomato and potato [83]. For example, there are 6 *PAL* genes in *R. glutinosa* [28]. Recently, three different redundancy phenomena including active compensation in ligand plus passive compensation in receptor in tomato, passive compensation in

ligand plus active compensation in receptor in Arabidopsis, and active compensation in both in corn have been figured out [84]; however, which type does the *CgPAL* genes belong to and whether they benefit the plants themselves in *C. grandiflora* Benth remain to be discovered. Unlike potato, the *PAL* gene family is highly redundant but underutilized due to the highly silencing mechanism in tomato [83]. In our transcriptome, there are 19 *PAL* genes and their highest expressions are found in leaves and stems with the lowest expression in roots (Figure 10b), which is similar to that in *G. biloba*, *R. glutinosa*, and *A. annua* [28,80,82]. Our transcriptome profiling data showed that 10 of 19 *CgPAL* genes were not expressed or slightly expressed in roots, stems, and leaves (Figure S1), which implied that gene silencing was also active in *C. grandiflora* Benth, and DNA cytosine methylation may account for this phenomenon [83]. A recent report showed that functional redundancy among *BZR*/*BEH* (*BRASSINAZOLE-RESISTANT*/*BRI1-EMS-SUPRESSOR1*/*BRASSINAZOLE-RESISTANT1 HOMOLOG*) gene family members is not necessary for trait robustness [85]. Even in tomato, only *PAL5* was expressed under environmental stimuli [83]. Therefore, *PAL* genes including the 3 significantly upregulated and 1 significantly downregulated in leaf vs. root in *C. grandiflora* Benth played important roles in acteoside biosynthesis (Table 5).

Polyphenol oxidase is usually undesirable in fruit and vegetable due to the browning, while it is desirable in tea, coffee, cocoa, etc. for the pigmentation [86]. Polyphenol oxidase (1,2-benzenediol: oxygen oxidoreductase), also known as tyrosinase, catechol oxidase, and laccase according to the specific substrate and reaction mechanism, is a group of copper-containing proteins [86,87]. A typical PPO protein contains three conservative regions: an N-terminal transit peptide that is responsible for the import of PPO into the thylakoid lumen; a di-copper center, each with three histidine residues to bind a copper atom; and a C-terminal region [88]. Polyphenol oxidases can catalyze two quite different types of reactions: monophenol monooxygenases (E.C. 1.14.18.1) activity and o-diphenol oxidation reactions including catechol oxidases (E.C. 1.10.3.1) and laccases (E.C. 1.10.3.2) activity [87]. In plants, polyphenol oxidase is localized in chloroplasts and the reaction product accumulated in thylakoid [89]. The number of *PPO* gene ranges from 1 to 13 in land plants with 0 for green algae and *A. thaliana*, and tandem duplications of the *PPO* gene family is common in dicotyledon [88]. In our transcriptome, 11 *PPO* genes were clustered into three groups. Expression levels of the upper group including *PPO7*, *PPO9*, *PPO10*, and *PPO11* were higher in leaves and stems compared with roots, while that of the bottom group including *PPO1*, *PPO2*, and *PPO3* were higher in roots and stems than in leaves with the somewhat low expressions in middle group including *PPO4*, *PPO5*, *PPO6*, and *PPO8* (Figure S2). Phylogenetic analysis of 11 CgPPOs with 6 PPOs of *Solanum melongena* and 6 PPOs of *Solanum lycopersicum* showed that all of our CgPPO proteins are clustered into one clade and that the other 12 PPO proteins formed another two clades (Figure S3). These species-specific PPO clades were also found in four major land plant lineages including *Populus trichocarpa*, *Glycine max*, *Vitis vinifera*, and *Aquilegia coerulea*, which implied that *CgPPO* genes were also formed by independent burst of gene duplication [88].

Azafrin (C27H38O4) derivates from tetraterpenoids (C40). It has been found in many medicinal plants such as rhizome of *Alectra chitrakutensis*, *Bergenia ciliate*, *Caralluma umbellate*, and *Alectra parasitica*, and it has the functions of being antimicrobial, anti-inflammatory, analgesic, antioxidant, treatment of cardiovascular diseases [90–93]. Roots of *C. grandiflora* Benth display orange-yellow color, which is largely due to the presence of abundant azafrin as *A. parasitica* [93]. So far, the biosynthetic pathway of azafrin is not established from perspectives of chemistry and biology. There are studies implying that excentric cleavage of carotenoid compounds is a possible route [94]. CCD7 can catalyze β-carotene (C40) into 10<sup>0</sup> -apo-β-carotenal (C27) and ionone (C13) to support the above hypothesis [39]. In the view of molecular structure, the differences between 100 -apo-β-carotenal and azafrin are one terminal carboxyl group and two hydroxyl groups in cyclohexane skeleton. Therefore, two reactions are indispensable from 10'-apo-β-carotenal to azafrin: one is to convert the aldehyde group into carboxyl groups, and the other is to insert two oxygen atoms into cyclohexane skeleton to generate two hydroxyl groups. The ALDH superfamily comprises a group of enzymes involved

in the NAD<sup>+</sup> (Nicotinamide Adenine Dinucleotide) or NADP<sup>+</sup> (Nicotinamide Adenine Dinucleotide Phosphate)-dependent conversion of various aldehydes to their corresponding carboxylic acids [95]. Although there are only 76 NAD+-dependent *ALDH* genes in our transcriptome, they are candidate genes for azafrin biosynthesis. In plant, CYP450s are responsible for many oxidative reactions such as hydroxylation, epoxidation, dealkylation, and dehydration, and the reactions catalyzed by CYP450s are irreversible [53]. There are 413 CYP450 unigenes in our transcriptome, of which 5 are significantly upregulated (log2(FC) > 10) and 5 are significantly downregulated in leaf vs. root (Table 6). They can be candidate genes of azafrin biosynthesis. The key enzymes determine the flux of the pathway, and the expression of the key enzyme gene dominates the number of enzymes. In marigold, the expression level of the *LCYE* gene in petals and *LCYB* gene in leaves were positively correlated with the lutein content [96]. In *Momordica cochinchinensis*, transcriptional regulation of genes including *HMGR*, *HDS*, *PSY*, *PDS*, *ZDS*, *CRTISO,* and *LCYE* may determine the alteration of carotenoid content during fruit ripening [97]. Our transcriptome data showed that only the expression levels of *HDS*, *PSY*, *ZDS*, and *CRTISO* were more abundant in roots than leaves and stems (Figure 11b). However, trace expression of the *DWARF27* gene in leaves and low expression of the *CCD7* gene in roots, stems, and leaves suggested that they were two rate-limiting enzymes in azafrin biosynthesis. *DWARF27*, which exhibits increased tillers and reduced plant height, was first studied in rice [98]. It encodes an iron-containing protein localized in chloroplasts and is expressed mainly in vascular cells of shoots and roots [98]. Further studies indicated that DWARF27 is an all-trans/9-cis isomerase which can convert all-trans-β-carotene into 9-cis-β-carotene in vivo and in vitro [99]. Obviously, DWARF27 is vital for azafrin biosynthesis. What is interesting is that there is only one *CCD7* gene in our transcriptome which coincides with the all *CCD7* genes identified including maize, rice, sorghum, *Selaginella moellendorfii*, *Physcomitrella patens*, and *Chlamydomonas reinhardtii* and is a single copy [100]. The highest expression of the *CCD7* gene was found in roots among maize, *A. thaliana*, pea, and petunia [100]. However, the highest expression in our transcriptome is in stems.

In the future, studies related to catalpol, acteoside, and azafrin biosynthesis will focus on the following aspects: (1) to construct a transgenic system for *C. grandiflora* Benth according to the successive tissue culture technology for verification of gene function, to characterize the putative genes of three pathways, and to verify their functions by enzyme assays in vitro or to overexpress them in vivo; (2) to explore the correlation between the contents of active component and related gene expression levels, to clone the putative TFs, and to verify their functions in the biosynthesis of active components via chromatin immunoprecipitation and overexpression in vivo; and (3) to figure out the biosynthetic pathway using feeding experiments with suspension cells.

#### **4. Materials and Methods**

#### *4.1. Plant Materials and RNA Isolation*

The artificial, cultivated *Centranthera grandiflora* Benth was grown in fields with *Cyperus rotundus* in Yushancheng base, Yuxi Flyingbear Agricultural Development Company Limited, Yuxi, Yunnan Province, China (Figure 15a). Three healthy plants with the same growth potential were selected, and the fresh roots, stems, and leaves were collected from one-year-old *C. grandiflora* plants on May 7, 2018 (Figure 15b). Materials from three individual plants were collected using scissors to yield 1 g of root, stem, and leaf samples (BioSample accessions: SAMN12499651, SAMN12499652, SAMN12499653, SAMN12499654, SAMN12499655, SAMN12499656, SAMN12499657, SAMN12499658, and SAMN12499659). After wrapping with tinfoil and tagging, all samples were immediately frozen in liquid nitrogen and stored at −80 ◦C.

**Figure 15.** Plant materials of *C. grandiflora* Benth: (**a**) *C. grandiflora* Benth growing in the field with *Cyperus rotundus* and (**b**) Roots, stems, and leaves used in experiments for sequencing and qRT-PCR.

For total RNA extraction and quality control, refer to Zhang et al. [101].

#### *4.2. Library Preparation for Transcriptome Sequencing*

A total amount of 1.5 µg RNA per sample was used as input material for the RNA sample preparations. Sequencing libraries were generated using the NEBNext® UItra™ RNA Library Prep Kit for Illumina (NEB, Ipswich, MA, USA) following the manufacturer's recommendations. Index codes were added to attribute sequences to each sample. Briefly, mRNA was purified from total RNA using poly-T oligo-attached magnetic beads. Fragmentation was carried out using divalent cations under elevated temperature in a NEBNext First Strand Synthesis Reaction Buffer (5×). First strand cDNA was synthesized using random hexamer primer and M-MuLV Reverse Transcriptase (RNase H−). Second-strand cDNA synthesis was subsequently performed using DNA Polymerase I and RNase H. Remaining overhangs were converted into blunt ends via exonuclease/polymerase activities. After adenylation of 30 ends of DNA fragments, NEBNext Adaptor with hairpin loop structure was ligated to prepare for hybridization. In order to select cDNA fragments preferentially 250–300 bp in length, the library fragments were purified with AMPure XP system (Beckman Coulter, Beverly, USA). Then, 3 µL USER Enzyme (NEB, Ipswich, MA, USA) was used with size-selected, adaptor-ligated cDNA at 37 ◦C for 15 min followed by 5 min at 95 ◦C before PCR. PCR was performed with Phusion High-Fidelity DNA polymerase, Universal PCR primers, and Index(X) Primer. At last, PCR products were purified (AMPure XP system) and library quality was assessed on the Agilent Bioanalyzer 2100 system (Agilent Technologies, Palo Alto, CA, USA).

#### *4.3. Clustering and Sequencing*

The clustering of the index-coded samples was performed on a cBot Cluster Generation System using TruSeq PE Cluster Kit v3-cBot-HS (Illumia, San Diego, CA, USA) according to the manufacturer's instructions. After cluster generation, the library preparations were sequenced on an Illumina Hiseq2000 platform and paired-end reads were generated.

#### *4.4. Data Filtering and Transcriptome Assembly*

The flow of bioinformatics analysis is showed in Figure 16. Before assembly, raw reads containing adaptors, more than 10% N bases, and more than 50% of low quality were removed to obtain clean reads. Meanwhile, Q20, Q30, and GC content were used to assess the data quality. All the subsequent analyses were based on these clean reads. As there are no reference genomes available for *C. grandiflora*, the clean reads of roots, stems, and leaves were assembled together. The paired-end reads of each sample were merged into one interleaved fastq file. All the nine pooled files were assembled using Trinity software (version: r20140413p1) (Cambridge, MA, USA) with min\_k-mer\_cov set to 2 and all

other parameters settings as default [102]. After clustering and de-redundancy by Corset software (version: 1.07) (VIC, Austrilia) [103], the clean nonredundant unigenes was generated.

**Figure 16.** Flow chart of transcriptome bioinformatics analysis for *C. grandiflora* Benth.

#### *4.5. Gene Functional Annotation*

Gene function was annotated based on the following databases: Nr (NCBI nonredundant protein sequences, diamond v0.8.22, *e*-value = 10−<sup>5</sup> ), NT (NCBI nucleotide sequences, NCBI blast 2.2.28+, *e*-value = 10−<sup>5</sup> ), PFAM (Protein family, HMMER 3.0 package, hmmscan *e*-value = 10−<sup>2</sup> ), SwissProt (a manually annotated and reviewed protein sequence database, diamond v0.8.22, *e*-value = 10−<sup>5</sup> ), KOG/COG (Clusters of Orthologous Groups of proteins/euKaryotic Ortholog Groups, diamond v0.8.22, *e*-value = 10−<sup>5</sup> ), KAAS (version: r140224, *e*-value = 10−10), and GO (Blast2GO v2.5 and inhouse script, *e*-value = 10−<sup>6</sup> ). To figure out the TF families involved in the active ingredient biosynthesis, iTAK software (https://github.com/kentn/iTAK/) was used to predict the TF. Its basic principle is to identify TF by hmmscan using TF family and rules defined by classification in the database. For the identification and classification methods of TF, refer to Perez-Rodriguez et al. [104].

#### *4.6. Di*ff*erential Expression Analysis*

The calculation of unigene expression was performed using the RPKM method, and gene expression levels were estimated by RSEM (version: 1.2.15, parameter for bowtie2: mismatch 0) for each sample [43]. Differential expression analysis of two organs was performed using the DESeq R package 1.10.1. DESeq provides statistical routines for determining differential expression in digital gene expression data using a model based on the negative binomial distribution. The resulting *p* values were adjusted using Benjamini and Hocberg's approach for controlling false discovery rates. Genes with a threshold of foldchange ≥ 2 and *p*-value < 0.05 found by DESeq were assigned as differentially expressed.

#### *4.7. GO Enrichment and KEGG Pathway Enrichment Analysis*

Gene ontology enrichment analysis of the differentially expressed genes (DEGs) was implemented by the GOseq (version: 1.10.0) and topGO (version: 2.10.0) R packages based Wallenius non-central hyper-genometric distribution, which can adjust for gene length bias in DEGs [105]. KOBAS software (Beijing, China) was used to test the statistical enrichment of DEGs in KEGG pathways [106]. The expressed genes (FPKM ≥ 1) were used as background with a corrected *p*-value ≤ 0.05 for both enrichment analyses.

## *4.8. qRT-PCR Analysis*

Total RNA was extracted from roots, stems, and leaves of annual *C. grandiflora* Benth. The first strand of DNA was synthesized using reverse transcription kit PrimeScript RT Master Mix (Perfect Real Time) (Takara, Dalian, China). Specific primers were designed according to the selected gene sequences for expression analysis (Table S14). Using the *C. grandiflora* Benth *CgUbi* gene (Accession number: MK256646) as an internal reference, qPCR was performed using chimeric fluorescence detection kit TB Green Premix Ex Taq II (Takara, China). Each reaction was repeated three times. Reaction was amplified by LightCycler 480II fluorescent quantitative PCR (Roche, Basel, Switzerland). After the amplification, results were calibrated by internal reference gene and the relative gene expressions in roots, stems, and leaves were calculated automatically by the 2−∆∆*C*<sup>t</sup> method.

## *4.9. Data Submission*

This Transcriptome Shotgun Assembly project (PRJNA558809) has been deposited at DDBJ/ENA/GenBank under the accession GHUX00000000. The version described in this paper is the first version, GHUX01000000.

#### **5. Conclusions**

*Centranthera grandiflora* Benth has been used to prevent and treat CVDs for a long time; however, the biosynthesis pathway of its active components including catalpol, acteoside, and azafrin remains undeciphered. Transcriptome sequencing technology is an effective way to discover the genes of this herb's biosynthesis pathways and its regulatory mechanisms. In this study, nine cDNA libraries were constructed from the roots, stems, and leaves of *C. grandiflora* Benth and sequenced by an Illumina Hiseq2000 platform. As a result, 438,112,930 clean reads were obtained and 153,198 unigenes were assembled. Among these genes, 557, 213, and 161 unigenes were annotated into catalpol, acteoside, and azafrin biosynthetic pathways, respectively. Azafrin can be synthesized through β-carotene, 9-cis-β-carotene, and 100 -apo-β-carotenal with the corresponding enzymes DWARF27, CCD7, ALDH, and CYP450. Also, the *PAL* gene silencing phenomenon is discovered and discussed. The candidate TF MYBs involved in the regulation of these pathways were proposed. Our results represent the first genomic resource for *C. grandiflora* Benth, which is a starting point for exploration of this valuable herb in molecular biology.

**Supplementary Materials:** Supplementary materials can be found at http://www.mdpi.com/1422-0067/20/23/ 6034/s1.

**Author Contributions:** X.Z. conceived and design of experiments. X.Z., C.L., and L.W. performed the experiments. X.Z., C.L., and W.Q. analyzed the data. Y.F. contributed materials. X.Z. wrote the manuscript, and W.Q. reviewed and edited the manuscript.

**Funding:** This research was funded by Yunnan Provincial Science and Technology Department (grant number 2017FH001-024, 2016FD113) and China Scholarship Council (grant number 201908530057).

**Acknowledgments:** We gratefully acknowledge Beijing Novogene Bioinformatics Technology Company of China for carrying out the sequencing of the transcriptomes. We gratefully acknowledge the anonymous reviewers for their constructive comments.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Species Identification of Oaks (***Quercus* **L., Fagaceae) from Gene to Genome**

**Xinbo Pang 1,2,3,4, Hongshan Liu <sup>2</sup> , Suran Wu <sup>2</sup> , Yangchen Yuan <sup>2</sup> , Haijun Li <sup>2</sup> , Junsheng Dong <sup>2</sup> , Zhaohua Liu <sup>2</sup> , Chuanzhi An <sup>2</sup> , Zhihai Su <sup>2</sup> and Bin Li 1,2,3,4,\***


Received: 24 September 2019; Accepted: 8 November 2019; Published: 26 November 2019

**Abstract:** Species identification of oaks (*Quercus*) is always a challenge because many species exhibit variable phenotypes that overlap with other species. Oaks are notorious for interspecific hybridization and introgression, and complex speciation patterns involving incomplete lineage sorting. Therefore, accurately identifying *Quercus* species barcodes has been unsuccessful. In this study, we used chloroplast genome sequence data to identify molecular markers for oak species identification. Using next generation sequencing methods, we sequenced 14 chloroplast genomes of *Quercus* species in this study and added 10 additional chloroplast genome sequences from GenBank to develop a DNA barcode for oaks. Chloroplast genome sequence divergence was low. We identified four mutation hotspots as candidate *Quercus* DNA barcodes; two intergenic regions (*matK-trnK-rps16* and *trnR-atpA*) were located in the large single copy region, and two coding regions (*ndhF* and *ycf1b*) were located in the small single copy region. The standard plant DNA barcode (*rbcL* and *matK*) had lower variability than that of the newly identified markers. Our data provide complete chloroplast genome sequences that improve the phylogenetic resolution and species level discrimination of *Quercus*. This study demonstrates that the complete chloroplast genome can substantially increase species discriminatory power and resolve phylogenetic relationships in plants.

**Keywords:** oak species identification; chloroplast genome; *Quercus*; mutation hotspots

#### **1. Introduction**

DNA barcoding has recently emerged as a new molecular tool for species identification [1]. A DNA barcode is a short, standardized DNA region normally employed for species identification. The mitochondrial gene cytochrome oxidase 1 (*COI*) is an effective and reliable standard animal DNA barcode for species identification [1]. Over the past 10 years, plant DNA barcode researchers have been evaluating the proposed barcode segments of plants. Previously proposed barcode segments exist primarily in chloroplast genomes that are relatively stable, single-copy, and easy to amplify. These proposed barcodes are *matK*, *rbcL*, *ropC1*, and *rpoB* in the coding region, and *atpF-H*, *trnL-F*, *trnH-psbA*, and *psbK-I* in the non-coding region [2]. At the third DNA barcode conference held in Mexico City in 2009, the majority of the Consortium for the Barcode of Life (CBOL) Plant Working Group preferred to recommend a core-barcode combination consisting of portions of two plastid coding regions, *rbcL* and *matK*, which are supplemented with additional markers (such as *trnH-psbA* and

internal transcribed spacers [ITS]) as required. In 2011, the China Plant BOL Group suggested using ITS as the plant DNA barcode [3]. However, increasing numbers of studies show that core-barcodes remain problematic, especially in recently diverged and rapidly radiated taxa [4–6].

With the development of next-generation sequencing (NGS), the number of sequenced chloroplast genomes has increased rapidly, making it possible to generate chloroplast genome data to extend the concept of DNA barcoding for plant species identification [6–9]. The DNA barcoding approaches for species identification has extended from gene to genome, promptly extending phylogeny analysis from gene-based phylogenetics to phylogenomics. Chloroplast genome sequences are a primary source of data for inferring plant phylogenies and DNA barcoding because of their conserved gene content and genome structure, low nucleotide substitution mutation rates, usually uni-parental inheritance, and the low cost of generating whole chloroplast genomes with high throughput sequencing. Using chloroplast genome data, longstanding controversies at various taxonomic levels have been resolved [10–12], suggesting its power in resolving evolutionary relationships. However, challenges still exist in establishing phylogeny relationships and discrimination of closely related, recently divergent, hybridized, or introgressed lineages such as the oak group.

Oaks (*Quercus* L., Fagaceae) comprise approximately 400–500 species that are widespread throughout the temperate zones of the Northern Hemisphere; they are dominant, diverse forest and savannah angiosperm trees and shrubs belonging to a taxonomically complex group. The taxonomy of oak species remains controversial and incomplete, owing to the overlapping variation of individuals and population produced by ecological adaptation and differential reproductive isolation. A series of phylogenetic and DNA barcoding studies have mainly used several chloroplast DNA markers [13,14] such as *rbcL*, *rpoC1*, *trnH-psbA*, *matK*, *ycf3-trnS*, *ycf1*, and the nuclear ribosomal DNA ITS [4,15–17]. These studies focused only on regional flora, and those markers revealed low sequence divergence leading to lower discrimination success [4,18]. Yang et al. [13] compared two closely related species (*Quercus rubra* and *Castanea mollissima*) by exploring nine highly variable chloroplast DNA markers for species identification. However, the results showed a very low discrimination success rate using a single marker and all their combinations. On the other hand, oaks are notorious for interspecific hybridization and introgression, as well as complex speciation patterns involving incomplete lineage sorting [19–21], which have possible negative effects for barcoding and phylogeny of the species-rich *Quercus* genus [4].

In this study, we sequenced the complete chloroplast genome of 14 *Quercus* species and combined the previously reported chloroplast genomes of 10 other *Quercus* species in order to provide a comparative analysis. The study aimed to (1) investigate the genome structure, gene order, and gene content of the whole chloroplast genome of multiple *Quercus* species; (2) test whether chloroplast genome data yielded sufficient variation to construct a well-supported phylogeny of *Quercus* species; and (3) determine if multiple variable markers or whole chloroplast genome data can be successfully used for oaks species identification.

#### **2. Results**

#### *2.1. General Features of the Quercus Chloroplast Genome*

Using the Illumina HiSeq X Ten system, 14 *Quercus* species were sequenced to produce 9,910,273–16,862,000 paired-end raw reads (150 bp average read length), with an average sequencing depth of 162× to 480× (Table S1). To validate the accuracy of the assembled chloroplast genome, we carried out Sanger sequencing of PCR amplicons spanning the junction regions (LSC/IRA, LSC/IRB, SSC/IRA, and SSC/IRB). The 14 *Quercus* chloroplast genome sequences were deposited in GenBank (accession numbers MK105451–MK105453, MK105456–MK105464, and MK105466-MK105467).

The total chloroplast genome sequence lengths of 14 *Quercus* species ranged from 161,132 bp (*Q. phillyraeoides*) to 161,366 bp (*Q. rubra*). These genomes displayed typical circular quadripartite structure consisting of a pair of IR regions (25,817–25,870 bp) separated by an LSC region

(90,363–90,624 bp) and an SSC region (18,946–19,073 bp) (Figure 1). The overall GC content was absolutely identical (36.8%; Table 1) across all plastomes, but was clearly higher in the IR region (42.8%) than in the other regions (LSC 34.7%; SSC 30.9%), possibly because of the high GC content of the rRNA that was located in the IR regions. All plastomes possessed 113 unique genes, including 79 protein-coding genes, 30 tRNA genes, and 4 rRNA genes. Among the unique genes, 15 genes contained one intron, and two genes contained two introns. genes, 30 tRNA genes, and 4 rRNA genes. Among the unique genes, 15 genes contained one intron, and two genes contained two introns. The chloroplast genome results showed that all 14 *Quercus* plastomes were remarkably similar in terms of size, genes, and genome structures. The LSC/IR and IR/SSC boundaries were conserved*. Rps19* was located in the LSC near the LSC/IRb, and *trnH-GUG* was located in the LSC near the IRa/LSC border. Additionally, the location of the SSC/IRa junction was within the coding region of the *ycf1* gene.

*Int. J. Mol. Sci.* **2019**, *20*, x FOR PEER REVIEW 3 of 15

90,624 bp) and an SSC region (18,946–19,073 bp) (Figure 1). The overall GC content was absolutely identical (36.8%; Table 1) across all plastomes, but was clearly higher in the IR region (42.8%) than in

was located in the IR regions. All plastomes possessed 113 unique genes, including 79 protein-coding

**Figure 1.** Gene map of *Quercus* chloroplast genome. Genes drawn within the circle are transcribed clockwise; genes drawn outside are transcribed counterclockwise. Genes in different functional groups are shown in different colors. Dark bold lines indicate the extent of the inverted repeats (IRa and IRb) that separate the genomes into small single-copy (SSC) and large single-copy (LSC) regions. **Figure 1.** Gene map of *Quercus* chloroplast genome. Genes drawn within the circle are transcribed clockwise; genes drawn outside are transcribed counterclockwise. Genes in different functional groups are shown in different colors. Dark bold lines indicate the extent of the inverted repeats (IRa and IRb) that separate the genomes into small single-copy (SSC) and large single-copy (LSC) regions.

The chloroplast genome results showed that all 14 *Quercus* plastomes were remarkably similar in terms of size, genes, and genome structures. The LSC/IR and IR/SSC boundaries were conserved. *Rps19* was located in the LSC near the LSC/IRb, and *trnH-GUG* was located in the LSC near the IRa/LSC border. Additionally, the location of the SSC/IRa junction was within the coding region of the *ycf1* gene.



*2.2. Phylogenetic Analyses* 

#### *2.2. Phylogenetic Analyses* nodes. Clade I on the base of the tree (BS = 100% and PP = 1) comprised *Q. edithiae*, *Q. gambelii*, *Q.*

The matrix of whole chloroplast genome sequences was used to reconstruct the *Quercus* phylogenetic tree (Figure 2). Both maximum likelihood and Bayesian analyses produced similar topologies for the 24 species and were highly branch supported. All the sampled *Quercus* species were clustered into one clade with 100% bootstrap value (BS) or Bayesian posterior probability (PP). However, backbone branch supports were relatively poor, as were some internal branches. Moreover, six major clades were identified in *Quercus* and the analyses obtained high support for all six of the nodes. *sichourensis*, *Q. aquifolioides*, and *Q. spinosa* being the earliest diverging lineages. Clade II (BS = 100% and PP = 1) contained seven species: *Q. acutissima*, *Q. variabilis*, *Q. serrata*, *Q. phillyraeoides*, *Q. dolicholepis*, *Q. baronii*, and *Q. tarokoensis*. Clade III only contained *Q. tungmaiensis*. *Q. rubra* and *Q. palustris* formed clade IV, which was identified as a sister to clade V with high support value (BS = 92% and PP = 1). Clade V included three species, *Q. macrocarpa*, *Q. glauca*, and *Q. stellata*. The last clade (BS = 100% and PP = 1) was made up of *Q. aliena* var. *acuteserrata*, *Q. wutaishanica*, *Q. mongolica*, *Q. fabri*, *Q. glandulifera* var. *brevipetiolata*, and *Q. dentata*.

six major clades were identified in *Quercus* and the analyses obtained high support for all six of the

*Int. J. Mol. Sci.* **2019**, *20*, x FOR PEER REVIEW 5 of 15

The matrix of whole chloroplast genome sequences was used to reconstruct the *Quercus*  phylogenetic tree (Figure 2). Both maximum likelihood and Bayesian analyses produced similar topologies for the 24 species and were highly branch supported. All the sampled *Quercus* species were clustered into one clade with 100% bootstrap value (BS) or Bayesian posterior probability (PP).

**Figure 2.** Phylogenetic tree inferred from the 25 chloroplast genomes. Left: Maximum likelihood tree with maximum likelihood (ML) bootstrap values; right: Bayesian tree with posterior probabilities. **Figure 2.** Phylogenetic tree inferred from the 25 chloroplast genomes. Left: Maximum likelihood tree with maximum likelihood (ML) bootstrap values; right: Bayesian tree with posterior probabilities.

*2.3 Analyses of the Standard DNA Barcodes*  The *trnH-psbA* intergenic spacer region ranged from 412 bp to 474 bp with 27 variable sites, 16 informative sites, and nine indels of 3–20 bp within 574 aligned bp. A small 32 bp inversion occurred at 454 bp. *RbcL* and *matK* genes, both without indels, were 698 bp with eight variable and five informative sites, and 744 bp with 21 variable and 11 informative sites, respectively (Table 2). The mean interspecific genetic distances of the 24 oaks species with K2P were 0.0026 for *rbcL*, 0.0048 for *matK*, and 0.0125 for *trnH-psbA*. Based on the distance method, the universal DNA barcode had less discriminatory power; *rbcL*, *matK,* and *trnH-psbA* had only a 12.50%, 25.00%, and 37.50% success rate, Clade I on the base of the tree (BS = 100% and PP = 1) comprised *Q. edithiae*, *Q. gambelii*, *Q. sichourensis*, *Q. aquifolioides*, and *Q. spinosa* being the earliest diverging lineages. Clade II (BS = 100% and PP = 1) contained seven species: *Q. acutissima*, *Q. variabilis*, *Q. serrata*, *Q. phillyraeoides*, *Q. dolicholepis*, *Q. baronii*, and *Q. tarokoensis*. Clade III only contained *Q. tungmaiensis*. *Q. rubra* and *Q. palustris* formed clade IV, which was identified as a sister to clade V with high support value (BS = 92% and PP = 1). Clade V included three species, *Q. macrocarpa*, *Q. glauca*, and *Q. stellata*. The last clade (BS = 100% and PP = 1) was made up of *Q. aliena* var. *acuteserrata*, *Q. wutaishanica*, *Q. mongolica*, *Q. fabri*, *Q. glandulifera* var. *brevipetiolata*, and *Q. dentata*.

#### respectively. With the two core DNA barcodes (*rbcL* and *matK*) combined, success was only 29.17%. *2.3. Analyses of the Standard DNA Barcodes*

The *trnH-psbA* intergenic spacer region ranged from 412 bp to 474 bp with 27 variable sites, 16 informative sites, and nine indels of 3–20 bp within 574 aligned bp. A small 32 bp inversion occurred at 454 bp. *RbcL* and *matK* genes, both without indels, were 698 bp with eight variable and five informative sites, and 744 bp with 21 variable and 11 informative sites, respectively (Table 2). The mean interspecific genetic distances of the 24 oaks species with K2P were 0.0026 for *rbcL*, 0.0048 for *matK*, and 0.0125 for *trnH-psbA*. Based on the distance method, the universal DNA barcode had less discriminatory power; *rbcL*, *matK,* and *trnH-psbA* had only a 12.50%, 25.00%, and 37.50% success rate, respectively. With the two core DNA barcodes (*rbcL* and *matK*) combined, success was only 29.17%. Combined analyses of *rbcL*, *matK*, and *trnH-psbA* or *rbcL* and *matK* generated lower branch supported trees (Figure 3). *Int. J. Mol. Sci.* **2019**, *20*, x FOR PEER REVIEW 6 of 15 Combined analyses of *rbcL*, *matK*, and *trnH-psbA* or *rbcL* and *matK* generated lower branch supported trees (Figure 3).


*matK-trnK-rps16 + trnR-atpA + ndhF+ycf1b* 6921 318 4.59% 198 2.86% 100.00%

**Table 2.** The variability of the four new markers, chloroplast genome, and the universal chloroplast DNA barcodes in *Quercus.* **Table 2.** The variability of the four new markers, chloroplast genome, and the universal chloroplast DNA barcodes in *Quercus.* 

**Figure 3.** Neighbor joining trees for *Quercus* using *rbcL + matK*, *rbcL + matK*, and *trnH-psbA* **Figure 3.** Neighbor joining trees for *Quercus* using *rbcL* + *matK*,*rbcL* + *matK*, and *trnH-psbA* combinations.

#### combinations. *2.4. Analyses of Specific Barcodes*

To identify closely related species, it is imperative to identify rapidly evolving markers. We used DNAsp and SPIDER to discover the variable mutation regions of the *Quercus* chloroplast genome (Figure 4). The nucleotide diversity (pi) value ranged from 0 to 0.01766 in the 800 bp window size, while the K2P-distance ranged from 0 to 0.0179. We found four relatively variable regions: *matK-trnK-rps16*, *trnR-atpA*, *ndhF,* and *ycf1b*. Two intergenic regions (*matK-trnK-rps16* and *trnR-atpA*) were located in the LSC region, and two coding regions (*ndhF* and *ycf1b*) in the SSC region. We designed new primers for four variable regions (Table S3).

The *ycf1b* marker possessed the highest variability (5.33%), followed by the *ndhF* (4.82%), *trnR-atpA* (4.35%), and *matK-trnK-rps16* (4.02%) regions. Of the four variable makers, *ndhF* had the highest rate of correct identifications (83.33%), followed by *matK-trnK-rps16* (79.17%) and *ycf1b* (70.83%). Combining the four variable markers produced the most correct identifications (100%). The NJ tree-based method generated a graphical representation of the results and they were the same as those of the distance-based method (Figure 5).

*Int. J. Mol. Sci.* **2019**, *20*, x FOR PEER REVIEW 7 of 15

*Int. J. Mol. Sci.* **2019**, *20*, x FOR PEER REVIEW 2 of 15

**Figure 5.** Neighbor joining tree for *Quercus* using the four highly variable markers and complete chloroplast genome data. **Figure 5.** Neighbor joining tree for *Quercus* using the four highly variable markers and complete chloroplast genome data.

#### *2.5. Super-Barcode*

The 24 *Quercus* chloroplast genomes were fully aligned, and an alignment matrix of 164,156 bp was obtained (Table 3). We identified 2778 variable sites (1.69%), including 1727 parsimony-informative sites (1.05%), in the total chloroplast genome. The average Pi value for the 24 *Quercus* chloroplast genomes was 0.00335. Among these regions, IR exhibited the least nucleotide diversity (0.00073) and SSC exhibited high divergence (0.00624).



To estimate the genetic divergence among *Quercus* chloroplast genomes, nucleotide substitutions and p-distance were calculated using MEGA. The overall sequence divergence estimated by p-distance among the 24 chloroplast genome sequences was only 0.0036. The number of nucleotide substitutions among the 24 species ranged from 14 to 734, and the p-distance ranged from 0.0001 to 0.0046. *Q. tungmaiensis* and *Q. serrata* had the largest sequence divergence. *Q. variabilis* had only 14 nucleotide substitutions with *Q. acutissima*.

The discriminatory power of the complete chloroplast genome as a DNA barcode was assessed using distance and tree-based methods. Compared to the standard DNA barcode or the four newly identified markers (specific barcodes), the complete chloroplast genome had the highest discriminatory power (Table 2 and Figure 5).

#### **3. Discussion**

Species delimitation remains one of the most controversial topics in biology. However, the accurate discrimination of material using only morphological characteristics is difficult. DNA barcoding is a widely used and effective tool that has enabled rapid and accurate identification of plant species since its development in 2003 [1]. Though DNA barcoding technology has developed significantly, no barcode can achieve the goal of sophisticated plant species identification [2]. In plants, the determination of a standardized barcode has been more complex. At present, increasing amounts of practical research tend to use chloroplast markers, such as *atpB-rbcL*, *atpF-H*, *matK*, *rbcL*, *psbK-I*, *rpoB*, *rpoC1*, *trnH-psbA*, and *trnL-F,* to identify species because of their relatively low evolutionary rates compared to those of nuclear loci and universal PCR primers [22–25]. The CBOL Working Group recently recommended a two-locus combination of *matK* + *rbcL* as the core plant barcode, with the recommendation to complement these using *trnH-psbA* and the ITS of the nuclear ribosomal DNA. However, because of the lower variability in standard DNA barcodes, discrimination power was low in plants [26]. In this study, the combination of *rbcL*, *matK*, and *trnH-psbA* had poor resolution (less than 50%) within *Quercus* (Table 2). Using the universal DNA barcode, the 12 Italian oak species revealed extremely low discrimination success (0%) [4]. Combined five chloroplast genome markers (*psbA-trnH, matK-trnK, ycf3-trnS, matK*, and *ycf1*), the species identification powers were only less than 20% [13]. Thus, there is an ongoing drive to develop additional oak barcodes.

With sequencing method development, greater numbers of DNA sequences were easily acquired. Identification of specific barcodes was an effective strategy for barcoding complex groups. Most studies showed that chloroplast genome mutations were clustered into hotspots, and those hotspots were defined as DNA barcodes [27–30]. The strategy of searching the complete chloroplast genome has been successfully applied to *Oryza* [30], *Panax* [28], *Diospyros* [31], and *Dioscorea* [32]. By comparing

24 *Quercus* chloroplast genomes in the present study, we identified four oak-specific barcodes including *matK-trnK-rps16*, *trnR-atpA*, *ndhF,* and *ycf1b* (Figure 4). The *ycf1* gene was more variable than the *matK* and *rbcL* genes in most plant lineages, and recently has been the focus of a DNA barcoding and plant phylogeny study [14]. Furthermore, *ycf1* has previously provided a higher species resolution in *Quercus* [13,14]. The *ndhF* gene has been widely used in plant phylogeny and is considered a variable coding gene in the chloroplast genome [27,33–35]. *MatK-trnK-rps16* and *trnR-atpA* are two interspace regions less commonly used as DNA barcode. Combined with the four highly variable markers, all 24 *Quercus* species were successfully identified using the distance method (Table 2).

Although the four specific barcodes had the highest discriminatory power, it was necessary to develop additional markers for *Quercus* because of its complex evolutionary history. With the advent of the next-generation DNA sequencing technologies, genomic data have extended the concept of DNA barcoding for species identification [6,8,36–38]. The DNA barcode has extended from gene or genes to the entire genome, and the extended DNA barcoding approach has been referred to as "ultra-barcoding" [39], "super-barcoding" [7], or "plant barcoding 2.0" [40]. Compared to the nuclear and mitochondrial genomes, the chloroplast genome is easily sequenced and may be the best-suited genome for plant species super-barcoding [36,41].

#### **4. Materials and Methods**

#### *4.1. Taxon Sampling*

The collection and GenBank accession information for taxa sampled in the present study are listed in Table 1 and Table S1. Ten species with previously sequenced chloroplast genomes used for analysis in this study are listed in Table S2. *Castanea pumila*, the sister group of *Quercus*, was used as the out-group.

#### *4.2. DNA Extraction and Sequencing*

We used an Illumina HiSeq X Ten platform to produce chloroplast genome sequences. *Quercus* species total DNA was extracted from silica-dried leaflets using the mCTAB protocol [42]. After extraction, total DNA was quantified with a Nanodrop 1000 Spectrophotometer. Fragmented samples of 350 bp were used to prepare paired-end libraries using a NEBNext®Ultra™DNA Library Prep Kit following the manufacturer's protocol. Each library that passed the first quality control step was tested with an Agilent 2100 Bio-147 analyzer (Agilent Technologies, Santa Clara, CA, USA) to ensure the libraries had the required size distributions. Real-time quantitative PCR was carried out to precisely measure library concentrations to balance the amounts used in multiplexed reactions. Paired-end sequencing (2 × 150 bp) was conducted on an Illumina HiSeq X Ten platform. For each species, approximately 5 Gb of raw data were generated.

#### *4.3. Genome Assembly and Genome Annotation*

A five-step approach was used to assemble the chloroplast genome. First, raw sequence reads were filtered for high quality reads by removing duplicate reads, as well as adapter-contaminated reads and reads with more than five Ns using the NGS QC Tool Kit [43]. Second, the SPAdes 3.6.1 program [44] was used for de novo assemblies. Third, chloroplast genome sequence contigs were selected from the SPAdes software by performing a BLAST search using the *Quercus variabilis* chloroplast genome sequence as a reference. Fourth, the Sequencher 5.4.5 program (Gene Codes Corp., Ann Arbor, Michigan, USA) was used to merge the selected contigs. Finally, small gaps or ambiguous nucleotides were bridged with specific primers designed for PCR based on their flanking sequences by Sanger sequencing. The four junctional regions between the IRs and small single copy (SSC) and large single copy (LSC) regions in the chloroplast genome sequences were further checked by PCR amplification and Sanger sequencing with specific primers as previously described [45].

Chloroplast genome annotation was performed with Plann [46] using the *Quercus variabilis* reference sequence. The chloroplast genome map was drawn using OGdraw online [47].

#### *4.4. Phylogenetic Analyses*

Multiple sequence alignment was performed using MAFFT v7 [48]. We estimated phylogenetic trees on the nucleotide substitution matrix using maximum likelihood (ML) and Bayesian inference (BI). ML analyses were performed using RAxML v.8.1.24 [49].

The RAxML analyses included 1000 bootstrap replicates in addition to a search for the best-scoring ML tree. BI was conducted with Mrbayes v3.2 [50]. The Metropolis-coupled Markov chain Monte Carlo (MCMC) algorithm was run for 50,000,000 generations with one cold and three heated chains, starting with a random tree and sampling one tree every 2000 generations. The first 25% of the trees were discarded as burn-in, and the remaining trees were used to build a 50% majority-rule consensus tree. Stationarity was considered reached when the average standard deviation of split frequencies remained below 0.01.

#### *4.5. Sequence Divergence and Hotspot Identification*

We analyzed the aligned sequences and counted the sequence divergence among *Quercus* chloroplast genomes to evaluate *Quercus* species divergence. Variable, parsimony-informative base sites, p-distances across the complete chloroplast genomes, and LSC, SSC, and inverted repeat (IR) regions of the 14 taxa were calculated using MEGA 6.0 software [51].

We used two methods to identify the hypervariable chloroplast genome regions. The first (nucleotide variability) was conducted using DnaSP version 5.1 software with the sliding window method. The second (genetic distance) was conducted using the *slideAnalyses* function of SPIDER [52] version 1.2-0 software. This function extracts all passable windows of a chosen size in a DNA alignment and performs pairwise distance (K2P) analyses of each window. The proportion of zero pairwise distances for each species and mean distance were considered for the definition of hypervariable regions. The step size was set to 100 bp with an 800 bp window length.

#### *4.6. DNA Barcoding Analysis*

To access the effectiveness of marker discriminatory performance, we used two methods to assess the barcoding resolution. The distance method used the *nearNeighbour* function of SPIDER software [52]. The distance method was used to analyze the barcode performances of newly identified highly variable regions.

Tree building analyses provide a convenient and visualized method for evaluating discriminatory performance by calculating the proportion of monophyletic species. A neighbor joining (NJ) tree was constructed for each hypervariable marker and different marker combinations using PAUP\* 4.0 software [53]. Relative support for the NJ tree branches was assessed via 200 bootstrap replicates.

#### **5. Conclusions**

In this study, we sequenced and compared the chloroplast genomes of 24 *Quercus* species. The structure, size, and gene content of the *Quercus* chloroplast genomes were found to be well conserved, and comparative analyses revealed low levels of sequence variability. Four higher variable regions were identified, which were suitable as DNA barcodes for *Quercus* species identification. We also evaluated the resolution of the complete chloroplast genome in phylogenetic reconstruction and species discrimination in *Quercus.* The complete chloroplast genome sequence data produced strongly supported and highly resolved phylogenies in this taxonomically complex group despite the extensive hybridization and introgression in *Quercus*. Compared to standard plant DNA barcodes and the specific barcodes, analyses of the complete chloroplast genome sequences improved species identification resolution.

**Supplementary Materials:** Supplementary materials can be found at http://www.mdpi.com/1422-0067/20/23/ 5940/s1. Table S1. Sampling and assembly information for the 14 *Quercus* species; Table S2. A list of the 10 taxa sampled from GenBank in this study. Table S3. Primers for amplifying four highly variable loci.

**Author Contributions:** B.L. and X.P. designed the experiment; B.L., X.P., H.L., S.W., Y.Y., H.L., J.D., Z.L., C.A., Z.S. and P.H. collected samples and performed the experiment; B.L. and X.P. analyzed the data and wrote the manuscript; All of the authors have read and approved the final manuscript.

**Funding:** The study was funded by by "national forest germplasm resources bank of Quercus mongolica and Quercus variabilis in Hongyashan of Hebei (2017, 2018, 2019)".

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**


#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
