1. Introduction
Chrysosplenium L. is a very small perennial herbaceous genus in the family Saxifragaceae, with tetramerous flowers and petaloid sepals [
1]. This genus consists of around 80 species distributed in Asia, Europe, Africa, and America; however, only two species in Chile have been found in the southern hemisphere, and the rest are concentrated in the northern hemisphere [
2,
3,
4,
5]. In the northern hemisphere,
Chrysospelnium species, including ca. 53 species, are mainly distributed in East Asia, with China being one of the diversity centers of this genus, with 39 species, of which 24 are endemic [
1,
5,
6,
7]. In accordance with the Flora of China, the literature, and field investigations,
Chrysospelnium macrophyllum is endemic to China, mainly distributed in 14 Chinese provinces [
8,
9]. It is a common folk herbal medicine that can treat infantile convulsions, ecthyma, scalds, and lung and ear disorders [
10]. Only a few studies have been performed on
C. macrophyllum, and its chloroplast genomic data have been obtained [
11]. Given the lack of rich molecular markers for
C. macrophyllum, the population structure and genetic diversity of
C. macrophyllum are still unknown, thus minimizing the exploitation and utilization of this species.
Molecular markers are an extremely popular tool in the analysis of genetic diversity because of their stability, cost-effectiveness, and facile application [
12]. The most used molecular markers mainly include restriction fragment length polymorphisms (RFLP), random amplified polymorphic DNA markers (RAPD), amplified fragment length polymorphisms (AFLP), inter simple sequence repeats (ISSR), sequence-related amplified polymorphisms (SRAP), simple sequence repeats (SSR), and single-nucleotide polymorphism (SNP) markers [
13,
14]. SSRs are the most widely used molecular markers, associated with their codominance, abundance, high polymorphism, good reproducibility, and simple operation [
15,
16,
17]. SSRs can be separated into genomic SSR (gSSR) and expressed sequence tag SSR (EST-SSR) markers, in accordance with their type of sequence source [
18]. EST-SSRs have a lower developmental cost than gSSRs and exhibit cross-species transferability and direct correlations with gene functions [
18,
19]. They have been widely used in plant research, such as studies on
Carex breviculmis [
20],
Pinus koraiensis [
21],
Actinidia eriantha [
22],
Zingiber officinale [
23],
Rosa roxburghii [
24], and
Dendrobium officinale [
25].
Next-generation sequencing technology, especially transcriptome sequencing with Illumina and MGI, is an effective and reliable tool that provides a low-cost means to develop SSR markers [
26,
27,
28,
29]. Transcriptome sequencing and
de novo assembly are essential for studying functional genomics as mining markers, especially markers in non-model organisms that lack sequenced genomes [
30,
31]. To date, only several nucleotide sequences of
Cymbidium aureobracteatum have been reported (September 2022), and no
C.
macrophyllum ESTs are available in GenBank [
32]. In previous studies, only the chloroplast gene
matK was used to examine the genetic variations of the genus
Chrysosplenium [
33]. However, only a few researchers have investigated
C. macrophyllum.
In this study, (i) we used the DNBSEQ-T7 Sequencer to obtain the global transcriptome of C. macrophyllum and annotated and functionally classified the transcripts. (ii) Then, a number of EST-SSRs were built for C. macrophyllum on the basis of these transcripts and we verified their transferability among different Chrysosplenium species. (iii) Finally, we evaluated the genetic diversity and structure of 12 populations of C. macrophyllum. This study will lay a solid resource foundation for studies on functional genomics, metabolomics, proteomics, and the development and utilization of molecular markers, and also provide important references and new ideas for related studies on the species of Chrysosplenium.
2. Materials and Methods
2.1. Plant Materials, RNA Isolation, and DNA Extraction
The fresh roots, stems, and leaves of
C. macrophyllum were gathered on 10 August 2021, from Xuanen County, Hubei Province, China, and instantly frozen in liquid nitrogen. Samples were then stored at up to −80 °C until used for RNA isolation. The young leaves of 60 individuals from 12 wild populations of
C. macrophyllum were collected and placed in sealed bags containing dried silica gel for subsequent DNA isolation. They were collected from seven provinces that included most of the distribution of this species in China (
Table 1). The distance between each individual in the population was more than 1 m. Sixteen additional
Chrysosplenium species were gathered to detected the cross-genome transferability of EST-SSRs (
Table 1).
Total RNA was extracted by using the R6827 Plant RNA Kit (Omega Bio-Tek, Inc., Norcross, GA, USA) in accordance with the manufacturer’s instructions. RNA contamination and degradation were supervised with 1% agarose gels. RNA integrity and purity was assayed by using a Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA) and NanoDrop One spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA), respectively. Qualified RNA from roots, stems, and leaves of C. macrophyllum was mixed in equal amounts for RNA sequencing.
Genomic DNA was extracted by using a modified cetyltrimethylammonium bromide (CTAB) method [
34]. DNA integrity and concentration were determined by using 1% agarose gel electrophoresis and NanoPhotometer
® NP80 (Implen, München, Germany), respectively. Then, the extracted DNA was diluted with ddH
2O to the desired working concentration (50 ng/μL) and stored at −20 °C until PCR amplification.
2.2. Transcriptome Sequencing and De Novo Assembly
The transcriptome sequencing of
C. macrophyllum was performed using the DNBSEQ-T7 platform from Wuhan Benagen Technology Co., Ltd. (Wuhan, China). FASTPv0.23.1 [
35] was used to remove reads with adaptors, those with more than 5% unknown nucleotides (N), or those with more than 50% low-quality (Q-value 5) bases. Then, the de novo assembly of the high-quality clean reads was conducted by utilizing Trinity v2.8.3 [
36] with the parameters of min_contig_length = 500, min_kmer_cov = 3, and min_glue = 15. After assembly, CD-HIT [
37] was used for clustering to remove redundant transcripts and unigenes were obtained.
2.3. Annotation and Functional Classification
Coding regions within unigenes were detected by using TransDecoder (
https://github.com/TransDecoder/TransDecoder/releases, accessed on 10 October 2022), implemented in Trinity software). For the characterization of all the putative functions of the unigenes, the unigenes were compared against public databases, including NCBI nonredundant protein sequences (NR) [
38], Kyoto Encyclopedia of Genes and Genomes (KEGG) [
39], Gene Ontology (GO) [
40], and Clusters of Eukaryotic Orthologous Groups (KOG) (E-value < 1.0 × 10
−5) [
41].
Eggnog-mapper v2 [
42] and InterProScan v5.0 (
https://github.com/ebi-pf-team/interproscan, accessed on 20 October 2022) were used to obtain GO and KOG annotations. After the prediction of protein sequences, the unigenes were aligned with the NR, Swiss-Prot, and KEGG databases by using Diamond (E-value < 1.0 × 10
−5) [
43].
2.4. SSR Identification and Primer Design
The detection and localization of potential SSRs were performed by using the microsatellite tool [
44]. The search standards for SSRs were set to the minimum number of 10, 6, 5, 5, 5, and 5 repeat units for mono-, di-, tri-, tetra-, penta-, and hexanucleotide motifs, respectively. Primers for the flanking sequences of the identified microsatellite motifs were designed by using Primer 3 software. The parameters considered for primer designing were as follows: (a) primer length of 18–23 bp with 20 bp as the optimal length; (b) PCR product sizes ranging from 100 bp to 250 bp; (c) GC content ranging from 40% to 60% with the optimum of 50%; (d) annealing temperature between 50 °C and 60 °C with 58 °C as the optimal temperature; and (e) default values for the other parameters.
2.5. EST-SSR Validation and Cross-Species Amplification
In total, 58 pairs of primers were randomly chosen and synthesized by Beijing TSINGKE Biological Technology Co., Ltd. (Beijing, China), to develop polymorphic EST-SSR markers. Twelve DNA samples from different populations, including ZJ, BD, HY, NJ, GD, XE, WG, LA, YS, JN, TS, and PA, were used to analyze the primary polymorphisms of the primers. PCR amplification was performed by using BIO-RAD T100 Thermal CyclerTM, and the PCR reaction system was prepared with a 10 μL total reaction volume comprising 5 μL of 2×T5 Super PCR Mix (PAGE) (Beijing TsingKe Biotech Co., Ltd., Beijing, China), 0.4 μL (10 μM) each of the forward and reverse primers, 1 μL of genomic DNA (50 ng/μL), and 3.2 μL of ddH2O. The PCR procedure was conducted as follows: an initial denaturation for 2 min at 98 °C; 30 cycles of denaturation at 10 s at 98 °C, annealing at 58 °C for 10 s, and extension at 72 °C for 10 s; and a final extension cycle of 2 min at 72 °C and holding at 4 °C. The amplified PCR products were mixed with 10× loading buffer at the ratio of 1:5 or 1:10 and immediately placed into a mixture of ice water after being denatured at 95 °C for 5 min in a BIO-RAD T100 Thermal CyclerTM. The same denaturation process was performed with PAGE Gel 20 bp ladder marker (Beijing Bio-ulab Biotech Co., Ltd., Beijing, China) as the molecular size standard. Then, the mixture of PCR products and 10× loading buffer was subjected to 6% denatured polyacrylamide gel electrophoresis at 90 W for 1–1.5 h and visualized by using silver nitrate staining.
After the screening of polymorphic primers, 39 pairs of primers with the expected band sizes were selected for cross-species amplification validation on other Chrysosplenium species. The PCR reaction system and conditions were the same as above. After PCR amplification was completed, gel electrophoresis was performed utilizing 3% agarose. Moreover, 50 bp DNA Ladder was used as a marker to determine the size of PCR products. Agarose gel photographs were taken using an automated gel imaging system. Then, 10 pairs of polymorphic primers were further selected for the analysis of genetic diversity in 60 individuals from 12 C. macrophyllum populations. The PCR amplification conditions and genotyping methods were the same as those above. The PCR bands of gel images observed under a light lamp were marked as present (1) or absent (0).
2.6. SSR Data Analysis
GENODIVE version 3.06 [
45], which can handle genetic data from polyploids or mixed-ploidy datasets, was used to calculate the following population genetic parameters: the number of alleles (
Na), effective number of alleles (
Ne), observed (
Ho) and expected (
He) heterozygosity, and inbreeding coefficient (
Fis). The
Ho and
He, polymorphic information content (PIC), and Shannon diversity index (
I) of each population and locus were estimated by using POLYGENE v1.2 [
46]. Differentiation between
C. macrophyllum populations was assessed on the basis of G
ST. Analysis of molecular variance (AMOVA) was performed by using POLYGENE v1.2 to obtain the genetic variation among populations.
A neighbor-joining tree based on
DA genetic distance was established for
C. macrophyllum individuals by using POPTREE v.2 software [
47]. Principal coordinate analysis (PCoA) was performed with Cavalli–Sforza’s chord distances, which have been shown to be the least biased distance measure in the absence of dosage information [
48]. STRUCTURE version 2.3.4 [
49] was used to infer the population structure using an admixture model with correlated allele frequencies. The potential number of genetic clusters (K) ranged from 1 to 10, and 10 independent replicates were run for each K value with a 100,000 burn-in period and 1,000,000 Markov chain Monte Carlo iterations. The online program STRUCTURE HARVESTER [
50] was used to infer the optimal K in accordance with the method of Evanno et al. [
51]. The program CLUMPP version 1.1.2 [
52] was applied to estimate the averaged admixture coefficients for each K value. The clustering results were visualized by using Distruct version 1.1 [
53].
4. Discussion
Progress in studies on
C. macrophyllum has been very slow compared with that in studies on other model plants with a reference genome. Access to genomic data is crucial for comprehending and expanding the study of a species. Transcriptome sequencing is more affordable and suitable for studying the genomes of non-model plant species than whole-genome sequencing [
54]. In this study, the transcriptome sequencing of
C. macrophyllum generated 40,507,062 high-quality clean reads (93.00% Q30), which were assembled into 29,477 non-redundant unigenes with an N50 of 1646 bp and an average length of 1341.32 bp. The current results were comparatively better than those previously reported for
Actinidia eriantha (average length = 594 bp, N50 = 973 bp) [
22] and
Panax vietnamensis (average length = 598.32 bp, N50 = 942 bp) [
55] and similar to those reported for
Pistacia chinensis (average length = 1325 bp, N50 = 2027 bp) [
56] and
P. vietnamensis var.
fuscidicus (average length = 1304 bp, N50 = 2108 bp) [
57]. Compared with
C. aureobracteatum (70,753,963 bp total assembled bases), we obtained more assembled bases in
C. macrophyllum (99,257,989 bp total assembled bases) [
32]. These findings indicated that the quality of sequencing and assembly was high and can meet the requirements of subsequent transcriptomic data analysis.
Among the 29,477 unigenes, 11,478 (38.94%) were successfully annotated in the public protein databases of NR, KOG, Swiss-Prot, KEGG, and GO. The annotated unigenes could provide valuable information for future studies on
C. macrophyllum. The remaining unmatched unigenes in the protein databases may be incomplete sequences lacking key information for annotation and/or the genes specific to
C. macrophyllum without previous characterization. The BLASTX search against the NR database revealed that although only 7.83% of the identified unigenes of
C. macrophyllum were similar to those of
V. vinifera, it was the species with the largest number of hits for
C. macrophyllum unigenes. In fact,
C. macrophyllum and
V. vinifera are members of Saxifragaceae and Vitaceae, respectively, and are therefore genetically and evolutionarily distant from each other. This result may be attributed to the lack of whole-genome sequences for any species of Saxifragaceae in public databases. The division of the identified unigenes into 25 subterms and 57 subcategories in the GO and KOG databases suggested that the annotated unigenes have a wide range of important functions in
C. macrophyllum. A total of 2020 unigenes were mapped to 127 biological pathways, among which the metabolism category was the largest, followed by the genetic information processing category. These data revealed the active metabolic processes and the synthesis of various metabolites. In
C. nudicaule,
C. carnosum, and other
Chrysosplenium species, flavonoids and triterpenoids are the main active components; these components help in resistance against biological and environmental stresses, such as cold, drought, and pests [
10,
58,
59]. In this study, we recorded the unigenes for the terpenoid backbone biosynthesis pathway.
In this study, 5573 unigene genes contained 6985 SSR loci with the distribution frequency and density of 23.46% and 5.67 kb, respectively. The rate of distribution frequency found in this work was higher than that reported for
Epimedium sagittatum (3.67%) [
60] and
Phyllostachys violascens (13.83%) [
17] but lower than that reported for
Phoebe bournei (55.57%) [
61]. The abundance and distribution of SSRs are influenced by numerous factors, including species differences, SSR search criteria, dataset size, SSR development tools, and sequence redundancy [
56,
62,
63]. The SSR types in the transcriptome of
C. macrophyllum were relatively abundant, ranging from mononucleotide repeats to hexanucleotide repeats. Consistent with the EST-SSR distribution reported in
C. aureobracteatum [
32], the dinucleotide (33.34%) and trinucleotide (19.18%) repeats became dominant when mononucleotides were excluded. Of the mononucleotide motifs, A/T (45.38%) motifs were far more abundant than the G/C (0.70%) motif, as in most plants [
64]. Among dinucleotide repeats, AG/CT (13.97%) was the most abundant; this result was identical to previous findings on monocots and eudicots [
65,
66]. AT/TA (6.09%) and AC/GT (2.21%) were the next most abundant motifs. In
C. macrophyllum, the most predominant trinucleotide repeat motif was ATC/ATG (4.31%), followed by AAG/CTT (4.27%). In contrast to those in
C. macrophyllum, the most frequent trinucleotide repeat motifs were AGG/CCT in
Z. officinale [
23], AAG/CTT in
E. sagittatum [
60], and CCG/GGC in
Elymus sibiricus [
67]. Previous studies on other species indicated that the trinucleotide motif AAG/CTT is a major motif and that CCG/CGG is a rare motif in dicotyledonous plants, but is a common motif in monocots [
68]. In this study, the trinucleotide CCG/CGG motif (0.30%) was the least abundant trinucleotide repeat, likely due to the high GC content and consequent codon usage bias in monocots [
69,
70].
We successfully designed 3127 (44.77%) primer pairs out of 8658 EST-SSR candidate loci. The failure of primer design for the remaining SSR loci may be due to the short flanking sequences of the SSR loci or the inappropriate motif of the required SSR markers. Among the 58 primer pairs selected, 39 (67.24%) resulted in successful amplification in
C. macrophyllum, among which 33 (56.90%) were polymorphic. The rate of polymorphism in this species was lower than in
Vigna mungo (58.2%; n = 18) [
71] but higher than in
R. roxburghii (29.4%; n = 16) [
24]. Therefore, in this study, the rate of EST-SSR polymorphisms was relatively high. The transferability of markers corresponds to the similarity of genomes, which can reflect the genomic relationships and even the evolutionary relationships between species [
72]. In general, close genetic relationships among different species are expected with the high transferability of EST-SSR markers. In this study, the transferability of the 39 EST-SSRs from
C. macrophyllum to
C. hydrocotylifolium was the highest, suggesting that
C. macrophyllum had a closer relationship with
C. hydrocotylifolium than with other
Chrysosplenium species. This result was consistent with the close phylogenetic relationship between the two species [
5]. Significantly, only 3 (7.69%) out of 39 EST-SSR markers failed to amplify successfully in all 16
Chrysosplenium species. The high transferability of the markers indicated that the flanking sequences of EST-SSRs were highly conserved among related species. These results suggest that the markers developed in our study may provide a powerful molecular tool for the evolutionary adaptation and phylogenetic analyses of
C. macrophyllum and other species of
Chrysosplenium.
In this study, the samples were subdivided into two main groups on the basis of STRUCTURE analysis, and the phylogenetic analysis of the NJ tree and PCoA analysis supported the two genetic clusters. The species from the YS, LA, and PA populations were allocated into one cluster, and geographically originated from the Ta-pieh Mountains, Tianmu Mountains, and Dapan Mountains, respectively. The classification of species from the same area into one group is correlated with the geographical distribution and environmental conditions. Geographic isolation may have contributed to the genetic differences. In addition, the population structure, NJ tree, and PCoA based on the genotypic data clearly showed obvious genetic differentiation among C. macrophyllum species. The set of EST-SSRs obtained in this work would facilitate the diversity analysis of C. macrophyllum.