Next Article in Journal
The Association between Post-Traumatic Stress Disorder, 5HTTLPR, and the Role of Ethnicity: A Meta-Analysis
Previous Article in Journal
Systematic Analysis of miR-506-3p Target Genes Identified Key Mediators of Its Differentiation-Inducing Function
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

De Novo Genome Assembly and Phylogenetic Analysis of Cirsium nipponicum

by
Bae Young Choi
1,†,
Jaewook Kim
2,†,
Hyeonseon Park
3,
Jincheol Kim
4,
Seahee Han
5,
Ick-Hyun Jo
4,* and
Donghwan Shim
3,6,*
1
School of Liberal Arts and Sciences, Korea National University of Transportation, Chungju 27469, Republic of Korea
2
Department of Biology Education, Korea National University of Education, Cheongju 28173, Republic of Korea
3
Department of Biological Sciences, Chungnam National University, Daejeon 34134, Republic of Korea
4
Department of Crop Science and Biotechnology, Dankook University, Cheonan 31116, Republic of Korea
5
Division of Botany, Honam National Institute of Biological Resources, Mokpo 58762, Republic of Korea
6
Center for Genome Engineering, Institute for Basic Science, Daejeon 34126, Republic of Korea
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Genes 2024, 15(10), 1269; https://doi.org/10.3390/genes15101269
Submission received: 4 September 2024 / Revised: 20 September 2024 / Accepted: 25 September 2024 / Published: 27 September 2024
(This article belongs to the Section Plant Genetics and Genomics)

Abstract

:
Background: Cirsium nipponicum, a pharmaceutically valuable plant from the Asteraceae family, has been utilized for over 2000 years. Unlike other thistles, it is native to East Asia and found exclusively on Ulleung Island on the Korea Peninsula. Despite its significance, the genome information of C. nipponicum has remained unclear. Methods: In this study, we assembled the genome of C. nipponicum using both short reads from Illumina sequencing and long reads from Nanopore sequencing. Results: The assembled genome is 929.4 Mb in size with an N50 length of 0.7 Mb, covering 95.1% of BUSCO core groups listed in edicots_odb10. Repeat sequences accounted for 70.94% of the assembled genome. We curated 31,263 protein-coding genes, of which 28,752 were functionally annotated using public databases. Phylogenetic analysis of 11 plant species using single-copy orthologs revealed that C. nipponicum diverged from Cynara cardunculus approximately 15.9 million years ago. Gene family evolutionary analysis revealed significant expansion and contraction in genes involved in abscisic acid biosynthesis, late endosome to vacuole transport, response to nitrate, and abaxial cell fate specification. Conclusions: This study provides a reference genome of C. nipponicum, enhancing our understanding of its genetic background and facilitating an exploration of genetic resources for beneficial phytochemicals.

1. Introduction

Cirsium nipponicum (C. nipponicum) is a perennial flowering plant in the Asteraceae family, renowned for its diverse species with medicinal properties and distributed in East Asia. Unlike other thistle species in Korea, C. nipponicum has minimal thorns on its leaves and is found exclusively on Ulleung Island, an oceanic volcanic island in the Korean Peninsula [1]. Phytochemicals in C. nipponicum, such as polyphenols and flavonoids, exhibit antioxidant and anti-inflammatory activities, making them valuable for therapeutic purposes [2,3]. Of particular interest is the high accumulation of silymarin in the fruits of thistle plants [4]. Silymarin, a complex mixture of flavonoids and flavonolignans, has been extensively studied for its medicinal effects, particularly in the treatment of liver disorders [5,6]. Despite the pharmaceutical importance of thistle species, to date only one genomic sequence of a thistle, Silybum marianum (L.) Gaertn., has been studied, to the best of our knowledge [7].
Oceanic islands are important ecosystems for studying the evolutionary history of organisms, due to their spatial and temporal confinement [8]. Ulleung Island, a biodiversity hotspot in Korea, is home to many species [9]. Anagenetic speciation, the evolutionary change within a lineage, has been extensively studied on Ulleung Island [10], highlighting its significance in evolutionary research for diverse species. In Korea, there are eight thistle species of the genus Cirsium, namely C. lineare, C. vlassovianum, C. setidens, C. pendulum, C. japonicum, C. schantarense, C. rhinoceros, and C. nipponicum [11]. However, the genetic resources of these thistle species remain largely unexplored. Recent sequencing of six Cirsium plastomes suggested that C. nipponicum was introduced from the north Eurasian region and evolved independently on Ulleung Island, distinct from other mainland Korean thistle species [1], suggesting that Ulleung Island C. nipponicum might possess unique genetic characteristics that could enhance our understanding of thistle species evolution.
In this study, we assembled the genome of C. nipponicum using a combination of Illumina short reads and Nanopore long reads. We annotated 31,263 protein-coding genes using RNA-seq data and protein homology information. Orthologous gene identification and phylogenetic analysis of the C. nipponicum genome, along with the genomes of 10 other representative plants, revealed that C. nipponicum formed a monophyletic clade with C. cardunculus and diverged approximately 15.9 million years ago (Mya). Furthermore, we found that genes related to abscisic acid biosynthesis, late endosome to vacuole transport, response to nitrate, and abaxial cell fate specification in C. nipponicum exhibited both significant expansion and contraction compared to the other species. The genomic resources, gene structure, and comparative genomic analyses of C. nipponicum will improve our understanding of thistle species.

2. Material and Methods

2.1. Plant Materials, DNA, and RNA Extraction

Whole plants of C. nipponicum were collected from Ulleung Island (Figure 1A). Healthy leaves were used for whole genome sequencing (WGS). Total genomic DNA was extracted using a Wizard® Genomic DNA Purification kit (Promega, Madison, WI, USA). Total RNA was extracted from leaves, roots, stems, and flowers using a SmartGene Plant RNA Extraction kit (SmartGene, Daejeon, Republic of Korea). The quality and quantity of the DNA and RNA samples were evaluated using a 4150 TapeStation (Agilent, Santa Clara, CA, USA) and a Qubit 4.0 Fluorometer (Invitrogen Ltd., Paisley, UK), respectively.

2.2. Short-Read Illumina Sequencing

The genomic DNA library was prepared using an xGen™ DNA Lib Prep EZ kit (Integrated DNA Technologies, Coralville, LA, USA). All RNA samples with RIN values > 7 were pulled, and mRNA was enriched using a Poly(A) RNA Selection Kit (Lexogen, Vienna, Austria). The enriched mRNA was subjected to 150 bp paired-end library preparation using an xGen™ RNA Lib Prep kit (Integrated DNA Technologies, Coralville, LA, USA). The generated Illumina libraries were sequenced on an Illumina NovaSeq 6000 platform (Illumina, San Diego, CA, USA).

2.3. Long-Read Nanopore Sequencing

Genomic DNA samples with a DNA integrity number > 8.0 were subjected to long read Oxford Nanopore Technologies (ONT) library preparation using a SQK-LSK110 kit (ONT, Oxford, UK). The pulled RNA was subjected to long-read cDNA library preparation using a PCS-109 kit (ONT, Oxford, UK). The long-read libraries were sequenced on the ONT MinION platform (ONT, Oxford, UK) according to the manufacturer’s instructions.

2.4. Genome Assembly

Paired-end short WGS reads were trimmed using Trimmomatic v. 0.39 with leading 15; trailing 15; and min. length 150 options [12]. To estimate genome size and heterozygosity of C. nipponicum, the k-mer frequency of cleaned reads was analyzed using the JELLYFISH v. 2.3.0 [13] and GenomeScope v. 2.0 [14].
Error correction for long WGS reads were conducted using Ratatosk v. 0.7.6, utilizing paired-end short WGS reads [15]. The error-corrected long reads were then subject to de novo genome assembly using NextDenovo v. 2.5.0 [16]. Both long and short reads were used to polish the genome using NextPolish v. 1.4.0 [17]. Haplotypic duplicated genome sequences were removed using Purge-dups software v. 1.2.5 [18]. The statistics of the assembled genome were analyzed using QUAST, and the completeness of the genome assembly was evaluated using BUSCO v 5.2.2 with 2326 single-copy orthologs from the Eudicots odb10 database [19,20].

2.5. Genome Annotation

Transposable elements (TEs) and repeat sequences were annotated in two steps. First, a de novo repeat library was built using RepeatModeler v. 2.0.4 [21], and Repbase database was downloaded [22]. Second, the TEs and repeat sequences in the library and database were analyzed and annotated using RepeatMasker v. 4.1.5 (https://www.repeatmasker.org/RepeatMasker/, accessed on 16 June 2022).
The BRAKER3 pipeline [23] was used to annotate the protein-coding genes of C. nipponicum with three different hints: short-read RNA-seq data, protein homology information, and long-read isoform sequencing (IsoSeq) data. PRINSEQ-lite v. 0.20.4 was used to trim and filter short RNA-seq reads with the following parameters: min len 50; min qual score 10; min qual mean 20; derep 14; trim qual left 20; trim qual right 20 [24]. Cleaned reads were aligned to the genome using HISAT2 v. 2.2.1 [25], and the resulting mapping files were supplied to the BRAKER3 pipeline along with protein sequences from four plant species (Arabidopsis thaliana, Carthamus tinctorius, Centaurea solstitialis, Cynara cardunculus). In this pipeline, protein sequence data and RNA-seq data were processed by DIAMOND and Stringtie2, respectively, to serve as hints for gene prediction [26,27]. The resulting hints were used for training GeneMark-ETP and AUGUSTUS to predict genes [28,29]. Errors in long-read IsoSeq data were corrected by Ratatosk v. 0.7.6 using paired-end short RNA-seq data [15]. Error-corrected Isoseq data were aligned to the genome using Minimap2 v. 2.25 [30] and redundant isoforms were collapsed using Cupcake (https://github.com/Magdoll/cDNA_Cupcake, accessed on 23 March 2016). GeneMarkS-T was employed to predict genes in the unique isoforms [31]. Finally, all predicted gene models were combined to produce non-redundant and consensus gene sets using TSEBRA [23].
Functional annotation of protein-coding genes were conducted using EnTAP v. 1.1.1 [32] with two methods. First, protein sequences of protein-coding genes were subjected to a BLASTp analysis against the NCBI RefSeq database [33] and Uniprot database using DIAMOND [26,34]. Second, genes were annotated with KEGG terms and GO terms using eggNOG-mapper [35].
Transfer RNA (tRNA) genes were annotated using tRNAscan-SE v. 2.0.12 with eukaryote parameters [36]. Ribosomal RNA (rRNA) and its subunits were identified using Barrnap v. 0.9 in Eukaryotic mode [37]. To detect microRNA (miRNA) and small nuclear RNA (snRNA), sequence analysis was conducted by comparing against the Rfam database using Infernal’s cmscan v. 1.1.5 [38,39].

2.6. Phylogenetic Analyses

Gene families for the eleven plant species listed in Table S3 were clustered using OrthoFinder v. 2.5.5 with the -M msa option [40]. The resulting rooted species tree from a concatenated multiple sequence alignment of single-copy orthologs was used to infer the divergence times of C. nipponicum using MCMCtree implemented in PAML v. 4.10.7 with the following parameters: JC69 model, burnin 5,000,000; sampfreq 30; and nsample 10,000,000 [41]. The calibration time points were obtained from the TimeTree database (https://www.timetree.org, accessed on 23 March 2023) using nwkit [42]. We repeated the divergence time analysis, and the convergence of the two independent analyses was evaluated by calculating Pearson’s correlation coefficient (Figure S1). The phylogenetic tree with divergence times was visualized using MCMCtreeR v. 1.1 [43].

2.7. Gene Family Expansion and Contraction Analyses

To predict the contraction and expansion of gene families in C. nipponicum relative to their ancestors, the birth and death models were used to estimate the numbers of ancestral gene families based on orthogroup clustering and the phylogenetic tree with divergence times using CAFE5 v. 1.1 [44]. Gene families with a p value < 0.001 were defined as significantly expanded or contracted gene families. Gene Ontology (GO) enrichment analyses for significantly expanded and contracted gene families were conducted using topGO v. 2.52.0 [45]. GO terms with a p-value < 0.05 were summarized and visualized using REVIGO with default options [46].

2.8. Identification of Gene Duplications

To identify homologous gene pairs between four plants (C. nipponicum, C. cardunculus, C. solstitialis, C. tinctorius), protein sequences were subjected to BLASTp analysis using DIAMOND v. 2.1.9 with an evalue < 1 × 10−5 [26]. The homologous gene pairs were used to identify collinear blocks using MCScanX. The synonymous substitution rates (Ks) were calculated for each gene pairs using the add_ka_and_ks_to_collinearity.pl script implemented in MCScanX v. 1.1.11 [47].

3. Results and Discussion

3.1. Genome Sequencing and Assembly

We assembled the C. nipponicum genome using both the Illumina and ONT sequencing platforms. This approach generated approximately 128 Gb of short WGS reads, 89 Gb of long WGS reads with an N50 of 16 Kb, 21 Gb of short RNA-seq reads, and 6.3 Gb of long IsoSeq reads. We performed the K-mer analysis on the short WGS reads to estimate the genome size of C. nipponicum. Using a k-mer size of 31, the genome was estimated to be 913 Mb in size with a heterozygosity of 1.43% (Figure 1B).
The initial genome assembly using long WGS reads resulted in a genome size of 1487.8 Mb, comprising 5675 contigs with an N50 of 0.4 Mb (Table 1). This assembly size was larger than the genome size estimated from the K-mer analysis. To refine the assembly, we removed haplotypic duplicated genome sequences in the draft assembly, resulting in a purged genome of 929.4 Mb in length, consisting of 2199 contigs with an N50 of 0.7 Mb (Table 1). Notably, this genome size is the smallest reported within the Cirsium genus, as a comprehensive genome size estimate for 19 Cirsium species, excluding C. nipponicum, using flow cytometry revealed genome sizes in this genus ranging approximately from 1046 Mb to 5245 Mb [48,49,50].
To assess the completeness of the assembled genome, we searched for 2326 single-copy orthologs from the Eudicots odb10 database. BUSCO analysis indicated that 95.1% of the core genes were completely captured in the assembled genome, with 85.8% being single-copy genes and 9.3% duplicated genes. The percentages of fragmented and missing core genes were 0.8% and 4.1%, respectively (Table 2). The high BUSCO completeness score, along with the consistency between the genome size obtained from the assembly and the estimated size from K-mer analysis, implies a high-quality and complete C. nipponicum genome assembly.

3.2. Repeat Sequence and Gene Prediction

In the assembled genome of C. nipponicum, a total of 659.3 Mb sequences were identified as repetitive elements, accounting for 70.94% of the genome (Table S1). The high portion of repetitive elements in the C. nipponicum genome is consistent with findings in most plant genomes, where repetitive DNA constitutes a significant portion of the total genomic content [51]. The most abundant class of repetitive elements was long terminal repeat (LTR) elements, comprising 344.9 Mb (37.11%) of the genome. The substantial presence of LTR elements suggests that the large genome size of C. nipponicum might be due to the accumulation of these elements, since their accumulation plays a key role in genome size expansion in some plants [52,53]. Unclassified repeats were the next most abundant, accounting for 263.8 Mb (28.39%). Interspersed repeats comprised 637.6 Mb (68.61%), while non-interspersed repeats comprised 21.6 Mb (2%).
We identified and curated a total of 31,263 protein-coding genes with 5596 isoforms in the assembled genome. The average length of the primary transcripts was 1203 bp. Functional annotation of these protein-coding genes was conducted using multiple databases, including Refseq, Uniprot, and eggNOG database. Overall, 28,752 genes (91.97%) were functionally annotated in at least one of these databases (Table S2). We further identified various non-coding RNA structures in the C. nipponicum genome, which included 771 rRNA, 1137 tRNA, 159 miRNA, and 1907 snRNA genes (Table 3).

3.3. Comparative Genomic and Phylogenetic Analyses

The evolutionary relationships within the green plant lineage and the phylogenetic position of C. nipponicum within the Asteraceae family were investigated through the analysis of primary protein-coding genes from 11 plant species using OrthoFinder [40]. We obtained 32,950 orthologous gene groups among these species. A phylogenetic tree was constructed using single-copy orthologous genes from these 11 plant species, and the divergence times were estimated (Figure 2A). Our phylogenetic analysis revealed that C. nipponicum, C. cardunculus, C. solstitialis, and C. tinctorius were grouped together, indicating that these species are closely related. These four species diverged from a common ancestor approximately 36.86 Mya. Specifically, C. nipponicum formed a monophyletic clade with C. cardunculus, which diverged from their common ancestor approximately 15.9 Mya.
To investigate species diversification and gene duplication events in C. nipponicum, we conducted Ks distribution analysis using paralogs and orthologs of C. nipponicum and its three closest relatives. The comparison of Ks distributions of the orthologs between C. nipponicum and C. cardunculus with those between C. nipponicum and C. solstitialis suggested that C. nipponicum diverged from C. cardunculus earlier than from C. solstitialis. This finding is consistent with the phylogenetic analysis and estimated divergence times (Figure 2A). The Ks distribution for paralogous genes in C. nipponicum showed a major peak at a Ks value of ~0.65, suggesting that gene duplication events occurred after the divergence of C. nipponicum.
Using the birth and death model, we identified 1508 gene families in C. nipponicum that underwent expansion and 1179 gene families that underwent contraction relative to the most recent common ancestor of C. nipponicum and C. cardunculus (Figure 2A). GO enrichment analysis revealed that the significantly expanded gene families were mainly enriched in the positive regulation of the proteasomal protein catabolic process and DNA conformation change, while the significantly contracted gene families were primarily involved in the positive regulation of gene expression and ribosomal small subunit assembly (Figure 3). Notably, gene families involved in abscisic acid biosynthesis, late endosome to vacuole transport, response to nitrate, and abaxial cell fate specification exhibited both significant expansion and contraction.
In conclusion, the combination of the Illumina and ONT sequencing technologies, followed by rigorous assembly and polishing steps, has enabled the production of a high-quality genome assembly for C. nipponicum. The characterization of repeat sequences and the detailed annotation of protein-coding genes in the C. nipponicum genome provide a rich resource for future genetic and functional studies. Notably, the identification of genes involved in biosynthesis of silymarin, as well as potential biomarkers for silymarin production, could contribute to enhancing the yield of thistle-derived pharmaceutical products, which are promising phytochemicals for the treatment of liver-related diseases. Furthermore, our comparative genomic and phylogenetic analyses provide valuable insights into the evolutionary history of C. nipponicum. Its relationship with other species, coupled with significant gene family expansions and contractions, highlights the dynamic evolutionary processes shaping the genome of C. nipponicum. This genomic information enhances our understanding for comparative genomics and biotechnological applications using thistle species.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/genes15101269/s1: Figure S1: Convergence plot of two independent divergence time analyses using MCMCtree; Table S1: The statistics of repeat sequences; Table S2: Functional annotation of protein-coding genes using EnTAP; Table S3: The species used for phylogenetic analysis; Table S4: GO enrichment analysis of expanded gene families; Table S5: GO enrichment analysis of contracted gene families.

Author Contributions

Conceptualization, B.Y.C., J.K. (Jaewook Kim), I.-H.J. and D.S.; methodology, B.Y.C., J.K. (Jaewook Kim) and H.P.; software, B.Y.C., J.K. (Jaewook Kim) and H.P.; validation, B.Y.C., J.K. (Jaewook Kim) and H.P.; formal analysis, B.Y.C., J.K. (Jaewook Kim) and H.P.; investigation, B.Y.C., J.K. (Jaewook Kim) and H.P.; resources, I.-H.J. and S.H.; data curation, B.Y.C., J.K. (Jaewook Kim), J.K. (Jincheol Kim) and H.P.; writing—original draft preparation, B.Y.C., J.K. (Jaewook Kim), I.-H.J. and D.S.; writing—review and editing, B.Y.C., J.K. (Jaewook Kim), H.P., I.-H.J. and D.S.; visualization, B.Y.C., J.K. (Jaewook Kim) and H.P.; supervision, I.-H.J. and D.S.; project administration, I.-H.J. and D.S.; funding acquisition, I.-H.J. and D.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was carried out with the support of the “Cooperative Research Program for Agricultural Science and Technology Development (Project No. RS-2023-00232275)”, Rural Development Administration, Republic of Korea.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All sequencing data generated in this study were deposited to the SRA database at the National Center for Biotechnology Information under the accession number PRJNA1127082. The assembled genome sequences and annotations are available at Figshare [10.6084/m9.figshare.26927092].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

BUSCOBenchmarking universal single-copy orthologs
BLASTBasic local alignment search tool
MbMegabase pairs
IsoSeqIsoform sequencing
ONTOxford Nanopore Technologies
TEsTransposable elements
GOGene Ontology
KsSynonymous substitution rates
WGSWhole genome sequencing
LTRLong terminal repeat
MyaMillion years ago

References

  1. Kim, B.; Lee, Y.; Koh, B.; Jhang, S.Y.; Lee, C.H.; Kim, S.; Chi, W.-J.; Cho, S.; Kim, H.; Yu, J. Distinctive origin and evolution of endemic thistle of Korean volcanic island: Structural organization and phylogenetic relationships with complete chloroplast genome. PLoS ONE 2023, 18, e0277471. [Google Scholar] [CrossRef] [PubMed]
  2. Lee, J.-H.; Lee, K.-R. Phytochemical constituents of Cirsium nipponicum (MAX.) Makino. Korean J. Pharmacogn. 2005, 36, 145–150. [Google Scholar]
  3. Yin, J.; Heo, S.-I.; Wang, M.-H. Antioxidant and antidiabetic activities of extracts from Cirsium japonicum roots. Nutr. Res. Pract. 2008, 2, 247. [Google Scholar] [CrossRef]
  4. Lv, Y.; Gao, S.; Xu, S.; Du, G.; Zhou, J.; Chen, J. Spatial organization of silybin biosynthesis in milk thistle [Silybum marianum (L.) Gaertn]. Plant J. 2017, 92, 995–1004. [Google Scholar] [CrossRef] [PubMed]
  5. Federico, A.; Dallio, M.; Loguercio, C. Silymarin/Silybin and Chronic Liver Disease: A Marriage of Many Years. Molecules 2017, 22, 191. [Google Scholar] [CrossRef]
  6. Shaker, E.; Mahmoud, H.; Mnaa, S. Silymarin, the antioxidant component and Silybum marianum extracts prevent liver damage. Food Chem. Toxicol. 2010, 48, 803–806. [Google Scholar] [CrossRef]
  7. Kim, K.D.; Shim, J.; Hwang, J.-H.; Kim, D.; El Baidouri, M.; Park, S.; Song, J.; Yu, Y.; Lee, K.; Ahn, B.-O. Chromosome-level genome assembly of milk thistle (Silybum marianum (L.) Gaertn.). Sci. Data 2024, 11, 342. [Google Scholar] [CrossRef]
  8. Vargas, P.; Zardoya, R. Evolution on islands. In The Tree of Life: Evolution and Classification of Living Organisms; Sinauer Associates: Sunderland, MA, USA, 2014; pp. 577–594. [Google Scholar]
  9. Oh, S.-H.; Chen, L.; Kim, S.-H.; Kim, Y.-D.; Shin, H. Phylogenetic relationship of Physocarpus insularis (Rosaceae) endemic on Ulleung Island: Implications for conservation biology. J. Plant Biol. 2010, 53, 94–105. [Google Scholar] [CrossRef]
  10. Stuessy, T.F.; Jakubowsky, G.; Gómez, R.S.; Pfosser, M.; Schlüter, P.M.; Fer, T.; Sun, B.Y.; Kato, H. Anagenetic evolution in island plants. J. Biogeogr. 2006, 33, 1259–1265. [Google Scholar] [CrossRef]
  11. Song, M.-J.; Kim, H. Taxonomic study on Cirsium Miller (Asteraceae) in Korea based on external morphology. Korean J. Plant Taxon. 2007, 37, 17–40. [Google Scholar] [CrossRef]
  12. Bolger, A.M.; Lohse, M.; Usadel, B. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics 2014, 30, 2114–2120. [Google Scholar] [CrossRef] [PubMed]
  13. Marçais, G.; Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 2011, 27, 764–770. [Google Scholar] [CrossRef] [PubMed]
  14. Ranallo-Benavidez, T.R.; Jaron, K.S.; Schatz, M.C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat. Commun. 2020, 11, 1432. [Google Scholar] [CrossRef] [PubMed]
  15. Holley, G.; Beyter, D.; Ingimundardottir, H.; Møller, P.L.; Kristmundsdottir, S.; Eggertsson, H.P.; Halldorsson, B.V. Ratatosk: Hybrid error correction of long reads enables accurate variant calling and assembly. Genome Biol. 2021, 22, 28. [Google Scholar] [CrossRef]
  16. Hu, J.; Wang, Z.; Sun, Z.; Hu, B.; Ayoola, A.O.; Liang, F.; Li, J.; Sandoval, J.R.; Cooper, D.N.; Ye, K. NextDenovo: An efficient error correction and accurate assembly tool for noisy long reads. Genome Biol. 2024, 25, 107. [Google Scholar] [CrossRef]
  17. Hu, J.; Fan, J.; Sun, Z.; Liu, S. NextPolish: A fast and efficient genome polishing tool for long-read assembly. Bioinformatics 2020, 36, 2253–2255. [Google Scholar] [CrossRef]
  18. Guan, D.; McCarthy, S.A.; Wood, J.; Howe, K.; Wang, Y.; Durbin, R. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 2020, 36, 2896–2898. [Google Scholar] [CrossRef]
  19. Mikheenko, A.; Prjibelski, A.; Saveliev, V.; Antipov, D.; Gurevich, A. Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 2018, 34, i142–i150. [Google Scholar] [CrossRef]
  20. Manni, M.; Berkeley, M.R.; Seppey, M.; Simão, F.A.; Zdobnov, E.M. BUSCO update: Novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol. Biol. Evol. 2021, 38, 4647–4654. [Google Scholar] [CrossRef]
  21. Rodriguez, M.; Makałowski, W. Software evaluation for de novo detection of transposons. Mobile DNA 2022, 13, 14. [Google Scholar] [CrossRef]
  22. Bao, W.; Kojima, K.K.; Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA 2015, 6, 11. [Google Scholar] [CrossRef] [PubMed]
  23. Gabriel, L.; Brůna, T.; Hoff, K.J.; Ebel, M.; Lomsadze, A.; Borodovsky, M.; Stanke, M. BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA. Genome Res. 2024. [Google Scholar] [CrossRef] [PubMed]
  24. Schmieder, R.; Edwards, R. Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011, 27, 863–864. [Google Scholar] [CrossRef] [PubMed]
  25. Kim, D.; Paggi, J.M.; Park, C.; Bennett, C.; Salzberg, S.L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 2019, 37, 907–915. [Google Scholar] [CrossRef]
  26. Buchfink, B.; Xie, C.; Huson, D.H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 2015, 12, 59–60. [Google Scholar] [CrossRef]
  27. Kovaka, S.; Zimin, A.V.; Pertea, G.M.; Razaghi, R.; Salzberg, S.L.; Pertea, M. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 2019, 20, 278. [Google Scholar] [CrossRef]
  28. Stanke, M.; Schöffmann, O.; Morgenstern, B.; Waack, S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinform. 2006, 7, 62. [Google Scholar] [CrossRef]
  29. Brůna, T.; Lomsadze, A.; Borodovsky, M. GeneMark-ETP significantly improves the accuracy of automatic annotation of large eukaryotic genomes. Genome Res. 2024. [Google Scholar] [CrossRef]
  30. Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 2018, 34, 3094–3100. [Google Scholar] [CrossRef]
  31. Tang, S.; Lomsadze, A.; Borodovsky, M. Identification of protein coding regions in RNA transcripts. Nucleic Acids Res. 2015, 43, e78. [Google Scholar] [CrossRef]
  32. Hart, A.J.; Ginzburg, S.; Xu, M.; Fisher, C.R.; Rahmatpour, N.; Mitton, J.B.; Paul, R.; Wegrzyn, J.L. EnTAP: Bringing faster and smarter functional annotation to non-model eukaryotic transcriptomes. Mol. Ecol. Resour. 2020, 20, 591–604. [Google Scholar] [CrossRef] [PubMed]
  33. O’Leary, N.A.; Wright, M.W.; Brister, J.R.; Ciufo, S.; Haddad, D.; McVeigh, R.; Rajput, B.; Robbertse, B.; Smith-White, B.; Ako-Adjei, D. Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016, 44, D733–D745. [Google Scholar] [CrossRef] [PubMed]
  34. Consortium, T.U. UniProt: The universal protein knowledgebase in 2021. Nucleic Acids Res. 2020, 49, D480–D489. [Google Scholar] [CrossRef] [PubMed]
  35. Cantalapiedra, C.P.; Hernández-Plaza, A.; Letunic, I.; Bork, P.; Huerta-Cepas, J. eggNOG-mapper v2: Functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol. Biol. Evol. 2021, 38, 5825–5829. [Google Scholar] [CrossRef] [PubMed]
  36. Lowe, T.M.; Eddy, S.R. tRNAscan-SE: A Program for Improved Detection of Transfer RNA Genes in Genomic Sequence. Nucleic Acids Res. 1997, 25, 955–964. [Google Scholar] [CrossRef]
  37. Loman, T. A Novel Method for Predicting Ribosomal RNA Genes in Prokaryotic Genomes. Master’s Thesis, Lund University, Lund, Sweden, 2017. [Google Scholar]
  38. Griffiths-Jones, S.; Moxon, S.; Marshall, M.; Khanna, A.; Eddy, S.R.; Bateman, A. Rfam: Annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 2005, 33, D121–D124. [Google Scholar] [CrossRef]
  39. Nawrocki, E.P.; Eddy, S.R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 2013, 29, 2933–2935. [Google Scholar] [CrossRef]
  40. Emms, D.M.; Kelly, S. OrthoFinder: Phylogenetic orthology inference for comparative genomics. Genome Biol. 2019, 20, 238. [Google Scholar] [CrossRef]
  41. Rannala, B.; Yang, Z. Inferring speciation times under an episodic molecular clock. Syst. Biol. 2007, 56, 453–466. [Google Scholar]
  42. Fukushima, K.; Pollock, D.D. Detecting macroevolutionary genotype–phenotype associations using error-corrected rates of protein convergence. Nat. Ecol. Evol. 2023, 7, 155–170. [Google Scholar] [CrossRef]
  43. Puttick, M.N. MCMCtreeR: Functions to prepare MCMCtree analyses and visualize posterior ages on trees. Bioinformatics 2019, 35, 5321–5322. [Google Scholar] [CrossRef] [PubMed]
  44. Mendes, F.K.; Vanderpool, D.; Fulton, B.; Hahn, M.W. CAFE 5 models variation in evolutionary rates among gene families. Bioinformatics 2020, 36, 5516–5518. [Google Scholar] [CrossRef] [PubMed]
  45. Alexa, A.; Rahnenführer, J.; Lengauer, T. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 2006, 22, 1600–1607. [Google Scholar] [CrossRef] [PubMed]
  46. Supek, F.; Bošnjak, M.; Škunca, N.; Šmuc, T. REVIGO Summarizes and Visualizes Long Lists of Gene Ontology Terms. PLoS ONE 2011, 6, e21800. [Google Scholar] [CrossRef]
  47. Wang, Y.; Tang, H.; DeBarry, J.D.; Tan, X.; Li, J.; Wang, X.; Lee, T.-h.; Jin, H.; Marler, B.; Guo, H. MCScanX: A toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 2012, 40, e49. [Google Scholar] [CrossRef]
  48. Bai, C.; Alverson, W.S.; Follansbee, A.; Waller, D.M. New reports of nuclear DNA content for 407 vascular plant taxa from the United States. Ann. Bot. 2012, 110, 1623–1629. [Google Scholar] [CrossRef]
  49. Garcia, S.; Hidalgo, O.; Jakovljević, I.; Siljak-Yakovlev, S.; Vigo, J.; Garnatje, T.; Vallès, J. New data on genome size in 128 Asteraceae species and subspecies, with first assessments for 40 genera, 3 tribes and 2 subfamilies. Plant Biosyst.-Int. J. Deal. All Asp. Plant Biol. 2013, 147, 1219–1227. [Google Scholar] [CrossRef]
  50. Bureš, P.; Wang, Y.-F.; Horová, L.; Suda, J. Genome size variation in Central European species of Cirsium (Compositae) and their natural hybrids. Ann. Bot. 2004, 94, 353–363. [Google Scholar] [CrossRef]
  51. Macas, J.; Novák, P.; Pellicer, J.; Čížková, J.; Koblížková, A.; Neumann, P.; Fukova, I.; Doležel, J.; Kelly, L.J.; Leitch, I.J. In depth characterization of repetitive DNA in 23 plant genomes reveals sources of genome size variation in the legume tribe Fabeae. PLoS ONE 2015, 10, e0143424. [Google Scholar] [CrossRef]
  52. Piegu, B.; Guyot, R.; Picault, N.; Roulin, A.; Saniyal, A.; Kim, H.; Collura, K.; Brar, D.S.; Jackson, S.; Wing, R.A. Doubling genome size without polyploidization: Dynamics of retrotransposition-driven genomic expansions in Oryza australiensis, a wild relative of rice. Genome Res. 2006, 16, 1262–1269. [Google Scholar] [CrossRef]
  53. Neumann, P.; Koblížková, A.; Navrátilová, A.; Macas, J. Significant expansion of Vicia pannonica genome size mediated by amplification of a single type of giant retroelement. Genetics 2006, 173, 1047–1056. [Google Scholar] [CrossRef]
Figure 1. Morphology and K-mer analysis of C. nipponicum. (A) C. nipponicum plant in the reproductive stage on Ulleung Island, displaying flowers with emerging petals. (B) The genome size was estimated as 913 Mb with 1.43% heterozygosity using 31-mer.
Figure 1. Morphology and K-mer analysis of C. nipponicum. (A) C. nipponicum plant in the reproductive stage on Ulleung Island, displaying flowers with emerging petals. (B) The genome size was estimated as 913 Mb with 1.43% heterozygosity using 31-mer.
Genes 15 01269 g001
Figure 2. Comparative genomics analysis of C. nipponicum and other plant species. (A) Phylogenetic relationships of 11 plant species with single-copy orthologs identified using OrthoFinder. The divergence times (in millions of years ago) were estimated using MCMCTree, and the blue bars indicate highest posterior density intervals of at least 95%. The numbers in green and red indicate the expanded and contracted gene families relative to the most recent common ancestor. Ea, early; La, late; Pa, Paleogene; Ne, Neogene; P, Paleocene; Eo, Eocene; O, Oligocene; Mi, Miocene. (B) Distribution of the synonymous substitution rates (Ks) for pairs of paralogs and orthologs in the four plants (C. nipponicum, C. cardunculus, C. solstitialis, and C. tinctorius).
Figure 2. Comparative genomics analysis of C. nipponicum and other plant species. (A) Phylogenetic relationships of 11 plant species with single-copy orthologs identified using OrthoFinder. The divergence times (in millions of years ago) were estimated using MCMCTree, and the blue bars indicate highest posterior density intervals of at least 95%. The numbers in green and red indicate the expanded and contracted gene families relative to the most recent common ancestor. Ea, early; La, late; Pa, Paleogene; Ne, Neogene; P, Paleocene; Eo, Eocene; O, Oligocene; Mi, Miocene. (B) Distribution of the synonymous substitution rates (Ks) for pairs of paralogs and orthologs in the four plants (C. nipponicum, C. cardunculus, C. solstitialis, and C. tinctorius).
Genes 15 01269 g002
Figure 3. GO enrichment analysis of significantly expanded and contracted gene families in C. nipponicum. The GO enrichment analysis was performed on the 1760 expanded (A) and 365 contracted (B) gene families using topGO. The GO terms with a p-value < 0.05 listed in Tables S4 and S5 were further analyzed using REVIGO to identify representative enriched GO terms. The size of each rectangle in the treemap corresponds to the p-value of each GO term.
Figure 3. GO enrichment analysis of significantly expanded and contracted gene families in C. nipponicum. The GO enrichment analysis was performed on the 1760 expanded (A) and 365 contracted (B) gene families using topGO. The GO terms with a p-value < 0.05 listed in Tables S4 and S5 were further analyzed using REVIGO to identify representative enriched GO terms. The size of each rectangle in the treemap corresponds to the p-value of each GO term.
Genes 15 01269 g003
Table 1. Statistics of the genome assembly and annotations.
Table 1. Statistics of the genome assembly and annotations.
Genome AssemblyDraftPurge Haplotigs
Genome size (Mb)1487.8929.4
Number of contigs56752199
N50 (bp)421,852700,963
GC contents (%)36.0135.76
Table 2. Statistics for genome assessment using BUSCO (edicots).
Table 2. Statistics for genome assessment using BUSCO (edicots).
Number of BUSCOs (%)
Complete 2212 (95.1)
Complete and single-copy1995 (85.8)
Complete and duplicated217 (9.3)
Fragmented 19 (0.8)
Missing 95 (4.1)
Table 3. Statistics of non-coding RNA in C. nipponicum genome.
Table 3. Statistics of non-coding RNA in C. nipponicum genome.
TypeCopyAverage Length (bp)Total Length (bp)
rRNA771139.66107,681
tRNA113776.0086,409
miRNA159120.8119,208
snRNA1907113.67216,769
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Choi, B.Y.; Kim, J.; Park, H.; Kim, J.; Han, S.; Jo, I.-H.; Shim, D. De Novo Genome Assembly and Phylogenetic Analysis of Cirsium nipponicum. Genes 2024, 15, 1269. https://doi.org/10.3390/genes15101269

AMA Style

Choi BY, Kim J, Park H, Kim J, Han S, Jo I-H, Shim D. De Novo Genome Assembly and Phylogenetic Analysis of Cirsium nipponicum. Genes. 2024; 15(10):1269. https://doi.org/10.3390/genes15101269

Chicago/Turabian Style

Choi, Bae Young, Jaewook Kim, Hyeonseon Park, Jincheol Kim, Seahee Han, Ick-Hyun Jo, and Donghwan Shim. 2024. "De Novo Genome Assembly and Phylogenetic Analysis of Cirsium nipponicum" Genes 15, no. 10: 1269. https://doi.org/10.3390/genes15101269

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop