1. Introduction
Prunus tenella is one of the oldest relict plants left from the Tertiary Miocene epoch and is mainly distributed in China and Kazakhstan [
1]. In China,
P. tenella is an endangered and rare species with economic, scientific, and cultural importance and is scattered only in the northern mountainous areas of Xinjiang province [
2]. It is considered to be in danger of extinction due to natural and human-caused disturbances, and as a result, is on the list of nationally significant protected wild plants (
Figure 1).
As a relatively wild species of cultivated almond,
P. tenella has adapted to extremely harsh environments, displaying impressive cold and drought tolerance [
3]. For example, it can grow normally in the hills and mountains in Tacheng and Altay at extremely low temperatures reaching −35 °C. Additionally,
P. tenella possesses exceptional agronomic traits and unique genomic characteristics, making it highly valuable for diverse applications in various fields, such as food processing and medicine [
2]. Additionally, these characteristics provide precious genetic materials for studying the adaptive evolution of the
P. tenella genome and allow for the genetic improvement of the cultivated almond, thus improving the resistance of cultivated almond to biotic and abiotic stresses and improving the yield.
With the increased attention to wild resources, many desirable traits have been mined and applied to agricultural varieties [
4,
5,
6,
7,
8]. Moreover, many breakthrough varieties have also benefited from the discovery of wild-breeding genetic resources [
9,
10,
11,
12], especially for wheat, rice, soybean, and other economically important food crops [
13,
14,
15,
16]. Together, these lines of evidence have shown that introducing wild gene resources has greatly improved disease resistance, insect resistance, and the growth of cultivated varieties [
17,
18,
19,
20,
21].
In
P. tenella, some genes such as self-incompatibility-related genes SBPI, Petcullin1, SFB, SSK1, S-RNase, and the cold resistance-related gene AlsCBF1-A were identified by homologous cloning technology [
22,
23,
24]. Population diversity and structure were studied using molecular markers, including chloroplast DNA (cp DNA), ISSR, and SSR markers. There was a significant genetic structure among the wild almond populations, and the genetic variation mainly came from among the populations. Some regions with high genetic diversity were also found. These studies provide a basis for the distribution, evolution and conservation of wild almond populations [
25,
26]. However, thorough knowledge of the economically significant genetic features has been severely hampered by a paucity of genomic resources. More research into the evolutionary adaptations and the development process of distinct features is required before the genetic characteristics of
P. tenella can be properly analyzed.
Obtaining high-quality reference genome sequences is the key to revealing allelic variation, genetic relationship, and evolutionary history [
27,
28,
29,
30,
31]. The present paper describes a highly accurate chromosome-scale standard genome of
P. tenella assembled from scratch using the Hi-C technique and lengthy PacBio SMRT reads. Additionally, the genetic variation and geographical differentiation of 130 individual plants from eight wild natural populations were assessed using genome-wide high-resolution molecular markers, which provided an important basis for advancing our understanding of origin, formation, and geographical distribution profile of
P. tenella.
3. Discussion
Through the use of various sequencing methods, we assembled the first complete reference genome for P. tenella in this work. The Prunus genus has several commercially and ecologically significant species in forestry and agriculture, and these data are essential for learning more about P. tenella and the genus as a whole. These findings will also aid in the development of genome-enabled P. tenella breeding initiatives. Last but not least, P. tenella’s status as a relict species makes it a useful model for studying the genetic basis of population formation, evolution, and adaptation to environmental effects under conditions of geographic isolation.
Given the high quality of our P. tenella genome assembly, high-depth PacBio long-read, and whole genome re-sequencing data, we now have a comprehensive understanding of the genome of the P. tenella. Single-copy, multi-copy, and species-specific gene families were obtained, the evolutionary status was inferred, and the genome’s evolutionary history was traced, laying the foundation for further exploration and research. Additionally, we found many variation sites, including SNPs, insertions, deletions, and inversion. Many of these variants may be associated with phenotypic traits, which will help understand the phylogenetic evolution of P. tenella. Moreover, these various sites can be used as important molecular markers for germplasm identification, genetic analysis, functional gene extraction, and assisted breeding. Compared with the genome of P. persica and P. dulcis, we identified a large inversion on chromosome 5, which may be related to the unique characteristics of P. tenella such as dwarfing and freezing resistance. In order to further determine the authenticity of this inversion, we re-checked the assembly process and the contig connection. Since the chromosome we assembled was composed of two contig and there was a break point at about 7588 kb on one side, we divided 500 kb sequences at both ends of the break point and calculated the degree of linkage disequilibrium between the two connection modes. By calculation, the degree of linkage disequilibrium of the current connection mode is 0.09415, which is significantly greater than that of the other connection mode (0.05116), indicating that our present assembly results are reasonable.
Through the analysis, we also found some different characteristics of the P. tenella genome relative to other species of the same genus. Compared with other tree species, P. tenella had relatively more endemic gene families (288), while cultivated almond and peach had relatively few, only 31–37, which may reflect the unique evolutionary characteristics of P. tenella. More gene families were contracted in P. tenella, which might be related to natural selection caused by the extreme natural environment. The evolutionary status inferred by the phylogenetic tree shows that the P. tenella were clustered into a large clade with amygdala subgenus, but compared with other species of amygdala subgenus, P. tenella differentiated earlier, at 13.4 million years ago, while cultivated almond differentiated only 6.9 million years ago, indicating that P. tenella is a relatively old species. But the P. tenella is still the closest relative to cultivated almond. Importantly, this observation indicates that P. tenella has the potential utilization value of providing genetic resources for cultivated almond, and this lays the foundation for further exploration and research.
Population genetic structure can be used to analyze the evolutionary dynamics of a population by describing gene transmission, gene frequency change, and genotype distribution [
32,
33,
34,
35]. Based on the SNP information derived from the whole-genome re-sequencing data, thousands of single SNP markers can be used for the fine-scale description of genetic structure [
36,
37,
38]. The results showed that, compared with the Tacheng (Nei’s = 0.18; Ho = 0.16) and Tuoli (Nei’s = 0.23–0.26; Ho = 0.16–0.22) populations, the Yumin variety (Nei’s = 0.26–0.3; Ho = 0.17–0.22) has relatively high genetic diversity. This observation is consistent with the study on genetic diversity using chloroplast sequences [
4]. Additionally, the results indicated a high genetic differentiation among the natural distribution of
P. tenella, as the pairwise genetic differentiation (Fst) in a different region is 0.23–0.32, especially within the Tacheng and Yumin group where Fst reached 0.29–0.32, values much higher than wright’s high differentiation coefficient [
39,
40,
41]. However, there is little differentiation between subgroups within the Tuoli and Yumin group (0.05–0.1). These results suggest that geographical isolation is an important factor affecting the genetic evolution of
P. tenella. This higher differentiation may result from the long-term natural selection without gene flow.
Selective Sweep analysis showed that the Tacheng population received fewer selection sites, while the Tuoli and Yumin populations received stronger selection sites. This indicates that there is a tendency to decrease genetic polymorphism and increasing purity and degree in Yumin and Tuoli populations. Due to the severe geographical isolation of P. tenella, this phenomenon may be related to the small population and inbreeding of the population. Since Tuoli and Yumin are in the same mountain range, and Tacheng is in another mountain range, this conclusion also shows that, under the influence of climatic and geographical conditions, the evolution and selection among the groups are relatively independent, resulting in different evolutionary directions. In summary, we assembled the first chromosome-level genome of P. tenella and assessed the genetic variation and geographical differentiation of eight natural populations, which laid a solid foundation for further research on genetic improvement and formation mechanism of important characters in the future.
4. Materials and Methods
4.1. Utilized Materials
P. tenella sample materials used for genome assembly were obtained from the germplasm conservation nursery of Xinjiang Academy of Forestry Sciences, Xinjiang, China. Fresh leaves were utilized for Hi-C library development, PacBio HiFi sequencing, and Illumina sequencing. To aid in genome assembly and annotation, fruit, leaf, root, and stem tissues were taken for RNA-seq study.
The fresh leaves used for whole genome re-sequencing were collected from Yumin County, Tuoli County, and Tacheng City, Xinjiang, China. A total of 8
P. tenella populations were collected, including 3 from Yumin County, 4 from Tuoli County, and 1 from Tacheng City (
Table 6). Approximately 15–18 samples were collected from each population. In addition, 7 cultivated almond samples were collected for population evolution analysis.
4.2. Genome Sequencing and Transcriptome Sequencing
The experiments were carried out in accordance with Illumina’s recommended methodology. The ultrasonic shock was used to physically fragment the qualifying genomic DNA into fragments (350 bp), and then end restoration, adding A, an adapter, and target fragment picking and PCR were used to generate the tiny fragment sequencing library. By using a bridge PCR, the library was transferred to the sequencing chip. An Illumina sequencer performed double-ended 150 bp (PE 150) library sequencing.
DNA capture and purification, cyclization, end repair, endonuclease digestion, cell cross-linking, and on-machine sequencing were all necessary steps for HI-C sequencing to be completed. The mRNA was utilized to synthesize full-length cDNA with the help of the SMARTerTM PCR cDNA Synthesis Kit, which was then used to generate sequencing libraries. Using the PacBio system, we sequenced the whole transcriptome.
Library sequencing, library quality testing, library creation, and sample quality testing were all carried out as per Illumina’s recommended approach for re-sequencing a variety of population samples. In order to prepare the DNA for sequencing, it was first physically fragmented (using ultrasonic waves), then purified, the ends were mended, the 3’ end was augmented with A, and the sequencing joint was linked. Finally, agarose gel electrophoresis was used to determine the optimal fragment size, and PCR amplification was carried out to form the sequencing library.
Transcriptome sequencing of the stem, root, leaf, and fruit tissues was performed on the NovaSeq 6000 platform.
4.3. Assurance of Sequencing Data Quality
Low-quality sequences and duplicated readings in the sequencing data were removed using stringent filtering algorithms that were optimized for the particular platform utilized to ensure data integrity and accuracy. Filtering criteria included the following actions for Illumina Hi-Seq data: Firstly, polyG tails were removed. Secondly, paired reads of less than 100 bp in length were discarded. Thirdly, read pairs containing more than 10% of bases that are the same as the next base were removed. Fourthly, read pairs with over 50% low-quality bases (quality score less than 10) were discarded. The last step was to clean the data of read pairings with a typical quality rating below 20. The Hi-C sequencing results went through a comparable filtering procedure as Illumina Hi-Seq Data before being processed in 3D. With the default settings of the pbccs pipeline, subreads from the PacBio HiFi long readings were filtered and corrected immediately. Approximately 2000 PacBio HiFi (CCS) reads were randomly selected from the sequencing data and compared with the NT library to evaluate whether the sequencing data contained contamination.
4.4. Heterozygosity and Genome Size Estimation
Heterozygosity and genome size were analyzed before HiFi library construction and sequencing. From the Illumina data, Jellyfish v.2.2.10 [
42] examined frequency distributions of quality-filtered short fragments (21-mers). Then, based on Jellyfish’s results, genome escope2
2 was used for genome analysis. This strategy obtained the genomic information of
P. tenella (
Supplementary Figure S1), such as heterozygosity, genome size, and proportion of repeat sequences.
4.5. Genome Assembly
Following correction and filtering, HiFi circular consensus sequencing (CCS) reading could be used in the de novo assembly using hifiasm (v0.14-r312) with default parameters. Purge haplotigs was used to remove redundant haploids [
43]. In 2017, Dudchenko et al. [
44] used the 3D de novo assembly (3D-DNA) software for scaffolding the haploid contigs. The Hi-C readings could be aligned within the draft genome 3D-DNA and Juicebox v1.9.8 was used for the candidate assembly. Assembly Tools (JBAT) [
45] was utilized for reviewing the candidate assembly and corrected artificially. The eudicotyledons_odb10 database was employed in conjunction with the BUSCO v3.0.2 (Benchmarking Universal Single-Copy Orthologs) [
46] algorithm for assessing genome integrality and gene annotation. A combination of the BWA-MEM method and HISAT2 (v2.1.0) [
47] was utilized for mapping the small reads’ filtration obtained by Illumina and the assembled transcripts to the assembly.
4.6. Repetitive Element Annotations
To annotate the TEs or transposable elements [
48], the EDTA genome annotation pipeline was utilized. TEs include retrotransposons and DNA transposons. RepeatModeler was used to identify DNA transposons, including long interspersed nuclear elements (LINEs) of the terminal inverted repeats (TIRs) and retrotransposons, and long tandem repeats (LTRs), as well as helitrons found in DNA transposons. To do this, we used Repbase and RepeatMasker (v4.0.7) and Repbase with the optimal settings to generate a de novo repeat library for repeat sequence identification [
49,
50].
4.7. Functional Annonations and Gene Prediction
The StringTie (v1.3.5) and HISAT2 (v2.1.0) pipeline was used for mapping the RNA-seq data within the fruits, leavers, stems, and roots to the genome. Gene prediction together with de novo transcripts assembly were conducted through Trinity [
51]. PASA (v2.4.1) pipeline transdecoder4 was also applied to annotate the transcripts-relevant coding regions [
52]. Exonerate v2.2.0 carried out homolog predictions. GlimmerHMM (v3.0.4) and the protein sequences of
P. dulcis,
P. mira,
P. persica,
P. armeniaca,
P. mume, and
P. salicina could also be mapped to the genome [
53]. For de novo gene speculation, genes from the PASA results were trained by AUGUSTUS (v3.3.3) and SNAP [
54,
55]. In order to combine the gene models, EVidenceModeler (v1.1.1) was used [
56]. The predicted protein sequences were compared to the EuKaryotic Orthologous Groups (KOG), Nr databases, Pfam, SwissProt, Kyoto Encyclopedia of Genes and Genomes (KEGG), and Gene Ontology (GO) to infer possible functions for the protein-coding genes.
4.8. Phylogenetic and Gene Family Analysis
Thirteen closely related species were selected for phylogenetic and gene family analysis along with
P.tenella. Additionally,
P.avium was selected as the outgroup. The genome database for Rosaceae “
www.rosaceae.org (accessed on 11 April 2023)”, and the NCBI database “
www.ncbi.nlm.nih.gov (accessed on 5 January 2023)” was used for obtaining the protein sequences of these species. Alignment quality was ensured by excluding sequences with lengths < 100 bp. OrthoFinder (v2.5.2) were deployed for identifying single-copy homologous genes and classifying families, using the settings “-M msa -S diamond -T raxml-ng” [
57]. RaxML [
58] have been approached for estimating and evaluating the phylogenetic connection tree of 14 species using 100 bootstrap repetitions. Time to diverge was computed using PAML’s MCMC tree [
59]. CAFE (v3.1) was used for examining relevant growth patterns and gene families-related declines, as described by Han et al. [
60]. By counting the number of ancestral gene families on each branch of the phylogenetic tree, we were able to determine the rate at which gene family sizes shrank or grew. Cafetutorial_clade_and_size_filter.py was used to filter gene families characterized by very high variations in gene copy numbers in an effort to decrease prediction mistakes. Exact data on the contraction and expansion gene families of 14 species were utilized using the script cafetutorial_report_analysis.py, and these data were then analyzed. For selected gene families, we used Fisher’s exact test to analyze GO functional enrichment.
4.9. Whole-Genome Synteny Analysis
Almond and peach were selected for whole genome replication (WGD) analysis. Four-fold synonymous (degenerative) third-codon transversion (4DTv) values and synonymous mutation distributions for each synonymous site (Ks) were calculated to analyze the genome replication events. The YN substitution model was used to calculate the 4DTv rates based on four-fold degenerate sites. KaKs_Calculator (v2.0) [
61] with default parameters was used to calculate Ks values. The minimap2 software “
https://lh3.github.io/minimap2/minimap2.html (accessed on 15 November 2022)” was used for genome-wide comparison, and syri software “
https://github.com/schneebergerlab/syri (accessed on 12 November 2022)” was used to identify collinear regions between the two genomes, structural rearrangements (inversion, translocation, and duplication), local variations (SNP, indel, and CNV), and unaligned regions. The nucmer (4.0.0beta2) program in MUMmer4 [
62] was used to determine whether similar gene pairs on chromosomal were adjacent in different species.
4.10. Single-Nucleotide Polymorphism (SNP) Calling
Trimmomatic v0.36 was used to eliminate adaptors and low-quality sequences during the preprocessing phase. Every sample’s clean reads have been planned using Burrows-Wheeler Aligner to the
P. tenella standard genome. Next, Picard “
http://broadinstitute.github.io/picard/ (accessed on 12 November 2022)” was employed to identify and align the PCR duplicated sample findings. SNP sites in re-sequencing people from diverse geographical regions were identified using GATK v4 (Genome Analysis Toolkit) for SNP recalling. Each genome’s VCF files were generated using variant calling with GATK Hap-lotypeCaller, and then the VCF files for all 137 genomes were combined to create a single VCF file. Only SNPs that had a Hardy–Weinberg equilibrium < 0.001, minor allele frequency > 0.05, and genotype missing rate of 10% for each were kept for further study, narrowing the analysis down to just biallelic variation sites.
4.11. Phylogenetic Analysis
A phylogenetic tree was generated using the distance matrix produced by MEGA-CC3.5 software (MEGAX) [
63] and 1000 bootstrap repetitions to assess the phylogenetic connection of various individuals in order to study the evolutionary links between different populations. In addition, the SMARTPCA application included in the EIGENSOFT software “
https://github.com/chrchang/eigensoft (accessed on 16 November 2022)” was utilized to carry out principal component analysis (PCA) and ascertain the subpopulations’ clustering status [
64].
4.12. Population Genetic Structure and Genetic Diversity Analysis
In order to learn about the genetic makeup of populations, including their variety, structure, and differentiation, nucleotide diversity was assessed by dividing each population into 10 kb chunks and analyzing a 100 kb window [
65]. Using a Bayesian-based strategy, the K-values (the hypothesized number of populations) ranged between 1 and 10 in ADMIXTURE [
66]. The optimal K-value was determined using cross-validation statistics across five separate studies. Bar graphs of the Q matrix for each K-value were made with the aid of the R package Pophelper “
http://royfrancis.github.io/pophelper (accessed on 25 November 2022)”. The fixation index (FST) and nucleotide diversity ratios (π) were computed using VCF methods to identify genomic areas possibly experiencing natural selection sweeps throughout the adaptation process.
4.13. Selective Sweep Analysis
Genome-wide detection on selective sweep region was processed by calculating the population genetic index of all SNPs within a sliding window of 100 kb and a certain step (10 kb). The indicators include population differentiation fixation index (Fst) and nucleotide polymorphism (π). The index was calculated by the PopGenome package based on consensus SNPs with a pre-defined bin and step.