Chromosome-Level Genome Assembly and Comparative Genomic Analysis of the Barbel Chub (Squaliobarbus curriculus) by Integration of PacBio Sequencing and Hi-C Technology

Zhang, Baidong; Sun, Yanling; Liu, Yang; Song, Xiaojun; Wang, Su; Xiao, Tiaoyi; Nie, Pin

doi:10.3390/fishes9080327

Open AccessArticle

Chromosome-Level Genome Assembly and Comparative Genomic Analysis of the Barbel Chub (Squaliobarbus curriculus) by Integration of PacBio Sequencing and Hi-C Technology

by

Baidong Zhang

¹,

Yanling Sun

¹,

Yang Liu

¹,

Xiaojun Song

¹

,

Su Wang

¹,

Tiaoyi Xiao

² and

Pin Nie

^1,*

¹

School of Marine Science and Engineering, Qingdao Agricultural University, Qingdao 266109, China

²

Fisheries College, Hunan Agricultural University, Changsha 410128, China

^*

Author to whom correspondence should be addressed.

Fishes 2024, 9(8), 327; https://doi.org/10.3390/fishes9080327

Submission received: 28 April 2024 / Revised: 5 August 2024 / Accepted: 16 August 2024 / Published: 20 August 2024

(This article belongs to the Special Issue Advances in Fish Genome and Transcriptomes)

Download

Browse Figures

Versions Notes

Abstract

The barbel chub (Squaliobarbus curriculus), the only species in the genus, is widely distributed in freshwater lakes and rivers at different latitudes in East Asia, with fishery and biodiversity importance, and is an emerging commercially important fish in China. However, the resource of this species has dramatically declined due to anthropogenic activities such as over-exploitation, as well as water pollution. Genomic resources for S. curriculus are useful for the management and sustainable utilization of this important fish species, and also for a better understanding of its genetic variation in the region. Here, we report the chromosome-level assembly of the S. curriculus genome obtained from the integration of PacBio long sequencing and Hi-C technology. A total of 155.34 Gb high-quality PacBio sequences were generated, and the preliminary genome assembly was 894.95 Mb in size with a contig N50 being 20.34 Mb. By using Hi-C data, 99.42% of the assembled sequences were anchored to 24 pseudochromosomes, with chromosome lengths ranging from 27.22 to 58.75 Mb. A total of 25,779 protein-coding genes were predicted, 94.70% of which were functionally annotated. Moreover, S. curriculus shows resistance to grass carp haemorrhagic disease (GCHD) caused by grass carp reovirus (GCRV), which seriously hinders the status and future perspectives of commercial grass carp production. Phylogenetic analysis indicated that S. curriculus diverged with grass carp (Ctenopharyngodon idellus) approximately 20.80 million years ago. Annotations of the expanded gene families were found to be largely enriched in immune-related KEGG pathway categories. Moreover, a total of 18 Toll-like receptor (TLR) genes were identified from the whole genome of S. curriculus. The high-quality genome assembled in this study will provide a valuable resource for accelerating ecological, evolutionary, and genetic research on S. curriculus.

Keywords:

genome assembly; barbel chub; Squaliobarbus curriculus; toll-like receptor; comparative genomics

Key Contribution: The genome of a cyprinid fish, the barbel chub Squaliobarbus curriculus, was sequenced with the prediction of 25,779 protein-coding genes. Phylogenetic analysis indicated that this fish is closely related with grass carp Ctenopharyngodon idellus but diverged 20.80 million years ago.

1. Introduction

In East Asia, there exists a rich composition of cyprinid fish. The barbel chub (Squaliobarbus curriculus; Figure 1) is the only species in the genus, belonging to the subfamily Leuciscinae (Teleostei: Cyprinidae) in the East Asian group of cyprinid fish. The fish is characterized by red spots on the superior border of its eyes and is widely distributed in freshwater lakes and rivers from North China, such as in Heilongjiang River, to the south in Hainan Island [1]. The fish spawns drifting eggs from April to July and is omnivorous, feeding on algae, macrophytes, and aquatic insects, as well as organic detritus [1,2,3]. Moreover, S. curriculus is listed as least concern according to a recent assessment of the IUCN Red List of Threatened Species in 2020 (https://www.iucnredlist.org/, accessed on 1 April 2020). Nevertheless, S. curriculus has an important value in catching fishery, and its natural resource has declined dramatically over the last few decades [1,3,4,5].

As a consequence, efforts have been made to understand the general biology of S. curriculus and its reproduction, culture, and disease resistance. Xiang and He [6] described the external morphology of S. curriculus in detail and investigated its biological characteristics, such as growth, reproduction, and feeding habits. Yang and Zheng [7] considered that this fish should have a potential for commercial culture because of its high nutritional composition and fillet yield. The artificial reproduction of this fish was successfully achieved in 2003, and its pond culture has become a current practice for aquaculture in China [8,9,10,11,12,13,14,15]. According to the latest data, the current production of S. curriculus from culture-based fisheries is estimated to reach 40 kg per cubic meter, showing desirable economic benefits in the aquaculture industry. Interestingly, S. curriculus, in culture, shows resistance to grass carp haemorrhagic disease (GCHD) caused by grass carp reovirus (GCRV). Grass carp (Ctenopharyngodon idellus) has the highest yield among all freshwater fish species in aquaculture according to the Fishery Statistical Yearbook of China (2023) [16,17]. It is reported that GCRV causes severe viral disease, which seriously hinders the status and future perspectives of commercial grass carp production [17]. Hybrids between S. curriculus and C. idellus, two of which all belong to the subfamily Leuciscinae in the Cyprinidae, can be obtained artificially, and the hybrids also show resistance to GCRV [8]. Meanwhile, more studies have been devoted to the understanding of the composition and function of immune genes in S. curriculus, such as Toll-like receptors (TLRs), interferons (IFNs), and interferon regulatory factors (IRFs), etc. [17,18,19,20,21,22,23]. In addition, attempts have been made to increase the restocking of this fish species in lakes as well as in rivers through artificial reproduction [24,25]. However, studies on genetic variation or population structure of S. curriculus have, so far, been barely reported, and only limited genetic markers from narrow regions of the genome have been reported [26,27,28]. Therefore, the lack of whole genome information has, to some extent, impeded the examination of any genetic, molecular, and immune difference between S. curriculus and C. idellus, the latter of which has been well characterized in genomes [29,30], and it also influenced the restoration of local genetic resources of S. curriculus.

The fast development of high-throughput sequencing technology, especially long-read sequencing, offers advantages over short-read sequencing as it helps improve de novo assembly, mapping certainty, variant detection, and transcript isoform identification [31,32,33]. Single-molecule real-time (SMRT) sequencing, developed by Pacific BioSciences (PacBio, Menlo Park, CA, USA), enables the possible determination of complex genomic regions, as well as the detection of gene isoforms, and it has been successfully used to construct high-quality genome assemblies for many fish species [33,34,35,36,37]. In this study, the chromosome-level reference genome sequence of S. curriculus was assembled with the integration of PacBio SMRT and high-throughput chromosome conformation capture (Hi-C) technologies. The completeness and continuity of the high-quality reference genome should be bioinformatically valuable for phylogenetic, biological, and immunological research on S. curriculus, and for the development of genome-scale disease-resistant strategies in economically important cyprinid fish species.

2. Materials and Methods

2.1. Ethic Statement

The care and use of experimental animals were conducted by following national laboratory animal guidelines and policies as approved by the School of Marine Science and Engineering, Qingdao Agricultural University in April 2020 with the code of 20200403.

2.2. Sample DNA and RNA Extraction

A single live adult female fish was collected in January 2020 from Suya Lake, located in Henan province, China (Figure 1). This sampled individual was first identified based on their morphological characteristics and then sacrificed to dissect the muscle, heart, liver, spleen, intestine, and kidney before being stored in liquid nitrogen until DNA and RNA extraction. Muscle tissue below the dorsal fin was used for DNA sequencing, while all the organs/tissues were used for transcriptome sequencing. Genomic DNA was extracted using the standard phenol-/chloroform-extraction method to construct the DNA sequencing library. Quality and concentration of the genomic DNA were checked using 1% agarose gel electrophoresis and measured using a Nanodrop 2000 spectrophotometer, as well as a Qubit fluorometric quantitation (Novogene Bioinformatics Institute, Beijing, China). This high-quality DNA was used for subsequent Pacbio, Illumina, and Hi-C sequencing (Novogene as indicated above). Total RNA was extracted from six tissues of S. curriculus, including muscle, heart, liver, spleen, intestine, and kidney, by using TRIzol reagent (Invitrogen, Waltham, MA, USA). RNA quality was checked with a Nanodrop 2000 spectrophotometer and an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA).

2.3. Library Construction and Genome Sequencing

The extracted DNA was sequenced with both the Illumina NovaSeq-6000 (Illumina Inc., San Diego, CA, USA) and PacBio Sequel II platforms at Novogene Bioinformatics Institute (Beijing, China) to generate short- and long-genomic reads, respectively. Illumina sequencing libraries were prepared to estimate genome size, heterozygosity level, and genome-assembly correction, as well as repeat content evaluation. Meantime, a paired-end library (2 × 150 bp) with an insertion size of ~350 bp was generated according to the Illumina standard protocol for the NovaSeq-6000 platform. To avoid low-quality reads with artificial bias, the following criteria were set as required: (1) reads with adapter contamination were removed; (2) reads with ≥10% unidentified nucleotides were removed; (3) Read pairs with >20% bases of one single read having phred quality < 5 were discarded; (4) putative duplicated read pairs generated by PCR amplification were discarded; after quality control, the remaining clean reads were used for subsequent analysis [38,39]. For long-read sequencing, a SMRTbell library with a fragment size of 20 kb was constructed for the PacBio platform according to the manufacturers’ protocols, providing ultra-long-genomic sequences for the following genome assembly.

2.4. Genome Size Estimation and Genome Assembly

The k-mer-based method was used for the genome survey to estimate the genome size of S. curriculus. The software Jellyfish v2.2.7 was used to conduct the counting of k-mers [40]. The paired-end Illumina short read data were used to estimate the genome size and heterozygosity, as well as repeat content of S. curriculus genome by using GCE v1.0.2 [41]. To generate the high-quality contig assembly, the wtdbg2 v2.5 was used to assemble the S. curriculus genome with long reads generated from PacBio Sequel platform [42]. After genome assembly, Quiver (Smrt Link v5.0.1) [43] was subsequently applied to polishing the assembly from PacBio data. The Purge Haplotigs software [44] was used to remove the heterozygosity of assembly. The remaining errors in the output were corrected using the short reads from Illumina by applying Pilon v1.22 [45]. Afterwards, a final draft genome assembly of S. curriculus was eventually obtained.

2.5. Hi-C Analysis and Chromosome-Level Genome Assembly

The Hi-C library for the Illumina NovaSeq-6000 sequencing platform was constructed to generate a chromosome-scale assembly of the genome. The sequencing reads were first mapped to the polished S. curriculus genome with BWA v0.7.8 [46]. Then, the low-quality and potential duplicate reads were eliminated to build raw inter/intra-chromosomal contact maps. Finally, the “configure” shell of the LAchesis software v201701 was applied for clustering and ordering, as well as orienting according to the agglomerative hierarchical clustering algorithm [47]. The scaffolds were clustered into 24 pseudochromosomes based on the diploid chromosome number of 48 (2n = 48), determined by injecting with phytohaemagglutinin (PHA) and colchicine in kidney cells from S. curriculus [48].

2.6. Assessment of the Genome Assembly

To generate a high-quality genome assembly, three methods were applied for evaluating the completeness and accuracy of S. curriculus genome. Initially, all Illumina read pairs were mapped onto the S. curriculus genome assembly for SNP calling to evaluate the within-individual heterozygosity rate and to assess mapping rate and average sequence depth, as well as coverage in the long-read data using the package BWA and SAMTOOLS v0.1.19 [49,50]. Then, the completeness of genome assembly was further assessed by performing the Benchmarking Universal Single-Copy Orthologs analysis (BUSCO v3.0.2, database: vertebrata_odb9) [51] to search against database consisting of 2586 single-copy orthologues. Last, the Core Eukaryotic Genes Mapping Approach (CEGMA) was applied based on a core gene set involved in 248 CEGs (Core Eukaryotic Genes) from six eukaryotic model organisms [52].

2.7. Repeat Annotation, Gene Prediction, and Functional Annotation

The combined strategy based on homology alignment and de novo search was applied for whole genome repeat annotation. The homologue prediction was performed using Repbase database [53] employing RepeatMasker v4.1.0 software [54] and its in-house scripts (RepeatProteinMask) with default parameters to extract repeat regions. The ab initio prediction built de novo repetitive elements database by LTR_FINDER v1.0.6 [55], RepeatScout v1.0.5 [56], and RepeatModeler v2.0.1 [48] with default parameters, then all repeat sequences with lengths >100 bp and gap “N” less than 5% constituted the raw transposable element (TE) library. The repeat elements within the library were then classified using a homologous search with Repbase. Meanwhile, the combination of Repbase and the de novo TE library was processed by uclust to yield a non-redundant library, and RepeatMasker was supplied for DNA-level repeat identification [57]. Additionally, tandem repeat was extracted using Tandem Repeats Finder (TRF v4.09) by ab initio prediction.

Structural annotation of the genome was performed with the incorporation of homology-based prediction, ab initio prediction, and RNA-Seq-assisted prediction. For homologue prediction, homologous protein sequences of five species (Ctenopharyngodon idellus, http://www.ncgr.ac.cn/grasscarp/index.html, accessed on 1 April 2020; Cyprinus carpio, GCF_000951615.1; Carassius auratus, GCF_003368295.1; Danio rerio, GCF_000002035.6; Onychostoma macrolepis, GCA_012432095.1; Sinocyclocheilus graham, GCA_001515645.1) were downloaded to confirm the completeness of the gene set. Protein sequences were aligned to the genome using TblastN v2.2.26 (E-value ≤ 1 × 10⁻⁵), and then the matching proteins were aligned to the homologous genome sequences for accurate spliced alignments with GeneWise v2.4.1 [58], which was used to predict gene structure contained in each protein region. For gene prediction based on ab initio, five programs, including Augustus v3.2.3 [59], Geneid v1.4 [60], Genescan v1.0 [61], GlimmerHMM v3.04 [62], and SNAP v2013.11.29 [63], were used to predict coding regions from the repeat-masked genome. For the transcriptome-based approach, RNA-seq data from lllumina were mapped to the assembled genome with TopHat v2.0.13 [64], followed by Cufflinks v2.1.1 [65]. In addition, Trinity v2.1.1 [66] was used to assemble the RNA-seq data, and its output was used to create pseudo-expressed sequence tags, which were then mapped to the assembly. Gene models were predicted by using the Program to Assemble Spliced Alignments (PASA) genome-annotation tool. Finally, the non-redundant reference gene set was generated by merging genes predicted by three methods with EvidenceModeler v1.1.1 (EVM) [67].

Gene functions were assigned according to the best match by aligning the protein sequences to the Swiss-Prot using Blastp (with a threshold of E-value ≤ 1 × 10⁻⁵). The motifs and domains were annotated using InterProScan v5.35-74.0 [68] by searching against publicly available databases, including ProDom, PRINTS, Pfam, SMRT, PANTHER, and PROSITE. The Gene Ontology (GO) IDs [69] for each gene were assigned according to the corresponding InterPro entry. The protein function was predicted by transferring annotations from the closest BLAST hit (E-value < 1 × 10⁻⁵) in the Swissprot database [70] and BLAST hit (E-value < 1 × 10⁻⁵) in the NR database. Gene set was mapped to a KEGG pathway, and the best match was identified for each gene [71]. For non-coding RNA annotation, tRNAs were predicted using the program tRNAscan-SE v1.4 [72]. Considering that rRNAs are highly conserved, rRNA sequences from related species such as grass carp, common carp, and zebrafish, as indicated above, were chosen as references to predict rRNA sequences by using Blast. Other ncRNAs, including miRNAs and snRNAs, were identified by searching against the Rfam database with default parameters using the infernal software [73].

2.8. Phylogenetic Analysis and Estimation of Divergence Time

To identify the evolutionary relationship of S. curriculus, the phylogenetic tree was constructed using the shared single-copy orthologous genes. Summary information of species used in the phylogenetic analysis is listed in Table S1. Orthologous relationships between genes of Ctenopharyngodon idellus (http://www.ncgr.ac.cn/grasscarp/index.html), Cyprinus carpio (GCF_000951615.1), Carassius auratus (GCF_003368295.1), Danio rerio (GCF_000002035.6), Sinocyclocheilus graham (GCA_001515645.1), O. macrolepis (GCA_012432095.1), Anabarilius graham (GCA_003731715.1), Labeo rohita (GCA_017311145.1), Triplophysa tibetana (GCA_008369825.1), Clupea harengus (GCF_900700415.1), Periophthalmus magnuspinnatus (GCF_009829125.1), Salmo salar (GCF_000233375.1), Lepisosteus oculatus (GCF_000242695.1), Acipenser ruthenus (GCF_010645085.1), and Branchiostoma belcheri (GCF_001625305.1) were inferred through all-against-all protein sequence similarity searches with OthoMCL v1.4, with the selection of longest predicted transcript per locus [74]. For each orthologue, alignment was conducted using Muscle v3.8.31 [75], and ambiguously aligned positions were trimmed using Gblocks v0.91b [76], with the tree inferred using RA×ML v8.2.12 [77]. Divergence times between species were calculated using the MCMC tree program and implemented in the PAML v4.9 [78].

2.9. Expansion, Contraction, and Identification of Gene Family

In order to identify gene family evolution as a stochastic birth-and-death process, where gene family either expands or contracts per gene per million years independently along each branch of the phylogenetic tree, the likelihood model originally implemented in the software package Cafe’ v4.2 was used [79]. The phylogenetic tree topology and branch lengths were considered to infer the significance of changes in the gene family size in each branch. In addition, the specific gene family was further identified on the basis of annotations for expanded and contracted genes of this species. Bony fishes and model vertebrate species were selected to retrieve the corresponding gene-protein sequences from National Center for Biotechnology Information (NCBI) database, respectively. The information of these sequences was then used to identify the gene families in S. curriculus. Two methods were utilized to search the S. curriculus protein sequences. The local BLASTP method was firstly used with a threshold e-value < 1 × 10⁻⁵, and then a Hidden Markov Model (HMM) was used to search against S. curriculus protein sequences. Afterwards, the putative protein sequences were submitted to SMART for further confirmation of conserved protein domains [80]. The Neighbor-joining tree of Toll-like receptors in S. curriculus was constructed with 1000 bootstrap replications using MEGA-X based on the full-length protein sequences. The exon-intron structures of these genes were graphically displayed by TBtools v2.096 [81]. The summary of software versions and parameters used in this study is listed in Table S2.

3. Results

3.1. De Novo Genome Assembly of S. curriculus

A total of 43.88 Gb data were generated from the Illumina NovaSeq-6000 platform with an insert size of 350 bp, representing 47.40-fold coverage of the S. curriculus genome (Table 1). The 17 mers were counted as 35,987,938,045, with a k-mers peak at a depth of 38 (Figure S1 and Table S3). The estimated genome size was ~925.66 Mb with heterozygosity of 0.32% and repeat content of 49.48% (Table S3). In total, 155.34 Gb of high-quality data were generated using the PacBio Sequel II platform from the long-read library, covering 167.82 folds of the S. curriculus genome (Table 1; Table S4). These data were assembled, and a high-quality genome of S. curriculus was obtained with a total length of ~894.95 Mb, a contig N50 of 20.34 Mb, and a scaffold N50 of 35.35 Mb (Table S5). To evaluate the accuracy and completeness of the initial genome assembly, the Illumina reads were realigned to the genome assembly of S. curriculus. The results showed that 98.49% of the Illumina reads were successfully mapped to the assembled genome with a genome coverage rate of 99.23% (Table S6). An average depth of 43.90× was obtained, and approximately 98.84% of the genome assembly was covered by 10 or more reads (Table S6). Moreover, the genome assembly had a heterozygous SNP rate of 0.13% and a low homozygous SNP rate of 0.00052%, which further validated the reliability of the S. curriculus genome (Table S7). The BUSCO analysis showed that 97.20% of the complete BUSCOs (Benchmarking Universal Single-Copy Orthologs) were found in the genome assembly, including 94.40% of the complete and single-copy and 2.80% of the duplicated genes (Table 2). With the CEGMA analysis, 234 CEGs were identified, accounting for 94.35% of a highly conserved 248 CEG dataset (Table S8).

The Hi-C scaffolding approach was employed to anchor the contigs or scaffolds to the draft assembly in order to obtain a chromosome-level genome of S. curriculus. A total of 145.69 Gb data (157.39×) were generated in the Hi-C library (Table 1). By using LAchesis, 99.42% of the assembled sequences were anchored to 24 pseudochromosomes, with chromosome lengths ranging from 27.22 Mb to 58.75 Mb (Table S9). To further evaluate the quality of the chromosomal-level genome assembly, a genome-wide Hi-C heatmap was generated. The 24 pseudochromosomes could be distinguished easily, and the interaction signal strength around the diagonal was considerably stronger than that of other positions, indicating the high quality of the S. curriculus genome assembly (Figure 2).

3.2. Repeat Annotation, Gene Prediction, and Functional Annotation

A total of 424.37 Mb of repeat sequences were detected, accounting for 47.42% of the assembly genome (Table S10). This repeat content was close to the value (49.48%) obtained from the k-mer analysis. The predominant repeat types were long terminal repeats (LTRs) (340.52 Mb; 38.05%), and the other transposable elements mainly consisted of DNA-transposable elements (DNA TE) (56.93 Mb; 6.36%), short interspersed elements (SINEs) (0.63 Mb; 0.07%), and long interspersed elements (LINEs) in 15.79 Mb (1.76%) (Table 3). A total of 25,779 protein-coding genes were predicted based on the combination of ab initio, homology-based, and RNA-Seq-assisted methods. The average values of the gene length, exon length, and intron length were found to be 15,553.78 bp, 173.11 bp, and 1736.53 bp, respectively (Table S11; Figure S2). Moreover, 94.70% (24,402/25,779) of the predicted genes were successfully annotated by alignment to databases, including Swissprot, NR, KEGG, InterPro, GO, and Pfam (Table 4). The noncoding RNA prediction enabled the identification of 3098 tRNAs, 410 rRNAs, and 2654 microRNAs (Table S12).

3.3. Comparative Genomics

A total of 35,018 homologues were found by comparing them with other 15 fish species genomes, 3617 of which were shared among all 16 species (Table S13; Figure S3). In addition, 25,779 genes of S. curriculus were clustered into 17,089 gene families, and a total of 12,577 gene families were found to be shared among five species in the Cyprinidae (Figure S4). The phylogenetic tree showed that S. curriculus and C. idellus were two closely related fish species in the Leuciscinae, and the divergence time between them was 20.80 Ma (Figure 3). The S. curriculus genome displayed 618 expanded gene families and 3742 contracted gene families compared with the most common ancestor of C. idellus (Figure 3). The expanded gene families of S. curriculus were significantly enriched in 115 GO terms and 72 KEGG pathways (p-value < 0.05), mainly including protein serine/threonine phosphatase complex (GO:0008287, p-value = 1.09 × 10⁻¹³), protein phosphatase Type 2A complex (GO:0000159, p-value = 1.38 × 10⁻¹²), MAP kinase activity (GO:0004707, p-value = 2.47 × 10⁻⁶), protein phosphatase regulator activity (GO:0019888, p-value = 2.56 × 10⁻⁶), protein serine/threonine kinase activity (GO:0004674, p-value = 6.09 × 10⁻⁵), and regulation of the Toll-signaling pathway (GO:0008592, p-value = 8.28 × 10⁻⁰⁴), which were closely linked to protein kinase activity and the immune system (Tables S14 and S15). In contrast, the enrichment analyses showed that only 44 GO terms and 19 KEGG pathways identified from the contracted gene families were significantly enriched (p-value < 0.05), which were associated with GTP binding (GO:0005525, p-value = 1.96 × 10⁻¹⁶), guanyl nucleotide binding (GO:0019001, p-value = 1.96 × 10⁻¹⁶), cell adhesion (GO:0007155, p-value = 2.63 × 10⁻¹²), cell–cell adhesion (GO:0098609, p-value = 9.70 × 10⁻¹²), homophilic cell adhesion (GO:0007156, p-value = 1.27 × 10⁻¹¹), and ion binding (GO:0043167, p-value = 7.69 × 10⁻⁶) (Tables S16 and S17).

Notably, as compared with the most recent common ancestor of S. curriculus and C. idellus, the expanded gene families of the S. curriculus genome were observed to be highly represented in processes related to the immune system category (Table S15). The results showed that a total of 58 KEGG pathways were successfully enriched, among which 52 KEGG pathways (~90%) were shared by enriched KEGG pathways of expanded genes detected in the S. curriculus genome (Table S15). Specifically, the most shared enriched pathways were identified to be also related to the immune system, including the Toll-like receptor (TLR)-signaling pathway, NOD-like receptor-signaling pathway, C-type lectin receptor-signaling pathway, complement and coagulation cascades, Fc epsilon RI signaling-pathway, the intestinal immune network for IgA production, B cell receptor-signaling pathway, T cell receptor-signaling pathway, and platelet activation-signaling pathway (Table S15).

In this study, TLR signaling and regulation functions were highly represented in both GO annotations and KEGG pathways for expanded genes in S. curriculus genome. In these circumstances, TLR gene families were further identified and characterized. TLR protein sequences from eight species of bony fish, Ctenopharyngodon idellus, Carassius auratus, Cyprinus carpio, D. rerio, Triplophysa tibetana, Larimichthys crocea, Periophthalmus magnuspinnatus, and Acipenser ruthenus, as well as three other model vertebrates, Xenopus tropicalis, Gallus gallus, and Rattus norvegicus, were BLAST aligned against S. curriculus protein sequences to obtain an overview of TLR genes in S. curriculus. After BLAST alignment and HMM searches, 27 putative TLR genes were identified, and these putative genes were further confirmed by using SMART to determine whether they contained TLR-specific conserved structural features, including leucine-rich repeat (LRR) motif and TIR domain. After removing redundant genes, 18 TLR genes were eventually identified (Table S18). The phylogenetic tree was constructed using the Neighbor-joining method on the basis of deduced amino acid sequences of these TLR genes (Figure 4). Phylogenetic analysis showed that TLRs in S. curriculus were designated as TLR1, TLR3, TLR4, TLR5, TLR7, and TLR11 subfamilies based on their family classification (Table S18; Figure S5).

The three TLRs in TLR4 subfamily were located on Chromosome 12; TLR18, TLR19, and TLR21 on Chromosome 10; and TLR7, TLR8a, and TLR20 on Chromosome 6, whereas others were unevenly located on Chromosomes 2, 4, 11, 14, 17, 19, 20, and 23 (Figure 4).

4. Discussion

In the present study, the high-quality chromosome-level genome assembly of S. curriculus was obtained with the integration of Pacific Bioscience single-molecule real-time sequencing, Illumina paired-end sequencing, and Hi-C reads. The assembled genome of S. curriculus showed higher contig N50 and less contig numbers than genome assemblies of several other cyprinid fishes, which reflects the accuracy and completeness of the S. curriculus genome [26,84,85,86,87]. The content of repeat sequences was close to the value (49.48%) obtained from the k-mer analysis and is less than the 52.20% repeat content observed in D. rerio [88], comparable to O. macrolepis (46.23%) but higher than C. idellus (38%), C. carpio (31.23%), and C. auratus (39.60%).

Genome duplication is found to be prevalent in the evolution of the Cypriniformes, and the genome sequence obtained in this study suggests that S. curriculus might have undergone such evolution as well. In these circumstances, demographic events such as bottlenecks might have possibly resulted in the expansion of repeat sequences in S. curriculus, as observed in other cyprinid fish [88,89]. Alternatively, these differences of repeat content may be also attributed to the exclusion of repetitive sequences located in unclosed gaps or on small fragments of the assemblies [26]. Therefore, the relatively high percentage of repeat sequences may further validate the reliability of the S. curriculus genome assembly. In addition, 99.42% of the assembled sequences were anchored to 24 pseudochromosomes by using LAchesis, which coincided with the karyotype results of 24 chromosomes based on cytological observation [48]. Thus, physical mappings using Hi-C technology markedly improved the assembly of the chromosome-level genome. Furthermore, the phylogenetic tree confirmed the grouping of Leuciscinae in East Asia with the inclusion of S. curriculus and C. idellus, which was consistent with the results of previous studies based on cytochrome b sequences [90]. Nevertheless, genome-wide studies on the phylogeny and evolution of S. curriculus within the Cypriniformes are still required, especially for understanding intergeneric relationships in the subfamily Leuciscinae and also in the family Xenocyprididae in East Asia. However, it should be pointed out that in FishBase, S. curriculus was characterized as a member of the family Xenocyprididae (Cypriniformes) (https://fishbase.se/search.php, accessed on 20 June 2024), which may be a result of the unregular update of the system database, or it could be attributed to the occurrence of a large number of genera and species in the Cypriniformes in East Asia and the limited inclusion of taxa in phylogenetic analyses [90]. High contiguity of genome assembly, as reported in this study, may enable the illustration of comprehensive phylogeny in the Cyprinidae.

From the BUSCO assessment, the annotated gene number of S. curriculus is comparable to that of O. macrolepis (24,753 genes), and less than that of D. rerio (26,151 genes) (Table S19). Furthermore, 24,402 (94.70%) of the 25,779 predicted genes were successfully annotated in at least one of the databases: SwissProt, NR, KEGG, InterPro, Pfam, or GO (Table S20). Taken together, all these analyses suggest that we have assembled a high-integrity and high-quality chromosome-level genome of S. curriculus in the present study.

The gene families expanded in S. curriculus were observed to be highly represented in pathways related to immune systems. But most of these pathways have been found to be contracted in C. idellus. The function of these shared pathways is critically important for the activation of both innate and adaptive immunity [91]. In the immune system, pathogen-associated molecular patterns (PAMPs) derived from invading pathogens such as bacteria, fungi, and viruses are recognized by pattern-recognition receptors (PRRs) [92]. Toll-like receptors (TLRs) are classic and major PRRs, playing important roles in recognizing a wide spectrum of pathogens and in orchestrating innate, as well as adaptive, immune responses [93,94]. Previous molecular immune gene analyses in S. curriculus and C. idellus, as well as the rare minnow (Gobiocypris rarus), demonstrated that TLR genes were upregulated significantly following GCRV infection, suggesting they are functional in virus recognition and antiviral immune defense [18,95,96,97]. So far, six major families of TLRs, i.e., TLR1, TLR3, TLR4, TLR5, TLR7, and TLR11, have been reported in vertebrates, and more than 20 TLRs have been identified in fish [97,98,99,100]. In particular, some TLR genes, such as TLR22 and TLR11, in fish are reported to have undergone positive selection, which enables them to cope with pathogenic challenges [98,101,102]. The 18 TLRs identified in S. curriculus can be grouped also into six families, and the composition of TLRs in this fish is comparable to that reported in C. idellus [97]. However, it should be noted that TLR genes in S. curriculus are not entirely identical to those in C. idellus. For instance, grass carp possesses two TLR22 members, which were reported as CiTLR22a and CiTLR22b [103], and two TLR5 members, as CiTLR5a and CiTLR5b [97]. Although two TLR22 genes were identified in S. curriculus in the present study, which is consistent with those reported in C. idellus, only one TLR5 member was identified in S. curriculus and is clustered together with CiTLR5b. Genome duplication or reorganization events are proposed to account for the presence of multiple TLR5 genes in teleost fish, such as the duplicated TLR5 (TLR5a and TLR5b) in zebrafish and grass carp [97,104,105,106,107]. The identified TLRs in the genome of S. curriculus should also provide the basis for understanding their functions in immune recognition and signaling.

In addition, GO annotation also indicates the immune involvement of other expanded genes in S. curriculus, such as the protein serine/threonine phosphatase complex, protein serine/threonine kinase activity, and MAP kinase activity [108]. It should be pointed out that the precise identification of all gene members in all gene families may be impossible as some genes might be located in unclosed gaps or on small fragments of the assemblies, and combined or specific cloning approaches may enable a precise identification of targeted genes or gene family [109]. Nevertheless, the assembled genome sequence of S. curriculus should provide new information for further investigations into the composition and function of the fish immune system at the genomic level, which should be valuable for understanding the possible difference in susceptibility or resistance to GCRV infection between S. curriculus and C. idellus.

Moreover, it should be noted that S. curriculus is a widespread species with large geographical distribution. Thus, it may exhibit adaptability to environmental heterogeneity. Therefore, a reference genome is crucial for understanding its evolutionary and adaptive variations, which should have a conservation value in terms of its natural genetic diversity protection.

5. Conclusions

In summary, the high-quality chromosome-scale genome assembly of S. curriculus was constructed with the integration of PacBio sequencing and Hi-C technology in this study. The genome will serve as a promising platform for future studies on molecular variation, genome duplication, population genomics, and disease prevention of S. curriculus or other cyprinid fish. Moreover, the high contiguity of the genome assembly may provide information for understanding the comprehensive phylogeny of species in the Cyprinidae and provide insight into the intergeneric relationship within this largest fish family.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/fishes9080327/s1, Figure S1: Distribution of k-mers of length 17 from the Illumina data; Figure S2: Annotation quality comparison of protein-coding genes. The CDS length, exon length, exon number, gene length, and intron length were compared among seven species: Carassius auratus, Cyprinus carpio, Squaliobarbus curriculus, Ctenopharyngodon idellus, Danio rerio, Onychostoma macrolepis, and Sinocyclocheilus graham; Figure S3: The distribution of genes in 16 different species; Figure S4: Venn diagram of orthologous gene families among five Cyprinidae species; Figure S5: Phylogenetic relationships of Toll-like receptors in S. curriculus and C. idellus. The Neighbor-joining tree was constructed with 1000 bootstrap replications using MEGA-X based on the full-length protein sequences; Table S1: Summary information of species used in the comparative genomics analysis; Table S2: Summary information of software versions and parameters used in this study; Table S3: The statistics for the genome survey (K-mer = 17); Table S4: Statistics of Pacbio data used for genome assembly; Table S5: Assembly statistics of S. curriculus genome; Table S6: Summary statistics for alignment of the Illumina reads to S. curriculus genome assembly; Table S7: Statistics of heterozygous and homozygous SNPs after mapping back all Illumina read pairs onto S. curriculus genome assembly; Table S8: Statistics of CEGMA for S. curriculus genome; Table S9: Statistics of the pseudochromosome assemblies using Hi-C data; Table S10: Statistics of repetitive sequences in S. curriculus genome; Table S11: Summary statistics of predicted protein-coding genes in S. curriculus genome. Table S12: Statistics of the noncoding RNA in S. curriculus genome; Table S13: Gene number used for gene family clustering in each species; Table S14: Summary of GO annotations for expanded genes of S. curriculus compared with the most common ancestor; Table S15: Summary of enriched KEGG pathways for expanded genes of S. curriculus compared with the most common ancestor; Table S16: Summary of GO annotations for contracted genes of S. curriculus compared with the most common ancestor; Table S17: Summary of enriched KEGG pathways for contracted genes of S. curriculus compared with the most common ancestor; Table S18: Characteristic features of TLR genes in S. curriculus genome; Table S19: Summary statistics of predicted protein-coding genes in S. curriculus and related species; Table S20: Summary statistics of gene annotation for S. curriculus genome.

Author Contributions

Conceptualization, B.Z.; Software, B.Z.; Formal analysis, B.Z., Y.S., S.W. and P.N.; Investigation, B.Z. and X.S.; Resources, T.X. and P.N.; Data curation, B.Z., Y.L. and P.N.; Writing—original draft, B.Z. and P.N.; Writing—review and editing, P.N. All authors have read and agreed to the published version of the manuscript.

Funding

National Natural Science Foundation of China (NSFC) (32072978), “First Class Fishery Discipline ([2020]3)” and a top talent plan “One Thing One Decision (Yishi Yiyi) ([2018]27)” in Shandong Province.

Institutional Review Board Statement

The animal study protocol was approved by the School of Marine Science and Engineering, Qingdao Agricultural University (20200403, 3 April 2020).

Data Availability Statement

This Whole Genome Shotgun project has been deposited at GenBank under the accession JASITH000000000. The version described in this paper is version JASITH010000000. The BioProject and BioSample accessions are PRJNA973127 and SAMN35124070, respectively (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_040359625.1/, accessed on 20 June 2024).

Acknowledgments

The sample was collected with the assistance of local fishermen.

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, S.Z. Fishes of the Yellow River; China Ocean University Press: Qingdao, China, 2017; p. 200. [Google Scholar]
Luo, Q.S.; Zheng, G.G. A brief introduction to the biology of Squallobarbus curriculus. J. Southwest China Norm. Univ. 1980, 2, 119–122. [Google Scholar]
Liu, Q.; Xiao, T.; Liu, M.; Zhou, W. Research progress of biology in Squaliobarbus curriculus. Fish. Sci. 2012, 11, 687–691. [Google Scholar]
Chen, D.Q.; Xiong, F.; Wang, K.; Chang, Y.H. Status of research on Yangtze fish biology and fisheries. Environ. Biol. Fish 2009, 85, 337–357. [Google Scholar] [CrossRef]
Yi, Y.J.; Yang, Z.F.; Zhang, S.H. Ecological influence of dam construction and river-lake connectivity on migration fish habitat in the Yangtze River basin, China. Procedia Environ. Sci. 2010, 2, 1942–1954. [Google Scholar] [CrossRef]
Xiang, J.G.; He, F.L. Study on biological characteristics of Squaliobarbus curriculus. Freshw. Fish. 2006, 36, 38–40. [Google Scholar]
Yang, S.X.; Zheng, T.S. Analysis on flesh rate and muscle nutritional value in Squaliobarbus curriculus Richardson. J. Anhui Agric. Sci. 2010, 38, 11835–11837. [Google Scholar]
Jin, X.L.; Tian, X.C.; Zeng, G.Q.; Wang, M.L. Preliminary study on crossbreeding and seedling cultivation of Squaliobarbus curriculus. Inland Fish. 1997, 12, 6–7. [Google Scholar]
Long, G.H.; Hu, D.S.; Liu, J.H.; Lu, W.Z. Research on artificial propagation of Squaliobarbus curriculus. Freshw. Fish. 2005, 35, 44–46. [Google Scholar]
Long, G.H.; Hu, D.S.; Liu, J.H.; Lu, W.Z. Preliminary study on seedling cultivation technology of Squaliobarbus curriculus. Reserv. Fish. 2005, 25, 41. [Google Scholar]
Xiong, W.Z.; Li, J.; Zhou, P.; Zhang, J.; Chen, C.C.; Wu, D.C. The technical research about the artificial fecundation and rearing of fingerling of the Squaliobarbus curriculus. J. Aquacult. 2005, 26, 12–15. [Google Scholar]
Chen, Y.C.; Lin, G. Growth characteristics and breeding technology of Squaliobarbus curriculus. Guangxi Agric. Sci. 2007, 38, 97–100. [Google Scholar]
Mi, G.Q.; Shen, T.S.; Xu, G.X. Studies on artificial propagation techniques of Squaliobarbus curriculus. J. Zhejiang Ocean Univ. (Nat. Sci.) 2007, 26, 272–276. [Google Scholar]
Guo, S.R.; Feng, X.Y.; Xie, N.; Liu, X.Y.; Wang, Y.X. Culturing experiment of Squaliobarbus curriculus in ponds. J. Hydroecol. 2009, 2, 142–144. [Google Scholar]
Yang, K.; Gao, Y.A.; Wang, Q.Y.; Xia, R.L.; Zeng, K.W.; Zhu, S.H.; Li, B.; Deng, G.Q.; Cheng, Y.H.; Zheng, C.H. Study on the introduction, domestication and artificial propagation of Squaliobarbus curriculus. Hubei Agric. Sci. 2014, 53, 1367–1369. [Google Scholar]
Liu, Q.L.; Liu, M.; Xiao, T.Y.; Xu, B.H.; Su, J.M. Mitochondrial DNA sequence of the hybrid of Squaliobarbus curriculus (♀) × Ctenopharyngodon idella (♂). Mitochondrial DNA 2013, 24, 394–396. [Google Scholar] [CrossRef]
Jin, S.Z.; Zhao, X.; Wang, H.Q.; Su, J.M.; Wang, J.A.; Ding, C.H.; Li, Y.G.; Xiao, T.Y. Molecular characterization and expression of TLR7 and TLR8 in barbel chub (Squaliobarbus curriculus): Responses to stimulation of grass carp reovirus and lipopolysaccharide. Fish Shellfish Immunol. 2018, 83, 292–307. [Google Scholar] [CrossRef]
Wang, R.H.; Li, W.; Fan, Y.D.; Liu, Q.L.; Zeng, L.B.; Xiao, T.Y. Tlr22 structure and expression characteristic of barbel chub, Squaliobarbus curriculus provides insights into antiviral immunity against infection with grass carp reovirus. Fish Shellfish Immun. 2017, 66, 120–128. [Google Scholar] [CrossRef] [PubMed]
Zhao, X.; Wang, R.H.; Li, Y.G.; Xiao, T.Y. Molecular cloning and functional characterization of interferon regulatory factor 7 of the barbel chub, Squaliobarbus curriculus. Fish Shellfish Immunol. 2017, 69, 185–194. [Google Scholar] [CrossRef]
Wang, R.H.; Li, Y.G.; Zhou, Z.Y.; Liu, Q.L.; Zeng, L.B.; Xiao, T.Y. Involvement of interferon regulatory factor 3 from the barbel chub Squaliobarbus curriculus in the immune response against grass carp reovirus. Gene 2018, 648, 5–11. [Google Scholar] [CrossRef]
Li, Y.G.; Jin, S.Z.; Zhao, X.; Luo, H.; Li, R.; Li, D.F.; Xiao, T.Y. Sequence and expression analysis of the cytoplasmic pattern recognition receptor melanoma differentiation-associated gene 5 from the barbel chub Squaliobarbus curriculus. Fish Shellfish Immunol. 2019, 94, 485–496. [Google Scholar] [CrossRef]
Zhao, X.; Xiao, T.Y.; Jin, S.Z.; Wang, J.A.; Wang, J.Y.; Luo, H.; Li, R.; Sun, T.; Zou, J.; Li, Y.G. Characterization and immune function of the interferon-β promoter stimulator-1 in the barbel chub, Squaliobarbus curriculus. Dev. Comp. Immunol. 2020, 104, 103571. [Google Scholar] [CrossRef] [PubMed]
Huang, Y.; Wang, X.; Lv, Z.; Hu, X.; Xu, B.; Yang, H.; Xiao, T.; Liu, Q. Comparative transcriptomics analysis reveals unique immune response to Grass Carp Reovirus infection in barbel chub (Squaliobarbus curriculus). Biology 2024, 13, 214. [Google Scholar] [CrossRef]
Kong, B.; Lin, G.; Li, X.Z.; Hu, X.N. Adaptive domestication of barbel chub (Squaliobarbus curriculus) before artificial releasing in the river. Guangxi Agric. Sci. 2008, 39, 541–543. [Google Scholar]
Xia, X.L.; Zhai, L.T. Several economic fishes extinct in the wild state of Baiyangdian Lake and their causes. Hebei Fish. 2014, 5, 63–64. [Google Scholar]
Liu, Q.L.; Xu, B.H.; Xiao, T.Y.; Su, J.M.; Yao, Y.B.; Liu, Y.J. Complete mitochondrial genome of the Xiangjiang barbel chub Squaliobarbus curriculus: Comparative analysis of the genetic variation associated with geographical population. Mitochondrial DNA 2013, 24, 654–656. [Google Scholar] [CrossRef]
Zhou, W.; Song, N.; Wang, J.; Gao, T.X. Effects of geological changes and climatic fluctuations on the demographic histories and low genetic diversity of Squaliobarbus curriculus in Yellow River. Gene 2016, 590, 149–158. [Google Scholar] [CrossRef]
Li, C.J.; Teng, T.; Shen, F.F.; Guo, J.Q.; Chen, Y.N.; Zhu, C.K.; Ling, Q.F. Transcriptome characterization and SSR discovery in Squaliobarbus curriculus. J. Oceanol. Limnol. 2019, 37, 235–244. [Google Scholar] [CrossRef]
Wang, Y.P.; Lu, Y.; Zhang, Y.; Ning, Z.M.; Li, Y.; Zhao, Q.; Lu, H.Y.; Huang, R.; Xia, X.Q.; Feng, Q.; et al. The draft genome of the grass carp (Ctenopharyngodon idellus) provides insights into its evolution and vegetarian adaptation. Nat. Genet. 2015, 47, 962. [Google Scholar] [CrossRef]
Wu, C.S.; Ma, Z.Y.; Zheng, G.D.; Zou, S.M.; Zhang, X.J.; Zhang, Y.A. Chromosome-level genome assembly of grass carp (Ctenopharyngodon idella) provides insights into its genome evolution. BMC Genom. 2022, 7, 23–271. [Google Scholar] [CrossRef] [PubMed]
Burgess, D.J. Genomics: Next regeneration sequencing for reference genomes. Nat. Rev. Genet. 2018, 19, 125. [Google Scholar] [CrossRef]
Pollard, M.O.; Gurdasani, D.; Mentzer, A.J.; Porter, T.; Sandhu, M.S. Long reads: Their purpose and place. Hum. Mol. Genet. 2018, 27, 234–241. [Google Scholar] [CrossRef]
Amarasinghe, S.L.; Su, S.; Dong, X.Y.; Zappia, L.; Ritchie, M.E.; Gouil, Q. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 2020, 21, 30. [Google Scholar] [CrossRef] [PubMed]
Rhoads, A.; Au, K.F. PacBio sequencing and its applications. Genom. Proteom. Bioinf. 2015, 13, 278–289. [Google Scholar] [CrossRef] [PubMed]
Vij, S.; Kuhl, H.; Kuznetsova, I.S.; Komissarov, A.; Yurchenko, A.A.; Van Heusden, P.; Singh, S.; Thevasagayam, N.M.; Prakki, S.R.; Purushothaman, K.; et al. Chromosomal-level assembly of the Asian seabass genome using long sequence reads and multi-layered scaffolding. PLoS Genet. 2016, 12, e1005954. [Google Scholar]
Gong, G.R.; Dan, C.; Xiao, S.J.; Guo, W.J.; Huang, P.P.; Xiong, Y.; Wu, J.J.; He, Y.; Zhang, J.C.; Li, X.H.; et al. Chromosomal-level assembly of yellow catfish genome using third-generation DNA sequencing and Hi-C analysis. Gigascience 2018, 7, giy120. [Google Scholar] [CrossRef]
Chen, B.H.; Zhou, Z.X.; Ke, Q.Z.; Wu, Y.D.; Bai, H.Q.; Pu, F.; Xu, P. The sequencing and de novo assembly of the Larimichthys crocea genome using PacBio and Hi-C technologies. Sci. Data 2019, 6, 188. [Google Scholar] [CrossRef]
Zhang, B.D.; Li, Y.L.; Xue, D.X.; Liu, J.X. Population genomic evidence for high genetic connectivity among populations of small yellow croaker (Larimichthys polyactis) in inshore waters of China. Fish. Res. 2020, 225, 105505. [Google Scholar] [CrossRef]
Zhang, B.D.; Li, Y.L.; Xue, D.X.; Liu, J.X. Population genomics reveals shallow genetic structure in a connected and ecologically important fish from the northwestern Pacific Ocean. Front. Mar. Sci. 2020, 7, 374. [Google Scholar] [CrossRef]
Marçais, G.; Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 2011, 27, 764–770. [Google Scholar] [CrossRef]
Liu, B.H.; Shi, Y.J.; Yuan, J.Y.; Hu, X.S.; Zhang, H.; Li, N.; Li, Z.Y.; Chen, Y.X.; Mu, D.S.; Fan, W. Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. arXiv 2013, arXiv:1308.2012. [Google Scholar]
Ruan, J.; Li, H. Fast and accurate long-read assembly with wtdbg2. Nat. Methods 2020, 17, 155–158. [Google Scholar] [CrossRef]
Chin, C.S.; Alexander, D.H.; Marks, P.; Klammer, A.A.; Drake, J.; Heiner, C.; Clum, A.; Copeland, A.; Huddleston, J.; Eichler, E.E.; et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 2013, 10, 563. [Google Scholar] [CrossRef]
Roach, M.J.; Schmidt, S.A.; Borneman, A.R. Purge Haplotigs: Synteny reduction for third-gen diploid genome assemblies. BioRxiv 2018. [Google Scholar] [CrossRef]
Walker, B.J.; Abeel, T.; Shea, T.; Priest, M.; Abouelliel, A.; Sakthikumar, S.; Cuomo, C.A.; Zeng, Q.D.; Wortman, J.; Young, S.K.; et al. Pilon: An integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 2014, 9, e112963. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 2009, 25, 1754–1760. [Google Scholar] [CrossRef] [PubMed]
Burton, J.N.; Adey, A.; Patwardhan, R.P.; Qiu, R.; Kitzman, J.O.; Shendure, J. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 2013, 31, 1119. [Google Scholar] [CrossRef]
Shu, H.; Liu, Y.B.; Wei, Q.L.; Liu, L.; Yang, L.D.; Qiang, L.; Hou, L.P. Studies on chromosome karyotype, Ag-NORs and C-banding patterns of wild Ctenopharyngodon idellus and Squaliobarbus curriculus in the Pearl River. J. Guangzhou Univ. 2014, 13, 53–59. [Google Scholar]
Li, H.; Handsaker, B.; Wysoker, A.; Fennell, T.; Ruan, J.; Homer, N.; Marth, G.; Abecasis, G.; Durbin, R. 1000 genome project data processing subgroup. The sequence alignment/map format and SAMtools. Bioinformatics 2009, 25, 2078–2079. [Google Scholar] [CrossRef]
Li, H.; Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 2010, 26, 589–595. [Google Scholar] [CrossRef]
Simão, F.A.; Waterhouse, R.M.; Ioannidis, P.; Kriventseva, E.V.; Zdobnov, E.M. BUSCO: Assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 2015, 31, 3210–3212. [Google Scholar] [CrossRef]
Parra, G.; Bradnam, K.; Korf, I. CEGMA: A pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 2007, 23, 1061–1067. [Google Scholar] [CrossRef]
Jurka, J.; Kapitonov, V.V.; Pavlicek, A.; Klonowski, P.; Kohany, O.; Walichiewicz, J. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 2005, 110, 462–467. [Google Scholar] [CrossRef]
Smit, A.F.A.; Hubley, R. RepeatModeler Open-1.0. 2010, 2008–2015. Available online: https://www.repeatmasker.org (accessed on 28 December 2021).
Xu, Z.; Wang, H. LTR_FINDER: An efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 2007, 35, W265–W268. [Google Scholar] [CrossRef]
Price, A.L.; Jones, N.C.; Pevzner, P.A. De novo identification of repeat families in large genomes. Bioinformatics 2005, 21, i351–i358. [Google Scholar] [CrossRef] [PubMed]
Edgar, R.C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 2010, 26, 2460–2461. [Google Scholar] [CrossRef] [PubMed]
Birney, E.; Clamp, M.; Durbin, R. GeneWise and Genomewise. Genome Res. 2004, 14, 988–995. [Google Scholar] [CrossRef]
Stanke, M.; Morgenstern, B. AUGUSTUS: A web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res. 2005, 33, W465–W467. [Google Scholar] [CrossRef] [PubMed]
Guigó, R.; Knudsen, S.; Drake, N.; Smith, T. Prediction of gene structure. J. Mol. Biol. 1992, 226, 141–157. [Google Scholar] [CrossRef]
Burge, C.; Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 1997, 268, 78–94. [Google Scholar] [CrossRef]
Majoros, W.H.; Pertea, M.; Salzberg, S.L. TigrScan and GlimmerHMM: Two open-source ab initio eukaryotic gene-finders. Bioinformatics 2004, 20, 2878–2879. [Google Scholar] [CrossRef]
Korf, I. Gene finding in novel genomes. Bmc Bioinform. 2004, 5, 59. [Google Scholar] [CrossRef]
Trapnell, C.; Pachter, L.; Salzberg, S.L. TopHat Manual. Bioinformatics 2009, 25, 1105–1111. [Google Scholar] [CrossRef]
Trapnell, C.; Roberts, A.; Goff, L.; Pertea, G.; Kim, D.; Kelley, D.R.; Pimentel, H.; Salzberg, S.L.; Rinn, J.L.; Pachter, L. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 2012, 7, 562–578. [Google Scholar] [CrossRef]
Grabherr, M.G.; Haas, B.J.; Yassour, M.; Levin, J.Z.; Thompson, D.A.; Amit, I.; Adiconis, X.; Fan, L.; Raychowdhury, R.; Zeng, Q.; et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 2011, 29, 644–652. [Google Scholar] [CrossRef] [PubMed]
Haas, B.J.; Salzberg, S.L.; Zhu, W.; Zhu, W.; Pertea, M.; Allen, J.E.; Orvis, J.; White, O.; Buell, C.R.; Wortman, J.R. Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biol. 2008, 9, R7. [Google Scholar] [CrossRef]
Mulder, N.; Apweiler, R. InterPro and InterProScan: Tools for protein sequence classification and comparison. Methods Mol. Biol. 2007, 396, 59–70. [Google Scholar] [PubMed]
Ashburner, M.; Ball, C.A.; Blake, J.A.; Botstein, D.; Butler, H.; Cherry, J.M.; Davis, A.P.; Dolinski, K.; Dwight, S.S.; Eppig, J.T.; et al. Gene ontology: Tool for the unification of biology. Nat. Genet. 2000, 25, 25–29. [Google Scholar] [CrossRef]
Bairoch, A.; Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000, 28, 45–48. [Google Scholar] [CrossRef] [PubMed]
Kanehisa, M.; Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000, 28, 27–30. [Google Scholar] [CrossRef] [PubMed]
Lowe, T.M.; Eddy, S.R. tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997, 25, 955–964. [Google Scholar] [CrossRef] [PubMed]
Griffiths-Jones, S.; Moxon, S.; Marshall, M.; Khanna, A.; Eddy, S.R.; Bateman, A. Rfam: Annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 2005, 33, D121–D124. [Google Scholar] [CrossRef] [PubMed]
Chen, F.; Mackey, A.J.; Stoeckert, C.J., Jr.; Roos, D.S. OrthoMCL-DB: Querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res. 2006, 34, D363–D368. [Google Scholar] [CrossRef]
Robert, C.E. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32, 1792–1797. [Google Scholar]
Castresana, J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 2000, 17, 540–552. [Google Scholar] [CrossRef]
Stamatakis, A. RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 2014, 30, 1312–1313. [Google Scholar] [CrossRef] [PubMed]
Yang, Z. PAML 4: Phylogenetic Analysis by Maximum Likelihood. Mol. Biol. Evol. 2007, 24, 1586–1591. [Google Scholar] [CrossRef]
De Bie, T.; Cristianini, N.; Demuth, J.P.; Hahn, M.W. CAFE: A computational tool for the study of gene family evolution. Bioinformatics 2006, 22, 1269–1271. [Google Scholar] [CrossRef]
Letunic, I.; Bork, P. 20 years of the SMART protein domain annotation resource. Nucleic. Acids. Res. 2018, 46, D493–D496. [Google Scholar] [CrossRef]
Chen, C.J.; Chen, H.; Zhang, Y.; Thomas, H.R.; Frank, M.H.; He, Y.H.; Xia, R. TBtools: An integrative toolkit developed for interactive analyses of big biological data. Mol. Plant 2020, 13, 1194–1202. [Google Scholar] [CrossRef]
Kumar, S.; Stecher, G.; Suleski, M.; Hedges, S.B. TimeTree: A resource for timelines, timetrees, and divergence times. Mol. Biol. Evol. 2017, 34, 1812–1819. [Google Scholar] [CrossRef]
Chen, D.; Zhang, Q.; Tang, W.; Huang, Z.; Wang, G.; Wang, Y.; Shi, J.; Xu, H.; Lin, L.; Li, Z.; et al. The evolutionary origin and domestication history of goldfish (Carassius auratus). Proc. Natl. Acad. Sci. USA 2020, 117, 29775–29785. [Google Scholar] [CrossRef] [PubMed]
Xu, P.; Zhang, X.F.; Wang, X.M.; Li, J.T.; Liu, G.M.; Kuang, Y.Y.; Xu, J.; Zheng, X.H.; Ren, L.F.; Wang, G.L.; et al. Genome sequence and genetic diversity of the common carp, Cyprinus carpio. Nat. Genet. 2014, 46, 1212–1219. [Google Scholar] [CrossRef] [PubMed]
Yang, J.X.; Chen, X.L.; Bai, J.; Fang, D.M.; Qiu, Y.; Jiang, W.S.; Yuan, H.; Bian, C.; Lu, J.; He, S.Y.; et al. The Sinocyclocheilus cavefish genome provides insights into cave adaptation. BMC Biol. 2016, 14, 1. [Google Scholar] [CrossRef] [PubMed]
Chen, Z.L.; Omori, Y.; Koren, S.; Shirokiya, T.; Kuroda, T.; Miyamoto, A.; Wada, H.; Fujiyama, A.; Toyoda, A.; Zhang, S.Y.; et al. De Novo assembly of the goldfish (Carassius auratus) genome and the evolution of genes after whole genome duplication. Sci. Adv. 2019, 5, eaav0547. [Google Scholar] [CrossRef] [PubMed]
Sun, L.N.; Gao, T.; Wang, F.L.; Qin, Z.L.; Yan, L.X.; Tao, W.J.; Li, M.H.; Jin, C.B.; Ma, L.; Kocher, T.D.; et al. Chromosome-level genome assembly of a cyprinid fish Onychostoma macrolepis by integration of nanopore sequencing, Bionano and Hi-C technology. Mol. Ecol. Resour. 2020, 20, 1361–1371. [Google Scholar] [CrossRef] [PubMed]
Howe, K.; Clark, M.D.; Torroja, C.F.; Torrance, J.; Berthelot, C.; Muffato, M.; Collins, J.E.; Humphray, S.; McLaren, K.; Matthews, L.; et al. The zebrafish reference genome sequence and its relationship to the human genome. Nature 2013, 505, 248. [Google Scholar] [CrossRef]
Yang, L.; Sado, T.; Hirt, M.V.; Pasco-Viel, E.; Arunachalam, M.; Li, J.B.; Wang, X.Z.; Freyhof, J.; Saitoh, K.; Simons, A.M.; et al. Phylogeny and polyploidy: Resolving the classification of cyprinine fishes (Teleostei: Cypriniformes). Mol. Phylogenet. Evol. 2015, 85, 97–116. [Google Scholar] [CrossRef] [PubMed]
He, S.; Liu, H.; Chen, Y.; Kuwahara, M.; Nakajima, T.; Zhong, Y. Molecular phylogenetic relationships of Eastern Asian Cyprinidae (Pisces: Cypriniformes) inferred from cytochrome b sequences. Sci. China Life Sci. 2004, 47, 130–138. [Google Scholar] [CrossRef] [PubMed]
Secombes, C.J.; Wang, T. The innate and adaptive immune system of fish. Infect. Dis. Aquac. 2012, 14, 3–68. [Google Scholar]
Lester, S.N.; Li, K. Toll-like receptors in antiviral innate immunity. J. Mol. Biol. 2014, 426, 1246–1264. [Google Scholar] [CrossRef] [PubMed]
Medzhitov, R.; Preston-Hurlburt, P.; Janeway, C.A., Jr. A human homologue of the Drosophila Toll protein signals activation of adaptive immunity. Nature 1997, 388, 394–397. [Google Scholar] [CrossRef] [PubMed]
Takeda, K.; Akira, S. Toll receptors and pathogen resistance. Cell Microbiol. 2003, 5, 143–153. [Google Scholar] [CrossRef] [PubMed]
Su, J.G.; Yang, C.R.; Xiong, F.; Wang, Y.P.; Zhu, Z.Y. Toll-like receptor 4 signaling pathway can be triggered by grass carp reovirus and Aeromonas hydrophila infection in rare minnow Gobio cyprisrarus. Fish Shellfish Immunol. 2009, 27, 33–39. [Google Scholar] [CrossRef] [PubMed]
Pei, Y.Y.; Huang, R.; Li, Y.M.; Liao, L.J.; Zhu, Z.Y.; Wang, Y.P. Characterizations of four toll-like receptor 4s in grass carp Ctenopharyngodon idellus and their response to grass carp reovirus infection and lipopolysaccharide stimulation. J. Fish Biol. 2015, 86, 1098–1108. [Google Scholar] [CrossRef] [PubMed]
Liao, Z.W.; Wan, Q.Y.; Su, H.; Wu, C.S.; Su, J.G. Pattern recognition receptors in grass carp Ctenopharyngodon idella: I. Organization and expression analysis of TLRs and RLRs. Dev. Comp. Immunol. 2017, 76, 93–104. [Google Scholar] [CrossRef] [PubMed]
Roach, J.C.; Glusman, G.; Rowen, L.; Kaur, A.; Purcell, M.K.; Smith, K.D.; Hood, L.E.; Aderem, A. The evolution of vertebrate Toll-like receptors. Proc. Natl. Acad. Sci. USA 2005, 102, 9577–9582. [Google Scholar] [CrossRef] [PubMed]
Altmann, S.; Korytar, T.; Kaczmarzyk, D.; Nipkow, M.; Kuhn, C.; Goldammer, T.; Rebl, A. Toll-like receptors in maraena whitefish: Evolutionary relation- ship among salmonid fishes and patterns of response to Aeromonas salmonicida. Fish Shellfish Immunol. 2016, 54, 391–401. [Google Scholar] [CrossRef] [PubMed]
Wang, K.L.; Chen, S.N.; Huo, H.J.; Nie, P. Identification and expression analysis of sixteen Toll-like receptor genes, TLR1, TLR2a, TLR2b, TLR3, TLR5M, TLR5S, TLR7−9, TLR13a−c, TLR14, TLR21−23 in mandarin fish Siniperca chuatsi. Dev. Comp. Immunol. 2021, 121, 104100. [Google Scholar] [CrossRef] [PubMed]
Sundaram, A.Y.; Kiron, V.; Dopazo, J.; Fernandes, J.M. Diversification of the expanded teleost-specific toll-like receptor family in Atlantic cod, Gadus morhua. BMC Evol. Biol. 2012, 12, 256. [Google Scholar] [CrossRef] [PubMed]
Qiu, H.T.; Fernandes, J.M.O.; Hong, W.S.; Wu, H.X.; Zhang, Y.T.; Huang, S.; Liu, D.T.; Yu, H.; Wang, Q.; You, X.X.; et al. Paralogues from the expanded Tlr11 gene family in mudskipper (Boleophthalmus pectinirostris) are under positive selection and respond differently to LPS/Poly(I:C) challenge. Front. Immunol. 2019, 10, 343. [Google Scholar] [CrossRef] [PubMed]
Ji, J.; Liao, Z.; Rao, Y.; Li, W.; Yang, C.; Yuan, G.; Feng, H.; Xu, Z.; Shao, J.; Su, J. Thoroughly remold the localization and signaling pathway of TLR22. Front. Immunol. 2020, 10, 3003. [Google Scholar] [CrossRef] [PubMed]
Meijer, A.H.; Gabby Krens, S.F.; Medina Rodriguez, I.A.; He, S.; Bitter, W.; Ewa Snaar- Jagalska, B.; Spaink, H.P. Expression analysis of the Toll-like receptor and TIR domain adaptor families of zebrafish. Mol. Immunol. 2004, 40, 773–783. [Google Scholar] [CrossRef] [PubMed]
Palti, Y. Toll-like receptors in bony fish: From genomics to function. Dev. Comp. Immunol. 2011, 35, 1263–1272. [Google Scholar] [CrossRef] [PubMed]
Liao, Z.W.; Su, J.G. Progresses on three pattern recognition receptor families (TLRs, RLRs and NLRs) in teleost. Dev. Comp. Immunol. 2021, 122, 104131. [Google Scholar] [CrossRef]
Liao, Z.W.; Yang, C.R.; Jiang, R.; Zhu, W.T.; Zhang, Y.A.; Su, J.G. Cyprinid-specific duplicated membrane TLR5 senses dsRNA as functional homodimeric receptors. Embo Rep. 2022, 23, e54281. [Google Scholar] [CrossRef] [PubMed]
Cao, W.S.; Bao, C.; Padalko, E.; Lowenstein, C.J. Acetylation of mitogen-activated protein kinase phosphatase-1 inhibits Toll-like receptor signaling. J. Exp. Med. 2008, 205, 1491–1503. [Google Scholar] [CrossRef] [PubMed]
Nestor, B.J.; Bayer, P.E.; Fernandez, C.G.T.; Edwards, D.; Finnegan, P.M. Approaches to increase the validity of gene family identification using manual homology search tools. Genetica 2023, 151, 325–338. [Google Scholar] [CrossRef]

Figure 1. The barbel chub (Squaliobarbus curriculus).

Figure 2. Genome-wide Hi-C heatmap of the barbel chub S. curriculus. The reddish blocks represent the 24 pseudochromosomes. The colour shade at any point within the square shows the proximity score for two genomic regions.

Figure 3. Phylogenetic analysis and divergence time tree of the barbel chub (marked in blue) with other teleost species with Branchiostoma belcheri as outgroup. The number of significantly expanded (+, green) and contracted (−, red) gene families is designated to each branch after species names. The estimated species divergence time (million years ago, Ma) is labelled at each branch site, and the confidence intervals are provided in parentheses. The divergence ages are taken from the TimeTree database [82] and a previous publication [83].

Figure 4. Phylogenetic relationship, chromosome position, and motif composition of Toll-like receptors in S. curriculus. The Neighbor-joining tree was constructed with 1000 bootstrap replications using MEGA-X based on the full-length protein sequences. Numbers over branches indicate bootstrap percentages following 1000 replications with Neighborhood-Joining when above 50%. The exon-intron structures of these genes were graphically displayed by TBtools.

Table 1. Statistics for the sequencing data of S. curriculus genome.

Pair-End Libraries	Library Size (bp)	Sequencing Platform	Total Data (Gb)	Sequence Coverage (×)
Illumina reads	350	Illumina NovaSeq−6000	43.88	47.40
PacBio reads	20,000	PacBio Sequel II	155.34	167.82
Hi-C reads	350	Illumina NovaSeq−6000	145.69	157.39
Transcriptome	350	Illumina NovaSeq−6000	39.78	42.97
Total			384.69	415.58

Note: Sequence coverage was calculated using an estimated genome size of 925.66 Mb.

Table 2. BUSCO analysis result of the S. curriculus genome.

Statistics	Number of Genes	Percentage (%)
Complete BUSCOs	2511	97.10%
Complete and single-copy BUSCOs	2441	94.40%
Complete Duplicated BUSCOs	70	2.70%
Fragmented BUSCOs	44	1.70%
Missing BUSCOs	31	1.20%
Total BUSCO groups searched	2586	100%

Table 3. Statistics on transposable elements in S. curriculus genome.

	De Novo & Repbase		TE Proteins		Combined TEs
	Length (bp)	% in Genome	Length (bp)	% in Genome	Length (bp)	% in Genome
DNA	54,528,535	6.09	5,695,830	0.64	56,929,089	6.36
LINE	6,820,226	0.76	12,651,477	1.41	15,790,457	1.76
SINE	632,665	0.07	0	0	632,665	0.07
LTR	339,137,241	37.89	19,028,647	2.13	340,522,087	38.05
Unknown	18,886,738	2.11	0	0	18,886,738	2.11
Total	411,198,899	45.94	37,369,252	4.18	414,869,703	46.35

Note: LINE, long interspersed element; SINE, short interspersed element; LTR, long terminal repeat.

Table 4. Summary of functional annotations for predicted genes of S. curriculus genome.

Annotation Database	Number of Annotated Genes	Percentage (%)
Swissprot	21,440	83.20
Nr	24,328	94.40
KEGG	21,328	82.70
InterPro	22,481	87.20
GO	16,145	62.60
Pfam	20,160	78.20
Annotated	24,402	94.70
Unannotated	1377	5.30
Total	25,779	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, B.; Sun, Y.; Liu, Y.; Song, X.; Wang, S.; Xiao, T.; Nie, P. Chromosome-Level Genome Assembly and Comparative Genomic Analysis of the Barbel Chub (Squaliobarbus curriculus) by Integration of PacBio Sequencing and Hi-C Technology. Fishes 2024, 9, 327. https://doi.org/10.3390/fishes9080327

AMA Style

Zhang B, Sun Y, Liu Y, Song X, Wang S, Xiao T, Nie P. Chromosome-Level Genome Assembly and Comparative Genomic Analysis of the Barbel Chub (Squaliobarbus curriculus) by Integration of PacBio Sequencing and Hi-C Technology. Fishes. 2024; 9(8):327. https://doi.org/10.3390/fishes9080327

Chicago/Turabian Style

Zhang, Baidong, Yanling Sun, Yang Liu, Xiaojun Song, Su Wang, Tiaoyi Xiao, and Pin Nie. 2024. "Chromosome-Level Genome Assembly and Comparative Genomic Analysis of the Barbel Chub (Squaliobarbus curriculus) by Integration of PacBio Sequencing and Hi-C Technology" Fishes 9, no. 8: 327. https://doi.org/10.3390/fishes9080327

APA Style

Zhang, B., Sun, Y., Liu, Y., Song, X., Wang, S., Xiao, T., & Nie, P. (2024). Chromosome-Level Genome Assembly and Comparative Genomic Analysis of the Barbel Chub (Squaliobarbus curriculus) by Integration of PacBio Sequencing and Hi-C Technology. Fishes, 9(8), 327. https://doi.org/10.3390/fishes9080327

Article Menu

Chromosome-Level Genome Assembly and Comparative Genomic Analysis of the Barbel Chub (Squaliobarbus curriculus) by Integration of PacBio Sequencing and Hi-C Technology

Abstract

1. Introduction

2. Materials and Methods

2.1. Ethic Statement

2.2. Sample DNA and RNA Extraction

2.3. Library Construction and Genome Sequencing

2.4. Genome Size Estimation and Genome Assembly

2.5. Hi-C Analysis and Chromosome-Level Genome Assembly

2.6. Assessment of the Genome Assembly

2.7. Repeat Annotation, Gene Prediction, and Functional Annotation

2.8. Phylogenetic Analysis and Estimation of Divergence Time

2.9. Expansion, Contraction, and Identification of Gene Family

3. Results

3.1. De Novo Genome Assembly of S. curriculus

3.2. Repeat Annotation, Gene Prediction, and Functional Annotation

3.3. Comparative Genomics

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI