Next Article in Journal
The Trèmols Herbarium: A European Herbarium from the End of the 19th Century
Previous Article in Journal
Diversity of Freshwater Macroinvertebrate Communities in Los Tuxtlas, Veracruz, Mexico
Previous Article in Special Issue
De Novo Genome Assembly of the Sea Star Patiria pectinifera (Muller & Troschel, 1842) Using Oxford Nanopore Technology and Illumina Platforms
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Mining NCBI Sequence Read Archive Database: An Untapped Source of Organelle Genomes for Taxonomic and Comparative Genomics Research

Department of Biology, Faculty of Science, Istanbul University, Istanbul 34134, Turkey
*
Author to whom correspondence should be addressed.
Diversity 2024, 16(2), 104; https://doi.org/10.3390/d16020104
Submission received: 22 December 2023 / Revised: 24 January 2024 / Accepted: 29 January 2024 / Published: 6 February 2024
(This article belongs to the Special Issue Genome Sequence and Analysis for Animal Ecology and Evolution)

Abstract

:
The NCBI SRA database is constantly expanding due to the large amount of genomic and transcriptomic data from various organisms generated by next-generation sequencing, and re-searchers worldwide regularly deposit new data into the database. This high-coverage genomic and transcriptomic information can be re-evaluated regardless of the original research subject. The database-deposited NGS data can offer valuable insights into the genomes of organelles, particularly for non-model organisms. Here, we developed an automated bioinformatics workflow called “OrgaMiner”, designed to unveil high-quality mitochondrial and chloroplast genomes by data mining the NCBI SRA database. OrgaMiner, a Python-based pipeline, automatically orchestrates various tools to extract, assemble, and annotate organelle genomes for non-model organisms without available organelle genome sequences but with data in the NCBI SRA. To test the usability and feasibility of the pipeline, “mollusca” was selected as a keyword, and 76 new mitochondrial genomes were de novo assembled and annotated automatically without writing one single code. The applicability of the pipeline can be expanded to identify organelles in diverse invertebrate, vertebrate, and plant species by simply specifying the taxonomic name. OrgaMiner provides an easy-to-use, end-to-end solution for biologists mainly working with taxonomy and population genetics.

1. Introduction

Whole-genome-sequencing (WGS) and RNA-sequencing (RNA-Seq) studies that employ new or third-generation sequencing approaches generate millions of reads and vast amounts of genomic data per sample. The high- or low-coverage WGS and RNA-Seq approaches have become routine procedures for conducting a variety of studies, including population genetics studies [1], comparative genome analyses [2,3], and genome-wide association studies (GWASs) [4], as well as studies on clarifying the molecular mechanism of organ development [5,6], sex determination [7], understanding the physical effects of exposure to biological and chemical agents [8,9,10], and other gene expression-based research on non-model organisms [11]. Due to the vast amount of genomic and transcriptomic data generated, this high-coverage nucleotide information can be re-evaluated regardless of the original research subject. For instance, these high-volume genomic data can yield significant insights into the genomes of organelles that are abundantly present in tissues, particularly mitochondria and chloroplasts. In vertebrates, organs such as the skeletal muscle, heart, liver, kidney, brain neurons, theca cells in the ovary, Leydig cells in the testis, and sperm cells have high concentrations of circular mitochondrial DNA (mtDNA) [12,13,14]. In invertebrates, flight muscle tissue and ommatidia in the compound eyes of insects, sperm cells, and gastrodermal cells in the digestive systems of some marine invertebrates, as well as the gills and digestive gland cells in mollusks, are rich in mitochondria and mtDNA [15,16,17]. Moreover, multiple circular genomes can exist within one mitochondrion. Similarly, in plants, chlorenchyma cells, especially those found in the mesophyll layer of leaf tissue, contain a significant number of chloroplasts and chloroplast DNA (cpDNA) [18,19].
Increasing evidence is showing that mitochondrial and chloroplast genomes are extensively utilized and highly preferred in various types of research, including genetic diversity and population structure analyses, research on resolving taxonomic ambiguities, divergence-time estimates, haplotype network analyses, molecular metabarcoding, and environmental DNA surveys [20,21,22,23,24]. In addition to conducting their own organelle genome-sequencing studies, researchers have three primary sources for acquiring organelle genomes: the (i) NCBI Organelle Genome Resources (https://www.ncbi.nlm.nih.gov/genome/organelle/ (accessed on 24 October 2023)), (ii) NCBI Nucleotide (https://www.ncbi.nlm.nih.gov/nucleotide/(accessed on 10 November 2023)), and (iii) NCBI Sequence Read Archive (https://www.ncbi.nlm.nih.gov/sra, accessed on 10 November 2023)) databases. Researchers predominantly rely on organellar genome resources and nucleotide databases, often overlooking the massive amount of raw next-generation sequencing data (“.fastq”) deposited in the NCBI SRA database due to the substantial computational requirements, basic shell scripting, and command-line experience required to manage and process NGS data. Currently, with regard to the experimental aspect, organelle genome-sequencing studies predominantly employ the genome-skimming approach. This approach entails performing low-coverage WGS sequencing and subsequently utilizing bioinformatic tools to extract circular cpDNA and mtDNA genomes from the generated data [25]. This approach is practically favored over the separate isolation of chloroplast and mitochondrial genomes or amplification with long-range PCR. While not as robust as WGS data, RNA-Seq data can also be utilized for the reconstruction of cpDNA and mtDNA genomes [26,27,28]. This can be attributed to two factors: (i) both rRNA-depleted and poly-A-selected RNA-Seq libraries enable the profiling and discovery of coding and non-coding RNAs, and (ii) the genomic contents of mtDNA and cpDNA primarily consist of coding genes, tRNA genes, and rRNA genes, with only a small fraction dedicated to non-coding intronic and intergenic regions. Several software tools, including GetOrganelle v1.7.7 [29], MitoZ v3.6 [30], and MITGARD v1.0 [26], as well as workflow frameworks like ORTHOSKIM v1.6 [31], go_batch [32], and PhyloHerb v1.1.3 [33], have been developed to facilitate the assembly of organelle genomes and nuclear ribosomal repeats from genomic skimming or RNA-Seq-based transcriptomic data.
Unlike other software and approaches, our workflow stands out due to several distinctive features: (i) We developed a streamlined pipeline for quickly retrieving and organizing SRA data related to multiple species within a particular clade. This process is facilitated using a metafile where users can input the clade’s name. Subsequently, data from the SRA for all species belonging to this clade are automatically downloaded and categorized into separate result files corresponding to different data types; (ii) One of our developed scripts, named “--mt_check or --pt_check”, is capable of identifying species lacking mitochondrial or chloroplast genomes but possessing pertinent SRA data through the NCBI database; (iii) The pipeline automatically ranks and streamlines the processing of WGS or RNA-Seq data. It automatically excludes unsuitable NGS data types, such as genotyping-by-sequencing (GBS), RAD-Seq, metabarcoding, and small RNA-Seq data, which are not suitable for obtaining complete mtDNA or cpDNA sequences; (iv) During the automated download process, the pipeline offers users multiple download options, such as sra-tools v3.0.7 (https://github.com/ncbi/sra-tools), IBM Aspera Connect v4.2.6 (https://www.ibm.com/aspera/connect/), and the bash curl (short for “Client URL”) command or Efetch v16.0.2 (E-utilities, https://www.ncbi.nlm.nih.gov/home/tools/) to prevent potential download issues; (v) Users can acquire and analyze organellar genomes from SRA WGS and RNA-Seq data without the need to write any code; a single command is sufficient for DNA data, while RNA data require only the execution of two distinct commands.
In this study, we developed an automated bioinformatics workflow designed to unveil high-quality mitochondrial and chloroplast genomes by data mining the NCBI SRA database. Through a case study, we evaluated the effectiveness of our bioinformatics pipeline. The investigation resulted in the complete revelation of the mitochondrial genome for 76 mollusk species for the first time by mining the SRA database solely by using the keyword “mollusca”. Utilizing our automated bioinformatics workflow, it becomes feasible to uncover the organelle genomes of numerous species for which genomic or transcriptomic data exist within the SRA database, yet their organelle genomes remain uncharacterized.

2. Materials and Methods

2.1. Implementation

The OrgaMiner pipeline consists of three fundamental stages: (i) “download_fastq_from_SRA”, (ii) “trimming_and_read_quality_assessment”, and (iii) “assembly_and_annotation_of_organelle_genomes”. In the “download_fastq_from_SRA” phase, we obtain the “.fastq” files from the NCBI SRA database using a range of options that will be explained later. Proceeding to the “trimming_and_read_quality_assessment” step, its main objectives are to obtain analysis-ready, high-quality NGS “.fastq” reads and to generate summary statistical reports for these reads. We achieve this by applying the trim_galore v0.6.10 (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) wrapper for trimming and generating read quality reports for raw NGS reads. All the outputs from trim_galore (with the “--fastqc” option) were merged into a single graphic using MultiQC v1.15 [34].

2.1.1. Input File Preparation

To initiate the pipeline, it is necessary to generate a metadata file that contains the scientific names of the target taxa for which organelle genomes are to be extracted. This step facilitates the identification of the relevant SRA files for each taxon for the organelle genome assembly through the utilization of the ESearch utility. Using the “--mt_check or --pt_check” option, species for which organelle genomes are already present in the NCBI database can be excluded from the analysis. Alternatively, the metadata file can include both the scientific names of the species and the corresponding accession numbers of the relevant SRA files. For those who want to skip the download process, the metadata file can contain the names of the read files instead of SRA accession numbers. Examples of metadata files can be located in the dedicated GitHub repository at the following link: https://github.com/MolecularBioDiversityLab/OrgaMiner.

2.1.2. Downloading Unprocessed “.fastq” Files from SRA Database

Finding and downloading “.fastq” files of the taxa of interest from the SRA database can be time-consuming for researchers. The OrgaMiner pipeline offers a user-friendly option for selecting the preferred “.fastq” download method when retrieving the relevant SRA files. This feature is facilitated through the use of the Kingfisher-download program (https://wwood.github.io/kingfisher-download/). Subsequently, the SRA accession numbers for paired-end reads are automatically extracted from the metadata file, thereby simplifying the data retrieval process and eliminating the need for users to download the “.fastq” files manually. The primary objective of this step is to acquire organelle genomes from various taxa using the corresponding SRA files available in public databases. However, users also have the option to extract organelle genomes from any WGS or RNA-Seq data that are already stored, as users have the flexibility to skip the download process and initiate the organelle genome assembly process with the “--skip_download” option.

2.1.3. Quality Assessment, De Novo Assembly, Annotation, and Outputting

In addition to the GetOrganelle v1.7.7 [29] and MITGARD v1.0 [26] tools, a set of scripts suggested by Senthilkumar et al. [35] was integrated into our pipeline, which enables OrgaMiner to assemble plastid genomes using RNA-Seq data. To ensure that the input reads were suitably prepared for organelle genome assembly using both the GetOrganelle v1.7.7 [29] and MITGARD v1.0 [26] tools, we utilized “trim_galore”. This Perl wrapper integrates cutadapt v4.4 [36] and FastQC v0.12.1 (https://github.com/s-andrews/FastQC) and enables adapter trimming and the removal of poly-A tails and undesired reads, and it generates a quality assessment report. To facilitate a more convenient assessment process, the reports were combined and summarized using MultiQC v1.15 [34]. For de novo organelle genome assembly, it is essential to ensure that all required arguments are included in the command line, as the pipeline utilizes GetOrganelle v1.7.7 [29] software for genome-skimming data and MITGARD v1.0 [26] software for RNA-Seq data. Users can change the parameters of these software tools in addition to the default settings. While the mitochondrial genomes are annotated using MitoZ v3.6 [30], our pipeline lacks a tool for plastid genome annotation, necessitating the use of external tools for annotating plastid genomes. Upon completing the pipeline, it compiles the “.fastq” files from the SRA database, quality assessment reports, and organelle genomes in the “.fasta” format, and their annotations into distinct directories, each denoted with the appropriate species name. In addition, users can automatically delete the large .fastq files following each assembly to optimize the memory utilization if required. This systematic storage approach ensures convenient access to the output files while minimizing the memory footprint of the pipeline.

2.2. Case Studies

The pipeline was devised for the de novo assembly of organelle genomes from “.fastq” files from the SRA database encompassing various species that lack representation in the NCBI Organelle Genome Resources or NCBI Nucleotide databases. In this context, the pipeline’s configuration file was edited with the addition of the keyword “mollusca”. At first, the pipeline automatically collected the scientific names of species that had SRA data from the first step of the pipeline but lacked mitochondrial genomes. By utilizing the Aspera option, we were able to retrieve all appropriate WGS and RNA-Seq data. However, mitochondrial genomes were only obtained for a subset of mollusk species and not for all of them due to issues with the sequence length (shorter than 50 bp) and data quality, quantity, or depth. This was achieved by utilizing the “command_DNA” and “command_RNA” commands with default parameters.

2.3. Validation and Application of OrgaMiner Workflow across Diverse Clades

We performed a series of comparative analyses to validate the accuracy of the OrgaMiner workflow’s mitochondrial genome annotations. First, we used the workflow to download the NGS data from mollusk species with well-established mitochondrial genome annotations deposited in the NCBI GenBank database. Then, OrgaMiner processed the NGS data and compared the obtained results with those stored in GenBank, demonstrating the reliability of the workflow. The following criteria were used to assess the completeness: the sequence alignment identity (percentage identity), the alignment coverage, and a gene number and composition comparison. Further analyses were conducted to determine the validity of the OrgaMiner workflow in different animal and plant groups. For this reason, the NGS datasets from segmented worms (Phylum Annelida) were selected and analyzed for de novo mitochondrial genome assembly and annotation using the OrgaMiner workflow. Similarly, using the same workflow, the NGS datasets belonging to various species within the dicotyledon Solanum and monocotyledon Zingiberales taxonomic groups were assembled and annotated for their chloroplast genomes. Following the assembly and annotation stages using Chlorobox for chloroplast genomes, a comparative analysis was undertaken between the assembled organelle genomes and those obtained from the NCBI. First, a BlastN database was generated with all gene sequences extracted from the reference assemblies. Subsequently, the BlastN (1 × 10−10) algorithm was employed, using the gene sequences of each assembled genome as queries to identify sequence similarities with the reference NCBI genomes. Additionally, the genes present in both the reference genomes and the assembled genomes of each species were compared to assess the completeness of the organelle genomes.

2.4. Code Availability

OrgaMiner is implemented in Python and orchestrates a sequence of external tools to extract and identify mitochondrial or chloroplast reads from WGS or RNA-Seq datasets of non-model organisms for which organelle genome sequences are not available but NGS data are deposited in the NCBI SRA database. OrgaMiner is freely available on GitHub (https://github.com/MolecularBioDiversityLab/OrgaMiner) and is compatible with Python version 3.6 or higher. The source code is available under a BSD-3 license. To reproduce all the assemblies generated in this study, please refer to the instructions provided on our GitHub page.

3. Results and Discussion

3.1. Sequence Characteristics of Datasets in Study

In the context of our research, we developed a flexible workflow that facilitates the automated retrieval, organization, and processing of WGS and RNA-Seq data from the NCBI SRA database to unveil the mitochondrial or chloroplast genome sequences of non-model organisms. We selected the phylum “mollusca” to test the practical application of the developed workflow. The phylum Mollusca was also chosen for the following reasons: (i) Mollusca, one of the most diverse animal phyla, accounts for approximately 23% of all known marine species [37]; (ii) these ecologically and economically significant invertebrates inhabit a wide array of marine, freshwater, and terrestrial ecosystems, displaying adaptability to various challenging oceanic environments [38]; (iii) there is a significant amount of NGS data available for the Mollusca phylum in the NCBI SRA database due to the reasons mentioned above. The OrgaMiner pipeline workflow, which consists of two phases, is depicted in Figure 1. First, the user must enter the taxonomic name or both the taxonomic name and the NCBI SRA accession number as input. After obtaining the information, the next step involves searching the NCBI Nucleotide database to download the sequences of species that do not have organelle genomes but do have WGS or RNA-Seq NGS reads in the NCBI SRA database. The second step involves the quality control of the NGS reads, de novo organelle genome assembly, and annotation (Figure 1). Upon entering the “mollusca” keyword in the workflow config file, the NCBI SRA database was automatically scanned, and the resulting species were matched with the NCBI Nucleotide and Organelle database. The data from species without mtDNA were saved only in “.fastq” format for WGS and RNA-Seq.
According to the mining of the SRA database across the phylum Mollusca, a total of 282 mollusk species possessed NGS data but lacked a corresponding mitochondrial genome. Of these species, 130 have WGS sequencing data, while 152 have RNA-Seq transcriptomic data. Regardless of whether WGS or RNA-Seq data were analyzed, from a taxonomic standpoint, most mollusk species predominantly belong to the Gastropoda, Bivalvia, and Cephalopoda classes (Figure 2). For the de novo mitochondrial DNA (mtDNA) analysis of 282 mollusk species, a total of 27.3 billion sequencing reads were obtained, producing over 4 terabases (Tb) of WGS data. Additionally, 9.75 billion sequencing reads were acquired, generating more than 12 Tb of RNA-Seq data, and both datasets were subsequently downloaded and processed (Table S1). While most NGS data were sequenced in the PE150 or PE100 mode with satisfactory Q20/Q30 Phred quality scores, heterogeneity was observed in the NGS data size generated per species. Sequencing data generated in the pair-end sequencing mode constitute more than 97% of the total reads in the WGS dataset and more than 92% in the RNA-Seq dataset (Table S1). The PE reads in the WGS dataset vary from approximately 1 M (million) to 554 M, averaging around 98.6 M reads. In comparison, the PE reads in the RNA-Seq dataset range from about 2.2 M to 308 M reads, averaging approximately 27.9 M reads (Figure 2, Table S1).
When examining the NGS sequencing data to be analyzed, it was observed that the read numbers and lengths align with the data sizes reported in the literature. For instance, Notocrater youngi (Gastropoda) has around 3.2 million paired-end reads [39]. Another species of Gastropoda, Planorbella pilsbryi, has approximately 21 million paired-end reads [40]. Additionally, species such as Nerita undata and Nerita balteata (Neritimorpha) have produced roughly 5 gigabases (Gb) of paired-end data per sample [41]. The two primary organelle de novo genome assembler programs, MitoZ v3.6 [30] (utilizing the assembly module of a modified version of SOAPdenovo-Trans) and GetOrganelle v1.7.7 [29] (employing SPAdes as the assembler), have also suggested that an NGS data yield ranging from 2 to 8 GB is adequate for the organelle genome assembler. Therefore, our WGS- or RNA-Seq-based datasets were considered suitable for the de novo assembly and annotation analysis.

3.2. Molluscan mtDNA Annotation Findings and Implications

Although species belonging to the phylum Mollusca generally exhibit structural compatibility with ancestral mitochondria in terms of the mitochondrial DNA organization (typically encoding 13 proteins, two rRNAs, and 22 tRNAs, as well as a putative control region), they display extraordinary variation in size and architecture within the animal kingdom, such as radical genome rearrangements, gene duplications and losses, and the introduction of novel genes [42,43,44,45]. Despite their complex mitochondrial structure, a high-quality mtDNA sequence and annotation were obtained by mining the NCBI database for 76 mollusk species for which an mtDNA sequence has not yet been revealed. Of these complete mtDNAs, 52 were acquired through WGS analysis (Table 1), while 24 were obtained through RNA-Seq read processing (Table 2). Among bilateral animals, the mitochondrial genomes of the phylum Mollusca exhibit considerable size variation [46]. Upon an examination of 3396 mitochondrial genomes in the NCBI Nucleotide database, it was observed that the minimum length of the mtDNA genome was 13.1 kb (Pliocardia stearnsii), while the maximum was 67.1 kb (zebra mussel, Dreissena polymorpha), with an average length of 16.2 kb. In Table 1 and Table 2, we find that the average length of the mtDNA genomes was approximately 16 kb. The average length of the mtDNA genomes obtained through de novo read assembly in the WGS and RNA-Seq datasets was consistent with the literature and mitochondrial genome database for Mollusca [37].
In the WGS dataset, the Gastropoda class had the highest number of mitochondrial DNA (mtDNA) genomes (32), followed by Cephalopoda (7), Bivalvia (6), and Polyplacophora (6). Among the RNA-Seq dataset, Gastropoda had the highest number of mitochondrial DNA (mtDNA) genomes with 13, followed by Bivalvia with 8, Cephalopoda with 2, and Polyplacophora with 1. The mtDNA genomes constructed from the WGS dataset show a significantly low count of missing genes, and the counts of coding and non-coding genes are consistent with the general mitochondrial genome pattern observed in mollusks. The presence of 16S large subunit rRNA and 12S small subunit rRNA was thoroughly identified in the de novo assembled mtDNA genomes, regardless of the taxonomic classification. In some bivalve species, the absence of the Atp8 gene leads to a reduction in the number of coding genes to 12 [46,47,48], while, in other mollusk species, there are 13 coding genes, in line with the ancestral mollusk genome. When examining Table 1, it becomes apparent that three Gastropoda species (Lottia persona, Stagnicola palustris, and Aeolidia papillosa) possess 12 coding genes, and upon characterizing the missing gene, it was identified as the Atp8 gene. Later, it was elucidated that this circumstance is not attributed to the absence of the Atp8 gene in the mtDNA genome, observed in the class Bivalvia, but rather to challenges in annotating this particular gene. In recent studies, the annotation of the Atp8 gene has posed challenges due to its high variation and short length [42,49]. We also need to emphasize the following regarding protein-coding genes: in Table 1 and Table 2, some species are marked with * and †, and these marks indicate that the NADH dehydrogenase subunit 3 (Nad3) gene is annotated on both the heavy (high G + T content) and light (low G + T content) strands in these species. We believe that this issue arises from challenges in annotating molluscan genomes. Because of the transcription of mtDNA as polycistronic RNA, it is considered physically impossible to have gene overlap between two protein-coding genes encoded on the same strand and in the same open reading frame, but it is possible if the frames are different [42]. Secondly, the boundaries of some coding genes (correct start and stop codon locations) cannot be determined precisely with the current annotation tools following the de novo assembly of NGS short reads, and this may lead to incorrect annotations. Therefore, third-generation technologies, such as Oxford Nanopore or PacBio, may be more effective in mtDNA genome assembly and annotation [50,51]. The mtDNA genome of a typical mollusk contains 22 transfer RNA (tRNA) genes. The mean number of recovered tRNA genes in the mtDNA genomes generated from the WGS analysis was 21.8 (Table 1). However, some tRNA genes could not be annotated in the mtDNA genome produced from the RNA-Seq dataset, resulting in an average number of 19.8 tRNA genes (as shown in Table 2). Recent studies suggest that changes and duplications in mitochondrial tRNA genes contribute significantly to the rearrangement of the mitochondrial genome [42,52]. Our analysis showed that some mollusk species might exhibit duplications in tRNA genes, resulting in over 22 tRNA genes, while fewer than 20 mitochondrial tRNA genes generally indicate an annotation issue. All GenBank annotations (“.gb” and “.gbk” files) and circular mitochondrial DNA (mtDNA) plots generated through the analysis of WGS and RNA-Seq data are included in the Supplementary File. To evaluate the accuracy of the mollusk mitochondrial genome annotations generated by the OrgaMiner workflow, NGS data from species with known mitochondrial genome sequences and annotations were analyzed using the workflow. The resulting mitochondrial genome and annotation files were then compared to the corresponding data available in the NCBI Nucleotide database. These analyses demonstrated that the OrgaMiner workflow produces mitochondrial genomes that are highly similar to those in the NCBI Nucleotide database, as evidenced by metrics such as the alignment identity (percentage identity), the alignment coverage, and gene number/composition comparisons. We processed totals of 10 pieces of WGS and 10 pieces of RNA-Seq mollusk data using the OrgaMiner workflow and compared the results to their mitochondrial genomes in the NCBI Nucleotide database (Table S3). Compared to the reference mtDNA, regardless of coding or non-coding genes, the alignment identities were found to be >99.18% for the WGS data and >98.60% for the RNA-Seq data. However, the WGS data appear to be more successful at extracting mitochondrial genes than the RNA-Seq data (“genome comparison” in Table S3). The validation results largely indicate the reliability of the OrgaMiner workflow in analyzing NGS data from the NCBI SRA. In some cases, partial mitochondrial and chloroplast genomes may be obtained from the OrgaMiner workflow. This occurrence could be attributed to the nature of the RNA-Seq or WGS NGS datasets in the NCBI SRA database rather than to the OrgaMiner workflow itself. While the read lengths, depths, and Phred quality scores are satisfactory, several factors inherent to NGS data in the “.fastq” format could negatively impact the analysis. These factors include the following: (i) the low levels of organellar coding genes, and particularly tRNA genes, represented in RNA-Seq data, unlike in WGS data; (ii) the high duplication rates of certain NGS reads, such as the high rRNA duplication common in RNA-Seq data; and (iii) the potential presence of contaminant sequences (non-target organisms).
Despite the typically high copy numbers of mitochondria (and, hence, mtDNAs) in tissues, fewer organelle genomes and coding/non-coding genes were detected in the RNA-Seq data compared to the WGS dataset. This discrepancy could be attributed to various potential reasons. First, mitochondrial transcripts are polyadenylated by MTPAP (polyadenylic acid RNA polymerase), adding 40–50 adenine nucleotides to almost all mitochondrial transcripts, which is crucial for mRNA stability and regulation [53,54]. The poly-A tail lengths of nuclear mRNA transcripts vary depending on the tissue type, and the median tail lengths of nuclear mRNA transcripts are generally longer (~250 nucleotides) than those of mitochondrial transcripts [55,56]. Nuclear mRNA transcripts might also be more easily captured in RNA-Seq libraries, which are prepared from total RNA through the poly(A) enrichment of mRNA (mRNA-Seq). In addition to coding genes, one notable category of genes often overlooked and challenging to predict in RNA-Seq analysis comprises tRNA genes. Due to their shorter length and lack of polyadenylation, tRNA genes are less frequently represented and sequenced than mRNA and rRNA genes in both poly-A-captured and rRNA-depleted libraries. Moreover, during the bioinformatics preprocessing step, sequences with fewer than 50 bases in the PE100 and PE150 .fastq files are typically filtered out and excluded from the analysis. tRNA genes are easily captured in small RNA-Seq libraries, but such libraries cannot be used in de novo mitochondrial genome assembly, so they are excluded from our OrgaMiner workflow. In spite of our numerous attempts to cover all genes, we obtained a total of 132 mtDNA partial genomes from the WGS (50) and RNA-Seq (82) datasets of various mollusk species, even though their mtDNA quality appeared to be low. Despite this, these mtDNA sequences can be helpful in population genetics, haplotype analysis, and molecular barcoding studies. Therefore, their sequence and annotations are included in Table S2.
To test whether complex mitogenomic rearrangements were correctly annotated using OrgaMiner, special attention was given to Pectinidae, one of the most peculiar groups within Mollusca. While gene rearrangements are most common for tRNAs among metazoans in general, the mitochondrial genome rearrangements in Pectinidae often involve coding genes, unlike what has been observed in other animal groups. Malkócs et al. [57] investigated the mitogenomic rearrangements observed in certain Pectinidae species and visualized their mitochondrial structural variations. Using the OrgaMiner workflow, we obtained annotations for three species within this family and compared them with published data (Figure 6 in Malkócs et al. [45]). The order and orientation of the gene arrangement pattern are identical to that of the study. This comparison and the mitochondrial rearrangements shown in Figure 3 validate the ability of the workflow to annotate mitochondrial structural variations as well.

3.3. Applicability of OrgaMiner Workflow across Different Clades

To test the validity of the OrgaMiner workflow for other organism groups, we performed mitochondrial genome analyses on animals with the taxonomic keyword Annelida. As for plants, we analyzed the chloroplast genomes of both dicotyledons from the Solanum genus and monocotyledons from the Zingiberales order. A total of 37 annelid mitochondrial genomes and annotations were obtained, 27 from whole-genome-sequencing (WGS) data and 10 from RNA-Seq data. When examining the mitochondrial genome assembly and annotations of these species in the NCBI Nucleotide database, it can be seen that accurate results were obtained using the OrgaMiner workflow (Table S4). Compared to the reference mitogenomes, for annelid species, the alignment identities were >99.65% for the WGS data and >96.67% for the RNA-Seq data. Using the OrgaMiner workflow, nearly all the mitochondrial genes of annelids were comprehensively covered. As observed in the mollusk species, the WGS data appeared to outperform the RNA-Seq data in the de novo retrieval and annotation of annelid organelle genomes regarding the number of genes retrieved (“mitochondrial comparison” in Table S4). As for plant species, we analyzed 15 species from the Solanum genus, including 6 with WGS data and 9 with RNA-Seq data, as well as 26 species from the Zingiberales order, all with WGS data. The results from both groups were compared to their reference chloroplast genome sequence and annotations. Regardless of the plant species, all species showed over 96% alignment identity. In the WGS data analysis of species from the genus Solanum and the order Zingiberales, the chloroplast genome and genes are largely represented and compatible with the reference genome/annotation (“plastid comparison” in Table S4). Although the de novo-assembled chloroplast genomes from both the WGS and RNA-Seq data showed promising results, with many unique genes aligned and commonly found in the reference genomes, there were some inconsistencies observed between the reference genomes and OrgaMiner-assembled genomes, as the gene numbers and compositions differed more than expected. This difficulty may have arisen due to the challenging nature of annotating plastid genomes, a task that requires manual curation, which users may contemplate following the assembly of plastid genomes [57].

3.4. Liminations of OrgaMiner Pipeline and Recommendations

OrgaMiner is a tool that quickly searches for species without organelle genomes (mtDNA or cpDNA) in the NCBI Nucleotide database, and it obtains organelle genomes by processing WGS and RNA-Seq data from the NCBI SRA database for species for which organelle genomes are not yet known. Although useful for quickly revealing new organelle genomes, there are factors limiting the effectiveness of this workflow for mining the database. A current limitation of these tools is that the pipeline’s data acquisition process often necessitates substantial storage resources, particularly when dealing with taxa encompassing large numbers of species. To ameliorate this weakness slightly, users can use the “--remove” option for removing “.fastq” files or the “--remove-all” option, which deletes all “.fastq” files, including those downloaded or already stored, following each assembly process. Nevertheless, the storage requirement remains a significant concern, potentially posing challenges for users with limited storage capacities. Secondly, when focusing on plastid genome analysis, the OrgaMiner exhibits inherent limitations. One notable constraint is its inability to derive plastid genomes from RNA-Seq data, requiring users to resort to alternative methods for plastid genome reconstruction from transcriptome data. Additionally, the absence of plastid genome annotation functionality within the pipeline necessitates the use of external annotation tools and databases, introducing potential additional steps. Furthermore, the efficiency of the pipeline is influenced by the choice of download options. While alternative download methods may offer relative speed advantages compared to the default sra-tools, the pipeline may encounter challenges in acquiring FTP links, potentially impeding data retrieval in some instances.

4. Conclusions

Taken together, through extensive analysis and evaluations on real “.fastq” data, we report a user-friendly bioinformatics pipeline called OrgaMiner, which enables the management, exploitation, and mining of large genomic and transcriptomic datasets available in the NCBI SRA database to uncover high-quality mitochondrial and chloroplast genomes for non-model organisms automatically. By successfully demonstrating its utility in the de novo assembly of mitochondrial genomes for various mollusk species solely through the keyword “mollusca,” we provide a valuable resource for researchers working with taxonomic and population genetics questions. The versatility of the pipeline extends its applicability to diverse species, including invertebrates, vertebrates, and plants, making it an accessible and efficient solution for organelle genome assembly used mainly in molecular taxonomy, population genetics, and haplotype network analysis.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/d16020104/s1, Table S1. The sequencing summary stats and basic SRA metainformation for the “.fastq” files used in this study; Table S2. The basic annotations and sequences of various mollusk species whose mitochondrial genomes were not entirely covered in the WGS or RNA-Seq datasets; Table S3. Comparative analysis of mollusk mitochondrial genome sequences generated with OrgaMiner workflow against their reference mitochondrial genomes from NCBI Nucleotide database; Table S4. The basic organelle genome features determined by OrgaMiner workflow in animal species belonging to Annelida and plant species belonging to the Solanum genus and Zingiberales order.

Author Contributions

Conceptualization, V.E.; methodology, V.E. and M.A.B.; pipeline development and implementation, M.A.B. and V.E.; validation, V.E. and M.A.B.; writing—review and editing, V.E.; visualization, M.A.B. and V.E.; funding acquisition, V.E. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the Scientific Research Projects Coordination Unit of Istanbul University, grant number 37223. The computing resources used in this work were funded by the National Center for High Performance Computing of Turkey (UHeM) under grant number 5004732017.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

All relevant data presented in the study are included in the article and Supplementary Material. The source codes (Python and BASH scripts) along with detailed instructions on how these must be executed are also freely available in the GitHub repository (https://github.com/MolecularBioDiversityLab/OrgaMiner).

Acknowledgments

We express our gratitude to Tuana Öğretici for her invaluable assistance with developing the Python scripts and the subsequent transfer of the fully functional code to the GitHub page of our Molecular Biodiversity Lab. We also thank the members of our laboratories for the helpful comments. We are also grateful to the anonymous reviewers for their valuable comments on the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of the data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Lou, R.N.; Jacobs, A.; Wilder, A.P.; Therkildsen, N.O. A beginner’s guide to low-coverage whole genome sequencing for population genomics. Mol. Ecol. 2021, 30, 5966–5993. [Google Scholar] [CrossRef] [PubMed]
  2. Holmer, R.; van Velzen, R.; Geurts, R.; Bisseling, T.; de Ridder, D.; Smit, S. GeneNoteBook, a collaborative notebook for comparative genomics. Bioinformatics 2019, 35, 4779–4781. [Google Scholar] [CrossRef] [PubMed]
  3. Baxter, S.W.; Davey, J.W.; Johnston, J.S.; Shelton, A.M.; Heckel, D.G.; Jiggins, C.D.; Blaxter, M.L. Linkage mapping and comparative genomics using next-generation RAD sequencing of a non-model organism. PLoS ONE 2011, 6, e19315. [Google Scholar] [CrossRef] [PubMed]
  4. Berhe, M.; Dossa, K.; You, J.; Mboup, P.A.; Diallo, I.N.; Diouf, D.; Zhang, X.; Wang, L. Genome-wide association study and its applications in the non-model crop Sesamum indicum. BMC Plant Biol. 2021, 21, 283. [Google Scholar] [CrossRef] [PubMed]
  5. Zeng, Q.; Liu, S.; Yao, J.; Zhang, Y.; Yuan, Z.; Jiang, C.; Chen, A.; Fu, Q.; Su, B.; Dunham, R.; et al. Transcriptome Display During Testicular Differentiation of Channel Catfish (Ictalurus punctatus) as Revealed by RNA-Seq Analysis. Biol. Reprod. 2016, 95, 19. [Google Scholar] [CrossRef] [PubMed]
  6. Eldem, V.; Zararsiz, G.; Erkan, M.; Bakir, Y. De novo assembly and comprehensive characterization of the skeletal muscle transcriptomes of the European anchovy (Engraulis encrasicolus). Mar. Genom. 2015, 20, 7–9. [Google Scholar] [CrossRef]
  7. Du, X.; Wang, B.; Liu, X.; Liu, X.; He, Y.; Zhang, Q.; Wang, X. Comparative transcriptome analysis of ovary and testis reveals potential sex-related genes and pathways in spotted knifejaw Oplegnathus punctatus. Gene 2017, 637, 203–210. [Google Scholar] [CrossRef] [PubMed]
  8. Ubrihien, R.P.; Ezaz, T.; Taylor, A.M.; Stevens, M.M.; Krikowa, F.; Foster, S.; Maher, W.A. The response of Isidorella newcombi to copper exposure: Using an integrated biological framework to interpret transcriptomic responses from RNA-seq analysis. Aquat. Toxicol. 2017, 185, 183–192. [Google Scholar] [CrossRef]
  9. Liang, Q.; Dong, B.; Li, A.; Wu, L.; Zhang, Y.; Han, T.; Liu, X. scRNA-seq analysis reveals toxicity mechanisms in shrimp hemocytes subjected to nitrite stress. Chemosphere 2023, 316, 137853. [Google Scholar] [CrossRef]
  10. Jiao, L.; Dai, T.; Jin, M.; Sun, P.; Zhou, Q. Transcriptome Analysis of the Hepatopancreas in the Litopenaeus vannamei Responding to the Lead Stress. Biol. Trace Elem. Res. 2021, 199, 1100–1109. [Google Scholar] [CrossRef]
  11. Eldem, V.; Zararsiz, G.; Taşçi, T.; Duru, I.P.; Bakir, Y.; Erkan, M. Transcriptome analysis for non-model organism: Current status and best-practices. Appl. RNA-Seq Omics Strateg.-Microorg. Hum. Health 2017, 1, 1–19. [Google Scholar]
  12. Ghazi, S.; Bourgeois, S.; Gomariz, A.; Bugarski, M.; Haenni, D.; Martins, J.R.; Nombela-Arrieta, C.; Unwin, R.J.; Wagner, C.A.; Hall, A.M.; et al. Multiparametric imaging reveals that mitochondria-rich intercalated cells in the kidney collecting duct have a very high glycolytic capacity. Faseb. J. 2020, 34, 8510–8525. [Google Scholar] [CrossRef]
  13. Jurss, K.; Bastrop, R. The Function of Mitochondria-Rich Cells (Chloride Cells) in Teleost Gills. Rev. Fish Biol. Fish. 1995, 5, 235–255. [Google Scholar] [CrossRef]
  14. Magoffin, D.A. Ovarian theca cell. Int. J. Biochem. Cell B 2005, 37, 1344–1349. [Google Scholar] [CrossRef]
  15. Lavrov, D.V.; Bell, E. Mitochondrial genomes in invertebrate animals. In Molecular Life Sciences: An Encyclopedic Reference; Springer: New York, NY, USA, 2014; pp. 1–8. [Google Scholar]
  16. Gorska-Andrzejak, J.; Stowers, R.S.; Borycz, J.; Kostyleva, R.; Schwarz, T.L.; Meinertzhagen, I.A. Mitochondria are redistributed in Drosophila photoreceptors lacking milton, a kinesin-associated protein. J. Comp. Neurol. 2003, 463, 372–388. [Google Scholar] [CrossRef]
  17. Tunholi, V.M.; Tunholi-Alves, V.M.; Santos, A.T.; Garcia, J.D.; Maldonado, A.; da-Silva, W.S.; Rodrigues, M.D.D.; Pinheiro, J. Evaluation of the mitochondrial system in the gonad-digestive gland complex of Biomphalaria glabrata (Mollusca, Gastropoda) after infection by Echinostoma paraensei (Trematoda, Echinostomatidae). J. Invertebr. Pathol. 2016, 136, 136–141. [Google Scholar] [CrossRef] [PubMed]
  18. Lersten, N.R.; Curtis, J.D. Paraveinal Mesophyll in Calliandra Tweedii and C. emarginata (Leguminosae; Mimosoideae). Am. J. Bot. 1993, 80, 561–568. [Google Scholar] [CrossRef] [PubMed]
  19. Procko, C.; Lee, T.; Borsuk, A.; Bargmann, B.O.R.; Dabi, T.; Nery, J.R.; Estelle, M.; Baird, L.; O’Connor, C.; Brodersen, C.; et al. Leaf cell-specific and single-cell transcriptional profiling reveals a role for the palisade layer in UV light protection. Plant Cell 2022, 34, 3261–3279. [Google Scholar] [CrossRef]
  20. Lynch, M.; Koskella, B.; Schaack, S. Mutation pressure and the evolution of organelle genomic architecture. Science 2006, 311, 1727–1730. [Google Scholar] [CrossRef] [PubMed]
  21. Duminil, J.; Besnard, G. Utility of the Mitochondrial Genome in Plant Taxonomic Studies. Methods Mol. Biol. 2021, 2222, 107–118. [Google Scholar]
  22. Shinzato, C.; Narisoko, H.; Nishitsuji, K.; Nagata, T.; Satoh, N.; Inoue, J. Novel Mitochondrial DNA Markers for Scleractinian Corals and Generic-Level Environmental DNA Metabarcoding. Front. Mar. Sci. 2021, 8, 758207. [Google Scholar] [CrossRef]
  23. Liu, H.; Zhao, W.; Hua, W.; Liu, J. A large-scale population based organelle pan-genomes construction and phylogeny analysis reveal the genetic diversity and the evolutionary origins of chloroplast and mitochondrion in Brassica napus L. Bmc Genom. 2022, 23, 339. [Google Scholar] [CrossRef]
  24. Toparslan, E.; Karabag, K.; Bilge, U. A workflow with R: Phylogenetic analyses and visualizations using mitochondrial cytochrome b gene sequences. PLoS ONE 2020, 15, e0243927. [Google Scholar] [CrossRef]
  25. Cay, S.B.; Cinar, Y.U.; Kuralay, S.C.; Inal, B.; Zararsiz, G.; Ciftci, A.; Mollman, R.; Obut, O.; Eldem, V.; Bakir, Y.; et al. Genome skimming approach reveals the gene arrangements in the chloroplast genomes of the highly endangered Crocus L. species: Crocus istanbulensis (B.Mathew) Rukans. PLoS ONE 2022, 17, e0269747. [Google Scholar] [CrossRef]
  26. Nachtigall, P.G.; Grazziotin, F.G.; Junqueira-de-Azevedo, I.L.M. MITGARD: An automated pipeline for mitochondrial genome assembly in eukaryotic species using RNA-seq data. Brief. Bioinform. 2021, 22, bbaa429. [Google Scholar] [CrossRef]
  27. Smith, D.R. RNA-Seq data: A goldmine for organelle research. Brief. Funct. Genom. 2013, 12, 454–456. [Google Scholar] [CrossRef]
  28. Osuna-Mascaro, C.; de Casas, R.R.; Perfectti, F. Comparative assessment shows the reliability of chloroplast genome assembly using RNA-seq. Sci. Rep. 2018, 8, 17404. [Google Scholar] [CrossRef]
  29. Jin, J.J.; Yu, W.B.; Yang, J.B.; Song, Y.; dePamphilis, C.W.; Yi, T.S.; Li, D.Z. GetOrganelle: A fast and versatile toolkit for accurate de novo assembly of organelle genomes. Genome Biol. 2020, 21, 241. [Google Scholar] [CrossRef]
  30. Meng, G.L.; Li, Y.Y.; Yang, C.T.; Liu, S.L. MitoZ: A toolkit for animal mitochondrial genome assembly, annotation and visualization. Nucleic. Acids Res. 2019, 47, e63. [Google Scholar] [CrossRef]
  31. Pouchon, C.; Boyer, F.; Roquet, C.; Denoeud, F.; Chave, J.; Coissac, E.; Alsos, I.G.; Lavergne, S.; Consortium, P.; Consortium, P. ORTHOSKIM: In silico sequence capture from genomic and transcriptomic libraries for phylogenomic and barcoding applications. Mol. Ecol. Resour. 2022, 22, 2018–2037. [Google Scholar] [CrossRef]
  32. White, O.; Hall, A.; Clark, M.; Williams, S. go_batch: A snakemake pipeline to assemble mitochondrial and ribosomal sequences from genome skims. bioRxiv 2023. [Google Scholar] [CrossRef]
  33. Cai, L.M.; Zhang, H.R.; Davis, C.C. PhyloHerb: A high-throughput phylogenomic pipeline for processing genome skimming data. Appl. Plant Sci. 2022, 10, e11475. [Google Scholar] [CrossRef]
  34. Ewels, P.; Magnusson, M.; Lundin, S.; Kaller, M. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics 2016, 32, 3047–3048. [Google Scholar] [CrossRef]
  35. Senthilkumar, S.; Ulaganathan, K.; Dasgupta, M.G. Reference-based assembly of chloroplast genome from leaf transcriptome data of Pterocarpus santalinus. 3 Biotech 2021, 11, 393. [Google Scholar] [CrossRef]
  36. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. J. 2011, 17, 10–12. [Google Scholar] [CrossRef]
  37. Qu, J.Y.; Xu, Y.R.; Cui, Y.T.; Wu, S.; Wang, L.J.; Liu, X.M.; Xing, Z.K.; Guo, X.Y.; Wang, S.S.; Li, R.R.; et al. MODB: A comprehensive mitochondrial genome database for Mollusca. Database 2021, 2021, baab056. [Google Scholar] [CrossRef]
  38. Gomes-dos-Santos, A.; Lopes-Lima, M.; Castro, L.F.C.; Froufe, E. Molluscan genomics: The road so far and the way forward. Hydrobiologia 2020, 847, 1705–1726. [Google Scholar] [CrossRef]
  39. Uribe, J.E.; Sei, M.; Harasewych, M.G. The mitogenome of the sunken wood limpet: Insights into mitogenome evolution in Lepetellida (Gastropoda: Vetigastropoda). J. Mollus. Stud. 2022, 88, eyac009. [Google Scholar] [CrossRef]
  40. Rempel, E.M.; Marcus, J.M.; Detwiler, J.T. The complete mitochondrial genome of the file ramshorn snail (Mollusca: Gastropoda: Hygrophila: Planorbidae). Mitochondrial DNA B 2021, 6, 3181–3183. [Google Scholar] [CrossRef]
  41. Feng, J.; Guo, Y.; Yan, C.; Ye, Y.; Li, J.; Guo, B.; Lu, Z. Sequence comparison of the mitochondrial genomes in two species of the genus Nerita (Gastropoda: Neritimorpha: Neritidae): Phylogenetic implications and divergence time estimation for Neritimorpha. Mol. Biol. Rep. 2020, 47, 7903–7916. [Google Scholar] [CrossRef]
  42. Ghiselli, F.; Gomes-Dos-Santos, A.; Adema, C.M.; Lopes-Lima, M.; Sharbrough, J.; Boore, J.L. Molluscan mitochondrial genomes break the rules. Philos. Trans. R Soc. Lond B Biol. Sci. 2021, 376, 20200159. [Google Scholar] [CrossRef] [PubMed]
  43. Moreyra, N.N.; Mensch, J.; Hurtado, J.; Almeida, F.; Laprida, C.; Hasson, E. What does mitogenomics tell us about the evolutionary history of the cluster (group)? PLoS ONE 2019, 14, e0220676. [Google Scholar] [CrossRef]
  44. Grande, C.; Templado, J.; Zardoya, R. Evolution of gastropod mitochondrial genome arrangements. Bmc Evol. Biol. 2008, 8, 61. [Google Scholar] [CrossRef]
  45. Malkócs, T.; Viricel, A.; Becquet, V.; Evin, L.; Dubillot, E.; Pante, E. Complex mitogenomic rearrangements within the Pectinidae (Mollusca: Bivalvia). Bmc Ecol. Evol. 2022, 22, 29. [Google Scholar] [CrossRef]
  46. He, C.B.; Wang, J.; Gao, X.G.; Song, W.T.; Li, H.J.; Li, Y.F.; Liu, W.D.; Su, H. The complete mitochondrial genome of the hard clam Meretrix meretrix. Mol. Biol. Rep. 2011, 38, 3401–3409. [Google Scholar] [CrossRef]
  47. Wang, Y.; Yang, Y.; Kong, L.F.; Sasaki, T.; Li, Q. Phylogenomic resolution of Imparidentia (Mollusca: Bivalvia) diversification through mitochondrial genomes. Mar. Life Sci. Tech. 2023, 5, 326–336. [Google Scholar] [CrossRef]
  48. Li, F.P.; Liu, H.Y.; Heng, X.; Zhang, Y.; Fan, M.F.; Wang, S.S.; Liu, C.S.; Gu, Z.F.; Wang, A.M.; Yang, Y. The complete mitochondrial genome of (Bivalvia, Ostreoidea) indicates the genetic diversity within Gryphaeidae. Biodivers Data J. 2023, 11, e101333. [Google Scholar] [CrossRef]
  49. Gaitán-Espitia, J.D.; Quintero-Galvis, J.F.; Mesas, A.; D’Elía, G. Mitogenomics of southern hemisphere blue mussels (Bivalvia: Pteriomorphia): Insights into the evolutionary characteristics of the complex. Sci. Rep. 2016, 6, 26853. [Google Scholar] [CrossRef] [PubMed]
  50. Kinkar, L.; Gasser, R.B.; Webster, B.L.; Rollinson, D.; Littlewood, D.T.J.; Chang, B.C.H.; Stroehlein, A.J.; Korhonen, P.K.; Young, N.D. Nanopore Sequencing Resolves Elusive Long Tandem-Repeat Regions in Mitochondrial Genomes. Int. J. Mol. Sci. 2021, 22, 1811. [Google Scholar] [CrossRef]
  51. Uliano-Silva, M.; Ferreira, J.G.R.N.; Krasheninnikova, K.; Formenti, G.; Abueg, L.; Torrance, J.; Myers, E.W.; Durbin, R.; Blaxter, M.; McCarthy, S.A.; et al. MitoHiFi: A python pipeline for mitochondrial genome assembly from PacBio high fidelity reads. Bmc Bioinform. 2023, 24, 288. [Google Scholar] [CrossRef]
  52. Cantatore, P.; Gadaleta, M.N.; Roberti, M.; Saccone, C.; Wilson, A.C. Duplication and remoulding of tRNA genes during the evolutionary rearrangement of mitochondrial genomes. Nature 1987, 329, 853–855. [Google Scholar] [CrossRef]
  53. Ojala, D.; Montoya, J.; Attardi, G. tRNA punctuation model of RNA processing in human mitochondria. Nature 1981, 290, 470–474. [Google Scholar] [CrossRef] [PubMed]
  54. Tomecki, R.; Dmochowska, A.; Gewartowski, K.; Dziembowski, A.; Stepien, P.P. Identification of a novel human nuclear-encoded mitochondrial poly(A) polymerase. Nucleic. Acids Res. 2004, 32, 6001–6014. [Google Scholar] [CrossRef] [PubMed]
  55. Kühn, U.; Gündel, M.; Knoth, A.; Kerwitz, Y.; Rüdel, S.; Wahle, E. Poly(A) Tail Length Is Controlled by the Nuclear Poly(A)-binding Protein Regulating the Interaction between Poly(A) Polymerase and the Cleavage and Polyadenylation Specificity Factor. J. Biol. Chem. 2009, 284, 22803–22814. [Google Scholar] [CrossRef] [PubMed]
  56. Honarmand, S.; Shoubridge, E.A. Poly (A) tail length of human mitochondrial mRNAs is tissue-specific and a mutation in LRPPRC results in transcript-specific patterns of deadenylation. Mol. Genet. Metab. Rep. 2020, 25, 100687. [Google Scholar] [CrossRef]
  57. Qu, X.J.; Zou, D.; Zhang, R.Y.; Stull, G.W.; Yi, T.S. Progress, challenge and prospect of plant plastome annotation. Front. Plant Sci. 2023, 14, 1166140. [Google Scholar] [CrossRef]
Figure 1. OrgaMiner workflow: schematic representation of the NGS data analysis process, from downloading WGS- or RNA-Seq-based “.fastq” files to de novo mitochondrial genome assembly and annotations for species with NGS data in the NCBI SRA database but that lack a complete mitochondrial genome. (A) The selection of species for de novo mtDNA assembly (“--mt_check or --pt_check”) and downloading NGS data via various approaches. (B) Subsequent steps involve QC analysis, de novo mtDNA assembly, and annotation of the downloaded NGS data.
Figure 1. OrgaMiner workflow: schematic representation of the NGS data analysis process, from downloading WGS- or RNA-Seq-based “.fastq” files to de novo mitochondrial genome assembly and annotations for species with NGS data in the NCBI SRA database but that lack a complete mitochondrial genome. (A) The selection of species for de novo mtDNA assembly (“--mt_check or --pt_check”) and downloading NGS data via various approaches. (B) Subsequent steps involve QC analysis, de novo mtDNA assembly, and annotation of the downloaded NGS data.
Diversity 16 00104 g001
Figure 2. Overview of (A) WGS dataset and (B) RNA-Seq dataset properties and taxonomic distribution of the classes with the most abundant data. Orange bars represent total read numbers, gray bars represent total base numbers, and dots on the blue line represent the numbers of species belonging to these classes. The values on the left side of the graphs represent the logarithm (base 10, log10) of the total read counts and base numbers (bp), relative to the bars. The numbers on the right side of the graphs are related to the lines and represent the numbers of species.
Figure 2. Overview of (A) WGS dataset and (B) RNA-Seq dataset properties and taxonomic distribution of the classes with the most abundant data. Orange bars represent total read numbers, gray bars represent total base numbers, and dots on the blue line represent the numbers of species belonging to these classes. The values on the left side of the graphs represent the logarithm (base 10, log10) of the total read counts and base numbers (bp), relative to the bars. The numbers on the right side of the graphs are related to the lines and represent the numbers of species.
Diversity 16 00104 g002
Figure 3. Structural and synteny comparisons of mitochondrial gene rearrangements observed in Pectinidae species were performed by uploading annotations of mtDNAs using pyGenomeViz (https://github.com/moshi4/pyGenomeViz) with default settings.
Figure 3. Structural and synteny comparisons of mitochondrial gene rearrangements observed in Pectinidae species were performed by uploading annotations of mtDNAs using pyGenomeViz (https://github.com/moshi4/pyGenomeViz) with default settings.
Diversity 16 00104 g003
Table 1. Basic information on the mitochondrial genome characteristics of various mollusk species, including the mtDNA genome lengths and numbers of coding and non-coding genes, obtained through the WGS data in the NCBI SRA database. * indicates that the ND3 gene was identified both in the H-strand and L-strand during the annotation processes.
Table 1. Basic information on the mitochondrial genome characteristics of various mollusk species, including the mtDNA genome lengths and numbers of coding and non-coding genes, obtained through the WGS data in the NCBI SRA database. * indicates that the ND3 gene was identified both in the H-strand and L-strand during the annotation processes.
ClassFamilySpeciesCoding GenestRNA GenesrRNA GenesMissing GenesTotal GenesmtDNA
Length
BivalviaMytilidaeBotula fusca1221223519,595
BivalviaUnionidaeElliptio hopetonensis1322203715,775
BivalviaTellinidaeMacoma nasuta1222213617,348
BivalviaUnionidaeMegalonaias nervosa1322203716,026
BivalviaAnomiidaePododesmus macrochisma1322203715,080
BivalviaVeneridaeSaxidomus gigantea1322203719,754
CephalopodaOctopodidaeMuusoctopus eicomar1322203716,168
CephalopodaOctopodidaeMuusoctopus leioderma *1322203717,006
CephalopodaOctopodidaeMuusoctopus longibrachus1322203716,192
CephalopodaOctopodidaeOctopus americanus1322203715,655
CephalopodaOctopodidaeAmphioctopus burryi1322203715,883
CephalopodaSepiolidaeRondeletiola minor1322203715,800
CephalopodaLoliginidaeDoryteuthis pealeii1321213616,674
GastropodaOnchidorididaeCorambe burchi1322203714,308
GastropodaNeomphalidaeCyathermia naticoides *1322203716,156
GastropodaOvulidaeCyphoma gibbosum1320223516,638
GastropodaDironidaeDirona albolineata1323203814,651
GastropodaDorididaeDoris verrucosa *1322203714,518
GastropodaPlakobranchidaeElysia diomedea1321213614,158
GastropodaChromodorididaeGoniobranchus kuniei1323203814,738
GastropodaHaliotidaeHaliotis corrugata1322203716,951
GastropodaHaliotidaeHaliotis discus discus1322203716,805
GastropodaHaliotidaeHaliotis fulgens1322203716,376
GastropodaHaliotidaeHaliotis gigantea1322203716,539
GastropodaHaliotidaeHaliotis kamtschatkana1322203716,892
GastropodaHaliotidaeHaliotis madaka1322203716,745
GastropodaHaliotidaeHaliotis midae1322203716,530
GastropodaHaliotidaeHaliotis sorenseni1322203716,711
GastropodaGlaucidaeHermissenda crassicornis1321213614,750
GastropodaLepetodrilidaeLepetodrilus galriftensis *1322203719,339
GastropodaLepetodrilidaeLepetodrilus gordensis1322203716,455
GastropodaLittorinidaeLittorina arcana1322203716,301
GastropodaLittorinidaeLittorina compressa1322203716,349
GastropodaLottiidaeLottia persona1222213617,106
GastropodaPeltospiridaePeltospira delicata1323203815,523
GastropodaTateidaePotamopyrgus kaitunuparaoa1323203815,332
GastropodaLepetodrilidaePseudorimula midatlantica *1322203716,411
GastropodaLymnaeidaeRadix swinhoei1319233414,998
GastropodaScyllaeidaeScyllaea pelagica1322203714,759
GastropodaLymnaeidaeStagnicola palustris1221223514,261
GastropodaTritoniidaeTritonia tetraquetra1322203715,087
GastropodaTylodinidaeTylodina fungina1321223614,649
GastropodaAegiretidaeAegires albopunctatus1321313713,947
GastropodaAeolidiidaeAeolidia papillosa1222223616,696
GastropodaGoniodorididaeAncula gibbosa1322203714,532
PolyplacophoraMopaliidaeMopalia ciliata *1321213613,987
PolyplacophoraMopaliidaeMopalia kennerleyi *1322203714,290
PolyplacophoraMopaliidaeMopalia muscosa *1322203714,976
PolyplacophoraMopaliidaeMopalia swanii *1322203714,969
PolyplacophoraMopaliidaeMopalia vespertina *1322203714,987
PolyplacophoraChitonidaeAcanthopleura granulata1322203715,618
SolenogastresGymnomeniidaeWirenia argentea1320223516,443
Table 2. Overview of mtDNA genome lengths and compositions of some mollusk species analyzed through RNA-Seq data in the NCBI SRA database. indicates that the ND3 gene was identified both in the H-strand and L-strand during the annotation processes.
Table 2. Overview of mtDNA genome lengths and compositions of some mollusk species analyzed through RNA-Seq data in the NCBI SRA database. indicates that the ND3 gene was identified both in the H-strand and L-strand during the annotation processes.
ClassFamilySpeciesCoding GenestRNA GenesrRNA GenesMissing GenesTotal GenesmtDNA
Length
BivalviaMactridaeMactra antiquata1318253316,429
BivalviaUnionidaeUniomerus tetralasmus1318243315,247
BivalviaMytilidaeMytilus planulatus1322213716,727
BivalviaVesicomyidaeArchivesica packardana 1422203816,467
BivalviaPharidaeEnsis directus1323203816,925
BivalviaOstreidaeSaccostrea palmula1319243416,130
BivalviaMytilidaeGigantidas horikoshii1220233417,504
BivalviaThyasiridaeConchocele bisecta1222223617,181
CephalopodaSepiolidaeRossia pacifica 1418243414,897
CephalopodaOctopodidaeEnteroctopus megalocyathus 1420223616,027
GastropodaRanellidaeMonoplex corrugatus1318243316,178
GastropodaPlanorbidaeBiomphalaria alexandrina1319233413,570
GastropodaConidaeConus ammiralis1319233415,459
GastropodaConidaeConus purpurascens1319233415,509
GastropodaFacelinidaeFacelina rubrovittata1319233414,481
GastropodaChromodorididaeVerconia verconis1319233414,560
GastropodaNacellidaeCellana rota 1419233516,042
GastropodaSemisulcospiridaeSemisulcospira reiniana1320243515,291
GastropodaTritoniidaeTritoniopsis frydis 1320253514,481
GastropodaTurbinidaeAngaria nodosa1420233619,389
GastropodaConidaeConus bayani 1420223615,525
GastropodaNacellidaeNacella polaris1321213616,752
GastropodaConidaeConus chaldaeus1322203715,442
PolyplacophoraChitonidaeTonicia schrammi1418243414,909
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Eldem, V.; Balcı, M.A. Mining NCBI Sequence Read Archive Database: An Untapped Source of Organelle Genomes for Taxonomic and Comparative Genomics Research. Diversity 2024, 16, 104. https://doi.org/10.3390/d16020104

AMA Style

Eldem V, Balcı MA. Mining NCBI Sequence Read Archive Database: An Untapped Source of Organelle Genomes for Taxonomic and Comparative Genomics Research. Diversity. 2024; 16(2):104. https://doi.org/10.3390/d16020104

Chicago/Turabian Style

Eldem, Vahap, and Mehmet Ali Balcı. 2024. "Mining NCBI Sequence Read Archive Database: An Untapped Source of Organelle Genomes for Taxonomic and Comparative Genomics Research" Diversity 16, no. 2: 104. https://doi.org/10.3390/d16020104

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop