1. Introduction
Daphnia, commonly known as the water flea, are small crustaceans usually inhabiting freshwater ponds and lakes on all continents of the globe. It has long been used as a model for the elucidation of animal responses and adaptations to environmental changes [
1]. It has been also used in diverse biological research areas such as ecology, ecotoxicology, evolution, and reproductive biology due to its important position in the aquatic food chain, a high degree of phenotypic plasticity, and cyclical parthenogenesis responding to environmental stimuli [
1,
2,
3,
4,
5,
6,
7]. Its sensitive behavioral and physiological responses are parameters used as biomarkers of the effect induced by various substances [
1,
2,
3,
4,
5,
6,
7,
8]. Thus, it has been used for reproduction tests, acute toxicity studies, and chronic toxicity tests in the OECD Guidelines [
9,
10]. The current state of knowledge of the nuclear genome of
Daphnia magna is based on reports by Routtu et al. [
11,
12], Dukić et al. [
13], and Lee et al. [
14]. Low- and high-density genetic linkage maps were obtained, in which they assembled the whole genome sequence of
D. magna. Specific genetic markers from a high-resolution genetic linkage map of
D. magna Xinb3 were evaluated in toxicological studies by Korea Institute of Toxicology (KIT) [
14].
Genomic resources are steadily being developed for many species of the genus Daphnia. In particular, a database of around 12,000 expressed sequence tags (EST) is currently available (
http://wfleabase.org) [
15], providing a useful resource to isolate polymorphic genetic markers in this species. However, there is no information about pseudogenic sequences of mitochondrial origin (NUMTs) in the
D. magna genome.
Nuclear DNA sequences that are homologous to the mitochondrial genome are often referred to as mitochondrial pseudogenes, or NUMTs [
16]. NUMTs may differ in length and be as large as the full length of the mitochondrial genome [
17]. It has been reported that the NUMT length is positively correlated with the genome size, suggesting potential roles of non-coding DNA gain and loss in NUMT accumulation [
18]. NUMTs have been documented in almost all eukaryotic genomes studied [
19]. The transfer of mitochondrial DNA sequences into the nuclear genome is an ongoing evolutionary process [
20], which has markedly influenced the evolution and function of eukaryotic genomes [
19]. Thus, NUMTs are good materials for studying the evolution of nuclear sequences without selective constraints [
21]. However, because of their homology, NUMTs may confound mtDNA studies, as the NUMT co-amplification product could interfere with sequence analysis [
22].
This is the first study in which the D. magna nuclear genome deposited in the GenBank database was analyzed for pseudogene sequences of mitochondrial origin. The present study aimed to identify NUMTs, their length, homology, and location for potential use in evolutionary studies and to check whether their occurrence causes co-amplification during mitochondrial genome analyses.
2. Results
Bioinformatic analysis showed 1909 fragments of the mitochondrial genome, of which 1630 fragments were located in ten linkage groups of the nuclear genome of
D. magna. The other fragments were localized in scaffolds which were used during genome sequencing. All the NUMTs found in this research are listed in
supplementary materials (Table S1). The total length of the NUMT sequences in the linkage groups corresponded to the number of fragments on individual LG (
Table 1). The total length of NUMTs in the
D. magna genome was 44.391 base pairs (bp), which accounted for 0.042% of the length of the nuclear genome. Their percentage content was 0.037% in the longest linkage group 2, in which 228 NUMTs were identified, and 0.047% in the shortest linkage group 10 (
Table 1). The most frequently occurring fragments of the mtDNA sequence in the nuclear genome included
ND2 (115),
ND3 (113),
TRNA-CYS (110), and 16S rRNA (105). However, the highest number of NUMTs was observed for the non-coding area, i.e., the D-loop (147) (
Table 2). The lowest numbers of mtDNA fragments found in nDNA were observed mainly for genes encoding tRNA molecules:
TRNA-PRO and
TRNA-MET (6),
TRNA-TYR and
TRNA-ASN (4), and
TRNA-SER1 (2). The highest numbers of mtDNA fragments were recorded on LG2-228, and the lowest—on LG8-134 (
Table 2).
The longest fragments of the mitochondrial genome present in the nuclear genome were observed for the D-loop (182 bp),
ND4 (108 bp),
ND3 (99 bp), and
ND5 and
COX3 (94 bp each) (
Table 3). The 182-bp fragment of the D-loop constituted 63% of the entire sequence of this region. In contrast, in the case of the other protein-encoding genes, the fragment size ranged from 4% (
CYTB) to 32% (
ATP8). In turn, NUMTs were recorded among genes encoding tRNA, constituting over 90% of the mtDNA gene sequence:
TRNA-ARG and
TRNA-THR (95% each) and
TRNA-GLU (97%). All the analyzed sequence fragments had a minimum length in the range of 16–25 bp (
Table 3).
Of the 1630 NUMTs (
Table 1), 253 fragments, representing 16% of all NUMTs, showed 100% homology with the mtDNA gene sequences. 100% sequence homology for
TRNA-MET was found for all 6 NUMTs (
Table 2 and
Table 4). 23 NUMTs whose sequence homology was 100% were observed for the D-loop region and 21 NUMTs for the gene
ND3. In contrast, in the case of genes
TRNA-SER1 and
TRNA-ALA, no NUMTs with 100% sequence homology were observed (
Table 2). The mean values of the percentage of sequence identity ranged from 88.0% (
ATP8) to 100% (
TRNA-MET). The percentage identity for the individual linkage groups was in the range of 90–91%. The largest homology was recorded on LG10 (90.7%) and the lowest—on LG6 (90.2%) (
Table 4). At least one sequence with 100% homology was identified on each of the linkage groups.
Table 5 compares the NUMTs’ number and length with mtDNA fragments found in pseudogenes and transcriptome of
D. magna. The overall number of mtDNA fragments in pseudogenes was 599 and their highest length was estimated between 18 bp (
TRNA-MET) to 99 bp (
ND3). The lowest number of mtDNA fragments being part of pseudogenes was observed for
TRNA-PRO (1),
ATP8 (2), and
TRNA-ASP and
TRNA-ILE (3). The highest number of these fragments was observed for
16s rRNA (63) and D-loop (48). In the case of mtDNA fragments found in the transcriptome, the highest number was observed for
TRNA-CYS (101) and
16S rRNA (100). The lowest number of mtDNA fragments was found for
TRNA-SER1 and
TRNA-PHE gene (1). The highest length ranged between 20 bp to 64 bp. The main localization of mtDNA fragments in transcriptome was in different mRNA transcript variants of various protein-coding genes as well as ncRNA and misc_RNA sequences (
Table S1). The overall number of mtDNA fragments in transcriptome was 1275.
3. Discussion
This is the first study in which the
D. magna nuclear genome deposited in the GenBank database was analyzed for pseudogene sequences of mitochondrial origin. The first complete information about the genome of
D. magna was published by Lee et al. in 2019 [
14]. To date, there is no information about pseudogenes localized in the genome of the water flea. Our research provides comprehensive bioinformatic details about the location of NUMTs found in the reference genome, which may be useful for future phylogenetic, evolution, and/or population analyses.
A computer-based search for NUMTs in the nuclear genome of 85 species of animals, plants, fungi, and protists showed that the total length of detected NUMTs varied from 0 to 823.9 kb per nuclear genome [
23]. For instance, the NUMT content was 0 in
Anopheles gambiae, 263.478 bp in
Homo sapiens, and more than 800 kbp in
Oryza sativa [
23]. In the case of
D. magna (
Table 1), the overall length of NUMT in the nuclear genome was 44.391 bp. The total length of 24 NUMTs was 9.989 bp in the
Pteromalus puparum genome, and 42.972 bp in
Nasoni vitripennis [
24], and more than 230 kbp in
Apis mellifera [
25]. There seems to be a positive correlation between the haploid genome sizes (C-values) and NUMT amount/prevalence in eukaryotes [
26].
Another explanation for the differences in the total length of NUMTs is the fact that they accumulate in the genomes in a continuous evolutionary process [
18]. Like in other species, e.g.,
Myotis lucifugus [
27], the percentage of NUMTs in the genome was less than 0.1%. In contrast, the number and length of NUMTs may vary depending on the computer-based query for NUMTs in the BLAST tool, as in the case of the genome of
Canis lupus familiaris [
22,
28]. In our study, the sequence search in BLASTN 2.6.0 implemented in CLC Genomics Workbench 12.0 yielded 1909 results, although 279 results were found in various scaffolds used during the sequencing of the
D. magna genome. In this paper, however, the NUMT results from the scaffold were excluded due to the potential occurrence of artifacts created during sequencing, as observed by Shi et al. [
27], where surprisingly, an entire mitochondrial genome was found in the scaffold AAPE02072785 in the
M. lucigufus genome. It is also worth considering that, in this work, the BLAST search result took into account all search results, even those fragments whose length was only 16 bp (
Table 3). The NUMTs found in scaffolds and mitochondrial DNA fragments found in pseudogenes and transcriptome are listed in
supplementary materials (Table S1).
In the case of various mtDNA gene fragments, such as
TRNA-MET,
COX1, and
ND4 where their number is higher in the transcriptome than in the nuclear genome, it may be the result of the presence of multiple mRNA transcripts or ncRNA forms in the transcriptome. The variability of mRNA and ncRNA forms may be caused by alternative splicing. Moreover, the transcriptome is changing during the lifespan and may depend on the type of tissue [
29]. It is not excluded that some NUMT are parts of protein-coding genes and/or participate in gene regulation through ncRNA formation. In this research, the authors focused on describing the presence of mitochondrial DNA in the nuclear genome, while the investigation of the role of NUMTs in gene expression and regulation in
D. magna should be verified experimentally in further analyses.
NUMTs were used to define characteristics and to clarify phylogenetic inconsistencies suggested by paralog sequences [
18,
30]. The analysis performed by Mishmar et al. [
31] revealed that mtDNA fragments, which were integrated into the nucleus before the radiation of modern human mtDNAs, confirming that mtDNAs similar to today’s African macro-haplogroup L were the first human mtDNAs. The analysis of NUMTs in
D. magna revealed that the latest evolutionary sequences are pseudogenes derived from the sequence
TRNA-MET as the homology of the entire gene sequence was 100%. The homology of
TRNA-MET gene fragments localized in transcriptome is below 100%, however, it could be a result of gene modifications after expression (
Table S1). In addition, all mtDNA fragments in transcriptome might differ from each other due to different ways of modulating gene expression depending on the tissue. The lowest homology (70.33%) was characteristic for the pseudogene derived from the mtDNA D-loop sequence located on LG5 (
Table S1). It may probably be the oldest element of mitochondrial DNA incorporated into the nuclear genome. However, due to the high degree of mutation in the D-loop, the thesis requires further verification. Similarly, NUMTs derived from
TRNA-ALA and
TRNA-SER1 may have been one of the first sequences derived from mtDNA. However, by assessing only sequence homology, the sequence of incorporation of mitochondrial pseudogenes into the nuclear genome cannot be determined. Nevertheless, as observed by Mishmar et al. [
31], the nuclear genome accumulates mutational changes at a much slower rate than mtDNA. Hence, the sequences of “recent” NUMTs can provide valuable information about the mtDNA sequences of the earliest humans.
NUMTs exhibit different degrees of homology to their mitochondrial counterparts. They are variable in size, evenly distributed within and among chromosomes, and, in some cases, they are highly rearranged and/or fragmented [
32]. However, the mitochondrial chromosome’s size does not correlate with the NUMT frequency or size distribution [
19]. The transfer to the nucleus can be influenced by mitochondria’s vulnerability to stress and other factors that may cause the escape of mtDNA to the cytoplasm [
32]. Mutations in mtDNA may occur in the entire mitochondrial genome; however, they are most frequently detected in the hypervariable regions of D-loops [
33,
34,
35].
The number of mutations as well as their incidence in the D-loop area may be related to the number of NUMTs occurring in the nuclear genome. The higher the mutation rate, the greater the likelihood of transfer of the D-loop fragment into the cytoplasm followed by its incorporation into the nuclear genome as a pseudogene. In the
D. magna genome, the highest numbers of pseudogenes from the D-loop (147) were observed, and only 23 of them had 100% sequence homology (
Table 1).
No pseudogenes derived from the
TRNA-ILE gene sequence were observed in the nuclear genome (
Table 1). However, the detailed search of
TRNA-ILE fragments in pseudogenes revealed the presence of three of them in LOC116920659, LOC116923942, and LOC116927475 and two fragments in the transcriptome (
Table S1). The difference between the number of NUMTs and pseudogenes and transcriptome for the
TRNA-ILE gene might be the result of the searching method. Pseudogenes were generated by automated computational analysis using gene prediction method: Gnomon [
36]. Gnomon is a gene prediction HMM-based program. The core algorithm is based on Genscan which uses a 3-periodic fifth-order Hidden Markov Model (HMM) for the coding propensity score and incorporates descriptions of the basic transcriptional, translational, and splicing signals, as well as length distributions and compositional features of exons, introns, and intergenic regions. For the genes for which no experimental information is available, Gnomon is creating conventional ab initio predictions [
36]. As a result of this prediction, the proposed gene contains the fragment of
TRNA-ILE. The presence of this sequence fragments derived from mtDNA cannot be excluded, however, further experimental analysis of the protein should be held. Due to the lack of comparison with other genome sequences for
D. magna the authors could not exclude any of the results. The parameters of the Gnomon prediction were widely used in the description of other model organisms’ genomes i.e.,
Arabidopsis thaliana,
Danio rerio,
Mus musculus, and
Drosophila melanogaster [
36]. Thus, the authors recognize the possibility of occurrence of tRNA-ILE-derived NUMTs in the
D. magna genome.
In the human genome, NUMTs are commonly associated with repetitive elements, suggesting a possible role for transposable elements in mtDNA integration in the nuclear genome [
31]. Certain NUMTs are repeated multiple times within the human genome [
32,
37]. In the case of the
D. magna genome, some NUMTs were also observed, which were repeated many times in different linkage groups (
Table S1). However, their association with repetitive elements in the nuclear genome requires additional research. The average levels of NUMT sequence homology for the individual linkage groups do not differ significantly from each other, which may indicate a random and even inclusion of sequence fragments into each of them (
Table 4).
The cytochrome C oxidase subunit I (
COI) has possibly been the most commonly studied marker. However, its popularity is mainly associated with its use as a marker for the DNA barcoding of animal diversity [
38]. There are several factors causing inadequacy of mtDNA in general and
COI individually, such as male-biased gene flow, the selection on any mtDNA nucleotide(s) (as the whole genome is one linkage group), retention of ancestral polymorphism, and introgression following hybridization [
39]. Presently, there are huge numbers of
COI sequences in public databases, and most of them have a limited length, generally close to the length of the barcoding region. It is known that the possibility of the presence of NUMTs in the existing data should not be ignored [
40]. Since the success of taxonomic differentiation is positively correlated with the barcode length, the minibarcode length is usually kept above 100 bp. For example, an approximately 250-bp region of 16S rRNA can be successfully amplified from various medicinal preparations and food products. It provides the correct identification of animal species [
41,
42]. Gene fragments that are often used for species identification in
D. magna are in the following ranges:
COX1 (19–72 bp),
CYTB (18–46 bp), 12s rRNA (18–72 bp), and 16s rRNA (19–78 bp); each of them constitutes less than 10% of the length of the entire gene (
Table 3). Hence, the NUMT sequences of frequently analyzed genes are generally shorter than the respective mitochondrial sequence; thus, the possibility of NUMT co-amplification should decrease with an increased length of the targeted mitochondrial marker. However, it is worth paying attention to the coverage of NUMTs derived from the
TRNA-ARG (95%),
TRNA-GLU (97%), and
TRNA-THR (95%) genes (
Table 3). The transcriptome analysis for
TRNA-ARG revealed the existence of the NUMT in lncRNA sequence transcript variants of the gene LOC116925827 (
Table S1). The presence of this gene fragment in a non-coding sequence may suggest its role in gene regulation. The
TRNA-THR-derived NUMT was identified in mRNA gene transcript variant LOC116930084. On the other hand, the sequence fragment of
TRNA-GLU was only partially found in the transcriptome in zinc finger CCHC domain-containing protein 7-like mRNA localized in LG2. Perhaps, in these cases, they are not NUMTs but functional gene fragments taking part in gene modulation. Nonetheless, to confirm it, a detailed analysis of the transcriptome on larger research groups should be held.
NUMTs are highly polymorphic in terms of the sequence, homo/heterozygosis status, and presence/absence at a specific locus [
43]. These features facilitate the use of NUMTs as specific population markers, as proposed for the human population by Lang et al. [
44] The biological importance of NUMTs may correlate with their location on the chromosome. Depending on the insertion location, NUMTs may perturb the function of the genes [
45]. Additionally, de novo integration of NUMT pseudogenes into the nuclear genome has an adverse effect in some cases: promoting various disorders and aging, as observed in humans [
46]. Chatre and Ricchetti [
47] report that migratory mitochondrial DNA can also impact the replication of the nuclear region in
Saccharomyces cerevisiae.