1. Introduction
Endogenous retroviruses (ERVs), discovered in the late 1960s and early 1970s via their derivation from exogenous retroviruses (XRVs) that infected germ lines throughout evolution, and which were subsequently integrated into the genomes of germ cells, eventually coevolved with the host and became fixed in future generations, spreading vertically as proviruses in a Mendelian-inherited manner [
1,
2,
3,
4,
5,
6]. The genomic structure of ERVs is identical to that of retroviruses, comprising internal coding sequences
gag,
pro,
pol, and
env (5–10 kb in length), flanked by a pair of identical long terminal repeats (LTRs) with cis-regulatory elements (CREs) (300–1200 nucleotides). There are many varieties of retroviruses, generally named by adding letters to indicate the species before the “ERV”. ERVs are classified according to phylogeny, being divided into three major classes. Class I includes Gamma-like and Epsilon-like iterations; class II includes Beta-like versions and lentivirus; and class III includes Spuma-like [
7,
8] variants. The PBS region of class II ERVs is complementary to lysine (K) tRNA molecules; therefore, these ERVs are named ERV-Ks.
ERVs initially integrate as proviral sequences, ranging from intact proviruses to highly fragmented proviral elements. Over time, they may lose the ability to replicate and transpose through destructive mutations, recombination, methylation, histone modifications, and other host defense mechanisms (thus generating intrinsic immunity to retroviral infection) [
9,
10,
11,
12,
13,
14], with the extent of sequence degradation being roughly correlated with the time of germline insertion of the provirus. Approximately 90% of the ERV sequences in the genome are solo LTRs, which are generated through homologous recombination between 5′ LTRs and 3′ LTRs. This process leads to the deletion of all internal sequences, including viral genes [
15,
16]. LTRs are highly variable sequences within the retroviral genome, and there is little similarity between retroviral LTRs of different genotypes [
17,
18]. As a result, annotating solo LTRs in genome assemblies often relies on their association with known retroviruses or previously characterized ERVs. However, there have been reports of the query-independent identification of LTRs [
9,
17,
18].
From an evolutionary perspective, chimpanzees and humans originate from the same ancestor, sharing numerous similar sequences, and the evolutionary patterns of protein-coding genes in the two species are highly related [
19,
20,
21,
22,
23]. ERV elements and other repetitive elements were initially perceived as parasitic elements whose presence reduced host fitness and were therefore derided as “junk DNA” [
24,
25]. Now, there is growing evidence that ERVs are widely distributed throughout the entire human and chimpanzee genomes, playing vital roles in genome and gene evolution, epigenetics, and gene regulation [
26,
27,
28,
29]. Studies have shown that chimpanzee endogenous retroviruses (CERVs) may be more numerous than human endogenous retroviruses (HERVs) [
23]. The chimpanzee genome contains at least 42 independent ERV groups, of which 29 belong to class I, 10 belong to class II, and 3 belong to class III. Among the 42 groups, excluding CERV1/PTERV1 and CERV2, all CERVs were found to display lineal homologues in humans [
30,
31]. Fewer than 1% of the tested CREs showed different activity between humans and chimpanzees, indicating that small changes in gene regulation can also produce significant differences and confirming that cis-regulatory evolution plays a central role in primate diversity [
32,
33,
34,
35,
36]. ERVs can enrich for certain active histone modifications and transcription factors and act as cis-regulatory elements to regulate host gene expression [
37,
38]. Some studies have shown that different types of insertions and deletions are associated with specific epigenomic diversity between humans and chimpanzees [
14], and about 7% of chimpanzee–human insertion–deletion (INDEL) variants are related to ERV sequences [
39,
40]. CERVs are among the factors contributing to INDEL in the human and chimpanzee genomes and account for 7% of the 4% genome-wide difference between the two species [
41]. Despite the availability of human and chimpanzee genomes, little is known about the differences in cis-regulation between humans and our closest evolutionary relatives.
The endogenous retrovirus K (ERV-K) is the most recently integrated endogenous retrovirus in the human and chimpanzee genomes, including HML-1–10. HML-9 is an important member of ERV-K family. However, there have been few studies on HML-9. Therefore, developing a precise and updated HML-9 genomic characterization is critical to understanding both the evolutionary history and mechanism effect of these elements in primates. The HML-9 (HERV-K14C) elements entered the reproductive line of primates following the differentiation of the Old World and New World monkey lineages approximately 39 million years ago [
41]. Study shows that HERV-K14C-related sequences were amplified during the evolution of the Y chromosome and contributed to the genomic diversity of this chromosome in the great ape lineage [
42]. Homologous recombination between HERV-K14C LTRs associated with testis-specific transcripts linked to the Y chromosome (TTY) leads to TTY deletion events, indicating that ERVs may cause genomic instability by inducing new insertions and via the homologous recombination of internal ERV sequences (especially LTRs), leading to deletions. These deletion events may be associated with some cases of genomic instability that result in male infertility [
43].
To sum up, the majority of ERV groups almost exclusively comprise solo LTRs, and these genomic structural variants persist in HERVs. Research has documented the presence of HERV-K, older HERV-H, and HERV-W alleles as proviruses in some individuals and as solo LTRs in others [
44,
45]. In addition, comprehensive analyses of the existence and distribution of HERV-K HML-9 elements in the human genome have been presented in previous studies, along with detailed accounts of the structure and phylogenetic characteristics of this group [
46]. However, the biological function of HML-9 elements in the chimpanzee genome remains largely elusive. Hence, we have identified the sequences of HML-9 proviruses and solo LTRs in the chimpanzee genome and conducted a comprehensive analysis of their presence and distribution, describing the structural and phylogenetic properties of the group in detail. Moreover, we have analyzed the integration time of proviruses, genes that may be regulated by these elements, and PBS sequence features. In general, this study intends to provide clear and comprehensive features for HML-9 elements in the chimpanzee genome. The aim is to provide a foundation for more detailed expression studies, the development of which is essential for understanding HML-9′s potential role in physiological and pathological contexts and for conducting further functional studies that focus on specific sites of interest.
4. Discussion
The HML-9 elements in the chimpanzee genome were identified via bioinformatic assessment. The performance was in line with the types of studies presented for the other HML groups [
47,
48,
53]. In the present study, we identified and characterized 26 proviruses and 38 solo LTRs retrieved in the chimpanzee genome assembly January 2018. Additionally, we analyzed the distribution, structural characterization, phylogeny, integration time, motifs, and PBS sequence features of all ERV-K HML-9 elements in detail, predicting regulatory function, in order provide a comprehensive characterization of the ERV-K HML-9 group.
Transposable elements (TEs) are DNA sequences that can be moved or replicated in the genome and integrated into new sites within it, contributing nearly half of the open chromatin region of the human genome and elements unique to most primates [
54,
55,
56,
57,
58,
59]. Based on whether the transposition process forms RNA intermediates, transposons can be classified as DNA transposons and retrotransposons. Retrotransposons are RNA-mediated, being accompanied by a retro-transcription process that produces a new copy at a new location in the genome in a copy-and-paste fashion [
60]. In contrast, the transposon mechanism of DNA transposons is in a cut-and-paste format [
61]. ERVs are related to LTR retrotransposons. There are two generally accepted models for the mechanism of ERV proliferation in the host genome: the evolutionary model of the reinfection of germ cell lines and the retrotransposon model [
62]. However, ERVs can be reactivated by a variety of factors, including infectious agents, exogenous viruses, radiation, aging-related processes, epigenetic drugs, cytokines, and mitogens [
63]. Studies have identified replication-competent ERVs in many species, and recombination between defective ERVs may also lead to the production of infectious viruses [
64,
65]. Abnormally activated ERVs may be involved in the occurrence and development of tumors [
66,
67,
68,
69].
All ERV families discovered in humans have subsequently been found in other primates, although some of the younger HERV genes are not conserved in other species [
45]. Here, we have added the comparison of chimpanzee HML-9 and their counterparts in the human genome in
Table 3 and
Table 4, including both the full and single LTR forms. In total, there are 38 ERV elements that have been detected in both genomes, including 10 proviral elements and 28 solo LTRs. There are 13 proviral elements and 19 solo LTRs that have only been detected in human genomes. And there are 16 proviral elements and 10 solo LTRs that have only been detected in the chimpanzee genome. The divergence between human and chimpanzee ancestors is known to trace back to approximately 6.5–7.5 mya or earlier. Our results confirmed that HML-9 was integrated into common ancestors before humans and chimpanzees diverged. Therefore, in theory, the chimpanzee-specific insertion should indeed be the same as the human-specific insertion. However, the two species have undergone independent evolutions of 6.5–7.5 mya since their divergence. The differences in HML-9 element distribution between the two species displayed here exactly validate their independent and separate evolution.
CERVs contribute to species-specific genomic changes in the chimpanzee genome due to the differentiation between chimpanzees and humans, potentially causing the source of the genomic differences between chimpanzees and their closest relatives, humans. The function of ERVs is varied, owing to the genomic structure of the integration site and the modification of the proviral sequences. The induction of novel transcriptional activity in the genomic region of the integration site can trigger cancer and abnormal developmental processes. The ERV provirus has attracted increasing interest because of its association with cancer, autoimmune diseases, and neurodegenerative disorders. Studies have reported clear signals of TE-derived, tissue-specific putative enhancers, as well as promoters that are specific to humans or chimpanzees. In particular, research has identified LTR5 as a putative promoter in induced pluripotent stem cells (iPSC) [
14], while in other studies it has been found to exhibit enhancer activity in human NCCIT cells [
70]. LTR5_Hs/LTR5 showed higher expression levels, displaying high activity in embryonic stem cells (ESC) and extended pluripotent stem cells (EPSC). LTR5_Hs/LTR5 can act as a distal enhancer of regulatory host genes [
11]. Among the class II elements, the HERV-K sequence was initially identified due to its similarity to the mouse breast tumor virus (MMTV) and was classified into 10 so-called human MMTV-like branches (HML1–10) [
7,
71]. Of these, HML-2 has the latest time integration and is the most bioactive, containing the youngest known HERV sequence in addition to many members of the full-length ORF. Furthermore, it is the only known element with human-specific integration [
72,
73]. HML-2 was first integrated into the genome of the common ancestor of humans and Old World monkeys 30 million years ago, and it contains more than 12 elements that have been integrated since the divergence of humans and chimpanzees [
74,
75,
76]. In particular, a study identified two species-specific HERV-K (HML-2) provirus loci (Pan8q and Pan2Ap) in chimpanzees, with Pan2Ap represented as a solo LTR in the reference genome and showing a dimorphism between solo LTR and provirus [
77]. HERV-K solo LTR, formed after the differentiation of humans and chimpanzees, has been identified in the chimpanzee genome. As a substitute promoter and enhancer, solo LTRs play significant roles in gene regulation. They are thought to contribute to species evolution by regulating host gene networks and key host genes, especially those involved in embryogenesis and stem cell development [
58,
70,
78]. Studies have shown that chimpanzee-specific ERVs located in the genomic region between chimpanzee PNRC2 5′ UTR1 and 2 can induce alternative splicing or different RNA polymerase II binding sites on the genes, indicating that chimpanzee-specific ERVs can generate alternative transcripts through their new insertion in the genes [
40]. LTRs of primate-specific retrovirus MER41 function as natural enhancers of interferon-inducible gene networks [
79]. Previous studies into HML-9 indicate that ERV-K HML-9 can function in different tissues under physiological conditions and during disease progression, which may contribute to immune regulation and antiviral defense [
80]. Comprehensive information has been reported regarding ERV-K HML-9 in the human genome. However, due to the lack of a comprehensive description of the ERV-K HML-9 group in the chimpanzee genome, the specific contribution of a single ERV-K HML-9 locus to the chimpanzee transcriptome remains unclear. In this study, we provide in great detail the complete features of all 64 ERV-K HML-9 elements retrieved from chimpanzee genome assembly January 2018.
Firstly, we predicted the expected integration number of ML-9 elements per chromosome and compared the result with the actual number of loci detected to evaluate the actual insertion rate of HML-9 in the chimpanzee genome. The number of ERV-K HML-9 integration events observed is usually not consistent with the expected number. The results showed that the distribution of these provirus and solo LTRs showed a non-random integration pattern, and these elements were mainly distributed in intergenic regions and introns. In particular, the number of proviruses on the Y chromosome was significantly different from that predicted by the chi-square test, indicating that the Y chromosome accumulated a higher density of CERVs and their related sequences. Such a pattern of distribution is in agreement with those results observed in the ERV-K HML-9 group’s human genome in general [
46]. The initial integration on the Y chromosome may be disfavored due to its heterochromatic status; however, once proviruses are integrated, they tend to have minimal detrimental effects since the Y chromosome is gene-sparse. In addition, because Y lacks a homolog, newly-integrated DNA cannot be removed through homologous recombination, resulting in the accumulation of a large number of complex repetitive sequences on the Y chromosome [
81]. The comparison between humans and chimpanzees also shows that HML-9 elements are not distributed on chromosomes 9, 17, 20, and 22 in the chimpanzee genome, while there is an absence of HML-9 elements only on chromosomes 9 and 22 in the human genome, suggesting that chimpanzees lost two homologous HML-9 elements during evolution.
Secondly, we sought to define the structural features of the chimpanzee genome CERV-K HML-9 provirus type relative to the consensus annotation of all insertions and deletions in the internal sequence. The characterization of the HML-9 consensus sequence confirmed a structure resembling the typical proviral genome, with the retroviral genes
gag,
pro,
pol, and
env flanked by 5′ LTRs and 3′ LTRs. The results show that only six elements (23.08%) maintain a relatively complete structure, while most of the sequences leave the genome structure incomplete due to deletions, with seven being less than 2000 bp in length (26.92%). We also annotate all the minor insertions and deletions, which can provide a specific background for the study of the structure of a single HML-9 locus. Six elements in the human genome (26.09%) maintain a relatively intact structure, and only one element is less than 2000 bp in length (4.35%). The results suggest that the absence of ERV-K HML-9 elements is more pronounced in the chimpanzee genome compared to the human genome. In the genomes of chimpanzees and humans, there are identical pol deletions in some proviruses, most of which are on the Y chromosome. We examined the HML-9 proviral DNA sequence of the Y chromosome in the chimpanzee genome using MEGA7 and found that the flanking of the integration site transcribed from the same chain had exactly the same DNA sequence (
Supplementary Materials).
Next, phylogenetic analysis showed that homologous HML-9 elements in the human and chimpanzee genomes are clustered together. However, the CERV-K HML-9 elements sequence had no obvious clustering and formed a single phylogenetic group. This was significantly different from other HML groups. Then, the integration time of ERV-K HML-9 provirus was calculated using the regions of LTRs, namely, gag, pro, pol, and env. The results show that LTR integration ranged from 14 to 36 mya, with an average integration time of 25 mya. However, the major cycle of ERV-K HML-9 integration based on four internal regions was between 22 and 118 mya, with an average integration time of 45 mya. Overall, the integration time estimated using LTR elements was later than the times estimated based on the four regions (gag, pro, pol, and env). The difference in estimated integration times between these two methods may be due to internal coding regions accumulating mutations during each replication cycle, resulting in the internal regions containing multiple sequence differences with a much higher error rate, while two identical LTRs integrate into the host genome during the integration phase. Therefore, it is more reasonable to use the integration time of LTRs to evaluate the integration time. In the human genome, LTRs integrated between 18 and 49 mya, with an average integration time of 29 mya, which was earlier than the integration time in the chimpanzee genome. Based on the integration time estimated by the LTR method, the time difference of HML-9 integration into human and chimpanzee genomes is 4 mya, which can be explained as follows: ① Between the two species—human and chimpanzee—there are only three homologous proviruses, most of which are non-homologous provirus, which may also contribute to the differences in integration times. ② Some are located in regions differentially subjected to other postintegration rearrangement (segmental duplication or deletion).
In addition, we performed motif-conserved analysis of chimpanzee gene family DNA sequences and obtained the top ten motifs with the highest frequency in proviruses, solo LTRs, and two LTR proviruses. Immediately after, we predicted and clustered the potential regulatory genes of ERV-K HML-9 provirus and solo LTRs. For the ERV-K HML-9 provirus, a total of 37 genes were predicted. Analysis shows that these genes are related to biological regulation. Previous studies have shown that among the six H3K9 methyltransferases present in mammals, SETDB1 has a specific and nonredundant role in the deposition of H3K9me3 in compartment A, where it is tethered by hundreds of sequence-specific KRAB domain-containing zinc finger transcription factors, primarily to repress endogenous retroviruses [
82,
83,
84]. In conclusion, it is important to identify specific ERVs associated with certain diseases, especially ERVs polymorphic loci, which may influence the expression profile of viruses in different individuals and the regulation of host genes. For ERV-K HML-9 solo LTRs, a total of 54 genes were predicted. Analysis showed that these genes were related to synapses. Previous studies showed that HERV produces proteins that regulate brain cell function and synaptic transmission and which are implicated in the etiology of neurological and neurodevelopmental psychiatric disorders, and investigators combined single-molecule tracking, calcium imaging, and behavioral approaches to demonstrate that the envelope protein (Env) of HERV-W is usually silent but can be expressed in patients with neuropsychiatric disorders, altering N-methyl-d-aspartate receptor (NMDAR)-mediated synaptic organization and plasticity through glial- and cytokine-dependent changes [
85]. It must be noted that these results are based entirely on predictions, the accuracy of which is determined by many factors, and further investigation is needed to confirm any implied associations between individual LTRs and nearby genes.
During the initiation phase of retroviral replication, host tRNAs, adopted as a primer for retroviral reverse transcriptase, are partially unfolded from their native structure to facilitate the PBS being base-paired to a specific complementary sequence on the viral genomic RNA. For the PBS analysis of the chimpanzee ERV-K HML-9 elements, the results showed that the TGG initiation nucleotide was the most conserved among the 18 bases. This result also applies to the human genome. We identified nine proviral PBS sequences, and three of them belonged to lysine. The logo maps generated from the human genome and chimpanzee genomes were identical. In the human genome, eight homologous HML-9 PBS sequences were predicted to recognize lysine (K) tRNA. It should be noted that these results are entirely based on prediction. Experimental verification studies are needed to confirm the association between these elements and these genes.
It is worth noting that in this study we are lacking data related to conserved transcription factor binding sites due to the absence of chimpanzee classification in the Species section of JASPAR.