**Molecular Analysis of the Complete Genome of a Simian Foamy Virus Infecting** *Hylobates pileatus* **(pileated gibbon) Reveals Ancient Co-Evolution with Lesser Apes**

#### **Anupama Shankar 1, Samuel D. Sibley 2, Tony L. Goldberg <sup>2</sup> and William M. Switzer 1,\***


Received: 1 April 2019; Accepted: 30 June 2019; Published: 3 July 2019

**Abstract:** Foamy viruses (FVs) are complex retroviruses present in many mammals, including nonhuman primates, where they are called simian foamy viruses (SFVs). SFVs can zoonotically infect humans, but very few complete SFV genomes are available, hampering the design of diagnostic assays. Gibbons are lesser apes widespread across Southeast Asia that can be infected with SFV, but only two partial SFV sequences are currently available. We used a metagenomics approach with next-generation sequencing of nucleic acid extracted from the cell culture of a blood specimen from a lesser ape, the pileated gibbon (*Hylobates pileatus*), to obtain the complete SFVhpi\_SAM106 genome. We used Bayesian analysis to co-infer phylogenetic relationships and divergence dates. SFVhpi\_SAM106 is ancestral to other ape SFVs with a divergence date of ~20.6 million years ago, reflecting ancient co-evolution of the host and SFVhpi\_SAM106. Analysis of the complete SFVhpi\_SAM106 genome shows that it has the same genetic architecture as other SFVs but has the longest recorded genome (13,885-nt) due to a longer long terminal repeat region (2,071 bp). The complete sequence of the SFVhpi\_SAM106 genome fills an important knowledge gap in SFV genetics and will facilitate future studies of FV infection, transmission, and evolutionary history.

**Keywords:** simian foamy virus; gibbon; lesser apes; co-evolution; complete viral genome

#### **1. Introduction**

Foamy viruses (FVs) belong to the Retroviridae subfamily of Spumaretrovirinae, which have fundamentally different replication strategies compared to other complex retroviruses, such as human immunodeficiency virus (HIV). For example, FVs differ from other retroviruses in how they initiate infection. Like complex DNA viruses, FVs use two promoters for gene expression, one in the long terminal repeat (LTR) and one in the envelope (*env*) gene [1]. In addition, FVs complete reverse transcription within the virion before infection of a new host cell, such that the SFV genome can be double-stranded DNA or single-stranded RNA [2,3]. FVs infect a wide range of mammals, including cows, horses, cats, and nonhuman primates (NHPs) in which they are called simian foamy viruses (SFVs). SFVs have been identified in nearly every primate species examined across all continents where NHPs exist. Phylogenetic analyses show each of these NHPs to be infected with species-specific variants, reflecting a generally ancient co-evolution of SFVs and their hosts [3–15]. Like other simian retroviruses, SFVs can recombine and cross-species infections occur, both of which can complicate their evolutionary history [5,16–19]. Hence, analysis of complete genomes is necessary to fully understand the evolutionary trajectory of SFV.

SFVs received heightened public health attention following numerous reports of transmission of SFV from NHPs to humans across the globe via a variety of routes of exposure [3,12,20–22]. Many studies have documented SFV acquisition, both in persons working with NHPs in research facilities and zoos [20,21,23–28] and in humans exposed to NHPs in natural habitats, where hunting and butchering of primates and keeping NHPs as pets is common, especially in parts of Africa and Asia [24–27,29]. Although SFVs establish permanent infection of their primate hosts and in zoonotically infected humans, there has been no clear evidence of pathogenesis despite their ability to cause cytopathology in vitro [1,3,12,22,30–34]. Limited studies have also been unable to identify cases of person-to-person transmission [3,7,12,22,31]. As human populations expand and encroach upon NHP habitats, interaction among these species grows, increasing risks for SFV exposure and infection. Many areas of Asia, Africa, and South America have seen increases in deforestation with concomitant intensified incursions into NHP habitats [35,36]. The ensuing exposures to NHPs and their pathogens require continued monitoring, facilitated by the development and modification of new and existing assays for the detection of zoonotic viral agents, including SFV and other retroviruses.

The design of molecular methods for the accurate identification of SFV infection requires a database containing representative sequences from divergent SFV lineages. Although numerous partial nucleotide (nt) sequences (average length of about 500 nt, mostly in the polymerase (*pol*) gene), are available from a large number of SFVs from prosimians, Old World monkeys (OWMs), apes, and New World monkeys (NWMs), there remains a paucity of complete SFV genomes. Recently, the taxonomic classification of FVs has been standardized such that the SFV names include a lower case three-letter abbreviation, where the first letter is the first letter of the scientific name of the host genus and the next two letters are the first two letters of the species or subspecies they were isolated from [14]. Whole genome SFV sequences are currently available from one prosimian (SFVocr from a galago), four from NWMs (one each from a spider monkey (SFVaxx), marmoset (SFVcja), squirrel monkey (SFVssc), and capuchin (SFVsxa)), seven from OWMs (three from African green monkeys (SFVcae), four from macaques (SFVmcy, SFVmmu, SFVmfu, SFVmfa), one from a human infected with SFV from a spot-nosed guenon (SFVcni), and eight from great apes (four from chimpanzee (SFVpsc, SFVpve, SFVptr) species, including three from infected humans, three from gorilla (SFVggo), including two from infected humans, and one from an orangutan (SFVppy)) [14]. In contrast, there is a dearth of knowledge about SFV genomes of the smaller or "lesser" apes, family Hylobatidae (common name gibbons), despite their wide taxonomic diversity. Gibbons belong to the superfamily Hominoidea along with great apes and humans. Gibbons are found predominantly in tropical and sub-tropical forests of Southern and Southeast Asia from eastern Bangladesh and northeast India to southern China and Indonesia, including the islands of Sumatra, Borneo, and Java [37]. Unlike the great apes, which include about six species, gibbons are more diverse, with about 19 species identified depending on the classification source [38,39], most of which are endangered (www.iucnredlist.org) [40]. The Hylobatidae consists of five genera and 19 species (*Hoolock* species (*n* = 2), *Hylobates* sp. (*n* = 9), *Justitia* (*n* = 1), *Nomascus* sp. (*n* = 6), and *Symphalangus* sp. (*n* = 1).

A recent study reported a high seroprevalence of SFV in gibbons from Cambodia, although the seropositive samples were not SFV PCR-positive [41]. Like other NHPs, gibbons are frequently hunted or kept as pets, facilitating opportunities for human exposure to SFV; yet there is a lack of available sequences to optimize PCR assays for their detection. Considering these factors, elucidation of additional sequences from SFV-infected gibbons will provide the field with important molecular information for the development of diagnostic assays and will permit further examination of the biology and evolutionary history of SFV in apes, and in NHPs in general.

In a previous report [42], we described the isolation of a novel, highly divergent gibbon SFV strain (SFVhpi\_SAM106) from a captive-born *Hylobates pileatus* (pileated gibbon) using blood cell co-culture, and the subsequent amplification of partial *pol* sequences. In this study, we used random hexamer-based deep-sequencing [43], a technique that uses random priming instead of relying on sequence-specific approaches, thus allowing molecular characterization of divergent viral sequences. In addition to

*in silico* characterization of the full-length SFVhpi\_SAM106 genome, we also analyzed evolutionary relationships to other complete monkey and ape SFVs by using non-simian FVs as outgroups.

#### **2. Materials and Methods**

#### *2.1. Blood Sample Processing, Co-Culture, and PCR Identification of a Novel Divergent SFV in Gibbons*

SFVhpi\_SAM106 was previously isolated from peripheral blood mononuclear cells (PBMCs) prepared from whole blood from a *H. pileatus* (SAM106) co-cultured with canine thymocyte Cf2Th cells as described in detail elsewhere [42]. Briefly, frozen PBMCs were thawed, stimulated with interleukin-2 for three days at 37 ◦C, washed with media, and incubated with an equal number of Cf2Th cells. Cultures were monitored every 3 to 4 days for syncytial cytopathic effect (CPE) typical of FV. CPE was visible on day 38, whereupon infected Cf2Th cells and viral supernatants were collected and stored frozen in liquid nitrogen until use. This captive gibbon was wild-caught and was about 30 years old at the time of specimen collection. Previous PCR and sequence analysis of integrase sequences within the *pol* gene from this gibbon confirmed the presence of a highly divergent SFV distinct from other ape SFVs [42]. During that same study, we similarly isolated SFVhle from a *H. leucogenys* but were unable to recover virus from the stored tissue culture supernatants. As described in the initial publication, NHP blood specimens were obtained opportunistically in accordance with the animal care use committees at each participating institution [42].

#### *2.2. Next Generation Sequencing and SFVhpi\_SAM106 Genome Assembly*

In total, 1 mL of tissue culture supernatant was centrifuged at 5000× *g* at 4 ◦C for 5 min with subsequent filtration through a 0.45 μm filter (Millipore, Billerica, MA, USA) to remove any residual host cells. Viral nucleic acids were then isolated using the Qiagen QIAamp MinElute virus spin kit (Qiagen, Hilden, Germany) according to the manufacturer's instructions, except that carrier RNA was omitted. The eluted nucleic acids were then treated with DNase I (DNA-free, Ambion, Austin, TX, USA) and cDNA was generated by priming with random hexamers using the Superscript double-stranded cDNA Synthesis kit (Invitrogen, Carlsbad, CA, USA). cDNA was purified using the Agencourt Ampure XP system (Beckman Coulter, Brea, CA, USA) and approximately 1 ng of cDNA was subjected to simultaneous fragmentation and adaptor ligation ("tagmentation") using the Nextera XT DNA Sample Prep Kit (Illumina Systems, San Diego, CA, USA). Briefly, fragmented DNA was PCR-amplified with Nextera index primers (15 cycles) and purified using the Agencourt Ampure XP system. The resulting DNA library was sequenced using an Illumina MiSeq (MiSeq Reagent Kit V3, 600 cycle, Illumina Systems, San Diego, CA, USA). To analyze sequence data, raw sequences were de-multiplexed and converted to FastQ format using the CASAVA v1.8.2 software (Illumina). The processed reads were then imported into the CLC Genomics Workbench v7 (CLC bio, Aarhus, Denmark), trimmed to remove Nextera-specific transposon sequences as well as short and low quality reads, and assembled using the CLC de novo assembler. Both singleton and assembled contiguous sequences (contigs) were queried against the GenBank database (http://www.ncbi.nlm.nih.gov/GenBank) using the basic local alignment search tools blastn and blastx [12], with a high e-value cut-off of 10 and word sizes of 11 and three for blastn and blastx queries, respectively. Contigs with significant homology to SFV were then mapped to SFVmcy (macaque SFV; GenBank accession number NC\_010819) as the reference sequence to generate the consensus SFVhpi\_SAM106 sequence.

#### *2.3. Sanger Sequencing of LTR Region*

We amplified the LTR region using primers specific to SFVhpi\_SAM106 in order to confirm the LTR length. DNA was extracted from the tissue culture cells and 500 ng was used as template in two separate nested PCRs using SFVhpi\_SAM106-specific primers. One assay spanned the 5 LTR and RU5 region while the second spanned the 3 end of *bet* and the 3 LTR. The primers for the SFVhpi\_SAM106\_LTR-RU5 and SFVhpi\_SAM106\_*bet*-LTR fragments for the primary and secondary PCRs respectively are:

SFVhpi\_SAM106\_LTRF2 (5 GCAGTAGGAGAACAACCTCCTT 3 ) and FVRU5R1 (5 CCCGACTT ATATTCGAGCCCCAC 3 ) and

SFVhpi\_SAM106\_LTR\_nestF2 (5 GGAGGAATACTCCTCTCCCCCTCTC 3 ); FVRU5R2 (5 CACG TTGGGCGCCAAATTGTC 3 ) and

#### SFVhpi\_SAM106\_*bet*\_F1 (5 GTGGGAGAAGGTAATATTAATCC 3 ) and SFVhpi\_SAM106\_LTR\_R1 (5 GTGGAATATTCTGTGTTGATTATCC 3 ) and

SFVhpi\_SAM106\_*bet*\_nestF1 (5 AGGCATATGGACCACCACAAG 3 ) and SFVhpi\_SAM 106\_LTR\_nestR1 (5 CAACCTTGTTGATAAGGGCAAC 3 ).

We performed an initial denaturation at 94 ◦C for 2 min. followed by 40 cycles of 94 ◦C for 1 min, 45 ◦C for 1 min; 72 ◦C for 2.5 min, with a final extension at 72 ◦C for 5 min. For both nested PCRs, we used 3 μL of the primary PCR product as template. Nested PCR products were electrophoresed in 1.0% agarose gels and visualized by ethidium bromide staining. For sequence analysis, the PCR products were purified using the Qiaquick gel extraction kit (Qiagen Inc., Valencia, CA) and then sequenced in both directions using a Big Dye terminator cycle kit (ThermoFisher Scientific, Waltham, MA) and additional internal SFVhpi\_SAM106-specific sequencing primers to ensure sufficient coverage.

#### *2.4. Complete SFVhpi\_SAM106 Genome Sequence Analysis*

Gene annotation tools in CLC Genomics Workbench were used to locate open reading frames (ORFs) within coding regions of the SFVhpi\_SAM106 genome. Positions of the complete 5 and 3 long terminal repeats (LTRs) were determined manually using previously published ape SFV genomes as a reference. Potential splice donor and acceptor positions were inferred using neural network predictions implemented in NetGene2 (http://www.cbs.dtu.dk/services/NetGene2/). N-linked glycosylation sites were predicted using the N-GlycoSite tool at the HIV LANL website (https: //www.hiv.lanl.gov/content/sequence/GLYCOSITE/glycosite.html) [44]. Potential nuclear localization signals in the Tas protein were predicted using NucPred (https://nucpred.bioinfo.se/cgi-bin/single.cgi) and PSORTII (https://psort.hgc.jp/form2.html). Coiled-coil motifs were inferred using the website https://embnet.vital-it.ch/software/COILS\_form.html. Nuclear export signals (NESs) were inferred using neural networks and hidden Markov models at http://www.cbs.dtu.dk/services/NetNES/.

Using Geneious v7.0.6, we extracted the five coding regions, *gag, pol, env, tas*, and *bet*, and aligned them with representative SFVs with complete genomes from four other apes, two OWMs, four NWMs, one prosimian, and one FV each from equine, bovine, and feline hosts (Table 1). We also created a concatamer of the major FV coding regions (group-specific antigen (*gag*), polymerase (*pol*), and envelope (*env*)) to enable maximally robust phylogenetic analysis. Concatamers of major coding regions of other slowly evolving cell-associated retroviruses are commonly used for evolutionary analyses [45–47]. Finally, since recombination was reported recently in the surface protein (SU) region of *env,* we also prepared an alignment of the complete SFV *env* sequences of species in Table 1 and those from chimpanzee, gorilla, humans, and macaques used in the analyses by Galvin et al. and Richard et al. consisting of 48 taxa [19,48] All alignments were checked for evidence of potential recombination events using first a 400-nt window and a 40-nt step, and then a 200-nt window and 20-nt step, using the Kimura-2 parameter nucleotide substitution model with gap stripping in Simplot v 3.5.1 [49]. We also checked for recombination by using the Recombination Detection Program (RDP) v4 with the default parameters [50].


**Table 1.** Foamy virus complete genomes analyzed.

We performed codon-based nucleotide alignments using MAFFT v 7.017 [51], followed by manual adjustments and gap stripping. We used the model selection algorithm in MEGA v6 [52] to determine the best fitting nucleotide substitution model, which was inferred to be the general time reversible (GTR) model with gamma (Γ) distribution and invariable sites (GTR+Γ+I). Likelihood mapping of quartet topologies implemented in IQ-TREE v1.6.8 was used to check for evidence of good phylogenetic signal in the alignment [53]. We also checked the phylogenetic signal and evidence of nucleotide substitution saturation in the alignments with the program DAMBE v7.0.35 (http://dambe.bio.uottawa.ca/DAMBE/dambe.aspx).

Phylogenies and divergence dating were simultaneously inferred using Bayesian inference with the program BEAST v.1.8.4 [54]. For BEAST analysis, we created six taxon groups, including Hominoidae (great apes), *Pan*/*Gorilla*, NWMs, OWMs, OWM/Hominoidae, and all simians. We used the MAFFT alignments and enforced monophyly for both simian and non-simian taxon groups in the analyses. To evaluate the potential effect of nucleotide heterogeneity sometimes observed at third codon positions of RNA viruses on the phylogeny, we also conducted phylogenetic analysis after stripping the third codon positions (cdp) from the alignment. We used an uncorrelated lognormal relaxed molecular clock, a birth-death speciation tree prior, and 100 million Markov Chain Monte Carlo (MCMC) iterations with a 10% burn-in. To more accurately estimate divergence dates, we set priors for the time to the most recent ancestor (TMRCA) dates across the FV phylogeny using normal distribution priors and nuclear DNA split estimates for NHPs and other matching non-simian placental mammals from www.timetree.org [55,56] as follows: 18.6–20.2 million years ago (mya), standard deviation (SD) 0.82 for Hominoidae; 8.44–9.06 mya, SD 0.33 for the *Pan troglodytes*/*Gorilla* split; 27.61–29.44 mya, SD 0.94 for the OWM/Hominoidae split; 74–78 mya, SD 2.0 for the *Equs caballus*/*Bos taurus* split; and 91–96 mya, SD 3.1 for *Boreoutheria* (placental mammals).

We used Tracer v1.6 to ensure all parameters converged with effective sampling size (ESS) values >250. Trees were logged every 10,000 generations. Two independent BEAST runs were performed to ensure convergence and reliability of the results. We used TreeAnnotator v1.8.3 to choose the maximum clade credibility tree from the posterior distribution of 10,001 sampled trees with a burn-in value of 1000 trees. The inferred trees were visualized using FigTree v1.4.2 http://tree.bio.ed.ac.uk/software/figtree/).

#### *2.5. Comparison of FV tRNA Binding Motifs*

tRNA primer binding site sequences for SFVhpi\_SAM106 were identified using tRNAdb (http: //trnadb.bioinf.uni-leipzig.de), a database that can be queried for tRNA binding motifs and outputs consensus and features of conservation for any selected set of tRNAs. FV tRNA motifs were compared to investigate potentially divergent primer binding sites.

#### *2.6. Nucleotide Sequence Accession Number*

The complete genome of SFVhpi\_SAM106 has been deposited in GenBank with the accession number M621235.

#### **3. Results**

#### *3.1. SFVhpi\_SAM106 Genome Assembly*

Assembly of 38,000 paired-end reads yielded the complete SFVhpi\_SAM106 coding genome with 380× coverage. The longest contig obtained by *de novo* assembly was 11,815 nt. The read lengths ranged from 175 to 250 bp. The average read length was 203.59 bp. The sequence of the complete genome was determined by manual alignment of overlapping 5 and 3 LTR regions to give a final length of 13,885 nt.

#### *3.2. Organization of the SFVhpi\_SAM106 Genome and Comparison with Other Ape SFVs*

The SFVhpi\_SAM106 genome consists of all expected structural, enzymatic and auxillary gene coding regions, including *gag*, *pol*, *env*, *tas*, and *bet*, together flanked by two LTRs (Figure 1).

**Figure 1.** Genomic structure of SFVhpi\_SAM106. Ets-1, ETS proto-oncogene 1 transcription factor motif; CAP, catabolite activator protein transcription motif; TATA box, promoter region motif; LTR, long terminal repeat; IP, internal promoter; *gag*, group-specific antigen; *pol*, polymerase; *env*, envelope; *tas*, transactivator gene; *bet*, between *env* and *tas* genes; U3, unique 3 region of the LTR; R, repeat region of the LTR; U5, unique 5 region of the LTR. The Bet protein is translated from a spliced RNA and residues from the 5 part of *tas* indicated by the speckled region and dotted line.

Gene and genome length comparisons with other ape SFVs are provided in Table 2. SFVhpi\_SAM106 has the longest recorded genome among ape SFVs owing to its longer LTRs, whereas the lengths of the five coding regions are comparable in size to those from other ape SFVs.

**Table 2.** Ape simian foamy virus (SFV) gene and genome nucleotide length comparison.


1. SFVhpi\_SAM106, pileated gibbon (M621235); SFVpve, chimpanzee (U04327); SFVpsc, chimpanzee (Y07725); SFVppy, orangutan (AJ544579); SFVggo, gorilla (JQ867465); LTR, long terminal repeat; *gag*, group-specific antigen; *pol*, polymerase; *env*, envelope; *tas*, transactivator gene; *bet*, between *env* and *tas* genes.

Nucleotide and amino acid identity comparisons of the major genes and proteins, respectively, of SFVhpi\_SAM106 to those of other ape FVs are provided in Table 3. Sequence analysis showed the SFVhpi\_SAM106 genome was nearly equidistant from other ape SFVs, sharing approximately 65% nucleotide identity across the *gag–pol–env* region of the coding genome. The highest gene and protein identities were seen with *pol*/Pol followed by *env*/Env, *gag*/Gag, *tas*/Tas, and *bet*/Bet.

**Table 3.** Percent nucleotide and amino acid identity comparisons of SFVhpi\_SAM106 compared to SFVs of other ape hosts.


1. SFVhpi\_SAM106, pileated gibbon; SFVpve, chimpanzee (U04327); SFVpsc, chimpanzee, (Y07725); SFVppy, orangutan (AJ544579); SFVggo, gorilla (JQ867465). *gag*, group-specific antigen; *pol*, polymerase; *env*, envelope; *tas*, transactivator protein; Bet, between *env* and *tas* protein. 2. Concatenation of *gag*/Gag, *pol*/Pol, and *env*/Env nucleotide/proteins.

The LTR, at 2071 nt (positions 1–2071), was found to be the longest among the ape SFVs by about 300 to 800 nucleotides. We confirmed the LTR length by PCR using SFVhpi\_SAM106 PCR primers and Sanger sequencing. Both the SFVhpi\_SAM106\_LTR-RU5 and *bet*-LTR fragments were 100% identical to the LTR obtained by NGS and were 1968 bp and 2029 bp in length after removing the primer sequences. The U3 region extends from positions 1 to 1704, which is about 300 nt longer than that of other ape SFVs; followed by the R region (positions 1705 to 1911), which is about 20 nt longer than that of other ape SFVs; and the U5 region (positions 1912 to 2071), which is about the same length as that in other ape SFVs. Three TATA box motifs were found at nucleotide positions 76, 239, and 396 upstream of the primer binding site (PBS), which is in turn 104 nt upstream of the start codon for *gag*. The poly A motif (AATAAA) is located at position 1889, the conserved 3 and central polypurine tracts (AGGAGAGGG) are located at positions 7973 and 11,813, respectively. The first two dimerization signals (DS) were highly conserved and are located at positions 2081 and 2140, but a potential third DS was not strictly conserved and consisted of AAAAGTC instead of AAAATGG found in other SFVs [57]. There are two Ets-1 transcription factor binding domains in the 5 LTR at positions 402 and 766. There are also three CAP (catabolite activator protein transcription motif) sites at positions 1647, 1692, and 1808. We also identified a polypurine tract (PPT) at positions 11,795, just upstream of the start of the 3 LTR. The conserved tRNAlys PBS motif 5 -TGG CGC CCA ACG TGG GGC-3 at positions 2074 to 2091 in SFVhpi\_SAM106 is present in all available complete ape SFV genomes. The relatively conserved SFV Env/Orf-2 splice donor (AG**TTGˆGTAA**TTT) and acceptor (TTTTA**AGˆA**TAAT) sites are located at positions 10,292 and 10,411, respectively, and were predicted by NetGene2. The nucleotide identity of the SFVhpi\_SAM106 LTR to other ape SFV LTRs ranged from 30% to 45%, with the closest identity to LTRs from chimpanzee SFVs.

The internal promoter (IP) was identified by comparison with other FVs and is located at *env* nucleotide positions 10,142 to 10,198 (5 -CAA GAG AA **CATAAA** AGA TCA AAT CGA GAG AGC AAC CGC AGA GC-3 ) (Figure 1). However, unlike the TATAAA box consensus motif of exogenous FV IPs, the SFVhpi\_SAM106 TATA promoter box is more similar to that of two endogenous FVs (sloth and coelacanth) in that it starts with a cytosine (highlighted in the IP sequence in bold and italics) instead of a thymine [58]. A potential CAP site was identified 48 nucleotides downstream of the IP. Fifteen N-linked glycosylation sites were identified, of which one is in the LP region, 11 are in SU, and three are in TM. Ten of these are NXT variants and five are NXS. In comparison, SFVppy has 13 N-linked glycosylation sites of which one is in LP, nine are in SU, and three are in TM. Nine of these are NXT variants and four are NXS.

The number, position, and composition of potential *tas*-response elements (TREs) varies among SFVs and requires in vivo experiments for confirmation [58]. By comparison with other SFVs, we identified a potential TRE upstream of the IP at position 10,021 (CTTAAAGGCAGAAGAGAAA). TREs in the LTR region could not be readily identified.

We observed that *gag* (1974 nt; positions 2178–4151), *pol* (3420 nt; positions 4102–7521), *env* (2966 nt; positions 7481–10,446), and *tas* (897 nt; positions 10,420–11,316) ORF lengths are similar to other ape SFVs (Tables 2 and 3). To identify the potential splice/acceptor sites for the *bet* gene, we used an online neural network prediction tool. Although we found evidence of a splice donor site in SFVhpi\_SAM106 at positions 10,693 to 10,702 (5 -GAGGAATGAˆTAAGTTAAT-3 ) and a splice acceptor site at positions 10,991 to 11,010 (5 -CTCCTATTAGˆGTACACTGGG-3 ) indicating possible bet splicing, support was not strong (0.54 and 0.30, respectively). We also found a splice acceptor site within the bet ORF (positions 11,375–11,395, 5 -AATTCTCAGˆATGATGAGGAT-3 ) with strong support (0.96), which could potentially give rise to an alternate *tas-bet* fusion transcript 1065 nucleotides and 355 amino acids in length.

Notable motifs identified *in silico* in Gag include the cytoplasmic retention and targeting signal (CTRS) (GEWGFGD**R**YNVVQIVLQD) located at aa positions 39 to 56 with the highly conserved arginine (R) at position 46 essential for intracytoplasmic particle formation. The YXXL motif involved in particle assembly was highly conserved at positions 77 to 80. The P3-cleavage site (RSFN/TVSQ) is located at aa positions 624 to 631 in the carboxyl terminus. Analysis of the Gag sequence did not identify conserved P(T/S)AP late domain motifs in the center of the protein, but we did find two PPAP motifs at aa positions 269 to 272 and 296 to 299. These motifs are similar in number and location to those in SFVpve and SFVggo, but not SFVppy, in which the single PSAP motif is located near the end of Gag. We also identified an assembly domain (YEMLGL) at aa positions 462 to 467. Examination of Gag for glycine-arginine (GR) rich boxes involved in viral replication identified four potential GR-boxes at aa positions 485 to 509 (GGRGRGRNNRNAASGNTQGGNQRQSR), 515 to 534 (GRQSQGGRGRGSNNNTNSRQ), 538 to 562 (QNSSGYNLRPRTYNQRYGGGQGRR), and 595 to 612 (RGDQPRRSGAGRGQGGNR) compared to the three such motifs found in other SFVs. The highly conserved chromatin binding motif was located in the third GR at AA positions 542 to 548 (underlined above), which for other FVs is in GRII [59]. However, a nuclear localization signal (NLS) present in GRII of SFVpsc was not identified in the SFVhpi\_SAM106 GRIII by using PSORTII and NucPred [59]. An NES at the N-terminus of Gag has been shown to be critical for the late stages of virus replication of SFVpsc (formerly HFV) and is partially conserved in SFVmcy and SFVcae [60]. Comparison with other SFVs identified a potential NES at aa positions 91 to 108 (LAFNGIGPAEGALRFGPL); however, NetNES predicted in the SFVhpi\_SAM106 Gag a leucine-rich NES around aa positions 8 to 20 (LDVQELVLLMQDL), which is relatively conserved in other ape SFV Gag proteins. NetNes correctly predicted the reported NES in SFVpsc, SFVpve, and SFVggo, but not in SFVppy, SFVmcy, SFVcae, SFVocr, SFVcja, SFVaxx, SFVssc, SFVsxa, BFVbta, and EFVeca. An NES similar to that in SFVhpi\_SAM106 was predicted for FFVfca.

Within the Pro-Pol polypeptide, the highly conserved protease catalytic center (DTGA) and reverse transcriptase (RT) catalytic site (YVDD) were located at aa positions 24 to 26 and 312 to 315, respectively. The DSF motif required for RNAse H activity located at positions 670 to 672 was also conserved. The viral protease cleavage site for the integrase (IN) protein (YTVN/NIQN) was partially conserved (YVNN/XNXX) and located at aa positions 749 to 756, potentially coding for a 388 aa IN peptide. The IN catalytic center (DD35E) was found at D898D936E972 and the zinc-finger motif at H813H817C847C850. Interestingly, we also identified a highly conserved TATA box motif at positions 4840 to 4846 in *pol* present in the RT region of all FVs with the consensus sequence (T(A/T)(T/C)AA(A/G) (Figure 2). In SFVhpi\_SAM106, this TATAAA motif is 80 nucleotides upstream of the RT active site.


**Figure 2.** Conserved TATA box in alignment of foamy virus polymerase sequences. Dots represent conserved nucleotides relative to the first sequence. Nucleotide positions are after alignment using MAFFT. Old World apes (OWA): SFVpve, *Pan troglodytes verus* (chimpanzee), GenBank accession number U04327; SFVpsc, *Pan troglodytes schweinfurthii* (chimpanzee), Y07725; SFVppy, *Pongo pygmaeus* (orangutan), AJ544579; SFVggo, *Gorilla gorilla* (gorilla), NC\_039029; SFVhpi\_SAM106, *Hylobates pileatus*, (pileated gibbon) M621235. Old World monkeys (OWMs): SFVcae, *Cercopithecus aethiops* (African green monkey), M74895; SFVmcy, *Macaca cyclopsis* (macaque), X54482. New World monkeys (NWM): SFVcja, *Callithrix jacchus* (common marmoset), GU356395; SFVsxa, *Sapajus xanthosternos* (capuchin), KP143760; SFVaxx, *Ateles* species (spider monkey), EU010385; SFVssc, *Saimiri sciureus*(squirrel monkey), GU356394. Prosimian (Pro): SFVocr, *Otolemur crassicaudatus* (brown greater galago), KM233624. Non-simian mammals (NSM): EFVeca, *Equus caballus* (equine), AF201902; BFVbta, *Bos taurus* (bovine), U94514; and FFVfca, *Felis catus* (feline), Y08851.

For the 989-aa Env protein, the highly conserved WXXW motif required for Gag interaction and budding is located at aa positions 10 to 13 (WLIW). The cellular furin protease cleavage sites, which cleave the N-terminal leader peptide (LP) from the SU and the SU from the transmembrane (TM) protein, are located at aa positions 125 to 131 (RLAR/RSLR) and 570 to 577 (RKRA/TSSN) to generate three potential Env proteins of lengths 127 (LP), 446 (SU), and 416 (TM), respectively. The highly conserved hydrophobic WXXW motif in the LP required for Env incorporation and particle release is located at aa positions 10 to 13. The membrane-spanning domain located at aa positions 945 to 980 (AKGIFGTAFSLVAYVKPILIGIGVIILLVVIFKIIS) is partially conserved. The consensus KKK endoplasmic reticulum retrieval signal located at aa positions 694 to 696 of the TM (or positions 985 to 987 of Env) is partially conserved and has the sequence KAK, whereas SFVppy, SFVaxx, SFVggo\_BAK74, and SFVssc motifs have the sequence KRK. A putative receptor-binding domain (RBD) is located in SU at aa positions 227 to 552 and the relatively conserved fusion peptide (LRSMGYALTGGIQTVSQI) is located in the TM at aa positions 581 to 598 of Env [61].

The ape SFV Tas proteins are highly divergent (Table 3) and hence identification of the poorly defined acidic activation and DNA binding domains was not possible. Tas is a nuclear protein involved in transcriptional transactivation with a partially conserved NLS at the 3 end of the protein. In SFVhpi\_SAM106, the NLS GTGRKRRTN is located at aa positions 216 to 224 with a strong NucPred score of 0.87, whereas PSORT II identified RKRR in this motif as an NLS, but with a low score (−0.16). Neither prediction program found a bipartite NLS in SFVhpi\_SAM106 or other ape SFV Tas proteins. PSORT II predicted with high reliability (94.1) that the SFVhpi\_SAM106 Tas was a nuclear protein, similar to the Tas protein of SFVppy (reliability score = 94.1), but more likely than those from SFVggo and SFVpve/psc (reliability score = 89). The PSORT II predictor uses a heuristic algorithm, including neural networks, to identify nuclear proteins are rich in basic residues. The SFVhpi\_SAM106 Tas includes 41 basic amino acids (R = 25, K = 12, H = 4), or about 13.7% of the protein, compared to 12.2% (34; R = 15, K = 15, H = 4) for SFVppy, 14.1% (42; R = 14, K = 20, H = 8) for SFVggo, 16.6% (50; R = 16, K = 22, H = 12) for SFVpve, and 15.3% (46; R = 18, K = 17, H = 11) for SFVpsc. A leucine-rich NES was predicted to be at aa positions 97 to 107 (LICERLILLAL).

Although the *bet* splice acceptor site described above was not strong, the SFVhpi\_SAM106 Bet protein length of 483 aa was similar to that of other ape SFVs (SFVpve = 490 aa, SFVpsc = 482 aa, SFVggo = 481 aa, SFVppy = 464). Comparison with these other ape SFV Bet proteins identified a potential integrin-binding motif (K/RGD) at aa positions 306 to 308 that was partially conserved (KGT), with SFVpve, SFVggo, and SFVppy having a KGD motif and SFVpsc having an RGD motif. The tripeptide RGD domain has been shown to be required for cell membrane binding, so it will be important to examine the functionality of the D > T mutation at the third aa position in the SFVhpi\_SAM106 Bet. One study has proposed the SFVpve Bet is secreted in both the cytoplasm and nucleus of infected cells and contains a bipartite NLS in the C-terminus [62]. PSORT II does detect the bipartite NLS RKIRTLTEMTQDEIRKR at aa positions 463 to 479 of SFVpsc Bet, but with a cytoplasmic protein reliability prediction of 70.6% rather than a nuclear one. NucPred did not predict a NLS in the SFVpsc Bet. Neither NucPred nor PSORT II identified any putative NLS in the SFVhpi\_SAM106, SFVpve, SFVggo, and SFVppy Bet proteins.

#### *3.3. Absence of Evidence of Genetic Recombination in the SFVhpi\_SAM106 Genome*

Simplot analyses on the *gag-pol-env* concatamer alignment did not show any evidence of genetic recombination at a threshold of 70% of permuted trees, a cutoff commonly used for the analysis of other retroviruses (data not shown). These results were consistent across two window and step sizes in the analysis. We also found no significant evidence of recombination using the recombination detection program RDP4 using eight methods (RDP, GENCONV, Chimaera, MaxChi, BootScan, SiScan, 3Seq, and LARD). We did not find any evidence of recombination of the SFVhpi\_SAM106 env using the 48 taxa dataset by phylogenetic and RDP analysis, but we did confirm the two different SU RBD SFV clades as previously reported (Supplementary Figure S1) [19,48]. Phylogenetic analysis of the two alignments encompassing the complete *env* and the region without the RBD in SU showed the typical co-evolutionary history of FV with SFVhpi\_SAM106 ancestral to SFVppy, SFV OWMs, and then chimpanzees and gorillas with strong support. In the analysis of only the BD region of SU, SFVhpi\_SAM106 was ancestral to the chimpanzee and gorilla SFVs, but with no support. In addition, SFVppy clustered FFVfca with good support between the NWM SFV and the other OWMA SFVs. Combined, these results suggest an absence of recombination in SFVhpi\_SAM106 and that genetic recombination did not affect our phylogenetic results when using the *gag-pol-env* concatamer.

#### *3.4. Evolutionary Relationships and Divergence Dating of SFVhpi\_SAM106 and Other FVs*

Likelihood mapping of the 15 taxa 7,412 position *gag-pol-env* concatamer alignment showed an equal distribution of the majority of possible quartets across the tree of which 98.4% were fully resolved, i.e., tree-like, with only 1.68% of unresolved quartets. These results suggest an overall dataset with very good phylogenetic signal and very little "noise" and hence suitable for phylogenetic reconstruction. Excellent phylogenetic signal was also found in both *gag-pol-env* alignments with scores >99.3 using DAMBE. Little nucleotide substitution saturation was found in the alignments using the method of Xia in DAMBE [63]. Together, our results indicate the alignments were satisfactory for phylogenetic analysis.

Phylogenetic trees generated using Bayesian inference of the *gag-pol-env* concatamer showed that FV sequences from a broad range of genetically diverse NHPs and non-simians formed monophyletic lineages and distinct clusters that mirrored host taxonomic relationships (Figure 3). SFVhpi\_SAM106 clustered with other ape SFVs with strong posterior probability (PP > 1) support (Figure 3). The FV phylogeny was similar to that seen in our previous study, where SFVhpi\_SAM106 is a sister taxa of but ancestral to the great ape SFVs, mirroring the phylogeny of the host mitochondrial and nuclear sequences in which the lesser apes are an outgroup to the great apes [6,42]. An identical phylogeny was obtained with the *gag-pol-env* first and second cdp alignment, indicating the absence of substitution saturation at the third cdp in the analysis of the unstripped alignment (Figure 3). As expected, the representative prosimian FV sequence from a galago (SFVocr) was ancestral to all other SFVs. The three non-simian FVs from equine, feline, and bovine formed a clade separate from the SFVs, with BFV and EFV clustering together with strong support (PP = 1). All BEAST analyses had standard deviation values of the uncorrelated lognormal relaxed clock (ucld.stdev) greater than zero but less than one, indicating that variation in substitution rates across branches was not consistent with a strict molecular clock but also was not so great as to bias the analyses (Table 4). The absence of site-to-site variation was also supported by alpha parameters of the gamma distribution above 1.0 (Table 4).

**Figure 3.** *Cont*.

**Figure 3.** Evolutionary relationships and divergence dates of foamy viruses (FVs) and their mammalian hosts. (**A**) FV phylogeny inferred using *gag-pol-env* concatamer (~7.4 kb). (**B**) FV phylogeny inferred using the first and second coding positions of the *gag-pol-env* concatamer (~4.9-kb). Phylogeny inferred using BEAST and mammalian host phylogeny inferred at Timetree.org as well as the corresponding geologic timescale. Old World apes (OWAs): SFVpve, *Pan troglodytes verus* (chimpanzee), GenBank accession number U04327; SFVpsc, *Pan troglodytes schweinfurthii* (chimpanzee), Y07725; SFVppy, *Pongo pygmaeus* (orangutan), AJ544579; SFVggo, *Gorilla gorilla* (gorilla), NC\_039029; SFVhpi\_SAM106, *Hylobates pileatus*, (pileated gibbon) M621235. Old World monkeys (OWMs): SFVcae, *Cercopithecus aethiops* (African green monkey), M74895; SFVmcy, *Macaca cyclopsis* (macaque), X54482. New World

monkeys (NWM): SFVcja, *Callithrix jacchus* (common marmoset), GU356395; SFVsxa, *Sapajus xanthosternos* (capuchin), KP143760; SFVaxx, *Ateles* species (spider monkey), EU010385; SFVssc, *Saimiri sciureus* (squirrel monkey), GU356394. Prosimian (Pro): SFVocr, *Otolemur crassicaudatus* (brown greater galago), KM233624. Non-simian mammals (NSMs): EFVeca, *Equus caballus* (equine), AF201902; BFVbta, *Bos taurus* (bovine), U94514; and FFVfca, *Felis catus* (feline), Y08851). Posterior probabilities (on the branch left of the node) and time to most recent ancestors in millions of years (right of node) are provided at each node of the FV phylogeny. Solid circles in the Timetree.org mammalian phylogeny indicate nodes that map directly to the NCBI taxonomy and open circles indicate nodes that were created during the polytomy resolution process. TOR, Tortonian; BUR, Burdigalian; CHT, Chattian; RUP, Rupelian; LUT, Lutetian; YPR, Ypresian; DAN, Danian; MAA, Maastrichtian; TUR, Turonian ages. PAL, Paleogene epoch.

Bayesian dating analyses showed that for the *gag-pol-env* concatamer alignment, the TMRCA for SFVhpi\_SAM106 and the Hominoidea was 20.69 mya with a 95% highest posterior density (HPD) interval of 19.13 to 22.19 mya (Table 4). This divergence date occurs during the Miocene epoch (Figure 3). The TMRCA for the SFV OWM/ape split (Crown Catarrhini) was 27.28 mya with a 95% HPD interval of 27.28 to 30.09 mya. The SFVpsc, SFVpve/SFVggo split (Crown Homininae) had a TMRCA of 9.18 mya with a 95% HPD interval of 8.54 to 9.8 mya. For the OWM SFV (Crown Cercopithecinae), a TMRCA of 18.05 mya was inferred with a 95% HPD interval of 11.29 to 24.7 mya. In contrast, TRMCAs for the non-simian FVs were much older, estimated to be 92.4 mya (95% HPD interval of 85.03–98.69 mya). The TMRCA for the Boreoutherian (placental mammals) at the root of the FV phylogeny was estimated to be 95.46 mya (95%HPD interval of 90.08 to 101.19 mya) during the Upper epoch and Mesozoic era. Similar TMRCAs were inferred for the 12 cdp alignment (Table 4). Our inferred FV and host divergence dates were similar to those inferred by others using different methods, supporting the robustness of our dating methods (Table 4). The inferred mean substitution rates ranged from 5.05 <sup>×</sup> <sup>10</sup>−<sup>9</sup> to 9.82 <sup>×</sup> <sup>10</sup>−<sup>9</sup> nucleotides/site/year across the *gag-pol-env* coding region (Table 4).



 Median*Pan troglodytes verus*; SFVpsc, SFV from *Pan troglodytes schweinfurthii*; SFVggo, SFV from *Gorilla gorilla*, non-simian foamy viruses include bovine foamy virus (BFV), equine foamy virus (EFV), and feline foamy virus (FFV). 3. 12 cdp, first and second codon positions. Length is 4.942 kb versus 7.412 kb for the complete coding concatamer. 4. FV Pol amino acid divergence dates were inferred with Bayesian phylogenetic inference and a time-dependent rate phenomenon, which is a power-law decay function [64]. 5. Primate and non-primate divergence dates from Pozzi et al. (complete mitochondrial genomes) and dos Reis et al. (nuclear genomes and mitochondrial genomes) [65,66]. 6. N/A, not available. 7. Mean evolutionary rate in nucleotide substitutions/site/year. 8. Standard deviation (stdev) of the uncorrelated lognormal (ucld) relaxed clock, BEAST analysis parameter indicating the amount of variation in the substitution rate across branches. Values close to zero indicate little substitution rate variation and the presence of a molecular clock, whereas values >1 indicate substantial rate heterogeneity amongst lineages. 9. Shape parameter of the gamma (Γ) distribution of rate heterogeneity among sites. 10. ESS, effective sampling size values for all BEAST parameters.

1.

#### **4. Discussion**

Using a metagenomics approach, we obtained the first complete genome of an SFV from a lesser ape, the pileated gibbon (*Hylobates pileatus*). Next generation sequencing is a powerful molecular method and has been used to obtain complete genomes of other SFV isolates recently [13,67,68]. We characterized the SFVhpi\_SAM106 genome by detailed sequence analysis and provided a deeper understanding of the evolutionary history of FVs, especially within the Hominoidea. Most of the important genomic structures and functional domains were conserved in SFVhpi\_SAM106 with some exceptions. While the organization of the SFVhpi\_SAM106 genome is similar to that of other FVs, it has several unique features, including an LTR at ~2 kb that is 1.2 to 1.6 times longer than that in other FVs, attributable mostly to longer U3 and R regions. Given that the 5 LTR U3 and R regions of FVs contain transcriptional start control elements, including the *tas* response elements (TREs) and cellular transcription factors, it will be important to conduct transcription-mapping experiments to determine the functional significance of the longer LTRs of SFVhpi\_SAM106 for replication and regulation. Since the number and location of FV TREs are not conserved, we were unable to locate the SFVhpi\_SAM106 TREs within the LTRs or *env* sequences *in silico*. Others have shown that tissue culture of chimpanzee SFV in diploid human fibroblasts isolated from an infected human selects for isolates with substantial nonrandom nucleotide deletions in the U3 region can increase viral replication [69]. The U3 deletion variant was also present as the majority variant in the infected human but was absent from naturally-infected chimpanzees [69]. However, SFVhpi\_SAM106 was isolated from infected PBMCs using canine thymocyte cells [42], which have not been shown to impact SFV LTR length or functionality in this dog cell line or in other non-human cells, including SFVggo and SFVcpz isolates from zoonotically-infected humans grown in baby hamster kidney cells (BHK-21) [57]. Comparisons of SFVhpi\_SAM106 LTRs directly from PBMCs from the pileated gibbon and those obtained by culture may be needed to further evaluate this unusually long LTR.

Additional differences from other FV prototypes included four instead of three potential GR boxes in Gag and a TATA box promoter motif in the IP in *env*, which is more similar to those of two endogenous FVs from the sloth and coelacanth by starting with a cytosine instead of a thymine present in other FVs [58]. Since the GR and TATA box motifs are important for viral replication, studies are needed to determine the effect of these genetic differences on SFVhpi\_SAM106 replication and growth. Some subtypes of human and simian immunodeficiency viruses (HIVs and SIVs) have the CATAAA TATA box sequence in their LTRs, which have not been shown to decrease LTR promoter activity by the HIV-1 transcription activator (Tat) protein in vitro [70]. Examination of the SFVhpi\_SAM106 genome for additional TATA promoters identified a highly conserved TATA box upstream of the RT active site promoter in the FV genome and its possible effect on viral replication is needed.

FVs have been shown to have co-evolved with their hosts for millions of years as vertebrates diversified in the Paleozoic Era [5,6,64]. Our results expand knowledge of the evolutionary history of FVs and show for the first time that SFVhpi\_SAM106 from the lesser ape, *Hylobates pileatus,* follows this same ancient co-speciation trajectory. The phylogenetic position of SFVhpi\_SAM106 mirrors that of the gibbon host, with a divergence date of about 21 mya (95% HPD 19.13–22.19 mya) during the Miocene Epoch of the Cenozoic Era. The FV divergence dates inferred with our methods are strongly congruent with those of Aiewsakun and Katzourakis, who used Bayesian phylogeny and a time-dependent rate power-law decay function that is independent of archaeological calibrators [64]. Our Hylobatidae/Hominidae divergence dates are also consistent with those of Matsudaira et al. (20.3 mya, 95% CI 17.5–23.6 mya) and Chan et al. (19.25 mya, 95% HPD 15.54–22.99), who examined ape complete mitochondrial sequences [71,72]. The Hylobatidae/Hominidae divergence dates are slightly older than those of reported by Thinh et al. (16.26, 95% HPD 14.69–18.16 mya), who used only complete cytochrome B mitochondrial sequences [38]. In addition, our inferred SFV divergence dates are also highly consistent with those reported for the evolutionary histories of NHPs and other placental mammals examined in our study and had very close divergence dates with overlapping confidence intervals (Table 4). Together, these results show that our use of multiple host divergence

calibration points provided a robust inference of FV evolutionary histories. As with all evolutionary studies, the inferred divergence dates represent minimum ages that are younger than true ages, because the fossil record is incomplete.

Given the strong evidence of FVs co-diverging with their hosts, both FVs and their mammalian hosts [73] should have very similar evolutionary rates. Indeed, FVs have extremely low rates of evolution, similar to rates observed for mitochondrial protein-coding genomes of about 1 <sup>×</sup> 10−<sup>8</sup> substitutions/site/year [4,6,11]. However, our estimates were a log lower and are more similar to those of the mammalian neutral substitution rate for nuclear genes and for endogenous retroviruses of 1 <sup>×</sup> 10−<sup>9</sup> substitutions/site/year [74,75]. Our lower inferred FV nucleotide substitution rates may reflect the use of older calibration dates for the *Equs caballus*/*Bos Taurus* split and for Boreoutheria in the phylogenetic analyses.

Genetic recombination did not appear to affect our phylogenetic results. Phylogenetic analysis of only the FV *env* region with additional SFV taxa from the gorilla, chimpanzee, and macaque with evidence of a variant RBD from a potential recombination event showed that SFVhpi\_SAM106 was ancestral to SFVppy and then all other OWMA SFVs instead of ancestral to only the apes as we found for the *gag-pol-env* concatamer alignments. Similar results were inferred even after removal of the putative recombination region in the RBD of SU, whereas in the RBD-only alignment, placement of the SFVhpi\_SAM106 and SFVppy taxa were not resolved. Similar genetic relationships were observed by Richard et al. in SFVppy when examining complete gorilla and chimpanzee *env* sequences [19]. One exception in their analysis of the variant RBD region is that SFVppy clustered ancestral to chimpanzee and gorilla clade 1 instead of with FFVfca as in our analysis since non-simian FVs were not included in their analysis [19,48]. Such nonconforming FV co-evolutionary phylogenetic relationships may reflect the phenomenon called long branch attraction (LBA), which can occur when highly divergent lineages are included in the phylogenetic analysis [73]. LBA can potentially be resolved by the addition of additional sequence information [73], which we have done by using the *gag-pol-env* concatamer, or by inclusion of more SFV sequences from gibbons and orangutans when they become available.

As with other NHPs, gibbons are threatened by habitat encroachment for forest clearance, road construction, and changes in agriculture resources. Gibbons are also hunted for food, medicine, or for exportation for the pet trade (see also the International Union for Conservation of Nature's Red List at www.iucnredlist.org) [40,76]. Gibbons are also common members of zoological exhibits. All of these activities can lead to human exposures to gibbons and their microbial communities, including SFVs, which have crossed into humans from many NHP species [12]. Until now, only two short SFV integrase sequences were available to inform the design of sensitive PCR assays for the detection of human infection [42]. More sensitive molecular assays are needed to detect the low copies of integrated genomes typically found in SFV infection. The small number of complete SFV genomes from all available NHPs has also limited the design of generic SFV PCR primers. These limitations may help explain the inability to detect SFV sequences in seropositive pileated gibbons from Cambodia using generic PCR assays [28]. The availability of the complete SFVhpi\_SAM106 genome from our study will facilitate the design of more sensitive and specific PCR assays for the detection of gibbon SFVs and fills an important knowledge gap in the SFV database, facilitating future studies of FV infection, transmission, and the evolutionary history of FVs.

#### **5. Conclusions**

By using a metagenomics approach, we obtained the complete SFVhpi\_SAM106 genome from a pileated gibbon (*Hylobates pileatus*). Bayesian analysis showed that SFVhpi\_SAM106 is ancestral to other ape SFVs with a divergence date of ~20.6 million years ago, reflecting ancient co-evolution of the host and virus. Our molecular analysis also showed that the SFVhpi\_SAM106 genome has the longest genome (13,885 nt) of all SFVs with complete genomes available, due to a much longer LTR (2071 bp). The complete sequence of the SFVhpi\_SAM106 genome will provide invaluable information for further understanding the epidemiology and evolutionary history of SFVs.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/1999-4915/11/7/605/s1. Supplementary Figure S1.

**Author Contributions:** W.M.S. and T.L.G. conceived and designed the study; A.S., W.M.S., and S.D.S. performed the experiments; A.S., S.D.S., T.L.G., and W.M.S. analyzed the data; A.S. prepared the draft manuscript and all authors contributed to the final version, which was approved by all authors for submission.

**Funding:** This work was supported by intramural CDC funding and by NIH grant TW009237 as part of the joint NIH-NSF Ecology of Infectious Disease program and the UK Economic and Social Research Council (grant ES/J011266/1).

**Acknowledgments:** Use of trade names is for identification only and does not imply endorsement by the U.S. Centers for Disease Control and Prevention (CDC). The findings and conclusions in this report are those of the authors and do not necessarily represent the views of the CDC.

**Conflicts of Interest:** The authors declare no conflict of interest. The funding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **The First Co-Opted Endogenous Foamy Viruses and the Evolutionary History of Reptilian Foamy Viruses**

**Pakorn Aiewsakun 1,\*, Peter Simmonds <sup>2</sup> and Aris Katzourakis 3,\***


Received: 9 May 2019; Accepted: 4 July 2019; Published: 12 July 2019

**Abstract:** A recent study reported the discovery of an endogenous reptilian foamy virus (FV), termed ERV-Spuma-Spu, found in the genome of tuatara. Here, we report two novel reptilian foamy viruses also identified as endogenous FVs (EFVs) in the genomes of panther gecko (ERV-Spuma-Ppi) and Schlegel's Japanese gecko (ERV-Spuma-Gja). Their presence indicates that FVs are capable of infecting reptiles in addition to mammals, amphibians, and fish. Numerous copies of full length ERV-Spuma-Spu elements were found in the tuatara genome littered with in-frame stop codons and transposable elements, suggesting that they are indeed endogenous and are not functional. ERV-Spuma-Ppi and ERV-Spuma-Gja, on the other hand, consist solely of a foamy virus-like *env* gene. Examination of host flanking sequences revealed that they are orthologous, and despite being more than 96 million years old, their *env* reading frames are fully coding competent with evidence for strong purifying selection to maintain expression and for them likely being transcriptionally active. These make them the oldest EFVs discovered thus far and the first documented EFVs that may have been co-opted for potential cellular functions. Phylogenetic analyses revealed a complex virus–host co-evolutionary history and cross-species transmission routes of ancient FVs.

**Keywords:** foamy virus; spumavirus; reptile foamy virus; endogenous foamy virus; endogenous retrovirus; ancient retroviruses; co-evolution; co-speciation; foamy virus-host interactions

#### **1. Introduction**

Foamy viruses (FV; the *Spumaretrovirinae* subfamily) are a unique subgroup of retroviruses (family *Retroviridae*) comprising an independent lineage basal to all other exogenous retroviruses [1]. FV surveillance and the discovery of their endogenous retrovirus (ERV) counterparts revealed that the host range of FVs covers a wide range of vertebrates, including mammals [2,3], amphibians [4], lobe-finned fish [5], bony fish [4,6], and cartilaginous fish [4,7], considerably wider than those of other retrovirus groups. Owing to the wealth of sequence data and the identification of multiple instances of endogenization for FVs, the longer-term evolutionary history of FVs can be investigated in great detail. For example, analysis of endogenous and modern-day viral sequences revealed that FVs have been broadly co-diversifying with their hosts since the origin of vertebrates, dating back almost half a billion years ago to the early Palaeozoic Era [4].

A recent study reported the discovery of the first reptilian endogenous FV (EFV) in the tuatara genome, namely ERV-Spuma-Spu [8]. Phylogenetic analysis showed that ERV-Spuma-Spu is basal to the clade of mammalian FVs. Based on this finding, the authors speculated that the reptilian FV lineage may have diverged from the mammalian FV linage more than 320 million years ago under the virus–host co-speciation assumption. Nevertheless, since there was effectively only one reptilian FV linage in the study, the inferred co-speciation could not be verified, and a history of viral cross-species transmissions might have been overlooked. Indeed, this was shown to be the case before for the lobe-finned fish EFV, CoeEFV [4].

Here, we further characterize ERV-Spuma-Spu and report two additional reptilian EFVs found in the genomes of panther gecko (*Paroedura picta*) and Schlegel's Japanese gecko (*Gekko japonicus*), designated ERV-Spuma-Ppi and ERV-Spuma-Gja, respectively. Evolutionary analyses together with other currently available FV and EFV sequences suggest that these reptile EFVs do not form a monophyletic clade and that they are significantly younger than their hosts. This in turn suggests that, in contrast to what was previously suggested, their ancestors likely originated from cross-species transmissions, where one gave rise to ERV-Spuma-Spu and the other gave rise to the two gecko EFVs. Our results improve our understanding of how FVs evolved and interacted with their hosts in the distant past.

#### **2. Materials and Methods**

#### *2.1. ERV-Spuma-Spu Mining*

The tuatara genome (*Sphenodon punctatus*; accession number: QEPC01000000) was searched using tBLASTn and a CoeEFV Pol protein query with an E-value cut off of 1 <sup>×</sup> <sup>10</sup><sup>−</sup>6. This returned 20,581 hits from 2370 contigs (Table S1). These hits were then combined together (including the sequence between the two hits) if they were ≤ 5000 base pairs apart or overlapping and were in the same orientation. This resulted in 12,520 merged Pol hits (Table S2).

These hits were subsequently subjected to reciprocal BLASTx against a database of retrovirus proteins one by one with an E-value cut off of 1 <sup>×</sup> 10−<sup>6</sup> (Table S2). If the best match protein did not belong to a member of the *Spumaretrovirinae* subfamily or the *Spumavirus* genus, the hit was then excluded from the downstream analyses. Ultimately, 87,387 retrovirus proteins were retrieved from the National Center for Biotechnology Information (NCBI) protein database on the 7th of June 2018 using 3 search queries. The first query was 'txid11632[Organism:exp] NOT "partial" AND 500:100000[SLEN]', which retrieved proteins that belong to the members of the *Retroviridae* family [NCBI: txid11632] with a length between 500 and 100,000 amino acids and were not annotated as partial (86,662 sequences). The second query was 'txid186534[Organism:exp] NOT "partial" AND 500:100000[SLEN]', which retrieved proteins that belong to the members of the *Caulimoviridae* family [NCBI: txid186534] with a length between 500 and 100,000 amino acids and were not annotated as partial (720 sequences). The last query was 'txid186665[Organism:exp]', which retrieved proteins that belong to the members of the *Metaviridae* family [NCBI: txid186665] (5 sequences). Out of 12,520 Pol hits, only 2757 exhibited the greatest similarity to FV proteins (Table S2). The rest were removed from the subsequent dataset.

We noted that some of these 2757 Pol sequence candidates, nevertheless, may actually have been those of Class III ERVs and not actually of EFVs. To further exclude false positive hits, we used them as queries in a BLASTx search against 194 retrovirus and ERV Pol protein sequences, publicly available from the database held at http://bioinformatics.cvr.ac.uk/paleovirology/site/html/retroviruses.html. Those with the best hit protein that did not belong to the *Spumavirus* genus at an E-value cut off of <sup>1</sup> <sup>×</sup> <sup>10</sup>−<sup>6</sup> were further excluded from the dataset. Only 1959 sequence candidates remained after this procedure (Table S2).

To recover potentially full elements, we extracted these 1959 Pol hits from the tuatara genome with 10,000 base pairs extended on both ends. They were then searched against the CoeEFV Env protein query using tBLASTn. The analysis showed that only 165 out of 1959 sequences exhibited similarity to CoeEFV Env protein at the 1 <sup>×</sup> <sup>10</sup>−<sup>6</sup> E-value cut off (Table S3). These 165 endogenous viral elements were designated ERV-Spuma-Spu elements.

#### *2.2. Consensus Sequence Reconstruction*

The top 20 elements of the 165 ERV-Spuma-Spu elements that exhibited the greatest similarity to the concatenated CoeEFV Pol-Env protein sequence were aligned and used to reconstruct a consensus sequence. At the time of analysis, the quality of the tuatara genome was low, however, containing many large strings of undetermined nucleotides ('N's). Furthermore, the majority of the

identified ERV-Spuma-Spu elements were also interrupted by transposable elements. Because of this, the standard protocol of consensus sequence reconstruction (or ancestral sequence reconstruction) inferred false gaps where they should not have been. To overcome this problem, we only allowed gaps in the consensus sequence if there were more than 15 sequences containing gaps in that particular position; otherwise, a consensus base pair from non-gap sequences would be inferred. Standard ambiguous bases were used in the case of base count ties. The consensus sequence of the virus internal region was inferred separately from the LTR portion (Data S1), and the consensus LTR sequence was inferred from both 5' and 3' LTRs (Data S2). Only 16 LTR sequences from 11 elements could be aligned with confidence. ORFfinder (https://www.ncbi.nlm.nih.gov/orffinder/) was used to identify open reading frames, and tRNAScan (http://lowelab.ucsc.edu/tRNAscan-SE/) was used to identify the primer binding site. We also attempted to identify the internal promoter in ERV-Spuma-Spu by comparing its sequence to those of mammalian FVs. The consensus sequence is in Figure 1, Figure S1, and Data S3. As previously reported [8], reciprocal BLASTp searches showed that the proteins identified from the consensus sequence were most similar to those of modern-day FVs, supporting that these ERV-Spuma-Spu elements are indeed FVs.

#### *2.3. EFVs in Other Reptiles*

Pol and Env protein sequences of the consensus ERV-Spuma-Spu and CoeEFV were used in a tBLASTn search against the NCBI Whole Genome Shotgun database restricted to reptiles and excluding the tuatara genome to examine if other reptile genomes had any EFVs.

Numerous tBLASTn hits were returned from six reptile genomes showing similarity to the Pol protein sequences. From each genome, we selected one to five top hits (19 hits in total) and examined if they were most similar to modern-day FVs. Reciprocal BLASTx analysis suggested that none of them were FV-like, however (Table S4). We thus did not analyze these sequences any further.

In contrast, only two FV-like *env* elements were found. One was found in the genome of the panther gecko (*Paroedura picta*; BDOT01000314.1:c548029-545198), and the other was found in the genome of Schlegel's Japanese gecko (*Gekko japonicus*; LNDG01066615.1:c46188-43360). Neither elements contained in-frame stop codons or transposable elements. Results from reciprocal BLASTx searchers (Table S4) suggested that they were indeed EFVs, showing the greatest similarity to modern-day FVs. They were designated ERV-Spuma-Ppi and ERV-Spuma-Gja, respectively.

The contigs containing ERV-Spuma-Ppi and ERV-Spuma-Gja were co-linear, suggesting that they may be orthologs. Genes surrounding the EFVs were determined and compared to confirm orthology. While LNDG01066615.1 was annotated with genes, BDOT01000314.1 was not. We thus used the gene prediction program AUGUSTUS [9] to annotate the contig. Homologous regions in other reptiles, including the European green lizard (*Lacerta viridis*, OFHU01003482.1), the brown spotted pitviper (*Protobothrops mucrosquamatus*, BCNE02010247.1), and the green anole (*Anolis carolinensis*, NW\_003339653.1), were identified based on the genes found on LNDG01066615.1 and BDOT01000314.1 by using tBLASTn. AUGUSTUS [9] was used to annotate the genes on these contigs when gene annotations were not available. The results are shown in Figure 2.

To determine the type of selection that ERV-Spuma-Ppi and ERV-Spuma-Gja were under, their nucleotide sequences were aligned, and a dN/dS ratio was computed by using CodeML implemented in PAML 4.9e [10]. The run mode was set to the "pairwise" mode (runmode = −2). No clock was assumed (clock = 0: no clock) with all sites assumed to have evolved under the same rate (fixed α = 1 and α = 0), and the equilibrium codon frequencies were assumed to be equal to those that were observed (CodonFreq = 3: code table; estFreq = 0: use observed freqs). The universal genetic code (icode = 0: universal) was used to determine the number of synonymous and non-synonymous sites and changes. The basic model of selection (model = 0: one dN/dS ratio; NSsites = 0: one ω) was used to compute an overall dN/dS ratio for the entire *env* like gene.

To investigate the possibility that ERV-Spuma-Ppi and ERV-Spuma-Gja might be transcriptionally active, we used them to query the Reptilian Transcriptomes v2.0 Database [11], using both BLASTn and

tBLASTn searches. One very significant hit was returned from the transcriptomic sequence database of common leopard Gecko (*Eublepharis macularius*), named "EMA\_Contig\_Illumina\_8645". It was a consensus contig that was estimated from 499 raw transcript reads.

#### *2.4. Recombination Analyses*

A concatenated alignment of *pol*-*env* nucleotide sequences of mammalian, tuatara, amphibian, and lobe-finned fish FVs/EFVs was prepared for recombination analyses. Their *gag* sequences were not included, as they could not be aligned among these viruses. To obtain ERV-Spuma-Spu's *pol* and *env* sequences, we used BLASTn to query the 165 ERV-Spuma-Spu elements with the consensus ERV-Spuma-Spu's *pol* and *env* sequences with an E-value cut off of 1 <sup>×</sup> 10<sup>−</sup>6. Hits that were found in the same ERV-Spuma-Spu element and that were in the same orientation were concatenated according to their hit locations to obtain contiguous viral sequences without transposable elements. We only included those with both *pol* and *env* gene coverage of ≥ 80% in the alignment. The final alignment contained 45 sequences, 27 of which were ERV-Spuma-Spu elements and were 4695 nucleotides (nt) long (*pol*: 2685 nt; *env*: 2010 nt) (Data S4).

Potential recombination events were detected using 7 programs: RDP, GENECONV, Chimaera, MaxChi, BootScan, SiScan, and 3Seq, implemented in Recombination Detection Program 4 [12] with their default settings. Only those detected with >4 programs were considered significant.

#### *2.5. Phylogenetic Analyses*

Recombination analyses suggested that FVs' *pol* and *env* genes might have different evolutionary histories (see Results). We thus estimated Pol and Env protein phylogenies separately under the Bayesian phylogenetic framework by using MrBayes 3.2.6 [13] to better understand how they evolved (Figure 3). The Env protein alignment was derived from the *env* nucleotide alignment used in the recombination analyses with the addition of sequences from the two gecko EFVs (Data S5). For the Pol protein alignment, sequences from fish EFVs and non-FV class III ERVs were included as an outgroup (Data S6). This alignment (containing only the reverse transcriptase and integrase coding domain portions) was based on the one we used in our previous study, allowing the results to be compared. The best-fit amino acid substitution models were determined to be JTT+I+Γ(4)+F for the Env alignment and JTT+Γ(4)+F for the Pol alignment by ProtTest 3.4.2 [14] under the sample size corrected Akaike information criterion. Two Markov chain Monte Carlo chains were run for 10,000,000 steps with a sampling frequency of 1 per 1000 steps. The metropolis coupling algorithm (3 hot chains and 1 cold chain) was used to improve the sampling. The first 25% of sampled parameter values were discarded as burn-in. Potential scale reduction factors of all parameters were ~1.000 in both analyses, indicating that they were all well sampled from their posterior distributions and had converged.

#### *2.6. Evolutionary Timescale Inference*

Many studies have shown that mammalian FVs have a long-term co-speciation history with their hosts throughout the entire evolution of the eutheria [2,15–17]. This extraordinarily evolutionary feature has led to the observation that the relationship between the virus total per-lineage substitution numbers (*s*) and the evolutionary timescales (*t*) can be approximated very well by a power-law function: *log*(*t*) = α*log*(*s*) + β [18]. We used this relationship, the so-called time-dependent rate phenomenon (TDRP) model, to estimate evolutionary timescales of reptile EFVs.

For each of the trees in the posterior distribution obtained from the Bayesian phylogenetic analyses, we traced the simian foamy virus Pan troglodytes verus (SFVpve) backwards in time to various virus–host co-speciation nodes to obtain various *s* estimates, of which the corresponding *t* estimates could be inferred directly under the co-speciation assumption (Table S5). Based on virus–host tree topology comparison, seven co-speciation nodes could be inferred in the Pol phylogeny (Figure 3A, labeled with black Roman numerals), and five were inferred in the Env tree (Figure 3B, labeled with black Roman numerals) down the SFVpve lineage. These corresponding *s* and *t* estimates were used to

calibrate the TDRP model. The model fitting was performed by using the *lm* function implemented in R 3.1.2 [19], and the *t* and the *s* estimates were log-transformed (base 10) prior to the linear model fitting. The model was then extrapolated to estimate the timescales of other nodes from their *s* estimates.

#### **3. Results**

#### *3.1. Characterisation of ERV-Spuma-Spu*

By using a series of BLAST searches (see Materials and Methods for details), we identified 165 FV-like endogenous viral elements (EVEs) in the genome of tuatara (*Sphenodon punctatus*; accession number: QEPC01000000). All of these EVEs showed the greatest similarity to modern FV Pol protein sequences (and not to those of other retroviruses) and harbored FV-like *env* sequences (Tables S1–S3). We observed that most of these EVEs contained transposable elements, interrupting their protein coding regions, indicating that they are genuine non-functional and old endogenous viruses and not contamination of extant virus sequences in the sequence of the tuatara genome. Due to the poor quality of the tuatara genome at the time of analysis, many EVE sequences contained large strings of "N", representing undetermined nucleotide sequences. We treated them as gaps in the downstream analyses in this study.

A consensus sequence of these FV-like EVEs (Figure 1A) was reconstructed for genome annotation (see Materials and Methods for details). Twenty out of the 165 elements that showed the greatest similarity to the concatenated CoeEFV's Pol and Env protein sequence were aligned and curated and subsequently used to reconstruct a consensus sequence. The consensus sequence of the virus body was inferred separately from the long terminal repeat (LTR) portion (Data S1), and the consensus LTR sequence was inferred from both 5'- and 3'-LTR sequences (Data S2). An average pairwise distance between these 20 elements was estimated to be only 0.118 substitutions per site (body portion), and they could be aligned with high confidence, indicating that they are of the same virus lineage. See Figure S1 and Data S3 for the full consensus sequence. Sequence comparison revealed that this virus is highly similar to the ERV-Spuma-Spu previously reported in [8]. They exhibit 97.17% nucleotide percentage identity and 97% coverage (with the one in [8] being 3% shorter), suggesting they are the same virus. The slight differences are likely due to the different methods we used to reconstruct the sequences.

**Figure 1.** Reconstructed, putative, ERV-Spuma-Spu genome (**A**) and foamy virus (FV) internal promoters (**B**). Top-A, scale bar indicates nucleotide position. Middle-A, schematic diagram representing the genomic organisation of ERV-Spuma-Spu. LTR: long terminal repeat (grey); PBS: primer binging site; Gag: *group antigen* gene (green); Pol: *polymerase* gene (yellow); Env: *envelope* gene (blue). Bottom-A, start (green) and stop (red) codon positions in the six translation frames (+1, +2, +3, −1, −2, and −3). Potential open reading frames are shown in purple. The dotted boxes indicate the two open reading

frames identified as a single accessory gene in [8]. The nucleotide sequences of consensus ERV-Spuma-Spu can be found in Figure S1 and Data S3. Left-B, internal promoters (TATAAAA) towards the 3' end of the *env* gene could be identified in all mammalian FVs (highlighted in yellow) but were absent from CoeEFV (endogenous FVs) and ERV-Spuma-Spu. Right-B, protein sequences used to guide the nucleotide alignment. Those corresponding to the sequences on the left are shown in brighter colors.

In summary, the consensus sequence was 10,044 nucleotides (nt) long. The LTRs were 935 nt in length (5'-LTR: nt 1–935, 3'-LTR: nt 9110–10,044), which were longer than those of other retroviruses such as alpharetroviruses (~350 nt), gammaretroviruses (~600 nt), deltaretroviruses (~550–750 nt), and lentiviruses (~600 nt), but were typical for FVs (~950–1700 nt) [20]. We noted that our LTRs were longer than those reported in [8] (694 nt); a sequence comparison showed that the consensus sequence reported in [8] was missing ~240 nt corresponding to the beginning (i.e., the 5' end) of the LTRs (Data S2). A lysine tRNA utilizing primer binding site (PBS) was identified downstream of the 5 -LTR (TGGCGCCCAAYGTGGGGCTCGA, nt 938–959), which is typical of mammalian FVs [21,22]. The *gag* gene was predicted to be 1308 nt long (nt 1085–2392) to generate a 435 amino acid (aa) protein. This was markedly shorter than those of mammalian FVs (~550–650 aa). The *pol* and *env* genes were determined to be 3540 nt (nt: 2487–6026; 1179 aa) and 3003 nt (nt: 5956–8958; 1000 aa) long, respectively, which are typical lengths of mammalian FV *pol* and *env* genes. Results from reciprocal BLAST analyses revealed that all three protein products were most similar to those of mammalian FVs (Table 1), consistent with previously reported results [8]. Phylogenetic analysis also showed that these EVEs clustered with other FVs and EFVs (see below), supporting that this consensus sequence is indeed derived from EFV elements.


**Table 1.** Reciprocal BLASTp analyses of ERV-Spuma-Spu against the National Center for Biotechnology Information (NCBI) non-redundant retroviral protein database.

\* as explicitly reported by the program.

One hypothetical open reading frame (ORF) was identified as an accessory gene in the previous study [8]. This ORF could be mapped to nt 9114–9498 in our consensus sequence, corresponding to the LTR portion (Data S2). In our consensus sequence, the identified ORF appeared to be broken into two separate ORFs of different frames (nt 9114–9305, frame +3 and nt 9343–9498, frame +1; Figure 1A). The crucial difference was that the Wei et al. sequence missed a nucleotide, causing the two ORFs to merge as one (Data S2; 10049th column in the alignment; 196th nt in our consensus LTR sequence). Indeed, we failed to locate any accessory genes between the *env* gene and the 3'-LTR, typical of a mammalian FV (Figure 1A). In addition, as previously noted, the hypothetical accessory protein did not exhibit significant similarity to any known FV proteins [8] or indeed any molecular sequences in the National Center for Biotechnology Information (NCBI) non-redundant (nr) nucleotide collection database. Furthermore, we could not identify an internal promoter towards the 3' end of the *env* gene (Figure 1B) required for efficient accessory gene expression [23,24]. All these results suggest that ERV-Spuma-Spu may actually lack accessory genes and that the previously identified accessory gene was an artifact.

#### *3.2. The Discovery and Characterisation of Gecko EFVs*

In addition to ERV-Spuma-Spu elements, we were able to recover two FV-like elements from two gecko genomes, namely panther gecko (*Paroedura picta*; accession number: BDOT01000000) and Schlegel's Japanese gecko (*Gekko japonicus*; accession number: LNDG01000000). They were found on contig BDOT01000314.1 (nt: c548,029–545,198; 943 aa) and contig LNDG01066615.1 (nt: c46,188–43,360; 942 aa), respectively. Both elements consisted solely of a full-length FV-like *env* gene, and no other FV-like elements could be identified. Interestingly, neither of them contained any in-frame stop codons or transposable elements. Reciprocal BLAST analyses showed that both elements were most similar to modern mammalian FVs (Table S4). Consistent with this finding, phylogenetic analysis revealed that they clustered with other FVs and EFVs (see below). Combined, these results strongly suggest that they are EFVs. We designated the two elements ERV-Spuma-Ppi and ERV-Spuma-Gja, respectively.

We determined that BDOT01000314.1 and LNDG01066615.1 were co-linear, suggesting that ERV-Spuma-Ppi and ERV-Spuma-Gja might be orthologous. ERV-Spuma-Gja was found in the intronic region of the predicted endoplasmic reticulum membrane protein complex subunit 1 gene (EMC1: XM\_015412627.1) between exon 21 and 22 in the antisense orientation. Further examination revealed three other genes in the same vicinity of ERV-Spuma-Gja, namely ubiquitin protein ligase E3 component n-recognin 4 gene (UBR4: XM\_015412557.1), MRT4 homolog, ribosome maturation factor gene (MRTO4: XM\_015412607.1), and aldo-keto reductase family 7 member A2 gene (AKR7A2: XM\_015412595-6.1). At the time of analysis, the contig BDOT01000314.1 (of the panther gecko genome) was not annotated with genes; nevertheless, we were able to confirm that ERV-Spuma-Ppi had the same genomic location as ERV-Spuma-Gja and was surrounded by the same set of genes in the same order, confirming that they are orthologs. Moreover, we were able to identify corresponding homologous genomic regions in the genomes of European green lizard (*Lacerta viridis*, OFHU01003482.1), brown spotted pitviper (*Protobothrops mucrosquamatus*, BCNE02010247.1), and green anole (*Anolis carolinensis*, NW\_003339653.1) based on the presence of these genes. However, none of these reptile genomes had homologs of ERV-Spuma-Ppi and ERV-Spuma-Gja. The results are shown in Figure 2. We thus could infer that ERV-Spuma-Ppi and ERV-Spuma-Gja are at least 96 (83–98) million years (myr) old based on the speciation date of panther gecko and Schlegel's Japanese gecko [25]. Furthermore, a pairwise nucleotide sequence comparison of ERV-Spuma-Ppi and ERV-Spuma-Gja estimated its dN/dS ratio to be 0.14 (0.08–0.54), strongly suggesting that they were under strong purifying selection pressure.

Remarkably, when we queried the Reptilian Transcriptomes V2.0 Database [11] with ERV-Spuma-Ppi and ERV-Spuma-Gja, we recovered a transcript consensus contig from the transcriptome of common leopard gecko (*Eublepharis macularius*), which is closely related to panther gecko and Schlegel's Japanese gecko. The sequence exhibited 82.2 and 93.0 aa percentage identity (E-value = 4 <sup>×</sup> 10−<sup>66</sup> and 2 <sup>×</sup> 10−74) and 81.6 and 86.94 nt percentage identity (E-value = 1 <sup>×</sup> 10−<sup>93</sup> and 4 <sup>×</sup> 10<sup>−</sup>118) to ERV-Spuma-Ppi and ERV-Spuma-Gja, respectively. The contig was 626 bases long and constructed from 499 raw reads. We could not check for the homolog of ERV-Spuma-Ppi and ERV-Spuma-Gja in the leopard gecko genome, as it is currently not available. However, this finding suggests that ERV-Spuma-Ppi and ERV-Spuma-Gja might be transcriptionally active.

**Figure 2.** FV-like *env* sequences in panther gecko (*Paroedura picta*, BDOT01000314.1) and Schlegel's Japanese gecko (*Gekko japonicus*, LNDG01066615.1). (**A**) Alignment of Env protein sequences of prototype

FV (SFVpsc), consensus ERV-Spuma-Spu, and the two gecko FV-like endogenous viral elements, BDOT01000314.1: ERV-Spuma-Ppi, and LNDG01066615.1: ERV-Spuma-Gja. (**B**) BLASTn dot matrix between LNDG01066615.1 and BDOT01000314.1. The red circle indicates the location of the FV-like *env* sequences. (**C**) Homologous regions in three other reptiles, namely European green lizard (*Lacerta viridis*, OFHU01003482.1), brown spotted pitviper (*Protobothrops mucrosquamatus*, BCNE02010247.1), and green anole (*Anolis carolinensis*, NW\_003339653.1). Genes were predicted by AUGUSTUS [9]. Four eukaryotic genes are shown on the diagram from left to right: ubiquitin protein ligase E3 component n-recognin 4 (UBR4, red), ER membrane protein complex subunit 1 (EMC1, orange and yellow), MRT4 homolog, ribosome maturation factor (MRTO4, green), and aldo-keto reductase family 7 member A2 (AKR7A2, blue). Gene homology at the protein level was examined by using BLASTp. FV-like env genes are highlighted in grey. Grey bars represent the contigs, and the scale bar (black) represents a length of 10 kb.

#### *3.3. Recombination Analyses*

A concatenated nucleotide alignment of *pol* and *env* sequences of ERV-Spuma-Spu elements together with those of mammalian, amphibian, and lobe-finned fish FVs and EFVs was checked for potential recombination events by Recombination Detection Program 4 (RDP4) [12]; *gag* sequences were not included, as they could not be aligned among these viruses. We found that, out of 165 ERV-Spuma-Spu elements, 138 of them (83.6%) had coverage of either *pol* or *env* genes of < 80% (see Methods and Materials for details), likely due to the poor genome quality at the time of analysis. We thus decided to exclude them from any further downstream analyses. The alignment contained 45 sequences, 27 of which were those of ERV-Spuma-Spu elements, and was 4695 nt (*pol*: 2685 nt; *env*: 2010 nt) long after curation (Data S4).

The analysis first identified two ERV-Spuma-Spu elements as recombinants (QEPC01002018.1: 1489335–1511987, and QEPC01002018.1: 1489335–1511987), harboring an integrase coding domain of unknown origin at the same genomic locations (position in the alignment: nt 2038–2596; Figure S2). We removed the recombinant regions (nt 2152–2432 after manual inspection) and performed the analysis again to further examine for other potential recombination events.

The results from the second round of analysis suggested that the *pol* and the *env* genes have different evolutionary histories. Based on the default dendrogram outputs from RDP4 [12] estimated using the unweighted pair group method with arithmetic mean, we found that, while ERV-Spuma-Spus' *pol* genes are more closely related to those of CoeEFV, their *env* genes are more similar to those of mammalian FVs (Figure S3). No recombination could be detected in either of the individual *pol* and *env* nucleotide alignments after that.

#### *3.4. Phylogenetic Analyses and Evolutionary Timescale Estimation*

To better understand how the *pol* and *env* genes evolved, their evolutionary histories were estimated from their corresponding protein alignments by using a Bayesian phylogenetic method (Figure 3). The Env protein alignment was derived from the *env* nucleotide alignment used in the recombination analyses with the addition of the two gecko EFVs to investigate how they relate to other FVs (Data S5). For the Pol protein alignment, we included sequences from fish EFVs and non-FV class III ERVs as an outgroup. This Pol alignment (Data S6) was based on the one we used previously in the study reporting the discovery of amphibian and fish EFVs [4], allowing the results to be compared.

Overall, the well-established broad co-speciation pattern between mammalian FVs and their hosts could be recovered from both the Pol and the Env phylogenies (Figure 3), and the topology of the Pol tree was comparable to that previously published in [4]. Our analyses suggested that ERV-Spuma-Spus' Pol is sister to that of CoeEFV (Bayesian posterior probability clade support = 0.99), while their Env is more closely related to those of mammalian FVs (Bayesian posterior probability clade support = 0.97). These results were consistent with those obtained from the recombination

analysis (Figure S3). In addition, we found that the Env proteins of ERV-Spuma-Spu elements did not cluster with gecko EFVs; they instead formed two separate lineages, with gecko EFVs being closer to mammalian FVs (Bayesian posterior probability clade support = 1.00). In addition, we noted that, while our Pol protein analysis strongly supports the sister taxon relationship between ERV-Spuma-Spu and CoeEFV, the previous Pol protein analysis showed that ERV-Spuma-Spu is a sister taxon of mammalian FVs, inferred under the maximum likelihood framework with 89% bootstrap support [8]. This could be due to the differences in the methods (Bayesian vs. maximum likelihood) and/or the alignments used (reversed transcriptase + RNase-H + integrase domain vs. reversed transcriptase + RNase-H).

**Figure 3.** Foamy virus Pol, Env, and host phylogenies. Bayesian Pol (**A**) and Env (**B**) phylogenies were estimated by using MrBayes 3.2.6 [13], and their scale bars are in the units of amino acid substitutions per site. Both Pol and Env trees were rooted by the mid-point rooting method, and the determined outgroups are shown in grey. Arabic numerals on nodes are Bayesian posterior probability clade support values. The topologies of the Pol and the Env trees were compared to that of the host phylogeny (**C**) to identify virus–host co-speciation events labeled with Roman numerals. Nodes on different phylogenies that are labeled with the same Roman numeral are those corresponding to the same co-speciation event. The timescale of the identified co-speciation nodes, directly inferred from their hosts (Table S5), was used to calibrate the timescales of other nodes. The host tree topology was estimated elsewhere [26], and its scale bar is in units of millions of years. The virus–host association can be found in Table S6. SFV: simian foamy virus; psc, *Pan troglodytes schweinfurthii* chimpanzee; pve, *Pan troglodytes verus* chimpanzee; ggo, *Gorilla gorilla* gorilla; *ppy*, *Pongo pygmaeus* orangutan; mcy, *Macaca cyclopis* macaque; cae, *Chlorocebus aethiops* Grivet; cja, *Callithrix jacchus* marmoset; axx, Ateles spider monkey; scc, *Saimiri sciureus* squirrel monkey; ocr, *Otolemur crassicaudatus* brown greater galago; BFVbta, bovine foamy virus Bos taurus; EFVeca, equine foamy virus Equus caballus; FFVfca, feline foamy virus Felis catus; PSFVaye, prosimian foamy virus aye-aye; EFV, endogenous foamy virus; SloEFV, sloth EFV; ChrEFV, Cape golden mole EFV; CoeEFV, Coelacanth EFV; NviFLERV-1, Notophthalmus viridescens foamy virus-like endogenous retrovirus - 1.

Evolutionary timescales of reptile EFVs were estimated using the time dependent rate phenomenon (TDRP) model [18,27]. The TDRP model is a model that describes the relationship between total per lineage substitutions (*s* estimates) and their associated evolutionary timescales (*t* estimates) using a power law function (*t* = α*s*β), and this relationship can be used to estimate a *t* value for an arbitrary node given its *s* (node height in the units of substitutions per site) [18,27]. We traced simian foamy virus Pan troglodytes verus (SFVpve) down the trees to obtain total per lineage substitutions across various timescales for the TDRP model estimation. Based on virus–host tree topology comparison, we inferred seven virus–host co-speciation events in the Pol phylogeny (Figure 3A, labeled with Roman numerals) and five in the Env tree (Figure 3B, labeled with Roman numerals) that lie along the SFVpve lineage, and the timescales of these nodes were inferred directly from those of their hosts (Table S5). Two TDRP models were estimated based on the *t* and the *s* estimates of these identified co-speciation nodes, one for the Pol protein [α = 364.08 (242.33–533.43), β = 1.59 (1.34–1.86), Adjusted R<sup>2</sup> = 0.95 (0.90–0.99)], which was comparable to the one previously reported {α = 407.89 (264.32–583.97), β = 1.63 (1.38–1.90), Adjusted *R*<sup>2</sup> = 0.95 (0.91–0.99) [4]}, and the other for the Env protein [α = 124.18 (100.42–152.36), β = 1.41 (1.21–1.60), Adjusted R2 = 0.96 (0.91–0.99)]. They were then used to extrapolate in order to calculate the timescales of other nodes based on their *s* estimates.

Analyses of the Pol protein sequences suggested that the ERV-Spuma-Spu lineage diverged 232.50 (173.70–303.33) myr ago (mya), which was comparable to that estimated based on the Env protein sequences, which was 257.15 (202.49–324.05) mya. These age estimates were, however, significantly lower than those of their hosts, which were estimated to be 324.7 (318–331.4) mya [28]. The age of the gecko EFV lineage was estimated to be 208.54 (171.59–250.91) myr old based on phylogenetic analysis of the Env protein sequences. This was consistent with their minimum age estimate of ~96 myr old. In addition, based on the Pol phylogeny, we also estimated the age of the amphibian EFV lineage and the entire clade of vertebrate FVs/EFVs to be 326.40 (229.14–448.65) myr old and 479.08 (298.31–718.86) myr old, respectively. These estimates were comparable to those previously reported {amphibian FVs: 348 (251–478) myr old and vertebrate FVs: 455 (304–684) myr old [4]} and those of their hosts (amphibians: ~335 myr old [29], and vertebrates: ~465 myr old [29]). Our results are thus consistent with those previously reported and support the long-term co-evolution between FVs and their hosts since the origin of vertebrates.

#### **4. Discussion**

This study reports two novel reptile EFVs that reside in the genomes of panther gecko (ERV-Spuma-Ppi) and Schlegel's Japanese gecko (ERV-Spuma-Gja) and further characterizes ERV-Spuma-Spu [8]. Together with ERV-Spuma-Spu [8], we analyzed the evolutionary history of reptilian FVs in detail, filling in the gap in our knowledge of the deep history of FV-host co-evolution.

ERV-Spuma-Ppi and ERV-Spuma-Gja are not full length ERVs, comprising only a full-length FV-like *env* gene, and are present in only a single copy in each genome. We showed that they are orthologous, being present in the same host genomic location in both species. Based on the speciation date of their gecko hosts, we inferred that these EFVs are at least 96 (83–98) myr old [25], making them the oldest EFVs ever discovered to date. The two gecko EFVs are located in the intronic region of the EMC1 gene in an antisense orientation typical of an old fixed intronic ERV; antisense integrations are favored because they are likely minimally disruptive to the host's gene transcription processes [30]. Remarkably, they do not contain any in-frame stop codons or transposable elements despite being almost 100 myr old and have been under strong purifying selection with a dN/dS estimate of 0.14 (0.08–0.54). Furthermore, by searching the transcriptomic sequence database of the common leopard gecko in the Reptilian Transcriptomes v2.0 Database [11], we discovered a transcript consensus sequence constructed from 499 reads that is highly similar to ERV-Spuma-Ppi and ERV-Spuma-Gja, exhibiting more than 80 percent identity both at the protein and the nucleotide levels. Combined, these findings suggest that they might be transcriptionally active and have been maintained for potential cellular functions, making them the first ever known co-opted EFVs. We, however, noted that the transcript we retrieved was distinct from the two gecko EFVs obtained from a different host species and mapped to only a small portion of the 3' end of the *env* genes. Additional analyses of transcriptomic data obtained directly from *Gekko japonicas* and *Paroedura picta* are thus required to confirm that the two gecko EFVs are transcriptionally active.

Retroviral *env* genes are known to have been co-opted many times by various vertebrate hosts for a wide range of functions. The most well-known one is perhaps the *syncytin* genes, which was captured for a function in placental formation numerous times by various mammals (see [31,32] for reviews). A recent study identified (for the first time) a functionally active *syncytin* gene outside mammals, namely *syncytin-Mab1*, in a lizard *Mabuya* [33,34]. The reptilian *syncytin* gene was identified as a gammaretrovirus *env* gene, however, which belongs to a different group to our two gecko EFVs. Furthermore, *Mabuya* is a viviparous placental lizard, while geckos are not. Thus, the functions of ERV-Spuma-Ppi and ERV-Spuma-Gja might differ from that of *syncytin-Mab1.*

The fact that the *env* reading frames are still intact and under purifying selection despite their old age suggests that ERV-Spuma-Ppi and ERV-Spuma-Gja are functional at the protein level. BLASTp analyses against the NCBI nr protein database failed to identify cellular proteins that are related to ERV-Spuma-Ppi and ERV-Spuma-Gja. Therefore, their functions still remain elusive. Nonetheless, the two gecko EFVs are inserted in the EMC1 gene, which intriguingly shows greatest expression level in placenta tissues in human [35]. One could thus imagine that the two genetic elements might be co-expressed and are in turn functionally associated, as has been repeatedly shown in several organisms [36–38]. Although geckos lay eggs and do not possess true placenta tissues such as *Mabuya* lizards, the physical association of EMC1 with co-opted retroviral *env* sequences suggests their possible involvement in the reproductive system of geckos, analogous to the *syncytin* gene in mammals [31,32] and viviparous placental lizards [33,34]. Furthermore, the EMC1 protein influences virus cross-membrane transportation and infectivity via direct physical contact with viral particles [39]. The two gecko EFVs might thus play an active role in the host immune systems as well, if their functions are indeed associated with those of EMC1. Indeed, studies have shown that retroviral *env* genes can be co-opted and exapted for host anti-viral defense, acting as restriction factors against related retroviruses [40,41]. Our finding warrants further functional investigation to confirm the potential involvement of ERV-Spuma-Ppi and ERV-Spuma-Gja in gecko reproductive and/or immune systems.

We were able to identify 165 ERV-Spuma-Spu elements in the tuatara genome. Examination of their consensus sequence revealed that ERV-Spuma-Spu only possessed the three main retroviral core genes, namely *gag*, *pol*, and *env*, flanked by two LTRs. A previous study identified one short hypothetical open reading frame (192 nt) as an accessory gene [8]; however, we found that the sequence was actually part of the 3'-LTR. On the other hand, we could not identify any accessory genes located between the *env* gene and the 3' LTR, which is typical for a mammalian FV [4,6,42]. The authors of the previous study also noted that the identified accessory gene did not exhibit similarity to any known foamy accessory genes. Further examination revealed that the hypothetical protein was not intact and did not exhibit significant similarity to any known molecular sequences in the NCBI nr database. In addition, we could not identify a potential internal promoter towards the 3' end of the *env* gene. It is thus possible that the previously identified accessory gene in ERV-Spuma-Spu might have been an artifact. At face value, the observed lack of accessory genes is suggestive of ancient gene losses in ancestral exogenous reptile FVs. Alternatively, it could be that the ancestral exogenous tuatara FVs did possess accessory genes but lost them after becoming endogenous. This observation is also consistent with multiple acquisitions of accessory genes in other FVs at the same genomic location, which is perhaps less parsimonious but nevertheless possible. Discovery of other reptilian FVs or EFVs will help elucidate this issue.

Furthermore, we found that the ERV-Spuma-Spu *gag* gene was markedly shorter than those of typical simian FVs. Protein sequence comparison showed that the putative ERV-Spuma-Spu Gag protein had a full matrix domain essential for Gag–Gag interaction [43,44], Gag–Env interaction [44], and Gag–microtubular network interaction [45]. The conserved central region, which is evolutionarily related to orthoretroviral capsid proteins [46], could also be found. The regions between the matrix and the capsid domain (aa ~180–~300 of the SFVpsc Gag protein) were missing, however, containing the late domain (P284SAP domain), which mediates viral particle release [47]. Nevertheless, this region was not highly conserved; indeed, non-primate FVs including bovine, equine, and feline FVs also lack this region and the late domain [48].

Moreover, a region homologous to the C-terminus of the nucleocapsid domain (corresponding to aa 549–648 in the SFVpsc Gag protein) containing part of the glycine/arginine rich box (GR) II and the entire GR-III was also missing from the putative ERV-Spuma-Spu Gag protein. GR-I, a nucleolar localization signal [49] that mediates nucleic acid binding [50] and is important for Pol packaging [51] and particle formation [52], could still be found. Sequence examination revealed that the chromatin-binding sequence (CBS) in GR-II (aa 534–546 in the SFVpsc Gag protein [53]) required for a direct physical contact between FV Gag protein and host nucleosomes [54] was still intact (Figure 4). The conserved tyrosine and arginine residues in the CBS (Y405 and R408) could also be found (Figure 4), and are essential for Gag chromosome binding and nuclear accumulation of Gag and genomic DNA [49,54,55]. The arginine-tyrosine-glycine (RYG) residues following the CBS (Figure 4) are crucial for nuclear accumulation of FV Gag and DNA, and, perhaps most importantly, DNA integration [55]. Mutagenesis of these residues causes significant reduction in all three activities, even if the CBS is complete [55]. Intriguingly, the RYG domain is absent from both ERV-Spuma-Spu and CoeEFV (Figure 4), which are incidentally the only two EFVs known to be present in high copy numbers in the host genomes [5,8]. In addition, the missing GR-III box was shown to be a nucleolar localization signal similar to GR-I [49]. Studies have shown that the deletion of GR-III only marginally affects viral budding [52,56], intracellular localization [52,57], reverse transcription [52], RNA packaging/binding [50,52,56,58], and particle morphology [52] but significantly reduces DNA packaging [52] and thus infectivity [52,56]. The lack of GR-III box and the RYG residues might help with the virus retrotransposition process by allowing their DNA to accumulate in the host cell and subsequently re-integrate into the host chromosomes in a steady and non-aggressive manner. As previously reported [8], a number of ERV-Spuma-Spu elements of different ages with paired-LTRs could be recovered. This means that at least some of the ERV-Spuma-Spu elements originated from the re-integration process and not via host genomic copying or LINE-mediated retrotransposition of viral mRNA, supporting our hypothesis. These observations might underlie the high copy number of ERV-Spuma-Spu and CoeEFV elements found in the tuatara [8] and the coelacanth genomes [5], respectively.

**Figure 4.** An alignment of chromatin-binding sequences (CBS) and surrounding regions. CBSs of CoeEFV and ERV-Spuma-Spu are intact. The conserved tyrosine (Y; 4th column) and arginine (R; 7th column) residues in the CBS, which are essential for Gag mitotic chromosome binding and nuclear accumulation of Gag and genomic DNA [49,54,55], could be found in all viruses. The arginine-tyrosine-glycine (RYG) residues, which are also important for nuclear accumulation of FV Gag and DNA [55], were absent from CoeEFV and ERV-Spuma-Spu but could be found in all mammalian FVs.

Based on their observation that ERV-Spuma-Spu is more closely related to mammalian FVs than CoeEFV, Wei et al. proposed an ancient co-speciation of ERV-Spuma-Spu and mammalian FVs dating back more than 320 million years ago [8]. This study, on the other hand, estimated the dates directly based on molecular analyses of Pol and Env protein sequences and the more well-established history of mammalian FV-host co-speciation. Our evolutionary analyses of Pol and Env proteins showed that the

ERV-Spuma-Spu lineage is only 232.50 (173.70–303.33) and 257.15 (202.49–324.05) myr old, respectively, comparable to one another in age. These estimates are much lower than those of their hosts, ~324.7 myr old [28]. This finding rejects the ancient co-speciation hypothesis previously proposed and instead suggests that the ancestral virus that gave rise to ERV-Spuma-Spu elements arose from (potentially a series of) cross species transmission(s) from an unknown, non-reptilian host. This result also highlights the pitfall of using tree topologies alone to infer a virus–host co-speciation history, especially when there are only a few lineages in the investigation.

Our phylogenetic analyses of Env proteins revealed that, while ERV-Spuma-Ppi and ERV-Spuma-Gja form a clade, they do not form a monophyletic clade with ERV-Spuma-Spu elements, with the two gecko EFVs being closer to mammalian FVs than ERV-Spuma-Spu (Bayesian posterior probability clade support = 1.00). We estimated the age of the gecko EFV lineage to be 208.54 (171.59–250.91) myr old. Again, this low age estimate is suggestive of a cross-species transmission origin for the gecko EFVs' ancestor, one that is broadly contemporary but independent from the transmission that gave rise to the ancestor of ERV-Spuma-Spu elements.

The phylogenetic placement of CoeEFV with respect to that of the ERV-Spuma-Spu lineage also suggests an evolutionary history of cross-species transmission. While the Pol protein of CoeEFV exhibits a sister relationship with that of ERV-Spuma-Spu viruses (Figure 3A), its Env protein does not, instead being basal to the clade of mammalian and reptile FVs (Figure 3B). Since CoeEFV forms a clade with ERV-Spuma-Spu viruses in the Pol phylogeny, their branching dates from mammalian FVs are hence the same, estimated to be 232.50 (173.70–303.33) mya. This is indeed comparable to the previously reported estimate of 262.76 (195.00–342.08) mya [4] and further supports the complex evolutionary history of FVs that might have transmitted several times between terrestrial and aquatic animals in the distant past [4].

On the other hand, we estimated CoeEFV's Env protein (and thus its *env* gene) to share a most recent common ancestor with that of mammalian FVs 332.68 (236.13–451.42) mya. This age estimate is drastically older than that of the *pol* gene, suggesting that CoeEFV's *pol* and *env* genes might have different evolutionary histories. We note however that this age estimate is conditioned on how the Env tree is rooted. In this study, we chose the mid-point rooting method, which placed NviFLERV-1 (the amphibian EFV) as the most basal lineage in the Env tree (Figure 3B), consistent with the topology of the Pol tree (Figure 3A). However, if the *pol* and the *env* genes can have different evolutionary histories, then there would be no intrinsic reasons for the Pol and the Env tree topologies to closely resemble each other. Another possibility is to subjectively place CoeEFV as the most basal linage in the Env tree, in which case the estimated date would be the divergence date of NviFLERV-1 instead, which in turn would make it a lower bound estimate for the branching date of CoeEFV's *env* gene. This nonetheless still supports the hypothesis that CoeEFV's *pol* and *env* genes have different evolutionary histories.

This finding mirrors observations from primate [59] and feline FVs [60,61]. Studies at the population level identified the surface domain of their *env* genes to have evolutionary histories that are strikingly different from the rest of the *env* gene and the *pol* gene [59–62], segregating into two variants that co-circulate in the same host populations while other genomic regions are not phylogenetically distinguishable. This domain carries the receptor binding domain and is targeted by neutralizing antibodies [60,63], which may help explain its greater diversity. Our analyses could not detect this evolutionary feature, since our dataset comprises only one sequence from each FV species. Nonetheless, it is remarkable that a similar evolutionary pattern could still be observed at the species level focusing on different timescales. Our results thus further support the modular nature of FV genomes and that this might be a widespread evolutionary feature of FVs.

Our analyses reveal a complex evolutionary history and ancient transmission routes of ancient FVs, likely involving host switches across the boundary between water and land, as well as the modular nature of their genomes. Our work also highlights the importance and the value of recombination analysis and temporal information in evolutionary inference as well as the pitfalls of tree-topology based virus–host co-speciation analysis. Discovery of additional EFVs will undoubtedly further our understanding and improve our knowledge of the complex and rich natural history of FVs.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/1999-4915/11/7/641/s1, Figure S1: ERV-Spuma-Spu consensus sequence, Figure S2: Recombination detection in *pol-env* alignment, first round, Figure S3: Recombination detection in *pol-env* alignment, second round, Table S1: tBLASTn using CoeEFV Pol as probe against the tuatara genome, Table S2: Merged Pol hits and reciprocal BLASTx results, Table S3: tBLASTn using CoeEFV Env as probe against the extended Pol hits, Table S4: foamy virus-like endogenous viral elements in other reptiles, Table S5: host evolutionary timescales, Table S6: foamy virus-host association, Data S1: alignment of ERV-Spuma-Spu sequences – virus body portion, Data S2: alignment of ERV-Spuma-Spu sequences – long terminal repeat portion, Data S3: ERV-Spuma-Spu consensus sequence, Data S4: *pol*-*env* alignment, Data S5: Env alignment, Data S6 Pol alignment.

**Author Contributions:** P.A. and A.K. conceived the project. P.A. and A.K. researched the data for the article. P.A., P.S., and A.K. substantially contributed to discussion of content. P.A. performed analyses and wrote the article. P.A., P.S., and A.K. reviewed and edited the manuscript before submission.

**Acknowledgments:** This work is funded by Wellcome Trust.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Review* **Simian Foamy Virus Co-Infections**

#### **Shannon M. Murray \* and Maxine L. Linial**

Division of Basic Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, WA 98109, USA; mlinial@fredhutch.org

**\*** Correspondence: smurray@fredhutch.org

Received: 8 July 2019; Accepted: 21 September 2019; Published: 27 September 2019

**Abstract:** Foamy viruses (FVs), also known as spumaretroviruses, are complex retroviruses that are seemingly nonpathogenic in natural hosts. In natural hosts, which include felines, bovines, and nonhuman primates (NHPs), a large percentage of adults are infected with FVs. For this reason, the effect of FVs on infections with other viruses (co-infections) cannot be easily studied in natural populations. Most of what is known about interactions between FVs and other viruses is based on studies of NHPs in artificial settings such as research facilities. In these settings, there is some indication that FVs can exacerbate infections with lentiviruses such as simian immunodeficiency virus (SIV). Nonhuman primate (NHP) simian FVs (SFVs) have been shown to infect people without any apparent pathogenicity. Humans zoonotically infected with simian foamy virus (SFV) are often co-infected with other viruses. Thus, it is important to know whether SFV co-infections affect human disease.

**Keywords:** foamy virus; spumaretrovirus; co-infections; NHP; pathogenesis; zoonoses

#### **1. Introduction**

Foamy viruses (FVs) comprise the *Spumaretrovirinae* subfamily of the family *Retroviridae* and are also designated spumaretroviruses [1]. FVs are ancient complex retroviruses that have co-evolved with their nonhuman primate (NHP) hosts for at least 60 million years [2]. Interestingly, recent sequencing of an ancient marine fish, the coelacanth, revealed an endogenous foamy virus (FV) [3]. The coelacanth is an ancient marine four-lobed fish believed to be the organism that first became terrestrial and is the ancestor of all terrestrial organisms. This indicates that FVs have existed for an estimated 400 million years [3], making *Spumaretrovirinae* the oldest known extant vertebrate virus subfamily.

FVs are apparently nonpathogenic in their natural hosts, which include NHPs (reviewed in [4]), felines [5], bovines [6], and equines [7]. FVs have also been found in bats, although the physiological consequences were not stated [8]. In non-primate hosts, FV prevalence is reported to be between ca. 30–70% in adults, depending on age, host, and location (reviewed in [9]). FVs were first identified by their cytopathicity in tissue culture cells ([10–12], reviewed in [13]). Thus, there have been many efforts to determine whether these viruses are pathogenic in vivo. To date, there have been no reports demonstrating clear-cut pathogenicity in natural or accidental hosts (reviewed in [14]). However, in research settings, there is some evidence that FVs can exacerbate the pathogenesis of other viruses. This phenomenon cannot be well studied in natural infections, as finding FV uninfected adult animals is difficult.

Simian FVs (SFVs) replicate primarily in tissues of the oral mucosa [15,16], and transmission most likely occurs through transfer of saliva from one individual to the next. Often, infected saliva transfer occurs through grooming, biting, or sharing food (reviewed in [4]). It is also thought that SFVs from saliva enters the blood or oral cavity of the recipient [17]. In NHP blood transfusion studies, SFV can be transmitted through blood [17,18]. However, it is not known whether this occurs in natural settings. In natural FV-infected felines, FV DNA has been detected in buccal swabs and in the blood [19].

While humans have contacts with felines and bovines, there is no evidence for actual infection of humans with FVs from these species (reviewed in [14]). Zoonotic infections with SFVs are frequent among animal caretakers, zoo keepers, bushmeat hunters, and others in direct contact with NHPs [20–23]. For example, ca. 2–5% of individuals in North America who report contact with NHPs are FV-infected, as determined by SFV polymerase chain reaction (PCR) (reviewed in [14]). There is no evidence for human-to-human transmission. The underlying reason(s) as to why the virus has not adapted to humans, but to all other primate species, is unknown.

Interestingly, SFV replication takes place in the most terminally differentiated superficial epithelial cells of the oral mucosa—those about to slough off into saliva [15]. The lack of pathogenicity of FV infections may be a result of this replicative niche in a cell type that turns over rapidly and is relatively dispensable to the host [15]. If this cellular niche for SFV replication is altered as described below for experimental simian immunodeficiency virus (SIV) infections, there may be potential for pathogenic effects.

#### **2. FVs and the Virome**

Each organism is host to a multitude of microbes, known as the microbiome. Organisms are also host to a collection of viruses, known as the virome (reviewed in [24]). Since, in natural species, FVs infect the majority of adults, FVs are considered part of the virome. It is well established that the virome plays a role in health and disease and that co-infecting viruses can affect each other and the microbiome [24]. In this review, we will consider the effect of FVs on infections by other viruses, in some cases, pathogens, and we will also consider the effects of other viruses upon FV infections, summarized in Table 1.


**Table 1.** Foamy virus (FV) and retrovirus co-infections: their effects on the host.

#### **3. SFV Zoonotic Infections**

There is evidence for zoonotic transmission of SFVs to humans (reviewed in [14]). Many of the zoonotically infected people are those with direct contact with NHPs, including bushmeat hunters in Africa as well as zoo and laboratory workers [20–23,29]. People whose interactions with NHPs are less direct but who are SFV-infected have been found in Asia [30]. There are small numbers of individuals identified who are infected with both SFV and human immunodeficiency virus (HIV) [31,32]. Moreover, as there is interest in using FV as a vector for gene therapy [33,34], the questions surrounding FV co-infections are becoming more relevant.

#### **4. SFV and Other Retrovirus Co-Infections**

SFVs are endemic in all NHP species examined to date in Africa, Asia, and the Americas [13,35]. In Africa, there are at least three other endemic complex retroviruses that infect NHPs. These are SIV, simian T-cell lymphotropic virus-1 (STLV-1), and simian retrovirus type D (SRV D). SIV had been thought to be nonpathogenic in its natural African monkey hosts [36], but there are reports of an acquired immunodeficiency syndrome (AIDS)-like illness in some SIV-infected natural hosts [37,38]. Additionally, when SIV infects other NHP species, such as chimpanzees and gorillas, it can be pathogenic in some cases [39–41].

STLV-1 has not been as extensively studied in its natural NHP hosts as human T-cell lymphotropic virus-1 (HTLV-1) has been in humans. HTLV-1 is pathogenic in only a small fraction of infected humans (reviewed in [42]), and STLV-1 infection in NHPs has been associated with a number of T cell abnormalities, including lymphomas and leukemias (reviewed in [43]). Many Asian NHPs are infected with another retrovirus, SRV D [44,45]. Thus, it is likely that many macaques are co-infected with SRV D and SFV. In NHPs, SFV as well as cytomegalovirus (CMV), a herpesvirus, are endemic and therefore many adult animals are co-infected with both viruses [46]. However, whether these CMV or SRV D co-infections alter animal health has not been reported.

#### *4.1. SFV and SIV Co-Infections*

Researchers have used the rhesus macaque (RM), also known as *Macaca mulatta*, as a model for HIV infections. In the wild, SIV is not known to exist in Asia [47,48], the natural habitat of RM. SIV sooty mangabey (SIVsm) has been found to be pathogenic in RM [49] and has become the virus used to infect RM to recapitulate HIV infections. SFV infection is latent in most tissues, including the blood. SFV replication is only detected in oropharyngeal tissues [16,25]. A key question concerning FV–host interactions is whether FV replication is controlled by the host immune system. To address this question, RM infected with pathogenic SIV strains (SIVmac239 and SIVmac155T3) that lead to CD4+ T cell depletion were studied, and SFV replication was assessed in the blood and other tissues [25]. In SIV/SFV co-infected RM, SFV replication was expanded to include the small intestine, the jejunum [25], which is a site of SIV-induced CD4+ T cell depletion [50]. However, other tissues that were CD4+ T cell depleted were not permissive for SFV replication [25]. Thus, it is not only CD4+ T cell depletion per se that is responsible for the expansion of SFV replication to the jejunum. Overall, this indicates that the host immune system is not limiting systemic SFV replication. It is important to note that the strains of SIV used in this study were lab-adapted, highly pathogenic strains that are HIV infection models in NHPs and are unlike SIV strains in natural hosts, which are usually poorly pathogenic.

It is also possible that SFV can affect the pathogenicity of other viruses, such as SIV. In natural settings, most adult NHPs are naturally infected with SFV. Therefore, this issue was addressed using RM individually housed in primate center facilities. The researchers examined SFV− and SFV+ RM infected with a pathogenic SIV strain (SIVmac239) [26]. The SFV+ animals had increased SIV viral loads in the blood and died more rapidly than SFV− RM. Thus, SFV infection does exacerbate SIV pathogenesis in experimentally SIV-infected RM.

The mechanism by which SFV increases SIV pathogenesis is unknown. One tissue target of SIV pathogenesis in RM is the jejunum [50]. Whether SFV replication in the jejunum contributes to this pathogenicity is an outstanding question. Because the SIV RM model commonly uses SFV+ animals, the RM model might not totally recapitulate HIV pathogenesis in people who are SFV−. There is evidence of HIV/SFV co-infections in humans [29,32]. Bushmeat hunters often get bitten by NHPs and become SFV-infected [29]. The number of bushmeat hunters co-infected with SFV and HIV is small. It would be of interest to compare HIV infections of bushmeat hunters who are SFV-infected or uninfected.

Free-living chimpanzees (cpz) in multiple countries of Africa, such as Cameroon and Gabon, have been analyzed for infection with SFVcpz and SIVcpz [51]. The researchers found that 15/70 (~21%) of SFV-infected animals were co-infected with SIV. This indicates that co-infection of these two viruses is common in chimpanzees. In another study, in the monkey the Ugandan red colobus [52], SFV/SIV co-infections were also common, with 23% of individuals co-infected. The routes of transmission of these two viruses are apparently different, with SFV transmission primarily through saliva. In the case of SIV, both sexual transmission and aggressive behavior have been implicated (reviewed in [53]). The effects of these two viruses on the health of free-living NHPs was not noted in either study, but this would be of great interest because these are clear examples of natural co-infections.

#### *4.2. SFV and STLV-1 Co-Infections*

A cohort of naturally SFV-infected baboons were studied. Some of the baboons were also naturally infected with STLV-1, while others were not, with 18 baboons studied per group. It was found that SFV DNA levels in the blood, but not in the saliva, were increased in the STLV-1-infected baboons [27]. It is not known why there is increased SFV DNA in the blood after STLV-1 infection. Two possibilities are (1) increased integrated SFV proviruses, or (2) increased numbers of SFV viral particles in the blood. The authors show that, in tissue culture cells, the related HTLV-1 transcriptional activator protein (Tax) can stimulate the FV long terminal repeat (LTR). They suggest that this could be a mechanism that explains the in vivo results [27]. However, it is also possible that since STLV-1 increases the number of lymphoid cells in the blood, this could result in more SFV in the blood as well. Overall, STLV-1 infection is unlikely to alter SFV transmission, since SFV transmission is mostly through saliva rather than blood. It is not known whether SFV infection alters the pathogenicity of STLV-1, since all of the STLV-1-infected animals were also SFV-infected. There is also concern about SFV/HTLV-1 co-infection in humans [54], as discussed further below.

#### *4.3. SFV and SRV D Co-Infections*

SRV D infections in Asian primates such as macaques sometimes causes an immunodeficiency-like syndrome [44,45]. SRV D appears to be transmitted through saliva as well as urine and feces [55]. There is a report of one worker who is seropositive for both SFV and SRV D [56]. It should be noted that this individual was both seropositive and PCR positive for SFV, but could not be shown to be PCR positive for SRV D. Thus, this person may have been exposed to both viruses, producing antibodies, but may not be actively SRV D infected. There is no evidence that this SRV D/SFV seropositive human has any retroviral-related pathogenesis.

#### **5. SFV Zoonotic Co-Infections and Hematological Changes**

One study identified HTLV-1 and SFV co-infected hunters in Central Africa [54]. Fifty-six percent of the hunters bitten by NHPs and infected with HTLV-1 were also infected with SFV, suggesting that these two viruses may have been co-transmitted via the bites [54]. However, the pathogenic effects of the co-infection were not evaluated.

There is a publication presenting evidence that SFV-infected bushmeat hunters from Cameron seem to have different levels of hematological markers, including hemoglobin, creatine phosphokinase, and bilirubin, than SFV-uninfected bushmeat hunters [57]. The SFV-infected group had a higher incidence of HTLV-1 infections, and all participants except for one were infected with hepatitis B virus (HBV). Thus, it is hard to ascribe the differences in hematological markers to SFV infection alone. These differences could result from SFV co-infection with either HTLV-1 or HBV. In fact, as described in baboons infected with STLV-1 [27], FV DNA levels were increased in the blood of co-infected animals. Another publication reports no hematological differences in SFV-infected North American research and zoo workers occupationally exposed to NHPs [58]. Since these individuals were unlikely to be infected with HTLV-1 and/or HBV, this supports the notion that hematological changes from SFV infection are a result of co-infections. In natural hosts, there is no evidence that SFV alone leads to any hematological changes. However, hematological changes in natural hosts infected with FV have not been thoroughly investigated.

#### **6. Non-Primate FV Co-Infections**

Most work on FV transmission has been done in NHPs. This has served as the general model for FV transmission in other natural species. While the transmission route for non-primate FVs has been speculated to include saliva, other transmission modes seem to occur. For example, bovine foamy virus (BFV) has been detected in bovine breast milk [59]. In fact, BFV has been shown to be transmitted from mothers' milk to calves [60].

#### *6.1. BFV and Herpesvirus Co-Infections*

A serious illness of unknown origin in dairy cows is non-responsive post-partum metritis (NPPM). It is thought to be caused by a virus, as it is unresponsive to antibiotics. There is some evidence that bovine gammaherpesviruses (BoHVs) are associated with this disease (M. Materniak-Kornas, personal communication, 2019). Since many cows are infected by BFV [60], it is possible that co-infection of BFV and BoHV-4 or BoHV-6 is a factor in the disease. However, the researchers found that is not the case, since as many healthy cows as sick cows were infected with BFV (M. Materniak-Kornas, personal communication, 2019).

#### *6.2. FFV and Other Retrovirus Co-Infections in Cats*

Feline leukemia virus (FeLV), a retrovirus of cats, often leads to fatal diseases, including lymphomas [61]. One study examined FeLV and FFV co-infections in domestic cats [28]. Natural FeLV infections can have two outcomes; these are progressive infections leading to disease, or regressive infections. Regressive infections are characterized by a transient production of viral structural proteins and a persistence of low levels of integrated viral DNA. Cats that are progressively infected have detectable FeLV RNA and infectious virions. It was found that higher levels of FFV DNA in the blood correlated with increased FeLV viremia and disease progression [28]. In another feline study, the researchers examined FeLV regressive versus progressive infections [19]. In cats with regressive infections, there were lower levels of FFV DNA in the oral cavity. This suggests the possibility that innate or adaptive immunity to FeLV in the regressors could affect FFV as well.

There are cats co-infected with FFV and feline immunodeficiency virus (FIV), a lentivirus causing decreases in CD4+ T cells and feline acquired immune deficiency syndrome (FAIDS) [62]. In one study, a small number of cats were experimentally infected with FIV alone or with FIV and FFV. In this study, FFV co-infection did not enhance the pathogenesis of FIV [63]. This is in contrast with what was seen in RM infected with SFV and SIV, as described above [26]. However, in a study of FFV experimental infection, FFV RNA was rarely seen in saliva samples (in only 1 of 80 samples) [64]. Thus, the FIV/FFV co-infection study might be flawed, since it is possible that FFV experimental infection does not mirror natural FFV infections. More studies are needed to assay viral RNA in tissues of naturally FV-infected non-primates, including cats.

#### **7. Conclusions**

In natural hosts, FVs are highly prevalent and therefore finding uninfected adult animals is difficult. Thus, all information about the effect of FVs on other viral infections in these species involves artificial situations in which some animals are kept FV-free. Most of the studies on FV effects on other viral infections use lab-adapted pathogenic viruses. In research laboratory settings, there is clear indication that FV infections exacerbate lentiviral outcomes. There is also evidence that SFV has an expanded tissue tropism in lentiviral co-infected NHPs. The prevalence of FV co-infected individuals has been described in natural hosts. However, these studies did not report whether FV co-infection affects pathogenesis induced by other viruses. Therefore, it is not known whether the results from lab settings are relevant to what occurs in natural settings. The mechanism of how FV infections may affect other viral infections is not clear.

Humans can be zoonotically infected with SFVs from NHPs. While it has been reported that there are people co-infected with SFV and HIV or HTLV-1, there are almost no data about whether SFV infection of humans can exacerbate other viral infections. Overall, more studies in natural settings and of human zoonotic FV co-infections would be important to understand whether FVs contribute to the pathogenicity of other microorganisms. One example of a FV pathogenic effect on hosts is in RM, using lab-adapted strains of a lentivirus, SIV. In the future, it will be important to see whether HIV pathogenesis is worse in humans co-infected with SFV. It is also possible that in humans, FV infections could exacerbate other viral infections such as HTLV-1, HBV, or even CMV or other herpesviruses. The understanding of how SFVs affect other human viruses is important, given the potential use of FV vectors for gene therapy to treat human diseases.

**Author Contributions:** S.M.M. and M.L.L. both contributed to the writing of this review article.

**Acknowledgments:** We would like to acknowledge Delia Pinto-Santini for helpful comments on the manuscript and Magdalena Materniak-Kornas for sharing unpublished data. We also would like to thank the Fred Hutchinson Cancer Research C for financial support.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the writing of the manuscript, or in the decision to publish the article.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

#### *Article*
