3.1. Detection of Viral Sequences in RNAseq Datasets of South American Long-Haired Mice and Olive Field Mice
In order to expand our knowledge on morbillivirus diversity, I assessed an essential resource of public high-throughput sequencing RNA data available at NCBI: the Transcriptome Shotgun Assembly (TSA) Database available at
https://www.ncbi.nlm.nih.gov/Traces/wgs/?view=TSA (accessed on 20 July 2022) including 866 RNAseq datasets on diverse vertebrates. In tBLASTN searches against Vertebrata (taxid:7742) using as query the N protein of MeV (NP_056918.1), I retrieved a significant hit (E-value = 1 × 10
−142, 52.82% identity) corresponding to a 1236 nt transcript from a TSA (GenBank: GCHM00000000.1) of a renal gene expression dataset of the South American long-haired mouse,
Abrothrix hirta, collected in natural populations (BioProject PRJNA256304 [
11]).
A. hirta is a sigmodontine rodent widely distributed in the southern area of South America occurring on both sides of the Andes, from the Chilean Region of Maule (ca. 35° S) to the Argentinean Tierra del Fuego (ca. 52° S). This specific library corresponded to half of a kidney sample from an adult male (specimen PPA528) captured using Sherman live traps, three kilometers west of Lago Pueyrredón in Río Chico department, Santa Cruz province, Argentina (47.42105° S, 71.958233° W) during field trips conducted during fall season (April 2011 and March 2012) (
Supplementary Figure S1).
Further inspection by BLASTP searches (E-value < 1 ×10−5) of the GCHM00000000.1 TSA library using as query MeV proteins retrieved four additional transcripts ranging from 462 to 1922 ntthat showed significant hits (E-value 3.43 × 10−30 to 5.52 × 10−134, identity 34.6% to 56%) to MeV-encoded P, M, and F proteins. The tentative virus contigs were curated by iterative mapping of the corresponding 76,503,290 library reads (NCBI-SRA: SRX663121) using Bowtie2 with standard parameters. The transcripts, extended, overlapped, and polished by iterative cycles of mapping of raw reads, were subsequently reassembled into a 9941 nt-long virus sequence including a continuum from a partial 3′ leader sequence followed by N-P/V/C-M-F-hp-H- and a few short partial sequences of L (1720 nt long), with a mean coverage of 14.8x obtained with 1779 virus-derived 85 × 2 nt-long reads.
As this bio project collected a total of 16 adults of South American long-haired mice from Chile and Argentina, when assessing the read mapping with Bowtie2 using as query the assembled virus sequence of the 15 additional libraries, virus reads (82 and 62 total reads) were found in two. The samples where virus reads were detected (v+) were specimen PPA357 (SRA:SRX663109) and specimen GD1454 (SRA:SRX663075). PPA357 and GD1454 were a female from the very same location (47.42105° S, 71.958233° W) and a male collected 65 km to the west in Aysén, Chile, (47.49671666° S, 72.80861666° W), respectively. While the number of reads was certainly low, inspection and mapping of virus reads from these additional libraries to the PPA528 consensus revealed some fixed SNPs among libraries (
Supplementary Figure S2) suggesting: (
i) on the one hand that the reads were evidence of apparently three distinctive virus isolates, while (
ii) on the other hand ruling out that these few reads corresponded to spillover from the PPA528 sample or contamination artifacts from index-hopping during library processing.
In order to expand the survey of this virus to additional hosts, all available RNAseq datasets of rodent subfamily
Sigmodontinae (
Cricetidae), including New World rats and mice with at least 376 species, were retrieved from NCBI. Of the 79 additional publicly available transcriptome datasets of mice, including members of the
Sigmodon,
Oligoryzomys, and
Abrothrix genera (
Supplementary Table S1), virus reads were detected by mapping using Bowtie2 in two libraries of
Abrothrix olivacea (
Supplementary Figure S1). The olive field mouse (
A. olivacea) is the rodent that shows the broadest geographic distribution in the area of southern South America. It ranges from the northernmost region of Chile (ca. 18° S) to central-western Argentina (ca. 35° S) and towards Patagonia, where it reaches the south of Tierra del Fuego (ca. 56° S).In elevation, it is found from sea level to up to 2500 m of altitude [
12]. Regarding viruses and abrotrichine rodents, to my knowledge, there are no reports oriented to the detection or characterizations of viruses linked to these mice. It is worth mentioning the detection of Andes hantavirus virus-reactive antibodies in
A. olivacea exemplars from southern Chile [
13], and that experimental conditions have indicated that the olive field mouse is susceptible to hantavirus infection [
14].
The two specific v+ libraries corresponded to kidney samples from an adult male (specimen PPA444, SRA: SRX4099316) also captured in the Río Chico department, Santa Cruz province, Argentina, but 265 km southeast of the location where PPA528 and PPA357 were collected (49.42105° S, 69.958233° W). The other sample, also an adult male (specimen GD1411, SRA: SRX4099309), was captured 870 km to the northeast, in Fundo San Martín, Región de Los Ríos, Chile (39.649233° S, 73.19255° W); both GD1411 and PPA444 were collected in the same study (BioProject PRJNA471316 [
15]). With iterative cycles of relaxed mapping (Bowtie2 parameters-very-sensitive-local) of SRX4099309 raw reads, extension and subsequent reassembly, a 9948 nt-long virus sequence from the GD1411 sample was obtained, including a continuum from a partial 3′ leader sequence to N-P/V/C-M-F-hp-H- and a few short partial sequences of L (2736 ntlong), with a robust mean coverage of 128x obtained with 12,630 virus-derived 101 × 2 nt-long reads. Notably, implementing the same pipeline to sample PPA444 employing the SRX4099316 library a coding-complete virus sequence with the genome architecture 3′-N-P/V/C-M-F-hp-H-L-5′ was assembled corresponding to 16,568 nt supported by a 17.7x mean coverage from 2897 virus-derived 101 × 2 nt-long reads.
A rapid comparison based on sequence alignments of the three consensus virus sequences (
Supplementary Figure S3) indicated that while divergent, the % identity of predicted proteins ranged between 88.2 and 99.8%, suggesting that the sequences corresponded to three distinctive strains of the same virus which I tentatively dubbed Ratón oliváceo morbillivirus (RoMV). Indetail, the viruses assembled from the PPA444 and PPA528 samples are highly similar, with their ORFs and predicted proteins sharing over a 99% sequence identity. In contrast, the GD1411 virus is clearly more divergent, sharing a lower 85–89% nt identity and 88–98% aa identity of the predicted proteins, the P and H proteinsthe more distinctive and N and M the more similar, indicating the significant preeminence of synonymous mutations on the GD1411 virus. Perhaps it is worth emphasizing that the GD1411 mouse was collected on the other side of the Andes mountain range, over 850 km and more than 1100 km northeast of the places where PPA528 and PPA444 were captured, signifying that geographical isolation could provide some clues to the evolutionary history of these viruses. The significant diversity revealed by these three mouse viruses could indicate a long-lasting virus–host relationship between RoMV and abrotrichine rodents. In turn, the consensus assembly from sample PPA444 that comprised a complete coding, and (near) complete genome, was used as reference for structural and functional annotation, genomic comparison and evolutionary insights into RoMV.
3.2. Characterization of a Novel Virus by In Silico Analysis
The genome organization of the tentatively named Ratón oliváceo morbillivirus is characterized by a ≈ 16,658 nt-long negative-sense single-stranded RNA containing six main ORFs in the anti-genome, positive-sense orientation. In addition, the second ORF includes a transcription unit with RNA editing and an overlapping ORF (P/V/C) and between the F and H genes there is an additional accessory ORF. In sum, the genomic architecture of theRoMV is 3′-N-P/V/C-M-F-hp-H-L-5′ (
Figure 1A). As expected for paramyxoviruses, the genes are separated by intergenic gene junction regions, composed of the polyadenylation signal of the preceding gene, a short intergenic region, and the transcriptional start of the following gene [
2]. The detected consensus gene junction region of the RoMV is consistent with morbilliviruses that have a conserved intergenic motif (CUU) between the gene-end and gene-start of adjacent genes following the structure “AAAA-CUU-AGG” (
Table 1).
BLASTP searches of predicted products (
Table 2) tentatively identified these ORFs as potentially encoding: a nucleocapsid protein (N; 513 aa), a phosphoprotein (P; 540 aa), a V non-structural protein (V; 325 aa), a C non-structural protein (C; 161 aa), a matrix protein (M; 336 aa), a fusion protein (F; 546 aa), a small hypothetical protein (hp; 147 aa), a Hemagglutinin glycoprotein (H; 603 aa), and an RNA-dependent RNA polymerase (L; 2172 aa). Importantly, all best hits based on highest sequence identity scores, which ranged between 25.1% (H) and 63.7% (M), were morbilliviruses, more specifically MeV, Longquan Berylmys bowersi morbillivirus 1 (LBbMV), and Wufeng Niviventer fulvescens morbillivirus 1 (WNfMV). LBbMV and WNfMV corresponded to recently released virus sequences, which are as yet unpublished and have been annotated as unclassified morbilliviruses. The metadata of their GenBank accessions indicated that LBbMV (GenBank accession no. MZ328284) was identified in the Bower’s white-toothed rat (
Berylmys bowersi), a rodent from the family
Muridae that is native to Southeast Asia.WNfMV (GenBank accession no. MZ328285) was detected in the chestnut white-bellied rat (
Niviventer fulvescens), another rodent from the family
Muridae.
Structural and functional annotation indicates that the 513 aa RoMV-N protein harbors a paramyxovirus nucleocapsid protein domain (Paramyxo_ncap, E-value = 0, coordinates 1–512) that is involved in tightly encapsidating the viral RNA and interacting with several other viral-encoded proteins, all of which are involved in controlling replication. RoMV-N presents the conserved MA(S,T)L motif of morbilliviruses, and appears to share the three key conserved motifs in paramyxoviruses and nuclear export signals and NLS (
Supplementary Figure S4). The 540 aa P protein, which plays a crucial role by positioning L onto the N/RNA template through an interaction with the C-terminal domain of N, is a co-factor of the RdRP, includes a paramyxovirus structural protein V/P N-terminus domain (Paramyxo_PNT, pfam13825, E-value = 3.67 × 10
−3, coordinates 274–347), and a paramyxovirus P/V phosphoprotein C-terminal domain (Paramyx_P_V_C, pfam03210, E-value = 1.13 × 10
−16, coordinates 372–536) (
Figure 1A). Most of its 540 amino acids, as expected, appear to be in a natively disordered state, and the C-terminal conserved residues are putatively folded into a three-helical bundle that binds to the C-terminal tail of N and has an oligomerization domain that forms a long tetrameric coiledcoil that is stabilized at its N terminus by a helical bundle linking protomers. Three-dimensional modeling using the Swiss-Model platform using as best fit template the 3zdo.1.B MeV phosphoprotein showed that RoMV-P forms a tetrameric coiled coil similar length (63 aa) and conserved structure but less highly packed than the measles virus P protein (
Supplementary Figure S5). The 327 aacysteine-rich non-structural V protein generated by mRNA editing by incorporating an additional “G” at coordinate 2512 of what encodes the P mRNA has a zinc-binding domain of
Paramyxoviridae V protein at its C-terminal region (zf-Paramyx-P, E-value = 2.4 × 10
−15, coordinates 280–323). The Vprotein is generated by an A-rich context where the RNA transcriptase ‘stutters’ on the template at the editing motif that is “AAAAAGGG” in RoMV. This stuttering results in the insertion of one pseudo-templated G shifting the reading frame to access the alternative ORF V [
2]. The 161 aa C protein, which is generated by leaky scanning of the P mRNA that results in the translation of an overlapped ORF 31 nt downstream of the AUG of P, presents a C protein from thehendra and measles viruses domain (C_Hendra, pfam16821, E-value = 8.10 × 10
−5, coordinates 1–146). The C protein has been involved in host defense interaction, for instance MeV C is implicated in modulation of interferon signaling but also in pathogenicity and virulence as is the case for CDV C [
16], andMeVC downregulates viral RNA synthesis and allows the virus to escape detection by the cytosolic RNA sensors and finally prevents IFN production [
17]. The non-glycosylated membrane or matrix protein (M) is 336 aa long and has a viral matrix protein domain (Matrix, pfam00661, E-value = 6.34 × 10
−118, coordinates 6–326). The M protein appears to be the most conserved protein of the RoMV, sharing 63.6% aa identity with that of LBbMV. The 546 aa F glycoprotein presents as expected a signal peptide at its N-terminal region and a transmembrane domain at its C-end (
Figure 1A). The F protein functional annotation pinpoints a typical fusion glycoprotein domain (Fusion_gly, pfam00523, E-value = 1.04 × 10
−128, coordinates 23–478). Unexpectedly, a small 147 aa protein (hp) is found in the F-H intergenic region showing no homology to any protein, nor domains. No similarities are found to motifs/domains/peptides/proteins in any database to hp when Psi-blast, HHblits, HHPred, or HMMER searches are implemented (see below for more details). The 603 aa surface H glycoprotein shows a typical haemagglutinin-neuraminidase of the paramyxoviridae domain (HN_like, cd15464, E-value = 3.09 × 10
−20, coordinates 207–579) and a N-terminal transmembrane domain. The H protein is the most divergent encoded main protein RoMV showing only a 16–22% aa best identity with the H of LBbMV and WNfMV.
As CD150 is the tentative main receptor of morbilliviruses I used the primary data from
A. olivacea to reconstruct the protein using as query the signaling lymphocytic activation molecule family member 1 coding sequence from the available hispid cotton rat CD150 (
Sigmodonhispidus, JX424845), eventually generating a complete mRNA 1278nt-long encoding an
A. olivacea 340 aa protein showing 81.5% aa identity to that of the hispid cotton rat. In order to try to glimpse the RBP-CD150 interactions that could be involved in determining host tropism, I compared the amino acid sequences at the putative contact surfaces of morbillivirus RBPs and their cognate CD150 receptors based on the predictions of Ikegame et al. [
10] (
Supplementary Figure S6). Alignment of putative key regions in some morbillivirus H proteins implicated in CD150 interactions showed virus-specific changes, with some residues highly conserved and others significantly variable, which may suggest the adaptation of morbillivirus H to the putative CD150 receptors of their cognate host (
Supplementary Figure S6). Modeling and experimental assessment of these in silico predictions could shed some light on the specific role of
A. olivacea CD150 in the host interaction and range of the RoMV. Finally, the 2172 aa long L protein presents a Mononegavirales RNA-dependent RNA-polymerase domain (Mononeg_RNA_pol, pfam00946, E-value = O, coordinates 16–1107) followed by a
Paramyxoviridae family mRNA-capping enzyme region (paramyx_RNAcap, TIGR04198, E-value = 2.96 × 10
−176, coordinates 1224–2172) including a mRNA (guanine-7-)methyltransferase (G-7-MTase) (G-7-MTase, pfam12803, E-value = 1.74 × 10
−87, coordinates 1483–1793) that catalyzes cap methylation.
The genome of RoMV presents some peculiarities that distinguish it from other assigned morbilliviruses. For instance, the (nearly) complete sequence with 16,658 nt represents to date the lengthiest morbillivirus reported (
Supplementary Figure S7), being as it is at least 518 nt longer than the Feline morbillivirus (FeMV), which is characterized for a long M-F intergenic region and 5′ trailer sequence. The longer nature of RoMV is not explained by its coding regions, which are of typical size, but by the longest intergenic regions described yet within the genus (
Supplementary Table S2). The RoMV presents the lengthiest N-P, P-M, and F-H intergenic regions reported yet. In the latter, the presence of an accessory putative ORF encoding a 147 aa hypothetical protein with a transmembrane domain is not a hallmark of morbilliviruses. For instance, as other examples, the mumps virus (
Orthorubulavirus) presents a small hydrophobic (SH) protein gene encoded between F and H [
18]. It is worth noting that the rodent putative morbillivirus WNfMV shares, in the same genomic context between F and H, an ORF encoding a small 74 aa protein also with a transmembrane domain. While it is tempting to consider that these accessory proteins may have some role in rodent–host interaction, the absence of this ORF in LBbMV hampers any conclusion, its function remains elusive, and its presence is not a distinguishing feature of this subclade. A short integral membrane protein (SH) and/or transmembrane protein (tM) located between F and H is not exceptional in paramyxoviruses and can be found for instance in some members of the subfamily
Orthoparamyxovirinae such as rodent viruses from genus
Jeilonvirus where it is thought to be involved in cell-to-cell fusion [
19]. Besides genomic location, relative size, and presence of a transmembrane signal, there is no apparent identity or reminiscence of homology between these jeilonvirus proteins and the ones from RoMV and WNfMV; thus, I decided to dub it as hypothetical protein (hp) instead of SH or tM to avoid confusion (
Supplementary Figure S8). Both RoMV and LBbMValso include a significantly long H-L intergenic region of about three times the typical size in morbilliviruses mainly derived from an unusually long AU-rich (65–70%) H mRNA 3′UTR.
3.3. Phylogenetic Analysis of a Novel Virus
Phylogenetic insights based on the predicted replicase of RoMV were employed to assess the putative evolutionary placement of this virus. To this end, the L protein aa alignment of recognized members of the family
Paramyxoviridae provided as a resource of ICTV available at
https://talk.ictvonline.org/ictv-reports/ictv_online_report/negative-sense-rna-viruses/w/paramyxoviridae/1197/resources-paramyxoviridae (accessed on 20 July 2022), was retrieved and a consensus alignment was generated using ClustalW. The obtained paramyxovirus L tree clearly shows that RoMV clusters together with other viruses within the genus
Morbillivirus (
Figure 1B;
Supplementary Figure S9). In addition, RoMV appears to have a close evolutionary relationship with WNfMV, LBbMV and a putative morbillivirus linked to the wood mouse (
Apodemus sylvaticus), a murid rodent native to Europe [
20], branching together forming a clade of rodent morbilliviruses. The recently reported “Apodemus morbillivirus” was detected in a wood mouse cadaver that had been killed by cats or vehicles, collected in Belgium and was dubbed Gierle apodemus virus (GaMV, GenBank accession no. OK623356, release date 18 May 2022). A comparison of GaMV and RoMV genomes based on sequence alignments showed a relatively low 56% nt pairwise identity and their predicted proteins ranged from 21.1% (H protein) to as high as 64.5% (M protein) pairwise identity. To further confirm the evolutionary findings based on L alignments, N, P, M, F, and H phylogenetic trees were generated using proteins from viruses of genus
Morbillivirus, RoMV, WNfMV, LBbMV, and GaMV and the respective proteins of Tupaianarmovirus and Nariva narmovirus (genus
Narmovirus). In all cases, unequivocally, RoMV clustered with morbillivirus forming a sub-clade with the rodent WNfMV, LBbMV, and GaMV (
Supplementary Figure S10).
The ICTV species demarcation criteria of morbilliviruses are based on distance in the phylogenic tree of complete L protein based on tree topology and branch length between the nearest node and the tip of the branch. This length is defined as 0.03 in the trees generated in the ICTV paramyxovirus resource and used as input for the L tree reported here.As the branch length from the node separating RoMV/LBbMV in substitutions per site of the obtained consensus tree is well above this threshold, RoMV appears to correspond to a new virus, a putative member of a novel species within genus Morbillivirus that I tentatively name ‘South American mouse morbillivirus’.