1. Introduction
All viruses that pass through an RNA replication phase are found in what is known as a quasispecies. That is, a set of closely related genomes that may exhibit a huge number of variants but keeping a high degree of similarity among them in a host. These variants are produced during the replication by the RNA-dependent RNA polymerases, which are error prone and lack the mechanism of error correction typical in most DNA polymerases [
1].
Quasispecies are dynamical entities subject to evolution, generating new variants at each replication cycle, while losing the less fit and those unable to replicate. A quasispecies at a given time point may be described in molecular terms by the existing different genomes (haplotypes) and their frequencies (the number or fraction of molecules with the same sequence), the haplotype distribution. That is, a multinomial distribution where each category corresponds to a different haplotype. The evolution of this dynamic entity may then be represented by the changes observed in this distribution, as new categories appear and others disappear, and as their frequencies vary.
The extent of changes of a quasispecies in a host, between two time points, may be quantified by the genetic distance between the two viral populations [
2], by the changes in quasispecies diversity indices [
3], and by the distance or dissimilarity between the two haplotype distributions [
4]. In this report, we discuss three selected indices used to compute the similarity between two haplotype distributions and their implications. With quasispecies simulated data, we show their particularities and correlations, and use plots to help in the interpretation of results. Finally a clinical HEV dataset, from a recent publication, is used to illustrate the practical use of these indices. They are particularly useful in the clinical follow up of a patient, where the compared quasispecies are highly related, and where the genetic distance between them may not suffice to describe the observed changes.
In the context of NGS, we denote each distinct genome as an haplotype, and each molecule sampled as a read. We shall be using this terminology throughout the paper.
3. Discussion
The proposed methods are intended to be used in the analysis of changes occurred in a in-host quasispecies along time, as a consequence of the host immune system or of an external action, like a treatment. The quasispecies are treated as entities (closed ecosystems or genetic populations), where the respective distribution of molecules are compared, in contrast with the more widespread comparison of summary values such as diversity indices (i.e., Shannon entropy), or of genetic diversification (i.e., nucleotide diversity) [
3].
In a recent paper [
5], we introduced the Quasispecies Fitness Partition (QFP) in four fractions (QFF), also described under methods, and we recommended its use together with the Hill Numbers Profile (HNP) to visualize the evolution of a quasispecies. Those methods were used in a deep exploration of a clinical case of an HEV infection treated with ribavirin. As part of the discussion, we proposed the use of distances between haplotype distributions as an alternative or complement to the use of genetic distances between quasispecies. This paper comes to explore three selected indices of similarity between haplotype distributions, from which the corresponding distances may be obtained.
Here, we have used simulated data aimed only at producing closely related quasispecies, similar to what could be observed in the follow-up of a single patient, with enough simplicity to be tabulated and plotted. However, to put in clinical context the methods here described, we have added the data of a clinical follow-up of an HEV chronically infected patient treated with a mutagen, spanning three years of observation, and different treatment regimens. Since HEV is an RNA virus having very high mutation rates, on the range of
to
substitutions/base/replication cycle [
7], similar to other highly clinical relevant viruses such as HCV or HIV, the tools presented can be extrapolated to the vast majority if not all RNA viral infections.
The simulation of a substantial number of paired quasispecies allowed us to illustrate particular cases of interest, contributing to the interpretation of results, and also to estimate the correlations between the three indices (, and ), and with the quasispecies genetic distance, . The correlation values show the pairs and , and , and and as highly correlated, with the most independent of the others. Despite this high correlation we recommend the use of three distances, , or , and . Nevertheless for distant quasispecies the four distances will contribute valuable information.
The use of these distances is shown with the simulated data of a quasispecies treatment (
Figure 7), the changes experienced by the quasispecies with samples taken at given evolutionary steps are summarized in the QFF plot,
Figure 8. The relationship between the quasispecies is shown in the form of a matrix of
distances,
Figure 9, from which we obtain a dendrogram by hierarchical clustering with the average method,
Supplementary Figure S15, or a MDS map,
Supplementary Figure S17. Using
distances we may obtain an alternative dendrogram,
Supplementary Figure S16, or an alternative MDS plot,
Supplementary Figure S18.
A key point with all these methods is the availability of quasispecies haplotypes with corresponding frequencies. The classical and more widespread NGS data analysis procedures for viruses, like Galaxy [
8], i.e., limit sequencing errors by trimming the reads at their ends, where the quality is poorer, by a number of nucleotides, attending to instrument quality scores, using different algorithms. As a result of this trimming the coverages are uneven, even within the same amplicon, which prevents the direct obtention of amplicon haplotypes. In [
5,
9], for instance, we describe the method used by our group to obtain high quality amplicon haplotypes in sequencing viral quasispecies samples. It is simply based on respecting the integrity of full reads, with no trimming, except for the primers. The quality filters are executed on full reads. This requires high sequencing quality and very high coverage to get a comprehensive picture of an infection that may involve viral loads higher than
copies/mL of blood. Currently we are only able to obtain high quality amplicon haplotypes of a size slightly over 500 bp, with coverages of the order of
reads per amplicon, sequencing with Illumina instruments. Despite this limitation, quasispecies genomes may be studied amplicon by amplicon. On the other hand, when the monitored treatment is by a direct acting agent that targets a specific region of the genome a single amplicon may suffice [
9]. There are a number of inferential methods for reconstructing full viral haplotypes from short reads, but they have limitations, require of special computational resources for high coverages, and perform poorly with samples of high genetic diversity, according to a recent review evaluation of them [
10].
The clinical case presented has given the opportunity to show a practical application of the proposed methods. This dataset with thousands of haplotypes in each sample, and coverages in the range of 5 to 5 reads, shows a correlation between the three indices consistent with what has been observed with the more modest simulated pairs of quasispecies entailing very few haplotypes; nevertheless, a critical aspect in the simulations was to ensure a close relationship between pairs of quasispecies, as it is the case in the follow-up of a patient, the main objective of this work.
The advantage of the described methods is that they provide rich summaries and visual tools to monitor the changes occurring in a viral quasispecies at the molecular level, with time. This facilitates the interpretation of the biological changes in the quasispecies, and also provides a means to diagnose possible outcomes of a treatment when monitoring a patient, as seen with the discussed HEV clinical case.
In the case of mutagenic treatments, we recommend this method, combined with the method of quasispecies fitness fractions (QFF), and the Hill numbers profile (HNP) [
5]. When the quasispecies evolution rate is low compared to mutagenic scenarios, the QFF may result as insufficient to evidence changes in the quasispecies, and the proposed indices could be more sensitive to changes.
4. Materials and Methods
4.1. Data
4.1.1. Simulation of Paired Quasispecies
To quantify the extent of changes (evolution) of a quasispecies, we compare the quasispecies composition at two time points. The paired quasispecies needed to illustrate the results and discussion are obtained by simulation as described in the following method:
Distribution pattern: 20,000 random occurrences of a geometric distribution, with parameter , are generated, simulating 20,000 reads of over 35 haplotypes. The frequencies of this distribution are used as pattern distribution on which to apply random selection criteria of frequencies.
Select frequencies for quasispecies A: From the above pattern distribution, 12 frequencies are randomly selected to represent the composition of quasispecies A.
Select frequencies for quasispecies B: From a new pattern distribution generated with the same parameters as above, randomly select 12 frequencies to represent the composition of quasispecies B.
Confront both simulated quasispecies: The two quasispecies are composed together of 20 haplotypes, some common to both quasispecies, some unique to either one. Assign randomly the 12 frequencies of quasispecies A among the 20, and do the same with the 12 frequencies of quasispecies B. Remove from the 20 any haplotype not populated (0 reads in both quasispecies).
A single cycle of this simulation results in the distributions of two paired quasispecies, which are given as shown in
Table 1, and may be represented, confronting both distributions, as in
Figure 1. The chosen numbers of reads and haplotypes in the simulation are arbitrary, a simplification of real life cases, but complex enough to compose a quasispecies.
The simulated pairs of quasispecies are related because of the result of a random selection of 12 haplotypes each from a common source of 20. On the other hand, the random selection of frequencies results in varying proportions for each haplotype and varying coverages (total number of reads) for each quasispecies. In this way, in each pair, we consider quasispecies B as the result of an evolution from quasispecies A.
4.1.2. Simulation of a Viral Treatment Follow-Up
The previous simulation aimed to generate pairs of quasispecies, more or less distant, as a result of certain evolution from the first to the second, and it was intended to help in the understanding and interpretation of the similarity indices and the correlations between them.
A second simulated dataset aims to generate a sequence of quasispecies that could be the result of an external treatment which generates resistant variants as a side effect. The quasispecies will consist of 40 haplotypes of three types:
The dominant haplotype, initially at a frequency of 99.9% evolving at a pace of a constant uniformly distributed between 0.85 and 1.05, at each evolution step.
A minoritary haplotype initially at (0.1/39)%, and evolving at a pace of a constant uniformly distributed between 0.95 and 1.25, at each evolution step.
The remaining 38 haplotypes, initially at (0.1/39)%, and evolving at a pace of a constant uniformly distribution between 0.8 and 2.5. Only a random number of these, between 2 and 10, are submitted to evolution at each step. The remaining are left as they were.
In this way, samples are sequentially generated at evolution steps 10, 20, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, and 80. The resulting haplotype distributions are plotted in
Figure 7.
4.1.3. A Clinical HEV Case
This dataset is taken from a recent publication [
5], which shows the negative effects of early treatment discontinuation by a mutagenic agent of an HEV chronically infected patient. This dataset is used to show an example of application of the proposed method to a practical case. Briefly, this is the clinical follow-up case of a 27-year-old patient who acquired chronic HEV infection after undergoing two kidney transplantations. The patient received three different RBV regimens (600 mg/day, 800 mg/day, and 1000 mg/day) with discontinuations caused by adverse effects, followed by relapses.
A single amplicon covering genomic positions 6323 to 6734 on the HEV ORF2 region was sequenced, for each of 13 sequential samples taken from May 2018 to June 2021. The coverage range of the final dataset is 53,307–503,770 reads, with a median of 328,271 reads per sample/amplicon, covering the full amplicon, and enabling the obtainment of amplicon haplotypes and corresponding frequencies. The number of haplotypes per sample are in the range 1688–7881, with a median number of 5602.
4.2. Methods
4.2.1. Similarity between Distributions
The similarity between two distributions may be quantified by a rich set of different indices [
4]. In this report, we use three of them:
Commons: As the fraction of reads belonging to haplotypes populated in both quasispecies.
Overlap: As the sum of the minimum proportion of common haplotypes.
Yue–Clayton: This index takes fuller account of all proportion information, considering the proportions of both common and unique haplotypes. [
11]
The three indices vary from 0 (no similitude) to 1 (equal quasispecies). The disimilarity, or distance, between two distributions may be computed as 1 minus the similarity index.
4.2.2. Genetic Distance between Quasispecies
The nucleotide distance between two quasispecies [
2],
X and
Y, may be estimated by:
where
and
are the proportion of the
i-th haplotype in quasispecies
X, and that of the
j-th haplotype in quasispecies
Y, and
is the genetic distance between both haplotypes. The sum extends over all haplotypes in both quasispecies. This distance is interpreted as the average number of nucleotide substitutions between the reads from quasispecies
X and quasispecies
Y.
Taking into account the nucleotide diversity of each quasispecies [
2], that is the average number of nucleotide substitutions for a random pair of reads in the quasispecies,
and
, which may be estimated by:
where
and
are the number of reads in each quasispecies, then the net nucleotide substitutions between the two quasispecies [
2] is estimated by:
will be taken as the genetic distance between two quasispecies.
The quasispecies pairs are simulated in a way that all haplotypes are considered to have a single substitution with respect to the master haplotype in the first quasispecies. In this way, the matrix of distances between all pairs of haplotypes in both quasispecies has the form:
4.2.3. Quasispecies Fitness Partition (QFP)
A quasispecies, at a given time, understood as a viral population, is usually comprised of a predominant haplotype, a few low- to medium-frequency genomes, various rare haplotypes with very low fitness but still able to replicate at some level, and some defective genomes unable to replicate. This composition can be modeled using the set of frequencies of all haplotypes in the quasispecies as parameters of a multinomial distribution, with . Where is the frequency in the quasispecies of the i-th haplotype. The parameters, , are sorted in decreasing order without a loss of generality.
In this way, the quasispecies can be partitioned into fractions limited by frequency thresholds of interest [
5], as in is Equation (
9), where a partition into four fractions (QFF) is illustrated, and where,
,
,
and
represent the four fractions.
In the typical quasispecies structure mentioned above, the four fractions can be defined as follows:
Master: the fraction of molecules belonging to the most frequent haplotype; that is, the one present at the highest percentage ().
Emerging: the fraction of molecules present at a frequency greater then and less than the master percentage, belonging to haplotypes that are potentially able to compete with the predominant one and possibly replace it ().
Low fitness: the fraction of molecules present at frequencies from to , belonging to haplotypes that have a low probability of progressing to higher frequencies ().
Very low fitness: the fraction of molecules present at frequencies below , belonging to haplotypes with very low fitness and to defective genomes. The likely fate of these molecules individually is degradation, but the fraction is continuously fed with new very low fitness genomes produced by replication errors or by host editing activities ().
This partition represents a summarization of the full haplotype distribution, where changes in each fraction have a straightforward biological meaning, and allow for the interpretation of the effects caused by the current environment, or by the administration of an external agent.
4.3. Software and Statistics
All computations were done in R (v4.0.3) [
12], using packages ape [
13], tidyverse [
14], and ggplot2 [
15]. The full code of the simulations and computations is provided in the
Supplementary Materials. The session info follows:
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
Random number generation:
RNG: Mersenne-Twister
Normal: Inversion
Sample: Rounding
locale:
[1] LC_COLLATE=Catalan_Spain.1252 LC_CTYPE=Catalan_Spain.1252
[3] LC_MONETARY=Catalan_Spain.1252 LC_NUMERIC=C
[5] LC_TIME=Catalan_Spain.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
[5] readr_2.0.0 tidyr_1.1.3 tibble_3.1.3 ggplot2_3.3.5
[9] tidyverse_1.3.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 cellranger_1.1.0 pillar_1.6.2 compiler_4.0.2
[5] dbplyr_2.1.1 tools_4.0.2 digest_0.6.27 jsonlite_1.7.2
[9] lubridate_1.7.10 lifecycle_1.0.0 gtable_0.3.0 pkgconfig_2.0.3
[13] rlang_0.4.11 reprex_2.0.1 cli_3.0.1 rstudioapi_0.13
[17] DBI_1.1.1 haven_2.4.3 xml2_1.3.2 withr_2.4.2
[21] httr_1.4.2 fs_1.5.0 generics_0.1.0 vctrs_0.3.8
[25] hms_1.1.0 grid_4.0.2 tidyselect_1.1.1 glue_1.4.2
[29] R6_2.5.0 fansi_0.5.0 readxl_1.3.1 farver_2.1.0
[33] tzdb_0.1.2 modelr_0.1.8 magrittr_2.0.1 backports_1.2.1
[37] scales_1.1.1 ellipsis_0.3.2 rvest_1.0.1 assertthat_0.2.1
[41] colorspace_2.0-2 labeling_0.4.2 utf8_1.2.2 stringi_1.7.3
[45] munsell_0.5.0 broom_0.7.9 crayon_1.4.1