1. Introduction
According to WHO, the monkeypox virus (MPXV) poses new threats to humanity [
1]. As surveillance expands in non-endemic countries, the cases of monkeypox being identified continue to rise with the risk of becoming an outbreak. Proper precautions and steps are necessary to avoid another global pandemic in 2022–2023.
MPXV is an infectious disease caused by the Zoonotic Orthopoxvirus closely related to cowpox and smallpox viruses—first identified in a monkey in Denmark at the end of the 1950s. Although it is mainly transmitted by monkeys and rodents, the human-to-human spread is also prevalent since the first case was recorded in central Africa in 1970. As of today, individuals infected by the MPXV are reported across continents. MPXV is transmitted through close physical contact with an infected person, animal or object, and its principal long-term effect is the presence of human skin lesions.
Understanding this pathogen better is still a global challenge for scientific research. A global network of permanent bioinformatics surveillance, with the capacity to rapidly deploy genomic tools and statistical studies for the MPXV characterization of the evolution of lineages and mutations, is still much needed. There are hundreds of complete genome sequence samples reported from confirmed cases from around the world [
2,
3,
4], which indicate a close match to exported cases from Africa found in 2017 [
5].
In this paper, we present a novel method for characterizing the MPXV genome from available data sequences. We apply the novel GenomeBits statistical algorithm, whose salient feature is to map the nucleotide bases
(as observed along a genome sequence) into a finite alternating (±) sum series of distributed terms of binary (0,1) indicators [
6]. We show how GenomeBits allow infectious MPXV surveillance by tracking and analyzing viral sequences in relation to their evolution status. In particular, we derive scatter plots for the nucleotides A-T and C-G to uncover correlations between MPXV sequences of complementary nucleotide bases.
2. Method
The new quantitative method presented for the characterization of distinctive patterns of complete genome sequences of length
N considers
where the independently distributed terms
are associated with (0,1) according to their position along the sequences, satisfying the following relation
. The arithmetic progression carries positive and negative signs
depending on the nucleotide base position
k.
This mapping into four binary projections of genome sequences follows previous studies on the three-base periodicity characteristic of protein-coding DNA sequences [
7]. However, as a principal difference from other binary representations (see also, e.g., Refs [
8,
9]), in Equation (
1), the terms in the sums change sign. If a term
in the sum series is positive at a given
k, then the next
term is negative, and vice versa.
The binary indicator for the sequences is chosen sequentially, starting with
. This is illustrated in
Table 1, where a mapping example for converting the brief genome fragment TTTTAGTACATTAAT (consisting of 15 nucleotides) into the alternating binary array via Equation (
1) is given. As illustrated in
Table 1, the positive and negative terms in the sums of Equation (
1) partly cancel out, allowing the sum series to diverge from zero rapidly and to become a non-Cauchy sequence type [
6].
There is a user-friendly Graphics User Interface (GUI) for the GenomeBits signal analysis method of genome sequences in FASTA format. The GUI can be downloaded freely from GitHub (see
https://github.com/canessae/GenomeBits/, accessed on 26 November 2022) [
10]. It runs under Linux Ubuntu OS. For large genomic data, say, of the order of
N = 197,204 nucleotide bases as, for example, for ON563414.3 monkeypox virus isolate MPXV_USA_2022_MA001 complete genome [
3], the one used in the present study, the GUI requires little processing time and discards uncompleted sequences containing codification errors (usually denoted with “NNNNN” letters).
3. Discussion
Next, we show some interesting statistical imprints of the intrinsic gene organization at the level of nucleotides from the above formula. In
Figure 1, the results obtained via Equation (
1) for the binary sequences of each A, C, G and T nucleotides of the MPXV reported from the USA are shown. For comparison, the curves
Figure 1a on the top display results for nucleotides A and T of the strand and on the bottom
Figure 1b are the nucleotides C and G (complementary to those of the opposite strand according to the pairing rules A-T and C-G of DNA). It is interesting to note how in the figure there are regions where the curves seem to mirror each other. This peculiar behavior reveals coding regions of correspondence between single-nucleotide structures.
Moreover, the fact that for , the total number of zeros is 66.4% of the whole sequence and the total number of ones is 33.6% is also interesting. For , these correspond to 83.5% and 16.5%, respectively. For , one finds 83.5% and 16.5% and, similarly, for , the values become 66.6% and 33.4%.
A first step in assessing the probability distribution of datasets is usually obtained by computing a basic histogram (i.e., counting the number of data points in a given bin). A standard cumulative, normalized histogram as a step function in order to visualize the empirical cumulative distribution function (CDF) for a sample is also useful [
11]. In this case, one gets the probability of, and observation from, the sample not exceeding a particular
x-value at a fixed
y-value. However, our goal here is to compare two different datasets or, alternatively, two different histograms of equally sized bins.
To achieve this, a pair of histograms above and to the right of a scatter plot of such two datasets can help to visualize and highlight correlations at certain colored contact points. Scatter plots with marginal distributions of this kind can be easily made using Python programming through the built-in Matplotlib’s routines: scatter(x,y), hist(x), gridspec and invert_xaxis (see, e.g., Ref. [
12]). In
Figure 2 and
Figure 3, we illustrate the (blue and green bar-graph) histograms, the empirical (red full line) and theoretical (red dotted lines) CDF curves, and the resulting scatter plots for the nucleotides A and its complementary T and C and its complementary G plotted in
Figure 1. We have used 50 bins, and the fits (blue line) have been calculated via an expected Gaussian distribution.
In
Figure 2 and
Figure 3, we have drawn scatter diagrams in an attempt to obtain correlations and anomalies between GenomeBits sequences of complementary nucleotide bases. The clearer colors of the hexagon markers add a new dimension to these plots. These show how the mapped binary data is stratified along the sequences and which areas show somehow correlated points that fall along complex, rounded shapes. The markers are unequally distributed in the scatter plot and far from forming a straight line. The patterns found that summarize the underlying data do not occur by random chance.
From these results, we believe GenomeBits may help to shed light on the surveillance behind infectious diseases. By GenomeBits, we have also uncovered distinctive signals for the intrinsic gene insights in the coronavirus genome sequences for past variants of concern: Alpha, Beta and Gamma, and the variants of interest: Epsilon and Eta [
6]. A study of GenomeBits into the most recent delta and omicron variants of the coronavirus pathogen can be found in Ref. [
13] (see also [
14]). The GenomeBits method focuses on single-nucleotide structures and may also provide useful information to study the evolution of the MPXV outbreak via simple statistical analysis.