Mathematical Modeling in Bioinformatics: Application of an Alignment-Free Method Combined with Principal Component Analysis

Bielińska-Wąż, Dorota; Wąż, Piotr; Błaczkowska, Agata; Mandrysz, Jan; Lass, Anna; Gładysz, Paweł; Karamon, Jacek

doi:10.3390/sym16080967

Open AccessArticle

Mathematical Modeling in Bioinformatics: Application of an Alignment-Free Method Combined with Principal Component Analysis

by

Dorota Bielińska-Wąż

^1,*,

Piotr Wąż

²,

Agata Błaczkowska

¹

,

Jan Mandrysz

¹

,

Anna Lass

³,

Paweł Gładysz

³

and

Jacek Karamon

⁴

¹

Department of Radiological Informatics and Statistics, Medical University of Gdańsk, 80-210 Gdańsk, Poland

²

Department of Nuclear Medicine, Medical University of Gdańsk, 80-210 Gdańsk, Poland

³

Department of Tropical Parasitology, Medical University of Gdańsk, 81-519 Gdynia, Poland

⁴

Department of Parasitology and Invasive Diseases, National Veterinary Research Institute, 24-100 Puławy, Poland

^*

Author to whom correspondence should be addressed.

Symmetry 2024, 16(8), 967; https://doi.org/10.3390/sym16080967 (registering DOI)

Submission received: 25 June 2024 / Revised: 22 July 2024 / Accepted: 23 July 2024 / Published: 30 July 2024

(This article belongs to the Special Issue Mathematical Modeling in Biology and Life Sciences)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, an alignment-free bioinformatics technique, termed the 20D-Dynamic Representation of Protein Sequences, is utilized to investigate the similarity/dissimilarity between Baculovirus and Echinococcus multilocularis genome sequences. In this method, amino acid sequences are depicted as 20D-dynamic graphs, comprising sets of “material points” in a 20-dimensional space. The spatial distribution of these material points is indicative of the sequence characteristics and is quantitatively described by sequence descriptors akin to those employed in dynamics, such as coordinates of the center of mass of the 20D-dynamic graph and the tensor of the moment of inertia of the graph (defined as a symmetric matrix). Each descriptor unveils distinct features of similarity and is employed to establish similarity relations among the examined sequences, manifested either as a symmetric distance matrix (“similarity matrix”), a classification map, or a phylogenetic tree. The classification maps are introduced as a new way of visualizing the similarity relations obtained using the 20D-Dynamic Representation of Protein Sequences. Some classification maps are obtained using the Principal Component Analysis (PCA) for the center of mass coordinates and normalized moments of inertia of 20D-dynamic graphs as input data. Although the method operates in a multidimensional space, we also apply some visualization techniques, including the projection of 20D-dynamic graphs onto a 2D plane. Studies on model sequences indicate that the method is of high quality, both graphically and numerically. Despite the high similarity observed among the sequences of E. multilocularis, subtle discrepancies can be discerned on the 2D graphs. Employing this approach has led to the discovery of numerous new similarity relations compared to our prior study conducted at the DNA level, using the 4D-Dynamic Representation of DNA/RNA Sequences, another alignment-free bioinformatics method also introduced by us.

Keywords:

data analysis; bioinformatics; alignment-free methods; moments of inertia

1. Introduction

Alignment-free techniques are a rapidly developing field in bioinformatics (a review can be found in [1]). These approaches, aimed at comparing biological sequences (DNA, RNA, and proteins), combine concepts from computer science, mathematics, biology, physics, and chemistry. They are characterized by computational efficiency. For example, Gupta et al. proposed a method to study protein sequence similarity through the general form of Chou’s pseudo amino acid composition [2]. Li et al. developed an alignment-free algorithm for comparing protein sequence similarity based on the pseudo-Markovian transition probabilities of amino acids [3]. Saw et al. proposed an alignment-free analysis of protein sequence similarities based on a fuzzy integral [4]. The branch of these methods that provides both numerical and graphical tools for sequence analysis is termed Graphical Representations of Biological Sequences (or Graphical Bioinformatics [5]) and is applied to similarity analyses of both DNA/RNA and of protein sequences. In particular, Nandy created one of the first two-dimensional graphical representations of DNA sequences [6], and this method was further developed [7]. Randić et al. proposed a graphical representation of DNA sequences in the form of four horizontal lines [8]. Randić et al. also introduced a representation of proteins based on generalized star graphs, i.e., graphs with one vertex of a maximal degree in the center to which are attached other vertices of either one or two degrees [9]. Cao et al. reduced a DNA sequence into four three-dimensional graphical representations based on the chemical properties of the neighboring dual nucleotides [10]. Jafarzadeh and Iranmanesh introduced a 3D graphical representation of DNA sequences based on codons [11]. Mu et al. proposed a graphical representation of protein sequences called the 3D-PAF curve. The method takes into account the accumulative frequencies of the adjacent amino acids of the protein sequence [12]. We introduced a graphical representation of DNA sequences in the form of a spectrum [13]. Zhang and Wen constructed a two-dimensional graphical representation of protein sequences based on a selected pair of the physicochemical properties of amino acids—the isoelectric point and hydrophy index [14]. Mahmoodi-Reihani et al. took into account twelve physicochemical properties of amino acids for the construction of a new two-dimensional graphical representation of protein sequences [15]. Li et al. represented the protein sequence as a time series in eleven-dimensional space [16].

The problem of similarity is interdisciplinary and can be a source of information in various fields of science [17]. In particular, we consider different types of objects, for example groups of individuals [18], molecular spectra [19], or DNA sequences [20], examining different features of their similarity.

Two or more objects can be very similar to each other due to one feature, and yet very different if we consider another feature. Each computational method reveals different features of object similarity. Therefore, the development of mathematical methods is particularly important.

Until relatively recently, mathematical methods in biological and medical sciences were mainly applied to the statistical analysis of the results of measurements and to the description of functioning experimental devices. Since about two decades ago, applications of mathematics to the modeling of biological processes and to the description of the structure of bio-molecules have opened a new window to investigate their properties. These mathematical methods use a specific language and specific set of notions. Some of them are alien to traditional biology and medicine, but this does not imply that they are useless. Many new achievements in biology and medicine would not be possible without the support provided by mathematical modeling [21,22].

In this work, we use the alignment-free bioinformatics method we introduced (called 20D-Dynamic Representation of Protein Sequences) to analyze the similarity between protein sequences [23]. This approach is applied to two datasets: Baculovirus [24] and Echinococcus multilocularis [25,26] genome sequences. In addition, the quality of the graphical representation is discussed using the same model sequences as in the literature [14,27,28]. As a result, our graphs clearly show the differences between sequences and directly show the arrangement of amino acids in a given sequence. In [23], we analyzed the similarity relations between the sequences, presented as phylogenetic trees and similarity matrices, for the NADH dehydrogenase subunit 5 (ND5) and for the NADH dehydrogenase subunit 6 (ND6) protein sequences of different species. In the present article the similarity relations are also visualized using classification maps whose axes are represented by 20D-Dynamic Representation of Protein Sequences descriptors. Some classification maps are obtained using a combination of this alignment-free method and the Principal Component Analysis.

Baculoviruses constitute a family of viruses, primarily targeting arthropods, with extensively studied hosts including Lepidoptera, Hymenoptera, and Diptera. The family currently encompasses 85 species distributed among four genera.

Among these viruses, Baculoviruses exhibit a penchant for infecting insects, boasting a host range encompassing over 600 species. While immature forms of lepidopteran species, such as moths and butterflies, constitute the most prevalent hosts, instances of infection have also been observed in sawflies and mosquitoes. Despite their demonstrated ability to infiltrate mammalian cells in culture, Baculoviruses do not exhibit replicative capabilities within mammalian or other vertebrate animal cells.

Historically, beginning in the 1940s, Baculoviruses have been extensively employed and investigated as biopesticides within agricultural settings [29]. Characterized by a circular, double-stranded DNA (dsDNA) genome ranging from 80 to 180 kbp, these viruses represent a significant area of interest in both scientific and practical domains [30,31,32].

The inception of Baculoviruses dates back to antiquity, with their earliest mentions traced to the annals of the 16th century literature, chronicling the affliction of silkworm larvae by the enigmatic “wilting disease”. A watershed moment ensued in the 1940s, heralding the extensive utilization and scholarly scrutiny of these viruses as biopesticides across agrarian landscapes. However, it was not until the dawn of the 1990s that their utility transcended conventional bounds, as they became pivotal in the orchestration of insect cell cultures for the synthesis of intricate eukaryotic proteins. These recombinant proteins, borne of their molecular machinery, found myriad applications in scholarly inquiry and medical praxis alike, spanning both human and veterinary domains. Notably, the apotheosis of their utility materialized in the manufacture of vaccines, exemplified by the preeminent H5N1 avian influenza prophylactic for poultry, synthesized within the crucible of a Baculovirus expression vector. Recent epochs have borne witness to a paradigm shift, as Baculoviruses unveiled their prowess in transducing mammalian cells endowed with propitious promoters [33].

E. multilocularis is a causative agent of Alveolar Echinococcosis (AE), a highly perilous zoonotic infection. AE is notorious for its staggering mortality rate exceeding 90%, rendering it potentially fatal in the absence of proper medical intervention [34,35,36].

Human susceptibility to infection is influenced by various behavioral and socio-economic factors that increase contact with E. multilocularis eggs. These risk factors and the geographic distribution of AE vary between countries and regions. Meta-analyses of case-control and cross-sectional studies in endemic regions of Central Europe, North America (Alaska), and Asia (China) have identified common potential risk factors for human AE, including gender (female), age (over 20), dog ownership, interaction with dogs, gardening, forest visits, agricultural occupations, hunting, handling foxes, and low education/income levels [37].

Due to the asexual reproduction of the metacestode and rare cross-fertilization events between mature tapeworms, a significant challenge in E. multilocularis studies lies in accurately estimating its genetic variation. The intraspecific genetic variation of E. multilocularis has been estimated by employing various markers, including mitochondrial (cytb, nad1/nad2, cox1, 12s rRNA, and ATPase) and nuclear (ef1a, elpexons VII/VIII, and elp-exon IX) genes and microsatellites (EMms1 and EMms2, EmsJ and EmsK, EmsB), to analyze isolates from a broad geographic range including areas of Asia, North America, and Europe [38,39,40]. The mitogenome has proven instrumental in elucidating the phylogeny of Echinococcus spp. due to its haploid nature, high copy number per cell, and faster evolutionary rate compared to nuclear markers [41,42]. Early mitochondrial studies identified two distinct geographical genotypes (haplotypes) among European isolates (M1) and isolates from China, Japan, Alaska, and North America (M2) [43,44]. Subsequent sequencing efforts targeting three mitochondrial loci revealed as many as 17 regional haplotypes in a set of isolates from Europe, Asia, and North America [41]. Meta-analysis combining data from commonly used mitochondrial genes, cox1 and cytb, unveiled a notable level of genetic diversity in E. multilocularis isolates derived from rodents, wild canids, and humans [45]. The highest diversity in haplotypes and haplotype diversity was observed within Asian and European populations. A cladistic phylogenetic tree suggested a sister relationship between the European and Asian clades, with only select North American haplotypes clustering with the European clade, including haplotype Em46 from Poland. In another study conducted on voles and plateau Pikas in Qinghai Province, China, it was found that the genetic diversity of E. multilocularis, based on mitochondrial genes within a small local area, is at a low level. However, between different regions with long distances and varying ecological environments, the genetic diversity is relatively high [46]. In the most recent pan-European study, five concatenated mitochondrial markers identified 43 haplotypes, and two main clusters possibly reflecting the post-glacial history of the key definitive hosts. In the seminal assessment of E. multilocularis genetic diversity in Poland, three mitochondrial genes (cytb, nad2, cox1) analyzed together divided worms from red foxes into fifteen haplotypes (EmPL1-15), some of them restricted to particular provinces [25]. For the first time, a haplotype of Asian origin was detected in north-eastern Poland, a phenomenon later investigated in detail by combining microsatellite profiling with mitochondrial genotyping to reveal Euro-Asian hybrids [47]. Inspection of metacestodes from AE patients living in north-eastern Poland provided indirect proof of environmental contamination with the eggs of worms with Asian genetic elements and their circulation in the synanthropic cycle, raising a question of differences in virulence between European and Asian variants. Polymorphism of E. multilocularis isolates from various locations across Europe, North America, and Asia was also assessed using a multilocular microsatellite EmsB and single-locus microsatellites EmsJ, EmsK, and NAK1 [38,39]. The highest polymorphism was observed for EmsB (29 profiles) compared to NAK1 (7 genotypes) and the combined EmsJ/EmsK (3 genotypes). EmsB exhibits a distinct (CA)n(GA)n pattern that varies from one worm to another, leading to an identification of different profiles [39]. This highly polymorphic marker has been used extensively to assess the genetic variation of the parasite in Europe and Eurasia [40,48], at the national level [49,50], and on the local scale [51,52,53]. For instance, at a local scale within an endemic region of France, covering a studied area of 900 m², 6 EmsB profiles were identified among 79 E. multilocularis-positive foxes [51]. The distribution of assemblage profiles among the fox population was not uniform, with a notable frequency of mixed infections observed. Furthermore, an increase in the number of worms analyzed per fox corresponded to an increase in detected profiles with the EmsB152 target, indicating the potential for a high dispersion of eggs of various genotypes within a single fox’s home range. At the national level, studies in France identified a total of 22 profiles [54]. In general, thirty-two EmsB profiles (G1-G32) were found in endemic regions across Europe, some of them locally predominant [40]. The continental distribution of profiles has been studied primarily to construct scenarios related to E. multilocularis geographical spread over time, from historical to newly described endemic areas, with Poland hypothesized to be one of the more recently colonized peripheral areas [40,55]. The resolution of EmsB proved discriminatory enough to investigate genetic variation and possible dispersal pathways of the parasite throughout Polish provinces [50]. In this study, 29 distinct profiles were identified (with the most common one accounting for 44.9% of all worms) showing regional differences in distribution. Microsatellite markers were utilized not only to assess the genetic diversity of E. multilocularis isolated from wild definitive and intermediate hosts but also those isolated from humans. Bretagne et al. conducted genotyping of human AE lesions based on the pentameric microsatellite upstream from the coding region of the U1 snRNA gene [56]. As per their findings, human AE lesions were categorized into three profiles based on their geographic origin, with all European patients clustering together, showing an unsatisfactory resolution [56]. In a recent investigation by Knapp et al., metacestode samples from 63 AE patients from France, Switzerland, Germany, and Belgium were examined [57]. In total, nine EmsB profiles were identified providing insights into potential transmission pathways to humans. These profiles appeared to exhibit stability over time, as similar profiles were detected in patients who recently underwent surgery and those who underwent surgery years ago. Patients residing in the same geographic area demonstrated identical or similar profiles. Some of these profiles had already been locally described in animals, while others appeared to be widespread among foxes in Europe [40,58]. This implies that individuals living in areas of high endemicity may have been infected within their residential environment, either with locally circulating profiles or with profiles more commonly found across Europe. An EmsB/cox1 study of human and animal samples from rural Sichuan, China, also indicates a link between high human AE incidence and local contamination by dogs [59]. These limited findings, although significant, are still insufficient to draw far-reaching conclusions about the distribution of E. multilocularis assemblages in humans, warranting further studies encompassing various countries and localities.

2. Theory

In the current study, we employ a bioinformatics method named 20D-Dynamic Representation of Protein Sequences, which we introduced to conduct a similarity/dissimilarity analysis of biological sequences [23]. The descriptors characterizing the graphs representing the sequences bear resemblance to those utilized in dynamics. The set of descriptors defined for a single sequence serves as a base dataset for analyzing potential biomedical applications.

To construct a graph representing the sequence, we utilize the convention of a shift (“walk”) in the 20D space. The shift originates from the origin of the 20D coordinate system. Each step in the shift is defined by the unit vector corresponding to the relevant amino acid in the sequence. At the end of the vector, a “material point” (“point mass”) with

m = 1

is situated. The first shift corresponds to the first amino acid in the sequence. The second shift, representing the second amino acid in the sequence, starts from the end of the first vector. The procedure is repeated until the end of the sequence. Consequently, an abstract 20D-dynamic graph is formed, consisting of material points in the 20D space. The arrangement of these material points in the space is dictated by the sequence represented by the graph. Details regarding the assignment of the axes of the 20-dimensional coordinate system to the amino acids (representation of amino acids by unit basis vectors in the 20D space) are given in Table 1.

For instance, Alanine (A) is represented by the unit vector

(1,

0,

0,

0,

0,

0,

0,

0,

0,

0,

0,

0,

0,

0,

0,

0,

0,

0,

0,

0)

, while Cysteine (C) is represented by

(0,

1,

0,

0,

0,

0,

0,

0,

0,

0,

0,

0,

0,

0,

0,

0,

0,

0,

0,

0)

, and so forth.

To graphically represent protein sequences, we project 20D-dynamic graphs onto 2D or 3D spaces. Mass distributions in 3D/2D spaces provide information on the positions of three/two amino acids in the sequences, enabling graphical comparisons. An example of the 20D-dynamic graph and its projection into 3D space for the model sequence MTMHTT can be found in [23]. A similar example of the 20D-dynamic graph and its projection into 2D space for the model sequence MTMH is shown in Table 2 and Table 3, respectively.

For the numerical characterization of the 20D-dynamic graphs we apply values analogous to the ones used in the dynamics: the coordinates of the centers of mass and the principal moments of inertia of 20D-dynamic graphs.

In dynamics, a commonly used descriptor of the distribution of mass in a rigid body is its center of mass and the moment of inertia. The moment of inertia of a rigid body around a specific axis indicates the resistance of this body to an acceleration of the angular rotation around this axis. When the mass is concentrated closer to the axis, then the corresponding moment of inertia is smaller and it is easier to speed up or slow down the rotation. The moments of inertia associated with rotations around the principal axes are referred to as the principal moments of inertia. The concept of moment of inertia can be generalized to characterize distributions of some other quantities. In particular, if we assign a “mass” to each point of the 20D-dynamic graph, then we can define the moment of inertia of the graph—an analogy of the standard moment of inertia.

The coordinates of the center of mass of the 20D-dynamic graph are defined as

μ^{k} = \frac{\sum_{i = 1}^{N} m_{i} x_{i}^{k}}{\sum_{i = 1}^{N} m_{i}}, k = 1, 2, \dots, 20 .

(1)

x_{i}^{k}

are the coordinates, determined by the shifts, of the material points with masses

m_{i}

. Since the mass of each individual material point is equal to one, the total mass of the sequence is equal to the length of the sequence N, i.e.,

N = \sum_{i = 1}^{N} m_{i} .

(2)

N is also the number of the material points in the 20D-dynamic graph. As a consequence, from Equations (1) and (2) it follows that

μ^{k} = \frac{1}{N} \sum_{i = 1}^{N} x_{i}^{k}, k = 1, 2, \dots, 20 .

(3)

The tensor of moment of inertia in the 20D space is expressed by the symmetric matrix

\hat{I} = (\begin{matrix} I_{1 1} & I_{1 2} & \dots & I_{1 k} & \dots & I_{1 20} \\ I_{2 1} & I_{2 2} & \dots & I_{2 k} & \dots & I_{2 20} \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ I_{j 1} & I_{j 2} & \dots & I_{j k} & \dots & I_{j 20} \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ I_{20 1} & I_{20 2} & \dots & I_{20 k} & \dots & I_{20 20} \end{matrix})

(4)

with the matrix elements

I_{j j} = \sum_{i = 1}^{N} m_{i} \sum_{k = 1}^{20} {[{\hat{x}}_{i}^{k} (1 - δ_{j k})]}^{2},

(5)

and

I_{j k} = I_{k j} = - \sum_{i = 1}^{N} m_{i} {\hat{x}}_{i}^{j} {\hat{x}}_{i}^{k}, j \neq k .

(6)

δ_{i j} = \{\begin{matrix} 1 & i = j, \\ 0 & i \neq j \end{matrix}

is the Kronecker–Delta and

{\hat{x}}_{i}^{k}

are the coordinates of

m_{i}

in the Cartesian coordinate system, with the origin chosen at the center of mass

{\hat{x}}_{i}^{k} = x_{i}^{k} - μ^{k} .

(7)

The eigenvalue problem of the tensor of inertia with the eigenvalues

I_{k}

and the eigenvectors

ω_{k}

is defined as

\hat{I} ω_{k} = I_{k} ω_{k} .

(8)

The eigenvalues are derived through the solution of the twentieth-order secular equation

det (\hat{I} - I \hat{E}) = 0,

(9)

where

\hat{E}

is

20 \times 20

unit matrix. The eigenvalues

I_{k}, k = 1, 2, \dots, 20

, are called the principal moments of inertia.

Figure 1 shows a 2D graph for the MTMH model sequence. This is the

x^{7} x^{11}

-graph, i.e., the HM-graph, according to Table 1. Additionally, the center of mass and the projections of the three eigenvectors multiplied by the corresponding eigenvalues are marked. These are the only non-zero projections onto this 2D plane created from the 20 possible products of the eigenvectors and eigenvalues obtained for this example.

Finally, as sequence descriptors we apply the normalized principal moments of inertia

r_{k} = \sqrt{\frac{I_{k}}{N}} .

(10)

We also consider

D = \sum_{k = 1}^{20} r_{k} .

(11)

As a similarity measure between a pair of sequences labeled

α

and

β

we use

S (α, β) = S (β, α) = |\frac{D^{α} - D^{β}}{D^{α} + D^{β}}| .

(12)

This similarity measure is dimensionless and normalized: the similarity values vary between 0 an 1.

3. Results and Discussion

Let us consider two short segments of proteins from yeast (P00729) and Saccharomyces cerevisiae (P38109). These two segments, taken from Randić [27], have been tested in several graphical representation methods [14,28]. The corresponding sequences are:

Protein1: WTFESRNDPAKDPVILWLNGGPGCSSLTGL;
Protein2: WFFESRNDPANDPIILWLNGGPGCSSFTGL.

Figure 2 shows 2D graphs for these two sequences. Zhang and Wen [14] claim that their graphs are more visible, intuitive, and easy to analyze compared to those presented by Yu et al. [28]. Our graphs are also visible and intuitive because they are directly related to the distributions of individual amino acids in the sequences.

Table 4 shows values of

r_{k}

(Equation (10)) for the two sequences and for:

Protein3: WFFESRNDPANDPIILWLNGGPGCSSFTGF.

Protein2 and Protein3 sequences differ by only one amino acid. The last amino acid in Protein2 sequence—Leucine (L)—is changed to Phenylalanine (F). This small difference is recognized by the descriptor values (they are slightly different—columns 2 and 3 in the table).

The 20D-Dynamic Representation of Protein Sequences is also used to analyze the similarities between helicase protein sequences of twelve Baculoviruses. We use the same dataset that has been analyzed using a different alignment-free bioinformatics method in [24]. Table 5 shows the lengths and the accession numbers of these amino acid sequences that are available in GenBank. The method (Equations (10)–(12)) is applied to determine the symmetric similarity matrix (Table 6) and its representation via a phylogenetic tree (Figure 3). To obtain the phylogenetic tree, Unweighted Pair Group Method with Arithmetic Mean (UPGMA) implemented in the Mega 6.0 package [60] is used. For the UPGMA clustering algorithm, the input data (i.e., the data used to determine the topology and length of the tree) is our similarity matrix.

Based on the values given in Table 6, we can observe significant similarities between certain groups of Baculoviruses, namely (AcMNPV, RoMNPV, SeMNPV), (BmNPV, HearNPV, HzSNPV), (NeseNPV, MacoNPVA, MacoNPVB), and (CpGV, AdorGV, CrleGV). The greatest differences between Baculoviruses can be seen for the NeseNPV sequence (Figure 4 and Figure 5) for the considered protein. Figure 4 and Figure 5 show examples of 2D and 3D graphs (projections of 20D-dynamic graphs into 2D and 3D spaces), respectively. We find that the NeseNPV protein graphs shown in the figures differ from those of other species. Other authors have also obtained similar results [24]. In [24], Yao et al. introduced another alignment-free bioinformatics method. It is a 2D graphical representation of protein sequences based on amino acid classification. The authors consider two chemical properties that have significant relations with the structure of proteins: chirality and hydrophilicity of 20 amino acids. Based on this classification, 2D graphs representing the sequences are obtained. Geometric centers of the graphs are proposed as sequence descriptors. These descriptors are equivalent to the centers of mass in our method. The moments of inertia considered in our method can be treated as the corrections [19].

The 20D-Dynamic Representation of Protein Sequences is also used to analyze Echinococcus multilocularis sequence similarities. Sequences of three mitochondrial genes (cytochrome b (cob), NADH dehydrogenase subunit 2 (nad2), and cytochrome c oxidase subunit 1 (cox1)) are analyzed. Sequences, obtained in studies described previously [25,41], are used for the analyses (GenBank accession numbers: KY205662-KY205706, AB461395-AB461420). Additional sequence data specific to the current studies are given in [26], in which we investigated features of similarity revealed using 4D-Dynamic Representation of DNA/RNA Sequences, another bioinformatics method we introduced [20].

In [26], we considered DNA sequences (the sequences composed of 4 nucleobases). In this work, protein sequences are considered, i.e., sequences composed of 20 amino acids. The considered protein sequences are very similar to each other. Some examples illustrate 2D graphs (the projections of 20D-dynamic graphs onto a 2D plane), LV-graphs, and TV-graphs (Figure 6). The graphs almost overlap, but slight differences can also be observed.

Figure 7 and Figure 8 show the classification maps for the cob, nad2 and cox1 genes.

The maps shown in Figure 7 show clusters of points indicating similarities between sequences based on two different properties (represented by descriptors on the axes):

r_{16} - r_{18} - r_{20}

and

r_{17} - r_{18} - r_{19}

. For different genes and different descriptors we observe different clusters of points.

The classification maps shown in Figure 8 have been obtained using PCA for the center of mass coordinates and the normalized moments of inertia of 20D-dynamic graphs as input data. The similarity relations revealed by the clusters of points appearing on the maps also differ.

Figure 9 and Figure 10 show the corresponding phylogenetic trees obtained using different descriptors (Figure 9—the coordinates of the centers of mass of the 20D-dynamic graphs and Figure 10—the normalized principal moments of inertia of the 20D-dynamic graphs). The following clusters of points appear in these trees:

cob gene, coordinates of the centers of mass, Figure 9 top panel:
- (CHM)
- (A-A)
- (14)
- (Jap)
- (Fra)
- (7,15)
- (A-I)
- (Kaz,Slo,Aus,1,2,3,4,5,6,8,9,10,11,12,13);
nad2 gene, coordinates of the centers of mass, Figure 9 middle panel:
- (CHM)
- (A-I)
- (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
- (A-A)
- (Kaz,Slo,Aus,Fra)
- (Jap)
- (CHS);
cox1 gene, coordinates of the centers of mass, Figure 9 bottom panel:
- (CHM)
- (8)
- (13,Aus)
- (Fra)
- (A-A)
- (A-I)
- (Kaz)
- (14)
- (Jap)
- (1,7)
- (6)
- (CHS,Slo,2,3,4,5,9,10,11,12,15);
cob gene, normalized principal moments of inertia, Figure 10 top panel:
- (CHM)
- (A-I)
- (A-A)
- (7,15)
- (Fra)
- (Jap)
- (14)
- (Kaz,Slo,Aus,1,2,3,4,5,6,8,9,10,11,12,13);
nad2 gene, normalized principal moments of inertia, Figure 10 middle panel:
- (A-I)
- (CHM)
- (CHS)
- (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
- (Kaz,Slo,Aus,Fra)
- (Jap)
- (A-A);
cox1 gene, normalized principal moments of inertia, Figure 10 bottom panel:
- (8)
- (Jap)
- (13,Aus)
- (6)
- (1,7)
- (CHS,Slo,2,3,4,5,9,10,11,12,15)
- (14)
- (A-A)
- (A-I)
- (CHM)
- (Fra)
- (Kaz).

The number of clusters present in the phylogenetic trees is equal to 8 for the cob gene, both using centers of mass coordinates (Figure 9 top panel) and normalized principal moments of inertia (Figure 10 top panel). The clusters are exactly the same in both cases. Some clusters contain only one element. Such single-element sets mean small similarity values between sequences belonging to other clusters. Sequences belonging to one cluster are characterized by large similarity values.

Similar observations can be made for the nad2 gene (seven identical clusters for both center of mass coordinates and normalized principal moments of inertia—Figure 9 and Figure 10, middle panels). Similarly, for the cox1 gene, the same number of identical clusters (12) appeared in the phylogenetic trees (Figure 9 and Figure 10, bottom panels).

The clusters of points appearing in both phylogenetic trees (based on

r_{k}

and

μ^{k}

- Figure 9 and Figure 10) are identical, but the structures of both phylogenetic trees (relations between different clusters) are different. These relations correspond to the similarity values between the sequences. To obtain the similarity values, the similarity matrix can be calculated in the same way as for Baculoviruses (Table 6). Calculating the similarity matrix for the center of mass coordinates,

r_{k}

should be replaced with

μ^{k}

in Equation (11).

In the current research, the clusters of points are completely different than those we observed in our previous research [26]. Taking into account the DNA sequences, we confirmed that one Polish haplotype (9) is of Asian origin [26], which was originally discovered in [25]. Considering the amino acid sequences in the present work, we do not observe any clusters (9, Asian sequences).

This is probably related to the fact that mutations in the genes that determine the division into separate clusters (European and Asian) do not cause significant changes in the proteins they encode. One of the fundamental features of the genetic code is the fact that it is degenerate. This means that different codons (i.e., a sequence of three nucleotides that constitute the unit coding for a particular amino acid) can code for the same amino acid. This means that most amino acids can be encoded in several ways. As a consequence, some changes in genetic information resulting from mutations are not reflected in the amino acid sequence.

4. Conclusions

Model studies of protein segments from yeast and Saccharomyces cerevisiae indicate the high graphical and numerical quality of our method. We plan to apply this method in the future to interpret our new experimental data, which are currently in preparation.

In this work, 20D-Dynamic Representation of Protein Sequences was applied to two datasets: Baculovirus and E. multilocularis genome sequences. Baculovirus data are model data also used by other authors in articles on mathematical modeling in bioinformatics. Yao et al. obtained similar results using another alignment-free method [24]. The data for Echinococcus are our experimental data from [25]. The genes encoding cox1, nad2, and cob are mitochondrial genes which, due to their characteristics, are very often used in analyses of the genetic diversity of Echinoococcus, but also of many other organisms. We recently performed computations on the same Echinoococcus data at the DNA level using a similar approach, also proposed by us, called 4D-Dynamic Representation of DNA/RNA Sequences [26]. Using the 4D method, we obtained similar results as when using standard alignment-based methods [25], i.e., we showed that one of the Polish haplotypes (9) is of Asian origin. In this work, we investigated similarity relations at the level of amino acid sequences. Of course, the obtained point clusters based on the 20D-Dynamic Representation of Protein Sequences descriptors are different than those for DNA sequence descriptors. The dispersion of the results obtained from various methods appears to be quite significant. Different methods may capture distinct features of sequence similarity. Generally, varying descriptors may hold importance depending on the context, forming the foundation for QSAR studies. In the realm of molecules, a similar approach yields computational techniques known as Quantitative Structure–Activity Relationship (QSAR) and Quantitative Structure–Property Relationship (QSPR), extensively employed in computational pharmacology, toxicology, and eco-toxicology [61,62]. QSAR/QSPR operates on the premise that chemical compound properties correlate systematically with numerical indices representing molecular structural features, termed molecular descriptors. These descriptor values, derived from available experimental data, can be applied to predict the properties of numerous structurally similar molecules. For instance, we have demonstrated the utility of moments of inertia of molecular spectra as descriptors in QSAR studies for predicting environmentally relevant chloronaphthalenes properties [63]. By using descriptors from the 20D-Dynamic Representation of Protein Sequences we enhanced the numerical characterization of the entities under investigation.

Each method describes different features of similarity. In particular, our method reveals features of similarity that are mathematically described using various descriptors characterizing 20D-dynamic graphs representing protein sequences (values analogous to dynamics—centers of mass and principal moments of inertia). Non-standard applications of the moment of inertia tensor and centers of mass were already introduced for the description of DNA/RNA sequences. For example, these methods have been applied to the bioinformatics characterization of the SARS-CoV-2 virus and the time evolution of the Zika virus [20,64]. The calculations supported the hypothesis that the SARS-CoV-2 virus may have originated in bats and pangolins [20]. Strong correlations with time have been obtained for the descriptors of the genome sequences of the Zika virus [64].

In this work, by comparing the moments of inertia of graphs, we compare the shapes of graphs representing the positions of amino acids in the sequences. Consequently, our description is more precise than the ones derived from traditional alignment-based methods, which neglect the asymmetry of the distributions of the the aligned amino acids along the sequences. The proposed method is sensitive: a difference in a single amino acid can be recognized. Another advantage of this approach is its applicability to large datasets: all the descriptors are expressed analytically and the computation time for a large number of long sequences is short. Note that the theory has no limits on sequence length.

Alignment-free bioinformatics methods are constructed to describe 20 amino acids [2,3,4,9,12,14,15,16,23,24,27,28]. Additionally, 20D-Dynamic Representation of Protein Sequences can easily be extended to more dimensions, if more than 20 amino acids are present in the sequence. The algorithm’s fundamental structure remains unchanged. Introducing new basis vectors follows the same procedure as for the original 20 dimensions. With the inclusion of selenocysteine and pyrrolysine, the consideration expands to the 22-dynamic graph. For the numerical characterization, the tensor of the moment of inertia in the 22D space should be considered. Consequently, adjustments to the upper limits in all summations within the method are necessary to reflect the expanded dimensional space.

Author Contributions

Methodology, D.B.-W., P.W. and A.B.; software, P.W., D.B.-W. and A.B.; formal analysis, D.B.-W., P.W. and A.B.; writing—original draft preparation, D.B.-W., A.B., A.L., P.G. and J.K.; visualization, P.W., A.B. and J.M.; data curation, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Science Centre, Poland, grant No. 2020/37/B/NZ7/03934.

Data Availability Statement

The sequence data used for the calculations are available in GenBank.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ramanathan, N.; Ramamurthy, J.; Natarajan, G. Numerical Characterization of DNA Sequences for Alignment-free Sequence Comparison—A Review. Comb. Chem. High Throughput Screen. 2022, 25, 365–380. [Google Scholar] [CrossRef] [PubMed]
Gupta, M.K.; Niyogi, R.; Misra, M. An alignment-free method to find similarity among protein sequences via the general form of Chou’s pseudo amino acid composition. SAR QSAR Environ. Res. 2013, 24, 597–609. [Google Scholar] [CrossRef]
Li, Y.S.; Song, T.; Yang, J.S.; Zhang, Y.; Yang, J.L. An Alignment-Free Algorithm in Comparing the Similarity of Protein Sequences Based on Pseudo-Markov Transition Probabilities among Amino Acids. PLoS ONE 2016, 11, e0167430. [Google Scholar] [CrossRef]
Saw, A.K.; Tripathy, B.C.; Nandi, S. Alignment-free similarity analysis for protein sequences based on fuzzy integral. Sci. Rep. 2019, 9, 2775. [Google Scholar] [CrossRef]
Randić, M.; Novič, M.; Plavšić, D. Milestones in graphical bioinformatics. Int. J. Quant. Chem. 2013, 113, 2413–2446. [Google Scholar] [CrossRef]
Nandy, A. A new graphical representation and analysis of DNA sequence structure. I: Methodology and application to globin genes. Curr. Sci. 1994, 66, 309–314. [Google Scholar]
Nandy, A.; Dey, S.; Basak, S.C.; Bielińska-Wąż, D.; Wąż, P. Characterizing the Zika Virus Genome—A Bioinformatics Study. Curr. Comput. Aided Drug Des. 2016, 12, 87–97. [Google Scholar] [CrossRef]
Randić, M.; Vračko, M.; Lerš, N.; Plavšić, D. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation. Chem. Phys.Lett. 2003, 371, 202–207. [Google Scholar] [CrossRef]
Randić, M.; Zupan, J.; Vikić-Topić, D. On representation of proteins by star-like graphs. J. Mol. Graph. Model. 2007, 26, 290–305. [Google Scholar] [CrossRef]
Cao, Z.; Liao, B.; Li, R. A group of 3D graphical representation of DNA sequences based on dual nucleotides. Int. J. Quant. Chem. 2008, 108, 1485–1490. [Google Scholar] [CrossRef]
Jafarzadeh, N.; Iranmanesh, A. C-curve: A novel 3D graphical representation of DNA sequence based on codons. Math. Biosci. 2013, 214, 217–224. [Google Scholar] [CrossRef]
Mu, Z.C.; Li, G.J.; Wu, H.Y.; Qi, X.Q. 3D–PAF Curve: A Novel Graphical Representation of Protein Sequences for Similarity Analysis. Match Commun. Math. Comput. Chem. 2016, 75, 447–462. [Google Scholar]
Bielińska-Wąż, D.; Wąż, P. Spectral-dynamic representation of DNA sequences. J. Biomed. Inform. 2017, 72, 1–7. [Google Scholar] [CrossRef]
Zhang, Y.Y.; Wen, J. Similarity analysis of protein sequences based on a new graphical representation method. Commun. Inf. Syst. 2018, 18, 193–208. [Google Scholar] [CrossRef]
Mahmoodi-Reihani, M.; Abbasitabar, F.; Zare-Shahabadi, V. A novel graphical representation and similarity analysis of protein sequences based on physicochemical properties. Phys. A Stat. Mech. Appl. 2018, 510, 477–485. [Google Scholar] [CrossRef]
Li, C.C.; Dai, Q.; He, P.A. A time series representation of protein sequences for similarity comparison. J. Theor. Biol. 2022, 538, 111039. [Google Scholar] [CrossRef]
Bielińska, A.; Majkowicz, M.; Bielińska-Wąż, D.; Wąż, P. A New Method in Bioinformatics—Interdisciplinary Similarity Studies. AIP Conf. Proc. 2019, 2116, 450013. [Google Scholar]
Bielińska, A.; Wąż, P.; Bielińska-Wąż, D. A Computational Model of Similarity Analysis in Quality of Life Research: An Example of Studies in Poland. Life 2022, 12, 56. [Google Scholar] [CrossRef]
Wąż, P.; Bielińska-Wąż, D. Moments of Inertia of Spectra and Distribution Moments as Molecular Descriptors. Match Commun. Math. Comput. Chem. 2013, 70, 851–865. [Google Scholar]
Bielińska-Wąż, D.; Wąż, P. Non-standard bioinformatics characterization of SARS-CoV-2. Comput. Biol. Med. 2021, 131, 104247. [Google Scholar] [CrossRef]
Ždímalová, M.; Chatterjee, A.; Kosnáčová, H.; Ghosh, M.; Obaidullah, S.M.; Kopáni, M.; Kosnáč, D. Various Approaches to the Quantitative Evaluation of Biological and Medical Data Using Mathematical Models. Symmetry 2022, 14, 7. [Google Scholar] [CrossRef]
Liu, Y.; Wu, R.; Yang, A. Research on Medical Problems Based on Mathematical Models. Mathematics 2023, 11, 2842. [Google Scholar] [CrossRef]
Czerniecka, A.; Bielińska-Wąż, D.; Wąż, P.; Clark, T. 20D-dynamic representation of protein sequences. Genomics 2016, 107, 16–23. [Google Scholar] [CrossRef]
Yao, Y.H.; Kong, F.; Dai, Q.; He, P.A. A sequence-segmented method applied to the similarity analysis of long protein sequence. Match Commun. Math. Comput. Chem. 2013, 70, 431–450. [Google Scholar]
Karamon, J.; Stojecki, K.; Samorek-Pieróg, M.; Bilska-Zając, E.; Różycki, M.; Chmurzyńska, E.; Sroka, J.; Zdybel, J.; Cencek, T. Genetic diversity of Echinococcus Multilocularis Red Foxes Poland: First Rep. Ahaplotype Probable Asian Origin. Folia Parasitol. 2017, 64, 7. [Google Scholar] [CrossRef]
Bielińska-Wąż, D.; Wąż, P.; Lass, A.; Karamon, J. 4D-Dynamic Representation of DNA/RNA Sequences: Studies on Genetic Diversity of Echinococcus multilocularis Red Foxes Poland. Life 2022, 12, 877. [Google Scholar] [CrossRef]
Randić, M. 2-D Graphical representation of proteins based on physico-chemical properties of amino acids. Chem. Phys. Lett. 2007, 440, 291–295. [Google Scholar] [CrossRef]
Yu, J.F.; Sun, X.; Wang, J.H. A novel 2D graphical representation of protein sequence based on individual amino acid. Int. J. Quantum Chem. 2011, 111, 2835–2843. [Google Scholar] [CrossRef]
Gelaye, Y.; Negash, B. The role of baculoviruses in controlling insect pests: A review. Cogent Food Agric. 2023, 9, 2254139. [Google Scholar] [CrossRef]
Williams, T. Soil as an Environmental Reservoir for Baculoviruses: Persistence, Dispersal and Role in Pest Control. Soil Syst. 2023, 7, 29. [Google Scholar] [CrossRef]
Rodríguez-Hernández, A.P.; Martínez-Flores, D.; Cruz-Reséndiz, A.; Padilla-Flores, T.; González-Flores, R.; Estrada, K.; Sampieri, A.; Camacho-Zarco, A.R.; Vaca, L. Baculovirus Display of Peptides and Proteins for Medical Applications. Viruses 2023, 15, 411. [Google Scholar] [CrossRef] [PubMed]
Motta, L.F.; Cerrudo, C.S.; Belaich, M.N. A Comprehensive Study of MicroRNA in Baculoviruses. Int. J. Mol. Sci. 2024, 25, 603. [Google Scholar] [CrossRef] [PubMed]
Lackner, A.; Genta, K.; Koppensteiner, H.; Herbacek, I.; Holzmann, K.; Spiegl-Kreinecker, S.; Berger, W.; Grusch, M. A bicistronic baculovirus vector for transient and stable protein expression in mammalian cells. Anal. Biochem. 2008, 380, 146–148. [Google Scholar] [CrossRef] [PubMed]
Ammann, R.W.; Eckert, J. Clinical diagnosis and treatment of echinococcosis in humans. In Echinococcus and Hydatid Disease; Thompson, R.C.A., Lymbery, A.J., Eds.; CAB International: Wallingford, UK, 1995; pp. 411–463. [Google Scholar]
Sulima, M.; Nahorski, W.; Gorycki, T.; Wolyniec, W.; Wąż, P.; Felczak-Korzybska, I.; Szostakowska, B.; Sikorska, K. Ultrasound images in hepatic alveolar echinococcosis and clinical stage of the disease. Adv. Med Sci. 2019, 64, 324–330. [Google Scholar] [CrossRef] [PubMed]
Sulima, M.; Szostakowska, B.; Nahorski, W.; Sikorska, K.; Wolyniec, W.; Wąż, P. The usefulness of commercially available serological tests in the diagnosis and monitoring of treatment in patients with alveolar echinococcosis. Clin. Exp. Hepatol. 2019, 5, 327–333. [Google Scholar] [CrossRef] [PubMed]
Conraths, F.J.; Probst, C.; Possenti, A.; Boufana, B.; Saulle, R.; La Torre, G.; Busani, L.; Casulli, A. Potential risk factors associated with human alveolar echinococcosis: Systematic review and meta-analysis. PLoS Negl. Trop. Dis. 2017, 11, e0005801. [Google Scholar] [CrossRef] [PubMed]
Nakao, M.; Sako, Y.; Ito, A. Isolation of polymorphic microsatellite loci from the tapeworm Echinococcus multilocularis. Infect. Genet. Evol. 2003, 3, 159–163. [Google Scholar] [CrossRef] [PubMed]
Knapp, J.; Bart, J.M.; Glowatzki, M.L.; Ito, A.; Gerard, S.; Maillard, S.; Piarroux, R.; Gottstein, B. Assessment of use of microsatellite polymorphism analysis for improving spatial distribution tracking of Echinococcus multilocularis. J. Clin. Microbiol. 2007, 45, 2943–2950. [Google Scholar] [CrossRef] [PubMed]
Knapp, J.; Bart, J.M.; Giraudoux, P.; Glowatzki, M.L.; Breyer, I.; Raoul, F.; Deplazes, P.; Duscher, G.; Martinek, K.; Dubinsky, P.; et al. Genetic diversity of the cestode Echinococcusmultilocularis in red foxes at a continental scale in Europe. PLoS Negl. Trop. Dis. 2009, 3, e452. [Google Scholar] [CrossRef]
Nakao, M.; Xiao, N.; Okamoto, M.; Yanagida, T.; Sako, Y.; Ito, A. Geographic pattern of genetic variation in the fox tapeworm Echinococcusmultilocularis. Parasitol. Int. 2009, 58, 384–389. [Google Scholar] [CrossRef]
Spotin, A.; Gholami, S.; Nasab, A.N.; Fallah, E.; Oskouei, M.M.; Semnani, V.; Shariatzadeh, S.A.; Shahbazi, A. Designing and conducting in silico analysis for identifying of Echinococcus spp. With discrimination of novel haplotypes: An approach to better understanding of parasite taxonomic. Parasitol. Res. 2015, 114, 1503–1509. [Google Scholar] [CrossRef]
Bowles, J.; McManus, D.P. NADH dehydrogenase 1 gene sequences compared for species and strains of the genus Echinococcus. Int. J. Parasitol. 1993, 23, 969–972. [Google Scholar] [CrossRef]
Okamoto, M.; Bessho, Y.; Kamiya, M.; Kurosawa, T.; Horii, T. Phylogeneticrelationships within Taenia taeniaeformis variants and other taeniid cestodesinferred from the nucleotide sequence of the cytochrome c oxidase subunit Igene. Parasitol. Res. 1995, 81, 451–458. [Google Scholar] [CrossRef] [PubMed]
Spotin, A.; Boufana, B.; Ahmadpour, E.; Casulli, A.; Mahami-Oskouei, M.; Rouhani, S.; Javadi-Mamaghani, A.; Shahrivar, F.; Khoshakhlagh, P. Assessment of the global pattern of genetic diversity in Echinococcus Multilocularis Inferred Mitochondrial DNA Sequences. Vet. Parasitol. 2018, 262, 30–41. [Google Scholar] [CrossRef]
Li, J.Q.; Li, L.; Fan, Y.L.; Fu, B.Q.; Zhu, X.Q.; Yan, H.B.; Jia, W.Z. Genetic Diversity in Echinococcus Multilocularis the Plateau Vole Plateau Pika Jiuzhi County, Qinghai Province, China. Front Microbiol. 2018, 9, 2632. [Google Scholar] [CrossRef]
Umhang, G.; Knapp, J.; Wassermann, M.; Bastid, V.; Peytavin de Garam, C.; Boué, F.; Cencek, T.; Romig, T.; Karamon, J. Asian Admixture in European Echinococcus multilocularis Populations: New Data From Poland Comparing EmsB Microsatellite Analyses and Mitochondrial Sequencing. Front Vet Sci. 2021, 7, 620722. [Google Scholar] [CrossRef]
Umhang, G.; Bastid, V.; Avcioglu, H.; Bagrade, G.; Bujanić, M.; Čabrilo, O.B.; Casulli, A.; Dorny, P.; van Der Giessen, J.; Guven, E.; et al. Unravelling the genetic diversity and relatedness of Echinococcus multilocularis isolates in Eurasia using the EmsB microsatellite nuclear marker. Infect. Genet. Evol. 2021, 92, 104863. [Google Scholar] [CrossRef]
Casulli, A.; Széll, Z.; Pozio, E.; Sréter, T. Spatial distribution and genetic diversity of Echinococcus Multilocularis in Hungary. Vet. Parasitol. 2010, 174, 241–246. [Google Scholar] [CrossRef] [PubMed]
Umhang, G.; Karamon, J.; Hormaz, V.; Knapp, J.; Cencek, T.; Boué, F. A step forward in the understanding of the presence and expansion of Echinococcus Multilocularisin Eastern Europe using microsatellite EmsB genotyping in Poland. Infect. Genet. Evol. 2017, 54, 176–182. [Google Scholar] [CrossRef] [PubMed]
Knapp, J.; Guislain, M.-H.; Bart, J.M.; Raoul, F.; Gottstein, B.; Giraudoux, P.; Piarroux, R. Genetic diversity of Echinococcusmultilocularis A Local Scale. Infect. Genet. Evol. 2008, 8, 367–373. [Google Scholar] [CrossRef]
Knapp, J.; Staebler, S.; Bart, J.M.; Stien, A.; Yoccoz, N.G.; Drögemüller, C.; Gottstein, B.; Deplazes, P. Echinococcus multilocularis in Svalbard, Norway: Microsatellite genotyping to investigate the origin of a highly focal contamination. Infect. Genet. Evol. 2012, 12, 1270–1274. [Google Scholar] [CrossRef] [PubMed]
Knapp, J.; Umhang, G.; Wahlström, H.; Al-Sabi, M.N.S.; Ågren, E.O.; Enemark, H.L. Genetic diversity of Echinococcus Multilocularis in red foxes from two Scandinavian countries: Denmark and Sweden. Food Waterborne Parasitol. 2019, 14, e00045. [Google Scholar] [CrossRef] [PubMed]
Umhang, G.; Knapp, J.; Hormaz, V.; Raoul, F.; BoueÂ, F. Using the genetics of Echinococcus Multilocularis to trace the history of expansion from an endemic area. Infect. Genet. Evol. J. Mol. Epidemiol. Evol. Genet. Infect. Dis. 2014, 22, 142–149. [Google Scholar] [CrossRef] [PubMed]
Laurimaa, L.; Süld, K.; Moks, E.; Valdmann, H.; Umhang, G.; Knapp, J.; Saarma, U. First report of the zoonotic tapeworm Echinococcusmultilocularis in raccoon dogs in Estonia, and comparisons with other countries in Europe. Vet. Parasitol. 2015, 212, 200–205. [Google Scholar] [CrossRef] [PubMed]
Bretagne, S.; Assouline, B.; Vidaud, D.; Houin, R.; Vidaud, M. Echinococcus Multilocularis: Microsatellite polymorphism in U1 snRNA genes. Exp. Parasitol. 1996, 82, 324–328. [Google Scholar] [CrossRef] [PubMed]
Knapp, J.; Gottstein, B.; Bretagne, S.; Bart, J.-M.; Umhang, G.; Richou, C.; Bresson-Hadni, S.; Millon, L. Genotyping Echinococcus multilocularis in Human Alveolar Echinococcosis Patients: An EmsB Microsatellite Analysis. Pathogens. 2020, 9, 282. [Google Scholar] [CrossRef] [PubMed]
Debourgogne, A.; Goehringer, F.; Umhang, G.; Gauchotte, G.; Hénard, S.; Boué, F.; May, T.; Machouart, M. Primary cerebral alveolar echinococcosis: Mycology to the rescue. J. Clin. Microbiol. 2014, 52, 692–694. [Google Scholar] [CrossRef]
Shang, J.Y.; Zhang, G.J.; Liao, S.; Yu, W.J.; He, W.; Wang, Q.; Huang, Y.; Wang, Q.; Long, Y.X.; Liu, Y.; et al. Low genetic variation in Echinococcus multilocularis from the Western Sichuan Plateau of China revealed by microsatellite and mitochondrial DNA markers. Acta Trop. 2021, 221, 105989. [Google Scholar] [CrossRef] [PubMed]
Tamura, K.; Stecher, G.; Peterson, D.; Filipski, A.; Kumar, S. MEGA6: Molecular evolutionary genetics analysis version 6.0. Mol. Biol. Evol. 2013, 30, 2725–2729. [Google Scholar] [CrossRef] [PubMed]
Schultz, T.W.; Cronin, M.T.D.; Walker, J.D.; Aptula, A.O. Quantitative structure–activity relationships (QSARs) in toxicology: A historical perspective. J. Mol. Struct. 2003, 622, 1–22. [Google Scholar] [CrossRef]
Lapinska, N.; Paclawski, A.; Szlek, J.; Mendyk, A. Integrated QSAR Models for Prediction of Serotonergic Activity: Machine Learning Unveiling Activity and Selectivity Patterns of Molecular Descriptors. Pharmaceutics 2024, 16, 349. [Google Scholar] [CrossRef] [PubMed]
Jagiełło, K.; Puzyn, T.; Wąż, P.; Bielińska-Wąż, D. Moments of Inertia of Spectra as Descriptors for QSAR/QSPR. In Topics in Chemical Graph Theory; Gutman, I., Ed.; University of Kragujevac: Kragujevac, Serbia, 2014; pp. 151–162. [Google Scholar]
Panas, D.; Wąż, P.; Bielińska-Wąż, D.; Nandy, A.; Basak, S.C. 2D-Dynamic Representation of DNA/RNA Sequences as a Characterization Tool of the Zika Virus Genome. MATCH Commun. Math. Comput. Chem. 2017, 77, 321–332. [Google Scholar]

Figure 1.

x^{7} x^{11}

-graph representing the MTMH model sequence (see text for details).

Figure 1.

x^{7} x^{11}

-graph representing the MTMH model sequence (see text for details).

Figure 2. 2D graphs representing Protein1 and Protein2 sequences: FI-graph (left panel), FL-graph (middle panel), and FT-graph (right panel).

Figure 3. Phylogenetic tree for Baculoviruses.

Figure 4. KL-graphs representing Baculovirus sequences.

Figure 5. KLN-graphs representing Baculovirus sequences.

Figure 6. 2D graphs representing cob (top panel), nad2 (middle panel), and cox1 (bottom panel) genes. LV-graphs (left panel) and TV-graphs (right panel). Notations: 8—Polish haplotype, Fra—France, Jap—Japan (Hokkaido), A-I—USA, Indiana.

Figure 7. Classification maps for cob (top panel), nad2 (middle panel), and cox1 (bottom panel) genes.

r_{16} - r_{18} - r_{20}

(left panel) and

r_{17} - r_{18} - r_{19}

(right panel). Colors: blue—Europe excluding Poland; red—Asia; green—America; black—Poland. Detailed notations:

1, 2, \dots, 15

—Polish haplotypes; A-A—USA, Alaska (St. Lawrence Island); A-I—USA, Indiana; Aus—Austria; CHM—China (Inner Mongolia); Fra—France; Jap—Japan (Hokkaido); Kaz—Kazakhstan; Slo—Slovakia.

Figure 7. Classification maps for cob (top panel), nad2 (middle panel), and cox1 (bottom panel) genes.

r_{16} - r_{18} - r_{20}

(left panel) and

r_{17} - r_{18} - r_{19}

(right panel). Colors: blue—Europe excluding Poland; red—Asia; green—America; black—Poland. Detailed notations:

1, 2, \dots, 15

—Polish haplotypes; A-A—USA, Alaska (St. Lawrence Island); A-I—USA, Indiana; Aus—Austria; CHM—China (Inner Mongolia); Fra—France; Jap—Japan (Hokkaido); Kaz—Kazakhstan; Slo—Slovakia.

Figure 8. Classification maps obtained using Principal Component Analysis for cob (top panel), nad2 (middle panel), and cox1 (bottom panel) genes. The notations are the same as in Figure 7.

Figure 9. Phylogenetic trees obtained using the coordinates of the centers of mass of the 20D-dynamic graphs for cob (top panel), nad2 (middle panel), and cox1 (bottom panel) genes. The notations are the same as in Figure 7.

Figure 10. Phylogenetic trees obtained using the normalized principal moments of inertia of the 20D-dynamic graphs for cob (top panel), nad2 (middle panel), and cox1 (bottom panel) genes. The notations are the same as in Figure 7.

Table 1. Assignment of the axes to the amino acids in the 20D space.

Axis No.	Amino Acid	Single-Letter Symbol
1	Alanine	A
2	Cysteine	C
3	Aspartic acid	D
4	Glutamic acid	E
5	Phenylalanine	F
6	Glycine	G
7	Histidine	H
8	Isoleucine	I
9	Lysine	K
10	Leucine	L
11	Methionine	M
12	Asparagine	N
13	Proline	P
14	Glutamine	Q
15	Arginine	R
16	Serine	S
17	Threonine	T
18	Valine	V
19	Tryptophan	W
20	Tyrosine	Y

Table 2. 20D-dynamic graph for a model sequence MTMH.

$m_{i}$	( $x_{i}^{1}$ , $x_{i}^{2}$ , $x_{i}^{3}$ , $x_{i}^{4}$ , $x_{i}^{5}$ , $x_{i}^{6}$ , $x_{i}^{7}$ , $x_{i}^{8}$ , $x_{i}^{9}$ , $x_{i}^{10}$ , $x_{i}^{11}$ , $x_{i}^{12}$ , $x_{i}^{13}$ , $x_{i}^{14}$ , $x_{i}^{15}$ , $x_{i}^{16}$ , $x_{i}^{17}$ , $x_{i}^{18}$ , $x_{i}^{19}$ , $x_{i}^{20}$ )
$m_{1}$	(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
$m_{2}$	(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0)
$m_{3}$	(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0)
$m_{4}$	(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0)

Table 3. HM-graph for the model sequence MTMH.

$m_{i}$	( $x_{i}^{7}$ , $x_{i}^{11}$ )
$m_{1}$	(0, 1)
$m_{2}$	(0, 1)
$m_{3}$	(0, 2)
$m_{4}$	(1, 2)

Table 4.

r_{k}

values for Protein1, Protein2, and Protein3.

Table 4.

r_{k}

values for Protein1, Protein2, and Protein3.

k	Protein1	Protein2	Protein3
1	2.96217	3.08383	3.08275
2	2.96217	3.08383	3.08275
3	2.96217	3.08383	3.08275
4	2.96217	3.08383	3.08275
5	2.96019	3.08383	3.08275
6	2.96014	3.08383	3.08275
7	2.95979	3.08110	3.08067
8	2.95899	3.08099	3.07986
9	2.95871	3.08049	3.07946
10	2.95784	3.08016	3.07909
11	2.95775	3.07934	3.07827
12	2.95643	3.07806	3.07770
13	2.95587	3.07622	3.07489
14	2.95046	3.07329	3.07272
15	2.94901	3.06775	3.06339
16	2.93858	3.06132	3.06032
17	2.92190	3.04625	3.04504
18	2.89850	3.00080	2.99511
19	2.79203	2.91024	2.90664
20	1.44066	1.50527	1.52230

Table 5. Sequence data for Baculoviruses.

No.	Species	Accession No.	Length
1	AcMNPV (Autographa californica nucleopolyhedrovirus)	AAA66725.1	1221
2	BmNPV (Bombyx mori nucleopolyhedrovirus)	AAC63764.1	1222
3	RoMNPV (Rachiplusia ou MNPV)	AAN28013.1	1221
4	HearNPV (Helicoverpa armigera nucleopolyhedrovirus)	AAK57882.1	1253
5	HzSNPV (Helicoverpa zea single nucleopolyhedrovirus)	AAL56093.1	1253
6	MacoNPVA (Mamestra configurata nucleopolyhedrovirus A)	AAM09201.1	1212
7	MacoNPVB (Mamestra configurata nucleopolyhedrovirus B)	AAM95079.1	1209
8	SeMNPV (Spodoptera exigua multiple nucleopolyhedrovirus)	AAB96630.1	1222
9	AdorGV (Adoxophyes orana granulovirus)	AAP85713.1	1138
10	CpGV (Cydia pomonella granulovirus)	AAK70750.1	1131
11	CrleGV (Cryptophlebia leucotreta granulovirus)	AAQ21676.1	1128
12	NeseNPV (Neodiprion sertifer nucleopolyhedrovirus)	AAQ96438.1	1143

Table 6. Similarity matrix for Baculoviruses (values multiplied by 1000).

Species	AcMNPV	BmNPV	RoMNPV	HearNPV	HzNPV	MacoNPVA	MacoNPVB	SeMNPV	AdorGV	CpGV	CrleGV	NeseNPV
AcMNPV	0	4.55	0.24	7.38	7.86	11.31	13.40	1.93	31.15	44.52	36.82	22.26
BmNPV		0	4.79	2.83	3.31	15.86	17.94	6.49	35.70	49.06	41.36	26.81
RoMNPV			0	7.62	8.11	11.07	13.15	1.69	30.91	44.27	36.58	22.02
HearNPV				0	0.48	18.69	20.78	9.32	38.53	51.88	44.19	29.64
HzSNPV					0	19.17	21.26	9.80	39.01	52.36	44.67	30.12
MacoNPVA						0	2.09	9.38	19.85	33.22	25.52	10.96
MacoNPVB							0	11.46	17.77	31.14	23.44	8.87
SeMNPV								0	29.22	42.58	34.89	20.33
AdorGV									0	13.38	5.67	8.90
CpGV										0	0.77	22.27
CrleGV											0	14.57
NeseNPV												0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bielińska-Wąż, D.; Wąż, P.; Błaczkowska, A.; Mandrysz, J.; Lass, A.; Gładysz, P.; Karamon, J. Mathematical Modeling in Bioinformatics: Application of an Alignment-Free Method Combined with Principal Component Analysis. Symmetry 2024, 16, 967. https://doi.org/10.3390/sym16080967

AMA Style

Bielińska-Wąż D, Wąż P, Błaczkowska A, Mandrysz J, Lass A, Gładysz P, Karamon J. Mathematical Modeling in Bioinformatics: Application of an Alignment-Free Method Combined with Principal Component Analysis. Symmetry. 2024; 16(8):967. https://doi.org/10.3390/sym16080967

Chicago/Turabian Style

Bielińska-Wąż, Dorota, Piotr Wąż, Agata Błaczkowska, Jan Mandrysz, Anna Lass, Paweł Gładysz, and Jacek Karamon. 2024. "Mathematical Modeling in Bioinformatics: Application of an Alignment-Free Method Combined with Principal Component Analysis" Symmetry 16, no. 8: 967. https://doi.org/10.3390/sym16080967

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Mathematical Modeling in Bioinformatics: Application of an Alignment-Free Method Combined with Principal Component Analysis

Abstract

1. Introduction

2. Theory

3. Results and Discussion

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI