1. Introduction
Alignment-free techniques are a rapidly developing field in bioinformatics (a review can be found in [
1]). These approaches, aimed at comparing biological sequences (DNA, RNA, and proteins), combine concepts from computer science, mathematics, biology, physics, and chemistry. They are characterized by computational efficiency. For example, Gupta et al. proposed a method to study protein sequence similarity through the general form of Chou’s pseudo amino acid composition [
2]. Li et al. developed an alignment-free algorithm for comparing protein sequence similarity based on the pseudo-Markovian transition probabilities of amino acids [
3]. Saw et al. proposed an alignment-free analysis of protein sequence similarities based on a fuzzy integral [
4]. The branch of these methods that provides both numerical and graphical tools for sequence analysis is termed Graphical Representations of Biological Sequences (or Graphical Bioinformatics [
5]) and is applied to similarity analyses of both DNA/RNA and of protein sequences. In particular, Nandy created one of the first two-dimensional graphical representations of DNA sequences [
6], and this method was further developed [
7]. Randić et al. proposed a graphical representation of DNA sequences in the form of four horizontal lines [
8]. Randić et al. also introduced a representation of proteins based on generalized star graphs, i.e., graphs with one vertex of a maximal degree in the center to which are attached other vertices of either one or two degrees [
9]. Cao et al. reduced a DNA sequence into four three-dimensional graphical representations based on the chemical properties of the neighboring dual nucleotides [
10]. Jafarzadeh and Iranmanesh introduced a 3D graphical representation of DNA sequences based on codons [
11]. Mu et al. proposed a graphical representation of protein sequences called the 3D-PAF curve. The method takes into account the accumulative frequencies of the adjacent amino acids of the protein sequence [
12]. We introduced a graphical representation of DNA sequences in the form of a spectrum [
13]. Zhang and Wen constructed a two-dimensional graphical representation of protein sequences based on a selected pair of the physicochemical properties of amino acids—the isoelectric point and hydrophy index [
14]. Mahmoodi-Reihani et al. took into account twelve physicochemical properties of amino acids for the construction of a new two-dimensional graphical representation of protein sequences [
15]. Li et al. represented the protein sequence as a time series in eleven-dimensional space [
16].
The problem of similarity is interdisciplinary and can be a source of information in various fields of science [
17]. In particular, we consider different types of objects, for example groups of individuals [
18], molecular spectra [
19], or DNA sequences [
20], examining different features of their similarity.
Two or more objects can be very similar to each other due to one feature, and yet very different if we consider another feature. Each computational method reveals different features of object similarity. Therefore, the development of mathematical methods is particularly important.
Until relatively recently, mathematical methods in biological and medical sciences were mainly applied to the statistical analysis of the results of measurements and to the description of functioning experimental devices. Since about two decades ago, applications of mathematics to the modeling of biological processes and to the description of the structure of bio-molecules have opened a new window to investigate their properties. These mathematical methods use a specific language and specific set of notions. Some of them are alien to traditional biology and medicine, but this does not imply that they are useless. Many new achievements in biology and medicine would not be possible without the support provided by mathematical modeling [
21,
22].
In this work, we use the alignment-free bioinformatics method we introduced (called 20D-Dynamic Representation of Protein Sequences) to analyze the similarity between protein sequences [
23]. This approach is applied to two datasets:
Baculovirus [
24] and
Echinococcus multilocularis [
25,
26] genome sequences. In addition, the quality of the graphical representation is discussed using the same model sequences as in the literature [
14,
27,
28]. As a result, our graphs clearly show the differences between sequences and directly show the arrangement of amino acids in a given sequence. In [
23], we analyzed the similarity relations between the sequences, presented as phylogenetic trees and similarity matrices, for the NADH dehydrogenase subunit 5 (ND5) and for the NADH dehydrogenase subunit 6 (ND6) protein sequences of different species. In the present article the similarity relations are also visualized using classification maps whose axes are represented by 20D-Dynamic Representation of Protein Sequences descriptors. Some classification maps are obtained using a combination of this alignment-free method and the Principal Component Analysis.
Baculoviruses constitute a family of viruses, primarily targeting arthropods, with extensively studied hosts including Lepidoptera, Hymenoptera, and Diptera. The family currently encompasses 85 species distributed among four genera.
Among these viruses, Baculoviruses exhibit a penchant for infecting insects, boasting a host range encompassing over 600 species. While immature forms of lepidopteran species, such as moths and butterflies, constitute the most prevalent hosts, instances of infection have also been observed in sawflies and mosquitoes. Despite their demonstrated ability to infiltrate mammalian cells in culture, Baculoviruses do not exhibit replicative capabilities within mammalian or other vertebrate animal cells.
Historically, beginning in the 1940s,
Baculoviruses have been extensively employed and investigated as biopesticides within agricultural settings [
29]. Characterized by a circular, double-stranded DNA (dsDNA) genome ranging from 80 to 180 kbp, these viruses represent a significant area of interest in both scientific and practical domains [
30,
31,
32].
The inception of
Baculoviruses dates back to antiquity, with their earliest mentions traced to the annals of the 16th century literature, chronicling the affliction of silkworm larvae by the enigmatic “wilting disease”. A watershed moment ensued in the 1940s, heralding the extensive utilization and scholarly scrutiny of these viruses as biopesticides across agrarian landscapes. However, it was not until the dawn of the 1990s that their utility transcended conventional bounds, as they became pivotal in the orchestration of insect cell cultures for the synthesis of intricate eukaryotic proteins. These recombinant proteins, borne of their molecular machinery, found myriad applications in scholarly inquiry and medical praxis alike, spanning both human and veterinary domains. Notably, the apotheosis of their utility materialized in the manufacture of vaccines, exemplified by the preeminent H5N1 avian influenza prophylactic for poultry, synthesized within the crucible of a
Baculovirus expression vector. Recent epochs have borne witness to a paradigm shift, as
Baculoviruses unveiled their prowess in transducing mammalian cells endowed with propitious promoters [
33].
E. multilocularis is a causative agent of Alveolar Echinococcosis (AE), a highly perilous zoonotic infection. AE is notorious for its staggering mortality rate exceeding 90%, rendering it potentially fatal in the absence of proper medical intervention [
34,
35,
36].
Human susceptibility to infection is influenced by various behavioral and socio-economic factors that increase contact with
E. multilocularis eggs. These risk factors and the geographic distribution of AE vary between countries and regions. Meta-analyses of case-control and cross-sectional studies in endemic regions of Central Europe, North America (Alaska), and Asia (China) have identified common potential risk factors for human AE, including gender (female), age (over 20), dog ownership, interaction with dogs, gardening, forest visits, agricultural occupations, hunting, handling foxes, and low education/income levels [
37].
Due to the asexual reproduction of the metacestode and rare cross-fertilization events between mature tapeworms, a significant challenge in E. multilocularis studies lies in accurately estimating its genetic variation. The intraspecific genetic variation of
E. multilocularis has been estimated by employing various markers, including mitochondrial (cytb, nad1/nad2, cox1, 12s rRNA, and ATPase) and nuclear (ef1a, elpexons VII/VIII, and elp-exon IX) genes and microsatellites (EMms1 and EMms2, EmsJ and EmsK, EmsB), to analyze isolates from a broad geographic range including areas of Asia, North America, and Europe [
38,
39,
40]. The mitogenome has proven instrumental in elucidating the phylogeny of
Echinococcus spp. due to its haploid nature, high copy number per cell, and faster evolutionary rate compared to nuclear markers [
41,
42]. Early mitochondrial studies identified two distinct geographical genotypes (haplotypes) among European isolates (M1) and isolates from China, Japan, Alaska, and North America (M2) [
43,
44]. Subsequent sequencing efforts targeting three mitochondrial loci revealed as many as 17 regional haplotypes in a set of isolates from Europe, Asia, and North America [
41]. Meta-analysis combining data from commonly used mitochondrial genes, cox1 and cytb, unveiled a notable level of genetic diversity in
E. multilocularis isolates derived from rodents, wild canids, and humans [
45]. The highest diversity in haplotypes and haplotype diversity was observed within Asian and European populations. A cladistic phylogenetic tree suggested a sister relationship between the European and Asian clades, with only select North American haplotypes clustering with the European clade, including haplotype Em46 from Poland. In another study conducted on voles and plateau Pikas in Qinghai Province, China, it was found that the genetic diversity of
E. multilocularis, based on mitochondrial genes within a small local area, is at a low level. However, between different regions with long distances and varying ecological environments, the genetic diversity is relatively high [
46]. In the most recent pan-European study, five concatenated mitochondrial markers identified 43 haplotypes, and two main clusters possibly reflecting the post-glacial history of the key definitive hosts. In the seminal assessment of
E. multilocularis genetic diversity in Poland, three mitochondrial genes (cytb, nad2, cox1) analyzed together divided worms from red foxes into fifteen haplotypes (EmPL1-15), some of them restricted to particular provinces [
25]. For the first time, a haplotype of Asian origin was detected in north-eastern Poland, a phenomenon later investigated in detail by combining microsatellite profiling with mitochondrial genotyping to reveal Euro-Asian hybrids [
47]. Inspection of metacestodes from AE patients living in north-eastern Poland provided indirect proof of environmental contamination with the eggs of worms with Asian genetic elements and their circulation in the synanthropic cycle, raising a question of differences in virulence between European and Asian variants. Polymorphism of
E. multilocularis isolates from various locations across Europe, North America, and Asia was also assessed using a multilocular microsatellite EmsB and single-locus microsatellites EmsJ, EmsK, and NAK1 [
38,
39]. The highest polymorphism was observed for EmsB (29 profiles) compared to NAK1 (7 genotypes) and the combined EmsJ/EmsK (3 genotypes). EmsB exhibits a distinct (CA)n(GA)n pattern that varies from one worm to another, leading to an identification of different profiles [
39]. This highly polymorphic marker has been used extensively to assess the genetic variation of the parasite in Europe and Eurasia [
40,
48], at the national level [
49,
50], and on the local scale [
51,
52,
53]. For instance, at a local scale within an endemic region of France, covering a studied area of 900 m
2, 6 EmsB profiles were identified among 79
E. multilocularis-positive foxes [
51]. The distribution of assemblage profiles among the fox population was not uniform, with a notable frequency of mixed infections observed. Furthermore, an increase in the number of worms analyzed per fox corresponded to an increase in detected profiles with the EmsB152 target, indicating the potential for a high dispersion of eggs of various genotypes within a single fox’s home range. At the national level, studies in France identified a total of 22 profiles [
54]. In general, thirty-two EmsB profiles (G1-G32) were found in endemic regions across Europe, some of them locally predominant [
40]. The continental distribution of profiles has been studied primarily to construct scenarios related to
E. multilocularis geographical spread over time, from historical to newly described endemic areas, with Poland hypothesized to be one of the more recently colonized peripheral areas [
40,
55]. The resolution of EmsB proved discriminatory enough to investigate genetic variation and possible dispersal pathways of the parasite throughout Polish provinces [
50]. In this study, 29 distinct profiles were identified (with the most common one accounting for 44.9% of all worms) showing regional differences in distribution. Microsatellite markers were utilized not only to assess the genetic diversity of
E. multilocularis isolated from wild definitive and intermediate hosts but also those isolated from humans. Bretagne et al. conducted genotyping of human AE lesions based on the pentameric microsatellite upstream from the coding region of the U1 snRNA gene [
56]. As per their findings, human AE lesions were categorized into three profiles based on their geographic origin, with all European patients clustering together, showing an unsatisfactory resolution [
56]. In a recent investigation by Knapp et al., metacestode samples from 63 AE patients from France, Switzerland, Germany, and Belgium were examined [
57]. In total, nine EmsB profiles were identified providing insights into potential transmission pathways to humans. These profiles appeared to exhibit stability over time, as similar profiles were detected in patients who recently underwent surgery and those who underwent surgery years ago. Patients residing in the same geographic area demonstrated identical or similar profiles. Some of these profiles had already been locally described in animals, while others appeared to be widespread among foxes in Europe [
40,
58]. This implies that individuals living in areas of high endemicity may have been infected within their residential environment, either with locally circulating profiles or with profiles more commonly found across Europe. An EmsB/cox1 study of human and animal samples from rural Sichuan, China, also indicates a link between high human AE incidence and local contamination by dogs [
59]. These limited findings, although significant, are still insufficient to draw far-reaching conclusions about the distribution of
E. multilocularis assemblages in humans, warranting further studies encompassing various countries and localities.
2. Theory
In the current study, we employ a bioinformatics method named 20D-Dynamic Representation of Protein Sequences, which we introduced to conduct a similarity/dissimilarity analysis of biological sequences [
23]. The descriptors characterizing the graphs representing the sequences bear resemblance to those utilized in dynamics. The set of descriptors defined for a single sequence serves as a base dataset for analyzing potential biomedical applications.
To construct a graph representing the sequence, we utilize the convention of a shift (“walk”) in the 20D space. The shift originates from the origin of the 20D coordinate system. Each step in the shift is defined by the unit vector corresponding to the relevant amino acid in the sequence. At the end of the vector, a “material point” (“point mass”) with
is situated. The first shift corresponds to the first amino acid in the sequence. The second shift, representing the second amino acid in the sequence, starts from the end of the first vector. The procedure is repeated until the end of the sequence. Consequently, an abstract 20D-dynamic graph is formed, consisting of material points in the 20D space. The arrangement of these material points in the space is dictated by the sequence represented by the graph. Details regarding the assignment of the axes of the 20-dimensional coordinate system to the amino acids (representation of amino acids by unit basis vectors in the 20D space) are given in
Table 1.
For instance, Alanine (A) is represented by the unit vector , while Cysteine (C) is represented by , and so forth.
To graphically represent protein sequences, we project 20D-dynamic graphs onto 2D or 3D spaces. Mass distributions in 3D/2D spaces provide information on the positions of three/two amino acids in the sequences, enabling graphical comparisons. An example of the 20D-dynamic graph and its projection into 3D space for the model sequence MTMHTT can be found in [
23]. A similar example of the 20D-dynamic graph and its projection into 2D space for the model sequence MTMH is shown in
Table 2 and
Table 3, respectively.
For the numerical characterization of the 20D-dynamic graphs we apply values analogous to the ones used in the dynamics: the coordinates of the centers of mass and the principal moments of inertia of 20D-dynamic graphs.
In dynamics, a commonly used descriptor of the distribution of mass in a rigid body is its center of mass and the moment of inertia. The moment of inertia of a rigid body around a specific axis indicates the resistance of this body to an acceleration of the angular rotation around this axis. When the mass is concentrated closer to the axis, then the corresponding moment of inertia is smaller and it is easier to speed up or slow down the rotation. The moments of inertia associated with rotations around the principal axes are referred to as the principal moments of inertia. The concept of moment of inertia can be generalized to characterize distributions of some other quantities. In particular, if we assign a “mass” to each point of the 20D-dynamic graph, then we can define the moment of inertia of the graph—an analogy of the standard moment of inertia.
The coordinates of the center of mass of the 20D-dynamic graph are defined as
are the coordinates, determined by the shifts, of the material points with masses
. Since the mass of each individual material point is equal to one, the total mass of the sequence is equal to the length of the sequence
N, i.e.,
N is also the number of the material points in the 20D-dynamic graph. As a consequence, from Equations (
1) and (
2) it follows that
The tensor of moment of inertia in the 20D space is expressed by the symmetric matrix
with the matrix elements
and
is the Kronecker–Delta and
are the coordinates of
in the Cartesian coordinate system, with the origin chosen at the center of mass
The eigenvalue problem of the tensor of inertia with the eigenvalues
and the eigenvectors
is defined as
The eigenvalues are derived through the solution of the twentieth-order secular equation
where
is
unit matrix. The eigenvalues
, are called the
principal moments of inertia.Figure 1 shows a 2D graph for the MTMH model sequence. This is the
-graph, i.e., the HM-graph, according to
Table 1. Additionally, the center of mass and the projections of the three eigenvectors multiplied by the corresponding eigenvalues are marked. These are the only non-zero projections onto this 2D plane created from the 20 possible products of the eigenvectors and eigenvalues obtained for this example.
Finally, as sequence descriptors we apply the normalized principal moments of inertia
As a similarity measure between a pair of sequences labeled
and
we use
This similarity measure is dimensionless and normalized: the similarity values vary between 0 an 1.
3. Results and Discussion
Let us consider two short segments of proteins from yeast (P00729) and Saccharomyces cerevisiae (P38109). These two segments, taken from Randić [
27], have been tested in several graphical representation methods [
14,
28]. The corresponding sequences are:
Figure 2 shows 2D graphs for these two sequences. Zhang and Wen [
14] claim that their graphs are more visible, intuitive, and easy to analyze compared to those presented by Yu et al. [
28]. Our graphs are also visible and intuitive because they are directly related to the distributions of individual amino acids in the sequences.
Table 4 shows values of
(Equation (
10)) for the two sequences and for:
Protein2 and Protein3 sequences differ by only one amino acid. The last amino acid in Protein2 sequence—Leucine (L)—is changed to Phenylalanine (F). This small difference is recognized by the descriptor values (they are slightly different—columns 2 and 3 in the table).
The 20D-Dynamic Representation of Protein Sequences is also used to analyze the similarities between helicase protein sequences of twelve
Baculoviruses. We use the same dataset that has been analyzed using a different alignment-free bioinformatics method in [
24].
Table 5 shows the lengths and the accession numbers of these amino acid sequences that are available in GenBank. The method (Equations (
10)–(
12)) is applied to determine the symmetric similarity matrix (
Table 6) and its representation via a phylogenetic tree (
Figure 3). To obtain the phylogenetic tree, Unweighted Pair Group Method with Arithmetic Mean (UPGMA) implemented in the Mega 6.0 package [
60] is used. For the UPGMA clustering algorithm, the input data (i.e., the data used to determine the topology and length of the tree) is our similarity matrix.
Based on the values given in
Table 6, we can observe significant similarities between certain groups of
Baculoviruses, namely (AcMNPV, RoMNPV, SeMNPV), (BmNPV, HearNPV, HzSNPV), (NeseNPV, MacoNPVA, MacoNPVB), and (CpGV, AdorGV, CrleGV). The greatest differences between
Baculoviruses can be seen for the NeseNPV sequence (
Figure 4 and
Figure 5) for the considered protein.
Figure 4 and
Figure 5 show examples of 2D and 3D graphs (projections of 20D-dynamic graphs into 2D and 3D spaces), respectively. We find that the NeseNPV protein graphs shown in the figures differ from those of other species. Other authors have also obtained similar results [
24]. In [
24], Yao et al. introduced another alignment-free bioinformatics method. It is a 2D graphical representation of protein sequences based on amino acid classification. The authors consider two chemical properties that have significant relations with the structure of proteins: chirality and hydrophilicity of 20 amino acids. Based on this classification, 2D graphs representing the sequences are obtained. Geometric centers of the graphs are proposed as sequence descriptors. These descriptors are equivalent to the centers of mass in our method. The moments of inertia considered in our method can be treated as the corrections [
19].
The 20D-Dynamic Representation of Protein Sequences is also used to analyze
Echinococcus multilocularis sequence similarities. Sequences of three mitochondrial genes (cytochrome b (
cob), NADH dehydrogenase subunit 2 (
nad2), and cytochrome c oxidase subunit 1 (
cox1)) are analyzed. Sequences, obtained in studies described previously [
25,
41], are used for the analyses (GenBank accession numbers: KY205662-KY205706, AB461395-AB461420). Additional sequence data specific to the current studies are given in [
26], in which we investigated features of similarity revealed using 4D-Dynamic Representation of DNA/RNA Sequences, another bioinformatics method we introduced [
20].
In [
26], we considered DNA sequences (the sequences composed of 4 nucleobases). In this work, protein sequences are considered, i.e., sequences composed of 20 amino acids. The considered protein sequences are very similar to each other. Some examples illustrate 2D graphs (the projections of 20D-dynamic graphs onto a 2D plane), LV-graphs, and TV-graphs (
Figure 6). The graphs almost overlap, but slight differences can also be observed.
Figure 7 and
Figure 8 show the classification maps for the
cob,
nad2 and
cox1 genes.
The maps shown in
Figure 7 show clusters of points indicating similarities between sequences based on two different properties (represented by descriptors on the axes):
and
. For different genes and different descriptors we observe different clusters of points.
The classification maps shown in
Figure 8 have been obtained using PCA for the center of mass coordinates and the normalized moments of inertia of 20D-dynamic graphs as input data. The similarity relations revealed by the clusters of points appearing on the maps also differ.
Figure 9 and
Figure 10 show the corresponding phylogenetic trees obtained using different descriptors (
Figure 9—the coordinates of the centers of mass of the 20D-dynamic graphs and
Figure 10—the normalized principal moments of inertia of the 20D-dynamic graphs). The following clusters of points appear in these trees:
cob gene, coordinates of the centers of mass,
Figure 9 top panel:
(CHM)
(A-A)
(14)
(Jap)
(Fra)
(7,15)
(A-I)
(Kaz,Slo,Aus,1,2,3,4,5,6,8,9,10,11,12,13);
nad2 gene, coordinates of the centers of mass,
Figure 9 middle panel:
(CHM)
(A-I)
(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
(A-A)
(Kaz,Slo,Aus,Fra)
(Jap)
(CHS);
cox1 gene, coordinates of the centers of mass,
Figure 9 bottom panel:
(CHM)
(8)
(13,Aus)
(Fra)
(A-A)
(A-I)
(Kaz)
(14)
(Jap)
(1,7)
(6)
(CHS,Slo,2,3,4,5,9,10,11,12,15);
cob gene, normalized principal moments of inertia,
Figure 10 top panel:
(CHM)
(A-I)
(A-A)
(7,15)
(Fra)
(Jap)
(14)
(Kaz,Slo,Aus,1,2,3,4,5,6,8,9,10,11,12,13);
nad2 gene, normalized principal moments of inertia,
Figure 10 middle panel:
(A-I)
(CHM)
(CHS)
(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
(Kaz,Slo,Aus,Fra)
(Jap)
(A-A);
cox1 gene, normalized principal moments of inertia,
Figure 10 bottom panel:
(8)
(Jap)
(13,Aus)
(6)
(1,7)
(CHS,Slo,2,3,4,5,9,10,11,12,15)
(14)
(A-A)
(A-I)
(CHM)
(Fra)
(Kaz).
The number of clusters present in the phylogenetic trees is equal to 8 for the
cob gene, both using centers of mass coordinates (
Figure 9 top panel) and normalized principal moments of inertia (
Figure 10 top panel). The clusters are exactly the same in both cases. Some clusters contain only one element. Such single-element sets mean small similarity values between sequences belonging to other clusters. Sequences belonging to one cluster are characterized by large similarity values.
Similar observations can be made for the
nad2 gene (seven identical clusters for both center of mass coordinates and normalized principal moments of inertia—
Figure 9 and
Figure 10, middle panels). Similarly, for the
cox1 gene, the same number of identical clusters (12) appeared in the phylogenetic trees (
Figure 9 and
Figure 10, bottom panels).
The clusters of points appearing in both phylogenetic trees (based on
and
-
Figure 9 and
Figure 10) are identical, but the structures of both phylogenetic trees (relations between different clusters) are different. These relations correspond to the similarity values between the sequences. To obtain the similarity values, the similarity matrix can be calculated in the same way as for
Baculoviruses (
Table 6). Calculating the similarity matrix for the center of mass coordinates,
should be replaced with
in Equation (
11).
In the current research, the clusters of points are completely different than those we observed in our previous research [
26]. Taking into account the DNA sequences, we confirmed that one Polish haplotype (9) is of Asian origin [
26], which was originally discovered in [
25]. Considering the amino acid sequences in the present work, we do not observe any clusters (9, Asian sequences).
This is probably related to the fact that mutations in the genes that determine the division into separate clusters (European and Asian) do not cause significant changes in the proteins they encode. One of the fundamental features of the genetic code is the fact that it is degenerate. This means that different codons (i.e., a sequence of three nucleotides that constitute the unit coding for a particular amino acid) can code for the same amino acid. This means that most amino acids can be encoded in several ways. As a consequence, some changes in genetic information resulting from mutations are not reflected in the amino acid sequence.
4. Conclusions
Model studies of protein segments from yeast and Saccharomyces cerevisiae indicate the high graphical and numerical quality of our method. We plan to apply this method in the future to interpret our new experimental data, which are currently in preparation.
In this work, 20D-Dynamic Representation of Protein Sequences was applied to two datasets:
Baculovirus and
E. multilocularis genome sequences.
Baculovirus data are model data also used by other authors in articles on mathematical modeling in bioinformatics. Yao et al. obtained similar results using another alignment-free method [
24]. The data for Echinococcus are our experimental data from [
25]. The genes encoding
cox1,
nad2, and
cob are mitochondrial genes which, due to their characteristics, are very often used in analyses of the genetic diversity of Echinoococcus, but also of many other organisms. We recently performed computations on the same Echinoococcus data at the DNA level using a similar approach, also proposed by us, called 4D-Dynamic Representation of DNA/RNA Sequences [
26]. Using the 4D method, we obtained similar results as when using standard alignment-based methods [
25], i.e., we showed that one of the Polish haplotypes (9) is of Asian origin. In this work, we investigated similarity relations at the level of amino acid sequences. Of course, the obtained point clusters based on the 20D-Dynamic Representation of Protein Sequences descriptors are different than those for DNA sequence descriptors. The dispersion of the results obtained from various methods appears to be quite significant. Different methods may capture distinct features of sequence similarity. Generally, varying descriptors may hold importance depending on the context, forming the foundation for QSAR studies. In the realm of molecules, a similar approach yields computational techniques known as Quantitative Structure–Activity Relationship (QSAR) and Quantitative Structure–Property Relationship (QSPR), extensively employed in computational pharmacology, toxicology, and eco-toxicology [
61,
62]. QSAR/QSPR operates on the premise that chemical compound properties correlate systematically with numerical indices representing molecular structural features, termed molecular descriptors. These descriptor values, derived from available experimental data, can be applied to predict the properties of numerous structurally similar molecules. For instance, we have demonstrated the utility of moments of inertia of molecular spectra as descriptors in QSAR studies for predicting environmentally relevant chloronaphthalenes properties [
63]. By using descriptors from the 20D-Dynamic Representation of Protein Sequences we enhanced the numerical characterization of the entities under investigation.
Each method describes different features of similarity. In particular, our method reveals features of similarity that are mathematically described using various descriptors characterizing 20D-dynamic graphs representing protein sequences (values analogous to dynamics—centers of mass and principal moments of inertia). Non-standard applications of the moment of inertia tensor and centers of mass were already introduced for the description of DNA/RNA sequences. For example, these methods have been applied to the bioinformatics characterization of the SARS-CoV-2 virus and the time evolution of the Zika virus [
20,
64]. The calculations supported the hypothesis that the SARS-CoV-2 virus may have originated in bats and pangolins [
20]. Strong correlations with time have been obtained for the descriptors of the genome sequences of the Zika virus [
64].
In this work, by comparing the moments of inertia of graphs, we compare the shapes of graphs representing the positions of amino acids in the sequences. Consequently, our description is more precise than the ones derived from traditional alignment-based methods, which neglect the asymmetry of the distributions of the the aligned amino acids along the sequences. The proposed method is sensitive: a difference in a single amino acid can be recognized. Another advantage of this approach is its applicability to large datasets: all the descriptors are expressed analytically and the computation time for a large number of long sequences is short. Note that the theory has no limits on sequence length.
Alignment-free bioinformatics methods are constructed to describe 20 amino acids [
2,
3,
4,
9,
12,
14,
15,
16,
23,
24,
27,
28]. Additionally, 20D-Dynamic Representation of Protein Sequences can easily be extended to more dimensions, if more than 20 amino acids are present in the sequence. The algorithm’s fundamental structure remains unchanged. Introducing new basis vectors follows the same procedure as for the original 20 dimensions. With the inclusion of selenocysteine and pyrrolysine, the consideration expands to the 22-dynamic graph. For the numerical characterization, the tensor of the moment of inertia in the 22D space should be considered. Consequently, adjustments to the upper limits in all summations within the method are necessary to reflect the expanded dimensional space.