1. Introduction
The Protein Data Bank (PDB) [
1,
2] entered the year 2023 with a historical achievement of 200,000 molecular structures deposited in the central database. This number grew to 210,000 in September 2023 [
3]. It is another milestone, because 210,000 constitutes 50.01% of the total number of possible PDB codes. Yet, it pales in comparison to over 250 million of known protein sequences available in the UniProt database [
4]. This discrepancy comes from the fact that in order to obtain the 3D model of a protein, one needs to perform a crystallographic experiment, which requires time, funds, and expertise. AlphaFold2 [
5,
6] bridged this gap a bit by bringing to the PDB over a million (≈1,070,000 in October 2023) of the so-called Computed Structure Models (CSMs), which were predicted from sequence by this robust machine learning package (it outperformed its competition during CASP14 [
7]). CSMs are auxiliary to the main, solved experimentally repository, and can be switched on and off during the search queries on the PDB website.
Anyone working in the field of structural bioinformatics eventually encounters the problem of measuring how two proteins are similar or dissimilar in terms of a chosen set of criteria. This task organically extends to the search for all molecules from a given database with properties within the range of the input molecule. Furthermore, the elements of this database can be clustered (or otherwise assigned to some categories) on the basis of their pairwise similarity metrics. The difficulty of these tasks comes not only from the number of items to process, but also the fact that the properties of proteins are often not directly comparable (e.g., sequences), requiring specialized algorithms that can be both difficult to program and have nonlinear complexity, translating to long run times.
The biological function of a protein can be inferred from its sequence or its structure. The structure also has higher evolutionary conservation than the sequence [
8]. This degeneracy of the fold code gave rise to the idea that not all details of the sequence are important to reach the native conformation [
9], and that the role of the environment in which the folding takes place must be considered [
10]. The probability that two proteins fold alike increases with the percentage of the identity of their sequences. A 40% identity over a long alignment is sufficient to distinguish between similar and dissimilar structures [
11]. Databases such as CATH [
12] and SCOP [
13] rely on this to organize the protein domains in forests depending on their structural, genetic and functional features. In turn, this information helps with the analysis and the classification of the proteins at the quaternary structure level. The 3D Complex database [
14] clusters the biological assemblies from the PDB on the basis of the topology of graphs of their inter-chain contacts, domain codes and sequence identity.
In the so-called twilight zone, at 20–35% of identity, the link between the sequence and the structure becomes too weak for reliable sequence-based homology detection [
8,
11,
15]. Likewise, a pair of biological assemblies with the same stoichiometry and domain composition but low sequence identity may exhibit vastly different complex symmetries. This phenomenon can be observed among the clusters at the QS (quaternary structure) Family level of 3D Complex hierarchy. The consequence is that accurate protein structure retrieval (i.e., recognition of molecules that folded in similar manner) requires structural comparison. With the ever-growing number and complexity of the models deposited in the PDB, this calls for the development of fast and scalable approaches.
Atomic superposition algorithms [
16,
17,
18,
19,
20] are powerful tools for the structural comparison of proteins, but they can only gauge the difference between the two molecules they take as their input. Applying them in bulk (e.g., in an entire PDB vs. entire PDB scenario) is prohibitively time consuming. In an alternative approach, the proteins can be reduced to low-dimensional fingerprints of their shapes that can be compared in a massively reduced amount of time, possibly down to a constant number. This encoding organically scales to the database level, permitting rapid mass-scale molecule retrievals.
While the concept of shape is not well defined at the molecular level [
20], it is typically (but also conveniently) taken as a function of the density of the atomic volumes. Van der Waals spheres provide this volume while bonds provide the connectivity graph. By rolling a sphere representing the water molecule (i.e., with a radius of 1.4 Å) over the vdW spheres, one produces the molecular surface of the protein, also known as its solvent-excluded or Conolly surface [
21]. The computer model of this surface is a polygonal mesh. Alternatively, the volume of the protein can be approximated with an axis-aligned grid of voxels (e.g., 1 Å
3 cubes) that intersect the vdW spheres. A relatively simple form of this voxelization is the central element of our Ellipsoid Profile (EP) algorithm for the measurement of the change in globularity in protein structures [
22,
23]. More advanced grid-based volumetric models can be obtained with EDTSurf (Euclidean Distance Transform) [
24].
The molecular surface and the cloud of voxels of chain A of Formamido-pyrimidine DNA Glycosylase from
Lactococcus lactis (PDB code: 1KFV [
25]) are shown in
Figure 1. Both representations are typically calculated for all (heavy) atoms of the protein, but one may choose to obtain them only for the backbone atoms: N, C
α, C, and O. The difference is that the all-atom surface mode encodes the overall shape of the molecule, whereas the backbone mode also captures the details of its interior by tracing the conformation of the chains.
Once the geometric representation of a protein is established, the generic shape retrieval methods [
26] can be applied to it. The PDB website offers an option to search for the similarly shaped biological assemblies since 2020 [
2]. It finds them using BioZernike [
27], a protein structure retrieval system based on Zernike–Canterakis (ZC) moments.
The Zernike moments are noise-resistant projections of an arbitrary scalar field that describes the shape of a 2D object onto a set of orthogonal functions defined within the unit disk [
28]. These functions, the Zernike polynomials, are named after Nobel Prize laureate Frits Zernike who delineated them in the 1930s [
29]. At the turn of the century, Nikolaos Canterakis generalized them to the third dimension [
30]. Three-dimensional Zernike moments, more accurately called Zernike–Canterakis moments [
20,
31], are defined inside a unit ball. In 2004, Marcin Novotni and Reinhard Klein [
32] showed how to obtain them from a linear combination of the geometric moments of a voxelized volume. Geometric moments, used extensively in pattern recognition [
33,
34], are easy but relatively expensive to compute. The simplest shape-defining function that goes along them is binary, yielding 1 for the occupied and 0 for the unoccupied voxels. Novotni and Klein also showed how to base the 3D object retrieval on the ZC moments by compressing those moments into rotation-invariant feature vectors (i.e., the aforesaid low-dimensional fingerprints). They named the components of those vectors 3D Zernike descriptors (3DZDs). Once again, it is more accurate to call them Zernike–Canterakis descriptors (ZCDs) [
20,
31].
Since ZC moments form an infinite series of expansion of the original function of shape (i.e., they allow its reconstruction), a cutoff—the moment order—must be given by the user. Its de facto standard value in structural bioinformatics is
N = 20 [
20,
27,
35]. At
N = 20, the number of geometric moments of a discretized 3D shape is 1771. The corresponding ZC moments capture most of the prominent features of the structure of a protein, akin to a low-resolution X-ray; see
Figure 1 from [
20] for comparison.
The number of ZC descriptors at
N = 20 is 121. If two shapes are identical, their ZCDs are identical. The more those shapes differ, the more their ZCDs grow distant. Due to the fixed length of ZCD vectors (ZCDVs) at a given
N, shape comparison in the ZC domain is very fast. The distance of one ZCDV to the ZCDVs of the entire database (e.g., the PDB) can be measured in linear time. It is the essence of the ZC-based protein shape matching. 3D Surfer [
35] was the first widely available tool to implement it. Its creators recently employed the ZCDs in the mapping of the protein universe [
36], in the classification of the CSMs [
37] and during SHREC 2020/21 [
38,
39]. BioZernike [
27] uses longer, less compressed ZC descriptors and feeds the retrieval with additional geometric data (e.g., the size of the molecule). Because the proteins must be scaled to fit inside the unit ball, these geometric features compensate for the loss of size-related information. ZC moments can also help with structural superposition. The ZEAL program [
20] performs pairwise protein alignments by finding the transformation that maximizes correlation between ZC moments.
In 2011, José Pozo, Maria-Cruz Villa-Uriol and Alejandro Frangi proposed [
31] an alternative approach to ZC moment calculation, directly from the polygonal surface of the mesh that determines the 3D shape, without voxelization. It is not only convenient, but it also helps with some of the issues of voxel-based solutions, namely the cubic complexity in respect to the width of the grid [
31], and the “discretization error, which escalates as the order of the moments increases, and makes the scale and rotation invariants obtained from the moments only approximately invariant” [
31]. In the algorithm of Pozo et al., the geometric moments are computed by integrating over the volumes of oriented tetrahedra spanned between the origin and the facets of the mesh. This step requires O(
) time, where
f is the number of facets. In 2012, Patrice Koehl significantly reduced its complexity to O(
) [
34], down to the level of practical application in mass-scale object retrieval. The optimized algorithm is thus called the Pozo–Koehl algorithm, PK for short [
40].
One who wishes to calculate ZC moments for themselves encounters the issue of deficiency of reliable, configurable libraries readily available for inclusion in custom programs (i.e., not as external binaries), notably those written in Python. Novotni and Klein implemented their algorithm in C++, and BioZernike is written in Java [
41]. Owing to the many useful libraries [
42], Python is often found in scientific projects. For instance, AlphaFold is written in this language [
43]. While there are packages in PyPi related to the “zernike” keyword [
44], none of them work in 3D. A web search for “3D Zernike Python” leads to the question on StackExchange about ZC moments in molecular surface comparison [
45]. It mentions the PK algorithm in Mindboggle [
46], a Python brain morphometry platform, but cites its slowness. This is where our work enters the scene.
Here, we present a well-written and documented, fast Python implementation of the PK algorithm. It relies on Numba [
47], a just-in-time Python-to-machine code compiler for a ≈200 times faster calculation at moment order 20 with respect to the code that only relies on NumPy’s vectorizations [
48]. On a consumer-grade PC, our PK-Zernike library is capable of processing about 50,000 facets per second per thread. A mesh of this complexity corresponds to the high-density molecular surface of a protein composed of ≈600 residues, or a simplified surface of a protein with ≈1200 residues. This is fast enough for full-scale database analysis. Because the facets can be processed in any order, parallel execution is also possible. By relinquishing the GIL (Python’s global interpreter lock), the PK-Zernike library can use threads, which are lightweight and more convenient than the usual method of GIL circumvention in CPU-bound tasks (i.e., based on a pool of subprocesses).
The citations on the publisher’s website [
31,
34] suggest that the PK algorithm is less popular than the NK (Novotni–Klein) algorithm [
32], and that it has not been used yet in structural bioinformatics. To assess its usefulness in this area, we made it the central element of a novel, simple, but effective protein structure retrieval pipeline that can be implemented using the “baseline” Python libraries for scientific computation [
48,
49]. This pipeline comprises the calculation of molecular surface meshes via an external program, and the comparison of their ZCDVs and additional geometric data: the distribution of the residues in the structure, its size and the volume of its mesh. Tests were carried out in the BioZernike validation repository [
50] and on several example proteins. Lastly, the results of this retrieval were put against the results of retrieval based on ZCDs produced by the 3D Surfer program for an indirect comparison of PK and NK algorithms.
While ZC descriptors are invariant to the rotations of the input shape, they are susceptible to its scaling and translation. Aderinwale et al. [
37] gave examples of structures in which protruding fragments (e.g., chain termini fully exposed to the solvent) can confuse the purely ZC-based retrieval systems, making it harder for them to recognize proteins that belong to the same fold. This issue can be worked around with neural networks [
37]. Here, the proteins were passed through an outlier residue detection subroutine from the EP algorithm [
23]. This subroutine, the principal component analysis (PCA [
51]) and Kabsch’s superposition algorithm [
52,
53] are based on the same paradigm. This method provided a major boost to the sensitivity of the retrieval with only a small loss of specificity and a very low computational cost, thus warranting its inclusion in the title of this document.
1.1. Structure of the Manuscript
The Materials and Methods section describes the protein structure datasets (
Section 3.1) and the algorithms, libraries and programs utilized in this research (
Section 3.2 and
Section 3.3). The Results section begins with the presentation of the capabilities of our fast PK-Zernike library (
Section 2.1). Next is the description of the novel protein structure retrieval pipeline based on the PK algorithm (
Section 2.2). The last three subsections of the Results assess the discrimination ability of this pipeline in the BioZernike validation repository (
Section 2.3), in the set of 30 fragments of example proteins (
Section 2.4), and in comparison with the results of the 3D Surfer program, also obtained for the example structures (
Section 2.5).
1.2. Definitions and Conventions
The structures from the PDB are referred to in this paper by their PDB codes rather than names, for example, 1KFV instead of Formamido-pyrimidine DNA Glycosylase.
Individual chains are denoted by the XXXX:Y notation, in which XXXX is the PDB code and Y is the chain identifier (e.g., 1KFV:A). The two alternative notations for multi-chain selections are XXXX:Y+…+ Y (an explicit list, e.g., 1KFV:A+D+G) and XXXX:Y−Y (an implicit list, e.g., 1KFV:B−H). Sub-chain selections end with a residue range, for example, 1KFV:(A+B)(30–100) captures residues 30–100 from chains A and B from 1KFV.
A biological assembly is the largest, functionally-relevant fragment of a protein (i.e., its quaternary structure) that was obtained during the crystallographic experiment and is available in a given PDB file [
54,
55]. Each PDB file should have at least one such assembly defined in its REMARK 350 records. 1KFV has two: 1KFV_1 and 1KFV_2; they are instances of the same protein–nucleic complex. In this paper, the first manually assigned (i.e., not only matched by software) assembly was taken as the representative of each whole PDB file from
Section 3.1.2. In this sense, 1KFV ≡ 1KFV_1. Some of those assemblies needed to be reconstructed via the symmetry operators from the REMARK 350 records. They are distinguished here by the
×n notation. It informs that the selected group of chains had to be transformed
n times to produce the final structure of the complex.
The term protein complex should be understood in the context of this work as the reference to two or more protein chains from a given biological assembly that have non-bonded contacts between them. Interactions between the proteins and the non-protein entities (e.g., nucleic acids, ligands or ions) were outside the scope of this research.
An effective atom of a residue is its representative pseudo-atom—a point in space at the average position of all heavy atoms from this residue (i.e., sans hydrogen and deuterium). This definition is shared with FOD [
56,
57] and EP algorithms [
22,
23].
An unstructured surface mesh is the surface of a 3D shape made up from vertices (i.e., points in space) and explicitly defined facets (i.e., polygons) that connect them. The meshes in this study had triangular facets, as required by the PK algorithm.
To decimate a mesh is to simplify its surface. It is achieved by algorithmically decreasing the number of vertices and facets by the reduction factor specified by the user (e.g., by 50%) while trying to retain the shape and volume of the original model.
Lastly, a Zernike–Canterakis descriptor vector (ZCDV) is a feature vector composed of rotation-invariant Zernike–Canterakis descriptors (ZCDs) [
58]. It is the low-dimensional representation of Zernike–Canterakis (ZC) moments of given 3D shape. Here, ZCDs were calculated following the “3DZD” formula popularized by Novotni and Klein [
32].
2. Results and Discussion
2.1. The Fast Python Implementation of the PK Algorithm
This section presents our time-efficient Python implementation of the Pozo–Koehl (PK) algorithm [
31,
34] for the calculation of Zernike–Canterakis moments from the exact volume-like geometric moments of an unstructured surface mesh (
Section 3.2.4).
A note on vocabulary: PK-Zernike is the name of the library, PKZernike is the name of the main calculation class, and pkzernike is the name of the Python package. The source code is available at
https://codeberg.org/mbanach/pkzernike since 22 October 2023.
A reference Python implementation of the PK algorithm was obtained from the Mindboggle library [
46]. It is based on NumPy [
48] and SciPy [
49]. Its shapes.zernike module [
59] is a 2013 port of the Matlab code by Arthur Mikhno [
60]. Brian Rossa and Arno Klein are credited for this conversion. There are several PK classes (a.k.a. pipelines) in Mindboggle; the fastest two are called KoehlOptimizations and KoehlMultiproc. The latter is a parallel version of KoehlOptimizations; it utilizes Python’s built-in multiprocessing module.
2.1.1. Geometric Moments
The most time-consuming portion of the ZC pipeline is the calculation of geometric moments. Its complexity in the PK algorithm is O(
), where
f is the number of facets of the mesh and
N is the moment order. It is optimal [
34], but doing it in Python, even with the fast NumPy vectorizations (i.e., optimized functions that can operate on whole arrays) is too slow for a practical application in structural bioinformatics. Because the facets can be processed in any order, Mindboggle reduces the calculation time by managing a pool of parallel workers. However, this engages the entire CPU while still taking too long for the molecular surface content. The parallel execution helps with the O(
f) component but not with the O(
) component of algorithmic complexity. On the flip side, the memory utilization of the PK algorithm is very efficient, only O(
).
Our approach takes on the cubic time complexity with Numba, a just-in-time (JIT) compiler capable of automatic translation of math-heavy functions and loops to optimized machine code [
47]. Owing to Numba, programs stay written in Python, but at runtime they achieve the CPU performance of programs written in C or Fortran. To reach this speed, one must carefully organize the Python code (possibly rewriting most of it) to avoid any unsupported features. For instance, the numpy.compress function is unavailable in Numba and its own version of the numpy.roll function does not accept the axis parameter.
While working around the aforesaid code restrictions, we found that the shortest run times happen when the facets of the mesh are passed to the geometric moment function in bulk. It produces a (
N+1) × (
N+1) × (
N+1) array using two loops. The outer loop iterates the facets while the inner loop populates the
C,
D and
S arrays for each facet. The elements of
S are multiplied by the volume of the oriented tetrahedron and added to the output array. Speed is gained by exploiting the fact that the inner loop does not need to undergo all
indices of the geometric moments, but only those for which
[
32]. At
N = 20, there are 1771 (22.1% of 8000) of such
tuples.
2.1.2. The Coefficient Cache
The PKZernike class populates and caches two arrays when it is instantiated. The maximum moment order must be given by the user. The first array holds binomial coefficients, , whereas the second array holds trinomial coefficients, which are given by the equation. This procedure is also accelerated with Numba. But because Numba has no support for factorial functions (e.g., scipy.special.factorial), the binomials and trinomials are calculated iteratively using Pascal’s triangle and Pascal’s pyramid, respectively. Overall, the cache is generated in O() time and requires O() space. It can also be copied from another PKZernike object to avoid recalculation.
2.1.3. ZC Moments
Cached binomial and trinomial coefficients accelerate the next stage of the PK algorithm that produces ZC moments from geometric moments. Its complexity is O() due to the four nested loops for each of the V, W, X, Y and Z arrays. The trinomial cache also helps with geometric moments as each element of their array must be multiplied by the coefficient. We again avoid the factorials by noticing that this coefficient is equal to , where . We also locally cache all temporary variables, so that the innermost loops (i.e., at depth 4) never call any math-heavy NumPy function (e.g., numpy.sqrt) on individual numbers. It significantly speeds up this part of the algorithm at N ≤ 20, even without Numba’s JIT.
2.1.4. ZC Descriptors
The last portion of the PKZernike class that is accelerated by Numba calculates the ZC descriptors in the sense of Novotni and Klein [
32]. Its algorithmic complexity is O(
). Once compiled to a machine code, its running time, like in the case of the functions of the coefficient cache and of ZC moments, becomes negligible, measured in milliseconds.
2.1.5. The Compilation
While Numba delivers an outstanding computational speed boost with no need for changing the programming languages or interpreters (and it is multi-platform), it does not come without its own cost. Both just-in-time and ahead-of-time compilation modes incur an overhead that increases with the complexity of the Python code and of the NumPy functions this code calls. The compilation occurs only once per interpreter process, but its delay is noticeable enough to warrant additional optimizations. Our PC needs this amount of time to compile the four calculation functions in the PKZernike class:
≈1.4 s for the geometric moment function;
≈0.7 s for the coefficient cache function;
≈1.4 s for the ZC moments function;
≈1.15 s for the ZC descriptors function.
Once everything is compiled (in over 4.5 s), points 2–4 together finish in less than 0.001 s at N = 20 and in ≈0.005 at N = 35. However, the cache is generated without the acceleration in ≈0.01 s at N = 20 and in ≈0.05 s at N = 35. Likewise, if only the geometric moment function is compiled, ZC moments are obtained from geometric moments in ≈0.05 s at N = 20 and in ≈0.3 s at N = 35. A total of ≈0.002 s is then needed to calculate their ZCDs at N = 20 (≈0.004 s at N = 35). This benchmark indicates the existence of two use cases: a single mesh and multiple meshes. When N ≤ 25 and the application processes a single mesh and exits, it is more time-efficient to only compile the geometric moment function. But when there are multiple meshes to process, or when the moment order is high, at least the ZC moment function should be compiled, too. This can save a lot of time during mass-scale object retrieval. For example, the processing of all structures from the PDB on our PC in the single mesh mode would incur almost 3 h of surplus CPU time (210,000 × 0.05 s). With a compiled ZC moments function, this overhead drops to 3 min. The best times, however, are achieved with Numba’s compilation cache. The cache directory can be specified via the NUMBA_CACHE_DIR environment variable, or by overwriting numba.config.CACHE_DIR. The fully compiled PKZernike class is then instantiated in less than 0.001 s. Loading the cache has a one-time delay of ≈0.175 s.
2.1.6. The Comparison
Figure 2 shows the amount of time needed by the serial and parallel run modes of the implementations of the PK algorithm to calculate the ZC descriptors for the all-atom surface meshes of 1SZT (a representative of small proteins) and 1AVO (a representative of medium-sized proteins) at moment orders 1–20. The primary factor contributing to this time was the geometric moments function. In the figure, “Python” means PKZernike without the acceleration (i.e., code employing pure Python loops), “NumPy” means KoehlOptimizations and KoehlMultiproc from Mindboggle (i.e., code relying on NumPy’s vectorizations), and “Numba” means PKZernike with full acceleration (i.e., all functions compiled).
At N ≤ 10, there was a noticeable overhead in KoehlOptimizations. The time needed by the plain, serial PKZernike class then rapidly increased, eventually surpassing KoehlOptimizations at N = 15. A similar situation, albeit at a smaller scale, happened in the parallel mode. As one may expect, the machine code vastly outperformed the Python loops and the NumPy-only classes. At N = 20, the single-core, Numba-powered PKZernike was over 45 times faster than KoehlMultiproc and over 200 times faster than KoehlOptimizations. Its speed gain versus itself in the “Python” mode was ×50 with six workers and ×300 with one worker. This shows the amount of time that can be saved by Numba in math-heavy tasks with nested loops, marking PK-Zernike a cost-efficient tool for pythonic protein structure analysis. There was only the matter of checking its accuracy and numerical stability. Mindboggle’s classes were taken as the gold standard for this purpose.
In this paper, if
A and
B are two shapes (e.g., two proteins), the difference between their ZC descriptor vectors (ZCDVs, both of length
l) is symbolized with
(
A,
B):
is the Manhattan distance between the two ZCDVs, and “zcd” is a hypothetical subroutine of the program that returns the value of the
ith ZCD of a given shape. Equation (
1) is used throughout the manuscript as the baseline shape distance function.
The
of the ZCDVs of the same protein produced by PK-Zernike and by the reference implementation in Mindboggle was in the
range at
N = 1. It grew to
at
N = 20, and to ≈0.002 at
N = 35. We suspect the cause of this discrepancy to be the different order of additions and multiplications in the internal equations. An error of the same magnitude was observed between the KoehlOptimizations and KoehlMultiproc classes, and between the serial and parallel runs of PKZernike. Thus, it can be considered a normal behavior, with no significant impact on the proposed protein structure retrieval (
Section 2.2).
All tested approaches exhibited numerical instabilities above order 35, as expected for PK and NK algorithms with the standard double precision arithmetic [
40]. While it is a limiting factor, it did not pose an obstacle for the presented retrieval, since the (small) benefit of the more detailed protein shape encoding at
N > 35 is offset by a considerably longer calculation time, approximately ×3 at
N = 30 and ×5 at
N = 35.
As it was demonstrated in
Figure 2, akin to the Mindboggle library, the run time of the PKZernike class can be further reduced by dispatching a pool of workers to calculate the geometric moments in parallel. The facet array is split into chunks of equal size. The partial results are collected from the pool and added in an out-of-order manner to the final geometric moment array. But because Numba can release Python’s GIL (if nogil = True is passed to numba.jit), PKZernike objects can employ threads instead of processes. This avoids the need to fork the Python interpreter for each worker and reduces the inter-process communication overhead. The speed gain is proportional to the number of threads running on the physical CPU cores (e.g., six threads means roughly calculation time/6).
2.2. The PK Algorithm in Protein Structure Retrieval
This section describes the novel protein structure retrieval pipeline centered around the fast implementation of the PK algorithm in the PK-Zernike library (
Section 2.1).
The primary purpose of this task was to assess the usefulness of the PK algorithm in structural bioinformatics as the provider of the Zernike–Canterakis moments (as opposed to the NK algorithm in a voxel-based workflow), with the calculation of those moments done in a fully pythonic environment (n.b., we consider the use of Numba to conform to this view, since it is a proper, multi-platform Python library, albeit requiring compilation of its C++ code when installed from the source). An external program had to be called only to generate the molecular surface meshes of the proteins. Its contribution to the run time of the pipeline was 1–2× the time of the single-threaded run of PK-Zernike after the meshes were decimated by 50% (the parsing of the PDB files was the third large time factor).
The second goal was to demonstrate that this pipeline, despite being relatively uncomplicated to make with the popular Python libraries and with PK-Zernike, can achieve the discrimination ability comparable with that of the state-of-the-art retrieval systems.
The flowchart of the proposed pipeline is given in
Figure 3. It has eight steps. Steps in green are rudimentary—they involve basic text parsing or numerical comparisons. Steps in orange require advanced programming, but can be performed with just a few calls to functions from dedicated libraries (a numerical package may provide a fast SVD subroutine, e.g., numpy.linalg.svd). Steps in red are challenging to code and need more time to complete in comparison to the other instructions on this flowchart.
2.2.1. Preparation of the Protein Structure
After the PDB file with the structure is loaded, it must undergo the usual cleaning. Water molecules (HOH) are not needed, and the alternative atom locations are normalized by keeping only the first rotamer with the highest occupancy (typically A).
The application should also parse the MODRES records from the PDB header to tell the modified polymer residues and proper ligands apart. The link between the MODRESes and their standard parents is necessary for an accurate molecular surface calculation as the van der Waals radii are typically not provided for the first group.
2.2.2. Selection of the Input Fragment
The user may now choose a specific fragment of the structure to be passed to the next steps of the pipeline. It may be the entire molecule (i.e., all chains), a single chain, a group of chains (not necessarily in non-bonded contact with each other), or a more sophisticated selection of residues, such as a domain or a beta sheet. The only limiting factor is whether the molecular surface package will be able to converge on this selection.
Here, all non-protein entities (nucleic acids, etc.) were removed from the input structures at this stage as they were out of scope of this research. Hydrogen and deuterium atoms were removed, too. They are ignored during the outlier residue detection and the molecular surface calculation, so there was no need for keeping them in memory.
2.2.3. Detection of the Outlier Residues
The selected portion of the structure undergoes the PCA-based outlier residue detection procedure described in
Section 3.2.1. Its default settings are recommended, namely
p = 0.9 (i.e., the confidence level) and
r = 3 (i.e., the number of detection rounds). Residues classified as outliers are excluded from the subsequent stages of the pipeline.
The main purpose of this step is the same as in the EP algorithm [
22,
23]—to eliminate the exposed, often disordered fragments of the structure that contribute little to its volume but artificially inflate its bounding shape (MVEE [
61] in EP, sphere in ZC), skewing the metrics that rely on this shape.
Figure 4 shows how strongly the presence of such fragments can affect the ZCDs of 4B0H:A (at P120–G131) and 1IIE:A−C (at Q184–K192).
This procedure may cause some of the external but not so strongly protruding residues to be discarded. It allows for the EP algorithm to attain more compact (i.e., tight) ellipsoidal representations of the proteins, counter-balancing the dependence of the MVEE algorithm on the shape and size of the convex hull of its input. This also benefits the protein structure retrieval, in particular when similarly-shaped molecules are considered. Such molecules are expected to share a common structural motif, but may differ in the details of their surface: a bent loop here, an extra helix there, etc. A controlled (i.e., at
r = 3) removal of some of those details diminishes the differences and emphasizes the main similarities. This, in turn, increases the overall sensitivity of the retrieval more than it decreases its specificity. The evidence in favor of this conjecture is presented in
Section 2.3.
2.2.4. Calculation of Molecular Surfaces
The molecular surface of the remaining (i.e., guide) portion of the molecule is now calculated. Here, this task was carried out with Michel Sanner’s program [
62] (details in
Section 3.2.2). MSMS outputs text files with the vertex and facet arrays of unstructured triangle meshes. These meshes are passed as the input to the PK algorithm.
Two kinds of meshes were generated for the proteins in this experiment. The first mesh was based on all heavy atoms (i.e., in the all-atom mode, e.g.,
Figure 1a), while the second used only the backbone atoms (i.e., in the backbone atom mode, e.g.,
Figure 1b). In the backbone mode, the input protein was stripped down to its N, C
α, C, O atoms and split into individual chains. Each chain was separately sent to MSMS, and the resulting set of meshes was merged into a single mesh. No special mesh merging method was needed, owing to the distance between the molecular surfaces of the backbones.
2.2.5. Decimation of Molecular Surfaces
The resolution of the triangulated molecular surfaces produced by MSMS is controlled by –density and –hdensity parameters. Their default values deliver detailed meshes with a high uniformity of the triangles (
Figure 1a,b). While the number of those facets is not overly high, even this level of detail was unnecessary for the current study. Thus, the resolution of all molecular surfaces was reduced by half (
Section 3.2.3); 50% fewer facets in the mesh means a 50% shorter geometric moment calculation time.
Pozo et al. [
31] showed that the adaptive mesh decimation has a lower error rate than the uniform decimation in conjunction with the exact, volume-like geometric moments. However, their tests involved smooth 3D models, whereas the molecular surface meshes are rugged and noisy. We found that when the shape difference between the original and the decimated molecular surface was measured with
, depending on the reduction factor, the uniform algorithm delivered a 2–4 times lower value at
N = 20 (and with overall lower standard deviation across different structures) than the adaptive algorithm.
The impact of the various reduction factors (50, 75, 90, and 95%) on the
of the molecular surface of 1SZT:A calculated for the backbone atoms is shown in
Figure 5. The uniform decimation filter [
63] stayed longer near the original volume of the mesh than the adaptive filter [
64]. It was also able to keep the surface intact at the highest of the four factors, while the other algorithm disintegrated it when its topology preservation setting was not enabled (note that a broken mesh does not break the PK algorithm).
For the above reasons, we decided to rely on the uniform decimation in this experiment. While a conservative 50% mesh reduction was applied to all molecules, as exemplified by
Figure 5, factors up to 90% are also viable options (90% equals a speed-up by an order of magnitude). In fact, the results from
Section 2.5 show that the uniform decimation of the output of MSMS by 90% was fine for the retrieval of the example proteins.
Figure 5 is also a good example of the resistance of the PK algorithm (and of the ZC moments in general) to the noise caused by the simplification of the mesh—the descriptors drift apart from the original, but do not undergo sudden, spectacular alterations.
2.2.6. Extraction of Geometric Features
ZC moments require all points of the shape (i.e., mesh vertices or voxel centers) to reside within the unit ball. The easiest way to fulfill this requirement is to shift the average position of these points to the origin, find their largest distance to the origin, , and divide their coordinates by . All unit ball scale factors discussed in this paper should be understood like this—as coordinate divisors. Unlike voxel-based methods, the meshes do not need to be pre-scaled for the PK algorithm to fill the grid of a specific size (e.g., ). They may, however, have more facets that the number of occupied voxels.
Citing poor resolution of ZC functions near the unit sphere [
58], Ljung and André [
20] set the scale factor in their ZEAL program to
(i.e., they increase
by ≈1.43). In BioZernike, Guzenko et al. [
27] scale the structures by
, where
is the radius of gyration, the root mean square of the distance of C
α atoms to the center of the mass of the molecule. Ljung and André also point out that in their C++ library, Novotni and Klein [
32] use
as the scale factor, which does not guarantee that the entire shape will fit inside the unit ball.
means that an even greater portion of the molecule may be left outside. There are two ways to address this: increase the scale factor to at least
or remove the outliers. The removal is less damaging to the voxelized representation of the proteins as it permits partial atomic volumes to remain inside the unit ball. Since it only affects the external voxels, it acts akin to the outlier detection procedure from
Section 2.2.3, exchanging some of the details of the molecular surface for the increased sensitivity of the retrieval. It cannot, however, eliminate any strongly protruding fragments like those in 4B0H and 1IIE (
Figure 4). On the other hand, removing a vertex of a mesh removes all facets it comprises. The volume-like variant of the PK algorithm expects the surface to be closed (i.e., hole-free), so its unit ball scale factors should not be lower than
.
ZC moments encode the shape of a protein, but the necessary scaling diminishes the information coming from its size. To compensate, Guzenko et al. employ additional geometric features in their retrieval process, such as the distribution of atomic distances to the origin, the radius of gyration and the molecular weight. The exact PK algorithm based on the volume-like geometric moments is naturally predisposed to the accurate description of the interior of the proteins (within the limits of ZC formalism). But at the same time it relies on their molecular surface to delimit that interior. In the all-atom mode, only the overall shape of the structure is captured. Guzenko et al. stress the need for the encoding of its entire volume. This is why we calculated both all-atom and backbone atom molecular surfaces (
Section 2.2.4). It gave us the possibility to observe the change in the specificity of the retrieval following the inclusion of polypeptide conformation data. Following Guzenko et al., we also collected three geometric features of each protein: its size, volume and the histogram of the distribution of effective atoms. The benefit of effective atoms is that they are invariant to the shape and complexity of the surface mesh of the molecule.
The distribution of residues inside the protein is approximated by the distribution of the distances of guide effective atoms (
Section 2.2.3) to the origin after translation to origin. The necessary transformation is produced by the outlier detection algorithm from
Section 2.2.3. The default distance histogram spans the 0–1000 Å range at 2 Å intervals. Its upper limit can be as high as needed, but we found 2 Å to be the optimal bin size. There is one problem: the locations of effective atom centroids may vary even between similar structures, leading to the overestimation of discrepancies between them. To counter-balance it, the histograms are smoothed with the Savitzky–Golay convolution [
65] using a third-order polynomial and a window length of five. Any negative values it may produce must be clipped to zero. The smoothing diminishes some of the local zigzags while retaining the overall trend of distribution.
The size of a protein is encoded by standard deviations of its guide effective atoms in each dimension. The whole structure participates in this measurement (sans the outliers). The atoms need to be centered at the origin and rotated in alignment with the principal axes of the coordinate system to maximize their variance. Once again, this alignment can be easily achieved with the algorithm from
Section 2.2.3.
The volume of a protein is simply the volume of its mesh prior to the unit ball scaling. It can be quickly calculated with PyVista/VTK [
66,
67], even if the surface is not continuous. Unlike the other two geometric features, it depends on the mesh rather than on the effective atoms. To approximate the volume of the effective atoms, one needs to voxelize the protein, which is something we wanted to avoid here.
If the atoms of the protein are not provided (only its mesh), the effective atoms can be replaced with the vertices of the mesh for the purpose of these geometric features.
Lastly, there is the question of the unit ball scale factor. Four candidate solutions were already mentioned: , , and . Obviously, must be calculated for the vertices of the mesh to ensure that they all stay inside the unit ball, but the other three factors can be based on (guide) effective atoms to make them mesh-invariant (this means that in was here the maximum distance of an effective atom, not of a vertex). Because the effective atoms are shifted away from the molecular surface, their is too short, essentially turning it into . This leaves , and .
Here, we present another unit ball scale factor:
, defined as the largest of the three radii of a confidence ellipsoid of the guide effective atoms at
(
≈ 2.795483,
Section 2.2.3).
is higly variable, and depending on the protein, it can reach values close to
(e.g., in 5AHE) or as high as
(e.g., in 1YAR). The reliance on the confidence level is both an advantage and a drawback—it allows and requires tuning of the algorithm, although
p = 0.95 appears to be a safe default value.
Figure 6 shows how the different scale factors affect the ZCDs of 1DIV:A. It also shows an example effective atom distance histogram before and after smoothing.
2.2.7. Calculation of ZC Moments
The vertices of the mesh are centered at the origin and their coordinates are divided by the chosen unit ball scale factor. If somehow this factor is smaller than
(
Section 2.2.6), it is increased to
. The entire mesh now resides inside the unit ball.
The fast Python implementation of the PK algorithm from
Section 2.1 is called on the vertex and facet arrays of the mesh, producing an order 20 Zernike–Canterakis descriptor vector (ZCDV) in the sense of Novotni and Klein (i.e., with 121 ZCDs).
2.2.8. Comparison of ZC Descriptors
If
A and
B are two shapes, here—molecular surface meshes of two proteins structures, the descriptors of their ZC moments produced by the PK algorithm should range from 0 to 1. In this experiment, before the difference between them was measured with Equation (
1), each ZCDV was normalized by dividing it by the sum of all its ZCDs. This caused
to have a maximum value of 2.
(
A,
B) = 2 happens when all ZCDs of
A and
B are 0 except for zcd(
A,
i) = 1 = zcd(
B,
j) and
i ≠
j. It is an unlikely situation. However, the normalization made it easier to define universal
-based structural similarity thresholds.
After the experimentation with example proteins (
Section 3.1.2), we devised a similarity scale based on
with steps at 0.2 intervals between 0 and 1, plus a final step at 2. It is shown in
Table 1.
A and
B had the same or nearly identical shape when
(
A,
B) < 0.2. No structural similarity should be expected between them when
(
A,
B) ≥ 1.
1YAR:(A−U)×2 was the only structure that the scale from
Table 1 marked as vastly different from the other example proteins in
Figure S31 in Supplemental File S3. It achieved
≥ 0.6 with all of them and
≥ 0.8 with most of them. The other uniquely shaped protein in this set, 1AV1:A−D, had a medium similarity score (i.e., Tier 3) versus the third of the examples. With the exception of 1AIK:(N+C) and 1SZT:A, Tier 1 results were accurate—the nearly identical shapes were paired without false positives. But there were too many false positive results in the second and third tiers, expected to only capture the rest of the similar structures with the same domain superfamily classification. This confirms that the shape encoding via ZC moments alone is insufficient for a robust protein structure retrieval system and must be supplied with additional geometric features to improve its discrimination power. Three such features were described in
Section 2.2.6: the smoothed histogram of the effective atom distances to the origin, the size of the molecule and the volume of its surface mesh. Here, we show how they can be incorporated to Equation (
1) while staying compatible with the interpretation in
Table 1.
gauges the difference between the first smoothed and then normalized effective atom distance histograms of
A and
B:
“Hist” is a hypothetical subroutine of the program that returns the smoothed and normalized number of guide effective atoms with distance to the origin in the [, +2) range, where b is the bin index and 2 is the bin width. should be high enough (e.g., 500) to contain the largest of the input structures. Alternatively, if the histograms have different lengths, the shorter one can be padded with trailing zeros for the same effect.
is simply the sum of values of
and
:
Normalization of histograms and the Euclidean distance in Equation (
2) prevents
from dominating
while allowing for
to push apart proteins that have a similar shape in the sense of ZC descriptors but different distributions of residues inside them.
gauges the difference between the sizes of
A and
B:
“Size” is a hypothetical subroutine of the program that measures the distribution of atomic density in a given protein in the dth dimension. Here, it was the standard deviation of the guide effective atoms after they were rotated in alignment with the principal axes of the coordinate system. It does not matter that this coefficient is smaller than the actual length of the structure; it could also be the radius of the confidence ellipsoid.
Larger size coefficients in Equation (
4) are divided by smaller size coefficients. The three fractions are multiplied, but before that,
is added to each of them. It is a tune parameter controlled by the user. Its suggested default value is
.
gauges the difference between the volumes of
A and
B:
“Vol” is a hypothetical subroutine of the program that returns the volume of a given mesh before its placement inside the unit ball. The idea is the same as in Equation (
4): the larger volume is divided by the smaller volume, and the ratio is added to
, which shifts it up or down. The suggested default value of
is
.
and
are combined with
via multiplication, yielding
and
:
and
act as penalty coefficients for
when
≥ 0 and
≥ 0. Setting
or
to a negative value allows the shape distance between similar structures measured with
or
to move towards the upper rows of
Table 1.
Finally,
is the combination of
with both
and
:
and
have the lowest discrimination ability alone, which is why they are not combined with
directly, but through
.
is also weak on its own, but becomes an important factor in
. This leaves
,
,
,
and
as the useful shape distance functions. It should also be noted that all of the presented equations are trivially vectorizable, facilitating quick searches in the protein databases. If the value of any of those equations surpasses 2 (which is expected for dissimilar proteins when
and
are in effect), it should be clipped to 2, to keep it synchronized with
Table 1.
2.3. Results of the Retrieval—The BioZernike Repository
The protein structure retrieval pipeline presented in
Section 2.2 has several combinations of settings: (1) the molecular surface mode: all atoms, backbone atoms; (2) the surface mesh decimation factor: fixed here at 50%; (3) the outlier residue detection: enabled (3 rounds), disabled (0 rounds); (4) the unit ball scale factor:
,
,
,
; and (5) the shape distance function:
,
,
,
,
. While one may surmise that the backbone atoms and outlier detection enabled are the right choices, the choice of the unit ball scale factor and the distance function is non-trivial. Hence, to determine the optimal settings of this pipeline, and to compare its discrimination ability with the state-of-the-art algorithms, we turned to the BioZernike validation repository [
50] and its three test suites: “CATH”, “ECOD” and “assemblies”. The database and the testing protocol are described in
Section 3.1.1. Guzenko et al. [
27] provided the ROC curves and the areas underneath them (AUROCs) for the retrievals they performed with BioZernike, 3D Surfer [
35] and Omokage [
68] algorithms (the Omokage search is a non-Zernike biomolecular structure retrieval system based on histograms of interatomic distances).
Figure 7 presents the ROC curves of the retrieval performed with the following settings: molecular surface of backbone atoms, outlier detection enabled, and
as the unit ball scale factor. This combination of settings achieved the highest sum of AUROCs using
as the shape distance metric: 0.961 in the CATH, 0.959 in the ECOD and 0.995 in the assembly suite. AUROCs for the other scale factors were 0.957, 0.959 and 0.995 for
(
Figure S30e,j,o), 0.954, 0.958 and 0.994 for
(
Figure S26e,j,o), and 0.952, 0.954 and 0.996 for
(
Figure S24e,j,o). While the best of these numbers are lower than the AUROCs reported for BioZernike (0.97 for ECOD and 1.0 for the assemblies [
27]), they were reached in an easily programmable and modifiable ab initio experiment (Guzenko et al. also reported AUROCs of 0.93 for ECOD and 0.95 for assemblies with 3D Surfer, and 0.84 and 0.99 with Omokage). We believe it is a promising result that may be optimized further, for example, by introducing another geometric feature. Overall,
had a higher TPR and FPR of the similarity thresholds than
. With
, the FPR difference between these two scale factors was up to ≈0.03 and the TPR difference was up to ≈0.05.
Molecular surfaces based on backbone atoms indeed permit more accurate domain assignment prediction than all-atom surfaces. AUROCs in
Figure 7a,f (i.e., for
) are 0.940 and 0.926, while the corresponding values in
Figure S20a,f are only 0.869 and 0.871. While the overall shape of domains from different superfamilies may be similar, the backbone traces unveil their internal structure, allowing for the ROC curves to stay closer to the left boundary of the ROC space. The opposite happened in the assembly suite, although with a much smaller magnitude; 0.987 in
Figure 7k and 0.992 in
Figure S20k are an example of it.
AUROCs for all-atom surfaces were also a bit higher, at 0.997 for all scale factors (
Figures S16o, S18o, S20o and S22o). It rounds to 1.0, the same as for BioZernike.
Turning on the outlier detection subroutine moved all curves toward the top of the ROC space. Without it, but with
and the backbone atom surface, AUROCs dropped to 0.874, 0.877 and 0.967 (
,
Figure S27a,f,k). This means that the outlier detection is as important as the type of molecular surface. The small loss of specificity it incurred (up to ≈0.02 for
) was followed by a massive gain of sensitivity (up to ≈0.1 at Tier 2). It is a worthy trade-off with a linear time cost with respect to the number of residues.
It was confirmed that
has the lowest discrimination ability of the five tested shape distance functions. When combined with outlier detection turned off, all-atom surface and
(i.e., the baseline settings,
Figure S15a,f,k), it performed the worst, with AUROCs of 0.746, 0.777 and 0.947. Switching to the backbone atoms changed these values to 0.789, 0.828 and 0.945, respectively (
Figure S23a,f,k). Turning on the outlier detection caused
AUROCs to rise to 0.811, 0.833 and 0.988 with all-atom surfaces (
Figure S16a,f,k), and to 0.874, 0.885 and 0.983 with backbone atom surfaces (
Figure S24a,f,k).
Another significant improvement was brought with the introduction of
—by the augmentation of the shape distance function from
to
. Once again, it confirms the need for the inclusion of the geometric data of the proteins for accurate retrieval. AUROCs for baseline settings increased a lot with
, to 0.870, 0.889 and 0.972 (
Figure S15b,g,l). The switch to backbone atoms set these numbers to 0.871, 0.900 and 0.971 (
Figure S23b,g,l). Outlier detection moved them even further up, to 0.923, 0.934 and 0.995 (all atoms,
Figure S16b,g,l), and to 0.929, 0.935, and 0.992 (backbone atoms,
Figure S24b,g,l). Discounting
, all shape distance functions at the Tier 1 threshold of structural similarity (i.e., the black markers) had the lowest TPR but also an FPR below 0.0005.
Introduction of
and
to the retrieval brought an increase in its specificity at the expense of its sensitivity, an effect similar to when the outlier detection was enabled, only with loss rather than gain of TPR and FPR. However, the net outcome was positive, with a higher area under the ROC curve with respect to
. Also, in comparison to
, all structural similarity threshold markers were shifted towards the beginning of the
and
curves (
Figure 7c,d,h,i,m,n), capturing far fewer false positive results and somewhat fewer true positive results. This phenomenon was independent of the surface mode and of the unit ball scale factor. It was stronger with
than with
. The reason for this is obvious: two meshes may have similar volumes but different sizes.
The markers of Tier 2 threshold on
curves were closest to the top left corner of the ROC space (e.g., the brown marker in
Figure 7b,g,l). This threshold observed the highest TPR loss on
and
curves for the domains, up to ≈0.1, but only up to ≈0.025 for the assemblies. It was compensated by the strong movement of higher-tier markers to the left (i.e., FPR reduction). They formed the least spread clusters near point [0, 1] when
and
were combined with
to
(
Figure 7e,j,o), especially in the case of assemblies. It seems that with the surface mesh calculated for backbone atoms, and with outlier detection enabled, regardless of the unit ball scale factor, the point on the
curve closest to the “best” corner of the ROC space lay somewhere in the Tier 3 range, from 0.4 to 0.6.
2.4. Results of Retrieval—Example Proteins
An optimal combination of settings of the protein structure retrieval pipeline proposed in
Section 2.2 was determined in
Section 2.3: the surface mesh of backbone atoms, outlier residue detection enabled, either
(lower TPR and FPR) or
(higher TPR and FPR) as the unit ball scale factor, and
as the primary shape distance metric. For an in-depth analysis of the results this pipeline can deliver, it was applied to a collection of 30 fragments of example proteins from
Section 3.1.2. These fragments represent biological assemblies, chains and chain complexes with various shapes, sizes, domain classifications and levels of structural similarity. They are listed in
Table 2 with additional experimental data.
was chosen as the unit ball scale factor in this experiment. ZCDVs of the 30 inputs from
Table 2 were compared with each other, yielding 435 values of
,
,
,
, and
. These values are presented in form of symmetric heatmaps in
Supplemental File S3, in Figures S31–S35.
Figure 8 is the copy of
Figure S35.
As expected,
exhibited the strongest discrimination ability—
Figure 8 is sparse (i.e., mostly gray,
≥ 1). If one considers
= 0.5 to be the similarity threshold, all but 4 pairs of input structures were scored in accordance with their domain classification in the sense of the SCOP/CATH superfamily. The increment of this threshold to 0.6 added 6 more false positives. However, 4 of those 10 incorrect results were technically duplicates, since they involved 4B0H:A/4B0H:B and 1IIE:A−C/1IIE:(A−C)(118–183), which are parts of same proteins (e.g., 4B0H:A ≈ 4B0H:B(2-131)).
Using one worker thread, our PC took 17.8 s to calculate all 30 ZCDVs (9.6 without 1YAR). With six threads, these times were reduced to 3.1 and to 1.7 s, respectively (note: those are only runtimes of the PKZernike class). The average vertex-to-facet ratio in the decimated surface meshes was 0.5 ( = 0.002), and the correlation coefficient between the number of facets and the volume of the mesh was 0.999 (0.994 sans 1YAR:(A−U)×2).
The outlier subroutine procedure removed the C-termini from 1IIE:(A−C), reducing it to 1IIE:(A−C)(118–183), and causing their to become 0. The removal of C-termini in 4B0H:A and 4B0H:B overcame the issue of truncation of chain A, leading to a of 0.06 (otherwise it would be 0.36). 4FFY:H+L and 5KVE:L (i.e., the mouse antibodies) received = 0.09. This confirms that the two chains in 4FFY structurally match the one chain in 5KVE. Outlier detection can be disabled for an increased specificity at the cost of a lowered sensitivity of the retrieval. It may be beneficial when one is seeking exact or near-exact matches, for example, to find 4B0H:C but not 4FFY:A when 4B0H:B is the query structure. In this case, the important shape information is kept in the outlier region.
The high similarity score (0.2 ≤ < 0.4) was achieved by 1AIK:(N+C)×3 and 1SZT:A×3 (their = 0.201 was close to Tier 1), 1AVO:A−N and 1YAR:O−U, 1AVO:A+B and 1YAR:O, 4FFY:A+H+L and 5KVE:E+L, 4FFY:A and 5KVE:E, and by 4B0H:A/4B0H:B and 4FFY:A and 5KVE:E. The last group comprised the first four (two non-redundant) false positive results due to mismatched domain classification. It happened because the chains of 4B0H without C-termini are similar to the viral proteins, albeit larger, as corroborated by the CE RMSD of 5.42 Å over 80 residues. They are, however, CATH distorted sandwiches (2.70), whereas 4FFY:A and 5KVE:E are “regular” sandwiches (2.60). Any trace of similarity between them is lost ( ≥ 1) when the outliers are not removed.
The medium similarity score (0.4 ≤ < 0.6) was assigned to 1AIK:N+C and 1SZT:A, 1IIE:A−C/1IIE:(A−C)(118–183) and 4RXF:A and 5AHE (two non-redundant false positives), 1TJ7:A and 3NZ4:A, 1UIP and 4B0H:A−C and 4FFY:A+H+L (two false positives), and 4RXF:A and 5AHE. = 1.41 was a significant factor in = 0.46 for 1AIK:N+C and 1SZT:A. 1TJ7:A and 3NZ4:A have three CATH domains, two of them with matching superfamilies. SCOP puts those proteins as a whole in one superfamily by an automated match. Despite their differences, they received a = 0.45. Conversely, 4RXF:A and 5AHE belong to different SCOP superfamilies, but CATH assigns them to the same superfamily. 1UIP is their cousin in SCOP fold c.1 and in CATH topology 3.20.20. Yet, it exhibited incorrect closeness to 4B0H:A−C and 4FFY:A+H+L, although both times was above 0.5. The medium structural similarity between 1IIE:(A−C)(118–183) and 4RXF:A and 5AHE was easier to point out, both under the visual inspection and via the CE RMSD of 5.86 Å over 72 residues. Like them, 1IIE:(A−C)(118–183) features a central “tunnel” in the surface of its backbone atoms, only surrounded by helices rather than sheets. Its versus each of these proteins was also above 0.5. The PDB website reports the intact molecule of 1IIE:A−C (i.e., with its disordered C-termini present) to have a unique shape in the entire database. Likewise, the outlying C-terminal helix makes 4RXF:A appear much different from 5AHE ( ≥ 1). After it was removed, their dropped to 0.44.
All results with ≥ 0.6 were true negative. This analysis suggests that the structurally identical or nearly identical proteins have < 0.2. High similarity (i.e, the high probability for the molecules to possess the same or closely related domain composition) is signaled by < 0.4, and it is worth investigating pairs with < 0.5, possibly extending this search to pairs that exhibit < 0.6.
2.5. Results of Retrieval—The 3D Surfer
Lastly, the 30 structures from
Section 2.4 were submitted to the original 3D Surfer program [
35]. It was particularly interesting since our pipeline employs the PK algorithm, while 3D Surfer relies on the voxel-based Novotni–Klein (NK) method to produce the ZC moments. Both approaches calculate ZC descriptors at moment order 20. This means that their output can be passed to Equation (
1) and the other equations, analyzed in the sense of
Table 1, and presented for comparison in the form of
Figure 8.
At first, 3D Surfer generates the molecular surface mesh of the input structure using EDTSurf [
24] (the PDB files are parsed with BioPython [
69]). The mesh is then scaled down or up to fill a 64 × 64 × 64 Å grid, and discretized into its nearby voxels. ZC moments and descriptors are calculated for those voxels. The unit ball scale factor is the same as in the C++ library of Novotni and Klein,
.
ZCDVs of the structures from the PDB can be downloaded from the 3D Surfer website [
70,
71]. Descriptors of custom models can be also produced with the offline version of 3D Surfer [
72] (the EDTSurf binary must be downloaded separately from its website [
73]). We calculated them in the backbone atoms (“main chain”) mode of the program.
Because the original 3D Surfer does not engage in structural outlier detection, to make this comparison meaningful, the outlier residues were removed a priori from all input files. After normalization, ZCDVs calculated by 3D Surfer for the 30 example structures were compared with each other using
,
,
,
and
. Only this time the lower diagonal matrix of each heatmap was filled with the output of the PK-Zernike-based pipeline using
as the unit ball scale factor. These heatmaps are given in
Supplemental File S4, in Figures S36–S40.
Figure 9 is the copy of
Figure S40.
The accordance of the two methods was measured with correlation coefficients (CCs). They are listed in
Table 3 for the five shape distance functions. Due to sparseness of
Figure 9, the comparisons were made in three value ranges: 0–2 (i.e., all heatmap cells), 0–1 (i.e., by clipping the gray cell values to 1) and below 1 (i.e., by ignoring the gray cells).
One may immediately notice the striking similarity between the two sides of
Figure 9, confirmed by the very high CCs in
Table 3. This means that the PK and NK algorithms lead to ZC descriptors of the molecular structures with similar interpretation and should be interchangeable within the ZC-based protein retrieval pipelines. They only need the information from the geometric features of the molecules to improve the discrimination ability of those pipelines. Because the geometric features are independent of ZC moments, and because ZC descriptors were normalized in this experiment (n.b., the ZCDs returned by 3D Surfer had values between 0 and ≈13), both pipelines were able to benefit from their combination through Equation (
8) and the other equations.
With
as the unit ball scale factor, there were only a few differences between our pipeline and 3D Surfer. Most notably, 3D Surfer assigned
= 0.51 to 1AIK:(N+C) and 1SZT:A (up from 0.44),
= 0.48 to 1IIE:A–C/1IIE:(A–C)(118–183) and 5AHE (down from 0.52),
= 0.46 to 1UIP and 4B0H:A–C (down from 0.54), and
= 0.49 to 1UIP and 4FFY:A+H+L (down from 0.50). Taking
= 0.5 as the shape similarity threshold, the first pair was a false negative, the other three pairs were false positives. Both pipelines also offered
below 0.5 to 1IIE:A–C/1IIE:(A–C)(118–183) and 4RXF:A, and to 4FFY:H+L and 5AHE, resulting in two more false positives. 5AHE and 5KVE:L were close to the same status with
≈ 0.51. It was the consequence of the higher FPR of
versus
. It was most noticeable for 5AHE and 5KVE:L, for which the
difference reached 0.15 (
Figure 8 vs.
Figure 9). All other results in
Figure 9 were correct.
We also performed a ZCD calculation time benchmark. MSMS needed 25.2 s to produce the molecular surfaces of backbone atoms of the 30 structures, and the single-threaded PKZernike class needed further 17.8 s to produce their ZC descriptors. EDTSurf took 156.2 s to generate the surfaces, while the subsequent voxelization and ZCD calculation was done by 3D Surfer in 26.2 s. The test was performed on a single CPU core.
Finally, we checked the effect of a 90% mesh decimation. The average
versus the undecimated surfaces of the example proteins was 0.031 (
= 0.017, normalized ZCDVs). The total ZCD calculation time dropped to 3.6 s with one thread, and to only 0.7 s with six threads. The
values from
Figure 8 shifted by at most
, but there was no change in their interpretation. This suggests that it is safe to gain even more speed by decimating the meshes generated by MSMS by over 50% for the PK algorithm.
4. Conclusions
In this paper, we report on the progress toward the increased availability of fast and well-documented calculation of Zernike–Canterakis moments and descriptors in Python. We created the small PK-Zernike library which, to the best of our knowledge, is currently the fastest Python implementation of the Pozo–Koehl algorithm (PK) [
31,
34]. In contrast to the popular approach by Novotni and Klein [
32], the PK algorithm produces ZC moments from the geometric moments obtained directly from the unstructured surface meshes of 3D models. The shapes are represented exactly, no voxelization is needed.
The PK-Zernike library relies on Numba’s just-in-time compiler to optimized machine code. Thus, it contains only Python code (with calls to NumPy’s math and array functions) that is easy to understand and apply, but, once compiled, can process the same mesh over 200–300 times faster at moment order 20. Owing to PK-Zernike, a consumer-grade PC can process about 50,000 facets per second. Calculation time can be further reduced via the decimation of the mesh (by up to 90%) and with parallel, thread-based execution. A typical protein (≈500 aa) is then processed in less than a few dozen of milliseconds.
One should be aware that the PK-Zernike library is generic. It accepts its input in the form of vertex and facet NumPy arrays. These arrays can be passed by the application, or loaded from the mesh file specified in the command line: python3 –m pkzernike path/to/the/mesh.obj. A rudimentary Wavefront OBJ format reader/writer is built in the library. Other mesh formats, such as PLY or STL, are supported through PyVista (a soft dependency). PyVista also exposes VTK’s mesh decimation filters to Python.
ZC moments lie at the foundation of state-of-the-art protein structure retrieval systems [
27,
35]. The shape of a molecule is encoded by a feature vector with a fixed length that can be quickly compared with feature vectors of the other molecules. This facilitates real-time (e.g., sub-second) searches for similarly shaped proteins in the entire database. Such functionality is offered by the web interface of the PDB since 2020 [
2].
Because the molecular surfaces of proteins produced by programs like MSMS are proper 3D meshes, they can readily become input data for the PK algorithm. However, it seems that nobody has utilized this fact in a fully-featured protein structure retrieval. Hence, to assess the usefulness of the PK algorithm in this area, we made it the pivotal element of a novel protein structure retrieval pipeline. This relatively simple workflow, based only on MSMS, NumPy, SciPy, Numba, PyVista/VTK and PK-Zernike (and our in-house PDB format parser), utilizes ZC descriptors and, following the footsteps of BioZernike [
27], three geometric features: the distribution of distances of residues to the origin, the size of the molecule and the volume of its mesh. We also devised several shape distance functions capable of joining these geometric features with ZC descriptors.
The discrimination ability of the proposed pipeline was checked by three suites from the BioZernike validation repository [
50]. There, it achieved the following areas under the ROC curves: ≈0.96 for the domains (both CATH and ECOD), and up 0.997 for the biological assemblies. These values are close to the AUROCs reported for BioZernike: 0.97 and 1.0, respectively. This experiment also unveiled the optimal default parameters of our method, namely the calculation of molecular surfaces for the backbones of the proteins (N, C
α, C and O atoms), either
(lower TPR and FPR) or
(higher TPR and FPR) as the unit ball scale factor, and
as the primary shape distance metric.
The high discrimination ability of the proposed retrieval pipeline would not be reached without the PCA-based outlier residue detection subroutine borrowed from our other algorithm, the Ellipsoid Profile (EP) [
22,
23]. The purpose of this subroutine is to eliminate the exposed fragments of the molecules (e.g., disordered chain termini) that contribute little to its volume but can skew the metrics that rely on its shape or position. Here, it allowed the TPR of the retrieval to increase a lot at the expense of only a small increase in the FPR. The detection is fast and simple. It may be disabled to reduce the FPR, for example, to find similar structures in which the crucial shape information in stored in the outliers.
The analysis of the example protein set showed that the proposed pipeline is capable of returning correct scores of their structural similarity, additionally reaching
correlation coefficients over 0.95 with the 3D Surfer program [
35]. This confirms that ZC descriptors of the molecular surface meshes obtained via the exact, volume-like PK algorithm are at least as useful as ZC descriptors of the voxelized protein representations. Thus, it confirms the usefulness of the PK algorithm in structural bioinformatics.
The algorithms of Pozo et al. and of Novotni and Klein begin to suffer from numerical instabilities above moment order 35 when the calculation is performed with the standard double precision arithmetic. This phenomenon, caused by the large cancellations in the geometric-to-Zernike moment conversion, is explained in depth by Houdayer and Koehl [
40]. The same authors proposed [
40] an alternative three-dimensional Zernike algorithm based on quadrature rules that circumvents the main issue by calculating the Zernike moments directly from the facets of the mesh, without the geometric moments intermediate. The finite-precision variant of their approach is time-wise on par with the PK algorithm, but stable at moment order 300 and beyond. The HK algorithm is a mesh-based counterpart of the work of Deng and Gwo [
100], who earlier adopted the direct calculation idea to the voxel-based shape representation with similar success. While moment orders up to 35 may be sufficient for structural bioinformatics, we believe these new algorithms are going to make an impact on the general field of 3D shape retrieval and reconstruction.
It should also be mentioned that Zernike moments are not the only noise-resistant method of encoding the shape of molecular structures. In particular, van Kempen and colleagues recently published an article about Foldseek [
101]. This robust superposition algorithm performs an “inverse translation”; it takes the geometric features of the proteins and discretizes them into sequences of a custom structural alphabet. The local alignment of those sequences results in a local alignment of the structures. In this sense, Foldseek falls into the same category as CE and the other superposition methods [
16,
17,
18,
19,
20]. However, its sequence-based protocol organically extends to the search for similarly shaped proteins in the supported databases (e.g., in the PDB). While these searches are slower than the comparisons of ZC descriptors, they can deliver accurate partial matches in a matter of seconds [
102], unlike ZCDs, which essentially facilitate a global comparison of the shapes inside the unit ball, and can be skewed by the structural outliers. Conversely, ZCDs are free from the constraints of the logical organization of the molecules (e.g., sequence order or chain IDs). Foldseek does global alignment with an accelerated TM-align [
19].
The reader may also be interested in a recent paper by Koehl and Orland [
103] on the subject of partial 3D shape matching. To cite its abstract, “A new algorithm is presented to compute nonrigid, possibly partial comparisons of shapes defined by unstructured triangulations of their surfaces. (…) Its goal is to define a possibly partial correspondence between the vertices of the two triangulations, with a cost associated with this correspondence that can serve as a measure of the similarity of the two shapes. To find this correspondence, the vertices in each triangulation are characterized by a signature vector of features.” It would be very interesting to see if this method can be applied to protein content.