*Article* **3D-PP: A Tool for Discovering Conserved Three-Dimensional Protein Patterns**

**Alejandro Valdés-Jiménez 1,2, Josep-L. Larriba-Pey <sup>3</sup> , Gabriel Núñez-Vivanco 1,\* and Miguel Reyes-Parada 4,5,\***


Received: 11 May2019; Accepted: 20 June 2019; Published: 28 June 2019

**Abstract:** Discovering conserved three-dimensional (3D) patterns among protein structures may provide valuable insights into protein classification, functional annotations or the rational design of multi-target drugs. Thus, several computational tools have been developed to discover and compare protein 3D-patterns. However, most of them only consider previously known 3D-patterns such as orthosteric binding sites or structural motifs. This fact makes necessary the development of new methods for the identification of all possible 3D-patterns that exist in protein structures (allosteric sites, enzyme-cofactor interaction motifs, among others). In this work, we present 3D-PP, a new free access web server for the discovery and recognition all similar 3D amino acid patterns among a set of proteins structures (independent of their sequence similarity). This new tool does not require any previous structural knowledge about ligands, and all data are organized in a high-performance graph database. The input can be a text file with the PDB access codes or a zip file of PDB coordinates regardless of the origin of the structural data: X-ray crystallographic experiments or in silico homology modeling. The results are presented as lists of sequence patterns that can be further analyzed within the web page. We tested the accuracy and suitability of 3D-PP using two sets of proteins coming from the Protein Data Bank: (a) Zinc finger containing and (b) Serotonin target proteins. We also evaluated its usefulness for the discovering of new 3D-patterns, using a set of protein structures coming from *in silico* homology modeling methodologies, all of which are overexpressed in different types of cancer. Results indicate that 3D-PP is a reliable, flexible and friendly-user tool to identify conserved structural motifs, which could be relevant to improve the knowledge about protein function or classification. The web server can be freely utilized at https://appsbio.utalca.cl/3d-pp/.

**Keywords:** conserved patterns; similarity; 3D-patterns

#### **1. Introduction**

Most drugs interact with more than one molecular target [1,2]. This fact is usually considered an undesired feature since it might be related to the side effects of pharmacological treatments. However, current trends in drug discovery have put hope and considerable effort into the development of multitarget compounds, due to the improved efficacy and safety profiles shown by some promiscuous drugs [3–8]. In this context, several computational approaches to predict the polypharmacological profile of either novel or known drugs have been developed, most of which are based on two main methodological strategies. In the first case, methods are based on ligand characteristics, for example, the search of compounds showing similar pharmacological/molecular activities with known drugs, those that represent the ligands as a bi-dimensional graph and look for similarities in databases using graph-based techniques and those based on the three-dimensional (3D) similarities of ligands. The second approach is centered on target(s) features and involves methods that use the known 3D structure of proteins to perform inverted docking, structure-based pharmacophore searching and the evaluation of binding sites similarities [8–10].

The usefulness of assessing structural similarities of ligand binding sites in different proteins, aimed to target clustering or drug development, is supported by the fact that the structure of proteins is several times more conserved than their sequences [11–13]. Furthermore, even in those cases where a close evolutionary relationship exists between two proteins, it might be possible that their global sequences and structures were not conserved and only share partial 3D-patterns, which define, in most cases, the function of such proteins. Indeed, the comparative analysis of important 3D protein patterns such as binding sites, catalytic sites and protein-protein interaction motifs, have been recently used to, for example, identify putative off-targets of known drugs, the design of polypharmacological compounds and drug re-purposing [14,15]. For these aims, several computational tools have been developed [16–25], which, in general, require a known query (ligands/binding sites) for their searching processes. Thus, these algorithms usually utilize (only) the orthosteric binding site in proteins, annotated motifs and/or previously known functional residues to make similarity assessments. This represents a weakness of the current tools, since some evidence indicates that a conserved 3D arrangement of amino acids might be enough to consider such a 3D-pattern as functionally relevant, even if no prior knowledge of their biological activity is available (e.g., protein cavities/pockets that may serve as allosteric sites) [20,26–31].

Thus, unveiling and comparing all local structural patterns (including those unknown or previously unobserved) into a set of protein structures could be more informative for the discovery, search and characterization of conserved 3D-patterns than exploring only previously known sites. In a recent report [32], we described a strategy for the exhaustive searching of similar 3D-patterns between two protein structures, which allowed the discovery of some conserved structural residue arrangements between proteins that differ in their function, structure and tissue localization but that share the same endogenous ligand and perform complementary physiological functions [33]. This type of finding, along with the increasing availability of structural data (more than 130,000 protein structures in the Protein Data Bank [34] and more than 3 million homology models in the SWISS-MODEL Repository [35]), represent an opportunity to use and develop structure-based methods for the classification, description and discovery of conserved 3D amino acid patterns among multiple protein structures.

Here we present 3D-PP, a new free access web server designed to discover all conserved 3D amino acid patterns among a set of protein structures. The pre-processing modules of 3D-PP were developed in Python language and all data generated are processed and organized automatically in a scalable, high-performance graph database [36]. Remarkably, this kind of database has shown better performance than relational databases, particularly when problems must be realistically modeled through, for example, the use of properties in a graph mode analysis [37–41].

#### **2. Results**

To demonstrate the applicability of 3D-PP, in the following sections we show the results of two different examples in which the existence of known and unknown 3D-patterns are assessed in a set of proteins. Also, as a benchmark analysis, we tried to replicate the same experiments with other available tools.

#### *2.1. Known Small 3D-Patterns*

We used a dataset of 46 protein structures, all of which contain the PROSITE Zinc finger C3H1-type motif (https://prosite.expasy.org/PDOC50103) (PDBids: *1m9o, 1rgo, 2cqe, 2d9m, 2d9n, 2e5s, 2fc6, 2rhk, 2rpp, 3d2n, 3d2q, 3d2s, 3jb9, 3tp2, 3u1l, 3u1m, 3u9g, 4c3b, 4c3d, 4c3e, 4cyk, 4ii1, 4yh8, 5elh, 5elk, 5gmk,* *5lj3, 5lj5, 5lqw, 5mps, 5mq0, 5mqf, 5u6h, 5u6l, 5u9b, 5wsg, 5y88, 5ylz, 5z58, 6bk8, 6dnh, 6eoj, 6exn, 6fbs, 6ff4, 6fuw*; range of PDB resolutions: 1.5 to 5.9 Å). This sequence motif is composed of three cysteines and one histidine amino acids, which are located in the primary sequence as defined by the following regular expression *C-x(8)-C-x(5)-C-x(3)-H*. At a structural level, this motif represents a small 3D-pattern, is highly conserved and shows chemical coordination of the residues with one Zinc ion. Usually known as Zinc finger, this pattern is essential for the folding stabilization of this kind of protein structure [42]. After the simultaneous evaluation of these 46 protein structures, 3D-PP identified 737,793 sites corresponding to 43,305 3D-patterns organized in 47,203 clusters. As shown in Figure 1, the 3C1H was the most represented 3D-pattern with a protein coverage value of *PCv* = 95.7%.


**Figure 1.** Coverage of 3D-patterns identified in the Zinc finger C3H1-type protein structures. Figure 1 shows the list of all 3D-patterns detected and several criteria for filtering.

This value means that this 3D-pattern was found in the vast majority of the proteins' structures (44 of 46 proteins). Also, this pattern grouped in only one cluster (cluster coverage *CCv* = 100%; Figure 1), which denotes that in those 44 proteins structures, there is at least one site whose 3D topological conformation does not exceed the root mean square deviation (RMSD) threshold defined by the user (4.5 Å in this example; Supplementary Data, Figure S1). This RMSD threshold is an important input parameter of our software because it allows to discriminate between 3D-patterns that contain similar components (i.e., amino acid residues) but exhibit different topological conformations (i.e., they are not in the same spatial localization/order). Thus, in 3D-PP even though several 3D-patterns might show a high level of protein coverage (*PCv*), they will appear grouped in different clusters with low coverage (*CCv*) if they show a high structural and/or topological diversity. In this example, only one cluster formed by 152 sites was detected in the 3D-Pattern 3C1H, denoting high structural conservation (Figure 2) and irregular sequence localization. As shown in Figure 3, the common 3D-pattern can appear in different locations of the sequences (blue and green boxes in Figure 3). Also, even though the 3D-pattern found corresponds to sites structurally conserved it can occur with differential sequence order in the global protein sequence (red and orange boxes in Figure 3).

**Figure 2.** Structural alignment of the 152 sites detected in the cluster 3C1H-1. This result is delivered by 3D-PP using an interactive Jsmol Viewer.

Interestingly, 122 of the 152 detected sites were confirmed by the presence of the *C-x(8)-C-x(5)- C-x(3)-H* pattern in the primary sequence of the proteins analyzed and also by the appearance of the respective Zinc ion in coordination with three cysteine and one histidine amino acid in the corresponding crystal structures (confirmed using the PDBsum server [43]). The remaining sites detected by 3D-PP have similar structural features to the confirmed sites but either the protein structure does not have a co-crystallized Zinc ion or the sequence localization of the residues in the sites does not match with the corresponding PROSITE pattern (Table 1 and Supplementary Data, Table S1).

**Table 1.** Number of sites containing the Zinc finger C3H1-type motif at the sequence (PROSITE) and structural (3D-PP and PDBsum) levels. The last column (A & B & C) shows those sites that satisfy the sequence pattern C-x(8)-C-x(5)-C-x(3)-H (**A**), those discovered by our software that matched with the previously described sites (**B**) and those in which PDBsum shows coordination with the Zinc ion (**C**).


In the detailed analysis of the new sites unveiled by our software, we remark the following particular cases:

**Figure 3.** Sequence representation of the structural alignment of the 152 sites detected in the cluster 3C1H-1. This result is delivered by 3D-PP. The diffuse lower box shows the entire representation of the sequence alignment of the 152 sites found. The blue and green boxes show a zoom denoting the PDBids, the chain and the original PDB residue number of each site. The red and orange boxes exemplifies that some 3D-patterns found exhibit the expected sequence order (C-x(8)-C-x(5)-C-x(3)-H), whereas other sites, while having the same structural orientation, do not match with the canonical Zinc finger C3H1-type motif.

#### 2.1.1. Putative New Zinc Ion Coordination Sites

For the protein structure with PDBid:2D9N, three sites were detected by 3D-PP. Two of them were confirmed at both sequence and structural levels and the third was only found by our software (Supplementary Data, Table S1, PDBid:2D9N). This new site, which is formed by the residues Cys68, Cys76, Cys82 and His70, shares the 3 cysteine residues with a known/confirmed site but involves a different histidine residue (His70 instead His86). As shown in Figure 4, this new identified site might keep the coordination of the Zinc ion in cases in which, for example, a punctual specific mutation of the residue His86 occurs. It should be noted that the calculated pKa of His86 (which forms the canonical Zinc coordination site) and His70 (the new putative site) was below 6, indicating that both residues are mostly deprotonated and therefore are able to establish coordination with the Zinc ion. Thus, in theory, the Zinc ion might be "moving" between both sites, since both offer a favorable environment to stabilize its binding. In the same line, the other 29 sites with similar features were discovered by our software (Supplementary Data, Table S1, Tag "New Site" in column "Comments").

**Figure 4.** Putative new site discovered by 3D-PP. In the figure the residues that form the new site detected are shown. The green sphere represents the Zinc ion.

#### 2.1.2. Promiscuous Binding Sites

Another remarkable result was the identification of two Cadmium ion binding sites that appeared in the same 3D-pattern cluster as the Zinc ion binding sites. These two sites belong to the crystal structure of the Essential Transcription Antiterminator M2-1 Protein of the human respiratory syncytial virus (PDBid:4c3d). As shown in the Supplementary Data, Figure S2, this structure effectively contains two Cadmium ions co-crystallized, which are coordinated with 3 cysteine and 1 histidine residues. These results are in agreement with previous reports that show that Zinc ions can be interchanged by Cadmium ions in some enzymes [44], indicating that this 3D-pattern can act as a promiscuous binding site. It is worth pointing out that 3D-PP does not use the information about ligand/ions co-crystallized with the protein structures and only works with the 3D-patterns found from the virtual grid of coordinates (see Materials and Methods Section).

#### 2.1.3. Not Found Patterns

As we indicated above, in 2 of the 46 protein structures submitted it was not possible to identify 3D-patterns with the components 3C1H. These proteins, namely pre-mRNA-processing-splicing factor 8 of Human (PDBid:5MQF) and Yeast (PDBid:5LQW), were the biggest structures evaluated in this set of data. Both structures are biological assemblies obtained through cryogenic electron microscopy at a resolution of 5.9 Å and 5.8 Å , respectively. As we confirmed in our detailed analysis, low resolution—in general—limits the possibility of obtaining all the coordinates of residue side chains, some hydrogen bonds and small ligands such as metal ions. In the case of these proteins, most chains have only the atomic coordinates for the backbone and unfortunately our software cannot detect 3D-patterns without considering the side chain of protein residues.

#### *2.2. Serotonin Target Proteins*

Serotonin (5-Hydroxytryptamine; 5-HT) is a biogenic amine which is found in the gastrointestinal tract, blood platelets and the central nervous system (CNS). In the CNS, 5-HT acts as neurotransmitter and is released into the synaptic cleft where it interacts with specific 5-HT receptors (5-HTRs) to activate different signal transduction pathways [45]. After that, 5-HT is pumped back into the nerve

terminals by the 5-HT transporter (SERT) and/or is metabolized by the enzyme monoamine oxidase type-A (MAO-A) [46]. Even though these three types of proteins (5-HTRs, SERT and MAO-A) have distinct functions, different sequences and diverse structural folding, they share 5-HT as the primary endogenous ligand. As observed in the matrix of amino acids' sequence identity (Supplementary Data Table S2), the range of pair-wise alignment among these proteins does not exceed 22%. In addition, their multiple sequence alignment (MSA; Supplementary Data Figure S4) only shows 15 residues conserved but with very disperse localization. Therefore, biologically relevant results cannot be obtained with these sequence alignment methods. To test our software, we submitted the crystal structures of the human SERT (PDBid:5I6X), MAO-A (PDBid:2BXR) and 5-HT2*<sup>A</sup>* receptor (PDBid:6A93) using the following input parameters: *St*: 2 Å, *Rt*: 5 Å, *RMSDt*: 4.5 Å, *Dt*: 2 Å and *Mc*: 80%. Despite protein differences, 3D-PP was able to detect several 3D-patterns with a 100% of coverage; one of them, the 3D-pattern 1*D*1*G*1*L*1*Q*, shows two clusters with 100% and 33% of *CCv* (Cluster Coverage), respectively. The first has four sites composed of one aspartate, one glycine, one leucine and one glutamine amino acids. These sites have an RMSD lower than 2.5 Å, show a similar 3D topological conformation (Figure 5A), their residues are unsorted on each primary sequence (Figure 5B) and their structural localization corresponds, for SERT and 5-HTR2A, at the extracellular side (Figure 5C,D), whereas in MAO-A, the site was detected in the protein surface (Figure 5E). The presence of aspartate residues on these sites could be significant because this type of amino acid has been shown to be critical, for example, in the inner binding site (Asp-98 [47]) and the antidepressant binding site of SERT (Asp-400 [48]), in the binding sites of the 5-HT receptors (Asp-155 [49]) and in the substrate/inhibitor cavity of MAO-A (Asp-328, Asp-132 [50]). Thus, these sites could represent a useful starting point for the design of allosteric multi-target drugs ([51]).

**Figure 5.** Conserved 3D-pattern among the serotonin target proteins. **A** shows the structural alignment of the four sites forming the 3D-pattern 1*D*1*G*1*L*1*Q*. **B** shows a representation of a sequence alignment of the structurally aligned sites that form the 3D-pattern 1*D*1*G*1*L*1*Q*. **C**, **D** and **E** show the structural localization of the sites forming the 3D-pattern 1*D*1*G*1*L*1*Q* on the global structure of SERT, 5HTR2A and MAO-A, respectively.

#### *2.3. Finding/Discovering Unknown 3D-Patterns on Homology Model Structures*

In this case, we tried to discover conserved 3D-patterns among 10 protein structures generated through the SwissModel server (homology models). All of these proteins are over-expressed in different types of cancer (breast, prostate, lung, gastric, etc) and correspond, for example, to the insulin-like growth factor 1 receptor, the macrophage-stimulating protein receptor and the aurora kinase B, among others [52]. After the assessment with 3D-PP, our results showed the existence of several common 3D-patterns in these proteins (high *PCv* coverage; Max *PCv* = 80%) but many of them showed high structural or topological diversity (low *CCv* coverage). Nevertheless, the most structurally conserved 3D-pattern has a cluster with sites occurring in 8 of the 10 homology models submitted (cluster 1*E*1*G*2*L* − 14, *CCv* = 100%; Figure 6A). The conservation of this 3D-pattern (1*E*1*G*2*L*; one glutamate, one glycin and two leucine amino acids; Figure 6B), is attractive since it might represent an event of convergent evolution which could be useful for establishing a functional annotation [53], the design of new poly-pharmacological anticancer drugs [4] and/or protein structure-based diagnosis [54]. As discussed above, this conserved 3D-pattern was detected in spite of their non-conserved sequence order (Figure 6C).

**Figure 6.** Coverage of 3D-patterns detected in the homology models of over expressed proteins in some cancer types. **A** shows the list of conserved 3D-patterns. **B** shows the structural alignment of the sites forming the 3D-pattern 1E1G2L. **C** shows a representation of a sequence alignment of the structurally aligned sites that form the 3D-pattern 1E1G2L.

#### *2.4. Comparison with other Methods for the Search and Description of Amino Acid Patterns*

Table 2 summarizes some features of computational tools aimed at the search of structural protein patterns, with comments regarding the results obtained when the same data set used in this work was evaluated.

In general terms, none of the software indicated in Table 2 was able to perform the same analysis as 3D-PP. Nevertheless, they were included in the benchmark, since they are the currently available algorithms with most similar performances/objectives as compared with 3D-PP. In spite of this, it seems probable that with using MMDB and VAST+ tools in combination with ProBIS (and a series of additional processing), results similar to those of 3D-PP may be obtained.


