*3.1. Grid of Virtual Coordinates*

One grid of virtual coordinates (*GvC*) is modeled for each protein structure submitted ( Figure 7). This *GvC*, generated by the function *F IND*\_*SITE* (Figure 8), is used for the searching of the 3D-patterns and confers to 3D-PP the ability to prescind from any previous knowledge about the ligands or binding sites in the protein structures.

Briefly, each *GvC* is constructed as follows:


**Figure 7.** Grid of Virtual Coordinates (GvC). Letters **A**, **B**, and **C**, show the process of creating a Grid of Virtual Coordinates for each protein structure. The red spheres represent the reference coordinates from which the searching of 3D-patterns will be done.

#### *3.2. Protein Preprocessing*

This step represents the core of 3D-PP since it is responsible for the identification of all possible sites (arrangements of structurally related amino acids) and generates all input data for the graph database. It is worth noting that the identification of the sites is independent of the order of the amino acids sequences of proteins. This pre-processing considers all chains of each protein separately and utilizes the *GvC* previously generated. The *GvC* is used as follows:


site is added to the current cluster. This step is implemented to avoid two sites having the same components but different 3D conformations, being grouped in the same cluster. It should be noted that if the user set too permissive *RMSDt* values (high values), there are more possibilities for grouping sites with different structural topologies; thus, many false positives can occur.


**Figure 8.** Pseudocode of the *FIND\_SITES* function of 3D-PP. Figure 8 shows the pseudocode of the function for the searching of 3D-patterns without any previous knowledge about the ligands or binding sites in the protein structures. On each line of the algorithm is indicated the computational complexity.

#### *3.3. Creation of the Graph Databases*

For each protein structure submitted, a new graph database is created simultaneously using parallel programming approaches. In these databases, the new sites identified are stored as a new node (SITE node; Supplementary Data, Figure S6A). Then, the main graph database, which is an extension of the first model, is used for the unification of data (e.g., 3D-patterns and sites of the protein 1, 3D-patterns and sites of the protein 2, etc.). For this, all the SITE node attributes are used to create or connect the corresponding PATTERN nodes, CLUSTER nodes and finally, to establish the edges SITE\_IN\_CLUSTER, CLUSTER\_IN\_PATTERN, and PATTERN\_IN\_PROTEIN (PATTERN\_IN\_PROTEIN; Supplementary Data, Figure S6B).

It is important to note that the PATTERN node with the highest amount of PATTERN\_IN\_PROTEIN edges represents the 3D-pattern with the highest coverage value. Moreover, if this PATTERN node has few CLUSTER\_IN\_PATTERN edges it is possible to estimate that the sites that are part of this 3D-pattern have a high level of structural and topological conservation. On the contrary, many CLUSTER\_IN\_PATTERN edges indicate a high level of structural diversity.

### *3.4. Result Visualization*

The first level of results shows, as a graph and dynamic data tables, all 3D-patterns discovered in the set of protein structures submitted. Additionally, the user can search sub-patterns of interest through a simple regular expression query. For instance, the regular expression ^2C.\*2H\$, will detect all the sub-patterns that begin with 2C and finish with 2H, with any character in between, which represents a 3D-pattern containing precisely two cysteines, two histidines and any other amino acids.

Once the measures have been done, every 3D-pattern has the following ranking features available:

