In this section, we describe in detail how we prepare our data and use it for training the convolutional neural network (CNN). First, we clean the training data and align and combine the contained proteins and ligands. Next, we model the ligandability using the CNN. In the last step, we combine ligandability predictions into predicted binding sites.
Figure 1 illustrates the entire method’s workflow, which we describe in detail in the following subsections.
2.1. Selection and Preparation of Training Data
To create our dataset, we used the PDB identifiers that are present in the PDBBind [
19] and the scPDB [
20] databases. However, since only one binding site was occupied by a ligand in these databases, we downloaded the full unprocessed data for the selected PDB identifiers from the extended RCSB PDB [
21] database, which allowed us to collect information about all possible binding sites. As we know, proteins often consist of more than one chain. It is not unusual that some of the chains are identical in their structure and consequently have their binding sites at the same locations. For this reason, it would be impractical to predict binding sites on the whole proteins; instead, we look at each chain separately.
In the preliminary dataset analysis, we noticed that some chains were not appropriate for use: some chains were too small, and some were attributed ligands that are too far from the chain to have any interaction. To solve these issues, we filtered our dataset by choosing chains with at least 28 aminoacids and ligands with at least 7 non-hydrogen atoms. Another condition for ligands was that the minimum distance between ligands’ atoms and chains’ C atoms was at most 5 Å. With these conditions, we also managed to include the small insulin molecule on the hand, and remove water molecules, ions, and other small particles that often float around proteins (e.g., glycerol with 6 atoms) on the other hand. After filtering, we were left with suitably large chains with ligands that were very likely to have an actual interaction with the chain. If we chose to include smaller ligands such as water and ions present only for the process of determining the molecular structure, we would predict non-ligandable locations as ligandable.
An even more difficult problem is preventing the occurrence of false non-ligandable locations. Since binding sites of a chain are only rarely all occupied, we cannot mark the non-occupied locations as non-ligandable with certainty. To mitigate this problem, we narrow the set of chains by merging ligands belonging to chains with enough similarity by clustering them based on their amino acid sequences. On each of these clusters, we afterwards merged the ligands. Note that the clusters computed at this step were also used at splitting the dataset into training, validation, and test sets.
2.2. Alignment of Amino Acid Sequences and Their Clustering
Aligning amino acid sequences is a broad topic that studies the structural similarity among proteins. To cluster protein chains, we used the MMseqs2 [
22] library, which offers sufficient control throughout the process. Since our dataset is relatively small compared to the datasets this library can be used on, we were able to skip the prefiltering step and ran the Smith–Waterman algorithm [
23] on all protein pairs. In short, for a pair of aminoacid sequences
where
m and
n are the lengths of the initial sequences, the algorithm creates a matrix in which the values represent the highest similarity—according to the selected substitution matrix (e.g., BLOSUM62 matrix)—between different subsequences of
A and subsequences of
B. The alignment of two subsequences is optimized by taking into account the matches or mismatches, insertions, and deletions of the amino acids. The Smith–Waterman algorithm follows the dynamic programming paradigm to compute the values of the matrix. To find the pair of the aligned sequences
and
, where
N is the length of the aligned sequences, it finds the highest value in the matrix and then follows the path via which it came to this value during the computation process. The details of the algorithm are explained in the original paper [
23].
The similarity of initial sequences is then defined as
To avoid cases where similarity measure is high but the actual aligned sequences cover only a small part of the initial sequences, we define the covering
c. For the initial sequence
A and aligned sequence
where
starts on index
of
A and ends on index
, the covering
is defined as
Now let denote the lower bound of similarity and the lower bound of covering. For each pair of sequences , , from the set of sequences , we compute the similarity and coverings and , where and are corresponding aligned sequences.
Let
be an undirected graph, where the set of vertices
V is the set of sequences
Z and the set of edges
E where edge
is in
E if the following conditions hold:
Since our goal is to cluster similar sequences, we define as clusters the connected components of
G. The sequences
A and
are therefore in the same cluster if there exists a path between them. This way of clustering ensures that each pair of similar sequences is in the same cluster. Stated differently, for two different clusters
and
, sequences
and
for which the conditions in Equation (
1) hold do not exist. In our case, we set the lower bound of similarity to
and the lower bound for covering to
. The two parameters correspond to
min-seq-id and
c in the MMseqs2 library.
We believe that sequence alignment with the Smith–Waterman algorithm from the MMSeqs2 library returned very robust results. Thus, we do not expect some other algorithm to give significantly different or better results. One could experiment with different amino acid substitution matrices (e.g., different types of BLOSUM matrices), but for our preliminary experiments, the outcome should not vary significantly. One could also consider the depth of a residue as a parameter by which to group residues at the evaluation phase (for example, one group would consist of residues lying close to the visible surface and the other of residues that are buried).
2.3. Merging of Ligands of Similar Chains
In our dataset, there are proteins that are represented in more than one PDB file but not necessarily with the same set of ligands, or ligands bound to the same binding sites. Cases in which a protein consists of multiple identical copies of some chain are also present. In order to increase the informative value of each chain, we merge the ligands of similar chains onto one representative chain. This way, we reduce the number of locations falsely marked as non-ligandable, increase the number of taken binding sites on each kept chain, and remove redundant data. Note that the PDF files do not allow certain deducing of non-ligandable locations; conversely, we treat only locations with present ligands in any input file to be certainly ligandable. If for some chain
A there exist different proteins with different numbers of identical copies, we can represent all their binding sites with only the representative chain
A, on which we add ligands that were originally bound to other chain instances. Since the chain
A carries information of all its identical copies, these copies can be removed from the dataset. Through this process, we achieved higher information density and decreased the occurrence of highly similar data entries. An example of the merging of ligands of two identical chains is shown in
Figure 2.
Our merging algorithm takes as its input a set of PDB files. We run it independently on each cluster of resulting chains from
Section 2.2. The algorithm has the following main steps:
We read and sort the chains from the cluster in descending order based on the number of bound ligands and the total number of atoms of bound ligands. Sorting by these two criteria, we prioritize chains with more bound ligands, and in case two chains have the same number of ligands, we prioritize the chain with larger ligands on average. The reason for the second sorting criterion is that ligands bound to approximately the same binding site can be significantly different in size. If in such a case we chose the smaller ligand as the representative occupant of the binding site, a large part of the actual binding site would be falsely marked as non-ligandable.
We choose the representative chain based on the number of chains from the cluster to which it can be aligned well. The basic idea is to find a chain from the cluster that can be aligned to a sufficient number of other chains in its cluster. We follow the order in the sorted list of chains described in the previous step until we find a chain that fits our criteria. The search for the common orientation of chains is done by adapting the method
prody_align from the ProDy library [
27]. This method receives as an input a reference chain and a list of chains that we wish to rotate and translate so they align with the reference chain. The algorithm also takes the lower bound for sequence similarity
seqid=0.9 and the lower bound for covering
overlap=0.9, which correspond to the parameters
and
from
Section 2.2.
Then, we add ligands to the selected reference chain from sufficiently similar chains. We only add ligands that do not intersect with the ligands already bound to the reference chain. We also discard ligands that intersect with the reference chain or are too far away. In this part, we consider two chains as sufficiently similar if the root mean square deviation of pairs of the matching C
atoms is smaller than 2 Å. That is,
where
and
are lists of matching C
atoms,
, and
is the euclidean distance between two points. The upper bound of 2 was set empirically, taking into account the alignment or oriented pairs of chains. An example where
Å is shown in
Figure 3. The chains align sufficiently well despite the value being close to the upper bound.
If we were not able to align all chains in the cluster to a single reference chain, we repeat the process on the remaining chains—we assign the remaining chains to the new cluster and go back to the first step.
The described procedure resulted in 6169 chains clustered into 5284 clusters.
2.4. Representation of Molecules for Machine Learning
For presenting the resulting protein/ligand data to a machine learning algorithm, we used a spatial description. This means that each protein is described by a three-dimensional image with a number of channels, where each
channel represents a property in the vicinity of a certain point. If we look at a protein as a set of points in space, we can surround it with a bounding box with some padding on its sides. In this bounding box, we then choose equidistant points on a rectangular grid and compute chemical properties at their locations. This is done by an algorithm from the HTMD library [
31] that takes into account the atoms’ locations, their elements, bonds between them, and other properties. A condensed description of different atom types and their grouping based on chemical properties is shown in
Table 1 and
Table 2, respectively.
Along with computation of the proteins’ attributes, we also need to mark each chain’s true ligandable locations. We do that by reading only ligands from the original PDB file that were left after initial pre-processing and filtering. The binding of the ligands takes place on the surface of the protein molecule, which is usually described as the protein solvent-accessible surface (SAS) [
32]. The points on the SAS of the protein were calculated using the
NumericalSurface method from the open-source library CDK [
33], which takes the solvent radius and tessellation level as input parameters. Having computed the surface points, we chose to define a point on the chain’s surface ligandable if the excluded volume property had a value higher than
for some of the chain’s ligands. While this threshold was set empirically, it does also have a meaning—for an atom of a ligand, it is reached at the distance from the atom’s center, which is about twice the size of the atom’s radius. An example of the surface points together with their ligandability property is shown in
Figure 4.
To gain some intuition about the processed set of chains and their surface points,
Figure 5 shows histograms of their basic properties. Usually, only a small part of the surface allows ligands to bind; consequently, there are more non-ligandable than ligandable points.
2.5. Machine Learning with Convolutional Neural Network
The convolutional neural network (CNN) as a machine learning model is able to receive our pre-processed dataset of three-dimensional images with eight channels as its input. Each channel represents a chemical property derived from atom types set by the openbabel [
36] library. A short description of different atom types and their grouping based on chemical properties is shown in
Table 1 and
Table 2, respectively. In our work, we used the PyTorch library [
37] for the implementation of the CNN. Our model consists of four convolutional and two pooling layers for feature extraction and two dense layers for classification. On the output of each convolutional and dense layer, we applied the ReLU activation function and batch normalization; a dropout layer was also used. A single input to the CNN is described by eight cubes—one for each channel representing the chemical properties of the protein—of shape
. These cubes are centered at a selected point on the accessible surface area of the protein, which gives the model the information needed to assess the surroundings of the selected point. As the output, the CNN gives the classification into
ligandable and
non-ligandable classes (as a vector with two values that sum up to 1).
Note that we considered different CNN architectures for our work, but in the preliminary testing the alternatives revealed worse performance. The final version is mainly characterized by having a large number of channels and not so many layers—increasing the number of channels turned out to be the most effective way to increase the quality of predictions, while increasing the number of layers heavily affected the running time of the training process. We also compared the different pooling types, where max-pooling performed better than average-pooling.
For hyperparameter search and evaluation, we used the nested cross-validation technique. We split our dataset ten times into three disjoint parts—the training, validation, and test set. Each training set consisted of
of all clusters of chains, while the validation and test set each contained
. We made sure that all of the ten test sets were pairwise disjoint. We therefore considered each triple of test, validation, and training sets to be independent in the process of nested cross-validation, since the clustering method described in
Section 2.2 minimizes the chance of data-leakage between different clusters. Pairs of training and validation sets were used in the process of hyperparameter optimization; the best set of parameters was then used to train the model on the union of the training and validation set and was then evaluated on the test set. The optimal hyperparameters were chosen based on the average value of the loss function of the validation set in the last five epochs of training—the total number of epochs was set to 40. Even though the training sets consisted of almost 5000 clusters, we only used 1024 in each epoch because of the time-consuming training process. Training of a single model took from 30 min up to two hours depending on the choice of hyperparameters.
When training a machine learning model, it is important to pay attention to the learning rate. We chose to follow the super-convergence method [
38], where at the first few iterations the learning rate is slow but rises gradually to the maximum value, and in the last phase it descends to a small value to stabilize the learning process. When choosing samples for the batch in each iteration, we first randomly picked
clusters from the training set, one chain from each cluster, then from each of the selected chains,
points on the chain’s surface. The total number of samples—cubes centered at the points on the surface—in a batch is therefore
. The pairs of
and
that we tested for optimality, namely
,
,
, and
, all had the same product of 512. So altogether we had three hyperparameters to tune: learning rate,
, and
. For
and
, the pair
was never the optimal choice since it caused overfitting to the training set and high deviation of the loss on the validation set because of the relatively low number of weight updates per epoch. The pair
was chosen twice and the other two pairs four times each.
Additionally, as we saw in
Figure 5, our dataset is unbalanced when we look at ligandable and non-ligandable points. This could lead to bias if we simply chose random points from the chains’ surface. For this reason, we chose to oversample the ligandable points so that the number of ligandable and non-ligandable samples fed to the model during training was the same.
2.6. Combining Ligandability Predictions into Binding Sites
Our algorithm for combining ligandability predictions into binding sites is based on statistical analysis of binding sites in our dataset. For points on the SAS of each binding site, we analyzed the following properties:
Each of these properties offers some information on the size and shape of binding sites. The standard deviation of distance tells us if a binding site’s shape resembles a sphere (small value) or if it is elongated (large value). In
Figure 6, histograms of the described properties are shown. We also denoted the 10th and 90th percentiles to show the chosen boundaries that our predicted binding sites were supposed to suffice. We see that the observed properties do not have peaks too far apart, and that inside the interval between the lower and upper bound the data are dense, which is beneficial for our method.
The pseudocode for the proposed algorithm is given in Algorithm 1. It uses the bounding values for the described properties to combine pointwise predictions into final binding sites. The function binding_sites takes the pointwise predictions for as input. The set of points is split into a ligandable or a non-ligandable class based on the threshold value c. In the while loop, points from ligandable class are clustered into components based on the pairwise distance of points. We look at the points as vertices of a graph, and for the edges, we state that two points are connected if the distance between them is smaller than 4 Å. The formed clusters are the connected components of this graph. Then, we check if a cluster fits into the bounds set by percentiles of all three properties; if it does, we add it to the predicted binding sites and remove its points from the variable.
Based on the condition, we change the variables
a and
b that serve as the boundaries of the interval in which the threshold
c varies. On the remaining set of points, we then compute the new ligandable points based on the updated threshold and compute the new condition for the while loop. Computing the condition is done with the method
compute_condition that tells us whether we should break the while loop or change the interval. The decision to change the interval is made on the basis of the percentiles and the shape of the current clusters: increasing
a leads to increasing
c, and decreasing
b leads to decreasing
c. Intuitively, we can think of this process as extracting the “well shaped” clusters from the current iteration and then setting up the threshold so that more clusters can be extracted in the following iterations.
Algorithm 1 Grouping predicted ligandable points into binding sites. |
1: function binding_sites()
|
2: ▹ Real values a and b define the interval in which we look for the best c value. |
3: |
4: ▹ Real value c is the threshold for classification of ligandability predictions in each iteration of the while loop. |
5: select_ligandable_points() |
6: |
7: |
8: |
9: while is not “break” do |
10: components() |
11: for do |
12: if properties fit into quantile bounds then |
13: append to |
14: end if |
15: end for |
16: |
17: if is “increase a” then |
18: |
19: else if is “decrease b” then |
20: |
21: end if |
22: |
23: |
24: select_ligandable_points() |
25: compute_condition() |
26: end while |
27: return |
28: end function |
|
29: function compute_condition() |
30: if then return “break” |
31: else if then return “increase a” |
32: else if is an then return “decrease b” |
33: else if then return “decrease b” |
34: else if then return “increase a” |
35: else if then return “decrease b” |
36: else if then return “increase a” |
37: else if then return “decrease b” |
38: else if then return “increase a” |
39: else |
40: return “decrease b” |
41: end if |
42: end function |
|
43: function components() |
44: |
45: ▹ Two points are connected if they are closer than 4Å |
46: ▹ Create a graph with vertices V and edges E |
47: return connected_components(G) |
48: end function |
|
49: function select_ligandable_points() |
50: |
51: for do |
52: if then ▹ Check if prediction for is higher than threshold c |
53: append to |
54: end if |
55: end for |
56: return |
57: end function |
Two examples of our binding site extraction algorithm are shown in
Figure 7, where we plot point clouds based on ligandability predictions and their clustering into detected binding sites.