1. Introduction
One of the frontier domains of biological research is to determine which small molecules, such as substrates or pharmaceuticals, are more likely than other small molecules to bind on a protein. Among the vast amount of small organic molecules present in living cells, only a small fraction—sometimes only a single chemical species—will normally bind to a specific protein. This specificity of protein:ligand is a central concept in chemical biology and in biochemistry, and the basis of virtually all projects in small molecules drug discovery. Yet, major questions remain on the mechanisms of protein:ligand specificity.
The widely accepted view of protein:ligand interactions is that a protein cycles through multiple conformations, and that a few of these protein conformations will be bound by the small molecules. Modern biophysical tools such as virtual docking [
1], that predict the probability of a given small chemical to bind to a given protein, are used routinely in fundamental and industrial research, but they are limited because they usually consider only one protein conformation in an “induced fit” mechanism rather than an ensemble of conformations.
In our previous work we have performed massive systematic computational characterization of protein:ligands interactions for a very large number of protein conformations and small molecules. We have shown that we can identify the small number of protein conformations that will be “selected” for binding by their ligands. In principle these “binding” conformations will lead to free energy minima in the (protein + ligand) complex free energy hypersurface. In our research we are aiming to understand why these rare apo-conformations possess this capacity to bind their ligands, while the vast majority of the other protein conformations do not. Being able to successfully identify and predict the properties of binding conformations that render them able to bind their ligands would be a transformational change in our understanding of conformational selection, in that it will open the door to the prediction of which protein conformations should be used in the drug discovery pipeline that relies on computational docking. That will massively reduce the enormous cost of docking, by focusing on these binding confirmations alone. We also will answer several very important fundamental questions, beyond the nature of the protein properties responsible for conformational selection: are these properties similar for all proteins? Are they similar for classes of structurally close proteins? Or are they different from all proteins? Are these properties related also to the chemical properties of the ligands, as hypothesized behind most docking approaches? This work contributes to such characterization of what properties of an apo-protein conformation leads to conformational selection.
The fundamental approach is relatively simple: from our previous work we already [
2] know which few protein conformations, among thousands, will lead to conformational selection, and we know which small molecules will be ligands of the proteins in these binding conformations. Calculating physico-chemical descriptors of the proteins, and building simple regression models could separate the protein conformations that bind ligands, from the protein conformations that do not bind ligands. However, the sheer amount of data renders “simple regression” calculations all but impossible. The data set we use here corresponds to about 1.5 millions of protein conformation and protein:ligand complex, and their associated interaction energies. A small fraction of this dataset contains a few hundreds of protein binding conformations that lead to conformational selection, and most of the rest of the data corresponds to protein conformations that are not statistically binding well to ligands. Calculating even “only” 40 physiochemical descriptors properties of the protein conformations, leads very quickly to a complexity that only Big Data Analytics can handle.
In addition, there are fundamental issues that are specific to the field of Big Data analytics that we aim at addressing in this work. Most of the real-world biomedical datasets suffer from statistical ill-conditioning issues such as class imbalance problem [
3], which arises due to imbalanced groups or sub-categories present in the data. In this scenario, typically, the majority class or larger group of data that consists of non-binding protein conformations overshadows the minority class or smaller data group that comprises the data-of-interest, i.e., the binding protein conformations. In such a case, any machine learning (ML) technique applied for data-learning on a biased population dataset could result in higher misclassification of the smaller population of binding conformations, since the decision-making process is more biased towards the larger population of non-binding conformations. Conventional ML algorithms are not directly equipped to handle the class imbalance problem during the data-learning phase and therefore a new two-stage sampling-based classifier framework was proposed in our previous work [
4,
5] which aimed at tackling the class imbalance problem and maximizing the detection of potential binding conformations. This paper extends the use of the two-stage sampling-based classification approach with additional feature scoring and feature selection methods in conjunction with an Enrichment ratio framework, in order to validate and gain deeper understanding about the unique physico-chemical attributes of the selected protein conformation predictions from the ML framework. The feature scoring method uses three different feature selection methods to select the physio-chemical features or descriptors of potential drug-binding molecular conformations for target proteins ADORA2A, ADRB2 and OPRK1. The two-stage sampling-based classifier to maximize the prediction rate of potential binding conformations for target proteins ADORA2A, ADRB2 and OPRK1 and uses the proposed Enrichment ratio framework for the validation of the prediction results obtained from the ML framework. The goal of our ML-based feature analysis process is to advance our understanding of the significance of specific physico-chemical attributes or descriptors than make certain protein conformations more conducive to the protein:ligand binding process.
2. Background and Related Work
2.1. Analysis of Variance (ANOVA)
Analysis of Variance (ANOVA) is a statistical technique to calculate the linear dependency between each feature and the target variable and select the features with the highest F-values [
6]. The ANOVA technique is performed between each feature and the target vector and the obtained F-value is assigned to that feature, wherein the features are ranked based on their information content. The higher the F-value, the more important is the feature. Here top ‘K’ features with the highest F-values were retained, where the K features to be selected is experimentally determined by the user. Therefore, ANOVA technique provides selection of primary physio-chemical protein features that play an important role in protein:ligand binding and conformation selection process.
2.2. Mutual Information (MI)
Mutual Information (MI) is defined as a measure between two random variables
X and Y, that determines the amount of information obtained about one variable, through the other random variable [
7]. In information theory, entropy is an important basis used for uncertainty or known information measurement. Given a variable
Y and the conditional entropy
H(
X|
Y) of
X with respect to
Y is defined as:
where
p(x,y) is the joint probability density function.
p(x|y) is the posterior probabilities of X given Y.
p(x) is the probability density function
From Equations (1) and (2), mutual information
I(
X;
Y) [
8] can be defined as below:
Using Equation (3), if X and Y are independent of each other, i.e., if they are unrelated, then their MI value is 0. Similarly, if X and Y are dependent on each other, then their MI value is 1. Here independent implies that no information of Y can be obtained using X and dependent implies that we can determine X from Y or vice-versa. MI measure for the physio-chemical features can be determined as follows:
Compute the MI to measure the dependency between the physio-chemical features vector and the target variable for all features.
The features are then ranked based on their MI values.
The top K features with highest MI values are retained where K is defined by the user.
Thus, MI explains the unique information present in the physio-chemical protein features that assists in protein conformation selection process.
2.3. Recurrence Quantification Analysis
Recurrence Quantification Analysis (RQA) is a numerical data analysis and statistical technique that is used for the study of non-linear dynamical systems [
9]. The technique quantifies the number and duration of occurrences of a system presented by its state space trajectory. Recurrence analysis starts by quantifying the repeating patterns of the plot. Entropy of the distribution of the diagonal lines (
ENTR), one of the measurements retrieved from the plots is the probability distribution
p(
l) of the diagonal line on the RQA plot is defined as:
where,
We evaluate the RQA-based entropy measure to understand the entropy with respect to time-space evolution of protein conformations and its relationship to probability of discovering potential binding conformation.
2.4. Logistic Regression
Logistic regression (LR) is a probabilistic statistical ML technique used to perform predictive analysis on data that is categorical or has binary classes. The main objective of LR is to determine the best fitting model that describes the relationship between a dependent binary variable against a group of independent variables [
10]. Let
y be the binary outcome indicating failure/success with values labeled as 0 or 1, respectively, and
p be the probability of
y = 1 as
p = 1. Conversely,
y = 0 can be expressed as 1 −
p. LR then models the outcome
y based on the linear combination of the independent data variables
b1,
b2,
…,
bn and their respective parameter/weight values A
1, A
2, …, A
n via maximum likelihood method as given by:
LR technique is used in this work for supervised classification and prediction of potential binding vs. non-binding protein conformations selection.
2.5. SMOTE Algorithm
Synthetic minority Over-sampling Technique (SMOTE) is a well-liked method which helps in solving the class imbalance problem by creating new artificial data points rather than by over-sampling with replacement of the minority class [
11]. The algorithm is elaborated below [
12,
13,
14]:
Select K neighbors.
Oversampling of the minority class is carried out by taking the difference between the feature vector and its nearest neighbors.
Multiply the difference obtained in the last step by a random weight between 0 and 1, and then add it to the feature vector under consideration. This causes the selection of an arbitrary point along the line segment between the two specific features.
This approach of the algorithm coerces the decision region of the minority class to be more general and therefore results in a desired well-balanced dataset.
2.6. Gaussian Naive Bayes Classifier
Gaussian Naive Bayes Classifier (GB) is a conditional probabilistic ML technique that is widely used for supervised data classification. According to Bayes theorem, the posterior probability can be expressed as the product of prior probability and likelihood as described by:
where,
P(y|x) is called the posterior probability. It gives the probability of y given that the data point x is true.
P(x|y) is called the likelihood. It gives the probability of point x given that the data point y is true.
P(y) is called the prior probability of y. It gives the probability of y being true across all data points.
P(x) is the probability of data point x averaged over all of the value of y.
Using Equation (6), posterior probability can be calculated for different features in the dataset. In case, when the features in the dataset has continuous values then the likelihood of the features is assumed to be Gaussian [
15] and, hence, conditional probability is given by:
where,
Since GB technique is probabilistic in nature, it is used in this work for supervised classification and prediction that is biased towards identification of potential binding protein conformations.
2.7. K-Nearest Neighbor Classifier
The K-Nearest Neighbors (KNN) algorithm is a non-parametric ML technique that is widely used for its easy interpretation and low calculation time for classification and regression problems. Given a testing example, the KNN algorithm directly searches through all the training examples by calculating the distances between the testing examples and the training dataset in order to identify its nearest neighbors and give a classification output [
16]. The Euclidean distance, as defined in Equation (8), is used as the distance metric which assigns the testing example to the class among its k nearest neighbors (where k is an integer). KNN technique is another supervised classification method used in this work for prediction of potential binding vs. non-binding protein conformations selection.
where,
2.8. Confusion Matrix
In the ML supervised learning domain, confusion matrix is often used to represent the summary of classification or prediction results. The confusion matrix terminology for a binary classification has four cases that are used to interpret the classification/prediction results as follows:
True Positive (TP): The case where the classifier correctly predicts ‘binding’ between the ligand and the target protein. (Class 1 right predictions)
True Negative (TN): The case where the classifier correctly predicts ‘No binding’, i.e., the ligand and the target protein did not bind. (Class 0 right predictions)
False Negative (FN): The case where the classifier predicts ‘No binding’, but the ligand and the target protein did bind. (Class 0 incorrect predictions)
False Positive (FP): The case where the classifier predicts ‘binding’ but the ligand and the target protein did not bind. (Class 1 incorrect predictions)
Where class 0 refers to the non-binding conformations and class 1 denotes the potential binding conformations.
5. Conclusions
In this paper, we extended the use of the two-stage sampling-based classifier framework with additional feature scoring and feature selection methods in conjunction with Enrichment ratio framework, to identify the unique physico-chemical attributes of the binding conformation and based on those attributes understand what makes a protein conformation more conducive to
protein:ligand binding process. The enrichments shown in
Table 2 through
Table 4 suggest that it is indeed possible to identify binding conformations in order of magnitude better than thought a random selection of protein conformations. This is a very encouraging result: it means that there are physico-chemical protein properties that will drive, at least in part, whether or not a protein conformation is more likely to be binding than non-binding. From a machine learning point of view, both LR + SMOTE − GB and LR + SMOTE − KNN classifiers demonstrated a reasonable prediction performance for binding protein conformations as described in (
Table S1 through Table S3). In addition, the best enrichment is obtained for the smallest selection of binding conformations (0.5% to 1% of the conformation, see
Table 2 through
Table 4). This suggests that the “top” predicted binding conformations, in terms of strength protein: ligand energies as defined in
Figure 2, contain the most “solid” data in terms of predictability power. This greatly simplifies the computational complexity and cost of this approach, as only a small fraction of the otherwise massive ensemble of conformations needs to be used in subsequent docking or analyzing.
There are however several challenges that limit, for the time being, the application of this approach to any protein conformation ensemble a priori. One limitation is that, as shown in
Table 5 through
Table 7, there does not seem to be a universal set of physico-chemical properties that are consistent across all the proteins studied here. Rather, it appears that each protein has its own unique set of “preferred” physico-chemical features that go into driving conformation selection. While this is not in itself a problem (indeed, individual proteins may very well have different physical and chemical ways to interact with their ligands), it certainly complicates the application of this approach to a novel protein for which no a priori knowledge of ligands does exist, and on which the model could be trained.
In our opinion, this suggests that there is a need to continue this work with (i) other proteins for which (preferably many) known ligands exist, to see how diverse are the physico chemical features in each set, and (ii) using different physicochemical descriptors that the 40 descriptors used here. The physico-chemical descriptors used here are well validated to quantify global protein properties and discriminate between global conformation, but it may be advantageous to use physic-chemical features that are limited to the binding sites of the proteins, where the interaction of the protein with its ligands are taking place. This is not however a trivial task. However, the fundamental result of this current work, that apo-protein conformations possess properties that correlate with conformational selection, makes this effort worthwhile.
Another challenge of this approach is a consequence of the severe class imbalance of the data, which contain much more conformations that are not selected by the ligands than conformations that are selected by the ligands. As shown in
Tables S1–S3, the overall model accuracy for all the test case proteins is low due to the class imbalance problem in the dataset, but despite that the model was able to identify the potential binding protein conformations with a smaller training size which is depicted by the sensitivity score in the classification tables in
Supplementary Materials.
Table S1 for ADORA2A shows that the sensitivity score of LR + SMOTE − KNN method was 80% for a training size of only 30%. This indicates that the LR + SMOTE + KNN method was able to correctly predict 80% of binding protein conformations with only a 30% training size. Thus, the ML-based protein conformation framework described here performs well overall in detection of binding protein conformations, which is what the current work is focusing on. These results provide a baseline against which subsequent performance improvement benefits can be evaluated, in a generalized protein conformational selection model (enrichment ratio framework) through the adoption of ML-based prediction models.
The motivation of this work is to understand the protein properties responsible for conformation selection. Applied to novel target proteins, this approach yielded an enrichment ratio of ~11 (ADORA2A) to 20 (ADRB2, OPRK1). Using feature scoring, we observed 5 common physio-chemical descriptors between ADORA2A and OPRK1, 1 common descriptor between ADORA2A and ADRB2 and 3 common descriptors between ADRB2 and OPRK1. No common physio-chemical descriptors were found between ADORA2A, ADRB2 and OPRK1.