1. Introduction
The brain disposition of drug candidates has been one of the major issues in the pharmaceutical industry in recent decades. Τhe development of novel compounds targeting the central nervous system (CNS) is becoming more and more essential as a consequence of the increasing incidence of neurological and neurodegenerative disorders (e.g., depression, Parkinson disease, and Alzheimer disease). The high attrition rate and failures in the field of CNS agents are current challenges that pharmaceutical companies have to face [
1,
2]. From this perspective, assessment of brain distribution in the early stages of the drug development process is crucial for novel CNS drug candidates. The most important obstacle for drug delivery in the brain is the blood-brain barrier (BBB), which is formed by the endothelial cells lining the brain micro-vessels [
3,
4,
5]. The presence of tight junctions between the endothelial cells leads to limited fenestration, making passive diffusion the dominant transport pathway inside the brain, while the paracellular path is of much lower importance. Moreover, the brain transport of certain drugs may be facilitated or restricted by the presence of membrane transport proteins in the BBB [
6].
The most common approach to quantifying the BBB permeability of molecules has been the determination of the ratio between the total brain and the total plasma concentration, mostly expressed as K
p,brain. However, the reliability of K
p,brain in quantitative estimations of brain disposition has been criticized, as its determination is based on the total concentrations, ignoring issues like plasma protein binding (PPB) and tissue binding. According to the “free drug hypothesis”, only the unbound drug crosses the biological membranes, thus constituting the unbound concentration (C
u) relevant for pharmacological activity. From this perspective, the ratio between the unbound drug concentration in the interstitial fluid of the brain to the corresponding plasma concentration, K
p,uu,brain, is considered as a more representative measure [
7,
8,
9,
10,
11,
12,
13].
Experimental determination of K
p,uu,brain through microdialysis exhibits certain limitations as it is technically demanding, time-consuming, and inefficient for very lipophilic compounds [
7]. However, K
p,uu,brain can be derived from K
p,brain by combining the unbound fraction of the drug in the plasma (f
u,p) and in the brain (f
u,brain) or the unbound brain volume of distribution, V
u,brain [
7]. The brain homogenate method is used to estimate the f
u,brain, considered to reflect only nonspecific binding, while the brain slice method is applied for measuring the V
u,brain (mL/g brain), which quantifies the overall cellular uptake of the drug, including active cellular membrane transport and pH partitioning [
7,
10,
12,
14]. The relevant equations are included in
Section 3. Evidently, f
u,p is obtained by standard methods for plasma protein binding.
K
p,brain, usually converted into the logarithmic form and labeled as logBB, has been widely used to establish in silico models in an effort to reveal the favorable physicochemical and molecular properties for BBB permeability [
15]. A number of QSAR models have been reported in the literature, most of them involving calculated lipophilicity, expressed as octanol–water partition or distribution coefficients (logP or logD), polar surface area, and molecular weight [
16,
17,
18,
19]. However, to the best of our knowledge, analogous models for f
u,brain, V
u,brain, or K
p,uu,brain itself are rather limited [
11,
14,
20].
A potential concern regarding in silico predictive models may arise in respect to the reliability of calculated physicochemical properties of drugs, in particular for new chemotypes. It has been reported that small errors in logP or logD predictions may cause substantial errors in the estimation of biological end-points, particularly if limited datasets are analyzed, while experimental lipophilicity requires a rather tedious and time-consuming procedure [
21,
22]. On the other hand, user-friendly chromatographic techniques may offer a challenging alternative to combining rapid measurements with theoretical descriptors for the construction of ‘hybrid’ in silico models based on experimental evidence.
Biomimetic properties, defined as the retention outcome on HPLC stationary phases containing a biologically relevant agent, have attracted considerable interest and are currently used for the rapid evaluation of ADME properties in early drug discovery phases. In particular, immobilized artificial membrane (IAM) chromatography, which uses phospholipid-containing stationary phases, has been applied in investigating permeability as an alternative to traditional octanol–water lipophilicity. Since, however, electrostatic interactions have a strong contribution in retention mechanism, especially in the case of protonated bases, IAM chromatography is considered as also reflecting drug–membrane interactions and tissue binding [
23,
24]. IAM models for oral absorption, skin partitioning, and brain penetration, mostly expressed as logBB, have been reported in the literature, usually in combination with additional molecular descriptors [
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40,
41,
42]. Classification of CNS
+ and CNS
− drugs has been suggested using the IAM retention factor divided by the fourth root of molecular weight [
43]. On the other hand, protein-based stationary phases, incorporating human serum albumin (HSA) or α
1 acid glycoprotein (AGP)-, simulate binding to plasma proteins [
25,
26,
39,
44,
45,
46].
Immobilized HSA retains the characteristics of the protein in solution, as has been proven by zone and frontal analysis experiments, permitting the safe estimation of binding constants to human serum albumin in the plasma [
47]. Less investigated is AGP retention due to the polymorphism of α
1 acid glycoprotein and practical issues related to the immobilization on the silica skeleton of the column. It is well-established, however, that protonated bases are more strongly retained on AGP columns, in agreement with the same binding preference of AGP in solution [
47]. In light of the above considerations, HPLC-based biomimetic properties reflect the major factors which govern most biological processes, e.g., passive diffusion and non-specific binding to phospholipids, tissue, or plasma proteins. Accordingly, their combination may be suitable for the estimation of composite pharmacokinetic data. Valko et al. used the weighted sum of IAM and HSA retention to estimate the volume of distribution and tissue binding in different organs, including brain tissue binding [
39].
In the present study, we investigated the performance of biomimetic properties in combination with molecular descriptors in the development of ‘hybrid’ models with experimental evidence for the thorough analysis of composite Kp,uu,brain data, and the distinct brain disposition components, Kp,brain, fu,brain, and Vu,brain. The aim was to bridge the gap with pure in silico prediction models and to contribute to the perception of these crucial experimental end-points for CNS-acting drug candidates. For this purpose, in-house retention factors, determined for a number of pharmaceutical compounds on IAM, has, and AGP stationary phases, were used together with physicochemical/molecular descriptors. Lipophilicity, expressed as octanol–water partition coefficients (logP) or distribution coefficients at pH 7.4 (logD7.4) and pH 5.0 (logD5.0), was incorporated in the pool of descriptors, and its implementation in the models was compared to that of IAM retention. Models were constructed by applying multiple linear regression (MLR) and partial least squares (PLS) analysis. MLR can provide simple models that are easily used by medicinal chemists, while PLS, contributing complementary and supporting MLR models, may permit a deeper insight into the factors underlying brain disposition of drugs. Attention was given to model validation in respect to robustness and applicability domain to offset the drawback that rather limited datasets were analyzed.
2. Results and Discussion
In the present study, we collected experimental biomimetic chromatographic data, previously obtained in our laboratory, for a set of 55 pharmaceutical compounds belonging to a wide range of pharmacological classes. (For further details, see the
Section 3 and
Table S1.) The investigated compounds are small molecules within a molecular weight (MW) range of ca 129 to 513 Da (
Figure 1a). The dataset includes neutral, basic, and acidic compounds; however, most molecules contain one basic ionizable group (
Figure 1b).
Experimentally measured logBB, K
p,uu,brain, f
u,brain, and V
u,brain values were collected from the literature [
10,
11,
12,
48,
49,
50,
51,
52,
53,
54] (
Table S2). The collected logBB values indicated a sufficient range of BBB permeability, from highly permeable (logBB > > 1) to significantly less permeable (logBB < < 0) compounds (
Figure 1c). Nevertheless, the range of logK
p,uu,brain values suggested a lower extent of brain distribution of the studied molecules (
Figure 1e). The two measures showed a mediocre intercorrelation (
Figure 1e), while f
u,brain and V
u,brain showed significant inverse intercorrelation on the basis of their logarithmic values (
Figure 1f–h).
For further statistical analysis, f
u,brain values were converted to the thermodynamic constant K
b [
55] according to the relevant equation included in
Section 3. Its logarithmic form logK
b was used throughout.
2.1. Data Overview with Unsupervised Data Analysis
To obtain an overview of the dataset, unsupervised principal component analysis (PCA) was performed using the pool of descriptors and the response variables as an X matrix. A 5-component PCA model was generated with R
2 = 0.674 and Q
2 = 0.462. The score plot of the first two components shows a uniform distribution of the data in all four quartiles, with one drug, candesartan, lying outside the Hoteling T
2 ellipse (
Figure 2a). As depicted by the coloring of the scores based on the logk
wIAM values, IAM retention is a dominant parameter of the data distribution. The loadings plot reflects the correlations among the chromatographic data, the molecular and physicochemical descriptors, and the experimental brain distribution data in a multilayered fashion (
Figure 2b). The coloring of the variables corresponds to their grouping according to hierarchical clustering based on the five principal components. LogBB and logV
u,brain are located in the same quartile with chromatographic factors and lipophilicity parameters and belong to the same hierarchical clustering group. Interestingly, logK
p,uu,brain is also incorporated into the same group, but its loading value in the first component is approximately zero. LogK
b is located in the opposite quartile in high proximity to hydrogen bonding parameters.
2.2. Modeling logKp,uu,brain
LogK
p,uu,brain is considered to be the most relevant measure of the rate and extent of drug delivery in the brain, and as such, it was examined first. MLR analysis gave poor statistics (R < 0.5). Polarity descriptors, e.g., tPSA, display the strongest negative correlation (
Figure 3a). Poor correlation was observed between logK
p,
uu,brain and logk
wIAM, logP or logD7.4. However, a relatively better correlation was found between logD
5.0 and IAM retention at pH 5.0 (
Figure 3a).
The above findings were further supported by Partial Least Squares analysis. PLS analysis offers a number of advantages, tolerating intercorrelated variables and missing values. Variables are treated simultaneously to extract the principal components as their linear combinations. Using the entire pool of descriptors and performing variable selection according to the variable importance to projection (VIP) criterion, no satisfactory PLS models could be obtained, with R
2 and Q
2 not exceeding 0.52 and 0.32, respectively. In agreement with MLR, polarity terms (TPSA) show the strongest impact with VIP values > 1 and negative contribution. Biomimetic properties and lipophilicity have positive contributions, albeit with considerably lower impacts. In agreement with MLR results, logD
5.0 and IAM retention at pH 5.0 show higher influence. In fact, VIP values decrease in the order of: logD
5.0 > logk
wIAM,5.0 > logD
7.4 > logk
10,HSA > logk
wAGP > logP > logk
wIAM,7.4 (
Table S3). The better performance of IAM retention and lipophilicity at pH 5.0 may be explained by the recently reported evidence of the interference of the endo-lysosomal system (lysosomes’ pH 5.0–4.5) in vitro blood–brain barrier models [
56].
Inspection of the plots of observed vs. predicted logK
p,uu,brain and of DModY plots (distance of observations to the Y model) revealed two strong outliers, quinidine and theophylline, present in any PLS model (figure not shown). Upon exclusion of these drugs from statistical analysis, acceptable PLS models were obtained. Using biomimetic properties and lipophilicity in combination with computational descriptors, a 1-component PLS model with 54 original descriptors was generated with the following statistics (PLS Model 1):
The separate effect of IAM retention or lipophilicity is shown in PLS Models 2 and 3, which include either IAM retention at pH 7.4 and 5.0 or lipophilicity measures (logP, logD
7.4 and logD
5.0). PLS Models 2 and 3 denote equal performance of IAM retention and traditional lipophilicity:
Finally, the use of only a few computational descriptors with high VIP values, including polarity, flexibility, and shape indices, proved sufficient to produce an improved, 1-component PLS model (PLS Model 4):
In
Figure 3b,c the plot of observed vs. predicted logK
p,
uu,brain values generated by PLS Model 4 and the coefficients of the original variables are depicted, respectively. The predicted logK
p,
uu,brain values are presented in
Table S4. The applicability domain was defined using DModY plot, and the doubled average DModY value, equal to 0.822 × 2 = 1.644, was used as a critical value. Two drugs, acetaminophen and midazolam, exceed the critical value (
Figure S1).
PLS Model 4 was validated by permutation tests (
Figure S2) and by dividing the dataset into a training set and a test set, using 4 different training/test sets (see
Section 3). The external validation demonstrated robust statistics except for 1 iteration, in which the drug acyclovir was included in the test set (
Table 1). In
Figure S3, a representative plot of observed vs. predicted values is presented.
In fact, the construction of PLS Model 4 indicates that biomimetic properties and lipophilicity are less crucial in logK
p,
uu,brain modeling, as shown also by the failure to generate successful MLR models. Previous studies have likewise identified the lack of strong correlations between logK
p,uu,brain and physicochemical descriptors, especially lipophilicity, while hydrogen bonding has been found to be the strongest contributor [
11,
20]. Loryan, et al. reported a PLS model with two descriptors related to polarity (tPSA) and hydrogen bonding capacity [
14].
Considering the complex nature of K
p,uu,brain as the outcome of opposing brain disposition end-points and plasma binding [
11] (see relevant equations in
Section 3), modeling K
p,brain, expressed as logBB, f
u,brain, and V
u,brain, can indirectly assist in the exploration of logK
p,uu,brain. In the next sections, the generation and validation of the MLR and PLS models of these distinct brain disposition end-points are discussed.
2.3. Modeling logBB
2.3.1. Multiple Linear Regression Models
Partial correlations between logBB, IAM retention, logD
7.4, and physicochemical/molecular properties revealed the crucial effect of polarity, with TPSA showing the highest correlation coefficient (r = −0.777,
p < 0.001), followed by the sum of nitrogen and oxygen atoms, NO (r = −0.711,
p < 0.001), and dipole moment (r = −0.693,
p < 0.001), while for logkwIAM and logD7.4 r = 0.617 (
p < 0.001) and r = 0.687 (
p < 0.001), respectively. Their combination with TPSA led to Equations (1) and (2), respectively. The term h* in Equations (1) and (2) expresses the critical leverage value, which defines the applicability domain (AD) [
57]. (See
Section 3.)
Eight drugs had h values higher than the threshold h* value and were beyond the AD of the model, as also shown in a Williams plot (
Figure S4).
Six drugs were outside the AD of Equation (2) (
Figure S4).
In both equations, TPSA had a higher significance than logkwIAM or logD7.4, with Student’s test-t values of |tTPSA| = 5.4 and 4.8, respectively, while |tlogkwIAM| = 2.22, and |tlogD7.4| = 2.69.
External validation of Equations (1) and (2) was performed by dividing the dataset into a training set and a test set. This procedure was repeated up to six times for each equation. Practically speaking, the same equations were generated with comparable statistical data (
Table 2). In all cases, observed vs. predicted values showed a 1:1 correlation. In
Figure S5, a representative plot of observed vs. predicted values from Equations (1) and (2) is illustrated.
2.3.2. Partial Least Squares Models
Initially, PLS analysis was performed considering both biomimetic properties and lipophilicity in the pool of descriptors, so as to rank their impact in modeling logBB. Polarity descriptors (TPSA and dipole moment) were found to have the highest VIP values (VIP = 1.2). AGP retention showed equally high impact, followed by logD7.4 (VIP = 1.05), and IAM retention (VIP = 0.95). HSA retention and logP had a lower influence in the model with VIP < 0.9.
Keeping AGP retention as the most influential parameter and including either logk
wIAM (PLS Model 5,
Figure 4a,b) or logD
7.4 (PLS Model 6,
Figure 4c,d), 2 component PLS models with 19 descriptors were obtained after variable selection, with practically equal statistical quality:
The applicability domain was defined by the critical value DModY being 0.767 × 2 = 1.534 for Model 5 and 0.789 × 2 = 1.58 for Model 6 (
Figure S6). A total of 4 drugs, i.e., atenolol, candesartan, haloperidol, and maprotiline, were outside the AD of PLS Model 5. The same drugs, plus indomethacin, were outside the AD of PLS Model 6 (
Figure S6).
Considering the large percentage of missing logk
wAGP values, and in order to confirm its essential contribution, PLS analysis was repeated including only drugs with available logk
wAGP data. The generated models proved the crucial contribution of AGP retention (
Figure S7).
The contribution of logk
wAGP in logBB was further confirmed in MLR models for the drugs with available data:
The high impact of AGP retention may be related to the presence of a strong negative charge in AGP stationary phases due to their high content in sialic acid. Brain cell membranes are also the most anionic and have their lipids mostly exposed, thus explaining the reason that lipophilic cationic compounds are more prone to cross the BBB.
PLS Models 5 and 6 were validated with permutation tests (
Figure S8) and by external validation upon dividing the dataset into 5 different training and test sets. Practically speaking, the same models were generated by an external validation procedure, with comparable statistical data (
Table 2), with the exception of iteration, including quinidine (PLS Model 5) and morphine (PLS Model 6). In all cases, observed vs. predicted values showed a 1:1 correlation. In
Figure S9, a representative plot of observed vs. predicted values is illustrated. The predicted logBB values from the MLR and PLS models are provided in
Table S5.
2.4. Modeling logKb
2.4.1. Multiple Linear Regression
Since logK
b is related to tissue binding, a satisfactory negative correlation with logk
w,IAM values was obtained. Strong IAM retention corresponds to high tissue phospholipid binding and thereupon to a lower fraction being unbound:
A negative correlation with inferior but still acceptable statistics was obtained using HSA retention factors:
Further stepwise MLR analysis led to a three-parameter equation, combining the IAM and HSA retention factors with the count of nitrogen and oxygen atoms present in the molecule (NO):
The positive sign of the coefficient of NO indicated that a higher hydrogen bond acceptor potential is not favorable for unspecific hydrophobic tissue and plasma protein binding, increasing the amount of the unbound fraction. It should be mentioned that logk
w,IAM and logk
10ACN,HSA showed a considerable degree of intercorrelation (r = 0.824). To overcome the collinearity problem, and considering the low differentiation of regression coefficients of logk
w,IAM and logk
10ACN,HSA in Equation (5) was within statistical limits, their sum Sum
IAM-HSA was used instead:
Satisfactory correlation was also obtained with AGP retention factors for 20 drugs with available data:
Hydrophobic binding, a major force in drug-tissue and drug-protein interactions, was better simulated by logP of the neutral form than logD
7.4. Thus, a satisfactory logK
b/logP relationship was obtained—although with a lower correlation coefficient, compared to Equation (4):
Further stepwise regression did not lead to an improved equation.
Correlation with the distribution coefficient logD
7.4 led to an inferior relationship:
Equation (10), however, can be improved if the fraction of the ionized molecular species F
+ and F
− is introduced in combination with NO. The introduction of fractions F
+ and F
− reflects the binding conditions in respect to basic and acidic drugs, the first interacting stronger with phospholipids and the latter with serum albumin:
Evidently, all of the regression coefficients in Equation (11) had negative signs, except NO, which contributed positively. Three drugs were outside the AD, as shown in
Figure S10.
External validation of Equations (7) and (11) was performed by dividing the dataset into the training set and the test set, using 5 different test sets (
Table 3). In all training sets, observed vs. predicted values showed a 1:1 correlation. In
Figure S11, a representative plot of observed vs. predicted values from Equations (7) and (11) is illustrated.
2.4.2. Partial Least Squares Models
Application of PLS to model logK
b led to a two-component model (PLS Model 7) if both biomimetic properties and lipophilicity were included in the pool of descriptors:
Biomimetic properties and lipophilicity were the most crucial parameters (VIP > 1), with decreasing importance following the order: logk
AGP > logk
wIAM > logP > logk
10HSA > logD
7.4 (
Figure S12). In accordance with the MLR equations, the three biomimetic properties and lipophilicity had negative contributions, while the hydrogen bonding acceptor descriptor NO had a positive effect. More to the point, logP showed a higher impact in respect to logD
7.4 as a better lipophilicity expression for hydrophobic binding. Three bulk descriptors, CMR, arC6, and nPSA, included in the model had a negative sign (
Figure S12).
Including only biomimetic properties in the pool of descriptors, a 3-component model (PLS Model 8) was obtained with improved statistics:
As illustrated in
Figure 5, the same signs were kept in PLS Model 8 for biomimetic properties and NO, while a balance was observed between the non-polar descriptors, with nPSA having a positive sign.
According to the DModY criterion (critical value = 0.812 × 2 = 1.624), three drugs, i.e., fluoxetine, ranitidine, and theophylline were outside of the AD of the model (
Figure S13).
Replacement of IAM retention by lipophilicity, expressed both as logP and logD
7.4 led to a 3-component model (PLS Model 9):
Five drugs—acetaminophen, atenolol, neostigmine, propranolol, and theophylline—are the beyond AD, with DModY being higher than the critical value (2 × 0.768) (
Figure S13).
The logK
b PLS Models 8 and 9 were validated with permutation tests (
Figure S14), and upon dividing the dataset into 5 different training and test sets. Robust statistical data were obtained in all cases (
Table 3). Representative plots of observed vs. predicted logK
b values from external validation are presented in
Figure S15. The predicted logK
b values from the MLR and PLS models are presented in
Table S6.
Back calculation of f
u,brain values using the predicted logK
b were successfully correlated with the corresponding experimental values, approximating a 1:1 correlation (
Table S7).
2.5. Unbound Volume of Distribution, Vu, Brain
2.5.1. Correlation between fraction unbound and unbound volume of distribution in the brain
The negative correlation of logV
u,brain with logf
u,brain, shown also in
Figure 1h, is reflected in Equation (12):
Equation (12) is considerably improved upon the inclusion of F
+, which has a positive effect, indicating the importance of positive charge in the overall cellular uptake (Equation (13):
The high quality of Equation (13) permits the safe prediction of logVu,brain from logfu,brain.
We further attempted to construct models for the unbound volume of distribution using biomimetic properties, lipophilicity, and computational descriptors although, in this case, the dataset was limited, including only 17 drugs with available experimental data.
2.5.2. Multiple Linear Regression
Direct correlation of logV
u,brain with logk
w,IAM led to Equation (14), with moderate statistics:
Using stepwise regression, a considerably improved regression equation was obtained upon inclusion of the count of the hydrogen bond acceptors (HBA) as an additional parameter (Equation (15)):
All compounds were within the AD of Equation (15) (
Figure S16).
HSA retention was not significant, while AGP retention led to a very good correlation for the limited dataset of seven compounds:
Replacing logk
w,IAM with lipophilicity, a moderate regression equation was obtained (Equation (17)), with logP in combination with the fraction protonated, including F
+ as an additional parameter:
All compounds were within the AD of Equation (17) (
Figure S16).
The combination of logP with the fraction protonated at pH7.4, both with a positive contribution, has previously been reported for a PLS model of the apparent volume of distribution [
58].
Owing to the reduced dataset, test sets for the external validation of Equations (15) and (17) contained 4 to 6 compounds. Robust statistical data were obtained for validated equations, except when metformin was included in the test set, for both Equations (15) and (17) (
Table 4). In
Figure S17, a representative plot of observed vs. predicted values from Equations (15) and (17) is illustrated:
2.5.3. Partial Least Squares (PLS) Models
A one-component PLS model, based on IAM retention, was obtained after variable selection with very good statistics (
Figure 6a,b):
In agreement with the MLR models, IAM and AGP retention were the most influential variables with a positive contribution, while HSA retention factors were not included in the final model. Most polar descriptors had a negative effect, and the opposite was true for non-polar descriptors.
According to the DModY criterion (critical value: 0.802 × 2 = 1.604), 2 drugs (metformin and propranolol) were outside the AD of the model (
Figure S18).
The use of logP in place of biomimetic properties led to a three-component model with satisfactory statistics although they had inferior cross-validation results (
Figure 6c,d):
According to the DModY criterion (critical value 1.524), 2 drugs (pindolol and propranolol) were outside the AD of the model (
Figure S18).
The lower Q
2 and higher RMSEEcv indicate inferior predictability, reflected also in the higher intercepts of the corresponding permutation tests (
Figure S19). The PLS models were also validated by external validation, dividing the dataset into two training and test sets. In
Table 4, the statistical data of the new models are given. In
Figure S20, representative plots of observed vs. predicted logV
u,brain are provided. The predicted logV
u,brain values from the MLR and PLS models are presented in
Table S8.
Considering the potential of Equation (13) for safe predictions of logV
u,brain, we used Equation (13) in order to extend the dataset. Drugs with predicted logV
u,brain were used as a blind test set. As illustrated in
Figure S21, the blind test set was well accommodated in the model with the exception of two drugs, clomipramine and fluoxetine, which were strong outliers. Excluding clomipramine and fluoxetine, the combination of the blind test set with the training set led to PLS Model 12, which was practically the same as PLS Model 10:
For 4 drugs, DModY exceeded the critical value of 1.45, being outside the AD (
Figure S22).
3. Materials and Methods
3.1. Dataset and Chromatographic Data
A dataset of 55 pharmaceutical compounds, belonging to a wide range of pharmacological classes, was used in the present investigation. The dataset exhibited adequate structural diversity, consisting of acidic, neutral, basic, and zwitterionic molecules. The compounds had been previously studied in our laboratory with respect to their retention profiles in 3 biomimetic chromatographic columns, namely an IAM.PC.DD2 (Regis Technology, Morton Grove, IL, USA), a ChromTech CHIRAL-HSA column (50 mm × 4 mm i.d.), and a ChromTech CHIRAL-AGP column (50 × 4 mm i.d.), under experimental conditions as described in the corresponding references [
23,
24,
27,
29,
45].
For IAM chromatography, retention factors logk
wIAM corresponding to a 100% aqueous mobile phase were used. PBS was used as a buffer at two pH values, i.e., the physiological pH 7.4 (data labeled as logk
wIAM) and pH 5.0 (data labeled as logk
wIAM5.
0), with the latter being associated with intestinal absorption and lysosomal trapping [
59]. In the case of HSA chromatography, isocratic logk values, measured in the presence of 10% acetonitrile (ACN), were used (logk
10,HSA), as they showed a highly significant 1:1 correlation with plasma protein binding data in our previous study [
45]. Logk
10,HSA data were available for 37 compounds in the dataset. For the remaining compounds, logk
10ACN,HSA values were calculated based on the highly significant equation reported in the same study [
45]. Logk
wAGP retention factors corresponding to a 100% aqueous mobile phase were available for 28 drugs. Thus, they could be used in the MLR models only for restricted datasets. However, their performance could be evaluated in PLS models since this type of analysis tolerates missing values. All chromatographic indices are included in
Table S1.
3.2. Brain Disposition Data
Experimental K
p,uu,brain values measured in rat brain tissue were collected for 22 compounds [
10,
11,
12,
60,
61] and converted to the logarithmic form. Experimentally determined K
p,brain values for 42 compounds were obtained from the literature and were converted to logBB. They preferably referred to rat studies in order to achieve homogeneity in the data [
19,
31,
32]. Experimental f
u,brain values were available from the literature for 39 compounds in the dataset [
10,
50,
51,
52,
53,
62]. When more than one value was available, the average was calculated. Most values were determined on rats. Since there are only small inter-species differences, values from species other than rats were included (seven cases) for compounds where rat values were not provided. The f
u,brain values were converted to the thermodynamic constant K
b according to the following equation [
55]:
The denominator was set to 1.001-fu,brain in the case of compounds exhibiting an fu,brain value equal to unity. The decimal logarithm of Kb, logKb, was used for the development of the models.
V
u,brain values, determined using the brain slice method, were collected from the literature for 17 compounds in the dataset and converted to the corresponding logarithm, logV
u,brain [
10,
12].
K
p,brain, K
p,uu,brain, f
u,brain, and V
u,brain values are presented in
Table S2. They span a sufficient range and are evenly distributed, including drugs with different brain distribution profiles.
K
p,uu,brain can be derived from K
p,brain by combining the unbound fractions of the drug in the plasma (f
u,p) and in the brain (f
u,brain) or the unbound brain volume of distribution, V
u,brain [
7]:
V
u,brain shows an inverse relation to Vu,brain shows an inverse relation to fu,brain approximating under circumstances [
63]. fu,brain∼[1/Vu,brain]range 0-1.
3.3. Physicochemical and Molecular Descriptors
A pool of 121 descriptors, including 1D, 2D and 3D descriptors, was calculated using appropriate software. The software ADME Boxes 3.0 was used to calculate the topological polar surface area (TPSA), hydrogen bond acceptor and donor sites (HBA and HBD), the rotatable bonds (RB), the number of ionizable groups, and the fraction of molecular species at pH 7.4 (with the fraction of negatively charged species being F
−7.4 and that of positively charged species being F
+7.4). The Abraham’s solvation parameters (hydrogen bond acidity (A), hydrogen bond basicity (B), excess molar refraction (E), McGowan’s volume (V), and dipolarity/polarizability (S)) were calculated with the module ABSOLV implemented using the same software. Topological indices and electrotopological state indices [
64] were computed with Molconn-Z (v4.12 eduSoft, LC, La Jolla, CA 92037 USA) software Hyperchem v.5.0/Chemplus v.1.6 software (HYPERCUBE Inc., Waterloo, ON, Canada) was used for the calculation of 3D descriptors. Molecular size descriptors, energy parameters, and dipole moment were calculated using the lowest energetic conformation. Descriptors based on essential structural characteristics, such as the total number of rings (nRings), the number of phenyl rings, the number of heteroatoms inside the rings (Nr, Or, Sr), the number of heteroatoms outside of the rings (Nnr, Onr, and Snr), or the sum of them (N, O, and S), the number of different halogens (Cl, F, I, and Br) and the sum of them (halogen), and the number of double bonds between carbon atoms were derived manually.
Experimental lipophilicity data, logP, logD
7.4, and logD
5.0, were taken from references [
23,
27,
29] and the therein-cited literature sources. They are included in
Table S1.
3.4. Statistical Analysis
Multiple linear regression analysis (MLR) was performed using SPSS v.22.0 software. Variable selection was performed by applying the stepwise algorithm, first in groups of descriptors, classified according to their physicochemical content, in order to exclude the less-relevant variables for the final selection. Variables with zero or very small variance were excluded, as were collinear variables, considering a correlation coefficient lower than 0.8. The models were evaluated by considering the values of R, R
2, s (standard deviation), and F- test. For the significance of each individual variable, a t-test value of |t| ≥ 2 was considered. The applicability domain (AD) of the regression equations was defined by calculating the critical leverage h* according to the formula h* = 3 (
p + 1)/n, where
p is the number of parameters, and n is the number of compounds [
65].
Models were validated by dividing the dataset into different training and test sets of 5 to 9 drugs, depending on the overall sample size. The test sets were randomly selected, taking however into account to cover all four quartiles of the PCA scores plot. Subsequently, each model derived by the training set was used to predict the response variable of the corresponding test set. The R, R2, and s for both the training and test sets were considered and compared to the original model. In addition, models were evaluated by the proximity of the relation of observed vs. predicted values to a 1:1 correlation, reflected by a slope close to 1 and an intercept close to 0.
Multivariate data analysis was performed using Simca-P 14.1 (Sartorius Stedim Biotech, Umeå, Sweden,). Prior to analysis, data were scaled to unit variance. Principal component analysis (PCA) was performed, considering all columns of the table as X variables. PCA is a projection method resulting in dimensional reduction. The principal components, derived through projection, represent the new (latent) variables that summarize the information included in the initial set of descriptors. A PCA scores plot provides a useful data overview, which was also considered for the test set selection. Partial least squares analysis (PLS), a regression extension of PCA, was applied in order to construct prediction models. The variable selection was based on the values of the variable’s influence to projection (VIP), the weight (w) in the loadings plot, and the size of coefficients. From this perspective, variables with variables with VIP < 0.8, low weight in the loading plot or low coefficient. Moreover, between descriptors encoding the same information, those which performed better were chosen. It should be mentioned, however, that PLS, as a projection method, can tolerate intercorrelated variables, as compared to multiple linear regression. The predictive ability and robustness of the models was evaluated using cross-validation as an internal validation according to the seven-fold option of Simca-P. The sum of the squared differences between the measured response and the predicted value of the omitted data, defined as predicting residuals sum of squares (PRESS), is used to calculate the cross-validated correlation coefficient, Q2: Q2 = 1 − PRESS/SSY, with SSY representing the variation in Y after mean centering and scaling. Permutation tests (100 permutations) were applied by randomly re-ordering the response variables, and the newly derived R2 and Q2 were plotted against the degree of correlation between the permuted and the original data. The models were validated by external test sets, as already described for multiple linear regression. The root mean square error of prediction (RMSEP) was considered as an index of the predictability of the models. The applicability domain was defined using the double of the average value of the distance of observation to model Y (DModY) as the critical value.
4. Conclusions
In the present study, we combined bio-chromatographic retention factors, determined on IAM, has, and AGP stationary phases, with computed descriptors to develop MLR/PLS models for the rapid estimation of drug brain disposition. Our aim was to suggest a novel ‘hybrid’ modelling approach which combines high-throughput experimental accuracy with theoretical descriptors. At the same time, the well-investigated information content of the individual biomimetic properties [
24,
29,
44,
45] permitted a deeper insight into the underlying mechanisms of biological processes. In this sense, diverse aspects of drug disposition in the CNS were explored, including the modeling of experimental logK
p,uu,brain, logBB
, f
u,brain, and V
u,brain data, upon the application of two statistical techniques, considered as contributing in a complementary way. MLR led to simple models that are easy to use by the medicinal chemist, whereas PLS served to strengthen the MLR models and to further scrutinize the biological issues inherent in brain disposition measurements.
LogKp,uu,brain, as the outcome of contradictory factors related to permeability and tissue and plasma binding could not be efficiently modeled by biomimetic properties and lipophilicity, while computational descriptors alone were sufficient for model construction. Yet, PLS analysis revealed the greater effect of IAM, HSA, and AGP retention factors, as well as logD, if measured at pH 5.0, supporting the potential interference of the endo-lysosomic system in vitro brain penetration models. In regard to logBB, which measures permeability to CNS, traditional lipophilicity and IAM retention showed equal performance with positive contributions. However, in the cases of both logKb and logVubrain, which also depend on binding, IAM retention performed better than lipophilicity. More to the point, a lesser number of total descriptors was required in the corresponding PLS models, reflecting the higher information content of IAM retention. HSA retention was found to be important in logKb modeling, while AGP retention influenced all brain disposition data, confirming the role of basicity as an essential CNS-drug-like characteristic. Polarity and hydrogen bond descriptors proved crucial in all models, with an opposite effect in regard to biomimetic properties and lipophilicity.
In view of the above findings, biomimetic properties are well-justified in modeling distinct brain disposition data as they reflect the major factors which govern biological processes, e.g., passive diffusion and binding. More to the point, IAM retention performs equally well or better than traditional lipophilicity. Hence, biomimetic chromatography can be suggested as a rapid and simple tool for the early evaluation of CNS drug candidates, permitting the construction of evidence-based ‘hybrid’ models and bridging the gap with in silico modeling, built solely on theoretical descriptors.
The ‘hybrid ‘complementary models constructed in this study can be further applied to, and validated with, a wider range of pharmaceutical compounds. Combined with relevant plasma protein binding models also based on biomimetic properties [
45], they can serve as a sound basis for exploring the composite brain disposition end-points.