3.1. PTML-MLP Models
We found that
Model 1 had the notation MLP 15-45-2, which means that this model was based on a multilayer perceptron network with 15 nodes in the input layer (number of
D(
GTI)
cj descriptors present in the model), 45 neurons in the hidden layer, and two values of the categorical variable of inhibitory activity
IAi(
cj) were predicted in the output layer, i.e., active (
IAi(
cj) = 1) and inactive (
IAi(
cj) = –1). The same deduction can be made for
Model 2 whose notation was MLP 14-45-2. A summary of the
D(
GTI)
cj descriptors present in each PTML-MLP model appears in
Table 2, while information regarding the chemical and biological data used to build such models appears in
Supplementary Materials S1 and S2, respectively.
Here,
Model 1 correctly classified 6312 out of the 7283 cases/chemicals in the training set (internal quality), which means
Ac(%) = 86.67%. In the test set (predictive power),
Model 1 satisfactorily classified 2006 out of 2422 cases/chemicals, with
Ac(%) = 82.82%. In the case of
Model 2, similar results were achieved; this PTML-MLP model had
Ac(%) values of 86.49% (6299 out of the 7283 cases/chemicals were correctly classified) and 81.75% (1980 out of 2422 cases/chemicals were correctly classified) for the training and test sets, respectively. Moreover,
Table 3 shows that the two PTML-MLP models have high
Sp(%) values, surpassing 80% in both the training and test set. In addition, for comparison, in
Table 3 we have reported the values of the different statistical indices for the linear counterparts of our two PTML-MLP models. Such linear models were based on the technique known as linear discriminant analysis (LDA), and they respectively used the same variables (molecular descriptors of the type
D(
GTI)
cj) from which our two PTML-MLP models were constructed.
From the analysis of the classification results in
Table 3, it can be seen that each of our two PTML-MLP models outperforms its corresponding LDA counterpart. This suggests that the relationship between the measures of anti-PANC activity and the chemical structure of the molecules in the present dataset is modeled better with the use of non-linear machine learning algorithms, as in the case of our two PTML-MLP models.
Continuing with the results depicted in
Table 3 for our PTML-MLP models, we can observe that for the case statistical index
Sn(%), the values were higher than 80% in the training set while remaining above 75% in the test set. Moreover, in both PTML-MLP models, there is a strong convergence between the observed and the predicted values of the categorical variable of inhibitory activity
IAi(
cj) since the statistical metric
MCC is closer to 1 than to 0 (random classifier) or –1 (completely erroneous prediction). Altogether, the results from
Table 3 indicate that
Model 1 classified/predicted the current data slightly better than
Model 2; however,
Model 2, using one less
D(
GTI)
cj descriptor than
Model 1 achieved similar performance.
We went deeper and, thus, analyzed the local metrics derived from
Sn(%) and
Sp(%). In this sense,
Model 1 exhibited values in the interval 69.64–99.21% for (
Sn(%))
ma, (
Sn(%))
tg, (
Sn(%))
ei, (
Sp(%))
ma, (
Sp(%))
tg, and (
Sp(%))
ei in the training set. The only exception was (
Sn(%))
ei = 33.33% for the assay information labeled as “F (assay format)”. In the test set, the behavior of
Model 1 was very similar to the training set, as the aforementioned local metrics were in the range of 64.29–100%. In this test set,
Model 1 maintained the same exception as in the training set and added the PANC cell line MIA-PaCa-2 with (
Sn(%))
tg = 55.56% and the assay information labeled as “B (assay format)” with (
Sp(%))
ei = 57.14%. In the case of
Model 2, values were between 65.22% and 97.14% in the training set; a relatively low performance was associated with the PANC cell line PSN1, exhibiting (
Sn(%))
tg = 57.29% and the assay information annotated as “F (assay format)” displaying (
Sn(%))
ei = 44.44%. In the test set,
Model 2 achieved the interval 60.87–100% for the 6 local metrics mentioned above, with the exceptions being the labels of assay information “B (assay format)” with (
Sp(%))
ei = 53.97% and assay information labeled as “F (assay format)” with (
Sn(%))
ei = 38.46%. We would like to emphasize that most of the values reported in this work for (
Sn(%))
ma, (
Sn(%))
tg, (
Sn(%))
ei, (
Sp(%))
ma, (
Sp(%))
tg, and (
Sp(%))
ei were above 70%, which confirms the great capabilities of both PTML-MLP models to classify/predict complex biological data focused on anti-PANC activity by considering the 44 different experimental conditions,
cj (as depicted in
Table 1), employed in this study. All the details regarding the classification results of
Model 1 and
Model 2 can be found in
Supplementary Materials S3 and S4, respectively.
Regarding the reliability of the predictions, we determined the AD of both PTML-MLP models according to the descriptor space approach (
Supplementary Material S5), and, therefore, we calculated the total scores of the applicability domain (TSAD). For
Model 1, only the cases/chemicals with TSAD = 15 were considered to be within the AD, with the number fifteen being equivalent to the number of
D(
GTI)
cj descriptors present in
Model 1. Only 2 of the 9705 cases/chemicals were found to be outside the AD of
Model 1 (TSAD < 15) and they both belonged to the test set. For
Model 2, the ideal value TSAD = 14 indicated the ability of case/chemical to fall within the AD. In this model, only 8 of the 9705 cases/chemicals had TSAD < 14, thus remaining outside of the AD of
Model 2.
Last, we would like to highlight the limitations and strengths of our two PTML-MLP models. A key limitation of our PTML-MLP models is their inability to correctly predict the entire dataset employed in this work. This means that the D(GTI)cj descriptors used to build the PTML-MLP models cannot characterize the whole complexity and molecular diversity in the present dataset. Another limitation is the one associated with the machine learning algorithm since none of the current algorithms are capable of encoding enough accurate chemical information. At least one of these two aspects is generalized to all computational models reported in the scientific literature. Consequently, in a virtual screening scenario, our PTML-MLP models will perform accurate predictions to some extent.
In any case, the strengths of our PTML-MLP models outweigh the aforementioned limitation. First, our PTML-MLP models are the first two models reported to date that can predict anti-PANC activity by simultaneously considering different mechanisms of action and multiple experimental conditions (involving dissimilar assay protocols) on the inhibition of both PANC-related proteins and PANC cell lines. This highlights the potentialities and wide applications of the PTML methodology in integrating chemical and biological data in the context of oncology research [
24,
25,
26,
74,
75,
76,
77,
78,
79]. Second, our two PTML-MLP models are highly interpretable, physicochemically and structurally describing the data in terms of suitable molecular fragments (see
Section 3.2), which can be useful to both computational and experimental chemists when selecting 2D pharmacophores. Third, by interpreting our PTML-MLP models, it is possible to design new anti-PANC molecules (see
Section 3.3). Last, in contrast to most computational models, whose complex algorithms involve remarkable consumption of time and financial resources, our PTML-MLP models are cost-efficient; it only took two hours to create each of them when working on a portable computer (12 GB RAM).
3.2. Physicochemical and Structural Meanings of the Molecular Descriptors
Here, we are providing the physicochemical and structural interpretations of the
D(
GTI)
cj descriptors in the PTML-MLP models. To support such interpretations, we employed certain graphics that illustrated the significance of each
D(
GTI)
cj descriptor in the form of sensitivity values (SV) while extracting information on the propensities (increment or diminution) in the value of each
D(
GTI)
cj descriptor [
37,
63,
80]. Such propensities indicate how to vary the values of the
D(
GTI)
cj descriptors to enhance the desired activity, which, in our case, is the improvement of the inhibitory activity against both the PANC-related proteins and the PANC cell lines.
3.2.1. First PTML-MLP Model (Model 1)
The propensities of the
D(
GTI)
cj descriptors in
Model 1 are reported in
Table 4, while the relative importance of each of them (measured by
SV) is depicted in
Figure 1.
In
Model 1, we have six
D(
GTI)
cj descriptors derived from the so-called spectral moments of the bond adjacency matrix, which denote how much the physicochemical properties are distributed/concentrated throughout the chemical structure of a molecule [
44,
45,
46,
81,
82,
83]. These
D(
GTI)
cj descriptors are
DTI01,
DTI06,
DTI08,
DTI09,
DTI12, and
DTI13 and among the most significant descriptors in
Model 1, they rank third, ninth, fourteenth, seventh, eleventh, and eighth, respectively.
Three of these
D(
GTI)
cj descriptors are derived from the spectral moments and are focused on hydrophobicity, according to the Ghose–Crippen approach [
84]. Thus,
DTI01 characterizes the decrease of the hydrophobicity in regions containing three-membered rings and fragments where one atom is attached to three other non-hydrogen atoms. Therefore, the presence of urea, carbamate, and carboxamide groups, as well as portions with secondary alcohols and amines, methylene moieties, and aziridine and oxirane rings favor a decrease in the value of
DTI01. The same fragments mentioned in the case of
DTI01, will also desirably decrease the value of
DTI12; however, the latter of these
D(
GTI)
cj descriptors will impact a molecule more globally. Another hydrophobicity-based descriptor is
DTI06, which involves the increment of the hydrophobicity in regions in fragments containing seven bonds or less. The presence of three-membered rings and groups/moieties where an atom is bonded to four other non-hydrogen atoms (e.g., quaternary carbons, sulfonamides, and phosphorus-based functional groups, such as phosphine oxides, phosphonates, and phosphates) will increase the value of
DTI06. At the same time,
DTI08 describes the increase of the polar surface area, which can be achieved by increasing the number the functional groups (mainly based on nitrogen and oxygen atoms) capable of forming hydrogen bonds. On the other hand,
DTI09 implies the augmentation of electronegative atoms in three-membered rings and groups where one atom is attached to two other non-hydrogen atoms. The functional groups described by
DTI01 also favor
DTI09, but the latter characterizes the structure in terms of the distribution of partial charges, also including the increase of halogens (particularly fluor). The last
D(
GTI)
cj descriptor derived from the spectral moments is
DTI13, which gives information regarding the diminution of the global polarizability in the molecules, thus favoring the presence of nitrogen, oxygen, and fluorine atoms, while diminishing the presence of aromatic rings or alkene moieties, as well as the functional groups attached to them.
In
Model 1, we also have
D(
GTI)
cj descriptors derived from atom-based connectivity indices, and, therefore, these
D(
GTI)
cj descriptors measure the contribution of different fragments to the molecular accessibility [
85,
86], i.e., the ability of certain regions of a given molecule to interact with the surrounding chemical environment (e.g., with molecules such as water, amino acids in the cavity of a protein, and different components of the membrane PANC cell line). The
D(
GTI)
cj descriptors are
DTI02,
DTI10, and
DTI11, and, among the most influential, they ranked sixth, tenth, and thirteenth, respectively. Thus, in
Model 1,
DTI02 describes the increment of the molecular accessibility, mainly in linear (non-ramified) fragments containing at least one atom different from carbon, hydrogen, and boron. In contrast with
DTI02,
DTI10 measures the decrease of the average accessibility area of a molecule, which can be achieved by limiting the number of sulfur and halogen atoms (except fluor). In the case of
DTI11, this descriptor indicates that augmenting the accessibility (and, therefore, the number) of six-membered rings will play a favorable role in enhancing the inhibitory activity against both the PANC-related proteins and the PANC cell lines.
Last, we have
D(
GTI)
cj descriptors which encompass information content similar to that of the edge (bond) connectivity indices and, therefore, behave as measures of contribution to the molecular/molar volume [
50,
87,
88]. Such
D(
GTI)
cj descriptors under analysis are
DTI03,
DTI04,
DTI05,
DTI07,
DTI14, and
DTI15, and they have been considered as the fourth, fifth, twelfth, fifteenth, first, and second most significant descriptors in
Model 1, respectively. Thus,
DTI03 is a direct measure of the global volume of a molecule, expressing the decrease of this physicochemical property. This has important implications because diminishing the molecular volume means that a molecule will be able to better fit in the binding pocket of the PANC-related proteins while also allowing the molecule to be small enough to permeate the cellular membranes [
89]. Decreasing the number of atoms in a molecule will favorably decrease the value of
DTI03. A similar effect is expected to take place by diminishing the value of
DTI04. The only difference is that
DTI04 focuses on fragments formed by two bonds (without counting bond multiplicity). In the case of
DTI05, this expresses the decrease of the volume in regions containing six-membered rings. We should highlight that this descriptor constrains
DTI11, which was explained before. In the end, at least two six-membered rings can be part of the structure of a molecule, but these rings should have at least two substitutions while the presence of nitrogen atoms in these rings is a favorable factor. The descriptor
DTI07 characterizes the increase of the molecular volume due to the presence of five-membered rings. The two most important
D(
GTI)
cj descriptors (
DTI14 and
DTI15) indicate the increase of the average volume by decreasing the number of atoms in a molecule while increasing the number of regions without branching in that molecule. Particularly,
DTI14 measures this effect in five-bond linear fragments while
DTI15 focuses on six-bond fragments where the presence of a six-membered ring is also important.
3.2.2. Second PTML-MLP Model (Model 2)
As in the case of
Model 1, data regarding each
D(
GTI)
cj descriptor propensity and significance in
Model 2 can be found in
Table 5 and
Figure 2, respectively. The
D(
GTI)
cj descriptors used in
Model 2 were derived from the atom-based local stochastic quadratic indices [
54,
56,
90], which means that they characterize the distribution of different physicochemical properties of diverse fragments containing specific types of atoms [
37,
90]. In
Model 2, seven of the fourteen
D(
GTI)
cj descriptors are based on hydrophobicity, which indicates the paramount importance of this physicochemical property. These
D(
GTI)
cj descriptors are
DQI01,
DQI02,
DQI03,
DQI09,
DQI10,
DQI12, and
DQI14 and we would like to highlight that, these
D(
GTI)
cj descriptors consider the multiplication of the hydrophobic contribution of a specific type of atom and the hydrophobic contribution of other non-hydrogen atoms situated at the given topological distance (number of bonds between any two atoms without counting bond multiplicity). Hence, instead of referring to an atomic hydrophobicity, we will use the term “joint hydrophobic contribution”.
In terms of significance in
Model 2, the
D(
GTI)
cj descriptors
DQI01,
DQI02,
DQI03,
DQI09,
DQI10,
DQI12, and
DQI14 rank first, eighth, fourth, eleventh, twelfth, thirteenth, and second, respectively. By inspecting the descriptors
DQI01,
DQI09, and
DQI10, we can see that they express the decrease of the joint hydrophobic contribution of any two atoms (with at least one of them being a halogen) separated by a topological distance of four, one, and two, respectively. Considering that all the halogens have a positive value of hydrophobic contribution [
91], the atoms surrounding them must have a negative value of the hydrophobic contribution. Such atoms can be mainly primary and secondary carbons. Therefore, fragments such as 4-halocyclohexyl and 4-halobutyl will greatly and favorably diminish the values of
DQI01,
DQI09, and
DQI10, thus enhancing the inhibitory activity against the PANC-related proteins and the PANC cell lines. At the same time,
DQI02,
DQI03, and
DQI14 characterize an increase in the joint hydrophobic contribution of any two atoms, with one of them being a heteroatom (particularly N, O, S, or P). However, while
DQI02 and
DQI03 consider fragments where the heteroatom is at the topological distance of three and four (respectively) or less from any other non-hydrogen atom,
DQI14 exclusively depends on the presence and number of heteroatoms. In any case, fragments containing the ethylenediamine or glycine moieties will desirably increase the value of
DQI02, while fragments such as 1,3,5-triazin-2-amine, pyrimidin-2-amine, and heteroaliphatic rings (containing a nitrogen atom or being attached to it), as well as urea and carbamate, will do the same for
DQI03; all these fragments will increase the value of
DQI14. On the other hand, the descriptor
DQI12 indicates the increase of the joint hydrophobic contribution between a carbon atom from a methyl group and its adjacent atoms (the topological distance equal to one). This means that to increase the value of
DQI12, a carbon atom from a methyl group must be attached to a secondary (non-substituted) carbon or aminic nitrogen (resulting in a secondary amine).
In Model 2, we also have three D(GTI)cj descriptors containing electronic information. One of them is DQI04 (the ninth most influential), which characterizes the augmentation of the electronegativity of any two atoms (with one of them being a heteroatom) placed at the topological distance of two. Therefore, all the fragments mentioned for the descriptors DQI02 and DQI03 explained above cause a beneficial increase in the value of DQI04. In the case of DQI05 (the fifth most important), this describes the diminution of the polar surface area in any two adjacent heteroatoms. Because DQI05 directly depends on the presence of electron-rich heteroatoms, such as nitrogen and oxygen, to decrease the value of DQI05, two electron-rich heteroatoms should not be bonded. Moreover, in terms of electronic information, DQI08 expresses the increase of the polar surface area of any two heteroatoms placed at the topological distance of four or less. Therefore, all the fragments mentioned for the descriptor DQI03 (except for the heteroaliphatic rings) and in a lower degree, for DQI02, will increase the value of DQI08. Other groups containing sulfur (sulfonamides) and phosphorus (phosphates) will also increase the value of DQI08. This descriptor is the least significant in Model 2.
Steric factors play an essential role in enhancing the inhibitory activity of a chemical against the PANC-related proteins and the PANC cell lines. The D(GTI)cj descriptors accounting for the steric information are DQI06, DQI07, DQI11, and DQI13, and, regarding their influence in Model 2, they ranked third, seventh, tenth, and sixth, respectively. The descriptor DQI06 indicates the augmentation of the molecular size, mainly by increasing the number of aliphatic carbons but also by increasing the number of atoms (different from carbon) bonded to the aliphatic carbons. On the other hand, DQI07 means that the number of halogens must be decreased; if a halogen is present, iodine, bromine, and chlorine will be preferred over fluorine. In the case of DQI11, this characterizes the decrease of the atomic weight of any two atoms (with one of them being a halogen) placed at the topological distance of two. This means that to decrease the value of DQI11, the number of halogens should be reduced, and their chemical environments must be almost exclusively constituted by carbon atoms; halogens should be attached to aliphatic or aromatic rings or in the form of 2-haloethyl moieties. For DQI11, fluorine is strongly preferred over the other halogens. Last, we have DQI13, and its value can be increased by bonding carbons from methyl groups to low-polarizability atoms, such as nitrogen and oxygen. Therefore, the presence of methoxy and methylamino groups favor the increment of DQI13.
3.3. Designing Multi-Protein and Multi-Cell Inhibitors as Anti-PANC Agents
The guidelines applied to the design of new molecules from PTML models have been reported before [
25,
26,
28,
67,
80,
92]. Here, we used
Model 2 as a tool to design new molecules from the fragments present in
Figure 3.
There were two reasons for which we preferred
Model 2 over
Model 1. From one side, as mentioned in
Section 3.1,
Model 2, with one less descriptor, achieved a statistical performance comparable to that of
Model 1. On the other hand, from a physicochemical point of view,
Model 2 offers a simpler (but more detailed) explanation than
Model 1. The joint interpretation of the
D(
GTI)
cj descriptors in
Model 2 suggests the presence of three well-defined regions that seem to be essential when designing a molecule. First, there should be a hydrophilic region containing several heteroatoms that can interact through hydrogen bonds. The second region, situated in the center of the molecule, may also contain heteroatoms but these should be more dispersed, with any two heteroatoms separated at a distance of three bonds. In this region, aliphatic atoms may serve as “bridges” between any two heteroatoms. The third region is expected to be hydrophobic, particularly containing a 4-halophenyl or 4-halocyclohexyl moiety, which means that the halogen will be in the periphery of the molecule.
By combining the aforementioned joint interpretation with the inspection of the fragments in
Figure 3, we designed three structurally related molecules, which are depicted in
Figure 4.
The three designed molecules were predicted by
Model 1 and
Model 2 to be anti-PANC agents through the inhibition of the PANC-related proteins and PANC cell lines. However, there were differences in the predicted probabilities. For instance, for all the 44 experimental conditions
cj reported in this work, MPMCI-001 and MPMCI-003 were predicted by both PTML-MLP models with probability values close to 100%. However, in the case of MPMCI-002, this happened only with
Model 2. That does not mean that
Model 1 poorly predicted MPMCI-002. Notice that
Model 1 predicted MPMCI-002 to be an anti-PANC agent in 42 of the 44 experimental conditions, but the predicted probability values were lower when compared to those obtained for MPMCI-001 and MPMCI-003. This means that, in the structure of MPMCI-002, the methyl group from the methylamino moiety located between the two nitrogen atoms of the pyrimidine ring should be removed. The difference in the predicted probabilities also demonstrates the sensibility of the
D(
GTI)
cj descriptors in
Model 1, allowing this model to be used as a filter of
Model 2 to ensure the correct design of new molecules. In any case, the fact that the three designed molecules were predicted by both PTML-MLP models as multi-protein and multi-cell inhibitors, theoretically confirms the aforementioned molecules to be anti-PANC agents. Moreover, the three molecules were within the applicability domain of these models. All the results of the predictions (including the assessment of the applicability domain) performed by
Model 1 and
Model 2 for the design molecules can be found in
Supplementary Materials S6 and S7, respectively.
We searched for our designed molecules in different databases, such as ChEMBL [
41,
93], ZINC [
94], and eMolecules [
95], to check their novelty. In this sense, we used the similarity cutoff of 80%. No molecules similar to those designed by us were found. We would like to add that we also calculated several global physicochemical properties for the designed molecules (
Table 6), which allowed us to have an idea regarding their drug-likeness.
In doing so, the calculated physicochemical properties allowed the use of three approaches known as the Lipinski rule of five [
96], Ghose filter [
97], and Veber’s guidelines [
98]. We compared the values of each physicochemical property for each of the designed molecules with the cutoff values/intervals established by each of the aforementioned filters. Our designed molecules complied with all these criteria.