Developing a Semi-Supervised Approach Using a PU-Learning-Based Data Augmentation Strategy for Multitarget Drug Discovery

Hao, Yang; Li, Bo; Huang, Daiyun; Wu, Sijin; Wang, Tianjun; Fu, Lei; Liu, Xin

doi:10.3390/ijms25158239

Open AccessArticle

Developing a Semi-Supervised Approach Using a PU-Learning-Based Data Augmentation Strategy for Multitarget Drug Discovery

by

Yang Hao

^1,2,

Bo Li

^1,2

,

Daiyun Huang

^1,3,*

,

Sijin Wu

¹

,

Tianjun Wang

^1,2,

Lei Fu

¹

and

Xin Liu

^1,*

¹

Wisdom Lake Academy of Pharmacy, Xi’an Jiaotong-Liverpool University, Suzhou 215123, China

²

Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZX, UK

³

School of Life Sciences, Fudan University, Shanghai 200092, China

^*

Authors to whom correspondence should be addressed.

Int. J. Mol. Sci. 2024, 25(15), 8239; https://doi.org/10.3390/ijms25158239 (registering DOI)

Submission received: 21 June 2024 / Revised: 26 July 2024 / Accepted: 26 July 2024 / Published: 28 July 2024

(This article belongs to the Section Molecular Informatics)

Download

Browse Figures

Versions Notes

Abstract

:

Multifactorial diseases demand therapeutics that can modulate multiple targets for enhanced safety and efficacy, yet the clinical approval of multitarget drugs remains rare. The integration of machine learning (ML) and deep learning (DL) in drug discovery has revolutionized virtual screening. This study investigates the synergy between ML/DL methodologies, molecular representations, and data augmentation strategies. Notably, we found that SVM can match or even surpass the performance of state-of-the-art DL methods. However, conventional data augmentation often involves a trade-off between the true positive rate and false positive rate. To address this, we introduce Negative-Augmented PU-bagging (NAPU-bagging) SVM, a novel semi-supervised learning framework. By leveraging ensemble SVM classifiers trained on resampled bags containing positive, negative, and unlabeled data, our approach is capable of managing false positive rates while maintaining high recall rates. We applied this method to the identification of multitarget-directed ligands (MTDLs), where high recall rates are critical for compiling a list of interaction candidate compounds. Case studies demonstrate that NAPU-bagging SVM can identify structurally novel MTDL hits for ALK-EGFR with favorable docking scores and binding modes, as well as pan-agonists for dopamine receptors. The NAPU-bagging SVM methodology should serve as a promising avenue to virtual screening, especially for the discovery of MTDLs.

Keywords:

Support Vector Machine (SVM); multitarget drug; PU-learning; virtual screening

1. Introduction

Drug discovery today mainly focuses on designing ligands with maximum selectivity to act on individual drug targets, with the aim of reducing the risk of off-target related side effects. However, many complex diseases, such as cancer [1,2], neurodegenerative diseases [3,4], cardiovascular diseases [5], and infection [6], involve multiple progression-relevant targets and/or signaling pathways. For these multifactorial diseases, relying on a single medicine to modulate the biological function of a solitary target might be inadequate to achieve satisfactory efficacy, despite the best research efforts. Hence, there is a growing interest in developing multitarget agents capable of simultaneously intervening in multiple receptors, with the goal of enhancing therapeutic efficacy and ensuring safety [7,8,9].

There are two approaches to multitarget therapeutics: drug combination, and multitarget-directed ligands (MTDLs) [10]. While the former approach may offer greater dose flexibility and lower clinical trial costs, it often grapples with treatment complexity, drug side effects, pharmacokinetic intricacies, and drug–drug interactions [11,12]. Conversely, well-designed and finely balanced MTDLs have the potential to circumvent these challenges, all the while capitalizing on the benefits of multitarget therapeutics, such as synergistic effects [13,14,15]. In essence, strategic combinations of targets can enable the use of lower drug doses to achieve sufficient therapeutic effects, as partial modulation of synergistically acting targets may be satisfactory for achieving full therapeutic efficacy. Consequently, this approach can mitigate drug side effects and widen therapeutic windows [16].

To date, several remarkable MTDLs have been identified, developed, and further approved for clinical use. A significant portion of these discoveries, such as the multitarget kinase inhibitor imatinib, occurred serendipitously, while others primarily stem from knowledge-based approaches in which scaffolds from different active molecules with known activity against a particular target are combined [17,18]. An alternative and potentially more practical strategy is the screening of natural compounds that can directly modulate multiple targets, offering an effective avenue for discovering MTDLs with improved bioavailability [19,20].

In recent years, numerous promising virtual screening methods based on machine learning (ML) or deep learning (DL) have been proposed [21,22,23,24], shaping the landscape of artificial intelligence in both single-target drug and MTDL discovery. Despite the promising performance that these methods have claimed, the design and choice of ML/DL architectures often appears arbitrary [25,26]. The existing literature provides limited comparisons on the superiority of traditional ML and DL algorithms, yielding inconclusive findings [27,28,29,30]. A recent impartial viewpoint suggests that no singular learning approach surpassed the others [27,28]. This also underscores, alongside the utilized modeling techniques, the significance of meticulous curation of training data [26,29] and the selection of molecular representations [30].

For any chosen ML or DL algorithm, the immediate and crucial next step is the selection of an appropriate molecular representation method, which largely contributes to the quality and efficiency of these compound activity prediction models. Many molecular descriptors and fingerprints have been derived from human expert knowledge to provide a comprehensive representation of the constitutional, physicochemical, topological, structural, and substructural features of molecules [31,32]. Furthermore, the utilization of neural network architectures for learning representations from graph-structured data stands as a cutting-edge strategy and has led to substantial enhancements in the learning of pharmaceutical and physicochemical properties [33]. Unfortunately, the evaluation of how to select the most suitable molecular representation method for the chosen ML/DL algorithm, or which combination of representation methods and ML/DL algorithms can achieve superior predictive performance, has rarely been systematically assessed in prior studies, leading to a lack of definitive answers.

In addition, the accuracy and robustness of these ML/DL-based compound activity prediction methods heavily rely on the training dataset, where both negative and positive samples play equally vital roles. Bioassay data obtained from the high-throughput screening are usually imbalanced, with a significantly larger number of active compounds compared to inactive compounds [34] (Supplementary Tables S1 and S2). Consequently, the scarcity of negative data can lead to unexpected high false positive rates [35]. To mitigate the issue induced by data sparsity, considerable efforts have been directed towards data augmentation, which mainly entails the random generation of negative samples from the entire chemical space [36,37] or from a selection of clustered regions [38]. However, the chemical space used for generating negative samples lacks experimental assay values (unlabeled data), potentially leading to the inclusion of active (positive) compounds in the generated negative samples, thereby compromising the credibility of the ML/DL prediction models.

In this context, positive unlabeled (PU) learning, a valuable semi-supervised learning algorithm that leverages unlabeled data with diverse features and potential actives, improves classification effectiveness in situations where negative samples are absent. It achieves this by enhancing the reliable extraction of negative samples from unlabeled pools, and it has shown promising performance in real-world applications within the domain of drug discovery [39,40,41]. PU learning offers the benefit of expanding the sample size and information reservoir, alleviating the impact of data sparsity and distribution shift, thus boosting the model’s capacity for generalization.

Based on PU learning, in this study, we aim to address three key questions through a series of experiments focusing on virtual screening of multitarget kinase inhibitors:

Do DL methods consistently outperform traditional ML methods, specifically identifying the best-performing ML/DL method for the compound activity prediction scenario?
Which molecular representation method(s) are most appropriate for the selected ML/DL method?
How can unlabeled or putative negative data be effectively utilized to enhance the true positive rate while keeping the false positive rate of virtual screening models within acceptable limits?

To address these questions, we meticulously compared various ML/DL methods, including Support Vector Machine (SVM) and ten prevalent neural network architectures, along with a range of molecular representation techniques encompassing fingerprints, physicochemical descriptors, and neural networks learning from molecular graphs. Following the evaluation, SVM and ECFP4 emerged as superior performers among the ML/DL and representation methods, prompting their selection as the baseline for further assessment of data augmentation strategies.

Subsequently, we delved into exploring different data augmentation strategies, leading to the development of a novel semi-supervised learning framework named Negative-Augmented PU-bagging (NAPU-bagging) SVM. Our comparative analyses revealed that conventional data augmentation techniques often entail a trade-off between the true positive rate and the false positive rate. In contrast, NAPU-bagging SVM is capable of enhancing the true positive rate without necessitating a sacrifice in the false positive rate. Recognizing its potential in virtual screening, particularly in the context of MTDL discovery, where a high true positive rate is crucial for identifying potential interactions between candidates of multiple targets, we effectively applied this framework. It was instrumental in identifying potential MTDLs targeting the epidermal growth factor receptor (EGFR) and ALK tyrosine kinase receptor (ALK) in non-small-cell lung cancer (NSLC), as well as pan-agonists for dopamine receptors (DRD1–DRD5) for the treatment of neurodegenerative diseases. The outcomes underscored the NAPU-bagging SVM’s proficiency in managing false positive rates in virtual screening while maintaining high recall rates, as well as its ability to unveil structurally novel hit molecules. In summary, the NAPU-bagging SVM presents a robust semi-supervised learning framework for active compound screening, thereby advancing the realm of poly-pharmacology therapy for complex diseases.

2. Results and Discussion

2.1. SVM Outperforms DL-Based Drug–Target Potency Prediction Models

To address the first question, i.e., to identify the best-performing ML/DL method for compound–protein interaction prediction, we initially selected eleven ML/DL methods for a comprehensive comparison, including SVM and ten other popular DL-based sequence-to-drug models. SVM was specifically chosen for its effectiveness in handling imbalanced data with its tunable hyperparameter misclassification penalty parameter

C

and maximum-margin hyperplane. Moreover, SVM has outperformed other traditional ML methods on diverse datasets in other similar evaluations [42]. Additionally, we chose SVM for its consistently superior performance in regression tasks [43], which opens the possibility of extending the interaction probability prediction model in this study to interaction activity prediction. Presently, over forty significant DL-based models for predicting drug–target potency have been developed and have demonstrated promising performance. Among these models, we selected ten, including DeepDTA [44], DeepConv-DTI [37], TransformerCPI [45], VQA-seq [46], MolTrans [47], GraphDTA [48], MGraphDTA [49], HyperAttentionDTI [50], DrugBAN [51], and TransformerCPI2.0 [52]. These models provide varied protein featurization, compound representation, and network frameworks for potency learning, enabling a comprehensive comparison.

In particular, SVM trained on Morgan fingerprints and the ten chosen DL-based models were applied to ChEMBL targets with a minimum of 680 compound entries containing bioactivity data (see Supplementary Table S3 for more details). This threshold was chosen to maintain 80% of the total sample for comparison, and to ensure statistical significance for each target test set size. The results showed that SVM demonstrated superior or at least comparable performance to the ten DL-based models (Figure 1). Therefore, SVM served as the baseline for the subsequent evaluation of molecular representation and data augmentation strategies.

2.2. ECFP4 Outperformed Other Compound Representation Methods

The efficacy of ML/DL-based compound–protein interaction prediction methods greatly hinges on the selection of input features. Therefore, for the chosen ML/DL method, the next crucial step is selecting an appropriate representation for the learned objects, particularly small molecules in this context. In this study, we considered 7 topological path-based fingerprints (including ECFP4, AtomPairFP, TorsionFP, RDkitFP, AvalonFP, MHFP, and MAP4), 2 pharmacophore-based fingerprints (including PharmacoErGFP and PharmacoPFP), 3 substructure-key SMARTS-based fingerprints (including PubChemFP, MACCSFP, and EstateFP), and a compilation of 13 constitutional, physicochemical, and topological descriptors (referred to as “descriptors” for simplicity). Furthermore, recent advancements in developing neural network architectures capable of learning representations from molecular graphs have demonstrated superiority over traditional methods relying on molecular descriptors and fingerprint features [33]. As a result, one model from this category, AttentiveFP, was also included in the assessment, using the same training and evaluation datasets.

Given the significant variance in positive-to-negative ratios across datasets (see Supplementary Table S4 for more details), direct comparisons of evaluation metrics such as false positive and true positive rates across datasets are not feasible. To address this, we calculated feature ranks for each evaluation dataset, which were then averaged to determine the final rank for each feature. Key evaluation metrics including accuracy, BEDROC, false positive rate, and true positive rate were considered. The ridge lines in Figure 2 illustrate the averaged ranks for each feature under each metric, with higher ranks indicating superior performance (see Supplementary Table S5). Notably, ECFP4 consistently ranked at or near the top across all four evaluation metrics. As a result, ECFP4 was utilized as the molecular representation method for all subsequent data augmentation strategies.

2.3. Data Augmentation Strategies for Constructing Semi-Supervised Learning Models

Developing ML/DL-based compound–protein interaction prediction models for virtual screening often faces challenges due to limited data availability on active and inactive molecules relevant to the target, as well as a lack of compound diversity covering the entire chemical space [53] (see Supplementary Tables S6 and S7). These data constraints can significantly hamper classification accuracy and generalization capabilities, particularly impacting the optimal classification boundaries, most notably for SVM algorithms. To address these limitations, we implemented diverse data augmentation strategies (Figure 3). To assess the effectiveness of these strategies, we conducted evaluations on six tumor-related targets, including ALK, MET, EGFR, MAPK1, FGFR1, and VEGFR1.

True positive and negative compounds relevant to these targets were utilized to establish the SVM model as the baseline. Furthermore, we integrated the generated putative negative sets for each target during training to implement Tao’s strategy [38]. The results were consistent with Tao’s findings [38], demonstrating the effectiveness of including putative negatives in reducing FPRs in virtual screening (Table 1). However, this strategy also introduces a trade-off by sacrificing a portion of the true positive rate. While this trade-off may be acceptable for single-target drug screening, it becomes more critical in multitarget drug screening scenarios, where the loss in the true positive rate from single-target screening can have a cascading effect and significantly impact the outcomes of multitarget screening, resulting in an extremely low yield. Therefore, a refined strategy is required to achieve an optimal balance, ensuring enhanced recall while effectively managing the FPR.

Tao’s strategy, which generates putative negative sets from molecular clusters without known active compounds, enhances the confidence in these putative negatives. However, clusters containing known active compounds may also include inactive compounds, leading to the presence of structurally similar compounds with vastly different activities, known as the “activity cliff” phenomenon. Therefore, building upon Tao’s strategy, we opted to utilize unlabeled data derived from clustered regions spanning the entire chemical space instead of using putative negative sets to improve the model’s recall rate. Additionally, clusters without known active compounds may still contain active compounds, suggesting that the credibility of negative sets generated from these clusters should be lower compared to true negative sets. To address this, we explored two strategies: first, assigning weights below 1 to the incorporated unlabeled data (0.5 in this study, referred to as the PN-weighted strategy); second, utilizing true positive and negative sets as the initial training set, where compounds predicted as negative in the unlabeled sets are iteratively included in subsequent training rounds until no new compounds are predicted as negative in the unbalanced sets, aiming to refine the boundary of the classifier (referred to as the PU-iterative strategy). The results (Table 1) suggest that these two strategies are notably effective in managing the false positive rate. However, there was an unexpected decrease in the recall rate compared to Tao’s strategy, possibly attributed to the unavoidable overfitting issue resulting from the substantial increase in negative samples in the training set.

Bagging-based classification has shown various advantages, such as preventing overfitting and reducing estimation variance, making it particularly valuable in the context of PU learning [54]. Therefore, to alleviate the issue of decreased recall caused by overfitting in both the PN-weighted and PU-iterative strategies, we explored an inductive bagging-based PU learning (PU-bagging) strategy. This strategy involves utilizing bootstrap aggregation to create multiple resampled bags from the training data, with each bag consisting of all true positive samples and randomly sampled unlabeled instances at a fixed ratio (1:1 in this study). An ensemble of SVM classifiers is trained on each bag of samples to distinguish positive samples from unlabeled ones. The final prediction is aggregated from the collective outputs of the ensemble of classifiers. As shown in Table 1, the PU-bagging SVM achieved significantly higher sensitivity/recall on the validation sets compared to previous strategies, consistent with observations in the existing literature [55,56]. This outcome further underscores the efficacy of the bagging technique in augmenting the recall/sensitivity of classifiers.

2.4. Developing a Novel Negative-Augmented PU-Bagging (NAPU-Bagging) SVM for Enhanced Virtual Screening

While enhancing the recall/sensitivity is a desirable trait for virtual screening, the PU-bagging SVM encountered a reduction in specificity, potentially impacting subsequent virtual screening processes by increasing the occurrence of false positives. Our hypothesis is that despite the bagging technique demonstrating heightened recall/sensitivity through mitigated overfitting by training ensemble classifiers on diverse samples [57], the presence of positive contamination in the unlabeled data may still introduce noise for each bootstrap classifier in establishing a decision boundary. Leveraging the available yet limited true negative samples for each target could potentially alleviate this issue. In order to maintain the elevated recall/sensitivity of the strategy while moderating its FPR to a satisfactory level, we propose a novel Negative-Augmented PU-bagging (NAPU-bagging) SVM approach. This approach enriches each bag by resampling true negative samples alongside the original true positive and unlabeled samples, following a predefined ratio (e.g., 1:0.2:0.8 for positive [P], negative [N], and unlabeled [U], respectively), and merging the negative samples with the unlabeled samples into a single class. Our refined data augmentation strategy ensures a controlled and substantial presence of “strong” negatives in each bag, aligning the feature distribution of unlabeled bags more closely with that of non-positive instances, promoting clearer class differentiation and reduced misclassification.

The classification boundary of the NAPU-bagging SVM was then optimized by adjusting the bag size and the sampling proportions of negative samples alongside the original positive and unlabeled samples, aiming for the best predictive performance. We trained NAPU-bagging classifiers with different P:N:U ratios by fixing P at 1.0 and tested their recall/sensitivity (SEN), specificity (SPE), F1-score, and BEDROC_80.5 metrics on validation sets for the six kinase targets (Figure 4). For optimal parameter selection, we adhered to the principle of achieving high sensitivity while maintaining sufficiently high specificity. Initially, we maintained the bag size consistent with the number of P samples and varied the N:U ratio from 0.0:1.0 (equivalent to the PU-bagging SVM, which only uses true positive and unlabeled samples) to 1.0:0.0, which revealed a clear trade-off between sensitivity and specificity—higher specificity accompanied by lower sensitivity was observed with more true negative samples used for augmentation in bagging (Figure 4A).

Subsequently, we adjusted the bag size by setting the bag size at two or five times the number of true positive samples, ensuring that it remained within the limits defined by the true negative sample size for each target. This modification allowed us to experiment with different N:U ratios. Interestingly, we found that this adjustment resulted in further increased specificity and reduced sensitivity compared to scenarios where the bag size was equal to the number of true positive samples (Figure 4B). Based on these observations, we concluded that setting the bag size equal to the number of true positive samples, along with establishing a P:N:U ratio of 1:0.4:4.5, achieved a well-balanced performance across all evaluation metrics (Table 1). Consequently, we determined that these parameters would serve as the final configuration for the predictive models of the six selected targets.

To further validate the virtual screening performance of the optimized models, we evaluated them using the 0.46M Enamine HLL-460 library as the virtual screening set. The results indicated that the models identified a very small percentage of compounds as virtual hits for each target: 12 (0.0026%), 25 (0.0054%), 29 (0.0063%), 336 (0.073%), 380 (0.0826%), and 668 (0.1452%) for MET, FGFR1, ALK, EGFR, VEGFR1, and MAPK1, respectively. It is important to note that even if all of these virtual hits were false positives, the maximum false hit rate would only range from 0.0025% to 0.1452%. Hence, we can confidently state that the proposed NAPU-bagging SVM approach is highly capable of efficiently searching extensive chemical libraries, with an exceptionally low false hit rate.

Nonetheless, we also observed diminished performance of NAPU-bagging SVM on certain tested targets, particularly MAPK1 and VEGFR1, especially in terms of F1-score (Table 1). This can be attributed to distinct factors. As depicted in Supplementary Tables S4 and S6, MAPK1 stands out as one of the kinases with extensive bioactivity data in available databases, especially regarding its negative compounds (14,454 compounds from 8687 distinct clusters). Given that NAPU-bagging SVM is primarily designed to address the prevalent issue of insufficient volume and/or diversity in negative data, which is not a problem for MAPK1, its lack of improvement or even diminished performance on MAPK1 is logically consistent. In the case of VEGFR1, although NAPU-bagging SVM slightly outperforms the baseline SVM model across all metrics, the unsatisfactory F1-score could be attributed to the very limited number of active samples in the test datasets, with a ratio of 31 positive to 8906 negative samples. This skewed distribution means that even a single misclassification of an active sample can significantly skew the evaluation metrics.

2.5. NAPU-Bagging SVM in Virtual Screening of Novel Dual-TKIs for NCSLC

Somatic activating mutations in EGFR and chromosomal rearrangements involving ALK are the leading oncogenic drivers in non-small-cell lung cancer (NSCLC) [58]. The treatment landscape for NSCLC has advanced significantly with the introduction of multiple EGFR and ALK tyrosine kinase inhibitors (TKIs) as highly effective agents. However, the emergence of acquired resistance poses a substantial challenge [48]. While combination therapies are an alternative approach to combat acquired resistance, they often encounter complexities related to differing bioavailability, pharmacokinetics, metabolism, and drug interactions [59]. Utilizing dual- or multitarget drugs, which integrate multiple biological actions in a single compound, enables the simultaneous targeting of multiple pathways. This simplifies treatment regimens, minimizes potential drug interactions [59], and therefore offers a more effective solution for acquired resistance with improved safety measures.

As of now, only one dual-TKI targeting EGFR and ALK, Brigatinib [60], has received clinical approval (see Supplementary Table S8 for its affinity data). This section aims to utilize the refined NAPU-bagging SVM models to identify new potential dual-TKIs for EGFR-ALK, offering an alternative therapeutic approach for NSCLC. Additionally, this case study serves to assess the effectiveness of the NAPU-bagging strategy in virtual screening for the discovery of MTDLs.

In pursuit of this, we first conducted a screening of the Therapeutic Target Database (TTD) to identify 100 approved TKIs, TKIs in clinical trials, and preclinical TKIs targeting EGFR, along with 18 for ALK (Supplementary Tables S9 and S10). The performance evaluation primarily focused on three metrics: recall, hit rate, and enrichment factor (EF). The results revealed recall rates of 86.4% and 55.4%, and EF1% of 32.98 and 77.13, for ALK and EGFR, respectively. In addition, Brigatinib, along with two other recently developed dual ALK-EGFR inhibitors (Compound 9j [61] and Compound 11 [62]), was successfully identified by our NAPU-bagging SVM among the hit ALK-EGFR dual-TKIs (Figure 5).

In evaluating the virtual screening performance of our NAPU-bagging SVM model in discovering new dual ALK-EGFR TKIs from large chemical libraries, we conducted a standard ultra-large virtual ligand screening (VLS) using the 0.46M compounds from the Enamine HLL-460 library. Separate screenings were carried out for both targets, resulting in 650 hits for ALK (Supplementary Table S11), 436 for EGFR (Supplementary Table S12), and 11 intersecting dual hits (Prob. > 0.75). To ensure that the chemical structures of the resulted hits aligned with favorable medicinal chemistry properties, we eliminated all hits containing patterns described by the PAINS [63], Dundee Rule [64], and BMS HTS Deck Filters [65] sets, which denote substructures that could be promiscuously reactive, mutagenic, or pharmacokinetically unfavorable. Additionally, all hits were required to meet Lipinski’s rule of five [66] and the Ghose criteria [67] for “drug-likeness”. This refinement process narrowed down the initial pool of 11 dual hits to 9 of them (Figure 6, Supplementary Table S13).

Subsequently, we conducted molecular docking to evaluate the binding energetics and modes between the identified dual hits and ALK and EGFR. Specifically, we extracted the kinase domain structures from the complex structures of ALK and EGFR with Brigatinib to perform docking simulations. As a result, all hits exhibited docking scores of less than −7.5 kcal/mol with both targets (Figure 6). Taking Z4607579043 as an example, the docking study revealed its favorable binding pose in both the ALK and EGFR ATP-binding pockets, with respective docking scores of −8.2 kcal/mol and −8.4 kcal/mol, indicating the potential of Z4607579043 as an active dual inhibitor. Comparing its binding mode to that of Brigatinib, Z4607579043 forms hydrogen bonds with the backbone oxygen and amino group of the conserved Met residues of ALK (M1199) and EGFR (M793) in their ATP-binding pockets, a key binding contribution for Brigatinib as a dual inhibitor (Figure 7). Furthermore, Z4607579043 also establishes hydrogen bonds with additional residues in the pockets, specifically G1269 in ALK and K745 and D855 in EGFR (Figure 7), further strengthening the binding affinity. The close proximity of the benzene ring of Z4607579043 to F723 in the EGFR pocket also suggests the possibility of a

π - π

interaction. For the remaining hits, see Supplementary Figure S1 for their selected docked poses. With the exception of one pair (Z1272282354 binding to EGFR), all exhibited a binding mode analogous to that of Brigatinib. Therefore, we posit that dual hits identified by NAPU-bagging SVM have promising potential as dual ALK-EGFR inhibitors following further optimization.

Worth highlighting is the fact that these dual hits exhibit structural similarities as low as 0.26 with known active TKIs targeting ALK or EGFR (Table 2), suggesting that our approach can detect structurally novel drug hits. These findings collectively demonstrate that our NAPU-bagging SVM method not only effectively identifies potential MTDLs from large compound libraries but also uncovers novel hit molecules in terms of structure.

2.6. NAPU-Bagging SVM for Discovering Novel Multitarget Agonists for Dopamine Receptors

Dopamine receptors play a crucial role as drug targets for a variety of central nervous system (CNS) disorders, including Parkinson’s disease [68], schizophrenia [69], and attention deficit hyperactivity disorder (ADHD) [70]. The activation of D1R-D5R dopamine receptors has emerged as a promising therapeutic approach for Parkinson’s disease [71]. However, the development of pan-agonists targeting multiple dopamine receptor family members poses significant challenges. Therefore, we further leveraged our NAPU-bagging SVM models to identify pan-agonists capable of activating D1R-D5R with reduced adverse effects, offering promising therapeutic potential for CNS disorders. This serves as an additional assessment of the broader applicability of this proposed novel strategy.

To achieve this, we proceeded with VLS using the Enamine HLL-460 library. The resulting hits (Supplementary Tables S14–S18) underwent the same rigorous filtration for unwanted patterns and assessment for drug-likeness as was employed in the discovery of dual ALK-EGFR TKIs (Supplementary Table S19). Ultimately, this meticulous process led to the identification of four pan-hits, as depicted in Figure 8. Subsequent molecular docking analysis revealed that three out of the four pan-hits formed tight interactions with all D1R-D5R targets (Figure 8). Notably, the maximum Tanimoto similarity between the active small molecules in the DRD1-DRD5 training set and the four selected active agonists ranged from 0.269 to 0.614 (Table 3). These results once again underscore the effectiveness of our proposed NAPU-bagging SVM approach in the virtual screening and discovery of MTDLs, particularly novel ones.

In a further endeavor to evaluate the predictive efficacy of NAPU-bagging SVM, we examined the activity of rotigotine, the only FDA-approved pan-agonist for dopamine receptors (D1R to D5R) [72]. Remarkably, the NAPU-bagging SVM demonstrated commendable predictive efficacy for four of the five targets (predicted probabilities ranging from 0.754 to 0.996), with the one exception being DRD5 (predicted probability = 0.216). It is important to acknowledge that the limited availability of training data for DRD5 (consisting of only 155 positive samples and 93 negative samples for this specific target) may have impacted the comprehensive understanding of the intricate chemical relationships governing the prediction of drug–target bioactivities. However, in the validation of multitarget drug activity experiments, priority is still given to molecules that exhibit superior activity across most protein subtypes of dopamine receptors. This approach is preferred because such molecules have the potential for subsequent optimization into lead compounds with enhanced efficacy and safety profiles through further molecular modifications.

3. Materials and Methods

3.1. Datasets for Training, Evaluation, and Virtual Screening

Training datasets: Bioactivity data for five tyrosine kinases (ALK, EGFR, FGFR1, MET, and VEGFR1) and one serine/threonine protein kinase (MAPK1) were collected from the ChEMBL database (v29). Their binding affinities were represented using pChEMBL values, which are the negative logarithms of various half-maximal response measurement values (IC50, XC50, EC50, AC50, Ki, Kd, ED50, and potency). Consistent with most compound–protein interaction prediction studies [52], a mixture of these different measurements were used, and a common set of cut-offs were used to partition the data into binary labels. Since there is no unified gold standard for defining actives and inactives, we followed existing studies [73] in defining those with pChEMBL values greater than 7.5(7) as positives and those with values less than 6.5(6) as negatives for binary classifier training of kinase and GPCR, respectively. The compounds with pChEMBL values ranging from 6.5(6) to 7.5(7) were excluded to mitigate the effects of experimental errors [74] and potential mismatches between different measurements.

Evaluation datasets: We curated the bioactivity data of the same targets from ChEMBL v32, ExCAPE-DB and DrugBank (version 5.1.10). To prevent the potential biases from data leakage, all training data were excluded to form the external validation datasets. In addition, the Enamine Hit Locator Library (HLL-460), featuring 0.46 million diverse compounds with high MedChem tractability, was utilized for virtual screening of MTDLs. Notably, for virtual screening purposes, the training and validation datasets were merged to retrain the model, ensuring comprehensive coverage of diversity in the screening process.

3.2. Data Pools for Augmentation Strategy

To address the class imbalance in the training datasets, two data pools were generated from the PubChem database. Using k-means clustering with PubChem 881 bits fingerprints as input features, a total of 75,745 molecule clusters were formed from the massive pool of 112,428,719 compounds available in PubChem. For each chosen target, a single molecule was randomly from each cluster that did not have a known inhibitor to form the putative negative set. Additionally, a single representative molecule was chosen from each cluster to form the unlabeled set. To mitigate the parameter tuning complexity of SVM, and to keep the training time within reasonable limits, a random subset of up to 50,000 molecules was sampled from both the putative negative and unlabeled sets. The putative negative set served as negative training examples, while the unlabeled set was incorporated into the PU learning framework to enhance the model generalization.

3.3. Molecular Representations

We systematically evaluated 14 compound representations in ligand-based virtual screening tasks including 7 sets of topological path-based fingerprints (namely, ECFP4, AtomPairFP, TorsionFP, RDkitFP, AvalonFP, MHFP6, and MAP4), 2 sets of pharmacophore-based fingerprints (PharmacoErGFP and PharmacoPFP), and 3 sets of substructure-key SMARTS-based fingerprints (PubChemFP, MACCSFP, and EstateFP). Thirteen classes of constitutional, physicochemical, and topological descriptors, including Autocorr, InfoContent, Topology, Path, Connectivity, Kappa, Estate, Charge, Matrix, Fragment, Property, Constitution, and MOE (Molecular Operating Environment), were merged as one representation of 1456 descriptors. All of these predefined representations were calculated using Python scripts with the RDKit package (https://www.rdkit.org/, accessed on 1 February 2024).

In addition to these fingerprints and molecular descriptors designed by chemists, we also considered learnable graph representation of compounds, which is widely adopted in deep learning methods, where atoms were converted to nodes and chemical bonds were converted to edges. In this study, one of the state-of-the-art methods, AttentiveFP, was selected for comparison with other representations.

3.4. Support Vector Machine (SVM)

The theoretical aspects of SVM have been extensively discussed elsewhere [75]. SVM operates by identifying a hyperplane that maximizes the margin between classes, thereby improving the generalization performance. It is capable of handling both linearly separable and non-linearly separable data by employing diverse kernel functions to map the input data into a higher-dimensional space. In this study, the SVM model was specifically trained to optimize the parameters of the c-Support Vector Classification (c-SVC) with the Radial Basis Function (RBF) kernel. All SVM models were implemented using Python with the scikit-learn package.

3.5. DL-Based Sequence-to-Drug Models

To enable a comprehensive comparison between ML and DL methods for drug–target potency prediction, this study incorporated ten prominent DL-based sequence-to-drug models, including DeepDTA [44], DeepConv-DTI [37], TransformerCPI [45], VQA-seq [46], MolTrans [47], GraphDTA [48], MGraphDTA [49], HyperAttentionDTI [50], DrugBAN [51], and TransformerCPI2.0 [52]. These models offer varied protein featurization, compound representation, and network frameworks for potency learning. The suite of models was faithfully re-implemented using the cutting-edge DL frameworks PyTorch 2.2 and PyTorch Lightning. The only modification from the original models was the utilization of up-to-date data to harness the expanding repository of bioassay information.

3.6. Molecular Docking

The crystal structures of ALK (PDB id: 6MX8) and EGFR (PDB id: 7ZYM), along with the cryo-EM structures of the D1R-Gs-rotigotine complex (PDB id: 8IRR), D2R-Gi-rotigotine complex (PDB id: 8IRS), D3R-Gi-rotigotine complex (PDB id: 8IRT), D4R-Gi-rotigotine complex (PDB id: 8IRU), and D5R-Gs-rotigotine complex (PDB id: 8IRV), were retrieved from the Protein Data Bank (PDB) and processed using prepare_receptor4 and prepare_ligand4 within AutoDockTools. Prior to molecular docking, the proteins were protonated, and water molecules were removed. During the molecular docking procedure, the receptors were prepared by adding polar hydrogens. We defined the binding sites with grid boxes measuring approximately 20 Å × 20 Å × 20 Å, centered on the crystal ligand. The docking simulations were performed using AutoDock Vina with default settings.

4. Conclusions

Drug discovery has traditionally emphasized designing ligands with high selectivity for individual drug targets to mitigate off-target side effects. However, the intricate nature of diseases such as cancer, neurodegenerative disorders, cardiovascular conditions, and infections often involves multiple targets and signaling pathways. In response to this complexity, the development of multitarget agents has gained traction to enhance therapeutic outcomes and safety by simultaneously intervening in multiple receptors. Two primary approaches to multitarget therapeutics are drug combinations and multitarget-directed ligands (MTDLs). While drug combinations offer flexibility in dosing and lower clinical trial costs, they can introduce treatment complexities and adverse effects. In contrast, well-designed MTDLs have shown potential to address these challenges by targeting multiple receptors synergistically to maximize therapeutic benefits while minimizing side effects.

The emergence of machine learning (ML) and deep learning (DL) methods in drug discovery, particularly in MTDL development, has opened new avenues for optimizing virtual screening processes. This study focused on exploring the synergy between ML/DL methodologies and molecular representations, alongside an optimized data augmentation strategy, to enhance the screening of MTDLs. Through meticulous experimentation, we have developed a novel semi-supervised learning framework known as Negative-Augmented PU-bagging (NAPU-bagging) SVM, which serves as a robust tool for managing false positive rates in virtual screening while ensuring high recall rates. This innovative approach involves utilizing bootstrap aggregation to create multiple resampled bags from the training data, where each bag consists of all true positive samples, along with randomly sampled true negative and unlabeled instances at an optimized ratio. An ensemble of SVM classifiers is trained on each bag of samples to distinguish positive samples from true negative and unlabeled ones. The final prediction is aggregated from the collective outputs of the ensemble of classifiers.

The results of this study not only showcase the effectiveness of the NAPU-bagging SVM approach in identifying potential MTDLs from extensive compound libraries but also highlight its capacity to uncover structurally novel hit molecules. This innovative methodology has the potential to transform drug discovery efforts by enhancing the accuracy and efficiency of virtual screening processes. As we navigate the intricate landscape of drug development, the integration of advanced machine learning techniques like NAPU-bagging SVM offers a promising pathway to discover novel therapeutics with enhanced efficacy and safety profiles.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ijms25158239/s1.

Author Contributions

Conceptualization, X.L. and Y.H.; methodology, Y.H. and B.L.; validation, Y.H. and T.W.; formal analysis, Y.H., B.L. and S.W.; data curation, Y.H.; writing—original draft preparation, Y.H.; writing—review and editing, B.L., X.L. and D.H.; visualization, L.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China [32300552], Jiangsu Science and Technology Program [BK20200253, BK20230279], SIP High-Quality Innovation Platform for Chronic Diseases [YZCXPT2022203], and XJTLU Research Development Fund [RDF-22-01-073].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data underlying this article are available in the article and in its online supplementary materials.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fu, R.G.; Sun, Y.; Sheng, W.B.; Liao, D.F. Designing multi-targeted agents: An emerging anticancer drug discovery paradigm. Eur. J. Med. Chem. 2017, 136, 195–211. [Google Scholar] [CrossRef] [PubMed]
Tao, L.; Zhu, F.; Xu, F.; Chen, Z.; Jiang, Y.Y.; Chen, Y.Z. Co-targeting cancer drug escape pathways confers clinical advantage for multi-target anticancer drugs. Pharmacol. Res. 2015, 102, 123–131. [Google Scholar] [CrossRef] [PubMed]
Ibrahim, M.M.; Gabr, M.T. Multitarget therapeutic strategies for Alzheimer’s disease. Neural Regen. Res. 2019, 14, 437–440. [Google Scholar] [CrossRef] [PubMed]
Nozal, V.; García-Rubia, A.; Cuevas, E.P.; Pérez, C.; Tosat-Bitrián, C.; Bartolomé, F.; Carro, E.; Ramírez, D.; Palomo, V.; Martínez, A. From kinase inhibitors to multitarget ligands as powerful drug leads for Alzheimer’s disease using protein-templated synthesis. Angew. Chem. Int. Ed. Engl. 2021, 60, 19344–19354. [Google Scholar] [CrossRef]
Caruso, L.; Nadur, N.F.; da Fonseca, M.B.; Peixoto Ferreira, L.A.; Lacerda, R.B.; Graebin, C.S.; Kummerle, A.E. The design of multi-target drugs to treat cardiovascular diseases: Two (or more) birds on one stone. Curr. Top. Med. Chem. 2022, 22, 366–394. [Google Scholar] [CrossRef] [PubMed]
East, S.P.; Silver, L.L. Multitarget ligands in antibacterial research: Progress and opportunities. Expert. Opin. Drug Discov. 2013, 8, 143–156. [Google Scholar] [CrossRef]
Tan, L.; Zhang, J.; Wang, Y.; Wang, X.; Wang, Y.; Zhang, Z.; Shuai, W.; Wang, G.; Chen, J.; Wang, C.; et al. Development of dual inhibitors targeting epidermal growth factor receptor in cancer therapy. J. Med. Chem. 2022, 65, 5149–5183. [Google Scholar] [CrossRef] [PubMed]
Wu, Q.; Zhen, Y.; Shi, L.; Vu, P.; Greninger, P.; Adil, R.; Merritt, J.; Egan, R.; Wu, M.J.; Yin, X.; et al. EGFR inhibition potentiates FGFR inhibitor therapy and overcomes resistance in FGFR2 fusion-positive cholangiocarcinoma. Cancer Discov. 2022, 12, 1378–1395. [Google Scholar] [CrossRef]
Proschak, E.; Stark, H.; Merk, D. Polypharmacology by Design: A Medicinal Chemist’s Perspective on multitargeting compounds. J. Med. Chem. 2019, 62, 420–444. [Google Scholar] [CrossRef]
Makhoba, X.H.; Viegas, C., Jr.; Mosa, R.A.; Viegas, F.P.D.; Pooe, O.J. Potential Impact of the multi-target drug approach in the treatment of some complex diseases. Drug Des. Devel Ther. 2020, 14, 3235–3249. [Google Scholar] [CrossRef]
Lopez, J.S.; Banerji, U. Combine and conquer: Challenges for targeted therapy combinations in early phase trials. Nat. Rev. Clin. Oncol. 2017, 14, 57–66. [Google Scholar] [CrossRef] [PubMed]
He, B.; Lu, C.; Zheng, G.; He, X.; Wang, M.; Chen, G.; Zhang, G.; Lu, A. Combination therapeutics in complex diseases. J. Cell Mol. Med. 2016, 20, 2231–2240. [Google Scholar] [CrossRef]
Singh, A.K.; Kumar, A.; Singh, H.; Sonawane, P.; Paliwal, H.; Thareja, S.; Pathak, P.; Grishina, M.; Jaremko, M.; Emwas, A.H.; et al. Concept of hybrid drugs and recent advancements in anticancer hybrids. Pharmaceuticals 2022, 15, 1071. [Google Scholar] [CrossRef] [PubMed]
Szumilak, M.; Wiktorowska-Owczarek, A.; Stanczak, A. Hybrid drugs-a strategy for overcoming anticancer drug resistance? Molecules 2021, 26, 2601. [Google Scholar] [CrossRef]
Sang, Z.; Wang, K.; Dong, J.; Tang, L. Alzheimer’s disease: Updated multi-targets therapeutics are in clinical and in progress. Eur. J. Med. Chem. 2022, 238, 114464. [Google Scholar] [CrossRef] [PubMed]
Doostmohammadi, A.; Jooya, H.; Ghorbanian, K.; Gohari, S.; Dadashpour, M. Potentials and future perspectives of multi-target drugs in cancer treatment: The next generation anti-cancer agents. Cell Commun. Signal 2024, 22, 228. [Google Scholar] [CrossRef]
Ma, X.H.; Shi, Z.; Tan, C.; Jiang, Y.; Go, M.L.; Low, B.C.; Chen, Y.Z. In-silico approaches to multi-target drug discovery: Computer aided multi-target drug design, multi-target virtual screening. Pharm. Res. 2010, 27, 739–749. [Google Scholar] [CrossRef]
Liu, K.; Chen, X.; Ren, Y.; Liu, C.; Lv, T.; Liu, Y.n.; Zhang, Y. Multi-target-based polypharmacology prediction (mTPP): An approach using virtual screening and machine learning for multi-target drug discovery. Chem.-Biol. Interact. 2022, 368, 110239. [Google Scholar] [CrossRef]
Ambure, P.; Bhat, J.; Puzyn, T.; Roy, K. Identifying natural compounds as multi-target-directed ligands against Alzheimer’s disease: An in silico approach. J. Biomol. Struct. Dyn. 2019, 37, 1282–1306. [Google Scholar] [CrossRef] [PubMed]
Srivastava, S.; Sharma, S.; Deep, S.; Khare, S.K. Screening of multitarget-directed natural compounds as drug candidates for Alzheimer’s disease using in silico techniques: Their extraction and in vitro validation. ACS Omega 2023, 8, 38118–38129. [Google Scholar] [CrossRef]
Chen, X.; Xie, W.; Yang, Y.; Hua, Y.; Xing, G.; Liang, L.; Deng, C.; Wang, Y.; Fan, Y.; Liu, H.; et al. Discovery of dual FGFR4 and EGFR inhibitors by machine learning and biological evaluation. J. Chem. Inf. Model. 2020, 60, 4640–4652. [Google Scholar] [CrossRef] [PubMed]
Xing, G.; Liang, L.; Deng, C.; Hua, Y.; Chen, X.; Yang, Y.; Liu, H.; Lu, T.; Chen, Y.; Zhang, Y. Activity prediction of small molecule inhibitors for antirheumatoid arthritis targets based on artificial intelligence. ACS Comb. Sci. 2020, 22, 873–886. [Google Scholar] [CrossRef]
Xiong, Z.; Jeon, M.; Allaway, R.J.; Kang, J.; Park, D.; Lee, J.; Jeon, H.; Ko, M.; Jiang, H.; Zheng, M.; et al. Crowdsourced identification of multi-target kinase inhibitors for RET- and TAU-based disease: The multi-targeting drug DREAM challenge. PLoS Comput. Biol. 2021, 17, e1009302. [Google Scholar] [CrossRef] [PubMed]
Yu, Y.; Xu, T.; Li, J.; Qiu, Y.; Rong, Y.; Gong, Z.; Cheng, X.; Dong, L.; Liu, W.; Li, J.; et al. A novel scalarized scaffold hopping algorithm with graph-based variational autoencoder for discovery of JAK1 inhibitors. ACS Omega 2021, 6, 22945–22954. [Google Scholar] [CrossRef]
van Tilborg, D.; Alenicheva, A.; Grisoni, F. Exposing the limitations of molecular machine learning with activity cliffs. J. Chem. Inf. Model. 2022, 62, 5938–5951. [Google Scholar] [CrossRef] [PubMed]
Dara, S.; Dhamercherla, S.; Jadav, S.S.; Babu, C.M.; Ahsan, M.J. Machine learning in drug discovery: A review. Artif. Intell. Rev. 2022, 55, 1947–1999. [Google Scholar] [CrossRef] [PubMed]
Cichońska, A.; Ravikumar, B.; Rahman, R. AI for targeted polypharmacology: The next frontier in drug discovery. Curr. Opin. Struct. Biol. 2024, 84, 102771. [Google Scholar] [CrossRef]
Cichońska, A.; Ravikumar, B.; Allaway, R.J.; Wan, F.; Park, S.; Isayev, O.; Li, S.; Mason, M.; Lamb, A.; Tanoli, Z.; et al. Crowdsourced mapping of unexplored target space of kinase inhibitors. Nat. Commun. 2021, 12, 3307. [Google Scholar] [CrossRef]
Gupta, R.; Srivastava, D.; Sahu, M.; Tiwari, S.; Ambasta, R.K.; Kumar, P. Artificial intelligence to deep learning: Machine intelligence approach for drug discovery. Mol. Divers. 2021, 25, 1315–1360. [Google Scholar] [CrossRef] [PubMed]
Yang, K.; Swanson, K.; Jin, W.; Coley, C.; Eiden, P.; Gao, H.; Guzman-Perez, A.; Hopper, T.; Kelley, B.; Mathea, M.; et al. Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model. 2019, 59, 3370–3388. [Google Scholar] [CrossRef] [PubMed]
Moriwaki, H.; Tian, Y.S.; Kawashita, N.; Takagi, T. Mordred: A molecular descriptor calculator. J. Cheminform. 2018, 10, 4. [Google Scholar] [CrossRef]
Shen, W.X.; Zeng, X.; Zhu, F.; Wang, Y.l.; Qin, C.; Tan, Y.; Jiang, Y.Y.; Chen, Y.Z. Out-of-the-box deep learning prediction of pharmaceutical properties by broadly learned knowledge-based molecular representations. Nat. Mach. Intell. 2021, 3, 334–343. [Google Scholar] [CrossRef]
Xiong, Z.; Wang, D.; Liu, X.; Zhong, F.; Wan, X.; Li, X.; Li, Z.; Luo, X.; Chen, K.; Jiang, H.; et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J. Med. Chem. 2020, 63, 8749–8760. [Google Scholar] [CrossRef] [PubMed]
Korkmaz, S. Deep learning-based imbalanced data classification for drug discovery. J. Chem. Inf. Model. 2020, 60, 4180–4190. [Google Scholar] [CrossRef] [PubMed]
Zhang, W.; Pei, J.; Lai, L. Computational multitarget drug design. J. Chem. Inf. Model. 2017, 57, 403–412. [Google Scholar] [CrossRef]
Thafar, M.A.; Olayan, R.S.; Ashoor, H.; Albaradei, S.; Bajic, V.B.; Gao, X.; Gojobori, T.; Essack, M. DTiGEMS+: Drug-target interaction prediction using graph embedding, graph mining, and similarity-based techniques. J. Cheminform. 2020, 12, 44. [Google Scholar] [CrossRef] [PubMed]
Lee, I.; Keum, J.; Nam, H. DeepConv-DTI: Prediction of drug-target interactions via deep learning with convolution on protein sequences. PLoS Comput. Biol. 2019, 15, e1007129. [Google Scholar] [CrossRef] [PubMed]
Xiao, T.; Qi, X.; Chen, Y.; Jiang, Y. Development of ligand-based big data deep neural network models for virtual screening of large compound libraries. Mol. Inform. 2018, 37, e1800031. [Google Scholar] [CrossRef] [PubMed]
Zheng, Y.; Peng, H.; Zhang, X.; Zhao, Z.; Gao, X.; Li, J. DDI-PULearn: A positive-unlabeled learning method for large-scale prediction of drug-drug interactions. BMC Bioinform. 2019, 20, 661. [Google Scholar] [CrossRef]
Bi, X.; Ma, H.; Li, J.; Ma, Y.; Chen, D. A positive and unlabeled learning frameworkbased on extreme learning machine for drug-drug interactions discovery. J. Ambient. Intell. Humaniz. Comput. 2023, 14, 1–12. [Google Scholar] [CrossRef]
Lan, W.; Wang, J.; Li, M.; Liu, J.; Li, Y.; Wu, F.X.; Pan, Y. Predicting drug–target interaction using positive-unlabeled learning. Neurocomputing 2016, 206, 50–57. [Google Scholar] [CrossRef]
Korotcov, A.; Tkachenko, V.; Russo, D.P.; Ekins, S. Comparison of deep learning with multiple machine learning methods and metrics using diverse drug discovery data sets. Mol. Pharm. 2017, 14, 4462–4475. [Google Scholar] [CrossRef] [PubMed]
Jiang, D.; Wu, Z.; Hsieh, C.Y.; Chen, G.; Liao, B.; Wang, Z.; Shen, C.; Cao, D.; Wu, J.; Hou, T. Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models. J. Cheminform. 2021, 13, 12. [Google Scholar] [CrossRef] [PubMed]
Ozturk, H.; Ozgur, A.; Ozkirimli, E. DeepDTA: Deep drug-target binding affinity prediction. Bioinformatics 2018, 34, i821–i829. [Google Scholar] [CrossRef] [PubMed]
Chen, L.; Tan, X.; Wang, D.; Zhong, F.; Liu, X.; Yang, T.; Luo, X.; Chen, K.; Jiang, H.; Zheng, M. TransformerCPI: Improving compound-protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments. Bioinformatics 2020, 36, 4406–4414. [Google Scholar] [CrossRef]
Zheng, S.; Li, Y.; Chen, S.; Xu, J.; Yang, Y. Predicting drug–protein interaction using quasi-visual questionanswering system. Nat. Mach. Intell. 2020, 2, 134–140. [Google Scholar] [CrossRef]
Huang, K.; Xiao, C.; Glass, L.M.; Sun, J. MolTrans: Molecular Interaction Transformer for drug-target interaction prediction. Bioinformatics 2021, 37, 830–836. [Google Scholar] [CrossRef]
Nguyen, T.; Le, H.; Quinn, T.P.; Nguyen, T.; Le, T.D.; Venkatesh, S. GraphDTA: Predicting drug-target binding affinity with graph neural networks. Bioinformatics 2021, 37, 1140–1147. [Google Scholar] [CrossRef]
Yang, Z.; Zhong, W.; Zhao, L.; Yu-Chian Chen, C. MGraphDTA: Deep multiscale graph neural network for explainable drug-target binding affinity prediction. Chem. Sci. 2022, 13, 816–833. [Google Scholar] [CrossRef]
Zhao, Q.; Zhao, H.; Zheng, K.; Wang, J. HyperAttentionDTI: Improving drug-protein interaction prediction by sequence-based deep learning with attention mechanism. Bioinformatics 2022, 38, 655–662. [Google Scholar] [CrossRef]
Bai, P.; Miljković, F.; John, B.; Lu, H. Interpretable bilinear attention network with domain adaptation improves drug–target prediction. Nat. Mach. Intell. 2023, 5, 126–136. [Google Scholar] [CrossRef]
Chen, L.; Fan, Z.; Chang, J.; Yang, R.; Hou, H.; Guo, H.; Zhang, Y.; Yang, T.; Zhou, C.; Sui, Q.; et al. Sequence-based drug design as a concept in computational drug design. Nat. Commun. 2023, 14, 4217. [Google Scholar] [CrossRef] [PubMed]
Lin, X.; Xu, C.; Xiong, Z.; Zhang, X.; Ni, N.; Ni, B.; Chang, J.; Pan, R.; Wang, Z.; Yu, F.; et al. PanGu drug model: Learn a molecule like a human. Sci. China Life Sci. 2023, 66, 879–882. [Google Scholar] [CrossRef]
Mordelet, F.; Vert, J.-P. A bagging SVM to learn from positive and unlabeled examples. Pattern Recognit. Lett. 2010, 37, 201–209. [Google Scholar] [CrossRef]
Zhao, L.; Ma, H.; Dong, J.; Wu, X.; Xu, H.; Niu, R. A comparative study of landslide susceptibility mapping using bagging PU learning in class-prior probability shift datasets. Remote Sens. 2023, 15, 5547. [Google Scholar] [CrossRef]
Pham, B.T.; Tien Bui, D.; Prakash, I. Bagging based support vector machines for spatial prediction of landslides. Environ. Earth Sci. 2018, 77, 146. [Google Scholar] [CrossRef]
Dudoit, S.; Fridlyand, J. Bagging to improve the accuracy of a clustering procedure. Bioinformatics 2003, 19, 1090–1099. [Google Scholar] [CrossRef]
Cooper, A.J.; Sequist, L.V.; Lin, J.J. Third-generation EGFR and ALK inhibitors: Mechanisms of resistance and management. Nat. Rev. Clin. Oncol. 2022, 19, 499–514. [Google Scholar] [CrossRef]
Raghavendra, N.M.; Pingili, D.; Kadasi, S.; Mettu, A.; Prasad, S.V.U.M. Dual or multi-targeting inhibitors: The next generation anticancer agents. Eur. J. Med. Chem. 2018, 143, 1277–1300. [Google Scholar] [CrossRef]
Abourehab, M.A.S.; Alqahtani, A.M.; Youssif, B.G.M.; Gouda, A.M. Globally approved EGFR inhibitors: Insights into their syntheses, target kinases, biological activities, receptor interactions, and metabolism. Molecules 2021, 26, 6677. [Google Scholar] [CrossRef]
Fan, Y.; Li, W.; Nie, W.; Yao, H.; Ren, Y.; Wang, M.; Nie, H.; Gu, C.; Liu, J.; An, B. Novel dual-target kinase inhibitors of EGFR and ALK were designed, synthesized, and induced cell apoptosis in non-small cell lung cancer. Molecules 2023, 28, 2006. [Google Scholar] [CrossRef] [PubMed]
Hu, L.; Fan, M.; Shi, S.; Song, X.; Wang, F.; He, H.; Qi, B. Dual target inhibitors based on EGFR: Promising anticancer agents for the treatment of cancers (2017-). Eur. J. Med. Chem. 2022, 227, 113963. [Google Scholar] [CrossRef] [PubMed]
Baell, J.B.; Holloway, G.A. New Substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays. J. Med. Chem. 2010, 53, 2719–2740. [Google Scholar] [CrossRef] [PubMed]
Bickerton, G.R.; Paolini, G.V.; Besnard, J.; Muresan, S.; Hopkins, A.L. Quantifying the chemical beauty of drugs. Nat. Chem. 2012, 4, 90–98. [Google Scholar] [CrossRef] [PubMed]
Pearce, B.C.; Sofia, M.J.; Good, A.C.; Drexler, D.M.; Stock, D.A. An empirical process for the design of high-throughput screening deck filters. J. Chem. Inf. Model. 2006, 46, 1060–1068. [Google Scholar] [CrossRef]
Lipinski, C.A.; Lombardo, F.; Dominy, B.W.; Feeney, P.J. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 2001, 46, 3–26. [Google Scholar] [CrossRef]
Ghose, A.K.; Viswanadhan, V.N.; Wendoloski, J.J. A knowledge-based approach in designing combinatorial or medicinal chemistry libraries for drug discovery. 1. A qualitative and quantitative characterization of known drug databases. J. Comb. Chem. 1999, 1, 55–68. [Google Scholar] [CrossRef] [PubMed]
Bonuccelli, U.; Del Dotto, P.; Rascol, O. Role of dopamine receptor agonists in the treatment of early Parkinson’s disease. Park. Relat. Disord. 2009, 15 (Suppl. S4), S44–S53. [Google Scholar] [CrossRef]
Martel, J.C.; Gatti McArthur, S. Dopamine Receptor subtypes, physiology and pharmacology: New ligands and concepts in schizophrenia. Front. Pharmacol. 2020, 11, 1003. [Google Scholar] [CrossRef]
Wu, J.; Xiao, H.; Sun, H.; Zou, L.; Zhu, L.Q. Role of dopamine receptors in ADHD: A systematic meta-analysis. Mol. Neurobiol. 2012, 45, 605–620. [Google Scholar] [CrossRef]
Xu, P.; Huang, S.; Krumm, B.E.; Zhuang, Y.; Mao, C.; Zhang, Y.; Wang, Y.; Huang, X.P.; Liu, Y.F.; He, X.; et al. Structural genomics of the human dopamine receptor system. Cell Res. 2023, 33, 604–616. [Google Scholar] [CrossRef] [PubMed]
Ouchi, M.; Kitta, T.; Chiba, H.; Higuchi, M.; Togo, M.; Abe-Takahashi, Y.; Shinohara, N. Mechanisms of D1/D2-like dopaminergic agonist, rotigotine, on lower urinary tract function in rat model of Parkinson’s disease. Sci. Rep. 2022, 12, 4540. [Google Scholar] [CrossRef] [PubMed]
Bosc, N.; Atkinson, F.; Felix, E.; Gaulton, A.; Hersey, A.; Leach, A.R. Large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery. J. Cheminform. 2019, 11, 4. [Google Scholar] [CrossRef] [PubMed]
Kramer, C.; Kalliokoski, T.; Gedeck, P.; Vulpetti, A. The experimental uncertainty of heterogeneous public K(i) data. J. Med. Chem. 2012, 55, 5165–5173. [Google Scholar] [CrossRef] [PubMed]
Rodríguez-Pérez, R.; Bajorath, J. Evolution of support vector machine and regression modeling in chemoinformatics and drug discovery. J. Comput.-Aided Mol. Des. 2022, 36, 355–362. [Google Scholar] [CrossRef]

Figure 1. Comparison of model performance of the 11 ML/DL methods based on hit rates and enrichment factors across ChEMBL targets (categorized by target family). Hit rate (HR): HR1% and HR5% represent the percentage of actual active compounds among the top 1% and 5% of predicted hits, respectively. Enrichment factor (EF): EF1% and EF5% are calculated as the ratio of HR1% and HR5% to the overall ratio of active compounds to the total number of compounds in the library; for both metrics, the higher the better.

Figure 2. The average rank of 14 features across 17 evaluation datasets. Rankings were based on performance metrics including accuracy (ACC), BEDROC, specificity (SPE, 1—false positive rate), and sensitivity (SEN, true positive rate). The higher values indicate more favorable rankings and superior performance; 17 evaluation datasets: 6 kinase targets

\times

3 evaluation datasets (ChEMBL v32, ExCAPE, and DrugBank), where VEGFR1 has no ExCAPE dataset.

Figure 2. The average rank of 14 features across 17 evaluation datasets. Rankings were based on performance metrics including accuracy (ACC), BEDROC, specificity (SPE, 1—false positive rate), and sensitivity (SEN, true positive rate). The higher values indicate more favorable rankings and superior performance; 17 evaluation datasets: 6 kinase targets

\times

3 evaluation datasets (ChEMBL v32, ExCAPE, and DrugBank), where VEGFR1 has no ExCAPE dataset.

Figure 3. An illustrative diagram of various data augmentation strategies.

Figure 4. Performance of NAPU-bagging SVM on kinase target test sets: (A) Varying negative-to-unlabeled ratios, maintaining a bag size equal to the number of positive samples (1.0 times). (B) Selected negative-to-unlabeled ratios, maintaining a bag size equal to 2.0 times or 5.0 times the number of positive samples.

Figure 5. The structures and predicted probability values of three known dual ALK-EGFR inhibitors.

Figure 6. The structures and the top docking scores (kcal/mol) of predicted dual hits with ALK and EGFR.

Figure 7. Binding modes of Z4607579043 with (A) ALK and (B) EGFR in comparison with Brigatinib. Blue: Z4607579043; yellow: Brigatinib; dashed line: hydrogen bond.

Figure 8. Hit pan-agonists targeting D1R-D5R identified from the Enamine HLL-460 library, along with their respective docking scores (kcal/mol) with D1R-D5R.

Table 1. Performance evaluation of different data augmentation strategies.

Strategy	Metrics	ALK	EGFR	FGFR1	MAPK1	MET	VEGFR1	Average
Baseline SVM	SPE	0.9886	0.9767	0.9961	0.9940	0.9915	0.9743	0.9869
Tao’s strategy		0.9929	0.9881	0.9973	0.9988	0.9931	0.9955	0.9943
PN-weighted		0.9922	0.9881	0.9972	0.9979	0.9923	0.9958	0.9939
PU-Iterative		0.9939	0.9876	0.9976	0.9988	0.9946	0.9952	0.9946
PU-bagging		0.9550	0.9104	0.9354	0.8555	0.9516	0.9275	0.9226
NAPU-bagging		0.9904	0.9695	0.9878	0.988	0.9872	0.9816	0.9841
Baseline SVM	SEN	0.5819	0.6467	0.2048	0.7264	0.7701	0.5806	0.5851
Tao’s strategy		0.4892	0.5923	0.1808	0.4865	0.6848	0.3548	0.4647
PN-weighted		0.4547	0.5923	0.1846	0.5878	0.6588	0.4194	0.4829
PU-Iterative		0.4935	0.6301	0.1858	0.6149	0.7108	0.4516	0.5145
PU-bagging		0.8276	0.9569	0.9456	0.8581	0.9753	0.8387	0.9004
NAPU-bagging		0.6164	0.8243	0.5107	0.7264	0.8640	0.6452	0.6978
Baseline SVM	F1	0.6429	0.694	0.3273	0.2666	0.8235	0.1295	0.4806
Tao’s strategy		0.5997	0.6956	0.2979	0.4318	0.7754	0.2683	0.5114
PN-weighted		0.5642	0.6956	0.3029	0.4065	0.7528	0.3210	0.5072
PU-Iterative		0.6107	0.7217	0.3059	0.5170	0.8008	0.3182	0.5457
PU-bagging		0.6057	0.6856	0.6945	0.0188	0.7653	0.0740	0.4740
NAPU-bagging		0.6810	0.7820	0.6163	0.1587	0.8571	0.1860	0.5469
Baseline SVM	BEDROC	0.8118	0.9185	0.7700	0.7261	0.8984	0.4639	0.7648
Tao’s strategy		0.7480	0.9299	0.8217	0.6107	0.9120	0.5086	0.7552
PN-weighted		0.7180	0.9299	0.8249	0.6607	0.9049	0.5366	0.7625
PU-Iterative		0.8375	0.9400	0.814	0.7518	0.9023	0.5264	0.7953
PU-bagging		0.7900	0.8164	0.8473	0.6013	0.8819	0.5047	0.7403
NAPU-bagging		0.7967	0.9273	0.8134	0.7030	0.9005	0.4943	0.7725

Table 2. The maximal Tanimoto similarity between the hit dual ALK-EGFR inhibitors and known positive compounds.

ID	ALK	EGFR
Z1192975494	0.304	0.315
Z1272282354	0.287	0.315
Z1276057417	0.360	0.360
Z1698127707	0.268	0.306
Z1996380289	0.373	0.373
Z3690564987	0.344	0.416
Z4364760669	0.260	0.433

Note: Tanimoto similarity here means Tanimoto/Jaccard coefficients calculated by RDKit using ECFP fingerprints.

Table 3. The maximal Tanimoto similarity between the candidate pan-hits and known positive compounds.

ID	DRD1	DRD2	DRD3	DRD4	DRD5
Z1408702762	0.389	0.525	0.580	0.580	0.269
Z218005308	0.308	0.519	0.523	0.554	0.300
Z279638458	0.297	0.511	0.473	0.473	0.282
Z46188417	0.438	0.526	0.614	0.614	0.288

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hao, Y.; Li, B.; Huang, D.; Wu, S.; Wang, T.; Fu, L.; Liu, X. Developing a Semi-Supervised Approach Using a PU-Learning-Based Data Augmentation Strategy for Multitarget Drug Discovery. Int. J. Mol. Sci. 2024, 25, 8239. https://doi.org/10.3390/ijms25158239

AMA Style

Hao Y, Li B, Huang D, Wu S, Wang T, Fu L, Liu X. Developing a Semi-Supervised Approach Using a PU-Learning-Based Data Augmentation Strategy for Multitarget Drug Discovery. International Journal of Molecular Sciences. 2024; 25(15):8239. https://doi.org/10.3390/ijms25158239

Chicago/Turabian Style

Hao, Yang, Bo Li, Daiyun Huang, Sijin Wu, Tianjun Wang, Lei Fu, and Xin Liu. 2024. "Developing a Semi-Supervised Approach Using a PU-Learning-Based Data Augmentation Strategy for Multitarget Drug Discovery" International Journal of Molecular Sciences 25, no. 15: 8239. https://doi.org/10.3390/ijms25158239

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Developing a Semi-Supervised Approach Using a PU-Learning-Based Data Augmentation Strategy for Multitarget Drug Discovery

Abstract

1. Introduction

2. Results and Discussion

2.1. SVM Outperforms DL-Based Drug–Target Potency Prediction Models

2.2. ECFP4 Outperformed Other Compound Representation Methods

2.3. Data Augmentation Strategies for Constructing Semi-Supervised Learning Models

2.4. Developing a Novel Negative-Augmented PU-Bagging (NAPU-Bagging) SVM for Enhanced Virtual Screening

2.5. NAPU-Bagging SVM in Virtual Screening of Novel Dual-TKIs for NCSLC

2.6. NAPU-Bagging SVM for Discovering Novel Multitarget Agonists for Dopamine Receptors

3. Materials and Methods

3.1. Datasets for Training, Evaluation, and Virtual Screening

3.2. Data Pools for Augmentation Strategy

3.3. Molecular Representations

3.4. Support Vector Machine (SVM)

3.5. DL-Based Sequence-to-Drug Models

3.6. Molecular Docking

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI