CPDP: Contrastive Protein–Drug Pre-Training for Novel Drug Discovery
Abstract
:1. Introduction
- We construct a common embedding space through the CPDP framework, which integrates protein and drug representations from various dimensions, thereby enhancing the prediction of DTIs.
- We employ contrastive learning for representation alignment to address the issue of sparse training data, while also designing weak labels to retain diverse DTI information from BioHNs and mitigate overfitting.
- CPDP demonstrates strong performance on novel drug discovery and drug repositioning tasks without relying on predefined graph structures, showing superior generalization to unseen biomolecular entities.
2. Results and Discussions
2.1. Using CPDP to Simulate Novel Drug Discovery
- Prompt 1: “Please select one and the most relevant drug for treating or managing <target> from the following options: <drugs>”.
- Prompt 2: “Please select one and the most relevant drug for treating or managing <target> from the following options: <drugs>. Answer “None” if you cannot select one drug from the list or require more information. You must start by choosing one from the drug’s name or “None””.
2.2. Using CPDP to Simulate Drug Repositioning
2.3. Ablation Experiment
3. Materials and Methods
3.1. Datasets
3.2. Workflow
3.3. More Details for CPDP
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
DTIs | Drug–Target Interactions |
GNN | Graph Neural Network |
BioHNs | Biomedical Heterogeneous Network |
LLMs | Large Language Models |
TTD | Therapeutic Target Database |
DDIs | Drug–Disease Interactions |
SMILES | Simplified Molecular Input Line Entry System |
AUPR | Area Under the Precision–Recall Curve |
AUROC | Area Under the Receiver Operating Characteristic Curve |
AP | Average Precision |
MAP | Mean Average Precision |
AR | Average Recall |
MAR | Mean Average Recall |
References
- Liu, Z.; Fang, H.; Reagan, K.; Xu, X.; Mendrick, D.L.; Slikker, W., Jr.; Tong, W. In silico drug repositioning—What we need to know. Drug Discov. Today 2013, 18, 110–115. [Google Scholar] [CrossRef] [PubMed]
- DiMasi, J.A.; Grabowski, H.G.; Hansen, R.W. Innovation in the pharmaceutical industry: New estimates of R&D costs. J. Health Econ. 2016, 47, 20–33. [Google Scholar]
- Catacutan, D.B.; Alexander, J.; Arnold, A.; Stokes, J.M. Machine learning in preclinical drug discovery. Nat. Chem. Biol. 2024, 20, 960–973. [Google Scholar] [CrossRef] [PubMed]
- Gangwal, A.; Lavecchia, A. Unlocking the potential of generative AI in drug discovery. Drug Discov. Today 2024, 29, 103992. [Google Scholar] [CrossRef] [PubMed]
- Amitai, G.; Shemesh, A.; Sitbon, E.; Shklar, M.; Netanely, D.; Venger, I.; Pietrokovski, S. Network analysis of protein structures identifies functional residues. J. Mol. Biol. 2004, 344, 1135–1146. [Google Scholar] [CrossRef]
- Luo, Y.; Zhao, X.; Zhou, J.; Yang, J.; Zhang, Y.; Kuang, W.; Peng, J.; Chen, L.; Zeng, J. A network integration approach for drug–target interaction prediction and computational drug repositioning from heterogeneous information. Nat. Commun. 2017, 8, 573. [Google Scholar] [CrossRef]
- Wan, F.; Hong, L.; Xiao, A.; Jiang, T.; Zeng, J. NeoDTI: Neural integration of neighbor information from a heterogeneous network for discovering new drug–target interactions. Bioinformatics 2019, 35, 104–111. [Google Scholar] [CrossRef] [PubMed]
- Nguyen, T.M.; Nguyen, T.; Le, T.M.; Tran, T. Gefa: Early fusion approach in drug–target affinity prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 2021, 19, 718–728. [Google Scholar] [CrossRef]
- Wang, X.; Cheng, Y.; Yang, Y.; Yu, Y.; Li, F.; Peng, S. Multitask joint strategies of self-supervised representation learning on biomedical networks for drug discovery. Nat. Mach. Intell. 2023, 5, 445–456. [Google Scholar] [CrossRef]
- Zhang, Z.; Chen, L.; Zhong, F.; Wang, D.; Jiang, J.; Zhang, S.; Jiang, H.; Zheng, M.; Li, X. Graph neural network approaches for drug–target interactions. Curr. Opin. Struct. Biol. 2022, 73, 102327. [Google Scholar] [CrossRef]
- Xu, Y.; Liu, X.; Cao, X.; Huang, C.; Liu, E.; Qian, S.; Liu, X.; Wu, Y.; Dong, F.; Qiu, C.W.; et al. Artificial intelligence: A powerful paradigm for scientific research. Innovation 2021, 2, 100179. [Google Scholar] [CrossRef] [PubMed]
- Cheng, F.; Kovács, I.A.; Barabási, A.L. Network-based prediction of drug combinations. Nat. Commun. 2019, 10, 1197. [Google Scholar] [CrossRef] [PubMed]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Tian, S.; Jin, Q.; Yeganova, L.; Lai, P.T.; Zhu, Q.; Chen, X.; Yang, Y.; Chen, Q.; Kim, W.; Comeau, D.C.; et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief. Bioinform. 2024, 25, bbad493. [Google Scholar] [CrossRef]
- Wu, C.; Zhang, X.; Zhang, Y.; Wang, Y.; Xie, W. Pmc-llama: Further finetuning llama on medical papers. arXiv 2023, arXiv:2304.14454. [Google Scholar]
- Luo, R.; Sun, L.; Xia, Y.; Qin, T.; Zhang, S.; Poon, H.; Liu, T.Y. BioGPT: Generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinform. 2022, 23, bbac409. [Google Scholar] [CrossRef]
- Bolton, E.; Venigalla, A.; Yasunaga, M.; Hall, D.; Xiong, B.; Lee, T.; Daneshjou, R.; Frankle, J.; Liang, P.; Carbin, M.; et al. Biomedlm: A 2.7 b parameter language model trained on biomedical text. arXiv 2024, arXiv:2403.18421. [Google Scholar]
- Lin, Z.; Akin, H.; Rao, R.; Hie, B.; Zhu, Z.; Lu, W.; dos Santos Costa, A.; Fazel-Zarandi, M.; Sercu, T.; Candido, S.; et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv 2022, 2022, 500902. [Google Scholar]
- Law, V.; Knox, C.; Djoumbou, Y.; Jewison, T.; Guo, A.C.; Liu, Y.; Maciejewski, A.; Arndt, D.; Wilson, M.; Neveu, V.; et al. DrugBank 4.0: Shedding new light on drug metabolism. Nucleic Acids Res. 2014, 42, D1091–D1097. [Google Scholar] [CrossRef]
- Zhu, F.; Shi, Z.; Qin, C.; Tao, L.; Liu, X.; Xu, F.; Zhang, L.; Song, Y.; Liu, X.; Zhang, J.; et al. Therapeutic target database update 2012: A resource for facilitating target-oriented drug discovery. Nucleic Acids Res. 2012, 40, D1128–D1136. [Google Scholar] [CrossRef]
- Hernandez-Boussard, T.; Whirl-Carrillo, M.; Hebert, J.M.; Gong, L.; Owen, R.; Gong, M.; Gor, W.; Liu, F.; Truong, C.; Whaley, R.; et al. The pharmacogenetics and pharmacogenomics knowledge base: Accentuating the knowledge. Nucleic Acids Res. 2007, 36, D913–D918. [Google Scholar] [CrossRef] [PubMed]
- Wishart, D.S.; Feunang, Y.D.; Guo, A.C.; Lo, E.J.; Marcu, A.; Grant, J.R.; Sajed, T.; Johnson, D.; Li, C.; Sayeeda, Z.; et al. DrugBank 5.0: A major update to the DrugBank database for 2018. Nucleic Acids Res. 2018, 46, D1074–D1082. [Google Scholar] [CrossRef] [PubMed]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
- Jin, W.; Barzilay, R.; Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 2323–2332. [Google Scholar]
- Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36. [Google Scholar] [CrossRef]
- Cai, Z.; Cao, M.; Liu, K.; Duan, X. Analysis of the responsiveness of latanoprost, travoprost, bimatoprost, and tafluprost in the treatment of OAG/OHT patients. J. Ophthalmol. 2021, 2021, 5586719. [Google Scholar] [CrossRef]
- Chauhan, M.; Patel, J.B.; Ahmad, F. Ramipril; StatPearls: Treasure Island, FL, USA, 2024. [Google Scholar]
- Karasik, A.; Aschner, P.; Katzeff, H.; Davies, M.J.; Stein, P.P. Sitagliptin, a DPP-4 inhibitor for the treatment of patients with type 2 diabetes: A review of recent clinical trials. Curr. Med. Res. Opin. 2008, 24, 489–496. [Google Scholar] [CrossRef]
- Rajendran, K.; Anoop, K.; Nagappan, K.; Sekar, G.M.; Rajendran, S.D. Ultra-Performance Liquid Chromatography Coupled with a Triple Quadrupole Mass Spectrometric Method for the Quantification of Antiepileptic Drugs Methsuximide and Normesuximide in Human Plasma and its Application in a Pharmacokinetic Study. Curr. Pharm. Anal. 2022, 18, 228–239. [Google Scholar] [CrossRef]
- Jeyarasasingam, G.; Yeluashvili, M.; Quik, M. Tacrine, a reversible acetylcholinesterase inhibitor, induces myopathy. Neuroreport 2000, 11, 1173–1176. [Google Scholar] [CrossRef]
- Sweeney, C.L.; Frandsen, J.L.; Verfaillie, C.M.; McIvor, R.S. Trimetrexate inhibits progression of the murine 32Dp210 model of chronic myeloid leukemia in animals expressing drug-resistant dihydrofolate reductase. Cancer Res. 2003, 63, 1304–1310. [Google Scholar]
- Zhou, J.; Xu, Y.; Wang, H.; Liu, Z. New target-HMGCR inhibitors for the treatment of primary sclerosing cholangitis: A drug Mendelian randomization study. Open Med. 2024, 19, 20240994. [Google Scholar] [CrossRef] [PubMed]
- Stoppoloni, D.; Salvatori, L.; Biroccio, A.; D’Angelo, C.; Muti, P.; Verdina, A.; Sacchi, A.; Vincenzi, B.; Baldi, A.; Galati, R. Aromatase inhibitor exemestane has antiproliferative effects on human mesothelioma cells. J. Thorac. Oncol. 2011, 6, 583–591. [Google Scholar] [CrossRef]
- Gronich, N.; Rennert, G. Beyond aspirin—cancer prevention with statins, metformin and bisphosphonates. Nat. Rev. Clin. Oncol. 2013, 10, 625–642. [Google Scholar] [CrossRef] [PubMed]
- Xiao, R.; Han, J.; Deng, Y.; Zhang, L.; Qian, Y.; Tian, N.; Yang, Z.; Zhang, L. AGTR1: A potential biomarker associated with the occurrence and prognosis of lung adenocarcinoma. Front. Oncol. 2024, 14, 1441235. [Google Scholar] [CrossRef] [PubMed]
- Wang, Y.; Xia, Y.; Yan, J.; Yuan, Y.; Shen, H.B.; Pan, X. ZeroBind: A protein-specific zero-shot predictor with subgraph matching for drug–target interactions. Nat. Commun. 2023, 14, 7861. [Google Scholar] [CrossRef]
- Öztürk, H.; Özgür, A.; Ozkirimli, E. DeepDTA: Deep drug–target binding affinity prediction. Bioinformatics 2018, 34, i821–i829. [Google Scholar] [CrossRef]
- Zeng, X.; Zhu, S.; Lu, W.; Liu, Z.; Huang, J.; Zhou, Y.; Fang, J.; Huang, Y.; Guo, H.; Li, L.; et al. Target identification among known drugs by deep learning from heterogeneous networks. Chem. Sci. 2020, 11, 1775–1797. [Google Scholar] [CrossRef] [PubMed]
- Coordinators, N.R. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2016, 44, D7–D19. [Google Scholar]
- Brown, A.S.; Patel, C.J. A standard database for drug repositioning. Sci. Data 2017, 4, 170029. [Google Scholar] [CrossRef]
- Ursu, O.; Holmes, J.; Knockel, J.; Bologa, C.G.; Yang, J.J.; Mathias, S.L.; Nelson, S.J.; Oprea, T.I. DrugCentral: Online drug compendium. Nucleic Acids Res. 2016, 45, D932–D939. [Google Scholar] [CrossRef]
- Bodenreider, O. The unified medical language system (UMLS): Integrating biomedical terminology. Nucleic Acids Res. 2004, 32, D267–D270. [Google Scholar] [CrossRef] [PubMed]
Protein–Molecules CPDP | Top-1 | Top-3 |
---|---|---|
MSSL2drug-JTVAE CPDP | 0.407 | 0.756 |
MSSL2drug-Llama CPDP | 0.393 | 0.708 |
MSSL2drug-BioGPT CPDP | 0.282 | 0.691 |
Llama-7B prompt1 | 0.327 | \ |
Llama-7B prompt2 | 0.327 | \ |
Protein–Molecules CPDP | Top-1 | Top-3 | Top-5 |
---|---|---|---|
MSSL2drug-JTVAE CPDP | 0.256 | 0.512 | 0.674 |
MSSL2drug-Llama CPDP | 0.225 | 0.517 | 0.663 |
MSSL2drug-BioGPT CPDP | 0.145 | 0.390 | 0.597 |
Llama-7B prompt1 | 0.169 | \ | \ |
Llama-7B prompt2 | 0.011 * | \ | \ |
Human Protein Target | Prediction | Likelihood Scores | Evidence |
---|---|---|---|
PTGFR | Travoprost | 0.582 | [27] |
ACE | Ramipril | 0.546 | [28] |
DPP4 | Sitagliptin | 0.532 | [29] |
AR | Methsuximide | 0.505 | [30] |
ACHE | Tacrine | 0.472 | [31] |
DHFR | Trimetrexate | 0.420 | [32] |
HMGCR | Cerivastatin | 0.376 | [33] |
CYP19A1 | Exemestane | 0.367 | [34] |
FDPS | Pamidronic Acid | 0.367 | [35] |
AGTR1 | Eprosartan | 0.361 | [36] |
Protein–Molecules CPDP | Top-1 | Top-3 |
---|---|---|
Esm2-JTVAE CPDP | 0.798 | 0.938 |
MSSL2drug-JTVAE CPDP | 0.787 | 0.936 |
Llama-JTVAE CPDP | 0.711 | 0.886 |
BioGPT-JTVAE CPDP | 0.631 | 0.858 |
Llama-7B prompt1 | 0.456 | \ |
Llama-7B prompt2 | 0.309 | \ |
Protein–Molecules CPDP | Top-1 | Top-3 | Top-5 |
---|---|---|---|
Esm2-JTVAE CPDP | 0.737 | 0.875 | 0.925 |
MSSL2drug-JTVAE CPDP | 0.721 | 0.860 | 0.914 |
Llama-JTVAE CPDP | 0.639 | 0.797 | 0.860 |
BioGPT-JTVAE CPDP | 0.561 | 0.709 | 0.821 |
Llama-7B prompt1 | 0.406 | \ | \ |
Llama-7B prompt2 | 0.250 | \ | \ |
Type of Node | Total | Train Set | Valid Set | Test Set |
Drug | 670 | 603 | 67 | 125 * |
Protein | 1894 | 1512 | 382 | 86 * |
Disease | 731 | \ | \ | \ |
Type of edge | ||||
Drug-Protein Interactions | 4839 | 4325 | 514 | 156 * |
Drug–Disease Interactions | 1103 | \ | \ | \ |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, S.; Wang, X.; Li, F.; Peng, S. CPDP: Contrastive Protein–Drug Pre-Training for Novel Drug Discovery. Int. J. Mol. Sci. 2025, 26, 3761. https://doi.org/10.3390/ijms26083761
Zhang S, Wang X, Li F, Peng S. CPDP: Contrastive Protein–Drug Pre-Training for Novel Drug Discovery. International Journal of Molecular Sciences. 2025; 26(8):3761. https://doi.org/10.3390/ijms26083761
Chicago/Turabian StyleZhang, Shihan, Xiaoqi Wang, Fei Li, and Shaoliang Peng. 2025. "CPDP: Contrastive Protein–Drug Pre-Training for Novel Drug Discovery" International Journal of Molecular Sciences 26, no. 8: 3761. https://doi.org/10.3390/ijms26083761
APA StyleZhang, S., Wang, X., Li, F., & Peng, S. (2025). CPDP: Contrastive Protein–Drug Pre-Training for Novel Drug Discovery. International Journal of Molecular Sciences, 26(8), 3761. https://doi.org/10.3390/ijms26083761