Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA
Abstract
:1. Introduction
2. Related Work
2.1. Advancements in Self-Supervised Pre-Training for Natural Language Processing
Comparative Analysis of BERT and ELECTRA
2.2. Tokenization of DNA Sequences
2.3. DNA Motifs
3. Methods
3.1. Model Architecture
3.2. Pre-Training Strategy
3.2.1. Genome Token Prediction
3.2.2. Motif Prediction
3.2.3. Pre-Training Objectives
3.3. Fine-Tuning
4. Experimental Results
4.1. Pre-Training and Fine-Tuning Experimental Pipelines
4.2. Implementation Details
4.3. Datasets
4.3.1. Unsupervised Pre-Training Dataset
4.3.2. Promoter Prediction Dataset
4.3.3. The 690 ChIP-seq Datasets
4.4. Promoter Prediction
4.5. Transcription Factor Binding Site (TFBS) Prediction
4.6. Ablation Studies and Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Alipanahi, B.; Delong, A.; Weirauch, M.T.; Frey, B.J. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nat. Biotechnol. 2015, 33, 831–838. [Google Scholar] [CrossRef] [PubMed]
- Li, M.J.; Yan, B.; Sham, P.C.; Wang, J. Exploring the function of genetic variants in the non-coding genomic regions: Approaches for identifying human regulatory variants affecting gene expression. Briefings Bioinform. 2015, 16, 393–412. [Google Scholar] [CrossRef] [PubMed]
- Clauwaert, J.; Menschaert, G.; Waegeman, W. Explainability in transformer models for functional genomics. Briefings Bioinform. 2021, 22, bbab060. [Google Scholar] [CrossRef] [PubMed]
- The ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007, 447, 799–816. [Google Scholar] [CrossRef]
- Andersson, R.; Sandelin, A. Determinants of enhancer and promoter activities of regulatory elements. Nat. Rev. Genet. 2020, 21, 71–87. [Google Scholar] [CrossRef]
- Oubounyt, M.; Louadi, Z.; Tayara, H.; Chong, K.T. DeePromoter: Robust promoter predictor using deep learning. Front. Genet. 2019, 10, 286. [Google Scholar] [CrossRef]
- Zhang, Y.; Qiao, S.; Ji, S.; Li, Y. DeepSite: Bidirectional LSTM and CNN models for predicting DNA–protein binding. Int. J. Mach. Learn. Cybern. 2020, 11, 841–851. [Google Scholar] [CrossRef]
- Mantegna, R.N.; Buldyrev, S.V.; Goldberger, A.L.; Havlin, S.; Peng, C.K.; Simons, M.; Stanley, H.E. Linguistic features of noncoding DNA sequences. Phys. Rev. Lett. 1994, 73, 3169. [Google Scholar] [CrossRef]
- Brendel, V.; Busse, H. Genome structure described by formal languages. Nucleic Acids Res. 1984, 12, 2561–2568. [Google Scholar] [CrossRef]
- Corso, G.; Ying, Z.; Pándy, M.; Veličković, P.; Leskovec, J.; Liò, P. Neural Distance Embeddings for Biological Sequences. Adv. Neural Inf. Process. Syst. 2021, 34, 18539–18551. [Google Scholar]
- Liao, R.; Cao, C.; Garcia, E.B.; Yu, S.; Huang, Y. Pose-based temporal-spatial network (PTSN) for gait recognition with carrying and clothing variations. In Proceedings of the Chinese Conference on Biometric Recognition, Shenzhen, China, 28–29 October 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 474–483. [Google Scholar]
- Guo, Y.; Wu, J.; Ma, H.; Huang, J. Self-supervised pre-training for protein embeddings using tertiary structures. Proc. AAAI Conf. Artif. Intell. 2022, 36, 6801–6809. [Google Scholar] [CrossRef]
- Yang, M.; Huang, H.; Huang, L.; Zhang, N.; Wu, J.; Yang, H.; Mu, F. Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution. Nucleic Acids Res. 2021, 50, e81. [Google Scholar] [CrossRef]
- Strodthoff, N.; Wagner, P.; Wenzel, M.; Samek, W. UDSMProt: Universal deep sequence models for protein classification. Bioinformatics 2020, 36, 2401–2409. [Google Scholar] [CrossRef] [PubMed]
- Umarov, R.; Kuwahara, H.; Li, Y.; Gao, X.; Solovyev, V. Promoter analysis and prediction in the human genome using sequence-based deep learning models. Bioinformatics 2019, 35, 2730–2737. [Google Scholar] [CrossRef]
- Torada, L.; Lorenzon, L.; Beddis, A.; Isildak, U.; Pattini, L.; Mathieson, S.; Fumagalli, M. ImaGene: A convolutional neural network to quantify natural selection from genomic data. BMC Bioinform. 2019, 20, 337. [Google Scholar] [CrossRef]
- Quang, D.; Xie, X. DanQ: A hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 2016, 44, e107. [Google Scholar] [CrossRef] [PubMed]
- Avsec, Z.; Agarwal, V.; Visentin, D.; Ledsam, J.R.; Grabska-Barwinska, A.; Taylor, K.R.; Assael, Y.; Jumper, J.; Kohli, P.; Kelley, D.R. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 2021, 18, 1196–1203. [Google Scholar] [CrossRef] [PubMed]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar]
- Gao, S.; Alawad, M.; Young, M.T.; Gounley, J.; Schaefferkoetter, N.; Yoon, H.J.; Wu, X.C.; Durbin, E.B.; Doherty, J.; Stroup, A.; et al. Limitations of Transformers on Clinical Text Classification. IEEE J. Biomed. Health Inform. 2021, 25, 3596–3607. [Google Scholar] [CrossRef]
- Rong, Y.; Bian, Y.; Xu, T.; Xie, W.; Wei, Y.; Huang, W.; Huang, J. Self-supervised graph transformer on large-scale molecular data. arXiv 2020, arXiv:2007.02835. [Google Scholar]
- Ji, Y.; Zhou, Z.; Liu, H.; Davuluri, R.V. DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 2021, 37, 2112–2120. [Google Scholar] [CrossRef] [PubMed]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Min, S.; Park, S.; Kim, S.; Choi, H.S.; Lee, B.; Yoon, S. Pre-training of deep bidirectional protein sequence representations with structural information. IEEE Access 2021, 9, 123912–123926. [Google Scholar] [CrossRef]
- Mo, S.; Fu, X.; Hong, C.; Chen, Y.; Zheng, Y.; Tang, X.; Shen, Z.; Xing, E.P.; Lan, Y. Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types. arXiv 2021, arXiv:2110.05231. [Google Scholar]
- Domcke, S.; Hill, A.J.; Daza, R.M.; Cao, J.; O’Day, D.R.; Pliner, H.A.; Aldinger, K.A.; Pokholok, D.; Zhang, F.; Milbank, J.H.; et al. A human cell atlas of fetal chromatin accessibility. Science 2020, 370, eaba7612. [Google Scholar] [CrossRef] [PubMed]
- An, W.; Guo, Y.; Bian, Y.; Ma, H.; Yang, J.; Li, C.; Huang, J. MoDNA: Motif-oriented pre-training for DNA language model. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, Northbrook, IL, USA, 7–10 August 2022; pp. 1–5. [Google Scholar]
- Boeva, V. Analysis of genomic sequence motifs for deciphering transcription factor binding and transcriptional regulation in eukaryotic cells. Front. Genet. 2016, 7, 24. [Google Scholar] [CrossRef] [PubMed]
- D’haeseleer, P. What are DNA sequence motifs? Nat. Biotechnol. 2006, 24, 423–425. [Google Scholar] [CrossRef] [PubMed]
- Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. Electra: Pre-training text encoders as discriminators rather than generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
- Yamada, K.; Hamada, M. Prediction of RNA-protein interactions using a nucleotide language model. Bioinform. Adv. 2021, 2, vbac023. [Google Scholar] [CrossRef]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Wang, B.; Xie, Q.; Pei, J.; Chen, Z.; Tiwari, P.; Li, Z.; Fu, J. Pre-trained Language Models in Biomedical Domain: A Systematic Survey. arXiv 2021, arXiv:2110.05006. [Google Scholar] [CrossRef]
- Choi, D.; Park, B.; Chae, H.; Lee, W.; Han, K. Predicting protein-binding regions in RNA using nucleotide profiles and compositions. BMC Syst. Biol. 2017, 11, 16. [Google Scholar] [CrossRef] [PubMed]
- Wang, S.; Guo, Y.; Wang, Y.; Sun, H.; Huang, J. Smiles-bert: Large scale unsupervised pre-training for molecular property prediction. BCB 2021, 429–436. [Google Scholar]
- Bailey, T.L.; Elkan, C. Fitting a mixture model by expectation maximization to discover motifs in bipolymers. ISMB 1994, 2, 28–36. [Google Scholar] [PubMed]
- Yan, F.; Powell, D.R.; Curtis, D.J.; Wong, N.C. From reads to insight: A hitchhiker’s guide to ATAC-seq data analysis. Genome Biol. 2020, 21, 22. [Google Scholar] [CrossRef] [PubMed]
- Das, M.K.; Dai, H.K. A survey of DNA motif finding algorithms. BMC Bioinform. 2007, 8, S21. [Google Scholar] [CrossRef] [PubMed]
- Bailey, T.L.; Boden, M.; Buske, F.A.; Frith, M.; Grant, C.E.; Clementi, L.; Ren, J.; Li, W.W.; Noble, W.S. MEME SUITE: Tools for motif discovery and searching. Nucleic Acids Res. 2009, 37, W202–W208. [Google Scholar] [CrossRef] [PubMed]
- Janky, R.; Verfaillie, A.; Imrichova, H.; Van de Sande, B.; Standaert, L.; Christiaens, V.; Hulselmans, G.; Herten, K.; Naval Sanchez, M.; Potier, D.; et al. iRegulon: From a gene list to a gene regulatory network using large motif and track collections. PLoS Comput. Biol. 2014, 10, e1003731. [Google Scholar] [CrossRef] [PubMed]
- Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
- Frazer, J.; Notin, P.; Dias, M.; Gomez, A.; Min, J.K.; Brock, K.; Gal, Y.; Marks, D.S. Disease variant prediction with deep generative models of evolutionary data. Nature 2021, 599, 91–95. [Google Scholar] [CrossRef] [PubMed]
- Kulakovskiy, I.V.; Vorontsov, I.E.; Yevshin, I.S.; Sharipov, R.N.; Fedorova, A.D.; Rumynskiy, E.I.; Medvedeva, Y.A.; Magana-Mora, A.; Bajic, V.B.; Papatsenko, D.A.; et al. HOCOMOCO: Towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res. 2018, 46, D252–D259. [Google Scholar] [CrossRef]
- Dreos, R.; Ambrosini, G.; Cavin Périer, R.; Bucher, P. EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era. Nucleic Acids Res. 2013, 41, D157–D164. [Google Scholar] [CrossRef] [PubMed]
- Lanchantin, J.; Sekhon, A.; Singh, R.; Qi, Y. Prototype Matching Networks for Large-Scale Multi-label Genomic Sequence Classification. arXiv 2017, arXiv:1710.11238. [Google Scholar]
- ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 2012, 489, 57. [Google Scholar] [CrossRef] [PubMed]
- Harrow, J.; Frankish, A.; Gonzalez, J.M.; Tapanari, E.; Diekhans, M.; Kokocinski, F.; Aken, B.L.; Barrell, D.; Zadissa, A.; Searle, S.; et al. GENCODE: The reference human genome annotation for the ENCODE Project. Genome Res. 2012, 22, 1760–1774. [Google Scholar] [CrossRef] [PubMed]
- Yang, J.; Ma, A.; Hoppe, A.D.; Wang, C.; Li, Y.; Zhang, C.; Wang, Y.; Liu, B.; Ma, Q. Prediction of regulatory motifs from human Chip-sequencing data using a deep learning framework. Nucleic Acids Res. 2019, 47, 7809–7824. [Google Scholar] [CrossRef] [PubMed]
- Zhou, J.; Troyanskaya, O.G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods 2015, 12, 931–934. [Google Scholar] [CrossRef]
- Kelley, D.R.; Snoek, J.; Rinn, J.L. Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 2016, 26, 990–999. [Google Scholar] [CrossRef]
Hyperparameter | MoDNA |
---|---|
Layers (L) | 12 |
Hidden size (H) | 256 |
Embedding size | 128 |
Att-heads | 4 |
Att-head size | 64 |
Batch size | 64 |
Attention dropout | 0.1 |
Dropout | 0.1 |
50 | |
1 |
Method | Accuracy | AUC | F1 | MCC | Precision | Recall |
---|---|---|---|---|---|---|
GeneBERT [26] | - | 0.894 | - | - | 0.805 | 0.803 |
DNABERT [23] | 0.841 | 0.925 | 0.840 | 0.685 | 0.844 | 0.841 |
MoDNA w/o motif | 0.857 | 0.929 | 0.857 | 0.714 | 0.858 | 0.857 |
MoDNA (ours) | 0.862 | 0.935 | 0.862 | 0.725 | 0.863 | 0.862 |
Method | Accuracy | AUC | F1 | MCC | Precision | Recall |
---|---|---|---|---|---|---|
DeepBind [1] | 0.851 | 0.919 | 0.850 | 0.710 | 0.837 | 0.877 |
DeepSEA [50] | 0.853 | 0.919 | 0.836 | 0.717 | 0.840 | 0.858 |
Basset [51] | 0.741 | 0.860 | 0.685 | 0.531 | 0.799 | 0.729 |
DanQ [17] | 0.840 | 0.910 | 0.823 | 0.694 | 0.848 | 0.823 |
DeepSite [7] | 0.817 | 0.88 | 0.795 | 0.647 | 0.817 | 0.822 |
DESSO [49] | 0.851 | 0.926 | 0.848 | 0.711 | 0.832 | 0.884 |
MoDNA w/o Pretrain | 0.837 | 0.905 | 0.819 | 0.688 | 0.842 | 0.823 |
MoDNA (hltextbfours) | 0.856 | 0.935 | 0.851 | 0.727 | 0.859 | 0.856 |
Method | Accuracy | AUC | F1 | MCC | Precision | Recall |
---|---|---|---|---|---|---|
Pretrain | 0.862 | 0.935 | 0.862 | 0.725 | 0.863 | 0.862 |
No Pretrain | 0.808 | 0.889 | 0.808 | 0.618 | 0.809 | 0.809 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
An, W.; Guo, Y.; Bian, Y.; Ma, H.; Yang, J.; Li, C.; Huang, J. Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA. BioMedInformatics 2024, 4, 1556-1571. https://doi.org/10.3390/biomedinformatics4020085
An W, Guo Y, Bian Y, Ma H, Yang J, Li C, Huang J. Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA. BioMedInformatics. 2024; 4(2):1556-1571. https://doi.org/10.3390/biomedinformatics4020085
Chicago/Turabian StyleAn, Weizhi, Yuzhi Guo, Yatao Bian, Hehuan Ma, Jinyu Yang, Chunyuan Li, and Junzhou Huang. 2024. "Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA" BioMedInformatics 4, no. 2: 1556-1571. https://doi.org/10.3390/biomedinformatics4020085
APA StyleAn, W., Guo, Y., Bian, Y., Ma, H., Yang, J., Li, C., & Huang, J. (2024). Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA. BioMedInformatics, 4(2), 1556-1571. https://doi.org/10.3390/biomedinformatics4020085