A Contrastive Learning Pre-Training Method for Motif Occupancy Identification
Abstract
:1. Introduction
2. Materials and Methods
2.1. Dataset and DNA Sequences
2.2. Model Overview
2.3. The Needleman–Wunsch Algorithm and Sequence Similarity
2.4. Sequence Edit Augmentation
2.5. Self-Supervised Contrastive Learning Based on a Sequence Similarity Criterion
2.6. Supervised Contrastive Learning
2.7. Model Training and Hyper-Parameter Settings
3. Results
3.1. Comparison Experiments and Performance Measure
3.2. Comparison Results
3.3. Analysis of Small Sample Learning
3.4. Self-Supervised Model for Transfer Learning
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Karin, M. Too many transcription factors: Positive and negative interactions. New Biol. 1990, 2, 126–131. [Google Scholar]
- Latchman, D.S. Transcription factors: An overview. Int. J. Biochem. Cell Biol. 1997, 29, 1305–1312. [Google Scholar] [CrossRef] [Green Version]
- Zeng, H.; Edwards, M.D.; Liu, G.; Gifford, D.K. Convolutional neural network architectures for predicting DNA–protein binding. Bioinformatics 2016, 32, i121–i127. [Google Scholar] [CrossRef]
- Ghandi, M.; Lee, D.; Mohammad-Noori, M.; Beer, M.A. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput. Biol. 2014, 10, e1003711. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Li, W.; Zhao, Q.; Zhang, H.; Quan, X.; Xu, J.; Yin, Y. Bayesian Multi-scale Convolutional Neural Network for Motif Occupancy Identification. In Proceedings of the 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Seoul, Korea, 16–19 December 2020; pp. 293–298. [Google Scholar]
- Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural Inf. Process. Syst. 2020, 33, 9912–9924. [Google Scholar]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
- Jaiswal, A.; Babu, A.R.; Zadeh, M.Z.; Banerjee, D.; Makedon, F. A survey on contrastive self-supervised learning. Technologies 2021, 9, 2. [Google Scholar] [CrossRef]
- Chen, X.; Fan, H.; Girshick, R.; He, K. Improved baselines with momentum contrastive learning. arXiv 2020, arXiv:2003.04297. [Google Scholar]
- Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
- Zheng, M.; Wang, F.; You, S.; Qian, C.; Zhang, C.; Wang, X.; Xu, C. Weakly supervised contrastive learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10042–10051. [Google Scholar]
- Zou, J.Y.; Hsu, D.J.; Parkes, D.C.; Adams, R.P. Contrastive learning using spectral methods. Adv. Neural Inf. Process. Syst. 2013, 26, 2238–2246. [Google Scholar]
- Wang, X.; Wang, J.; Zhang, H.; Huang, S.; Yin, Y. HDMC: A novel deep learning-based framework for removing batch effects in single-cell RNA-seq data. Bioinformatics 2022, 38, 1295–1303. [Google Scholar] [CrossRef] [PubMed]
- Ciortan, M.; Defrance, M. Contrastive self-supervised clustering of scRNA-seq data. BMC Bioinform. 2021, 22, 1–27. [Google Scholar] [CrossRef] [PubMed]
- Wan, H.; Chen, L.; Deng, M. scNAME: Neighborhood contrastive clustering with ancillary mask estimation for scRNA-seq data. Bioinformatics 2022, 38, 1575–1583. [Google Scholar] [CrossRef] [PubMed]
- Navarro, G. A guided tour to approximate string matching. ACM Comput. Surv. 2001, 33, 31–88. [Google Scholar] [CrossRef]
- Needleman, S.B.; Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970, 48, 443–453. [Google Scholar] [CrossRef]
- Smith, T.F.; Waterman, M.S. Identification of common molecular subsequences. J. Mol. Biol. 1981, 147, 195–197. [Google Scholar] [CrossRef]
- Casey, R. BLAST Sequences Aid in Genomics and Proteomics. 2005. Available online: http://www.b-eye-network.com/view/1730 (accessed on 2 April 2022).
- Rice, P.; Longden, I.; Bleasby, A. EMBOSS: The European molecular biology open software suite. Trends Genet. 2000, 16, 276–277. [Google Scholar] [CrossRef]
- The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 2012, 489, 57. [Google Scholar] [CrossRef]
- So, K.K.; Peng, X.L.; Sun, H.; Wang, H. Whole genome chromatin IP-sequencing (ChIP-Seq) in skeletal muscle cells. In Skeletal Muscle Development; Springer: Berlin/Heidelberg, Germany, 2017; pp. 15–25. [Google Scholar]
- Harris, D.; Harris, S.L. Digital Design and Computer Architecture; Morgan Kaufmann: San Francisco, CA, USA, 2010. [Google Scholar]
- Zhou, J.; Troyanskaya, O.G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods 2015, 12, 931–934. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Alipanahi, B.; Delong, A.; Weirauch, M.T.; Frey, B.J. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nat. Biotechnol. 2015, 33, 831–838. [Google Scholar] [CrossRef] [PubMed]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
- Iwana, B.K.; Uchida, S. An empirical survey of data augmentation for time series classification with neural networks. PLoS ONE 2021, 16, e0254841. [Google Scholar] [CrossRef] [PubMed]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Shannon, C.E. A mathematical theory of communication. ACM Sigmobile Mob. Comput. Commun. Rev. 2001, 5, 3–55. [Google Scholar] [CrossRef]
- Sasaki, Y. The Truth of the F-Measure. 2007. Available online: http://www.cs.odu.edu/~mukka/cs795sum09dm/Lecturenotes/Day3/F-measure-YS-26Oct07.pdf (accessed on 2 April 2022).
- Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
- Wang, F.; Zhu, L.; Li, J.; Chen, H.; Zhang, H. Unsupervised soft-label feature selection. Knowl.-Based Syst. 2021, 219, 106847. [Google Scholar] [CrossRef]
- Algan, G.; Ulusoy, I. Meta soft label generation for noisy labels. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 7142–7148. [Google Scholar] [CrossRef]
- Bragança, L.S.M.; Souza, A.D.; Braga, R.A.S.; Suriani, M.A.M.; Dias, R.M.C. Sequence Alignment Algorithms in Hardware Implementation: A Systematic Mapping of the Literature. In ITNG 2021 18th International Conference on Information Technology-New Generations; Latifi, S., Ed.; Springer International Publishing: Cham, Switzerland, 2021; pp. 307–312. [Google Scholar]
- Rashed, A.E.E.D.; Obaya, M.; Moustafa, H.E.D. Accelerating DNA pairwise sequence alignment using FPGA and a customized convolutional neural network. Comput. Electr. Eng. 2021, 92, 107112. [Google Scholar] [CrossRef]
- Mondal, P.; Basuli, K. OAIPM: Optimal Algorithm to Identify Point Mutation Between DNA Sequences. In Proceedings of the International Conference on Advanced Computing Applications, Lisbon, Portugal, 20–21 September 2022; Springer: Singapore, 2022; pp. 393–401. [Google Scholar]
Type | Symbol | Meaning |
---|---|---|
Variables | input sequences | |
classification labels of input sequences | ||
p-th character of | ||
augmented sequences of | ||
encodes features of augmented sequences | ||
constant temperature hyper-parameter | ||
zoom ratio hyper-parameter of learning rate | ||
total self-supervised contrastive loss | ||
total supervised contrastive loss | ||
sample self-supervised loss of | ||
sample supervised loss of | ||
Sets | I | full augmented sample set |
augmented sample set excluding | ||
positive sample set of | ||
negative sample set of | ||
Functions | Needleman–Wunsch score between first p characters of | |
and first q characters of | ||
character matching score between characters and | ||
similarity score between sequences and | ||
cosine similarity score between features and | ||
self-supervised similarity label between sequences and | ||
supervised similarity label between sequences and |
Method | Accuracy | Score | Precision | Recall | AUC |
---|---|---|---|---|---|
baseline | 0.716 | 0.702 | 0.738 | 0.675 | 0.797 |
simCLR | 0.719 | 0.721 | 0.761 | 0.689 | 0.801 |
editCLR-0.7 | 0.720 | 0.721 | 0.760 | 0.692 | 0.800 |
editCLR-0.8 | 0.720 | 0.722 | 0.760 | 0.690 | 0.804 |
editCLR | 0.723 | 0.725 | 0.762 | 0.694 | 0.811 |
supCLR | 0.734 | 0.734 | 0.784 | 0.691 | 0.823 |
Method | Training Ratio | Accuracy | Score | AUC |
---|---|---|---|---|
baseline | 50% | 0.688 | 0.683 | 0.762 |
SimCLR | 0.690 | 0.697 | 0.764 | |
editCLR | 0.693 | 0.700 | 0.764 | |
supCLR | 0.700 | 0.698 | 0.765 | |
baseline | 20% | 0.636 | 0.632 | 0.705 |
SimCLR | 0.653 | 0.659 | 0.719 | |
editCLR | 0.657 | 0.664 | 0.722 | |
supCLR | 0.658 | 0.656 | 0.721 | |
baseline | 10% | 0.629 | 0.607 | 0.674 |
SimCLR | 0.636 | 0.646 | 0.695 | |
editCLR | 0.640 | 0.650 | 0.706 | |
supCLR | 0.639 | 0.645 | 0.703 |
Architecture | Model | Accuracy | Score | AUC |
---|---|---|---|---|
1layer64motif | baseline | 0.717 | 0.707 | 0.794 |
editCLR#1 | 0.693 | 0.718 | 0.768 | |
editCLR#2 | 0.720 | 0.715 | 0.807 | |
editCLR#3 | 0.715 | 0.729 | 0.792 | |
editCLR#4 | 0.715 | 0.711 | 0.806 | |
1layer128motif | baseline | 0.723 | 0.712 | 0.805 |
editCLR#5 | 0.725 | 0.727 | 0.811 | |
editCLR#6 | 0.726 | 0.715 | 0.804 | |
editCLR#7 | 0.718 | 0.709 | 0.800 | |
editCLR#8 | 0.721 | 0.716 | 0.808 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lin, K.; Quan, X.; Yin, W.; Zhang, H. A Contrastive Learning Pre-Training Method for Motif Occupancy Identification. Int. J. Mol. Sci. 2022, 23, 4699. https://doi.org/10.3390/ijms23094699
Lin K, Quan X, Yin W, Zhang H. A Contrastive Learning Pre-Training Method for Motif Occupancy Identification. International Journal of Molecular Sciences. 2022; 23(9):4699. https://doi.org/10.3390/ijms23094699
Chicago/Turabian StyleLin, Ken, Xiongwen Quan, Wenya Yin, and Han Zhang. 2022. "A Contrastive Learning Pre-Training Method for Motif Occupancy Identification" International Journal of Molecular Sciences 23, no. 9: 4699. https://doi.org/10.3390/ijms23094699
APA StyleLin, K., Quan, X., Yin, W., & Zhang, H. (2022). A Contrastive Learning Pre-Training Method for Motif Occupancy Identification. International Journal of Molecular Sciences, 23(9), 4699. https://doi.org/10.3390/ijms23094699