Cross-Predicting Essential Genes between Two Model Eukaryotic Species Using Machine Learning
Abstract
1. Introduction
2. Results
2.1. A Selection of Strong Predictive Features of Essential Genes for CE and DM Identified by Predictions within Species
2.2. Select Features and ML Models Enable Essential Gene Predictions between CE and DM
2.3. Visual Representation of Cross-Species Gene Essentiality Probabilities along the CE and DM Genomes
2.4. Gene Ontology (GO) Enrichment and Functional Clustering Analyses Confirm Important Roles of Essential Genes
3. Discussion
4. Materials and Methods
4.1. Defining Feature Sets
4.2. Feature Selection, ML Training, and Evaluation within Species Using Standardised Data
4.3. Employing and Evaluating the ML Approach for Predictions between Species
4.4. Gene Ontology Analyses with Functional Annotation Clustering
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Telford, M.J.; Bourlat, S.J.; Economou, A.; Papillon, D.; Rota-Stabelli, O. The evolution of the Ecdysozoa. Philos. Trans. R. Soc. B. Biol. Sci. 2008, 363, 1529–1537. [Google Scholar] [CrossRef]
- Abraham, E.G.; Cha, S.-J.; Jacobs-Lorena, M. Towards the genetic control of insect vectors: An overview. Entomol. Res. 2007, 37, 213–220. [Google Scholar] [CrossRef] [PubMed]
- International Helminth Genomes Consortium. Comparative genomics of the major parasitic worms. Nat. Genet. 2019, 51, 163–174. [Google Scholar] [CrossRef] [PubMed]
- Torgerson, P.R. One world health: Socioeconomic burden and parasitic disease control priorities. Vet. Parasitol. 2013, 195, 223–232. [Google Scholar] [CrossRef] [PubMed]
- Parham, P.E.; Waldock, J.; Christophides, G.K.; Hemming, D.; Agusto, F.; Evans, K.J.; Fefferman, N.; Gaff, H.; Gumel, A.; LaDeau, S.; et al. Climate, environmental and socio-economic change: Weighing up the balance in vector-borne disease transmission. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 2015, 370, 20130551. [Google Scholar] [CrossRef] [PubMed]
- Hedges, S.B. The origin and evolution of model organisms. Nat. Rev. Genet. 2002, 3, 838–849. [Google Scholar] [CrossRef]
- Zhan, T.; Boutros, M. Towards a compendium of essential genes—From model organisms to synthetic lethality in cancer cells. Crit. Rev. Biochem. Mol. Biol. 2016, 51, 74–85. [Google Scholar] [CrossRef] [PubMed]
- Rancati, G.; Moffat, J.; Typas, A.; Pavelka, N. Emerging and evolving concepts in gene essentiality. Nat. Rev. Genet. 2018, 19, 34–49. [Google Scholar] [CrossRef] [PubMed]
- Itaya, M. An estimation of minimal genome size required for life. FEBS Lett. 1995, 362, 257–260. [Google Scholar] [CrossRef]
- Koonin, E.V. How many genes can make a cell: The minimal-gene-set concept. Annu. Rev. Genom. Hum. Genet. 2000, 1, 99–116. [Google Scholar] [CrossRef] [PubMed]
- Juhas, M.; Eberl, L.; Glass, J.I. Essence of life: Essential genes of minimal genomes. Trends Cell Biol. 2011, 21, 562–568. [Google Scholar] [CrossRef]
- Xu, J.; Xu, X.; Zhan, S.; Huang, Y. Genome editing in insects: Current status and challenges. Nat. Sci. Rev. 2019, 6, 399–401. [Google Scholar] [CrossRef]
- Zhang, X.; Acencio, M.L.; Lemke, N. Predicting essential genes and proteins based on machine learning and network topological features: A comprehensive review. Front. Physiol. 2016, 7, 75. [Google Scholar] [PubMed]
- Howe, K.L.; Bolt, B.J.; Cain, S.; Chan, J.; Chen, W.J.; Davis, P.; Done, J.; Down, T.; Gao, S.; Grove, C.; et al. WormBase 2016: Expanding to enable helminth genomic research. Nucleic Acids Res. 2016, 44, D774–D780. [Google Scholar] [CrossRef] [PubMed]
- dos Santos, G.; Schroeder, A.J.; Goodman, J.L.; Strelets, V.B.; Crosby, M.A.; Thurmond, J.; Emmert, D.B.; Gelbart, W.M. FlyBase: Introduction of the Drosophila melanogaster Release 6 reference genome assembly and large-scale migration of genome annotations. Nucleic Acids Res. 2015, 43, D690–D697. [Google Scholar] [CrossRef] [PubMed]
- Dong, C.; Jin, Y.-T.; Hua, H.-L.; Wen, Q.-F.; Luo, S.; Zheng, W.-X.; Guo, F.-B. Comprehensive review of the identification of essential genes using computational methods: Focusing on feature implementation and assessment. Brief. Bioinform. 2018, 21, 171–181. [Google Scholar] [CrossRef]
- Gustafson, A.M.; Snitkin, E.S.; Parker, S.C.; DeLisi, C.; Kasif, S. Towards the identification of essential genes using targeted genome sequencing and comparative analysis. BMC Genom. 2006, 7, 265. [Google Scholar] [CrossRef]
- Acencio, M.L.; Lemke, N. Towards the prediction of essential genes by integration of network topology, cellular localization and biological process information. BMC Bioinforma. 2009, 10, 290. [Google Scholar] [CrossRef]
- Zhang, X.; Xiao, W.; Hu, X. Predicting essential proteins by integrating orthology, gene expressions, and PPI networks. PLoS ONE 2018, 13, e0195410. [Google Scholar] [CrossRef]
- Campos, T.L.; Korhonen, P.K.; Gasser, R.B.; Young, N.D. An evaluation of machine learning approaches for the prediction of essential genes in eukaryotes using protein sequence-derived features. Comput. Struct. Biotechnol. J. 2019, 17, 785–796. [Google Scholar] [CrossRef] [PubMed]
- Aromolaran, O.; Beder, T.; Oswald, M.; Oyelade, J.; Adebiyi, E.; Koenig, R. Essential gene prediction in Drosophila melanogaster using machine learning approaches based on sequence and functional features. Comput. Struct. Biotechnol. J. 2020, 18, 612–621. [Google Scholar] [CrossRef]
- Campos, T.L.; Korhonen, P.K.; Hofmann, A.; Gasser, R.B.; Young, N.D. Predicting gene essentiality in Caenorhabditis elegans by feature engineering and machine-learning. Comput. Struct. Biotechnol. J. 2020, 18, 1093–1102. [Google Scholar] [CrossRef]
- Campos, T.L.; Korhonen, P.K.; Hofmann, A.; Gasser, R.B.; Young, N.D. Combined use of feature engineering and machine learning to predict essential genes in Drosophila melanogaster. NAR Genom. Bioinform. 2020, 2, lqaa051. [Google Scholar] [CrossRef] [PubMed]
- Kumar, S.; Gupta, S.; Mohmad, A.; Fular, A.; Parthasarathi, B.C.; Chaubey, A.K. Molecular tools-advances, opportunities and prospects for the control of parasites of veterinary importance. Int. J. Trop. Insect Sci. 2020, 2020, 1–10. [Google Scholar] [CrossRef] [PubMed]
- Hutter, H.; Suh, J. GExplore 1.4: An expanded web interface for queries on Caenorhabditis elegans protein and gene function. Worm 2016, 19, e1234659. [Google Scholar] [CrossRef]
- Schmidt, E.E.; Pelz, O.; Buhlmann, S.; Kerr, G.; Horn, T.; Boutros, M. GenomeRNAi: A database for cell-based and in vivo RNAi phenotypes, 2013 update. Nucleic Acids Res. 2013, 41, D1021–D1026. [Google Scholar] [CrossRef]
- Garrigues, J.M.; Sidoli, S.; Garcia, B.A.; Strome, S. Defining heterochromatin in C. elegans through genome-wide analysis of the heterochromatin protein 1 homolog HPL-2. Genome Res. 2015, 25, 76–88. [Google Scholar] [CrossRef]
- Kaufman, T.C. A short history and description of Drosophila melanogaster classical genetics: Chromosome aberrations, forward genetic screens, and the nature of mutations. Genetics 2017, 206, 665–689. [Google Scholar] [CrossRef] [PubMed]
- Yates, A.D.; Achuthan, P.; Akanni, W.; Allen, J.; Allen, J.; Alvarez-Jarreta, J.; Amode, M.R.; Armean, I.M.; Azov, A.G.; Bennett, R.; et al. Ensembl 2020. Nucleic Acids Res. 2020, 48, D682–D688. [Google Scholar] [CrossRef] [PubMed]
- Cao, J.; Packer, J.S.; Ramani, V.; Cusanovich, D.A.; Huynh, C.; Daza, R.; Qiu, X.; Lee, C.; Furlan, S.N.; Steemers, F.J.; et al. Comprehensive single-cell transcriptional profiling of a multicellular organism. Science 2017, 18, 661–667. [Google Scholar] [CrossRef]
- Karaiskos, N.; Wahle, P.; Alles, J.; Boltengagen, A.; Ayoub, S.; Kipar, C.; Kocks, C.; Rajewsky, N.; Zinzen, R.P. The Drosophila embryo at single-cell transcriptome resolution. Science 2017, 358, 194–199. [Google Scholar] [CrossRef] [PubMed]
- Soares, L.M.; He, P.C.; Chun, Y.; Suh, H.; Kim, T.; Buratowski, S. Determinants of histone H3K4 methylation patterns. Mol. Cell 2017, 16, 773–785. [Google Scholar] [CrossRef]
- Wiles, E.T.; Selker, E.U. H3K27 methylation: A promiscuous repressive chromatin mark. Curr. Opin. Genet. Dev. 2017, 43, 31–37. [Google Scholar] [CrossRef]
- Altenhoff, A.M.; Glover, N.M.; Train, C.M.; Kaleb, K.; Warwick Vesztrocy, A.; Dylus, D. The OMA orthology database in 2018: Retrieving evolutionary relationships among all domains of life through richer web and programmatic interfaces. Nucleic Acids Res. 2018, 46, D477–D485. [Google Scholar] [CrossRef]
- Jeong, H.; Mason, S.P.; Barabási, A.L.; Oltvai, Z.N. Lethality and centrality in protein networks. Nature 2001, 411, 41–42. [Google Scholar] [CrossRef]
- Doyle, M.A.; Gasser, R.B.; Woodcroft, B.J.; Hall, R.S.; Ralph, S.A. Drug target prediction and prioritization: Using orthology to predict essentiality in parasite genomes. BMC Genom. 2010, 11, 222. [Google Scholar] [CrossRef] [PubMed]
- Eraslan, G.; Avsec, Ž.; Gagneur, J.; Theis, F.J. Deep learning: New computational modelling techniques for genomics. Nat. Rev. Genet. 2019, 20, 389403. [Google Scholar] [CrossRef]
- Saito, T.; Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef] [PubMed]
- Blagus, R.; Lusa, L. SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 2013, 14, 106. [Google Scholar] [CrossRef] [PubMed]
- Blagus, R.; Lusa, L. Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models. BMC Bioinform. 2015, 16, 363. [Google Scholar] [CrossRef] [PubMed]
- Ho, J.W.K.; Jung, Y.L.; Liu, T.; Alver, B.H.; Lee, S.; Ikegami, K.; Sohn, K.-A.; Minoda, A.; Tolstorukov, M.Y.; Appert, A.; et al. Comparative analysis of metazoan chromatin organization. Nature 2014, 512, 449–452. [Google Scholar] [CrossRef]
- Talbert, P.B.; Henikoff, S. What makes a centromere? Exp. Cell Res. 2020, 15, 111895. [Google Scholar] [CrossRef] [PubMed]
- Cutter, A.D. Reproductive evolution: Symptom of a selfing syndrome. Curr. Biol. 2006, 18, R1056–R1058. [Google Scholar] [CrossRef][Green Version]
- Woods, S.; Coghlan, A.; Rivers, D.; Warnecke, T.; Jeffries, S.J.; Kwon, T.; Rogers, A.; Hurst, L.D.; Ahringer, J. Duplication and retention biases of essential and non-essential genes revealed by systematic knockdown analyses. PLoS Genet. 2013, 9, e1003330. [Google Scholar] [CrossRef] [PubMed]
- Liao, B.Y.; Zhang, J. Mouse duplicate genes are as essential as singletons. Trends Genet. 2007, 23, 378–381. [Google Scholar] [CrossRef] [PubMed]
- Stapleton, M.; Liao, G.; Brokstein, P.; Hong, L.; Carninci, P.; Shiraki, T.; Hayashizaki, Y.; Champe, M.; Pacleb, J.; Wan, K.; et al. The Drosophila gene collection: Identification of putative full-length cDNAs for 70% of D. melanogaster genes. Genome Res. 2002, 12, 1294–1300. [Google Scholar] [CrossRef]
- Kent, W.J. BLAT—The BLAST-Like Alignment Tool. Genome Res. 2002, 12, 656–664. [Google Scholar] [CrossRef] [PubMed]
- Washington, N.L.; Stinson, E.O.; Perry, M.D.; Ruzanov, P.; Contrino, S.; Smith, R.; Zha, Z.; Lyne, R.; Carr, A.; Lloyd, P.; et al. The modENCODE Data Coordination Center: Lessons in harvesting comprehensive experimental details. Database 2011, 2011, bar023. [Google Scholar] [CrossRef]
- Li, H.; Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25, 1754–1760. [Google Scholar] [CrossRef] [PubMed]
- Feng, J.; Liu, T.; Qin, B.; Zhang, Y.; Shirley Liu, X. Identifying ChIP-seq enrichment using MACS. Nat. Protoc. 2012, 7, 1728–1740. [Google Scholar] [CrossRef] [PubMed]
- Quinlan, A.R.; Hall, I.M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 2010, 26, 841–842. [Google Scholar] [CrossRef] [PubMed]
- Huang, D.W.; Sherman, B.T.; Tan, Q.; Collins, J.R.; Alvord, W.G.; Roayaei, J.; Stephens, R.; Baseler, M.W.; Lane, H.C.; Lempicki, R.A. The DAVID Gene Functional Classification Tool: A novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol. 2007, 8, R183. [Google Scholar] [CrossRef] [PubMed]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Campos, T.L.; Korhonen, P.K.; Young, N.D. Cross-Predicting Essential Genes between Two Model Eukaryotic Species Using Machine Learning. Int. J. Mol. Sci. 2021, 22, 5056. https://doi.org/10.3390/ijms22105056
Campos TL, Korhonen PK, Young ND. Cross-Predicting Essential Genes between Two Model Eukaryotic Species Using Machine Learning. International Journal of Molecular Sciences. 2021; 22(10):5056. https://doi.org/10.3390/ijms22105056
Chicago/Turabian StyleCampos, Tulio L., Pasi K. Korhonen, and Neil D. Young. 2021. "Cross-Predicting Essential Genes between Two Model Eukaryotic Species Using Machine Learning" International Journal of Molecular Sciences 22, no. 10: 5056. https://doi.org/10.3390/ijms22105056
APA StyleCampos, T. L., Korhonen, P. K., & Young, N. D. (2021). Cross-Predicting Essential Genes between Two Model Eukaryotic Species Using Machine Learning. International Journal of Molecular Sciences, 22(10), 5056. https://doi.org/10.3390/ijms22105056