Machine Learning to Advance Human Genome-Wide Association Studies
Abstract
:1. Introduction
1.1. The Road from GWAS Findings to Drug Discovery
1.2. GWAS Applications beyond Gene Discovery: Cumulative Genetic Profiles and Causal Relationships
2. Machine Learning Solutions for GWAS
2.1. Machine Learning Methods Frequently Adapted for GWAS
2.2. Machine Learning Application Areas in GWAS
2.3. Tools for SNP Discovery from Whole-Genome SNP Data
2.4. Applications Supporting PRS
3. Limitations and Criticism of Machine Learning
4. Future Prospects
4.1. Multimodal Omics Databases
4.2. Opportunities of Large Language Models and Foundation Models
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Visscher, P.M.; Wray, N.R.; Zhang, Q.; Sklar, P.; McCarthy, M.I.; Brown, M.A.; Yang, J. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet. 2017, 101, 5–22. [Google Scholar] [CrossRef] [PubMed]
- Watanabe, K.; Stringer, S.; Frei, O.; Umićević Mirkov, M.; de Leeuw, C.; Polderman, T.J.C.; van der Sluis, S.; Andreassen, O.A.; Neale, B.M.; Posthuma, D. A global overview of pleiotropy and genetic architecture in complex traits. Nat. Genet. 2019, 51, 1339–1348. [Google Scholar] [CrossRef] [PubMed]
- GWAS Catalogue. Online Resource [GWAS Catalog]. Available online: https://www.ebi.ac.uk/ (accessed on 23 May 2023).
- Canela-Xandri, O.; Rawlik, K.; Tenesa, A. An atlas of genetic associations in UK Biobank. Nat. Genet. 2018, 50, 1593–1599. [Google Scholar] [CrossRef] [PubMed]
- Frontini, M.; Boisnard, A.; Frouin, J.; Ouikene, M.; Morel, J.B.; Ballini, E. Genome-wide association of rice response to blast fungus identifies loci for robust resistance under high nitrogen. BMC Plant Biol. 2021, 21, 99. [Google Scholar] [CrossRef] [PubMed]
- Young, B.C.; Earle, S.G.; Soeng, S.; Sar, P.; Kumar, V.; Hor, S.; Sar, V.; Bousfield, R.; Sanderson, N.D.; Barker, L.; et al. Panton-Valentine leucocidin is the key determinant of Staphylococcus aureus pyomyositis in a bacterial GWAS. Elife 2019, 8, e42486. [Google Scholar] [CrossRef] [PubMed]
- Tibbs Cortes, L.; Zhang, Z.; Yu, J. Status and prospects of genome-wide association studies in plants. Plant Genome 2021, 14, e20077. [Google Scholar] [CrossRef]
- Plassais, J.; Kim, J.; Davis, B.W.; Karyadi, D.M.; Hogan, A.N.; Harris, A.C.; Decker, B.; Parker, H.G.; Ostrander, E. Whole genome sequencing of canids reveals genomic regions under selection and variants influencing morphology. Nat. Commun. 2019, 10, 1489. [Google Scholar] [CrossRef]
- Wang, K.; Hu, H.; Tian, Y.; Li, J.; Scheben, A.; Zhang, C.; Li, Y.; Wu, J.; Yang, L.; Fan, X.; et al. The Chicken Pan-Genome Reveals Gene Content Variation and a Promoter Region Deletion in IGF2BP1 Affecting Body Size. Mol. Biol. Evol. 2021, 38, 5066–5081. [Google Scholar] [CrossRef]
- Denny, J.C.; Rutter, J.L.; Goldstein, D.B.; Philippakis, A.; Smoller, J.W.; Jenkins, G.; Dishman, E. The All of Us Research Program: Data quality, utility, and diversity. Patterns 2022, 3, 100570. [Google Scholar]
- Claussnitzer, M.; Dankel, S.N.; Kim, K.H.; Quon, G.; Meuleman, W.; Haugen, C.; Glunk, V.; Sousa, I.S.; Beaudry, J.L.; Puviindran, V.; et al. FTO Obesity Variant Circuitry and Adipocyte Browning in Humans. N. Engl. J. Med. 2015, 373, 895–907. [Google Scholar] [CrossRef]
- Ng, M.C.; Park, K.S.; Oh, B.; Tam, C.H.; Cho, Y.M.; Shin, H.D.; Lam, V.K.L.; Ma, R.C.W.; So, W.Y.; Cho, Y.S.; et al. Implication of genetic variants near TCF7L2, SLC30A8, HHEX, CDKAL1, CDKN2A/B, IGF2BP2, and FTO in type 2 diabetes and obesity in 6719 Asians. Diabetes 2008, 57, 2226–2233. [Google Scholar] [CrossRef] [PubMed]
- Lambert, J.C.; Ibrahim-Verbaas, C.A.; Harold, D.; Naj, A.C.; Sims, R.; Bellenguez, C.; DeStafano, A.L.; Bis, J.C.; Beecham, G.W.; Grenier-Boley, B.; et al. Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer’s disease. Nat. Genet. 2013, 45, 1452–1458. [Google Scholar] [CrossRef] [PubMed]
- Lagou, V.; Jiang, L.; Ulrich, A.; Zudina, L.; González, K.S.G.; Balkhiyarova, Z.; Faggian, A.; Maina, J.G.; Chen, S.; Todorov, P.V.; et al. GWAS of random glucose in 476,326 individuals provide insights into diabetes pathophysiology, complications and treatment stratification. Nat. Genet. 2023, 55, 1448–1461. [Google Scholar] [CrossRef] [PubMed]
- Reay, W.R.; Cairns, M.J. Advancing the use of genome-wide association studies for drug repurposing. Nat. Rev. Genet. 2021, 22, 658–671. [Google Scholar] [CrossRef] [PubMed]
- Ochoa, D.; Karim, M.; Ghoussaini, M.; Hulcoop, D.G.; McDonagh, E.M.; Dunham, I. Human genetics evidence supports two-thirds of the 2021 FDA-approved drugs. Nat. Rev. Drug Discov. 2022, 21, 551. [Google Scholar] [CrossRef] [PubMed]
- Ochoa, D.; Hercules, A.; Carmona, M.; Suveges, D.; Baker, J.; Malangone, C.; Lopez, I.; Miranda, A.; Cruz-Castillo, C.; Fumis, L.; et al. The next-generation Open Targets Platform: Reimagined, redesigned, rebuilt. Nucleic Acids Res. 2023, 51, D1353–D1359. [Google Scholar] [CrossRef] [PubMed]
- Ghoussaini, M.; Mountjoy, E.; Carmona, M.; Peat, G.; Schmidt, E.M.; Hercules, A.; Fumis, L.; Miranda, A.; Carvalho-Silva, D.; Buniello, A.; et al. Open Targets Genetics: Systematic identification of trait-associated genes using large-scale genetics and functional genomics. Nucleic Acids Res. 2021, 49, D1311–D1320. [Google Scholar] [CrossRef] [PubMed]
- Genin, E.; Hannequin, D.; Wallon, D.; Sleegers, K.; Hiltunen, M.; Combarros, O.; Bullido, M.J.; Engelborghs, S.; De Deyn, P.; Berr, C.; et al. APOE and Alzheimer disease: A major gene with semi-dominant inheritance. Mol. Psychiatry 2011, 16, 903–907. [Google Scholar] [CrossRef]
- Ni, G.; Zeng, J.; Revez, J.A.; Wang, Y.; Zheng, Z.; Ge, T.; Restuadi, R.; Kiewa, J.; Nyholt, D.R.; Coleman, J.R.I.; et al. A Comparison of Ten Polygenic Score Methods for Psychiatric Disorders Applied Across Multiple Cohorts. Biol. Psychiatry 2021, 90, 611–620. [Google Scholar] [CrossRef]
- The International Schizophrenia Consortium. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 2009, 460, 748–752. [Google Scholar] [CrossRef]
- Demirkan, A.; Penninx, B.W.; Hek, K.; Wray, N.R.; Amin, N.; Aulchenko, Y.S.; van Dyck, R.; de Geus, E.J.; Hofman, A.; Uitterlinden, A.G.; et al. Genetic risk profiles for depression and anxiety in adult and elderly cohorts. Mol. Psychiatry 2011, 16, 773–783. [Google Scholar] [CrossRef] [PubMed]
- Lewis, C.M.; Vassos, E. Polygenic risk scores: From research tools to clinical instruments. Genome Med. 2020, 12, 44. [Google Scholar] [CrossRef] [PubMed]
- O’Sullivan, J.W.; Raghavan, S.; Marquez-Luna, C.; Luzum, J.A.; Damrauer, S.M.; Ashley, E.A.; O’Donnell, C.J.; Willer, C.J.; Natarajan, P.; American Heart Association Council on Genomic and Precision Medicine; et al. Polygenic Risk Scores for Cardiovascular Disease: A Scientific Statement From the American Heart Association. Circulation 2022, 146, e93–e118. [Google Scholar] [CrossRef] [PubMed]
- Martin, A.R.; Gignoux, C.R.; Walters, R.K.; Wojcik, G.L.; Neale, B.M.; Gravel, S.; Daly, M.J.; Bustamante, C.D.; Kenny, E.E. Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations. Am. J. Hum. Genet. 2017, 100, 635–649. [Google Scholar] [CrossRef] [PubMed]
- Kachuri, L.; Chatterjee, N.; Hirbo, J.; Schaid, D.J.; Martin, I.; Kullo, I.J.; Kenny, E.E.; Pasaniuc, B. Principles and methods for transferring polygenic risk scores across global populations. Nat. Rev. Genet. 2023, 25, 8–25. [Google Scholar] [CrossRef] [PubMed]
- Gola, D.; Erdmann, J.; Läll, K.; Mägi, R.; Müller-Myhsok, B.; Schunkert, H.; König, I.R. Population Bias in Polygenic Risk Prediction Models for Coronary Artery Disease. Circ. Genom. Precis. Med. 2020, 13, e002932. [Google Scholar] [CrossRef] [PubMed]
- Richmond, R.C.; Davey Smith, G. Mendelian Randomization: Concepts and Scope. Cold Spring Harb. Perspect. Med. 2022, 12, a040501. [Google Scholar] [CrossRef] [PubMed]
- van Rheenen, W.; Peyrot, W.J.; Schork, A.J.; Lee, S.H.; Wray, N.R. Genetic correlations of polygenic disease traits: From theory to practice. Nat. Rev. Genet. 2019, 20, 567–581. [Google Scholar] [CrossRef]
- Yengo, L.; Vedantam, S.; Marouli, E.; Sidorenko, J.; Bartell, E.; Sakaue, S.; Graff, M.; Eliasen, A.U.; Jiang, Y.; Raghavan, S.; et al. A saturated map of common genetic variants associated with human height. Nature 2022, 610, 704–712. [Google Scholar] [CrossRef]
- Bergen, S.E.; Petryshen, T.L. Genome-wide association studies of schizophrenia: Does bigger lead to better results? Curr. Opin. Psychiatry 2012, 25, 76–82. [Google Scholar] [CrossRef]
- Degroeve, S.; De Baets, B.; Van de Peer, Y.; Rouze, P. Feature subset selection for splice site prediction. Bioinformatics 2002, 18 (Suppl. S2), S75–S83. [Google Scholar] [CrossRef]
- Bucher, P. Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol. 1990, 212, 563–578. [Google Scholar] [CrossRef]
- Heintzman, N.D.; Stuart, R.K.; Hon, G.; Fu, Y.; Ching, C.W.; Hawkins, R.D.; Barrera, L.O.; Van Calcar, S.; Qu, C.; Ching, K.A.; et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat. Genet. 2007, 39, 311–318. [Google Scholar] [CrossRef]
- Segal, E.; Fondufe-Mittendorf, Y.; Chen, L.; Thåström, A.; Field, Y.; Moore, I.K.; Wang, J.P.; Widom, J. A genomic code for nucleosome positioning. Nature 2006, 442, 772–778. [Google Scholar] [CrossRef]
- Mathieu, A.; Leclercq, M.; Sanabria, M.; Perin, O.; Droit, A. Machine Learning and Deep Learning Applications in Metagenomic Taxonomy and Functional Annotation. Front. Microbiol. 2022, 13, 811495. [Google Scholar] [CrossRef]
- Costea, P.I.; Hildebrand, F.; Arumugam, M.; Bäckhed, F.; Blaser, M.J.; Bushman, F.D.; de Vos, W.M.; Ehrlich, S.D.; Fraser, C.M.; Hattori, M.; et al. Enterotypes in the landscape of gut microbial community composition. Nat. Microbiol. 2018, 3, 8–16. [Google Scholar] [CrossRef]
- Callahan, B.J.; McMurdie, P.J.; Rosen, M.J.; Han, A.W.; Johnson, A.J.; Holmes, S.P. DADA2: High-resolution sample inference from Illumina amplicon data. Nat. Methods 2016, 13, 581–583. [Google Scholar] [CrossRef]
- Statnikov, A.; Henaff, M.; Narendra, V.; Konganti, K.; Li, Z.; Yang, L.; Pei, Z.; Blaser, M.J.; Aliferis, C.F.; Alekseyenko, A.V. A comprehensive evaluation of multicategory classification methods for microbiomic data. Microbiome 2013, 1, 11. [Google Scholar] [CrossRef]
- Hie, B.; Zhong, E.D.; Berger, B.; Bryson, B. Learning the language of viral evolution and escape. Science 2021, 371, 284–288. [Google Scholar] [CrossRef] [PubMed]
- Ramakrishnan, G.; Baakman, C.; Heijl, S.; Vroling, B.; van Horck, R.; Hiraki, J.; Xue, L.C.; Huynen, M.A. Understanding structure-guided variant effect predictions using 3D convolutional neural networks. Front. Mol. Biosci. 2023, 10, 1204157. [Google Scholar] [CrossRef] [PubMed]
- Huang, X.; Rymbekova, A.; Dolgova, O.; Lao, O.; Kuhlwilm, M. Harnessing deep learning for population genetic inference. Nat. Rev. Genet. 2023, 25, 61–78. [Google Scholar] [CrossRef] [PubMed]
- Moeinizade, S.; Hu, G.; Wang, L. A Reinforcement Learning Approach to Resource Allocation in Genomic Selection. Intell. Syst. Appl. 2021, 14, 200076. [Google Scholar] [CrossRef]
- Chen, X.; Ishwaran, H. Random forests for genomic data analysis. Genomics 2012, 99, 323–329. [Google Scholar] [CrossRef] [PubMed]
- Lunetta, K.L.; Hayward, L.B.; Segal, J.; Van Eerdewegh, P. Screening large-scale association study data: Exploiting interactions using random forests. BMC Genet. 2004, 5, 32. [Google Scholar] [CrossRef] [PubMed]
- Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Gurney, K. An Introduction to Neural Networks; CRC Press: Boca Raton, FL, USA, 1997. [Google Scholar]
- Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef]
- Montesinos-López, O.A.; Montesinos-López, A.; Pérez-Rodríguez, P.; Barrón-López, J.A.; Martini, J.W.R.; Fajardo-Flores, S.B.; Gaytan-Lugo, L.S.; Santana-Mancilla, P.C.; Crossa, J. A review of deep learning applications for genomic selection. BMC Genom. 2021, 22, 19. [Google Scholar] [CrossRef]
- Zhu, Z.; Zhang, F.; Hu, H.; Bakshi, A.; Robinson, M.R.; Powell, J.E.; Montgomery, G.W.; Goddard, M.E. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 2016, 48, 481–487. [Google Scholar] [CrossRef]
- Brænne, I.; Civelek, M.; Vilne, B.; Di Narzo, A.; Johnson, A.D.; Zhao, Y.; Reiz, B.; Codoni, V.; Webb, T.R.; Foroughi Asl, H.; et al. Prediction of Causal Candidate Genes in Coronary Artery Disease Loci. Arterioscler. Thromb. Vasc. Biol. 2015, 35, 2207–2217. [Google Scholar] [CrossRef]
- Nicholls, H.L.; John, C.R.; Watson, D.S.; Munroe, P.B.; Barnes, M.R.; Cabrera, C.P. Reaching the End-Game for GWAS: Machine Learning Approaches for the Prioritization of Complex Disease Loci. Front. Genet. 2020, 11, 350. [Google Scholar] [CrossRef]
- Liu, Q.; Xia, F.; Yin, Q.; Jiang, R. Chromatin accessibility prediction via a hybrid deep convolutional neural network. Bioinformatics 2018, 34, 732–738. [Google Scholar] [CrossRef] [PubMed]
- Mountjoy, E.; Schmidt, E.M.; Carmona, M.; Schwartzentruber, J.; Peat, G.; Miranda, A.; Fumis, L.; Hayhurst, J.; Buniello, A.; Karim, M.A.; et al. An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci. Nat. Genet. 2021, 53, 1527–1533. [Google Scholar] [CrossRef] [PubMed]
- Pinakhina, D.; Loboda, A.; Sergushichev, A.; Artomov, M. Gene, cell type, and drug prioritization analysis suggest genetic basis for the utility of diuretics in treating Alzheimer disease. Hum. Genet. Genom. Adv. 2023, 4, 100203. [Google Scholar] [CrossRef] [PubMed]
- Vitsios, D.; Petrovski, S. Mantis-ml: Disease-Agnostic Gene Prioritization from High-Throughput Genomic Screens by Stochastic Semi-supervised Learning. Am. J. Hum. Genet. 2020, 106, 659–678. [Google Scholar] [CrossRef] [PubMed]
- Bureau, A.; Dupuis, J.; Falls, K.; Lunetta, K.L.; Hayward, B.; Keith, T.P.; Van Eerdewegh, P. Identifying SNPs predictive of phenotype using random forests. Genet. Epidemiol. 2005, 28, 171–182. [Google Scholar] [CrossRef] [PubMed]
- Garcia-Magarinos, M.; Lopez-de-Ullibarri, I.; Cao, R.; Salas, A. Evaluating the ability of tree-based methods and logistic regression for the detection of SNP-SNP interaction. Ann. Hum. Genet. 2009, 73, 360–369. [Google Scholar] [CrossRef]
- Nguyen, T.T.; Huang, J.; Wu, Q.; Nguyen, T.; Li, M. Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests. BMC Genom. 2015, 16 (Suppl. S2), S5. [Google Scholar] [CrossRef]
- Chang, C.-C.; Lin, C.-J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 27. [Google Scholar] [CrossRef]
- Leem, S.; Jeong, H.H.; Lee, J.; Wee, K.; Sohn, K.A. Fast detection of high-order epistatic interactions in genome-wide association studies using information theoretic measure. Comput. Biol. Chem. 2014, 50, 19–28. [Google Scholar] [CrossRef]
- Xie, Q.; Ratnasinghe, L.D.; Hong, H.; Perkins, R.; Tang, Z.-Z.; Hu, N.; Taylor, P.R.; Tong, W. Decision forest analysis of 61 single nucleotide polymorphisms in a case-control study of esophageal cancer; a novel method. BMC Bioinform. 2005, 6 (Suppl. S2), S4. [Google Scholar] [CrossRef]
- Wang, H.; Yue, T.; Yang, J.; Wu, W.; Xing, E.P. Deep mixed model for marginal epistasis detection and population stratification correction in genome-wide association studies. BMC Bioinform. 2019, 20, 656. [Google Scholar] [CrossRef] [PubMed]
- Motsinger-Reif, A.A.; Dudek, S.M.; Hahn, L.W.; Ritchie, M.D. Comparison of approaches for machine-learning optimization of neural networks for detecting gene-gene interactions in genetic epidemiology. Genet. Epidemiol. 2008, 32, 325–340. [Google Scholar] [CrossRef] [PubMed]
- Silva, P.P.; Gaudillo, J.D.; Vilela, J.A.; Roxas-Villanueva, R.M.L.; Tiangco, B.J.; Domingo, M.R.; Albia, J.R. A machine learning-based SNP-set analysis approach for identifying disease-associated susceptibility loci. Sci. Rep. 2022, 12, 15817. [Google Scholar] [CrossRef] [PubMed]
- Wang, C.; Kao, W.H.; Hsiao, C.K. Using Hamming Distance as Information for SNP-Sets Clustering and Testing in Disease Association Studies. PLoS ONE 2015, 10, e0135918. [Google Scholar] [CrossRef] [PubMed]
- Gaudillo, J.; Rodriguez, J.J.R.; Nazareno, A.; Baltazar, L.R.; Vilela, J.; Bulalacao, R.; Domingo, M.; Albia, J. Machine learning approach to single nucleotide polymorphism-based asthma prediction. PLoS ONE 2019, 14, e0225574. [Google Scholar] [CrossRef]
- Mittag, F.; Büchel, F.; Saad, M.; Jahn, A.; Schulte, C.; Bochdanovits, Z.; Simón-Sánchez, J.; Nalls, M.A.; Keller, M.; Hernandez, D.G.; et al. Use of support vector machines for disease risk prediction in genome-wide association studies: Concerns and opportunities. Hum. Mutat. 2012, 33, 1708–1718. [Google Scholar] [CrossRef]
- Alatrany, A.S.; Khan, W.; Hussain, A.; Al-Jumeily, D.; Alzheimer’s Disease Neuroimaging Initiative. Wide and deep learning based approaches for classification of Alzheimer’s disease using genome-wide association studies. PLoS ONE 2023, 18, e0283712. [Google Scholar] [CrossRef]
- Li, Y.; Wen, J.; Li, G.; Chen, J.; Sun, Q.; Liu, W.; Guan, W.; Lai, B.; Szatkiewicz, J.; He, X.; et al. DeepGWAS: Enhance GWAS Signals for Neuropsychiatric Disorders via Deep Neural Network. Res. Sq. 2023. [Google Scholar] [CrossRef]
- Mieth, B.; Kloft, M.; Rodríguez, J.A.; Sonnenburg, S.; Vobruba, R.; Morcillo-Suárez, C.; Farré, X.; Marigorta, U.M.; Fehr, E.; Dickhaus, T.; et al. Combining Multiple Hypothesis Testing with Machine Learning Increases the Statistical Power of Genome-wide Association Studies. Sci. Rep. 2016, 6, 36671. [Google Scholar] [CrossRef]
- Mieth, B.; Rozier, A.; Rodriguez, J.A.; Höhne, M.M.C.; Görnitz, N.; Müller, K.-R. DeepCOMBI: Explainable artificial intelligence for the analysis and discovery in genome-wide association studies. NAR Genom. Bioinform. 2021, 3, lqab065. [Google Scholar] [CrossRef]
- van Hilten, A.; Kushner, S.A.; Kayser, M.; Ikram, M.A.; Adams, H.H.H.; Klaver, C.C.W.; Niessen, W.J.; Roshchupkin, G. VGenNet framework: Interpretable deep learning for predicting phenotypes from genetic data. Commun. Biol. 2021, 4, 1094. [Google Scholar] [CrossRef] [PubMed]
- Ashkenazy, N.; Feder, M.; Shir, O.M.; Hübner, S. GWANN: Implementing deep learning in genome wide association studies. bioRxiv 2022. [Google Scholar] [CrossRef]
- Jeong, S.; Kim, J.Y.; Kim, N. GMStool: GWAS-based marker selection tool for genomic prediction from genomic data. Sci. Rep. 2020, 10, 19653. [Google Scholar] [CrossRef] [PubMed]
- Khan, A.; Liu, Q.; Wang, K. iMEGES: Integrated mental-disorder GEnome score by deep neural network for prioritizing the susceptibility genes for mental disorders in personal genomes. BMC Bioinform. 2018, 19, 501. [Google Scholar] [CrossRef] [PubMed]
- Zhou, X.; Chen, Y.; Ip, F.C.F.; Jiang, Y.; Cao, H.; Lv, G.; Zhong, H.; Chen, J.; Ye, T.; Chen, Y.; et al. Deep learning-based polygenic risk analysis for Alzheimer’s disease prediction. Commun. Med. 2023, 3, 49. [Google Scholar] [CrossRef] [PubMed]
- Badre, A.; Zhang, L.; Muchero, W.; Reynolds, J.C.; Pan, C. Deep neural network improves the estimation of polygenic risk scores for breast cancer. J. Hum. Genet. 2021, 66, 359–369. [Google Scholar] [CrossRef] [PubMed]
- Lau, M.; Wigmann, C.; Kress, S.; Schikowski, T.; Schwender, H. Evaluation of tree-based statistical learning methods for constructing genetic risk scores. BMC Bioinform. 2022, 23, 97. [Google Scholar] [CrossRef] [PubMed]
- Purcell, S.; Neale, B.; Todd-Brown, K.; Thomas, L.; Ferreira, M.A.; Bender, D.; Maller, J.; Sklar, P.; de Bakker, P.I.; Daly, M.; et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007, 81, 559–575. [Google Scholar] [CrossRef]
- Peter, H.; Westfall, S.S.Y. Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment; Wiley: Hoboken, NJ, USA, 1993. [Google Scholar]
- Roshan, U.; Chikkagoudar, S.; Wei, Z.; Wang, K.; Hakonarson, H. Ranking causal variants and associated regions in genome-wide association studies by the support vector machine and random forest. Nucleic Acids Res. 2011, 39, e62. [Google Scholar] [CrossRef]
- Roshchupkin, G.V.; Adams, H.H.; Vernooij, M.W.; Hofman, A.; Van Duijn, C.M.; Ikram, M.A.; Niessen, W.J. HASE: Framework for efficient high-dimensional association analyses. Sci. Rep. 2016, 6, 36076. [Google Scholar] [CrossRef]
- Wang, K.; Li, M.; Hakonarson, H. ANNOVAR: Functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010, 38, e164. [Google Scholar] [CrossRef] [PubMed]
- Arloth, J.; Eraslan, G.; Andlauer, T.F.M.; Martins, J.; Iurato, S.; Kühnel, B.; Waldenberger, M.; Frank, J.; Gold, R.; Hemmer, B.; et al. DeepWAS: Multivariate genotype-phenotype associations by directly integrating regulatory information using deep learning. PLoS Comput. Biol. 2020, 16, e1007616. [Google Scholar] [CrossRef] [PubMed]
- Zhou, J.; Troyanskaya, O.G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 2015, 12, 931–934. [Google Scholar] [CrossRef] [PubMed]
- Maier, R.; Moser, G.; Chen, G.B.; Ripke, S.; Cross-Disorder Working Group of the Psychiatric Genomics Consortium; Coryell, W.; Potash, J.B.; Scheftner, W.A.; Shi, J.; Weissman, M.M.; et al. Joint analysis of psychiatric disorders increases accuracy of risk prediction for schizophrenia, bipolar disorder, and major depressive disorder. Am. J. Hum. Genet. 2015, 96, 283–294. [Google Scholar] [CrossRef] [PubMed]
- Elliott, L.T.; Sharp, K.; Alfaro-Almagro, F.; Shi, S.; Miller, K.L.; Douaud, G.; Marchini, J.; Smith, S.M. Genome-wide association studies of brain imaging phenotypes in UK Biobank. Nature 2018, 562, 210–216. [Google Scholar] [CrossRef] [PubMed]
- Kirchler, M.; Konigorski, S.; Norden, M.; Meltendorf, C.; Kloft, M.; Schurmann, C.; Lippert, C. transferGWAS: GWAS of images using deep transfer learning. Bioinformatics 2022, 38, 3621–3628. [Google Scholar] [CrossRef] [PubMed]
- Huang, Y.T.; Liang, L.; Moffatt, M.F.; Cookson, W.O.; Lin, X. iGWAS: Image-based genome-wide association of self-supervised deep phenotyping of human medical images. medRxiv 2022. [Google Scholar] [CrossRef]
- Alipanahi, B.; Hormozdiari, F.; Behsaz, B.; Cosentino, J.; McCaw, Z.R.; Schorsch, E.; Sculley, D.; Dorfman, E.H.; Foster, P.J.; Peng, L.H.; et al. Large-scale machine-learning-based phenotyping significantly improves genomic discovery for optic nerve head morphology. Am. J. Hum. Genet. 2021, 108, 1217–1230. [Google Scholar] [CrossRef]
- Patel, K.; Xie, Z.; Yuan, H.; Islam, S.M.S.; Zhang, W.; Gottlieb, A.; Chen, P.; Giancardo, P.; Knaack, A.; Fletcher, P.; et al. New phenotype discovery method by unsupervised deep representation learning empowers genetic association studies of brain imaging. medRxiv 2022. [Google Scholar] [CrossRef]
- Wei, Z.; Wang, W.; Bradfield, J.; Li, J.; Cardinale, C.; Frackelton, E.; Kim, C.; Mentch, F.; Van Steen, K.; Visscher, P.M.; et al. Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction for inflammatory bowel disease. Am. J. Hum. Genet. 2013, 92, 1008–1012. [Google Scholar] [CrossRef]
- Mittelstadt, B.D.; Allo, P.; Taddeo, M.; Wachter, S.; Floridi, L. The ethics of algorithms: Mapping the debate. Big Data Soc. 2016, 3, 2053951716679679. [Google Scholar] [CrossRef]
- Fitipaldi, H.; Franks, P.W. Ethnic, gender and other sociodemographic biases in genome-wide association studies for the most burdensome non-communicable diseases: 2005–2022. Hum. Mol. Genet. 2023, 32, 520–532. [Google Scholar] [CrossRef] [PubMed]
- Daneshjou, R.; Vodrahalli, K.; Novoa, R.A.; Jenkins, M.; Liang, W.; Rotemberg, V.; Ko, J.; Swetter, S.M.; Bailey, E.E.; Gevaert, O.; et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci. Adv. 2022, 8, eabq6147. [Google Scholar] [CrossRef] [PubMed]
- Haibe-Kains, B.; Adam, G.A.; Hosny, A.; Khodakarami, F.; Massive Analysis Quality Control (MAQC) Society Board of Directors; Waldron, L.; Wang, B.; McIntosh, C.; Goldenberg, A.; Kundaje, A.; et al. Transparency and reproducibility in artificial intelligence. Nature 2020, 586, E14–E16. [Google Scholar] [CrossRef] [PubMed]
- Sudlow, C.; Gallacher, J.; Allen, N.; Beral, V.; Burton, P.; Danesh, J.; Downey, P.; Elliott, P.; Green, J.; Landray, M.; et al. UK biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015, 12, e1001779. [Google Scholar] [CrossRef] [PubMed]
- Chen, Z.; Chen, J.; Collins, R.; Guo, Y.; Peto, R.; Wu, F.; Li, L. China Kadoorie Biobank (CKB) collaborative group. China Kadoorie Biobank of 0.5 million people: Survey methods, baseline characteristics and long-term follow-up. Int. J. Epidemiol. 2011, 40, 1652–1666. [Google Scholar] [CrossRef] [PubMed]
- Leitsalu, L.; Haller, T.; Esko, T.; Tammesoo, M.L.; Alavere, H.; Snieder, H.; Perola, M.; Ng, P.C.; Mägi, R.; Milani, L.; et al. Cohort Profile: Estonian Biobank of the Estonian Genome Center, University of Tartu. Int. J. Epidemiol. 2015, 44, 1137–1147. [Google Scholar] [CrossRef]
- Prélot, L.; Draisma, H.; Anasanti, M.D.; Balkhiyarova, Z.; Wielscher, M.; Yengo, L.; Balkau, B.; Roussel, R.; Sebert, S.; Ala-Korpela, M.; et al. Machine Learning in Multi-Omics Data to Assess Longitudinal Predictors of Glycaemic Health. bioRxiv 2018, 2018, 358390. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Atito, S.; Awais, M.; Kittler, J. Sit: Self-supervised vision transformer. arXiv 2021, arXiv:2104.03602. [Google Scholar]
- Moor, M.; Banerjee, O.; Abad, Z.S.H.; Krumholz, H.M.; Leskovec, J.; Topol, E.J.; Rajpurkar, P. Foundation models for generalist medical artificial intelligence. Nature 2023, 616, 259–265. [Google Scholar] [CrossRef] [PubMed]
- Scholtens, S.; Smidt, N.; Swertz, M.A.; Bakker, S.J.; Dotinga, A.; Vonk, J.M.; van Dijk, F.; van Zon, S.K.; Wijmenga, C.; Wolffenbuttel, B.H.; et al. Cohort Profile: LifeLines, a three-generation cohort study and biobank. Int. J. Epidemiol. 2015, 44, 1172–1180. [Google Scholar] [CrossRef] [PubMed]
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
- Elmes, K.; Benavides-Prado, D.; Tan, N.Ö.; Nguyen, T.B.; Sumpter, N.; Leask, M.; Witbrock, M.; Gavryushkin, A. SNVformer: An. Attention-based Deep. Neural Network for GWAS Data. bioRxiv 2022. Available online: https://www.biorxiv.org/content/10.1101/2022.07.07.499217v2 (accessed on 23 May 2023).
- Ji, Y.; Zhou, Z.; Liu, H.; Davuluri, R.V. DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 2021, 37, 2112–2120. [Google Scholar] [CrossRef]
- Santiesteban, S.; Awais, M.; Song, Y.; Kittler, J. Multimodal Self-Supervised Learning for Pan-Cancer Survival Prediction using Histology-Genomic Data. Open Rev. CVPR 2024. [Google Scholar] [CrossRef]
Application Categories | Applications and Tools | Machine Learning Approach |
---|---|---|
Prioritization of top GWAS SNPs and genes | Clustering SVM Random Forrest Neural Network | |
Epistasis detection among pre-selected SNPs | Clustering Random Forrest Neural Network | |
Search space reduction | SVM Random Forrest Neural Network | |
Hypothesis-free GWAS | SVM Neural Network | |
Polygenic Risk Score | Random Forrest Neural Network |
Name | Method | Genotype Matrix Generation | Explainability/Method for SNP Relevance Scores | Language |
---|---|---|---|---|
COMBI | Two-step method:
| Not built-in. It requires a phenotype vector and a genotype matrix. | Yes/SVM for SNP relevance scores | Matlab/Octave, R and Java |
DeepCOMBI | Three-step method:
| Not built-in. It requires a phenotype vector and a genotype matrix. | Yes/relevance scores | Python |
Deep Mixed Model | Two-component DL method:
| Not built-in. It requires genotype and phenotype matrices. | Not available | Python |
DeepWAS | Integration method:
| Not built-in. DeepSea requires vcf format. | Not available | R |
GenNet | Use of NN with connections defined by prior biological knowledge to create groups of nodes across different layers to reduce the number of learnable parameters | Built-in | Built in as SNP, gene and pathway relevance scores based on relative weights | Python |
GMStool | Three-step method:
| Not built-in. It requires genotype, phenotype, GWAS result and test list files. | Not available | R |
GWANN |
| Not built-in. It requires a VCF file with genotype data and a csv file with phenotype data. | Not available | Python |
iMEGES | The Annovar input/bed format file | Not built-in. It requires various predictors for genotype data from ANNOVAR, BED or VCF files. | Built in. | Python |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sigala, R.E.; Lagou, V.; Shmeliov, A.; Atito, S.; Kouchaki, S.; Awais, M.; Prokopenko, I.; Mahdi, A.; Demirkan, A. Machine Learning to Advance Human Genome-Wide Association Studies. Genes 2024, 15, 34. https://doi.org/10.3390/genes15010034
Sigala RE, Lagou V, Shmeliov A, Atito S, Kouchaki S, Awais M, Prokopenko I, Mahdi A, Demirkan A. Machine Learning to Advance Human Genome-Wide Association Studies. Genes. 2024; 15(1):34. https://doi.org/10.3390/genes15010034
Chicago/Turabian StyleSigala, Rafaella E., Vasiliki Lagou, Aleksey Shmeliov, Sara Atito, Samaneh Kouchaki, Muhammad Awais, Inga Prokopenko, Adam Mahdi, and Ayse Demirkan. 2024. "Machine Learning to Advance Human Genome-Wide Association Studies" Genes 15, no. 1: 34. https://doi.org/10.3390/genes15010034