Exhaustive Variant Interaction Analysis Using Multifactor Dimensionality Reduction
Abstract
:1. Introduction
Contributions
- Contribution 1: a containerized HPC architecture for detecting disease-associated variant pairs.
- Contribution 2: discovery and functional interpretation of pairs of variant interactions associated with T2D.
2. Materials and Methods
2.1. Input Dataset Preparation
2.2. Multifactor Dimensionality Reduction
- The first step consists of a k-fold cross-validation and is performed to avoid potential over-fitting. The dataset is subdivided into k parts, where k − 1 data parts are used for training and one part is used for testing. Since we generate k different distributions of the dataset, the following steps are going to be performed k times: one for each distribution.
- In the second step, for each variant pair, we cross-tabulate genotypes for cases and controls using only the individuals from the training dataset. This generates two factor tables (variant–variant): each one has three classes corresponding to the variant genotypes (AA, Aa, aa). In the end, we have a nine-cell table with two dimensions: the number of cases and the number of controls.
- In step three, each of the tables is going to be transformed into a one-dimensional space using the case-control ratio from the training distribution set as a threshold T. The cells where the ratio of cases to controls is greater than T are going to be classified as ‘high risk’; the other cells are classified as ‘low risk’. As a result of this step, we reduce the dimension of the problem from two classes to one.
- In step four, each of the variant–variant tables is used to classify the individuals of the training set. The classification is done as follows: for each individual, we extract the variant–variant genotypes and determine class of the multifactor table it belongs to. Then, if the class corresponds to a high-risk cell, we classify the individual as a case. Otherwise, we classify it as a control. After classifying every individual, we compare the predicted classes with the original labels, obtaining the misclassification error of the variant–variant table.
- In step five, the model with the best misclassification error is selected, and the prediction power of the model is estimated using the independent test data and determining the percentage of patients correctly classified. After these steps are repeated for each of the possible cross-validation sets, the best and most consistent variant–variant combinations are selected. This means picking the ones that appear the most times in the cross-validation sets as a top predictor pair.
2.3. Chi-Square p-Value Threshold Selection
2.4. Framework
2.5. Functional Interpretation Analyses
3. Results
3.1. Overall Strategy
3.2. Reducing the Computational Cost by Leveraging High-Performance Computing Technologies
3.2.1. Resource Scalability
3.2.2. Sample Scalability
3.3. Applying the MDR Model to Detect Pairwise Interactions Associated with T2D
3.4. Functional Interpretation of the T2D-Associated Pairs
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Taliun, D.; Harris, D.N.; Kessler, M.D.; Carlson, J.; Szpiech, Z.A.; Torres, R.; Taliun, S.A.G.; Corvelo, A.; Gogarten, S.M.; Kang, H.M.; et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 2021, 590, 290–299. [Google Scholar] [CrossRef] [PubMed]
- Uffelmann, E.; Huang, Q.Q.; Munung, N.S.; De Vries, J.; Okada, Y.; Martin, A.R.; Martin, H.C.; Lappalainen, T.; Posthuma, D. Genome-wide association studies. Nat. Rev. Methods Prim. 2021, 1, 59. [Google Scholar] [CrossRef]
- Visscher, P.M.; Brown, M.A.; McCarthy, M.I.; Yang, J. Five years of GWAS discovery. Am. J. Hum. Genet. 2012, 90, 7–24. [Google Scholar] [CrossRef] [PubMed]
- Hayes, B. Overview of statistical methods for genome-wide association studies (GWAS). In Genome-Wide Association Studies and Genomic Prediction; Springer: Berlin/Heidelberg, Germany, 2013; pp. 149–169. [Google Scholar]
- Alonso, L.; Morán, I.; Salvoro, C.; Torrents, D. In Search of Complex Disease Risk through Genome Wide Association Studies. Mathematics 2021, 9, 3083. [Google Scholar] [CrossRef]
- Yang, W.; Charles Gu, C. Random forest fishing: A novel approach to identifying organic group of risk factors in genome-wide association studies. Eur. J. Hum. Genet. 2014, 22, 254–259. [Google Scholar] [CrossRef] [PubMed]
- Moore, J.H. A global view of epistasis. Nat. Genet. 2005, 37, 13–14. [Google Scholar] [CrossRef] [PubMed]
- Niel, C.; Sinoquet, C.; Dina, C.; Rocheleau, G. A survey about methods dedicated to epistasis detection. Front. Genet. 2015, 6, 285. [Google Scholar] [CrossRef] [PubMed]
- Purcell, S.; Neale, B.; Todd-Brown, K.; Thomas, L.; Ferreira, M.A.; Bender, D.; Maller, J.; Sklar, P.; De Bakker, P.I.; Daly, M.J.; et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007, 81, 559–575. [Google Scholar] [CrossRef] [PubMed]
- Goudey, B.; Rawlinson, D.; Wang, Q.; Shi, F.; Ferra, H.; Campbell, R.M.; Stern, L.; Inouye, M.T.; Ong, C.S.; Kowalczyk, A. GWIS-model-free, fast and exhaustive search for epistatic interactions in case-control GWAS. BMC Genom. 2013, 14, S10. [Google Scholar] [CrossRef]
- Wan, X.; Yang, C.; Yang, Q.; Xue, H.; Fan, X.; Tang, N.L.; Yu, W. BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies. Am. J. Hum. Genet. 2010, 87, 325–340. [Google Scholar] [CrossRef]
- Jafari, M.; Ansari-Pour, N. Why, when and how to adjust your P values? Cell J. 2019, 20, 604. [Google Scholar] [PubMed]
- Greene, C.S.; Himmelstein, D.S.; Kiralis, J.; Moore, J.H. The informative extremes: Using both nearest and farthest individuals can improve relief algorithms in the domain of human genetics. In Proceedings of the Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics: 8th European Conference, EvoBIO 2010, Istanbul, Turkey, 7–9 April 2010; Proceedings 8. Springer: Berlin/Heidelberg, Germany, 2010; pp. 182–193. [Google Scholar]
- Mendez, D.; Gaulton, A.; Bento, A.P.; Chambers, J.; De Veij, M.; Félix, E.; Magariños, M.P.; Mosquera, J.F.; Mutowo, P.; Nowotka, M.; et al. ChEMBL: Towards direct deposition of bioassay data. Nucleic Acids Res. 2019, 47, D930–D940. [Google Scholar] [CrossRef]
- Oughtred, R.; Stark, C.; Breitkreutz, B.J.; Rust, J.; Boucher, L.; Chang, C.; Kolas, N.; O’Donnell, L.; Leung, G.; McAdam, R.; et al. The BioGRID interaction database: 2019 update. Nucleic Acids Res. 2019, 47, D529–D541. [Google Scholar] [CrossRef] [PubMed]
- Jiang, R.; Tang, W.; Wu, X.; Fu, W. A random forest approach to the detection of epistatic interactions in case-control studies. BMC Bioinform. 2009, 10, S65. [Google Scholar] [CrossRef]
- Zhang, Y.; Liu, J.S. Bayesian inference of epistatic interactions in case-control studies. Nat. Genet. 2007, 39, 1167–1173. [Google Scholar] [CrossRef]
- Payne, J.L.; Greene, C.S.; Hill, D.P.; Moore, J.H. Sensible initialization of a computational evolution system using expert knowledge for epistasis analysis in human genetics. In Exploitation of Linkage Learning in Evolutionary Algorithms; Springer: Berlin/Heidelberg, Germany, 2010; pp. 215–226. [Google Scholar]
- Wang, Y.; Liu, X.; Robbins, K.; Rekaya, R. AntEpiSeeker: Detecting epistatic interactions for case-control studies using a two-stage ant colony optimization algorithm. BMC Res. Notes 2010, 3, 117. [Google Scholar] [CrossRef]
- Ritchie, M.; Hahn, L.; Moore, J. Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet. Epidemiol. Off. Publ. Int. Genet. Epidemiol. Soc. 2003, 24, 150–157. [Google Scholar] [CrossRef] [PubMed]
- Greene, C.S.; Sinnott-Armstrong, N.A.; Himmelstein, D.S.; Park, P.J.; Moore, J.H.; Harris, B.T. Multifactor dimensionality reduction for graphics processing units enables genome-wide testing of epistasis in sporadic ALS. Bioinformatics 2010, 26, 694–695. [Google Scholar] [CrossRef]
- Omri Gottesman, E.A. The Electronic Medical Records and Genomics (eMERGE) Network: Past, present, and future. Genet. Med. 2013, 15, 761–771. [Google Scholar] [CrossRef]
- Bonàs-Guarch, S.; Guindo-Martínez, M.; Miguel-Escalada, I.; Grarup, N.; Sebastian, D.; Rodriguez-Fos, E.; Sánchez, F.; Planas-Fèlix, M.; Cortes-Sánchez, P.; González, S.; et al. Re-analysis of public genetic data reveals a rare X-chromosomal variant associated with type 2 diabetes. Nat. Commun. 2018, 9, 321. [Google Scholar] [CrossRef]
- Chavarría-Miranda, D.; Huang, Z.; Chen, Y. High-performance computing (HPC): Application & use in the power grid. In Proceedings of the 2012 IEEE Power and Energy Society General Meeting, San Diego, CA, USA, 22–26 July 2012; pp. 1–7. [Google Scholar]
- Zaharia, M.; Xin, R.S.; Wendell, P.; Das, T.; Armbrust, M.; Dave, A.; Meng, X.; Rosen, J.; Venkataraman, S.; Franklin, M.J.; et al. Apache spark: A unified engine for big data processing. Commun. ACM 2016, 59, 56–65. [Google Scholar] [CrossRef]
- Shvachko, K.; Kuang, H.; Radia, S.; Chansler, R. The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), Incline Village, NV, USA, 3–7 May 2010; pp. 1–10. [Google Scholar]
- Van Rossum, G. Python Programming Language. In Proceedings of the USENIX Annual Technical Conference, Santa Clara, CA, USA, 17–22 June 2007; Volume 41, pp. 1–36. [Google Scholar]
- Harris, C.R.; Millman, K.J.; Van Der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef] [PubMed]
- Zaharia, M.; Chowdhury, M.; Das, T.; Dave, A.; Ma, J.; McCauley, M.; Franklin, M.J.; Shenker, S.; Stoica, I. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for in-Memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, San Jose, CA, USA, 25–27 April 2012; p. 2. [Google Scholar]
- Patel, A.B.; Birla, M.; Nair, U. Addressing big data problem using Hadoop and Map Reduce. In Proceedings of the 2012 Nirma University International Conference on Engineering (NUiCONE), Ahmedabad, India, 6–8 December 2012; pp. 1–5. [Google Scholar]
- Digitale, J.C.; Martin, J.N.; Glymour, M.M. Tutorial on directed acyclic graphs. J. Clin. Epidemiol. 2022, 142, 264–267. [Google Scholar] [CrossRef] [PubMed]
- Potdar, A.M.; Narayan, D.; Kengond, S.; Mulla, M.M. Performance evaluation of docker container and virtual machine. Procedia Comput. Sci. 2020, 171, 1419–1428. [Google Scholar] [CrossRef]
- Sefraoui, O.; Aissaoui, M.; Eleuldj, M. OpenStack: Toward an open-source solution for cloud computing. Int. J. Comput. Appl. 2012, 55, 38–42. [Google Scholar] [CrossRef]
- Barcelona Supercomputing Center. Marenostrum Technical Information. 2017. Available online: https://www.bsc.es/marenostrum/marenostrum/technical (accessed on 1 June 2023).
- Variant Interaction Analysis Application Open-Data Repository. Available online: https://gitlab.bsc.es/datacentric-computing/via.git (accessed on 1 June 2023).
- Chen, J.; Spracklen, C.N.; Marenne, G.; Varshney, A.; Corbin, L.J.; Luan, J.; Willems, S.M.; Wu, Y.; Zhang, X.; Horikoshi, M.; et al. The trans-ancestral genomic architecture of glycemic traits. Nat. Genet. 2021, 53, 840–860. [Google Scholar] [CrossRef]
- Mahajan, A.; Taliun, D.; Thurner, M.; Robertson, N.R.; Torres, J.M.; Rayner, N.W.; Payne, A.J.; Steinthorsdottir, V.; Scott, R.A.; Grarup, N.; et al. Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps. Nat. Genet. 2018, 50, 1505–1513. [Google Scholar] [CrossRef]
- Scott, R.A.; Scott, L.J.; Mägi, R.; Marullo, L.; Gaulton, K.J.; Kaakinen, M.; Pervjakova, N.; Pers, T.H.; Johnson, A.D.; Eicher, J.D.; et al. An expanded genome-wide association study of type 2 diabetes in Europeans. Diabetes 2017, 66, 2888–2902. [Google Scholar] [CrossRef] [PubMed]
- DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) Consortium; Asian Genetic Epidemiology Network Type 2 Diabetes (AGEN-T2D) Consortium; South Asian Type 2 Diabetes (SAT2D) Consortium; Mexican American Type 2 Diabetes (MAT2D) Consortium; Type 2 Diabetes Genetic Exploration by Nex-generation sequencing in muylti-Ethnic Samples (T2D-GENES) Consortium; Mahajan, A.; Go, M.J.; Zhang, W.; Below, J.E.; Gaulton, K.J.; et al. Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nat. Genet. 2014, 46, 234–244. [Google Scholar] [CrossRef]
- Alonso, L.; Piron, A.; Morán, I.; Guindo-Martínez, M.; Bonàs-Guarch, S.; Atla, G.; Miguel-Escalada, I.; Royo, R.; Puiggròs, M.; Garcia-Hurtado, X.; et al. TIGER: The gene expression regulatory variation landscape of human pancreatic islets. Cell Rep. 2021, 37, 109807. [Google Scholar] [CrossRef]
- Consortium, G. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 2020, 369, 1318–1330. [Google Scholar] [CrossRef]
- McLaren, W.; Gil, L.; Hunt, S.E.; Riat, H.S.; Ritchie, G.R.; Thormann, A.; Flicek, P.; Cunningham, F. The ensembl variant effect predictor. Genome Biol. 2016, 17, 112. [Google Scholar] [CrossRef]
- Raudvere, U.; Kolberg, L.; Kuzmin, I.; Arak, T.; Adler, P.; Peterson, H.; Vilo, J. g: Profiler: A web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Res. 2019, 47, W191–W198. [Google Scholar] [CrossRef]
- Barcelona Supercomputing Center. GREASY User Guide. 2014. Available online: https://github.com/BSC-Support-Team/GREASY (accessed on 1 June 2023).
- Nonogaki, K.; Kaji, T.; Yamazaki, T.; Murakami, M. Treatment with FGFR2-IIIc monoclonal antibody suppresses weight gain and adiposity in KKAy mice. Nutr. Diabetes 2016, 6, e233. [Google Scholar] [CrossRef]
- Yılmaz Kara, B.; Kalcan, S.; Özyurt, S.; Gümüş, A.; Özçelik, N.; Karadoğan, D.; Şahin, Ü. Weight loss as the first-line therapy in patients with severe obesity and obstructive sleep apnea syndrome: The role of laparoscopic sleeve gastrectomy. Obes. Surg. 2021, 31, 1082–1091. [Google Scholar] [CrossRef]
- Typiak, M.; Kulesza, T.; Rachubik, P.; Rogacka, D.; Audzeyenka, I.; Angielski, S.; Saleem, M.A.; Piwkowska, A. Role of klotho in hyperglycemia: Its levels and effects on fibroblast growth factor receptors, glycolysis, and glomerular filtration. Int. J. Mol. Sci. 2021, 22, 7867. [Google Scholar] [CrossRef]
- Sugimoto, R.; Enjoji, M.; Kohjima, M.; Tsuruta, S.; Fukushima, M.; Iwao, M.; Sonta, T.; Kotoh, K.; Inoguchi, T.; Nakamuta, M. High glucose stimulates hepatic stellate cells to proliferate and to produce collagen through free radical production and activation of mitogen-activated protein kinase. Liver Int. 2005, 25, 1018–1026. [Google Scholar] [CrossRef]
- Sakurai, M.; Weber, P.; Wolff, G.; Wieder, A.; Szendroedi, J.; Herzig, S.; Üstünel, B.E. TSC22D4 promotes TGFβ1-induced activation of hepatic stellate cells. Biochem. Biophys. Res. Commun. 2022, 618, 46–53. [Google Scholar] [CrossRef]
- Zhao, B.; Li, S.; Guo, Z.; Chen, Z.; Zhang, X.; Xu, C.; Chen, J.; Wei, C. Dopamine receptor D2 inhibition alleviates diabetic hepatic stellate cells fibrosis by regulating the TGF-β1/Smads and NFκB pathways. Clin. Exp. Pharmacol. Physiol. 2021, 48, 370–380. [Google Scholar] [CrossRef]
- Moore, J.H.; Andrews, P.C. Epistasis analysis using multifactor dimensionality reduction. In Epistasis: Methods and Protocols; Springer: Berlin/Heidelberg, Germany, 2015; pp. 301–314. [Google Scholar]
- Ritchie, M.D.; Hahn, L.W.; Roodi, N.; Bailey, L.R.; Dupont, W.D.; Parl, F.F.; Moore, J.H. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet. 2001, 69, 138–147. [Google Scholar] [CrossRef] [PubMed]
- Collins, R.L.; Hu, T.; Wejse, C.; Sirugo, G.; Williams, S.M.; Moore, J.H. Multifactor dimensionality reduction reveals a three-locus epistatic interaction associated with susceptibility to pulmonary tuberculosis. BioData Min. 2013, 6, 4. [Google Scholar] [CrossRef] [PubMed]
- Kuon, I.; Tessier, R.; Rose, J. FPGA architecture: Survey and challenges. Found. Trends Electron. Des. Autom. 2008, 2, 135–253. [Google Scholar] [CrossRef]
- Manolio, T.A.; Brooks, L.D.; Collins, F.S. A HapMap harvest of insights into the genetics of common disease. Am. Soc. Clin. Investig. 2008, 118, 1590–1605. [Google Scholar] [CrossRef] [PubMed]
Chromosome | Position | Ref Allele | Alt Allele | AA | Aa | aa | AA | Aa | aa | … | AA | Aa | aa |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
22 | 16,231,367 | A | G | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | |
22 | 17,052,123 | G | A | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | |
22 | 17,055,458 | G | A | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gómez-Sánchez, G.; Alonso, L.; Pérez, M.Á.; Morán, I.; Torrents, D.; Berral, J.L. Exhaustive Variant Interaction Analysis Using Multifactor Dimensionality Reduction. Appl. Sci. 2024, 14, 5136. https://doi.org/10.3390/app14125136
Gómez-Sánchez G, Alonso L, Pérez MÁ, Morán I, Torrents D, Berral JL. Exhaustive Variant Interaction Analysis Using Multifactor Dimensionality Reduction. Applied Sciences. 2024; 14(12):5136. https://doi.org/10.3390/app14125136
Chicago/Turabian StyleGómez-Sánchez, Gonzalo, Lorena Alonso, Miguel Ángel Pérez, Ignasi Morán, David Torrents, and Josep Ll. Berral. 2024. "Exhaustive Variant Interaction Analysis Using Multifactor Dimensionality Reduction" Applied Sciences 14, no. 12: 5136. https://doi.org/10.3390/app14125136
APA StyleGómez-Sánchez, G., Alonso, L., Pérez, M. Á., Morán, I., Torrents, D., & Berral, J. L. (2024). Exhaustive Variant Interaction Analysis Using Multifactor Dimensionality Reduction. Applied Sciences, 14(12), 5136. https://doi.org/10.3390/app14125136