bioGWAS: A Simple and Flexible Tool for Simulating GWAS Datasets
Abstract
:Simple Summary
Abstract
1. Introduction
2. Materials and Methods
2.1. Overview of bioGWAS
2.2. Genotype Simulation
- Filtering variants based on MAF, thus excluding rare variants (this step is independent of further filtering of causal variants based on both minimum and maximum MAF values);
- Filtering samples based on a sample identifier list, which allows users to specify a list of samples for analysis, streamlining the process and eliminating the need for manual filtering.
2.3. Phenotype Simulation
- The closest function of the BEDTools v2.30.0 tool [15] is used to annotate each SNP according to the genome annotation file provided by the user. This step collects information about the closest gene for each variant. Additionally, variants are filtered by minimum and maximum MAF. By default, all common variants ( MAF ) are considered as potentially causal (i.e., minimum MAF and maximum MAF ), though the user may specify custom thresholds to specify the desired MAF window for causal variants.
- Next, the user-defined number of causal variants (K) are selected. Depending on the user input, this is done either based on the user-specified gene set (pathway) or an explicitly provided set of causal variants.In the former case, variants are drawn from the gene-SNP mapping obtained in the first step. Let there be n genes in the specified gene set, and k variants from this set are set to be causal out of a total of K causal variants affecting the trait (). Then, causal variants should be drawn from random genes.The simulation algorithm then looks as follows:
- (a)
- k genes are randomly selected from the n genes in the gene set of interest. Gene set information is taken from the file in a standard gene matrix transpose (GMT) format, in which each row contains a gene set name accompanied by a comprehensive list of corresponding genes (for human data, curated gene sets from the Molecular Signatures Database (MSigDB) [16] is used by default);
- (b)
- Based on the gene-SNP mapping obtained in step (1), k variants corresponding to the genes from step (2a) are selected (one SNP per gene; if the number of genes in the set is less than the value of k, then more than one SNP per gene may be selected, maintaining uniform coverage);
- (c)
- random variants (not belonging to the n genes from causal gene sets) are added.
Thus, this procedure yields K causal variants to be used in phenotype simulation, k of which correspond to the given set of genes.In the case of manually specified causal variants, the final set of causal variants is constructed by filtering out user-provided variant IDs using the list of variants present in the genotype data. - Individual genotypes for the constructed set of causal variants from step (2) are extracted from the genotype data file.
- Finally, phenotype data are simulated using genotypes at the causal variant loci obtained in step (3). Continuous trait values are generated using phenotypeSimulator (R-package, v.0.3.4) [17]. The following simulation parameters can be specified: the mean effect size (), the standard deviation of effect size (), the genetic effects variance (heritability, ), the proportion of the genetic variants’ effect variance (), the proportion of variance of shared genetic variant effects (), the proportion of genetic variant effects to have a trait-independent fixed effect (), the proportion of observational noise effect variance (), and the variance of shared observational noise effect (). The description of these parameters can be found in the PhenotypeSimulator documentation (https://cran.r-project.org/web/packages/PhenotypeSimulator/PhenotypeSimulator.pdf, accessed on 23 October 2023).
2.4. Association Analysis
2.5. Visualization of Generated GWAS Results
2.6. Validation of Simulation Results
- Number of causal variants: ;
- Number of causal variants from a given set of genes: ;
- For each K, 120 combinations of parameters of PhenotypeSimpulator were evaluated (see Section 2.3 and Supplementary Table S1).
2.7. Replication of Existing GWAS Datasets
2.8. Using bioGWAS to Benchmark Pathway Analysis Tools
2.9. bioGWAS Implementation and Data Availability
3. Results and Discussion
3.1. Validation of Data Simulated Using bioGWAS
3.2. Using bioGWAS to Benchmark Pathway Analysis Tools for GWAS Data
4. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
GWAS | Genome-wide association study |
SNP | Single-nucleotide polymorphism |
EA | enrichment analysis |
UKB | UK Biobank |
FG | FinnGen |
1KGP | 1000 Genomes Project |
MAF | Minor Allele Frequency |
GRCh37 | Genome Reference Consortium Human Build 37 |
References
- Khoury, M.J. From genes to public health: The applications of genetic technology in disease prevention. In The Ethics of Public Health, Volumes I and II; Routledge: London, UK, 2018; pp. 371–376. [Google Scholar]
- Uffelmann, E.; Huang, Q.Q.; Munung, N.S.; de Vries, J.; Okada, Y.; Martin, A.R.; Martin, H.C.; Lappalainen, T.; Posthuma, D. Genome-wide association studies. Nat. Rev. Methods Prim. 2021, 1, 59. [Google Scholar] [CrossRef]
- Pers, T.H.; Karjalainen, J.M.; Chan, Y.; Westra, H.J.; Wood, A.R.; Yang, J.; Lui, J.C.; Vedantam, S.; Gustafsson, S.; Esko, T.; et al. Biological interpretation of genome-wide association studies using predicted gene functions. Nat. Commun. 2015, 6, 5890. [Google Scholar] [CrossRef] [PubMed]
- de Leeuw, C.A.; Mooij, J.M.; Heskes, T.; Posthuma, D. MAGMA: Generalized Gene-Set Analysis of GWAS Data. PLoS Comput. Biol. 2015, 11, e1004219. [Google Scholar] [CrossRef] [PubMed]
- Lamparter, D.; Marbach, D.; Rueedi, R.; Kutalik, Z.; Bergmann, S. Fast and Rigorous Computation of Gene and Pathway Scores from SNP-Based Summary Statistics. PLoS Comput. Biol. 2016, 12, e1004714. [Google Scholar] [CrossRef] [PubMed]
- Silberstein, M.; Nesbit, N.; Cai, J.; Lee, P.H. Pathway analysis for genome-wide genetic variation data: Analytic principles, latest developments, and new opportunities. J. Genet. Genom. 2021, 48, 173–183. [Google Scholar] [CrossRef] [PubMed]
- Klein, R.J. Power analysis for genome-wide association studies. BMC Genet. 2007, 8, 58. [Google Scholar] [CrossRef] [PubMed]
- Auton, A.; Abecasis, G.R.; Altshuler, D.M.; Durbin, R.M.; Bentley, D.R.; Chakravarti, A.; Clark, A.G.; Donnelly, P.; Eichler, E.E.; Flicek, P.; et al. A global reference for human genetic variation. Nature 2015, 526, 68–74. [Google Scholar] [CrossRef] [PubMed]
- Bycroft, C.; Freeman, C.; Petkova, D.; Band, G.; Elliott, L.T.; Sharp, K.; Motyer, A.; Vukcevic, D.; Delaneau, O.; O’Connell, J.; et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 2018, 562, 203–209. [Google Scholar] [CrossRef]
- Sudlow, C.; Gallacher, J.; Allen, N.; Beral, V.; Burton, P.; Danesh, J.; Downey, P.; Elliott, P.; Green, J.; Landray, M.; et al. UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age. PLoS Med. 2015, 12, e1001779. [Google Scholar] [CrossRef]
- Su, Z.; Marchini, J.; Donnelly, P. HAPGEN2: Simulation of multiple disease SNPs. Bioinformatics 2011, 27, 2304–2305. [Google Scholar] [CrossRef]
- Shi, M.; Umbach, D.M.; Wise, A.S.; Weinberg, C.R. Simulating autosomal genotypes with realistic linkage disequilibrium and a spiked-in genetic effect. BMC Bioinform. 2018, 19, 2. [Google Scholar] [CrossRef]
- Fortune, M.D.; Wallace, C. SimGWAS: A fast method for simulation of large scale case-control GWAS summary statistics. Bioinformatics 2019, 35, 1901–1906. [Google Scholar] [CrossRef]
- Chang, C.C.; Chow, C.C.; Tellier, L.C.; Vattikuti, S.; Purcell, S.M.; Lee, J.J. Second-generation PLINK: Rising to the challenge of larger and richer datasets. GigaScience 2015, 4, s13742-015-0047-8. [Google Scholar] [CrossRef]
- Quinlan, A.R.; Hall, I.M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 2010, 26, 841–842. [Google Scholar] [CrossRef]
- Liberzon, A.; Birger, C.; Thorvaldsdóttir, H.; Ghandi, M.; Mesirov, J.P.; Tamayo, P. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Syst. 2015, 1, 417–425. [Google Scholar] [CrossRef]
- Meyer, H.V.; Birney, E. Phenotype Simulator: A comprehensive framework for simulating multi-trait, multi-locus genotype to phenotype relationships. Bioinformatics 2018, 34, 2951–2956. [Google Scholar] [CrossRef]
- Purcell, S.; Neale, B.; Todd-Brown, K.; Thomas, L.; Ferreira, M.A.; Bender, D.; Maller, J.; Sklar, P.; Bakker, P.I.D.; Daly, M.J.; et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007, 81, 559–575. [Google Scholar] [CrossRef]
- Yin, L.; Zhang, H.; Tang, Z.; Xu, J.; Yin, D.; Zhang, Z.; Yuan, X.; Zhu, M.; Zhao, S.; Li, X.; et al. rMVP: A Memory-efficient, Visualization-enhanced, and Parallel-accelerated Tool for Genome-wide Association Study. Genom. Proteom. Bioinform. 2021, 19, 619–628. [Google Scholar] [CrossRef]
- Frankish, A.; Diekhans, M.; Jungreis, I.; Lagarde, J.; Loveland, J.E.; Mudge, J.M.; Sisu, C.; Wright, J.C.; Armstrong, J.; Barnes, I.; et al. GENCODE 2021. Nucleic Acids Res. 2021, 49, D916–D923. [Google Scholar] [CrossRef]
- Kanehisa, M.; Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000, 28, 27–30. [Google Scholar] [CrossRef]
- Kanehisa, M. Toward understanding the origin and evolution of cellular organisms. Protein Sci. 2019, 28, 1947–1951. [Google Scholar] [CrossRef]
- Kanehisa, M.; Furumichi, M.; Sato, Y.; Kawashima, M.; Ishiguro-Watanabe, M. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 2023, 51, D587–D592. [Google Scholar] [CrossRef]
- Kurki, M.I.; Karjalainen, J.; Palta, P.; Sipilä, T.P.; Kristiansson, K.; Donner, K.; Reeve, M.P.; Laivuori, H.; Aavikko, M.; Kaunisto, M.A.; et al. FinnGen: Unique genetic insights from combining isolated population and national health register data. medRxiv 2022. [Google Scholar] [CrossRef]
Algorithm | HAPGEN2 [11] | TriadSim [12] | simGWAS [13] | bioGWAS |
---|---|---|---|---|
Supplied as | Command line interface (CLI) | R package | R package | CLI with a Docker image |
Input format | Known haplotypes in .haps/.legend files | Genotypes in a PLINK binary file, must include trios | Genotype matrix † | Genotypes in a phased VCF or a PLINK binary file |
Genotype filtration | Manual (as a preprocessing step) | Manual (as a preprocessing step) | Manual (in R code, or in pre-processing) | Automatic, for samples and variants (cutoffs specified as parameters) |
Setting causal variants. | Yes | Yes | Yes | Yes |
Setting gene sets and biological pathways | Partially (as a list of manually selected variants) | Partially (as a list of manually selected variants) | Partially (as a list of manually selected variants) | Yes: as a list of variants, or as names of biological pathways/gene sets |
Outputs | Genotype data, summary statistics for causal variants | Genotypes and quantitative trait samples | Summary statistics | Genotype data, phenotype data; summary statistics; visualizations (genotypes PCA, Q-Q, and Manhattan plot) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Changalidis, A.I.; Alexeev, D.A.; Nasykhova, Y.A.; Glotov, A.S.; Barbitoff, Y.A. bioGWAS: A Simple and Flexible Tool for Simulating GWAS Datasets. Biology 2024, 13, 10. https://doi.org/10.3390/biology13010010
Changalidis AI, Alexeev DA, Nasykhova YA, Glotov AS, Barbitoff YA. bioGWAS: A Simple and Flexible Tool for Simulating GWAS Datasets. Biology. 2024; 13(1):10. https://doi.org/10.3390/biology13010010
Chicago/Turabian StyleChangalidis, Anton I., Dmitry A. Alexeev, Yulia A. Nasykhova, Andrey S. Glotov, and Yury A. Barbitoff. 2024. "bioGWAS: A Simple and Flexible Tool for Simulating GWAS Datasets" Biology 13, no. 1: 10. https://doi.org/10.3390/biology13010010
APA StyleChangalidis, A. I., Alexeev, D. A., Nasykhova, Y. A., Glotov, A. S., & Barbitoff, Y. A. (2024). bioGWAS: A Simple and Flexible Tool for Simulating GWAS Datasets. Biology, 13(1), 10. https://doi.org/10.3390/biology13010010