Evaluation of Different SNP Analysis Software and Optimal Mining Process in Tree Species
Abstract
:1. Introduction
2. Materials and Methods
2.1. Brief Introduction to the Alignment Process
2.2. Comparison of Three Major SNP Calling Programs
2.3. Information on the Test Data Sets
2.4. Experimental Validation, Optimization and Mining of SNPs
3. Results and Discussion
3.1. Runtime of Different Alignment Programs with Algorithm Details
3.2. Influence of Different Alignment Tools on the Results
3.3. Verification of the Prediction Accuracy with the Simulation Data Set
3.4. Experimental Validation and SNP Mining with the Optimized Protocol
3.5. Several Suggestions to Improve the Efficiency of SNP Calling
4. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Telenti, A. Deep Sequencing of 10,000 Human Genomes. Eur. J. Hum. Genet. 2018, 26, 25. [Google Scholar] [CrossRef] [PubMed]
- Kammerer, S.; Roth, R.B.; Hoyal, C.R.; Reneland, R.; Marnellos, G.; Kiechle, M.; Schwarz-Boeger, U.; Griffiths, L.R.; Ebner, F.; Rehbock, J.; et al. Association of the NuMA region on chromosome 11q13 with breast cancer susceptibility. Proc. Natl. Acad. Sci. USA 2005, 102, 2004–2009. [Google Scholar] [CrossRef] [PubMed]
- Wang, C.; Dai, J.; Qin, N.; Fan, J.; Ma, H.; Chen, C.; An, M.; Zhang, J.; Yan, C.; Gu, Y.; et al. Analyses of rare predisposing variants of lung cancer in 6,004 whole genomes in Chinese. Cancer Cell 2022, 40, 1223–1239.e6. [Google Scholar] [CrossRef]
- Paternoster, L.; Standl, M.; Chen, C.-M.; Ramasamy, A.; Bønnelykke, K.; Duijts, L.; A Ferreira, M.; Alves, A.C.; Thyssen, J.P.; Albrecht, E.; et al. Meta-analysis of genome-wide association studies identifies three new risk loci for atopic dermatitis. Nat. Genet. 2012, 44, 187–192. [Google Scholar] [CrossRef] [PubMed]
- Konishi, S.; Izawa, T.; Lin, S.Y.; Ebana, K.; Fukuta, Y.; Sasaki, T.; Yano, M. An SNP Caused Loss of Seed Shattering During Rice Domestication. Science 2006, 312, 1392–1396. [Google Scholar] [CrossRef] [PubMed]
- Liu, M.-S.; Kuo, T.C.-Y.; Ko, C.-Y.; Wu, D.-C.; Li, K.-Y.; Lin, W.-J.; Lin, C.-P.; Wang, Y.-W.; Schafleitner, R.; Lo, H.-F.; et al. Genomic and transcriptomic comparison of nucleotide variations for insights into bruchid resistance of mungbean (Vigna radiata [L.] R. Wilczek). BMC Plant Biol. 2016, 16, 46. [Google Scholar] [CrossRef]
- Ganal, M.W.; Altmann, T.; Röder, M.S. SNP identification in crop plants. Curr. Opin. Plant Biol. 2009, 12, 211–217. [Google Scholar] [CrossRef]
- Shendure, J.; Mitra, R.; Varma, C.; Church, G. Advanced sequencing technologies: Methods and goals. Nat. Rev. Genet. 2004, 5, 335–344. [Google Scholar] [CrossRef]
- Pabinger, S.; Dander, A.; Fischer, M.; Snajder, R.; Sperk, M.; Efremova, M.; Krabichler, B.; Speicher, M.R.; Zschocke, J.; Trajanoski, Z. A survey of tools for variant analysis of next-generation genome sequencing data. Briefings Bioinform. 2014, 15, 256–278. [Google Scholar] [CrossRef]
- Yu, X.; Sun, S. Comparing a few SNP calling algorithms using low-coverage sequencing data. BMC Bioinform. 2013, 14, 274. [Google Scholar] [CrossRef]
- Ellegren, H. Genome sequencing and population genomics in non-model organisms. Trends Ecol. Evol. 2014, 29, 51–63. [Google Scholar] [CrossRef] [PubMed]
- Clevenger, J.; Chavarro, C.; Pearl, S.A.; Ozias-Akins, P.; Jackson, S.A. Single Nucleotide Polymorphism Identification in Polyploids: A Review, Example, and Recommendations. Mol. Plant 2015, 8, 831–846. [Google Scholar] [CrossRef]
- Li, H.; Durbin, R. Fast and accurate short read alignment with Burrows—Wheeler transform. Bioinformatics 2009, 25, 1754–1760. [Google Scholar] [CrossRef] [PubMed]
- Langmead, B.; Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 2012, 9, 357–359. [Google Scholar] [CrossRef]
- Simpson, J.T.; Durbin, R. Efficient construction of an assembly string graph using the FM-index. Bioinformatics 2010, 26, i367–i373. [Google Scholar] [CrossRef] [PubMed]
- Pokrzywa, R. Searching for unique DNA sequences with the Burrows-Wheeler Transform. Biocybern. Biomed. Eng. 2008, 28, 95–104. [Google Scholar]
- Cheng, J.F.; Dolinar, S.; Effros MMcEliece, R. Data expansion with Huffman codes. In Proceedings of the 1995 IEEE International Symposium on Information Theory, Whistler, BC, Canada, 17–22 September 1995; p. 325. [Google Scholar]
- Li, H.; Handsaker, B.; Wysoker, A.; Fennell, T.; Ruan, J.; Homer, N.; Marth, G.; Abecasis, G.; Durbin, R. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009, 25, 2078–2079. [Google Scholar] [CrossRef] [PubMed]
- Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 2011, 27, 2987–2993. [Google Scholar] [CrossRef] [PubMed]
- DePristo, M.A.; Banks, E.; Poplin, R.; Garimella, K.V.; Maguire, J.R.; Hartl, C.; Philippakis, A.A.; del Angel, G.; Rivas, M.A.; Hanna, M.; et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011, 43, 491–498. [Google Scholar] [CrossRef]
- Tuskan, G.A.; DiFazio, S.; Jansson, S.; Bohlmann, J.; Grigoriev, I.; Hellsten, U.; Putnam, N.; Ralph, S.; Rombauts, S.; Salamov, A.; et al. The Genome of Black Cottonwood, Populus trichocarpa (Torr. & Gray). Science 2006, 313, 1596–1604. [Google Scholar] [CrossRef]
- Huang, W.; Li, L.; Myers, J.R.; Marth, G.T. ART: A next-generation sequencing read simulator. Bioinformatics 2012, 28, 593–594. [Google Scholar] [CrossRef] [PubMed]
- He, B.; Gu, Y.C.; Xu, M.; Wang, J.W.; Cao, F.L.; Xu, L.A. Transcriptome analysis of Ginkgo biloba kernels. Front. Plant. Sci. 2015, 6, 819. [Google Scholar] [CrossRef] [PubMed]
- Bolger, A.M.; Lohse, M.; Usadel, B. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics 2014, 30, 2114–2120. [Google Scholar] [CrossRef] [PubMed]
- Altschul, S.F.; Madden, T.L.; Schäffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.J. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997, 25, 3389–3402. [Google Scholar] [CrossRef]
- Tesar, R.; Fiala, D.; Rousselot, F.; Jezek, K. A comparison of two algorithms for discovering repeated word sequences. Wit Trans. Infor. Comm. 2005, 35, 121–131. [Google Scholar]
- Kärkkäinen, J.; Sanders, P. Simple Linear Work Suffix Array Construction. Lect. Notes Comput. Sc. 2003, 2719, 943–955. [Google Scholar] [CrossRef]
- Arram, J.; Kaplan, T.; Luk, W.; Jiang, P. Leveraging FPGAs for Accelerating Short Read Alignment. IEEE/ACM Trans. Comput. Biol. Bioinform. 2017, 14, 668–677. [Google Scholar] [CrossRef]
- Halperin, E.; A Stephan, D. SNP imputation in association studies. Nat. Biotechnol. 2009, 27, 349–351. [Google Scholar] [CrossRef]
- Li, H.; Ruan, J.; Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008, 18, 1851–1858. [Google Scholar] [CrossRef]
- Wu, Y.; Zhou, Q.; Huang, S.; Wang, G.; Xu, L.-A. SNP development and diversity analysis for Ginkgo biloba based on transcriptome sequencing. Trees 2019, 33, 587–597. [Google Scholar] [CrossRef]
- Zhang, B.; Zhu, W.; Diao, S.; Wu, X.; Lu, J.; Ding, C.; Su, X. The poplar pangenome provides insights into the evolutionary history of the genus. Commun. Biol. 2019, 2, 215. [Google Scholar] [CrossRef] [PubMed]
Tools | Time | |
---|---|---|
BWA | Bowtie2 | |
SAMtools | 509 min 15 s | 525 min 21 s |
GATK-HC | 1800 min 34 s | 1209 min 32 s |
Freebayes | 354 min 32 s | 187 min 37 s |
ID | Number of Predicted (Validated) SNPs | |||||
---|---|---|---|---|---|---|
BWA | Bowtie 2 | |||||
SAMtools | GATK-HC | Freebayes | SAMtools | GATK-HC | Freebayes | |
1 | 42(17) | 16(3) | 27(12) | 8(1) | 6(1) | 9(1) |
2 | 42(21) | 10(3) | 33(12) | 5(0) | 10(6) | 8(2) |
3 | 21(10) | 28(13) | 4(1) | 3(1) | 6(1) | 11(2) |
4 | 35(12) | 19(5) | 25(3) | 11(2) | 8(1) | 12(1) |
5 | 35(15) | 29(8) | 28(5) | 7(3) | 10(2) | 13(3) |
6 | 51(23) | 38(14) | 18(4) | 10(4) | 15(3) | 10(3) |
7 | 40(18) | 26(11) | 30(6) | 12(3) | 14(4) | 10(2) |
8 | 47(20) | 30(13) | 22(5) | 9(2) | 13(3) | 13(2) |
Total | 313(136) | 196(70) | 187(48) | 65(16) | 82(21) | 86(16) |
Percentage (%) | 43.45 | 35.71 | 25.67 | 24.62 | 25.61 | 18.61 |
Gene Name | Length of Samples (bp) | Number of SNPs | Proportion of SNPs |
---|---|---|---|
comp39263_c0 | 843 | 8 | 0.95% |
comp38056_c5 | 605 | 2 | 0.33% |
comp39131_c0 | 746 | 2 | 0.27% |
comp39024_c3 | 501 | 12 | 2.40% |
comp38899_c0 | 1029 | 4 | 0.39% |
comp39290_c0 | 643 | 6 | 0.93% |
comp39123_c0 | 1009 | 9 | 0.89% |
comp39115_c1 | 1060 | 13 | 1.23% |
comp39323_c0 | 1076 | 3 | 0.28% |
comp38596_c0 | 1160 | 12 | 1.03% |
comp39136_c1 | 1106 | 10 | 0.90% |
comp38902_c0 | 499 | 2 | 0.40% |
comp34734_c0 | 731 | 6 | 0.82% |
comp39345_c0 | 1114 | 3 | 0.27% |
comp39091_c0 | 902 | 10 | 1.11% |
comp39347_c0 | 1105 | 1 | 0.09% |
comp38540_c0 | 1140 | 17 | 1.49% |
comp35986_c0 | 1134 | 2 | 0.18% |
Total | 16,403 | 122 | 0.74% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bu, M.; Xu, M.; Tao, S.; Cui, P.; He, B. Evaluation of Different SNP Analysis Software and Optimal Mining Process in Tree Species. Life 2023, 13, 1069. https://doi.org/10.3390/life13051069
Bu M, Xu M, Tao S, Cui P, He B. Evaluation of Different SNP Analysis Software and Optimal Mining Process in Tree Species. Life. 2023; 13(5):1069. https://doi.org/10.3390/life13051069
Chicago/Turabian StyleBu, Mengjia, Mengxuan Xu, Shentong Tao, Peng Cui, and Bing He. 2023. "Evaluation of Different SNP Analysis Software and Optimal Mining Process in Tree Species" Life 13, no. 5: 1069. https://doi.org/10.3390/life13051069
APA StyleBu, M., Xu, M., Tao, S., Cui, P., & He, B. (2023). Evaluation of Different SNP Analysis Software and Optimal Mining Process in Tree Species. Life, 13(5), 1069. https://doi.org/10.3390/life13051069