CuReSim-LoRM: A Tool to Simulate Metabarcoding Long Reads
Abstract
:1. Introduction
2. Results
2.1. Comparison of NanoSim-H, DeepSimulator, and CuReSim-LoRM Simulated Data with Real Data
2.2. Evaluation of CuReSim-LoRM with Challenging Datasets
3. Discussion
4. Materials and Methods
4.1. Datasets
4.2. Sequencing
4.3. CuReSim-LoRM
4.3.1. Read Simulation
4.3.2. Training Error Models
4.4. Simulated Datasets
4.4.1. CuReSim-LoRM Datasets
- Simulation of error-prone reads with Grinder and bbmap
- Alignment of real datasets against reference sequences
- Simulation of reads with the ONT error model
4.4.2. NanoSim-H and DeepSimulator Datasets
4.4.3. Evaluation Metrics
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Janda, J.M.; Abbott, S.L. 16S rRNA gene sequencing for bacterial identification in the diagnostic laboratory: Pluses, perils, and pitfalls. J. Clin. Microbiol. 2007, 45, 2761–2764. [Google Scholar] [CrossRef]
- Wensel, C.R.; Pluznick, J.L.; Salzberg, S.L.; Sears, C.L. Next-generation sequencing: Insights to advance clinical investigations of the microbiome. J. Clin. Investig. 2022, 132, e154944. [Google Scholar] [CrossRef] [PubMed]
- Santos, A.; van Aerle, R.; Barrientos, L.; Martinez-Urtaza, J. Computational methods for 16S metabarcoding studies using Nanopore sequencing data. Comput. Struct. Biotechnol. J. 2020, 18, 296–305. [Google Scholar] [CrossRef] [PubMed]
- Winand, R.; Bogaerts, B.; Hoffman, S.; Lefevre, L.; Delvoye, M.; Van Braekel, J.; Fu, Q.; Roosens, N.H.; Keersmaecker, S.C.D.; Vanneste, K. Targeting the 16s rRNA gene for bacterial identification in complex mixed samples: Comparative evaluation of second (Illumina) and third (Oxford Nanopore Technologies) generation sequencing technologies. Int. J. Mol. Sci. 2019, 21, 298. [Google Scholar] [CrossRef] [PubMed]
- Szoboszlay, M.; Schramm, L.; Pinzauti, D.; Scerri, J.; Sandionigi, A.; Biazzo, M. Nanopore is preferable over Illumina for 16S amplicon sequencing of the gut Microbiota when species-level taxonomic classification, accurate estimation of richness, or focus on rare taxa is required. Microorganisms 2023, 11, 804. [Google Scholar] [CrossRef] [PubMed]
- Escalona, M.; Rocha, S.; Posada, D. A comparison of tools for the simulation of genomic next-generation sequencing data. Nat. Rev. Genet. 2016, 17, 459–469. [Google Scholar] [CrossRef] [PubMed]
- Urban, L.; Holzer, A.; Baronas, J.J.; Hall, M.B.; Braeuninger-Weimer, P.; Scherm, M.J.; Kunz, D.J.; Perera, S.N.; Martin-Herranz, D.E.; Tipper, E.T.; et al. Freshwater monitoring by nanopore sequencing. eLife 2021, 10, e61504. [Google Scholar] [CrossRef] [PubMed]
- Curry, K.D.; Wang, Q.; Nute, M.G.; Tyshaieva, A.; Reeves, E.; Soriano, S.; Wu, Q.; Graeber, E.; Finzer, P.; Mendling, W.; et al. Emu: Species-level microbial community profiling of full-length 16S rRNA Oxford Nanopore sequencing data. Nat. Methods 2022, 19, 845–853. [Google Scholar] [CrossRef] [PubMed]
- Li, Y.; Wang, S.; Bi, C.; Qiu, Z.; Li, M.; Gao, X. DeepSimulator1.5: A more powerful, quicker and lighter simulator for Nanopore sequencing. Bioinformatics 2020, 36, 2578–2580. [Google Scholar] [CrossRef] [PubMed]
- Manske, F.; Grundmann, N.; Makalowski, W. MetaGenomic analysis of short and long reads. bioRxiv 2020. [Google Scholar] [CrossRef]
- Yang, C.; Chu, J.; Warren, R.L.; Birol, I. NanoSim: Nanopore sequence read simulator based on statistical characterization. GigaScience 2017, 6, gix010. [Google Scholar] [CrossRef]
- Caboche, S.; Audebert, C.; Lemoine, Y.; Hot, D. Comparison of mapping algorithms used in high-throughput sequencing: Application to Ion Torrent data. BMC Genom. 2014, 15, 264. [Google Scholar] [CrossRef] [PubMed]
- Cuscó, A.; Catozzi, C.; Viñes, J.; Sanchez, A.; Francino, O. Microbiota profiling with long amplicons using Nanopore sequencing: Full-length 16S rRNA gene and the 16S-ITS-23S of the rrn operon. F1000Research 2018, 7, 1755. [Google Scholar] [CrossRef] [PubMed]
- Frank, J.A.; Reich, C.I.; Sharma, S.; Weisbaum, J.S.; Wilson, B.A.; Olsen, G.J. Critical evaluation of two primers commonly used for amplification of bacterial 16S rRNA genes. Appl. Environ. Microbiol. 2008, 74, 2461–2470. [Google Scholar] [CrossRef] [PubMed]
- Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 2018, 34, 3094–3100. [Google Scholar] [CrossRef] [PubMed]
- Angly, F.E.; Willner, D.; Rohwer, F.; Hugenholtz, P.; Tyson, G.W. Grinder: A versatile amplicon and shotgun sequence simulator. Nucleic Acids Res. 2012, 40, e94. [Google Scholar] [CrossRef] [PubMed]
- Bushnell, B. BBMap: A Fast, Accurate, Splice-Aware Aligner; Lawrence Berkeley National Lab. (LBNL): Berkeley, CA, USA, 2014. [Google Scholar]
- Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat. Methods 2020, 17, 261–272. [Google Scholar] [CrossRef] [PubMed]
- Wickham, H. ggplot2: Elegant Graphics for Data Analysis; Springer: New York, NY, USA, 2016. [Google Scholar]
- Stoddard, S.F.; Smith, B.J.; Hein, R.; Roller, B.R.K.; Schmidt, T.M. rrnDB: Improved tools for interpreting rRNA gene abundance in bacteria and archaea and a new foundation for future development. Nucleic Acids Res. 2015, 43, D593–D598. [Google Scholar] [CrossRef] [PubMed]
- Quast, C.; Pruesse, E.; Yilmaz, P.; Gerken, J.; Schweer, T.; Yarza, P.; Peplies, J.; Glöckner, F.O. The SILVA ribosomal RNA gene database project: Improved data processing and web-based tools. Nucleic Acids Res. 2013, 41, D590–D596. [Google Scholar] [CrossRef] [PubMed]
Metrics | NewLot | CuReSim-LoRM | NanoSim-H | NanoSim-H | NanoSim-H | Deep-Simulator |
---|---|---|---|---|---|---|
Dataset1 | Dataset2 | Dataset3 | ||||
error rate | 14.45 | 14.5 | 12.94 | 13.28 | 18.77 | 10.84 |
%unmapped | 3.4 | 5.5 | 6.5 | 6.6 | 9.6 | 0.01 |
%identity | 86.6 | 86.6 | 87.8 | 87.2 | 82.7 | 89.4 |
SD | 5.2 | 5 | 3.9 | 3.2 | 3.4 | 1.5 |
precision_Z | 1 | 1 | 0.99 | 1 | 1 | 1 |
recall_Z | 0.96 | 0.94 | 0.92 | 0.93 | 0.9 | 1 |
precision_R | 0.94 | 0.91 | 0.87 | 0.87 | 0.86 | 0.84 |
recall_R | 0.91 | 0.88 | 0.82 | 0.83 | 0.8 | 0.84 |
precision_S | 0.74 | 0.76 | 0.67 | 0.64 | 0.62 | 0.73 |
recall_S | 0.72 | 0.72 | 0.6 | 0.59 | 0.55 | 0.73 |
Metrics | run1 | sim. | run2 | sim. | run3 | sim. | reBasecalling | sim. | Urban | sim. |
---|---|---|---|---|---|---|---|---|---|---|
Error rate | 17.41 | 17.4 | 16.39 | 16.45 | 14.24 | 14.35 | 11.64 | 11.2 | 10.59 | 10.6 |
%unmapped | 16.8 | 15.76 | 0.7 | 4.9 | 2.6 | 3.9 | 4.5 | 3.7 | 0 | 1.6 |
%identity | 84.2 | 84.5 | 84.8 | 85 | 86.8 | 86.8 | 88.8 | 89.3 | 89.8 | 89.9 |
SD | 4.7 | 4.6 | 4.3 | 4.6 | 5.2 | 5 | 4.9 | 4.5 | 3.6 | 3.8 |
precision_Z | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
recall_Z | 0.83 | 0.84 | 0.99 | 0.95 | 0.97 | 0.96 | 0.95 | 0.96 | 1 | 0.98 |
precision_R | 0.94 | 0.92 | 0.96 | 0.93 | 0.95 | 0.94 | 0.93 | 0.91 | 0.94 | 0.92 |
recall_R | 0.8 | 0.81 | 0.95 | 0.91 | 0.94 | 0.92 | 0.9 | 0.89 | 0.94 | 0.9 |
precision_S | 0.71 | 0.73 | 0.78 | 0.77 | 0.81 | 0.81 | 0.76 | 0.79 | 0.69 | 0.75 |
recall_S | 0.6 | 0.61 | 0.77 | 0.73 | 0.78 | 0.78 | 0.72 | 0.76 | 0.69 | 0.74 |
Name | #Reads | Lot | Basecaller | Error Rate (%) | Mean Length (bp) | SD (bp) | Reference |
---|---|---|---|---|---|---|---|
run1 | 2,388,682 | V1 | Albacore v2.1.7 | 17.41 | 1378.28 | 571.24 | this study |
run2 | 3,263,535 | V1 | Albacore v2.1.7 | 16.39 | 1464.85 | 144.34 | this study |
run3 | 1,108,390 | V1 | guppy_fast v3.4.5 | 14.24 | 1399.014 | 289.30 | this study |
newLot | 1,339,249 | V2 | guppy_fast v3.4.5 | 14.45 | 1362.24 | 331.43 | this study |
reBasecalling | 1,487,976 | V2 | guppy_hac v3.4.5 | 11.64 | 1364.99 | 381.26 | this study |
Urban | 361,582 | V2 | guppy_fast v3.1.5 | 10.59 | 1470.81 | 25.62 | [7] |
Lot V1 | Expected | run1 | run2 | run3 | Simulation | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Lactobacillus fermentum | 0.19 | 0.09 | 0.09 | 0.08 | 0.08 | ||||||||||||||||||||||||||||||
Listeria monocytogenes | 0.16 | 0.16 | 0.16 | 0.14 | 0.14 | ||||||||||||||||||||||||||||||
Bacillus subtilis | 0.16 | 0.21 | 0.21 | 0.21 | 0.21 | ||||||||||||||||||||||||||||||
Staphylococcus aureus | 0.13 | 0.18 | 0.18 | 0.17 | 0.18 | ||||||||||||||||||||||||||||||
Salmonella enterica | 0.11 | 0.14 | 0.14 | 0.16 | 0.16 | ||||||||||||||||||||||||||||||
Enterococcus faecalis | 0.10 | 0.12 | 0.12 | 0.10 | 0.10 | ||||||||||||||||||||||||||||||
Escherichia coli | 0.10 | 0.10 | 0.09 | 0.11 | 0.12 | ||||||||||||||||||||||||||||||
Pseudomonas aeruginosa | 0.05 | 0.00 | 0.00 | 0.02 | 0.01 | ||||||||||||||||||||||||||||||
Lot V2 | Expected | NewLot | reBasecalling | Simulation | Urban | Simulation | |||||||||||||||||||||||||||||
Lactobacillus fermentum | 0.18 | 0.10 | 0.10 | 0.10 | 0.05 | 0.05 | |||||||||||||||||||||||||||||
Listeria monocytogenes | 0.14 | 0.13 | 0.13 | 0.13 | 0.07 | 0.07 | |||||||||||||||||||||||||||||
Bacillus subtilis | 0.17 | 0.18 | 0.18 | 0.18 | 0.13 | 0.13 | |||||||||||||||||||||||||||||
Staphylococcus aureus | 0.16 | 0.18 | 0.18 | 0.18 | 0.09 | 0.09 | |||||||||||||||||||||||||||||
Salmonella enterica | 0.10 | 0.15 | 0.15 | 0.15 | 0.30 | 0.30 | |||||||||||||||||||||||||||||
Enterococcus faecalis | 0.10 | 0.09 | 0.09 | 0.08 | 0.05 | 0.05 | |||||||||||||||||||||||||||||
Escherichia coli | 0.10 | 0.16 | 0.16 | 0.16 | 0.30 | 0.30 | |||||||||||||||||||||||||||||
Pseudomonas aeruginosa | 0.04 | 0.02 | 0.02 | 0.02 | 0.01 | 0.01 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mesloub, Y.; Beury, D.; Vandermeeren, F.; Caboche, S. CuReSim-LoRM: A Tool to Simulate Metabarcoding Long Reads. Int. J. Mol. Sci. 2023, 24, 14005. https://doi.org/10.3390/ijms241814005
Mesloub Y, Beury D, Vandermeeren F, Caboche S. CuReSim-LoRM: A Tool to Simulate Metabarcoding Long Reads. International Journal of Molecular Sciences. 2023; 24(18):14005. https://doi.org/10.3390/ijms241814005
Chicago/Turabian StyleMesloub, Yasmina, Delphine Beury, Félix Vandermeeren, and Ségolène Caboche. 2023. "CuReSim-LoRM: A Tool to Simulate Metabarcoding Long Reads" International Journal of Molecular Sciences 24, no. 18: 14005. https://doi.org/10.3390/ijms241814005
APA StyleMesloub, Y., Beury, D., Vandermeeren, F., & Caboche, S. (2023). CuReSim-LoRM: A Tool to Simulate Metabarcoding Long Reads. International Journal of Molecular Sciences, 24(18), 14005. https://doi.org/10.3390/ijms241814005