Next Article in Journal
Genetic Variability in Mediterranean Coastal Ecosystems: Insights into Ostrea spp. (Bivalvia: Ostreidae)
Previous Article in Journal
Exploring the Impact of Exercise-Derived Extracellular Vesicles in Cancer Biology
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Opinion

Simulated High Throughput Sequencing Datasets: A Crucial Tool for Validating Bioinformatic Pathogen Detection Pipelines

by
Andres S. Espindola
Department of Entomology and Plant Pathology, Oklahoma State University, Stillwater, OK 74078, USA
Biology 2024, 13(9), 700; https://doi.org/10.3390/biology13090700
Submission received: 23 July 2024 / Revised: 3 September 2024 / Accepted: 3 September 2024 / Published: 6 September 2024
(This article belongs to the Section Bioinformatics)

Abstract

:

Simple Summary

Validation of plant pathogens’ diagnostic assays requires positive and negative controls. High Throughput Sequencing (HTS) has shown capabilities for detecting multiple pathogens simultaneously. However, accurate pathogen identification from HTS data depends on the bioinformatic pipeline used. There is currently no consensus on the best pipeline, leading to inconsistent results. To address this, standardized artificial HTS datasets are proposed as benchmarks to evaluate the performance of bioinformatic pipelines used in detection scenarios using HTS. These datasets will aid in resolving challenges like unknown sensitivity and specificity of the bioinformatic tool, contributing to advancing plant pathogen detection using HTS.

Abstract

The validation of diagnostic assays in plant pathogen detection is a critical area of research. It requires the use of both negative and positive controls containing a known quantity of the target pathogen, which are crucial elements when calculating analytical sensitivity and specificity, among other diagnostic performance metrics. High Throughput Sequencing (HTS) is a method that allows the simultaneous detection of a theoretically unlimited number of plant pathogens. However, accurately identifying the pathogen from HTS data is directly related to the bioinformatic pipeline utilized and its effectiveness at correctly assigning reads to their associated taxa. To this day, there is no consensus about the pipeline that should be used to detect the pathogens in HTS data, and results often undergo review and scientific evaluation. It is, therefore, imperative to establish HTS resources tailored for evaluating the performance of bioinformatic pipelines utilized in plant pathogen detection. Standardized artificial HTS datasets can be used as a benchmark by allowing users to test their pipelines for various pathogen infection scenarios, some of the most prevalent being multiple infections, low titer pathogens, mutations, and new strains, among others. Having these artificial HTS datasets in the hands of HTS diagnostic assay validators can help resolve challenges encountered when implementing bioinformatics pipelines for routine pathogen detection. Offering these purely artificial HTS datasets as benchmarking tools will significantly advance research on plant pathogen detection using HTS and enable a more robust and standardized evaluation of the bioinformatic methods, thereby enhancing the field of plant pathogen detection.

1. Introduction

High-throughput sequencing (HTS) technologies, coupled with the reduction in sequencing costs, are revolutionizing the field of plant diagnostics by enabling rapid, cost-effective, and simultaneous pathogen detection in a single sample with promising sensitivity and high specificity [1,2]. The adoption of HTS in plant health diagnostics represents a significant advancement since the introduction of PCR-based detection in the late 1980s, offering a powerful method for comprehensive pathogen detection without prior assumptions about the organisms present [3]. HTS techniques such as amplicon sequencing (also known as metabarcoding) [4] and shotgun metagenomics have been applied across a variety of plant pathogens, including bacteria, viruses, fungi, and nematodes [5,6]. A clear example of how HTS has revolutionized plant diagnostics is shown by its ability to detect any plant virus in a sample without prior knowledge, thereby facilitating the discovery of new plant viruses [7]. HTS has been successfully applied for virus and viroid discovery in various agricultural crops, leading to its adoption in routine pathogen detection [8,9]. The application of HTS in plant pathology has allowed for the identification of viruses, bacteria, fungi, and oomycetes infecting many agricultural crops [10,11,12,13,14,15,16,17,18,19,20].
The versatility of HTS extends to the identification of specific pathogens at various taxonomic levels and groups, as demonstrated by tools like MARPLE, which enable rapid identification of individual pathogen strains directly from field-collected infected plant tissue [21]. Furthermore, portable HTS technologies like the MinION from Nanopore have opened new avenues for point-of-care detection and identification of plant pathogens, offering rapid and accurate identification of dominant pathogenic organisms from plant tissues [22,23]. These technologies have shown reproducibility and sensitivity in detecting various plant pathogens, including viruses and viroids [8,9].
Despite its advantages, the implementation of HTS in routine plant diagnostics is fraught with challenges. One major hurdle is the lack of standardized bioinformatic pipelines and standard methodological procedures when analyzing the output data [24]. Despite some government regulatory agencies beginning to offer guidelines for evaluating HTS for pathogen diagnostics in wet labs and bioinformatics pipelines, these guidelines are still quite general [25,26]. This may be due to a shortage of peer-reviewed literature recommending methods for evaluating these pipelines. This lack of consensus has been associated with the lack of trustworthy controls or HTS datasets that could serve as a benchmarking method for bioinformatic pipelines [27]. Such controls should contain known amounts of pathogens and must include all potential artifacts generated by various variables in the sequencing platform and the host–pathogen sample. To address this, researchers created a collection of 18 datasets, including semi-artificial, real, and fully artificial datasets, to test different aspects and challenges in plant virus detection from HTS data. However, the limitations of this study were that all datasets were sequenced or/and simulated using an Illumina system, with most using a four-channel system. This could be problematic when using benchmarked pipelines on datasets from two-channel systems, which are known to have erroneous guanine base calls [27]. Additionally, the use of real sequencing datasets is a great limitation when benchmarking these pipelines because the absolute amount of pathogen in the sample cannot be known with certainty [27].
The scientific community and stakeholders clearly require a comprehensive methodology to create HTS datasets that can be used to benchmark the bioinformatic pipelines that are used for plant health diagnostics. Even though the literature related to the use of HTS datasets for this type of benchmarking is scarce, it is important to note that having these datasets will play a pivotal role in ensuring the robustness and reliability of pathogen detection methods, ultimately contributing to better disease surveillance and management. Here, we will explore the importance of and considerations for developing mock standardized HTS datasets to benchmark bioinformatic pathogen detection pipelines, and discuss the various drawbacks and benefits of artificial datasets as benchmarking controls for bioinformatic pipelines used in plant health diagnostics.

2. Types of Datasets

Two main types of reference datasets can be used: real and artificial. Real datasets derived from actual biological samples provide realistic scenarios encountered by researchers, but their use is limited because the “true” composition of the samples is not known with absolute certainty. Artificial datasets, created entirely in silico, allow for complete control over their composition, but they may not accurately reflect the complexities of real HTS data. To overcome these limitations, semi-artificial datasets, which combine real HTS data with artificially generated reads, have been used in the past [27]. Researchers should consider the impact of incorporating the complexity of the real HTS dataset on any diagnostic performance metrics within bioinformatic pipelines. Analytical sensitivity and analytical specificity are fundamental diagnostic performance metrics for any diagnostic assay. Analytical sensitivity refers to the ability of an assay to detect low concentrations of the target analyte (pathogen reads) accurately, while analytical specificity pertains to the ability of the assay to correctly identify the target analyte without interference from other substances [28,29]. Constructing a serial dilution of the pathogen is often performed with nucleic acids and water for PCR analytical sensitivity. On the other hand, to evaluate HTS diagnostic assays, we have suggested the use of serial dilutions using the pathogen and host nucleic acids as a background or using infected samples with known concentrations of the pathogen obtained either by real-time PCR [30,31]. However, the costs of library preparation and sequencing do not yet allow us to perform full experimental design for each pathogen. Therefore, artificial datasets are imperative when calculating analytical sensitivity. The most fundamental analytical sensitivity experiment uses a matrix that does not interact with the analyte (host and pathogen). However, the matrix of the sample can influence analytical sensitivity; therefore, subsequent analytical sensitivity experiments should include a more complex matrix. There may be multiple factors influencing the matrix depending on the sampling technique, host, environmental conditions, nucleic acid extraction, enrichment, and library preparation methods. Recreating all possible scenarios is not cost-efficient. Therefore, the most prevalent matrix should be used; for example, if a host–pathogen sample is often found as a co-infection or multiple infections, then the matrix must include that interaction. Therefore, at different tiers of validation, the use of artificial or real HTS datasets is interchangeable (Table 1).
Do we need to replicate the intricacies of a real sample exactly to generate exact analytical sensitivity and specificity values? If so, how do analytical sensitivity and specificity values vary when comparing real and artificial scenarios, and are these variations significant? Therefore, a hypothesis that needs to be addressed is whether the disconnect between artificial HTS datasets and real-world complexities shown in real HTS datasets can potentially lead to an overestimation or underestimation of the analytical sensitivity of an HTS test. Determining this difference in diagnostic performance metrics using available bioinformatic tools used in plant health HTS tests can be achieved as a proof of concept and should include different taxonomic groups, sampling techniques, and seasonality, as well as wet lab pipelines that could introduce the variation that is observed in real HTS datasets. Additionally, to make the diagnostic performance metric results comparable, all bioinformatic pipelines used in the evaluation should generate positive/negative results as an output, which may be challenging because many bioinformatic tools used for plant health diagnostics provide results that require researchers or diagnosticians to evaluate their output to consider the pathogen’s presence in the sample [3]. Hence, existing bioinformatic pipelines should produce impartial and objective results. These results should indicate positive or negative outcomes while still allowing for interpretation as either absolute or relative values generated by the pipeline. By addressing these gaps, researchers can determine if artificial HTS datasets are sufficient, and the assessment of bioinformatic pipelines used for HTS diagnostic tests will become more consistent and supported by scientific rigor.

3. Existing HTS Simulators for Benchmarking Bioinformatic Pipelines for Pathogen Detection

HTS simulators are essential for evaluating the performance of bioinformatic pipelines, especially for pathogen detection, where knowing the exact composition of a sample is crucial for accurate diagnosis. Here, we describe simulators, along with their strengths and limitations, that have shown potential for aiding in the simulation of HTS data that can ultimately be used for benchmarking bioinformatic pipelines used for pathogen detection (Table 2).
ART is a widely used simulator that can generate reads with both substitution and insertion-deletion errors, making it suitable for simulating data from various sequencing platforms like Illumina and 454 [32]. However, its limitations in simulating indels, particularly their distribution, can impact the accuracy of downstream analysis [33]. BEAR utilizes a machine-learning approach to generate reads with lengths and quality scores resembling empirically derived distributions from various sequencing platforms [32]. Its strength lies in its ability to emulate data from platforms like Ion Torrent, for which dedicated simulators are limited. BEAR also automates abundance profile generation, which is often arduous in metagenomic simulations [32]. CAMISIM is a highly modular metagenome simulator that stands out for its ability to model diverse microbial communities, including real and simulated strain-level diversity. Notably, CAMISIM can generate data from various sequencing technologies, including second- and third-generation platforms, making it suitable for benchmarking a broader range of bioinformatic pipelines. It also provides gold standards for various downstream analyses, such as assembly, binning, and taxonomic profiling, enabling comprehensive evaluation of metagenomic pipelines [34]. CuReSim is specifically designed to evaluate mapping algorithms used in HTS data analysis. Its customizable nature allows the generation of reads with varying lengths, error rates, and distributions of insertions, deletions, and substitutions, making it suitable for assessing the robustness of mapping algorithms under different error scenarios [35]. FASTQSim is a platform-independent simulator that characterizes HTS datasets and generates in silico reads with matching error profiles. It is particularly useful for evaluating metagenomic algorithms, as it allows the creation of spiked datasets with known concentrations of different organisms, enabling accurate assessment of their performance in complex mixtures [33]. Grinder is a versatile simulator capable of emulating data from various platforms, including Sanger, 454, and Illumina. While it allows the creation of multi-sample datasets, its application is limited to simulating differential abundances [36]. MetaSim focuses on simulating metagenomic data and supports the use of empirical error profiles. However, it does not generate quality values for the simulated reads, potentially limiting its utility for evaluating pipelines that rely on quality scores for analysis [37]. NeSSM is specifically designed for metagenomic simulations and incorporates realistic features like sequencing error models based on explicit error distributions and sequencing coverage bias. It provides tools to estimate these parameters directly from existing metagenomic data, enhancing the fidelity of the simulated data. NeSSM supports both 454 and Illumina platforms, and a GPU-accelerated version is available for faster simulations [38]. However, the link provided for the source code was not available when we wrote this article. MetaSPARSim is specifically designed to simulate 16S rRNA gene sequencing count data. It aims to address the lack of realistic simulation models for 16S data by capturing characteristics such as sparsity and compositionality commonly observed in real datasets [39]. ReSeq is a more recent simulator that aims to improve the realism of simulated Illumina gDNA sequencing data by incorporating features like systematic errors, a fragment-based coverage model, and sampling-matrix estimates [40]. NanoSim is specifically designed to generate realistic ONT read data and allows users to characterize and utilize their own datasets to tailor the simulations [41]. The choice of simulator depends on the specific research question, sequencing platform, and the desired level of realism in the simulated data. It is important to note that while these simulators aim to capture the characteristics of real sequencing data, they may not fully encompass the complexities and biases inherent in actual experiments. Therefore, it is crucial to carefully consider the limitations of each simulator and interpret the results of benchmarking studies accordingly. We consider the simulators that show to be more promising by offering a combination of highly relevant features to biosecurity applications, including the ability to simulate realistic microbial communities, generate data from various sequencing platforms, incorporate sequencing errors, and provide gold standards for evaluating pipeline performance are CAMISIM, ART, and BEAR. Having these characteristics makes them particularly promising for generating HTS datasets that could aid in assessing the limit of detection and other diagnostic performance metrics of HTS in pathogen detection, which is critical for developing and implementing effective biosecurity measures.
Table 2. Key attributes of HTS dataset simulators that are useful for assessing HTS detection pipelines.
Table 2. Key attributes of HTS dataset simulators that are useful for assessing HTS detection pipelines.
Simulator NameTaxonomic Profiling (Metagenome)Platform ErrorsGenomic VariantsSequencing PlatformQuality ScoreRef.Latest Release Date [M/D/Y]
ARTIt is designed for single genomes but can be adapted for metagenomics.Yes, it simulates substitution and INDEL (insertion-deletion) errors.Yes, with VarSimIlluminaYes[32]6/5/2016
BEARYes, specifically designed for metagenomics.Yes, emulates characteristics from real data.Not specified, but designed to work with metagenomic datasets that inherently contain variations.Ion Torrent, 454, IlluminaYes[42]5/8/2020
CAMISIMYes, can model different microbial abundance profiles, multi-sample time series, and differential abundance studies.Yes, offers flexibility in simulating various error profiles.Yes, includes real and simulated strain-level diversity.Illumina, PacBio, Oxford NanoporeYes[34]1/4/2022
CuReSimNo.Yes, allows adjustments to error distribution along reads.Yes, can introduce insertions, deletions, and substitutions at a controlled rate.Ion TorrentYes[35]6/24/2015
FASTQSimDoes not allow profile input, but it has been used for metagenome simulations.Yes, designed to be platform-independent and simulate various NGS datasets.Yes.Platform-independentYes[33]11/15/2016
GrinderYes, can simulate metagenomic data. User-defined profile or inferred from real HTS runs.Yes, provides options for uniform, linear, and polynomial error models.Not explicitly specified.Sanger, 454, IlluminaYes[36]11/27/2016
metaSPARSimYes, specifically designed for 16S rRNA gene sequencing data.Yes, utilizes a Multivariate Hypergeometric distribution to model sequencing and simulate realistic sparsity and compositionality.Not explicitly specified.Not specifiedNot specified[39]12/1/2020
MetaSimYes, explicitly designed for simulating metagenomic data.Yes, supports user-defined parametric error models.Not explicitly specified.454, IlluminaNo[37]10/8/2008
NanoSimYes, they added metagenomic simulation option.Yes.Not explicitly specified.NanoporeNo[41]8/16/2024
NeSSMYes, designed for metagenomic sequencing simulation.Yes, incorporates sequencing error models based on the distribution of errors at each base and coverage bias.Not explicitly specified.454, IlluminaYes[38]8/18/2024
nfcore-ReadSimulatorYes.YES, from ART and capsim.Not explicitly specified.Illumina, PacBio [43]4/26/2024
ReadSimNot explicitly specified.Yes.Not explicitly specified.Nanopore, PacBioYes[44]12/1/2014
ReSeqYes.Yes.Yes.Illumina, BGIYes[40]12/1/2020

4. Diagnostic Performance Metrics with Artificial HTS Datasets

Currently, there is no agreement on what tiers of diagnostic validation are required for HTS-based plant health diagnostics. The literature does not reflect a systematic way of validation where comparable results could aid in evaluating bioinformatic pipelines. However, prior literature has focused on a comprehensive validation approach encompassing key performance criteria tailored to the specific context of HTS technology. Rather than providing methodological procedures, generalized aspects have been suggested, adhering to established international standards such as ISO/IEC 9000:2015, ISO/IEC 17025:2017, and EPPO standards (e.g., PM 7/98, PM 7/147) for validation [45]. Such standards provide a framework for demonstrating that a test method consistently meets predefined requirements for its intended use. However, these standards have been merely designed for molecular assays, and HTS tests have not been added. Researchers have employed a range of diagnostic performance metrics to evaluate the efficacy of HTS tests.

4.1. Analytical Sensitivity

Analytical sensitivity is a metric that reflects the ability of an HTS test to consistently detect a target pathogen at its lowest concentration. It is essentially the limit of detection (LoD) observed under ideal testing conditions [46]. Unlike traditional diagnostic tests, there is no simple formula that could be used for HTS tests. Determining analytical sensitivity for HTS tests requires establishing thresholds, including the minimum number of reads per sample and the determination of contamination levels [45]. Several factors can influence the analytical sensitivity of an HTS test. These factors include the number of reads generated from a sample, the level of contamination between samples, and the presence of co-infecting organisms.
When using bioinformatic pipelines to determine the presence/absence of pathogens in an HTS test, the analyte that we want to detect is pathogen reads, percentage of genome coverage, or depth of coverage. Determining the threshold has been a subjective matter; however, given the nature of HTS and the likelihood of having false positives due to the background noise found in most HTS runs, a Limit of Blank (LoB) should be calculated. The LoB is the highest concentration of an analyte, which is expected when running replicates of a sample without the analyte. The LoB is determined by calculating the mean result and standard deviation of replicate analyses of a blank sample [47].
The LoB is used to determine the limit of detection (LoD), which represents the lowest concentration of an analyte that can be reliably distinguished from the LoB [47]. This is important in HTS tests, where analytical sensitivity, represented by the LoD, is essential for distinguishing true positives from background noise. While the LoB helps establish a baseline for analytical signals in the absence of an analyte, it serves as a starting point for estimating the LoD rather than a direct measure of analytical sensitivity. The LoD is calculated by adding 1.645 times the standard deviation of a low-concentration sample to the LoB [47]. Once the LoD is established, it is confirmed by running the samples containing the LoD concentration and analyzing the observed values. The LoB is just one factor considered when determining and verifying the analytical sensitivity of HTS tests. For instance, a higher number of reads increases the probability of detecting a target, but it also raises the risk of detecting contamination if the bioinformatic pipeline thresholds are not well validated. Therefore, sequencing depth is a crucial metric when determining the analytical sensitivity of an HTS test. Other factors like duplication rate and host sequence abundance might also need consideration. Artificial HTS datasets can be beneficial in determining the analytical sensitivity of HTS-based plant pest diagnostic tests. Artificial datasets can supplement this gap since it is generally not feasible to have reference material encompassing every potential target and matrix combination for HTS tests. Examples of HTS simulators that can be used for this purpose include those that allow the user to modify the taxonomic profile, such as BEAR, CAMISIM, Grinder, metaSPARSim, MetaSim, NanoSIm, NESSM, nfcore-ReadSimulator, and ReSeq [34,36,37,38,39,40,41,42,43]. Rarefaction can be performed on real HTS runs to generate subsamples that mimic lower pathogen reads [48,49], but it is challenging to determine the original amount of pathogen reads, making it difficult to figure out the actual total read output. In this case, trial and error experiments may be necessary. Using HTS simulators that allow taxonomy modification can provide a more accurate representation of the pathogen.

4.2. Analytical Specificity

Analytical specificity, in its general sense, refers to the HTS test’s capability to detect all intended target pathogens (inclusivity) while avoiding the detection of unrelated pathogens (exclusivity) or host plant material (selectivity). This ensures the test is specifically targeting the desired organisms [46]. However, given the complexity of HTS tests for plant pest diagnostics and the complex interplay of factors, analytical specificity represents the ability of the HTS test to correctly identify the target organism at the desired taxonomic level without being misled by the presence of other organisms or genetic material in the sample. There is no specific metric or calculation for analytical specificity in HTS tests. However, the desired taxonomic resolution acts as a practical measure of analytical specificity. This means determining how effectively the HTS test can differentiate between organisms at the level required for its intended purpose—be it strain, species, genus, or family level identification. Artificial HTS datasets can play a significant role in evaluating the analytical specificity of HTS tests, particularly by addressing some of the limitations of relying solely on real-world samples, whose true composition is never fully known, making it challenging to determine if a negative result is truly negative or a false negative. In contrast to real datasets, artificial HTS datasets are entirely controlled in their composition. This means researchers know exactly which organisms and variants are present and at what frequencies. This precise knowledge is essential for evaluating whether an HTS test can accurately identify all the intended targets while excluding closely related organisms or common contaminants. By using artificial datasets, researchers can isolate and understand the impact of specific factors on taxonomic resolution. These factors might include sequencing depth, the presence of closely related organisms, or the choice of bioinformatic parameters. The knowledge gained from these controlled experiments can then be applied to improve the design of HTS tests, optimize bioinformatic analysis pipelines, and minimize the risk of false positives or negatives.

4.3. Diagnostic Sensitivity

The diagnostic sensitivity, often expressed as a percentage, reveals the proportion of truly positive samples accurately identified by the HTS test in an experiment utilizing known positive controls. It provides insight into the test’s ability to minimize false negatives. At this validation stage, the positive controls utilized should include the matrix where the infection occurs. However, it is very difficult to have reference material for every possible target and matrix combination due to the vast diversity of organisms detectable by HTS tests [45]. The calculation of diagnostic sensitivity relies heavily on several factors, including detection thresholds. Depending on the bioinformatic tool used to determine the presence or absence of the pathogen, when using tools like mapping to reference or read binning algorithms, a threshold must be selected. Selection of such a threshold should be performed with care because lowering the detection threshold, such as considering a single read as a positive detection, can increase diagnostic sensitivity. However, this approach can inflate the false positives, underscoring the importance of establishing appropriate thresholds. Additionally, the total number of reads generated, or sequencing depth, significantly impacts the ability to detect a target, particularly at low concentrations. A higher number of reads increases the likelihood of capturing target sequences, leading to higher sensitivity. Attempts have been made to generate thresholds for HTS tests; specifically, four genera of plant pathogens affecting citrus (a virus, two bacteria, and a viroid) have been used as a proof of concept where a quantitative discriminant analysis (QDA) was employed to establish a limit of detection (LoD) for each targeted pathogen. This LoD serves as a baseline score for determining positive diagnostic results. The LoD, representing the lowest detectable quantity of an analyte, is directly linked to the concept of analytical sensitivity. In the context of HTS and other diagnostic assays, thresholds are often used to define the boundaries between positive and negative results based on factors like the LoD. The use of a LoD calculated through QDA suggests that HTS test validation can employ a threshold-based approach. The calculated LoD could function as a practical threshold for interpreting the significance of pathogen signals in their analysis [31].

4.4. Diagnostic Specificity

Diagnostic specificity refers to the ability of a diagnostic test to correctly identify samples that do not have the target pathogen. In simpler terms, it measures how well a test avoids giving false positive results. A high diagnostic specificity indicates that the test is good at ruling out the presence of the target condition when it is truly absent. Evaluating diagnostic specificity is crucial during the development of any diagnostic assay, particularly for HTS applications in plant health diagnostics. This is because HTS methods have the potential to detect a wide range of organisms in a sample, making it crucial to differentiate true positives from false positives.
One significant challenge in calculating diagnostic specificity for HTS-based diagnostics is the potential for false positives arising from various sources, such as contamination or cross-reactions with similar organisms. This challenge is amplified by the fact that the complete range of organisms that an HTS test might detect is often unknown. Simulated HTS datasets are invaluable tools for addressing this challenge and accurately determining the diagnostic specificity of bioinformatic pipelines. The controlled composition of these datasets enables researchers to precisely evaluate the pipeline’s performance in identifying only the target organisms and avoiding misclassifications. Simulated datasets are important because they provide a known negative background, which is often difficult to define with real-world samples where the presence or absence of all potential organisms is uncertain. By spiking known concentrations of target organisms, researchers can assess the pipeline’s ability to correctly identify these targets while ignoring unrelated sequences, thus providing a direct measure of diagnostic specificity. Mock datasets can be designed to include challenging scenarios, such as low virus titers, the presence of closely related organisms, or sequencing errors, to test the robustness of the pipeline’s specificity under these conditions.

4.5. Precision

Precision in HTS diagnostics, which encompasses repeatability, intermediate precision, and reproducibility, ensures consistent outcomes across various operators, equipment, and laboratories. Simulated HTS datasets are valuable tools for evaluating precision because they allow researchers to control variables like pathogen prevalence and sequencing error rates. By generating datasets with predefined characteristics and analyzing them using different operators, equipment, or laboratories, researchers can assess the variability in results. Analyzing these variations provides insights into the robustness of the HTS test and helps establish the reliability of results across different testing environments.

4.6. Robustness

Assay robustness refers to the ability to maintain precision when experiencing minor variations in protocol or environmental conditions. Although no specific formula exists for calculating robustness, it involves identifying potential sources of variation, such as changes in reagent concentrations, incubation times, thermocycling conditions, fluctuations in laboratory temperature, humidity, equipment calibration, and differences in operator expertise. For the HTS test, deliberate variations should be systematically introduced within a defined range around the standard protocol, and well-characterized positive and negative controls should be included to assess the impact on assay performance. The results are analyzed to evaluate the effect of variations on key performance metrics like sensitivity, specificity, and the limit of detection, and to define the range of variation that maintains acceptable assay performance. Any protocol deviations and their impacts should be clearly documented. Verification studies are emphasized to assess the impact of planned deviations from a validated method, ensuring the test’s robustness to modifications. In the context of HTS tests used for plant health diagnostics, simulated HTS datasets offer distinct advantages over real datasets, such as controlled composition, ground truth knowledge, reproducibility, and the ability to address specific diagnostic challenges. Simulated datasets allow complete control over variables like pathogen concentration, mutation frequencies, and contaminant presence, aiding in targeted benchmarking and validation of specific bioinformatic pipeline features. Researchers have full knowledge of the ‘true’ composition of simulated datasets, eliminating the uncertainty associated with real-world samples, thus facilitating accurate performance evaluation and unbiased comparison of different bioinformatic pipelines. Moreover, simulated datasets enable consistent, reproducible benchmarking conditions, unlike real datasets subject to natural variations and potential batch effects. Researchers can design simulated datasets to tackle particular diagnostic challenges, such as detecting low-abundance viruses, identifying novel strains, or disentangling complex infections, aiding in developing and evaluating pipelines tailored to these challenges.

5. Conclusions

In conclusion, simulated HTS datasets play a crucial role in evaluating and validating the diagnostic performance of HTS tests for plant pathogen detection. These datasets offer a controlled and reproducible environment for assessing key performance metrics such as analytical sensitivity, analytical specificity, and precision. By generating datasets with predefined characteristics, such as pathogen prevalence, genome complexity, and sequencing error rates, we can systematically evaluate the impact of these variables on the ability of HTS tests to accurately detect and identify plant pathogens. The availability of a wide array of simulators makes it difficult to select the simulator that works best for determining the diagnostic performance characteristic metrics of HTS tests. In this article, we aimed to bring awareness of the complexities of evaluating HTS tests and that selecting the appropriate simulator or having standardized HTS reference controls is a potential solution to the lack of consensus on how to evaluate HTS tests for agricultural diagnostic purposes. This controlled approach helps researchers establish reliable detection thresholds, minimize false-positive and false-negative results, and ensure the robustness of HTS tests across different laboratories and testing conditions. The insights gained from analyzing simulated HTS datasets contribute significantly to the development and implementation of accurate and dependable HTS-based diagnostic tools for plant health management.

Funding

Funding was provided by the Oklahoma Agricultural Experiment Station and the hatch project OKL03271 titled Computational and Molecular Approaches to Detect Microbes in High-throughput Data.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Soltani, N.; Stevens, K.A.; Klaassen, V.; Hwang, M.-S.; Golino, D.A.; Al Rwahnih, M. Quality Assessment and Validation of High-Throughput Sequencing for Grapevine Virus Diagnostics. Viruses 2021, 13, 1130. [Google Scholar] [CrossRef] [PubMed]
  2. Maina, S.; Zheng, L.; Rodoni, B.C. Targeted Genome Sequencing (TG-Seq) Approaches to Detect Plant Viruses. Viruses 2021, 13, 583. [Google Scholar] [CrossRef] [PubMed]
  3. Lebas, B.; Adams, I.; Al Rwahnih, M.; Baeyen, S.; Bilodeau, G.J.; Blouin, A.G.; Boonham, N.; Candresse, T.; Chandelier, A.; De Jonghe, K.; et al. Facilitating the Adoption of High-throughput Sequencing Technologies as a Plant Pest Diagnostic Test in Laboratories: A Step-by-step Description. Bull. OEPP 2022, 52, 394–418. [Google Scholar] [CrossRef]
  4. Piombo, E.; Abdelfattah, A.; Droby, S.; Wisniewski, M.; Spadaro, D.; Schena, L. Metagenomics Approaches for the Detection and Surveillance of Emerging and Recurrent Plant Pathogens. Microorganisms 2021, 9, 188. [Google Scholar] [CrossRef]
  5. Hu, X.; Hurtado-Gonzales, O.P.; Adhikari, B.N.; French-Monar, R.D.; Malapi, M.; Foster, J.A.; McFarland, C.D. PhytoPipe: A Phytosanitary Pipeline for Plant Pathogen Detection and Diagnosis Using RNA-Seq Data. BMC Bioinform. 2023, 24, 470. [Google Scholar] [CrossRef]
  6. Espindola, A.S.; Sempertegui-Bayas, D.; Bravo-Padilla, D.F.; Freire-Zapata, V.; Ochoa-Corona, F.; Cardwell, K.F. TASPERT: Target-Specific Reverse Transcript Pools to Improve HTS Plant Virus Diagnostics. Viruses 2021, 13, 1223. [Google Scholar] [CrossRef]
  7. Katsiani, A.; Maliogka, V.I.; Katis, N.; Svanella-Dumas, L.; Olmos, A.; Ruiz-García, A.B.; Marais, A.; Faure, C.; Theil, S.; Lotos, L.; et al. High-Throughput Sequencing Reveals Further Diversity of Little Cherry Virus 1 with Implications for Diagnostics. Viruses 2018, 10, 385. [Google Scholar] [CrossRef]
  8. Bester, R.; Cook, G.; Breytenbach, J.H.J.; Steyn, C.; De Bruyn, R.; Maree, H.J. Towards the Validation of High-Throughput Sequencing (HTS) for Routine Plant Virus Diagnostics: Measurement of Variation Linked to HTS Detection of Citrus Viruses and Viroids. Virol. J. 2021, 18, 61. [Google Scholar] [CrossRef]
  9. Maree, H.J.; Fox, A.; Al Rwahnih, M.; Boonham, N.; Candresse, T. Application of HTS for Routine Plant Virus Diagnostics: State of the Art and Challenges. Front. Plant Sci. 2018, 9, 1082. [Google Scholar] [CrossRef]
  10. Fajardo, T.V.M.; Silva, F.N.; Eiras, M.; Nickel, O. High-Throughput Sequencing Applied for the Identification of Viruses Infecting Grapevines in Brazil and Genetic Variability Analysis. Trop. Plant Pathol. 2017, 42, 250–260. [Google Scholar] [CrossRef]
  11. Amoia, S.S.; Chiumenti, M.; Minafra, A. First Identification of Fig Virus A and Fig Virus B in Ficus Carica in Italy. Plants 2023, 12, 1503. [Google Scholar] [CrossRef]
  12. Maliogka, V.I.; Minafra, A.; Saldarelli, P.; Ruiz-García, A.B.; Glasa, M.; Katis, N.; Olmos, A. Recent Advances on Detection and Characterization of Fruit Tree Viruses Using High-Throughput Sequencing Technologies. Viruses 2018, 10, 436. [Google Scholar] [CrossRef] [PubMed]
  13. Al-helu, M.H.; Zhongtian, X.; Li, J.-M.; Lahuf, A.A. Next-Generation Sequencing-Based Detection Reveals Erysiphe Necator-Associated Virus 1 in Okra Plants. J. Kerbala Agric. Sci. 2024, 11, 205–213. [Google Scholar] [CrossRef]
  14. Kinoti, W.M.; Nancarrow, N.; Dann, A.; Rodoni, B.C.; Constable, F.E. Updating the Quarantine Status of Prunus Infecting Viruses in Australia. Viruses 2020, 12, 246. [Google Scholar] [CrossRef] [PubMed]
  15. Dang, T.; Espindola, A.; Vidalakis, G.; Cardwell, K. An In Silico Detection of a Citrus Viroid from Raw High-Throughput Sequencing Data. In Viroids: Methods and Protocols; Rao, A.L.N., Lavagi-Craddock, I., Vidalakis, G., Eds.; Springer: New York, NY, USA, 2022; Volume 2316, pp. 275–283. ISBN 9781071614648. [Google Scholar]
  16. Proaño-Cuenca, F.; Espindola, A.S.; Garzon, C.D. Detection of Phytophthora, Pythium, Globisporangium, Hyaloperonospora and Plasmopara species in High-Throughput Sequencing data by in silico and in vitro analysis using Microbe Finder (MiFi®). PhytoFrontiersTM 2023, 3, 124–136. [Google Scholar] [CrossRef]
  17. Espindola, A.; Schneider, W.; Hoyt, P.R.; Marek, S.M.; Garzon, C. A New Approach for Detecting Fungal and Oomycete Plant Pathogens in next Generation Sequencing Metagenome Data Utilising Electronic Probes. Int. J. Data Min. Bioinform. 2015, 12, 115–128. [Google Scholar] [CrossRef]
  18. Espindola, A.S.; Cardwell, K.; Martin, F.N.; Hoyt, P.R.; Marek, S.M.; Schneider, W.; Garzon, C.D. A Step Towards Validation of High-Throughput Sequencing for the Identification of Plant Pathogenic Oomycetes. Phytopathology 2022, 112, 1859–1866. [Google Scholar] [CrossRef]
  19. Stobbe, A.H.; Daniels, J.; Espindola, A.S.; Verma, R.; Melcher, U.; Ochoa-Corona, F.; Garzon, C.; Fletcher, J.; Schneider, W. E-Probe Diagnostic Nucleic Acid Analysis (EDNA): A Theoretical Approach for Handling of next Generation Sequencing Data for Diagnostics. J. Microbiol. Methods 2013, 94, 356–366. [Google Scholar] [CrossRef]
  20. Bocsanczy, A.M.; Espindola, A.S.; Cardwell, K.; Norman, D.J. Development and Validation of E-Probes with the MiFi System for Detection of Ralstonia Solanacearum Species Complex in Blueberries. PhytoFrontiers 2023, 3, 137–147. [Google Scholar] [CrossRef]
  21. Radhakrishnan, G.V.; Cook, N.M.; Bueno-Sancho, V.; Lewis, C.M.; Persoons, A.; Mitiku, A.D.; Heaton, M.; Davey, P.E.; Abeyo, B.; Alemayehu, Y.; et al. MARPLE, a Point-of-Care, Strain-Level Disease Diagnostics and Surveillance Tool for Complex Fungal Pathogens. BMC Biol. 2019, 17, 65. [Google Scholar] [CrossRef]
  22. Loit, K.; Adamson, K.; Bahram, M.; Puusepp, R.; Anslan, S.; Kiiker, R.; Drenkhan, R.; Tedersoo, L. Relative Performance of MinION (Oxford Nanopore Technologies) versus Sequel (Pacific Biosciences) Third-Generation Sequencing Instruments in Identification of Agricultural and Forest Fungal Pathogens. Appl. Environ. Microbiol. 2019, 85, e01368-19. [Google Scholar] [CrossRef] [PubMed]
  23. Bronzato Badial, A.; Sherman, D.; Stone, A.; Gopakumar, A.; Wilson, V.; Schneider, W.; King, J. Nanopore Sequencing as a Surveillance Tool for Plant Pathogens in Plant and Insect Tissues. Plant Dis. 2018, 102, 1648–1652. [Google Scholar] [CrossRef] [PubMed]
  24. Kutnjak, D.; Tamisier, L.; Adams, I.; Boonham, N.; Candresse, T.; Chiumenti, M.; De Jonghe, K.; Kreuze, J.F.; Lefebvre, M.; Silva, G.; et al. A Primer on the Analysis of High-Throughput Sequencing Data for Detection of Plant Viruses. Microorganisms 2021, 9, 841. [Google Scholar] [CrossRef] [PubMed]
  25. Standards & Guidelines: Generation and Analysis of High Throughput Sequencing Data. Available online: https://www.agriculture.gov.au/agriculture-land/animal/health/laboratories/hts-standards-and-guidelines (accessed on 18 August 2024).
  26. PM 7/151 (1) Considerations for the Use of High Throughput Sequencing in Plant Health Diagnostics. Bull. OEPP 2022, 52, 619–642. [CrossRef]
  27. Tamisier, L.; Haegeman, A.; Foucart, Y.; Fouillien, N.; Al Rwahnih, M.; Buzkan, N.; Candresse, T.; Chiumenti, M.; De Jonghe, K.; Lefebvre, M.; et al. Semi-Artificial Datasets as a Resource for Validation of Bioinformatics Pipelines for Plant Virus Detection. Peer Community J. 2021, 1, e53. [Google Scholar] [CrossRef]
  28. Saah, A.J.; Hoover, D.R. “Sensitivity” and “Specificity” Reconsidered: The Meaning of These Terms in Analytical and Diagnostic Settings. Ann. Intern. Med. 1997, 126, 91–94. [Google Scholar] [CrossRef]
  29. Mostafa, H.H.; Hardick, J.; Morehead, E.; Miller, J.-A.; Gaydos, C.A.; Manabe, Y.C. Comparison of the Analytical Sensitivity of Seven Commonly Used Commercial SARS-CoV-2 Automated Molecular Assays. J. Clin. Virol. 2020, 130, 104578. [Google Scholar] [CrossRef]
  30. Espindola, A.S.; Cardwell, K.F. Microbe Finder (MiFi®): Implementation of an Interactive Pathogen Detection Tool in Metagenomic Sequence Data. Plants 2021, 10, 250. [Google Scholar] [CrossRef]
  31. Dang, T.; Wang, H.; Espíndola, A.S.; Habiger, J.; Vidalakis, G.; Cardwell, K. Development and Statistical Validation of E-Probe Diagnostic Nucleic Acid Analysis (EDNA) Detection Assays for the Detection of Citrus Pathogens from Raw High Throughput Sequencing Data. PhytoFrontiers 2022, 3, 113–123. [Google Scholar] [CrossRef]
  32. Huang, W.; Li, L.; Myers, J.R.; Marth, G.T. ART: A next-Generation Sequencing Read Simulator. Bioinformatics 2012, 28, 593–594. [Google Scholar] [CrossRef]
  33. Shcherbina, A. FASTQSim: Platform-Independent Data Characterization and in Silico Read Generation for NGS Datasets. BMC Res. Notes 2014, 7, 533. [Google Scholar] [CrossRef] [PubMed]
  34. Fritz, A.; Hofmann, P.; Majda, S.; Dahms, E.; Dröge, J.; Fiedler, J.; Lesker, T.R.; Belmann, P.; DeMaere, M.Z.; Darling, A.E.; et al. CAMISIM: Simulating Metagenomes and Microbial Communities. Microbiome 2019, 7, 17. [Google Scholar] [CrossRef] [PubMed]
  35. Caboche, S.; Audebert, C.; Lemoine, Y.; Hot, D. Comparison of Mapping Algorithms Used in High-Throughput Sequencing: Application to Ion Torrent Data. BMC Genom. 2014, 15, 264. [Google Scholar] [CrossRef]
  36. Angly, F.E.; Willner, D.; Rohwer, F.; Hugenholtz, P.; Tyson, G.W. Grinder: A Versatile Amplicon and Shotgun Sequence Simulator. Nucleic Acids Res. 2012, 40, e94. [Google Scholar] [CrossRef]
  37. Richter, D.C.; Ott, F.; Auch, A.F.; Schmid, R.; Huson, D.H. MetaSim—A Sequencing Simulator for Genomics and Metagenomics. PLoS ONE 2008, 3, e3373. [Google Scholar] [CrossRef]
  38. Jia, B.; Xuan, L.; Cai, K.; Hu, Z.; Ma, L.; Wei, C. NeSSM: A Next-Generation Sequencing Simulator for Metagenomics. PLoS ONE 2013, 8, e75448. [Google Scholar] [CrossRef]
  39. Patuzzi, I.; Baruzzo, G.; Losasso, C.; Ricci, A.; Di Camillo, B. MetaSPARSim: A 16S RRNA Gene Sequencing Count Data Simulator. BMC Bioinform. 2019, 20, 416. [Google Scholar] [CrossRef] [PubMed]
  40. Schmeing, S.; Robinson, M.D. ReSeq Simulates Realistic Illumina High-Throughput Sequencing Data. Genome Biol. 2021, 22, 67. [Google Scholar] [CrossRef]
  41. Yang, C.; Chu, J.; Warren, R.L.; Birol, I. NanoSim: Nanopore Sequence Read Simulator Based on Statistical Characterization. Gigascience 2017, 6, gix010. [Google Scholar] [CrossRef]
  42. Johnson, S.; Trost, B.; Long, J.R.; Pittet, V.; Kusalik, A. A Better Sequence-Read Simulator Program for Metagenomics. BMC Bioinform. 2014, 15, S14. [Google Scholar] [CrossRef]
  43. Ewels, P.A.; Peltzer, A.; Fillinger, S.; Patel, H.; Alneberg, J.; Wilm, A.; Garcia, M.U.; Di Tommaso, P.; Nahnsen, S. The Nf-Core Framework for Community-Curated Bioinformatics Pipelines. Nat. Biotechnol. 2020, 38, 276–278. [Google Scholar] [CrossRef]
  44. Lee, H.; Gurtowski, J.; Yoo, S.; Marcus, S.; McCombie, W.R.; Schatz, M. Error Correction and Assembly Complexity of Single Molecule Sequencing Reads. bioRxiv 2014, 006395. [Google Scholar] [CrossRef]
  45. Massart, S.; Adams, I.; Al Rwahnih, M.; Baeyen, S.; Bilodeau, G.J.; Blouin, A.G.; Boonham, N.; Candresse, T.; Chandellier, A.; De Jonghe, K.; et al. Guidelines for the Reliable Use of High Throughput Sequencing Technologies to Detect Plant Pathogens and Pests. Peer Community J. 2022, 2, e62. [Google Scholar] [CrossRef]
  46. Groth-Helms, D.; Rivera, Y.; Martin, F.N.; Arif, M.; Sharma, P.; Castlebury, L.A. Terminology and Guidelines for Diagnostic Assay Development and Validation: Best Practices for Molecular Tests. PhytoFront. 2023, 3, 23–35. [Google Scholar] [CrossRef]
  47. Armbruster, D.A.; Pry, T. Limit of Blank, Limit of Detection and Limit of Quantitation. Clin. Biochem. Rev. 2008, 29, S49–S52. [Google Scholar] [PubMed]
  48. Gaafar, Y.Z.A.; Ziebell, H. Comparative Study on Three Viral Enrichment Approaches Based on RNA Extraction for Plant Virus/Viroid Detection Using High-Throughput Sequencing. PLoS ONE 2020, 15, e0237951. [Google Scholar] [CrossRef]
  49. Pecman, A.; Kutnjak, D.; Gutiérrez-Aguirre, I.; Adams, I.; Fox, A.; Boonham, N.; Ravnikar, M. Next Generation Sequencing for Detection and Discovery of Plant Viruses and Viroids: Comparison of Two Approaches. Front. Microbiol. 2017, 8, 1998. [Google Scholar] [CrossRef]
Table 1. Overview of assay performance metrics, aims, variables, and appropriate HTS dataset types.
Table 1. Overview of assay performance metrics, aims, variables, and appropriate HTS dataset types.
Performance MetricAimVariablesHTS Dataset Type
Analytical sensitivityDetermine the lowest concentration of the target that is consistently detectable.Quantitative: Limit of Detection (LoD) calculated as the mean of replicate tests used to calculate the Limit of Blank (LoB) + 1.645 standard deviations of a low-concentration sample.
Qualitative: the lowest concentration is consistently detected as positive in repeated testing.
Real: serially diluted samples.
Simulated: datasets with varying target concentrations.
Analytical specificityEnsure the assay accurately detects all target variants (inclusivity) while excluding non-targets (exclusivity).Inclusivity: percentage of target variants correctly identified.
Exclusivity: percentage of non-target samples correctly identified as negative.
Selectivity: ability to detect target in the presence of background matrix.
Real: panels of target and non-target samples, including closely related species and environmental samples.
Simulated: datasets containing a mix of target and non-target sequences with controlled variations.
Diagnostic sensitivityEvaluate the assay’s ability to correctly identify true positive samples.Percentage of known positive samples correctly identified by the test.Real: panels of samples with confirmed presence or absence of the target.
Simulated: not suggested.
Diagnostic specificityAssess the assay’s ability to correctly identify true negative samples.Percentage of known negative samples correctly identified. Real: panels of samples with confirmed absence of the target.
Simulated: they are needed because they provide a known negative background. The intentional inclusion of closely related organisms to the target is suggested (see precision).
PrecisionPrecision refers to the closeness of agreement between independent test results obtained under specified conditions.Repeatability: this involves assessing the variation when the same operator conducts the assay on the same sample multiple times.
Intermediate precision: agreement between results from multiple operators or instruments within a lab.
Reproducibility: evaluate the variation in results when the assay is performed in different laboratories and by different operators.
Real: replicate testing of the same samples.
Simulated: datasets containing a mix of target and non-target sequences with controlled variations and concentrations.
RobustnessAbility to maintain its precision despite variations in factors. It evaluates the assay’s performance under variable conditions.The ability of the test to maintain precision when subjected to variations. Most variations for bioinformatic pipelines come from the operator. These are deliberate variations. In this case, variations in the pipeline could be intentionally introduced, i.e., read filtering, library size (rarefaction), and slight variation in pathogen reads, incorporating closely related organisms, etc. Typically assessed through ring tests involving multiple laboratories.Real: testing under various conditions and with minor protocol deviations. The variations are more limited.
Simulated: variations that include pathogen read abundance and library size read filtering, among others, can be performed without restrictions.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Espindola, A.S. Simulated High Throughput Sequencing Datasets: A Crucial Tool for Validating Bioinformatic Pathogen Detection Pipelines. Biology 2024, 13, 700. https://doi.org/10.3390/biology13090700

AMA Style

Espindola AS. Simulated High Throughput Sequencing Datasets: A Crucial Tool for Validating Bioinformatic Pathogen Detection Pipelines. Biology. 2024; 13(9):700. https://doi.org/10.3390/biology13090700

Chicago/Turabian Style

Espindola, Andres S. 2024. "Simulated High Throughput Sequencing Datasets: A Crucial Tool for Validating Bioinformatic Pathogen Detection Pipelines" Biology 13, no. 9: 700. https://doi.org/10.3390/biology13090700

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop