1. Introduction
Foodborne pathogens remain a major health concern and economic burden in the United States. The Centers for Disease Control estimates that 48 million people are sickened by foodborne illnesses each year, with 128,000 hospitalizations [
1]. Many of those hospitalizations are due to infection by Shiga toxin-producing
Escherichia coli (STEC) and
Listeria monocytogenes [
1]. In the U.S. in 2021, pathogen contamination resulted in 47 meat recalls totaling over 15 million pounds of meat; two recalls were due to STEC contamination and five due to
Listeria [
2]. It is estimated that foodborne disease costs the U.S. economy approximately
$17 billion annually [
3].
Infection with STEC can result in hemorrhagic colitis and hemolytic uremic syndrome [
4]. Serotypes of
E. coli are determined based on the polysaccharide O-antigen in the lipopolysaccharide outer membrane and the H-antigen on the flagella [
5]. The STEC serotype most frequently associated with outbreaks is O157:H7 [
1]. The U.S. Department of Agriculture Food Safety and Inspection Service (USDA FSIS) isolates and identifies STEC in meat through a combination of culturing, molecular methods, O typing, and matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) mass spectrometry [
6], and the process takes four days to complete. Confirmation of
E. coli O157:H7 is through the detection of the virulence genes
eae,
stx1,
stx2, and
fliC, and the ribosomal 16S rRNA gene,
rrsC, using quantitative PCR (qPCR) [
6]. The
eae gene product is intimin, which mediates enterocyte colonization [
4], and
fliC encodes the flagellar H-antigen determinant [
7]. Expression of genes
stx1 and
2 produces Shiga toxin 1 and 2, respectively, which are responsible for surface localization and cytotoxicity [
4]. The U.S. Food and Drug Administration’s (FDA) method to test other food and beverages for
E. coli takes at least 5 days and involves an RT-PCR screen, culture confirmation, antisera testing, RT-PCR confirmation testing, and characterization with pulsed-field gel electrophoresis and whole genome sequencing [
8].
Listeriosis, caused by
L. monocytogenes, generally results in gastroenteritis, but some cases result in sepsis, meningitis, or, in pregnant women, fetal infection [
9]. Isolation and confirmation of
L. monocytogenes from meat, egg, or environmental samples by FSIS involves a combination of culturing, molecular methods, and MALDI-TOF mass spectrometry and takes six days to complete [
10]. Molecular identification methods generally target the
hly gene, which encodes the hemolysin, listeriolysin O [
11]. The FDA method involves culture isolation, biochemical or RT-PCR confirmation, and serological and genetic subtyping and takes at least 5 days, but likely more depending on the tests selected and incubation times needed [
12].
Advances in whole genome sequencing technology have led to third-generation, or long-read, sequencing that could significantly reduce the amount of time needed to identify foodborne pathogens from current culture-based methods. Oxford Nanopore Technologies’ MinION device sequences RNA or DNA by detecting changes in electrical current as the strands of nucleic acid pass through nanopores on a flow cell [
13]. Long reads are generated that facilitate genome assembly [
14], and real-time analysis allows pathogen detection to be accomplished in hours instead of days if samples are sequenced directly [
15]. Including a 24-h growth enrichment prior to sequencing may enhance detection, and this would still reduce the time needed to positively identify samples from current methods. Additionally, the small, portable sequencers allow whole genome analysis to be conducted outside of traditional laboratories, and the cost is generally lower than second-generation sequencing. Whole genome sequencing is also advantageous for serotype resolution and antibiotic resistance monitoring. However, a disadvantage of nanopore sequencing is that it is more error-prone, although accuracy is rapidly improving [
15].
Simulator software has been developed to assist with the planning and data analysis of long-read sequencing experiments. NanoSim was developed to simulate Oxford Nanopore reads and error rates [
16], and a derivation of the program, NanoSim-H, incorporates several improvements and bug fixes [
17]. The software operates in two steps. First, the input genome is characterized by an alignment-based analysis, and a model is produced that incorporates errors and length distributions. Second, the model is used to simulate sequencing run reads [
16]. While nanopore sequencing is generally more affordable and faster than second-generation sequencing, novel method development can be expensive and time-consuming. Simulation results can guide method development to reduce the time and cost of benchwork. Therefore, the objectives of this project were to determine the number of reads needed to detect target genes in STEC and
L. monocytogenes and evaluate the influence of host genetic material on detection. The goal of this project was to use the simulation data to assess the feasibility of using long-read sequencing for metagenomic analysis of samples to detect foodborne pathogens without enrichment and to guide experimental design.
4. Discussion
The results of this project provide guidance for the development of foodborne pathogen detection programs using MinION sequencing. NanoSim-H software, an updated version of the NanoSim simulator, was used to simulate ONT sequence data because of its convenience and specificity. NanoSim is a rapid, scalable simulator of ONT sequencing technology-specific data that can be modified as ONT technology improves. NanoSim-H can run in R or Python. Additionally, NanoSim-H has been shown to simulate error events, fragment lengths, and alignment ratios of ONT reads more accurately than other simulation programs [
16]. NanoSim was benchmarked with DNA prepared using sequencing kits that fragment the DNA [
16], and simulation results are representative of using an ONT Rapid or Field Sequencing Kit. The program can be trained to provide simulated results with different library preparation kits, but as the Rapid and Field Sequencing Kits are simple and require minimal equipment, they would be the first choice for use in food safety testing. Therefore, NanoSim-H was used as benchmarked. The number of reads needed to detect virulence genes and to provide sufficient genome coverage was determined for STEC and
L. monocytogenes. For STEC, all seven virulence genes of interest were identified beginning with 2500 simulated reads, while the
L. monocytogenes virulence gene of interest,
hly, was detected starting with 500 simulated reads.
Genome coverage of 30x is desired for high-quality assemblies to ensure all regions of the genome are sequenced at least once, and that sequence variations can be distinguished from errors [
20]. However, for pathogen identification, 10x coverage would be sufficient as error rates were 5–6% (discussed more fully below). For STEC, coverage of 14x was obtained with 10,000 reads, and each virulence gene of interest was detected an average of 12 times, which would allow confident identification (
Figure 1). With 50,000 reads, 70x coverage was observed, and each virulence gene was identified an average of 61 times. A linear regression analysis indicated 30x coverage for a high-quality assembly would be obtained with 21,521 reads. For
L. monocytogenes, coverage of 12x was obtained with 5000 reads, and the virulence gene of interest was detected 15 times (
Figure 1). With 50,000 reads, coverage increased to 129x, and
hly was detected 124 times. A linear regression suggested 11,802 reads would provide 30x coverage. Fewer reads were needed to obtain more coverage of the
L. monocytogenes genome (2.9 Mb) because it is nearly half the size of the STEC genome (5.5 Mb). Additionally, only one gene was targeted in
L. monocytogenes, while seven genes were targeted in STEC (
Table 7). These results suggest that the number of sequences, and therefore time, needed to detect pathogens will depend on the genome size and number of genes targeted. Bacteria with smaller genomes or with fewer virulence genes of interest should require shorter sequencing times.
A simulation of MinION sequencing reads using a 1:1 ratio of STEC to bovine genomes was conducted, and predictably, adding the bovine genome made detecting STEC virulence genes of interest more difficult. Even with 1,000,000 simulated reads, 1x coverage of the STEC genome was not achieved, and the genes of interest were only detected an average of three times each. The majority of reads simulated with the mixed genomes did not align with the STEC reference. This is not surprising, considering the bovine genome is approximately 490 times larger than the STEC genome and would have a higher probability of being sequenced. In actual meat samples, the amount of bovine DNA in comparison to pathogen DNA would far exceed the 1:1 ratio used in this study. However, running simulations with a higher ratio of bovine to STEC DNA was not undertaken in this study due to the huge computing power and time required for such an experiment. Therefore, the results were scaled up mathematically to provide an estimate of the amount of sequencing time needed to detect STEC in a meat sample. The current FSIS protocol tests 325 g samples of raw ground beef for STEC [
6]. An estimated 1.08 × 10
11 bovine cells would be in 325 g of beef, assuming a mammalian cell mass of 3 ng [
21,
22]. The STEC genome is approximately 5.5 Mb, and the bovine genome is 2711 Mb; therefore, one STEC genome in the 325 g of meat would represent only 9.23 × 10
−10% of the genomes in the sample. The average size of simulated reads was 9444 bp, and based on other studies [
23,
24,
25], an average of 80,638 reads per hour could be expected. This suggests that 7.82 × 10
8 h of sequencing would be needed to obtain 1x coverage of the STEC genome. Therefore, obtaining 10x coverage to ensure there are no false negatives in detection would substantially increase the amount of time needed to 2.35 × 10
10 h. This suggests that detection of foodborne pathogens with MinION sequencing would be impractical without enrichment for the bacteria of interest.
Food samples are often enriched to multiply pathogen numbers and increase the probability of detection. In the current FSIS protocols, STEC samples are enriched in modified tryptone soy broth for 15–24 h [
6], which adds one day to the testing regime.
L. monocytogenes undergoes primary enrichment in modified University of Vermont broth for 20–26 h and then secondary enrichment in (3-N-morpholino) propanesulfonic acid-buffered
Listeria enrichment broth for 18–24 h [
10], adding two days to the protocol. Developing a method to detect pathogens without enrichment is a major goal, but even with enrichment, MinION sequencing could still reduce the amount of time needed to identify pathogens in a sample. Enrichment and plating would take two days, and DNA extraction, sequencing, and data analysis could be conducted on the third day. This would result in species confirmation one day faster than the current FSIS STEC protocol and three days faster than the
L. monocytogenes protocol. This time savings would allow meat products to arrive at the market more quickly, reducing losses due to spoilage.
A disadvantage of third-generation sequencing compared to second-generation is the higher error rate in sequencing data. Error rates of 5–20% have been reported [
26,
27]. In this study, the error rate was on the lower side of the reported estimates, with 5–6% of reads simulated from STEC or
L. monocytogenes genomes not aligning with their respective reference genomes due to the introduction of errors. Errors (mismatches, insertions, and deletions) were determined by the NanoSim-H program using statistical mixture models [
16]. Despite these inaccuracies, the simulated sequencing data were sufficient to identify the genes of interest in STEC and
L. monocytogenes. Additionally, the error rate of MinION sequencing may be substantially improving as the latest MinION flow cells (version R10.4.1) are supposed to deliver accuracy above 99% [
13,
28], which is on par with next-generation sequencing platforms that have error rates of less than 1% [
29]. ONT also released the Dorado basecaller in 2023, which is the newest version of the program that converts electrical disruptions into basecalls. The increased speed of Dorado will allow the use of higher accuracy basecalling models in real-time and further reduce error rates [
30]. Improved accuracy in long-read sequencing would be beneficial for differentiating between serotypes in bacteria, such as
Salmonella, which vary by a single nucleotide polymorphism [
31] and where inaccurate base calling would be problematic. Conventional methods of serotyping with antisera are time intensive, and sequencing could significantly reduce the time needed for identification.
Short-read sequencing with platforms, such as Illumina, have also been evaluated for foodborne pathogen detection, particularly in produce [
32,
33,
34]. Pathogens could be successfully identified, but short-read sequencing has disadvantages compared to long-read sequencing in rapid testing methods. It requires longer, more labor-intensive library preparation and larger equipment, which would make on-site applications unlikely and may be cost-prohibitive [
35]. Additionally, real-time analysis cannot be conducted with short-read sequencing, increasing the amount of time needed for pathogen identification [
35]. The MinION platform will also sequence any DNA in the library, both short and long DNA fragments, maximizing the sequencing data obtained. A challenge for all sequencing platforms will be the low quantity of pathogen DNA as compared to the host DNA. Methods will need to be optimized to address this issue during sample preparation prior to sequencing and during data analysis.
There were a few limitations in this study that could be addressed in future research. First, we were unable to obtain replicate sequences for a sample because the simulator would output the same set of reads for a particular genome input. Real sequencing runs would vary, even with replicate runs on the same DNA extraction, because only a portion of the sample is sequenced, and the genomic fragments available for sequencing would be different between runs, especially in metagenomic samples. It is possible that coding changes could allow for varied outputs. Another limitation was the size of the fasta files that could be processed. As noted above, we only simulated a 1:1 E. coli:bovine mix because of the intense computing power required, even using the U.S. Department of Agriculture high-performance computing system. As computational power advances, high-performance computers may be able to run these simulations, which would allow more realistic genome mixtures to be analyzed. We also limited this study to one E. coli and one Listeria serotype. However, food samples may be contaminated with multiple serotypes of one bacterial species or multiple bacterial species. Future simulation studies are needed to assess how species and serotypes can be differentiated.