1. Introduction
The cultivated oat (
Avena sativa L.) genome has recently been sequenced, providing valuable insights into this healthy cereal crop [
1,
2]. Oats are recognized for their importance as a source of carbohydrates, dietary soluble fiber, balanced protein, lipids, phenolic compounds, vitamins, and minerals, rendering them a promising functional food with diverse health benefits [
3]. The oat genome is an allohexaploid (AACCDD, 2n = 6× = 42) with six sets of chromosomes [
1,
2,
4]. Its genome complexity, due to its hexaploid nature and mosaic-like architecture, has offered challenges in studies for research and breeding [
1,
2,
4]. The sequencing of the oat genome has significant implications for both agriculture and human nutrition. In agriculture, it provides greater knowledge of oat genomics, offering more opportunities for targeted improvements in yield, disease tolerance, and other characteristics of oats [
1,
2,
3,
4]. In human nutrition, oats are esteemed as a valuable source of nutrients, and they have been associated with various health benefits, including the mitigation of cardiovascular disease risks, inflammation, and type-2 diabetes [
3,
5,
6].
The complexity of the oat genome emphasizes the indispensability of targeted sequencing in studying this crop [
1,
2]. Despite remarkable advances in sequencing technologies and bioinformatics techniques in recent years, conducting genome sequencing on a large scale to a sufficient depth remains challenging for plants with large and highly repetitive genomes like oats. Target capture based on hybridization offers a cost-effective means of attaining high depth coverage and identifying sequence variants in the coding and noncoding regions of very large genomes [
7,
8,
9,
10]. This approach involves a custom design of capture probes targeting specific chromosome regions harboring loci or candidate genes for traits of interest, enabling the highly flexible scaling of resequencing experiments from a few to many genes at a low cost for large plant populations [
7,
9]. Targeted gene enrichment utilizes synthetic DNA probes designed from reference sequences that are complementary to specific regions in genomes. These probes are attached to a substrate to facilitate the capture of targeted DNA regions. Subsequently, the captured DNA can undergo high-throughput sequencing without requiring universal primers [
11]. This technique is widely employed in human genomic research, phylogenetic studies, and evolutionary investigations [
12,
13]. Surprisingly, gene enrichment has not yet been explored in oat breeding and research. Several factors contribute to this; for example, (1) Oats have a complex polyploid genome, and the presence of multiple sets of chromosomes and extensive repetitive regions complicates the situation [
14]. This complexity makes it difficult to design effective baits for target capture and accurately identify genetic variants. (2) Developing and optimizing gene enrichment techniques require substantial technical expertise and financial investment. Oat research programs, often less funded compared to major crops like wheat or barley, may lack the resources needed to implement and refine these advanced methodologies. (3) Oat breeding has traditionally relied on conventional methods. The integration of molecular techniques, including gene enrichment and targeted sequencing, has been slower due to the established reliance on these conventional approaches. (4) Research priorities and funding are often directed towards crops with higher economic importance or those considered staple foods. As a result, oats, which are important but not among the top global crops, have seen less investment in advanced genomic technologies. (5) The last and most important factor is that, until recently, there was a lack of oat genomic resources. The lack of a high-quality reference genome and the limited availability of annotated gene sequences have hindered the development and application of targeted sequencing technologies. Recent advancements in oat genomics are beginning to address these gaps.
In this context, the myBaits technology, a hybridization capture system, has been used for the targeted next-generation sequencing of specific genomic regions of interest, providing a powerful and versatile tool for studying the genome [
9,
10,
11]. The myBaits technology provides targeted sequencing solutions for plant genomics. These kits use hybridization capture with biotinylated RNA baits to enrich specific genomic regions efficiently, providing deep insights into plant genomes [
10,
15]. Compared to traditional shotgun techniques, the myBaits technology enables next-generation sequencing (NGS) to be an order of magnitude more efficient by enriching target molecules and removing non-target molecules, resulting in significant cost savings compared to shotgun sequencing approaches [
8,
12,
13]. Additionally, the myBaits Custom DNA-Seq kits are versatile and can accommodate various sample types like genomic DNA, metagenomic DNA, environmental DNA, ancient DNA, and more, making them ideal for gene or exon resequencing, novel variant discovery, phylogenetics, transgene detection, and other research applications in plant genomics [
2,
3,
4,
8,
12,
13]. When dealing with high coverage of short sequence reads from specific regions of a crop genome, the initial step involves aligning these reads to corresponding regions of a reference genome. Various mapping tools employ distinct algorithms to ensure the precise and efficient alignment of these short-sequence reads to the appropriate locations on the reference genome [
16,
17]. Polyploid crop genomes, such as oats, significantly amplify the complexity and challenges associated with both sequence mapping and variant detection. Therefore, using the right algorithm becomes imperative to ensure precision, accuracy, and reproducibility. In our study, we employed various variant-calling tools and focused on variants detected via all of them to enhance the reliability of our results. Therefore, we focused on the common variants to ensure the accuracy and reliability of the oat target capture data generated in this study.
The objective of this study was to evaluate the efficacy and reliability of the myBaits target capture sequencing technology for variant detection in oat genomics. Specifically, the study aimed to utilize the myBaits technology using short-read sequencing to detect variants in specific regions of the oat genome and assess the reliability of identified variants through rigorous validation efforts.
3. Results
Three different aligners, BWA-MEM, Bowtie2, and NGSEP, were evaluated using Illumina paired-end read target capture datasets from the 10 oat genotypes. The results of the read statistics and mapping efficiency analysis across the ten oat genotypes using BWA-MEM, Bowtie 2, and NGSEP aligners are summarized in
Table 1. The table presents the total number of reads generated for each genotype, the reads that passed quality-control filtering, and the successfully mapped reads via each aligner—BWA-MEM, Bowtie 2, and NGSEP. BWA-MEM consistently demonstrates high mapping efficiency across all genotypes, with percentages ranging from 98.98% to 99.84%. Bowtie 2 and NGSEP also showed satisfactory mapping efficiencies, albeit with slight variations across genotypes. This provides insights into aligner performance, aiding in the selection of the most suitable tool for subsequent oat genomics analyses. The implications of variations in read-mapping efficiency are significant for accurate variant detection in the complex oat genome. High mapping efficiency indicates robust performance in handling the repetitive regions and polyploid nature of oats, which is essential for reliable variant calling. However, variations in mapping efficiency observed with different mappers suggest potential challenges in aligning reads in certain genotypes, which could introduce biases and inaccuracies in downstream analyses.
The variant-calling results were obtained from different variant callers across a range of oat genotypes (
Table 2). The GATK Haplotype Caller (GATK HC) detected varying variant counts, from 3816 for Symphony to 4411 for NOS 819111-120. In contrast, SAMtools mpileup identified fewer variants compared to GATK HC, ranging from 753 for Symphony to 2820 for NOS 819111-120. FreeBayes demonstrated a broad spectrum of variants across genotypes, with counts ranging from 544 for NOS 81937-11 to 4338 for NOS 81920-15. DeepVariant identified fewer variants in the majority of genotypes compared to all variant callers and exhibited variant counts ranging from 513 for NOS 81937-11 to 2185 for Symphony. Lastly, the NGSEP Variant caller consistently detected variants across genotypes, ranging from 3223 for Symphony to 4325 for NOS 819111-120. The observed variations in variant calling across different aligners indicate the importance of interpreting the accuracy and confidence of the identified genetic variants. Variants identified exclusively via one caller and not via others are more likely to be false positives, highlighting the need for a consensus approach in variant detection. This is one of the reasons why we took the cross-caller consensus approach to minimize the false positives and ensure that only high-confidence variants were considered. Moreover, we selected the variants for validation that were present in the two target genotypes and absent in the remaining genotypes. This was done with the aim of increasing the stringency and reliability of variant detection.
Upon comparing genotypes, we found 420 variants identified via all variant callers in the Symphony genotype. Specifically, GATK HC exclusively identified 1207 variants, FreeBayes identified 948 variants not identified via any other caller, NGSEP identified 649 unique variants, and Samtools mpileup detected 20 variants not found via any other caller (
Figure 1). These variants were considered false positives if they were only identified via one caller and absent when other callers were used. Regarding DeepVariant, all the variants identified were also detected via at least one other caller across all the investigated genotypes. For the Delfin genotype, 246 variants were detected via all callers, of which 549 were uniquely identified via GATK HTC, 437 via NGSEP, 6 via FreeBayes, and 63 via Samtools mpileup (
Figure 1). For the 81920-15 genotype, 518 variants were detected via all callers, with 1253 variants uniquely identified via GATK HC, 936 identified via NGSEP, 719 identified via FreeBayes, and Samtools mpileup detecting 19 variants. Similar patterns were observed for other genotypes, as depicted in
Figure 1. All the variants identified via DeepVariant across all genotypes were also identified via one or more other callers, with none uniquely identified via DeepVariant. These results emphasize that DeepVariant does not produce false positives and provides more reliable variant detection. These results provide insights into the performance and efficacy of different variant callers in identifying genetic variants within oat genotypes.
Results of the Validation of Targeted Variants
To validate our findings regarding the variants identified via all variant callers, we selected two SNP variants, one deletion variant, and one insertion variant present in genotypes KF-318 and NOS 819111-70 but absent in the remaining eight genotypes (
Supplementary Figures S1–S4). We performed PCR amplification and Sanger sequencing to confirm the presence of the selected SNPs (2A_456055130 and 2A_455932982), which was consistently identified via all variant callers in genotypes KF-318 and NOS 819111-70. These SNPs are located on chromosome 2A at positions 456055130 and 455932982, respectively. In the target capture sequencing data, SNP 2A_456055130 appeared as a G in both KF-318 and NOS 819111-70, while it was a C in the remaining genotypes and the reference (
Figure 2A). The total depth coverage for different genotypes ranged from 70 to 450, with a total depth coverage of 380 (540 reads) and 350 in KF-318 and NOS 819111-70, respectively. The genotype 819111-70 has a total coverage of 637 (906 reads) for this SNP. We did not observe this SNP in all the genotypes except in KF-318 and NOS 819111-70 (
Supplementary Figure S1). Similarly, SNP 2A_455932982 was identified as a T in both genotypes and a C in the reference and other genotypes, with total depth coverage of 171 In KF-318 (308 reads) and 141 (257 reads) in 819111-70 (
Figure 2A). On the other hand, the NOS 819111-120 genotype has a total depth coverage of 525 (947 reads) for this SNP and does not exhibit any heterozygous allele. We did not observe this SNP in the six genotypes except KF-318 and NOS 819111-70 (
Supplementary Figure S2), and two genotypes (Symphony and 81920-15) did not exhibit coverage of this SNP. No heterozygosity of these SNPs in KF-318 and NOS 819111-70 was observed in the target capture data for both variants either (
Figure 2A,B).
The Sanger sequencing of the PCR products flanking these SNPs revealed heterozygosity for SNP 2A_456055130 in both KF-318 and NOS 819111-70 (
Figure 3A), contradicting the target capture data. Similarly, for SNP 2A_455932982, only a “C” nucleotide was observed in both genotypes, contrary to the target capture data (
Figure 3B). Although a faint “T” was observed in KF-318 and NOS 819111-70, it was deemed unreliable. Even if it were considered genuine, it still contradicted the target capture data, where no instances of “C” were observed at this specific location in these two genotypes.
Using genotypes KF-318 and NOS 819111-70 as references, we identified deletions and insertions compared to other genotypes. One deletion variant (9 bp) and one insertion variant (3 bp) were selected. The 9 bp deletion, located on chromosome 2A at position 453603957, was clearly identified in KF-318 and NOS 819111-70 (
Figure 4A), with a total depth coverage of 420 (562 reads) and 442 (591 reads), respectively. The genotype 819111-120 has coverage of 620 (830 reads) for this region. We did not observe this 9bp deletion in all the genotypes except in KF-318 and NOS 819111-70 (
Supplementary Figure S3). High-resolution melting (HRM) curve analysis grouped the ten oat genotypes into two clusters (
Figure 4B), with genotypes KF-318 and NOS 819111-70 clustering together, along with those that did not contain the 9 bp deletion (e.g., NOS 819111-120).
For the selected insertion variants, the 3 bp insertion at position 456585644 of 2A showed a total depth coverage of 69 (127 reads) in KF-318 and 89 (163 reads) in NOS 819111-70. The genotype NOS 819111-120 has coverage of 311 (573 reads) for this region; read alignment confirmed the presence of the 3 bp insertion in both genotypes but not in NOS 819111-120 (
Figure 5A). In fact, we did not observe this 3bp insertion in any of the genotypes except in KF-318 and NOS 819111-70 (
Supplementary Figure S4). HRM analysis grouped genotypes KF-318, NOS 819111-70, and NOS 819111-120 together despite NOS 819111-120 lacking the 3 bp insertion (
Figure 5B). Discrepancies between target capture data and validation results highlight potential issues in the applicability of myBaits technology in oat breeding. Several factors could contribute to these discrepancies, including biases in probe capture efficiency, high sequence variability, and the polyploid complexity of the oat genome. Additionally, the genetic diversity and mosaic nature of the oat genomes may pose challenges for effectively capturing and accurately calling variants. Hence, inadequate or uneven coverage, sequencing errors, and aligner inefficiencies may also play roles. Strategies to improve variant detection accuracy include, e.g., (i) the optimization of bait design and hybridization conditions specific to the oat genome, which could reduce biases and enhance capture efficiencies, (ii) the development of a specific bioinformatics pipeline, method, or tool to deal with target capture data generated from the oat genome, particularly considering the polyploidy complexity of the oat genome, high sequence variability, etc., (iii) implementing even more stringent quality-control measures during read-mapping and variant-calling processes, which can help identify and rectify errors, ensuring more reliable variant detection, and (iv) long-read sequencing technologies, which can provide more comprehensive coverage and accurate capture of variants, especially in complex and repetitive regions, reducing false positives.
4. Discussion
The findings of this study have broader implications for oat genomics research and marker development. The challenges and limitations observed with the current target-capture-sequencing approach demonstrate the need for continuous improvement in library preparation to capture the targeted regions in an unbiased manner, achieve an improvement in sequencing technologies, and develop innovative bioinformatics tools to handle the complexity of oat genomes effectively. Improved variant detection accuracy will directly impact the development of molecular markers, which are critical for breeding programs aimed at improving oat varieties. Ultimately, accurate and reliable markers will facilitate the selection of desirable traits, accelerate the breeding process, and improve crop yields, disease resistance, and stress tolerance. By addressing the limitations identified in this study and implementing the proposed strategies, oat researchers can develop more precise and effective markers, contributing to the overall advancement of oat breeding and agricultural productivity. Moreover, the insights gained from this study can be applied to other polyploid and complex plant genomes, broadening the impact of this research beyond oats. As sequencing technologies and bioinformatics tools continue to evolve, the potential for groundbreaking discoveries in plant genomics and breeding will expand.
The process of variant discovery in oat genomics involves two primary stages: read alignment and variant calling. A plethora of tools exist for each stage; hence, the use of different aligners and variant callers may be crucial to evaluating and confirming the effectiveness of certain sequencing technologies. Accordingly, different aligners and variant callers were employed in this study. Many plant studies involve high levels of genetic diversity and, in some cases, incorporate distantly related varieties and wild relatives [
17,
26]. Neither of these conditions is common in human studies, and as such, pipelines designed and evaluated on humans may perform differently than expected [
27,
28]. Therefore, in this study, we employed three different aligners commonly used in plant genomic studies, namely BWA-MEM, Bowtie2, and NGSEP, to map the Illumina paired-end read target-capture datasets from 10 oat genotypes. Our results demonstrated that BWA-MEM consistently exhibited high mapping efficiency across all genotypes, with percentages ranging from 98.98% to 99.84%. Earlier, Yan et al., 2021, also identified that BWA-MEM has a higher mapping rate than Bowtie 2 when they evaluated these two mappers using large plant genome resequencing data [
16]. However, BWA-MEM’s increased sensitivity may come at a cost in that, as the number of SNPs or the size of the INDELs per read increased, the false positive rate also became slightly higher than that of Bowtie and NGSEP. Although Bowtie2 and NGSEP also showed comparable mapping efficiencies, slight variations were observed across genotypes. These findings suggest that the use of different aligners is the way forward in oat genomics analyses due to its mosaic genome and complexity. Similar results were obtained by Schilbert et al., 2020, when they compared different mappers using plant NGS data [
17]. Neither NGSEP nor Bowtie2 was able to align as many reads for any genotype when compared with the BWA-MEM mapper. In this study, we chose not to alter the default setting mostly because the mapping percentage was already high and there was no obvious parameter, such as the number of mismatches allowed or the fragment size. Moreover, many program users, especially non-experts in bioinformatics, may retain the default settings of programs. The results of this study are also in line with other studies in which similar results were previously reported, suggesting that the BWA-MEM mapping tool had a higher mapping rate [
16,
17]. However, we recommend using different mappers for variant discovery to lower the false discovery rate. Our study also has strengths compared to other studies in that different mappers were employed when utilizing real data obtained through the target capture sequencing of several oat genotypes, rather than testing using simulated sequence data.
The next step in the bioinformatic analysis pipeline is variant calling, and we performed variant calling using five variant callers. These variant callers include the GATK Haplotype Caller (GATK-HC), FreeBayes, DeepVariant, the NGSEP variant caller, and SAMtools-mpileup. GATK-HC detected different variant counts, ranging from 3816 for Symphony to 4411 for NOS 819111-120. In contrast, SAMtools-mpileup identified fewer variants, ranging from 753 for Symphony to 2820 for NOS 819111-120. GATK-HC detected many variants compared to SAMtools-mpileup, which resulted in a very low recall of variants. The reason could be that GATK-HC performs local assembly to identify the haplotypes, whereas SAMtools-mpileup only utilizes read alignments. Plant genomes, in general, are rich in repetitive sequences that are difficult to assemble correctly using short reads. Therefore, the local assembly strategy employed via GATK-HC might identify true variants, but on the other hand, it might also generate false positive variants, especially INDELs. FreeBayes demonstrated a broad spectrum of variants across genotypes, with counts ranging from 544 for NOS 81937-11 to 4338 for NOS 81920-15. DeepVariant exhibited variant counts ranging from 513 for NOS 81937-11 to 2185 for Symphony. Lastly, NGSEP Variant consistently detected many variants across genotypes, ranging from 3223 for Symphony to 4325 for NOS 819111-120. These results highlight that different variant callers detect different numbers of variants; hence, it is advisable to use different variant callers and then select variants that are common among callers. We chose this strategy in our study. We found that all the variants called via DeepVariant were also detected via at least one other variant caller. The DeepVariant method relies on a convolutional neural network model, and such advanced machine-learning techniques hold significant promise for the future evolution of bioinformatic software, particularly in variant-calling applications [
29]. Hence, if someone wanted to use only a variant caller, then DeepVariant could be a better choice. Studies revealed that the DeepVariant method can detect variants utilizing next-generation sequencing (NGS) data with accuracy [
30,
31,
32]. However, it is always better to use combinations of variant callers and then choose the variants detected via the number of variant callers. This strategy has been employed in various studies previously [
33,
34].
The main finding of our study is that the target capture methodology using short-read sequencing devised by Daicel Arbor Biosciences is not applicable in oat genome research to identify variants with reliability. We reached this conclusion in the validation step when we used the variants identified via all five variant callers employed in this study. Among the variants called, we selected variants detected via all variant callers in two oat genotypes, i.e., KF-318 and NOS 819111-70, but which were absent in the remaining genotypes for validation purposes. In the case of the selected SNPs, the Sanger sequencing of the target region contradicted the target capture data even though the coverage of selected variants in the target capture data was very high. Similarly, the validation of deletion and insertion variants also presented challenges, suggesting limitations in the reliability of myBaits target capture sequencing for variant detection in the oat genome. To further investigate these discrepancies, we conducted high-resolution melting (HRM) curve analysis, which grouped the ten oat genotypes into various clusters, indicating the presence of variations in the targeted regions. Interestingly, some genotypes that were not expected to cluster together were observed to do so. While Sanger sequencing could have been employed to verify and elucidate these variations, we refrained from this approach due to the contradictory results observed in the target capture data for SNPs when Sanger sequencing was performed. Challenges primarily revolve around the reliability of target capture data in ensuring precise variant-calling accuracy using these target capture data. Given that the myBaits technology involves the targeted sequencing of specific genomic regions, the accurate capture of variants within these regions is, of course, of vital significance. Challenges may surface in accurately capturing variants, especially in regions with high sequence variability or complexity. Inadequate coverage via myBaits probes or biases in capture efficiency could lead to incomplete variant calling or inaccuracies in variant identification. In our situation, we have seen sufficient coverage; hence, biases in capture efficiency could be the cause. These challenges arise due to the complexity of oat genomes, which contain large repetitive regions and polyploidy, making efficient bait design and target capture difficult. Additionally, the genetic diversity and mosaic nature of the oat genome might pose challenges in ensuring that baits effectively capture target sequences. The methodology of custom bait design and synthesis can also be a reason, especially for large genomes such as oats. We suggest that bait design and hybridization conditions should be optimized for oats, and this may require extensive experimentation to ensure reliability.
While our study highlights limitations in the reliability of target-capture sequencing using short-read sequencing for variant detection in the oat genome, long-read sequencing can be useful in this context. The myBaits technology using long-read sequencing instead of short-read sequencing could be a better option for the reliable detection of variants. However, this needs to be tested in the case of a complex and polyploid genome, such as the oat genome. Long-read sequencing offers a promising strategy to mitigate the challenges encountered in target-capture sequencing using short reads and to avoid false positive results [
35]. By generating longer sequencing reads, long-read sequencing technologies can overcome some of the challenges associated with short-read sequencing, such as accurately capturing variants in regions with high sequence variability or complexity. Additionally, long-read sequencing enables a more comprehensive characterization of genetic variation, including large structural variants and complex rearrangements, which may be missed or inaccurately identified via short-read sequencing [
36]. However, this needs to be validated for the oat genome to conclude that log-read sequencing is the solution for the shortcoming of the myBaits target-capture technology in the reliable detection of variants.
It is true that the myBaits technology offers substantial opportunities for plant researchers by enabling the targeted sequencing of specific regions of interest; researchers can concentrate on genomic regions associated with the traits of interest or genetic variation [
13,
37]. This approach enhances variant-calling efficiency by reducing the volume of non-targeted sequencing data that require processing, potentially alleviating computational burdens and associated costs [
8,
11,
12,
15]. Furthermore, the flexibility in bait design provided via the myBaits technology empowers researchers to tailor sequencing experiments to align with specific research goals or genomic regions of interest. However, our study revealed limitations regarding the suitability of the myBaits target capture technology for marker development in oat breeding. Despite its advantages, the technology may not yet meet the stringent requirements for marker development in oat breeding programs.
To address the limitations of myBaits target-capture sequencing observed in this study and advance research in oat genomics and marker development, we propose the following recommendations for future research directions. (i) The further refinement of myBaits protocols is essential, as the current protocol does not seem to work reliably for variant detection in oats. This may involve fine-tuning the probe design, optimizing the hybridization conditions, and enhancing the coverage depth to improve the accuracy and efficiency of variant calling. Moreover, different hybridization conditions should be explored, including temperature, duration, and buffer composition, to optimize the efficiency of target capture. Fine-tuning these parameters can improve the specificity and sensitivity of the hybridization process. (ii) Incorporating long-read sequencing technologies, such as PacBio or Oxford Nanopore, alongside myBaits target capture can be a useful strategy to overcome the challenges posed due to the complex oat genome. Long reads can provide valuable information for resolving repetitive regions and structural variations and address the bioinformatic challenges associated with short reads. However, the feasibility and effectiveness of this approach need further investigation in the context of oats and myBaits target capture. (iii) Alternative targeted sequencing methods, such as amplicon-based sequencing, may offer advantages for marker development to facilitate oat breeding. Amplicon sequencing can provide a targeted approach while avoiding some of the challenges associated with myBaits, such as probe design limitations and capture biases. (iv) The development of specific bioinformatics pipelines or tools tailored to analyzing target capture data generated via myBaits from the oat genome is crucial. These tools should account for the unique challenges posed due to the polyploid nature of oats, high sequence variability, and the presence of repetitive regions. By developing specialized computational methods, researchers can more effectively process and interpret target-capture data, leading to more reliable variant detection and downstream applications. By implementing these recommendations and leveraging recent advancements in oat genomics, researchers can overcome the limitations of myBaits target capture sequencing and unlock the full potential of targeted sequencing technologies for oat breeding.