1. Introduction
Over time, the field of genetic testing for many cancer biomarkers, such as breast cancer driver genes BRCA1 and BRCA2, improved, starting from single gene sequencing on sanger sequencing technology, followed by multigene panels, which were created as a result of developments in next-generation sequencing technology (NGS), allowing for a broader genetic assessment, a faster testing method, and better throughput, without being cost prohibitive but constrained by the generation of short reads [
1,
2]. MinION, the first long-read Nanopore-based sequencer, was released by Oxford Nanopore Technologies (ONT), overcoming the primary limitations of short-read sequence creation [
3] by introducing long-read sequencing technology that was adapted by both ONT and Pacific Biosciences (PacBio) [
4]. These technologies proved that new long-read, single-molecule sequencing technologies could reliably be able to identify small variants, indel, and structural variants (SVs), with significant improvements in both sensitivity and specificity [
3,
5].
In human genomes, single-nucleotide polymorphisms (SNPs) and short insertions and/or deletions (indel) are two forms of genetic variants [
6,
7]. They contribute to genetic diversity and have the ability to affect phenotypic differences, such as human disease susceptibility. Detecting SNPs and indel is challenging in studying genomic variants and functions using new generations of high-throughput sequencing data [
5]. Many different variant (SNP/indel) callers were introduced by the Nanopore community and recommended by ONTs for accurate variant detection based on data from long-read sequencing. Some variant callers implemented variant calling methods using deep learning, such as “Clair” [
8], the successor of “Clairvoyant” [
9]. “Longshot “ [
10] calls SNPs on long-read data using a Pair-Hidden Markov Model (pair-HMM) for a small local window surrounding candidate sites. Medaka [
11], an SNP/indel caller based on deep learning on long-read data, was recently launched by ONTs [
11]. Medaka predicts SNPs from unphased long reads before phasing them. For each set of phased reads, Medaka ends up making SNP and indel calling. Nanocaller [
12] is a deep convolutional neural network that incorporates a long-range haplotype structure to improve variant detection on long-read sequencing data. “Clair3” [
13] combines the greatest characteristics of two key method categories: pile-up calling, which handles most variant candidates fast, and full alignment, which tackles complicated candidates with precision and recall in an account. Accordingly, in this article, the development of a workflow for detecting disease-causing variants, starting from the sample to the variant call format (VCF) with annotated variants, was proposed where different variant calling tools were tested on reference genome samples to evaluate the output of each tool against “Truth” set of variants. The proposed pipeline for targeted sequencing of the data generated from long-read sequencing technology, where the two genes BRCA1 and BRCA2, which are recurrently mutated in breast cancer, were analyzed as an example of this workflow and an examination of its performance was described for future testing and implementation.
4. Discussion
Evaluation of BRCA1/2 molecular status has become the standard of care in the treatment of individuals with breast cancer. Precision medicine has made significant progress against this type of cancer, which accounts for one-third of all new female cancers every year. Female breast cancer is the sixth biggest cause of mortality worldwide, with an estimate of 685,000 deaths in 2020 [
25]. One example is the development and clinical application of PARP-inhibitor (PARPi); Poly (adenosine diphosphate-ribose) polymerase inhibitors (PARPi) are a key arrow in the oncologist’s quiver among new therapeutics [
26,
27]. Indeed, PARPi has been found to enhance the clinical outcomes of breast cancer patients with BRCA1/2 germline or somatic mutations, which have been found to improve survival and quality of life [
28,
29,
30,
31,
32]. As a result, current worldwide guidelines strongly advise BRCA1/2 testing in all patients. Rapid and dependable genetic screening for BRCA1/2 germline or somatic mutations has become critical in identifying individuals who would most likely benefit from these treatments [
3,
33,
34].
The technology used in BRCA 1/2 gene testing held an important impact on getting the full picture of the two genes. Traditional Sanger sequencing is expensive and takes a long turn-around time (TAT). Next-generation sequencing (NGS) is a game-changing high-throughput nucleotide sequencing approach that produces rapid, cheap, and accurate genomic data. NGS developed the clinical methodology for genetic examination across various fields of medicine [
34]. NGS can massively sequence millions of DNA reads, allowing for accurate characterization of the “status” of multiple genes; in this context, NGS-targeted gene sequencing enables the detection of driver mutations, which are responsible for progression and relapse and might be employed as predictive or prognostic biomarkers in breast cancer [
33,
34]. When compared to Sanger sequencing, NGS can offer doctors comparable genetic information at a cheaper cost and shorter time to results [
2,
34], yet the NGS limitations are the small read size and the difficulty in analyzing large alterations as structural variants. Many studies employed the NGS as a technology in the detection of BRCA1/2 gene variants in various ethnic groups to implement the detection of gene variants using NGS in the routine line of diagnostics and may allow doctors to make more prompt and informed decisions about surgery or neo-adjuvant chemotherapy in breast cancer patients [
35,
36,
37,
38,
39,
40,
41,
42].
However, the use of NGS technologies in clinical diagnostics necessitates a large initial investment in the sequencer, which is a barrier for local research institutions in underdeveloped nations, as well as small research institutes and hospitals. MinION, the first commercially available sequencer based on Nanopore technology, might be a viable alternative [
43,
44]. MinION has previously been utilized effectively to identify mutations in TP53 and ABL1 genes in CLL and CML patients [
45,
46,
47,
48], respectively. Furthermore, the cheap cost, ease of use, and length of the reads make MinION a perfect instrument for targeted gene sequencing; the long read can enable researchers to detect and phase genetic variants, as well as thoroughly define new isoforms and fusion transcripts, using Nanopore technology. Nanopore technology sheds new light on health and disease, ranging from cancer to immunology and neurology [
48].
In the current study, the main focus was on the data analysis of data generated using Nanopore technology, as there are many proposed tools by the Nanopore community, a hub for all the Nanopore technology users (
https://community.nanoporetech.com/ (accessed on 8 August 2022)) for every step along the way in data analysis. The in-house targeted gene sequencing workflow was divided into two parts: (1) design a data analysis pipeline for SNV/INDEL/SV detection and how to validate this pipeline and (2) design an in-house primer panel for BRCA1/2 genes as a prototype for future implementation. The pipeline design started with a set of tools designed and trained on long-read data generated from the MinIon ONT sequencer; the reference samples used as the input data for validation of this workflow are the publicly published “NA12878” (HG001) reference sample [
49] and “NA24385” (HG002) dataset that contain whole-genome sequencing of well-known human cell lines, sequenced using Nanopore technology [
50]. Each, therefore, serves as a helpful benchmark sample. The HG002 cell line was used as a “seen” sample in the current (PrecisionFDA Truth Challenge V2) competition [
51]. The method of validating the performance of workflows and especially the variant callers is called “Benchmarking”, where a reference sample is used either as DNA to be sequenced and undergo the workflow or using the data for this reference sample from the public repository in-silico for a data analysis step, a method that was recommended by Global Alliance for Genomics and Health (GA4GH) [
52].
The pipeline went as follows: (1) mapping for the reads stored in the fastq file that outputs the reads into a SAM file format using “Minimap2” mapper for long reads against reference sequences based on the GRCh38/UCSC hg38 public human genome build, (2) sorting and indexing using Samtools as a versatile tool as it was heavily used in many pipelines proposed by other studies, used to convert a SAM file to BAM file, sort and index the BAM output, (3) removing the PCR duplicates even though the reference data samples used to validate this workflow were whole-genome sequencing, not including a PCR step but were included in the workflow as this workflow will be used on targeted gene sequencing data, (4) calculating the mean coverage of the targeted genes using Bedtools, (5) variant calling step, which is the main event in the workflow and the focus of our study; there are many variant callers both recommended by ONT and the Nanopore community, so the output variants were filtered based on “PASS” and QUAL > 20 as a threshold for the comparison of the tools output, (6) annotating the variants using SnpEff as an annotation tool, and (7) checking the clinical significance of the annotated variants using ClinVar clinical database.
The focus of the current study was to evaluate this workflow as well as compare the performance of the commonly used software pipelines for variant calling, which is another key element in variant discovery. The comparison is based on how well the tool calls the “True” variants when compared to the benchmarking VCF file; the tools analyzed in this study are Medaka, Clair, Nanocaller, Longshot, Clair3, and ONT’s wf-human-snp workflow for variant calling, which employs Clair3 with pre-adjusted parameters for the accurate calling of variants.
Recent studies attempted to enhance variant calling by using phasing information from long-read sequencing data. Longshot calls SNPs on long-read data using a pair-hidden Markov Model (pair-HMM) for a small local window surrounding candidate sites and then improves genotyping of identified SNPs using Hap-CUT2 [
53] based on the most probable pair of haplotypes given the present variant genotypes, but on the other hand, is incapable of detecting indel. Medaka was provided by ONT, an SNP/indel caller that uses deep learning on long-read data. Medaka predicts SNPs from unphased long readings before using WhatsHap [
54] to phase the data Medaka eventually makes SNP and indel calls for each phased read group. Clair, the successor of Clairvoyante, is a tool for detecting germline minor variants quickly and accurately using single-molecule sequencing data. Clair outperforms several competing systems for ONT data, including Clairvoyante, Longshot, and Medaka, in terms of precision, recall, and speed. As a deep learning approach, Nanocaller detects SNPs using long-range haplotype information, then phases long reads with identified SNPs and calls indels using local realignment.
Two key designs differ greatly in terms of performance and speed either employing pileup or full alignment as the input of the decision-making neural network. Clair and Nanocaller are pileup-based calling networks that aggregate read alignments into features and counts before sending them into a variant calling network. PEPPER-Margin-DeepVariant5 (PEPPER) [
55] is fully alignment based. The DeepVariant variant calling network input is retained with spatial information in the full alignment method and is tens of times greater in size than the pileup method. Medaka is consensus based, using pileup input to generate a diploid consensus in the first iteration and two haploid consensuses in the second. Variants are formed by identifying and combining differences between the reference and consensus. To fill the void, Clair3 was created, which combines the best of both designs. It is as quick as pileup-based callers and performs just as well as full alignment callers. First, the pileup calling network goes through all the variant candidates that met a coverage and alternative allele frequency criterion. The high-quality pileup calls are then used to phase the alignments and generate the final output. Then, for each low-quality pileup call for full-alignment calling, the alignments phased by WhatsHap are utilized to create full-alignment input that is 23-times greater in size than the pileup input. Finally, as the final output, the full-alignment calls are combined with the high-quality pileup calls.
For performance validation of the pipeline along with the variant callers, the process started with the genome in a bottle (GIAB) reference samples HG001 and the Ashkenazi son sample HG002 ONT reads that were used as an input for mapping with Minimap2, sorting and indexing with Samtools, calculating the mean coverage within the BRCA1/2 gene bed file with coordination. for the variant calling step, the default parameters were used for all the variant callers to ensure uniformity in the output variants. The benchmarking variant VCF “Truth set” used was the GIAB v.4.2.1 for each reference genome sample to compare the output of different variant callers. The hap.py [
24] tool was used for benchmarking, which is a reference implementation of the GA4GH recommendations for variant caller benchmarking with the “vcfeval” engine for comparison; it generated metrics as “False positive”, “False negative”, “True positive”, “Precision”, “Recall”, and “F1 score”. It was found that three metrics are the most important for variant caller performance evaluations, which are “Precision”, “Recall”, and, most importantly, “F1 score”, which is the mean of precision and recall and is commonly used to test the performance of the callers [
56,
57,
58].
Based on the metrics obtained in our results, it is suggested that Clair3 as a stand-alone or incorporated into a workflow as Human-SNP-wf by ONT, was found to be outperforming other variant callers concerning performance. The Clair3 method’s efficiency is based on its ability to effectively distinguish between true and false calls during pileup calling, allowing only essential candidates to be transferred to the considerably more computationally costly full alignment calling. Following that comes Nanocaller, which performed in a better way than the rest of the variant callers, Longshot, Clair, and Medaka, respectively, agree with the findings of another study. Even though Clair is supposed to outperform Longshot, it was found to have lower F1 scores in both reference samples and that may be because Clair was outdated and was succeeded by Clair3 in May 2021 (
https://github.com/HKU-BAL/Clair (accessed on 8 August 2022)). Although Medaka was, up until the release of Clair3, the recommended variant caller for SNP calling using the “medaka_variant” argument, which was formerly implemented inside the medaka package, it has been exceeded in accuracy and computing performance by alternative approaches and is, thus, deprecated and it is advised to utilize Clair3 either directly or through the Oxford Nanopore Technologies offered Nextflow implementation (Human-SNP-wf) (
https://github.com/nanoporetech/medaka (accessed on 8 August 2022)) and that may explain the low performance. It was intentional not to test Nanopolish [
59], which is also capable of variant calling on ONT data since it requires fast5 raw signals file as input, which are not publicly accessible for HG002, so it was excluded from the variant callers’ comparison.
Targeted gene panels are one of the most frequent ways of enriching the genomic areas to be sequenced and they are widely utilized in NGS technology. Using Nanopore technology, we were able to enrich all the gene areas of interest without being limited by the read length. MinION real-time sequencing allows reads to be evaluated as they are produced, considerably speeding up analysis and allowing for the modification of experimental conditions as needed. Another benefit of MinION over second-generation sequencers is its mobility and ease of use for library preparation and sequencing, as well as its low cost. There are currently many custom/academic or commercial BRCA1/2 target panels that have been established in recent years because of investigations on the use and impact of NGS in breast/ovarian cancer [
56,
60,
61,
62], the majority of which are based on the amplicon sequencing technique. There are currently many commercial short-read amplicon-based BRCA gene panels available that detect SNV and/or copy number variation. Nonetheless, efforts to create a complete gene panel useful for BRCA prognosis and medication impact prediction are ongoing. The design of a primer panel targeting different oncology biomarkers will be incorporated into our future plan for trial on different cancer sample types.