A Tool for the Design of the Minimal Fingerprinting SNP Set: Use Case for Barley

Ermolaev, Aleksey; Samarina, Mariya; Strembovskiy, Ilya; Kroupin, Pavel; Karlov, Gennady; Kharchenko, Pyotr; Voronov, Sergey; Eroshenko, Lyubov; Kryuchenko, Elizaveta; Laptina, Yulia; Avdeev, Sergey; Shirnin, Sergey; Igonin, Vladimir; Divashuk, Mikhail

doi:10.3390/agronomy14081802

Open AccessArticle

A Tool for the Design of the Minimal Fingerprinting SNP Set: Use Case for Barley

by

Aleksey Ermolaev

^1,2,3

,

Mariya Samarina

^1,2,3

,

Ilya Strembovskiy

¹,

Pavel Kroupin

¹

,

Gennady Karlov

¹

,

Pyotr Kharchenko

¹,

Sergey Voronov

²,

Lyubov Eroshenko

²,

Elizaveta Kryuchenko

⁴

,

Yulia Laptina

²,

Sergey Avdeev

¹

,

Sergey Shirnin

¹,

Vladimir Igonin

¹ and

Mikhail Divashuk

^1,2,3,*

¹

All-Russian Research Institute of Agricultural Biotechnology, Moscow 127550, Russia

²

Federal Research Center Nemchinovka, Moscow 143026, Russia

³

Department of Agrobiotechnology, Moscow Timiryazev Agricultural Academy—Russian State Agrarian University, Moscow 127550, Russia

⁴

Gorbatov Federal Research Center for Food Systems, Moscow 109316, Russia

^*

Author to whom correspondence should be addressed.

Agronomy 2024, 14(8), 1802; https://doi.org/10.3390/agronomy14081802

Submission received: 19 July 2024 / Revised: 13 August 2024 / Accepted: 14 August 2024 / Published: 15 August 2024

(This article belongs to the Section Innovative Cropping Systems)

Download

Browse Figures

Versions Notes

Abstract

:

High-throughput genomic technologies are enabling the identification of thousands or even millions of single-nucleotide polymorphisms (SNPs). SNP markers are frequently used to analyze crop varieties, with the marker data summarized in a common VCF file. At present, it is difficult to identify the minimal SNP set, the smallest sets that can distinguish between all crop varieties listed in a VCF file, due to the absence of available ready-to-use tools capable of such characterization. Here, we describe the development of the ready-to-use open-source tool MDSearch (Minimal Discriminatory SNP Set Search) based on the identification of the MDS (minimal discriminatory set) of SNPs using random walking staring from the maximal discriminatory set. MDSearch can be used for disploid as well as poliploid species and both phased and unphased VCF files. MDSearch has been validated using a publicly available dataset of barley SNPs obtained by genotyping-by-sequencing. As a result, we have successfully identified a discriminating set of 19 SNP markers capable of distinguishing all 254 barley varieties included in our study. We expect that this program will prove useful to genomics researchers to support a variety of certifications.

Keywords:

barley; fingerprinting; SNP; genetic marker; algorithm; bioinformatics

1. Introduction

Barley (Hordeum vulgare L.) has a rich history, having been domesticated around 8000 BC, making it one of the oldest cultivated crops [1]. Nowadays, barley primarily serves as an animal feed component (approximately 70%), with the remaining 30% allocated for brewing and food purposes (www.fao.org/faostat, accessed on 10 June 2024). Notably, in Russia, barley varieties covered an expansive 7943 million hectares in 2023, representing approximately 9.8% of all sown areas in the Russian Federation (rosstat.gov.ru, accessed on 6 June 2024).

Producers and processors adhere to strict criteria for the quality of barley grain, which include meeting the specified variety, specific biochemical composition, and grain purity. Brewing varieties face particularly high standards, which require variety certification. Taxonomic characteristics, primarily morphological, used to confirm varietal identity are referred to as Distinctness, Uniformity, and Stability (DUS) characteristics. The assessment of DUS occurs during the registration of new varieties in accordance with national and international legislation and protocols (www.upov.int/edocs/tgdocs/en/tg019.pdf, accessed on 6 June 2024). However, the apparent uniformity of a variety does not necessarily imply genetic homogeneity. Barley varieties may consist of several biotypes with varying characteristics, including fat, carbohydrate, protein content, and enzyme composition. To provide a more comprehensive description of the variety, data on storage protein composition or DNA polymorphisms can be employed in addition to morphological characteristics. Genetic markers, specifically those based on single-nucleotide polymorphisms (SNPs), are the most widely used type of markers for fingerprinting. The abundance of SNPs spread across the plant genome, coupled with its low genetic variability, has solidified the widespread adoption of this mutation for plant genotyping (SNP genotyping) [2,3,4]. Barley SNPs have been thoroughly studied, and high-quality barley SNP-based genetic maps are readily available [5]. A substantial number of SNP arrays, based on these maps, have been developed for the majority of American and European barley varieties [6,7].

The fingerprinting methodology is based on the use of a consistent set of SNP markers. There are three main approaches for barley fingerprinting: grain storage protein profiling [8], SSR [9], and SNP markers [10]. However, the protein profile may not distinguish some varieties, and SSR markers might not be easy to find and optimize for use. SNP is the most abundant and easily identifiable mutation type in large numbers [11]. A small set of biallelic SNPs could distinguish a large number of varieties. Such a set requires maximum discrimination power (distinguishability) and minimal size for easy implementation. There is no universal approach for this development; each scientific team uses its own algorithms and bioinformatics tools. Previous studies have reported programs which were capable of accurately identifying minimal marker sets for a given population, but these programs are not available at present [12,13]. Also, there are studies about the design of the minimal discriminatory SNP set for various crops [14,15], including barley [16]; however, in such studies, the authors only describe a protocol for minimal SNP set discovery and do not provide ready-to-use tools.

In this study, we present a universal, ready-to-use tool called MDSearch (Minimal Discriminatory Set (MDS) Search) for the selection of a minimum informative SNP set for any given VCF file (both phased and unphased). MDSearch is an open-source script written using Python3 and available on Github (github.com/alermol/MDSearch, accessed on 6 June 2024). MDSearch can be used for diploid as well as polyploid species. As a demonstration, an MDS of SNPs for 254 barley varieties was designed using MDSearch.

2. Materials and Methods

2.1. The MDSearch Algorithm

The MDSearch algorithm is based on the minimization of the primary discriminative set containing SNPs with the highest minor allele frequency (MAF) using random walking. The algorithm starts by picking one SNP with the highest MAF. After that, MDSearch iteratively and greedily adds to this set SNPs with the highest MAF available, ignoring the possibility of an existing moderately MAF (at least partly) discriminative set. After each iteration, the resulting SNP set is assessed for the discriminative ability for all samples in the run. Iterations stop when the resulting SNP set is capable of discriminating all samples in the run. Next, the minimization step using random walking is performed. During this stage, a single SNP is randomly removed from the set, and the discriminative ability of the resulting set is assessed. If single SNP removal leads to the elimination of discriminative ability, the removed SNP is returned to the set, and the effect of removing another random single SNP on the discriminative ability of the set is tested.

2.2. Testing Dataset

MDSearch was tested using the genotyping-by-sequencing data of 254 barley lines from BioProject PRJEB38709 [17]. Read archives belonging to the same line were merged before further actions. The lines in the BioProject are an

F_{7}

population that has been obtained from an interspecific tetraploid Hordeum vulgare L. (Hv) cv. ‘Borwina’ 9 Hordeum bulbosum L. (Hb, accession 0 A420) hybrid [18] as described previously [19]. The reference barley genome assembly MorexV3 was used for reads mapping (GenBank: GCF_904849725.1) [20].

2.3. Illumina Reads Preparation and Mapping

The quality of downloaded reads was assessed using FastQC 0.12.1 [21]. Further adapter trimming and elimination of low-quality reads were carried out in the bbduk.sh program from the BBMap 39.01 toolkit (www.sourceforge.net/projects/bbmap, accessed on 10 June 2024). The trimming conditions were as follows: maxns = 0 (remove reads with undefined nucleotides), ml = 50 (removal of reads shorter than 50 nucleotides), forcetrimleft = 5 (the first 5 nucleotides of each read are removed). Mapping of the trimmed reads to the barley reference assembly MorexV3 was managed by bowtie2 v2.5.1 [22] with default parameters. Supplementary alignments and unmapped reads were removed with samtools 1.18 [23].

2.4. SNP-Calling and Filtration for Minimal Discriminatory SNP Set Selection

SNPs were identified in freebayes 1.3.6 [24] with default parameters, excepting parameter “-n 4” (account 4 alleles with the highest overall quality). Further filtering and SNP detection were carried out using plink2 v2.00a5LM [25]. Firstly, indels, multiallelic SNPs, SNPs with at least one missing genotype, and MAF ≥ 10% were removed. Although MDSearch allowed missing genotypes, it is advised to use only positions genotyped in all samples, because for non-genotyped positions, there is no possibility to find out whether this position is not genotyped due to a local genome feature (like deletion) or a technical error (like too stringent mapping or SNP calling parameters). All heterozygous SNPs in at least one sample were removed as well. Before the selection of the minimal discriminatory SNP set, LD (linkage disequilibrium) pruning was performed: for pair SNPs in the 50 SNP sliding window (5 SNPs overlap), if

r^{2}

for the pair was greater than 0.2, the first SNP from the pair was removed. For each window, removing was repeated until no SNP pairs exceeded the threshold remained. Additionally, the thinning of SNPs was performed, and variants were removed until minimal distance between two neighboring variants reached 10 Mbp.

3. Results and Discussion

The process of creating a minimum discriminating set (MDS) can be divided into two distinct blocks of work. Firstly, it involves the identification and filtration of highly polymorphic SNPs (see Section 3.1). Secondly, it involves developing a minimal SNP set for fingerprinting and certification purposes (see Section 3.2). The initial block further comprises sub-stages that are meticulously arranged in a logically related order and executed in separate programs.

3.1. Filtration and Identification of Highly Polymorphic SNPs

In the initial phase (Figure 1), the quality check was performed on previously downloaded Illumina reads using the FastQC program (Step I). Subsequently, the bbduk.sh program was utilized to trim reads that did not meet quality standards, removing them from the dataset, while also eliminating Illumina reads adapters (Step II). After this trimming process, the quality of the trimmed reads was reassessed using FastQC (Step III). The subsequent step involved mapping the prepared read archives to the barley reference genome MorexV3 using Bowtie2 (Step IV), resulting in an average alignment of approximately 96% of the mapped reads. Secondary alignments and unmapped reads were then removed using samtools, leaving only the primary alignment. Lastly, sequence mismatches between the mapped reads and the reference genome were identified using freebayes (Step V), revealing a total of 850,174 mutations.

The list of detected mutations includes essential single-nucleotide mutations (SNPs) necessary for certification, as well as insertions, deletions, inversions, and other mutations that are not relevant to our work. Hence, the process of filtering (separating) SNPs from other types of mutations was conducted using plink2 v2.00a5LM [25]. Technically, the selection of SNPs involves a command-line operation with specific arguments. The plink2 v2.00a5LM [25] organizes these arguments in a predefined order encoded within the program itself (Figure 2).

The characteristics of the identified SNP markers set depend on the effectiveness of filtering, so we need to pay closer attention to the arguments used by plink2 (shown in brackets) and the functions they perform. The SNP filtering results are as follows:

Initially, from a set of 850,174 detected mutations, 148,108 indels and complex mutations were removed (--snps-only), retaining only SNPs.
Subsequently, out of 702,066 SNPs, 7252 multiallelic SNPs were removed (--min-alleles 2 --max-alleles 2), eliminating SNPs that discriminated more than two alleles.
The SNPs were then sorted based on their ability to discriminate between all 254 studied varieties (--geno 0), resulting in the removal of SNPs that were unable to distinguish between all cultivars (SNP call rate < 100%). After this step, 41,112 SNPs remained.
The remaining variants were filtered by minor allele frequency (MAF; --maf 0.01). Minor allele frequency indicates the prevalence of the reference variant of SNP. For highly polymorphic single-nucleotide mutations without homozygous genotypes in diploid species, the MAF cannot exceed 0.5, meaning that 50% of varieties have a reference SNP variant, and 50% have an alternative one. After appropriate evaluation, 37,450 SNPs with MAF < 1% were removed because such a low polymorphism frequency may indicate a genotyping error. Before the next stage, the set included 3662 SNPs.
Homozygous SNPs are the most suitable for creating the SNP barcode, so 3166 SNPs that were heterozygous in at least one of the cultivars were removed from the set (--keep <filename>). After this procedure, 496 homozygous SNPs remained in the set.
For certification, it is important to use SNPs that are not inherited linked, because this may affect the accuracy of the passport during the variety cultivation. Therefore, the pairwise linkage of 496 SNPs was analyzed in a sliding window of 50 SNPs with a step of 5 SNPs (--indep-pairwise 50 5 0.2). As a result, SNP pairs with linkage greater than 0.2 were removed, meaning that the genetic proximity of the two genotypes was at least 0.2. After removing 319 SNPs, 177 SNPs remained for further analysis.
The optimal set of markers is considered to be one in which the markers are evenly distributed throughout the genome, which also reduces the risk of having linked markers in the set. Therefore, at the final stage of filtering, SNPs with a distance of less than 10 Mbp (--bp-space 10,000,000) were removed. As a result, 89 SNPs were excluded from the analysis, and the remaining 88 SNPs were utilized to identify the minimum discriminatory set of SNPs.

3.2. A Minimal SNP Set Development

Several approaches and corresponding tools were developed earlier for finding a minimal SNP marker set for a given sample set [12,13]. The previously reported GGDS program was based on Integer Linear Programming and could be used to identify all minimal marker sets for any given summary table. However, GGDS was designed for use with only dominant markers [12]. The successor of GGDS, MinimalMarker, could process co-dominant markers, regardless of the number of alleles [13]. Unfortunately, both of these programs are not available now. Another set of articles about minimal SNP set development describes a method but does not provide a ready-to-use tool for reproduction. These articles used approaches based on SNP prioritization by MAF [14,16] or linkage disequilibrium [26].

The initial phase of our work involved filtering a comprehensive set of 254 varieties to identify suitable SNPs, marking the completion of the first stage. Subsequently, in the second phase, we developed a minimally discriminating set of SNPs using a stepwise addition–elimination algorithm. This process, illustrated in Figure 3, was based on sorting SNPs by their minor allele frequency (MAF) using a modified prioritization method. The first SNP selected for the set had a MAF closest to 0.5, indicating that the allele containing this SNP was present in half of the varieties (Step I of the algorithm, Figure 3). The subsequent SNP with the highest MAF was then added to the set (Step II). At each iteration (SNP selection stage), we evaluated the discriminatory ability of the entire formed set (Step III). The addition process continued until the resulting set of SNPs could effectively distinguish all 254 varieties. Following the identification of such a set, we proceeded with sequential removals, one SNP at a time, to search for the minimal discriminating set (MDS) (Step IV). If the removal of a specific SNP did not affect the resolution of the entire set, we excluded it and reassessed the resolution (Step V). This process continued until the size of the final set was minimized. To expedite the convergence of the algorithm, multiple iterations were launched simultaneously.

Upon completion of the algorithm, a set of 19 discriminatory SNPs was obtained (Table 1). All selected SNPs were biallelic and homozygous for the studied barley variety collection. MAF and PIC of selected SNPs varied from 0.11 to 0.45 and from 0.20 to 0.50, respectively. Selected SNPs were distributed across barley chromosomes and separated by long physical distances, thereby reducing the risk of SNP linkage (Figure 4). The highest number of SNPs were located on chromosome 3H (5 SNPs), while the fewest were found on chromosomes 1H and 6H (2 SNPs each). Importantly, the selected SNPs were spaced apart, with an average distance of 53 Mbps, effectively covering large chromosome regions. Genetically, among the 19 SNPs identified, 14 were transitions (purine to purine or pyrimidine to pyrimidine mutations), while only 4 were transversions (purine to pyrimidine or pyrimidine to purine mutations).

An essential trait exhibited by both individual SNP markers and the complete set is their inherent stability. Stability, in this context, refers to the resilience of a marker or a group of markers against spontaneous mutations where one of the SNPs within the set is physically lost. The loss of even a single SNP can have significant repercussions for the entire set, especially in situations where two pairs of varieties show minor differences in their SNP profiles. The stability of each individual SNP marker can only be accurately evaluated through experimental methods. However, the likelihood of losing a marker can be substantially minimized by a meticulous SNP selection process and comprehensive testing of SNPs. In instances where stability is paramount, preference should be given to SNPs that result in neutral mutations. These criteria are fulfilled by SNPs located in intergenic or intronic regions, as well as SNPs situated in quadruple degenerate triplets that vary by a single nucleotide while encoding the same amino acid (e.g., A > G, A > C, T > G, T > C, G > A, G > T, C > A, C > T) [27].

The stability of the entire set of SNPs can be accurately assessed through bioinformatics. The fingerprint of each of the 254 corresponding varieties in this study, created by the developed set of SNPs, was compared. The presence of each SNP was examined to determine the number of SNPs shared between pairs of varieties, measured by the Hamming distance (the minimum number of substitutions required to change one barcode into the other) [28]. In our analysis, the average distance between fingerprint pairs was found to be 7. This implies that, on average, two different varieties share 7 out of 19 SNPs (Figure 5). Consequently, the loss of 7 SNP data could lead to a loss of functionality for the entire set, as it would no longer be able to distinguish between most varieties. Considering the extremely low probability of a one-time mutation resulting in the loss of 7 SNPs and the insignificant proportion of barcodes differing by only 1 SNP (Hamming distance = 1), we can confidently conclude that the developed set has a good potential to prove stable.

3.3. Developed SNP Set Comparison with Known Sets

When evaluating the outcomes of our research in comparison to similar studies conducted by other authors, we have uncovered distinct characteristics within the marker sets we have developed. Notably, the set we have created is significantly smaller. It is important to acknowledge that these differences can be attributed to the varying number of cultivars on which each marker set is developed. For instance, Owen et al. [16] created their own set of 45 SNP markers by utilizing sequences from 800 Scottish barley varieties, whereas Hayden et al. [29] devised a customized set of 48 SNPs for 88 Australian barley lines and varieties. As observed, both of these aforementioned sets, along with many others we have examined, were designed to certify a limited number of varieties. This common objective requires a larger collection of SNPs to differentiate genetically similar varieties. For this reason, comparing these sets to the set we have developed based on varieties from different countries is not entirely accurate due to a possible founder effect [30].

The variation in the quantity and origin of the varieties used in the study may also account for the lack of matches with the SNP markers we selected and those identified by other researchers. This allows us to discuss the uniqueness of the set we have developed. Partially, the observed lack of SNP matches is also explained by the novelty of the approaches we have employed, specifically the modified prioritization method underlying our proprietary greedy SNP filtration algorithm. Previously, this approach has not been used for SNP selection in barley, but it has been successfully applied for similar purposes in hops (Humulus lupulus L.) [14].

To accurately assess the resolution capacity of the developed SNP set, it is essential to apply it practically to a statistically significant number of barley varieties, ideally those that are genetically unrelated. Looking ahead, we intend to conduct a similar evaluation of this SNPs’ set on a collection of 400 barley varieties sourced from diverse countries. The theoretically calculated indicators of the set’s resolution capacity, along with its stability against SNP dropout, have yielded promising results, underscoring the effectiveness of the developed SNP set.

4. Conclusions

In this work, we presented a ready-to-use open-source MDSearch tool that addresses the challenge of identifying the minimally discriminating SNP set. It effectively identifies a set of 19 SNP markers for 254 barley varieties. Created MDS using MDSearch has good potential to prove stable due to the large Hamming distance between barcodes. MDSearch is versatile for species with different ploidy and can process both phased and unphased VCF files. This tool is expected to aid genomics researchers in various certification processes.

Author Contributions

Conceptualization, A.E., M.S., P.K. (Pavel Kroupin) and M.D.; Data curation, A.E.; Formal analysis, A.E.; Funding acquisition, G.K., P.K. (Pyotr Kharchenko) and M.D.; Investigation, A.E.; Methodology, A.E. and M.S.; Project administration, A.E., P.K. (Pavel Kroupin) and M.D.; Resources, G.K., S.V., L.E., E.K., Y.L., S.A., S.S., V.I. and M.D.; Software, A.E.; Supervision, M.D.; Validation, A.E.; Visualization, A.E. and I.S.; Writing—original draft, I.S.; Writing—review and editing, A.E., I.S., P.K. (Pavel Kroupin) and M.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by State assignment FGGE-2023-0004.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Muñoz-Amatriaín, M.; Cuesta-Marcos, A.; Hayes, P.M.; Muehlbauer, G.J. Barley genetic variation: Implications for crop improvement. Brief. Funct. Genom. 2014, 13, 341–350. [Google Scholar] [CrossRef]
Bohra, A.; Chand Jha, U.; Godwin, I.D.; Kumar Varshney, R. Genomic interventions for sustainable agriculture. Plant Biotechnol. J. 2020, 18, 2388–2405. [Google Scholar] [CrossRef]
Hasan, N.; Choudhary, S.; Naaz, N.; Sharma, N.; Laskar, R.A. Recent advancements in molecular marker-assisted selection and applications in plant breeding programmes. J. Genet. Eng. Biotechnol. 2021, 19, 128. [Google Scholar] [CrossRef] [PubMed]
Abed, A.; Belzile, F. Comparing single-SNP, multi-SNP, and haplotype-based approaches in association studies for major traits in barley. Plant Genome 2019, 12, 190036. [Google Scholar] [CrossRef]
Close, T.J.; Bhat, P.R.; Lonardi, S.; Wu, Y.; Rostoks, N.; Ramsay, L.; Druka, A.; Stein, N.; Svensson, J.T.; Wanamaker, S.; et al. Development and implementation of high-throughput SNP genotyping in barley. BMC Genom. 2009, 10, 582. [Google Scholar] [CrossRef] [PubMed]
Comadran, J.; Kilian, B.; Russell, J.; Ramsay, L.; Stein, N.; Ganal, M.; Shaw, P.; Bayer, M.; Thomas, W.; Marshall, D.; et al. Natural variation in a homolog of Antirrhinum CENTRORADIALIS contributed to spring growth habit and environmental adaptation in cultivated barley. Nat. Genet. 2012, 44, 1388–1392. [Google Scholar] [CrossRef] [PubMed]
Bayer, M.M.; Rapazote-Flores, P.; Ganal, M.; Hedley, P.E.; Macaulay, M.; Plieske, J.; Ramsay, L.; Russell, J.; Shaw, P.D.; Thomas, W.; et al. Development and evaluation of a barley 50K iSelect SNP array. Front. Plant Sci. 2017, 8, 1792. [Google Scholar] [CrossRef]
Zhang, G.; Li, C. Genetics and Improvement of Barley Malt Quality; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
Mohammadi, S.A.; Abdollahi Sisi, N.; Sadeghzadeh, B. The influence of breeding history, origin and growth type on population structure of barley as revealed by SSR markers. Sci. Rep. 2020, 10, 19165. [Google Scholar] [CrossRef] [PubMed]
Pasam, R.K.; Sharma, R.; Walther, A.; Özkan, H.; Graner, A.; Kilian, B. Genetic diversity and population structure in a legacy collection of spring barley landraces adapted to a wide range of climates. PLoS ONE 2014, 9, e116164. [Google Scholar] [CrossRef]
Song, L.; Wang, R.; Yang, X.; Zhang, A.; Liu, D. Molecular markers and their applications in marker-assisted selection (MAS) in bread wheat (Triticum aestivum L.). Agriculture 2023, 13, 642. [Google Scholar] [CrossRef]
Gale, K.; Jiang, H.; Westcott, M. An optimization method for the identification of minimal sets of discriminating gene markers: Application to cultivar identification in wheat. J. Bioinform. Comput. Biol. 2005, 3, 269–279. [Google Scholar] [CrossRef] [PubMed]
Fujii, H.; Ogata, T.; Shimada, T.; Endo, T.; Iketani, H.; Shimizu, T.; Yamamoto, T.; Omura, M. Minimal marker: An algorithm and computer program for the identification of minimal sets of discriminating DNA markers for efficient variety identification. J. Bioinform. Comput. Biol. 2013, 11, 1250022. [Google Scholar] [CrossRef]
Henning, J.A.; Coggins, J.; Peterson, M. Simple SNP-based minimal marker genotyping for Humulus lupulus L. identification and variety validation. BMC Res. Notes 2015, 8, 542. [Google Scholar] [CrossRef] [PubMed]
Allen, A.M.; Barker, G.L.; Wilkinson, P.; Burridge, A.; Winfield, M.; Coghill, J.; Uauy, C.; Griffiths, S.; Jack, P.; Berry, S.; et al. Discovery and development of exome-based, co-dominant single nucleotide polymorphism markers in hexaploid wheat (Triticum aestivum L.). Plant Biotechnol. J. 2013, 11, 279–295. [Google Scholar] [CrossRef] [PubMed]
Owen, H.; Pearson, K.; Roberts, A.M.; Reid, A.; Russell, J. Single nucleotide polymorphism assay to distinguish barley (Hordeum vulgare L.) varieties in support of seed certification. Genet. Resour. Crop Evol. 2019, 66, 1243–1256. [Google Scholar] [CrossRef]
Wendler, N.; Mascher, M.; Nöh, C.; Himmelbach, A.; Scholz, U.; Ruge-Wehling, B.; Stein, N. Unlocking the secondary gene-pool of barley with next-generation sequencing. Plant Biotechnol. J. 2014, 12, 1122–1131. [Google Scholar] [CrossRef]
Szigat, G. Amphidiploid Hybrids between Hordeum vulgare and Hordeum bulbosum-Basis for the Development of New Initial Material for Winter Barley Breeding; Vortraege fuer Pflanzenzuechtung: Brussels, Belgium, 1991; Volume 20. [Google Scholar]
Ruge-Wehling, B.; Linz, A.; Habekuß, A.; Wehling, P. Mapping of Rym16 Hb, the second soil-borne virus-resistance gene introgressed from Hordeum bulbosum. Theor. Appl. Genet. 2006, 113, 867–873. [Google Scholar] [CrossRef]
Mascher, M.; Wicker, T.; Jenkins, J.; Plott, C.; Lux, T.; Koh, C.S.; Ens, J.; Gundlach, H.; Boston, L.B.; Tulpová, Z.; et al. Long-read sequence assembly: A technical evaluation in barley. Plant Cell 2021, 33, 1888–1906. [Google Scholar] [CrossRef]
Andrews, S. FastQC: A quality control tool for high throughput sequence data. Babraham Bioinform. 2010. Available online: http://www.bioinformatics.babraham.ac.uk/projects/fastqc (accessed on 10 June 2024).
Langmead, B.; Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 2012, 9, 357–359. [Google Scholar] [CrossRef]
Danecek, P.; Bonfield, J.K.; Liddle, J.; Marshall, J.; Ohan, V.; Pollard, M.O.; Whitwham, A.; Keane, T.; McCarthy, S.A.; Davies, R.M.; et al. Twelve years of SAMtools and BCFtools. Gigascience 2021, 10, giab008. [Google Scholar] [CrossRef] [PubMed]
Garrison, E.; Marth, G. Haplotype-based variant detection from short-read sequencing. arXiv 2012, arXiv:1207.3907. [Google Scholar]
Chang, C.C.; Chow, C.C.; Tellier, L.C.; Vattikuti, S.; Purcell, S.M.; Lee, J.J. Second-generation PLINK: Rising to the challenge of larger and richer datasets. Gigascience 2015, 4, s13742-015. [Google Scholar] [CrossRef] [PubMed]
Carlson, C.S.; Eberle, M.A.; Rieder, M.J.; Yi, Q.; Kruglyak, L.; Nickerson, D.A. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet. 2004, 74, 106–120. [Google Scholar] [CrossRef] [PubMed]
Baniecki, M.L.; Moon, J.; Sani, K.; Lemieux, J.E.; Schaffner, S.F.; Sabeti, P.C. Development of a SNP barcode to genotype Babesia microti infections. PLoS Neglected Trop. Dis. 2019, 13, e0007194. [Google Scholar] [CrossRef] [PubMed]
Hamming, R.W. Error detecting and error correcting codes. Bell Syst. Tech. J. 1950, 29, 147–160. [Google Scholar] [CrossRef]
Hayden, M.; Tabone, T.; Nguyen, T.; Coventry, S.; Keiper, F.; Fox, R.; Chalmers, K.; Mather, D.; Eglinton, J. An informative set of SNP markers for molecular characterisation of Australian barley germplasm. Crop Pasture Sci. 2009, 61, 70–83. [Google Scholar] [CrossRef]
Templeton, A.R. The theory of speciation via the founder principle. Genetics 1980, 94, 1011–1038. [Google Scholar] [CrossRef]

Figure 1. Flowchart for preparing sequencing reads for alignment to the barley MorexV3 [20] reference genome assembly and subsequent mutation detection (initial workflow block).

Figure 2. Scheme of barley mutation filtering detected after mapping reads to the reference genome assembly for subsequent selection of discriminating SNPs. The numbers in boxes below the headings correspond to the following: #samples—the number of varieties on the basis of which the mutation filtering is performed; #variants/SNPs left—the total number of mutations or SNPs left after filtering on each step; #removed samples and variants—the number of excluded mutations and varieties from the analysis (removal of varieties is possible in cases where the read quality is insufficient for detecting the desired SNPs).

Figure 3. The flowchart of the MDSearch algorithm for discovering the minimal discriminating set of SNPs for certification purposes.

Figure 4. Distribution of SNPs in designed minimal SNP set using MDSearch on chromosomes of reference barley genome assembly MorexV3 [20]. SNPs numbered according to Table 1.

Figure 5. The histogram illustrating the distribution of Hamming distances (the minimum number of substitutions required to change one barcode into the other) between all pairs of fingerprints for 254 barley varieties.

Table 1. Summary information about selected discriminating SNPs for the testing barley population.

# SNP	Chromosome Coordinate ^a	Ref. Allele	Alt. Allele	MAF ^b	PIC ^c
1	1H:5,550,434	T	A	0.11	0.20
2	1H:402,453,297	G	A	0.36	0.46
3	2H:39,017,947	T	C	0.11	0.20
4	2H:81,911,835	G	T	0.19	0.31
5	2H:560,080,229	A	G	0.38	0.47
6	3H:205,461,025	C	G	0.24	0.36
7	3H:447,564,794	C	A	0.23	0.35
8	3H:525,419,066	G	C	0.38	0.47
9	3H:566,880,335	T	C	0.27	0.39
10	3H:615,767,065	T	C	0.28	0.40
11	4H:453,072,008	T	C	0.15	0.26
12	5H:456,058,599	G	C	0.43	0.49
13	5H:519,706,238	A	G	0.13	0.23
14	5H:545,182,553	T	C	0.11	0.20
15	6H:13,191,080	T	G	0.44	0.49
16	6H:71,795,477	A	T	0.45	0.50
17	7H:208,590,826	A	G	0.12	0.21
18	7H:457,261,245	G	C	0.38	0.47
19	7H:616,514,313	G	A	0.38	0.47

^a Coordinates in barley reference genome assembly MorexV3 [20]. ^b Minor allele frequency. ^c Polymorphism information content (PIC) = 1 − ((RefAlleleFreq)² + (AltAlleleFreq)²).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ermolaev, A.; Samarina, M.; Strembovskiy, I.; Kroupin, P.; Karlov, G.; Kharchenko, P.; Voronov, S.; Eroshenko, L.; Kryuchenko, E.; Laptina, Y.; et al. A Tool for the Design of the Minimal Fingerprinting SNP Set: Use Case for Barley. Agronomy 2024, 14, 1802. https://doi.org/10.3390/agronomy14081802

AMA Style

Ermolaev A, Samarina M, Strembovskiy I, Kroupin P, Karlov G, Kharchenko P, Voronov S, Eroshenko L, Kryuchenko E, Laptina Y, et al. A Tool for the Design of the Minimal Fingerprinting SNP Set: Use Case for Barley. Agronomy. 2024; 14(8):1802. https://doi.org/10.3390/agronomy14081802

Chicago/Turabian Style

Ermolaev, Aleksey, Mariya Samarina, Ilya Strembovskiy, Pavel Kroupin, Gennady Karlov, Pyotr Kharchenko, Sergey Voronov, Lyubov Eroshenko, Elizaveta Kryuchenko, Yulia Laptina, and et al. 2024. "A Tool for the Design of the Minimal Fingerprinting SNP Set: Use Case for Barley" Agronomy 14, no. 8: 1802. https://doi.org/10.3390/agronomy14081802

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Tool for the Design of the Minimal Fingerprinting SNP Set: Use Case for Barley

Abstract

1. Introduction

2. Materials and Methods

2.1. The MDSearch Algorithm

2.2. Testing Dataset

2.3. Illumina Reads Preparation and Mapping

2.4. SNP-Calling and Filtration for Minimal Discriminatory SNP Set Selection

3. Results and Discussion

3.1. Filtration and Identification of Highly Polymorphic SNPs

3.2. A Minimal SNP Set Development

3.3. Developed SNP Set Comparison with Known Sets

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI