*2.1. SNP Discovery and Characterization*

The Miseq run of 192 genotypes from 12 CWG lines (Table 1) generated approximately 87.8 million raw forward (R1) sequence reads of 250 bp. The number of raw forward sequence reads per sample ranged from 190,606 to 775,160 with an average of 457,279. Combined UNEAK and Haplotag analysis at the 20%, 30%, 40%, and 50% level of missing data generated 227; 1,884; 10,738; and 45,507 SNPs, respectively across the 192 genotypes. In addition, this analysis also generated many metagenomic files associated with the SNP discovery, which are described and accessible in the online Supplementary Materials. The distribution of the minor allele frequency in 45,507 SNPs' data ranged from 0.025 to 0.5, and exhibited a steady decline of minor alleles with increased occurrence of frequencies from 0.075 to 0.5 (Figure 1A). Likewise, there were more SNPs at the higher percentages of missing data (Figure 1B). metagenomic files associated with the SNP discovery, which are described and accessible in the online Supplementary Materials. The distribution of the minor allele frequency in 45,507 SNPs' data ranged from 0.025 to 0.5, and exhibited a steady decline of minor alleles with increased occurrence of frequencies from 0.075 to 0.5 (Figure 1A). Likewise, there were more SNPs at the higher percentages of missing data (Figure 1B). **Table 1.** List of the 12 crested wheatgrass (*A*. *cristatum*) lines used in the study.

*IJMS* **2018**, *9*, x FOR PEER REVIEW 3 of 13

SNPs, respectively across the 192 genotypes. In addition, this analysis also generated many


**Table 1.** List of the 12 crested wheatgrass (*A*. *cristatum*) lines used in the study. **Lines CN Number <sup>a</sup>Alternative Identification <sup>a</sup>Origin Type** 

<sup>a</sup> CN number is the line identification in Plant Gene Resources of Canada, Agriculture, and Agri-Food Canada (AAFC), while the alternative identifications, including FOR or S, are from the joint forage breeding program of the University of Saskatchewan and AAFC, and PI is from plant inventory book, National Germplasm Resources Laboratory, USA. Canada (AAFC), while the alternative identifications, including FOR or S, are from the joint forage breeding program of the University of Saskatchewan and AAFC, and PI is from plant inventory book, National Germplasm Resources Laboratory, USA.

**Figure 1.** The minor allele frequency distribution (**A**) and the frequency of missing data (**B**) for 45,507 SNP markers in 192 genotypes of 12 crested wheatgrass lines. **Figure 1.** The minor allele frequency distribution (**A**) and the frequency of missing data (**B**) for 45,507 SNP markers in 192 genotypes of 12 crested wheatgrass lines.

#### *2.2. Genetic Structure and Relationship 2.2. Genetic Structure and Relationship*

The genetic structure estimated for 192 genotypes from 12 CWG lines without consideration of prior population information in the STRUCTURE [36] analysis revealed four optimal clusters (Figure 2A) with strong support from change in LnP(K) variance (Figure 2B) and the largest delta K value (Figure 2C). Cluster 1 (red in color) consisted of 17 genotypes (16 from Vysokij 9 and one from S8959E). Cluster 2 (green in color) had 22 genotypes (16 from S9491 and 6 from S9514). Cluster 3 (blue in color) was the largest cluster, with 95 genotypes from seven lines. Cluster 4 (yellow in color), with 58 genotypes from five lines, was the second largest cluster. The neighbor-joining (NJ) tree was in agreement with clusters obtained from the STRUCTURE analysis (Figure 3). However, there existed some discrepancies, as some members of cluster 4 (yellow in color) were spread into cluster 2 (green in color) and cluster 3 (blue in color). The genetic structure estimated for 192 genotypes from 12 CWG lines without consideration of prior population information in the STRUCTURE [36] analysis revealed four optimal clusters (Figure 2A) with strong support from change in LnP(K) variance (Figure 2B) and the largest delta K value (Figure 2C). Cluster 1 (red in color) consisted of 17 genotypes (16 from Vysokij 9 and one from S8959E). Cluster 2 (green in color) had 22 genotypes (16 from S9491 and 6 from S9514). Cluster 3 (blue in color) was the largest cluster, with 95 genotypes from seven lines. Cluster 4 (yellow in color), with 58 genotypes from five lines, was the second largest cluster. The neighbor-joining (NJ) tree was in agreement with clusters obtained from the STRUCTURE analysis (Figure 3). However, there existed some discrepancies, as some members of cluster 4 (yellow in color) were spread into cluster 2 (green in color) and cluster 3 (blue in color).

*IJMS* **2018**, *9*, x FOR PEER REVIEW 4 of 13

**Figure 2.** Four genetic clusters of 192 genotypes of the 12 crested wheatgrass lines inferred by STRUCTURE based on 45,507 SNP markers. (**A**) The mixture coefficients of 192 genotypes with K = 4, presented in the original order of genotypes from 12 lines (see Table 1 for line label); (**B**) support from the LnP(K) estimation; (**C**) support from the estimation of the largest value of the delta K = mean (|Ln"(K)|)/sd (LnP(K)). **Figure 2.** Four genetic clusters of 192 genotypes of the 12 crested wheatgrass lines inferred by STRUCTURE based on 45,507 SNP markers. (**A**) The mixture coefficients of 192 genotypes with K = 4, presented in the original order of genotypes from 12 lines (see Table 1 for line label); (**B**) support from the LnP(K) estimation; (**C**) support from the estimation of the largest value of the delta K = mean (|Ln"(K)|)/sd (LnP(K)). **Figure 2.** Four genetic clusters of 192 genotypes of the 12 crested wheatgrass lines inferred by STRUCTURE based on 45,507 SNP markers. (**A**) The mixture coefficients of 192 genotypes with K = 4, presented in the original order of genotypes from 12 lines (see Table 1 for line label); (**B**) support from the LnP(K) estimation; (**C**) support from the estimation of the largest value of the delta K = mean (|Ln"(K)|)/sd (LnP(K)).

label. Each node for a genotype is represented with colored circle followed by genotype name. Red, green, blue, and yellow represent plants in Clusters 1, 2, 3, and 4, inferred from the STRUCTURE analysis (Figure 2A), respectively. **Figure 3.** Genetic relationship of 192 genotypes of the 12 crested wheatgrass lines as revealed by neighbor-joining clustering with the 45,507 SNP markers. Each genotype is numbered after its line label. Each node for a genotype is represented with colored circle followed by genotype name. Red, green, blue, and yellow represent plants in Clusters 1, 2, 3, and 4, inferred from the STRUCTURE analysis (Figure 2A), respectively. **Figure 3.** Genetic relationship of 192 genotypes of the 12 crested wheatgrass lines as revealed by neighbor-joining clustering with the 45,507 SNP markers. Each genotype is numbered after its line label. Each node for a genotype is represented with colored circle followed by genotype name. Red, green, blue, and yellow represent plants in Clusters 1, 2, 3, and 4, inferred from the STRUCTURE analysis (Figure 2A), respectively.

neighbor-joining clustering with the 45,507 SNP markers. Each genotype is numbered after its line

The principal coordinates analysis (PCoA) revealed that the genetic relationship of 192 genotypes (Figure 4A) was not in accordance to the Bayesian inferences from the STRUCTURE analysis. The clusters II, III, and IV identified by the Bayesian inferences appeared to overlap and became undistinguishable with PCoA. However, the PCoA plot was able to distinguish four lines Karabalykskij 202 (from Kazakhstan), PGR 16,830 (from Kazakhstan), Vysokij 9 (from Russia) and S8,959E (selected from Vysokij 9) from the rest of the lines (Figure 4B). We also observed lines S9,516, S9,544 and S9,556 from cluster 3 (blue in color from the model-based Bayesian analysis) were more dispersed than other breeding lines and cultivars, likely indicating the larger genetic diversity present in those breeding lines (Figure 4B). The principal coordinates analysis (PCoA) revealed that the genetic relationship of 192 genotypes (Figure 4A) was not in accordance to the Bayesian inferences from the STRUCTURE analysis. The clusters II, III, and IV identified by the Bayesian inferences appeared to overlap and became undistinguishable with PCoA. However, the PCoA plot was able to distinguish four lines Karabalykskij 202 (from Kazakhstan), PGR 16,830 (from Kazakhstan), Vysokij 9 (from Russia) and S8,959E (selected from Vysokij 9) from the rest of the lines (Figure 4B). We also observed lines S9,516, S9,544 and S9,556 from cluster 3 (blue in color from the model-based Bayesian analysis) were more dispersed than other breeding lines and cultivars, likely indicating the larger genetic diversity present in those breeding lines (Figure 4B).

**Figure 4.** Genetic relationship of 192 genotypes of the 12 crested wheatgrass lines as revealed by principal coordinates analysis (PCoA) with the 45,507 SNP markers. Two panels are identical, but in the left panel (**A)** each genotype is labelled with colored circles representing the clusters obtained from the STRUCTURE analysis, while the right panel (**B)** labels genotypes for 12 lines. **Figure 4.** Genetic relationship of 192 genotypes of the 12 crested wheatgrass lines as revealed by principal coordinates analysis (PCoA) with the 45,507 SNP markers. Two panels are identical, but in the left panel (**A)** each genotype is labelled with colored circles representing the clusters obtained from the STRUCTURE analysis, while the right panel (**B)** labels genotypes for 12 lines.

#### *2.3. Genetic Differentiation 2.3. Genetic Differentiation*

distance of 0.15.

The analysis of molecular variance (AMOVA) revealed that most of the SNP variations were present within the lines (84.2%), while much smaller variations reside among lines (15.8%) or among the four Bayesian clusters (12.07%) (Table 2). Line-specific Fst was also estimated from AMOVA for each line as the weighted variation among individual plants within a line to observe the extent of inbreeding. They were obtained in the range of 0.56 (in line S9491) to 0.64 (in the cultivar Kirk) with mean of 0.60 (Figure 5B). The pairwise genetic distance among the 12 lines ranged from 0.055 (between AC-Goliath and S9544) to 0.32 (between Karabalykskij 202 and S9491) with an average The analysis of molecular variance (AMOVA) revealed that most of the SNP variations were present within the lines (84.2%), while much smaller variations reside among lines (15.8%) or among the four Bayesian clusters (12.07%) (Table 2). Line-specific Fst was also estimated from AMOVA for each line as the weighted variation among individual plants within a line to observe the extent of inbreeding. They were obtained in the range of 0.56 (in line S9491) to 0.64 (in the cultivar Kirk) with mean of 0.60 (Figure 5B). The pairwise genetic distance among the 12 lines ranged from 0.055 (between AC-Goliath and S9544) to 0.32 (between Karabalykskij 202 and S9491) with an average distance of 0.15.

**Table 2.** Results of the analysis of molecular variance for two models of genetic structure (12 lines and **Table 2.** Results of the analysis of molecular variance for two models of genetic structure (12 lines and four clusters from the STRUCTURE analysis) based on 45,507 SNP markers.


Among clusters 3 54,736.5 193.3 12.1 Within clusters 380 534,910.3 1407.7 87.9 <sup>a</sup> These variances were statistically significant from zero at *P* < 0.0001.

a These variances were statistically significant from zero at *P* < 0.0001.

*IJMS* **2018**, *9*, x FOR PEER REVIEW 6 of 13

**Figure 5.** Genetic diversity and genetic relationships of the 12 crested wheatgrass lines. Left panel (**A**) shows their genetic relationship in the unweighted pair group method with arithmetic mean (UPGMA) dendrogram based on the Phi statistics obtained from the AMOVA. The right panel (**B**) displays the line-specific Fst values for the 12 lines. **Figure 5.** Genetic diversity and genetic relationships of the 12 crested wheatgrass lines. Left panel (**A**)shows their genetic relationship in the unweighted pair group method with arithmetic mean (UPGMA)dendrogram based on the Phi statistics obtained from the AMOVA. The right panel (**B**) displays the line-specific Fst values for the 12 lines. **Figure 5.** Genetic diversity and genetic relationships of the 12 crested wheatgrass lines. Left panel (**A**) shows their genetic relationship in the unweighted pair group method with arithmetic mean (UPGMA) dendrogram based on the Phi statistics obtained from the AMOVA. The right panel (**B**) displays the line-specific Fst values for the 12 lines.

The dendrogram based on AMOVA showed the grouping of the 12 CWG lines into three genetically distinct clusters at the Phi statistic of 0.08 or more (Figure 5A). The dendrogram grouped the lines from Kazakhstan and Russia in one distinct cluster. The second distinct cluster consisted of the single line S9491. The largest of all is the third cluster, with seven lines consisting of cultivars and breeding lines from Canada. The dendrogram based on AMOVA showed the grouping of the 12 CWG lines into three genetically distinct clusters at the Phi statistic of 0.08 or more (Figure 5A). The dendrogram grouped the lines from Kazakhstan and Russia in one distinct cluster. The second distinct cluster consisted of the single line S9491. The largest of all is the third cluster, with seven lines consisting of cultivars and breeding lines from Canada. The dendrogram based on AMOVA showed the grouping of the 12 CWG lines into three genetically distinct clusters at the Phi statistic of 0.08 or more (Figure 5A). The dendrogram grouped the lines from Kazakhstan and Russia in one distinct cluster. The second distinct cluster consisted of the single line S9491. The largest of all is the third cluster, with seven lines consisting of cultivars and breeding lines from Canada.

#### *2.4. Effects of Missing Data on Diversity Analysis 2.4. Effects of Missing Data on Diversity Analysis 2.4. Effects of Missing Data on Diversity Analysis*

**3. Discussion** 

**3. Discussion** 

The optimal numbers of genetic clusters inferred from STRUCTURE analyses with respect to the extent of missing data from M20%, M30%, M40%, and M50% datasets provided 4, 6, 6, and 4 optimal clusters, respectively (Figure 6A). Comparing the proportions of SNP variance residing among the 12 lines inferred from the AMOVA analysis showed 24.6%, 20.3%, 17.8%, and 15.8% for M20%, M30%, M40%, and M50%, respectively (Figure 6B). The optimal numbers of genetic clusters inferred from STRUCTURE analyses with respect to the extent of missing data from M20%, M30%, M40%, and M50% datasets provided 4, 6, 6, and 4 optimal clusters, respectively (Figure 6A). Comparing the proportions of SNP variance residing among the 12 lines inferred from the AMOVA analysis showed 24.6%, 20.3%, 17.8%, and 15.8% for M20%, M30%, M40%, and M50%, respectively (Figure 6B). The optimal numbers of genetic clusters inferred from STRUCTURE analyses with respect to the extent of missing data from M20%, M30%, M40%, and M50% datasets provided 4, 6, 6, and 4 optimal clusters, respectively (Figure 6A). Comparing the proportions of SNP variance residing among the 12 lines inferred from the AMOVA analysis showed 24.6%, 20.3%, 17.8%, and 15.8% for M20%, M30%, M40%, and M50%, respectively (Figure 6B).

**Figure 6.** The impact of missing SNP data on the inferences of STRUCTURE and AMOVA analysis. The left panel (**A**) shows the four optimal clusters obtained from the STRUCTURE analyses at the missing level of M20% and M50%, and six clusters at M30% and M40%. The right panel (**B**) shows the SNP variances, ranging from 24.6 to 15.78%, inferred from AMOVA analyses residing among 12 lines at the increasing level of missing values from M20% to M50%, respectively. **Figure 6.** The impact of missing SNP data on the inferences of STRUCTURE and AMOVA analysis. The left panel (**A**) shows the four optimal clusters obtained from the STRUCTURE analyses at the missing level of M20% and M50%, and six clusters at M30% and M40%. The right panel (**B**) shows the SNP variances, ranging from 24.6 to 15.78%, inferred from AMOVA analyses residing among 12 lines at the increasing level of missing values from M20% to M50%, respectively. **Figure 6.** The impact of missing SNP data on the inferences of STRUCTURE and AMOVA analysis. The left panel (**A**) shows the four optimal clusters obtained from the STRUCTURE analyses at the missing level of M20% and M50%, and six clusters at M30% and M40%. The right panel (**B**) shows the SNP variances, ranging from 24.6 to 15.78%, inferred from AMOVA analyses residing among 12 lines at the increasing level of missing values from M20% to M50%, respectively.

This study utilized the gd-GBS application, in combination with Haplotag pipeline, for the first time in CWG, to generate a data matrix of 192 genotypes × 45,507 SNP markers, and captured

This study utilized the gd-GBS application, in combination with Haplotag pipeline, for the first time in CWG, to generate a data matrix of 192 genotypes × 45,507 SNP markers, and captured
