*4.4. Genetic Diversity Analysis*

The diversity analysis was based on 45,507 SNP markers, with 50% or less missing values in 192 genotypes from 12 CWG lines. Data analysis began with calculation of the minor allele frequency and the extent of missing SNP data with Microsoft Excel®. Thereafter, diversity analyses at the individual and line levels were carried out.

Three types of diversity analysis were performed at individual genotype level. First, genetic structure of 192 CWG genotypes was examined using a model-based Bayesian method implemented in the program STRUCTURE version 2.2.3 [36,43]. Linux server with 60 core parallel computing was used to run the STRUCTURE program, where each population subgroup (K = 1–9) was run 20 times, using an admixture model with 10,000 replicates each for burn-in and during the analysis. Based on (1) a plot of likelihood of these models, (2) the rate of change in the second derivative (∆K) between successive K values [44], and (3) the consistency of group configuration across 20 runs, the final population subgroups were determined. For a given population subgroup (K) with 20 runs, the run having the highest likelihood value was chosen to assign the posterior membership coefficients to each sample. These posterior membership coefficients were used to create a graphical bar plot. The size and formation of each optimal cluster with respect to population were evaluated. Second, a neighbor-joining (NJ) analysis of the 192 genotypes was conducted using MEGA version 7.0.14 [45] based on the dissimilarity matrix obtained from R routine AveDissR [46,47], and a radiation tree was displayed. Third, a PCoA of all 192 genotypes was also done using the R routine AveDissR [46,47] to assess genetic distinctness and redundancy, and to assess the genotype associations, plots of the first two resulting principal components were generated. For comparison, the resulting NJ trees and PCoA plots were individually labeled for the inferred structures.

Genetic variation present among the 12 lines was evaluated with AMOVA using Arlequin version 3.5 [48] on 45,507 markers. In addition, the pairwise genetic distances were computed and line-specific Fst values (inbreeding coefficient) for each line [49] were generated to infer the reduction in heterozygosity. To inspect the genetic variation among the clusters identified from the STRUCTURE analysis, additional AMOVA was performed. Unweighted pair group method, with arithmetic mean (UPGMA) dendrogram based on pairwise genetic distances among the 12 lines obtained from AMOVA, were generated using MEGA version 7.0.14 [45], to evaluate line differentiation and distinctness.

To estimate the influence of missing SNP data on the genetic diversity analysis, four datasets of 272; 1884; 10,738; and 45,507 SNPs representing 20%, 30%, 40%, and 50% of missing SNPs (M20%, M30%, M40%, and M50%) were attained for the 192 genotypes, respectively. For each dataset, the among-line variance from AMOVA and the optimal number of genetic clusters from STRUCTURE were obtained and compared among the four datasets of varying percentages of missing data.
