*4.5. Evaluation of Prediction Models*

Two validation methods were used to evaluate prediction models generated from combinations of statistical models, marker sets and PS datasets. The first method was a five-fold random cross-validation. The 370 flax accessions were randomly partitioned into five subsets. For a given partition, each subset was in turn used as validation or test data and the remaining four subsets made the training dataset. This partitioning was repeated 500 times. In this manner, a total of 2500 training data sets were created to build GP models and estimate marker effects. These were used to predict the breeding values of the individuals in the corresponding 2500 test/validation datasets. The accuracy of the genomic predictions (*r*) was defined by the Pearson's simple correlation coefficient between the genetic values predicted by GP and the observed phenotypic values. The relative efficiency of genomic prediction over phenotypic selection (*RE*) was estimated using |*r*|/*H*<sup>2</sup> [26,27], where *H*<sup>2</sup> refers to the broad-sense heritability of PS, estimated to be 0.25 [3]. *RE* was used as a criterion to compare the response to one cycle of genome-wide selection versus one cycle of phenotypic selection. Means of *r* and *RE* of the 500 samplings for each marker set, GP model and PS dataset were used to describe the prediction accuracy of GP and the efficiency of one GP cycle relative to one phenotypic selection cycle, respectively. To compare different marker and PS datasets, a joint analysis of variance with Tukey multiple pairwise-comparisons was performed to test the statistical significance of differences in *r* and *RE* using R. As a case study, we randomly selected 20% of all 370 accessions as validation dataset and used the remaining 277 accessions as training dataset to build a GP model for genomic prediction of unknown germplasm.

The second cross-validation approach involved comparisons across different PS datasets, that is, each of the six complete PS phenotypic datasets were used as training datasets to build GP models that were applied to itself and to the other five phenotypic datasets. The same set of markers for all 370 accessions was used for training and validation. This method tests the relevance of models built based on single year phenotypic data to predict phenotypes measured in different years.
