1. Introduction
Greater grain legume cultivation has a high priority for European agriculture, to enhance the sustainability of its agricultural systems [
1,
2], reduce its huge deficit for high-protein feedstuffs [
3], and meet the increasing demand for healthy and nutritious plant-based foods [
4]. White lupin (
Lupinus albus L.) is a Mediterranean crop with a long history of domestication that has high potential interest in this context, because of its seed protein content close to 40% and other characteristics [
5,
6]. Its exploitation as a component of functional, healthy, or vegan food is favored by its good content of essential amino acids and several useful techno-functional properties [
7], the ability of its γ-conglutin protein fraction to control glycaemia by interaction and binding with insulin and insulin-mimetic properties [
8], and 8–12% seed content of oil with excellent nutritional characteristics [
9]. While being less targeted by modern breeding than other cool-season grain legumes such as peas, faba beans, and narrow-leafed lupin, white lupin has out-performed these species in terms of crude protein yield per unit area across climatically-contrasting, autumn-sown rain-fed environments of southern Europe [
10].
A key factor that limits the cultivation of white lupin worldwide is its narrow adaptation with respect to soil type. Grain and biomass reductions can be expected in soils whose active lime, i.e., the soluble fraction of calcium carbonate based on Drouineau’s [
11] method, exceeds 1% [
12], which are mildly calcareous, or in soils with an alkaline reaction [
13]. Various studies have suggested that active lime has a greater depressive effect on crop yield than alkalinity [
14,
15,
16], but these factors are associated in calcareous soils. Although Ca
2+ may directly affect
Lupinus species [
17], the main effects of lime are indirect, through the precipitation of organic acids secreted by lupin cluster roots to mobilize and uptake phosphorus and iron [
18], and through the inhibition of iron uptake by HCO
3− [
13]. Indeed, lupin cluster roots increase in their relative proportion in response to limed soils, although the total root biomass decreases [
19]. High soil Ca
2+ concentration and alkaline soils also reduce the growth and the nodulating ability of the nitrogen-fixing
Bradyrhizobium microsymbiont of lupins [
13,
20,
21], but the selection of lime-tolerant plant material may have a greater impact on crop adaptation to calcareous soils than the exploitation of
Bradyrhizobium strains adapted to such soils [
22].
Various studies have highlighted the adaptation to moderately calcareous or alkaline soils of Egyptian landraces that evolved in such soils [
23,
24,
25], but the exploitation of Egyptian genetic resources in breeding for autumn-sown environments of southern Europe is limited by their high susceptibility to winter low temperatures [
26]. Variation in adaptation to moderately calcareous or alkaline soils was also found in commercial varieties and some accessions of indefinite origin [
15], and within Italian landrace germplasm [
22]. The phenotypic selection for white lupin lime tolerance may be complicated by high soil heterogeneity when performed under field conditions [
27,
28], and by abnormal plant root growth [
29] and fairly modest correlation with genotype responses in agricultural environments [
15] when performed in pots or liquid culture. In addition, lime tolerance and adaptation to other site-specific agroclimatic characteristics are confounded when assessing genotype yield responses under field conditions, suggesting to assess these responses across agroclimatically-contrasting sites that feature a calcareous soil.
Plant breeding for traits that are genetically complex and/or difficult to assess could become more cost-efficient by genomic selection, which combines phenotyping and genotyping data of a genotype sample (training population) representing a target genetic base (reference population) into a statistical model for the prediction of breeding values in future plant selection [
30,
31]. This avenue has become practically feasible after the development of next generation sequencing techniques that genotype large germplasm sets by thousands of single nucleotide polymorphism (SNP) markers at a relatively low cost. Genotyping-by-sequencing (GBS; [
32]), which skips sequence discovery and explores SNP polymorphism in DNA fragments cut by a restriction enzyme, holds special interest in this context [
33]. One study confirmed the ability of GBS to generate thousands of polymorphic SNP markers for white lupin genetic analyses [
34]. GBS-based, pioneer genomic selection studies were encouraging for white lupin breeding. The cross-environment predictive ability (as Pearson’s correlation between predicted and observed values) for grain yield between pairs of moderately contrasting environments, such as autumn-sown subcontinental vs. Mediterranean sites or moisture-favorable vs. drought-prone environments, was in the range of 0.40–0.51 [
35]. Predictions were highly accurate (0.84–0.85) for the onset of flowering and winter survival, and moderately accurate (0.49–0.63) for various grain yield components and other morphophysiological traits, based on intra-environment cross-validations in a second study [
36]. The practical interest of these findings for breeding programs was limited by the fact that they were based on relatively small genotypes sets (81 to 117 entries) mainly including landrace accessions.
This study focused on a collection of 140 GBS-genotyped sweet-seed breeding lines representative of a genetically-broad reference population originated by crosses of four elite landraces with four elite sweet-seed cultivars or breeding lines performed according to a factorial mating design. The lines were evaluated for grain yield and other traits associated with performance in an autumn-sown environment of Greece and a spring-sown environment of the Netherlands that featured moderately calcareous and alkaline soils. The main objectives of our study were to assess the phenotypic variation, the trait architecture based on a genome-wide association study (GWAS), and the predictive ability of genome-enabled models for grain yield and its contributing traits, verifying as well the consistency of genotype adaptation patterns and genome-based results across the two growing regions.
3. Discussion
The soil active lime content of the test sites was nearly (Ens) or over (Larissa) two-fold that indicated in [
12] as the threshold for the occurrence of crop yield reduction due to soil lime stress. However, lime tolerance affected lupin genotype responses to a much greater extent in the Greek site than the Dutch site on the grounds of site mean values of the lime susceptibility scores and site-specific correlations of these scores with line grain yield, despite the only moderate difference between sites for soil active lime. Definitely greater drought stress caused by lower rainfall may have emphasized the negative effect of a calcareous soil in the Greek site, when considering that soil lime may determine a remarkable reduction of root biomass [
19,
29]. Better crop performance in the Dutch site may have been favored not only by the wetter climate, but also by greater soil N (
Table 1) and more intense N fixation arising from seed inoculation, thereby partly compensating for the negative effect of soil lime.
An important finding of our study was the low consistency of genotype yield responses across the two sites featuring a moderately calcareous soil (as indicated by the
rg value of 0.32). Various reasons may account for this result. First, the site-specific adaptive responses of the genotypes were not only influenced by lime susceptibility, but also by adaptation to the contrasting climates and sowing times of the two sites. Secondly, the much greater impact on plant responses of lime tolerance in the Greek site than in the Dutch site (as indicated by site mean values of the lime susceptibility score) could itself be a reason for GEI, making lime tolerance variation important mainly in the former site. In this context, data from the Greek site are more relevant than those from the Dutch site for assessing the lime tolerance of the lines, also with regard to lime tolerance score values, which suffered low heritability besides referring to milder stress in the Dutch site. Lime tolerance scores themselves displayed definite inconsistency between locations, suggesting that they might have estimated the general plant vigor (as determined by several possible site-specific factors) rather than the intrinsic lime tolerance, especially in the Dutch site (featuring lower soil lime stress). However, white lupin lime tolerance proved to be a fairly elusive and only partly repeatable trait in earlier studies. For example, 17 cultivars exhibited only moderate correlation (
r = 0.50) for lime tolerance, expressed as shoot biomass ratio between high-soil lime and low-soil lime environments, across experiments in Italy and Morocco [
22]; and the Egyptian landrace Giza 1 did not exhibit a distinct lime tolerance response in any of these experiments, while having emerged as lime tolerant in other studies [
23,
25]. The whole of these findings confirms the difficulty of improving the lupin adaptation to calcareous soils and suggests the importance of pursuing that on a regional basis, in order to also take account of climatic adaptation and its possible interaction with lime susceptibility. Selection for lime tolerance is also made challenging by the difficulty to highlight this trait by concurrent evaluation in high-soil lime and low-soil lime environments, because of the difficulty in identifying geographically-close locations with contrasting soil (as necessary to prevent the confounding of soil type effects with other environmental effects) and the limitations of evaluations performed in pots or hydroponic solutions (e.g., abnormal plant root growth [
29]; modest reproduction of responses under field conditions [
15]; lack of meaningful grain yield assessment).
The number of pods per plant was the grain yield component that displayed the highest correlation with grain yield in both locations. This result agrees with results for landrace germplasm evaluated in Spanish environments [
38,
39]. Individual seed weight exhibited higher correlation with grain yield than the number of pods per plant or the number of seeds per pod in a multi-environment evaluation of a world germplasm collection in which seed weight featured greater genetic variation (based on CV
g values) than the other seed components [
26], which was not the case here (
Table 2). The current results, which are likely to be relevant for most breeding populations (whose variation for individual seed weight is unlikely to be very large), support the selection for more pods per plant whenever grain yield could not be meaningfully assessed, as in early selection stages.
The observed pattern of LD decay as a function of the physical distance was very similar to that reported for the same reference population of lines in an earlier study [
40], with differences in the 90th percentile
r2 and corresponding distance values essentially arising from the adoption of different window sizes for LD estimation (100 kb instead of 50 kb). However, the current LD decay was faster than that reported by Hufnagel et al. for a collection of cultivars and landraces [
41], in the presence of substantial variation among chromosomes that was also found in that study. Although challenged by the fast LD decay and the somewhat suboptimal genotype sample size, our study revealed various quantitative trait loci (QTL) for crop yield and its component traits. However, these QTL were of region-specific interest, given their inconsistency across locations. This finding reflected the large GEI observed for most traits. The definite relationship between lime tolerance and crop yield that emerged phenotypically for the site featuring greater soil lime stress, i.e. Larissa, was confirmed genomically by the detection of a SNP on chromosome 7 that was associated with the former trait and tended to association with the latter trait. This SNP was in high LD with a gene encoding a putative ferric-chelate reductase (Lalb_Chr07g0184151), an enzyme that proved to affect lime tolerance in various other herbaceous [
42,
43] and tree species [
44,
45]. In particular, this enzyme participates in the absorption of iron, which is poorly available in soils featuring a high pH [
46] and active lime [
18], by reducing chelated Fe
3+ according to the acquisition strategy I typical of non graminaceous species [
47]. Our results support this gene as a strong candidate for improving white lupin tolerance to calcareous soils in future studies, albeit in the context of a polygenic control of lime tolerance that emerged from our study.
The GWAS revealed very few or no significant SNPs for all traits in both sites, except for individual seed weight in Larissa, probably because of insufficient power to detect small-effect genes that are expected to be important for the control of crop yield, yield components, and plant stature. For individual seed weight, a polygenic trait architecture was supported not only by the detection of various putative QTL in one location, but by the fact that cross-environment GS prediction models exhibited just a modest loss in predictive ability (27%) compared with intra-environment prediction models despite the inconsistency for significant SNPs across locations. Such an outcome could only arise from the presence of many small-effect, statistically non-significant QTL that are consistent across locations.
Genomic selection is highly relevant for polygenic traits if predictions are proved to be sufficiently accurate. The current intra-environment predictive ability values for grain yield and the lime tolerance score in the site featuring greater soil lime stress (Larissa), which were close to 0.34, were lower than those reported for white lupin grain yield in moisture-favorable or drought stress conditions of landrace genotypes (in the range of 0.47–0.58) and breeding lines (in the range of 0.67–0.78) [
35,
40]. However, they could still justify the exploitation of the GS models for these traits for regional lime tolerance improvement (e.g., by concurrent selection according to genome-enabled predictions for both traits), when considering the challenges of phenotypic selection for lime tolerance. A second important indication for GS arising from this study is the high predictive ability of individual seed weight, which is supported not only by values for intra-environment prediction above 0.6, but also by the small loss of predictive ability when using a model constructed in one location for predictions in another, climatically-contrasting location. An earlier study on landrace accessions reported a similar intra-environment predictive ability value for this trait [
36], further supporting the GS for this trait. A large seed (say, >0.50 g), which is needed for the traditional consumption of white lupin as a snack, is hardly found in modern cultivars [
26]. The moderate predictive ability observed for plant height and the site-specific number of seeds per pod agrees with earlier findings for landrace germplasm [
36].
The absence of sizable and consistent differences in predictive ability among GS statistical models is fairly common in general and was already reported in earlier white lupin studies that compared the same models [
40] or a subset of them [
35,
36]. The slightly higher average predictive ability of GS models adopting the more stringent missing data threshold suggests that the 5429 polymorphic markers provided by this threshold were mostly sufficient for genome saturation aimed at GS prediction in this reference population.
In conclusion, this study highlighted the importance of GEI effects when improving white lupin adaptation to calcareous soils, supporting efforts on a regional scale that may take advantage of genome-enabled predictions of grain yield and visually-assessed lime tolerance. Future proof-of-concept work will verify the actual genetic gains obtained from GS for these traits applied to independent germplasm. Other major results issued from our study are: (a) the identification of a candidate gene for lime tolerance, revealed by the GWAS; (b) the high reliability across contrasting locations of the genome-enabled prediction of seed weight; and (c) the usefulness of more pods per plant as a phenotypic selection criterion in early selection stages that lack grain yield data.
4. Materials and Methods
4.1. Plant Material
The plant material for this study was represented by a sample of 140 sweet-seed inbred lines sorted from a reference population developed by CREA to broaden the genetic base for white lupin breeding in Europe. The reference population originated from 16 crosses produced by a 4 × 4 factorial mating design by which each of four elite sweet-seed cultivars or breeding lines was crossed with each of four elite bitter-seed, landrace accessions. The list of test lines and information on their parent germplasm is provided in
Table S4. To further broaden the population genetic base, we used a different parent genotype within a landrace population for each cross with sweet-seed material (assuming that landraces are genetically heterogeneous, unlike modern cultivars). The landrace parent accessions were selected out of a world germplasm collection evaluated for grain yield under spring sowing in France and autumn sowing in two climatically-contrasting Italian sites [
26]. Additional trait information for parent choice was provided from other studies relative to lime tolerance [
22], genotype adaptation across Italian environments [
48], and drought tolerance [
49]. In brief, the Italian landrace LAP123 collected by CREA was selected because of moderately good adaptation to fairly calcareous soil; the landrace La646 from the Canary Islands, the Italian landrace La246 belonging to INRAE’s germplasm collection, the newly-registered Italian variety Arsenio (referred to as line 7–50 in earlier studies), and the French variety Lucky were selected because of wide adaptation to climatically-contrasting and/or moisture-contrasting environments; the Moroccan breeding line L27PS3 was chosen because of moderately high drought tolerance; the Greek landrace Gr56 (INRAE’s collection) and the breeding line MB-38 were selected because of high tolerance to low winter temperatures. Seed quality characteristics contributed to parent choice; e.g., the high γ-conglutin content of Arsenio and Lucky, or the very large seed of LAP123. The crosses were performed in 2014, whereas F
2 to F
5 inbred lines for each of the 16 crosses were produced from 2015 to 2017 by single-seed descent on-season and off-season generations. All crossing and seed multiplication work was carried out in isolation through insect-proof nets. Within-cross selection for low alkaloid content was performed (a) on F
3 and F
4 individual seeds by the fluorescence method [
50], and (b) on F
4 seed by a non-destructive test that was adapted to single seeds using the spectrophotometer method described in [
51,
52]. The latter test was used to discard material for which the alkaloid content value belonged to the highest 25% quartile. The final population included 960 F
5 inbred lines (60 per cross), of which 560 (35 per cross) were genotyped, and 192 (12 per cross) were multiplied in isolation in 2018 to obtain the F
6 seed used for this study and to phenologically characterize these lines. Four to 10 lines per cross contributed to the current set of 140 test lines (
Table S4). The lines were randomly chosen within earlier-maturing crosses, and were selected for earliness within later-maturing crosses to avoid the presence in the panel of definitely winter-type germplasm (a phenological type expected to be poorly adapted to both target regions of this study). The number of lines issued by each individual parent was moderately balanced, ranging from 30 for MB-38 to 37 for La646, Arsenio, and L27PS3.
4.2. Field Evaluation
The 140 inbred lines were evaluated under field conditions in Larissa (eastern continental Greece; 39°36′ N, 22°25′ E) and Ens (the Netherlands; 52°38′ N, 5°49′ E), two sites that featured moderately calcareous, moderately alkaline soils along with a contrasting climate (
Table 1). In Larissa, characterized by a Mediterranean climate with mild winters and substantial terminal drought, the material was sown in late November 2020 and was combine-harvested in early July 2021. This site adopted a conventional management involving mineral fertilization with 46 kg/ha of P
2O
5, with manual weeding and no lupin seed inoculation. In Ens, characterized by a cool Oceanic climate, the material was sown at the beginning of April and was harvested by the end of October 2021. The crop was organically-managed, inoculating the seed by the commercial inoculant HiStick (Becker Underwood, Toulouse, France) before sowing, and performing occasional hand-weeding during the crop cycle. Each experiment was laid out according to a randomized complete block design with three replications per line, sowing 36 seeds per plot. Plots included 4 rows in Larissa and 3 rows in Ens, with rows at 25 cm spacing and seeds at 9 cm spacing on the row (density: 44.4 seeds/m
2) in both sites.
Grain yield at 0% seed moisture was assessed on a plot basis, after measuring the seed moisture on a sample of 100 seeds per plot. In Ens, the harvested pods were previously dried through a heated forced air drier for about 48 h. A lime tolerance score was visually attributed twice, i.e., at the end of the pod setting stage of the main stem, and at the end of the pod setting stage of the primary branches, according to the following scale definition: 1 (no stress): dark-green leaves, robust growth, 3–6 lateral branches, >20 pods/plant (only for the second observation); 3 (mild stress): chlorotic lower leaves, 2–4 lateral branches, 10–20 pods/plant (for the second observation); 5 (intermediate stress): yellow leaves along the whole plant, apparently limited shoot development, 1–2 lateral branches, 6–10 pods/plant (for the second observation); 7 (severe stress): yellow leaves along the whole plant, modest shoot development, abscission of leaves early in the ripening stage, 0–2 lateral branches, 2–5 pods/plant (for the second observation); 9 (extreme stress): yellow leaves along the whole plant, very poor shoot development, abscission of leaves and inflorescences, inability to complete the reproductive stage. Data analysis was performed for the latest score observation and for the average value of both observations. We also recorded the following traits: proportion of plants with seed; plant height at the end of flowering (measured on five test plants); average numbers of pods per plant and of seeds per pod (based on all test plants); and individual dry seed weight (assessed on 100 random seeds per plot).
4.3. Statistical Analysis of Phenotyping Data
An analysis of variance (ANOVA) including the factors lupin genotype and block was performed for each trait in each environment to assess the significance of the variation among lines and its extent as a genetic coefficient of variation computed as:
where
m is the trait mean value, and
sg is the square root of the genotypic component of variance (
s2g) estimated along with the experimental error (
s2e) component of variance by the REML method. Trait broad-sense heritability was computed from these components of variance by the equation:
where
n is the number of experiment replicates. We used
H2 values to compute best linear unbiased prediction (BLUP) values as described in [
53], which were used for trait-marker analyses. Another ANOVA including the factors lupin genotype, environment, and block within environment was performed on each trait that displayed genetic variation in both sites, to test the significance of environmental and genotype × environment interaction (GEI) effects. The extent of the GEI effects was assessed by the size of the genetic correlation for lupin line responses across sites as described in [
54] for one trait assessed in different environments. One last ANOVA including the fixed factors cross and environment and the random factors line within cross and block within environment partitioned the variation of genotype main effects and GEI into effects relative to the 16 crosses and the within-cross variability. The relationship of grain yield with the other traits was explored for each test site by phenotypic correlation. All analyses were carried out using SAS/STAT
® software 9.3 [
55].
4.4. DNA Isolation, GBS Library Construction, and Sequencing
Genomic DNA was extracted from young leaves of F
5 plants of each inbred line using the DNeasy Plant Mini Kit (Qiagen, Milan, Italy). Nucleic acid was quantified by a Quant-iT™ PicoGreen™ dsDNA Assay Kit (P7589, Life Technologies, Italy), checking its quality by 1% agarose gel electrophoresis. A trial digestion was carried out on 10% of the DNA samples using the Optizyme EcoRI restriction enzyme (25,000 U, Fisher BioReagents, Rodano, Italy), to compare bands of cut and uncut DNA. The reaction was performed at 37 °C for an hour and the enzyme was deactivated at 65 °C for 20 min. DNA samples were sent to The Elshire Group Ltd. lab (Palmerston North, New Zealand) for outsourced library preparation and sequencing. GBS data were generated according to Elshire et al.’s [
32] method with the following changes: we used 100 ng of genomic DNA and 3.6 ng of total adapters, and restricted the genomic DNA with
ApeKI enzyme (NEB New England Biolabs, R0643L, Ipswich, MA, USA); then, the library was amplified with Kapa Taq polymerase Alpha (KAPA Library Amplification Readymix, Kapa Biosystems KK2611, Cape Town, South Africa) by 14 PCR cycles. Sequencing was performed on a single Illumina HiSeq X lane, at 2 × 150 bp paired end. Adopting
ApeKI as the restriction enzyme according to Elshire et al. [
32] was supported by the fact that about 60% of the white lupin genome includes repetitive DNA sequences [
37], which this enzyme tends to skip.
4.5. Genotype SNP Calling Procedures, Data Filtering, and Imputation
GBS raw reads were demultiplexed using Axe demultiplexer [
56]. Trimming for restriction enzyme remnants, alignment on reference genome and SNP calling were performed using the dDocent pipeline [
57]. For alignment, we used the
Lupinus albus genome version 1.0 [
37], which was downloaded from
https://www.whitelupin.fr/ (accessed on 3 November 2022). The final genotype matrix, in the form of a vcf file, was further filtered for quality using the vcftools software [
58] with parameters
−minQ 30 −max-non-ref-af 1 –non-ref-af 0.001. The resulting data set was filtered for monomorphic markers, minor allele frequency (MAF) > 5%, missing SNP marker rate < 10%, 20% or 30%, and a missing rate per individual < 50%. Following Nazzicari et al. [
59], we estimated missing data by random forest imputation [
60] using the R package MissForest [
61] with the configuration ntree = 100, maxiter = 10 and encoding genotypes as categorical data (factors).
Both genotypes and SNPs were filtered for excess heterozygosis by using the mean plus three and two standard deviations as maximum thresholds, respectively.
4.6. Analysis of Population Structure and Genome-Wide Association Study
The presence and pattern of population structure were investigated by a discriminant principal components analysis (DPCA) [
62]. We iteratively used the k-means clustering algorithm for increasing values of K genotype groups from 1 to 20 to identify the optimal number of groups according to the local minimum of the Bayesian information criterion (BIC). The analyses were performed on the output of an ordinary principal components analysis performed on SNP data to benefit from its dimensionality reduction, but keeping all the components to avoid information loss. We performed the final DPCA after selecting the optimal K value. The optimal number of DPCA axis to retain for the following analyses was selected by the a-score criterion (which represents the propensity of DPCA toward overfitting). The whole procedure was implemented using the R package adegenet [
63] using the functions
find.clusters(), dapc() and optim.a.score().
Linkage disequilibrium was estimated for SNPs within a 100 kb window in R by the LD.decay function from the package sommer [
64]. We used genotype data filtered by 0.3 missing per marker, 0.5 missing per sample, 0.05 MAF, and excess SNP and genotype heterozygosis, plotted against physical distance, and fitted by a polynomial curve as described in [
65]. Following data filtering, 136 genotypes were available for LD and GWAS analyses.
A GWAS was performed on BLUP data of grain yield, the average value across observations of the lime susceptibility score, plant height, number of pods per plant, number of seeds per pod, and individual seed weight measured in each location, using 9815 mapped SNPs according to the Blink model [
66] in R package GAPIT3 [
67]. A second GWAS was performed on data on plant height and individual seed weight averaged across locations. Each GWAS included the first seven components of the DPCA with K = 17, which appeared to properly account for population structure for all traits except individual seed weight in Ens, the only case for which the visual inspection of quantile-quantile plots comparing observed trait-marker association scores with those expected in case of no significant association suggested some inflation (
Figure S5).
4.7. Genome-Enabled Predictions
For each of six focus traits, we tested several whole-genome regression models whose configuration resulted from the combination of four possible statistical models (described below), the presence or absence of population structure inputted as in the GWAS, and three thresholds of maximum missing rate per SNP marker (0.15, 0.20, and 0.30). We first envisaged an intra-environment prediction scenario, assessing the predictive ability of GS models by a standard within-location 10-fold cross-validation. Each analysis was performed 10 times, reporting the average result, to ensure numerical stability. We also envisaged a cross-environment prediction scenario for two traits that displayed moderately high genetic correlation for line responses across locations, using by turns one environment for training and the other for validation (by splitting the training data in a 90/10 fashion as done for intra-environment predictions). The whole procedure was repeated 10 times for numerical stability.
We considered four possible whole-genome regression models: Ridge Regression BLUP (rrBLUP), Bayesian Lasso (BL), Bayesian Reproducing Kernel Hilbert space (RKHS), and Weighted G-BLUP (WGBLUP). All models were implemented using R package GROAN [
68].
rrBLUP [
69] assumes a linear mixed additive model where each marker is assigned an effect as a solution of the following equation:
where y is the vector of observed phenotypes, μ is the mean of y, W is the genotype matrix (e.g., {0,1,2} for biallelic SNPs), q ∼ N (0, Iσ
2q) is the vector of marker effects, and ε is the vector of residuals. The model is solved in a maximum likelihood context estimating the ridge parameter λ = σ
2e/σ
2q representing the ratio between residual and markers variance. When the population structure is included in the model the above formula is updated as follows:
where X is the is structure matrix with one row per sample and one column per considered DPCA component; and b ∼ N (0, Iσ
2b) is the vector of (fixed) effects corresponding to the population structure.
BL [
70] solves the same general model than rrBLUP but in the bayesian context where regression parameters have independent Laplace double-exponential priors. The system is solved via Gibbs sampling with proper iteration count (10,000 repetitions) and burn-in period (1000 repetitions) so as to ensure convergence. When present, the population structure was added to the model as a fixed (i.e., flat prior) component.
RKHS model is used to solve the so-called genomic BLUP (G-BLUP) in the Bayesian context. First, a genomic kinship additive matrix G is computed according to [
71]. The matrix is then used in the following model:
where Z is a design matrix allocating samples to genetic values and g is a vector of additive genetic effects for a sample with var(g) = Gσ
2g, where G is the genomic relationship matrix and σ
2g is the genetic variance for this model. In the context of RKHS the G matrix is considered as the reproducing kernel function mapping from each pair of markers to covariance. The system is then solved with a standard Gibbs sampling, as done in BL, with the same configuration to accommodate for population structure if required.
WGBLUP [
72] is very similar to RKHS and operatively follows the same implementation, with the main difference being that matrix G is substituted by matrix G*, which is computed by weighing the SNP markers by the
p values resulting from an association study. The association scores were programmatically computed inside each cross-validation cycle on the training set using statgenGWAS R package [
73]. Once the scores were obtained, the G* matrix was computed as:
where Z is an identity matrix for the markers, D is a diagonal matrix where each element of the diagonal corresponds to SNP weights, and p
i is the observed MAF of all genotyped individuals.