1. Introduction
Genomic selection (GS), an upgraded form of marker-assisted selection (MAS), aims to use the genetic effects of genome-wide molecular markers to determine the genomic estimated breeding value (GEBV) of individuals based on optimized statistical models [
1]. In GS, the genome is densely covered with markers that are expected to be in complete or partial population-wide linkage disequilibrium (LD) with quantitative trait loci (QTL), allowing a high fraction of genetic variance to be explained by the markers. Generally, GS achieves a higher prediction accuracy than traditional methods based on pedigree. In practice, GS is becoming a popular tool for genetic evaluations of livestock and poultry; it is at present replacing the traditional methodology of genetic evaluation in dairy cattle [
2,
3], and has been widely used in the commercial breeding of other animals, such as beef cattle [
4], pigs [
5], chickens [
6,
7], and sheep [
8,
9]. In GS using single nucleotide polymorphism (SNP) chips, typical statistical methods include genomic best linear unbiased prediction (GBLUP) and Bayesian methods. In practical applications, GBLUP is favored for its robust predictions and speed.
Decreased costs for whole-genome sequencing (WGS) have provided an exceptional opportunity to systematically detect genetic variants throughout the entire genomes of numerous individuals [
10]. In theory, GS based on WGS data can utilize causative mutation sites, and can thus achieve higher accuracy than GS based on SNP chips. This greatly mitigates the negative impacts that the passage of numerous generations between the reference and candidate populations can have on prediction accuracy. In addition, prediction of GEBV based on causative mutation loci can minimize the influence of selection (artificial or natural) on GS accuracy and allow cross-population and cross-breed GS.
Meuwissen and Goddard [
11] published the first simulation study of GS based on WGS data, which confirmed that prediction accuracy could be improved by using WGS data. Since then, WGS-based GS has become a topic of great research interest in the fields of livestock [
9,
12] and poultry breeding [
13], and has yielded valuable results. Due to the implementation of the 1000 Bull Genomes Project (
http://1000bullgenomes.com/, accessed on 10 August 2022), most relevant research has been undertaken with dairy cattle [
12,
14,
15].
SNP chips can typically detect tens or hundreds of thousands of SNPs, whereas WGS data can be used to identify millions of SNPs. The use of such ultra-high-dimensional data causes two key challenges in GS. First, irrelevant and redundant information exists in predictive variables (i.e., SNPs), impacting the prediction performance [
16,
17,
18]; second, computational complexity is greatly increased [
19]. Feature selection is an important method in high-dimensional data analysis and modeling. It can improve prediction performance and reduce computational complexity by removing irrelevant and redundant features while retaining relevant features.
Feature selection has become an important strategy for GS using WGS data. Feature selection of predictive variables is often carried out based on genome annotation information or genome-wide association study (GWAS) results. Studies have shown that feature selection using genomic annotation information (such as gene location) can improve the prediction performance of GS to some extent [
20]. There are two main strategies for feature selection based on GWAS results: selection based on the association degree between a SNP and a trait (
p-value), and selection based on the estimated effect or effect variance of a SNP. Brondum, et al. [
21] built a model including SNPs that were found to be significantly associated with traits based on WGS data, and achieved a ~0.005–0.04 times higher reliability compared to SNP chip data. Raymond, et al. [
16] used different thresholds of
p-values to select SNPs from WGS data, and found that, compared with SNP chips, using only the selected SNPs could significantly improve the prediction accuracy, whereas the increase in accuracy was small when both the selected SNPs and SNP chips were used. Ye, et al. [
22] found that the prediction accuracy was 1–3% higher when all imputed SNP loci were used rather than SNP chip data. Different thresholds of
p-values were further used to select loci for prediction, and the accuracy was unchanged compared to the model using all SNPs. Starting with a set of SNP chip data, Warburton, et al. [
15] added significantly associated loci based on WGS data, and the top 100 and 250 SNPs were selected based on
p-values; they found that for the 50K and 800K SNP chips, adding the selected loci resulted in little improvement in accuracy. Among feature selection strategies (SNP chip + SNPs selected based on
p-value/effect estimation/effect variance), VanRaden, et al. [
18] found that methods based on effects and effect variance of SNPs could significantly improve the prediction performance of GS.
Although the aforementioned feature selection methods have been used to improve the prediction performance of GS to varying degrees, they have several shortcomings. First, the current annotations of livestock and poultry genomes are not sufficiently accurate, so feature selection based on genome annotation information cannot be used to generate models with optimal performance [
23]. Second, relevant loci are not selected accurately and comprehensively based on GWAS results; SNP significance values are affected by population size and other factors. Selection of only SNPs significantly associated with a trait may exclude other SNPs that also contribute to the phenotype of interest, resulting in limited improvement in prediction performance. Similarly, arbitrarily selecting a certain number of SNPs based on GWAS results cannot guarantee that all loci in the selected subset will be useful for prediction, or that truly effective loci will not be omitted. A method for accurate, comprehensive selection of relevant features is the key problem in feature selection.
Regularized regressions are a group of classical methods for variable selection. They identify outcome-associated features, estimate non-zero parameters simultaneously, and are particularly useful for high-dimensional datasets with small sample sizes. Commonly used regularized regression models include least absolute shrinkage and selection operator (LASSO) [
24], ridge regression (RR) [
25], and elastic net (EN) [
26]. RR regularization shrinks predictors and thus makes parameter estimates more stable, whereas LASSO regularization compresses many regression coefficients to zero, thus enabling automatic variable selection (i.e., selecting only one predictor among relevant predictors). EN regularization employs both RR and LASSO penalties, making full use of the advantages of both methods. LASSO and EN automatically perform variable selection. Here, we therefore used LASSO and EN as feature selection methods to perform GS based on WGS data. The purpose of this study was to investigate the application of LASSO and EN models in feature selection to determine whether they could improve the prediction performance of GS models for milk production traits in Chinese Holstein cattle
4. Discussion
The objective of this study was to explore the application of regularized regression feature selection strategies to WGS-based GS for milk production traits in Chinese Holstein cattle. It was previously shown that it may be necessary to improve the prediction accuracy of GS based on WGS data by increasing the number of individuals in the reference population [
43] or by selecting feature markers using different strategies [
44,
45]. The regularized regression models LASSO and EN had not previously been used as feature selection strategies in WGS-based GS [
12,
45]. LASSO and EN shrink the regression coefficients of minor effect SNPs towards zero, removing irrelevant and redundant features to yield the final set of selected features.
In the present study, GS using 50K SNP chip data and WGS SNPs selected with LASSO (50K+LASSO) had the best prediction performance for MY and PY; the 50K+LMMLASSO method performed best for GS of FY. GS using SNPs selected by EN and LASSO did not lead to higher accuracy than GS using SNPs selected based on p-values from single-marker linear mixed model GWAS analysis (LMMLASSO and LMMEN), and the bias was much higher in the former. There are four possible reasons for this result; these are enumerated below.
First, there were a total of 5,912,383 SNPs in the WGS data, meaning it was an ultra-high-dimensional dataset. Therefore, the number of SNPs (p) was much larger than the number of observations (n). Theoretically, LASSO and EN analyses are suitable for such instances where p >> n. However, the ‘GLMNET’ package in R could not handle such a large volume of data, preventing optimal feature selection. To solve this problem, single-marker GWAS analysis was used for SNP pre-selection prior to the use of LASSO or EN for feature selection. With the single marker model, adjacent SNPs may be selected for subsequent analyses, although they may have low p-values due to being in LD with the same causal mutation. This would prevent EN and LASSO from fully exploiting their advantages.
Moreover, it has been shown that performing GS with markers selected based on GWAS results did not significantly improve the prediction performance; when markers with
p-values < 0.05 were used for prediction, the prediction accuracy of some traits significantly decreased [
46]. This indicates that SNP pre-selection based on GWAS may have a negative effect on GS accuracy. We therefore recommend using other pre-selection methods followed by LASSO or EN.
Ideally, all SNPs in WGS data would be directly analyzed using EN or LASSO. We used the ‘BIGLASSO’ package in R, which can improve the memory and computational efficiency of LASSO models built with ultra-high-dimensional data [
47]. Unfortunately, this package was only able to analyze 1,000,000 SNPs. It is therefore necessary to reevaluate feature selection in GS when software becomes available that can perform LASSO or EN with millions or even tens of millions of SNPs.
Second, we pre-selected SNPs with
p-values < 0.05 from single-marker GWAS for subsequent analyses. It is possible that the accuracy of GS using LASSO or EN, particularly the latter, was not greatly improved due to the threshold value. Ye, et al. [
45] divided SNPs into different categories based on several
p-value thresholds: 0.05, 0.001, 0.0001, 0.00001, and 0.000001. The results showed that GS using SNPs with different
p-value thresholds has lower accuracy and larger bias compared to models built with the complete WGS data when using either GBLUP or GFBLUP.
A third possible reason for the undesirable GS results after using LASSO or EN for feature selection is that the statistical model for GS used in this study was GBLUP. Although it performs robustly under a variety of conditions, Meuwissen and Goddard [
11] demonstrated that GBLUP does not take full advantage of WGS data. GS based on WGS data tends to have better results when using Bayes series models. In theory, Bayesian methods can utilize all of the mutation information provided by WGS data. van Binsbergen, et al. [
2] performed GS using GBLUP and Bayesian stochastic search variable selection (BSSVS) based on WGS data for somatic cell scores, intervals between first and last insemination, and protein yield in Holstein cattle. BSSVS performed better than GBLUP in all cases. Liu, et al. [
48] used GBLUP and Bayesian four-distribution mixture models to perform GS for milk production and reproduction traits by integrating additional SNPs selected from imputed WGS data. The Bayesian four-distribution model achieved higher accuracy than the GBLUP model for milk and protein yields.
Finally, selected SNPs were integrated with 50K SNP chip data and regarded as random effects in this study; however, it has been demonstrated that including some selected SNPs as fixed effects or giving them a higher prior may improve prediction accuracy [
2]. Therefore, in addition to our approach of constructing two genomic relationship matrices with SNPs selected for traits as random effects in the 50K SNP chip dataset, GS could be attempted with large-effect SNPs selected for traits as fixed effects. Brondum, et al. [
21] reported that accuracy may be increased by adding several genomic SNPs found in WGS data to conventional 54K SNP chip data, especially fertility trait data from significant QTLs. This strategy should be tested in future studies.
In our study, the GBLUP model that considered one G matrix (model (6)) achieved higher accuracy than the model that considered two G matrices (model (7)). Brondum, et al. [
21] reported that model (6) had higher accuracies than model (7) for all the traits analyzed. In the study by Liu, et al. [
48], model (6) performed better than model (7) in GS of mastitis and fertility, but model (7) performed better in GS of milk and protein. Moreover, in GS of fat, model (6) had better performance when using the reference populations of Danish bulls and Danish and US bulls; model (7) had better performance when using the reference population of Danish cows; models (6) and (7) had the same performance when using the reference populations of Danish bulls and cows, as well as Danish and US bulls and Danish cows. Gebreyesus, et al. [
49] reported that model (6) achieved better performance in a scenario involving additional SNPs selected by GWAS in Denmark–Finland–Sweden dairy cattle populations; but model (7) achieved higher accuracy in a scenario involving SNPs selected from GWAS for survival index. Therefore, the performance of models (6) and (7) varies depending on the traits, the reference population, and how many SNPs are selected. Moreover, the problems of noise and confounding between the effects may be present in model (7).