1. Introduction
The use of genomic selection (GS) has led to the transformation of breeding practices across animal and plant industries, which rely on factors such as reference population size, marker density, and trait heritability for precision [
1,
2,
3]. High-precision genomic predictions for select candidates necessitate a substantial reference population comprising individuals with both genotypic and phenotypic data [
2]. Nevertheless, collecting such extensive reference populations proves challenging, particularly with limited animal resources. An alternative strategy involves combining populations of different breeds to enhance the precision of genomic prediction [
4,
5]. Previous studies [
6] have found that the Bayesian model accuracy for merged populations with different populations of the same species in simulated data is better than the GBLUP and ssGBLUP models’ accuracies. This study was conducted to verify whether the conclusions were consistent with the simulated data using real data.
While prior investigations on genomic selection in pigs have focused primarily on similar or closely related genetic backgrounds [
7], limited research has been conducted on genomic selection in pigs utilizing populations characterized by diverse genetic backgrounds. It has been shown that crossing populations does not improve the accuracies of genome predictions because differences in allele substitution effects between populations reduce the accuracies of the genome predictions across populations [
8]. Additionally, compared to those among dairy cattle, the consistency of linkage disequilibrium patterns among different pig populations is lower, prompting questions about the substantial improvement in dairy cattle. Traits crucial for genomic selection in pigs, such as feed intake, carcass characteristics, and meat quality, pose challenges in establishing sufficiently large reference populations. Hence, it is imperative to investigate the potential benefits of amalgamating reference groups of diverse sizes to augment the accuracy of genomic estimated breeding values in pigs.
The Yorkshire pig breed, acclaimed for its traits, such as effective utilization of feed, fast growth, valuable slaughter characteristics, and favorable lean meat composition, holds prominence in pig production. Two critical traits, namely, age at 100 kg live weight (AGE100) and backfat thickness at 100 kg (BF100), are of particular significance, with AGE100 serving as a pivotal genetic marker for the rate of growth [
9] and BF100 being crucial for assessing the lean meat rate [
10,
11]. These economically crucial traits offer a holistic perspective on the necessity of integrating genetically estimated breeding values for growth rate and fat deposition traits in Yorkshire pigs with diverse genetic backgrounds.
In this study, we explored the impact of combining different breeds of Yorkshire pig populations on the precision of genomic estimated breeding values (GEBVs) for the AGE100 and BF100 traits. Our findings indicate that combining different populations can enhance GEBV accuracy under specific conditions. Bayesian models displayed superior performance compared to GBLUP and ssGBLUP models when applied to combined populations, while the latter models proved effective in predicting genomic estimated breeding values (GEBVs) within individual populations. In addition, combining the other two populations resulted in a significant third-population accuracy for the GEBV, highlighting the need for further research into the potential factors leading to this enhancement.
2. Materials and Methods
2.1. Husbandry Management of Experimental Animals
This study included 2295 pigs from three distinct regions, representing three Yorkshire pig populations: 295 Danish, 500 Canadian, and 1500 American Yorkshire pigs, and both boars and sows were selected. All data were recorded successively from 2015 to 2020. The Danish and Canadian pigs were from regions with a temperate monsoon climate, while the American pigs were from a subtropical climate. All Yorkshire pigs under investigation were in the fattening stage and resided in expansive facilities. The housing conditions maintained a feeding density ranging from 0.8 to 1.2 m2 per pig and with a parameter adjusted according to their respective weights. Automated feeding troughs ensured continuous access to food without interruptions. Standard disinfection procedures were applied to all the experimental animal buildings, and all the animals were vaccinated.
2.2. Trait-Corrected Models for Experimental Animals
Swine across all pig populations were meticulously chosen for analysis, specifically targeting individuals aged approximately 160 days and subjected to uniform feeding conditions, ensuring both health and freedom from disease. The swine underwent a standardized feeding regimen, and individual measurements were conducted at an average weight of 100 kg. Backfat thickness was assessed using ultrasound and measured during the same weighing interval between the eleventh and twelfth rib. The measurements for AGE100 were computed using the appropriate correction equations [
12]:
, where CF is the correction factor (referring to the National Program for Swine Genetic Enhancement in China),
, and
. Those for BF100 were calculated as follows [
13]:
and
.
2.3. DNA Extraction and Genotyping
Blood samples were transferred to sterile tubes and centrifuged at 3000 rpm for 10 min. The buffy coat layer containing leukocytes was carefully aspirated and transferred to a clean tube. Red blood cells were lysed by adding an equal volume of RBC lysis buffer (10 mM Tris-HCl, 10 mM NaCl, 1 mM EDTA, pH 8.0) and incubating at room temperature for 10 min. After the leukocyte pellet was washed with phosphate-buffered saline (PBS), cells were lysed using a commercial cell lysis buffer containing proteinase K. The lysate was incubated at 55 °C for 2–4 h to ensure complete lysis. Following the cell lysis, proteinase K was inactivated via heat treatment at 95 °C for 10 min. RNase A (20 μg/mL) was added to the lysate to degrade RNA, and the mixture was further incubated at 37 °C for 30 min. DNA was purified using a Tiangen DNA extraction kit, following the manufacturer’s instructions. Briefly, the lysate was mixed with binding buffer and transferred to a spin column. After centrifugation, the DNA was bound to the column while contaminants were removed through washing steps. Finally, purified DNA was eluted in sterile water or TE buffer.
All individuals with phenotypes were genotyped using the GeneSeek GGP porcine HD array. According to Sus scrofa version 10.2, the SNP chip consisted of 50,915 probes, and autosomes were further upgraded to the latest version of the porcine genome—Sus scrofa version 11.1. The final remaining autosomal SNPs were 34,150 Kb for the Canadian line, 34,543 Kb for the Danish line, and 34,497 Kb for the American line.
Quality control was performed through PLINK (V1.90;
http://www.cog-genomics.org/; accessed on 16 March 2023). Pigs with call rates < 0.9 were excluded, and SNPs with minor allele frequencies (MAFs) below 0.05 and call rates < 0.9 were excluded in each population species.
2.4. Genomic Analysis of the Population
Eigenvalues and eigenvectors were acquired through PLINK (v1.90;
http://www.cog-genomics.org/; accessed on 16 March 2023), with a subsequent execution of a principal component analysis (PCA) on the remaining SNPs [
14]. Furthermore, linkage disequilibrium (LD, represented as r2) was computed for each population. In this investigation, SNeP software (v1.1;
https://bioinformaticshome.com/tools/descriptions/SNeP.html#gsc.tab=0; accessed on 25 March 2023) was employed to determine the population effect size (Ne) [
15].
2.5. Scenarios of Combining References
The scenarios of combining references were as follows: (1) combining all three populations (Danish, Canadian, and American lines) for prediction; (2) combining two populations to predict the third, e.g., combining the American (large-scale) and Canadian (medium-scale) lines to predict the Danish (small-scale) population or combining the American (large-scale) and Danish (small-scale) lines to predict the Canadian (medium-scale) population; and (3) predicting each population independently.
2.6. Statistical Analysis
2.6.1. Bayesian A and Bayesian B
In 2001, Meuwissen [
3] proposed the concept of GS, concurrently proposing two Bayesian models, named Bayesian A and Bayesian B, for predicting genomic estimated breeding values. The general framework for a Bayesian regression model is outlined as follows:
where
represents the vector of phenotypes,
signifies the vector of fixed effects, including variables such as sex,
denotes the corresponding design matrix,
denotes the genotype of the
-th locus (coded as 0/1/2),
represents the effect value linked to the
-th locus, and
is indicative of the random residual effect vector.
The Bayesian A model assumes that all markers contribute to the genetic effect, with the impact of genetic markers following a Gaussian distribution, , while the variance adheres to an inverse chi-square distribution, , where represents the degree of freedom, and denotes the scale parameter.
Diverging from the Bayesian A approach, the Bayesian B model introduces an indicator variable represented by
, indicating the effect of the SNP. This variable operates under the assumption that most markers exhibit negligible impact (scaled by π), whereas only a limited subset of markers manifests an effect (scaled by
. Furthermore, the variance associated with this specific subset of markers with an impact is characterized by an inverse chi-square distribution.
The Bayesian A model can be conceptualized as a particular instance of Bayesian B, where the parameter
is set to 0. In this research, both the Bayesian A and Bayesian B models were applied through the utilization of the R BGLR package [
16].
2.6.2. BayesC
The formulation for BayesC [
17] was as follows:
In this representation, denotes a vector representing yield deviations, signifies a vector of ones, represents the overall mean, and stands for a vector of genotypes for SNP , incorporating −2 for individuals with identical alleles, 1–2 for heterozygotes, and 2–2 for the alternative homozygote genotype, with pi denoting the allele frequency of the SNP . The indicator variable determines whether SNP is included in the model during the specific MCMC cycle (0/1), where the prior probability of being equal to 1 is denoted by π. The SNP effect is modeled to follow a normal distribution: . The residual e follows a normal distribution: .
2.6.3. GBLUP
The GBLUP model can be delineated as follows:
where
represents the vector of phenotypic values,
denotes the vector of fixed effects, including variables such as sex,
stands for the corresponding design matrix,
signifies the overall mean,
is the design matrix establishing the link between genetic value (g) and
, and
is a vector encompassing stochastic residuals. It was hypothesized that
where
is the additive genetic variance, and
is the random residual variance. The genomic relationship matrix (G), as outlined by [
18], was derived through the computation of the SNPs:
In this formulation,
represents an
matrix, where n denotes the number of individuals, and m represents the number of SNPs;
signifies the minor allele frequency of the
ith SNP; and
is a matrix where the elements of the
-th column are 2
. In this investigation, GBLUP was performed through the utilization of the R qgg package [
19].
2.6.4. ssGBLUP
The univariate ssGBLUP model can be delineated as follows:
In this framework, the incidence matrix
establishes the relationship between fixed effects
and the effect of sex, while the incidence matrix
establishes connections between breeding values (a) and the corresponding observations in vector
, and
represents the random residual vector. It is postulated that Var(e) =
where
denotes the identity matrix. The same matrix is employed in GBLUP, or a diagonal matrix incorporating weights [
18]. In ssGBLUP, the breeding values exhibit a specified covariance structure in which Var(a) =
, with
denoting the genetic variance and H incorporating information from both genomic (G) and pedigree (A) relationship matrices [
18,
20].
2.7. Evaluation of the Accuracy of Genomic Prediction
In this investigation, predictive accuracy was assessed utilizing 5-fold cross-validation (CV) on actual genomic data. The genotyped individuals were randomly distributed into five nearly equitably sized subgroups. Within this framework, one subgroup was exclusively designated as the validation population, while the remaining four subgroups constituted the reference population. This cross-validation procedure was iterated five times, with each subgroup serving as the validation set once. To ensure robustness, the 5-fold CV was repeated 20 times, resulting in 20 averaged predictive accuracies.
4. Discussion
The present study aimed to evaluate different models for the genetic estimation of breeding values and different scenarios for combined populations of Yorkshire pigs originating from distinct genetic backgrounds—namely, Danish, Canadian, and American lines. Our findings demonstrate that the amalgamation of Danish, Canadian, and American populations significantly enhanced the precision of the Bayesian model, while the BLUP model show marginal improvement in accuracy. These outcomes are consistent with our earlier findings using simulated data [
6]. It has been shown that the accuracy of Bayesian variable selection models depends on the number of QTLs, with fewer QTLs resulting in higher accuracy. This trend is more evident in cross-population genomic prediction. Bayesian variable selection outperforms GBLUP when the number of QTLs is small but loses its advantage when the number of QTLs is large [
21]. As an adaptation of GBLUP, several studies [
7,
22,
23,
24] have affirmed the superior performance of ssGBLUP over GBLUP. Furthermore, the outcomes of this investigation suggest that this superiority becomes more pronounced, particularly when dealing with a smaller reference population. In such instances, the relative contribution of phenotype information from non-genotyped individuals, specifically related to the selection candidates, becomes more pertinent. The ssGBLUP model had a positive effect on medium-sized and small populations of Yorkshire Pig farms, except for the AGE100 trait in the medium-sized population; in this case, the ssGBLUP model alone was more accurate at predicting the genomic estimated breeding values for the medium population itself. In addition, predicting genetic estimates of breeding values by combining large, medium, and small populations improved the prediction when only their populations were used to varying degrees for medium- and small-sized populations. Overall, the heritability estimates for the studied BF100 trait ranged from moderate to high [
25], indicating a substantial genetic component and suggesting favorable responsiveness to selection. A large number of studies have shown that the ssGBLUP model is more accurate in terms of genomic selection than are other models in most cases [
26,
27,
28,
29,
30]. Furthermore, the increased heritability estimates observed with the utilization of ssGBLUP compared to PBLUP may be attributed to the consideration of the Mendelian sampling term. Genomic-based methodologies, such as ssGBLUP, use the Mendelian sampling term, which pedigree-based approaches neglect. This term captures the variation among family members within a half-sib or full-sib family surrounding the family’s mean relatedness [
31].
The reference population plays a pivotal role in GS, with the accuracy of GS being contingent upon both its size and its relationship to the selection of candidates. Establishing the combined reference group, comprising animals from diverse populations with distinct genetic backgrounds of the same breed, serves dual purposes: expanding the reference size and enabling the utilization of a shared reference population for genomic selection across diverse breeding populations. This strategy has been applied in various genomic selection initiatives. For example, within the Euro Genomics Coöperative U.A. initiative based in Amsterdam, the Netherlands, a unified reference population was established by amalgamating four closely affiliated populations of dairy cattle. The results indicated a substantial improvement in the reliability of genomic predictions for bulls across the four populations, showing an average increase of 10% compared to predictions based on individual reference populations alone [
32]. Analogous initiatives have been pursued in beef cattle, where endeavors to form a unified reference group encompassing multiple breeds showed that prediction equations developed in multibreed populations achieved enhanced accuracy for subpopulations [
33]. The enhancement of genomic prediction accuracy is facilitated through the consolidation of individuals from the same species, owing to the inherent relationship with their genealogical heritage [
34]. Emre Karaman [
35] discovered that in mixed populations, achieving higher accuracies in genomic evaluation is possible by amalgamating all available data from both purebred and admixed individuals. This approach outperformed other methods by considering the breed origin of alleles solely for selection candidates, along with utilizing breed-specific SNP effects estimated independently in each pure breed.
In the realm of multi-population studies, previous research has revealed the efficacy of employing individual SNP methods based on whole-genome sequencing data [
36] or haplotype methods based on chip data [
37,
38]. These methods, leveraging either the comprehensive information provided by whole-genome sequencing or haplotype-based insights from chip data, enhance the capacity to capture LD between genetic variants and QTLs. The utilization of such approaches has been shown to effectively increase the accuracy of GP in multi-population settings. As shown in a large number of studies, high-density SNP markers and imputed data are not more accurate than medium-density markers. Therefore, this study directly used more economical and convenient 50k chips [
26,
39]. The implications of our findings hold significant relevance for the pragmatic aspects of Chinese pig breeding. Positioned as the primary global producer and consumer of pork, the pig industry in China is assumed to be of paramount importance in agricultural landscapes. However, a notable challenge arises from the limited or absent genetic connectedness prevalent among most Chinese pig breeding farms, stemming from infrequent genetic exchange. This deficiency in genetic interconnectedness poses a substantial hurdle, hindering the successful execution of comprehensive national or regional joint genetic evaluations thus far.
An in-depth analysis of the three Yorkshire populations in our study revealed pronounced weakness in genetic connectedness when scrutinized based on pedigree data. No shared ancestors or relatives were discerned among individuals from the Danish, Canadian, or American Yorkshire populations. Moreover, the genetic connectedness within the American Yorkshire population did not reach a sufficient level for the implementation of joint genetic evaluations, as evidenced by undisclosed data. This predicament has led to a deceleration and impediment in the genetic improvement of Chinese pig breeding, despite boasting the largest nuclear herd globally. In response to these challenges, genomic selection has emerged as a promising avenue for advancing Chinese pig breeding practices. Our results underscore the potential of combining reference populations from disparate genetic backgrounds, even in the absence of robust genetic connectedness, to significantly enhance the accuracy of genomic predictions. This insight provides a prospective solution that could propel genetic improvement in Chinese pig breeding, offering new possibilities for leveraging genomic data in the absence of extensive genetic exchange networks.
In interpreting the enhanced accuracy observed when using the 1500- and 500-head populations as the reference population and the 295-head population as the validation population, it is plausible that the larger population (1500 + 500) contributed to a broader spectrum of genetic diversity and a greater number of data points. This extended range facilitated the capture of more comprehensive genetic information, thereby elevating the prediction accuracy for smaller populations (295 heads). When incorporating a smaller population (e.g., 295 heads) into a larger reference population, there is a risk of overfitting the characteristics of the larger population, particularly if there are significant genetic differences between them. This could lead to a reduction in predictive accuracy when applied to the validation population (i.e., the 295-head population).
Conversely, when all three populations are amalgamated, even though the volume of information increases, the merger may introduce irrelevant genetic information if the populations lack genetic relatedness. This influx of extraneous genetic data may contribute to increased ‘noise’ in the model, ultimately diminishing prediction accuracy.
The Bayesian approach exhibits potential superiority over GBLUP in handling complex genetic structures, offering greater flexibility in addressing varying effect sizes of different genetic markers, especially in the presence of substantial nonadditive genetic variation. In contrast, the GBLUP approach assumes equal effects for all genetic markers, potentially lacking the required flexibility when dealing with datasets characterized by intricate genetic backgrounds. The intricate kinship relationships, lack of relatedness, and distinct genetic structures among the three populations resulted in the lower accuracies of the GBLUP and ssGBLUP models compared to those of the Bayesian approaches.