Multivariate Adaptive Regression Splines Enhance Genomic Prediction of Non-Additive Traits

de Oliveira Celeri, Maurício; da Costa, Weverton Gomes; Nascimento, Ana Carolina Campana; Azevedo, Camila Ferreira; Cruz, Cosme Damião; Sagae, Vitor Seiti; Nascimento, Moysés

doi:10.3390/agronomy14102234

Open AccessArticle

Multivariate Adaptive Regression Splines Enhance Genomic Prediction of Non-Additive Traits

by

Maurício de Oliveira Celeri

¹,

Weverton Gomes da Costa

¹

,

Ana Carolina Campana Nascimento

¹

,

Camila Ferreira Azevedo

¹

,

Cosme Damião Cruz

²

,

Vitor Seiti Sagae

¹

and

Moysés Nascimento

^1,*

¹

Laboratory of Computational Intelligence and Statistical Learning (LICAE), Department of Statistics, Federal University of Viçosa, Av. Peter Henry Rolfs, Viçosa 36570-900, MG, Brazil

²

Department of General Biology, Federal University of Viçosa, Av. Peter Henry Rolfs, Viçosa 36570-900, MG, Brazil

^*

Author to whom correspondence should be addressed.

Agronomy 2024, 14(10), 2234; https://doi.org/10.3390/agronomy14102234

Submission received: 7 August 2024 / Revised: 12 September 2024 / Accepted: 24 September 2024 / Published: 27 September 2024

(This article belongs to the Special Issue Multi-omic Integration for Applied Prediction Breeding)

Download

Browse Figures

Versions Notes

Abstract

:

The present work used Multivariate Adaptive Regression Splines (MARS) for genomic prediction and to study the non-additive fraction present in a trait. To this end, 12 scenarios for an F₂ population were simulated by combining three levels of broad-sense heritability (h² = 0.3, 0.5, and 0.8) and four amounts of QTLs controlling the trait (8, 40, 80, and 120). All scenarios included non-additive effects due to dominance and additive–additive epistasis. The individuals’ genomic estimated breeding values (GEBV) were predicted via MARS and compared against the GBLUP method, whose models were additive, additive–dominant, and additive–epistatic. In addition, a linkage disequilibrium study between markers and QTL was performed. Linkage maps highlighted the QTL and molecular markers identified by the methodologies under study. MARS showed superior results to the GBLUP models regarding predictive ability for traits controlled by 8 loci, and results were similar for traits controlled by more than 40 loci. Moreover, the use of MARS, together with a linkage disequilibrium study of the trait, can help to elucidate the traits’ genetic architecture. Therefore, MARS showed potential to improve genomic prediction, especially for oligogenic traits or traits controlled by approximately 40 QTLs, while enabling the elucidation of the genetic architecture of traits.

Keywords:

dominance; epistasis; marker selection; oligogenic trait

1. Introduction

Genome-wide selection (GWS) proposed by Meuwissen et al. (2001) [1] has become a popular tool for animal and plant breeding, which uses molecular markers distributed throughout the genome to predict the genetic merit of individuals. This approach increases prediction accuracy of genetic values and improves the understanding of the genetic architecture of traits, accelerating the selection process of individuals [2,3]. Since the conception of GWS, several statistical techniques have been proposed, such as the GBLUP (Genomic Best Linear Unbiased Predictor), RR-BLUP (Ridge Regression Best Linear Unbiased Predictor), BayesA, and BayesB [1,4]. Usually, the most applied method to GWS is the GBLUP, mainly due to the advantages regarding computational efficiency [5].

Although GWS is currently employed routinely in breeding programs, in fact, most studies consider only additive effects controlling the traits [6], ignoring non-additive genetic effects, which can lead to inaccurate estimates of genetic values and consequently reducing genetic gains [7]. Among the proposed methods, the GBLUP enables the inclusion of non-additive effects in a simplified way, with a smaller number of parameters to be estimated while showing promise in terms of predictive abilities [8,9].

The interest in methods for genomic prediction based on computational intelligence and machine learning, such as artificial neural networks [10,11,12,13,14] and regression trees [15], has been increasing [16,17,18]. This stems especially from the fact that different from classical methods for genomic prediction, such methodologies do not require prior assumptions about the relationship between phenotype and markers. This feature allows great flexibility to deal with different types of non-additive effects and capture non-linear and interaction patterns [19]. Within this scope, an alternative methodology that also does not require defining the functional relationship between inputs (SNPs markers) and output (predicted phenotypic and genotypic values) is Multivariate Adaptive Regression Splines, known as MARS [20]. MARS automatically models complex nonlinearities and interactions among the input variables [21], while presenting at the end of the process the adjusted model. In this sense, MARS has the potential to increase the prediction accuracy for traits with non-additive effects and obtain information about its genetic control.

In this study, we aimed to use Multivariate Adaptive Regression Splines for prediction of genetic merit and capturing non-additive effects to increase the understanding of the trait genetic architecture. The obtained results were compared to those obtained with the standard approach of the GBLUP modeling additive and non-additive effects.

2. Materials and Methods

2.1. Data Simulation

A simulation composing an F₂ population of a diploid species (2n = 20) with 1000 individuals was simulated in the software Genes [22]. We considered 4010 bi-allelic single nucleotide polymorphisms (SNPs) codominant markers, equally and equidistantly distributed across 10 linkage groups (chromosomes) with 200 cM size each and having no QTL present in the last two groups.

2.2. Evaluated Scenarios

Twelve scenarios were simulated with several numbers of QTLs controlling the trait (8, 40, 80, and 120), being equally distributed among the first eight linkage groups, and heritabilities of 0.3, 0.5, and 0.8, as presented in Table 1. The last two linkage groups, containing 802 of 4010 markers used, have no direct influence nor linkage disequilibrium on the traits serving as a control to evaluate the method’s efficiency regarding the detection of QTLs.

2.3. Phenotype Simulation

The phenotypic traits for 12 scenarios were simulated considering a mean (

μ

) equal to 100 and a coefficient of variation of 10%, with an average degree of dominance equal to 0.5 and an epistatic effect, as shown below:

Y_{i} = μ + \sum_{j} p_{j} α_{j} + \sum_{j} \sum_{j^{'}} {p_{j} p_{j^{'}} α}_{j} α_{j^{'}} + ϵ_{i},

where

Y_{i}

the phenotype of individual

i, i = 1, \dots, 1000;

d_{i}

/

a_{i}

= 0.5,

α_{j}

is the effect of the favorable allele of marker

j, j = 1, \dots, 4010,

assuming values u +

a_{i}

, u +

d_{i}

and u −

a_{i}

associated with classes AA, Aa and aa, respectively, being coded by 1, 0 and −1, in this order;

p_{j}

is the contribution from locus j to the trait, to the manifestation of the trait under consideration, generated by binomial distribution, with parameters p = q = 0.5 and n equal to g − 1 being g the number of QTLs,

α_{j} α_{j^{'}}

symbolizes the interaction between the

j and j ’

,

j \neq j^{'},

loci, and

ϵ_{i}

represents the random error of the observation of individual i. It was assumed that

ϵ ~ N (0, I V_{e})

, where

V_{e}

is the residual variance, defined by

V_{e} = \frac{V_{g} (1 - h^{2})}{h^{2}},

where

V_{g}

is the genetic variance and

h^{2}

the heritability.

2.4. Genomic Best Linear Unbiased Predictor (GBLUP)

The GBLUP model considering additive and non-additive effects (dominance and additive x additive epistasis) is given by the following:

y = 1 μ + Z u_{a} + Z u_{d} + Z u_{e} + ϵ,

where y (

N \times 1

) with N being the number of individuals is the vector of phenotypic observations;

1

(

N \times 1

) is a vector equal to 1 and

μ

is the overall mean; u_a (

N \times 1

), u_d (

N \times 1

) and u_e (

N \times 1

) are the additive, dominance and epistatic (additive × additive) effects of individuals, respectively; Z (

N \times N

) is the incidence matrix and

ϵ

(

N \times 1

) is the random error vector with

ϵ ~ N (0, I σ_{ϵ}^{2})

and

σ_{ϵ}^{2}

the residual variance. The variance structure of the model is given by

u_{a} ~ N (0, G_{a} σ_{u_{a}}^{2})

,

u_{d} ~ N (0, G_{d} σ_{u_{d}}^{2})

,

u_{e} ~ N (0, G_{e} σ_{u_{e}}^{2})

and

ϵ ~ N (0, I σ_{e}^{2})

, where

σ_{u_{a}}^{2}

is the additive variance,

σ_{u_{d}}^{2}

is the variance due to dominance,

σ_{u_{e}}^{2}

is the epistasis variance,

G_{a}, G_{d}

and G_e (

N \times N

) are the genomic relationship matrices for additive, dominance and additive × additive epistatic effects, respectively.

To calculate the genomic relationship matrices used in the model, we considered M_ij as the incidence matrix containing the number of alleles in marker j for individual i and p_j the frequency of the dominant allele A in marker j. Thus, we obtained the matrices W and S by Zhang et al. (2019) [23]:

W_{i j} = \{\begin{matrix} 2 - 2 p_{j}, i f M_{i j} = A A \\ 1 - 2 p_{j}, i f M_{i j} = A a \\ - 2 p_{j}, i f M_{i j} = a a; \end{matrix}

S_{i j} = \{\begin{matrix} - 2 {(1 - p_{j})}^{2}, i f M_{i j} = A A \\ 2 p_{j} (1 - p_{j}), i f M_{i j} = A a \\ - 2 p_{j}^{2}, i f M_{i j} = a a . \end{matrix}

Then, we calculated the following:

G_{a} = \frac{W W^{'}}{\sum_{i = 1}^{n} 2 p_{i} (1 - p_{i})};

G_{d} = \frac{S S^{'}}{\sum_{i = 1}^{n} {[2 p_{j} (1 - p_{j})]}^{2}};

G_{e} = G_{a} # G_{a} .

where # denotes the Hadamard product.

The genomic estimated breeding values (GEBV) of individuals is defined as

\hat{G E B V} = {\hat{u}}_{a} + {\hat{u}}_{d} + {\hat{u}}_{e} .

In this work, we consider three models for comparison: GBLUP-A (considering only the additive component), GBLUP-AD (considering the additive components and due to dominance), and GBLUP-AE (considering the additive components and due to additive x additive epistasis). All models were fitted using GenomicLand software [24].

2.5. Multivariate Adaptive Regression Splines (MARS)

Multivariate Adaptive Regression Splines is a nonparametric regression technique proposed for solving high dimensionality problems using basis functions to fit the relationship between the dependent variable and its predictors [20,25]. Moreover, it allows modeling the contribution of each variable individually and also the possible interactions between them [26].

The MARS model is based on basis functions having one of the following forms:

{(x - a)}_{+ m a x (0, x - a);}

{(a - x)}_{+ m a x (0, a - x) .}

In these functions, a is called a node and the pair

{(x - a)}_{+}

and

{(a - x)}_{+}

is called a reflexive pair. Basis functions are constructed for predictor variables in a regression context as follows: given a set with variables

X_{j}

,

j = 1,2, \dots, p

, with observations

x_{i j}, i = 1,2, \dots, N

, then one can determine a collection of basis functions

C = {{(X_{j} - a)}_{+, {(a - X_{j})}_{+}}}

with

j = 1,2, \dots, p

e

a ∊ {x_{1 j}, x_{2 j}, \dots, x_{N j}} .

The model containing M terms, proposed by Friedman (1991), is given by the following:

f (X) = c_{0} + \sum_{i = 1}^{M} c_{i} B_{i} (X) + ϵ,

whereby

c_{0}

é o intercepto;

B_{i} (X)

is a basis function or product of basis functions, contained in

C

; X is the set referring to the marker data;

c_{i}

is the coefficient of

B_{i}

, with

i = 1, . . ., M

; M corresponds to the number of basis functions, or products of basis functions of the model, set automatically by the MARS algorithm [26] and

ϵ

is the error. The coefficients

c_{0}

and

c_{i}

are estimated based on minimizing the residual sum of squares [27]. Small sample data to demonstrate the whole procedure are presented as a toy example (Example S1).

The MARS algorithm [20,28] consists of two stages: the forward and the backward phases. In the forward phase, an iterative process of inserting the reflexive basis function pairs occurs. Initially, the reflective pair inserted into the model is the one that maximizes the reduction of the residual sum of squares. Then, the use of another reflective pair into the model is tested and the one that promotes greater reduction of the residual sum of squares is inserted. This process extends until a stopping condition is reached [28,29]. In the backward stage, an exclusion process of the basis functions is applied, aiming to achieve a more parsimonious model avoiding overfitting. In this step, the MARS algorithm uses Generalized Cross Validation (GCV) to present the best model of size λ, with the final model presented by the MARS algorithm being the one that minimizes the GCV value, which is given by the following:

G C V (λ) = \frac{R S S}{{(1 - \frac{N (λ)}{n})}^{2}},

with RSS being the residual sum of squares,

N (λ)

is the effective number of model parameters, and n is the number of observations. Hastie, Tibshirani, and Friedman (2008) [27] propose

N (λ) = r + c K

, where

r

is the number of linearly independent basis functions present in the model, and c is the penalty factor (

c = 2

for an additive model and

c = 3

otherwise), and

K

is the number of nodes present in the model.

The GCV is still used by the MARS algorithm to determine the generalized coefficient of determination (GRSq), given by the following:

G R S q = 1 - \frac{G C V}{{G C V}_{0}},

where

{G C V}_{0}

represents the

G C V

of a model containing only the intercept. GRSq is a metric used to assess the goodness-of-fit of the fitted model. Higher GRSq values indicate a better ability of the model to explain the observed variation in the dataset.

The analyses for Multivariate Adaptive Regression Splines were conducted in R software [30], version 4.0.2, with the earth package [28]. The MARS fitting was performed considering models of degree 1 (MARS1—additive models) and 2 (MARS2—non-additive models). The selected MARS model was the one that presented the highest GRSq between the two fitted.

2.6. Linkage Disequilibrium

Linkage disequilibrium

(r^{2})

was calculated between marker pairs within each linkage group, by using the LD.Decay function implemented in sommer package [31]. To study the LD pattern in our data,

r^{2}

was presented as a function of genetic distance. Subsequently, a local polynomial regression (LOESS) model was fitted [32]. Finally, as presented by Vos et al. (2017) and Otyama et al. (2019) [33,34], a horizontal straight line was plotted, considering 0.2 as a critical value

r^{2}

. The markers within the window distance, defined as the intersection between the fitted LOESS curve and the horizontal straight line, were considered as markers associated with the simulated QTLs, being obtained to evaluate the potential of the method to study the trait genetic architecture.

2.7. Assessing Methods

Predictive abilities (PA) for the genomic prediction models (MARS, GBLUP-A, GBLUP-AD, and GBLUP-AE) were obtained using a 5-fold cross-validation scheme. The PA is defined as the Pearson correlation coefficient between the individuals’ simulated genetic values and those predicted by the models [35].

In addition, Cohen’s Kappa coefficient [36] was used to calculate the percentage of agreement for the 10% individuals with the highest estimated GEBVs for each model, in common with the 10% individuals with the highest simulated genetic values. Cohen’s Kappa coefficient is defined by the following:

κ = \frac{p_{0} - p_{e}}{1 - p_{e}},

where p₀ is the observed agreement and p_e is the agreement due to chance.

We also verified the amount of QTLs within the distance window defined by the LD analysis for each model. The number of markers selected in each of the 12 scenarios was given by the number of markers identified by the MARS methodology. In this analysis, the whole dataset was used for fitting the models described above.

3. Results

3.1. MARS Model Selection

Figure 1 shows the generalized coefficients of determination (GRSq) and the associated standard error for the additive and non-additive models across evaluated scenarios. The additive model presented a GRSq ranging from 0.048 to 0.406, while the non-additive model presented a GRSq between 0.088 and 0.550, being higher than the additive model in all evaluated scenarios. Thus, we selected the MARS model that considers additive and interaction effects for obtaining the genomic estimated breeding values (GEBV).

3.2. Genomic Prediction Models

The predictive abilities estimate for the GBLUP-A, GBLUP-AD, GBLUP-AE, and MARS methods and their respective standard errors are shown in Figure 2. In general, MARS showed higher predictive ability than the GBLUP models in scenarios with eight QTLs controlling the trait. For other scenarios in which traits were controlled by at least 40 QTLs, MARS showed similar or inferior results in comparison to the GBLUP models. Also, MARS showed similar results when the traits were controlled by a minimum amount of 40 QTLs in a high heritability scenario (

h^{2}

= 0.80). Considering lower heritabilities values (

h^{2}

= 0.30 and 0.50), MARS showed similar results to the additive G-BLUP model (G-BLUP-A) in scenarios with a minimum number of 40 QTLs controlling the trait.

3.3. Cohen’s Kappa Coefficient of Agreement

Based on Cohen’s Kappa coefficient (Figure 3), the GBLUP and MARS showed coefficients ranging from not significant (0–0.20) to moderate (0.40–0.59) [37]. MARS showed Cohen’s Kappa coefficient higher than the GBLUP methods’ scenarios with eight QTLs. For the other scenarios, Cohen’s Kappa coefficient for the MARS model was similar or lower than the GBLUP models.

3.4. Linkage Disequilibrium and Study of Trait Genetic Architecture

Linkage disequilibrium was calculated for all pairs of marker combinations within the same linkage group. To fit the decay curve, only those with significant

r^{2}

were considered, i.e., pairs with a p-value no greater than 0.05 with Bonferroni correction (Figure 4). The threshold value for the linkage disequilibrium was

r^{2} = 0.2

[33,34].

The number of markers selected by the MARS model in each evaluated scenario is presented in Table 2.

MARS was the methodology that presented the highest linkage disequilibrium values between the selected marker and the closest QTL considering the scenarios with eight QTLs and heritabilities equal to 0.3 and 0.5 (Figure S1). Considering heritability equal to 0.8, the markers selected by all methodologies presented medium or high linkage disequilibrium values (Figure S1). In the other scenarios, all fitted models indicated important markers with high linkage disequilibrium values between the marker and the closest QTL (Figures S2–S4).

The average linkage disequilibrium among markers selected by MARS was lower than the average disequilibrium among markers selected by the GBLUP methods (Table 3). This result indicates that the markers selected by MARS are covering a larger portion of the genome and therefore detecting a larger amount of QTLs. Such a result is corroborated by Supplementary Figures S5–S8 in which the linkage maps show the QTLs and the selected molecular markers.

Despite the small number of markers used to fit the MARS model, this methodology can capture markers’ effects and consequently genome positions that are important in controlling the trait. This result is observed in Figure S5, where MARS was able to indicate all the positions in the genome that are important for the trait. Considering a larger number of QTLs, MARS was also able to indicate positions in which the QTLs were allocated. For some traits, a possible limitation of MARS is the small number of markers used to fit the model. For polygenic traits where the QTLs are well distributed throughout the genome, MARS may not achieve a good explanation of the trait architecture since not all markers will be involved in the analysis.

3.5. Computational Efficiency

Figure 5 illustrates the average time required to fit MARS2 and GBLUP models, along with their standard errors. Among the GBLUP models, the additive model exhibited the shortest fitting time, consistently spending less than 10 s in all scenarios. In contrast, the degree-2 MARS model demonstrated the longest fitting time, ranging from 18.76 to 37.54 s.

4. Discussion

In this study, we employ MARS to increase the predictive ability of traits with non-additive effects. Considering that inclusion of non-additive effects in the model can increase the predictive ability and accuracy of genetic values [6,7,38], it is interesting to emphasize that, unlike statistical models, the insertion of non-additive effects in MARS is performed in a natural way without the need to make explicit the effects in the model. Specifically, the algorithm captures these effects through the base functions and their products to be inserted. This feature is evident considering the GRSq values, as MARS2 models (non-additive models) exhibited superior goodness of fit (Figure 1). This is consistent with the simulated traits which present non-additive genetic architecture.

The improvement observed in the predictive ability of the GBLUP as the number of QTLs increases can be explained by its construction, which considers the infinitesimal model in which many genes control the trait equally [39]. Moreover, when comparing additive and non-additive models, Liu et al. (2019) [3] found similar results, where non-additive models showed higher accuracy than the strictly additive model in scenarios with heritability around 0.83. According to the same authors, for traits with low heritability, the insertion of non-additive terms did not cause a considerable increase.

In low heritability scenarios, MARS presents problems, possibly for two reasons: (i) for being sensitive to multicollinearity [20], since the high number of markers and QTLs can lead to multicollinearity; and (ii) the algorithm defines termination conditions that disfavor traits controlled by many genes, since the number of terms inserted in the model is much lower than the number of QTLs. On the other hand, in scenarios where the trait is controlled by eight QTLs, the MARS model showed higher predictive ability values. Additionally, traits controlled by a few QTLs tend to have well-defined phenotypic classes [40] and present more important epistatic effects than polygenic traits [6,41]. These characteristics, added to the breaking assumption of the GBLUP model, make MARS more flexible for oligogenic traits. Studies conducted by Li et al. (2019) [42], working with geotechnical data, found that MARS can capture complex relationships and perform well with a more parsimonious final model, a fact that corroborates with the results found in this study.

The Cohen’s Kappa agreement coefficient, calculated for the top 10% of individuals based on estimated GEBVs and simulated genetic values, ranged from non-significant to moderate. In the scenario with eight QTLs, only MARS successfully identified some individuals within the top 10% based on true genetic values. Conversely, scenarios with a higher number of QTLs demonstrated stronger agreement, suggesting more reliable selection outcomes.

The LD decays at

r^{2} = 0.2

over a distance of 39.60 cM (Figure 4). However, it is interesting to note that the value of

r^{2}

increases again for markers that are approximately 130 cM apart within the same linkage group, suggesting the epistatic nature of the simulated genome, given that linkage disequilibrium can be generated by epistasis between nearby loci [43].

Considering the linkage disequilibrium information, it is possible to verify that the simulated traits have an epistatic effect (Figure 4) and are controlled by QTLs present in the first four linkage groups (Figures S1–S8, and Table 3). Some authors highlight the importance of knowing the genetic architecture of traits in plants [44,45]. According to these authors, genetic architecture studies, combined with a better knowledge of plant population structure, will impact the understanding of plant evolution, crop improvement design, and accuracy in genomic prediction models.

Based on the evaluated dataset, MARS incurred a computational cost approximately 2.5 times greater than the GBLUP model. However, the absolute runtime for MARS was still less than a minute, suggesting that its increased computational demands are manageable for most applications.

Finally, it can be highlighted that machine learning methods, such as MARS, are interesting for genomic prediction. This class of methods allows capturing complex structures, such as dominance and epistasis, directly from the available dataset [46], unlike usual methodologies, where the genetic architecture must be informed a priori. In addition, machine learning-based methodologies do not make assumptions about the distribution of observed phenotypic values or the types of effects to be included in the model.

Genomic prediction based on machine learning methods, such as artificial neural networks, decision tree-based methods and their refinements, and support vector machines, has obtained good results for complex traits [16]. Attention should be paid to the fact that even today with machine learning-based methods, genomic prediction methods encounter major challenges in detecting epistatic effects [47].

Our results suggest that MARS is a promising tool for genomic prediction of traits controlled by a limited number of genes (

\leq

40). For polygenic traits, MARS can be effectively employed within an ensemble modeling framework, combining its advantages with those of traditional methods. For example, MARS can be used to identify a subset of QTLs and then the GBLUP can be used to capture the remaining genetic variation. Nascimento et al. (2024) [48] demonstrated the effectiveness of Stacking Ensemble Learning (SEL) in coffee breeding for key traits as yield, fruit number, leaf miner infestation, and cercosporiosis incidence. By combining MARS, the GBLUP, and other machine learning techniques, SEL significantly enhanced prediction accuracy. MARS, though not yet implemented, has the potential to optimize marker density panels, potentially reducing genotyping costs and improving prediction ability (PA). Sousa et al. (2024) [49] demonstrated this concept using machine learning algorithms (Random Forest and Bagging) to evaluate the trade-off between panel size and PA for eight agronomic traits in Coffea canephora. Their findings suggest that reducing the number of markers can improve selection efficiency while lowering costs. These optimized marker panels can serve as inputs for machine learning models, such as neural networks, potentially reducing the computational cost associated with this approach.

5. Conclusions

MARS proved efficient for predicting genetic values of individuals with the inclusion of non-additive effects. Specifically, it showed superior results to the GBLUP models in terms of predictive ability for traits controlled by 8 loci, and similar results for traits controlled by more than 40 loci. Furthermore, the use of MARS, together with a linkage disequilibrium study of the trait, was able to elucidate the trait genetic architecture under study and identify important regions in the genome.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/agronomy14102234/s1, Example S1: MARS procedure using small sample data. Figure S1: Linkage disequilibrium between selected markers and the nearest QTL for GBLUP-A, GBLUP-AD, GBLUP-AE, and MARS2 for the scenario with eight QTLs (heritability equal to 0.3, 0.5, and 0.8). Figure S2: Linkage disequilibrium between selected markers and the nearest QTL for GBLUP-A, GBLUP-AD, GBLUP-AE, and MARS2 for the scenario with 40 QTLs (heritability equal to 0.3, 0.5, and 0.8). Figure S3: Linkage disequilibrium between selected markers and the nearest QTL for GBLUP-A, GBLUP-AD, GBLUP-AE, and MARS2 for the scenario with 80 QTLs (heritability equal to 0.3, 0.5, and 0.8). Figure S4: Linkage disequilibrium between selected markers and the nearest QTL for GBLUP-A, GBLUP-AD, GBLUP-AE, and MARS2 for the scenario with 120 QTLs (heritability equal to 0.3, 0.5, and 0.8). Figure S5: Distribution of selected markers on the genetic map for the models GBLUP-A (red), GBLUP-AD (dark red), GBLUP-AE (green), and MARS2 (purple) for the scenario with eight QTLs and heritability: A—0.3, B—0.5, and C—0.8. Figure S6: Distribution of selected markers on the genetic map for the models GBLUP-A (red), GBLUP-AD (dark red), GBLUP-AE (green), and MARS2 (purple) for the scenario with 40 QTLs and heritability: A—0.3, B—0.5, and C—0.8. Figure S7: Distribution of selected markers on the genetic map for the models GBLUP-A (red), GBLUP-AD (dark red), GBLUP-AE (green), and MARS2 (purple) for the scenario with 80 QTLs and heritability: A—0.3, B—0.5, and C—0.8. Figure S8: Distribution of selected markers on the genetic map for the models GBLUP-A (red), GBLUP-AD (dark red), GBLUP-AE (green), and MARS2 (purple) for the scenario with 120 QTLs and heritability: A—0.3, B—0.5, and C—0.8.

Author Contributions

Conceptualization, M.d.O.C., A.C.C.N. and M.N.; Formal analysis, M.d.O.C.; Investigation, W.G.d.C., A.C.C.N., C.F.A., C.D.C. and M.N.; Methodology, M.d.O.C., W.G.d.C. and M.N.; Software, M.d.O.C., W.G.d.C. and M.N.; Supervision, M.N.; Validation, M.d.O.C., V.S.S. and M.N.; Writing—original draft, M.d.O.C. and M.N.; Writing—review and editing, M.d.O.C., A.C.C.N., C.F.A., C.D.C., V.S.S. and M.N. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Brazilian Federal Agency for Support and Evaluation of Graduate Education (CAPES)—Finance Code 001.

Data Availability Statement

The data presented in this study are openly available in Zenodo at 10.5281/zenodo.13256070.

Acknowledgments

Foundation for Research Support of the state of Minas Gerais (FAPEMIG, APQ-01638–18), by the National Council of Scientific and Technological Development (CNPq, 408833/2023–8), and by the National Institutes of Science and Technology of Coffee (INCT/Café). MN and CA are supported by scientific productivity (310755/2023–9 and 309856/2023-0, respectively), from the Brazilian Council for Scientific and Technological Development (CNPq).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Meuwissen, T.H.E.; Hayes, B.J.; Goddard, M.E. Prediction of Total Genetic Value Using Genome-Wide Dense Marker Maps. Genetics 2001, 157, 1819–1829. [Google Scholar] [CrossRef] [PubMed]
Singh, B.; Mal, G.; Gautam, S.K.; Mukesh, M. Whole-Genome Selection in Livestock. In Advances in Animal Biotechnology; Singh, B., Mal, G., Gautam, S.K., Mukesh, M., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 349–364. ISBN 978-3-030-21309-1. [Google Scholar]
Liu, X.; Wang, H.; Hu, X.; Li, K.; Liu, Z.; Wu, Y.; Huang, C. Improving Genomic Selection with Quantitative Trait Loci and Nonadditive Effects Revealed by Empirical Evidence in Maize. Front. Plant Sci. 2019, 10, 1129. [Google Scholar] [CrossRef]
VanRaden, P.M. Efficient Methods to Compute Genomic Predictions. J. Dairy Sci. 2008, 91, 4414–4423. [Google Scholar] [CrossRef] [PubMed]
Hernandez, C.O.; Wyatt, L.E.; Mazourek, M.R. Genomic Prediction and Selection for Fruit Traits in Winter Squash. G3 GenesGenomesGenetics 2020, 10, 3601–3610. [Google Scholar] [CrossRef]
Varona, L.; Legarra, A.; Toro, M.A.; Vitezica, Z.G. Non-Additive Effects in Genomic Selection. Front. Genet. 2018, 9, 78. [Google Scholar] [CrossRef]
Lebedev, V.G.; Lebedeva, T.N.; Chernodubov, A.I.; Shestibratov, K.A. Genomic Selection for Forest Tree Improvement: Methods, Achievements and Perspectives. Forests 2020, 11, 1190. [Google Scholar] [CrossRef]
Martini, J.W.R.; Gao, N.; Cardoso, D.F.; Wimmer, V.; Erbe, M.; Cantet, R.J.C.; Simianer, H. Genomic Prediction with Epistasis Models: On the Marker-Coding-Dependent Performance of the Extended GBLUP and Properties of the Categorical Epistasis Model (CE). BMC Bioinform. 2017, 18, 3. [Google Scholar] [CrossRef]
Calleja-Rodriguez, A.; Chen, Z.; Suontama, M.; Pan, J.; Wu, H.X. Genomic Predictions with Nonadditive Effects Improved Estimates of Additive Effects and Predictions of Total Genetic Values in Pinus Sylvestris. Front. Plant Sci. 2021, 12, 666820. [Google Scholar] [CrossRef] [PubMed]
González-Camacho, J.M.; Ornella, L.; Pérez-Rodríguez, P.; Gianola, D.; Dreisigacker, S.; Crossa, J. Applications of Machine Learning Methods to Genomic Selection in Breeding Wheat for Rust Resistance. Plant Genome 2018, 11, 170104. [Google Scholar] [CrossRef] [PubMed]
Wang, K.; Abid, M.A.; Rasheed, A.; Crossa, J.; Hearne, S.; Li, H. DNNGP, a Deep Neural Network-Based Method for Genomic Prediction Using Multi-Omics Data in Plants. Mol. Plant 2023, 16, 279–293. [Google Scholar] [CrossRef] [PubMed]
Coelho de Sousa, I.; Nascimento, M.; de Castro Sant’anna, I.; Caixeta, E.T.; Azevedo, C.F.; Cruz, C.D.; da Silva, F.L.; Alkimim, E.R.; Nascimento, A.C.C.; Serão, N.V.L. Marker Effects and Heritability Estimates Using Additive-Dominance Genomic Architectures via Artificial Neural Networks in Coffea Canephora. PLoS ONE 2022, 17, e0262055. [Google Scholar] [CrossRef]
Montesinos-López, O.A.; Sivakumar, A.; Huerta Prado, G.I.; Salinas-Ruiz, J.; Agbona, A.; Ortiz Reyes, A.E.; Alnowibet, K.; Ortiz, R.; Montesinos-López, A.; Crossa, J. Exploring Data Augmentation Algorithm to Improve Genomic Prediction of Top-Ranking Cultivars. Algorithms 2024, 17, 260. [Google Scholar] [CrossRef]
Feng, W.; Gao, P.; Wang, X. AI Breeder: Genomic Predictions for Crop Breeding. New Crops 2024, 1, 100010. [Google Scholar] [CrossRef]
Wang, X.; Shi, S.; Wang, G.; Luo, W.; Wei, X.; Qiu, A.; Luo, F.; Ding, X. Using Machine Learning to Improve the Accuracy of Genomic Prediction of Reproduction Traits in Pigs. J. Anim. Sci. Biotechnol. 2022, 13, 60. [Google Scholar] [CrossRef] [PubMed]
Azodi, C.B.; Bolger, E.; McCarren, A.; Roantree, M.; de los Campos, G.; Shiu, S.-H. Benchmarking Parametric and Machine Learning Models for Genomic Prediction of Complex Traits. G3 GenesGenomesGenetics 2019, 9, 3691–3702. [Google Scholar] [CrossRef]
Crossa, J.; Pérez-Rodríguez, P.; Cuevas, J.; Montesinos-López, O.; Jarquín, D.; de los Campos, G.; Burgueño, J.; González-Camacho, J.M.; Pérez-Elizalde, S.; Beyene, Y.; et al. Genomic Selection in Plant Breeding: Methods, Models, and Perspectives. Trends Plant Sci. 2017, 22, 961–975. [Google Scholar] [CrossRef] [PubMed]
Montesinos López, O.A.; Montesinos López, A.; Crossa, J. Multivariate Statistical Machine Learning Methods for Genomic Prediction; Springer International Publishing: Cham, Switzerland, 2022; ISBN 978-3-030-89009-4. [Google Scholar]
Zingaretti, L.M.; Gezan, S.A.; Ferrão, L.F.V.; Osorio, L.F.; Monfort, A.; Muñoz, P.R.; Whitaker, V.M.; Pérez-Enciso, M. Exploring Deep Learning for Complex Trait Genomic Prediction in Polyploid Outcrossing Species. Front. Plant Sci. 2020, 11, 25. [Google Scholar] [CrossRef] [PubMed]
Friedman, J.H. Multivariate Adaptive Regression Splines. Ann. Stat. 1991, 19, 1–67. [Google Scholar] [CrossRef]
Adnan, R.M.; Liang, Z.; Heddam, S.; Zounemat-Kermani, M.; Kisi, O.; Li, B. Least Square Support Vector Machine and Multivariate Adaptive Regression Splines for Streamflow Prediction in Mountainous Basin Using Hydro-Meteorological Data as Inputs. J. Hydrol. 2020, 586, 124371. [Google Scholar] [CrossRef]
Cruz, C.D. Genes Software—Extended and Integrated with the R, Matlab and Selegen. Acta Sci. Agron. 2016, 38, 547–552. [Google Scholar] [CrossRef]
Zhang, H.; Yin, L.; Wang, M.; Yuan, X.; Liu, X. Factors Affecting the Accuracy of Genomic Selection for Agricultural Economic Traits in Maize, Cattle, and Pig Populations. Front. Genet. 2019, 10, 189. [Google Scholar] [CrossRef] [PubMed]
Azevedo, C.F.; Nascimento, M.; Fontes, V.C.; Silva, F.F.E.; de Resende, M.D.V.; Cruz, C.D. GenomicLand: Software for Genome-Wide Association Studies and Genomic Prediction. Acta Sci. Agron. 2019, 41, e45361. [Google Scholar] [CrossRef]
Huang, H.; Ji, X.; Xia, F.; Huang, S.; Shang, X.; Chen, H.; Zhang, M.; Dahlgren, R.A.; Mei, K. Multivariate Adaptive Regression Splines for Estimating Riverine Constituent Concentrations. Hydrol. Process. 2020, 34, 1213–1227. [Google Scholar] [CrossRef]
Abdulelah Al-Sudani, Z.; Salih, S.Q.; Sharafati, A.; Yaseen, Z.M. Development of Multivariate Adaptive Regression Spline Integrated with Differential Evolution Model for Streamflow Simulation. J. Hydrol. 2019, 573, 1–12. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer Series in Statistics; Springer: New York, NY, USA, 2009; ISBN 978-0-387-84857-0. [Google Scholar]
Milborrow, S.; Hastei, T.; Tibshirani, R.; Miller, A.; Lumley, T. Earth: Multivariate Adaptive Regression Splines. R Package Version 5.1.1. 2019. Available online: https://CRAN.R-project.org/package=earth (accessed on 11 March 2023).
Park, J.; Kim, J. Defining Heatwave Thresholds Using an Inductive Machine Learning Approach. PLoS ONE 2018, 13, e0206872. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2023. [Google Scholar]
Covarrubias-Pazaran, G. Genome-Assisted Prediction of Quantitative Traits Using the R Package Sommer. PLoS ONE 2016, 11, e0156744. [Google Scholar] [CrossRef] [PubMed]
Cleveland, W.S.; Devlin, S.J. Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting. J. Am. Stat. Assoc. 1988, 83, 596–610. [Google Scholar] [CrossRef]
Vos, P.G.; Paulo, M.J.; Voorrips, R.E.; Visser, R.G.F.; van Eck, H.J.; van Eeuwijk, F.A. Evaluation of LD Decay and Various LD-Decay Estimators in Simulated and SNP-Array Data of Tetraploid Potato. Theor. Appl. Genet. 2017, 130, 123–135. [Google Scholar] [CrossRef]
Otyama, P.I.; Wilkey, A.; Kulkarni, R.; Assefa, T.; Chu, Y.; Clevenger, J.; O’Connor, D.J.; Wright, G.C.; Dezern, S.W.; MacDonald, G.E.; et al. Evaluation of Linkage Disequilibrium, Population Structure, and Genetic Diversity in the U.S. Peanut Mini Core Collection. BMC Genom. 2019, 20, 481. [Google Scholar] [CrossRef]
Jannink, J.-L.; Lorenz, A.J.; Iwata, H. Genomic Selection in Plant Breeding: From Theory to Practice. Brief. Funct. Genom. 2010, 9, 166–177. [Google Scholar] [CrossRef]
Cohen, J. A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
McHugh, M.L. Interrater Reliability: The Kappa Statistic. Biochem. Medica 2012, 22, 276–282. [Google Scholar] [CrossRef]
Toro, M.A.; Varona, L. A Note on Mate Allocation for Dominance Handling in Genomic Selection. Genet. Sel. Evol. 2010, 42, 33. [Google Scholar] [CrossRef]
Henderson, C.R. Best Linear Unbiased Prediction of Nonadditive Genetic Merits in Noninbred Populations. J. Anim. Sci. 1985, 60, 111–117. [Google Scholar] [CrossRef]
Mackay, T.F. Q&A: Genetic Analysis of Quantitative Traits. J. Biol. 2009, 8, 23. [Google Scholar] [CrossRef]
Barbosa, I.D.P.; da Silva, M.J.; da Costa, W.G.; de Castro Sant’Anna, I.; Nascimento, M.; Cruz, C.D. Genome-Enabled Prediction through Machine Learning Methods Considering Different Levels of Trait Complexity. Crop Sci. 2021, 61, 1890–1902. [Google Scholar] [CrossRef]
Li, D.H.W.; Chen, W.; Li, S.; Lou, S. Estimation of Hourly Global Solar Radiation Using Multivariate Adaptive Regression Spline (MARS)—A Case Study of Hong Kong. Energy 2019, 186, 115857. [Google Scholar] [CrossRef]
Phillips, P.C. Epistasis—The Essential Role of Gene Interactions in the Structure and Evolution of Genetic Systems. Nat. Rev. Genet. 2008, 9, 855–867. [Google Scholar] [CrossRef]
Holland, J.B. Genetic Architecture of Complex Traits in Plants. Curr. Opin. Plant Biol. 2007, 10, 156–161. [Google Scholar] [CrossRef] [PubMed]
Hayes, B.J.; Pryce, J.; Chamberlain, A.J.; Bowman, P.J.; Goddard, M.E. Genetic Architecture of Complex Traits and Accuracy of Genomic Prediction: Coat Colour, Milk-Fat Percentage, and Type in Holstein Cattle as Contrasting Model Traits. PLOS Genet. 2010, 6, e1001139. [Google Scholar] [CrossRef] [PubMed]
Barreto, C.A.V.; das Graças Dias, K.O.; de Sousa, I.C.; Azevedo, C.F.; Nascimento, A.C.C.; Guimarães, L.J.M.; Guimarães, C.T.; Pastina, M.M.; Nascimento, M. Genomic Prediction in Multi-Environment Trials in Maize Using Statistical and Machine Learning Methods. Sci. Rep. 2024, 14, 1062. [Google Scholar] [CrossRef]
Mathew, B.; Léon, J.; Sannemann, W.; Sillanpää, M.J. Detection of Epistasis for Flowering Time Using Bayesian Multilocus Estimation in a Barley MAGIC Population. Genetics 2018, 208, 525–536. [Google Scholar] [CrossRef]
Nascimento, M.; Nascimento, A.C.C.; Azevedo, C.F.; de Oliveira, A.C.B.; Caixeta, E.T.; Jarquin, D. Enhancing Genomic Prediction with Stacking Ensemble Learning in Arabica Coffee. Front. Plant Sci. 2024, 15, 1373318. [Google Scholar] [CrossRef]
De Sousa, I.C.; Barreto, C.A.V.; Caixeta, E.T.; Nascimento, A.C.C.; Azevedo, C.F.; Alkimim, E.R.; Nascimento, M. The Trade-off between Density Marker Panels Size and Predictive Ability of Genomic Prediction for Agronomic Traits in Coffea Canephora. Euphytica 2024, 220, 46. [Google Scholar] [CrossRef]

Figure 1. GRSq of Multivariate Adaptive Regression Splines of degree 1 (MARS1) and Multivariate Adaptive Regression Splines of degree 2 (MARS2) models for scenarios of 8, 40, 80, and 120 QTLs.

Figure 2. Predictive ability for the models: Additive Genomic Best Linear Unbiased Predictor (GBLUP-A), Additive and Dominant Effects Genomic Best Linear Unbiased Predictor (GBLUP-AD), Additive and Epistatic Effect Genomic Best Linear Unbiased Predictor (GBLUP-AE), and Multivariate Adaptive Regression Splines of degree 2 (MARS2) for scenarios of 8, 40, 80 and 120 QTLs controlling the trait, and heritabilities 0.3, 0.5, and 0.8.

Figure 3. Cohen’s Kappa coefficient for selected top 10% of individuals for the models: Additive Effect Genomic Best Linear Unbiased Predictor (GBLUP-A), Additive and Dominant Effect Genomic Best Linear Unbiased Predictor (GBLUP-AD), Additive and Epistatic Effects Genomic Best Linear Unbiased Predictor (GBLUP-AE), and Multivariate Adaptive Regression Splines of degree 2 (MARS2) for scenarios combining the number of controlling genes (8, 40, 80, and 120), and heritability of 0.3, 0.5, and 0.8.

Figure 4. LD decay plot generated with the simulated genome. The red lines represent the fitted spline.

Figure 5. Time required to fit the Best Linear Unbiased Genomic Predictor with Additive Effect (GBLUP-A), Best Linear Unbiased Genomic Predictor with Additive and Dominant Effects (GBLUP-AD), and Multivariate Adaptive Regression Splines of Degree 2 (MARS2) models for combinations of the number of controlling genes (8, 40, 80, and 120) and heritability (0. 3, 0.5, and 0.8).

Table 1. Evaluated scenarios (S1 to S12) with the number of trait loci (QTLs) across different values of broad-sense heritability (h²).

Heritability (h²)	Number of Loci (QTLs)
Heritability (h²)	8	40	80	120
0.3	S1	S4	S7	S10
0.5	S2	S5	S8	S11
0.8	S3	S6	S9	S12

Table 2. Number of markers selected by MARS for scenarios combining different number of QTLs (8, 40, 80, and 120) and heritabilities (0.3, 0.5, and 0.8).

Heritability (h²)	Number of Loci (QTLs)
Heritability (h²)	8	40	80	120
0.3	14	12	16	17
0.5	25	16	14	18
0.8	22	18	18	17

Table 3. Average linkage disequilibrium among selected markers by Additive Effect Genomic Best Linear Unbiased Predictor (GBLUP-A), Additive and Dominant Effects Genomic Best Linear Unbiased Predictor (GBLUP-AD), Additive and Epistatic Effects Genomic Best Linear Unbiased Predictor (GBLUP-AE), and Multivariate Adaptive Regression Splines of degree 2 (MARS2) for the scenarios with 8, 40, 80, and 120 QTLs, and heritabilities 0.3, 0.5, and 0.8.

Number of QTLs	h²	MARS2	GBLUP-A	GBLUP-AD	GBLUP-AE
8	0.3	0.569	0.782	0.881	0.782
	0.5	0.534	0.831	0.831	0.831
	0.8	0.625	0.804	0.707	0.804
40	0.3	0.636	0.612	0.919	0.727
	0.5	0.621	0.945	0.760	0.605
	0.8	0.587	0.762	0.930	0.605
80	0.3	0.569	0.859	0.817	0.699
	0.5	0.468	0.762	0.628	0.533
	0.8	0.601	0.929	0.961	0.606
120	0.3	0.624	0.896	0.751	0.527
	0.5	0.523	0.866	0.543	0.639
	0.8	0.492	0.868	0.837	0.621

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

de Oliveira Celeri, M.; da Costa, W.G.; Nascimento, A.C.C.; Azevedo, C.F.; Cruz, C.D.; Sagae, V.S.; Nascimento, M. Multivariate Adaptive Regression Splines Enhance Genomic Prediction of Non-Additive Traits. Agronomy 2024, 14, 2234. https://doi.org/10.3390/agronomy14102234

AMA Style

de Oliveira Celeri M, da Costa WG, Nascimento ACC, Azevedo CF, Cruz CD, Sagae VS, Nascimento M. Multivariate Adaptive Regression Splines Enhance Genomic Prediction of Non-Additive Traits. Agronomy. 2024; 14(10):2234. https://doi.org/10.3390/agronomy14102234

Chicago/Turabian Style

de Oliveira Celeri, Maurício, Weverton Gomes da Costa, Ana Carolina Campana Nascimento, Camila Ferreira Azevedo, Cosme Damião Cruz, Vitor Seiti Sagae, and Moysés Nascimento. 2024. "Multivariate Adaptive Regression Splines Enhance Genomic Prediction of Non-Additive Traits" Agronomy 14, no. 10: 2234. https://doi.org/10.3390/agronomy14102234

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multivariate Adaptive Regression Splines Enhance Genomic Prediction of Non-Additive Traits

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Simulation

2.2. Evaluated Scenarios

2.3. Phenotype Simulation

2.4. Genomic Best Linear Unbiased Predictor (GBLUP)

2.5. Multivariate Adaptive Regression Splines (MARS)

2.6. Linkage Disequilibrium

2.7. Assessing Methods

3. Results

3.1. MARS Model Selection

3.2. Genomic Prediction Models

3.3. Cohen’s Kappa Coefficient of Agreement

3.4. Linkage Disequilibrium and Study of Trait Genetic Architecture

3.5. Computational Efficiency

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI