An Improved Expectation–Maximization Bayesian Algorithm for GWAS

Zhang, Ganwen; Zhao, Jianini; Wang, Jieru; Lin, Guo; Li, Lin; Ban, Fengfei; Zhu, Meiting; Wen, Yangjun; Zhang, Jin

doi:10.3390/math12131944

Open AccessArticle

An Improved Expectation–Maximization Bayesian Algorithm for GWAS

by

Ganwen Zhang

^†,

Jianini Zhao

^†,

Jieru Wang

,

Guo Lin

,

Lin Li

,

Fengfei Ban

,

Meiting Zhu

,

Yangjun Wen

^* and

Jin Zhang

^*

College of Science, Nanjing Agricultural University, Nanjing 210095, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2024, 12(13), 1944; https://doi.org/10.3390/math12131944

Submission received: 7 May 2024 / Revised: 2 June 2024 / Accepted: 20 June 2024 / Published: 23 June 2024

(This article belongs to the Section Mathematical Biology)

Download

Browse Figures

Versions Notes

Abstract

:

Genome-wide association studies (GWASs) are flexible and comprehensive tools for identifying single nucleotide polymorphisms (SNPs) associated with complex traits or diseases. The whole-genome Bayesian models are an effective way of incorporating important prior information into modeling. Bayesian methods have been widely used in association analysis. However, Bayesian analysis is often not feasible due to the high-throughput genotype and large sample sizes involved. In this study, we propose a new Bayesian algorithm under the mixed linear model framework: the expectation and maximization BayesB Improved algorithm (emBBI). The emBBI algorithm corrects polygenic and environmental noise and reduces dimensions; then, it estimates and tests marker effects using emBayesB and the LOD test, respectively. We conducted two simulation experiments and analyzed a real dataset related to flowering time in Arabidopsis to demonstrate the validation of the new algorithm. The results show that the emBBI algorithm is more flexible and accurate in simulation studies compared to established methods, and it performs well under complex genetic backgrounds. The analysis of the Arabidopsis real dataset further illustrates the advantages of the emBBI algorithm for GWAS by detecting known genes. Furthermore, 12 candidate genes are identified in the neighborhood of the significant quantitative trait nucleotides (QTNs) of flowering-related QTNs in Arabidopsis. In addition, we also performed enrichment analysis and tissue expression analysis of candidate genes, which will help us better understand the genetic basis of flowering-related traits in Arabidopsis.

Keywords:

GAWS; Bayesian method; mixed linear model; candidate gene

MSC:

92D10; 62F15

1. Introduction

Genome-wide association studies (GWASs) are a powerful tool for identifying genetic variations associated with complex diseases and important traits [1]. Over the past few decades, GWAS has been successfully applied in genetic analysis. Hundreds of thousands of genes have been identified to be associated with target traits in humans [1,2,3,4], animals [5,6,7,8], and plants [9,10,11,12,13]. This is essential for dissecting complex diseases as well as for innovation in genetics and germplasm.

The mixed linear model (MLM) approach [14,15], which considers Q + K, has been introduced to the concept of GWASs, significantly increasing the power of quantitative trait nucleotide (QTN) detection. Based on that, an efficient mixed-model association (EMMA) [16] and GEMMA [17] treated the polygenic effect as the random effect to fit the mixed model, which is now becoming the mainstream process in GWAS. Several methods are continuously emerging, such as EMMA eXpedited (EMMAX) [18], FaST-LMM [19], SUPER [20], FarmCPU [21], and FastGWA [22]. Because of the genetic variant dissection and computational speed, all these methods have been successfully applied in genomic analysis. However, most of them comprise a one-dimensional marker scan by testing one marker at a time, which is involved in multiple test corrections for the threshold value of significance tests. The widely used Bonferroni correction is often too conservative, which may be disadvantageous to the detection of important loci in GWASs [13,23].

For a complex trait governed by multiple QTNs, it is reasonable to consider the effects of multiple QTNs in the model as this aligns with the internal genetic mechanism of these traits. The whole-genome Bayesian models are an effective way of incorporating important prior information into modeling. Bayesian methods [24,25,26,27] are widely used in association analysis. Bayesian algorithms based on MCMC correctly estimate the locations and genetic effects of QTNs when the dimensionality and data size are small. Additionally, several Bayesian methods can also perform both GWASs and genome selection, such as emBayesB (emBB) [27], expectation and maximization BayesA (emBA) [28], expectation and maximization BayesLasso (emBL) [29,30,31], expectation and maximization Gaussian maximum likelihood (emML) [29,32], expectation–maximization ridge regression (emRR) [33], and least absolute shrinkage and selection operator (LASSO) [34]. The current biotechnology has generated abundant high-throughput molecular markers and high-dimensional sample sizes, promoting genetic study in complex trait dissection. However, Bayesian analysis is often not feasible due to the large-scale data involved.

In this study, to improve the accuracy of GWASs in identifying important QTNs and gene mining, we combine mixed linear models and the Bayesian concept and propose a new multi-stage flexible method, namely, the emBBI (expectation and maximization BayesB Improved) algorithm, which takes into account the population structure and polygenic background in the mixed linear model [23,35]. In the first stage, the emBBI algorithm whitens the polygenetic and environmental noise. Subsequently, we conduct the variable reduction stage to remove the unassociated loci from the model and then apply the emBayesB method to estimate the QTN effect value. In the last stage, the LOD test is used to assess the significance of potential QTNs. In this study, a series of simulation experiments and real data set analysis were carried out to illustrate the advantages of this new method. As a comparison, the existing methods, including emBB, emRR, emML, emBL, and emBA are used to analyze the above data sets. The results of both the simulation experiments and real dataset analysis demonstrate the advantages of our approach in QTN detection and testing.

2. Materials and Methods

2.1. Genetic Model

Consider the following mixed linear model relating phenotypes

y

to genotypes

Z

, assuming n individuals and p genetic markers, p >> n, it can be described as:

y = W α + Z γ + u + ε

(1)

where

y = (y_{1}, . . ., y_{n})^{T}

,

y_{i}

is an n-vector of phenotypes measured on n individuals; α is a

c \times 1

vector of the fixed effects, including the intercept, population structure effect, and so on,

W

is an

n \times c

matrix of the corresponding designed matrix for

α

;

Z

is an

n \times 1

vector of marker genotypes, and

γ ~ N (0, σ_{γ}^{2})

is a

p \times 1

vector of a random effect of each genetic marker,

σ_{γ}^{2}

is the variance of

γ

;

u ~ M V N (0, {σ_{g}}^{2} K)

is an

n \times 1

random vector of polygenic effects,

{σ_{g}}^{2}

is the variance of the polygenic background, K is a known

n \times n

genetic relationship matrix between individuals;

ε ~ M V N (0, σ^{2} I_{n})

is an

n \times 1

vector of residual errors; and

σ^{2}

is the variance of residual error;

I_{n}

is an

n \times n

identity matrix. MVN denotes multivariate normal distribution.

As

γ

is treated as a random effect, the variance of y in Model (1) is as follows:

var (y) = σ_{γ}^{2} Z Z^{T} + {σ_{g}}^{2} K + σ^{2} I_{n} = σ^{2} (λ_{γ} Z Z^{T} + λ_{g} K + I_{n})

(2)

where

λ_{γ} = {σ_{γ}}^{2} / σ^{2}

,

λ_{g} = {σ_{g}}^{2} / σ^{2}

,

λ_{γ}

is the ratio of genetic variance to the variance of residual error, and

λ_{g}

is the ratio between the variance of the polygenic background and the variance of the residual error.

2.2. Genome-Wide Association Analysis of the emBBI Algorithm

The emBBI algorithm is a multi-stage approach for GWAS, which simultaneously estimates the regression effect and performs hypothesis testing. We describe it as the following four stages:

2.2.1. Polygenic and Residual Noise Whitening Stage

First, we borrow the idea from Wen [23] and conduct the polygenic and residual noise whitening stage. The estimations of two ratios

λ_{γ}

and

λ_{g}

, cause an expensive computational burden in GWAS. The polygenic variance is always larger than zero, thus assuming

λ_{γ} = 0

since most markers are not associated with the trait. Therefore, we estimate

{\hat{λ}}_{g}

by the reduced Model (1), which remove

Z γ

with only polygenic background, and replace

λ_{g}

in (2) by the

{\hat{λ}}_{g}

[23,36], avoiding time-consuming re-estimate

λ_{g}

for each single marker scanning. Thus,

var (y) = σ^{2} (λ_{γ} Z Z^{T} + {\hat{λ}}_{g} K + I_{n}) = σ^{2} (λ_{γ} Z Z^{T} + B)

(3)

The eigen (or spectral) decomposition of the positive semidefinite matrix

B = {\hat{λ}}_{g} K + I_{n}

.

B = Q Λ Q^{T} = (Q Λ^{\frac{1}{2}} Q^{T}) (Q Λ^{\frac{1}{2}} Q^{T})

(4)

where

Q

is orthogonal and

Λ

is a diagonal matrix with positive eigenvalues. Assuming that

C = Q Λ^{- \frac{1}{2}} Q^{T}

, model (1) is changed to the following:

y_{c} = W_{c} α + Z_{c} γ + ε_{c}

(5)

where

y_{c} = C y

,

W_{c} = C W

,

Z_{c} = C Z

, and

ε_{c} = C u + C ε ~ M V N (0, σ^{2} I_{n})

[13,36].

2.2.2. Variable Reduction Stage

Nowadays, the advanced biotechnology has generated millions of genetic markers, it is inflexible to analyze so many markers in the Bayesian model. Meanwhile, several studies illustrated that most quantitative traits are controlled by a small portion of genes [35,37], which means that most SNPs are not associated with target traits; thus, it is critical to conduct dimension reduction. In the variable reduction stage, the unassociated loci were removed according to ordinary least squares under Model (5). All the most potential top m markers were selected to construct the reduced model for the next EM stage. In this study, the top 200 loci remain for the next stage [38,39].

2.2.3. EM Algorithm Stage

In the emBayesB method [27], we focused on the following linear model:

y_{c} = Z_{c}^{*} γ + ε_{c}

(6)

where

y_{c}

is the

n \times 1

vector of phenotypic value after polygenic background correction, which is the same as it in Model (5).

γ

is an

m \times 1

random effect vector of SNP effects, m is the number of SNPs after variable reduction stage, and

Z_{c}^{*}

is a corresponding corrected

n \times m

genotypic matrix for

γ

.

ε_{c}

is an

n \times 1

vector of residuals, which is assumed to be the same as in Model (5). Hence, it is described as

y_{c} | γ ~ N (Z_{c}^{*} γ, σ^{2} I_{n})

.

First, we consider the prior distribution of the SNP effect

γ

. An unknown indicator variable

h_{j}

, is defined to indicate whether the jth SNP has linkage disequilibrium (LD) with the QTN. If

h_{j}

= 1 (the jth SNP is in LD with the QTN), and the SNP effect

γ_{j}

is assumed to be from a double-exponential (DE) distribution in (7). Otherwise, the SNP effect

γ_{j}

is assumed to be from a Dirac Delta (DD) distribution in (7). Hence, the conditional distribution of

γ_{j}

given

h_{j}

is

p (γ_{j} | h_{j}) = \{\begin{matrix} 0.5 \times λ \times e x p (- λ | γ_{j} |) & h_{j} = 1 \\ δ (γ_{j}) & h_{j} = 0 \end{matrix}

(7)

Intuitively, we assume that the proportion

τ

of SNPs are in LD with at least one QTN in this study. Assuming the independence of the m SNP effects, the joint prior for

h

and

γ

is as follows:

p (h, γ) = \prod_{j = 1}^{m} p (h_{j}) p ({γ_{j} | h}_{j}) = {\prod_{j = 1}^{m} {[0.5 τ λ \exp (- λ |γ_{j}|)]}^{h_{j}} \times [(1 - τ) δ (γ_{j})]}^{1 - h_{j}}

(8)

The log posterior is proportional to the following:

\log p (h, γ | y_{c}) \propto \sum_{j = 1}^{m} h_{j} \log (0.5 τ λ) + \sum_{j = 1}^{m} (1 - h_{j}) \log (1 - τ) - λ \sum_{j = 1}^{m} h_{j} |γ_{j}| + \sum_{j = 1}^{m} (1 - h_{j}) \log (δ (γ_{j})) - \frac{n}{2} \log σ^{2} - \frac{1}{2 σ^{2}} {\sum_{i = 1}^{n} (y_{i} - \sum_{j = 1}^{m} {Z_{c}^{*}}_{i j} γ_{j})}^{2}

(9)

The indicator variable

h_{j}

is considered as missing data. Then, we maximize the log posterior in an EM algorithm. In the E-step, an analytical expression of the posterior probability that SNP j is in LD with at least one QTN for iteration k,

τ_{j}^{k},

is derived as [27]:

τ_{j}^{k} = (E [h_{j} | {\hat{γ}}^{k}, y_{c}])

(10)

For the M-step, we fix

τ_{j}^{k}

and maximize

E_{h} [\log p (h, γ | y_{c})]

with respect to the parameters

γ_{j}

,

τ

,

λ

, and

σ^{2}

. Setting the derivative to zero and rearranging, we obtain the following expressions at iteration k:

{\hat{γ}}_{j} = τ_{j}^{k} D E_{m o d e} + (1 - τ_{j}^{k}) {D D}_{m o d e}

(11)

\hat{τ} = \frac{1}{p} 1^{T} τ^{k}

(12)

\hat{λ} = \frac{1^{T} τ^{k}}{τ^{k^{T}} |\hat{γ}|}

(13)

{\hat{σ}}^{2} = \frac{1}{n} {(y_{c} - Z_{c} \hat{γ})}^{T} (y_{c} - Z_{c} \hat{γ})

(14)

where

τ^{k}

is the posterior probability vector at iteration k.

D E_{m o d e}

is the posterior mode of the DE distribution, and

{D D}_{m o d e}

is the posterior mode of the DD distribution.

The EM algorithm for emBayesB is summarized as follows:

Provide the initial values for the parameters.
E-step: for each SNP, calculate $γ_{j}$ and the posterior probability $τ_{j}^{k}$ as shown in (10).
M-step: updated $\hat{γ}$ , $\hat{τ}$ , $\hat{λ}$ , and ${\hat{σ}}^{2}$ are given according to Functions (11)–(14).

The E-step and M-step are repeated until the convergence criterion is satisfied. To reduce the computational burden, we adopted the maximum iterations 200 for the EM algorithm [29] in this study.

2.2.4. Likelihood Ratio (LR) Test

Based on the estimates of SNP effects

γ_{j}

in the EM algorithm, all the markers with

| γ_{j} | < 1 e^{- 4}

could be viewed as not being associated with the quantitative trait; the other markers with the effects are possibly associated with the trait. To test the null hypothesis (

H_{0} : γ_{(j)} = 0

means no QTN), the LR test was conducted as follows:

{L R}_{i} = - 2 [L (θ_{- i}) - L (θ_{i})]

(15)

where

θ^{T} = (γ_{(1)}, \dots {, γ}_{(j)}, \dots, γ_{(m)}, σ^{2})

and

{θ^{T}}_{- j} = (γ_{(1)}, \dots, γ_{(j - 1)}, γ_{(j + 1)}, \dots γ_{(m)}, σ^{2})

is the parameter vector that exclude

γ_{j}

and

L (.)

is the log-likelihood function. As pointed out by Kao et al. [40], selecting critical values of significant QTNs becomes complicated for multiple QTN tests. For simplicity, we used the logarithm of odds (LOD) as a criterion in real data [41,42], and

L O D = L R / \ln (10)

. The critical value for significance was set to LOD = 2.5, equal to p-value = 0.00069, which is converted from the LOD score by using

p - v a l u e = \Pr (χ^{2} (1) > 2.5 \times 4.61) = 0.00069

under the null hypothesis following a Chi-square distribution with one degree of freedom.

2.3. Comparison Algorithm

In this study, we compare several classical Bayesian methods to demonstrate the effectiveness and superiority of emBBI.

Expectation and maximization BayesB (emBB [27]): This method assumes that the proportion of SNPs in LD is associated with at least one QTN. The SNP effect follows the DE distribution or DD distribution according to whether associated, and the computational speed is improved compared to MCMC-based BayesB.

Expectation–maximization Bayesian ridge regression (emRR [33]): This method assumes that all regression coefficients have an equal variance. It obtains the posterior distribution of the parameters by introducing the regular term automatically, thereby avoiding overfitting in large-scale likelihood estimation in the estimation process.

Expectation–Maximization Gaussian Maximum Likelihood (emML [29]): This is a classical Bayesian estimation algorithm that focuses on evaluating a set of interested parameters under the maximum probability. It is a straightforward and stable approach under the Bayesian framework, with excellent estimation properties as well [32].

Expectation and Maximization Bayesian Lasso (emBL [29]): This method assumes that the SNP-effect follows the double exponential prior distribution. It constructs a full Bayesian analysis, providing credible intervals for the estimates, and the value of λ can be chosen by marginal maximum likelihood or hyperprior methods. It is a flexible method [30,31].

Expectation Maximization BayesA (emBA [28]): This method assumes that all regression coefficients (SNP effects) follow a normal distribution, with the prior of variance parameters further assigned as a Chi-square distribution. The hyperparameters are directly related to genetic structure, and compared to the MCMC-based mode, the accuracy and computational efficiency of emBA are improved.

All Bayesian algorithms were implemented by using the R program package bWGR (version 2.2.9, http://github.com/cran/bWGR (accessed on 5 May 2024)).

2.4. Experimental Materials

2.4.1. Simulation Datasets

In this study, we performed Monte Carlo simulation experiments to verify the advantages of the new algorithm. The datasets of the simulation experiment were generated from the mixed linear model, the genotypes based on the minor allele frequency (MAF) under Hardy Weinberg equilibrium in the interval [0.1, 0.5]. The simulated dataset contained a sample size of 2000 with 10,000 genetic variants (SNPs) generated by the MLM. Both the population mean and the residuals were set to 10.0.

In this study, we conducted two simulation experiments. The first simulation involved a single fixed-position QTN located on the 98th marker with 0.1 heritability. The purpose of this experiment is to illustrate the accuracy of the fixed position. For the second simulation experiment, we randomly selected 50 QTNs from 10,000 genetic variants. Each QTN had an MAF greater than 0.3, and the proportion of phenotypic variance explained by all QTNs was 50%. We repeated each simulation experiment 100 times. Due to the different levels of genetic structures of the various species and populations, for each simulation, we further examined fifteen scenarios involving three background noise levels and three sample size combinations. These combinations included two-, five-, and ten-times polygenic backgrounds and sample sizes of 500, 1000, 2000, 3000, and 5000.

2.4.2. Arabidopsis Datasets

The well-known Arabidopsis dataset consists of 199 Arabidopsis inbred lines (Atwell et al., 2010), containing 216,130 SNPs and 107 traits [43]. Among these traits, three traits related to flowering time were used to validate the performance of the various methods in this study: (1) SDV: days to vernalized flowering under short days of sunlight; (2) FT10: days to flowering at 10 °C; and (3) FT22: days to flowering at 22 °C. In this study, we performed quality control using the Plink tool and utilized ”—bfile data—maf 0.01—make-bed—out output” to remove data with MAF < 0.01 in Arabidopsis datasets. After filtering, 215,961 SNPs were used to conduct real data analysis. Following quality control and the removal of missing data, the remaining SDV, FT10, and FT22 data were 159, 194, and 193, respectively. The densities of whole markers and MAF are shown in Figure 1A,B, respectively.

2.5. Candidate Gene Identification and Enrichment Analysis

All the identified significant putative QTNs of different methods were used to mine the known genes or candidate genes using the Arabidopsis Information Resource (TAIR, https://www.arabidopsis.org, accessed on 10 March 2024). According to the LD information, the neighboring regions within 20 kb of all significant loci were used to search for the known genes.

To give insight into the genetic basis, we implemented gene ontology (GO) enrichment analysis based on all the genes in the neighborhood of the significant loci. Genes enriched in these significant pathways were considered as candidate genes. This process provides more information on biological functions, pathways, or cellular localizations [44]. The online tool agriGO (http://systemsbiology.cau.edu.cn/agriGOv2/#, accessed on 15 March 2023) [45] was used to perform GO enrichment analysis concerning the biological process (BP), molecular function (MF), and cellular component (CC) for the candidate genes. In summary, GO enrichment analysis was conducted using an enrichment analysis tool, and Fisher’s exact test (p-value < 0.05) was utilized to select enrichment GO terms. The R package “pheatmap”(version 1.0.12) was used to create a heatmap according to the results of the GO enrichment analysis for the candidate genes.

2.6. Tissue-Specific Expression Analysis

In this study, the ATH1 database (http://bar.utoronto.ca/efp, accessed on 17 April 2023) was applied to display the expression levels of candidate genes in various tissues or organs. The obtained results of the expression level are based on the GCOS expression signal (from Affymetrix GeneChip). To visualize the expression of candidate genes in various tissues or organs, we used the “pheatmap” package (version 1.0.12) in the R program to draw heatmaps to show the expression information.

3. Results

3.1. Experimental Results of Simulated Data

We first compared the model estimation evaluation of emBBI with established methods, including emBB, emRR, emML, emBL, and emBA through two Monte Carlo simulation experiments. Each simulation dataset contained 2000 individuals with 10,000 genetic variables. For these two simulation experiments, we considered a QTN with a fixed position and 50 QTNs with random positions, respectively.

For simulation experiment I, a fixed-position 98th QTN with 0.1 heritability was simulated. The power of all the methods for the detection of QTNs was almost 100%, which indicates that emBBI and the other methods easily capture the genetic variant. The MSE between the true effect

γ

and the estimation

\hat{γ}

was adopted to evaluate the accuracy of each method, which was calculated as follows:

M S E = \frac{1}{m k} \sum_{i = 1}^{m} {\sum_{j}^{k} (γ_{i j} - {\hat{γ}}_{i j})}^{2}

(16)

where

γ_{i j} (i = 1,2, \dots, m; j = 1,2, \dots, k)

is the SNP true effect,

{\hat{γ}}_{i j}

is the model coefficient estimator,

m

is the number of SNPs, and

k

is the number of repetitions. The smaller value of the MSE indicates a higher accuracy of the results.

It is obvious that the MSE of emBBI (the red bar) is much less than the others, as shown in Figure 2. Meanwhile, for the sample size of 500, the SNP effect estimation accuracies of emBBI and emBB are much better than that of the other methods, followed by emBL, emBA, emML, and emRR. As the sample size increases, the accuracies of all the other algorithms are improved, except for the emRR method, which assumes that all regression coefficients have an equal variance and might impact the accuracy. An interesting result indicates that although emBBI and emBB have much higher accuracy than the other algorithms, the differences in MSE are significant increasing along with polygenetic background or sample size increasing. Additionally, the accuracy of all the algorithms decreases with an increase in background noise, which implies that the background noise has a great influence on the effect estimation accuracy. This conclusion is consistent with our understanding. The extreme advantages of the emBBI algorithm estimation are illustrated in Figure 2.

For simulation experiment II, there are 50 random-position QTNs associated with the target phenotype. The total phenotypic variance explained (PVE) is 50%, and the average PVE for each QTN is approximately 1%, which indicates that almost all QTNs have a minor effect. Fortunately, the power of all the methods in the detection of QTNs are greater than 90%, which are slightly weaker than simulation I, the accuracy of QTN estimations shows a similar trend. Here, the −log(p-value) is employed to assess the significance of QTNs in different methods. The results of experiments with sample sizes of 3000 and 5000 are not shown because the likelihood ratio test has a huge computational burden. Most methods failed to generate any meaningful results, and all the estimated effects are zero.

Figure S1 shows the violin plots of −log(p-value) after the likelihood ratio test for sample sizes of 500, 1000, and 2000. The results show that under multiple QTNs, emBBI and emBB have significantly smaller p-values obtained from the LOD test compared to the other methods, and the other four methods are similar to each other. It is noteworthy that the emBBI was more significant than the emBB test in most situations. The p-value of the test decreased when the sample size increased or the genetic background decreased for all methods.

In addition, we compared the computation time of the simulation experiments under 100 repetitions. Although emBBI is more accurate and significant, it does not show a large improvement in computational speed compared to the emBB algorithm.

3.2. Results of Real Data Analysis

To further validate the emBBI algorithm, three flowering-related traits, including SDV, FT10, and FT22, were analyzed in this study. After removing the missing data, the remaining SDV, FT10, and FT22 data were 159, 194, and 193, respectively. The method emBBI and other classical methods were conducted to examine whether there were significant genetic variants associated with the flowering-related Arabidopsis traits.

3.2.1. Analysis of Phenotypic Differences in Arabidopsis Data

The variation in phenotypic data of the three traits is shown in Figure 3 by box plots, histograms, and distribution plots. The mean values of SDV, FT10, and FT22 are 64.86 days, 63.97 days, and 74.72 days (Table 1), respectively. The average of the FT22 trait is smaller than FT10. The higher the temperature, the shorter the flowering phase is. The phenotypic datasets of SDV, FT10, and FT22 are distributed between 27 and 100 days, 40 and 80 days, and 23 and 130 days, respectively. The FT10 trait had the smallest variance, which means that the phenotypic datasets of FT10 were all clustered around the mean. The SD (71.75) and CV (0.96) of FT22 were larger than those of SDV (SD: 35.07 and CV: 0.54) and FT10 (SD: 17.82 and CV: 0.28). Detailed descriptive statistics for the three phenotypes, including mean, SD, CV, minimum, maximum, and range, are shown in Table 1. All the phenotypes approximatively follow the normal distribution according to the Kolmogorov–Smirnov test. The results of descriptive statistics indicate that there might be a different genetic basis for days to flowering traits between days of sunlight and temperature. It can be inferred that the real datasets are suitable for GWASs.

3.2.2. GWASs and Known Genes of QTNs

In total, 199 inbred lines with 215,961 SNPs were applied to carry out GWAS for flowering-related Arabidopsis traits, including SDV, FT10, and FT22 by emBBI and the other algorithms. The densities of all SNPs and MAF are shown in Figure 1. These SNPs were evenly distributed across the five chromosomes (Figure 1A).

The QTNs less than the threshold with LOD = 2.5, equal to p-value = 0.00069, were mined from TAIR (https://www.arabidopsis.org, accessed on 10 March 2024) for different methods. The known genes are listed in Table 2 and the Manhattan plot (Figures S2–S4). The emBBI algorithm detected 3 (Table S1 and Figure S2), 6 (Table S1 and Figure S3), and 12 (Table 2 and Figure S4) known genes, which were associated with SDV, FT10, and FT22, respectively. There was a total of 21 confirmed genes, which were associated with three flowering-relate traits, whereas the other methods, emBB, emBA, emBL, emML, and emRR, detected 6, 5, 10, 19, and 13 (Table 2 and Table S1 and Figures S2–S4) for the three traits, respectively. An interesting result shows that the emBBI method dissected three clusters of genes associated with the same trait, FT22 (Table 2 and Figure S4A), including the genes AT2G38185, AT2G38195, and AT2G38220, in the neighborhood of the SNP located at 16020457 bp on chromosome 2. More importantly, the AT5G20280 gene was simultaneously detected by three flowering-relate traits, SDV, FT10, and FT22 (Table 2 and Table S1), and the AT5G24930 gene was detected by both the SDV and FT22 traits (Table 2 and Table S1). It seems that emBBI is more powerful in detecting important genes (confirmed by TAIR) associated with relevant flowering traits.

3.2.3. Functional Enrichment Analysis of Candidate Genes

In order to gain insight into the genetic basis of the candidate genes, GO enrichment analysis was performed in this study, which is a powerful bioinformatics tool to better understand the underlying biological processes, molecular functions, and cellular components of candidate genes. According to the results of the GO functional enrichment study, 1655 candidate genes for QTNs are significantly enriched with 234 GO terms related to various biological processes (p-value < 0.05). The results of the enrichment analysis heatmap of all the candidate genes are shown in Figure 4, and the pathway is shown in Supplementary Figure S5. The important GO terms of the candidate genes are marked in red in the rectangular boxes.

In adjacent areas of the significant QTNs, 12 candidate genes (Figure 4) are involved in several biological and metabolic processes during the Arabidopsis growth period, such as flower development, which have not been reported in previous studies. It reveals that these candidate genes have a potential impact on the target traits. For example, the detected candidate gene AT5G37930 is involved in multicellular biological development, zinc ion binding, ubiquitin protein transferase activity, and protein binding. According to the GO analysis (Supplementary Figure S5), reproductive processes, such as postembryonic development (GO:0009791), are functionally related to flower development in Arabidopsis, and cellular macromolecular metabolic processes (GO:0044260), nucleus (GO:0005634) and ion binding (GO:00443167) are most significant in BP, CC, and MF, respectively. In the last layer (Supplementary Figure S5), 144 genes were enriched to the protein modification process (GO:0006464), in which the novel candidate genes AT5G37890, AT5G37930, AT1G32060, AT5G37870, AT2G13950, and AT5G37910 were involved (Figure 4); these genes play an important role in the growth and reproduction of Arabidopsis thaliana. These results provide the critical biological basis for mining novel genes.

3.2.4. Expression Profiling of Candidate Genes

The expression levels of the candidate genes in different organs or tissues are available in the ATH1 database (http://bar.utoronto.ca/efp, accessed on 17 April 2024), including seeds (both dried and soaked), stamens, stems, leaves, leaf roots, inflorescences, and so on. For the candidate genes for SDV traits, the GCOS expression signal expression levels of different tissues or organs are illustrated in Figure 5 (normalized before plotting the heat map). It is evident that the gene AT4G36060 has the highest GCOS expression level in dry seeds, as shown in Figure 5. The genes AT2G02090, AT2G02140, and AT4G01575 all had the highest expression levels in the Arabidopsis mature pollen period. Genes AT5G20240, AT1G23010, AT1G23060, and AT5G20270 showed high expression levels in the flowering stage.

4. Discussion

The emBayesB [27] method is a fast, accurate, and unbiased EM algorithm based on iterated conditional expectation, which was developed by combining the Expectation–Maximization (EM) algorithm. It is a robust algorithm for both GWASs and GS, and its key advantages can be expressed as the algorithm dynamically updates the LD ratio between SNPs and QTNs during the iterative process, resulting in a relatively high estimation accuracy. However, advanced biotechnology generates millions of molecular markers and large-scale SNP markers, and the classical Bayesian algorithms demonstrate its inflexibility in high-dimensional data analysis. Meanwhile, the limitations in whitening the population structure and polygenic background noise lead to the inaccuracy of SNP effect estimation. In this study, we propose a new algorithm, emBBI, under the framework of the mixed linear model framework. This algorithm first assumes that the marker effects are random and employs the model transformation of FASTmrEMMA to whiten the covariance matrices of polygenes and environmental noise. After dimensional reduction, the emBayesB method and LOD test are applied to estimate and test the marker effects. The results of this study show that controlling population structure and polygenic background noise is crucial for emBBI in GWASs and to improve SNP effect estimation, which provides an alternative method for genetic analysis.

We compared emBBI with the established FASTmrEMMA method in both simulation experiments and real data analysis. The powers of the two methods in the detection of the fixed QTN are almost 100% under three polygenic backgrounds. The accuracy of emBBI is slightly better while FASTmrEMMA has a faster computational speed. For random-position QTNs, FASTmrEMMA performs slightly better. The computational time and accuracy have a similar tendency with simulation I, with the gap becoming wider with a stronger polygenic background. In the real data analysis, the FASTmrEMMA detects 11 known genes for SDV, which is less than emBBI (12 known genes). The main reason is that although both emBBI and FASTmrEMMA have a polygenic correction stage to whiten the noise, the following algorithms are different. FASTmrEMMA is a combination of single-locus and multi-locus analysis, while emBBI employs the least squares method to reduce the dimensions; then, it estimates and tests the marker effects using emBayesB and the LOD test, respectively.

In real data analysis, the emBBI algorithm detected 21 known genes associated with three flowering-related traits. The number of known genes detected by emBBI is significantly higher than the number detected by other comparison methods. In fact, it is more than three times greater than the traditional emBB and four times greater than emBA. Based on the multi-stage process, the emBBI method identified three clusters of genes associated with FT22 (Table 2 and Figure S4). Notably, the gene AT5G20280 was simultaneously detected by three flowering-related traits. All the above results revealed that emBBI is more powerful in detecting significant QTNs and important genes.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math12131944/s1, Table S1. The identified Genes of two traits SDV and FT10 in Arabidopsis by using the emBBI, emBB, emRR, emML, emBL and emBA methods. Figure S1. Violin plots of LOD test p-value in simulation II. Figure S2. Manhattan plot for the trait SDV by six methods. Figure S3. Manhattan plot for the trait FT10 by six methods. Figure S4. Manhattan plot for the trait FT22 by six methods. Figure S5. Annotated hierarchical tree of GO (slim) annotations for candidate genes. It consists of three categories: (A) biological process (BP), (B) molecular function (MF), and (C) cellular component (CC). The significant (p ≤ 0.05) Go terms are labeled with colored boxes, and the color in each box reflects the enrichment of differential genes in the GO term; the darker color, the more significant the enrichment.

Author Contributions

Conceived and supervised, J.Z. (Jin Zhang), Y.W.; methodology, G.Z., J.Z. (Jianini Zhao) and L.L.; investigation, G.Z., J.W., G.L. and F.B.; performing analysis, G.Z., J.Z. (Jianini Zhao) and L.L.; visualization, M.Z. and L.L.; writing—original draft, J.Z. (Jin Zhang) and L.L.; writing—review and editing, J.Z. (Jin Zhang), Y.W. and G.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Innovation and Entrepreneurship Program of the Nanjing Agriculture University (grant number 202310307369Z).

Data Availability Statement

All data required for this article are included within this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

GWASs: Genome-wide association studies; SNP: single nucleotide polymorphism; emBBI: expectation and maximization BayesB Improved; LOD: logarithm of odds; QTN: quantitative trait nucleotide; MLM: mixed linear model; EMMA: efficient mixed-model association; GEMMA: genome-wide efficient mixed-model association; EMMAX: EMMA eXpedited; FarmCPU: Fixed and random model Circulating Probability Unification; emBB: expectation and maximization BayesB; emBA: expectation and maximization BayesA; emBL: expectation and maximization BayesLasso; emML: expectation and maximization Gaussian maximum likelihood; emRR: expectation maximization ridge regression; LASSO: least absolute shrinkage and selection operator; MVN: multivariate normal distribution; LD: linkage disequilibrium; DE: double-exponential; DD: Dirac Delta; LR: Likelihood ratio; MCMC: Markov chain Monte Carlo; ML: maximum likelihood; MAF: minor allele frequency; SDV: days to vernalized flowering under short days of sunlight; FT10: days to flowering at 10 °C; FT22: days to flowering at 22 °C; TAIR: The Arabidopsis Information Resource; GO: gene ontology; BP: biological process; MF: molecular function; CC: cellular component; MSE: mean square error; PVE: phenotypic variance explained; SD: standard deviation; CV: error cross validation error; and FASTmrEMMA: fast multi-locus random-SNP-effect EMMA.

References

Uffelmann, E.; Huang, Q.Q.; Munung, N.S.; de Vries, J.; Okada, Y.; Martin, A.R.; Martin, H.C.; Lappalainen, T.; Posthuma, D. Genome-wide association studies. Nat. Rev. Methods Primers 2021, 1, 59. [Google Scholar] [CrossRef]
Visscher, P.M.; Wray, N.R.; Zhang, Q.; Sklar, P.; McCarthy, M.I.; Brown, M.A.; Yang, J. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet. 2017, 101, 5–22. [Google Scholar] [CrossRef] [PubMed]
Frayling, T.M.; Timpson, N.J.; Weedon, M.N.; Zeggini, E.; Freathy, R.M.; Lindgren, C.M.; Perry, J.R.B.; Elliott, K.S.; Lango, H.; Rayner, N.W.; et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 2007, 316, 889–894. [Google Scholar] [CrossRef] [PubMed]
Wang, K.; Zhang, H.; Kugathasan, S.; Annese, V.; Bradfield, J.R.; Russell, R.K.; Sleiman, P.M.A.; Imielinski, M.; Glessner, J.; Hou, C.; et al. Diverse Genome-wide Association Studies Associate the IL12/IL23 Pathway with Crohn Disease. Am. J. Hum. Genet. 2009, 84, 399–405. [Google Scholar] [CrossRef] [PubMed]
Ma, J.W.; Yang, J.; Zhou, L.S.; Ren, J.; Liu, X.X.; Zhang, H.; Yang, B.; Zhang, Z.Y.; Ma, H.B.; Xie, X.H.; et al. A Splice Mutation in the Gene Causes High Glycogen Content and Low Meat Quality in Pig Skeletal Muscle. PLoS Genet. 2014, 10, e1004710. [Google Scholar] [CrossRef]
Fan, Q.C.; Wu, P.F.; Dai, G.J.; Zhang, G.X.; Zhang, T.; Xue, Q.; Shi, H.Q.; Wang, J.Y. Identification of 19 loci for reproductive traits in a local Chinese chicken by genome-wide study. Genet. Mol. Res. 2017, 16, 1–8. [Google Scholar] [CrossRef] [PubMed]
Demars, J.; Fabre, S.; Sarry, J.; Rossetti, R.; Gilbert, H.; Persani, L.; Tosser-Klopp, G.; Mulsant, P.; Nowak, Z.; Drobik, W.; et al. Genome-Wide Association Studies Identify Two Novel Mutations Responsible for an Atypical Hyperprolificacy Phenotype in Sheep. PLoS Genet. 2013, 9, e1003482. [Google Scholar] [CrossRef]
Lin, H.; Zhou, Z.; Zhao, J.; Zhou, T.; Bai, H.; Ke, Q.; Pu, F.; Zheng, W.; Xu, P. Genome-Wide Association Study Identifies Genomic Loci of Sex Determination and Gonadosomatic Index Traits in Large Yellow Croaker (Larimichthys crocea). Mar. Biotechnol. 2021, 23, 127–139. [Google Scholar] [CrossRef] [PubMed]
Zhao, K.; Tung, C.W.; Eizenga, G.C.; Wright, M.H.; Ali, M.L.; Price, A.H.; Norton, G.J.; Islam, M.R.; Reynolds, A.; Mezey, J.; et al. Genome-wide association mapping reveals a rich genetic architecture of complex traits in. Nat. Commun. 2011, 2, 467. [Google Scholar] [CrossRef]
Huang, X.H.; Wei, X.H.; Sang, T.; Zhao, Q.A.; Feng, Q.; Zhao, Y.; Li, C.Y.; Zhu, C.R.; Lu, T.T.; Zhang, Z.W.; et al. Genome-wide association studies of 14 agronomic traits in rice landraces. Nat. Genet. 2010, 42, 961–976. [Google Scholar] [CrossRef]
Li, H.; Peng, Z.Y.; Yang, X.H.; Wang, W.D.; Fu, J.J.; Wang, J.H.; Han, Y.J.; Chai, Y.C.; Guo, T.T.; Yang, N.; et al. Genome-wide association study dissects the genetic architecture of oil biosynthesis in maize kernels. Nat. Genet. 2013, 45, 43–50. [Google Scholar] [CrossRef] [PubMed]
Chao, Z.F.; Chen, Y.Y.; Ji, C.; Wang, Y.L.; Huang, X.; Zhang, C.Y.; Yang, J.; Song, T.; Wu, J.C.; Guo, L.X.; et al. A genome-wide association study identifies a transporter for zinc uploading to maize kernels. EMBO Rep. 2023, 24, e55542. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Chen, M.; Wen, Y.; Zhang, Y.; Lu, Y.; Wang, S.; Chen, J. A Fast Multi-Locus Ridge Regression Algorithm for High-Dimensional Genome-Wide Association Studies. Front. Genet. 2021, 12, 649196. [Google Scholar] [CrossRef] [PubMed]
Yu, J.; Pressoir, G.; Briggs, W.H.; Vroh Bi, I.; Yamasaki, M.; Doebley, J.F.; McMullen, M.D.; Gaut, B.S.; Nielsen, D.M.; Holland, J.B.; et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 2006, 38, 203–208. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.W.; Ersoz, E.; Lai, C.Q.; Todhunter, R.J.; Tiwari, H.K.; Gore, M.A.; Bradbury, P.J.; Yu, J.M.; Arnett, D.K.; Ordovas, J.M.; et al. Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 2010, 42, 355–360. [Google Scholar] [CrossRef]
Kang, H.M.; Zaitlen, N.A.; Wade, C.M.; Kirby, A.; Heckerman, D.; Daly, M.J.; Eskin, E. Efficient control of population structure in model organism association mapping. Genetics 2008, 178, 1709–1723. [Google Scholar] [CrossRef]
Zhou, X.; Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 2012, 44, 821–824. [Google Scholar] [CrossRef] [PubMed]
Kang, H.M.; Sul, J.H.; Service, S.K.; Zaitlen, N.A.; Kong, S.Y.; Freimer, N.B.; Sabatti, C.; Eskin, E. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 2010, 42, 348–354. [Google Scholar] [CrossRef]
Lippert, C.; Listgarten, J.; Liu, Y.; Kadie, C.M.; Davidson, R.I.; Heckerman, D. FaST linear mixed models for genome-wide association studies. Nat. Methods 2011, 8, 833–835. [Google Scholar] [CrossRef]
Wang, Q.S.; Tian, F.; Pan, Y.C.; Buckler, E.S.; Zhang, Z.W. A SUPER Powerful Method for Genome Wide Association Study. PLoS ONE 2014, 9, e107684. [Google Scholar] [CrossRef]
Liu, X.; Huang, M.; Fan, B.; Buckler, E.S.; Zhang, Z. Iterative Usage of Fixed and Random Effect Models for Powerful and Efficient Genome-Wide Association Studies. PLoS Genet. 2016, 12, e1005767. [Google Scholar] [CrossRef]
Jiang, L.; Zheng, Z.; Qi, T.; Kemper, K.E.; Wray, N.R.; Visscher, P.M.; Yang, J. A resource-efficient tool for mixed model association analysis of large-scale data. Nat. Genet. 2019, 51, 1749–1755. [Google Scholar] [CrossRef] [PubMed]
Wen, Y.J.; Zhang, H.; Ni, Y.L.; Huang, B.; Zhang, J.; Feng, J.Y.; Wang, S.B.; Dunwell, J.M.; Zhang, Y.M.; Wu, R. Methodological implementation of mixed linear models in multi-locus genome-wide association studies. Brief. Bioinform. 2018, 19, 700–712. [Google Scholar] [CrossRef]
Iwata, H.; Uga, Y.; Yoshioka, Y.; Ebana, K.; Hayashi, T. Bayesian association mapping of multiple quantitative trait loci and its application to the analysis of genetic variation among L. germplasms. Theor. Appl. Genet. 2007, 114, 1437–1449. [Google Scholar] [CrossRef]
Zhang, J.; Yue, C.; Zhang, Y.M. Bias correction for estimated QTL effects using the penalized maximum likelihood method. Heredity 2012, 108, 396–402. [Google Scholar] [CrossRef] [PubMed]
Moser, G.; Lee, S.H.; Hayes, B.J.; Goddard, M.E.; Wray, N.R.; Visscher, P.M. Simultaneous Discovery, Estimation and Prediction Analysis of Complex Traits Using a Bayesian Mixture Model. PLoS Genet. 2015, 11, e1004969. [Google Scholar] [CrossRef]
Shepherd, R.K.; Meuwissen, T.H.; Woolliams, J.A. Genomic selection and complex trait prediction using a fast EM algorithm applied to genome-wide markers. BMC Bioinform. 2010, 11, 529. [Google Scholar] [CrossRef] [PubMed]
Hayashi, T.; Iwata, H. EM algorithm for Bayesian estimation of genomic breeding values. BMC Genet. 2010, 11, 3. [Google Scholar] [CrossRef] [PubMed]
Xavier, A.; Muir, W.M.; Rainey, K.M. bWGR: Bayesian whole-genome regression. Bioinformatics 2020, 36, 1957–1959. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B-Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
Park, T.; Casella, G. The Bayesian Lasso. J. Am. Stat. Assoc. 2008, 103, 681–686. [Google Scholar] [CrossRef]
Swallow, W.H.; Monahan, J.F. Monte Carlo Comparison of ANOVA, MIVQUE, REML, and ML Estimators of Variance Components. Technometrics 1984, 26, 47–57. [Google Scholar] [CrossRef]
da Silva, F.A.; Viana, A.P.; Correa, C.C.G.; Santos, E.A.; de Oliveira, J.A.V.S.; Andrade, J.D.G.; Ribeiro, R.M.; Glória, L.S. Bayesian ridge regression shows the best fit for SSR markers in Psidium guajava among Bayesian models. Sci. Rep. 2021, 11, 13639. [Google Scholar] [CrossRef] [PubMed]
Yi, N.J.; Xu, S.H. Bayesian LASSO for quantitative trait loci mapping. Genetics 2008, 179, 1045–1055. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Feng, J.Y.; Ni, Y.L.; Wen, Y.J.; Niu, Y.; Tamba, C.L.; Yue, C.; Song, Q.; Zhang, Y.M. pLARmEB: Integration of least angle regression with empirical Bayes for multilocus genome-wide association studies. Heredity 2017, 118, 517–524. [Google Scholar] [CrossRef] [PubMed]
Wen, Y.J.; Zhang, Y.W.; Zhang, J.; Feng, J.Y.; Zhang, Y.M. The improved FASTmrEMMA and GCIM algorithms for genome-wide association and linkage studies in large mapping populations. Crop J. 2020, 8, 723–732. [Google Scholar] [CrossRef]
Wen, Y.J.; Zhang, Y.W.; Zhang, J.; Feng, J.Y.; Dunwell, J.M.; Zhang, Y.M. An efficient multi-locus mixed model framework for the detection of small and linked QTLs in F2. Brief. Bioinform. 2019, 20, 1913–1924. [Google Scholar] [CrossRef] [PubMed]
Sun, J.; Wu, Q.; Shen, D.; Wen, Y.; Liu, F.; Gao, Y.; Ding, J.; Zhang, J. TSLRF: Two-Stage Algorithm Based on Least Angle Regression and Random Forest in genome-wide association studies. Sci. Rep. 2019, 9, 18034. [Google Scholar] [CrossRef] [PubMed]
Tamba, C.L.; Ni, Y.L.; Zhang, Y.M. Iterative sure independence screening EM-Bayesian LASSO algorithm for multi-locus genome-wide association studies. PLoS Comput. Biol. 2017, 13, e1005357. [Google Scholar] [CrossRef]
Kao, C.H.; Zeng, Z.B.; Teasdale, R.D. Multiple interval mapping for quantitative trait loci. Genetics 1999, 152, 1203–1216. [Google Scholar] [CrossRef]
Lander, E.; Kruglyak, L. Genetic dissection of complex traits: Guidelines for interpreting and reporting linkage results. Nat. Genet. 1995, 11, 241–247. [Google Scholar] [CrossRef] [PubMed]
Qin, H.; Guo, W.; Zhang, Y.M.; Zhang, T. QTL mapping of yield and fiber traits based on a four-way cross population in Gossypium hirsutum L. Theor. Appl. Genet. 2008, 117, 883–894. [Google Scholar] [CrossRef] [PubMed]
Atwell, S.; Huang, Y.S.; Vilhjalmsson, B.J.; Willems, G.; Horton, M.; Li, Y.; Meng, D.; Platt, A.; Tarone, A.M.; Hu, T.T.; et al. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 2010, 465, 627–631. [Google Scholar] [CrossRef] [PubMed]
Akond, Z.; Ahsan, M.A.; Alam, M.; Mollah, M.N.H. Robustification of GWAS to explore effective SNPs addressing the challenges of hidden population stratification and polygenic effects. Sci. Rep. 2021, 11, 13060. [Google Scholar] [CrossRef]
Tian, T.; Liu, Y.; Yan, H.; You, Q.; Yi, X.; Du, Z.; Xu, W.; Su, Z. agriGO v2.0: A GO analysis toolkit for the agricultural community, 2017 update. Nucleic Acids Res. 2017, 45, W122–W129. [Google Scholar] [CrossRef]

Figure 1. (A) Marker density of the natural population of Arabidopsis with 1 Mb window size. (B) MAF of the natural population of Arabidopsis.

Figure 2. The MSE of SNP effects using the emBBI, emBB, emRR, emML, emBL, and emBA algorithms under two-, five-, and ten-times polygenic background and sample size combinations of 500, 1000, 2000, 3000, and 5000 (in x label) in simulation experiment I.

Figure 3. The descriptive statistics of the phenotypic data for the three flowering-related traits, SDV, FT10, and FT22.

Figure 4. The heat map of GO enrichment analysis of candidate genes for SDV, FT10, and FT22.

Figure 5. Heat Map for Expression Profile Analysis.

Table 1. Statistical analysis of three Arabidopsis flowering-related traits.

Trait	Mean	SD	CV	Min	Max	Range
SDV	64.86	35.07	0.54	26.33	200.00	173.67
FT10	63.97	17.82	0.28	41.00	121.00	80.00
FT22	74.72	71.75	0.96	23.30	250.00	226.70

Table 2. The identified genes of FT22 in Arabidopsis using the emBBI, emBB, emRR, emML, emBL, and emBA methods.

Chr.	Position	Gene	Method	Chr.	Position	Gene	Method
1	2747091	AT1G08660	emRR, emBB	2	13862690	AT2G32700	emRR
1	2752442	AT1G08660	emRR	2	10555534	AT2G24790	emBBI
1	2779526	AT1G08730	emRR	2	14696964	AT2G34880	emBBI
1	6470743	AT1G18750	emBB	2	16020457	AT2G38185	emBBI
1	4345034	AT1G12790	emBL			AT2G38195
1	4953573	AT1G14440	emBL			AT2G38220
1	2779077	AT1G08660	emML	3	18923922	AT3G50870	emRR
1	2779077	AT1G08730	emML	4	17263477	AT4G36620	emRR
1	3180545	AT1G09780	emML	4	13976417	AT4G28190	emBBI
1	3849924	AT1G11410	emML	5	13923880	AT5G35750	emRR
1	9053189	AT1G26220	emML	5	14296026	AT5G36260	emBA
	9054289			5	14296180	AT5G36260	emBA
	9055133			5	3163806	AT5G10140	emBBI
	9072307			5	6851199	AT5G20280	emBBI
	9073534			5	8609688	AT5G24930	emBBI
	9082795			5	19938064	AT5G49150	emBBI
1	9483525	AT1G27320	emML	5	26786046	AT5G67180	emBBI
	9486308			5	26793959	AT5G67180	emBBI
	9488653

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, G.; Zhao, J.; Wang, J.; Lin, G.; Li, L.; Ban, F.; Zhu, M.; Wen, Y.; Zhang, J. An Improved Expectation–Maximization Bayesian Algorithm for GWAS. Mathematics 2024, 12, 1944. https://doi.org/10.3390/math12131944

AMA Style

Zhang G, Zhao J, Wang J, Lin G, Li L, Ban F, Zhu M, Wen Y, Zhang J. An Improved Expectation–Maximization Bayesian Algorithm for GWAS. Mathematics. 2024; 12(13):1944. https://doi.org/10.3390/math12131944

Chicago/Turabian Style

Zhang, Ganwen, Jianini Zhao, Jieru Wang, Guo Lin, Lin Li, Fengfei Ban, Meiting Zhu, Yangjun Wen, and Jin Zhang. 2024. "An Improved Expectation–Maximization Bayesian Algorithm for GWAS" Mathematics 12, no. 13: 1944. https://doi.org/10.3390/math12131944

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved Expectation–Maximization Bayesian Algorithm for GWAS

Abstract

1. Introduction

2. Materials and Methods

2.1. Genetic Model

2.2. Genome-Wide Association Analysis of the emBBI Algorithm

2.2.1. Polygenic and Residual Noise Whitening Stage

2.2.2. Variable Reduction Stage

2.2.3. EM Algorithm Stage

2.2.4. Likelihood Ratio (LR) Test

2.3. Comparison Algorithm

2.4. Experimental Materials

2.4.1. Simulation Datasets

2.4.2. Arabidopsis Datasets

2.5. Candidate Gene Identification and Enrichment Analysis

2.6. Tissue-Specific Expression Analysis

3. Results

3.1. Experimental Results of Simulated Data

3.2. Results of Real Data Analysis

3.2.1. Analysis of Phenotypic Differences in Arabidopsis Data

3.2.2. GWASs and Known Genes of QTNs

3.2.3. Functional Enrichment Analysis of Candidate Genes

3.2.4. Expression Profiling of Candidate Genes

4. Discussion

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI