1. Introduction
In recent years, with the development of high-throughput sequencing technology and the application of second-generation and third-generation sequencing platforms, unprecedented large-scale and high-dimensional genetic variation data have been generated. SNP (single nucleotide polymorphism) or CNVS (copy number variation) have become common genetic markers for studying the genetic mechanism of traits. Linkage disequilibrium-based association analysis has achieved great success in detecting the pathogenic genes of human diseases and the genetic structure of complex traits of animals and plants [
1,
2,
3]. Genome-wide association analysis (GWAS) is a high-throughput genotyping technique, of which the millions of SNPs or CNVS use as genetic markers to identify causal genes by association analysis. GWAS has achieved primary success in the genetic studies of humans, animals and plants. GWAS falls into two categories of analysis projects: common variant association study (CVAS, Minor Allele Frequency (MAF) > 5%) and rare variant association study (RVAS, MAF <= 1~5% or MAF < 1%), where MAF <= 1~5% is called low-frequency variants and MAF < 1% is called rare variants. Most of the causal loci identified by the current GWAS study are common variants and could only explain a small proportion of the phenotypic variation. From the view of biological evolution and population genetics, most of the mutant alleles are low-frequency, and the associated loci controlling complex traits are generally low-frequency variants [
4]. The association between low-frequency variation and complex traits has been reported [
5,
6,
7], and some GWAS methods for dealing with low-frequency variants have also been proposed [
8,
9,
10].
Univariate association detection is an effective method for common variants in gene association analysis. This analysis method has many advantages and has achieved good results through the improvement of many experts and scholars. The variants in the gene region are detected one by one while using this method. This approach can be adopted for common variants, while for rare variants, other important information contained in their genetic region may be overlooked, such as the location of individual variants, the correlation between variants, etc., so it is easy to underestimate or overestimate the power of a rare variant. To overcome this problem, many association analysis methods based on the genomic region have been proposed, including: OMNI [
11], eSCAN [
12], SKAT [
13], SKAT-O [
14], CauchyGM-O [
15], tpSSU [
16], etc.
There are so many association analysis methods based on genomic regions, but they could be generally divided into three kinds. The first type of method is based on merging ideas [
10,
17,
18,
19]. This type of method usually integrates all the variants into a new variable and obtains the detection power of the new variable. This is a common way is to add all variables in the region to form a new variable and then test the variable [
17]. This kind of method can reduce the power’s loss due to the huge degree of freedom. The disadvantage of this method is it requires uniform direction for effect in a detection region; otherwise, it is difficult to perform an effective detection.
The second type of method is based on the variance component test of the mixed model [
20,
21,
22,
23,
24]. In order to solve the directional problem of the loci effect, a variance component test was proposed. The variance component test does not focus on how to combine rare variants. It assumed the genetic effects of rare variants subject to a normal distribution. By testing the variance component of random effects, the associated relationship between rare variants and phenotypic traits could be better studied [
25]. This kind of method needs to select a kernel function to measure the degree of genetic similarity between any two individuals in the same detection area. The variance component approach does not have many requirements for the effect's direction.
The third type of method is based on the functional data analysis (FDA) [
26,
27,
28,
29] proposed in recent years. Due to high-density genetic marker data, the original genetic model was transformed from a traditional multiple-linear regression model to a functional linear model (FLM). A coefficient function consisting of a set of basis functions and its coefficients could be taken from the functional linear model which represents the genetic effects. By testing coefficients in the coefficient function, we can know whether there is a significant non-zero genetic effect value in this region. Many previous studies have shown that the methods based on the functional linear regression model have higher power than those based on the merging ideas and variance component tests of the mixed model [
26,
27,
30]. The application of the functional linear regression model in gene association analysis has been explored in many directions (additive, dominant and epistasis) [
31,
32], but researchers were paying more attention to the power of each method, it seems that power was the whole point of evaluating methods. Of course, the power of the method was an extremely important indicator for measuring the quality of a gene association analysis method. For this indicator, the analysis methods based on FDA performed very well, but there is still a problem: there is sparsity in the gene region, and traditional associated analysis methods do not have the capacity to shrink a sparse region, resulting in high power, but they also tend to identify some noise information as an associated signal, so if a method based on FDA could not only effectively compress the sparse region but also without reducing the power too much, then the application of this method would achieve better practical results. At the same time, if we can provide quantitative indicators to measure the impact of genetic variation on phenotypic traits and analyze the complex relationship between phenotypic traits and genetic loci, it will be more accurate to analyze the impact of genetic variation on phenotypic traits.
Lin et al. [
33] proposed the fSCAD (functional smoothly clipped absolute deviation) method, which improved the original FLM by adding a SCAD (smoothly clipped absolute deviation) penalty item based on FLM. This method can accurately compress the zero areas of the model region to zero without excessively compressing the non-zero area of the model region, which also means that this method can allow unassociated variants to be ignored in gene association analysis and remain the true associated variants.
In consideration of these advantages of fSCAD and the problems of gene association analysis methods based on FDA, a new gene association analysis approach called sparse functional data association test (SFDAT) which is based on FDA [
33] was proposed in this paper, and the computer simulation was used to evaluate the effect coefficient estimation accuracy, the type I error rate and power. The real data set of
O. sativa was analyzed by SFDAT to demonstrate the applicability of real data of SFDAT.
2. Theory and Methods
2.1. Genetic Model
Let
be the phenotypic value of
ith individual. For
i-th individual, the traditional linear genetic model can be expressed as:
where
is a genotype profile (if A and a represent a pair of alleles, then when the genotype is AA,
is taken as 2; when the genotype is Aa, it is taken as 1; when the genotype is aa, it is 0).
represents the effect coefficient of genetic marker,
,
is the environmental genetic variance,
is the number of genetic markers. With the increase in the number of genetic markers, the degree of freedom gradually increases, and the multi-collinearity among variables becomes more and more serious, eventually leading to the reduction in estimation accuracy and power. This is especially true when the genetic markers are low-frequency variations. In order to reduce the degree freedom of the model and the multiple collinearities of variables due to low-frequency variation, the functional linear model (FLM) can be used instead of the multiple linear genetic model:
where
is an independent and normal distribution with zero mean and variance
, [0,
T] represents the genomic region under consideration, that is, a DNA fragment that contains multiple SNP loci, among which there may be SNP loci that can affect the target quantitative trait. The discrete genetic markers in Equation (1) are converted into the continuous genetic marker function, and the effects of genetic markers
are also converted into a continuous genetic effect function
.
For Equation (2), the B-spline function is used to fit the genetic variants and the genetic effects. According to the functional data analysis method [
26,
34]: First, let
variants be in a sequence of their physical locations
which constitutes the genomic region
; Second, a series of B-spline basis functions are defined and let
be a B-spline basis function; Third, define
equidistant nodes
in the interval
. After that, these discrete genetic variants can be expanded as a continuous function:
where the coefficient could be obtained by minimizing the following equation:
Let , , .
The coefficient
is estimated to be
. Similar to the genetic variants, the genetic effects also can be expanded as
where
,
.
Let ,, , then .
The functional linear model of Equation (2) can be rewritten as
Let
, then Equation (6) can be rewritten as
The form of the linear regression equation is:
In the actual operation, the integral interval can be converted into .
2.2. Parameter Estimation
It can be seen from the equations established above that the genetic variants and effects in the functional linear genetic model are transformed into a continuous function through B-spline. Finally, the functional genetic model is transformed into the traditional linear regression model. In order to obtain the local sparse estimation of
, we use the method of parameter estimation based on the penalty function proposed by Lin et al. [
33]. Lin et al. [
33] proposed that
and
in Equation (2) could be estimated by minimizing the following loss function:
where
as defined above,
is the
order differential operator,
, which we usually take
is
norm. As defined by Fan and Li’s [
35]
is:
Its domain is
, the value of
could be 3.7 which is suggested by Fan and Li [
35] and the value of
is determined by the sample size. The
term in the loss function
is the roughness penalty of
, it controls the smoothness of
, where parameter
can further adjust the severity of roughness penalty,
is called the smoothing parameter. Due to the roughness penalty, the functional linear regression model FLM has a certain ability to resist “noise”.
is the local sparse penalty of
, it can compress the tiny
directly to zero. The parameter
determines how tiny value would be compressed,
is called the compression parameter. In addition, the local sparse penalty will play different constraints according to the specific form of
.
For Equation (9), when , is the same as the loss function of ordinary functional linear regression model (FLM), the method of parameter estimation is called Ordinary Least-Square Estimator (OLS); when , equals to the loss function of the smoothed functional linear regression model, the method of parameter estimation is called Smoothing Spline Estimator (Smooth); when , is a loss function for the functional linear regression model with locally sparse. The method of parameter estimation is called smooth and locally sparse (SLoS) estimator. There are two advantages of SLoS’s loss function: first, these rough results due to false correlation effects could be smoothed by the roughness penalty; second, the small and insignificant effects would be directly compressed to zero, which further reduces the false positive.
2.3. Test Statistics
Another major problem in the genetic study for quantitative traits is whether the association between genetic regions and phenotypic traits is real existence. In general, we consider the following hypothesis-testing questions:
Since the genetic effect function is the expansion of the basis function, the above assumption is equal to the following assumptions:
The statistic can be defined as:
where
is the regression square sum of Equation (11), and
is the residual square sum of Equation (11).
The above statistical test is for the entire gene region
, let us move on to the test for the subgene area. Suppose
is the zero value area of
and
is non-zero value area of
, then
. The estimation
has Oracle property for
by SLoS estimator. Lin [
33] have proven the following conclusion: if
for any
, then the estimation of
for any
in probability. In other words, if the estimation of
for any
, then
for any
in probability. Suppose
is the zero value area of
and
is the non-zero value area of
,
is convergence to
in probability and
is convergence to
in probability. Therefore, the zero and non-zero area of
represent that of
. The view of statistical genetic, the effect function
to be zero means that there is no correlation between phenotypic traits and locus
; the effect function
to be non-zero, it means that there is a correlation between phenotypic traits and locus
.
Therefore, it is necessary to establish indicators for SFDAT to estimate the effect function in zero or non-zero regions. On the one hand, it can measure the accuracy of the model estimation; on the other hand, it can provide a reference for a more accurate analysis of functional linear model regional association.
2.4. Indicators of Estimated Accuracy
There are numerous SNP loci in the gene region, and the identified causal SNP loci will inevitably have location deviation, so we regard the region with a total of 200 loci centered on the causal SNP loci as the region of acceptable deviation (Abbr. RAD), that is, we can accept that the identified causal SNP is within the RAD. Then, on the basis of being closer to reality, in order to evaluate the ability of the function to compress the zero regions on the one hand, and measure the accuracy of the function to identify the non-zero region on the other hand, we evaluate the identification ability and the region-selection ability of the model through the correct selection ratio (Abbr.CSR) of zero regions outside RAD and the discovery length (Abbr. DL) for non-zero regions in RAD, which was defined, respectively, by
and are denoted as the length of in zero region and non-zero region. and represent the real and estimated effect values at the locus , respectively. For a good test method, its CSR should be enough large to handle non-association signals region effectively; in the meantime, the more precise ability to identify the association signals, the lower DL it has.
For areas with zero or non-zero effects in the integral region, the following integral squared errors (ISE) are defined by Lin [
33]:
where
is the length of zero areas,
is the length of non-zero areas. ISE
0 and ISE
1 can be used to estimate the error between estimated
and true
on zero and non-zero areas, respectively. In addition to the performance of model prediction, it is judged by prediction mean squared errors (PMSE):
where
test is the test individual set,
is the number of samples,
and
are estimated of
and
.
According to the above definition, we can define ISE
0, ISE
1 and PMSE as criteria for evaluating the accuracy of effect estimates for gene regions. To determine the degree of fitting on zero effects, we define
denotes a set of variants loci in which no association exists,
denotes the number of elements in set
.
and
, respectively, denote estimated effects and actual effect values on locus
in set
. It indicates the degree of the overall deviation of the true and estimated values at the zero effects, the lower ISE
0, the more accurate estimation of zero effects. To determine the degree of fitting on the non-zero effects, we define
represents the set of associated variants loci in the region,
represents the number of elements in set
.
and
represent the estimated and actual effect values on locus
in set
, respectively. It indicates the degree of overall deviation between true and estimated values at the non-zero effects, the lower ISE
1, the more precise estimation of non-zero effects. To determine the degree of fit for the genetic model, we define
represents the test individual set, represents the number of individuals in the test set, represents the true effect value of th individual in the test set and represents the predicted value of th individual in the test set, which indicates the overall deviation degree between true and estimated trait values at the test set, the lower PMSE, the more powerful predict ability.
2.5. Determine Tune Parameters
In the SFDAT method, the choice of parameter
value is not very important [
36] as long as the selected
is large enough to reflect the local appearance of
(including the area of zero). For selection of
and
, a series of candidate values are given and find out the optimal parameters based on cross-validation, generalized cross-validation, BIC (Bayesian information criterion), AIC (Akaike information criterion) or RIC (Risk Inflation Criterion).
3. Simulation Studies
In order to verify the feasibility and effectiveness of the SFDAT method, a computer simulation was carried out. The simulated SNP genotype data were used to study the power, type I error rate and estimation accuracy of the method. To compare the advantages of the SFDAT method, the OLS and Smooth methods were also adopted. The computer simulation code was written in R language.
The power of the model and type I error rate can be obtained by the following steps: firstly, test Equation (2) to obtain the p value of the genetic model under different assumptions; secondly, count the number of p values less than a certain threshold; thirdly, the counted number divided by the total number of simulations, and then the ratio under the non-zero hypothesis is the power and the ratio under the null hypothesis is the type I error rate. The reason for calculating in this way is: under a non-zero hypothesis, there is an associated variant in the region, if the p value is less than threshold at this time, the phenotype trait and gene region have an associated relationship, so the ratio is the power; under the null hypothesis, there is no associated variant in the region, if the p value is still less than threshold at this time, the phenotype trait and gene region have a false associated relationship, so this ratio become the type I error rate.
At last, a test set for each simulation (an additional 100 individuals from the simulated SNP genotype data) was generated to calculate the PMSE.
3.1. Simulated SNP Genotype Data
In the simulation, we consider the linkage equilibrium and linkage disequilibrium simulation. For the linkage equilibrium simulation, the simulated SNP genotype dataset consists of many simulated gene regions, each of which contains 900 SNPs, and the MAF of SNPs within each region is generated by uniform distribution
. In fact, we generate the MAF of each SNP through uniform distribution, then generate the corresponding SNP through the MAF, and finally, form the simulated gene region from these SNPs. For the generation of the simulated SNP genotype of linkage disequilibrium simulation, we refer to Wang and Pan [
37,
38] and set the measure of linkage disequilibrium 0.2 in the simulation.
In the simulation, the number of basis functions is 15, the order is 4, the node is 11. A set of smoothing parameters [,,,,] and a set of compression parameters [0.03, 0.04, 0.05, 0.06] are given here. In the calculation process, the optimal parameters will be automatically selected according to BIC.
For linkage equilibrium and linkage disequilibrium simulation, four kinds of gene regions will be discussed: rare variants gene regions, low-frequency variants gene regions, common variants gene regions and mixed variants gene regions. The rare variant's gene region is constituted by rare variants of which MAFs are generated by uniform distribution ; The low-frequency variants gene region is constituted by low-frequency variants of which MAFs are generated by uniform distribution ; The common variants gene region is constituted by common variants of which MAFs are generated by uniform distribution ; The mixed variants gene region is randomly composed by 60% rare variants, 15% low-frequency variants, and 25% common variants.
Simulated phenotypic trait values were generated by
,
where is the set of causal SNPs, . Two different types of simulation cases were considered:
Case I: A total of three scenarios will be considered: Scenario I: setting a positive causal SNP at locus 450 in the gene region; Scenario II: setting a positive causal SNP at locus 100 and 800 in the gene region, respectively; Scenario III: setting a positive causal SNP and a negative causal SNP at locus 100 and 800 in the gene region, respectively. The effect size of each scenario is fixed at 5 ( = 5).
Case II: The value of genetic effect is ( is the minor allele frequency of the SNP). A total of 27 scenarios will be considered: the number of causal SNPs is 5, 10 or 20. Namely, the associated variants proportion of gene region (900 SNP) is 1/180, 1/90 and 1/45 ; the proportion of negative effect in causal SNPs was 0%, 20%, or 40%; The parameter in the genetic effect equals to 3, 5 or 7.
The simulated gene regions of Case I and Case II were shown in
Figure 1. For each case, we generated 2000 samples for each gene region to simulate, and all simulations were replicated 1000 times. CSR, DL will be calculated in Case I and power, ISE
0, ISE
1 and PMSE will be calculated in Case II.
3.2. The Power Evaluation and the Estimation of Indicators
Figure 2 and
Figure 3 show the CSR and DL of OLS, Smooth, and SFDAT in the three scenarios of Case I, respectively. Note that SFDAT can compress
to 0 in the region of most unassociated SNPs. However, neither OLS nor Smooth has this ability, which results in their failure to compress unassociated SNP loci, that is, OLS and Smooth estimate the coefficients of these SNP loci with no genetic effect as non-zero. The ability of SFDAT to compress the non-effect region in common variants and low-frequency variants is stronger than that of mixed variants and rare variants, indicating that the gene regions with rare variants may limit the compression ability of SFDAT. SFDAT performs better in gene regions with only one causal SNP, and worst in gene regions with a positive and negative causal SNP. Meanwhile, compared with linkage equilibrium, SFDAT performs better in the simulation of linkage disequilibrium. As can be seen from
Figure 3, OLS, Smooth and SFDAT can all find causal signals in RAD, but the signal regions found by SFDAT are more concentrated. OLS and Smooth explore the causal signal at each locus in the RAD, while SFDAT only identified the causal SNPs in part of the RAD under all cases. This is because OLS and Smooth do not have the ability to compress the zero region, resulting the noise fluctuations when estimating the effect functions on the unassociated SNPs loci, which mistakenly deems all the loci have the effect. SFDAT cannot only perfectly detect unassociated SNPs regions, but also accurately identify causal SNP loci. When there is only one causal SNP in the gene region, SFDAT can detect the locations of rare variants more accurately. The DL
100 probed by SFDAT is similar to DL
800 under scenario II and scenario III (Gene regions exist two causal SNPs). In general, the DL of linkage disequilibrium simulation is smaller than that of linkage equilibrium, that is, the location of the identified causal SNPs is more concentrated.
Table 1 illustrates the power of three methods of Case II under the significant level 0.01. As can be seen from
Table 1, under the simulation of linkage equilibrium, the power for common variant regions and low-frequency variant regions are similar, the powers of rare variants regions and mixed variants regions are similar. The powers of the former (common variants and low-frequency variants) are higher than that of the latter (rare variants and mixed variants), and the powers of rare variant regions are also lower than that of mixed variant regions. That is because the former does not contain rare variants, the latter contains rare variants and the rare variants contained in the mixed variants regions will be fewer than the rare variants regions, indicating that the gene regions that do not contain rare variants have a higher power, the power of the gene region containing common variants is higher than that of a gene region containing only rare variants. For common variant regions and low-frequency variant regions, there are similar powers in various situations for the OLS, Smooth and SFDAT methods, and the OLS method is slightly better. However, for rare variants regions and mixed variants regions, the power of the OLS method is significantly better. Obviously, the powers of methods are higher when the number of causal SNPs or the effect size increases, but when the proportion of negative effect increases, the powers of methods are just the opposite. In rare variant regions, the detection results are unstable while the causal SNPs contained in the gene region become less. Finally, it has a higher power in all cases, when the number of pathogenic SNPs in the gene region reaches 20. It is shown that similar performance patterns are observed in linkage disequilibrium. However, compared with the simulation of linkage equilibrium, the power of linkage disequilibrium has improved enormously. This is owing to the overall effect of gene regions as the correlation of loci.
The ISE
0 and its standard deviation of three methods for linkage equilibrium and linkage disequilibrium under Case II are shown in
Table 2. From
Table 2, under the simulation of linkage equilibrium, the standard deviation increases with the MAF decreases, it shows that the common variants fit better in the region where the effect is zero; the ISE
0 value or standard deviation of the OLS method is the largest among three methods, while there are smaller and similar results for the Smooth and SFDAT methods; the ISE
0 standard deviation of the Smooth method has a larger deviation than that of SFDAT method when the number of causal SNPs in the gene region is small; the ISE
0 values are similar (in other words the degree of the fitting is close) when the number of the causal SNPs and the effect size are the same regardless of the proportion of the negative effect; the ISE
0 and its standard deviation increase while the number of the causal SNPs and the effect size increase. Compared to the simulation of linkage equilibrium, ISE
0 decreased significantly in the regions of common variants and mixed variants under linkage disequilibrium, and slightly increased in the regions of low-frequency variants and rare variants. It is also verified that the models can better fit the zero region of common variants.
Table 3 displays the ISE
1 and its standard deviation of three methods under linkage equilibrium and linkage disequilibrium based on Case II. In the situation of linkage equilibrium, the standard deviation increases as MAF decreases which indicated that three methods fitted better on the common variants in the non-zero region; the ISE
1 and its standard deviation of the three methods were similar: as the effect size
or the proportion of the negative effects increases. Under linkage disequilibrium, the results of OLS, Smooth and SFDAT in fitting non-zero regions of common variants, low-frequency variants and rare variants decreased slightly. When fitting the rare variants that do not exist negative causal SNPs, the fitting results are also slightly lower than that of linkage equilibrium, but the ISE
1 of linkage disequilibrium simulation is significantly lower than that of linkage equilibrium when the rare variants region exists negative causal SNPs.
PMSE and its standard deviation of three methods under linkage equilibrium and linkage disequilibrium in Case II are shown in
Table 4. In general, the functional-based genetic model containing rare variants fits better than those containing common variants. PMSE and its standard deviation of the three methods are similar. PMSE and its standard deviation increase with the number of causal SNPs or effect size, while the proportion of negative effect only has little influence on it. Compared with linkage equilibrium simulations, PMSE and its standard deviation of linkage disequilibrium decrease significantly in common variants, low-frequency variants and rare variants, but slightly increase in mixed variants.
3.3. The Type I Error Rate
We set five sample sizes of 500, 1000, 1500, 2000 and 2500, each sample size is simulated 10,000 times. Simulated traits are generated by model
, where
,
. The significance levels are 0.05, 0.01, 0.001 and 0.0001.
Table 5 summarizes the type I error rates of OLS, Smooth and SFDAT under the linkage equilibrium and disequilibrium simulation. It can be seen from
Table 5 that Smooth and SFDAT controlled type I error rates correctly across all sample sizes and all significance levels. While the type I error rates of OLS is severely inflated, which means the use of OLS for gene association analysis produces false positives that can lead to false associations. Note that the SFDAT method will appear conservative when the sample size and significance level are relatively small, but as the two increase, the SFDAT method also reaches a sufficient level of significance. Relative to linkage equilibrium, the type I error rates of the three models under linkage disequilibrium increase in smaller sample sizes (500, 1000, 1500) and decrease in larger sample sizes (2000, 2500).
Combining
Figure 2 and
Figure 3 and
Table 1,
Table 2,
Table 3,
Table 4 and
Table 5, SFDAT has a competitive performance to the OLS and Smooth in terms of location of non-causal SNP regions and identification of causal SNPs regions, as well as power, the type I error rates and other indicators. It is appreciable that OLS has relatively higher power, but its ISE
0, the standard deviation of ISE
0 and type I error rates are larger than those of Smooth and SFDAT methods, which declares that there may be a false correlation in the association analysis using OSL method, and the zero effect may be recognized as a non-zero effect. In addition, although Smooth has a similar performance to SFDAT in power, ISE
1 and PMSE, it does not have the capacity to shrink sparse regions, which leads to noise fluctuations in the application of real data usually and further causes false association. Therefore, if we consider the SFDAT’s performance of the power, type I error rates and combine its performance of other indicators (the CSR, DL, ISE
0, ISE
1 and PMSE), the SFDAT method would be an excellent method which has extraordinarily high power and has a marvelous ability to reduce false positives.
4. Application to O. sativa Data Set
We apply SFDAT to the data set of 413 diverse rice (
O. sativa) varieties from 82 countries [
39] to demonstrate the applicability of SFDAT based on the above simulation. This data set broadly includes six categories of phenotype: plant morphology-related traits; yield-related traits; seed and grain morphology-related traits; stress-related phenotypes; cooking, eating and nutritional-quality-related traits; and plant-development-related traits. Almost all phenotypes were measured in the field at Stuttgart, Arkansas, the measuring times were repeated two times each year during the growing season (May–October) in 2006 and 2007. For details of experimental design, see Zhao [
39].
Zhao [
39] verified two candidate genes related to Culm habit,
D10 and
D14. So, Culm habit and its causal SNP
D10 and
D14 are chosen for the association analysis test for displaying the feasibility of SFDAT in the real application. The samples with missing values were eliminated and matched with the genotype data, the remaining 356 samples were for follow-up analysis. The genotype data contains a total of 44,100 markers on 12 chromosomes. The missing genotype was estimated, and SNPs with a minimum allele frequency of less than 0.005 were deleted. Finally, there were 32,185 SNPs left. Each chromosome is sequentially divided into several gene regions which contains 1000 SNPs, and merges the set of less than 1000 SNPs in the chromosome with the previous gene region, 24 gene regions are obtained finally. Causal SNP
D10 and
D14 are located in the 4th and 9th gene region, respectively. The number of SNPs and the
p-value of the association analysis of each gene region are shown in
Table 6. Significant SNP loci have been identified in the 3rd, 4th, 9th and 14th gene regions, which means that SFDAT can detect the gene regions where causal SNP is located precisely. The calculation of the whole process was taken 37 s on the Intel Core 2.50 GHz CPU. This result indicates that the genetic region association analysis method based on the SFDAT method is virtually and computationally feasible.
Then, in order to compare the capability of test in real data application of OLS, Smooth and SFDAT, the florets per panicle, brown rice seed length and flowering time in Aberdeen are chosen to be the phenotype for subsequent analysis,
SSD1,
GS3 and
Hd1 are chosen, respectively, for the candidate SNP of these phenotypes, which had been verified by Zhao, and the locations on chromosomes of these candidate SNPs were shown in
Figure 4. The phenotypic and genotype data are taken from the same process as above, each phenotype left 383, 350 and 334 samples for follow-up analysis. We regard a set of 1000 SNPs as a gene region to be tested. For each candidate SNP, we divide its surrounding region containing a total of 8000 markers into eight gene regions to be tested and ensure that these gene regions have no significant SNP associated with phenotype except the candidate SNP.
SSD1,
GS3 and
Hd1 are, respectively located in the 6th, 4th and 3rd gene regions of the eight gene regions to be ested. We perform association analysis on the eight gene regions to be detected for each candidate SNP. The associated gene regions detected by OLS, Smooth and SFDAT at different significant levels are shown in
Table 7. OLS, Smooth and SFDAT can detect the gene region of candidate SNP at different significant levels. However, severe false correlations appear in OLS and Smooth. OLS and Smooth detect significant SNP loci in the gene regions without candidate SNP, especially in the association analysis of the eight gene regions of
GS3, OLS and Smooth consider that all gene regions contained significant SNP loci when
is equal to 0.05, 0.01 and 0.001. SFDAT show a robust ability for accurate positioning. Compare with OLS and Smooth, SFDAT can shrink the regions without candidate SNP and accurately identify the regions containing candidate SNP. SFDAT detects some gene regions that do not contain candidate SNPS at some level of significance, but these gene regions are close to the gene regions where SNP candidates were located.
5. Discussion
GWAS is a new strategy that uses millions of SNPs in the genome as molecular genetic markers to conduct genome-wide comparative analysis or association analysis, and to find out the genetic variation that affects complex traits through comparison. In 2005,
Science magazine reported the first age-related GWAS study of macular degeneration [
40]. For more than a decade, research on genome-wide association analysis has grown rapidly, but most methods are aimed at common variants. In recent years, more and more scholars have begun to pay attention to the study of rare variants. With the development of a new generation of high-throughput sequencing technology, TB or even more sequence data will be generated every day, and the data will be gradually changed from discrete to dense. We can regard it as continuous data, and thus functional data analysis methods have emerged. It can be seen from the above analysis that it can analyze both common and rare variants [
26]. In recent years, more and more articles on functional data analysis have been published in genome-wide association analysis [
26,
31,
32,
41,
42,
43,
44]. The association analysis method based on the functional linear regression model can not only estimate the additive and dominant effects of genes, but also estimate the epistasis effects of genes [
31,
32], and extended to the study of dynamic development and multiple traits, Li [
45] proposed a longitudinal functional data association test (LFDAT) based on the function–function regression model, which can provide a feasible method for studying the formation and expression of longitudinal traits. Li [
46] put forward an integrative functional linear model for GWAS with multiple traits, which effectively accommodates the high dimensionality of SNPs and correlation among multiple traits. However, current analysis methods can develop only based on SNP gene region, it is impossible to further study whether SNP inside gene regions are associated with phenotypic traits.
In practice, gene loci are linkage disequilibrium with each other. The simulation results of this paper also show that most of the indictors under the linkage disequilibrium are better than that based on the linkage equilibrium, especially since the power of linkage disequilibrium is much higher than that of linkage equilibrium. Therefore, there is a considerable gap between the simulation results with a measure of 0.2 linkage disequilibrium and linkage equilibrium, so this paper does not further explore the simulation of linkage disequilibrium with a higher measure.
Figure 5 shows the effect of function
curve by OLS, Smooth and SFDAT methods. Firstly, from
Figure 5, it can be seen that the effect function of the OLS and Smooth methods
have frequent fluctuations. Compared to the OLS and Smooth methods, the estimated effect function
by the SFDAT method could smooth the effect value which real effect values are zero to around zero and still retain the non-zero part of the real effect. Combining the CSR and DL of SFDAT, it shows that the smoothing function can indeed remove some “noise”, which also explains the reason why false associations are detected by the OLS and Smooth methods. Secondly, there are similar results for non-zero genetic effects estimated to use the Smooth and SFDAT methods. This shows that the compression capability of the SFDAT method can be regarded as the further compression of the effect value close to zero on the basis of the smooth estimation results, while retaining the non-zero effect. Therefore, if the compression parameter is small and there are no effect values worth compressing in the gene region, the estimated effect function of the SFDAT method should be consistent with that of the Smooth method. It is the reason why the Smooth and SFDAT methods have similar power, ISE
0, ISE
1, PMSE and the type I error rate in computer simulations, but relatively speaking, SFDAT still has a higher accuracy. In addition, in the application of real data, QTL analysis often takes a lot of time. Therefore, during GWAS, all gene regions can be quickly scanned through SFDAT to find the gene regions where significant SNP loci are located, and then QTL analysis can be performed on these gene regions to accurately locate the positions of significant SNPs, which can save a lot of time and improve efficiency beyond all doubt.
It must be pointed out that Luo et al. [
26] proposed a functional linear regression model for QTL association analysis based on next-generation high-throughput sequencing (NGS), which used the functional linear regression model method (FLM). The smoothed functional linear model (SFLM) and eight other statistical methods (WSS, VT, RVT1, RVT2, PCA regression, multiple regression, simple regression and SKAT) were compared in six cases (uniformity of effects means the same direction; heterogeneity of effects means different directions; only low-frequency variants; all variants; different proportions of causal variants; different proportions of variants included). It found that the FLM method and SFLM method are similar in each case, but they are obviously superior to other methods, including collapsing-based methods RVT1, RVT2, kernel-based methods SKAT. The FLM and SFLM proposed by Luo et al. [
26] differ from our OLS and Smooth methods in loss function
. The first term of loss function
for the OLS, Smooth and SFDAT methods is
divided by the sample size
, the first term of loss function
for the FLM and SFLM methods is not divided by the sample size
. However, the idea of FLM and SFLM is similar to that of the OLS, Smooth and SFDAT methods. SFLM and Smooth method are only an adjustment among parameters. Although the OLS, Smooth and SFDAT methods have not been compared with traditional methods, the SFDAT method should be a good method according to Luo et al. [
26] and our simulation's results. It should be noted that the SFDAT method is only based on a single-gene region for a one-by-one search in this paper. In fact, we can extend the method to multiple-gene regions detection which is our next research direction.
In the application of the functional linear regression model for gene association analysis, we mainly convert the functional linear regression model into the classical linear regression model for parameter estimation and statistical tests. However, we know that gene variants have common variants and rare variants so there are unique methods for rare variant detection [
47]. This is especially true for statistical test problems of functional linear regression model which has also been proposed by some scholars [
48]. Therefore, how to better perform statistical tests on gene association analysis using the functional linear regression models remains to be further studied.