1. Introduction
Genetic analysis of rare variants is considered to be one of the most important components to compensate for the current deficiency of genetic variation, which has not yet been explained [
1]. Although the lack of catalogs to speculate on the genotypes of rare variants and the high cost of sequencing technology have previously made it impossible to conduct very in-depth studies on rare variants [
2,
3], the development of high-throughput sequencing technology [
4] has enabled scientists to obtain SNP data in a cheap and efficient way, as it contains a large amount of data on rare variants [
5,
6]. However, many previous tools and methods are designed for common variants, so there is still a lack of efficient and practical tools for rare variants association analysis. At present, single-marker association analysis is the most commonly used method of gene association analysis. However, if this method is directly applied to rare variants, it will be impossible to find loci with a moderate or low gene effect due to the limitations of single-marker association analysis [
5,
6]. The effect of a locus of a rare variant is small and not easily detected, and if the single-marker association analysis is used, many valuable association loci will be ignored.
To better find the weak effect sites, some methods to make the weak effect more significant by concentrating the association information of the whole gene region were proposed: a) the method based on fold (Collapsing Methods) [
7,
8,
9]. By directly compressing multiple loci into a new variable, the associated rare variants with weak effects distributed at multiple loci are aggregated to make it easier to find; b) the method based on kern [
10,
11,
12]. When the variance of a set of random variables is 0, the set of variables is made up of the same value, and therefore the kernel by method only needs to check whether the variance component of the group of effect estimates corresponding to all genotype variables in the whole region is 0; c) the method based on functional data analysis [
13,
14,
15,
16]. Functional data analysis converts the discrete loci into a continuous variable through the basis function, and only the coefficients of the effect estimation function corresponding to the continuous variable need to be tested in association analysis. There is also a strategy to consider and examine multiple loci at the same time and determine the significance of each locus [
17,
18], which is more effective than single-marker gene association analysis because it considers the interrelationships between multiple loci. The simplest method is to use the multivariate linear model as the test model for the multi-gene locus test [
17]. However, when only a small part of the multiple gene loci included in the test are related, the large degree of freedom of the uncorrelated loci will lead to the loss of power.
The above-mentioned methods, whether based on gene region information aggregation or multi-gene locus analysis, have their advantages and disadvantages. Moreover, through continuous improvement and innovation of experts and scholars, the shortcomings of these methods have been constantly overcome and their performance has become more and more excellent. Since the single-gene region approach can aggregate small effects, the multi-locus approach can improve the analysis power by considering the interrelationship between multiple variables. If we combine these two methods, we can expect to obtain a multi-region analysis method with the advantages of both methods. In addition, the actual situation of phenotypic tend to be controlled by a few gene regions. Some of these gene region effects are apparent, some are weak, and a strong effect is easy recognize. However, a weak effect can easily be concealed by a stronger effect, even considering that this part of the phenotypic is controlled by the effect of apparent genetic regions.
At present, some scholars have carried out research on the combination of the two ideas, i.e., the aggregation of genetic information in gene regions and the use of interrelationship among multi-gene regions: One of them is Turkmen and Lin [
19], who further extended the statistical test PDT (Pedigree Disequilibrium Test [
20]) and FBAT (Family-Based Association Test [
21]), and proposed Block analysis methods. Firstly, the specific approach is to divide the gene sequence to be analyzed into a block-by-block in a certain way and assume that the variants within the region are interdependent, but that the relationship between the blocks is mutually independent; secondly, PDT or FBAT methods are used to analyze the loci of each block; thirdly, the results in the region are generalized by means of the squares sum of standardized variances. After the statistics of each small block are obtained, the squares sum of the statistics is calculated again. After the aggregation of the information twice, a statistic subject to chi-square distribution is obtained, which is used as the gene association analysis statistics of the large gene region composed of these blocks. This method assumes that the loci inside the block are interdependent and the information inside the block is aggregated by PDT and FBAT methods, assuming that the blocks are independent of each other. The other method is by Ayers and Cordell [
22], who improved the two methods of Group Lasso and Group Sparse Lasso [
23], enabling them to distinguish common and rare variants within a group. This method can test multiple groups at the same time, in which one group is treated as a variable and the relationship between different groups is considered.
The analysis method of a single gene region based on functional data is a method to express the high-density genetic markers as functional data through integration and analyze the region through a linear model. Many experts have shown that this is an effective way to improve the power of gene association [
13,
14,
24]. If the functional linear model considers multiple gene regions at the same time, it can not only improve the power by considering the interaction between multi-gene regions, but also isolate the effects of each gene region through the characteristics of the linear model, making the gene regions with weak effects more obvious and easier to find. Therefore, we will explore the method of gene association analysis of multi-gene regions based on functional data analysis, hoping to find a method with higher power and better detection of association loci with weak effects, and provide some references for researchers interested in this field in the future.
4. Discussion
In this paper, a total of five analysis methods are proposed for the analysis of multi-gene regions, which can be divided into three categories: the common multi-gene region method, the multi-gene region weighted method, and the multi-gene region loci weighted method. the Step method merged the two ideas together with the gene information of the region and the relationship among the gene regions. The simulation results showed that the power of the Step method is higher than that of the FLM and SLoS method for a single gene region, even for that of the region-weighted SLoS method. It means the power is improved for considering the relationship among gene regions. The multi-gene region loci weighted method is the most complex but also the most effective. Its power simulation results are much higher than the unweighted single gene region analysis method and the false positive ratio is much lower than the single gene region analysis method. For the SLoS method, the simulation results of the common multi-gene region method are only slightly better than the common single-gene region analysis method, which also shows that even the simplest multi-gene region analysis method can effectively improve the power of the analysis compared with the single-gene region analysis method. In general, the simulation results of the Step method and LW-Step method are better methods for the associated analysis of multi-gene region. By modified freedom degree of test statistic , the multi-SLoS, W-SLoS and LW-SLoS are feasible for the associated analysis of multi-gene region. Compared with the rare variant multi-gene region, the associated analysis result of the common variant multi-gene region is better than using the multi-gene region analysis method.
The multi-gene region loci weighted method is a further expansion and extension of the weighting idea of Belonogova et al. [
32]. However, there are some differences between our work and Belonogova’s paper: firstly, the weighted idea was not applied to a multi-gene region in his paper; secondly, the coefficient of functional data is estimated by the smooth method in our paper and the least square method in Belonogova’s paper.
The Fourier basis function is selected to fit the genotype data when the functional expansion is carried out. The reason as to why we chose the Fourier basis function is that some studies have compared the Fourier basis to the spline basis, achieving similar results (the previously cited papers on functional gene-association analysis have compared it in their papers). However, some people question how genes can be represented by periodic Fourier bases. Perhaps they both extracted the same amount of information for the gene regions they wanted to summarize in their way, and after a lot of people compared the results of the two bases, there was only a very small difference. Even though the Fourier basis function might look better in some cases, it is not good enough for the authors of those papers to conclude that the Fourier basis function is a better choice. Usually, both of the basis functions are used, so one can choose one of two types of basis functions. The reason why the Fourier basis is selected in this paper is that the Fourier basis only needs to determine the number of basis functions, while the spline basis not only needs to select the number of basis functions but also the order of the basis function, so the selection of the Fourier basis function is much simpler. It must choose the Fourier basis for the Step method and the LW-Step method.
Regarding the selection of compression parameters and smoothing parameters, the paper does not necessarily choose the weights that can give the method the best performance, but it basically selects the parameters according to the most common standards and methods (such as 10-fold cross-validation, etc.). The specific parameter selection strategy can be: Firstly, determine the selection criteria of parameters, then determine the approximate range of compression parameters and smoothing parameters according to the method and the actual situation, finally, the computer program is used to screen out the optimal parameters. This process must be repeated in the processing of actual data, because the genetic region composition of the actual data is constantly changing, and so it requires that the parameters change accordingly. However, in the simulation, the same situations in the genetic regions are made up the same way. Therefore, in order to save computing resources, a suitable parameter is directly selected in this paper.
One might ask: why not just analyze it as a larger gene region? Instead, we analyze it in the way of this multi-gene region, and the functional data can do this only by increasing the number of nodes and the number of basis functions. One reason is that the large gene regions can be divided into smaller gene regions, and then the multi-gene region method can be used to get more detailed test results. Another reason is that, as mentioned in the previous article, due to the consideration of the interrelationship between different gene regions, the multi-gene region method can better identify some regions with relatively weak effects and have higher power. So, compared to single-gene region testing, multi-gene region testing can detect gene regions where the effect is smaller or where the effect overlaps with that of other regions.
We focused on independent SNPs for common variants. Rare variants come from the rare haploid dataset (the R language SKAT package [
33] contains a set of such data produced by simulating allele frequency and linkage imbalance information in European populations) with a length of 200 kb. However, we also simulate linkage disequilibrium in common variant multi-gene region of Scenario I based on the common multi-gene region method and multi-gene region weighted method. When the
r2 measure of linkage disequilibrium is between 0.25 and 0.64, the power of the linkage disequilibrium based on the SLoS and Multi-SloS methods is higher than that of linkage equilibrium, but the false positive rates also increase significantly. The power of the linkage disequilibrium based on the Step method decreases slightly, and the false positive rates remain largely unchanged. The power and false positive rates of the linkage disequilibrium based on the FLM method does not change significantly. This indicates that the linkage disequilibrium among the gene loci causes SLoS and Multi-SLoS methods to more easily misidentify non-associated gene regions. The association analysis between gene region and quantitative trait is susceptible to linkage disequilibrium of gene loci for SLoS and Multi-SLoS methods. Then, the association analysis of gene regions for quantitative traits was unstable for SLoS and Multi-SLoS methods, while Step and FLM methods are more stable. The reason may be the parameter estimation of SLoS and Multi-SLoS method. Furthermore, we compare the simulation results of forward selection and backward selection and find that the power of the forward selection method is slightly higher than that of the back selection method, but the false positive rates of the forward selection method is far greater than that of the back selection method. Finally, we study the distribution of allele frequencies of gene loci. The results show that the power of following a normal distribution is higher than that of following a uniform distribution, which may be due to the fact that the MAF of the normal distribution is mostly smaller than that of the uniform distribution. According to the effect function expression, the effect size of the normal distribution is larger at this time and can make the locus easier to detect.
The model we considered is an ideal state, as it is only a study of basic assumptions, without considering group structure and population with relatives, and there is no missing genotype. In practice, it is inevitable that there will be missing genotypes. At this time, we can use statistical methods to fill in the missing data, and then convert the discrete genetic data into continuous functions. In future research, we will consider incorporating models of group structure and population with relatives.