Simulation Research on the Methods of Multi-Gene Region Association Analysis Based on a Functional Linear Model

Li, Shijing; Zhou, Fujie; Shen, Jiayu; Zhang, Hui; Wen, Yongxian

doi:10.3390/genes13030455

Open AccessArticle

Simulation Research on the Methods of Multi-Gene Region Association Analysis Based on a Functional Linear Model

¹

College of Computer and Information Science, Fujian Agriculture and Forestry University, Fuzhou 350002, China

²

Institute of Statistics and Application, Fujian Agriculture and Forestry University, Fuzhou 350002, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Genes 2022, 13(3), 455; https://doi.org/10.3390/genes13030455

Submission received: 30 January 2022 / Revised: 26 February 2022 / Accepted: 27 February 2022 / Published: 2 March 2022

(This article belongs to the Section Bioinformatics)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Genome-wide association analysis is an important approach to identify genetic variants associated with complex traits. Complex traits are not only affected by single gene loci, but also by the interaction of multiple gene loci. Studies of association between gene regions and quantitative traits are of great significance in revealing the genetic mechanism of biological development. There have been a lot of studies on single-gene region association analysis, but the application of functional linear models in multi-gene region association analysis is still less. In this paper, a functional multi-gene region association analysis test method is proposed based on the functional linear model. From the three directions of common multi-gene region method, multi-gene region weighted method and multi-gene region loci weighted method, that test method is studied combined with computer simulation. The following conclusions are obtained through computer simulation: (a) The functional multi-gene region association analysis test method has higher power than the functional single gene region association analysis test method; (b) The functional multi-gene region weighted method performs better than the common functional multi-gene region method; (c) the functional multi-gene region loci weighted method is the best method for association analysis on three directions of the common multi-gene region method; (d) the performance of the Step method and Multi-gene region loci weighted Step for multi-gene regions is the best in general. Functional multi-gene region association analysis test method can theoretically provide a feasible method for the study of complex traits affected by multiple genes.

Keywords:

functional linear model; multi-gene regions; association analysis; region weighted; loci weighted

1. Introduction

Genetic analysis of rare variants is considered to be one of the most important components to compensate for the current deficiency of genetic variation, which has not yet been explained [1]. Although the lack of catalogs to speculate on the genotypes of rare variants and the high cost of sequencing technology have previously made it impossible to conduct very in-depth studies on rare variants [2,3], the development of high-throughput sequencing technology [4] has enabled scientists to obtain SNP data in a cheap and efficient way, as it contains a large amount of data on rare variants [5,6]. However, many previous tools and methods are designed for common variants, so there is still a lack of efficient and practical tools for rare variants association analysis. At present, single-marker association analysis is the most commonly used method of gene association analysis. However, if this method is directly applied to rare variants, it will be impossible to find loci with a moderate or low gene effect due to the limitations of single-marker association analysis [5,6]. The effect of a locus of a rare variant is small and not easily detected, and if the single-marker association analysis is used, many valuable association loci will be ignored.

To better find the weak effect sites, some methods to make the weak effect more significant by concentrating the association information of the whole gene region were proposed: a) the method based on fold (Collapsing Methods) [7,8,9]. By directly compressing multiple loci into a new variable, the associated rare variants with weak effects distributed at multiple loci are aggregated to make it easier to find; b) the method based on kern [10,11,12]. When the variance of a set of random variables is 0, the set of variables is made up of the same value, and therefore the kernel by method only needs to check whether the variance component of the group of effect estimates corresponding to all genotype variables in the whole region is 0; c) the method based on functional data analysis [13,14,15,16]. Functional data analysis converts the discrete loci into a continuous variable through the basis function, and only the coefficients of the effect estimation function corresponding to the continuous variable need to be tested in association analysis. There is also a strategy to consider and examine multiple loci at the same time and determine the significance of each locus [17,18], which is more effective than single-marker gene association analysis because it considers the interrelationships between multiple loci. The simplest method is to use the multivariate linear model as the test model for the multi-gene locus test [17]. However, when only a small part of the multiple gene loci included in the test are related, the large degree of freedom of the uncorrelated loci will lead to the loss of power.

The above-mentioned methods, whether based on gene region information aggregation or multi-gene locus analysis, have their advantages and disadvantages. Moreover, through continuous improvement and innovation of experts and scholars, the shortcomings of these methods have been constantly overcome and their performance has become more and more excellent. Since the single-gene region approach can aggregate small effects, the multi-locus approach can improve the analysis power by considering the interrelationship between multiple variables. If we combine these two methods, we can expect to obtain a multi-region analysis method with the advantages of both methods. In addition, the actual situation of phenotypic tend to be controlled by a few gene regions. Some of these gene region effects are apparent, some are weak, and a strong effect is easy recognize. However, a weak effect can easily be concealed by a stronger effect, even considering that this part of the phenotypic is controlled by the effect of apparent genetic regions.

At present, some scholars have carried out research on the combination of the two ideas, i.e., the aggregation of genetic information in gene regions and the use of interrelationship among multi-gene regions: One of them is Turkmen and Lin [19], who further extended the statistical test PDT (Pedigree Disequilibrium Test [20]) and FBAT (Family-Based Association Test [21]), and proposed Block analysis methods. Firstly, the specific approach is to divide the gene sequence to be analyzed into a block-by-block in a certain way and assume that the variants within the region are interdependent, but that the relationship between the blocks is mutually independent; secondly, PDT or FBAT methods are used to analyze the loci of each block; thirdly, the results in the region are generalized by means of the squares sum of standardized variances. After the statistics of each small block are obtained, the squares sum of the statistics is calculated again. After the aggregation of the information twice, a statistic subject to chi-square distribution is obtained, which is used as the gene association analysis statistics of the large gene region composed of these blocks. This method assumes that the loci inside the block are interdependent and the information inside the block is aggregated by PDT and FBAT methods, assuming that the blocks are independent of each other. The other method is by Ayers and Cordell [22], who improved the two methods of Group Lasso and Group Sparse Lasso [23], enabling them to distinguish common and rare variants within a group. This method can test multiple groups at the same time, in which one group is treated as a variable and the relationship between different groups is considered.

The analysis method of a single gene region based on functional data is a method to express the high-density genetic markers as functional data through integration and analyze the region through a linear model. Many experts have shown that this is an effective way to improve the power of gene association [13,14,24]. If the functional linear model considers multiple gene regions at the same time, it can not only improve the power by considering the interaction between multi-gene regions, but also isolate the effects of each gene region through the characteristics of the linear model, making the gene regions with weak effects more obvious and easier to find. Therefore, we will explore the method of gene association analysis of multi-gene regions based on functional data analysis, hoping to find a method with higher power and better detection of association loci with weak effects, and provide some references for researchers interested in this field in the future.

2. Materials and Methods

2.1. Statistic Model

2.1.1. Common Multi-Gene Region Method

Suppose that there are

n

individuals in a genetic population. The genome region

[0, T]

is constructed by SNP sequences

t_{1} \leq t_{2} \dots \leq t_{M}

for genetic association analysis under the group structure and is not included. Let

y_{i}

be the quantitative trait value of i-th individual and the population structure of the sample is not considered, so the traditional linear genetic model can be expressed as

y_{i} = μ_{0} + \sum_{j = 1}^{M} x_{i j} β_{j} + ε_{i} i = 1, 2, \dots, n .

(1)

where

μ_{0}

is the overall mean of the model,

x_{i j}

is a genotype profile (if A and a represent a pair of alleles, then when the genotype of i-th individual is AA,

x_{i j}

is taken as 2; when the genotype of i-th individual is Aa, it is taken as 1; when the genotype of i-th individual is aa, is 0).

β_{j}

represents the effect coefficient of genetic marker,

ε_{i} ~ N (0, σ^{2})

,

σ^{2}

is the environmental genetic variance,

M

is the number of genetic markers. With the increase of the number of genetic markers, the freedom degree of the model gradually increases, and the multicollinearity among variables becomes more and more serious, eventually leading to the reduction of estimation accuracy and power. This is especially true when the genetic markers are low-frequency variations. When the discrete variants are at ultrahigh density, the discrete variants in an interval are as continuous, and the functional linear model (FLM) can be used instead of the multiple linear genetic model:

y_{i} = μ_{0} + \int_{0}^{T} X_{i} (t) β (t) d t + ε_{i} i = 1, 2, \dots, n .

(2)

where

ε_{i}

is an independent and normal distribution with zero mean and variance

σ^{2}

, and

[0, T]

represents the genomic region under consideration. The discrete genetic markers

x_{i j}

in Equation (1) are converted into continuous genetic markers function

X_{i} (t)

in Equation (2). At this time

X_{i} (t)

is a random function, and the effects of genetic markers

β_{j}

are also converted into a continuous genetic effect function

β (t)

.

Step

The functional linear genetic model of single-gene region is generally in the following form

y_{i} = μ_{0} + \int_{0}^{T} X_{i} (t) β (t) d t + ε_{i}, i = 1, 2, 3, \dots, n .

(3)

When the single-gene region is extended to the multi-gene region, the original linear genetic model becomes the following form

y_{i} = μ_{0} + \sum_{p = 1}^{P} \int_{0}^{T} X_{p i} (t) β_{p} (t) d t + ε_{i}, i = 1, 2, 3, \dots, n .

(4)

Suppose that there are P gene regions. There are SNP sequences

t_{1}^{p} < t_{2}^{p} < \dots < t_{M}^{p}

for p-th (

p = 1, 2, \dots, P

) gene region. Every gene region is

[0, T]

. For every region, the lower bound of the interval is converted to zero, and the upper bound of the interval is converted to

T

. According to the method of functional data analysis, a set of basis functions

φ_{p 1} (t), φ_{p 2} (t), \dots φ_{p K_{G}} (t)

and the coefficient

d_{p i 1}, d_{p i 2}, \dots, d_{p i K_{G}}

can be used to expand

X_{p i} (t)

as

X_{p i} (t) = \sum_{k = 1}^{K_{G}} d_{p i k} φ_{p k} (t), p = 1, 2, \dots, P; i = 1, 2, \dots, n .

(5)

Similarly, a set of basis functions

ϕ_{p 1} (t), ϕ_{p 2} (t), \dots, ϕ_{p K_{β}} (t)

and coefficient

b_{p 1}, b_{p 2}, \dots, b_{p K_{β}}

can be used to expand

β_{p} (t)

as

β_{p} (t) = \sum_{k^{'} = 1}^{K_{β}} b_{p k^{'}} ϕ_{p k^{'}} (t), p = 1, 2, \dots, P .

(6)

the expansion of

X_{p i} (t)

and

β_{p} (t)

, the model (4) becomes

y_{i} = μ_{0} + \sum_{p = 1}^{P} \sum_{k = 1}^{K_{G}} \sum_{k^{'} = 1}^{K_{β}} d_{p i k} \int_{0}^{T} φ_{p k} (t) ϕ_{p k^{'}} (t) d t \cdot b_{p k^{'}} + ε_{i} \begin{matrix} , & i = 1, 2, 3, \dots, n \end{matrix} .

(7)

Let

d_{p i} = {[d_{p i 1}, d_{p i 2}, \dots, d_{p i K_{G}}]}_{K_{G} \times 1}^{T}

,

b_{p} = {[b_{p 1}, b_{p 2}, b_{p 3}, \dots b_{p K_{β}}]}_{K_{β} \times 1}^{T}

, as well as

Φ_{p} = {[\begin{matrix} \int_{0}^{T} φ_{p 1} (t) ϕ_{p 1} (t) d t & \dots & \int_{0}^{T} φ_{p 1} (t) ϕ_{p K_{β}} (t) d t \\ ⋮ & \dots & ⋮ \\ \int_{0}^{T} φ_{p K_{G}} (t) ϕ_{p 1} (t) d t & \dots & \int_{0}^{T} φ_{p K_{G}} (t) ϕ_{p K_{β}} (t) d t \end{matrix}]}_{K_{G} \times K_{β}}

Model (7) becomes

y_{i} = μ_{0} + \sum_{p = 1}^{P} d_{p i}^{T} Φ_{p} b_{p} + ε_{i} \begin{matrix} , & i = 1, 2, 3, \dots, n \end{matrix} .

(8)

where

φ_{p 1} (t), φ_{p 2} (t), \dots φ_{p K_{G}} (t)

and

ϕ_{p 1} (t), ϕ_{p 2} (t), \dots, ϕ_{p K_{β}} (t)

are a set of orthonormal basis. Usually, we choose the same basis function for

φ_{p 1} (t), φ_{p 2} (t), \dots φ_{p K_{G}} (t)

and

ϕ_{p 1} (t), ϕ_{p 2} (t), \dots, ϕ_{p K_{β}} (t)

. Therefore, the above genetic model can be further simplified as

y_{i} = μ_{0} + \sum_{p = 1}^{P} d_{p i}^{T} b_{p} + ε_{i} \begin{matrix} , & i = 1, 2, 3, \dots, n \end{matrix} .

(9)

Model (8) becomes model (9). At this point, the genetic model is transformed into the ordinary multiple linear regression model of Equation (9), for which variables can be screened by stepwise regression [25,26,27]. Because only the interrelationship between whole gene regions and traits is discussed,

d_{p i}

represents the genetic information of p-th gene region, so

d_{p i}

, which represents the whole information of a gene region, is considered to be added to the model as a “variable”.

p

gene regions should be screened as

p

variables.

There are three ways of screening variables for regression: forward selection, backward selection, forward selection, and backward selection. Here, the backward selection method is performed, where all gene regions to be analyzed are put into the model at the beginning, and then some gene regions are removed step-by-step until a reduced model is obtained. AIC (Akaike Information Criterion) information criteria will be used as the basis for each step to determine which gene regions need to be removed from the model

A I C = n \ln (R s s / n) + 2 • K,

(10)

where

R s s

represents the squares sum of the residuals for the current model, and

K

represents the number of unknown variables, that is, the sum of the number of elements of all

d_{p i}

. After deciding which

d_{p i}

to remove, we hypothetically delete each

d_{p i}

existing in the current model and calculate the AIC, which is made up of the rest of the gene region. We find the model corresponding to the minimum AIC and then proceed to the next step. By repeating the above steps, until deleting any of the gene regions in the model does not make the AIC of the model smaller, we now have a very reduced model. Finally, the partial

F

test commonly used in the multiple linear regression model was used to test each

d_{p i}

in the model, and the corresponding p-value was calculated as the evaluation basis for the association between gene regions and quantitative traits.

Multi-SLoS

Lin et al. [28] proposed a locally sparse functional linear model (SLoS method, Smooth and Locally Sparse method). By adding fSCAD (Functional Smoothly Clipped Absolute Deviation) penalty function on the basis of smoothing penalty term, the functional linear model has the ability to identify the sparse part of the estimated effect value

\hat{β} (t)

and compress the estimated value to null. In the paper, a single region is taken as an example, but a model for multi-regions was also proposed:

y_{i} = μ_{0} + \sum_{p = 1}^{P} \int_{0}^{T} X_{p i} (t) β_{p} (t) d t + ε_{i} \begin{matrix} , & i = 1, 2, 3, \dots, n \end{matrix} .

(11)

According to the description of the paper, to estimate the corresponding

β (t) = {(β_{1} (t), β_{2} (t), \dots β_{P} (t))}^{T}

, we only need to solve the corresponding loss function:

\begin{array}{l} Q (β, μ_{0}) = & \frac{1}{n} \sum_{i = 1}^{n} {[y_{i} - μ_{0} - \sum_{p = 1}^{P} \int_{0}^{T} X_{p i} (t) β_{p} (t) d t]}^{2} \\ + \sum_{p = 1}^{P} γ_{p} {‖ D^{m} β_{p} ‖}^{2} + \sum_{p = 1}^{P} \frac{M}{T} \int_{0}^{T} p_{λ_{k}}^{'} (| β_{p} (t) |) d t \end{array}

(12)

Specific algorithms can be found in Lin et al. [28]. In order to distinguish the SLoS method for a single region, we refer to it as the Multi-SLoS method. The Multi-SLoS method has the ability of local sparse, that is, it can identify null and no-null in

β (t)

. We hope to use the local sparsity ability of this model for gene association test, which involves the problem of the model test. That is, the significance test problem of individual gene regions in the model (2) is presented based on the Multi-SLoS method. The following is a detailed description of how to conduct the test based on the work of Lin et al. [28] for multi-gene region association analysis.

The functional linear model of a single gene region can be transformed into a multivariate linear model of several variables (the number of basis functions plus intercept), and then directly test whether the estimated coefficients of the basis functions are all null [13,29,30]. Multi-regions association analysis can be done similarly. However, when multi-gene regions are tested, not only must more variables be tested, but also the degrees of freedom of the model should be adjusted. In addition, we found that the results obtained after adjusting the degrees of freedom of the Multi-SLoS method as follows would be more consistent with the features of the method and the actual results in the follow-up study of polygenic regions. The reasons for the adjustment of degrees of freedom are given below.

Let the null gene regions denote gene regions where estimated effect values

{\hat{β}}_{p} (t) (p = 1, 2, \dots, P)

are all null, and non-null gene regions denote gene regions where estimated effect value

{\hat{β}}_{p} (t)

are not all null. The Multi-SLoS method will directly compress the estimated effect values

{\hat{β}}_{p} (t)

of null gene regions to null and identify the non-null and null gene regions from p gene regions, regardless of whether the Multi-SLoS method is correct in distinguishing the non-null and null gene regions (the results can be seen in the later simulation). The regions where

{\hat{β}}_{p} (t)

is compressed to null have no effect on the estimated results, which means that these regions have been directly identified as

{\hat{β}}_{p} (t) = 0

at a certain estimation stage of the model and have no effect on the estimated model. Then, the degrees of freedom of these gene regions should be removed from the calculation of the model. Similarly, there are some sub-regions where the effect is also compressed to null in non-null gene regions, which means that these sub-regions also have no effect on the estimated model, and the degrees of freedom of these sub-regions should be deducted.

Combined with the adjustment of degrees of freedom, the partial F test of the Multi-SLoS method is given below.

Let B represent the set of subscripts of gene regions with non-null effect, and b represent one element of the set B. For the above linear genetic model, the following method is used to test the association between gene region b and quantitative traits.

(a): Calculate the sum of square residuals of the full model including all non-null gene regions:

${\hat{y}}_{i} = μ_{0} + \sum_{k \in A} \int_{0}^{T} X_{k i} (t) β_{k} (t) d t \begin{matrix} , & i = 1, 2, 3, \dots, n \end{matrix} .$

(13)

$S S E (f u l l) = {\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})}^{2}$

(14)
(b): Calculate the sum of square residuals of the reduced model excluding gene region a:

${\hat{y}}_{i} = μ_{0} + \sum_{k \in B - b} \int_{0}^{T} X_{k i} (t) β_{k} (t) d t$

(15)

$S S E (r e d u c e d) = {\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})}^{2}$

(16)
(c): Adjustable degrees of freedom:

The adjusted freedom degree of SSE (full) (freedom_adj(full)) should be: the number of individuals (n)—the sum of the number of non-null basis function coefficients in all non-null gene regions—1.

The adjusted freedom degree of SSE (reduced)–SSE (full) (freedom_adj(reduced)) should be: the number of non-null basis function coefficients in gene region a.

(d): Calculate the corresponding values of F and p value

$F = \frac{\frac{S S E (r e d u c e d) - S S E (f u l l)}{f r e e d o m_{a d j} (r e d u c e d)}}{\frac{S S E (f u l l)}{f r e e d o m_{a d j} (f u l l)}} ~ F (f r e e d o m_{a d j} (r e d u c e d), f r e e d o m_{a d j} (f u l l)$

2.1.2. Multi-Gene Region Weighted Method

In common gene association analysis, common variants can be easily identified if the associated loci contain both rare and common variants, but rare variants are difficult to detect because of their micro effects. For the association analysis of multi-gene regions, a similar situation is likely to occur—only the associated regions with rare variants are difficult to identify if the associated loci exist in the regions only with common variants and the regions only with rare variants at the same time. The common solution to this problem in gene-association analysis is to assign different weights to different types of variants. The same approach is used to assign different weights to different gene regions, by assigning weights to different types of gene regions to eliminate differences due to different allele frequencies rather than different degrees of association with phenotypic values.

Weighted SLoS (W-SLoS)

The SLoS approach has been applied to polygenic regions in Section 2.1.1, which asks the question: would it further improve the power if different weights were given to different types of gene regions? The loss function of the SLoS method is:

\begin{array}{l} Q (β, μ_{0}) = & \frac{1}{n} \sum_{i = 1}^{n} {[y_{i} - μ_{0} - \sum_{p = 1}^{P} \int_{0}^{T} X_{p i} (t) β_{p} (t) d t]}^{2} \\ + \sum_{p = 1}^{P} γ_{p} {‖ D^{m} β_{p} ‖}^{2} + \sum_{p = 1}^{P} \frac{M}{T} \int_{0}^{T} p_{λ_{p}}^{'} (| β_{p} (t) |) d t \end{array}

(17)

It can be seen from the loss function that different gene regions can be assigned different weights by adjusting parameters

γ_{p}

and

p_{λ_{p}}^{'}

. Therefore, based on the research of Lin et al. [28], we have made appropriate adjustments to the code in the slos package of R language provided by Lin et al., so that the method is not only theoretically feasible but also runs smoothly in the actual program. Finally, the Weighted SLoS method only increases the weight compared to the Multi-SLoS method, and the same statistical test can be used to test the significance of each gene region.

2.1.3. Multi-Gene Region Loci Weighted Method

Although it is possible to distinguish rare variants from common variants and then divide them into rare variants regions and common variants regions for analysis in the multi-gene region analysis, it is more common in the actual situation that both common variants and rare variants exist in a gene region to be analyzed. Therefore, the multi-gene region loci weighted method is proposed, which is a more general method of combining functional data analysis by assigning different weights to each locus within each region rather than to the gene region, as in Section 2.1.2.

Multi-Gene Region Loci Weighted Step (LW-Step)

Similar to Section 2.1.1, there are n individuals in a genetic population. The genome region

[0, T]

is constructed by SNPs sequences

t_{1} < t_{2} < \dots < t_{M}

for genetic association analysis under no group structure. Accordingly, the genetic markers are

x_{i 1}, x_{i 2}, \dots, x_{i M} (i = 1, 2, \dots, n)

. Let

y_{i}

be the quantitative trait value of the i-th individual and the population structure of the sample is not considered, and the traditional linear genetic model can be expressed as

y_{i} = μ_{0} + \sum_{j = 1}^{M} x_{i j} β_{j} + ε_{i}, i = 1, 2, \dots, n .

(18)

With the increase of the number of genetic markers, the functional linear model (FLM) can be used instead of the multiple linear genetic model

y_{i} = μ_{0} + \int_{0}^{T} X_{i} (t) β (t) d t + ε_{i}, i = 1, 2, \dots, n .

(19)

A set of basis functions

φ_{1} (t), φ_{2} (t), φ_{3} (t), \dots, φ_{K}_{_{G}} (t)

and coefficients

d_{i 1}, d_{i 2}, d_{i 3}, \dots d_{i K_{G}}

can be used to expand

X_{i} (t)

as

X_{i} (t) = \sum_{k = 1}^{K_{G}} d_{i k} φ_{k} (t)

(20)

According to functional analysis method [31], let

X_{i} = [x_{i 1}, x_{i 2}, \dots, x_{i M}]

represent the gene data vector for the i-th individual, then,

d_{i} = {[d_{i 1}, d_{i 2}, \dots, d_{i K_{G}}]}^{T}, φ (t) = {[φ_{1} (t), φ_{2} (t), \dots, φ_{K_{G}} (t)]}^{T}, φ = {[φ (t_{1}), φ (t_{2}), \dots, φ (t_{M})]}^{T},

And

φ = [\begin{matrix} φ^{T} (t_{1}) \\ φ^{T} (t_{2}) \\ ⋮ \\ φ^{T} (t_{M}) \end{matrix}] = {[\begin{matrix} φ_{1} (t_{1}) & φ_{2} (t_{1}) & \dots & φ_{K_{G}} (t_{1}) \\ φ_{1} (t_{2}) & φ_{2} (t_{2}) & \dots & φ_{K_{G}} (t_{2}) \\ ⋮ & ⋮ & \dots & ⋮ \\ φ_{1} (t_{M}) & φ_{2} (t_{M}) & \dots & φ_{K_{G}} (t_{M}) \end{matrix}]}_{M \times K_{G}}

Then

X_{i} (t) = d_{i}^{T} φ (t), i = 1, 2, \dots, n .

According to the functional data analysis methods, there are

φ d_{i} = [\begin{matrix} d_{i}^{T} φ (t_{1}) \\ d_{i}^{T} φ (t_{2}) \\ ⋮ \\ d_{i}^{T} φ (t_{M}) \end{matrix}] = [\begin{matrix} \sum_{k = 1}^{K_{G}} φ_{k} (t_{1}) d_{i k} \\ \sum_{k = 1}^{K_{G}} φ_{k} (t_{2}) d_{i k} \\ ⋮ \\ \sum_{k = 1}^{K_{G}} φ_{k} (t_{M}) d_{i k} \end{matrix}] = [\begin{matrix} X_{i} (t_{1}) \\ X_{i} (t_{2}) \\ ⋮ \\ X_{i} (t_{M}) \end{matrix}] = X_{i} (t)

(21)

The coefficient

d_{i}

is solved in a smooth way

P E N S S E_{λ_{x}} (x) = {(X_{i}^{T} - φ d_{i})}^{T} (X_{i}^{T} - φ d_{i}) + λ_{x} \int_{0}^{T} {[D^{2} X (t)]}^{2} d t,

(22)

\int_{0}^{T} {[D^{2} X (t)]}^{2} d t = \int_{0}^{T} d_{i}^{T} [D^{2} φ (t)] {[D^{2} φ (t)]}^{T} d_{i} d t = d_{i}^{T} R_{2} d_{i} .

(23)

Here,

R_{2}

is a penalty matrix,

{[R_{2}]}_{j k} = \int_{0}^{T} [D^{2} φ_{j} (t)] [D^{2} φ_{k} (t)] d t, j = 1, 2, \dots, K_{G}; k = 1, 2, \dots, K_{G} .

(24)

The solution result is

{\hat{d}}_{i} = {[φ^{T} φ + λ_{x} R_{2}]}^{- 1} φ^{T} X_{i}^{T}

, then

X_{i} (t) = X_{i} φ {[φ^{T} φ + λ_{x} R_{2}]}^{- 1} φ (t), i = 1, 2, \dots, n .

(25)

In addition,

β (t) = \sum_{k^{'} = 1}^{K_{β}} b_{k^{'}} ϕ_{k^{'}} (t) = {[ϕ (t)]}^{T} b,

(26)

where

ϕ (t) = {[ϕ_{1} (t), ϕ_{2} (t), \dots, ϕ_{K_{β}} (t)]}^{T},

b = {(b_{1}, b_{2}, \dots, b_{K_{β}})}^{T},

combined

X_{i} (t),

β (t)

and functional linear model

y_{i} = μ_{0} + \int_{0}^{T} X_{i} (t) β (t) d t + ε_{i},

(27)

the following can be obtained

y_{i} = μ_{0} + X_{i} W b + ε_{i}, i = 1, 2, \dots, n .

(28)

where

W = φ {[φ^{T} φ + λ_{x} R_{2}]}^{- 1} \int_{0}^{T} φ (t) [ϕ (t)]^{T} d t .

Next, as in Belonogova et al. [32], a

M \times M

diagonal matrix

Θ

was designed, where each element on the diagonal of the matrix corresponds to the weight of genotype data

X_{i} = [x_{i 1}, x_{i 2}, \dots, x_{i M}]

. The weight can be determined by

B e t a

distribution

B e t a (M A F_{i j}, a_{1}, a_{2}), i = 1, 2, \dots, n, j = 1, 2, \dots, M .

where

a_{1}, a_{2}

are the preset parameters, and

M A F_{i j}

represents the j-th genotype frequency of the i-th individual. The diagonal matrix

Θ

is embedded to the simplified functional linear equation [32]

y_{i} = μ_{0} + X_{i} Θ W b + ε_{i}, i = 1, 2, \dots, n .

(29)

This gives us a weighted functional linear function. This is the method of a single gene region, corresponding to the functional linear equation of multiple gene regions, which can be

y_{i} = μ_{0} + \sum_{p = 1}^{P} \int_{0}^{T} X_{p i} (t) β_{p} (t) d t + ε_{i}, i = 1, 2, \dots, n .

(30)

All the assumptions about gene regions are the same before. According to the specific situation of each region, add a weight matrix

Θ_{p}

to assign different weights to the loci in the region, then

y_{i} = μ_{0} + \sum_{p = 1}^{P} X_{p i} Θ_{p} W_{p} b_{p} + ε_{i}, i = 1, 2, \dots, n .

(31)

The above statement is used to better explain the method in theory, in fact, the actual processing is not so complicated as in theory. Returning to the fitting of genotype data, the problem of loci weighting can be viewed from another perspective. First, let

X_{i}^{*} = X_{i} Θ

, and then expand

X_{i}^{*}

with the same functional smoothing parameters, so that

X_{i} (t) = X_{i} φ {[φ^{T} φ + λ_{x} R_{2}]}^{- 1} φ (t)

(32)

X_{i}^{*} (t) = X_{i}^{*} φ {[φ^{T} φ + λ_{x} R_{2}]}^{- 1} φ (t) = X_{i} Θ φ {[φ^{T} φ + λ_{x} R_{2}]}^{- 1} φ (t)

(33)

These are the same smooth parameters, basis functions, number of basis functions, and nodes. The value of

φ {[φ^{T} φ + λ_{x} R_{2}]}^{- 1} φ (t)

is decided by the above factors.

φ {[φ^{T} φ + λ_{x} R_{2}]}^{- 1} φ (t)

is the same as the expansion of

X_{i} (t)

and

X_{i}^{*} (t)

. The difference between

X_{i} (t)

and

X_{i}^{*} (t)

is the weight matrix

Θ

. This result can be used to deduce the single gene loci weighted functional linear model as follows

\begin{array}{l} y_{i} & = μ_{0} + X_{i} Θ W b + ε_{i} \\ = μ_{0} + X_{i} Θ φ {[φ^{T} φ + λ_{x} R_{2}]}^{- 1} \int_{0}^{T} φ (t) {[ϕ (t)]}^{T} d t b + ε_{i} \\ = μ_{0} + X_{i}^{*} φ {[φ^{T} φ + λ_{x} R_{2}]}^{- 1} \int_{0}^{T} φ (t) {[ϕ (t)]}^{T} d t b + ε_{i} \\ = μ_{0} + \int_{0}^{T} X_{i}^{*} φ {[φ^{T} φ + λ_{x} R_{2}]}^{- 1} φ (t) {[ϕ (t)]}^{T} b d t + ε_{i} \\ = μ_{0} + \int_{0}^{T} X_{i}^{*} (t) β (t) d t + ε_{i} \end{array}

(34)

where

W = φ {[φ^{T} φ + λ_{x} R_{2}]}^{- 1} \int_{0}^{T} φ (t) [ϕ (t)]^{T} d t,

i = 1, 2, \dots, n .

It can be seen from the derived results that the single gene loci weighted linear functional linear model can be understood as the weighted transformation of the original genotype data into new functional data

X_{i}^{*} (t)

, and then the functional linear model can be established by using

X_{i}^{*} (t)

. We extend this result into the multi-gene region weighted functional linear model, and all the assumptions are the same as the above section. The model becomes the ordinary multi-gene region functional linear model

y_{i} = μ_{0} + \sum_{p = 1}^{P} \int_{0}^{T} X_{p i}^{*} (t) β_{p} (t) d t + ε_{i}, i = 1, 2, \dots, n

(35)

Then,

\begin{array}{l} y_{i} = μ_{0} + \sum_{p = 1}^{P} \int_{0}^{T} X_{p i}^{*} (t) β_{p} (t) d t + ε_{i} \\ = μ_{0} + \sum_{p = 1}^{P} \int_{0}^{T} X_{p i}^{*} φ_{p}^{} {[φ_{p}^{T} φ_{p} + λ_{x} R_{2}]}^{- 1} φ_{p} (t) {[ϕ_{p} (t)]}^{T} b_{p} d t + ε_{i} \\ = μ_{0} + \sum_{p = 1}^{P} X_{p i}^{*} φ_{p}^{} {[φ_{p}^{T} φ_{p} + λ_{x} R_{2}]}^{- 1} \int_{0}^{T} φ_{p} (t) {[ϕ_{p} (t)]}^{T} d t b_{p} + ε_{i} \end{array}

(36)

where

φ_{p} (t) = {[φ_{p 1} (t), φ_{p 2} (t), \dots, φ_{p K_{G}} (t)]}^{T},

ϕ_{p} (t) = {[ϕ_{p 1} (t), ϕ_{p 2} (t), \dots, ϕ_{p K_{β}} (t)]}^{T}, φ_{p} = {[φ_{p} (t_{1}^{p}), φ_{p} (t_{2}^{p}), \dots, φ_{p} (t_{M}^{p})]}^{T},

And

φ_{p} = [\begin{matrix} φ_{p}^{T} (t_{1}^{p}) \\ φ_{p}^{T} (t_{2}^{p}) \\ ⋮ \\ φ_{p}^{T} (t_{M}^{p}) \end{matrix}] = {[\begin{matrix} φ_{p 1} (t_{1}^{p}) & φ_{p 2} (t_{1}^{p}) & \dots & φ_{p K_{G}} (t_{1}^{p}) \\ φ_{p 1} (t_{2}^{p}) & φ_{p 2} (t_{2}^{p}) & \dots & φ_{p K_{G}} (t_{2}^{p}) \\ ⋮ & ⋮ & \dots & ⋮ \\ φ_{p 1} (t_{M}^{p}) & φ_{p 2} (t_{M}^{p}) & \dots & φ_{p K_{G}} (t_{M}^{p}) \end{matrix}]}_{M \times K_{G}}, p = 1, 2, \dots, P .

Let

d_{p i}^{*} = X_{p i}^{*} φ_{p} {[φ_{p}^{T} φ_{p} + λ_{x} R_{2}]}^{- 1}

Φ_{p} = {[\begin{matrix} \int_{0}^{T} φ_{p 1} (t) ϕ_{p 1} (t) d t & \dots & \int_{0}^{T} φ_{p 1} (t) ϕ_{p K_{β}} (t) d t \\ ⋮ & \dots & ⋮ \\ \int_{0}^{T} φ_{p K_{G}} (t) ϕ_{p 1} (t) d t & \dots & \int_{0}^{T} φ_{p K_{G}} (t) ϕ_{p K_{β}} (t) d t \end{matrix}]}_{K_{G} \times K_{β}}

Then

y_{i} = μ_{0} + \sum_{p = 1}^{P} d_{p i}^{*} Φ_{p} b_{p} + ε_{i}, i = 1, 2, \dots, n .

(37)

For the same reasons as in Step method, when

φ_{p 1} (t), φ_{p 2} (t), \dots φ_{p K_{G}} (t)

and

ϕ_{p 1} (t), ϕ_{p 2} (t), \dots, ϕ_{p K_{β}} (t)

are a set of orthonormal basis and the same basis function, the difference between different individual data is mainly reflected in the coefficients. Therefore, it is reasonable to simply take the coefficients of the basis function as a new variable as the basis of subsequent method operations, and simplify the model as follows

y_{i} = μ_{0} + \sum_{p = 1}^{P} d_{p i}^{*} b_{p} + ε_{i}, i = 1, 2, \dots, n .

(38)

The model becomes a multivariate linear model with the coefficients of the basis functions in the region as the new variables. The difference between the two methods compared with the Step method is that, at this time, the coefficients are obtained by functional data analysis of genotype data after weighting. The simplified model treats each gene region as a ’variable’ for stepwise regression and conducts a partial F test on the final simplified model to obtain the significance level of each gene region. The method is called the Multi-gene region loci weighted Step (LW-Step).

Multi-Gene Region Loci Weighted SLoS Method (LW-SLoS)

For the multi-gene region, the model becomes the ordinary multi-gene region functional linear model

y_{i} = μ_{0} + \sum_{p = 1}^{P} \int_{0}^{T} X_{p i}^{*} (t) β_{p} (t) d t + ε_{i}, i = 1, 2, \dots, n .

(39)

The SLoS method can be used to solve the model, and the statistical test method proposed in Multi-SLoS method can be used to test the model. Finally, in order to distinguish between other types of SLoS methods, we will call this method LW-SLoS, which means Loci Weight SLoS.

2.2. Design of Simulation Evaluation

2.2.1. Simulation of Common Multi-Gene Region Method

In the simulation analysis of the common multi-gene region method, 25 gene regions are generated at a time and spliced together as the "multi-gene region" to be analyzed at one time. We design four different multi-gene regions: (a) rare variant multi-gene regions, i.e., all variants within the regions are rare variants (MAF (Minor Allele Frequency) < 0.01); (b) common variant multi-gene regions, i.e., all variants within the regions are common variants; (c) the hybrid variant multi-gene regions (15 gene regions are common variant gene regions and 10 gene regions are rare variant gene regions); (d) the hybrid variant multi-gene regions: each gene region is a mixture of 60% rare variants and 40% common variants. The purpose of designing these multi-gene regions is to better find out the specific scenarios applicable to multi-gene analysis and the different manifestations of multi-gene region analysis in different scenarios.

The rare variant gene regions and common variant gene regions in the multi-gene region are generated as follows: for rare variant gene regions, each time, a 5 kb gene segment is randomly selected as a gene region from the rare haploid dataset (the R language SKAT package [33] contains a set of such data produced by simulating allele frequency and linkage imbalance information in European populations) with a length of 200 kb, then 2000 individuals are randomly selected from 10,000 individuals twice to synthesize the diploid region of the rare variant genes. For common variant gene regions, firstly, the allele frequencies of gene loci are generated uniformly distributed; secondly, according to the frequency of each locus, haploid datasets of common variants with the same structure and size as the rare variants dataset are generated by simple random sampling; finally, the common variant gene regions are generated in the same way as the rare variant gene regions. Among the 4 multi-genic regions: the rare variant multi-gene regions are directly composed of 25 rare variant gene regions; the common variant multi-gene regions are directly composed of 25 common variant gene regions; the hybrid variant multi-gene regions are composed of 15 common variant gene regions and 10 rare variant gene regions, in addition, the splicing order of the two gene regions is random; the hybrid variant multi-gene regions is generated in this way—firstly, make 25 regions of rare variants and 25 regions of common variants (rare variant regions and common variant regions must be of the same size and structure); secondly, 40% of the rare variants are randomly selected from the first rare variant gene region and filled with genotypes corresponding to first common variant gene region. This is repeated until the generation of the 25-th gene region is completed.

In power simulation, the associated loci should be assumed as the target of the analysis and the quantitative traits should be simulated as the analysis objects. Therefore, in each simulation, five of the 25 gene regions splicing “multi-gene regions” are selected as the associated gene regions, and then three loci are randomly selected from each associated gene region as the associated loci. The generation of simulated traits adopted the additive effect model. At the same time, three different scenarios are made for the effect of the associated loci: Scenario I, all effect directions are positive; Scenario II, the effect direction of all associated loci are negative in two of the five associated regions; Scenario III, the effect direction of one locus in each associated region is negative. The absolute value of effect value is determined by the following effect model

| β (t_{j}) | = | \log_{10} (2 • M A F_{t_{j}}) | / 4 \times 1.5, t_{1} \leq \dots \leq t_{j} \dots \leq t_{M} .

(40)

where

M A F_{t_{j}}

represents the minimum allele frequency of the t_j-th genotype as the associated locus in the associated gene regions. In the false-positive proportional simulation, random numbers were generated through normal distribution

N (0, 0.1)

as the phenotypic values of the simulation, since it was assumed that no associated gene loci existed in the gene region.

The multi-gene regions that are composed of 25 gene regions are analyzed for every simulation. Each gene region is simulated 100 times under different association effect hypotheses. There are 2500 (

100 \times 25

) gene regions analyzed. In each scenario, 5 gene regions are assumed to be associated regions, and 20 gene regions are assumed to be unrelated regions. That is, there are 500 associated gene regions and 2000 unrelated gene regions for every case. Under given significance level

α

based on every model, every method and every scenario, the number of significant gene regions is

n_{1}

. The number of significant gene regions, but no significant gene regions, in fact, is

n_{2}

. The power is

\frac{n_{1}}{500}

and the false positive rates (Type I error rates) are

\frac{n_{2}}{2000}

.

In order to compare the Step and Multi-SLoS method of multi-gene region analysis with the single-gene region functional method, the SLoS method and FLM method are also performed in this simulation. The single-gene region method analyzes the sub-regions of the simulated multi-gene region one by one and then summarizes the results to test the multi-gene regions. The FLM method is proposed by Svishcheva et al. [14] as a functional gene region analysis method for family and population gene data. The author provides a package, “FREGAT”, written in the R language, that contains the computer program for the method. Moreover, the method can also be used for genetic association analysis in populations with no family relationship. For SLoS method and Multi-SLoS method, the compression parameter and smoothing parameter are set as 0.1 and 0.1 in the common variants multi-gene regions simulation; as 0.01 and 0.01 in the rare variants multi-gene regions simulation; and as 0.05 and 0.05 for the hybrid variants multi-gene regions simulation and the hybrid variants multi-gene regions simulation. The fitting of SLoS and Multi-SLoS models requires the "SLoS" package of R language [28], but the statistical part needs to be supplemented by the R language code written by us. In the process of simulation, genotype data needs to be converted into functional data, all gene regions are smoothed by 25 Fourier basis functions, the number of nodes is equal to the number of variants in the gene regions, and the distance of nodes is equidistant. The number of basis functions of the effect function uses the default settings for the appropriate software. The basis functions and node settings are the same in subsequent computational simulations, except for additional instructions.

2.2.2. Simulation of Multi-Gene Region Weighted Method

We designed three different multi-gene regions: (a) rare variants multi-gene regions, i.e., all variants within the regions are rare variants (MAF < 0.01); (b) common variants multi-gene regions, i.e., all variants within the regions are common variants; (c) the hybrid variants multi-gene regions: 15 are common variants gene regions and 10 are rare variants gene regions. Here, a simulation of the multi-gene region weighted method is only aimed at hybrid variants multi-gene regions analyzed in the simulation of common multi-gene region method. Besides, there are two common variant gene regions and three rare variant gene regions for five associated regions. In addition to counting the power and false positive rates of common and rare variant gene sub-regions in multi-genic regions, the rest of the settings are similar with those simulations of the common multi-gene region method, in order to count the power and false positive rates of the common variants gene regions and the rare variants gene regions in the multi-gene regions respectively. We illustrate the rare variants gene regions as an example: the power of rare variants gene sub-regions is the total number of significant and true associated gene regions in the rare variants gene sub-regions divided by the total number of rare variants gene sub-regions in the multi-gene regions. False positive rates of rare variants gene sub-regions: the total number of significant but unrelated gene regions in rare variants gene sub-regions is divided by the total number of rare variants gene sub-regions in multi-gene regions.

Three methods are used in this simulation: Step, FLM, and Weighted SLoS (W-SLoS). During the simulation of the W-SLoS method, the smoothing parameters of common variants gene sub-regions are 0.02 and the compression parameters are 0.05. The smoothing parameter and compression parameter of rare variants gene sub-regions are 0.01 and 0.0025, respectively. All settings for the nodes and the effect of genetic variations are similar as the computer simulations for the common multi-gene region method (Section 2.2.1).

2.2.3. Simulation of Multi-Gene Region Loci Weighted Method

The multi-gene regions generated by the simulation in the multi-gene regions loci weighted analysis is composed of 10 mixed variants gene regions. The proportion of rare variants in the 10 regions of mixed variants are as follows: [0.7,0.8,0.6,0.95,0.9,0.95,0.7,0.9,0.8,0.6], respectively. Common variants and rare variants of every sub-region are random distributions for simulated multi-gene regions. The gene loci are weighted in the sub-regions. The rare variants and common variants in each gene region are generated in the same way as the common multi-gene regions simulations. This multi-gene region will be analyzed 100 times during the simulations. During the power simulation, the associated gene regions are preset as 2-th, 4-th, 5-th, 7-th, and 10-th gene regions, the proportion of rare variants in the corresponding gene regions is [0.8,0.95,0.9,0.7,0.6], and the 4-th and 5-th gene regions are adjacent. Although the associated gene regions are determined, the associated loci in each gene region are randomly extracted from loci with a minimum allele frequency of less than 0.02 within the corresponding gene regions in each simulation. In addition, the model and method used to simulate phenotypic values and the effect value scenario settings of the associated loci are the same as those in the common multi-gene regions simulations. Suppose that there are three scenarios for the effect of the associated loci: Scenario I, all effect directions of all associated loci are positive for gene loci; Scenario II, the effect directions of all associated loci are negative for the 4-th and 7-th gene region; Scenario Ⅲ, choose a locus at random form associated loci and the effect direction of that locus is negative in each associated region.

For power and false positive rates in the simulation, it is assumed that in the set of regions that are significant under a certain condition, the number of regions that are truly correlated and significant is

n_{1}

, while the number of uncorrelated regions and identified as significant is

n_{2}

, and then the total power in the simulation is:

\frac{n_{1}}{500}

, and the total false positive rates (Type I error rates) are:

\frac{n_{2}}{500}

. Suppose that the number of regions identified as significant in the 2-th, 4-th, 5-th, 7-th, 10-th gene regions is

n_{i 1}, (i = 2, 4, 5, 7, 10)

, respectively. Then, the powers of the corresponding gene sub-regions are

\frac{n_{i 1}}{100}, (i = 2, 4, 5, 7, 10)

. Suppose that the number of regions in the 1-th, 3-th, 6-th, 8-th, 9-th gene regions identified as significant is:

n_{i 2}, (i = 1, 3, 6, 8, 9)

(in fact, those gene regions are no significant), then the false positive rates of corresponding gene sub-regions are

\frac{n_{i 2}}{100}

,(

i = 1, 3, 6, 8, 9

).

Four methods are used in the simulation: Step, Loci Weighted Step (LW-Step), Multi-SLoS, and Loci Weighted SLoS (LW-SLoS). The smoothing and compression parameters of the Multi-SLoS and LW-Step methods are set to 0.001. The multi-gene region loci weighted method needs to set the weights of different loci. In this paper, it is realized by setting the weight matrix. Here, the weight matrix is set as the diagonal matrix, and the weight of the corresponding gene loci of the diagonal elements on the matrix is generated through the beta distribution:

B e t a (M A F_{i}, 1, 10) .

3. Results

3.1. Results of the Common Multi-Gene Region Method

It can be seen from Table 1 that the power of the Step method is the highest among the four methods at each significance level in the rare variants multi-gene regions, and the power of the Multi-SLoS method is not higher than that of FLM and Step, but it is also slightly higher than that of SLoS; in hybrid variants multi-gene regions Ⅰ and hybrid variants multi-gene regions Ⅱ the power of the Step method is higher than the FLM method, but the power of the Multi-SLoS method is not higher than the SLoS method. Therefore, in terms of power performance, multi-gene region analysis has a comparative advantage in rare variants multi-gene regions, and the power of the Step method is the best in four simulated gene regions. By comparing the power on different types of gene regions, the SLoS and Multi-SLoS method have higher power for multi-gene regions with common variants; the Step method makes it less powerful in multi-gene regions that contain common variants; the FLM method is more effective in multi-gene regions consisting of only rare or common variants, but the performance of hybrid variants multi-gene regions I is very strange.

Combined with the simulation results of false positive rates in Table 2, the results are further analyzed. In rare variants multi-gene regions, the Multi-SLoS method has a higher false positive rate than the SLoS method, and the Step method has a higher false positive rate than the FLM method under different effect directions and higher significance level. The false positive rates in both the Step method and Multi-SLoS method are very low in the common variants multi-gene regions. In the hybrid variants multi-gene regions II, the Step method and Multi-SLoS method compared to the FLM and SLoS methods have lower false positive rates. Therefore, the addition of common variants to the sub-regions of the multi-gene regions has the effect of reducing the false positive rates.

On the one hand, the multi-gene region Step method based on functional data analysis has the best power and performance in false positive rates. Indeed, it can better find some associated regions that cannot be found in single gene regions, especially some associated gene regions with micro effects. On the other hand, the Multi-SLoS method has no significant advantages over the SLoS method and needs further improvement and adjustment.

3.2. Results of Multi-Gene Region Weighted Method

As can be seen from the power simulation results in Table 1 and Table 3, the highest power of the Multi-SLoS method is 76.2% and of the SLoS method is 78.4%. The lowest power of W-SLoS is 84% for the hybrid variant multi-gene region I (see Table 1). Therefore, the SLoS method with weighted multi-gene regions has a significant improvement in power. The FLM method still does not perform well in this case. Combined with the previous simulation of the FLM method, it has a good performance when the gene sub-region in the multi-gene regions is of the same type, but the performance is not good when the multi-gene regions are mixed with multiple types of gene regions. It can be known that this method may be more suitable for the detection of multi-gene regions with the same type of gene sub-regions. The Step method is still the most powerful of the four methods, both in terms of overall power and in gene regions with common or rare variants. Almost all of the common variant gene regions are identified, and rare variants gene regions have lower power than that of the common variants gene regions. In general, the three methods can more easily identify rare variants gene regions, but common variants gene regions still require higher power.

Table 4 shows the simulation results of false positive rates. It is obvious that the false positive rates of FLM are larger than that of other methods. Among three methods, the false positive rates of the W-SLoS method are the lowest, followed by the Step method. In general, the performance of W-SLoS is better than that of Step on false positive rates from Table 4. It means the W-SLoS method is not as effective as the Step method for detecting gene regions, but it is more reliable for gene regions selected by the W-SLoS method.

By weighting the sub-regions, the performance of the multi-gene region SLoS method exceeds that of the single-gene region SLoS method. This indicates that weighting can further improve the power of the polygenic region analysis model.

3.3. Results of Multi-Gene Region Loci Weighted Method

Table 5 and Table 6 show the power and the false positive rates simulation results of the multi-gene region loci weighted method. In the power simulation results, there are three unweighted methods: FLM, Multi-SLoS and Step, and the highest power of these methods is 94.2%. The highest power of the two loci-weighted methods is 98.2%. This suggests that multi-gene regions loci weighted methods can indeed improve the power in this simulation. For false positive rates simulation, the false positive rates of LW-Step are higher than that of Step, but lower than that of the other three methods. The LW-SLoS method has much higher false positive rates than the LW-Step method, but the false positive rates of LW-SLoS approach that of the LW-Step method as the significance level increased gradually.

Since the multi-gene regions in each simulation are fixed, so are the associated gene sub-regions as well, and the power and false positive rates of each gene sub-region are calculated. Figure 1 and Figure 2, respectively, show the power of the 2-th, 4-th, 5-th, 7-th, and 10-th associated gene regions, and the false positive rates of the 1-th, 3-th, 6-th, 8-th, and 9-th unassociated gene regions. Figure 1 shows that the power of different gene regions is different under different effect hypotheses, the proportion of common and rare variants in a gene region affects power of the region. Figure 2 shows that in general, the LW-SLoS method has higher false positive rates in the 1-th and 3-th gene regions. The proportions of rare variants in these two gene regions are 0.7 and 0.6, which are the second and the smallest in the five gene regions.

In general, the multi-gene region loci weighted method has an advantage in the multi-gene regions, where the proportions of various variants in each gene region are different. Moreover, the simulation results of power and false positive rates show that such results are not simply lowering the threshold, but it improves the power of the analysis at reasonable false positive rates. The simulation further analyses the power and false positive rates of different gene sub-regions in the same multi-gene regions. The results show that the presence of some common variants in the sub-gene regions could improve the power of the method and reduce false positive rates.

4. Discussion

In this paper, a total of five analysis methods are proposed for the analysis of multi-gene regions, which can be divided into three categories: the common multi-gene region method, the multi-gene region weighted method, and the multi-gene region loci weighted method. the Step method merged the two ideas together with the gene information of the region and the relationship among the gene regions. The simulation results showed that the power of the Step method is higher than that of the FLM and SLoS method for a single gene region, even for that of the region-weighted SLoS method. It means the power is improved for considering the relationship among gene regions. The multi-gene region loci weighted method is the most complex but also the most effective. Its power simulation results are much higher than the unweighted single gene region analysis method and the false positive ratio is much lower than the single gene region analysis method. For the SLoS method, the simulation results of the common multi-gene region method are only slightly better than the common single-gene region analysis method, which also shows that even the simplest multi-gene region analysis method can effectively improve the power of the analysis compared with the single-gene region analysis method. In general, the simulation results of the Step method and LW-Step method are better methods for the associated analysis of multi-gene region. By modified freedom degree of test statistic

F

, the multi-SLoS, W-SLoS and LW-SLoS are feasible for the associated analysis of multi-gene region. Compared with the rare variant multi-gene region, the associated analysis result of the common variant multi-gene region is better than using the multi-gene region analysis method.

The multi-gene region loci weighted method is a further expansion and extension of the weighting idea of Belonogova et al. [32]. However, there are some differences between our work and Belonogova’s paper: firstly, the weighted idea was not applied to a multi-gene region in his paper; secondly, the coefficient of functional data is estimated by the smooth method in our paper and the least square method in Belonogova’s paper.

The Fourier basis function is selected to fit the genotype data when the functional expansion is carried out. The reason as to why we chose the Fourier basis function is that some studies have compared the Fourier basis to the spline basis, achieving similar results (the previously cited papers on functional gene-association analysis have compared it in their papers). However, some people question how genes can be represented by periodic Fourier bases. Perhaps they both extracted the same amount of information for the gene regions they wanted to summarize in their way, and after a lot of people compared the results of the two bases, there was only a very small difference. Even though the Fourier basis function might look better in some cases, it is not good enough for the authors of those papers to conclude that the Fourier basis function is a better choice. Usually, both of the basis functions are used, so one can choose one of two types of basis functions. The reason why the Fourier basis is selected in this paper is that the Fourier basis only needs to determine the number of basis functions, while the spline basis not only needs to select the number of basis functions but also the order of the basis function, so the selection of the Fourier basis function is much simpler. It must choose the Fourier basis for the Step method and the LW-Step method.

Regarding the selection of compression parameters and smoothing parameters, the paper does not necessarily choose the weights that can give the method the best performance, but it basically selects the parameters according to the most common standards and methods (such as 10-fold cross-validation, etc.). The specific parameter selection strategy can be: Firstly, determine the selection criteria of parameters, then determine the approximate range of compression parameters and smoothing parameters according to the method and the actual situation, finally, the computer program is used to screen out the optimal parameters. This process must be repeated in the processing of actual data, because the genetic region composition of the actual data is constantly changing, and so it requires that the parameters change accordingly. However, in the simulation, the same situations in the genetic regions are made up the same way. Therefore, in order to save computing resources, a suitable parameter is directly selected in this paper.

One might ask: why not just analyze it as a larger gene region? Instead, we analyze it in the way of this multi-gene region, and the functional data can do this only by increasing the number of nodes and the number of basis functions. One reason is that the large gene regions can be divided into smaller gene regions, and then the multi-gene region method can be used to get more detailed test results. Another reason is that, as mentioned in the previous article, due to the consideration of the interrelationship between different gene regions, the multi-gene region method can better identify some regions with relatively weak effects and have higher power. So, compared to single-gene region testing, multi-gene region testing can detect gene regions where the effect is smaller or where the effect overlaps with that of other regions.

We focused on independent SNPs for common variants. Rare variants come from the rare haploid dataset (the R language SKAT package [33] contains a set of such data produced by simulating allele frequency and linkage imbalance information in European populations) with a length of 200 kb. However, we also simulate linkage disequilibrium in common variant multi-gene region of Scenario I based on the common multi-gene region method and multi-gene region weighted method. When the r² measure of linkage disequilibrium is between 0.25 and 0.64, the power of the linkage disequilibrium based on the SLoS and Multi-SloS methods is higher than that of linkage equilibrium, but the false positive rates also increase significantly. The power of the linkage disequilibrium based on the Step method decreases slightly, and the false positive rates remain largely unchanged. The power and false positive rates of the linkage disequilibrium based on the FLM method does not change significantly. This indicates that the linkage disequilibrium among the gene loci causes SLoS and Multi-SLoS methods to more easily misidentify non-associated gene regions. The association analysis between gene region and quantitative trait is susceptible to linkage disequilibrium of gene loci for SLoS and Multi-SLoS methods. Then, the association analysis of gene regions for quantitative traits was unstable for SLoS and Multi-SLoS methods, while Step and FLM methods are more stable. The reason may be the parameter estimation of SLoS and Multi-SLoS method. Furthermore, we compare the simulation results of forward selection and backward selection and find that the power of the forward selection method is slightly higher than that of the back selection method, but the false positive rates of the forward selection method is far greater than that of the back selection method. Finally, we study the distribution of allele frequencies of gene loci. The results show that the power of following a normal distribution is higher than that of following a uniform distribution, which may be due to the fact that the MAF of the normal distribution is mostly smaller than that of the uniform distribution. According to the effect function expression, the effect size of the normal distribution is larger at this time and can make the locus easier to detect.

The model we considered is an ideal state, as it is only a study of basic assumptions, without considering group structure and population with relatives, and there is no missing genotype. In practice, it is inevitable that there will be missing genotypes. At this time, we can use statistical methods to fill in the missing data, and then convert the discrete genetic data into continuous functions. In future research, we will consider incorporating models of group structure and population with relatives.

Author Contributions

Conceptualization, F.Z.; data curation, Y.W.; formal analysis, S.L. and F.Z.; software, S.L. and F.Z.; supervision, Y.W.; validation, S.L.; writing —original draft, S.L., F.Z. and Y.W.; writing—review and editing, J.S., H.Z. and Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China grant number [32071892], the Natural Science Foundation of Fujian of China grant number [2021J01126], and Science and Technology Innovation Special Foundation of Fujian Agriculture and Forestry University of China grant number [CXZX2020109A].

Acknowledgments

We thank the reviewers for their helpful comments.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bansal, V.; Libiger, O.; Torkamani, A.; Schork, N.J. Statistical analysis strategies for association studies involving rare variants. Nat. Rev. Genet. 2010, 11, 773–785. [Google Scholar] [CrossRef] [Green Version]
Asimit, J.; Zeggini, E. Rare variant association analysis methods for complex traits. Annu. Rev. Genet. 2010, 44, 293–308. [Google Scholar] [CrossRef] [Green Version]
Gibson, G. Rare and common variants: Twenty arguments. Nat. Rev. Genet. 2012, 13, 135–145. [Google Scholar] [CrossRef] [Green Version]
Buermans, H.P.J.; Dunnen, J.T.D. Next generation sequencing technology: Advances and applications. Biochim. Biophys. Acta-Mol. Basis Dis. 2014, 1842, 1932–1941. [Google Scholar] [CrossRef] [Green Version]
The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1092 human genomes. Nature 2012, 491, 56–65. [Google Scholar] [CrossRef] [Green Version]
Nelson, M.R.; Wegmann, D.; Ehm, M.G.; Kessner, D.; Jean, P.S.; Verzilli, C.; Shen, J.; Tang, Z.; Bacanu, S.-A.; Fraser, D.; et al. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science 2012, 337, 100–104. [Google Scholar] [CrossRef] [Green Version]
Madsen, B.E.; Browning, S.R. A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic. PLoS Genet. 2009, 5, e1000384. [Google Scholar] [CrossRef] [Green Version]
Han, F.; Pan, W. A Data-Adaptive Sum Test for Disease Association with Multiple Common or Rare Variants. Hum. Hered. 2010, 70, 42–54. [Google Scholar] [CrossRef] [Green Version]
Price, A.L.; Kryukov, G.V.; de Bakker, P.I.W.; Purcell, S.M.; Staples, J.; Wei, L.-J.; Sunyaev, S.R. Pooled Association Tests for Rare Variants in Exon-Resequencing Studies. Am. J. Hum. Genet. 2010, 86, 832–838. [Google Scholar] [CrossRef] [Green Version]
Liu, D.; Ghosh, D.; Lin, X. Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC Bioinform. 2008, 9, 292. [Google Scholar] [CrossRef] [Green Version]
Wu, M.C.; Kraft, P.; Epstein, M.P.; Taylor, D.; Chanock, S.J.; Hunter, D.J.; Lin, X. Powerful SNP-Set Analysis for Case-Control Genome-wide Association Studies. Am. J. Hum. Genet. 2010, 86, 929–942. [Google Scholar] [CrossRef] [Green Version]
Lee, S.; Abecasis, G.R.; Boehnke, M.; Lin, X. Rare-Variant Association Analysis: Study Designs and Statistical Tests. Am. J. Hum. Genet. 2014, 95, 5–23. [Google Scholar] [CrossRef] [Green Version]
Luo, L.; Zhu, Y.; Xiong, M. Quantitative trait locus analysis for next-generation sequencing with the functional linear models. J. Med. Genet. 2012, 49, 513–524. [Google Scholar] [CrossRef] [Green Version]
Svishcheva, G.R.; Belonogova, N.M.; Axenovich, T.I. Region-Based Association Test for Familial Data under Functional Linear Models. PLoS ONE 2015, 10, e0128999. [Google Scholar] [CrossRef]
Zhang, F.; Xie, D.; Liang, M.; Xiong, M. Functional Regression Models for Epistasis Analysis of Multiple Quantitative Traits. PLoS Genet. 2016, 12, e1005965. [Google Scholar] [CrossRef] [Green Version]
Li, Y.; Wang, F.; Wu, M.; Ma, S. Integrative functional linear model for genome-wide association studies with multiple traits. Biostatistics 2020, 1–17. [Google Scholar] [CrossRef]
Wessel, J.; Schork, J.N. Generalized Genomic Distance–Based Regression Methodology for Multilocus Association Analysis. Am. J. Hum. Genet. 2006, 79, 792–806. [Google Scholar] [CrossRef] [Green Version]
Mukhopadhyay, I.; Feingold, E.; Weeks, D.E.; Thalamuthu, A. Association tests using kernel-based measures of multi-locus genotype similarity between individuals. Genet. Epidemiol. 2010, 34, 213–221. [Google Scholar] [CrossRef] [Green Version]
Turkmen, A.S.; Lin, S. Blocking Approach for Identification of Rare Variants in Family-Based Association Studies. PLoS ONE 2014, 9, e86126. [Google Scholar] [CrossRef]
Martin, E.R.; Monks, S.A.; Warren, L.L.; Kaplan, N.L. A Test for Linkage and Association in General Pedigrees: The Pedigree Disequilibrium Test. Am. J. Hum. Genet. 2000, 67, 146–154. [Google Scholar] [CrossRef] [Green Version]
Laird, N.; Lange, C. Family-based designs in the age of large-scale gene-association studies. Nat. Rev. Genet. 2006, 7, 385–394. [Google Scholar] [CrossRef] [PubMed]
Ayers, K.L.; Cordell, H.J. Identification of Grouped Rare and Common Variants via Penalized Logistic Regression. Genet. Epidemiol. 2013, 37, 592–602. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Friedman, J.; Hastie, T.; Tibshirani, R. A Note on the Group Lasso and a Sparse Group Lasso; Technical Report; Stanford University: Stanford, CA, USA, 2010. [Google Scholar]
Fan, R.; Wang, Y.; Mills, J.L.; Wilson, A.F.; Bailey-Wilson, J.E.; Xiong, M. Functional Linear Models for Association Analysis of Quantitative Traits. Genet. Epidemiol. 2013, 37, 726–742. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Hastie, T.J. Generalized additive models. In Statistical Models in S; Chambers, J.M., Hastie, T.J., Eds.; T Bell Laboratories: Yonkers, NY, USA, 1992. [Google Scholar]
Vernables, W.N.; Ripley, B.D. Modern Applied Statistics with S, 4th ed.; Springer: New York, NY, USA, 2002. [Google Scholar]
Izenman, A.J. Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning; Springer: New York, NY, USA, 2008. [Google Scholar]
Lin, Z.H.; Cao, J.G.; Wang, L.L.; Wang, H. Locally Sparse Estimator for Functional Linear Regression Models. J. Comput. Graph. Stat. 2017, 26, 306–318. [Google Scholar] [CrossRef]
Weisberg, S. Applied Linear Regression, 4th ed.; Wiley Series in Probability and Statistics; Wiley: Hoboken, NJ, USA, 2013; ISBN 9781118594858. [Google Scholar]
Su, Y.R.; Di, C.Z.; Hsu, L. Hypothesis testing in functional linear models. Biometrics 2017, 73, 551–561. [Google Scholar] [CrossRef]
Ramsay, J.O.; Silverman, B.W. Functional Data Analysis, 2nd ed.; Springer: New York, NY, USA, 2005. [Google Scholar]
Belonogova, N.M.; Svishcheva, G.R.; Wilson, J.F.; Campbell, H.; Axenovich, T.I. Weighted functional linear regression models for gene-based association analysis. PLoS ONE 2018, 13, e0190486. [Google Scholar] [CrossRef] [Green Version]
Schaffner, S.F.; Foo, C.; Gabriel, S.; Reich, D.; Daly, M.J.; Altshuler, D. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 2005, 15, 1576–1583. [Google Scholar] [CrossRef] [Green Version]

Figure 1. The simulation power of gene regions 2, 4, 5, 7, and 10 using the multi-gene region loci weighted method. Scenario I—all effect directions of all associated loci are positive for gene loci; Scenario II—the effect directions of all associated loci are negative for the 4-th and 7-th gene region; Scenario III—choose a gene locus at random and the effect direction of the gene loci is negative for every associated region.

Figure 2. The simulation false positive rates of gene regions 1, 3, 6, 8, and 9 using the multi-gene region loci weighted method. Scenario I—all effect directions of all associated loci are positive for gene loci; Scenario II—the effect directions of all associated loci are negative for the 4-th and 8-th gene region; Scenario III—choose a gene locus at random and the effect direction of the gene loci is negative for every associated region.

Table 1. Simulation results of the power for four types of multi-gene regions regarding the common multi-gene region method.

Gene Effect	α	Common Variant				Rare Variant				Hybrid Variant				Hybrid Variant
		Multi-Gene Region				Multi-Gene Region				Multi-Gene Region I				Multi-Gene Region II
		Step	Multi-SLoS	FLM	SLoS	Step	Multi-SLoS	FLM	SLoS	Step	Multi-SLoS	FLM	SLoS	Step	Multi-SLoS	FLM	SLoS
Scenario I	0.05	0.9560	0.8360	0.9680	0.8540	0.9880	0.7120	0.9820	0.7140	0.9720	0.7480	0.3120	0.7720	0.9000	0.8460	0.8980	0.8600
	0.01	0.9560	0.8300	0.9540	0.8500	0.9880	0.7020	0.9600	0.6880	0.9700	0.7440	0.2640	0.7700	0.8980	0.8460	0.8720	0.8520
	0.001	0.9480	0.8260	0.9240	0.8480	0.9880	0.6860	0.9200	0.6660	0.9640	0.7400	0.2420	0.7620	0.8900	0.8440	0.8620	0.8480
	1 × 10⁻⁴	0.9220	0.8120	0.8780	0.8360	0.9860	0.6820	0.8820	0.6580	0.9600	0.7320	0.2220	0.7540	0.8780	0.8240	0.8460	0.8260
	1 × 10⁻⁶	0.8600	0.7540	0.7920	0.7600	0.9780	0.6720	0.8380	0.6420	0.9200	0.7120	0.2000	0.7240	0.8660	0.8120	0.8020	0.8000
	1 × 10⁻⁸	0.7840	0.6600	0.6680	0.6500	0.9680	0.6660	0.7880	0.6300	0.8780	0.6900	0.1780	0.6800	0.8520	0.7920	0.7420	0.7560
Scenario II	0.05	0.9680	0.8560	0.9800	0.8860	0.9860	0.6900	0.9520	0.6900	0.9600	0.7340	0.3040	0.7520	0.8800	0.8240	0.8900	0.8540
	0.01	0.9660	0.8460	0.9640	0.8840	0.9860	0.6740	0.9400	0.6580	0.9600	0.7340	0.2640	0.7480	0.8800	0.8200	0.8560	0.8540
	0.001	0.9540	0.8360	0.9440	0.8840	0.9860	0.6600	0.9240	0.6480	0.9560	0.7320	0.2360	0.7400	0.8760	0.8160	0.8320	0.8320
	1 × 10⁻⁴	0.9360	0.8200	0.9080	0.8720	0.9840	0.6580	0.8980	0.6400	0.9480	0.7280	0.2280	0.7240	0.8720	0.8080	0.8100	0.8060
	1 × 10⁻⁶	0.8940	0.7440	0.8040	0.7920	0.9840	0.6460	0.8420	0.6280	0.9080	0.7040	0.2040	0.6880	0.8400	0.7820	0.7620	0.7620
	1 × 10⁻⁸	0.8080	0.6640	0.6900	0.6740	0.9680	0.6400	0.7980	0.6140	0.8560	0.6820	0.1860	0.6400	0.8120	0.7580	0.7180	0.7200
Scenario III	0.05	0.9560	0.8240	0.9740	0.8420	0.9940	0.6980	0.9660	0.6680	0.9480	0.7620	0.2880	0.7840	0.8900	0.8360	0.8900	0.8540
	0.01	0.9540	0.8200	0.9460	0.8420	0.9940	0.6660	0.9480	0.6300	0.9480	0.7580	0.2380	0.7760	0.8900	0.8340	0.8700	0.8500
	0.001	0.9380	0.8100	0.9120	0.8400	0.9940	0.6440	0.9200	0.6080	0.9380	0.7480	0.2180	0.7600	0.8880	0.8300	0.8440	0.8340
	1 × 10⁻⁴	0.9120	0.7940	0.8680	0.8280	0.9900	0.6280	0.8900	0.5940	0.9240	0.7320	0.2020	0.7380	0.8760	0.8160	0.8220	0.8140
	1 × 10⁻⁶	0.8420	0.6940	0.7680	0.7200	0.9860	0.6160	0.8340	0.5740	0.8860	0.7000	0.1840	0.6900	0.8540	0.7860	0.7780	0.7820
	1 × 10⁻⁸	0.7620	0.6000	0.6720	0.6340	0.9700	0.6000	0.7880	0.5540	0.8360	0.6500	0.1640	0.6180	0.8400	0.7580	0.7300	0.7340