*2.6. Regularization Method*

In human genome-wide association studies, regularization methods based on penalized likelihood are popular regarding their application to identify disease-related genes or genetic regions as they are computationally efficient when used in analysis of highdimensional genomic data [62–68]. The penalized likelihood function using an elastic-net penalty is defined as

$$\mathbf{Q}(\beta) = -\mathbf{l}(\beta) + \lambda \boldsymbol{\alpha} \sum\_{j=1}^{p} |\beta\_j| + \lambda (\mathbf{1} - \boldsymbol{\alpha}) \sum\_{j=1}^{p} \beta\_j^2 \tag{1}$$

where l(*β*) is a log-likelihood function, *β* is the *p*-dimensional coefficient vector, *λ* ≥ 0 is a tuning parameter for sparsity, and α ∈ [0,1] is a tuning parameter for smoothness. When α = 1, the coefficient vector *β* becomes the solution of the least absolute and shrinkage selection operator (LASSO) [69]. The estimated coefficient β consists mostly of zero values and only a few nonzero values. Based on 100 bootstrap samples, the selection probability of individual SNPs was computed where only SNPs with nonzero coefficients were selected for each bootstrap sample. Finally, we were able to identify the top ranked SNPs by their selection probability.

In order to select significant SNPs, we used two types of threshold of selection probability which can control the number of falsely selected SNPs. The first one is the theoretical threshold proposed by [70]. The second one is the empirical threshold [71] which basically computes the quantile value of an empirical distribution of selection probability based on permutation. In their extensive simulation studies, it was demonstrated that the number of falsely selected SNPs can be controlled when the empirical threshold is applied to highdimensional genomic data. The theoretical threshold (<sup>π</sup>θ) and the empirical threshold (π<sup>∗</sup> θ) can be written as:

$$
\pi\_{\theta} = \frac{q\_{\Lambda}^{2}}{2\theta p} + \frac{1}{2} \text{ and } \pi\_{\theta}^{\*} = \frac{1}{\mathcal{B}} \sum\_{b=1}^{\mathcal{B}} \text{SP}\_{(b)}^{[\theta]}(I\_{b}) \text{ .} \tag{2}
$$

.

where *θ* is the upper bound of the expected number of false discoveries, *q*Λ is the average number of selected SNPs, B is the number of permutations and *Ib* is the b-th random permuted sample. We denote *SP*[*θ*] (*b*)by the top *θ*-th ranked selection probability when they

were sorted in descending order for the b-th permuted sample such as *SP*[1] (*b*) > ··· > *SP*[*p*] (*b*) We chose the expected number of false discoveries *θ* = 1, and thereby the number of falsely selected SNPs by each threshold can be guaranteed to be less than *θ* = 1.
