Integrating External Controls by Regression Calibration for Genome-Wide Association Study

Zhu, Lirong; Yan, Shijia; Cao, Xuewei; Zhang, Shuanglin; Sha, Qiuying

doi:10.3390/genes15010067

Open AccessArticle

Integrating External Controls by Regression Calibration for Genome-Wide Association Study

Department of Mathematical Sciences, Michigan Technological University, Houghton, MI 49931, USA

^*

Author to whom correspondence should be addressed.

Genes 2024, 15(1), 67; https://doi.org/10.3390/genes15010067

Submission received: 13 December 2023 / Revised: 30 December 2023 / Accepted: 1 January 2024 / Published: 3 January 2024

(This article belongs to the Special Issue Genome-Wide Identification: Recent Trends in Genomic Studies, Volume II)

Download

Browse Figures

Versions Notes

Abstract

:

Genome-wide association studies (GWAS) have successfully revealed many disease-associated genetic variants. For a case-control study, the adequate power of an association test can be achieved with a large sample size, although genotyping large samples is expensive. A cost-effective strategy to boost power is to integrate external control samples with publicly available genotyped data. However, the naive integration of external controls may inflate the type I error rates if ignoring the systematic differences (batch effect) between studies, such as the differences in sequencing platforms, genotype-calling procedures, population stratification, and so forth. To account for the batch effect, we propose an approach by integrating External Controls into the Association Test by Regression Calibration (iECAT-RC) in case-control association studies. Extensive simulation studies show that iECAT-RC not only can control type I error rates but also can boost statistical power in all models. We also apply iECAT-RC to the UK Biobank data for M72 Fibroblastic disorders by considering genotype calling as the batch effect. Four SNPs associated with fibroblastic disorders have been detected by iECAT-RC and the other two comparison methods, iECAT-Score and Internal. However, our method has a higher probability of identifying these significant SNPs in the scenario of an unbalanced case-control association study.

Keywords:

genome-wide association test; case-control study; batch effect; data integration

1. Introduction

Genome-wide association studies (GWASs) play a major role in associating specific genetic variants with continuous or dichotomous phenotypes [1,2,3]. Sometimes, researchers may have limited access to individuals’ genetic information regarding specific phenotypes, and large-scale genetic studies can be expensive and resource-intensive [4]. Thus, with a small sample size in a GWAS, an association test could have low power and may also increase the possibility of false-positive findings, especially for infrequent variants (i.e., minor allele frequency (MAF) < 5%), where MAF refers to the frequency at which the less common allele occurs in a given population [5,6].

The rapid development of sequencing technologies has promoted substantial advancement in GWASs, particularly in obtaining comprehensive genetic information from limited samples [7,8]. This advancement provides an opportunity to enhance the power of single-variant association tests in case-control studies, with several approaches having been proposed. Firstly, the utilization of time-to-event data in case-control studies provides valuable insights into timing and dynamics of events. However, this approach may lead to a loss of information compared to cohort studies due to potential censoring, where some individuals do not experience the event of interest by the end of the study or analysis. Secondly, the integration of sequenced samples from internal and external sources provides a great opportunity for identifying novel genetic associations and increasing the statistical power of single-variant association tests [9]. Specifically, internal sources encompass data generated or collected within the study, which typically include genotype data from genotyping arrays or sequencing platforms, and external sources refer to data obtained from outside the immediate study, such as the utilization of diverse sequencing platforms, variations in genotype-calling procedures, the presence of population stratification, and so forth. Nevertheless, the integration of sequenced samples from internal and external studies is challenging [10]. In a single study, by incorporating sequenced samples from other studies as an external control sample, the power of single-variant tests can be significantly increased without incurring additional sequencing costs. However, the systematic differences (batch effect) arise from various sources, such as different genotyping arrays or sequencing platforms. Integrating sequenced samples from internal and external studies without accounting for these batch effects could inflate type I error rates and increase the possibility of false-positive findings in association studies [11].

Recently, several likelihood-based methods have been proposed to tackle the systematic differences between internal genotyped data and external genotyped data [12]. Liu and Leal proposed the SEQCHIP method to correct bias when integrating genotype data in rare-variant association studies [13]. Derkach et al. proposed another method that substitutes the genotype calls with the expected values given by observed sequence data to account for differential read depths between studies [14]. Chen and Lin proposed regression calibration (RC) methods aimed at addressing the differential sequencing errors between cases and controls [15]. Despite these powerful methods, the calculation of genotype probabilities and the management of sequence read data are challenging in terms of both complexity and cost, particularly in large-scale genetic studies. Therefore, the Proxy External Controls Association Test (ProxECAT) only utilizes allele frequencies of internal cases and external controls to estimate the enrichment of rare variants within a gene [16]. However, the absence of internal controls potentially limits the power of the association test. In contrast, the Integrating External Controls into Association Test (iECAT) uses allele counts from internal cases, internal controls, and external controls to conduct the rare-variant association test [11]. Subsequently, a Bayesian approach is employed to assess the presence of batch effects by comparing the odds ratio estimates between internal controls and combined controls of internal and external studies. External controls that are not subject to batch effects are then integrated with internal samples to increase the sample size. It has been demonstrated that this method can control type I error rates, as well as improve the power of the association test. However, this method cannot adjust for covariates such as age, gender, and so on [11]. Based on the aforementioned method, Li and Lee proposed a novel score-based test that constructs a shrinkage score statistic using internal samples and external control samples, allowing for covariate adjustment for region-based tests [17]. However, the power increase of this method in association testing by integrating external controls is limited for extremely unbalanced case-control studies.

In this study, we present a novel approach that integrates External Controls into Association Tests by Regression Calibration (iECAT-RC) to incorporate external control samples in case-control studies. The objective of this research is to boost the statistical power of the single-variant association test by integrating external controls with the adjustment of batch effects. Our approach adjusts the genotypes of an external control sample to approximate the same distribution as that of the genotypes in the internal control sample through regression calibration. Furthermore, we apply the saddlepoint approximation [18] and efficient resampling [19] methods to control type I error rates with imbalanced case-control and low minor allele count (MAC) scenarios, respectively.

2. Materials and Methods

A dichotomous phenotype with case and control states was considered. A case is represented by an individual exhibiting a specific characteristic, which was coded as 1, whereas a control is an individual who does not exhibit this characteristic, which was coded as 0. It was assumed that the internal study had the sample size

n^{I}

with

n_{0}^{I}

controls and

n_{1}^{I}

cases and

n_{0}^{I} + n_{1}^{I} = n^{I}

; the external study had

n_{0}^{E}

controls. For the

i^{t h}

subject, let

y_{i} = 0 ∕ 1

be the dichotomous phenotype.

G_{1}, G_{2}, \dots, G_{n_{0}^{I}}, G_{n_{0}^{I} + 1}, G_{n_{0}^{I} + 2}, \dots, G_{n^{I}}

, and

g_{1}, g_{2}, \dots, g_{n_{0}^{E}}

are denoted as the genotypes of the internal control sample, the internal case sample, and the external control sample at a genetic variant, respectively, indicating the number of copies of the minor allele carried by the subject at that genetic variant, which can take values of 0, 1, or 2.

X_{i}^{I}

is the first

p

principal components of the internal genotypes, and

X_{i}^{E}

is the first

p

principal component of the external genotypes for the

i^{t h}

subject.

p = 10

was used in our simulation studies and real data analysis [20].

Motivated by the novel iECAT-Score method [21], we propose a new method by integrating external controls into association tests to boost the statistical power. Our proposed method involves three steps: (1) adjusting the genotypes of external controls using regression calibration, (2) Conducting a single-variant association test, and (3) calibrating the single-variant test using the saddlepoint approximation (SPA) [18] and efficient resampling (ER) methods [19]—in particular, addressing scenarios of imbalanced case-control and low MAC, respectively. By following these three steps, the iECAT-RC method effectively minimizes the impact of batch effects and improves the power of the single-variant association test.

Step 1. Adjusting the Genotypes of External Controls by Regression Calibration

To adjust the genotype of external control samples for the batch effect, we propose using the following procedure:

(1): Without loss of generality, $n_{0}^{E} \geq n_{0}^{I}$ is assumed. A total of $n_{0}^{I}$ individuals with genotypes $g_{k 1}, \dots, g_{k n_{0}^{I}}$ is chosen from external control samples.
(2): A linear regression model $G_{i} = β_{0}^{(k)} + β_{1}^{(k)} g_{k i} + α_{I}^{(k)} X_{i}^{I} + α_{E}^{(k)} X_{k i}^{E}$ is assumed for $i = 1, \dots, n_{0}^{I}$ , where ${\hat{β}}^{(k)} = {({\hat{β}}_{0}^{(k)}, {\hat{β}}_{1}^{(k)}, {\hat{α}}_{I}^{(k)}, {\hat{α}}_{E}^{(k)})}^{T}$ is the least-square estimate of $β^{(k)} = {(β_{0}^{(k)}, β_{1}^{(k)}, α_{I}^{(k)}, α_{E}^{(k)})}^{T}$ .
(3): (1) and (2) are repeated $K$ times. ${\hat{β}}^{(1)}, \dots, {\hat{β}}^{(K)}$ are obtained and the average value $\hat{β} = {({\hat{β}}_{0}, {\hat{β}}_{1}, {\hat{α}}_{I}, {\hat{α}}_{E})}^{T} = \sum_{k = 1}^{k} {\hat{β}}^{(k)} ∕ K$ is calculated. Let $G_{n^{I} + i} = {\hat{β}}_{0} + {\hat{β}}_{1} g_{i} + {\hat{α}}_{I} X_{i}^{I} + {\hat{α}}_{E} X_{i}^{E}$ for $i = 1, \dots, n_{0}^{I}$ . When $G_{n^{I} + i} < a_{0}$ , let $G_{n^{I} + i}$ be 0, where $a_{0}$ is determined such that the frequency of 0 in the internal control genotypes is equal to the frequency of 0 in $G_{n^{I} + i}$ for $i = 1, \dots, n_{0}^{I}$ . When $a_{0} \leq G_{n^{I} + i} < a_{1}$ , let $G_{n^{I} + i}$ be 1, where $a_{1}$ is determined such that the frequency of 1 in the internal control genotypes is equal to the frequency of 1 in $G_{n^{I} + i}$ for $i = 1, \dots, n_{0}^{I}$ . When $G_{n^{I} + i} > a_{1}$ , let $G_{n^{I} + i}$ be 2.

The above procedure is repeated till

G_{n^{I} + i}

is obtained for

i = 1, \dots, n_{0}^{E}

. Then, the association test is performed based on the internal case-control data and external control data with genotypes

G_{1}, G_{2}, \dots, G_{n_{0}^{I}}, G_{n_{0}^{I} + 1}, G_{n_{0}^{I} + 2}, G_{n^{I}}, G_{n^{I} + 1}, \dots, G_{n^{I} + n_{0}^{E}}

.

Step 2. Single-Variant Association Test

The adjusted genotypes of the internal and external studies are integrated. Let

G = {(G_{1}, G_{2}, \dots, G_{n})}^{T}

be the genotype vector at an interested variant for

n

subjects, where

n = n^{I} + n^{E}

. It is assumed that there is a total of

q

covariates; then, the phenotype

Y_{i}

is linked to the covariate

Z_{i}

and genotype

G_{i}

using the logistic regression model

logit [P (Y_{i} = 1 | Z_{i}, G_{i})] = Z_{i}^{T} α + G_{i} β

, where the phenotype

Y_{i}

follows a Bernoulli distribution. Let

α

be a

q \times 1

coefficient vector for

q

covariates and include the intercept. Let

β

be the genotype effect at the variant. Then, the association between the phenotype and the genotype at a variant is evaluated, equivalent to testing

H_{0} : β = 0

.

Let

μ = {μ_{i}} = {P (Y_{i} = 1 | Z_{i})}

and

{\hat{μ}}_{i}

be the maximum-likelihood estimate of

μ_{i}

under

H_{0}

. In the score test, the score is given by

S = {\tilde{G}}^{T} (Y - \hat{μ})

, Where

Y = {(Y_{1}, \dots, Y_{n})}^{T}

,

\tilde{G} = {{\tilde{G}}_{i}} = G - Z {(Z^{T} V Z)}^{- 1} Z^{T} V G

, and

V = d i a g {{\hat{μ}}_{i} (1 - {\hat{μ}}_{i})}

[2]. Assuming there is no genetic effect under the null hypothesis,

E (S) = 0

and

V a r (S) = \sum_{i = 1}^{n} {\tilde{G}}_{i}^{2} {\hat{μ}}_{i} (1 - {\hat{μ}}_{i})

. Then, the score test statistic

T_{S c o r e} = S^{2} ∕ V a r (S)

asymptotically follows the chi-square distribution with 1 degree of freedom, and the p-value can be obtained as

p = P (χ_{1}^{2} > S^{2} ∕ V a r (S))

.

Step 3. Calibrating Single-Variant Test Using the SPA and ER Methods

The single-variant score test statistic approximately follows the normal distribution under the null hypothesis. For balanced case-control studies with common variants, variance estimates derived from this asymptotic test behave well. However, when the case-control ratio is not balanced or the MAC is low, leading to extremely low allele frequencies, the underlying distribution of the test statistic may be highly skewed. Thus, the conventional asymptotic score test underperforms in such scenarios and may produce conservative or anticonservative results [22,23].

To account for the scenarios of unbalanced case-control ratio, the SPA method is applied to obtain the p-value [18]. When the MAC is low (

M A C < 10

), the ER method is used to obtain the p-values [19].

(1). SPA Method

SPA is an improvement over normal approximation, which only uses the mean and variance to approximate the underlying distribution. SPA uses the entire cumulant-generating function (CGF). Given the score test statistic

S = \sum_{i = 1}^{n} {\hat{G}}_{i} (Y_{i} - {\hat{μ}}_{i})

, the estimation of the CGF of

S

is

K (t) = \log (E_{H_{0}} (e^{t s})) = \sum_{i = 1}^{n} \log (1 - {\hat{μ}}_{i} + {\hat{μ}}_{i} e^{{\hat{G}}_{i} t}) - t \sum_{i = 1}^{n} {\hat{G}}_{i} {\hat{μ}}_{i}

. According to the SPA method, the distribution of

S

can be estimated by

P r (S < s) \approx \tilde{F} (s) = Φ {ω + \frac{1}{ω} \log (\frac{ν}{ω})},

where

ω = s g n (\hat{t}) \sqrt{2 (\hat{t} s - K (\hat{t}))}

,

ν = \hat{t} \sqrt{K^{″} (\hat{t})}

,

K^{'} (t)

, and

K^{″} (t)

are the estimations of the first- and second-order derivatives of

K; \hat{t}

is the solution to the equation

K^{'} (\hat{t}) = s

; and

Φ

is the distribution of a standard normal distribution [18]. The p-value can be obtained using the R package SPA test.

(2). ER Method

The ER method is used for rare-variant association tests with binary traits. Given phenotype

Y

, genotype

G

, and covariate

Z

, the p-value of the ER method is defined as

\Pr (Q \geq \hat{Q} | Y, G, Z) = \sum_{d = 0}^{m} \Pr (Q \geq \hat{Q} | D = d, Y, G, Z) \Pr (D = d | Y, G, Z)

where

\hat{Q}

is the test score statistic from the original phenotype,

m

is the number of individuals with minor alleles, and

D

is the number of cases among

m

individuals carrying a minor allele [19]. The p-value can be obtained using the R package SKAT.

3. Simulations

In order to evaluate the performance of the proposed iECAT-RC method in terms of the type I error rates and power, we carried out simulation studies under a series of scenarios. We generated the binary phenotypes with cases and controls from a logistic regression model

logit [P (Y = 1 | Z, G)] = α_{0} + 0.5 Z_{1} + 0.5 Z_{2} + β G + ε

, where

Z_{1}

is a continuous covariate generated from the standard normal distribution,

Z_{2}

is a binary covariate taking values of

0

and

1

with a probability of

0.5

,

α_{0}

is chosen such that the disease prevalence is

0.05

,

G

is the genotype at a variant generated from a binomial distribution

B I N (2, M A F)

,

β

is the effect size of the variant, and

ε

follows a standard normal distribution.

M A F

was sampled from the empirical Mini-Exome genotype data provided by GAW17, which includes

24, 487

variants in

3205

genes, as introduced in Sha et al. [2].

To simulate the batch effect between internal and external control studies, we first defined the differential variant size (DVS) as the proportion of variants with different MAFs between the internal and external control samples. For these variants, we randomly generated the MAFs of the external controls based on two scenarios to mimic the degree of the batch effect: (1)

U n i f o r m (0.1 q, 4 q)

and (2)

2 q

, where

q

is the MAF of the corresponding variants in the internal sample. Subsequently, we considered different numbers of cases and controls in the internal sample and the number of controls in the external controls. We set the following three ratios between the internal cases, internal controls, and external controls

(n_{1}^{I} : n_{0}^{I} : n_{0}^{E})

: (1)

5000 : 5000 : 10, 000

, (2)

6667 : 3333 : 10, 000

, and (3)

500 : 5000 : 10, 000

, respectively. Thus, we considered a total of six models. Model 1: the ratio

(n_{1}^{I} : n_{0}^{I} : n_{0}^{E})

is

5000 : 5000 : 10, 000