A Versatile and Efficient Novel Approach for Mendelian Randomization Analysis with Application to Assess the Causal Effect of Fetal Hemoglobin on Anemia in Sickle Cell Anemia

Liyanage, Janaka S. S.; Estepp, Jeremie H.; Srivastava, Kumar; Rashkin, Sara R.; Sheehan, Vivien A.; Hankins, Jane S.; Takemoto, Clifford M.; Li, Yun; Cui, Yuehua; Mori, Motomi; Burgess, Stephen; DeBaun, Michael R.; Kang, Guolian

doi:10.3390/math10203743

Open AccessArticle

A Versatile and Efficient Novel Approach for Mendelian Randomization Analysis with Application to Assess the Causal Effect of Fetal Hemoglobin on Anemia in Sickle Cell Anemia

by

Janaka S. S. Liyanage

^1,†

,

Jeremie H. Estepp

^2,†,

Kumar Srivastava

¹,

Sara R. Rashkin

³

,

Vivien A. Sheehan

⁴,

Jane S. Hankins

³

,

Clifford M. Takemoto

³,

Yun Li

^5,6,7

,

Yuehua Cui

⁸

,

Motomi Mori

¹,

Stephen Burgess

^9,10,

Michael R. DeBaun

¹¹

and

Guolian Kang

^1,*

¹

Department of Biostatistics, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA

²

Departments of Global Pediatric Medicine and Hematology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA

³

Department of Hematology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA

⁴

Department of Pediatrics, Emory University School of Medicine, Children’s Healthcare of Atlanta, Atlanta, GA 30322, USA

⁵

Department of Biostatistics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA

⁶

Department of Genetics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA

⁷

Department of Computer Science, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA

⁸

Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA

⁹

MRC Integrative Epidemiology Unit, University of Bristol, Bristol BS8 2BN, UK

¹⁰

Department of Public Health and Primary Care, University of Cambridge, Cambridge CB2 0SR, UK

¹¹

Department of Pediatrics, Vanderbilt-Meharry Sickle Cell Disease Center of Excellence, Vanderbilt University Medical Center, Nashville, TN 37232, USA

Show full affiliation list

Hide full affiliation list

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2022, 10(20), 3743; https://doi.org/10.3390/math10203743

Submission received: 18 August 2022 / Revised: 3 October 2022 / Accepted: 9 October 2022 / Published: 12 October 2022

(This article belongs to the Special Issue Mathematical and Statistical Methods in Detecting and Modeling Abnormalities, Extremes, and Rare Events)

Download

Browse Figures

Versions Notes

Abstract

Mendelian randomization (MR) is increasingly employed as a technique to assess the causation of a risk factor on an outcome using observational data. The two-stage least-squares (2SLS) procedure is commonly used to examine the causation using genetic variants as the instrument variables. The validity of 2SLS relies on a representative sample randomly selected from a study cohort or a population for genome-wide association study (GWAS), which is not always true in practice. For example, the extreme phenotype sequencing (EPS) design is widely used to investigate genetic determinants of an outcome in GWAS as it bears many advantages such as efficiency, low sequencing or genotyping cost, and large power in detecting the involvement of rare genetic variants in disease etiology. In this paper, we develop a novel, versatile, and efficient approach, namely MR analysis under Extreme or random Phenotype Sampling (MREPS), for one-sample MR analysis based on samples drawn through either the random sampling design or the nonrandom EPS design. In simulations, MREPS provides unbiased estimates for causal effects, correct type I errors for causal effect testing. Furthermore, it is robust under different study designs and has high power. These results demonstrate the superiority of MREPS over the widely used standard 2SLS approach. We applied MREPS to assess and highlight the causal effect of total fetal hemoglobin on anemia risk in patients with sickle cell anemia using two independent cohort studies. A user-friendly Shiny app web interface was implemented for professionals to easily explore the MREPS.

Keywords:

Mendelian randomization; extreme phenotype sequencing; causal inference; genome-wide association studies; next generation sequencing studies

MSC:

62D20

1. Introduction

Historically, making causal inferences on the effects of exposures or risk factors on important health outcomes in humans required executing a controlled experiment, but this approach is limited due to the necessity of exposure of subjects to risk factors, as well as other logistical considerations. Instead, researchers rely on observational studies of individuals with a history of exposure to a risk factor(s) of interest. Although making causal inferences is crucial to better understand the pathophysiology of a disease, conclusions on causality from observational studies are difficult to establish due to confounding and reverse causation effects [1,2,3]. This limitation in observational studies can be partly overcome by Mendelian randomization (MR) analyses, by utilizing genetic variants as instrumental variables (IVs) for the risk factor. Utilizing genetic variants as IVs facilitates analyzing them as natural experiments within observational cohorts, allowing for causal inference between outcomes and risk factors. With this advantage, MR has gained immense popularity as an epidemiologic methodology for evaluating causal inferences from observational studies.

The advent of genome-wide association studies (GWAS), either based on genotyping array or next generation sequencing (NGS), has been pivotal in better understanding the epidemiology and pathophysiology of many human diseases. Additionally, cohorts with existing genotype/phenotype data support exploration of causal influence of genetic variants in disease onset and severity leveraging MR methodologies. However, performing genomic sequencing of observational cohorts, frequently composed of a large number of participants, is daunting due to cost and many other reasons such as secondary findings and privacy loss. As a solution to these drawbacks, the employment of extreme phenotyping sequencing (EPS) design has been utilized with GWAS data. For more detail, see [4,5,6,7,8,9,10,11,12,13]. For EPS design, individuals are identified from the extreme quantiles of a selected phenotype or trait value distribution. The EPS design bears advantages of cost effectiveness, higher power in detecting rare variants, and efficiency, but, by only evaluating the extremes of phenotype distribution, EPS violates random sampling, a basic tenet of statistical methodology aimed at preventing bias. With this limitation, the EPS design has been largely constrained to typical case-control analyses.

The set-valued approach proposed by [14] with the maximum likelihood framework has unlocked the potential of EPS design by turning it into a procedure that can provide unbiased estimates even under the nonrandom extreme tails sampling process. This approach can produce versatile statistical procedures that can be universally applied to both nonrandom EPS and random sampling by turning the random sampling into a special case of EPS [15]. Because of this success, the set-valued approach has been utilized by [16] in investigating the association of genetic variants of secondary traits under the EPS design. By utilizing the framework suggested by [16], along with the set-valued approach, [15] have proposed a versatile mediation analysis approach, which can produce unbiased estimates under both EPS and random sampling for investigating the mediation effect of genotype through a mediator on a continuous phenotype. These studies show the utilization of the set-valued approach under the EPS design has the capacity to produce unbiased results similar to results obtained under random sampling. In this study, we report the capacity of the combination of EPS design and the set-valued approach in analyzing the causal association of endogenous risk factors on health outcomes by integrating the concepts developed in MR.

Under MR, the concepts and methods of IV and two-stage least-squares (2SLS), originally developed in econometrics [17], are utilized to explore the causal relationships between risk factors and health outcomes. The 2SLS procedure fits two regression models, 1st and 2nd stage regressions, on collected data. In order to obtain the most unbiased estimates of the model parameters via these regressions, it is essential to feed data drawn from a random sampling framework. A direct application of this conventional 2SLS under EPS design is inappropriate, as EPS severally violates the random sampling, which is the essential bias reducing procedure for statistical methods. This constraint limits the further exploration of the causal effects between risk factors and health outcomes by utilizing the genetic variants collected under already executed EPS design study, limiting the further extraction of the important information of the disease etiology. If exploring the causal relationship is also a priority, it is not necessary to redesign another random observational study and carry out another sequencing process, increasing the cost and sacrificing the efficiency and statistical power. However, if the 2SLS approach is versatile enough for handling EPS data, then we can explore all this information from already available EPS design without redesigning a new separate study, turning the EPS into an even more efficient study design that could address multiple statistical questions simultaneously by deploying multiple statistical methodologies under only a single EPS design. These plausible benefits, along with the surge of the EPS and MR designs, demand a new, versatile approach in exploring the causal relationship between risk factors and health outcomes, regardless of if the samples are drawn randomly or not.

Motivated by these potential benefits that the EPS design could bring, we developed novel MR estimation and testing procedures for the causal effect of the risk factor under the maximum likelihood framework by utilizing the limited information maximum likelihood (LIML) [18,19,20], which is the counterpart maximum likelihood approach of the 2SLS. We called this approach MR analysis under Extreme or random Phenotype Sampling (MREPS). Due to the capacity of producing unbiased estimates under the EPS design as well as turning the random sampling into a special case of EPS design, which is an essential characteristic for considering in the development of versatile approaches for both random and EPS samples, we utilized the set-valued approach for deriving versatile maximum likelihood function under both EPS and random sampling.

When deriving the models by considering the associated cost effectiveness, we consider three possible sample selection strategies: a random sampling selection and two EPS samples selection from the original cohort. One consideration for EPS samples is to draw the samples from the extreme tails of the health outcome value distribution of interest. The other is to draw the EPS samples from the extreme tails of the original primary outcome of interest in GWAS, which will become an endogenous risk factor for another clinical or health outcome of interest in another study. For example, in a benign ethnic neutropenia GWAS study, the samples were originally selected based on extreme values of white blood cell count (WBC) (dbGaP Study Accession: phs000507.v1.p1), which is a risk factor of many clinical or health outcomes such as coronary artery disease. Extensive simulations demonstrated that MREPS outperform the standard 2SLS approach under two EPS designs, while producing almost comparable results under the random sampling. The loss of information due to the estimation under the samples drawn from the original cohort is minimum under MREPS, regardless of whether samples are drawn either randomly or from extreme tails. This demonstrates that MREPS offers cost effective solutions without sacrificing desired statistical properties in terms of bias, standard errors, power, and coverage percentages of the confidence intervals (CIs). We then illustrate the use of the proposed MREPS in an example considering the causal effect of total fetal hemoglobin (HbF) on anemia in sickle cell anemia (SCA) patients enrolled in the Sickle Cell Clinical Research Intervention Program (SCCRIP) [21] and Baylor College of Medicine (BCM) as the discovery cohort [22] and using the Silent Cerebral Infarct Transfusion (SIT) Trial data [23] as the validation cohort. This is the first study to demonstrate a causal relationship between HbF (%) and total hemoglobin (Hb) concentration (g/dL) in SCA patients using MR analysis on observational data.

The rest of the article is arranged in the following order. Section 2 discusses the development of the statistical models, estimation process, and testing procedure for the causal effect of the risk factor. Section 3 describes and presents the results of simulation studies. Section 4 discusses the real data application with the results. Finally, a concluding discussion is given in Section 5.

2. Method

In this section, we discuss the development of the versatile statistical models for estimating and testing the causal effect of a risk factor on an outcome under the random sampling and EPS designs by integrating the MR framework.

2.1. Model

We consider a large cohort of N random individuals. The EPS and random samples of size n (<N) are drawn from this large cohort, which will be discussed under each design. Let

Y_{i}

be the continuous outcome of interest and

X_{i}

be the continuous risk factor for

i = 1, \dots, N

. Let

Z_{i} = {(1, Z_{i 1}, \dots, Z_{i k})}^{T}

be the

k + 1

-dimensional confounding vector for an integer k and

G_{i}

be the genetic instrument variable, which takes values 0, 1, and 2, for the number of minor alleles of a single nucleotide polymorphism (SNP), or it can be any combined effect of multiple SNPs such as polygenic or allele scores. Further, we assume

G_{i}

satisfies all valid instrument variable assumptions [24].

We utilize the concepts of the general MR framework and the set-valued approach [14] for constructing the versatile statistical model under EPS and random samples. We consider

G_{i}

as a valid instrument variable. Let the outcome, risk factor and confounding factors be linearly associated. Assume the risk factor, genetic instrumental variable, and confounding factors are linearly associated. Then the versatile model is given by:

\begin{matrix} Y_{i} & = β_{1}^{T} Z_{i} + β_{2} X_{i} + ϵ_{y} \\ X_{i} & = γ_{1}^{T} Z_{i} + γ_{2} G_{i} + ϵ_{x} \\ S_{i} & = I (X_{i} \in X) I (Y_{i} \in Y), where X = (x_{0}, x_{1}] \cup [x_{2}, x_{3}) and Y = (y_{0}, y_{1}] \cup [y_{2}, y_{3}) . \end{matrix}

(1)

Here,

β_{1}^{T} = (β_{10}, β_{11}, \dots, β_{1 k})

,

γ_{1}^{T} = (γ_{10}, γ_{11}, \dots, γ_{1 k})

are regression parameters, including the intercepts

β_{10}, γ_{10}

;

β_{2}

is the effect of the risk factor on the outcome, which is also the parameter of interest;

γ_{2}

is the effect of the genetic instrumental variable on the risk factor. Note that to determine if the genetic variant

G_{i}

is a valid IV, we are interested in estimating and testing the significance of

γ_{2}

, besides the primary interests of estimating and testing

β_{2}

. The associated random error terms

ϵ_{y}

and

ϵ_{x}

of each model are assumed to be correlated and followed the bivariate normal density

N (0, Σ)

with mean

0^{T} = (0, 0)

and covariance matrix

Σ = (\begin{matrix} σ_{y y} & σ_{x y} \\ σ_{x y} & σ_{x x} \end{matrix}) .

The correlation of the error terms is given by

ρ = σ_{x y} / \sqrt{σ_{x x} σ_{y y}}

. The thresholds

x_{0} < x_{1} \leq x_{2} < x_{3}

determine the extreme sample selection region of

X_{i}

, and the thresholds of the extreme sample selection region for

Y_{i}

are

y_{0} < y_{1} \leq y_{2} < y_{3}

. Note that the first and second relations in (1) are the 2nd and 1st stage regressions in the general MR framework, respectively.

There is not any known real-world application of the EPS design based on the sample selection from the extreme tails of both

X_{i}

and

Y_{i}

values, simultaneously. Because of this limitation, we ignore this hypothetical design, regardless of the possibility of implementing the model (1). For EPS designs based only on either extreme tail of

X_{i}

or

Y_{i}

values and the random sampling design, the supports

X

and

Y

should be appropriately chosen for each model by choosing appropriate cutoff values

x_{0}, x_{1}, x_{2}, x_{3}, y_{0}, y_{1}, y_{2},

and

y_{3}

. The form of the indicator

S_{i}

also changes accordingly. In the following sections, we discuss the utilization of model (1) under each design by choosing the appropriate supports and the indicator

S_{i}

.

2.1.1. MREPS for Samples from the Extreme Tails of Outcome Distribution

The cutoff values

y_{0} < y_{1} < y_{2} < y_{3}

, with

y_{1} \neq y_{2}

, set the thresholds for sample selection under the EPS design based on the extreme outcome values. With this setting

Y = (y_{0}, y_{1}) \cup (y_{2}, y_{3})

. The restriction on the support

X

is relaxed by choosing

x_{0} = - \infty, x_{3} = \infty

, and

x_{1} = x_{2}

so that

X

becomes

(- \infty, \infty)

, which results

I (X_{i} \in X) = 1

. As a result, the indicator

S_{i}

reduces to

S_{i} = I (Y_{i} \in Y)

. The n individuals whose outcome values fall into the extreme intervals

(y_{0}, y_{1})

and

(y_{2}, y_{3})

are chosen for the EPS design without imposing any restriction on the risk factor value

X_{i}

. We denote this EPS design as EPS^Y throughout the paper.

2.1.2. MREPS for Samples from the Extreme Tails of Risk Factor Distribution

For the EPS design constructed upon the risk factor, say EPS^X, let the n individuals whose risk factors fall into either

(x_{0}, x_{1})

or

(x_{2}, x_{3})

, with

x_{1} \neq x_{2}

, are chosen as the EPS samples without imposing any restriction on the outcome values by choosing

y_{0} = - \infty, y_{3} = \infty,

and

y_{1} = y_{2}

. Note that under this setting,

X = (x_{0}, x_{1}) \cup (x_{2}, x_{3})

, and

S_{i}

reduces to

S_{i} = I (X_{i} \in X)

as

Y = (- \infty, \infty) .

2.1.3. MREPS for Random Samples

For random sampling, we relax all restrictions on both supports

X

and

Y

by choosing

x_{0} = y_{0} = - \infty, x_{3} = y_{3} = \infty

, and

x_{1} = x_{2}, y_{1} = y_{2}

. With this setting,

X = Y = (- \infty, \infty)

. As a result,

S_{i}

becomes 1 for all i. Under this cutoff selection, the EPS samples turn into the whole cohort sample of N individuals as

S_{i} = 1

for all individuals. For the random subsampling design, n individuals can be randomly sampled from the cohort of N individuals. Therefore, the model (1) can be applied to analyze these n individuals. This transformation capability of the set-valued approach implies that the general MR model given by the first two relations of (1) is a special case of the MREPS model (1), indicating that MREPS can be applied to the general MR model, EPS^X and EPS^Y. With this adaptive capability, MREPS turns out to be a versatile, universal approach for MR, regardless of if the samples are random or EPS.

2.2. Estimation and Hypothesis Testing

Notice that under the general MR model given by the first two relations of (1), the typical approach is to employ 2SLS regression to estimate the model parameters under the random sampling framework. However, this approach is not valid for the EPS designs under the extreme tails sampling given by the indicator

S_{i}

as it violates the random sampling upon which the regression is based. To overcome this restriction, we propose a maximum likelihood approach by utilizing LIML, which is the counterpart maximum likelihood approach of 2SLS, and integrating the set-valued approach proposed by [14].

2.2.1. Samples Drawn from the Extreme Tails of Outcome

For the samples drawn from the extreme tails of the outcome and for given

Z_{i}, G_{i}

and

S_{i}

, the likelihood function for the observations

X_{i}

and

Y_{i}

can be expressed as

\begin{matrix} l_{1} = \prod_{i = 1}^{n} \Pr (Y_{i} = y_{i}, X_{i} = x_{i} | Z_{i}, G_{i}, S_{i} = 1), \end{matrix}

(2)

where

\begin{matrix} \Pr (Y_{i} = y_{i}, X_{i} = x_{i} | Z_{i}, G_{i}, S_{i} = 1) = \frac{\Pr (Y_{i} = y_{i}, X_{i} = x_{i}, S_{i} = 1 | Z_{i}, G_{i})}{\Pr (S_{i} = 1 | Z_{i}, G_{i})} \end{matrix}

(3)

Note that

X = (- \infty, \infty)

and

Y = (y_{0}, y_{1}) \cup (y_{2}, y_{3})

for this sample selection, and, hence,

S_{i} = I (Y_{i} \in Y)

. If

y_{i} \in Y

, then

\begin{matrix} \Pr (Y_{i} = y_{i}, X_{i} = x_{i}, S_{i} = 1 | Z_{i}, G_{i}) \\ = \frac{1}{2 π \sqrt{σ_{x x} σ_{y y} (1 - ρ^{2})}} exp {\frac{- 1}{2 (1 - ρ^{2})} [{(\frac{y_{i} - β_{1}^{T} Z_{i} - β_{2} x_{i}}{\sqrt{σ_{y y}}})}^{2} + {(\frac{x_{i} - γ_{1}^{T} Z_{i} - γ_{2} G_{i}}{\sqrt{σ_{x x}}})}^{2} \\ - 2 ρ (\frac{y_{i} - β_{1}^{T} Z_{i} - β_{2} x_{i}}{\sqrt{σ_{y y}}}) (\frac{x_{i} - γ_{1}^{T} Z_{i} - γ_{2} G_{i}}{\sqrt{σ_{x x}}})]} . \end{matrix}

(4)

Otherwise,

\Pr (Y_{i} = y_{i}, X_{i} = x_{i}, S_{i} = 1 | Z_{i}, G_{i}) = 0

. The marginal probability can be calculated by

\begin{matrix} \Pr (S_{i} = 1 | Z_{i}, G_{i}) \\ = \frac{1}{2 π \sqrt{σ_{x x} σ_{y y} (1 - ρ^{2})}} \int_{Y} \int_{- \infty}^{\infty} exp {\frac{- 1}{2 (1 - ρ^{2})} [{(\frac{y - β_{1}^{T} Z_{i} - β_{2} x}{\sqrt{σ_{y y}}})}^{2} \\ + {(\frac{x - γ_{1}^{T} Z_{i} - γ_{2} G_{i}}{\sqrt{σ_{x x}}})}^{2} - 2 ρ (\frac{y - β_{1}^{T} Z_{i} - β_{2} x}{\sqrt{σ_{y y}}}) (\frac{x - γ_{1}^{T} Z_{i} - γ_{2} G_{i}}{\sqrt{σ_{x x}}})]} d x d y . \end{matrix}

(5)

Let

θ_{1} = γ_{1} Z_{i} + γ_{2} G_{i}, θ_{2} = β_{1}^{T} Z_{i}, σ_{1}^{2} = (1 - ρ^{2}) σ_{x x} σ_{y y} / σ_{2}^{2}

, and

σ_{2}^{2} = σ_{x x} β_{2}^{2} + σ_{y y} + 2 ρ β_{2} \sqrt{σ_{x x} σ_{y y}}

. Then the probability (5) can be reduced to

\begin{matrix} \Pr (S_{i} = & 1 | Z_{i}, G_{i}) \\ = F_{(θ_{2} + θ_{1} β_{2}, σ_{2})} (y_{1}) - F_{(θ_{2} + θ_{1} β_{2}, σ_{2})} (y_{0}) + F_{(θ_{2} + θ_{1} β_{2}, σ_{2})} (y_{3}) - F_{(θ_{2} + θ_{1} β_{2}, σ_{2})} (y_{2}), \end{matrix}

(6)

where

F_{(θ_{2} + θ_{1} β_{2}, σ_{2})} (\cdot)

is the normal cumulative distribution with mean

θ_{2} + θ_{1} β_{2}

and standard deviation

σ_{2}

. The derivation of this relation is given in Supplementary Materials Section S1.1. Using the relations (6), (4) and (3) along with (2), the likelihood can be derived as

\begin{matrix} l = \prod_{i = 1}^{n} \Pr (Y_{i} = y_{i}, X_{i} = x_{i} | Z_{i}, G_{i}, S_{i} = 1) . \end{matrix}

(7)

Closed form solutions for the optimum parameters do not exist for the likelihood (7) as it is a nonlinear function. By utilizing appropriate numerical optimization procedures such as quasi- Newton, we can obtain the maximum likelihood estimators numerically from (7). Note that the likelihood given by (4) is the bivariate normal density, which is also LIML. See [19] for more details about LIML.

Next, to investigate the validity of

G_{i}

as an IV statistically, we test the significance of the parameter

γ_{2}

by utilizing a Wald type test statistic

{\hat{γ}}_{2}^{2} / \hat{v a r} ({\hat{γ}}_{2})

comparing

H_{0} : γ_{2} = 0

vs.

H_{1} : γ_{2} \neq 0

. Similarly, to test the significance of the effect of the risk factor, we propose the Wald type test statistic,

{\hat{β}}_{2}^{2} / \hat{v a r} ({\hat{β}}_{2})

comparing

H_{0} : β_{2} = 0

vs.

H_{1} : β_{2} \neq 0

. Here,

{\hat{γ}}_{2}

and

{\hat{β}}_{2}

are the MLEs of

γ_{2}

and

β_{2}

obtained by maximizing (7), and

\hat{v a r} ({\hat{γ}}_{2})

and

\hat{v a r} ({\hat{β}}_{2})

are, respectively, the squared standard errors of

{\hat{γ}}_{2}

and

{\hat{β}}_{2}

, which can be obtained via the resultant Hessian matrix from the end of the numerical optimization process of (7). These test statistics follow a

χ^{2}

distribution with one degree of freedom under the null hypotheses. Note that the conclusions of the above hypothesis tests should be made sequentially by first assessing the significance of

γ_{2}

as a valid instrument effect. If

γ_{2}

is not significant then the corresponding genetic variant is not a valid instrument, and, hence, testing the significance of

β_{2}

meaningless.

2.2.2. Samples Drawn from the Extreme Tails of Risk Factor

Similar to the above section, for the samples drawn from the extreme tails of the risk factor, for given

Z_{i}, G_{i}

and

S_{i}

, the likelihood function for the observations

X_{i}

and

Y_{i}

is still given by the relations (2) and (3). Recall that the supports

X

and

Y

are, respectively, given by

(x_{0}, x_{1}) \cup (x_{2}, x_{3})

and

(- \infty, \infty)

along with

S_{i} = I (X_{i} \in X)

for this sample draw. The corresponding numerator and denominator of (3) can be evaluated as bellow.

If

x_{i} \in X

, then

\begin{matrix} \Pr (Y_{i} = y_{i}, X_{i} = x_{i}, S_{i} = 1 | Z_{i}, G_{i}) \\ = \frac{1}{2 π \sqrt{σ_{x x} σ_{y y} (1 - ρ^{2})}} exp {\frac{- 1}{2 (1 - ρ^{2})} [{(\frac{y_{i} - β_{1}^{T} Z_{i} - β_{2} x_{i}}{\sqrt{σ_{y y}}})}^{2} + {(\frac{x_{i} - γ_{1}^{T} Z_{i} - γ_{2} G_{i}}{\sqrt{σ_{x x}}})}^{2} \\ - 2 ρ (\frac{y_{i} - β_{1}^{T} Z_{i} - β_{2} x_{i}}{\sqrt{σ_{y y}}}) (\frac{x_{i} - γ_{1}^{T} Z_{i} - γ_{2} G_{i}}{\sqrt{σ_{x x}}})]} . \end{matrix}

(8)

Otherwise,

\Pr (Y_{i} = y_{i}, X_{i} = x_{i}, S_{i} = 1 | Z_{i}, G_{i}) = 0

.

\begin{matrix} \Pr (S_{i} = 1 | Z_{i}, G_{i}) \\ = \frac{1}{2 π \sqrt{σ_{x x} σ_{y y} (1 - ρ^{2})}} \int_{- \infty}^{\infty} \int_{X} exp {\frac{- 1}{2 (1 - ρ^{2})} [{(\frac{y - β_{1}^{T} Z_{i} - β_{2} x}{\sqrt{σ_{y y}}})}^{2} \\ + {(\frac{x - γ_{1}^{T} Z_{i} - γ_{2} G_{i}}{\sqrt{σ_{x x}}})}^{2} - 2 ρ (\frac{y - β_{1}^{T} Z_{i} - β_{2} x}{\sqrt{σ_{y y}}}) (\frac{x - γ_{1}^{T} Z_{i} - γ_{2} G_{i}}{\sqrt{σ_{x x}}})]} d x d y . \end{matrix}

(9)

By following a similar approach as above section, it can be shown that

\begin{matrix} \Pr (S_{i} = 1 | Z_{i}, G_{i}) = F_{(θ_{1}, \sqrt{σ_{x x}})} (x_{1}) - F_{(θ_{1}, \sqrt{σ_{x x}})} (x_{0}) + F_{(θ_{1}, \sqrt{σ_{x x}})} (x_{3}) - F_{(θ_{1}, \sqrt{σ_{x x}})} (x_{2}), \end{matrix}

(10)

where

F_{(θ_{1}, \sqrt{σ_{x x}})} (\cdot)

is normal cumulative distribution with mean

θ_{1}

and standard deviation

\sqrt{σ_{x x}}

. Using the relations (8), (10) and (3) along with (2), the likelihood can be derived as

\begin{matrix} l = \prod_{i = 1}^{n} \Pr (Y_{i} = y_{i}, X_{i} = x_{i} | Z_{i}, G_{i}, S_{i} = 1) . \end{matrix}

(11)

By utilizing an appropriate numerical optimization procedure, MLEs can be obtained from (11).

Using the same Wald test statistic in the above section, hypothesis testing can be executed for testing the significance of the effects of the genetic variant and the risk factor, sequentially.

2.2.3. Random Samples

Note that, for a random sample, we choose

x_{1} = x_{2}

and

y_{1} = y_{2}

by setting

x_{0} = y_{0} = - \infty

and

x_{3} = y_{3} = \infty

. Therefore, the indicator

S_{i}

and the selection probability

\Pr (S_{i} = 1 | Z_{i}, G_{i})

in the denominator of (3) becomes 1 for all i. Hence, the likelihood function reduces to

l^{'} = \prod_{i = 1}^{n} \Pr (Y_{i} = y_{i}, X_{i} = x_{i} | Z_{i}, G_{i}),

where

\begin{matrix} \Pr (Y_{i} = y_{i}, X_{i} = x_{i} | Z_{i}, G_{i}) \\ = \frac{1}{2 π \sqrt{σ_{x x} σ_{y y} (1 - ρ^{2})}} exp {\frac{- 1}{2 (1 - ρ^{2})} [{(\frac{y_{i} - β_{1}^{T} Z_{i} - β_{2} x_{i}}{\sqrt{σ_{y y}}})}^{2} + {(\frac{x_{i} - γ_{1}^{T} Z_{i} - γ_{2} G_{i}}{\sqrt{σ_{x x}}})}^{2} \\ - 2 ρ (\frac{y_{i} - β_{1}^{T} Z_{i} - β_{2} x_{i}}{\sqrt{σ_{y y}}}) (\frac{x_{i} - γ_{1}^{T} Z_{i} - γ_{2} G_{i}}{\sqrt{σ_{x x}}})]}, \end{matrix}

(12)

transforming to the LIML, which is the equivalent maximum likelihood approach for 2SLS, under random sample. Using this relationship, the final log likelihood can be calculated, and estimation and the hypothesis testing can be carried out as mentioned above two sections.

3. Simulations

To investigate the proposed MREPS methods in estimation and hypothesis testing, thorough simulations were carried out by comparison with 2SLS procedure. We utilized the widely used

t s l s ()

function, which fits 1st and 2nd stage regression models under the standard 2SLS procedure in estimating model parameters, in the “sem” R package [25] for 2SLS approach. To obtain MLEs from the corresponding likelihood functions, general non-linear optimization procedure implemented by utilizing the R function solnp() in “Rsolnp” package [26,27]. At the end of the optimization process, solnp() outputs the Hessian matrix evaluated via the resulting MLEs. We used this resulting Hessian matrix for evaluating the standard errors (SEs) of the MLEs. This section describes and summarizes the simulation studies along with the findings.

For each setting, 1000 replicates were considered. We first simulated a larger cohort of N individuals. The subsamples of size n were drawn randomly and from the extreme tails of outcome and risk factor to represent the application of each model discussed in above section under the random sampling, EPS^Y and EPS^X. For the EPS designs, the n subjects were drawn from the upper and lower q quantile from the distribution of primary trait values

Y_{i}

for EPS^Y and risk factor values

X_{i}

for EPS^X. For EPS^Y, the support

X

was set to be

(- \infty, \infty)

and for EPS^X, the support

Y

was set to be

(- \infty, \infty)

. For these two designs, n satisfies the relation

n = 2 q N

, where N is the original cohort size. This selection sets the cutoffs

x_{1}, x_{2}

and

y_{1}, y_{2}

under the the corresponding EPS^X and EPS^Y, accordingly. For EPS^X, the values

x_{0}, x_{3}

were chosen to be -∞ and ∞, respectively. The cutoffs

y_{0}, y_{3}

were, respectively, chosen to be -∞ and ∞ for EPS^Y. We first considered

q = 0.01

for assessing the different effect sizes of

β_{2}

. We then considered a set of q ranging from 0.005 to 0.045 with increment of 0.01 for assessing the effect of different n given

N = 20, 000

on the performance of MREPS and 2SLS.

We considered two scenarios in terms of the instrument variables in our simulations. We first considered a single SNP as an IV. The genotype of the SNP with a minor allele frequency (MAF) of 0.3 was generated under the Hardy–Weinberg equilibrium (HWE) assumption. In this setting, the term

γ_{2} G_{i}

is used directly in the regression of the risk factor

X_{i}

on the outcome

Y_{i}

in the model (1) by setting

γ_{2} = 0.5

.

We then considered multiple SNPs each with a MAF of 0.3 contributing to the risk factor. For this multiple SNP setting, we first simulated the genotypes of multiple SNPs independently and then simulated the risk factor according to four situations below to mimic different practical situations, which is similar to scenarios considered by [28]:

(a): Equal-sized combined effects of 9 and 25 independent SNPs: two separate cases were considered by using different numbers, 9 and 25, of SNPs. The genotype data of each SNP was independently generated under HWE as mentioned above. Then, the equal sized combined effect was considered by using $γ_{2} G_{i} = γ_{2} \sum_{j = 1}^{9} G_{i j}$ , where $γ_{2} = 0.17$ and $\sum_{j = 1}^{9} G_{i j}$ is the summation of the 9 independent SNP genotype values for the ith individual. Similarly, for the 25 SNPs, $γ_{2} G_{i} = γ_{2} \sum_{j = 1}^{25} G_{i j}$ , where $γ_{2} = 0.07$ and $\sum_{j = 1}^{25} G_{i j}$ is the summation of the 25 independent SNP genotype values for the ith individual. This case represents the utilization of the summary of multiple SNPs with equal-sized association with the risk factor as an IV.
(b): Different-sized combined effects of 9 and 25 SNPs: with this setting, $γ_{2} G_{i}$ turned out to be $\sum_{j = 1}^{J} γ_{2_{j}} G_{i j}$ for $J = 9, 25$ . Each independently generated SNP under HWE was multiplied by randomly generated $γ_{2_{j}}$ from $N (0.18, 0.018)$ for 9 SNPs and from $N (0.06, 0.018)$ for 25 SNPs separately. Then, the summation of these products considered as $γ_{2} G_{i}$ under 9 and 25 SNPs separately for each i. This scenario represents the utilization of the summary of multiple SNPs with different-sized association with the risk factor as an IV.
(c): A few large and many small effects of 9 and 25 SNPs: two SNPs generated under H-WE were multiplied by 0.46 and the rest SNPs multiplied by 0.092 separately for 9 and 25 SNPs. Next, for each i, the summations of these products separately considered as $γ_{2} G_{i}$ for 9 and 25 SNPs. This represents the utilization of the summary of multiple SNPs with very few large associations and a large number of week associations with the risk factor as an IV.
(d): A combination of valid and invalid SNPs as an IV, 9 and 25 SNPs: Among 9 SNPs g-enerated under HWE, 4 SNPs were multiplied by 0.37 while multiplying the rest by 0, and the summation of the resulting values was considered as $γ_{2} G_{i}$ for the ith individual. Similarly, among 25 SNPs, 12 SNPs were multiplied by 0.14 while multiplying the rest by 0, and the resulting summation was considered as $γ_{2} G_{i}$ for the ith individual. This case represents the utilization of the summary of few valid and many invalid SNPs as an IV.

We also considered two confounding covariates (

k = 2

). The components

Z_{i 1}, Z_{i 2}

of the confounding vector

Z_{i} = (1, Z_{i 1}, Z_{i 2})

were generated using

Z_{i 1} \sim N (0, 1)

and

Z_{i 2} \sim B e r n o u l l i (p = 0.5)

for the ith individual. By choosing

σ_{x x} = 4, σ_{y y} = 2

and

σ_{x y} = 0.8

for the covariance matrix

Σ

, the correlated error terms

ϵ_{x}

and

ϵ_{y}

were generated from bivariate normal density with the mean vector

0^{T} = (0, 0)

and the covariance matrix

Σ

. By using the generated

Z_{i}, G_{i}

and

ϵ_{x}

in the second relation of (1), the endogenous risk factor

X_{i}

was generated first, and, using this

X_{i}, Z_{i}

and

ϵ_{y}

, the outcome

Y_{i}

was generated using the first relation of (1). After the data were generated, we defined an unweighted polygenic score (PGS) as the IV in multiple SNP settings for MR data analysis because (1) naïve use of the data under analysis to choose SNPs to define PGS or for deriving weights leads to substantial bias; (2) weighted PGS increases the power, but the increase is small when SNPs have similar effect sizes [28].

3.1. Comparison: Bias, Rejection Proportion, Standard Error, and Coverage Percentage

Our major interest was to estimate and test the significance of the effect of the risk factor on the outcome, which can be achieved by estimating and testing

β_{2}

via the proposed estimation and testing processes. To investigate these procedures in estimation and testing, first we assessed the bias and rejection proportion of the null hypothesis using the single SNP setting described above for

G_{i}

. In achieving this, we set

β_{1}^{T} = (0.5, 0.4, 0.7)

and

γ_{1}^{T} = (0.3, 0.4, 0.5)

. To investigate the behavior of bias and rejection proportion due to the change of the effect size of

β_{2}

, the sequence of points from 0 to 0.6 with an increment of 0.1 was considered for

β_{2}

. Figure 1 displays the plots of

β_{2}

vs.

{\hat{β}}_{2}

and the rejection proportion under the random sampling, EPS^X and EPS^Y. Under the random sampling, both MREPS and 2SLS produced almost the same results. The

β_{2}

vs.

{\hat{β}}_{2}

curves overlapped with the true

β_{2}

vs.

β_{2}

curve. The rejection proportion plots of MREPS and 2SLS overlapped with each other, becoming comparable approaches under the random sampling, by producing almost same unbiased estimates and the rejection proportions. However, under EPS^Y, 2SLS became a highly biased approach, and the bias gradually increased as

β_{2}

increases. MREPS still overlapped with the true curve, becoming an unbiased estimator. Although the rejection proportion curves overlap with each other, the estimate of the true causal effect by 2SLS was uninterpretable due to highly biased estimator. Under EPS^X, MREPS was still the unbiased estimator, overlapping with the true curve. 2SLS consistently maintained the bias by consistently overestimating the true

β_{2}

. 2SLS produced unreliable rejection proportion, which is almost always 1, even when

β_{2} = 0

. Therefore, under EPS^X and EPS^Y, we cannot use 2SLS for MR analysis because of biased and uninterpretable parameter estimates and uncontrolled type I error rate.

Under the multiple-SNPs settings (a)–(d), we accessed MREPS by comparison with 2SLS quantitatively in terms of average bias, average SE, bias percentages, rejection proportion of the null hypothesis and the coverage percentage of the 95% CI under 9 and 25 SNPs, separately. We kept the true values fixed at the same vectors above and considered the values

{0, 0.2, 0.4}

for

β_{2}

. Table 1 summarizes the results under EPS^Y. The corresponding results for EPS^X and random sampling are given in Supplemental Table S1. From Table 1, we observed that MREPS controlled the bias percentages well below 10% for all none zero

β_{2}

settings, while 2SLS failed to control the bias percentages. The resulting bias percentages were above 100% for all nonzero

β_{2}

settings, becoming a highly biased approach. MREPS produced the smallest average SEs in all settings. The coverage percentages of 95% CIs of MREPS were all close to 95 for all settings. However, the coverage percentages of 2SLS CI became close to 95 only when

β_{2} = 0

. For all nonzero

β_{2}

coverage percentages decreased significantly as

β_{2}

increased under each setting. The only acceptable statistics under 2SLS was the rejection rate of the null hypothesis. However, reliability of these rejection rates was questionable due to the highly biased estimates and the relatively large SEs. From Supplement Table S1, under the random sampling, comparable results were observed from both 2SLS and MREPS. However, for EPS^X, the worst results were observed from 2SLS, while observing overall acceptable better results from MREPS. High bias percentages, poor coverage percentages (below 2.2 for all cases) and unreliable (almost 100% rejection rates for the null hypothesis, regardless of the change of the magnitude of the effect size) were observed from 2SLS. These overall observations are consistent with the observations displayed in Figure 1.

Overall, based on Table 1, Supplement Table S1 and Figure 1, we can conclude that both MREPS and 2SLS were comparable unbiased procedures under the random sampling. However, MREPS became the only reliable approach under both EPS^X and EPS^Y, becoming a versatile approach that can be utilized under the random as well as EPS samples in MR analysis.

3.2. Comparison: The Optimum Cost-Effective Subsample Size

One strong and important implication of the above results is that the possibility of reducing the genotyping or sequencing cost by drawing a representative subsample from the original cohort and then executing genotyping or sequencing on the selected subsample. The obvious trade off consequences of such subsample selection is the inevitable sacrifices of the desired qualities of the estimates and test statistics such as increment of bias and reduction of statistical power. However, the above observations imply that MREPS is the most appropriate solution which can optimize such trade off consequences favorably under any of EPS^X, EPS^Y or random subsample drawing. To explore this capability further numerically, we carried out simulations to explore the optimal subsample size n that required to chosen under EPS^X, EPS^Y and random sampling. We considered a cohort of size N = 20,000. For parameter

β_{2}

, the values 0.1 annd 0.4 were considered. For subsample size n the sequence of values from 200 to 2000 were considered with the increment of 200, which corresponds to quantile, q, ranging from 0.005 to 0.045 with the increment of 0.01 under EPS^X and EPS^Y. For IV, a single SNP was considered.

Figure 2 displays the behavior of the estimator

{\hat{β}}_{2}

due to the change of the subsample size n or due to the change of the corresponding quantile value q under EPS^X and EPS^Y. Under the considered simulation setting, for the random subsample,

n = 600

is an ideal sample size, and there is not much gain for other large sample sizes, indicating the consistency property is achieved. Both 2SLS and MREPS procedures produced roughly the same estimators. For both EPS^X and EPS^Y, the ideal optimal sample size can be as small as 200 (100 in both tails) under MREPS, and there is not any substantial gain for the larger sample sizes. However, regardless of the increment of the sample size n, 2SLS consistently produced nearly the same bias, indicating that 2SLS fails to recover the tradeoff loss due to the subsample selection under EPS^X and EPS^Y.

From these figures we can conclude that there are almost no losses with the valid statistical approach of MREPS due to the subsample selection, in estimation of

β_{2}

. This demonstrates the cost effectiveness of MREPS in reducing the genotyping or sequencing cost without sacrificing desired statistical properties of the estimator.

An important implication of this simulation setting is that it can be treated as a pilot simulation study to determine the required EPS sample size for obtaining unbiased causal effect in advance before conducting the real study. By setting a range for the sample size based on the available resources, simulations can be executed on several sample values chosen from the range to determine the optimal size. This study would be a foundation and it can be adapted based on the objectives and the type of data, including prospective, retrospective, exiting, or newly acquiring data under the EPS design.

4. Unravelling the Effect of HbF on Anemia in Sickle Cell Anemia

To examine the performance of the proposed MREPS approach in real data analysis, we applied the MREPS and 2SLS approaches to assess whether there is a causal effect of HbF on Hb as described below.

Background: Sickle cell anemia (SCA; HbSS and S

β^{0}

-thalathemia) is a catastrophic hemoglo-binopathy caused by a single gene mutation located on chromosome 11 that is defined by chronic hemolytic anemia, recurrent microvascular occlusion, tissue ischemia, and inflammation [29]. Acute SCA-related events (i.e., vaso-occlusive pain, acute chest syndrome, and splenic sequestration) initially manifest during infancy, and young children often develop subclinical end-organ damage [30,31,32,33,34,35,36,37,38,39] that frequently progresses and contributes to morbidity and early mortality for adults with SCA. The degree of anemia (Hb concentration) in individuals with SCA is a known risk factor for disease severity, and the overall percentage of fetal hemoglobin (HbF) produced by an individual instead of sickle hemoglobin (HbS) is a strong modifier of disease severity, including low HbF being an independent risk factor for early mortality [40]. GWAS and basic science experiments have established that HbF is heritable and modulated by three known genes of

B C L 11 A, H B S 1 L - M Y B

and the extended

β

-globin locus (in total 11 SNPs), which account for over 30% of variance observed in endogenous HbF levels in individuals with SCA, and polymorphisms in these genes are associated with higher Hb concentrations [22,41,42,43,44]. Hydroxyurea is an FDA approved therapy for SCA that largely works through HbF induction [45], with induction of HbF levels of over 20% being associated with marked clinical benefits and higher overall Hb concentrations [46], and induction of HbF is required for improvements in Hb concentrations in murine models [47]. However, no study has shown a causal effect of HbF on Hb in SCA patients.

Data preparation and analysis: 585 SCA participants of the Sickle Cell Clinical Research Intervention Program (SCCRIP) study and Baylor College of Medicine (BCM), aged ≥2 years, were used for genetic associations with cross-sectional baseline HbF [22] and were analyzed in this study. The design of SCCRIP has been described previously [22]. There were no participants on hydroxyurea therapy for at least 90 days prior to HbF measurements. Ninety-four participants had received a blood transfusion within the past 90 days. Whole genome sequencing were performed using Illumina HiSeq X10 sequencers at the HudsonAlpha Institute for Biotechnology Genomic Services Laboratory (Huntsville, AL). We used 11 HbF SNPs based on results from [22]. We defined polygenic score of HbF (PGS^HbF) as the summation of the number of high HbF alleles across these 11 SNPs, which is opposite to the PGS^HbF defined by [22] that equals to the summation of the number of low HbF alleles. The first five principal components (PCs) were calculated in the original cohort [22]. The covariates we used here were the same as those used in [22], but we further added blood transfusion history as an additional covariate because it was associated with Hb and HbF [48].

To validate the causal effect of HbF on Hb based on PGS^HbF and each individual SNP identified in the discovery cohort of SCCRIP/BCM, we analyzed the 724 patients in SIT Trial. The cohort description and GWAS data have been reported by [23]. The missing genotypes for this study were imputed using the Michigan Imputation Server for chromosomes 2, 6, and 11 using the 1000 genomes Phase 3 data as African reference population samples [49]. After imputation, one SNP (

r s 66650371

) from

H B S 1 L - M Y B

remained unavailable. The dose data for the remaining ten SNPs were extracted and analyzed as continuous variables. PGS^HbF was calculated as the unweighted sum of the dosage for the high HbF from ten SNPs using dosage data and was analyzed as a continuous variable. The first five principal components (PCs) were calculated in the original cohort [23] and included in the model to control for potential population stratification. The covariates for which we adjusted here are age, sex, diagnosis (HbSS or S

β_{0}

), and 5 PCs.

To show the performance of the MREPS approach on data collected under EPS^Y and EPS^X, we then combined the two cohorts to increase the cohort size and mimicked the two EPS designs. For choosing the two extreme tail regions for EPS sample selection under EPS^Y, we carried out a pilot simulation by using the simulation setting of Figure 2 as a foundation. The size of the combined cohort of SIT and SCCRIP is 1309, which was considered as the size of the original cohort. The percentiles 10%, 20%, 30%, 40%, and 50% were chosen as the candidate cutoff values for the extreme regions. The corresponding sample sizes determined by these cutoffs are 261, 523, 785, 1047, and 1309, respectively. Note that 50% percentile represents the entire cohort of size 1309. The parameters

β_{2}

and

γ_{2}

were set at 0.11 and 1.4 based on the literature [50] and this study. MAF was set as 0.3. The corresponding results are given by Figure S3 and Table S3 of Supplementary Materials. Based on these results, the 20% percentile for the extreme tail sample selection gave a simulated 80% power to detect a causal effect of HbF on Hb in this study at a significance level of 0.05. Without financial budget information, we chose 20% percentile only based on the power. In the real practice, the extreme percentile might be determined based on the available resources and the statistical power. We analyzed 492 participants with Hb in the lowest and highest 20% percentile, which was used to mimic EPS^Y. We also analyzed 512 participants with HbF in the lowest and highest 20% percentile, which was used to mimic EPS^X. For comparison, we also randomly sampled 518 individuals for these MR analyses. Because in SIT, PGS^HbF was defined based on 10 SNPs, we then removed SNP

r s 6650371

from PGS^HbF in SCCRIP/BCM. In the combined data analysis, we adjusted for the common covariates described above with additional covariate cohort (SCCRIP/BCM or SIT). These combined cohort and mimicked EPS analyses were only used to demonstrate the performance of the proposed new MREPS approach.

Pleiotropy: A common problem in MR analyses is the presence of horizontal pleiotropic effects, i.e., a genetic variant has associations with more than one risk factor on different causal pathways [3]. As these three genes are the genetic determinants of HbF, and higher HbF is necessary to lead to higher Hb, it is very unlikely that these three genes affect Hb via other pathways. However, to test the violations of the IV assumptions, we used three approaches. First, we checked if the three genes related to HbF are also significantly associated with the other risk factors of anemia. Second, we used the MR-Egger method to test for pleiotropy [51]. This performs a regression of the SNP-Hb associations on the SNP-HbF associations with the intercept left unconstrained to test for the evidence of bias-generating pleiotropy. Finally, we used the Sargan test for overidentification for pleiotropy [52]. The null hypothesis in this test is that all exogenous instruments are, in fact, exogenous and uncorrelated with the model residuals, and all over-identifying restrictions are, therefore, valid.

Invalid instruments: Here, we only considered three known HbF modifiers. The F statistic from the first stage linear regression in 2SLS or the statistical test for

γ_{2}

was used to evaluate the strength of the genetic instrumental variable. Therefore, we did not analyze

r s 3834466

as an individual SNP IV because it was not significantly associated with HbF in both cohorts. We also did not include the other GWAS hits of HbF because they are not validated in independent studies or in biological functional studies.

SCCRIP/BCM Results: The PGS^HbF was not associated with any potential confounding variables, including age, sex, blood transfusion history, site, SS/S

β_{0}

, and 5 PCs based on the Spearman correlation analysis and the two-sample Wilcoxon rank sum test. In SCCRIP/BCM, the Sargan test provided evidence of pleiotropic effects in HbF (

p = 0.0005

), which was not supported by the MR Egger intercept test (

p = 0.36

). In contrast, in SIT, both Sargan test (

p = 0.15

) and MR Egger intercept test (

p = 0.72

) did not provide any evidence of pleiotropic effects in HbF.

A strong association between HbF and Hb was observed after adjusting for covariates (est [estimate] = 0.09, SE = 0.007,

p < 0.0001

) in Supplementary Figure S1. PGS^HbF was also associated with HbF (est = 1.64, SE = 0.1,

p < 0.0001

) and Hb (est = 0.17, SE = 0.02,

p < 0.0001

) (Supplementary Figure S1). After including both HbF and PGS^HbF in the same model, PGS^HbF was not independently significantly associated with Hb (est = 0.04, SE = 0.03,

p = 0.13

), although HbF was still independently significantly associated with Hb (est = 0.08, SE = 0.008,

p < 0.0001

). This implies that PGS^HbF does not have a direct effect on Hb except its effect on Hb via HbF, which supports the third MR analysis assumption [53].

Using the entire group (Figure 3, Table 2), MREPS analysis using PGS^HbF as the IV showed that there was a significant relationship between HbF and Hb (est = 0.10, SE = 0.013, p = 1

\times 10^{- 16}

). All 10 SNPs were significantly associated with HbF. Using each of 10 individual SNP as an IV in the model, there was a significant relationship between HbF and Hb (min p = 6.7

\times 10^{- 10}

for

B C L 11 A

SNP

r s 1427407

and max p

= 0.029

for

H B B

SNP

r s 28440105

) at a level of 0.05 (Table 2, Supplemental Table S2). The strongest effect for testing the causal effect of HbF on Hb induced by the

B C L 11 A

SNP

r s 1427407

is because this SNP explains more than 20% of the variance of HbF, which is almost 2/3 of the variance explained by PGS^HbF that includes the contributions from all 11 SNPs from three genes [22,54]. In contrast, 2SLS only rejected PGS^HbF and 9 out of 10 analyzed SNPs (90%). p-values for each test obtained by 2SLS were larger than those obtained by MREPS. Similarly, even though Benjamini–Hochberg approach [55] was used to control false discovery rate (FDR) to adjust for multiple comparisons, MREPS still rejected all 11 tests (PGS^HbF + 10 SNPs) at FDR controlled at 0.05. However, 2SLS only rejected 9 (82%) of them:

r s 968857

with a raw p value of 0.04 was not significant after multiple comparison correction. This was consistent with our previous simulation result.

SIT validation findings: To validate the findings from the entire cohort analysis from SCCRIP/BCM, we analyzed SIT data as an entire group. We first confirmed the relationships among HbF, HB and PGS^HbF (see Supplementary Figure S1). There were no relations between PGS^HbF and any potential confounding factor, including age, sex, SS/S

β_{0}

, and 5 PCs based on the Spearman correlation analysis and the two-sample Wilcoxon rank sum test. When we included both HbF and PGS^HbF in the same model, both PGS^HbF and HbF were significantly associated with Hb although the variance in Hb explained by HbF was 3.5 times higher than that by PGS^HbF (14% vs. 4%).

MREPS analysis validated that there was a significant relationship between HbF and Hb at a level of 0.05 using PGS^HbF or 6 SNPs as the IV (Figure 3, Table 2). In contrast, 2SLS only showed a significant relationship between HbF and Hb when PGS^HbF and 5 SNPs used as IVs at a level of 0.05 (Table 2, Supplemental Table S2). With FDR controlled at 0.05, the same conclusions held. Using SIT cohort, we replicated the finding that there is a causal effect of HbF on Hb predicted from genetic variants from these three genes. This is the first time that the causal effect of HbF on Hb was shown in SCA patients using the MR approach.

Combined SCCRIP/BCM and SIT and EPS designs: After combing SCCRIP/BCM with SIT, MREPS analysis showed that there was a significant relationship between HbF and Hb at a level of 0.05 using PGS^HbF or 8 SNPs (90%) as the IV (Figure 4, Table 2, Supplemental Table S2). However, 2SLS missed two SNPs. The analysis results under the mimicked EPS and random sampling design are displayed in Figure 4, Supplemental Table S2 and Figure S2. Using the samples selected under EPS^Y (Figure 4), MREPS identified 8 out of 10 tests (80%), while 2SLS identified 6 (60%) at a level of 0.05. Similarly, using the samples selected under EPS^X (Supplemental Figure S2), both MREPS and 2SLS only identified the relationship via IVs of PGS^HbF and two

B C L 11 A

SNPs (

r s 1427407

and

r s 7606173

) at a level of 0.05. In contrast, using the randomly selected samples (Supplemental Figure S2), MREPS identified 7 out of 10 (70%) tests with correct effect direction at a level of 0.05, but 2SLS only identified 6 (60%). With FDR controlled at 0.05, the same conclusions held.

5. Discussion

Due to the capability of making causal inferences in observational studies, MR is an increasingly used approach in genetic epidemiology. The EPS design is another rapidly surging approach in GWAS studies as it is efficient, cost effective and bearing higher statistical power in rare genetic variants detection in disease etiology. In this study, we developed a maximum likelihood approach (MREPS) by bridging these two studies while preserving the benefits these two approaches offer. We compared MREPS with the standard 2SLS approach under the two EPS and random sampling designs. Although both methods provided comparable unbiased results under the random sampling design, 2SLS failed to produce unbiased estimates for the causal effect under the two EPS designs. As a result, it produced lowest coverage CIs under the EPS designs. In contrast, MREPS preserved these statistics under the EPS designs. MREPS pushed the capabilities of EPS design another step further, as it could produce unbiased estimates under the nonrandom EPS sample selection. The loss of statistical properties (such as bias, SE, and power) of the estimators and test statistics of MREPS is minimum under the random and EPS sample selection. With this capacity, MREPS preserves the cost benefit and the required statistical properties of the estimates simultaneously, regardless of whether the samples are EPS or random. This makes the MREPS a versatile and reliable approach that can be employed under the random and nonrandom extreme tails sampling designs. MREPS satisfies the necessity of having a versatile approach for exploring the causal effect of risk factors under the observational studies due to the rise of nonrandom sample selection designs such as the EPS designs which are commonly used in GWAS/NGS.

In an MR analysis for HbF on anemia risk using MREPS, we found that the higher predicted HbF calculated based on PGS^HbF or some individual SNPs protects patients from anemia using the observation SCCRIP/BCM cohort data, which was validated in another independent SIT study. This is the first study in the literature to show the causal effect of HbF on Hb in SCA patients using MR analysis of observation data. In the data analyses using the combined SCCRIP/BCM and the SIT data, the corresponding EPS^Y, EPS^X and randomly sampling from the combined cohort demonstrated the same conclusions as those obtained from the simulation studies. An important implication of the mimicked EPS^Y, EPS^X, and random sampling compared to whole cohort analysis is that only a 40% of samples was used from the large combined cohort for these designs to get close conclusions to the conclusions derived from the combined large cohort. A 60% percent of the data from the large cohort was not used, which means that about 60% of associated cost, time, and resources under the utilization of EPS was saved compared to collecting samples/data on the whole samples, no matter EPS design is for prospective or retrospective studies.

In SCA patients, high HbF is strongly correlated with a low rate of acute vaso-occlusive pain, acute chest syndrome, fewer leg ulcers, and longevity [56]. There is an association of HbF with priapism, renal functional impairment, cerebrovascular disease, and, perhaps, sickle vasculopathy as estimated by tricuspid regurgitant velocity, although there is less conclusive evidence [56]. However, showing potential causal effect of HbF on these clinical outcomes in SCA patients is strongly prohibited because of the limitations of traditional analysis in observational studies. Therefore, MR analysis, which uses PGS^HbF as an IV, allows us to examine whether there is potential causal effect of HbF on these SCA-related clinical complications using observation data. Our work demonstrating the causal effect of HbF on anemia provides a theoretical foundation for assessing the causal effect of HbF on the other clinical complications listed above because of the strong biological evidence showing the causal effect of three genes of

B C L 11 A, H B S 1 L - M Y B

and the extended

β

-globin locus on HbF. We will investigate these research topics in the future.

Another advantage that MREPS offers is that it can estimate all model parameters in the proposed statistical models in Section 2. Because of this advantage, MREPS can be expanded to investigate the validity of the MR assumptions: (a) the association between the risk factor and the IV, and (b) pleiotropy effect. The proposed approach can already address the assumption (a). By utilizing the models (1) and (2) with unobservable confound U used by [51], by utilizing EM algorithm under the MREPS framework, these objectives can be achieved. The possibility of this development will be considered in future.

Although MREPS was proposed for MR analysis using genetic variants as IVs, it can also be readily applied to analyze any epidemiology or clinical or population studies in which data on the outcome or risk factor variable were collected under the random sampling or EPS designs because we can replace genetic variables with the non-genetic IV variables. This will expand the original 2SLS proposed in econometrics [17] to study designs other than random sampling design.

The MREPS framework can also be extended to several future developments. Here, we only considered continuous outcome and continuous risk factors. We can extend it to binary outcomes and/or binary risk factors by applying set-valued framework. In some public health studies, matched case-control design can be applied. By replacing the current likelihood function with clustered likelihood function, we can also extend MREPS to handle the causal inference with genetic data collected under matched case-control design. The MREPS framework provides a theoretical foundation for all these future considerations.

The main objective of these methodology developments is to explore the causation from the data collected under the EPS design because of the lack of valid approaches. Identifying valid SNPs for IVs is a common challenge for all MR approaches including the proposed models. The valid results are not guaranteed from the models under the violations of the IV assumptions. We recommend to use the models with carefully chosen IVs. The pleiotropy effect under the violation of IV assumptions can be tested by MR-Egger and Sargan tests before utilizing the models. When applying the models for sickle cell disease data, in this study, we relied on the existing literature evidence of genetic associations for choosing SNPs as valid IVs. Associating such literature evidence for genetic associations is highly recommend for obtaining valid results from MREPS. MREPS faces the same challenges as 2SLS when exploring causation under the lack of literature evidence of genetic associations.

In conclusion, we have introduced a versatile and efficient method for MR that is comparable to 2SLS under a random sampling design but outperforms 2SLS under EPS designs. We, for the first time, have demonstrated the causal effect of lower endogenous HbF on increased anemia risk in SCA patients. A user-friendly shiny app for MREPS was implemented and is readily available for academic researchers to explore MR analysis.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math10203743/s1.

Author Contributions

Conceptualization, J.S.S.L., J.H.E., S.B., Y.L., Y.C. and G.K.; methodology, J.S.S.L. and G.K.; software, J.S.S.L. and G.K.; validation, J.S.S.L. and G.K.; formal analysis, J.S.S.L. and G.K.; investigation, J.S.S.L., G.K., J.S.H., J.H.E., V.A.S. and M.R.D.; resources, J.H.E., V.A.S., J.S.H., C.M.T., M.R.D. and M.M.; data curation, J.H.E., V.A.S., M.R.D. and J.S.H.; writing—original draft preparation, J.S.S.L., J.H.E. and G.K.; writing—review and editing, J.S.S.L., J.H.E., K.S., S.R.R., V.A.S., J.S.H., M.R.D., Y.L., Y.C., M.M., S.B., C.M.T. and G.K.; visualization, J.S.S.L. and G.K.; supervision, G.K.; project administration, G.K.; funding acquisition, M.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was institutionally funded by American Lebanese Syrian Associated Charities (ALSAC) at St. Jude Children’s Research Hospital. The APC was funded by St. Jude Children’s Research Hospital.

Informed Consent Statement

Not applicable.

Data Availability Statement

WGS data for 585 SCCRIP/BCM individuals are available via St. Jude Cloud (https://platform.stjude.cloud/data/cohorts, Accession number: SJC-DS-1006) upon request and subsequent approval by the sickle cell genomic project steering committee.

Code Availability

R codes for MREPS are implemented in the Shiny web application (https://statphyskang.shinyapps.io/statphyskang/) (accessed on 8 October 2022).

Acknowledgments

This research is supported by the American Lebanese and Syrian Associated Charities (ALSAC). We acknowledge the High-Performance Computing Facility (HPCF) at SJCRH for providing shared HPC resources that have contributed to the research results reported within this article.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MR	Mendalian randomization
2SLS	Two-stage least-squares
GWAS	Genome-wide association study
EPS	Extreme phenotype sequencing
MREPS	MR analysis under extreme or random phenotype sampling
IV	Instrumental variables
NGS	Next generation sequencing
WBC	White blood cell count
CI	Confidence intervals
HbF	Total fetal hemoglobin
Hb	Hemoglobin
SCA	Sickle cell anemia
SCCRIP	Sickle cell clinical research intervention program
BCM	Baylor college of medicine
SIT	Silent Cerebral Infarct Transfusion
LIML	Limited information maximum likelihood
MLE	Maximum likelihood estimator
SE	Standard errors
${EPS}^{X}$	EPS design constructed upon the risk factor
${EPS}^{Y}$	EPS design constructed upon the outcome
HWE	Hardy–Weinberg equilibrium
SNP	Single nucleotide polymorphisms
HbS	Sickle hemoglobin
FDR	False discovery rate
CP	Coverage percentage
${PGS}^{HbF}$	Polygenic Score of HbF

References

Davey Smith, G.; Ebrahim, S. ‘Mendelian randomization’: Can genetic epidemiology contribute to understanding environmental determinants of disease?*. Int. J. Epidemiol. 2003, 32, 1–22. [Google Scholar] [CrossRef] [PubMed]
Davey Smith, G.; Hemani, G. Mendelian randomization: Genetic anchors for causal inference in epidemiological studies. Hum. Mol. Genet. 2014, 23, R89–R98. [Google Scholar] [CrossRef]
Lawlor, D.A.; Harbord, R.M.; Sterne, J.A.C.; Timpson, N.; Davey Smith, G. Mendelian randomization: Using genes as instruments for making causal inferences in epidemiology. Stat. Med. 2008, 27, 1133–1163. [Google Scholar] [CrossRef]
Lamina, C. Digging into the extremes: A useful approach for the analysis of rare variants with continuous traits? BMC Proc. 2011, 5, S105. [Google Scholar] [CrossRef] [PubMed]
Dorr, C.; Wu, B.; Remmel, R.; Muthusamy, A.; Schladt, D.; Abrahante, J.E.; Guan, W.; Mannon, R.; Matas, A.; Oetting, W.; et al. Identification of genetic variants associated with tacrolimus metabolism in kidney transplant recipients by extreme phenotype sampling and next generation sequencing. Pharm. J. 2019, 19, 375–389. [Google Scholar] [CrossRef] [PubMed]
Kleinstein, S.E.; Rein, M.; Abdelmalek, M.F.; Guy, C.D.; Goldstein, D.B.; Diehl, A.M.; Moylan, C.A. Whole-Exome Sequencing Study of Extreme Phenotypes of NAFLD. Hepatol. Commun. 2018, 2, 1021–1029. [Google Scholar] [CrossRef] [PubMed]
Peloso, G.; Rader, D.; Gabriel, S.; Kathiresan, S.; Daly, M.J.; Neale, B.M. Phenotypic extremes in rare variant study designs. Eur. J. Hum. Genet. 2016, 24, 924–930. [Google Scholar] [CrossRef] [PubMed]
Amanat, S.; Requena, T.; Lopez-Escamez, J.A. A Systematic Review of Extreme Phenotype Strategies to Search for Rare Variants in Genetic Studies of Complex Disorders. Genes 2020, 11, 987. [Google Scholar] [CrossRef] [PubMed]
Amanat, S.; Gallego-Martinez, A.; Sollini, J.; Perez-Carpena, P.; Espinosa-Sánchez, J.; Aran, I.; Soto-Varela, A.; Batuecas-caletrio, A.; Canlon, B.; May, P.; et al. Burden of Rare Variants in Synaptic Genes in Patients with Severe Tinnitus: An Exome Based Extreme Phenotype Study. EBioMedicine 2020, 66, 103309. [Google Scholar] [CrossRef]
Perez-Gracia, J.L.; Pita, G.; Segura, V.; Pajares, M.J.; Fusco, J.P.; Andueza, M.P.; Sanchez-Bayona, R.; Guruceaga, E.; Mora, M.I.; Gurpide, A.; et al. Whole exome sequencing of germline DNA of individuals presenting extreme phenotypes of high and low risk to develop tobacco-induced lung adenocarcinoma (LUAD) according to KRAS status. J. Clin. Oncol. 2019, 37, 1540. [Google Scholar] [CrossRef]
Emond, M.J.; Louie, T.; Emerson, J.; Zhao, W.; Mathias, R.A.; Knowles, M.R.; Wright, F.A.; Rieder, M.J.; Tabor, H.K.; Nickerson, D.A.; et al. Exome sequencing of extreme phenotypes identifies DCTN4 as a modifier of chronic Pseudomonas aeruginosa infection in cystic fibrosis. Nat. Genet. 2012, 44, 886–889. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Levran, O.; Kim, J.; Zhang, T.; Chen, X.; Suo, C. Extreme sampling design in genetic association mapping of quantitative trait loci using balanced and unbalanced case-control samples. Sci. Rep. 2019, 9, l15504. [Google Scholar] [CrossRef] [PubMed]
Xu, C.; Fang, J.; Shen, H.; Wang, Y.P.; Deng, H.W. EPS-LASSO: Test for high-dimensional regression under extreme phenotype sampling of continuous traits. Bioinformatics 2018, 34, 1996–2003. [Google Scholar] [CrossRef] [PubMed]
Kang, G.; Bi, W.; Zhao, Y.; Zhang, J.F.; Yang, J.J.; Xu, H.; Loh, M.L.; Hunger, S.P.; Relling, M.V.P.S.; Cheng, C. A new system identification approach to identify genetic variants in sequencing studies for a binary phenotype. Hum. Hered. 2014, 78, 104–116. [Google Scholar] [CrossRef] [PubMed]
Liyanage, J.S.S.; Estepp, J.H.; Srivastava, K.; Li, Y.; Mori, M.; Kang, G. GMEPS: A fast and efficient likelihood approach for genome-wide mediation analysis under extreme phenotype sequencing. Stat. Appl. Genet. Mol. Biol. 2022, 21. [Google Scholar] [CrossRef]
Bi, W.; Li, Y.; Smeltzer, M.P.; Gao, G.; Zhao, S.; Kang, G. STEPS: An efficient prospective likelihood approach to genetic association analyses of secondary traits in extreme phenotype sequencing. Biostatistics 2020, 21, 33–49. [Google Scholar] [CrossRef]
Angrist, J.D.; Imbens, G.W. Two-Stage Least Squares Estimation of Average Causal Effects in Models with Variable Treatment Intensity. J. Am. Stat. Assoc. 1995, 90, 431–442. [Google Scholar] [CrossRef]
Anderson, T. Origins of the limited information maximum likelihood and two-stage least squares estimators. J. Econom. 2005, 127, 1–16. [Google Scholar] [CrossRef]
Hayashi, F. Econometrics; Princeton University Press: Princeton, NJ, USA, 2000. [Google Scholar]
Anderson, T.W.; Kunitomo, N.; Sawa, T. Evaluation of the Distribution Function of the Limited Information Maximum Likelihood Estimator. Econometrica 1982, 50, 1009–1027. [Google Scholar] [CrossRef][Green Version]
Hankins, J.; Estepp, J.; Hodges, J.; Villavicencio, M.; Robison, L.; Weiss, M.; Kang, G.; Schreiber, J.; Porter, J.; Kaste, S.; et al. Sickle Cell Clinical Research and Intervention Program (SCCRIP): A lifespan cohort study for sickle cell disease progression from the pediatric stage into adulthood. Pediatr. Blood Cancer 2018, 65, e27228. [Google Scholar] [CrossRef]
Rampersaud, E.; Kang, G.; Palmer, L.E.; Rashkin, S.R.; Wang, S.; Bi, W.; Alberts, N.M.; Anghelescu, D.; Barton, M.; Birch, K.; et al. A polygenic score for acute vaso-occlusive pain in pediatric sickle cell disease. Blood Adv. 2021, 5, 2839–2851. [Google Scholar] [CrossRef] [PubMed]
Chaturvedi, S.; Bhatnagar, P.; Bean, C.; Steinberg, M.; Milton, J.; Casella, J.; Barron-Casella, E.; Arking, D.; DeBaun, M. Genome-wide association study to identify variants associated with acute severe vaso-occlusive pain in sickle cell anemia. Blood 2017, 130, 686–688. [Google Scholar] [CrossRef] [PubMed]
Baiocchi, M.; Cheng, J.; Small, D.S. Instrumental variable methods for causal inference. Stat. Med. 2014, 33, 2297–2340. [Google Scholar] [CrossRef] [PubMed]
Fox, J.; Nie, Z.; Byrnes, J. sem: Structural Equation Models. R Package Version 3.1-15. 2022. Available online: https://CRAN.R-project.org/package=sem.
Ghalanos, A.; Theussl, S. Rsolnp: General Non-linear Optimization Using Augmented Lagrange Multiplier Method. R package version 1.16. 2015. Available online: https://CRAN.R-project.org/package=Rsolnp.
Ye, Y. Interior Algorithms for Linear, Quadratic, and Linearly Constrained Non-Linear Programming. Ph.D. Thesis, Department of ESS, Stanford University, Stanford, CA, USA, 1987. [Google Scholar]
Burgess, S.; Thompson, S. Use of allele scores as instrumental variables for Mendelian randomization. Int. J. Epidemiol. 2013, 42, 1134–1144. [Google Scholar] [CrossRef] [PubMed]
Ware, R.; de Montalembert, M.; Tshilolo, L.; Abboud, M. Sickle cell disease. Lancet 2017, 390, 311–323. [Google Scholar] [CrossRef]
Gill, F.; Sleeper, L.; Weiner, S.; Brown, A.; Bellevue, R.; Grover, R.; Pegelow, C.; Vichinsky, E. Clinical events in the first decade in a cohort of infants with sickle cell disease. Cooperative Study of Sickle Cell Disease. Blood 1995, 86, 776–783. [Google Scholar] [CrossRef]
Statius van Eps, L.; Pinedo-Veels, C.; de Vries, G.; de Koning, J. Nature of concentrating defect in sickle-cell nephropathy. Microradioangiographic studies. Lancet 1970, 1, 450–452. [Google Scholar] [CrossRef]
Lester, L.; Sodt, P.; Hutcheon, N.; Arcilla, R. Cardiac abnormalities in children with sickle cell anemia. Chest 1990, 98, 1169–1174. [Google Scholar] [CrossRef]
Ohene-Frempong, K.; Weiner, S.; Sleeper, L.; Miller, S.; Embury, S.; Moohr, J.; Wethers, D.; Pegelow, C.; Gill, F. Cerebrovascular accidents in sickle cell disease: Rates and risk factors. Blood 1998, 91, 288–294. [Google Scholar]
McCarville, M.B.; Luo, Z.; Huang, X.; Rees, R.C.; Rogers, Z.R.; Miller, S.T.; Thompson, B.; Kalpatthi, R.; Wang, W.C.; BABY HUG Investigators. Abdominal ultrasound with scintigraphic and clinical correlates in infants with sickle cell anemia: Baseline data from the BABY HUG trial. AJR Am. J. Roentgenol. 2011, 196, 1399–1404. [Google Scholar] [CrossRef][Green Version]
Rogers, Z.R.; Wang, W.C.; Luo, Z.; Iyer, R.V.; Shalaby-Rana, E.; Dertinger, S.D.; Shulkin, B.L.; Miller, J.H.; Files, B.; Lane, P.A.; et al. Biomarkers of splenic function in infants with sickle cell anemia: Baseline data from the BABY HUG Trial. Blood 2011, 117, 2614–2617. [Google Scholar] [CrossRef] [PubMed]
Ware, R.E.; Rees, R.C.; Sarnaik, S.A.; Iyer, R.V.; Alvarez, O.A.; Casella, J.F.; Shulkin, B.L.; Shalaby-Rana, E.; Strife, C.F.; Miller, J.H.; et al. Renal function in infants with sickle cell anemia: Baseline data from the BABY HUG trial. J. Pediatr. 2010, 156, 66–70. [Google Scholar] [CrossRef] [PubMed]
Pavlakis, S.; Rees, R.; Huang, X.; Brown, R.; Casella, J.; Iyer, R.; Kalpatthi, R.; Luden, J.; Miller, S.; Rogers, Z.; et al. Transcranial doppler ultrasonography (TCD) in infants with sickle cell anemia: Baseline data from the BABY HUG trial. Pediatr. Blood Cancer 2010, 54, 256–259. [Google Scholar] [CrossRef] [PubMed]
Miller, S.; Wang, W.; Iyer, R.; Rana, S.; Lane, P.; Ware, R.; Li, D.; Rees, R.; BABY-HUG Investigators. Urine concentrating ability in infants with sickle cell disease: Baseline data from the phase III trial of hydroxyurea (BABY HUG). Pediatr. Blood Cancer 2010, 54, 265–268. [Google Scholar] [CrossRef] [PubMed]
Wang, W.; Pavlakis, S.; Helton, K.; McKinstry, R.; Casella, J.; Adams, R.; Rees, R.; BABY HUG Investigators. MRI abnormalities of the brain in one-year-old children with sickle cell anemia. Pediatr. Blood Cancer 2008, 51, 643–646. [Google Scholar] [CrossRef]
Platt, O.; Brambilla, D.; Rosse, W.; Milner, P.; Castro, O.; Steinberg, M.; Klug, P. Mortality in sickle cell disease. Life expectancy and risk factors for early death. N. Engl. J. Med. 1994, 330, 1639–1644. [Google Scholar] [CrossRef]
Danjou, F.; Zoledziewska, M.; Sidore, C.; Steri, M.; Busonero, F.; Maschio, A.; Mulas, A.; Perseu, L.; Barella, S.; Porcu, E.; et al. Genome-wide association analyses based on whole-genome sequencing in Sardinia provide insights into regulation of hemoglobin levels. Nat. Genet. 2015, 47, 1264–1271. [Google Scholar] [CrossRef]
Stadhouders, R.; Aktuna, S.; Thongjuea, S.; Aghajanirefah, A.; Pourfarzad, F.; van Jcken, W.; Lenhard, B.; Rooks, H.; Best, S.; Menzel, S.; et al. HBS1L-MYB intergenic variants modulate fetal hemoglobin via long-range MYB enhancers. J. Clin. Investig. 2014, 124, 1699–1710. [Google Scholar] [CrossRef] [PubMed]
Stadhouders, R.; Thongjuea, S.; Andrieu-Soler, C.; Palstra, R.; Bryne, J.; Van Den Heuvel, A.; Stevens, M.; de Boer, E.; Kockx, C.; van der Sloot, A.; et al. Dynamic long-range chromatin interactions control Myb proto-oncogene transcription during erythroid development. EMBO J. 2012, 31, 986–999. [Google Scholar] [CrossRef]
Suzuki, M.; Yamazaki, H.; Mukai, H.; Motohashi, H.; Shi, L.; Tanabe, O.; Engel, J.; Yamamoto, M. Disruption of the Hbs1l-Myb locus causes hereditary persistence of fetal hemoglobin in a mouse model. Mol. Cell. Biol. 2013, 33, 1687–1695. [Google Scholar] [CrossRef] [PubMed]
Yawn, B.P.; Buchanan, G.; Afenyi-Annan, A.; Ballas, S.; Hassell, K.; James, A.; Jordan, L.; Lanzkron, S.; Lottenberg, R.; Savage, W.; et al. Management of sickle cell disease: Summary of the 2014 evidence-based report by expert panel members. JAMA 2014, 312, 1033–1048. [Google Scholar] [CrossRef]
Estepp, J.; Smeltzer, M.; Kang, G.; Li, C.; Wang, W.; Abrams, C.; Aygun, B.; Ware, R.; Nottage, K.; Hankins, J. A clinically meaningful fetal hemoglobin threshold for children with sickle cell anemia during hydroxyurea therapy. Am. J. Hematol. 2017, 92, 1333–1339. [Google Scholar] [CrossRef] [PubMed]
Lebensburger, J.; Pestina, T.; Ware, R.; Boyd, K.; Persons, D. Hydroxyurea therapy requires HbF induction for clinical benefit in a sickle cell mouse model. Haematologica 2010, 95, 1599–1603. [Google Scholar] [CrossRef] [PubMed]
Meier, E.; Byrnes, C.; Weissman, M.; Noel, P.; Luban, N.; Miller, J. Expression patterns of fetal hemoglobin in sickle cell erythrocytes are both patient- and treatment-specific during childhood. Pediatr. Blood Cancer 2011, 56, 103–109. [Google Scholar] [CrossRef]
Das, S.; Forer, L.; Schönherr, S.; Sidore, C.; Locke, A.; Kwong, A.; Vrieze, S.; Chew, E.; Levy, S.; McGue, M.; et al. Next-generation genotype imputation service and methods. Nat. Genet. 2016, 48, 1284–1287. [Google Scholar] [CrossRef]
Gardner, K.; Fulford, T.; Silver, N.; Rooks, H.; Angelis, N.; Allman, M.; Nkya, S.; Makani, J.; Howard, J.; Kesse-Adu, R.; et al. g(HbF): A genetic model of fetal hemoglobin in sickle cell disease. Blood Adv. 2018, 2, 235–239. [Google Scholar] [CrossRef] [PubMed]
Bowden, J.; Davey, S.G.; Burgess, S. Mendelian randomization with invalid instruments: Effect estimation and bias detection through egger regression. Int. J. Epidemiol. 2015, 44, 512–525. [Google Scholar] [CrossRef]
Sargan, J.D. The Estimation of Economic Relationships using Instrumental Variables. Econometrica 1958, 26, 393–415. [Google Scholar] [CrossRef]
Burgess, S.; Thompson, S.G. Mendelian Randomization: Methods for Causal Inference Using Genetic Variants; Chapman and Hall/CRC: Boca Raton, FL, USA, 2021. [Google Scholar]
Bhanushali, A.A.; Patra, P.; Nair, D.; Verma, H.; Das, B. Genetic variant in the BCL11A (rs1427407), but not HBS1-MYB (rs6934903) loci associate with fetal hemoglobin levels in Indian sickle cell disease patients. Blood Cells Mol. Dis. 2015, 54, 4–8. [Google Scholar] [CrossRef]
Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. 1995, 57, 289–300. [Google Scholar] [CrossRef]
Steinberg, M.; Sebastiani, P. Genetic modifiers of sickle cell disease. Am. J. Hematol. 2012, 87, 795–803. [Google Scholar] [CrossRef] [PubMed]

Figure 1. (Top) panel: comparison of estimators obtained via MREPS and 2SLS under the random (left), EPS^Y (middle) and EPS^X (right) designs. (Bottom) panel: rejection proportion of

H_{0}

comparison for MREPS and 2SLS under the random (left), EPS^Y (middle) and EPS^X (right) designs.

Figure 1. (Top) panel: comparison of estimators obtained via MREPS and 2SLS under the random (left), EPS^Y (middle) and EPS^X (right) designs. (Bottom) panel: rejection proportion of

H_{0}

comparison for MREPS and 2SLS under the random (left), EPS^Y (middle) and EPS^X (right) designs.

Figure 2. (Left): the behavior of the estimators due to the change of sample size, obtained through MREPS and 2SLS under the random design. (Middle and right): the behavior of the estimators, due to the change of sample size or the proportion of the extreme tails, obtained through MREPS and 2SLS under the EPS^Y and EPS^X, respectively.

Figure 3. HbF induced by

B C L 11 A

,

H B S 1 L - M Y B

, and the extended

β

-globin locus causally influences Hb in SCCRIP/BCM (left) and SIT (right) patients. (Top) panels: X-axis: Genetic association with HbF (%). Y-axis: Genetic association with Hb (g/dL). The genetic associations were done using linear regression models, adjusting for the covariates age, sex, sickle diagnosis (Yes/No), and top 5 PCs. (Bottom) panels: X-axis: Genetic association with HbF (%); Y-axis: (Causal) Effect of HbF on Hb (g/dL/%). The Mendelian randomization analysis was done using MREPS. SCCRIP/BMC = Sickle Cell Clinical Research Intervention Program study and Baylor College of Medicine; SIT = Silent Cerebral Infarct Transfusion trial.

Figure 3. HbF induced by

B C L 11 A

,

H B S 1 L - M Y B

, and the extended

β

-globin locus causally influences Hb in SCCRIP/BCM (left) and SIT (right) patients. (Top) panels: X-axis: Genetic association with HbF (%). Y-axis: Genetic association with Hb (g/dL). The genetic associations were done using linear regression models, adjusting for the covariates age, sex, sickle diagnosis (Yes/No), and top 5 PCs. (Bottom) panels: X-axis: Genetic association with HbF (%); Y-axis: (Causal) Effect of HbF on Hb (g/dL/%). The Mendelian randomization analysis was done using MREPS. SCCRIP/BMC = Sickle Cell Clinical Research Intervention Program study and Baylor College of Medicine; SIT = Silent Cerebral Infarct Transfusion trial.

Figure 4. Top left and right panels: comparison between the effects of PGS^HbF on HbF and Hb and correlation between HbF and Hb in SCCRIP/BCM and SIT combined cohort and EPS^Y derived from SCCRIP/BCM and SIT combined cohort, respectively. The p value of the top right plot was calculated using STEPS [16], and the p values of the other plots were calculated using linear regression model, adjusting for covariates of cubic-spline–defined age, sex, sickle diagnosis (Yes/No), and top 5 PCs. Bottom panel: HbF induced by

B C L 11 A

and

H B S 1 L - M Y B

causally influences Hb in SCCRIP/BCM and SIT combined (left) and EPS^Y (right) patients. X-axis: Genetic association with HbF (%). Y-axis: (Causal) Effect of HbF on Hb (g/dL/%). SCCRIP/BMC = Sickle Cell Clinical Research Intervention Program study and Baylor College of Medicine; SIT = Silent Cerebral Infarct Transfusion trial.

Figure 4. Top left and right panels: comparison between the effects of PGS^HbF on HbF and Hb and correlation between HbF and Hb in SCCRIP/BCM and SIT combined cohort and EPS^Y derived from SCCRIP/BCM and SIT combined cohort, respectively. The p value of the top right plot was calculated using STEPS [16], and the p values of the other plots were calculated using linear regression model, adjusting for covariates of cubic-spline–defined age, sex, sickle diagnosis (Yes/No), and top 5 PCs. Bottom panel: HbF induced by

B C L 11 A

and

H B S 1 L - M Y B

causally influences Hb in SCCRIP/BCM and SIT combined (left) and EPS^Y (right) patients. X-axis: Genetic association with HbF (%). Y-axis: (Causal) Effect of HbF on Hb (g/dL/%). SCCRIP/BMC = Sickle Cell Clinical Research Intervention Program study and Baylor College of Medicine; SIT = Silent Cerebral Infarct Transfusion trial.

Table 1. Average estimators, average standard errors, bias percentages, rejection proportions, and coverage percentages obtained through MREPS and 2SLS under EPS^Y design.

		Mean ${\hat{β}}_{2}$		Mean SE		Bias %		Rejection Proportion		CP
Number of SNPs	$β_{2}$	MREPS	2SLS	MREPS	2SLS	MREPS	2SLS	MREPS	2SLS	MREPS	2SLS
(a) Equal-sized IV—risk factor associations
9	0	−0.004	−0.02	0.07	0.3	-	-	0.06	0.05	94.6	95.1
	0.2	0.21	0.54	0.08	0.17	−3.19	−169.66	0.77	0.86	93.5	42.1
	0.4	0.41	0.82	0.09	0.1	−2.02	−105.3	0.99	1.0	94.5	5.9
25	0	−0.01	−0.02	0.11	1.13	-	-	0.05	0.05	95.3	94.7
	0.2	0.19	0.51	0.12	0.28	4.84	−155.91	0.48	0.64	95	62.7
	0.4	0.41	0.82	0.13	0.16	−2.6	−104.03	0.89	0.97	94.6	25.7
(b) Different-sized IV—risk factor associations
9	0	−0.005	−0.04	0.07	0.28	-	-	0.04	0.03	96.2	96.8
	0.2	0.2	0.53	0.07	0.16	−1.02	−166.89	0.79	0.88	94.3	39.1
	0.4	0.4	0.82	0.08	0.1	−0.99	−104.72	1.0	1.0	94.2	4.2
25	0	−0.01	−0.11	0.12	0.58	-	-	0.04	0.04	95.7	95.7
	0.2	0.18	0.51	0.12	0.28	8.28	−153.52	0.46	0.61	95.4	61.4
	0.4	0.41	0.81	0.13	0.17	−1.96	−102.81	0.86	0.96	94.8	27.8
(c) A combination of few large and many small IV - risk factor associations
9	0	−0.0006	−0.01	0.05	0.21	-	-	0.06	0.05	94.1	95.1
	0.2	0.2	0.54	0.06	0.12	−1.45	−170.85	0.94	0.97	93.2	20.1
	0.4	0.41	0.82	0.06	0.07	−1.34	−105.3	1.0	1.0	93	0.5
25	0	−0.0002	−0.01	0.05	0.18	-	-	0.05	0.06	94.7	94.5
	0.2	0.2	0.54	0.05	0.1	−1.15	−171.2	0.97	0.99	93.3	12.4
	0.4	0.4	0.82	0.06	0.06	−1.1	−104.93	1.0	1.0	93.1	0.1
(d) A combination of valid and invalid IV - risk factor associations
9	0	−0.002	−0.02	0.05	0.19	-	-	0.06	0.05	93.9	95.2
	0.2	0.2	0.54	0.05	0.11	−1.42	−170.38	0.95	0.97	91.5	18.7
	0.4	0.4	0.82	0.06	0.07	−1.04	−104.89	1.0	1.0	93	0.2
25	0	−0.005	−0.03	0.08	0.31	-	-	0.05	0.05	94.9	95.5
	0.2	0.21	0.54	0.08	0.17	−4.16	−170.31	0.75	0.84	93.4	43.8
	0.4	0.41	0.82	0.09	0.11	−2.91	−106.12	0.99	1.0	93.9	6.7

SE = Standard Error, CP = Coverage Percentage.

Table 2. MR analyses to assessing the causal effect of HbF on Hb of patients from the two cohorts of SCCRIP/BCM and SIT. Instrumental variables are PGS^HbF and

r s 1427407

.

Table 2. MR analyses to assessing the causal effect of HbF on Hb of patients from the two cohorts of SCCRIP/BCM and SIT. Instrumental variables are PGS^HbF and

r s 1427407

.

		${\hat{β}}_{2}$		SE ( ${\hat{β}}_{2}$ )		p-Value ( $β_{2}$ )		${\hat{γ}}_{2}$		SE ( ${\hat{γ}}_{2}$ )		p-Value ( $γ_{2}$ )
Cohort/Design	IV	MREPS	2SLS	MREPS	2SLS	MREPS	2SLS	MREPS	2SLS	MREPS	2SLS	MREPS	2SLS
SCCRIP/BCM	PGS^HbF	0.1	0.1	0.013	0.013	$1 \times 10^{- 16}$	$1 \times 10^{- 14}$	1.6	-	0.083	-	0	-
SCCRIP/BCM	$r s 1427407$	0.096	0.095	0.016	0.016	$7 \times 10^{- 10}$	$1 \times 10^{- 8}$	4.5	-	0.3	-	0	-
SIT	PGS^HbF	0.15	0.15	0.014	0.023	0	$3 \times 10^{- 10}$	1.3	-	0.14	-	0	-
SIT	$r s 1427407$	0.15	0.15	0.024	0.027	$7 \times 10^{- 10}$	$1 \times 10^{- 7}$	3.3	-	0.52	-	$2 \times 10^{- 10}$	-
Combined	PGS^HbF	0.11	0.11	0.01	0.01	0	0	1.4	-	0.076	-	0	-
Combined	$r s 1427407$	0.11	0.11	0.011	0.012	0	0	4	-	0.27	-	0	-
EPS^Y	PGS^HbF	0.11	0.11	0.015	0.016	$2 \times 10^{- 13}$	$3 \times 10^{- 12}$	1.4	-	0.13	-	0	-
EPS^Y	$r s 1427407$	0.12	0.12	0.017	0.019	$2 \times 10^{- 12}$	$1 \times 10^{- 9}$	3.8	-	0.44	-	0	-
EPS^X	PGS^HbF	0.098	0.11	0.013	0.018	$1 \times 10^{- 14}$	$5 \times 10^{- 10}$	1.7	-	0.18	-	0	-
EPS^X	$r s 1427407$	0.095	0.11	0.013	0.019	$4 \times 10^{- 13}$	$3 \times 10^{- 8}$	5.4	-	0.67	-	$7 \times 10^{- 16}$	-

SE = Standard Error; SCCRIP/BCM = Sickle Cell Clinical Research Intervention Program study and Baylor College of Medicine; SIT = Silent Cerebral Infarct Transfusion study cohort; HbF = total fetal hemoglobin; PGSHbF = Polygenic Score of HbF.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liyanage, J.S.S.; Estepp, J.H.; Srivastava, K.; Rashkin, S.R.; Sheehan, V.A.; Hankins, J.S.; Takemoto, C.M.; Li, Y.; Cui, Y.; Mori, M.; et al. A Versatile and Efficient Novel Approach for Mendelian Randomization Analysis with Application to Assess the Causal Effect of Fetal Hemoglobin on Anemia in Sickle Cell Anemia. Mathematics 2022, 10, 3743. https://doi.org/10.3390/math10203743

AMA Style

Liyanage JSS, Estepp JH, Srivastava K, Rashkin SR, Sheehan VA, Hankins JS, Takemoto CM, Li Y, Cui Y, Mori M, et al. A Versatile and Efficient Novel Approach for Mendelian Randomization Analysis with Application to Assess the Causal Effect of Fetal Hemoglobin on Anemia in Sickle Cell Anemia. Mathematics. 2022; 10(20):3743. https://doi.org/10.3390/math10203743

Chicago/Turabian Style

Liyanage, Janaka S. S., Jeremie H. Estepp, Kumar Srivastava, Sara R. Rashkin, Vivien A. Sheehan, Jane S. Hankins, Clifford M. Takemoto, Yun Li, Yuehua Cui, Motomi Mori, and et al. 2022. "A Versatile and Efficient Novel Approach for Mendelian Randomization Analysis with Application to Assess the Causal Effect of Fetal Hemoglobin on Anemia in Sickle Cell Anemia" Mathematics 10, no. 20: 3743. https://doi.org/10.3390/math10203743

APA Style

Liyanage, J. S. S., Estepp, J. H., Srivastava, K., Rashkin, S. R., Sheehan, V. A., Hankins, J. S., Takemoto, C. M., Li, Y., Cui, Y., Mori, M., Burgess, S., DeBaun, M. R., & Kang, G. (2022). A Versatile and Efficient Novel Approach for Mendelian Randomization Analysis with Application to Assess the Causal Effect of Fetal Hemoglobin on Anemia in Sickle Cell Anemia. Mathematics, 10(20), 3743. https://doi.org/10.3390/math10203743

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Versatile and Efficient Novel Approach for Mendelian Randomization Analysis with Application to Assess the Causal Effect of Fetal Hemoglobin on Anemia in Sickle Cell Anemia

Abstract

1. Introduction

2. Method

2.1. Model

2.1.1. MREPS for Samples from the Extreme Tails of Outcome Distribution

2.1.2. MREPS for Samples from the Extreme Tails of Risk Factor Distribution

2.1.3. MREPS for Random Samples

2.2. Estimation and Hypothesis Testing

2.2.1. Samples Drawn from the Extreme Tails of Outcome

2.2.2. Samples Drawn from the Extreme Tails of Risk Factor

2.2.3. Random Samples

3. Simulations

3.1. Comparison: Bias, Rejection Proportion, Standard Error, and Coverage Percentage

3.2. Comparison: The Optimum Cost-Effective Subsample Size

4. Unravelling the Effect of HbF on Anemia in Sickle Cell Anemia

5. Discussion

Supplementary Materials

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Code Availability

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI