1. Introduction
The methodology used for mediation analysis of the cause-and-effect relationship through intermediate variables, also known as mediators, has been increasingly popular over the past decades. Mediation considers how a third variable affects the explored variable and outcome variable in a causal pathway. In the mediation analysis framework, the relationship between the exposure and outcome is called a direct effect, and the causal effect through the mediator is named an indirect effect. The topic of causal inference, which particularly involves mediating factors, was initially established by social scientists and psychologists in their fields. In recent years, mediation methodologies have been expanded to epidemiology and healthcare research to better understand how treatment effects occur or which interventions can change the outcome of interest by targeting specific mediators. The mediators can be one or more, and both direct and indirect effects can be estimated and tested in a counterfactual approach by modeling covariance and correlation matrices [
1,
2,
3,
4,
5].
The approaches to mediation analysis involve traditional methodologies such as the difference method and the product method using the structural equation modeling (SEM) method [
6]. SEM is derived from path analysis, which allows latent variables and confounders to be incorporated into the analysis of mediating relationships [
7]. Both SEM and path analysis require adequate linear model specification. As a result, the counterfactual framework was developed in the past decade to model mediator or outcome variable without the linearity assumptions [
8]. Within the counterfactual framework, causal inference analysis models observed and potential outcomes in which the unobserved outcomes are treated as missing data. This approach is more flexible in terms of incorporating nonlinear models such as hazard survival and marginal structural models (MSMs) for estimation of the indirect effects [
3,
9,
10,
11]. In addition to the relaxation of linearity assumptions, the interaction effects of the exposure and the mediator are also allowed.
The field of microbes concentrates on the relationships among the microbiome and its host, showing profound connections between the microbiome with human health. Studies to understand the effects of the microbiome on health status have been conducted. Evidence has shown that the gut microbiota is associated with many chronic diseases such as obesity [
12,
13,
14,
15], cardiovascular disease [
16], inflammatory bowel disease [
17,
18,
19], diabetes [
20,
21], fatty liver disease [
22], liver cirrhosis [
23], and colorectal cancer [
24]. The microbiome composition can be quantified using 16S rRNA [
25,
26] or shotgun sequencing technology [
27,
28] or quantification of microbial absolute abundance differences (QWDs) [
29] as markers to categorize organisms into taxonomic groups with specific taxonomic identity, which are called operational taxonomic units (OTUs). Researchers have shown that OTUs are usually skewed and heavy-tailed with excessive zeros [
30,
31,
32].
Although mediation frameworks have been well established, there is a lack of research on the applications of the microbiome as a mediator. Classical models such as linear regression and logistic regression models cannot accommodate the features of microbiome data without sacrificing the information on the zero parts for microbiome sequencing data and violate the normality and constant variance assumptions. Recent studies on the human microbiome have shown its potential mediation effects between risk factors and human diseases. For example, Sohn and Li [
33] proposed a sparse compositional mediation framework to investigate the mediated effect of gut microbiome between fat intake and body mass index. They directly established a compositional mediation model in the simplex space and adopted the additive log-ratio transformation on the composition. Zhang [
34] mainly used linear structural equation models to model the mediator and the outcome. Considering the compositional features, they used the isometric log-ratio transformed mediators to construct a mediation model in Euclidean space and applied methods that can be used in Euclidean space. They studied the relative mediation effect denoted by a specific composition in contrast to the rest of the compositions. These two studies are specific for relative-abundance microbiome data using a structural equation modeling framework. Other researchers have used counterfactual frameworks, such as Wu et al. [
35], who established a mediation analysis approach with the consideration of a zero-inflated mediator using a model-based standardization technique. They assumed that the mediator followed a zero-inflated beta distribution for relative abundance. For the outcome model, they separately modeled the count part and the zero part using an indicator. Zhang et al. [
36] used inverse probability weighing-based mediation analysis for a specific measure called the interventional indirect effect.
Regarding current microbiome research, there is a shortage of causal framework models to understand how the exposure affects the disease outcome through the microbiome. Although the aforementioned methods can estimate the mediation effect of microbiome features, they parametrically estimate the mediation effects, which strongly relies on model assumptions. In addition, the existing studies have all focused on the relative abundance of the microbiome rather than count data. Our objective was to develop a semiparametric causal modeling framework to formulate a zero-inflated mediation model for microbiome count data. We aimed to estimate the mediation effect of the microbiome by simultaneously modeling the zero part and the count part of the mediator using semiparametric estimators. This method can avoid the specification of an outcome model and exposure-covariate interactions and can more easily identify positivity violations.
This paper is structured as follows: We provide the background of the counterfactual framework and its notations and assumptions at the beginning of
Section 2. Models for the microbiome such as zero-inflated models are introduced in
Section 2.1, followed by the proposed model in
Section 2.2. Statistical estimation procedures are provided in
Section 2.3. Our simulation studies to assess the performance of the proposed model in comparison with the existing approaches are described in
Section 3.1 and
Section 3.2, followed by a real-data application of the proposed model and the comparison models in
Section 3.3, with a discussion in
Section 4.
2. Materials and Methods
For mediation analysis, interest is usually focused on separating the indirect pathway from the total causal effect between exposure and outcome. Counterfactual variables [
37] are introduced to better estimate the effects. Directed acyclic graphs (DAGs) are intuitive ways to describe a causal structure. An example is provided in
Figure 1, where the variables defined in the graph are used below to define the models. Let
denote the exposure for subject
i,
denote the outcome achieved for subject
i if the subject was exposed to level
a and mediator level
m, and
denote the mediator achieved for subject
i if the exposure level is
a. When the mediator is randomized, nested counterfactuals are defined as
. When
, the counterfactual variable is simply the observations.
We define the causal effects of interest using counterfactual variable notations. The effect that one would observe is referred to as the total effect (TE) of an exposure, and it is defined as the difference between
and
if the exposure is binary. More generally, we can define
. When the mediator is set to 0 and the exposure level is changed, the difference between
and
is referred as a natural direct effect
. Likewise, the natural indirect effect ocurrs when fixing the exposure level and changing the mediator, i.e.,
or
. The total effect is the sum of the natural direct effect (NDE) and natural indirect effect (NIE):
In any mediation analysis or causal inference, the causal mediation interpretation relies on a number of assumptions, and most of them are untestable.
Sequential Ignorability: There is no unmeasured confounding of the exposure–outcome relationship, the exposure–mediator relationship, or the mediator–outcome relationship. Mathematically, we have:
Consistency: The consistency assumption is that the counterfactuals take the observed values when the risk factor or treatment and mediator are actively set to the values they would have had. Mathematically, we have
Positivity: All levels of exposure and mediator have a nonzero probability for any values of the confounders. Mathematically, we have
Identification of Natural Effects: In order to identify natural effects, the potential outcome
is assumed to be independent of the potential mediator
whenever
a and
are different. In other words, this assumption ensures that there are no confounders of the mediator–outcome relationship that are affected by exposure, so that the direct and indirect effects are two distinct systems between exposure and outcome. Mathematically, we have
Noted that regardless of which mediation analysis approach is used, no unmeasured confounding assumption is required [
6].
If all the above assumptions hold, then
which can be estimated via:
Model-based standardization that model outcome, mediator, and exposure
Inverse probability weighting approach (IPW) [
38,
39] that weights the the observed outcomes via a combination of exposure and mediator given covariates
2.1. Zero-Inflated and Zero-Hurdle Models
Mixed models have been proposed to avoid the loss of information when data contain excess zeros. Zero-inflated (ZI) models were first introduced by Cohen [
40] and widely accepted after Mullahy [
41] and Lambert [
42]. ZI models are mixtures of a discrete distribution and a zero point mass. Structure zeros are distinguished from count data, and the counts are usually assumed to follow a Poisson or negative binomial (NB) distribution. The density function for ZI models can be generally defined as
On the other hand, zero-hurdle (ZH) models [
41,
43,
44] process the data in two stages to account for the excessive number of zeros. The first part is a binary model to determine whether the outcome is zero or a positive value. Logistic regression models are usually used for the first part to incorporate the effects of the covariates on the probability of an observation being zero. For the second part, distributions truncated at zero are used, conditioning on all the nonzero count outcomes. Zero-truncated regression models are then applied to incorporate the covariates effects on the nonzero distribution.
Xu et al. [
31] analyzed several commonly used models on microbiome abundances including standard parametric and nonparametric models, zero-hurdle models, and zero-inflated models with varying degrees of zero inflation, with or without dispersion in the count component, and different directions of the covariate effect on both the structural zero and the count components. The simulation studies showed that the ZH and ZI models outperform the other models in terms of type I errors, power, goodness of fit measures, and they are more accurate and efficient in the parameter estimation. Additionally, ZH models are more stable when structural zeros are absent. Therefore, we developed the zero-hurdle negative binomial (ZHNB) model as the mediation model in the mediation framework.
For ZHNB models, the response variable
has the distribution
where
and
is a dispersion parameter that is assumed not to depend on covariates and
. We can estimate the expectation of the variable
given the exposure and confounders by
2.2. Inverse Probability Weighting Two-Part Model
To incorporate a variable with excessive zeros as the mediator, we decompose the mediation effect of the microbiome into two components that are inherent in the zero-inflated distributions: one attributed by a zero part
and one attributed by a count part
(
Figure 2). The zero part, which includes the structural zeros and the sampling zeros, suggests the probability of whether an OTU is present in the data, while the count part explains the change in the outcome of interest resulting from a unit change in the OTU counts. To simultaneously model the zero part and the count part of the mediator, we developed a weighting-based approach in which the estimation of exposure–covariate interactions and a separate averaging step can be avoided. Following Lange et al.’s work [
38], we propose a semiparametric approach for estimating the direct and indirect effects while avoiding specification of the outcome model. In this weighting-based approach, the estimation of exposure–covariate interactions and a separate averaging step are also avoided. Furthermore, the IPW approach is less computationally intensive and easier to formulate and implement. The weighting-based approach estimates the expectation of
as
The IP weights remove confounding by creating a pseudo-population in which the confounders and the exposure are not associated. Subjects with a low probability of receiving the exposure level to which they indeed were exposed have high weights, which results in unstable estimators with large variance. Stabilized inverse probability weights can help the situation by timing the probability of the observed exposure value without any confounders of the original weights. We weight the outcomes by the stabilized inverse probability of each individual’s exposure status and mediator levels to obtain the estimations.
As
A and
Y are binary variables, we model them both using a generalized linear model (GLM) with
as a logit link function. Due to the characteristics of the mediator, we consider a ZH model to estimate the indirect effects. Specifically, for the exposure model with binary exposure, we have a model with a logit link:
and
can be written as
The stabilized weighting function for the exposure can be written as
For the mediator model, we consider a zero-hurdle negative binomial model with a logit link when the mediator is zero and a log link for the nonzero mediators:
Then, the probability of zero
is defined as
, and the mean of counts
is defined as
. The stabilized weighting function for the mediator is
Note that the models used for calculating the weights are not fit to the replicate dataset but to the original one.
2.3. Estimation of the Stabilized Direct and Indirect Effects
One of our major interests in mediation analysis lies in estimating the NDE, NIE, and TE of the counterfactual framework. Because we used the stabilized version of the weights, we named the targeted estimands as SNDE, SNIE, and STE.
The outcomes are weighted by the inverse probability of each individual’s exposure status and mediator levels by fitting a weighted logistic regression model. The combined stabilized weights are the product of the exposure weights and the mediator weights. The natural direct and indirect effects can then be estimated accordingly:
With the parameter estimates, we are able to calculate SNDE, SNIE, and STE and their empirical confidence intervals using bootstrapping. The detailed algorithm for the estimation of SNDE, SNIE, and STE can be found in
Appendix A.
4. Discussion
In this study, we developed an innovative mediation counterfactual framework for the microbiome as a mediator to adopt the zero-inflation characteristics of the microbial mediator. In particular, we incorporated the inverse probability weighting method for parameter estimations and used zero-hurdle models for the zero-inflated mediator in the count data form. We showed that a zero-inflated mediator can be decomposed into two components of which the first part is for whether an OTU presents for the subject, and the second part is for a unit increase in outcome with the OTU’s increase. We also constructed the bootstrap standard errors for the microbial variables in a real-data application and provided the corresponding empirical confidence intervals. Simulation studies demonstrated the robust performance of the proposed method in terms of the mediation effect estimation, including both direct and indirect effects. It was shown that the weighting-based approach has less bias than the model-based approach when consonant or dissonant effects are present in the microbiome data. If the true models are known, model-based mediation models should work well, theoretically. However, the true models of the relationship between exposure to the mediator, mediator to the outcome, and exposure to the outcome are usually unknown, especially when dealing with complex human microbiome data. The weighing-based mediation model can avoid the specification of the outcome model and exposure–covariate interactions, resulting a more robust estimation for the unknown true relationships. In addition, the simulation revealed that ZHNB models fit the microbiome mediator better than GLM, especially when excessive zeros were present. In the DIABIMMUNE dataset, model-based M-ZHNB models estimated larger SNIE than W-ZHNB for both taxa Lactobacilllus and Lactococcus.
Several extensions of our proposed approach could be explored in the future. We considered the situation where both exposure and outcome are binary. One natural extension is to accommodate multilevel or continuous exposure and/or outcomes. Although our framework can handle microbiome count data as mediators, it would be desirable to extend microbiome count data to relative abundance in the proposed weighting mediation framework in future research. The human microbiota is a dynamic ecosystem that can interact within itself and change over time. In this study, we focused on the mediation effect of a single time point of one OTU to address the knowledge gap of existing mediation frameworks being unable to handle the overdispersion and zero-inflation problems produced by OTU abundances. Multiple OTUs could result in interaction effects among the taxa as microbiome mediators, which would be our next step to explore. The dynamic change in the OTUs is interesting, so it would be important to explore the meaning of a mediation effect of OTUs. We will investigate longitudinal human microbiota data in the future.