Next Article in Journal
Parcel-Locker-Sharing Model for E-Commerce Logistics Service Providers
Previous Article in Journal
Advancing Model Generalization in Continuous Cyclic Test-Time Adaptation with Matrix Perturbation Noise
Previous Article in Special Issue
Testing Multivariate Normality Based on Beta-Representative Points
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Identifiability and Estimation for Potential-Outcome Means with Misclassified Outcomes

by
Shaojie Wei
,
Chao Zhang
,
Zhi Geng
and
Shanshan Luo
*
School of Mathematics and Statistics, Beijing Technology and Business University, Beijing 100048, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(18), 2801; https://doi.org/10.3390/math12182801
Submission received: 11 August 2024 / Revised: 1 September 2024 / Accepted: 2 September 2024 / Published: 10 September 2024
(This article belongs to the Special Issue Computational Statistics and Data Analysis, 2nd Edition)

Abstract

:
Potential outcomes play a fundamental and important role in many causal inference problems. If the potential-outcome means are identifiable, a series of causal effect measures, including the risk difference, the risk ratio, and the treatment benefit rate, among others, can also be identified. However, current identification and estimation methods for these means often implicitly assume that the collected data for analysis are measured precisely. In many fields such as medicine and economics, the collected variables may be subject to measurement errors, such as medical diagnostic results and individual wage data. Misclassification, as a non-classic measurement error, can lead to severely biased estimates in causal inference. In this paper, we leverage a combined sample to study the identifiability of potential-outcome means corresponding to different treatment levers under a plausible misclassification assumption for the outcome, allowing the misclassification probability to depend on not only the true outcome but also the covariates. Furthermore, we propose the multiply-robust and semiparametric efficient estimators for the means, consistent even under partial misspecification of the observed data law, based on the semiparametric theory framework. The simulation studies and real data analysis demonstrate the satisfactory performance of the proposed method.

1. Introduction

Explaining causal relationships is paramount for understanding how treatments influence outcomes within various domains such as medicine, economics, and social sciences. Over the past few decades, envisioning the potential outcomes of each unit in a population under different treatment conditions has become a staple of causal inference. Various causal measures of interest can be defined using the potential-outcome means [1,2,3]. In other words, once we identify the potential-outcome means, we can identify a series of causal measures.
Randomized experiments are viewed as the gold standard for causal inference. The means of subsamples corresponding to different treatment levels can serve as consistent estimators of the corresponding potential-outcome means. However, in practice, the randomized assignment of the treatment is often difficult to implement due to ethical concerns, high costs, and so on. Observational studies are in general more feasible. Under the assumption that the collected data are precisely measured, researchers can use various methods to adjust for confounding and obtain consistent estimates of the potential-outcome means in these studies, including regression [4], matching [5], and inverse probability weighting [6], among others. Nevertheless, in many observational studies, researchers face challenges in obtaining accurate data. For instance, in classic studies on the impact of smoking on lung cancer, smokers may misreport their smoking status; when economists study the determinants of income, respondents may misreport their earnings; in the job market, applicants may falsify their educational qualifications. VanderWeele and Li [7] pointed out that measurement error is one of the key threats to statistical analysis in observational studies. Measurement errors cause the observed data distribution to deviate from the true distribution, which can further lead to biased or even entirely incorrect conclusions in causal inference. For a long time, many researchers have focused on the estimation of model parameters under measurement error or misclassified data (discrete variable data with measurement errors). The well-known book by Carroll et al. [8] covered many measurement error models, emphasizing the bias-corrected techniques. Schennach [9] reviewed much significant progress made in developing estimation and inference methods in the presence of mismeasured data, especially describing approaches that rely on validation data techniques or auxiliary variables, e.g., repeated measurements, multiple indicators, measurement systems with factor model structure, instrumental variables, and panel data. Although the statistical analysis of measurement error data has a long history, surprisingly, with the advent of the big data era and the introduction of new experimental methods, this topic has remained quite active, experiencing a recent increase in attention and research activity. For example, Amorim et al. [10], Tao et al. [11], and Amorim et al. [12] addressed measurement error challenges in multi-phase studies, which include data collected from one or more rounds of validation processes.
Recently, many researchers in the field of causal inference have also focused significantly on the measurement error issue. Boatman et al. [13] examined the estimation of causal effects through a weighted strategy in randomized clinical trials, addressing noncompliance measured with error. Gravel and Platt [14] proposed a method for estimating the marginal causal odds ratio when outcomes are misclassified, utilizing internal validation information. Yanagi [15] provided identification and estimation results for local average treatment effects with misclassified treatment by leveraging an exogenous variable. Shu and Yi [16,17,18] investigated the use of inverse probability weighting estimation for causal effects, integrating validation data sets when the data collected were prone to errors. Edwards et al. [19] introduced a reparameterized imputation method for addressing measurement error, applicable for estimating counterfactual risk functions or hazard ratios with internal or external validation data. Richardson et al. [20] employed a reference population to achieve identifiability for causal effects in the presence of measurement error in continuous treatments. Several other studies have addressed measurement error issues in the context of instrumental variable methods or mediation analysis. Notable examples include VanderWeele et al. [21], Jiang and Ding [22], and Cheng et al. [23], among others. Currently, the literature in this field primarily focuses on the misclassification of causal variables, such as treatment variables or mediators, with relatively less attention given to the misclassification of outcome variables. Moreover, for identifiability, while many existing studies address the identifiability of various causal measures, they often do not directly consider the identifiability of the means of the potential outcome variables. Identifying these means enables the identification of a range of causal measures, such as the risk difference, the risk ratio, and the odds ratio. For estimation, current approaches to similar problems frequently depend on specific parametric models, with the accuracy of the resulting estimators being contingent on the correct specification of corresponding models, and the resulting estimators are not necessarily effective. The advancement of the semiparametric efficiency theory and the creation of multiply-robust estimators in the context of causal inference with measurement error are still incomplete.
Our article contributes to the prior literature as follows. First, we directly provide identification results for the potential-outcome means using a special combined sample, composed of a primary sample and a validation sample, where no individual has complete data. By identifying these means, we provide the identification results for various causal measures. Second, based on the semiparametric theory framework [24], we derive the efficient influence functions for different potential-outcome means under the observed data law. To our knowledge, no similar results have been reported in the context of causal inference with misclassification. Third, building on the efficient influence functions, we develop efficient and multiply-robust estimators for the means. Beyond the target estimands and multiply-robust estimation methods used, our setup and assumptions differ from those in some existing literature on causal inference involving misclassified outcome variables. We consider a more relaxed assumption about the misclassification mechanism compared to [16,18]: in addition to the true outcome variable, we allow the misclassification probability to depend on the covariates. Under our setup, the primary sample contains the collected covariates, the treatment variable of interest, and the misclassified outcome variable but lacks information on the true outcome variable. The validation sample, on the other hand, includes the covariates, the misclassified outcome variable, and the information needed to determine the true outcome variable values. This setup aligns well with practical applications and is different from previous work [14,16,18], as the validation sample in this work may come from a validation study where the target treatment variable is different from that of the primary study, thus lacking information on the target treatment variable in the primary study. This scenario is typical in the data fusion literature, where the variables collected in each study may differ [25]. Similarly, there are no complete data for any subject in our setting.
The structure of the paper is as follows. In Section 2, we outline the formal setup, the assumptions necessary for our error-prone outcome context, and the nonparametric identification results for the potential-outcome means. Section 3 discusses the efficient influence functions for the means and proposes the multiply-robust estimation approach. The asymptotic properties of the proposed method are also provided. Section 4 and Section 5 demonstrate the finite-sample performance of the proposed method through simulation studies and real data analysis, respectively. The discussion is presented in Section 6. Technical proofs are provided in Appendix A.

2. Setup, Assumptions, and Nonparametric Identification

Assume we have obtained two samples from two studies: one of main interest called the primary study, and another called the validation study. In the primary study, the outcome variable cannot be measured accurately, and the collected data are denoted as { ( X i , T i , W i ) , i = 1 , , n 1 } , where X represents the covariates, T is the binary treatment of interest, and W is the misclassified version of the true binary outcome variable Y. When T i = 1 , it indicates that the individual i is assigned to the treatment group, otherwise, they are assigned to the control group. In the primary sample, we assume no information can be used to determine the true value of Y i . In the validation study, we have information that helps ascertain the true outcome values, e.g., carbon monoxide levels can help determine smoking cessation. That is, we actually obtain the data for both Y and W in the validation study. However, the study may not include the treatment T, because this study may be designed to evaluate another treatment T * influencing the same outcome Y as the primary study. The collected validation data are denoted as X i , W i , Y i , i = n 1 + 1 , , n 1 + n 2 . Using either sample alone makes it difficult to identify the potential-outcome means, but combining the two samples makes the identifiability and estimation of the means and a series of causal measures possible. The problem considered in our data analysis aligns with the setup. One data set includes the treatment of interest but only contains the misclassified outcome variables, while the other validation data set provides both the accurate and misclassified outcomes but does not collect the treatment of interest. Let R i be an indicator, where R i = 1 indicates that the individual i belongs to the primary sample, and R i = 0 indicates that the individual i belongs to the validation sample. Then, the combined sample can be expressed as { O i = ( R i , X i , W i , R i T i , ( 1 R i ) Y i ) , i = 1 , , n } , where n = n 1 + n 2 . In the following, we omit the subscript i wherever it does not confuse.
Under the potential-outcome framework [26], let Y 0 denote the potential outcome if an individual were assigned to T = 0 and Y 1 the potential outcome if the individual were assigned to T = 1 . Only one of Y 1 and Y 0 can be observed for an individual, and we assume Y = T Y 1 + ( 1 T ) Y 0 (called the consistency assumption) throughout the paper. In this paper, we aim to identify and estimate the potential-outcome means a 1 = E ( Y 1 ) = P ( Y 1 = 1 ) and a 0 = E ( Y 0 ) = P ( Y 0 = 1 ) . Many causal measures are defined using the potential-outcome means, such as the risk difference (RD)
E ( Y 1 Y 0 ) ,
the risk ratio (RR)
E ( Y 1 ) / E ( Y 0 ) ,
or the odds ratio
P Y 1 = 1 P Y 1 = 0 / P Y 0 = 1 P Y 0 = 0 = E Y 1 1 E Y 1 / E Y 0 1 E Y 0 .
Some existing studies have explained and compared the definitions and applicability of these causal measures [2,3,27,28,29]. In addition, when treatment does not cause harm, that is, Y 1 Y 0 , referred to as monotonicity assumption [30], the joint distribution of the potential outcomes can be written as
P ( Y 1 = j , Y 0 = k ) = P ( Y 1 = 0 ) = 1 E ( Y 1 ) , if j = 0 , k = 0 , P ( Y 0 = 1 ) = E ( Y 0 ) , if j = 1 , k = 1 , 1 P ( Y 1 = 0 ) P ( Y 0 = 1 ) = E ( Y 1 ) E ( Y 0 ) , if j = 1 , k = 0 .
Assume that Y = 1 indicates a positive or beneficial result, then, the first joint distribution represents the rate of “never benefit” (NBR), the second represents the rate of “always benefit” (ABR), and the third represents the treatment benefit rate (TBR) [31]. On the other hand, the joint distribution of the potential outcomes plays an important role in causal attribution [32,33]. It is evident from the definitions that if the potential-outcome means are identifiable, all of the above quantities are identifiable. To proceed, we list the following assumptions:
Assumption 1.
T Y 1 , Y 0 X , that is, the potential outcomes are independent of the treatment conditional on the observed covariates X .
Assumption 2.
P W = 1 Y = 1 , X P W = 1 Y = 0 , X 0 .
Assumption 3.
W T ( X , Y ) , that is, the misclassified outcome W is independent of the treatment T conditional on the true outcome Y and the observed covariates X .
Assumption 4.
R ( X , T , W , Y ) .
Assumption 5.
0 < π x = P T = 1 | X = x < 1 for any x Q , where Q is the support set of X .
Assumption 1 is referred to as the unconfoundedness assumption [34], the assumption is considered feasible in situations where the covariates X are rich enough to include all common causes of both the treatment and the outcome variables. Assumption 2 implies that the misclassified outcome W correlates with the true outcome Y conditional on X . Assumption 3 implies that the misclassification probability is allowed to depend on the covariates X . This assumption is weaker than the traditional non-differential misclassification assumption used in Shu and Yi [16]. Assumptions 2 and 3 are feasible when the misclassification error is directly influenced by Y and X but is independent of the treatment assignment. Misclassification errors due to self-reporting generally align with Assumptions 2 and 3. Assumption 4 is naturally satisfied in situations where the validation sample can be viewed as a random subsample drawn from the population of the primary study. Assumption 5 is called the overlap assumption in the causal inference literature, it is feasible when each individual has a non-zero probability of being assigned to each treatment level.
Under Assumptions 1 and 5, when the precise data are available, the potential-outcome means can be identified as
E ( Y t ) = E { E ( Y 1 X ) } = E { E ( Y X , T = 1 ) } = E T π ( X ) Y , if t = 1 , E { E ( Y 0 X ) } = E { E ( Y X , T = 0 ) } = E 1 T 1 π ( X ) Y , if t = 0 ,
and then the regression-based and the weighting-based estimators can be established by using the plug-in method based on (1). The consistency of these methods potentially requires the collection of precise outcome variable data, and the naive estimators that directly use the misclassified outcome W may lead to a seriously biased estimation of the potential-outcome means.
Let f ( · ) denote the probability density or mass function of a random variable (vector). To tackle the case where the binary outcome Y is subject to misclassification, we proceed by defining some functions for simplicity of exposition:
B W 0 = E ( W Y = 0 , X ) ,
and
B W ( X ) = E ( W Y = 1 , X ) E ( W Y = 0 , X ) .
The following theorem provides the nonparametric identification results for the potential-outcome means under our setup.
Theorem 1.
Suppose Assumptions 1–4 hold. Then, a 1 can be identified as
a 1 = E E ( W T = 1 , X ) E ( W Y = 0 , X ) B W ( X ) E a 1 ( X ) ,
and a 0 can be identified as
a 0 = E E ( W T = 0 , X ) E ( W Y = 0 , X ) B W ( X ) E a 0 ( X ) .
Note that under Assupmtion 4, we have
E ( W T = 1 , X ) = E ( W T = 1 , X , R = 1 ) ,
E ( W Y = 1 , X ) = E ( W Y = 1 , X , R = 0 ) ,
and the identification results (2) and (3) leverage information from both the primary sample and the validation sample. The identification results of a series of causal measures can be derived following the identification results of the potential-outcome means. For example, the risk difference can be identified as
E E ( W T = 1 , X ) E ( W T = 0 , X ) B W ( X ) ,
and the risk ratio can be identified as
E E ( W T = 1 , X ) E ( W Y = 0 , X ) B W ( X ) / E E ( W T = 0 , X ) E ( W Y = 0 , X ) B W ( X ) .
Similar results can also be obtained for the causal odds ratio, and the joint distribution of the potential outcomes under the monotonicity assumption.
Based on the identification results of Theorem 1, we can firstly estimate the four functions E ( W T = 1 , X ) , E ( W T = 0 , X ) , E ( W Y = 1 , X ) , and E ( W Y = 0 , X ) by assuming parametric models, giving their estimates through common parametric regression methods. Subsequently, we can provide the estimates for a 1 and a 0 using the plug-in method. However, the estimators obtained through this strategy may not be efficient and are overly reliant on the model specifications.

3. Efficient and Multiply-Robust Estimation Based on the Semiparametric Theory

In practice, the specification of parametric models may not always be accurate. Multiply-robust estimators, meaning that the estimator retains consistency when one, but not necessarily all, of several model assumptions is correctly specified, are in general a more preferable choice. In the causal inference literature with accurate data, a common strategy for constructing multiply-robust estimators is to study the efficient influence function of the parameter to be estimated [35,36], requiring advanced semiparametric theory. From the efficient influence function, efficient estimators can then be constructed, and the multiply-robust properties can be investigated. In this section, we aim to explore the efficient influence functions for a 1 and a 0 within the framework of semiparametric theory in the presence of a misclassified outcome variable. Furthermore, we propose the efficient estimators of a 1 and a 0 and explore their multiply-robust properties.
To facilitate understanding, we give a short review of the concept of asymptotic linearity and influence function in the semiparametric theory framework [24]. An estimator ρ ^ of a p-dimensional parameter ρ is referred as asymptotically linear, meaning that there exists a p-dimensional function ψ ( O ; ρ ) of the collected variable O such that E { ψ ( O ; ρ ) } = 0 , Var { ψ ( O ; ρ ) } < , and
n 1 / 2 ρ ^ ρ = n 1 / 2 i = 1 n ϕ ( O i ; ρ ) + o p ( 1 ) ,
where ψ ( O ; ρ ) is called the influence function of ρ . The influence function ψ ( O ; ρ ) with the lowest variance is referred to as the efficient influence function, and the estimator with the efficient influence function is semiparametric efficient.
We proceed by deriving the efficient influence function for a t , t { 0 , 1 } , in the next theorem, within our setting. The detailed proof of the theorem can be found in Appendix A. Define
z ( t ) = t T + ( 1 t ) ( 1 T ) ,
ϕ 1 t = R q z ( t ) f ( T X ) B W ( X ) W B W ( X ) a t ( X ) B W 0 ( X ) ,
ϕ 2 = 1 R 1 q 1 Y f ( Y X ) B W ( X ) W B W 0 ( X ) ,
and
ϕ 3 t = 1 R 1 q a t ( X ) B W ( X ) 2 Y 1 f ( Y X ) W Y B W ( X ) B W 0 ( X ) ,
where t { 0 , 1 } and q = P ( R = 1 ) .
Theorem 2.
Under Assumptions 1–5, the efficient influence function for a 1 can be expressed as
EIF a 1 = ϕ 11 ϕ 2 ϕ 31 + a 1 ( X ) a 1 ,
and the efficient influence function for a 0 can be expressed as
EIF a 0 = ϕ 10 ϕ 2 ϕ 30 + a 0 ( X ) a 0 .
Theorem 2 provides the efficient influence functions (4) and (5) for the two potential-outcome means in the presence of a misclassified outcome variable. According to the semiparametric theory framework, the semiparametric efficiency bound for a 1 is Var ( EIF a 1 ) , and that for a 0 is Var ( EIF a 0 ) . It can be seen that the efficient influence function of the mean a t is composed of five distinct components of the observed data law, which are f ( Y X ) , f ( T X ) , B W 0 ( X ) , B W ( X ) , and a t ( X ) . We assume their corresponding parametric models as f ( T X ; α ) , f ( Y X ; β ) , B W 0 ( X ; γ ) , B W ( X ; η ) , and a t ( X ; θ t ) . Assume the estimators obtained for the nuisance parameters are denoted as α ^ , β ^ , γ ^ , η ^ , and θ ^ t , which are consistent when their corresponding models are correct; then, we can derive the estimates for ϕ 1 t , ϕ 2 , and ϕ 3 t , denoted as ϕ ^ 1 t , ϕ ^ 2 , and ϕ ^ 3 t .
Constructing estimators from the efficient influence functions is a common method when studying semiparametric efficient estimation. Let E n ( · ) denote the empirical mean operator with the sample size n, which means E n { f ( O ) } = 1 n i = 1 n f ( O i ) for any function f ( · ) of the observed data O. Theorem 2 and the definition of the influence function imply that we can derive estimators for a 1 and a 0 by solving
E n { ϕ ^ 11 ϕ ^ 2 ϕ ^ 31 + a ^ 1 ( X ) a 1 } = 0
and
E n { ϕ ^ 10 ϕ ^ 2 ϕ ^ 30 + a ^ 0 ( X ) a 0 } = 0 ,
and the resulting estimators can be written as
a ^ 1 m r = E n { ϕ ^ 11 ϕ ^ 2 ϕ ^ 31 + a ^ 1 ( X ) }
and
a ^ 0 m r = E n { ϕ ^ 10 ϕ ^ 2 ϕ ^ 30 + a ^ 0 ( X ) } .
Additionally, because our method is directly proposed for the potential-outcome means corresponding to different treatment levels, we can easily derive estimators for a series of causal measures listed in Section 2. The estimators we propose are derived from the estimating equations constructed using the efficient influence function. When all the involved models are correctly specified, it is straightforward to prove that the proposed estimators are efficient. Furthermore, we demonstrate that the proposed estimators exhibit multiply robustness when one of the below assumptions is correctly specified. We first list the three model assumptions:
  • M 1 t : The models a t ( X ; θ t ) , B W 0 ( X ; γ ) , and B W ( X ; η ) are correctly specified, meaning that there exist θ t 0 , γ 0 , and η 0 such that a t ( X ; θ t 0 ) , B W 0 ( X ; γ 0 ) , and B W ( X ; η 0 ) equal to their corresponding true models, respectively.
  • M 2 : The models B W ( X ; η ) , f ( T X ; α ) , and f ( Y X ; β ) are correctly specified, meaning that there exist η 0 , α 0 , and θ 0 such that B W ( X ; η 0 ) , f ( T X ; α 0 ) , and f ( Y X ; β 0 ) equal to their corresponding true models, respectively.
  • M 3 t : The models a t ( X ; θ t ) , f ( T X ; α ) , and f ( Y X ; β ) are correctly specified, meaning that there exist θ t 0 , α 0 , and β 0 such that a t ( X ; θ t 0 ) , f ( T X ; α 0 ) , and f ( Y X ; β 0 ) equal to their corresponding true models, respectively.
The following theorem formally demonstrates the properties of the proposed estimators.
Theorem 3.
Suppose the standard regularity conditions [37] (pp. 2121–2123) and Assumptions 1–5 hold. Then, the proposed estimator a ^ t m r is consistent and asymptotic normal for a t under the union set M u n i o n t = M 1 t M 2 M 3 t . Moreover, a ^ t m r attains the semiparametric efficiency bound when all the working models in M u n i o n t are correctly specified.
Theorem 3 ensures the multiply robustness of our proposed estimation method, that is to say, our method guarantees the consistency of the resulting estimators as long as one of the listed assumptions is correctly specified. The model sets under different assumptions reflect various combinations of the components of the data law, and when all the involved components are correct, our proposed estimators for a 1 and a 0 attain their corresponding semiparametric efficiency bounds, which represent the minimum possible asymptotic variances among all regular semiparametric estimators.
Next, we discuss in detail the estimation strategies for the nuisance parameters. Of note, the models in the three assumptions contain elements with overlap. The model sets in M 1 t and M 2 both contain B W ( X ; η ) , and those in M 1 t and M 3 t both contain a t ( X ; θ t ) . The fact implies the multiply-robust estimator a ^ t m r requires constructing a consistent estimator of η under M 1 t M 2 , and a consistent estimator of θ t under M 1 t M 3 t . To achieve this, we consider extending the doubly robust g-estimation [38] to our measurement error setting with the combined sample. Firstly, according to the form of adopted parametric models f ( T X ; α ) , f ( Y X ; β ) , and B W 0 ( X ; γ ) , we can apply common parametric regression methods to obtain estimators α ^ , β ^ , and γ ^ for α , β , and γ , respectively. Secondly, we propose estimators η ^ and θ ^ t for η and θ t by solving
0 = E n 1 R 1 q k 0 ( X ) W Y B W ( X ; η ) B W 0 ( X ; γ ^ ) 2 Y 1 f ( Y X ; β ^ ) ,
and
0 = E n R q k 1 ( X ) Z ( t ) f ( T X ; α ^ ) W B W ( X , η ^ ) a t ( X , θ t ) B W 0 ( X , γ ^ ) E n 1 R 1 q k 2 ( X ) 1 Y f ( Y X ; β ^ ) W B W 0 ( X ; γ ^ ) + ( 2 Y 1 ) f ( Y X ; β ^ ) W Y B W ( X ; η ^ ) B W 0 ( X ; γ ^ ) a t ( X ; θ t ) ,
respectively, where k 0 ( X ) is the user-specified index function of the same dimension as η , and k 1 ( X ) and k 2 ( X ) are of the same dimension as θ t . We have the following theorem for η ^ and θ ^ t .
Theorem 4.
Suppose the standard regularity conditions [37] (pp. 2121–2123) and Assumptions 1–5 hold. When estimating a t , the proposed estimator η ^ is consistent and asymptotically normal under M 1 t M 2 , and θ ^ t is consistent and asymptotically normal under M 1 t M 3 t .
Finally, by using the plug-in approach, we utilize the estimates of the nuisance parameters to give the multiply-robust estimator a ^ t m r . From the estimation process, it can be seen that our entire estimation procedure can be viewed as solving a complex system of estimating equations, which consists of estimating equations of the parameters of interest and the nuisance parameters. Thus, the asymptotic variances of the proposed estimators can be derived using standard M-estimation theory [37,39]. To facilitate computation, the bootstrap method is commonly employed in practice for variance estimation and the construction of confidence intervals.

4. Simulation Studies

In this section, we conducted some simulation studies to (1) verify the multiply robustness of our proposed method and (2) evaluate the resulting estimators’ finite sample performance in the measurement error setting. Our simulation comprised two examples. In Example 1, we considered the two potential-outcome means, risk difference, and risk ratio as the quantities of interest to be estimated. In Example 2, our data-generating mechanism ensured the validity of the monotonicity assumption, and we considered the joint probabilities of the potential outcomes as the estimands of interest. In both examples, bias, root-mean-square error (RMSE), and standard error (SE) were used as the evaluation criteria, and all results were based on 1000 repeated simulation experiments with sample size n = 2000 and n = 5000 under four cases:
(1)
All the models are correctly specified;
(2)
Only the assumption M 1 t holds;
(3)
Only the assumption M 2 holds;
(4)
Only the assumption M 3 t holds.
In Example 1, similar to Wang and Tchetgen Tchetgen [35], the baseline covariates X included an intercept and a variable X 1 uniformly distributed on the interval ( 1 , 0.5 ) ( 0.5 , 1 ) , and we considered the following data-generating mechanism:
P ( T = 1 X ) = expit α 0 T X ,
P ( Y = 1 T , X ) = expit β 0 T ( X , T ) T ,
B W 0 ( X ) = expit γ 0 T X ,
B W ( X ) = tanh η 0 T X ,
where α 0 = ( 0.5 , 0.5 ) T , β 0 = ( 0.3 , 0.3 , 1 ) T , γ 0 = ( 0.5 , 1 ) T , η 0 = ( 0 , 0.5 ) T , and expit ( · ) = exp ( · ) / { 1 + exp ( · ) } . Under the mechanism, we simulated the treatment T, the true outcome Y, and the misclassified outcome W by noting
P ( W = 1 Y , X ) = B W 0 ( X ) + Y B W ( X ) .
The generated data represented realizations of ( X , T , Y , W ) . We then generated the sample indicator R Bernoulli ( 0.5 ) . A primary sample was created with R = 1 , recording only the realizations of ( X , T , W ) , and a validation sample was generated with R = 0 , recording only the realizations of ( X , W , Y ) . The two samples were combined to form the simulated data set. Note that with the above data generating process, the true value of a 1 and a 0 were around 0.7832 and 0.5732 , respectively, with the use of the Monte Carlo approach. To implement the proposed method, we chose the identity functions of covariates as the user-specified index functions. We could easily derive the true parametric models for the involved five functions in constructing the multiply-robust estimator for a t . For estimating the parameters in models f ( T X ; α ) , f ( Y X ; β ) , and B W 0 ( X ; γ ) , we used the maximum likelihood method. For models a t ( X ; θ t ) and B W ( X ; η ) , parameter estimation was conducted using (6) and (7) by the method of estimating equations, implemented through the optim function in R, with the quasi-Newton method specified. Finally, we obtained the resulting estimators using the plug-in method. When considering model misspecification, we used X 1 * = exp ( X 1 ) , instead of X 1 , to fit the involved models.
Table 1 reports the simulation results for all four cases in Example 1. The multiply-robust method for the four estimands exhibited small bias, RMSE, and SE across all cases. The SE results were close to the RMSE results, and they both decreased as the sample size increased. Compared to the other three estimands, the estimated risk ratio tended to have a larger RMSE and SE. This is because the estimated risk ratio involved the ratio of two estimates, where a small denominator could lead to a significant increase in variance. In addition, it is worth noting that with a sample size of 5000 and all models correctly specified, the estimators exhibited the smallest SE. These simulation results support the previous theoretical results. For clarity, the results of naive estimates that ignored measurement errors are not included in the table. Naive estimates were severely biased, for example, with a sample size of 5000 and correctly specified propensity scores, the bias of the naive estimate of a 1 using the inverse probability weighting method based on (1) was close to 0.4 . Moreover, the bias did not significantly decrease even with a larger sample size.
In Example 2, let X = ( 1 , X 1 ) , where X 1 comes from the standard normal distribution. Let I ( · ) denote the indicator function that takes one when the input is greater than zero, and zero otherwise. The treatment T, the true outcome Y, and the misclassified outcome W were generated from the following models:
P ( T = 1 X ) = expit α 1 T X ,
P ( Y = 1 T , X ) = I β 1 T ( 1 , X , T ) T + ϵ > 0 ,
B W 0 ( X ) = expit γ 1 T X ,
B W ( X ) = tanh η 1 T X ,
where α 1 = ( 0.2 , 0.4 ) T , β 1 = ( 0.5 , 0.5 , 1 ) T , γ 1 = ( 0.5 , 1 ) T , η 1 = ( 0 , 0.5 ) T , and ϵ N ( 0 , 1 ) . Under this data generation mechanism, the potential outcome under the treatment level T = 1 was no less than the potential outcome under the treatment level T = 0 . On this condition, the estimation for the joint probabilities of potential outcomes could be achieved through the estimation of the potential-outcome means. We divided the generated data into two parts and combined them into a sample using the same method as in Example 1. With the use of the Monte Carlo approach, the true values of a 1 and a 0 in Example 2 were found to be around 0.6732 and 0.3272 , respectively. The choices of the index functions and the parameter estimation in the parametric models were similar to those in Example 1. We calculated the three different joint probabilities, NBR, ABR, and TBR, under the four different cases. When considering the manner of model misspecification, we used the nonlinear transformation X 1 2 + 3 X 1 of the original variable X 1 for fitting.
Table 2 reports the simulation results for Example 2. From the results, it can be observed that for the three probabilities, our proposed multiply-robust method exhibited relatively small bias, RMSE, and SE. When the sample size increased from 2000 to 5000, the RMSE and SE of the three probabilities decreased. Additionally, although the proposed method exhibited the smallest SE when all models were correctly specified, in the other three cases where two models of the involved models were misspecified, the SE of the estimators did not significantly increase. This was similar to the simulation results in Example 1.

5. Data Analysis

In this section, we aim to analyze a publicly available data set from the Behavioral Risk Factor Surveillance System (BRFSS). The BRFSS is a survey carried out across all 50 states in the United States by the Centers for Disease Control and Prevention (CDC), and it gathers data on health-related behaviors (such as smoking habits), healthcare access, and chronic conditions. Detailed information on the survey design, sampling methods, data collection, and statistical weighting is available on the CDC website (http://www.cdc.gov/brfss, accessed on 2 September 2024). The data were also previously analyzed by many other researchers [40,41,42].
The main purpose of our analysis was to explore the impact of smoking on obesity (defined as having a body mass index (BMI) of at least 30 kg/m2). We aimed to determine whether smoking had a significant effect on obesity rates and if so, whether this effect was positive or negative. This relationship can provide valuable insights into the potential health implications of smoking and inform public health strategies [43,44,45]. The current smokers were determined using the SMOKER2 (Computed Smoking Status) variable. In our study, individuals who responded as “Current Smoker—Now Smokes Every Day” and “Current Smoker—Now Smokes Some Days” were classified as smokers (SMOKER2 = 1 or 2), as done in the work by Sharbaugh [46]. The weight and height data contained in the BRFSS data set are self-reported, and prone to measurement errors, making the obesity data derived from the BMI data susceptible to inaccuracies. In other words, besides the treatment variable indicating whether an individual is a smoker, the BRFSS data set includes information on a misclassified version of the outcome variable, but not the true outcome variable. The National Health and Nutrition Examination Survey (NHANES) is another survey designed to assess the health and nutritional status of adults and children in the United States. The data set from NHANES contains not only self-reported weights and heights but also measured weights and heights; however, the data do not include data on smoking habits as the treatment variable of interest, and we used the NHANES data as the validation data set. The obesity rate calculated from the BMI information in this validation data set is considered to be accurate. The NHANES data set can be accessed at https://www.cdc.gov/nchs/nhanes, (accessed on 2 September 2024). According to the data set, approximately ten percent of the data are reported incorrectly.
Alongside the binary treatment variable for smoking and the binary outcome variable for obesity, we incorporated age, gender, race, and education as the control variables, as these factors can influence both the treatment and the outcome. For both data sets, our analysis sample was limited to the 2018 survey year. We preprocessed the data using some standard procedures, such as removing cases with missing information, converting the calculated BMI data into a binary format indicating obesity, and encoding nominal variables.
For the purpose of model fitting, logistic regression models were employed for f ( T X ) , f ( Y X ) , B W 0 ( X ) , and a t ( X ) , and a hyperbolic tangent function was utilized for modeling B W ( X ) , where X represents the control covariates previously discussed. We present the analysis results in Table 3, utilizing point estimates, standard errors, and the 95 % confidence intervals for method evaluation. The standard errors and confidence intervals were derived using the bootstrap method with 500 bootstrap samples.
Table 3 presents the analysis results for the same four estimands as those in Example 1 of the simulation studies. The point estimates of the potential-outcome means corresponding to different treatment levels both fell within the interval [ 0 , 1 ] . The point estimates of the risk difference and the risk ratio indicated that smoking reduces obesity rates, which is consistent with some previous studies [43,47]. This is mainly because nicotine is thought to suppress appetite and increase metabolism, potentially leading to lower body weight among smokers. Therefore, smokers have a lower obesity rate compared to non-smokers. However, as noted by Munafò [48], although smoking may be associated with a lower body weight, it poses significant health risks, including heart disease, lung disease, and cancer. From a long-term health perspective, smoking is not a healthy way to control weight and can lead to more severe health issues. In addition, for the four estimands, the standard errors calculated by the bootstrap method were small, and the estimated 95 % confidence intervals did not include zero. The results suggest that the proposed estimation method may perform satisfactorily in real-world scenarios. Additionally, when the monotonicity assumption held (in our example, this meant that smoking did not cause weight gain for any individual), we could also calculate the NBR, ABR, and TBR.

6. Discussion

In fields like medicine and economics, measurement errors are common. This paper highlighted the importance of potential-outcome means in causal inference and addressed the challenge of identification and estimation for the means with misclassified outcome data. By leveraging both the primary and the validation samples, we ensured the identifiability of potential-outcome means under a plausible assumption that the misclassification probability could depend on both the true outcomes and the baseline covariates. Furthermore, we proposed the multiply-robust and efficient estimators based on the semiparametric theory, which remained consistent even with partial misspecification of the observed data law. Extensive simulation experiments and real data analysis underscored the robustness and effectiveness of our method. Of note, the causal measures discussed in this paper can be defined using the potential-outcome means. However, the quantile treatment effects, as another common measure of causal effect, cannot be directly defined using the potential-outcome means. Therefore, the methods proposed in this paper cannot be directly applied to quantile treatment effects. In future work, we will consider how to address the identification and estimation of quantile treatment effects under measurement error data within the semiparametric theoretical framework.

Author Contributions

Conceptualization, S.W.; methodology, S.W.; software, C.Z.; validation, Z.G.; formal analysis, S.L.; writing—original draft preparation, S.W.; writing—review and editing, Z.G. and S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (grant numbers: 12071015, 12401378, 12301370).

Data Availability Statement

The data used in this study are publicly available.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Proof of Theorem 1.
Note that
E ( W T = 1 , X ) = E ( W Y = 1 , T = 1 , X ) P ( Y = 1 T = 1 , X ) + E ( W Y = 0 , T = 1 , X ) P ( Y = 0 T = 1 , X ) = E ( W Y = 1 , X ) P ( Y = 1 T = 1 , X ) + E ( W Y = 0 , X ) P ( Y = 0 T = 1 , X ) = E ( W Y = 1 , X ) E ( Y 1 X ) + E ( W Y = 0 , X ) { 1 E ( Y 1 X ) } = { E ( W Y = 1 , X ) E ( W Y = 0 , X ) } E Y 1 X + E ( W Y = 0 , X ) ,
where the second equality and the third equality are due to Assumption 3 and Assumption 1, respectively. Under Assumption 2, we have
a 1 = E E ( W T = 1 , X ) E ( W Y = 0 , X ) E ( W Y = 1 , X ) E ( W Y = 0 , X ) ,
the components in (A1) can all be identified using the combined data under Assumption 4. The identification result for a 0 can be proven in a similar manner. □
Proof of Theorem 2.
We base our proof on the semiparametric efficiency theory in Tsiatis [24] and Chen [49]. Let O = ( R , X , W , R T , ( 1 R ) Y ) T be the vector of all observed variables. The density function f ρ ( O ) of the observed variables O with a parametric path ρ can be factorized as
f ρ ( O ) = q ρ R ( 1 q ρ ) 1 R f ρ ( Y , W , X ) 1 R f ρ ( T , W , X ) R = q ρ R ( 1 q ρ ) 1 R f ρ ( X ) 1 R f ρ ( Y X ) 1 R f ρ ( W Y , X ) 1 R f ρ ( X ) R f ρ ( T X ) R f ρ ( W T , X ) R = q ρ R ( 1 q ρ ) 1 R f ρ ( X ) f ρ ( Y X ) 1 R f ρ ( W Y , X ) 1 R f ρ ( T X ) R f ρ ( W T , X ) R ,
where q ρ denotes the probability of R being equal to one. The corresponding score function can be written as
S ρ ( O ) = log f ρ ( O ) ρ = c ( R q ρ ) + S ρ ( X ) + ( 1 R ) S ρ ( Y X ) + ( 1 R ) S ρ ( W Y , X ) + R S ρ ( T X ) + R S ρ ( W T , X ) ,
where c is a constant. Our subsequent proof takes a 1 as the example; the results for a 0 can be proven in a similar manner. According to the semiparametric theory, the efficient influence function ψ e i f ( O ) we seek for a 1 needs to satisfy E ρ { ψ e i f ( O ) } = 0 and
a ˙ 1 = E ρ { a 1 ( X ) } / ρ = E ρ { ψ e i f ( O ) S ρ ( O ) } ,
where the dot is used to denote the partial derivative with respect to ρ . In the following, for the sake of clarity, we omit the subscript ρ of q, f ( · ) , E ( · ) , and S ( · ) . Under the given assumptions, by a simple calculation, the partial derivative yields
a ˙ 1 = E E ( W T = 1 , X ) E ( W Y = 0 , X ) E ( W Y = 1 , X ) E ( W Y = 0 , X ) S ( X ) + E E ( W T = 1 , X ) E W Y = 0 , X E ( W Y = 1 , X ) E ( W Y = 0 , X ) = E E ( W T = 1 , X ) E ( W Y = 0 , X ) E ( W Y = 1 , X ) E ( W Y = 0 , X ) S ( X ) + E E ˙ ( W T = 1 , X ) B W ( X ) E E ˙ ( W Y = 0 , X ) B W ( X ) E a 1 ( X ) B ˙ W ( X ) B W ( X ) L 1 + L 2 L 3 L 4 .
Note that
E ˙ ( W T = 1 , X ) = E { W S ( W T = 1 , X ) T = 1 , X } = E [ { W E ( W T = 1 , X ) } S ( W T = 1 , X ) T = 1 , X , R = 1 ] = E R T { W E ( W T = 1 , X ) } q f ( T = 1 X ) S ( W T = 1 , X ) X = E R T { W E ( W T = 1 , X ) } q f ( T = 1 X ) R S ( W T , X ) X = E R T { W E ( W T = 1 , X ) } q f ( T = 1 X ) { S ( O ) c ( R q ) S ( X ) S ( T X ) } X N 1 N 2 N 3 N 4 ,
where
N 2 = E R T { W E ( W T = 1 , X ) } q f ( T = 1 X ) { c ( 1 q ) } X = E { W E ( W T = 1 , X ) } c ( 1 q ) R = 1 , T = 1 , X = 0 , N 3 = E [ { W E ( W T = 1 , X ) } S ( X ) R = 1 , T = 1 , X ] = 0 , N 4 = E [ { W E ( W T = 1 , X ) } S ( T = 1 X ) R = 1 , T = 1 , X ] = S ( T = 1 X ) E [ { W E ( W T = 1 , X ) } R = 1 , T = 1 , X ] = 0 ,
and
E ˙ ( W Y = 0 , X ) = E { W S ( W Y = 0 , X ) Y = 0 , X } = E [ { W E ( W Y = 0 , X ) } S ( W Y = 0 , X ) Y = 0 , X , R = 0 ] = E ( 1 R ) ( 1 Y ) { W E ( W Y = 0 , X ) } ( 1 q ) f ( Y = 0 X ) S ( W Y , X ) X = E ( 1 R ) ( 1 Y ) { W E ( W Y = 0 , X ) } ( 1 q ) f ( Y = 0 X ) { S ( O ) c ( R q ) S ( X ) S ( Y X ) } X M 1 M 2 M 3 M 4 ,
where
M 2 = E [ { W E ( W Y = 0 , X ) } c ( 0 q ) X , Y = 0 , R = 0 ] = 0 , M 3 = S ( X ) E [ { W E ( W Y = 0 , X ) } X , Y = 0 , R = 0 ] = 0 , M 4 = S ( Y = 0 X ) E [ { W E ( W Y = 0 , X ) } X , Y = 0 , R = 0 ] = 0 .
Similarly,
E ˙ ( W Y = 1 , X ) = E ( 1 R ) Y { W E ( W Y = 1 , X ) } ( 1 q ) f ( Y = 1 X ) S ( O ) X
Then, we have
L 1 = E a 1 ( X ) a 1 S ( O ) , L 2 = E R T W E W T = 1 , X q f ( T X ) B W ( X ) S ( O ) ,
L 3 = E ( 1 R ) ( 1 Y ) W E W Y = 0 , X ( 1 q ) f ( Y X ) B W ( X ) S ( O ) , L 4 = E a 1 ( X ) B W ( X ) ( 1 R ) ( 2 Y 1 ) { W E ( W Y , X ) } ( 1 q ) f ( Y X ) S ( O ) .
Meanwhile, it is straight to see that
E ( W T = 1 , X ) = B W ( X ) a 1 ( X ) + B W 0 ( X )
and
E ( W Y , X ) = Y B W ( X ) + B W 0 ( X ) .
Combining the above results, we have
EIF a 1 = ϕ 11 ϕ 2 ϕ 31 + a 1 ( X ) a 1 .
A similar proof process can be employed to obtain the efficient influence function for a 0 . □
Proof of Theorem 3.
Taking a 1 as the example, we first assume that as the sample size tends to infinity, the estimators α ^ , β ^ , γ ^ , η ^ , and θ ^ 1 converge in probability to the constants α * , β * , γ * , η * , and θ 1 * . For clarity of representation, we write a 1 X ; θ 1 * = a 1 * X . Likewise for the other models. Under the given assumptions, the limit value a 1 * of a ^ 1 m r can be written as
a 1 * = E T f * ( T X ) B W * ( X ) W B W * ( X ) a 1 * ( X ) B W 0 * ( X ) E 1 Y f * ( Y X ) B W * ( X ) W B W 0 * ( X ) E a 1 * ( X ) B W * ( X ) 2 Y 1 f * ( Y X ) W Y B W * ( X ) B W 0 * ( X ) + E a 1 * ( X ) ,
and we need to prove that a 1 * equals to a under M u n i o n 1 .
Firstly, when M 11 holds, meaning that a 1 * ( X ) = a 1 ( X ) , B W 0 * ( X ) = B W 0 ( X ) , and B W * ( X ) = B W ( X ) , note that
E T f * ( T X ) B W ( X ) W B W ( X ) a 1 ( X ) B W 0 ( X ) X , T = T f * ( T X ) B W ( X ) E W B W ( X ) a 1 ( X ) B W 0 ( X ) X , T = T f * ( T X ) B W ( X ) { E W X , T = 1 B W ( X ) a 1 ( X ) B W 0 ( X ) } = 0 ,
Similarly,
E 1 Y f * ( Y X ) B W ( X ) W B W 0 ( X ) Y , X = 0 ,
and
E a 1 ( X ) B W ( X ) 2 Y 1 f * ( Y X ) W Y B W ( X ) B W 0 ( X ) Y , X = 0 .
Thus, by the law of iterated expectations, we have
a 1 * = E T f * ( T X ) B W ( X ) W B W ( X ) a 1 ( X ) B W 0 ( X ) E 1 Y f * ( Y X ) B W ( X ) W B W 0 ( X ) E a 1 ( X ) B W ( X ) 2 Y 1 f * ( Y X ) W Y B W ( X ) B W 0 ( X ) + E a 1 ( X ) = 0 + E a 1 ( X ) = a 1 .
Secondly, when M 2 holds, meaning that B W * ( X ) = B W ( X ) , f * ( T X ) = f ( T X ) , and f * ( Y X ) = f ( Y X ) , note that
E T f ( T X ) B W ( X ) W B W ( X ) a 1 * ( X ) B W 0 * ( X ) = E E ( W T = 1 , X ) B W ( X ) a 1 * ( X ) B W 0 * ( X ) B W ( X ) ,
E 1 Y f Y X ) B W ( X ) W B W 0 * ( X ) = E E ( W Y = 0 , X ) B W ( X ) B W 0 * ( X ) B W ( X ) ,
and
E 2 Y 1 B W ( X ) f ( Y X ) a 1 * ( X ) W Y B W ( X ) B W 0 * ( X ) = E 2 Y 1 B W ( X ) f ( Y X ) a 1 * ( X ) W Y B W ( X ) f ( Y X ) a 1 * ( X ) B W ( X ) = E a 1 * ( X ) B W ( X ) { E ( W Y = 1 , X ) E ( W Y = 0 , X ) } E a 1 * ( X ) = 0 .
Thus, we have
a 1 * = E T f ( T X ) B W ( X ) W B W ( X ) a 1 * ( X ) B W 0 * ( X ) E 1 Y f ( Y X ) B W ( X ) W B W 0 * ( X ) E a 1 * ( X ) B W ( X ) 2 Y 1 f ( Y X ) W Y B W ( X ) B W 0 * ( X ) + E a 1 * ( X ) = a 1 .
Thirdly, when M 31 holds, meaning that a 1 * ( X ) = a 1 ( X ) , f * ( T X ) = f ( T X ) , and f * ( Y X ) = f ( Y X ) , note that
E T f ( T X ) B W * ( X ) W B W * ( X ) a 1 ( X ) B W 0 * ( X ) = E T f ( T X ) B W * ( X ) W a 1 ( X ) B W 0 * ( X ) B W * ( X ) = E E ( W T = 1 , X ) B W * ( X ) a 1 ( X ) B W 0 * ( X ) B W * ( X ) ,
E 1 Y f Y X ) B W * ( X ) W B W 0 * ( X ) = E E ( W Y = 0 , X ) B W * ( X ) B W 0 * ( X ) B W * ( X ) ,
and
E 2 Y 1 B W * ( X ) f ( Y X ) a 1 ( X ) W Y B W * ( X ) B W 0 * ( X ) = E E ( W T = 1 , X ) E ( W Y = 0 , X ) B W * ( X ) E { a 1 ( X ) } ,
and then we have
a 1 * = E T f ( T X ) B W * ( X ) W B W * ( X ) a 1 ( X ) B W 0 * ( X ) E 1 Y f ( Y X ) B W * ( X ) W B W 0 * ( X ) E a 1 ( X ) B W * ( X ) 2 Y 1 f ( Y X ) W Y B W * ( X ) B W 0 * ( X ) + E a 1 ( X ) = a 1 .
The property of a 2 can be proven similarly, and we conclude the proof for the first claim according to the M-estimation theory [37]. The second claim of the semiparametric efficiency follows from Theorem 2 when all models are correct. □
Proof of Theorem 4.
To prove Theorem 4, under the given assumptions, we actually need to prove
0 = E W Y B W ( X ; η ) B W 0 * ( X ) 2 Y 1 f * ( Y X ) ,
under M 1 t M 2 , and
0 = E Z ( t ) f * ( T X ) W B W * ( X ) a t ( X , θ t ) B W 0 * ( X ) E 1 Y f * ( Y X ) W B W 0 * ( X ) + ( 2 Y 1 ) f * ( Y X ) W Y B W * ( X ) B W 0 * ( X ) a t ( X ; θ t )
under M 1 t M 3 t . The unbiasedness of these estimating equations can be obtained in a manner similar to the proof of Theorem 3, and we omit the specific details here. □

References

  1. Brumback, B.; Berg, A. On effect-measure modification: Relationships among changes in the relative risk, odds ratio, and risk difference. Stat. Med. 2008, 27, 3453–3465. [Google Scholar] [CrossRef] [PubMed]
  2. Ukoumunne, O.C.; Forbes, A.B.; Carlin, J.B.; Gulliford, M.C. Comparison of the risk difference, risk ratio and odds ratio scales for quantifying the unadjusted intervention effect in cluster randomized trials. Stat. Med. 2008, 27, 5143–5155. [Google Scholar] [CrossRef] [PubMed]
  3. Kim, H.Y. Statistical notes for clinical researchers: Risk difference, risk ratio, and odds ratio. Restor. Dent. Endod. 2017, 42, 72–76. [Google Scholar] [CrossRef] [PubMed]
  4. VanderWeele, T.J. Can sophisticated study designs with regression analyses of observational data provide causal inferences? JAMA Psychiatry 2021, 78, 244–246. [Google Scholar] [CrossRef]
  5. Abadie, A.; Spiess, J. Robust post-matching inference. J. Am. Stat. Assoc. 2022, 117, 983–995. [Google Scholar] [CrossRef]
  6. Ma, X.; Wang, J. Robust inference using inverse probability weighting. J. Am. Stat. Assoc. 2020, 115, 1851–1860. [Google Scholar] [CrossRef]
  7. VanderWeele, T.J.; Li, Y. Simple sensitivity analysis for differential measurement error. Am. J. Epidemiol. 2019, 188, 1823–1829. [Google Scholar] [CrossRef]
  8. Carroll, R.J.; Ruppert, D.; Stefanski, L.A.; Crainiceanu, C.M. Measurement Error in Nonlinear Models: A Modern Perspective; Chapman and Hall/CRC: Boca Raton, CA, USA, 2006. [Google Scholar]
  9. Schennach, S.M. Recent advances in the measurement error literature. Annu. Rev. Econ. 2016, 8, 341–377. [Google Scholar] [CrossRef]
  10. Amorim, G.; Tao, R.; Lotspeich, S.; Shaw, P.A.; Lumley, T.; Shepherd, B.E. Two-phase sampling designs for data validation in settings with covariate measurement error and continuous outcome. J. R. Stat. Soc. Ser. B 2021, 184, 1368–1389. [Google Scholar] [CrossRef]
  11. Tao, R.; Lotspeich, S.C.; Amorim, G.; Shaw, P.A.; Shepherd, B.E. Efficient semiparametric inference for two-phase studies with outcome and covariate measurement errors. Stat. Med. 2021, 40, 725–738. [Google Scholar] [CrossRef]
  12. Amorim, G.; Tao, R.; Lotspeich, S.; Shaw, P.A.; Lumley, T.; Patel, R.C.; Shepherd, B.E. Three-phase generalized raking and multiple imputation estimators to address error-prone data. Stat. Med. 2024, 43, 379–394. [Google Scholar] [CrossRef] [PubMed]
  13. Boatman, J.A.; Vock, D.M.; Koopmeiners, J.S.; Donny, E.C. Estimating causal effects from a randomized clinical trial when noncompliance is measured with error. Biostatistics 2018, 19, 103–118. [Google Scholar] [CrossRef] [PubMed]
  14. Gravel, C.A.; Platt, R.W. Weighted estimation for confounded binary outcomes subject to misclassification. Stat. Med. 2018, 37, 425–436. [Google Scholar] [CrossRef]
  15. Yanagi, T. Inference on local average treatment effects for misclassified treatment. Econom. Rev. 2019, 38, 938–960. [Google Scholar] [CrossRef]
  16. Shu, D.; Yi, G.Y. Causal inference with measurement error in outcomes: Bias analysis and estimation methods. Stat. Methods Med. Res. 2019, 28, 2049–2068. [Google Scholar] [CrossRef] [PubMed]
  17. Shu, D.; Yi, G.Y. Inverse-probability-of-treatment weighted estimation of causal parameters in the presence of error-contaminated and time-dependent confounders. Biom. J. 2019, 61, 1507–1525. [Google Scholar] [CrossRef]
  18. Shu, D.; Yi, G.Y. Weighted causal inference methods with mismeasured covariates and misclassified outcomes. Stat. Med. 2019, 38, 1835–1854. [Google Scholar] [CrossRef]
  19. Edwards, J.K.; Cole, S.R.; Fox, M.P. Flexibly accounting for exposure misclassification with external validation data. Am. J. Epidemiol. 2020, 189, 850–860. [Google Scholar] [CrossRef]
  20. Richardson, D.B.; Keil, A.P.; Edwards, J.K.; Cole, S.R.; Tchetgen Tchetgen, E. A bespoke instrumental variable approach to correction for exposure measurement error. Am. J. Epidemiol. 2022, 191, 1954–1961. [Google Scholar] [CrossRef]
  21. VanderWeele, T.J.; Valeri, L.; Ogburn, E.L. The role of measurement error and misclassification in mediation analysis: Mediation and measurement error. Epidemiology 2012, 23, 561–564. [Google Scholar] [CrossRef]
  22. Jiang, Z.; Ding, P. Measurement errors in the binary instrumental variable model. Biometrika 2020, 107, 238–245. [Google Scholar] [CrossRef]
  23. Cheng, C.; Spiegelman, D.; Li, F. Mediation analysis in the presence of continuous exposure measurement error. Stat. Med. 2023, 42, 1669–1686. [Google Scholar] [CrossRef]
  24. Tsiatis, A.A. Semiparametric Theory and Missing Data; Springer: New York, NY, USA, 2006. [Google Scholar]
  25. Shi, X.; Pan, Z.; Miao, W. Data integration in causal inference. Wiley Interdiscip. Rev. Comput. Stat. 2023, 15, e1581. [Google Scholar] [CrossRef] [PubMed]
  26. Rubin, D.B. Causal inference using potential outcomes: Design, modeling, decisions. J. Am. Stat. Assoc. 2005, 100, 322–331. [Google Scholar] [CrossRef]
  27. Greenland, S. Model-based estimation of relative risks and other epidemiologic measures in studies of common outcomes and in case-control studies. Am. J. Epidemiol. 2004, 160, 301–305. [Google Scholar] [CrossRef] [PubMed]
  28. Olivier, J.; May, W.L.; Bell, M.L. Relative effect sizes for measures of risk. Commun. Stat-Theor. Methods 2017, 46, 6774–6781. [Google Scholar] [CrossRef]
  29. Doi, S.A.; Furuya-Kanamori, L.; Xu, C.; Chivese, T.; Lin, L.; Musa, O.A.; Hindy, G.; Thalib, L.; Harrell, F.E. The odds ratio is “portable” across baseline risk but not the relative risk: Time to do away with the log link in binomial regression. J. Clin. Epidemiol. 2022, 142, 288–293. [Google Scholar] [CrossRef]
  30. Ding, P.; Geng, Z.; Yan, W.; Zhou, X.H. Identifiability and estimation of causal effects by principal stratification with outcomes truncated by death. J. Am. Stat. Assoc. 2011, 106, 1578–1591. [Google Scholar] [CrossRef]
  31. He, Y.; Zheng, L.; Luo, P. Treatment benefit and treatment harm rates with nonignorable missing covariate, endpoint, or treatment. Mathematics 2023, 11, 4459. [Google Scholar] [CrossRef]
  32. Lu, Z.; Geng, Z.; Li, W.; Zhu, S.; Jia, J. Evaluating causes of effects by posterior effects of causes. Biometrika 2022, 110, 449–465. [Google Scholar] [CrossRef]
  33. Shingaki, R.; Kuroki, M. Identification and estimation of joint probabilities of potential outcomes in observational studies with covariate information. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 26475–26486. [Google Scholar]
  34. Imbens, G.W. Nonparametric estimation of average treatment effects under exogeneity: A review. Rev. Econ. Stat. 2004, 86, 4–29. [Google Scholar] [CrossRef]
  35. Wang, L.; Tchetgen Tchetgen, E. Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables. J. R. Stat. Soc. Ser. B. Stat. Methodol. 2018, 80, 531–550. [Google Scholar] [CrossRef] [PubMed]
  36. Shi, X.; Miao, W.; Nelson, J.C.; Tchetgen Tchetgen, E. Multiply robust causal inference with double-negative control adjustment for categorical unmeasured confounding. J. R. Stat. Soc. Ser. B. Stat. Methodol. 2020, 82, 521–540. [Google Scholar] [CrossRef] [PubMed]
  37. Newey, W.K.; McFadden, D. Large sample estimation and hypothesis testing. Handb. Econ. 1994, 4, 2111–2245. [Google Scholar]
  38. Robins, J.M. Correcting for non-compliance in randomized trials using structural nested mean models. Commun. Stat. Theory Methods 1994, 23, 2379–2412. [Google Scholar] [CrossRef]
  39. Van der Vaart, A.W. Asymptotic Statistics; Cambridge University Press: Cambridge, UK, 2000; Volume 3. [Google Scholar]
  40. Lurbet, M.F.; Rojano, B.; Whittaker Brown, S.A.; Busse, P.; Holguin, F.; Federman, A.D.; Wisnivesky, J.P. Obesity trends among asthma patients in the United States: A population-based study. Ann. Glob. Health 2019, 85, 10. [Google Scholar] [CrossRef]
  41. Wallace, J.; Jiang, K.; Goldsmith-Pinkham, P.; Song, Z. Changes in racial and ethnic disparities in access to care and health among US adults at age 65 years. JAMA Intern. Med. 2021, 181, 1207–1215. [Google Scholar] [CrossRef]
  42. Daher, M.; Al Rifai, M.; Kherallah, R.Y.; Rodriguez, F.; Mahtta, D.; Michos, E.D.; Khan, S.U.; Petersen, L.A.; Virani, S.S. Gender disparities in difficulty accessing healthcare and cost-related medication non-adherence: The CDC behavioral risk factor surveillance system (BRFSS) survey. Prev. Med. 2021, 153, 106779. [Google Scholar] [CrossRef]
  43. Flegal, K.M.; Troiano, R.P.; Pamuk, E.R.; Kuczmarski, R.J.; Campbell, S.M. The influence of smoking cessation on the prevalence of overweight in the United States. N. Engl. J. Med. 1995, 333, 1165–1170. [Google Scholar] [CrossRef]
  44. Chiolero, A.; Faeh, D.; Paccaud, F.; Cornuz, J. Consequences of smoking for body weight, body fat distribution, and insulin resistance1. Am. J. Clin. Nutr. 2008, 87, 801–809. [Google Scholar] [CrossRef]
  45. Schwartz, A.; Bellissimo, N. Nicotine and energy balance: A review examining the effect of nicotine on hormonal appetite regulation and energy expenditure. Appetite 2021, 164, 105260. [Google Scholar] [CrossRef] [PubMed]
  46. Sharbaugh, M.S.; Althouse, A.D.; Thoma, F.W.; Lee, J.S.; Figueredo, V.M.; Mulukutla, S.R. Impact of cigarette taxes on smoking prevalence from 2001–2015: A report using the Behavioral and Risk Factor Surveillance Survey (BRFSS). PLoS ONE 2018, 13, e0204416. [Google Scholar] [CrossRef] [PubMed]
  47. Filozof, C.; Fernández Pinilla, M.C.; Fernández-Cruz, A. Smoking cessation and weight gain. Obes. Rev. 2004, 5, 95–103. [Google Scholar] [CrossRef] [PubMed]
  48. Munafò, M.R.; Tilling, K.; Ben-Shlomo, Y. Smoking status and body mass index: A longitudinal study. Nicotine Tob. Res. 2009, 11, 765–771. [Google Scholar] [CrossRef]
  49. Chen, X.; Hong, H.; Tarozzi, A. Semiparametric efficiency in GMM models with auxiliary data. Ann. Stat. 2008, 36, 808–843. [Google Scholar] [CrossRef]
Table 1. Performance of the proposed method in Example 1, with checkmarks indicating assumptions hold.
Table 1. Performance of the proposed method in Example 1, with checkmarks indicating assumptions hold.
n = 2000 n = 5000
M 1 t M 2 M 3 t EstimandBiasRMSESEBiasRMSESE
a 1 0.00680.07850.07820.00070.04620.0462
a 0 −0.00400.09140.09130.00020.05280.0528
RD0.01080.10000.09840.00050.06020.0602
RR0.05410.23910.23350.01190.12110.1206
a 1 0.00070.07710.07710.00030.04710.0471
a 0 −0.00240.08920.0892−0.00190.05330.0533
RD0.00310.09880.09880.00220.06040.0604
RR0.03690.24470.24200.01510.13530.1345
a 1 −0.00860.08090.0805−0.01010.04960.0486
a 0 −0.01430.09290.0919−0.01020.05640.0555
RD0.00580.09850.09840.00020.06220.0623
RR0.05200.25310.24790.01830.14480.1437
a 1 −0.02460.09090.0876−0.02520.05820.0525
a 0 −0.02970.09960.0952−0.02850.06670.0603
RD0.00510.10330.10330.00340.06430.0643
RR0.06600.27570.26780.03940.16270.1579
Table 2. Performance of the proposed method in Example 2, with checkmarks indicating assumptions hold.
Table 2. Performance of the proposed method in Example 2, with checkmarks indicating assumptions hold.
n = 2000 n = 5000
M 1 t M 2 M 3 t EstimandBiasRMSESEBiasRMSESE
NBR−0.01710.15960.15880.00500.12960.1295
ABR−0.00770.16090.16080.00750.12190.1217
TBR−0.00940.19840.1983−0.01250.11620.1158
NBR−0.03820.18600.18220.04560.13990.1301
ABR−0.08100.25470.2416−0.07600.15770.1383
TBR0.04280.26580.26250.03040.15510.1521
NBR0.02520.17170.1699−0.02430.13930.1389
ABR−0.00540.21650.21640.00440.12230.1223
TBR0.03060.21620.21410.01990.12210.1206
NBR0.04560.17280.16680.04660.13280.1328
ABR−0.06740.21030.1993−0.05620.13210.1296
TBR0.02190.21830.21730.00960.13180.1315
Table 3. Summary of the analysis results using the proposed method.
Table 3. Summary of the analysis results using the proposed method.
EstimandPoint EstimateSE 95 % Confidence Interval
a 1 0.71090.0082(0.6954, 0.7267)
a 0 0.82060.0069(0.8076, 0.8350)
RD−0.10970.0083(−0.1261, −0.0935)
RR0.86630.0096(0.8470, 0.8856)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wei, S.; Zhang, C.; Geng, Z.; Luo, S. Identifiability and Estimation for Potential-Outcome Means with Misclassified Outcomes. Mathematics 2024, 12, 2801. https://doi.org/10.3390/math12182801

AMA Style

Wei S, Zhang C, Geng Z, Luo S. Identifiability and Estimation for Potential-Outcome Means with Misclassified Outcomes. Mathematics. 2024; 12(18):2801. https://doi.org/10.3390/math12182801

Chicago/Turabian Style

Wei, Shaojie, Chao Zhang, Zhi Geng, and Shanshan Luo. 2024. "Identifiability and Estimation for Potential-Outcome Means with Misclassified Outcomes" Mathematics 12, no. 18: 2801. https://doi.org/10.3390/math12182801

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop