1. Introduction
Non-compliance and missing data are phenomena which usually occur in studies of economics, medicine, and public health. Non-compliance arises when certain participants do not adhere to their prescribed treatments, while the problem of missing data arises when the study’s researchers are unable to gather information for some participants [
1]. These unmeasured confounding variables may complicate the inference of causal effects if the missing mechanism is non-ignorable [
2,
3]. The missingness is named ignorable if it depends on the observed data only; otherwise, it is named non-ignorable [
4,
5]. Identifying the complier average causal effect becomes challenging in the presence of both non-compliance and non-ignorable missing values, as it is impossible to identify the full data distribution or the causal effect without additional assumptions.
To address non-compliance, one can assume that the response in the dataset follows a specific distribution, such as exponential families, and use maximum likelihood estimation. The authors in [
6,
7,
8] presented the estimation of the complier average causal effect (CACE) with the maximum likelihood estimation method with the EM algorithm [
9]. The maximum likelihood method provides the advantage of relaxing the exclusion restrictions, which can often be unrealistic, particularly in natural experiment scenarios [
10]. Another approach to addressing non-compliance is the instrumental variable method [
11]. Utilizing instrumental variables is a viable method when stringent assumptions about the response distribution could result in mis-specification.
The shadow variable strategy is a mainstream solution for establishing identifiability in cases of non-ignorable missingness [
12]. Analogous approaches also employ instrumental variables, as suggested by [
13]. Nonetheless, the selection of appropriate instrumental or shadow variables poses challenges, particularly among a multitude of covariates [
14]. The authors of [
15] illustrate that stronger assumptions regarding the response mechanism enable the establishment of identifiability derived from the distribution of the observed data. When the response adheres to exponential families, as noted by [
16,
17], the identifiability under non-ignorable missingness is readily attainable without the use of instrumental variables.
It is critically important to integrate causal inference with research on non-ignorable missing data [
18,
19]. The authors of [
20,
21,
22] introduced estimators for the complier average causal effect parameters, assuming that the missing data are ignorable. However, if the missingness is non-ignorable, these methods may yield biased estimators. The recent literature has introduced strategies for addressing missing covariates or outcomes in the context of non-ignorable missingness, which includes but is not limited to the following papers. The study in [
1] initially explored the identifiability of parameters in randomized clinical trials characterized by non-compliance and non-ignorable missing binary outcome variables. The study in [
2] examined semi-parametric identifiability and developed an estimation method for the complier average causal effect in randomized clinical trials, addressing the challenge of non-ignorable missingness in continuous outcomes. The study in [
23] formulated a method using a shadow variable to identify and estimate the complier average causal effect parameters arising from outcomes that represent non-ignorable missingness. The study in [
19] tackled the challenge of non-ignorable missingness confounders in causal analysis from observational data, demonstrating that causal effects are ascertainable in scenarios wherein the missingness mechanism is outcome-independent, conditional on the treatment and potential missing confounders. The study in [
24] developed semi-parametric estimators to determine the average causal effect, accounting for non-ignorable missing confounders, based on the premise that the missingness of data is independent of the outcome. The study in [
25] addressed the problem of identifying the treatment benefit rate and treatment harm rate in scenarios wherein treatment, endpoints, or covariates are missing. The study in [
26] investigates a treatment-independent missingness assumption, which facilitates the identification of causal effects in situations wherein confounders represent non-ignorable missingness.
In the existing literature, limited attention has been given to situations wherein a general missing mechanism model is selected to address non-ignorable missingness. The study in [
2] utilized an outcome-dependent non-ignorable missingness model without incorporating an auxiliary variable. The study in [
3] proposed using an auxiliary variable that serves as both a shadow variable for non-ignorable missingness and an instrumental variable for causal effects. However, the models used in the above approach assume the absence of covariates, despite their common presence in practical applied research. In this paper, we examine the complier average causal effect parameters under non-ignorable missingness by including covariates to mitigate the sensitivity to the violation of specific identification assumptions. The missing data mechanism is assumed to follow a logistic model, wherein the absence of the outcome is explained by the outcome itself, the treatment received, and the covariates. We establish the identifiability of the models under mild conditions by assuming that the outcome follows a normal distribution. We develop a computational method to estimate model parameters through a two-step likelihood estimation approach, employing subgroup analysis. The bootstrap method is employed for variance estimation, and the effectiveness of our approach is confirmed through simulation. The proposed method is applied to analyze the household income dataset from the Chinese Household Income Project Survey 2013.
The rest of this article is organized as follows.
Section 2 introduces the general modeling framework, with notation and assumptions. In
Section 3, we give the theoretical results on the identifiability of the parameters and the estimation approach. The performance of the proposed method is evaluated through simulation studies in
Section 4. The application to CHIP data is presented in
Section 5. Concluding remarks are provided in
Section 6.
2. Notation and Assumptions
Let denote the individual outcome and let represent the individual covariate vector, where , . denotes the randomized treatment assignment for the ith unit in the study, where i ranges from 1 to n. We assign if the ith individual is allocated to the treatment group and if it is assigned to the control group. Moreover, represents the treatment received by the ith individual, with indicating treatment and indicating no treatment. and denote the potential treatment received and the potential outcome under the assigned treatment , respectively. We define as the binary response indicator for , where if is observed and if is missing. Additionally, we define as representing the potential outcome under the assigned treatment and actual treatment .
Following [
6], we define
as the compliance status of the
ith patient, which is determined as follows:
where the potential intermediate outcomes
c,
n,
a, and
d stand for complier, never-taker, always-taker, and defier, respectively. The complier average causal effect gives the predictors
, which equals
Next, we provide the necessary assumptions to ensure the identifiability of
under the non-ignorable missingness. These are formalized by the following assumptions [
1,
11]:
Assumption 1 (Stable unit treatment value assumption, SUTVA). If , then . And, if , then .
Assumption 2 (Randomization). The treatment assignment Z is randomization.
Assumption 3 (Exclusion restrictions). for .
Assumption 4 (First-stage). Given , there is a non-zero average causal effect of Z on D, i.e., .
Assumption 5 (Monotonicity). for all . Defining where , we then have .
Assumption 6 (Compound exclusion restrictions). for .
Assumption 7 (Non-ignorable missingness). and .
Assumption 8 (Subgroup). U is independent of variable .
Assumption 9 (Normal).
The conditional density of the outcome variable Y has the following normal form:where is the compliance status and . Assumption 1 posits that
and
, indicating that the observed outcome equals the potential outcome evaluated at the observed treatment value, and the observed treatment equals the potential treatment evaluated at the assigned treatment [
27]. This assumption is typically reasonable when dealing with randomly sampled units. Assumption 2 [
1] means that
Z is as good as randomly assigned. Assumption 3 asserts that, given
,
. This concept encapsulates the fundamental principle of instrumental variable procedures, indicating that any influence of
Z on
Y must be channeled through an impact of
Z on
D [
28]. Under Assumption 3, with
fixed, the conditional density of
Y remains independent of
Z for never-takers and always-takers. Assumption 4 stipulates that, for every stratum defined by
,
is strictly positive. Assumption 5 asserts the absence of defiers within the population. Assumptions 1–5 are standard assumptions in causal effect models.
Assumption 6 [
2] is credible in a double-blinded clinical trial, as patients are unaware of the treatment assigned to them, and the treatment assignment is performed through randomization. Assumption 7 suggests that the absence of an outcome variable is accounted for by the outcome itself, the treatment received, and the covariates. It implies that the expectation of
given
,
,
, and
is equal to the expectation of observing
R given
,
, and
. The missing data mechanism is assumed to follow a logistic model, expressed as
,
,
. Assumption 8 incorporates the independence between
X and
U into the computational method for estimating model parameters. This inclusion enhances the robustness and validity of the estimation procedure, particularly through a two-step maximum likelihood estimation approach using subgroup analysis. The purpose of Assumption 9 is to establish the identifiability of the models under mild conditions, facilitating the creation of a computational method to estimate model parameters.
3. Identifiability and Estimation
In this section, we initially construct a model for non-compliant data and missing data, outlining the steps to solve the model. Subsequently, we theoretically establish the identifiability of the model involved in each step. Given that the second step entails a system of non-linear equations and integral operations, which can be challenging to handle analytically, we present a numerical calculation method to address this complexity.
We introduce a convenient two-stage estimation process based on the random sample
. Given
, the joint density function with respect to variables
can be expressed as follows:
Assumptions 2 and 8 ensure the validity of the equation above. Let
, and denote
. Assumption 3 implies that
, denoted as
, and
, and denoted as
. Let
,
for
and
. As stated in [
2], the full likelihood for
can be expressed as
where
It is evident from (
2) that the maximum likelihood estimator for
is
, which is equivalent to the moment estimator, thus ensuring its identifiability. Next, we propose a two-stage estimation process
and establish the identifiability of the models involved in these two steps.
Step 1. Based on Assumptions 3 and 7, the conditional density of the outcome
Y for never-takers and always-takers, as well as the missing data models, are independent of
Z. The likelihood function computed on never-takers with
is (
4) and the likelihood function computed on always-takers with
is (
5). We can maximize (
4) and (
5) to obtain that the maximum likelihood estimators for
and
are
and
.
Step 2. We substitute
,
, and
into (
3) and (
6) and consider the numerical characteristics of the mixed normal distribution and the inverse probability weighting method (more details can be found in [
29]) to obtain the estimators of
and
, denoted as
and
.
Non-ignorable missingness in
Y presents challenges to the identifiability of the observed likelihood function (
4) and (
5), as emphasized by [
12]. The next theorem shows that the parameters are all identifiable under mild conditions.
Theorem 1. Suppose there exists one continuous covariate; if the sign of some element of is known, then the vector of parameters is identifiable, where and .
Proof. (i) First, we establish the identifiability of parameters
. Suppose there exist two sets of parameters
and
such that
which holds for all
. The authors in [
12] accounted that
is identifiable if (
7) implies that
Without a loss of generality, suppose
is a continuous variable and can take any real values, and let covariates
be omitted since we can consider
as fixed, while
varies. Due to the fact that the proof for the case
can be obtained by mimicking the following process, here, we consider the case
. Let
; then,
can be transformed into the following form:
Applying operation
on both sides of (
8) yields
Applying operation
on both sides of (
9) yields
If
or
, then (
10) reduces to
and then
, which contradicts the assumption
and
Now, we consider the case
and
Due to the fact that
, the roots of the derivatives of
are as follows:
If
, then this leads to a contradiction. Assume the line
intersects with
at
; then, we have
. If
by applying operation
and
on both sides of (
10) at
, we have
and
, then
, which is a contradiction. If
t is equal to
or
, by (
10) and by applying operation
on both sides of (
10) at
and
and
, then
, which is a contradiction. If
by (
10) and by applying operation
on both sides of (
10) at
, then
and
, which means that
; then,
, which is a contradiction. Thus,
Let
and (
10) reduces to
with
If
assume the line
intersects with
at
; then,
. If
let
in (
11); then,
, which is a contradiction. If
by applying operation
on both sides of (
11) at
, we have
, that is,
or
. If
then
and
, which means that
that is,
Recall the condition that the sign of any element of
is assumed to be known. Therefore, the case according to which
and
is impossible. Therefore, we have
; then, (
11) reduces to
and we have
because
has only one maximum point. Now, we have
and the condition in (
8) reduces to
We can readily obtain
from (
12). Thus,
is identifiable. In an analogous manner to the above,
is identifiable.
(ii) Similar to the proof of the above, we now only need to demonstrate the following: if
we can obtain
, and then the
is identifiable. We now substitute
into (
13); then, (
13) reduces to
and we can readily obtain
; then,
is identifiable. Likewise,
is identifiable.
Therefore, the conclusion of Theorem 1 follows. □
The sign of some element of
is easy to verify. According to [
30], factors such as respondents’ cognitive level, motivation, and social status influence non-response probability. Leveraging this insight, we can speculate on the trend of non-response probability and infer the sign of the parameters in the missing mechanism model. For example, in a household income survey, low-income individuals may be less inclined to disclose their true income, implying
.
Given the similarity between the methods for estimating
and
and those for estimating
and
, we will delineate the estimation methods for estimating
and
separately in two steps. The likelihood function computed on always-takers with
is (
4). Thus, the maximization problem in (
4) with respect to parameters
is equivalent to finding
, and
to maximize
with
and
where
.
For convenience, we denote
by
. The score equations can be derived by means of taking the derivatives of (
16) with respect to
and
, yielding
where
and
However, computing the integral involved in (
17) is challenging. To circumvent this difficulty, we propose utilizing Monte Carlo approximation. Given
, we can straightforwardly sample
Y directly from the conditional distribution with density function
. Let
denote a sample of size
m. For convenience, we denote
As stated in [
17], the introduced additional variability becomes ignorable when the resample size
m is sufficiently large; for example,
, where
is a small positive constant, so we use ≈ here. We record the parameters of t iterations as
and then obtain the estimated value of
times as follows:
and
The above formula is iterated through until the following is satisfied:
where tol is a given constant that can be arbitrarily small. Ultimately, we obtain
for the parameters
. In the same manner, we can obtain
, where
. In practice, the above computation procedure is straightforward to implement, as it relies on the use of the empirical distribution. Moreover, it is attractive because the introduced additional variability becomes ignorable when the resample size
m is sufficiently large; for instance,
, where
is a small positive constant.
According to the assumption of Condition 7, the absence of
Y is unrelated to
Z. Then, we can focus on the individuals having
and, using the inverse probability weighting method, the estimated equation is as follows:
We can readily obtain that the estimators for
and
are
and
from the above estimation equation, and considering the numerical characteristics of the mixed normal distribution, we have
By solving the two equations above, we can derive estimators for
and
, denoted as
and
. In the same manner, we can obtain
. Here,
. By denoting
, we arrive at the complier average causal effect given the predictors
as
Considering the variance of the mixed normal distribution, the estimator for the corresponding variance
, where
. Thus,
In practice, we can utilize the bootstrap method to approximate the sampling variance of the estimator of CACE.
4. Simulations
In this section, we perform simulations to assess the effectiveness of the outlined estimation procedure. Assuming
and
, we divide all samples into six groups based on
, with the number in each group denoted as
. The covariates
in all groups are two-dimensional normal distributions, with the mean vector and the covariance matrix as follows, respectively:
Conditional on
, if
or
, the response
Y is generated from
where
. Instead, if
or
, we use
where
. If
, we simulate
Y via the conditional distribution
where
, while if
, we have
where
. The missing mechanism model in the group with
is as follows:
where
and
. Instead, if
, we use
where
. The missing rate for the outcome is
in group
and
in group
.
The Monte Carlo samples have a size of
.
Table 1 and
Table 2 report the empirical bias, standard deviation obtained by means of non-parametric bootstrap with
replications, and coverage of
Wald-type confidence intervals. The numerical results of
Table 1 demonstrate a highly accurate estimation of
, the missing mechanism models parameters, the always-taker parameters, and the never-taker parameters, which means that the proposed method can still accurately estimate the parameters in the causal model under the non-ignorable missingness. The numerical results in
Table 2 demonstrate a highly accurate estimation of the complier parameters and CACE components. However, there is some under-coverage in the Wald confidence intervals for
and
.
The estimators for exhibit significantly smaller bias and variance compared to other parameters, attributed to the larger sample size used for their estimation. Conversely, the estimators for the always-taker and never-taker parameters show similar bias and variance due to their comparable missing rates. The subpar estimation results for the compliers parameters arise from various factors, including the numerical characteristics of the normal distribution, which we aim to optimize in future iterations.
5. Real Data Analysis
The CHIP study monitors income distribution and economic factors among rural, rural-to-urban migrant, and urban households in China [
31]. The purpose of this study is to investigate whether there is a significant change in the return rate of education when the male population transitions from rural to migrant status. The survey captures details such as employment status, education level, age, income, and other relevant information. Data collection involves systematic random sampling from various regions to ensure geographic representation. The survey includes data from cities and towns in fifteen provinces, representing different regions across the country. These provinces include Liaoning, Shanxi, Jiangsu, Shandong, Guangdong, Anhui, Henan, Sichuan, Hunan, Hubei, Gansu, Xinjiang, Yunnan, Beijing, and Chongqing, covering the north, eastern coastal areas, interior regions, and western regions of China.
Each respondent’s actual annual income serves as a proxy for the individual’s earnings
E, utilized in the response of the Mincer earnings function [
32]. We compute the number of years since leaving school as
, as Chinese children typically start school at 7 years old. The years since the onset of labor market experience are calculated as
, as Chinese individuals who have reached the age of 16 can legally participate in the labor market.
The rural population’s migration for work primarily depends on two factors: the economic development status of the region and the distance between their home and work. If the region’s economic development is favorable, migration may be less likely. Conversely, shorter distances between the home and work place increase the likelihood of migration. Hence, leveraging Chinese administrative division data, we assign
if the individual’s household head’s registered residence is Beijing, Shanxi, Guangdong, Hubei Chongqing, or Liaoning, and
otherwise. The sample sizes of the observed data are shown in
Table 3. In the 2013 wave of the study, the rural and migrant sub-samples accounted for 62.97% and 4.54% of the dataset, respectively, including urban individuals. However, in China in 2013, the percentages of rural and migrant households were of 45.77% and 13.30%, respectively [
33]. To align with this distribution, we adjust the sample weights. Specifically, each migrant respondent is weighted four times as much as a rural one, since
.
The Mincer earnings function is a single-equation model that explains wage income as a function of education and work experience, which can be expressed as follows:
In the specification,
E represents earnings (in CNY),
S indicates years of schooling, and
stands for potential work experience.
is an unobserved normal random error with mean 0 and variance
. Without considering the cost of schooling, the returns to education are given by
. The missing data mechanism is modeled using the following model:
Using the Assumption 9 described in
Section 2 in this setting, model (
19) implies a normal distribution of the outcome
:
where
is the realization of the covariates vector
. Thus, the complier average causal effect which captures the causal effect of migrant work on the earnings of compliers is estimated by
with
. Specifically, the coefficient
gives the estimated causal effect of migrant work on the returns to education of a complier.
The analysis of the 2013 CHIP data conducted in
R version 4.3.2, which invented in August 1993 by statisticians Ross Ihaka and Robert Jetman of the University of Auckland, New Zealand. Utilizing the estimation methods outlined in
Section 3 of this paper, yielded results presented in
Table 4 and
Table 5. The results from
Table 4 show that log-income has a significant positive effect on the probability of missingness, as evidenced by
and
.
Table 5 illustrates that the estimated returns to education are
for migrant compliers and
for rural compliers. The difference,
, indicates that migrant work enhances the returns to education for compliers. Additionally, based on the observation that
, we infer that the initial wages of migrant compliers are notably higher than those of rural compliers. This suggests that migrant work also contributes to higher initial wages for individuals. In conclusion, the estimated returns to education vary significantly among different target groups. However, it is noteworthy that the estimated returns to education of migrant workers and rural residents are consistently lower than those of urban residents across the board, with an average difference of 8.4%, as documented by [
34]. This finding scientifically validates the social and economic significance of migrant work from a human capital perspective, offering a basis for decision-making aimed at enhancing the condition of Chinese rural labor.
6. Conclusions
In this study, the challenge of identifying and estimating complier average causal effect parameters under non-ignorable missingness is tackled by increasing covariates to mitigate the sensitivity to the violation of specific identification assumptions. The missing data mechanism is assumed to follow a logistic model, wherein the absence of the outcome is explained by the outcome itself, the treatment received and the covariates, giving it an advantage over the assumptions proposed by [
2,
3]. The identifiability of the models is established under mild conditions by assuming that the outcome follows a normal distribution. A computational method is developed to estimate model parameters through a two-step likelihood estimation approach, utilizing subgroup analysis.
Some studies in the literature discuss the consistency of parameter estimation in the presence of non-ignorable missing and non-compliant data. The authors of [
23] demonstrated that parameter estimators are consistent under the assumption of a correct missing mechanism model. The authors of [
26] obtained the consistency of parameters of interest even when confounders are missing not at random. The authors of [
3] established the asymptotic results of the estimators when the missing outcome depends only on itself. Estimating the parameters of interest based on subgroup analysis poses significant challenges when the absence of the outcome is explained by the outcome itself, the treatment received, and the covariates. We plan to address this aspect in future research endeavors to enhance our methodologies.
There are many directions worthy of further research. A possible extension in this research area involves utilizing instrumental variables to transform the identifiability of the observation likelihood into the identifiability of the parameters of interest. Indeed, we propose a relaxation of the exclusion restriction based on likelihood analysis, resulting in a parametric model characterized by mixtures of distributions. Furthermore, we can adopt a semi-parametric model for theoretical modeling, incorporating more sophisticated structures into the missing mechanism models and regression models.