1. Introduction
Regularization is an important approach for estimation in regression models with a relatively large number of parameters because it provides a stable numerical procedure and better prediction while avoiding the overfitting problem. This approach has attracted much attention in the recent literature, mainly due to its applications in variable selection problems in high-dimensional models where the conventional statistical methods are infeasible theoretically and computationally. To address the variable selection problem in sparse regression, various regularization methods have been proposed, e.g., the bridge regression (
Frank and Friedman 1993), Lasso (
Tibshirani 1996), SCAD (
Fan and Li 2001), adaptive Lasso (
Zou 2006), MCP (
Zhang et al. 2010), elastic net (
Zou and Hastie 2005) and Dantzig selector (
Candes et al. 2007). A more detailed review of regularization methods can be found in
Fan and Lv (
2010) and
Negahban et al. (
2012).
In real data analysis, it is common that some predictors cannot be observed directly or measured precisely. For example, the long-term average systolic blood pressure and cholesterol level are important factors of cardiovascular disease, which are usually measured with error. In a lung cancer risk study, the inhaled dose of air pollutants cannot be measured precisely and are approximated by the average level of pollutants within a certain area. In regression models, it is well known that if some predictors are measured with error, ordinary estimation procedures ignoring the ME are biased and inconsistent. However, the impact of ME on regularized estimation procedures is not clear. For example, in a linear model, the naive least squares estimate for the regression coefficient of the mismeareued predictor is attenuated towards zero and, on the other hand, the variance of the corresponding observed surrogate predictor is inflated. Therefore, the combined effect of these two factors may cause false-positive or -negative results in the selection outcome, as illustrated in Example 1.
Research on regularized estimation in ME models is sparse. Some authors considered the penalized version of the usual error correction methods, assuming the ME covariance matrix is known or can be estimated using replicate data. For example,
Liang and Li (
2009) applied the penalized least squares method with attenuation correction and quantile estimation with orthogonal regression adjustment in a partial linear model.
Ma and Li (
2010) studied general parametric and semiparametric models using the method of penalized estimating equations. Further,
Huang and Zhang (
2013) used the penalized score functions, while
Zhang et al. (
2017) used a prediction criterion for variable selection in linear ME models.
Another major approach to estimation in ME models is the instrumental variable (IV) method. This method has been used to treat the endogeneity problem in high-dimensional regression models by
Fan and Liao (
2014) who proposed the focused generalized method of moments estimator.
Lin et al. (
2015) studied a two-stage regularization method for selecting relevant instruments and predictors in linear models under the assumption that the random errors are jointly normally distributed.
Zhong et al. (
2020) proposed a two-stage estimation procedure with instrumental variables for a dummy endogenous variable.
All these works mainly focus on a general endogegeity problem in linear models or binary response variables. So far, there is very few, if any, published studies focusing specifically on the IV approach to measurement error problems. In this paper, we try to fill in this gap. Specifically, we extend the IV method to study the variable selection and estimation problem in generalized linear models with ME. This method does not require the distribution or covariance matrix of the ME to be known. It is an extension of the method of conditional moments by
Wang and Hsiao (
2011). The proposed selection procedure and estimator are consistent and enjoy the oracle property under general conditions.
The rest of the paper is organized as follows: In
Section 2, we introduce the regularized instrumental variable method and study its asymptotic properties.
Section 3 contains the special case of linear model. Numerical examples are given in
Section 4 followed by a real example in
Section 5. Technical details are relegated to
Appendix A.
2. The Model and Estimation Method
Suppose the response variable
Y has the conditional mean function
where
is a vector of error-prone predictors in low dimension,
is a vector of error-free predictors and
is a link function. Equation (
1) includes the generalized linear models as well as the so-called single index models as special cases. We assume that the observed surrogate predictors are
where
is a random ME. Further, we assume that there are instrumental variables (IV)
besides the main sample
. The usual requirement for an IV is that it is correlated with the unobserved predictor
but independent of the ME
and is conditionally independent of
Y given
. Following the literature (
Wang 2021;
Wang and Hsiao 2011), we assume that the IV
is related with
through
where
is the
matrix of unknown parameters which is assumed to have full rank
p,
is independent of
, has mean
and density
with unknown parameters
. It is further assumed that the ME
in (
2) satisfies
. Throughout this paper, we assumed that
is exogenous and all expectations on it are taken conditionally; however,
is suppressed to simplify notations. We also adopt the common assumption in the ME literature that the ME
is nondifferential, which implies that
.
Now, we consider the estimation of unknown parameters in (
1)–(
3) given an
random sample
. First, substituting (
3) into (
2) results in a usual linear regression equation
and, therefore,
can be consistently estimated by the least squares estimator
In the following, we focus on the estimation of other parameters of main interest
in model (
1)–(
3). Specifically, we propose an estimator based on the fist two conditional moments
and
. To simplify notation, we denote
and
. Then, the two conditional moments can be written together as
where
Then, the loss function for estimating
is defined as
where
and
are a semipositive definite matrix which may depend on
.
One of the main features in the high-dimensional variable selection framework is the sparsity of the model, where many regression parameters in
have a true value zero. In the following, we denote the true parameter values of
as
, the index set of non-zero coefficients as
and its compliment set as
. We further denote
,
and
. Similarly, let
be the matrix consisting of rows of
corresponding to the index set
J, and
as the vector consisting of the columns of
. Finally, the proposed regularized IV estimator is defined as the minimizer of the objective function
where
is a penalty function. Let
be the true value of model parameters.
Theorem 1. Under Assumptions A1–A5 in Appendix A, suppose the penalty function satisfiesand Then there exists a local minimizer of the objective function (6) such that . Further, let and . We have the following results.
Theorem 2. If , and , then with probability approaching 1, the root n consistent estimator in Theorem 1, satisfies
(a) ,
(b) has asymptotic distributionwhereand From the proof of the above theorem in the Appendix, it can be seen that the covariance matrix
can be estimated by
where
Though the estimator is consistent regardless of the choice of
, there exists an optimal weight
matrix theoretically for the most efficient estimator. Following
Wang and Hsiao (
2011), the optimal weight matrix is given by
Since the optimal weight matrix involves unknown parameters,
can be calculated via a two-stage estimation procedure. First, the objective function is minimized using the identity matrix as a weight matrix. In the second stage, the estimators are obtained with the optimal weight matrix, which is calculated with the estimates from the first stage.
As noted in
Abarin and Wang (
2012), for some models like gamma log-linear and Poisson log-liner model, the analytical form of the expectation (
4) can be obtained for some error distribution
. For example, when the random error
u follows an univariate normal distribution
, the integral in (
4) has the following closed-form expression
where
and
. With the closed-form expression, the burden of computation is eased a lot. On the other hand, in situations where the integral in (
4) does not have analytical form, Monte Carlo methods (e.g., importance sampling) can be used to approximate the integral. Specifically, we follow the suggestions in
Wang and Hsiao (
2011) to calculate (
4) as follows.
- (1)
Choose a candidate distribution whose density function is known;
- (2)
Generate i.i.d. random sample from density function ;
- (3)
Calculate the Monte Carlo approximation of
as
and
- (4)
Apply the gradient descent method on approximated loss function
where
and
.
For some penalty functions like SCAD and MCP,
and
are both zero when the tuning parameter
is sufficiently small. Hence the resulting estimator has the oracle performance such that
and the asymptotic distribution of
is given by
3. Linear ME Model
For the linear regression model, the proposed regularized IV method simplifies to a regularized two-stage least squares method when the weight matrix
. Specifically, consider a linear model
where
,
and
. Without loss of generality, assume the intercept
is zero. The regularized instrumental variable estimator is defined as the minimizer of the following objective function
where
. Since the naive estimator is inconsistent in estimation and selection in general, the observed covariates are replaced by its corrected version
based on instrumental variables. Furthermore, since the objective function in (
8) involves the non-independence of a random sample
due to the involvement of
, the standard results for regularized linear regression cannot be applied directly. For the linear regression model, we have the following results.
Corollary 1. If , and is positive definite, where , then there exists a local minimizer of such that .
Corollary 2. If , and , then with probability approaching 1, the root n consistent estimator in (8) satisfies (a) ,
(b) has the following asymptotic normal distributionwhere , and is the matrix consisting of rows of corresponding to the index set J. 4. Numerical Examples
In this section, we conduct simulations to assess the finite sample performance of the proposed instrumental variable estimator (IVE) on variable selection as well as parameter estimation. For comparison purposes, we also calculate the regularized estimator (TRE) using the true data
, and the naive estimator (NAE) using the observed sample
. The proposed method is implemented with SCAD penalty function. The tuning parameter is selected by BIC that has the property of recovering the true model consistently for SCAD penalty (
Wang et al. 2007). To assess the selection performance, we calculate the false-positive (FP) rate that is the average number of zero coefficients incorrectly estimated as non-zero, and the false negative (FN) rate that is the average number of non-zero coefficients incorrectly estimated as zero. We also calculate the Matthews correlation coefficient (MCC) that is a general measure of describing the confusion matrix of true/false positives/negatives and is defined as
. The MCC ranges from −1 to 1, where the large value indicates good prediction. Finally, we calculate the mean squared error (MSE)
to assess estimation accuracy.
Example 1. First, we consider a linear model , where and are jointly generated from with . In addition, the true covariate X is generated as , where ϵ and U are the standard normal. The observed surrogate is generated as , where δ follows a normal distribution with mean zero and variance .
Figure 1 shows the estimated coefficients, FP and FN, for various values of with a sample size . The results of the naive method (NAE) are on the left-hand side, while the results of the IVE method are on the right-hand side. Both the FP and FN increase with for the naive method, as seen from the bottom left graph. In contrast, the IV estimator is robust against the magnitude of . The simulation results with are reported in Table 1. It can be seen that the naive method has both high FP and FN in the selection results. The increase in FN is due to the fact that covariate is dropped from the model incorrectly, as shown in Table 1. On the other hand, the TR and IV methods perform well in recovering the true model. The selection results of three methods () with sample sizes are reported in Table 2. As the sample size increases, it can be seen that both FP and FN decrease for TR and IV methods, whereas the FP increases for the naive method. In addition, the performance of MCC and MSE is better for TR and IV methods than that of the naive method. The selection is biased for the naive method regardless of the sample size. Example 2. In this example, we consider a logistic model where follows the Bernoulli distribution with mean function , where and . The covariates are jointly generated from with . Further, and the rest of the model setting is the same as in Example 1. The simulation results are shown on the left-hand side of Table 3. The results show similar patterns as in Example 1, where values of FP and FN are both low for TR and IV methods, compared with the NA method. Example 3. In this example, we consider the Poisson model for with mean function , and the rest of model setting is the same as in Example 2. The simulation results are shown on the right-hand side of Table 3. It can be seen that in the Poisson log-linear model, the naive method performs the worst among all three methods, where FP and FN remain at a high level. In contrast, the results from the IV method is similar to that of the TR method, where values of FP, FN and MSE are close to zero and MCC is close to one. Example 4. In this example, we consider the linear model of Example 1 with relatively high dimension p. In particular, we simulate the data with sample size and . The true parameter values are . The simulation results in Table 4 show that the proposed IVE method performs similarly to the small p scenarios, and, in particular, it outperforms the naive method clearly. Example 5. In this example, we consider the linear model of Example 5 with different parameter settings where . In this case, the corresponding coefficient is 0 for the error prone covariate x. The simulation results are presented in Table 5. It can be observed that the ME has virtually no effects on the FN. Also, as the sample size increases, the IV estimation performs nearly the same as the TR model. Example 6. In this example, we consider the linear model. First, we consider a linear model , where and are jointly generated from with . Note that the covariate X and W are jointly generated together with all other covariates in this case. The rest of the model setting remains the same as in Example 1. The results are shown in Table 6, which are similar as those in Example 1, regardless of the data-generating mechanism. The IV estimator performs better compared with NA estimator as the sample size increases. 5. Real Data Example
We applied the proposed method on a real dataset in this section. The Mobility Program Clinical Research Unit of St. Michael’s Hospital conducted research studying the prognostic factors of work productivity after a limb injury. The dataset was collected through the Work Limitations Questionnaire (WLQ) from a group of injured workers attending Shoulder & Elbow Specialty clinic, which is managed by Workplace Safety & Insurance Board of Ontario, Canada. The WLQ developed by
Lerner et al. (
2001) and
Lerner et al. (
2012) offers a way of measuring how the health problems affect the job performance and the productivity loss at work. The WLQ has shown its good criterion validity and is adopted by several research institutes such as
Ida et al. (
2012) and
Tang et al. (
2011). There were 168 recruited participants who were worker compensation claimants and may or may not be working at the time of initial clinic attendance. Typically, injured workers were referred to these clinics if they have a chronic work-related upper limb injury greater than 6 months in duration without sufficient recovery.
In this paper, we are interested in exploring the prognostic factors of the WLQ index. The response variable, the work limitations questionnaire index, evaluates the proportion of time where difficulty is experienced in the following four different domains: time management, physical demands, and mental–interpersonal and output demands. This index quantifies the productivity loss at work as a result of health disorders. The predictors (prognostic factors) are supervisor support (
); lower quick disabilities of the arm, shoulder and hand (DASH) score (
); better mental health factor score (
); better physical health factor score (
); age (
); lower von Korff pain intensity score (
); lower von Korff pain intensity score (
); and lower shoulder pain and disability index (
). The instrumental variables are organization support and decision authority. The work disability is an important issue in public health, caused by whether the productivity loss at work can exceed the direct medical cost. In the literature, supervisor support is associated with the productivity and health outcomes of workers. Physical and mental disorders are also significantly related to work loss. For example, positive support from a supervisor is associated with low degree of stress and low sickness absence of the employees (
Nielsen et al. 2006;
Stansfeld et al. 1997). Physical–mental comorbidity is also found to have an additive increase effect in work loss (
Buist-Bouwman et al. 2005). The estimation results are presented in
Table 7. It can be observed that, besides the covariates lower quick DASH score and better mental health factor score that are retained in the model for the naive method, the IV method keeps the supervisor support and lower shoulder pain and disability index.
6. Conclusions and Discussion
Although the regularized regression methods have been widely investigated in the literature, most of the published works assume the data are precisely measured. Some researchers study the high-dimensional measurement error models, assuming the ME covariance matrix is known or can be estimated using replicate data. However, the replicate data are not always available in real applications. Instead, instrumental data are more flexible and relatively easy to obtain. Technically, the assumption of instrumental variable is weaker than the replicate measurements. Although the IV approach is used by some authors to study general endogeneity problem in linear models, very few studies focus specifically on ME problems. Developing methodologies in this particular context allow us to obtain more insights into ME issues such as its impact on variable selection and parameter estimation in high-dimensional models.
In this paper, we extended the instrument variable method to the regularization estimation setup to correct for ME effects in both linear and generalized linear ME models. Besides the attenuation effect, the ME also affects the selection results in various settings. The proposed estimator is shown to have the oracle property, which is consistent in both variable selection and parameter estimation. The asymptotic distribution is derived for the proposed estimator in both linear and generalized linear ME models. Extensive simulation studies for linear, logistic and Poisson log-linear models are conducted examining the performance of the proposed estimator, as well as the naive estimator. Simulation results show that the proposed estimator performs well in various model settings with a finite sample size. The extension of the proposed method to nonlinear models is of interest for future research.
In this paper, we assume that the possibly mismeasured covariates are of low dimension. In future, it is important to study the case where a large number of covariates are measured with errors.