Abstract
Under the assumption of missing response data, empirical likelihood inference is studied via composite quantile regression. Firstly, three empirical likelihood ratios of composite quantile regression are given and proved to be asymptotically . Secondly, without an estimation of the asymptotic covariance, confidence intervals are constructed for the regression coefficients. Thirdly, three estimators are presented for the regression parameters to obtain its asymptotic distribution. The finite sample performance is assessed through simulation studies, and the symmetry confidence intervals of the parametric are constructed. Finally, the effectiveness of the proposed methods is illustrated by analyzing a real-world data set.
1. Introduction
It is a common occurrence in opinion polls, biostatistics, and a multitude of scientific experiments for data to be missing. Consequently, numerous papers on the statistical analysis of missing data have been published (see [1,2,3,4,5,6]), and some different kinds of methods have been proposed, including imputation methods [7], complete-case (CC) analysis methods [8], likelihood-based methods [9], and inverse probability weighted methods (IPW) [10], to handle the missing data problem. The imputation method under MAR is the most popular and effective of these methods for dealing with missing data. This method uses an appropriate value to replace the missing data point. For missing response data, there are some famous imputation methods such as semi-parametric imputation [11], kernel regression imputation [12], linear regression imputation [13], and so on.
On the other hand, as an indispensable and adaptable instrument in the domain of statistical investigation, quantile regression (QR) has not only elegant mathematical properties and promising performance but also the ability to directly estimate the effects of the covariates at different quantiles, in addition to the center of the distribution, which is a limitation of traditional least squares regression (LSR) methods. Therefore, QR is less sensitive and more robust to outliers. However, since the estimation efficiency of quantile regression is easily affected by specific quantile values, a new quantile regression called composite quantile regression (CQR) was introduced by Zou and Yuan [14] for the estimation of the unknown parameters of a linear model. Recently, a linear model with missing covariates was proposed by Yang and Liu [15]; it uses the CQR method and IPW method. Penalized weighted composite quantile regression was considered by Jin [16] for partially linear varying coefficient models with missing covariates. Zou [17] discussed heteroscedastic partially linear varying-coefficient models with missing censoring indicators using the CQR method.
Furthermore, the empirical likelihood (EL) method, as outlined by Owen [18], was employed for the purpose of constructing confidence intervals. This method offers a number of advantages over traditional normal approximation techniques. For instance, the confidence intervals or regions, which are defined by their shape and orientation, are produced by the empirical likelihood method. The obtained confidence intervals or regions have many attractive features such as a range-preserving property, the circumvention of asymptotic variance estimation, a flexible shape, and so on. Recently, many papers have studied the empirical likelihood inferences for quantile regression models. For instance, Whang [19] proposed a smoothed empirical likelihood method and provided estimates of the parameters of quantile regression models, constructing confidence regions for the model parameters. More analogous works can be found in [3,20,21,22].
As mentioned above, although [15,16] discussed the estimation of composite quantile regression with missing covariates, the authors considered neither the empirical likelihood of the composite quantile regression nor response data missing at random. Ref. [6] studied smoothed empirical likelihood for quantile regression models with response data missing at random, but it did not consider the composite quantile regression model. In addition, in spite of [17] having investigated composite quantile regression for heteroscedastic partially linear varying-coefficient models with missing censoring indicators, the authors considered censoring indicators missing rather than the response data missing at random; furthermore, they also does not study the empirical likelihood of the composite quantile regression model. Thus, it has seldom been considered in the literature on empirical likelihood for composite quantile regression models with missing response data so far. In this paper, we focus on empirical likelihood inferences of composite quantile regression models with missing response data and establish some theoretical results.
The highlights of this article are as follows: Firstly, three empirical likelihood ratios of composite quantile regression are given and proved to be asymptotically distributed according to the chi-squared law. Secondly, without the estimation of the asymptotic covariance, the confidence intervals for the regression coefficients are constructed. Thirdly, a class of estimators for the regression parameters is presented to derive their asymptotic distribution.
The rest of this paper is organized as follows. In Section 2, we will construct three EL ratios and estimators for the regression parameters of a QR model with missing response data. In Section 3, we will derive some asymptotic properties of the proposed procedure. In Section 4, we will undertake a series of simulation studies to evaluate the performance of the proposed method. An application to a real-world data set is used to illustrate the effectiveness of our approach in Section 5. The discussion and conclusions are presented in Section 6. The proofs of the asymptotic results are provided in Appendix A.
2. Empirical Likelihood for Composite Quantile Regression with Missing Response Data
Consider the classic linear model
where is the response, is a vector of unknown regression coefficients, is a vector of covariates, and the error is a random error satisfying
where is the of , ( ), and K is the number of quantiles. For the above model, we focus on the case where X is observed completely in a sample of size n and some Y values may be missing. In other words, we obtain an incomplete sample from model (1), where if is missing, and otherwise, and all are observed. Throughout this paper, we assume that Y is missing at random (MAR). The MAR assumption implies that and Y are conditionally independent given X. That is, . MAR is reasonable in many practical situations and is a common assumption for statistical analysis with missing data [8].
2.1. Complete-Case Linear Composite Quantile Regression Empirical Likelihood
By model (1), the complete data CQR estimator of solves
where is the quantile loss function. It is easy to show that
where is the function of the indicator. By Equation (3) and the EL method idea, the auxiliary random vector is defined as follows:
when is the true value of the parameter. It can be shown that by Equation (3). An empirical log-likelihood ratio function for can be defined based on . However, because the quantiles are unknown, cannot be directly used to make inferences for . Therefore, we replace , with their estimators from (2); thus,
by the simple calculation of
Similar to the proof of Theorem 1 in [14], we have . Hence, it is easy to show that
Then, by , we can obtain , i.e., the auxiliary random vector is asymptotically unbiased; therefore, the complete-case composite quantile empirical log-likelihood (CCQEL) for is defined as
Then, the empirical likelihood estimation of the linear composite quantile under complete data can be defined as :
2.2. Weighted Composite Quantile Empirical Likelihood
In accordance with the method in Section 2.1, a weighted composite quantile empirical log-likelihood ratio function for can be defined as follows:
where
and is referred to as the selection probability function. It should be noted that the selection probability in (6) is assumed to be known. A kernel smoothing method can be used to estimate the selection probability if it is unknown. The estimate for can be defined as follows:
where represents a kernel function, and denotes a sequence of positive numbers that tend to zero, which controls the amount of smoothing used in the estimations. Therefore, a weighted composite quantile empirical log-likelihood, say , can be obtained by replacing with its estimator . That is,
where Then, the empirical likelihood estimation of the linear weighted composite quantile can be defined as :
2.3. Imputation Composite Quantile Empirical Likelihood
For estimations of the CCQEL and the WCQEL, the information contained in sample data is not fully utilized because only the information of the observed data is used in constructing the EL ratio; the coverage accuracy of the confidence region is reduced when there are many missing values. To solve the problem, is imputed if is missing. The following introduces auxiliary random data:
where . Thus, an imputation composite quantile empirical log-likelihood (ICQEL) ratio is defined as
The ratio is more appropriate than the quantile weighted empirical likelihood ratio, as it makes optimal use of the information contained in the data. In addition, the empirical likelihood estimation of the linear imputation composite quantile of can be defined as :
3. Asymptotic Properties
Let be an integer. Denote as and , for the conditional density and conditional distribution functions of on conditional , we denote as the density function of X. Let c be a positive constant that is not dependent on n and may assume a different value in each instance. The following conditions are necessary for the results to be valid:
(C1) : are independent and identically distributed random vectors.
(C2) Both and have bounded partial derivatives up to order r almost surely, and
(C3) This condition is made up of the following two aspects:
(a) is bounded; it is compactly supported on .
(b) is a kernel function of order r, and there are positive constants, denoted by and , and a positive real number, denoted by , such that the following inequality holds:
(C4) , where , and when , .
(C5) The positive bandwidth parameter h satisfies .
(C6) has a bounded support, and matrices A and B are nonsingular, where B and A are defined in Theorem 2.
(C7) The conditional distribution of Y given is absolutely continuous with a density function strictly bounded away from zero and infinity at the conditional quantiles, .
In this section, the asymptotic distributions of the CQEL ratios and the estimators proposed in Section 2.1, Section 2.2 and Section 2.3 will be considered. Firstly, the asymptotic distributions are established for , , and .
Theorem 1.
Suppose that Conditions hold. If β is the true parameter, then
where is desirable as , , or ; is the chi square distribution with degrees of freedom d; and indicates the convergence in distribution.
Let be the quantile of the for . Using Theorem 1, we are able to derive an approximate confidence region for β, which is defined by
Theorem 1 can also be used to test the hypothesis : . One could reject at level α if .
In order to compare the EL method with the asymptotic normal method, the following theorem gives the asymptotic normality of and , where is desirable as , , or .
Theorem 2.
Suppose that Conditions hold. Then,
where , , , is the conditional density of ε when , and is desirable as , , , or ; it has when is desirable as and ; it has when is desirable as ; and it has when .
To construct the confidence region for , it is necessary to estimate the asymptotic covariance matrix using , where , and . We can prove that is a consistent estimator of . Thus, by Theorem 2, we have
So, there is
Accordingly, the confidence regions of can be constructed using (8).
4. Simulation Study
In order to study the finite sample performance of the proposed method, we performed some simulations. The following two models are considered:
1. Homoscedastic model:
2. Heteroscedastic model:
Here, the variable X was simulated from the , and , We consider the case where and . In the simulation study, for the convenience of calculation, the composite level was chosen , so the quantiles were taken as . Consider the following three selection probability functions:
(a)
(b)
(c)
Approximately, 0.26, and are the average missing rates corresponding to the three cases.
The kernel function was taken to be
In total, 2000 Monte Carlo random samples of size and 200 were generated, and the cross-validation method was used to select the optimal bandwidths . Consider the confidence intervals of on model 1 and model 2. Then, five method, including imputed quantile empirical likelihood (IQEL) in [6], composite QR empirical likelihood without missing data (NCQEL), imputed composite quantile empirical likelihood (ICQEL), and the normal approximation in Theorem 2, were used. In the following, for convenience of expression, the normal approximation confidence intervals for and are denoted as NA and NA , respectively. Then, the corresponding empirical coverage probabilities and their average lengths of confidence intervals were computed with a nominal level and . Table 1, Table 2, Table 3 and Table 4 show the results.
Table 1.
Average lengths of the confidence intervals for in model 1, calculated for different forms of the selection probability function and different values of sample size n, with a nominal level of .
Table 2.
Emprical coverage probabilities of the intervals for in model 1, calculated for different forms of the selection probability function and different values of the sample sizes n, with a nominal level of 0.95.
Table 3.
Average lengths of the confidence intervals for in model 2 for different forms of the selection probability function and different values of sample size n under the nominal level of .
Table 4.
Emprical coverage probabilities of the intervals for in model 1 for different forms of the selection probability function and different values of the sample sizes n under the nominal level of 0.95.
The results in Table 1, Table 2, Table 3 and Table 4 show that (1) for case 1, when , the average length of the confidence intervals for in model 1 obtained by the ICQEL method was 0.9389, while the average length of the confidence intervals for in model 1 obtained by the IQEL method was 0.9346. So, the ICQEL method gives higher coverage probabilities but slightly longer intervals compared to the other two methods. For cases 2 and 3, in the sense that the confidence intervals of ICQEL have higher coverage probabilities and uniformly shorter average lengths, ICQEL performs better than the other two methods. This indicates that when the missing rate is large, CQR imputation is necessary. (2) Both IQEL and ICQEL have slightly longer interval lengths but higher coverage probabilities than NA () and NA (). In addition, the confidence intervals obtained by NA () and NA () have nearly equal lengths and coverage probabilities in the same case. (3) For every given missing rate, as the sample size n increases, all the interval lengths decrease and the empirical coverage probabilities increase. Observably, the missing rate also affects the interval length and coverage probability. Generally, for every fixed sample size, the coverage probability decreases and the interval length increases as the missing rate increases. However, for the ICQEL and IQEL methods, the two values do not exhibit a significant change because the QR imputation is used in the two methods. Moreover, it is evident that the other methods for the heteroscedastic model, ICQEL, continue to demonstrate superior performance.
5. A Real-World Example
The data originally obtained from Engel [23] are analyzed in this section in order to verify the results attained in this paper, using a real example of a declining share of personal income concerning food expenditure. The data set comprises 235 budget surveys of 19th-century European working-class households and has no missing data. Engel data can be accessed directly from the R package. In order to illustrate our method using the data set, we deleted some of the response values at random to create artificial missing data. Assume that in these data, of the response values are missing. The following linear QR model
was considered, where X is the centered annual household income in Belgian francs, and Y is the household’s centered annual food expenditure.
Now, based on the proposed ICQLE and NCQEL methods, the confidence intervals and the estimators of are presented, and the quantiles are taken as with . The quantile in the IQEL method is taken as . The results are presented in Table 5. From an examination of Table 5, it can be observed that the confidence interval obtained by the IQEL method has a longer confidence interval than that obtained by the ICQEL method. The confidence intervals obtained by ICQEL and NCQEL are basically close to each other. The results are in good agreement with the simulation results.
Table 5.
The confidence intervals and estimators of based on IQEL, ICQEL, and NCQEL in the Engel data analysis.
6. Conclusions and Discussions
In this paper, a CQEL method is proposed for analysis for a QR model with missing response data. Three empirical likelihood ratios of CQR, including the CCQEL ratio, WCQEL ratio, and ICQEL ratio, for the regression parameter were proposed, and it was proved that they are asymptotically distributed. Also, three CQEL estimators for the regression parameter were constructed such that the three estimators were asymptotically normal. The benefits of the CQEL method were demonstrated through a simulation study and an analysis of a real-world data set.
While this paper focuses on the empirical likelihood estimation of composite quantile regression, other areas of empirical likelihood estimation could also be explored, such as modal composite quantile regression or expectation quantile regression with missing data. Furthermore, missing data are frequently not missing at random, and the composite quantile under missing at non-random can be considered at a later stage.
Author Contributions
Methodology, S.L.; software, Y.Z.; formal analysis, S.L.; investigation, S.L. and Y.Z.; writing—original draft, S.L.; writing—review and editing, C.-y.Z. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the National Natural Science Foundations of China (No. 12271420), the Natural Science Foundation of Shaanxi Province of China (2024JC-YBMS-007), and the Planning Project of Yulin Science and Technology Bureau of Shaanxi Province of China (CXY-2021-117).
Data Availability Statement
Data are contained within the article.
Acknowledgments
The authors are grateful to all the reviewers for their constructive comments and suggestions that led to significant improvements to the original manuscript.
Conflicts of Interest
The authors declare no conflicts of interest.
Appendix A
The following Lemmas are instrumental in proving the theorems given in Section 3.
Lemma A1.
Assume that Conditions – hold. Then,
where .
Proof.
Similar to the proof of the theorem in [14], this lemma can be easily derived. □
Lemma A2.
Suppose that the regularity Conditions – hold. If β is the true parameter in (1), we have that
and
which are all true where takes , , or ; ; and ; and when , takes ; when , takes ; and when , .
Proof.
First is the proof when and equalities (A1) and (A2) both hold, . Let . The following formula can be obtained by simple calculation:
where , and A simple calculation is available as follows:
We have
In addition, there are , , and Hence, we obtain that
We have
From the central limit theorem,
Next, we prove
Let , be the jth component of , and be the jth component of ; by Lemma A.2 of [24], we have that
Furthermore, in accordance with Lemma A1, we have that
Hence, we have
According to formulas (A4), (A5), and (A8), it is proved that Equation (A1) holds. On the other hand, using the above idea of the proof, we can prove that Equation (A2) holds. Secondly, it is proved that Equations (A1) and (A2) hold when , . Note that
because
where
A simple calculation is available as follows:
We have that
In addition, there are , , and . Hence, we have
Hence, We obtain the following formula:
From the central limit theorem,
It is not difficult to prove from the above idea of . By referencing the proof of Theorem 3 of [25], and by Conditions , and , we have
Because , by (A12), we have , which is similar to the proof of . So, we can prove that Equation (A1) holds by (A10) and (A11). In addition, using the above proof idea, we can prove that (A2) holds.
It is easy to prove that we have
and
Therefore, it is possible to obtain
Lemma A3.
Suppose that regularity Conditions – hold. If β is the true parameter in (1), we have that
where takes , , or , and
and when , takes ; when , takes ; and when , .
Proof.
(a) This proves the conclusion when , where . A simple calculation is available as follows:
It can be obtained from the law of large numbers that . We prove . Let us define the matrix as the component of the matrix . Similarly, let us define the matrix as the rth component of the matrix , where . Subsequently, the Cauchy–Schwarz inequality is employed to derive the following result:
From Lemmas A1 and A2, we can see that and . Hence, . Using a similar argument, we can prove , . So, we prove that Equation (A14) holds.
(b) This proves the conclusion when , where . This is because
where
In accordance with the law of large numbers, it can be inferred that
Proof the Theorem 1.
The Lagrange multiplier method allows us to represent in the following way:
where is a vector and satisfies the solution of the following equation:
Then, Taylor expansion is applied to (A16), and by Lemmas A2 and (A18), we have
and by Equation (A17), we can obtain
By Lemmas A2 and (A18), we can obtain
Then, by (A19), we have
The proof of Theorem 1 is derived from the combination of Lemmas A2 and A3. □
References
- Liu, H.; Yang, H.; Peng, C. Weighted composite quantile regression for single index model with missing covariates at random. Comput. Stat. 2019, 34, 1711–1740. [Google Scholar] [CrossRef]
- Chen, X.; Alan, T.K.; Zhou, Y. Efficient quantile regression analysis with missing observations. J. Am. Stat. 2015, 110, 723–741. [Google Scholar] [CrossRef]
- Luo, S.H.; Yan, Y.X.; Zhang, C.Y. Two-Stage estimation of partially linear varying coefffcient quantile regression model with missing data. Mathematics 2024, 12, 578. [Google Scholar] [CrossRef]
- Xue, L.G.; Zhu, L.X. Empirical likelihood in a partially linear single-index model with censored response data. Comput. Stat. Data Anal. 2024, 193, 107912. [Google Scholar] [CrossRef]
- Luo, S.H.; Zhang, C.Y.; Wang, M.H. Composite quantile regression for varying coefficient models with response data missing at random. Symmetry 2019, 11, 1065. [Google Scholar] [CrossRef]
- Luo, S.H.; Mei, C.L.; Zhang, C.Y. Smoothed empirical likelihood for quantile regression models with response data missing at random. AStA-Adv. Stat. Anal. 2017, 15, 95–116. [Google Scholar] [CrossRef]
- Aerts, M.; Claeskens, G.; Hens, N.; Molenberghs, G. Local multiple imputation. Biometrika 2002, 89, 375–388. [Google Scholar] [CrossRef]
- Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
- Schafer, J.; Graham, J. Missing data: Our view of the state of the art. Psychol. Methods 2002, 2, 147–177. [Google Scholar] [CrossRef]
- Robins, J.M.; Rotnitzky, A.; Zhao, L.P. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J. Am. Stat. Assoc. 1995, 90, 106–121. [Google Scholar] [CrossRef]
- Wang, Q.; Linton, O.; Ärdle, W.H. Semiparametric regression analysis with missing response at random. J. Am. Stat. Assoc. 2004, 99, 334–345. [Google Scholar] [CrossRef]
- Wang, Q.; Rao, N.K. Empirical likelihood-based inference under imputation for missing response data. Ann. Stat. 2002, 30, 896–924. [Google Scholar]
- Xue, L.G. Empirical likelihood for linear models with missing responses. J. Multivar. Anal. 2009, 100, 1353–1366. [Google Scholar] [CrossRef]
- Zou, H.; Yuan, M. Composite quantile regression and the oracle model selection theory. Ann. Stat. 2008, 36, 1108–1126. [Google Scholar] [CrossRef]
- Yang, H.; Liu, H.L. Penalized weighted composite quantile estimators with missing covariates. Stat. Pap. 2016, 57, 69–88. [Google Scholar] [CrossRef]
- Jin, J.; Ma, T.; Dai, J.; Liu, S. Penalized weighted composite quantile regression for partially linear varying coefficient models with missing covariates. Comput. Stat. 2020, 36, 541–575. [Google Scholar] [CrossRef]
- Zou, Y.Y.; Fan, G.L.; Zhang, R.Q. Composite quantile regression for heteroscedastic partially linear varying-coefficient models with missing censoring indicators. J. Stat. Comput. Simul. 2023, 93, 341–365. [Google Scholar] [CrossRef]
- Owen, A.B. Empirical likelihood ratio confidence regions. Ann. Stat. 1990, 18, 90–120. [Google Scholar] [CrossRef]
- Whang, Y.J. Smoothed empirical likelihood methods for quantile regression models. Econom. Theory 2006, 22, 173–205. [Google Scholar] [CrossRef]
- Zhao, P.X.; Lin, X.S.; Lin, L. Empirical likelihood for composite quantile regression modeling. J. Appl. Math. Comput. 2015, 48, 321–333. [Google Scholar] [CrossRef]
- Wang, J.F.; Jiang, W.J.; Xu, F.Y.; Fu, W.X. Weighted composite quantile regression with censoring indicators missing at random. Commun. Stat.-Theory Methods 2021, 50, 2900–2917. [Google Scholar] [CrossRef]
- Sun, J.; Ma, Y.Y. Empirical likelihood weighted composite quantile regression with partially missing covariates. J. Nonparametric Stat. 2017, 29, 137–150. [Google Scholar] [CrossRef]
- Engel, E. Die productions and consumtionsver haltnisse des konigreichs sachsen. Stat. Burdes 1857, 8, 1–54. [Google Scholar]
- Zhao, P.X.; Xue, L.G. Empirical likelihood inferences for semiparametric varying coefficient partially linear models with longitudinal data. Commun. Stat.-Theory Methods 2010, 39, 1898–1914. [Google Scholar] [CrossRef]
- Wong, H.; Guo, S.; Chen, M. On locally weighted estimation and hypothesis testing of varying-coefficient models with missing covariates. J. Stat. Plan. Inference 2009, 139, 2933–2951. [Google Scholar] [CrossRef]
- Otsu, T. Conditional empirical likelihood estimation and inference for quantile regression models. J. Econom. 2008, 142, 508–538. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).