Next Article in Journal
Echo State Network-Based Adaptive Event-Triggered Control for Stochastic Nonaffine Systems with Actuator Hysteresis
Next Article in Special Issue
Ordinal Time Series Analysis with the R Package otsfeatures
Previous Article in Journal
Exergy Analysis for Combustible Third-Grade Fluid Flow through a Medium with Variable Electrical Conductivity and Porous Permeability
Previous Article in Special Issue
Randomized Response Techniques: A Systematic Review from the Pioneering Work of Warner (1965) to the Present
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Optimal Model Averaging Estimation for the Varying-Coefficient Partially Linear Models with Missing Responses

1
School of Mathematics and Statistics, Hefei Normal University, Hefei 230601, China
2
Faculty of Science, Beijing University of Technology, Beijing 100124, China
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(8), 1883; https://doi.org/10.3390/math11081883
Submission received: 9 March 2023 / Revised: 7 April 2023 / Accepted: 12 April 2023 / Published: 16 April 2023
(This article belongs to the Special Issue Statistical Methods in Data Science and Applications)

Abstract

:
In this paper, we propose a model averaging estimation for the varying-coefficient partially linear models with missing responses. Within this context, we construct a HR C p weight choice criterion that exhibits asymptotic optimality under certain assumptions. Our model averaging procedure can simultaneously address the uncertainty on which covariates to include and the uncertainty on whether a covariate should enter the linear or nonlinear component of the model. The simulation results in comparison with some related strategies strongly favor our proposal. A real dataset is analyzed to illustrate the practical application as well.

1. Introduction

Model averaging, an alternative to model selection, addresses both model uncertainty and estimation uncertainty by appropriately compromising over the set of candidate models, instead of picking only one of them, and this generally leads to much smaller risk than that encountered in model selection. Over the past decade, various model averaging approaches, with optimal large sample properties have been actively proposed for complete data setting, such as the following: Mallows model averaging [1,2], optimal mean squared error averaging [3], jackknife model averaging [4,5,6], heteroscedasticity-robust C p (HR C p ) model averaging [7], model averaging based on Kullback–Leibler distance [8], model averaging in a kernel regression setup [9], and model averaging based on K-fold cross-validation [10], among others.
In practice, many datasets in clinical trials, opinion polls and market research surveys often contain missing values. As far as we know, compared with the large body of research regarding model averaging for fully observed data, much less attention has been paid to performing optimal model averaging in the presence of missing data. Reference [11] studied a model averaging method applicable to situations in which covariates are missing completely at random, by adapting a Mallows criterion based on the data from complete cases. Reference [12] broadened the analysis in [11] to a fragmentary data and heteroscedasticity setup. By applying the HR C p approach in [7], Reference [13] developed an optimal model averaging method in the presence of responses missing at random (MAR). In the context of missing response data, Reference [14] constructed a model averaging method based on a delete-one cross-validation criterion. Reference [15] proposed a two-step model averaging procedure for high-dimensional regression with missing responses at random.
The aforementioned model averaging methods in a missing data setting are asymptotically optimal in the sense of minimizing the squared error loss in a large sample case, but they all concentrate mainly on the simple linear regression model. In the context of missing data, it would be interesting to study model averaging in the varying-coefficient partially linear model (VCPLM) introduced by [16], which allows interactions between a covariate and an unknown function through effect modifiers. Due to its flexible specification and explanatory power, this model has received extensive attention over the past decades. Different kinds of approaches have been raised to estimate the VCPLM, such as the following: estimation process based on the local polynomial fitting method [17], the general series method [18], and profile least squares estimation [19]. References [20,21,22,23] have developed various variable selection procedures in the VCPLM. As for model averaging in the VCPLM, only the following works have been conducted. In the measurement error model and the missing data model, References [24,25], respectively, established the limiting distribution of the resulting model averaging estimators of the unknown parameters of interest under the local misspecification framework. As pointed out by [26], this framework, which was suggested by [27], is a useful tool for asymptotic analysis, but its realism is subject to considerable criticism. Additionally, these two works studied existing model averaging strategies, based on the focused information criterion, but did not consider any new model averaging method with asymptotic optimality. When all data are available, References [26,28] developed two asymptotically optimal model averaging approaches for the VCPLM, based on a Mallows-type criterion and a jackknife criterion, respectively.
As far as we know, there remains no optimal model averaging approach developed for the VCPLM with missing responses. The main goal of the current paper was to fill this gap. To the best of our knowledge, this paper is the first to study the asymptotically optimal model averaging approach for the VCPLM in the presence of responses MAR without the local misspecification assumption. However, existing results are difficult to directly extend to our setup for the following two reasons. Firstly, existing optimal model averaging approaches in the VCPLM with complete data, such as the Mallows model averaging method proposed by [26], and the jackknife model averaging method advocated by [28], cannot be directly applied to our problem. Secondly, in contrast with the case in linear missing data models, studied by [13,14], our analysis is significantly complicated by two kinds of uncertainty in the VCPLM: the uncertainty on the selection of variables, and the uncertainty on whether a covariate should be allocated to the linear or nonlinear component of the model. These uncertainties have not been investigated much by the VCPLM literature. Motivated by these two challenges, we suggest a new model averaging approach for the VCPLM with responses MAR via the HR C p criterion. This new approach was developed by introducing a synthetic response based on an inverse probability weighted (IPW) technique. Then, HR C p model averaging could be conducted easily. Under certain assumptions, the weights selected by minimizing the HR C p criterion are demonstrated to be asymptotically optimal. Furthermore, we numerically illustrate that our method is always superior to its rivals in several designs with different kinds of model uncertainty. The detailed research procedures and methods can be found in Figure 1.
The remainder of this article is organized as follows. We construct the model averaging estimator and establish its asymptotic optimality in Section 2. A simulation study is conducted in Section 3 to illustrate the finite sample performance of our strategy and a real data example is provided in Section 4. Section 5 contains some conclusions. Detailed proofs of the main results are relegated to the Appendix A.

2. Model Averaging Estimation

2.1. Model and Estimators

We considered the following VCPLM:
y i = μ i + ϵ i = X i β + Z i α ( u i ) + ϵ i = p = 1 x i p β p + q = 1 z i q α q ( u i ) + ϵ i , i = 1 , , n ,
where y i is a scalar response variable, ( X i , Z i , u i ) are covariates with X i and Z i being countably infinite, β is an unknown coefficient vector associated with X i , α ( · ) is an unknown coefficient function vector associated with Z i , ϵ i is a random statistical error with E ( ϵ i | X i , Z i , u i ) = 0 and E ( ϵ i 2 | X i , Z i , u i ) = σ 2 . As in [26,29], we assume that the dimension of u i is one. Model (1) is flexible enough to cover a variety of other existing models, such as the following: the linear model that was studied by [1,4], the partially linear model that was studied by [30] and the varying-coefficient model that was studied by [29]. For this model, we focus on the case where all covariates are always fully observed while some observations of the response variable may be missing. Specifically, we assume that y i is MAR in the sense that:
P ( δ i = 1 | y i , X i , Z i , u i ) = P ( δ i = 1 | X i , Z i , u i ) π ( X i , Z i , u i ) ,
where δ i = 1 if y i is completely observed, otherwise δ i = 0 , and the selection probability function π ( X i , Z i , u i ) is bounded away from 0.
As in most literature on model averaging, we aimed to estimate the conditional mean of the response data Y = ( y 1 , , y n ) , i.e., μ = ( μ 1 , , μ n ) , which is especially useful in prediction. However, owing to the presence of the missing data, none of the existing optimal model averaging estimations for complete data could be directly utilized in our setting. We addressed this problem by introducing a synthetic response H π , i = δ i y i / π ( X i , Z i , u i ) . By the aforementioned MAR assumption and some simple calculations, it is easy to observe that E ( H π , i | X i , Z i , u i ) = E ( y i | X i , Z i , u i ) = μ i and Var ( H π , i | X i , Z i , u i ) = σ π , i 2 , where σ π , i 2 = { π ( X i , Z i , u i ) } 1 1 μ i 2 + { π ( X i , Z i , u i ) } 1 σ 2 . Therefore, under Model (1) and the MAR assumption, we have:
H π , i = μ i + ϵ π , i , i = 1 , , n ,
where ϵ π , i = H π , i E ( y i | X i , Z i , u i ) satisfying E ( ϵ π , i | X i , Z i , u i ) = 0 and Var ( ϵ π , i | X i , Z i , u i ) = σ π , i 2 . As is apparent, in Model (3) the completely observed cases are weighted by their corresponding inverse selection probabilities, while the missing cases are weighted by zeros. Then, the analysis is conducted on the basis of the weighted data. By introducing the fully observed synthetic response H π , i , we obtain a new Model (3) the conditional expectation of which is equivalent to that of Model (1). Thus, the HR C p model averaging estimation for μ i , the conditional mean of Model (1), can be alternatively derived by studying the HR C p model averaging estimation for Model (3) with the synthetic data when π ( X i , Z i , u i ) is known.
Supposing that there are M candidate VCPLMs to approximate the true data generating process of y i , which is given in (1), and the mth candidate VCPLM comprises p m covariates in X i and q m covariates in Z i . Accordingly, there are M candidate models to approximate Model (3), and the mth candidate model contains the same covariates as that of the mth candidate VCPLM for (1). Specifically, the mth candidate model is:
H π , i = X ( m ) , i β ( m ) + Z ( m ) , i α ( m ) ( u i ) + e ( m ) , i + ϵ π , i , i = 1 , , n ,
where X ( m ) , i is the p m -dimensional sub-vector of X i and β ( m ) is the corresponding unknown coefficient vector, Z ( m ) , i = ( z ( m ) , i 1 , , z ( m ) , i q m ) is the q m -dimensional sub-vector of Z i and α ( m ) ( u i ) = ( α ( m ) , 1 ( u i ) , , α ( m ) , q m ( u i ) ) is the corresponding unknown coefficient function, e ( m ) , i = μ i X ( m ) , i β ( m ) Z ( m ) , i α ( m ) ( u i ) denotes the approximation error of the mth candidate model. Details of the model averaging estimation procedure in our setup are provided below.
We employed the polynomial spline-based smoothing strategy to estimate each coefficient function first. Without loss of generality, suppose that the covariate u is distributed on a compact interval [ 0 , 1 ] . Denote the polynomial spline space of degree ϱ on interval [ 0 , 1 ] by Ψ . We introduce a sequence of knots on the interval [ 0 , 1 ] : k ϱ = = k 1 = k 0 = 0 < k 1 < < k J n < 1 = k J n + 1 = = k J n + ϱ + 1 , where the number of interior knots J n increases with sample size n. The spline basis functions are polynomials of degree ϱ on all sub-intervals [ k j , k j + 1 ) , j = 0 , , J n 1 and [ k J n , 1 ] , and are ( ϱ 1 ) -times continuously differentiable on [ 0 , 1 ] . Let B ( · ) = ( B ϱ ( · ) , , B J n ( · ) ) be a vector of the B-spline basis function in space Ψ . According to B-spline theory, there exists a B ( u ) θ ( m ) , q in Ψ for some ( J n + ϱ + 1 ) -dimensional spline coefficient vector θ ( m ) , q = ( θ ( m ) , q , ϱ , , θ ( m ) , q , J n ) such that max m , q sup u [ 0 , 1 ] | α ( m ) , q ( u ) B ( u ) θ ( m ) , q | = O ( ( J n + ϱ + 1 ) d ) , where α ( m ) , q ( u ) is the qth element of α ( m ) ( u ) . We would like to estimate β ( m ) and θ ( m ) = ( θ ( m ) , 1 , , θ ( m ) , q m ) by the least squares method based on the criterion:
min β ( m ) , θ ( m ) i = 1 n H π , i X ( m ) , i β ( m ) q = 1 q m z ( m ) , i q B ( u i ) θ ( m ) , q 2 .
Let G ( m ) , i = ( z ( m ) , i 1 B ( u i ) , , z ( m ) , i q m B ( u i ) ) be an { q m ( J n + ϱ + 1 ) } -dimensional vector. Denote H π = ( H π , 1 , , H π , n ) , X ( m ) = ( X ( m ) , 1 , , X ( m ) , n ) and G ( m ) = ( G ( m ) , 1 , , G ( m ) , n ) . Here, we assume that the regressor matrix X ˜ ( m ) = ( X ( m ) , G ( m ) ) has full column rank l m = p m + { q m ( J n + ϱ + 1 ) } . The solution to the minimization problem provided in (5) can be expressed as:
β ^ ( m , π ) = { X ( m ) ( I Q ( m ) ) X ( m ) } 1 X ( m ) ( I Q ( m ) ) H π ,
θ ^ ( m , π ) = ( G ( m ) G ( m ) ) 1 G ( m ) ( H π X ( m ) β ^ ( m , π ) ) ,
where Q ( m ) = G ( m ) ( G ( m ) G ( m ) ) 1 G ( m ) . Let Φ ( m ) = ( I Q ( m ) ) X ( m ) , then the estimator of μ under the mth candidate model follows:
μ ^ ( m , π ) = X ( m ) β ^ ( m , π ) + G ( m ) θ ^ ( m , π ) = { Q ( m ) + Φ ( m ) ( Φ ( m ) Φ ( m ) ) 1 Φ ( m ) } H π .
Denoting P ( m ) = Q ( m ) + Φ ( m ) ( Φ ( m ) Φ ( m ) ) 1 Φ ( m ) , we obtain μ ^ ( m , π ) = P ( m ) H π .
To smooth estimators across all candidate models, we may define the model averaging estimator of μ as:
μ ^ π ( w ) = m = 1 M w m μ ^ ( m , π ) = m = 1 M w m P ( m ) H π P ( w ) H π ,
where w = ( w 1 , , w M ) is a weight vector in the set W = { w [ 0 , 1 ] M : m = 1 M w m = 1 } .

2.2. Weight Choice Criterion and Asymptotically Optimal Property

Obviously, the weight vector w, which represents the contribution of each candidate model in the final estimation, plays a central role in (9). Our weight choice criterion was motivated by applying the HR C p method of [7], which is designed for the complete data setting, and is defined as follows:
C π ( w ) = H π μ ^ π ( w ) 2 + 2 i = 1 n ϵ ^ π , i 2 P i i ( w ) ,
where ϵ ^ π , i is the residual from a preliminary estimation, P i i ( w ) is the ith diagonal element of the matrix P ( w ) . As suggested by [7], ϵ ^ π , i can be obtained by a model, indexed by M * , which includes all the regressors in the candidate models. That is:
ϵ ^ π = n / ( n l M * ) ( I P M * ) H π ,
where l M * is the rank of the regressor matrix in model M * , ϵ ^ π = ( ϵ ^ π , 1 , , ϵ ^ π , n ) .
So far, we have assumed that the selection probability function is known. This is, of course, not the case in real-world data analysis, and the proposed criterion (10) is, hence, computationally infeasible. To obtain a feasible criterion in practice, we needed to estimate π ( X i , Z i , u i ) first. Following much of the missing data literature, and under the MAR assumption defined above, we assume that for an unknown parameter vector η and T i = ( X i , Z i , u i ) we have:
π ( X i , Z i , u i ) = π ( T i , η ) ,
for some function π ( · , η ) , the form which is known to be a finite-dimensional parameter η . Let η ^ be the maximum likelihood estimator (MLE) of η . Then the selection probability function can be estimated by π ( T i , η ^ ) . In what follows, the Greek letter indexed by π ^ denotes that it is obtained by replacing π ( X i , Z i , u i ) in its equation with the estimator π ( T i , η ^ ) . A feasible form of the weight choice criterion based on HR C p method is, thus, given by:
C π ^ ( w ) = H π ^ μ ^ π ^ ( w ) 2 + 2 i = 1 n ϵ ^ π ^ , i 2 P i i ( w ) ,
and the weight vector can be obtained by:
w ^ = arg min w W C π ^ ( w ) .
Then, the corresponding model averaging estimator of μ can be expressed as μ ^ π ^ ( w ^ ) , and its asymptotic optimality can be developed under some regularity conditions.
Some notations and definitions are required before we list these conditions. Write l ( η ) = E δ log π ( T , η ) + ( 1 δ ) log { 1 π ( T , η ) } , X = ( X 1 , , X n ) , Z = ( Z 1 , , Z n ) , U = ( u 1 , , u n ) . Define the squared error loss of μ ^ π ( w ) and the corresponding risk as L π ( w ) = μ ^ π ( w ) μ 2 and R π ( w ) = E ( L π ( w ) | X , Z , U ) . Let ξ π = inf w W R π ( w ) , w m 0 be a M × 1 vector with the mth element being 1 and the others being 0, and let Θ η be the parameter space of η . Define r as a positive integer and τ ( 0 , 1 ] , such that d = ( r + τ ) > 0.5 . Let S be a collection of functions s on [ 0 , 1 ] whose rth derivative s ( r ) exists and satisfies the Lipschitz condition of order τ , i.e.,
| s ( r ) ( t * ) s ( r ) ( t ) | C s | t * t | τ , for 0 t * , t 1 ,
where C s is a positive constant. All limiting processes discussed throughout the paper are under n . The conditions needed to derive asymptotic optimality are as follows:
  • (Condition (C.1)) l ( η ) has a unique maximum at η 0 in Θ η , where η 0 is an inner point of Θ η and Θ η is compact. π ( T i , η ) C π > 0 , and π ( T i , η ) is twice continuously differentiable with respect to η , where C π is a constant. max 1 i n π ( T i , η ) η = O p ( 1 ) for all η ’s in a neighborhood of η 0 .
  • (Condition (C.2)) max 1 i n E ( ϵ i 4 K | X i , Z i , u i ) C ϵ < for some integer 1 K < and for some constant C ϵ . There exists a constant C μ , such that max 1 i n | μ i | C μ .
  • (Condition (C.3)) M ξ π 2 K m = 1 M { R π ( w m 0 ) } K 0 , where K is given in Condition (C.2).
  • (Condition (C.4)) Each coefficient function α q ( · ) S .
  • (Condition (C.5)) The density function of u, say f, is bounded away from 0 and infinity on [ 0 , 1 ] .
  • (Condition (C.6)) max 1 m M max 1 i n P ( m ) , i i = O ( n 1 / 2 ) , where P ( m ) , i i denotes the ith diagonal element of P ( m ) .
  • (Condition (C.7)) n 1 / 2 / ξ π 0 .
  • (Condition (C.8)) l M * = O ( n 1 / 2 ) .
Condition (C.1) is from [31] and is similar to Condition (C1) of [13], which ensures the consistency and asymptotic normality of the MLE η ^ . The first part of Condition (C.2) is a commonly used assumption of the conditional moment of the random error term in model averaging literature; see, for example, [2,4,26]. The second part of Condition (C.2) is the same as the assumption (C.2) of [32] that bounds the conditional expectation μ i . Condition (C.3) not only requires ξ π , but also requires that M and max 1 m M R π ( w m 0 ) tend to infinity slowly enough. Such a condition can be viewed as an analogous version of Assumption 2.3 in [7], in which the authors proposed the HR C p model averaging method in a complete data setting. Conditions (C.4) and (C.5) are two general requirements that are necessary for studies of the B-spline basis, see [29,33]. Condition (C.6), an assumption that excludes peculiar models, is from [7]. A similar condition, which is frequently used in studies of optimal model averaging based on cross-validation, can be found in assumption ( 5.2 ) of [34] and ( 24 ) of [5]. Condition (C.7) states that ξ π approaches infinity at a rate faster than n 1 / 2 , and is the same as Condition (C.3) of [35] and implied by (A3) of [36]. Condition (C.8) limits the increasing rate of the number of covariates. A similar condition is used in other model averaging studies, such as (22) in [5]. In fact, (22) in [5] can be obtained by combining our Conditions (C.7) and (C.8).
The following theorem states the asymptotic optimality of the corresponding model averaging estimator based on the feasible HR C p criterion.
Theorem 1.
Suppose that Conditions (C.1)–(C.8) hold. Then, we have
L π ^ ( w ^ ) inf w W L π ^ ( w ) 1
in probability as n .
Theorem 1 reveals that when the selection probability function is estimated by π ( T i , η ^ ) and the conditions listed are satisfied, w ^ , the weight vector selected by the feasible HR C p criterion leads to a squared error loss that is asymptotically identical to that of the infeasible best possible weight vector. This indicates the asymptotic optimality of the resulting model averaging estimator μ ^ π ^ ( w ^ ) . The detailed proof of Theorem 1 is in Appendix A.

3. A Simulation Study

In this section, we conduct a simulation study with five designs to evaluate the performance of the proposed method, including selection of the interior knot number and a comparison of several model selection and model averaging procedures.

3.1. Data Generation Process

Our setup was based on the setting of [26], except that the response variable is subject to missingness. Specifically, we generated data from the following model:
y i = μ i + ϵ i = p = 1 200 x i p β p + q = 1 200 z i q α q ( u i ) + ϵ i ,
where X i = ( x i 1 , , x i 200 ) and Z i = ( z i 1 , , z i 200 ) are drawn from a multivariate normal distribution with mean 0 and covariance matrix Λ = ( λ i j ) with λ i j = 0 . 5 | i j | , u i Uniform ( 0 , 1 ) , ϵ i N ( 0 , ζ 2 ( x i 2 2 + 0.01 ) ) . We changed the value of ζ , so that the population R 2 = var ( μ i ) / var ( y i ) varied from 0.1 to 0.9 , where var ( · ) was the sample variance. The coefficients of the linear part were set as β p = 1 / p 2 , and the coefficient functions were determined by α q ( u i ) = sin ( 2 π q u i ) / q . Under the MAR assumption, we generated the missingness indicator δ i from the following two logistic regression models, respectively:
  • Case 1: logit { P ( δ i = 1 | X i , Z i , u i ) } = 1.2 + 0.5 u i + 0.5 x i 1 ;
  • Case 2: logit { P ( δ i = 1 | X i , Z i , u i ) } = 0.1 + 0.7 u i + 0.7 x i 1 .
For the preceding two cases, the average missing rates (MR) were about 20 % and 40 % , respectively. In this simulation, we assumed the parametric function π ( T i , η ) applied in our proposed method was correctly specified in both cases.
To investigate the performance of the methods as comprehensively as possible, the sample sizes were taken to be n = 100 and n = 200 , and five simulation designs, with different M and covariate settings, were considered. These five designs are displayed in Table 1, in which INT(·) returns the nearest integer from the corresponding element. So, in Design 1 and Design 3, M = 14 and 18 for the two sample sizes. We required every candidate model to contain at least one covariate in the linear part, leading to 2 5 1 candidate models in Designs 2 and 4. In Design 5, each candidate model included at least one covariate of { x i 1 , x i 2 , x i 3 , z i 1 } in the linear part and one covariate of { x i 1 , x i 2 , x i 3 , z i 1 } in the nonparametric part, and each covariate could not exist in both parts. This led to C 4 1 ( 2 3 1 ) + C 4 2 ( 2 2 1 ) + C 4 3 = 50 candidate models. In summary, in the first four designs, Designs 1 and 3 for the nested case and Designs 2 and 4 for the non-nested case, there was, a priori knowledge of which covariates should enter the nonparametric part of the model, but the specification of the linear part was uncertain. The last design incorporated two types of uncertainty: uncertainty on the choice of variables and uncertainty on whether the variable should be in the linear or nonparametric part given that it is already included in the model.

3.2. Estimation and Comparison

3.2.1. Selection of the Knot Number

We used the cubic B-splines to approximate each nonparametric function, and the spline basis matrix was produced by the function “bs(·, df)” in the“splines” package of the R project, where the degree of freedom df = 4 + number of knots. We assessed the effect of the knot number on the performance of our proposal based on the following risk:
L μ = 1 1000 r = 1 1000 μ ^ π ^ ( w ^ ) ( r ) μ 2 ,
where 1000 was the number of simulation trials and μ ^ π ^ ( w ^ ) ( r ) was the model averaging estimator of μ in the rth run.
We set ζ = 1 and n = 200 to show the impact of the number of interior knots on the risk of our proposed procedure in the five designs. Since the simulated results produced were similar for Designs 1 and 2, and for Designs 3 and 4, we only report the results from Designs 1, 3 and 5, which are presented in Figure 2. This figure demonstrates the risk against df for a variety of combinations of designs and missing rates considered. From Figure 2, we note that, for almost all situations considered, generally the risk tended to increase with the number of knots. In other words, the larger number of knots yielded a more serious oversmoothing effect, and, hence, lower estimation accuracy. As suggested by this figure, for our proposed model averaging method, we specified df = 4, which corresponded to the smallest risk. Therefore, in this simulation, we adopted the suggestion of applying df = 4 for all five designs. In other words, the number of knots was set to be 0 in our analysis, which resulted in a basis for ordinary polynomial regression. The number of knots of the B-spline basis function was also set to be 0 in [29], which examined the influence of the knot number on the model averaging method for the varying-coefficient model when all data were available.

3.2.2. Alternative Methods

We conducted some simulation experiments to assess the finite sample performance of our proposed model averaging approach, called the HR C p approach, in VCPLM with missing data. We compared it with four alternatives, the missing data problems of which were addressed by the IPW method discussed in Section 2. The alternatives included two well-known model selection methods (AIC and BIC) and two widely-used model averaging methods (SAIC and SBIC). Along the lines of [32], we defined the AIC and BIC scores under the varying-coefficient partially linear missing data framework as:
A I C m = log ( σ ^ ( m , π ^ ) 2 ) + 2 n 1 t r ( P ( m ) ) ,
and
B I C m = log ( σ ^ ( m , π ^ ) 2 ) + n 1 t r ( P ( m ) ) log ( n ) ,
where σ ^ ( m , π ^ ) 2 = n 1 H π ^ μ ^ ( m , π ^ ) 2 . These two model selection methods select the model corresponding to the smallest score of the information criterion. The two model averaging methods, SAIC and SBIC, respectively, assign weights:
w A I C m = exp ( A I C m / 2 ) / m = 1 M exp ( A I C m / 2 )
and
w B I C m = exp ( B I C m / 2 ) / m = 1 M exp ( B I C m / 2 )
to the mth candidate model. As suggested by a referee, we also compared our proposal with the Mallows model averaging approach of [29] with a complete-case analysis, which just excluded the individuals with missingness (denoted as CC-MMA). We evaluated the performance of these six methods by computing their risks, and the corresponding results for Designs 1–5 are respectively displayed in Figure 3, Figure 4, Figure 5, Figure 6 and Figure 7. For better comparison, all risks were normalized by the risk of the AIC model selection method.
Besides, following an anonymous referee’s suggestion, we make a comparison of computation time between different model selection and averaging methods. To be more specific, we examined the resulting computation time in seconds by, respectively, employing six methods for five designs when n = 100 , R 2 = 0.1 and MR = 20 % . The corresponding results are listed in Table 2.

3.3. Simulation Results

3.3.1. Risk Comparison

From these five figures, we observe that, in general, model averaging approaches worked better than model selection approaches. As shown in most figures, the risk difference in favor of model averaging over model selection was more pronounced when R 2 was small or moderate than when R 2 was large. This is hardly surprising as it is hard to identify only one best model in the presence of much noise corresponding to a small R 2 , while the model averaging method shields against selecting a very poor model by compromising across all possible models. On the other hand, when R 2 was large, model selection could sometimes be a better strategy than model averaging. A possible reason for this is that the small noise in the data allows the model selection strategy to select the right model with very high frequencies.
As for the comparison of HR C p method with its rivals, we found that no matter whether the candidate models were nested or not, our proposed model averaging method yielded the smallest risk in almost all combinations of simulation designs, sample sizes and missing rates considered, although when R 2 was very high, the information criterion-based model averaging methods could sometimes be marginally preferable to ours. The superiority of our method was more marked in Design 5, which was subject to two kinds of uncertainty simultaneously, uncertainty in covariate inclusion and uncertainty in structure, than in Designs 1–4, which were only associated with uncertainty in the linear part specification. This finding provided evidence that our model averaging method was most effective when both the linear and nonlinear components of the model are uncertain, as in most real-world applications. The good performance of our method in finite samples can be partially explained by noting that the optimality of the HR C p estimator does not depend on the correct specification of candidate models. As expected, it was observed that information criterion-based model averaging methods invariably produced more accurate estimators than their model selection counterparts. The advantage of our approach became more noticeable as the missing rate increased.
To sum up, within the context of the VCPLM with missing responses, and when the missing data is handled by an IPW method, our proposed HR C p model averaging method performs better than information criterion-based model selection and averaging methods in terms of risk, especially when the model is characterized by much noise. By and large, our results are parallel to those of [26], which investigated model averaging in the VCPLM with complete data. Additionally, we found evidence of our proposed IPW technique-based model averaging method, HR C p , enjoying significantly smaller risk than a model averaging method with complete-case analysis, CC-MMA.

3.3.2. Computation Time Comparison

According to Table 2, it was hardly surprising that model selection methods always needed less computation time than model averaging methods in all designs. Among all model averaging methods, two data-driven methods (CC-MMA and HR C p ) spent slightly more time than the two information criterion-based methods (SAIC and SBIC). As for the comparison between CC-MMA and HR C p , it was expected that our method would perform slightly more slowly than CC-MMA because of the need to approximate the unknown propensity score function. In general, from the perspective of computation time, our method was slightly inferior to other methods, but it greatly dominated its competitors in terms of estimation accuracy. Thus, it is worthwhile to carry out the HR C p model averaging method to obtain a comparatively accurate estimator, even if a little computation time has to be sacrificed.

4. Real Data Analysis

In this section, we applied our model averaging method to analyze data including information about aged patients from 36 for-profit nursing homes in San Diego, California, provided in [37] and studied by [26,38]. The response variable, y, was the natural logarithm of the days in the nursing home. The five covariates were x 1 , a binary variable indicating whether the patient was treated at a nursing home; x 2 , a binary variable indicating whether the patient was male; x 3 , a binary variable indicating whether the patient was married; x 4 , a health status variable, with a smaller value indicating better health condition; u = ( age 64 ) / ( 102 64 ) , the normalized age of the patients was the effect modifier, with age ranging from 65 to 102.
We considered fitting the data by the VCPLM, but we were not sure which of x 1 , x 2 , x 3 and x 4 to include, and we were uncertain whether to assign a variable in the linear or nonparametric part. Therefore, we considered all possibilities, namely, a variable in the linear part or in the nonparametric part or not in the model. Similar to the simulation study, we required all candidate models to include no fewer than one linear and one nonparametric variable. This resulted in 50 possible models. In our analysis, we ignored 332 censored observations from the original data, and only focused on the remaining 1269 uncensored sample points. Further, we randomly selected n 0 observations from the 1269 uncensored observations as the training set and the remaining n 1 = n n 0 observations were taken as test set, where n 0 = 700 , 800 , 900 , 1000 and 1100. Since the data points we used could be fully observed, to illustrate the application of our method, we artificially created missing responses in the training data, according to the following missing data mechanism:
logit { P ( δ i = 1 | X i , Z i , u i ) } = 1 + 0.4 u i + 0.4 x i 1 .
Hence, the corresponding mean missing rate was about 20 % .
We employed observations in the training set to obtain estimators of model parameters in each candidate model, and then performed four model averaging (HR C p , CC-MMA, SAIC and SBIC) and two model selection (AIC and BIC) procedures. We fitted each candidate model by applying the estimation method introduced in Section 2. The cubic B-splines were adopted to approximate each coefficient function. Following the suggestion in the simulation study, we set the number of knots to be 0. We then evaluated the predictive performance of these six approaches by computing their mean squared prediction error (MSPE). As suggested by [4,26], the observations in the test set were utilized to compute the MSPE as follows:
MSPE = 1 n 1 i = n 0 + 1 n ( y i μ ^ i ) 2 ,
where μ ^ i is the predicted value for the ith patient based on each approach. We repeated the above process 500 times and calculated the mean, median and standard deviation (SD) of the MSPEs of the six strategies across the replications. For comparison convenience, all MSPEs were normalized by dividing the MSPE of AIC, which was referred to as the relative MSPE (RMSPE). The results are summarized in Table 3.
The results in Table 3 show that in almost all situations, our proposed HR C p method had the best predictive efficiency among the six approaches considered. The superiority of our method was particularly obvious in terms of the mean and median, since the smallest mean and median were invariably produced by our method for all training sample sizes. The SBIC always yielded a mean and median that were second to the HR C p but the best among the remaining five methods. As for the comparison of SD, we found evidence that our method had an edge over other methods when n 0 was not less than 1000, while the SBIC frequently yielded the smallest SD when n 0 was less than 1000. This implied that our HR C p method outperformed the SBIC method when the size of the training set was large. We further noted that all numbers in this table were smaller than 1, which implied that the AIC was the worst method among those considered, irrespective of the performance yardstick.
We also provide the Diebold and Mariano test results for the differences in MSPE, which are displayed in Table 4. A positive/negative test statistic in this table denotes that the estimator in the numerator leads to a bigger/smaller MSPE than the estimator in the denominator. The test statistics and p-values listed in columns 3, 6, 7 and 9 provide evidence that the MSPE differences between our proposed HR C p estimator and the BIC, SAIC, AIC and CC-MMA estimators were statistically significant for all training set sizes. Considering the HR C p and SBIC estimators, column 8 demonstrates that the advantage of HR C p over SBIC was statistically significant in the case with n 0 = 1000 and 1100. However, the same cannot be reported about the differences in performance between the HR C p and SBIC estimators when n 0 was less than 1000, as presented in column 8. This result reinforced the intuition that the HR C p estimator was more reliable than the SBIC estimator when the training set size was large. The test results shown in columns 3–7 indicate that the MSPE differences between AIC estimator and the remaining five estimators were statistically significant in all situations. The test results given in columns 3, 8, 9 and 10 imply the same about the differences between the BIC and the other five estimators.

5. Conclusions

Considering model averaging estimation in the VCPLM with missing responses, we propose a HR C p weight choice criterion and its feasible form. Our model averaging process can jointly incorporate two layers of model uncertainty: the first concerns which covariates to include and the second further concerns whether a covariate should be in the linear or nonparametric component. The resultant model averaging estimator is shown to be asymptotically optimal in the sense of achieving the lowest possible squared error loss under certain regularity conditions. The simulation results demonstrated that, in several designs with different types of model uncertainty, our model averaging method always performed much better in comparison with existing methods. The real data analysis also reveals the superiority of the proposed strategy.
There are still many issues deserving future research. Firstly, we only considered model averaging for the VCPLM in the context of missing response data, so it would be worthwhile considering cases where some covariates are also subject to missingness, or missing data arise in a more general framework, such as the generalized VCPLM which permits a discrete response variable. Secondly, in our analysis the missing data mechanism was MAR. The development of a model averaging procedure in a more natural, but more complex, non-ignorable missing data case and the establishment of its asymptotic property is still challenging and warrants future studies. Thirdly, our procedure is applicable only when the dimension parameters p m and q m are less than the sample size n. The consideration of an asymptotically optimal model averaging method for high dimensional VCPLM with missing data is meaningful and, thus, merits future research.

Author Contributions

Conceptualization, W.C.; methodology, J.Z., W.C. and G.H.; software, J.Z. and G.H.; supervision, W.C. and G.H.; writing-original draft, J.Z.; writing—review and editing, G.H. All authors have read and agreed to the published version of the manuscript.

Funding

The work of Zeng is supported by the Important Natural Science Foundation of Colleges and Universities of Anhui Province (No.KJ2021A0929). The work of Hu is supported by the Important Natural Science Foundation of Colleges and Universities of Anhui Province (No.KJ2021A0930).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in real data analysis is available at: https://www.stats.ox.ac.uk/pub/datasets/csb/ (accessed on 27 January 2023).

Acknowledgments

The authors would like to thank the reviewers and editors for their careful reading and constructive comments.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Lemma A1.
If Conditions (C.1) and (C.2) hold, then there exists a positive constant C ϵ π , such that:
max 1 i n E ( ϵ π , i 4 K | X i , Z i , u i ) C ϵ π ,
where K is given in Condition (C.2).
Proof of Lemma A1. 
Note that:
| ϵ π , i | =   | H π , i μ i | = δ i π ( X i , Z i , u i ) y i μ i | μ i | + | ϵ i | π ( X i , Z i , u i ) + | μ i | | μ i | + | ϵ i | C π + | μ i | C μ C π + C μ + | ϵ i | C π ,
where the second inequality is from Condition (C.1) and the third inequality from Condition (C.2). Let C 1 = C μ C π + C μ . By means of C p inequality, we have:
| ϵ π , i | 4 K 2 4 K 1 C 1 4 K + 1 C π 4 K | ϵ i | 4 K .
According to Condition (C.2), we obtain:
max 1 i n E ( ϵ π , i 4 K | X i , Z i , u i ) C ϵ π ,
where C ϵ π = 2 4 K 1 C 1 4 K + 1 C π 4 K C ϵ . □
Lemma A2.
Under Conditions (C.1) and (C.2), one has H π H π ^ 2 = O p ( 1 ) .
Proof of Lemma A2. 
By Cauchy–Schwarz inequality and Taylor expansion, this lemma could be proved, based on some arguments used in the proof of Lemma 1 of [13]. So we omitted it here. □
Proof of Theorem 1. 
Let λ ¯ ( · ) be the largest singular value of a matrix, P ˜ ( w ) be an n × n diagonal matrix whose ith diagonal element is P i i ( w ) , Ω π be an n × n diagonal matrix whose ith diagonal element is σ π , i 2 , A ( w ) = I P ( w ) , ϵ π = ( ϵ π , 1 , , ϵ π , n ) . From Lemma 1, we obtain λ ¯ ( Ω π ) = O ( 1 ) . After some simple calculations, we know P ( m ) is an idempotent matrix with λ ¯ ( P ( m ) ) 1 , and, hence, λ ¯ ( P ( w ) ) m = 1 M w m λ ¯ ( P ( m ) ) 1 for any w W . Observe that:
C π ^ ( w ) = H π ^ μ ^ π ^ ( w ) 2 + 2 ϵ ^ π ^ P ˜ ( w ) ϵ ^ π ^ = H π ^ μ 2 + L π ^ ( w ) + 2 b n ( w ) + 2 d n ( w ) ,
where b n ( w ) = ( H π ^ H π ) { μ μ ^ π ^ ( w ) } , d n ( w ) = ϵ π { μ μ ^ π ^ ( w ) } + ϵ ^ π ^ P ˜ ( w ) ϵ ^ π ^ . Since H π ^ μ 2 is unrelated to w, minimizing C π ^ ( w ) is equivalent to minimizing C π ^ ( w ) H π ^ μ 2 . Therefore, to prove Theorem 1, we only need to verify that:
sup ω W | L π ^ ( w ) R π ( w ) 1 | = o p ( 1 ) ,
sup ω W | b n ( w ) R π ( w ) | = o p ( 1 ) ,
sup ω W | d n ( w ) R π ( w ) | = o p ( 1 ) .
By the fact that
| L π ^ ( w ) R π ( w ) 1 | = | μ μ ^ π ( w ) + μ ^ π ( w ) μ ^ π ^ ( w ) 2 R π ( w ) 1 | | L π ( w ) R π ( w ) 1 | + 2 L π ( w ) R π ( w ) 1 / 2 μ ^ π ( w ) μ ^ π ^ ( w ) { R π ( w ) } 1 / 2 + μ ^ π ( w ) μ ^ π ^ ( w ) 2 R π ( w ) ,
and
μ ^ π ( w ) μ ^ π ^ ( w ) 2 = P ( w ) H π P ( w ) H π ^ 2 { λ ¯ ( P ( w ) ) } 2 H π H π ^ 2 H π H π ^ 2 ,
it is readily seen that the result of (A1) is valid if
sup ω W | L π ( w ) R π ( w ) 1 | = o p ( 1 ) ,
and
sup ω W H π H π ^ 2 R π ( w ) = o p ( 1 ) .
Note that: L π ( w ) R π ( w ) = P ( w ) ϵ π 2 2 ϵ π P ( w ) A ( w ) μ trace { P ( w ) P ( w ) Ω π } , so to prove (A4), it is sufficient to show that
sup ω W | P ( w ) ϵ π 2 trace { P ( w ) P ( w ) Ω π } R π ( w ) | = o p ( 1 ) ,
and
sup ω W | ϵ π P ( w ) A ( w ) μ R π ( w ) | = o p ( 1 ) .
We observe, for any ν > 0 , that:
Pr sup w W P ( w ) ϵ π 2 trace { P ( w ) P ( w ) Ω π } R π ( w ) > ν | X , Z , U m = 1 M m * = 1 M Pr ϵ π P ( w m 0 ) P ( w m * 0 ) ϵ π trace { P ( w m 0 ) P ( w m * 0 ) Ω π } > ν ξ π | X , Z , U ν 2 K ξ π 2 K m = 1 M m * = 1 M E ϵ π P ( w m 0 ) P ( w m * 0 ) ϵ π trace { P ( w m 0 ) P ( w m * 0 ) Ω π } 2 K | X , Z , U C 2 ν 2 K ξ π 2 K m = 1 M m * = 1 M trace P ( w m 0 ) P ( w m * 0 ) Ω π P ( w m * 0 ) P ( w m 0 ) Ω π K C 2 ν 2 K ξ π 2 K { λ ¯ ( Ω π ) } K { λ ¯ ( P ( w m * 0 ) ) } 2 K M m = 1 M trace P ( w m 0 ) P ( w m 0 ) Ω π K C 2 ν 2 K ξ π 2 K { λ ¯ ( Ω π ) } K M m = 1 M { R π ( w m 0 ) } K = o p ( 1 ) ,
where C 2 is a constant, the second inequality is from Chebyshev’s inequality, the third inequality is from Theorem 2 of [39], and the last inequality is because λ ¯ ( P ( w m * 0 ) ) 1 and trace { P ( w m 0 ) P ( w m 0 ) Ω π } R π ( w m 0 ) , and the equality is ensured by Condition (C.3). Then (A6) holds because of the following fact:
Pr sup w W P ( w ) ϵ π 2 trace { P ( w ) P ( w ) Ω π } R π ( w ) > ν = E Pr sup w W P ( w ) ϵ π 2 trace { P ( w ) P ( w ) Ω π } R π ( w ) > ν | X , Z , U = o p ( 1 ) .
By means of similar steps, we obtain
Pr sup w W ϵ π P ( w ) A ( w ) μ R π ( w ) > ν | X , Z , U m = 1 M m * = 1 M Pr ϵ π P ( w m 0 ) A ( w m * 0 ) μ > ν ξ π | X , Z , U ν 2 K ξ π 2 K m = 1 M m * = 1 M E ϵ π P ( w m 0 ) A ( w m * 0 ) μ 2 K | X , Z , U C 3 ν 2 K ξ π 2 K m = 1 M m * = 1 M Ω π 1 / 2 P ( w m 0 ) A ( w m * 0 ) μ 2 K C 3 ν 2 K ξ π 2 K m = 1 M m * = 1 M { λ ¯ ( P ( w m 0 ) ) } 2 K { λ ¯ ( Ω π ) } K A ( w m * 0 ) μ 2 K C 3 ν 2 K ξ π 2 K { λ ¯ ( Ω π ) } K M m * = 1 M R π ( w m * 0 ) K = o p ( 1 ) ,
where C 3 is a constant, and the last inequality is due to λ ¯ ( P ( w m 0 ) ) 1 and A ( w m * 0 ) μ 2 R π ( w m * 0 ) . Therefore, (A7) is satisfied by previous argument, which along with (A6), implies (A4). On the other hand, (A5) can be easily obtained by Lemma A2 and Condition (C.7). So (A1) is correct.
From Cauchy–Schwarz inequality, (A1), Lemma A2 and Condition (C.7), one has:
sup ω W | b n ( w ) R π ( w ) | sup ω W H π ^ H π 2 μ μ ^ π ^ ( w ) 2 1 / 2 R π ( w ) H π ^ H π 2 sup ω W L π ^ ( w ) R π ( w ) 1 / 2 sup ω W 1 R π ( w ) 1 / 2 = o p ( 1 ) .
So, (A2) is true. In what follows, we provide the proof of (A3), which yields the desired result of Theorem 1.
By Cauchy–Schwarz inequality and some algebraic manipulations, we obtain:
| d n ( w ) | = ϵ π { μ μ ^ π ^ ( w ) } + ϵ ^ π ^ P ˜ ( w ) ϵ ^ π ^   | ϵ π A ( w ) μ | + ϵ π P ( w ) ϵ π trace { Ω π P ( w ) } + P ( w ) ϵ π · H π H π ^ + n n l M * λ ¯ ( P ˜ ( w ) ) H π H π ^ 2 + 2 n n l M * λ ¯ ( P ˜ ( w ) ) H π H π ^ · H π + ϵ ^ π P ˜ ( w ) ϵ ^ π trace { Ω π P ( w ) } .
Therefore, (A3) is implied by:
sup ω W ϵ π A ( w ) μ R π ( w ) = o p ( 1 ) ,
sup ω W ϵ π P ( w ) ϵ π trace { Ω π P ( w ) } R π ( w ) = o p ( 1 ) ,
sup ω W ϵ ^ π P ˜ ( w ) ϵ ^ π trace { Ω π P ( w ) } R π ( w ) = o p ( 1 ) ,
sup ω W P ( w ) ϵ π R π ( w ) = o p ( 1 ) ,
sup ω W n n l M * λ ¯ ( P ˜ ( w ) ) H π H π ^ 2 R π ( w ) = o p ( 1 ) ,
and
sup ω W n n l M * λ ¯ ( P ˜ ( w ) ) H π R π ( w ) = o p ( 1 ) .
Similar to the proof steps in (A7) and (A6), respectively, it is not difficult to obtain (A8) and (A9). As for (A10), it is readily seen that:
sup ω W ϵ ^ π P ˜ ( w ) ϵ ^ π trace { Ω π P ( w ) } R π ( w ) sup ω W ϵ ^ π P ˜ ( w ) ϵ ^ π trace { Ω π P ˜ ( w ) } / ξ π sup ω W ϵ ^ π P ˜ ( w ) ϵ ^ π ϵ π P ˜ ( w ) ϵ π / ξ π + sup ω W ϵ π P ˜ ( w ) ϵ π trace { Ω π P ˜ ( w ) } / ξ π .
Following an argument similar to that used in [7], we know that both two terms in the second line of (A14) are equal to o p ( 1 ) . So, (A10) is valid. We now prove (A11) and (A12). From Lemma A1, we find that E ( ϵ π , i 4 ) = E { E ( ϵ π , i 4 | X i , Z i , u i ) } C ϵ π , and, thus, ϵ π = ( i = 1 n ϵ π , i 2 ) 1 / 2 = O p ( n 1 / 2 ) . Consequently, based on Condition (C.7), we have:
sup ω W P ( w ) ϵ π R π ( w ) λ ¯ ( P ( w ) ) ϵ π / ξ π O p ( n 1 / 2 ) / ξ π = o p ( 1 ) .
So, we establish (A11). By Condition (C.6), it is easy to show that sup ω W λ ¯ ( P ˜ ( w ) ) = O p ( n 1 / 2 ) . This, together with Conditions (C.7) and (C.8), and Lemma A2, yields:
sup ω W n n l M * λ ¯ ( P ˜ ( w ) ) H π H π ^ 2 R π ( w ) n n l M * sup ω W λ ¯ ( P ˜ ( w ) ) H π H π ^ 2 ξ π 1 = O ( 1 ) O p ( n 1 / 2 ) O p ( 1 ) o p ( n 1 / 2 ) = o p ( 1 ) .
So, (A12) is valid. From triangle inequality, Condition (C.2) and Lemma A1, we see that H π μ + ϵ π = O p ( n 1 / 2 ) . Hence, following the step of proving (A12), (A13) is valid. The proof of Theorem 1 is, thus, completed. □

References

  1. Hansen, B.E. Least squares model averaging. Econometrica 2007, 75, 1175–1189. [Google Scholar] [CrossRef]
  2. Wan, A.T.K.; Zhang, X.; Zou, G. Least squares model averaging by Mallows criterion. J. Economet. 2010, 156, 277–283. [Google Scholar] [CrossRef]
  3. Liang, H.; Zou, G.; Wan, A.T.K.; Zhang, X. Optimal weight choice for frequentist model average estimators. J. Am. Stat. Assoc. 2011, 106, 1053–1066. [Google Scholar] [CrossRef]
  4. Hansen, B.E.; Racine, J.S. Jackknife model averaging. J. Economet. 2012, 167, 38–46. [Google Scholar] [CrossRef]
  5. Zhang, X.; Wan, A.T.K.; Zou, G. Model averaging by jackknife criterion in models with dependent data. J. Economet. 2013, 174, 82–94. [Google Scholar] [CrossRef]
  6. Lu, X.; Su, L. Jackknife model averaging for quantile regressions. J. Economet. 2015, 188, 40–58. [Google Scholar] [CrossRef]
  7. Liu, Q.; Okui, R. Heteroscedasticity-robust Cp model averaging. Economet. J. 2013, 16, 463–472. [Google Scholar] [CrossRef]
  8. Zhang, X.; Zou, G.; Carroll, R.J. Model averaging based on Kullback-Leibler distance. Stat. Sinica 2015, 25, 1583–1598. [Google Scholar] [CrossRef] [PubMed]
  9. Zhu, R.; Zhang, X.; Wan, A.T.K.; Zou, G. Kernel averaging estimators. J. Bus. Econ. Stat. 2022, 41, 157–169. [Google Scholar] [CrossRef]
  10. Zhang, X.; Liu, C.A. Model averaging prediction by K-fold cross-validation. J. Economet. 2022, in press. [Google Scholar]
  11. Zhang, X. Model averaging with covariates that are missing completely at random. Econ. Lett. 2013, 121, 360–363. [Google Scholar] [CrossRef]
  12. Fang, F.; Lan, W.; Tong, J.; Shao, J. Model averaging for prediction with fragmentary data. J. Bus. Econ. Stat. 2019, 37, 517–527. [Google Scholar] [CrossRef]
  13. Wei, Y.; Wang, Q.; Liu, W. Model averaging for linear models with responses missing at random. Ann. I. Stat. Math. 2021, 73, 535–553. [Google Scholar] [CrossRef]
  14. Wei, Y.; Wang, Q. Cross-validation-based model averaging in linear models with responses missing at random. Stat. Probabil. Lett. 2021, 171, 108990. [Google Scholar] [CrossRef]
  15. Xie, J.; Yan, X.; Tang, N. A model-averaging method for high-dimensional regression with missing responses at random. Stat. Sinica 2021, 31, 1005–1026. [Google Scholar] [CrossRef]
  16. Li, Q.; Huang, C.J.; Li, D.; Fu, T.T. Semiparametric smooth coefficient models. J. Bus. Econ. Stat. 2002, 20, 412–422. [Google Scholar] [CrossRef]
  17. Zhang, W.; Lee, S.Y.; Song, X. Local polynomial fitting in semivarying coefficient model. J. Multivariate Anal. 2002, 82, 166–188. [Google Scholar] [CrossRef]
  18. Ahmad, I.; Leelahanon, S.; Li, Q. Efficient estimation of a semiparametric partially linear varying coefficient model. Ann. Stat. 2005, 33, 258–283. [Google Scholar] [CrossRef]
  19. Fan, J.; Huang, T. Profile likelihood inferences on semiparametric varying-coefficient partially linear models. Bernoulli 2005, 11, 1031–1057. [Google Scholar] [CrossRef]
  20. Li, R.; Liang, H. Variable selection in semiparametric regression modeling. Ann. Stat. 2008, 36, 261–286. [Google Scholar] [CrossRef]
  21. Zhao, P.; Xue, L. Variable selection for semiparametric varying coefficient partially linear models. Stat. Probabil. Lett. 2009, 79, 2148–2157. [Google Scholar] [CrossRef]
  22. Zhao, P.; Xue, L. Variable selection for semiparametric varying coefficient partially linear errors-in-variables models. J. Multivariate Anal. 2010, 101, 1872–1883. [Google Scholar] [CrossRef]
  23. Zhao, W.; Zhang, R.; Liu, J.; Lv, Y. Robust and efficient variable selection for semiparametric partially linear varying coefficient model based on modal regression. Ann. I. Stat. Math. 2014, 66, 165–191. [Google Scholar] [CrossRef]
  24. Wang, H.; Zou, G.; Wan, A.T.K. Model averaging for varying-coefficient partially linear measurement error models. Electron. J. Stat. 2012, 6, 1017–1039. [Google Scholar] [CrossRef]
  25. Zeng, J.; Cheng, W.; Hu, G.; Rong, Y. Model averaging procedure for varying-coefficient partially linear models with missing responses. J. Korean Stat. Soc. 2018, 47, 379–394. [Google Scholar] [CrossRef]
  26. Zhu, R.; Wan, A.T.K.; Zhang, X.; Zou, G. A Mallows-type model averaging estimator for the varying-coefficient partially linear model. J. Am. Stat. Assoc. 2019, 114, 882–892. [Google Scholar] [CrossRef]
  27. Hjort, N.L.; Claeskens, G. Frequentist model average estimators. J. Am. Stat. Assoc. 2003, 98, 879–899. [Google Scholar] [CrossRef]
  28. Hu, G.; Cheng, W.; Zeng, J. Model averaging by jackknife criterion for varying-coefficient partially linear models. Commun. Stat.-Theor. M. 2020, 49, 2671–2689. [Google Scholar] [CrossRef]
  29. Xia, X. Model averaging prediction for nonparametric varying-coefficient models with B-spline smoothing. Stat. Pap. 2021, 62, 2885–2905. [Google Scholar] [CrossRef]
  30. Zhang, X.; Wang, W. Optimal model averaging estimation for partially linear models. Stat. Sinica 2019, 29, 693–718. [Google Scholar] [CrossRef]
  31. White, J. Maximum likelihood estimation of misspecified models. Econometrica 1982, 50, 1–25. [Google Scholar] [CrossRef]
  32. Liang, Z.; Chen, X.; Zhou, Y. Mallows model averaging estimation for linear regression model with right censored data. Acta Math. Appl. Sin. E. 2022, 38, 5–23. [Google Scholar] [CrossRef]
  33. Zhang, X.; Liang, H. Focused information criterion and model averaging for generalized additive partial linear models. Ann. Stat. 2011, 39, 174–200. [Google Scholar] [CrossRef]
  34. Li, K.C. Asymptotic optimality for Cp, CL, cross-validation and generalized cross-validation: Discrete index set. Ann. Stat. 1987, 15, 958–975. [Google Scholar] [CrossRef]
  35. Zhang, X.; Yu, D.; Zou, G.; Liang, H. Optimal model averaging estimation for generalized linear models and generalized linear mixed-effects models. J. Am. Stat. Assoc. 2016, 111, 1775–1790. [Google Scholar] [CrossRef]
  36. Ando, T.; Li, K.C. A weighted-relaxed model averaging approach for high-dimensional generalized linear models. Ann. Stat. 2017, 45, 2654–2679. [Google Scholar] [CrossRef]
  37. Morris, C.N.; Norton, E.C.; Zhou, X.H. Parametric duration analysis of nursing home usage. In Case Studies in Biometry; Lang, N., Ryan, L., Billard, L., Brillinger, D., Conquest, L., Greenhouse, J., Eds.; Wiley: New York, NY, USA, 1994. [Google Scholar]
  38. Fan, J.; Lin, H.; Zhou, Y. Local partial-likelihood estimation for lifetime data. Ann. Stat. 2006, 34, 290–325. [Google Scholar] [CrossRef]
  39. Whittle, P. Bounds for the moments of linear and quadratic forms in independent variables. Theor. Probab. Appl. 1960, 5, 331–335. [Google Scholar] [CrossRef]
Figure 1. The flow chart of our research.
Figure 1. The flow chart of our research.
Mathematics 11 01883 g001
Figure 2. The curves of the risk with the number of knots over 1000 replications.
Figure 2. The curves of the risk with the number of knots over 1000 replications.
Mathematics 11 01883 g002
Figure 3. Risk comparisons for Design 1.
Figure 3. Risk comparisons for Design 1.
Mathematics 11 01883 g003
Figure 4. Risk comparisons for Design 2.
Figure 4. Risk comparisons for Design 2.
Mathematics 11 01883 g004
Figure 5. Risk comparisons for Design 3.
Figure 5. Risk comparisons for Design 3.
Mathematics 11 01883 g005
Figure 6. Risk comparisons for Design 4.
Figure 6. Risk comparisons for Design 4.
Mathematics 11 01883 g006
Figure 7. Risk comparisons for Design 5.
Figure 7. Risk comparisons for Design 5.
Mathematics 11 01883 g007
Table 1. Summary of designs in simulation study.
Table 1. Summary of designs in simulation study.
DesignMCovariate Setting
1INT( 3 n 1 / 3 )All candidate models shared a common nonparametric structure of z i 1 α 1 ( u i ) , and their parametric parts were a set of { x i 1 , x i 2 , , x i M } , with the mth candidate model including the first m covariates. In other words, all of the candidate models were nested.
2 2 5 1 Identical to Design 1 except that all candidate models were non-nested, and their linear parts were constructed by varying combinations of { x i 1 , x i 2 , , x i 5 } .
3INT( 3 n 1 / 3 )Identical to Design 1 except that all candidate models shared a common nonparametric structure of z i 1 α 1 ( u i ) + z i 2 α 2 ( u i ) .
4 2 5 1 Identical to Design 2 except that all candidate models shared a common nonparametric structure of z i 1 α 1 ( u i ) + z i 2 α 2 ( u i ) .
550The covariate set included { x i 1 , x i 2 , x i 3 , z i 1 } . Each candidate model included at least one covariate in the linear part and one covariate in the nonparametric part, and each covariate could not exist in both parts.
Table 2. Averaged computation time in seconds over 3 runs, when n = 100 , R 2 = 0.1 and MR = 20 % .
Table 2. Averaged computation time in seconds over 3 runs, when n = 100 , R 2 = 0.1 and MR = 20 % .
MethodDesign 1Design 2Design 3Design 4Design 5
AIC0.2130.2230.2200.2230.248
BIC0.2200.2290.2190.2180.247
SAIC0.2220.2320.2240.2250.254
SBIC0.2240.2290.2220.2220.249
CC-MMA0.2390.2330.2320.2420.261
HR C p 0.2510.2420.2460.2540.284
Table 3. The mean, median and SD of RMSPE across 500 repetitions.
Table 3. The mean, median and SD of RMSPE across 500 repetitions.
n 0 MethodBICSAICSBICCC-MMAHRCp
700mean0.9910.9840.9810.9890.980
median0.9970.9890.9880.9930.985
SD0.6240.6600.5730.6220.619
800mean0.9930.9870.9850.9900.982
median0.9970.9900.9880.9940.985
SD0.8820.9090.8660.8810.884
900mean0.9940.9880.9870.9910.984
median0.9950.9890.9880.9920.986
SD0.8270.8610.7920.8470.836
1000mean0.9950.9890.9880.9910.985
median0.9970.9890.9890.9920.986
SD0.8900.8850.8830.8880.876
1100mean0.9950.9900.9900.9910.986
median0.9980.9930.9910.9920.990
SD0.9680.9680.9570.9660.939
Table 4. Diebold–Mariano test results for the differences in MSPE.
Table 4. Diebold–Mariano test results for the differences in MSPE.
n 0 Method AIC BIC AIC SAIC AIC SBIC AIC CC MMA AIC HRCp BIC SAIC BIC SBIC BIC CC - MMA
700DM3.6229.6937.7384.14710.5286.19615.9082.165
p-value0.0000.0000.0000.0000.0000.0000.0000.030
800DM5.34515.91611.5899.47215.8326.21618.97910.863
p-value0.0000.0000.0000.0000.0000.0000.0000.000
900DM3.35310.1278.0094.72511.8675.50214.9925.128
p-value0.0010.0000.0000.0000.0000.0000.0000.000
1000DM2.9309.0127.1654.19212.6657.69717.1023.214
p-value0.0030.0000.0000.0000.0000.0000.0010.001
1100DM3.55012.4758.5657.29113.1015.29912.7394.395
p-value0.0000.0000.0000.0000.0000.0000.0000.000
n 0 Method BIC HRCp SAIC SBIC SAIC CC - MMA SAIC HRCp SBIC CC - MMA SBIC HRCp CC - MMA HRCp
700DM9.1733.452−4.6826.001−7.2450.94211.426
p-value0.0000.0010.0000.0000.0000.3460.000
800DM12.1023.276−4.2748.501−6.8351.82712.183
p-value0.0000.0010.0000.0000.0000.0680.000
900DM8.7402.935−1.2317.078−5.3521.30110.278
p-value0.0000.0000.2180.0000.0000.1930.000
1000DM10.5861.404−2.0538.353−2.9753.5379.486
p-value0.0000.1600.0400.0000.0030.0000.000
1100DM9.9371.154−0.8927.721−1.6264.14911.254
p-value0.0060.2490.3720.0000.1040.0000.000
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zeng, J.; Cheng, W.; Hu, G. Optimal Model Averaging Estimation for the Varying-Coefficient Partially Linear Models with Missing Responses. Mathematics 2023, 11, 1883. https://doi.org/10.3390/math11081883

AMA Style

Zeng J, Cheng W, Hu G. Optimal Model Averaging Estimation for the Varying-Coefficient Partially Linear Models with Missing Responses. Mathematics. 2023; 11(8):1883. https://doi.org/10.3390/math11081883

Chicago/Turabian Style

Zeng, Jie, Weihu Cheng, and Guozhi Hu. 2023. "Optimal Model Averaging Estimation for the Varying-Coefficient Partially Linear Models with Missing Responses" Mathematics 11, no. 8: 1883. https://doi.org/10.3390/math11081883

APA Style

Zeng, J., Cheng, W., & Hu, G. (2023). Optimal Model Averaging Estimation for the Varying-Coefficient Partially Linear Models with Missing Responses. Mathematics, 11(8), 1883. https://doi.org/10.3390/math11081883

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop