Next Article in Journal
Design of Rate-Compatible Parallel Concatenated Punctured Polar Codes for IR-HARQ Transmission Schemes
Next Article in Special Issue
Composite Likelihood Methods Based on Minimum Density Power Divergence Estimator
Previous Article in Journal
Variational Characterization of Free Energy: Theory and Algorithms
Previous Article in Special Issue
Robust and Sparse Regression via γ-Divergence
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robust-BD Estimation and Inference for General Partially Linear Models

Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, USA
*
Author to whom correspondence should be addressed.
Entropy 2017, 19(11), 625; https://doi.org/10.3390/e19110625
Submission received: 10 October 2017 / Revised: 15 November 2017 / Accepted: 16 November 2017 / Published: 20 November 2017

Abstract

:
The classical quadratic loss for the partially linear model (PLM) and the likelihood function for the generalized PLM are not resistant to outliers. This inspires us to propose a class of “robust-Bregman divergence (BD)” estimators of both the parametric and nonparametric components in the general partially linear model (GPLM), which allows the distribution of the response variable to be partially specified, without being fully known. Using the local-polynomial function estimation method, we propose a computationally-efficient procedure for obtaining “robust-BD” estimators and establish the consistency and asymptotic normality of the “robust-BD” estimator of the parametric component β o . For inference procedures of β o in the GPLM, we show that the Wald-type test statistic W n constructed from the “robust-BD” estimators is asymptotically distribution free under the null, whereas the likelihood ratio-type test statistic Λ n is not. This provides an insight into the distinction from the asymptotic equivalence (Fan and Huang 2005) between W n and Λ n in the PLM constructed from profile least-squares estimators using the non-robust quadratic loss. Numerical examples illustrate the computational effectiveness of the proposed “robust-BD” estimators and robust Wald-type test in the appearance of outlying observations.

1. Introduction

Semiparametric models, such as the partially linear model (PLM) and generalized PLM, play an important role in statistics, biostatistics, economics and engineering studies [1,2,3,4,5]. For the response variable Y and covariates ( X , T ) , where X = ( X 1 , , X d ) T R d and T T R D , the PLM, which is widely used for continuous responses Y, describes the model structure according to:
Y = X T β o + η o ( T ) + ϵ , E ( ϵ X , T ) = 0 ,
where β o = ( β 1 ; o , , β d ; o ) T is a vector of unknown parameters and η o ( · ) is an unknown smooth function; the generalized PLM, which is more suited to discrete responses Y and extends the generalized linear model [6], assumes:
m ( x , t ) = E ( Y X = x , T = t ) = F 1 ( x T β o + η o ( t ) ) ,
Y ( X , T ) exponential family of distributions ,
where F is a known link function. Typically, the parametric component β o is of primary interest, while the nonparametric component η o ( · ) serves as a nuisance function. For illustration clarity, this paper focuses on D = 1 . An important application of PLM to brain fMRI data was given in [7] for detecting activated brain voxels in response to external stimuli. There, β o corresponds to the part of hemodynamic response values, which is the object of primary interest to neuroscientists; η o ( · ) is the slowly drifting baseline of time. Determining whether a voxel is activated or not can be formulated as testing for the linear form of hypotheses,
H 0 : A β o = g 0 versus H 1 : A β o g 0 ,
where A is a given k × d full row rank matrix and g 0 is a known k × 1 vector.
Estimation of the parametric and nonparametric components of PLM and generalized PLM has received much attention in the literature. On the other hand, the existing work has some limitations: (i) The generalized PLM assumes that Y ( X , T ) follows the distribution in (3), so that the likelihood function is fully available. From the practical viewpoint, results from the generalized PLM are not applicable to situations where the distribution of Y ( X , T ) either departs from (3) or is incompletely known. (ii) Some commonly-used error measures, such as the quadratic loss in PLM for Gaussian-type responses (see for example [7,8]) and the (negative) likelihood function used in the generalized PLM, are not resistant to outliers. The work in [9] studied robust inference based on the kernel regression method for the generalized PLM with a canonical link, based on either the (negative) likelihood or (negative) quasi-likelihood as the error measure, and illustrated numerical examples with the dimension d = 1 . However, the quasi-likelihood is not suitable for the exponential loss function (defined in Section 2.1), commonly used in machine learning and data mining. (iii) The work in [8] developed the inference of (4) for PLM, via the classical quadratic loss as the error measure, and demonstrated that the asymptotic distributions of the likelihood ratio-type statistic and Wald statistic under the null of (4) are both χ k 2 . It remains unknown whether this conclusion holds when the tests are constructed based on robust estimators.
Without completely specifying the distribution of Y ( X ,   T ) , we assume:
var ( Y X = x , T = t ) = V ( m ( x , t ) ) ,
with a known functional form of V ( · ) . We refer to a model specified by (2) and (5) as the “general partially linear model” (GPLM). This paper aims to develop robust estimation of GPLM and robust inference of β o , allowing the distribution of Y ( X , T ) to be partially specified. To introduce robust estimation, we adopt a broader class of robust error measures, called “robust-Bregman divergence (BD)” developed in [10], for a GLM, in which BD includes the quadratic loss, the (negative) quasi-likelihood, the exponential loss and many other commonly-used error measures as special cases. We propose the “robust-BD estimators” for both the parametric and nonparametric components of the GPLM. Distinct from the explicit-form estimators for PLM using the classical quadratic loss (see [8]), the “robust-BD estimators” for GPLM do not have closed-form expressions, which makes the theoretical derivation challenging. Moreover, the robust-BD estimators, as numerical solutions to non-linear optimization problems, pose key implementation challenges. Our major contributions are given below.
  • The robust fitting of the nonparametric component η o ( · ) is formulated using the local-polynomial regression technique [11]. See Section 2.3.
  • We develop a coordinate descent algorithm for the robust-BD estimator of β o , which is computationally efficient particularly when the dimension d is large. See Section 3.
  • Theorems 1 and 2 demonstrate that under the GPLM, the consistency and asymptotic normality of the proposed robust-BD estimator for β o are achieved. See Section 4.
  • For robust inference of β o , we propose a robust version of the Wald-type test statistic W n , based on the robust-BD estimators, and justify its validity in Theorems 3–5. It is shown to be asymptotically χ 2 (central) under the null, thus distribution free, and χ 2 (noncentral) under the contiguous alternatives. Hence, this result, when applied to the exponential loss, as well as other loss functions in the wider class of BD, is practically feasible. See Section 5.1.
  • For robust inference of β o , we re-examine the likelihood ratio-type test statistic Λ n , constructed by replacing the negative log-likelihood with the robust-BD. Our Theorem 6 reveals that the asymptotic null distribution of Λ n is generally not χ 2 , but a linear combination of independent χ 2 variables, with weights relying on unknown quantities. Even in the particular case of using the classical-BD, the limit distribution is not invariant with re-scaling the generating function of the BD. Moreover, the limit null distribution of Λ n (in either the non-robust or robust version) using the exponential loss, which does not belong to the (negative) quasi-likelihood, but falls in BD, is always a weighted χ 2 , thus limiting its use in practical applications. See Section 5.2.
Simulation studies in Section 6 demonstrate that the proposed class of robust-BD estimators and robust Wald-type test either compare well with or perform better than the classical non-robust counterparts: the former is less sensitive to outliers than the latter, and both perform comparably well for non-contaminated cases. Section 7 illustrates some real data applications. Section 8 ends the paper with brief discussions. Details of technical derivations are relegated to Appendix A.

2. Robust-BD and Robust-BD Estimators

This section starts with a brief review of BD in Section 2.1 and “robust-BD” in Section 2.2, followed by the proposed “robust-BD” estimators of η o ( · ) and β o in Section 2.3 and Section 2.4.

2.1. Classical-BD

To broaden the scope of robust estimation and inference, we consider a class of error measures motivated from the Bregman divergence (BD). For a given concave q-function, [12] defined a bivariate function,
Q q ( ν , μ ) = q ( ν ) + q ( μ ) + ( ν μ ) q ( μ ) .
We call Q q the BD and call q the generating q-function of the BD. For example, a function q ( μ ) = a μ μ 2 for some constant a yields the quadratic loss Q q ( Y , μ ) = ( Y μ ) 2 . For a binary response variable Y, q ( μ ) = min { μ , ( 1 μ ) } gives the misclassification loss Q q ( Y , μ ) = I { Y I ( μ > 1 / 2 ) } , where I ( · ) is an indicator function; q ( μ ) = 2 { μ log ( μ ) + ( 1 μ ) log ( 1 μ ) } gives the Bernoulli deviance loss log-likelihood Q q ( Y , μ ) = 2 { Y log ( μ ) + ( 1 Y ) log ( 1 μ ) } ; q ( μ ) = 2 min { μ , ( 1 μ ) } results in the hinge loss Q q ( Y , μ ) = max { 1 ( 2 Y 1 ) sign ( μ 0.5 ) , 0 } of the support vector machine; q ( μ ) = 2 { μ ( 1 μ ) } 1 / 2 yields the exponential loss Q q ( Y , μ ) = exp [ ( Y 0.5 ) log { μ / ( 1 μ ) } ] used in AdaBoost [13]. Moreover, [14] showed that if:
q ( μ ) = a μ s μ V ( s ) d s ,
with a finite constant a such that the integral is well defined, then Q q ( y , μ ) matches the “classical (negative) quasi-likelihood” function.

2.2. Robust-BD ρ q ( y , μ )

Let r ( y , μ ) = ( y μ ) / V ( μ ) denote the Pearson residual, which reduces to the standardized residual for linear models. In contrast to the “classical-BD”, denoted by Q q in (6), the “robust-BD” developed in [10] for a GLM [6], is formed by:
ρ q ( y , μ ) = y μ ψ ( r ( y , s ) ) { q ( s ) V ( s ) } d s G ( μ ) ,
where ψ ( r ) is chosen to be a bounded, odd function, such as the Huber ψ -function [15], ψ ( r ) = r min ( 1 , c / | r | ) , and the bias-correction term, G ( μ ) , entails the Fisher consistency of the parameter estimator and satisfies:
G ( μ ) = G 1 ( μ ) { q ( μ ) V ( μ ) } ,
with
G 1 ( m ( x , t ) ) = E { ψ ( r ( Y , m ( x , t ) ) ) X = x , T = t } .
We make the following discussions regarding features of the “robust-BD”. To facilitate the discussion, we first introduce some necessary notation. Assume that the quantities:
p j ( y ; θ ) = j θ j ρ q ( y , F 1 ( θ ) ) , j = 0 , 1 , ,
exist finitely up to any order required. Then, we have the following expressions,
p 1 ( y ; θ ) = { ψ ( r ( y , μ ) ) G 1 ( μ ) } { q ( μ ) V ( μ ) } / F ( μ ) , p 2 ( y ; θ ) = A 0 ( y , μ ) + { ψ ( r ( y , μ ) ) G 1 ( μ ) } A 1 ( μ ) , p 3 ( y ; θ ) = A 2 ( y , μ ) + { ψ ( r ( y , μ ) ) G 1 ( μ ) } A 1 ( μ ) / F ( μ ) ,
where μ = F 1 ( θ ) ,
A 0 ( y , μ ) = ψ ( r ( y , μ ) ) 1 + y μ V ( μ ) × V ( μ ) 2 V ( μ ) + G 1 ( μ ) V ( μ ) q ( μ ) { F ( μ ) } 2 ,
A 1 ( μ ) = [ { q ( 3 ) ( μ ) V ( μ ) + 2 1 q ( μ ) V ( μ ) / V ( μ ) } F ( μ ) q ( μ ) V ( μ ) F ( μ ) ] / { F ( μ ) } 3 and A 2 ( y , μ ) = [ A 0 ( y , μ ) / μ + { ψ ( r ( y , μ ) ) G 1 ( μ ) } / μ A 1 ( μ ) ] / F ( μ ) . Particularly, p 1 ( y ; θ ) contains ψ ( r ) ; p 2 ( y ; θ ) contains ψ ( r ) , ψ ( r ) and ψ ( r ) r ; p 3 ( y ; θ ) contains ψ ( r ) , ψ ( r ) , ψ ( r ) r , ψ ( r ) , ψ ( r ) r , and ψ ( r ) r 2 , where r = r ( y , μ ) = ( y μ ) / V ( μ ) denotes the Pearson residual. Accordingly, { p j ( y ; θ ) : j = 1 , 2 , 3 } depend on y through ψ ( r ) and its derivatives coupled with r. Then, we observe from (9) and (11) that:
E { p 1 ( Y ; X T β o + η o ( T ) ) X , T } = 0 .
In the particular choice of ψ ( r ) = r , it is clearly noticed from (9) that G 1 ( · ) = 0 , and thus, G ( · ) = 0 . In such a case, the proposed “robust-BD” ρ q ( y , μ ) reduces to the “classical-BD” Q q ( y , μ ) .

2.3. Local-Polynomial Robust-BD Estimator of η o ( · )

Let { ( Y i , X i , T i ) } i = 1 n be i . i . d . observations of ( Y , X , T ) captured by the GPLM in (2) and (5), where the dimension d 1 is a finite integer. From (2), it is directly observed that if the true value of β o is known, then estimating η o ( · ) becomes estimating a nonparametric function; conversely, if the actual form of η o ( · ) is available, then estimating β o amounts to estimating a vector parameter.
To motivate the estimation of η o ( · ) at a fitting point t, a proper way to characterize η o ( t ) is desired. For any given value of β , define:
S ( a ; t , β ) = E { ρ q ( Y , F 1 ( X T β + a ) ) w 1 ( X ) T = t } ,
where a is a scalar, ρ q ( y , μ ) is the “robust-BD” defined in (8), which aims to guard against outlying observations in the response space of Y, and w 1 ( · ) 0 is a given bounded weight function that downweights high leverage points in the covariate space of X . See Section 6 and Section 7 for an example of w 1 ( x ) . Set:
η β ( t ) = arg min a R 1 S ( a ; t , β ) .
Theoretically, η o ( t ) = η β o ( t ) will be assumed (in Condition A3) for obtaining asymptotically unbiased estimators of η o ( · ) . Such property indeed holds, for example, when a classical quadratic loss combined with an identity link is used in (14). Thus, we call η η ( · ) the “surrogate function” for η o ( · ) .
The characterization of the surrogate function η β ( t ) in (14) enables us to develop its robust-BD estimator η ^ β ( t ) based on nonparametric function estimation. Assume that η o ( · ) is ( p + 1 ) -times continuously differentiable at the fitting point t. Denote by a o ( t ) = ( η o ( t ) , ( η o ) ( 1 ) ( t ) , , ( η o ) ( p ) ( t ) / p ! ) T R p + 1 the vector consisting of η o ( t ) along with its (re-scaled) derivatives. For observed covariates T i close to the point t, the Taylor expansion implies that:
η o ( T i ) η o ( t ) + ( T i t ) ( η o ) ( 1 ) ( t ) + + ( T i t ) p ( η o ) ( p ) ( t ) / p ! = t i ( t ) T a o ( t ) ,
where t i ( t ) = ( 1 , ( T i t ) , , ( T i t ) p ) T . For any given value of β , let a ^ ( t ; β ) = ( a ^ 0 ( t ; β ) , a ^ 1 ( t ; β ) , , a ^ p ( t ; β ) ) T be the minimizer of the criterion function,
S n ( a ; t , β ) = 1 n i = 1 n ρ q ( Y i , F 1 ( X i T β + t i ( t ) T a ) ) w 1 ( X i ) K h ( T i t ) ,
with respect to a R p + 1 , where K h ( · ) = K ( · / h ) / h is re-scaled from a kernel function K and h > 0 is termed a bandwidth parameter. The first entry of a ^ ( t ; β ) supplies the local-polynomial robust-BD estimator η ^ β ( t ) of η β ( t ) , i.e.,
η ^ β ( t ) = e 1 , p + 1 T arg min a R p + 1 S n ( a ; t , β ) ,
where e j , p + 1 denotes the j-th column of a ( p + 1 ) × ( p + 1 ) identity matrix.
It is noted that the reliance of η ^ β ( t ) on β does not guarantee its consistency to η o ( t ) . Nonetheless, it is anticipated from the uniform consistency of η ^ β ^ in Lemma 1 that η ^ β ^ ( t ) will offer a valid estimator of η o ( t ) , provided that β ^ consistently estimates β o . Section 2.4 will discuss our proposed robust-BD estimator β ^ . Furthermore, Lemma 1 will assume (in Condition A1) that η β ( t ) is the unique minimizer of S ( a ; t , β ) with respect to a.
Remark 1.
The case of using the “kernel estimation”, or locally-constant estimation, corresponds to the choice of degree p = 0 in (15). In that case, the criterion function in (16) and the estimator in (17) reduce to:
S n ( a ; t , β ) = 1 n i = 1 n ρ q ( Y i , F 1 ( X i T β + a ) ) w 1 ( X i ) K h ( T i t ) ,
η ^ β ( t ) = arg min a R 1 S n ( a ; t , β ) ,
respectively.

2.4. Robust-BD Estimator of β o

For any given value of β , define:
J ( β , η β ) = E { ρ q ( Y , F 1 ( X T β + η β ( T ) ) ) w 2 ( X ) } ,
where η β ( · ) is as defined in (14) and w 2 ( · ) plays the same role as w 1 ( · ) in (13). Theoretically, it is anticipated that:
β o = arg min β R d J ( β , η β ) ,
which holds for example in the case where a classical quadratic loss combined with an identity link is used. To estimate β o , it is natural to replace (20) by its sample-based criterion,
J n ( β , η ^ β ) = 1 n i = 1 n ρ q ( Y i , F 1 ( X i T β + η ^ β ( T i ) ) ) w 2 ( X i ) ,
where η ^ β ( · ) is as defined in (17). Hence, a parametric estimator of β o is provided by:
β ^ = arg min β R d J n ( β , η ^ β ) .
Finally, the estimator of η o ( · ) is given by:
η ^ ( · ) = η ^ β ^ ( · ) .
To achieve asymptotic normality of β ^ , Theorem 2 assumes (in Condition A 2 ) that β o is the unique minimizer in (21), a standard condition for consistent M-estimators [16].
As a comparison, it is seen that w 1 ( · ) in (16) is used to robustify covariates X i in estimating η o ( · ) , w 2 ( · ) in (22) is used to robustify covariates X i in estimating β o and ρ q ( · , · ) serves to robustify the responses Y i in both estimating procedures.

3. Two-Step Iterative Algorithm for Robust-BD Estimation

In a special case of using the classical quadratic loss combined with an identity link function, the robust-BD estimators for parametric and nonparametric components have explicit expressions,
β ^ = ( X ˜ T w 2 X ˜ ) 1 ( X ˜ T w 2 y ˜ ) , ( η ^ ( T 1 ) , , η ^ ( T n ) ) T = S h ( y X β ^ ) ,
where w 2 = diag ( w 2 ( X 1 ) , , w 2 ( X n ) ) , y ˜ = ( I S h ) y , X ˜ = ( I S h ) X , with I being an identity matrix, y = ( Y 1 , , Y n ) T , X = ( X 1 , , X n ) T the design matrix,
S h = e 1 , p + 1 T [ { T ( T 1 ) } T W w 1 ; K ( T 1 ) T ( T 1 ) ] 1 { T ( T 1 ) } T W w 1 ; K ( T 1 ) e 1 , p + 1 T [ { T ( T n ) } T W w 1 ; K ( T n ) T ( T n ) ] 1 { T ( T n ) } T W w 1 ; K ( T n ) ,
and:
T ( t ) = ( t 1 ( t ) , , t n ( t ) ) T , W w 1 ; K ( t ) = diag { w 1 ( X i ) K h ( T i t ) : i = 1 , , n } .
When w 1 ( x ) = w 2 ( x ) 1 , (24) reduces to the “profile least-squares estimators” of [8].
In other cases, robust-BD estimators from (17) and (23) do not have closed-form expressions and need to be solved numerically, which are computationally challenging and intensive. We now discuss a two-step robust proposal for iteratively estimating β o and η o ( · ) . Let β ^ [ k 1 ] and { η ^ [ k 1 ] ( T i ) } i = 1 n denote the estimates in the ( k 1 ) -th iteration, where η ^ [ k 1 ] ( · ) = η ^ β ^ [ k 1 ] ( · ) . The k-th iteration consists of two steps below.
  • Step 1: Instead of solving (23) directly, we propose to solve a surrogate optimization problem, β ^ [ k ] = arg min β R d J n ( β , η ^ [ k 1 ] ) . This minimizer approximates β ^ .
  • Step 2: Obtain η ^ [ k ] ( T i ) = η ^ β ^ [ k ] ( T i ) , i = 1 , , n , where η ^ β ( t ) is defined in (17).
The algorithm terminates provided that β ^ [ k ] β ^ [ k 1 ] is below some pre-specified threshold value, and all { η ^ [ k ] ( T i ) } i = 1 n stabilize.

3.1. Step 1

For the above two-step algorithm, we first elaborate on the procedure of acquiring β ^ [ k ] in Step 1, by extending the coordinate descent (CD) iterative algorithm [17] designed for penalized estimation to our current robust-BD estimation, which is computationally efficient. For any given value of η , by Taylor expansion, around some initial estimate β (for example, β ^ [ k 1 ] ), we obtain the weighted quadratic approximation,
ρ q ( Y i , F 1 ( X i T β + η ) ) 1 2 s i I ( Z i I X i T β ) 2 + C i ,
where C i is a constant not depending on β ,
s i I = p 2 ( Y i ; X i T β + η ) , Z i I = X i T β p 1 ( Y i ; X i T β + η ) / p 2 ( Y i ; X i T β + η ) ,
with p j ( y ; θ ) defined in (10). Hence,
J n ( β , η ) = 1 n i = 1 n ρ q ( Y i , F 1 ( X i T β + η ) ) w 2 ( X i ) 1 2 i = 1 n n 1 s i I w 2 ( X i ) ( Z i I X i T β ) 2 + constant .
Thus it suffices to conduct minimization of i = 1 n s i I w 2 ( X i ) ( Z i I X i T β ) 2 with respect to β , using a coordinate descent (CD) updating procedure. Suppose that the current estimate is β ^ old = ( β ^ 1 old , , β ^ d old ) T , with the current residual vector r ^ old = ( r ^ 1 old , , r ^ n old ) = z I X β ^ old , where z I = ( Z 1 I , , Z n I ) T is the vector of pseudo responses. Adopting the Newton–Raphson algorithm, the estimate of the j-th coordinate based on the previous estimate β ^ j old is updated to:
β ^ j new = β ^ j old + i = 1 n { s i I w 2 ( X i ) } r ^ i old X i , j i = 1 n { s i I w 2 ( X i ) } X i , j 2 .
As a result, the residuals due to such an update are updated to:
r ^ i new = r ^ i old X i , j ( β ^ j new β ^ j old ) , i = 1 , , n .
Cycling through j = 1 , , d , we obtain the estimate β ^ new = ( β ^ 1 new , , β ^ d new ) T . Now, we set η = η ^ [ k 1 ] and β = β ^ [ k 1 ] . Iterate the process of weighted quadratic approximation followed by the CD updating, for a number of times, until the estimate β ^ new stabilizes to the solution β ^ [ k ] .
The validity of β ^ [ k ] in Step 1 converging to the true parameter β o is justified as follows. (i) Standard results for M-estimation [16] indicate that the minimizer of J n ( β , η β o ) is consistent with β o . (ii) According to our Theorem 1 (ii) in Section 4.1, sup t T | η ^ β ^ ( t ) η β o ( t ) | P 0 for a compact set T , where P stands for convergence in probability. Using derivations similar to those of (A4) gives sup β K | J n ( β , η ^ β ^ ) J n ( β , η β o ) | P 0 for any compact set K . Thus, minimizing J n ( β , η ^ β ^ ) is asymptotically equivalent to minimizing J n ( β , η β o ) . (iii) Similarly, provided that β ^ [ k 1 ] is close to β ^ , minimizing J n ( β , η ^ β ^ [ k 1 ] ) is asymptotically equivalent to minimizing J n ( β , η ^ β ^ ) . Assembling these three results with the definition of β ^ [ k ] yields:
β ^ [ k ] = arg min β J n ( β , η ^ β ^ [ k 1 ] ) = arg min β J n ( β , η ^ β ^ ) + o P ( 1 ) = arg min β J n ( β , η β o ) + o P ( 1 ) = β o + o P ( 1 ) .

3.2. Step 2

In Step 2, obtaining η ^ β ( t ) for any given values of β and t is equivalent to minimizing S n ( a ; t , β ) in (16). Notice that the dimension ( p + 1 ) of a is typically low, with degrees p = 0 or p = 1 being the most commonly used in practice. Hence, the minimizer of S n ( a ; t , β ) can be obtained by directly applying the Newton–Raphson iteration: for k = 0 , 1 , ,
a [ k + 1 ] ( t ; β ) = a [ k ] ( t ; β ) 2 S n ( a ; t , β ) a a T | a = a [ k ] ( t ; β ) 1 S n ( a ; t , β ) a | a = a [ k ] ( t ; β ) ,
where a [ k ] ( t ; β ) denotes the estimate in the k-th iteration, and:
S n ( a ; t , β ) a = 1 n i = 1 n p 1 ( Y i ; X i T β + t i ( t ) T a ) t i ( t ) w 1 ( X i ) K h ( T i t ) , 2 S n ( a ; t , β ) a a T = 1 n i = 1 n p 2 ( Y i ; X i T β + t i ( t ) T a ) t i ( t ) t i ( t ) T w 1 ( X i ) K h ( T i t ) .
The iterations terminate until the estimate η ^ [ k + 1 ] ( t ) = e 1 , p + 1 T a [ k + 1 ] ( t ; β ) stabilizes.
Our numerical studies of the robust-BD estimation indicate that (i) the kernel regression method can be both faster and stabler than the local-linear method; (ii) to estimate the nonparametric component η o ( · ) , the local-linear method outperforms the kernel method, especially at the edges of points { T i } i = 1 n ; (iii) for the performance of the robust estimation of β o , which is of major interest, there is a relatively negligible difference between choices of using the kernel and local-linear methods in estimating nonparametric components.

4. Asymptotic Property of the Robust-BD Estimators

This section investigates the asymptotic behavior of robust-BD estimators β ^ and η ^ β ^ , under regularity conditions. The consistency of β ^ to β o and uniform consistency of η ^ β ^ to η o are given in Theorem 1; the asymptotic normality of β ^ is obtained in Theorem 2. For the sake of exposition, the asymptotic results will be derived using local-linear estimation with degree p = 1 . Analogous results can be obtained for local-polynomial methods with lengthier technical details and are omitted.
We assume that T T , and let T 0 T be a compact set. For any continuous function v : T R , define v = sup t T | v ( t ) | and v T 0 ; = sup t T 0 | v ( t ) | . For a matrix M, the smallest and largest eigenvalues are denoted by λ j ( M ) , λ min ( M ) and λ max ( M ) , respectively. Let M = sup x = 1 M x = { λ max ( M T M ) } 1 / 2 be the matrix L 2 norm. Denote by P convergence in probability and D convergence in distribution.

4.1. Consistency

We first present Lemma 1, which states the uniform consistency of η ^ β ( · ) to the surrogate function η β ( · ) . Theorem 1 gives the consistency of β ^ and η ^ β ^ .
Lemma 1
(For the non-parametric surrogate η β ( · ) ). Let K R d and T 0 T be compact sets. Assume Condition A 1 and Condition B in the Appendix. If n , h 0 , n h , log ( 1 / h ) / ( n h ) 0 , then sup β K η ^ β η β T 0 ; P 0 .
Theorem 1
(For β o and η o ( · ) ). Assume conditions in Lemma 1.
(i)
If there exists a compact set K 1 such that lim n P ( β ^ K 1 ) = 1 and Condition A 2 holds, then β ^ P β o .
(ii)
Moreover, if Condition A 3 holds, then η ^ β ^ η o T 0 ; P 0 .

4.2. Asymptotic Normality

The asymptotic normality of β ^ is provided in Theorem 2.
Theorem 2
(For the parametric part βo). Assume Conditions A and Condition B in the Appendix. If n , n h 4 0 and log ( 1 / h ) / ( n h 2 ) 0 , then:
n ( β ^ β o ) D N ( 0 , H 0 1 Ω 0 H 0 1 ) ,
where:
H 0 = E p 2 ( Y ; X T β o + η o ( T ) ) X + η β ( T ) β | β = β o X + η β ( T ) β | β = β o T w 2 ( X ) ,
and:
Ω 0 = E ( p 1 2 ( Y ; X T β o + η o ( T ) ) X + η β ( T ) β | β = β o w 2 ( X ) γ ( T ) g 2 ( T ; T , β o ) w 1 ( X ) × X + η β ( T ) β | β = β o w 2 ( X ) γ ( T ) g 2 ( T ; T , β o ) w 1 ( X ) T )
with:
γ ( t ) = E p 2 ( Y ; X T β o + η o ( t ) ) X + η β ( t ) β | β = β o w 2 ( X ) | T = t , g 2 ( t ; t , β ) = E { p 2 ( Y ; X T β + η β ( t ) ) w 1 ( X ) T = t } .
From Condition A 1 , (13) and (14), we can show that if w 1 ( · ) C w 2 ( · ) for some constant C ( 0 , ) , then γ ( t ) = 0 . In that case, Ω 0 = Ω 0 , where:
Ω 0 = E p 1 2 ( Y ; X T β o + η o ( T ) ) X + η β ( T ) β | β = β o X + η β ( T ) β | β = β o T w 2 2 ( X ) .
Consider the conventional PLM in (1), estimated using the classical quadratic loss, identity link and w 1 ( · ) = w 2 ( · ) 1 . If var ( ϵ X , T ) σ 2 , then H 0 1 Ω 0 H 0 1 = σ 2 [ E { var ( X T ) } ] 1 , and thus, the result of Theorem 2 agrees with that in [18].
Remark 2.
Theorem 2 implies the root-n convergence rate of β ^ . This differs from η ^ β ^ ( t ) , which converges at some rate incorporating both the sample size n and the bandwidth h, as seen in the proofs of Lemma 1 and Theorem 2.

5. Robust Inference for β o Based on BD

In many statistical applications, we will check whether or not a subset of explanatory variables used is statistically significant. Specific examples include:
H 0 : β j ; o = 0 , for j = j 0 , H 0 : β j ; o = 0 , for j = j 1 , , j 2 .
These forms of linear hypotheses for β o can be more generally formulated as: (4).

5.1. Wald-Type Test W n

We propose a robust version of the Wald-type test statistic,
W n = n ( A β ^ g 0 ) T ( A H ^ 0 1 Ω ^ 0 H ^ 0 1 A T ) 1 ( A β ^ g 0 ) ,
based on the robust-BD estimator β ^ proposed in Section 2.4, where Ω ^ 0 and H ^ 0 are estimates of Ω 0 and H 0 satisfying H ^ 0 1 Ω ^ 0 H ^ 0 1 P H 0 1 Ω 0 H 0 1 . For example,
H ^ 0 = 1 n i = 1 n p 2 ( Y i ; X i T β ^ + η ^ β ^ ( T i ) ) X i + η ^ β ( T i ) β | β = β ^ X i + η ^ β ( T i ) β | β = β ^ T w 2 ( X i ) ,
and:
Ω ^ 0 = 1 n i = 1 n p 1 2 ( Y i ; X i T β ^ + η ^ β ^ ( T i ) ) X i + η ^ β ( T i ) β | β = β ^ w 2 ( X i ) γ ^ ( T i ) g ^ 2 ( T i ; T i , β ^ ) w 1 ( X i ) × X i + η ^ β ( T i ) β | β = β ^ w 2 ( X i ) γ ^ ( T i ) g ^ 2 ( T i ; T i , β ^ ) w 1 ( X i ) T ,
fulfill the requirement, where:
η ^ β ( t ) β = k = 1 n p 2 ( Y k ; X k T β + η ^ β ( t ) ) X k w 1 ( X k ) K h ( T k t ) k = 1 n p 2 ( Y k ; X k T β + η ^ β ( t ) ) w 1 ( X k ) K h ( T k t ) , γ ^ ( t ) = 1 n k = 1 n p 2 ( Y k ; X k T β ^ + η ^ β ^ ( t ) ) X k + η ^ β ( t ) β | β = β ^ w 2 ( X k ) K h ( T k t ) , g ^ 2 ( t ; t , β ) = 1 n k = 1 n p 2 ( Y k ; X k T β + η ^ β ( t ) ) w 1 ( X k ) K h ( T k t ) .
Again, we can verify that if w 1 ( · ) C w 2 ( · ) for some constant C ( 0 , ) and η ^ β ( t ) is obtained from kernel estimation method, then γ ^ ( t ) = 0 , and hence, Ω ^ 0 = Ω ^ 0 , where:
Ω ^ 0 = 1 n i = 1 n p 1 2 ( Y i ; X i T β ^ + η ^ β ^ ( T i ) ) X i + η ^ β ( T i ) β | β = β ^ X i + η ^ β ( T i ) β | β = β ^ T w 2 2 ( X i ) .
Theorem 3 justifies that under the null, W n would for large n be distributed as χ k 2 , thus asymptotically distribution-free.
Theorem 3
(Wald-type test based on robust-BD under H0). Assume conditions in Theorem 2, and H ^ 0 1 Ω ^ 0 H ^ 0 1 P H 0 1 Ω 0 H 0 1 in (28). Then, under H 0 in (4), we have that:
W n D χ k 2 .
Theorem 4 indicates that W n has a non-trivial local power detecting contiguous alternatives approaching the null at the rate n 1 / 2 :
H 1 n : A β o g 0 = c / n { 1 + o ( 1 ) } ,
where c = ( c 1 , , c k ) T 0 .
Theorem 4
(Wald-type test based on robust-BD under H1n). Assume conditions in Theorem 2, and H ^ 0 1 Ω ^ 0 H ^ 0 1 P H 0 1 Ω 0 H 0 1 in (28). Then, under H 1 n in (29), W n D χ k 2 ( τ 2 ) , where τ 2 = c T ( A H 0 1 Ω 0 H 0 1 A T ) 1 c > 0 .
To appreciate the discriminating power of W n in assessing the significance, the asymptotic power is analyzed. Theorem 5 manifests that under the fixed alternative H 1 , W n P + at the rate n. Thus, W n has the power approaching to one against fixed alternatives.
Theorem 5
(Wald-type test based on robust-BD under H1). Assume conditions in Theorem 2, and H ^ 0 1 Ω ^ 0 H ^ 0 1 P H 0 1 Ω 0 H 0 1 in (28). Then, under H 1 in (4), n 1 W n λ max 1 ( A H 0 1 Ω 0 H 0 1 A T ) A β o g 0 2 + o P ( 1 ) .
For the conventional PLM in (1) estimated using the non-robust quadratic loss, [8] showed the asymptotic equivalence between the Wald-type test and likelihood ratio-type test. Our results in the next Section 5.2 reveal that such equivalence is violated when estimators are obtained using the robust loss functions.

5.2. Likelihood Ratio-Type Test Λ n

This section explores the degree to which the likelihood ratio-type test is extended to the “robust-BD” for testing the null hypothesis in (4) for the GPLM. The robust-BD test statistic is:
Λ n = 2 n min β R d : A β = g 0 J n ( β , η ^ β ) J n ( β ^ , η ^ β ^ ) ,
where β ^ is the robust-BD estimator for β o developed in Section 2.4.
Theorem 6 indicates that the limit distribution of Λ n under H 0 is a linear combination of independent chi-squared variables, with weights relying on some unknown quantities, thus not distribution free.
Theorem 6
(Likelihood ratio-type test based on robust-BD under H0). Assume conditions in Theorem 2.
(i)
Under H 0 in (4), we obtain:
Λ n D j = 1 k λ j { ( A H 0 1 A T ) 1 ( A V 0 A T ) } Z j 2 ,
where V 0 = H 0 1 Ω 0 H 0 1 and { Z j } j = 1 k i . i . d . N ( 0 , 1 ) .
(ii)
Moreover, if ψ ( r ) = r , w 1 ( x ) = w 2 ( x ) 1 , and the generating q-function of BD satisfies:
q ( m ( x , t ) ) = C V ( m ( x , t ) ) , for a constant C > 0 ,
then under H 0 in (4), we have that Λ n / C D χ k 2 .
Theorem 7 states that Λ n has non-trivial local power for identifying contiguous alternatives approaching the null at rate n 1 / 2 and that Λ n P + at the rate n under H 1 , thus having the power approaching to one against fixed alternatives.
Theorem 7
(Likelihood ratio-type test based on robust-BD under H1n and H1). Assume conditions in Theorem 2. Let V 0 = H 0 1 Ω 0 H 0 1 and λ j = λ j { ( A H 0 1 A T ) 1 ( A V 0 A T ) } , j = 1 , , k .
(i)
Under H 1 n in (29), Λ n D j = 1 k ( λ j Z j + e j , k T S c ) 2 , where { Z j } j = 1 k i . i . d . N ( 0 , 1 ) , and S is a matrix satisfying S T S = ( A H 0 1 A T ) 1 and S ( A V 0 A T ) S T = diag ( λ 1 , , λ k ) .
(ii)
Under H 1 in (4), n 1 Λ n c A β o g 0 2 + o P ( 1 ) for a constant c > 0 .

5.3. Comparison between W n and Λ n

In summary, the test W n has some advantages over the test Λ n . First, the asymptotic null distribution of W n is distribution-free, whereas the asymptotic null distribution of Λ n in general depends on unknown quantities. Second, W n is invariant with re-scaling the generating q-function of the BD, but Λ n is not. Third, the computational expense of W n is much more reduced than that of Λ n , partly because the integration operations for ρ q are involved in Λ n , but not in W n , and partly because Λ n requires both unrestricted and restricted parameter estimates, while W n is useful in cases where restricted parameter estimates are difficult to compute. Thus, W n will be focused on in numerical studies of Section 6.

6. Simulation Study

We conduct simulation evaluations of the performance of robust-BD estimation methods for general partially linear models. We use the Huber ψ -function ψ ( · ) with c = 1.345 . The weight functions are chosen to be w 1 ( x ) = w 2 ( x ) = 1 / { 1 + j = 1 d ( x j m j s j ) 2 } 1 / 2 , where x = ( x 1 , , x d ) T , m j and s j denote the sample median and sample median absolute deviation of { X i , j : i = 1 , , n } respectively, j = 1 , , d . As a comparison, the classical non-robust estimation counterparts correspond to using ψ ( r ) = r and w 1 ( x ) = w 2 ( x ) 1 . Throughout the numerical work, the Epanechnikov kernel function K ( t ) = 0.75 max ( 1 t 2 , 0 ) is used. All these choices (among many others) are for feasibility; the issues on the trade-off between robustness and efficiency are not pursued further in the paper.
The following setup is used in the simulation studies. The sample size is n = 200 , and the number of replications is 500. (Incorporating a nonparametric component in the GPLM desires a larger n when the number of covariates increases for better numerical performance.) Local-linear robust-BD estimation is illustrated with the bandwidth parameter h to be 20 % of the interval length of the variable T. Results using other data-driven choices of h are similar and are omitted.

6.1. Bernoulli Responses

We generate observations { ( X i , T i , Y i ) } i = 1 n randomly from the model,
Y ( X , T ) Bernoulli ( m ( X , T ) ) , X N ( 0 , Σ ) , T Uniform ( 0 , 1 ) ,
where Σ = ( σ j k ) with σ j k = 0.2 | j k | , and X is independent of T. The link function is logit { m ( x , t ) } = x T β o + η o ( t ) , where β o = ( 2 , 2 , 0 , 0 ) T and η o ( t ) = 2 sin { π ( 1 + 2 t ) } . Both the deviance and exponential loss functions are employed as the BD.
For each generated dataset from the true model, we create a contaminated dataset, where 10 data points ( X i , j , Y i ) are contaminated as follows: they are replaced by ( X i , j , Y i ) , where Y i = 1 Y i , i = 1 , , 5 ,
X 1 , 2 = 5 sign ( U 1 0.5 ) , X 2 , 2 = 5 sign ( U 2 0.5 ) , X 3 , 2 = 5 sign ( U 3 0.5 ) , X 4 , 4 = 5 sign ( U 4 0.5 ) , X 5 , 1 = 5 sign ( U 5 0.5 ) , X 6 , 2 = 5 sign ( U 6 0.5 ) , X 7 , 3 = 5 sign ( U 7 0.5 ) , X 8 , 4 = 5 sign ( U 8 0.5 ) , X 9 , 2 = 5 sign ( U 9 0.5 ) , X 10 , 3 = 5 sign ( U 10 0.5 ) ,
with { U i } i . i . d . Uniform ( 0 , 1 ) .
Figure 1 and Figure 2 compare the boxplots of ( β ^ j β j ; o ) , j = 1 , , d , based on the non-robust and robust-BD estimates, where the deviance loss and exponential loss are used as the BD in the top and bottom panels respectively. As seen from Figure 1 in the absence of contamination, both non-robust and robust methods perform comparably well. Besides, the bias in non-robust methods using the exponential loss (with p 2 ( y ; θ ) unbounded) is larger than that of the deviance loss (with p 2 ( y ; θ ) bounded). In the presence of contamination, Figure 2 reveals that the robust method is more effective in decreasing the estimation bias without excessively increasing the estimation variance.
For each replication, we calculate MSE ( η ^ ) = n 1 i = 1 n { η ^ β ^ ( t i ) η o ( t i ) } 2 . Figure 3 and Figure 4 compare the plots of η ^ β ^ ( t ) from typical samples, using non-robust and robust-BD estimates, where the deviance loss and exponential loss are used as the BD in the top and bottom panels, respectively. There, the typical sample in each panel is selected in a way such that its MSE value corresponds to the 50-th percentile among the MSE-ranked values from 500 replications. These fitted curves reveal little difference between using the robust and non-robust methods, in the absence of contamination. For contaminated cases, robust estimates perform slightly better than non-robust estimates. Moreover, the boundary bias issue arising from the curve estimates at the edges using the local constant method can be ameliorated by using the local-linear method.

6.2. Gaussian Responses

We generate independent observations { ( X i , T i , Y i ) } i = 1 n from ( X , T , Y ) satisfying:
Y ( X , T ) N ( m ( X , T ) , σ 2 ) , ( X , Φ 1 ( T ) ) N ( 0 , Σ ) ,
where σ = 1 , Σ = ( σ j k ) with σ j k = 0.2 | j k | , Φ denotes the CDF of the standard normal distribution. The link function is m ( x , t ) = x T β o + η o ( t ) , where β o = ( 2 , 2 , 1 , 1 , 0 , 0 ) T and η o ( t ) = 2 sin { π ( 1 + 2 t ) } . The quadratic loss is utilized as the BD.
For each dataset simulated from the true model, a contaminated data-set is created, where 10 data points ( X i , j , Y i ) are subject to contamination. They are replaced by ( X i , j , Y i ) , where Y i = Y i I { | Y i m ( X i , T i ) | / σ > 2 } + 15 I { | Y i m ( X i , T i ) | / σ 2 } , i = 1 , , 10 ,
X 1 , 2 = 5 sign ( U 1 0.5 ) , X 2 , 2 = 5 sign ( U 2 0.5 ) , X 3 , 2 = 5 sign ( U 3 0.5 ) , X 4 , 4 = 5 sign ( U 4 0.5 ) , X 5 , 6 = 5 sign ( U 5 0.5 ) , X 6 , 1 = 5 sign ( U 6 0.5 ) , X 7 , 2 = 5 sign ( U 7 0.5 ) , X 8 , 3 = 5 sign ( U 8 0.5 ) , X 9 , 4 = 5 sign ( U 9 0.5 ) , X 10 , 5 = 5 sign ( U 10 0.5 ) ,
with { U i } i . i . d . Uniform ( 0 , 1 ) .
Figure 5 and Figure 6 compare the boxplots of ( β ^ j β j ; o ) , j = 1 , , d , on the top panels, and plots of η ^ β ^ ( t ) from typical samples, on the bottom panels, using the non-robust and robust-BD estimates. The typical samples are selected similar to those in Section 6.1. The simulation results in Figure 5 indicate that the robust method performs, as well as the non-robust method for estimating both the parameter vector and non-parametric curve in non-contaminated cases. Figure 6 reveals that the robust estimates are less sensitive to outliers than the non-robust counterparts. Indeed, the non-robust method yields a conceivable bias for parametric estimation, and non-parametric estimation is worse than that of the robust method.
Figure 7 gives the QQ plots of the (first to 95-th) percentiles of the Wald-type statistic W n versus those of the χ 2 2 distribution for testing the null hypothesis:
H 0 : β 2 ; o = 2 and β 4 ; o = 1 .
The plots depict that in both clean and contaminated cases, the robust W n (in right panels) closely follows the χ 2 2 distribution, lending support to Theorem 3. On the other hand, the non-robust W n agrees well with the χ 2 2 distribution in clean data; the presence of a small number of outlying data points severely distorts the sampling distribution of the non-robust W n (in the bottom left panel) from the χ 2 2 distribution, yielding inaccurate levels of the test.
To assess the stability of the power of the Wald-type test for testing the hypothesis (32), we evaluate the power in a sequence of alternatives with parameters β o + Δ c for each given Δ , where c = β o + ( 1 , , 1 ) T . Figure 8 plots the empirical rejection rates of the null model in the non-contaminated case and the contaminated case. The price to pay for the robust W n is a little loss of power in the non-contaminated cases. However, under contamination, a very different behavior is observed. The observed power curve of the robust W n is close to those attained in the non-contaminated case. On the contrary, the non-robust W n is less informative, since its power curve is much lower than that of the robust W n against the alternative hypotheses with Δ 0 , but higher than the nominal level at the null hypothesis with Δ = 0 .

7. Real Data Analysis

Two real datasets are analyzed. In both cases, the quadratic loss is set to be the BD, and the nonparametric function is fitted via local-linear regression method, where the bandwidth parameter is chosen to be 25% of the interval length of the variable T. Choices of the Huber ψ -function and weight functions are identical to those in Section 6.

7.1. Example 1

The dataset studied in [19] consists of 2447 observations on three variables, log ( wage ) , age and education , for women. It is of interest to learn how wages change with years of age and years of education. It is anticipated to find an increasing regression function of Y = log ( wage ) in T = age as well as in X 1 = education . We fit a partially linear model Y = η ( T ) + β 1 X 1 + ϵ . Profiles of the fitted nonparametric functions η ^ ( · ) in Figure 9 indeed exhibit the overall upward trend in age . The coefficient estimate is β ^ 1 = 0.0809 with standard error 0.0042 using the non-robust method, and is β ^ 1 = 0.1334 with standard error 0.0046 by means of the robust method. It is seen that robust estimates are similar to the non-robust counterparts. Our evaluation, based on both the non-robust and robust methods, supports the predicted result in theoretical and empirical literature in socio-economical studies.

7.2. Example 2

We analyze an employee dataset (Example 11.3 of [20]) of the Fifth National Bank of Springfield, based on year 1995 data. The bank, whose name has been changed, was charged in court with that its female employees received substantially smaller salaries than its male employees. For each of its 208 employees, the dataset consists of seven variables, EducLev (education level), JobGrade (job grade), YrHired (year that an employee was hired), YrBorn (year that an employee was born), Female (indicator of being female), YrsPrior (years of work experience at another bank before working at the Fifth National bank), and Salary (current annual salary in thousands of dollars).
To explain variation in salary, we fit a partial linear model, Y = η ( T ) + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + β 5 X 5 + ϵ , for Y = log ( Salary ) , T = Age , X 1 = Female , X 2 = YrHired , X 3 = EducLev , X 4 = JobGrade and X 5 = YrsPrior , where Age = 95 YrBorn is age. Table 1 presents parameter estimates and their standard errors (given within brackets), along with p-values calculated from the Wald-type test W n . Figure 10 depicts the estimated nonparametric functions.
It is interesting to note that for this dataset, results from using the robust and non-robust methods make a difference in drawing conclusions. For example, from Table 1, the non-robust method gives the estimate of parameter β 1 for gender to be below zero, which may be interpreted as the evidence of discrimination against female employees in salary and lends support to the plaintiff. In contrast, the robust method yields β ^ 1 > 0 , which does not indicate that gender has an adverse effect. (A similar conclusion made from penalized-likelihood was obtained in Section 4.1 of [21]). Moreover, the estimated nonparametric functions η ^ ( · ) obtained from non-robust and robust methods are qualitatively different: the former method does not deliver a monotone increasing pattern with Age, whereas the latter method does. Whether or not the difference was caused by outlying observations will be an interesting issue to be investigated.

8. Discussion

Over the past two decades, nonparametric inference procedures for testing hypotheses concerning nonparametric regression functions have been developed extensively. See [22,23,24,25,26] and the references therein. The work on the generalized likelihood ratio test [24] offers light into nonparametric inference, based on function estimation under nonparametric models, using the quadratic loss function as the error measure. These works do not directly deal with the robust procedure. Exploring the inference on nonparametric functions, such as η o ( t ) in GPLM associated with a scalar variable T and the additive structure d = 1 D η d o ( t d ) as in [27] with a vector variable T = ( T 1 , , T D ) , estimated via the “robust-BD” as the error measure, when there are possible outlying data points, will be the future work.
This paper utilizes the class BD of loss functions, the optimal choice of which depends on specific settings and criteria. For e.g., regression and classification will utilize different loss functions, and thus further study on optimality is desirable.
Some recent work on partially linear models in econometrics includes [28,29,30]. There, the nonparametric function is approximated via linear expansions, with the number of coefficients diverging with n. Developing inference procedures to be resistant to outliers could be of interest.

Acknowledgments

The authors thank the two referees for insightful comments and suggestions. The research is supported by the U.S. NSF Grants DMS–1712418, DMS–1505367, CMMI–1536978, DMS–1308872, the Wisconsin Alumni Research Foundation and the National Natural Science Foundation of China, grants 11690014.

Author Contributions

C.Z. conceived and designed the experiments; C.Z. analyzed the data; Z.Z. contributed to discussions and analysis tools; C.Z. wrote the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs of Main Results

Throughout the proof, C represents a generic finite constant. We impose some regularity conditions, which may not be the weakest, but facilitate the technical derivations.
Notation:
For integers j 0 , μ j ( K ) = u j K ( u ) d u ; c p = ( μ p + 1 ( K ) , , μ 2 p + 1 ( K ) ) T ; S = ( μ j + k 2 ( K ) ) 1 j , k p + 1 . Define: η ( x , t ) = F ( m ( x , t ) ) = x T β o + η o ( t ) ; η i = η ( X i , T i ) . Set η i ( t ; β ) = X i T β + η β ( t ) + k = 1 p ( T i t ) k η β ( k ) ( t ) / k ! ; g 1 ( τ ; t , β ) = E { p 1 ( Y i ; η i ( t ; β ) ) w 1 ( X i ) T i = τ } ; g 2 ( τ ; t , β ) = E { p 2 ( Y i ; η i ( t ; β ) ) w 1 ( X i ) T i = τ } .
Condition A:
A1.
η β ( t ) is the unique minimizer of S ( a ; t , β ) with respect to a R 1 .
A2.
β o R d is the unique minimizer of J ( β , η β ) with respect to β , where d 1 .
A3.
η o ( · ) = η β o ( · ) .
Condition B:
B1.
The function ρ q ( y , μ ) is continuous and bounded. The functions p 1 ( y ; θ ) , p 2 ( y ; θ ) , p 3 ( y ; θ ) , w 1 ( · ) and w 2 ( · ) are bounded; p 2 ( y ; θ ) is continuous in θ .
B2.
The kernel function K is Lipschitz continuous, a symmetric probability density function with bounded support. The matrix S is positive definite.
B3.
The marginal density f T ( t ) of T is a continuous function, uniformly bounded away from zero and for t T 0 .
B4.
The function S ( a ; t , β ) is continuous and η β ( t ) is a continuous function of ( t , β ) .
B5.
Assume g 2 ( τ ; t , β ) is continuous in τ ; g 2 ( t ; t , β ) is continuous in t T 0 .
B6.
Functions η β ( t ) and η o ( t ) are ( p + 1 ) -times continuously differentiable at t.
B7.
The link function F ( · ) is monotone increasing and a bijection, F ( 3 ) ( · ) is continuous, and F ( 1 ) ( · ) > 0 . The matrix var ( X T = t ) is positive definite for a.e. t.
B8.
The matrix H 0 in (25) is invertible; Ω 0 in (26) is positive-definite.
B9.
η ^ β ( t ) and η β ( t ) are continuously differentiable with respect to ( t , β ) , and twice continuously differentiable with respect to β such that for any 1 j , k d , 2 β j β k η β ( t ) | β = β o is bounded. Furthermore, for any 1 j , k d , 2 β j β k η β ( t ) satisfies the equicontinuity condition:
ε > 0 , δ ε > 0 : β 1 β o < δ ε 2 β j β k η β | β = β 1 2 β j β k η β | β = β o < ε .
Note that Conditions A, B2–B5 and B8–B9 were similarly used in [9]. Conditions B1 and B7 follow [10]. Condition B6 is due to the local p-th-degree polynomial regression estimation.
Proof of Lemma 1:
From Condition A1, we obtain E { p 1 ( Y ; X T β + η β ( t ) ) w 1 ( X ) T = t } = 0 and E { p 2 ( Y ; X T β + η β ( t ) ) w 1 ( X ) T = t } > 0 , i.e.,
g 1 ( t ; t , β ) = E { p 1 ( Y ; X T β + η β ( t ) ) w 1 ( X ) T = t } = 0 ,
g 2 ( t ; t , β ) = E { p 2 ( Y ; X T β + η β ( t ) ) w 1 ( X ) T = t } > 0 .
Define by η β ( 0 , , p ) ( t ) = ( η β ( t ) , η β ( 1 ) ( t ) , , η β ( p ) ( t ) / p ! ) T the vector of η β ( t ) along with re-scaled derivatives with respect to t up to the order p. Note that:
η i ( t ; β ) = X i T β + k = 0 p ( T i t ) k η β ( k ) ( t ) k ! = X i T β + t i ( t ) T η β ( 0 , , p ) ( t ) = X i T β + { H 1 t i ( t ) } T H η β ( 0 , , p ) ( t ) = X i T β + t i ( t ) T H η β ( 0 , , p ) ( t ) ,
where H = diag { ( 1 , h , , h p ) } and t i ( t ) = H 1 t i ( t ) = ( 1 , ( T i t ) / h , , ( T i t ) p / h p ) T denotes the re-scaled t i ( t ) . Then:
X i T β + t i ( t ) T a = X i T β + t i ( t ) T H a = X i T β + t i ( t ) T H η β ( 0 , , p ) ( t ) + t i ( t ) T H { a η β ( 0 , , p ) ( t ) } = η i ( t ; β ) + t i ( t ) T H { a η β ( 0 , , p ) ( t ) } .
Hence, we rewrite (16) as:
S n ( a ; t , β ) = 1 n i = 1 n ρ q ( Y i , F 1 ( η i ( t ; β ) + t i ( t ) T H { a η β ( 0 , , p ) ( t ) } ) ) w 1 ( X i ) K h ( T i t ) .
Therefore, a ^ ( t , β ) minimizing S n ( a ; t , β ) is equivalent to the one minimizing:
1 n i = 1 n { ρ q ( Y i , F 1 ( η i ( t ; β ) + t i ( t ) T H { a η β ( 0 , , p ) ( t ) } ) ) ρ q ( Y i , F 1 ( η i ( t ; β ) ) ) } w 1 ( X i ) K h ( T i t )
with respect to a . It follows that a ^ ( t , β ) , defined by a ^ ( t , β ) = n h H { a ^ ( t , β ) η β ( 0 , , p ) ( t ) } , minimizes:
G n ( a ; t , β ) = n h [ 1 n i = 1 n ρ q ( Y i , F 1 ( η i ( t ; β ) + { a n t i ( t ) T a } ) ) ρ q ( Y i , F 1 ( η i ( t ; β ) ) ) w 1 ( X i ) K h ( T i t ) ]
with respect to a R p + 1 , where a n = 1 / n h . Note that for any fixed a , | t i ( t ) T a | C . By Taylor expansion,
G n ( a ; t , β ) = n h ( a n 1 n i = 1 n p 1 ( Y i ; η i ( t ; β ) ) { t i ( t ) T a } w 1 ( X i ) K h ( T i t ) + a n 2 1 2 1 n i = 1 n p 2 ( Y i ; η i ( t ; β ) ) { t i ( t ) T a } 2 w 1 ( X i ) K h ( T i t ) + a n 3 1 6 1 n i = 1 n p 3 ( Y i ; η i ( t ; β ) ) { t i ( t ) T a } 3 w 1 ( X i ) K h ( T i t ) ) = I n , 1 + I n , 2 + I n , 3 ,
where η i ( t ; β ) is located between η i ( t ; β ) and η i ( t ; β ) + { a n t i ( t ) T a } . We notice that:
I n , 1 n h W n ( t , β ) T a ,
where:
W n ( t , β ) = 1 n i = 1 n p 1 ( Y i ; η i ( t ; β ) ) t i ( t ) w 1 ( X i ) K h ( T i t ) ;
also, Lemma A1 implies:
I n , 2 = n h a n 2 1 2 a T 1 n i = 1 n p 2 ( Y i ; η i ( t ; β ) ) { t i ( t ) t i ( t ) T } w 1 ( X i ) K h ( T i t ) a = 1 2 a T S 2 ( t , β ) a + o P ( 1 ) ,
where:
S 2 ( t , β ) = g 2 ( t ; t , β ) f T ( t ) S 0
by (A2), Condition B2 and B5; and (by using X n = O P ( E ( | X n | ) ) ):
I n , 3 C O P ( n h a n 3 ) = O P ( 1 / n h ) = o P ( 1 ) .
Then:
G n ( a ; t , β ) = n h W n ( t , β ) T a + 1 2 a T S 2 ( t , β ) a + o P ( 1 ) ,
where a T S 2 ( t , β ) a = ( a T S a ) g 2 ( t ; t , β ) f T ( t ) is continuous in t T 0 by B3 and B5.
We now examine W n ( t , β ) . Note that:
var { W n ( t , β ) } = 1 n var { p 1 ( Y i ; η i ( t ; β ) ) t i ( t ) w 1 ( X i ) K h ( T i t ) } 1 n E p 1 2 ( Y i ; η i ( t ; β ) ) { t i ( t ) t i ( t ) T } w 1 2 ( X i ) { K h ( T i t ) } 2 C n E 1 h 2 K T i t h 2 = C n h .
To evaluate E { W n ( t , β ) } , it is easy to see that for each j { 0 , 1 , , p } ,
e j + 1 , p + 1 T E { W n ( t , β ) } = E { p 1 ( Y i ; η i ( t ; β ) ) e j + 1 , p + 1 T t i ( t ) w 1 ( X i ) K h ( T i t ) } = E p 1 ( Y i ; η i ( t ; β ) ) T i t h j w 1 ( X i ) K h ( T i t ) = E E { p 1 ( Y i ; η i ( t ; β ) ) w 1 ( X i ) T i } T i t h j K h ( T i t ) = E g 1 ( T i ; t , β ) T i t h j K h ( T i t ) = g 1 ( y ; t , β ) y t h j 1 h K y t h f T ( y ) d y = g 1 ( t + h x ; t , β ) x j K ( x ) f T ( t + h x ) d x .
Note that by Taylor expansion,
η β ( t + h x ) = k = 0 p ( h x ) k η β ( k ) ( t ) k ! + ( h x ) p + 1 η β ( p + 1 ) ( t ) ( p + 1 ) ! + o ( h p + 1 ) .
This combined with the facts (A1) and (A2) give that:
g 1 ( t + h x ; t , β ) = E p 1 Y ; X T β + k = 0 p ( h x ) k η β ( k ) ( t ) k ! w 1 ( X ) | T = t + h x = E [ p 1 ( Y ; X T β + η β ( t + h x ) ) w 1 ( X ) + p 2 ( Y ; X T β + η β ( t + h x ) ) k = 0 p ( h x ) k η β ( k ) ( t ) k ! η β ( t + h x ) w 1 ( X ) | T = t + h x ] + o ( h p + 1 ) = g 1 ( t + h x ; t + h x , β ) ( h x ) p + 1 η β ( p + 1 ) ( t ) ( p + 1 ) ! g 2 ( t + h x ; t + h x , β ) + o ( h p + 1 ) = ( h x ) p + 1 η β ( p + 1 ) ( t ) ( p + 1 ) ! g 2 ( t + h x ; t + h x , β ) + o ( h p + 1 ) .
Thus, using the continuity of g 2 ( t ; t , β ) and f T ( t ) in t, we obtain:
E { W n ( t , β ) } = c p η β ( p + 1 ) ( t ) ( p + 1 ) ! g 2 ( t ; t , β ) f T ( t ) h p + 1 + o ( h p + 1 )
uniformly in ( t , β ) . Thus, we conclude that n h W n ( t , β ) = O P ( 1 ) when n h 2 p + 3 = O ( 1 ) .
By Lemma A2,
sup a Θ , t T 0 , β K | G n ( a ; t , β ) n h W n ( t , β ) T a 1 2 a T S 2 ( t , β ) a | = o P ( 1 ) .
This along with Lemma A.1 of [18] yields:
sup t T 0 , β K a ^ ( t , β ) + { S 2 ( t , β ) } 1 n h W n ( t , β ) = o P ( 1 ) ,
the first entry of which satisfies:
sup t T 0 , β K | n h { η ^ β ( t ) η β ( t ) } + e 1 , p + 1 T { S 2 ( t , β ) } 1 n h W n ( t , β ) | = o P ( 1 ) ,
namely, sup t T 0 , β K | η ^ β ( t ) η β ( t ) + e 1 , p + 1 T { S 2 ( t , β ) } 1 W n ( t , β ) | = o P ( 1 / n h ) . By [31], sup t T 0 , β K W n ( t , β ) E { W n ( t , β ) } = O P ( { log ( 1 / h ) n h } 1 / 2 ) . Furthermore,
{ S 2 ( t , β ) } 1 E { W n ( t , β ) } = S 1 c p η β ( p + 1 ) ( t ) ( p + 1 ) ! h p + 1 + o ( h p + 1 )
uniformly in ( t , β ) . Therefore,
sup t T 0 , β K | η ^ β ( t ) η β ( t ) e 1 , p + 1 T S 1 c p η β ( p + 1 ) ( t ) ( p + 1 ) ! h p + 1 | = o P ( 1 ) .
This yields:
sup β K sup t T 0 | η ^ β ( t ) η β ( t ) e 1 , p + 1 T S 1 c p η β ( p + 1 ) ( t ) ( p + 1 ) ! h p + 1 | sup t T 0 , β K | η ^ β ( t ) η β ( t ) e 1 , p + 1 T S 1 c p η β ( p + 1 ) ( t ) ( p + 1 ) ! h p + 1 | = o P ( 1 ) .
Note that for p = 1 , e 1 , p + 1 T S 1 c p = μ 2 ( K ) . This completes the proof. ☐
Lemma A1.
Assume Condition B in the Appendix. If n , h 0 and n h , then for given t T 0 an β K ,
1 n i = 1 n p 2 ( Y i ; η i ( t ; β ) ) { t i ( t ) t i ( t ) T } w 1 ( X i ) K h ( T i t ) = S 2 ( t , β ) + o P ( 1 ) ,
where S 2 ( t , β ) = g 2 ( t ; t , β ) f T ( t ) S , with S = ( μ j + k 2 ( K ) ) 1 j , k p + 1 and μ j ( K ) = u j K ( u ) d u , j = 0 , 1 , , 2 p .
Proof. 
Recall the ( p + 1 ) × ( p + 1 ) matrix t i ( t ) t i ( t ) T = ( ( T i t h ) j + k 2 ) 1 j , k p + 1 . Set X j = 1 n i = 1 n p 2 ( Y i ; η i ( t ; β ) ) ( T i t h ) j w 1 ( X i ) K h ( T i t ) for j = 0 , 1 , , 2 p . We observe that:
E ( X j ) = 1 n i = 1 n E E { p 2 ( Y i ; η i ( t ; β ) ) w 1 ( X i ) T i } T i t h j K h ( T i t ) = 1 n i = 1 n E g 2 ( T i ; t , β ) T i t h j K h ( T i t ) = E g 2 ( T ; t , β ) T t h j K h ( T t ) = g 2 ( y ; t , β ) y t h j 1 h K y t h f T ( y ) d y = g 2 ( t + h x ; t , β ) x j K ( x ) f T ( t + h x ) d x = g 2 ( t ; t , β ) f T ( t ) μ j ( K ) + o ( 1 ) ,
using the continuity of g 2 ( τ ; t , β ) in τ and f T ( t ) in t. Similarly,
var ( X j ) = 1 n 2 i = 1 n var p 2 ( Y i ; η i ( t ; β ) ) T i t h j w 1 ( X i ) K h ( T i t ) 1 n 2 i = 1 n E p 2 2 ( Y i ; η i ( t ; β ) ) T i t h 2 j w 1 2 ( X i ) { K h ( T i t ) } 2 C n h .
This completes the proof. ☐
Lemma A2.
Assume Condition B . If n , h 0 , n h , log ( 1 / h ) / ( n h ) 0 , then sup a Θ , t T 0 , β K | G n ( a ; t , β ) n h W n ( t , β ) T a 2 1 a T S 2 ( t , β ) a | = o P ( 1 ) , with a compact set Θ R p + 1 .
Proof. 
Let D n ( a ; t , β ) = G n ( a ; t , β ) n h W n ( t , β ) T a . Note that:
D n ( a ; t , β ) = n h [ 1 n i = 1 n ρ q ( Y i , F 1 ( η i ( t ; β ) + { a n t i ( t ) T a } ) ) w 1 ( X i ) K h ( T i t ) 1 n i = 1 n ρ q ( Y i , F 1 ( η i ( t ; β ) ) ) w 1 ( X i ) K h ( T i t ) 1 n i = 1 n p 1 ( Y i ; η i ( t ; β ) ) { a n t i ( t ) T a } w 1 ( X i ) K h ( T i t ) ] = 1 2 a T 1 n i = 1 n p 2 ( Y i ; η ˜ i ( t ; β ) ) { t i ( t ) t i ( t ) T } w 1 ( X i ) K h ( T i t ) a ,
where a n = 1 / n h and η ˜ i ( t ; β ) is between η i ( t ; β ) and η i ( t ; β ) + { a n t i ( t ) T a } . Then:
| D n ( a ; t , β ) 2 1 a T S 2 ( t , β ) a | = 1 2 | a T 1 n i = 1 n p 2 ( Y i ; η ˜ i ( t ; β ) ) { t i ( t ) t i ( t ) T } w 1 ( X i ) K h ( T i t ) S 2 ( t , β ) a | a 2 | 1 n i = 1 n p 2 ( Y i ; η ˜ i ( t ; β ) ) { t i ( t ) t i ( t ) T } w 1 ( X i ) K h ( T i t ) S 2 ( t , β ) | .
The proof completes by applying [31]. ☐
Proof of Theorem 1.
Before showing Theorem 1, we need Proposition A1 (whose proof is omitted), where the following notation will be used. Denote by C 1 ( T ) the set of continuously differentiable functions in T . Let V ( β ) denote the neighborhood of β K . Let H δ ( β ) denote the neighborhood of η β such that V ( β ) K and H δ ( β ) = { u C 1 ( T ) : u η β δ , t u t η β δ } .
Proposition A1.
Let { ( Y i , X i , T i ) } i = 1 n be independent observations of ( Y , X , T ) modeled by (2) and (5). Assume that a random variable T is distributed on T . Let K and H 1 ( β ) be compact sets, g ( · ; · ) : R 2 R be a continuous and bounded function, W ( x , t ) : R d + 1 R be such that E { | W ( X , T ) | } < and η β ( t ) = η ( t , β ) : R d + 1 R be a continuous function of ( t , β ) . Then:
(i)
E { g ( Y ; X T θ + v ( T ) ) W ( X , T ) } E { g ( Y ; X T β + η β ( T ) ) W ( X , T ) } as θ β + v η β 0 ;
(ii)
sup θ K | n 1 i = 1 n g ( Y i ; X i T θ + η θ ( T i ) ) W ( X , T ) E { g ( Y ; X T θ + η θ ( T ) ) W ( X , T ) } P 0 as n ;
(iii)
if, in addition, T is compact and η β C 1 ( T ) , then sup θ K , v H 1 ( β )
| n 1 i = 1 n g ( Y i ; X i T θ + v ( T i ) ) W ( X i , T i ) E { g ( Y ; X T θ + v ( T ) ) W ( X , T ) } | P 0 as n .
For part (i), we first show that for any compact set K in R d ,
sup β K | J n ( β , η ^ β ) J ( β , η β ) | P 0 .
It suffices to show sup β K | J n ( β , η β ) J ( β , η β ) | P 0 , which follows from Proposition A1 (ii), and:
sup β K | J n ( β , η ^ β ) J n ( β , η β ) | P 0 .
To show (A4), we note that for any ε > 0 , let T 0 be a compact set such that P ( T i T 0 ) < ε . Then:
J n ( β , η ^ β ) J n ( β , η β ) = 1 n i = 1 n { ρ q ( Y i , F 1 ( X i T β + η ^ β ( T i ) ) ) ρ q ( Y i , F 1 ( X i T β + η β ( T i ) ) ) } w 2 ( X i ) I ( T i T 0 ) + 1 n i = 1 n { ρ q ( Y i , F 1 ( X i T β + η ^ β ( T i ) ) ) ρ q ( Y i , F 1 ( X i T β + η β ( T i ) ) ) } w 2 ( X i ) I ( T i T 0 ) .
For T i T 0 , by the mean-value theorem,
| ρ q ( Y i , F 1 ( X i T β + η ^ β ( T i ) ) ) ρ q ( Y i , F 1 ( X i T β + η β ( T i ) ) ) | = | p 1 ( Y i ; X i T β + η i , β ) { η ^ β ( T i ) η β ( T i ) } | p 1 ( · ; · ) sup β K η ^ β η β T 0 ; ,
where η i , β is located between η ^ β ( T i ) and η β ( T i ) . For T i T 0 , it follows that:
| ρ q ( Y i , F 1 ( X i T β + η ^ β ( T i ) ) ) ρ q ( Y i , F 1 ( X i T β + η β ( T i ) ) ) | 2 ρ q ( · , · ) .
Hence,
| J n ( β , η ^ β ) J n ( β , η β ) | p 1 ( · ; · ) sup β K η ^ β η β T 0 ; + 2 ρ q ( · , · ) T n w 2 2 ε ,
where the last inequality is entailed by Lemma 1 and the law of large numbers for T n = n 1 i = 1 n I ( T i T 0 ) . This completes the proof of (A3). The proof of β ^ P β o follows from combining Lemma A-1 of [1] with (A3) and Condition A2.
Part (ii) follows from Lemma 1, Part (i) and Condition B5 for η β ( t ) . ☐
Proof of Theorem 2.
Similar to the proof of Lemma 1, it can be shown that | η ^ β ( t ) η β ( t ) + e 1 , p + 1 T { S 2 ( t , β ) } 1 1 n i = 1 n p 1 ( Y i ; η i ( t ; β ) ) t i ( t ) w 1 ( X i ) K h ( T i t ) | = O P ( h 2 a n + a n 2 log ( 1 / h ) ) . Note that for p = 1 ,
e 1 , p + 1 T { S 2 ( t , β ) } 1 t i ( t ) = 1 g 2 ( t ; t , β ) f T ( t ) ( 1 , 0 ) 1 0 0 1 / μ 2 ( K ) 1 ( T i t ) / h = 1 g 2 ( t ; t , β ) f T ( t ) .
Thus:
| η ^ β ( t ) η β ( t ) + 1 n f T ( t ) g 2 ( t ; t , β ) i = 1 n p 1 ( Y i ; η i ( t ; β ) ) w 1 ( X i ) K h ( T i t ) | = O P ( h 2 a n + a n 2 log ( 1 / h ) ) .
Consider β ^ defined in (23). Note that:
X i T β + η ^ β ( T i ) = X i T β o + X i T ( β β o ) + η ^ ( β β o ) + β o ( T i ) = X i T β o + c n X i T { n ( β β o ) } + η ^ c n { n ( β β o ) } + β o ( T i ) ,
where c n = 1 / n . Then, θ ^ = n ( β ^ β o ) minimizes:
J n ( θ ) = n [ 1 n i = 1 n { ρ q ( Y i , F 1 ( X i T β o + c n X i T θ + η ^ c n θ + β o ( T i ) ) ) w 2 ( X i ) ρ q ( Y i , F 1 ( X i T β o + η ^ β o ( T i ) ) ) w 2 ( X i ) } ]
with respect to θ . By Taylor expansion,
J n ( θ ) = n ( 1 n i = 1 n p 1 ( Y i ; X i T β o + η ^ β o ( T i ) ) [ c n X i T θ + { η ^ c n θ + β o ( T i ) η ^ β o ( T i ) } ] w 2 ( X i ) + 1 2 n i = 1 n p 2 ( Y i ; X i T β o + η ^ β o ( T i ) ) [ c n X i T θ + { η ^ c n θ + β o ( T i ) η ^ β o ( T i ) } ] 2 w 2 ( X i ) + 1 6 n i = 1 n p 3 ( Y i ; η i ) [ c n X i T θ + { η ^ c n θ + β o ( T i ) η ^ β o ( T i ) } ] 3 w 2 ( X i ) ) = I n , 1 + I n , 2 + I n , 3 ,
where η i is located between X i T β o + η ^ β o ( T i ) and X i T β o + c n X i T θ + η ^ c n θ + β o ( T i ) ,
I n , 1 = i = 1 n p 1 ( Y i ; X i T β o + η ^ β o ( T i ) ) c n X i T θ + η ^ β ( T i ) β T | β = β n c n θ w 2 ( X i ) = 1 n i = 1 n p 1 ( Y i ; X i T β o + η ^ β o ( T i ) ) X i + η ^ β ( T i ) β | β = β n T θ w 2 ( X i ) = 1 n i = 1 n p 1 ( Y i ; X i T β o + η ^ β o ( T i ) ) X i + η β ( T i ) β | β = β o T θ w 2 ( X i ) + o P ( 1 ) , I n , 2 = 1 2 θ T [ 1 n i = 1 n p 2 ( Y i ; X i T β o + η ^ β o ( T i ) ) X i + η ^ β ( T i ) β | β = β n X i + η ^ β ( T i ) β | β = β n T w 2 ( X i ) ] θ = 1 2 θ T B 2 θ + o P ( 1 ) , I n , 3 = o P ( 1 ) ,
with β n located between β o and c n θ + β o , and B 2 = H 0 following Lemma 1, Condition A3 and Proposition A1. Thus:
J n ( θ ) = I n , 1 T θ + 1 2 θ T B 2 θ + o P ( 1 ) ,
where I n , 1 = 1 n i = 1 n p 1 ( Y i ; X i T β o + η ^ β o ( T i ) ) { X i + η β ( T i ) β | β = β o } w 2 ( X i ) . Note that:
I n , 1 = 1 n i = 1 n [ p 1 ( Y i ; X i T β o + η β o ( T i ) ) X i + η β ( T i ) β | β = β o w 2 ( X i ) + p 2 ( Y i ; X i T β o + η β o ( T i ) ) X i + η β ( T i ) β | β = β o w 2 ( X i ) { η ^ β o ( T i ) η β o ( T i ) } + 1 2 p 3 ( Y i ; η i ) X i + η β ( T i ) β | β = β o w 2 ( X i ) { η ^ β o ( T i ) η β o ( T i ) } 2 ] = T n , 1 + T n , 2 + T n , 3 ,
where η i is between X i T β o + η ^ β o ( T i ) and X i T β o + η β o ( T i ) ,
T n , 3 = o P ( 1 ) , T n , 2 = 1 n i = 1 n p 2 ( Y i ; X i T β o + η β o ( T i ) ) X i + η β ( T i ) β | β = β o w 2 ( X i ) × ( 1 ) n f T ( T i ) g 2 ( T i ; T i , β o ) j = 1 n p 1 ( Y j ; η j ( T i ; β o ) ) w 1 ( X j ) K h ( T j T i ) = 1 n j = 1 n p 1 ( Y j ; η j ) w 1 ( X j ) g 2 ( T j ; T j , β o ) E p 2 ( Y j ; η j ) X j + η β ( T j ) β | β = β o w 2 ( X j ) | T j 1 n j = 1 n p 1 ( Y j ; η j ) γ ( T j ) g 2 ( T j ; T j , β o ) w 1 ( X j ) ,
with:
γ ( t ) = E p 2 ( Y ; η ( X , T ) ) X + η β ( T ) β | β = β o w 2 ( X ) | T = t .
Therefore,
I n , 1 = 1 n i = 1 n p 1 ( Y i ; η i ) X i + η β ( T i ) β | β = β o w 2 ( X i ) γ ( T i ) g 2 ( T i ; T i , β o ) w 1 ( X i ) + o P ( 1 ) .
By the central limit theorem,
I n , 1 D N ( 0 , Ω 0 ) ,
where:
Ω 0 = E ( p 1 2 ( Y ; η ( X , T ) ) X + η β ( T ) β | β = β o w 2 ( X ) γ ( T ) g 2 ( T ; T , β o ) w 1 ( X ) X + η β ( T ) β | β = β o w 2 ( X ) γ ( T ) g 2 ( T ; T , β o ) w 1 ( X ) T ) .
From (A5) and (A6), θ ^ = B 2 1 I n , 1 + o P ( 1 ) . This implies that n ( β ^ β o ) D N ( 0 , H 0 1 Ω 0 H 0 1 ) . ☐
Proof of Theorem 3.
Denote V 0 = H 0 1 Ω 0 H 0 1 and V ^ n = H ^ 0 1 Ω ^ 0 H ^ 0 1 . Note that A β ^ g 0 = A ( β ^ β o ) + ( A β o g 0 ) . Thus:
( A V ^ n A T ) 1 / 2 n ( A β ^ g 0 ) = ( A V ^ n A T ) 1 / 2 { A n ( β ^ β o ) } + ( A V ^ n A T ) 1 / 2 { n ( A β o g 0 ) } I 1 + I 2 ,
which implies that W n = I 1 + I 2 2 . Arguments for Theorem 2 give I 1 D N ( 0 , I k ) . Under H 0 in (4), I 2 0 and thus ( I 1 + I 2 ) D N ( 0 , I k ) , which completes the proof. ☐
Proof of Theorem 4.
Follow the notation and proof in Theorem 3. Under H 1 n in (29), I 2 P ( A V 0 A T ) 1 / 2 c and thus ( I 1 + I 2 ) D N ( ( A V 0 A T ) 1 / 2 c , I k ) . This completes the proof. ☐
Proof of Theorem 5.
Following the notation and proof in Theorem 3, W n = I 1 2 + 2 I 1 T I 2 + I 2 2 . We see that I 1 2 D χ k 2 . Under H 1 in (4), I 2 = ( A V 0 A T ) 1 / 2 n ( A β o g 0 ) { 1 + o P ( 1 ) } , which means I 2 2 = n ( A β o g 0 ) T ( A V 0 A T ) 1 ( A β o g 0 ) { 1 + o P ( 1 ) } and thus I 1 T I 2 = O P ( n ) . Hence, n 1 W n λ min { ( A V 0 A T ) 1 } A β o g 0 2 + o P ( 1 ) . This completes the proof. ☐
Proof of Theorem 6.
Denote J n ( β ) = J n ( β , η ^ β ) . For the matrix A in (4), there exists a ( d k ) × d matrix B satisfying B B T = I d k and A B T = 0 . Therefore, A β = g 0 is equivalent to β = B T γ + b 0 for some vector γ R d k and b 0 = A T ( A A T ) 1 g 0 . Then, minimizing J n ( β ) subject to A β = g 0 is equivalent to minimizing J n ( B T γ + b 0 ) with respect to γ , and we denote by γ ^ the minimizer. Furthermore, under H 0 in (4), we have β o = B T γ 0 + b 0 for γ 0 = B β o , and γ ^ γ 0 P 0 .
For Part (i), using the Taylor expansion around β ^ , we get:
J n ( B T γ ^ + b 0 ) J n ( β ^ ) = 1 2 n { n ( B T γ ^ + b 0 β ^ ) } T J n ( β ˜ ) { n ( B T γ ^ + b 0 β ^ ) } ,
where β ˜ is between B T γ ^ + b 0 and β ^ . We now discuss B T γ ^ + b 0 β ^ . From the proof in Theorem 2, ( β ^ β o ) = H 0 1 J n ( β o ) { 1 + o P ( 1 ) } , where J n ( β o ) = { I n , 1 + o P ( 1 ) } / n . Similar arguments deduce γ ^ γ 0 = ( B H 0 B T ) 1 B J n ( β o ) { 1 + o P ( 1 ) } . Thus, under H 0 in (4),
B T γ ^ + b 0 β ^ = B T ( γ ^ γ 0 ) ( β ^ β o ) = H 0 1 / 2 P H 0 1 / 2 A T H 0 1 / 2 J n ( β o ) { 1 + o P ( 1 ) } ,
and thus by (A6),
n ( B T γ ^ + b 0 β ^ ) D H 0 1 / 2 P H 0 1 / 2 A T H 0 1 / 2 Ω 0 1 / 2 Z ,
where Z = ( Z 1 , , Z d ) T N ( 0 , I d ) . Combining the fact J n ( β ˜ ) P H 0 , (A7) and (A8) gives:
Λ n = { n ( B T γ ^ + b 0 β ^ ) } T H 0 { n ( B T γ ^ + b 0 β ^ ) } { 1 + o P ( 1 ) } D Z T Ω 0 1 / 2 H 0 1 / 2 P H 0 1 / 2 A T H 0 1 / 2 Ω 0 1 / 2 Z = j = 1 d λ j ( Ω 0 1 / 2 H 0 1 / 2 P H 0 1 / 2 A T H 0 1 / 2 Ω 0 1 / 2 ) Z j 2 = j = 1 k λ j { ( A H 0 1 A T ) 1 ( A V 0 A T ) } Z j 2 .
This proves Part (i).
For Part (ii), using ψ ( r ) = r , w 1 ( x ) = w 2 ( x ) 1 and (31), we obtain Ω 0 = Ω 0 = C H 0 , and thus, A V 0 A T = C ( A H 0 1 A T ) . Thus, (A9) = C j = 1 k Z j 2 C χ k 2 , which completes the proof. ☐
Proof of Theorem 7.
The proofs are similar to those used in Theorem 4 and Theorems 5 and 6. The lengthy details are omitted. ☐

References

  1. Andrews, D. Asymptotics for semiparametric econometric models via stochastic equicontinuity. Econometrica 1994, 62, 43–72. [Google Scholar] [CrossRef]
  2. Robinson, P.M. Root-n consistent semiparametric regression. Econometrica 1988, 56, 931–954. [Google Scholar] [CrossRef]
  3. Speckman, P. Kernel smoothing in partial linear models. J. R. Statist. Soc. B 1988, 50, 413–436. [Google Scholar]
  4. Yatchew, A. An elementary estimator of the partial linear model. Econ. Lett. 1997, 57, 135–143. [Google Scholar] [CrossRef]
  5. Fan, J.; Li, R. New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. J. Am. Stat. Assoc. 2004, 99, 710–723. [Google Scholar] [CrossRef]
  6. McCullagh, P.; Nelder, J.A. Generalized Linear Models, 2nd ed.; Chapman & Hall: London, UK, 1989. [Google Scholar]
  7. Zhang, C.M.; Yu, T. Semiparametric detection of significant activation for brain fMRI. Ann. Stat. 2008, 36, 1693–1725. [Google Scholar] [CrossRef]
  8. Fan, J.; Huang, T. Profile likelihood inferences on semiparametric varying-coefficient partially linear models. Bernoulli 2005, 11, 1031–1057. [Google Scholar] [CrossRef]
  9. Boente, G.; He, X.; Zhou, J. Robust estimates in generalized partially linear models. Ann. Stat. 2006, 34, 2856–2878. [Google Scholar] [CrossRef]
  10. Zhang, C.M.; Guo, X.; Cheng, C.; Zhang, Z.J. Robust-BD estimation and inference for varying-dimensional general linear models. Stat. Sin. 2014, 24, 653–673. [Google Scholar] [CrossRef]
  11. Fan, J.; Gijbels, I. Local Polynomial Modeling and Its Applications; Chapman and Hall: London, UK, 1996. [Google Scholar]
  12. Brègman, L.M. A relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 620–631. [Google Scholar] [CrossRef]
  13. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer: New York, NY, USA, 2001. [Google Scholar]
  14. Zhang, C.M.; Jiang, Y.; Shang, Z. New aspects of Bregman divergence in regression and classification with parametric and nonparametric estimation. Can. J. Stat. 2009, 37, 119–139. [Google Scholar] [CrossRef]
  15. Huber, P. Robust estimation of a location parameter. Ann. Math. Statist. 1964, 35, 73–101. [Google Scholar] [CrossRef]
  16. Van der Vaart, A.W. Asymptotic Statistics; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
  17. Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef] [PubMed]
  18. Carroll, R.; Fan, J.; Gijbels, I.; Wand, M. Generalized partially linear single-index models. J. Am. Stat. Assoc. 1997, 92, 477–489. [Google Scholar] [CrossRef]
  19. Mukarjee, H.; Stern, S. Feasible nonparametric estimation of multiargument monotone functions. J. Am. Stat. Assoc. 1994, 89, 77–80. [Google Scholar] [CrossRef]
  20. Albright, S.C.; Winston, W.L.; Zappe, C.J. Data Analysis and Decision Making with Microsoft Excel; Duxbury Press: Pacific Grove, CA, USA, 1999. [Google Scholar]
  21. Fan, J.; Peng, H. Nonconcave penalized likelihood with a diverging number of parameters. Ann. Stat. 2004, 32, 928–961. [Google Scholar]
  22. Dette, H. A consistent test for the functional form of a regression based on a difference of variance estimators. Ann. Stat. 1999, 27, 1012–1050. [Google Scholar] [CrossRef]
  23. Dette, H.; von Lieres und Wilkau, C. Testing additivity by kernel-based methods. Bernoulli 2001, 7, 669–697. [Google Scholar] [CrossRef]
  24. Fan, J.; Zhang, C.M.; Zhang, J. Generalized likelihood ratio statistics and Wilks phenomenon. Ann. Stat. 2001, 29, 153–193. [Google Scholar] [CrossRef]
  25. Hong, Y.M.; Lee, Y.J. A loss function approach to model specification testing and its relative efficiency. Ann. Stat. 2013, 41, 1166–1203. [Google Scholar] [CrossRef]
  26. Zheng, J.X. A consistent test of functional form via nonparametric estimation techniques. J. Econ. 1996, 75, 263–289. [Google Scholar] [CrossRef]
  27. Opsomer, J.D.; Ruppert, D. A root-n consistent backfitting estimator for semiparametric additive modeling. J. Comput. Graph. Stat. 1999, 8, 715–732. [Google Scholar] [CrossRef]
  28. Belloni, A.; Chernozhukov, V.; Hansen, C. Inference on treatment effects after selection amongst high-dimensional controls. Rev. Econ. Stud. 2014, 81, 608–650. [Google Scholar] [CrossRef]
  29. Cattaneo, M.D.; Jansson, M.; Newey, W.K. Alternative asymptotics and the partially linear model with many regressors. Econ. Theory 2016, 1–25. [Google Scholar] [CrossRef]
  30. Cattaneo, M.D.; Jansson, M.; Newey, W.K. Treatment effects with many covariates and heteroskedasticity. arXiv, 2015; arXiv:1507.02493. [Google Scholar]
  31. Mack, Y.P.; Silverman, B.W. Weak and strong uniform consistency of kernel regression estimates. Z. Wahrsch. Verw. Gebiete 1982, 61, 405–415. [Google Scholar] [CrossRef]
Figure 1. Simulated Bernoulli response data without contamination. Boxplots of ( β ^ j β j ; o ) , j = 1 , , d (from left to right). (Left panels): non-robust method; (right panels): robust method.
Figure 1. Simulated Bernoulli response data without contamination. Boxplots of ( β ^ j β j ; o ) , j = 1 , , d (from left to right). (Left panels): non-robust method; (right panels): robust method.
Entropy 19 00625 g001
Figure 2. Simulated Bernoulli response data with contamination. The captions are identical to those in Figure 1.
Figure 2. Simulated Bernoulli response data with contamination. The captions are identical to those in Figure 1.
Entropy 19 00625 g002
Figure 3. Simulated Bernoulli response data without contamination. Plots of η o ( t ) and η ^ β ^ ( t ) . (Left panels): non-robust method; (right panels): robust method.
Figure 3. Simulated Bernoulli response data without contamination. Plots of η o ( t ) and η ^ β ^ ( t ) . (Left panels): non-robust method; (right panels): robust method.
Entropy 19 00625 g003
Figure 4. Simulated Bernoulli response data with contamination. Plots of η o ( t ) and η ^ β ^ ( t ) . (Left panels): non-robust method; (right panels): robust method.
Figure 4. Simulated Bernoulli response data with contamination. Plots of η o ( t ) and η ^ β ^ ( t ) . (Left panels): non-robust method; (right panels): robust method.
Entropy 19 00625 g004
Figure 5. Simulated Gaussian response data without contamination. Top panels: boxplots of ( β ^ j β j ; o ) , j = 1 , , d (from left to right). Bottom panels: plots of η o ( t ) and η ^ β ^ ( t ) . (Left panels): non-robust method; (right panels): robust method.
Figure 5. Simulated Gaussian response data without contamination. Top panels: boxplots of ( β ^ j β j ; o ) , j = 1 , , d (from left to right). Bottom panels: plots of η o ( t ) and η ^ β ^ ( t ) . (Left panels): non-robust method; (right panels): robust method.
Entropy 19 00625 g005
Figure 6. Simulated Gaussian response data with contamination. Top panels: boxplots of ( β ^ j β j ; o ) , j = 1 , , d (from left to right). Bottom panels: plots of η o ( t ) and η ^ β ^ ( t ) . (Left panels): non-robust method; (right panels): robust method.
Figure 6. Simulated Gaussian response data with contamination. Top panels: boxplots of ( β ^ j β j ; o ) , j = 1 , , d (from left to right). Bottom panels: plots of η o ( t ) and η ^ β ^ ( t ) . (Left panels): non-robust method; (right panels): robust method.
Entropy 19 00625 g006
Figure 7. Simulated Gaussian response data with contamination. Empirical quantiles (on the y-axis) of the Wald-type statistics W n versus quantiles (on the x-axis) of the χ 2 distribution. Solid line: the 45 degree reference line. (Left panels): non-robust method; (right panels): robust method.
Figure 7. Simulated Gaussian response data with contamination. Empirical quantiles (on the y-axis) of the Wald-type statistics W n versus quantiles (on the x-axis) of the χ 2 distribution. Solid line: the 45 degree reference line. (Left panels): non-robust method; (right panels): robust method.
Entropy 19 00625 g007
Figure 8. Observed power curves of tests for the Gaussian response data. The dashed line corresponds to the non-robust Wald-type test W n ; the solid line corresponds to the robust W n ; the dotted line indicates the 5% nominal level. (Left panels): non-contaminated case; (right panels): contaminated case.
Figure 8. Observed power curves of tests for the Gaussian response data. The dashed line corresponds to the non-robust Wald-type test W n ; the solid line corresponds to the robust W n ; the dotted line indicates the 5% nominal level. (Left panels): non-contaminated case; (right panels): contaminated case.
Entropy 19 00625 g008
Figure 9. The dataset in [19]. (Left panels): estimate of η ( T ) via the non-robust quadratic loss; (right panels): estimate of η ( T ) via the robust quadratic loss.
Figure 9. The dataset in [19]. (Left panels): estimate of η ( T ) via the non-robust quadratic loss; (right panels): estimate of η ( T ) via the robust quadratic loss.
Entropy 19 00625 g009
Figure 10. The dataset in [20]. (Left panel): estimate of η ( T ) via the non-robust quadratic loss; (right panel): estimate of η ( T ) via the robust quadratic loss.
Figure 10. The dataset in [20]. (Left panel): estimate of η ( T ) via the non-robust quadratic loss; (right panel): estimate of η ( T ) via the robust quadratic loss.
Entropy 19 00625 g010
Table 1. Parameter estimates and p-values for partially linear model of the dataset in [20].
Table 1. Parameter estimates and p-values for partially linear model of the dataset in [20].
VariableClassical-BD EstimationRobust-BD Estimation
Estimate (s.e.)p-Value of W n Estimate (s.e.)p-Value of W n
Female −0.0491(0.0232)0.03390.0530(0.0323)0.1010
YrHired −0.0093(0.0026)0.00050.0359(0.0086)0.0000
EducLev 0.0179(0.0079)0.0228−0.0133(0.0131)0.3103
JobGrade 0.0899(0.0075)0.00000.1672(0.0168)0.0000
YrsPrior 0.0033(0.0023)0.1528−0.0050(0.0061)0.4104

Share and Cite

MDPI and ACS Style

Zhang, C.; Zhang, Z. Robust-BD Estimation and Inference for General Partially Linear Models. Entropy 2017, 19, 625. https://doi.org/10.3390/e19110625

AMA Style

Zhang C, Zhang Z. Robust-BD Estimation and Inference for General Partially Linear Models. Entropy. 2017; 19(11):625. https://doi.org/10.3390/e19110625

Chicago/Turabian Style

Zhang, Chunming, and Zhengjun Zhang. 2017. "Robust-BD Estimation and Inference for General Partially Linear Models" Entropy 19, no. 11: 625. https://doi.org/10.3390/e19110625

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop