Estimation for Varying Coefficient Models with Hierarchical Structure

Li, Feng; Li, Yajie; Feng, Sanying

doi:10.3390/math9020132

Open AccessArticle

Estimation for Varying Coefficient Models with Hierarchical Structure

by

Feng Li

¹,

Yajie Li

¹ and

Sanying Feng

^1,2,*

¹

School of Mathematics and Statistics, Zhengzhou University, Zhengzhou 450001, China

²

Henan Key Laboratory of Financial Engineering, Zhengzhou University, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Mathematics 2021, 9(2), 132; https://doi.org/10.3390/math9020132

Submission received: 30 November 2020 / Revised: 5 January 2021 / Accepted: 7 January 2021 / Published: 9 January 2021

(This article belongs to the Section Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

The varying coefficient (VC) model is a generalization of ordinary linear model, which can not only retain strong interpretability but also has the flexibility of the nonparametric model. In this paper, we investigate a VC model with hierarchical structure. A unified variable selection method for VC model is proposed, which can simultaneously select the nonzero effects and estimate the unknown coefficient functions. Meanwhile, the selected model enforces the hierarchical structure, that is, interaction terms can be selected into the model only if the corresponding main effects are in the model. The kernel method is employed to estimate the varying coefficient functions, and a combined overlapped group Lasso regularization is introduced to implement variable selection to keep the hierarchical structure. It is proved that the proposed penalty estimators have oracle properties, that is, the coefficients are estimated as well as if the true model were known in advance. Simulation studies and a real data analysis are carried out to examine the performance of the proposed method in finite sample case.

Keywords:

varying coefficient model; variable selection; interaction term; hierarchical structure; group lasso

1. Introduction

The varying coefficient model [1] is defined as

Y = \sum_{j = 1}^{d} X_{j} β_{j} (U) + ε,

(1)

where Y is the response variable and

(X_{1}, X_{2}, \dots, X_{d}, U)

are its associated covariates,

ε

is the error term, and

β_{j} (U)

are unknown smooth coefficient functions of observable continuous covariate U. Compared with the linear model, the predictors

X_{j}

(j = 1, 2, \dots, d)

affect the response variable Y linearly, but their coefficients are allowed to change smoothly with other covariate U. Thus, each value of U is associated with a different linear model, and it also allows one to examine the extent to which the association between response Y and covariates

X_{j}

varies over covariate U (see [1,2]). In environmental data analysis [2], one objective of the study is to investigate the association between the level of the pollutants and the number of daily total hospital admissions for circulatory and respiratory problems, as well as to examine how the association varies over time, where

U = t = t i m e

. Besides, without the model specification of parametric structure and multivariate nonparametric structure, the VC model can significantly reduce the modelling bias and avoid the “curse of dimensionality”.

As the merit of good interpretability and flexibility of the varying coefficient model, a considerable amount of literature has been published on estimation and hypothesis test of the VC model since it was initiated (see, e.g., [2,3,4,5,6]). Recently, variable selection and model detection for varying coefficient model have gained much attention, for instance Wang and Xia [7] combined the ideas of the local polynomial smoothing and Lasso (least absolute shrinkage and selection operator [8]) to estimate the coefficients and select variables simultaneously; Zhao and Xue [9] employed basis function approximations and SCAD (smoothly clipped absolute deviation [10]) penalty for the semiparametric varying coefficient partially linear model; Tang et al. [11] developed a unified variable selection approach for varying coefficient models; Li et al. [12] studied the model selection and structure specification for the generalized semi-varying coefficient models; and He et al. [13] introduced a dimensionality reduction and variable selection method for multivariate varying-coefficient models with a large number of covariates, and so on.

In many complex situations, main effects combined with interaction effects may be sufficient to characterize the relationship between the response and predictors. In social, political, and economic problems and genome-wide association studies, it is useful to identify nontrivial interactions between covariates in modeling selection results, product sales, social networks, stock market changes, and disease risk. Recent years have seen a surge of interests in interaction identification in the high dimensional setting by many researchers. For instance, Hall and Xue [14] proposed a recursive approach to identify important interactions among covariates; Niu et al. [15] proposed a forward selection based screening method for identifying interactions for ultra-high dimensional data; Kong et al. [16] suggested a two-stage interaction identification method, called the interaction pursuit via distance correlation in high dimensional multi-response regression; and Radchenko and James [17] investigated variable selection for nonlinear additive regression models with interaction structures by group regularization method. Specifically, for the high dimensional linear model with interaction terms,

Y = \sum_{j = 1}^{d} X_{j} β_{j} + \sum_{1 \leq j < k \leq d} X_{j} X_{k} ϕ_{j k} + ε,

(2)

some important works include but are not limited to the following: Choi et al. [18] reparameterized the coefficients for the interaction terms; Bien et al. [19] added a set of convex constraints to the Lasso to produce sparse interaction terms; Zhao et al. [20] introduced the composite absolute penalties family by defining groups with particular overlapping patterns to express the relationships between the predictors; and Lim and Hastie [21] developed a method for learning pairwise interactions via hierarchical group-Lasso regularization. A key feature of the models is its hierarchical structure, as the interaction effects are derived from the main effects, which means that the interaction terms exist only if the main terms are significant in the model. This is also referred to the marginality principal in generalized linear models [22] or the strong heredity in the analysis of designed experiments [23].

Compared to parametric model (2), the VC model with interaction terms is the direct extension to the nonparametric case, where the coefficients are unknown smoothing functions of some covariates. However, the estimation methods for model (2) cannot be directly extended to the VC model with interaction terms. Variable selection for the VC model including interaction effects is also important, since ignoring important predictors can lead to biased results, while including irrelevant predictors may lead to efficiency loss. Moreover, variable selection for the VC model with interaction terms is even more complex, since nonzero functional coefficients rather than nonzero parameters need to be identified. In this paper, we aim to develop a unified variable selection method for VC model with hierarchical structure, which not merely can identify the significant variables with nonzero functional coefficients. Moreover, the selected model keeps the hierarchical structure, that is, interaction terms can be selected into the model only if the corresponding main effects are in the model. Firstly, kernel smoothing method is employed to obtain the initial estimates of the varying coefficient functions. Secondly, the local penalized least squares estimates with overlapped group Lasso penalty are proposed to simultaneously achieve variable selection and coefficients estimation, and the estimators enforce the hierarchical structure. Thirdly, it is proved that the proposed estimators have the oracle properties, that is, the functional coefficients are estimated as well as if the true model were known in advance.

The rest of the paper is organized as follows. In Section 2, we propose the local penalized least squares estimator with group Lasso penalty, which enforces the hierarchical structure. The asymptotic properties of the estimators are investigated in Section 3. In Section 4, we conduct some simulations and the Boston housing data analysis to assess the finite sample performance of the new estimators. The conclusion and future works are discussed in Section 5. Proofs of the theorems are postponed to Appendix A.

2. Modeling and Estimation

The varying coefficient model with hierarchical structure is defined as follows,

Y_{i} = \sum_{j = 1}^{d} X_{i j} β_{j} (U_{i}) + \sum_{1 \leq j < k \leq d} X_{i j} X_{i k} ϕ_{j k} (U_{i}) + ε_{i}, i = 1, 2, \dots, n,

(3)

where

Y_{i}

is the response variable,

X_{i j}, j = 1, 2, \dots, d

are the predictive variables,

U_{i}

is the index covariate,

ε_{i}

is the random error and satisfies

E (ε_{i} | X_{i}, U_{i}) = 0

,

Var (ε_{i} | X_{i}, U_{i}) = σ^{2} (U_{i})

,

β_{j} (U_{i})

is the unknown smooth coefficient function, and

ϕ_{j k} (U_{i})

is the coefficient function associated with the interaction term

X_{i j} X_{i k}

. The hierarchical structure means that the interaction term

X_{i j} X_{i k}

exists if and only if both

X_{i j}

and

X_{i k}

exist in the model, namely

\int ϕ_{j k}^{2} (u) d u \neq 0

holds if and only if both

\int β_{j}^{2} (u) d u \neq 0

and

\int β_{k}^{2} (u) d u \neq 0

hold for any

i = 1, 2, \dots, n

,

1 \leq j < k \leq d

. The model (3) is a useful extension of the linear model with hierarchical structure (2) (see, e.g., [18,19,24,25,26,27]), which can maintain the good interpretability of parameter models and also has the flexibility of the nonparametric models.

To simplify the representation of model (3), we list the following definitions,

\begin{matrix} X_{i} = {(X_{i 1}, X_{i 2}, \dots, X_{i d})}^{τ}, Z_{i} = {(X_{i 1} X_{i 2}, \dots, X_{i 1} X_{i d}, X_{i 2} X_{i 3}, \dots, X_{i 2} X_{i d}, \dots, X_{i d - 1} X_{i d})}^{τ}, \\ ϕ (U_{i}) = {(ϕ_{12} (U_{i}), ϕ_{13} (U_{i}), \dots, ϕ_{1 d} (U_{i}), ϕ_{23} (U_{i}), ϕ_{24} (U_{i}), \dots, ϕ_{2 d} (U_{i}), \dots, ϕ_{(d - 1) d} (U_{i}))}^{τ}, \\ β (U_{i}) = {(β_{1} (U_{i}), β_{2} (U_{i}), \dots, β_{d} (U_{i}))}^{τ}, W_{i} = {(X_{i}^{τ}, Z_{i}^{τ})}^{τ}, α (U_{i}) = {(β {(U_{i})}^{τ}, ϕ {(U_{i})}^{τ})}^{τ}, \\ β_{j} = {(β_{j} (U_{1}), β_{j} (U_{2}), \dots, β_{j} (U_{n}))}^{τ}, ϕ_{j k} = {(ϕ_{j k} (U_{1}), ϕ_{j k} (U_{2}), \dots, ϕ_{j k} (U_{n}))}^{τ}, \end{matrix}

where the superscript “

τ

” means transposition operation. Then, model (3) can be reformulated as

\begin{matrix} Y_{i} = W_{i}^{τ} α (U_{i}) + ε_{i}, i = 1, 2, \dots, n . \end{matrix}

Let

Λ = (β_{1}, β_{2}, \dots, β_{d}, ϕ_{12}, ϕ_{13}, \dots, ϕ_{(d - 1) d})

, the support set of the important main effects be

M = {j : ∥ β_{j} ∥ > 0}

, and the support set of the important interaction effects be

I = {(j, k) : ∥ ϕ_{j k} ∥ > 0, j, k \in M}

, where

∥ \cdot ∥

is the

L_{2}

norm.

To obtain the initial estimate of coefficient matrix

Λ

, we minimize the following objective function,

Q (Λ) = \sum_{t = 1}^{n} \sum_{i = 1}^{n} {[Y_{i} - W_{i}^{τ} α (U_{t})]}^{2} K_{h} (U_{t} - U_{i}),

(4)

where

K_{h} (\cdot) = \frac{1}{h} K (\cdot / h)

, and

K (\cdot)

is a kernel function which satisfies the Condition C5 and

h \to 0

is the bandwidth. Denote the solution to the objective function (4) by

\tilde{Λ}

, the tth row of

\tilde{Λ}

for

t = 1, 2, \dots, n

has the closed form

\tilde{α} (U_{t}) = {[\frac{1}{n} \sum_{i = 1}^{n} W_{i} W_{i}^{τ} K_{h} (U_{t} - U_{i})]}^{- 1} [\frac{1}{n} \sum_{i = 1}^{n} W_{i} Y_{i} K_{h} (U_{t} - U_{i})] .

Corresponding to the assumption, only the columns of

Λ

indexed by the support sets

M

and

I

are nonzero, so the main task of variable selection is to identify the sparse columns of

Λ

efficiently. Meanwhile, to maintain the hierarchical structure of the model, we apply the idea of group Lasso proposed by Yuan and Lin [28] and give the following local penalized least squares estimation,

\begin{matrix} {\hat{Λ}}_{λ} = arg min_{Λ} Q_{λ} (Λ) & = & arg min_{β_{j}, ϕ_{j k}, 1 \leq j < k \leq d} \{\sum_{t = 1}^{n} \sum_{i = 1}^{n} {[Y_{i} - W_{i}^{τ} α (U_{t})]}^{2} K_{h} (U_{t} - U_{i}) \\ + & \sum_{j = 1}^{d} (λ_{j}^{1} \sqrt{∥ β_{j} ∥^{2} + \sum_{k : k \neq j}^{} {∥ ϕ_{j k} ∥}^{2}} + \sum_{k : j < k} λ_{j k}^{2} ∥ ϕ_{j k} ∥)\}, \end{matrix}

(5)

where we assume

ϕ_{j k} = ϕ_{k j}

, which is commonly used in the model with hierarchical structure, and the assumption means the interaction effects are independent of the order of the both covariates, which also significantly reduces the computation burden by lessening the number of the functional coefficients from the order of

O (d^{2})

to

O (d^{2} / 2)

(see [18,19,25,27]).

λ_{j}^{1}

and

λ_{j k}^{2}

are tuning parameters. For simplicity of calculations, we use the local quadratic approximation (see [10,29,30]) in each step of the iteration. Take

\tilde{Λ}

as the initial estimator, which is

{\hat{Λ}}_{λ}^{(0)} = \tilde{Λ}

, and, for the

(m + 1)

th step, the objective function can be approximately represented as follows,

\begin{matrix} Q_{λ}^{(m + 1)} (Λ) \approx & \sum_{t = 1}^{n} \sum_{i = 1}^{n} {Y_{i} - W_{i}^{τ} α (U_{t})}^{2} K_{h} (U_{t} - U_{i}) \\ + \sum_{j = 1}^{d} (λ_{j}^{1} \frac{∥ β_{j} ∥^{2} + \sum_{k : k \neq j}^{} {∥ ϕ_{j k} ∥}^{2}}{\sqrt{∥ {\hat{β}}_{j}^{(m)} ∥^{2} + \sum_{k : k \neq j}^{} {∥ {\hat{ϕ}}_{j k}^{(m)} ∥}^{2}}} + \sum_{k : j < k}^{} λ_{j k}^{2} \frac{∥ ϕ_{j k} ∥^{2}}{∥ {\hat{ϕ}}_{j k}^{(m)} ∥}) \\ = & \sum_{t = 1}^{n} \{\sum_{i = 1}^{n} {Y_{i} - W_{i}^{τ} α (U_{t})}^{2} K_{h} (U_{t} - U_{i}) \\ + \sum_{j = 1}^{d} (λ_{j}^{1} \frac{β_{j}^{2} (U_{t}) + \sum_{k : k \neq j}^{} ϕ_{j k}^{2} (U_{t})}{\sqrt{∥ {\hat{β}}_{j}^{(m)} ∥^{2} + \sum_{k : k \neq j}^{} {∥ {\hat{ϕ}}_{j k}^{(m)} ∥}^{2}}} + \sum_{k : j < k}^{} λ_{j k}^{2} \frac{ϕ_{j k}^{2} (U_{t})}{∥ {\hat{ϕ}}_{j k}^{(m)} ∥})\} . \end{matrix}

(6)

By minimizing

Q_{λ}^{(m + 1)} (Λ)

, we have

\begin{matrix} {\hat{α}}_{λ} {(U_{t})}^{(m + 1)} = {[\sum_{i = 1}^{n} W_{i} W_{i}^{τ} K_{h} (U_{t} - U_{i}) + G^{(m)} + H^{(m)}]}^{- 1} [\sum_{i = 1}^{n} W_{i} Y_{i} K_{h} (U_{t} - U_{i})], \end{matrix}

(7)

where

G^{(m)} = (\begin{matrix} M_{1}^{(m)} & 0 & 0 & \dots & 0 \\ 0 & I_{d - 1} λ_{1}^{1} γ_{1}^{(m)} + M_{2}^{(m)} & 0 & \dots & 0 \\ 0 & 0 & I_{d - 2} λ_{2}^{1} γ_{2}^{(m)} + M_{3}^{(m)} & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & 0 & \dots & λ_{d - 1}^{1} γ_{d - 1}^{(m)} + M_{d}^{(m)} \end{matrix}),

H^{(m)} = (\begin{matrix} 0_{d \times d} & 0 & 0 & \dots & 0 \\ 0 & λ_{12}^{2} ζ_{12}^{(m)} & 0 & \dots & 0 \\ 0 & 0 & λ_{13}^{2} ζ_{13}^{(m)} & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & 0 & \dots & λ_{(d - 1) d}^{2} ζ_{(d - 1) d}^{(m)} \end{matrix}),

and

M_{j}^{(m)} = diag (λ_{j}^{1} γ_{j}^{(m)}, λ_{j + 1}^{1} γ_{j + 1}^{(m)}, \dots, λ_{d}^{1} γ_{d}^{(m)})

,

γ_{j}^{(m)} = \frac{1}{\sqrt{∥ {\hat{β}}_{j}^{(m)} ∥^{2} + \sum_{k : k \neq j} {∥ {\hat{ϕ}}_{j k}^{(m)} ∥}^{2}}}

,

ζ_{j k}^{(m)} = \frac{1}{∥ {\hat{ϕ}}_{j k}^{(m)} ∥}

,

1 \leq j < k \leq d

.

Regard

{\hat{Λ}}_{λ}

as the Kernel Lasso (KLasso) estimator, and the specific implementation procedures can be arranged as follows:

(1): Take the non-penalized estimator $\tilde{Λ}$ as the initial estimator ${\hat{Λ}}_{λ}^{(0)} = \tilde{Λ}$ .
(2): According to the method discussed above, iterate (7) until convergence; specifically, the iteration stops when $max (| ∥ {\hat{β}}_{j}^{(m + 1)} ∥ - ∥ {\hat{β}}_{j}^{(m)} ∥ |, | ∥ {\hat{ϕ}}_{j k}^{(m + 1)} ∥ - ∥ {\hat{ϕ}}_{j k}^{(m)} ∥ |, 1 \leq j < k \leq d)$ is less than $10^{- 3}$ . For model sparsity, $∥ {\hat{β}}_{j} ∥$ and $∥ {\hat{ϕ}}_{j k} ∥$ should be set 0 when they are less than a small real value $c_{r}$ (in our simulation, we choose $c_{r} = 0.5$ ).
(3): Get the KLasso estimator ${\hat{Λ}}_{λ}$ .

Selection of bandwidth h and tuning parameters

λ_{j}^{1}

and

λ_{j k}^{2}

(

1 \leq j < k \leq d

) is another important issue for VC model. There are two types of parameters to be considered; one common method to solve this problem is grid searching, such as cross validation (CV) or generalized cross validation (GCV) [31]. However, it would be too expensive in computation, thus here we choose h in the kernel estimation of coefficient functions by CV criterion. For the tuning parameters, we apply the idea of Zou and Li [32], where large coefficients should be given small penalties, while small coefficients should be given large penalties, and the tuning parameters can be chosen as

λ_{j}^{1} = \frac{λ_{0}}{n^{- \frac{1}{2}} (\sqrt{∥ {\tilde{β}}_{j} ∥^{2} + \sum_{k : k \neq j}^{} {∥ {\tilde{ϕ}}_{j k} ∥}^{2}})}, λ_{j k}^{2} = \frac{λ_{0}}{n^{- \frac{1}{2}} ∥ {\tilde{ϕ}}_{j k} ∥} .

Then, only one parameter,

λ_{0}

, needs to be considered, which is selected according to BIC criterion

{BIC}_{λ} = log ({RSS}_{λ}) + d_{λ} \frac{log (n h)}{n h},

where

d_{λ}

is the number of nonzero coefficient functions determined by

{\hat{Λ}}_{λ}

and

{RSS}_{λ} = \frac{1}{n^{2}} \sum_{t = 1}^{n} \sum_{i = 1}^{n} {Y_{i} - W_{i}^{τ} {\hat{α}}_{λ} (U_{t})}^{2} K_{h} (U_{t} - U_{i}) .

Then,

λ_{0}

can be obtained by minimizing

{BIC}_{λ}

.

3. Theoretical Properties

In this section, we establish asymptotic properties of the proposed estimator, including the model selection consistency and the oracle properties. First, we make some notations and list some regular conditions. Define

a_{n} = max {λ_{j}^{1}, λ_{k l}^{2} : j \in M, (k, l) \in I}

,

b_{n} = min {λ_{j}^{1}, λ_{k l}^{2} : j \in M^{c}, (k, l) \in I^{c}}

. Let

X_{M}

denote a matrix which is generated from

X

with columns indexed by

M

, and

Z_{I}

denotes a matrix which is generated from

Z

with columns indexed by

I

,

W_{S} = (X_{M}, Z_{I})

,

W_{S^{c}} = (X_{M^{c}}, Z_{I^{c}})

,

\hat{M} = {j : ∥ {\hat{β}}_{j} ∥ > 0}

,

\hat{I} = {(j, k) : ∥ {\hat{ϕ}}_{j, k} ∥ > 0}

,

{\hat{α}}_{S} (U_{t}) = {({\hat{β}}_{M}^{τ} (U_{t}), {\hat{ϕ}}_{I}^{τ} (U_{t}))}^{τ}

,

{\hat{α}}_{S^{c}} (U_{t}) = {({\hat{β}}_{M^{c}}^{τ} (U_{t}), {\hat{ϕ}}_{I^{c}}^{τ} (U_{t}))}^{τ}

. The convergence of

o_{p} (\cdot)

and

O_{p} (\cdot)

are defined, respectively, as follows, for random variables

ξ

and

η

:

ξ = o_{p} (η)

means that, for all

ϵ > 0

,

P (| ξ / η | > ϵ) \to 0

as

n \to \infty

; and

ξ = O_{p} (η)

means that, for all

ϵ > 0

, there exists

c > 0

such that

P (| ξ / η | > c) < ϵ

as n is sufficiently large. The following traditional conditions (see [3]) are also needed.

C1.: For $1 \leq i \leq n$ , the covariate $X_{i}$ is independent of the error $ε_{i}$ .
C2.: The covariate $W_{i}$ has finite p-order moment, i.e., $E ∥ W_{i} ∥^{p} < \infty$ , where $p \geq 2$ .
C3.: The density function of U, $f (U)$ , is continuous and has second-order derivative.
C4.: $Ω (U) = E (W_{i} W_{i}^{τ} ∣ U_{i} = U)$ has second-order derivative, while $E (∥ W_{i} ∥^{4} ∣ U_{i} = U)$ and $E (∥ ε_{i} ∥^{2} ∣ U_{i} = U)$ are both bounded.
C5.: $K (\cdot)$ is a symmetric kernel function, which satisfies $\int K (s) d s = 1$ , $\int s^{2} K (s) d s = μ_{2} < \infty$ , $\int K^{2} (s) d s = ν < \infty$ , $\int s^{2} K^{2} (s) d s = ν_{2} < \infty$ .
C6.: $β_{j} (U)$ and $ϕ_{j k} (U)$ are all bounded and have second-order continuous derivatives for $j \in M$ and $(j, k) \in I$ .

Theorem 1.

Suppose C1-C6 hold; if

h \propto n^{- \frac{1}{5}}, {(n h)}^{- \frac{1}{2}} a_{n} \to 0

,

{(n h)}^{- \frac{1}{2}} b_{n} \to \infty,

then the following results hold:

(1): $P (\hat{M} = M) \to 1$ as $n \to \infty$
(2): $P (\hat{I} = I) \to 1$ as $n \to \infty$

Theorem 1 shows that the KLasso estimator can select the true model consistently. Then, we discuss the oracle properties of the KLasso estimator in Theorem 2.

Theorem 2.

Suppose C1-C6 hold; if

h \propto n^{- \frac{1}{5}}, {(n h)}^{- \frac{1}{2}} a_{n} \to 0

,

{(n h)}^{- \frac{1}{2}} b_{n} \to \infty,

then we have

sup_{U_{t} \in [0, 1]} ∥ {\hat{α}}_{λ, S} (U_{t}) - {\tilde{α}}_{S} (U_{t}) ∥ = o_{p} (n^{- \frac{2}{5}}),

for

1 \leq t \leq n

.

Note that the optimal convergence rate of oracle estimator is

O_{p} (n^{- \frac{2}{5}})

. We also observe that the difference of convergence rate between the KLasso estimator and the oracle estimator can be ignored over the univariate indicator set. Thus, we can conclude that the KLasso estimator shares the same asymptotic properties with the oracle estimator. Proofs of these two theorems are given in the Appendix.

4. Simulation Study and Real Data Analysis

4.1. Simulation Study

In this section, three examples are applied to assess the proposed procedures in terms of varying coefficient functions estimations and variable selection. The data with sample size

n = 100, 200, 500

are independently generated from the following models:

\begin{matrix} Model 1 : Y_{i} = sin (π U_{i}) X_{i 1} + 0.5 exp (U_{i}) X_{i 2} + U_{i}^{3} X_{i 3} + cos (π U_{i}) X_{i 2} X_{i 3} + ε_{i}, \\ Model 2 : Y_{i} = [{(U_{i} - 0.5)}^{2} + 1] X_{i 1} + 0.5 exp (U_{i}) X_{i 2} + U_{i}^{3} X_{i 3} + cos (π U_{i}) X_{i 4} \\ + [U_{i}^{3} + log (2 U_{i} + 1)] X_{i 1} X_{i 2} + sin (2 π U_{i}) X_{i 2} X_{i 3} + ε_{i}, \\ Model 3 : Y_{i} = 2 U_{i}^{2} X_{i 1} + 0.5 exp (U_{i}) X_{i 2} + U^{1 / 3} X_{i 3} + [cos (2 π U_{i}) + cos (π U_{i})] X_{i 4} \\ + sin (2 π U_{i}) X_{i 2} X_{i 3} + [sin (2 π U_{i}) + sin (π U_{i})] X_{i 2} X_{i 4} + ε_{i}, \end{matrix}

where the functional coefficients mainly include trigonometric function with different periods, exponential function, and power function with different locations and power. The random vector

{(X_{i 1}, \dots, X_{i 10})}^{τ}

follows the multivariate normal distribution with zero means and

Cov (X_{i j}, X_{i k}) = 0 . 5^{| k - j |}

, for

1 \leq j < k \leq 10

. The index variable

U_{i} \sim

Uniform[0,1] and the random error

ε_{i}

follow a standard normal distribution. In the estimation procedures, Gaussian kernel function

K (u) = \frac{1}{\sqrt{2 π}} exp (- \frac{u^{2}}{2})

is employed. The initial estimated coefficient matrix

\tilde{Λ}

is obtained by minimizing (4), and the optimal bandwidth is selected via CV criterion, the selected bandwidth is also used for KLasso estimate procedure.

To assess the performance of the KLasso estimator in estimation accuracy, the empirical integrated squared error defined as follows is computed,

ISE ({\hat{β}}_{j}) = \frac{1}{n} \sum_{i = 1}^{n} {({\hat{β}}_{j} (U_{i}) - β_{j} (U_{i}))}^{2}, ISE ({\hat{ϕ}}_{j k}) = \frac{1}{n} \sum_{i = 1}^{n} {({\hat{ϕ}}_{j k} (U_{i}) - ϕ_{j k} (U_{i}))}^{2}, for j \in M, (j, k) \in I .

As a benchmark of comparison, the mean empirical integrated squared error (MISE) and its standard error (in parenthesis) of the oracle estimators and the proposed estimators for the three models, are respectively, reported in Table 1, Table 2 and Table 3. All the empirical results were computed by R software [33] based on 1000 replications. In Table 1, Table 2 and Table 3, we can see that: (1) the MISEs decrease quickly as the sample size increases for both oracle estimator and KLasso estimator; (2) the proposed estimator performs comparably well with the oracle estimator in terms of coefficients estimations for moderate sample sizes; and (3) the estimators for Model 1 have smaller MISEs compared to those of the other two models, so the number and complexity of the nonzero coefficient functions to be estimated may affect the performance of the proposed procedure in finite samples.

Let CM denote the frequency of the nonzero coefficients being correctly estimated as nonzero, CZ denote the frequency of the zero coefficients being correctly estimated as zero, and CS denote the frequency of the correctly selected model, which means that only nonzero coefficients are estimated as nonzero. CM, CZ, and CS are summarized in Table 4. For the three models, as the sample size increases, CM, CZ, and CS all can be as large as 100%, which implies that the proposed method can identify the model well. In addition, Model 1 has the largest possibility of being correctly selected among the three models.

Besides, we also depict the quantiles curves of the estimated coefficient functions at fixed series

U_{1}, U_{2}, \dots, U_{46}

, where

U_{i} = 0.05 + (i - 1) \times 0.02, i = 1, 2, \dots, 46

. Figure 1, Figure 2 and Figure 3 are quantile curves for Models 1–3 with sample size 200, respectively. In these figures, we can see that the main effects and interaction effects can be correctly selected and consistently estimated. Meanwhile, the estimated curves usually underestimate at the peaks while overestimate at the valley of the curves. In summary, the proposed method for VC models with hierarchical structure works well.

4.2. The Boston Housing Data Analysis

To further investigate the performance of our method, we apply the proposed method to the Boston housing data which concerns the median value of owner-occupied homes (MV) for 506 census tracts in 1970. The dataset “Boston” is available in the R package “MASS” ([34]). Following the basic housing value equation of Harrison and Rubinfeld [35] and the study of Fan and Huang [3], Wang and Xia [7], we consider seven covariates here: CRIM (per capita crime rate by town), RM (average number of rooms per dwelling), NOX (nitric oxides concentration), PTRATIO (pupil-teacher ratio by town), TAX (full-value property-tax rate per $10,000), AGE (proportion of owner-occupied units built prior to 1940), and LSTAT (percentage of lower status of the population). The log transformation of MV and power transformation of RM, NOX, and LSTAT are employed to fit the Boston data well. For simplicity of representation, CRIM, RM

^{2}

, TAX, NOX

^{2}

, PTRATIO, and AGE are, respectively, denoted by

X_{1}, \dots, X_{6}

, and they are all scaled with mean zero and standard deviation 1. Meanwhile, we take

X_{0} \equiv 1

as the intercept term (INT), log(MV) as the response Y, and LSTAT

^{1 / 2}

as the index variable U. Consequently, the varying coefficient model with hierarchical structure

Y_{i} = β_{0} (U_{i}) + \sum_{j = 1}^{6} X_{i j} β_{j} (U_{i}) + \sum_{1 \leq j < k \leq 6} X_{i j} X_{i k} ϕ_{j k} (U_{i}) + ε_{i}

is fitted to the Boston housing data.

The data are divided into training data and testing data with sample size 405 and 101, respectively, by sampling. Estimates are based on the training data, and performance of the proposed procedures are evaluated on the testing data. The CV criterion suggests a bandwidth

h = 0.31

, and the optimal tuning parameter selected by BIC criterion is

{\hat{λ}}_{0} = 1.8

. During the implementation procedures, the variables are regarded insignificant for the mode of whose estimated functional coefficients are less than 0.1. It shows that INT, CRIM, RM

^{2}

, TAX, PTRATIO, AGE, RM

^{2}

×TAX are significant but the others are not. The estimated coefficient function curves for these relevant variables are depicted in Figure 4. For the testing data, compared with the VC model without considering the interaction terms, the multiple

R^{2}

for our proposed procedure is 0.8314, and it is 0.8021 without interaction terms. Thus, the proposed VC model with hierarchical structure can fit the testing data slightly better.

5. Conclusions and Future Works

In this paper, the VC models with hierarchical structure are investigated, and a unified variable selection procedure is proposed, which can simultaneously select the nonzero effects and estimate the unknown coefficient functions, while the selected model enforces the hierarchical structure. It is proved that the proposed penalty estimators have the oracle properties, that is the coefficients are estimated as well as if the true model were known in advance. Simulation studies and Boston housing data analysis are carried out to examine the performance of the proposed method in finite sample case.

However, we mainly focus on the fixed dimensionality of the predictive covariates in the paper. We will investigate the VC model with hierarchical structure in the case of diverging dimensionality of the predictors in the future. Estimation and variable selection for the generalized varying coefficient model with hierarchical structure, as well as estimation for the semiparametric varying coefficient partially linear model with hierarchical structure, are also interesting topics that deserve to be studied.

Author Contributions

Methodology, F.L. and S.F.; Software, F.L.; Writing—original draft, Y.L.; Writing—review—editing, S.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (Grant Nos.U1404104 and 11501522), the National Statistical Science Research Project of China (Grant No. 2019LY18), the Foundation of Henan Educational Committee (Grant No. 21A910004), and the Training Fund for Basic Research Program of Zhengzhou University (Grant No. 32211591).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Before proving Theorems 1 and 2, we first investigate the following lemmas, which aim for the asymptotic properties of

{\hat{Λ}}_{λ}

. Without loss of generality, we suppose

M = {1, 2, \dots, d_{0}}

,

I = {(j, k) : 1 \leq j < k \leq d_{1}}

, where

d_{1} < d_{0}

. Let

α_{S} (U_{i}) = (β_{1} (U_{i}), \dots, β_{d_{0}} (U_{i}),

ϕ_{12} (U_{i}), \dots, ϕ_{(d_{1} - 1) d_{1}} (U_{i}))^{τ}

,

β_{M} = (β_{1}, β_{2}, \dots, β_{d_{0}})

,

ϕ_{I} = (ϕ_{12}, ϕ_{13}, \dots, ϕ_{(d_{1} - 1) d_{1}})

.

Lemma A1.

Suppose Conditions C1–C6 hold,

h \propto n^{- \frac{1}{5}}

,

{(n h)}^{- \frac{1}{2}} a_{n} \to 0, {(n h)}^{- \frac{1}{2}} b_{n} \to \infty

; we have

\begin{matrix} \frac{1}{n} \sum_{t = 1}^{n} {∥ {\hat{α}}_{λ} (U_{t}) - α (U_{t}) ∥}^{2} = O_{p} (n^{- \frac{4}{5}}) . \end{matrix}

(A1)

Proof.

We use

A = (a_{i j}) \in R^{n \times \frac{d (d + 1)}{2}}

to denote an arbitrary

n \times \frac{d (d + 1)}{2}

matrix, the rows of A are defined as

η_{1}^{τ}, \dots, η_{n}^{τ}

, and the columns of which are defined as

ρ_{1}, \dots, ρ_{d}, ρ_{12}, \dots, ρ_{1 d},

ρ_{23}, \dots, ρ_{2 d}, \dots, ρ_{(d - 1) d}

. Use

L_{2}

norm to define

{∥ A ∥}^{2} = Σ a_{i j}^{2}

. According Fan and Li [10], it suffices to show that, for any small

ϵ > 0

, there exists a constant

C > 0

; we have

\begin{matrix} \underset{n \to \infty}{lim inf} P (inf_{n^{- 1} {∥ A ∥}^{2} = C} Q_{λ} (Λ + {(n h)}^{- \frac{1}{2}} A) > Q_{λ} (Λ)) = 1 - ϵ . \end{matrix}

(A2)

Define

D_{n} (A) = Q_{λ} (Λ + {(n h)}^{- 1 / 2} A) - Q_{λ} (Λ)

. Then, we have

\begin{matrix} D_{n} (A) = & \sum_{t = 1}^{n} \sum_{i = 1}^{n} [{(n h)}^{- 1} η_{t}^{τ} W_{i} W_{i}^{τ} η_{t} + 2 {(n h)}^{- \frac{1}{2}} (Y_{i} - W_{i}^{τ} α (U_{t})) W_{i}^{τ} η_{t}] K_{h} (U_{t} - U_{i}) \\ + \sum_{j = 1}^{d} λ_{j}^{1} (\sqrt{∥ β_{j} + {(n h)}^{- \frac{1}{2}} ρ_{j} ∥^{2} + \sum_{k : k \neq j} {∥ ϕ_{j k} + {(n h)}^{- \frac{1}{2}} ρ_{j k} ∥}^{2}} - \sqrt{∥ β_{j} ∥^{2} + \sum_{k = j + 1}^{d} {∥ ϕ_{j k} ∥}^{2}}) \\ + \sum_{j = 1}^{d} \sum_{k = j + 1}^{d} λ_{j k}^{2} (∥ ϕ_{j k} + {(n h)}^{- \frac{1}{2}} ρ_{j k} ∥ - ∥ ϕ_{j k} ∥) \\ = & R_{0} + R_{1} + R_{2} . \end{matrix}

Next, we discuss

R_{i}, i = 0, 1, 2

, respectively. For

R_{0}

,

\begin{matrix} R_{0} = & \sum_{t = 1}^{n} \sum_{i = 1}^{n} [{(n h)}^{- 1} η_{t}^{τ} W_{i} W_{i}^{τ} η_{t} + 2 {(n h)}^{- \frac{1}{2}} (Y_{i} - W_{i}^{τ} α (U_{t})) W_{i}^{τ} η_{t}] K_{h} (U_{t} - U_{i}) \\ = & \sum_{t = 1}^{n} \frac{1}{h} η_{t}^{τ} (\frac{1}{n} \sum_{i = 1}^{n} W_{i} W_{i}^{τ} K_{h} (U_{t} - U_{i})) η_{t} \\ + 2 {(n h)}^{- \frac{1}{2}} \sum_{t = 1}^{n} \sum_{i = 1}^{n} η_{t}^{τ} W_{i} [{(α (U_{i}) - α (U_{t}))}^{τ} W_{i} + ε_{i}] K_{h} (U_{t} - U_{i}) \\ = & \frac{1}{h} \sum_{t = 1}^{n} η_{t}^{τ} Σ (U_{t}) η_{t} + 2 \frac{1}{h} \sum_{t = 1}^{n} η_{t}^{τ} e_{t} \\ \geq & n h^{- 1} λ_{m i n} (n^{- 1} \sum_{t = 1}^{n} ∥ η_{t} ∥^{2}) - 2 n h^{- 1} (\sum_{t = 1}^{n} n^{- \frac{1}{2}} ∥ η_{t} ∥) (\sum_{t = 1}^{n} n^{- \frac{1}{2}} ∥ e_{t} ∥) \\ \geq & n h^{- 1} λ_{m i n} C - 2 n h^{- 1} C^{\frac{1}{2}}, \end{matrix}

where

Σ (U_{t}) = n^{- 1} \sum_{i = 1}^{n} W_{i} W_{i}^{τ} K_{h} (U_{t} - U_{i})

,

e_{t} = n^{- \frac{1}{2}} h^{\frac{1}{2}} \sum_{i = 1}^{n} W_{i} [{(α (U_{i}) - α (U_{t}))}^{τ}

W_{i} + ε_{i}] K_{h} (U_{t} - U_{i})

,

λ_{m i n} (U_{t})

is the minimum eigenvalue of

Σ (U_{t})

,

λ_{m i n}

is the minimizer of {

λ_{m i n} (U_{t}), t = 1, 2, \dots, n}

, and

n^{- 1} \sum_{t = 1}^{n}

∥ e_{t} ∥^{2} = O_{p} (1)

, which we prove in Lemma 2. According to

n^{- 1} {∥ A ∥}^{2} = C

and

a_{n} = max {λ_{j}^{1}, λ_{j k}^{2} : j \in M, (j, k) \in I}

, it is easy to show that

\begin{matrix} R_{1} = & \sum_{j = 1}^{d} λ_{j}^{1} (\sqrt{∥ β_{j} + {(n h)}^{- \frac{1}{2}} ρ_{j} ∥^{2} + \sum_{k : k \neq j} {∥ ϕ_{j k} + {(n h)}^{- \frac{1}{2}} ρ_{j k} ∥}^{2}} - \sqrt{∥ β_{j} ∥^{2} + \sum_{k = j + 1}^{d} {∥ ϕ_{j k} ∥}^{2}}) \\ \geq & \sum_{j = 1}^{d_{0}} λ_{j}^{1} (\sqrt{∥ β_{j} ∥^{2} + \sum_{k = j + 1}^{d_{0}} {∥ ϕ_{j k} ∥}^{2}} - \sqrt{{(n h)}^{- 1} ∥ ρ_{j} ∥^{2} + {(n h)}^{- 1} \sum_{k = j + 1}^{d_{0}} {∥ ρ_{j k} ∥}^{2}} \\ - \sqrt{∥ β_{j} ∥^{2} + \sum_{k = j + 1}^{d_{0}} {∥ ϕ_{j k} ∥}^{2}}) \\ \geq & - d_{0} C^{\frac{1}{2}} h^{- \frac{1}{2}} a_{n} . \end{matrix}

\begin{matrix} R_{2} = & \sum_{j = 1}^{d} \sum_{k : j < k \leq d}^{} λ_{j k}^{2} (∥ ϕ_{j k} + {(n h)}^{- \frac{1}{2}} ρ_{j k} ∥ - ∥ ϕ_{j k} ∥) \\ \geq & \sum_{j = 1}^{d_{0}} \sum_{k : j < k \leq d_{0}}^{} λ_{j k}^{2} (∥ ϕ_{j k} + {(n h)}^{- \frac{1}{2}} ρ_{j k} ∥ - ∥ ϕ_{j k} ∥) \\ \geq & - \sum_{j = 1}^{d_{0}} \sum_{k : j < k \leq d_{0}}^{} λ_{j k}^{2} {(n h)}^{- \frac{1}{2}} ∥ ρ_{j k} ∥ \\ \geq & - {(n h)}^{- \frac{1}{2}} n^{\frac{1}{2}} C^{\frac{1}{2}} a_{n} \\ = & - C^{\frac{1}{2}} h^{- \frac{1}{2}} a_{n} . \end{matrix}

Consequently, we have

\begin{matrix} h n^{- 1} D_{n} (A) = & h n^{- 1} (R_{0} + R_{1} + R_{2}) \\ \geq & h n^{- 1} (n h^{- 1} λ_{m i n} C - 2 n h^{- 1} C^{\frac{1}{2}} - d_{0} a_{n} C^{\frac{1}{2}} h^{- \frac{1}{2}} - C^{\frac{1}{2}} h^{- \frac{1}{2}} a_{n}) \\ = & λ_{m i n} C - 2 C^{\frac{1}{2}} - n^{- \frac{1}{2}} h (d_{0} a_{n} + a_{n}) C^{\frac{1}{2}} \\ = & λ_{m i n} C - 2 C^{\frac{1}{2}} - n^{- \frac{1}{5}} ((d_{0} + 1) {(n h)}^{- \frac{1}{2}} a_{n}) C^{\frac{1}{2}} \\ \geq & 0, \end{matrix}

when C is large enough, the result of Lemma 1 holds. □

Lemma A2.

Suppose

C 1 - C 6

hold; then,

n^{- 1} \sum_{t = 1}^{n} {∥ e_{t} ∥}^{2} = O_{p} (1)

.

Proof.

By straightforward algebra, we have

\begin{matrix} E (n^{- 1} \sum_{t = 1}^{n} ∥ e_{t} ∥^{2}) \\ = & n^{- 2} h \sum_{t = 1}^{n} E [\sum_{i \neq j}^{n} {(α (U_{i}) - α (U_{t}))}^{τ} W_{i} W_{i}^{τ} W_{j} W_{j}^{τ} (α (U_{j}) - α (U_{t})) K_{h} (U_{t} - U_{i}) \\ K_{h} (U_{t} - U_{j}) + \sum_{i = j}^{n} {(α (U_{i}) - α (U_{t}))}^{τ} W_{i} W_{i}^{τ} W_{i} W_{i}^{τ} (α (U_{i}) - α (U_{t})) K_{h}^{2} (U_{t} - U_{i}) \\ + \sum_{i = 1}^{n} ε_{i}^{2} W_{i}^{τ} W_{i} K_{h}^{2} (U_{t} - U_{i})] \\ = & n^{- 2} h \sum_{t = 1}^{n} E (\sum_{i \neq j, i = 1}^{n} e_{t 1} + \sum_{i = j, i = 1}^{n} e_{t 2} + \sum_{i = 1}^{n} e_{t 3}) . \end{matrix}

By Tylor expansion (7), we have

\begin{matrix} E (e_{t 1}) \\ = & E (E (e_{t 1} ∣ U_{i}, U_{j})) \\ = & E (E ({(α (U_{i}) - α (U_{t}))}^{τ} W_{i} W_{i}^{τ} W_{j} W_{j}^{τ} (α (U_{j}) - α (U_{t})) K_{h} (U_{t} - U_{i}) K_{h} (U_{t} - U_{j}) ∣ U_{i}, U_{j}))) \\ = & E (E (α^{'} {(U_{t})}^{τ} W_{i} W_{i}^{τ} W_{j} W_{j}^{τ} α^{'} (U_{t}) (U_{t} - U_{i}) (U_{t} - U_{j}) K_{h} (U_{t} - U_{i}) K_{h} (U_{t} - U_{j}) \\ + ∥ W_{i} ∥^{2} ∥ W_{j} ∥^{2} {(U_{t} - U_{i})}^{2} {(U_{t} - U_{j})}^{2} K_{h} (U_{t} - U_{i}) K_{h} (U_{t} - U_{j}) ∣ U_{i}, U_{j}) \\ = & E_{1} + E_{2} . \end{matrix}

When

t = i

or

t = j

,

E_{1} = E_{2} = 0

;

t \neq i

and

t \neq j

, according to C3,

\begin{matrix} E_{1} = & E (E (α^{'} {(U_{t})}^{τ} W_{i} W_{i}^{τ} W_{j} W_{j}^{τ} α^{'} (U_{t}) (U_{t} - U_{i}) (U_{t} - U_{j}) K_{h} (U_{t} - U_{i}) K_{h} (U_{t} - U_{j}) ∣ U_{t}, U_{i}, U_{j})) \\ = & E (α^{'} {(U_{t})}^{τ} E (W_{i} W_{i}^{τ} W_{j} W_{j}^{τ} ∣ U_{i}, U_{j}) α^{'} (U_{t}) (U_{t} - U_{i}) (U_{t} - U_{j}) K_{h} (U_{t} - U_{i}) K_{h} (U_{t} - U_{j})) \\ = & \int α^{'} {(U_{t})}^{τ} Ω (U_{i}) Ω (U_{j}) α^{'} (U_{t}) (U_{t} - U_{i}) (U_{t} - U_{j}) K_{h} (U_{t} - U_{i}) K_{h} (U_{t} - U_{j}) \\ f (U_{t}) f (U_{i}) f (U_{j}) d U_{t} d U_{i} d U_{j} \\ = & h^{2} \int α^{'} {(U_{t})}^{τ} Ω (U_{t} + h s_{1}) Ω (U_{t} + h s_{1}) f (U_{t} + h s_{1}) f (U_{t} + h s_{2}) α^{'} (U_{t}) s_{1} s_{2} K (s_{1}) K (s_{2}) d U_{t} d s_{1} d s_{2} \\ = & h^{2} \int α^{'} {(U_{t})}^{τ} (\tilde{ω} (U_{t}) + {\tilde{ω}}_{1}^{'} (U_{t}) h s_{1} + {\tilde{ω}}_{2}^{'} (U_{t}) h s_{2} + C_{3} (s_{1}^{2} + s_{2}^{2})) α^{'} (U_{t}) h^{2} s_{1} s_{2} K (s_{1}) K (s_{2}) d U_{t} d s_{1} d s_{2} \\ = & O (h^{4}), \end{matrix}

for

\int (\tilde{ω} (U_{t}) + {\tilde{ω}}_{1}^{'} (U_{t}) h s_{1} + {\tilde{ω}}_{2}^{'} (U_{t}) h s_{2}) s_{1} s_{2} K (s_{1}) K (s_{2}) d s_{1} d s_{2} = 0

. Similarly, we can get

E_{2} = O (h)

, so

E (e_{t 1}) = O (h^{4})

,

E (e_{t 2}) = O (h)

.

Next, we consider

E (\sum_{i = 1}^{n} e_{t 3})

, define

g (U_{t}) = f^{2} (U_{t})

, and suppose that

\tilde{E} (∥ W_{t} ε_{t} ∥^{2} |

U_{t} = u) = \int ∥ W_{t} ε_{t} ∥^{2} g (U_{t}) d U_{t}

is bounded.

\begin{matrix} E (\sum_{i = 1}^{n} e_{t 3}) = & (n - 1) E (∥ W_{i} ε_{i} ∥^{2} K_{h}^{2} (U_{t} - U_{i})) + K^{2} (0) E (∥ W_{i} ε_{i} ∥^{2}) \\ = & (n - 1) E {E (∥ W_{i} ε_{i} ∥^{2} K_{h}^{2} (U_{t} - U_{i}) | U_{t}, U_{i}) + K^{2} (0) E (∥ W_{i} ε_{i} ∥^{2} | U_{i} = U)} \\ = & (n - 1) \int ε_{i}^{2} W_{i}^{τ} W_{i} K_{h}^{2} (U_{t} - U_{i}) f (U_{t}) f (U_{i}) d U_{t} d U_{i} + K^{2} (0) E (∥ W_{i} ε_{i} ∥^{2} | U_{i} = U)} \\ = & (n - 1) h^{- 1} \int ε_{i}^{2} W_{i}^{τ} W_{i} K^{2} (v) f (U_{i} + h v) f (U_{i}) d U_{i} d v + K^{2} (0) E (∥ W_{i} ε_{i} ∥^{2} | U_{i} = U)} \\ = & (n - 1) h^{- 1} \int ε_{i}^{2} W_{i}^{τ} W_{i} K^{2} (v) f (U_{i}) {f (U_{i}) + f^{'} (U_{i}) h v + C_{1} h^{2} v^{2}} d U_{i} d v \\ + K^{2} (0) E (∥ W_{i} ε_{i} ∥^{2} | U_{i} = U) \\ = & (n - 1) h^{- 1} \int ε_{i}^{2} W_{i}^{τ} W_{i} K^{2} (v) f (U_{i}) {f (U_{i}) + C_{1} h^{2} v^{2}} d U_{i} d v + K^{2} (0) E (∥ W_{i} ε_{i} ∥^{2} | U_{i} = U) \\ = & (n - 1) h^{- 1} {\int ε_{i}^{2} W_{i}^{τ} W_{i} g (U_{i}) d U_{i} \int K^{2} (v) d v + C_{1} h^{3} \int ε_{i}^{2} W_{i}^{τ} W_{i} f (U_{i}) d U_{i} \int K_{h}^{2} (v) v^{2} d v} \\ + K^{2} (0) E (∥ W_{i} ε_{i} ∥^{2} | U_{i} = U) \\ \leq & (n - 1) h^{- 1} {ν \tilde{E} (∥ W_{i} ε_{i} ∥^{2} | U_{i} = U) + C_{1} h^{3} ν_{2} E (∥ W_{i} ε_{i} ∥^{2} | U_{i} = U)} \\ + K^{2} (0) E (∥ W_{i} ε_{i} ∥^{2} | U_{i} = U), \end{matrix}

where

ν = \int K^{2} (v) d v

and

ν_{2} = \int K^{2} (v) v^{2} d v

, we can get

\begin{matrix} n^{- 1} h E (\sum_{i = 1}^{n} e_{t 3}) = & n^{- 1} h {(n - 1) h^{- 1} ν \tilde{E} (∥ W_{i} ε_{i} ∥^{2} | U_{i} = U) + C_{1} h^{3} ν_{2} E (∥ W_{i} ε_{i} ∥^{2} | U_{i} = U) \\ + K^{2} (0) E (∥ W_{i} ε_{i} ∥^{2} | U_{i} = U)} \\ \leq & \tilde{E} (∥ W_{i} ε_{i} |^{2} | U_{i} = U) ν + n^{- \frac{9}{5}} C_{1} E (∥ W_{i} ε_{t} ∥^{2} | U_{i} = U) ν_{2} + n^{- \frac{6}{5}} K^{2} (0) E (∥ W_{i} ε_{i} ∥^{2} | U_{i} = U) \\ < & \infty \end{matrix}

for

n^{- \frac{9}{5}}, n^{- \frac{6}{5}} \to 0

. Thus,

n^{- 1} \sum_{t = 1}^{n} {∥ e_{t} ∥}^{2} = n^{- 1} h (O_{p} (n^{2} h^{4}) + n O_{p} (h)) + O_{p} (1) = O_{p} (1)

can be proved. □

Proof of Theorem 1.

According to the definition above, we have

M^{c} = {d_{0} + 1, d_{0} + 2, \dots, d} = {j : ∥ β_{j} ∥ = 0}

,

I^{c} = {(j, k) :

d_{1} < j < k \leq d_{0} \leq d

, or

d_{1} \leq j < d_{0} < k \leq d

, or

d_{0} < j < k \leq d

} = {(j, k) : ∥ ϕ_{j k} ∥ = 0}

. Meanwhile,

\hat{M^{c}} = {j : ∥ {\hat{β}}_{j} ∥ = 0}

,

\hat{I^{c}} = {(j, k) : ∥ {\hat{ϕ}}_{j, k} ∥ = 0}

.

We first prove that

P (\hat{M^{c}} = M^{c}) \to 1

as

n \to \infty

. That is for any

j \in M^{c}

,

P ({\hat{β}}_{j} (U_{t}) = 0) \to 1

for

1 \leq t \leq n

. If it is not true,

{\hat{β}}_{j} (U_{t}) \neq 0

must be the solution of the following normal equation,

\begin{matrix} 0 = \frac{\partial Q_{λ} (\hat{Λ})}{\partial β_{j} (U_{t})} = & - 2 \sum_{t = 1}^{n} \sum_{i = 1}^{n} X_{i j} (Y_{i} - W_{i}^{τ} {\hat{α}}_{λ} (U_{t})) K_{h} (U_{t} - U_{i}) + 2 λ_{j}^{1} (\frac{{\hat{β}}_{j} (U_{t})}{\sqrt{∥ {\hat{β}}_{j} ∥^{2} + \sum_{k : k \neq j}^{} {∥ {\hat{ϕ}}_{j k} ∥}^{2}}}) \\ = & - 2 \sum_{t = 1}^{n} \sum_{i = 1}^{n} X_{i j} (Y_{i} - W_{i}^{τ} α (U_{t}) - W_{i}^{τ} ({\hat{α}}_{λ} (U_{t}) - α (U_{t}))) K_{h} (U_{t} - U_{i}) \\ + 2 λ_{j}^{1} (\frac{{\hat{β}}_{j} (U_{t})}{\sqrt{∥ {\hat{β}}_{j} ∥^{2} + \sum_{k : k \neq j}^{} {∥ {\hat{ϕ}}_{j k} ∥}^{2}}}) \\ = & - 2 \sum_{t = 1}^{n} \sum_{i = 1}^{n} X_{i j} ε_{i} K_{h} (U_{t} - U_{i}) + 2 \sum_{t = 1}^{n} \sum_{i = 1}^{n} X_{i j} W_{i}^{τ} ({\hat{α}}_{λ} (U_{t}) - α (U_{t})) K_{h} (U_{t} - U_{i}) \\ + 2 {(n h)}^{\frac{1}{2}} \frac{λ_{j}^{1}}{{(n h)}^{\frac{1}{2}}} (\frac{{\hat{β}}_{j} (U_{t})}{\sqrt{∥ {\hat{β}}_{j} ∥^{2} + \sum_{k : k \neq j}^{} {∥ {\hat{ϕ}}_{j k} ∥}^{2}}}), \end{matrix}

by Conditions C1-C6 and Lemma 2, we know that

\sum_{t = 1}^{n} \sum_{i = 1}^{n} X_{i j} ε_{i} K_{h} (U_{t} - U_{i}) = O_{p} (n^{2})

. Meanwhile, according to Lemma 1,

∥ {\hat{α}}_{λ} (U_{t}) - α (U_{t}) ∥ = O_{p} (n^{- \frac{2}{5}})

,

∥ \sum_{t = 1}^{n} \sum_{i = 1}^{n} X_{i j} W_{i}^{τ}

({\hat{α}}_{λ} (U_{t}) - α (U_{t})) K_{h} (U_{t} - U_{i}) ∥ \leq \sum_{t = 1}^{n} \sum_{i = 1}^{n} ∥ X_{i j} W_{i}^{τ} K_{h} (U_{t} - U_{i}) ∥ ∥ {\hat{α}}_{λ} (U_{t}) - α (U_{t}) ∥ =

O_{p} (n^{2} h^{2})

, and then we have that

\frac{{\hat{β}}_{j} (U_{t})}{\sqrt{∥ {\hat{β}}_{j} ∥^{2} + \sum_{k : k \neq j}^{} {∥ {\hat{ϕ}}_{j k} ∥}^{2}}} = O_{p} (1)

. In addition, for

b_{n} =

min {λ_{j}^{1}, λ_{k l}^{2} : j \in M^{c}, (k, l) \in I^{c}}

, we get

\frac{\partial Q_{λ} (\hat{Λ})}{\partial β_{j} (U_{t})} = O_{p} (n^{2}) + O_{p} (n^{2} h^{2}) + {(n h)}^{\frac{1}{2}} O_{p} ({(n h)}^{- \frac{1}{2}} λ_{j}^{1})

\geq O_{p} (n^{2}) + O_{p} (n^{2} h^{2}) + {(n h)}^{\frac{1}{2}} O_{p} ({(n h)}^{- \frac{1}{2}} b_{n}) \to \infty

for

{(n h)}^{- \frac{1}{2}} b_{n} \to \infty

. It clearly shows that for

j \in M^{c}

, there are no solutions for

\frac{\partial Q_{λ} (\hat{Λ})}{\partial β_{j} (U_{t})} = 0

with

{\hat{β}}_{j} (U_{t}) \neq 0

, which is to say for any

j \in M^{c}

, then

j \in \hat{M^{c}}

, so

P (\hat{M^{c}} = M^{c}) \to 1

is proved, natually we get that

P (\hat{M} = M) \to 1

.

Then, we pay attention to

P (\hat{I^{c}} = I^{c}) \to 1

as

n \to \infty

, i.e., for any

(j, k) \in I^{c}

,

P ({\hat{ϕ}}_{j k} (U_{t}) = 0) \to 1

for

1 \leq t \leq n

. In addition, if the claim is not true,

{\hat{ϕ}}_{j k} (U_{t}) \neq 0

must be the solution of the following normal equation,

\begin{matrix} 0 = \frac{\partial Q_{λ} (\hat{Λ})}{\partial ϕ_{j k} (U_{t})} = & - 2 \sum_{t = 1}^{n} \sum_{i = 1}^{n} Z_{i j} (Y_{i} - W_{i}^{τ} {\hat{α}}_{λ} (U_{t})) K_{h} (U_{t} - U_{i}) + 2 λ_{j}^{1} \frac{{\hat{ϕ}}_{j k} (U_{t})}{\sqrt{∥ {\hat{β}}_{j} ∥^{2} + \sum_{k : k \neq j}^{} {∥ {\hat{ϕ}}_{j k} ∥}^{2}}} \\ + 2 λ_{k}^{1} \frac{{\hat{ϕ}}_{j k} (U_{t})}{\sqrt{∥ {\hat{β}}_{k} ∥^{2} + \sum_{j : k \neq j}^{} {∥ {\hat{ϕ}}_{j k} ∥}^{2}}} + 2 λ_{j k}^{2} \frac{{\hat{ϕ}}_{j k} (U_{t})}{∥ {\hat{ϕ}}_{j k} ∥} \\ = & - 2 \sum_{t = 1}^{n} \sum_{i = 1}^{n} Z_{i j} ε_{i} K_{h} (U_{t} - U_{i}) + 2 \sum_{t = 1}^{n} \sum_{i = 1}^{n} Z_{i j} W_{i}^{τ} ({\hat{α}}_{λ} (U_{t}) - α (U_{t})) K_{h} (U_{t} - U_{i}) \\ + 2 {(n h)}^{\frac{1}{2}} \frac{λ_{j}^{1}}{{(n h)}^{\frac{1}{2}}} \frac{{\hat{ϕ}}_{j k} (U_{t})}{\sqrt{∥ {\hat{β}}_{j} ∥^{2} + \sum_{k : k \neq j}^{} {∥ {\hat{ϕ}}_{j k} ∥}^{2}}} \\ + 2 {(n h)}^{\frac{1}{2}} \frac{λ_{k}^{1}}{{(n h)}^{\frac{1}{2}}} \frac{{\hat{ϕ}}_{j k} (U_{t})}{\sqrt{∥ {\hat{β}}_{k} ∥^{2} + \sum_{j : k \neq j}^{} {∥ {\hat{ϕ}}_{j k} ∥}^{2}}} + 2 {(n h)}^{\frac{1}{2}} \frac{λ_{j k}^{2}}{{(n h)}^{\frac{1}{2}}} \frac{{\hat{ϕ}}_{j k} (U_{t})}{∥ {\hat{ϕ}}_{j k} ∥} . \end{matrix}

Referring to the definition, we consider the following three situations for

(j, k) \in I^{c}

,

(1): $(j, k) \in I^{c}$ , and $j, k \in M$ ;
(2): $(j, k) \in I^{c}$ , and $j \in M$ and $k \in M^{c}$ ; and
(3): $(j, k) \in I^{c}$ , and $j, k \in M^{c}$ .

By Conditions C1-C6 and Lemma 2, we know that

\sum_{t = 1}^{n} \sum_{i = 1}^{n} Z_{i j} ε_{i} K_{h} (U_{t} - U_{i}) = O_{p} (n^{2})

. Meanwhile, according to Lemma 1,

∥ {\hat{α}}_{λ} (U_{t}) - α (U_{t}) ∥ = O_{p} (n^{- \frac{2}{5}})

,

∥ \sum_{t = 1}^{n} \sum_{i = 1}^{n} Z_{i j} W_{i}^{τ}

({\hat{α}}_{λ} (U_{t}) - α (U_{t})) K_{h} (U_{t} - U_{i}) ∥ \leq \sum_{t = 1}^{n} \sum_{i = 1}^{n} ∥ Z_{i j} W_{i}^{τ} K_{h} (U_{t} - U_{i}) ∥ ∥ {\hat{α}}_{λ} (U_{t}) - α (U_{t}) ∥ =

O_{p} (n^{2} h^{2})

, and have that

\frac{{\hat{ϕ}}_{j k} (U_{t})}{\sqrt{∥ {\hat{β}}_{j} ∥^{2} + \sum_{k : k \neq j}^{} {∥ {\hat{ϕ}}_{j k} ∥}^{2}}} = O_{p} (1), \frac{{\hat{ϕ}}_{j k} (U_{t})}{∥ {\hat{ϕ}}_{j k} ∥} = O_{p} (1)

. We get

\frac{\partial Q_{λ} (\hat{Λ})}{\partial ϕ_{j k} (U_{t})} = O_{p} (n^{2}) + O_{p} (n^{2} h^{2}) + {(n h)}^{\frac{1}{2}} O_{p} ({(n h)}^{- \frac{1}{2}} (λ_{j}^{1} + λ_{k}^{1} + λ_{j k}^{2}))

. In addition, for

a_{n} = max {λ_{j}^{1}, λ_{k}^{1},

λ_{k l}^{2} : j \in M, (k, l) \in I}

, we can get that

ν_{n} = min {λ_{j}^{1}, λ_{k}^{1}, λ_{k l}^{2} : j \in M, (k, l) \in I}

and

{(n h)}^{- \frac{1}{2}} ν_{n} \leq {(n h)}^{- \frac{1}{2}} a_{n} \to 0

, while

b_{n} = min {λ_{j}^{1}, λ_{k}^{1}, λ_{k l}^{2} : j \in M^{c}, (k, l) \in I^{c}}

.

Then, for Situations (1) and (2),

\frac{\partial Q_{λ} (\hat{Λ})}{\partial β_{j} (U_{t})} \geq O_{p} (n^{2}) + O_{p} (n^{2} h^{2}) + {(n h)}^{\frac{1}{2}} O_{p} ({(n h)}^{- \frac{1}{2}} (ν_{n} + b_{n}))

\to \infty

for

{(n h)}^{- \frac{1}{2}} ν_{n} \to 0

,

{(n h)}^{- \frac{1}{2}} b_{n} \to \infty

.

For Situation (3),

\frac{\partial Q_{λ} (\hat{Λ})}{\partial β_{j} (U_{t})} \geq O_{p} (n^{2}) + O_{p} (n^{2} h^{2}) + {(n h)}^{\frac{1}{2}} O_{p} ({(n h)}^{- \frac{1}{2}} (b_{n} + b_{n}))

\to \infty

for

{(n h)}^{- \frac{1}{2}} b_{n} \to \infty

.

It clearly shows that, for

(j, k) \in I^{c}

, there are no solutions for

\frac{\partial Q_{λ} (\hat{Λ})}{\partial ϕ_{j k} (U_{t})} = 0

with

{\hat{ϕ}}_{j k} (U_{t}) \neq 0

, which is to say that, for any

(j, k) \in I^{c}

,

(j, k) \in \hat{I^{c}}

, thus

P (\hat{I^{c}} = I^{c}) \to 1

is proved; naturally, we get that

P (\hat{I} = I) \to 1

. □

Proof of Theorem 2.

By Theorem 1, we know immediately that

{\hat{α}}_{λ, S^{c}} (U_{t}) = 0

with probability tending to 1, thus we know

{\hat{α}}_{λ, S} (U_{t})

must be the solution of

- \frac{1}{n} \sum_{t = 1}^{n} \sum_{i = 1}^{n} W_{i S} (Y_{i} - W_{i S}^{τ} {\hat{α}}_{λ, S} (U_{t})) K_{h} (U_{t} - U_{i}) + \frac{1}{n} \sum_{j = 1}^{d_{0}} λ_{j}^{1} γ_{j} (1_{j} 1_{j}^{τ} + \sum_{k : j < k}^{} 1_{j k} 1_{j k}^{τ}) {\hat{α}}_{λ, S} (u_{t})

+ \frac{1}{n} \sum_{j = 1}^{d_{0}} \sum_{k : j < k} λ_{k}^{1} γ_{k} 1_{j k} 1_{j k}^{τ} {\hat{α}}_{λ, S} (u_{t}) + \frac{1}{n} \sum_{j = 1}^{d_{1}} \sum_{k : j < k}^{} λ_{j k}^{2} ζ_{j k} 1_{j k} 1_{j k}^{τ} {\hat{α}}_{λ, S} (u_{t}) = 0,

which means

{\hat{α}}_{λ, S}

has the closed form

\begin{matrix} {\hat{α}}_{λ, S} (U_{t}) = & {[\frac{1}{n} \sum_{i = 1}^{n} W_{i S} W_{i S}^{τ} K_{h} (U_{t} - U_{i})]}^{- 1} \frac{1}{n} [\sum_{i = 1}^{n} W_{i S} Y_{i} K_{h} (U_{t} - U_{i}) \\ + \sum_{j = 1}^{d_{0}} λ_{j}^{1} γ_{j} (1_{j} 1_{j}^{τ} + \sum_{k : j < k}^{} 1_{j k} 1_{j k}^{τ}) {\hat{α}}_{λ, S} (u_{t}) \\ + \sum_{j = 1}^{d_{0}} \sum_{k : j < k} λ_{k}^{1} γ_{k} 1_{j k} 1_{j k}^{τ} {\hat{α}}_{λ, S} (u_{t}) + \sum_{j = 1}^{d_{1}} \sum_{k : j < k}^{} λ_{j k}^{2} ζ_{j k} 1_{j k} 1_{j k}^{τ} {\hat{α}}_{λ, S} (u_{t})], \end{matrix}

where both

1_{j}

and

1_{j k}

are

d_{0} + \frac{d_{1} (d_{1} + 1)}{2}

-dimensional unit vectors, with the components either 1 or 0, and it satisfies

{\hat{α}}_{λ, S} {(U_{t})}^{τ} 1_{j} = {\hat{β}}_{j} (U_{t})

,

{\hat{α}}_{λ, S} {(U_{t})}^{τ} 1_{j k} = {\hat{ϕ}}_{j k} (U_{t})

. The oracle estimator is defined as follows,

\begin{matrix} {\tilde{α}}_{S} (U_{t}) = & {[\frac{1}{n} \sum_{i = 1}^{n} W_{i S} W_{i S}^{τ} K_{h} (U_{t} - U_{i})]}^{- 1} [\frac{1}{n} \sum_{i = 1}^{n} W_{i S} Y_{i} K_{h} (U_{t} - U_{i})] . \end{matrix}

Then, we have

\begin{matrix} ∥ {\hat{α}}_{λ, S} (U_{t}) - {\tilde{α}}_{S} (U_{t}) ∥ & = ∥ Σ_{S}^{- 1} (U_{t}) \times \frac{1}{n} [\sum_{j = 1}^{d_{0}} λ_{j}^{1} γ_{j} (1_{j} 1_{j}^{τ} + \sum_{k : j < k}^{} 1_{j k} 1_{j k}^{τ}) {\hat{α}}_{λ} (u_{t}) \\ + \sum_{j = 1}^{d_{0}} \sum_{k : j < k} λ_{k}^{1} γ_{k} 1_{j k} 1_{j k}^{τ} {\hat{α}}_{λ, S} (u_{t}) + \sum_{j = 1}^{d_{1}} \sum_{k : j < k \leq d_{1}}^{} λ_{j k}^{2} ζ_{j k} 1_{j k} 1_{j k}^{τ} {\hat{α}}_{λ} (u_{t})] ∥ \\ \leq λ_{S, m a x} \cdot a_{n} \times \frac{1}{n} [2 \sum_{j = 1}^{d_{0}} \frac{\sqrt{{\hat{β}}_{j}^{2} (U_{t}) + \sum_{k : j < k}^{} {\hat{ϕ}}_{j k}^{2} (U_{t})}}{\sqrt{∥ {\hat{β}}_{j} ∥^{2} + \sum_{k : j < k}^{} {∥ {\hat{ϕ}}_{j k} ∥}^{2}}} + \sum_{j = 1}^{d_{1}} \sum_{k : j < k \leq d_{1}} \frac{| {\hat{ϕ}}_{j k} (U_{t}) |}{∥ {\hat{ϕ}}_{j k} ∥}] \\ \leq λ_{S, m a x} \cdot a_{n} \times \frac{1}{n} (2 d_{0} + \frac{d_{1} (d_{1} + 1)}{2}) \\ = (d_{0} + \frac{d_{1} (d_{1} + 1)}{2}) λ_{S, m a x} n^{- \frac{3}{5}} {(n h)}^{- \frac{1}{2}} a_{n}, \end{matrix}

where

λ_{S, m a x}

is the maximum eigenvalue of

Σ_{S}^{- 1} (U_{t})

, and, referring to

{(n h)}^{- \frac{1}{2}} a_{n} \to 0

, we know that

m a x ∥ {\hat{α}}_{λ, S} (U_{t}) - {\tilde{α}}_{S} (U_{t}) ∥ \leq (2 d_{0} + \frac{d_{1} (d_{1} + 1)}{2}) λ_{S, m a x} n^{- \frac{3}{5}} {(n h)}^{- \frac{1}{2}} a_{n} \to o_{p} (n^{- \frac{3}{5}})

, and Theorem 2 is proved. □

References

Hastie, T.; Tibshirani, R. Varying-coefficient models. J. R. Stat. Soc. Ser. B 1993, 55, 757–779. [Google Scholar] [CrossRef]
Fan, J.; Zhang, W. Statistical estimation in varying coefficient models. Ann. Stat. 1999, 27, 1491–1518. [Google Scholar]
Fan, J.; Huang, T. Profile likelihood inferences on semiparametric varying-coefficient partially linear models. Bernoulli 2005, 11, 1031–1057. [Google Scholar] [CrossRef]
Fan, J.; Zhang, W. Simultaneous confidence bands and hypothesis testing in varying-coefficient models. Scand. J. Stat. 2000, 27, 715–731. [Google Scholar] [CrossRef]
Zhu, N.H. Two-stage local Walsh average estimation of generalized varying coefficient models. Acta Math. Appl. Sin. Engl. Ser. 2015, 31, 623–642. [Google Scholar] [CrossRef]
Li, Z.H.; Liu, J.S.; Wu, X.L. Variable bandwidth and one step local M-estimation of varying coefficient models. Appl. Math. A J. Chin. Univ. 2009, 4, 379–390. [Google Scholar]
Wang, H.; Xia, Y. Shrinkage estimation of the varying coefficient model. J. Am. Stat. Assoc. 2009, 104, 747–757. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
Zhao, P.; Xue, L. Variable selection for semiparametric varying coefficient partially linear models. Stat. Probab. Lett. 2009, 79, 2148–2157. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Tang, Y.; Wang, H.J.; Zhu, Z.; Song, X. A unified variable selection approach for varying coefficient models. Stat. Sin. 2012, 22, 601–628. [Google Scholar] [CrossRef] [Green Version]
Li, D.; Ke, Y.; Zhang, W. Model selection and structure specification in ultra-high dimensional generalised semi-varying coefficient models. Ann. Stat. 2015, 43, 2676–2705. [Google Scholar] [CrossRef]
He, K.; Lian, H.; Ma, S.; Huang, J. Dimensionality reduction and variable selection in multivariate varying-coefficient models with a large number covariates. J. Am. Stat. Assoc. 2018, 113, 746–754. [Google Scholar] [CrossRef]
Hall, P.; Xue, J.H. On selecting interacting features from high-dimensional data. Comput. Stat. Data Anal. 2014, 71, 694–708. [Google Scholar] [CrossRef] [Green Version]
Niu, Y.S.; Hao, N.; Zhang, H.H. Interaction screening by partial correlation. Stat. Interface 2018, 11, 317–325. [Google Scholar] [CrossRef]
Kong, Y.; Li, D.; Fan, Y.; Lv, J. Interaction pursuit in high-dimensional multi-response regression via distance correlation. Ann. Stat. 2017, 45, 897–922. [Google Scholar] [CrossRef] [Green Version]
Radchenko, P.; James, G. Variable selection using adaptive nonlinear interaction structure in high dimensions. J. Am. Stat. Assoc. 2010, 105, 1541–1553. [Google Scholar] [CrossRef]
Choi, N.; Li, W.; Zhu, J. Variable selection with strong heredity constraint and its oracle property. J. Am. Stat. Assoc. 2010, 105, 354–364. [Google Scholar] [CrossRef]
Bien, J.; Taylor, J.; Tibshirani, R. A lasso for hierarchical interactions. Ann. Stat. 2013, 41, 1111–1141. [Google Scholar] [CrossRef]
Zhao, P.; Rocha, G.; Yu, B. The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Stat. 2009, 37, 3468–3497. [Google Scholar] [CrossRef] [Green Version]
Lim, M.; Hastie, T. Learning interactions via hierarchical group lasso regularization. J. Comput. Graph. Stat. 2015, 24, 627–654. [Google Scholar] [CrossRef] [PubMed]
Nelder, J.A. The statistics of linear models: Back to basics. Stat. Comput. 1994, 4, 221–234. [Google Scholar] [CrossRef]
Hamada, M.; Wu, C.J. Analysis of designed experiments with complex aliasing. J. Qual. Technol. 1992, 24, 130–137. [Google Scholar] [CrossRef]
Haris, A.; Witten, D.; Simon, N. Convex modeling of interactions with strong heredity. J. Comput. Graph. Stat. 2016, 25, 981–1004. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ning, H.; Zhang, H.H. A note on high-dimensional linear regression with interactions. Am. Stat. 2017, 71, 291–297. [Google Scholar]
Ning, H.; Yang, F.; Zhang, H.H. Model selection for high dimensional quadratic regression via regularization. J. Am. Stat. Assoc. 2018, 113, 615–625. [Google Scholar]
She, Y.; Wang, Z.; Jiang, H. Group regularized estimation under structural hierarchy. J. Am. Stat. Assoc. 2018, 113, 445–454. [Google Scholar] [CrossRef]
Yuan, M.; Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B 2006, 68, 49–67. [Google Scholar] [CrossRef]
Hu, T.; Xia, Y. Adaptive semi-varying coefficient model selection. Stat. Sin. 2012, 22, 575–599. [Google Scholar] [CrossRef] [Green Version]
Hunter, D.; Li, R. Variable selection using MM algorithms. Ann. Stat. 2005, 33, 1617–1642. [Google Scholar] [CrossRef] [Green Version]
Craven, P.; Wahba, G. Estimating the correct degree of smoothing by the method of generalized cross-validation. Numer. Math. 1979, 31, 377–403. [Google Scholar] [CrossRef]
Zou, H.; Li, R. One-step sparse estimates in nonconcave penalized likelihood models. Ann. Stat. 2008, 36, 1509–1533. [Google Scholar] [CrossRef] [PubMed]
R Core Team. R: A Language and Environment for Statistical Computating; R Foundation for Statistical Computing: Vienna, Austria, 2020; Available online: https://www.R-project.org/ (accessed on 16 March 2020).
Venables, W.; Ripley, B. Modern Applied Statistics with S, 4th ed.; Springer: New York, NY, USA, 2020; ISBN 0-387-95457-0. [Google Scholar]
Harrison, D. and Rubinfeld D. Hedonic housing prices and the demand for clean air. J. Environ. Econ. Manag. 1978, 5, 81–102. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Estimated quantile curves of Model 1. Dash-dot lines are 0.95 quantile curves, dotted lines are 0.50 quantile curves, dashed lines are 0.05 quantile curves, and solid lines are the real function curves. The dash-dot and dashed lines indicate 90% confident bands.

Figure 2. Estimated quantile curves of Model 2. Dash-dot lines are 0.95 quantile curves, dotted lines are 0.50 quantile curves, dashed lines are 0.05 quantile curves, and solid lines are the real function curves. The dash-dot and dashed lines indicate 90% confident bands.

Figure 3. Estimated quantile curves of Model 3. Dash-dot lines are 0.95 quantile curves, dotted lines are 0.50 quantile curves, dashed lines are 0.05 quantile curves, and solid lines are the real function curves. The dash-dot and dashed lines indicate 90% confident bands.

Figure 4. Estimated functional coefficients curves of the relevant variables.

Table 1. MISEs of the functional coefficients estimators for Model 1.

Estimator	$n = 100$	$n = 200$	$n = 500$
${\hat{β}}_{1} (u)$	0.0684 (0.0856)	0.0285 (0.0189)	0.0174 (0.0087)
${\hat{β}}_{o r a, 1} (u)$	0.0553 (0.0385)	0.0277 (0.0181)	0.0170 (0.0087)
${\hat{β}}_{2} (u)$	0.0531 (0.0508)	0.0309 (0.0213)	0.0213 (0.0110)
${\hat{β}}_{o r a, 2} (u)$	0.0476 (0.0406)	0.0305 (0.0211)	0.0211 (0.0107)
${\hat{β}}_{3} (u)$	0.0483 (0.0403)	0.0277 (0.0199)	0.0183 (0.0096)
${\hat{β}}_{o r a, 3} (u)$	0.0410 (0.0325)	0.0256 (0.0167)	0.0169 (0.0086)
${\hat{ϕ}}_{23} (u)$	0.0811 (0.0715)	0.0231 (0.0256)	0.0124 (0.0082)
${\hat{ϕ}}_{o r a, 23} (u)$	0.0435 (0.0343)	0.0198 (0.0134)	0.0111 (0.0056)

Table 2. MISEs of the functional coefficients estimators for Model 2.

Estimator	$n = 100$	$n = 200$	$n = 500$
${\hat{β}}_{1} (u)$	0.0605 (0.0433)	0.0332 (0.0215)	0.0173 (0.0084)
${\hat{β}}_{o r a, 1} (u)$	0.0576 (0.0407)	0.0329 (0.0212)	0.0172 (0.0083)
${\hat{β}}_{2} (u)$	0.0832 (0.0608)	0.0428 (0.0278)	0.0221 (0.0107)
${\hat{β}}_{o r a, 2} (u)$	0.0789 (0.0551)	0.0426 (0.0270)	0.0221 (0.0106)
${\hat{β}}_{3} (u)$	0.0820 (0.0584)	0.0411 (0.0253)	0.0221 (0.0116)
${\hat{β}}_{o r a, 3} (u)$	0.0753 (0.0498)	0.0419 (0.0254)	0.0225 (0.0117)
${\hat{β}}_{4} (u)$	0.0866 (0.0882)	0.0358 (0.0323)	0.0182 (0.0095)
${\hat{β}}_{o r a, 4} (u)$	0.0605 (0.0429)	0.0325 (0.0212)	0.0173 (0.0088)
${\hat{ϕ}}_{12} (u)$	0.0815 (0.0685)	0.0329 (0.0213)	0.0147 (0.0081)
${\hat{ϕ}}_{o r a, 12} (u)$	0.0658 (0.0486)	0.0317 (0.0202)	0.0143 (0.0077)
${\hat{ϕ}}_{23} (u)$	0.1695 (0.1023)	0.0486 (0.0328)	0.0185 (0.0093)
${\hat{ϕ}}_{o r a, 23} (u)$	0.1123 (0.0652)	0.0428 (0.0242)	0.0168 (0.0082)

Table 3. MISEs of the functional coefficients estimators for Model 3.

Estimator	$n = 100$	$n = 200$	$n = 500$
${\hat{β}}_{1} (u)$	0.0784 (0.0751)	0.0357 (0.0232)	0.0188 (0.0096)
${\hat{β}}_{o r a, 1} (u)$	0.0702 (0.0536)	0.0345 (0.0224)	0.0189 (0.0096)
${\hat{β}}_{2} (u)$	0.0938 (0.0753)	0.0483 (0.0338)	0.0253 (0.0127)
${\hat{β}}_{o r a, 2} (u)$	0.0918 (0.0717)	0.0484 (0.0334)	0.0253 (0.0125)
${\hat{β}}_{3} (u)$	0.0819 (0.0654)	0.0435 (0.0305)	0.0242 (0.0119)
${\hat{β}}_{o r a, 3} (u)$	0.0795 (0.0595)	0.0435 (0.0297)	0.0244 (0.0119)
${\hat{β}}_{4} (u)$	0.2343 (0.1158)	0.0658 (0.0385)	0.0227 (0.0113)
${\hat{β}}_{o r a, 4} (u)$	0.2127 (0.1052)	0.0628 (0.0367)	0.0216 (0.0104)
${\hat{ϕ}}_{23} (u)$	0.1462 (0.1213)	0.0734 (0.0660)	0.0247 (0.0126)
${\hat{ϕ}}_{o r a, 23} (u)$	0.1265 (0.0897)	0.0668 (0.0483)	0.0239 (0.0118)
${\hat{ϕ}}_{24} (u)$	0.2088 (0.1638)	0.0933 (0.0616)	0.0330 (0.0166)
${\hat{ϕ}}_{o r a, 24} (u)$	0.1842 (0.1159)	0.0909 (0.0534)	0.0305 (0.0152)

Table 4. CM, CZ, and CS for Models 1–3.

	n	CM	CZ	CS
	100	0.922	0.711	0.658
Model 1	200	0.996	0.985	0.982
	500	1.000	1.000	1.000
	100	0.884	0.575	0.524
Model 2	200	0.987	0.972	0.960
	500	1.000	1.000	1.000
	100	0.932	0.533	0.517
Model 3	200	0.973	0.952	0.937
	500	1.000	1.000	1.000

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, F.; Li, Y.; Feng, S. Estimation for Varying Coefficient Models with Hierarchical Structure. Mathematics 2021, 9, 132. https://doi.org/10.3390/math9020132

AMA Style

Li F, Li Y, Feng S. Estimation for Varying Coefficient Models with Hierarchical Structure. Mathematics. 2021; 9(2):132. https://doi.org/10.3390/math9020132

Chicago/Turabian Style

Li, Feng, Yajie Li, and Sanying Feng. 2021. "Estimation for Varying Coefficient Models with Hierarchical Structure" Mathematics 9, no. 2: 132. https://doi.org/10.3390/math9020132

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Estimation for Varying Coefficient Models with Hierarchical Structure

Abstract

1. Introduction

2. Modeling and Estimation

3. Theoretical Properties

4. Simulation Study and Real Data Analysis

4.1. Simulation Study

4.2. The Boston Housing Data Analysis

5. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI