**A Nuisance-Free Inference Procedure Accounting for the Unknown Missingness with Application to Electronic Health Records**

#### **Jiwei Zhao 1,\* and Chi Chen <sup>2</sup>**


Received: 24 August 2020; Accepted: 12 October 2020; Published: 14 October 2020

**Abstract:** We study how to conduct statistical inference in a regression model where the outcome variable is prone to missing values and the missingness mechanism is unknown. The model we consider might be a traditional setting or a modern high-dimensional setting where the sparsity assumption is usually imposed and the regularization technique is popularly used. Motivated by the fact that the missingness mechanism, albeit usually treated as a nuisance, is difficult to specify correctly, we adopt the conditional likelihood approach so that the nuisance can be completely ignored throughout our procedure. We establish the asymptotic theory of the proposed estimator and develop an easy-to-implement algorithm via some data manipulation strategy. In particular, under the high-dimensional setting where regularization is needed, we propose a data perturbation method for the post-selection inference. The proposed methodology is especially appealing when the true missingness mechanism tends to be missing not at random, e.g., patient reported outcomes or real world data such as electronic health records. The performance of the proposed method is evaluated by comprehensive simulation experiments as well as a study of the albumin level in the MIMIC-III database.

**Keywords:** nuisance; post-selection inference; missingness mechanism; regularization; asymptotic theory; unconventional likelihood

#### **1. Introduction**

A major step towards scientific discovery is to identify useful associations from various features and to quantify their uncertainties. This usually warrants building a regression model for an outcome variable and estimating the coefficient associated with each feature as well as the precision of the estimator. Besides the traditional regression with a small dimensionality, with advances in biotechnology, the modern high-dimensional regression usually posits a sparse parameter in the model, and then applies regularization to select the significant features in order to recover the sparsity. In particular, the post-selection inference could be challenging in a regularized regression framework. In this paper, our main interest is to consider a regression model where the outcome variable is prone to missing values. We study both the traditional setting where regularization is not needed and the modern one with regularization.

The missing data issue is an inevitable concern for statistical analysis in various disciplines ranging from biomedical studies to social sciences. In many applications, the occurrence of missing data is usually not the investigator's primary interest but complicates the statistical analysis. The validity of any method devised for missing data heavily depends on the assumption of the missingness

mechanism [1]. Unfortunately, those assumptions are largely unknown and difficult, if not infeasible, to be empirically tested. Therefore, one prefers to concentrate on analyzing the regression model for the outcome variable, while treating the mechanism model as a nuisance. A flexible assumption imposed at the minimum level on the mechanism would provide protection against model misspecification at this level.

While it is indeed promising to regard the missingness mechanism as a nuisance with a flexible assumption, a potential issue is the model identifiability problem if the mechanism contains missing-not-at-random cases, i.e., allowing the mechanism to depend on the missing values themselves. In the past few years, researchers have made great progress on this topic by introducing a so-called instrument. This instrument could be a shadow variable [2–7] or an instrumental variable [8,9]. Both approaches are reasonable and are suitable for different applications. In this paper, we adopt the shadow variable approach as it facilitates the interpretability of the regression model for the outcome. The details of the shadow variable approach will be articulated later throughout the paper.

Therefore, we proceed with a semiparametric framework where our primary interest is a parametric regression, e.g., a linear model, where the statistical task is to estimate the parameter of interest and conduct statistical inference (particularly post-selection inference for the setting with regularization). For the nuisance missingness mechanism, we only impose a nonparametric assumption without specifying a concrete form. We encode the shadow variable as *Z*, which is one component of the covariate **X**. In general, a shadow variable with a smaller dimensionality allows more flexibility of the missingness mechanism. Therefore, although it could be multidimensional, we only consider univariate *Z* throughout the paper. With all of these ingredients, we analyze a conditional likelihood approach which will eventually result in a nuisance-free procedure for parameter estimation and statistical inference.

There are at least two extra highlights of our proposed method that are worth mentioning. The first pertains to the algorithm and computation. Although it looks complicated at first sight, we show that, via some data manipulation strategy, the conditional likelihood function can be analytically written as the likelihood of a conventional logistic regression with some prespecified format. Therefore, our objective function can be readily optimized by many existing software packages. This greatly alleviates the computational burden of our procedure. Second, while the variance estimation under the traditional setting is straightforward following the asymptotic approximation, it is challenging for the setting with regularization. To resolve this problem, we present an easy-to-implement data-driven method to estimate the variance of the regularized estimator via a data perturbation technique. It is noted that the current literature on the inference procedure for regularized estimation in the presence of missing values is very scarce. The authors of [10–12] all considered the model selection problem under high dimensionality with missing data; however, none of them studied the post-selection inference in this context.

The remainder of the paper is structured as follows. In Section 2, we first layout our model formulation and introduce the shadow variable and the conditional likelihood. Section 3 details the traditional setting without regularization. We present our algorithm of how to maximize the conditional likelihood function, the theory of how to derive the asymptotic representation of our proposed estimator and how to estimate its variance. In Section 4, we devote ourselves to the modern setting where the sparsity assumption is imposed and the regularization technique is adopted. Both algorithm and theory as well as the variance estimation through the data perturbation technique are presented. In Section 5, we conduct comprehensive simulation studies to examine the finite sample performance of our proposed estimator as well as the comparison to some existing methods. Section 6 is the application of our method to the regression model for the albumin level which suffers from a large amount of missing values in the MIMIC-III study [13]. The paper is concluded with a discussion in Section 7.

#### **2. Methodology**

Denote the outcome variable as *Y* and covariate **X**. We assume **X** = (**U** T , *Z*) T where **U** is *p*-dimensional and *Z* univariate, with detailed interpretation later. We consider the linear model

$$\boldsymbol{\Upsilon} = \boldsymbol{\mathfrak{a}} + \boldsymbol{\mathfrak{g}}^{\mathsf{T}} \mathbf{U} + \boldsymbol{\gamma} \boldsymbol{Z} + \boldsymbol{\varepsilon}, \tag{1}$$

where *β* is also *p*-dimensional, *α* and *γ* are scalars and the true value of *γ*, *γ*0, is nonzero, *ǫ* ∼ *N*(0, *σ* 2 ). We consider the situation that *Y* has missing values while **X** is fully observed. We introduce a binary variable *R* to indicate missingness: *R* = 1 if *Y* is observed and *R* = 0 if missing. To allow the greatest flexibility of the missingness mechanism model, we assume

$$\operatorname{pr}(R=1\mid\mathcal{Y},\mathbf{X}) = \operatorname{pr}(R=1\mid\mathcal{Y},\mathbf{U}) = s(\mathcal{Y},\mathbf{U}),\tag{2}$$

where *s*(·) merely represents an unknown and unspecified function not depending on *Z*. We reiterate that, as the assumption (2), in a nonparametric flavor, does not specify a concrete form of *s*(·), one does not need to be worrisome of the mechanism model misspecification. Moreover, as it allows the dependence on *Y*, besides missing-completely-at-random (MCAR) and many scenarios of missing-at-random (MAR), the assumption (2) also contains various situations of missing-not-at-random (MNAR).

We term *Z* the shadow variable following the works in [5–7,14]. Its existence depends on whether it is sensible that *Z* and *R* are conditionally independent (given *Y* and **U**) and that *Y* heavily relies on *Z* (as *γ*<sup>0</sup> 6= 0). There are many examples in the literature documenting that the existence of *Z* is practically reasonable. In application, a surrogate or a proxy of the outcome variable *Y*, which would not synchronically affect the missingness mechanism, could be a good choice for the shadow variable *Z*.

We assume independent and identically distributed observations {*r<sup>i</sup>* , *y<sup>i</sup>* , **u***<sup>i</sup>* , *zi*} for *i* = 1, ..., *N* and the first *n* subjects are free of missing data. Now we present a *s*(·)-free procedure via the use of the conditional likelihood. Denote **V** = (*Y*, **U** T ) T . We start with

$$\prod\_{i=1}^{n} p(\mathbf{v}\_i \mid z\_i, r\_i = 1) = \prod\_{i=1}^{n} \frac{s(\mathbf{v}\_i)}{\mathcal{g}(z\_i)} p(\mathbf{v}\_i \mid z\_i) \nu$$

where *<sup>g</sup>*(*zi*) = pr(*r<sup>i</sup>* <sup>=</sup> <sup>1</sup> <sup>|</sup> *<sup>z</sup>i*) = <sup>R</sup> pr(*r<sup>i</sup>* = 1 | **v**)*p*(**v** | *zi*)*d***v** and *p*(· | ·) is a generic notation for conditional probability density/mass function. If **V** were univariate, we denote A as the rank statistic of {*v*1, ..., *vn*}, then

$$\prod\_{i=1}^{n} p(v\_i \mid z\_i, r\_i = 1) = p(v\_1, \dots, v\_n \mid z\_1, \dots, z\_n, r\_1 = \dots = r\_n = 1)$$

$$\mathcal{I} = p(\mathcal{A} \mid v\_{(1)}, \dots, v\_{(n)}, z\_1, \dots, z\_n, r\_1 = \dots = r\_n = 1) p(v\_{(1)}, \dots, v\_{(n)} \mid z\_1, \dots, z\_n, r\_1 = \dots = r\_n = 1). \tag{3}$$

The conditional likelihood that we use, the first term on the right hand side of (3), is exactly

$$p(\mathcal{A} \mid v\_{(1)}, \dots, v\_{(n)}, z\_1, \dots, z\_n, r\_1 = \dots = r\_n = 1) = \frac{p(v\_1, \dots, v\_n \mid z\_1, \dots, z\_n, r\_1 = \dots = r\_n = 1)}{p(v\_{(1)}, \dots, v\_{(n)} \mid z\_1, \dots, z\_n, r\_1 = \dots = r\_n = 1)}$$

$$= \frac{\prod\_{i=1}^n p(v\_i \mid z\_i, r\_i = 1)}{\sum\_{\boldsymbol{\varpi} \in \Omega} \prod\_{i=1}^n p(v\_{\boldsymbol{\varpi}(i)} \mid z\_i, r\_i = 1)} = \frac{\prod\_{i=1}^n p(v\_i \mid z\_i)}{\sum\_{\boldsymbol{\varpi} \in \Omega} \prod\_{i=1}^n p(v\_{\boldsymbol{\varpi}(i)} \mid z\_i)}\tag{4}$$

where <sup>Ω</sup> represents the collection of all one-to-one mappings from {1, ..., *<sup>n</sup>*} to {1, ..., *<sup>n</sup>*}. Now (4) is nuisance-free and can be used to estimate the unknown parameters in *p*(*v<sup>i</sup>* | *zi*).

Although **V** is multidimensional in our case, the idea presented above can still be applied and it leads to

$$\frac{\prod\_{i=1}^{n} p(y\_{i\prime} \mathbf{u}\_{i\prime} \mid z\_{i\prime} r\_{i} = 1)}{\sum\_{\omega \in \Omega} \prod\_{i=1}^{n} p(y\_{\omega(i)}, \mathbf{u}\_{\omega(i)} \mid z\_{i\prime} r\_{i} = 1)} = \frac{\prod\_{i=1}^{n} p(y\_{i\prime} \mathbf{u}\_{i\prime} \mid z\_{i})}{\sum\_{\omega \in \Omega} \prod\_{i=1}^{n} p(y\_{\omega(i)}, \mathbf{u}\_{\omega(i)} \mid z\_{i})}. \tag{5}$$

Furthermore, to simplify the computation, we adopt the pairwise fashion of (5) following the previous discussion on pairwise pseudo-likelihood in [15], which results

$$\prod\_{1 \le i < j \le n} \frac{p(y\_{i\prime}, \mathbf{u}\_i \mid z\_i) p(y\_{j\prime}, \mathbf{u}\_j \mid z\_j)}{p(y\_{i\prime}, \mathbf{u}\_i \mid z\_i) p(y\_{j\prime}, \mathbf{u}\_j \mid z\_j) + p(y\_{i\prime}, \mathbf{u}\_i \mid z\_j) p(y\_{j\prime}, \mathbf{u}\_j \mid z\_i)}.$$

After plugging in model (1) and some algebra, the objective eventually becomes to minimize

$$L(\boldsymbol{\theta}) = \binom{N}{2}^{-1} \sum\_{1 \le i < j \le N} \phi\_{\vec{l}\vec{j}}(\boldsymbol{\theta}) = \binom{N}{2}^{-1} \sum\_{1 \le i < j \le N} r\_{\vec{l}} r\_{\vec{j}} \log \left\{ 1 + \mathcal{W}\_{\vec{l}\vec{j}} \exp(\boldsymbol{\theta}^{\mathsf{T}} \mathbf{d}\_{\vec{j}}) \right\},\tag{6}$$

where *<sup>θ</sup>* = (*γ*e, *<sup>β</sup>*<sup>e</sup> T ) T , *<sup>γ</sup>*<sup>e</sup> <sup>=</sup> *<sup>γ</sup>*/*<sup>σ</sup>* 2 , *<sup>β</sup>*<sup>e</sup> <sup>=</sup> *<sup>γ</sup>*e*β*, **<sup>d</sup>***ij* = (−*yi*\*<sup>j</sup> zi*\*j* , **u** T *i*\*j zi*\*j* ) T , *<sup>y</sup>i*\*<sup>j</sup>* = *<sup>y</sup><sup>i</sup>* − *<sup>y</sup><sup>j</sup>* , **<sup>u</sup>***i*\*<sup>j</sup>* = **<sup>u</sup>***<sup>i</sup>* − **<sup>u</sup>***<sup>j</sup>* , *<sup>z</sup>i*\*<sup>j</sup>* = *<sup>z</sup><sup>i</sup>* − *<sup>z</sup><sup>j</sup>* and *<sup>W</sup>ij* = *<sup>p</sup>*(*z<sup>i</sup>* | **u***j*)*p*(*z<sup>j</sup>* | **u***i*)/{*p*(*z<sup>i</sup>* | **u***i*)*p*(*z<sup>j</sup>* | **u***j*)}.

Denote the minimizer of (6) as *θ*b. By checking that

$$\frac{\partial^2 \phi\_{ij}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^{\mathrm{T}}} = r\_{\boldsymbol{i}} r\_{\boldsymbol{j}} \{1 + \mathcal{W}\_{\boldsymbol{i}\boldsymbol{j}} \exp(\boldsymbol{\theta}^{\mathrm{T}} \mathbf{d}\_{\boldsymbol{i}\boldsymbol{j}})\}^{-2} \mathcal{W}\_{\boldsymbol{i}\boldsymbol{j}} \exp(\boldsymbol{\theta}^{\mathrm{T}} \mathbf{d}\_{\boldsymbol{i}\boldsymbol{j}}) \mathbf{d}\_{\boldsymbol{i}\boldsymbol{j}} \mathbf{d}\_{\boldsymbol{i}\boldsymbol{j}}^{\mathrm{T}}$$

is positive definite, *θ*buniquely exists. To compute *θ*b, one also needs a model for *Wij*. Fortunately, this model only depends on fully observed data **x***<sup>i</sup>* and **x***<sup>j</sup>* . Essentially any existing parametric, semiparametric, or nonparametric modeling technique for *p*(*z* | **u**) can be used, and *Wij* can be estimated accordingly. Throughout, we denote *W*b *ij* as an available well-behaved estimator of *Wij*. Although our procedure stems from *p*(*y*, **u** | *z*,*r* = 1), which only relies on the data {*y<sup>i</sup>* , **x***i*} with *i* = 1, it can be seen that, not only the data {*y<sup>i</sup>* , **x***i*} with *i* = 1 are used to compute *θ*b, the data {**x***i*} with *i* = 0 are also used in the process of estimating *Wij*. Therefore, all observed data, both from completely-observed subjects and from partially-observed subjects, are utilized in our procedure.

One can notice that, due to the assumption (2) which allows the greatest flexibility of the mechanism model and the adoption of the conditional likelihood, not all parameters *α*, *β*, *γ*, and *σ* 2 are estimable. Nevertheless, the parameter *β*, which quantifies the association between *Y* and **U** after adjusting for *Z* and is of primarily scientific interest, can be fully estimable. The remainder of the paper focuses on the estimation and inference of *β*, as well as the variable selection procedure based on *β*.

Before moving on, we give some comparison with the existing literature to underline the novel contributions we make in this paper. Based on a slightly different but more restrictive missingness mechanism assumption that pr(*R* = 1 | *Y*,**X**) = *a*(*Y*)*b*(**X**), Refs. [16–18] used the similar idea to analyze non-ignorable missing data for a generalized linear model and a semiparametric proportional likelihood ratio model, respectively. They focused on different aspects of how to use the conditional likelihoods and their consequences such as the partial identifiability issue and the large bias issue. In this paper, we focus on the linear model (1) and we just showed that the parameter *β* is fully identifiable. It can be seen that the method presented in this paper can be applied to different models, but their identifiability problems or some other relevant issues have to be analyzed on a case-by-case basis. For instance, Ref. [19] studied the parameter estimation problem in a logistic regression model with a low dimensionality under assumption (2). They showed that, different from the current paper, all the unknown parameters are identifiable in their context. However, because of the complexity of their objective function, the algorithm studied in [19] is trivial and cannot be extended to a high dimensional setting.

#### **3. Traditional Setting without Regularization**

**Computation**. Directly minimizing *L*(*θ*) is feasible; however, it is very computationally involved. From rearranging the terms in *L*(*θ*), we realize that it can be rewritten as the negative log-likelihood

function of a standard logistic regression model. To be more specific, let *k* be the index of pair (*i*, *j*) with *k* = 1, ..., *K* and *K* = ( *n* 2 ). Then,

$$L(\boldsymbol{\theta}) = \frac{1}{K} \sum\_{k=1}^{K} \log \left\{ 1 + \exp \left( s\_k \boldsymbol{\theta}^T \mathbf{t}\_k + \log \hat{\mathbf{W}}\_k \right) \right\},\tag{7}$$

where *<sup>s</sup><sup>k</sup>* = −sign(*zi*\*<sup>j</sup>* ), **<sup>t</sup>***<sup>k</sup>* = (|*zi*\*<sup>j</sup>* |*yi*\*<sup>j</sup>* , −|*zi*\*<sup>j</sup>* |**u** T *i*\*j* ) T . Denote *<sup>g</sup><sup>k</sup>* <sup>=</sup> *<sup>I</sup>*{*zi*\*<sup>j</sup>* <sup>&</sup>gt; <sup>0</sup>}, then one can show that the summand in (7), log <sup>n</sup> <sup>1</sup> <sup>+</sup> exp *skθ* T **t***<sup>k</sup>* + log *W*b *k* o, equals,

$$-\left[\mathcal{g}\_k \left(\boldsymbol{\theta}^\top \mathbf{t}\_k + \mathbf{s}\_k \log \widehat{\boldsymbol{W}}\_k\right) - \log \left\{ 1 + \exp \left(\boldsymbol{\theta}^\top \mathbf{t}\_k + \mathbf{s}\_k \log \widehat{\boldsymbol{W}}\_k\right) \right\} \right] \boldsymbol{\wedge}$$

which is the contribution of the *k*-th subject to the negative log-likelihood of a logistic regression with *g<sup>k</sup>* as the response, *θ* as the coefficient, **t***<sup>k</sup>* as the covariate, and *s<sup>k</sup>* log *W*b *<sup>k</sup>* as the offset term, but without an intercept. Therefore, *θ*b can be obtained by fitting the aforementioned logistic regression model. Algorithm 1 describes the steps for data manipulation and model fitting to estimate *θ* under this traditional setting.

**Algorithm 1** Minimization of (6) without penalization

1: **Inputs:** {*y<sup>i</sup>* , **u***<sup>i</sup>* , *zi*}, {*y<sup>j</sup>* , **u***<sup>j</sup>* , *zj*}, *W*b *ij*, for *i* = 1, ..., *n* and *j* = 1, ..., *n* 2: **Initialize:** *k* ← 0 3: **for** *j* ∈ {2 : *n*} **do** 4: **for** *i* ∈ {1 : (*j* − 1)} **do** 5: *k* ← *k* + 1 6: *<sup>y</sup>i*\*<sup>j</sup>* ← *<sup>y</sup><sup>i</sup>* − *<sup>y</sup><sup>j</sup>* , **<sup>u</sup>***i*\*<sup>j</sup>* ← **<sup>u</sup>***<sup>i</sup>* − **<sup>u</sup>***<sup>j</sup>* , *<sup>z</sup>i*\*<sup>j</sup>* ← *<sup>z</sup><sup>i</sup>* − *<sup>z</sup><sup>j</sup>* , *W*b *<sup>k</sup>* ← *W*b *ij* 7: *<sup>g</sup><sup>k</sup>* <sup>←</sup> *<sup>I</sup>*{*zi*\*<sup>j</sup>* <sup>&</sup>gt; <sup>0</sup>} 8: *<sup>s</sup><sup>k</sup>* ← −sign(*zi*\*<sup>j</sup>* ) 9: **<sup>t</sup>***<sup>k</sup>* ← (|*zi*\*<sup>j</sup>* |*yi*\*<sup>j</sup>* , −|*zi*\*<sup>j</sup>* |**u** T *i*\*j* ) T 10: Fit logistic regression with response **g**, covariate **t**, offset **s** T log **W**b , and no intercept.

11: **Outputs:** *θ*b

**Asymptotic Theory**. The asymptotic theory of *θ*b involves a model of *p*(*z* | **u**), which does not contain any missing values, and therefore any statistical model, either parametric, or semiparametric, or nonparametric, can be used. For simplicity, we only discuss the parametric case here, and any further elaborations will be rendered into Section 7. For a parametric model *p*(*z* | **u**; *η*), one can apply the standard maximum likelihood estimate *<sup>η</sup>*b. Here, we simply assume

$$\sqrt{N}\left(\hat{\eta} - \eta\_0\right) = -\mathbf{G}^{-1}\sqrt{N}\frac{1}{N}\sum\_{i=1}^{N}\frac{\partial}{\partial\eta}\log\left\{p(z\_i \mid \mathbf{u}\_i; \eta\_0)\right\} + o\_p(1),\tag{8}$$

where **G** = E *∂* 2 *∂η∂η* T log {*p*(*z* | **u**; *η*0)} , Ek *∂* 2 *∂η∂η* T log {*p*(*z* | **u**; *η*0)} k <sup>2</sup> < ∞, *η*<sup>0</sup> is the true value of *η*, and k**M**k = p trace(**MM**T) for a matrix **<sup>M</sup>**. With this prerequisite, we have the following result for *<sup>θ</sup>*b, and its proof is provided in Appendix A.

**Theorem 1.** *Assume (8) as well as E ∂* <sup>2</sup>*φij*(*θ*0,*η*0) *∂θ∂θ* T 2 < ∞*. Denote θ*<sup>0</sup> *the true value of θ. Then* √ *N θ*b− *θ*<sup>0</sup> *d* −→ *N* **0**, **A** <sup>−</sup>1**ΣA** −1 ,

$$\begin{split} \text{where } \mathbf{A} &= E\left\{ \frac{\partial^2 \phi\_{\overline{\eta}}(\mathbf{\theta}\_0 \eta\_0)}{\partial \theta \partial \overline{\theta}^T} \right\}, \Sigma = 4E\left\{ \lambda\_{12}(\mathbf{\theta}\_0, \eta\_0) \lambda\_{13}(\mathbf{\theta}\_0, \eta\_0)^T \right\}, \lambda\_{\overline{\eta}}(\mathbf{\theta}\_0, \eta\_0) = \mathbf{B} \mathbf{G}^{-1} \mathbf{M}\_{\overline{\eta}}(\eta\_0) - \mathbf{N}\_{\overline{\eta}}(\mathbf{\theta}\_0, \eta\_0), \\ \mathbf{B} &= E\left\{ \frac{\partial^2 \phi\_{\overline{\eta}}(\mathbf{\theta}\_0, \eta\_0)}{\partial \theta \partial \overline{\eta}^T} \right\}, \mathbf{M}\_{\overline{\eta}}(\eta\_0) = \frac{1}{2} \left\{ \frac{\partial}{\partial \eta} \log p(z\_i \mid \mathbf{u}\_i; \eta\_0) + \frac{\partial}{\partial \eta} \log p(z\_j \mid \mathbf{u}\_j; \eta\_0) \right\}, \text{ and } \mathbf{N}\_{\overline{\eta}}(\mathbf{\theta}\_0, \eta\_0) = \mathbf{B} \mathbf{G}^{-1} \mathbf{M}\_{\overline{\eta}}(\mathbf{\theta}\_0, \eta\_0). \end{split}$$

If one prefers the asymptotic result of *β*b, we have

**Corollary 1.** *Let* **C** *be a p* × (*p* + 1) *matrix such that* **C***θ* = *β, i.e.,*

$$\mathbf{C} = \begin{pmatrix} 0 & 1/\widetilde{\gamma}\_0 & 0 & \cdots & 0 \\ 0 & 0 & 1/\widetilde{\gamma}\_0 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1/\widetilde{\gamma}\_0 \end{pmatrix}$$

*Denote <sup>β</sup>*<sup>0</sup> *the true value of <sup>β</sup>. Then, following Theorem 1, we have* <sup>√</sup> *N β*b− *β*<sup>0</sup> *d* −→ *N* **0**, **CA**−1**ΣA**−1**C** T *.*

**Variance Estimation**. With Theorem 1 and Corollary 1, the variance estimation is straightforward using the plugging in strategy. Note that var(*θ*b) = <sup>1</sup> *<sup>N</sup>* **<sup>A</sup>**−1**ΣA**−<sup>1</sup> , then one would have the estimate var <sup>c</sup> (*θ*b) = <sup>1</sup> *<sup>N</sup>* **<sup>A</sup>**b−1**Σ**b**A**b−<sup>1</sup> where **<sup>A</sup>**<sup>b</sup> <sup>=</sup> ( *N* 2 ) −1 <sup>∑</sup>1≤*i*<*j*≤*<sup>N</sup> ∂* <sup>2</sup>*φij*(*θ*b,*η*b) *∂θ∂θ* T , **<sup>Σ</sup>**<sup>b</sup> <sup>=</sup> <sup>4</sup> *<sup>N</sup>*−<sup>1</sup> <sup>∑</sup> *N i*=1 h 1 *<sup>N</sup>*−<sup>1</sup> <sup>∑</sup> *N j*=1,*j*6=*i* n **<sup>B</sup>**b**G**<sup>b</sup> <sup>−</sup>1**M***ij*(*η*b) <sup>−</sup> **<sup>N</sup>***ij*(*θ*b, *<sup>η</sup>*b) oi⊗<sup>2</sup> , **B**b = ( *N* 2 ) −1 <sup>∑</sup>1≤*i*<*j*≤*<sup>N</sup> ∂* <sup>2</sup>*φij*(*θ*b,*η*b) *∂θ∂η* T , and **G**b = 1 *<sup>N</sup>* ∑ *N i*=1 *∂* 2 *∂η∂η* T log {*p*(*z<sup>i</sup>* | **u***<sup>i</sup>* ; *<sup>η</sup>*b)}.

#### **4. Modern Setting with Regularization**

In the past few decades, it has become a standard practice to consider the high-dimensional regression model, where one assumes the parameter *β* is sparse and often uses the regularization technique to recover the sparsity. While it is a prominent problem to analyze this type of model when the data are prone to missing values, the literature is quite scarce primarily because it is cumbersome to rigorously address the missingness under high dimensionality. Therefore, it is valuable to extend the nuisance-free likelihood procedure proposed in Section 3 to the setting with regularization.

**Computation**. Regularization is a powerful technique to identify the zero elements of a sparse parameter in a regression model. Various penalty functions have been extensively studied, such as LASSO [20], SCAD [21], and MCP [22]. In particular, we study the adaptive LASSO penalty [23] with the objective of minimizing the following function

$$L\_{\lambda}(\boldsymbol{\theta}) = L(\boldsymbol{\theta}) + \sum\_{j=1}^{p} \lambda \left| \widehat{\widetilde{\boldsymbol{\beta}}}\_{j} \right|^{-1} \left| \widetilde{\boldsymbol{\beta}}\_{j} \right| \,\tag{9}$$

.

where *λ* > 0 is the tuning parameter. Following [23], b *β*e *j* is a root-*N*-consistent estimator of *β*e *j* ; for example, one can use the estimator via minimizing the unregularized objective Function (6). Obviously, the penalty term in (9) does not alter the numerical characteristic of *L*(*θ*) that we presented in Section 3. The *Lλ*(*θ*) is essentially the regularized log-likelihood of a logistic regression model with the similar format as discussed in (7).

To choose the tuning parameter *λ*, one can follow either the cross-validation method or various information-based criteria. Fortunately, all of these approaches have been extensively studied in the literature. In this paper, we follow the Bayesian information criterion (BIC) to determine *λ*. Specifically, we choose *λ* to be the minimizer of the following BIC function

$$\text{BIC}(\lambda) = 2L(\theta) + p\_{\lambda} \frac{\log(n)}{n},$$

where *p<sup>λ</sup>* is the number of nonzero elements in *β* be *λ* and the minimizer of (9) is encoded as *θ*b *<sup>λ</sup>* = (*γ*be*<sup>λ</sup>* , *β* be T *λ* ) T . We summarize the whole computation pipeline as Algorithm 2 below.

#### **Algorithm 2** Minimization of (9) with the ALASSO penalty

1: **Inputs:** {*y<sup>i</sup>* , **u***<sup>i</sup>* , *zi*}, {*y<sup>j</sup>* , **u***<sup>j</sup>* , *zj*}, *W*b *ij*, for *i* = 1, ..., *n* and *j* = 1, ..., *n* 2: **Initialize:** *k* ← 0 3: **for** *j* ∈ {2 : *n*} **do** 4: **for** *i* ∈ {1 : (*j* − 1)} **do** 5: *k* ← *k* + 1 6: *<sup>y</sup>i*\*<sup>j</sup>* ← *<sup>y</sup><sup>i</sup>* − *<sup>y</sup><sup>j</sup>* , **<sup>u</sup>***i*\*<sup>j</sup>* ← **<sup>u</sup>***<sup>i</sup>* − **<sup>u</sup>***<sup>j</sup>* , *<sup>z</sup>i*\*<sup>j</sup>* ← *<sup>z</sup><sup>i</sup>* − *<sup>z</sup><sup>j</sup>* , *W*b *<sup>k</sup>* ← *W*b *ij* 7: *<sup>g</sup><sup>k</sup>* <sup>←</sup> *<sup>I</sup>*{*zi*\*<sup>j</sup>* <sup>&</sup>gt; <sup>0</sup>} 8: *<sup>s</sup><sup>k</sup>* ← −sign(*zi*\*<sup>j</sup>* ) 9: **<sup>t</sup>***<sup>k</sup>* ← (|*zi*\*<sup>j</sup>* |*yi*\*<sup>j</sup>* , |*zi*\*<sup>j</sup>* |**u** T *i*\*j* ) T T

10: Fit logistic regression with response **g**, covariates **t**, offset **s** log**W**, and no intercept.


**Asymptotic Theory**. Recall that *<sup>θ</sup>* = (*γ*e, *<sup>β</sup>*e<sup>T</sup> ) T . Without loss of generality, we assume the first *p*<sup>0</sup> parameters in *<sup>β</sup>*<sup>e</sup> are nonzero, where 1 <sup>≤</sup> *<sup>p</sup>*<sup>0</sup> <sup>&</sup>lt; *<sup>p</sup>*. For simplicity, we denote *<sup>θ</sup><sup>T</sup>* = (*γ*e, *<sup>β</sup>*<sup>e</sup> <sup>1</sup>, ..., *β*e*p*<sup>0</sup> ) <sup>T</sup> as the vector of nonzero components and *θT<sup>c</sup>* = (*β*e *<sup>p</sup>*0+1, ..., *β*e*p*) <sup>T</sup> as the vector of zeros.

In Theorem 1, we defined **A** = E *∂* <sup>2</sup>*φij*(*θ*0,*η*0) *∂θ∂θ* T , a (*p* + 1) × (*p* + 1) matrix. Now we assume it can be partitioned as **A** = **A**<sup>1</sup> **A**<sup>2</sup> **A** T <sup>2</sup> **A**<sup>3</sup> ! , where **A**<sup>1</sup> is a (*p*<sup>0</sup> + 1) × (*p*<sup>0</sup> + 1) submatrix corresponding to *θT*. Similarly, we defined **Σ** = 4E n *λ*12(*θ*0, *η*0)*λ*13(*θ*0, *η*0) T o , and we also assume it can be partitioned as **Σ** = **Σ**<sup>1</sup> **Σ**<sup>2</sup> **Σ** T <sup>2</sup> **<sup>Σ</sup>**<sup>3</sup> ! , where **<sup>Σ</sup>**<sup>1</sup> is a (*p*<sup>0</sup> <sup>+</sup> <sup>1</sup>) <sup>×</sup> (*p*<sup>0</sup> <sup>+</sup> <sup>1</sup>) submatrix corresponding to *<sup>θ</sup><sup>T</sup>* as well. We denote

the minimizer of (9), *θ*b *<sup>λ</sup>*, as *θ*b *<sup>λ</sup>* = (*θ*b<sup>T</sup> *λ*,*T* , *θ*bT *<sup>λ</sup>*,*T<sup>c</sup>* ) T , and its true value *θ*<sup>0</sup> = (*θ* T 0,*T* , *θ* T 0,*T<sup>c</sup>* ) T .

Now, we present the oracle property pertaining to *θ*b *<sup>λ</sup>*, which includes the asymptotic normality for the nonzero components and the variable selection consistency. The proof is provided in Appendix B.

**Theorem 2.** *Assume (8),* **A**<sup>1</sup> *is positive definite and E*k *∂φij*(*θ*0,*η*0) *∂θ* k <sup>2</sup> < ∞ *for each θ in a neighborhood of θ*0*. We also assume* <sup>√</sup> *<sup>N</sup><sup>λ</sup>* → <sup>0</sup> *and N<sup>λ</sup>* → <sup>∞</sup>*. Then,*

$$\sqrt{N}\left(\widehat{\boldsymbol{\theta}}\_{\lambda,T} - \boldsymbol{\theta}\_{0,T}\right) \xrightarrow{d} N\left(\mathbf{0}, \mathbf{A}\_1^{-1} \boldsymbol{\Sigma}\_1 \mathbf{A}\_1^{-1}\right) \dots$$

*In addition, let T<sup>N</sup>* = {*j* ∈ {1, ..., *p*} : b *β*e *j*,*λ* 6= 0} *and T* = {*j* ∈ {1, ..., *p*} : *β*e *<sup>j</sup>*,0 6= 0}*, then* lim *<sup>N</sup>*→<sup>∞</sup> *pr*(*T<sup>N</sup>* = *T*) = 1.

**Variance Estimation**. Although the above theory provides a rigorous justification for the asymptotic property of *θ*b *<sup>λ</sup>*, in practice, however, it does not guide the standard error estimation. Here, we propose a data perturbation approach for the variance estimation. Specifically, following [24], we generate a set of independent and identically distributed positive random variables **<sup>Ξ</sup>** <sup>=</sup> {*ξ<sup>i</sup>* , *i* = 1, ..., *N*} with E(*ξi*) = 1 and var(*ξi*) = 1, e.g., the standard exponential distribution. Since it is based on a U-statistic structure, we perturb our objective function by adding *κij* = *ξiξ<sup>j</sup>* to each of its pairwise terms. We first obtain the estimator *<sup>θ</sup>*b<sup>⋆</sup> by minimizing the perturbed version of (6):

$$L^\star(\boldsymbol{\theta}) = \binom{N}{2}^{-1} \sum\_{1 \le i < j \le N} \kappa\_{ij} \phi\_{ij}(\boldsymbol{\theta}).$$

Then, we obtain the estimator *<sup>θ</sup>*b<sup>⋆</sup> *λ* by minimizing the perturbed version of (9):

$$L\_{\lambda}^{\star}(\boldsymbol{\theta}) = \binom{N}{2}^{-1} \sum\_{1 \le i < j \le N} \kappa\_{ij} \phi\_{ij}(\boldsymbol{\theta}) + \sum\_{j=1}^{p} \frac{\lambda}{\left| \widehat{\widetilde{\boldsymbol{\beta}}}\_{j}^{\star} \right|} \left| \widetilde{\boldsymbol{\beta}}\_{j} \right| \boldsymbol{\epsilon}$$

where the optimal *λ* is also computed by the BIC. We repeat this data perturbation scheme a large number of times, say, *M*.

Following the theory in [25,26], under some regularity conditions, one can first show that √ *N θ*b⋆ *<sup>λ</sup>*,*<sup>T</sup>* − *θ*0,*<sup>T</sup>* converges in distribution to *N*(**0**, **A** −1 1 **Σ**1**A** −1 1 ), the same limiting distribution of √ *N θ*b *<sup>λ</sup>* − *θ*<sup>0</sup> . Furthermore, one can also show pr∗ *θ*b⋆ *<sup>λ</sup>*,*T<sup>c</sup>* = 0 → 1, where pr<sup>∗</sup> is the probability measure generated by the original data <sup>X</sup> and the perturbation data **<sup>Ξ</sup>**. In addition, one can show that the distribution of <sup>√</sup> *N θ*b⋆ *<sup>λ</sup>*,*<sup>T</sup>* − *θ*b *λ*,*T* conditional on the data can be used to approximate the unconditional distribution of <sup>√</sup> *N θ*b *<sup>λ</sup>*,*<sup>T</sup>* − *θ*0,*<sup>T</sup>* and that pr∗ *θ*b⋆ *<sup>λ</sup>*,*T<sup>c</sup>* <sup>=</sup> <sup>0</sup> | X → 1.

To achieve a confidence interval for *θ<sup>j</sup>* , the *j*-th coordinate in *θ*, the lower and upper bounds can be formed by *<sup>θ</sup>*b<sup>⋆</sup> *<sup>λ</sup>*,*j*,*α*/2 and *<sup>θ</sup>*b<sup>⋆</sup> *<sup>λ</sup>*,*j*,1−*α*/2, respectively, where *<sup>θ</sup>*b<sup>⋆</sup> *λ*,*j*,*q* represents the *q*-th quantile of n *θ*b⋆ *λ*,*j*,*m* , *m* = 1, ..., *M* o .

#### **5. Simulation Studies**

We conduct comprehensive simulation studies to evaluate the finite sample performance of our proposed estimators and also compare with some currently existing methods. We first present the results under the model without regularization, then with regularization.

#### *5.1. Scenarios without Regularization*

For the proposed estimator studied in Section 3, we generate {*R<sup>i</sup>* ,*Y<sup>i</sup>* , **U**<sup>T</sup> *i* , *Zi*}, *i* = 1, . . . , *N*, independent and identically distributed copies of (*R*,*Y*, **U**<sup>T</sup> , *Z*), as follows. We first generate the random vector **U** = (*U*1, . . . , *Up*) <sup>T</sup> with *<sup>U</sup><sup>i</sup>* <sup>∼</sup> *<sup>N</sup>*(0.5, 1) and *<sup>p</sup>* <sup>=</sup> 4, and then generate *<sup>Z</sup>* <sup>=</sup> *<sup>α</sup><sup>z</sup>* <sup>+</sup> *<sup>η</sup>* <sup>T</sup>**U** + *ǫ<sup>z</sup>* with *α<sup>z</sup>* = 0.5, *η* = (−0.5, 1, −1, 1.5) T , *ǫ<sup>z</sup>* ∼ *N*(0, 1). Afterwards, the outcome variable *Y* is generated following the model (1) with *α* = −1, *β* = (−0.5, 1, −1, 1.5) T , *γ* = 0.5, and *ǫ* ∼ *N*(0, 1), and the missingness indicator *<sup>R</sup>* is generated following pr(*<sup>R</sup>* <sup>=</sup> <sup>1</sup> <sup>|</sup> *<sup>Y</sup>*, **<sup>U</sup>**) = *<sup>I</sup>*(*<sup>Y</sup>* <sup>&</sup>lt; 2.5, *<sup>U</sup>*<sup>1</sup> <sup>&</sup>lt; 2, *<sup>U</sup>*<sup>2</sup> <sup>&</sup>lt; 2, *<sup>U</sup>*<sup>3</sup> <sup>&</sup>lt; 2, *U*<sup>4</sup> < 2) which results in around 40% missing values. We examine two situations with sample size *N* = 500 and *N* = 1000 respectively. Besides the estimator studied in Section 3 (Proposed), we also implement the estimator using all simulated data (FullData) and the estimator using completely observed subjects only (CC). Based on 1000 simulation replicates, for each of the three estimators, we summarize the sample bias, sample standard deviation, estimated standard error, and coverage probability of 95% confidence intervals in Table 1.


**Table 1.** In Section 5.1, sample bias (Bias), sample standard deviation (SD), estimated standard error (SE), and coverage probability (CP) of 95% confidence interval of the estimator of FullData (using all simulated data), CC (using only completely observed subjects), and of the proposed estimator studied in Section 3.

Furthermore, we consider a similar simulation setting where the generation is the same as above except for a logistic missingness mechanism model with logit{pr(*R* = 1 | *Y*, **U**)} = 3 − 2*Y* + 0.5*U*<sup>1</sup> − *U*<sup>2</sup> + *U*<sup>3</sup> − 1.5*U*4, which also results in around 40% missing values. We replicate the results, shown in Table 2.

We can reach the following conclusions from Tables 1 and 2. For the estimator Proposed, although its bias is slightly larger than the benchmark FullData, it is still very close to zero. The sample standard deviation and the estimated standard error are rather close to each other. The sample coverage probability of the estimated 95% confidence interval is also very close to the nominal level. This observation well matches our theoretical justification in Theorem 1. On the contrary, the estimator CC is clearly biased, resulting in empirical coverage far from the nominal level, and therefore is not recommended to use in practice. It is also clear that, compared to the benchmark FullData, the estimator Proposed has estimation efficiency loss to some extent. This is because the proposed method uses the conditional likelihood approach and it completely eliminates the effect of the nuisance.


**Table 2.** In Section 5.1, sample bias (Bias), sample standard deviation (SD), estimated standard error (SE), and coverage probability (CP) of 95% confidence interval of the estimator of FullData (using all simulated data), CC (using only completely observed subjects), and of the proposed estimator studied in Section 3, with a logistic missingness mechanism model.

#### *5.2. Scenarios with Regularization*

For the estimator studied in Section 4, the independent and identically distributed samples are generated as follows. The variable **U** = (*U*1, . . . , *Up*) T is generated from MVN(**0**, **Σ***u*) with **Σ***<sup>u</sup>* = (0.5|*i*−*j*<sup>|</sup> )1≤*i*,*j*≤*<sup>p</sup>* and *p* = 8. Then, the shadow variable *Z* is generated following *Z* = *α<sup>z</sup>* + *η* <sup>T</sup>**<sup>U</sup>** <sup>+</sup> *<sup>ǫ</sup><sup>z</sup>* with *<sup>α</sup><sup>z</sup>* <sup>=</sup> 0, *<sup>η</sup>* = (−0.5, 0.5, <sup>−</sup>1, 1, <sup>−</sup>0.5, 0.5, <sup>−</sup>1, 1) <sup>T</sup> and *<sup>ǫ</sup><sup>z</sup>* <sup>∼</sup> *<sup>N</sup>*(0, 1). The outcome variable *Y* is generated from model (1) with *α* = 0, *β* = (3, 1.5, 0, 0, 2, 0, 0, 0) T , *γ* = 3, *ǫ* ∼ *N*(0, *σ* 2 ) and *σ* = 3. The distribution of the missingness indicator follows from logit{pr(*R* = 1 | *Y*, **U**)} = 5 + 5*Y* + 0.2*U*<sup>1</sup> + 0.2*U*7, which results in about 45% missing values. Similar to Section 5.1, we also examine two situations with sample size *N* = 500 and *N* = 1000 respectively, and we implement three estimators FullData, CC, and Proposed. When the estimator Proposed is implemented, we perform *M* = 500 perturbations in order to obtain the confidence interval for the unknown parameter. The results summarized below are based on 1000 simulation replicates.

Figure 1 shows the *L*1, *L*2, and *L*<sup>∞</sup> norms of the bias for the three different estimators. As sample size increases, there is no doubt that the estimation bias is getting smaller for any method. It is also clear that the bias of the Proposed estimator is larger than the benchmark FullData, but much smaller than the method CC.

**Figure 1.** In Section 5.2, *L*<sup>1</sup> (1st column), *L*<sup>2</sup> (2nd column), and *L*<sup>∞</sup> (3rd column) norms of the estimation bias of the estimator of FullData (using all simulated data), CC (using only completely observed subjects), and of the proposed estimator studied in Section 4.

(**b**) *N* = 1000

We present the statistical inference results in Table 3 for *N* = 500 and Table 4 for *N* = 1000, respectively, including sample bias, sample standard deviation, estimated standard error, coverage probability, and length of 95% confidence interval for the three different methods. For the nonzero *<sup>β</sup>*'s as well as *<sup>γ</sup>*e, similar to Section 5.1, the method CC clearly prompts coverage probability far from the nominal level hence is not reliable. For the method Proposed, its estimation bias is quite close to zero, and its sample standard deviation and estimated standard error are quite close to each other. The coverage probability of the confidence interval converges to the nominal level 95% as the sample size gets larger. For the noisy zero *β*'s, the coverage probabilities in the three methods are all close to 1, reflecting the variable selection consistency in the oracle property, even for the CC method. Furthermore, a very nice finite sample property of our proposed estimator is that it produces the confidence interval with the shortest length, which can be clearly seen from both Tables 3 and 4.


**Table 3.** In Section 5.2, with sample size *N* = 500, sample bias (Bias), sample standard deviation (SD), estimated standard error (SE), coverage probability (CP), and length (Length) of 95% confidence interval of the estimator of FullData (using all simulated data), CC (using only completely observed subjects) and of the proposed estimator studied in Section 4.




**Table 4.** *Cont.*

#### **6. Real Data Application**

The Medical Information Mart for Intensive Care III (MIMIC-III) is an openly available electronic health records (EHR) database, developed by the MIT Lab for Computational Physiology [13], comprising de-identified health-related data associated with intensive care unit patients with rich information including demographics, vital signs, laboratory test, medications, and more.

Our initial motivation for this data analysis is to understand the missingness mechanism for some laboratory test biomarkers in this EHR system. As for the EHR database, since the data are collected in a non-prescheduled fashion, i.e., only available when the patient seeks care or the physician orders care, the visiting process could be potentially informative about the patients' risk categories. Therefore, it is very plausible that the data are missing not at random, or a mix of missing not at random and missing at random [27,28]. When we first conducted the data cleaning process briefly, an interesting phenomenon we observe is that, compared to most biomarkers which usually have <3% missing values, the albumin level in the blood sample, a very indicative biomarker associated with different types of diseases [29], has around 30% missingness.

To further understand this phenomenon, we concentrate on a subset of the data with sample size *N* = 1359 in which 421 samples have missing values in the albumin level but all other variables are complete. We aim to apply the proposed method to the study of the albumin level (*Y*). The calcium level in the blood sample, free of missing data, has been shown in the biomedical literature that it has high correlation with the albumin level [30–32]; therefore, we adopt the calcium level as the shadow variable *Z*. Seventeen other variables comprise the vector **U**, which are either demographics (age and gender), chart events (respiratory rate, glucose, heart rate, systolic blood pressure, diastolic blood pressure, and temperature), other laboratory tests (urea nitrogen, platelets, magnesium, hematocrit, red blood cell, white blood cell, and peripheral capillary oxygen saturation (SpO2)), or aggregated metrics (simplified acute physiology score (SAPS-II) and sequential organ failure assessment score (SOFA)).

We implement the proposed estimator studied in Section 4 to achieve both variable selection and post-selection inference. We also compare it with the CC method which naively fits the regularized linear regression with the ALASSO penalty. For each of the methods, we apply the data perturbation scheme presented in Section 4 with *M* = 500 for standard error estimation. The results are summarized in Table 5. The solution path of the Proposed method, as the tuning parameter *λ* varies, is also provided in Figure 2.


**Table 5.** In Section 6, the parameter estimate (Estimate), standard error (SE), and confidence interval (CI) of the estimator of CC (using only completely observed subjects) and of the proposed estimator studied in Section 4 in the MIMIC−III study.

**Figure 2.** In Section 6, as tuning parameter *λ* varies, the solution path of the proposed estimator in the MIMIC-III study. The optimal *λ*, *λ* ∗ , equals 1.0030 and log *λ* ∗ = 0.0030.

In general, both methods achieve the goal of variable selection and post-selection inference by leveraging the regularization technique coupled with the data perturbation strategy, and identify many variables as noise with zero coefficients. In particular, the Proposed method provides larger effects for the calcium level (the shadow variable) and the red blood cell count, whereas a smaller effect for the aggregated SOFA score. The Proposed method simplifies the body temperature and the white blood cell count as nonsignificant variables, which are identified as nonzero but with a very small effect using the CC method. It is also worthwhile to mention that the Proposed method signifies

the magnesium level with a quite significant coefficient, which was extensively investigated in the scientific literature [33–35].

#### **7. Discussion**

In this paper, we provide a systematic approach for parameter estimation and statistical inference in both traditional linear model where the regularization is not needed and the modern regularized regression setting, when the outcome variable is prone to missing values and the missingness mechanism can be arbitrarily flexible. A pivotal condition rooted in our procedure is the shadow variable *Z*, which overcomes the model identifiability problem and enables the nuisance-free conditional likelihood process.

Certainly any method would have its own limitations and could be potentially improved. One needs a model *p*(*z* | **u**) to implement the proposed estimator in Sections 3 and 4. As its modeling does not involve any missing data, we simply use the parametric maximum likelihood estimation in our algorithm as well as in the theoretical justification. Indeed, any statistical or machine learning method can be used for modeling *p*(*z* | **u**). For instance, if one would like to consider a semiparametric model [36], e.g.,

$$p(z \mid \mathbf{u}; \boldsymbol{\eta}, \boldsymbol{F}) = \frac{\exp(\boldsymbol{\eta}^{\mathrm{T}} \mathbf{u} z) f(z)}{\int \exp(\boldsymbol{\eta}^{\mathrm{T}} \mathbf{u} z) dF(z)},$$

where *η* = (*η*1, ..., *ηp*) T is a vector of unknown parameters and *f*(*z*) is the density of an unknown baseline distribution function *F* with respect to some dominating measure *ν*. With this model fitted, *<sup>W</sup>ij* can be simplified to *<sup>W</sup>ij* = exp(−*zi*\*j<sup>η</sup>* T **u***i*\*<sup>j</sup>* ). Therefore, a similar conditional likelihood approach can be used to estimate *η* without estimating the nonparametric component *f*(*z*).

**Author Contributions:** Conceptualization, J.Z.; Experiment, J.Z. and C.C.; Writing, J.Z. and C.C.; Supervision, J.Z. All authors have read and agreed to the published version of the manuscript

**Funding:** Jiwei Zhao is supported by the National Science Foundation award 1953526.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Proof of Theorem 1**

**Proof.** Note that *<sup>θ</sup>*<sup>b</sup> is obtained by setting estimating equation *<sup>∂</sup>L*(*θ*b,*η*b) *<sup>∂</sup><sup>θ</sup>* = 0, which is equivalent to

$$\left\{\frac{\partial L(\hat{\theta},\hat{\eta})}{\partial\theta} - \frac{\partial L(\theta\_0,\hat{\eta})}{\partial\theta}\right\} + \left\{\frac{\partial L(\theta\_0,\hat{\eta})}{\partial\theta} - \frac{\partial L(\theta\_0,\eta\_0)}{\partial\theta}\right\} + \frac{\partial L(\theta\_0,\eta\_0)}{\partial\theta} = 0. \tag{A1}$$

Specifically,

$$\frac{\partial L(\hat{\theta}, \hat{\eta})}{\partial \theta} - \frac{\partial L(\theta\_0, \hat{\eta})}{\partial \theta} = \frac{\partial^2 L(\theta\_0, \hat{\eta})}{\partial \theta \partial \theta^T} \left(\hat{\theta} - \theta\_0\right) + o\_p\left(N^{-\frac{1}{2}}\right),\tag{A2}$$

by Taylor expansion. Similarly,

$$\frac{\partial L(\boldsymbol{\theta}\_{0},\widehat{\boldsymbol{\eta}})}{\partial \boldsymbol{\theta}} - \frac{\partial L(\boldsymbol{\theta}\_{0},\boldsymbol{\eta}\_{0})}{\partial \boldsymbol{\theta}} = \frac{\partial^{2} L(\boldsymbol{\theta}\_{0},\boldsymbol{\eta}\_{0})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\eta}^{\mathrm{T}}} \left(\widehat{\boldsymbol{\eta}} - \boldsymbol{\eta}\_{0}\right) + o\_{p}\left(\boldsymbol{N}^{-\frac{1}{2}}\right). \tag{A3}$$

With (A2) and (A3) plugging into (A1), we can obtain the following equation,

$$
\sqrt{N}\frac{\partial^2 L(\boldsymbol{\theta}\_0, \boldsymbol{\hat{\eta}})}{\partial\boldsymbol{\theta}\partial\boldsymbol{\theta}^\top} \left(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}\_0\right) + \sqrt{N}\frac{\partial^2 L(\boldsymbol{\theta}\_0, \boldsymbol{\eta}\_0)}{\partial\boldsymbol{\theta}\partial\boldsymbol{\eta}^\top} \left(\hat{\boldsymbol{\eta}} - \boldsymbol{\eta}\_0\right) + \sqrt{N}\frac{\partial L(\boldsymbol{\theta}\_0, \boldsymbol{\eta}\_0)}{\partial\boldsymbol{\theta}} + o\_p(1) = 0. \tag{A4}
$$

As <sup>√</sup> *<sup>N</sup>* (*η*b<sup>−</sup> *<sup>η</sup>*0) <sup>=</sup> <sup>−</sup>**G**−<sup>1</sup> √ *N* 1 *<sup>N</sup>* ∑ *N i*=1 *∂ ∂η* log {*p*(*z<sup>i</sup>* | **u***<sup>i</sup>* ; *<sup>η</sup>*0)} <sup>+</sup> *<sup>o</sup>p*(1) from the asymptotic property of *<sup>η</sup>*b, (A4) is equivalent to

$$\begin{split} &\sqrt{N}\frac{\partial^2 L(\boldsymbol{\theta}\_0, \boldsymbol{\hat{\eta}})}{\partial\boldsymbol{\theta}\partial\boldsymbol{\theta}^T} \left(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}\_0\right) + \frac{\partial^2 L(\boldsymbol{\theta}\_0, \boldsymbol{\eta}\_0)}{\partial\boldsymbol{\theta}\partial\boldsymbol{\eta}^T} \left[ -\mathbf{G}^{-1}\sqrt{N}\frac{1}{N} \sum\_{i=1}^N \frac{\partial}{\partial\boldsymbol{\eta}} \log \left\{ p(\boldsymbol{z}\_i \mid \mathbf{u}\_i; \boldsymbol{\eta}\_0) \right\} \right] \\ &+ \quad \sqrt{N}\frac{\partial L(\boldsymbol{\theta}\_0, \boldsymbol{\eta}\_0)}{\partial\boldsymbol{\theta}} + o\_p(1) = 0. \end{split}$$

Thus,

$$\begin{split} &\quad\sqrt{N}\left(\hat{\boldsymbol{\theta}}-\boldsymbol{\theta}\_{0}\right) \\ &=&-\left\{\frac{\partial^{2}L(\boldsymbol{\theta}\_{0},\boldsymbol{\hat{\eta}})}{\partial\boldsymbol{\theta}\partial\boldsymbol{\theta}^{\top}}\right\}^{-1}\times\left\{\frac{\partial^{2}L(\boldsymbol{\theta}\_{0},\boldsymbol{\eta}\_{0})}{\partial\boldsymbol{\theta}\partial\boldsymbol{\eta}^{\top}}\left[-\mathbf{G}^{-1}\sqrt{N}\frac{1}{N}\sum\_{i=1}^{N}\frac{\partial}{\partial\boldsymbol{\eta}}\log\left\{p(\boldsymbol{z}\_{i}\mid\mathbf{u}\_{i};\boldsymbol{\eta}\_{0})\right\}\right]+\sqrt{N}\frac{\partial L(\boldsymbol{\theta}\_{0},\boldsymbol{\eta}\_{0})}{\partial\boldsymbol{\theta}}\right\} \\ &\quad+o\_{p}(1) \\ &=&-\mathbf{A}^{-1}\left\{\mathbf{B}\left[-\mathbf{G}^{-1}\sqrt{N}\frac{1}{N}\sum\_{i=1}^{N}\frac{\partial}{\partial\boldsymbol{\eta}}\log\left\{p(\boldsymbol{z}\_{i}\mid\mathbf{u}\_{i};\boldsymbol{\eta}\_{0})\right\}\right]+\sqrt{N}\frac{\partial L(\boldsymbol{\theta}\_{0},\boldsymbol{\eta}\_{0})}{\partial\boldsymbol{\theta}}\right\}+o\_{p}(1),\end{split}\tag{A5}$$

where *<sup>∂</sup>* <sup>2</sup>*L*(*θ*0,*η*0) *∂θ∂θ* T *p* −→ **A** = E *∂* <sup>2</sup>*φij*(*θ*0,*η*0) *∂θ∂θ* T , and *<sup>∂</sup>* <sup>2</sup>*L*(*θ*0,*η*0) *∂θ∂η* T *p* −→ **B** = E *∂* <sup>2</sup>*φij*(*θ*0,*η*0) *∂θ∂η* T . In addition, we need to form a projection of <sup>1</sup> *<sup>N</sup>* ∑ *N i*=1 *∂ ∂η* log {*p*(*z<sup>i</sup>* | **u***<sup>i</sup>* ; *η*0)} in (A5) through

$$\frac{1}{N} \sum\_{i=1}^{N} \frac{\partial}{\partial \eta} \log \left\{ p(z\_i \mid \mathbf{u}\_i; \eta\_0) \right\} = \binom{N}{2}^{-1} \sum\_{1 \le i < j \le N} \frac{1}{2} \left[ \frac{\partial}{\partial \eta} \log \left\{ p(z\_i \mid \mathbf{u}\_i; \eta\_0) \right\} + \frac{\partial}{\partial \eta} \log \left\{ p(z\_j \mid \mathbf{u}\_j; \eta\_0) \right\} \right],$$

and

$$\frac{\partial L(\boldsymbol{\theta}\_{0\prime}\boldsymbol{\eta}\_{0})}{\partial \boldsymbol{\theta}} = \binom{N}{2}^{-1} \sum\_{1 \le i < j \le N} \frac{\partial \phi\_{ij}(\boldsymbol{\theta}\_{0\prime}\boldsymbol{\eta}\_{0})}{\partial \boldsymbol{\theta}}.$$

To sum up, (A5) can be formed as

$$\sqrt{N}\left(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}\_{0}\right) = \mathbf{A}^{-1}\sqrt{N}\binom{N}{2}^{-1} \sum\_{1 \le i < j \le N} \left\{ \mathbf{BG}^{-1}\mathbf{M}\_{ij}(\boldsymbol{\eta}\_{0}) - \mathbf{N}\_{ij}(\boldsymbol{\theta}\_{0}, \boldsymbol{\eta}\_{0}) \right\} + o\_p(1),$$

$$\text{where } \mathbf{M}\_{ij}(\boldsymbol{\eta}\_{0}) = \frac{1}{2} \left[ \frac{\partial}{\partial \boldsymbol{\eta}} \log\left\{ p(\boldsymbol{z}\_{i} \mid \mathbf{u}\_{i}; \boldsymbol{\eta}\_{0}) \right\} + \frac{\partial}{\partial \boldsymbol{\eta}} \log\left\{ p(\boldsymbol{z}\_{j} \mid \mathbf{u}\_{j}; \boldsymbol{\eta}\_{0}) \right\} \right] \text{ and } \mathbf{N}\_{ij}(\boldsymbol{\theta}\_{0}, \boldsymbol{\eta}\_{0}) = \frac{\partial \phi\_{i}(\boldsymbol{\theta}\_{0}, \boldsymbol{\eta}\_{0})}{\partial \boldsymbol{\theta}}.$$

#### **Appendix B. Proof of Theorem 2**

**Proof.** Define function

$$q\_{\vec{ij}}(\boldsymbol{\theta}) = \phi\_{\vec{ij}}\left(\boldsymbol{\theta}\_{0} + \frac{\boldsymbol{\theta}}{\sqrt{N}}, \widehat{\boldsymbol{\eta}}\right) - \phi\_{\vec{ij}}(\boldsymbol{\theta}\_{0}, \widehat{\boldsymbol{\eta}}) - \left(\frac{\boldsymbol{\theta}}{\sqrt{N}}\right)^{\mathsf{T}} \frac{\partial \phi\_{\vec{ij}}(\boldsymbol{\theta}\_{0}, \widehat{\boldsymbol{\eta}})}{\partial \boldsymbol{\theta}} = O\_{p}\left(\frac{1}{N}\right),\tag{A6}$$

and we can form a U-statistic based on *qij*(*θ*) as

$$\begin{split} Q\_N(\boldsymbol{\theta}) &= \frac{2}{N(N-1)} \sum\_{1 \le i < j \le N} q\_{ij}(\boldsymbol{\theta}) \\ &= L\left(\boldsymbol{\theta}\_0 + \frac{\boldsymbol{\theta}}{\sqrt{N}}\right) - L(\boldsymbol{\theta}\_0) - \frac{1}{\sqrt{N}} \cdot \frac{2}{N(N-1)} \boldsymbol{\theta}^\top \sum\_{1 \le i < j \le N} \frac{\partial \phi\_{ij}(\boldsymbol{\theta}\_0, \hat{\boldsymbol{\eta}})}{\partial \boldsymbol{\theta}}. \end{split}$$

The variance of *<sup>Q</sup>N*(*θ*) is bounded by var {*QN*(*θ*)} <sup>≤</sup> <sup>2</sup> *N* var *qij*(*θ*) , from Corollary 3.2 of [37]. Meanwhile, <sup>2</sup> *N* var *qij*(*θ*) = <sup>2</sup> *N* h E *qij*(*θ*) 2 − E *qij*(*θ*) 2 i <sup>≤</sup> <sup>2</sup> *N* E *qij*(*θ*) 2 , as E *qij*(*θ*) <sup>2</sup> <sup>≥</sup> 0. As *<sup>φ</sup>ij*(*θ*, *<sup>η</sup>*b) is convex, that is, differentiable at *<sup>θ</sup>*0, we can conclude

$$
\phi\_{\hat{l}\hat{l}}\left(\theta\_0 + \frac{\theta}{\sqrt{N}}, \hat{\eta}\right) - \phi\_{\hat{l}\hat{l}}\left(\theta\_0, \hat{\eta}\right) \ge \left(\frac{\theta}{\sqrt{N}}\right)^{\top} \frac{\partial \phi\_{\hat{l}\hat{l}}(\theta\_0, \hat{\eta})}{\partial \theta},\tag{A7}
$$

from which we can obtain *qij*(*θ*) ≥ 0. Similarly,

$$
\phi\_{ij}\left(\theta\_0 + \frac{\theta}{\sqrt{N}}, \hat{\eta}\right) - \phi\_{lj}\left(\theta\_0, \hat{\eta}\right) \le \left(\frac{\theta}{\sqrt{N}}\right)^{\mathbb{T}} \frac{\partial \phi\_{lj}\left(\theta\_0 + \frac{\theta}{\sqrt{N}}, \hat{\eta}\right)}{\partial \theta}.\tag{A8}
$$

From (A6)–(A8), we can conclude

$$0 \le q\_{ij}(\boldsymbol{\theta}) \le \left(\frac{\boldsymbol{\theta}}{\sqrt{N}}\right)^{\mathsf{T}} \left\{ \frac{\left\|\boldsymbol{\phi}\right\|\_{\boldsymbol{\hat{\eta}}} \left(\boldsymbol{\theta}\_{0} + \frac{\boldsymbol{\theta}}{\sqrt{N}}, \boldsymbol{\hat{\eta}}\right)}{\left\|\boldsymbol{\theta}\right\|} - \frac{\left\|\boldsymbol{\phi}\_{ij}\left(\boldsymbol{\theta}\_{0}, \boldsymbol{\hat{\eta}}\right)\right\|}{\left\|\boldsymbol{\theta}\right\|} \right\}.$$

Therefore, we can bound

$$\frac{2}{N} \mathbb{E}\left\{q\_{\hat{\boldsymbol{\eta}}}(\boldsymbol{\theta})^{2}\right\} \leq \frac{2}{N} \left(\frac{1}{\sqrt{N}}\right)^{2} \mathbb{E}\left[\boldsymbol{\theta}^{T}\left\{\frac{\partial}{\partial\boldsymbol{\theta}}\phi\_{\hat{\boldsymbol{\eta}}}\left(\boldsymbol{\theta}\_{0} + \frac{\boldsymbol{\theta}}{\sqrt{N}},\hat{\boldsymbol{\eta}}\right) - \frac{\partial\phi\_{\hat{\boldsymbol{\eta}}}(\boldsymbol{\theta}\_{0},\hat{\boldsymbol{\eta}})}{\partial\boldsymbol{\theta}}\right\}\right]^{2}.$$

$$\lim\_{\boldsymbol{\tau} \to \boldsymbol{\theta}^{T}} \mathbb{E}\left\{\boldsymbol{\theta}\_{\hat{\boldsymbol{\eta}}\_{\text{min}}}\left(\boldsymbol{\theta}\_{0} + \frac{\boldsymbol{\theta}}{\sqrt{N}},\hat{\boldsymbol{\eta}}\right) - \frac{\partial\phi\_{\hat{\boldsymbol{\eta}}}(\boldsymbol{\theta}\_{0},\hat{\boldsymbol{\eta}})}{\partial\boldsymbol{\theta}}\right\}\quad\stackrel{p}{\boldsymbol{\theta}} \geq 0 \text{ as } N \to \infty. \text{ Thus, } \lim\_{\boldsymbol{\theta} \to \boldsymbol{\theta}^{T}} \{\boldsymbol{M} \odot \boldsymbol{\eta}(\boldsymbol{\theta})\}\_{\mathbb{R}^{N} \times \mathbb{R}} = 0 \text{ as } N \to \infty.$$

The term *θ ∂ ∂θ <sup>φ</sup>ij θ*<sup>0</sup> + <sup>√</sup> *θ N* , *η*b − *∂φij*(*θ*0,*η*b) *∂θ* −→ 0 as *<sup>N</sup>* → <sup>∞</sup>. Thus, var {*<sup>N</sup>* · *<sup>Q</sup>N*(*θ*)} −→ 0 and consequently *p*

$$N \cdot Q\_N(\theta) - N \cdot \mathbb{E}\{Q\_N(\theta)\} \xrightarrow{p} 0.\tag{A9}$$

Meanwhile, E {*QN*(*θ*)} = E n *<sup>φ</sup>ij θ*<sup>0</sup> + <sup>√</sup> *θ N* , *η*b o <sup>−</sup> <sup>E</sup> *<sup>φ</sup>ij*(*θ*0, *<sup>η</sup>*b) . Eventually from (A9) we have

$$\begin{split} N\left\{ L\left(\boldsymbol{\theta}\_{0} + \frac{\boldsymbol{\theta}}{\sqrt{N}}\right) - L\left(\boldsymbol{\theta}\_{0}\right) \right\} - \boldsymbol{\theta}^{\top}\sqrt{N} \frac{2}{N(N-1)} \sum\_{1 \le i < j \le N} \frac{\partial \phi\_{i}(\boldsymbol{\theta}\_{0}, \hat{\boldsymbol{\eta}})}{\partial \boldsymbol{\theta}} \\ - N\left[ \mathbb{E}\left\{ \phi\_{i\hat{\boldsymbol{\eta}}}\left(\boldsymbol{\theta}\_{0} + \frac{\boldsymbol{\theta}}{\sqrt{N}}, \hat{\boldsymbol{\eta}}\right) \right\} - \mathbb{E}\left\{ \phi\_{i\hat{\boldsymbol{\eta}}}(\boldsymbol{\theta}\_{0}, \hat{\boldsymbol{\eta}}) \right\} \right] \xrightarrow{p} \text{0.} \tag{A10} \end{split} \tag{A10}$$

The third term on the left side of (A10) has convergence properties

$$\begin{split} &N\left[\mathbb{E}\left\{\boldsymbol{\upphi}\_{\hat{\boldsymbol{\eta}}\boldsymbol{\upphi}}\left(\boldsymbol{\theta}\_{0}+\frac{\boldsymbol{\theta}}{\sqrt{N}},\hat{\boldsymbol{\eta}}\right)\right\}-\mathbb{E}\left\{\boldsymbol{\upphi}\_{\hat{\boldsymbol{\eta}}\boldsymbol{\upphi}}(\boldsymbol{\theta}\_{0},\hat{\boldsymbol{\eta}})\right\}\right] \\ &=&N\left[\mathbb{E}\left\{\boldsymbol{\upphi}\_{\hat{\boldsymbol{\eta}}\boldsymbol{\upphi}}(\boldsymbol{\theta}\_{0},\hat{\boldsymbol{\eta}})+\left(\frac{\boldsymbol{\uptheta}}{\sqrt{N}}\right)^{\top}\frac{\boldsymbol{\upphi}\_{\hat{\boldsymbol{\eta}}\boldsymbol{\upphi}}(\boldsymbol{\theta}\_{0},\hat{\boldsymbol{\eta}})}{\boldsymbol{\uptheta}\boldsymbol{\uptheta}}+\frac{1}{2}\left(\frac{\boldsymbol{\uptheta}}{\sqrt{N}}\right)^{\top}\frac{\hat{\boldsymbol{\upphi}}^{2}\boldsymbol{\upphi}\_{\hat{\boldsymbol{\eta}}\boldsymbol{\upphi}}(\boldsymbol{\theta}\_{0},\hat{\boldsymbol{\eta}})}{\boldsymbol{\uptheta}\boldsymbol{\uptheta}\boldsymbol{\uptheta}^{T}}\frac{\boldsymbol{\uptheta}}{\sqrt{N}}+o\_{p}\left(\frac{1}{N}\right)\right\}\right], \\ &\quad \quad -\mathbb{E}\left\{\boldsymbol{\upphi}\_{\hat{\boldsymbol{\eta}}\boldsymbol{\upphi}}(\boldsymbol{\theta}\_{0},\hat{\boldsymbol{\upeta}})\right\}\Bigg]\_{\boldsymbol{\upbeta}} \\ &\stackrel{p}{\to}&\frac{1}{2}\boldsymbol{\uptheta}^{\top}\mathbf{A}\boldsymbol{\uptheta}. \end{split}$$

By CLT for U-statistics,

$$\sqrt{N} \left[ \frac{2}{N(N-1)} \sum\_{1 \le i < j \le N} \frac{\partial \phi\_{ij}(\boldsymbol{\theta}\_0, \hat{\boldsymbol{\eta}})}{\partial \boldsymbol{\theta}} \right] \stackrel{d}{\to} N(\boldsymbol{0}, \boldsymbol{\Sigma}).$$

Using Slutsky's theorem, we can simplify (A10) as

$$N\left\{L\left(\boldsymbol{\theta}\_{0}+\frac{\boldsymbol{\theta}}{\sqrt{N}}\right)-L(\boldsymbol{\theta}\_{0})\right\}\stackrel{d}{\to}\frac{1}{2}\boldsymbol{\theta}^{\top}\mathbf{A}\boldsymbol{\theta}+\boldsymbol{\theta}^{\top}\mathbf{W}\_{\boldsymbol{\theta}}$$

where **<sup>W</sup>** <sup>∼</sup> *<sup>N</sup>*(0, **<sup>Σ</sup>**). Based on convexity [38], for every compact set **<sup>K</sup>** <sup>⊂</sup> <sup>R</sup>*p*+<sup>1</sup> , we have

$$\left[N\left\{L\left(\boldsymbol{\theta}\_{0}+\frac{\boldsymbol{\theta}}{\sqrt{N}},\widehat{\boldsymbol{\eta}}\right)-L\left(\boldsymbol{\theta}\_{0},\widehat{\boldsymbol{\eta}}\right)\right\}:\boldsymbol{\theta}\in\mathbf{K}\right]\stackrel{d}{\to}\left\{\frac{1}{2}\boldsymbol{\theta}^{\top}\mathbf{A}\boldsymbol{\theta}+\boldsymbol{\theta}^{\top}\mathbf{W}:\boldsymbol{\theta}\in\mathbf{K}\right\}.\tag{A11}$$

Now we develop large sample properties on the penalty term in objective function with adaptive LASSO penalty. We modify the penalty term as

$$N\sum\_{j=1}^{p} \frac{\lambda}{\left|\widehat{\widetilde{\beta}}\_{j}\right|} \left|\widetilde{\beta}\_{j,0} + \frac{\widetilde{\beta}\_{j}}{\sqrt{N}}\right| - N\sum\_{j=1}^{p} \frac{\lambda}{\left|\widehat{\widetilde{\beta}}\_{j}\right|} \left|\widetilde{\beta}\_{j,0}\right| \dots$$

From Theorem 1, we have already obtained <sup>√</sup> *N* b *β*e *<sup>j</sup>* − *β*e *j*,0 = *<sup>O</sup>p*(1). Meanwhile, *<sup>N</sup><sup>λ</sup>* → <sup>∞</sup> and √ *Nλ* → 0. If *β*e *<sup>j</sup>*,0 <sup>6</sup><sup>=</sup> 0, then <sup>√</sup> *Nλ*/ b *β*e *j p* −→ 0 and √ *Nβ*e *<sup>j</sup>*,0 + *β*e *j*  − √ *Nβ*e *j*,0 → sign(*β*e *<sup>j</sup>*,0)*β*e *j* . Eventually

$$N\frac{\lambda}{\left|\widetilde{\beta}\_{j}\right|}\left(\left|\widetilde{\beta}\_{j,0} + \frac{\widetilde{\beta}\_{j}}{\sqrt{N}}\right| - \left|\widetilde{\beta}\_{j,0}\right|\right) = \sqrt{N}\frac{\lambda}{\left|\widetilde{\beta}\_{j}\right|}\left(\left|\sqrt{N}\widetilde{\beta}\_{j,0} + \widetilde{\beta}\_{j}\right| - \left|\sqrt{N}\widetilde{\beta}\_{j,0}\right|\right) \xrightarrow{p} 0.1$$

If *β*e *<sup>j</sup>*,0 <sup>=</sup> 0, then <sup>√</sup> *Nλ*/ b *β*e *j* <sup>=</sup> *<sup>N</sup>λ*/ √ *N* b *β*e *j p* −→ <sup>∞</sup>, consequently

$$N\frac{\lambda}{\left|\widehat{\beta}\_{j}\right|}\left(\left|\widetilde{\beta}\_{j,0} + \frac{\widetilde{\beta}\_{j}}{\sqrt{N}}\right| - \left|\widetilde{\beta}\_{j,0}\right|\right) = \sqrt{N}\frac{\lambda}{\left|\widehat{\widetilde{\beta}}\_{j}\right|}\left|\widetilde{\beta}\_{j}\right| \stackrel{p}{\to} \begin{cases} 0, & \text{if } \widetilde{\beta}\_{j} = 0, \\ \infty, & \text{if } \widetilde{\beta}\_{j} \neq 0. \end{cases} \right.$$

Therefore, we can summarize

$$N\sum\_{j=1}^{p} \frac{\lambda}{\left|\widetilde{\widetilde{\beta}}\_{j}\right|} \left( \left|\widetilde{\beta}\_{j,0} + \frac{\widetilde{\beta}\_{j}}{\sqrt{N}}\right| - \left|\widetilde{\beta}\_{j,0}\right| \right) \stackrel{p}{\rightarrow} \begin{cases} 0, & \text{if } \widetilde{\beta} = (\widetilde{\beta}\_{1}, ..., \widetilde{\beta}\_{p\_{0}}, 0, ..., 0), \\ \infty, & \text{otherwise}. \end{cases} \right)$$

We have infinity in the limit function, so we cannot use standard argumentation relating to uniform convergence in probability on compacts [39]. However, we can apply slightly more complicated epi-convergence. Thus, based on the works in [23,40,41], we have

$$N\left\{L\left(\theta\_0 + \frac{\mathfrak{G}}{\sqrt{N}}\right) - L\left(\theta\_0\right)\right\} + N\sum\_{j=1}^p \frac{\lambda}{\left|\widetilde{\beta}\_j\right|} \left(\left|\widetilde{\beta}\_{j,0} + \frac{\widetilde{\beta}\_j}{\sqrt{N}}\right| - \left|\widetilde{\beta}\_{j,0}\right|\right) \xrightarrow{\varepsilon \to 0} V(\mathfrak{G}),\tag{A12}$$

and

$$V(\boldsymbol{\theta}) = \begin{cases} \frac{1}{2} \boldsymbol{\theta}\_T^\mathrm{T} \mathbf{A}\_1 \boldsymbol{\theta}\_T + \boldsymbol{\theta}\_T^\mathrm{T} \mathbf{W}\_{T'} & \text{if } \boldsymbol{\theta} = (\widetilde{\gamma}\_\prime \widetilde{\beta}\_{1\prime}, \dots, \widetilde{\beta}\_{p\_0}, 0, \dots, 0), \\\infty, & \text{otherwise.} \end{cases}$$

and **<sup>W</sup>***<sup>T</sup>* <sup>∼</sup> *<sup>N</sup>*(**0**, **<sup>Σ</sup>**1). Specifically, the left side of (A12) is minimized if *<sup>θ</sup>* <sup>=</sup> √ *N θ*b *<sup>λ</sup>* − *θ*<sup>0</sup> and *<sup>V</sup>*(*θ*) has a unique minimizer −(**A** −1 <sup>1</sup> **W***T*) T , **0** T T by setting *<sup>∂</sup>V*(*θ*) *<sup>∂</sup><sup>θ</sup>* = 0. Therefore, convergence of minimizers [40] can be concluded from (A12):

$$
\sqrt{N}\left(\hat{\boldsymbol{\theta}}\_{\lambda,T} - \boldsymbol{\theta}\_{0,T}\right) \stackrel{d}{\to} -\mathbf{A}\_1^{-1}\mathbf{W}\_T \text{ and } \sqrt{N}\left(\hat{\boldsymbol{\theta}}\_{\lambda,T^c} - \boldsymbol{\theta}\_{0,T^c}\right) \stackrel{d}{\to} \mathbf{0}.\tag{A13}
$$

For *j* ∈ *T*,

$$\text{pr}\left(j \notin T\_N\right) = \text{pr}\left(\widehat{\hat{\beta}}\_{j,\lambda} = 0\right) \to 0.$$

Thus, pr(*T* ⊂ *TN*) → 1. In addition, *θ*b *<sup>λ</sup>* minimizes the convex objective function *Lλ*(*θ*) so that **0** ∈ *∂Lλ*(*θ*b *<sup>λ</sup>*). As *Lλ*(*θ*) might be nondifferentiable and gradient of *Lλ*(*θ*) does not exist for some *θ*, we use *∂Lλ*(*θ*) to represent an arbitrary selection of the subgradient of *Lλ*(*θ*). By taking the subgradient of the objective function with adaptive LASSO penalty, we can obtain

$$
\partial L\_{\lambda}(\widehat{\boldsymbol{\theta}}\_{\lambda}) = \partial L(\widehat{\boldsymbol{\theta}}\_{\lambda}) + \mathfrak{d}\left(\sum\_{j=1}^{p} \frac{\lambda}{\left|\widehat{\widetilde{\boldsymbol{\beta}}\_{j}}\right|} \left|\widehat{\widetilde{\boldsymbol{\beta}}}\_{j,\lambda}\right|\right).
$$

For *j* ∈/ *T*, pr(*j* ∈ *TN*) can be upper bounded by

$$\operatorname{pr}\left(\partial\_{\hat{f}}L(\widehat{\theta}\_{\lambda}) + \frac{\lambda}{\left|\widehat{\widehat{\beta}}\_{\hat{f}}\right|}\operatorname{sign}\left(\widehat{\widehat{\beta}}\_{\hat{f},\lambda}\right) = 0\right) \le \operatorname{pr}\left(\sqrt{N}\left|\partial\_{\hat{f}}L(\widehat{\theta}\_{\lambda})\right| = \sqrt{N}\frac{\lambda}{\left|\widehat{\widehat{\beta}}\_{\hat{f}}\right|}\right),\tag{A14}$$

where *∂<sup>j</sup>* is the *<sup>j</sup>*-th coordinate of subgradient and <sup>√</sup> *Nλ*/ b *β*e *j p* −→ <sup>∞</sup> as *<sup>j</sup>* ∈/ *<sup>T</sup>*.

We can expand the subgradient <sup>√</sup> *N∂L*(*θ*b *<sup>λ</sup>*) as

$$\sqrt{N}\partial L(\widehat{\boldsymbol{\theta}}\_{\lambda}) = \sqrt{N}\left\{\partial L(\widehat{\boldsymbol{\theta}}\_{\lambda}) - \partial L(\boldsymbol{\theta}\_{0}) - \mathbf{A}\left(\widehat{\boldsymbol{\theta}}\_{\lambda} - \boldsymbol{\theta}\_{0}\right)\right\} + \sqrt{N}\partial L(\boldsymbol{\theta}\_{0}) + \sqrt{N}\mathbf{A}\left(\widehat{\boldsymbol{\theta}}\_{\lambda} - \boldsymbol{\theta}\_{0}\right), \tag{A15}$$

where <sup>√</sup> *<sup>N</sup>∂L*(*θ*0) is bounded in probability, <sup>√</sup> *N***A** *θ*b *<sup>λ</sup>* − *θ*<sup>0</sup> *D* −→ <sup>√</sup> *N***W** which is bounded in probability as well. By Theorem 1 of the work in [42],

$$\sup\_{\left|\widehat{\theta}\_{\lambda}-\theta\_{0}\right|\leq M/\sqrt{N}} \left| \partial L(\widehat{\theta}\_{\lambda}) - \partial L(\theta\_{0}) - \mathbf{A} \left( \widehat{\theta}\_{\lambda} - \theta\_{0} \right) \right| = o\_{p} \left( \frac{1}{\sqrt{N}} \right).$$

Therefore, <sup>√</sup> *N* n *∂L*(*θ*b *<sup>λ</sup>*) − *∂L*(*θ*0) − **A** *θ*b *<sup>λ</sup>* − *θ*<sup>0</sup> o *p* −→ 0. Finally, <sup>√</sup> *N ∂jL*(*θ*b *λ*) is bounded and the right side of (A14) converges to 0, which proves pr(*j* ∈ *TN*) → 0 for *j* ∈/ *T*.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

#### *Article*
