**Nonparametric Statistical Inference with an Emphasis on Information-Theoretic Methods**

Editor

**Jan Mielniczuk**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editor* Jan Mielniczuk Institute of Computer Science, Polish Academy of Sciences Faculty of Mathematics and Information Science, Warsaw University of Technology Poland

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Entropy* (ISSN 1099-4300) (available at: https://www.mdpi.com/journal/entropy/special issues/Non stat Inf).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-4297-3 (Hbk) ISBN 978-3-0365-4298-0 (PDF)**

Cover image courtesy of Monika Sliwowska ´

© 2022 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


## **About the Editor**

#### **Jan Mielniczuk**

Jan Mielniczuk is a full professor at the Institute of Computer Science, Polish Academy of Science; and a professor at the Faculty of Mathematics and Information Sciences of Warsaw University of Technology. His main research contributions concern computational statistics and data mining, particularly time-series modelling and prediction, inference for high-dimensional and misspecified data, model selection, computer-intensive methods, asymptotic analysis and quantification of dependence. He is the author and coauthor of two books and over eighty articles.

## *Editorial* **Nonparametric Statistical Inference with an Emphasis on Information-Theoretic Methods**

**Jan Mielniczuk 1,2**


The presented volume addresses some vital problems in contemporary statistical reasoning. One of them is high dimensionality of the studied phenomenon and its consequences for formal statistical inference. A huge number of studies have been devoted to proposing new solutions and/or to modifying existing ones in order to account for the specificity of high-dimensional data. However, frequently, these methods work well for precisely defined parametric models and fail when misspecification occurs. Thus, there is a growing need to develop non-parametric and robust procedures accounting for this problem and to study existing methods when misspecification is suspected. This has been discussed in several papers in this volume under various scenarios. Furthermore, information theoretic methods due to their generality are of special interest in this context, e.g., when variable selection is envisaged. Frequently, the approach to account for high-dimensionality is based on the penalization of classic statistical procedures, and this line of reasoning is discussed here. Moreover, in a multivariate scenario, there is a need to define and study analogues of statistical measures designed for the univariate or bivariate case, and this approach is represented by the study on tail dependence indices. The important area of statistical research is devoted to time series analysis, especially in multivariate cases and in non-standard observability scenarios; two papers in the volume address this issue. Furthermore, information theoretic tools used to shed a new light on the generalization risk in learnability theory are covered here.

In [1], the general class of non-stationary multivariate processes is considered based on *p*-dimensional Bernoulli shifts, which, in particular, encompass multivariate linear processes with time-varying coefficients. A locally stationary model is proposed, under which its covariance matrix Σ(*t*) is piecewise Lipschitz continuous except at a certain number of breaks (change points). The problem of the non-parametric estimation of change points is addressed as well as that of graph support recovery, specifically the estimation of the set {(*j*, *<sup>k</sup>*) : <sup>|</sup>Σ(*t*)−1(*j*, *<sup>k</sup>*)<sup>|</sup> <sup>&</sup>gt; *<sup>u</sup>*} for a given threshold *<sup>u</sup>* and precision matrix <sup>Σ</sup>(*t*)−1. It is shown that in both problems, one can obtain theoretical guarantees of the accuracy of estimation procedures using the proposed kernel smoothed constrained -<sup>1</sup> minimization approach.

In [2], the problem of support recovery is considered for a semiparametric binary model in which the posterior probability of the response is given by *q*(*βTx*), where *q* is an unknown response function. The problem is dealt with by applying the penalized empirical risk minimization approach for a convex loss *φ*. This has nice information theoretic connotations when *φ* is a logistic loss, as, in this case, we aim at estimating the averaged Kullback–Leibler projection of *q*(*βTx*) on the family of logistic models. For a highdimensional setting and random subgaussian regressors, the conditions are studied, under which the minimizer of penalized empirical risk *β*ˆ converges to vector *β*<sup>∗</sup> corresponding to the Kullback–Leibler projection. This is used to establish selection consistency of the

**Citation:** Mielniczuk, J. Nonparametric Statistical Inference with an Emphasis on Information-Theoretic Methods. *Entropy* **2022**, *24*, 553. https:// doi.org/10.3390/e24040553

Received: 28 March 2022 Accepted: 12 April 2022 Published: 15 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Generalized Information Criterion GIC based on *β*ˆ for Lipschitz and convex *φ* under Linear Regressions Conditions. The resulting Screeing and Selection (SS) procedure is studied in numerical experiments.

Ref. [3] addresses one of the main issues of the learnability theory, namely the properties of generalization risk for the given learning algorithm L. I. Alabdulmohsin introduces a new concept of the uniform generalization of L with a rate *ε* that stipulates that the generalization risk is less than *ε* for any bounded loss function *l*(·, ·) such that *l*(·, *h*) depends on the underlying sample only through the hypothesis *h* chosen by L. The information-theoretic characterization of this property is given in terms of variational information *J*(*z*ˆ, *h*) between a single observation *z*ˆ and chosen hypothesis *h* (Theorem 2). In Theorem 4, the probabilistic inequality for deviation of empirical risk from the true risk is given in terms of *J*(*z*ˆ, *h*). Moreover, the concept of the learning capacity of L, analogous to the concept of Shannon channel capacity, is introduced and studied.

Ref. [4], similarly to [2], deals with the classification problem of a binary variable under misspecification. It focuses on establishing a general upper bound of excess risk, i.e., the difference between the risk of the linear classifier *β*ˆ*Tx*, obtained as a minimizer of the penalized empirical risk pertaining to convex function *φ*, and the Bayes risk in such a case (Theorem 1). The crucial part of the bound is the probability that <sup>|</sup>*β*<sup>ˆ</sup> <sup>−</sup> *<sup>β</sup>*∗|<sup>1</sup> exceeds a certain threshold, where *β*∗ is the minimizer of the theoretical risk pertaining to *φ*. Interestingly, the authors are able to bound this probability, provided the predictors are multivariate subgaussian, for non-Lipschitz quadratic risk *<sup>φ</sup>*(*t*)=(<sup>1</sup> <sup>−</sup> *<sup>t</sup>*)2, which is rarely studied in the classification context. The second part of the paper deals with consistency of the thresholded Lasso selector under the Linear Regression Conditions mentioned above and again for quadratic loss. The result complements the results on selection consistency studied in [2].

The paper [5] is an insightful study of introduced tail dependence indices in the multivariate case from a novel perspective, which sheds a new light on their similarities and differences. Namely, a set of five natural properties are introduced, which should be satisfied by such indices, and existing proposals (Frahm's extremal dependence, Li's tail dependence and Schmid's and Schmidt's tail dependence measures) are investigated in this context. Further properties of these indices are studied such as their behavior with increasing dimensions of the vector. The delicate problem of estimating the tail indices is addressed, and the consistency of the introduced estimators is studied. Their performance is illustrated using the EURO STOXX 50 index.

Ref. [6] considers non-parametric variable selection based on information-theoretic criteria. In such an approach, the maximization of conditional mutual information *CMI* = *I*(*X*,*Y*|*XS*) is often considered in greedy selection, where *Y* is the response, *XS* is a vector of already chosen predictors, and *X* is a candidate for a possible augmentation of *XS*. Frequently, conditional mutual information is replaced by the approximations resulting from Möbius expansion or some modifications of these approximations. In the paper, two criteria obtained in such a way, namely Conditional Infomax Feature Extraction (CIFE) and Joint Mutual Information (JMI), are analyzed, together with CMI, in a certain dependence model called the Generative Tree Model. It is shown that the two considered criteria may lead to a different order of chosen variables than the order induced by CMI, and CIFE may disregard a significant part of active variables. The analysis is based on formulae for the entropy of the multivariate Gaussian mixture and its mutual information with mixing variables derived in the paper, which are interesting in their own right.

In [7], the authors consider a semiparametric stationary time series model of the form *Zt* = *x<sup>T</sup> <sup>t</sup> β* + *f*(*st*) + *εt*, where *xt* is a vector of random explanatory variables, *st* is a temporal covariate, and *ε<sup>t</sup>* is an autoregressive process. Moreover, *Zt* is subject to random censoring from the right, and *f* is a linear combination of B-spline basis functions of order *q* with a corresponding vector of coefficients *α*. The penalized adaptive spline approach is developed in the paper to tackle the data irregularity and is then applied to an unbiased synthetic transformation of *Zt*. The bias and covariance structure of the obtained estimators of *α* and *β* are derived, and their consistency is studied.

Ref. [8] addresses practically important and intensively researched problem of accounting for outliers in the estimation process when fitting the multiple linear regression model. The approach is based on the L2E parametric method proposed by the first author, which consists of finding the minimizer of the estimated Integrated Squared Error (ISE) in a parametric family of densities { *f*(*x*|*θ*)}. The proposed extension introduces an additional parameter *w*, which loosely corresponds to the mixture proportion of the main (outlier-free) component of the density, and the minimization is now performed in family {*w f*(*θ*|*x*)} with respect to both *θ* and *x*. The authors then convincingly show by analyzing several examples that the proposed method yields a much more adequate fit of residuals than the least squares, and additional insight into data interpretation is sometimes possible.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Estimation of Dynamic Networks for High-Dimensional Nonstationary Time Series**

#### **Mengyu Xu 1, Xiaohui Chen <sup>2</sup> and Wei Biao Wu 3,\***


Received: 14 November 2019; Accepted: 26 December 2019; Published: 31 December 2019

**Abstract:** This paper is concerned with the estimation of time-varying networks for high-dimensional nonstationary time series. Two types of dynamic behaviors are considered: structural breaks (i.e., abrupt change points) and smooth changes. To simultaneously handle these two types of time-varying features, a two-step approach is proposed: multiple change point locations are first identified on the basis of comparing the difference between the localized averages on sample covariance matrices, and then graph supports are recovered on the basis of a kernelized time-varying constrained *L*1-minimization for inverse matrix estimation (CLIME) estimator on each segment. We derive the rates of convergence for estimating the change points and precision matrices under mild moment and dependence conditions. In particular, we show that this two-step approach is consistent in estimating the change points and the piecewise smooth precision matrix function, under a certain high-dimensional scaling limit. The method is applied to the analysis of network structure of the S&P 500 index between 2003 and 2008.

**Keywords:** high-dimensional time series; nonstationarity; network estimation; change points; kernel estimation

#### **1. Introduction**

Networks are useful tools to visualize the relational information among a large number of variables. An undirected graphical model belongs to a rich class of statistical network models that encodes conditional independence [1]. Canonically, Gaussian graphical models (or their normalized version partial correlations [2]) can be represented by the inverse covariance matrix (i.e., the precision matrix), where a zero entry is associated with a missing edge between two vertices in the graph. Specifically, two vertices are not connected if and only if they are conditionally independent, given the value of all other variables.

On one hand, there is a large volume of literature on estimating the (static) precision matrix for graphical models in the high-dimensional setting, where the sample size and the dimension are both large [3–16]. Most of the earlier work along this line assumes that the underlying network is time-invariant. This assumption is quite restrictive in practice and hardly plausible for many real-world applications, such as gene regulatory networks, social networks, and stocking market, where the underlying data generating mechanisms are often dynamic. On the other hand, dynamic random networks have been extensively studied from the perspective of large random graphs, such as community detection and edge probability estimation for dynamic stochastic block models (DSBMs) [17–30]. Such approaches do not model the sampling distributions of the error (or noise), since the "true" networks are connected with random edges sampled from certain probability models, such as the Erd˝os–Rényi graphs [31] and random geometric graphs [32].

In this paper, we view the (time-varying) networks of interests as non-random graphs. We adopt the graph signal processing approach for denoising the nonstationary time series and target on estimating the *true unknown* underlying graphs. Despite the recent attempts towards more flexible time-varying models [33–40], there are still a number of major limitations in the current high-dimensional literature. First, theoretical analysis was derived under the fundamental assumption that the observations are either temporally *independent*, or the temporal dependence has very specific forms, such as Gaussian processes or (linear) vector autoregression (VAR) [14,33,34,37,41–43]. Such dynamic structures are unduly demanding in view that many time series encountered in real applications have very complex nonlinear spatial-temporal dependency [44,45]. Second, most existing work assumes the data have time-varying distributions with sufficiently light tails, such as Gaussian graphical models and Ising models [33,34,36,41,42]. Third, in change point estimation problems for high-dimensional time series, piecewise constancy is widely used [41,42,46,47], which can be fragile in practice. For instance, financial data often appears to have time-dependent cross-volatility with structural breaks [48]. For resting-state fMRI signals, correlation analysis reveals both slowly varying and abruptly changing characteristics corresponding to modularities in brain functional networks [49,50].

Advances in analyzing high-dimensional (stationary) time series have been made recently to address the aforementioned nonlinear spatial-temporal dependency issue [14,37,43,51–57]. In [53,56,57], the authors considered the theoretical properties of regularized estimation of covariance and precision matrices, based on various dependence measures of high-dimensional time series. Reference [38] considered the non-paranormal graphs that evolve with a random variable. Reference [37] discussed the joint estimation of Gaussian graphical models based on a stationary VAR(1) model with special coefficient matrices, which may also depend on certain covariates. The authors applied a constrained *L*1-minimization for inverse matrix estimation (CLIME) estimator with a kernel estimator of covariance matrix and developed consistency in the graph recovery at a given time point. Reference [14] studied the recovery of the Granger causality across time and nodes assuming a stationary Gaussian VAR model with unknown order.

In this paper, we focus on the recovery of time-varying undirected graphs on the basis of the regularized estimation of the precision matrices for a general class of nonstationary time series. We simultaneously model two types of dynamics: abrupt changes with an unknown number of change points and the smooth evolution between the change points. In particular, we study a class of high-dimensional *piecewise locally stationary processes* in a general nonlinear temporal dependency framework, where the observations are allowed to have a finite polynomial moment.

More specifically, there are two main goals of this paper: first, to estimate the change point locations, as well as the number of change points, and second, to estimate the smooth precision matrix functions between the change points. Accordingly, our proposed method contains two steps. In the first step, the maximum norm of the local difference matrix is computed at each time point and the jumps in the covariance matrices are detected at the location where the maximum norms are above a certain threshold. In the second step, the precision matrices before and after the jump are estimated by a regularized kernel smoothing estimator. These two steps are recursively performed until a stopping criterion is met. Moreover, a boundary correction procedure based on data reflection is considered to reduce the bias near the change point.

We provide an asymptotic theory to justify the proposed method in high dimensions: point-wise and uniform rates of convergence are derived for the change point estimation and graph recovery under mild and interpretable conditions. The convergence rates are determined via subtle interplay among the sample size, dimensionality, temporal dependence, moment condition, and the choice of bandwidth in the kernel estimator. Our results are significantly more involved than problems for sub-Gaussian tails and independent samples. We highlight that uniform consistency in terms

of time-varying network structure recovery is much more challenging and difficult than pointwise consistency. For the multiple change point detection problem, we also characterize the threshold of the difference statistic that gives a consistent selection of the number of change points.

We fix some notations: Positive, finite, and non-random constants, independent of the sample size *n* and dimension *p*, are denoted by *C*, *C*1, *C*2, ... , whose values may differ from line to line. For the sequence of real numbers, *an* and *bn*, we write *an* = *O*(*bn*) or *an bn* if lim sup*n*→∞(*an*/*bn*) <sup>≤</sup> *<sup>C</sup>* for some constant *C* < ∞ and *an* = *o*(*bn*) if lim*n*→∞(*an*/*bn*) = 0. We say *an bn* if *an* = *O*(*bn*) and *bn* = *O*(*an*). For a sequence of random variables *Yn* and a corresponding set of constants *an*, denote *Yn* <sup>=</sup> *<sup>O</sup>*P(*an*) if for any *<sup>ε</sup>* <sup>&</sup>gt; 0 there is a constant *<sup>C</sup>* <sup>&</sup>gt; 0 such that <sup>P</sup>(|*Yn*|/*an* <sup>&</sup>gt; *<sup>C</sup>*) <sup>&</sup>lt; *<sup>ε</sup>* for all *<sup>n</sup>*. For a vector **<sup>x</sup>** <sup>∈</sup> <sup>R</sup>*p*, we write <sup>|</sup>**x**<sup>|</sup> = (∑*<sup>p</sup> <sup>j</sup>*=<sup>1</sup> *<sup>x</sup>*<sup>2</sup> *<sup>j</sup>* )1/2. For a matrix <sup>Σ</sup>, <sup>|</sup>Σ|<sup>1</sup> <sup>=</sup> <sup>∑</sup>*j*,*<sup>k</sup>* <sup>|</sup>*σjk*|, <sup>|</sup>Σ|<sup>∞</sup> <sup>=</sup> max*j*,*<sup>k</sup>* <sup>|</sup>*σjk*|, <sup>|</sup>Σ|*L*<sup>1</sup> <sup>=</sup> max*<sup>k</sup>* <sup>∑</sup>*<sup>j</sup>* <sup>|</sup>*σjk*|, <sup>|</sup>Σ|*<sup>F</sup>* = (∑*j*,*<sup>k</sup> <sup>σ</sup>*<sup>2</sup> *jk*)1/2 and *<sup>ρ</sup>*(Σ) = max{|Σ**x**<sup>|</sup> : <sup>|</sup>**x**<sup>|</sup> <sup>=</sup> <sup>1</sup>}. For a random vector **<sup>z</sup>** <sup>∈</sup> <sup>R</sup>*p*, write **<sup>z</sup>** ∈ L*a*, *<sup>a</sup>* <sup>&</sup>gt; 0, if **z***<sup>a</sup>* <sup>=</sup>: [E(|**z**<sup>|</sup> *<sup>a</sup>*)]1/*<sup>a</sup>* <sup>&</sup>lt; <sup>∞</sup>. Let **z** <sup>=</sup> **z**2. Denote *<sup>a</sup>* <sup>∧</sup> *<sup>b</sup>* <sup>=</sup> min(*a*, *<sup>b</sup>*) and *a* ∨ *b* = max(*a*, *b*).

The rest of the paper is organized as follows: Section 2 presents the time series model, as well as the main assumptions, which can simultaneously capture the smooth and abrupt changes. In Section 3, we introduce the two-step method that first segments the time series based on the difference between the localized averages on sample covariance matrices and then recovers the graph support based on a kernelized CLIME estimator. In Section 4, we state the main theoretical results for the change point estimation and support recovery. Simulation examples are presented in Section 5 and a real data application is given in Section 6. Proof of main results can be found in Section 7.

#### **2. Time Series Model**

We first introduce a class of causal vector stochastic processes. Next, we state the assumptions to derive an asymptotic theory in Section <sup>4</sup> and explain their implications. Let *<sup>ε</sup><sup>i</sup>* <sup>∈</sup> <sup>R</sup>*p*, *<sup>i</sup>* <sup>∈</sup> <sup>Z</sup> be independent and identically distributed (i.i.d.) random vectors and F*<sup>i</sup>* = (... , *<sup>ε</sup>i*−1, *<sup>ε</sup>i*) be a shift process. Let **X**◦ *<sup>i</sup>* (*t*)=(*X*◦ *<sup>i</sup>*1(*t*),..., *X*◦ *ip*(*t*)) be a *p*-dimensional nonstationary time series generated by

$$\mathbf{X}\_{i}^{\odot}(t) = \mathbf{H}(\mathcal{F}\_{i}; t), \tag{1}$$

where **H**(·; ·) = - *<sup>H</sup>*1(·; ·), ... , *Hp*(·; ·)) is an <sup>R</sup>*p*-valued jointly measurable function. Suppose we observe the data points **X***<sup>i</sup>* = **X***i*,*<sup>n</sup>* = **X**◦ *<sup>i</sup>* (*ti*) at the evenly spaced time intervals *ti* = *i*/*n*, *i* = 1, 2, . . . , *n*,

$$\mathbf{X}\_{i,n} = \mathbf{H}(\mathcal{F}\_{i\prime}^{\ast} \, i/n). \tag{2}$$

We drop the subscription *n* in **X***i*,*<sup>n</sup>* in the rest of this section. Since our focus is to study the second-order properties, the data is assumed to have a mean of zero.

Model (1) is first introduced in [58]. The stochastic process - *X*◦ *<sup>i</sup>* (*t*) *<sup>i</sup>*∈Z,*t*∈[0,1) can be thought of as a triangular array system, double indexed by *i* and *t*, while the observations (*Xi*)*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> are sampled from the diagonal of the array. On one hand, when fixing the time index *t*, the (vertical) process - *X*◦ *<sup>i</sup>* (*t*) *<sup>i</sup>*∈<sup>Z</sup> is stationary. On the other hand, since **H**(F*i*; *ti*) is allowed to vary with *ti*, the diagonal process (2) is able to capture nonstationarity.

The process (**X***i*)*i*∈<sup>Z</sup> is causal or non-anticipative as **<sup>X</sup>***<sup>i</sup>* is an output of the past innovations (*εj*)*j*≤*<sup>i</sup>* and does not depend on future innovations. In fact, it covers a broad range of linear and nonlinear, stationary and non-stationary processes, such as vector auto-regressive moving average processes, locally stationary processes, Markov chains, and nonlinear functional processes [53,58–61].

Motivated by real applications where nonstationary time series data can involve both abrupt breaks and smooth varies between the breaks, we model the underlying processes as piecewise locally stationary with a finite number of structural breaks.

**Definition 1** (Piecewise locally stationary time series model)**.** *Define* PLS*ι*([0, 1], *L*) *as the collection of mean-zero piecewise locally stationary processes on* [0, 1]*, if for each* (*X*(*t*))0≤*t*≤<sup>1</sup> ∈ PLS*ι*([0, 1], *L*)*, there is a nonnegative integer ι such that X*(*t*) *is piecewise stochastic Lipschitz continuous in t with Lipschitz constant L on the interval* [*t* (*l*), *t* (*l*+1)), *<sup>l</sup>* <sup>=</sup> 0, ··· , *<sup>ι</sup>, where* <sup>0</sup> <sup>=</sup> *<sup>t</sup>* (0) < *t* (1) ··· <sup>&</sup>lt; *<sup>t</sup>* (*ι*) < *t* (*ι*+1) = 1*. A vector stochastic process* (**X**(*t*))0≤*t*≤<sup>1</sup> ∈ PLS*ι*([0, 1], *L*) *if all coordinates belong to* PLS*ι*([0, 1], *L*)*. For the process* (*X*◦ <sup>0</sup> (*t*))0≤*t*≤<sup>1</sup> *defined in (1), this means that there exists a non-negative integer ι and a constant L* > 0*, such that*

$$\max\_{1 \le j \le p} \left\| H\_j(\mathcal{F}\_0; t) - H\_j(\mathcal{F}\_0; t') \right\| \le L|t - t'| \text{ for all } t^{(l)} \le t, t' < t^{(l+1)}, 0 \le l \le \iota.$$

**Remark 1.** *If we assume* (**X**◦ *<sup>i</sup>* (*t*))0≤*t*≤<sup>1</sup> <sup>∈</sup> PLS*ι*([0, 1], *<sup>L</sup>*), *<sup>i</sup>* <sup>∈</sup> <sup>Z</sup>*, then it follows that for each <sup>i</sup>* = *i* − *k*, ... , *i* + *k, where k*/*<sup>n</sup>* <sup>→</sup> <sup>0</sup>*, and that t*(*l*) <sup>≤</sup> *<sup>i</sup>*, *<sup>i</sup>* < *t* (*l*+1) *for some* <sup>0</sup> <sup>≤</sup> *<sup>l</sup>* <sup>≤</sup> *<sup>ι</sup>, we have*

$$\max\_{1 \le j \le p} \left\lVert H\_j(\mathcal{F}\_{i'}; \mathfrak{i}/n) - H\_j(\mathcal{F}\_{i'}; \mathfrak{i}'/n) \right\rVert \le Lk/n = o(1).$$

*In other words, within a locally stationary time period, in a local window of i,* (*Xi<sup>j</sup>*)*i*−*k*≤*i*≤*i*+*<sup>k</sup> can be approximated by the stationary process* (*X*◦ *ij* (*i*/*n*))*i*−*k*≤*i*≤*i*+*<sup>k</sup> for each <sup>j</sup>* = 1, ... , *p. This justifies the terminology of local stationarity.*

The covariance matrix function of the underlying process is Σ(*t*) = - *σjk*(*t*) <sup>1</sup>≤*j*,*k*≤*p*, *<sup>t</sup>* <sup>∈</sup> [0, 1], where *<sup>σ</sup>jk*(*t*) = E - *Hj*(F0; *<sup>t</sup>*)*Hk*(F0; *<sup>t</sup>*)), and the precision matrix function is <sup>Ω</sup>(*t*) = <sup>Σ</sup>(*t*)−<sup>1</sup> <sup>=</sup> - *ωjk*(*t*) <sup>1</sup>≤*j*,*k*≤*p*. The graph at time *<sup>t</sup>* is denoted by *<sup>G</sup>*(*t*)=(V, <sup>E</sup>(*t*)), where <sup>V</sup> is the vertex set and E(*t*) = {(*j*, *k*) : *ωjk*(*t*) = 0}. Note that (**X**◦ *<sup>i</sup>* (*t*))*<sup>t</sup>* <sup>∈</sup> PLS*ι*([0, 1], *<sup>L</sup>*), *<sup>i</sup>* <sup>∈</sup> <sup>Z</sup> implies piecewise Lipschitz continuity in Σ(*t*) except at the breaks *t* (1), ... , *t* (*ι*). In particular, if sup0≤*t*≤<sup>1</sup> max1≤*j*≤*<sup>p</sup> Hj*(F0; *<sup>t</sup>*) ≤ *<sup>C</sup>* for some constant *C* > 0, then

$$|\Sigma(\mathbf{s}) - \Sigma(t)|\_{\infty} \le \ 2\mathbb{C}L|\mathbf{s} - t|\_{\prime} \qquad \forall \mathbf{s}, t \in [t^{(l)}, t^{(l+1)}), l = 0, \dots, \iota. \tag{3}$$

The reverse direction is not necessarily true, i.e., (3) does not indicate (**X**◦ *<sup>i</sup>* (*t*))*<sup>t</sup>* ∈ PLS*ι*([0, 1], *L*), *<sup>i</sup>* <sup>∈</sup> <sup>Z</sup> in general. As a trivial example, let *<sup>ε</sup>ij* <sup>=</sup> <sup>2</sup>−1/2 with probability 2/3 and <sup>√</sup><sup>2</sup> with probability 1/3 i.i.d for all *i*, *j*. At time *tk* = *k*/*n*, let *X*◦ *ij*(*tk*)=(−1)*<sup>k</sup>* <sup>√</sup>*tkεij*. Then for any *<sup>k</sup>* and *<sup>k</sup>* such that *<sup>k</sup>* <sup>+</sup> *<sup>k</sup>* is odd, |Σ(*tk*) − Σ(*tk*)|<sup>∞</sup> = |*tk* − *tk*|, while *X*◦ <sup>01</sup>(*tk*) − *X*◦ <sup>01</sup>(*tk*)<sup>2</sup> <sup>=</sup> <sup>√</sup>*tk* <sup>+</sup> <sup>√</sup>*tk* .

**Assumption 1** (Piecewise smoothness)**.** *(i) Assume* (**X**◦ *<sup>i</sup>* (*t*))0≤*t*≤<sup>1</sup> <sup>∈</sup> PLS*ι*([0, 1], *<sup>L</sup>*) *for each <sup>i</sup>* <sup>∈</sup> <sup>Z</sup>*, where L* > 0 *and ι* ≥ 0 *are constants independent of n and p. (ii) For each l* = 0, ... , *ι, and* 1 ≤ *j*, *k* ≤ *p, we have <sup>σ</sup>jk*(*t*) ∈ C2[*<sup>t</sup>* (*l*), *t* (*l*+1))*.*

Now we introduce the temporal dependence measure. We quantify the dependence of - **X**◦ *<sup>i</sup>* (*t*) *<sup>i</sup>*∈<sup>Z</sup> by the dependence adjusted norm (DAN) (cf. [62]). Let *ε <sup>i</sup>* be an independent copy of *<sup>ε</sup><sup>i</sup>* and F*i*,{*m*} = (... , *<sup>ε</sup>i*−*m*−1, *<sup>ε</sup> <sup>i</sup>*−*m*, *<sup>ε</sup>i*−*m*<sup>+</sup>1, ... , *<sup>ε</sup>i*). Denote **<sup>X</sup>**◦ *<sup>i</sup>*,{*m*}(*t*) = - *X*◦ *<sup>i</sup>*1,{*m*}(*t*), ... , *<sup>X</sup>*◦ *ip*,{*m*}(*t*) , where *X*◦ *ij*,{*m*}(*t*) = *Hj*(F*i*,{*m*}; *<sup>t</sup>*), 1 ≤ *<sup>j</sup>* ≤ *<sup>p</sup>*. Here **<sup>X</sup>**◦ *<sup>i</sup>*,{*m*}(*t*) is a coupled version of **<sup>X</sup>**◦ *<sup>i</sup>* (*t*), with the same generating mechanism and input, except that *εi*−*<sup>m</sup>* is replaced by an independent copy *ε <sup>i</sup>*−*m*.

**Definition 2** (Dependence adjusted norm (DAN))**.** *Let constants a* ≥ 1, *A* > 0*. Assume* sup0≤*t*≤<sup>1</sup> *X*◦ 1*j* (*t*)*<sup>a</sup>* < ∞, *j* = 1, ... , *p. Define the uniform functional dependence measure for the sequences* (*X*◦ *ij*(*t*))*i*∈Z,*t*∈[0,1] *of form (1) as*

$$\theta\_{m,a,j} = \sup\_{0 \le t \le 1} ||X^{\odot}\_{ij}(t) - X^{\odot}\_{ij,\{m\}}(t)||\_{a\_{\prime}} \quad j = 1, \ldots, p\_{\prime}$$

*and* Θ*m*,*a*,*<sup>j</sup>* = ∑<sup>∞</sup> *<sup>i</sup>*=*<sup>m</sup> θi*,*a*,*j. The dependence adjusted norm of* (*X*◦ *ij*(*t*))*i*∈Z,*t*∈[0,1] *is defined as*

$$\left\| X\_{\cdot,j} \right\|\_{a,A} = \sup\_{m \ge 0} (m+1)^A \Theta\_{m,a,j,\prime} $$

*whenever X*·,*<sup>j</sup> <sup>a</sup>*,*<sup>A</sup>* < <sup>∞</sup>*.*

Intuitively, the physical dependence measure quantifies the adjusted stochastic difference between the random variable and its coupled version by replacing past innovations. Indeed, *θm*,*a*,*<sup>j</sup>* measures the impact on *X*◦ *ij*(*t*) uniform over *<sup>t</sup>* by replacing *<sup>ε</sup>i*−*<sup>m</sup>* while freezing all the other inputs, while <sup>Θ</sup>*m*,*a*,*<sup>j</sup>* quantifies the cumulative influence of replacing *ε*−*<sup>m</sup>* on (*X*◦ *ij*(*t*))*i*≥<sup>0</sup> uniform over *<sup>t</sup>*. Then *X*·,*<sup>j</sup> a*,*A* controls the uniform polynomial decay in the lag of the cumulative physical dependence, where *a* depends on the the tail of marginal distributions of *X*◦ 1,*j* (*t*) and *A* quantifies the polynomial decay power and thus the temporal dependence strength. It is clear that *X*·,*<sup>j</sup> <sup>a</sup>*,*<sup>A</sup>* is a semi-norm, i.e., it is subaddative and absolutely homogeneous.

**Assumption 2** (Dependence and moment conditions)**.** *Let* **X**◦ *<sup>i</sup>* (*t*) *be defined in (1) and* **X***<sup>i</sup> in (2). There exist q* > 2 *and A* > 0 *such that*

$$\nu\_{2q} := \sup\_{t \in [0,1]} \max\_{1 \le j \le p} \mathbb{E} |X\_j^{\diamond}(t)|^{2q} < \infty \qquad \text{and} \qquad N\_{\mathbf{X}, 2q} := \max\_{1 \le j \le p} \|X\_{\cdot j}\|\_{2q, A} < \infty. \tag{4}$$

We let *MX*,*<sup>q</sup>* := ∑1≤*j*≤*<sup>p</sup> X*·,*<sup>j</sup> q* 2*q*,*A* 1/*<sup>q</sup>* and write *NX* = *NX*,4, *MX* = *MX*,2. The quantities *MX*,*<sup>q</sup>* and *NX*,2*<sup>q</sup>* measure the *Lq*-norm aggregated effect and the largest effect of the element-wise DANs respectively. Both quantities play a role in the convergence rates of our estimator.

Obviously, we have *Xij* <sup>−</sup> *Xij*,{*<sup>m</sup>*}*<sup>a</sup>* <sup>≤</sup> *<sup>θ</sup>m*,*a*,*<sup>j</sup>* and max1≤*j*≤*<sup>p</sup>* <sup>E</sup>|*Xij*<sup>|</sup> <sup>2</sup>*<sup>q</sup>* <sup>≤</sup> *<sup>ν</sup>*2*<sup>q</sup>* for all 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*. In contrast to other works in a high-dimensional covariance matrix and network estimation, where sub-Gaussian tails and independence are the keys to ensure consistent estimation. Assumption 2 only requires that the time series have a finite polynomial moment, and it allows linear and nonlinear processes with short memory in the time domain.

**Example 1** (Vector linear process)**.** *Consider the following vector linear process model*

$$\mathbf{H}(\mathcal{F}\_{i};t) = \sum\_{m=0}^{\infty} A\_{m}(t) \mathbf{e}\_{i-m} t$$

*where <sup>ε</sup><sup>i</sup>* = (*ε*1, ... ,*<sup>ε</sup> <sup>p</sup>*) *and <sup>ε</sup>ij are i.i.d. with mean* <sup>0</sup> *and variance* <sup>1</sup>*, and εij<sup>q</sup>* <sup>≤</sup> *Cq for each <sup>i</sup>* <sup>∈</sup> <sup>Z</sup> *and* 1 ≤ *j* ≤ *p with some constants q* > 2 *and Cq* > 0*. The vector linear process is commonly seen in literature and application [63]. It includes the time-varying VAR model where Am*(*t*) = *A*(*t*)*<sup>m</sup> as a special example.*

*Suppose that the coefficient matrices Am*(*t*)=(*am*,*jk*(*t*))1≤*j*,*k*≤*p*, *<sup>m</sup>* = 0, 1, ... *satisfy the following condition.*

*(A1) For each* <sup>1</sup> <sup>≤</sup> *<sup>j</sup>*, *<sup>k</sup>* <sup>≤</sup> *p, am*,*jk*(*t*) ∈ C2[0, 1].


*Note that*

$$\begin{aligned} \sigma\_{jk}(t) &= \sum\_{m \ge 0} A\_{m,j}^\top(t) A\_{m,k} \cdot (t), \\ \Theta\_{m,q,j} &\le 2C\_q \sqrt{q-1} \sum\_{m=0}^\infty (A\_{m,j}^\top A\_{m,j})^{1/2}, \\ ||X\_{ij}^\odot(t) - X\_{ij}^\odot(t')||^2 &= \sum\_{m=0}^\infty A\_{m,j} \cdot \sum\_{k=1}^p [a\_{m,jk}(t) - a\_{m,jk}(t')]^2, \end{aligned}$$

*where Am*,*j*·(*t*) *is the jth row of Am*(*t*)*. Under conditions (A1)–(A3), one can easily verify that for each* <sup>1</sup> <sup>≤</sup> *<sup>j</sup>*, *<sup>k</sup>* <sup>≤</sup> *p, the process satisfies: (1) <sup>σ</sup>jk*(*t*) ∈ C2[0, 1]*; (2) X*·,*jq*,*<sup>A</sup>* <sup>≤</sup> *Cq* (*q* − 1)*CA*,*<sup>j</sup> (due to Burkholder's inequality, cf. [64]); (3) Hj*(F0; *t*) − *Hj*(F0; *t* ) ≤ *L*|*t* − *t* |*.*

*Conditions (A1)–(A3) implicitly impose smoothness in each entry of the coefficient matrices, sparseness in each column of the entry and evolution, and polynomial decay rate in the lag m of each entry and its derivative.*

For 1 ≤ *l* ≤ *ι*, let *δjk*(*t* (*l*)) := *σjk*(*t* (*l*)) <sup>−</sup> *<sup>σ</sup>jk*(*<sup>t</sup>* (*l*)−) and <sup>Δ</sup>(*<sup>t</sup>* (*l*)) = - *δjk*(*t* (*l*)) <sup>1</sup>≤*j*,*k*≤*p*, where *σjk*(*t* (*l*)−) = lim*t*→*t*(*l*)<sup>−</sup> *<sup>σ</sup>jk*(*t*) is well-defined in view of (3). We assume that the change points are separated and sizeable.

**Assumption 3** (Separability and sizeability of change points)**.** *There exist positive constants c*<sup>1</sup> ∈ (0, 1) *and c*<sup>2</sup> > <sup>0</sup> *independent of n and p such that* max0≤*l*≤*ι*(*<sup>t</sup>* (*l*+1) <sup>−</sup> *<sup>t</sup>* (*l*)) <sup>≥</sup> *<sup>c</sup>*<sup>1</sup> *and <sup>δ</sup>*(*tl*) :<sup>=</sup> <sup>|</sup>Δ(*tl*)|<sup>∞</sup> <sup>≥</sup> *<sup>c</sup>*2*.*

In the high-dimensional context, we assume that the inverse covariance matrices are sparse in the sense of their *L*<sup>1</sup> norms.

**Assumption 4** (Sparsity of precision matrices)**.** *The precision matrix* |Ω(*t*)|*L*<sup>1</sup> ≤ *κ<sup>p</sup> for each t* ∈ [0, 1]*, where κ<sup>p</sup> is allowed to grow with p.*

If we further assume that the eigenvalues of the covariance matrices are bounded from below and above, i.e., there exists a constant 0 <sup>&</sup>lt; *<sup>c</sup>* <sup>&</sup>lt; 1, such that *<sup>c</sup>* <sup>≤</sup> inf*t*∈[0,1] <sup>|</sup>Σ(*t*)|<sup>2</sup> <sup>≤</sup> sup*t*∈[0,1] <sup>|</sup>Σ(*t*)|<sup>2</sup> <sup>≤</sup> *<sup>c</sup>*−1, then the covariance matrices and precision matrices are well-conditioned. In particular, as |Ω(*t*) − Ω(*t* )| ≤ *<sup>c</sup>*−2|Σ(*t*) <sup>−</sup> <sup>Σ</sup>(*<sup>t</sup>* )|, a small perturbation in the covariance matrix would guarantee a small change of the same order in the precision matrix under the spectral norm.

#### **3. Method: Change Point Estimation and Support Recovery**

In graphical models (such as the Gaussian graphical model or partial correlation graph), network structures relevant to correlations or partial correlations are second-order characteristics of the data distributions. Specifically, the existence of edges coincides with non-zero entries of the inverse covariance matrix. We consider the dynamics of time series with both structural breaks and smooth changes. The piecewise stochastic Lipschitz continuity in Definition 1 allows the time series to have discontinuity in the covariance matrix function at time points *t* (*l*), *l* = 1, ... , *ι* (i.e., change points), while only smooth changes (i.e., twice continuous differentiability of the covariance matrix function in Assumptions 1) can occur between the change points.

In the presence of change points, we must first remove the change points before applying any smoothing procedures since |Ω(*t*) − Ω(*t*−)|<sup>∞</sup> ≥ |Σ(*t*)| −1 *<sup>L</sup>*<sup>1</sup> |Σ(*t*−)| −1 *<sup>L</sup>*<sup>1</sup> |Δ(*t*)|∞, i.e., a non-negligible abrupt change in the covariance matrix will result in a substantial change of the graph structure for sparse and smooth covariance matrices. Thus our proposed graph recovery method consists of two steps: change point detection and support recovery.

Let *<sup>h</sup>* <sup>≡</sup> *hn* <sup>&</sup>gt; 0 be a bandwidth parameter such that *<sup>h</sup>* <sup>=</sup> *<sup>o</sup>*(1) and *<sup>n</sup>*−<sup>1</sup> <sup>=</sup> *<sup>o</sup>*(*h*), and <sup>D</sup>*h*(0) = {*h*, *h* + 1/*n*,...,1 − *h*} be a search grid in (0, 1). Define

$$D(s) = n^{-1} \left( \sum\_{i=0}^{hn-1} \mathbf{X}\_{ns-i} \mathbf{X}\_{ns-i}^{\top} - \sum\_{i=1}^{hn} \mathbf{X}\_{ns+i} \mathbf{X}\_{ns+i}^{\top} \right), \qquad s \in \mathcal{D}\_{\hbar}(0). \tag{5}$$

To estimate the change points, compute

$$\mathfrak{k}\_1 = \operatorname{argmax}\_{s \in \mathcal{D}\_h(0)} |D(s)|\_{\infty} \,. \tag{6}$$

The following steps are performed recursively. For *l* = 1, 2, . . ., let

$$\mathcal{D}\_{\hbar}(l) = \mathcal{D}\_{\hbar}(l-1) \cap \{\mathfrak{s}\_{l} - 2\hbar, \cdot \cdot, \mathfrak{s}\_{l} + 2\hbar\}^{\varepsilon},\tag{7}$$

$$\mathfrak{s}\_{l+1} = \arg\max\_{\mathbf{s}\in\mathcal{D}\_{\hbar}(l)} |D(\mathbf{s})|\_{\infty} \tag{8}$$

until the following criterion is attained:

$$\max\_{s \in \mathcal{D}\_h(l)} |D(s)|\_{\infty} < \nu\_{\prime} \tag{9}$$

where *ν* is an early stopping threshold. The value of *ν* is determined in Section 4, which depends on the dimension and sample size, as well as the serial dependence level, tail condition, and local smoothness. Since our method only utilizes data in the localized neighborhood, multiple change points can be estimated and ranked in a single pass, which offers some computational advantage than the binary segmentation algorithm [41,46].

Once the change points are claimed, in the second step, we consider recovering the networks from the locally stationary time series before and after the structural breaks. In [11], where *Xi*, *i* = 1, ... , *n* are assumed with an identical covariance matrix, the precision matrix Ωˆ is estimated as,

$$\hat{\Omega}\_{\lambda} = \arg\min\_{\Omega \in \mathbb{R}^{p \times p}} |\Omega|\_1 \quad \text{s.t.} \; |\hat{\Sigma}\Omega - \mathrm{Id}\_p|\_{\infty} \le \lambda\_\prime \tag{10}$$

where Σˆ is the sample covariance matrix. Inspired by (10), we apply a kernelized time-varying (tv-) CLIME estimator for the covariance matrix functions of the multiple pieces of locally stationary processes before and after the structural breaks. Let

$$
\hat{\Sigma}(t) = \sum\_{i=1}^{n} w(t, t\_i) \mathbf{X}\_i \mathbf{X}\_i^\top,\tag{11}
$$

where

$$w(t, i) = \frac{K\_b(t\_{i\prime}, t)}{\sum\_{i=1}^{n} K\_b(t\_{i\prime}, t)} \tag{12}$$

and *Kb*(*u*, *<sup>v</sup>*) = *<sup>K</sup>*(|*<sup>u</sup>* <sup>−</sup> *<sup>v</sup>*|/*b*)/*b*. The bandwidth parameter *<sup>b</sup>* satisfies that *<sup>b</sup>* <sup>=</sup> *<sup>o</sup>*(1) and *<sup>n</sup>*−<sup>1</sup> <sup>=</sup> *<sup>o</sup>*(*b*). Denote *Bn* = *nb*. The kernel function *K*(·) is chosen to have properties as follows.

**Assumption 5** (Regularity of kernel function)**.** *The kernel function K*(·) *is non-negative, symmetric, and Lipschitz continuous with bounded support in* [−1, 1]*, and that* <sup>1</sup> <sup>−</sup><sup>1</sup> *<sup>K</sup>*(*u*)*du* <sup>=</sup> <sup>1</sup>*.*

Assumption 5 is a common requirement on the kernel functions and can be fulfilled by a range of kernel functions, such as the uniform kernel, triangular kernel, and the Epanechnikov kernel. Now the tv-CLIME estimator of the precision matrix Ω(*t*) is defined by Ω˜ (*t*) = *ω*˜ *jk*(*t*) 1≤*j*,*k*≤*p* , where *<sup>ω</sup>*˜ *jk*(*t*) = min(*ω*<sup>ˆ</sup> *jk*(*t*), *<sup>ω</sup>*<sup>ˆ</sup> *kj*(*t*)), and <sup>Ω</sup><sup>ˆ</sup> (*t*) <sup>≡</sup> <sup>Ω</sup><sup>ˆ</sup> *<sup>λ</sup>*(*t*)=(*ω*<sup>ˆ</sup> *jk*(*t*))1≤*j*,*k*≤*p*,

$$\hat{\Omega}\_{\lambda}(t) = \arg\min\_{\Omega \in \mathbb{R}^{p \times p}} |\Omega|\_1 \quad \text{s.t.} \; |\hat{\Sigma}(t)\Omega - \mathrm{Id}\_p|\_{\infty} \le \lambda. \tag{13}$$

Similar hybridized kernel smoothing and the CLIME method for estimating the sparse and smooth transition matrices in high-dimensional VAR model has been considered in [65], where change point is not considered. Thus in the current setting we need to carefully control effect of (consistently) removing the change points before smoothing.

Then, the network is estimated by the "effective support" defined as follows.

$$\hat{G}(t;u) = (\oint\_{\bar{\mathcal{G}}} (t;u))\_{1 \le j,k \le p\prime} \quad \text{where } \mathcal{G}\_{\bar{\mathcal{G}}}(t;u) = \mathbb{I}\left\{ |\bar{\omega}\_{\bar{\mathcal{G}}}(t)| \ge u \right\}. \tag{14}$$

It should be noted that the (vanilla) kernel smoothing estimator (11) of the covariance matrix does not adjust for the boundary effect due to the change points in the covariance matrice function. Thus, in the neighborhood of the change points, a larger bias can be induced in estimating Σ(*t*) by <sup>Σ</sup>ˆ(*t*). As a remedy, we apply the following reflection procedure for boundary correction. Suppose *<sup>t</sup>* <sup>∈</sup> Tˆ *<sup>b</sup>*+*h*<sup>2</sup> (*j*) for 1 <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *<sup>ι</sup>*, Denote <sup>T</sup><sup>ˆ</sup> *<sup>d</sup>*(*j*) := [*s*ˆ*<sup>j</sup>* − *d*,*s*ˆ*<sup>j</sup>* + *d*) for *d* ∈ (0, 1). We replace (11) by

$$\hat{\Sigma}(t) = \sum\_{i=1}^{n} w(t, t\_i) \mathbf{x}\_i \mathbf{x}\_i^\top.$$

and then apply the rest of the tv-CLIME approach. Here

$$\mathbf{x}\_{i} = \begin{cases} \mathbf{x}\_{i} & \text{if } (i - \mathfrak{s}\_{j}n)(t - \mathfrak{s}\_{j}n) \ge 0; \\ \mathbf{x}\_{2\boldsymbol{\upupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupupup }$$

#### **4. Theoretical Results**

In this section, we derive the theoretical guarantees for the change point estimation and graph support recovery. Roughly speaking, Proposition 1 and 2 below show that under appropriate conditions, if each element of the covariance matrix varies smoothly in time, one can obtain an accurate snapshot estimation of the precision matrices as well as the time-varying graphs with high probability via the proposed kernel smoothed constrained *l*<sup>1</sup> minimization approach.

Define *Jq*,*A*(*n*, *<sup>p</sup>*) = *MX*,*q*(*pq*,*A*(*n*))1/*q*, where *q*,*A*(*n*) = *<sup>n</sup>*, *<sup>n</sup>*(log *<sup>n</sup>*)1+2*q*, *<sup>n</sup>q*/2−*Aq* if *<sup>A</sup>* <sup>&</sup>gt; 1/2 <sup>−</sup> 1/*q*, *A* = 1/2 − 1/*q*, and 0 < *A* < 1/2 − 1/*q*, respectively.

**Proposition 1** (Rate of convergence for estimating precision matrices: pointwise and uniform)**.** *Suppose Assumptions 2, 4, and 5 hold with ι* = 0*. Let Bn* = *bn for n*−<sup>1</sup> = *o*(*b*) *and b* = *o*(1)*.*

*(i)* **Pointwise.** *Choose the parameter <sup>λ</sup>*◦ <sup>≥</sup> *<sup>C</sup>κp*(*b*<sup>2</sup> <sup>+</sup> *<sup>B</sup>*−<sup>1</sup> *<sup>n</sup> Jq*,*A*(*Bn*, *<sup>p</sup>*) + *NX*(log *<sup>p</sup>*/*Bn*)1/2) *in the tv-CLIME estimator* Ωˆ *<sup>λ</sup>*◦ (*t*) *in (13), where C is a sufficiently large constant independent of n and p. Then for any t* ∈ [*b*, 1 − *b*]*, we have*

$$
\langle |\hat{\Omega}\_{\lambda^\circ}(t) - \Omega(t)|\_{\infty} = O\_{\mathbb{P}}(\kappa\_p \lambda^\circ). \tag{16}
$$

*(ii)* **Uniform.** *Choose λ* ≥ *Cκ<sup>p</sup> <sup>b</sup>*<sup>2</sup> <sup>+</sup> *<sup>B</sup>*−<sup>1</sup> *<sup>n</sup> Jq*,*A*(*n*, *<sup>p</sup>*) + *NXB*−<sup>1</sup> *<sup>n</sup>* (*<sup>n</sup>* log(*p*))1/2 *in the tv-CLIME estimator* Ωˆ *<sup>λ</sup>*◦ (*t*) *in (13), where C is a sufficiently large constant independent of n and p. Then we have*

$$\sup\_{t \in [b, 1-b]} |\hat{\Omega}\_{\lambda^\odot}(t) - \Omega(t)|\_{\infty} = O\_{\mathbb{P}}(\kappa\_P \lambda^\odot). \tag{17}$$

The optimal order of the bandwidth parameter *b* = *b* in (17) is the solution to the following equation:

$$b^2 \quad = \quad B\_n^{-1} \max(J\_{q,A}(n,p), N\_X(n\log(p^2))^{1/2}),$$

which implies that the closed-form expression for *b* is given by

$$b\_{\sharp} = \mathbb{C}\_1(n^{-1}I\_{\eta,A}(n,p))^{1/3} + \mathbb{C}\_2\mathbb{N}\_X^{1/3}n^{-1/6}\log(p)^{1/6}$$

for some constants *C*<sup>1</sup> and *C*<sup>2</sup> that are independent of *n* and *p*.

Given a finite sample, to distinguish the small entries in the precision matrix from the noise is challenging. Since a smaller magnitude of a certain element of the precision matrix implies a weaker connection of the edge in the graphical model, we instead consider the estimation of *significant* edges in the graph. Define the set of *significant* edges at level *<sup>u</sup>* as <sup>E</sup>∗(*t*; *<sup>u</sup>*) =  (*j*, *k*) : *g*∗ *jk*(*t*; *u*) = 0 , where

$$\mathcal{g}\_{jk}^\*(t;\mu) = \mathbb{I}\left\{|\omega\_{jk}(t)| > \mu\right\}.$$

Then, as a consequence of (17), we have the following support recovery consistency result.

**Proposition 2** (Consistency of support recovery: significant edges)**.** *Choose u as u* = *C*0*κ*<sup>2</sup> *pb*2 *, where C*<sup>0</sup> *is taken as a sufficiently large constant independent of n and p. Suppose that u* = *o*(1) *as n*, *p* → ∞*. Then under conditions of Proposition 1, we have that as n*, *p* → ∞*,*

$$\mathbb{P}\left(\sup\_{t\in[b,1-b]}\sum\_{(j,k)\in\mathcal{E}^{\varepsilon}(t)}\mathbb{I}\left\{\hat{\mathfrak{g}}\_{jk}(t;\mathsf{u}\_{\sharp})\neq 0\right\}\neq 0\right)\to 0,\tag{18}$$

$$\mathbb{P}\left(\sup\_{t\in[b,1-b]}\sum\_{(j,k)\in\mathcal{E}^\*(t;2u\_t)}\mathbb{I}\left\{\mathfrak{f}\_{jk}(t;u\_t)=0\right\}\neq 0\right)\to 0.\tag{19}$$

Proposition 2 shows that the pattern of significant edges in the time-varying true graphs *G*(*t*), *t* ∈ [*b*, 1 − *b*], can be correctly recovered with high probability. However, it is still an open question to what extent the edges with magnitude below *u* can be consistently estimated, which can be naturally studied in the multiple hypothesis testing framework. Nonetheless, hypothesis testing for graphical models on the nonstationary high-dimensional time series is rather challenging. We leave it as a future problem.

Propositions 1 and 2 together yield that the consistent estimation of the precision matrices and the graphs can be achieved before and after the change points. Now, we provide the theoretical result of the change point estimation. Theorem 1 below shows that if the change points are separated and sizable, then we can consistently identify them via the single pass segmentation approach under suitable conditions. Denote

$$h\_{\diamond} = \mathbb{C}\_1 \big( n^{-1} I\_{\eta, \mathcal{A}}(n, p) \big)^{1/3} + \mathbb{C}\_2 N\_X^{1/3} n^{-1/6} \log(p)^{1/6},$$

where *C*<sup>1</sup> and *C*<sup>2</sup> are constants independent of *n* and *p*.

**Theorem 1** (Consistency of change point estimation)**.** *Assume* **<sup>X</sup>***<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*<sup>p</sup> admits the form (2). Suppose that Assumptions <sup>2</sup> to <sup>3</sup> are satisfied. Choose the bandwidth <sup>h</sup>* <sup>=</sup> *<sup>h</sup>, and <sup>ν</sup>* = (<sup>1</sup> <sup>+</sup> *<sup>L</sup>*)*h*<sup>2</sup> *in (5) and (9) respectively. Assume that h* = *o*(1) *as n*, *p* → ∞*. We find that there exist constants C*1, *C*2, *C*<sup>3</sup> *independent of n and p, such that*

$$\mathbb{P}(|\mathbb{I}-\boldsymbol{\iota}|>0) \leq \mathbb{C}\_{1} \Big( \frac{p\alpha\_{q,\mathcal{A}}(n)M\_{X,q}^{q}\nu\_{2q}^{q}}{n^{q}\mathcal{L}\_{2}^{q}} \Big)^{1/3} + \mathbb{C}\_{2}p^{2}\exp\left\{-\mathbb{C}\_{3}(\frac{n\log^{2}(p)}{N\_{X}^{2}})^{1/3}\right\}.\tag{20}$$

*Furthermore, in the event* {*ι* = *ι*ˆ}*, the ordered change-point estimator* (*s*ˆ(1) < *s*ˆ(2) < ··· < *s*ˆ(*ι*<sup>ˆ</sup>)) *defined in (7) satisfies*

$$\max\_{1 \le j \le l} |\pounds\_{(j)} - t^{(j)}| = O\_{\mathbb{P}}(h\_{\diamond}^2). \tag{21}$$

Proposition 2 and Theorem 1 together indicate the consistency in the snapshot estimation of the time-varying graphs before and after the change points. In a close neighborhood of the change points, we have the following result for the recovery of the time-varying network. Denote <sup>S</sup> :<sup>=</sup> *<sup>b</sup>*, 1 <sup>−</sup> *<sup>b</sup>*] <sup>∩</sup> (∪1≤*j*≤*ι*ˆT<sup>ˆ</sup> *<sup>c</sup> h*2 +*b* (*j*) as the time intervals between the estimated change points, and N := [0, *b*) ∪ - <sup>∪</sup>1≤*j*≤*ι*ˆ(T<sup>ˆ</sup> *h*2 +*b* <sup>∩</sup> <sup>T</sup><sup>ˆ</sup> *<sup>c</sup> h*2 ) ∪ (1 − *b*, 1] as the recoverable neighborhood of the jump.

**Theorem 2.** *Let Assumptions 2 to 5 be satisfied. We have the following results as n*, *p* → ∞*.*

*(i)* **Between change points.** *For t* ∈ S*, take b* = *b and u* = *u, where b and u are defined in Proposition 2. Suppose u* = *o*(1)*. We have*

$$\sup\_{t \in \mathcal{S}} \max\_{j,k} |\vartheta\_{j,k}(t) - \sigma\_{j,k}(t)| = O\_{\mathbb{P}}(b\_{\sharp}^2). \tag{22}$$

*Choose the penalty parameter as λ* := *C*1*κpb*<sup>2</sup> *, where C*<sup>1</sup> *is a constant independent of n and p. Then*

$$\sup\_{t \in \mathcal{S}} |\hat{\Omega}\_{\lambda\_{\sharp}}(t) - \Omega(t)|\_{\infty} = O\_{\mathbb{P}}(\kappa\_p^2 b\_{\sharp}^2).$$

*Moreover,*

$$\mathbb{P}\left(\sup\_{t\in\mathcal{S}}\sum\_{(j,k)\in\mathcal{E}^{\varepsilon}(t)}\mathbb{I}\left\{\mathfrak{f}\_{j,k}(t;\boldsymbol{\mu}\_{\sharp})\neq 0\right\}=0\right)\to 1,\tag{23}$$

$$\mathbb{P}\left(\sup\_{t\in\mathcal{S}}\sum\_{(j,k)\in\mathcal{E}^\*(t;2u\_t)}\mathbb{I}\left\{\mathcal{G}\_{\bar{\beta}k}(t;u\_{\sharp})=0\right\}=0\right)\to 1.\tag{24}$$

*(ii)* **Around change points.** *For s* ∈ N *, take b* = *b* := *C*<sup>1</sup> - *n*−<sup>1</sup> *Jq*,*A*(*n*, *p*) 1/2 <sup>+</sup> *<sup>C</sup>*2*N*1/2 *<sup>X</sup> <sup>n</sup>*−1/4 log(*p*)1/4*, and u* = *u* := *C*0*κ*<sup>2</sup> *pb, where C*0*, C*<sup>1</sup> *and C*<sup>2</sup> *are constants independent of n and p. Suppose u* = *o*(1)*. We have*

$$\sup\_{t \in \mathcal{N}} \max\_{j,k} |\vartheta\_{j,k}(t) - \sigma\_{j,k}(t)| = O\_{\mathbb{P}}(b\_{\star}).$$

*Choose the penalty parameter as λ* := *C*1*κpb, where C*<sup>1</sup> *is a constant independent of n and p. Then*

$$\sup\_{t \in \mathcal{N}} |\hat{\Omega}\_{\lambda\_\star}(t) - \Omega(t)|\_{\infty} = O\_\mathbb{P}(\kappa\_p^2 b\_\star). \tag{25}$$

*Moreover,*

$$\mathbb{P}\left(\sup\_{t\in\mathcal{N}}\sum\_{(j,k)\in\mathcal{E}^c(t)}\mathbb{I}\left\{\mathfrak{f}\_{j,k}(t;\mathsf{u}\_\star)\neq 0\right\}=0\right)\to 1,\tag{26}$$

$$\mathbb{P}\left(\sup\_{t\in\mathcal{N}}\sum\_{(j,k)\in\mathcal{E}^\*(t;2u\_\star)}\mathbb{I}\left\{\mathfrak{f}\_{j,k}(t;u\_\star)=0\right\}=0\right)\to 1.\tag{27}$$

Note that the convergence rates for the covariance matrix entries and precision matrix entries in case (ii) around the jump locations are slower than those for points well separated from the jump locations in case (i). This is because on the boundary due to the reflection, the smooth condition may no longer hold true. Indeed, we only take advantage of the Lipschitz continuous property of the

covariance matrix function. Thus, we lose one degree of regularity in the covariance matrix function, and the bias term *b*<sup>2</sup> in the convergence rate of the between-jump area becomes *b* around the jumps. We also note that around the smaller neighborhood of the jump <sup>J</sup> :<sup>=</sup> <sup>∪</sup>1≤*j*≤*ι*ˆT<sup>ˆ</sup> *h*2 , due to the larger error in the change point estimation, consistent recovery of the graphs is not achievable.

#### **5. A Simulation Study**

We simulate data from the following multivariate time series model:

$$X\_i = \sum\_{m=0}^{100} A\_m(i) \mathbf{e}\_{i - m} i = 1, \dots, n\_r$$

where *Am*(*i*) <sup>∈</sup> <sup>R</sup>*p*×*p*, 1 <sup>≤</sup> *<sup>m</sup>* <sup>≤</sup> 100, 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*, and *i*−*<sup>m</sup>* = (*i*−*<sup>m</sup>*,1, ... , *i*−*m*,*p*), with *m*,*k*, *<sup>m</sup>* <sup>∈</sup> <sup>Z</sup>, *<sup>j</sup>* <sup>=</sup> 1, ... , *p* generated as i.i.d. standardized *T*(8) random variables. In the simulation, we fix *n* = 1000 and vary *p* = 50 and *p* = 100. For each *m* = 1, ... , 100, the coefficient matrices *Am*(*i*)=(1 + *m*)−*βBm*(*i*), where *<sup>β</sup>* <sup>=</sup> 1, and *Bm*(1) is an *<sup>R</sup>p*×*<sup>p</sup>* block diagonal matrix. The 5 <sup>×</sup> 5 diagonal blocks in *Bm*(*i*) are fixed with i.i.d. *N*(0, 1) entries and all the other entries are 0.

We consider the number of abrupt changes is *ι* = 2 and (*nt*(1), *nt*(2))=(300, 650). The matrix *A*0(*i*) is set to be a zero matrix for *i* = 1, 2, ... , 299, while *A*0(*i*) = *A*0(299) + *αα*, *i* = 300, 301, ... , 649, and *A*0(*i*) = *A*0(649) − *αα*, *i* = 650, 651, ... , 1000, where the first 20 entries in *α* are taken to be a constant *δ*<sup>0</sup> and the others are 0.

We let the coefficient matrices *<sup>A</sup>*1(*i*) = {*am*,*jk*(*i*)}1≤*j*,*k*≤*<sup>p</sup>* evolve at each time point, such that two entries are soft-thresholded and another two elements increase. Specifically, at time *i*, we randomly select two elements from the support of *A*1(*i*), which are denoted as {*a*1,*<sup>j</sup> l k l* (*i*)}, *l* = 1, 2 and that *<sup>a</sup>*1,*jk* (*i*) <sup>=</sup> 0, and set them to *<sup>a</sup>* 1,*j l k l* (*i*) = sign(*a*1,*<sup>j</sup> l k l* (*i*))(|*a*1,*<sup>j</sup> l k l* (*i*) − 0.05|). We also randomly select two elements from *A* <sup>1</sup> (*i*) and increase their values by 0.03.

Figures 1 and 2 show the support of the true covariance matrices at *i* = 100, 200, . . . , 900.

In detecting the change points, the cutoff value *ν* of detection is chosen as follows. After removing the neighborhood of detected change points, we obtain <sup>D</sup>(*l*) *<sup>h</sup>* by ordering <sup>D</sup>(*l*) *<sup>h</sup>* , ... <sup>D</sup>(l) *<sup>h</sup>* , where l is obtained from (9) with *ν* = 0. For *l* = 1, 2, . . . , l − 1, compute

$$\mathcal{R}\_h^{(l)} = \frac{\mathcal{D}\_h^{(l)}}{\mathcal{D}\_h^{(l+1)}}.$$

We let *<sup>ι</sup>*<sup>ˆ</sup> <sup>=</sup> arg max0≤*l*≤l−<sup>1</sup> <sup>R</sup>(*l*) *<sup>h</sup>* and set *<sup>ν</sup>* <sup>=</sup> <sup>D</sup>(*ι*ˆ) *h* .

**Figure 1.** Support of the true covariance matrices, *p* = 50.

**Figure 2.** Support of the true covariance matrices, *p* = 100.

We report the number of estimated jumps and the average absolute estimation error, where the average absolute estimation error is the mean of the distance between the estimated change points and the true change points. As is shown in Tables 1 and 2, there is an apparent improvement in the estimation accuracy as the jump magnitude increases and dimension decreases. The detection is relatively robust to the choice of bandwidth.



**Table 2.** Number of estimated change points.

We evaluate the support recovery performance of the time-varying CLIME at the lattice 100, 200, ... , 900 with *λ* = 0.02, 0.06, 0.1. We take the uniform kernel function and the bandwidth is fixed as 0.2. At each time point *t*0, two quantities are computed: sensitivity and specificity, which are defined as:

$$\begin{split} \text{sensitivity} &= \frac{\sum\_{1 \le j,k \le p} \mathbb{I}\{\mathcal{G}\_{jk}(t\_0;\mathsf{u}) \ne 0, \mathcal{G}\_{jk}(t\_0;\mathsf{u}) \ne 0\}}{\sum\_{1 \le j,k \le p} \mathbb{I}\{\mathcal{G}\_{jk}(t\_0;\mathsf{u}) \ne 0\}},\\ \text{sspecificity} &= \frac{\sum\_{1 \le j,k \le p} \mathbb{I}\{\mathcal{G}\_{jk}(t\_0;\mathsf{u}) = 0, \mathcal{G}\_{jk}(t\_0;\mathsf{u}) = 0\}}{\sum\_{1 \le j,k \le p} \mathbb{I}\{\mathcal{G}\_{jk}(t\_0;\mathsf{u}) = 0\}}. \end{split}$$

We plot the Receiver Operating Characteristic (ROC) curve, that is, sensitivity against 1-specificity. From Figures 3 and 4 we observe that, due to a screening step, the support recovery is robust to the choice of *λ*, except at the change points, where a non-negligible estimation error of the covariance matrix is induced and the overall estimation is less accurate. As the effective dimension of the network remains the same at *p* = 50 and *p* = 100 by the construction of the coefficient matrix *Am*(*i*), there is no significant difference in the ROC curves at different dimensions.

**Figure 3.** ROC curve of the time-varying CLIME, *p* = 50.

**Figure 4.** ROC curve of the time-varying CLIME, *p* = 100.

#### **6. A Real Data Application**

Understanding the interconnection among financial entities and how they vary over time provides investors and policy makers with insights into risk control and decision making. Reference [66] presents a comprehensive study of the applications of network theory in financial systems. In this section, we apply our method to a real financial dataset from Yahoo! Finance (finance.yahoo.com). The data matrix contains daily closing prices of 420 stocks that are always in the S&P 500 index between 2 January 2002 through 30 December 2011. In total, there are *n* = 2519 time points. We select 100 stocks with the largest volatility and consider their log-returns; that is, for *j* = 1, . . . , 100,

$$X\_{ij} = \log\left(p\_{i+1,j}/p\_{ij}\right) \text{ .}$$

where *pij* is the daily closing price of the stock *j* at time point *i*. We first compute the statistic (5) and (6) for the change point detection. We look at the top three statistics for different bandwidths. For bandwidth *k* = *n*−1/5 = 0.21, we rank the test statistic and find that the location for the top change point is: 7 February 2008 (*ns*ˆ1 = 1536), which is shown in Figure 5. The detected change point is quite robust to a variety of choices of bandwidth. Our result is partially consistent with the change point

detection method in [48]. In particular, the two breaks in 2006 and 2007 were also found in [48] and it is conjectured that the 2007 break may be associated to the U.S. house market collapse. Meanwhile, it is interesting to observe the increased volatility before the 2008 financial crisis.

**Figure 5.** Break size |*Ds*|∞. From 4 February 2004, to 30 November 2009.

Next, we estimate the time-varying networks before and after the change point at 26 May 2006 with the largest jump size. Specifically, we look at four time points at: 813, 828, 888, and 903, corresponding to 23 March 2006, 13 April 2006, 11 July 2006, and 1 August 2006. We use tv-CLIME (13) with the Epanechnikov kernel with the same bandwidth as in the change point detection to estimate the networks at the four points. Optimal tuning parameter *λ* is automatically selected according to the stability approach [67]. The following matrix shows the number of different edges at those four time points. It is observed that the time of the first two time points (813 and 828) and the last two (888 and 903) has a higher similarity than across the change point at time 858. The estimated networks are shown in Figure 6. Networks in the first and second row are estimated before and after the estimated change point at time 858, respectively. It is observed that at each time point the companies in the same section tend to be clustered together such as companies in the Energy section: OXY, NOV, TSO, MRO, and DO (highlighted in cyan). In addition, the distance matrix of estimated networks is estimated as

$$
\begin{pmatrix}
0 & 332 & 350 & 396 \\
332 & 0 & 394 & 428 \\
350 & 394 & 0 & 234 \\
396 & 428 & 234 & 0
\end{pmatrix}.
$$

**Figure 6.** Estimated networks at time points 813, 828, 888, and 903, corresponding to 23 March 2006, 13 April 2006, 11 July 2006, and 1 August 2006. Colors correspond to the nine sections in the S&P dataset.

#### **7. Proof of Main Results**

#### *7.1. Preliminary Lemmas*

**Lemma 1.** *Let* (*Yi*)*i*∈*<sup>Z</sup> be a sequence that admits (2). Assume Yi* ∈ L*<sup>q</sup> for <sup>i</sup>* <sup>=</sup> 1, 2, ... *, and the dependence adjusted norm (DAN) of the corresponding underlying array* (*Y*◦ *<sup>i</sup>* (*t*)) *satisfies Y*·*q*,*<sup>A</sup>* < ∞ *for q* > 2 *and A* > 0*. Let* (*ω*(*t*, *ti*))*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *be defined in (12) and suppose that the kernel function K*(·) *satisfies Assumption 5. Denote q*,*A*(*n*) = *<sup>n</sup>*, *<sup>n</sup>*(log *<sup>n</sup>*)1+2*q*, *<sup>n</sup>q*/2−*Aq if <sup>A</sup>* <sup>&</sup>gt; 1/2 <sup>−</sup> 1/*q, <sup>A</sup>* <sup>=</sup> 1/2 <sup>−</sup> 1/*q, and* <sup>0</sup> <sup>&</sup>lt; *<sup>A</sup>* <sup>&</sup>lt; 1/2 <sup>−</sup> 1/*q, respectively. Then there exist constants C*1, *C*<sup>2</sup> *and C*<sup>3</sup> *independent of n, such that for all x* > 0*,*

$$\sup\_{t \in (0,1)} \mathbb{P}\left( \left| \sum\_{i=1}^{n} w(t, t\_i) \left( Y\_i - \mathbb{E}(Y\_i) \right) \right| > \mathbf{x} \right) \leq \mathbb{C}\_1 \frac{\mathcal{O}\_{q,A}(B\_n) \left\| Y\_i \right\|\_{q,A}^q}{B\_n^q \mathbf{x}^q} + \mathbb{C}\_2 \exp\left( \frac{-\mathbb{C}\_3 B\_n \mathbf{x}^2}{\left\| Y\_i \right\|\_{2,A}^2} \right). \tag{28}$$

$$\mathbb{P}\left(\sup\_{t\in(0,1)}\left|\sum\_{i=1}^{n}w(t,t\_{i})\left(Y\_{i}-\mathbb{E}(Y\_{i})\right)\right|>\mathbf{x}\right)\leq\mathbb{C}\_{1}\frac{\mathcal{O}\_{q,A}(n)\left\|Y\_{i}\right\|\_{q,A}^{q}}{B\_{n}^{q}\mathbf{x}^{q}}+\mathbb{C}\_{2}\exp\left(\frac{-\mathbb{C}\_{3}B\_{n}^{2}\mathbf{x}^{2}}{n\left\|Y\_{i}\right\|\_{2,A}^{2}}\right).\tag{29}$$

**Proof.** Let *Si* = ∑*<sup>i</sup> j*=1 - *Yi* <sup>−</sup> <sup>E</sup>(*Yi*) . Note that

$$\begin{aligned} \left| \sup\_{t \in (0,1)} \left| \sum\_{i=1}^n w(t, t\_i) Y\_i \right| = \sup\_{t \in (0,1)} \left| \sum\_{i=1}^n w(t, t\_i) (S\_i - S\_{i-1}) \right| \\ \leq \sup\_t \left| \sum\_{i=1}^{n-1} \left[ \left( w(t, t\_i) - w(t, t\_{i+1}) \right) S\_i \right] \right| + \sup\_t |w(t, 1) S\_n| \\ \lesssim B\_n^{-1} \max\_{1 \leq i \leq n} |S\_i| . \end{aligned}$$

where the last inequality follows from the fact that sup*<sup>t</sup>* <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> <sup>|</sup>*w*(*t*, *ti*) <sup>−</sup> *<sup>w</sup>*(*<sup>t</sup>* <sup>−</sup> *ti*+1)| *<sup>B</sup>*−<sup>1</sup> *<sup>n</sup>* , due to Assumption 5.

To see (29), it suffices to show

$$\mathbb{P}\left(\max\_{1\le i\le n}|S\_i|>\mathbf{x}\right)\le\mathbb{C}\_1\frac{\mathcal{O}\_{q,A}(n)\left\Vert\mathcal{Y}\_{\cdot}\right\Vert\_{q,A}^q}{\mathbf{x}^q}+\mathbb{C}\_2\exp\left(\frac{-\mathbb{C}\_3\mathbf{x}^2}{n\left\Vert\mathcal{Y}\_{\cdot}\right\Vert\_{2,A}^2}\right).\tag{30}$$

Now, we develop a probability deviation inequality for max1≤*i*≤*<sup>n</sup>* <sup>|</sup> <sup>∑</sup>*<sup>i</sup> <sup>j</sup>*=<sup>1</sup> *αjYj*|, where *α<sup>j</sup>* ≥ 0, <sup>1</sup> <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *<sup>n</sup>* are constants such that <sup>∑</sup>1≤*j*≤*<sup>n</sup> <sup>α</sup><sup>j</sup>* <sup>=</sup> 1. Denote <sup>P</sup>0(*Yi*) = <sup>E</sup>(*Yi*|*εi*) <sup>−</sup> <sup>E</sup>(*Yi*) and

$$\mathcal{P}\_k(\boldsymbol{\chi}\_i) = \mathbb{E}(\boldsymbol{\chi}\_i | \boldsymbol{\varepsilon}\_{i-k\prime} \dots \boldsymbol{\varepsilon}\_i) - \mathbb{E}(\boldsymbol{\chi}\_i | \boldsymbol{\varepsilon}\_{i-k+1\prime} \dots \boldsymbol{\varepsilon}\_i).$$

Then we can write

$$\begin{array}{c} \max\_{1 \le i \le n} \left| \sum\_{j=1}^{l} a\_{j} \mathbf{Y}\_{j} \right| \\ \qquad \quad \quad \quad \le \max\_{1 \le i \le n} \left| \sum\_{j=1}^{l} a\_{j} \mathbf{P}\_{0} (\mathbf{Y}\_{j}) \right| + \max\_{1 \le i \le n} \left| \sum\_{k=1}^{n} \sum\_{j=1}^{l} a\_{j} \mathbf{P}\_{k} (\mathbf{Y}\_{j}) \right| \\ \qquad \quad \quad + \max\_{1 \le i \le n} \left| \sum\_{k=n+1}^{\infty} \sum\_{j=1}^{i} a\_{j} \mathbf{P}\_{k} (\mathbf{Y}\_{j}) \right|. \end{array} \tag{31}$$

Note that (P0(*Yj*))*j*∈<sup>Z</sup> is an independent sequence. By Nagaev's inequality and Ottaviani's inequality, we have that

$$\begin{split} \mathbb{P}(\max\_{1 \le i \le n} | \sum\_{j=1}^{i} a\_j \mathcal{P}\_0(Y\_j)| \ge \mathbf{x}) \quad \overset{\scriptstyle \sum\_{j=1}^{\eta} a\_j^q \left\| |\mathcal{P}\_0(Y\_j)| \right\|\_q^q}{\mathbf{x}^q} + \exp\left(-\frac{\mathbb{C}\_3 \mathbf{x}^2}{\sum\_{j=1}^{n} a\_j^2 \left\| |\mathcal{P}\_0(Y\_j)| \right\|\_2^2}\right) \\ \overset{\scriptstyle \sum\_{j=1}^{\eta} a\_j^q}{\mathbf{x}^q \left\| |Y\_j| \right\|\_q} + \exp\left(-\mathbb{C}\_3 \frac{\mathbf{x}^2}{\sum\_{j=1}^{n} a\_j^2}\right), \end{split} \tag{32}$$

where the last inequality holds because P0(*Yj*)*<sup>q</sup>* ≤ 2*Yj<sup>q</sup>* by Jensen's inequality. Since ∑<sup>∞</sup> *<sup>j</sup>*=*i*+<sup>1</sup> *<sup>α</sup>j*P*k*(*Yj*) is a martingale difference sequence with respect to *<sup>σ</sup>*(*εi*+1−*k*,*εi*+2−*k*, ...), we have that <sup>|</sup> <sup>∑</sup><sup>∞</sup> *<sup>k</sup>*=1+*<sup>n</sup>* <sup>∑</sup>*<sup>n</sup> <sup>j</sup>*=*i*+<sup>1</sup> *αj*P*k*(*Yj*)| is a non-negative sub-martingale. Then by Doob's inequality and Burkholder's inequality, we have

$$\begin{split} & \mathbb{P}\left(\max\_{1 \le i \le n} \left| \sum\_{k=n+1}^{\infty} a\_{j} \mathcal{P}\_{k}(Y\_{j}) \right| \ge \mathbf{x} \right) \\ & \le \mathbb{P}\left( \left| \sum\_{k=n+1}^{\infty} \sum\_{j=1}^{n} a\_{j} \mathcal{P}\_{k}(Y\_{j}) \right| \ge \frac{\mathbf{x}}{2} \right) + \mathbb{P}\left( \max\_{1 \le i \le n} \left| \sum\_{k=n+1}^{\infty} \sum\_{j=1+i}^{n} a\_{j} \mathcal{P}\_{k}(Y\_{j}) \right| \ge \frac{\mathbf{x}}{2} \right) \\ & \le \frac{\left| \sum\_{k=n+1}^{\infty} \sum\_{j=1}^{n} a\_{j} \mathcal{P}\_{k}(Y\_{j}) \right| \Big|\_{q}^{q}}{\mathbf{x}^{q}} \\ & \le \frac{\left( \sum\_{j=1}^{n} a\_{j}^{2} \right)^{q/2} \mathbb{E}\_{n,q}^{q}}{\mathbf{x}^{q}} \le \frac{\Theta\_{n,q}^{q} n^{t/2 - 1} \sum\_{j=1}^{n} a\_{j}^{q}}{\mathbf{x}^{q}}. \end{split} \tag{33} \\ \stackrel{\text{def}}{=} \frac{\left( \sum\_{j=1}^{n} a\_{j}^{2} \right)^{q/2} \mathbb{E}\_{n,q}^{q}}{\mathbf{x}^{q}}. \tag{34} \end{split} \tag{35}$$

Now, we deal with the term max1≤*i*≤*<sup>n</sup>* <sup>|</sup> <sup>∑</sup>*<sup>n</sup> <sup>k</sup>*=<sup>1</sup> <sup>∑</sup>*<sup>i</sup> <sup>j</sup>*=<sup>1</sup> *<sup>α</sup>j*P*k*(*Yj*)|. Define *am* <sup>=</sup> min(2*m*, *<sup>n</sup>*) and *Mn* <sup>=</sup> log *n*/ log 2. Then

$$\max\_{1 \le i \le n} \left| \sum\_{k=1}^{n} \sum\_{j=1}^{i} a\_j \mathcal{P}\_k(\boldsymbol{Y}\_j) \right| \le \sum\_{m=1}^{M\_n} \max\_{1 \le i \le n} \left| \sum\_{l=1}^{\lceil i/a\_m \rceil} \sum\_{j=1+(l-1)a\_m}^{\min(l a\_m, i)} \sum\_{k=1+a\_{m-1}}^{a\_m} a\_j \mathcal{P}\_k(\boldsymbol{Y}\_j) \right|. \tag{34}$$

$$\text{Let } \mathcal{A}\_{odd} = \{1 \le l \le \lceil i/a\_m \rceil, l \text{ is odd} \} \text{ and } \mathcal{A}\_{even} = \{1 \le l \le \lceil i/a\_m \rceil, l \text{ is even} \}. \text{ We have}$$

$$\mathbb{P}\left(\max\_{1 \le i \le n} \left| \sum\_{l=1}^{\lceil i/a\_m \rceil} Z\_{l,m,i} \right| \ge \mathbf{x} \right) \le \mathbb{P}\left(\max\_{1 \le i \le n} \left| \sum\_{\mathcal{A}\_{odd}} Z\_{l,m,i} \right| \ge \mathbf{x}/2 \right) + \mathbb{P}\left(\max\_{1 \le i \le n} \left| \sum\_{\mathcal{A}\_{even}} Z\_{l,m,i} \right| \ge \mathbf{x}/2 \right),$$

where we have that *Zl*,*m*,*<sup>i</sup>* :<sup>=</sup> <sup>∑</sup>min(*lam*,*i*) *<sup>j</sup>*=1+(*l*−1)*am <sup>α</sup>j*P*am am*−<sup>1</sup> (*Yj*) is independent of *Zl*<sup>+</sup>2,*m*,*<sup>i</sup>* for 1 ≤ *<sup>l</sup>* ≤ *i*/*am*, 1 <sup>≤</sup> *<sup>m</sup>* <sup>≤</sup> *Mn*, 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*, as <sup>P</sup>*am am*−<sup>1</sup> (*Yj*) :<sup>=</sup> <sup>∑</sup>*am <sup>k</sup>*=1+*am*−<sup>1</sup> <sup>P</sup>*k*(*Yj*) is *am*-dependent. Therefore, we can apply Ottaviani's inequality and Nagaev's inequality for independent variables. As a consequence,

$$\mathbb{P}\left(\max\_{1\le i\le n} \big|\sum\_{l=1}^{\lceil i/a\_m \rceil} Z\_{l,m,i} \big| \ge \mathbf{x}\right) \lesssim \frac{\sum\_{1\le l\le \lceil n/a\_m \rceil} \|Z\_{l,m,n}\|\_{q}^{q}}{\mathbf{x}^{q}} + \exp\left(-\frac{\mathbb{C}\_3 \mathbf{x}^2}{\sum\_{1\le l\le \lceil n/a\_m \rceil} \|Z\_{l,m,n}\|\_{2}^{2}}\right)$$

.

Again, by Burkholder's inequality, we have that for *q* ≥ 2,

$$\begin{aligned} ||Z\_{l,m,n}||\_q &\leq \sum\_{k=1+a\_{m-1}}^{a\_m} ||\sum\_{j=1+(l-1)a\_m}^{\min(la\_m,n)} \alpha\_j \mathcal{P}\_k(\mathcal{Y}\_j)||\_q \\ &\lesssim \left(\sum\_{j=1+(l-1)a\_m}^{\min(la\_m,n)} \alpha\_j^2\right)^{1/2} (\Theta\_{a\_{m-1}} - \Theta\_{a\_m}).\end{aligned}$$

Note <sup>∑</sup>min(*lam*,*n*) *<sup>j</sup>*=1+(*l*−1)*am <sup>α</sup>*<sup>2</sup> *<sup>j</sup>* ≤ *a* (*q*−2)/*<sup>q</sup> <sup>m</sup>* (∑min(*lam*,*n*) *<sup>j</sup>*=1+(*l*−1)*am <sup>α</sup> q <sup>j</sup>*)2/*q*. Let *<sup>τ</sup><sup>m</sup>* <sup>=</sup> *<sup>m</sup>*−2/ <sup>∑</sup>*Mn <sup>m</sup>*=<sup>1</sup> *<sup>m</sup>*−2, and we have *<sup>τ</sup><sup>m</sup> <sup>m</sup>*−<sup>2</sup> as 1 <sup>≤</sup> <sup>∑</sup>*Mn <sup>m</sup>*=<sup>1</sup> *<sup>m</sup>*−<sup>2</sup> <sup>≤</sup> *<sup>π</sup>*2/6. In respect to (34), we have that

$$\begin{split} \mathbb{P}\{\max\_{1\le i\le n} \left| \sum\_{k=1}^{n} \sum\_{j=1}^{i} \mathscr{P}\_{k}(Y\_{j}) \right| \geq \underline{\mathbf{x}} \} \quad \leq \sum\_{m=1}^{M\_{\text{tr}}} \mathbb{P}\{\max\_{1\le i\le n} \left| \sum\_{l=1}^{i/d\_{m}} \left| Z\_{l,m,i} \right| \geq \underline{\mathbf{x}}\_{m} \mathbf{x} \} \\ \quad \lesssim \frac{\sum\_{i=1}^{v} a\_{j}^{q}}{\mathbf{x}^{q}} \, \| \, Y\_{i} \|\_{q,\mathcal{A}}^{q} \sum\_{m=1}^{M\_{\text{tr}}} \tau\_{m}^{-q} a\_{m}^{(1/2-\mathcal{A})q-1} + \sum\_{m=1}^{M\_{\text{tr}}} \exp\left( -\frac{\mathbb{C}\_{3} \mathbf{x}^{2} \tau\_{m}^{2} a\_{m}^{2\mathcal{A}}}{\sum\_{j=1}^{q} a\_{j}^{2} \| \mathbf{Y}\_{i} \|\_{2,\mathcal{A}}^{2}} \right). \end{split} \tag{35}$$

Note ∑*Mn <sup>m</sup>*=<sup>1</sup> *<sup>τ</sup>*−*<sup>q</sup> <sup>m</sup> <sup>a</sup>* (1/2−*A*)*q*−<sup>1</sup> *<sup>m</sup> <sup>n</sup>*−1*q*,*A*(*n*), and

$$\sum\_{m=1}^{M\_n} \exp\left(-\frac{\mathbb{C}\_3 \mathbf{x}^2 \tau\_m^2 a\_m^{2,A}}{\sum\_{j=1}^n a\_j^2 ||\mathcal{Y}\_\cdot||\_{2,A}^2}\right) \lesssim \exp\left(-\frac{\mathbb{C}\_3 \mathbf{x}^2}{\sum\_{j=1}^n a\_j^2 ||\mathcal{Y}\_\cdot||\_{2,A}^2}\right).$$

Combining (31), (32), (33), and (35), we obtain

$$\begin{array}{l} \mathbb{P}\left(\max\_{1\le i\le n} \left|\sum\_{j=1}^{i} a\_j \left(Y\_j - \mathbb{E}(Y\_j)\right)\right| > \mathbf{x}\right) \\ \leq \mathbb{C}\_1 \frac{a\_{q,A}(n) \sum\_{j=1}^{n} a\_j^q \|Y\_j\|\_{q,A}^q}{nx^q} + \mathbb{C}\_2 \exp\left(\frac{-\mathbb{C}\_3 x^2}{\sum\_{j=1}^{n} a\_j^2 \|Y\|\_{2,A}^2}\right). \end{array} \tag{36}$$

Now, we have (30) by taking *<sup>α</sup><sup>j</sup>* <sup>=</sup> *<sup>n</sup>*−<sup>1</sup> for *<sup>j</sup>* <sup>=</sup> 1, ... , *<sup>n</sup>*. Note that since *<sup>K</sup>*(·) has bounded support, for any given *t* ∈ [*b*, 1 − *b*], we have

$$\begin{split} & \mathbb{P}(\left|\sum\_{i=1}^{n} w(t, t\_{i}) (Y\_{i} - \mathbb{E}Y\_{i})\right| > \mathbf{x}) \leq \mathbb{P}(\left|\sum\_{i=-B\_{n}}^{B\_{n}} w(t, t\_{tn+i}) (Y\_{tn+i} - \mathbb{E}Y\_{tn+i})\right| > \mathbf{x}) \\ & \leq C\_{1} \frac{\mathcal{O}\_{q, \mathcal{A}}(B\_{n}) \sum\_{i=-B\_{n}}^{B\_{n}} w(t, t\_{tn+i})^{q} \left\|\left|Y\_{i}\right|\right\|\_{q, \mathcal{A}}^{q}}{B\_{\mathcal{B}, \mathcal{X}}^{q}} + C\_{2} \exp\left(\frac{-C\_{3} \mathbf{x}^{2}}{\sum\_{i=-B\_{n}}^{B\_{n}} w(t, t\_{tn+i})^{2} \left\|\left|Y\_{i}\right|\right\|\_{2, \mathcal{A}}^{2}}\right). \end{split}$$

Therefore (28) follows from (36) by taking *α<sup>j</sup>* = *w*(*t*, *tn* + *j*), and note that for any *t* ∈ [*b*, 1 − *b*], ∑*Bn <sup>i</sup>*=−*Bn <sup>w</sup>*(*t*, *ttn*+*i*)*<sup>β</sup> <sup>B</sup>*1−*<sup>β</sup> <sup>n</sup>* for a constant *<sup>β</sup>* <sup>≥</sup> 2.

**Lemma 2.** *Suppose* (*Xij*)*i*∈Z,1≤*j*≤*<sup>p</sup> satisfys Assumption 2. Furthermore, let Assumption 5 hold. Let q*,*A*(*n*) *be defined as in Lemma 1. Then there exist constants C*1, *C*2*, and C*<sup>3</sup> *independent of n and p, such that for all x* > 0*, we have*

$$\begin{split} \sup\_{t \in (0,1)} & \mathbb{P} \left( \left| \sum\_{i=1}^{n} \omega \left( t, t\_i \right) \left( \mathbf{X}\_i \mathbf{X}\_i^\top - \mathbb{E} (\mathbf{X}\_i \mathbf{X}\_i^\top) \right) \right|\_{\infty} \geq \infty \right) \\ & \leq \mathbb{C}\_1 \mathsf{v}\_{2q}^q \frac{p \mathfrak{o}\_{q, \mathcal{A}}(\mathcal{B}\_n) \mathsf{M}\_{\mathcal{X}, q}^q}{B\_n^q x^q} + \mathbb{C}\_2 p^2 \exp \left( -\mathbb{C}\_3 \frac{\mathcal{B}\_n x^2}{\mathbf{v}\_4^2 \mathcal{N}\_\mathcal{X}^2} \right) . \end{split} \tag{37}$$

*and*

$$\begin{split} \mathbb{P}(\sup\_{t \in (0,1)} \left| \sum\_{i=1}^{n} w(t, t\_i) \left( \mathbf{X}\_i \mathbf{X}\_i^\top - \mathbb{E}(\mathbf{X}\_i \mathbf{X}\_i^\top) \right) \right|\_{\infty} &\geq \infty \\ \leq \mathbb{C}\_1 \upsilon\_{2q}^q \frac{p \mathfrak{o}\_{q, \mathcal{A}}(n) M\_{\mathcal{X}, q}^q}{B\_n^q \mathbb{x}^q} + \mathbb{C}\_2 p^2 \exp \left( -\mathbb{C}\_3 \frac{B\_n^2 \mathbb{x}^2}{m\_4^2 N\_X^2} \right) . \end{split} \tag{38}$$

**Proof.** For 1 ≤ *<sup>j</sup>*, *<sup>k</sup>* ≤ *<sup>p</sup>*, let *Yi*,*jk* = *XijXik*. We now check the conditions in Lemma 1 for (*Yi*,*jk*)1≤*i*≤*n*. Denote *Yi*,*jk*,{*m*} = *Xij*,{*m*}*Xik*,{*m*}. Then the uniform functional dependence measure of (*Yi*,*jk*)*<sup>i</sup>* is

$$\begin{aligned} \theta\_{m,q,jk}^{\mathcal{Y}} &= \sup\_{i} ||\mathcal{Y}\_{i,jk} - \mathcal{Y}\_{i,jk,\{m\}}||\_{q} \\ &= \sup\_{i} ||\mathcal{X}\_{ij}\mathcal{X}\_{ik} - \mathcal{X}\_{ij,\{m\}}\mathcal{X}\_{ik,\{m\}}||\_{q} \\ &\leq \sup\_{i} ||\mathcal{X}\_{ij}(\mathcal{X}\_{ik} - \mathcal{X}\_{ik,\{m\}})||\_{q} + \sup\_{i} ||\mathcal{X}\_{ik,\{m\}}(\mathcal{X}\_{ij} - \mathcal{X}\_{ij,\{m\}})||\_{q}. \end{aligned}$$

Thus the DAN of the process *Y*·,*jk* satisfies that

$$\|\|Y\_{\cdot,\vec{\eta}k}\|\_{q,A} \le \sup\_i \|\|X\_{\vec{\eta}j}\|\_{2q} \|\|X\_{\cdot,k}\|\_{2q,A} + \sup\_i \|\|X\_{\vec{\eta}k}\|\_{2q} \|\|X\_{\cdot,j}\|\_{2q,A} \le \nu\_{\vec{\eta}} (\|\|X\_{\cdot,k}\|\_{2q,A} + \|\|X\_{\cdot,j}\|\_{2q,A}).$$

The result follows immediately from Lemma 1 and the Bonferroni inequality.

**Lemma 3.** *We adopt the notation in Lemma 2. Suppose Assumptions 2, 1, and 5 hold with ι* = 0*. Recall Bn* = *nb, where b* → 0 *and Bn*/ <sup>√</sup>*<sup>n</sup>* <sup>→</sup> <sup>∞</sup> *as <sup>n</sup>* <sup>→</sup> <sup>∞</sup>*. Then there exists a constant <sup>C</sup> independent of <sup>n</sup> and <sup>p</sup> such that* <sup>Σ</sup>ˆ(*t*) *in (11) satisfies that for any t* <sup>∈</sup> [*c*, 1 <sup>−</sup> *<sup>c</sup>*]*,*

$$|\hat{\Sigma}(t) - \Sigma(t)|\_{\infty} = O\_{\mathbb{P}}\left(b^2 + M\_{X,q}\nu\_{2\parallel}B\_n^{-1}\left(p\omega\_{q,A}(B\_n)\right)^{1/q} + \nu\_4 N\_X(\log p/B\_n)^{1/2}\right). \tag{39}$$

*Furthermore,*

$$\sup\_{t \in [c, 1-\varepsilon]} |\mathring{\Sigma}(t) - \Sigma(t)|\_{\infty} = O\_{\mathbb{P}}\left(b^2 + M\_{X,\mathfrak{q}} \nu\_{2\mathfrak{q}} B\_n^{-1} (p \mathfrak{o}\_{\mathfrak{q},A}(n))^{1/q} + \nu\_4 N\_X B\_n^{-1} [n \log p]^{1/2}\right). \tag{40}$$

**Proof.** First, we have

$$\mathbb{E}\partial\_{jk}(t) - \sigma\_{jk}(t) = \sum\_{i=1}^{n} w(t, t\_i) \left[\sigma\_{jk}(t\_i) - \sigma\_{jk}(t)\right].$$

Approximating the discrete summation with integral, we obtain for all 1 ≤ *j*, *k* ≤ *p*,

$$\sup\_{t \in [b, 1-b]} \left| \mathbb{E} \partial\_{jk}(t) - \sigma\_{jk}(t) - \int\_{-1}^{1} K(\mu) [\sigma\_{jk}(\mu b + t) - \sigma\_{jk}(t)] d\mu \right| = O\left( B\_n^{-1} \right).$$

By Assumption 1, we have

$$
\sigma\_{jk}(\iota b + t) - \sigma\_{jk}(t) = \iota b \sigma\_{jk}'(t) + \frac{1}{2} \iota^2 b^2 \sigma\_{jk}''(t) + o(b^2 \iota^2).
$$

Thus we have sup*t*∈[*c*,1−*c*] <sup>|</sup>E*σ*ˆ(*t*) <sup>−</sup> *<sup>σ</sup>*(*t*)|<sup>∞</sup> <sup>=</sup> *<sup>O</sup>* - *B*−<sup>1</sup> *<sup>n</sup>* + *b*<sup>2</sup> , in view of Assumption 5. By Lemma 2, we have

$$\sup\_{t \in (0,1)} \mathbb{P}\left( \left| \mathfrak{L}(t) - \mathbb{E}\mathfrak{L}(t) \right|\_{\infty} \geq \mathbf{x} \right) \leq \mathbb{C}\_1 p \upsilon\_q^q \frac{M\_{X,q}^q \mathbb{D}\_{q,A}(B\_n)}{B\_n^q \mathbf{x}^q} + \mathbb{C}\_2 p^2 \exp\left( -\mathbb{C}\_3 \frac{B\_n \mathbf{x}^2}{N\_X^2} \right).$$

Denote *u* = *C*<sup>4</sup> - *MX*,*qν*2*qB*−<sup>1</sup> *<sup>n</sup>* (*pq*,*A*(*Bn*))1/*<sup>q</sup>* + *ν*4*NX*(log *p*/*Bn*)1/2 for a large enough constant *C*4, then for any *t* ∈ (0, 1),

$$\left|\hat{\Sigma}(t) - \mathbb{E}\hat{\Sigma}(t)\right|\_{\infty} = O\_{\mathbb{P}}(u).$$

Thus (39) is proved. The result (40) can be obtained similarly.

#### *7.2. Proof of Main Results*

**Proof of Proposition 1.** Given (39) and (40), the proof of (16) is standard. (See, e.g., Theorem 6 of [11]). For *λ*◦ and *λ*∗ given in Proposition 1, by Lemma 3, we have that, respectively,

$$\lambda^{\diamond} \ge \sup\_{t} \mathbb{E}\left(\kappa\_p |\triangle(t) - \Sigma(t)|\_{\infty}\right),\tag{41}$$

$$\lambda^{\diamond} \ge \mathbb{E}\left(\kappa\_p \sup\_t |\triangle(t) - \Sigma(t)|\_{\infty}\right). \tag{42}$$

Then note that for any *t* ∈ [0, 1], for any *λ* > 0,

$$\begin{split} \left| \vert \boldsymbol{\Omega}\_{\lambda}(t) - \boldsymbol{\Omega}(t) \vert\_{\infty} \leq \vert \boldsymbol{\Omega}(t) \vert\_{L\_{1}} \vert \boldsymbol{\Sigma}(t) \boldsymbol{\Omega}\_{\lambda}(t) - \mathrm{Id}\_{p} \vert\_{\infty} \\ \leq \vert \boldsymbol{\Omega}(t) \vert\_{L\_{1}} \left[ \vert \boldsymbol{\Sigma}(t) \boldsymbol{\Omega}\_{\lambda}(t) - \mathrm{Id}\_{p} \vert\_{\infty} + \vert (\boldsymbol{\Sigma}(t) - \boldsymbol{\Sigma}(t)) \boldsymbol{\Omega}(t) \vert\_{\infty} + \vert \boldsymbol{\Omega}\_{\lambda}(t) - \boldsymbol{\Omega}(t) \vert\_{L^{1}} \vert \boldsymbol{\Sigma}(t) - \boldsymbol{\Sigma}(t) \vert\_{\infty} \right] \end{split}$$

where by construction, we have <sup>|</sup>Σˆ(*t*)Ω<sup>ˆ</sup> *<sup>λ</sup>*(*t*) <sup>−</sup> Id*p*|<sup>∞</sup> <sup>≤</sup> *<sup>λ</sup>* and <sup>|</sup>Ω<sup>ˆ</sup> *<sup>λ</sup>*(*t*) <sup>−</sup> <sup>Ω</sup>(*t*)|*L*<sup>1</sup> <sup>≤</sup> <sup>2</sup>*κp*. Consequently,

$$|\hat{\Omega}\_{\lambda}(t) - \Omega(t)|\_{\infty} \le \kappa\_p \left(\lambda + 3\kappa\_p |\hat{\Sigma}(t) - \Sigma(t)|\_{\infty}\right). \tag{43}$$

Then (16) and (17) follow from (41) to (43).

**Proof of Proposition 2.** Theorem 2 is an immediate result of (17).

**Proof of Theorem 1.** Denote *rj*, 1 ≤ *j* ≤ *ι* as the time point(s) of the time of jump ordered decreasingly in the sense of the infinite norm of covariance matrices, i.e., |Δ(*r*1)|<sup>∞</sup> ≥ |Δ(*r*2)|<sup>∞</sup> ≥ ... ≥ |Δ(*rι*)|<sup>∞</sup> ≥ <sup>|</sup>Δ(*s*)|<sup>∞</sup> for *<sup>s</sup>* <sup>∈</sup> (0, 1) ∩ {*r*1, ... ,*rι*}*c*. (Temporal order is applied if there is a tie.) Let <sup>T</sup>*h*(*j*)=[*rj* <sup>−</sup> *<sup>h</sup>*,*rj* <sup>+</sup> *h*). For *h* = *o*(1), as a result of Assumption 3, T*h*(*j*) ∩ T*h*(*i*) = ∅ if *i* = *j* for *n* sufficiently large. That is to say, each time point *s* ∈ (0, 1) is in the neighborhood of, at most, one change point.

For any *s* ∈ [*t* (*j*), *t* (*j*+1)), *j* = 0, 1, . . . , *ι*, denote D(*s*) = E[*D*(*s*)] and

$$\mathbb{D}^{\diamond}(\mathbf{s}) = \begin{cases} (\boldsymbol{h} - \mathbf{s} + \boldsymbol{t}^{(j)}) \Delta(\boldsymbol{t}^{(j)}), & \boldsymbol{t}^{(j)} \le \mathbf{s} < \boldsymbol{t}^{(j)} + \boldsymbol{h} \\\ 0, & \boldsymbol{t}^{(j)} + \boldsymbol{h} \le \mathbf{s} < \boldsymbol{t}^{(j+1)} - \boldsymbol{h} \\\ (\boldsymbol{h} + \mathbf{s} - \boldsymbol{r}) \Delta(\boldsymbol{t}^{(j+1)}), & \boldsymbol{t}^{(j+1)} - \boldsymbol{h} \le \mathbf{s} \le \boldsymbol{t}^{(j+1)}. \end{cases} \tag{44}$$

Then, for *<sup>s</sup>* ∈ ∪1≤*j*≤*ι*[*<sup>t</sup>* (*j*) + *h*, < *t* (*j*+1) <sup>−</sup> *<sup>h</sup>*), by (3), we have

$$|\Sigma(s+t) - \Sigma(s)|\_{\infty} \le Lt, \qquad \forall |\mathbf{t}| \le h\_{\star}$$

we can easily verify that

$$\sup\_{s \in [0,1]} |\mathbb{D}(s) - \mathbb{D}^\diamond(s)|\_{\infty} \le Lh^2. \tag{45}$$

Note that <sup>|</sup>D(*s*)|<sup>∞</sup> is maximized at *<sup>s</sup>* <sup>=</sup> *<sup>r</sup>*<sup>1</sup> and <sup>|</sup>D(*r*1)|<sup>∞</sup> <sup>=</sup> *<sup>h</sup>*|Δ(*r*1)|∞. By the triangle inequalities, we have that for some positive constant *C*, for any *s* ∈ [0, 1],

$$\begin{array}{rclcrcl}|\mathbb{D}(r\_1)|\_{\infty} - |\mathbb{D}(s)|\_{\infty} & \geq & hc\_2 - |\mathbb{D}(r\_1) - \mathbb{D}^\diamond(r\_1)|\_{\infty} - |\mathbb{D}^\diamond(s)|\_{\infty} - |\mathbb{D}(s) - \mathbb{D}^\diamond(s)|\_{\infty} \\ & \geq & hc\_2 - |\mathbb{D}^\diamond(s)|\_{\infty} - 2Lh^2 \\ & \geq & c\_2(|s - r\_1| \wedge h) - 2Lh^2. \end{array} \tag{46}$$

On the other hand, since |*D*(*r*1)|<sup>∞</sup> ≤ |*D*(*s*ˆ1)|∞, we have

$$\begin{array}{rcll}|\mathbb{D}(r\_{1})|\_{\infty} - |\mathbb{D}(\hat{s}\_{1})|\_{\infty} & \leq |D(r\_{1})|\_{\infty} - |D(\hat{s}\_{1})|\_{\infty} + |\mathbb{D}(r\_{1}) - D(r\_{1})|\_{\infty} + |\mathbb{D}(\hat{s}\_{1}) - D(\hat{s}\_{1})|\_{\infty} \\ & \leq |\mathbb{D}(r\_{1}) - D(r\_{1})|\_{\infty} + |\mathbb{D}(\hat{s}\_{1}) - D(\hat{s}\_{1})|\_{\infty} .\end{array} \tag{47}$$

Denote the event <sup>A</sup> :<sup>=</sup> {sup*s*∈[*h*,1−*h*] <sup>|</sup>*D*(*s*) <sup>−</sup> <sup>D</sup>(*s*)|<sup>∞</sup> <sup>≤</sup> *<sup>h</sup>*<sup>2</sup> } and let **Yi** = (*Yi*,*jk*)1≤*j*,*k*≤*p*, *Yi*,*jk* = *XijXik* − *σi*,*jk*. Note that

$$|D\_{jk}(\mathbf{s}) - \mathbb{D}\_{jk}(\mathbf{s})| = \frac{1}{n} \left| \sum\_{i=1}^{hn} \mathbb{Y}\_{n\_{\mathbf{s}} + 1 - i, \mathbf{j}\mathbf{k}} - \sum\_{i=1}^{hn} \mathbb{Y}\_{n\_{\mathbf{s}} + i, j\mathbf{k}} \right|. \tag{48}$$

By Lemma 2, we have for any *x* > 0,

$$\mathbb{P}\left(\sup\_{s\in[h,1-h]}|D(s)-\mathbb{D}(s)|\_{\infty}\geq x\right)\leq \mathbb{C}\_{1}\frac{p\alpha\_{q,A}(n)M\_{X,q}^{q}\nu\_{2q}^{q}}{n^{q}x^{q}}+\mathbb{C}\_{2}p^{2}\exp\left(-\mathbb{C}\_{3}\frac{nx^{2}}{N\_{X}^{2}}\right).\tag{49}$$

It follows that

$$|\mathbb{D}(r\_1)|\_{\infty} - |\mathbb{D}(\mathfrak{s}\_1)|\_{\infty} = O\_{\mathbb{P}}(h^{-1})\_{\mathfrak{q},\mathcal{A}}(n, p) + N\_{\mathcal{X}}h^{-1}(n^{-1}\log(p))^{1/2}).$$

Taking *h* = *h*, we have

$$|\pounds\_1 - r\_1| = O\_{\mathbb{P}}(h\_{\diamond}^2).$$

Furthermore, we have

$$\mathbb{P}(\mathcal{A}) \ge 1 - \mathbb{C}\_1 (\frac{p \mathfrak{D}\_{q,A}(n) M\_{X,q}^q \nu\_{2q}^q}{n^q c\_2^q})^{1/3} - \mathbb{C}\_2 p^2 \exp\left(-\mathbb{C}\_3 (\frac{n \log^2(p)}{N\_X^2})^{1/3}\right).$$

Let <sup>A</sup>*<sup>k</sup>* :<sup>=</sup> {max1≤*j*≤*<sup>k</sup>* <sup>|</sup>*s*ˆ*<sup>j</sup>* <sup>−</sup> *rj*| ≤ *<sup>c</sup>*−<sup>1</sup> <sup>2</sup> <sup>2</sup>(*<sup>L</sup>* + <sup>1</sup>)*h*<sup>2</sup> } for some 1 ≤ *k* ≤ *ι*. Assume A*<sup>k</sup>* ⊂ A. Under A*<sup>k</sup>* we have that [*rj* <sup>−</sup> *<sup>h</sup>*,*rj* <sup>+</sup> *<sup>h</sup>*) <sup>⊂</sup> <sup>T</sup><sup>ˆ</sup> <sup>2</sup>*h* (*j*) =: [*s*ˆ*<sup>j</sup>* <sup>−</sup> <sup>2</sup>*h*,*s*ˆ*<sup>j</sup>* <sup>+</sup> <sup>2</sup>*h*) for 1 <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *<sup>k</sup>* and *rk*<sup>+</sup><sup>1</sup> ∈ ∪ / <sup>1</sup>≤*j*≤*k*T<sup>ˆ</sup> <sup>2</sup>*h* (*j*) as a consequence of Assumption 3. According to (46) and (47), we have if A is true, |*s*ˆ*k*+<sup>1</sup> − *rk*<sup>+</sup>1| ≤ *c*−<sup>1</sup> <sup>2</sup> <sup>2</sup>(*<sup>L</sup>* + <sup>1</sup>)*h*<sup>2</sup> , which implies A*k*+<sup>1</sup> ⊂ A. The result (21) follows from deduction.

Suppose A holds. By the choice of *ν*, as a consequence of (45) and (49), and that *ν h*, we have that

$$\sup\_{s \in [0,1]} |D(s) - \mathbb{D}^\diamond(s)|\_{\infty} \le \nu.$$

As a result,

$$\min\_{1 \le j \le \iota} |D(r\_j)|\_{\infty} \ge c\_2 h\_{\diamond} - \nu \ge \nu\_{\prime}$$

i.e., *<sup>ι</sup>*<sup>ˆ</sup> <sup>≥</sup> *<sup>ι</sup>*. On the other hand, since <sup>∪</sup>1≤*j*≤*ι*T<sup>ˆ</sup> <sup>2</sup>*h* (*j*) is excluded from the searching region for *<sup>s</sup>ι*<sup>+</sup>1, we have

$$\sup\_{s \in \left(\cup\_{1 \le j \le \iota} \tau\_{2k\_{\diamond}}(j)\right)^{\varepsilon}} |D(s)|\_{\infty} \le \nu.$$

In other words, {*ι*ˆ = *ι*}⊂A. Thus (20) is proved.

**Proof of Theorem 2.** We adopt the notations in the proof of Theorem 1 and assume that E holds. Similar to Lemma 3, we have that by Lemma 2, for any *t* ∈ (0, 1),

$$\left| \hat{\Sigma}(t) - \mathbb{E}\hat{\Sigma}(t) \right|\_{\infty} = O\_{\mathbb{P}}(u)\_{\prime}$$

where *u* = *C*<sup>4</sup> - *MX*,*qν*2*qB*−<sup>1</sup> *<sup>n</sup>* (*pq*,*A*(*Bn*))1/*<sup>q</sup>* + *ν*4*NX*(log *p*/*Bn*)1/2 for a large enough constant *C*4.

Since under <sup>E</sup>, <sup>T</sup>*b*(*j*) <sup>⊂</sup> <sup>T</sup><sup>ˆ</sup> *b*+*h*<sup>2</sup> (*j*). For *t* ∈ - <sup>∪</sup>1≤*j*≤*<sup>ι</sup>* <sup>T</sup><sup>ˆ</sup> *b*+*h*<sup>2</sup> (*j*) *<sup>c</sup>* <sup>∩</sup> [*b*, 1 <sup>−</sup> *<sup>b</sup>*], we have that for all 1 ≤ *j*, *k* ≤ *p*,

$$\begin{split} \left| \mathbb{E} \partial\_{\vec{j}\vec{k}}(t) - \sigma\_{\vec{j}\vec{k}}(t) \right| &= \int\_{-1}^{1} K(u) [\sigma\_{\vec{j}\vec{k}}(uh+t) - \sigma\_{\vec{j}\vec{k}}(t)] du + O\left(B\_{n}^{-1}\right) \\ &= b\sigma\_{\vec{j}\vec{k}}'(t) \int\_{-1}^{1} uK(u) du + \left(\frac{1}{2} b^{2} \sigma\_{\vec{j}\vec{k}}''(t) + o(b^{2})\right) \int\_{-1}^{1} u^{2} K(u) du + O\left(B\_{n}^{-1}\right) \\ &= O(b^{2} + B\_{n}^{-1}). \end{split}$$

On the other hand, for *t* ∈ ∪1≤*j*≤*<sup>ι</sup>* - Tˆ *b*+*h*<sup>2</sup> (*j*) ∩ T *<sup>c</sup> h*2 (*j*) ∪ [0, *b*] ∪ [1 − *b*, 1], due to reflection, we no longer have that differentiability. As a result of the Lipschitz continuity, we get

$$\left| \mathbb{E} \theta\_{jk}(t) - \sigma\_{jk}(t) \right| = \int\_{-1}^{1} \mathcal{K}(u) [\sigma\_{jk}(uh+t) - \sigma\_{jk}(t)] du + O\left(B\_n^{-1}\right) = O(b + B\_n^{-1}).$$

The result (22) follows by the choices of *b*. The rest of the proof are similar to that of Proposition 1 and Theorem 2.

**Author Contributions:** Methodology, M.X., X.C., W.B.W.; writing–original draft preparation, M.X., X.C., W.B.W.; writing–review and editing, M.X., X.C., W.B.W., software, M.X. All authors have read and agreed to the published version of the manuscript.

**Funding:** X.C.'s research is supported in part by NSF CAREER Award DMS-1752614 and UIUC Research Board Award RB18099. W.B.W.'s research is supported in part by NSF DMS-1405410.

**Acknowledgments:** X.C. acknowledges that part of this work was carried out at the MIT Institute for Data, System, and Society (IDSS).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Selection Consistency of Lasso-Based Procedures for Misspecified High-Dimensional Binary Model and Random Regressors**

#### **Mariusz Kubkowski 1,2,† and Jan Mielniczuk 1,2,***∗***,†**


Received: 13 November 2019; Accepted: 24 January 2020; Published: 28 January 2020

**Abstract:** We consider selection of random predictors for a high-dimensional regression problem with a binary response for a general loss function. An important special case is when the binary model is semi-parametric and the response function is misspecified under a parametric model fit. When the true response coincides with a postulated parametric response for a certain value of parameter, we obtain a common framework for parametric inference. Both cases of correct specification and misspecification are covered in this contribution. Variable selection for such a scenario aims at recovering the support of the minimizer of the associated risk with large probability. We propose a two-step selection Screening-Selection (SS) procedure which consists of screening and ordering predictors by Lasso method and then selecting the subset of predictors which minimizes the Generalized Information Criterion for the corresponding nested family of models. We prove consistency of the proposed selection method under conditions that allow for a much larger number of predictors than the number of observations. For the semi-parametric case when distribution of random predictors satisfies linear regressions condition, the true and the estimated parameters are collinear and their common support can be consistently identified. This partly explains robustness of selection procedures to the response function misspecification.

**Keywords:** high-dimensional regression; loss function; random predictors; misspecification; consistent selection; subgaussianity; generalized information criterion; robustness

#### **1. Introduction**

Consider a random variable (*X*,*Y*) <sup>∈</sup> *<sup>R</sup><sup>p</sup>* × {0, 1} and a corresponding response function defined as a posteriori probability *q*(*x*) = *P*(*Y* = 1|*X* = *x*). Estimation of the a posteriori probability is of paramount importance in machine learning and statistics since many frequently applied methods, e.g., logistic or tree-based classifiers, rely on it. One of the main estimation methods of *q* is a parametric approach for which the response function is assumed to have parametric form

$$q(\mathbf{x}) = q\_0(\boldsymbol{\beta}^T \mathbf{x}) \tag{1}$$

for some fixed *β* and known *q*0(*x*). If Equation (1) holds, that is the underlying structure is correctly specified, then it is known that

$$\beta = \operatorname{argmin}\_{b \in R^{\overline{r}}} - \left\{ E\_{X,Y}(Y \log q\_0(b^T X) + (1 - Y) \log(1 - q\_0(b^T X))) \right\},\tag{2}$$

or, equivalently (cf., e.g., [1])

$$\beta = \arg\min\_{b} E\_X KL(q(X), q\_0(X^I b)),\tag{3}$$

where *EX f*(*X*) is the expected value of a random variable *f*(*X*) and *KL*(*q*(*X*), *q*0(*XTb*)) is Kullback–Leibler distance between the binary distributions with success probabilities *q*(*X*) and *q*0(*XTb*):

$$KL(q(X), q\_0(X^Tb)) = q(X)\log\frac{q(X)}{q\_0(X^Tb)} + (1 - q(X))\log\frac{1 - q(X)}{1 - q\_0(X^Tb)}.$$

The equalities in Equations (2) and (3) form the theoretical underpinning of (conditional) maximum likelihood (ML) method as the expression under the expected value in Equation (2) is the conditional log-likelihood of *Y* given *X* in the parametric model. Moreover, it is a crucial property needed to show that ML estimates of *β* under appropriate conditions approximate *β*.

However, more frequently than not, the model in Equation (1) does not hold, i.e., response *q* is misspecified and ML estimators do not approximate *β*, but the quantity defined by the right-hand side of Equation (3), namely

$$\mathcal{S}^\* = \operatorname{argmin}\_b E\_X KL(q(X), q\_0(X^T b)),\tag{4}$$

Thus, parametric fit using conditional ML method, which is the most popular approach to modeling binary response, also has very intuitive geometric and information-theoretic flavor. Indeed, fitting a parametric model, we try to approximate the *β*∗ which yields averaged KL projection of unknown *q* on set of parametric models {*q*0(*bTx*)}*b*∈*R<sup>p</sup>* . A typical situation is a semi-parametric framework the true response function satisfies when

$$q(\mathbf{x}) = \overline{q}(\boldsymbol{\beta}^T \mathbf{x}) \tag{5}$$

for some unknown *q*˜(*x*) and the model in Equation (1) is fitted where *q*˜ = *q*0. An important problem is then how *β*∗ in Equation (4) relates to *β* in Equation (5). In particular, a frequently asked question is what can be said about a support of *<sup>β</sup>* = (*β*1, ... , *<sup>β</sup>p*)*T*, i.e., the set {*<sup>i</sup>* : *<sup>β</sup><sup>i</sup>* <sup>=</sup> <sup>0</sup>}, which consists of indices of predictors which truly influence *Y*. More specifically, an interplay between supports of *β* and analogously defined support of *β*∗ is of importance as the latter is consistently estimated and the support of ML estimator is frequently considered as an approximation of the set of true predictors. Variable selection, or equivalently the support recovery of *β* in high-dimensional setting, is one of the most intensively studied subjects in contemporary statistics and machine learning. This is related to many applications in bioinformatics, biology, image processing, spatiotemporal analysis, and other research areas (see [2–4]). It is usually studied under a correct model specification, i.e., under theassumption that data are generated following a given parametric model (e.g., logistic or, in the case of quantitative *Y*, linear model).

Consider the following example: let *q*˜(*x*) = *qL*(*x*3), where *qL*(*x*) = *ex*/(1 + *ex*) is the logistic function. Define regression model by *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> <sup>1</sup>|*X*) = *<sup>q</sup>*˜(*βTX*) = *qL*((*X*<sup>1</sup> <sup>+</sup> *<sup>X</sup>*2)3), where *<sup>X</sup>* = (*X*1, ... , *Xp*) is *<sup>N</sup>*(0, *Ip*×*p*)-distributed vector of predictors, *<sup>p</sup>* <sup>&</sup>gt; 2 and *<sup>β</sup>* = (1, 1, 0, ... , 0) <sup>∈</sup> *<sup>R</sup>p*. Then, the considered model will obviously be misspecified when the family of logistic models is fitted. However, it turns out in this case that, as *X* is elliptically contoured, *β*∗ = *ηβ* = *η*(1, 1, 0, ... , 0) and *η* = 0 (see [5]) and thus supports of *β* and *β*<sup>∗</sup> coincide. Thus, in this case, despite misspecification variable selection, i.e., finding out that *X*<sup>1</sup> and *X*<sup>2</sup> are the only active predictors, it can be solved using the methods described below.

For recent contributions to the study of Kullback–Leibler projections on logistic model (which coincide with Equation (4) for a logistic loss, see below) and references, we refer to the works of Kubkowski and Mielniczuk [6], Kubkowski and Mielniczuk [7] and Kubkowski [8]. We also refer to the work of Lu et al. [9], where the asymptotic distribution of adaptive Lasso is studied under misspecification in the case of fixed number of deterministic predictors. Questions of robustness analysis evolve around an interplay between *β* and *β*∗, in particular under what conditions the directions of *β* and *β*∗ coincide (cf. the important contribution by Brillinger [10] and Ruud [11]).

In the present paper, we discuss this problem in a more general non-parametric setting. Namely, the minus conditional log-likelihood <sup>−</sup>(*<sup>y</sup>* log *<sup>q</sup>*0(*bTx* + (<sup>1</sup> <sup>−</sup> *<sup>y</sup>*)log(<sup>1</sup> <sup>−</sup> *<sup>q</sup>*0(*bTx*)) is replaced by a general loss function of the form

$$l(b, \mathbf{x}, y) = \rho(b^T \mathbf{x}, y),\tag{6}$$

where *<sup>ρ</sup>* : *<sup>R</sup>* × {0, 1} → *<sup>R</sup>* is some function, *<sup>b</sup>*, *<sup>x</sup>* <sup>∈</sup> *<sup>R</sup>p*, *<sup>y</sup>* ∈ {0, 1}, and

$$\mathcal{R}(b) = E\_{X\_\prime Y} l(b, X\_\prime \mathcal{Y})$$

is the associated risk function for *<sup>b</sup>* <sup>∈</sup> *<sup>R</sup>p*. Our aim is to determine a support of *<sup>β</sup>*∗, where

$$
\beta^\* = \arg\min\_{b \in \mathbb{R}^{p\_n}} \mathcal{R}(b). \tag{7}
$$

Coordinates of *β*∗ corresponding to non-zero coefficients are called active predictors and vector *β*∗ the pseudo-true vector.

The most popular loss functions are related to minus log-likelihood of specific parametric models such as logistic loss

$$l\_{\log ist}(b, \mathbf{x}, \mathbf{y}) = -yb^T\mathbf{x} + \log(1 + \exp(b^T\mathbf{x})) $$

related to *q*0(*bTx*) = exp(*bTx*)/(1 + exp(*bTx*), probit loss

$$l\_{\text{probit}}(b, \mathbf{x}, y) = -y \log \Phi(b^T \mathbf{x}) + (1 - y) \log(1 - \Phi(b^T \mathbf{x})) $$

related to *<sup>q</sup>*0(*bTx*) = <sup>Φ</sup>(*bTx*), or quadratic loss *llin*(*b*, *<sup>x</sup>*, *<sup>y</sup>*)=(*<sup>y</sup>* <sup>−</sup> *<sup>b</sup>Tx*)2/2 related to linear regression and quantitative response. Other losses which do not correspond to any parametric model such as Huber loss (see [12]) are constructed with a specific aim to induce certain desired properties of corresponding estimators such as robustness to outliers. We show in the following that variable selection problem can be studied for a general loss function imposing certain analytic properties such as its convexity and Lipschitz property.

For fixed number *p* of predictors smaller than sample size *n*, the statistical consequences of misspecification of a semi-parametric regression model were intensively studied by H. White and his collaborators in the 1980s. The concept of a projection on the fitted parametric model is central to these investigations which show how the distribution of maximum likelihood estimator of *β*∗ centered by *β*∗ changes under misspecification (cf. e.g., [13,14]). However, for the case when *p* > *n*, the maximum likelihood estimator, which is a natural tool for fixed *p* ≤ *n* case, is ill-defined and a natural question arises: What can be estimated and by what methods?

The aim of the present paper is to study the above problem in high-dimensional setting. To this end, we introduce two-stage approach in which the first stage is based on Lasso estimation (cf., e.g., [2])

$$\hat{\beta}\_L = \operatorname{argmin}\_{b \in R^{\text{pu}}} \{ R\_n(b) + \lambda\_L \sum\_{i=1}^{p\_n} |b\_i| \} \tag{8}$$

where *b* = (*b*1,..., *bpn* )*<sup>T</sup>* and the empirical risk *Rn*(*b*) corresponding to *R*(*b*) is

$$R\_n(b) = n^{-1} \sum\_{i=1}^n \rho(b^T X\_{i\prime} Y\_i).$$

Parameter *λ<sup>L</sup>* > 0 is Lasso penalty, which penalizes large *l*1-norms of potential candidates for a solution. Note that the criterion function in Equation (8) for *ρ*(*s*, *y*) = log(1 + exp(−*s*(2*y* − 1)) can be viewed as penalized empirical risk for the logistic loss. Lasso estimator is thoroughly studied in the case of the linear model when considered loss is square loss (see, e.g., [2,4] for references and overview of the subject) and some of the papers treat the case when such model is fitted to *Y*, which is not necessarily linearly dependent on regressors (cf. [15]). In this case, regression model is misspecified with respect to linear fit. However, similar results are scarce for other scenarios such as logistic fit under misspecification in particular. One of the notable exceptions is Negahban et al. [16], who studied the behavior of Lasso estimate i for a general loss function and possibly misspecified models.

The output of the first stage is Lasso estimate *β*ˆ *<sup>L</sup>*. The second stage consists in ordering of predictors according to the absolute values of corresponding non-zero coordinates of Lasso estimator and then minimization of Generalized Information Criterion (GIC) on the resulting nested family. This is a variant of SOS (Screening-Ordering-Selection) procedure introduced in [17]. Let *s*ˆ ∗ be the model chosen by GIC procedure.

Our main contributions are as follows:


We now discuss how our results relate to previous results. Most of the variable selection methods in high-dimensional case are studied for deterministic regressors; here, our results concern random regressors with subgaussian distributions. Note that random regressors scenario is much more realistic for experimental data than deterministic one. The stated results to the best of our knowledge are not available for random predictors even when the model is correctly specified. As to novelty of SS procedure, for its second stage we assume that the number of active predictors is bounded by a deterministic sequence *kn* tending to infinity and we minimize GIC on family M of models with sizes satisfying also this condition. Such exhaustive search has been proposed in [19] for linear models and extended to GLMs in [20] (cf. [21]). In these papers, GIC has been optimized on all possible subsets of regressors with cardinality not exceeding certain constant *kn*. Such method is feasible for practical purposes only when *pn* is small. Here, we consider a similar set-up but with important differences: M is a data-dependent small nested family of models and optimization of GIC is considered in the case when the original model is misspecified. The regressors are supposed random and assumptions are carefully tailored to this case. We also stress the fact that the presented results also cover the case when the regression model is correctly specified and Equation (5) is satisfied.

In numerical experiments, we study the performance of grid version of logistic and linear SOS and compare it to its several Lasso-based competitors.

The paper is organized as follows. Section 2 contains auxiliaries, including new useful probability inequalities for empirical risk in the case of subgaussian random variables (Lemma 2). In Section 3, we prove a bound on approximation error for Lasso when the loss function is convex and Lipschitz and regressors are random (Theorem 1). This yields separation property of Lasso. In Theorems

2 and 3 of Section 4, we prove GIC consistency on nested family, which in particular can be built according to the order in which the Lasso coordinates are included in the fitted model. In Section 5.1, we discuss consequences of the proved results for semi-parametric binary model when distribution of predictors satisfies linear regressions condition. In Section 6, we numerically compare the performance of two-stage selection method for two closely related models, one of which is a logistic model and the second one is misspecified.

#### **2. Definitions and Auxiliary Results**

and assume in the following that

In the following, we allow random vector (*X*,*Y*), *q*(*x*), and *p* to depend on sample size *n*, i.e., (*X*,*Y*)=(*X*(*n*),*Y*(*n*)) <sup>∈</sup> *<sup>R</sup>pn* × {0, 1} and *qn*(*x*) = *<sup>P</sup>*(*Y*(*n*) <sup>=</sup> <sup>1</sup>|*X*(*n*) <sup>=</sup> *<sup>x</sup>*). We assume that *n* copies *X*(*n*) <sup>1</sup> , ... , *<sup>X</sup>*(*n*) *<sup>n</sup>* of a random vector *<sup>X</sup>*(*n*) in *<sup>R</sup>pn* are observed together with corresponding binary responses *Y*(*n*) <sup>1</sup> , ... ,*Y*(*n*) *<sup>n</sup>* . Moreover, we assume that observations (*X*(*n*) *<sup>i</sup>* ,*Y*(*n*) *<sup>i</sup>* ), *i* = 1, ... , *n* are independent and identically distributed (iid). If this condition is satisfied for each *n*, but not necessarily for different *n* and *m*, i.e., distributions of (*X*(*n*) *<sup>i</sup>* ,*Y*(*n*) *<sup>i</sup>* ) may be different from that of (*X*(*m*) *<sup>j</sup>* ,*Y*(*m*) *<sup>j</sup>* ) or they may be dependent for *m* = *n*, then such framework is called a triangular scenario. A frequently considered scenario is the sequential one. In this case, when sample size *n* increases, we observe values of new predictors additionally to the ones observed earlier. This is a special case of the above scheme as then *X*(*n*+1) *<sup>i</sup>* = (*X*(*n*)*<sup>T</sup> <sup>i</sup>* , *Xi*,*pn*<sup>+</sup>1, ... , *Xi*,*pn*+<sup>1</sup> )*T*. In the following, we skip the upper index *<sup>n</sup>* if no ambiguity arises. Moreover, we write *q*(*x*) = *qn*(*x*). We impose a condition on distributions of random predictors assume that coordinates *Xij* of *Xi* are subgaussian *Subg*(*σ*<sup>2</sup> *jn*) with subgaussianity parameter *σ*<sup>2</sup> *jn*, i.e., it holds that (see [22])

$$E\exp(tX\_{ij}) \le \exp(t^2\sigma\_{jn}^2/2)\tag{9}$$

for all *t* ∈ *R*. This condition basically says that the tails of of *Xij* do not decrease more slowly than tails of normal distribution *N*(0, *σ*<sup>2</sup> *jn*). For future reference, let

$$s\_n^2 = \max\_{j=1,\ldots,p\_n} \sigma\_{jn}^2$$
 
$$\gamma^2 := \limsup s\_n^2 < \infty. \tag{10}$$

We assume moreover that *Xi*1, ... , *Xipn* are linearly independent in the sense that their arbitrary linear combination is not constant almost everywhere. We consider a general form of response function *q*(*x*) = *P*(*Y* = 1|*X* = *x*) and assume that for the given loss function *β*∗, as defined in Equation (7), exists and is unique. For *s* ⊆ {1, ... , *pn*}, let *β*∗(*s*) be defined as in Equation (7) when minimum is taken over *b* with support in *s*. We let

*n*

$$s^\* = \text{supp}(\beta^\*(\{1, \dots, p\_{\mathbb{N}}\}) = \{i \le p\_{\mathbb{N}} : \beta^\*\_i \ne 0\}, i]$$

denote the support of *β*∗({1, . . . , *pn*}) with *β*∗({1, . . . , *pn*})=(*β*<sup>∗</sup> <sup>1</sup>,..., *β*<sup>∗</sup> *pn* )*T*.

Let *<sup>v</sup><sup>π</sup>* = (*vj*<sup>1</sup> , ... , *vjk* )*<sup>T</sup>* <sup>∈</sup> *<sup>R</sup>*|*π*<sup>|</sup> for *<sup>v</sup>* <sup>∈</sup> *<sup>R</sup>pn* and *<sup>π</sup>* <sup>=</sup> {*j*1, ... , *jk*}⊆{1, ... , *pn*}. Let *<sup>β</sup>*<sup>∗</sup> *<sup>s</sup>*<sup>∗</sup> <sup>∈</sup> *<sup>R</sup>*|*s*∗| be *β*<sup>∗</sup> = *β*∗({1, ... , *pn*}) restricted to its support *s*∗. Note that if *s*<sup>∗</sup> ⊆ *s*, then provided projections are unique (see Section 2) we have

$$
\beta\_{s^\*}^\* = \beta^\*(s^\*) = \beta^\*(s)\_{s^\*}.
$$

Note that this implies that for every superset *s* ⊇ *s*<sup>∗</sup> of *s* the projection *β*∗(*s*) on the model pertaining to *s* is obtained by appending projection *β*∗(*s*∗) with appropriate number of zeros. Moreover, let

$$
\beta\_{\min}^\* = \min\_{i \in s^\*} |\beta\_i^\*|.
$$

We remark that *β*∗, *s*∗ and *β*∗ *min* may depend on *n*. We stress that *β*<sup>∗</sup> *min* is an important quantity in the development here as it turns out that it may not decrease too quickly in order to obtain approximation results for *β*ˆ<sup>∗</sup> *<sup>L</sup>* (see Theorem 1). Note that, when the parametric model is correctly specified, i.e., *q*(*x*) = *q*0(*βTx*) for some *β* with *l* being an associated log-likelihood loss, if *s* is the support of *β*, we have *s* = *s*∗.

First, we discuss quantities and assumptions needed for the first step of SS procedure. We consider cones of the form:

$$\mathcal{C}\_{\varepsilon} = \left\{ \Delta \in \mathbb{R}^{p\_n} \colon \ ||\Delta\_{\mathfrak{s}^{\mathsf{rec}}}||\_1 \le (\mathfrak{z} + \varepsilon) ||\Delta\_{\mathfrak{s}^{\mathsf{st}}}||\_1 \right\},\tag{11}$$

where *<sup>ε</sup>* <sup>&</sup>gt; 0, *<sup>s</sup>*∗*<sup>c</sup>* <sup>=</sup> {1, ... , *pn*} \ *<sup>s</sup>*<sup>∗</sup> and <sup>Δ</sup>*s*<sup>∗</sup> = (Δ*s*<sup>∗</sup> 1 , ... , Δ*s*<sup>∗</sup> |*s*∗| ) for *s*<sup>∗</sup> = {*s*<sup>∗</sup> <sup>1</sup>, ... ,*s*<sup>∗</sup> |*s*∗| }. Cones C*<sup>ε</sup>* are of special importance because we prove that *<sup>β</sup>*<sup>ˆ</sup> *<sup>L</sup>* <sup>−</sup> *<sup>β</sup>*<sup>∗</sup> ∈ C*<sup>ε</sup>* (see Lemma 3). In addition, we note that since *l* 1-norm is decomposable in the sense that ||*vA*||<sup>1</sup> <sup>+</sup> ||*vAc* ||<sup>1</sup> <sup>=</sup> ||*v*||<sup>1</sup> the definition of the cone above can be stated as

$$\mathcal{C}\_{\varepsilon} = \{ \Delta \in \mathbb{R}^{p\_n} \colon ||\Delta||\_1 \le (4+\varepsilon) \left|| |\Delta\_{s^\*}| |\_1 \} \dots \right\}$$

Thus, <sup>C</sup>*<sup>ε</sup>* consists of vectors which do not put too much mass on the complement of *<sup>s</sup>*∗. Let *<sup>H</sup>* <sup>∈</sup> *<sup>R</sup>pn*×*pn* be a fixed non-negative definite matrix. For cone C*ε*, we define a quantity *κH*(*ε*) which can be regarded as a restricted minimal eigenvalue of a matrix in high-dimensional set-up:

$$\kappa\_H(\varepsilon) = \inf\_{\Delta \in \mathcal{C}\_\varepsilon \backslash \{0\}} \frac{\Delta^T H \Delta}{\Delta^T \Delta}. \tag{12}$$

In the considered context, *H* is usually taken as hessian *D*2*R*(*β*∗) and, e.g., for quadratic loss, it equals *EXTX*. When *H* is non-negative definite and not strictly positive definite its smallest eigenvalue *<sup>λ</sup>*<sup>1</sup> <sup>=</sup> 0 and thus infΔ∈*Rp*\{0} <sup>Δ</sup>*TH*<sup>Δ</sup> <sup>Δ</sup>*T*<sup>Δ</sup> = *λ*<sup>1</sup> = 0. That is why we have to restrict minimization in Equation (12) in order to have *<sup>κ</sup>H*(*ε*) <sup>&</sup>gt; 0 in the high-dimensional case. As we prove that <sup>Δ</sup><sup>0</sup> <sup>=</sup> *<sup>β</sup>*<sup>ˆ</sup> *<sup>L</sup>* <sup>−</sup> *<sup>β</sup>*<sup>∗</sup> ∈ C*<sup>ε</sup>* and would use 0 <sup>&</sup>lt; *<sup>κ</sup>H*(*ε*) <sup>≤</sup> <sup>Δ</sup>*<sup>T</sup>* <sup>0</sup> *<sup>H</sup>*Δ0/Δ*<sup>T</sup>* <sup>0</sup> Δ<sup>0</sup> it is useful to restrict minimization in Equation (12) to C*<sup>ε</sup>* \ {0}. Let *R* and *Rn* be the risk and the empirical risk defined above. Moreover, we introduce the following notation:

$$\mathcal{W}(b) = \mathcal{R}(b) - \mathcal{R}(\boldsymbol{\beta}^\*),\tag{13}$$

$$\mathcal{W}\_{\rm{\tiny{\tiny{\tiny{\tiny{\tiny{\phantom{\langle}}}}}}}(b)} = \mathcal{R}\_{\rm{\tiny{\tiny{\tiny{\tiny{\langle}}}}}}(b)} - \mathcal{R}\_{\rm{\tiny{\tiny{\langle}}}}(\boldsymbol{\beta}^{\*}), \tag{14}$$

$$B\_p(r) = \{ \Delta \in R^{p\_n} \colon ||\Delta||\_p \le r \}, \text{ for } p = 1, 2,\tag{15}$$

$$S(r) = \sup\_{b \in R^{n\_1}: b - \beta^\* \in \mathcal{B}\_1(r)} |\mathcal{W}(b) - \mathcal{W}\_n(b)|. \tag{16}$$

Note that *ERn*(*b*) = *R*(*b*). Thus, *S*(*r*) corresponds to oscillation of centred empirical risk over ball *B*1(*r*). We need the following Margin Condition (MC) in Lemma 3 and Theorem 1:

(MC) There exist *<sup>ϑ</sup>*,*ε*, *<sup>δ</sup>* <sup>&</sup>gt; 0 and non-negative definite matrix *<sup>H</sup>* <sup>∈</sup> *<sup>R</sup>pn*×*pn* such that for all *<sup>b</sup>* with *b* − *β*<sup>∗</sup> ∈ C*<sup>ε</sup>* ∩ *B*1(*δ*) we have

$$R(b) - R(\boldsymbol{\beta}^\*) \ge \frac{\theta}{2} (b - \boldsymbol{\beta}^\*)^T H (b - \boldsymbol{\beta}^\*).$$

The above condition can be viewed as a weaker version of strong convexity of function *R* (when the right-hand side is replaced by *<sup>ϑ</sup>*||*<sup>b</sup>* <sup>−</sup> *<sup>β</sup>*∗||2) in the restricted neighbourhood of *<sup>β</sup>*<sup>∗</sup> (namely, in the intersection of ball *B*1(*δ*) and cone C*ε*). We stress the fact that *H* is not required to be positive definite, as in Section 3 we use Condition (MC) together with stronger conditions than *κH*(*ε*) > 0 which imply that right hand side of inequality in (MC) is positive. We also do not require here twice differentiability of *R*. We note in particular that Condition (MC) is satisfied in the case of logistic loss, *X* being bounded

random variable and *H* = *D*2*R*(*β*∗) (see [23–25]). It is also easily seen that that (MC) is satisfied for quadratic loss, *<sup>X</sup>* such that *<sup>E</sup>*||*X*||<sup>2</sup> <sup>2</sup> < <sup>∞</sup> and *<sup>H</sup>* = *<sup>D</sup>*2*R*(*β*∗). Similar condition to (MC) (called Restricted Strict Convexity) was considered in [16] for empirical risk *Rn*:

$$R\_n(\beta^\* + \Delta) - R\_n(\beta^\*) \ge DR\_n(\beta^\*)^T \Delta + \kappa\_L ||\Delta||^2 - \tau^2(\beta^\*)$$

for all Δ ∈ *C*(3,*s*∗), some *κ<sup>L</sup>* > 0, and tolerance function *τ*. Note however that MC is a deterministic condition, whereas Restricted Strict Convexity has to be satisfied for random empirical risk function.

Another important assumption, used in Theorem 1 and Lemma 2, is the Lipschitz property of *ρ* :

$$\text{(LL) } \exists L > 0 \; \forall b\_1, b\_2 \in \mathbb{R}, \mathcal{Y} \in \{0, 1\} \colon \left| \rho(b\_1, \mathcal{Y}) - \rho(b\_2, \mathcal{Y}) \right| \le L|b\_1 - b\_2|.$$

Now, we discuss preliminaries needed for the development of the second step of SS procedure. Let |*w*| stand for dimension of *w*. For the second step of the procedure we consider an arbitrary family M ⊆ <sup>2</sup>{1,...,*pn*} of models (which are identified with subsets of {1, ... , *pn*} and may be data-dependent) such that *s*<sup>∗</sup> ∈ M, ∀*w* ∈ M : |*w*| ≤ *kn* a.e. and *kn* ∈ *N*<sup>+</sup> is some deterministic sequence. We define Generalized Information Criterion (GIC) as:

$$GIC(w) = nR\_n(\not\beta(w)) + a\_n|w|\_\prime \tag{17}$$

where

$$\beta(w) = \underset{b \in R^{pn} \colon \: b\_{w^c} = 0\_{\vert w^c \vert}}{\text{arg min}} R\_n(b)$$

is ML estimator for model *w* as minimization above is taken over all vectors *b* with support in *w*. Parameter *an* > 0 is some penalty factor depending on the sample size *n* which weighs how important is the complexity of the model described by the number of its variables |*w*|. Typical examples of *an* include:


AIC, BIC and EBIC were introduced by Akaike [26], Schwarz [27], and Chen and Chen [19], respectively. Note that for *n* ≥ 8 BIC penalty is larger than AIC penalty and in its turn EBIC penalty is larger than BIC penalty.

We study properties of *Sk*(*r*) for *k* = 1, 2, where:

$$S\_k(r) = \sup\_{b \in D\_k: b - \beta^\* \in \mathcal{B}\_2(r)} |(\mathcal{W}\_n(b) - \mathcal{W}(b))|\tag{18}$$

and is the maximal absolute value of the centred empirical risk *Wn*(·) and sets *Dk* for *k* = 1, 2 are defined as follows:

$$D\_1 = \{ b \in R^{p\_n} \colon \exists w \in \mathcal{M} \colon \left| w \right| \le k\_n \land s^\* \subset w \land \text{supp} \, b \subseteq w \},\tag{19}$$

$$D\_2 = \{ b \in R^{p\_n} \colon \text{supp}\, b \subset s^\* \}. \tag{20}$$

The idea here is simply to consider sets *Di* consisting of vectors having no more that *kn* non-zero coordinates. However, for *s*<sup>∗</sup> ≤ *kn*, we need that for *b* ∈ *Di*, we have | supp(*b* − *β*∗)| ≤ *kn*, what we exploit in Lemma 2. This entails additional condition in the definition of *D*1. Moreover, in Section 4, we consider the following condition *C*(*w*) for > 0, *w* ⊆ {1, . . . , *pn*} and some *θ* > 0:

$$\text{C}\_{\mathfrak{c}}(w) \colon \quad \text{R}(b) - \text{R}(\mathfrak{\beta}^\*) \ge \theta ||b - \mathfrak{\beta}^\*||\_2^2 \text{ for all } b \in \mathbb{R}^{p\_n} \text{ such that } \text{supp } b \subseteq w \text{ and } b - \mathfrak{\beta}^\* \in B\_2(\mathfrak{\epsilon}).$$

We observe also that, although Conditions (MC) and *C*(*w*) are similar, they are not equivalent, as they hold for *<sup>v</sup>* <sup>=</sup> *<sup>b</sup>* <sup>−</sup> *<sup>β</sup>*<sup>∗</sup> belonging to different sets: *<sup>B</sup>*1(*r*) ∩ C*<sup>ε</sup>* and *<sup>B</sup>*2() ∩ {<sup>Δ</sup> <sup>∈</sup> *<sup>R</sup>pn* : supp <sup>Δ</sup> <sup>⊆</sup> *w*}, respectively. If the minimal eigenvalue *λmin* of matrix *H* in Condition (MC) is positive and Condition (MC) holds for *b* − *β*<sup>∗</sup> ∈ *B*1(*r*) (instead of for *b* − *β*<sup>∗</sup> ∈ C*<sup>ε</sup>* ∩ *B*1(*r*)), then we have for *<sup>b</sup>* <sup>−</sup> *<sup>β</sup>*<sup>∗</sup> <sup>∈</sup> *<sup>B</sup>*2(*r*/√*pn*) <sup>⊆</sup> *<sup>B</sup>*1(*r*):

$$R(b) - R(\boldsymbol{\beta}^\*) \ge \frac{\theta}{2} (b - \boldsymbol{\beta}^\*)^T H (b - \boldsymbol{\beta}^\*) \ge \frac{\theta \lambda\_{\text{min}}}{2} ||b - \boldsymbol{\beta}^\*||\_2^2.$$

Furthermore, if *λmax* is the maximal eigenvalue of *H* and Condition *C*(*w*) holds for all *v* = *b* − *β*<sup>∗</sup> ∈ *B*2(*r*) without restriction on supp *b*, then we have for *b* − *β*<sup>∗</sup> ∈ *B*1(*r*) ⊆ *B*2(*r*):

$$R(b) - R(\beta^\*) \ge \theta ||b - \beta^\*||\_2^2 \ge \frac{\theta}{\lambda\_{\max}} (b - \beta^\*)^T H (b - \beta^\*).$$

Thus, Condition (MC) holds in this case. A similar condition to Condition *C*(*w*) for empirical risk *Rn* was considered by Kim and Jeon [28] (formula (2.1)) in the context of GIC minimization. It turns out that Condition *C*(*w*) together with *ρ*(·, *y*) being convex for all *y* and satisfying Lipschitz Condition (LL) are sufficient to establish bounds which ensure GIC consistency for *kn* ln *pn* = *o*(*n*) and *kn* ln *pn* = *o*(*an*) (see Corollaries 2 and 3). First, we state the following basic inequality. *W*(*v*) and *S*(*r*) are defined above the definition of Margin Condition.

**Lemma 1.** *(Basic inequality). Let ρ*(·, *y*) *be convex function for all y*. *If for some r* > 0 *we have*

$$\mu = \frac{r}{r + ||\hat{\beta}\_L - \beta||\_1}, \quad v = \mu \pounds\_L + (1 - \mu)\beta^\*,$$

*then*

$$\mathcal{W}(\upsilon) + \lambda ||\upsilon - \beta^\*||\_1 \le S(r) + 2\lambda ||\upsilon\_{s^\*} - \beta^\*\_{s^\*}||\_1.$$

The proof of the lemma is moved to the Appendix A. It follows from the lemma that, as in view of decomposability of *l* 1-distance we have ||*<sup>v</sup>* <sup>−</sup> *<sup>β</sup>*∗||<sup>1</sup> <sup>=</sup> ||(*<sup>v</sup>* <sup>−</sup> *<sup>β</sup>*∗)<sup>∗</sup> *<sup>s</sup>* ||<sup>1</sup> + ||(*v* − *β*∗)*s*∗*<sup>c</sup>* ||1, when *S*(*r*) is small we have ||(*v* − *β*∗)*s*∗*<sup>c</sup>* ||<sup>1</sup> is not large in comparison with ||(*v* − *β*∗)<sup>∗</sup> *<sup>s</sup>* ||1.

Quantities *Sk*(*r*) are defined in Equation (18). Recall that *S*2(*r*) is an oscillation taken over ball *B*2(*r*), whereas *Si*, *i* = 1, 2 are oscillations taken over *B*1(*r*) ball with restriction on the number of nonzero coordinates.

**Lemma 2.** *Let ρ*(·, *y*) *be convex function for all y and satisfy Lipschitz Condition (LL). Assume that Xij for <sup>j</sup>* <sup>≥</sup> <sup>1</sup> *are subgaussian Subg*(*σ*<sup>2</sup> *jn*)*, where σjn* ≤ *sn. Then, for r*, *t* > 0*:*

$$1. \quad P(S(r) > t) \le \frac{8Lrs\_n\sqrt{\log(p\_n \vee 2)}}{\ldots},$$

$$2. \quad P(S\_1(r) \ge t) \le \frac{8Lrs\_n\sqrt[4]{\frac{\hbar^n}{k\_n\ln(p\_n\sqrt{2})}}}{\underline{t\sqrt{n}}},$$

$$3. \quad P(S\_2(r) \ge t) \le \frac{4Ls\_n \sqrt[4]{|s^\*|}}{t\sqrt{n}}.$$

The proof of the Lemma above, which relies on Chebyshev inequality , symmetrization inequality (see Lemma 2.3.1 of [29]), and Talagrand–Ledoux inequality ([30], Theorem 4.12), is moved to the Appendix A. In the case when *β*∗ does not depend on *n* and thus its support does not change, Part 3 implies in particular that *S*2(*r*) is of the order *n*−1/2 in probability.

#### **3. Properties of Lasso for a General Loss Function and Random Predictors**

The main result in this section is Theorem 1. The idea for the proof is based on fact that, if *S*(*r*) defined in Equation (16) is sufficiently small (condition *<sup>S</sup>*(*r*) <sup>≤</sup> *<sup>C</sup>*¯*λ<sup>r</sup>* is satisfied), then *<sup>β</sup>*<sup>ˆ</sup> *<sup>L</sup>* lies in a

ball {<sup>Δ</sup> <sup>∈</sup> *<sup>R</sup>pn* : ||<sup>Δ</sup> <sup>−</sup> *<sup>β</sup>*∗||<sup>1</sup> <sup>≤</sup> *<sup>r</sup>*} (see Lemma 3). Using a tail inequality for *<sup>S</sup>*(*r*) proved in Lemma 2, we obtain Theorem 1. Note that *<sup>κ</sup>H*(*ε*) has to be bounded away from 0 (condition 2|*s*∗|*<sup>λ</sup>* <sup>≤</sup> *<sup>κ</sup>H*(*ε*)*ϑCr*˜ ). Convexity of *ρ*(·, *y*) below is understood as convexity for both *y* = 0, 1.

**Lemma 3.** *Let ρ*(·, *y*) *be convex function and assume that λ* > 0. *Moreover, assume margin Condition (MC) with constants <sup>ϑ</sup>*, , *<sup>δ</sup>* <sup>&</sup>gt; <sup>0</sup> *and some non-negative definite matrix <sup>H</sup>* <sup>∈</sup> *<sup>R</sup>pn*×*pn . If for some <sup>r</sup>* <sup>∈</sup> (0, *<sup>δ</sup>*] *we have <sup>S</sup>*(*r*) <sup>≤</sup> *<sup>C</sup>*¯*λr and* <sup>2</sup>|*s*∗|*<sup>λ</sup>* <sup>≤</sup> *<sup>κ</sup>H*(*ε*)*ϑCr, where* ˜ *<sup>C</sup>*¯ <sup>=</sup> *<sup>ε</sup>*/(<sup>8</sup> <sup>+</sup> <sup>2</sup>*ε*) *and <sup>C</sup>*˜ <sup>=</sup> 2/(<sup>4</sup> <sup>+</sup> *<sup>ε</sup>*), *then*

$$||\vec{\beta}\_L - \beta^\*||\_1 \le r.$$

The proof of the lemma is moved to the Appendix A.

The first main result provides an exponential inequality for *P* - ||*β*<sup>ˆ</sup> *<sup>L</sup>* <sup>−</sup> *<sup>β</sup>*∗||<sup>1</sup> <sup>≤</sup> *<sup>β</sup>*<sup>∗</sup> *min*/2 . The threshold *β*∗ *min*/2 is crucial there as it ensures separation: max *<sup>i</sup>*∈*s*∗*<sup>c</sup>* <sup>|</sup>*β*<sup>ˆ</sup> *<sup>L</sup>*,*i*| ≤ min *<sup>i</sup>*∈*s*<sup>∗</sup> <sup>|</sup>*β*<sup>ˆ</sup> *<sup>L</sup>*,*i*<sup>|</sup> (see proof of Corollary 1).

**Theorem 1.** *Let ρ*(·, *y*) *be convex function for all y and satisfy Lipschitz Condition (LL). Assume that Xij* ∼ *Subg*(*σ*<sup>2</sup> *jn*)*, β*<sup>∗</sup> *exists and is unique, margin Condition (MC) is satisfied for ε*, *δ*, *ϑ* > 0*, non-negative definite matrix H* <sup>∈</sup> *<sup>R</sup>pn*×*pn and let*

2|*s*∗|*λ ϑκH*(*ε*) <sup>≤</sup> *<sup>C</sup>*˜ min *<sup>β</sup>*<sup>∗</sup> *min* <sup>2</sup> , *<sup>δ</sup>* ,

*where C*˜ = 2/(4 + *ε*). *Then,*

$$P\left(||\beta\_L - \beta^\*||\_1 \le \frac{\beta\_{\min}^\*}{2}\right) \ge 1 - 2p\_n e^{-\frac{\mu^2 \lambda^2}{A}}.$$

*where A* = 128*L*2(4 + *ε*)2*s*<sup>2</sup> *n.*

**Proof.** Let

$$m = \min\left\{\frac{\beta^\*\_{\min}}{2}, \delta\right\}.$$

Lemmas 2 and 3 imply that:

$$\begin{aligned} P\left(||\hat{\beta}\_L - \beta^\*||\_1 > \frac{\beta\_{\min}^\*}{2}\right) &\leq P\left(||\hat{\beta}\_L - \beta^\*||\_1 > m\right) \leq P\left(\mathcal{S}\left(m\right) > \mathcal{C}\lambda m\right) \\ &\leq 2p\_{\text{fl}}e^{-\frac{m^2\lambda^2}{128\Lambda^2(4+\epsilon)^2\kappa\_n^2}}.\end{aligned}$$

**Corollary 1.** *(Separation property) If assumptions of Theorem 1 are satisfied,*

$$
\lambda = \frac{8L s\_n (4 + \varepsilon) \phi}{\varepsilon} \sqrt{\frac{2 \log(2p\_n)}{n}}
$$

*for some φ* > 1 *and κH*(*ε*) > *d for some d*,*ε* > 0 *for large n,* |*s*∗|*λ* = *o*(min{*β*<sup>∗</sup> *min*, 1}), *then*

$$P\left(||\hat{\beta}\_L - \beta^\*||\_1 \le \frac{\beta^\*\_{\min}}{2}\right) \to 1.$$

*Moreover,*

$$P\left(\max\_{i\in s^{\*c}} |\hat{\beta}\_{L,i}| \le \min\_{i\in s^{\*}} |\hat{\beta}\_{L,i}|\right) \to 1.$$

**Proof.** The first part of the corollary follows directly from Theorem 1 and the observation that:

$$P\left(||\hat{\beta}\_L - \beta^\*||\_1 > \frac{\hat{\beta}\_{\text{min}}^\*}{2}\right) \le e^{\log(2p\_n) - \frac{m^2 \lambda^2}{128L^2 (4+\iota)^2 \kappa\_n^2}} = e^{\log(2p\_n)(1-\phi^2)} \to 0.$$

Now, we prove that condition ||*β*<sup>ˆ</sup> *<sup>L</sup>* <sup>−</sup> *<sup>β</sup>*∗||<sup>1</sup> <sup>≤</sup> *<sup>β</sup>*<sup>∗</sup> *min*/2 implies separation property

$$\max\_{i \in s^{\*c}} |\hat{\beta}\_{L,i}| \le \min\_{i \in s^{\*}} |\hat{\beta}\_{L,i}| \,. \tag{21}$$

Indeed, observe that for all *j* ∈ {1, . . . , *pn*} we have:

$$\frac{\beta\_{\min}^\*}{2} \ge ||\beta\_L - \beta^\*||\_1 \ge |\beta\_{L,j} - \beta\_j^\*|. \tag{22}$$

If *j* ∈ *s*∗, then using triangle inequality yields:

$$|\beta\_{L,j} - \beta\_j^\*| \ge |\beta\_j^\*| - |\hat{\beta}\_{L,j}| \ge \beta\_{\min}^\* - |\hat{\beta}\_{L,j}|.$$

Hence, from the above inequality and Equation (22), we obtain for *<sup>j</sup>* <sup>∈</sup> *<sup>s</sup>*∗: <sup>|</sup>*β*<sup>ˆ</sup> *<sup>L</sup>*,*j*| ≥ *<sup>β</sup>*<sup>∗</sup> *min*/2. If *<sup>j</sup>* <sup>∈</sup> *<sup>s</sup>*∗*c*, then *β*∗ *<sup>j</sup>* <sup>=</sup> 0 and Equation (22) takes the form: <sup>|</sup>*β*<sup>ˆ</sup> *<sup>L</sup>*,*j*| ≤ *<sup>β</sup>*<sup>∗</sup> *min*/2. This ends the proof.

We note that the separation property in Equation (21) means that when *λ* is chosen in an appropriate manner, recovery of *s*∗ is feasible with a large probability if all predictors corresponding to absolute value of Lasso coefficient exceeding a certain threshold are chosen. The threshold unfortunately depends on unknown parameters of the model. However, separation property allows to restrict attention to nested family of models and thus to decrease significantly computational complexity of the problem. This is dealt with in the next section. Note moreover that if *γ* in Equation (10) is finite than *λ* defined in the Corollary is of order (log *pn*/*n*)1/2, which is the optimal order of Lasso penalty in the case of deterministic regressors (see, e.g., [2]).

#### **4. GIC Consistency for a a General Loss Function and Random Predictors**

Theorems 2 and 3 state probability inequalities related to behavior of GIC on supersets and on subsets of *s*∗, respectively. In a nutshell, we show for supersets and subsets separately that the probability that the minimum of GIC is not attained at *s*∗ is exponentially small. Corollaries 2 and 3 present asymptotic conditions for GIC consistency in the aforementioned situations. Corollary 4 gathers conclusions of Theorem 1 and Corollaries 1–3 to show consistency of SS procedure (see [17] for consistency of SOS procedure for a linear model with dieterministic predictors) in case of subgaussian variables. Note that in Theorem below we want to consider minimization of GIC in Equation (23) over all supersets of *s*<sup>∗</sup> as in our applications M is data dependent. As the number of such possible subsets is at least ( *pn*−|*s*∗| *kn*−|*s*∗| ), the proof has to be more involved than using reasoning based on Bonferroni inequality.

**Theorem 2.** *Assume that <sup>ρ</sup>*(·, *<sup>y</sup>*) *is convex, Lipschitz function with constant <sup>L</sup>* <sup>&</sup>gt; <sup>0</sup>*, Xij* <sup>∼</sup> *Subg*(*σ*<sup>2</sup> *jn*), *condition C*(*w*) *holds for some* , *θ* > 0 *and for every w* ⊆ {1, ... , *pn*} *such that* |*w*| ≤ *kn. Then, for any r* < *, we have:*

$$P(\min\_{w \in \mathcal{M} \colon \mathbf{s}^\* \subset w} GIC(w) \le GIC(\mathbf{s}^\*)) \le 2p\_n e^{-\frac{a\_n^2}{k\_n \eta}} + 2p\_n e^{-\frac{\eta D}{k\_n}},\tag{23}$$

*where B* = 32*nL*2*r*2*kns*<sup>2</sup> *<sup>n</sup> and D* = *θ*2*r*2/512*L*2*s*<sup>2</sup> *n.* **Proof.** If *<sup>s</sup>*<sup>∗</sup> <sup>⊂</sup> *<sup>w</sup>* ∈ M and *<sup>β</sup>*ˆ(*w*) <sup>−</sup> *<sup>β</sup>*<sup>∗</sup> <sup>∈</sup> *<sup>B</sup>*2(*r*), then in view of inequalities *Rn*(*β*ˆ(*s*∗)) <sup>≤</sup> *Rn*(*β*∗) and *R*(*β*∗) ≤ *R*(*b*) we have:

$$\begin{split} R\_n(\mathring{\mathcal{B}}(\mathbb{s}^\*)) - R\_n(\mathring{\mathcal{B}}(w)) &\leq \sup\_{\substack{b \in D\_1 \colon \ b - \beta^\* \in B\_2(r)}} \left( R\_n(\boldsymbol{\beta}^\*) - R\_n(b) \right) \\ &\leq \sup\_{\substack{b \in D\_1 \colon \ b - \beta^\* \in B\_2(r)}} \left( (R\_n(\boldsymbol{\beta}^\*) - R(\boldsymbol{\beta}^\*)) - (R\_n(b) - R(b)) \right) \\ &\leq \sup\_{\substack{b \in D\_1 \colon \ b - \beta^\* \in B\_2(r)}} |R\_n(b) - R(b) - (R\_n(\boldsymbol{\beta}^\*) - R(\boldsymbol{\beta}^\*))| \\ &= S\_1(r). \end{split}$$

Note that *an*(|*w*|−|*s*∗|) ≥ *an*. Hence, if we have for some *w* ⊃ *s*∗: *GIC*(*w*) ≤ *GIC*(*s*∗), then we obtain *nRn*(*β*ˆ(*s*∗)) <sup>−</sup> *nRn*(*β*ˆ(*w*))) <sup>≥</sup> *an*(|*w*|−|*s*∗|) and from the above inequality we have *<sup>S</sup>*1(*r*) <sup>≥</sup> *an*/*n*. Furthermore, if *<sup>β</sup>*ˆ(*w*) <sup>−</sup> *<sup>β</sup>*<sup>∗</sup> <sup>∈</sup> *<sup>B</sup>*2(*r*)*<sup>c</sup>* and *<sup>r</sup>* <sup>&</sup>lt; , then consider:

$$v = u\beta(w) + (1 - u)\beta^\*\nu$$

where *<sup>u</sup>* <sup>=</sup> *<sup>r</sup>*/(*<sup>r</sup>* <sup>+</sup> ||*β*ˆ(*w*) <sup>−</sup> *<sup>β</sup>*∗||2). Then

$$||v - \beta^\*||\_2 = \mu ||\beta(w) - \beta^\*||\_2 = r \cdot \frac{||\hat{\beta}(w) - \beta^\*||\_2}{r + ||\hat{\beta}(w) - \beta^\*||\_2} \ge \frac{r}{2}r$$

as function *x*/(*x* + *r*) is increasing with respect to *x* for *x* > 0. Moreover, we have ||*v* − *β*∗||<sup>2</sup> ≤ *r* < . Hence, in view of *C*(*w*) condition, we get:

$$R(v) - R(\beta^\*) \ge \theta ||v - \beta^\*||\_2^2 \ge \frac{\theta r^2}{4}.$$

From convexity of *Rn*, we have:

$$
\mathcal{R}\_n(\upsilon) \le \mu(\mathcal{R}\_n(\not\beta(w)) - \mathcal{R}\_n(\beta^\*)) + \mathcal{R}\_n(\beta^\*) \le \mathcal{R}\_n(\beta^\*).
$$

Let supp *<sup>v</sup>* denote the support of vector *<sup>v</sup>*. We observe that supp *<sup>v</sup>* <sup>⊆</sup> supp *<sup>β</sup>*ˆ(*w*) <sup>∪</sup> supp *<sup>β</sup>*<sup>∗</sup> <sup>⊆</sup> *<sup>w</sup>*, hence *v* ∈ *D*1. Finally, we have:

$$\mathcal{S}\_1(r) \ge \mathcal{R}\_\mathbb{H}(\boldsymbol{\beta}^\*) - \mathcal{R}(\boldsymbol{\beta}^\*) - (\mathcal{R}\_\mathbb{H}(\boldsymbol{v}) - \mathcal{R}(\boldsymbol{v})) \ge \mathcal{R}(\boldsymbol{v}) - \mathcal{R}(\boldsymbol{\beta}^\*) \ge \frac{\theta r^2}{4}.$$

Hence, we obtain the following sequence of inequalities:

$$\begin{split} P(\min\_{w \in \mathcal{M}: \mathbf{s}^\* \subset \mathbf{w}} G \mathbf{C}(w) \le GIG(\mathbf{s}^\*)) \\ \le P(S\_1(r) \ge \frac{a\_n}{n}, \forall w \in \mathcal{M}: \widehat{\beta}(w) - \boldsymbol{\beta}^\* \in B\_2(r)) \\ + P(\exists w \in \mathcal{M}: \mathbf{s}^\* \subset w \wedge \widehat{\beta}(w) - \boldsymbol{\beta}^\* \in B\_2(r)^c) \le P(S\_1(r) \ge \frac{a\_n}{n}) + P(S\_1(r) \ge \frac{\theta r^2}{4}) \\ \le 2p\_n e^{-\frac{\frac{a\_n^2}{n}}{52n \lambda^2 \sigma^2 \lambda v\_n^2}} + 2p\_n e^{-\frac{n\theta^2 r^2}{512 \lambda^2 \sigma v\_n^2}}. \end{split}$$

**Corollary 2.** *Assume that the conditions of Theorem 2 hold and for some* , *θ* > 0 *and for every w* ⊆ {1, ... , *pn*} *such that* |*w*| ≤ *kn, kn* ln(*pn* ∨ 2) = *o*(*n*) *and* lim inf *n*→∞ *Dnan kn* log(2*pn*) <sup>&</sup>gt; <sup>1</sup>*, where <sup>D</sup>*−<sup>1</sup> *<sup>n</sup>* <sup>=</sup> <sup>128</sup>*L*2*s*<sup>2</sup> *<sup>n</sup>φ*/*θ for some φ* > 1*. Then, we have*

$$P(\min\_{w \in \mathcal{M}: s^\* \subset w} GIC(w) \le GIC(s^\*)) \to 0.$$

**Proof.** We the choose allb radius *r* of *B*2(*r*) in a special way. Namely, we take:

$$r\_n^2 = \frac{512\phi^2 L^2 s\_n^2 \log(2p\_n) k\_n}{n\theta^2}$$

for some *φ* > 1. In view of assumptions *rn* → 0. Consider *n*<sup>0</sup> such that *rn* < for all *n* ≥ *n*0. Hence, the second term of the upper bound in Equation (23) for *r* = *rn* is equal to:

$$2p\_n e^{-\frac{n\theta^2 r\_n^2}{512L^2 k\_n \nu\_n^2}} = e^{\log(2p\_n)(1-\phi^2)} \to 0.$$

Similarly, the first term of the upper bound in Equation (23) is equal to:

$$2p\_n\varepsilon^{-\frac{a\_n^2}{32nL^2\sigma\_n^2\log\epsilon\_n^2}} = \varepsilon^{\log(2p\_n)\left(1 - \frac{a\_n^2\theta^2}{12\tilde{\sigma}^2L^4\tilde{k}\_n^2\sigma\_n^4\rho^2(2p\_n)}\right)} = \varepsilon^{\log(2p\_n)\left(1 - \frac{D\_n^2a\_n^2}{k\_n^2\log^2(2p\_n)}\right)} \to 0.$$

These two convergences end the proof.

$$\mathbb{D}$$

The most restrictive condition of Corollary 2 is lim inf *n*→∞ *Dnan kn* log(2*pn*) > 1 which is slightly weaker than *kn* ln(*pn* ∨ 2) = *o*(*an*). The following remark proved in the Appendix A gives sufficient conditions for consistency of BIC and EBIC penalties, which do not satisfy condition *kn* log(*pn*) = *o*(*an*).

**Remark 1.** *If in Corollary 2 we assume Dn* ≥ *A for some A* > 0*, then condition* lim inf *n*→∞ *Dnan kn* log(2*pn*) > <sup>1</sup> *holds when:*


Theorem 3 is an analog of Theorem 2 for subsets of *s*∗.

**Theorem 3.** *Assume that <sup>ρ</sup>*(·, *<sup>y</sup>*) *is convex, Lipschitz function with constant <sup>L</sup>* <sup>&</sup>gt; <sup>0</sup>*, Xij* <sup>∼</sup> *Subg*(*σ*<sup>2</sup> *jn*), *condition C*(*s*∗) *holds for some* , *<sup>θ</sup>* <sup>&</sup>gt; <sup>0</sup>*, and* <sup>8</sup>*an*|*s*∗| ≤ *<sup>θ</sup><sup>n</sup>* min{2, *<sup>β</sup>*∗<sup>2</sup> *min*}*. Then, we have:*

$$P(\min\_{w \in \mathcal{M}: w \subset s^\*} GIC(w) \le GIC(s^\*)) \le \sqrt{2}e^{-n \min\left\{\mathfrak{e}, \mathfrak{f}\_{\min}^\*\right\}^2 E}f$$

*where E* = *θ*2/212*L*2*s*<sup>2</sup> *<sup>n</sup>*|*s*∗|

**Proof.** Suppose that for some *w* ⊂ *s*<sup>∗</sup> we have *GIC*(*w*) ≤ *GIC*(*s*∗). This is equivalent to:

*nRn*(*β*ˆ(*s* <sup>∗</sup>)) <sup>−</sup> *nRn*(*β*ˆ(*w*)) <sup>≥</sup> *an*(|*w*|−|*<sup>s</sup>* ∗|).

In view of inequalities *Rn*(*β*ˆ(*s*∗)) <sup>≤</sup> *Rn*(*β*∗) and *an*(|*w*|−|*s*∗|) ≥ −*an*|*s*∗|, we obtain:

$$n\mathcal{R}\_{\mathfrak{U}}(\beta^\*) - n\mathcal{R}\_{\mathfrak{U}}(\hat{\beta}(w)) \ge -a\_{\mathfrak{U}}|s^\*|.$$

Let *<sup>v</sup>* <sup>=</sup> *<sup>u</sup>β*ˆ(*w*)+(<sup>1</sup> <sup>−</sup> *<sup>u</sup>*)*β*<sup>∗</sup> for some *<sup>u</sup>* <sup>∈</sup> [0, 1] to be specified later. From convexity of *<sup>ρ</sup>*, we consider:

$$nR\_{\boldsymbol{\mathfrak{H}}}(\boldsymbol{\beta}^{\*}) - nR\_{\boldsymbol{\mathfrak{H}}}(\boldsymbol{\upsilon}) \ge \boldsymbol{m} \boldsymbol{u} (R\_{\boldsymbol{\mathfrak{H}}}(\boldsymbol{\beta}^{\*}) - R\_{\boldsymbol{\mathfrak{H}}}(\boldsymbol{\hat{\boldsymbol{\beta}}}(\boldsymbol{w}))) \ge -\boldsymbol{u}a\_{\boldsymbol{\mathfrak{H}}}|\mathbf{s}^{\*}| \ge -a\_{\boldsymbol{\mathfrak{H}}}|\mathbf{s}^{\*}|.\tag{24}$$

We consider two cases separately:

(1) *β*∗ *min* > .

First, observe that

$$\|8a\_n|s^\*\| \le \theta \epsilon^2 n\_\prime\tag{25}$$

which follows from our assumption. Let *<sup>u</sup>* <sup>=</sup> /( <sup>+</sup> ||*β*ˆ(*w*) <sup>−</sup> *<sup>β</sup>*∗||2) and

$$w = u\hat{\beta}(w) + (1 - u)\beta^\*. \tag{26}$$

Note that ||*β*ˆ(*w*) <sup>−</sup> *<sup>β</sup>*∗||<sup>2</sup> ≥ ||*β*<sup>∗</sup> *<sup>s</sup>*∗\*w*||<sup>2</sup> <sup>≥</sup> *<sup>β</sup>*<sup>∗</sup> *min*. Then, as function *d*(*x*) = *x*/(*x* + *c*) is increasing and bounded from above by 1 for *x*, *c* > 0, we obtain:

$$\epsilon \ge ||v - \beta^\*||\_2 = \frac{\epsilon ||\hat{\beta}(w) - \beta^\*||\_2}{\epsilon + ||\hat{\beta}(w) - \beta^\*||\_2} \ge \frac{\epsilon \beta^\*\_{\min}}{\epsilon + \beta^\*\_{\min}} > \frac{\epsilon^2}{2\epsilon} = \frac{\epsilon}{2}.\tag{27}$$

Hence, in view of *C*(*s*∗) condition, we have:

$$
\mathcal{R}(v) - \mathcal{R}(\beta^\*) > \theta \frac{\epsilon^2}{4}.
$$

Using Equations (24)–(26) and the above inequality yields:

$$S\_2(\varepsilon) \ge R\_n(\boldsymbol{\beta}^\*) - R(\boldsymbol{\beta}^\*) - (R\_n(\boldsymbol{v}) - R(\boldsymbol{v})) \\ > \theta \frac{\varepsilon^2}{4} - \frac{a\_n}{n} |\boldsymbol{s}^\*| \\ \ge \frac{\theta \varepsilon^2}{8}.$$

Thus, in view of Lemma 2, we obtain:

$$P(\min\_{w \in \mathcal{M}: w \subset s^\*} GIC(w) \le GIC(s^\*)) \le P\left(S\_2(\varepsilon) > \frac{\theta \varepsilon^2}{8}\right) \le \sqrt{2} \varepsilon^{-\frac{m^2 \varepsilon^2}{408d^2 \varepsilon^2 |v^\*|}}.\tag{28}$$

(2) *β*∗ *min* ≤ .

In this case, we take *u* = *β*∗ *min*/(*β*<sup>∗</sup> *min* <sup>+</sup> ||*β*ˆ(*w*) <sup>−</sup> *<sup>β</sup>*∗||2) and define *<sup>v</sup>* as in Equation (26). Analogously, as in Equation (27), we have:

$$\frac{\beta\_{\min}^\*}{2} \le ||v - \beta^\*||\_2 \le \beta\_{\min}^\*.$$

Hence, in view of *C*(*s*∗) condition, we have:

$$
\mathcal{R}(v) - \mathcal{R}(\mathcal{J}^\*) \ge \theta \frac{\mathcal{J}\_{min}^{\*2}}{4}.
$$

Using Equation (24) and the above inequality yields:

$$\mathcal{S}\_2(\mathcal{J}\_{\min}^\*) \ge \mathcal{R}\_\mathbb{II}(\mathcal{J}^\*) - \mathcal{R}(\mathcal{J}^\*) - (\mathcal{R}\_\mathbb{II}(v) - \mathcal{R}(v)) \ge \theta \frac{\mathcal{J}\_{\min}^{\*2}}{4} - \frac{a\_n}{n}|\mathbf{s}^\*| \ge \frac{\theta}{8} \beta\_{\min}^{\*2}.$$

Thus, in view of Lemma 2, we obtain:

$$P(\min\_{w \in \mathcal{M}: w \subset s^\*} GIC(w) \le GIC(s^\*)) \le P\left(S\_2(\beta\_{\min}^\*) \ge \frac{\theta}{8} \beta\_{\min}^{\*2}\right) \le \sqrt{2} \varepsilon^{-\frac{n\theta^2 \rho\_{\min}^{\*2}}{2^{32} L^2 \rho\_n^2 |s^\*|}}.\tag{29}$$

By combining Equations (28) and (29), the theorem follows.

**Corollary 3.** *Assume that loss <sup>ρ</sup>*(·, *<sup>y</sup>*) *is convex, Lipschitz function with constant <sup>L</sup>* <sup>&</sup>gt; <sup>0</sup>*, Xij* <sup>∼</sup> *Subg*(*σ*<sup>2</sup> *jn*), *condition C*(*s*∗) *holds for some* , *θ* > 0 *and an*|*s*∗| = *o*(*n* min{1, *β*<sup>∗</sup> *min*}2)*, then*

$$P(\min\_{w \in \mathcal{M}: w \subset s^\*} GIC(w) \le GIC(s^\*)) \to 0.$$

**Proof.** First, observe that as *an* → ∞

$$a\_n|s^\*| = o(n \min\{1, \beta^\*\_{min}\}^2),$$

implies

$$|s^\*| = o(n \min\{1, \beta^\*\_{\min}\}^2),$$

and thus in view of Theorem 3 we have

$$P(\min\_{w \in \mathcal{M}: w \subseteq s^\*} GIC(w) \le GIC(s^\*)) \to 0.$$

#### **5. Selection Consistency of SS Procedure**

In this section, we combine the results of the two previous sections to establish consistency of a two-step SS procedure. It consists in construction of a nested family of models M using magnitude of Lasso coefficients and then finding the minimizer of GIC over this family. As M is data dependent to establish consistency of the procedure we use Corollaries 2 and 3 in which the minimizer of GIC is considered over *all* subsets and supersets of *s*∗.

SS (Screening and Selection) procedure is defined as follows:

1. Choose some *λ* > 0.

$$\text{2. }\quad\text{Find }\beta\_L = \underset{\text{2. }n-}{\text{arg min}}\ R\_n(b) + \lambda||b||\_1.$$


*w*∈M*SS*

The SS procedure is a modification of SOS procedure in [17] designed for linear models. Since ordering step considered in [17] is omitted in the proposed modification, we abbreviate the name to SS.

Corollary 4 and Remark 2 describe the situations when SS procedure is selection consistent. In it, we use the assumptions imposed in Sections 2 and 3 together with an assumption that support of *s*∗ contains no more than *kn* elements, where *kn* is some deterministic sequence of integers. Let M*SS* is nested family constructed in the step 4 of SS procedure.

**Corollary 4.** *Assume that <sup>ρ</sup>*(·, *<sup>y</sup>*) *is convex, Lipschitz function with constant <sup>L</sup>* <sup>&</sup>gt; <sup>0</sup>*, Xij* <sup>∼</sup> *Subg*(*σ*<sup>2</sup> *jn*) *and β*<sup>∗</sup> *exists and is unique. If kn* ∈ *N*<sup>+</sup> *is some sequence, margin Condition (MC) is satisfied for some ϑ*, *δ*,*ε* > 0*, condition C*(*w*) *holds for some* , *θ* > 0 *and for every w* ⊆ {1, ... , *pn*} *such that* |*w*| ≤ *kn and the following conditions are fulfilled:*



*then for SS procedure we have*

$$P(\mathfrak{s}^\* = \mathfrak{s}^\*) \to 1.$$

**Proof.** In view of Corollary 1, following from the separation property in Equation (22) we obtain *P*(*s*<sup>∗</sup> ∈ M*SS*) → 1. Let:

$$A\_1 = \{ \min\_{w \in \mathcal{M}\_{SS} : w \supset s^\*, |w| \le k\_n} GIC(w) \le GIC(s^\*) \},$$

$$A\_2 = \{ \min\_{w \in \mathcal{M}\_{SS} : w \supset s^\*, |w| > k\_n} GIC(w) \le GIC(s^\*) \},$$

$$B = \{ \forall w \in \mathcal{M}\_{SS} : |w| \le k\_n \}.$$

Then, we have again from the fact that *A*<sup>2</sup> ∩ *B* = ∅, union inequality and Corollary 2:

$$P(\min\_{w \in \mathcal{M}\_{SS} : w \supset s^\*} GIC(w) \le GIC(s^\*)) = P(A\_1 \cup A\_2) = P(A\_1 \cup (A\_2 \cap B^c))$$

$$\le P(A\_1) + P(B^c) \to 0. \tag{30}$$

In an analogous way, using |*s*∗| ≤ *kn* and Corollary 3 yields:

$$P(\min\_{w \in \mathcal{M}\_{SS}: w \subset s^\*} GIC(w) \le GIC(s^\*)) \to 0. \tag{31}$$

Now, observe that in view of definition of *s*ˆ ∗ and union inequality:

$$\begin{split} P(\\$^\* = s^\*) &= P(\min\_{w \in \mathcal{M}\_{\mathcal{GS}}: w \neq s^\*} GIC(w) > GIC(s^\*)) \\ &\geq 1 - P(\min\_{w \in \mathcal{M}\_{\mathcal{GS}}: w \subset s^\*} GIC(w) \leq GIC(s^\*)) \\ &\qquad - P(\min\_{w \in \mathcal{M}\_{\mathcal{GS}}: w \supset s^\*} GIC(w) \leq GIC(s^\*)). \end{split}$$

Thus, *P*(*s*ˆ <sup>∗</sup> = *s*∗) → 1 in view of the above inequality and Equations (30) and (31).

#### *5.1. Case of Misspecified Semi-Parametric Model*

Consider now the important case of the misspecified semi-parametric model defined in Equation (5) for which function *q*˜ is unknown and may be arbitrary. An interesting question is whether information about *β* can be recovered when misspecification occurs. The answer is positive under some additional assumptions on distribution of random predictors. Assume additionally that *X* satisfies

$$E(X|\boldsymbol{\beta}^T\boldsymbol{X}) = \boldsymbol{u}\_0 + \boldsymbol{u}\boldsymbol{\beta}^T\boldsymbol{X}\_\prime \tag{32}$$

where *β* is the true parameter. Thus, regressions of *X* given *βTX* have to be linear. We stress that conditioning *βTX* involves only the true *β* in Equation (5). Then, it is known (cf. [5,10,11]) that *β*<sup>∗</sup> = *ηβ* and *η* = 0 if Cov(*Y*, *X*) = 0. Note that because *β* and *β*<sup>∗</sup> are collinear and *η* = 0 it follows that *s* = *s*∗. This is important in practical applications as it shows that a position of the optimal separating direction given by *β* can be consistently recovered. It is also worth mentioning that if Equation (32) is satisfied the direction of *β* coincides with the direction of the first canonical vector. We refer to the work of Kubkowski and Mielniczuk [7] for the proof and to the work o Kubkowski and Mielniczuk [6] for discussion and up-to date references to this problem. The linear regressions condition in Equation (32) is satisfied, e.g., by elliptically contoured distribution, in particular by multivariate

normal. We note that it is proved in [18] that Equation (32) approximately holds for the majority of *β*. When Equation (32) holds exactly, proportionality constant *η* can be calculated numerically for known *q*˜ and *β*. We can state thus the following result provided Equation (32) is satisfied.

**Corollary 5.** *Assume that Equation (32) and the assumptions of Corollary 4 are satisfied. Moreover,* Cov(*Y*, *X*) = 0*. Then, P*(*s*ˆ <sup>∗</sup> = *s*) → 1*.*

**Remark 2.** *If pn* = *<sup>O</sup>*(*ecn<sup>γ</sup>* ) *for some c* > 0*, γ* ∈ (0, 1/2)*, ξ* ∈ (0, 0.5 − *γ*)*, u* ∈ (0, 0.5 − *γ* − *ξ*)*, kn* = *O*(*n<sup>ξ</sup>* )*, λ* = *Cn* log(*pn*)/*n, Cn* <sup>=</sup> *<sup>O</sup>*(*nu*)*, Cn* <sup>→</sup> <sup>+</sup>∞*, <sup>n</sup>*<sup>−</sup> *<sup>γ</sup>* <sup>2</sup> = *O*(*β*∗ *min*)*, an* <sup>=</sup> *dn*<sup>1</sup> <sup>2</sup> <sup>−</sup>*u, then assumptions imposed on asymptotic behavior of parameters in Corollary 4 are satisfied.*

Note that *pn* is allowed to grow exponentially: log *pn* = *O*(*nγ*), however *β*<sup>∗</sup> *min* may not decrease to 0 too quickly with regard to growth of *pn*: *<sup>n</sup>*<sup>−</sup> *<sup>γ</sup>* <sup>2</sup> = *O*(*β*∗ *min*).

**Remark 3.** *We note that, to apply Corollary 4 to the two-step procedure based on Lasso, it is required that* |*s*∗| ≤ *kn and that the support of Lasso estimator with probability tending to* 1 *contains no more than kn elements. Some results bounding* <sup>|</sup> supp *<sup>β</sup>*<sup>ˆ</sup> *<sup>L</sup>*<sup>|</sup> *are available for deterministic <sup>X</sup> (see [31]) and for random <sup>X</sup> (see [32]), but they are too weak to be useful for EBIC penalties. The other possibility to prove consistency of two-step procedure is to modify it in the first step by using thresholded Lasso (see [33]) corresponding to k n largest Lasso coefficients where k <sup>n</sup>* ∈ *N is such that kn* = *o*(*k <sup>n</sup>*)*. This is a subject of ongoing research.*

#### **6. Numerical Experiments**

#### *6.1. Selection Procedures*

We note that the original procedure is defined for a single *λ* only. In the simulations discussed below, we implemented modifications of SS procedure introduced in Section 5. In practice, it is generally more convenient to consider in the first step some sequence of penalty parameters *λ*<sup>1</sup> > ... > *λ<sup>m</sup>* > 0 instead of only one *λ* in order to avoid choosing the "best" *λ*. For the fixed sequence *λ*1, ... , *λm*, we construct corresponding families M1, ... ,M*<sup>m</sup>* analogously to M in Step 4 of the SS procedure. Thus, we arrive at the following SSnet procedure, which is the modification of SOSnet procedure in [17]. Below, ˜ *b* is a vector *b* with first coordinate corresponding to intercept omitted, *b* = (*b*0, ˜ *bT*)*T*:


$$\begin{array}{ll}\text{for } i = 1, \ldots, m. \\ \text{4.} \qquad \text{Define } \mathcal{M}\_{i} = \{ \{j\_1^{(i)}\}, \{\underline{j}\_1^{(i)}, \underline{j}\_2^{(i)}\}, \ldots, \{\underline{j}\_1^{(i)}, \underline{j}\_2^{(i)}, \ldots, \underline{j}\_{k\_i}^{(i)}\} \} \text{ for } i = 1, \ldots, m. \end{array}$$

$$\text{5.} \quad \text{Define } \mathcal{M} = \{\mathcal{Q}\} \cup \bigcup\_{i=1}^{m} \mathcal{M}\_i.$$

*i*=1 6. Find *s*ˆ ∗ = arg min *w*∈M *GIC*(*w*), where

$$IGC(w) = \min\_{b \in R^{p\_n + 1} : \text{supp } b \subseteq w} nR\_n(b) + a\_n(|w| + 1).$$

Instead of constructing families M*<sup>i</sup>* for each *λ<sup>i</sup>* in SSnet procedure, *λ* can be chosen by cross-validation using 1SE rule (see [34]) and then SS procedure is applied for such *λ*. We call this procedure SSCV. The last procedure considered was introduced by Fan and Tang [35] and is Lasso procedure with penalty parameter *λ*ˆ chosen in a data-dependent way analogously to SSCV. Namely, it is the minimizer of GIC criterion with *an* = log(log *n*) · log *pn* for which ML estimator has been

replaced by Lasso estimator with penalty *λ*. Once *β*ˆ *<sup>L</sup>*(*λ*ˆ *<sup>L</sup>*) is calculated, then *s*ˆ ∗ is defined as its support. The procedure is called LFT in the sequel.

We list below versions of the above procedures along with R packages that were used to choose sequence *λ*1, ... , *λ<sup>m</sup>* and computation of Lasso estimator. The following packages were chosen based on selection performance after initial tests for each loss and procedure:


The following functions were used to optimize *Rn* in GIC minimization step for each loss:


Before applying the investigated procedures, each column of matrix X = (*X*1, ... , *Xn*)*<sup>T</sup>* was standardized as Lasso estimator *β*ˆ *<sup>L</sup>* depends on scaling of predictors. We set length of *λ<sup>i</sup>* sequence to *m* = 20. Moreover, in all procedures we considered only *λ<sup>i</sup>* for which |*s*ˆ (*i*) *<sup>L</sup>* | ≤ *n* because, when |*s*ˆ (*i*) *<sup>L</sup>* | > *n*, Lasso and ML solutions are not unique (see [32,36]). For Huber loss, we set parameter *δ* = 1/10 (see [12]). The number of folds in SSCV was set to *K* = 10.

Each simulation run consisted of *<sup>L</sup>* repetitions, during which samples <sup>X</sup>*<sup>k</sup>* = (*X*(*k*) <sup>1</sup> , ... , *<sup>X</sup>*(*k*) *<sup>n</sup>* )*<sup>T</sup>* and **<sup>Y</sup>***<sup>k</sup>* = (*Y*(*k*) <sup>1</sup> , ... ,*Y*(*k*) *<sup>n</sup>* )*<sup>T</sup>* were generated for *<sup>k</sup>* <sup>=</sup> 1, ... , *<sup>L</sup>*. For *<sup>k</sup>*th sample (X*k*, **<sup>Y</sup>***k*) estimator *<sup>s</sup>*<sup>ˆ</sup> ∗ *<sup>k</sup>* of set of active predictors was obtained by a given procedure as the support of ˆ *β*˜(*s*ˆ ∗ *<sup>k</sup>* ), where

$$\hat{\beta}(\hat{s}\_k^\*) = (\hat{\beta} \circ (\hat{s}\_k^\*), \hat{\beta}(\hat{s}\_k^\*)^T)^T = \underset{b \in R^{pn+1}}{\text{arg min}} \frac{1}{n} \sum\_{i=1}^n \rho(b^T X\_i^{(k)}, Y\_i^{(k)})^T$$

is ML estimator for *<sup>k</sup>*th sample. We denote by <sup>M</sup>(*k*) the family <sup>M</sup> obtained by a given procedure for *k*th sample.

In our numerical experiments we have computed the following measures of selection performance which gauge co-direction of true parameter *β* and *β*ˆ and the interplay between *s*<sup>∗</sup> and *s*ˆ ∗:

$$\bullet \qquad ANGLE = \frac{1}{L} \sum\_{k=1}^{L} \arccos \left| \cos \angle (\vec{\beta}\_0, \vec{\beta}(\vec{s}\_k^\*)) \right|, \text{where } L$$

$$\cos\angle(\vec{\beta}, \hat{\vec{\beta}}(\hat{s}\_k^\*)) = \frac{\sum\_{j=1}^{p\_n} \beta\_j \hat{\beta}\_j(\hat{s}\_k^\*)}{||\vec{\beta}||\_2 ||\hat{\vec{\beta}}(\hat{s}\_k^\*)||\_2}.$$

and we let cos  (*β*˜, ˆ *β*˜(*s*ˆ ∗ *<sup>k</sup>* )) = 0, if ||*β*˜||2|| <sup>ˆ</sup> *β*˜(*s*ˆ ∗ *<sup>k</sup>* )||<sup>2</sup> = 0,


Thus, *ANGLE* is equal an of angle between true parameter (with intercept omitted) and its post model-selection estimator averaged over simulations, *Pinc* is a fraction of simulations for which family <sup>M</sup>(*k*) contains true model *<sup>s</sup>*∗, and *Pequal* and *Psupset* are the fractions of time when SSnet chooses true model or its superset, respectively.

#### *6.2. Regression Models Considered*

To investigate behavior of two-step procedure under misspecification we considered two similar models with different sets of predictors. As sets of predictors differ, this results in correct specification of the first model (Model M1) and misspecification of the second (Model M2).

Namely, in Model M1, we generated *<sup>n</sup>* observations (*Xi*,*Yi*) <sup>∈</sup> *<sup>R</sup>p*+<sup>1</sup> × {0, 1} for *<sup>i</sup>* <sup>=</sup> 1, ... , *<sup>n</sup>* such that:

$$\begin{aligned} X\_{i0} &= 1, X\_{i1} = Z\_{i1}, X\_{i2} = Z\_{i2}, X\_{ij} = Z\_{i,j-7} \text{ for } j = 10, \dots, p\_{\prime}, \\ X\_{i3} &= X\_{i1}^2, X\_{i4} = X\_{i2}^2, X\_{i5} = X\_{i1} X\_{i2}, \\ X\_{i6} &= X\_{i1}^2 X\_{i2}, X\_{i7} = X\_{i1} X\_{i2}^2, X\_{i8} = X\_{i1}^3, X\_{i9} = X\_{i2}^3 \end{aligned}$$

where *Zi* = (*Zi*1, ... , *Zip*)*<sup>T</sup>* ∼ N*p*(0*p*, <sup>Σ</sup>), <sup>Σ</sup> = [*ρ*|*i*−*j*<sup>|</sup> ]*i*,*j*=1,...,*<sup>p</sup>* and *ρ* ∈ (−1, 1). We consider response function *<sup>q</sup>*(*x*) = *qL*(*x*3) for *<sup>x</sup>* <sup>∈</sup> *<sup>R</sup>*, *<sup>s</sup>* <sup>=</sup> {1, 2} and *<sup>β</sup><sup>s</sup>* = (1, 1)*T*. Thus,

$$\begin{split} P(\mathbf{Y}\_{i} = 1 | \mathbf{X}\_{i} = \mathbf{x}\_{i}) &= q(\boldsymbol{\beta}\_{s}^{T} \mathbf{x}\_{i,s}) = q(\mathbf{x}\_{i1} + \mathbf{x}\_{i2}) = q\_{L}((\mathbf{x}\_{i1} + \mathbf{x}\_{i2})^{\boldsymbol{\beta}}) \\ &= q\_{L}(\mathbf{x}\_{i1}^{\boldsymbol{\beta}} + \mathbf{x}\_{i2}^{\boldsymbol{\beta}} + 3\mathbf{x}\_{i1}^{\boldsymbol{\beta}}\mathbf{x}\_{i2} + 3\mathbf{x}\_{i1}\mathbf{x}\_{i2}^{\boldsymbol{\beta}}) \\ &= q\_{L}(3\mathbf{x}\_{i6} + 3\mathbf{x}\_{i7} + \mathbf{x}\_{i8} + \mathbf{x}\_{i9}). \end{split}$$

We observe that the last equality implies that the above binary model is correctly specified with respect to family of fitted logistic models and *X*6, *X*7, *X*<sup>8</sup> and *X*<sup>9</sup> are four active predictors, whereas the remaining ones play no role in prediction of *Y*. Hence, *s*<sup>∗</sup> = {6, 7, 8, 9} and *β*<sup>∗</sup> *<sup>s</sup>*<sup>∗</sup> = (3, 3, 1, 1)*<sup>T</sup>* are, respectively, sets of indices of active predictors and non-zero coefficients of projection onto family of logistic models.

We considered the following parameters in numerical experiments: *n* = 500, *p* = 150, *ρ* ∈ {−0.9 + 0.15 · *k* : *k* = 0, 1, ... , 12}, and *L* = 500 (the number of generated datasets for each combination of parameters). We investigated procedures SSnet, SSCV, and LFT using logistic, quadratic, and Huber (cf. [12]) loss functions. For procedures SSnet and SSCV, we used GIC penalties with:

• *an* = log *n* (BIC); and

• *an* = log *n* + 2 log *pn* (EBIC1).

In Model M2, we generated *<sup>n</sup>* observations (*Xi*,*Yi*) <sup>∈</sup> *<sup>R</sup>p*+<sup>1</sup> × {0, 1} for *<sup>i</sup>* <sup>=</sup> 1, ... , *<sup>n</sup>* such that *Xi* <sup>=</sup> (*Xi*0, *Xi*1, ... , *Xip*)*<sup>T</sup>* and (*Xi*1, ... , *Xip*)*<sup>T</sup>* ∼ N*p*(0*p*, <sup>Σ</sup>), <sup>Σ</sup> = [*ρ*|*i*−*j*<sup>|</sup> ]*i*,*j*=1,...,*<sup>p</sup>* and *ρ* ∈ (−1, 1). Response function is *<sup>q</sup>*(*x*) = *qL*(*x*3) for *<sup>x</sup>* <sup>∈</sup> *<sup>R</sup>*, *<sup>s</sup>* <sup>=</sup> {1, 2} and *<sup>β</sup><sup>s</sup>* = (1, 1)*T*. This means that:

$$P(\mathbf{Y}\_i = \mathbf{1} | \mathbf{X}\_i = \mathbf{x}\_i) = q(\boldsymbol{\beta}\_s^T \mathbf{x}\_{i,s}) = q(\mathbf{x}\_{i1} + \mathbf{x}\_{i2}) = q\_L((\mathbf{x}\_{i1} + \mathbf{x}\_{i2})^3)$$

This model in comparison to Model M1 does not contain monomials of *Xi*<sup>1</sup> and *Xi*<sup>2</sup> of degree higher than 1 in its set of predictors. We observe that this binary model is misspecified with respect to fitted family of logistic models, because *<sup>q</sup>*(*xi*<sup>1</sup> <sup>+</sup> *xi*2) <sup>≡</sup> *qL*(*βTxi*) for any *<sup>β</sup>* <sup>∈</sup> *<sup>R</sup>p*<sup>+</sup>1. However, in this case, the linear regressions condition in Equation (32) is satisfied for *X*, as it follows normal distribution (see [5,7]). Hence, in view of Proposition 3.8 in [6], we have *s*∗ *log* = {1, 2} and *β*<sup>∗</sup> *log*,*s*∗ *log* = *η*(1, 1)*<sup>T</sup>* for some *η* > 0. Parameters *n*, *p*, *ρ* as well as *L* were chosen as for Model M1.

#### *6.3. Results for Models M1 and M2*

We first discuss the behavior of *Pinc*, *Pequal* and *Psupset* for the considered procedures. We observe that values of *Pinc* for SSCV and SSnet are close to 1 for low correlations in Model M2 for every tested loss (see Figure 1). In Model M1, *Pinc* attains the largest values for SSnet procedure and logistic loss for low correlations, which is because in most cases the corresponding family M is the largest among the families created by considered procedures. *Pinc* is close to 0 in Model M1 for quadratic and Huber loss, which results in low values of the remaining indices. This may be due to strong dependences between predictors in Model M1; note that we have, e.g., Cor(*Xi*1, *Xi*8) = 3/ <sup>√</sup><sup>15</sup> <sup>≈</sup> 0.77. It is seen that in Model M1 inclusion probability *Pinc* is much lower than in Model M2 (except for negative correlations). It it also seen that *Pinc* for SSCV is larger than for LFT and LFT fails with respect to *Pinc* in M1.

0HWKRG 66QHW%,& 66QHW(%,& 66&9%,& 66&9(%,& /)7 ●

**Figure 1.** *Pinc* for Models M1 and M2.

In Model M1, the largest values *Pequal* are attained for SSnet with BIC penalty, the second best is SSCV with EBIC1 penalty (see Figure 2). In Model M2, *Pequal* is close to 1 for SSnet and SSCV with EBIC1 penalty and is much larger than *Pequal* for the corresponding versions using BIC penalty. We also note that choice of loss is relevant only for larger correlations. These results confirm theoretical result of Theorem 2.1 in [5], which show that collinearity holds for broad class of loss function. We observe also that, although in Model M2 remaining procedures do not select *s*∗ with high probability, they select its superset, what is indicated by values of *Psupset* (see Figure 3). This analysis is confirmed by an analysis of *ANGLE* measure (see Figure 4), which attains values close to 0, when *Psupset* is close to 1. Low values of *ANGLE* measure mean that estimated vector ˆ *β*˜(*s*ˆ ∗ *<sup>k</sup>* ) is approximately proportional to *<sup>β</sup>*˜, which is the case for Model M2, where normal predictors satisfy linear regressions condition. Note that the angles of ˆ *β*˜(*s*ˆ ∗ *<sup>k</sup>* ) and *<sup>β</sup>*˜<sup>∗</sup> in Model M1 significantly differ even though Model M1 is well specified. In addition, for the best performing procedures in both models and *any* loss considered, *Pequal* is much larger in Model M2 than in Model M1, even though the latter is correctly specified. This shows that choosing a simple misspecified model which retains crucial characteristics of the well specified large model instead of the latter might be beneficial.

In Model M1, procedures with BIC penalty perform better than those with EBIC1 penalty; however, the gain for *Pequal* is much smaller than the gain when using EBIC1 in Model M2. LFT procedure performs poorly in Model M1 and reasonably well in Model M2. The overall winner in both models is SSnet. SSCV performs only slightly worse than SSnet in Model M2 but performs significantly worse in Model M1.

Analysis of computing times of the first and second stages of each procedure shows that SSnet procedure creates large families M and GIC minimization becomes computationally intensive. We also observe that the first stage for SSCV is more time consuming than for SSnet, what is caused by multiple fitting of Lasso in cross-validation. However, SSCV is much faster than SSnet in the second stage.

We conclude that in the considered experiments SSnet with EBIC1 penalty works the best in most cases; however, even for the winning procedure, strong dependence of predictors results in deterioration of its performance. It is also clear from our experiments that a choice of GIC penalty is crucial for its performance. Modification of SS procedure which would perform satisfactorily for large correlations is still an open problem.

0HWKRG 66QHW%,& 66QHW(%,& 66&9%,& 66&9(%,& /)7 ● **Figure 2.** *Pequal* for Models M1 and M2.

0HWKRG 66QHW%,& 66QHW(%,& 66&9%,& 66&9(%,& /)7 ● **Figure 3.** *Psupset* for Models M1 and M2.

**Figure 4.** *ANGLE* for Models M1 and M2.

#### **7. Discussion**

In the paper, we study the problem of selecting a set of active variables in binary regression model when the number of all predictors *p* is much larger then number of observations *n* and active predictors are sparse among all predictors, i.e., their number is significantly smaller than *p*. We consider a general binary model and fit based on minimization of empirical risk corresponding to a general loss function. This scenario encompasses the common case in practice when the underlying semi-parametric model is misspecified, i.e., the assumed response function is different from the true one. For random predictors, we show that in such a case the two-step procedure based on Lasso consistently estimates the support of pseudo-true vector *β*∗. Under linear regression conditions and semi-parametric model, this implies consistent recovery of a subset of active predictors. This partly explains why selection procedures perform satisfactorily even when the fitted model is wrong. We show that, by using the two-step procedure, we can successfully reduce the dimension of the model chosen by Lasso. Moreover, for the two-step procedure in the case of random predictors, we do not require restrictive conditions on experimental matrix needed for Lasso support consistency for deterministic predictors such as irrepresentable condition. Our experiments show satisfactory behavior of the proposed SSnet procedure with EBIC1 penalty.

Future research directions include considering the performance of SS procedure without subgaussianity assumption and for practical importance an automatic choice of a penalty for GIC criterion. Moreover, we note the existing challenge of finding a modification of SS procedure that would perform satisfactorily for large correlations is still an open problem. It would also be of interest to find conditions under which weaker than Equation (32) would lead to collinearity of *β* and *β*∗ (see [18] for different angle on this problem).

**Author Contributions:** Both authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

**Funding:** The research of the second author was partially supported by Polish National Science Center grant 2015/17/*B*/*ST*6/01878.

**Acknowledgments:** The comments by the two referees, which helped to improve presentation of the original version of the manuscript, are gratefully acknowledged.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

Proof of Lemma 1:

**Proof.** Observe first that function *Rn* is convex as *ρ* is convex. Moreover, from the definition of *β*ˆ *<sup>L</sup>*, we get the inequality:

$$\mathcal{W}\_{\mathfrak{n}}(\hat{\beta}\_{L}) = \mathcal{R}\_{\mathfrak{n}}(\hat{\beta}\_{L}) - \mathcal{R}\_{\mathfrak{n}}(\mathcal{J}^{\*}) \le \lambda (||\mathcal{J}^{\*}||\_{1} - ||\hat{\beta}\_{L}||\_{1}).\tag{A1}$$

Note that *v* − *β*<sup>∗</sup> ∈ *B*1(*r*), as we have:

$$||v - \beta^\*||\_1 = \frac{||\beta\_L - \beta^\*||\_1}{r + ||\beta\_L - \beta^\*||\_1} \cdot r \le r. \tag{A2}$$

By definition of *Wn*, convexity of *Rn*, Equation (A2) and definition of *S*, we have:

$$\begin{split} \mathcal{W}(\upsilon) &= \mathcal{W}(\upsilon) - \mathcal{W}\_{\mathfrak{n}}(\upsilon) + R\_{\mathfrak{n}}(\upsilon) - R\_{\mathfrak{n}}(\beta^\*) \\ &\leq \mathcal{W}(\upsilon) - \mathcal{W}\_{\mathfrak{n}}(\upsilon) + \mathfrak{u}(R\_{\mathfrak{n}}(\hat{\beta}\_L) - R\_{\mathfrak{n}}(\beta^\*)) \leq S(r) + \mathfrak{u}\mathcal{W}\_{\mathfrak{n}}(\hat{\beta}\_L). \end{split} \tag{A3}$$

From the convexity of *l*<sup>1</sup> norm, Equations (A1) and (A3), equality ||*β*∗||<sup>1</sup> = ||*β*<sup>∗</sup> *<sup>s</sup>*<sup>∗</sup> ||1, and triangle inequality, it follows that:

$$\begin{split} \|\mathcal{W}(\boldsymbol{v}) + \lambda||\boldsymbol{v}||\_{1} &\leq \mathcal{W}(\boldsymbol{v}) + \lambda\boldsymbol{u} ||\hat{\beta}\_{L}||\_{1} + \lambda(1 - \boldsymbol{u}) ||\boldsymbol{\beta}^{\*}||\_{1} \\ &\leq \mathcal{S}(\boldsymbol{r}) + \mathsf{u}\mathcal{W}\_{\boldsymbol{n}}(\hat{\beta}\_{L}) + \mathsf{u}\lambda(||\hat{\beta}\_{L}||\_{1} - ||\boldsymbol{\beta}^{\*}||\_{1}) + \lambda||\boldsymbol{\beta}^{\*}||\_{1} \\ &\leq \mathcal{S}(\boldsymbol{r}) + \lambda||\boldsymbol{\beta}^{\*}||\_{1} \leq \mathcal{S}(\boldsymbol{r}) + \lambda||\boldsymbol{\beta}^{\*} - \mathsf{v}\_{\mathbb{S}^{\*}}||\_{1} + \lambda||\boldsymbol{\upsilon}\_{\mathbb{S}^{\*}}||\_{1} . \end{split} \tag{A4}$$

Hence,

$$\begin{aligned} \mathcal{W}(\boldsymbol{\upsilon}) + \lambda||\boldsymbol{\upsilon} - \boldsymbol{\beta}^\*||\_1 &= \left(\mathcal{W}(\boldsymbol{\upsilon}) + \lambda||\boldsymbol{\upsilon}||\_1\right) + \lambda\left(||\boldsymbol{\upsilon} - \boldsymbol{\beta}^\*||\_1 - ||\boldsymbol{\upsilon}||\_1\right) \\ &\leq \mathcal{S}(\boldsymbol{r}) + \lambda||\boldsymbol{\beta}^\* - \boldsymbol{v}\_{\boldsymbol{s}^\*}||\_1 + \lambda||\boldsymbol{v}\_{\boldsymbol{s}^\*}||\_1 + \lambda\left(||\boldsymbol{\upsilon} - \boldsymbol{\beta}^\*||\_1 - ||\boldsymbol{\upsilon}||\_1\right) = \mathcal{S}(\boldsymbol{r}) + 2\lambda||\boldsymbol{\beta}^\* - \boldsymbol{v}\_{\boldsymbol{s}^\*}||\_1. \end{aligned}$$

We prove now Lemma A1 needed in the proof of Lemma 2 below.

**Lemma A1.** *Assume that <sup>S</sup>* <sup>∼</sup> *Subg*(*σ*2) *and <sup>T</sup> is a random variable such that* <sup>|</sup>*T*| ≤ *<sup>M</sup>*, *where <sup>M</sup> is some positive constant and S and T are independent. Then, ST* <sup>∼</sup> *Subg*(*M*2*σ*2).

**Proof.** Observe that:

$$E\epsilon^{tST} = E(E(\epsilon^{tST}|T)) \le E\epsilon^{\frac{t^2 T^2 \epsilon^2}{2}} \le \epsilon^{\frac{t^2 M^2 \epsilon^2}{2}}.$$

Proof of Lemma 2.

**Proof.** From the Chebyshev inequality (first inequality below), symmetrization inequality (see Lemma 2.3.1 of [29]) and Talagrand–Ledoux inequality ([30], Theorem 4.12), we have for *t* > 0 and (*εi*)*i*=1,...,*<sup>n</sup>* being Rademacher variables independent of (*Xi*)*i*=1,...,*n*:

$$\begin{split} P(S(r) > t) &\leq \frac{ES(r)}{t} \\ &\leq \frac{2}{tn} E \sup\_{b \in R^{ra}: b - \beta^\* \in B\_1(r)} \left| \sum\_{i=1}^n \varepsilon\_i(\rho(X\_i^T b, Y\_i) - \rho(X\_i^T \beta^\*, Y\_i)) \right| \\ &\leq \frac{4L}{tn} E \sup\_{b \in R^{ra}: b - \beta^\* \in B\_1(r)} \left| \sum\_{i=1}^n \varepsilon\_i X\_i^T (b - \beta^\*) \right|. \end{split} \tag{A5}$$

We observe that *<sup>ε</sup>iXij* <sup>∼</sup> *Subg*(*σ*<sup>2</sup> *jn*) in view of Lemma A1. Hence, using independence, we obtain *n* ∑ *i*=1 *<sup>ε</sup>iXij* <sup>∼</sup> *Subg*(*nσ*<sup>2</sup> *jn*) and thus *<sup>n</sup>* ∑ *i*=1 *<sup>ε</sup>iXij* <sup>∼</sup> *Subg*(*ns*<sup>2</sup> *<sup>n</sup>*). Applying Hölder inequality and the following inequality (see Lemma 2.2 of [37]):

$$E\left\|\sum\_{i=1}^{n}\varepsilon\_{i}X\_{ij}\right\|\_{\infty}\leq\sqrt{n}s\_{\mathbb{R}}\sqrt{2\ln(2p\_{n})}\leq2s\_{\mathbb{R}}\sqrt{n\ln(p\_{n}\vee 2)}\tag{A6}$$

 

we have:

$$\begin{split} \frac{4L}{t\pi} \mathbb{E} \sup\_{\boldsymbol{b} \in R^{\eta\_{\boldsymbol{n}}} : \boldsymbol{b} - \boldsymbol{\beta}^{\*} \in \mathcal{B}\_{1}(\boldsymbol{r})} \left| \sum\_{i=1}^{n} \varepsilon\_{i} \boldsymbol{X}\_{i}^{T} (\boldsymbol{b} - \boldsymbol{\beta}^{\*}) \right| &\leq \frac{4Lr}{t} \mathbb{E} \max\_{\boldsymbol{j} \in \{1, \ldots, p\_{\boldsymbol{n}}\}} \left| \frac{1}{n} \sum\_{i=1}^{n} \varepsilon\_{i} \boldsymbol{X}\_{i\boldsymbol{j}} \right| \\ &\leq \frac{8Lrs\_{\boldsymbol{n}} \sqrt{\log(p\_{\boldsymbol{n}} \vee 2)}}{t\sqrt{n}}. \end{split}$$

From this, Part 1 follows. In the proofs of Parts 2 and 3, the first inequalities are the same as in Equation (A5) with supremums taken on corresponding sets. Using Cauchy–Schwarz inequality, inequality ||*v*||<sup>2</sup> ≤ |*v*|||*v*||∞, inequality ||*vπ*||<sup>∞</sup> ≤ ||*v*||<sup>∞</sup> for *<sup>π</sup>* ⊆ {1, ... , *pn*}, and Equation (A6) yields:

$$\begin{split} P(S\_1(r) \ge t) &\le \frac{4L}{nt} E \sup\_{b \in D\_1:\, b - \beta^\* \in B\_2(r)} \left| \sum\_{i=1}^n \varepsilon\_i X\_i^T (b - \beta^\*) \right| \\ &\le \frac{4Lr}{nt} E \max\_{\pi \subseteq \{1, \ldots, n\_n\}, |\pi| \le k\_n} \left| \left| \sum\_{i=1}^n \varepsilon\_i X\_{i, \pi} \right| \right| \\ &\le \frac{4Lr}{nt} E \max\_{\pi \subseteq \{1, \ldots, n\_n\}, |\pi| \le k\_n} \sqrt{|\pi|} \left| \left| \sum\_{i=1}^n \varepsilon\_i X\_{i, \pi} \right| \right| \Big|\_{\infty} \\ &\le \frac{4Lr\sqrt{k\_n}}{nt} E \left| \left| \sum\_{i=1}^n \varepsilon\_i X\_i \right| \right|\_{\infty} \le \frac{8Lr}{t\sqrt{n}} \sqrt{k\_n} s\_n \sqrt{\ln(p\_n \vee 2)}. \end{split}$$

Similarly for *S*2(*r*), using Cauchy–Schwarz inequality, ||*vπ*||<sup>2</sup> ≤ ||*vs*<sup>∗</sup> ||2, which is valid for *π* ⊆ *s*∗, definition of *<sup>l</sup>*<sup>2</sup> norm and inequality *<sup>E</sup>*|*Z*| ≤ <sup>√</sup> *EZ*<sup>2</sup> <sup>≤</sup> *<sup>σ</sup>* for *<sup>Z</sup>* <sup>∼</sup> *Subg*(*σ*2), we obtain:

$$\begin{split} P(\mathcal{S}\_{2}(r)\geq t) &\leq \frac{4L}{nt} \mathbb{E} \sup\_{b\in D\_{2}:\,b-\beta^{\*}\in B\_{2}(r)} \left| \sum\_{i=1}^{n} \varepsilon\_{i} X\_{i}^{T}(b-\beta^{\*}) \right| \\ &\leq \frac{4Lr}{nt} \mathbb{E} \max\_{\pi\leq s^{\*}} \left| \left| \sum\_{i=1}^{n} \varepsilon\_{i} X\_{i,\pi} \right| \right|\_{2} \leq \frac{4Lr}{nt} \mathbb{E} \left| \left| \sum\_{i=1}^{n} \varepsilon\_{i} X\_{i,s^{\*}} \right| \right|\_{2} \\ &\leq \frac{4Lr}{nt} \sqrt{E\left| \left| \sum\_{i=1}^{n} \varepsilon\_{i} X\_{i,s^{\*}} \right| \right|\_{2}^{2}} = \frac{4Lr}{nt} \sqrt{\sum\_{i\in s^{\*}} E\left(\sum\_{i=1}^{n} \varepsilon\_{i} X\_{ij}\right)^{2}} \leq \frac{4Lr}{\sqrt{nt}} \sqrt{|s^{\*}|} s\_{n}. \end{split}$$

Proof of Lemma 3.

**Proof.** Let *u* and *v* be defined as in Lemma 1. Observe that ||*v* − *β*∗||<sup>1</sup> ≤ *r*/2 is equivalent to ||*β*<sup>ˆ</sup> *<sup>L</sup>* <sup>−</sup> *<sup>β</sup>*∗||<sup>1</sup> <sup>≤</sup> *<sup>r</sup>*, as the function *<sup>f</sup>*(*x*) = *rx*/(*<sup>x</sup>* <sup>+</sup> *<sup>r</sup>*) is increasing, *<sup>f</sup>*(*r*) = *<sup>r</sup>*/2 and *<sup>f</sup>*(||*β*<sup>ˆ</sup> *<sup>L</sup>* <sup>−</sup> *<sup>β</sup>*∗||1) = ||*v* − *β*∗||1. Let *C* = 1/(4 + *ε*). We consider two cases:

(i) ||*vs*<sup>∗</sup> − *β*<sup>∗</sup> *<sup>s</sup>*<sup>∗</sup> ||<sup>1</sup> ≤ *Cr*.

In this case, from the basic inequality (Lemma 1), we have:

$$\lambda ||\boldsymbol{v} - \boldsymbol{\beta}^\*||\_1 \le \lambda^{-1} (\mathcal{W}(\boldsymbol{v}) + \lambda ||\boldsymbol{v} - \boldsymbol{\beta}^\*||\_1) \le \lambda^{-1} S(r) + 2||\boldsymbol{v}\_{s^\*} - \boldsymbol{\beta}^\*\_{s^\*}||\_1 \le \mathsf{C}r + 2\mathsf{C}r = \frac{r}{2}.$$

(ii) ||*vs*<sup>∗</sup> − *β*<sup>∗</sup> *<sup>s</sup>*<sup>∗</sup> ||<sup>1</sup> > *Cr*.

Note that ||*vs*∗*<sup>c</sup>* ||<sup>1</sup> < (1 − *C*)*r*, otherwise we would have ||*v* − *β*∗||<sup>1</sup> > *r*, which contradicts Equation (A2) in proof of Lemma 1. Now, we observe that *v* − *β*<sup>∗</sup> ∈ C*ε*, as we have from definition of *C* and assumption for this case:

$$\|\|v\_{\mathfrak{s}^{\rm sr}}\|\|\_{1} < (1-\mathsf{C})r = (\mathfrak{z}+\mathfrak{e})\mathsf{C}r < (\mathfrak{z}+\mathfrak{e})\|\|v\_{\mathfrak{s}^{\rm sr}} - \mathfrak{z}\_{\mathfrak{s}^{\rm sr}}^{\*}\|\_{1}.$$

By inequality between *<sup>l</sup>*<sup>1</sup> and *<sup>l</sup>*<sup>2</sup> norms, the definition of *<sup>κ</sup>H*(*ε*), inequality *ca*2/4 <sup>+</sup> *<sup>b</sup>*2/*<sup>c</sup>* <sup>≥</sup> *ab*, and margin Condition (MC) (which holds because *v* − *β*<sup>∗</sup> ∈ *B*1(*r*) ⊆ *B*1(*δ*) in view of Equation (A2)), we conclude that:

$$\begin{split} ||v\_{s^\*} - \beta\_{s^\*}^\*||\_1 &\leq \sqrt{|s^\*|} ||v\_{s^\*} - \beta\_{s^\*}^\*||\_2 \leq \sqrt{|s^\*|} ||v - \beta^\*||\_2 \\ &< \sqrt{|s^\*|} \sqrt{\frac{(v - \beta^\*)^T H (v - \beta^\*)}{2}} \end{split} \tag{A7}$$

$$\begin{split} & \quad \mathbb{E} \le \sqrt{|\mathbf{s}^\*|} \sqrt{\frac{(\mathbb{P} - \mathbb{P}^\*)^\* \cdots (\mathbb{P} - \mathbb{P}^\*)}{\kappa\_H(\varepsilon)}} \\ & \le \frac{\theta(\upsilon - \beta^\*)^T H (\upsilon - \beta^\*)}{4\lambda} + \frac{|\mathbf{s}^\*|\lambda}{\theta \kappa\_H(\varepsilon)} \le \frac{W(\upsilon)}{2\lambda} + \frac{|\mathbf{s}^\*|\lambda}{\theta \kappa\_H(\varepsilon)}. \end{split} \tag{A8}$$

Hence, from the basic inequality (Lemma 1) and the inequality above, it follows that:

$$\mathcal{W}(\upsilon) + \lambda ||\upsilon - \mathcal{G}^\*||\_1 \le \mathcal{S}(r) + 2\lambda ||\upsilon\_{s^\*} - \mathcal{J}^\*\_{s^\*}||\_1 \le \mathcal{S}(r) + \mathcal{W}(\upsilon) + \frac{2|s^\*|\lambda^2}{\theta \kappa\_H(\varepsilon)}.$$

Subtracting *W*(*v*) from both sides of the above inequality and using the assumption on *S*, the bound on <sup>|</sup>*s*∗|, and the definition of *<sup>C</sup>*˜ yields:

$$\mathbb{P}\left(\left|v-\beta^{\*}\right|\right|\_{1}\leq\frac{\mathcal{S}(r)}{\lambda}+\frac{2\left|s^{\*}\right|\lambda}{\theta\kappa\_{H}(\varepsilon)}\leq\tilde{C}r+\frac{2\left|s^{\*}\right|\lambda}{\theta\kappa\_{H}(\varepsilon)}\leq(\tilde{\mathcal{C}}+\tilde{\mathcal{C}})r=\frac{r}{2}.$$

Proof of Remark 1.

**Proof.** Condition lim inf *n*→∞ *Dnan kn* log(2*pn*) > 1 is equivalent to the condition that exists some *<sup>u</sup>* > 0 that for almost all *n* we have:

$$D\_{\mathbb{M}}a\_{\mathbb{M}} - (1+\mathfrak{u})k\_{\mathfrak{u}}\log(2p\_{\mathfrak{u}}) > 0.$$

(1) We observe that, if

$$A a\_n - (1+\mu)k\_n \log(2p\_n) > 0,$$

then the above condition is satisfied. For BIC, we have:

$$A \log n > (1 + \mathfrak{u}) k\_n \log(2p\_n) > 0,$$

which is equivalent to the condition (1) of the Remark.

(2) We observe that using inequalities *kn* ≤ *C*, 2*Aγ* − (1 + *u*)*C* ≥ 0 and *pn* ≥ 1 yields for *n* > 2 (1+*u*)*C <sup>A</sup>* :

$$\begin{aligned} A(\log n + 2\gamma \log p\_n) - (1 + u)k\_n \log(2p\_n) &\geq A(\log n + 2\gamma \log p\_n) - (1 + u)\mathbb{C} \log(2p\_n) \\ &= (2A\gamma - (1 + u)\mathbb{C}) \log p\_n + A \log n - (1 + u)\mathbb{C} \log 2 \geq A \log n - (1 + u)\mathbb{C} \log 2 > 0. \end{aligned}$$

(3) In this case, we check similarly as in (2) that

$$\begin{aligned} A(\log n + 2\gamma \log p\_n) - (1 + u)k\_n \log(2p\_n) &\geq A(\log n + 2\gamma \log p\_n) - (1 + u)\mathbb{C} \log(2p\_n) \\ &= (2A\gamma - (1 + u)\mathbb{C}) \log p\_n + A \log n - (1 + u)\mathbb{C} \log 2 > 0 \end{aligned}$$

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Towards a Unified Theory of Learning and Information**

#### **Ibrahim Alabdulmohsin**

Google Research, 8002 Zürich, Switzerland; ibomohsin@google.com

Received: 10 February 2020; Accepted: 6 April 2020; Published: 13 April 2020

**Abstract:** In this paper, we introduce the notion of "learning capacity" for algorithms that learn from data, which is analogous to the Shannon channel capacity for communication systems. We show how "learning capacity" bridges the gap between statistical learning theory and information theory, and we will use it to derive generalization bounds for finite hypothesis spaces, differential privacy, and countable domains, among others. Moreover, we prove that under the Axiom of Choice, the existence of an empirical risk minimization (ERM) rule that has a vanishing learning capacity is equivalent to the assertion that the hypothesis space has a finite Vapnik–Chervonenkis (VC) dimension, thus establishing an equivalence relation between two of the most fundamental concepts in statistical learning theory and information theory. In addition, we show how the learning capacity of an algorithm provides important qualitative results, such as on the relation between generalization and algorithmic stability, information leakage, and data processing. Finally, we conclude by listing some open problems and suggesting future directions of research.

**Keywords:** statistical learning theory; information theory; entropy; parameter estimation; learning systems; privacy; prediction methods

#### **1. Introduction**

#### *1.1. Generalization Risk*

A central goal when learning from data is to strike a balance between underfitting and overfitting. Mathematically, this requirement can be translated into an optimization problem with two competing objectives. First, we would like the learning algorithm to produce a hypothesis (i.e., an answer) that performs well on the empirical sample. This goal can be easily achieved by using a *rich* hypothesis space that can "explain" any observations. Second, we would like to guarantee that the performance of the hypothesis on the empirical data (a.k.a. training error) is a good approximation of its performance with respect to the unknown underlying distribution (a.k.a. test error). This goal can be achieved by *limiting* the complexity of the hypothesis space. The first condition mitigates underfitting while the latter condition mitigates overfitting.

Formally, suppose we have a learning algorithm <sup>L</sup> : <sup>Z</sup>*<sup>m</sup>* → H that receives a sample **<sup>s</sup>** <sup>=</sup> {**z**1, ... , **<sup>z</sup>***m*}, which comprises of *m* i.i.d. observations **z***<sup>i</sup>* ∼ *p*(*z*), and uses **s** to select a hypothesis **h** ∈ H. Let *l* be a loss function defined on the product space Z×H. For instance, *l* can be the mean-square-error (MSE) in regression or the 0–1 error in classification. Then, the goal of learning from data is to select a hypothesis **h** ∈ H such that its *true risk R*(**h**), defined by

$$\mathcal{R}(h) = \mathbb{E}\_{\mathbf{z} \sim p(z)}[l(\mathbf{z}, h)],\tag{1}$$

*Entropy* **2020**, *22*, 438; doi:10.3390/e22040438 www.mdpi.com/journal/entropy

is small. However, this optimization problem is often difficult to solve exactly since the underlying distribution of observations *p*(*z*) is seldom known. Rather, because the true risk *R*(**h**) can be decomposed into a sum of two terms:

$$\mathcal{R}(\mathbf{h}) = \left[ \mathcal{R}\_{\mathbf{s}}(\mathbf{h}) \right] + \left[ \mathcal{R}(\mathbf{h}) - \mathcal{R}\_{\mathbf{s}}(\mathbf{h}) \right],$$

where *<sup>R</sup>***s**(**h**) = <sup>E</sup>**z**∼**s**[*l*(**z**, **<sup>h</sup>**)] . = (1/*m*) <sup>∑</sup>*z*∈**<sup>s</sup>** *<sup>l</sup>*(*z*, **<sup>h</sup>**), both terms can be tackled separately. The first term in the equation above corresponds to the *empirical risk* on the training sample **s**. The second term corresponds to the *generalization risk*. Hence, by minimizing both terms, one obtains a learning algorithm whose true risk is small.

Minimizing the empirical risk can be achieved using tractable approximations to the *empirical risk minimization* (ERM) procedure, such as stochastic convex optimization [1,2]. However, the generalization risk is often difficult to deal with directly because the underlying distribution is often unknown. Instead, it is a common practice to bound it *analytically*. By establishing analytical conditions for generalization, one hopes to design better learning algorithms that both perform well empirically and generalize as well into the future.

Several methods have been proposed in the past for bounding the generalization risk of learning algorithms. Some examples of popular approaches include uniform convergence, algorithmic stability, Rademacher and Gaussian complexities, and the PAC–Bayesian framework [3–7].

The proliferation of such bounds can be understood upon noting that the generalization risk of a learning algorithm is influenced by multiple factors, such as the domain Z, the hypothesis space H, and the mapping from Z to H. Hence, one may derive new generalization bounds by imposing conditions on any of such components. For example, the Vapnik–Chervonenkis (VC) theory derives generalization bounds by assuming constraints on H whereas stability bounds, e.g., [6,8,9], are derived by assuming constraints on the mapping from Z to H.

Rather than showing that certain conditions are sufficient for generalization, we will establish in this paper conditions that are both *necessary and sufficient*. More precisely, we will show that the " uniform" generalization risk of a learning algorithm is an *information-theoretic* characterization. In particular, it is *equal* to the total variation distance between the joint distribution of the hypothesis **h** and a single random training example **z**ˆ ∼ **s**, on one hand, and the product of their marginal distributions, on the other hand. Hence, it is analogous to the mutual information between **h** and **z**ˆ. Since uniform generalization is an information-theoretic quantity, information-theoretic tools, such as the data-processing inequality and the chain rules of entropy [10], can be used to analyze the performance of machine learning algorithms. For example, we will illustrate this fact by presenting a simple proof to the classical generalization bound in the finite hypothesis space setting using, solely, information-theoretic inequalities without any reference to the union bound.

#### *1.2. Types of Generalization*

Generalization bounds can be stated either in expectation or in probability. Let *l* : Z×H→ [0, 1] be some loss function with a bounded range. Then, we have the following definitions:

**Definition 1** (Generalization in Expectation)**.** *The expected generalization risk of a learning algorithm* L : <sup>Z</sup>*<sup>m</sup>* → H *with respect to a loss l* : Z×H→ [0, 1] *is defined by:*

$$\mathcal{R}\_{\mathbb{S}^{\rm eff}}(\mathcal{L}) = \mathbb{E}\_{\hbar}[\mathcal{R}(\hbar)] - \mathbb{E}\_{\hbar, \hbar} \mathbb{E}\_{\mathbb{S}^{-\rm sc}}[l(\mathfrak{E}, \hbar)],\tag{2}$$

*where R*(*h*) *is defined in Equation (1), and the expectation is taken over the random choice of s and the internal randomness of* L*. A learning algorithm* L *generalizes in expectation if Rgen*(L) → 0 *as m* → ∞ *for all distributions p*(*z*)*.*

**Definition 2** (Generalization in Probability)**.** *A learning algorithm* L *generalizes in probability if for any* > 0*, we have:*

$$\wp\left\{\left|R(\hbar)-\mathbb{E}\_{\mathbb{P}\sim\mathfrak{s}}[l(\mathfrak{H},\hbar)]\right|>\epsilon\right\}\to 0\quad\text{as}\ m\to\infty,$$

*where the probability is evaluated over the randomness of s and the internal randomness of the learning algorithm.*

In general, both types of generalization have been used to analyze machine learning algorithms. For instance, generalization in probability is used in the VC theory to analyze algorithms with finite VC dimensions, such as linear classifiers [3]. Generalization in expectation, on the other hand, was used to analyze learning algorithms, such as the stochastic gradient descent (SGD), differential privacy, and ridge regression [11–14]. Generalization in expectation is often simpler to analyze, but it provides a weaker performance guarantee.

#### *1.3. Paper Outline*

In this paper, a third notion of generalization is introduced, which is called *uniform* generalization. Uniform generalization also provides generalization bounds in expectation, but it is stronger than the traditional form of generalization in expectation in Definition 1 because it requires that the generalization risk vanishes uniformly in expectation across *all* bounded parametric loss functions (hence the name). In this paper, a loss function *l* : Z×H→ [0, 1] is called " parametric" if it is conditionally independent of the original training sample given the learned hypothesis *h* ∈ H.

As mentioned earlier, the *uniform* generalization risk is *equal* to an information-theoretic quantity and it yields classical results in statistical learning theory. Perhaps more importantly, and unlike traditional in-expectation guarantees that do not imply concentration, we will show that uniform generalization in expectation implies generalization in probability. Hence, all of the uniform generalization bounds derived in this paper hold both in expectation and with a high probability.

The theory of uniform generalization bridges the gap between information theory and statistical learning theory. For example, we will establish an equivalence relation between the VC dimension, on one hand, and another quantity that is quite analogous to the Shannon channel capacity, on the other hand. Needless to mention, both the VC dimension and the Shannon channel capacity are arguably the most central concepts in statistical learning theory and information theory. This connection between the two concepts is obtained via the notion of the " learning capacity" that we introduce in this paper, which is the supremum of the uniform generalization risk across all input distributions. We will compute the learning capacities for many machine learning algorithms and show how it matches known bounds on the generalization risk up to logarithmic factors.

In general, the main aim of this work is to bring to light a new information-theoretic approach for analyzing machine learning algorithms. Despite the fact that " uniform generalization" might appear to be a strong condition at a first sight, one of the central themes that is emphasized repeatedly throughout this paper is that uniform generalization is, in fact, a natural condition that arises commonly in practice. It is not a condition to require or enforce by machine learning practitioners! We believe this holds because any learning algorithm is a *channel* from the space of training samples to the hypothesis space so its risk for overfitting can be analyzed by studying the properties of this mapping itself. Such an approach yields the uniform generalization bounds that are derived in this paper.

While we strive to introduce foundational results in this work, there are many important questions that remain unanswered. We conclude this paper by listing some of those open problems and suggesting future directions of research.

#### **2. Notation**

The notation used in this paper is fairly standard. Important exceptions are listed here. If **x** is a random variable that takes its values from a finite set **s** uniformly at random, we write **x** ∼ **s** to denote such a distribution. If **<sup>x</sup>** is a boolean random variable (i.e., a predicate), then <sup>I</sup>{**x**} <sup>=</sup> 1 if and only if **<sup>x</sup>** is true, otherwise <sup>I</sup>{**x**} <sup>=</sup> 0. In general, random variables are denoted with boldface letters **<sup>x</sup>**, instances of random variables are denoted with small letters *x*, matrices are denoted with capital letters *X*, and alphabets i.e., fixed sets) are denoted with calligraphic typeface X (except L that will be reserved for the learning algorithm and D that will be reserved for the input distribution as is customary in the literature).

Throughout this paper, we will always write Z to denote the space of observations (a.k.a. *domain*) and write <sup>H</sup> to denote the hypothesis space (a.k.a. *range*). A learning algorithm <sup>L</sup> : <sup>Z</sup>*<sup>m</sup>* → H is formally treated as a stochastic map, where the hypothesis **h** ∈ H can be a deterministic or a randomized function of the training sample **<sup>s</sup>** ∈ Z*m*. Given a 0–1 loss function *<sup>l</sup>* : H×Z→{0, 1}, we will abuse terminology slightly by speaking about the " VC dimension of H" when we actually mean the VC dimension of the loss class {*l*(·, *h*) : *h* ∈ H}.

In addition, given two probability measures *p* and *q* defined on the same space, we will write *<sup>p</sup>*, *<sup>q</sup>* to denote the *overlapping coefficient* between *<sup>p</sup>* and *<sup>q</sup>*. That is, *<sup>p</sup>*, *<sup>q</sup>* = 1 − ||*<sup>p</sup>* , *<sup>q</sup>*||T , where ||*<sup>p</sup>* , *<sup>q</sup>*||T <sup>=</sup> <sup>1</sup> 2 *<sup>p</sup>* − *<sup>q</sup>* <sup>1</sup> is the total variation distance.

Moreover, we will use the *order in probability* notation for real-valued *random* variables. Here, we adopt the notation used by [15] and [16]. In particular, let **x** = **x***<sup>n</sup>* be a real-valued random variable that depends on some parameter *<sup>n</sup>* <sup>∈</sup> <sup>N</sup>. Then, we will write **<sup>x</sup>***<sup>n</sup>* <sup>=</sup> *Op*(*f*(*n*)) if for any *<sup>δ</sup>* <sup>&</sup>gt; 0, there exists absolute constants *C* and *n*<sup>0</sup> such that for any fixed *n* ≥ *n*0, the inequality |**x***n*| < *C* | *f*(*n*)| holds with a probability of, at least, 1 − *δ*. In other words, the ratio **x***n*/ *f*(*n*) is *stochastically* bounded [15]. Similarly, we write **x***<sup>n</sup>* = *op*(*f*(*n*)) if **x***n*/ *f*(*n*) converges to zero in probability. As an example, if **x** ∼ N (0, *Id*) is a standard multivariate Gaussian vector, then || **x** ||<sup>2</sup> = *Op*( <sup>√</sup>*d*) even though || **<sup>x</sup>** ||<sup>2</sup> can be arbitrarily large. Intuitively, the probability of the event || **x** ||<sup>2</sup> ≥ *d* 1 <sup>2</sup> <sup>+</sup> when <sup>&</sup>gt; 0 goes to zero as *<sup>d</sup>* <sup>→</sup> <sup>∞</sup> so || **<sup>x</sup>** ||<sup>2</sup> is *effectively* of the order *O*( <sup>√</sup>*d*).

#### **3. Related Work**

A learning algorithm is called *consistent* if the true risk of its hypothesis **h** converges to the optimal true risk in H, i.e., inf*h*∈H *<sup>R</sup>*(*h*), as *<sup>m</sup>* → <sup>∞</sup> in a distribution agnostic manner. A learning problem, which is a tuple (Z, H, *l*) with *l* being a loss function defined on the product space Z×H, is called *learnable* if it admits a consistent learning algorithm. It can be shown that learnability is equivalent to uniform convergence for supervised classification and regression even though uniform convergence is not necessary in the general setting [17].

Unlike learnability, the subject of generalization looks into how representative the empirical risk *R***s**(**h**) is to the true risk *R*(**h**) as discussed earlier. It can be rightfully considered as an extension to the *law of large numbers*, which is one of the earliest and most important results in probability theory and statistics. However, unlike the law of large numbers, which assumes that observations are independent and identically distributed, the subject of generalization in machine learning addresses the case where the losses *l*(**z***i*, **h**) are no longer i.i.d. due to the fact that **h** is selected according to the training sample **s** and **z***<sup>i</sup>* ∈ **s**.

Similar to learnability, uniform convergence is, by definition, sufficient for generalization but it is not necessary because the learning algorithm might restrict its search space to a smaller subset of H. So, in addition to uniform convergence bounds, several other methods have been introduced for bounding the generalization risk, such as using algorithmic stability, Rademacher and Gaussian complexities, generic chaining bounds, the PAC-Bayesian framework, and robustness-based analysis [5–7,18–20]. Classical concentration of measure inequalities, such as using the union bound, form the building blocks of such rich theories.

In this work, we address the subject of generalization in machine learning from an information-theoretic point of view. We will show that if the hypothesis **h** conveys " little" information about a random single training example **<sup>z</sup>**<sup>ˆ</sup> <sup>∼</sup> **<sup>s</sup>**, then the difference between <sup>E</sup>**z**ˆ∼**s**[*l*(**z**ˆ, **<sup>h</sup>**)] and <sup>E</sup>**z**∼*p*(*z*)[*l*(**z**, **<sup>h</sup>**)] will be small with a high probability. The measure of information we use here is given by the notion of *variational information* J (**z**ˆ; **h**) between the hypothesis **h** and a single random training example **z**ˆ ∼ **s**. Variational information, also sometimes called *T*-information [14], is an instance of the class of *informativity* measures using *f*-divergences, which can be motivated axiomatically [21,22]. Unlike traditional methods, we will prove that J (**z**ˆ; **h**) is *equal* to the " uniform" generalization risk; it is not just an upper bound.

Information-theoretic approaches of analyzing the generalization risk of learning algorithms, such as the one proposed in this paper, have found applications in adaptive data analysis. This includes the work of [12] using the *max-information*, the work of [23] and [24] using the *mutual information*, and the work of [14] using the *leave-one-out* information. One key contribution of our work is to show that one should examine the relationship between the hypothesis and a *single* random training example, instead of examining the relationship between the hypothesis and the full training sample as is customary in the literature. The gap between such two approaches is strict. For example, Theorem 8 in Section 5.5 presents an example of when a learning algorithm can have a vanishing uniform generalization risk even when the mutual information between the learned hypothesis and the training sample can be made arbitrarily large.

#### **4. Uniform Generalization**

#### *4.1. Preliminary Definitions*

In this paper, we consider the general setting of learning introduced by Vapnik [3]. To reiterate, we have an observation space (a.k.a. domain) Z and a hypothesis space H. Our learning algorithm L receives a set of *<sup>m</sup>* observations **<sup>s</sup>** <sup>=</sup> {**z**1, ... , **<sup>z</sup>***m*}∈Z*<sup>m</sup>* generated i.i.d. from some fixed unknown distribution *<sup>p</sup>*(*z*), and picks a hypothesis **h** ∈ H according to some probability distribution *p*(*h* | **s**). In other words, L is a channel from **s** to **h**. In this paper, we allow the hypothesis **h** to be any *summary statistic* of the training set. It can be an answer to a query, a measure of central tendency, or a mapping from the input space to the output space. In fact, we even allow **h** to be a subset of the training set itself. In formal terms, L is a stochastic map between the two random variables **<sup>s</sup>** ∈ Z*<sup>m</sup>* and **<sup>h</sup>** ∈ H, where the exact interpretation of those random variables is irrelevant. Moreover, we assume that there exists a non-negative bounded loss function *l*(*z*, *h*) ∈ [0, 1] that is used to measure the fitness of the hypothesis *h* ∈ H on the observation *z* ∈ Z.

For any fixed hypothesis *h* ∈ H, we define its true risk *R*(*h*) by Equation (1) and denote its empirical risk on the training sample by *R***s**(*h*). We also define the true and empirical risks of the *learning algorithm* L by the expected corresponding risk of its hypothesis:

$$\mathbb{R}(\mathcal{L}) = \mathbb{E}\_{\mathbf{s}} \mathbb{E}\_{\mathbf{h} \sim p(h|\mathbf{s})} \left[ \mathcal{R}(\mathbf{h}) \right] \quad = \mathbb{E}\_{\mathbf{h}} \left[ \mathcal{R}(\mathbf{h}) \right] \tag{3}$$

$$\hat{\mathcal{R}}(\mathcal{L}) = \mathbb{E}\_{\mathbf{s}} \mathbb{E}\_{\mathbf{h} \sim p(h|\mathbf{s})} \left[ \mathcal{R}\_{\mathbf{s}}(\mathbf{h}) \right] \quad = \mathbb{E}\_{\mathbf{s}, \mathbf{h}} \left[ \mathcal{R}\_{\mathbf{s}}(\mathbf{h}) \right] \tag{4}$$

Finally, the generalization risk of the learning algorithm is defined by:

$$\mathcal{R}\_{\text{gen}}(\mathcal{L}) \doteq \mathcal{R}(\mathcal{L}) - \hat{\mathcal{R}}(\mathcal{L}) \tag{5}$$

Next, we define uniform generalization:

**Definition 3** (Parametric Loss)**.** *A loss function l*(·, *h*) : Z → [0, 1] *is called parametric if it is conditionally independent of the training sample given the hypothesis <sup>h</sup>* ∈ H*. That is, it satisfies the Markov chain <sup>s</sup>* <sup>→</sup> *<sup>h</sup>* <sup>→</sup> *<sup>l</sup>*(·, *<sup>h</sup>*)*.*

**Definition 4** (Uniform Generalization)**.** *A learning algorithm* <sup>L</sup> : <sup>Z</sup>*<sup>m</sup>* → H *generalizes uniformly with rate* ≥ 0 *if for all bounded parametric losses l* : Z×H→ [0, 1]*, we have* |*Rgen*(L)| ≤ *, where Rgen*(L) *is given in Equation (5).*

Informally, Definition 4 states that once a hypothesis **h** is selected by a learning algorithm L that achieves uniform generalization, then no " adversary" can post-process the hypothesis in a manner that causes over-fitting to occur. Equivalently, uniform generalization implies that the empirical performance of **h** on the sample **s** will remain close to its performance with respect to the underlying distribution regardless of how that performance is being measured. For example, the loss function *l* : Z×H → [0, 1] in Equation (5) can be the misclassification error rate as in the traditional classification setting, a cost-sensitive error rate as in fraud detection and medical diagnosis [25], or the Brier score as in probabilistic predictions [26]. The generalization guarantee would hold in any case.

#### *4.2. Variational Information*

Given two random variables **x** and **y**, the *variational information* between the two random variables is defined to be the total variation distance between the join distribution *p*(**x**, **y**) and the product of marginals *p*(**x**) · *p*(**y**). We will denote this by J (**x**; **y**). By definition:

$$\mathcal{J}(\mathbf{x}; \mathbf{y}) = \mathbb{E}\_{\mathbf{x}, \mathbf{y}} ||p(\mathbf{x}, \mathbf{y}) \; , \; p(\mathbf{x}) \cdot p(\mathbf{y}) ||\tau = \mathbb{E}\_{\mathbf{x}} ||p(\mathbf{y}) \; , \; p(\mathbf{y}|\mathbf{x}) ||\tau$$

Note that 0 ≤ J (**x**; **y**) ≤ 1. We describe some of the important properties of variational information in this section. The reader may consult the appendices for detailed proofs.

**Lemma 1** (Data Processing Inequality)**.** *If <sup>x</sup>* <sup>→</sup> *<sup>y</sup>* <sup>→</sup> *<sup>z</sup> is a Markov chain, then:*

$$\mathcal{J}(x;z) \le \mathcal{J}(y;z)$$

This *data processing inequality* holds, in general, for all informativity measures using *f*-divergences [21,22].

**Lemma 2** (Information Cannot Hurt)**.** *For any random variables <sup>x</sup>* ∈ X *, <sup>y</sup>* ∈ Y*, and <sup>z</sup>* ∈ Z*, we have:*

$$\mathcal{I}(\mathbf{x}; \mathbf{y}) \le \mathcal{I}(\mathbf{x}; (y, z))$$

**Proof.** The proof is in Appendix A.

Finally, we derive a chain rule for the variational information.

**Definition 5** (Conditional Variational Information)**.** *The conditional variational information between the two random variables x and y given z is defined by:*

$$\mathcal{J}(\mathbf{x};\,\boldsymbol{y}\,\,|\,\boldsymbol{z}) = \mathbb{E}\_{\mathbf{z}}\left[\left||\boldsymbol{p}(\mathbf{x},\boldsymbol{y}\,\,|\,\boldsymbol{z})\,\,\,\,\boldsymbol{p}(\mathbf{x}|\boldsymbol{z})\cdot\boldsymbol{p}(\boldsymbol{y}|\boldsymbol{z})\,\vert\,|\,\boldsymbol{\tau}\right],$$

*which is analogous to the conditional mutual information in information theory [10].*

**Theorem 1** (Chain Rule)**.** *Let* (*h*1, ... , *hk*) *be a sequence of random variables. Then, for any random variable z, we have:* <sup>J</sup> (*z*; (*h*1,..., *<sup>h</sup>k*)) <sup>≤</sup> <sup>∑</sup>*<sup>k</sup> <sup>t</sup>*=<sup>1</sup> <sup>J</sup> (*z*; *<sup>h</sup><sup>t</sup>* <sup>|</sup>(*h*1,..., *<sup>h</sup>t*−1))

**Proof.** The proof is in Appendix B.

Although the chain rule above provides an upper bound, the upper bound is tight in the following sense:

**Proposition 1.** *For any random variables x*, *y, and z, we have* <sup>J</sup> (*x*; (*y*, *<sup>z</sup>*)) − J (*x*; *<sup>z</sup>* <sup>|</sup> *<sup>y</sup>*) ≤ J (*x*; *<sup>y</sup>*) *and* <sup>J</sup> (*x*; (*y*, *<sup>z</sup>*)) − J (*x*; *<sup>y</sup>*) ≤ J (*x*; *<sup>z</sup>* <sup>|</sup> *<sup>y</sup>*)*.*

#### **Proof.** The proof is in Appendix C.

In other words, the inequality in the chain rule J (**x**; (**y**, **z**)) ≤ J (**x**; **y**) + J (**x**; **z** | **y**) becomes an equality if:

$$\min \{ \mathcal{J}(\mathbf{x}; \mathbf{y}), \mathcal{J}(\mathbf{x}; \mathbf{z} \mid \mathbf{y}) \} = 0$$

The chain rule provides a recipe for computing the bias of a composition of hypotheses (**h**1, ... , **h***k*). Recently, [23] proposed an *information budget* framework for controlling the bias of estimators by controlling the mutual information between **h** and the training sample **s**. The proposed framework rests on the chain rule of mutual information. Here, we note that the argument for the information budget framework also holds when using the variational information due to the chain rule above.

#### *4.3. Equivalence Result*

Our first main theorem states that the uniform generalization risk has a precise information-theoretic characterization.

**Theorem 2.** *Given a fixed constant* <sup>0</sup> <sup>≤</sup> <sup>≤</sup> <sup>1</sup> *and a learning algorithm* <sup>L</sup> : <sup>Z</sup>*<sup>m</sup>* → H *that selects a hypothesis <sup>h</sup>* ∈ H *according to a training sample <sup>s</sup>* <sup>=</sup> {*z*1, ... , *<sup>z</sup>m*}*, where <sup>z</sup><sup>i</sup>* <sup>∼</sup> *<sup>p</sup>*(*z*) *are i.i.d.,* <sup>L</sup> *generalizes uniformly with rate if and only if* <sup>J</sup> (*h*; ˆ*z*) <sup>≤</sup> *, where <sup>z</sup>*<sup>ˆ</sup> <sup>∼</sup> *<sup>s</sup> is a single random training example.*

**Proof.** Let <sup>L</sup> : <sup>Z</sup>*<sup>m</sup>* → H be a learning algorithm that receives a finite set of training examples **<sup>s</sup>** <sup>=</sup> {**z**1,..., **<sup>z</sup>***m*}∈Z*<sup>m</sup>* drawn i.i.d. from a fixed unknown distribution *<sup>p</sup>*(*z*). Let **<sup>h</sup>** <sup>∼</sup> *<sup>p</sup>*(*h*|**s**) be the hypothesis chosen by L (can be deterministic or randomized) and write **z**ˆ ∼ **s** to denote a random variable that selects its value uniformly at random from the training sample **s**. Clearly, **z**ˆ and **h** are not independent in general. To simplify notation, we will write **l** = *l*(·, **h**) : Z → [0, 1] to denote the loss function. Note that **l** is itself a random variable that satisfies the Markov chain **s** → **h** → **l**. The claim is that L generalizes uniformly with rate > 0 across all parametric loss functions **l** if and only if J (**h**; ˆ**z**) ≤ .

By the Markov property, we have *p*(**l**|**h**, **s**) = *p*(**l**|**h**). By definition, the true and empirical risks of L are given by:

$$R(\mathcal{L}) = \mathbb{E}\_{\mathbf{z}, \mathbf{h}} \mathbb{E}\_{\mathbb{I}|\mathbf{h}} \mathbb{E}\_{\mathbf{z} \sim p(z)} \mathbf{l}(\mathbf{z}) = \mathbb{E}\_{\mathbb{I}} \mathbb{E}\_{\mathbf{z} \sim p(z)} \mathbf{l}(\mathbf{z}) \tag{6}$$

$$\hat{\mathcal{R}}(\mathcal{L}) = \mathbb{E}\_{\mathsf{e}} \, \mathbb{E}\_{\mathsf{l}|\mathsf{e}} \, \mathbb{E}\_{\mathsf{z}\sim\mathsf{e}} \, \mathbf{l}(\mathbf{z}) = \mathbb{E}\_{\mathsf{l}} \, \mathbb{E}\_{\mathsf{e}|\mathsf{l}} \, \mathbb{E}\_{\mathsf{z}\sim\mathsf{e}} \, \mathbf{l}(\mathbf{z}) \tag{7}$$

Because **z**ˆ ∼ **s** is a random variable whose value is chosen uniformly at random with replacement from the training set **s**, its marginal distribution is *p*(*z*). Its *conditional* distribution given **l** can be different, however, because both **l** and **z**ˆ depend on the training set **s**. However, they are both *conditionally* independent of each other given **s**. By marginalization, we have:

$$p(\mathfrak{z}|\mathfrak{l}) = \mathbb{E}\_{\mathfrak{s}|\mathfrak{l}} \left. p(\mathfrak{z}|\mathfrak{s}, \mathfrak{l}) \right| = \mathbb{E}\_{\mathfrak{s}|\mathfrak{l}} \left. p(\mathfrak{z}|\mathfrak{s}) \right|$$

Combining this with Equations (6) and (7) yields *<sup>R</sup>*(L) = <sup>E</sup>**<sup>l</sup>** <sup>E</sup>**z**<sup>ˆ</sup> **<sup>l</sup>**(**z**ˆ) and *<sup>R</sup>*ˆ(L) = <sup>E</sup>**<sup>l</sup>** <sup>E</sup>**z**ˆ|**<sup>l</sup> <sup>l</sup>**(**z**ˆ). Both equations imply that:

$$\mathcal{R}(\mathcal{L}) - \hat{\mathcal{R}}(\mathcal{L}) = \mathbb{E}\_{\mathsf{I}} \left[ \mathbb{E}\_{\mathsf{A}} \mathbf{1}(\dot{\mathbf{z}}) - \mathbb{E}\_{\mathsf{A}|\mathsf{I}} \mathbf{1}(\dot{\mathbf{z}}) \right]$$

Now, we would like to sandwich the right-hand side between upper and lower bounds. To do this, we note that if *p*1(*z*) and *p*2(*z*) are two distributions defined on the same domain Z and *f* : Z → [0, 1], then:

$$\left| \mathbb{E}\_{\mathbf{z}\sim p\_1(z)} f(\mathbf{z}) - \mathbb{E}\_{\mathbf{z}\sim p\_2(z)} f(\mathbf{z}) \right| \le ||p\_1(z)| \cdot p\_2(z) ||\tau\_{\mathcal{H}}$$

where ||*p*1(*z*), *<sup>p</sup>*2(*z*)||T is the total variation distance. This result can be immediately proven by considering the two regions {*z* ∈ Z : *p*1(*z*) > *p*2(*z*)} and {*z* ∈ Z : *p*1(*z*) < *p*2(*z*)} separately. In addition, it is tight because the inequality holds with equality for the loss function *<sup>f</sup>*(*z*) = <sup>I</sup>{*p*1(*z*) <sup>≥</sup> *<sup>p</sup>*2(*z*)}. Consequently:

$$|\mathcal{R}(\mathcal{L}) - \mathcal{R}(\mathcal{L})| \le \mathcal{J}(\mathbf{l}; \mathbf{\hat{z}}),$$

Finally, from the Markov chain **z**ˆ → **s** → **h** → **l** and the data processing inequality, we have J (**l**; **z**ˆ) ≤ J (**h**; ˆ**z**). Plugging this into the earlier inequality yields the bound:

$$|\mathcal{R}(\mathcal{L}) - \mathcal{R}(\mathcal{L})| \le \mathcal{J}(\mathbf{h}; \mathbf{\hat{z}}),$$

To prove the converse, define:

$$\begin{split} l^{\star}(z, \mathbf{h}) &= \mathbb{I} \{ p(\dot{\mathbf{z}} = z) \ge p(\dot{\mathbf{z}} = z \mid \mathbf{h}) \} \\ &= \mathbb{I} \{ p(\dot{\mathbf{z}} = z) \ge \mathbb{E}\_{\mathbf{s} \mid \mathbf{h}} \left[ p\_{\mathbf{s} \sim \mathbf{s}} \left( \dot{\mathbf{z}} = z \right) \right] \} \end{split}$$

The loss *l* (*z*, **<sup>h</sup>**) is independent of the training sample given **<sup>h</sup>** because *<sup>p</sup>*(**z**<sup>ˆ</sup> <sup>=</sup> *<sup>z</sup>* <sup>|</sup> **<sup>h</sup>**) is evaluated by taking expectation over all the training samples conditioned on **h**. Hence, *l* (*z*, **h**) is a 0–1 loss defined on the product space Z×H and satisfies the Markov chain **s** → **h** → **l**. However, given this choice of loss, we have:

$$\begin{aligned} \left| R(\mathcal{L}) - \mathring{R}(\mathcal{L}) \right| &= \mathbb{E}\_{\mathbf{h}} \left[ \mathbb{E}\_{\mathbf{\hat{z}}} \mathbb{I} \{ p(\mathbf{\hat{z}}) > p(\mathbf{\hat{z}} \mid \mathbf{h}) \} - \mathbb{E}\_{\mathbf{\hat{z}} \mid \mathbf{h}} \, \mathbb{I} \{ p(\mathbf{\hat{z}}) > p(\mathbf{\hat{z}} \mid \mathbf{h}) \} \right] \\ &= \mathbb{E}\_{\mathbf{h}} ||p(\mathbf{\hat{z}}) \,, \ p(\mathbf{\hat{z}} \mid \mathbf{h})||\_{\mathcal{T}} = \mathcal{J}(\mathbf{h}; \mathbf{\hat{z}}) \end{aligned}$$

Hence, the variational information J (**h**; **z**ˆ) does not only provide an upper bound on the uniform generalization risk, but is also a lower bound to it. Therefore, J (**h**; **z**ˆ) is equal to the uniform generalization risk.

**Remark 1.** *One important observation about Theorem 2 is that the variational information is measured between the hypothesis h and a* single *training example z*ˆ*, which is quite different from previous works that looked into the mutual information with the entire training sample s. By considering z*ˆ *rather than s, we quantify the uniform generalization risk with equality and the resulting bound is not vacuous even if the learning algorithm was deterministic. By contrast,* <sup>J</sup> (*s*; *<sup>h</sup>*) *may yield vacuous bounds when* <sup>L</sup> *is deterministic and both* <sup>Z</sup> *and* <sup>H</sup> *are uncountable.*

For concreteness, we illustrate how to compute the uniform generalization risk (or equivalently the variational information) on two simple examples. Here, *B*(*k*; *φ*, *n*) = ( *n <sup>k</sup>*)*φ<sup>k</sup>* (<sup>1</sup> <sup>−</sup> *<sup>φ</sup>*)*n*−*<sup>k</sup>* is the binomial distribution. The first example is a special case of a more general theorem that will be presented later in Section 5.2.

**Example 1.** *Suppose that observations <sup>z</sup><sup>i</sup>* ∈ {0, 1} *are i.i.d. Bernoulli trials with <sup>p</sup>*(*z<sup>i</sup>* <sup>=</sup> <sup>1</sup>) = *<sup>φ</sup>, and that the hypothesis produced by* <sup>L</sup> *is the empirical average <sup>h</sup>* <sup>=</sup> <sup>1</sup> *<sup>m</sup>* <sup>∑</sup>*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *<sup>z</sup>i. Because <sup>p</sup>*(*<sup>h</sup>* <sup>=</sup> *<sup>k</sup>*/*<sup>m</sup> <sup>z</sup>trn* <sup>=</sup> <sup>1</sup>) = *<sup>B</sup>*(*<sup>k</sup>* <sup>−</sup> 1; *<sup>φ</sup>*, *<sup>m</sup>* <sup>−</sup> <sup>1</sup>) *and p*(*<sup>h</sup>* <sup>=</sup> *<sup>k</sup>*/*<sup>m</sup> <sup>z</sup>trn* <sup>=</sup> <sup>0</sup>) = *<sup>B</sup>*(*k*; *<sup>φ</sup>*, *<sup>m</sup>* <sup>−</sup> <sup>1</sup>)*, it can be shown that the uniform generalization risk of this learning algorithm is given by the following quantity assuming that φm is an integer:*

$$\mathcal{J}\left(\mathbf{\hat{z}};\hbar\right) = 2\left(1-\phi\right)^{(1-\phi)m}\phi^{1+m\phi}\left(1+m\phi\right)\binom{m}{m\phi+1}\tag{8}$$

*This is maximized when φ* = 1/2*, in which case, the uniform generalization risk can be bounded using the Stirling approximation [27] by* 1/√<sup>2</sup> *<sup>π</sup> m up to a first-order term.*

**Proof.** First, the probability we obtain a hypothesis **h** = *<sup>k</sup> <sup>m</sup>* , where *k* ∈ {0, 1, ... , *m*}, given that we have *m* Bernoulli trials has a binomial distribution:

$$p(\mathbf{h} = \frac{k}{m}) = \binom{m}{k} \phi^k \left(1 - \phi\right)^{m-k}$$

We use the identity:

$$\mathcal{J}(\mathbf{\hat{z}};\mathbf{h}) = \sum\_{k=0}^{m} \, p(\mathbf{h} = \frac{k}{m}) \, ||p(\mathbf{\hat{z}}) \, \, \, p(\mathbf{\hat{z}}|\mathbf{h})||\_{\mathcal{T}}$$

However, *<sup>p</sup>*(**z**ˆ) is Bernoulli with probability of success *<sup>φ</sup>* while *<sup>p</sup>*(**z**ˆ|**<sup>h</sup>** <sup>=</sup> *<sup>k</sup> <sup>m</sup>* ) is Bernoulli with probability of success **h**. The total variation distance between the two Bernoulli distributions is given by |*φ* − **h**|. So, we obtain:

$$\mathcal{J}(\mathbf{\dot{z}};\mathbf{h}) = \sum\_{k=0}^{m} \binom{m}{k} \phi^{k} \left.(1-\phi)^{m-k}\right| \phi - \frac{k}{m}\Big|\tag{9}$$

This is the *mean deviation*. Assuming *φ m* is an integer, then the mean deviation of the binomial random variable is given by de Moivre's formula:

$$MD = 2\left(1 - \phi\right)^{(1-\phi)m} \phi^{1+m\phi} \left(1 + m\phi\right) \binom{m}{m\phi + 1} \tag{10}$$

The mean deviation is maximized when *φ* = <sup>1</sup> <sup>2</sup> . This gives us:

$$\mathcal{J}(\mathbf{\hat{z}};\mathbf{h}) \le \frac{1}{2^m} \binom{m}{m/2+1} \sim \frac{1}{\sqrt{2\ \pi m}}.$$

where in the last step we expanded the binomial coefficient and used Stirling's approximation [27].

**Example 2.** *Suppose that the domain is* <sup>Z</sup> <sup>=</sup> {1, 2, 3, ... , *<sup>K</sup>*} *for some <sup>K</sup>* <sup>&</sup>lt; <sup>∞</sup>*, where <sup>p</sup>*(*<sup>z</sup>* <sup>=</sup> *<sup>k</sup>*) = 1/*<sup>K</sup> for all <sup>k</sup>* ∈ Z*. Let the hypothesis space be* <sup>H</sup> <sup>=</sup> <sup>Z</sup> *where <sup>p</sup>*(*<sup>h</sup>* <sup>=</sup> *<sup>k</sup>*) *is equal to the fraction of times the value <sup>k</sup> is observed in the training sample <sup>s</sup>* <sup>=</sup> {*z*1, ... , *<sup>z</sup>m*}*. For example, if <sup>s</sup>* <sup>=</sup> {1, 3, 2, 1, 1, 3}*, the hypothesis <sup>h</sup> is chosen among the set* {1, 2, 3} *with the respective probabilities* {1/2, 1/6, 1/3}*. Then, the variational information is given by:*

$$\mathcal{J}(\mathfrak{z};h) = \frac{1}{m} \left( 1 - \frac{1}{K} \right)^{\frac{1}{2}}$$

**Proof.** We have by symmetry *p*(**h** = *k*) = 1/*K* for all *k* ∈ {1, 2, 3, ... , *K*}. Let **z**ˆ = *x*. By Bayes rule, we have:

$$p(\mathbf{\hat{z}} = \mathbf{x} | \mathbf{\hat{h}} = k) = p(\mathbf{h} = k | \mathbf{\hat{z}} = \mathbf{x}) \cdot \frac{p(\mathbf{\hat{z}} = \mathbf{x})}{p(\mathbf{h}) = k}$$

$$= p(\mathbf{h} = k | \mathbf{\hat{z}} = \mathbf{x})$$

However, given one observation **z**ˆ = *x*, the probability of selecting a hypothesis **h** = *k* depends on two cases:

$$p(\mathbf{h} = k \mid \mathbf{\hat{z}} = \mathbf{x}) = \begin{cases} q & \text{if } k = \mathbf{x} \\ r & \text{if } k \neq \mathbf{x} \end{cases}$$

for some values *q* ≥ 0 and *r* ≥ 0 such that *q* + (*K* − 1)*r* = 1. To find *q*, we use the definition of L:

$$q = \frac{1}{m} + \frac{1}{K} \cdot \frac{m-1}{m} = \frac{1}{K} + \frac{1}{m} \left(1 - \frac{1}{K}\right)$$

This holds because L is equivalent to an algorithm that selects a single observation in the set **s** uniformly at random. So, to satisfy the condition *q* + (*K* − 1)*r* = 1, we have:

$$r = \frac{1}{K} - \frac{1}{mK}$$

Now, we are ready to find the desired expression.

$$\begin{split} \mathcal{I}(\mathbf{\hat{z}};\mathbf{h}) &= \frac{1}{2} \sum\_{\mathbf{x} \in \mathcal{Z}} p(\mathbf{\hat{z}} = \mathbf{x}) \sum\_{k \in \mathcal{Z}} \left| p(\mathbf{h} = k) - p(\mathbf{h} = k | \mathbf{\hat{z}} = \mathbf{x}) \right| \\ &= \frac{1}{2} \sum\_{k \in \mathcal{Z}} \left| p(\mathbf{h} = k) - p(\mathbf{h} = k | \mathbf{\hat{z}} = \mathbf{1}) \right| \\ &= \frac{1}{2} \left[ \frac{1}{m} (1 - \frac{1}{K}) + \frac{K - 1}{mK} \right] = \frac{1}{m} \left( 1 - \frac{1}{K} \right) \quad \Box \end{split}$$

Note that the variational information in Example 2 is Θ(1/*m*), which is smaller than the variational information in Example 1. This is not a coincidence. The difference between the two examples is related to *data processing*. Specifically, suppose that *K* = 2 in Example 2 and let **h**<sup>2</sup> be the hypothesis. Let **h**<sup>1</sup> be the hypothesis in Example 1. Then, we have the Markov chain **s** → **h**<sup>1</sup> → **h**<sup>2</sup> because **h**<sup>2</sup> is Bernoulli with parameter **h**1.

#### *4.4. Learning Capacity*

The variational information depends on the distribution of observations *p*(*z*), which is seldom known in practice. To construct a distribution-free bound on the uniform generalization risk, we introduce the following quantity:

**Definition 6** (Learning Capacity)**.** *The learning capacity of an algorithm* L *is defined by:*

$$\mathbb{C}(\mathcal{L}) \doteq \sup\_{p(z)} \left\{ \mathcal{J}(\mathfrak{k}; \mathfrak{h}) \right\},\tag{11}$$

*where h and z*ˆ *are as defined in Theorem 2.*

The above quantity is analogous to the Shannon channel capacity except that it is measured in the total variation distance. It quantifies the capacity for overfitting in the given learning algorithm. For example, the learning capacity of the algorithm in Example 1 is 1/ <sup>√</sup>2*π<sup>m</sup>* up to a first order term, as proved earlier, so its capacity for overfitting is larger than that of the learning algorithm in Example 2.

Theorem 2 reveals that *C*(L) has, at least, three *equivalent* interpretations:


Throughout the sequel, we analyze the properties of *C*(L) and derive upper bounds for it under various conditions, such as in the finite hypothesis space setting and differential privacy.

#### *4.5. The Definition of Hypothesis*

In the proof of Theorem 2, the following Markov chain **z**ˆ → **s** → **h** → **l**(·, **h**) is used. Essentially, this states that the loss function **l**(·, **h**) : Z → [0, 1], which is a random variable itself, must be parameterized entirely by the hypothesis **h** as stated in Definition 3. We list, next, a few examples that highlight this point.

**Example 3** (Input Normalization)**.** *If the data is normalized prior to training, such as using min-max or z-score normalization, then the normalization parameters are included in the definition of the hypothesis h.*

**Example 4** (Feature Selection)**.** *If the observations z comprise of d features and feature selection is implemented prior to training a model v (such as in classification or clustering), then the hypothesis h is the composition* (*u*, *v*)*, where <sup>u</sup>* ∈ {0, 1}*<sup>d</sup> encodes the set of the features that have been selected by the feature selection algorithm.*

**Example 5** (Cross Validation)**.** *Hyper-parameter tuning is a common practice in machine learning. This includes choosing the tradeoff parameter C in support vector machine (SVM) [28] or the bandwidth γ in radial basis function (RBF) networks [29]. However, not all hyper-parameters are encoded in the hypothesis h. For instance, the tradeoff constant C is never used during prediction so it is omitted from the definition of h but the bandwidth parameter γ is included if it is selected based on the training sample.*

In order to illustrate why the Markov chain **z**ˆ → **s** → **h** → **l**(·, **h**) is important, consider the following simple scenario. Suppose we have a mixture of two Gaussians in R*d*, one corresponding to the positive class and one corresponding to the negative class. If *z*-score normalization is applied before training a linear classifier, then the generalization risk might increase with normalization because the final hypothesis now includes more information about the training sample (see Lemma 2). Figure 1 shows this effect when *d* = 1. As illustrated in the figure, normalization is often important in order to assign equal weights to all features but it can increase the generalization risk as well.

**Figure 1.** This figure corresponds to a classification problem in one dimension in which a classifier is a threshold between positive and negative examples. In this figure, the *x* axis is the number of training examples while the *y*-axis is the generalization risk. The red curve (top) corresponds to the difference between training and test accuracy when *z*-score normalization is applied before learning a classifier. The blue curve (bottom) corresponds to the difference between training and test accuracy when the data is not normalized.

#### *4.6. Concentration*

The notion of uniform generalization in Definition 4 provides *in-expectation* guarantees. In this section, we show that whereas traditional generalization in expectation does not imply concentration, *uniform* generalization in expectation implies concentration. In fact, we will use the chain rule in Theorem 1 to derive a Markov-type inequality. After that, we show that the bound is tight.

We begin by showing why a non-uniform generalization in expectation does not imply concentration.

**Proposition 2.** *There exists a learning algorithm* <sup>L</sup> : <sup>Z</sup>*<sup>m</sup>* → H *and a parametric loss <sup>l</sup>* : Z×H→ [0, 1] *such that the expected generalization risk is Rgen*(L) = 0 *even though p* ' <sup>|</sup>*R*(*h*) <sup>−</sup> *<sup>R</sup>s*(*h*)<sup>|</sup> <sup>=</sup> <sup>1</sup> 2 ( = 1*, where the probability is evaluated over the randomness of <sup>s</sup> and the internal randomness of* <sup>L</sup>*.*

**Proof.** Let Z = [0, 1] be an instance space with a continuous marginal density *p*(*z*) and let Y = {−1, +1} be the target set. Let *<sup>h</sup>* : Z → {−1, <sup>+</sup>1} be some *fixed* predictor, such that *<sup>p</sup>*{*h*(**z**) = <sup>1</sup>} <sup>=</sup> <sup>1</sup> <sup>2</sup> , where the probability is evaluated over the random choice of **z** ∈ Z. In other words, the marginal distribution of the labels predicted by *<sup>h</sup>* is uniform over the set {−1, <sup>+</sup>1}. These assumptions are satisfied, for example, if *<sup>p</sup>*(*z*) is uniform in [0, 1] and *<sup>h</sup>*(*z*) = <sup>I</sup>{*<sup>z</sup>* <sup>&</sup>lt; 1/2}.

Next, let the hypothesis space H be the set of predictors from Z to {−1, +1} that output a label in {−1, +1} uniformly at random everywhere in Z except at a finite number of points. Define the parametric loss by *l*(*z*; *h*) = I ' *<sup>h</sup>*(*z*) <sup>=</sup> *<sup>h</sup>*(*z*) ( .

Next, we construct a learning algorithm L that generalizes perfectly in expectation but does not generalize in probability. The learning algorithm L simply picks **h** ∈ {**h**0, **h**1} at random with equal probability. The two hypotheses are:

$$\mathbf{h}\_0(z) = \begin{cases} -h^\star(z) & \text{if } z \in \mathbf{s} \\ \text{Uniform}(-1, +1) & \text{if } z \notin \mathbf{s} \end{cases}$$

$$\mathbf{h}\_1(z) = \begin{cases} h^\star(z) & \text{if } z \in \mathbf{s} \\ \text{Uniform}(-1, +1) & \text{if } z \notin \mathbf{s} \end{cases}$$

Because <sup>Z</sup> is uncountable, where the probability of seeing the same observation **<sup>z</sup>** twice is zero, *<sup>R</sup>*(**h**) = <sup>1</sup> 2 for this learning algorithm. Thus:

$$R\_{\mathbb{S}^{en}}(\mathcal{L}) = \mathbb{E}\_{\mathbf{s}, \mathbf{h}} \left[ R\_{\mathbf{s}}(\mathbf{h}) - R(\mathbf{h}) \right] = 0$$

However, the empirical risk for any **<sup>s</sup>** satisfies *<sup>R</sup>***s**(**h**) ∈ {0, 1} while the true risk always satisfies *<sup>R</sup>*(**h**) = <sup>1</sup> 2 , as mentioned earlier. Hence, the statement of the proposition follows.

There are many ways of seeing why the algorithm in Proposition 2 does not generalize *uniformly* in expectation. The simplest way is to use the equivalence between uniform generalization and variational information as stated in Theorem 2. Given the hypothesis **h** ∈ {**h**0, **h**1} that is learned by the algorithm constructed in the proposition, the marginal distribution of an individual training example *p*(**z**ˆ | **h**) is uniform over the sample **s**. This follows from the fact that the hypothesis **h** has to encode the entire sample **s**. However, the probability of seeing the same observation twice is zero (by construction). Hence, ||*p*(**z**ˆ), *<sup>p</sup>*(**z**<sup>ˆ</sup> | **<sup>h</sup>**)||T = 1. This shows that *<sup>C</sup>*(L) = 1.

The example in Proposition 2 reveals an interesting property of non-uniform generalization. Namely, *non-uniform* generalization can be sensitive to every bit of information provided by the hypothesis. In the example above, the hypothesis **h** is encoded by the pair (**s**, **k**), where **k** ∈ {0, 1} determines which of the two hypotheses {**h**0, **h**1} is selected. The discrepancy between generalization in expectation and generalization in probability happens because **k** is added into the hypothesis.

Next, we use the chain rule in Theorem 1 to prove that uniform generalization, on the other hand, is a *robust* property of learning algorithms. More precisely, if **k** has a finite domain, then a hypothesis **h** generalizes uniformly in expectation if and only if the pair (**h**, **k**) generalizes uniformly in expectation. Hence, adding any finite amount of information (in bits) to a hypothesis cannot alter its uniform generalization property in a significant way.

**Theorem 3.** *Let* <sup>L</sup> : <sup>Z</sup>*<sup>m</sup>* → H *be a learning algorithm whose hypothesis is <sup>h</sup>* ∈ H*. Let <sup>k</sup>* ∈ K *be a different hypothesis that is obtained from the same sample <sup>s</sup>. If <sup>z</sup>*<sup>ˆ</sup> <sup>∼</sup> *<sup>s</sup>, then:*

$$\mathcal{J}(\mathfrak{z}; (\mathfrak{h}, \mathfrak{k})) \le (2 + \frac{|\mathcal{K}|}{2}) \cdot \mathcal{J}(\mathfrak{z}; \mathfrak{h}) + \sqrt{\frac{\log|\mathcal{K}|}{2m}}$$

**Proof.** The proof is in Appendix D.

We use Theorem 3, next, to prove that a uniform generalization in expectation implies a generalization in probability. The proof is by contradiction. Suppose we have a hypothesis **h** that generalizes uniformly in expectation but there exists a parametric loss *l* : Z×H→ [0, 1] that does not generalize in probability. We will derive a contradiction from these two assumptions. We show that appending little information to the hypothesis **h** will allow us to construct a *different* parametric loss that does not generalize in expectation

by determining whether or not the empirical risk w.r.t. *l* : Z×H→ [0, 1] is greater than, approximately equal to, or is less than the true risk w.r.t. the same loss. This is described in, at most, two bits. Knowing this additional information, we can define a new parametric loss that does not generalize in expectation, which contradicts the definition of uniform generalization.

**Theorem 4.** *Let* <sup>L</sup> : <sup>Z</sup>*<sup>m</sup>* → H *be a learning algorithm, whose risk is evaluated using a parametric loss <sup>l</sup>* : Z×H→ [0, 1]*. Then:*

$$p\left\{\left|R\_{\mathfrak{s}}(\hbar)-R(\hbar)\right|\geq\mathfrak{t}\right\}\leq\frac{7}{2t}\left[\mathcal{I}\left(\mathfrak{t};\hbar\right)+\sqrt{\frac{\log\mathfrak{s}}{49m}}\right],$$

*where the probability is evaluated over the random choice of <sup>s</sup> and the internal randomness of* <sup>L</sup>*.*

**Proof.** Let *l* : Z×H→ [0, 1] be a parametric loss function and write:

$$\kappa(t) = p\left\{ \left| \mathcal{R}\_{\mathsf{A}}(\mathsf{h}) - \mathcal{R}(\mathsf{h}) \right| \geq t \right\} \tag{12}$$

Consider the new pair of hypotheses (**h**, **k**), where:

$$\mathbf{k} = \begin{cases} +1, & \text{if } R\_{\mathbf{s}}(\mathbf{h}) \ge R(\mathbf{h}) + t, \\ -1, & \text{if } R\_{\mathbf{s}}(\mathbf{h}) \le R(\mathbf{h}) - t, \\ 0, & \text{otherwise} \end{cases}$$

Then, by Theorem 3, the uniform generalization risk in expectation for the composition of hypotheses (**h**, **<sup>k</sup>**) is bounded by (7/2) <sup>J</sup> (**z**ˆ; **<sup>h</sup>**) + log 3 <sup>2</sup>*<sup>m</sup>* . This holds uniformly across all parametric loss functions that satisfy the Markov chain **s** → (**h**, **k**) → **l**(·,(**h**, **k**)). Next, consider the parametric loss:

$$\mathbf{l}(z,(\mathbf{h},\mathbf{k})) = \begin{cases} l(z;\mathbf{h}) & \text{if } \mathbf{k} = +1 \\ 1 - l(z;\mathbf{h}) & \text{if } \mathbf{k} = -1 \\ 0 & \text{otherwise} \end{cases}$$

Note that **l**(*z*,(**h**, **k**)) is parametric with respect to the composition of hypotheses (**h**, **k**). Using Equation (12), the generalization risk w.r.t **l**(*z*,(**h**, **k**)) in expectation is, at least, as large as *t κ*(*t*). Therefore, by Theorems <sup>2</sup> and 3, we have *<sup>t</sup> <sup>κ</sup>*(*t*) <sup>≤</sup> (7/2) <sup>J</sup> (**z**ˆ; **<sup>h</sup>**) + log 3 <sup>2</sup>*<sup>m</sup>* , which is the statement of the theorem (Note: The proof assumes that the loss function **l** has access to the underlying distribution. This assumption is valid because the underlying distribution *p*(*z*) is fixed and does not depend on any random outcomes, such as **s** or **h**).

Theorem 4 reveals that uniform generalization is sufficient for concentration to hold. Importantly, the generalization bound depends on the learning algorithm L only via its variational information J (**z**ˆ; **h**). Hence, by controlling the uniform generalization risk, one improves the generalization risk of L both in expectation and with a high probability.

The same proof technique used in Theorem 4 also implies the following concentration bound, which is useful when *I*(**h**; **s**) = *o*(*m*) where *I*(**x**; **y**) is the Shannon mutual information. The following bound is similar to the bound derived by [23] using properties of sub-Gaussian loss functions.

**Proposition 3.** *Let* <sup>L</sup> : <sup>Z</sup>*<sup>m</sup>* → H *be a learning algorithm, whose risk is evaluated using a parametric loss function l* : Z×H→ [0, 1]*. Then:*

$$p\left\{\left|R\_{\mathbf{s}}(\mathbf{h}) - R(\mathbf{h})\right| \geq t\right\} \leq \frac{1}{t} \sqrt{\frac{I(\mathbf{s};\mathbf{h}) + 2}{2m}}.$$

**Proof.** The proof is in Appendix E.

Note that having a vanishing mutual information, i.e., *I*(**s**; **h**) = *o*(*m*), which is the setting recently considered in the work of [23], is a *strictly stronger* condition than uniform generalization. For instance, we will later construct *deterministic* learning algorithms that generalize uniformly in expectation even though *I*(**s**; **h**) is unbounded (see Theorem 8). By contrast, *I*(**s**; **h**) = *o*(*m*) is sufficient for J (**z**ˆ; **h**) → 0 to hold.

Finally, we note that the concentration bound depends linearly on the variational information J (**z**ˆ; **h**). Typically, J (**z**ˆ; **h**) = *O*(1/ <sup>√</sup>*m*). By contrast, the VC bound provides an exponential decay on *<sup>m</sup>* [3,17]. Can the concentration bound in Theorem 4 be improved? The following proposition answers this question in the negative.

**Proposition 4.** *For any rational* <sup>0</sup> <sup>&</sup>lt; *<sup>t</sup>* <sup>&</sup>lt; <sup>1</sup>*, there exists a learning algorithm* <sup>L</sup> : <sup>Z</sup>*<sup>m</sup>* → H*, a distribution <sup>p</sup>*(*z*)*, and a parametric loss l* : Z×H→ [0, 1] *such that:*

$$p\left\{\left|R\_{\mathfrak{s}}(\hbar) - R(\hbar)\right| = t\right\} = \frac{\mathcal{J}(\mathfrak{s}; \hbar)}{t},$$

*where the probability is evaluated over the random choice of <sup>s</sup> and the internal randomness of* <sup>L</sup>*.*

#### **Proof.** The proof is in Appendix F.

Proposition 4 shows that, without making any additional assumptions beyond that of uniform generalization, the concentration bound in Theorem 4 is tight up to constant factors. Essentially, the only difference between the upper and the lower bounds is a vanishing *O*(1/ <sup>√</sup>*m*) term that is *independent* of <sup>L</sup>.

#### **5. Properties of the Learning Capacity**

In this section, we derive bounds on the learning capacity under various settings. We also describe some of its important properties.

#### *5.1. Data Processing*

The relationship between learning capacity and data processing is presented in Lemma 1. Given the random variables **x**, **y**, and **z** and the Markov chain **x** → **y** → **z**, we always have J (**x**; **z**) ≤ J (**x**; **y**). Hence, we have a *partial order* on learning algorithms. This presents us with an important qualitative insight into the design of machine learning algorithms.

Suppose we have two different hypotheses **h**<sup>1</sup> and **h**2. We will say that **h**<sup>2</sup> contains *less information* than **h**<sup>1</sup> if the Markov chain **s** → **h**<sup>1</sup> → **h**<sup>2</sup> holds. For example, if the observations **z***<sup>i</sup>* ∈ {0, 1} are Bernoulli trials, then **<sup>h</sup>**<sup>1</sup> <sup>∈</sup> <sup>R</sup> can be the empirical average as given in Example <sup>1</sup> while **<sup>h</sup>**<sup>2</sup> ∈ {0, 1} can be the label that occurs most often in the training set. Because **<sup>h</sup>**<sup>2</sup> <sup>=</sup> <sup>I</sup>{**h**<sup>1</sup> <sup>≥</sup> *<sup>m</sup>*/2}, the hypothesis **<sup>h</sup>**<sup>2</sup> contains strictly less information about the original training set than **h**1. Formally, we have **s** → **h**<sup>1</sup> → **h**2. In this case, **h**<sup>2</sup> enjoys a better *uniform* generalization bound because of data-processing. Intuitively, we know that such a result should hold because **h**<sup>2</sup> is less dependent to the original training set than **h**1. Hence, one can improve the uniform generalization bound (or equivalently the learning capacity) of a learning algorithm

by post-processing its hypothesis **h** in a manner that is conditionally independent of the original training set given **h**.

**Example 6.** *Post-processing hypotheses is a common technique in machine learning. This includes sparsifying the coefficient vector <sup>w</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup> in linear methods, where <sup>w</sup><sup>j</sup> is set to zero if it has a small absolute magnitude. It also includes methods that have been proposed to reduce the number of support vectors in SVM by exploiting linear dependence [30], or some methods for decision tree pruning. By the data processing inequality, such techniques reduce the learning capacity and, as a consequence, mitigate the risk for overfitting.*

Needless to mention, better generalization does not immediately translate into a smaller true risk. This is because the empirical risk itself may increase when the hypothesis **h** is post-processed *independently* of the original training sample.

#### *5.2. Effective Domain Size*

Next, we look into how the size of the domain Z limits the learning capacity. First, we start with the following definition:

**Definition 7** (Lazy Learning)**.** *A learning algorithm* <sup>L</sup> *is called* lazy *if the training sample <sup>s</sup>* ∈ Z*<sup>m</sup> can be reconstructed perfectly from the hypothesis <sup>h</sup>* ∈ H*. In other words, <sup>H</sup>*(*s*|*h*) = <sup>0</sup>*, where <sup>H</sup> is the Shannon entropy. Equivalently, the mapping from s to h is injective.*

One common example of a lazy learner is instance-based learning when **h** = **s**. Despite their simple nature, lazy learners are useful in practice. They are useful theoretical tools as well. In particular, because of the fact that *H*(**s**|**h**) = 0 and the data processing inequality, the learning capacity of a lazy learner provides an upper bound to the learning capacity of *any* possible learning algorithm. Therefore, we can relate the learning capacity *C*(L) to the size of the domain Z by determining the learning capacity of lazy learners. Because the size of Z is usually infinite, we introduce the following definition of *effective* set size.

**Definition 8.** *In a countable space* Z *endowed with a probability mass function p*(*z*)*, the effective size of* Z *w.r.t. <sup>p</sup>*(*z*) *is defined by:* **Ess***p*(*z*) (Z) . = 1 + - <sup>∑</sup>*z*∈Z *p*(*z*) (<sup>1</sup> <sup>−</sup> *<sup>p</sup>*(*z*))<sup>2</sup> *.*

At one extreme, if *p*(*z*) is *uniform* over a finite alphabet Z, then **Ess***p*(*z*) (Z) = |Z|. At the other extreme, if *p*(*z*) is a Kronecker delta distribution, then **Ess***p*(*z*) (Z) = 1. As proved next, this notion of effective set size *determines* the rate of convergence of an empirical probability mass function to its true distribution when the distance is measured in the total variation sense. As a result, it allows us to relate the learning capacity to a property of the domain Z.

**Theorem 5.** *Let* <sup>Z</sup> *be a countable space endowed with a probability mass function <sup>p</sup>*(*z*)*. Let <sup>s</sup> be a set of <sup>m</sup> i.i.d. observations <sup>z</sup><sup>i</sup>* <sup>∼</sup> *<sup>p</sup>*(*z*)*. Define <sup>p</sup>s*(*z*) *to be the empirical probability mass function that results from drawing observations uniformly at random from s. Then:*

$$\mathbb{E}\_{\mathsf{s}}\left||p(z),p\_{\mathsf{s}}(z)||\_{\mathcal{T}} = \sqrt{\frac{\mathsf{E}\mathsf{s}\mathsf{s}\_{p(z)}\left[\mathcal{Z}\right]-1}{2\,\pi\,m}} + o(1/\sqrt{m}),$$

*where* **Ess***p*(*z*) [Z] *is the effective size of* Z *(see Definition 8).*

**Proof.** The proof is in Appendix G.

A special case of Theorem 5 was proved by de Moivre in the 1730s, who showed that the empirical mean of i.i.d. Bernoulli trials with a probability of success *φ* converges to the true mean with rate 2*φ*(1 − *φ*)/(*πm*). This is believed to be the first appearance of the square-root law in statistical inference in the literature [31]. Because the effective domain size of the Bernoulli distribution, according to Definition 8, is given by 1 + 4*φ*(1 − *φ*), Theorem 5 agrees with, in fact generalizes, de Moivre's result.

**Corollary 1.** *Let* <sup>L</sup> : <sup>Z</sup>*<sup>m</sup>* → H *be a learning algorithm whose hypothesis is <sup>h</sup>* ∈ H*. Then,* <sup>J</sup> (*z*ˆ; *<sup>h</sup>*) <sup>≤</sup> **Ess***p*(*z*) [Z]−1 <sup>2</sup> *<sup>π</sup> <sup>m</sup>* <sup>+</sup> *<sup>o</sup>*(1/√*m*)*. Moreover, the bound is achieved by lazy learners.*

**Proof.** Let **h**˜ be the hypothesis produced by a lazy learner. The simplest example is if **h** is equal to the training sample **<sup>s</sup>** itself. Then, we always have the Markov chain **<sup>s</sup>** <sup>→</sup> **<sup>h</sup>**˜ <sup>→</sup> **<sup>h</sup>** for any hypothesis **<sup>h</sup>** ∈ H. Therefore, by the data processing inequality, we have <sup>J</sup> (**z**ˆ; **<sup>h</sup>**) ≤ J (**z**ˆ; **<sup>h</sup>**˜). By Theorem 5, we have:

$$\mathcal{J}(\mathbf{\hat{z}};\mathbf{\hat{h}}) = \sqrt{\frac{\mathbf{E} \mathbf{s}\_{p(z)} \left[ \mathcal{Z} \right] - 1}{2 \,\pi \, m}} + o(1/\sqrt{m}),$$

Hence, the statement of the corollary follows.

**Corollary 2.** *For any learning algorithm* <sup>L</sup> : <sup>Z</sup>*<sup>m</sup>* → H*, we have C*(L) <sup>≤</sup> |Z|−<sup>1</sup> <sup>2</sup> *<sup>π</sup> <sup>m</sup>* <sup>+</sup> *<sup>o</sup>*(1/√*m*)*.*

**Proof.** The function *f*(*p*) = ∑*<sup>z</sup> p*(*z*)(<sup>1</sup> − *<sup>p</sup>*(*z*)) is both concave over the probability simplex and permutation-invariant. Hence, by symmetry, the maximum effective domain size must be achieved at the uniform distribution *p*(*z*) = 1/|Z|, in which case **Ess***p*(*z*) [Z] = |Z|.

#### *5.3. Finite Hypothesis Space*

Next, we look into the role of the *size* of the hypothesis space. This is formalized by the following theorem.

**Theorem 6.** *Let <sup>h</sup>* ∈ H *be the hypothesis produced by a learning algorithm* <sup>L</sup> : <sup>Z</sup>*<sup>m</sup>* → H*. Then:*

$$\mathcal{C}(\mathcal{L}) \le \sqrt{\frac{\mathcal{H}(\mathcal{h})}{2 \, m}} \le \sqrt{\frac{\log|\mathcal{H}|}{2 \, m}}$$

*where* H *is the Shannon entropy measured in nats.*

**Proof.** If we let *I*(**x**; **y**) be the mutual information between the r.v.'s **x** and **y** and let **s** = {**z**1, **z**2, ... , **z***m*} be the training set, we have:

$$\begin{aligned} I(\mathbf{s}; \mathbf{h}) &= H(\mathbf{s}) - H(\mathbf{s} \mid \mathbf{h}) \\ &= \left[ \sum\_{i=1}^{m} H(\mathbf{z}\_{i}) \right] - \left[ H(\mathbf{z}\_{1} \mid \mathbf{h}) + H(\mathbf{z}\_{2} \mid \mathbf{z}\_{1}, \mathbf{h}) + \dotsb \right] \end{aligned}$$

Because conditioning reduces entropy, i.e., *H*(**x**|**y**) ≤ *H*(**x**) for any r.v.'s **x** and **y**, we have:

$$H(\mathbf{s}; \mathbf{h}) \ge \sum\_{i=1}^{m} \left[ H(\mathbf{z}\_i) - H(\mathbf{z}\_i \mid \mathbf{h}) \right] = m \left[ H(\mathbf{\hat{z}}) - H(\mathbf{\hat{z}} \mid \mathbf{h}) \right],$$

Therefore:

$$I(\mathbf{\hat{z}}; \mathbf{h}) \le \frac{I(\mathbf{s}; \mathbf{h})}{m} \tag{13}$$

Next, we use *Pinsker's inequality* [10], which states that for any probability measures *p* and *q*: ||*p* , *q*||T ≤ *D*(*p* || *q*) <sup>2</sup> , where ||*<sup>p</sup>* , *<sup>q</sup>*||T is total variation distance and *<sup>D</sup>*(*<sup>p</sup>* || *<sup>q</sup>*) is the Kullback-Leibler divergence measured in nats. If we recall that J (**s**; **<sup>h</sup>**) = ||*p*(**s**) *<sup>p</sup>*(**h**), *<sup>p</sup>*(**s**, **<sup>h</sup>**)||T while the mutual information is *I*(**s**; **h**) = *D*(*p*(**s**, **h**)|| *p*(**s**) *p*(**h**)), we deduce from Pinsker's inequality and Equation (13):

$$\begin{aligned} \mathcal{J}(\mathbf{\hat{z}};\mathbf{h}) &= ||p(\mathbf{\hat{z}})p(\mathbf{h})||, \ p(\mathbf{\hat{z}},\mathbf{h})||\tau\\ &\leq \sqrt{\frac{I(\mathbf{\hat{z}};\mathbf{h})}{2}} \leq \sqrt{\frac{I(\mathbf{s};\mathbf{h})}{2m}} \leq \sqrt{\frac{H(\mathbf{h})}{2m}} \leq \sqrt{\frac{\log|\mathcal{H}|}{2m}}. \ \ \ \ \ \ \ \ \ \end{aligned}$$

Theorem 6 re-establishes the classical PAC result on the finite hypothesis space setting. However, unlike its typical proofs, the proof presented here is purely information-theoretic and does not make any references to the union bounds.

#### *5.4. Differential Privacy*

Randomization reduces the risk for overfitting. One common randomization technique in machine learning is differential privacy [32,33], which addresses the goal of obtaining useful information about the sample **s** as a whole without revealing a lot of information about any individual observation. Here, we show that differentially-private learning algorithms have small learning capacities.

**Definition 9** ([33])**.** *A randomized learning algorithm* <sup>L</sup> : <sup>Z</sup>*<sup>m</sup>* → H *is* (, *<sup>δ</sup>*) *differentially private if for any* O⊆H *and any two samples <sup>s</sup> and <sup>s</sup> that differ in one observation only, we have:*

$$p(\hbar \in \mathcal{O} \mid \mathbf{s}) \le e^{\boldsymbol{\xi}} \cdot p(\hbar \in \mathcal{O} \mid \mathbf{s'}) + \boldsymbol{\delta}$$

**Proposition 5.** *If a learning algorithm* <sup>L</sup> : <sup>Z</sup>*<sup>m</sup>* → H *is* (, *<sup>δ</sup>*) *differentially private, then:* <sup>J</sup> (*z*ˆ; *<sup>h</sup>*) <sup>≤</sup> (*e* <sup>−</sup> <sup>1</sup> <sup>+</sup> *δ*)/2*.*

#### **Proof.** The proof is in Appendix H.

Not surprisingly, the differential privacy parameters (, *δ*) control the uniform generalization risk, where small values of and *δ* lead to a reduced risk for overfitting.

#### *5.5. Empirical Risk Minimization of 0–1 Loss Classes*

Empirical risk minimization (ERM) of stochastic loss is a popular approach for learning from data. It is often regarded as the default strategy to use, due to its simplicity, generality, and statistical efficiency [1,3,13,34]. Given a fixed hypothesis space <sup>H</sup>, a domain <sup>Z</sup>, and a loss function *<sup>l</sup>* : H×Z→ <sup>R</sup>, the ERM learning rule selects the hypothesis **h**ˆ **<sup>s</sup>** that minimizes the empirical risk:

$$\hat{\mathbf{h}}\_{\mathsf{s}} = \arg\min\_{h \in \mathcal{H}} \left\{ L\_{\mathsf{s}}(h) = \frac{1}{|\mathsf{s}|} \sum\_{\mathbf{z}\_i \in \mathsf{s}} l(\mathbf{z}\_i, h) \right\},\tag{14}$$

By contrast, the true risk minimizer **h** is:

$$\mathbf{h}^\* = \arg\min\_{h \in \mathcal{H}} \left\{ L(h) = \mathbb{E}\_{\mathbf{z} \sim p(z)} \left[ l(\mathbf{z}, h) \right] \right\}. \tag{15}$$

Hence, learning via ERM is justified if *<sup>L</sup>*(**h**<sup>ˆ</sup> **<sup>s</sup>**) <sup>≤</sup> *<sup>L</sup>*(**h**) + , for some 1. If such a condition holds and → 0 as the sample size *m* increases, the ERM learning rule is called *consistent*.

Uniform generalization is a sufficient condition for the consistency of empirical risk minimization (ERM). To see this, we have by definition:

$$\begin{aligned} \mathbb{E}\_{\mathfrak{s}}[L\_{\mathfrak{s}}(\hat{\mathbf{h}}\_{\mathfrak{s}})] &= \mathbb{E}\_{\mathfrak{s}}[\min\_{h \in \mathcal{H}} L\_{\mathfrak{s}}(h)] \\ &\leq \min\_{h \in \mathcal{H}} \left\{ \mathbb{E}\_{\mathfrak{s}}[L\_{\mathfrak{s}}(h)] \right\} = \min\_{h \in \mathcal{H}} L(h) = R(\mathbf{h}^{\star}), \end{aligned}$$

From this, we conclude that:

$$\mathbb{E}\_{\mathsf{s}}\mathcal{R}(\hat{\mathbf{h}}\_{\mathsf{s}}) - \mathcal{R}(\mathbf{h}^{\star}) \leq \mathbb{E}\_{\mathsf{s}}\mathcal{R}(\hat{\mathbf{h}}\_{\mathsf{s}}) - \mathbb{E}\_{\mathsf{s}}[L\_{\mathsf{s}}(\hat{\mathbf{h}}\_{\mathsf{s}})] \leq \mathbb{C}(\mathcal{L}),$$

where *C*(L) is the learning capacity of the empirical risk minimization rule. The last inequality follows from Theorem 2. In addition, because *<sup>R</sup>*(**h**<sup>ˆ</sup> **<sup>s</sup>**) <sup>−</sup> *<sup>R</sup>*(**h**) <sup>≥</sup> 0, we have by the Markov inequality:

$$\mathbb{P}\_{\mathbf{s}}\left\{\mathcal{R}(\hat{\mathbf{h}}\_{\mathbf{s}})-\mathcal{R}(\mathbf{h}^{\star})\geq t\right\}\leq\frac{\mathbb{E}\_{\mathbf{s}}\mathcal{R}(\hat{\mathbf{h}}\_{\mathbf{s}})-\mathcal{R}(\mathbf{h}^{\star})}{t}\leq\frac{\mathcal{C}(\mathcal{L})}{t}$$

Hence, the ERM learning rule is consistent if *C*(L) → 0 as *m* → ∞. Next, we describe when such a condition on *C*(L) holds for 0–1 loss classes. To do that, we begin with two familiar definitions from statistical learning theory.

**Definition 10** (Shattered Set)**.** *Given a domain* Z*, a hypothesis space* H*, and a 0–1 loss function l* : Z×H→ {0, 1}*, a set* {*z*1, ... , *zd*} *is said to be shattered by* <sup>H</sup> *with respect to the function <sup>l</sup> if for any labeling <sup>I</sup>* ∈ {0, 1}*d, there exists a hypothesis hI* ∈ H *such that* (*l*(*z*1, *hI*),..., *l*(*zd*, *hI*)) = *I.*

**Example 7.** *Let* <sup>Z</sup> <sup>=</sup> <sup>H</sup> <sup>=</sup> <sup>R</sup> *and let the loss function be <sup>l</sup>*(*z*, *<sup>h</sup>*) = <sup>I</sup>{*<sup>z</sup>* <sup>−</sup> *<sup>h</sup>* <sup>≥</sup> <sup>0</sup>}*. Then, any singleton set* {*z*} *is shattered by* H *since we always have the two hypotheses h*<sup>0</sup> = *z* − 1 *and h*<sup>1</sup> = *z* + 1*. However, no set of two points in* <sup>Z</sup> *can be shattered by* <sup>H</sup>*. By contrast, if the hypothesis is a pair* (*h*, *<sup>c</sup>*) <sup>∈</sup> <sup>R</sup> <sup>×</sup> <sup>R</sup> *and the loss function is <sup>l</sup>*(*z*, *<sup>h</sup>*, *<sup>b</sup>*) = <sup>I</sup>{*c z* <sup>−</sup> *<sup>h</sup>* <sup>≥</sup> <sup>0</sup>}*, then any set of two distinct examples* {*z*1, *<sup>z</sup>*2} *is shattered by the hypothesis space.*

**Definition 11** (VC Dimension)**.** *The VC dimension of a hypothesis space* H *with respect to a domain* Z *and a 0–1 loss l* : Z×H→{0, 1} *is the maximum cardinality of a set of points in* Z *that can be shattered by* H *with respect to l.*

The VC dimension is arguably the most fundamental concept in statistical learning theory because it provides a crisp characterization of learnability for 0–1 loss classes. Next, we show that the VC dimension has, in fact, an equivalence characterization with the learning capacity *C*(L). Specifically, under the Axiom of Choice, an ERM learning rule exists that has a vanishing learning capacity *C*(L) if and only if the 0–1 loss class has a finite VC dimension.

Before we establish this important result, we describe why ERM by itself is not sufficient for uniform generalization to hold even when the hypothesis space has a finite VC dimension.

**Proposition 6.** *For any sample size m* ≥ 1 *and a positive constant* > 0*, there exists a hypothesis space* H*, a domain* Z*, and a 0–1 loss l* : Z×H→ {0, 1} *such that: (1)* H *has a VC dimension d* = 1*, and (2) a learning algorithm* <sup>L</sup> : <sup>Z</sup>*<sup>m</sup>* → H *exists that outputs an empirical risk minimizer <sup>h</sup>*<sup>ˆ</sup> *<sup>s</sup> with* <sup>J</sup> (*z*ˆ; *<sup>h</sup>*<sup>ˆ</sup> *<sup>s</sup>*) <sup>≥</sup> <sup>1</sup> <sup>−</sup> *.*

**Proof.** Let <sup>Z</sup> <sup>=</sup> X ×Y, where <sup>X</sup> = [0, 1] and <sup>Y</sup> <sup>=</sup> {+1, <sup>−</sup>1} and let the loss be *<sup>l</sup>*(*x*, *<sup>y</sup>*, *<sup>h</sup>*) = <sup>I</sup>{*<sup>y</sup>* · (*<sup>x</sup>* <sup>−</sup> *<sup>h</sup>*) <sup>≤</sup> 0}. In other words, the goal is to learn a threshold in the unit interval that separates the positive from the negative examples. Let **<sup>x</sup>** ∈ X be uniformly distributed in [0, 1] and let **<sup>h</sup>** be an error-free separator. Then, for any training sample **<sup>s</sup>** ∈ Z*m*, the set of all empirical risk minimizers **<sup>H</sup>**<sup>ˆ</sup> is:

$$\hat{\mathbf{H}} = \left\{ h \in [0, 1] : \, y\_i = \mathbf{sign}(x\_i - h), \quad \forall i \in \{1, \dots, m\} \right\}$$

In particular, **H**ˆ is an interval, which has the power of the continuum, so it can be used to encode the entire training sample.

Fix *δ* > 0 in advance, which can be made arbitrarily small. Then, the probability over the random choice of the sample that <sup>|</sup>**H**<sup>ˆ</sup> <sup>|</sup> <sup>&</sup>lt; *<sup>δ</sup>* can be made arbitrarily small for a sufficiently small *<sup>δ</sup>* <sup>&</sup>gt; 0, where <sup>|</sup>**H**<sup>ˆ</sup> <sup>|</sup> is the length of the interval.

Let **<sup>h</sup>**<sup>ˆ</sup> <sup>∈</sup> **<sup>H</sup>**<sup>ˆ</sup> be a hypothesis that lies at the middle of **<sup>H</sup>**<sup>ˆ</sup> , i.e.,:

$$\mathbf{f} = \frac{1}{2} \left[ \arg \max\_{x\_i \in \mathfrak{s} \land y\_i = -1} x\_i + \arg \min\_{x\_i \in \mathfrak{s} \land y\_i = +1} x\_i \right]$$

Let *<sup>k</sup>* <sup>=</sup> <sup>1</sup> <sup>+</sup> log2(1/*δ*). Then, [**h**<sup>ˆ</sup> <sup>−</sup> <sup>2</sup>−*k*, **<sup>h</sup>**<sup>ˆ</sup> <sup>+</sup> <sup>2</sup>−*k*] <sup>⊆</sup> **<sup>H</sup>**<sup>ˆ</sup> holds with a high probability (which can be made arbitrarily close to 1 for a sufficiently small *δ*). Let **h**˜ be a hypothesis whose binary expansion agrees with **h**ˆ in its first *k* + 1 bits and encodes the entire training sample in the rest of the bits.

Finally, the output of the learning algorithm is **h**ˆ **<sup>s</sup>**, which is given by the following rule:


Now, define the following *different* parametric loss *l* : Z → [0, 1] to be a function that first uses **<sup>h</sup>**<sup>ˆ</sup> **<sup>s</sup>** to *decode* the training sample **s** based on the coding method constructed above and, then, assigns 1 if and only if *x* ∈ **s**. To reiterate, this decoding succeeds with a probability that can be made arbitrarily high for a sufficiently small *δ* > 0. Clearly, *l* is a loss defined on the product space Z×H and has a bounded range. However, the generalization risk w.r.t. *l* is, at least, equal to the probability that <sup>|</sup>**H**<sup>ˆ</sup> <sup>|</sup> <sup>&</sup>lt; *<sup>δ</sup>*, which can be made arbitrarily close to 1. Hence, the statement of the proposition holds.

Proposition 6 shows that one cannot obtain a non-trivial bound on the uniform generalization risk of an ERM learning rule in terms of the VC dimension *d* and the sample size *m* without making some additional assumptions. Next, we prove that an ERM learning rule *exists* that satisfies the uniform generalization property if the hypothesis space has a finite VC dimension. We begin by recalling a fundamental result in modern set theory. A non-empty set Q is said to be *well-ordered* if Q is endowed with a total order ! such that every non-empty subset of Q contains a least element. The following fundamental result, which was published in 1904, is due to Ernst Zermelo [35].

**Theorem 7** (Well-Ordering Theorem)**.** *Under the Axiom of Choice, every non-empty subset can be well-ordered.*

**Theorem 8.** *Given a hypothesis space* H*, a domain* Z*, and a 0–1 loss l* : H×Z →{0, 1}*, let* ! *be a well-ordering on* <sup>H</sup> *and let* <sup>L</sup> : <sup>Z</sup>*<sup>m</sup>* → H *be the learning rule that outputs the " least" empirical risk minimizer to the training sample <sup>s</sup>* ∈ Z*<sup>m</sup> according to* !*. Then, C*(L) <sup>→</sup> <sup>0</sup> *as m* <sup>→</sup> <sup>∞</sup> *if* <sup>H</sup> *has a finite VC dimension. In particular:*

$$C(\mathcal{L}) \le \frac{3}{\sqrt{m}} + \sqrt{\frac{1 + d\log\frac{2cm}{d}}{m}}.$$

*where d is the VC dimension of* H*, provided that m* ≥ *d.*

#### **Proof.** The proof is in Appendix I.

Next, we prove a converse statement. Before we do this, we present a learning problem that shows why a converse to Theorem 8 is not generally possible without making some additional assumptions. Hence, our converse will be later established for the binary classification setting only.

**Example 8** (Subset Learning Problem)**.** *Let* Z = {1, 2, 3, ... , *d*} *be a finite set of positive integers. Let* H = 2<sup>Z</sup> *and define the 0–1 loss of a hypothesis <sup>h</sup>* ∈ H *to be <sup>l</sup>*(*z*, *<sup>h</sup>*) = <sup>I</sup>{*<sup>z</sup>* <sup>∈</sup>/ *<sup>h</sup>*}*. Then, the VC dimension is d. However, the learning rule that outputs h* = Z *is always an ERM learning rule that generalizes uniformly with rate* = 0 *regardless of the sample size and the distribution of observations.*

The previous example shows that a converse to Theorem 8 is not generally possible without making some additional assumptions. In particular, in the Subset Learning Problem, the VC dimension is not an accurate measure of the complexity of the hypothesis space H because many hypotheses dominate others (i.e., perform better across all distributions of observations). For example, the hypothesis *h* = {1, 2, 3} dominates *h* = {1} because there is no distribution on observations in which *h* outperforms *h* . In fact, the hypothesis *h* = Z dominates all other hypotheses.

Consequently, in order to prove a lower bound for all ERM rules, we focus on the standard binary classification setting.

**Theorem 9.** *In any fixed domain* Z = X ×Y*, let the hypothesis space* H *be a concept class on* X *and let <sup>l</sup>*(*x*, *<sup>y</sup>*, *<sup>h</sup>*) = <sup>I</sup>{*<sup>y</sup>* <sup>=</sup> *<sup>h</sup>*(*x*)} *be the misclassification error. Then, any ERM learning rule* <sup>L</sup> *w.r.t. <sup>l</sup> has a learning capacity <sup>C</sup>*(L) *that is bounded from below by <sup>C</sup>*(L) <sup>≥</sup> <sup>1</sup> 2 - <sup>1</sup> <sup>−</sup> <sup>1</sup> *d m, where m is the training sample size and d is the VC dimension of* H*.*

#### **Proof.** The proof is in Appendix J.

Using both Theorems 8 and 9, we arrive at the following equivalence characterization of the VC dimension of a concept class with the learning capacity.

**Theorem 10.** *Given a fixed domain* Z = X ×Y*, let the hypothesis space* H *be a concept class on* X *and let <sup>l</sup>*(*x*, *<sup>y</sup>*, *<sup>h</sup>*) = <sup>I</sup>{*<sup>y</sup>* <sup>=</sup> *<sup>h</sup>*(*x*)} *be the misclassification error. Let <sup>m</sup> be the sample size. Then, the following statements are equivalent under the Axiom of Choice:*


**Proof.** The lower bound in Theorem 9 holds for all ERM learning rules. Hence, an ERM learning rule exists that generalize uniformly with a vanishing rate across all distributions only if H has a finite VC dimension. However, under the Axiom of Choice, H can always be well-ordered by Theorem 7 so, by Theorem 8, a finite VC dimension is also sufficient to guarantee the existence of a learning rule that generalize uniformly.

Theorem 10 presents a characterization of the VC dimension in terms of information theory. According to the theorem, an ERM learning rule can be constructed that does not encode the training sample *if and only if* the hypothesis space has a finite VC dimension.

**Remark 2.** *One method of constructing a well-ordering on a hypothesis space* H *is to use the fact that computers are equipped with finite precisions. Hence, in practice, every hypothesis space is enumerable, from which the normal ordering of the integers forms a valid well-ordering on* H*.*

#### **6. Concluding Remarks**

In this paper, we introduced the notion of " learning capacity" for algorithms that learn from data, which is analogous to the Shannon capacity of communication channels. Learning capacity is an information-theoretic quantity that measures the contribution of a single training example to the final hypothesis. It has three equivalent interpretations: (1) as a tight upper bound on the uniform generalization risk, (2) as a measure of information leakage, and (3) as a measure of algorithmic stability. Furthermore, by establishing a chain rule for learning capacity, concentration bounds were derived, which revealed that the learning capacity controlled both the expectation of the generalization risk and its variance. Moreover, the relationship between algorithmic stability and data processing revealed that algorithmic stability can be improved by post-processing the learned hypothesis.

Throughout this paper, we provided several bounds on the learning capacity under various settings. For instance, we established a relationship between algorithmic stability and the effective size of the domain of observations, which can be interpreted as a formal justification for dimensionality reduction methods. Moreover, we showed how learning capacity recovered classical bounds, such as in the finite hypothesis space setting, and derived new bounds for other settings as well, such as differential privacy. We also established that, under the Axiom of Choice, the existence of an empirical risk minimization (ERM) rule for 0–1 loss classes that had a vanishing learning capacity was equivalent to the assertion that the hypothesis space had a finite Vapnik–Chervonenkis (VC) dimension, thus establishing an equivalence relation between two of the most fundamental concepts in statistical learning theory and information theory.

More generally, the intent of this work is to bring to light a new information-theoretic approach for analyzing machine learning algorithms. Despite the fact that " uniform generalization" might appear to be a strong condition at a first sight, one of the central claims of this paper is that uniform generalization is, in fact, a natural condition that arises commonly in practice. It is not a condition to require or enforce! We believe this holds because any learning algorithm is a *channel* from the space of training samples to the hypothesis space. Because learning is a mapping between two spaces, its risk for overfitting should be determined from the mapping itself (i.e., independently of the choice of the loss function). Such an approach yields the uniform generalization bounds that are derived in this paper.

It is worth highlighting that uniform generalization bounds can be established for many other settings that have not be discussed in this paper and it has found some promising applications. Using sample compression schemes, one can show that any learnable hypothesis space is also learnable by an algorithm that achieves uniform generalization [36]. Also, generalization bounds for stochastic convex optimization yield information criteria for model selection that can outperform the popular Akaike's information criterion (AIC) and Schwarz's Bayesian information criterion (BIC) [37]. More recently, uniform generalization has inspired the development of new approaches for structured regression as well [38].

#### **7. Further Research Directions**

Before we conclude, we suggest future directions of research and list some open problems.

#### *7.1. Induced VC Dimension*

The variational information J (**z**ˆ; **h**) provides an upper bound on the generalization risk of the learning algorithm L across all parametric loss classes. This upper bound is *achievable* by the generalization risk of the *binary reconstruction loss*:

$$\mathbb{I}(z, \mathbf{h}) = \mathbb{I}\{p(z \in \mathbf{s} \mid \mathbf{h}) \; \ge \; p(z \in \mathbf{s})\},\tag{16}$$

which assigns the value one to observations *z* ∈ Z that are *more* likely to have been present in the training sample **s** upon knowing **h**, and assigns zero otherwise. In expectation, the generalization risk of this parametric loss is the worst generalization risk across all parametric loss classes.

Let both *p*(*z*) and *p*(*h*|*z*) be fixed; the first is the distribution of observations while the second is entirely determined by the learning algorithm L. Then, because the loss in Equation (16) is binary, it has a VC dimension, which we will call the *induced VC dimension* of the learning algorithm L [39]. Note that this induced VC dimension is defined for all learning problems, including regression and clustering, but it is *distribution-dependent*, which is quite unlike the traditional VC dimension of hypothesis spaces.

There are a lot of open questions related to the *induced VC dimension* of learning algorithms. For instance, while a finite VC dimension implies a small variational information, when does the converse also hold? Can we obtain a non-trivial bound on the induced VC dimension of a learning algorithm L upon knowing its uniform generalization risk J (**z**ˆ; **h**)? Along similar lines, suppose that L is an empirical risk minimization (ERM) algorithm of a 0–1 loss class that may or may not use an appropriate tie breaking rule (in light of what was discussed in Section 5.5). Is there a non-trivial relation between the VC dimension of the 0–1 loss that is being minimized and the induced VC dimension of the ERM learning algorithm?

#### *7.2. Unsupervised Model Selection*

Information criteria (such as AIC and BIC), are sometimes used in the unsupervised learning setting for model selection, such as when determining the value of *k* in the popular *k*-means algorithm [40]. Given that the notion of uniform generalization is developed in the *general* setting of learning, should the learning capacity *C*(L) serve as a model selection criterion in the unsupervised setting? Why or why not?

#### *7.3. Effective Domain Size*

The effective size of the domain of a random variable **z** in Definition 8 satisfies some intuitive properties and violates others. For instance, it reduces to the size of the domain |Z| when the distribution is uniform. Moreover, if **z** is Bernoulli, the effective domain size is determined by the *variance* of the Bernoulli distribution. Importantly, this notion is well-motivated because it determines the rate of convergence of an empirical probability mass function to its true distribution when the distance is measured in the total variation sense. As a result, it allowed us to relate the learning capacity to a property of the domain Z.

However, such a notion of effective domain size has some surprising properties. For instance, the effective size of the domain of two *independent* random variables is not equal to the product of the effective size of each individual domain! In rate distortion theory, a similar phenomenon is observed. Reference [10] explain this observation by stating that " rectangular grid points (arising from independent descriptions) do not fill up the space efficiently." Can the effective domain size in Definition 8 be motivated using rate distortion theory?

**Funding:** This research received no external funding.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **Appendix A. Proof of Lemma 2**

With no loss of generality, let's assume that all domains are enumerable. We have:

$$\mathcal{J}\left(\mathbf{x};\left(\mathbf{y},\mathbf{z}\right)\right) = 1 - \sum\_{\mathbf{x},\mathbf{y},z} \min\left\{p\left(\mathbf{x}=\mathbf{x}\right)p\left(\mathbf{y}=\mathbf{y},\mathbf{z}=z\right), \ p\left(\mathbf{x}=\mathbf{x},\mathbf{y}=y,\mathbf{z}=z\right)\right\}$$

$$= 1 - \sum\_{\mathbf{x}} p\left(\mathbf{x}=\mathbf{x}\right) \sum\_{\mathbf{y},\mathbf{z}} \min\left\{p\left(\mathbf{y}=\mathbf{y},\mathbf{z}=z\right), \ p\left(\mathbf{y}=\mathbf{y},\mathbf{z}=z \middle| \mathbf{x}=\mathbf{x}\right)\right\}$$

However, the minimum of the sums is always larger than the sum of minimums. That is:

$$\min\left\{\sum\_{i} \alpha\_{i\prime}, \sum\_{i} \beta\_{i} \right\} \ge \sum\_{i} \min\left\{\alpha\_{i\prime}, \beta\_{i}\right\}.$$

Using marginalization *p*(**x**) = ∑*<sup>y</sup> p*(**x**, **y** = *y*) and the above inequality, we obtain:

$$\begin{aligned} \mathcal{J}\left(\mathbf{x}; \left(\mathbf{y}, \mathbf{z}\right)\right) &\geq 1 - \sum\_{\mathbf{x}} p(\mathbf{x} = \mathbf{x}) \sum\_{\mathbf{y}} \min\{\sum\_{z} p(\mathbf{y} = \mathbf{y}, \mathbf{z} = \mathbf{z}), \sum\_{z} p(\mathbf{y} = \mathbf{y}, \mathbf{z} = z | \mathbf{x} = \mathbf{x})\} \\ &= 1 - \sum\_{\mathbf{x}} p(\mathbf{x} = \mathbf{x}) \sum\_{\mathbf{y}} \min\{p(\mathbf{y} = \mathbf{y}), p(\mathbf{y} = \mathbf{y} | \mathbf{x} = \mathbf{x})\} \\ &= \mathcal{J}\left(\mathbf{x}; \mathbf{y}\right) \end{aligned}$$

#### **Appendix B. Proof of Theorem 1**

We will first prove the inequality when *k* = 2. First, we write by definition:

$$\mathcal{J}(\mathbf{z}; (\mathbf{h}\_1, \mathbf{h}\_2)) = ||p(\mathbf{z}, \mathbf{h}\_1, \mathbf{h}\_2)\$, \ p(\mathbf{z}) \ p(\mathbf{h}\_1, \mathbf{h}\_2)||\_{\mathcal{T}}$$

Using the fact that the total variation distance is related to the -<sup>1</sup> distance by ||*<sup>P</sup>* , *<sup>Q</sup>*||T <sup>=</sup> <sup>1</sup> <sup>2</sup> ||*P* − *Q*||1, we have:

$$\mathcal{J}(\mathbf{z}; \left(\mathbf{h}\_1, \mathbf{h}\_2\right)) = \frac{1}{2} \left| \left| \left( p(\mathbf{z}, \mathbf{h}\_1, \mathbf{h}\_2) - p(\mathbf{z}) \left( p(\mathbf{h}\_1, \mathbf{h}\_2) \right) \right| \right) \right|$$

$$= \frac{1}{2} \left| \left| \left( p(\mathbf{z}, \mathbf{h}\_1) \right) p(\mathbf{h}\_2 | \mathbf{z}, \mathbf{h}\_1) - p(\mathbf{z}) \left( p(\mathbf{h}\_1) \right) p(\mathbf{h}\_2 | \mathbf{h}\_1) \right| \right|$$

$$= \frac{1}{2} \left| \left| \left[ p(\mathbf{z}, \mathbf{h}\_1) - p(\mathbf{z}) \left( p(\mathbf{h}\_1) \right) \right] \cdot p(\mathbf{h}\_2 | \mathbf{h}\_1) \right|$$

$$\quad + p(\mathbf{z}, \mathbf{h}\_1) \cdot \left[ p(\mathbf{h}\_2 | \mathbf{z}, \mathbf{h}\_1) - p(\mathbf{h}\_2 | \mathbf{h}\_1) \right] \right|\_1$$

Using the triangle inequality:

$$\mathcal{J}\left(\mathbf{z};\left(\mathbf{h}\_{1},\mathbf{h}\_{2}\right)\right) \leq \frac{1}{2} \left| \left| \left[ p(\mathbf{z},\mathbf{h}\_{1}) - p(\mathbf{z}) \ p(\mathbf{h}\_{1}) \right] \cdot p(\mathbf{h}\_{2}|\mathbf{h}\_{1}) \right| \right|\_{1} + \frac{1}{2} \left| \left[ p(\mathbf{z},\mathbf{h}\_{1}) \cdot \left[ p(\mathbf{h}\_{2}|\mathbf{z},\mathbf{h}\_{1}) - p(\mathbf{h}\_{2}|\mathbf{h}\_{1}) \right] \right] \right|\_{1}$$

The above inequality is interpreted by expanding the -<sup>1</sup> distance into a sum of absolute values of terms in the product space Z×H<sup>1</sup> × H2, where **h***<sup>k</sup>* ∈ H*k*. Next, we bound each term on the right-hand side separately. For the first term, we note that:

$$\frac{1}{2}||\left[p(\mathbf{z}, \mathbf{h}\_1) - p(\mathbf{z})\, p(\mathbf{h}\_1)\right] \cdot p(\mathbf{h}\_2|\mathbf{h}\_1) \left||\_1 = \frac{1}{2}||\, p(\mathbf{z}, \mathbf{h}\_1) - p(\mathbf{z})\, p(\mathbf{h}\_1)\, ||\_1 = \mathcal{J}(\mathbf{z}; \mathbf{h}\_1) \tag{A1}$$

The equality holds by expanding the -<sup>1</sup> distance and using the fact that ∑**h**<sup>2</sup> *p*(**h**2|**h**1) = 1. However, the second term can be re-written as:

$$\begin{aligned} &\frac{1}{2}||\left|\left(p(\mathbf{z},\mathbf{h}\_{1})\cdot\left[p(\mathbf{h}\_{2}|\mathbf{z},\mathbf{h}\_{1})-p(\mathbf{h}\_{2}|\mathbf{h}\_{1})\right]\right)||\_{1} \\ &=\frac{1}{2}||\left(p(\mathbf{h}\_{1})\cdot\left[p(\mathbf{h}\_{2},\mathbf{z}|\mathbf{h}\_{1})-p(\mathbf{z}|\mathbf{h}\_{1})\right]p(\mathbf{h}\_{2}|\mathbf{h}\_{1})\right)||\_{1} \\ &=\mathbb{E}\_{\mathbf{h}\_{1}}\left[||\left(p(\mathbf{h}\_{2},\mathbf{z}|\mathbf{h}\_{1})\right)\cdot p(\mathbf{z}|\mathbf{h}\_{1})\,p(\mathbf{h}\_{2}|\mathbf{h}\_{1})||\,\!\right] \\ &=\mathcal{J}\left(\mathbf{z};\,\mathbf{h}\_{2}\,|\,\mathbf{h}\_{1}\right) \end{aligned} \tag{A2}$$

Combining Equations (A1) and (A2) yields the inequality:

$$\mathcal{J}\left(\mathbf{z};\left(\mathbf{h}\_1,\mathbf{h}\_2\right)\right) \le \mathcal{J}\left(\mathbf{z};\mathbf{h}\_1\right) + \mathcal{J}\left(\mathbf{z};\left|\mathbf{h}\_2\right|\left|\mathbf{h}\_1\right|\right) \tag{A3}$$

Next, we use Equation (A3) to prove the general statement for all *k* ≥ 1. By writing:

$$\mathcal{J}(\mathbf{z}; (\mathbf{h}\_1, \dots, \mathbf{h}\_k)) \le \mathcal{J}(\mathbf{z}; \mathbf{h}\_k \mid (\mathbf{h}\_1, \dots, \mathbf{h}\_{k-1})) + \mathcal{J}(\mathbf{z}; (\mathbf{h}\_1, \dots, \mathbf{h}\_{k-1})) $$

Repeating the same inequality on the last term on the right-hand side yields the statement of the theorem.

#### **Appendix C. Proof of Proposition 1**

By the triangle inequality:

$$\begin{split} \mathcal{J}(\mathbf{x}; \mathbf{z} \mid \mathbf{y}) &= \mathbb{E}\_{\mathbf{y}} ||p(\mathbf{x} \mid \mathbf{y}) \cdot p(\mathbf{z} \mid \mathbf{y}) \,, \ p(\mathbf{x}, \mathbf{z} \mid \mathbf{y}) || \boldsymbol{\tau} \\ &= \mathbb{E}\_{\mathbf{x}, \mathbf{y}} ||p(\mathbf{z} \mid \mathbf{y}) \, \ \ p(\mathbf{z} \mid \mathbf{x}, \mathbf{y}) || \boldsymbol{\tau} \\ &\leq \mathbb{E}\_{\mathbf{x}, \mathbf{y}} ||p(\mathbf{z} \mid \mathbf{y}) \, \ \ p(\mathbf{z}) || \boldsymbol{\tau} + \mathbb{E}\_{\mathbf{x}, \mathbf{y}} ||p(\mathbf{z}) \, \ \ p(\mathbf{z} \mid \mathbf{x}, \mathbf{y}) || \boldsymbol{\tau} \\ &= \mathbb{E}\_{\mathbf{y}} ||p(\mathbf{z} \mid \mathbf{y}) \, \ \ p(\mathbf{z}) || \boldsymbol{\tau} + \mathbb{E}\_{\mathbf{x}, \mathbf{y}} ||p(\mathbf{z}) \, \ \ p(\mathbf{z} \mid \mathbf{x}, \mathbf{y}) || \boldsymbol{\tau} \\ &= \mathcal{J}(\mathbf{y}; \mathbf{z}) + \mathcal{J}(\mathbf{z}; \ \mathbf{x}, \mathbf{y}) \end{split}$$

Therefore:

J (**z**; (**x**, **y**)) ≥ J (**x**; **z** | **y**) − J (**y**; **z**)

Combining this with the following chain rule of Theorem 2:

$$\mathcal{J}(\mathbf{z};\,(\mathbf{x},\mathbf{y})) \le \mathcal{J}(\mathbf{x};\,\mathbf{z}\,\big|\,\mathbf{y}) + \mathcal{J}(\mathbf{y};\,\mathbf{z})$$

yields:

$$\left| \mathcal{J}(\mathbf{z}; (\mathbf{x}, \mathbf{y})) - \mathcal{J}(\mathbf{x}; \mathbf{z} \mid \mathbf{y}) \right| \le \mathcal{J}(\mathbf{y}; \mathbf{z}) $$

Or equivalently:

$$\left| \mathcal{J}(\mathbf{x}; (\mathbf{y}, \mathbf{z})) - \mathcal{J}(\mathbf{x}; \mathbf{z} \mid \mathbf{y}) \right| \le \mathcal{J}(\mathbf{x}; \mathbf{y}) \tag{A4}$$

To prove the other inequality, we use Lemma 2. We have:

$$\mathcal{J}(\mathbf{x};\mathbf{y}) \le \mathcal{J}(\mathbf{x};\,(\mathbf{y},\mathbf{z})) \le \mathcal{J}(\mathbf{x};\,\mathbf{y}) + \mathcal{J}(\mathbf{x};\,\mathbf{z}\,\big|\,\mathbf{y})\_{\ast}$$

where the first inequality follows from Lemma 2 and the second inequality follows from the chain rule. Thus, we obtain the desired bound:

$$\left| \mathcal{J}(\mathbf{x}; (\mathbf{y}, \mathbf{z})) - \mathcal{J}(\mathbf{x}; \mathbf{y}) \right| \le \mathcal{J}(\mathbf{x}; \mathbf{z} \mid \mathbf{y}) \tag{A5}$$

Both Equations (A4) and (A5) imply that the chain rule is tight. More precisely, the inequality can be made arbitrarily close to an equality when one of the two terms in the upper bound is chosen to be arbitrarily close to zero.

#### **Appendix D. Proof of Theorem 3**

We will use the following fact:

**Fact 1.** *Let f* : X → [0, 1] *be a function with a bounded range in the interval* [0, 1]*. Let p*1(*x*) *and p*2(*x*) *be two different probability measures defined on the same space* X *. Then:*

$$\left| \mathbb{E}\_{\mathbf{x} \sim p\_1(\mathbf{x})} f(\mathbf{x}) - \mathbb{E}\_{\mathbf{x} \sim p\_2(\mathbf{x})} f(\mathbf{x}) \right| \le ||p\_1(\mathbf{x})||\_{\mathcal{T}} ||p\_2(\mathbf{x})||\_{\mathcal{T}}$$

**First Setting**: We first consider the following scenario. Suppose a learning algorithm L produces a hypothesis **h** ∈ H from some marginal distribution *p*(*h*) *independently* of the training sample **s**. Afterwards, L produces a second hypothesis **k** ∈ K according to *p*(*k* | **h**, **s**). In other words, **k** depends on both **h** and **s** but the latter two random variables are independent of each other. Under this scenario, we have:

$$\mathcal{J}(\mathbf{\hat{z}}; (\mathbf{h}, \mathbf{k})) = \mathcal{J}(\mathbf{\hat{z}}; \mathbf{k} \mid \mathbf{h})\_{\prime}$$

where the equality follows from the chain rule in Theorem 1, the statement of Proposition 1, and the fact that J (**z**ˆ; **h**) = 0.

The conditional variational information is written as:

$$\mathcal{J}(\mathbf{\hat{z}};\mathbf{k}|\mathbf{h}) = \mathbb{E}\_{\mathbf{h}}||p(\mathbf{\hat{z}}) \cdot p(\mathbf{k}|\mathbf{h}) \,, \ p(\mathbf{\hat{z}},\mathbf{k}|\mathbf{h})||\_{\mathcal{T}'} $$

where we used the fact that *p*(**z**ˆ|**h**) = *p*(**z**ˆ). By marginalization:

$$p(\mathbf{k}|\mathbf{h}) = \mathbb{E}\_{\mathbb{1}'|\mathbf{h}}[p(\mathbf{k}|\mathbf{z}', \mathbf{h})] = \mathbb{E}\_{\mathbb{1}' \sim p(z)}[p(\mathbf{k}|\mathbf{z}', \mathbf{h})],$$

Similarly:

$$p(\mathbf{\hat{z}}, \mathbf{k} | \mathbf{h}) = p(\mathbf{\hat{z}} | \mathbf{h}) \cdot p(\mathbf{k} | \mathbf{\hat{z}}, \mathbf{h}) = p(\mathbf{\hat{z}}) \cdot p(\mathbf{k} | \mathbf{\hat{z}}, \mathbf{h})$$

Therefore:

$$\mathcal{J}(\mathbf{\hat{z}};\mathbf{k}\mid\mathbf{h}) = \mathbb{E}\_{\mathbf{h}}\mathbb{E}\_{\mathbf{\hat{z}}}||\mathbb{E}\_{\mathbf{\hat{z}}'}[p(\mathbf{k}|\mathbf{\hat{z}'},\mathbf{h})] \,, \ p(\mathbf{k}|\mathbf{\hat{z}},\mathbf{h})||\_{\mathcal{T}}$$

Next, we note that since **h** is independent of the sample **s**, the variational information between **z**ˆ ∼ **s** and **k** ∈ K can be bounded using Theorem 6. This follows because **h** is selected independently of the sample **s**, and, hence, the i.i.d. property of the observations **z***<sup>i</sup>* continues to hold. Therefore, we obtain:

$$\mathbb{E}\_{\mathbf{h}}\mathbb{E}\_{\mathbf{k}}||\mathbb{E}\_{\mathbf{z}'}[p(\mathbf{k}|\mathbf{z}',\mathbf{h})] \,, \ p(\mathbf{k}|\mathbf{z},\mathbf{h})||\_{\mathcal{T}} \leq \sqrt{\frac{\log|\mathcal{K}|}{2m}}\tag{A6}$$

Because *p*(**k**|**z**ˆ, **h**) is arbitrary in our derivation, the above bound holds for any distribution of observations *p*(*z*), any distribution *p*(*h*), and any family of conditional distributions *p*(*k*|**z**ˆ, **h**).

**Original Setting**: Next, we return to the original setting where both **h** ∈ H and **k** ∈ K are chosen *according* to the training sample **s**. We have:

$$\begin{split} \mathcal{J}(\mathbf{\hat{z}};\mathbf{k}|\mathbf{h}) &= \mathbb{E}\_{\mathbf{h}}||p(\mathbf{\hat{z}}|\mathbf{h}) \cdot p(\mathbf{k}|\mathbf{h}) \,, p(\mathbf{\hat{z}},\mathbf{k}|\mathbf{h})||\tau \\ &= \mathbb{E}\_{\mathbf{h},\mathbf{2}}||p(\mathbf{k}|\mathbf{h}) \,, p(\mathbf{k}|\mathbf{z},\mathbf{h})||\tau \\ &= \mathbb{E}\_{\mathbf{h},\mathbf{2}}||\mathbb{E}\_{\mathbf{k}'|\mathbf{h}}[p(\mathbf{k}|\mathbf{z}',\mathbf{h})] \,, \ p(\mathbf{k}|\mathbf{z},\mathbf{h})||\_{\mathcal{T}} \\ &\leq \mathbb{E}\_{\mathbf{h},\mathbf{2}}||\mathbb{E}\_{\mathbf{k}'|\mathbf{h}}[p(\mathbf{k}|\mathbf{z}',\mathbf{h})] \,, \ \mathbb{E}\_{\mathbf{k}'}[p(\mathbf{k}|\mathbf{z}',\mathbf{h})]||\tau + \mathbb{E}\_{\mathbf{h},\mathbf{2}}||\mathbb{E}\_{\mathbf{k}'}[p(\mathbf{k}|\mathbf{z}',\mathbf{h})] \,, \ p(\mathbf{k}|\mathbf{z},\mathbf{h})||\tau \end{split} \tag{A7}$$

In the last line, we used the triangle inequality.

Next, we would like to bound the first term. Using the fact that the total variation distance is related to the -<sup>1</sup> distance by ||*<sup>p</sup>* , *<sup>q</sup>*||T <sup>=</sup> <sup>1</sup> <sup>2</sup> ||*p* − *q*||1, we have:

$$\begin{split} & \mathbb{E}\_{\mathbf{h}, \mathbf{l}} || \mathbb{E}\_{\mathbf{z}' \mid \mathbf{h}} [p(\mathbf{k} | \mathbf{z}', \mathbf{h})] \,, \quad \mathbb{E}\_{\mathbf{z}'} [p(\mathbf{k} | \mathbf{z}', \mathbf{h})] || \tau \\ &= \mathbb{E}\_{\mathbf{h} \mid ||\mathbb{E}\_{\mathbf{z}' \mid \mathbf{h}} [p(\mathbf{k} | \mathbf{z}', \mathbf{h})] \,, \quad \mathbb{E}\_{\mathbf{z}'} [p(\mathbf{k} | \mathbf{z}', \mathbf{h})] || \tau \\ &= \frac{1}{2} \operatorname\*{\mathbb{E}\_{\mathbf{h}}} \sum\_{k \in \mathcal{K}} \left| \operatorname{\mathbb{E}\_{\mathbf{z}' \mid \mathbf{h}} [p(\mathbf{k} = k | \mathbf{z}', \mathbf{h})] - \operatorname{\mathbb{E}\_{\mathbf{z}'} [p(\mathbf{k} = k | \mathbf{z}', \mathbf{h})] \right| \\ &\leq \frac{1}{2} \sum\_{\mathbf{k} \in \mathcal{K}} \operatorname\*{\mathbb{E}\_{\mathbf{h}}} ||p(\mathbf{z}' | \mathbf{h}) \,, \ p(\mathbf{z}')||\_{\mathcal{T}} \\ &= \frac{1}{2} \sum\_{\mathbf{k} \in \mathcal{K}} \mathcal{I}(\mathbf{z}; \mathbf{h}) = \frac{|\mathcal{K}|}{2} \operatorname\*{\mathcal{I}}(\mathbf{z}; \mathbf{h}) \end{split} \tag{A8}$$

Here, the inequality follows from Fact 1.

Next, we bound the second term in Equation (A7). Using Fact 1 and our earlier result in Equation (A6):

$$\begin{aligned} & \mathbb{E}\_{\mathbf{h}, \mathbf{z}} ||| \mathbb{E}\_{\mathbf{z}'} \left[ p(\mathbf{k} | \mathbf{z}', \mathbf{h}) \right] \; \int p(\mathbf{k} | \mathbf{z}, \mathbf{h}) |||\_{\mathcal{T}} \\ & \leq \mathcal{I}(\mathbf{z}; \mathbf{h}) + \mathbb{E}\_{\mathbf{h}} \mathbb{E}\_{\mathbf{z}} ||| \mathbb{E}\_{\mathbf{z}'} \left[ p(\mathbf{k} | \mathbf{z}', \mathbf{h}) \right] \; \int p(\mathbf{k} | \mathbf{z}, \mathbf{h}) |||\_{\mathcal{T}} \\ & \leq \mathcal{I}(\mathbf{z}; \mathbf{h}) + \sqrt{\frac{\log |\mathcal{K}|}{2m}} \end{aligned} \tag{A9}$$

Combining all results in Equations (A7)–(A9):

$$\mathcal{J}(\mathbf{\dot{z}};\mathbf{k}\mid\mathbf{\dot{z}}) \le \left[1 + \frac{|\mathcal{K}|}{2}\right] \mathcal{J}(\mathbf{\dot{z}};\mathbf{h}) + \sqrt{\frac{\log|\mathcal{K}|}{2m}}\tag{A10}$$

This along with the chain rule imply the statement of the theorem.

#### **Appendix E. Proof of Proposition 3**

Let *I*(**x**; **y**) denote the mutual information between **x** and **y** and let *H*(**x**) denote the Shannon entropy of the random variable **x** measured in nats (i.e., using natural logarithms). As before, we write **s** = (**z**1, ... , **z***m*). We have:

$$\begin{aligned} I(\mathbf{s}; (\mathbf{h}, \mathbf{k})) &= H(\mathbf{s}) - H(\mathbf{s} \mid \mathbf{h}, \mathbf{k}) \\ &= \sum\_{i=1}^{m} H(\mathbf{z}\_i) - \sum\_{i=1}^{m} H(\mathbf{z}\_i | \mathbf{h}, \mathbf{k}, \mathbf{z}\_{1'}, \dots, \mathbf{z}\_{i-1}) \\ &\geq \sum\_{i=1}^{m} H(\mathbf{z}\_i) - H(\mathbf{z}\_i | \mathbf{h}, \mathbf{k}) = m I(\mathbf{z}; \mathbf{h}, \mathbf{k}) \end{aligned}$$

The second line is the chain rule for entropy and the third lines follows from the fact that conditioning reduces entropy. We obtain:

$$I(\mathbf{\hat{z}}; \mathbf{h}, \mathbf{k}) \le \frac{I(\mathbf{s}; (\mathbf{h}, \mathbf{k}))}{m}$$

By Pinsker's inequality:

$$\mathcal{J}(\mathbf{\hat{z}}; (\mathbf{h}, \mathbf{k})) \le \sqrt{\frac{I(\mathbf{\hat{z}}; (\mathbf{h}, \mathbf{k}))}{2}} \le \sqrt{\frac{I(\mathbf{s}; (\mathbf{h}, \mathbf{k}))}{2m}}$$

Using the chain rule for mutual information:

$$\begin{aligned} \mathcal{J}(\mathbf{\dot{z}}; (\mathbf{h}, \mathbf{k})) &\leq \sqrt{\frac{I(\mathbf{s}; (\mathbf{h}, \mathbf{k}))}{2m}} = \sqrt{\frac{I(\mathbf{s}; \mathbf{h}) + I(\mathbf{s}; \mathbf{k} | \mathbf{h})}{2m}} \\ &\leq \sqrt{\frac{I(\mathbf{s}; \mathbf{h}) + H(\mathbf{k})}{2m}} \leq \sqrt{\frac{I(\mathbf{s}; \mathbf{h}) + \log|\mathbf{k}|}{2m}} \end{aligned}$$

The desired bound follows by applying the same proof technique of Theorem 4 on the last uniform generalization bound, and using the fact that log 3 < 2.

#### **Appendix F. Proof of Proposition 4**

Before we prove the statement of the theorem, we begin with the following lemma:

**Lemma A1.** *Let the observation space* <sup>Z</sup> *be the interval* [0, 1]*, where <sup>p</sup>*(*z*) *is continuous in* [0, 1]*. Let <sup>h</sup>* <sup>⊆</sup> *<sup>s</sup>* : <sup>|</sup>*h*<sup>|</sup> <sup>=</sup> *<sup>k</sup> be a set of k examples picked at random without replacement from the training sample <sup>s</sup>. Then* <sup>J</sup> (*z*ˆ; *<sup>h</sup>*) = *<sup>k</sup> m .*

**Proof.** First, we note that *p*(**z**ˆ|**h**) is a *mixture* of two distributions: one that is uniform in **h** with probability *k*/*m*, and the original distribution *p*(*z*) with probability 1− *k*/*m*. By Jensen's inequality, we have J (**z**ˆ; **h**) ≤ *<sup>k</sup>*/*m*. Second, let the parametric loss be *<sup>l</sup>*(*z*; **<sup>h</sup>**) = <sup>I</sup>{*<sup>z</sup>* <sup>∈</sup> **<sup>h</sup>**}. Then, <sup>|</sup>*Rgen*(L)<sup>|</sup> <sup>=</sup> *<sup>k</sup> <sup>m</sup>* . By Theorem 2, we have J (**z**ˆ; **h**) ≥ |*Rgen*(L)| = *k*/*m*. Both bounds imply the statement of the lemma.

Now, we prove Proposition 4. Consider the setting where Z = [0, 1] and suppose that the observations **z** ∈ Z have a continuous marginal distribution. Because *t* is a rational number, let the sample size *m* be chosen such that *k* = *t m* is an integer.

Let **s** = {**z**1, ... , **z***m*} be the training set, and let the hypothesis **h** be given by **h** = {**z**1, ... , **z***k*} with some probability *δ* > 0 and **h** = {} otherwise. Here, the *k* instances **z***<sup>i</sup>* ∈ **h** are picked uniformly at random

without replacement from the sample **s**. To determine the variational information between **z**ˆ and **h**, we consider the two cases:


So, by combining the two cases above, we deduce that:

$$\mathcal{J}(\mathbf{\hat{z}};\mathbf{h}) = \mathbb{E}\_{\mathbf{h}} \left| \left| p(\mathbf{\hat{z}}) \right. \right. \left. \left. p(\mathbf{\hat{z}} \mid \mathbf{h}) \right| \right|\_{\mathcal{T}} = \mathbf{t} \ \boldsymbol{\delta} .$$

Therefore, L generalizes uniformly with the rate *tδ*. Next, let the parametric loss be given by *l*(*z* ; **h**) = I ' *z* ∈ **h** ( . With this loss:

$$p\left\{\left|\mathcal{R}\_{\mathsf{A}}(\mathbf{h}) - \mathcal{R}(\mathbf{h})\right| = t\right\} = \delta = \frac{\mathcal{I}(\mathbf{\hat{z}}; \mathbf{h})}{t}\_{}$$

which is the statement of the proposition.

#### **Appendix G. Proof of Theorem 5**

Because Z is countable, we will assume without loss of generality that Z = {1, 2, 3, ... , ...}, and we will write *pz* = *p*(**z**ˆ = *z*) to denote the marginal distribution of observations. Since all lazy learners are equivalent, we will look into the lazy learner whose hypothesis **h** is equal to the training sample **s** itself up to a permutation. Let *mz* denote the number of times *z* ∈ Z was observed in the training sample. Note that *<sup>p</sup>*(**z**<sup>ˆ</sup> <sup>=</sup> *<sup>z</sup>*|**h**) = *<sup>p</sup>***s**(*z*), and so <sup>J</sup> (**z**ˆ; **<sup>h</sup>**) = <sup>E</sup>**<sup>s</sup>** ||*p*(*z*), *<sup>p</sup>***s**(*z*)||T .

We have:

$$p(\mathbf{h}) = p(\mathbf{s}) = \binom{m}{m\_1, m\_2, \dots} \, \prescript{m\_1}{}{p\_1}^{m\_2} \cdots \cdots$$

Using the relation ||*<sup>p</sup>* , *<sup>q</sup>*||T <sup>=</sup> <sup>1</sup> <sup>2</sup> ||*p* − *q*||<sup>1</sup> for any two probability distributions *p* and *q*, we obtain:

$$\mathbb{E}\_{\mathbf{h}} ||p(\mathbf{\hat{z}}) - p(\mathbf{\hat{z}}|\mathbf{h})||\_{1} = \sum\_{k \ge 1 \, : \, m\_1 + m\_2 + \dots = m} \binom{m}{m\_1, m\_2, \dots} \times p\_1^{m\_1} p\_2^{m\_2} \cdots \cdot \left| \frac{m\_k}{m} - p\_k \right|\_{1}$$

For the inner summation, we write:

$$\begin{aligned} &\sum\_{m\_1+m\_2+\ldots=m} \binom{m}{m\_1,m\_2,\ldots} \left| p\_1^{m\_1} p\_2^{m\_2} \cdots \right| \frac{m\_k}{m} - p\_k \bigg| \\ &= \sum\_{s=0}^m \binom{m}{s} p\_k^s \left| \frac{m\_k}{m} - p\_k \right| \sum\_{m\_1+\ldots+m\_{k-1}+m\_{k+1}+\ldots=m-s} \binom{m-s}{m\_1,\ldots,m\_{k-1},m\_{k+1},\ldots} \times p\_1^{m\_1} \cdots p\_{k-1}^{m\_{k-1}} p\_{k+1}^{m\_{k+1}} \cdots \frac{m\_k}{m\_1,\ldots,m\_{k-1},m\_{k+1},\ldots} \end{aligned}$$

Using the multinomial series, we simplify the right-hand side into:

$$\sum\_{s=0}^{m} \binom{m}{s} \left. p\_k^s \left(1 - p\_k\right)^{m-s} \right| \frac{s}{m} - p\_k \Big|\_{0}$$

Now, we use *De Moivre's formula* for the mean deviation of the binomial random variable (see the proof of Example 1). This gives us:

$$\begin{aligned} &\sum\_{m\_1+m\_2+\ldots=m} \binom{m}{m\_1,m\_2,\ldots} p\_1^{m\_1} p\_2^{m\_2} \cdots \cdot \left| \frac{s}{m} - p\_k \right| \\ &= \sum\_{s=0}^m \binom{m}{s} \left. p\_k^\varsigma (1-p\_k)^{m-s} \right| \frac{s}{m} - p\_k \right| \\ &= \frac{2}{m} (1-p\_k)^{(1-p\_k)m} p\_k^{1+mp\_k} \frac{m!}{(p\_km)! \left( (1-p\_k)m - 1 \right)!} \end{aligned}$$

Using *Stirling's approximation* to the factorial [17], we obtain the simple asymptotic expression:

$$\sum\_{m\_1+m\_2+\ldots=m} \binom{m}{m\_{1\prime},m\_{2\prime},\ldots} \left| p\_1^{m\_1} p\_2^{m\_2} \cdots \cdot \left| \frac{m\_k}{m} - p\_k \right| \right. \\ \left. \sim \sqrt{\frac{2p\_k(1-p\_k)}{\pi m}} \right.$$

Plugging this into the earlier expression for J (**z**ˆ; **h**) yields:

$$\begin{aligned} \mathcal{J}(\mathbf{\hat{z}};\mathbf{h}) &\sim \frac{1}{2} \sum\_{k=1,2,3,\dots} \sqrt{\frac{2p\_k(1-p\_k)}{\pi m}} \\ &= \sqrt{\frac{\mathbf{E} \mathbf{s} \left[\mathcal{Z}; \, p(z)\right] - 1}{2\pi m}} \end{aligned}$$

Due to the tightness of the Stirling approximation, the asymptotic expression for the variational information is tight. Because <sup>J</sup> (**z**ˆ; **<sup>h</sup>**) = <sup>E</sup>**<sup>s</sup>** ||*p*(*z*), *<sup>p</sup>***s**(*z*)||T , we deduce that:

$$\mathbb{E}\_{\mathbf{s}}\left||p(z),\ p\_{\mathbf{s}}(z)||\_{\mathcal{T}} \sim \sqrt{\frac{\mathbf{E} \mathbf{s} \left[\mathcal{Z};\ p(z)\right] - 1}{2\pi m}},$$

which provides the asymptotic rate of convergence of an empirical probability mass function to the true distribution.

#### **Appendix H. Proof of Proposition 5**

First, we note that for any two adjacent samples **s** and **s** and any O⊆H, we have in the differential privacy setting:

$$p(\mathbf{h} \in \mathcal{O}|\mathbf{s}) - p(\mathbf{h} \in \mathcal{O}|\mathbf{s'}) \le (e^{\mathbf{s}} - 1) \ p(\mathbf{h} \in \mathcal{O}|\mathbf{s'}) + \delta$$

Similarly, we have:

$$\begin{split} p(\mathbf{h}\in\mathcal{O}|\mathbf{s}) - p(\mathbf{h}\in\mathcal{O}|\mathbf{s'}) &\geq (e^{-\epsilon} - 1) \ p(\mathbf{h}\in\mathcal{O}|\mathbf{s'}) - e^{-\epsilon}\delta \\ &= -\left[ (1 - e^{-\epsilon}) \ p(\mathbf{h}\in\mathcal{O}|\mathbf{s'}) + e^{-\epsilon}\delta \right] \\ &\geq -e^{\epsilon} \left[ (1 - e^{-\epsilon}) \ p(\mathbf{h}\in\mathcal{O}|\mathbf{s'}) + e^{-\epsilon}\delta \right] \\ &= -\left[ (e^{\epsilon} - 1)p(\mathbf{h}\in\mathcal{O}|\mathbf{s'}) + \delta \right] \end{split}$$

Both results imply that:

$$\begin{aligned} \left| p(\mathbf{h} \in \mathcal{O} | \mathbf{s}) - p(\mathbf{h} \in \mathcal{O} | \mathbf{s}') \right| &\leq (\epsilon^{\mathbf{c}} - 1) p(\mathbf{h} \in \mathcal{O} | \mathbf{s}') + \delta \\ &\leq \epsilon^{\mathbf{c}} - 1 + \delta \end{aligned} \tag{A11}$$

We write:

$$\begin{aligned} \mathcal{J}(\mathbf{\hat{z}}; \mathbf{h}) &= \mathbb{E}\_{\mathbf{\hat{z}}} \left|| p(\mathbf{h}|\mathbf{\hat{z}}) \,, \ p(\mathbf{h}) || \boldsymbol{\tau} \\ &= \frac{1}{2} \mathbb{E}\_{\mathbf{\hat{z}}} \left|| \mathbb{E}\_{\mathbf{\hat{z}'}} [p(\mathbf{h}|\mathbf{\hat{z}}) - p(\mathbf{h}|\mathbf{\hat{z}'})] \, | \, |\_{1} \\ &\leq \frac{1}{2} \mathbb{E}\_{\mathbf{\hat{z}}, \mathbf{\hat{z}'}} \left|| p(\mathbf{h}|\mathbf{\hat{z}}) - p(\mathbf{h}|\mathbf{\hat{z}'}) \, | \, |\_{1} \end{aligned}$$

The last inequality follows by convexity. Next, let **s***m*−<sup>1</sup> be a sample that contains *m* − 1 observations drawing i.i.d. from *p*(*z*). Then:

$$\begin{aligned} \left| \mathcal{J}(\mathbf{\dot{z}}; \mathbf{h}) \right| &\leq \frac{1}{2} \mathbb{E}\_{\mathbf{z}, \mathbf{z'}} \left| \left| \operatorname{\mathbb{E}}\_{\mathbf{s}\_{m-1}} \left[ p(\mathbf{h} | \mathbf{\dot{z}}, \mathbf{s}\_{m-1}) - p(\mathbf{h} | \mathbf{\dot{z}'}, \mathbf{s}\_{m-1}) \right] \right| \right|\_{1} \\ &\leq \frac{1}{2} \mathbb{E}\_{\mathbf{s}, \mathbf{s'}} \left| \left| p(\mathbf{h} | \mathbf{s}) - p(\mathbf{h} | \mathbf{s'}) \right| \right|\_{1'} \end{aligned}$$

where **s**, **s** are two adjacent samples. Finally, we use Equation (A11) to arrive at the statement of the proposition.

#### **Appendix I. Proof of Theorem 8**

The proof is similar to the classical VC argument. Given a fixed hypothesis space H, a fixed domain Z, and a 0–1 loss function *l* : H×Z→{0, 1}, let **s** = {**z**1, ... , **z***m*} be a training sample that comprises of *m* i.i.d. observations. Define the *restriction* of H to **s** by:

$$\mathcal{F}\_{\mathbf{s}} = \left\{ l(\mathbf{z}\_1, h), \dots, l(\mathbf{z}\_{\mathbf{m}}, h) \; : \quad h \in \mathcal{H} \right\}.$$

In other words, F**<sup>s</sup>** is the set of all possible realizations of the 0–1 loss for the elements in **s** by hypotheses in H. We can introduce an *equivalence relation* between the elements of H w.r.t. the sample **s**. Specifically, we say that for *h* , *h* ∈ H, we have *h* ≡**<sup>s</sup>** *h* if and only if:

$$\left(l(\mathbf{z}\_1, h'), \dots, l(\mathbf{z}\_m, h')\right) = \left(l(\mathbf{z}\_1, h''), \dots, l(\mathbf{z}\_m, h'')\right).$$

It is trivial to see that this defines an equivalence relation; i.e., it is reflexive, symmetric, and transitive. Let the set of equivalence classes w.r.t. **s** be denoted H**s**. Note that we have a one-to-one correspondence between the members of F**<sup>s</sup>** and the members of H**s**. Moreover, H**<sup>s</sup>** is a *partitioning* of H.

We use the standard twin-sample trick where we have **<sup>s</sup>**<sup>2</sup> <sup>=</sup> **<sup>s</sup>** <sup>∪</sup> **<sup>s</sup>** ∈ Z2*<sup>m</sup>* and <sup>L</sup> learns based on **<sup>s</sup>** only. For any fixed *h* ∈ H, let *f* : H×Z→ [0, 1] be an arbitrary loss function, which can be different from the loss *l* that is optimized during the training. A Hoeffding bound for sampling without replacement [41] states that:

$$p\left\{\left|\mathbb{E}\_{\mathbf{z}\sim\mathsf{s}}[f(\mathbf{z},\mathsf{h})]-\mathbb{E}\_{\mathbf{z}\sim\mathsf{s}\_{2}}[f(\mathbf{z},\mathsf{h})]\right|\geq\epsilon\right\}\leq 2\exp\{-2\epsilon^{2}m\}\tag{A12}$$

Hence:

$$\begin{split} &p\left\{ \left| \mathbb{E}\_{\mathbf{z}\sim\mathbf{s}}[f(\mathbf{z},h)] - \mathbb{E}\_{\mathbf{z}\sim\mathbf{s}'}[f(\mathbf{z},h)] \right| \right| \geq \varepsilon \right\} \\ &\leq p\left\{ \left| \mathbb{E}\_{\mathbf{z}\sim\mathbf{s}}[f(\mathbf{z},h)] - \mathbb{E}\_{\mathbf{z}\sim\mathbf{s}\_2}[f(\mathbf{z},h)] \right| \geq \frac{\varepsilon}{2} \right\} + p\left\{ \left| \mathbb{E}\_{\mathbf{z}\sim\mathbf{s}'}[f(\mathbf{z},h)] - \mathbb{E}\_{\mathbf{z}\sim\mathbf{s}\_2}[f(\mathbf{z},h)] \right| \geq \frac{\varepsilon}{2} \right\} \\ &\leq 4\exp\{-(1/2)\varepsilon^2 m\} \end{split}$$

This happens for a hypothesis *h* ∈ H that is fixed independently of the random split of **s**<sup>2</sup> into training and ghost samples. When *h* is selected according to the random split of **s**2, then we need to employ the union bound.

For any subset *H* ⊆ H, let min(*H*) be the least element in *H* according to !. Let H**<sup>s</sup>** be as defined previously and write *Hmin*(**s**) = {min(*Hk*) : *Hk* ∈ H**s**}. Then, it is easy to observe that the ERM learning rule of Theorem 2 must select one of the hypotheses in *Hmin*(**s**2) regardless of the split **s**<sup>2</sup> = **s** ∪ **s** . This holds because H**s**<sup>2</sup> is a *coarser* partitioning of H than H**s**. In other words, every member of H**<sup>s</sup>** is a union of some finite number of members of H**s**<sup>2</sup> . By the well-ordering property, the " least" element among the empirical risk minimizers must be in *Hmin*(**s**2).

Hence, there is, at most, *<sup>τ</sup>*H(2*m*) possible hypotheses given **<sup>s</sup>**2, where *<sup>τ</sup>*H(*m*) is the growth function (sometimes referred to as the shattering coefficient), and those hypotheses can be fixed independently of the random splitting of **s**<sup>2</sup> into a training sample **s** and a ghost sample **s** .

Consequently, we have by the union bound:

$$\begin{split} &p\left\{\sup\_{h\in H\_{\text{min}}(\mathbf{s}\cup\mathbf{s}')} \left| \mathbb{E}\_{\mathbf{z}\sim\mathbf{s}}[f(\mathbf{z},h)] - \mathbb{E}\_{\mathbf{z}\sim\mathbf{s}'}[f(\mathbf{z},h)] \right| \right\} \geq \epsilon \right\} \\ &\leq 4\tau\_{\mathcal{H}}(2m) \exp\{-\frac{\epsilon^{2}m}{2}\} \leq 4\left(\frac{2\epsilon m}{d}\right)^{d} \exp\{-\frac{\epsilon^{2}m}{2}\}, \end{split}$$

where *d* is the VC dimension of H. Finally, to bound the generalization risk in expectation, we use Lemma A.4 in [13], which implies that if *m* ≥ *d*:

$$\begin{aligned} &\mathbb{E}\_{\mathbf{s},\mathbf{s}'}\left[\sup\_{h\in H\_{\min}(\mathbf{s},\mathbf{s}')}\left|\mathbb{E}\_{\mathbf{z}\sim\mathbf{s}}[f(\mathbf{z},h)]-\mathbb{E}\_{\mathbf{z}\sim\mathbf{s}'}[f(\mathbf{z},h)]\right|\right] \\ &\leq\sqrt{\frac{2}{m}}\left(2+\sqrt{\log2+d\log\frac{2cm}{d}}\right) \\ &\leq\sqrt{\frac{2}{m}}\left(2+\sqrt{1+d\log\frac{2cm}{d}}\right)\leq\frac{3+\sqrt{1+d\log\frac{2cm}{d}}}{\sqrt{m}}\end{aligned}$$

Writing **h**ˆ for the *least* empirical risk minimizer w.r.t. the training sample **s**:

$$\begin{split} R\_{\mathcal{K}\mathfrak{m}}(\mathcal{L}) &= \mathbb{E}\_{\mathsf{s}}\left[\mathbb{E}\_{\mathsf{z}\sim\mathsf{s}}\left[f(\mathsf{z},\hat{\mathsf{h}})\right] - \mathbb{E}\_{\mathsf{z}\sim\mathcal{D}}\left[f(\mathsf{z},\hat{\mathsf{h}})\right]\right] \\ &\leq \mathbb{E}\_{\mathsf{s}}\left|\mathbb{E}\_{\mathsf{z}\sim\mathsf{s}}\left[f(\mathsf{z},\hat{\mathsf{h}})\right] - \mathbb{E}\_{\mathsf{z}\sim\mathcal{D}}\left[f(\mathsf{z},\hat{\mathsf{h}})\right]\right| \\ &= \mathbb{E}\_{\mathsf{s}}\left|\mathbb{E}\_{\mathsf{z}\sim\mathsf{s}}\left[f(\mathsf{z},\hat{\mathsf{h}})\right] - \mathbb{E}\_{\mathsf{z}'}\mathbb{E}\_{\mathsf{z}\sim\mathsf{s}'}\left[f(\mathsf{z},\hat{\mathsf{h}})\right]\right| \\ &\leq \mathbb{E}\_{\mathsf{s},\mathsf{s}'}\left|\mathbb{E}\_{\mathsf{z}\sim\mathsf{s}}\left[f(\mathsf{z},\hat{\mathsf{h}})\right] - \mathbb{E}\_{\mathsf{z}\sim\mathsf{s}'}\left[f(\mathsf{z},\hat{\mathsf{h}})\right]\right| \\ &\leq \mathbb{E}\_{\mathsf{s},\mathsf{s}'}\sup\_{h\in H\_{\min}(\mathsf{s}\cup\mathsf{s}')}\left|\mathbb{E}\_{\mathsf{z}\sim\mathsf{s}}\left[f(\mathsf{z},h)\right] - \mathbb{E}\_{\mathsf{z}\sim\mathsf{s}'}\left[f(\mathsf{z},h)\right]\right| \\ &\leq \frac{3 + \sqrt{1 + d\log\frac{2cm}{d}}}{\sqrt{m}} \end{split}$$

Because this bound in expectation holds for any single loss *f* : *H* ×Z→ [0, 1], it holds for the following loss function:

$$l^\star(z,h) = \mathbb{I}\{p(z \in \mathbf{s}|h) > p(z \in \mathbf{s})\},$$

which is a deterministic 0–1 loss function of *h* that assigns to *z* ∈ Z the value 1 if and only if our knowledge of *h* increases the probability that *z* belongs to the training sample. However, the generalization risk in expectation for the loss *l* is equal to the variational information <sup>J</sup> (**h**ˆ; **<sup>z</sup>**ˆ) as shown in the proof of Theorem 2. Hence, we have the bound stated in the theorem:

$$\mathcal{J}(\hat{\mathbf{h}}; \hat{\mathbf{z}}) \le \frac{3 + \sqrt{1 + d \log \frac{2cm}{d}}}{\sqrt{m}},$$

Because this is a distribution-free bound, we have:

$$\mathcal{C}(\mathcal{L}) \le \frac{3 + \sqrt{1 + d \log \frac{2cm}{d}}}{\sqrt{m}}$$

#### **Appendix J. Proof of Theorem 9**

Let <sup>X</sup> <sup>=</sup> {*x*1, ... , *xd*} be a set of *<sup>d</sup>* points in <sup>X</sup> that are shattered by hypotheses in <sup>H</sup>. By definition, this implies that for any possible 0–1 labeling *<sup>I</sup>* ∈ {0, 1}*d*, there exists a hypothesis *hI* ∈ H such that (*hI*(*x*1),..., *hI*(*xd*)) = *I*.

Given an ERM learning rule <sup>L</sup> whose hypothesis is denoted **<sup>h</sup>**<sup>ˆ</sup> **<sup>s</sup>**, let *<sup>p</sup>*(*x*) be the uniform distribution of instances over <sup>X</sup> and define:

$$y(\mathbf{x}) = \arg\min\_{\mathcal{Y} \in \{+1, -1\}} p\_{\mathbf{s}} \left\{ \hat{\mathbf{h}}\_{\mathbf{s}}(\mathbf{x}) = \mathcal{Y} \mid \mathbf{x} \notin \mathbf{s} \right\},$$

In other words, *y*(**x**) is the least probable class that is assigned by L to the instance **x** when **x** is unseen in the training sample. Let *<sup>p</sup>*(*z*) with **<sup>z</sup>** = (**x**, **<sup>y</sup>**) denote the uniform distribution of instances over <sup>X</sup> with **<sup>y</sup>** given by the labeling rule above.

By drawing a training sample **<sup>s</sup>** ∈ Z*<sup>m</sup>* of *<sup>m</sup>* i.i.d. observations from *<sup>p</sup>*(*z*), our first task is to bound the expected number of *distinct* values in <sup>X</sup> that are not observed in the training sample. Let:

$$E\_i = \mathbb{I}\{\mathbf{x}\_i \notin \mathbf{s}\}$$

Then, the expected number of *distinct* values in <sup>X</sup> that are not observed in the training sample **<sup>s</sup>** is:

$$\sum\_{i=1}^d \mathbb{E}[E\_i] = \sum\_{i=1}^d \left(1 - \frac{1}{d}\right)^m = d\left(1 - \frac{1}{d}\right)^m$$

Here, we used the linearity of expectation, which holds even when the random variables are not independent. This shows that the expected *fraction* of instances in <sup>X</sup> that are not seen in the sample **<sup>s</sup>** is (<sup>1</sup> <sup>−</sup> <sup>1</sup> *<sup>d</sup>* )*m*.

Next, given an ERM learning rule that outputs an empirical risk minimizer, the training error of this learning algorithm is zero because <sup>X</sup> is shattered by <sup>H</sup>. However, for any learning rule <sup>L</sup>, the expected error rate on the unseen examples is, at least, 1/2 by construction. Therefore, there exists a distribution *<sup>p</sup>*(*z*) in which the generalization risk is, at least, (1/2)(<sup>1</sup> <sup>−</sup> 1/*d*)*m*.

By Theorem 2, the learning capacity is an upper bound on the maximum generalization risk across all distributions of observations and all parametric loss functions. Consequently:

$$\mathcal{C}(\mathcal{L}) \ge \frac{1}{2} \left( 1 - \frac{1}{d} \right)^m \prime$$

which is the statement of the theorem.

#### **References**




© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Prediction and Variable Selection in High-Dimensional Misspecified Binary Classification**

**Konrad Furma ´nczyk 1,\*,† and Wojciech Rejchel 2,†**


Received: 20 April 2020; Accepted: 11 May 2020; Published: 13 May 2020

**Abstract:** In this paper, we consider prediction and variable selection in the misspecified binary classification models under the high-dimensional scenario. We focus on two approaches to classification, which are computationally efficient, but lead to model misspecification. The first one is to apply penalized logistic regression to the classification data, which possibly do not follow the logistic model. The second method is even more radical: we just treat class labels of objects as they were numbers and apply penalized linear regression. In this paper, we investigate thoroughly these two approaches and provide conditions, which guarantee that they are successful in prediction and variable selection. Our results hold even if the number of predictors is much larger than the sample size. The paper is completed by the experimental results.

**Keywords:** misclassification risk; model misspecification; penalized estimation; supervised classification; variable selection consistency

#### **1. Introduction**

Large-scale data sets, where the number of predictors significantly exceeds the number of observations, become common in many practical problems from, among others, biology or genetics. Currently, the analysis of such data sets is a fundamental challenge in statistics and machine learning. High-dimensional prediction and variable selection are arguably the most popular and intensively studied topics in this field. There are many methods trying to solve these problems such as those based on penalized estimation [1,2]. The main representative of them is Lasso [3], that relates to *l*1-norm penalization. Its properties in model selection, estimation and prediction are deeply investigated, among others, in [2,4–10]. The results obtained in the above papers can be applied only if some specific assumptions are satisfied. For instance, these conditions concern the relation between the response variable and predictors. However, it is quite common that a complex data set does not satisfy these model assumptions or they are difficult to verify, which leads to the fact that the considered model is specified incorrectly. The model misspecification problem is the core of the current paper. We investigate this topic in the context of high-dimensional binary classification (binary regression).

In the classification problem we are to predict or to guess the class label of the object on the basis of its observed predictors. The object is described by the random vector (*X*,*Y*), where *<sup>X</sup>* <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* is a vector of predictors and *Y* ∈ {−1, 1} is the class label of the object. A classifier is defined as a measurable function *<sup>f</sup>* : <sup>R</sup>*<sup>p</sup>* <sup>→</sup> <sup>R</sup>, which determines the label of an object in the following way:

if *f*(*x*) ≥ 0, then we predict that *y* = 1.

Otherwise, we guess that *y* = −1.

The most natural approach is to look for a classifier *f* , which minimizes the misclassification risk (probability of incorrect classification)

$$R\left(f\right) = P(Y=1, f(X) < 0) + P(Y=-1, f(X) \ge 0). \tag{1}$$

Let *η*(*x*) = *P*(*Y* = 1|*X* = *x*). It is clear that *fB*(*x*) = sign(2*η*(*x*) − 1) minimizes the risk (1) in the family of all classifiers. It is called the Bayes classifier and we denote its risk as *RB* = *R* (*fB*). Obviously, in practice we do not know the function *η*, so we cannot find the Bayes classifier. However, if we possess a training sample (*X*1,*Y*1), ... ,(*Xn*,*Yn*) containing independent copies of (*X*,*Y*), then we can consider a sample analog of (1), namely the empirical misclassification risk

$$\frac{1}{n}\sum\_{i=1}^{n}\left[\mathbb{I}(Y\_i = 1, f(X\_i) < 0) + \mathbb{I}(Y\_i = -1, f(X\_i) \ge 0)\right],\tag{2}$$

where I is the indicator function. Then a minimizer of (2) could be used as our estimator.

The main difficulty in this approach lies in discontinuity of the function (2). It entails that finding its minimizer is computationally difficult and not effective. To overcome this problem, one usually replaces the discontinuous loss function by its convex analog *<sup>φ</sup>* : <sup>R</sup> <sup>→</sup> [0, <sup>∞</sup>], for instance the logistic loss, the hinge loss or the exponential loss. Then we obtain the convex empirical risk

$$\bar{Q}(f) = \frac{1}{n} \sum\_{i=1}^{n} \phi(\mathcal{Y}\_i f(\mathcal{X}\_i)). \tag{3}$$

In the high-dimensional case one usually obtains an estimator by minimizing the penalized version of (3). Those tricks have been successfully used in the classification theory and have allowed to invent boosting algorithms [11], support vector machines [12] or Lasso estimators [3]. In this paper we are mainly interested in Lasso estimators, because they are able to solve both variable selection and prediction problems simultaneously, while the first two algorithms are developed mainly for prediction.

Thus, we consider linear classifiers

$$f\_b(\mathbf{x}) = b\_0 + \sum\_{j=1}^p b\_j \mathbf{x}\_{j\prime} \tag{4}$$

where *<sup>b</sup>* = (*b*0, *<sup>b</sup>*1,..., *bp*) <sup>∈</sup> <sup>R</sup>*p*<sup>+</sup>1. For a fixed loss function *<sup>φ</sup>* we define the Lasso estimator as

$$\delta = \arg\min\_{b \in \mathbb{R}^{p+1}} \quad \bar{Q}(f\_b) + \lambda \sum\_{j=1}^{p} |b\_j|. \tag{5}$$

where *λ* is a positive tuning parameter, which provides a balance between minimizing the empirical risk and the penalty. The form of the penalty is crucial, because its singularity at the origin implies that some coordinates of the minimizer ˆ *b* are exactly equal to zero, if *λ* is sufficiently large. Thus, calculating (5) we simultaneously select significant predictors in the model and we estimate their coefficients, so we are also able to predict the class of new objects. The function *Q*¯(*fb*) and the penalty are convex, so (5) is a convex minimization problem, which is an important fact from both practical and theoretical points of view. Notice that the intercept *b*<sup>0</sup> is not penalized in (5).

The random vector (5) is an estimator of

$$b\_\* = \arg\min\_{b \in \mathbb{R}^{p+1}} \quad Q(f\_b)\_\prime \tag{6}$$

where *<sup>Q</sup>*(*fb*) = E*φ*(*Y fb*(*X*)). In this paper we are mainly interested in minimizers (6) corresponding to quadratic and logistic loss functions. The latter has a nice information-theoretic interpretation. Namely, it can be viewed as the Kullback–Leibler projection of unknown *η* on logistic models [13]. The Kullback–Leibler divergence [14] plays an important role in the information theory and statistics, for instance it is involved in information criteria in model selection [15] or in detecting influenctial observations [16].

In general, the classifier corresponding to (6) need not coincide with the Bayes classifier. Obviously, we want to have a "good" estimator, which means that its misclassification risk should be as close to the risk of the Bayes classifier as possible. In other words, its excess risk

$$\mathcal{E}\left(\hat{b}, f\_{\mathcal{B}}\right) = E\_{\mathcal{D}}\mathcal{R}(\hat{b}) - \mathcal{R}\_{\mathcal{B}} \tag{7}$$

should be small, where *ED* is the expectation with respect to the data *D* = {(*X*1,*Y*1), ... ,(*Xn*,*Yn*)} and we write simply *R*(*b*) instead of *R*(*fb*). Our goal is to study the excess risk (7) for the estimator (5) with different loss functions *φ*. We do it by looking for the upper bounds of (7).

In the excess risk (7) we compare two misclassification risks defined in (1). In the literature one can also find a different approach, which replaces the misclassification risks *R*(·) in (7) by the convex risks *Q*(·). In that case the excess risk depends on the loss function *φ*. To deal with this fact one uses the results from [17,18], which state the relation between the excess risk (7) and its analog based on the convex risk *Q*(·). In this paper we do not follow this way and work, right from the beginning, with the excess risk independent of *φ*. Only the estimator (5) depends on the loss *φ*.

In this paper we are also interested in variable selection. We investigate this problem in the following semiparametric model

$$\eta(\mathbf{x}) = \mathbf{g}(\beta\_0 + \sum\_{j=1}^p \beta\_j \mathbf{x}\_j),\tag{8}$$

where *<sup>η</sup>*(*x*) = *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> <sup>1</sup>|*<sup>X</sup>* <sup>=</sup> *<sup>x</sup>*), *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*p*+<sup>1</sup> is the true parameter and *<sup>g</sup>* is unknown function. Thus, we suppose that predictors influence class probability through the function *g* of the linear combination *β*<sup>0</sup> + *p* ∑ *j*=1 *βjxj*. The goal of variable selection is the identification of the set of significant predictors

$$\mathbb{T} = \{1 \le j \le p : \beta\_j \ne 0\}.\tag{9}$$

Obviously, in the model (8) we cannot estimate an intercept *β*<sup>0</sup> and we can identify the vector (*β*1, ... , *βp*) only up to a multiplicative constant, because any shift or scale change in *β*<sup>0</sup> + *p* ∑ *j*=1 *βjXj* can be absorbed by *g*. However, we show in Section 5 that in many situations the Lasso estimator (5) can properly identify the set (9).

The literature on the classification problem is comprehensive. We just mention a few references: [12,19–21]. The predictive quality of classifiers is often investigated by obtaining upper bounds for their excess risks. It is an important problem and was studied thoroughly, among others in [17,18,22–24]. The variable selection and predictive properties of estimators in the high-dimensional scenario were studied, for instance, in [2,10,13,25,26]. In the current paper we investigate the behaviour of classifiers in possibly misspecified high-dimensional classification, which appears frequently in practice. For instance, while working with binary regression one often assumes incorrectly that the data follow the logistic regression model. Then the problem is solved using the Lasso penalized maximum likelihood method. Another approach to binary regression, which is widely used due to its computational simplicity, is just treating labels *Yi* as they were numbers and applying standard Lasso. For instance, such method is used in ([1], [Subsections 4.2 and 4.3]) or ([2], Subsection 2.4.1). These two approaches to classification sometimes give unexpectedly good results in variable selection and prediction, but the reason of this phenomenon has not been deeply studied in the literature. Among the above mentioned papers only [2,13,25] take up this issue. However, [25] focuses mainly on the predictive properties of Lasso classifiers with the hinge loss. Bühlmann and van de Geer [2]

and Kubkowski and Mielniczuk [13] study general Lipschitz loss functions. The latter paper considers only the variable selection problem. In [2] one also investigates prediction, but they do not study classification with the quadratic loss.

In this paper we are interested in both variable selection and predictive properties of classifiers with convex (but not necessarily Lipschitz) loss functions. The prominent example is classification with the quadratic loss function, which has not been investigated so far in the context of the high-dimensional misspecified model. In this case the estimator (5) can be calculated efficiently using the existing algorithms, for instance [27] or [28], even if the number of predictors is much larger than the sample size. It makes this estimator very attractive, while working with large data sets. In [28] one provides also the efficient algorithm for Lasso estimators with the logistic loss in the high-dimensional scenario. Therefore, misspecified classification with the logistic loss plays an important role in this paper as well. Our goal is to study thoroughly such estimators and provide conditions, which guarantee that they are successful in prediction and variable selection.

The paper is organized as follows: in the next section we provide basic notations and assumptions, which are used in this paper. In Section 3 we study predictive properties of Lasso estimators with different loss functions. We will see that these properties depend strongly on the estimation quality of estimators, which is studied in Section 4. In Section 5 we consider variable selection. In Section 6 we show numerical experiments, which describe the quality of estimators in practice. The proofs and auxiliary results are relegated to Appendix A.

#### **2. Assumptions and Notation**

In this paper we work in the high-dimensional scenario *p* >> *n*. As usual we assume that the number of predictors *p* can vary with the sample size *n*, which could be denoted as *p*(*n*) = *pn*. However, to make notation simpler we omit the lower index and write *p* istead of *pn*. The same refers to the other objects appearing in this paper.

In the further sections we will need the following notation:


$$KL(\pi\_1, \pi\_2) = \pi\_1 \log \left(\frac{\pi\_1}{\pi\_2}\right) + (1 - \pi\_1) \log \left(\frac{1 - \pi\_1}{1 - \pi\_2}\right) \,. \tag{10}$$

Obviously, we have *KL*(*π*1, *π*2) ≥ 0 and *KL*(*π*1, *π*2) = 0 if only if *π*<sup>1</sup> = *π*2. Moreover, the *KL* distance need not be symmetric;


$$T = \{ 1 \le j \le p : (b^{\text{quad}}\_\*)\_j \ne 0 \}. \tag{11}$$

Notice that the intercept is not contained in (11) even if it is nonzero. We also specify assumptions, which are used in this paper.

**Assumption 1.** *We assume that* (*X*,*Y*),(*X*1,*Y*1), ... ,(*Xn*,*Yn*) *are i.i.d. random vectors. Moreover, predictors are univariate subgaussian, i.e., for each <sup>a</sup>* <sup>∈</sup> <sup>R</sup> *and <sup>j</sup>* ∈ {1, ... , *<sup>p</sup>*} *we have <sup>E</sup>* exp(*aXj*) <sup>≤</sup> exp(*σ*<sup>2</sup> *<sup>j</sup> <sup>a</sup>*2/2) *for positive numbers σj*. *We also denote σ* = max 1≤*j*≤*p σj*. *Finally, we suppose that the matrix H* = *E*[*XX*] *is positive definite and Hjj* = 1 *for j* = 1, . . . , *p*.

In Sections 4 and 5 we need stronger version of Assumption 1.

**Assumption 2.** *We suppose that the subvector of predictors XT is subgaussian with the coefficient σ*<sup>0</sup> > <sup>0</sup>*, i.e., for each <sup>u</sup>* <sup>∈</sup> <sup>R</sup>|*T*<sup>|</sup> *we have <sup>E</sup>* exp(*uXT*) <sup>≤</sup> exp(*σ*<sup>2</sup> <sup>0</sup> *uHTu*/2), *where HT* = - *E*[*X*1*jX*<sup>1</sup>*k*] *<sup>j</sup>*,*k*∈*<sup>T</sup>* . *The remaining conditions are as in Assumption 1. We also denote σ* = max(*σ*0, *σj*, *j* ∈/ *T*).

Subgaussianity of predictors is a standard assumption while working with random predictors in high-dimensional models, cf. [13]. In particular, Assumption 1 implies that *E*[*X*] = 0 and *σ* ≥ 1 [29].

#### **3. Predictive Properties of Classifiers**

In this part of the paper we study prediction properties of classifiers with convex loss functions. To do it we look for upper bounds of the excess risk (7) of estimators.

As usual the excess risk in (7) can be decomposed as

$$E\_D \mathcal{R}(\hat{b}) - \mathcal{R}(b\_\*) \quad + \quad \mathcal{R}(b\_\*) - \mathcal{R}\_B. \tag{12}$$

The second term in (12) is the approximation risk and compares the predictive abilitity of the "best" linear classifier (6) to the Bayes classifier. The first term in (12) is called the estimation risk and describes how the estimation process influences the predictive property of classifiers.

In the next theorem we bound from above the estimation risk of classifiers. To make the result more transparent we use notations *PD* and *PX* in (13), which indicate explicitly which probability we consider, i.e., *PD* is probability with respect to the data *D* and *PX* is with respect to the new object *X*. In further results we omit these lower indexes and believe that it does not lead to confusion.

**Theorem 1.** *For c* <sup>&</sup>gt; <sup>0</sup> *we consider an event* <sup>Ω</sup> <sup>=</sup> {|<sup>ˆ</sup> *b* − *b*∗|<sup>1</sup> ≤ *c*}. *We have*

$$2E\_D R(\hat{b}) - R(b\_\*) \le 2P\_D(\Omega^c) + P\_X(|b\_\*^\top \mathcal{X}| \le c|\mathcal{X}|\_\infty). \tag{13}$$

In Theorem 1 we obtain the upper bound for the estimation risk. This risk becomes small, if we establish that probability of the event Ω*<sup>c</sup>* is small and the sequence *c*, which is involved in Ω and in the second term on the right-hand side of (13), decreases sufficiently fast to zero. Therefore, Theorem 1 shows that to have a small estimation risk it is enough to prove that for each *ε* ∈ (0, 1) there exists *c* such that

$$P(|\hat{\mathfrak{b}} - b\_\*|\_1 \le \varepsilon) \ge 1 - \varepsilon. \tag{14}$$

Moreover, numbers *ε* and *c* should be sufficiently small. This property will be studied thoroughly in the next section. Notice that the first term on the right-hand side of (13) relates to the fact, how well (5) estimates (6). Moreover, the second expression on the right-hand side of (13) can be bounded from above, if predictors are sufficiently regular, for instance subgaussian.

So far, we have been interested in the estimation risk of estimators. In the next result we establish the upper bound for the approximation risk as well. This bound combined with (13) enables us to bound from above the excess risk of estimators. We prove this fact for the quadratic loss *<sup>φ</sup>*(*t*)=(<sup>1</sup> <sup>−</sup> *<sup>t</sup>*)<sup>2</sup> and the logistic loss *φ*(*t*) = log(1 + *e*−*v*), which play prominent roles in this paper.

**Theorem 2.** *Suppose that Assumption 1 is fulfilled. Moreover, a random variable b* <sup>∗</sup> *<sup>X</sup>*˜ *has a density <sup>h</sup>*, *which is continuous on the interval U* = [−2*σc* log *p*, 2*σc* log *p* ] *and* ˜ *h* = sup *u*∈*U h*(*u*).

*(a) We have*

$$\mathcal{E}\left(\hat{b}^{\text{quad}},f\_{\mathcal{B}}\right) \quad \leq \quad 2P(\Omega^{\varepsilon}) + 4\sigma \tilde{h}^{\text{quad}}c\sqrt{\log p} + 2/p \tag{15}$$

$$+\quad\sqrt{E\left[2\eta(X) - 1 - (b\_{\ast}^{quad})^\top \tilde{X}\right]^2}\,\Big.\tag{16}$$

*where* ˜ *hquad refers to the density h of* (*b quad* <sup>∗</sup> )*X*˜ .

*(b) Let ηlog*(*u*) = 1/(1 + exp(−*u*)). *Then we obtain*

$$\mathcal{E}(\mathfrak{f}^{\log}, f\_B) \quad \le \ 2P(\Omega^c) + 4\sigma \mathfrak{h}^{\log} \mathfrak{c} \sqrt{\log p} + 2/p \tag{17}$$

$$+\sqrt{2E\left[KL\left(\eta(X),\eta\_{\log}((b\_{\ast}^{\log})^{\top}\tilde{X})\right)\right]}\tag{18}$$

*where KL*(·, ·) *is the Kullback–Leibler distance defined in* (10) *and* ˜ *hlog refers to the density h of* (*b log* <sup>∗</sup> )*X*˜ . *Additionally, assuming that there exists δ* ∈ (0, 1) *such that δ* ≤ *η*(*X*) ≤ 1 − *δ and δ* ≤ *ηlog*((*b log* <sup>∗</sup> )*X*˜) <sup>≤</sup> 1 − *δ*, *we have*

$$\mathbb{E}\left[\operatorname{KL}\left(\eta(X),\eta\_{\log}((b\_{\ast}^{\log})^\top \mathcal{X})\right)\right] \le (2\delta(1-\delta))^{-1} \mathbb{E}\left[\eta(X) - \eta\_{\log}((b\_{\ast}^{\log})^\top \mathcal{X})\right]^2. \tag{19}$$

In Theorem 2 we establish upper bounds on the excess risks for Lasso estimators (5). They describe predictive properties of these classifiers. In this paper we consider linear classifiers, so the misclassification risk of an estimator is close to the Bayes risk, if the "truth" can be approximated linearly in a satisfactory way. For the classifier with the logistic loss this fact is described by (18) and (19), which measure the distance between true success probability and the one in logistic regression. In particular, when the true model is logistic, then (18) and (19) vanish. The expression (16) relates to the approximation error in the case of the quadratic loss. It measures how well the conditional expectation *E*[*Y*|*X*] can be described by the "best" (with respect to the loss *φ*) linear function (*b quad* <sup>∗</sup> )*X*˜ .

The right-hand sides of (15) and (17) relate to estimation risk. They have been already discussed after Theorem 1. Using subgaussianity of predictors we have made them more explicit. The main ingredient of bounds in Theorem 2, namely *P*(Ω*c*), is studied in the next section.

Results in Theorem 2 refer to Lasso estimators with quadratic and logistic loss functions. Similar results are given in ([2], Theorem 6.4). They refer to the case that the convex excess risk is considered, i.e., the misclassification risks *R*(·) are replaced by the convex risks *Q*(·) in (7). Moreover, these results do not consider Lasso estimators with the quadratic loss applied to classification, which is an approach playing a key role in the current paper. Furthermore, in ([2], Theorem 6.4) the estimation error ˆ *b* − *b*<sup>∗</sup> is measured in the *l*1-norm, which is enough for prediction. However, for variable selection the *l*∞-norm gives better results. Such results will be established in Sections 4 and 5. Finally, results of [2] need more restrictive assumptions than ours. For instance, predictors should be bounded and a function *fb*<sup>∗</sup> should be sufficiently close to *fB* in the supremum norm.

Analogous bounds to those in Theorem 2 can be obtained for other loss functions, if we combine Theorem 1 with results of [17]. Finally, we should stress that the estimator ˆ *b* need not rely on the Lasso method. All we require is that the bound (14) can be estiblished for this estimator.

#### **4. On the Event** Ω

In this section we show that probability of the event Ω can be close to one. Such results for classification models with Lipschitz loss functions were established in [2,13]. Therefore, we focus on the quadratic loss function, which is obviously non-Lipschitz. This loss function is important from the practical point of view, but was not considered in these papers. Moreover, in our results the estimation error in Ω can be measured in the *lq*-norms, *q* ≥ 1, not only in the *l*1-norm as in [2,13]. Bounds in the *l*∞-norm lead to better results in variable selection, which are given in Section 5.

We start with introducing the cone invertibility factor (CIF), which plays a significant role in investigating properties of estimators based on the Lasso penalty [9]. In the case *n* > *p* one usually uses the minimal eigenvalue of the matrix XX/*n* to express the strength of correlations between predictors. Obviously, in the high-dimensional scenario this value is equal to zero and the minimal eigenvalue needs to be replaced by some other measure of predictors interdependency, which would describe the potential of consistent estimation of model parameters.

For *ξ* > 1 we define a cone

$$\mathcal{C}(\emptyset) = \{ b \in \mathbb{R}^{p+1} : |b\_{T^c}|\_1 \le \emptyset |b\_{T}|\_1 \}.$$

where we recall that *<sup>T</sup>*˜ <sup>=</sup> *<sup>T</sup>* ∪ {0}. In the case when *<sup>p</sup>* >> *<sup>n</sup>* three different characteristics measuring the potential for consistent estimation of the model parameters have been introduced:


$$RE(\xi) = \inf\_{0 \ne b \in \mathcal{C}(\xi)} \frac{b^\top \mathbb{R}^\top \mathbb{R} b / n}{|b|\_2^2} \searrow$$


$$K(\xi) = \inf\_{0 \ne b \in \mathcal{C}(\xi)} \frac{|T| b^\top \mathbb{R}^\top \mathbb{R} b / n}{|b\_T|\_1^2} \text{ .}$$


$$\bar{F}\_{\emptyset}(\xi) = \inf\_{0 \ne b \in \mathcal{C}(\xi)} \frac{|T|^{1/q} |\vec{\&}^{\top}\vec{\&} b / n|\_{\infty}}{|b|\_{q}} \ .$$

In this article we will use CIF, because this factor allows for a sharp formulation of convergency results for all *lq* norms with *q* ≥ 1, see ([9], Section 3.2). The population (non-random) version of CIF is given by

$$F\_q(\xi) = \inf\_{0 \ne b \in \mathcal{C}(\xi)} \frac{|T|^{1/q} |\tilde{H}b|\_{\infty}}{|b|\_q}.$$

where *H*˜ = *E X*˜ *X*˜ . The key property of the random and the population versions of CIF, *F*¯ *<sup>q</sup>*(*ξ*) and *Fq*(*ξ*), is that, in contrast to the smallest eigenvalues of matrices X˜ X˜ /*n* and *H*˜ , they can be close to each other in the high-dimensional setting, see ([30], Lemma 4.1) or ([31], Corollary 10.1). This fact is used in the proof of Theorem 3 (given below).

Next, we state the main results of this section.

**Theorem 3.** *Let a* ∈ (0, 1), *q* ≥ 1 *and ξ* > 1 *be arbitrary. Suppose that Assumption 2 is satisfied and*

$$m \geq \frac{\mathbb{K}\_1 |T|^2 \sigma^4 (1 + \xi)^2 \log(p/a)}{F\_q^2(\xi)}\tag{20}$$

*and*

$$
\lambda \ge K\_2 \frac{\xi + 1}{\xi - 1} r^2 \sqrt{\frac{\log(p/a)}{n}},
\tag{21}
$$

*where K*1, *K*<sup>2</sup> *are universal constants. Then there exists a universal constant K*<sup>3</sup> > 0 *such that with probability at least* 1 − *K*3*a we have*

$$|\mathring{b}^{quad} - b\_{\ast}^{quad}|\_{\boldsymbol{\eta}} \le \frac{2\mathcal{J}|T|^{1/q}\lambda}{(\mathcal{J}+1)F\_{\boldsymbol{q}}(\mathcal{J})} \,. \tag{22}$$

In Theorem 3 we provide the upper bound for the estimation error of the Lasso estimator with the quadratic loss function. This result gives the conditions for estimation consistency of ˆ *bquad* in the high-dimensional scenario, i.e., the number of predictors can be significantly greater than the sample size. Indeed, consistency in the *<sup>l</sup>*∞-norm holds e.g., when *<sup>p</sup>* <sup>=</sup> exp(*na*<sup>1</sup> ), <sup>|</sup>*T*<sup>|</sup> <sup>=</sup> *<sup>n</sup>a*<sup>2</sup> , *<sup>a</sup>* <sup>=</sup> exp(−*na*<sup>1</sup> ), where *a*<sup>1</sup> + 2*a*<sup>2</sup> < 1. Moreover, *λ* is taken as the right-hand side of the inequality (21) and finally *F*∞(*ξ*) is bounded from below (or slowly converging to 0) and *σ* is bounded from above (or slowly diverging to ∞).

The choice of the *λ* parameter is difficult in practice, which is a common drawback of Lasso estimators. However, Theorem 3 gives us a hint how to choose *λ*. The "safe" choice of *λ* is the right-hand side of the inequality (21), so, roughly speaking, *λ* should be proportional to log(*p*)/*n*. In the experimental part of the paper the parameter *λ* is chosen using the cross-validation method. As we will observe it gives satisfatory results for the Lasso estimators in both prediction and variable selection.

Theorem 3 is a crucial fact, which gives the upper bound for (15) in Theorem 2. Namely, taking *q* = 1, *a* = 1/*p* and *λ* equal to the right-hand side of the inequality (21), we obtain the following consequence of Theorem 3.

**Corollary 1.** *Suppose that Assumptions 2 is satisfied. Moreover, assume that there exist ξ*<sup>0</sup> > 1 *and constants C*<sup>1</sup> > 0 *and C*<sup>2</sup> < ∞ *such that F*1(*ξ*0) ≥ *C*<sup>1</sup> *and σ* ≤ *C*2*. If n* ≥ *K*1|*T*| <sup>2</sup> log *p*, *then*

$$P\left(|\hat{b}^{\text{quadd}} - b\_\*^{\text{quadd}}|\_1 \le K\_2 |T| \sqrt{\frac{\log p}{n}}\right) \ge 1 - K\_3 / p\_\prime \tag{23}$$

*where the constants K*<sup>1</sup> *and K*<sup>2</sup> *depend only on ξ*0, *C*1, *C*<sup>2</sup> *and K*<sup>3</sup> *is a universal constant provided in Theorem 3.*

The above result works for Lasso estimators with the quadratic loss. In the case of the logistic loss analogous result is obtained in ([13], Theorem 1). In fact, their results relate to the case of quite general Lipschitz loss functions, which can be useful in extending Theorem 2 to such cases.

#### **5. Variable Selection Properties of Estimators**

In Section 3 we are interested in predictive properties of estimators. In this part of the paper we focus on variable selection, which is another important problem in high-dimensional statistics. As we have already noticed upper bounds for probability of the event Ω are crucial in proving results concerning prediction. It also plays a key role in establishing results relating to variable selection. In this section we again focus on the Lasso estimators with the quadratic loss functions. The analogous results for Lipschitz loss functions were considered in ([13], Corollary 1).

In the variable selection problem we want to find significant predictors, which, roughly speaking, give us some information on the observed phenomenon. We consider this problem in the semiparametric model, which is defined in (8). In this case the set of significant predictors is given by (9). As we have already mentioned vectors *β* and *b quad* <sup>∗</sup> need not be the same. However, in [32] one proved that for a real number *γ* the following relation

$$(b\_{\ast}^{\text{quad}})\_{j} = \gamma \beta\_{j}, \quad j = 1, \ldots, p \tag{24}$$

holds under Assumption 3, which is now stated.

**Assumption 3.** *Let <sup>β</sup>*˚ = (*β*1, ... , *<sup>β</sup>p*). *We assume that for each <sup>θ</sup>* <sup>∈</sup> <sup>R</sup>*<sup>p</sup> the conditional expectation E* ) *<sup>θ</sup>X*|*β*˚*<sup>X</sup>* \* *exists and*

$$E\left[\theta^\top X \vert \hat{\beta}^\top X\right] = d\_\theta \hat{\beta}^\top X$$

*for a real number d<sup>θ</sup>* <sup>∈</sup> <sup>R</sup>.

The coefficient *γ* in (24) can be easily calculated. Namely, we have

$$\gamma = \frac{E\left[\mathbf{Y}\boldsymbol{\hat{\beta}}^{\top}\mathbf{X}\right]}{\boldsymbol{\hat{\beta}}^{\top}H\boldsymbol{\hat{\beta}}} = \frac{2E\left[\mathbf{g}(\boldsymbol{\beta}^{\top}\mathbf{X})\boldsymbol{\hat{\beta}}^{\top}\mathbf{X}\right]}{\boldsymbol{\hat{\beta}}^{\top}H\boldsymbol{\hat{\beta}}}.$$

Standard arguments [33] show that *γ* is nonzero, if *g* is monotonic. In this case we have that the set T defined in (9) equals to *T* defined in (11).

Assumption 3 is a well-known condition in the literature, see e.g., [13,32,34–36]. It is always satisfied in the simple regression model (i.e., when *<sup>X</sup>*<sup>1</sup> <sup>∈</sup> <sup>R</sup>), which is often used for initial screening of explanatory variables, see, e.g., [37]. It is also satisfied when *X* comes from the *elliptical distribution*, like the multivariate normal distribution or multivariate *t*-distribution. In the interesting paper [38] one advocates that Assumption 3 is a nonrestrictive condition, when the number of predictors is large, which is the case that we focus on in this paper.

Now, we state the results of this part of the paper. We will use the notation *b quad min* = min*j*∈*<sup>T</sup>* |(*<sup>b</sup> quad* <sup>∗</sup> )*j*|.

**Corollary 2.** *Suppose that conditions of Theorem 3 are satisfied for q* = ∞. *If bquad min* <sup>≥</sup> <sup>4</sup>*ξλ* (*ξ*+1)*F*∞(*ξ*), *then*

$$P\left(\forall\_{j\in T,k\notin T} \quad |\mathcal{G}\_j^{quad}| > |\mathcal{G}\_k^{quad}|\right) \ge 1 - K\_3 a\_{-}$$

*where K*<sup>3</sup> *is the universal constant from Theorem 3.*

In Corollary 2 we show that the Lasso estimator with the quadratic loss is able to separate predictors, if the nonzero coefficients of ˚ *b quad* <sup>∗</sup> are large enough in absolute values. In the case that *T* equals (9) (i.e., *T* is the set of significant predictors) we can prove that the thresholded Lasso estimator is able to find the true model with high-probability. This fact is stated in the next result. The thresholded Lasso estimator is denoted by ˆ *b quad th* and defined as

$$(\delta\_{th}^{quad})\_{\rangle} = \delta\_{\underline{j}}^{quad} \mathbb{I}(|\delta\_{\underline{j}}^{quad}| \ge \delta), \quad \underline{j} = 1, \dots, p,\tag{25}$$

where *δ* > 0 is a threshold. We set (ˆ *b quad th* )<sup>0</sup> <sup>=</sup> <sup>ˆ</sup> *b quad* <sup>0</sup> and denote *<sup>T</sup>*<sup>ˆ</sup> *th* <sup>=</sup> {<sup>1</sup> <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *<sup>p</sup>* : (<sup>ˆ</sup> *b quad th* )*<sup>j</sup>* = 0}.

**Corollary 3.** *Let g in (8) be monotonic. We suppose that Assumption 3 and conditions of Theorem 3 are satisfied for q* = ∞. *If*

$$\frac{2\zeta\lambda}{(\zeta+1)F\_{\infty}(\zeta)} < \delta \le b\_{\text{min}}^{\text{quad}}/2,$$

*then*

*P* - *T*ˆ *th* = T ≥ 1 − *K*3*a*,

*where K*<sup>3</sup> *is the universal constant from Theorem 3.*

Corollary 3 states that the Lasso estimator after thresholding is able to find the true model with high probability, if the threshold is appropriately chosen. However, Corollary 3 does not give a constructive way of choosing the threshold, because both endpoints of the interval [ <sup>2</sup>*ξλ* (*ξ*+1)*F*∞(*ξ*), *b quad min* /2] are unknown. It is not a surprising fact and has been already observed, for instance, in linear models ([9], Theorem 8). In the literature we can find methods, which help to choose a threshold in practice, for instance the approach relying on information criteria developed in [39,40].

Finally, we discuss the condition of Corollary 3 that *b quad min* cannot be too small, i.e., *b quad min* ≥ 4*ξλ* (*ξ*+1)*F*∞(*ξ*). We know that (*<sup>b</sup> quad* <sup>∗</sup> )*<sup>j</sup>* = *γβ<sup>j</sup>* for *j* = 1, . . . , *p*, so the considered condition requires that

$$\min\_{j \in \mathbb{T}} |\beta\_j| \ge \frac{4\xi\lambda}{|\gamma|(\xi+1)F\_{\mathbb{S}^\diamond}(\xi)}\,. \tag{26}$$

Compared to the similar condition for the Lasso estimators in the well-specified models, we observe that the denominator in (26) contains an additional factor |*γ*|. This number is usually smaller than one, which means that in the misspecified models the Lasso estimator needs larger sample size to work well. This phenomenon is typical for misspecified models and the similar restrictions hold for competitors [13].

#### **6. Numerical Experiments**

In this section we present simulation study, where we compare the accuracy of considered estimators in prediction and variable selection.

We consider the model (8) with predictors generated from the *p*-dimensional normal distribution *N*(0, *H*), where *Hjj* = 1 and *Hjk* = 0.5 for *j* = *k*. The true parameter is

$$\beta = (1, \underbrace{\pm 1, \pm 1, \ldots, \pm 1}\_{10}, 0, 0, \ldots, 0), \tag{27}$$

where signs are chosen at random. The first coordinate in (27) corresponds to the intercept and the next ten coefficients relate to significant predictors in the model. We study two cases:


In each scenario we generate the data (*X*1,*Y*1), ... ,(*Xn*,*Yn*) for *n* ∈ {100, 350, 600}. The corresponding numbers of predictors are *p* ∈ {100, 1225, 3600}, so the number of predictors exceeds significantly the sample size in the experiments. For every model we consider two Lasso estimators with unpenalized intercepts (5): the first one with the logistic loss and the second one with the quadratic loss. They are denoted by "logistic" and "quadratic", respectively. To calculate them we use the "glmnet" package [28] in the "R" software [41]. The tuning parameters *λ* are chosen on the basis of 10-fold cross-validation.

Observe that applying the Lasso estimator with the logistic loss function to Scenario 1 leads to a well-specified model, while using the quadratic loss implies misspecification. In Scenario 2 both estimators work in misspecified models.

Simulations for each scenario are repeated 300 times.

To describe the quality of estimators in variable selection we calculate two values:


So, we want to confirm that the considered estimators are able to separate predictors, which we establish in Section 5. Using TD we also study "screening" properties of estimators, which are easier than separability.

The classification accuracy of estimators is measured in the following way: we generate a test sample containg 1000 objects. On this set we calculate


The results of experiments are collected in Tables 1 and 2. By the "oracle" we mean the classifier, which works only with significant predictors and uses the function *g* from the true model (8) in the estimation process.


**Table 1.** Results for Scenario 1.


**Table 2.** Results for Scenario 2.

Finally, we also compare execution time of both algorithms. In Table 3 we show the averaged relative time difference

$$\frac{t^{\log} - t^{\text{quad}}}{t^{\text{quad}}} \,, \tag{28}$$

where *t quad* and *t log* is time of calculating Lasso with quadratic and logistic loss functions, respectively.

**Table 3.** Relative time difference (28) of algorithms.


Looking at the results of experiments we observe that both estimators perform in a satisfactory way. Their predictive accuracy is relatively close to the oracle, especially when the sample size is larger. In variable selection we see that both estimators are able to find significant predictors and separate predictors in both scenarios. Again we can notice that properties of estimators become better, when *n* increases.

In Scenario 2 the quality of both estimators in prediction and variable selection is comparable. In Scenario 1, which is well-specified for Lasso with the logistic loss, we observe its dominance over Lasso with the quadratic loss. However, this dominance is not large. Therefore, using Lasso with the quadratic loss we obtain slightly worse accuracy of the procedure, but this algorithm is computationally faster. The computational efficiency is especially important, when we study large data sets. As we can see in Table 3 execution time of estimators is almost the same for *n* = 350, but for *n* = 600 the relative time difference becomes greater than 10%.

**Author Contributions:** Both authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

**Funding:** The research of K.F. was partially supported by Warsaw University of Life Sciences (SGGW).

**Acknowledgments:** We would like to thank J. Mielniczuk and the reviewers for their valuable comments, which have improved the paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Proofs and Auxiliary Results**

This section contains proofs of results from the paper. Additional lemmas are also provided.

#### *Appendix A.1. Results from Section 3*

**Proof of Theorem 1.** For arbitrary *<sup>b</sup>* <sup>∈</sup> <sup>R</sup>*p*+<sup>1</sup> the averaged misclassification risk of *fb* can be expressed as

$$E\_D R(b) = E\_D E\_{(X,Y)} \left[ I(Y=1)I(b^\top \bar{X} < 0) + I(Y=-1)I(b^\top \bar{X} \ge 0) \right]. \tag{A1}$$

Moreover, we have

$$I(Y=-1)I(b^\top \mathcal{X} \ge 0) = I(Y=-1)\left[1 - I(b^\top \mathcal{X} < 0)\right].\tag{A2}$$

 

Applying (A1) and (A2) for ˆ *b* and *b*∗, we obtain

$$\begin{aligned} &\left|E\_D R(\hat{b}) - R(b\_\*)\right| \\ &= \left|E\_D E\_{(X,Y)}\left[I(Y=1) - I(Y=-1)\right] \left[I(\hat{b}^\top \bar{X} < 0) - I(b\_\*^\top \bar{X} < 0)\right]\right| \\ &\le \left|E\_D E\_{(X,Y)}\left[I(\hat{b}^\top \bar{X} < 0) - I(b\_\*^\top \bar{X} < 0)\right]\right| \\ &= \left|P(\hat{b}^\top \bar{X} < 0, b\_\*^\top \bar{X} \ge 0) + P(\hat{b}^\top \bar{X} \ge 0, b\_\*^\top \bar{X} < 0)\right| \end{aligned}$$

where *P* is probability with respect to both the data *D* and the new object *X*. Observe that on the event Ω, we have that

$$
\delta^\top \mathcal{X} \le c |\mathcal{X}|\_\infty + b\_\*^\top \mathcal{X}\_\prime
$$

so

$$\begin{aligned} &P(\hat{\boldsymbol{b}}^{\top}\check{\boldsymbol{X}} \ge 0, \boldsymbol{b}\_{\*}^{\top}\check{\boldsymbol{X}} < 0) = P(\hat{\boldsymbol{b}}^{\top}\check{\boldsymbol{X}} \ge 0, \boldsymbol{b}\_{\*}^{\top}\check{\boldsymbol{X}} < 0, \boldsymbol{\Omega}) \\ &+ \quad P(\hat{\boldsymbol{b}}^{\top}\check{\boldsymbol{X}} \ge 0, \boldsymbol{b}\_{\*}^{\top}\check{\boldsymbol{X}} < 0, \boldsymbol{\Omega}^{c}) \\ &\leq \quad P\_{\boldsymbol{X}}(-c|\boldsymbol{\mathcal{X}}|\_{\infty} \leq \boldsymbol{b}\_{\*}^{\top}\check{\boldsymbol{X}} < 0) + P\_{\boldsymbol{D}}(\boldsymbol{\Omega}^{c}). \end{aligned}$$

Analogously, we obtain

$$P(\hat{\mathfrak{b}}^{\top}\tilde{X} < 0, b\_\*^{\top}\tilde{X} \ge 0) \le P\_X(0 \le b\_\*^{\top}\tilde{X} \le c|\tilde{X}|\_{\infty}) + P\_D(\Omega^{\tilde{c}}),$$

from

$$
\delta^\top \check{X} \ge -c|\check{X}|\_\infty + b\_\*^\top \check{X}\_\*
$$

which finishes the proof.

**Lemma A1.** *Suppose that Assumption 1 is fulfilled. Moreover, a random variable b* <sup>∗</sup> *<sup>X</sup>*˜ *has a density <sup>h</sup>*, *which is continuous on the interval U* = [−2*σc* log *p*, 2*σc* log *p* ] *and* ˜ *h* = sup *u*∈*U h*(*u*). *Then*

$$P\_X(|b\_\*^\top \tilde{X}| \le c|\tilde{X}|\_\infty) \le 4r\tilde{h}c\sqrt{\log p} + 2/p. \tag{A3}$$

**Proof.** For simplicity, we omit the lower index *X* in probability *PX* in this proof. We take *a* > 1 and obtain inequalities

$$\begin{split} &P(|b\_\*^\top \overline{X}| \le c|\overline{X}|\_\infty) \le P(|b\_\*^\top \overline{X}| \le c|\overline{X}|\_\infty, |\overline{X}|\_\infty \le a) \\ &+ \quad P(|b\_\*^\top \overline{X}| \le c|\overline{X}|\_\infty, |\overline{X}|\_\infty > a) \\ &\le \quad P(|b\_\*^\top \overline{X}| \le ca) + P(|\overline{X}|\_\infty > a). \end{split} \tag{A4}$$

The second expression in (A4) equals *P*(|*X*|<sup>∞</sup> > *a*), because *a* > 1. It can be handled using subgaussianity of *X* as follows: take *z* > 0 and notice that by the Markov inequality and the fact that exp(|*u*|) <sup>≤</sup> exp(*u*) + exp(−*u*) for each *<sup>u</sup>* <sup>∈</sup> <sup>R</sup>, we obtain

$$\begin{aligned} P(|X|\_{\infty} > a) &\leq \ \ \ \ \iota^{-za} E \exp(z|X|\_{\infty}) \leq \ \iota^{-za} \sum\_{j=1}^{p} E \exp(z|X\_j|) \\ &\leq \ \ \ 2p \exp(\iota^2 z^2/2 - az) .\end{aligned}$$

Taking *z* = *a*/*σ*2, we obtain

$$P(|X|\_{\infty} > a) \le 2p \exp(-a^2/(2\sigma^2)).$$

Then we choose *a* = 2*σ* log *<sup>p</sup>*, which is not smaller than one, because *<sup>σ</sup>* ≥ 1 from Assumption 1.

Finally, the first term in (A4) can be bounded from above by 2*ca*˜ *h* = 4*σ*˜ *hc*log *p* by the mean value theorem.

**Proof of Theorem 2.** The right-hand side of (15) and (17) are upper bounds on the estimation risk. They are obtained using Theorem 1 and Lemma A1. The expressions (16) and (18) are upper bounds for the approximation risk in the case of estimators with the quadratic and logistic loss functions, respectively. In particular, (16) follows from ([17], [Theorem 2.1) applied for *f b quad* ∗ and Example 3.1. Establishing (18) is similar: we just use ([17], [Theorem 2.1) applied for *f b log* ∗ and Example 3.5 to show that

$$R(b\_{\ast}^{\log}) - R\_B \le \sqrt{2E\left[KL\left(\eta(X), \eta\_{\log}((b\_{\ast}^{\log})^\top \mathcal{X})\right)\right]},\tag{A5}$$

where the Kullback–Leibler distance *KL*(·, ·) is defined in (10).

Next, we define the function *h*(*a*) = *a* log *a* + (1 − *a*)log(1 − *a*) for *a* ∈ (0, 1). Clearly, we have *KL*(*a*, *b*) = *h*(*a*) − *h*(*b*) − *h* (*b*)(*<sup>a</sup>* <sup>−</sup> *<sup>b</sup>*) and *<sup>h</sup>*(*a*)=(*a*(<sup>1</sup> <sup>−</sup> *<sup>a</sup>*))−1. Therefore, from the mean value theorem

$$KL(a,b) = \frac{(a-b)^2}{2c(1-c)}\tag{A6}$$

for some *c* between *a* and *b*. To finish the proof we apply (A6) to the right-hand side of (A5) with *δ* < *c* < 1 − *δ*.

#### *Appendix A.2. Results from Section 4*

To simplify notation we write ˆ *<sup>b</sup>*, *<sup>b</sup>*<sup>∗</sup> for <sup>ˆ</sup> *bquad*, *b quad* <sup>∗</sup> , respectively, in this section. Moreover, we also denote ˚ *b*<sup>∗</sup> = ((*b*∗)1,...,(*b*∗)*p*).

We start with establishing results, which help us to prove Theorem 3.

**Lemma A2.** *For* ˚ *<sup>b</sup>*<sup>∗</sup> <sup>=</sup> *<sup>H</sup>*−1*<sup>E</sup>* [*XY*] *we have* ˚ *b* <sup>∗</sup> *<sup>H</sup>*˚ *b*<sup>∗</sup> ≤ 1.

**Proof.** The proof is elementary and based on the inequality

$$0 \le E\left[E\left[Y|X\right] - \delta\_\*^\top X\right]^2. \tag{A7}$$

The right-hand side of (A7) can be expressed as

$$E\left[\left[E\left[Y|X\right]\right]^2 - 2\hat{b}\_\*^\top E\left[XY|X\right] + \hat{b}\_\*^\top X X^\top \hat{b}\_\*\right] = E\left[E\left[Y|X\right]\right]^2 - 2\hat{b}\_\*^\top E\left[XY\right] + \hat{b}\_\*^\top H \hat{b}\_\*.\tag{A8}$$

Using ˚ *<sup>b</sup>*<sup>∗</sup> <sup>=</sup> *<sup>H</sup>*−1*<sup>E</sup>* [*XY*] , we have ˚ *b* <sup>∗</sup> *<sup>E</sup>* [*XY*] <sup>=</sup> ˚ *b* <sup>∗</sup> *HH*−1*<sup>E</sup>* [*XY*] <sup>=</sup> ˚ *b* <sup>∗</sup> *<sup>H</sup>*˚ *b*<sup>∗</sup> and we can bound from above the right-hand side of (A8) by

$$E\left[\left.Y^{2}\right]-\left.b\_{\ast}^{\top}H\theta\_{\ast\vee}\right]$$

which finishes the proof.

The next result is given in ([42], Corollary 8.2).

**Lemma A3.** *Suppose that Z*1, ... , *Zn are i.i.d. random variables and there exists L* > 0 *such that C*<sup>2</sup> = *E* exp (|*Z*1|/*L*) *is finite. Then for arbitrary u* > 0

$$P\left(\frac{1}{n}\sum\_{i=1}^n (Z\_i - E\left[Z\_i\right]) > 2L\left(C\sqrt{\frac{2\mu}{n}} + \frac{\mu}{n}\right)\right) \le \exp(-\mu).$$

**Lemma A4.** *For arbitrary j* = 1, . . . , *p and u* > 0 *we have*

$$P\left(\frac{2}{n}\sum\_{i=1}^{n}X\_{ij}(X\_i^\top \hat{b}\_\* + E\left[\mathbf{Y}\right] - \mathbf{Y}\_i) > 16.4\sigma^2 \left(3\sqrt{\frac{2u}{n}} + \frac{u}{n}\right)\right) \le \exp(-u). \tag{A9}$$

**Proof.** Fix *j* = 1, ... , *p* and *u* > 0. Recall that *H*˚ *b*<sup>∗</sup> = *E* [*YX*] and *E* [*X*] = 0. Thus, we work with an average of i.i.d. centred random variables, so we can use Lemma A3. We only have to find *L*, *C* > 0 such that

$$E \exp\left( \left| X\_j(X^\top \hat{\boldsymbol{\theta}}\_\* + E\left[\boldsymbol{Y}\right] - \boldsymbol{Y} \right| / L \right) \le \mathcal{C}^2,\tag{A10}$$

where *Xj* is the *<sup>j</sup>*-th coordinate of *<sup>X</sup>*. For each positive number *<sup>a</sup>*, *<sup>b</sup>* we have the inequality *ab* <sup>≤</sup> *<sup>a</sup>*<sup>2</sup> <sup>2</sup> <sup>+</sup> *<sup>b</sup>*<sup>2</sup> 2 . Therefore, we have

$$|X\_j(X^\top \hat{b}\_\* + E\left[\mathcal{Y}\right] - \mathcal{Y})| \le \frac{X\_j^2}{2} + (X^\top \hat{b}\_\*)^2 + 4.1$$

Applying this fact and the Schwarz inequality we obtain

$$E\exp\left(|X\_j(X^\top \hat{b}\_\* - Y)|/L\right) \le \exp\left(\frac{4}{L}\right)\sqrt{E\exp\left(\frac{X\_j^2}{L}\right)E\exp\left(\frac{2(X^\top \hat{b}\_\*)^2}{L}\right)}.\tag{A11}$$

The variable *Xj* is subgaussian, so using ([43], Lemma 7.4) we can bound the first expectation in (A11) by <sup>1</sup> <sup>−</sup> <sup>2</sup>*σ*<sup>2</sup> *L* −1/2 provided that *L* > 2*σ*2. The second expectation in (A11) can be bounded using subgaussianity of the vector *XT*, ([43], Lemma 7.4) and Lemma A2 in the following way

$$E \exp\left(\frac{2(X^\top \vec{b}\_\*)^2}{L}\right) \le \left(1 - \frac{4\sigma^2}{L}\right)^{-1/2}.$$

provided that 4*σ*<sup>2</sup> <sup>&</sup>lt; *<sup>L</sup>*. Taking *<sup>L</sup>* <sup>=</sup> 4.1*σ*<sup>2</sup> we can bound exp(4/*L*) <sup>≤</sup> 2.7, because *Hjj* <sup>=</sup> 1 implies that *σ* ≥ 1. Thus, we obtain *C* ≤ 3, where *C* is the upper bound in (A10). It finishes the proof.

**Lemma A5.** *Suppose that assumptions of Theorem 3 are satisfied. Then for arbitrary a* ∈ (0, 1), *q* ≥ 1, *ξ* > 1 *with probability at least* <sup>1</sup> <sup>−</sup> *Ka we have <sup>F</sup>*¯ *<sup>q</sup>*(*ξ*) ≥ *Fq*(*ξ*)/2, *where K is an universal constant.*

**Proof.** Fix *a* ∈ (0, 1), *q* ≥ 1, *ξ* > 1. We start with considering the *l*∞-norm of the matrix

$$\left|\frac{1}{n}\mathbb{X}^{\top}\mathbb{X} - E\left[\mathbb{X}\mathbb{X}^{\top}\right]\right|\_{\infty} = \max\left(\max\_{j,k=1,\ldots,p} \left|\frac{1}{n}\sum\_{i=1}^{n} X\_{ij}X\_{ik} - E\left[X\_{j}X\_{k}\right]\right|,\tag{A12}$$

$$\max\_{j=1,\ldots,p} \left| \frac{1}{n} \sum\_{i=1}^{n} X\_{ij} \right| \tag{A13}$$

We focus only on the right-hand side of (A12), because (A13) can be done similarly. Thus, fix *j*, *k* ∈ {1, ... , *p*}. Using subgaussianity of predictors, Lemma A3 and argumentation similar to the proof of Lemma A4 we have

$$P\left(\left|\frac{1}{n}\sum\_{i=1}^n X\_{ij}X\_{ik} - E\left[X\_{1j}X\_{1k}\right]\right| > K\_2\sigma^2\sqrt{\frac{\log(p^2/a)}{n}}\right) \le \frac{2a}{p^2} > 1$$

where *K*<sup>2</sup> is an universal constant. The values of constants *Ki* that appear in this proof can change from line to line.

Therefore, using union bounds we obtain

$$P\left(\left|\frac{1}{n}\mathbb{X}^\top\mathbb{X} - E\left[\mathcal{X}\mathbb{X}^\top\right]\right|\_\infty > K\_2\sigma^2\sqrt{\frac{\log(p^2/a)}{n}}\right) \le K\_3a.s$$

Proceeding similarly to the proof of ([30], Lemma 4.1) we have the following probabilistic inequality

$$F\_{\mathfrak{q}}(\xi) \ge F\_{\mathfrak{q}}(\xi) - K\_2(1 + \xi)|T|\sigma^2 \sqrt{\frac{\log(p^2/a)}{n}}.$$

To finish the proof we use (20) with *K*<sup>1</sup> being sufficiently large.

**Proof of Theorem 3.** Let *a* ∈ (0, 1), *q* ≥ 1, *ξ* > 1 be arbitrary. The main part of the proof is to show that with high probability

$$|\mathfrak{f} - b\_\*|\_q \le \frac{\mathfrak{f}|T|^{1/q}\lambda}{(\mathfrak{f}+1)\overline{F}\_\emptyset(\mathfrak{f})} \,. \tag{A14}$$

Then we apply Lemma A5 to obtain (22).

Thus, we focus on showing that (A14) holds with high probability . Denote <sup>A</sup> <sup>=</sup> {|∇*Q*¯(*b*∗)|<sup>∞</sup> <sup>≤</sup> *ξ*−1 *<sup>ξ</sup>*+1*λ*}. We start with bounding from below probability of A. Recall that *b*<sup>∗</sup> is the minimizer of *<sup>Q</sup>*(*b*) = *<sup>E</sup>*(<sup>1</sup> <sup>−</sup> *Yb X*˜)2, which can be easily calculated, namely

$$
\hat{b}\_\* = H^{-1}E\left[\!\!\!\!\!X\right] \quad \text{and} \quad (b\_\*)\_0 = E\left[\!\!\!\!\!\!\right].
$$

For every *<sup>j</sup>* <sup>=</sup> 1, . . . , *<sup>p</sup>* the *<sup>j</sup>*-th partial derivative of *<sup>Q</sup>*¯(*b*) at *<sup>b</sup>*<sup>∗</sup> is

$$\nabla\_{\dot{\boldsymbol{\beta}}} \mathcal{Q}(\boldsymbol{b}\_{\*}) = \frac{2}{n} \sum\_{i=1}^{n} X\_{ij} (X\_{i}^{\top} \boldsymbol{\hat{\boldsymbol{\beta}}}\_{\*} + E\left[\boldsymbol{Y}\right] - \boldsymbol{Y}\_{i}).\tag{A15}$$

The derivative with respect to the *b*<sup>0</sup> is

$$
\nabla\_0 \mathcal{Q}(b\_\*) = \frac{2}{n} \sum\_{i=1}^n (X\_i^\top \hat{b}\_\* + E\left[Y\right] - Y\_i). \tag{A16}
$$

Taking *λ*, which satisfies (21), and using union bounds, we obtain that

$$P(\mathcal{A}^c) \le \sum\_{j=0}^p P\left( |\nabla\_j Q(b\_\*)| > K\_2 \sigma^2 \sqrt{\frac{\log(p/a)}{n}} \right). \tag{A17}$$

Consider a summand on the right-hand side of (A17), which corresponds to *j* ∈ {1, ... , *p*}. From (A15) we can handle it using Lemma A4. We just take *u* = log(*p*/*a*) and suffciently large *K*2. Probability of the first term on the right-hand side of (A17), which corresponds to *j* = 0, can be bounded from above analogously as in the proof of Lemma A4. Proceeding is even easier, so we omit it.

In further argumentation we consider only the event <sup>A</sup>. Besides, we denote *<sup>θ</sup>* <sup>=</sup> <sup>ˆ</sup> *<sup>b</sup>* <sup>−</sup> *<sup>b</sup>*∗, where <sup>ˆ</sup> *b* is a minimizer of a convex function (5), that is equivalent to

$$\begin{cases} \nabla\_{\hat{\boldsymbol{\beta}}} \vec{\mathcal{Q}}(\hat{\boldsymbol{\delta}}) = -\lambda \text{sign}(\hat{\boldsymbol{b}}\_{\hat{\boldsymbol{\beta}}}) & \text{for} \quad \hat{\boldsymbol{b}}\_{\hat{\boldsymbol{\beta}}} \neq 0; \\\ \quad |\nabla\_{\hat{\boldsymbol{\beta}}} \vec{\mathcal{Q}}(\hat{\boldsymbol{b}})| \le \lambda & \text{for} \quad \hat{\boldsymbol{b}}\_{\hat{\boldsymbol{\beta}}} = 0; \\\ \nabla\_{\boldsymbol{0}} \vec{\mathcal{Q}}(\hat{\boldsymbol{b}}) = 0, \end{cases} \tag{A18}$$

where *j* = 1, . . . , *p*.

First, we prove that *θ* ∈ C(*ξ*). Here our argumentation is standard [9]. From (A18) and the fact that |*θ*|<sup>1</sup> = |*θT*|<sup>1</sup> + |*θT<sup>c</sup>* |<sup>1</sup> + |*θ*0| we can calculate

$$\begin{split} 0 &\leq 2\theta^{\top}\mathbb{X}^{\top}\mathbb{X}\theta/n = \theta^{\top}\left[\nabla\dot{Q}(\hat{b}) - \nabla\dot{Q}(b\_{\ast})\right] \\ &= \sum\_{\boldsymbol{j}\in\boldsymbol{T}}\theta\_{\boldsymbol{j}}\nabla\_{\boldsymbol{j}}Q(\boldsymbol{b}) + \sum\_{\boldsymbol{j}\in\boldsymbol{T}^{c}}\hat{b}\_{\boldsymbol{j}}\nabla\_{\boldsymbol{j}}Q(\boldsymbol{b}) - \theta^{\top}\nabla\mathcal{Q}(b\_{\ast}) \\ &\leq \lambda\sum\_{\boldsymbol{j}\in\boldsymbol{T}}|\theta\_{\boldsymbol{j}}| - \lambda\sum\_{\boldsymbol{j}\in\boldsymbol{T}^{c}}|\hat{b}\_{\boldsymbol{j}}| + |\theta|\_{1}|\nabla\hat{Q}(b\_{\ast})|\_{\infty} \\ &= \left[\lambda + |\nabla Q(b\_{\ast})|\_{\infty}\right]|\theta\_{\boldsymbol{T}}|\_{1} + \left[|\nabla Q(b\_{\ast})|\_{\infty} - \lambda\right]|\theta\_{\boldsymbol{T}^{c}}|\_{1} + |\theta\_{0}| |\nabla Q(b\_{\ast})|\_{\infty}. \end{split}$$

Thus, using the fact that we consider the event A we get

$$\|\theta\_{T^{\xi}}\|\_{1} \leq \frac{\lambda + |\nabla \mathcal{Q}(b\_{\ast})|\_{\infty}}{\lambda - |\nabla \mathcal{Q}(b\_{\ast})|\_{\infty}} |\theta\_{T}|\_{1} + \frac{|\nabla \mathcal{Q}(b\_{\ast})|\_{\infty}}{\lambda - |\nabla \mathcal{Q}(b\_{\ast})|\_{\infty}} |\theta\_{0}| \leq \mathcal{\zeta} |\theta\_{T}|\_{1} \cdot \mathcal{Q}$$

Therefore, from the definition of *F*¯ *<sup>q</sup>*(*ξ*) we have

$$|\mathfrak{H} - \mathfrak{b}\_\*|\_q \le \frac{|T|^{1/q} |\tilde{\mathcal{K}}^\top \tilde{\mathcal{K}}(\hat{\mathfrak{b}} - \mathfrak{b}\_\*)/n|\_\infty}{\mathbb{F}\_q(\xi)} \le |T|^{1/q} \frac{|\nabla \mathcal{Q}(\hat{\mathfrak{b}})|\_\infty/2 + |\nabla \mathcal{Q}(\mathfrak{b}\_\*)|\_\infty/2}{\mathbb{F}\_q(\xi)}.$$

Using (A18) and the fact, that we are on A, we obtain (A14).

*Appendix A.3. Results from Section 5*

**Proof of Corollary 2.** The proof is a simple consequence of the bound (22) with *q* = ∞ obtained in Theorem 3. Indeed, for arbitrary predictors *j* ∈ *T* and *k* ∈/ *T* we obtain

$$\begin{split} |\boldsymbol{b}\_{j}^{quad}| &\geq \quad |(\boldsymbol{b}\_{\*}^{quad})\_{j}| - |\boldsymbol{b}\_{j}^{quad} - (\boldsymbol{b}\_{\*}^{quad})\_{j}| \geq \boldsymbol{b}\_{\mathrm{min}}^{quad} - |\boldsymbol{b}\_{\*}^{quad} - \boldsymbol{b}\_{\*}^{quad}|\_{\infty} \\ &> \quad \frac{2\xi^{\mathsf{X}}\lambda}{(\xi+1)F\_{\infty}(\xi)} \geq |\widehat{\boldsymbol{b}}^{quad} - (\boldsymbol{b}\_{\*}^{quad})|\_{\infty} \geq |\widehat{\boldsymbol{b}}\_{k}^{quad} - (\boldsymbol{b}\_{\*}^{quad})\_{k}| = |\widehat{\boldsymbol{b}}\_{k}^{quad}|. \end{split}$$

**Proof of Corollary 3.** The proof is almost the same as the proof of Corollary 2, so it is omitted.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Multivariate Tail Coefficients: Properties and Estimation**

#### **Irène Gijbels 1,\*, Vojt ˇech Kika 1,2 and Marek Omelka <sup>2</sup>**


Received: 3 April 2020; Accepted: 25 June 2020; Published: 30 June 2020

**Abstract:** Multivariate tail coefficients are an important tool when investigating dependencies between extreme events for different components of a random vector. Although bivariate tail coefficients are well-studied, this is, to a lesser extent, the case for multivariate tail coefficients. This paper contributes to this research area by (i) providing a thorough study of properties of existing multivariate tail coefficients in the light of a set of desirable properties; (ii) proposing some new multivariate tail measurements; (iii) dealing with estimation of the discussed coefficients and establishing asymptotic consistency; and, (iv) studying the behavior of tail measurements with increasing dimension of the random vector. A set of illustrative examples is given, and practical use of the tail measurements is demonstrated in a data analysis with a focus on dependencies between stocks that are part of the EURO STOXX 50 market index.

**Keywords:** archimedean copula; consistency; estimation; extreme-value copula; tail dependency; multivariate analysis

**MSC:** Primary: 60Exx; Secondary: 62H20; 62G32

#### **1. Introduction**

Assume that we have a *d*-variate random vector and we are interested in the tendency of the components to achieve extreme values simultaneously, which is taking extremely small or extremely large values. In the bivariate setting, when *d* = 2, this so-called tail dependence has been studied thoroughly in the literature. Bivariate lower and upper tail coefficients appeared for example in [1] but the idea of studying bivariate extremes dates back to [2]. These coefficients, being conditional probabilities of an extreme event given that another event is also extreme, have become the standard tool to quantify tail dependence of a bivariate random vector. Later, a generalization into arbitrary dimension *d* became of interest. The presence of more than two components however brings difficulties of defining tail dependency and several proposals appeared in the literature. These proposals include those made by [3,4] or [5] who adopted different strategies for conditioning in general dimensions. Further proposals were made for specific copula families, for example, by [6] for Archimedean copulas or by [7] for extreme-value copulas.

In this paper, we aim to contribute to the discussion on the appropriateness of multivariate tail coefficients, from the view point of properties that one would desire such coefficients to have. This study also entails the proposal of some new multivariate tail measures, for which we establish the properties. We investigate an estimation of the discussed multivariate tail coefficients and establish consistency of all estimators. It is also of particular interest to find out how tail dependence measures behave when the dimension *d* increases.

The organization of the paper is as follows. In Section 2, we briefly review some basic concepts about copulas and classes of copulas that will be needed in subsequent sections. Section 3 is devoted to the study of various multivariate tail dependence measures, whereas Section 7 discusses statistical estimation of these measures, including consistency properties. Section 4 investigates some further probabilistic properties of the multivariate tail dependence measures. Section 5 studies the behavior of the tail coefficient measures for Archimedean copulas when the dimension increases to infinity. A variety of illustrative examples is provided in Section 6, and it accompanies the studies that are presented in Sections 3 and 5. Finally, in Section 8, it is demonstrated how multivariate tail coefficients contribute in getting insights into dependencies between stocks that are part of the EURO STOXX 50 market index.

#### **2. Multivariate Copulas**

In this section, we briefly introduce concepts and notation from copula theory that will be necessary in the rest of this text. For more details on copulas, see e.g., [8].

#### *2.1. Basic Properties. Survival and Marginal Copulas*

Suppose that we have a *d*-variate random vector *X* = (*X*1, ... , *Xd*) having a joint distribution function *F*. Let further *Fj* denote the continuous marginal distribution function of *Xj* for *j* = 1, ... , *d*. Sklar's theorem [9] describes the relationship between the joint distribution function and the marginals that are given by a unique copula function *Cd* : [0, 1] *<sup>d</sup>* <sup>→</sup> [0, 1] such that

$$F(\mathbf{x}\_1, \dots, \mathbf{x}\_d) = \mathbb{C}\_d(F\_1(\mathbf{x}\_1), \dots, F\_d(\mathbf{x}\_d)), \quad (\mathbf{x}\_1, \dots, \mathbf{x}\_d)^\top \in \mathbb{R}^d.$$

We denote the set of all *d*-variate copulas by Cop(*d*). From the above relationship, it is easily seen that the random vector *U* = (*U*1, ... *Ud*) = (*F*1(*X*1), ... , *Fd*(*Xd*)) has a joint distribution function *Cd*, that is, with *<sup>u</sup>* = (*u*1,..., *ud*) <sup>∈</sup> [0, 1] *<sup>d</sup>*, *Cd*(*u*) = <sup>P</sup>(*<sup>U</sup>* <sup>≤</sup> *<sup>u</sup>*). The inequalities of vectors in this text are understood component-wise.

The survival function *Cd* that is associated to a copula *Cd* is defined as *Cd*(*u*) = P(*U* > *u*). The survival copula *C<sup>S</sup> <sup>d</sup>* that is associated to a copula *Cd* is defined as the copula of the random vector **<sup>1</sup>** <sup>−</sup> *<sup>U</sup>*, that is

$$\mathbb{C}\_d^S(\mathfrak{u}) = \mathbb{P}(\mathfrak{1} - \mathfrak{U} \le \mathfrak{u}) = \overline{\mathbb{C}}\_d(\mathfrak{1} - \mathfrak{u}).\tag{1}$$

Let *<sup>π</sup>* be a permutation of the set of indices {1, ... , *<sup>d</sup>*}, i.e., *<sup>π</sup>* : {1, ... , *<sup>d</sup>*}→{1, ... , *<sup>d</sup>*}. The copula *<sup>C</sup><sup>π</sup> d* is defined using a copula *Cd* as [10]

$$\mathbb{C}\_d^{\pi}(\mu\_1, \dots, \mu\_d) = \mathbb{C}\_d(\mu\_{\pi(1)}, \dots, \mu\_{\pi(d)})\_\prime \quad \forall \boldsymbol{\mu} \in [0, 1]^d.$$

In every point of the unit hypercube [0, 1] *<sup>d</sup>*, the value of a copula *Cd* is restricted by the lower Fréchet's bound *Wd*(*u*) = max(∑*<sup>d</sup> <sup>j</sup>*=<sup>1</sup> *uj* <sup>−</sup> *<sup>d</sup>* <sup>+</sup> 1, 0) and the upper Fréchet's bound *Md*(*u*) = min(*u*1,..., *ud*). In other words,

$$\mathcal{W}\_d(\mathfrak{u}) \le \mathbb{C}\_d(\mathfrak{u}) \le M\_d(\mathfrak{u}), \quad \forall \mathfrak{u} \in [0,1]^d.$$

The function *Md* is a copula for any *d* ≥ 2 and it is often called the comonotonicity copula, since it corresponds to the copula of a random vector *X* whose arbitrary component can be expressed as a strictly increasing function of any other component. If the components of a random vector *X* are mutually independent, the copula of *X* is the independence copula Π*d*(*u*) = ∏*<sup>d</sup> <sup>j</sup>*=<sup>1</sup> *uj*.

The copula that is associated to any subset of components of a *d*-dimensional random vector *X* is called a marginal copula of *Cd*. A marginal copula might be calculated from the original copula by setting arguments corresponding to the unconsidered components to 1. For example, the marginal copula *C*(1,...,*d*−1) *<sup>d</sup>*−<sup>1</sup> of (*X*1,..., *Xd*−1) can be obtained as

$$\mathbb{C}\_{d-1}^{(1,\dots,d-1)}(\mu\_{1\prime},\dots,\dots,\mu\_{d-1}) = \mathbb{C}\_d(\mu\_1,\dots,\mu\_{d-1},1),$$

where *Cd* is the copula of *X*. Marginal copulas can be used to calculate the survival function *Cd* of a copula *Cd*, since

$$\overline{\mathbb{C}}\_d(\mathfrak{u}) = 1 + \sum\_{j=1}^d (-1)^j \sum\_{1 \le k\_1 < \dots < k\_j \le d} \mathbb{C}\_j^{(k\_1, \dots, k\_j)}(\mathfrak{u}\_{k\_1 \prime}, \dots, \mathfrak{u}\_{k\_j}).\tag{2}$$

#### *2.2. Classes of Archimedean and Extreme-Value Copulas*

In the study here, we pay particular attention to two classes of copulas: multivariate extreme-value copulas and multivariate Archimedean copulas.

**Definition 1.** *A d-variate copula Cd is called an extreme-value copula if it satisfies*

$$\mathbb{C}\_d(\mu\_1, \dots, \mu\_d) = \left[ \mathbb{C}\_d \left( \mu\_1^{1/m}, \dots, \mu\_d^{1/m} \right) \right]^m$$

*for every integer m* <sup>≥</sup> <sup>1</sup> *and <sup>u</sup>* <sup>∈</sup> [0, 1] *d.*

This definition is only one of many ways how to define extreme-value copulas. For other definitions and properties, see, for example, ref. [11]. Every extreme-value copula *Cd* can be expressed in terms of a so-called stable tail dependence function *<sup>d</sup>* : [0, 1)*<sup>d</sup>* <sup>→</sup> [0, <sup>∞</sup>) as

$$\mathbb{C}\_d(\boldsymbol{\mu}\_1, \dots, \boldsymbol{\mu}\_d) = \exp(-\ell\_d(-\log \boldsymbol{\mu}\_1, \dots, -\log \boldsymbol{\mu}\_d)).\tag{3}$$

Denote by <sup>Δ</sup>*d*−<sup>1</sup> the *<sup>d</sup>*-dimensional unit simplex

$$\Delta\_{d-1} = \left\{ (w\_1, \dots, w\_d) \in [0, \infty)^d : w\_1 + \dots + w\_d = 1 \right\}.$$

Every extreme-value copula can be equivalently expressed in terms of Pickands dependence function *Ad* : <sup>Δ</sup>*d*−<sup>1</sup> → [1/*d*, 1] as

$$\begin{aligned} \mathbb{C}\_d(u\_1, \dots, u\_d) &= \exp\left[ \left( \sum\_{j=1}^d \log u\_j \right) A\_d \left( \frac{\log u\_1}{\sum\_{j=1}^d \log u\_j}, \dots, \frac{\log u\_d}{\sum\_{j=1}^d \log u\_j} \right) \right] \\ &= \left( \prod\_{j=1}^d u\_j \right)^{A\_d \left( \frac{\log u\_1}{\sum\_{j=1}^d \log u\_j}, \dots, \frac{\log u\_d}{\sum\_{j=1}^d \log u\_j} \right)} \end{aligned} \tag{4}$$

The function *Ad* is the restriction of the function *<sup>d</sup>* on the unit simplex and given as

$$A\_d \left( \frac{\mathbf{x}\_1}{\sum\_{j=1}^d \mathbf{x}\_j}, \dots, \frac{\mathbf{x}\_d}{\sum\_{j=1}^d \mathbf{x}\_j} \right) = \frac{1}{\mathbf{x}\_1 + \dots + \mathbf{x}\_d} \ell\_d(\mathbf{x}\_1, \dots, \mathbf{x}\_d). \tag{5}$$

Further, *Ad* is convex and it satisfies max(*w*1, ... , *wd*) <sup>≤</sup> *Ad*(*w*1, ... , *wd*) <sup>≤</sup> 1, for *<sup>w</sup>* = (*w*1, ... , *wd*) <sup>∈</sup> <sup>Δ</sup>*d*−1. The comonotonicity copula *Md* and the independence copula <sup>Π</sup>*<sup>d</sup>* are both extreme-value copulas with respective Pickands dependence functions *Ad*(*w*) = max(*w*1, ... , *wd*) and *Ad*(*w*) = 1, i.e., the lower and upper bounds above.

Note that if *Ad*(1/*d*, ... , 1/*d*) = 1/*d*, then the corresponding copula must be the comonotonicity copula *Md*. Indeed, if *Ad*(1/*d*, ... , 1/*d*) = 1/*d* it follows from (4) that *Cd*(*u*, ... , *u*) = *u* for every *<sup>u</sup>* <sup>∈</sup> (0, 1). Because, for any copula *Cd*, it holds that *Cd*(*u*) <sup>≤</sup> *Md*(*u*) for all *<sup>u</sup>* <sup>∈</sup> [0, 1] *<sup>d</sup>*, the upper Fréchet bound, and *Cd*(*u*) <sup>≥</sup> *Cd*(min(*u*1, ... , *ud*), ... , min(*u*1, ... , *ud*)), where the latter quantity equals min(*u*1, ... , *ud*) in this case and, consequently, *Cd*(*u*) <sup>≥</sup> *Md*(*u*) for all *<sup>u</sup>* <sup>∈</sup> [0, 1] *<sup>d</sup>*. Hence, in this case *Cd* = *Md*.

Similarly, if *Ad*(1/*d*, ... , 1/*d*) = 1, then the corresponding copula *Cd* must be the independence copula <sup>Π</sup>*d*. To see this, first suppose that there exists a point *<sup>w</sup>* = (*w*1, ... , *wd*−1, 1 <sup>−</sup> <sup>∑</sup>*d*−<sup>1</sup> *<sup>j</sup>*=<sup>1</sup> *wj*) ∈ <sup>Δ</sup>*d*−1, such that *Ad*(*w*) = *<sup>c</sup>* <sup>&</sup>lt; 1. Now, define a point *<sup>z</sup>* <sup>∈</sup> <sup>Δ</sup>*d*−<sup>1</sup> by setting *zj* = (<sup>1</sup> <sup>−</sup> *wj*)/(*<sup>d</sup>* <sup>−</sup> <sup>1</sup>) for *<sup>j</sup>* <sup>=</sup> 1, . . . , *<sup>d</sup>* <sup>−</sup> 1 and *zd* <sup>=</sup> <sup>1</sup> <sup>−</sup> <sup>∑</sup>*d*−<sup>1</sup> *<sup>j</sup>*=<sup>1</sup> *zj* <sup>=</sup> <sup>∑</sup>*d*−<sup>1</sup> *<sup>j</sup>*=<sup>1</sup> *wj*/(*d* − 1). Because *Ad* is a convex function, then

$$1 = A\_d\left(\frac{1}{d}, \dots, \frac{1}{d}\right) = A\_d\left(\frac{1}{d}w + \left(1 - \frac{1}{d}\right)z\right) \le \frac{1}{d}A\_d(w) + \frac{d-1}{d}A\_d(z) \le \frac{c+d-1}{d} < 1$$

which is a contradiction. This means that *Ad*(*w*) = 1 for every *<sup>w</sup>* <sup>∈</sup> <sup>Δ</sup>*d*−1. Immediately from (4), we get that *Cd*(*u*) = ∏*<sup>d</sup> <sup>j</sup>*=<sup>1</sup> *uj* for every *<sup>u</sup>* <sup>∈</sup> [0, 1] *<sup>d</sup>* and, hence, *Cd* = Π*d*.

Finally, from Definition 1, it follows that the marginal copula of an extreme-value copula is also an extreme-value copula.

We next provide an illustrative example.

**Example 1.** *Let Cd be the d-variate extreme-value copula of* (*X*1, ... , *Xd*) *and Cd*<sup>+</sup><sup>1</sup> *be the* (*d* + 1)*-variate copula of* (*X*1,..., *Xd*, *Xd*<sup>+</sup>1) *where Xd*<sup>+</sup><sup>1</sup> *is independent of* (*X*1,..., *Xd*)*, that is*

$$
\mathbb{C}\_{d+1}(\boldsymbol{\mu}\_1, \dots, \boldsymbol{\mu}\_d, \boldsymbol{\mu}\_{d+1}) = \mathbb{C}\_d(\boldsymbol{\mu}\_1, \dots, \boldsymbol{\mu}\_d)\boldsymbol{\mu}\_{d+1}.
$$

*Subsequently, from Definition 1, Cd*<sup>+</sup><sup>1</sup> *is also an extreme-value copula. The stable dependence function <sup>d</sup>*+<sup>1</sup> *can be expressed, using* (3)*, as*

$$\ell\_{d+1}(\mathbf{x}\_1, \dots, \mathbf{x}\_{d+1}) = -\log(\mathbb{C}\_{d+1}(e^{-\mathbf{x}\_1}, \dots, e^{-\mathbf{x}\_{d+1}})) = \ell\_d(\mathbf{x}\_1, \dots, \mathbf{x}\_d) + \mathbf{x}\_{d+1}.$$

*Then from* (5)

$$A\_{d+1} \left( \frac{\mathbf{x}\_1}{\sum\_{j=1}^{d+1} \mathbf{x}\_j}, \dots, \frac{\mathbf{x}\_{d+1}}{\sum\_{j=1}^{d+1} \mathbf{x}\_j} \right) = \frac{\left(\sum\_{j=1}^{d} \mathbf{x}\_j\right) A\_d \left(\frac{\mathbf{x}\_1}{\sum\_{j=1}^{d} \mathbf{x}\_j}, \dots, \frac{\mathbf{x}\_d}{\sum\_{j=1}^{d} \mathbf{x}\_j}\right) + \mathbf{x}\_{d+1}}{\sum\_{j=1}^{d+1} \mathbf{x}\_j}$$

*and in particular*

$$A\_{d+1}\left(\frac{1}{d+1},\ldots,\frac{1}{d+1}\right) = \frac{1}{d+1}\left(dA\_d\left(\frac{1}{d'},\ldots,\frac{1}{d}\right) + 1\right) \dots$$

Another class of copulas that we consider is the class of multivariate Archimedean copulas, thoroughly discussed, for example, in [12].

**Definition 2** (Archimedean copula)**.** *A non-increasing and continuous function ψ* : [0, ∞) → [0, 1]*, which satisfies the conditions ψ*(0) = 1*,* lim*x*→<sup>∞</sup> *ψ*(*x*) = 0 *and is strictly decreasing on* [0, inf{*x* : *ψ*(*x*) = 0}) *is* *called an Archimedean generator. A d-dimensional copula Cd is called Archimedean if it, for any <sup>u</sup>* <sup>∈</sup> [0, 1] *d, permits the representation*

$$\mathcal{C}\_d(\mu) = \psi \left[ \psi^{-1}(\mu\_1) + \dots + \psi^{-1}(\mu\_d) \right],$$

*for some Archimedean generator <sup>ψ</sup> and its inverse <sup>ψ</sup>*−<sup>1</sup> : (0, 1] <sup>→</sup> [0, <sup>∞</sup>)*, where, by convention, <sup>ψ</sup>*(∞) = <sup>0</sup> *and ψ*−1(0) = inf ' *u* : *ψ*(*u*) = 0 ( *.*

In [12], the authors also provide a characterization of an Archimedean generator leading to some Archimedean copula by means of the following definition and proposition.

**Definition 3** (*d*-monotone function)**.** *A real function f is called d-monotone on the interval* [0, ∞)*, where d* ≥ 2*, if it is continuous on* [0, ∞) *and differentiable on* (0, ∞) *up to the order d* − 2 *and the derivatives satisfy*

$$(-1)^k f^{(k)}(x) \ge 0, \text{ for } k = 0, 1, \dots, d - 2$$

*for any <sup>x</sup>* <sup>∈</sup> (0, <sup>∞</sup>) *and further if* (−1)*d*−<sup>2</sup> *<sup>f</sup>* (*d*−2) *is non-increasing and convex in* (0, <sup>∞</sup>)*. If <sup>f</sup> has derivatives of all orders in* (0, <sup>∞</sup>) *and if* (−1)*<sup>k</sup> <sup>f</sup>* (*k*)(*x*) <sup>≥</sup> <sup>0</sup> *for any <sup>x</sup>* <sup>∈</sup> (0, <sup>∞</sup>) *and any <sup>k</sup>* <sup>=</sup> 0, 1, ... *, then <sup>f</sup> is called completely monotone.*

It can be shown that exactly this definition is the key to specify which Archimedean generators can generate copulas.

**Proposition 1** (Characterization of Archimedean copulas)**.** *Let ψ be an Archimedean generator and d* ≥ 2*. Subsequently, Cd* : [0, 1] *<sup>d</sup>* <sup>→</sup> [0, 1] *given by*

$$\mathbb{C}\_d(\mathfrak{u}) = \psi \left[ \psi^{-1}(\mathfrak{u}\_1) + \dots + \psi^{-1}(\mathfrak{u}\_d) \right],$$

*is a d-dimensional copula if and only if ψ is d-monotone on* [0, ∞)*.*

**Corollary 1.** *An Archimedean generator ψ can generate a copula in any dimension if and only if it is completely monotone.*

Most of the well-known Archimedean generators are completely monotone, also called strict generators. For strict generators, *ψ*−1(0) = ∞. However, the range of parameter values possibly depends on the dimension. We illustrate this with the Clayton copula family.

**Example 2.** *Let Cd be the d-variate Clayton copula with parameter θ. In the bivariate case, its generator is defined as ψθ* (*t*)=(1 + *θt*) <sup>−</sup>1/*<sup>θ</sup>* <sup>+</sup> *with <sup>θ</sup>* ≥ −1*. However, ψθ is d-monotone only for <sup>θ</sup>* ≥ −1/(*<sup>d</sup>* <sup>−</sup> <sup>1</sup>) *(see [12]). That is, if we want to consider Clayton copula in any dimension, we have to restrict ourselves to θ* ≥ 0*, where case θ* = 0 *is defined as a limit θ* # 0 *and, in fact, corresponds to the independence copula.*

*Figure 1 shows how the generator of the Clayton family depends on the parameter θ. When θ* < 0 *and, thus, ψθ is not completely monotone, then there exists t* ∈ (0, ∞)*, such that ψθ* (*t*) = 0*. Otherwise, for θ* ≥ 0*,* lim*t*→<sup>∞</sup> *ψθ* (*t*) = 0*, but for every t* ∈ (0, ∞) *we have ψθ* (*t*) > 0*.*

In Figure 1, we see the most common shape of the generator function. The following lemma focuses on the behavior of generators close to *t* = 0 and is useful later in this text.

**Figure 1.** Generator of Clayton copula with parameters −0.3 (dash-dotted line), 1 (solid line), 5 (dashed line) and 10 (dotted line).

**Lemma 1.** *Let ψ be an Archimedean generator that generates a copula, differentiable on* (0, ) *for some* > 0*. Afterwards, ψ* (0+) = lim*t*#<sup>0</sup> *<sup>ψ</sup>* (*t*) *can take values in* [−∞, 0)*.*

**Proof.** It can be easily shown that *ψ* is a convex function on [0, ∞) [13] (Theorem 6.3.3). That means that *ψ* is a non-decreasing function on [0, ∞). Additionally, from Definition 2, *ψ* is strictly decreasing on [0, inf{*x* : *ψ*(*x*) = 0}). That is, *ψ* is negative on (0, inf{*x* : *ψ*(*x*) = 0}), which implies that *ψ* (0+) <sup>≤</sup> 0. Suppose now that *ψ* (0+) = 0. Afterwards, from negativity of *<sup>ψ</sup>* on (0, inf{*<sup>x</sup>* : *<sup>ψ</sup>*(*x*) = <sup>0</sup>}), *<sup>ψ</sup>* must decrease, which is in contradiction with the fact that *ψ* is a non-decreasing function on [0, ∞).

The following example shows that *ψ* (0+) can be equal to <sup>−</sup>∞.

**Example 3.** *Let ψθ* (*t*) = exp(−*t* 1/*<sup>θ</sup>* ) *for <sup>θ</sup>* <sup>≥</sup> <sup>1</sup> *which is the generator of the Gumbel-Hougaard family. Then*

$$\psi\_{\theta}'(0^+) = \lim\_{t \searrow 0} \frac{-1}{\theta} \exp(-t^{1/\theta}) t^{1/\theta - 1} = \begin{cases} -1, & \text{if } \theta = 1, \\ -\infty, & \text{if } \theta > 1. \end{cases}$$

*Recall that θ* = 1 *corresponds to the independence copula. Figure 2 shows how the generator of Gumbel-Hougaard family depends on the parameter θ.*

**Figure 2.** Generator of the Gumbel-Hougaard copula with parameters 1 (dash-dotted line), 2 (solid line), 5 (dashed line) and 10 (dotted line).

#### **3. Tail Coefficients**

In the bivariate case (i.e., *d* = 2), lower and upper tail coefficients are defined, respectively, as

$$\lambda\_L(\mathbb{C}\_2) = \lim\_{u \searrow 0} \mathbb{P}(\mathcal{U}\_2 \le u | \mathcal{U}\_1 \le u) = \lim\_{u \searrow 0} \mathbb{P}(\mathcal{U}\_1 \le u | \mathcal{U}\_2 \le u) = \lim\_{u \searrow 0} \frac{\mathbb{C}\_2(u, u)}{u},$$

$$\lambda \mu(\mathbb{C}\_2) = \lim\_{u \nearrow 1} \mathbb{P}(\mathcal{U}\_2 > u | \mathcal{U}\_1 > u) = \lim\_{u \nearrow 1} \mathbb{P}(\mathcal{U}\_1 > u | \mathcal{U}\_2 > u) = \lim\_{u \nearrow 1} \frac{1 - 2u + \mathbb{C}\_2(u, u)}{1 - u},$$

if the limits above exist. Throughout the text, when defining these and other tail coefficients, we will assume the existence of the limits involved. The general idea behind the tail coefficients is to measure how likely a random variable is extreme, given that another variable is extreme. These coefficients can take values between 0 and 1, since they are probabilities.

For extreme-value copulas, tail coefficients can be expressed as functions of Pickands dependence function *A*<sup>2</sup> corresponding to the copula *C*<sup>2</sup> as

$$\begin{aligned} \lambda\_L(\mathbb{C}\_2) &= \begin{cases} 1 & \text{if } A\_2 \,(1/2, 1/2) = 1/2, \\ 0 & \text{otherwise,} \end{cases} \\ \lambda\_{ll}(\mathbb{C}\_2) &= 2(1 - A\_2(1/2, 1/2)), \end{aligned} \tag{6}$$

see [11]. That is, unless the studied copula is the comonotonicity copula, extreme-value copulas do not possess any lower tail dependence. Recall that, when *A*2(1/2, 1/2) = 1, the corresponding copula must be the independence copula Π2. Therefore, an extreme-value copula possesses upper tail dependence, unless the copula is the independence copula.

In case of Archimedean copulas, the tail coefficients can be expressed via the corresponding generator *ψ* as

$$\begin{aligned} \lambda\_L(\mathcal{C}\_2) &= 2 \lim\_{\boldsymbol{\mu} \searrow 0} \frac{\psi'(2\psi^{-1}(\boldsymbol{\mu}))}{\psi'(\psi^{-1}(\boldsymbol{\mu}))},\\ \lambda\_{\boldsymbol{\mu}}(\mathcal{C}\_2) &= 2 - 2 \lim\_{\boldsymbol{\mu} \searrow 1} \frac{\psi'(2\psi^{-1}(\boldsymbol{\mu}))}{\psi'(\psi^{-1}(\boldsymbol{\mu}))} = 2 - 2 \lim\_{t \searrow 0} \frac{\psi'(2t)}{\psi'(t)}. \end{aligned}$$

see [14]. Note that both tail coefficients only depend on the behavior of the generator *ψ* in proximity of the points 0 and *ψ*−1(0). Recall that, in the case of strict Archimedean generators, the latter is equal to ∞.

Given their meaning and mathematical expression, tail coefficients cannot be generalized in general dimension *d* ≥ 2 in a straightforward and unique way. We first propose a set of desirable properties that are expected to hold for any multivariate tail coefficient <sup>t</sup>*<sup>d</sup>* : Cop(*d*) <sup>→</sup> <sup>R</sup> and for any *d*-variate copulas *Cd* and *Cd*,*m*, *m* = 1, 2, ... . The following properties are stated under the working condition that all tail coefficients (t*d*(*Cd*), t*d*+1(*Cd*<sup>+</sup>1), t*d*(*Cd*,*m*), and so on) exist.


$$t\_d(\mathbb{C}\_d) \ge t\_{d+1}(\mathbb{C}\_{d+1})\,.$$

Property (*T*4) could be formulated in a slightly stricter way, as

(*T* <sup>4</sup>) For *Xd*+1, independent of (*X*1, ... , *Xd*), there exists a constant *kd*(t*d*) ∈ [0, 1] not depending on *Cd* such that

$$t\_{d+1}(\mathbb{C}\_{d+1}) = k\_d(t\_d) \cdot t\_d(\mathbb{C}\_d).$$

Because both lower and upper tail dependence are of interest, usually we consider that t*<sup>d</sup>* has actually two versions t*U*,*<sup>d</sup>* and t*L*,*<sup>d</sup>* focusing on either upper tail (variables simultaneously large) or lower tail (variables simultaneously small) dependence respectively. Thus we can also consider the following property

$$(T\_5) \quad \text{ (Duality)} \ t\_{L,d}(\mathbb{C}\_d^S) = t\_{L,d}(\mathbb{C}\_d).$$

In general, some of the desirable properties above are easy to be enforced. If one starts with a candidate coefficient t∗ *<sup>d</sup>* , property (*T*1) can be achieved by defining

$$t\_d(\mathbb{C}\_d) = \frac{t\_d^\*(\mathbb{C}\_d) - t\_d^\*(\Pi\_d)}{t\_d^\*(M\_d) - t\_d^\*(\Pi\_d)}.$$

Property (*T*3) can be achieved by taking an average of the candidate coefficient t<sup>∗</sup> *<sup>d</sup>* over all of the permutations

$$t\_d(\mathcal{C}\_d) = \frac{1}{d!} \sum\_{\pi \in \mathcal{S}\_d} t\_d^\*(\mathcal{C}\_d^{\pi}),$$

where *Sd* denotes all of the permutations of the set {1, ... , *d*}. Note, however, that, especially for high dimensions, this significantly increases computational complexity. In the case of property (*T*5), we can simply use it to define an upper tail coefficient from the lower tail one (or the other way around).

In the following, we briefly review multivariate tail coefficients proposed in the literature and elaborate on their behavior with respect to the desirable properties (*T*1)–(*T*5). For brevity of presentation, we refer to (*T*4) or its variant (*T* <sup>4</sup>) as the "addition property". To simplify the notation, the subscript *d* of t*d*, denoting the dimension, will sometimes be omitted in the text, the dimension being clear from an argument of a functional t.

#### *3.1. Frahm's Extremal Dependence Coefficient*

Frahm (see [3]) considered lower and upper extremal dependence coefficients *L*, *U*, respectively, defined as

$$\begin{split} \varepsilon\_{L}(\mathsf{C}\_{d}) &= \lim\_{u \searrow 0} \mathsf{P}(\mathsf{U}\_{\max} \le u | \mathsf{U}\_{\min} \le u) = \lim\_{u \searrow 0} \frac{\mathsf{P}(\mathsf{U}\_{\max} \le u)}{\mathsf{P}(\mathsf{U}\_{\min} \le u)} = \lim\_{u \searrow 0} \frac{\mathsf{C}\_{d}(u\mathbf{1})}{1 - \overline{\mathsf{C}}\_{d}(u\mathbf{1})}, \\ \varepsilon\_{\mathsf{U}}(\mathsf{C}\_{d}) &= \lim\_{u \nearrow 1} \mathsf{P}(\mathsf{U}\_{\min} > u | \mathsf{U}\_{\max} > u) = \lim\_{u \nearrow 1} \frac{\mathsf{P}(\mathsf{U}\_{\min} > u)}{\mathsf{P}(\mathsf{U}\_{\max} > u)} = \lim\_{u \nearrow 1} \frac{\overline{\mathsf{C}}\_{d}(u\mathbf{1})}{1 - \overline{\mathsf{C}}\_{d}(u\mathbf{1})}, \end{split} \tag{7}$$

given the limits exist, where *U*max = max(*U*1, ... , *Ud*) and *U*min = min(*U*1, ... , *Ud*). These coefficients are not equal to *λL*, *λU*, respectively, in the bivariate case. More specifically, for any copula *C*<sup>2</sup> (see [3])

$$
\epsilon\_L(\mathbb{C}\_2) = \frac{\lambda\_L(\mathbb{C}\_2)}{2 - \lambda\_L(\mathbb{C}\_2)}, \qquad \epsilon\_{ll}(\mathbb{C}\_2) = \frac{\lambda\_{ll}(\mathbb{C}\_2)}{2 - \lambda\_{ll}(\mathbb{C}\_2)}.
$$

Thus, we can consider it more as a different type of tail dependence coefficient than a generalization of bivariate tail coefficients.

For extreme-value copulas, extremal dependence coefficients can be stated in terms of Pickands dependence function. Let *Cd* be an extreme-value copula with Pickands dependence function *Ad* and denote the Pickands dependence function of the marginal copula *<sup>C</sup>*(*k*1,...,*kj*) *<sup>j</sup>* as *<sup>A</sup>*(*k*1,...,*kj*) *<sup>j</sup>* . Subsequently,

$$\begin{aligned} \mathbb{C}\_{d}(t,\ldots,t) &= \exp\left\{d\log(t)A\_{d}\left(1/d,\ldots,1/d\right)\right\} = t^{dA\_{d}\left(1/d,\ldots,1/d\right)}\\ \overline{\mathbb{C}}\_{d}(t,\ldots,t) &= 1 + \sum\_{j=1}^{d}(-1)^{j} \sum\_{1 \le k\_{1} < \cdots < k\_{j} \le d} t^{jA\_{j}^{(k\_{1},\ldots,k\_{j})}} \left(1/j,\ldots,1/j\right) \end{aligned} \tag{8}$$

$$\zeta = 1 + \sum\_{j=1}^{d} (-1)^{j} \sum\_{1 \le |k\_1 < \dots < k\_j| \le d} t^{jA\_d(w\_1, \dots, w\_d)} \, \_{\prime} \tag{9}$$

where *w*- = 1/*j* if - ∈ {*k*1, ... , *kj*} and *w*- = 0 otherwise. As opposed to (8), expression (9) only involves the overall *d*-dimensional Pickands dependence function. This might be helpful, for example, during estimation, since not all of the lower-dimensional Pickands dependence functions in (8) need to be estimated.

Thus, for the lower extremal dependence coefficient, one obtains

$$\varepsilon\_{L}(\mathbb{C}\_{d}) = \lim\_{t \searrow 0} \frac{t^{dA\_{d}(1/d,\ldots,1/d)}}{\sum\_{j=1}^{d} (-1)^{j} \sum\_{1 \le k\_{1} < \cdots < k\_{j} \le d} t^{A\_{j}^{(k\_{1},\ldots,1/j)}}(1/j,\ldots,1/d)} = \begin{cases} 1 & \text{if } A\_{d} \,(1/d,\ldots,1/d) = 1/d, \\ 0 & \text{otherwise} \end{cases} \tag{10}$$

because the polynomial (in *t*) in the denominator contains lower-degree terms than the polynomial in the numerator. We can see that this behavior resembles *λ<sup>L</sup>* for bivariate extreme-value copulas, since the only extreme-value copula possessing lower tail dependence is the comonotonicity copula.

For the upper extremal dependence coefficient, we can calculate

$$\begin{split} \epsilon\_{II}(\mathsf{C}\_{d}) &= \lim\_{t \nearrow 1} \frac{1 + \sum\_{j=1}^{d} (-1)^{j} \sum\_{1 \le k\_{1} < \cdots < k\_{j} \le d} t^{A\_{j}^{(k\_{1}, \ldots, k\_{j})} \left(1/j, \ldots, 1/j\right)}}{1 - t^{dA\_{d}(1/d, \ldots, 1/d)}} \\ &= \lim\_{t \nearrow 1} \frac{\sum\_{j=1}^{d} (-1)^{j} \sum\_{1 \le k\_{1} < \cdots < k\_{j} \le d} j A\_{j}^{(k\_{1}, \ldots, k\_{j})} \left(1/j, \ldots, 1/j\right) t^{A\_{d}^{(k\_{1}, \ldots, k\_{j})} \left(1/j, \ldots, 1/j\right) - 1}}{-dA\_{d} \left(1/d, \ldots, 1/d\right) t^{A\_{d}(1/d, \ldots, 1/d) - 1}} \\ &= \frac{\sum\_{j=1}^{d} (-1)^{j+1} \sum\_{1 \le k\_{1} < \cdots < k\_{j} \le d} j A\_{j}^{(k\_{1}, \ldots, k\_{j})} \left(1/j, \ldots, 1/j\right)}{dA\_{d} \left(1/d, \ldots, 1/d\right)} \\ &= \frac{\sum\_{j=1}^{d} (-1)^{j+1} \sum\_{1 \le k\_{1} < \cdots < k\_{j} \le d} j A\_{d}(w\_{1}, \ldots, w\_{d})}{dA\_{d} \left(1/d, \ldots, 1/d\right)}. \end{split} \tag{11}$$

where, as above, *w*- = 1/*j* if - ∈ {*k*1,..., *kj*} and *w*-= 0 otherwise.

We next look into the tail coefficients (7) for Archimedean copulas. Let {*Cd*}*d*≥<sup>2</sup> be a sequence of *d*-dimensional Archimedean copulas with (the same) generator *ψ*. Subsequently,

$$\begin{aligned} \mathbb{C}\_d(\mathfrak{u}, \dots, \mathfrak{u}) &= \psi(d\psi^{-1}(\mathfrak{u})), \\ \overline{\mathbb{C}}\_d(\mathfrak{u}, \dots, \mathfrak{u}) &= 1 + \sum\_{j=1}^d (-1)^j \binom{d}{j} \psi(j\psi^{-1}(\mathfrak{u})). \end{aligned}$$

The corresponding derivatives, if they exist, are

$$\begin{aligned} \mathcal{C}'\_d(\mu, \dots, \mu) &= \psi'(d\psi^{-1}(\mu))d(\psi^{-1})'(\mu), \\ \overline{\mathcal{C}}'\_d(\mu, \dots, \mu) &= \sum\_{j=1}^d (-1)^j \binom{d}{j} \psi'(j\psi^{-1}(\mu)) j(\psi^{-1})'(\mu). \end{aligned}$$

Afterwards, the extremal dependence coefficients can be expressed as

$$\begin{split} \epsilon\_{L}(\mathcal{C}\_{d}) &= \lim\_{u \searrow 0} \frac{\mathsf{C}\_{d}(u\mathbf{1})}{1 - \overline{\mathsf{C}\_{d}}(u\mathbf{1})} = \lim\_{u \searrow 0} \frac{\psi(d\psi^{-1}(u))}{\sum\_{j=1}^{d} (-1)^{j+1} \binom{d}{j} \psi(j\psi^{-1}(u))} \\ &= \lim\_{u \searrow 0} \frac{\psi'(d\psi^{-1}(u))d}{\sum\_{j=1}^{d} (-1)^{j+1} \binom{d}{j} \psi'(j\psi^{-1}(u))j} \\ \epsilon\_{L}(\mathcal{C}\_{d}) &= \lim\_{u \searrow 1} \frac{\overline{\mathsf{C}\_{d}}(u\mathbf{1})}{1 - \overline{\mathsf{C}\_{d}}(u\mathbf{1})} = \lim\_{u \searrow 1} \frac{1 + \sum\_{j=1}^{d} (-1)^{j} \binom{d}{j} \psi'(j\psi^{-1}(u))}{1 - \psi'(d\psi^{-1}(u))} \\ &= \lim\_{u \searrow 1} \frac{\sum\_{j=1}^{d} (-1)^{j} \binom{d}{j} \psi'(j\psi^{-1}(u))j}{-\psi'(d\psi^{-1}(u))d} \\ &= \lim\_{f \searrow 0} \frac{\sum\_{j=1}^{d} (-1)^{j} \binom{d}{j} \psi'(jt)j}{-\psi'(td)d}, \end{split} \tag{13}$$

where we used L'Hospital's rule to get to the equation in (12), and the second equation in the derivation towards (13). Recall that *<sup>ψ</sup>*−1(1) = 0 and *<sup>ψ</sup>*−1(0) = inf{*<sup>u</sup>* : *<sup>ψ</sup>*(*u*) = <sup>0</sup>}. One can see that using L'Hospital's rule does not solve the 0/0 limit problem for general *ψ* and knowledge of the precise behavior of *ψ* is thus crucial for calculating the coefficients *L*(*Cd*) and *U*(*Cd*).

As will be illustrated in Section 6, Archimedean copulas can have both extremal dependence coefficients non-zero, depending on the generator. For *U*, one additional assumption regarding a generator *<sup>ψ</sup>* may become useful. Because (from the definition of the generator) lim*u*\$<sup>1</sup> *<sup>ψ</sup>*−1(*u*) = 0, if the additional condition *ψ* (0+) <sup>&</sup>gt; <sup>−</sup><sup>∞</sup> is fulfilled, we get

$$\epsilon\_{ll}(\mathbb{C}\_d) = \frac{\sum\_{j=1}^d (-1)^j \binom{d}{j} \psi'(0^+) j}{-\psi'(0^+)d} = \sum\_{j=1}^d (-1)^j \binom{d-1}{j-1} = 0,$$

using that from Lemma 1 *ψ* (0+) cannot be equal to zero. In other words, if *ψ* (0+) <sup>&</sup>gt; <sup>−</sup>∞, then the corresponding Archimedean copula is upper tail independent, for every dimension.

Next, we investigate which of the desirable properties (*T*1)–(*T*5) are satisfied for Frahm's extremal dependence coefficients *<sup>L</sup>* and *U*.

**Proposition 2.** *Frahm's extremal dependence coefficients <sup>L</sup> and <sup>U</sup> satisfy normalization property* (*T*1)*, permutation invariance property* (*T*3)*, and addition property* (*T* <sup>4</sup>)*, with kd*(*L*) = *kd*(*U*) = 0 *for every d* ≥ 2*, and* (*T*5)*.*

**Proof.** Normalization property (*T*1) follows from straightforward calculations

$$\begin{aligned} \varepsilon\_L(M\_d) &= \lim\_{u \searrow 0} \frac{u}{1 - (1 - u)} = 1, & \quad \varepsilon\_{lI}(M\_d) &= \lim\_{u \nearrow 1} \frac{1 - u}{1 - u} = 1, \\\varepsilon\_L(\Pi\_d) &= \lim\_{u \searrow 0} \frac{u^d}{1 - (1 - u)^d} = 0, & \quad \varepsilon\_{lI}(\Pi\_d) &= \lim\_{u \nearrow 1} \frac{(1 - u)^d}{1 - u^d} = 0. \end{aligned}$$

Permutation invariance property (*T*3) follows immediately from the fact that the coefficients only depend on *U*max and *U*min, which do not depend on the order of components of the random vector.

Look now into the addition of an independent component, i.e., property (*T* <sup>4</sup>). To be able to distinguish between the dimensions, we use the notation *U*max,*<sup>d</sup>* = max(*U*1, ... , *Ud*) and *U*min,*<sup>d</sup>* = min(*U*1, ... , *Ud*). For *Xd*<sup>+</sup><sup>1</sup> independent of (*X*1, ... , *Xd*), we have P(*U*min,*d*+<sup>1</sup> ≤ *u*) ≥ P(*U*min,*<sup>d</sup>* ≤ *u*) and P(*U*max,*d*+<sup>1</sup> > *u*) ≥ P(*U*max,*<sup>d</sup>* > *u*) for every *u* ∈ [0, 1]. Further, P(*U*max,*d*+<sup>1</sup> ≤ *u*) = P(*U*max,*<sup>d</sup>* ≤ *u*, *Ud*<sup>+</sup><sup>1</sup> ≤ *u*) = *u* P(*U*max,*<sup>d</sup>* ≤ *u*) and similarly P(*U*min,*d*+<sup>1</sup> > *u*) = P(*U*min,*<sup>d</sup>* > *u*, *Ud*<sup>+</sup><sup>1</sup> > *u*) = (1 − *u*) P(*U*min,*<sup>d</sup>* > *u*). Thus,

$$\begin{split} \epsilon\_{L}(\mathbb{C}\_{d+1}) &= \lim\_{u \searrow 0} \frac{\mathbb{P}(\mathcal{U}\_{\text{max},d+1} \le u)}{\mathbb{P}(\mathcal{U}\_{\text{min},d+1} \le u)} \le \lim\_{u \searrow 0} \frac{u \,\mathbb{P}(\mathcal{U}\_{\text{max},d} \le u)}{\mathbb{P}(\mathcal{U}\_{\text{min},d} \le u)} = 0 \cdot \epsilon\_{L}(\mathbb{C}\_{d}) = 0, \\\epsilon\_{L}(\mathbb{C}\_{d+1}) &= \lim\_{u \nearrow 1} \frac{\mathbb{P}(\mathcal{U}\_{\text{min},d+1} > u)}{\mathbb{P}(\mathcal{U}\_{\text{max},d+1} > u)} \le \lim\_{u \nearrow 1} \frac{(1-u)\mathbb{P}(\mathcal{U}\_{\text{min},d} > u)}{\mathbb{P}(\mathcal{U}\_{\text{max},d} > u)} = 0 \cdot \epsilon\_{L}(\mathbb{C}\_{d}) = 0, \end{split}$$

which means that the property about adding an independent component (*T* <sup>4</sup>) holds with constants *kd*(*L*) = *kd*(*U*) = 0 for every *d* ≥ 2.

We next look into duality (*T*5). Using relation (1) between the survival function and the survival copula, coefficients *<sup>L</sup>* and *<sup>U</sup>* can be rewritten as

$$\begin{aligned} \epsilon\_{\mathsf{L}}(\mathsf{C}\_{d}) &= \lim\_{\mathsf{u}\searrow\mathsf{0}} \frac{\mathsf{C}(\mathsf{u}\mathbf{1})}{1 - \overline{\mathsf{C}}(\mathsf{u}\mathbf{1})} = \lim\_{\mathsf{u}\searrow\mathsf{0}} \frac{\mathsf{C}(\mathsf{u}\mathbf{1})}{1 - \mathsf{C}^{S}(\mathsf{1} - \mathsf{u}\mathbf{1})}, \\\epsilon\_{\mathsf{L}}(\mathsf{C}\_{d}) &= \lim\_{\mathsf{u}\nearrow\mathsf{1}} \frac{\overline{\mathsf{C}}(\mathsf{u}\mathbf{1})}{1 - \mathsf{C}(\mathsf{u}\mathbf{1})} = \lim\_{\mathsf{u}\nearrow\mathsf{1}} \frac{\mathsf{C}^{S}(\mathsf{1} - \mathsf{u}\mathbf{1})}{1 - \mathsf{C}(\mathsf{u}\mathbf{1})} \end{aligned}$$

and thus

$$\epsilon\_L(\mathbb{C}\_d^S) = \lim\_{u \searrow 0} \frac{\mathbb{C}^S(u\mathbf{1})}{1 - \mathbb{C}(\mathbf{1} - u\mathbf{1})} = \lim\_{v \nearrow 1} \frac{\mathbb{C}^S(\mathbf{1} - v\mathbf{1})}{1 - \mathbb{C}(v\mathbf{1})} = \epsilon\_{\mathcal{U}}(\mathbb{C}\_d)\_{\mathcal{H}}$$

where substitution *v* = 1 − *u* was used. This proves the validity of duality property (*T*5).

We suspect that the continuity property (*T*2) does not hold in its full generality for most multivariate tail coefficients. To obtain insight into this, consider the following example with a sequence of copulas {*Cd*,*m*} given by

$$\mathcal{L}\_{d,\mathfrak{m}}(\mathfrak{u}) = M\_d(\mathfrak{u}) \mathbf{1} \left\{ \min \{ \mathfrak{u}\_1, \dots, \mathfrak{u}\_d \} \le \frac{1}{\mathfrak{m}} \right\} + \left( \frac{1}{\mathfrak{m}} + \frac{\prod\_d (\mathfrak{u} - \frac{1}{\mathfrak{m}} \mathbf{1})}{(1 - \frac{1}{\mathfrak{m}})^{d-1}} \right) \mathbf{1} \left\{ \min \{ \mathfrak{u}\_1, \dots, \mathfrak{u}\_d \} > \frac{1}{\mathfrak{m}} \right\}.$$

Note that the distribution that is given by *Cd*,*<sup>m</sup>* is uniform on the set [ <sup>1</sup> *<sup>m</sup>* , 1] *<sup>d</sup>* and it corresponds to the upper Fréchet's bound *Md* otherwise. Note that *Cd*,*<sup>m</sup>* is a copula with an ordinal sum representation, see [8] (Section 3.2.2).

It is easily seen that *Cd*,*<sup>m</sup>* → Π*<sup>d</sup>* as *m* → ∞ uniformly on [0, 1] *<sup>d</sup>*. Note that *L*(*Cd*,*m*) = 1 for each *<sup>m</sup>* <sup>∈</sup> <sup>N</sup>. On the other hand, *L*(Π*d*) = 0. Hence, for this sequence of copulas, the continuity property (*T*2) does not hold.

However, a continuity property may hold, in general, under more specific conditions on the copula sequences. One such condition is that of a sequence of contaminated copulas, defined as follows.

Let *Cd* and *Bd*,*m*, for *m* = 1, ... be *d*-variate copulas, and let *<sup>m</sup>* be a sequence of numbers in [0, 1]. One considers the sequence of contaminated copulas

$$\mathbb{C}\_{d,m} = (1 - \epsilon\_m)\mathbb{C}\_d + \epsilon\_m B\_{d,m}.\tag{14}$$

Note that *Cd*,*<sup>m</sup>* is a convex combination of the copulas *Cd* and *Bd*,*<sup>m</sup>* and, hence, is also a copula, see e.g., [8]. The interest is to investigate the behavior of a tail coefficient for the sequence *Cd*,*<sup>m</sup>* when *<sup>m</sup>* → 0, as *m* → ∞.

Proposition 3 establishes a continuity property for Frahm's extremal dependence coefficient.

**Proposition 3.** *Suppose that, for any d-variate copulas Cd and Cd*,*m, m* = 1, 2, ... *, there exist* > 0*, such that*

$$\frac{\mathbb{C}\_{d,m}(u\mathbf{1})}{1-\overline{\mathbb{C}}\_{d,m}(u\mathbf{1})} \to \frac{\mathbb{C}\_d(u\mathbf{1})}{1-\overline{\mathbb{C}}\_d(u\mathbf{1})} \quad \text{uniformly on } (0,\epsilon), \text{ as } m \to \infty. \tag{15}$$

*Further assume that L*(*Cd*,*m*) *exists for every m* = 1, 2, . . . *. Subsequently, L*(*Cd*,*m*) → *L*(*Cd*) *as m* → ∞*. In particular, condition* (15) *is satisfied for a sequence of contaminated copulas, as in* (14)*, for which <sup>m</sup>* → 0*, as m* → ∞*, and provided L*(*Cd*) *exists.*

**Proof.** Assumption (15) allows for us to use the Moore–Osgood theorem to interchange the limits and, thus

$$\lim\_{m \to \infty} \varepsilon\_L(\mathbb{C}\_{d,m}) = \lim\_{m \to \infty} \lim\_{u \searrow 0} \frac{\mathbb{C}\_{d,m}(u\mathbf{1})}{1 - \overline{\mathbb{C}}\_{d,m}(u\mathbf{1})} = \lim\_{u \searrow 0} \lim\_{m \to \infty} \frac{\mathbb{C}\_{d,m}(u\mathbf{1})}{1 - \overline{\mathbb{C}}\_{d,m}(u\mathbf{1})} = \varepsilon\_L(\mathbb{C}\_d).$$

Suppose now that we have a sequence of contaminated copulas, for which *<sup>m</sup>* → 0, as *m* → ∞. Subsequently, one calculates

$$\begin{split} \frac{\mathbb{C}\_{d,m}(u\mathbf{1})}{1-\overline{\mathbb{C}}\_{d,m}(u\mathbf{1})} - \frac{\mathbb{C}\_{d}(u\mathbf{1})}{1-\overline{\mathbb{C}}\_{d}(u\mathbf{1})} &= \frac{\mathbb{C}\_{d,m}(u\mathbf{1}) - \mathbb{C}\_{d}(u\mathbf{1})}{1-\overline{\mathbb{C}}\_{d,m}(u\mathbf{1})} + \frac{\mathbb{C}\_{d}(u\mathbf{1})}{1-\overline{\mathbb{C}}\_{d,m}(u\mathbf{1})} - \frac{\mathbb{C}\_{d}(u\mathbf{1})}{1-\overline{\mathbb{C}}\_{d}(u\mathbf{1})} \\ &= \frac{\varepsilon\_{m}(\mathbb{B}\_{d,m}(u\mathbf{1}) - \mathbb{C}\_{d}(u\mathbf{1}))}{1-\overline{\mathbb{C}}\_{d,m}(u\mathbf{1})} + \frac{\mathbb{C}\_{d}(u\mathbf{1})\varepsilon\_{m}(\overline{\mathbb{B}}\_{d,m}(u\mathbf{1}) - \overline{\mathbb{C}}\_{d}(u\mathbf{1}))}{(1-\overline{\mathbb{C}}\_{d,m}(u\mathbf{1}))(1-\overline{\mathbb{C}}\_{d}(u\mathbf{1}))}. \end{split} \tag{16}$$

One next realizes that max{*Bd*,*m*(*u***1**), *Cd*(*u***1**)} ≤ *u* and min{1 − *Cd*,*m*(*u***1**), 1 − *Cd*(*u***1**)} ≥ *u*. Furthermore, with the help of Formula (2) for the survival function of a copula one gets *Bd*,*m*(*u***1**) − *Cd*(*u***1**) = *O*(*u*). Thus, one can bound

$$\left| \frac{\mathbb{C}\_{d,m}(\boldsymbol{\mu}\mathbf{1})}{1 - \overline{\mathbb{C}}\_{d,m}(\boldsymbol{\mu}\mathbf{1})} - \frac{\mathbb{C}\_{d}(\boldsymbol{\mu}\mathbf{1})}{1 - \overline{\mathbb{C}}\_{d}(\boldsymbol{\mu}\mathbf{1})} \right| \leq \frac{\varepsilon\_{m}\boldsymbol{\mu}}{\boldsymbol{\mu}} + \frac{\boldsymbol{\mu}\,\varepsilon\_{m}\boldsymbol{O}(\boldsymbol{\mu})}{\boldsymbol{\mu}^{2}} = \varepsilon\_{m}\boldsymbol{O}(1),$$

which implies (15).

Analogously, a similar result could be stated for *U*.

#### *3.2. Li's Tail Dependence Parameter*

Suppose that <sup>∅</sup> = *Ih* ⊂ {1, ... , *<sup>d</sup>*} is a subset of indices, such that |*Ih*| = *<sup>h</sup>* and *Jd*−*<sup>h</sup>* = {1, ... , *<sup>d</sup>*}\*Ih*. Subsequently, Li [4] (Def. 1.2) defines so-called lower and upper tail dependence parameters, as follows

$$\begin{aligned} \lambda\_L^{I\_h|I\_{d-h}}(\mathbb{C}\_d) &= \lim\_{\boldsymbol{\mu}\searrow 0} \mathbb{P}(\mathcal{U}\_{\boldsymbol{i}} \le \boldsymbol{\mu}, \forall \boldsymbol{i} \in I\_h|\mathcal{U}\_{\boldsymbol{j}} \le \boldsymbol{\mu}, \forall \boldsymbol{j} \in I\_{d-h}),\\ \lambda\_{\boldsymbol{\mu}}^{I\_h|I\_{d-h}}(\mathbb{C}\_d) &= \lim\_{\boldsymbol{\mu}\searrow \boldsymbol{1}} \mathbb{P}(\mathcal{U}\_{\boldsymbol{i}} > \boldsymbol{\mu}, \forall \boldsymbol{i} \in I\_h|\mathcal{U}\_{\boldsymbol{j}} > \boldsymbol{\mu}, \forall \boldsymbol{j} \in I\_{d-h}), \end{aligned}$$

given the expressions exist. It is evident that these coefficients heavily depend on the choice of the set *Ih*. Additionally, this generalization includes the usual bivariate tail dependence coefficients *λ<sup>L</sup>* and *λU*, by letting *h* = 1, *I*<sup>1</sup> = {1} and *J*<sup>1</sup> = {2} or the other way around. Li [4] further states that *λIh*|*Jd*−*<sup>h</sup> <sup>L</sup>* (*Cd*) = *<sup>λ</sup>Ih*|*Jd*−*<sup>h</sup> <sup>U</sup>* (*C<sup>S</sup> <sup>d</sup>* ) and, therefore, duality property (*T*5) is fulfilled.

One can also notice that, for exchangeable copulas (i.e., symmetric in their arguments), the dependence parameters are in fact functions of cardinality *h* rather than particular contents of *Ih*. Especially in this case, it is worth introducing another notation being

$$\lambda\_L^{1,\dots,h|h+1,\dots,d}(\mathbb{C}\_d) = \lim\_{u \searrow 0} \mathbb{P}(\!\!\!\!L\_1 \le u, \dots, \!\!\!\!\!L\_h \le u | \!\!\!\!L\_{h+1} \le u, \dots, \!\!\!\!L\_d \le u),$$

$$\lambda\_{\mathcal{U}}^{1,\dots,h|h+1,\dots,d}(\mathbb{C}\_d) = \lim\_{u \nearrow 1} \mathbb{P}(\!\!\!\!L\_1 > u, \dots, \!\!\!\!L\_h > u | \!\!\!\!L\_{h+1} > u, \dots, \!\!\!\!L\_d > u).$$

In paper [15], it is shown that these coefficients can be rewritten while using one-sided derivatives of the diagonal section *δCd* (*u*) = *Cd*(*u*,..., *u*) of the corresponding copula in the following way:

$$\begin{aligned} \lambda\_L^{1,\dots,h[h+1,\dots,d]}(\mathbb{C}\_d) &= \frac{\delta\_{\mathbb{C}\_d}'(0^+)}{\delta\_{(h+1)\dots,d}'(0^+)} \\ \lambda\_{l\bar{l}}^{1,\dots,h[h+1,\dots,d]}(\mathbb{C}\_d) &= \frac{\sum\_{j=1}^d (-1)^{j+1} \sum\_{1 \le k\_1 < \dots < k\_j \le d} \delta\_{k\_1\dots k\_j}'(1^-)}{\sum\_{j=1}^{d-h} (-1)^{j+1} \sum\_{l+1 \le k\_1 < \dots < k\_j \le d} \delta\_{k\_1\dots k\_j}'(1^-)} \end{aligned}$$

where *<sup>δ</sup>k*1...*kj* denotes the diagonal section of copula *<sup>C</sup>*(*k*1,...,*kj*) *<sup>j</sup>* .

Additionally, the authors in [15] comment on the connection with Frahm's extremal dependence coefficients *<sup>L</sup>* and *U*, which can be expressed as

$$\begin{split} \epsilon\_{L}(\mathbb{C}\_{d}) &= \frac{\delta\_{\mathbb{C}\_{d}}^{\prime}(0^{+})}{\sum\_{j=1}^{d}(-1)^{j+1}\sum\_{1\le k\_{1}<\cdots$$

if all of the above quantities exist.

De Luca and Rivieccio [6] (Def. 2) also use this way to measure tail dependence, although they consider it as a measure for Archimedean copulas only since we can express the measures while using the generator, as

$$\lambda\_L^{1,\ldots,h[h+1,\ldots,d]} = \lim\_{u\searrow 0} \frac{\overline{\mathbb{C}\_d(u,\ldots,u)}}{\mathbb{C}\_{d-h}^{(h+1,\ldots,d)}(u,\ldots,u)} = \lim\_{u\searrow 0} \frac{\psi(d\psi^{-1}(u))}{\psi((d-h)\psi^{-1}(u))}$$

$$= \lim\_{u\searrow 0} \frac{d\psi'(d\psi^{-1}(u))}{(d-h)\psi'((d-h)\psi^{-1}(u))},\tag{17}$$

$$\begin{split} \lambda\_{II}^{1,\ldots,h[h+1,\ldots,d]} &= \lim\_{u\searrow 1} \frac{\overline{\mathbb{C}\_d}(u,\ldots,u)}{\overline{\mathbb{C}\_{d-h}^{(h+1,\ldots,d)}}(u,\ldots,u)} = \lim\_{u\searrow 1} \frac{1 + \sum\_{j=1}^d (-1)^j \binom{d}{j} \psi(j\psi^{-1}(u))}{1 + \sum\_{j=1}^{d-h} (-1)^j \binom{d-h}{j} \psi(j\psi^{-1}(u))} \\ &= \lim\_{u\searrow 1} \frac{\sum\_{j=1}^d (-1)^j \binom{d}{j} \psi'(j\psi^{-1}(u))}{\sum\_{j=1}^{d-h} (-1)^j \binom{d-h}{j} \psi'(j\psi^{-1}(u))},\tag{18}$$

where we applied l'Hospital's rule for obtaining the equation in (17) and (18). In contrast to the Frahm's coefficient, here the additional condition that *ψ* (0+) <sup>&</sup>gt; <sup>−</sup><sup>∞</sup> is not helpful, since it leads to

$$\lambda\_{II}^{1,\dots,h|h+1,\dots,d} = \frac{\sum\_{j=1}^d (-1)^j \binom{d}{j} \psi'(0^+)j}{\sum\_{j=1}^{d-h} (-1)^j \binom{d-h}{j} \psi'(0^+)j} = \frac{\sum\_{j=1}^d (-1)^j \binom{d}{j} j}{\sum\_{j=1}^{d-h} (-1)^j \binom{d-h}{j} j}$$

and numerator and denominator are both equal to zero here.

**Proposition 4.** *Li's tail dependence parameters λIh*|*Jd*−*<sup>h</sup> <sup>L</sup> and <sup>λ</sup>Ih*|*Jd*−*<sup>h</sup> <sup>U</sup> satisfy normalization property* (*T*1)*, addition property* (*T*4)*, and duality property* (*T*5)*.*

**Proof.** Duality property (*T*5) was shown in [4]. Normalization property (*T*1) follows from straightforward calculations while using (17) and (18)

$$\lambda\_L^{I\_h|I\_{d-h}}(M\_d) = \lim\_{\boldsymbol{\mu}\searrow 0} \frac{\boldsymbol{\mu}}{\boldsymbol{\mu}} = 1, \quad \lambda\_L^{I\_h|I\_{d-h}}(\boldsymbol{\Pi}\_d) = \lim\_{\boldsymbol{\mu}\searrow 0} \frac{\boldsymbol{\mu}^d}{\boldsymbol{\mu}^{d-h}} = 0.$$

For *λIh*|*Jd*−*<sup>h</sup> <sup>U</sup>* , it follows from duality property (*T*5).

We now check property (*T*4), the addition of an independent random component. Suppose that the added independent component belongs to the set *Ih*+1. Subsequently,

$$\lambda\_L^{I\_{h+1}|J\_{d-h}}(\mathbb{C}\_{d+1}) = \lim\_{u \searrow 0} \frac{\mathbb{C}\_d(u\mathbf{1})u}{\mathbb{C}\_{d-h}^{J\_{d-h}}(u\mathbf{1})} = 0 \cdot \lambda\_L^{I\_h|J\_{d-h}}(\mathbb{C}\_d) = 0.$$

If the added independent component belongs to the set *Jd*−*h*<sup>+</sup>1, then from the definition of the coefficient

$$
\lambda\_L^{I\_h|I\_{d-h+1}}(\mathbb{C}\_{d+1}) = \lim\_{u \searrow 0} \frac{\mathbb{C}\_d(u\mathbf{1})u}{\mathbb{C}\_{d-h}^{I\_{d-h}}(u\mathbf{1})u} = \lambda\_L^{I\_h|I\_{d-h}}(\mathbb{C}\_d).
$$

Showing the duality property for *λIh*|*Jd*−*<sup>h</sup> <sup>U</sup>* is analogous.

The proof of Proposition 4 shows that, in fact, property (*T* <sup>4</sup>) is fulfilled if one distinguishes two cases. If the added independent component belongs to the set *Ih*+1, then (*T* <sup>4</sup>) holds with *kd*(*λL*) = *kd*(*λU*) = 0 for every *<sup>d</sup>* ≥ 2. However, if the added independent component belongs to the set *Jd*−*h*<sup>+</sup>1, then *kd*(*λL*) = *kd*(*λU*) = 1 for every *d* ≥ 2.

Permutation invariance (*T*3) does not hold in general. However, if one would restrict to only permutations that permute indices within *Ih* and within *Jd*−*<sup>h</sup>* and not across these two sets, *λ<sup>L</sup>* and *λ<sup>U</sup>* would be invariant with respect to such permutations. Further, we might consider the special case when *h* = *d* − 1, which is if we condition only on one variable. Subsequently, for any permutation *π*

$$\lambda\_L^{I\_{d-1}|I\_1}(\mathbb{C}\_d^{\pi}) = \lim\_{u \searrow 0} \frac{\mathbb{C}\_d^{\pi}(u\mathbf{1})}{u} = \lim\_{u \searrow 0} \frac{\mathbb{C}\_d(u\mathbf{1})}{u} = \lambda\_L^{I\_{d-1}|I\_1}(\mathbb{C}\_d) \tag{19}$$

and analogously for *<sup>λ</sup>U*, we have *<sup>λ</sup>Id*−1|*J*<sup>1</sup> *<sup>U</sup>* (*C<sup>π</sup> <sup>d</sup>* ) = *<sup>λ</sup>Id*−1|*J*<sup>1</sup> *<sup>U</sup>* (*Cd*).

A continuity property can be shown under a specific condition on the copula sequence as is established in Proposition 5.

**Proposition 5.** *Suppose that, for any d-variate copulas Cd and Cd*,*m, m* = 1, 2, ... *, there exist* > 0*, such that*

$$\frac{\mathbb{C}\_{d,m}(u\mathbf{1})}{\mathbb{C}\_{d-h,m}^{l=h}(u\mathbf{1})} \to \frac{\mathbb{C}\_{d}(u\mathbf{1})}{\mathbb{C}\_{d-h}^{l=h}(u\mathbf{1})} \quad \text{uniformly on } (0,\varepsilon), \text{ as } m \to \infty. \tag{20}$$

*Further assume that λIh*|*Jd*−*<sup>h</sup> <sup>L</sup>* (*Cd*,*m*) *exists for every <sup>m</sup>* <sup>=</sup> 1, 2, ... *, as well as <sup>λ</sup>Ih*|*Jd*−*<sup>h</sup> <sup>L</sup>* (*Cd*)*. Subsequently, λIh*|*Jd*−*<sup>h</sup> <sup>L</sup>* (*Cd*,*m*) <sup>→</sup> *<sup>λ</sup>Ih*|*Jd*−*<sup>h</sup> <sup>L</sup>* (*Cd*) *as m* → ∞*.*

*In particular, condition* (20) *holds for a sequence of contaminated copulas, see* (14)*, for which <sup>m</sup>* → 0*, as m* → ∞*, and*

$$\limsup\_{m \to \infty} \sup\_{u \in (0, \varepsilon)} \frac{B\_{d-h,m}^{J\_{d-h}}(u\mathbf{1})}{\mathcal{C}\_{d-h}^{J\_{d-h}}(u\mathbf{1})} < \infty,\tag{21}$$

*and λIh*|*Jd*−*<sup>h</sup> <sup>L</sup>* (*Cd*) *exists.*

**Proof.** The first part of Proposition 5 is proven along the same lines as the proof of Proposition 3 and hence omitted here.

Consider now a sequence of contaminated copulas satisfying in addition (21). We need to show that (20) holds. To see this, note that, similarly as in (16), one gets

$$\frac{\mathbb{C}\_{d,m}(u\mathbf{1})}{\mathbb{C}\_{d-h,m}^{\mathrm{Id}-h}(u\mathbf{1})} - \frac{\mathbb{C}\_{d}(u\mathbf{1})}{\mathbb{C}\_{d-h}^{\mathrm{Id}-h}(u\mathbf{1})} = \frac{\varepsilon\_{m}(B\_{d,m}(u\mathbf{1}) - \mathbb{C}\_{d}(u\mathbf{1}))}{\mathbb{C}\_{d-h,m}^{\mathrm{Id}-h}(u\mathbf{1})} + \frac{\mathbb{C}\_{d}(u\mathbf{1})\varepsilon\_{m}(B\_{d-h,m}^{\mathrm{Id}-h}(u\mathbf{1}) - \mathbb{C}\_{d-h}^{\mathrm{Id}-h}(u\mathbf{1}))}{\mathbb{C}\_{d-h,m}^{\mathrm{Id}-h}(u\mathbf{1})\mathbb{C}\_{d-h}^{\mathrm{Id}-h}(u\mathbf{1})}.\tag{22}$$

Further note that, for all sufficiently large *m* for all *u* ∈ (0, )

$$\frac{\mathbb{C}\_{d-h}^{l-h}(u\mathbf{1})}{\mathbb{C}\_{d,m}^{l\_{d-h}}(u\mathbf{1})} \le \frac{\mathbb{C}\_{d-h}^{l\_{d-h}}(u\mathbf{1})}{(1-\varepsilon\_m)\mathbb{C}\_{d-h}^{l\_{d-h}}(u\mathbf{1})} \le 2.\tag{23}$$

Combining (21), (22) and (23) now yields that (for all sufficiently large *m*)

$$\begin{split} \left| \frac{\mathsf{C}\_{d,m}(u\mathbf{1})}{\mathsf{C}\_{d-h,m}^{\mathrm{Id}-h}(u\mathbf{1})} - \frac{\mathsf{C}\_{d}(u\mathbf{1})}{\mathsf{C}\_{d-h}^{\mathrm{Id}-h}(u\mathbf{1})} \right| &\leq \frac{\varepsilon\_{m}B\_{d,m}(u\mathbf{1})}{\mathsf{C}\_{d-h,m}^{\mathrm{Id}-h}(u\mathbf{1})} + \frac{\varepsilon\_{m}\mathsf{C}\_{d}(u\mathbf{1})}{\mathsf{C}\_{d-h,m}^{\mathrm{Id}-h}(u\mathbf{1})} + \frac{\varepsilon\_{m}\mathsf{C}\_{d}(u\mathbf{1})}{\mathsf{C}\_{d-h,m}^{\mathrm{Id}-h}(u\mathbf{1})} + \frac{\varepsilon\_{m}\mathsf{C}\_{d}(u\mathbf{1})}{\mathsf{C}\_{d-h,m}^{\mathrm{Id}-h}(u\mathbf{1})} \\ &\leq \frac{2\varepsilon\_{m}B\_{d,m}(u\mathbf{1})}{\mathsf{C}\_{d-h}^{\mathrm{Id}-h}(u\mathbf{1})} + \frac{2\varepsilon\_{m}\mathsf{C}\_{d}(u\mathbf{1})}{\mathsf{C}\_{d-h}^{\mathrm{Id}-h}(u\mathbf{1})} + \frac{2\varepsilon\_{m}\mathsf{C}\_{d}(u\mathbf{1})B\_{d-h,m}^{\mathrm{Id}-h}(u\mathbf{1})}{\mathsf{C}\_{d-h}^{\mathrm{Id}-h}(u\mathbf{1})\mathsf{C}\_{d-h}^{\mathrm{Id}-h}(u\mathbf{1})} + \frac{2\varepsilon\_{m}\mathsf{C}\_{d}(u\mathbf{1})}{\mathsf{C}\_{d-h}^{\mathrm{Id}-h}(u\mathbf{1})} \\ &= \varepsilon\_{m}O(1), \end{split}$$

where the *O*(1)-term does not depend on *u*. Thus, one can conclude that condition (20) of Proposition 5 is satisfied.

An analogous result as the one stated in Proposition 5 can be stated for *λU*.

#### *3.3. Schmid's and Schmidt's Tail Dependence Measure*

Schmid and Schmidt (see [5] (Sec. 3.3)) considered a generalization of tail coefficients based on a multivariate conditional version of Spearman's rho, which is defined as

$$\rho(\mathbb{C}\_{d'}\mathcal{g}) = \frac{\int\_{[0,1]^d} \mathbb{C}\_d(\mathfrak{u})\mathcal{g}(\mathfrak{u}) \, \mathrm{d}\mathfrak{u} - \int\_{[0,1]^d} \Pi\_d(\mathfrak{u})\mathcal{g}(\mathfrak{u}) \, \mathrm{d}\mathfrak{u}}{\int\_{[0,1]^d} \mathcal{M}\_d(\mathfrak{u})\mathcal{g}(\mathfrak{u}) \, \mathrm{d}\mathfrak{u} - \int\_{[0,1]^d} \Pi\_d(\mathfrak{u})\mathcal{g}(\mathfrak{u}) \, \mathrm{d}\mathfrak{u}}$$

for some non-negative measurable function *g* provided that the integrals exist. The choice *g*(*u*) = (*<sup>u</sup>* <sup>∈</sup> [0, *<sup>p</sup>*] *<sup>d</sup>*) leads to

$$\rho\_1(\mathbb{C}\_d, p) = \frac{\int\_{[0,p]^d} \mathbb{C}\_d(\mathfrak{u}) \, \mathrm{d}\mathfrak{u} - \int\_{[0,p]^d} \prod\_d(\mathfrak{u}) \, \mathrm{d}\mathfrak{u}}{\int\_{[0,p]^d} M\_d(\mathfrak{u}) \, \mathrm{d}\mathfrak{u} - \int\_{[0,p]^d} \prod\_d(\mathfrak{u}) \, \mathrm{d}\mathfrak{u}}$$

and the multivariate tail dependence measure is defined as

$$\lambda\_{L,S}(\mathbb{C}\_d) = \lim\_{p \searrow 0} \rho\_1(\mathbb{C}\_d, p) = \lim\_{p \searrow 0} \frac{d+1}{p^{d+1}} \int\_{[0,p]^d} \mathbb{C}\_d(\mathfrak{u}) \, \mathrm{d}\mathfrak{u},\tag{24}$$

provided the existence of the limit. Similarly, they define

$$\lambda\_{lL,S}(\mathbb{C}\_d) = \lim\_{p \searrow 0} \frac{\int\_{[1-p,1]^d} \mathbb{C}\_d(\mathfrak{u}) \, \mathrm{d}\mathfrak{u} - \int\_{[1-p,1]^d} \Pi\_d(\mathfrak{u}) \, \mathrm{d}\mathfrak{u}}{\int\_{[1-p,1]^d} M\_d(\mathfrak{u}) \, \mathrm{d}\mathfrak{u} - \int\_{[1-p,1]^d} \Pi\_d(\mathfrak{u}) \, \mathrm{d}\mathfrak{u}}. \tag{25}$$

Additionally, these coefficients are not equal to *λL*, *λU*, respectively, in the bivariate case, so we can consider it more as a different type of tail dependence coefficient rather than a generalization.

**Proposition 6.** *Schmid's and Schmidt's tail dependence measure λL*,*<sup>S</sup> satisfies normalization property* (*T*1)*, permutation invariance property* (*T*3)*, and addition property* (*T* <sup>4</sup>)*, with kd*(*λL*,*S*) = 0 *for every d* ≥ 2*.*

**Proof.** Normalization property (*T*1) and permutation invariance (*T*3) follow from the normalization property and permutation invariance of Spearman's rho, see, for example [16]. When adding an independent component, one gets

$$\mathbb{E}\,\lambda\_{L,\mathbb{S}}(\mathbb{C}\_{d+1}) = \lim\_{p\searrow 0} \frac{d+2}{p^{d+2}} \int\_{[0,p]^{d+1}} \mathbb{C}\_d(\mathfrak{u}) \,\mathfrak{u} \,\mathrm{d}\mathfrak{u} = \lim\_{p\searrow 0} \frac{p(d+2)}{2(d+1)} \frac{d+1}{p^{d+1}} \int\_{[0,p]^d} \mathbb{C}\_d(\mathfrak{u}) \,\mathrm{d}\mathfrak{u} = 0.$$

This finishes the proof.

In order for duality property (*T*5) to hold, the upper version should rather be defined as

$$\lambda\_{L,S}^\*(\mathbb{C}\_d) = \lim\_{p \searrow 0} \frac{d+1}{p^{d+1}} \int\_{[0,p]^d} \mathbb{C}\_d^S(\mathfrak{u}) \, \mathrm{d}\mathfrak{u}.\tag{26}$$

This seems to be more logical, since *λU*,*S*(*Cd*) can only be expressed, after substituting

$$\int\_{[1-p,1]^d} \Pi\_d(\mathbf{u}) \, \mathbf{d} \mathbf{u} = \left[\frac{p(2-p)}{2}\right]^d \qquad \text{and} \qquad \int\_{[1-p,1]^d} M\_d(\mathbf{u}) \, \mathbf{d} \mathbf{u} = p^d - \frac{d}{d+1} p^{d+1} \tag{27}$$

into (25), as

$$\lambda\_{dI,S}(\mathbb{C}\_d) = \lim\_{p \searrow 0} \frac{\int\_{[1-p,1]^d} \mathbb{C}\_d(\mathfrak{u}) \, \mathrm{d}\mathfrak{u} - \left[\frac{p(2-p)}{2}\right]^d}{p^d - \frac{d}{d+1}p^{d+1} - \left[\frac{p(2-p)}{2}\right]^d}$$

which cannot be further simplified. It is easy to show that in the bivariate case (i.e., *d* = 2) the coefficients *λU*,*S*(*Cd*) and *λ*<sup>∗</sup> *<sup>U</sup>*,*S*(*Cd*) coincide. For a general dimension *d* > 2 however they can differ.

The continuity property (*T*2) cannot be shown in full generality, but a continuity property is fulfilled in the special case of a sequence of contaminated copulas, as in (14).

**Proposition 7.** *Consider a sequence of contaminated copulas, Cd*,*<sup>m</sup>* = (1 − *m*)*Cd* + *mBd*,*m, such that <sup>m</sup>* → 0*, as m* → ∞*, and λL*,*S*(*Cd*) *exists. Afterwards, as m* → ∞*,*

$$
\lambda\_{L,S}(\mathsf{C}\_{d,m}) \to \lambda\_{L,S}(\mathsf{C}\_d).
$$

**Proof.** Direct calculation gives

$$\lim\_{m \to \infty} \lambda\_{L,S}(\mathbb{C}\_{d,m}) = \lim\_{m \to \infty} \left[ (1 - \varepsilon\_m) \lambda\_{L,S}(\mathbb{C}\_d) + \varepsilon\_m \lambda\_{L,S}(B\_{d,m}) \right] = \lambda\_{L,S}(\mathbb{C}\_d)$$

since *λL*,*S*(*Bd*,*m*) is bounded.

#### *3.4. Tail Dependence of Extreme-Value Copulas*

As stated in (6), bivariate tail coefficients for extreme-value copulas can be simply expressed using the corresponding Pickands dependence function. Thus tail dependence is fully determined by the Pickands dependence function *A*<sup>2</sup> in the point (1/2, 1/2). The range of values for *A*<sup>2</sup> is limited by max(*w*1, *w*2) ≤ *A*2(*w*1, *w*2) ≤ 1, which also gives us 1/2 ≤ *A*2(1/2, 1/2) ≤ 1 where the bivariate tail coefficient *λ<sup>U</sup>* gets larger when *A*2(1/2, 1/2) is closer to 1/2. On the other hand, *A*2(1/2, 1/2) = 1 means tail independence. Following this idea and given that also for the *d*-dimensional Pickands dependence function *Ad* associated to a copula *Cd* we have 1/*d* ≤ *Ad*(1/*d*, ... , 1/*d*) ≤ 1, a measure of tail dependence for *d*-dimensional extreme-value copulas could be based on the difference 1 − *Ad*(1/*d*, . . . , 1/*d*). After proper standardization, this leads to

$$
\lambda\_{II,E}(\mathbb{C}\_d) = \frac{d}{d-1} (1 - A\_d(1/d, \dots, 1/d)). \tag{28}
$$

Note that such a coefficient is equal to a translation of the extremal coefficient given in [17] or [7] and defined as *θ<sup>E</sup>* = *d* · *Ad*(1/*d*, ... , 1/*d*). This coefficient *θ<sup>E</sup>* was termed extremal coefficient in [17]. Schlather and Town (see [18]) give a simple interpretation of *θE*, related to the amount of independent variables that are involved in the corresponding *d*-variate random vector.

**Proposition 8.** *The multivariate tail dependence coefficient λU*,*<sup>E</sup> in* (28) *satisfies normalization property* (*T*1)*, continuity property* (*T*2)*, permutation invariance property* (*T*3)*, and addition property* (*T* <sup>4</sup>)*, with kd*(*λU*,*E*) = *<sup>d</sup>*−<sup>1</sup> *<sup>d</sup> for every d* ≥ 2*.*

**Proof.** Normalization (*T*1) and permutation invariance (*T*3) follow immediately from the definition of *<sup>λ</sup>U*,*E*. If lim*m*→<sup>∞</sup> *Cd*,*m*(*u*) = *Cd*(*u*), <sup>∀</sup>*<sup>u</sup>* <sup>∈</sup> [0, 1] *<sup>d</sup>*, and then also lim*m*→<sup>∞</sup> *Ad*,*m*(*w*) = *Ad*(*w*), <sup>∀</sup>*<sup>w</sup>* <sup>∈</sup> <sup>Δ</sup>*d*−1, which proves the validity of (*T*2). For *Xd*<sup>+</sup><sup>1</sup> independent of (*X*1, ... , *Xd*), we can use Example 1 and obtain

$$\begin{aligned} \lambda\_{d,E}(\mathbb{C}\_{d+1}) &= \frac{d+1}{d} \left( 1 - A\_{d+1} \left( \frac{1}{d+1}, \dots, \frac{1}{d+1} \right) \right) = \frac{d+1}{d} \left( 1 - \frac{1}{d+1} \left( dA\_d \left( \frac{1}{d}, \dots, \frac{1}{d} \right) + 1 \right) \right) \\ &= 1 - A\_d \left( \frac{1}{d}, \dots, \frac{1}{d} \right) \\ &= \frac{d-1}{d} \lambda\_{d,E}(\mathbb{C}\_d). \end{aligned}$$

**Remark 1.** *The duality property* (*T*5) *is not applicable, since the survival copula of an extreme-value copula does not have to be an extreme-value copula.*

#### *3.5. Tail Dependence Using Subvectors*

A common element of the multivariate tail dependence measures discussed in Sections 3.1–3.3 is that they focus on extremal behavior of all *d* components of a random vector *X*. However, one could also be interested in knowing whether there is any kind of tail dependence present in the vector, which is even for subvectors of *X*. An interesting observation to be made is for tail dependence measures that satisfy property (*T* <sup>4</sup>) with *kd* = 0 for every *d* ≥ 2. Assume that *X* and *Y* are independent random variables. Then any tail measure t2(*C*2) would be zero for the random couple (*X*,*Y*) and no

matter which random component we add the tail measure for the extended random vector would stay 0. In other words, for any such tail dependence measure, this leads to tail independence of the *d*-dimensional random vector (*X*, ... , *X*,*Y*), no matter what *d* is. Considering tail dependence of subvectors would be of particular interest in this case.

Suppose that we have a multivariate tail coefficient *μL*,*<sup>d</sup>* that can be calculated for general dimension *d* ≥ 2. Suppose further that this coefficient only depends on the strength of tail dependence when all of the components of a random vector are simultaneously large or small. This is the case for all multivariate tail coefficients mentioned in Sections 3.1–3.3. Subsequently, we can introduce a tail coefficient given by

$$\begin{split} \mu\_{L}(\mathbb{C}\_{d}) &= \sum\_{j=2}^{d} w\_{d,j} \sum\_{1 \le \ell\_{1} < \cdots < \ell\_{j} \le d} \mu\_{L,j}(\mathbb{C}^{(\ell\_{1}, \ldots, \ell\_{j})}) \\ &= \sum\_{j=2}^{d} \tilde{w}\_{d,j} \frac{1}{\binom{d}{j}} \sum\_{1 \le \ell\_{1} < \cdots < \ell\_{j} \le d} \mu\_{L,j}(\mathbb{C}^{(\ell\_{1}, \ldots, \ell\_{j})}) \end{split} \tag{29}$$

where <sup>1</sup> ( *d j* ) <sup>∑</sup> <sup>1</sup>≤-1<···<*<sup>j</sup>*≤*d <sup>μ</sup>L*,*j*(*C*(-1,...,*j*) ) can be interpreted as an average tail dependence measure per

dimension, and where *<sup>w</sup>*9*d*,*<sup>j</sup>* <sup>=</sup> *wd*,*j*( *d j* ). This measure deals with a disadvantage of current multivariate tail coefficients that assign a value of 0 to the copulas, where *d* − 1 components are highly dependent in their tail, and the *d*-th component is independent. When dealing with possible stock losses, for example, this situation should be also captured by a tail coefficient.

Recall that the weight *<sup>w</sup>*9*d*,*<sup>j</sup>* corresponds to an importance given to the average tail dependence within all the *j*-dimensional subvectors of *X*. Because tail dependence in a higher dimension is more severe, as more extremes occur simultaneously, it is natural to assume *<sup>w</sup>*9*d*,2 <sup>≤</sup> *<sup>w</sup>*9*d*,3 ≤ ··· ≤ *<sup>w</sup>*9*d*,*d*. However, such an assumption excludes other approaches to measure tail dependence. For example, setting *<sup>w</sup>*9*d*,2 <sup>=</sup> 1 and *<sup>w</sup>*9*d*,*<sup>j</sup>* <sup>=</sup> 0 for *<sup>j</sup>* <sup>=</sup> 3, ... , *<sup>d</sup>* would lead to the construction of a tail dependence measure as the average of all pairwise measures. If the underlying bivariate measure satisfies (*T*1), (*T*2), (*T*3), and (*T*5) with *d* = 2 only, these properties are carried over to the pairwise measure. Additionally, (*T* <sup>4</sup>) can be shown similarly as in Proposition 1 in [16]. Despite possibly fulfilling the desirable properties, all of the higher dimensional dependencies are ignored, being a clear drawback of such a pairwise approach. In the sequel, we focus on the setting that *<sup>w</sup>*9*d*,2 <sup>≤</sup> *<sup>w</sup>*9*d*,3 ≤···≤ *<sup>w</sup>*9*d*,*d*.

**Proposition 9.** *Suppose that the tail dependence measures μL*,*<sup>j</sup> satisfy normalization property* (*T*1)*, continuity property* (*T*2)*, permutation invariance property* (*T*3)*, and duality property* (*T*5)*, for j* = 2, ... , *d. Further assume that d* ∑ *j*=2 *<sup>w</sup>*9*d*,*<sup>j</sup>* <sup>=</sup> <sup>1</sup>*. Subsequently, the coefficient <sup>μ</sup><sup>L</sup> in* (29) *also satisfies properties* (*T*1)*,* (*T*2)*,* (*T*3)*, and* (*T*5)*.*

**Proof.** Clearly *μL*(Π*d*) = 0 and *μL*(*Md*) = *d* ∑ *j*=2 *<sup>w</sup>*9*d*,*<sup>j</sup>* <sup>=</sup> 1. The continuity, permutation invariance, and duality properties follow from the continuity, permutation invariance, and duality properties of *μL*,*j*.

What happens in case of the addition of an independent component (property (*T*4)) is not so straightforward, since the weights differ depending on the overall dimension *d*. The addition of an independent component increases dimension and, thus, possibly changes all of the weights. However, one could try to come up with a weighting scheme that guarantees fulfilment of property (*T*4). Consider *Xd*<sup>+</sup><sup>1</sup> independent of (*X*1, ... , *Xd*). Suppose that the input tail dependence measures *μL*,*<sup>j</sup>*

satisfy property (*T* <sup>4</sup>), with *kj* = *kj*(*μL*,*j*) for *j* = 2, ... , *d*. First, we express *μ<sup>L</sup>* for the random vector (*X*1,..., *Xd*<sup>+</sup>1), as

$$\begin{split} \mu\_{L}(\mathbf{C}\_{d+1}) &= \sum\_{j=2}^{d+1} \tilde{w}\_{d+1,j} \frac{1}{\binom{d+1}{j}} \sum\_{1 \le \ell\_{1} < \cdots < \ell\_{j} \le d+1} \mu\_{L,j}(\mathbf{C}\_{j}^{(\ell\_{1},...,\ell\_{j})}) \\ &= \sum\_{j=2}^{d} \tilde{w}\_{d+1,j} \frac{1}{\binom{d+1}{j}} \sum\_{1 \le \ell\_{1} < \cdots < \ell\_{j} \le d} \mu\_{L,j}(\mathbf{C}\_{j}^{(\ell\_{1},...,\ell\_{j})}) \\ &+ \sum\_{j=2}^{d+1} \tilde{w}\_{d+1,j} \frac{1}{\binom{d+1}{j}} \sum\_{1 \le \ell\_{1} < \cdots < \ell\_{j-1} \le d} \mu\_{L,j}(\mathbf{C}\_{j}^{(\ell\_{1},...,\ell\_{j-1},d+1)}). \end{split} \tag{30}$$

Now using property (*T* <sup>4</sup>) in (30) together with the fact that for index *j* = 2, the corresponding summand is *μL*,2(Π2) = 0 and, thus, this index can be omitted, one obtains

$$\begin{split} \mu\_{L}(\mathbb{C}\_{d+1}) &= \sum\_{j=2}^{d} \widetilde{w}\_{d+1,j} \frac{d+1-j}{d+1} \frac{1}{\binom{d}{j}} \sum\_{1 \le \ell\_{1} < \cdots < \ell\_{j} \le d} \mu\_{L,j}(\mathbb{C}\_{j}^{(\ell\_{1},\ldots,\ell\_{j})}) \\ &+ \sum\_{j=3}^{d+1} \widetilde{w}\_{d+1,j} \frac{k\_{j-1}}{\binom{d+1}{j}} \sum\_{1 \le \ell\_{1} < \cdots < \ell\_{j-1} \le d} \mu\_{L,j-1}(\mathbb{C}\_{j-1}^{(\ell\_{1},\ldots,\ell\_{j-1})}) \\ &= \sum\_{j=2}^{d} \left( \widetilde{w}\_{d+1,j} \frac{d+1-j}{d+1} + \widetilde{w}\_{d+1,j+1} k\_{j} \frac{j+1}{d+1} \right) \frac{1}{\binom{d}{j}} \sum\_{1 \le \ell\_{1} < \cdots < \ell\_{j} \le d} \mu\_{L,j}(\mathbb{C}\_{j}^{(\ell\_{1},\ldots,\ell\_{j})}) \end{split}$$

which is equal to *μL*(*Cd*) with weights given as

$$
\widetilde{w}\_{d,j} = \widetilde{w}\_{d+1,j} \frac{d+1-j}{d+1} + \widetilde{w}\_{d+1,j+1} k\_j \frac{j+1}{d+1}
$$

for every *j* = 2, . . . , *d*. A sufficient criterion for fulfillment of property (*T*4) would thus be to have

$$
\delta\tilde{w}\_{d,j} \ge \tilde{w}\_{d+1,j}\frac{d+1-j}{d+1} + \tilde{w}\_{d+1,j+1}k\_j\frac{j+1}{d+1} \tag{31}
$$

for every *<sup>j</sup>* <sup>=</sup> 2, ... , *<sup>d</sup>*. Knowing the values *kj*, *<sup>w</sup>*9*d*,*j*, *<sup>w</sup>*9*d*+1,*j*, for *<sup>j</sup>* <sup>=</sup> 2, ... , *<sup>d</sup>*, and *<sup>w</sup>*9*d*+1,*d*<sup>+</sup>1, one can check (31).

One rather general method of weight selection can then be as follows. Suppose that one wants to achieve that proportions of weights *wd*,*d*<sup>1</sup> and *wd*,*d*<sup>2</sup> corresponding to two subdimensions *d*<sup>1</sup> and *d*<sup>2</sup> do not depend on the overall dimension *d*. This can be achieved by setting recursively *wd*<sup>+</sup>1,*<sup>j</sup>* = *rd*<sup>+</sup>1*wd*,*<sup>j</sup>* for *<sup>j</sup>* <sup>=</sup> 2, ... , *<sup>d</sup>* and *wd*<sup>+</sup>1,*d*+<sup>1</sup> <sup>=</sup> <sup>1</sup> <sup>−</sup> <sup>∑</sup>*<sup>d</sup> <sup>j</sup>*=<sup>2</sup> *wd*<sup>+</sup>1,*<sup>j</sup>* = 1 − *rd*+1. The initial condition is obviously given as *<sup>w</sup>*2,2 <sup>=</sup> 1. To obtain *<sup>w</sup>*9*d*,2 <sup>≤</sup> *<sup>w</sup>*9*d*,3 ≤···≤ *<sup>w</sup>*9*d*,*d*, one needs that *rd* <sup>∈</sup> [0, 1/2] for every *<sup>d</sup>* <sup>=</sup> 3, ... . Values of *rd* closer to 0 give more weight to the *d*-th dimension, values close to 1/2 limit its influence. If we further assume that *rd* = *r*, which is *rd* does not depend on *d*, this further simplifies to

$$w\_{d,j} = r^{d-j} (1 - r)^{\lfloor \binom{j+2}{2} \rfloor}$$

for *d* = 2, ... and *j* = 2, ... , *d*. We next check the condition in (31) for this particular weight selection. Condition (31) can be rewritten as

$$1 \ge r \frac{d+1-j}{d+1} + k\_j \frac{j+1}{d+1}, \quad \text{for every } j = 2, \dots, d. \tag{32}$$

If *kj* = 1 for every *j* as in one case of Li's tail dependence parameter, condition (32) allows for only one selection of *r*, which is *r* = 0. On the other hand, if *kj* = 0 for every *j*, *r* can take any values in [0, 1/2]. Looking from the other perspective, if *r* = 1/2, then condition (32) is satisfied if

$$k\_j \le \frac{d + 1/2}{d + 1}, \quad \text{for every } j = 2, \dots, d.$$

Let us recall that these conditions can only be seen as sufficient, not necessary. A precise study of what happens when an independent component is added requires knowledge of the weighting scheme and knowledge of the underlying input tail dependence measure.

In summary, the above discussion reveals that a measure that is able to detect tail dependence not only in a random vector as a whole, but also in lower-dimensional subvectors, can be constructed. A simple and interpretable weighting scheme proposed above can be used, such that several desirable properties of the tail dependence measure are guaranteed.

#### *3.6. Overview of Multivariate Tail Coefficients and Properties*

For convenience of the reader, we list in Table 1 all of the discussed tail dependence measures, with reference to their section number, and indicate which properties they satisfy.


**Table 1.** Overview of multivariate tail coefficients and their properties.

#### **4. Multivariate Tail Coefficients: Further Properties**

In Section 3, the focus was on properties (*T*1)–(*T*5). In this section, we aim at exploring some further properties that might be of special interest. We, in particular, investigate the following type of properties. Here, t*d*(*Cd*) denotes a multivariate tail coefficient for *Cd* ∈ Cop(*d*). When needed, we specify whether it concerns a lower or upper tail coefficient, referring to them as t*L*,*d*(*Cd*) and t*U*,*d*(*Cd*), respectively.

• *Expansion property (P*1*)*.

Given is a random vector *X* = (*X*1, ... , *Xd*) with copula *Cd*. One adds one random component *Xd*<sup>+</sup><sup>1</sup> to *X*. Denote the copula of the expanded random vector (*X*, *Xd*<sup>+</sup>1) by *Cd*+1. How does t*d*+1(*Cd*<sup>+</sup>1) compare to t*d*(*Cd*)? Does it hold that t*d*+1(*Cd*<sup>+</sup>1) ≤ t*d*(*Cd*)?

• *Monotonicity property (P*2*)*. Consider two copulas *Cd*,1, *Cd*,2 ∈ Cop(*d*). Does the following hold?


• *Convex combination property (P*3*)*.

Suppose that the copula *Cd* can be written as *Cd* = *αCd*,1 + (1 − *α*)*Cd*,2 for *α* ∈ [0, 1], and *Cd*,1, *Cd*,2 ∈ Cop(*d*). What can we say about the comparison between t*d*(*Cd*) and *α*t*d*(*Cd*,1)+(1 − *α*)t*d*(*Cd*,2)?

For extreme-value copulas, we look into geometric combinations instead.

The logic behind property (*P*1) comes from the perception of a tail coefficient as a probability of extreme events of components of the random vector to happen simultaneously. Thus, when another component is added, the probability of having extreme events cannot increase. However, there is no such a limitation from below and adding a component can immediately lead to a decrease of the coefficient to zero.

In the next subsections, we briefly discuss these properties for the multivariate tail coefficients discussed in Section 3.

#### *4.1. Expansion Property (P1)*

For Frahm's coefficient, it holds that *L*(*Cd*<sup>+</sup>1) ≤ *L*(*Cd*) and analogously for the upper coefficient. This result can be found in Proposition 2 of [3].

For Li's tail dependence parameters, we need to distinguish two cases. If we add the new component to the set *Ih*, then we have

$$\lambda\_L^{I\_{k+1}|I\_{d-k}}(\mathbb{C}\_{d+1}) = \lim\_{\boldsymbol{\mu} \searrow 0} \frac{\mathbb{C}\_{d+1}(\boldsymbol{\mu}\mathbf{1})}{\mathbb{C}\_{d-h}^{I\_{d-h}}(\boldsymbol{\mu}\mathbf{1})} \le \lim\_{\boldsymbol{\mu} \searrow 0} \frac{\mathbb{C}\_d(\boldsymbol{\mu}\mathbf{1})}{\mathbb{C}\_{d-h}^{I\_{d-h}}(\boldsymbol{\mu}\mathbf{1})} = \lambda\_L^{I\_h|I\_{d-h}}(\mathbb{C}\_d).$$

However, if the component is added to the set *Jd*−*h*, no relationship can be shown, in general. A special situation occurs when the component *Xd*<sup>+</sup><sup>1</sup> added to the set *Jd*−*<sup>h</sup>* is just a duplicate of a component, which is already included in *Jd*−*h*. Subsequently, obviously *<sup>λ</sup>Ih*|*Jd*−*h*+<sup>1</sup> *<sup>L</sup>* (*Cd*<sup>+</sup>1) = *<sup>λ</sup>Ih*|*Jd*−*<sup>h</sup> <sup>L</sup>* (*Cd*).

For Schmid's and Schmidt's tail dependence measures, one cannot say, in general, how the coefficient *λL*,*S*(*Cd*<sup>+</sup>1) behaves when compared to *λL*,*S*(*Cd*). As can be seen from (24), the integral expression decreases with increasing dimension *d*, but, at the same time, the normalizing constant increases with *d*.

For the tail coefficient for extreme-value copulas, *λU*,*E*(*Cd*) it follows from Example 7 in Section 6 that the addition of another component can lead to an increase in this coefficient. See, in particular, also Figure 5.

#### *4.2. Monotonicity Property (P2)*

Concerning the monotonicity property (*P*2) it is easily seen that (*P*2)(i) holds for Frahm's lower dependence coefficient *L*(*Cd*) if we additionally assume that *Cd*,1(*u*) <sup>≤</sup> *Cd*,2(*u*) for *<sup>u</sup>* in some neighborhood of **<sup>0</sup>**. Similarly, we need to assume that *Cd*,1(*u*) <sup>≤</sup> *Cd*,2(*u*) for *<sup>u</sup>* in some neighborhood of **1** in order to show that (*P*2) (ii) holds.

For Li's tail dependence parameters, property (*P*2) does not hold in general. This is illustrated via the following example in case *d* = 4. Consider a random vector (*U*1, *U*2, *U*3, *U*4) with uniform marginals and with distribution function a Clayton copula with parameter *θ* > 0 (see Example 6), given by *<sup>C</sup>*4,1(*u*)=(*u*<sup>1</sup> <sup>+</sup> *<sup>u</sup>*<sup>2</sup> <sup>+</sup> *<sup>u</sup>*<sup>3</sup> <sup>+</sup> *<sup>u</sup>*<sup>4</sup> <sup>−</sup> <sup>3</sup>)1/*<sup>θ</sup>* (see (39)). We denote this first copula by *<sup>C</sup>*4,1. Note that the random vector (*U*1, *U*2, *U*3) has as joint distribution a three-dimensional Clayton copula with parameter *θ*, which we denote by *C*3. The vector (*U*1, *U*2, *U*4) has the same joint distribution *C*3. Next, we consider the copula of the random vector (*U*1, *U*2, *U*3, *U*3) that we denote by *C*4,2. One has that, for all *<sup>u</sup>* <sup>∈</sup> [0, 1] 4,

$$\begin{aligned} \mathsf{C}\_{4,1}(\mathsf{u}) &= \mathsf{P}\left(\mathsf{U}\_{1} \leq \mathsf{u}\_{1}, \mathsf{U}\_{2} \leq \mathsf{u}\_{2}, \mathsf{U}\_{3} \leq \mathsf{u}\_{3}, \mathsf{U}\_{4} \leq \mathsf{u}\_{4}\right) \\ &\leq \min\left(\mathsf{P}\left(\mathsf{U}\_{1} \leq \mathsf{u}\_{1}, \mathsf{U}\_{2} \leq \mathsf{u}\_{2}, \mathsf{U}\_{3} \leq \mathsf{u}\_{3}\right), \mathsf{P}\left(\mathsf{U}\_{1} \leq \mathsf{u}\_{1}, \mathsf{U}\_{2} \leq \mathsf{u}\_{2}, \mathsf{U}\_{4} \leq \mathsf{u}\_{4}\right)\right) \\ &= \min\left(\mathsf{C}\_{3}(\mathsf{u}\_{1}, \mathsf{u}\_{2}, \mathsf{u}\_{3}), \mathsf{C}\_{3}(\mathsf{u}\_{1}, \mathsf{u}\_{2}, \mathsf{u}\_{4})\right) \\ &= \mathsf{C}\_{3}(\mathsf{u}\_{1}, \mathsf{u}\_{2}, \min\left(\mathsf{u}\_{3}, \mathsf{u}\_{4}\right)) \\ &= \mathsf{C}\_{4,2}(\mathsf{u}). \end{aligned}$$

In Example 6 we calculate Li's lower tail dependence parameter for a *d*-variate Clayton copula, which equals *λIh*|*Jd*−*<sup>h</sup> <sup>L</sup>* (*Cd*) = - (*d* − *h*)/*d* 1/*<sup>θ</sup>* (see (41)). Applying this in the setting of the current example leads to

$$\lambda\_L^{1,2\mid 3,4}(\mathbb{C}\_{4,2}) = \lambda\_L^{1,2\mid 3}(\mathbb{C}\_3) = \left(\frac{1}{3}\right)^{1/\theta} < \left(\frac{2}{4}\right)^{1/\theta} = \lambda\_L^{1,2\mid 3,4}(\mathbb{C}\_{4,1})\_\theta$$

which thus contradicts monotonicity property (*P*2)(i).

From the definition of Schmid's and Schmidt's tail dependence measure, it is immediate that the monotonicity property (*P*2) holds.

For the tail coefficient for extreme-value copulas, *λU*,*<sup>E</sup>* defined in (28) the monotonicity property (*P*2) holds. To see this, recall from (3), that, for an extreme-value copula *Cd*,1, we can express its stable tail dependence function as

$$\ell\_{\mathbb{C}\_{d,1}}(\mathbf{x}\_1, \dots, \mathbf{x}\_d) = -\log(\mathbb{C}\_{d,1}(e^{-\mathbf{x}\_1}, \dots, e^{-\mathbf{x}\_d})),\tag{33}$$

and, hence, using that *Cd*,1 ≤ *Cd*,2, it follows that -*Cd*,1 ≥ -*Cd*,2 . The same inequality holds for Pickands dependence function *Ad*,1, which is a restriction of the stable tail dependence function -*Cd*,1 on the unit simplex. Hence, *Cd*,1 ≤ *Cd*,2 also implies that *ACd*,1 ≥ *ACd*,2 . From the definition of the tail coefficient in (28) it thus follows *λU*,*E*(*Cd*,1) ≤ *λU*,*E*(*Cd*,2).

#### *4.3. Investigation of a Tail Coefficient for a Convex/Geometric Combination (Property (P3))*

Consider a copula *Cd* that is a convex combination of two copulas *Cd*,1 and *Cd*,2, i.e., *Cd* = *αCd*,1 + (1 − *α*)*Cd*,2 for *α* ∈ [0, 1]. For the survival function, we then also have *Cd* = *αCd*,1 + (1 − *α*)*Cd*,1.

Before stating the results for the various multivariate tail coefficients, we first make the following observation. For *α*, *a*, *b*, *c*, *d* ∈ [0, 1], it is straightforward to show that

$$\frac{a}{c} \le \frac{aa + (1 - a)b}{ac + (1 - a)d} \le \frac{b}{d} \Longleftrightarrow \frac{a}{c} \le \frac{b}{d}.\tag{34}$$

Frahm's lower extremal dependence coefficient for the copula *Cd* is given by

$$\epsilon\_L(\mathbb{C}\_d) = \lim\_{\nu \searrow 0} \frac{\varkappa \mathbb{C}\_{d,1}(\nu \mathbf{1}) + (1 - \varkappa) \mathbb{C}\_{d,2}(\nu \mathbf{1})}{\varkappa (1 - \overline{\mathbb{C}}\_{d,1}(\nu \mathbf{1})) + (1 - \varkappa)(1 - \overline{\mathbb{C}}\_{d,2}(\nu \mathbf{1}))}.$$

Using (34), it then follows that, if *L*(*Cd*,1) ≤ *L*(*Cd*,2), then

$$
\epsilon\_L(\mathbb{C}\_{d,1}) \le \epsilon\_L(\mathbb{C}\_d) \le \epsilon\_L(\mathbb{C}\_{d,2}).
$$

The same conclusion can be found for Frahm's upper extremal dependence coefficient *U*.

Li's lower tail dependence parameter for *Cd*, a convex mixture of copulas, equals

$$\lambda\_L^{I\_h|f\_{d-h}}(\mathbb{C}\_d) = \lim\_{u \searrow 0} \frac{a\mathbb{C}\_{d,1}(u\mathbf{1}) + (1-a)\mathbb{C}\_{d,2}(u\mathbf{1})}{a\mathbb{C}\_{d-h,1}^{I\_{d-h}}(u\mathbf{1}) + (1-a)\mathbb{C}\_{d-h,2}^{I\_{d-h}}}.$$

and an application of (34) gives that, if *λIh*|*Jd*−*<sup>h</sup> <sup>L</sup>* (*Cd*,1) <sup>≤</sup> *<sup>λ</sup>Ih*|*Jd*−*<sup>h</sup> <sup>L</sup>* (*Cd*,2), then *<sup>λ</sup>Ih*|*Jd*−*<sup>h</sup> <sup>L</sup>* (*Cd*,1) ≤ *λIh*|*Jd*−*<sup>h</sup> <sup>L</sup>* (*Cd*) <sup>≤</sup> *<sup>λ</sup>Ih*|*Jd*−*<sup>h</sup> <sup>L</sup>* (*Cd*,2). The same conclusion can be found for Li's upper tail dependence parameter *λIh*|*Jd*−*<sup>h</sup> <sup>U</sup>* .

Schmid's and Schmidt's lower tail dependence measure for a convex mixture of copulas is

$$\lambda\_{L,S}(\mathbb{C}\_d) = \lim\_{p \searrow 0} \frac{d+1}{p^{d+1}} \int\_{[0,p]^d} \left[ a \mathbb{C}\_{d,1}(\mathfrak{u}) + (1-a) \mathbb{C}\_{d,2}(\mathfrak{u}) \right] d\mathfrak{u} = a \lambda\_{L,S}(\mathbb{C}\_{d,1}) + (1-a) \lambda\_{L,S}(\mathbb{C}\_{d,2}).$$

For an extreme-value copula, it does not make sense to look at convex combinations of two extreme-value copulas, since it cannot be shown, in general, that such a convex combination would again be an extreme-value copula. A more natural way to combine two extreme-value copulas *Cd*,1 and *Cd*,2 is by means of a geometric combination, i.e., by considering *Cd* = *C<sup>α</sup> d*,1*C*1−*<sup>α</sup> <sup>d</sup>*,2 , with *α* ∈ [0, 1]. In, for example, Falk et al. [19] (p. 123) it was shown that a convex combination of two Pickands dependence functions is also a Pickands dependence function. Denoting by *Ad*,1 and *Ad*,2, the Pickands dependence functions of *Cd*,1 and *Cd*,2, respectively, it then follows from (33) that the Pickands dependence function *Ad* for *Cd* = *C<sup>α</sup> d*,1*C*1−*<sup>α</sup> <sup>d</sup>*,2 , is given by *Ad* = *αAd*,1 + (1 − *α*)*Ad*,2. From this it is seen that *Cd* is again an extreme-value copula. For the tail dependence coefficient for extreme-value copulas, it thus holds that

$$\begin{aligned} \lambda\_{lI,E}(\mathbb{C}\_d) &= \frac{d}{d-1} (1 - a A\_{d,1}(1/d, \dots, 1/d) - (1 - a) A\_{d,2}(1/d, \dots, 1/d)) \\ &= a \lambda\_{lI,E}(\mathbb{C}\_{d,1}) + (1 - a) \lambda\_{lI,E}(\mathbb{C}\_{d,2}), \end{aligned}$$

i.e., the coefficient *λU*,*<sup>E</sup>* of a geometric mean of two extreme-value copulas is equal to the corresponding convex combination of the coefficients of the concerned two copulas.

#### **5. Tail Coefficients for Archimedean Copulas in Increasing Dimension**

A natural question to examine is an influence of increasing dimension on possible multivariate tail dependence. If one restricts to the class of Archimedean copulas, several results can be achieved, despite that similar problems with interchanging limits occur while studying the continuity property (*T*2). First, let us formulate a useful lemma that describes the behavior of the main diagonal of Archimedean copulas when the dimension increases.

**Lemma 2.** *Let* {*Cd*} *be a sequence of d-dimensional Archimedean copulas with (the same) generator ψ. Then for u* ∈ [0, 1) *and v* ∈ (0, 1]

$$\begin{aligned} \lim\_{d \to \infty} \mathbb{C}\_d(\mu, \dots, \mu) &= 0, \\ \lim\_{d \to \infty} \overline{\mathbb{C}}\_d(\upsilon, \dots, \upsilon) &= 0. \end{aligned}$$

**Proof.** The proof is along the same lines as the proof of Proposition 9 in [16].

This lemma can be used in the following statements that focus on individual multivariate tail coefficients. The first one to be examined is the Frahm's extremal dependence coefficient *L*.

**Proposition 10.** *Let* {*Cd*} *be a sequence of d-dimensional Archimedean copulas with (the same) generator ψ. Further assume that*

$$\lim\_{d \to \infty} \lim\_{\nu \searrow 0} \frac{\mathbb{C}\_d(\nu \mathbf{1})}{1 - \widetilde{\mathsf{C}}\_d(\nu \mathbf{1})} = \lim\_{\nu \searrow 0} \lim\_{d \to \infty} \frac{\mathbb{C}\_d(\nu \mathbf{1})}{1 - \widetilde{\mathsf{C}}\_d(\nu \mathbf{1})}$$

.

*Then*

$$\lim\_{d \to \infty} \epsilon\_L(\mathcal{C}\_d) = 0.$$

**Proof.** The statement follows by the direct application of Lemma 2, since then

$$\lim\_{d \to \infty} \mathfrak{e}\_L(\mathbb{C}\_d) = \lim\_{\mu \searrow 0} \lim\_{d \to \infty} \frac{\mathbb{C}\_d(\mu \mathbf{1})}{1 - \overline{\mathbb{C}}\_d(\mu \mathbf{1})} = 0. \tag{7}$$

An analogous result could be stated for *U*.

**Remark 2.** *The condition on interchanging limits is, in general, difficult to check. However, we discuss some examples in which the condition can be checked. A first example is that of the independence copula Cd*(*u*) = <sup>Π</sup>(*u*) *for which Cd*(*u***1**) = *<sup>u</sup><sup>d</sup> and Cd*(*u***1**)=(<sup>1</sup> <sup>−</sup> *<sup>u</sup>*)*d. Henceforth,* lim*u*#<sup>0</sup> *Cd*(*u***1**) <sup>1</sup>−*Cd*(*u***1**) <sup>=</sup> <sup>0</sup> *for all <sup>u</sup>* <sup>∈</sup> [0, 1]*. Furthermore,* lim*d*→<sup>∞</sup> *Cd*(*u***1**) <sup>1</sup>−*Cd*(*u***1**) <sup>=</sup> <sup>0</sup>*, for all <sup>u</sup>* <sup>∈</sup> [0, 1)*. Consequently, in this example, the condition of interchanging limits holds. A second example is the Gumbel–Hougaard copula also considered in Example 7 in Section 6. For this copula it can be seen that, as in the previous example, the two concerned limits (when u* → 0 *and when d* → ∞*) are zero and, hence, interchanging the limits is also valid in this example.*

*Proposition 10 further shows that if we construct estimators (based on values of u close to* 0 *or close to* 1*) of the limits above for Archimedean copulas in high dimensions, these will be very close to* 0*.*

For Li's tail dependence parameters *λIh*|*Jd*−*<sup>h</sup> <sup>L</sup>* and *<sup>λ</sup>Ih*|*Jd*−*<sup>h</sup> <sup>U</sup>* , the situation is further complicated by the necessary selection of *Ih* and *Jd*−*<sup>h</sup>* and, in particular, of the cardinality *h*. However, if the cardinality of the set *Jd*−*<sup>h</sup>* is kept constant when the dimension *d* increases, the following result can be achieved.

**Proposition 11.** *Let* {*Cd*} *be a sequence of d-dimensional Archimedean copulas with (the same) generator ψ and let h in definition of λIh*|*Jd*−*<sup>h</sup> <sup>L</sup> be given as h*(*d*) = *d* − *h*<sup>∗</sup> *for a constant h*∗*. Further assume that*

$$\lim\_{d \to \infty} \lim\_{\nu \searrow 0} \frac{\mathbb{C}\_d(\nu \mathbf{1})}{\mathbb{C}\_{h^\*}(\nu \mathbf{1})} = \lim\_{\nu \searrow 0} \lim\_{d \to \infty} \frac{\mathbb{C}\_d(\nu \mathbf{1})}{\mathbb{C}\_{h^\*}(\nu \mathbf{1})}.$$

*Subsequently*

$$\lim\_{d \to \infty} \lambda\_L^{I\_{d-h^\*} \mid J\_h \*} (\mathbf{C}\_d) = 0.1$$

**Proof.** Using Lemma 2, we obtain

$$\lim\_{d \to \infty} \lambda\_L^{I\_{d-h^\*} \mid l\_h \*} (\mathbf{C}\_d) = \lim\_{\mu \searrow 0} \lim\_{d \to \infty} \frac{\mathbf{C}\_d(\mu \mathbf{1})}{\mathbf{C}\_{h^\*}(\mu \mathbf{1})} = 0,$$

from which the statement of this proposition follows.

An analogous statement could be formulated for *λU*.

What can one learn from the results in this section? Archimedean copulas may be not very appropriate in high dimensions, because of their symmetry, but they are a convenient class of copulas to use. It is good to be aware though that, when the dimension increases, the tail dependence of Archimedean copulas vanishes, at least from the perspective of *L*, *<sup>λ</sup>Ih*|*Jd*−*<sup>h</sup> <sup>L</sup>* and their upper tail counterparts.

Obtaining similar results for different classes of copulas would also be of interest, for example, for extreme-value copulas with restrictions on Pickands dependence function. However, this is complicated by the fact that, unlike Archimedean copulas, extreme-value copulas do not share a structure that could be carried through different dimensions. Some insights into this behavior are

studied using the examples given in Section 6. This section includes examples on both Archimedean and extreme-value copulas, as well as examples outside these classes.

#### **6. Illustrative Examples**

**Example 4.** *Farlie–Gumbel–Morgenstern copula.*

Let *Cd* be a *d*-dimensional Farlie–Gumbel–Morgenstern copula defined as

$$\mathbb{C}\_{d}(\boldsymbol{u}) = \boldsymbol{u}\_{1}\boldsymbol{u}\_{2}\dots\boldsymbol{u}\_{d} \left[ \boldsymbol{1} + \sum\_{j=2}^{d} \sum\_{1 \le k\_{1} < \dots < k\_{j} \le d} \boldsymbol{a}\_{k\_{1},\dots,k\_{j}} \left( \boldsymbol{1} - \boldsymbol{u}\_{k\_{1}} \right) \dots \left( \boldsymbol{1} - \boldsymbol{u}\_{k\_{j}} \right), \right] \tag{35}$$

where the parameters have to satisfy the following 2*<sup>d</sup>* conditions

$$1 + \sum\_{j=2}^{d} \sum\_{1 \le k\_1 < \cdots < k\_j \le d} a\_{k\_1, \dots, k\_j} \varepsilon\_{k\_1} \cdots \varepsilon\_{k\_j} \ge 0, \quad \forall \varepsilon\_1, \dots, \varepsilon\_d \in \{-1, 1\}.$$

This copula is neither an Archimedean nor extreme-value copula.

We first consider Frahm's extremal dependence coefficients *<sup>L</sup>* and *U*. From (35), up to a constant *Cd*(*u***1**) <sup>≈</sup> *<sup>u</sup><sup>d</sup>* when *<sup>u</sup>* <sup>≈</sup> 0. Further, plugging (35) into (2) gives that 1 <sup>−</sup> *Cd*(*u***1**) behaves like a polynomial *<sup>u</sup>* <sup>−</sup> *<sup>u</sup>*<sup>2</sup> <sup>+</sup> . . . when *<sup>u</sup>* <sup>≈</sup> 0. Thus,

$$\epsilon\_L(\mathbb{C}\_d) = \lim\_{\mu \searrow 0} \frac{\mathbb{C}\_d(\mu \mathbf{1})}{1 - \overline{\mathbb{C}}\_d(\mu \mathbf{1})} = 0,$$

because the polynomial in the numerator converges to zero faster than the polynomial in the denominator. Similarly, one obtains

$$\epsilon\_{ll}(\mathbb{C}\_d) = \lim\_{u \nearrow 1} \frac{\overline{\mathbb{C}}\_d(u\mathbf{1})}{1 - \mathbb{C}\_d(u\mathbf{1})} = 0.$$

While examining *λIh*|*Jd*−*<sup>h</sup> <sup>L</sup>* and *<sup>λ</sup>Ih*|*Jd*−*<sup>h</sup> <sup>U</sup>* , the very same arguments are of use. No matter how one chooses index sets *Ih* and *Jd*−*h*,

$$
\lambda\_L^{I\_h|J\_{d-h}}(\mathbb{C}\_d) = \lambda\_{ll}^{I\_h|J\_{d-h}}(\mathbb{C}\_d) = 0
$$

since, again, the corresponding limits contain ratios of polynomials, such that the polynomials in the numerators converge to zero faster than the polynomials in the denominators.

To obtain *λL*,*S*, the integral [0,*p*]*<sup>d</sup> Cd*(*u*) <sup>d</sup>*<sup>u</sup>* needs to be calculated. Consider now a special case when the only non-zero parameter is *α* = *α*1,...,*d*. Then

$$\int\_{[0,p]^d} \mathbb{C}\_d(\mathfrak{u}) \, \mathrm{d}\mathfrak{u} = \int\_{[0,1]^p} u\_1 u\_2 \dots u\_d \left[ 1 + a(1 - u\_1) \dots (1 - u\_d) \right] \, \mathrm{d}\mathfrak{u} = \left( \frac{p^2}{2} \right)^d + a \left( \frac{3p^2 - 2p^3}{6} \right)^d \dots$$

Going back to general *Cd*, we can notice that the resulting integral would always be a polynomial in *p*, with the lowest power being 2*d* and thus

$$\lambda\_{L,S}(\mathbb{C}\_d) = \lim\_{p \searrow 0} \frac{d+1}{p^{d+1}} p^{2d} = 0.$$

A similar calculation leads to *λU*,*S*(*Cd*) = 0. Some further calculations (not presented here) also show that *λ*∗ *<sup>U</sup>*,*S*(*Cd*) = 0.

From the perspective of all the above tail dependence coefficients, the Farlie–Gumbel–Morgenstern copula does not possess any tail dependence.

**Example 5.** *Cuadras-Augé copula.*

Let *Cd* be a *d*-variate Cuadras-Augé copula, that is of the form

$$C\_d(\mu\_1, \dots, \mu\_d) = [\min(\mu\_1, \dots, \mu\_d)]^\theta (\mu\_1 \mu\_2 \dots \mu\_d)^{1-\theta}$$

for *θ* ∈ [0, 1]. The Cuadras-Augé copula combines the comonotonicity copula *Md* with the independence copula Π*d*. If *θ* = 0, then *Cd* becomes Π*d*. If *θ* = 1, then *Cd* becomes *Md*.

We again start with calculating *<sup>L</sup>* and *U*. From (2), we find

$$\overline{\mathbb{C}}\_d(\mathfrak{u}\mathbf{1}) = 1 + \sum\_{j=1}^d \left[ (-1)^j \binom{d}{j} \mathfrak{u}^{j - (j-1)\theta} \right]$$

and Frahm's lower extremal dependence coefficient *<sup>L</sup>* is thus given as

$$\begin{split} \varepsilon\_{L}(\mathbb{C}\_{d}) &= \lim\_{u \searrow 0} \frac{\mathbb{C}\_{d}(u\mathbf{1})}{1 - \overline{\mathbb{C}}\_{d}(u\mathbf{1})} = \lim\_{u \searrow 0} \frac{u^{d - (d - 1)\theta}}{\sum\limits\_{j=1}^{d} \left[ (-1)^{j + 1} \binom{d}{j} u^{j - (j - 1)\theta} \right]} \\ &= \lim\_{u \searrow 0} \frac{u^{d - (d - 1)\theta - 1}}{\sum\limits\_{j=1}^{d} \left[ (-1)^{j + 1} \binom{d}{j} u^{j - (j - 1)\theta - 1} \right]} = \begin{cases} 1 & \text{if } \theta = 1, \\ 0 & \text{if } \theta \in [0, 1) \end{cases} \end{split}$$

since if *θ* ∈ [0, 1), the polynomial in *u* in the numerator converges to zero faster than the polynomial in the denominator. For *U*, using L'Hospital's rule leads to

$$\begin{split} \epsilon\_{II}(\mathbf{C}\_{d}) &= \lim\_{u \nearrow 1} \frac{\overline{\mathbf{C}}\_{d}(u\mathbf{1})}{1 - \mathbf{C}\_{d}(u\mathbf{1})} = \lim\_{u \nearrow 1} \frac{1 + \sum\_{j=1}^{d} \left\{(-1)^{j} \binom{d}{j} u^{j - (j-1)\theta} \right\}}{1 - u^{d - (d-1)\theta}} \\ &= \lim\_{u \nearrow 1} \frac{\sum\_{j=1}^{d} \left\{(-1)^{j} \binom{d}{j} \left[j - (j-1)\theta\right] u^{j - (j-1)\theta - 1} \right\}}{-(d - (d-1)\theta) u^{d - (d-1)\theta - 1}} \\ &= \frac{\sum\_{j=1}^{d} \left\{(-1)^{j} \binom{d}{j} \left[j - (j-1)\theta\right] \right\}}{-(d - (d-1)\theta)} \end{split}$$

$$\begin{aligned} &= \frac{(1-\theta)\sum\_{j=1}^{d}\left[(-1)^{j}\binom{d}{j}j\right]+\theta\sum\_{j=1}^{d}\left[(-1)^{j}\binom{d}{j}\right]}{-(d-(d-1)\theta)}\\ &= \frac{0-\theta}{-(d-(d-1)\theta)}=\frac{\theta}{d-(d-1)\theta}.\end{aligned}$$

These values coincide with those calculated in [20] for a more general group of copulas. One can also notice that

$$\lim\_{d \to \infty} \epsilon\_{ll}(\mathbb{C}\_d) = \begin{cases} 1 & \text{if } \theta = 1, \\ 0 & \text{if } \theta \in [0, 1). \end{cases}$$

In other words, if the parameter *θ* is smaller than 1, any sign of tail dependence disappears when the dimension increases. If *θ* = 1, then *U*(*Cd*) = 1 for every *d* ≥ 2 which is no surprise, since, in that case, *Cd* is the comonotonicity copula *Md*. This behavior is illustrated in Figure 3 that details the influence of the parameter *θ* on the speed of decrease of *U*(*Cd*) when *d* increases.

A Cuadras–Augé copula is an exchangeable copula, which is invariant with respect to the order of its arguments. Therefore, when calculating Li's tail dependence parameters, only the cardinality of the index sets *Ih* and *Jd*−*<sup>h</sup>* plays a role. Subsequently,

$$\lambda\_L^{I\_h|f\_{d-h}}(\mathbb{C}\_d) = \lim\_{u \searrow 0} \frac{u^{d-(d-1)\theta}}{u^{d-h-(d-h-1)\theta}} = \begin{cases} 1 & \text{if } \theta = 1, \\ 0 & \text{if } \theta \in [0,1) \end{cases}$$

and by using L'Hospital's rule

$$\lambda\_{\boldsymbol{L}}^{I\_{\boldsymbol{h}}|\boldsymbol{I}-\boldsymbol{h}}(\mathbb{C}\_{\boldsymbol{d}}) = \lim\_{\boldsymbol{u}\nearrow\boldsymbol{1}} \frac{1+\sum\_{j=1}^{d}(-1)^{j}\binom{\boldsymbol{d}}{j}\boldsymbol{u}^{j-(j-1)\theta}}{1+\sum\_{j=1}^{d-h}(-1)^{j}\binom{\boldsymbol{d}-\boldsymbol{h}}{j}\boldsymbol{u}^{j-(j-1)\theta}} = \frac{\sum\_{j=1}^{d}(-1)^{j}\binom{\boldsymbol{d}}{j}(j-(j-1)\theta)}{\sum\_{j=1}^{d-h}(-1)^{j}\binom{\boldsymbol{d}-h}{j}(j-(j-1)\theta)}.\tag{36}$$

If *θ* = 1, then *λIh*|*Jd*−*<sup>h</sup> <sup>U</sup>* (*Cd*) = 1, as expected, and it does not depend on the conditioning sets *Ih* and *Jd*−*h*.

For Schmid's and Schmidt's lower tail dependence measure *λL*,*S*(*Cd*), defined in (24), we first need to calculate the integral [0,*p*]*<sup>d</sup> Cd*(*u*) <sup>d</sup>*u*. A straightforward calculation gives that

$$\int\_{[0,p]^d} \mathbb{C}\_d(\mathfrak{u}) \, \mathrm{d}\mathfrak{u} = \frac{d}{(2-\theta)^d} p^{(2-\theta)(d-1)+2} B\left(\frac{2}{2-\theta}, d\right).$$

where *B*(*s*, *t*) = <sup>1</sup> <sup>0</sup> *<sup>x</sup>s*−1(<sup>1</sup> <sup>−</sup> *<sup>x</sup>*)*t*−1*dx* is the Beta function. We then get

$$
\lambda\_{L,S}(\mathbb{C}\_d) = \lim\_{p \searrow 0} \frac{d+1}{p^{d+1}} \frac{d}{(2-\theta)^d} p^{(2-\theta)(d-1)+2} B\left(\frac{2}{2-\theta}, d\right),
$$

which equals 1 when *θ* = 1 and 0 when *θ* ∈ [0, 1). Schmid's and Schmidt's lower tail dependence measure thus equals Frahm's lower extremal dependence coefficient *<sup>L</sup>* as well as Li's lower tail dependence parameter *λIh*|*Jd*−*<sup>h</sup> <sup>L</sup>* (*Cd*).

Determining Schmid's and Schmidt's upper tail dependence measure *λU*,*S*(*Cd*) in (25) is less straightforward. This dependence measure involves three integrals. Because its expression concerns the limit when *p* → 0, it suffices to investigate the behavior of the numerator and the denominator of (25) for *p* close to 0. From (27) it is easy to see that, for *p* close to 0,

$$\int\_{[1-p,1]^d} \Pi\_d(\mu) \, \mathrm{d}\mu = p^d - \frac{d}{2} p^{d+1} + o\left(p^{d+1}\right).$$

and, hence, the denominator of (25) behaves, for *p* close to 0, as

$$\int\_{[1-p,1]^d} M\_d(\mathfrak{u}) \, \mathrm{d}\mathfrak{u} - \int\_{[1-p,1]^d} \Pi\_d(\mathfrak{u}) \, \mathrm{d}\mathfrak{u} = \frac{d(d-1)}{2(d+1)} p^{d+1} + o\left(p^{d+1}\right). \tag{37}$$

For the integral [1−*p*,1]*<sup>d</sup> Cd*(*u*) <sup>d</sup>*u*, note that, since *Cd* is an exchangeable copula, we can divide the integration domain [1 − *p*, 1] *<sup>d</sup>* into *d* parts depending on which argument from *u*1, ... , *ud* is minimal. The integrals over each of the *d* parts are equal. We get

$$\begin{aligned} \int\_{[1-p,1]^d} \mathbb{C}\_d(\boldsymbol{u}) \, \mathrm{d}\boldsymbol{u} &= \quad d \int\_{1-p}^1 \boldsymbol{u}\_1 \left( \prod\_{j=2}^d \int\_{\boldsymbol{u}\_1}^1 \boldsymbol{u}\_j^{1-\theta} \, \mathrm{d}\boldsymbol{u}\_j \right) \, \mathrm{d}\boldsymbol{u}\_1 \\ &= \quad d \int\_{1-p}^1 \boldsymbol{u}\_1 \left( \frac{1-\boldsymbol{u}\_1^{2-\theta}}{2-\theta} \right)^{d-1} \, \mathrm{d}\boldsymbol{u}\_1 \\ &= \quad \frac{d}{(2-\theta)^{d-1}} \int\_{1-p}^1 \boldsymbol{u}\_1 \left( 1-\boldsymbol{u}\_1^{2-\theta} \right)^{d-1} \, \mathrm{d}\boldsymbol{u}\_1 \\ &= \quad p^d + \left[ \theta \frac{d(d-1)}{2(d+1)} - \frac{d}{2} \right] p^{d+1} + o\left(p^{d+1}\right), \end{aligned}$$

where the approximation, valid for *p* close to 0, is based on a careful evaluation of the integral. For brevity, we do not include the details here. Consequently the numerator of (25) behaves, for *p* close to 0, as

$$\int\_{[1-p,1]^d} \mathbb{C}\_d(\mathfrak{u}) \, \mathrm{d}\mathfrak{u} - \int\_{[1-p,1]^d} \Pi\_d(\mathfrak{u}) \, \mathrm{d}\mathfrak{u} = \theta \frac{d(d-1)}{2(d+1)} p^{d+1} + o\left(p^{d+1}\right). \tag{38}$$

,

Combining (37) and (38) reveals that *λU*,*S*(*Cd*) = *θ*, for all *d* ≥ 2. Other calculations (omitted here for brevity) lead to *λ*∗ *<sup>U</sup>*,*S*(*Cd*) = *θ*.

A Cuadras–Augé copula is also an extreme-value copula. This can be seen through the following calculation, where the notation *u*(1) = min(*u*1,..., *ud*) is used. One gets

$$\begin{aligned} \mathbb{C}\_{d}(\boldsymbol{u}\_{1},\ldots,\boldsymbol{u}\_{d}) &= [\boldsymbol{u}\_{(1)}]^{\theta} (\boldsymbol{u}\_{1}\boldsymbol{u}\_{2}\ldots\boldsymbol{u}\_{d})^{1-\theta} = \exp\left\{\theta \log\left(\boldsymbol{u}\_{(1)}\right) + (1-\theta) \sum\_{j=1}^{d} \log(\boldsymbol{u}\_{j})\right\} \\ &= \exp\left\{ \left(\frac{\theta \log\left(\boldsymbol{u}\_{(1)}\right)}{\log(\boldsymbol{u}\_{1}\boldsymbol{u}\_{2}\ldots\boldsymbol{u}\_{d})} + \frac{(1-\theta)\sum\_{j=1}^{d} \log(\boldsymbol{u}\_{j})}{\log(\boldsymbol{u}\_{1}\boldsymbol{u}\_{2}\ldots\boldsymbol{u}\_{d})}\right) \log(\boldsymbol{u}\_{1}\boldsymbol{u}\_{2}\ldots\boldsymbol{u}\_{d})\right\} \end{aligned}$$

and, thus, *Cd* is an extreme-value copula with Pickands dependence function

$$A\_d(w\_1, \ldots, w\_d) = \theta w\_{(1)} + (1 - \theta) \sum\_{j=1}^d w\_j \ldots$$

This allows for calculating the tail coefficient for extreme-value copulas, *λU*,*E*, as

$$
\lambda\_{ll,E}(\mathbf{C}\_d) = \frac{d}{d-1} \left( 1 - \frac{\theta}{d} - (1 - \theta) \right) = \theta.
$$

In case of the Cuadras–Augé copula, tail dependence measured by *λU*,*<sup>E</sup>* does not depend on the dimension *d*. For illustration, the values of *λU*,*E*(*Cd*) are included in Figure 3. One can see that *<sup>U</sup>* and *λU*,*<sup>E</sup>* behave very differently, both in terms of shapes and values.

**Figure 3.** Frahm's upper extremal dependence coefficient (black line) and tail dependence coefficient for extreme-value copulas *λU*,*<sup>E</sup>* (grey line) for a Cuadras–Augé copula with parameters 0.9 (solid line), 0.99 (dashed line) and 0.999 (dotted line) as a function of the dimension of the copula.

**Example 6.** *Clayton copula.*

Let *Cd* be a *d*-variate Clayton family copula defined as

$$\mathcal{C}\_d(u) = \left(\sum\_{j=1}^d u\_j^{-\theta} - d + 1\right)^{-1/\theta} \tag{39}$$

for *θ* > 0. The Clayton copula is an Archimedean copula and the behavior of its generator is studied in Example 2.

For Frahm's lower extremal dependence coefficient, either using (12) or by factoring out as below, one obtains

$$\epsilon\_L(\mathbb{C}\_d) = \lim\_{u \searrow 0} \frac{\mathbb{C}\_d(u\mathbf{1})}{1 - \overline{\mathbb{C}}\_d(u\mathbf{1})} = \lim\_{u \searrow 0} \frac{u(d - du^\theta + u^\theta)^{-1/\theta}}{\sum\_{j=1}^d (-1)^{j+1} \binom{d}{j} u(j - ju^\theta + u^\theta)^{-1/\theta}}$$

$$= \frac{d^{-1/\theta}}{\sum\_{j=1}^d (-1)^{j+1} \binom{d}{j} j^{-1/\theta}},\tag{40}$$

whereas, for Frahm's upper extremal dependence coefficient, using (13) with the derivative of the Clayton generator *ψ* (*t*) = <sup>−</sup>(<sup>1</sup> <sup>+</sup> *<sup>θ</sup>t*)−(1+*θ*)/*θ*, one finds

$$\epsilon \epsilon \iota(\mathbb{C}\_d) = \lim\_{t \searrow 0} \frac{\sum\_{j=1}^d (-1)^j \binom{d}{j} \psi'(jt) j}{-\psi'(dt)d} = \frac{-\sum\_{j=1}^d (-1)^j \binom{d}{j} j}{d} = \sum\_{j=1}^d (-1)^{j+1} \binom{d-1}{j-1} = 0.1$$

Analytical calculation of lim*d*→<sup>∞</sup> *L*(*Cd*) is not possible; however, insight can be gained by plotting *L*(*Cd*) as a function of the dimension *d*. This is done in Figure 4. From the plot it is evident that *L*(*Cd*) decreases when the dimension increases. However, for larger parameter values, the decrease seems to be slow.

**Figure 4.** Frahm's lower extremal dependence coefficient for Clayton copula with parameters 1 (solid line), 5 (dashed line) and 10 (dotted line) as a function of the dimension of the copula.

A Clayton copula is also an exchangeable copula and, thus, when calculating Li's tail dependence parameters, only the cardinality of the index sets *Ih* and *Jd*−*<sup>h</sup>* comes into play. Then

$$\begin{split} \lambda\_L^{I\_h|I\_{d-h}}(\mathbb{C}\_d) &= \lim\_{u \searrow 0} \frac{\left(du^{-\theta} - d + 1\right)^{-1/\theta}}{\left((d-h)u^{-\theta} - (d-h) + 1\right)^{-1/\theta}} = \lim\_{u \searrow 0} \frac{\left(d - du^{\theta} + u^{\theta}\right)^{-1/\theta}}{\left(d - h - (d-h)u^{\theta} + u^{\theta}\right)^{-1/\theta}} \\ &= \left(\frac{d-h}{d}\right)^{1/\theta} . \end{split} \tag{41}$$

If, as in Proposition 11, the cardinality of *Jd*−*<sup>h</sup>* is kept constant (equal to *h*∗) when the dimension increases, then

$$\lim\_{L \to \infty} \lambda\_L^{I\_h|J\_{d-h}}(\mathbf{C}\_d) = 0. \tag{42}$$

In fact, in this example, even a milder condition is sufficient for achieving (42). If *h* = *h*(*d*) is linked to the dimension such that lim*d*→∞(*<sup>d</sup>* − *<sup>h</sup>*(*d*))/*<sup>d</sup>* = 0, then (42) holds. However, for large values of the parameter *θ*, the convergence in (42) might be very slow. By applying L'Hospital's rule (*d* − *h*) times, one can also calculate

$$
\lambda\_{U}^{I\_h|J\_{d-h}}(\mathcal{C}\_d) = 0.
$$

Spearman's rho for the Clayton copula cannot be explicitly calculated and, thus, the values of *λL*,*<sup>S</sup>* and *λU*,*<sup>S</sup>* are unknown.

**Example 7.** *Gumbel-Hougaard copula.*

Let *Cd* be a *d*-variate Gumbel–Hougaard copula, defined as

$$\mathcal{C}\_d(\boldsymbol{\mu}) = \exp\left\{-\left[\sum\_{j=1}^d (-\log \boldsymbol{u}\_j)^{\theta}\right]^{1/\theta}\right\}$$

where *θ* ≥ 1. The Gumbel-Hougaard copula is the only copula (family) that is both an extreme-value and an Archimedean copula as proved in [21] (Sec. 2). The behavior of its Archimedean generator is studied in Example 3. Note that *θ* = 1 corresponds to the independence copula Π*<sup>d</sup>* and the limiting case *θ* → ∞ corresponds to the comonotonicity copula *Md*.

As expected (see (10)), for an extreme-value copula which is not the comonotonicity copula, the Frahm's lower extremal dependence coefficient is

$$\epsilon\_L(\mathbb{C}\_d) = \lim\_{u \searrow 0} \frac{\mathbb{C}\_d(u\mathbf{1})}{1 - \overline{\mathbb{C}\_d}(u\mathbf{1})} = \lim\_{u \searrow 0} \frac{u^{d^{1/\vartheta}}}{\sum\_{j=1}^d (-1)^{j+1} \binom{d}{j} u^{j^{1/\vartheta}}} = 0$$

since the polynomial in *u* in the numerator converges to zero faster than the polynomial in the denominator. For the Frahm's upper extremal dependence coefficient, by using (13) with the derivative of the Gumbel–Hougaard generator *ψ* (*t*) = <sup>−</sup><sup>1</sup> *<sup>θ</sup>* exp(−*t* 1/*<sup>θ</sup>* )*t* 1/*θ*−1, one obtains

$$\epsilon\_{\rm II}(\mathbb{C}\_{\rm d}) = \lim\_{t \searrow 0} \frac{\sum\_{j=1}^{d} (-1)^{j} \binom{d}{j} \psi^{\prime}(t) j!}{-\psi^{\prime}(dt)d} = \lim\_{t \searrow 0} \frac{\frac{-1}{\theta} t^{1/\theta - 1} \sum\_{j=1}^{d} (-1)^{j} \binom{d}{j} \exp(-(jt)^{1/\theta}) j^{1/\theta}}{\frac{1}{\theta} t^{1/\theta - 1} \exp(-(dt)^{1/\theta}) d^{1/\theta}}$$

$$= \frac{\sum\_{j=1}^{d} (-1)^{j+1} \binom{d}{j} j^{1/\theta}}{d^{1/\theta}}.\tag{43}$$

Analytical calculation of lim*d*→<sup>∞</sup> *U*(*Cd*) is not possible; however, insights can be gained by plotting *U*(*Cd*) as a function of dimension *d*. This is done in Figure 5. It is evident that *U*(*Cd*) decreases when the dimension increases; but, the decrease seems to be slow for larger parameter values. When comparing Figures 4 and 5, one might come to a conclusion that *<sup>L</sup>* for the Clayton copula with parameter *θ* is equal to *<sup>U</sup>* for the Gumbel–Hougaard copula with the same parameter *θ*. Despite their similarity, that is not true, as can be easily checked by calculating both of the quantities for any pair (*d*, *θ*).

**Figure 5.** Frahm's upper extremal dependence coefficient (black line) and tail dependence coefficient for extreme-value copulas *λU*,*<sup>E</sup>* (grey line) for Gumbel–Hougaard copula with parameters 2 (solid line), 5 (dashed line) and 10 (dotted line) as a function of the dimension of the copula.

When calculating Li's tail dependence parameters, one uses that the Gumbel–Hougaard copula is also an exchangeable copula and, thus, only the cardinality of the index sets *Ih* and *Jd*−*<sup>h</sup>* plays a role. Then

$$\lambda\_L^{I\_h|I\_{d-h}}(\mathbf{C}\_d) = \lim\_{\mu \searrow 0} \frac{\mu^{d^{1/\vartheta}}}{\mu^{(d-h)^{1/\vartheta}}} = 0.$$

If *θ* = 1, then *λIh*|*Jd*−*<sup>h</sup> <sup>U</sup>* (*Cd*) = 0, otherwise by using L'Hospital's rule

$$\lambda\_{\boldsymbol{L}}^{I\_{h}|I\_{d-h}}(\mathbb{C}\_{d}) = \lim\_{u \nearrow 1} \frac{\sum\_{j=0}^{d} (-1)^{j} \binom{d}{j} u^{j^{1/\theta}}}{\sum\_{j=0}^{d-h} (-1)^{j} \binom{d-h}{j} u^{j^{1/\theta}}} = \frac{\sum\_{j=1}^{d} (-1)^{j} \binom{d}{j} \, ^{1/\theta}}{\sum\_{j=1}^{d-h} (-1)^{j} \binom{d-h}{j} j^{1/\theta}}. \tag{44}$$

This function of parameter *θ*, dimension *d* and cardinality *h* is rather involved and it is depicted in Figure 6 for different parameter choices and also two different selections of *h*. In one of the cases, *h* = *d* − 1 and thus corresponds to *h*<sup>∗</sup> = 1 in Proposition 11. In the other case, the number of components on which we condition *h*∗ = *h*∗(*d*) is chosen to increase with *d*, specifically *h*∗(*d*) = ' <sup>√</sup>*d*(. For *h*∗ = 1 (and thus the setting of Proposition 11), the tail coefficient slowly decreases with dimension, as expected. An interesting behavior is seen for *h*∗(*d*) = ' <sup>√</sup>*d*(, where the tail coefficient seems to be, except for instability in low dimensions, constant, independently of the parameter *θ* choice.

**Figure 6.** Li's upper tail dependence parameter with *h*∗ = 1 (black line) and with *h*∗ = ' <sup>√</sup>*d*( (grey line) for Gumbel-Hougaard copula with parameters 2 (solid line), 5 (dashed line) and 10 (dotted line) as a function of the dimension of the copula.

Spearman's rho for a Gumbel–Hougaard copula cannot be calculated explicitly and thus the values of *λL*,*<sup>S</sup>* and *λU*,*<sup>S</sup>* are unknown.

Pickands dependence function *Ad* of a Gumbel–Hougaard copula is

$$A\_d(w) = (w\_1 + \dots + w\_d)^{-1} (w\_1^{\theta} + \dots + w\_d^{\theta})^{1/\theta}$$

and thus

$$
\lambda\_{ULE}(\mathbf{C}\_d) = \frac{d - d^{1/\theta}}{d - 1}.
$$

Note that lim*d*→<sup>∞</sup> *<sup>λ</sup>U*,*E*(*Cd*) = 1. From our perspective, such a behavior is rather counter-intuitive and should be taken into account when using this tail coefficient.

An overview of the results obtained in the illustrative examples is given in Table 2.



#### **7. Estimation of Tail Coefficients**

Before we move to the estimation of tail coefficients itself, we introduce the setting and notation for the estimation.

#### *7.1. Preliminaries*

Let *X*1, ... , *X<sup>n</sup>* be a random sample of a *d*-dimensional random vector with copula *Cd* where *<sup>X</sup><sup>i</sup>* = (*X*1,*i*, ... *Xd*,*i*) for *<sup>i</sup>* ∈ {1, ... *<sup>n</sup>*}. Throughout this section, the dimension *<sup>d</sup>* of a copula *Cd* is arbitrary but fixed and, thus, for simplicity of notation, we omit the subscript *d* in *Cd*.

We consider the empirical copula

$$\hat{C}\_n(\boldsymbol{\mu}) = \frac{1}{n} \sum\_{i=1}^n \mathbf{1} \left( \hat{\Omega}\_{1,i} \le u\_1, \dots, \hat{\Omega}\_{d,i} \le u\_d \right), \tag{45}$$

where

$$
\hat{\Omega}\_{j,i} = \hat{F}\_{j,n}(X\_{j,i}), \quad \text{with} \quad \hat{F}\_{j,n}(\mathbf{x}) = \frac{1}{n+1} \sum\_{i=1}^{n} \mathbf{1} \left( X\_{j,i} \le \mathbf{x} \right), \quad \mathbf{x} \in \mathbb{R}.
$$

Similarly, we define the empirical survival function as

$$
\widehat{\overline{C}}\_n(\boldsymbol{\mu}) = \frac{1}{n} \sum\_{i=1}^n \mathbf{1} \left( \Omega\_{1,i} > \boldsymbol{\mu}\_1, \dots, \Omega\_{d,i} > \boldsymbol{\mu}\_d \right).
$$

For extreme-value copulas, one can take advantage of estimation methods for the Pickands dependence function or the stable tail dependence function. The estimation of these was discussed, for example, in [22–24], or [7]. We briefly discuss the estimator for the Pickands dependence function, as proposed in [7].

Madogram Estimator of Pickands Dependence Function

The multivariate *<sup>w</sup>*-madogram, as introduced in [7], is, for *<sup>w</sup>* <sup>∈</sup> <sup>Δ</sup>*d*−1, defined as

$$\nu\_d(\mathfrak{w}) = \mathbb{E}\left(\bigvee\_{j=1}^d F\_j^{1/w\_j}(X\_j) - \frac{1}{d} \sum\_{j=1}^d F\_j^{1/w\_j}(X\_j)\right),$$

where *u*1/*wj* = 0 by convention if *wj* = 0 and 0 < *u* < 1. The authors in [7] further show a relation between Pickands dependence function and the madogram given by

$$A\_d(w) = \frac{\nu\_d(w) + c(w)}{1 - \nu\_d(w) - c(w)}$$

where *c*(*w*) = *d*−<sup>1</sup> ∑*<sup>d</sup> <sup>j</sup>*=<sup>1</sup> *wj*/(1 + *wj*). This leads to the following estimator of Pickands dependence function

$$
\hat{A}\_n^{\text{MD}}(w) = \frac{\hat{\upsilon}\_n(w) + c(w)}{1 - \hat{\upsilon}\_n(w) - c(w)}.
$$

with

$$
\widehat{\upsilon}\_{\mathcal{U}}(w) = \frac{1}{n} \sum\_{i=1}^{n} \left( \bigvee\_{j=1}^{d} \widehat{F}\_{j,\mathfrak{u}}^{1/\mathfrak{u}\mathfrak{y}}(X\_{j,i}) - \frac{1}{d} \sum\_{j=1}^{d} \widehat{F}\_{j,\mathfrak{u}}^{1/\mathfrak{u}\mathfrak{y}}(X\_{j,i}) \right) \dots
$$

However, the estimator *<sup>A</sup>*=MD *<sup>n</sup>* is not a proper Pickands dependence function. To deal with this problem, they propose an estimator based on Bernstein polynomials that overcomes this issue and results into an estimator, which is a proper Pickands dependence function.

#### *7.2. Estimation of the Various Tail Coefficients*

#### 7.2.1. Estimation of Frahm's Extremal Dependence Coefficient

The estimation of the Frahm's extremal dependence coefficients has not been discussed in the literature so far. However, a straightforward approach is to consider empirical approximations of the quantities in definition (7), i.e.,

$$
\widehat{\varepsilon}\_{L} = \frac{\widehat{\mathbb{C}}\_{n}(\boldsymbol{u}\_{\boldsymbol{n}}, \dots, \boldsymbol{u}\_{\boldsymbol{n}})}{1 - \widehat{\mathbb{C}}\_{n}(\boldsymbol{u}\_{\boldsymbol{n}}, \dots, \boldsymbol{u}\_{\boldsymbol{n}})}, \qquad \widehat{\varepsilon}\_{L} = \frac{\overline{\mathbb{C}}\_{n}(1 - \boldsymbol{u}\_{\boldsymbol{n}}, \dots, 1 - \boldsymbol{u}\_{\boldsymbol{n}})}{1 - \widehat{\mathbb{C}}\_{n}(1 - \boldsymbol{u}\_{\boldsymbol{n}}, \dots, 1 - \boldsymbol{u}\_{\boldsymbol{n}})},
$$

where {*un*} is a sequence of positive numbers converging to zero. The choice of *un* is crucial for the performance of the estimator. Small values of *un* provide an estimator with low bias but large variance, large values of *un* provide an estimator with large bias but small variance. Note that, in applications, it is useful to think about *un* as *un* = *kn <sup>n</sup>*+<sup>1</sup> , where *kn* stands for the numbers of extreme values used in the estimation procedure.

Alternatively, if the underlying copula is known to be an extreme-value copula, the estimator can be based on the estimator of Pickands dependence function plugged into (11). This results in the following estimator

$$\hat{\mathfrak{e}}\_{ll}^{\text{MD}} = \frac{\sum\_{j=1}^{d} (-1)^{j+1} \,\, \Sigma\_{1 \le k\_1 < \dots < k\_j \le d} \,\, j \,\hat{A}\_{ll}^{\text{MD}}(w\_1, \dots, w\_d)}{d \hat{A}\_n^{\text{MD}}(1/d, \dots, 1/d)},$$

with *w*- = 1/*j* if - ∈ {*k*1,..., *kj*} and *w*-= 0 otherwise.

#### 7.2.2. Estimation of Li's Tail Dependence Parameters

Similarly as for Frahm's extremal dependence coefficients, one can introduce the following estimators

$$
\widehat{\lambda}\_{L}^{I\_{h}|I\_{d-h}} = \frac{\widehat{\mathbb{C}}\_{n}(\boldsymbol{u}\_{\boldsymbol{n}},\ldots,\boldsymbol{u}\_{\boldsymbol{n}})}{\widehat{\mathbb{C}}\_{n}^{ld-h}(\boldsymbol{u}\_{\boldsymbol{n}},\ldots,\boldsymbol{u}\_{\boldsymbol{n}})}, \qquad \widehat{\lambda}\_{ll}^{I\_{h}|I\_{d-h}} = \frac{\widehat{\overline{\mathbb{C}}}\_{n}(1-\boldsymbol{u}\_{\boldsymbol{n}},\ldots,1-\boldsymbol{u}\_{\boldsymbol{n}})}{\widehat{\overline{\mathbb{C}}}\_{n}^{ld-h}(1-\boldsymbol{u}\_{\boldsymbol{n}},\ldots,1-\boldsymbol{u}\_{\boldsymbol{n}})}.
$$

7.2.3. Estimation of Schmid's and Schmidt's Tail Dependence Measure

Also in this case, one can make use of the empirical copula (45). Recall the definition of *λL*,*<sup>S</sup>* in (24), and consider *p* small. More precisely, let *pn* be a small positive number. Subsequently, one can calculate

$$\int\_{[0,p\_n]^d} \hat{\mathbf{C}}\_{\boldsymbol{\mu}}(\boldsymbol{\mu}) \, \mathrm{d}\boldsymbol{\mu} = \frac{1}{n} \sum\_{i=1}^n \prod\_{j=1}^d \left(p\_{\boldsymbol{\mu}} - \hat{\boldsymbol{\mathcal{U}}}\_{j,i}\right)\_+. \tag{46}$$

The estimator of *λL*,*<sup>S</sup>* that could then be considered is of the form

$$\frac{(d+1)}{n \; p\_n^{d+1}} \sum\_{i=1}^n \prod\_{j=1}^d \left(p\_n - \widehat{U}\_{j,i}\right)\_+.$$

However, this quantity does not provide the value 1 for a sample from a comonotonicity copula. See the related discussion in [25]. This problem increases, while *pn* gets smaller. Thus, we propose to use an estimator defined as

$$
\widehat{\lambda}\_{L,S} = \frac{\sum\_{i=1}^n \prod\_{j=1}^d \left(p\_n - \widehat{\Omega}\_{j,i}\right)\_+}{\sum\_{i=1}^n \left[\left(p\_n - \frac{i}{n+1}\right)\_+\right]^d}
$$

where the denominator is based on estimating [0,*p*]*<sup>d</sup> Md*(*u*) <sup>d</sup>*<sup>u</sup>* using (46) and the fact that for a sample from a comonotonicity copula *<sup>U</sup>*=1,*<sup>i</sup>* <sup>=</sup> ··· <sup>=</sup> *<sup>U</sup>*=*d*,*<sup>i</sup>* for every *<sup>i</sup>* ∈ {1, ... , *<sup>n</sup>*} almost surely. Analogous arguments lead to an estimator of *λ*∗ *<sup>U</sup>*,*S*, as defined in (26), given by

$$
\widehat{\lambda}\_{ll,S}^{\*} = \frac{\Sigma\_{i=1}^{n} \prod\_{j=1}^{d} \left( p\_n - (1 - \widehat{\Omega}\_{j,i}) \right)\_+}{\sum\_{i=1}^{n} \left[ \left( p\_n - \frac{i}{n+1} \right)\_+ \right]^d}.
$$

7.2.4. Estimation of *λU*,*<sup>E</sup>* the Proposed Tail Coefficient for Extreme-Value Copulas

Because coefficient *λU*,*E*, in (28), is a function of Pickands dependence function *Ad*, estimation can again be based on estimation of *Ad*. For example, the madogram estimator *<sup>A</sup>*=MD *<sup>n</sup>* can be used, which results in the following estimator

$$
\widehat{\lambda}\_{\mathrm{II},\mathrm{E}}^{\mathrm{MD}} = \frac{d}{d-1} (1 - \widehat{A}\_n^{\mathrm{MD}}(1/d, \dots, 1/d)).
$$

The consistency results for the suggested estimators can be found in the following propositions.

**Proposition 12.** *Suppose that Cd is a d-variate extreme-value copula. Subsequently, the estimators* <sup>=</sup>*MD <sup>U</sup> and* = *λMD <sup>U</sup>*,*<sup>E</sup> are strongly consistent.*

**Proof.** The statement of the proposition follows by Theorem 2.4(b) in [7], which states that

$$\sup\_{\mathfrak{w}\in\Lambda\_{d-1}} \left| \hat{A}\_n^{\mathrm{MD}}(\mathfrak{w}) - A(\mathfrak{w}) \right| \xrightarrow[n \to \infty]{\mathrm{alm. surely}} 0. \tag{7}$$

**Proposition 13.** *Suppose that un*, *pn* <sup>∈</sup> (*n*−*δ*, *<sup>n</sup>*−*γ*) *for some* <sup>0</sup> <sup>&</sup>lt; *<sup>γ</sup>* <sup>&</sup>lt; *<sup>δ</sup>* <sup>&</sup>lt; <sup>1</sup>*.*

*(i) Then* <sup>=</sup>*<sup>L</sup> and* <sup>=</sup>*<sup>U</sup> are weakly consistent.*

*(ii) Then* <sup>=</sup> *<sup>λ</sup>L*,*<sup>S</sup> and* <sup>=</sup> *λ*∗ *<sup>U</sup>*,*<sup>S</sup> are weakly consistent.*

*(iii) Further suppose that* (*n CJd*−*<sup>h</sup>* (*un***1**)) <sup>→</sup> <sup>∞</sup>*. Subsequently, the following implications hold. If* lim*γ*→<sup>0</sup> lim*u*→0<sup>+</sup> *CJ d*−*h* (*u*(1+*γ*)**1**) *<sup>d</sup>*−*<sup>h</sup>* (*u***1**) <sup>=</sup> <sup>1</sup>*, then* <sup>=</sup> *λIh*|*Jd*−*<sup>h</sup> <sup>L</sup> is weakly consistent.*

$$\begin{aligned} &\text{ $\mathcal{I}$ }\lim\_{\gamma\to 0} \mathsf{lim}\_{\mathfrak{h}\to\mathfrak{h}+} \mathsf{0}\_{+} \mathsf{e}\_{\mathfrak{l}^{d}\mathfrak{h}^{-h}(\mathfrak{u}\mathbf{1})} \quad \text{ $\mathsf{T}$ ,  $\mathsf{meas}\mathsf{T}^{d}\mathfrak{h}$ } \quad \text{ $\mathsf{w}$  \text{weakly continuous.} }\\ &\text{ $\mathcal{I}$ f \lim}\mathsf{lim}\_{\gamma\to 0} \mathsf{lim}\_{\mathfrak{u}\to\mathfrak{o}\_{+}} \, \frac{\mathsf{T}^{d\_{d-k}}(\mathfrak{u}(1+\gamma)\mathbf{1})}{\mathsf{T}^{d\_{d-k}}(\mathfrak{u}\mathbf{1})} = \mathsf{1}\_{\mathsf{r}} \, \text{then } \widehat{\lambda}\_{\mathrm{ll}}^{\mathsf{I}\_{\mathfrak{h}}|\mathfrak{l}\_{d-k}} \text{ is weakly consistent.}\end{aligned}$$

**Proof.** We will only deal with the estimators of the lower dependence coefficients <sup>=</sup>*L*, <sup>=</sup> *<sup>λ</sup>L*,*<sup>S</sup>* and <sup>=</sup> *λIh*|*Jd*−*<sup>h</sup> <sup>L</sup>* . The estimators of the upper dependence coefficients can be handled completely analogously.

*Showing (i)*.

With the help of (A.22) of [26], one gets that for each *β* < <sup>1</sup> 2

$$
\Omega\_{j,i} = \mathcal{U}\_{j,i} + \mathcal{U}\_{j,i}^{\mathcal{G}} \mathcal{O}\_P \left( \frac{1}{\sqrt{n}} \right), \quad \text{uniformly in } j \in \{1, \dots, d\}, i \in \{1, \dots, n\}.
$$

This, together with Lemma A3 in [27] (see also (A.12) in [26]), implies that, for each *ε* > 0 with probability arbitrarily close to 1 for all sufficiently large *n*, it holds that

$$\left[\mathcal{U}\_{j,i} \le u\_n(1-\varepsilon)\right] \subseteq \left[\mathcal{Q}\_{j,i} \le u\_n\right] \subseteq \left[\mathcal{U}\_{j,i} \le u\_n(1+\varepsilon)\right], \quad \text{for all } j,i. \tag{47}$$

Denote

$$G\_n(\mathbf{u}) = \frac{1}{n} \sum\_{i=1}^n \mathbf{1} \{ \mathbf{U}\_i \le \mathbf{u} \}.$$

Subsequently, conditionally on (47) and with the help of Chebyshev's inequality, one gets that

$$\widehat{\mathbb{C}}\_{n}(\boldsymbol{u}\_{n}\mathbf{1}) \le \mathbb{G}\_{n}(\boldsymbol{u}\_{n}(\mathbf{1}+\varepsilon)\mathbf{1}) = \mathbb{C}(\boldsymbol{u}\_{n}(\mathbf{1}+\varepsilon)\mathbf{1}) + \sqrt{\mathbb{C}\left(\boldsymbol{u}\_{n}(\mathbf{1}+\varepsilon)\mathbf{1}\right)}\,\mathrm{O}\_{P}\left(\frac{1}{\sqrt{n}}\right) \tag{48}$$

$$\mathbf{u} = \mathbf{C}(\boldsymbol{u}\_n \mathbf{1}) + \varepsilon \, \mathbf{O}(\boldsymbol{u}\_n) + \sqrt{\boldsymbol{u}\_n} \, \mathbf{O}\_P \left(\frac{1}{\sqrt{n}}\right). \tag{49}$$

Analogously, also

$$
\widehat{\mathbb{C}}\_{\mathfrak{n}}(\boldsymbol{\mu}\_{\mathfrak{n}}\mathbf{1}) \ge \mathbb{C}(\boldsymbol{\mu}\_{\mathfrak{n}}\mathbf{1}) + \varepsilon O(\boldsymbol{\mu}\_{\mathfrak{n}}) + \sqrt{\boldsymbol{\mu}\_{\mathfrak{n}}} \, \boldsymbol{O}\_{P} \left(\frac{1}{\sqrt{\boldsymbol{\mu}}}\right). \tag{50}
$$

As *ε* > 0 is arbitrary, one can combine (49) and (50) to deduce that

$$
\widetilde{\mathsf{C}}\_{n}(\boldsymbol{\mu}\_{n}\mathbf{1}) = \mathsf{C}(\boldsymbol{\mu}\_{n}\mathbf{1}) + o\_{P}(\boldsymbol{\mu}\_{n}).\tag{51}
$$

Completely analogously with the help of (2), one can show that

$$1 - \widehat{\overline{\mathbb{C}}}\_{\text{ll}}(\boldsymbol{u}\_{\text{ll}} \mathbf{1}) = 1 - \overline{\mathbb{C}}(\boldsymbol{u}\_{\text{ll}} \mathbf{1}) + o\_P(\boldsymbol{u}\_{\text{ll}}).\tag{52}$$

Further note that

$$1 - \overline{\mathbb{C}}(\boldsymbol{\mu}\_n \mathbf{1}) = \mathbb{P}(\boldsymbol{\mathcal{U}}\_{\text{min}} \le \boldsymbol{\mu}\_n) \ge \mathbb{P}(\mathcal{U}\_1 \le \boldsymbol{\mu}\_n) = \boldsymbol{\mu}\_n. \tag{53}$$

Now combining (51), (52) and (53) yields that

$$\widehat{\varepsilon}\_{L} = \frac{\mathbb{C}\_{\mathbb{H}}(\boldsymbol{\mu}\_{\mathbb{H}}\mathbf{1})}{1 - \widehat{\overline{\mathbb{C}}}\_{n}(\boldsymbol{\mu}\_{\mathbb{H}}\mathbf{1})} = \frac{\mathbb{C}(\boldsymbol{\mu}\_{\mathbb{H}}\mathbf{1}) + o\_{\mathbb{P}}(\boldsymbol{\mu}\_{\mathbb{H}})}{1 - \overline{\mathbb{C}}(\boldsymbol{\mu}\_{\mathbb{H}}\mathbf{1}) + o\_{\mathbb{P}}(\boldsymbol{\mu}\_{n})} = \frac{\mathbb{C}(\boldsymbol{\mu}\_{\mathbb{H}}\mathbf{1})}{1 - \overline{\mathbb{C}}(\boldsymbol{\mu}\_{\mathbb{H}}\mathbf{1})} + o\_{\mathbb{P}}(1) \xrightarrow[n \to \infty]{\text{P}} \mathfrak{e}\_{L}.$$

*Showing (ii).*

First of all, note that it is sufficient to show that

$$I\_n = \frac{d+1}{p\_n^{d+1}} \int\_{[0,p\_n]^d} \left[\overset{\frown}{\mathbb{C}}\_n(\mathfrak{u}) - \mathbb{C}(\mathfrak{u})\right] \mathrm{d}\mathfrak{u} = o\_P(1). \tag{54}$$

Further, it is straightforward to bound

$$\frac{d+1}{p\_n^{d+1}} \int\_{[0,p\_n]^d \cup \left[\frac{p\_n}{\log n}, p\_n\right]^d} \left| \widehat{\mathbb{C}}\_n(\mathfrak{u}) - \mathbb{C}(\mathfrak{u}) \right| \, \mathrm{d}\mathfrak{u} \le \frac{d+1}{p\_n^{d+1}} \int\_{[0,p\_n]^d \cup \left[\frac{p\_n}{\log n}, p\_n\right]^d} \left\{ 2 \min\{u\_1, \dots, u\_d\} + \frac{1}{n} \right\} \, \mathrm{d}\mathfrak{u}$$

$$\le \frac{2d(d+1)}{p\_n^{d+1}} \int\_0^{p\_n} \cdots \int\_0^{p\_n} \left[ \int\_0^{\frac{p\_n}{\log n}} u\_1 \, \mathrm{d}u\_1 \right] \, \mathrm{d}u\_2 \dots \, \mathrm{d}u\_d + O\left(\frac{1}{n\cdot p\_n}\right) = O\left(\frac{1}{\log^2 n}\right) = o(1). \tag{55}$$

Now, (47) holds uniformly for *un* <sup>∈</sup> [ *pn* log *<sup>n</sup>* , *pn*]. Thus analogously as one derived (51) one can also show that uniformly in *<sup>u</sup>* <sup>∈</sup> [ *pn* log *<sup>n</sup>* , *pn*] *d*

$$
\widehat{\mathcal{C}}\_n(\mathfrak{u}) = \mathcal{C}(\mathfrak{u}) + o\_P\left(\sum\_{j=1}^d \mathfrak{u}\_j\right),
$$

which further implies

$$\frac{d+1}{p\_n^{d+1}} \int\_{[\frac{p\_n}{\log n}, p\_n]^d} \left| \overset{\mathbb{S}}{\mathbb{C}}\_{\mathfrak{n}}(\mathfrak{u}) - \mathbb{C}(\mathfrak{u}) \right| \mathrm{d}\mathfrak{u} = o\_P(1). \tag{56}$$

Now, combining (55) and (56) yields (54).

*Showing (iii).*

To prove the weak consistency of <sup>=</sup> *λIh*|*Jd*−*<sup>h</sup> <sup>L</sup>* , it is sufficient to show that

$$\frac{\widehat{\mathbb{C}}\_{n}(\boldsymbol{u}\_{n}\mathbf{1})-\mathbb{C}(\boldsymbol{u}\_{n}\mathbf{1})}{\mathbb{C}^{\boldsymbol{I}\_{d}-\boldsymbol{h}}(\boldsymbol{u}\_{\boldsymbol{\mathfrak{t}}}\mathbf{1})} \xrightarrow[n\to\infty]{\operatorname{P}} 1 \quad \text{and} \quad \frac{\widehat{\mathbb{C}}\_{n}^{\boldsymbol{I}\_{d}-\boldsymbol{h}}(\boldsymbol{u}\_{\boldsymbol{\mathfrak{t}}}\mathbf{1})}{\mathbb{C}^{\boldsymbol{I}\_{d}-\boldsymbol{h}}(\boldsymbol{u}\_{\boldsymbol{\mathfrak{t}}}\mathbf{1})} \xrightarrow[n\to\infty]{\operatorname{P}} 1. \tag{57}$$

We start with the second convergence. Similarly, as in (48) for each *ε* > 0 with probability arbitrarily close to 1 for all sufficiently large *n*, one can bound

$$\frac{\mathbf{G}\_{n}^{l\_{d-h}}(\boldsymbol{u}\_{n}(1-\varepsilon)\mathbf{1})}{\mathbf{C}^{l\_{d-h}}(\boldsymbol{u}\_{n}(1-\varepsilon)\mathbf{1})} \frac{\mathbf{C}^{l\_{d-h}}(\boldsymbol{u}\_{n}(1-\varepsilon)\mathbf{1})}{\mathbf{C}^{l\_{d-h}}(\boldsymbol{u}\_{n}\mathbf{1})} \leq \frac{\hat{\mathbf{C}}\_{n}^{l\_{d-h}}(\boldsymbol{u}\_{n}\mathbf{1})}{\mathbf{C}^{l\_{d-h}}(\boldsymbol{u}\_{n}\mathbf{1})} \leq \frac{\mathbf{G}\_{n}^{l\_{d-h}}(\boldsymbol{u}\_{n}(1+\varepsilon)\mathbf{1})}{\mathbf{C}^{l\_{d-h}}(\boldsymbol{u}\_{n}(1+\varepsilon)\mathbf{1})} \frac{\mathbf{C}^{l\_{d-h}}(\boldsymbol{u}\_{n}(1+\varepsilon)\mathbf{1})}{\mathbf{C}^{l\_{d-h}}(\boldsymbol{u}\_{n}\mathbf{1})}.$$

Now, by the assumption in (iii), the ratios *<sup>C</sup><sup>J</sup> <sup>d</sup>*−*<sup>h</sup>* (*un*(1−*ε*)**1**) *CJ <sup>d</sup>*−*<sup>h</sup>* (*un***1**) and *<sup>C</sup><sup>J</sup> <sup>d</sup>*−*<sup>h</sup>* (*un*(1+*ε*)**1**) *CJ <sup>d</sup>*−*<sup>h</sup>* (*un***1**) can be made arbitrarily close to 1 for *ε* close enough to zero and *n* large enough. Further, by Chebyshev's inequality

$$\frac{G\_n^{ld-h}\left(\mu\_n(1+\varepsilon)\mathbf{1}\right)}{C^{ld-h}\left(\mu\_n(1+\varepsilon)\mathbf{1}\right)} = 1 + O\_P\left(\frac{1}{\sqrt{n\,\,^{Ch-h}\left(\mu\_n(1+\varepsilon)\mathbf{1}\right)}}\right) \xrightarrow[n\to\infty]{P} 1$$

and, similarly, one can show also *<sup>G</sup><sup>J</sup> <sup>d</sup>*−*<sup>h</sup> <sup>n</sup>* (*un*(1−*ε*)**1**) *CJ <sup>d</sup>*−*<sup>h</sup>* (*un*(1−*ε*)**1**) *<sup>P</sup>* −−−→*n*→<sup>∞</sup> 1. This concludes the proof of the second convergence in (57).

To show the first convergence in (57), one can proceed as in (48) (exploiting (47)) and arrive at

$$\begin{split} \frac{\widehat{\mathbb{C}}\_{n}(\boldsymbol{u}\_{n}\mathbf{1}) - \mathbb{C}(\boldsymbol{u}\_{n}\mathbf{1})}{\mathbb{C}^{\boldsymbol{I}\_{d}-\boldsymbol{k}}(\boldsymbol{u}\_{n}\mathbf{1})} &\leq \frac{\operatorname{G}\_{n}(\boldsymbol{u}\_{n}(1+\varepsilon)\mathbf{1}) - \operatorname{C}(\boldsymbol{u}\_{n}(1+\varepsilon)\mathbf{1})}{\mathbb{C}^{\boldsymbol{I}\_{d-h}}(\boldsymbol{u}\_{n}\mathbf{1})} + \frac{\operatorname{C}(\boldsymbol{u}\_{n}(1+\varepsilon)\mathbf{1}) - \operatorname{C}(\boldsymbol{u}\_{n}\mathbf{1})}{\mathbb{C}^{\boldsymbol{I}\_{d-h}}(\boldsymbol{u}\_{n}\mathbf{1})} \\ &= \operatorname{Op}\left(\frac{1}{\sqrt{\operatorname{u}\mathbb{C}^{\boldsymbol{I}\_{d-h}}(\boldsymbol{u}\_{n}\mathbf{1})}}\right) + \frac{\operatorname{C}(\boldsymbol{u}\_{n}(1+\varepsilon)\mathbf{1}) - \operatorname{C}(\boldsymbol{u}\_{n}\mathbf{1})}{\mathbb{C}^{\boldsymbol{I}\_{d-h}}(\boldsymbol{u}\_{n}\mathbf{1})}. \end{split}$$

Now, the second term on the right-hand side of the last inequality can be rewritten as

$$\frac{\mathbb{C}(\boldsymbol{u}\_{n}(1+\varepsilon)\mathbf{1})-\mathbb{C}(\boldsymbol{u}\_{n}\mathbf{1})}{\mathbb{C}^{\boldsymbol{l}\_{d-h}}(\boldsymbol{u}\_{n}\mathbf{1})} = \frac{\mathbb{C}(\boldsymbol{u}\_{n}(1+\varepsilon)\mathbf{1})}{\mathbb{C}^{\boldsymbol{l}\_{d-h}}(\boldsymbol{u}\_{n}(1+\varepsilon)\mathbf{1})} \frac{\mathbb{C}^{\boldsymbol{l}\_{d-h}}(\boldsymbol{u}\_{n}(1+\varepsilon)\mathbf{1})}{\mathbb{C}^{\boldsymbol{l}\_{d-h}}(\boldsymbol{u}\_{n}\mathbf{1})} - \frac{\mathbb{C}(\boldsymbol{u}\_{n}\mathbf{1})}{\mathbb{C}^{\boldsymbol{l}\_{d-h}}(\boldsymbol{u}\_{n}\mathbf{1})}.$$

which, thanks to the assumptions of the theorem and the existence of *λIh*|*Id*−*<sup>h</sup> <sup>L</sup>* , can be made arbitrarily small by taking *ε* small enough and *n* sufficiently large.

As an analogous lower bound can be derived for *<sup>C</sup>*=*n*(*un***1**)−*C*(*un***1**) *CJ <sup>d</sup>*−*<sup>h</sup>* (*un***1**) , one can conclude that the first convergence in (57) also holds.

#### **8. Real Data Application**

In this section, we illustrate the practical use of the multivariate tail coefficients via a real data example. The data concern stock prices of companies that are constituents of the EURO STOXX 50 market index. EURO STOXX 50 index is based on the largest and the most liquid stocks in the eurozone. Daily adjusted prices of these stocks are publicly available on https://finance.yahoo.com/ (downloaded 19 March 2020). The selected time period is 15 years, starting on 18 March 2005 and ending on 18 March 2020. Note that this period covers both the global financial crisis 2007–2008, as well as the sharp decline of the markets that was caused by COVID-19 coronavirus pandemic in early 2020. All the calculations are done in the statistical software R [28]. The R codes for the data application, written by the authors, are available at https://www.karlin.mff.cuni.cz/~omelka/codes.php.

The preprocessing of the data was done, as follows. The stocks are traded on different stock exchanges and thus might differ in trading days. The union of all trading days is used and missing data introduced by this method are filled in by linear interpolation. No data were missing on the first or the last day of the studied time range. Negative log-returns are calculated from the adjusted stock prices and ARMA(1,1)–GARCH(1,1) is fitted to each of the variables (stocks), similarly as for example in [29]. We also refer therein for detailed model specification. Fitting ARMA(1,1)–GARCH(1,1) model to every stock does not necessarily provide the best achievable model, but residual checks show that the models are adequate. The standardized residuals obtained from these univariate models are used as the final dataset for calculating various tail coefficients. The total number of observations is *n* = 3847. Table 3 summarizes the stocks used for the analysis.


**Table 3.** List of selected stocks for the analysis.

It is of interest here to discuss tendency of extremely low returns happening simultaneously, which translates into calculating upper tail coefficients while working with negative log-returns. This allows us to use also the methods assuming that the data are coming from an extreme-value copula.

Six different settings are considered: stocks from Group 1 (G1), from Group 2 (G2), from Group 3 (G3), from G1 and G2, from G1 and G3, and finally stocks from G2 and G3. The dimension *d* is equal to 3 for the first three settings and equal to 6 for the last three settings.

Six different estimators are considered: <sup>=</sup>*U*, <sup>=</sup>MD *<sup>U</sup>* , <sup>=</sup> *λ*∗ *<sup>U</sup>*,*S*, <sup>=</sup> *<sup>λ</sup>U*,*E*, and <sup>=</sup> *λIh*|*Jd*−*<sup>h</sup> <sup>U</sup>* with two different selections of the conditioning sets *Ih* and *Jd*−*h*. In one case, *<sup>h</sup>*<sup>∗</sup> = *<sup>d</sup>* − *<sup>h</sup>* = 1 and we condition on only one variable. The specific choice of that one variable does not impact the result, as follows from (19). The analysis with the conditioning on only one variable shows how the rest of the group is affected by the behavior of one stock. In the other case, we condition on all of the stocks, except for the one with largest market capitalization within the group. This analysis indicates how the largest player is affected by the behavior of the rest of the group.

The estimators that are functions of the amount of data points *k* (recall from Section 7.2 that a common choice is *un* = *kn*/(*n* + 1), with *kn* = *k* here) do not provide one specific estimate but rather a function of *k*. A selection of in some sense the best possible *k* requires further study. Intuitively, one should look at lowest *k* for which the estimator is not too volatile. This idea was used in [30] for estimating bivariate tail coefficients by finding a plateau in the considered estimator as a function of *k*. The results of the analysis are summarized in Figures 7 and 8 and Table 4. Examining Figure 7, it seems that *k* around 100 would be a possible reasonable choice for the tail coefficients of Frahm, and Schmid and Schmidt, for these data. For Li's tail dependence parameters, it appears from Figure 8 that, when conditioning on more than one variable, a larger value for *k* is needed, for example *k* = 200.

For the tail dependence measurements for extreme-value copulas, we include the coefficients *λU*,*<sup>E</sup>* and the original extremal coefficient *θ<sup>E</sup>* (see [17]), where the latter can be estimated from the former, since *<sup>θ</sup><sup>E</sup>* <sup>=</sup> *<sup>d</sup>*(<sup>1</sup> <sup>−</sup> *<sup>d</sup>*−<sup>1</sup> *<sup>d</sup> λU*,*E*). Recall that the various tail coefficient estimators estimate different quantities and, therefore, their values should not be compared to each other. However, a few general conclusions can be made based on Figures 7 and 8. Clearly, all the studied groups possess a certain amount of tail dependence. The combinations of groups also seem to be tail dependent, although the strength of dependence is smaller. Groups G2 and G3 seem to be slightly more tail dependent than G1, which suggests that sharing industry influences tail dependence more than sharing geographical location.

**Table 4.** Estimated tail coefficients for extreme-value copulas.

**Figure 7.** Various estimated tail coefficients. (**a**) Estimator <sup>=</sup>*<sup>U</sup>* for 3-variate groups. Corresponding symbols (, •, ) represent values of <sup>=</sup>MD *<sup>U</sup>* (not a function of *<sup>k</sup>*); (**b**) Estimator <sup>=</sup>*<sup>U</sup>* for 6-variate groups. Corresponding symbols (, •, ) represent values of <sup>=</sup>MD *<sup>U</sup>* (not a function of *<sup>k</sup>*); (**c**) Estimator <sup>=</sup> *λ*∗ *<sup>U</sup>*,*<sup>S</sup>* for 3-variate groups; (**d**) Estimator <sup>=</sup> *λ*∗ *<sup>U</sup>*,*<sup>S</sup>* for 6-variate groups.

**Figure 8.** Various estimated tail coefficients. (**a**) Estimator <sup>=</sup> *λI*2|*J*<sup>1</sup> *<sup>U</sup>* for 3-variate groups with conditioning on one stock; (**b**) Estimator <sup>=</sup> *λI*5|*J*<sup>1</sup> *<sup>U</sup>* for 6-variate groups with conditioning on one stock; (**c**) Estimator = *λI*1|*J*<sup>2</sup> *<sup>U</sup>* for 3-variate groups with conditioning on all but the stock with highest market capitalization; (**d**) Estimator <sup>=</sup> *λI*1|*J*<sup>5</sup> *<sup>U</sup>* for 6-variate groups with conditioning on all but the stock with highest market capitalization.

The estimator of Frahm's extremal dependence coefficient in Figure 7a,b is clearly the smallest of all the estimators, which follows its "strict" definition in (7). The dots, representing the estimates under the assumption of underlying copula being an extreme-value copula, are greater than the fully non-parametric estimators. This indicates that assuming underlying extreme-value copula might not be appropriate.

The estimator of Schmid's and Schmidt's tail dependence measure in Figure 7c,d is much smoother as a function of *k* than the other estimators. However, it tends to move towards 0 or 1 for very low *k*.

The estimator <sup>=</sup> *λI*2|*J*<sup>1</sup> *<sup>U</sup>* in Figure 8a suggests that, for all three groups, the probability of two stocks having an extremely low return given that the third stock has an extremely low return is approximately 0.2. The estimator <sup>=</sup> *λI*1|*J*<sup>5</sup> *<sup>U</sup>* in Figure 8d on the other hand suggests that, in all three group combinations, the largest company is heavily affected if the remaining five stocks have extremely low returns. For group combinations G1 + G3 and G2 + G3, the estimated tail coefficient is, in fact, equal to 1.

The values of <sup>=</sup> *λU*,*<sup>E</sup>* and *θ* <sup>=</sup>*<sup>E</sup>* are presented in Table 4. One can notice that these measures also suggest that groups G2 and G3 are slightly more tail dependent than G1, or, in other words, they likely contain less independent components (see [18]).

**Author Contributions:** Authors are listed in alphabetic order, reflecting the intensive collaborative research of the authors. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research is supported by GOA/12/014 project of the Research Fund KU Leuven. The research of the third author was supported by the grant GACR 19–00015S.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Analysis of Information-Based Nonparametric Variable Selection Criteria**

#### **Małgorzata Łaz ˛ecka 1,2 and Jan Mielniczuk 1,2,***<sup>∗</sup>*


Received: 6 August 2020; Accepted: 28 August 2020; Published: 31 August 2020

**Abstract:** We consider a nonparametric Generative Tree Model and discuss a problem of selecting active predictors for the response in such scenario. We investigated two popular information-based selection criteria: Conditional Infomax Feature Extraction (CIFE) and Joint Mutual information (JMI), which are both derived as approximations of Conditional Mutual Information (CMI) criterion. We show that both criteria CIFE and JMI may exhibit different behavior from CMI, resulting in different orders in which predictors are chosen in variable selection process. Explicit formulae for CMI and its two approximations in the generative tree model are obtained. As a byproduct, we establish expressions for an entropy of a multivariate gaussian mixture and its mutual information with mixing distribution.

**Keywords:** conditional mutual information; CMI; information measures; nonparametric variable selection criteria; gaussian mixture; conditional infomax feature extraction; CIFE; joint mutual information criterion; JMI; generative tree model; Markov blanket

#### **1. Introduction**

In the paper, we consider theoretical properties of Conditional Mutual Information (CMI) and its approximations in a certain dependence model called Generative Tree Model (GTM). CMI and its modifications are used in many problems of machine learning including feature selection, variable importance ranking, causal discovery, and structure learning of dependence networks (see, e.g., Reference [1,2]). They are the cornerstone of nonparametric methods to solve such problems meaning that no parametric assumptions on dependence structure are imposed. However, formal properties of these criteria remain largely unknown. This is mainly due to two problems: firstly, theoretical values of CMI and related quantities are hard to calculate explicitly, especially when the conditioning set has a large dimension. Moreover, there are only a few established facts about behavior of their sample counterparts. Such a situation, however, has important consequences. In particular, a relevant question whether certain information based criteria, such as Conditional Infomax Feature Extraction (CIFE) and Joint Mutual Information (JMI), obtained as approximations of CMI, e.g., by truncation of its Möbius expansion are approximations in analytic sense (i.e., whether the difference of both quantities is negligible) remains unanswered. In the paper, we try to fill this gap. The considered GTM is a model for which marginal distributions of predictors are mixtures of gaussians. Exact values of CMI, as well as of those of CIFE and JMI, are calculated for this model, which makes studying their behavior when parameters of the model and number of predictors change feasible. In particular, it is shown that CIFE and JMI exhibit different behavior than CMI and also they may significantly differ between themselves. In particular, we show, that depending on the value of model parameters, each of considered criteria JMI and CIFE can incorporate inactive variables before

active ones into a set of chosen predictors. This, of course, does not mean that important performance criteria, such as False Detection Rate (FDR), cannot be controlled for CIFE and JMI but should serve as a cautionary note that their similarity to CMI, despite their derivation, is not necessarily ensured. As a byproduct, we establish expressions for an entropy of a multivariate gaussian mixture and its mutual information with mixing distribution, which are of independent interest.

We stress that our approach is intrinsically nonparametric and focuses on using nonparametric measures of conditional dependence for feature selection. By studying their theoretical behavior for this task we also learn an average behavior of their empirical counterparts for large sample sizes.

Generative Tree Model appears, e.g., in Reference [3], a non-parametric tree structured model is also considered, e.g., in Reference [4,5]. Together with autoregressive model, it is one of the two most common types of generative models. Besides its easily explainable dependence structure, distributions of predictors in the considered model are mixed gaussians, and this facilitates calculation of explicit form of information-based selection criteria.

The paper is structured as follows. Section 2 contains information-theoretic preliminaries, some necessary facts on information based feature-selection and derivation of CIFE and JMI criteria as approximations of CMI. Section 3 contains derivation of entropy and mutual information for gaussian mixtures. In Section 4, behavior of CMI, CIFE, and JMI is studied in GTM. Section 5 concludes.

#### **2. Preliminaries**

We denote by *<sup>p</sup>*(*x*), *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* a probability density function corresponding to continuous variable *<sup>X</sup>* on R*d*. Joint density of *X* and variable *Y* will be denoted by *p*(*x*, *y*). In the following, *Y* will denote discrete random response to be predicted using multivariate vector *X*.

Below, we discuss some information-theoretic preliminaries, which leads, at the end of Section 2.1, to Möbius decomposition of mutual information. This is used in Section 2.2 to construct CIFE approximation of CMI. In addition, properties of Mutual Information discussed in Section 2.1 are used in Section 2.2 to justify JMI criterion.

#### *2.1. Information-Theoretic Measures of Dependence*

The (differential) entropy for continuous random variable *X* is defined as

$$H(\mathbf{X}) = -\int\_{\mathbb{R}^d} p(\mathbf{x}) \log p(\mathbf{x}) \, d\mathbf{x} \tag{1}$$

and quantifies the uncertainty of observing random values of *X*. Note that the definition above is valid regardless the dimensionality *d* of the range of *X*. For discrete *X*, we replace the integral in (1) by the sum and density *p*(*x*) by probability mass function. In the following, we will frequently consider subvectors of *X* = (*X*1, ... , *Xp*), which is a vector of all potential predictors of discrete response *Y*. The conditional entropy of *X* given discrete *Y* is written as

$$H(X|Y) = \sum\_{y \in \mathcal{Y}} p(y)H(X|Y=y). \tag{2}$$

When *<sup>Z</sup>* is continuous, the conditional entropy *<sup>H</sup>*(*X*|*Z*) is defined as <sup>E</sup>*ZH*(*X*|*<sup>Z</sup>* <sup>=</sup> *<sup>z</sup>*), i.e.,

$$H(X|Z) = -\int p(z) \int \frac{p(\mathbf{x}, z)}{p(z)} \log\left(\frac{p(\mathbf{x}, z)}{p(z)}\right) \, d\mathbf{x} dz = -\int p(\mathbf{x}, z) \log\left(\frac{p(\mathbf{x}, z)}{p(z)}\right) \, d\mathbf{x} dz,\tag{3}$$

where *p*(*x*, *z*) and *p*(*z*) denote joint density of (*X*, *Z*) and density of *Z*, respectively. The mutual information (MI) between *X* and *Y* is

$$H(X,Y) = H(X) - H(X|Y) = H(X) - H(Y|X). \tag{4}$$

This can be interpreted as the amount of uncertainty in *X* (*Y*) which is removed when *Y* (respectively, *X*) is known, which is consistent with the intuitive meaning of mutual information as the amount of information that one variable provides about another. It determines how similar the joint distribution is to the product of marginal distributions when Kullback-Leibler divergence is used as similarity measure (cf. Reference [6], Equation (8.49)). Thus, *I*(*X*,*Y*) may be viewed as nonparametric measure of dependence. Note that, as *I*(*X*,*Y*) is symmetric, it only shows the strength of dependence but not its direction. In contrast to correlation coefficient MI is able to discover non-linear relationships as it equals zero if and only if *X* and *Y* are independent. It is easily seen that *I*(*X*,*Y*) = *H*(*X*) + *H*(*Y*) − *H*(*X*,*Y*). A natural extension of MI is conditional mutual information (CMI) defined as

$$I(\mathbf{X}, \mathbf{Y}|\mathbf{Z}) = H(\mathbf{X}|\mathbf{Z}) - H(\mathbf{X}|\mathbf{Y}, \mathbf{Z}) = \int p(\mathbf{z}) \int p(\mathbf{x}, y|\mathbf{z}) \log \frac{p(\mathbf{x}, y|\mathbf{z})}{p(\mathbf{x}|\mathbf{z}) p(y|\mathbf{z})} d\mathbf{x} dy d\mathbf{z},\tag{5}$$

which measures the conditional dependence between *X* and *Y* given *Z*. When *Z* is a discrete random variable, the first integral is replaced by a sum. Note that the conditional mutual information is mutual information of *X* and *Y* given *Z* = *z* averaged over values *z* of *Z*, and it equals zero if and only if *X* and *Y* are conditionally independent given *Z*. Important property of MI is a chain rule which connects *I*((*X*1, *X*2),*Y*) with *I*(*X*1,*Y*):

$$I((X\_1, X\_2), Y) = I(X\_1, Y) + I(X\_2, Y|X\_1). \tag{6}$$

For more properties of the basic measures described above, we refer to Reference [6,7]. We define now interaction information II ([8]), which is a useful tool for decomposing mutual information between multivariate random variable *XS* and *Y* (see Formula (13) below). The 3-way interaction information is defined as

$$II(X\_1, X\_2, Y) = I((X\_1, X\_2), Y) - I(X\_1, Y) - I(X\_2, Y). \tag{7}$$

This is frequently interpreted as the part of *I*((*X*1, *X*2),*Y*), which remains after subtraction of individual informations between *Y* and *X*<sup>1</sup> and *Y* and *X*2. The definition indicates in particular that *I I*(*X*1, *X*2,*Y*) is symmetric. Note that it follows from (6) that

$$II(X\_1, X\_2, \mathcal{Y}) = I(X\_1, \mathcal{Y}|X\_2) - I(X\_1, \mathcal{Y}) = I(X\_2, \mathcal{Y}|X\_1) - I(X\_2, \mathcal{Y}),\tag{8}$$

which is consistent with the intuitive meaning of existence of interaction as a situation in which the effect of one variable on the class variable *Y* depends on the value of another variable. By expanding all mutual informations on RHS of (7), we obtain

$$II(X\_1, X\_2, Y) = -H(X\_1) - H(X\_2) - H(Y) + H(X\_1, Y) + H(X\_2, Y) + H(X\_1, X\_2) - H(X\_1, X\_2, Y). \tag{9}$$

The 3-way *I I* can be extended to the general case of *p* variables. The *p*-way interaction information [9,10] is

$$II(X\_1, \ldots, X\_p) = -\sum\_{T \subseteq \{1, \ldots, p\}} (-1)^{p - |T|} H(X\_T) \,. \tag{10}$$

For *p* = 2, (10) reduces to mutual information, whereas, for *p* = 3, it reduces to (9).

We consider two useful properties of introduced measures. We first start with 3-way information interaction, and we note that it inherits chain-rule property from MI, namely

$$II(X\_1, (X\_2, X\_3), Y) = II(X\_1, X\_3, Y) + II(X\_1, X\_2, Y | X\_3), \tag{11}$$

where *I*(*X*1, *X*2,*Y*|*X*3) is defined analogously to (7) by replacing mutual informations on RHS by conditional mutual informations given *X*3. This is easily proved by writing, in the view of (6):

$$II(X\_{1\prime}(X\_2, X\_3), Y) = I(X\_{1\prime}(X\_2, X\_3) | Y) - I(X\_{1\prime}(X\_2, X\_3)) = 0$$

$$I(X\_1, X\_3 | Y) + I(X\_1, X\_2 | Y, X\_3) - \left[ I(X\_1, X\_3) + I(X\_1, X\_2 | X\_3) \right] \tag{12}$$

and using (8) in the above equalities. Namely, joining the first and the third expression together (and the second and the fourth, as well), we obtain that RHS equals *I I*(*X*1, *X*3,*Y*) + *I I*(*X*1, *X*2,*Y*|*X*3).

We also state Möbius representation of mutual information which plays an important role in the following development. For *S* ⊆ {1, 2, ... , *p*}, let *XS* be a random vector coordinates of which have indices in *S*. Möbius representation [10–12] states that *I*(*XS*,*Y*) can be recovered from interaction informations

$$I(X\_{S,\prime},Y) = \sum\_{k=1}^{|S|} \sum\_{\{t\_1,\ldots,t\_k\} \subseteq S} II(X\_{t\_1,\prime}, \ldots, X\_{t\_{k'}}, Y)\_{\prime} \tag{13}$$

where |*S*| denotes number of elements of set *S*.

#### *2.2. Information-Based Feature Selection*

We consider discrete class variable *Y* and *p* features *X*1, ... , *Xp*. We do not impose any assumptions on dependence between *Y* and *X*1, ... , *Xp*, i.e., we view its distributional structure in a nonparametric way. Let *XS* denote a subset of features, indexed by set *S* ⊆ {1, ... , *p*}. As *I*(*XS*,*Y*) does not decrease when *S* is replaced by its superset *S* ⊇ *S*, the problem of finding arg max*<sup>S</sup> I*(*XS*,*Y*) has a trivial solution *f ull* = {1, 2, ... , *p*}. Thus, one usually tries to optimize mutual information between *XS* and *Y* under some constraints on the size |*S*| of *S*. The most intuitive approach is an analogue of *k*-best subset selection in regression which tries to identify a feature subset of a fixed size 1 ≤ *k* ≤ *p* that maximizes the joint mutual information with a class variable *Y*. However, this is infeasible for large *k* because the search space grows exponentially with the number of features. As a result, various greedy algorithms have been developed including forward selection, backward elimination and genetic algorithms. They are based on observation that

$$\underset{j \in S^c}{\arg\max} \left[ I(X\_{\mathbb{S} \cup \{j\}}, Y) - I(X\_{\mathbb{S}}, Y) \right] = \underset{j \in S^c}{\arg\max} \, I(X\_{\circ}, Y | X\_{\mathbb{S}}), \tag{14}$$

where *<sup>S</sup><sup>c</sup>* <sup>=</sup> {1, ... , *<sup>p</sup>*} \ *<sup>S</sup>* is a complement of *<sup>S</sup>*. The equality in (14) follows from (6). In each step, the most promising candidate is added. In the case of ties in (14), the variable satisfying it with the smallest index is chosen.

#### *2.3. Approximations of CMI: CIFE and JMI Criteria*

Observe that it follows from (13)

$$I(X\_{\mathbb{S}\cup\{j\}},Y) - I(X\_{\mathbb{S}},Y) = I(X\_{j\prime},Y|X\_{\mathbb{S}}) = \sum\_{k=0}^{|S|} \sum\_{\{t\_1,\ldots,t\_k\} \subseteq S} II(X\_{t\_1},\ldots,X\_{t\_{k\prime}},X\_{j\prime},Y). \tag{15}$$

Direct application of the above formula to find the maximizer in (14) is infeasible as estimation of a specific information interaction of order *k* requires *O*(*Ck*) observations. The above formula allows us, however, to obtain various natural approximations of CMI. The first order approximation does not take interactions between features into account and that is why the second order approximation obtained by taking first two terms in (15) is usually considered. The corresponding score for candidate feature *Xj* is

$$CIFE(X\_{j\prime}, Y | X\_S) = I(X\_{j\prime}, Y) + \sum\_{i \in \mathcal{S}} II(X\_{i\prime}, X\_{j\prime}, Y) = I(X\_{j\prime}, Y) + \sum\_{i \in \mathcal{S}} \left[ I(X\_i, X\_j | Y) - I(X\_i, X\_j) \right]. \tag{16}$$

The acronym CIFE stand for Conditional Infomax Feature Extraction, and the measure has been introduced in Reference [13]. Observe that if interactions of order 3 and higher between predictors are 0, i.e., *I I*(*Xt*<sup>1</sup> , ... , *Xtk* , *Xj*,*Y*) = 0 for *k* ≥ 2 and then CIFE coincides with CMI. In Reference [2], it is shown that CMI also coincides with CIFE if certain dependence assumptions on vector (*X*,*Y*) are satisfied. In view of the discussion above, CIFE can be viewed as a natural approximation to CMI.

Observe that, in (16), we take into account not only relevance of the candidate feature, but also the possible interactions between the already selected features and the candidate feature. The empirical evaluation indicates that (16) is among the most successful MI-based methods; see Reference [2] for an extensive comparison of several MI-based feature selection approaches. We mention in this context, Reference [14], in which stopping rules for CIFE-based methods are considered.

Some additional assumptions lead to other score functions. We show now reasoning leading to Joint Mutual Information Criterion JMI (cf. Reference [12], on which the derivation below is based). Namely, if we define *<sup>S</sup>* = {*j*1,..., *<sup>j</sup>*|*<sup>S</sup>*|}, we have for *<sup>i</sup>* ∈ *<sup>S</sup>*

$$I(X\_{j\prime}X\_S) = I(X\_{j\prime}X\_i) + I(X\_{j\prime}X\_{S^\parallel\{i\}}|X\_i).$$

Summing these equalities over all *i* ∈ *S* and dividing by |*S*|, we obtain

$$I(X\_{j\prime}, X\_S) = \frac{1}{|S|} \sum\_{i \in S} I(X\_{j\prime}, X\_i) + \frac{1}{|S|} \sum\_{i \in S} I(X\_{j\prime}, X\_{S^\parallel \backslash \{i\}} | X\_i)$$

and analogously

$$I(X\_{j\prime}, X\_S|Y) = \frac{1}{|S|} \sum\_{i \in S} I(X\_{j\prime}, X\_i|Y) + \frac{1}{|S|} \sum\_{i \in S} I(X\_{j\prime}, X\_{S^\prime \backslash \{i\}}|X\_{i\prime}, Y).$$

Subtracting the two last equations and using (8), we obtain

$$I(X\_{\circ},Y|X\_{\circ}) = I(X\_{\circ},Y) + \frac{1}{|S|} \sum\_{i \in S} II(X\_{\circ}, X\_{i\circ}, Y) + \frac{1}{|S|} \sum\_{i \in S} II(X\_{\circ}, X\_{\circ(\cdot)\circ}, Y|X\_i). \tag{17}$$

Moreover, it follows from (8) that when *Xj* is independent of *XS*\{*i*} given *Xi* and these quantities are independent given *Xi* and *Y* the last sum is 0 and we obtain equality

$$I(MI(X\_{\circ},Y|X\_{\circ}) = I(X\_{\circ},Y) + \frac{1}{|S|}\sum\_{i \in S} II(X\_{\circ}, X\_{i\circ}, Y) = I(X\_{\circ},Y) + \frac{1}{|S|}\sum\_{i \in S} \left[ I(X\_{\circ}, X\_{i}|Y) - I(X\_{\circ}, X\_{i}) \right]. \tag{18}$$

This is Joint Mutual Information Criterion (JMI) introduced in Reference [15]. Note that (18) together with (8) imply another useful representation

$$I\left[MI(\mathbf{X}\_{j\prime},\mathbf{Y}|\mathbf{X}\_{S}) = I(\mathbf{X}\_{j\prime},\mathbf{Y}) + \frac{1}{|\mathcal{S}|}\sum\_{i \in \mathcal{S}} \left[ I(\mathbf{X}\_{j\prime},\mathbf{Y}|\mathbf{X}\_{i}) - I(\mathbf{X}\_{j\prime},\mathbf{Y}) \right] = \frac{1}{|\mathcal{S}|}\sum\_{i \in \mathcal{S}} I(\mathbf{X}\_{j\prime},\mathbf{Y}|\mathbf{X}\_{i}).\tag{19}$$

JMI can be viewed as an approximation of CMI when independence assumptions on which the above derivation was based are satisfied only approximately. Observe that *JMI*(*Xj*,*Y*|*XS*) differs from *CIFE*(*Xj*,*Y*|*XS*) in that the influence of the sum of interaction informations *I I*(*Xj*, *Xi*,*Y*) is down weighted by factor |*S*| <sup>−</sup><sup>1</sup> instead of 1. This is sometimes interpreted as coping with 'redundancy over-scaled' problem (cf. Reference [2]). When the terms *I*(*Xj*, *Xi*|*Y*) are omitted from the sum above then minimal redundancy maximal relevance (mRMR) criterion is obtained [16]. We note that approximations of CMI, such as CIFE or JMI, can be used in place of CMI in (14). As the derivation in both cases is quite intuitive, it is natural to ask how the approximations compare when used for selection. This is the primary aim of the present paper. Theoretical behavior of such methods will be investigated in the following sections. Note that we do not consider empirical counterparts of the above selection rules and investigate how they would behave provided their values have been known exactly.

#### **3. Auxiliary Results: Information Measures for Gaussian Mixtures**

In the following section, we will prove some results on information-theoretic properties of gaussian mixtures which are necessary to analyze the behavior of CMI, CIFE, and JMI in Generative Tree Model defined below.

In the next section, we will consider a gaussian Generative Tree Model, in which the main components have marginal distributions being mixtures of normal distributions. Namely, if *Y* has Bernoulli distribution *Y* ∼ Bern (1/2) (i.e., it admits values 0 and 1 with probability 1/2) and distribution of *X* is defined as N (*μY*, Σ), then *X* is a mixture of two normal distributions: N (0, Σ) and N (*μ*, Σ) with equal weights. Thus, in this section, we state auxiliary results on entropy of such random variable and its mutual information with its mixing distribution. The result for entropy of multivariate gaussian mixture, to the best of our knowledge, is new; for univariate case, it was derived in Reference [17]. Bounds and approximations of the entropy of a gaussian mixture are used, e.g., in signal processing; see, e.g., Reference [18,19]. Consider *d*-dimensional gaussian mixture *X* defined as

$$X \sim \frac{1}{2} \mathcal{N} \left( 0, I\_d \right) + \frac{1}{2} \mathcal{N} \left( \mu, I\_d \right), \tag{20}$$

where '∼' signifies 'distributed as'.

**Theorem 1.** *Differential entropy of X in (20) equals*

$$H(X) = h(||\mu||) + \frac{d-1}{2}\log(2\pi e)\_{\mu}$$

*where <sup>h</sup>*(*a*) *is the differential entropy of one-dimensional gaussian mixture* <sup>2</sup>−1{N (0, 1) + <sup>N</sup> (0, *<sup>a</sup>*)} *for <sup>a</sup>* <sup>&</sup>gt; <sup>0</sup>*.*

$$h(a) = -\int\_{\mathbb{R}} \frac{1}{2\sqrt{2\pi}} \left( e^{-\frac{x^2}{2}} + e^{-\frac{(x-a)^2}{2}} \right) \log\left( \frac{1}{2\sqrt{2\pi}} \left( e^{-\frac{x^2}{2}} + e^{-\frac{(x-a)^2}{2}} \right) \right) d\mathbf{x}.\tag{21}$$

**Proof.** In order to avoid burdensome notation, we prove the theorem for *d* = 2 only. By the definition of differential entropy, we have

$$H(X) = -\int \frac{1}{2} \left( f\_0(\mathbf{x}\_1, \mathbf{x}\_2) + f\_{\mu}(\mathbf{x}\_1, \mathbf{x}\_2) \right) \log \left( \frac{1}{2} (f\_0(\mathbf{x}\_1, \mathbf{x}\_2) + f\_{\mu}(\mathbf{x}\_1, \mathbf{x}\_2)) \right) d\mathbf{x}\_1 d\mathbf{x}\_2$$

where *X* is defined in (20) for *d* = 2, and *f<sup>μ</sup>* denotes the density of normal distribution with a mean *μ* and a covariance matrix *I*2.

We calculate the integral above changing the variables according to the following rotation

$$
\begin{pmatrix} y\_1 \\ y\_2 \end{pmatrix} = \begin{pmatrix} \frac{\mu\_1}{||\mu||} & -\frac{\mu\_2}{||\mu||} \\ \frac{\mu\_2}{||\mu||} & \frac{\mu\_1}{||\mu||} \end{pmatrix} \begin{pmatrix} x\_1 \\ x\_2 \end{pmatrix}.
$$

Transformed densities *f*<sup>0</sup> and *f<sup>μ</sup>* are equal

$$f\_0(y\_1, y\_2) = \frac{1}{2\pi} \exp\left(-\frac{y\_1^2 + y\_2^2}{2}\right)$$

and

$$f\_{\mu}(y\_1, y\_2) = \frac{1}{2\pi} \exp\left(-\frac{(y\_1 - ||\mu||)^2 + y\_2^2}{2}\right).$$

Applying above transformation, we can decompose *H*(*X*) into sum of two integrals as follows:

$$\begin{split} H(X) &= \int\_{\mathbb{R}} \frac{1}{2\sqrt{2\pi}} \left( e^{-\frac{1}{2}y\_1^2} + e^{-\frac{1}{2}(y\_1 - \|\mu\|)^2} \right) \log \left( \frac{1}{2\sqrt{2\pi}} \left( e^{-\frac{1}{2}y\_1^2} + e^{-\frac{1}{2}(y\_1 - \|\mu\|)^2} \right) \right) dy\_1 \\ &+ \int\_{\mathbb{R}} \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}y\_2^2} \log \left( \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}y\_2^2} \right) dy\_2 = h(\|\mu\|) + \frac{1}{2} \log(2\pi e), \end{split}$$

where in the last equality the value *H*(*Z*) = log(2*πe*)/2 for *N*(0, 1) variable *Z* is used. This ends the proof.

The result above is now generalized to the case of arbitrary covariance matrix Σ. The general case will follow from Theorem 1 and the scaling property of differential entropy under linear transformations.

**Theorem 2.** *Differential entropy of*

$$X \sim \frac{1}{2} \mathcal{N} \left( 0, \Sigma \right) + \frac{1}{2} \mathcal{N} \left( \mu, \Sigma \right),$$

*equals*

$$H(X) = h\left(\left\|\Sigma^{-1/2}\mu\right\|\right) + \frac{d-1}{2}\log(2\pi e) + \frac{1}{2}\log\left(\det\Sigma\right).$$

**Proof.** We apply Theorem 1 to multivariate random variable *Y* = Σ<sup>−</sup> <sup>1</sup> <sup>2</sup> *X*. We obtain

$$H(\boldsymbol{\gamma}) = h\left(\left\|\boldsymbol{\Sigma}^{-1/2}\boldsymbol{\mu}\right\|\right) + \frac{d-1}{2}\log(2\pi e).$$

Using the scaling property of differential entropy [6], we have

$$H(X) = H(Y) + \frac{1}{2} \log(\det \Sigma).$$

which completes the proof.

Similarly, we obtain the formula for mutual information of gaussian mixture and its mixing distribution. We use shorthand *X*|*Y* = *y* to denote random variable defined as having distribution coinciding with conditional distribution *P*(*X*|*Y* = *y*).

**Theorem 3.** *Mutual information of X and Y where Y* ∼ *Bern* (1/2) *and X*|*Y* = *y* ∼ N (*yμ*, Σ) *equals*

$$I(X,Y) = h\left(\left\|\Sigma^{-1/2}\mu\right\|\right) - \frac{1}{2}\log(2\pi e). \tag{22}$$

**Proof.** We will use here the fact that the entropy of multidimensional normal distribution *Z* ∼ N (*μZ*, Σ) equals (cf. Reference [6], Theorem 8.4.1)

$$H(Z) = \frac{d}{2}\log(2\pi e) + \frac{1}{2}\log(\det \Sigma).$$

Therefore, we have

$$I(X,Y) = H(X) - H(X|Y) = h\left(\left\|\Sigma^{-1/2}\mu\right\|\right) - \frac{1}{2}\log(2\pi e),\tag{23}$$

as

$$H(X|Y) = \frac{1}{2}H(X|Y=0) + \frac{1}{2}H(X|Y=1),\tag{24}$$

where *H*(*X*|*Y* = *i*) stands for the entropy of *X* on the stratum *Y* = *i*. We notice that *H*(*X*|*Y* = *i*) = *H*(*Z*), as the distribution of *X* on stratum *Y* = *i* is normal with covariance matrix Σ, and its entropy does not depend on the mean.

We note that, in Reference [17], entropy of one-dimensional Gaussian mixture 2−1(*N*(*a*, 1) + *N*(−*a*, 1)) is calculated as *he*(*a*), where *he*(*a*) is given in an integral form. As the entropy is invariant with respect to translation, function *h*(*a*) defined above equals *he*(*a*/2). The behavior of *h* and its two first derivatives is shown in Figure 1. It indicates that the function *h* is strictly increasing, and this fact is also stated in Reference [17] without proof. This is proved formally below. Strict monotonicity of *h* plays a crucial role in determining the order in which variables are included in a set of active variables. Note that *h*(0) = log(2*πe*)/2, which is the entropy of the standard normal *N*(0, 1) variable. Values of *h* need to be calculated numerically.

**Figure 1.** Behavior of function *h* and its two first derivatives. Horizontal lines in the left chart correspond to bounds of *h* and equal <sup>1</sup> <sup>2</sup> log(2*πe*) and <sup>1</sup> <sup>2</sup> log(2*πe*) + log(2), respectively.

#### **Lemma 1.** *Differential entropy h*(*a*) *of gaussian mixture defined in Theorem 1 is strictly increasing function of a.*

**Proof.** It is easy to see that *h* is differentiable and for calculation of its derivative, integration in (21) and taking derivatives can be interchanged. We show that derivative of *h* is positive. We have by standard manipulations, using the fact that *<sup>x</sup>* exp(−*x*2/2) is an odd function for the second equality below, that

*h* (*a*) = <sup>−</sup> <sup>1</sup> 2 <sup>√</sup>2*<sup>π</sup>* R (*x* − *a*)*e* <sup>−</sup> (*x*−*a*)<sup>2</sup> <sup>2</sup> log <sup>1</sup> 2 <sup>√</sup>2*<sup>π</sup> e* − *x*2 <sup>2</sup> + *e* <sup>−</sup> (*x*−*a*)<sup>2</sup> 2 + (*<sup>x</sup>* <sup>−</sup> *<sup>a</sup>*)*<sup>e</sup>* <sup>−</sup> (*x*−*a*)<sup>2</sup> 2 *dx* <sup>=</sup> <sup>−</sup> <sup>1</sup> 2 <sup>√</sup>2*<sup>π</sup>* R (*x* − *a*)*e* <sup>−</sup> (*x*−*a*)<sup>2</sup> <sup>2</sup> log <sup>1</sup> 2 <sup>√</sup>2*<sup>π</sup> e* − *x*2 <sup>2</sup> + *e* <sup>−</sup> (*x*−*a*)<sup>2</sup> 2 *dx* <sup>=</sup> <sup>−</sup> <sup>1</sup> 2 <sup>√</sup>2*<sup>π</sup>* R *xe*<sup>−</sup> *<sup>x</sup>*<sup>2</sup> <sup>2</sup> log <sup>1</sup> 2 <sup>√</sup>2*<sup>π</sup> e* − *x*2 <sup>2</sup> + *e* <sup>−</sup> (*x*+*a*)<sup>2</sup> 2 *dx* <sup>=</sup> <sup>−</sup> <sup>1</sup> 2 <sup>√</sup>2*<sup>π</sup>* ∞ 0 *xe*<sup>−</sup> *<sup>x</sup>*<sup>2</sup> <sup>2</sup> log <sup>1</sup> 2 <sup>√</sup>2*<sup>π</sup> e* − *x*2 <sup>2</sup> + *e* <sup>−</sup> (*x*+*a*)<sup>2</sup> 2 *dx* <sup>−</sup> <sup>1</sup> 2 <sup>√</sup>2*<sup>π</sup>* 0 −∞ *xe*<sup>−</sup> *<sup>x</sup>*<sup>2</sup> <sup>2</sup> log <sup>1</sup> 2 <sup>√</sup>2*<sup>π</sup> e* − *x*2 <sup>2</sup> + *e* <sup>−</sup> (*x*+*a*)<sup>2</sup> 2 *dx* <sup>=</sup> <sup>1</sup> 2 <sup>√</sup>2*<sup>π</sup>* ∞ 0 *xe*<sup>−</sup> *<sup>x</sup>*<sup>2</sup> 2 log <sup>1</sup> 2 <sup>√</sup>2*<sup>π</sup> e* − *x*2 <sup>2</sup> + *e* <sup>−</sup> (*x*−*a*)<sup>2</sup> 2 <sup>−</sup> log <sup>1</sup> 2 <sup>√</sup>2*<sup>π</sup> e* − *x*2 <sup>2</sup> + *e* <sup>−</sup> (*x*+*a*)<sup>2</sup> 2 *dx*.

We have used change of variables for the third and the fifth equality above. It follows from the last expression that *h* (*a*) <sup>&</sup>gt; 0 as (*<sup>x</sup>* <sup>−</sup> *<sup>a</sup>*)<sup>2</sup> <sup>&</sup>lt; (*<sup>x</sup>* <sup>+</sup> *<sup>a</sup>*)<sup>2</sup> for *<sup>x</sup>* <sup>&</sup>gt; 0 and *<sup>a</sup>* <sup>&</sup>gt; 0, and, therefore, *<sup>h</sup>* is increasing.

**Remark 1.** *Note that Theorems 2 and 3 in conjunction with Lemma 1 show that entropy of mixture of two gaussians with the same covariance matrix and its mutual information with mixing distribution is strictly increasing function of the norm* Σ−1*<sup>μ</sup> . In particular, for* <sup>Σ</sup> <sup>=</sup> *I, entropy increases as the distance between centers of two gaussians increases. In addition, it follows from (22) and I*(*X*,*Y*) ≥ 0 *that h*(*s*) ≥ log(2*πe*)/2 *for any s* <sup>∈</sup> <sup>R</sup>*.*

**Remark 2.** *We call a random variable <sup>X</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup> a generalized mixture when there exist diffeomorphisms fi* : <sup>R</sup> <sup>→</sup> <sup>R</sup> *such that* (*f*1(*X*1), ... *fp*(*Xd*)) <sup>∼</sup> <sup>2</sup>−1(<sup>N</sup> (0, *Id*) + <sup>N</sup> (*μ*, *Id*))*. Then, it follows from Theorem 2 that, analogously to Reference [20], that total correlation of X (cf. Reference [21]) defined as T*(*X*) = ∑*d <sup>i</sup>*=<sup>1</sup> *H*(*Xi*) − *H*(*X*) *equals for generalized mixture X*

$$TC(X) = \sum\_{i=1}^{d} h(|\mu\_i|) - h(||\mu||) + (1 - d)\log(2\pi e)/2\omega$$

*where μ* = (*μ*1,..., *μd*)*T.*

#### **4. Main Results: Behavior of Information-Based Criteria in Generative Tree Model**

In the following, we define a special gaussian Generative Tree Model and investigate how greedy procedure based on (14), as well as its analogues when CMI is replaced by JMI and CIFE, behaves in this model. Theorem 22 proved in the previous section will yield explicit formulae for CMIs in this model, whereas strict monotonicity of function *h*(·) proved in Lemma 1 will be essential to compare values of *I*(*Xj*,*Y*|*XS*) for different candidates *Xj*.

#### *4.1. Generative Tree Model*

We will consider the Generative Tree Model with tree structure illustrated in the Figure 2. Data Generating Process described by this model yields the distribution of the random vector (*Y*, *<sup>X</sup>*1,..., *Xk*+1, *<sup>X</sup>*(1) <sup>1</sup> ) such that:

$$\mathcal{Y} \sim \text{Bern}\left(1/2\right), \quad \mathcal{X}\_i|\mathcal{Y} \sim \mathcal{N}\left(\gamma^{i-1}\mathcal{Y}, 1\right) \text{ and } i \in \{1, 2, \dots, k+1\}, \quad |\mathcal{X}\_1 \sim \mathcal{N}\left(\mathcal{X}\_1, 1\right), \tag{25}$$

where 0 < *γ* ≤ 1 is the parameter. Thus, first the value *Y* = 0, 1 is generated with both values 0 and 1 having the same probability 1/2; then, *X*1, ... *Xk*<sup>+</sup><sup>1</sup> are generated as normal variables with the variance 1 and the mean equal to *<sup>Y</sup>*. Finally, once the value of *<sup>X</sup>*<sup>1</sup> is obtained, *<sup>X</sup>*(1) <sup>1</sup> is generated from normal distribution with the variance 1 and the mean equal to *X*1. Thus, in the sense specified above, *<sup>X</sup>*1, ... *Xk*<sup>+</sup><sup>1</sup> are the children of *<sup>Y</sup>* and *<sup>X</sup>*(1) <sup>1</sup> is the child of *X*1. Parameter *γ* controls how difficult the problem of feature selection is. Namely, the smaller the parameter *γ* is, the less information *Xi* holds about *Y* for *i* ∈ {1, 2, ... , *k* + 1}. We will refer to the model defined above as M*k*,*γ*. We denote by, abusing slightly the notation, *p*(*y*, *xi*), *p*(*x*1, *x* (1) <sup>1</sup> ) bivariate densities and by *p*(*y*), *p*(*xi*), *p*(*x* (1) <sup>1</sup> ) marginal densities. With this notation, the joint density *p*(*y*, *x*1,..., *xk*+1, *x* (1) <sup>1</sup> ) equals

$$p(y)\left[\prod\_{i=1}^{k+1} \frac{p(y, \mathbf{x}\_i)}{p(y)}\right] \frac{p(\mathbf{x}\_1, \mathbf{x}\_1^{(1)})}{p(\mathbf{x}\_1)} = \frac{p(\mathbf{x}\_1, \mathbf{x}\_1^{(1)})}{p(\mathbf{x}\_1)p(\mathbf{x}\_1^{(1)})} \prod\_{i=1}^{k+1} \frac{p(y, \mathbf{x}\_i)}{p(y)p(\mathbf{x}\_i)} \left[\prod\_{i=1}^{k+1} p(\mathbf{x}\_i)\right] p(y)p(\mathbf{x}\_1^{(1)}),$$

which can be more succinctly written as

$$\prod\_{(i,j)\in E} \frac{p(z\_i, z\_j)}{p(z\_i)p(z\_j)} \prod\_{i\in V} p(z\_i) \,\_{\prime}$$

after renaming the variables to *zi*, *i* = 1, ... *k* + 3 and *E* and *V* standing for edges and vertices in the graph shown in Figure 2 (cf. formula (4.1) in Reference [4]).

**Figure 2.** Generative Tree Model under consideration.

The above model generalizes the model discussed in Reference [3], but some branches which are irrelevant in our considerations are omitted. The values of conditional mutual information *I*(*Xk*+1,*Y*|*XS*) in the model, where *S* = {1, 2, ... , *k*} for different *γ* as a function of *k* are shown in the Figure 3. We prove in the following that *I*(*Xk*+1,*Y*|*XS*) > 0; thus, *Xk*<sup>+</sup><sup>1</sup> carries non-null predictive information about *Y* even when variables *X*1, ... , *Xk* are already chosen as predictors. We note that *I*(*X*(1) <sup>1</sup> ,*Y*|*XS*) = 0 for every *γ* ∈ (0, 1] and *XS* containing *X*1. Thus, {*X*1, ... , *Xk*<sup>+</sup>1} is the Markov Blanket (cf., e.g., Reference [22]) of *<sup>Y</sup>* among predictors {*X*1, ... , *Xk*+1, *<sup>X</sup>*(1) <sup>1</sup> } and {*X*1, ... , *Xk*<sup>+</sup>1} is sufficient for *Y* (cf. Reference [23]). A more general model may be considered which incorporates children of every vertex *X*1, ... , *Xk*+1, and several levels of progeny. Here, we show how one variable *X*(1) <sup>1</sup> which does not belong to Markov Blanket of *Y* is treated differently by the considered selection rules.

Intuitively, for 0 < *γ* < 1 and *l* < *n Xl* carry more information about *Y* than *Xn* and, moreover, *X*(1) <sup>1</sup> is redundant once *X*<sup>1</sup> has been chosen. Thus, predictors should be chosen in order *X*1, *X*2, ... *Xk*+1. For *γ* = 1, the order of selection of *Xi* is also *X*1, ... , *Xk*<sup>+</sup><sup>1</sup> in concordance with our convention of breaking ties, but *X*(1) <sup>1</sup> should not be chosen. We show in the following that CMI chooses variables

in this order; however, the order with respect to its approximations, CIFE, and JMI may be different. We also note that alternative way of representing predictors is

$$X\_i = \gamma^{i-1}Y + \varepsilon\_i \qquad X\_1^{(1)} = X\_1 + \varepsilon\_{k+2\prime} \tag{26}$$

for *i* = 1, . . . , *k* + 1, where *ε*1,...,*εk*+<sup>2</sup> are i.i.d. *N*(0, 1). Thus, in particular

$$a\_k \mathcal{Y} = \sum\_{i=1}^{k+1} X\_i - \sum\_{i=1}^{k+1} \varepsilon\_{i\prime}$$

with *ak* = (<sup>1</sup> <sup>−</sup> *<sup>γ</sup>k*+1)/(<sup>1</sup> <sup>−</sup> *<sup>γ</sup>*). Moreover, it is seen that <sup>E</sup>*Xi* <sup>=</sup> *<sup>γ</sup>i*−1E*<sup>Y</sup>* <sup>=</sup> *<sup>γ</sup>i*−1/2.

**Figure 3.** Behavior of conditional mutual information *I*(*Xk*+1,*Y*|*X*1, *X*2, ... , *Xk*) as a function of *k* for different *γ* values.

It is shown in Reference [2] that maximization of *I*(*Xj*,*Y*|*XS*) is equivalent to maximization of *CIFE*(*Xj*,*Y*|*XS*) provided that selected features in *XS* are independent and class-conditionally independent given unselected features *Xj*. It is easily seen that these properties do not hold in the considered GTM for *S* = {1, ... , *l*} and *j* = *l* + 1 for *l* ≤ *k*. It can also be seen by a direct calculation that CMI differs from CIFE in GTM. Take *<sup>S</sup>* <sup>=</sup> {1, 2} and *Xj* <sup>=</sup> *<sup>X</sup>*(1) <sup>1</sup> . Then, note that the difference between these quantities equals

$$I(X\_{\circ}, Y | X\_{\circ}) - I(X\_{\circ}, Y) - \sum\_{i \in \mathcal{S}} II(X\_{i\circ}, X\_{\circ}, Y) \tag{27}$$

Moreover, using conditional independence, we have

$$II(X\_1, X\_1^{(1)}, \mathcal{Y}) = I(X\_1^{(1)}, \mathcal{Y}|X\_1) - I(X\_1^{(1)}, \mathcal{Y}) = -I(X\_1^{(1)}, \mathcal{Y})$$

and

$$II(X\_2, X\_1^{(1)}, Y) = I(X\_1^{(1)}, X\_2 | Y) - I(X\_1^{(1)}, X\_2) = -I(X\_1^{(1)}, X\_2);$$

thus, plugging the above equalities into (27) and using *I*(*X*(1) <sup>1</sup> ,*Y*|*X*1, *X*2) = 0, we obtain that expression there equals *I*(*X*(1) <sup>1</sup> , *X*2), which is strictly positive in the considered GTM.

Similar considerations concerning conditions stated above (18) show that maximization of JMI is not equivalent to maximization of CMI in GTM. Namely, if *S* = {1, 2} and *j* ∈ {3, ... , *k* + 1}, then it is easily seen that *<sup>I</sup>*(*Xj*, *XS*\{*<sup>i</sup>*}|*Xi*) > 0 and *<sup>I</sup>*(*Xj*, *XS*\{*<sup>i</sup>*}|*Xi*,*Y*) = 0 for *<sup>i</sup>* = 1, 2; thus, the last term in (17) is negative.

In order to support this numerically for a specific case, consider *γ* = 2/3. In the first column of the Table 1a, MI values *I*(*Xi*,*Y*), *i* = 1, ... , 4 are shown for this value of *γ*. They were calculated in Reference [3] using simulations, while here they are based on (23) and numerical evaluation of *h* Σ−1/2*μ* . Additionally, in Table 1, CMI values from subsequent steps and JMI and CIFE values in such a model are shown. As a foretaste of the analysis which follows, note that, in view of panel (b) of the table, JMI chooses erroneously *X*(1) <sup>1</sup> in the third step instead of *X*<sup>3</sup> in contrast to CIFE (cf. part (c) of the table) which chooses *X*1, *X*2, *X*<sup>3</sup> in the right order. Note also that, in this case, is the second largest mutual informations with *Y*; thus, when the filter based solely on this information is considered, then *X*(1) <sup>1</sup> is chosen at the second step (after *X*1).

We note that analysis of behavior of CMI and its approximations including CIFE and JMI has been given in Reference [24], Section 6, for a simple model containing 4 predictors. We analyze here the behavior of these measures of conditional dependence for the general model M*k*,*γ*, which involves arbitrary number of predictors having varying dependence with *Y*.

**Table 1.** The criteria (Conditional Mutual Information (CMI), Joint Mutual Information (JMI), Conditional Infomax Feature Extraction (CIFE)) values for *k* = 2 and *γ* = 2/3. A value of the chosen variable in each step and for each criterion is in bold.


#### *4.2. Behavior of CMI*

First of all, we show that the criterion based on conditional mutual information CMI without any modifications chooses correct variables in the right order. It has been previously noticed that *I*(*X*(1) <sup>1</sup> ,*Y*|*XS*) = 0 for *S* = {1, ... , *k*}. Now, we show that *I*(*Xk*+1,*Y*|*XS*) > 0 for every *k*. Namely, applying Theorem 3 and the chain rule for mutual information

$$I(X\_{S \cup \{k+1\}}, Y) = I(X\_{S'}, Y) + I(X\_{k+1}, Y | X\_S)\_{\prime}$$

we obtain

$$I(\mathbf{X}\_{k+1}, \mathbf{Y} | \mathbf{X}\_S) = h\left(\sqrt{\sum\_{i=0}^k \gamma^{2i}}\right) - h\left(\sqrt{\sum\_{i=0}^{k-1} \gamma^{2i}}\right) > 0,\tag{28}$$

where the inequality follows as *h* is an strictly increasing function. Thus, we proved that *I*(*X*(1) <sup>1</sup> ,*Y*|*XS*) = 0 < *I*(*Xk*+1,*Y*|*XS*) for *S* = {1, ... , *k*} for every *k*. Whence we have for *S* = {1, ... , *l*} and *l* < *k* that

$$\underset{Z \in S^c}{\text{arg}\,\text{max}}\, I(Z,\,\mathcal{Y}|X\_S) = X\_{I+1}.$$

thus CMI chooses predictors in a correct order. Figure 3 shows behavior of *g*(*k*, *γ*) = *I*(*Xk*+1,*Y*|*X*1, ... , *Xk*) as the function of *k* for various *γ*. Note that it follows from Figure 3 that *g*(·, *γ*) is decreasing. This means that the additional information on *Y* obtained when *Xk*<sup>+</sup><sup>1</sup> is incorporated gets smaller with *k*. Now, we study the order in which predictors are chosen with respect to JMI and CIFE.

#### *4.3. Behavior of JMI*

The main objective of this section is to examine performance of JMI criterion in the Generative Tree Model for different values of parameter *γ*. We will show that:


Consider the model above and assume that the set of indices of currently chosen variables equals *S* = {1, 2, ... , *k*}. For *i* ∈ {1, 2, ... , *k*} we apply chain rule (6) and Theorem 3 with the following covariance matrices and mean vectors for *I*((*Xi*, *Z*),*Y*) (cf. (26)):

$$
\Sigma = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}, \mu = \begin{pmatrix} \gamma^{i-1} \\ \gamma^k \end{pmatrix} \text{ and } \Sigma = \begin{pmatrix} 1 & 0 \\ 0 & 2 \end{pmatrix}, \mu = \begin{pmatrix} \gamma^{i-1} \\ 1 \end{pmatrix}, \tag{29}
$$

respectively, for *<sup>Z</sup>* <sup>=</sup> *Xk*<sup>+</sup><sup>1</sup> and *<sup>Z</sup>* <sup>=</sup> *<sup>X</sup>*(1) <sup>1</sup> . Then, we have

$$I(X\_{k+1}, Y | X\_i) = h\left(\sqrt{\gamma^{2k} + \gamma^{2(i-1)}}\right) - h\left(\gamma^{i-1}\right),\tag{30}$$

$$I(X\_1^{(1)}, Y | X\_i) = h\left(\sqrt{\gamma^{2(i-1)} + \frac{1}{2}}\right) - h\left(\gamma^{i-1}\right) \text{ for } i \neq 1,\tag{31}$$

$$I(X\_1^{(1)}, Y | X\_1) = 0.\tag{32}$$

The last equation follows from the fact that *X*(1) <sup>1</sup> and *Y* are conditionally independent given *X*1.

From the definition of *JMI*(*X*,*Y*|*XS*), abbreviated from now on to *JMI*(*X*|*XS*) to simplify notation, we obtain

$$kfMI(X\_{k+1}|X\_S) = \sum\_{\substack{i=1 \\ \ell \text{ .}}}^k \left( h\left(\sqrt{\gamma^{2k} + \gamma^{2(i-1)}}\right) - h\left(\gamma^{i-1}\right) \right), \tag{33}$$

$$kJMI(X\_1^{(1)}|X\_3) = \begin{cases} 0 & \text{if } k=1\\ \sum\_{i=2}^k \left( h\left(\sqrt{\gamma^{2(i-1)} + \frac{1}{2}}\right) - h\left(\gamma^{i-1}\right) \right) & \text{if } k>1 \end{cases} \tag{34}$$

We observe that the variables *X*1, *X*2, ... are chosen in order according to JMI, as for *S* = {1, ... , *l*} and *l* < *m* < *n*, we have *JMI*(*Xm*) > *JMI*(*Xn*). For *γ* = 1, the right-hand sides of the last two expressions equal *k h* √<sup>2</sup> − *h* (1) and (*k* − 1) - *h* -<sup>√</sup>3/2 − *h* (1) , respectively. Thus, for *γ* = 1, we have *JMI*(*Xk*<sup>+</sup>1|*XS*) <sup>&</sup>gt; *JMI*(*X*(1) <sup>1</sup> |*XS*), which means that variables are chosen in the order

*<sup>X</sup>*1, ... , *Xk*<sup>+</sup><sup>1</sup> and *<sup>X</sup>*(1) <sup>1</sup> is not chosen before them when JMI criterion is used. Although, for *γ* = 1, JMI criterion does not select this redundant feature, we note that, for *k* → ∞, *S* = {1, ... , *k*}, and *γ* = 1

$$JMI(X\_1^{(1)}|X\_S) \to \left(h\left(\sqrt{\frac{3}{2}}\right) - h\left(1\right)\right) > 0\_\nu$$

which differs from *I*(*X*(1) <sup>1</sup> ,*Y*|*XS*) = 0 for all *k* ≥ 1. We note also that, in this case, *JMI*(*Xk*<sup>+</sup>1|*XS*) does not depend on *k* in contrast to *I*(*Xk*+1,*Y*|*XS*).

Now, we will consider the case 0 < *γ* < 1. We want to show that, for sufficiently large *k* and *<sup>S</sup>* <sup>=</sup> {1, . . . , *<sup>k</sup>*}, JMI criterion chooses *<sup>X</sup>*(1) <sup>1</sup> since

$$JMI(X\_{k+1}|X\_S) < JMI(X\_1^{(1)}|X\_S).$$

The last inequality is equivalent to

$$\sum\_{i=2}^{k} \left( h\left(\sqrt{\gamma^{2(i-1)} + \frac{1}{2}}\right) - h\left(\sqrt{\gamma^{2k} + \gamma^{2(i-1)}}\right) \right) > h(\sqrt{1 + \gamma^{2k}}) - h\left(1\right). \tag{35}$$

The right-hand side tends to 0 when *<sup>k</sup>* <sup>→</sup> <sup>∞</sup>. For the left-hand side, note that, for *<sup>k</sup>* <sup>&</sup>gt; <sup>−</sup>log*<sup>γ</sup>* <sup>2</sup> <sup>2</sup> , we have *γ*2*<sup>k</sup>* < 1/2, and all summands of the sum above are positive, as *h* is an increasing function. Thus, bounding the sum by its first term, we have

$$\sum\_{i=2}^k \left( h\left(\sqrt{\gamma^{2(i-1)} + \frac{1}{2}}\right) - h\left(\sqrt{\gamma^{2k} + \gamma^{2(i-1)}}\right) \right) > h(\sqrt{\gamma^2 + 1/2}) - h(\sqrt{\gamma^2 + 1/2}) = 0.$$

The minimal *k* for which the JMI criterion incorrectly chooses *X*(1) <sup>1</sup> , i.e., the first *k* for which (35) holds, is shown in Figure 4. The values of JMI criterion for variables *Xk*<sup>+</sup><sup>1</sup> and *<sup>X</sup>*(1) <sup>1</sup> is shown in Figure 5. Figure 4 indicates that *X*(1) <sup>1</sup> is chosen early; for *γ* ≤ 0.8, it happens in the third step at the latest.

**Figure 4.** Minimal *<sup>k</sup>* for which *JMI*(*Xk*<sup>+</sup>1|*XS*) <sup>&</sup>lt; *JMI*(*X*(1) <sup>1</sup> |*XS*), 0 < *γ* < 1.

**Figure 5.** The behavior of JMI in the generative tree model: *JMI*(*Xk*<sup>+</sup>1|*XS*) and *JMI*(*X*(1) <sup>1</sup> |*XS*).

#### *4.4. Behavior of CIFE and Its Comparison with JMI*

The aim of this section is to show that, although both JMI and CIFE criteria are developed as approximations to conditional mutual information, their behavior in the tree generative model differs. We will show that:


Thus, CIFE behaves very differently from JMI in Generative Tree Model.

Analogously to formulae for JMI, we have the following formulae for CIFE (*S* = {1, . . . , *k*}):

$$\begin{aligned} \text{CIFE}(\mathbf{X}\_{k+1}|\mathbf{X}\_{\mathcal{S}}) &= (1-k)\left(h\left(\gamma^{k}\right) - \frac{1}{2}\log(2\pi e)\right) + \sum\_{i=1}^{k} \left(h\left(\sqrt{\gamma^{2k} + \gamma^{2(i-1)}}\right) - h\left(\gamma^{i-1}\right)\right), \\ \text{CIFE}(\mathbf{X}\_{1}^{(1)}|\mathbf{X}\_{\mathcal{S}}) &= \begin{cases} 0 & \text{if } k=1\\ \left(1-k\right)\left(h(1) - \frac{1}{2}\log(2\pi e)\right) + \sum\_{i=2}^{k} \left(h\left(\sqrt{\gamma^{2(i-1)} + \frac{1}{2}}\right) - h\left(\gamma^{i-1}\right)\right) & \text{if } k>1 \end{cases} \end{aligned}$$

For *γ* = 1, we have

$$\begin{aligned} \text{CIFE}(X\_{k+1}|X\_{\mathcal{S}}) &= (1-k)\left(h\left(1-\frac{1}{2}\log(2\pi e)\right) + \sum\_{i=1}^{k} \left(h\left(\sqrt{2}\right) - h\left(1\right)\right)\right), \\ &= h\left(1\right) - \frac{1}{2}\log(2\pi e) - k\left(2h(1) - h(\sqrt{2}) - \frac{1}{2}\log(2\pi e)\right), \\ \text{CIFE}(X\_{1}^{(1)}|X\_{\mathcal{S}}) &= (1-k)\left(2h(1) - \frac{1}{2}\log(2\pi e) - h\left(\sqrt{\frac{3}{2}}\right)\right). \end{aligned}$$

Note that both expressions above are linear functions with respect to *k*. Comparison of their slopes, in view of *h* <sup>3</sup> 2 < *h* √<sup>2</sup> as *h* is an increasing function, yields that, for sufficiently large *k*, we obtain *CIFE*(*Xk*<sup>+</sup>1|*XS*) <sup>&</sup>lt; *CIFE*(*X*(1) <sup>1</sup> |*XS*). The behavior of CIFE for 0 < *γ* < 1 in case of *Xk*<sup>+</sup><sup>1</sup> and *X*(1) <sup>1</sup> is shown in Figure <sup>6</sup> and the difference between *CIFE*(*Xk*<sup>+</sup>1|*XS*) and *CIFE*(*X*(1) <sup>1</sup> |*XS*) in Figure 7. The values below 0 in the last plot occur for *γ* = 1; only, thus, for 0 < *γ* < 1, we have *CIFE*(*Xk*<sup>+</sup>1|*XS*) <sup>&</sup>gt; *CIFE*(*X*(1) <sup>1</sup> |*XS*) for any *k*.

**Figure 6.** The behavior of CIFE in the generative tree model: *CIFE*(*Xk*<sup>+</sup>1|*XS*) and *CIFE*(*X*(1) <sup>1</sup> |*XS*).

**Figure 7.** Difference between values of JMI for *Xk*<sup>+</sup><sup>1</sup> and *<sup>X</sup>*(1) <sup>1</sup> (**left panel**) and analogous difference for CIFE (**right panel**). Values below 0 mean that the variable *<sup>X</sup>*(1) <sup>1</sup> is chosen.

Furthermore, as 2*h*(1) <sup>−</sup> <sup>1</sup> <sup>2</sup> log(2*πe*) − *h* <sup>3</sup> 2 ≈ 0.0642 > 0, we have, for *γ* = 1, *CIFE*(*X*(1) <sup>1</sup> |*XS*) → −∞ as *k* → ∞,

and as 2*h*(1) − *h*( <sup>√</sup>2) <sup>−</sup> <sup>1</sup> <sup>2</sup> log(2*πe*) ≈ 0.0215 > 0, we have

$$\text{CIFE}(X\_{k+1}|X\_S) \to -\infty \text{ as } k \to \infty.$$

In order to understand the consequences of this property, let us momentarily assume that one introduces an intuitive stopping rule which says that candidate *Xj*<sup>0</sup> such that *j*<sup>0</sup> = arg max*j*∈*S<sup>c</sup> CIFE*(*Xj*,*Y*|*XS*) is appended only when *CIFE*(*Xj*<sup>0</sup> ,*Y*|*XS*) <sup>&</sup>gt; 0. Then, Positive Selection Rate (PSR) of such selection procedure may become arbitrarily small in model M*k*,*<sup>γ</sup>* for fixed *γ* and sufficiently large *<sup>k</sup>*. PSR is defined as |ˆ*<sup>t</sup>* ∩ *<sup>t</sup>*|/|*t*|, where *<sup>t</sup>* = {1, ... , *<sup>k</sup>* + <sup>1</sup>} is a set of indices of Markov Blanket of *Y* and ˆ*t* is a set of indices of the chosen variables.

#### **5. Conclusions**

We have considered M*k*,*γ*, a special case of Generative Tree Model and investigated behavior of CMI and related criteria JMI and CIFE in this model. We have shown that, despite the fact that both of these criteria are derived as approximations of CMI under certain dependence conditions, their behavior may greatly differ from that of CMI in the sense that they may switch the order of variable importance and treat inactive variables as more relevant than active ones. In particular, this occurs for JMI when *γ* < 1 and CIFE for *γ* = 1. We have also shown a drawback of CIFE procedure which consists in disregarding significant part of active variables so that PSR may become arbitrarily small in model M*k*,*<sup>γ</sup>* for large *k*. As a byproduct, we obtain formulae for the entropy of multivariate gaussian mixture and its mutual information with mixing variable. We have also shown that the entropy of the gaussian mixture is a strictly increasing function of the euclidean distance between two centers of its components. Note that, in this paper, we investigated behavior of theoretical CMI and its approximations in GTM; for their empirical versions, we may expect exacerbation of effects described here.

**Author Contributions:** Conceptualization, M.Ł.; Formal analysis, J.M. and M.Ł.; Methodology, J.M. and M.Ł.; Supervision, J.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** Comments of two referees which helped to improve presentation of the original version of the manuscript are gratefully acknowledged.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Robust Multiple Regression**

**David W. Scott 1,\*,† and Zhipeng Wang 1,2,†**


**Abstract:** As modern data analysis pushes the boundaries of classical statistics, it is timely to reexamine alternate approaches to dealing with outliers in multiple regression. As sample sizes and the number of predictors increase, interactive methodology becomes less effective. Likewise, with limited understanding of the underlying contamination process, diagnostics are likely to fail as well. In this article, we advocate for a non-likelihood procedure that attempts to quantify the fraction of bad data as a part of the estimation step. These ideas also allow for the selection of important predictors under some assumptions. As there are many robust algorithms available, running several and looking for interesting differences is a sensible strategy for understanding the nature of the outliers.

**Keywords:** minimum distance estimation; maximum likelihood estimation; influence functions

#### **1. Introduction**

We examine how to approach bad data in the classical multiple regression setting. We are given a section of *n* vectors, {(**x***i*, *yi*), *i* = 1, 2, ... , *n*}. We have *p* predictors; hence, **<sup>x</sup>***<sup>i</sup>* ∈ )*p*. The random variable model we consider is *Yi* <sup>=</sup> **<sup>X</sup>***<sup>t</sup> <sup>i</sup>β* + *<sup>i</sup>* where *<sup>i</sup>* represents the (random) unexplained portion of the response. In vector form we have

$$\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\_{\prime\prime}$$

where **Y** is the *n* × 1 vector of responses. **X** is the *n* × *p* matrix whose *n* rows contain the predictor vectors, and is the vector of random errors. Minimizing the sum of squared errors leads to the well-known formula

$$
\hat{\boldsymbol{\beta}} = (\mathbf{X}^t \mathbf{X})^{-1} \mathbf{X}^t \mathbf{Y}. \tag{1}
$$

Since *β*ˆ is a linear combination of the responses, any outliers will result in corresponding influence in the parameter estimates. Alternatively, outliers in the predictor vectors can exert a strong influence on the estimated parameter vector. With modern gigabit datasets, both outliers may be expected. Outliers in the predictor space may or may not be viewed as errors. In either case, they may result in high leverage, as any prediction errors there that are very large would result in a large fraction of the SSE; thus, we would expect *β*ˆ to pay attention and try to rotate to minimize that effect. In practice, it is more common to assume the features are measured accurately and without error and to focus on outliers in the response space. We will adopt this framework initially.

#### **2. Strategies for Handling Outliers in the Response Space**

Denote the multivariate normal PDF by *φ*(**x**|*μ*, Σ). Although it is not required, if we assume the distribution of the error vector is multivariate normal with zero mean and covariance matrix Σ = *σ*<sup>2</sup> *Ip*, maximizing the likelihood

**Citation:** Scott, D.W.; Wang, Z. Robust Multiple Regression. *Entropy* **2021**, *23*, 88. https://doi.org/10.3390/ e23010088

Received: 16 December 2020 Accepted: 5 January 2021 Published: 9 January 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license ( https://creativecommons.org/ licenses/by/4.0/).

$$\begin{split} \prod\_{i=1}^{n} \phi(\varepsilon\_i | \boldsymbol{\beta}, \sigma\_\varepsilon^2 I\_p) &= \prod\_{i=1}^{n} \frac{1}{\sqrt{2\pi \sigma\_\varepsilon^2}} \exp(-\epsilon\_i^2 / 2\sigma\_\varepsilon^2) \\ &= \prod\_{i=1}^{n} \frac{1}{\sqrt{2\pi \sigma\_\varepsilon^2}} \exp(-(y\_i - \mathbf{x}\_i^\dagger \boldsymbol{\beta})^2 / 2\sigma\_\varepsilon^2) \end{split} \tag{2}$$

may be shown to be equivalent to minimizing the residual sum of squares

$$\begin{split} \sum\_{i=1}^{n} (y\_i - \mathfrak{H}\_i)^2 &= \sum\_{i=1}^{n} (y\_i - \mathbf{x}\_i^t \mathfrak{H})^2 \\ &= (\mathbf{Y} - \mathbf{X}\mathfrak{H})^t (\mathbf{Y} - \mathbf{X}\mathfrak{H}) \end{split} \tag{3}$$

over *β*, leading to the least squares estimator given in Equation (1), where *y*ˆ*<sup>i</sup>* = **x***<sup>t</sup> <sup>i</sup>β*<sup>ˆ</sup> is the predicted response. Again we remark that the least squares criterion in Equation (3) is often invoked without assuming the errors are independent and normally distributed.

Robust estimation for the parameters of the normal distribution as in Equation (2) is a well-studied topic. In particular, the likelihood is modified so as to avoid the use of the non-robust squared errors found in Equation (3). For example, <sup>2</sup> *<sup>i</sup>* may be modified to be bounded from above, or may even take a more extreme modification to have redescending shape (to zero); see [1–3]. Either approach requires the specification of meta-parameters that explicitly control the shape of the resulting influence function. Typically, this is done by an iterative process where the residuals are computed and a robust estimate of their scale is obtained. For example, the median of the absolute median residuals.

As an alternative, we advocate making an assumption about the explicit shape of the residuals, for example, ∼ *N*(0, *σ* <sup>2</sup>). With such an assumption, it is possible to replace likelihood and influence function approaches with a minimum distance criterion. As we shall show, the advantage of doing so is that an explicit estimate of the fraction of contaminated data may be obtained. In the next section, we briefly describe this approach and the estimation equations.

#### **3. Minimum Distance Estimation**

We follow the derivation of the *L*2*E* algorithm described by Scott [4]. Suppose we have a random sample {*xi*, *i* = 1, 2, ... , *n*} from an unknown density function *g*(*x*), which we propose to model with the parametric density *f*(*x*|*θ*). Either *x* or *θ* may be multivariate in the following. Then as an alternative to evaluating potential parameter values of *θ* with respect to the likelihood, we consider instead estimates of how close the two densities are in the integrated squared or *L*<sup>2</sup> sense:

$$\theta = \operatorname\*{arg\,min}\_{\theta} \int \left( f(\mathbf{x}|\theta) - g(\mathbf{x}) \right)^2 d\mathbf{x} \tag{4}$$

$$=\arg\min\_{\theta} \left[ \iiint f(\mathbf{x}|\theta)^2 d\mathbf{x} - \int 2f(\mathbf{x}|\theta)\mathbf{g}(\mathbf{x})d\mathbf{x} + \iint \mathbf{g}(\mathbf{x})^2 d\mathbf{x} \right] \tag{5}$$

$$=\arg\min\_{\theta} \left[ \int f(\mathbf{x}|\theta)^2 d\mathbf{x} - 2\triangle f(\mathbf{X}|\theta) \right] \tag{6}$$

$$=\arg\min\_{\theta} \left[ \int f(\mathbf{x}|\theta)^2 d\mathbf{x} - \frac{2}{n} \sum\_{i=1}^{n} f(\mathbf{x}\_i|\theta) \right]. \tag{7}$$

Notes: In Equation (4), the hat on the integral sign indicates we are seeking a data-based estimator for that integral; in Equation (5), we have simply expanded the integrand into three individual integrals, the first of which can be calculated explicitly for any posited value of *θ* and need not be estimated; in Equation (6), we have omitted the hat on the first integral and eliminated entirely the third integral since it is a constant with respect to *θ*, and we have observed that the middle integral is (by definition) the expectation of our density model at a random point *X* ∼ *g*(*x*); and finally, in Equation (7), we have substituted an unbiased estimate of that expectation. Note that the quantity in brackets in Equation (7) is fully data-based, assuming the first integral exists for all values of *θ*. Scott calls the resulting estimator *L*2*E* as it minimizes an *L*<sup>2</sup> criterion.

We illustrate this estimator with the 2-parameter *N*(*μ*, *σ*2) model. Then the criterion in Equation (7) becomes

$$\Phi(\boldsymbol{\mu}, \boldsymbol{\sigma}) = \arg\min\_{\left(\boldsymbol{\mu}, \sigma\right)} \left[ \frac{1}{2\sqrt{\pi \sigma}} - \frac{2}{n} \sum\_{i=1}^{n} \phi(\boldsymbol{x}\_i | \boldsymbol{\mu}, \sigma^2) \right]. \tag{8}$$

We illustrate this estimator on a sample of 10<sup>4</sup> points from the normal mixture 0.10*N*(1, 0.22) + 0.75*N*(5, 1) + 0.15*N*(9, 0.52). The L2E and MLE curves are shown in the left frame of Figure 1.

**Figure 1.** (**Left**) MLE and L2E estimates together with a histogram; (**Right**) partial L2E estimate.

A careful examination of the L2E derivation in Equation (4) shows that we crucially used the fact that *g*(*x*) was a density function, but nowhere did we require the model *f*(*x*|*θ*) to also be a bona fide density function. Scott proposed fitting a partial mixture model, namely

$$f(\mathfrak{x}|\theta) = w \cdot \phi(\mathfrak{x}|\mu, \sigma^2) \,,$$

which he called a partial density component. (Here, the L2E criterion could be applied to a full 3-component normal mixture density.) When applied to the previous data, the fitted curve is shown in the right frame of Figure 1.

We discuss these 3 estimators briefly. The MLE is simply (*x*¯,*s*), and the nonrobustness of both parameters is clearly illustrated. Next, the L2E estimate of the mean is clearly robust, but the scale estimate is also inflated compared to the true value *σ* = 1. After reflection, this is the result of the fitted model having an area equal to 1. The closest normal curve is close to the central portion of the mixture, but with standard deviation inflated by a third. Note that the fitted curve completely ignores the outer mixture components. However, when the 3-parameter partial density component model is fitted, *w*ˆ = 0.759, which suggests that some 24% of the data are not captured by the minimum distance fit. Thus the estimation step itself conveys important information about the adequacy of the fit. By way of contrast, a graphical diagnosis of the MLE fit such as a *q*–*q* plot would show the fit is also inadequate, but give no explicit guidance as to how much data are outliers and what the correct parameters might be. Note that the parameter estimates of the mean and standard deviation by partial L2E are both robust, although the estimate of *σ* is inflated by 3%, reflecting some overlap of the third mixture component with the central component. Thus, we should not assume *w*ˆ is an unbiased estimate of the fraction of "good data", but rather an upper bound on it.

With the insight gained by this example, we shift now to the problem at hand, namely, multiple regression. We will use the partial L2E formulation in order to gain insight into the portion of data not adequately modeled by the linear model.

#### **4. Minimum Distance Estimation for Multiple Regression**

If we are willing to make the (perhaps rather strong but explicit) assumption that the error random variables follow a normal distribution, the appropriate model is

$$
\epsilon \sim N(0, \sigma\_c^2) \, .
$$

Given initial estimates for *β*, *σ*, and *w*, we use any nonlinear optimization routine (for example, *nlminb* in the R language) to minimize Equation (7)

$$\frac{w^2}{2\sqrt{\pi}\sigma\_c} - \frac{2w}{n}\sum\_{i=1}^n \phi\left(y\_i - \mathbf{x}\_i^t \boldsymbol{\theta} |0, \sigma\_c^{-2}\right) \tag{9}$$

over the *p* + 2 parameters (*β*, *σ*, *w*). In practice, the intercept may be coded as another parameter, or a column of 1s may be included in the design matrix, **X**. Notice that the residuals are assumed to be normal (at least partially) and centered at 0. It is convenient to use the least-squares estimates to initialize the L2E algorithm. In some cases, there may be more than one solution to Equation (9), especially if using the partial component model. In every case, the fitted value of *w* should offer clear guidance.

#### **5. Examples**

*5.1. Hertzsprung–Russell Diagram CYG OB1 Data*

These data (*n* = 47) are well-studied due to the strong influence of the four very bright giant stars observed at low temperatures [5]; see Figure 2. In fact, the slope of the least-squares line in the left frame has the wrong sign.

**Figure 2.** (**Left**) MLE (blue) and L2E (red) regression estimates for the Hertzsprung–Russell data; (**Middle**) kernel (blue) and normal (green) densities of the least squares residuals; and (**Right**) kernel (red) and normal (green) densities of the L2E residuals. See text.

In the middle frame, we examine the residuals from the least-squares fit. The residuals are shown along the *x*-axis, together with a kernel density estimate (blue), which has a bimodal shape [6]. The green curve shows the presumed normal fit *N*(0, *σ*ˆ <sup>2</sup> ), where *σ*ˆ = 0.558. Since this is just a bivariate case, it is easy to see that the bimodal shape of the residuals does not convey the correct size of the population of outliers. In higher dimensions, such inference about the nature and quantity of outliers only becomes more difficult.

In the right frame, we examine the residuals from the L2E fit. We begin by noting that the fraction of "good data" is around 92%, indicating 3.8 outliers. The kernel density estimate of the residuals is shown in red. The fitted normal curve to the residuals is the partial normal component given by

$$0.919 \cdot N(0, 0.394^2)$$

and is shown again in green. The estimated L2E standard deviation is 41% smaller than the least-squares estimate. Examining the residuals closely, there are a possible two more stars with residual values 1.09 and 1.49 that may bear closer scrutiny. Finally, the assumption of a normal shape for the residuals seems warranted by the close agreement of the red and green curves around the origin in this figure.

#### *5.2. Boston Housing Data*

This dataset was first analyzed by economists who were interested in the affect that air pollution (nitrous oxide) had on median housing prices per census track [7]. A description of the data may be found at https://www.cs.toronto.edu/~delve/data/ boston/bostonDetail.html.

We begin by fitting the full least-squares and L2E multiple regression models with *p* = 13 predictors to the median housing price for the 506 census tracks. All 14 variables were standardized; see Table 1. Thus we know the intercept for the least-squares model will be zero. All of the LS coefficients were significant except for INDUS and AGE. L2E puts more weight on AGE and RM and less on NOX, RAD, and LSTAT compared to least-squares.

**Table 1.** The multiple regression parameter estimates for LS and L2E are given in the first two rows. The variable importance counts are given in the last two rows; see text.


In Figure 3, we display histograms of the residuals as well as a normal curve with mean zero and the estimated standard deviation of the residuals. The estimated value of *σ* is 0.509 and *<sup>R</sup>*<sup>2</sup> <sup>=</sup> 0.74 for LS; however, *<sup>σ</sup>*ˆ is only 0.240 for L2E, with *<sup>w</sup>*<sup>=</sup> <sup>=</sup> 0.845. Examining the curves in Figure 3, we see that the least-squares model tends to overestimate the median housing value. Our interpretation of the L2E result is that the simple multiple linear regression model only provides an adequate fit to at most 84.5% of the data. (This interpretation relies critically on the correctness of the proper shape of the residuals following the normal distribution.) In particular, the L2E model is saying that very accurate predictions of the most expensive median housing census tracks are not possible with these 13 predictors.

In Figure 4, the L2E residuals (in standardized units) are displayed for the 506 census tracks. The dark blue and dark red shaded tracks are more than 3 standard units from their predictions. The expanded scale shown in Figure 5 shows that the largest residuals (outliers) are in the central Boston to Cambridge region. Similar maps of the LS residuals show much less structure, as is apparent from the top histogram in Figure 3.

**Figure 3.** LS and L2E residual analysis; see text.

**Figure 4.** Full map of the L2E residuals in the Boston region; see text.

Next, we briefly examine whether subsets of the predictors are sufficient for prediction. In Figure 6, we display the residual standard deviation for all 8191 such models. Apparently, as few as 5 variables provide as good a prediction as the full model above. In the bottom two rows of Table 1 we tabulate the variables that entered into the best 100 models as the number of variables range from 5 to 8. The variables RM, LSTAT, PTRATIO, and DIS appear in almost all of those 400 models. The additional three variables ZN, B, and CHAS appear at least half the time. However, the L2E fits for these models have standard errors often 50% larger than the full model. Variable selection remains a noisy process, although this simple counting procedure can prove informative.

**Figure 7.** Histograms of the critical temperatures; see text.

We showcase the histograms and curves for L2E regression residuals, as well as the fitting curves in Figure 9. We plotted the blue curve with the negative residuals and the green curve with positive residuals. Our interpretation of the L2E result is that the points with positive and negative residuals from the L2E regression fit the two major clusters of the critical temperature very well. In particular, the L2E model yields a narrower distribution of residuals, and the fitting explains the bi-modal distribution of the critical temperatures. On a practical note, the same L2E values (to five significant digits) were obtained starting with the LS parameters or a vector of zeros, for any initial choice of *w*.

**+LVWRJUDPRI/6\UHVLGXDOV**

**Figure 8.** Histogram and normal curve for LS residual; see text.

**Figure 9.** Histogram and fitting kernel density estimation curves for L2E residuals; see text.

#### **6. Discussion**

Maximum likelihood or entropy or Kullbach–Liebler estimators are examples of divergence criteria rather than being distance-based. It is well-known these are not robust in their native form. Donoho and Liu argued that all minimum distance estimators are inherently robust [9]. Other minimum distance criteria (L1 or Hellinger, e.g.) exist with some properties superior to L2E such as being dimensionless. However, none are fully data-based and unbiased. Often a kernel density estimate is placed in the role of *g*(*x*), which introduces an auxiliary parameter that is problematic to calibrate. Furthermore, numerical integration is almost always necessary. Numerical optimization of a criterion involving numerical integration severely limits the number of parameters and the dimension that can be considered.

The L2E approach with multivariate normal mixture models benefits greatly from the following closed form integral:

$$\int\_{\mathbb{R}^p} \phi(\mathbf{x}|\mu\_1, \Sigma\_1) \, \phi(\mathbf{x}|\mu\_{2'}, \Sigma\_2) \, d\mathbf{x} = \phi(\mathbf{0}|\mu\_1 - \mu\_{2'}\Sigma\_1 + \Sigma\_2) \, \mu\_1$$

whose proof follows from the Fourier transform of normal convolutions; see the appendix of Wand and Jones [10]. Thus the robust multiple regression problem could be approached by fitting the parameter vector (*μ*, Σ, *w*) to the random variable vector (**x**, *y*) and then computing the conditional expectation. In two dimensions, the number of parameters is 2 + 3 + 1 compared to the multiple regression parameter vector (*β*, *σ*, *w*), which has 2+1+1 parameters (including the intercept). The advantage is much greater as *p* increases, as the full covariance matrix requires *p*(*p* + 1)/2 parameters alone.

To illustrate this approach, we computed the MLE and L2E parameters estimates for the Hertzsprung–Russell data [11]. The solutions are depicted in Figure 10 by three level sets corresponding to 1-, 2-, and 3-*σ* contours. These data are not perfectly modeled by the bivariate normal PDF; however, the direct regression solutions shown in the left frame of Figure <sup>2</sup> are immediately evident. The estimate of *<sup>w</sup>*<sup>=</sup> here was 0.937, which is slightly larger than the estimate shown in the right frame of Figure 2.

**Figure 10.** MLE (blue) and L2E (red) bivariate normal estimates for the Hertzsprung–Russell data.

If the full correlation structure is of interest, then the extra work required to robustly estimate the parameters may be warranted. For **<sup>x</sup>** ∈ )*p*, this requires estimation of *<sup>p</sup>* <sup>+</sup> *<sup>p</sup>*(*<sup>p</sup>* <sup>+</sup> <sup>1</sup>)/2 <sup>+</sup> 1 or (*<sup>p</sup>* <sup>+</sup> <sup>2</sup>)(*<sup>p</sup>* <sup>+</sup> <sup>1</sup>)/2 parameters. In )<sup>10</sup> this means estimating 66 parameters, which is on the edge of optimization feasibility currently. Many simulated bivariate examples of partial mixture fits with 1–3 normal components are given in Scott [11]. When the number of fitted components is less than the true number, initialization can result in alternative solutions. Some correctly isolate components, others combine them in interesting ways. Software to fit such mixtures and multiple regression models may be found at http://www.stat.rice.edu/~scottdw/ under the *Download software and papers* tab.

We have not focused on the theoretical properties of L2E in this article. However, given the simple summation form of the L2E criterion in Equation (7), the asymptotic normality of the estimated parameters may be shown. Such general results are to be found in Basu, et al. [12], for example. Regularization of L2E regression, such as the *L*<sup>1</sup> penalty in LASSO, has been considered by Ma, et al. [13]. LASSO can aid in the selection of variables in a regression setting.

#### **7. Conclusions**

The ubiquitousness of massive datasets has only increased the need for robust methods. In this article, we advocate application of numerous robust procedures, including L2E, in order to find similarities and differences among their results. Many robust procedures focus on high-breakdown as a figure of merit; however, even those algorithms may falter in the regression setting; see Hawkins and Olive [14]. Manual inspection of such high-dimensional data is not feasible. Similarly, graphical tools for inspection of residuals also are of limited utility; however, see Olive [15] for a specific idea for multivariate regression. The partial L2E procedure described in this article puts the burden of interpretation where it can more reasonably be expected to succeed, namely, in the estimation phase. Points tentatively tagged as outliers may still be inspected in aggregate for underlying cause. Such points may have residuals greater than some multiple of the estimated residual standard deviation, *<sup>σ</sup>*ˆ, or simply be the largest 100(<sup>1</sup> <sup>−</sup> *<sup>w</sup>*=)% of the residuals in magnitude. In either case, the understanding of the data is much greater than least-squares in the high-dimensional case.

**Author Contributions:** Conceptualization, D.W.S. and Z.W.; methodology, D.W.S.; software, D.W.S. and Z.W.; validation, D.W.S. and Z.W.; formal analysis, D.W.S. and Z.W.; investigation, D.W.S. and Z.W.; resources, D.W.S.; data curation, D.W.S. and Z.W.; writing—original draft preparation, D.W.S. and Z.W.; writing—review and editing, D.W.S. and Z.W.; visualization, D.W.S.; supervision, D.W.S.; project administration, D.W.S.; funding acquisition, D.W.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** We thank the Rice University Center for Research Computing and the Department of Statistics for their support.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Right-Censored Time Series Modeling by Modified Semi-Parametric A-Spline Estimator**

**Dursun Aydın 1, Syed Ejaz Ahmed <sup>2</sup> and Ersin Yılmaz 1,\***


**Abstract:** This paper focuses on the adaptive spline (A-spline) fitting of the semiparametric regression model to time series data with right-censored observations. Typically, there are two main problems that need to be solved in such a case: dealing with censored data and obtaining a proper A-spline estimator for the components of the semiparametric model. The first problem is traditionally solved by the synthetic data approach based on the Kaplan–Meier estimator. In practice, although the synthetic data technique is one of the most widely used solutions for right-censored observations, the transformed data's structure is distorted, especially for heavily censored datasets, due to the nature of the approach. In this paper, we introduced a modified semiparametric estimator based on the A-spline approach to overcome data irregularity with minimum information loss and to resolve the second problem described above. In addition, the semiparametric B-spline estimator was used as a benchmark method to gauge the success of the A-spline estimator. To this end, a detailed Monte Carlo simulation study and a real data sample were carried out to evaluate the performance of the proposed estimator and to make a practical comparison.

**Keywords:** adaptive splines; B-splines; right-censored data; semiparametric regression; synthetic data transformation; time series

#### **1. Introduction**

Time series datasets are censored from the right under specific conditions, such as a detection limit or an insufficient observation process. Consider a device which cannot measure values above a certain point, which is known as a detection limit. Since the device cannot determine the real value of an observation above its detection limit, such observations are recorded as right-censored data points. The hourly observed cloud ceiling heights data collected by the National Center for Atmospheric Research (NCAR) and modelled by [1,2] can be used as an example of a right-censored time series. Although right-censored time series are encountered frequently in the real world, in the literature, there are truly few studies completed on the estimation of right-censored time series. This may be because censorship is an unwanted data irregularity for the researchers, and it is therefore often ignored or solved by outdated techniques.

To solve the censorship problem before modelling the time series, reference [1] used the Gaussian imputation technique to estimate the series using modified ARMA models. In a similar manner, references [2,3] solved the censorship problem by using data imputation techniques. The common ground of these studies is the use of imputation and data augmentation methods to estimate the regression models with autoregressive errors for right-censored time series. On the other hand, there is an easier way to handle the censorship problem called synthetic data transformation. Although data imputation techniques have some merits, they are generally based on iterative algorithms and their calculations are costly. Reference [4] estimated the temporally correlated and right-censored

**Citation:** Aydın, D.; Ahmed, S.E.; Yılmaz, E. Right-Censored Time Series Modeling by Modified Semi-Parametric A-Spline Estimator. *Entropy* **2021**, *23*, 1586. https:// doi.org/10.3390/e23121586

Academic Editor: Jan Mielniczuk

Received: 19 October 2021 Accepted: 22 November 2021 Published: 27 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

series by Nadaraya–Watson estimator nonparametrically, solving the censorship problem using a data transformation technique. Various data transformation (or synthetic data) methods have been proposed and studied in the literature for independent and identically distributed (i.i.d.) datasets; for example, see [5–7]. Because synthetic data transformation manipulates the data structure, which is disadvantageous, this solution method is no longer the preferred technique for right-censored time series. This paper aims to propose a method which can overcome the disadvantage of the synthetic data transformation method.

Note that the studies mentioned above consider the modeling of time series data using parametric or nonparametric methods. The data structure of a time series in the real world is generally not suitable for parametric modelling, because it requires rigid assumptions to reach reasonable estimates. Single-index nonparametric models, on the other hand, are very flexible, which is an important advantage over parametric methods and there are valuable studies on the subject [2,8,9]. However, nonparametric approaches lose their statistical efficiencies, when the number of covariates increases. In addition, it should be noted that, when a time series dataset is right-censored, the weaknesses of both methods are further increased.

Considering the issues mentioned above, this paper adopts semiparametric regression model for estimating right-censored time series. Although several researchers have introduced different types of semiparametric estimators for time series data, such as [10,11], there remains a significant gap in the research regarding the modelling of right-censored time series data. To address this absence, our paper proposes a modified semiparametric A-spline (AS) estimator based on synthetic data transformation. Thus, the bidirectional flexibility of the semiparametric model will be used, and the censorship problem will be effectively solved.

The paper is designed as follows: the methodology and fundamental ideas about rightcensored semiparametric time series model with autoregressive errors and the synthetic data transformation method are given in Section 2. Section 3 introduces a modified AS estimator for parametric and nonparametric components of the right-censored time series model, and a semiparametric B-spline (BS) is given as a benchmark. Section 4 involves the statistical properties and evaluation criteria for both the modified AS and benchmark BS methods. Section 5 introduces some additional information about the penalty term of the semiparametric AS approach. Sections 6 and 7 contain a detailed Monte Carlo simulation study and a real-world data example, respectively. Conclusions are presented in Section 8.

#### **2. Background**

The classical semiparametric model can be defined as a hybrid model with a finite dimensional parametric component and a nonparametric component having an infinite dimensional nuisance parameter. See [12–15] for additional information. In both theory and practice, the semiparametric model brings a new perspective to data modeling, since it includes both parametric and nonparametric components. As mentioned in the previous section, it is well-suited to time series data, because it brings the advantages of the semiparametric model to time series analysis.

Suppose that a time series dataset {*Zt*, **x***t*, *st*, *t* = 1, 2, . . . , *n* } satisfies an uncensored semiparametric time series model of the form:

$$Z\_t = \mathbf{x}\_t \mathfrak{B} + f(\mathbf{s}\_t) + \varepsilon\_t, \ a = s\_1 \precdots \prec s\_n = b,\tag{1}$$

where *Zt* s are the observations of stationary time series, **x***<sup>t</sup>* = - **x***t*1,..., **x***tp* and **x**1, ... , **x***<sup>n</sup>* are known *p*-dimensional vectors of the explanatory variables, β = - *β*1, *β*2,..., *β<sup>p</sup>* is an unknown *p*-dimensional vector of the regression coefficients to be estimated, *f*(.) is an unknown smooth function that describes the relationship between *Zt* and a nonparametric temporal covariate *st*, and finally, *εt*'s are the stationary autoregressive error terms generated by:

$$
\varepsilon\_t = \rho\_1 \varepsilon\_{t-1} + \dots + \rho\_k \varepsilon\_{t-k} + u\_{t\prime} \tag{2}
$$

where *ρ*1, ... , *ρ<sup>k</sup>* are the autoregressive coefficients, and *ut* denotes the independent and identically distributed random error terms with mean zero and a constant variance. Model (1) does not include lagged *Zt* s and has auto-correlated errors. This expression makes it a suitable model for the semiparametric regression analysis of certain kinds of time series.

A common problem in practice is that dependent observations *Zt s* cannot be perfectly collected due to limitations including the detection limit of an evaluation tool or the end time for the study. To express this situation algebraically, we assume that *Zt s* are censored from the right by a non-negative random variable representing detection limit *Ct*. Therefore, instead of observing the values of *Zt*, we now observe:

$$\mathcal{Y}\_{\mathbf{t}} = \min(\mathcal{Z}\_{\mathbf{t}}, \mathcal{C}\_{\mathbf{t}}) \text{ and } \delta\_{\mathbf{t}} = \begin{cases} 1 & \text{if } \mathcal{Z}\_{\mathbf{t}} \le \mathcal{C}\_{\mathbf{t}} \text{ (unconsored)}\\ 0 & \text{if } \mathcal{Z}\_{\mathbf{t}} > \mathcal{C}\_{\mathbf{t}} \text{ (censored)} \end{cases} \tag{3}$$

where *δt*'s denote the censoring information. Suppose that we are interested in estimating the mean semiparametric regression function. The distribution of the observable random variables does not identify the mean regression function uniquely. However, this problem can be solved as follows.

Let *FZ*(*α*) <sup>=</sup> *<sup>P</sup>*(*<sup>Z</sup>* <sup>≤</sup> *<sup>α</sup>*), *GC*(*α*) <sup>=</sup> *<sup>P</sup>*(*<sup>C</sup>* <sup>≤</sup> *<sup>α</sup>*), and *HY*(*α*) <sup>=</sup> *<sup>P</sup>*(*<sup>Y</sup>* <sup>≤</sup> *<sup>α</sup>*) for *<sup>α</sup>* <sup>∈</sup> <sup>R</sup> be cumulative distribution functions of non-negative random variables *Zt*, *Ct*, and *Yt*, respectively. If random variables *Zt* and *Ct* are independent, then the survival function *HY*(*α*) for observed response variable *Yt* can be defined from the basic relationship between *FZ* and *GC*: '

$$\{H\_Y(a) \;=\; 1 - H\_Y(a)\}\;=\; [(1 - F\_Z(a)) \cdot (1 - G\_C(a))].\tag{4}$$

Given a random sample from the distribution of (*Yt*, *Xt*,*st*, *δt*), it is of interest to examine the explanatory variables' effect on the observations of time series (i.e., response variable) by estimating the survival function *HY*(*α*) = *P*(*Y* > *α*), which is the regression function *E*(*Yt*|**x***t*,*st*) = **x***t*β + *f*(*st*), the conditional mean of time series *Yt*. However, because of the censoring, ordinary methods cannot be applied directly to estimate the regression function. To overcome censored observations, a data transformation technique should be used. One of the most widely used techniques is the synthetic data transformation, detailed in the section below.

#### *Synthetic Data*

To extend the penalized sum of squares approach to right-censored semiparametric regression analysis, we updated the synthetic data approach developed by [5]. The first step is to create an unbiased synthetic response variable of which the expectation is equal to the original and then to obtain the penalized squares estimator by means of this synthetic variable. The main goal of this transaction is to consider the censoring effect on the distribution of response variable. In the case of censored data, the authors of [16,17] used the synthetic data approach.

In the synthetic approach, we replace observed variable *Yt* with transformed data *YtG*; a transformation maintains the conditional expectation of original variable *Zt*. To describe this situation, it is easier to proceed directly using the cumulative distributions given in Lemma 1 below. Note also that if *GC* is known then it is possible to transform observed data {(*Yt*, *δt*), *t* = 1, . . . , *n*} into unbiased synthetic data, given by:

$$\mathcal{Y}\_{t\!G} = \frac{\delta\_t \mathcal{Y}\_t}{1 - \mathcal{G}\_{\mathcal{C}}(\mathcal{Y}\_t)'} \tag{5}$$

where *GC*(.) is the distribution function of the censoring time *Ct*, as defined before. It should be noted that the distribution of *GC* is rarely known. In this case, we use the Kaplan–Meier estimator defined by:

$$1 - \hat{\mathcal{G}}\_{\mathfrak{c}}(y) = \prod\_{t=1}^{n} \frac{n-t}{n-t+1} \Big/ \frac{\mathbb{I}[Y\_{(t)} \le y, \ \delta\_{(t)} = 0]}{n}, y \ge 0,\tag{6}$$

where *Y*(1) ≤ ... ≤ *Y*(*n*) are the sorted values of *Y*1, ... ,*Yn* and *δ*(*t*) is the *δ<sup>t</sup>* related to *Y*(*t*). Equation (5) has the following properties: (a) if distribution GC is selected arbitrarily, some *Y*(*i*) can be identical. In this case, the ranking of *Y*1, ... ,*Yn* into *Y*(1) ... *Y*(*n*) is not unique. However, the Kaplan–Meier estimator allows us to define the ranking of *Yt* uniquely; (b) *G*ˆ *<sup>C</sup>*(.) has jumps only at the censored observations of the time series (see [18]).

Substituting *G*ˆ *<sup>C</sup>*(.) for *GC*(.) in Equation (5), we construct the following synthetic data, given by:

$$\mathcal{Y}\_{t\hat{G}} = \frac{\delta\_t \mathcal{Y}\_t}{1 - \hat{G}\_C(\mathcal{Y}\_t)}. \tag{7}$$

Then, one practical consequence of the following Lemma is that synthetic data *YtG*<sup>ˆ</sup> and completely observed response times *Zt* have the same conditional expectations, as claimed in before.

**Lemma 1.** *Consider time series data Zt denoted as a response variable. If the data is censored by random censoring variable C with distribution GC, transform observed series Yt* = *min*(*Zt*, *Ct*) *to YtG* = *δtYt*/1 − *GC*(*Yt*) *in an unbiased form, as defined in Equation (4). Based on the information, it can be easily verified that <sup>E</sup>*[*YtG*|*xt*,*st*] <sup>=</sup> *<sup>E</sup>*[*Zt*|*xt*,*st*] <sup>=</sup> *<sup>x</sup>t<sup>β</sup>* <sup>+</sup> *<sup>f</sup>*(*st*)*. However, generally, GC is unknown as mentioned before. Therefore, YtG*<sup>ˆ</sup> *is used which is defined in Equation (7), instead of YtG. Because of <sup>G</sup>*<sup>ˆ</sup> *<sup>c</sup>* <sup>→</sup> *<sup>G</sup> when <sup>n</sup>* <sup>→</sup> <sup>∞</sup>*, (see [5]), it is ensured that <sup>E</sup> YtG*<sup>ˆ</sup> *xt*,*st* ∼= *<sup>E</sup>*[*YtG*|*xt*,*st*] <sup>=</sup> *<sup>x</sup>t<sup>β</sup>* <sup>+</sup> *<sup>f</sup>*(*st*).

Let us consider that *τHY* = sup{*α* : *HY*(*α*) < 1}, where *HY*(.) is defined right after Equation (3). In the literature, the convergence rate of the Kaplan–Meier estimator is examined in two classes: (i) restriction of time-interval as [0, *α*] with *α* < *τHY* ; (ii) extension of time-interval 0, *τHY* (see [19] for more detailed discussions). Here, the convergence rate of the Kaplan–Meier estimator is inspected with regard to case (ii). However, 0, *τHY* cannot be used without some strong conditions that can be given by:


Details about conditions (i)–(iii) were studied by [20]. The convergence of *<sup>G</sup>*<sup>ˆ</sup> <sup>→</sup> *<sup>G</sup>* over the interval 0, *τHY* can be provided. Reference [19] clearly shows both strong and weak convergences at the rate *<sup>n</sup>*−*<sup>ϑ</sup>* where 0 <sup>≤</sup> *<sup>ϑ</sup>* <sup>≤</sup> 1/2.

The proof of Lemma 1 is given in Appendix A.

The major concern of this paper is to overcome the censoring problem and to estimate the semiparametric time series model efficiently. To achieve this goal, we used two different approaches, BS and modified AS estimators. In the following section, we applied these approaches to the transformed data to estimate time series observations under random right-censorship.

#### **3. Estimating the Semiparametric Model Based on the BS Estimator**

We first introduce the BS considered for estimating the components of model (1). A univariate B-spline is constructed by a piecewise polynomial function of degree *q* such that its derivatives up to order (*q* − 1) is continuous at each knot point *r*1, ... , *rk*. The set of BSs of degree *q* over the real numbers (*r*1,...,*rk*) = *r* is a vector space of dimension *q* + *k* + 1. In addition, note that *k* denotes the number of interior knots, while *q* ≥ 0 indicates the

polynomial order. For example, the polynomials of order *q* = 0, 1, 2, and 3 are defined as constant, linear, quadratic, and cubic BS basis functions, respectively. If the knots are equally spaced (i.e., separated by same distance *h* = (*rk*<sup>+</sup><sup>1</sup> − *rk*)), the knot points and the corresponding BSs are called uniform.

**Definition 1.** *Given an ordered knot vector <sup>r</sup>* <sup>=</sup> {*r*<sup>1</sup> <sup>≤</sup> *<sup>r</sup>*<sup>2</sup> <sup>≤</sup> ... <sup>≤</sup> *rk*} *in the domain of covariate st, then i th BS basis functions* ' *Bi*,*q*(*st*), *i* = 1, 2, . . . , *q* + *k* + 1 ( *of degree q* = 0 *and q* > 0 *can be defined in recursive series, respectively, as:*

$$B\_{i,0}(s) := \begin{cases} 1 & if \quad r\_i \le s \le r\_{i+1} \\ 0, & otherwise \end{cases},\tag{8}$$

$$B\_{i,q}(s) \ = \frac{s - r\_i}{r\_{i+q} - r\_i} B\_{i,q-1}(s) + \frac{r\_{i+q+1} - s}{r\_{i+q+1} - r\_{i+1}} B\_{i+1,q-1}(s). \tag{9}$$

*Note that if the denominator of Equation (9) is equal to zero, then the BS basis function is assumed to be zero. From Equations (8) and (9), a set of* (*q* + *k* + 1) *basis functions have the following important properties:*

*(a) The BS basis functions form a partition of unity,*∑*q*+*k*+<sup>1</sup> *<sup>i</sup>* <sup>=</sup> <sup>1</sup> *Bi*,*q*(*s*) = 1*; (b) For all values of covariate st*, *Bi*,*q*(*s*) ≥ 0*; and (c) Bi*,*q*(*s*) *is realized in the interval [rk*, *rk*<sup>+</sup>*q*+1*]*.

Reference [21] proposes an algorithm to solve equation (9). See also the work of [22] for more detailed discussions on the BS approximation. Note also that the BS curve can be uniquely represented as a linear combination of the BSs basis functions in Equation (9), as given in the next section. Note that references [23,24] could be counted as recent studies about BSs.

#### *3.1. BS Estimator*

As previously noted, in this paper, we fit semiparametric time series model (1) with right-censored data. For this purpose, the BS estimator can be used as an approximation method. Using the synthetic data in Equation (7), we estimated the parametric and nonparametric components of model (1). Therefore, the sum of the squares of the differences between the censored time series values *YtG*<sup>ˆ</sup> and (**x***t*β + *f*(*st*)) are minimum. Assume that *f*(.) is a smooth function that can be approximated by a linear combination of the BSs basis functions in Equations (8) and (9):

$$f(\mathbf{s}) \stackrel{\simeq}{=} \sum\_{i=1}^{m} \prescript{m}{=} \prescript{q+k+1}{}{a\_i} \prescript{}{b\_i}{}\_{\not q}(\mathbf{s}) \ = \mathbf{B} \mathbf{x} \; \prime \tag{10}$$

where *m* = (*q* + *k* + 1) is the total number of BS basis functions being used, *α*ˆ*<sup>i</sup> s* are estimated coefficients (or control points) for each BS, **B** is an (*n* × *m*)-dimensional matrix which includes BSs as defined by Equation (9) and α = (α1,..., α*m*) is a parameter vector of the BS function. Note also that the autoregressive errors in model (1) follow an *n*-dimensional multivariate normal distribution with a zero mean and stationary (*n* × *n*) covariance matrix **Σ**, that is, (*ε*1,...,*εn*) *<sup>T</sup>* <sup>∼</sup> *Nn*(**0**, **<sup>Σ</sup>** ), where the covariance matrix **<sup>Σ</sup>** is a symmetric and positive definite matrix with elements:

$$\Sigma = \frac{\sigma\_u^2}{1 - \rho^2} \mathbf{R}, \ R(t, j) = \rho^{|t - j|}, \ 1 \le (t, j) \le n. \tag{11}$$

Throughout the paper, the notation is used as **Σ**−<sup>1</sup> = **V**. Note that **V** is generally unknown. However, its elements can be obtained by the generalized least squares (GLS) based on an iterative process. Then, as in [25] which is a penalized BS study combining

BS and difference penalties, the estimates of the components of semiparametric model (1) were obtained by minimizing the penalized sum of squares (*PSS*) criterion:

$$PSS = \sum\_{t=1}^{u} \, \_{1} \mathbf{V} \left\{ \mathbf{Y}\_{t\mathbb{G}} - \sum\_{j=1}^{p} \mathbf{x}\_{tj} \boldsymbol{\beta}\_{j} - \sum\_{i=1}^{m} a\_{i} \mathbf{B}\_{i,q}(\mathbf{s}) \right\}^{2} + \lambda \sum\_{i=q+1}^{m} \left(\Delta^{q} \boldsymbol{u}\_{i}\right)^{2}, \tag{12}$$

where <sup>Δ</sup>*α<sup>i</sup>* = (*α<sup>i</sup>* − *<sup>α</sup>i*−1) is the first-order difference penalty on the coefficients of the BSs. The other differences can be defined as follows:

$$
\Delta^2 \mathfrak{a}\_i = \Delta(\Delta \mathfrak{a}\_i) \\
= (\mathfrak{a}\_i - \mathfrak{a}\_{i-1}) - (\mathfrak{a}\_{i-1} - \mathfrak{a}\_{i-2}) \\
= \mathfrak{a}\_i - 2\mathfrak{a}\_{i-1} + \mathfrak{a}\_{i-2} \tag{13}
$$

and similarly:

$$
\Delta^q \mathfrak{a}\_i = \Delta \left( \Delta^{q-1} \mathfrak{a}\_i \right). \tag{14}
$$

Note that if degree *q* = 0 in Equation (12), we obtain semiparametric ridge regression based on BSs. When *λ* = 0 in Equation (12), we have the minimization equation of ordinary least squares regression with a correlated error. If *λ* > 0, the penalty only influences the main diagonal and *q* sub-diagonals (on both sides of the main diagonal elements) of the banded structure system due to the limited overlap of the BSs.

We rewrite the minimization criterion described as Equation (12) in a matrix and vector notation:

$$PSS = \left(\mathbf{Y}\_{\hat{\mathbb{G}}} - \mathbf{X}\mathfrak{B} - \mathbf{B}\mathfrak{a}\right)' \mathbf{V} \left(\mathbf{Y}\_{\hat{\mathbb{G}}} - \mathbf{X}\mathfrak{B} - \mathbf{B}\mathfrak{a}\right) + \lambda \left\| \left| \mathbf{D}\mathfrak{a} \right| \right\|^2,\tag{15}$$

where . denotes Euclidean norm, **X** = (**x**1,..., **x***n*) , **Y***G*<sup>ˆ</sup> = - *Y*1*G*<sup>ˆ</sup> ,...,*YtG*<sup>ˆ</sup> is the synthetic response vector defined in Equation (7), *λ* > 0 is a smoothing parameter, and **D** denotes the matrix notation of the difference operator (Δ*q*) defined in Equation (13). For example, **D** is an (*n* − 2) × *n*-dimensional banded matrix that corresponds to the second-order difference penalty, given by:

$$\mathbf{D} = \begin{bmatrix} 1 & -2 & 1 & \cdots & 0 \\ \vdots & \ddots & \ddots & \ddots & \vdots \\ 0 & \cdots & 1 & -2 & 1 \end{bmatrix}. \tag{16}$$

From simple algebraic operations, it follows that the solution to the minimization problem in Equation (15) satisfies the following block matrix equation:

$$
\begin{pmatrix}
\mathbf{x'}\mathbf{v}\mathbf{x} & \mathbf{x'}\mathbf{v}\mathbf{B} \\
\mathbf{B'}\mathbf{v}\mathbf{x} & \left(\mathbf{B'}\mathbf{v}\mathbf{B} + \lambda\mathbf{D'}\mathbf{D}\right)
\end{pmatrix}
\begin{pmatrix}
\mathbf{g} \\
\mathbf{a}
\end{pmatrix} = \begin{pmatrix}
\mathbf{x'} \\
\mathbf{B'}
\end{pmatrix} \mathbf{V}\mathbf{Y}\_{\mathbb{G}}.\tag{17}
$$

Given a parameter λ > 0, the corresponding estimators based on BSs for vectors β and α can be easily obtained by:

$$\hat{\mathbf{a}}\_{\rm BS} = \left[\mathbf{B}^{\prime}\mathbf{V}\mathbf{B} + \lambda\mathbf{D}^{\prime}\mathbf{D}\right]^{-1}\mathbf{B}^{\prime}\mathbf{V}(\mathbf{Y}\_{\odot} - \mathbf{X}\boldsymbol{\beta}\_{\rm BS}),\tag{18}$$

and:

$$\boldsymbol{\mathfrak{H}}\_{\rm BS} = \left[ \left( \mathbf{X}^{\prime} \mathbf{V} - \mathbf{A}\_{\rm BS} \right) \mathbf{X} \right]^{-1} \left( \mathbf{X}^{\prime} - \mathbf{A}\_{\rm BS} \right) \mathbf{V} \mathbf{Y}\_{\rm G^{\prime}} \tag{19}$$

where *ABS* = **X VB**) **B VB** + λ**D D** \*−<sup>1</sup> **B V**. It should be noted that the estimates of the unknown regression function in a censored semiparametric model are obtained by:

$$\hat{\mathbf{f}}\_{\rm BS} = \mathbf{B} \hat{\mathbf{x}}\_{\rm BS} = \left[ \hat{f}(\mathbf{s}\_1), \dots, \hat{f}(\mathbf{s}\_{\rm n}) \right]'. \tag{20}$$

From Equations (19) and (20), we see that the fitted values of dependent time series data can be written as:

$$\mathfrak{h}\_{BS} = \left(\mathfrak{X}\mathfrak{f}\_{BS} + \mathfrak{f}\_{BS}\right) \\ = \mathfrak{H}\_{BS}\mathfrak{Y}\_{\widehat{\mathbb{G}}} = E[Y \mid X, \mathfrak{s}], \tag{21}$$

where **H***BS* is a hat matrix for BSs and computed as follows:

$$\mathbf{H}\_{BS} = \left[ \mathbf{X} \left[ \left( \mathbf{X}' \mathbf{V} - \mathbf{A}\_{BS} \right) \mathbf{X} \right]^{-1} \left( \mathbf{X}' - \mathbf{A}\_{BS} \right) \mathbf{V} (\mathbf{I} - \mathbf{M}\_{BS}) + \mathbf{M}\_{BS} \right], \tag{22}$$

where **M***BS* = **B** ) **B VB** + λ**D D** \*−<sup>1</sup> **B V.**

#### *3.2. AS Estimator*

The adaptive spline (AS) applies an adaptive ridge penalty to the BS method, which makes it more flexible for knot determination. The AS concept is explained in [26] in a nonparametric context. However, in this paper, we generalized this estimation concept to the semiparametric environment based on synthetic response observations. It should be noted that the location and number of knots have crucial importance in terms of synthetic data transformation. This issue is discussed in detail in Section 4.3. The point here is that a more efficient estimator based on synthetic responses is needed, as most of the existing smoothing techniques (spline smoothing, kernel smoothing, etc.) cannot properly handle synthetic data. This article aims to solve this issue with the AS estimator.

When a BS is defined on the knots *<sup>r</sup>*<sup>1</sup> <sup>≤</sup> *<sup>r</sup>*<sup>2</sup> <sup>≤</sup> ... <sup>≤</sup> *rk* such that <sup>Δ</sup>*qα<sup>i</sup>* <sup>=</sup> 0 for some *<sup>i</sup> th* knot, it may be reparametrized as a BS on the knots *r*1,*r*2, ... ,*ri*−1,*ri*+1, ... ,*rk*. Accordingly, when *m* = (*q* + *k* + 1), we want to put a penalty on the number of non-zero differences indicated as below:

$$\lambda \sum\_{i=q+1}^{m} \left\lVert \Delta^q \alpha\_j \right\rVert\_{0, \prime} \tag{23}$$

where <sup>Δ</sup>*qα<sup>i</sup>* is the *<sup>q</sup>th*-order difference operator and Δ*qαi*<sup>0</sup> is the *<sup>L</sup>*0-norm of the differences, that is, Δ*qαi*<sup>0</sup> = 0 if <sup>Δ</sup>*qα<sup>j</sup>* <sup>=</sup> 0, otherwise, Δ*qαi*<sup>0</sup> = 1, and *<sup>λ</sup>* is a positive penalty parameter that ensures the tradeoff between the goodness of fit to the data and the smoothness of the fitted curve. This penalty enables us to remove knot *ri* that is not related to the smoothing problem, to join the neighbor intervals [*ri*−1,*ri*) and [*ri*, *ri*+1), and to carry on fitting with a BS described over the remaining knot points. Note also that when *λ* → 0, the fitted curve becomes a BS with knots *ri*, *i* = 1, 2, ... , *k* and when *λ* → ∞, the fitted function becomes a polynomial of degree *q*.

It should be emphasized that one of the important points about the adaptive ridge penalty is that Equation (23) cannot be differentiated due to the *L*0-norm. As a result, the fitting process is made numerically untraceable. An approximate solution to dealing with the *L*0-norm is provided by [27,28]. Following the studies of these authors, we approximate the *L*0-norm by using an iterative process referred to as an "adaptive ridge" based on synthetic data. The new criterion function is expressed by the following weighted penalized sum of squares:

$$\text{WPSS} = \left(\mathbf{Y}\_{\text{G}} - \mathbf{X}\mathfrak{B} - \mathbf{B}\mathfrak{a}\right)^{\prime}\mathbf{V}\left(\mathbf{Y}\_{\text{G}} - \mathbf{X}\mathfrak{B} - \mathbf{B}\mathfrak{a}\right) + \lambda \sum\_{i=q+1}^{\text{nr}} w\_{i} \left(\Delta^{q}\mathfrak{a}\_{i}\right)^{2},\tag{24}$$

where *wi*'s denote the positive weights. It should be noted that the penalty is close to the *L*0-norm of the differences when the weights are iteratively calculated from the parameter vector α of BS following the equation:

$$w\_i = \left[ (\Delta^q a\_i)^2 + \gamma^2 \right]^{-1}, \ \gamma > 0,\tag{25}$$

where *γ* is a constant properly determined by the researcher.

**Remark 1.** *There are a few important points to know about the selection of γ. If* (Δ*qαi*) < *γ, then the magnitudes of wi's might be quite large, resulting in* (Δ*qαi*) ∼= 0 *and the penalty term turning into wi*(Δ*qαi*) <sup>2</sup> <sup>∼</sup><sup>=</sup> <sup>0</sup>*. Furthermore, if* (Δ*qαi*) \* *<sup>γ</sup> , then wi*(Δ*qαi*) <sup>2</sup> <sup>∼</sup><sup>=</sup> Δ*qαi*0*. This convergence gives us a measure of how relevant the i th knot point is. In practice, one possible choice, suggested by [28], is γ* = 10−5*. They select the knots (denoted as ri*<sup>∗</sup> *) with a weighted difference bigger than 0.99. The number of parameters of the chosen BS is m<sup>λ</sup>* = *q* + *k<sup>λ</sup>* + 1*, where k<sup>λ</sup> denotes the number of selected knot points.*

Note that reference [28] provides a figure to show the effects of different norm degrees (*q*) on the quality of estimation. It is seen from that the performance of estimation does not change for different values of *γ* when norm degree is zero (*q* = 0). However, it affects the performance seriously if *q* > 0.

For some *λ* > 0 and non-negative weights, the *WPSS* of Equation (26) can be rewritten as:

$$\text{WPSS} = \left(\mathbf{Y}\_{\hat{\mathbb{G}}} - \mathbf{X}\mathfrak{B} - \mathbf{B}\mathfrak{a}\right)' \mathbf{V} \left(\mathbf{Y}\_{\hat{\mathbb{G}}} - \mathbf{X}\mathfrak{B} - \mathbf{B}\mathfrak{a}\right) + \lambda \mathfrak{a}' \mathbf{K} \mathfrak{a},\tag{26}$$

where **K** is a penalty matrix and written as **K** = **D WD**, where **W** = diag- *wq*+1,..., *wm* and **D** is the matrix form of the difference operator Δ*q*, as defined in Equation (13). Simple algebraic operations show that the solution to the minimization problem *WPSS* in Equation (26) satisfies the block matrix equation:

$$
\begin{pmatrix}
\mathbf{x}' \mathbf{v} \mathbf{x} & \mathbf{x}' \mathbf{v} \mathbf{B} \\
\mathbf{B}' \mathbf{v} \mathbf{x} & \left(\mathbf{B}' \mathbf{v} \mathbf{B} + \lambda \mathbf{K}\right)
\end{pmatrix}
\begin{pmatrix}
\mathbf{g} \\
\mathbf{a}
\end{pmatrix} = \begin{pmatrix}
\mathbf{x}' \\
\mathbf{b}'
\end{pmatrix} \mathbf{v} \mathbf{y}\_{\mathbf{\hat{G}}}.\tag{27}
$$

By similar arguments as in the case of the BS approach, the corresponding estimators αˆ *AS* and βˆ *AS* of α and β, based on the right-censored semiparametric time series model (1) with correlated data, can be easily obtained, respectively, as:

$$\hat{\mathbf{x}}\_{\text{AS}} = \left[\mathbf{B}^{\prime}\mathbf{V}\mathbf{B} + \lambda\mathbf{K}\right]^{-1}\mathbf{B}^{\prime}\mathbf{V}^{\prime}(\mathbf{Y}\_{\hat{G}} - \mathbf{X}\hat{\mathbf{g}}\_{\text{AS}})\_{\prime} \tag{28}$$

and:

$$\mathfrak{F}\_{\rm AS} = \left( (\mathbf{X}'\mathbf{V} - \mathbf{A}\_{\rm AS})\mathbf{X} \right)^{-1} \left( \mathbf{X}' - \mathbf{A}\_{\rm AS} \right) \mathbf{V} \mathbf{Y}\_{\rm G'} \tag{29}$$

where *AAS* = **X VB**) **B VB** + *λ***K** \*−<sup>1</sup> **B V** . The proofs and derivations of Equations (28) and (29) are given in Appendix B. Notice that the estimates corresponding to the nonparametric part of the semiparametric model (1) are obtained using Equation (28) as described in the following equation:

$$\mathbf{\hat{f}\_{AS}} = \mathbf{B} \mathbf{\hat{x}\_{AS}} = \begin{bmatrix} \hat{f}(\mathbf{s}\_1), \dots, \hat{f}(\mathbf{s}\_n) \end{bmatrix} \text{'}.\tag{30}$$

From Equations (29) and (30), we can see that the fitted values of the dependent time series data can be obtained as:

$$
\hat{\mathbf{\hat{u}}}\_{\text{AS}} = \left( \mathbf{X} \hat{\mathbf{\hat{p}}}\_{\text{AS}} + \hat{\mathbf{f}}\_{\text{AS}} \right) \\
= \mathbf{H}\_{\text{AS}} \mathbf{Y}\_{\text{\hat{G}}} = E[Y|X, \mathbf{s}], \tag{31}
$$

where **H***AS* denotes the hat matrix, given by:

$$\mathbf{H}\_{\rm AS} = \left[ \mathbf{X} \left[ \left( \mathbf{X}' \mathbf{V} - \mathbf{A}\_{\rm AS} \right) \mathbf{X} \right]^{-1} \left( \mathbf{X}' - \mathbf{A}\_{\rm AS} \right) \mathbf{V} (\mathbf{I} - \mathbf{M}\_{\rm AS}) + \mathbf{M}\_{\rm AS} \right], \tag{32}$$

with **M***AS* = **B** ) **B VB** + λ**K** \*−<sup>1</sup> **B V** .

To make the computation process efficient, all penalty terms - **D***T***WD** are calculated by using the iteration process instead of finding matrix **D** and knot set individually. The iterative algorithm is given in Algorithm 1 below.

**Algorithm 1**. Iterative algorithm process for the modified A-spline (AS) estimator ˆα*AS*.

**Input**: **X, s, Y***G*<sup>ˆ</sup> . **Output**: <sup>β</sup><sup>ˆ</sup> (*i*) *AS* = *<sup>β</sup>*ˆ(*i*) <sup>1</sup> , *<sup>β</sup>*ˆ(*i*) <sup>2</sup> ,..., *<sup>β</sup>*ˆ(*i*) *p* <sup>α</sup><sup>ˆ</sup> (*i*) *AS* = *α*ˆ (*i*) <sup>1</sup> , *α*ˆ (*i*) <sup>2</sup> ,..., *α*ˆ (*i*) *q*+*k*+1 1: **Begin** 2: Give initial values, <sup>β</sup>(0) = <sup>1</sup>*p*, <sup>α</sup>(**0**) = **<sup>0</sup>***q*+*k*+<sup>1</sup> and **<sup>W</sup>**(0) = **<sup>I</sup>** to start iterative process 3: **do** until converges weighted differences to *L*0-norm 4: <sup>β</sup><sup>ˆ</sup> (*i*) *AS* = ( **X <sup>V</sup>** <sup>−</sup> *<sup>A</sup>*)**<sup>X</sup>** −1 **X** − **A VY***G*ˆ 5: <sup>α</sup><sup>ˆ</sup> (*i*) *AS* = ) **B VB** + λ*K* \*−<sup>1</sup> **B V <sup>Y</sup>***G*<sup>ˆ</sup> <sup>−</sup> **<sup>X</sup>**β<sup>ˆ</sup> (*i*) *AS* 6: Determine *γ* = 10−<sup>5</sup> 7: *<sup>w</sup>*(*i*) *<sup>i</sup>* = / <sup>Δ</sup>*qα*(*i*) *i* 2 + *γ*<sup>2</sup> 0−<sup>1</sup> 8: <sup>β</sup><sup>ˆ</sup> *AS* <sup>=</sup> <sup>β</sup>(*i*) *AS*, <sup>α</sup><sup>ˆ</sup> *AS* <sup>=</sup> <sup>α</sup><sup>ˆ</sup> (*i*) *AS*, **<sup>W</sup>** <sup>=</sup> *diag <sup>w</sup>*(*i*) *i* 9: **end** 10: Calculate **<sup>r</sup>**(*i*∗) by the criterion of Δ*q*α(*i*) *AS*<sup>2</sup> **W**(*i*) > 0.99 11: Return <sup>β</sup><sup>ˆ</sup> (*i*) *AS* = - *β*ˆ 1, *β*ˆ 2,..., *β*ˆ *<sup>p</sup>* , <sup>α</sup><sup>ˆ</sup> (*i*) *AS* = *α*ˆ 1, *α*ˆ 2,..., *α*ˆ *<sup>q</sup>*+*k*+<sup>1</sup> 12: **End**

**Remark 2.** *For the constant value of γ* = 10−5, the iteration process repeats between step 3 and step 9 until the pre-determined tolerance value *δ* = 10−<sup>4</sup> is obtained where *δ* = ∑*<sup>n</sup> <sup>i</sup>* <sup>=</sup> <sup>1</sup> *n*−<sup>1</sup> *Yi* <sup>−</sup> *<sup>Y</sup>*<sup>ˆ</sup> *iG*ˆ *. From our experience, the expected number of iterations is observed as no*.*iteration* = 20 *to achieve the convergence.*

Notice that the complexity and efficiency of Algorithm 1 is analyzed from different aspects that are given by:

(i) Number of local searches: algorithm does not involve a local search procedure which is an advantage for the speed of Algorithm 1;

(ii) Number of nested loops: due to the fact that there is only an iteration loop (without nested loops), if an algorithm does not include nested loops, its "order of growth" will be *O*(*n*);

(iii) Asymptotic behaviors: as the former inference mentioned, Algorithm 1 has *O*(*n*) which means that the limiting case of its convergence speed is considerable when it is compared with its alternative BS method on this issue.

As mentioned at the beginning of this section, the choice of an optimum smoothing parameter λ is required for both semiparametric BS and AS estimators. In this context, the improved Akaike information criterion (*AICc*) proposed by [29] is used, which is computed with the following equation:

$$AIC\_c(\lambda) = \log \left(\phi^2\right) + 1 + \frac{2\{tr(\mathbf{H}) + 1\}}{n - tr(\mathbf{H}) - 2},\tag{33}$$

where *σ*ˆ <sup>2</sup> is the estimate of the model variance, which is estimated for both methods separately in the next section, and **H** denotes the hat matrix for any of two methods. It is replaced by **H***AS* for the AS method and **H***BS* for the BS method, respectively.

#### **4. Statistical Properties of the Estimators**

In this paper, we introduced the semiparametric AS and BS estimators for the estimation of the right-censored time series model. It should be noted that these two methods were used for the first time in the setting of a time series estimation procedure. Inferences were therefore carried out about their statistical properties. For example, among these, the error terms obtained from the estimates of both methods and the estimators of parametric and nonparametric components were inspected and their properties were extracted.

#### *4.1. Properties of the Semiparametric BS Estimator*

Firstly, the parametric component was inspected. As is known, in a parametric context, errors can be decomposed into the bias and the variance terms that provide the quality of the estimator. Accordingly, the estimator βˆ *BS* of the parametric coefficients vector is expanded as follows:

$$\hat{\boldsymbol{\mathfrak{H}}}\_{BS} = \left[ (\mathbf{X}^{\prime}\mathbf{V} - \mathbf{A}\_{BS})\mathbf{X} \right]^{-1} (\mathbf{X}^{\prime}\mathbf{V} - \mathbf{A}\_{BS})\mathbf{Y}\_{\hat{\mathbf{G}}} \\ = \boldsymbol{\mathfrak{B}} + \left[ (\mathbf{X}^{\prime}\mathbf{V} - \mathbf{A}\_{BS})\mathbf{X} \right]^{-1} (\mathbf{X}^{\prime}\mathbf{V} - \mathbf{A}\_{BS})\mathbf{f}, \tag{34}$$

where **V**, **A***BS* and **M***BS* matrices are as defined in Section 3.1 and **f** = [ *f*(*s*1), *f*(*s*2),..., *f*(*sn*)] . From here, bias *B* - βˆ *BS* and variance-covariance *V* - βˆ *BS* of estimator βˆ *BS* can be computed as follows:

$$B\left(\hat{\mathfrak{H}}\_{BS}\right) \ = E\left(\hat{\mathfrak{H}}\_{BS}\right) - \mathfrak{B} \ = \left[ (\mathbf{X}^\prime \mathbf{V} - \mathbf{A}\_{BS})\mathbf{X} \right]^{-1} (\mathbf{X}^\prime \mathbf{V} - \mathbf{A}\_{BS}) \mathbf{f},\tag{35}$$

$$V(\hat{\mathfrak{H}}\_{BS}) \ = \sigma^2 \left[ (\mathbf{X}^\prime \mathbf{V} - \mathbf{A}\_{BS}) \mathbf{X} \right]^{-1} (\mathbf{X}^\prime \mathbf{V} - \mathbf{A}\_{BS}) \mathbf{X} \left[ (\mathbf{X}^\prime \mathbf{V} - \mathbf{A}\_{BS}) \mathbf{X} \right]^{-1},\tag{36}$$

where *σ*<sup>2</sup> is the variance of the fitted semiparametric model. Since the variance is not generally known, instead of *σ*2, the estimation (denoted by *σ*ˆ <sup>2</sup> *BS*) based on the BS is used. It can be computed from the residuals sum of squares (RSS) using error terms:

$$\mathcal{O}\_{BS}^2 = \frac{RSS}{tr(\mathbf{I} - \mathbf{H}\_{BS})^2} = \frac{\left\|(\mathbf{I} - \mathbf{H}\_{BS})\mathbf{\hat{Y}}\_{\mathcal{G}\_{BS}}\right\|^2}{tr\left[(\mathbf{I} - \mathbf{H}\_{BS})'(\mathbf{I} - \mathbf{H}\_{BS})\right]'} \tag{37}$$

where *tr*(**I** − **H***BS*) <sup>2</sup> <sup>=</sup> *<sup>n</sup>* <sup>−</sup> <sup>2</sup>*tr*(**H***BS*) <sup>+</sup> *tr*- **H** *BS***H***BS* denotes the degrees of freedom. In addition, *tr*- **H** *BS***H***BS* needs *O*(*n*) algebraic operations. In the context of the BS, if the data have a normal distribution, *σ*ˆ <sup>2</sup> *BS* is asymptotically unbiased.

Secondly, the properties of estimated nonparametric component αˆ *BS* = *α*ˆ1, *α*ˆ2,..., *α*ˆ*q*+*k*+<sup>1</sup> are given here. The bias of αˆ is one of the quality measurements for the estimated model. The bias is denoted as conditional expectation *E*[αˆ |*st*], given by:

$$E[\mathfrak{A}\_{BS}|s\_t] = \left(\mathbf{B}^\prime \mathbf{V} \mathbf{B} + \lambda \mathbf{D}^\prime \mathbf{D}\right)^{-1} \mathbf{B}^\prime \mathbf{V} \mathbf{B} \alpha. \tag{38}$$

From that, the bias is given by:

$$\begin{split} Bias(\mathring{\mathbf{a}}\_{BS}) &= E[\mathring{\mathbf{a}}\_{BS}|s\_{l}] - \mathbf{a} = [(\mathbf{B}^{\prime}\mathbf{V}\mathbf{B} + \lambda\mathbf{D}^{\prime}\mathbf{D})]^{-1}\mathbf{B}^{\prime}\mathbf{V}\mathbf{f} - [(\mathbf{B}^{\prime}\mathbf{V}\mathbf{B} + \lambda\mathbf{D}^{\prime}\mathbf{V}\mathbf{A})^{-1}\mathbf{B}^{\prime}\mathbf{V}\mathbf{A}^{\prime}] \\ [\lambda\mathbf{D}^{\prime}\mathbf{D})]^{-1}\mathbf{B}^{\prime}\mathbf{V}\mathbf{X}[(\mathbf{X}^{\prime}\mathbf{V} - \mathbf{A}\_{BS})\mathbf{X}]^{-1}(\mathbf{X}^{\prime}\mathbf{V} - \mathbf{A}\_{BS}) - [(\mathbf{B}^{\prime}\mathbf{V}\mathbf{B} + \lambda\mathbf{D}^{\prime}\mathbf{D})]^{-1}\mathbf{B}^{\prime}\mathbf{V}^{\prime} = \\ [(\mathbf{B}^{\prime}\mathbf{V}\mathbf{B} + \lambda\mathbf{D}^{\prime}\mathbf{D}))^{-1}\mathbf{B}^{\prime}\mathbf{V}^{\prime}\mathbf{X}[(\mathbf{X}^{\prime}\mathbf{V} - \mathbf{A}\_{BS})\mathbf{X}]^{-1}(\mathbf{X}^{\prime}\mathbf{V} - \mathbf{A}\_{BS}). \end{split} \tag{39}$$

Accordingly, the covariance of αˆ *BS* can be computed as:

$$\text{Cov}(\hat{\mathbf{a}}\_{\text{BS}}) = \hat{\boldsymbol{\sigma}}\_{\text{BS}}^2 \frac{1}{n} \left( \mathbf{B}^\prime \mathbf{V} \mathbf{B} + \lambda \mathbf{D}^\prime \mathbf{D} \right)^{-1} \left( \mathbf{B}^\prime \mathbf{V} \mathbf{B} \right) \left( \mathbf{B}^\prime \mathbf{V} \mathbf{B} + \lambda \mathbf{D}^\prime \mathbf{D} \right)^{-1},\tag{40}$$

where *σ*ˆ <sup>2</sup> *BS* is defined by Equation (36). In addition, to reveal the performance of <sup>ˆ</sup> **f***BS* = **B**αˆ *BS*, the root square of mean squared error *RMSE* **f**, ˆ **<sup>f</sup>***BS* is used:

$$RMSE(\mathbf{f}, \mathbf{\hat{f}}\_{\rm BS}) = \boldsymbol{n}^{-1} \sum\_{t=1}^{n} \left[ f(\mathbf{s}\_t) - \boldsymbol{\hat{f}}\_{\rm BS}(\mathbf{s}\_t) \right]^2 = \boldsymbol{n}^{-1} (\mathbf{f} - \mathbf{\hat{f}}\_{\rm BS})^\prime (\mathbf{f} - \mathbf{\hat{f}}\_{\rm BS}). \tag{41}$$

#### *4.2. Properties of the Semiparametric AS Estimator*

Similar to in Section 4.1, the same properties for parametric and nonparametric components are given for the AS estimator here. The necessary expansion is written as follows to derivate the bias and variance of βˆ *AS*:

$$\boldsymbol{\hat{\mathfrak{H}}}\_{AS} = \left[ (\mathbf{X}^{\prime}\mathbf{V} - \mathbf{A}\_{AS})\mathbf{X} \right]^{-1} (\mathbf{X}^{\prime}\mathbf{V} - \mathbf{A}\_{AS})\mathbf{Y}\_{\widehat{\mathbf{C}}} \\ \quad = \left[ \mathbf{\mathcal{S}} + \left[ (\mathbf{X}^{\prime}\mathbf{V} - \mathbf{A}\_{AS})\mathbf{X} \right]^{-1} (\mathbf{X}^{\prime}\mathbf{V} - \mathbf{A}\_{AS})\mathbf{f}, \right. \tag{42}$$

where **A***AS* and **M***AS* are given in Section 3.2. Now, the bias and the covariance matrix of the estimator βˆ *AS* can be provided by:

$$B(\hat{\mathfrak{H}}\_{AS}) \ = \operatorname{E}(\hat{\mathfrak{H}}\_A) - \mathfrak{B} \ = \left[ (\mathbf{X}^\prime \mathbf{V} - \mathbf{A}\_{AS})\mathbf{X} \right]^{-1} (\mathbf{X}^\prime \mathbf{V} - \mathbf{A}\_{AS}) \mathbf{f}\_\prime \tag{43}$$

$$V(\boldsymbol{\mathring{\theta}}\_{AS}) = \sigma^2 \left[ (\mathbf{X}^\prime \mathbf{V} - \mathbf{A}\_{AS}) \mathbf{X} \right]^{-1} (\mathbf{X}^\prime \mathbf{V} - \mathbf{A}\_{AS}) \mathbf{X} \left[ (\mathbf{X}^\prime \mathbf{V} - \mathbf{A}\_{AS}) \mathbf{X} \right]^{-1},\tag{44}$$

where *σ*<sup>2</sup> is the variance of the fitted semiparametric model. Similar to Equation (40), instead of the model variance, *σ*ˆ <sup>2</sup> *AS* is obtained as follows:

$$\boldsymbol{\sigma}\_{\rm AS}^{2} = \frac{RSS}{\operatorname{tr}(\mathbf{I} - \mathbf{H}\_{\rm AS})} = \frac{\left\| \left( \mathbf{I} - \mathbf{H}\_{\rm AS} \right) \hat{\mathbf{Y}}\_{\rm AS} \right\|^{2}}{\operatorname{tr} \left[ \left( \mathbf{I} - \mathbf{H}\_{\rm AS} \right)^{\prime} (\mathbf{I} - \mathbf{H}\_{\rm AS}) \right]}. \tag{45}$$

The properties of estimated nonparametric component αˆ *AS* = *α*ˆ 1, *α*ˆ 2,..., *α*ˆ *<sup>q</sup>*+*k*+<sup>1</sup> for the AS method are described below. The bias and the variance of the AS estimator αˆ *AS* can be given, respectively, as:

$$\begin{array}{rcl} Bias(\hat{\mathbf{a}}\_{AS}) & = & [\hat{\mathbf{a}}\_{AS}|s\_{l}] - \mathbf{a} = [(\mathbf{B'VB} + \lambda \mathbf{D'WD})]^{-1} \mathbf{B'V}\mathbf{f} - [(\mathbf{B'VB} + \lambda \mathbf{D'W})]^{-1} \mathbf{b} \\ & \quad \lambda \mathbf{D'WD}) [^{-1} \mathbf{B'V}\mathbf{X} [(\mathbf{A'V} - \mathbf{A}\_{AS})\mathbf{X}]^{-1} (\mathbf{X'V} - \mathbf{A}\_{AS}) - [(\mathbf{B'VB} + \lambda \mathbf{D'W})]^{-1} \mathbf{b} \\ \lambda \mathbf{D'WD}) [^{-1} \mathbf{B'V}\mathbf{f} = & [(\mathbf{B'VB} + \lambda \mathbf{D'W})]^{-1} \mathbf{B'V}\mathbf{X} [(\mathbf{A'V} - \mathbf{A}\_{AS})\mathbf{X}]^{-1} (\mathbf{X'V} - \mathbf{A}\_{AS}) \end{array} \tag{46}$$

and

$$\text{Cov}(\hat{\mathfrak{a}}\_{AS}) = \boldsymbol{\upbeta}\_{AS}^2 \frac{1}{n} \left( \mathbf{B}^\prime \mathbf{V} \mathbf{B} + \lambda \mathbf{D}^\prime \mathbf{W} \mathbf{D} \right)^{-1} \left( \mathbf{B}^\prime \mathbf{V} \mathbf{B} \right) \left( \mathbf{B}^\prime \mathbf{V} \mathbf{B} + \lambda \mathbf{D}^\prime \mathbf{W} \mathbf{D} \right)^{-1}. \tag{47}$$

Thus, the value of *RMSE* **f**, ˆ **<sup>f</sup>***AS* for ˆ **f***AS* = **B**αˆ *AS*, similar to Equation (41), is calculated as follows:

$$RMSE(\mathbf{f}, \hat{\mathbf{f}}\_{AS}) = \left. n^{-1} \sum\_{t=-1}^{n} \left[ f(\mathbf{s}\_{l}) - \hat{f}\_{AS}(\mathbf{s}\_{l}) \right]^{2} = \left. n^{-1} (\mathbf{f} - \hat{\mathbf{f}}\_{AS})' (\mathbf{f} - \hat{\mathbf{f}}\_{AS}) \right. \tag{48}$$

#### *4.3. Quality Measures for the Fitted Model*

After assessing the parametric and nonparametric components of the model in Sections 4.1 and 4.2, several measurements are introduced in this section to evaluate the overall model performance. In the literature on time series modelling, mean absolute percentage error (*MAPE*), mean absolute error (*MAE*), and mean squared error (*MSE*) are the most commonly used performance criteria. To represent these criteria, *MAPE* is preferred in this study. In addition, median absolute error (*MedAE*) was used, which allowed us to account for missing or censored data. Generalized *MSE* (*GMSE*) and the ratio of *GMSE* (*RGMSE*) proposed by [30] and [2], respectively, were used to measure the quality of the fitted time series model. The aforementioned criteria can be defined as follows:

 $MAPE(\mathbf{Y}\_{t\hat{\mathbf{G}}'}\hat{\mathbf{Y}}\_{t\hat{\mathbf{G}}}) = \frac{n^{-1}\sum\_{l=1}^{n}|\mathbf{Y}\_{l} - \hat{\mathbf{Y}}\_{l\hat{\mathbf{G}}}|}{\mathbf{Y}\_{l\hat{\mathbf{G}}}}, \quad ModelAE(\mathbf{Y}\_{\hat{\mathbf{G}}'}\hat{\mathbf{Y}}\_{\hat{\mathbf{G}}}) = \textit{Median}(|\mathbf{Y}\_{\hat{\mathbf{G}}} - \hat{\mathbf{Y}}\_{\hat{\mathbf{G}}}|),$ 
$$GMSE(\mathbf{Y}\_{\hat{\mathbf{G}}'}\hat{\mathbf{Y}}\_{\hat{\mathbf{G}}}) = (\mathbf{Y}\_{\hat{\mathbf{G}}} - \hat{\mathbf{Y}}\_{\hat{\mathbf{G}}})^{\prime}E(\mathbf{Y}\_{\hat{\mathbf{G}}}\mathbf{Y}\_{\hat{\mathbf{G}}}^{\prime})(\mathbf{Y}\_{\hat{\mathbf{G}}} - \hat{\mathbf{Y}}\_{\hat{\mathbf{G}}}),$$

where *Y*ˆ *tG*<sup>ˆ</sup> and **<sup>Y</sup>**<sup>ˆ</sup> *<sup>G</sup>*<sup>ˆ</sup> denote the fitted dependent variable values and vector for any estimation method. Here, *Y*ˆ *tG*<sup>ˆ</sup> and **<sup>Y</sup>**<sup>ˆ</sup> *<sup>G</sup>*<sup>ˆ</sup> are replaced by *<sup>Y</sup>*<sup>ˆ</sup> *tG*ˆ *BS* and **<sup>Y</sup>**<sup>ˆ</sup> *G*ˆ *BS* for the BS, and *<sup>Y</sup>*<sup>ˆ</sup> *tG*<sup>ˆ</sup> *<sup>A</sup>* and **<sup>Y</sup>**<sup>ˆ</sup> *G*ˆ *A* for the AS. In addition, to make a more considerable comparison between the AS and BS estimators, *RGMSE* is defined below.

**Definition 2.** *The ratio of GMSE can be defined as follows:*

$$RGLMSE(\mathbf{Y}\_{\mathbf{\hat{G}}\_{\text{BS}}}, \mathbf{\hat{Y}}\_{\mathbf{\hat{G}}\_{\text{AS}}}) \ = \frac{GMSE(\mathbf{\hat{Y}}\_{\mathbf{\hat{G}}\_{\text{AS}}})}{GMSE(\mathbf{\hat{Y}}\_{\mathbf{\hat{G}}\_{\text{BS}}})} . \tag{49}$$

Regarding the RGMSE criterion, if RGMSE (**Y**Gˆ BS ,**Y**<sup>ˆ</sup> Gˆ AS ) < 1, then it can be seen that the fitted model by the AS method shows better performance then the BS method.

#### **5. Further Information for Adaptive-Ridge Penalty**

The semiparametric AS estimator proposed for the right-censored time series model, with its adaptive nature, aims for qualified estimations despite the censorship. To approach the *L*0-norm given in Equation (23), the most suitable knot locations can be chosen due to the weighted penalty term. Thus, the model avoids the disadvantages of synthetic data transformation, which gives higher magnitudes to uncensored observations.

This section is designed to inspect some of the large sample properties of the modified AS estimator under right-censored data. It should be noted that adaptive ridge penalty in the setting of regression has been studied by many authors; see for example [25,26,28]. However, the aforementioned studies consider adaptive ridge penalty individually, not as a part of a semiparametric time series model. This section provides basic information for the large sample properties of the proposed AS estimator in the context of a semiparametric time series model.

As previously stated, the AS approximation is a modified version of the P-splines (penalized BSs) estimator proposed by [31]. Note also that the AS method diverges from BSs with a significant difference of the *L*0-norm in the penalty term. The AS estimator is obtained by an iterative process with determining weights, as expressed in Section 3.2. In addition, apart from the usage of the AS method in the literature, it is also used for modelling censored time series. For these reasons, we can make several important assumptions. The large sample properties are written based on the assumptions given below:

**Assumption 1.** *The minimization problem for the semiparametric AS is given in Equation (26). To make this expression more general, it can be rewritten as follows:*

$$PSS(a; \lambda) \ = \sum\_{t=-1}^{n} \mathbf{V} \left\{ Y\_{t\:\mathbb{G}} - \sum\_{l=-1}^{p} x\_{l\:l} \beta\_l - \sum\_{j=-1}^{v} a\_j B\_{j,q}(s\_t) \right\}^2 + \lambda \sum\_{j=-q+1}^{q+k+1} \left\| \Delta^q a\_j \right\|\_{\mathbf{r}'} \tag{50}$$

*where* Δ*qαj<sup>τ</sup> represents the <sup>τ</sup>-norm of the penalty term. The first assumption is <sup>τ</sup>* <sup>→</sup> <sup>0</sup>*, which allows approximation to the L*0*-norm with the acquisition of weights via the iterative process. Otherwise, the L*0*-norm needs overly complex calculations, which leads to the loss of practicality when using the method. From our knowledge of the literature, when τ* → 0, *such as in Equation (26), the minimization of Equation (50) works by penalizing the non-zero coefficients αj's, as shown by [32]*.

**Assumption 2.** *When α*ˆ *AS is examined asymptotically, the objective function of Equation (26) may not have a global minimum, since it is not clearly convex. However, if we assume that:*

$$\mathcal{R}\_{\text{tr}} = \frac{1}{r} \sum\_{i}^{r} \mathcal{B}\_{i} \mathcal{B}\_{i}^{\cdot \prime} \to \mathcal{R}\_{\text{'}} \tag{51}$$

*then it is possible to point out some important aspects of asymptotic consistency. Therefore, it should be presumed that R is a non-negative definite matrix and:*

$$\frac{1}{q+k+1} \max\_{1 \le i \le r} \mathbf{B}\_i^{\prime} \mathbf{B}\_i \to 0,\tag{52}$$

*where elements of diag*(*Ri*) = 1.

**Assumption 3.** *B<sup>T</sup> <sup>j</sup> <sup>B</sup>j, BT j Bj* −<sup>1</sup> *, and R are assumed to be full rank matrices. Under the assumptions given above, to see asymptotic consistency of α*ˆ *AS and β*ˆ *AS, an equation can be obtained from Equation (50) as follows:*

$$\mathcal{M}\_{\text{ll}}\left(\hat{\mathfrak{a}}\_{ASn}\hat{\mathfrak{f}}\_{ASn}\right) \\ = \sum\_{t=1}^{\mathfrak{u}} \begin{array}{c} \mathsf{V}\left\{\mathcal{Y}\_{t\hat{\mathbf{G}}} - \sum\_{l=1}^{p} \mathsf{x}\_{ll}\hat{\mathfrak{f}}\_{ASn l} - \sum\_{j=1}^{r} \mathsf{\hat{a}}\_{ASn j}\mathcal{B}\_{j,q}(\mathfrak{s}\_{l})\right\}^2 \\ + \lambda\_{\text{ll}}\sum\_{i=q+1}^{q+k+1} ||\Delta^{q}\mathsf{A}\_{ASn\_{i}}||\_{\mathfrak{s}} \end{array} , \tag{53}$$

*where α*ˆ *ASn*, *β*ˆ *ASn denote the limiting case of the estimators for λ<sup>n</sup>* = *O*(*n*)*. Note that Equation (52) is ensured by following Theorem 1.*

**Theorem 1.** *Based on Assumptions 1–3, and <sup>λ</sup><sup>n</sup>* <sup>→</sup> *<sup>λ</sup>* <sup>≥</sup> <sup>0</sup>*, then β*ˆ *ASn* , ˆ*αASn d* → *argmin*(*Mn*) *where:*

$$M\_n(\hat{\mathfrak{H}}\_{AS\_n}, \hat{\mathfrak{u}}\_{AS\_n}) \ = \left[ \left( \hat{\mathfrak{H}}\_{AS\_n} \, \hat{\mathfrak{u}}\_{AS\_n} \right)' - \left( \mathcal{J} \, \mathfrak{a} \right)' \right] \mathcal{R} \left[ \left( \hat{\mathfrak{H}}\_{AS\_n} \, \hat{\mathfrak{u}}\_{AS\_n} \right)' - \left( \mathcal{J} \, \mathfrak{a} \right)' \right] + \\ \times \,\tag{54}$$
  $\lambda\_n \sum\_{i=q+1}^m \left| \Delta^q a\_i \right|\_{\mathfrak{r}}$ .

*Therefore, for optimal λ<sup>n</sup>* = *O*(1)*, pair* (*β*ˆ *ASn* , ˆ*αASn* ) *can be counted as a consistent AS estimator of* (*β*, *<sup>α</sup>*)*. In this context, when n* <sup>→</sup> <sup>∞</sup> *then* <sup>|</sup>*β*<sup>ˆ</sup> *ASn* , ˆ*αASn* | → (*β*, *α*).

For the proof of Theorem 1, see Appendix C.

To clearly indicate the place of Assumptions 1–3 in the estimation process, the following explanations are given for each assumption.


identifiable and avoid the ill-posed problem, **B B** must be a full-ranked matrix.

• Assumption 3 confirms Assumption 2. Thus, it can be seen that asymptotic consistency can be confirmed by Assumption 3. From that it can be said that Assumption 3 is indirectly depended on the dataset.

#### *5.1. Asymptotic Distribution and Consistency of the Proposed Estimator*

In this section, the estimate of parametric component βˆ *AS* is inspected in terms of asymptotic consistency and asymptotic distribution.

Assume the following regularity conditions:

$$\mathbf{f} \text{ (i) } \quad \mathbb{F}\_n \; = \; n^{-1} \left( \mathbf{X}\_i^T \mathbf{V} - \mathbf{A} \right) \mathbf{X}\_i \to \mathbf{F} \text{ for non-negative matrix } \mathbf{F}\_i$$

$$\text{(ii)}\quad n^{-1}\max\_{1\le i\le n} \left(\mathbf{X}\_i^T\mathbf{V} - \mathbf{A}\right)\mathbf{X}\_i \to 0;$$


Here, condition (ii) indicates that the diagonal elements of **F** and **F***<sup>n</sup>* are identical and one, because the covariates are scaled. To obtain the asymptotic distribution of βˆ *AS*, "nearly-singular" designs are performed due to *τ* → 0 for **F***n*. Thus, it can be ensured that **F***<sup>n</sup>* → **F** asymptotically. On the other hand, **F***<sup>n</sup>* and **F** are assumed as non-singular in Section 5.1.

To show the consistency and asymptotic normality of the semiparametric AS estimator when conditions (i), (ii), and (iii) are ensured with non-singular **F**, first the case of *τ* ≥ 1 is considered, followed by the case of *τ* < 1.

Let <sup>β</sup><sup>ˆ</sup> *ASn* be an asymptotic estimator. The consistency of <sup>β</sup><sup>ˆ</sup> *ASn* can be reached by using following minimization function:

$$\psi\_n(\mathfrak{f}\_{AS\_n}f(\mathfrak{s}\_t)) = n^{-1} \sum\_{t=1}^n \left[ \mathbf{Y}\_t - \mathfrak{X}\_t \mathbf{\hat{f}}\_{AS\_n} - f(\mathfrak{s}\_t) \right]^2 + \lambda\_n n^{-1} \sum\_{j=1}^p \left| \mathfrak{f}\_{(j)AS\_n} \right|^\tau. \tag{55}$$

The following theorem shows the consistency of <sup>β</sup><sup>ˆ</sup> *ASn* for validated additional assumption *λ<sup>n</sup>* = *O*(*n*).

**Theorem 2.** *Assume that F is non-singular,* ˆ *<sup>f</sup>*(*st*) *behaves stable, and <sup>λ</sup>nn*−<sup>1</sup> <sup>→</sup> *<sup>λ</sup>*<sup>0</sup> <sup>≥</sup> <sup>0</sup>*. It can then be said that as n* → ∞*:*

$$
\hat{\mathfrak{F}}\_{AS\_n} \stackrel{d}{\rightarrow} \mathfrak{F},
\tag{56}
$$

*where β*ˆ *ASn is a consistent estimator of β. The proofs of this theorem are given in Appendix D. For λ<sup>n</sup>* = *O*(*n*)*, argmin*(*ψ*) = *β and therefore β*ˆ *ASn is a consistent estimator.*

It should be emphasized that the consistency of <sup>β</sup><sup>ˆ</sup> ASn is sufficient to show that λ<sup>n</sup> = O(n). However, this depends on the magnitude of growth of λn. When λ<sup>n</sup> grows more slowly, then a limiting distribution <sup>√</sup><sup>n</sup> - <sup>β</sup><sup>ˆ</sup> ASn <sup>−</sup> <sup>β</sup> exists. It is clear from Theorem 2 that the mean of the limiting distribution of <sup>√</sup><sup>n</sup> - <sup>β</sup><sup>ˆ</sup> ASn <sup>−</sup> <sup>β</sup> converges to zero for the consistency of <sup>β</sup><sup>ˆ</sup> ASn . In addition, its asymptotic variance can be obtained based on conditions (i) and (iv) as σ2**F**<sup>−</sup>1. Accordingly, the asymptotic distribution of the semiparametric AS estimator is written as:

$$\boldsymbol{\Theta} = \sqrt{\boldsymbol{n}} (\hat{\boldsymbol{\p}}\_{\text{AS}\_n} - \boldsymbol{\p}) \stackrel{\text{d}}{\rightarrow} \mathcal{N} \Big[ \boldsymbol{0}, \ \sigma^2 \mathbf{F}^{-1} \Big]. \tag{57}$$

However, the limiting distribution depends on whether τ < 1 or τ ≥ 1. In the context of this paper, Theorem 3 is given for the limiting distribution of <sup>β</sup><sup>ˆ</sup> ASn when <sup>τ</sup> < 1.

**Theorem 3.** *Assume that <sup>τ</sup>* < <sup>1</sup> *if <sup>λ</sup>n*/*<sup>n</sup> <sup>τ</sup>* <sup>2</sup> → *λ*<sup>0</sup> ≥ 0*. Then:*

$$\mathfrak{G} = \sqrt{n} \left( \mathfrak{J}\_{AS\_n} - \mathfrak{G} \right) \stackrel{d}{\to} \arg\min(\mathfrak{J}),\tag{58}$$

*where <sup>ξ</sup>*(*θ*) <sup>=</sup> <sup>−</sup>2*θTF* <sup>+</sup> *<sup>θ</sup>TF<sup>θ</sup>* <sup>+</sup> *<sup>λ</sup>*<sup>0</sup> <sup>∑</sup>*<sup>p</sup> <sup>j</sup>* <sup>=</sup> <sup>1</sup> *θj<sup>τ</sup> <sup>I</sup>* - *β<sup>j</sup>* = 0 *. The proofs of Theorem 3 are given in Appendix E.*

#### **6. Simulation Study**

In this section, a simulation study was conducted to inspect the finite-sample behaviors and performances of the two semiparametric estimators - αˆ *BS*, βˆ *BS* and - αˆ *AS*, βˆ *AS* under right-censored time series. These estimators were then compared through the quality measurements given in Section 4. The simulation scenarios are designed as follows:


**Algorithm 2**. Generation of censoring variable *Ct*.

**Input:** Completely observed *Zt* **Output:** Right-censored dependent variable *Yt* 1: For given censoring level (CL), produce *δ<sup>t</sup>* = *I*(*Zt* ≤ *Ct*) from the binomial distribution 2: **for** (*t in* 1 *to n*) 3: **If** (*δ<sup>t</sup>* = 0) 4: **while** (*Zt* ≤ *Ct*) 5: generate *Ct* ∼ *N*- *μZ*, *σ*<sup>2</sup> *Z* 6: **Else** 7: *Ct* = *Zt* 8: **end** (for loop in step 2) 9: **for** (*t in* 1 *to n*) 10: **If** (*Zt* ≤ *Ct*) 11: *Yt* = *Zt* 12: **Else** 13: *Yt* = *Ct* 14: **end** (for loop in Step 9)


For each CL in the simulation experiments, we generated 1000 random samples for size *n* = 50, 100, and 200.

The results of the simulation study were divided into three parts for parametric components, nonparametric components, and overall model performance. Accordingly, the outcomes of the estimated models, comparative results, and corresponding comments are given together in the following tables and figures. To understand the simulated datasets and the scenarios, examples of some of the simulation configurations are given in Figure 1. Panel (a) shows the dataset for small sample size and low censorship. Panel (b) is drawn to show the case when the censoring level is really high. Panels (c)–(d) indicates the cases for medium and large sample sized data with censoring levels 20% and 40% respectively.

#### *6.1. Assessing the Parametric Component*

In this section, the performances of the two methods were compared in terms of the parametric components of the right-censored semiparametric linear models generated by the simulation. It should be also noted that in this simulation study, 54 different configurations were analyzed to provide a broad perspective of the adequacy of each method. The results from the parametric components in the simulation study are displayed in Table 1 and Figure 2. Note that bold colored scores indicate the best (minimum) scores.

From the careful inspection of Table 1, it can be demonstrated that the behaviors of the BS and AS change noticeably in different scenarios. Let us look at low and medium CLs for *n* = 50; under these conditions, the BS has remarkable superiority over the AS. This can be interpreted as the BS fitting the data better when the data's structure is distorted less by censorship. However, for *CL* = 40%, which means the data are heavily censored, the AS method gives better scores.

As the sample size increases, although the bias and variance values from the methods are obtained more closely, the AS provides more efficient performance in estimating the parametric component. Regarding the parametric component, it should be emphasized that the AS behaves as expected and gives the best scores for cases of heavy censorship.

In general, the best scores for each method can be evaluated in terms of bias and variance results. When we examined the bias results of the regression coefficients, the AS method gives the best score in only 12 out of 27 configurations while the BS method gives the best score in 15. However, regarding the variances, the AS gives the best score in 18 of 27 configurations, while the BS is superior in only 9 configurations. In Figure 2, Panels (a–c) shows the calculated biases for each simulation repetition for all cases when sample size is small, medium, and large.

**Figure 1.** Some of the datasets generated using Algorithm 2 including both fully observed and censored data points for different censoring levels and sample sizes.



The bolded values indicate the best scores.

#### *6.2. Evaluating the Nonparametric Component*

As in the case of parametric components, we constructed 1000 estimates of the regression function *f*(.), which is the nonparametric component of model (1). For each method, 1000 replications were carried out, and the estimated bias, variance and *RMSE* values were computed for each estimator. This section is designed to show the simulated results related to the nonparametric component.

The results in Table 2 showed that the AS method proves its efficiency for the estimation of the nonparametric component when time series data are moderately to heavily censored. On the other hand, for *CL* = 5%, the BS method gives better results for all sample sizes according to our evaluation metrics. One of the main reasons for this is that the BS adapted to the knots more than the AS. Consequently, when the data points are manipulated by censorship, these knots force the BS to make inefficient estimates. At this point, the knot determination of the AS based on the weights given in Equation (24) diminishes the effect of the censorship. That is why the AS method performs better under moderately and heavily censored time series data.

**Figure 2.** Boxplots of bias values for both the AS and BS methods for all configurations. In the x-axis, b1, b2, and b3 denote *β*1, *β*2, and *β*3; A1, A2, and A3 denote biases obtained from the AS method for CLs of 5%, 20%, and 40%. Similarly, B1, B2, and B3 denote biases for the BS method, when CLs are 5%, 20%, and 40%.


**Table 2.** Outcomes from the fitted nonparametric components.

The bolded values indicate the best scores.

Figure 3, consisting of four panels (a), (b), (c), and (d), is drawn to illustrate the performance of the AS and BS methods in nonparametric curve estimation and to present different simulation configurations. Panel (a) show the estimated curves for small sample size and medium censoring level. Similarly, Panel (b) shows the case when medium sample size and high censoring level. Panel (c) indicates the estimated curves for small sample size and low censoring level. Finally, Panel (d) shows the estimated curves when sample size is large and censoring level is medium. When panels (**a**) and (**c**) are analyzed comparatively, the effect of the censorship level can be seen. At the first glance, the distortion of both curves is noticeable. However, the BS method is insufficient to represent censored time series compared to the AS method. In addition, panel (b) shows that when data are heavily censored, the BS curve is drawn towards the x = 0 line, due to the presence of zero values in the synthetic response variable. Finally, panel (**d**) indicates that although the time series contains censored data points, the qualities of the estimates for both the AS and BS methods become better as the sample size increases.

**Figure 3.** Data points, real regression functions, and curves fitted by two methods. In the legend of the plots, f(A) and f(B) represent function estimates obtained from the AS and BS methods, respectively.

#### *6.3. Assessing the Performances of Methods*

This section involves the results for overall model estimations obtained from the AS and BS methods. Although results are given for parametric and nonparametric components in the previous sections, a separate review for the whole model estimation is required for a healthy comparison. Accordingly, the performance scores for *MAPE*, *MedAE*, and *GMSE* are given in Table 3, and Figure 4 is drawn to illustrate the *RGMSE* values.


**Table 3.** The values of performances from the AS and BS methods.

The bolded values indicated the best scores.

**Figure 4.** 360◦ bar chart for the *RGMSE*s of all simulation combinations.

When Table 3 is examined, it can be seen that the results obtained for the model estimates are slightly different from the previous results, as expected. The total error obtained from the estimation of parametric and nonparametric components is one of the reasons for this discrepancy. In addition, considering the situations where the two methods produce extremely similar scores, this difference can be understood better. Note that AR(1) model shows poor performance, which depends on its parametric and linear structure. However, for the large sample size (n = 200), the scores of models obtained are close to each other. However, it is clearly seen that the AS and BS methods are much better on the estimation of right-censored time series.

As can be seen from the bolded scores, the AS method generally performs better. From Table 3, it can be seen that the *MAPE* values obtained by BS are better for *n* = 50. However, as mentioned earlier, in this study, the *MedAE* criterion, which is not frequently used for time series data, is used to measure the durability of the predictions. When the scores of this criterion are examined, it is understood that, as stated from the beginning of the study, the BS method has more successful estimates under low censorship levels, but the AS method is superior for medium and high censorship levels.

Figure 4 includes the *RGMSE* scores for both the AS and BS methods that are formed by the ratio of the *GMSE* values of each method. In Figure 4, the difference between the qualities of the estimates is clearly very small for *CL* = 5%. However, the difference becomes more significant for *CL* = 20% and *CL* = 40%. Note that for *CL* = 5%, the BS method gives smaller ratio values, which confirms the results given in Table 3. As stated before, the AS method is demonstrably superior at higher censorship levels, which can be seen in Figure 4 for all sample sizes.

#### **7. Real-World Data**

This section is designed to show how the newly introduced semiparametric estimator AS and benchmark BS method behave with a real right-censored time series dataset. For this purpose, we consider unemployment duration data involving the monthly unemployment period rates years between 2004 and 2019 for Turkey; this dataset is available at https://ec.europa.eu/eurostat/databrowser/view/UNE\_RT\_M\_\_custom\_163512 7/default/table?lang=en. In the dataset, the last three months of 2004 and the last three months of 2019 cannot be observed correctly. Therefore, these data points can be censored from the right by the detection limit zero, because none of the data points are negative values. Accordingly, the introduced semiparametric methods, AS and BS, can be used for this time series analysis. In addition, as in the simulation study, the results of the AR model are given in the following tables. However, different from the simulation study, AR(2) model was used for the real data study, because the optimal lag values is determined as *lag* = 2 from Table 4. Before the modelling procedure, the stationarity of the time series data was tested with the augmented Dickey–Fuller (ADF) test, the suitable lag is determined under null hypothesis *H*<sup>0</sup> : *yt is non* − *stationary*. The test results are given in Table 4 below:

**Table 4.** Augmented Dickey–Fuller (ADF) test results for the stationarity of time series data and the determination of the appropriate lag.


Bold scores are significant score for the 95% confidence level.

Table 4 shows that the second lag for this time series is suitable for the modelling. From this information, the semiparametric time series model can be given by:

$$\mathcal{U}ID\_{t} = \beta\_{1} \mathcal{U}ID\_{(t-1)} + \beta\_{2} \mathcal{U}ID\_{(t-2)} + f(s\_{t}) + \varepsilon\_{t}, \ t = 1, \ldots, 186,\tag{59}$$

where *UEDt*s represent the dependent time series of the unemployment duration ratio, *UED*(*t*−1) and *UED*(*t*−2) denote the first and second lags of the dependent series *UEDt* that are used as covariates, respectively, *st* = (1, . . . , *n*) *<sup>T</sup>* denotes the seasonality, and finally, *εt*'s are the stationary autoregressive error terms as given in Equation (2). The estimation of model (6.1) is realized by both the AS and BS methods, and then, results are presented in Tables 5 and 6 and Figure 5.

**Table 5.** The performances of the BS and AS methods for the estimation of both parametric and nonparametric components.


The bolded values indicate the best scores.


**Table 6.** Scores of performance measures for the AS and BS methods obtained from the whole model estimation.

The bolded values indicate the best scores.

**Figure 5.** Estimated curves for the seasonality *f*(*st*) obtained from the AS and BS methods.

Table 5 involves the bias and variance values for estimated regression coefficients βˆ = - *β*ˆ 1, *β*ˆ 2 and αˆ = *α*ˆ 1, *α*ˆ 2,..., *α*ˆ *<sup>q</sup>*+*k*+<sup>1</sup> *T* . Accordingly, the AS method gives smaller bias and variance values than the BS method regarding βˆ . Moreover, the AS method has better bias values for αˆ **,** but the BS method gives smaller variance values for αˆ than the AS method. In overview, the AS and BS methods give similar values, because the data properties are *n* = 186 and *CL* = 8.1%. Thus, it can be seen that the results of the unemployment duration data ensure the simulation outputs.

In addition, it should be noted that the outcomes obtained from estimated model (7.1) are given in Table 6 with *RMSE* scores for the estimated nonparametric function *f*(*st*). Upon close inspection, it is obviously seen from the results that the AS method produces the best scores. It should be emphasized that the largest difference between the methods regarding performance criteria is in *MedAE*, which indicates the strength of the AS method under censorship. Table 6 indicates the results of AR(2) model that are worse than the results of the other two as in the simulation study. Note that because of the sample size of the real data of *n* = 186 which is close to the simulation configurations when *n* = 200, scores are relatively close to each other. Figure 5 is given to compare the AS and BS methods in representing data under censorship.

As can be seen in Figure 5, the estimated curves are quite similar due to the data properties of a large sample size and a low CL. The effect of synthetic data manipulation is obvious in the figure with zero values. Like the simulation study, the BS method is affected by these zero values more than the AS method. The reason for this is that the knots of the AS method are determined by iteratively calculated weights. Therefore, the optimal knot sequence diminishes the effect of censorship.

#### **8. Concluding Remarks**

This paper demonstrated the estimation of right-censored time series data using a newly introduced semiparametric AS estimator and making a comparison with the BS method as a benchmark. The results obtained from both a simulation study and a real data example proved that the introduced method (AS) achieves the superior modelling of rightcensored time series data in a semiparametric context. Comparative outcomes also support that the AS method provides better performance scores over the BS method in most simulation configurations and the real data example. The most important factor in the success of the AS method is the adaptive nature of the method based on iteratively calculated weights. In the AS method, weights are responsible for determining and controlling the penalty term and for dependently obtaining the optimal knot points. Accordingly, our findings showed that the proposed method provides an advantage in modelling right-censored time series over the benchmark.

The simulation study examined the performance of the methods in three parts: the outcomes for the estimated parametric component (Table 1 and Figure 2), the nonparametric component (Table 2 and Figure 3), and the whole semiparametric model (Table 3 and Figure 4). The unemployment data estimation was evaluated for bias and variance (Table 5) using the criteria of *MAPE*, *MedAE*, *GMSE*, and *RGMSE* (Table 6). Given the outcomes of the simulation study and the real data example, our general and detailed conclusions are as follows:


Finally, as can be understood from the whole paper, the AS method is superior for estimating right-censored time series over the BS method in both theory and practice.

**Author Contributions:** Conceptualization, S.E.A. and D.A.; methodology, S.E.A. and D.A.; software, D.A. and E.Y.; validation, S.E.A., D.A. and E.Y.; formal analysis, D.A. and E.Y.; investigation, D.A. and E.Y.; resources, S.E.A. and D.A.; data curation, E.Y.; writing—original draft preparation, S.E.A., D.A. and E.Y.: writing—review and editing, S.E.A., D.A. and E.Y.; visualization, E.Y.; supervision, S.E.A.; funding acquisition, S.E.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** Professor Ahmed research was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** We consider unemployment duration data involving the monthly unemployment period rates years between 2004 and 2019 for Turkey; this dataset is available at https://ec.europa.eu/eurostat/databrowser/view/UNE\_RT\_M\_\_custom\_1635127/default/table? lang=en.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**Proof of Lemma 1.** Lemma 1 can be ensured based on the common censorship assumption that *Zt* and *Ct* are independent. From that, the proof can be written as follows:

$$\begin{aligned} E[\mathbf{Y}\_{l\gets}|\mathbf{x},\mathbf{s}] &= E\left[\frac{\delta\_{l}Z\_{l}}{1-\mathsf{G}(\mathcal{Z}\_{l})} \mid \mathbf{x},\mathbf{s}\right] &= E\left[\frac{\delta\_{l}Z\_{l}}{\mathsf{G}(\mathcal{Z}\_{l})} \mid \mathbf{x},\mathbf{s}\right] &= E\left[\frac{I(Z\_{l}\leq\mathsf{C}\_{l})\min(Z\_{l},\mathcal{C}\_{l})}{\mathsf{G}[\min(Z\_{l},\mathcal{C}\_{l})]} \mid \mathbf{x},\mathbf{s}\right] &= \\ E\left[I(Z\_{l}\leq\mathsf{C}\_{l})\frac{Z\_{l}}{\mathsf{G}(\mathcal{Z}\_{l})} \mid \mathbf{x},\mathbf{s}\right] &= E\left[E\left[\frac{Z\_{l}}{\mathsf{G}(\mathcal{Z}\_{l})}I(Z\_{l}\leq\mathsf{C}\_{l}) \mid \mathbf{x},\mathbf{s}\right] \mid \mathbf{x},\mathbf{s}\right] &= E\left[\frac{Z\_{l}}{\mathsf{G}(\mathcal{Z}\_{l})}\overline{\mathsf{G}}(Z\_{l}) \mid \mathbf{x},\mathbf{s}\right] &= \end{aligned} \tag{A1}$$

Thus, Lemma 1 is proven. Here, *G*(.) = 1 − *G*(.). Generally, distribution *G*(.) is unknown. Therefore, its Kaplan–Meier estimator *G*ˆ(.) is used instead of *G*(.), which is given in Equation (5).

#### **Appendix B**

Derivations of Equations (29) and (30).

To show the derivations of Equations (29) and (30), two equations obtained from Equation (27) are written as:

$$\left(\mathbf{x}'\mathbf{v}\mathbf{x}\right)\mathfrak{g} + \mathbf{x}'\mathbf{v}\mathbf{B}\mathfrak{a} = \mathbf{x}'\mathbf{v}\mathbf{y}\_{\hat{\odot}} \quad \mathbf{B}'\mathbf{v}\mathbf{x}\mathfrak{g} + \left(\mathbf{B}'\mathbf{v}\mathbf{B} + \lambda\mathbf{K}\right)\mathfrak{a} = \mathbf{B}'\mathbf{V}\mathbf{Y}\_{\hat{\odot}} \tag{A2}$$

From Equation (B1), αˆ *AS* can be acquired by the algebraic operations:

$$\left(\mathbf{B}^{'}\mathbf{V}\mathbf{B} + \lambda\mathbf{K}\right)\mathbf{a} = \mathbf{B}^{'}\mathbf{V}\mathbf{Y}\_{\mathbb{G}} - \mathbf{B}^{'}\mathbf{V}\mathbf{X}\mathfrak{F} \quad \left(\mathbf{B}^{'}\mathbf{V}\mathbf{B} + \lambda\mathbf{K}\right)\mathfrak{a} = \mathbf{B}^{'}\mathbf{V}(\mathbf{Y}\_{\mathbb{G}} - \mathbf{X}\mathfrak{F}).\tag{A3}$$

Thus, if β is replaced by βˆ *AS*, then αˆ *AS* can be written as:

$$\hat{\mathfrak{a}}\_{AS} = \left[\mathbf{B}^{\prime}\mathbf{V}\mathbf{B} + \lambda K\right]^{-1}\mathbf{B}^{\prime}\mathbf{V}^{\prime}(\mathbf{Y}\_{\hat{G}} - \mathbf{X}\mathfrak{f}\_{AS}).\tag{A4}$$

Therefore, Equation (27) can be derived. Accordingly, the derivation of βˆ *AS* can be obtained by using (B1):

$$
\begin{split}
\left(\mathbf{\dot{\chi}'}\mathbf{V}\mathbf{X}\right)\boldsymbol{\upbeta} + \mathbf{\dot{\chi}'}\mathbf{V}\mathbf{B}\left[\left[\mathbf{B}'\mathbf{V}\mathbf{B} + \lambda\mathbf{K}\right]^{-1}\mathbf{B}'\mathbf{V}'\left(\mathbf{Y}\_{\mathbf{\hat{C}}} - \mathbf{X}\boldsymbol{\upbeta}\right)\right] &= \mathbf{\dot{\chi}'}\mathbf{V}\mathbf{Y}\_{\mathbf{\hat{C}}'} \\
\left(\mathbf{\dot{\chi}'}\mathbf{V}\mathbf{X}\right)\boldsymbol{\upbeta} + \mathbf{\dot{\chi}'}\mathbf{V}\mathbf{B}\left[\mathbf{B}'\mathbf{V}\mathbf{B} + \lambda\mathbf{K}\right]^{-1}\mathbf{B}'\mathbf{V}\mathbf{Y}\_{\mathbf{\hat{C}}} - \mathbf{X}'\mathbf{V}\mathbf{B}\left[\mathbf{B}'\mathbf{V}\mathbf{B} + \lambda\mathbf{K}\right]^{-1}\mathbf{B}'\mathbf{V}\mathbf{X}\boldsymbol{\upbeta} &= \mathbf{\dot{\chi}'}\mathbf{V}\mathbf{Y}\_{\mathbf{\hat{C}}} \\
\left[\left(\mathbf{\dot{\chi}'}\mathbf{V}\mathbf{X}\right) - \mathbf{\dot{\chi}'}\mathbf{V}\mathbf{B}\left[\mathbf{B}'\mathbf{V}\mathbf{B} + \lambda\mathbf{K}\right]^{-1}\mathbf{B}'\mathbf{V}\mathbf{X}\right]\boldsymbol{\upbeta} &= \mathbf{\dot{\chi}'}\mathbf{V}\mathbf{Y}\_{\mathbf{\hat{C}}} - \mathbf{\dot{\chi}'}\mathbf{V}\mathbf{B}\left[\mathbf{B}'\mathbf{V}\mathbf{B} + \lambda\mathbf{K}\right]^{-1}\mathbf{B}'\mathbf{V}'\mathbf{Y}\_{\mathbf{\hat{C}}}.
\end{split}
\tag{A55}
$$

To simplify the calculations, let *AAS* = **X VB**) **B VB** + λ**K** \*−<sup>1</sup> **B V** . Therefore,

$$\mathbb{E}\left[\left(\mathbf{X}'\mathbf{V} - \mathbf{A}\_{\mathrm{AS}}\right)\mathbf{X}\right]\boldsymbol{\mathfrak{F}} = \left(\mathbf{X}' - \mathbf{A}\_{\mathrm{AS}}\right)\mathbf{V}\mathbf{Y}\_{\mathrm{\hat{G}'}}\boldsymbol{\mathfrak{F}}\_{\mathrm{AS}} = \left(\left(\mathbf{X}'\mathbf{V} - \mathbf{A}\_{\mathrm{AS}}\right)\mathbf{X}\right)^{-1}\left(\mathbf{X}' - \mathbf{A}\_{\mathrm{AS}}\right)\mathbf{V}\mathbf{Y}\_{\mathrm{\hat{G}'}}.\tag{A6}$$

The derivations of Equations (29) and (30) are thus completed.

#### **Appendix C**

**Proof of Theorem 1.** To validate the Theorem 1, necessary equations are given by:

$$\sup\_{\mathfrak{A}\_{ASn}\in Q} \left| M\_n(\mathfrak{A}\_n) - M(\mathfrak{A}\_{ASn}) - \sigma\_\varepsilon^2 \right| \stackrel{p}{\to} 0,\tag{A7}$$

where *σ*<sup>2</sup> *<sup>ε</sup>* is the variance of the model defined in Equation (7), *Q* is a compact set in a metric space and by using Equations (54)–(57), it can be seen that:

$$|\mathfrak{A}\_{\text{ASn}}| \to \mathfrak{K}\_{\prime} \text{ as } n \to \infty. \tag{A8}$$

See [33] for more details.

#### **Appendix D**

**Proof of Theorem 2.** For ensured regularity conditions (i)–(iv), *plim*(β<sup>ˆ</sup> *ASn* ) is written as follows:

$$\begin{split} \operatorname{plim}(\mathring{\mathfrak{P}}\_{AS\_{\mathbb{K}}}) &= \mathfrak{P} + \operatorname{plim}(n^{-1}[(\mathbf{X}^{\prime}\mathbf{V} - \mathbf{A}\_{AS})\mathbf{X}]^{-1}(\mathbf{X}^{\prime}\mathbf{V} - \mathbf{A}\_{AS})\mathbf{f}) + \operatorname{plim}(n^{-1}[(\mathbf{X}^{\prime}\mathbf{V} - \mathbf{A}\_{AS})\mathbf{X}]^{-1}(\mathbf{X}^{\prime}\mathbf{V} - \mathbf{A}\_{AS})) \\ &\quad \mathbf{A}\_{AS})\mathbf{X}]^{-1}(\mathbf{X}^{\prime}\mathbf{V} - \mathbf{A}\_{AS})\mathbf{f}) \\ \operatorname{plim}(\mathring{\mathfrak{P}}\_{AS\_{\mathbb{K}}}) &= \mathfrak{P} + \operatorname{plim}\{n^{-1}\left[(\mathbf{X}^{\prime}\mathbf{V} - \mathbf{A}\_{AS})\mathbf{X}]^{-1}\}\operatorname{plim}\{n^{-1}(\mathbf{X}^{\prime}\mathbf{V} - \mathbf{A}\_{AS})[\mathbf{f} + \mathbf{z}]\right\}. \end{split} \tag{A9}$$

Because **f** can be counted as a nuisance parameter, and from assumptions (i) and (ii), *plim n*−<sup>1</sup> (**X V** − **A***AS*)**X** −<sup>1</sup> = **F**−<sup>1</sup> *<sup>n</sup>* and *plim*' *n*−1(**X V** − **A***AS*)[**f** + ε] ( = *o*(1). Therefore, the expression at the right side in (D1) goes to zero. Thus, from that, it is obtained that:

$$\operatorname{argmin}(\psi\_n) \stackrel{p}{\to} \operatorname{argmin}(\psi), \quad \mathfrak{B}\_{AS\_n} \stackrel{d}{\to} \mathfrak{B}.\tag{A10}$$

Note that the results obtained above are for *τ* ≥ 1, which means *ψ<sup>n</sup>* has a convex structure (see [34,35]). However, the proposed AS estimator includes the case of *τ* < 1, so that *ψ<sup>n</sup>* is not convex. In this matter, Equation (D2) is processed differently as:

$$\psi\_n \left( \mathfrak{H}\_{AS\_n\prime} f(\mathbf{s}\_t) \right) > n^{-1} \sum\_{t=1}^n \left[ \mathbf{Y}\_t - \mathfrak{X}\_t \mathfrak{H}\_{AS\_n} - f(\mathbf{s}\_t) \right]^2 = \psi\_n^{(0)} \left( \mathfrak{H}\_{AS\_n\prime} f(\mathbf{s}\_t) \right) \tag{A11}$$

Note that Equation (D3) is validated for all <sup>β</sup><sup>ˆ</sup> *ASn* . Moreover, *argmin*(*ψn*) = *Op*(1), because *<sup>ψ</sup>*(0) *<sup>n</sup>* = *Op*(1).

#### **Appendix E**

**Proof of Theorem 3.** To show the proof of Theorem 3, due to the non-convex structure of *τ* < 1, some complex expressions are needed for minimization criterion *ξ*. These are given by:

$$\mathcal{G}\_{n}(\boldsymbol{\theta}) \;= \sum\_{t=1}^{n} \left[ \left( \boldsymbol{\varepsilon}\_{t} - \frac{\boldsymbol{\theta}^{T} \mathbf{X}\_{t}}{n^{-1}} \right)^{2} - \boldsymbol{\varepsilon}\_{t} \right] + \lambda\_{n} \sum\_{j=1}^{p} \left[ \left| \boldsymbol{\beta}\_{j} + \frac{\boldsymbol{\theta}\_{j}}{n^{-1}} \right|^{\boldsymbol{\tau}} - \left| \boldsymbol{\beta}\_{j} \right|^{\boldsymbol{\tau}} \right]. \tag{A12}$$

Due to *λ<sup>n</sup>* = *O nτ*/2 = *o* -√*n* , the following expression is obtained similar to Theorem 3:

$$
\lambda\_n \sum\_{j=1}^p \left[ \left| \beta\_j + \frac{\theta\_j}{n^{-1}} \right|^\tau - \left| \beta\_j \right|^\tau \right] \stackrel{d}{\to} \lambda\_0 \sum\_{j=1}^p \left| \theta\_j \right|^\tau \mathbf{I} \{ \beta\_j = 0 \}. \tag{A13}
$$

Then the convergence is realized as follows:

$$
\arg\min(\mathfrak{f}\_n) \stackrel{d}{\rightarrow} \arg\min(\mathfrak{f}).\tag{A14}
$$

Thus, the proof is finished. It is important to note that, for *τ* < 1, the non-zero regression coefficients of the model can be estimated without asymptotic bias if zero ones are shrunk to the zero with a positive probability.

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Entropy* Editorial Office E-mail: entropy@mdpi.com www.mdpi.com/journal/entropy

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18

www.mdpi.com

ISBN 978-3-0365-4298-0