*Article* **Robust Variable Selection Based on Penalized Composite Quantile Regression for High-Dimensional Single-Index Models**

**Yunquan Song \*, Zitong Li and Minglu Fang**

College of Science, China University of Petroleum, Qingdao 266580, China; slofzt\_l@163.com (Z.L.); pdmath@163.com (M.F.)

**\*** Correspondence: syqfly1980@163.com

**Abstract:** The single-index model is an intuitive extension of the linear regression model. It has been increasingly popular due to its flexibility in modeling. In this work, we focus on the estimators of the parameters and the unknown link function for the single-index model in a high-dimensional situation. The SCAD and Laplace error penalty (LEP)-based penalized composite quantile regression estimators, which could realize variable selection and estimation simultaneously, are proposed; a practical iterative algorithm is introduced to obtain the efficient and robust estimators. The choices of the tuning parameters, the bandwidth, and the initial values are also discussed. Furthermore, under some mild conditions, we show the large sample properties and oracle property of the SCAD and Laplace penalized composite quantile regression estimators. Finally, we evaluated the performances of the proposed estimators by two numerical simulations and a real data application.

**Keywords:** single-index models; composite quantile regression; SCAD; Laplace error penalty (LEP)

**MSC:** 62F12; 62G08; 62G20; 62J07T07

#### **1. Introduction**

As a generalized regression model, the single-index regression model has a wide range of applications in the fields of finance, economics, biomedicine, etc. The singleindex regression model not only avoids the so-called "curse of dimensionality" in the nonparametric models, but also significantly improves the efficiency of model estimation and reveals the relationship between the response variables and the high-dimensional covariates, to keep good interpretability of the parametric models and flexibility of the nonparametric models simultaneously [1–5]. However, the single-index regression model also inherits the shortcomings of the classical regression models. For example, in practical applications, especially in heavy-tailed error distribution scenarios, it is difficult to satisfy the bounded error variance, which a single-index regression model requires. Moreover, in the mean regression scenario, the estimation results of a single-index regression model are very sensitive to extreme values. To overcome these drawbacks, robust regression methods are necessary for the single-index regression model when fitting the real data.

Compared with the mean regression, the quantile regression proposed by [6] can measure the effect of the explanatory variable, not only on the distribution center but also on the upper and lower tails of the response variable. The quantile regression (QR) estimation, which is not restricted by the error distribution and can effectively avoid the impact of outliers, is more robust than the least squares estimation. Furthermore, in order to utilize the information on different quantiles adequately, composite quantile regression (CQR) was proposed by [7]. [8] added the SCAD-L2 penalty to the loss function and proposed a robust variable selection method based on the weighted composite quantile regression (WCQR), which made variable selection insensitive to high-leverage points and

**Citation:** Song, Y.; Li, Z.; Fang, M. Robust Variable Selection Based on Penalized Composite Quantile Regression for High-Dimensional Single-Index Models. *Mathematics* **2022**, *10*, 2000. https://doi.org/ 10.3390/math10122000

Academic Editors: Snezhana Gocheva-Ilieva, Atanas Ivanov and Hristina Kulina

Received: 23 March 2022 Accepted: 23 May 2022 Published: 10 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

outliers. In this article, we studied the estimation and variable selection of the single-index quantile regression model. The single-index quantile regression model is specified in the following form

$$\boldsymbol{\Upsilon} = \mathcal{g}(\boldsymbol{X}^{\top}\boldsymbol{\gamma}) + \boldsymbol{\varepsilon}, \tag{1}$$

where *Y* is the response variable, *X* is a d-dimensional covariate vector, *γ* is an unknown parameter vector, *g*(·) is an unknown link function, *ε* is the random error, and the *τ*th conditional quantile is zero, i.e., *<sup>P</sup>*(*<sup>ε</sup>* <sup>≤</sup> <sup>0</sup>|*X*) = *<sup>τ</sup>* . In order to identify it more easily, we assume that *γ* = 1 and the first component of *γ* is positive, where *·* denotes the Euclidean norm.

There are two estimation problems for the single-index quantile regression model. One is the estimation of parameters and the other is the estimation of the link function. The study of estimation for single-index quantile regression models began with [9], which generalized the average derivative method. Meanwhile, [10] proposed a simple algorithm to achieve the quantile regression for single-index models and proved the asymptotic properties of estimators. [3] proposed D-Vine Copula-based quantile regression, which is a new algorithm that does not require accurately assuming the shape of conditional quantiles and avoids the typical defects of linear models, such as multicollinearity. [11] proposed a non-iterative coincidence quantile regression (NICQR) estimation algorithm for the singleindex quantile regression model, which has high computational efficiency and is suitable for analyzing massive data sets.

In real data, the model is often sparse. The variables inevitably contain a few irrelevant and unnecessary variables while modeling the real data, which can degrade the efficiency of the resulting estimation procedure and increase the complexity of models. In the case of linear models, many authors have considered variable selection via penalized least squares, which allows for a simultaneous selection of variables and estimation of regression parameters. Several penalty functions, including the SCAD [12], the adaptive LASSO [13], and the adaptive elastic net [14] have been shown to possess favorable theoretical properties: unbiased, sparsity, and continuity; it is regarded as the basic properties that a good estimator should enjoy [15]. It enjoys the oracle property. [5] combined the SCAD penalty variable selection method with LM-ANN for modeling, making good use of the advantages of SCAD in dimension reduction and the efficiency of LM-ANN in nonlinear relationship modeling.

Similar to the linear regression model, the set of predictors for the single-index quantile regression model can contain a large number of irrelevant variables. Therefore, it is important to select the relevant variables when fitting the single-index quantile regression model. However, the problem of variable selection for the high-dimensional single-index quantile regression model is not well settled in the literature. In recent years, many significant research results have emerged on the variable selection problem of the singleindex quantile regression model. [16] proposed a non-iterative estimation and variable selection method for the single-index quantile regression model. The initial value and the weight of the penalty function were obtained via the inverse regression technique, which is the key to this method. [17] combined least absolute deviations (LAD) and SCAD for single-index models. However, we note that SCAD is a piecewise continuous spline function. Because of this structure, different splines need different derivative formulas; it is necessary to select different derivative formulas to match different splines when we carry out penalty optimization. This certainly adds to the programming complexity. So, [18] proposed a continuous bounded smooth penalty function–Laplace error penalty (LEP) that does not have a piecewise spline structure and proved its oracle property. LEP is infinitely differentiable everywhere except at the origin and, therefore, is much smoother than SCAD. Furthermore, LEP can approach the *L*<sup>0</sup> penalty as close as possible, which is viewed as the optimal penalty. Moreover, LEP can yield a convex objective function for optimization under mild conditions, such that it is easier to calculate and obtain the only optimal solution with desired properties.

In this paper, we combined (composite) the quantile regression method with the SCAD penalty and Laplace error penalty to construct two sparse estimators for the singleindex quantile regression model. Our method realizes variable selection and parameter estimation simultaneously. In addition, we prove that the proposed estimator has large sample properties, including N-consistency and oracle properties. A simulation study showed that our method has some resistance to heavy tail errors and outliers, and the accuracy of parameter estimation is higher.

The rest of this paper is organized as follows. In Section 2, the SCAD penalized composite quantile regression and the Laplace penalized composite quantile regression for single-index models are introduced. Furthermore, an iterative algorithm for the singleindex model is analyzed and the selections of bandwidth, tuning parameters, and initial values are discussed. In Section 3, we state the large sample properties of SCAD and Laplace penalized composite quantile estimators for single-index models. In Section 4, we show our method and algorithm by two numerical simulations and real data. Section 5 includes some concluding remarks. Technical proofs and the algorithm based on LEP are relegated to Appendix A and Appendix B, respectively.

#### **2. Problem Setting and Methodology**

#### *2.1. Composite Quantile-SCAD Method for Single-Index Models*

We assume {*Xi*,*Yi*, *<sup>i</sup>* <sup>=</sup> 1, 2, ... , *<sup>n</sup>*} are *<sup>n</sup>* independent samples from the single-index model (1). Note that there are two estimation problems, which are the estimation of the parameter vector *γ* and the estimation of the link function *g*(·). Given an accurate estimate of *γ*, the link function *g*(·) can be locally approximated by a linear function

$$\log(\mathbf{X}^\top \boldsymbol{\gamma}) \approx \mathbf{g}(\boldsymbol{u}) + \mathbf{g}'(\boldsymbol{u})(\mathbf{X}^\top \boldsymbol{\gamma} - \boldsymbol{u}) = \boldsymbol{a} + \boldsymbol{b}(\mathbf{X}^\top \boldsymbol{\gamma} - \boldsymbol{u}),\tag{2}$$

for *Xγ* in the neighborhood of u, where *a* = *g*(*u*) and *b* = *g* (*u*) are local constants. Namely, we can obtain a good local linear estimation of *g*(*u*) and *g* (*u*), which are *<sup>g</sup>* (*u*) = *<sup>a</sup>* and *g* (*u*) = *b*, respectively, based on an accurate estimate of *γ*. So our main interest is to estimate the parameter vector. Following [19], we adopt the MAVE estimate of *γ*, which is obtained by solving the minimization problem

$$\min\_{a,b,||\gamma||=1} \sum\_{j=1}^{n} \sum\_{i=1}^{n} [Y\_i - a\_j - b\_j(\mathbf{X}\_i^\top \gamma - \mathbf{X}\_j^\top \gamma)]^2 w\_{ij\prime} \tag{3}$$

where *wij* = *kh*(*X <sup>i</sup> <sup>γ</sup>* <sup>−</sup> *<sup>X</sup> <sup>j</sup> <sup>γ</sup>*)/ <sup>∑</sup>*<sup>n</sup> <sup>l</sup>*=<sup>1</sup> *kh*(*X <sup>l</sup> <sup>γ</sup>* <sup>−</sup> *<sup>X</sup> <sup>j</sup> <sup>γ</sup>*), *<sup>a</sup>* = (*a*1, *<sup>a</sup>*2, ... , *an*), *<sup>b</sup>* = (*b*1, *<sup>b</sup>*2, ... , *bn*), *kh*(·) = *<sup>k</sup>*(·/*h*)/*<sup>h</sup>* and *<sup>k</sup>*(·) is a symmetric kernel function, *<sup>h</sup>* is the bandwidth. [20] combined the MAVE and LASSO to obtain the sparse estimate (sim-lasso) of *γ* by solving the following minimization problem

$$\min\_{a,b,\|\gamma\|=1} \sum\_{j=1}^{n} \sum\_{i=1}^{n} [Y\_i - a\_j - b\_j(\mathbf{X}\_i^\top \gamma - \mathbf{X}\_j^\top \gamma)]^2 w\_{ij} + \lambda \sum\_{j=1}^{n} |\; b\_j| \sum\_{k=1}^{d} |\; \gamma\_k|\,\prime \tag{4}$$

where *λ* is a nonnegative penalty parameter. Note that the above target loss function is the square loss function based on the least squares method and, naturally, the LAD is extended to a single-index model as an alternative to the LS method. [17] combined LAD with SCAD to construct a sparse estimator of *γ* by solving the following minimization problem

$$\min\_{a,b,||\gamma||=1} \sum\_{j=1}^{n} \sum\_{i=1}^{n} \mid Y\_i - a\_j - b\_j(\mathbf{X}\_i^\top \gamma - \mathbf{X}\_j^\top \gamma) \mid w\_{ij} + \sum\_{j=1}^{n} \mid b\_j \mid \sum\_{k=1}^{d} p\_\lambda(\mid \gamma\_k \mid),\tag{5}$$

where *pλ*(·) is the SCAD penalty function proposed by [3]; it is defined in terms of its first order derivative. For *θ* > 0

$$p'\_{\lambda}(\theta) = \lambda \{ I(\theta \le \lambda) + \frac{(a\lambda - \theta)\_+}{(a-1)\lambda} I(\theta > \lambda) \},\tag{6}$$

where *a* > 2 and *λ* is a nonnegative penalty parameter, the notation *Z*+ stands for the positive part of *Z*. The LAD is a special case of the quantile regression, which only indicates the case when the quantile is 1/2. Thus, this motives us to generalize composite quantile regression for single-index models. Combining the SCAD penalty function with the compound quantile regression, we can obtain a sparse estimation *γ***ˆ** *qr*.*sim*.*scad* of the parameter *γ*, which is the solution to the following minimization problem

$$\min\_{a,b,\left\lVert\gamma\right\rVert=1} \sum\_{j=1}^{n} \sum\_{q=1}^{Q} \sum\_{i=1}^{n} \rho\_{\tau\_{q}} [\boldsymbol{\left\lvert\boldsymbol{Y}\_{i}-\boldsymbol{a}\_{j}-\boldsymbol{b}\_{j}\right\rVert \left(\boldsymbol{X}\_{i}^{\top}\boldsymbol{\gamma}-\boldsymbol{X}\_{j}^{\top}\boldsymbol{\gamma}\right)]} \\ \boldsymbol{w}\_{i\bar{j}} + \sum\_{j=1}^{n} \left\lvert \boldsymbol{b}\_{\bar{j}} \mid \sum\_{k=1}^{d} p\_{\lambda}(\left\lvert\boldsymbol{\left\lvert\boldsymbol{\left}\boldsymbol{\gamma}\right\rvert}\right\rvert) \right\rvert \quad \text{(7)}$$

where *<sup>τ</sup><sup>q</sup>* = *<sup>q</sup> <sup>q</sup>*+<sup>1</sup> ∈ (0, 1) stands for *τq*-quantile and *q* = 1, 2, ... , *Q* with the number of quantile *<sup>Q</sup>*, and *ρτ<sup>q</sup>* (*z*) = *<sup>τ</sup>qz* · *<sup>I</sup>*[0,<sup>∞</sup>](*z*) − (<sup>1</sup> − *<sup>τ</sup>q*)*<sup>z</sup>* · *<sup>I</sup>*(−∞,0)(*z*) is the *<sup>τ</sup>q*-quantile loss function. In addition, we assume that the *τq*-quantile of the random error *ε* is 0. Thus, *g*(*Xγ*) is the conditional *τq*-quantile of the response variable *Y*. We denote the target function in (7) by *Q<sup>S</sup> <sup>λ</sup>*(*a*, *<sup>b</sup>*, *<sup>γ</sup>*).

#### *2.2. Composite Quantile–LEP Method for Single-Index Models*

The Laplace error penalty function with two tuning parameters is proposed by [18]. Unlike other penalty functions, this new penalty function is naturally constructed as a bounded smooth function rather than a piecewise spline. The figure of LEP is similar to SCAD, but is much smoother than SCAD, which prompts us to apply it to the composite quantile regression for single-index models. Combining LEP with composite quantile regression, we can obtain a sparse estimation *γ***ˆ** *qr*.*sim*.*lep* of the parameter *γ*, which is the solution to the following minimization problem

$$\min\_{\mathbf{a},\mathbf{b},\|\|\gamma\|=1} \sum\_{j=1}^{n} \sum\_{q=1}^{Q} \sum\_{i=1}^{n} \rho\_{\tau\_{q}} [Y\_{i} - a\_{j} - b\_{j}(\mathbf{X}\_{i}^{\top} \gamma - \mathbf{X}\_{j}^{\top} \gamma)] w\_{ij} + \sum\_{j=1}^{n} \mid b\_{j} \mid \sum\_{k=1}^{d} p\_{\lambda,\mathbf{x}}(\mid \gamma\_{k} \mid) \tag{8}$$

where *Pλ*,*κ*(·) is LEP. For *θ* > 0,

$$P\_{\lambda,\kappa}(\theta) = \lambda (1 - e^{-\frac{\theta}{k}}),\tag{9}$$

where *λ* and *κ* are two nonnegative tuning parameters regularizing the magnitude of penalty and controlling the degree of approximation to the *L*<sup>0</sup> penalty, respectively. This penalty function is called Laplace penalty function because function *e*<sup>−</sup> *<sup>θ</sup> <sup>κ</sup>* has the form of the Laplace density. We denote the target function in (8) by *Q<sup>S</sup> <sup>λ</sup>*,*κ*(*a*, *<sup>b</sup>*, *<sup>γ</sup>*).

#### *2.3. Computation*

Given the initial estimate *<sup>γ</sup>* , the SCAD penalty function can be locally linear, approximated as follows [21]. For *<sup>γ</sup> <sup>j</sup>* <sup>=</sup> 0,

$$p\_{\lambda}(|\;\gamma\_{j}\mid) \approx p\_{\lambda}(|\;\hat{\gamma}\_{j}\mid) + p\_{\lambda}'(|\;\hat{\gamma}\_{j}\mid)(|\;\gamma\_{j}\mid - |\;\hat{\gamma}\_{j}\mid).\tag{10}$$

where *<sup>γ</sup><sup>j</sup>* <sup>≈</sup> *<sup>γ</sup> <sup>j</sup>*. Remove a few irrelevant terms, (7) can be rewritten as

$$\min\_{\mathbf{a}, \mathbf{b}, \|\gamma\|=1} \sum\_{j=1}^{n} \sum\_{q=1}^{Q} \sum\_{i=1}^{n} \rho\_{\tau\_{\mathbf{q}}} [Y\_i - a\_j - b\_j(\mathbf{X}\_i^{\top} \gamma - \mathbf{X}\_j^{\top} \gamma)] w\_{ij} + \sum\_{j=1}^{n} |\; b\_j \mid \sum\_{k=1}^{d} p'\_{\lambda}(|\; \widehat{\gamma}\_k \mid) \mid \gamma\_k \mid. \tag{11}$$

We denote the target function in (11) by *QS*<sup>∗</sup> *<sup>λ</sup>* (*a*, *<sup>b</sup>*, *<sup>γ</sup>*). We can easily discover that *<sup>Q</sup>S*<sup>∗</sup> *<sup>λ</sup>* is invariant, so when minimizing *QS*<sup>∗</sup> *<sup>λ</sup>* , we restrict *γ* = 1 to be the unit length *γ* = 1 through normalization *γ*.

In order to obtain an accurate estimate of *γ* and *g*(·), we introduce a new iterative algorithm. Then our estimation procedure is described in detail as follows:


$$\begin{split} & \min\_{\{a\_{\hat{\boldsymbol{\eta}}} \boldsymbol{b}\}} \sum\_{i=1}^{n} \sum\_{q=1}^{Q} \rho\_{\tau\_{q}} [Y\_{i} - a\_{j} - b\_{j}(\mathbf{X}\_{i}^{\top} \widehat{\boldsymbol{\gamma}} - \mathbf{X}\_{j}^{\top} \widehat{\boldsymbol{\gamma}})] w\_{ij} + |\ \boldsymbol{b}\_{j}| \sum\_{k=1}^{d} p\_{\lambda}'(|\ \widehat{\gamma}\_{k}| \ | \ |\ \widehat{\gamma}\_{k}| \ | \ \widehat{\gamma}\_{k}| \ | \ \widehat{\gamma}\_{k}| \ | \ \widehat{\gamma}\_{k}| \ | \ \widehat{\gamma}\_{k}| \ | \ \widehat{\gamma}\_{k}| \ = \min\_{\{a\_{\hat{\boldsymbol{\eta}}}, b\_{\hat{\boldsymbol{\eta}}}\}} \sum\_{i=1}^{n} \sum\_{q=1}^{Q} p[Y\_{i}^{\*} - (A\_{\boldsymbol{\tau}} \boldsymbol{B}) \left( \begin{array}{c} a\_{j} \\ b\_{j} \end{array} \right) | w\_{ij}^{\*} \end{split} \tag{12}$$

where *h* is the optimal bandwidth, (*ρ*,*Y*∗ *<sup>i</sup>* , *A*, *B*, *w*<sup>∗</sup> *ij*)=(*ρτ<sup>q</sup>* ,*Yi*, 1, *<sup>X</sup> <sup>i</sup> <sup>γ</sup>* <sup>−</sup> *<sup>X</sup> <sup>j</sup> <sup>γ</sup>* , *wij*) for *i* = 1, 2, ... , *n*, and (*ρ*,*Y*∗ *<sup>i</sup>* , *A*, *B*, *w*<sup>∗</sup> *ij*)=(1/*Q*, 0, 0, <sup>∑</sup>*<sup>d</sup> <sup>k</sup>*=<sup>1</sup> *p <sup>λ</sup>*(<sup>|</sup> *<sup>γ</sup> <sup>k</sup>* <sup>|</sup>) <sup>|</sup> *<sup>γ</sup> <sup>k</sup>* <sup>|</sup>, 1) for *i* = *n* + 1. The *rq*(·) function in R package "quantreg" is helpful to obtain { *aj*, *bj*, *<sup>j</sup>* <sup>=</sup> 1, 2, ... , *<sup>n</sup>*}. Moreover, for the SCAD penalty, we can apply a differenceof-convex algorithm [22] to the simple computation.

Step 2. Given { *aj*, *bj*, *<sup>j</sup>* <sup>=</sup> 1, 2, . . . , *<sup>n</sup>*}, update *<sup>γ</sup>* by solving

$$\min\_{\boldsymbol{\gamma}} \sum\_{j=1}^{n} \sum\_{q=1}^{Q} \sum\_{i=1}^{n} \rho\_{\tau\_{q}} [\boldsymbol{\gamma}\_{i} - \widehat{\boldsymbol{a}}\_{j} - \widehat{\boldsymbol{b}}\_{j}(\mathbf{X}\_{i}^{\top} \boldsymbol{\gamma} - \mathbf{X}\_{j}^{\top} \boldsymbol{\gamma})] \\ \boldsymbol{w}\_{ij} + \sum\_{j=1}^{n} |\widehat{\boldsymbol{b}}\_{j}| \sum\_{k=1}^{d} p\_{\lambda}'(|\mid \widehat{\boldsymbol{\gamma}}\_{k}| ) \mid \boldsymbol{\gamma}\_{k} \mid . \tag{13}$$

We can apply a fast and efficient coordinate descent algorithm [23] if *d* is very large, or combine the MM algorithm [24] to reduce the calculation.


$$\widehat{\rho}\_{\boldsymbol{\gamma}}(\widehat{\boldsymbol{a}},\widehat{\boldsymbol{b}}) = \min\_{(\boldsymbol{a},\boldsymbol{b})} \sum\_{q=1}^{Q} \sum\_{i=1}^{n} \rho\_{\tau\_{q}} [\boldsymbol{Y}\_{i} - \boldsymbol{a} - \boldsymbol{b}(\boldsymbol{X}\_{i}^{\top}\widehat{\boldsymbol{\gamma}} - \boldsymbol{u})] k\_{h}(\boldsymbol{X}\_{i}^{\top}\widehat{\boldsymbol{\gamma}} - \boldsymbol{u}).\tag{14}$$

**Remark 1.** *The above algorithm is aimed at the SCAD penalty function. Moreover, similarly, we can obtain the other algorithm for the Laplace penalty function, replacing SCAD with LEP.*

#### *2.4. The Selections of Bandwidth, Tuning Parameters, and Initial Value*

The selection of the bandwidth plays a crucially important role in local polynomial smoothing because it controls the curvature of the fitted function. The cross-validation (CV) and the generalized cross-validation (GCV) can be utilized to choose a proper bandwidth, but these methods are not computationally practical due to the large calculation amounts. For the local linear quantile regression, [25] obtained an approximate optimal bandwidth under some suitable assumptions and found the rule-of-thumb bandwidth: *h<sup>τ</sup>* = *hm*{*τ*(1 − *<sup>τ</sup>*)/*ψ*2(Φ−1(*τ*))}1/5, where *<sup>ψ</sup>*(·) and <sup>Φ</sup>(·) are the probability density function and the cumulative distribution function of the normal distribution, respectively; *hm* is the optimal bandwidth of the least squares regression. There are many algorithms for the selection of *hm*. [26] found that the rule-of-thumb bandwidth: *hm* <sup>=</sup> {4/(*<sup>d</sup>* <sup>+</sup> <sup>2</sup>)}1/(4+*d*)*n*−1/(4+*d*) in the single-index models acts fairly well, where *d* is the dimension of the kernel function. We combined a multiplier only consisting of *τ* with the optimal bandwidth *hm* of the LS regression to obtain a *hτ* with good characters.

There are several kinds of selection methods for SCAD's nonnegative tuning parameter *λ*, such as CV, GCV, AIC, BIC, and so on. Following [27], we utilized the BIC criterion to choose a proper tuning parameter of SCAD in this paper

$$\text{BIC}(\lambda) = \frac{1}{\tilde{\mathcal{G}}} \sum\_{i=1}^{n} \sum\_{q=1}^{Q} \rho\_{\tau\_{q}} [Y\_{i} - \mathcal{g}(\mathbf{X}\_{i}^{\top} \hat{\gamma}(\lambda))] + \log(n) \cdot df / 2,\tag{15}$$

where !*<sup>σ</sup>* = (1/*n*) <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> <sup>∑</sup>*<sup>Q</sup> <sup>q</sup>*=<sup>1</sup> *ρτ<sup>q</sup>* [*Yi* <sup>−</sup> *<sup>g</sup>*(*X <sup>i</sup> <sup>γ</sup>*!)] with *<sup>γ</sup>*! being the composite quantile estimator without penalty and *d f* is the number of non-zero coefficients of *<sup>γ</sup>* (*λ*). Then, we chose the optimal tuning parameter by minimizing the above criteria. Moreover, for LEP, we utilized the extended Bayesian information criterion (EBIC) [28]to choose proper tuning parameters *λ* and *κ*.

$$\text{EBIT}(\lambda, \kappa) = \log(\hat{\sigma}^2) + \frac{\log(n) + \log \log(d)}{n} df \tag{16}$$

where *<sup>σ</sup>*<sup>2</sup> <sup>=</sup> 1/(*<sup>n</sup>* <sup>−</sup> <sup>1</sup>) <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> <sup>∑</sup>*<sup>Q</sup> <sup>q</sup>*=<sup>1</sup> *ρτ<sup>q</sup>* [*Yi* <sup>−</sup> *<sup>g</sup>*(*X <sup>i</sup> <sup>γ</sup>* )] and *d f* is the number of non-zero coefficients of *<sup>γ</sup>* (*λ*, *<sup>κ</sup>*). Similarly, in order to select the best tuning parameters, we minimized the above criteria in the arrangement of *λ* values.

The initial value of the unknown parameter is required at the beginning of the iteration of our algorithm. A convenient choice is *<sup>γ</sup>* / *<sup>γ</sup>* where *<sup>γ</sup>* is the composite quantile estimator without penalty.

#### **3. Theoretical Properties**

A good estimate is supposed to satisfy unbiasedness, continuity, and the so-called oracle property. Thus, in this section, we discuss the large sample properties of the SCAD penalized composite quantile regression and the Laplace penalized composite quantile regression for single-index models. We consider the data {(*Xi*,*Yi*), *<sup>i</sup>* <sup>=</sup> 1, 2, ... , *<sup>n</sup>*} including *n* observations from model (1), such as Section 2. Moreover, let *X<sup>i</sup>* = (*X <sup>i</sup>*<sup>1</sup> , *<sup>X</sup> <sup>i</sup>*<sup>2</sup> ), *γ* = (*γ* <sup>1</sup> , *γ* <sup>2</sup> ), *<sup>X</sup>i*<sup>1</sup> <sup>∈</sup>*<sup>s</sup>* , *<sup>X</sup>i*<sup>2</sup> <sup>∈</sup>*d*−*<sup>s</sup>* . In addition, *γ*<sup>0</sup> = (*γ* <sup>10</sup>, *γ* <sup>20</sup>) stands for the true regression parameters of model (1) and *γ*<sup>0</sup> = 1, where the *s* components in *γ*<sup>10</sup> are not zero. We suppose the following regularity conditions to hold:


$$\begin{aligned} \mathsf{C}\_{0} &= E\{\boldsymbol{\mathcal{g}}'(\boldsymbol{\mathcal{X}}^{\top}\boldsymbol{\gamma}\_{0})^{2}[\boldsymbol{\mathcal{X}} - E(\boldsymbol{\mathcal{X}}|\boldsymbol{\mathcal{X}}^{\top}\boldsymbol{\gamma}\_{0})][\boldsymbol{\mathcal{X}} - E(\boldsymbol{\mathcal{X}}|\boldsymbol{\mathcal{X}}^{\top}\boldsymbol{\gamma}\_{0})]^{\top}\} \\ \mathsf{C}\_{1} &= E\{f\_{\boldsymbol{Y}}(\boldsymbol{\mathcal{g}}(\boldsymbol{\mathcal{X}}^{\top}\boldsymbol{\gamma}\_{0}))\boldsymbol{\mathcal{g}}'(\boldsymbol{\mathcal{X}}^{\top}\boldsymbol{\gamma}\_{0})^{2}[\boldsymbol{\mathcal{X}} - E(\boldsymbol{\mathcal{X}}|\boldsymbol{\mathcal{X}}^{\top}\boldsymbol{\gamma}\_{0})][\boldsymbol{\mathcal{X}} - E(\boldsymbol{\mathcal{X}}|\boldsymbol{\mathcal{X}}^{\top}\boldsymbol{\gamma}\_{0})]^{\top}\} \end{aligned}$$

(vi) *h* → 0 and *nh* → ∞.

Given ( *aj*, *bj*), let *<sup>H</sup>* <sup>=</sup> <sup>∑</sup>*<sup>n</sup> <sup>j</sup>*=<sup>1</sup> | *bj* |, *an* = max{*P <sup>λ</sup>*(*γ*0*j*) : *<sup>γ</sup>*0*<sup>j</sup>* <sup>=</sup> <sup>0</sup>} and *<sup>γ</sup> qr*.*sim*.*scad* be the solution of (7). We should note that under condition (ii), the first derivative is bounded. Thus, *H* = *O*(*n*).

**Theorem 1.** *Under the conditions (i)–(v). If* max{*P <sup>λ</sup>* (| *γ*0*<sup>k</sup>* |) : *γ*0*<sup>k</sup>* = 0} → 0 *and an* = *<sup>O</sup>*(*n*−1/2)*, then there exists a local minimizer in* (7) *such that <sup>γ</sup> qr*.*sim*.*scad* <sup>−</sup> *<sup>γ</sup>*<sup>0</sup> = *OP*(*n*−1/2 + *an*) *with <sup>γ</sup> qr*.*sim*.*scad* = *γ*<sup>0</sup> = 1*.*

According to Theorem 1, we show that there exists a <sup>√</sup>*n*-consistent SCAD penalized composite quantile regression estimate for *γ* if a proper tuning parameter *λ* is selected. Let *cn* = {*p <sup>λ</sup>*(| *γ*<sup>01</sup> |) sgn(*γ*01), ... , *p <sup>λ</sup>*(| *γ*0*<sup>s</sup>* |) sgn(*γ*0*s*)}, and ∑*<sup>λ</sup>* = diag(*p <sup>λ</sup>*(| *γ*<sup>01</sup> |), ... , *p λ*(| *γ*0*<sup>s</sup>* |).

**Lemma 1.** *Under the same conditions as in Theorem 1. Assume that*

$$\liminf\_{n \to +\infty} \liminf\_{\theta \to 0^+} p'\_{\lambda\_n}(\theta/\lambda\_n) > 0. \tag{17}$$

*If <sup>λ</sup>* <sup>→</sup> <sup>0</sup> *and* <sup>√</sup>*H<sup>λ</sup>* <sup>→</sup> <sup>∞</sup> *as <sup>n</sup>* <sup>→</sup> *<sup>∂</sup>, then with probability tending to 1, for any given <sup>γ</sup>* <sup>1</sup> *satisfying <sup>γ</sup>* <sup>1</sup> <sup>−</sup> *<sup>γ</sup>*<sup>10</sup> = *OP*(*n*−1/2)*, and any constant C, we have*

$$Q\_{\lambda}^{\mathbb{S}}((\gamma\_1^\top, \mathbf{0}^\top)^\top) = \min\_{||\gamma\_2|| \le \mathbb{C}n^{-1/2}} Q\_{\lambda}^{\mathbb{S}}((\gamma\_1^\top, \gamma\_2^\top)^\top). \tag{18}$$

**Theorem 2.** *Under the same conditions as in Theorem 1—assume that the penalty function <sup>p</sup>λ*(<sup>|</sup> *<sup>θ</sup>* <sup>|</sup>) *satisfies condition* (17)*. If <sup>λ</sup>* <sup>→</sup> <sup>0</sup> *and* <sup>√</sup>*H<sup>λ</sup>* <sup>→</sup> <sup>∞</sup>*, then with the probability tending to 1, the* <sup>√</sup>*n-consistent local minimizer <sup>γ</sup> qr*.*sim*.*scad* = ((*<sup>γ</sup> qr*.*sim*.*scad* <sup>1</sup> ),(*<sup>γ</sup> qr*.*sim*.*scad* <sup>2</sup> )) *in Theorem 1 must satisfy:*

*(i) Sparsity: <sup>γ</sup> qr*.*sim*.*scad* <sup>2</sup> = 0*. (ii) Asymptotic normality:*

$$\sqrt{n}\{(QC\_{11} + H\sum\_{\lambda}/n)(\hat{\gamma}\_1^{qr.\text{sim.scad}} - \gamma\_{10}) + Hc\_n/n\} \xrightarrow{D} N(0, 0.2C\_{01})\_\*$$

*where C*<sup>11</sup> *is the top-left s-by-s sub-matrix of C*<sup>1</sup> *and C*<sup>01</sup> *is the top-left s-by-s sub-matrix of C*0*.*

Theorem 2 shows that the SCAD penalized composite quantile regression estimator has the so-called oracle property when *<sup>λ</sup>* <sup>→</sup> 0 and <sup>√</sup>*H<sup>λ</sup>* <sup>→</sup> <sup>∞</sup>.

**Remark 2.** *In this section, we discuss the large sample properties of the SCAD penalized composite quantile estimator (<sup>γ</sup> qr*.*sim*.*scad) in detail. Similarly, we can also show the large sample properties of the Laplace penalized composite quantile estimator (<sup>γ</sup> qr*.*sim*.*lep).*

#### **4. Numerical Studies**

*4.1. Simulation Studies*

In this section, we evaluate the proposed estimator by simulation studies. We compare the performances of different penalized estimators with the oracle estimator. Specially, in order to reduce the burden of computation and simplify the calculation, we take the Gaussian kernel as the kernel function in our simulations. Moreover, we do not tune the value of the parameter *a* and set *a* = 3.7, which is suggested by [12] for the SCAD penalty. Moreover, we set the quantile number: *Q* = 5. Next, we compare the performances of the following four estimates for the single-index model:


In order to evaluate the performances of the above estimators, we consider the following criteria:


Additionally, an estimated coefficient is viewed as 0 if its absolute value is smaller than 10<sup>−</sup>6.

**Scenario 1.** We assume that the single-index model has the following form:

$$\boldsymbol{\chi} = 2\boldsymbol{\mathbf{X}}^{\top}\boldsymbol{\gamma}\_0 + 10\exp\left(-(\boldsymbol{\mathbf{X}}^{\top}\boldsymbol{\gamma}\_0)^2/5\right) + \epsilon\_r$$

where *γ***<sup>0</sup>** = *γ*/ *γ* with *γ* = (1, −1, 2, −0.5, 0, ... , 0) being a 15-dimensional vector with only four non-zero values (the true coefficients). The *X*-variables are generated from the multivariate normal distribution and set the correlation between *Xi* and *Xj* to be 0.5|*i*−*j*<sup>|</sup> and the *Y*-variable is generated from the above model. Then, to eliminate the impacts of different error distributions, we consider the following five error distributions:


In order to perform the simulations, we generated 200 replicates with moderate sample sizes, *n* = 100, 200. Then, the corresponding results are reported in Table 1.


**Table 1.** Simulation results for Scenario 1 based on 200 replications.

MAD (the mean absolute deviation) of *<sup>γ</sup>* : MAD <sup>=</sup> <sup>1</sup> *<sup>n</sup>* <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> | *X <sup>i</sup> <sup>γ</sup>* <sup>−</sup> *<sup>X</sup> <sup>i</sup> γ*<sup>0</sup> |; NC: the average number of non-zero coefficients that are correctly estimated to be non-zero; NIC: the average number of zero coefficients that are incorrectly estimated to be non-zero, respectively.

**Scenario 2.** The model set-up is similar to Example 1, except the link function is *Xγ*<sup>0</sup> + 4 "<sup>|</sup> *<sup>X</sup>γ*<sup>0</sup> <sup>+</sup> <sup>1</sup> <sup>|</sup>. These link functions are also analyzed by [20]. Then, Table <sup>2</sup> summarizes the corresponding results.


**Table 2.** Simulation results for Scenario 2 based on 200 replications.

MAD (the mean absolute deviation) of *<sup>γ</sup>* : MAD <sup>=</sup> <sup>1</sup> *<sup>n</sup>* <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> | *X <sup>i</sup> <sup>γ</sup>* <sup>−</sup> *<sup>X</sup> <sup>i</sup> γ*<sup>0</sup> |; NC: the average number of non-zero coefficients that are correctly estimated to be non-zero; NIC: the average number of zero coefficients that are incorrectly estimated to be non-zero, respectively.

From Tables 1 and 2, we can note that the performance of cqr.sim.lep is best, cqr.sim.scad is second, and lad.sim.scad is the worst for five different error distributions. This is consistent with our theoretical findings. Furthermore, with the sample size increasing, all estimators become better.

#### *4.2. Real Data Application: Boston Housing Data*

In this section, the methods are illustrated through an analysis of Boston housing data. The data (506 observations and 14 variables) are available in the package ('MASS') in R, and the definitions of the dependent variable (MEDV) and explanatory variables are described in Table 3. We checked whether there were missing values in the data through the violin diagram of each variable. Figures 1 and 2 show the violins between the first and last seven variables, respectively. It is obvious from Figures 1 and 2 that there are obvious outliers in CRIM and Black columns. In order to test the linear relationship among variables, the heat map between the variables is given in Figure 3. It can be seen from the heat map that variables RM, Ptratio, Istat, and MEDV have certain correlations. The correlation coefficients between Indus and nox, CRIM and RAD, RAD and tax, and Indus and tax were 0.7, 0.8, 0.9, and 0.7, respectively. Therefore, there was a high correlation between variables, so the single-index regression model could be considered.

Boston housing data have been utilized by many regression studies; the potential relationship between MEDV and *X*-variables was also founded [10,17]. For the single-index quantile regression, [10] introduced a practical algorithm where the unknown link function *g*(·) is estimated by local linear quantile regression and the parametric index was estimated through linear quantile regression. However, the authors did not consider the variable selection. For the single-index regression models, [17] considered the penalized LAD regression, which dealt with variable selection and estimation simultaneously. However, the LAD estimator is only the special case of the quantile estimator in which the quantile *τ*

is equal to 0.5. Moreover, the two literature studies mentioned above only used the case of a single quantile. In this article, we constructed new sparse estimators for single-index quantile regression models based on the composite quantile regression method combined with the SCAD penalty and Laplace error penalty.

**Figure 1.** Violin diagram of the first seven variables.

**Figure 2.** Violin diagram of the last seven variables.

Due to the sparsity of data in the region concerned, possible quantile curves cross at both tails similar to [14]. The results of the real data example and simulation studies confirm the reasonableness and effectiveness of our method in practice.

In order to better numerical performance, we need to standardize the response variable MEDV and the predictor variables except CHAS before applying our method. The estimated coefficient is treated as 0 if its absolute value is smaller than 10−12. Then, the corresponding results are reported in Table 4.

From Table 4, we see that all methods can achieve the variable selection and parameter estimation simultaneously in the real problem. Moreover, all methods choose the sparse model including RM, DIS, PTRATIO, LSTAT, the same as the one using all predictors without penalty (cqr.sim). Moreover, the estimation of cqr.sim.lep is the closest to cqr.sim. These results indicate that only four explanatory variables are significant and the rest are irrelevant.


**Table 3.** Description of variables for Boston housing data.


**Figure 3.** Heat map between the variables.

79


**Table 4.** Coefficient estimates for Boston housing data.

#### **5. Conclusions**

In this article, we propose SCAD and Laplace penalized composite quantile regression estimators for single-index models in a high-dimensional case. Compared with the least squares method, composite quantile regression can obtain the robust estimator with respect to heavy-tailed error distributions and outliers.Then, a practical iterative algorithm was introduced. It is based on composite quantile regression and uses local linear smoothing to estimate the unknown link function. This method realizes parameter selection and estimation simultaneously by combining two kinds of penalty functions with the composite quantile regression. In addition, we proved that the proposed estimator has large sample properties, including <sup>√</sup>*n*-consistence and the oracle property. Furthermore, the estimator was evaluated and illustrated by numerical studies. Moreover, we can draw the conclusion from Boston housing data: the sparse model with the same significant variables was selected by all three estimation methods; however, the estimators of cqr.sim.lep were the closest to that of cqr.sim. This reveals that using—or not using—the LEP penalty essentially acts the same when we estimate the link function.

**Author Contributions:** Formal analysis, Z.L.; Methodology, Y.S.; Software, M.F. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported by the NNSF project (61503412) of China and the NSF project (ZR2019MA016) of the Shandong Province of China.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Proof of Theorems**

**Proof of Theorem 1.** Given ( *aj*, *bj*), we need to definite other symbols before proving. Let *<sup>α</sup><sup>n</sup>* <sup>=</sup> *<sup>n</sup>*−1/2 <sup>+</sup> *an*, *Yij* <sup>=</sup> *Yi* <sup>−</sup> *aj* <sup>−</sup> *bjX ij <sup>γ</sup>*0, *<sup>X</sup>ij* <sup>=</sup> *<sup>X</sup><sup>i</sup>* <sup>−</sup> *<sup>X</sup>j*, *<sup>γ</sup>* <sup>=</sup> *<sup>α</sup>n<sup>u</sup>* <sup>+</sup> *<sup>γ</sup>*<sup>0</sup> where *<sup>u</sup>* is a ddimensional vector, *<sup>S</sup>*(*u*) = <sup>∑</sup>*i*,*j*,*<sup>q</sup> ρτ<sup>q</sup>* [*Yij* <sup>−</sup> *<sup>α</sup><sup>n</sup> bjX ij u*]*wij* − ∑*i*,*j*,*<sup>q</sup> ρτ<sup>q</sup>* (*Yij*)*wij* and set *<sup>u</sup>* = *C* where *C* is a large enough constant.

Our purpose is to prove that *<sup>γ</sup> qr*.*sim*.*scad* is <sup>√</sup>*n*-consistent; that is, to show that for any given > 0 and large *n* , there is a large enough constant C, such that

$$P\{\inf\_{\|\mathbf{u}\|=\mathbf{C}} Q\_{\lambda}^{\mathbb{S}}(\gamma\_0 + \kappa\_n \mathbf{u}) > Q\_{\lambda}^{\mathbb{S}}(\gamma\_0)\} \ge 1 - \epsilon,\tag{A1}$$

which implies that there exists a local minimum *<sup>γ</sup>* in the ball {*αn<sup>u</sup>* <sup>+</sup> *<sup>γ</sup>*<sup>0</sup> : *<sup>u</sup>* ≤ *C*}, such that *<sup>γ</sup>* <sup>−</sup> *<sup>γ</sup>*<sup>0</sup> = *Op*(*αn*) with probability of at least 1 − . Let

$$\begin{split} D\_n(\mathfrak{u}) &= \mathbb{Q}\_{\lambda}^S(a\_n \mathfrak{u} + \gamma\_0) - \mathbb{Q}\_{\lambda}^S(\gamma\_0) \\ &\geq S(\mathfrak{u}) + H \sum\_{k=1}^s \left[ p\_\lambda(|\, |a\_n \mathfrak{u}\_k + \gamma\_{0\mathbf{k}} \,|\, ) - p\_\lambda(|\, \gamma\_{0\mathbf{k}} \,|\, ) \right] \\ &=: I + \Pi \end{split} \tag{A2}$$

First, we consider *I* and partition it into *I*<sup>1</sup> and *I*2. We have

$$\begin{split} I &= \sum\_{i,j,q} \rho\_{\pi\_{\vec{q}}} [Y\_{ij} - \widehat{b}\_{j} \mathbf{X}\_{ij}^{\top} a\_{n} \mathbf{u}] w\_{ij} - \sum\_{i,j,q} \rho\_{\pi\_{\vec{q}}} (Y\_{ij}) w\_{ij} \\ &= \sum\_{i,j,q} w\_{ij} \{\widehat{b}\_{j} \mathbf{X}\_{ij}^{\top} a\_{n} \mathbf{u} [I(\boldsymbol{Y}\_{ij} < \widehat{b}\_{j} \mathbf{X}\_{ij}^{\top} a\_{n} \mathbf{u}) - \pi\_{q}]\} + \sum\_{i,j} Q w\_{ij} \mathbf{Y}\_{ij} I(\widehat{b}\_{j} \mathbf{X}\_{ij}^{\top} a\_{n} \mathbf{u} < \mathbf{Y}\_{ij} < 0) \end{split} \tag{A3}$$

Obviously,

$$\begin{aligned} \left||\left||I\_1\right||\right|| &\leq \left||\sum\_{i,j} Qw\_{ij}\widehat{b}\_j \mathbf{X}\_{ij}^\top \boldsymbol{\alpha}\_n \boldsymbol{\mu}\right||\\ &= \sqrt{n}\boldsymbol{\alpha}\_n \parallel \boldsymbol{\mu} \parallel \left||\sum\_{i,j} Qw\_{ij}\widehat{b}\_j \mathbf{X}\_{ij}^\top / \sqrt{n}\right||\end{aligned} \tag{A4}$$

Note that <sup>∑</sup>*i*,*<sup>j</sup> Qwij bjX ij* / <sup>√</sup>*<sup>n</sup>* = *OP*(1), which can refer to the proof of Theorem 2 of [10]. Therefore, we can obtain that *I*<sup>1</sup> = *OP*( <sup>√</sup>*nαn*) *<sup>u</sup>* = *OP*(*nα*<sup>2</sup> *<sup>n</sup>*) *<sup>u</sup>* . Then, we can obtain the mean and variance of *I*<sup>2</sup> by direct computation.

$$\begin{split} EI\_{2} &= \sum\_{i,j} Qw\_{ij} \int\_{-\infty}^{+\infty} Y\_{ij} \cdot I(\hat{b}\_{j} \mathbf{X}\_{ij}^{\top} a\_{n} \mathbf{u} < Y\_{ij} < 0) \cdot f\_{Y\_{ij}}(y) dy \\ &= \sum\_{i,j} Qw\_{ij} \int\_{\hat{b}\_{j} \mathbf{X}\_{ij}^{\top} a\_{n} \mathbf{u}}^{0} Y\_{ij} \cdot f\_{Y\_{ij}}(y) dy \\ &= \sum\_{i,j} Qw\_{ij} \cdot P(\hat{b}\_{j} \mathbf{X}\_{ij}^{\top} a\_{n} \mathbf{u} < Y\_{ij} < 0) \\ &= \sum\_{ij} Qw\_{ij} P\_{ij} \end{split} \tag{A5}$$

where *fYij*(·) is the probability density function of *Yij*, *Pij* stands for *<sup>P</sup>*( *bjX ij <sup>α</sup>n<sup>u</sup>* <sup>&</sup>lt; *Yij* <sup>&</sup>lt; <sup>0</sup>). Note that *EI*<sup>2</sup> > 0, furthermore, *EI*<sup>2</sup> → 0. Moreover,

$$\begin{split} Var(I\_2) &= \sum\_{i,j} Qw\_{ij} \cdot E[Y\_{ij} \cdot I(\hat{b}\_j \mathbf{X}\_{ij}^\top \alpha\_n \mathbf{u} < Y\_{ij} < 0) - P\_{ij}]^2 \\ &\le \sum\_{i,j} Qw\_{ij} \max\_{i,j} \mid Y\_{ij} \mid P\_{ij} - P\_{ij}^2 \\ &= \sum\_{i,j} Qw\_{ij} P\_{ij} [\max\_{i,j} \mid Y\_{ij} \mid -P\_{ij}] \to 0 \end{split} \tag{A6}$$

In addition, by taking Taylor's expansion for *Pλ*(| *γ<sup>k</sup>* |) and the basic inequality, we obtain that,

$$\begin{split} II &= H \sum\_{k=1}^{s} \left[ P\_{\lambda}(|\ \gamma\_{0k} + \mathfrak{a}\_{n}\mathfrak{u}\_{k} \mid ) - P\_{\lambda}(|\ \gamma\_{0k} \mid ) \right] \\ &= H \sum\_{k=1}^{s} \left[ \mathfrak{a}\_{n} P\_{\lambda}^{\prime}(|\ \gamma\_{0k} \mid ) \text{sgn}(\mathfrak{f}\_{0k}) \mathfrak{u}\_{k} + 0.5 \mathfrak{f}\_{n}^{2} \mathcal{P}^{\prime}\_{\text{n}}(|\ \mathfrak{f}\_{0k} \mid ) \mathfrak{u}\_{k}^{2} \right] \\ &\leq \sqrt{s} H \mathfrak{a}\_{n} \mathfrak{a}\_{n} \parallel \mathfrak{u} \parallel + H \max\_{1 \leq k \leq s} P\_{\lambda}^{\prime \prime}(|\ \gamma\_{0k} \mid) \mathfrak{a}\_{n}^{2} \parallel \mathfrak{u} \parallel^{2} \end{split} \tag{A7}$$

According to the condition *H* = *OP*(*n*), max{*P <sup>λ</sup>* (| *γ*0*<sup>k</sup>* |) : *γ*0*<sup>k</sup>* = 0} → 0, and *<sup>u</sup>* = *C*, *Dn*(*u*) in (A2) is mainly determined by *I*2, which is positive. Thus, we prove (A1).

**Proof of Lemma 1.** Due to *<sup>γ</sup>* <sup>=</sup> *<sup>α</sup>n<sup>u</sup>* <sup>+</sup> *<sup>γ</sup>*0, let *<sup>u</sup>*<sup>1</sup> <sup>=</sup> *<sup>α</sup>*−<sup>1</sup> *<sup>n</sup>* (*γ*<sup>1</sup> <sup>−</sup> *<sup>γ</sup>*01),*u*<sup>2</sup> <sup>=</sup> *<sup>α</sup>*−<sup>1</sup> *<sup>n</sup>* (*γ*<sup>2</sup> <sup>−</sup> *<sup>γ</sup>*02), *u* = (*u* <sup>1</sup> , *<sup>u</sup>* <sup>2</sup> ). After the computation, we obtain

$$\begin{split} &Q^{\mathbb{S}}\_{\lambda}((\gamma\_{1}^{\top},\mathbf{0}^{\top})^{\top}) - Q^{\mathbb{S}}\_{\lambda}((\gamma\_{1}^{\top},\gamma\_{2}^{\top})^{\top}) \\ &= \mathcal{S}((\boldsymbol{u}\_{1}^{\top},\mathbf{0}^{\top})^{\top}) - \mathcal{S}((\boldsymbol{u}\_{1}^{\top},\boldsymbol{u}\_{2}^{\top})^{\top}) - H \sum\_{k=s+1}^{d} P\_{\lambda}(|\;\gamma\_{k}|) \end{split} \tag{A8}$$

According to the proof of Theorem 1 and *<sup>u</sup>* = *O*(1), we have *I*<sup>1</sup> = *OP*(*nα*<sup>2</sup> *n*) and *I*<sup>2</sup> = *OP*(1); thus, we prove that *S*((*u* <sup>1</sup> , *<sup>u</sup>* <sup>2</sup> )) = *I* = *OP*(1). Similarly, *S*((*u* <sup>1</sup> , **0**)) = *OP*(1). By the mean value theorem and *Pλ*(0) = 0, we can obtain the following inequality

$$\begin{split} H \sum\_{k=s+1}^{d} P\_{\lambda}(|\;\gamma\_{k}\;|\;) &= H \sum\_{k=s+1}^{d} P\_{\lambda}'(|\;\gamma\_{k}^{\*}\;|\;) \mid \gamma\_{k} \;| \\ &\geq H\lambda \big(\liminf\_{\lambda \to 0} \liminf\_{\theta \to 0^{+}} (P\_{\lambda}'(\theta/\lambda)) \sum\_{k=s+1}^{d} |\;\gamma\_{k}\;|\;\end{split} \tag{A9}$$

where 0 <| *γ*<sup>∗</sup> *<sup>k</sup>* <sup>|</sup><<sup>|</sup> *<sup>γ</sup><sup>k</sup>* <sup>|</sup> (*<sup>k</sup>* <sup>=</sup> *<sup>s</sup>* <sup>+</sup> 1, ... , *<sup>d</sup>*). We can obtain that *<sup>H</sup><sup>λ</sup>* <sup>=</sup> <sup>√</sup>*H*( <sup>√</sup>*Hλ*) is of higher order than *O*( <sup>√</sup>*H*) because of the condition <sup>√</sup>*H<sup>λ</sup>* <sup>→</sup> <sup>∞</sup>, which implies that the last term of (A8) dominates in magnitude. As a result, *Q<sup>S</sup> λ*((*γ* <sup>1</sup> , **<sup>0</sup>**)) − *<sup>Q</sup><sup>S</sup> λ*((*γ* <sup>1</sup> , *γ* <sup>2</sup> )) < 0 for large n. This proves **Lemma 1**.

#### **Proof of Theorem 2.**

(i) Follows from Lemma 1. (ii) By partitioning the vectors *u* = (*u* <sup>1</sup> , *<sup>u</sup>* <sup>2</sup> ), *Pλ*(0) = 0 and (A2), we have

$$\begin{split} D\_{\boldsymbol{\pi}}((\boldsymbol{\mu}\_{1}^{\top},\boldsymbol{\mathsf{0}}^{\top})^{\top}) &= \mathcal{S}((\boldsymbol{\mu}\_{1}^{\top},\boldsymbol{\mathsf{0}}^{\top})^{\top}) + H \sum\_{k=1}^{s} [P\_{\lambda}(|\;\boldsymbol{\gamma}\_{0k} + \boldsymbol{a}\_{n}\boldsymbol{\mathsf{u}}\_{k}|) - P\_{\lambda}(|\;\boldsymbol{\gamma}\_{0k}|)] \\ &= \mathcal{S}((\boldsymbol{\mu}\_{1}^{\top},\boldsymbol{\mathsf{0}}^{\top})^{\top}) + P\_{\lambda}(\boldsymbol{\mathsf{u}}\_{1}) \end{split} \tag{A10}$$

where *Pλ*(*u*1) = *H* ∑*<sup>s</sup> <sup>k</sup>*=1[*Pλ*(| *γ*0*<sup>k</sup>* + *αnuk* |) − *Pλ*(| *γ*0*<sup>k</sup>* |)]. Moreover, by Taylor's expansion and calculation, *Pλ*(*u*1) can be rewritten as

$$P\_{\lambda}(\boldsymbol{\mu}\_{1}) = H\boldsymbol{a}\_{n}\boldsymbol{c}\_{n}^{\top}\boldsymbol{\mu}\_{1} + \frac{1}{2}H\boldsymbol{a}\_{n}^{2}\boldsymbol{\mu}\_{1}^{\top}\sum\_{\lambda}\boldsymbol{\mu}\_{1} \tag{A11}$$

Let

$$\begin{aligned} \boldsymbol{\delta}^\* &= \widehat{\boldsymbol{a}}\_{\boldsymbol{j}} + \widehat{\boldsymbol{b}}\_{\boldsymbol{j}} \mathbf{X}\_{1ij}^\top (\gamma\_{01} + \boldsymbol{\alpha}\_n \widehat{\boldsymbol{u}}\_1) \\ \boldsymbol{\delta}^\*\_1 &= \widehat{\boldsymbol{a}}\_{\boldsymbol{j}} + \widehat{\boldsymbol{b}}\_{\boldsymbol{j}} \mathbf{X}\_{1ij}^\top \gamma\_{01} \end{aligned}$$

In order to find the minimized *<sup>u</sup>* <sup>1</sup> of *Dn*((*u* <sup>1</sup> , **0**)), we compute the derivation of it and set *D n*((*u* <sup>1</sup> , **0**)) = 0. Thus, we obtain the following equation.

$$-\sum\_{i,j,\eta} w\_{i\bar{j}} \hat{b}\_{\bar{j}} \mathbf{X}\_{1\bar{i}\bar{j}} a\_{\pi} [\mathbf{r}\_{\eta} - I(\mathbf{Y}\_{i} - \hat{a}\_{\bar{j}} - \hat{b}\_{\bar{j}} \mathbf{X}\_{1\bar{i}\bar{j}}^{\top}(\gamma\_{01} + a\_{\pi} \hat{u}\_{1}) < 0)] + H a\_{\pi} c\_{\pi} + H a\_{\pi}^{2} \sum\_{\lambda} \hat{u}\_{1} = 0 \tag{A12}$$

That is,

$$\hat{\mathbf{r}} - \frac{1}{n} \sum\_{i,j,\neq \mathbf{0}} \hat{b}\_{\dagger} \mathbf{X}\_{1ij} m\_{i\bar{j}} [I(Y\_{\bar{i}} < \delta^\*) - \mathbf{r}\_q] = \frac{1}{n} [H\mathbf{c}\_n + H\mathbf{a}\_n \sum\_{\lambda} \hat{\mathbf{u}}\_1] \tag{A13}$$

Let

$$\begin{aligned} Z\_1 &= n^{-1/2} \sum\_{i,j,q} \hat{b}\_j \mathbf{X}\_{1ij} w\_{ij} [I(Y\_i < \delta\_1^\*) - \pi\_q] \\ B\_1 &= n^{-1} \sum\_{i,j,q} \hat{b}\_j \mathbf{X}\_{1ij} w\_{ij} [F\_Y(\delta\_1^\*) - F\_Y(\delta^\*)] \\ B\_2 &= n^{-1} \sum\_{i,j,q} \hat{b}\_j \mathbf{X}\_{1ij} w\_{ij} \{ [I(Y\_i < \delta\_1^\*) - I(Y\_i < \delta^\*)] - [F\_Y(\delta\_1^\*) - F\_Y(\delta^\*)] \} \end{aligned}$$

where *FY*(·) is the cumulative distribution function of *Y*. Therefore, we have

$$-\frac{1}{n}\sum\_{i,j,q}\widehat{b}\_{\dot{j}}\mathbf{X}\_{1ij}w\_{\dot{j}\dot{j}}[I(Y\_i<\delta^\*)-\pi\_q]=-\frac{1}{\sqrt{n}}Z\_1+B\_1+B\_2\tag{A14}$$

By taking the Taylor's expansion for *FY*(·), we can obtain that

$$\begin{split} B\_1 &= -Q\mathfrak{a}\_n n^{-1} \sum\_{i,j} \hat{b}\_j^2 f\_Y(\delta\_1^\*) w\_{ij} \mathbf{X}\_{1ij} \mathbf{X}\_{1ij}^\top \hat{a}\_1 \\ &\to -Q\mathfrak{a}\_n \mathbf{C}\_{11} \hat{w}\_1 \end{split} \tag{A15}$$

where *fY*(·) is the probability density function of *Y*. According to the direct calculation of the mean and variance in [15], we have *B*<sup>2</sup> = *oP*( *<sup>Q</sup>* <sup>√</sup>*<sup>n</sup>* ) = *oP*( <sup>√</sup><sup>1</sup> *<sup>n</sup>* ). Moreover, combing (A14), (A15) and *<sup>u</sup>* <sup>1</sup> <sup>=</sup> *<sup>α</sup>*−<sup>1</sup> *<sup>n</sup>* (*<sup>γ</sup>* <sup>1</sup> <sup>−</sup> *<sup>γ</sup>*01), (A13) can be rewritten in the following form:

$$\sqrt{n}\{(\mathcal{QC}\_{11} + \frac{H}{n}\sum\_{\lambda})(\hat{\gamma}\_1 - \gamma\_{01}) + \frac{H}{n}c\_n\} = -Z\_1 + o\_P(\frac{1}{\sqrt{n}}) \tag{A16}$$

Note that *Z*<sup>1</sup> *D* −→ *QN*(0, 0.25*C*01). We can obtain that

$$\sqrt{n}\{(\mathcal{QC}\_{11} + \frac{H}{n}\sum\_{\lambda})(\hat{\gamma}\_{1} - \gamma\_{01}) + \frac{H}{n}c\_{n}\} \stackrel{D}{\to} N(0, \mathcal{Q}^{2} 0.25 \mathcal{C}\_{01})\tag{A17}$$

Thus, we prove Theorem 2.

**Remark A1.** *Above all, we prove the* <sup>√</sup>*n-consistency and oracle property for the SCAD penalized composite quantile estimator <sup>γ</sup> qr*.*sim*.*scad. Similarly, we can also show the same properties for the Laplace penalized composite quantile estimator <sup>γ</sup> qr*.*sim*.*lep*

#### **Appendix B. The Algorithm Based on LEP**

Similar to the SCAD penalty function, by the local linear approximation of LEP and removal of a few irrelevant terms, (8) can be rewritten as

$$\min\_{\mathbf{a},\mathbf{b},\|\gamma\|=1} \sum\_{j=1}^{n} \sum\_{q=1}^{Q} \sum\_{i=1}^{n} \rho\_{\tau\_{\mathbf{q}}} \left[ \mathbf{Y}\_{i} - a\_{j} - b\_{\mathbf{j}} (\mathbf{X}\_{i}^{\top} \boldsymbol{\gamma} - \mathbf{X}\_{j}^{\top} \boldsymbol{\gamma}) \boldsymbol{w}\_{i\bar{j}} + \sum\_{j=1}^{n} \left\lfloor b\_{\mathbf{j}} \right\rfloor \sum\_{k=1}^{d} p\_{\lambda,\mathbf{x}}^{\prime} (\left| \left. \hat{\gamma}\_{\mathbf{k}} \right| \right) \left| \left. \gamma\_{\mathbf{k}} \right| \right. \tag{A18}$$

We denote the target function in (A18) by *QS*<sup>∗</sup> *<sup>λ</sup>*,*κ*(*a*, *<sup>b</sup>*, *<sup>γ</sup>*). Moreover, the iterative algorithm based on LEP is as follows.

Step 0. We obtain an initial estimate of *<sup>γ</sup>*. We standardize the initial estimate *<sup>γ</sup>* such that *γ* <sup>=</sup> 1 and *<sup>γ</sup>* <sup>1</sup> <sup>&</sup>gt; 0.

Step 1. Given an estimate *<sup>γ</sup>* , we obtain { *aj*, *bj*, *<sup>j</sup>* <sup>=</sup> 1, 2, . . . , *<sup>n</sup>*} by solving

min (*aj*,*bj*) *n* ∑ *i*=1 *Q* ∑ *q*=1 *ρτ<sup>q</sup>* [*Yi* <sup>−</sup> *aj* <sup>−</sup> *bj*(*X <sup>i</sup> <sup>γ</sup>* <sup>−</sup> *<sup>X</sup> <sup>j</sup> <sup>γ</sup>* )]*wij*<sup>+</sup> <sup>|</sup> *bj* <sup>|</sup> *d* ∑ *k*=1 *p <sup>λ</sup>*,*κ*(<sup>|</sup> *<sup>γ</sup> <sup>k</sup>* <sup>|</sup>) <sup>|</sup> *<sup>γ</sup> <sup>k</sup>* <sup>|</sup> = min (*aj*,*bj*) *n*+1 ∑ *i*=1 *Q* ∑ *q*=1 *ρ*[*Y*∗ *<sup>i</sup>* − (*A*, *B*) *aj bj* ]*w*∗ *ij*, (A19)

where *h* is the optimal bandwidth, (*ρ*,*Y*∗ *<sup>i</sup>* , *A*, *B*, *w*<sup>∗</sup> *ij*)=(*ρτ<sup>q</sup>* ,*Yi*, 1, *<sup>X</sup> <sup>i</sup> <sup>γ</sup>* <sup>−</sup> *<sup>X</sup> <sup>j</sup> <sup>γ</sup>* , *wij*) for *i* = 1, 2, ... , *n*, and (*ρ*,*Y*∗ *<sup>i</sup>* , *A*, *B*, *w*<sup>∗</sup> *ij*)=(1/*Q*, 0, 0, <sup>∑</sup>*<sup>d</sup> <sup>k</sup>*=<sup>1</sup> *p <sup>λ</sup>*,*κ*(<sup>|</sup> *<sup>γ</sup> <sup>k</sup>* <sup>|</sup>) <sup>|</sup> *<sup>γ</sup> <sup>k</sup>* <sup>|</sup>, 1) for *i* = *n* + 1.

Step 2. Given { *aj*, *bj*, *<sup>j</sup>* <sup>=</sup> 1, 2, . . . , *<sup>n</sup>*}, update *<sup>γ</sup>* by solving

$$\min\_{\gamma} \sum\_{j=1}^{n} \sum\_{q=1}^{Q} \sum\_{i=1}^{n} \rho\_{\tau\_{q}} [Y\_{i} - \widehat{a}\_{j} - \widehat{b}\_{j}(\mathbf{X}\_{i}^{\top} \boldsymbol{\gamma} - \mathbf{X}\_{j}^{\top} \boldsymbol{\gamma})] \\ w\_{i\bar{j}} + \sum\_{j=1}^{n} |\,\,\widehat{b}\_{\bar{j}}| \,\, \sum\_{k=1}^{d} p\_{\lambda,\mathbf{x}}^{\prime} (|\,\widehat{\gamma}\_{k}| \, | \, \boldsymbol{\gamma}\_{k} \, | \, . \tag{A20}$$

Step 3. Scale *<sup>b</sup>* <sup>←</sup> sgn(*<sup>γ</sup>* <sup>1</sup>)· *<sup>γ</sup> <sup>b</sup>*, and *<sup>γ</sup>* <sup>←</sup> sgn(*<sup>γ</sup>* <sup>1</sup>) · *<sup>γ</sup>* / *<sup>γ</sup>* .


$$\hat{h}(\hat{a},\hat{b}) = \min\_{(a,b)} \sum\_{q=1}^{Q} \sum\_{i=1}^{n} \rho\_{\tau\_{\hat{q}}} [Y\_i - a - b(\mathbf{X}\_i^\top \hat{\gamma} - \mathfrak{u})] k\_h(\mathbf{X}\_i^\top \hat{\gamma} - \mathfrak{u}).\tag{A21}$$

#### **References**


### *Article* **The FEDHC Bayesian Network Learning Algorithm**

**Michail Tsagris**

Department of Economics, University of Crete, Gallos Campus, 74100 Rethymnon, Greece; mtsagris@uoc.gr

**Abstract:** The paper proposes a new hybrid Bayesian network learning algorithm, termed Forward Early Dropping Hill Climbing (FEDHC), devised to work with either continuous or categorical variables. Further, the paper manifests that the only implementation of MMHC in the statistical software *R* is prohibitively expensive, and a new implementation is offered. Further, specifically for the case of continuous data, a robust to outliers version of FEDHC, which can be adopted by other BN learning algorithms, is proposed. The FEDHC is tested via Monte Carlo simulations that distinctly show that it is computationally efficient, and that it produces Bayesian networks of similar to, or of higher accuracy than MMHC and PCHC. Finally, an application of FEDHC, PCHC and MMHC algorithms to real data, from the field of economics, is demonstrated using the statistical software *R*.

**Keywords:** causality; Bayesian networks; scalability

**MSC:** 62H22

#### **1. Introduction**

Learning the causal relationships among variables using non-experimental data is of high importance in many scientific fields, such as economics and econometrics (for a general definition of causality specifically in economics and econometrics, see [1]). When the aim is particularly to recreate the causal mechanism that generated the data, graphical models, such as causal networks and Bayesian Networks (BNs) (also known as Bayes networks, belief networks, decision networks, Bayes(ian) models or probabilistic directed acyclic graphical models) are frequently employed. The advantages of BNs include simultaneous variable selection among all variables, and hence, the detection of conditional associations between variables. On a different route, BNs form the scheme for synthetic population generation [2], and have been used synergetically with agent-based models [3,4].

BNs enjoy applications to numerous fields, but the focus of the current paper is on fields related to economics applications, such as production economics [5], macroeconomics [6] and environmental resource economics [7]. Applications of BNs can also be found in financial econometrics [8], banking and finance [9], credit scoring [10], insurance [11] and customer service [12], to name a few. Despite the plethora of applications of BNs, not many BN algorithms exist, and most importantly, fewer are publicly available in free software environments, such as the statistical software *R*. The Max-Min Hill Climbing (MMHC) [13] is an example of a widely used BN learning algorithm (The relevant paper is one of the classic papers in the Artificial Intelligence field, and has received more than 1870 citations according to scholar.google.com as of July 2022.) that is publicly available, in the *R* package *bnlearn* [14]. PC Hill Climbing (PCHC) [15] is a recently suggested hybrid algorithm that is also publicly available, in the *R* package *pchc* [16].

Ref. [15] showed that when the sample size is in the order of hundreds of thousands, MMHC's implementation in the *R* package *bnlearn* requires more than a day with continuous data, even with 50 variables. On the contrary, PCHC is a computationally efficient and scalable BN learning algorithm [15]. With modern technology and vast data generation, the computational cost is a considerable parameter. Every novel algorithm must be computationally efficient and scalable to large sample sizes. Seen from the green economy point

**Citation:** Tsagris, M. The FEDHC Bayesian Network Learning Algorithm. *Mathematics* **2022**, *10*, 2604. https://doi.org/10.3390/ math10152604

Academic Editors: Snezhana Gocheva-Ilieva, Atanas Ivanov and Hristina Kulina

Received: 23 May 2022 Accepted: 20 July 2022 Published: 26 July 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

of view, this cost also has an economic and environmental impact; a faster algorithm will produce results in a more timely manner, facilitating faster decision making, consuming less energy and hence reducing its carbon footprint.

Moving along those lines, this paper proposes a new computationally highly efficient algorithm termed Forward with Early Dropping Hill Climbing (FEDHC), which is publicly available in the *R* package *pchc*. FEDHC shares common ideas with PCHC and MMHC. It applies the Forward Backward with Early Dropping (FBED) variable selection algorithm [17] to each variable as a means of skeleton identification, followed by a Hill Climbing (HC) scoring phase. FEDHC can handle millions of observations in just a few minutes, and retains similar or better accuracy levels than PCHC and MMHC. With continuous data, FEDHC performs fewer errors than PCHC, but the converse is true with categorical data. FEDHC further enjoys the same scalability property as PCHC, and its computational cost is proportional to the sample size of the data. Increasing the sample size by a factor increases the execution time by the same factor. Finally, a new, computationally efficient implementation of MMHC is offered that is also publicly available in the *R* package *pchc*.

The choice of the BN learning algorithm is not only computational cost-dependent, but also quality-dependant. Regardless of the algorithm used, the quality of the learned BN can be significantly affected by outliers. Robustness to outliers is an important aspect that, surprisingly enough, has not attracted substantial research attention in the field of BN learning. Ref. [18] were the first to propose a robustified version of the PC algorithm, by replacing the empirical standard deviations with robust scale estimates. Ref. [19] on the other hand, removed the outliers, but their algorithm is only applicable to BNs with a known topology. Robustification of the proposed FEDHC algorithm takes place by adopting techniques from the robust statistics literature. The key concept is to identify and remove the outliers prior to applying FEDHC.

The rest of the paper is structured as follows. Preliminaries regarding BNs that will assist in making the paper comprehensible, and a brief presentation of the PCHC and MMHC algorithms are unveiled in Section 2. The FEDHC is introduced in Section 3, along with its robustified version (the robustified versions are applicable to PCHC and MMHC as well), which will be shown to be remarkable and utterly insensitive to outliers. The theoretical properties and computational details of FEDHC, and the conditional independence tests utilised for continuous and categorical data are delineated in the same section. Section 4 contains Monte Carlo simulation studies comparing FEDHC to PCHC and MMHC in terms of accuracy, computational efficiency and number of tests performed. Section 5 illustrates the FEDHC, PCHC and MMHC on two real cross-sectional datasets using the *R* package *pchc*. The first dataset contains continuous data on the credit history for a sample of applicants for a type of credit card [20]. The second dataset contains categorical data on the household income plus some more demographic variables. Ultimately, Section 6 contains the conclusions drawn from this paper.

#### **2. Preliminaries**

Graphical models or probabilistic graphical models express visually the conditional (in)dependencies between random variables (*Vi*, *i* = 1, ... , *D*). Nodes (or vertices) are used to represent the variables *Vi* and edges between the nodes; for example, *Vi* − *Vj* indicates the relationship between the variables *Vi* and *Vj*. Directed graphs are graphical models that contain arrows instead of edges, indicating the direction of the relationship; for example, *Vi* → *Vj*. The parents of a node *Vi* are the nodes whose direction (arrows) points towards *Vi*. Consequently, node *Vi* is termed the child of those nodes and the common parents of those nodes are called spouses. Directed acyclic graphs (DAG) are stricter in the sense that they impose no cycles on these directions. For any path between *Vi* and *Vj*, *Vi* → *Vk* → ... → *Vj*, no path from *Vj* to *Vi* (*Vj* → ... → *Vi*) exists. In other words, the natural sequence or relationship between parents and children or ancestors and descendants is mandated.

#### *2.1. Bayesian Networks*

Assume there is a collection **V** of *D* variables whose joint distribution *P* is known. The BN (BN is a special case of a DAG) [21,22] *B* = *G*, **V** arises from linking *P* to *G* through the Markov condition (or Markov property), which states that each variable is conditionally independent of its non-descendants, given its parents. By using this condition, the joint distribution *P* can be factorised as:

$$P(V\_1, \ldots, V\_D) = \prod\_{i=1}^D P(V\_i | Pa(V\_i)),\tag{1}$$

where *Pa*(*Vi*) denotes the parents set of *Vi* in *G*.

If *G* entails only conditional independencies in *P* and all conditional independencies in *P* are entailed by *G*, based on the Markov condition, then *G*, *P* and *G* are faithful to each other, and *G* is a perfect map of *P* [22].

The BN whose edges can be interpreted causally is called causal BN; an edge *Vi* → *Vj* exists if *Vi* is a direct cause of *Vj*. A necessary assumption made by the algorithms under study is causal sufficiency; there are no latent (hidden, non-observed) variables among the observed variables **V**.

The triplet (*Vi*, *Vk*, *Vj*), where *Vi* → *Vk* ← *Vj* is called V-structure. If there is no edge between *Vi* and *Vj*, the node *Vk* is called an unshielded collider. In Figure 1, the triplet (*V*1, *V*3, *V*2) is a V-structure as there is no edge between *V*<sup>1</sup> and *V*2, and hence, node *V*<sup>3</sup> is an unshielded collider. The unshielded collider *Vk* implies that *Vi* and *Vj* are independent conditioning on *Vk*, provided that the faithfulness property holds true [22]. Conversely, the triplet of nodes (*Vi*, *Vk*, *Vj*) such that *Vk* → *Vi* and *Vk* → *Vj* is termed Λ-structure (nodes *V*3, *V*<sup>4</sup> and *V*<sup>5</sup> in Figure 1 is such an example). The Λ-structure implies that *Vi* and *Vj* are conditionally independent, given *Vk*.

**Figure 1.** An example of a DAG. Nodes V1 and V2 are the parents of V3, whose children are nodes V4 and V5. V2 is the spouse of V1 (and vice versa, V1 is the spouse of V2).

Two or more BNs are called Markov equivalent, if and only if they have the same skeleton and the same V-structures [23]. The set of all Markov equivalent BNs forms the Markov equivalence class that can be represented by a complete partially DAG, which in addition to directed edges, contains undirected edges. (Undirected edges may be oriented either way in BNs of the Markov equivalence class (in the set of all valid combinations), while directed and missing edges are shared among all equivalent BNs.)

#### *2.2. Classes of BN Learning Algorithms*

BN learning algorithms are typically constraint-based, score-based or hybrid. Constraintbased learning algorithms, such as PC (PC stands for Peter and Clark, named after Peter Spirtes and Clark Glymour, the names of the two researchers who invented it) [24] and

FCI [22] employ conditional independence (CI) tests to discover the structure of the network (skeleton), and then orient the edges by repetitively applying orientation rules. On the contrary, score-based methods [25–27] assign a score on the whole network and perform a search in the space of BNs to identify a high-scoring network. Hybrid algorithms, such as MMHC [13] and PCHC [15], combine both aforementioned methods; they first perform CI tests to discover the skeleton of the BN, and then employ a scoring method to direct the edges in the space of BNs. For a series of CI tests for continuous and categorical data, please see Appendix A.

#### *2.3. PCHC and MMHC Algorithms*

The PCHC's skeleton identification phase is the same as that of the PC algorithm [15]. The phase commences with all pairwise unconditional associations and removes the edge between ordered pairs that are not statistically significantly associated. Subsequently, CI tests are performed with the cardinality of the conditioning set (denoted by *k*) increasing by 1 at a time. At every step, the conditioning set consists of subsets of the neighbours, adjacent to each variable *Vi* (*adj*(*G*, *Vi*)). This process is repeated until no edge can be removed.

Ref. [22] suggested three heuristics to select the pairs of variables, and the order is crucial, as it can yield erroneous results. The optimal process, for a given variable *Vi*, is to first test those variables **V** that are least probabilistically dependent on *Vi*, conditional on those subsets of variables that are most probabilistically dependent on *Vi*. Note that the pairs are first ordered according to the third heuristic of [22], and so, the order of selection of the pairs is deterministic. Hence, the skeleton identification phase is independent of the order at which the variables are located in the dataset [28].

MMHC's skeleton identification phase performs a variable selection process for each variable (call it target variable, *Vi*), described as follows. A search for its statistically significantly associated variable *Vs* is performed via unconditional statistical tests. The associations are stored, and the variable with the highest association (*Vj*) is chosen; an edge is added between this *Vi* and *Vj*, and all non-statistically significant variables are excluded from further consideration. In the second step, all CI tests between the target variable and previously identified variables, conditional on the previously selected variable, are performed \$ *Vi* ⊥⊥ *Vm*|*Vj*, *m* = *i*, *j* % , and the non-statistically significant variables are neglected. The previously stored associations are updated; for each variable, the minimum between the old and the new variables is stored. The variable with the highest association (Max-Min heuristic) is next selected. In subsequent steps, while the set of the selected variables increases, the conditioning set does not, as its cardinality is at most equal to *k*. (This algorithm resembles the classical forward variable selection in statistics with two distinct differences. At each step, non-significant variables are excluded from future searches, instead of conditioning on all selected variables. Secondly, the CI tests for the next variable conditions upon all possible subsets, up to a pre-specified cardinality, of the already selected variables.) Upon completion, a backward phase, in the same spirit as the forward, applies to remove falsely detected variables.

This variable selection process is repeated for all variables. The edges detected remain only if they were identified by all variables. If, for example, *Vj* was found to be associated with *Vi*, but *Vi* was not found to be associated with *Vj*, then no edge between *Vi* and *Vj* will be added.

A slightly modified version of MMHC's skeleton identification phase is implemented in the *R* package *pchc*. The backward phase is not performed, in order to make the algorithm faster. To distinguish between them, *bnlearn*'s implementation will be denoted by MMHC-1, and *pchc*'s implementation will be denoted by MMHC-2, hereafter.

The orientation of the discovered edges takes place in the second, Hill Climbing (HC) scoring, phase of PCHC and MMHC, and is the same phase employed by FEDHC as well.

#### **3. The FEDHC BN Learning Algorithm**

Similarly to PCHC and MMHC, the skeleton identification phase of FEDHC relies on a variable selection algorithm. Thus, prior to introducing the FEDHC algorithm, the Forward Backward with Early Dropping (FBED) variable selection algorithm [17] is briefly presented.

#### *3.1. The FBED Variable Selection Algorithm*

In the classical forward selection algorithm, all available predictor variables are constantly used, and their statistical significance is tested at each step. Assuming that out of 10, 000 predictor variables, only 10 are selected. This implies that almost 10,000 × 10 regression models must be fitted, and the same amount of statistical tests must be executed. The computational cost is tremendous, rendering this computationally expensive algorithm impractical, and hence, prohibitive. The authors of [17] introduced the FBED algorithm as a speed-up modification of the traditional forward selection algorithm coupled with the backward selection algorithm [29]. FBED relies on the Early Dropping heuristic to speed up the forward selection. The heuristic drops the statistically non-significant predictor variables at each step, thus removing them from further consideration, resulting in a computationally dramatically cheaper algorithm that is presented in Algorithm 1. Especially for the case of continuous data this phase can be completed computationally efficient using only the correlation matrix (see Appendix B for more details).

#### **Algorithm 1** The FBED variable selection algorithm.


#### 8: **Return S**.

#### *3.2. Skeleton Identification Phase of the FEDHC Algorithm*

The skeleton identification phase of the FEDHC algorithm is the one presented in Algorithm 2, but it must be stressed that the backward phase of FBED is not performed, so as to reduce the computational cost. The FBED algorithm (Algorithm 1) is used for each variable (call it target variable, *Vi*). This variable selection process is repeated for all variables. The edges detected remain only if they were identified by all variables. If, for example, *Vj* was found to associated with *Vi*, but *Vi* was not found to be associated with *Vj*, then no edge between *Vi* and *Vj* will be added.

#### **Algorithm 2** Skeleton identification phase of the FEDHC algorithm.


#### *3.3. Hill Climbing Phase of the FEDHC Algorithm*

The first phase of FEDHC, MMHC and of PCHC is to discover any possible edges between the nodes, using CI tests. In the second phase, a search for the optimal DAG is performed, where the edges turn to arrows or are deleted towards the maximisation of a score metric. This scoring phase performs a greedy HC search in the space of BNs, commencing with an empty graph [13]. The edge deletion or direction reversal that leads to the largest increase in score, in the space of BNs (This implies that every time an edge removal or arrow direction is implemented, a check for cycles is performed. If cycles are created, the operation is canceled regardless of whether it increases the score), is applied, and the search continues recursively. The fundamental difference from standard greedy search is that the search is constrained to the orientation of the edges discovered by the skeleton identification phase (For more information, see [13]).

Tabu search is an iterative local searching procedure adopted by [13] for this purpose. Its performance is enhanced by using a list where the last 100 structures explored are stored, while searching in the neighborhood of each solution. The search is also capable of escaping from a local optima, in which the normal local search techniques often become stuck. Instead of applying the best local change, the best local change that results in a structure not on the list is performed, in an attempt to escape the local maxima [13]. This change may actually reduce the score. When a number of changes (10–15) occur without an increase in the maximum score ever encountered during a search, the algorithm terminates. The overall best scoring structure is then returned.

The Bayesian Information Criterion (BIC) [30] is a frequent score used for continuous data, while other options include the multivariate normal log-likelihood, the Akaike Information Criterion (AIC) and the Bayesian Gaussian equivalent [31] score. (The term "*equivalent*" refers to their attractive property of giving the same score to equivalent structures (Markov equivalent BNs), i.e., structures that are statistically indistinguishable [13]) The Bayesian Dirichlet equivalent (BDE) [32], the BDe uniform score (BDeu) [27], the multinomial log-likelihood score [33] and the MDL score [34,35] are four scoring options for discrete data.

The combination of the FBED algorithm during the skeleton identification phase with the HC scoring method forms the FEDHC algorithm. Interestingly enough, the skeleton identification phase of FEDHC performs substantially fewer statistical tests than PCHC and MMHC.

#### *3.4. Prior Knowledge*

All BN learning algorithms are agnostic of true relationships among the input data. It is customary though, for practitioners and researchers to have prior knowledge of the necessary directions (forbidden or not) of some of the relationships among the variables. For instance, variables such as sex or age cannot be caused by any economic or demographic variables. Economic theory (or theory from any other field) can further assist in improving the quality of the fitted BN by imposing or forbidding directions among some variables. All of the prior information can be inserted into the scoring phase of the aforementioned BN learning algorithms, leading to less errors and more realistic BNs.

#### *3.5. Theoretical Properties of FEDHC*

The theoretical properties and guarantees of MMHC and PCHC can be found in [13,15], respectively. As for the FEDHC, while there is no theoretical guarantee of the skeleton identification phase of FEDHC, Ref. [17] showed that running FBED with two repeats recovers the MB of the response variable if the joint distribution of the response and the predictor variables can be faithfully represented by a BN. When used for BN learning though, FBED need not be run more than once for each variable. In this case, FBED, similarly to MMHC, will identify the children and parents of a variable *Vi*, but not the spouses of the children [17], as this is not necessary during the skeleton identification phase. When FBED is run on the children of the variable *Vi*, it will again identify the children's parents who are the spouses of the variable *Vi*. Hence, upon completion, the FEDHC algorithm will have identified the MB of each variable.

Additionally, the early dropping heuristic does not only reduce the computational time, but also reduces the problem of multiple testing, in some sense [17]. When FBED is run only once (as in the current situation), in the worst-case scenario, it is expected to select about *α* · *D* variables (where *α* is the significance level), since all other variables will be dropped in the first (filtering) phase. However, simulation studies have shown that FBED selects fewer false positives than expected and the authors' recommendation is to reduce the number of runs to further limit the number of falsely selected variables, a strategy that FEDHC follows by default.

Similar to MMHC, the FEDHC is a local learning algorithm, and hence during the HC phase, the overall score is decomposed [13], exploiting the Markov property of BNs (1). The local learning has several advantages (see [13]) and the scores (BDe, BIC., etc.) are locally consistent [25].

#### *3.6. Robustification of the FEDHC Algorithm for Continuous Data*

It is not uncommon for economic datasets to contain outliers, observations with values that are far from the rest of the data. Income is such an example that contains outliers, but if outliers appear only in that variable, their effect will be minor. The effect of outliers is propagated when they exist in more variables, and in order to mitigate their effect, they must be identified in the multivariate space. If these outliers are not detected or not treated properly, BN learning algorithms will yield erroneous results. FEDHC will employ the Reweighted Minimum Covariance Determinant (RMCD) [36,37] as a means to identify outliers and remove them. (The reason for why one cannot use the robust correlation matrix directly is because the independence property between two variables no longer holds true. The robust correlation between any two variables depends on the other variables, and so adding or removing a variable modifies the correlation structure [38].)

The RMCD estimator is computed in two stages. In the first stage, a subset of observations *h* (*n*/2 ≤ *h* < *n*) is selected, such that the covariance matrix has the smallest determinant, and a robust mean vector is also computed. The second stage is a re-weighting scheme that increases the efficiency of the estimator, while preserving its high-breakdown properties. A weight *wi* = 0 is given to observations whose first-stage robust distance exceeds a threshold value, otherwise the weight of *wi* = 1 (*i* = 1, ... , *n*) is given. Using the re-weighted robust covariance matrix and mean vector, robust Mahalanobis distances are computed *d*<sup>2</sup> *<sup>i</sup>*(*RMCD*) = **x***<sup>i</sup>* − *μ*˜(*RMCD*) *<sup>T</sup>* Σ˜ <sup>−</sup><sup>1</sup> (*RMCD*) **x***<sup>i</sup>* − *μ*˜(*RMCD*) and proper cut-off values are required to detect the outliers. Those cut-off values are based on the following accurate approximations [39,40]:

$$d\_{i(RMCD)}^2 \sim \begin{array}{l} \frac{\left(w-1\right)^2}{w} \text{Re}\left(\frac{D}{2}, \frac{w-D-1}{2}\right) & \text{if } w\_i = 1\\ \frac{w+1}{w} \frac{\left(w-1\right)D}{w-D} F\_{D,w-D} & \text{if } w\_i = 0, \end{array}$$

where *w* = ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *wi*, and *Be* and *F* denote the Beta and F distributions, respectively.

The observations whose Mahalananobis distance *d*<sup>2</sup> *<sup>i</sup>*(*RMCD*) exceeds the 97.5% quantile of either distribution (*Be* or *F*) are considered to be outliers, and are hence removed from the dataset. The remainder of the dataset, assumed to be outlier-free, will be used by FEDHC to learn the BN.

The default value for *h* is [(*n* + *p* + 1)/2], where [.] denotes the largest integer. This value was proven to have the highest breakdown point [41], but a low efficiency, on the other hand. Changing *h* yields an estimator with lower robustness properties and increases the computational cost of the RMCD estimator. For these reasons, this is the default value used inside the robustification process of the FEDHC algorithm.

The case of *n* < *p* and *n p* (a very high-dimensional case) in general can be treated in a similar way, by replacing the RMCD estimator with the high-dimensional MCD approach of [42].

#### **4. Monte Carlo Simulation Studies**

Extensive experiments were conducted on simulated data to investigate the quality of the estimation of FEDHC compared to PCHC and MMHC-2. MMHC-1 participated in the simulation studies only with categorical data and not with continuous data, because it is prohibitively expensive. The continuous data were generated by synthetic BNs that contained a various number of nodes, *p* = (20, 30, 50), with an average of three and five neighbours (edges) for each variable. For each case 50 random BNs were generated with Gaussian data and various sample sizes. Categorical data were generated, utilising two famous (in the BN learning community) realistic BNs, and the sample sizes were left varying. The *R* packages *MXM* [43] and *bnlearn* were used for the BN generation, and the *R* packages *pchc* and *bnlearn* were utilised to apply the FEDHC, PCHC, MMHC-2 and the MMHC-1 algorithms, respectively. All simulations were implemented in a desktop computer with an Intel Core i5-9500 CPU at 3.00 GHz with 48 GB RAM and an SSD installed. The *R* codes for the simulation studies are available in the Supplementary Material.

The metrics of quality of the learned BNs were the structural Hamming distance (SHD) [13] of the estimated DAG from the true DAG (This is defined as the number of operations required to make the estimated graph equal to the true graph. Instead of the true DAG, the Markov equivalence graph of the true BN is used; that is, some edges have been un-oriented as their direction cannot be statistically decided. The transformation from the DAG to the Markov equivalence graph was carried out using Chickering's algorithm [44]), the computational cost and the number of tests performed during the skeleton identification phase and the total duration of the algorithm. PCHC and the MMHC-1 algorithms have been implemented in *C++*; henceforth the comparison of the execution times is not really fair at the programming language level. FEDHC and MMHC-2 have been implemented in R (skeleton identification phase) and *C++* (scoring phase).

#### *4.1. Synthetic BNs with Continuous Data*

The procedure used to generate the data for *X* is summarised in the steps below. Let *X* be a variable in *G*, and let *Pa*(*X*) be the parents of *X* in *G*.


The average number of connecting edges (neighbours) was set to 3 and 5. The higher the number of edges is, the denser the network is and the harder the inference becomes. The sample sizes considered were *<sup>n</sup>* = (100, 500, 1000, 5000, 1 × 104, 3 × 104, 5 × <sup>10</sup>4, <sup>1</sup> × 105, <sup>3</sup> × 105, 5 × 105, 1 × 106, 3 × 106, 5 × <sup>10</sup>6).

Figures 2–5 present the SHD and the number of CI tests performed of each algorithm for a range of sample sizes (in log-scale). With three neighbours on average per node, the differences in the SHD are noticeably rather small, yet FEDHC achieves lower numbers. With five neighbours on average though, the differences are more significant and increasing with increasing sample sizes. As for the number of CI tests executed during the skeleton identification phase, FEDHC is the evident winner, as it executes up to six times less tests, regardless of the average neighbours. Moreover, the SHD of the FEDHC is lower than the SHD of its competitors for all cases, and the difference is magnified when the sample size is in the order of millions.

The common observation for all algorithms is that as the dimensionality increases, the SHD requires a greater sample size to achieve low levels. With 20 and 30 variables, the SHD may reach single-digit figures (with three neighbours on average), but with 50 and 100 variables, it never goes below 10. A similar conclusion is drawn when there are five neighbours on average.

**Figure 2.** SHD and number of CI tests against log of sample size for 20 and 30 dimensions, with **3 neighbours** on average. (**a**) SHD vs. log of sample size. (**b**) Number of CI tests vs. log of sample size. (**c**) SHD vs. log of sample size. (**d**) Number of CI tests vs. log of sample size.

**Figure 3.** *Cont*.

**Figure 3.** SHD and number of CI tests against log of sample size for 50 and 100 dimensions, with **3 neighbours** on average. (**a**) SHD vs. log of sample size. (**b**) Number of CI tests vs. log of sample size. (**c**) SHD vs. log of sample size. (**d**) Number of CI tests vs. log of sample size.

**Figure 4.** SHD and number of CI tests against log of sample size for 20 and 30 dimensions, with **5 neighbours** on average. (**a**) SHD vs. log of sample size. (**b**) Number of CI tests vs. log of sample size. (**c**) SHD vs. log of sample size. (**d**) Number of CI tests vs. log of sample size.

**Figure 5.** SHD and number of CI tests against log of sample size for various dimensions, with **5 neighbours** on average. (**a**) SHD vs. log of sample size. (**b**) Number of CI tests vs. log of sample size. (**c**) SHD vs. log of sample size. (**d**) Number of CI tests vs. log of sample size.

#### *4.2. Robustified FEDHC for Continuous Data*

An examination of the robustified FEDHC contains two axes of comparison; the outlier-free and the outliers present cases. At first, the performances of the raw and the robustified FEDHC in the outlier-free case are evaluated.

Figures 6 and 7 signify that there is no loss in the accuracy when using the robustified FEDHC over the raw FEDHC. Computationally speaking though, the raw FEDHC is significantly faster than the robustified FEDHC. For small sample sizes, the robustified FEDHC can be 10 times higher, while for large sample sizes, its cost can be double or only 5% higher than that of the raw FEDHC.

**Figure 6.** Ratios of SHD and computational cost against log of sample size for various dimensions, with **3 neighbours** and **5 neighbours** on average. The ratios depict the errors and computational cost of the raw FEDHC relative to the robustified FEDHC with NO outliers. (**a**,**c**) 3 neighbours. (**b**,**d**) 5 neighbours.

**Figure 7.** *Cont*.

**Figure 7.** Ratios of SHD and computational cost against log of sample size for various dimensions, with **3 neighbours** and **5 neighbours** on average. The ratios depict the errors and computational cost of the raw FEDHC relative to the robustified FEDHC with NO outliers. (**a**,**c**) 3 neighbours. (**b**,**d**) 5 neighbours.

The performances of the raw FEDHC and of the robustified FEDHC in the presence of extreme outliers are evaluated next. The BN generation scheme is the one described in Section 4.1, with the exception of having added a 5% of outlying observations. The considered sample sizes are smaller, as although FEDHC is computationally efficient, it becomes really slow in the presence of outliers.

The results presented in Figures 8 and 9 evidently show the gain when using the robustified FEDHC over the raw FEDHC. The SHD of the raw FEDHC increases by as little as 100%, and up to 700% with 3 neighbours on average and 50 variables. The duration of the raw FEDHC increases substantially. (Similar patterns were observed for the duration of the skeleton learning phase and for the number of CI tests.) The raw FEDHC becomes up to more than 200 times slower than the robustified version, with hundreds of thousands of observations, and 50 variables. This is attributed to the HC phase of the raw FEDHC, which consumes a tremendous amount of time. The outliers produce noise that escalates the labour of the HC phase.

**Figure 8.** *Cont*.

**Figure 8.** Ratios of SHD and computational cost against log of sample size for 20 and 30 dimensions with **3 neighbours** and **5 neighbours** on average. The ratios depict the errors and computational cost of the raw FEDHC relatively to the robustified FEDHC with 5% outliers. (**a**,**c**) 3 neighbours. (**b**,**d**) 5 neighbours.

**Figure 9.** Ratios of SHD and computational cost against log of sample size for 50 and 100 dimensions with **3 neighbours** and **5 neighbours** on average. The ratios depict the errors and computational cost of the raw FEDHC relatively to the robustified FEDHC with 5% outliers. (**a**,**c**) 3 neighbours. (**b**,**d**) 5 neighbours.

#### *4.3. Realistic BNs with Categorical Data*

The *f*(*Pa*(*X*)) function utilised in the continuous data case relies on the *β* coefficients. The larger the magnitude of their values, the stronger the association between the edges becomes, and hence, it becomes easier to identify them. For BNs with categorical data, one could apply the same generation technique and then discretise the simulated data. To avoid biased or optimistic estimates favoring one or the other method, two real BNs with categorical data were utilised to simulate data. These are (a) the *Insurance* BN, used for evaluating car insurance risks [45], which consists of 27 variable (nodes) and 52 (directed) edges and (b) the *Alarm* BN, designed to provide an alarm message system for patient monitoring, and consists of 37 variables and 45 (directed) edges. The *R* package *bnlearn* contains a few thousand categorical instantiations from these BNs, but for the purpose of the simulation studies more instantiations were generated using the same package. The sample sizes considered were *<sup>n</sup>* = (<sup>1</sup> × <sup>10</sup>4, 2 × 104, 5 × <sup>10</sup>4, 1 × 105, 2 × 105, 5 × <sup>10</sup>5, 1 × 106, <sup>2</sup> × 106, <sup>5</sup> × 106).

Figure 10 shows the SHD and the number of CI tests executed by each algorithm against the sample size. The MMHC-1 evidently has the poorest performance in both axes of comparison. Our implementation (MMHC-2) performs substantially better, but the overall winner is the PCHC. FEDHC, on the other hand, performs better than MMHC-1, yet is the second-best option.

**Figure 10.** SHD and number of CI tests against log of sample size with **categorical** data. (**a**,**c**) SHD vs. log of sample size. (**b**,**d**) Number of CI tests vs. log of sample size.

#### *4.4. Scalability of FEDHC*

The computational cost of each algorithm was also measured, appearing in Figure 11 as a function of the sample size. The empirical slopes of all lines in Figure 11 are nearly equal to 1, indicating that the scalability of FEDHC, PCHC, and MMHC-2 is linear in the sample size. Hence, the computational cost of all algorithms increases linearly with respect to the sample size. For any percentage-wise increase in the sample size, the time increases by the same percentage. Surprisingly enough, the computational cost of MMHC-1 was not evaluated for the categorical data case, because similarly to the continuous data case, it was too high to evaluate.

It is surprising though that the computational cost of FEDHC is similar to that of PCHC and MMHC-2. In fact the skeleton identification phase requires about the same amount of time, and it requires only 8 seconds with 5 million observations. The scoring phase is the most expensive phase of the algorithms, absorbing 73–99% of the total computation time. Regarding FEDHC and MMHC-2, since the initial phase of both has been implemented in *R*, this can be attributed to the fact that the calculations of the partial correlation in FEDHC are heavier than those in MMHC-2, because the conditioning set in the former can grow larger than the conditioning set in MMHC-2, which is always bounded by a pre-specified value *k*. Thus, MMHC-2 performs more but computationally lighter calculations than FEDHC.

**Continuous data with 5 neighbours on average.**

**Figure 11.** *Cont*.

**Figure 11.** Scalability of the algorithms with respect to the sample size for some selected cases. The results for the other cases convey the same message and are thus not presented. The left column refers to the skeleton identification phase, whereas the right column refers to both phases.

### **5. Illustration of the Algorithms on Real Economics Data Using the** *R* **Package** *pchc*

The *R* package *pchc* is used with two examples with the datasets used in [15] to illustrate the performance of FEDHC against its competitors, PCHC, MMHC-1 and MMHC-2. More details about the package can be found in Appendix C. The advantages of BNs have already been discussed in [15], and hence the examples focus on the comparison of FEDHC to PCHC, MMHC-1 and MMHC-2.

#### *5.1. Expenditure Data*

The first example concerns a dataset with continuous variables containing information on the monthly credit card expenditure of individuals. This is the **Expenditure** dataset [20] and is publicly accessible via the *R* package *AER* [46]. The dataset contains information about 1319 observations (10% of the original data set) on the following 12 variables. Whether the application for a credit card was accepted or not (**Card**), the number of major derogatory reports, (**Reports**), the age in years plus twelfths of a year (**Age**), the yearly income in \$10,000s (**Income**), the ratio of monthly credit card expenditure to yearly income (**Share**), the average monthly credit card expenditure (**Expenditure**), whether the person owns their home or they rent (**Owner**), whether the person is self employed or not (**Selfemp**), the number of dependents+1(**Dependents**), the number of months living at current address (**Months**), the number of major credit cards held (**Majorcards**) and the number of active credit accounts (**Active**).

The *R* package *AER* contains the data and must be loaded and processed for the algorithms to run.

```
> library(AER) ## CreditCard are available
> library(bnlearn) ## To run MMHC-1
> data(CreditCard) ## load the data
> x <- CreditCard
> colnames(x) <- c( ''Card'', ''Reports'', ''Age'', ''Income'', ''Share'', ''Expenditure'',
+ ''Owner'', ''Selfemp'', ''Dependents'', ''Months'', ''Majorcards'', ''Active'')
## Prepare the data
> for (i in 1:12) x[, i] <- as.numeric(x[, i])
> x <- as.matrix(x)
> x[, c(1, 7, 8)] <- x[, c(1, 7, 8)] - 1
## Run all 4 algorithms
> a1 <- bnlearn::mmhc( as.data.frame(x), restrict.args =
```

```
+ list(alpha = 0.05, test = ''zf'') )
> a2 <- pchc::mmhc(x, alpha = 0.05)
> a3 <- pchc::pchc(x, alpha = 0.05)
> a4 <- pchc::fedhc(x, alpha = 0.05)
```
In order to plot the fitted BNs of each algorithm, the following commands were used.

```
> pchc::bnplot(a1)
```
> pchc::bnplot(a2\$dag)

> pchc::bnplot(a3\$dag)

> pchc::bnplot(a4\\$dag)

This example shows the practical usefulness of the BNs. Evidently, this small scale experiment shows that companies can customise their products according to the key factors that determine the consumers' behaviours. Instead of selecting one variable only, a researcher/practitioner can identify the relationships among all of the variables by estimating the causal mechanistic system that generated the data. The BN can further reveal information about the variables that are statistically significantly related.

According to FEDHC (Figure 12a), the age of the individual affects their income, the number of months they have been living at their current address, whether they own their home or not, and the ratio of their monthly credit card expenditure to their yearly income. The only variables associated with the number of major derogatory reports (Reports) are whether the consumer's application for a credit card was accepted or not (Card) and the number of active credit accounts (Active). In fact, these two variables are parents of Reports, as the arrows are directed towards it. A third advantage of BNs is that they provide a solution to the variable selection problem. The parents of the variable Majorcards (number of major credit cards held) are Card (whether the application for credit card was accepted or not) and Income (yearly income in \$10,000), its only child is Active (number of active credit accounts) and its only spouse (parent of Active) is Owner (whether the consumer owns their home). The collection of those parents, children and spouses form the Majorcards' MB. That is, any other variable does not increase the information on the number of major credit cards held by the consumer. For any given variable, one can straightforwardly obtain (and visualise) its MB, which can be used for the construction of the appropriate linear regression model.

Figure 12 contains the BNs using both implementations of MMHC, the PCHC and the FEDHC algorithms fitted to the expenditure data, with the variables sorted in a topological order [44], a tree-like structure. The BIC of the BN learned by MMHC-1 and MMHC-2 are equal to −32, 171.75 and −32, 171.22, and for the PCHC and FEDHC, they are both equal to −32, 171.75. This is an indication that all four algorithms produced BNs of nearly the same quality. On a closer examination of the graphs, one can detect some differences between the algorithms. For instance, **Age** is directly related to **Active**, according to PCHC and MMHC-2, but not according to FEDHC and MMHC-1. Further, all algorithms have identified the **Owner** as the parent of **Income**, and not vice-versa. This is related to the prior knowledge discussed in Section 3.4, and will be examined in the next categorical example dataset.

**Figure 12. Expenditure data**. Estimated BNs using (**a**) FEDHC, (**b**) PCHC, (**c**) MMHC-1 and (**d**) MMHC-2.

#### *5.2. Income Data*

The second example dataset contains categorical variables and originates from an example in the book "The Elements of Statistical Learning" [47], and is publicly available from the *R* package *arules* [48]. It consists of 6876 instances (obtained from the original dataset with 9409 instances, by removing observations with a missing annual income) and a mixture of 13 categorical and continuous demographic variables. The continuous variables were discretised, as suggested by the authors of the package. The continuous variables (age, education, income, years in bay area, number in household and number of children) were discretised based on their median values. **Income**: "\$0–\$40,000" or "\$40,000+", **Sex**: "male" or "female", **Marriage**: "married", "cohabitation", "divorced", "widowed" or "single", **Age**: "14–34" or "35+", **Education**: "college graduate" or "no college graduate", **Occupation**, "professional/managerial", "sales", "laborer", "clerical/service", "homemaker", "student", "military", "retired" or "unemployed", **Bay** (number of years in bay area): "1–9" or "10+", **No of people** (number of of people living in the house): "1" or "2+", **Children**: "0" or "1+", **Rent**: "own", "rent" or "live with parents/family", **Type**: "house", "condominium", "apartment", "mobile home" or "other", **Ethnicity**: "American Indian", "Asian", "black", "east Indian", "hispanic", "white", "pacific islander" or "other" and **Language** (language spoken at home): "english", "spanish" or "other".

The dataset is at first accessed via the *R* package *arules* and is prepossessed, as suggested in *arules*.

```
> library(arules)
> data(IncomeESL)
## remove incomplete cases
> IncomeESL <- IncomeESL[complete.cases(IncomeESL), ]
## preparing the data set
> IncomeESL[[''income'']] <- factor((as.numeric(IncomeESL[[''income'']]) > 6) + 1,
+ levels = 1:2, labels = c(''0-40,000'', ''40,000+''))
> IncomeESL[[''age'']] <- factor((as.numeric(IncomeESL[[''age'']]) > 3) + 1,
+ levels = 1:2, labels = c(''14-34'', ''35+''))
> IncomeESL[[''education'']] <- factor((as.numeric(IncomeESL[[''education'']]) > 4) +
+ 1, levels =1:2, labels = c(''no college graduate'', ''college graduate''))
> IncomeESL[[''years in bay area'']] <- factor(
+ (as.numeric(IncomeESL[[''years in bay area'']]) > 4) + 1,
+ levels = 1:2, labels = c(''1-9'', ''10+''))
> IncomeESL[[''number in household'']] <- factor(
+ (as.numeric(IncomeESL[[''number in household'']]) > 3) + 1,
+ levels = 1:2, labels = c(''1'', ''2+''))
> IncomeESL[[''number of children'']] <- factor(
+ (as.numeric(IncomeESL[[''number of children'']]) > 1) + 0,
+ levels = 0:1, labels = c(''0'', ''1+''))
                             Some more steps are required prior to running the BN algorithms.
> x <- IncomeESL
> x <- x[, -8]
```

```
> colnames(x) <- c( "Income", "Sex", "Marriage", "Age", "Education", "Occupation",
+ "Bay", "No of people", "Children", "Rent", "Type", "Ethnicity", "Language" )
> nam <- colnames(x)
```
The importance of prior knowledge incorporation discussed in Section 3.4 becomes evident in this example. Prior knowledge can be added in the **blacklist** argument denoting the forbidden directions (arrows).

```
> black <- matrix(nrow = 26, ncol = 2)
> black <- as.data.frame(black)
> for (i in 1:13) black[i, ] <- c(nam[i], nam[2])
> for (i in 1:13) black[13 + i, ] <- c(nam[i], nam[4])
> black <- black[-c(2, 17), ]
> black <- rbind( black, c(nam[9], nam[3]) )
> black <- rbind( black, c(nam[3], nam[6]) )
> black <- rbind( black, c(nam[9], nam[6]) )
> black <- rbind( black, c(nam[6], nam[5]) )
> black <- rbind( black, c(nam[3], nam[1]) )
> black <- rbind( black, c(nam[1], nam[5]) )
> black <- rbind( black, c(nam[10], nam[1]) )
> black <- rbind( black, c(nam[10], nam[5]) )
> black <- rbind( black, c(nam[10], nam[6]) )
> black <- rbind( black, c(nam[13], nam[12]) )
> colnames(black) <- c(''from'', ''to'')
```
Finally, the four BN algorithms are applied to the Income data.

```
> b1 <- bnlearn::mmhc( x, blacklist = black, restrict.args =
+ list(alpha = 0.05, test = ''mi'') )
> b2 <- pchc::mmhc(x, method = ''cat'', alpha = 0.05, blacklist = black,
+ score = ''bic'')
```

```
> b3 <- pchc::pchc(x, method = ''cat'', alpha = 0.05, blacklist = black,
+ score = ''bic'')
> b4 <- pchc::fedhc(x, method = ''cat'', alpha = 0.05, blacklist = black,
+ score = ''bic'')
```
Figure 13 presents the fitted BNs of the MMHC-1, MMHC-2, PCHC and FEDHC algorithms. There are some distinct differences between the algorithms. For instance, PCHC is the only algorithm that has not identified **Education** as the parent of **Bay**. Additionally, the BN learned by MMHC-2 is the densest one (contains more arrows), whereas PCHC learned the BN with the fewest arrows.

This example further demonstrates the necessity of prior knowledge. BN learning algorithms fit a model to the data, ignoring the underlying truthfulness and ignoring the relevant economic theory. Economic theory can be used as prior knowledge to help mitigate the errors and lead to more truthful BNs. The exclusion of the blacklist argument (forbidden directions) would yield some irrational directions; for instance, that occupation or age might affect sex, or that marriage affects age, simply because these directions could increase the score. Finally, BNs are related to synthetic population generation, where the data are usually categorical. This task requires the specification of the joint distribution of the data, and BNs accomplish this [2]. Based on the Markov condition (1), the joint distribution can be written down explicitly, allowing for synthetic population generation in a sequential order. One commences by generating values for education and sex. Using these two variables, values for occupation are generated. These values, along with income and age, can be used to generate for the marital status, and so on.

**Figure 13. Income data.** Estimated BNs using (**a**) FEDHC, (**b**) PCHC, (**c**) MMHC-1 and (**d**) MMHC-2.

#### **6. Conclusions**

This paper proposed to combine the first phase of the FBED variable selection algorithm with the HC scoring phase, leading to a new hybrid algorithm, termed FEDHC. Additionally, a new implementation of the MMHC algorithm was provided. Finally, the paper presented robustified (against outliers) versions of FEDHC, PCHC and MMHC. The robustified version of FEDHC was shown to be nearly 40 times faster than the raw version, and yielded BNs of higher quality, when outliers were present. Simulation studies manifested that in terms of computational efficiency, FEDHC is comparable to PHCHC, and along with MMHC-2, FEDHC was able to fit BNs to continuous data, with sample sizes in the order of hundreds of thousands in a few seconds and in the order of millions in a few minutes. It must be highlighted though that the skeleton identification phase of FEDHC and MMHC-1 have been implemented in *R* and not in *C++*. Additionally, FEDHC always executed significantly fewer CI tests than its competitors. Ultimately, in terms of accuracy, FEDHC outperformed is competitors with continuous data, and it was more accurate than or on par with MMHC-1 and MMHC-2 with categorical data, but less accurate than PCHC.

The rationale of MMHC and PCHC is to perform variable selection to each node, and then to apply a HC to the resulting network. In the same spirit, Ref. [49] used LASSO for variable selection with the scopus of constructing an un-directed network. The HC phase could be incorporated in the graphical LASSO to learn the underlying BN. Broadly speaking, the combination of a network learning phase with a scoring phase can yield hybrid algorithms. Other modern hybrid methods for BN learning include [50] on hybrid structure learning and sampling. They combine constraint-based pruning with MCMC inference schemes (also to improve the overall search space) and find a combination that is relatively efficient, with relatively good performance. The constraint-based part is interchangeable, and could connect well with MMHC, PCHC or FEDHC.

FEDHC is not the first algorithm that has outperformed MMHC. Recent algorithms include PCHC [15], the SP algorithm for Gaussian DAGs [51] and the NOTEARS [52]. The algorithms of [53,54] were also shown to outperform MMHC in the presence of latent confounders, not examined here. The advantage of the latter two is that they employ nonparametric tests such as the kernel CI test, thus allowing for non-linear relationships. BNs that detect non-linear relationships among the variables, such as the algorithms proposed by [53,54] is what this paper did not cover. Further, our comparative analysis was only with MMHC [13] and PCHC [15], due to their close relationship with FEDHC.

Future research includes a comparison of all algorithms in terms of more directions. For instance, (a) the effect of the Pearson and Spearman CI tests and the effect of *X*<sup>2</sup> and *G*<sup>2</sup> CI tests, (b) the effect of the outliers, (c) the effect of the scoring methods (Tabu search and HC), (d) the effect of the average neighbours (network density), and (e) the effect of the number of variables on the quality of the BN learned by either algorithm. These directions can be used to numerically evaluate the asymptotic properties of the BN learning algorithms with tens of millions of observations. Another interesting direction is the incorporation of fast non-linear CI tests, such as the distance correlation [55–58]. The distance correlation could be utilised during the skeleton identification of the FEDHC, mainly because it performs fewer CI tests than its competitors.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/math10152604/s1, *R* codes for simulation studies.

**Funding:** This research received no external funding.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data is contained within the article.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Conditional Independence Tests**

The type of CI tests executed during the skeleton identification phase depends upon the nature of the data, and they are used to test the following. Let *X* and *Y* be two random variables, and **Z** be a (possibly empty) set of random variables. Statistically speaking, *X* and *Y* are conditionally independent, given **Z** (*X* ⊥⊥ *Y*|**Z**), if *P*(*X*,*Y*|**Z**) = *P*(*X*|**Z**) · *P*(*Y*|**Z**), and this holds for all values of *X*, *Y* and **Z**. Equivalently, the conditional independence of *X* and *Y*, given **Z** implies *P*(*X*|*Y*, **Z**) = *P*(*X*|**Z**) and *P*(*Y*|*X*, **Z**) = *P*(*Y*|**Z**).

#### *Appendix A.1. Pearson Correlation for Continuous Data*

A frequently employed CI test for two continuous variables *X* and *Y* conditional on a set of variables **Z** is the partial correlation test [59] that assumes linear relationships among the variables. The test statistic for the partial Pearson correlation is given by:

$$T\_P = \frac{1}{2} \left| \log \frac{1 + r\_{\mathbf{X}, \mathbf{Y} \mid \mathbf{Z}}}{1 - r\_{\mathbf{X}, \mathbf{Y} \mid \mathbf{Z}}} \right| \sqrt{n - |\mathbf{Z}| - 3},\tag{A1}$$

where *n* is the sample size, |**Z**| denotes the number of conditioning variables and *rX*,*Y*|**<sup>z</sup>** is the partial Pearson correlation (The partial correlation is efficiently computed using the correlation matrix of *X*, *Y* and **Z** [59]) of *X* and *Y* conditioning on **Z** (In the *R* package *Rfast*, its implementation of the PC algorithm compares *Tp* (A1) against a *t* distribution with *n* − |**Z**| − 3 degrees of freedom, whereas the MMHC algorithm in the *R* package *bnlearn* compares *T* against the standard normal distribution. The differences are evident in small sample sizes, but become negligible when the sample sizes are in the order of a few tens). When **Z** is empty (|**Z**| = 0), the partial correlation drops to the usual Pearson correlation coefficient.

#### *Appendix A.2. Spearman Correlation for Continuous Data*

The non-parametric alternative that is assumed to be more robust to outliers is the Spearman correlation coefficient. The Spearman correlation is equivalent to the Pearson correlation applied to the ranks of the variables. Its test statistic, however, is given by *Ts* = *Tp* × 1.029563 [60,61].

#### *Appendix A.3. G*<sup>2</sup> *Test of Independence for Categorical Data*

The *G*<sup>2</sup> test of independence of two categorical variables *X* and *Y*, conditional on a set of variables **Z**, is defined as [62]:

$$G^2 = 2\sum\_{l} \sum\_{i,j} O\_{ij|l} \log \frac{O\_{ij|l}}{E\_{ij|l}} \, \text{} \tag{A2}$$

where *Oij* are the observed frequencies of the *i*-th and *j*-th values of *X* and *Y*, respectively, for the *l*-th value of **Z**. The *Eij* are their corresponding expected frequencies computed by *Eij* <sup>=</sup> *Oi*<sup>+</sup>|*lO*<sup>+</sup>*j*|*<sup>l</sup> <sup>O</sup>*++|*<sup>l</sup>* , where *Oi*<sup>+</sup>|*<sup>l</sup>* <sup>=</sup> <sup>∑</sup>*<sup>n</sup> <sup>j</sup>*=<sup>1</sup> *Oij*|*l*, *<sup>O</sup>*+*j*|*<sup>l</sup>* <sup>=</sup> <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *Oij*|*<sup>l</sup>* and *<sup>O</sup>*++|*<sup>l</sup>* = *nl*. Under the conditional independence assumption, the *G*<sup>2</sup> test statistic follows the *χ*<sup>2</sup> distribution with (|*X*| − 1)(|*X*| − 1)(|**Z**| − 1) degrees of freedom, where |**Z**| refers to the cardinality of **Z** and the total number of values of **Z**.

#### Appendix A.3.1. *X*<sup>2</sup> Test of Independence for Categorical Data

Alternatively, one could use the Pearson *X*<sup>2</sup> test statistic *X*<sup>2</sup> = ∑*<sup>l</sup>* ∑*i*,*<sup>j</sup>* (*Oij*|*l*−*Eij*|*<sup>l</sup>*) 2 *E*2 *ij*|*l* that has the same properties as the *G*<sup>2</sup> test statistic (A2). The drawback of *X*<sup>2</sup> is that it cannot be computed when *Eij*|*<sup>l</sup>* <sup>=</sup> 0. On the contrary, *<sup>G</sup>*<sup>2</sup> is computed in such cases, since lim*x*→<sup>0</sup> *x* log *x* = 0. For either aforementioned test, when |**Z**| is the empty set, both

tests examine the unconditional association between variables *X* and *Y*. (For a practical comparison between the two tests based on extensive simulation studies, see [63].)

#### Appendix A.3.2. Permutation-Based *p*-Values

The aforementioned test statistics produce asymptotic *p*-values. In the case of small sample sizes, computationally intensive methods such as permutations might be preferable. With continuous variables for instance, when testing for unconditional independence, the idea is to distort the pairs multiple times, and each time, calculate the relevant test statistic. For the conditional independence of *X* and *Y* conditional on **Z**, the partial correlation is computed from the residuals of two linear regression models, *X*∼**Z** and *Y*∼**Z**. In this instance, the pairs of the residual vectors are distorted multiple times. With categorical variables, this approach is more complicated, and care must be taken so as to retain the row and column totals of the resulting contingency tables. For either case, the *p*-value is then computed as the proportion of times that the permuted test statistics exceed the observed test statistic that is computed using the original data. Permutation-based techniques have shown to improve the quality of BNs [64] in small sample sized cases. On the contrary, the FEDHC algorithm aims at making inferences on datasets with large sample sizes, for which asymptotic statistical tests are valid and reliable enough to produce the correct decisions.

#### **Appendix B. Computational Details of FEDHC**

With continuous data, the correlation matrix is computed once and utilised throughout the skeleton identification phase. FEDHC returns the correlation matrix and the matrix of the *p*-values of all pairwise associations that are useful in a second run of the algorithm with a different significance level. This is a significant advantage when BNs have to fit to large scale datasets and the correlation matrix can be given as an input to FEDHC to further reduce FEDHC's computational cost.

The partial correlation coefficient is given by:

$$r\_{X,Y|\mathbf{Z}} = \left\{ \begin{array}{cc} \frac{\boldsymbol{R}\_{X,Y} - \boldsymbol{R}\_{X,\boldsymbol{\pi}} \boldsymbol{R}\_{Y,\boldsymbol{\pi}}}{\sqrt{\left(1 - \boldsymbol{R}\_{X,Z}^2\right)^T \left(1 - \boldsymbol{R}\_{Y,\boldsymbol{\pi}}^2\right)}} & \text{if } |\mathbf{Z}| = 1 \\\ -\frac{\boldsymbol{\Lambda}\_{1,2}}{\sqrt{\boldsymbol{\Lambda}\_{1,1} \boldsymbol{\Lambda}\_{2,2}}} & \text{if } |\mathbf{Z}| > 1 \end{array} \right\},$$

where *RX*,*<sup>Y</sup>* is the correlation between the variables *X* and *Y*; *RX*,*<sup>Z</sup>* and *RY*,*<sup>Z</sup>* denote the correlations between *X* & *Z* and *Y* & *Z*. **A** = *R*−<sup>1</sup> *<sup>X</sup>*,*Y*,**Z**, with **A** denoting the sub-correlation matrix of variables *X*,*Y*, **Z** and *Ai*,*<sup>j</sup>* symbolises the element in the *i*-row and *j*-th column of matrix *A*.

The CI tests executed during the initial phase compute the logarithm of the *p*-value, instead of the *p*-value itself, to avoid numerical overflows observed with a large test statistic that produces a *p*-value that is equal to 0. Additionally, the computational cost of FEDHC's first phase can be further reduced via parallel programming.

It is also possible to store the *p*-values of each CI test for future reference. When a different significance level must be used, this will further decrease the associated computational cost of the skeleton identification phase in a second run. However, as will be exposed in Section 4.4, the cost of this phase is very small (a few seconds), even for millions of observations. The largest portion of this phase's computational cost is attributed to the calculation of the correlation matrix, which can be passed into subsequent runs of the algorithm.

Finally, Ref. [15] disregarded the potential of applying the PC-orientation rules [22,24] prior to the scoring phase as a means of improving the performance of FEDHC and MMHC, and this is not pursued any further.

### **Appendix C. The** *R* **Package** *pchc*

The package *pchc* was first launched in *R* in July 2020 and initially contained the PCHC algorithm. It now includes the FEDHC and MMHC-2 algorithms, functions for testing (un)conditional independence with continuous and categorical data, data generation, BN visualisation and utility functions. It imports the *R* packages *bnlearn* and *Rfast*, and the built-in package *stats*. *pchc* is distributed as part of the CRAN R package repository and is compatible with MacOS-X, Windows, Solaris and Linux operating systems. Once the package is installed and loaded,

> install.packages (''pchc'') > library(pchc)

it is ready to use without internet connection. The signature of the function **fedhc**, along with a short explanation of its arguments, is displayed below.

```
> fedhc(x, method = ''pearson", alpha = 0.05, robust = FALSE, ini.stat = NULL,
+ R = NULL, restart = 10, score = ''bic-g", blacklist = NULL, whitelist = NULL)
```

The output of the **fedhc** function is a list including:


#### **References**


**Claudia Angelini 1,†, Daniela De Canditiis 2,† and Anna Plaksienko 1,\*,†**


**Abstract:** In this paper, we consider the problem of estimating the graphs of conditional dependencies between variables (i.e., graphical models) from multiple datasets under Gaussian settings. We present *jewel 2.0*, which improves our previous method *jewel 1.0* by modeling commonality and class-specific differences in the graph structures and better estimating graphs with hubs, making this new approach more appealing for biological data applications. We introduce these two improvements by modifying the regression-based problem formulation and the corresponding minimization algorithm. We also present, for the first time in the multiple graphs setting, a stability selection procedure to reduce the number of false positives in the estimated graphs. Finally, we illustrate the performance of *jewel 2.0* through simulated and real data examples. The method is implemented in the new version of the R package jewel.

**Keywords:** group lasso penalty; data integration; network estimation; stability selection

**MSC:** 62A09; 62J07; 92-08

#### **1. Introduction**

Gaussian graphical models (GGMs) are becoming essential tools for studying the relationships between Gaussian variables. They are pervasive in many fields, especially biology and medicine, where the variables are usually genes or proteins, and the edges represent their interactions. In addition, researchers now often have several datasets measuring the same set of variables, i.e., data collected under slightly different conditions, on varying types of equipment, in different labs, or even for sub-types of disease. In this case, we expect most connections between variables to be the same or similar across the datasets; hence, GGM joint estimation is desirable to improve the estimation accuracy.

In [1], we proposed *jewel*, a technique for the joint estimation of a GGM in the multiple dataset framework under the hypothesis that the graph structure is the same across the classes. The *jewel* technique uses a node-wise regression approach (based on [2]) with a group lasso penalty and has the advantage of providing a symmetric estimate of the adjacency matrix, which encodes the graph's structure. In this paper, we present *jewel 2.0*, which attempts to extend *jewel* by addressing several aspects and enlarging its range of applications.

Firstly, we modified the problem formulation to allow the simultaneous modeling of commonalities and differences between datasets. In particular, we estimated a graph for each class or dataset, assuming that all graphs share considerable amount of information (denoted as the common part in the text, representing the edges present in all graphs) but can exhibit some class-specific differences (representing the edges present only in some graphs). Such an assumption significantly extends the range of applications since the estimation of class-specific differences is of interest in several contexts, for example, when we want to estimate gene regulatory networks associated with different disease sub-types.

115

**Citation:** Angelini, C. ; De Canditiis, D.; Plaksienko, A. *Jewel 2.0*: An Improved Joint Estimation Method for Multiple Gaussian Graphical Models. *Mathematics* **2022**, *10*, 3983. https://doi.org/10.3390/math 10213983

Academic Editors: Snezhana Gocheva-Ilieva, Atanas Ivanov and Hristina Kulina

Received: 5 September 2022 Accepted: 22 October 2022 Published: 26 October 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

In such cases, the common part captures the main mechanisms associated with the disease, while the class-specific model captures the differences related to a particular sub-type. Although other joint estimation methods (see Section 3.3) have previously allowed this extension, *jewel 2.0* is the first method that both provides symmetric estimates of the graphs (as with our previous *jewel 1.0*) and accounts for differences between graphs.

Secondly, we improved the method's performance when applied to graphs with hubs, which often describe biological networks. In the numerical studies presented in [1], we noticed that the performance of *jewel*, as well as of the other joint estimation methods, deteriorates when the underlying graphs have prominent hubs (i.e., a few nodes with high degrees). We argued that this phenomenon was due to the type of "democratic" penalty used in the problem formulation that does not allow some vertices to get the power to become a hub/leader. Therefore, in *jewel 2.0*, we introduced specific weights into the penalty function to enable hubs to emerge. The weights reduce the penalization of edges incident to the potential hub node, allowing even more edges to "join" that vertex. This modification considerably improves the performance of *jewel 2.0*, as shown by numerical simulations carried out on synthetic graphs.

Finally, in this paper, we also incorporate a stability selection procedure to reduce the number of false positives (i.e., estimated edges that are not truly present in the underlying graph). In the high-dimensional context (i.e., when *p* >> *n*), methods often suffer from increasing false positives or lack of power, depending on the choice of the regularization parameter. Although simulations show that the overall performance is competitive, estimated graphs must be sparse to be interpretable, but the presence of false positive edges decreases the interpretability of the estimated network. We observed a significant presence of incorrectly estimated edges in all methods we compared (under high dimensional settings), although ROC curves are not dramatically impacted. Stability selection procedures can alleviate the problem of false positives at the price of a slight loss of statistical power. Our stability procedure extends the ideas of [3] to the multiple graph context, suggesting the choice of only edges that persistently appear in the several runs of the method on subsampled data; see Section 2.4 for details. To the best of our knowledge, *jewel 2.0* is the first method that employs a stability selection procedure to refine the estimated networks in the context of joint inference.

The rest of the paper is organized as follows. In Section 2, we provide the necessary background and introduce the *jewel 2.0* method as well as the numerical algorithm for its evaluation. In Section 3, we use synthetic data to demonstrate the performance of *jewel 2.0* in several scenarios and present a comparison with other existing methods. Then, in Section 4, we demonstrate the application of *jewel 2.0* to breast cancer gene expression datasets. Finally, in Section 5, we discuss our study's advantages and limitations and provide directions for future work.

#### **2. Materials and Methods**

This section introduces the mathematical notations and the assumptions made in this paper. Then, we describe the penalization model that allows the approach both to incorporate differences among classes and to handle the presence of hubs. After that, we propose the *jewel 2.0* algorithm for the joint estimation of Gaussian graphical models. Finally, the section also describes the choice of the regularization parameters and the stability selection procedure and highlights the differences from our previous method, *jewel 1.0*.

#### *2.1. Problem Set-Up*

Let **<sup>X</sup>**(1),**X**(2), ... ,**X**(*K*) be *<sup>K</sup>* <sup>≥</sup> 2 datasets containing measurements of (almost) the same variables under *K* different but similar conditions (e.g., sub-types of disease) or collected in distinct classes (e.g., different equipment or laboratories). Each dataset **X**(*k*) is an *nk* × *pk* data matrix with *nk* observations of *pk* variables. We assume that observations (*x* (*k*) <sup>1</sup> , ... , *<sup>x</sup>* (*k*) *nk* ) are independent and identically distributed samples from a *pk*-variate Gaussian distribution with zero mean and covariance matrix **Σ**(*k*).

Every distribution <sup>N</sup> (0, **<sup>Σ</sup>**(*k*)) is associated with a graphical model *<sup>G</sup>*(*k*) = (*Vk*, *Ek*), where vertices correspond to random variables, i.e., *Vk* <sup>=</sup> {*X*(*k*) <sup>1</sup> , ... , *<sup>X</sup>*(*k*) *pk* }, and the absence of edges (where *Ek* ⊆ *Vk* × *Vk*) implies the conditional independence of the corresponding variables, i.e., (*i*, *<sup>j</sup>*) <sup>∈</sup>/ *Ek* <sup>⇔</sup> *<sup>X</sup>*(*k*) *<sup>i</sup>* ⊥⊥ *<sup>X</sup>*(*k*) *<sup>j</sup>* <sup>|</sup>*X*(*k*) {*l*, *<sup>l</sup>*=*i*, *<sup>j</sup>*}. It is well known that the support of the precision matrix encodes the graph structure, i.e., (*i*, *<sup>j</sup>*) <sup>∈</sup> *Ek* ⇐⇒ <sup>Ω</sup>(*k*) *ij* = 0, where **Ω**(*k*) = (**Σ**(*k*))−<sup>1</sup> is the true precision matrix for the *k*-th class. Therefore, for each dataset, the problem of graph estimation becomes equivalent to estimating the support of the precision matrix, as illustrated in Figure 1. However, when the *K* datasets share some dependency structure (as occurs when we consider similar datasets), a joint inference can be more powerful and accurate, as already shown in [1] and referenced therein. The joint inference requires modeling how the information about the graphs *G*(*k*) is shared across different datasets.

**Figure 1.** Idea of regression-based graphical model estimation for a single dataset **X**: estimate the regression coefficient matrix **Θ** from the data and use its support as an adjacency matrix of the graph *G*. See text for details.

In [1], we proposed the following joint estimation approach. We started with an assumption that although the covariance matrices **Σ**(*k*), *k* = 1 ... *K*, might be different, the structures of *K* graphs coincide, i.e., the graph *G* is the same across all datasets. Therefore, we needed to estimate the same support from *K* precision matrices. However, instead of doing that directly, we estimated the support of *K* regression coefficient matrices. We defined them as **<sup>Θ</sup>**(*k*) <sup>∈</sup> *<sup>R</sup>pk*×*pk* with entries <sup>Θ</sup>(*k*) *ij* <sup>=</sup> <sup>−</sup>Ω(*k*) *ij* /Ω(*k*) *ii* and <sup>Θ</sup>(*k*) *ii* = 0. By construction, the support of the extra-diagonal entries of **Θ**(*k*) also encodes the graph structure, i.e., (*i*, *<sup>j</sup>*) <sup>∈</sup> *Ek* ⇐⇒ <sup>Θ</sup>(*k*) *ij* = 0. Hence, in *jewel 1.0*, we simultaneously estimated **Θ**ˆ (1),..., **Θ**ˆ (*K*) by solving the following minimization problem:

$$\begin{split} (\Theta^{(1)}, \dots, \Theta^{(K)}) &= \operatorname\*{arg\,min}\_{\Theta^{(1)} \in \mathbb{R}^{p\_{1} \times p\_{1}}} \left\{ \frac{1}{2} \sum\_{k=1}^{K} \frac{1}{n\_{k}} ||\mathbf{X}^{(k)} - \mathbf{X}^{(k)} \Theta^{(k)}||\_{F}^{2} + \\ & \qquad \qquad \qquad \qquad \qquad \qquad \qquad \begin{split} &\text{arg\,min}\_{\Theta^{(1)} \in \mathbb{R}^{p\_{K} \times p\_{K}}} \left\{ \frac{1}{n\_{k}} ||\mathbf{X}^{(k)} - \mathbf{X}^{(k)} \Theta^{(k)}||\_{F}^{2} + \\ & \qquad \qquad \qquad \qquad \qquad \qquad \qquad \} \\ & \lambda \sum\_{i$$

where *gij* denotes the cardinality of the group of symmetric variables (Θ(1) *ij* , <sup>Θ</sup>(1) *ji* , ......, <sup>Θ</sup>(*K*) *ij* , Θ(*K*) *ji* ) across all the datasets that contain the couple of variables (*i*, *j*). As a result, we obtained *K* matrices (**Θ**ˆ (1), ... , **Θ**ˆ (*K*)) with the same symmetric support, which represents the adjacency matrix of the common estimated graph *G*ˆ.

However, while the symmetry of the solution of Equation (1) is one of the main advantages of our original formulation, forcing the datasets to have exactly the same underlying graph might be a limitation in some cases. For example, the assumption that the datasets **X**(*k*), *k* = 1 ... *K*, share the same graph *G* might be reasonable when they represent the gene expression of the same type of cells, measured in different laboratories or with different instruments. In such cases, we expect the actual underlying regulatory mechanisms to remain the same despite the technology. However, the assumption becomes limited when the datasets represent the gene expression in cells isolated from different sub-types or stages of some disease. In such a context, we expect a common underlying mechanism that can be associated with that disease, but there could be sub-type-specific differences that are not shared across all the datasets but characterize only one (or some, but not all) condition. To overcome this limitation, in this paper, we modify the formulation of the minimization problem in Equation (1) to allow the modeling of both commonalities and differences between the graphs *G*(*k*), *k* = 1 ... *K*. Therefore, by relaxing the assumptions, we increase the range of applications of our approach.

Additionally, in [1] we demonstrated that *jewel 1.0* performs comparably well or even better than other joint approaches for GGM (i.e., JGL [4] and the proposal of Guo et al. [5]). However, we also showed that all methods' performances significantly decreased when the true graph contained hubs. Scale-free graphs with a power of preferential attachment bigger than 1 constitute typical examples of graphs with hubs. In general, many reallife graphs such as gene regulatory networks or protein–protein interaction networks are estimated to have a power law between 2 and 3. Therefore, we introduced some weights into the penalization problem to allow for a better estimation of hubs. The key idea is that small weights must be assigned to the edges linked to potential hubs so that hubs can emerge (in other words, we allow the preferential attachment of an edge to a hub). We will show that this modification leads to a considerable improvement in performance when some preliminary information on the hubs is available. We note that although other graph estimation methods with weighted penalties exist for one dataset, e.g., DW-lasso [6], *jewel 2.0* is the first joint method for several datasets with a weighted penalty.

To incorporate both changes, we reformulate the minimization problem as follows. We define the set *V* = *V*<sup>1</sup> ∪···∪ *VK* with *p* being its cardinality. We define matrices of weights for each graph, **<sup>W</sup>**(*k*), with entries *<sup>W</sup>*(*k*) *ij* ∈ (0, 1]. These matrices incorporate information about hubs when available (smaller weights should be assigned to edges incident to hubs, larger weights to the others); otherwise, we set **W**(*k*) = **1**. For each class *k* = 1 ... *K* we set **Θ**(*k*) = **Ξ**(*k*) + **Γ**(*k*), where matrices **Ξ**(*k*) share the same support structure among the *K* classes (i.e., they represent the common information shared across different graphs) while **Γ**(*k*) are class-specific (hence they allow to model differences among the classes). We emphasize that **Ξ**(*k*) and **Γ**(*k*), *k* = 1 ... *K*, act as auxiliary (or dummy) variables and do not exactly represent common and specific parts of the graphs *G*(*k*). For example, non-zero elements in position *i*, *j* of all **Γ**(*k*) are an indication of the common edge, despite the fact of being present in class-specific **Γ**(*k*)s. In our procedure, we estimate *G*ˆ(*k*) through the support of **Θ**(*k*) = **Ξ**(*k*) + **Γ**(*k*). Once we have estimated *G*ˆ(*k*), *k* = 1 ... *K*, we can later identify their common part as the intersection of the estimated graphs and the specific parts as the difference between the common part and the individual estimates. With this notation, we estimate **Θ**ˆ (1),..., **Θ**ˆ (*K*) by solving the following problem:

(**Ξ**ˆ (1) , **Γ**ˆ(1) ,..., **Ξ**ˆ (*K*) , **Γ**ˆ(*K*) ) = arg min **<sup>Ξ</sup>**(1), **<sup>Γ</sup>**(1)∈R*p*1×*p*<sup>1</sup> . . . **<sup>Ξ</sup>**(*K*), **<sup>Γ</sup>**(*K*)∈R*pK*×*pK* , diag <sup>=</sup> <sup>0</sup> - 1 2 *K* ∑ *k*=1 1 *nk* ||**X**(*k*) <sup>−</sup> **<sup>X</sup>**(*k*) **<sup>Ξ</sup>**(*k*) <sup>−</sup> **<sup>X</sup>**(*k*) **Γ**(*k*) ||2 *<sup>F</sup>*+ +*λ*<sup>1</sup> *p* ∑ *i*<*j*=1 "*gijWij* <sup>∑</sup> *k*:{*Xi*, *Xj*}⊂*Vk* Ξ(*k*) *ij* <sup>2</sup> + Ξ(*k*) *ji* <sup>2</sup> +*λ*<sup>2</sup> *p* ∑ *i*<*j*=1 √ 2 ∑ *k*:{*Xi*, *Xj*}⊂*Vk W*(*k*) *ij* / Γ(*k*) *ij* <sup>2</sup> + Γ(*k*) *ji* <sup>2</sup> . , (2)

where *Wij* <sup>=</sup> 1/*<sup>K</sup>* <sup>∑</sup>*<sup>k</sup> <sup>W</sup>*(*k*) *ij* and *λ*1, *λ*<sup>2</sup> are two regularization parameters.

Note that the minimization problem in Equation (2) has two penalty terms. The first penalty is applied to the entire group (Ξ(1) *ij* , <sup>Ξ</sup>(1) *ji* , ... , <sup>Ξ</sup>(*K*) *ij* , <sup>Ξ</sup>(*K*) *ji* ) and enforces the presence/absence of an edge across all classes containing that edge (to capture common dependencies). The second penalty is applied independently to each group (Γ(*k*) *ij* , <sup>Γ</sup>(*k*) *ji* ), *k* = 1 ... *K*. This enforces the symmetry of the relation between two nodes in each class, allowing classspecific differences to emerge. Figure 2 provides an illustration of the idea.

**Figure 2.** Illustration of the group penalties applied in *jewel 1.0* (**left panel**) and *jewel 2.0* (**right panel**). In *jewel 1.0*, the penalty forces the datasets to share the same graph by either setting to zero or retaining the coefficients associated with the entire group (Θ(1) *ij* , <sup>Θ</sup>(1) *ji* , ... , <sup>Θ</sup>(*K*) *ij* , <sup>Θ</sup>(*K*) *ji* ). In *jewel 2.0*, the first penalty is similar to the one in *jewel 1.0*, applied to the entire group (Ξ(1) *ij* , <sup>Ξ</sup>(1) *ji* , ... , <sup>Ξ</sup>(*K*) *ij* , <sup>Ξ</sup>(*K*) *ji* ). However, the second penalty is applied to each group (Γ(*k*) *ij* , <sup>Γ</sup>(*k*) *ji* ), *k* = 1 ... *K*, independently, allowing the approach to set to zero or retain edges in specific classes.

Moreover, the weights *W*(*k*) *ij* in the penalties allow hubs to emerge. By assigning lower weights to edges linked to potential hubs, we reduce the penalty on those edges and allow hubs to emerge. Instead, by choosing *W*(*k*) *ij* = 1, we have a more democratic approach that is equivalent to an unweighted formulation. Weights could be estimated from the data or provided by the user from prior knowledge, literature, and databases. Currently, using information already available in the literature (from previous experiments) or knowledge stored in databases is becoming widespread, particularly in omics science, where international projects and consortiums have released an enormous amount of data in open form. For example, in [7], the authors suggested using gene networks retrieved from databases to improve survival prediction. Here, we propose the use of information concerning potential highly influential genes to estimate networks with hubs better. Section 3.2.2 discusses weight choice in detail.

Finally, we note that the minimization problem in Equation (2) is not identifiable in terms of **Ξ**(*k*) and **Γ**(*k*) but is identifiable in terms of the support of **Θ**(*k*), provided that we estimate it using **Ξ**(*k*) + **Γ**(*k*). Furthermore, as a result of the minimization problem in Equation (2), we obtain *K* graphs *G*ˆ(*k*) with largely the same structure *G* but also allow some class-specific edges. See Figure 3 for an illustration of the idea.

**Figure 3.** Illustration of how *jewel 2.0* estimates the graphs with largely the same support (i.e., the black part of the graphs *G*ˆ(*k*)) but also allows for some class-specific edges (colored part of the graphs *G*ˆ(*k*)). The lower part of the figure demonstrates how the idea translates to the R package jewel described in Section 2.5.

#### *2.2. Jewel 2.0 Algorithm*

To solve the minimization problem in Equation (2), we apply the group descent algorithm presented in [8]. To better describe this algorithm, we rewrite the problem in Equation (2) using an equivalent formulation. To this end, we define the following matrices and vectors (note that for the ease of notation without loss of generality, we suppose *p*<sup>1</sup> = ··· = *pK* = *p*):


(*p*−1)*p* are obtained by concatenating the vectorized matrices over the *K* classes; -

$$\mathbf{X} = \begin{pmatrix} \begin{pmatrix} \mathbf{X}\_{-1}^{(1)} & 0 & \dots & 0\\ 0 & \mathbf{X}\_{-2}^{(1)} & \dots & 0\\ \vdots & & \ddots\\ 0 & \dots & & \mathbf{X}\_{-p}^{(1)} \end{pmatrix} \\ \begin{pmatrix} & & & & & & &\\ & & & & & & & \\ & & & & & & & \ddots\\ & & & & & & & & \ddots\\ & & & & & & & & \ddots\\ & & & & & & & & & \ddots\\ & & & & & & & & & & \ddots\\ & & & & & & & & & & \ddots\\ & & & & & & & & & & & \ddots\\ & & & & & & & & & & & \ddots\\ & & & & & & & & & & & \ddots\\ & & & & & & & & & & & \ddots\\ & & & & & & & & & & & & \ddots\\ & & & & & & & & & & & & \ddots\\ \end{pmatrix} \end{pmatrix}$$

denotes the block-diagonal matrix made up of the block-diagonal matrices **<sup>X</sup>**(*k*) .−*j* , *k* = 1... *K*, *j* = 1... *p*, dim(**X**) = *N p* × *p*(*p* − 1)*K*;


With these notations, the problem in Equation (2) is equivalent to the following linear regression model with non-overlapping weighted group lasso penalties:

$$\mathcal{E}(\boldsymbol{\xi},\boldsymbol{\gamma}) = \operatorname\*{arg\,min}\_{\boldsymbol{\xi},\boldsymbol{\gamma} \in \mathbb{R}^{p(\boldsymbol{r}-1)K}} \underbrace{\frac{1}{2} \left| \left| \boldsymbol{y} - \boldsymbol{\mathcal{X}}[\boldsymbol{\xi},\boldsymbol{\gamma}]^{\top} \right| \right|\_{D^{2}} + \lambda\_{1} \sum\_{i$$

Function *F*(*ξ*, *γ*) in Equation (3) is jointly convex in *ξ* and *γ*, and the two penalties involve *ξ* or *γ* independently. Hence, we can apply an alternating optimization algorithm as follows:


$$\mathfrak{F} = \operatorname\*{arg\,min}\_{\mathfrak{F} \in \mathbb{R}^{p(p-1)K}} \frac{1}{2} \left| \left| \tilde{\mathbf{y}} - \mathbf{X} \mathbf{g}^{\top} \right| \right|\_{D^2} + \lambda\_1 \sum\_{i$$


$$\hat{\gamma} = \operatorname\*{arg\,min}\_{\gamma \in \mathbb{R}^{p(\boldsymbol{\eta}-1)K}} \frac{1}{2} \left| \left| \tilde{\boldsymbol{y}} - \mathsf{X} \gamma^{\top} \right| \right|\_{\mathbf{D}^2} + \lambda\_2 \sum\_{i$$

The two steps are alternated and repeated until convergence.

Since the groups are non-overlapping and orthogonal by construction, each minimization step can be solved using the group descent algorithm of [8]. Algorithm 1 illustrates the steps of the proposed *jewel 2.0* algorithm in details.

Note that Algorithm 1 generalizes the one presented in [1] to the double penalties (which are alternately updated) in the minimization problem in Equation (2). However, it still uses the idea of the *Active* matrices discussed in [1], where only non-zero entries are updated. The algorithm provided here also has a minor difference compared to the *jewel 1.0* version: now, the order of variables is randomized instead of being 1 → *p*. Although the algorithm converges regardless of the order [8], we decided to use the random order of updates because we have empirically seen that it provides an advantage in the running time without loss in performance. Finally, we note that the matrices and vectors defined in this subsection to describe the algorithm do not need to be explicitly constructed to implement the algorithm, as discussed in [1].

**Algorithm 1** The *jewel 2.0* algorithm.

INPUT: data **X**(1), ...,**X**(*K*), weights **W**(1),...,**W**(*K*) parameters *λ*1, *λ*2, *tol* and *t*max INITIALIZE: Common part: **Ξ**(1, 0), ..., **Ξ**(*K*, 0) Specific part: **Γ**(1, 0), ..., **Γ**(*K*, 0) Residuals (common and specific): *RC*(*k*) <sup>=</sup> *RS*(*k*) <sup>=</sup> **<sup>X</sup>**(*k*) <sup>−</sup> **<sup>X</sup>**(*k*)**Γ**(*k*,0) <sup>−</sup> **<sup>X</sup>**(*k*)**Ξ**(*k*,0) Matrices of active pairs **Active** = **Active**(*k*) = ⎛ ⎜⎜⎜⎜⎜⎜⎝ 0 1 ... 1 00 1 . . . ... . . . 0 0 ... 0 ⎞ ⎟⎟⎟⎟⎟⎟⎠ ∀*k* REPEAT UNTIL CONVERGENCE *Fix the specific part* **Γ**(1, *<sup>t</sup>*), ..., **Γ**(*K*, *<sup>t</sup>*)*, update the COMMON part.* Generate *order*Ξ by resampling from 1 to *p*. **for** *j* ∈ *order*<sup>Ξ</sup> **do for** *i* = *j* + 1... *p* **do if** *Activeij* = 0 **then** evaluate *z* = *z* (1) *ij* , *z* (1) *ji* ,..., *z* (*K*) *ij* , *z* (*K*) *ji* by *z* (*k*) *ij* <sup>=</sup> <sup>1</sup> *nk X*(*k*) .*i RC*(*k*) .*<sup>j</sup>* <sup>+</sup> <sup>Ξ</sup>(*k*, *<sup>t</sup>*) *ij z* (*k*) *ji* <sup>=</sup> <sup>1</sup> *nk X*(*k*) .*j RC*(*k*) .*<sup>i</sup>* <sup>+</sup> <sup>Ξ</sup>(*k*, *<sup>t</sup>*) *ji* compute *Wij* <sup>=</sup> <sup>∑</sup>*<sup>k</sup> <sup>W</sup>*(*k*) *ij* /*K* evaluate *threshold* = 1 − *λ*<sup>1</sup> <sup>√</sup>*gijWij*/ *z* **if** *threshold* < 0 **then** *Activeij* <sup>←</sup> 0 and *<sup>z</sup>* <sup>←</sup> <sup>0</sup> **else** *<sup>z</sup>* <sup>←</sup> *<sup>z</sup>* · *threshold* **end if** update residuals *RC*(*k*) .*<sup>j</sup>* <sup>=</sup> *RC*(*k*) .*<sup>j</sup>* <sup>+</sup> *<sup>X</sup>*(*k*) .*i* Ξ(*k*, *<sup>t</sup>*) *ij* − *z* (*k*) *ij RC*(*k*) .*<sup>i</sup>* <sup>=</sup> *RC*(*k*) .*<sup>i</sup>* <sup>+</sup> *<sup>X</sup>*(*k*) .*j* Ξ(*k*, *<sup>t</sup>*) *ji* − *z* (*k*) *ji* update coefficients (Ξ(1, *<sup>t</sup>*+1) *ij* , <sup>Ξ</sup>(1, *<sup>t</sup>*+1) *ji* , ..., <sup>Ξ</sup>(*K*, *<sup>t</sup>*+1) *ij* , <sup>Ξ</sup>(*K*, *<sup>t</sup>*+1) *ij* ) <sup>←</sup> *<sup>z</sup>* **end if**

Update residuals *RS*(*k*) <sup>=</sup> **<sup>X</sup>**(*k*) <sup>−</sup> **<sup>X</sup>**(*k*)**Γ**(*k*, *<sup>t</sup>*) <sup>−</sup> **<sup>X</sup>**(*k*)**Ξ**(*k*, *<sup>t</sup>*+1).

**end for end for**

#### **Algorithm 1** *Cont.*

*Fix the common part* **Ξ**(1, *<sup>t</sup>*+1), ..., **Ξ**(*K*, *<sup>t</sup>*+1)*, update the SPECIFIC part.* Generate *order*Γ by resampling from 1 to *p*. **for** *k* = 1... *K* **do for** *j* ∈ *order*<sup>Γ</sup> **do for** *i* = *j* + 1... *p* **do if** *Active*(*k*) *ij* = 0 **then** evaluate *z* = *z* (*k*) *ij* , *z* (*k*) *ji* by *z* (*k*) *ij* <sup>=</sup> <sup>1</sup> *nk X*(*k*) .*i RS*(*k*) .*<sup>j</sup>* <sup>+</sup> <sup>Γ</sup>(*k*, *<sup>t</sup>*) *ij z* (*k*) *ji* <sup>=</sup> <sup>1</sup> *nk X*(*k*) .*j RS*(*k*) .*<sup>i</sup>* <sup>+</sup> <sup>Γ</sup>(*k*, *<sup>t</sup>*) *ji* evaluate *threshold* = 1 − *λ*<sup>2</sup> <sup>√</sup>2*W*(*k*) *ij* / *z* **if** *threshold* < 0 **then** *Active*(*k*) *ij* <sup>←</sup> 0 and *<sup>z</sup>* <sup>←</sup> <sup>0</sup> **else** *<sup>z</sup>* <sup>←</sup> *<sup>z</sup>* · *threshold* **end if** update residuals *RS*(*k*) .*<sup>j</sup>* <sup>=</sup> *RS*(*k*) .*<sup>j</sup>* <sup>+</sup> *<sup>X</sup>*(*k*) .*i* Γ(*k*, *<sup>t</sup>*) *ij* − *z* (*k*) *ij RS*(*k*) .*<sup>i</sup>* <sup>=</sup> *RS*(*k*) .*<sup>i</sup>* <sup>+</sup> *<sup>X</sup>*(*k*) .*j* Γ(*k*, *<sup>t</sup>*) *ji* − *z* (*k*) *ji* update coefficients (Γ(*k*, *<sup>t</sup>*+1) *ij* , <sup>Γ</sup>(*k*, *<sup>t</sup>*+1) *ji* ) <sup>←</sup> *<sup>z</sup>* **end if end for end for end for** Update residuals *RC*(*k*) <sup>=</sup> **<sup>X</sup>**(*k*) <sup>−</sup> **<sup>X</sup>**(*k*)**Γ**(*k*, *<sup>t</sup>*+1) <sup>−</sup> **<sup>X</sup>**(*k*)**Ξ**(*k*, *<sup>t</sup>*+1). Combine common and specific parts **Ξ**(*k*, *<sup>t</sup>*+1) + **Γ**(*k*, *<sup>t</sup>*+1) = **Θ**(*k*, *<sup>t</sup>*+1). Check convergence: stop if ∑*k* **<sup>Θ</sup>**(*k*, *<sup>t</sup>*+1) <sup>−</sup> **<sup>Θ</sup>**(*k*, *<sup>t</sup>*) ∑*k* **Θ**(*k*, *<sup>t</sup>*) < *tol* or *t* > *t*max.

OUTPUT: *G*ˆ(*k*) = supp **Θ**ˆ (*k*, *<sup>t</sup>* <sup>∗</sup>), where *t* ∗ is the last iteration to achieve the convergence. *<sup>G</sup>*<sup>ˆ</sup> <sup>=</sup> <sup>∩</sup>*kG*ˆ(*k*)

#### *2.3. Selection of Regularization Parameter*

The minimization problem of Equation (2) contains two regularization parameters: *λ*<sup>1</sup> which penalizes the common part of the graphs, **Ξ**(*k*), *k* = 1 ... *K*, and *λ*<sup>2</sup> which penalizes the class-specific part of the graphs, **Γ**(*k*), *k* = 1 ... *K*. In general, larger values of the parameters provide a more sparse estimator, which can lead to many false negatives. In contrast, small values of regularization parameters result in dense graphs with more false positives and less interpretability. Therefore, the choice of such parameters is crucial for the success of any regularization method. In this context, the two parameters are not independent since edges set to zero in the common part might be included in the class-specific part and vice-versa.

To better explain the interplay relation between *λ*<sup>1</sup> and *λ*2, we first consider the reparametrization proposed in JGL [4], which connects the regularization parameters to two physical parameters *ω*<sup>1</sup> and *ω*<sup>2</sup> representing the sparsity of the graphs and the similarity across graphs in different classes

$$\begin{aligned} \text{sparsity}: \omega\_1 &= \frac{\lambda\_1}{\sqrt{2K}} + \frac{\lambda\_2}{\sqrt{2}},\\ \text{similarity}: \omega\_2 &= \frac{\lambda\_2/\sqrt{2}}{\omega\_1}. \end{aligned}$$

With this reparametrization, we have *<sup>λ</sup>*<sup>1</sup> <sup>=</sup> <sup>√</sup>2*Kω*1(<sup>1</sup> <sup>−</sup> *<sup>ω</sup>*2) and *<sup>λ</sup>*<sup>2</sup> <sup>=</sup> <sup>√</sup>2*ω*1*ω*2. The motivation behind this reparameterization is that it is easier to elicit prior information on *ω*<sup>1</sup> and *ω*<sup>2</sup> than on *λ*<sup>1</sup> and *λ*2.

In a more data-driven spirit, Chapter 9.7 of [9] suggests the use of a proportional relation between the two regularization parameters, namely *λ*<sup>2</sup> = *φλ*1. In the specific setting of matrix decomposition in the multivariate regression, the author proposes the following estimate (see Equation (9.7) of [9]):

$$\phi = \frac{\sqrt{\text{maximum size over all groups}} + \sqrt{\log(\text{#groups})}}{\sqrt{\log(\text{#parameters to estimate})}}.$$

In our case, we still use the relation *λ*<sup>2</sup> = *φλ*1. However, to avoid a computationally intensive data-driven search over *φ*, we chose the value *φ* empirically after extensive numerical experimentation under different settings, where we evaluated ROC curves and edges for a range of values *φ* = 1, 1.1, 1.2, 1.3, 1.4, 1.5. We set *φ* = 1.4 (the value that provided the best performance on average among the tested values) and used it in all the subsequent analyses on synthetic and real datasets.

With the established relation *λ*<sup>2</sup> = *φλ*<sup>1</sup> and having fixed the value of *φ*, the complexity of the parameter space reduces to a choice of only the regularization parameter *λ*1. In principle, one can extend the Bayesian information criterion (BIC) as done in [1]. However, there is a long ongoing debate on using BIC in high-dimensional settings. For example, [10] reported poor performance. In general, BIC might be too liberal in the high-dimensional regime. Therefore, the BIC estimate might represent only an empirical rough estimate of the optimal regularization parameter. Its usage beyond empirical evidence would require studying the conditions to guarantee the recovery of the true structure. Since this paper introduces the stability selection procedure (described in the next section), we suggest that the user chooses a value for *λ*<sup>1</sup> within a plausible range and then applies the stability selection to mitigate the impact on the final estimate of non-optimal choices. We used this suggestion in Section 4. We further discuss the choice of the regularization parameter in Section 5.

#### *2.4. Stability Selection*

Graph estimation can be regarded as a binary classification problem: for each pair of vertices {*i*, *j*}, we decide whether an edge (*i*, *j*) exists or not. As with any binary classification method, it is essential to estimate as many true edges as possible and avoid false positives for better interpretability, which becomes especially valuable in high dimensions.

In this respect, another important novelty of *jewel 2.0* compared to *jewel 1.0* is the implementation of a stable variable selection procedure that decreases the number of false positives. The idea is that actual positive edges should be more stable than false positive edges because the former result from true numerical significance while the latter are merely the result of a chance. With this in mind, one expects that by repeating the calculation several times (on different realizations with a fixed regularization parameter), the true positive edges are more frequent (i.e., appear more stable) than the false positive edges. Motivated by this argument, we extended the idea presented in [3] for the first time to the multiple graphs estimation problem.

The variable stability selection procedure implemented in *jewel 2.0* works as follows. Fix *#subsets*, representing the number of times we re-run the method on random subsets of the data, and fix *chosen fraction*, representing the fraction of times that an edge must appear in the results produced on random subsets to be considered stable. Then,

	- randomly subsample the data matrices **<sup>X</sup>**(1,*s*), **<sup>X</sup>**(2,*s*), ... , **<sup>X</sup>**(*K*,*s*) of the size *nk*/2<sup>×</sup> *pk* from **X**(1),**X**(2),...,**X**(*K*);
	- obtain *G*ˆ(*k*)(*s*), *k* = 1... *K*, by applying *jewel 2.0* to the subsampled data.

In our approach, the regularization parameter *λ*<sup>1</sup> is the same throughout the entire stability selection procedure. One might choose, for example, *λ*<sup>1</sup> obtained with a model selection estimator such as BIC or select a user-defined value (as performed in Section 4) within a reasonable range. The main advantage of the stability selection procedure is that the final estimate of the graphs *G*ˆ(*k*) is not only sparser but also does not critically depend on the specific choice of the regularization parameter. As we show in the numerical experiment, the stability selection procedure effectively refines the graphs (providing sparser solutions) and reduces the number of false positive edges.

#### *2.5. Code Availability*

The *jewel 2.0* approach is implemented as an R package jewel, which is freely available at https://github.com/annaplaksienko/jewel (accessed on 15 October 2022). As illustrated in Figure 3, the main function jewel performs the entire algorithm and is relatively simple to use: the user must provide only the list of *K* numeric data matrices and a numeric value of *λ*1, as all other parameters are optional. See package documentation and READ.ME file for the details. All the code to reproduce the simulation studies is provided at https: //github.com/annaplaksienko/jewel\_simulation\_studies (accessed on 15 October 2022).

#### **3. Results on Synthetic Data**

This section presents simulation results to demonstrate the specific advantages of *jewel 2.0* in handling differences among classes, dealing with network hubs, and refining the networks to reduce false positives and improve interpretability. Moreover, we also compare *jewel 2.0* to similar methods for joint Gaussian graphical model estimation. Before presenting the results, in Section 3.1, we briefly describe the data generation procedure for our experimental settings.

#### *3.1. Experimental Settings*

To demonstrate the capability of our approach, we used scale-free graphs since they can describe many social and biological processes. Using the following procedure, we generated *K* scale-free graphs *G*(*k*) with similar supports. We first generated a scale-free graph *G*(1) with *p* vertices with the Barabasi–Albert algorithm [11] (barabasi.game function from the igraph package [12]). We tuned graph sparsity with the *m* parameter, which describes the number of edges added at each iteration of the Barabasi–Albert algorithm. The resulting graph *<sup>G</sup>*(1) has *mp* <sup>−</sup> (2*<sup>m</sup>* <sup>−</sup> <sup>1</sup>) edges. We monitored the hub structure with the *power* parameter, which tunes the preferential attachment. With increased power, vertices with more edges are more likely to have new edges to be added to them at each step of the Barabasi–Albert algorithm. Hence, by modifying the *power* parameter, we can generate networks with different hub/not-hub structures.

Then, given the graph *<sup>G</sup>*(1), we modified it *<sup>K</sup>* <sup>−</sup> 1 times (to obtain *<sup>K</sup>* graphs total) with igraph's function rewire(..., with = keeping\_degseq), which moves the edges of a given graph, preserving its order and the degree distribution. We tuned the number of differences introduced in each modified graph using the *niter* parameter, i.e., the number of iterations of the algorithm (unless stated otherwise, parameter *niter* = 0.08 ∗ *p*, making the graphs have about 26% of edges not in common, i.e., not in the intersection).

We used the resulting graphs *G*(*k*), *k* = 1 ... *K*, as the support to construct the precision matrices **Ω**(*k*). We considered the adjacency matrices of the graphs and replaced all 1s with realizations from the uniform distribution on [−0.8, −0.2] ∪ [0.2, 0.8]. To ensure positive definiteness of **<sup>Ω</sup>**(*k*), we set its diagonal elements equal to <sup>|</sup>*μmin*(**Ω**(*k*))<sup>|</sup> <sup>+</sup> 0.1, with *<sup>μ</sup>min* being the minimum eigenvalue of the matrix. Then, we constructed covariance matrices 

**Σ**(*k*) as Σ(*k*) *ij* = Ω(*k*) −<sup>1</sup> *ij* / /\$ Ω(*k*) %−<sup>1</sup> *ii* \$ Ω(*k*) %−<sup>1</sup> *jj* . Finally, we generated data matrices **<sup>X</sup>**(*k*) as

*nk* <sup>=</sup> *<sup>n</sup>* independent identically distributed observations from <sup>N</sup> (0, **<sup>Σ</sup>**(*k*)).

In our experimental setting, we chose *K* = 3, *p* = 500, and *n* = 100, and we repeated the above procedure 20 times. We evaluated the performance using the true positive rate *TPR* = *TP*/(*TP* + *FN*) and false positive rate *FPR* = *TN*/(*TN* + *FP*), where *FP* is the number of false positives, *TP* the number of true positives, *FN* the number of false negatives, and *TN* the number of true negatives. *jewel 2.0* returns as an output *K* graphs *G*ˆ(*k*), instead of the single graph *G*ˆ obtained from *jewel 1.0*. Therefore, we first calculated *TPR*(*k*) and *FPR*(*k*) for each graph and then calculated *TPR* and *FPR* by averaging over *K* = 3 graphs. Finally, we averaged the results over 20 independent runs.

Simulations were carried out on an 8-core 4.2 GHz processor and 32 GB RAM computer.

#### *3.2. Performance of Jewel 2.0*

This section aims to illustrate the three main novelties we have introduced with this work. First, we demonstrate the performance of *jewel 2.0* when graphs *G*(*k*) have varying amount of differences. Then, we deal with graphs with a non-uniform edge distribution (i.e., graphs with hubs). Finally, we demonstrate how stability selection can reduce the number of false positives and lead to more interpretable graphs.

#### 3.2.1. Performance on Graphs with Varying Differences

We motivated the minimization problem in Equation (2) by the need to estimate graphs *G*(*k*) that share a common part but can have class-specific differences. Therefore, here, we evaluated the performance of *jewel 2.0* when applied to datasets with varying differences between their underlying graphs *G*(*k*).

We set the parameter *niter* in the rewire(..., with = keeping\_degseq(niter)) function to 2, 4, 8, 10% of number of vertices *p*, resulting in graphs with 7, 13, 26, and 32% of difference (on average over 20 independent runs). Here, by "difference" we mean" #edges not in the intersection of *K* = 3 graphs".

ROC curves obtained with the *jewel 2.0* method applied to *K* = 3 datasets of the size *p* = 500, *nk* = 100 and varying underlying graphs are reported in Figure 4. We set the parameter *power* = 1, and we compare two different sparsity scenarios: with parameter *m* = 1, i.e., 499 edges, and *m* = 2, i.e., 997 edges. Here, we set weights *W*(*k*) *ij* = 1 regardless of the graph structure.

As we can observe from Figure 4, *jewel 2.0* demonstrates good performances in all scenarios of varying amounts of differences between *K* graphs. Unsurprisingly, the performance decreases as the graphs become increasingly different, but even for 32% of difference, the method still performs well.

#### 3.2.2. Performance on Graphs with Hubs

In [1], we have noticed that the performance of our previous method *jewel 1.0* and similar joint estimation methods deteriorates with the increase in the power decay of the degree distribution (i.e., for graphs that contain a few nodes with a significant number of connections and most of the nodes with limited connections). The problem was already noticeable in our simulations for *power* = 1.5. To face this problem, in this work, we introduced weights *W*(*k*) *ij* into the model formulation of Equation (2), whose idea is to adjust the penalty on edge (*i*, *j*) according to the degree of its incident nodes *i* and *j*. When *i* and/or *j* have a high degree, the penalty on the groups involving {*i*, *j*} is decreased to favor

the edge (*i*, *j*) to emerge, whereas when *i* and *j* have a low degree, the penalty is increased to favor the removal of the edge (*i*, *j*).

**Figure 4.** ROC curves for *jewel 2.0* method applied to *K* = 3 datasets of the size *p* = 500, *nk* = 100 with various differences between their underlying graphs (denoted by different colors). Parameter *power* = 1, all weights are set *<sup>W</sup>*(*k*) *ij* = 1. **Left panel**: performance for *m* = 1, i.e., 499 edges. **Right panel**: performance for *m* = 2, i.e., 997 edges.

We first describe an ideal procedure to demonstrate the advantages of our weighted approach by using oracle weights obtained from the true degrees of the nodes, which, of course, are not available in real data applications.

Suppose an oracle provides a vector of true degrees for each class *k* = 1 ... *K* and denote it *<sup>d</sup>*(*k*) <sup>∈</sup> <sup>N</sup>*p*, with N <sup>=</sup> {0, 1, 2, 3 ... }. We then define the oracle weights matrix **<sup>W</sup>**(*k*) for the class *k* = 1... *K* by the following formulae:

$$\mathbf{W}^{(k)} = \mathbf{W}^{(k)} / \max(\mathbf{W}^{(k)}), \quad \text{with} \quad W\_{ij}^{(k)} = \frac{1}{\sqrt{d\_i^{(k)} \cdot d\_j^{(k)}}}. \tag{4}$$

Note that the re-scaling by max(**W**(*k*)) assures that weights are in the interval (0, 1] and are not dependent on the number of nodes *p*. Figure 5 (upper black lines) demonstrates that with the oracle choice of weights *jewel 2.0* provides excellent performance. In particular, the oracle weights provide a great improvement compared to the no-weight approach (i.e., **W**(*k*) = **1**, lower black lines in Figure 5).

According to the results of the oracle estimator, we might suggest the use of a two-step procedure where one first applies *jewel 2.0* with **W**(*k*) = **1** and then estimates the degree vectors *d*ˆ(*k*), *k* = 1 ... *K*, and reapplies *jewel 2.0* using *d*ˆ(*k*) as an oracle. However, such a procedure becomes time consuming (since it requires running *jewel 2.0* twice) and does not show the same excellent performance as the oracle estimator.

In the spirit of proposing a practical procedure, we noticed that to retain good performance, it is not necessary to know precisely the weights *W*(*k*) *ij* as given in Equation (4), but it is sufficient to distinguish potential hubs from other nodes. Furthermore, in many applications, it is reasonable to assume that we have prior information on whether or not specific nodes are hubs (but it is less feasible to have prior information on the node degrees). For example, in protein–protein interaction networks, the key players are known from the literature, so we can assume that these nodes are hubs without knowing the exact degree. To evaluate whether a simple procedure is effective, we performed the following simulation:


We have evaluated this procedure for choosing weights by setting *cut*(*k*) to select 1, 3, 5% of the highest degree vertices as "hubs" for each *k*. Results are reported in Figure 5 with different colors, as well as the results obtained using **W**(*k*) = **1** and the oracle weights.

**Figure 5.** ROC curves for *jewel 2.0* method applied to *K* = 3 datasets of the size *p* = 500, *nk* = 100. Parameter *power* = 1.5. The upper black lines demonstrate the excellent performance obtained with the oracle weights compared to the lower black lines that correspond to **W**(*k*) = **1** (i.e., the standard unweighted approach). Performance with the simple procedure that assigns weights as hub/non-hub using prior information is given in different colors depending on the percentages of nodes declared hubs. **Left panel**: performance for *m* = 1, i.e., 499 edges. **Right panel**: performance for *m* = 2, i.e., 997 edges.

Figure 5 shows the performance for the power of preferential attachment *power* = 1.5. We can see that although the simple procedure is unable to reach the performance of the oracle weights, it outperforms the standard unweighted approach (i.e., **W**(*k*) = **1**). Moreover, results are quite robust to the percentage of nodes identified as hubs. Results are also robust to the specific value assigned to the hubs (data not shown). Indeed, we applied the same simple procedure assigning 3 or 5 to the hubs instead of 10 and obtained a similar conclusion. Therefore, although an optimal choice of the weights is still needed, the weighted approach proposed in Equation (2) can face the problem of graphs with hubs representing the Achilles' heel of most of the available methods.

#### 3.2.3. Stability Selection Reduces the Number of False Positives

In this section, we compare the results obtained by running *jewel 2.0* on the same dataset without the stability procedure and with the stability procedure described in Section 2.4 where we fixed *#subsets* = 25, 50, 100 and *chosen fraction* = 0.6, 0.7, 0.8. The results are shown in Figure 6 with different colors and line types. In particular, in the left panel of Figure 6, we show the average (over *K* = 3) order of the estimated graphs *G*ˆ(*k*) (i.e., log2 of the average number of estimated edges) as a measure of graph's "sparseness"; in the middle panel of Figure 6, we show the average precision to measure the proportion of true positive edges, while in the right panel of Figure 6, we show the average F1-score, defined as *F*1 = 2*TP*/(2*TP* + *FP* + *FN*), to measure the overall accuracy. As we can observe from the inspection of Figure 6, the "sparseness" of the estimated graphs decreases with the stability procedure, as expected. More importantly, the stability selection procedure reduces the number of false positives at a much higher rate than the loss of true positives. Overall, it increases the precision (middle panel), maintaining a good accuracy (right panel). We also note that the gain in performance is obtained even with *#subsets* = 25, meaning that stability selection can be used even when a high number of re-sampled sets is not computationally feasible.

**Figure 6.** Performance of the *jewel 2.0* method applied to *K* = 3 datasets of the size *p* = 500, *nk* = 100, *<sup>m</sup>* = 1, *power* = 1, all weights are set *<sup>W</sup>*(*k*) *ij* = 1. The solid black line denotes performance without stability procedure. Colour denotes a varying threshold of % of subsets in which edge needs to be present to be chosen. Line-type denotes a varying number of subsets. **Left panel**: *log*<sup>2</sup> of the order of the graphs. **Middle panel**: precision. **Right panel**: F1-score.

To highlight the advantage of having more precise estimates with the same accuracy, i.e., estimated graphs with greater "sparseness", in Figure 7, we compared results with and without stability selection for the same graph. It is evident that the stability selection estimates sparser graphs than those obtained without stability selection. Comparison is provided for *#subsets* = 25 and *chosen fraction* = 0.8. Figure 7 is a part of a more extensive comparison on a finer grid of *λ*1. When considering the entire study, we also observed that stability makes the final graph estimates less influenced by the specific choice of *λ*1. To conclude this subsection, we note that although the idea of the stability selection procedure is not novel, to our knowledge, *jewel 2.0* is the first joint estimation method to adopt it in the context of multiple graphs.

**Figure 7.** Estimated graph *G*ˆ(1) of one of the independent runs. Rows demonstrate results with *jewel 2.0* and results with an additional stability selection procedure (#*subsets* = 25, choose edges present in 80% of results), columns correspond to different values of regularization parameter *λ*1.

#### *3.3. Comparison of Jewel 2.0 with Other Joint Estimation Methods*

In this subsection, we compare the performance of *jewel 2.0* with other methods for joint estimation: joint graphical lasso (JGL) with group penalty [4], and the proposal of Guo et al. [5]. JGL requires two tuning parameters *λ*<sup>1</sup> and *λ*2, where the first controls for differences between the graphs and the second for graph sparsity. The authors suggested the use of the relation *λ*<sup>1</sup> = (1 − *ω*2)/( <sup>√</sup>2*ω*2) · *<sup>λ</sup>*2. We set parameter *<sup>ω</sup>*<sup>2</sup> <sup>=</sup> 0.7 since it provides better performances in the original paper. Hence, here, we vary only their parameter *λ*2.

We also compared *jewel 2.0* with another joint estimation method, *simone* [13] (which, as discussed in [1], is the most similar to *jewel 2.0*), but we did not include the results in the full comparisons for the following technical reasons. When we provided the range of *λ* parameters from 1 to 0.01 to the simone function, the estimation produced "out of convergence" results for *λ* ≈ 0.15 or lower (depending on the dataset realization). Using the default function parameters, the simone function went out of convergence even faster. This behavior did not allow us to produce a meaningful ROC curve for [13]. We should note that our simulation settings are different from the ones considered in [13], where the number of variables *p* was lower, and the authors explored only the *n* > *p* setting.

We considered four different *m* − *power* scenarios, with parameter *m* = 1, 2 (resulting in 499 edges with 0.4% sparsity and 997 edges with 0.8% sparsity, respectively, the same settings as we had in all previous experiments) and parameter *power* = 1, 1.5. In the case of *power* = 1.5, we also added *jewel 2.0* with a priori weights to the comparison to demonstrate its advantage. In that case, we estimated the weights using the simple approach where we chose 1% of vertices as hubs—even though the performance with this setting was slightly worse than with 3% or 5%, we deemed this one more realistic. Figure 8 illustrates the results of the comparison study.

As we can see in Figure 8, *jewel 2.0* and JGL demonstrate similar performance in the case of *power* = 1, both superior to the performance of the proposal of Guo et al. As already observed, in the case *power* = 1.5, the performance of all three methods drops drastically; however, once we use *jewel 2.0* with a simple choice of weights, we see a significant gain in performance. Note that neither *JGL* nor Guo et al. include methods for dealing with hubs. Therefore, while *jewel 2.0* can adapt its penalization to allow hubs to emerge, the other methods remain democratic in their penalization and do not perform well when graphs have hubs.

**Figure 8.** ROC curves corresponding to different joint estimation methods are denoted by different colors. Each panel demonstrates performance in different *m* − *power* settings [5].

#### **4. Results on Real Data**

The estimation of multiple GGMs could be useful in various contexts. For instance, GGMs allow connections to be found between stocks' prices (considering different markets as *K* classes) or social relations (considering different social media as *K* classes), and so on. In the context of estimating gene networks from gene expression, we can consider gene expression data obtained from different equipment or laboratories (as we did, for example, for *jewel 1.0* [1]), or we can consider different sub-types of disease. Identifying the gene regulatory networks and differences associated with a particular sub-disease can help to tailor treatments. Since this paper discusses the advantages of our algorithm from a methodological point of view, in the following, we provide only a small example of an application to real data without investigating the biological implications of our findings.

We apply *jewel 2.0* to the gene expression datasets of the patients with breast cancer, which is the most common type of cancer among women. We consider four molecular breast cancer sub-types: luminal A, luminal B, basal-like, and HER2-enriched. Each sub-type can have specific regulatory mechanisms. However, most of the regulatory mechanisms are shared across all sub-types.

We used the gene expression microarray data available in The Cancer Genome Atlas and described in [14]. Gene expressions were measured using Agilent G450 microarrays. The dataset contains 522 primary tumor samples among which there are *n*<sup>1</sup> = 231 samples of luminal A cancers, *n*<sup>2</sup> = 127 of luminal B cancers, *n*<sup>3</sup> = 98 of basal-like cancers, and *n*<sup>4</sup> = 58 of HER2-enriched cancers. These *K* = 4 sub-types were identified in the original paper of [14]. We used the data with *K* = 4 datasets to reveal sub-type-specific gene regulatory networks.

For the sake of simplicity, we limited our attention to the genes belonging to the following 10 pathways from the Kyoto Encyclopedia of Genes and Genomes database [15]: breast cancer pathway (hsa05224), estrogen signaling pathway (hsa04915), p53 signaling pathway (hsa04115), PI3K-Akt signaling pathway (hsa04151), GnRH signaling pathway (hsa04912), PPAR signaling pathway (hsa03320), Wnt signaling pathway (hsa04310), NFkappa B signaling pathway (hsa04064), notch signaling pathway (hsa04330), and hedgehog signaling pathway (hsa04340). According to the literature, these pathways are associated with breast cancer. These pathways involve 945 genes; out of them, 901 were measured in our datasets (with *p* = 900 also annotated in the STRING database [16], see below). Therefore, we applied *jewel 2.0* (with weights for hub detection and with stability selection) to the *K* = 4 submatrices with this subset of genes.

We retrieved prior weights information by building a broad network using the STRING database [16] with *p* = 900 genes. STRING is a database of known and predicted protein– protein interactions that can be physical and functional and derived from lab experiments, known co-expression, and genomic context predictions and knowledge in the text mining of the databases. We limited the query to connections from "experiments" and "databases" as active interaction sources and set the minimum required interaction score to the highest value of 0.9. As a result, the STRING network had 775 out of 900 vertices connected to any other node (i.e., 125 nodes were isolated) and 7514 edges.

From the STRING network, we chose the 27 vertices (3% of *p* = 900) with the highest true degrees as "hubs". Therefore, in *jewel 2.0*, we set their degree to 10 while the degree of all the other vertices was set to 1. We then computed edges' weights with the Equation (4) formula. We chose regularization parameter empirically as *λ*<sup>1</sup> = 0.35. We then ran *jewel 2.0* with this parameter, estimated weights, and performed a stability selection procedure with 25 subsets, choosing edges that appeared in at least 80% of the results. The resulting networks are presented in Figure 9.

The resulting networks *G*ˆ(*k*), *k* = 1 ... 4, had 3336, 3297, 3318, and 3313 edges, respectively. There were 3171 edges in their intersection (gray edges in Figure 9, i.e., about 4–6% of edges in each graph were class-specific). For each estimated network, we measured the number of edges in common with the STRING database network and observed 418, 417, 408, and 412 edges in common, respectively. The *p*-values of the hypergeometric test

to assess the significance of the edge overlaps were strictly less than 10−<sup>10</sup> for all four networks. Therefore, the estimated graphs showed a significant overlap with the STRING networks compared to random graphs of the same size. This result encourages the quality of the connection, although we also note that connections in the STRING database are not tissue or disease-specific and are not necessarily of the same nature as the ones estimated with *jewel 2.0* (conditional dependence). Thus, we did not regard a considerable overlap as the main quality criteria.

We observe that for each of the *K* estimated graphs, the nodes with the highest degree include the 27 vertices we initially provided as potential hubs. Some of them are key genes in breast cancer, such as LCK, which was shown to influence cell motility in breast cancer, or TRAF6, which promotes tumorigenesis.

Finally, to investigate the impact of the choice of the regularization parameter, we repeated the analysis with a different value (e.g., *λ*<sup>1</sup> = 0.3), reporting similar results in terms of estimated networks.

**Figure 9.** Networks estimated with *jewel 2.0* with *λ*<sup>1</sup> = 0.35, with weights derived from the prior information (27 (3% of *p* = 900) vertices with the highest degree in the STRING networks are identified as "hubs" with degree 10, all other vertices have degree 1) and with stability procedure (*#subsets* = 25, *chosen fraction* = 0.8). Edges common to all *K* = 4 graphs are in grey, and class-specific edges are colored. Note that vertices that are isolated in the union of these graphs are not plotted for simplicity.

#### **5. Discussion**

The proposed method represents a significant advancement of our original method *jewel*, presented in [1]. *jewel 2.0* extends the range of applications by relaxing the assumptions and also provides several other improvements. Namely, by reformulating the minimization problem as in Equation (2), i) *jewel 2.0* allows us to jointly estimate both common and class-specific parts of GGMs from multiple datasets, and ii) it allows the user to specify weights to capture the graph topologies that present hubs better. Moreover, with *jewel 2.0*, we introduced a stability selection procedure to the multiple joint graphical model context. This procedure extends the idea of [3]. It reduces the number of false

positives, providing sparser and more explicative graphs, and reduces the influence of the specific choice of the regularization parameter on the estimated graphs. Finally, we demonstrated the performance of *jewel 2.0* in simulation and with real data applications.

Although we showed that *jewel 2.0* performs well, we believe there is still room for further improvement. Therefore, in the following, we emphasize our study's limitations and the directions of future work. Furthermore, we stress that limits are not specific to our approach but represent common challenges to all methods available in this context.

The *jewel 2.0* minimization problem requires two regularization parameters: *λ*<sup>1</sup> and *λ*2. The choice of the regularization parameters is known to impact the estimator's performance. In general, under high-dimensional settings, the problem of finding suitable estimates of the regularization parameters is still open, with limited theoretical results available for guaranteeing the recovery of the true structure given the chosen parameter. In this paper, we chose to reduce the complexity of the space by setting *λ*<sup>2</sup> = *φλ*1, fixing *φ* to a default value (selected to produce good performance under different settings). The value of *λ*<sup>1</sup> can be either estimated from the data (e.g., using BIC or similar data-driven criteria) or chosen by the user in a plausible range of values. In simulations, we investigated the performance of *jewel 2.0* when using the ROC curve over a wide range of *λ*1. In the real data example, we chose *λ*<sup>1</sup> empirically and then used the stability selection to reduce the impact of the choice on the final estimate. However, a data-driven estimation of *λ*<sup>1</sup> or both *φ* and *λ*<sup>1</sup> would better adapt to the specific data structure. The work of [17] provides an interesting suggestion in this direction.

Another direction of future work consists of improving the stability selection procedure. Inspired by [3], we were the first to introduce a stability selection approach to reduce false positives in the context of multiple graphical model estimation. However, in this context, one can consider further adapting the idea of [3] for setting the bound on the number of false positives or extending the idea of [10]. Both methods used the graphical lasso on a single dataset. Hence, their adaptation to regression-based approaches and multiple datasets might not be straightforward. In general, the study of convergence properties of the resulting estimator would be advisable.

With the weighted approach, we have demonstrated the great improvement that *jewel 2.0* can bring when dealing with graphs containing hubs. In several applications, such as gene networks, knowledge available from the literature might give an indication of which genes can act as hubs. The incorporation of this information in the weights improves the estimate. However, the choice of optimal weights remains an open challenge to be addressed in future works.

Finally, the computational cost of GGMs methods on big data remains challenging. Both *jewel 1.0* and *jewel 2.0* perform well for a number of variables *p* of the order of several hundreds. Both become unfeasible when applied to tens or hundreds of thousands of variables that might arise from big data analysis. In [1], we showed that other similar methods suffer from the same problem (some more than others), with our approach being the fastest. Although we have introduced some speed-ups (such as the active shooting approach, for example), we believe there is still much room for improvement.

**Author Contributions:** Conceptualization, C.A., D.D.C. and A.P.; methodology, C.A., D.D.C. and A.P.; software, A.P.; formal analysis, A.P.; investigation, C.A., D.D.C. and A.P.; resources, C.A.; data curation, A.P.; writing—original draft preparation, C.A., D.D.C. and A.P.; writing and editing, C.A., D.D.C. and A.P.; supervision, C.A. and D.D.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the "Antitumor Drugs and Vaccines from the Sea (ADViSE)" project (CUP B43D18000240007–SURF 17061BP000000011) funded by POR Campania FESR 2014-2020 "Technology Platform for Therapeutic Strategies against Cancer"—Action 1.2.1 and 1.2.2.

**Data Availability Statement:** TCGA breast cancer gene expression dataset can be downloaded from https://gdc.cancer.gov/about-data/publications/brca\_2012, (accessed on 8 June 2022) as *BRCA.exp.547.med.txt* file with sub-types information in *BRCA.547.PAM50.SigClust.Subtypes.txt* file.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

