**Consistent Estimation of Generalized Linear Models with High Dimensional Predictors via Stepwise Regression**

#### **Alex Pijyan <sup>1</sup> , Qi Zheng <sup>2</sup> , Hyokyoung G. Hong 1,\* and Yi Li <sup>3</sup>**


Received: 1 August 2020; Accepted: 28 August 2020; Published: 31 August 2020

**Abstract:** Predictive models play a central role in decision making. Penalized regression approaches, such as least absolute shrinkage and selection operator (LASSO), have been widely used to construct predictive models and explain the impacts of the selected predictors, but the estimates are typically biased. Moreover, when data are ultrahigh-dimensional, penalized regression is usable only after applying variable screening methods to downsize variables. We propose a stepwise procedure for fitting generalized linear models with ultrahigh dimensional predictors. Our procedure can provide a final model; control both false negatives and false positives; and yield consistent estimates, which are useful to gauge the actual effect size of risk factors. Simulations and applications to two clinical studies verify the utility of the method.

**Keywords:** estimation consistency; generalized linear models; high dimensional predictors; model selection; stepwise regression

#### **1. Introduction**

In the era of precision medicine, constructing interpretable and accurate predictive models, based on patients' demographic characteristics, clinical conditions and molecular biomarkers, has been crucial for disease prevention, early diagnosis and targeted therapy [1]. When the number of predictors is moderate, penalized regression approaches such as least absolute shrinkage and selection operator (LASSO) by [2] have been used to construct predictive models and explain the impacts of the selected predictors. However, in ultrahigh dimensional settings where *p* is in the exponential order of *n*, penalized methods may incur computational challenges [3], may not reach globally optimal solutions and often generate biased estimates [4]. Sure independence screening (SIS) proposed by [5] has emerged as a powerful tool for modeling ultrahigh dimensional data. However, the method relies on a partial faithfulness assumption, which stipulates that jointly important variables must be marginally important, an assumption that may not be always realistic. To relieve this condition, some iterative procedures, such as ISIS [5], have been adopted to repeatedly screen variables based on the residuals from the previous iterations, but with heavy computation and unclear theoretical properties. Conditional screening approaches [see, e.g., [6]] have, to some extent, addressed the challenge. However, screening methods do not directly generate a final model, and post-screening regularization methods, such as LASSO, are recommended by [5] to produce a final model.

For generating a final predictive model in ultrahigh dimensional settings, recent years have seen a surging interest of performing forward regression, an old technique for model selection; see [7–9], among many others. Under some regularity conditions and with some proper stopping criteria, forward regression can achieve screening consistency and sequentially select variables according to metrics such as AIC, BIC or *R* 2 . Closely related to forward selection also, is least angle regression (LARS) [10], a widely used model selection algorithm for high-dimensional models. In the generalized linear model setting [11,12], proposed differential geometrical LARS (dgLARS) based on a differential geometrical extension of LARS.

However, these methods have drawbacks. First, once a variable is identified by the forward selection, it is not removable from the list of selected variables. Hence, false positives are unavoidable without a systematic elimination procedure. Second, most of the existing works focus on variable selection and are silent with respect to estimation accuracy.

To address the first issue, some works have been proposed to add backward elimination steps once forward selection is accomplished, as backward elimination may further eliminate false positives from the variables selected by forward selection. For example, ref. [13,14] proposed a stepwise selection for linear regression models in high-dimensional settings and proved model selection consistency. However, it is unclear whether the results hold for high-dimensional generalized linear models (GLMs); Ref. [15] proposed a similar stepwise algorithm in high-dimensional GLM settings, but with no theoretical properties on model selection. Moreover, none of the relevant works have touched upon the accuracy of estimation.

We extend a stepwise regression method to accommodate GLMs with high-dimensional predictors. Our method embraces both model selection and estimation. It starts with an empty model or pre-specified predictors, scans all features and sequentially selects features, and conducts backward elimination once forward selection is completed. Our proposal controls both false negatives and false positives in high dimensional settings: the forward selection steps recruit variables in an inclusive way by allowing some false positives for the sake of avoiding false negatives, while the backward selection steps eliminate the potential false positives from the recruited variables. We use different stopping criteria in the forward and backward selection steps, to control the numbers of false positives and false negatives. Moreover, we prove that, under a sparsity assumption of the true model, the proposed approach can discover all of the relevant predictors within a finite number of steps, and the estimated coefficients are consistent, a property still unknown to the literature. Finally, our GLM framework enables our work to accommodate a wide range of data types, such as binary, categorical and count data.

To recap, our proposed method distinguishes from the existing stepwise approaches in high dimensional settings. For example, it improves [13,14] by extending the work to a more broad GLM setting and [15] by establishing the theoretical properties.

Compared with the other variable selection and screening works, our method produces a final model in ultrahigh dimensional settings, without applying a pre-screening step which may produce unintended false negatives. Under some regularity conditions, the method identifies or includes the true model with probability going to 1. Moreover, unlike the penalized approaches such as LASSO, the coefficients estimated by our stepwise selection procedure in the final model will be consistent, which are useful for gauging the real effect sizes of risk factors.

#### **2. Method**

Let (**X***<sup>i</sup>* ,*Yi*), *i* = 1, . . . , *n*, denote *n* independently and identically distributed (i.i.d.) copies of (**X**,*Y*). Here, **X** = (1, *X*1, . . . , *Xp*) T is a (*p* + 1)-dimensional predictor vector with *X*<sup>0</sup> = 1 corresponding to the intercept term, and *Y* is an outcome. Suppose that the conditional density of *Y*, given **X**, belongs to a linear exponential family:

$$\pi(\boldsymbol{Y} \mid \mathbf{X}) = \exp\{\mathbf{Y}\mathbf{X}^{\mathsf{T}}\boldsymbol{\mathfrak{P}} - b(\mathbf{X}^{\mathsf{T}}\boldsymbol{\mathfrak{P}}) + \mathcal{A}(\boldsymbol{Y})\},\tag{1}$$

where *β* = (*β*0, *β*1, . . . , *βp*) T is the vector of coefficients; *β*<sup>0</sup> is the intercept; and A(·) and *b*(·) are known functions. Model (1), with a canonical link function and a unit dispersion parameter, belongs to a larger exponential family [16]. Further, *b*(·) is assumed twice continuously differentiable with a non-negative second derivative *b* ′′(·). We use *µ*(·) and *σ*(·) to denote *b* ′ (·) and *b* ′′(·), i.e., the mean and variance functions, respectively. For example, *b*(*θ*) = log(1 + exp(*θ*)) in a logistic distribution and *b*(*θ*) = exp(*θ*) in a Poisson distribution.

Let *L*(*u*, *v*) = *uv* − *b*(*u*) and E*n*{ *f*(*ξ*)} = *n* <sup>−</sup><sup>1</sup> ∑ *n i*=1 *f*(*ξi*) denote the mean of { *f*(*ξi*)} *n i*=1 for a sequence of i.i.d. random variables *ξ<sup>i</sup>* (*i* = 1, . . . , *n*) and a non-random function *f*(·). Based on the i.i.d. observations, the log-likelihood function is

$$\ell(\mathfrak{f}) = n^{-1} \sum\_{i=1}^{n} L(\mathbf{X}\_i^{\mathrm{T}} \mathfrak{f}, \mathbf{Y}\_i) = \mathbb{E}\_n \{ L(\mathbf{X}^{\mathrm{T}} \mathfrak{f}, \mathbf{Y}) \}. \tag{2}$$

We use *β*<sup>∗</sup> = (*β*∗0, *β*∗1, . . . , *β*∗*p*) T to denote the true values of *β*. Then the true model is M = {*j* : *β*∗*<sup>j</sup>* 6= 0, *j* ≥ 1} ∪ {0}, which consists of the intercept and all variables with nonzero effects. Overarching goals of ultra-high dimensional data analysis are to identify M and estimate *β*∗*<sup>j</sup>* for *j* ∈ M. While most of the relevant literature [8,9] is on estimating M, this work is to accomplish both identification of M and estimation of *β*∗*<sup>j</sup>* .

When *p* is in the exponential order of *n*, we aim to generate a predictive model that contains the true model with high probability, and provide consistent estimates of regression coefficients. We further introduce the following notation. For a generic index set *S* ⊂ {0, 1, . . . , *p*} and a (*p* + 1)-dimensional vector **A**, we use *S c* to denote the complement of a set *S* and **A***<sup>S</sup>* = {*A<sup>j</sup>* : *j* ∈ *S*} to denote the subvector of **A** corresponding to *S*. For instance, if *S* = {2, 3, 4}, then **X***iS* = (*Xi*<sup>2</sup> , *Xi*<sup>3</sup> , *Xi*4) T . Moreover, denote by <sup>ℓ</sup>*S*(*βS*) = <sup>E</sup>*n*{*L*(**<sup>X</sup>** T *S βS*,*Y*)} the log-likelihood of the regression model of *Y* on **X***<sup>S</sup>* and denote by *β*ˆ *<sup>S</sup>* the maximizer of ℓ*S*(*βS*). Under model (1), we elaborate on the idea of stepwise (details in the supplementary materials) selection, consisting of the forward and backward stages.

**Forward stage:** We start with *F*0, a set of variables that need to be included according to some *a priori* knowledge, such as clinically important factors and conditions. If no such information is available, *F*<sup>0</sup> is set to be {0}, corresponding to a null model. We sequentially add covariates as follows:

$$F\_0 \subset F\_1 \subset F\_2 \subset \dots \subset F\_{k\nu}$$

where *F<sup>k</sup>* ⊂ {0, 1, . . . , *p*} is the index set of the selected covariates upon completion of the *k*th step, with *k* ≥ 0. At the (*k* + 1)th step, we append new variables to *F<sup>k</sup>* one at a time and refit GLMs: for every *j* ∈ *F c k* , we let *<sup>F</sup>k*,*<sup>j</sup>* <sup>=</sup> *<sup>F</sup><sup>k</sup>* ∪ {*j*}, obtain *<sup>β</sup>*<sup>ˆ</sup> *Fk*,*<sup>j</sup>* by maximizing <sup>ℓ</sup>*Fk*,*<sup>j</sup>* (*βFk*,*<sup>j</sup>* ), and compute the increment of log-likelihood,

$$\ell\_{F\_{k,j}}(\mathfrak{f}\_{F\_{k,j}}) - \ell\_{F\_k}(\mathfrak{f}\_{F\_k}) \dots$$

Then the index of a new candidate variable is determined to be

$$j\_{k+1} = \underset{j \in F\_k^c}{\text{arg}\, \mathbf{max}} \, \ell\_{F\_{k,j}}(\mathfrak{f}\_{F\_{k,j}}) - \ell\_{F\_k}(\mathfrak{f}\_{F\_k}) \,.$$

Additionally, we update *Fk*+<sup>1</sup> = *F<sup>k</sup>* ∪ {*jk*+1}. We then need to decide whether to stop at the *k*th step or move on to the (*k* + 1)th step with *Fk*+<sup>1</sup> . To do so, we use the following EBIC criterion:

$$\text{EBIT}(F\_{k+1}) = -\mathcal{2}\ell\_{F\_{k+1}}(\mathfrak{f}\_{F\_{k+1}}) + |F\_{k+1}|n^{-1}(\log n + \mathcal{2}\eta\_1 \log p),\tag{3}$$

where the second term is motivated by [17] and |*F*| denotes the cardinality of a set *F*.

The forward selection stops if EBIC(*Fk*+<sup>1</sup> ) > EBIC(*F<sup>k</sup>* ). We denote the stopping step by *k* ∗ and the set of variables selected so far by *F<sup>k</sup>* ∗ .

**Backward stage:** Upon the completion of forward stage, backward elimination, starting with *B*<sup>0</sup> = *F<sup>k</sup>* ∗ , sequentially drops covariates as follows:

$$B\_0 \supset B\_1 \supset B\_2 \supset \cdots \supset B\_{k\_1}$$

where *B<sup>k</sup>* is the index set of the remaining covariates upon the completion of the *k*th step of the backward stage, with *k* ≥ 0. At the (*k* + 1)th backward step and for every *j* ∈ *B<sup>k</sup>* , we let *Bk*/*<sup>j</sup>* = *B<sup>k</sup>* \ {*j*}, obtain *β*ˆ *Bk*/*<sup>j</sup>* by maximizing <sup>ℓ</sup>(*βBk*/*<sup>j</sup>* ), and calculating the difference of the log-likelihoods between these two nested models:

$$
\ell\_{B\_k}(\hat{\mathcal{B}}\_{B\_k}) - \ell\_{B\_{k/j}}(\hat{\mathcal{B}}\_{B\_{k/j}}) \dots
$$

The variable that can be removed from the current set of variables is indexed by

$$j\_{k+1} = \underset{j \in B\_k}{\text{arg min}} \,\ell\_{B\_k}(\hat{\boldsymbol{\beta}}\_{B\_k}) - \ell\_{B\_{k/j}}(\hat{\boldsymbol{\beta}}\_{B\_{k/j}}) .$$

Let *Bk*+<sup>1</sup> = *B<sup>k</sup>* \ {*jk*+1}. We determine whether to stop at the *k*th step or move on to the (*k* + 1)th step of the backward stage according to the following BIC criterion:

$$\text{BIC}(B\_{k+1}) = -\mathcal{D}\ell\_{B\_{k+1}}(\hat{\mathcal{B}}\_{B\_{k+1}}) + \eta\_2 n^{-1} |B\_{k+1}| \log n. \tag{4}$$

If BIC(*Bk*+<sup>1</sup> ) > BIC(*B<sup>k</sup>* ), we end the backward stage at the *k*th step. Let *k* ∗∗ denote the stopping step and we declare the selected model *B<sup>k</sup>* ∗∗ to be the final model. Thus, <sup>M</sup><sup>ˆ</sup> <sup>=</sup> *<sup>B</sup><sup>k</sup>* ∗∗ is the estimate of M. As the backward stage starts with the *k* ∗ variables selected by forward selection, *k* ∗∗ cannot exceed *k* ∗ .

A strength of our algorithm, termed STEPWISE hereafter, is the added flexibility with *η*<sup>1</sup> and *η*<sup>2</sup> in the stopping criteria for controlling the false negatives and positives. For example, a smaller value of *η*<sup>1</sup> close to zero in the forward selection step will likely include more variables, and thus incur more false positives and less false negatives, whereas a larger value of *η*<sup>1</sup> will recruit too few variables and cause too many false negatives. Similarly, in the backward selection step, a large *η*<sup>2</sup> would eliminate more variables and therefore further reduce more false positives, and vice versa for a small *η*2. While finding optimal *η*<sup>1</sup> and *η*<sup>2</sup> is not trivial, our numerical experiences suggest a small *η*<sup>1</sup> and a large *η*<sup>2</sup> may well balance the false negatives and positives. When *η*<sup>2</sup> = 0, no variables can be dropped after forward selection; hence, our proposal includes forward selection as a special case.

Moreover, [8] proposed a sequentially conditioning approach based on offset terms that absorb the prior information. However, our numerical experiments indicate that the offset approach may be suboptimal compared to our full stepwise optimization approach, which will be demonstrated in the simulation studies.

#### **3. Theoretical Properties**

With a column vector **v**, let k**v**k*<sup>q</sup>* denote the *Lq*-norm for any *q* ≥ 1. For simplicity, we denote the *<sup>L</sup>*2-norm of **<sup>v</sup>** by <sup>k</sup>**v**k, and denote **vv**<sup>T</sup> by **<sup>v</sup>** ⊗2 . We use *C*1, *C*2, . . . , to denote some generic constants that do not depend on *n* and may change from line to line. The following regularity conditions are set.


Condition (1), as assumed in [8,18], is an alternative to the Lipschitz assumption [5,19]. The bound of the model size allowed in the selection procedure or *q* is often required in model-based screening methods see, e.g., [8,20–22]. The bound should be large enough so that the correct model can be included, but not too large; otherwise, excessive noise variables would be included, leading to unstable and inconsistent estimates. Indeed, Conditions (1) and (6) reveal that the range of *q* depends on the true model size |M|, the minimum signal strength, *n* <sup>−</sup>*<sup>α</sup>* and the total number of covariates, *p*. The upper bound of *q* is *o*((*n* <sup>1</sup>−4*α*/ log *<sup>p</sup>*) <sup>∧</sup> (*<sup>n</sup>* 1/3/ log *p*)), ensuring the consistency of EBIC [17]. Condition (1) also implies that the parameter space under consideration can be restricted to <sup>B</sup> :<sup>=</sup> {*<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*p*+<sup>1</sup> : k*β*k<sup>1</sup> ≤ *K* <sup>2</sup>}, for any model *<sup>S</sup>* with <sup>|</sup>*S*| ≤ *<sup>q</sup>*. Condition (2), as assumed in [23,24], reflects that data are often standardized at the pre-processing stage. Condition (3) ensures that *Y* has a light tail, and is satisfied by Gaussian and discrete data, such as binary and count data [25]. Condition (4) is satisfied by common GLM models, such as Gaussian, binomial, Poisson and gamma distributions. Condition (5) represents the sparse Riesz condition [26] and Condition (6) is a strong "irrepresentable" condition, suggesting that M cannot be represented by a set of variables that does not include the true model. It further implies that adding a signal variable to a mis-specified model will increase the log-likelihood by a certain lower bound [8]. The signal rate is comparable to the conditions required by the other sequential methods, see, e.g., [7,22].

Theorem 1 develops a lower bound of the increment of the log-likelihood if the true model M is not yet included in a selected model *S*.

**Theorem 1.** *Suppose Conditions (1)–(6) hold. There exists some constant C*<sup>1</sup> *such that with probability at least 1–6*exp(−6*q* log *p*)*,*

$$\min\_{S:\mathcal{M}\underline{\mathcal{G}}\mathcal{S}, |S| < q} \left\{ \max\_{j \in S^c} \ell\_{S \cup \{j\}}(\mathfrak{f}\_{S \cup \{j\}}) - \ell\_S(\mathfrak{f}\_S) \right\} \ge \mathcal{C}\_1 n^{-2\kappa}.$$

Theorem 1 shows that, before the true model is included in the selected model, we can append a variable which will increase the log-likelihood by at least *C*1*n* <sup>−</sup>2*<sup>α</sup>* with probability tending to 1. This ensures that in the forward stage, our proposed STEPWISE approach will keep searching for signal variables until the true model is contained. To see this, suppose at the *k*th step of the forward stage that *F<sup>k</sup>* satisfies *M* 6⊆ *F<sup>k</sup>* and |*F<sup>k</sup>* <sup>|</sup> <sup>&</sup>lt; *<sup>q</sup>*, and let *<sup>r</sup>* be the index selected by STEPWISE. By Theorem 1, we obtain that, for any *η*<sup>1</sup> > 0, when *n* is sufficiently large,

$$\begin{split} \mathsf{EBUC}(\mathsf{F}\_{\mathsf{k}\mathcal{F}}) - \mathsf{EBUC}(\mathsf{F}\_{\mathsf{k}}) &= -2\ell\_{\mathsf{F}\_{\mathsf{k}\mathcal{F}}}(\mathring{\mathsf{\mathcal{B}}}\_{\mathsf{F}\_{\mathsf{k}\mathcal{F}}}) + (|\mathsf{F}\_{\mathsf{k}}| + 1)n^{-1}(\log n + 2\eta\_{1}\log p) \\ &- \left[ -2\ell\_{\mathsf{F}\_{\mathsf{k}}}(\mathring{\mathsf{\mathcal{B}}}\_{\mathsf{F}\_{\mathsf{k}}}) + |\mathsf{F}\_{\mathsf{k}}|n^{-1}(\log n + 2\eta\_{1}\log p) \right] \\ &\leq -2\mathsf{C}\_{1}n^{-2\alpha} + n^{-1}(\log n + 2\eta\_{1}\log p) < 0, \end{split}$$

with probability at least 1 − 6 exp(−6*q* log *p*), where the last inequality is due to Condition (6). Therefore, with high probability the forward stage of STEPWISE continues as long as M 6⊆ *F<sup>k</sup>* and |*F<sup>k</sup>* <sup>|</sup> <sup>&</sup>lt; *<sup>q</sup>*. We next establish an upper bound of the number of steps in the forward stage needed to include the true model.

**Theorem 2.** *Under the same conditions as in Theorem 1 and if*

$$\max\_{S:|S|\leq q} \left\{ \max\_{j \in \mathcal{S}^c \cap \mathcal{M}^c} \left| E\left[ \left\{ Y - \mu(\mathcal{S}\_S^{\*\mathrm{T}} \mathbf{X}\_S) \right\} X\_j \right] \right| \right\} = o(n^{-a})\_\prime$$

*then there exists some constant <sup>C</sup>*<sup>2</sup> <sup>&</sup>gt; <sup>2</sup> *such that* M ⊂ *<sup>F</sup><sup>k</sup> , for some F<sup>k</sup> in the forward stage of STEPWISE and k* ≤ *C*2|M|*, with probability at least* 1 − 18 exp(−4*q* log *p*)*.*

The "max" condition, as assumed in Section 5.3 of [27], relaxes the partial orthogonality assumption that **X**M*<sup>c</sup>* are independent of **X**M, and ensures that with probability tending to 1, appending a signal variable increases log-likelihood more than adding a noise variable does, uniformly over all possible models *<sup>S</sup>* satisfying M 6⊆ *<sup>S</sup>*, <sup>|</sup>*S*<sup>|</sup> <sup>&</sup>lt; *<sup>q</sup>*. This entails that the proposed procedure is much more likely to select a signal variable, in lieu of a noise variable, at each step. Since EBIC is a consistent model selection criterion [28,29], the following theorem guarantees termination of the proposed procedure with M ⊂ *F<sup>k</sup>* for some *k*.

**Theorem 3.** *Under the same conditions as in Theorem 2 and if* M 6⊂ *Fk*−<sup>1</sup> *and* M ⊂ *F<sup>k</sup> , the forward stage stops at the kth step with probability going to* 1 − exp(−3*q* log *p*)*.*

Theorem 3 ensures that the forward stage of STEPWISE will stop within a finite number of steps and will cover the true model with probability at least 1 − *q* exp(−3*q* log *p*) ≥ 1 − exp(−2*q* log *p*). We next consider the backward stage and provide a probability bound of removing a signal from a set in which the set of true signals M is contained.

**Theorem 4.** *Under the same conditions as in Theorem 2, BIC*(*S*\{*r*}) <sup>−</sup> *BIC*(*S*) <sup>&</sup>gt; <sup>0</sup> *uniformly over <sup>r</sup>* ∈ M *and S satisfying* M ⊂ *S and* |*S*| ≤ *q, with probability at least* 1 − 6 exp(−6*q* log *p*)*.*

Theorem 4 indicates that with probability at 1 − 6 exp(−6*q* log *p*), BIC would decrease when removing a signal variable from a model that contains the true model. That is, with high probability, back elimination is to reduce false positives.

Recall that *F<sup>k</sup>* <sup>∗</sup> denotes the model selected at the end of the forward selection stage. By Theorem 2, M ⊂ *F<sup>k</sup>* <sup>∗</sup> with probability at least 1 − 18 exp(−4*q* log *p*). Then Theorem 4 implies that at each step of the backward stage, a signal variable will not be removed from the model with probability at least 1 − 6 exp(−6*q* log *p*). By Theorem 2, |*F<sup>k</sup>* <sup>∗</sup> | ≤ *C*2|M|. Thus, the backward elimination will carry out at most (*C*<sup>2</sup> <sup>−</sup> <sup>1</sup>)|M| steps. Combining results from Theorems 2 and 3 yields that M ⊂ <sup>M</sup><sup>ˆ</sup> with probability at least 1 <sup>−</sup> <sup>18</sup> exp(−4*<sup>q</sup>* log *<sup>p</sup>*) <sup>−</sup> <sup>6</sup>(*C*<sup>2</sup> <sup>−</sup> <sup>1</sup>)|M| exp(−6*<sup>q</sup>* log *<sup>p</sup>*). Let *<sup>β</sup>*<sup>ˆ</sup> be the estimate of *<sup>β</sup>*<sup>∗</sup> in model (1) at the termination of STEPWISE. By convention, the estimates of the coefficients of the unselected covariates are 0.

**Theorem 5.** *Under the same conditions as in Theorem 2, we have that* M ⊆ <sup>M</sup><sup>ˆ</sup> *and*

$$\|\mathfrak{J} - \mathfrak{J}\_\*\| \to 0$$

*in probability.*

The theorem warrants that the proposed STEPWISE yields consistent estimates, a property not shared by many regularized methods, including LASSO. Our later simulations verified this. Proof of main theorems and lemmas are provided in Appendix A.

#### **4. Simulation Studies**

We compared the proposal with the other competing methods, including the penalized methods, such as least absolute shrinkage and selection operator (LASSO); the differential geometric least angle

regression (dgLARS) [11,12]; the forward regression (FR) approach [7]; the sequentially conditioning (SC) approach [8]; and the screening methods, such as sure independence screening (SIS) [5], which is popular in practice. As SIS does not directly generate a predictive model, we applied LASSO for the top [*n*/ log(*n*)] variables chosen by SIS and denoted the procedure by SIS+LASSO. As the FR, SC and STEPWISE approaches involve forward searching and to make them comparable, we applied the same stopping rule, for example, Equation (3) with the same *γ*, to their forward steps. In particular, the STEPWISE approach, with *η*<sup>1</sup> = *γ* and *η*<sup>2</sup> = 0, is equivalent to FR and asymptotically equivalent to SC. By varying *γ* in FR and SC between *γ<sup>L</sup>* and *γH*, we explored the impact of *γ* on inducing false positives and negatives. In our numerical studies, we fixed *γ<sup>H</sup>* = 10 and set *γ<sup>L</sup>* = *η*1. To choose *η*<sup>1</sup> and *η*<sup>2</sup> in (3) and (4) in STEPWISE, we performed 5-fold cross-validation to minimize the mean squared prediction error (MSPE), and reported the results in Table 1. Since the proposed STEPWISE algorithm uses the (E)BIC criterion, for a fair comparison we chose the tuning parameter in dgLARS by using the BIC criterion as well, and coined the corresponding approach as dgLARS(BIC). The regularization parameter in LASSO was chosen via the following two approaches: (1) giving the smallest BIC for the models on the LASSO path, denoted by LASSO(BIC); (2) using the one-standard-error rule, denoted by LASSO(1SE), which chooses the most parsimonious model whose error is no more than one standard error above the error of the best model in cross-validation [30].

**Table 1.** The values of *η*<sup>1</sup> and *η*<sup>2</sup> used in the simulation studies.


Note: values for *η*<sup>1</sup> and *η*<sup>2</sup> were searched on the grid {0, 0.25, 0.5, 1} and {1, 2, 3, 3.5, 4, 4.5, 5}, respectively.

Denote by **X***<sup>i</sup>* = (*Xi*<sup>1</sup> , . . . , *Xip*) <sup>T</sup> and *β* = (*β*1, . . . , *βp*) T , the covariate vector for subject *i* (1, . . . , *n*) and the true coefficient vector. The following five examples generated **X** T *i β*, the inner product of the coefficient and covariate vectors for each individual, which were used to generate outcomes from the normal, binomial and Poisson models.

**Example 1.** *For each i,*

$$c\mathbf{X}\_i^T \boldsymbol{\mathcal{B}} = c \times \left(\sum\_{j=1}^{p\_0} \beta\_j X\_{ij} + \sum\_{j=p\_0+1}^p \beta\_j X\_{ij}\right), \ i = 1, \dots, n\_i$$

*where β<sup>j</sup> =* (−1) *Bj (4*log *n/* √ *n +* |*Z<sup>j</sup>* |*), for j* = 1, . . . , *p*<sup>0</sup> *and β<sup>j</sup>* =*0 otherwise B<sup>j</sup> was a binary random variable with P*(*B<sup>j</sup>* = 1) = 0.4 *and Z<sup>j</sup> was generated by a standard normal distribution; p*<sup>0</sup> = 8*; Xijs were independently generated from a standardized exponential distribution, that is,* exp(1) − 1*. Here and also in the other examples, c (specified later) controls the signal strengths.*

**Example 2.** *This scenario is the same as* **Example 1** *except that Xij was independently generated from a standard normal distribution.*

**Example 3.** *For each i,*

$$c\mathbf{X}\_i^\mathrm{T}\mathfrak{F} = c \times \left(\sum\_{j=1}^{p\_0} \beta\_j X\_{ij} + \sum\_{j=p\_0+1}^p \beta\_j X\_{ij}^\*\right), i = 1, \dots, n\_i$$

*where β<sup>j</sup> = 2j for 1* ≤ *j* ≤ *p*<sup>0</sup> *and p*<sup>0</sup> = 5*. We simulated every component of* **Z***<sup>i</sup> = (Zij)* ∈ *R p and* **W***<sup>i</sup> = (Wij)* ∈ *R p independently from a standard normal distribution. Next, we generated* **X***<sup>i</sup> according to Xij =* (*Zij* + *Wij*)/ √ 2 *for 1* ≤ *j* ≤ *p*<sup>0</sup> *and X*<sup>∗</sup> *ij* = (*Zij* + *p*0 ∑ *j* ′=1 *<sup>Z</sup>ij*′)/2 *for p*<sup>0</sup> <sup>&</sup>lt; *<sup>j</sup>* <sup>≤</sup> *p.*

**Example 4.** *For each i,*

$$c\mathbf{X}\_i^{\mathrm{T}}\mathfrak{B} = c \times \left(\sum\_{j=1}^{500} \beta\_j \mathbf{X}\_{ij} + \sum\_{j=501}^p \beta\_j \mathbf{X}\_{ij}\right), \ i = 1, \dots, n, \ $$

*where the first 500 Xijs were generated from the multivariate normal distribution with mean* **0** *and a covariance matrix with all of the diagonal entries being 1 and cov*(*Xij*, *Xij*′) *=* 0.5|*j*−*<sup>j</sup>* ′ | *for* 1 ≤ *j*, *j* ′ ≤ *p. The remaining p* − 500 *Xijs were generated through the autoregressive processes with Xi*,501 ∼ *Unif(-2, 2), Xij = 0.5 Xi*,*j*−<sup>1</sup> *+ 0.5 X* ∗ *ij, for j* = 502, . . . , *p, where X* ∗ *ij* ∼ *Unif(-2, 2) were generated independently. The coefficients β<sup>j</sup> for j* = 1, . . . , 7, 501, . . . , 507 *were generated from* (−1) *Bj (4*log *n/* √ *n +* |*Z<sup>j</sup>* |*), where B<sup>j</sup> was a binary random variable with P*(*B<sup>j</sup>* = 1) = 0.4 *and Z<sup>j</sup> was from a standard normal distribution. The remaining β<sup>j</sup> were zeros.*

**Example 5.** *For each i,*

$$c\mathbf{X}\_i^{\mathrm{T}}\mathfrak{F} = c \times (-0.5X\_{i1} + X\_{i2} + 0.5X\_{i,100}) \text{ , } i = 1, \ldots, n\_{\prime\prime}$$

*where* **X***<sup>i</sup> were generated from a multivariate normal distribution with mean* **0** *and a covariance matrix with all of the diagonal entries being 1 and cov*(*Xij*, *Xij*′) *=* 0.9|*j*−*<sup>j</sup>* ′ | *for 1* ≤ *j, j'* ≤ *p. All of the coefficients were zero except for Xi*<sup>1</sup> *, Xi*<sup>2</sup> *and Xi*,100*.*

**Examples 1 and 3** were adopted from [7], while **Examples 2 and 4** were borrowed from [5,15], respectively. We then generated the responses from the following three models.

**Normal model:** *Y<sup>i</sup>* = *c***X** T *i β* + *ǫ<sup>i</sup>* with *ǫ<sup>i</sup>* ∼ *N*(0, 1). **Binomial model:** *Y<sup>i</sup>* ∼ Bernoulli( exp(*c***X** T *i β*)/{1 + exp(*c***X** T *i β*)}). **Poisson model:** *Y<sup>i</sup>* ∼ Poisson( exp(*c***X** T *i β*)).

We considered *n* = 400 and *p* = 1000 throughout all of the examples. We specified the magnitude of the coefficients in the GLMs with a constant multiplier, *c*. For Examples 1–5, this constant was set, respectively for the normal, binomial and Poisson models, to be: (1, 1, 0.3), (1, 1.5, 0.3), (1, 1, 0.1), (1, 1.5, 0.3) and (1, 3, 2). For each parameter configuration, we simulated 500 independent data sets. We evaluated the performances of the methods by the criteria of true positives (TP), false positives (FP), the estimated probability of including the true models (PIT), the mean squared error (MSE) of *β*ˆ and the mean squared prediction error (MSPE). To compute the MSPE, we randomly partitioned the samples into the training (75%) and testing (25%) sets. The models obtained from the training datasets were used to predict the responses in the testing datasets. Tables 2–4 report the average TP, FP, PIT, MSE and MSPE over 500 datasets along with the standard deviations. The findings are summarized below.

First, the proposed STEPWISE method was able to detect all the true signals with nearly zero FPs. Specifically, in all of the Examples, STEPWISE outperformed the other methods by detecting more TPs with fewer FPs, whereas LASSO, SIS+LASSO and dgLARS included much more FPs.

Second, though a smaller *γ* in FR and SC led to the inclusion of all TPs with a PIT close to 1, it incurred more FPs. On the other hand, a larger *γ* may eliminate some TPs, resulting in a smaller PIT and a larger MSPE.

Third, for the normal model, the STEPWISE method yielded an MSE close to 0, the smallest among all the competing methods. The binary and Poisson data challenged all of the methods, and the MSEs for all the methods were non-negligible. However, the STEPWISE method still produced the lowest MSE. The results seemed to verify the consistency of *β*ˆ , which distinguished the proposed STEPWISE method from the other regularized methods and highlighted its ability to provide a more accurate means to characterize the effects of high dimensional predictors.


**Table 2.** Normal model.

Note: TP, true positives; FP, false positives; PIT, probability of including all true predictors in the selected predictors; MSE, mean squared error of *β*ˆ ; MSPE, mean squared prediction error; numbers in the parentheses are standard deviations; LASSO(BIC), LASSO with the tuning parameter chosen to give the smallest BIC for the models on the LASSO path; LASSO(1SE), LASSO with the tuning parameter chosen by the one-standard-error rule; SIS+LASSO(BIC), sure independence screening by [5] followed by LASSO(BIC); SIS+LASSO(1SE), sure independence screening followed by LASSO(1SE); dgLARS(BIC), differential geometric least angle regression by [11,12] with the tuning parameter chosen to give the smallest BIC on the dgLARS path; SC(*γ*), sequentially conditioning approach by [8]; FR(*γ*), forward regression by [7]; STEPWISE, the proposed method; in FR and SC, the smaller and large values of *γ* are presented as *γ<sup>L</sup>* and *γH*, respectively; *p*<sup>0</sup> denotes the number of true signals; LASSO(1SE), LASSO(BIC), SIS and dgLARS were conducted via R packages glmnet [31], ncvreg [32], screening [33] and dglars [34], respectively .


**Table 3.** Binomial model.

Note: abbreviations are explained in the footnote of Table 2.


**Table 4.** Poisson model.

Note: abbreviations are explained in the footnote of Table 2.

#### **5. Real Data Analysis**

#### *5.1. A Study of Gene Regulation in the Mammalian Eye*

To demonstrate the utility of our proposed method, we analyzed a microarray dataset from [35] with 120 twelve-week male rats selected for eye tissue harvesting. The dataset contained more than 31,042 different probe sets (Affymetric GeneChip Rat Genome 230 2.0 Array); see [35] for a more detailed description of the data.

Although our method was applicable to the original 31,042 probe sets, many probes turned out to have very small variances and were unlikely to be informative for correlative analyses. Therefore, using variance as the screening criterion, we selected 5000 genes with the largest variances in expressions and correlated them with gene *TRIM32* that has been found to cause Bardet–Biedl syndrome, a genetically heterogeneous disease of multiple organ systems including the retina [36].

We applied the proposed STEPWISE method to the dataset with *n* = 120 and *p* = 5000, and treated the *TRIM32* gene expression as the response variable and the expressions of 5000 genes as the predictors. With no prior biological information available, we started with the empty set. To choose *η*<sup>1</sup> and *η*2, we carried out 5-fold cross-validation to minimize the mean squared prediction error (MSPE) by using the following grid search: *η*<sup>1</sup> = {0, 0.25, 0.5, 1} and *η*<sup>2</sup> = {1, 2, 3, 4, 5}, and set *η*<sup>1</sup> = 1 and *η*<sup>2</sup> = 4. We also performed the same procedure to choose the *γ* for FR and SC. The regularization parameters in LASSO and dgLARS were selected to minimize BIC values.

In the forward step, STEPWISE selected the probes of *1376747\_at*, *1381902\_at*, *1382673\_at* and *1375577\_at*, and the backward step eliminated probe *1375577\_at*. The STEPWISE procedure produced the following final predictive model:

*TRIM32* = 4.6208 + 0.2310 × (*1376747\_at*) + 0.1914 × (*1381902\_at*) + 0.1263 × (*1382673\_at*). Table A1 in Appendix B presents the numbers of overlapping genes among competing methods. It shows that the two out of three probes, *1381902\_at* and *1376747\_at*, selected from our method are also discovered by the other methods, except for dgLARS.

Next, we performed Leave-One-Out Cross-Validation (LOOCV) to obtain the distribution of the model size (MS) and MSPE for the competing methods.

As reported in Table 5 and Figure 1, LASSO, SIS+LASSO and dgLARS tended to select more variables than the forward approaches and STEPWISE. Among all of the methods, STEPWISE selected the fewest variables but with almost the same MSPE as the other methods.

**Table 5.** Comparisons of MSPE among competing methods using the mammalian eye data set.


Note: The mean squared prediction error (MSPE) was averaged over 120 splits. LASSO, least absolute shrinkage and selection operator with regularization parameter that gives the smallest BIC; SIS+LASSO, sure independence screening by [5] followed by LASSO; dgLARS, differential geometric least angle regression by [11,12] that gives the smallest BIC; SC(*γ*), sequentially conditioning approach by [8]; FR(*γ*), forward regression by [7]; STEPWISE, the proposed method. STEPWISE was performed with *η*<sup>1</sup> = 1 and *η*<sup>2</sup> = 4 ; FR and SC were performed with *γ* = 1.

**Figure 1.** Box plot of model sizes for each method over 120 different training samples from the mammalian eye data set. STEPWISE was performed with *η*<sup>1</sup> = 1 and *η*<sup>2</sup> = 4, and FR and SC were conducted with *γ* = 1.

#### *5.2. An Esophageal Squamous Cell Carcinoma Study*

Esophageal squamous cell carcinoma (ESCC), the most common histological type of esophageal cancer, is known to be associated with poor overall survival, making early diagnosis crucial for treatment and disease management [37]. Several studies have investigated the roles of circulating microRNAs (miRNAs) in diagnosis of ESCC [38].

Using a clinical study that investigated the roles of miRNAs on the ESCC [39], we aimed to use miRNAs to predict ESCC risks and estimate their impacts on the development of ESCC. Specifically, with a dataset of serum profiling of 2565 miRNAs from 566 ESCC patients and 4965 controls without cancer, we demonstrated the utility of the proposed STEPWISE method in predicting ESCC with miRNAs.

To proceed, we used a balance sampling scheme (283 cases and 283 controls) in the training dataset. The design of yielding an equal number of cases and controls in the training set has proved to be useful [39] for handling imbalanced outcomes as we encountered here. To validate our findings, samples were randomly divided into a training (*n*<sup>1</sup> = 566, *p* = 2565) and testing set (*n*<sup>2</sup> = 4965, *p* = 2565).

The training set consisted of 283 patients with ESCC (median age of 65 years, 79% male) and 283 control patients (median age of 68 years, 46.3% male), and the testing set consisted of 283 patients with ESCC (median age of 67 years, 85.7% male) and 4682 control patients (median age of 67.5 years, 44.5% male). Control patients without ESCC came from three sources: 323 individuals from National Cancer Center Biobank (NCCB); 2670 individuals from the Biobank of the National Center for Geriatrics and Gerontology (NCGG); and 1972 individuals from Minoru Clinic (MC). More detailed characteristics of cases and controls in the training and testing sets are given in Table 6.


**Table 6.** Clinicopathological characteristics of study participants of the ESCC data set.

We defined the binary outcome variable to be 1 if the subject was a case and 0 otherwise. As age and gender (0 = female, 1 = male) are important risk factors for ESCC [40,41] and it is common to adjust for them in clinical models, we set the initial set in STEPWISE to be *F*<sup>0</sup> = {age, gender}. With *η*<sup>1</sup> = 0 and *η*<sup>2</sup> = 3.5 that were also chosen from 5-fold CV, our procedure recruited three miRNAs. More specifically, *miR-4783-3p*, *miR-320b*, *miR-1225-3p* and *miR-6789-5p* were selected among 2565 miRNAs by the forward stage from the training set, and then the backward stage eliminated *miR-6789-5p*.

In comparison, with *γ* = 0, both FR and SC selected four miRNAs, *miR-4783-3p*, *miR-320b*, *miR-1225-3p* and *miR-6789-5p*. The list of selected miRNAs by different methods are given in Table A2 in Appendix B.

Our findings were biologically meaningful, as the selected miRNAs had been identified by other cancer studies as well. Specifically, *miR-320b* was found to promote colorectal cancer proliferation and invasion by competing with its homologous *miR-320a* [42]. In addition, serum levels of *miR-320* family members were associated with clinical parameters and diagnosis in prostate cancer patients [43]. Reference [44] showed that *miR-4783-3p* was one of the miRNAs that could increase the risk of colorectal cancer death among rectal cancer cases. Finally, *miR-1225-5p* inhibited proliferation and metastasis of gastric carcinoma through repressing insulin receptor substrate-1 and activation of *β*-catenin signaling [45].

Aiming to identify a final model without resorting to a pre-screening procedure that may miss out on important biomarkers, we applied STEPWISE to reach the following predictive model for ESCC based on patients' demographics and miRNAs:

logit−<sup>1</sup> (−35.70 + 1.41 × *miR-4783-3p* + 0.98 × *miR-320b* + 1.91 × *miR-1225-3p* + 0.10 × *Age* − 2.02 <sup>×</sup> *Gender*), where logit−<sup>1</sup> (*x*) = exp(*x*)/(1 + exp(*x*)).

In the testing dataset, the model had an area under the receiver operating curve (AUC) of 0.99 and achieved a high accuracy of 0.96, with a sensitivity and specificity of 0.97 and 0.95, respectively. Additionally, using the testing cohort, we evaluated the performances of the models sequentially selected by STEPWISE. Starting with a model containing age and gender, STEPWISE selected *miR-4783-3p*, *miR-320b* and *miR-1225-3p* in turn. Figure 2, showing the corresponding receiver operating curves (ROC) for these sequential models, revealed the improvement by sequentially adding predictors to the model and justified the importance of these variables in the final model. In addition, Figure 2e illustrated that adding an extra miRNA selected by FR and SC made little improvement of the model's predictive power.

Furthermore, we conducted subgroup analysis within the testing cohort to study how the sensitivity of the final model differed by cancer stage, one of the most important risk factors. The sensitivities for stages 0, i.e., non-invasive cancer, 9 (*n* = 27), 1 (*n* = 128), 2 (*n* = 57), 3 (*n* = 61) and 4 (*n* = 10) were 1.00, 0.98, 0.97, 0.97 and 1.00, respectively. We next evaluated how the specificity varied across controls coming from different data sources. The specificities for the various control groups, namely, NCCB (*n* = 306), NCGG (*n* = 2512) and MC (*n* = 1864), were 0.99, 0.99 and 0.98, respectively. The results indicated the robust performance of the miRNA-based model toward cancer stages and data sources.

Finally, to compare STEPWISE with the other competing methods, we repeatedly applied the aforementioned balance sampling procedure and split the ESCC data into the training and testing sets 100 times. We obtained MSPE and the average of accuracy, sensitivity, specificity, and AUC. Figure 3 reported the model size of each method. Though STEPWISE selected fewer variables compared to the other variable selection methods (for example, LASSO selected 11-31 variables and dgLARS selected 12–51 variables), it achieved comparable prediction accuracy, specificity, sensitivity and AUC (see Table 7), evidencing the utility of STEPWISE for generating parsimonious models while maintaining competitive predictability.


**Table 7.** Comparisons of competing methods over 100 independent splits of the ESCC data into training and testing sets.

Note: Values were averaged over 100 splits. STEPWISE was performed with *η*<sup>1</sup> = 0 and *η*<sup>2</sup> = 1. SC and FR were performed with *γ* = 1. The regularization parameters in LASSO and dgLARS were selected to minimize the BIC.

We used R software [46] to obtain the numerical results in Sections 4 and 5 with following packages: ggplot2 [47], ncvreg [32], glmnet [31], dglars [34] and screening [33].

**Figure 2.** *Cont*.

**Figure 2.** Comparisons of ROC curves for the selected models in the ESCC data set by the sequentially selected order: Model 1: −2.52 + 0.02 × *Age* − 1.86 × *Gender*; Model 2: −20.64 + 0.08 × *Age* − 2.12 × *Gender* + 2.02× *miR-4783-3p*; Model 3: −24.21 + 0.09 × *Age* − 2.16 × *Gender* + 1.44× *miR-4783-3p* −1.31× *miR-320b*; Model 4: −35.70 + 0.10 × *Age* − 2.02 × *Gender* + 1.40× *miR-4783-3p* −0.98× *miR-320b* +1.91× *miR-1225-3p*; Model 5: −53.10 + 0.10 × *Age* − 1.85 × *Gender* + 1.43× *miR-4783-3p*−0.92× *miR-320b* +1.43× *miR-1225-3p* +2.10× *miR-6789-5p*.

**Figure 3.** Box plot of model sizes for each method based on 100 ESCC training datasets. Performance of STEPWISE is reported with *η*<sup>1</sup> = 0 and *η*<sup>2</sup> = 3.5. Performances of SC and FR are reported with *γ* = 0.

#### **6. Discussion**

We have proposed to apply STEPWISE to produce final models in ultrahigh dimensional settings, without resorting to a pre-screening step. We have shown that the method identifies or includes the true model with probability going to 1, and produces consistent coefficient estimates, which are useful for properly interpreting the actual impacts of risk factors. The theoretical properties of STEPWISE were established under mild conditions, which are worth discussing. As in practice covariates are often standardized for various reasons, Condition (2) is assumed without loss of generality. Conditions (3) and (4) are generally satisfied under common GLM models, including Gaussian, binomial, Poisson and gamma distributions. Condition (5) is also often satisfied in practice. Proposition 2 in [26] may be used as a tool to verify Condition (5) as well. Conditions (1) and (6) are in good faith with the unknown true model size |M| and minimum signal strength *n* −*α* in practice. The "irrepresentable" condition (6) is strong and may not hold in some real datasets, see, e.g., [48,49]. However, the condition holds under some commonly used covariance structures, including AR(1) and compound symmetry structure [48].

As shown in simulation studies and real data analyses, STEPWISE tends to generate models as predictive as the other well-known methods, with fewer variables (Figure 3). Parsimonious models are useful for biomedical studies as they explain data with a small number of important predictors, and offer practitioners a realistic list of biomarkers to investigate. With categorical outcome data frequently observed in biomedical studies (e.g., histology types of cancer), STEPWISE can be extended to accommodate multinomial classification, with more involved notation and computation. We will pursue this elsewhere.

There are several open questions. First, our final model was determined by using (E)BIC, which involves two extra parameters *η*<sup>1</sup> and *η*2. In our numerical experiments, we used cross-validation to choose them, which seemed to work well. However, more in-depth research is needed to find their optimal values to strike a balance between false positives and false negatives. Second, despite our consistent estimates, drawing inferences based on them remains challenging. Statistical inference, which accounts for uncertainty in estimation, is key for properly interpreting analysis results and drawing appropriate conclusions. Our asymptotic results, nevertheless, are a stepping stone toward this important problem.

**Supplementary Materials:** An R package, STEPWISE, was developed and is available at https://github.com/ AlexPijyan/STEPWISE, along with the examples shown in the paper.

**Author Contributions:** Conceptualization, Q.Z., H.H. and Y.L.; Formal analysis, A.P.; Methodology, A.P., Q.Z., H.H. and Y.L.; Project administration, H.H.; Software, A.P.; Supervision, H.H.; Writing – original draft, Q.Z., H.H. and Y.L.; Writing – review & editing, H.H. and Y.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded partially by grants from NSF (DMS1915099, DMS1952486) and NIH (R01AG056764, U01CA209414, R03AG067611).

**Acknowledgments:** We are thankful to the Editor, the AE and two referees for insightful suggestions that helped improve the manuscript.

**Conflicts of Interest:** 'The authors declare no conflict of interest.

#### **Appendix A. Proofs of Main Theorems**

Since *b*(·) is twice continuously differentiable with a nonnegative second derivative *b* ′′(·), *b*max := max|*t*|≤*K*<sup>3</sup> <sup>|</sup>*b*(*t*)|, *<sup>µ</sup>*max :<sup>=</sup> max|*t*|≤*K*<sup>3</sup> <sup>|</sup>*<sup>b</sup>* ′ (*t*)<sup>|</sup> and *<sup>σ</sup>*max :<sup>=</sup> sup|*t*|≤*K*<sup>3</sup> <sup>|</sup>*<sup>b</sup>* ′′(*t*)| are bounded above, where *L* and *K* are some constants from Conditions (1) and (2), respectively. Let G*n*{ *f*(*ξ*)} = *n* <sup>−</sup>1/2 ∑ *n i*=1 (*f*(*ξi*) − *E*[ *f*(*ξi*)]) for a sequence of i.i.d. random variables *ξ<sup>i</sup>* (*i* = 1, . . . , *n*) and a non-random function *f*(·).

Given any *βS*, when a variable *X<sup>r</sup>* ,*r* ∈ *S c* is added into the model *S*, we define the augmented log-likelihood as

$$\ell\_{S\cup\{r\}}(\mathfrak{f}\_{S+r}) := \mathbb{E}\_{\mathbb{H}}\left\{ L\left(\mathfrak{f}\_{S}^{\mathsf{T}}\mathbf{X}\_{S} + \mathfrak{f}\_{r}\mathbf{X}\_{r}, \mathbf{Y}\right) \right\}.\tag{A1}$$

We use *β*ˆ *S*+*r* to denote the maximizer of (A1). Thus, *β*ˆ *<sup>S</sup>*+*<sup>r</sup>* = *β*ˆ *S*∪{*r*} . In addition, denote the maximizer of *<sup>E</sup>*[ℓ*S*∪{*r*} (*βS*+*r*)] by *β* ∗ *S*+*r* . Due to the concavity of the log-likelihood in GLMs with the canonical link, *β* ∗ *S*+*r* is unique.

**Proof of Theorem 1.** Given an index set *S* and *r* ∈ *S c* , let B 0 *S* (*d*) = {*β<sup>S</sup>* : k*β<sup>S</sup>* − *β* ∗ *S* k ≤ *d*/(*K* p |*S*|)} where *d* = *A*<sup>2</sup> p *q* 3 log *p*/*n* with *A*<sup>2</sup> defined in Lemma A6.

Let Ω be the event that

$$\begin{aligned} & \left| \sup\_{|S| \le q, \mathfrak{P}\_S \in \mathcal{B}\_S^0(d)} \left| \mathrm{G}\_n \left[ L \left( \mathfrak{P}\_S^\mathrm{T} \mathbf{X}\_{\mathrm{S}}, Y \right) - L \left( \mathfrak{P}\_S^{\*\mathrm{T}} \mathbf{X}\_{\mathrm{S}}, Y \right) \right] \right| \right| \le 20A\_1 d \sqrt{q \log p} \quad \text{and} \\ & \max\_{|S| \le q} \left| \mathrm{G}\_n \left[ L \left( \mathfrak{P}\_S^{\*\mathrm{T}} \mathbf{X}\_{\mathrm{S}}, Y \right) \right] \right| \le 10 (A\_1 \mathrm{K}^2 + b\_{\mathrm{max}}) \sqrt{q \log p} \end{aligned}$$

where *<sup>A</sup>*<sup>1</sup> is some constant defined in Lemma A4. By Lemma A4, *<sup>P</sup>*(Ω) ≥ <sup>1</sup> − <sup>6</sup> exp(−6*<sup>q</sup>* log *<sup>p</sup>*). Thus in the rest of the proof, we only consider the sample points in Ω.

In the proof of Lemma A6, we show that max|*S*|≤*<sup>q</sup>* <sup>k</sup>*β*<sup>ˆ</sup> *<sup>S</sup>* − *β* ∗ *S* k ≤ *A*2*K* −1 (*q* 2 log *p*/*n*) 1/2 under Ω. Then given an index set *<sup>S</sup>* and *<sup>β</sup><sup>S</sup>* such that <sup>|</sup>*S*<sup>|</sup> <sup>&</sup>lt; *<sup>q</sup>*, <sup>k</sup>*β<sup>S</sup>* <sup>−</sup> *<sup>β</sup>* ∗ *S* k ≤ *A*2*K* −1 (*q* 2 log *p*/*n*) 1/2, and for any *j* ∈ *S c* ,

<sup>ℓ</sup>*S*∪{*j*} (*β* ∗ *S*+*j* ) <sup>−</sup> <sup>ℓ</sup>*S*(*β*<sup>ˆ</sup> *<sup>S</sup>*) ≥ inf k*βS*−*β* ∗ *S* k≤*A*2*K*−1(*<sup>q</sup>* <sup>2</sup> log *p*/*n*) 1/2 <sup>ℓ</sup>*S*∪{*j*} (*β* ∗ *S*+*j* ) <sup>−</sup> <sup>ℓ</sup>*S*(*βS*) = *n* <sup>−</sup>1/2G*<sup>n</sup>* h *L*(*β* ∗T *<sup>S</sup>*+*j***X***S*∪{*j*} ,*Y*) i − *n* <sup>−</sup>1/2G*<sup>n</sup>* h *L*(*β* ∗T *<sup>S</sup>* **X***S*,*Y*) i − sup k*βS*−*β* ∗ *S* k≤*A*2*K*−1(*<sup>q</sup>* <sup>2</sup> log *p*/*n*) 1/2 *n* <sup>−</sup>1/2G*<sup>n</sup>* h *L*(*β* T *<sup>S</sup>***X***S*,*Y*) − *L*(*β* ∗T *<sup>S</sup>* **X***S*,*Y*) i + *E* h *L*(*β* ∗T *<sup>S</sup>*+*j***X***S*∪{*j*} ,*Y*) i − *E* h *L*(*β* ∗T *<sup>S</sup>* **X***S*,*Y*) i ≥ −20(*A*1*K* <sup>2</sup> + *b*max) p *q* log *p*/*n* − 20*A*1*A*2*q* 2 log *p*/*n* + *σ*min*κ*min 2 k*β* ∗ *<sup>S</sup>*+*<sup>j</sup>* − (*β* ∗T *S* , 0) T k 2 ,

where the second inequality follows from the event Ω and Lemma A5.

By Lemma A1, if M 6⊆ *S*, there exists *r* ∈ *S <sup>c</sup>* ∩ M, such that <sup>k</sup>*<sup>β</sup>* ∗T *<sup>S</sup>*+*<sup>r</sup>* − (*β* ∗T *S* , 0)k ≥ *Cσ* −1 max*κ* −1 max*n* −*α* . Thus, there exists some constant *C*<sup>1</sup> that does not depend on *n* such that

$$\max\_{j \in \mathcal{S}^\*} \ell\_{\mathbb{S} \cup \{j\}}(\hat{\mathfrak{f}}\_{\mathcal{S}+j}) - \ell\_{\mathbb{S}}(\hat{\mathfrak{f}}\_{\mathcal{S}}) \ge \max\_{j \in \mathcal{S}^\*} \ell\_{\mathbb{S} \cup \{j\}}(\mathcal{f}\_{\mathcal{S}+j}^\*) - \ell\_{\mathbb{S}}(\hat{\mathfrak{f}}\_{\mathcal{S}}) \ge \ell\_{\mathbb{S} \cup \{r\}}(\mathcal{f}\_{\mathcal{S}+r}^\*) - \ell\_{\mathbb{S}}(\hat{\mathfrak{f}}\_{\mathcal{S}})$$

$$\ge -20(A\_1 K^2 + b\_{\max}) \sqrt{q \log p / n} - 20A\_1 A\_2 q^2 \log p / n + \frac{\mathcal{C}^2 \sigma\_{\min} \kappa\_{\min} n^{-2\alpha}}{2\sigma\_{\max}^2 \kappa\_{\max}^2} \ge \mathcal{C}\_1 n^{-2\alpha}, \tag{A2}$$

where the first inequality follows from *β*ˆ *<sup>S</sup>*+*<sup>j</sup>* being the maximizer of (A1) and the second inequality follows from Conditions (1) and (6).

Withdrawing the restriction to Ω, we obtain that

$$P\left(\min\_{|\mathcal{S}|$$

**Proof of Theorem 2.** We have shown that our forward stage will not stop when M 6⊆ *<sup>S</sup>* and <sup>|</sup>*S*<sup>|</sup> <sup>&</sup>lt; *<sup>q</sup>* with probability converging to 1.

For any *r* ∈ *S <sup>c</sup>* ∩ M*<sup>c</sup>* , *β* ∗ *S*+*r* is the unique solution to the equation *E* -*<sup>Y</sup>* <sup>−</sup> *µ β* T *S*+*r* **X***S*∪{*r*} **X***S*∪{*r*} = **0**. By the mean value theorem,

$$\begin{split} &E\left[\left\{Y - \mu\left(\mathcal{J}\_{\mathrm{S}}^{\mathrm{sT}}\mathbf{X}\_{\mathrm{S}}\right)\right\}\mathbf{X}\_{\mathrm{r}}\right] = E\left[\left\{\mu\left(\mathcal{J}\_{\mathrm{s}}^{\mathrm{T}}\mathbf{X}\right) - \mu\left(\mathcal{J}\_{\mathrm{S}}^{\mathrm{sT}}\mathbf{X}\_{\mathrm{S}}\right)\right\}\mathbf{X}\_{\mathrm{r}}\right] \\ &= E\left[\left\{\mu\left(\mathcal{J}\_{\mathrm{s}}^{\mathrm{T}}\mathbf{X}\right) - \mu\left(\mathcal{J}\_{\mathrm{S}}^{\mathrm{sT}}\mathbf{X}\_{\mathrm{S}}\right)\right\}\mathbf{X}\_{\mathrm{r}}\right] - E\left[\left\{\mu\left(\mathcal{J}\_{\mathrm{s}}^{\mathrm{T}}\mathbf{X}\right) - \mu\left(\mathcal{J}\_{\mathrm{S}+r}^{\mathrm{sT}}\mathbf{X}\_{\mathrm{S}\cup\{r\}}\right)\right\}\mathbf{X}\_{\mathrm{r}}\right] \\ &= \left(\mathcal{J}\_{\mathrm{S}+r}^{\mathrm{T}} - \left(\mathcal{J}\_{\mathrm{S}}^{\mathrm{sT}}\mathbf{0}\right)\right)E\left[\sigma\left(\tilde{\mathcal{J}}\_{\mathrm{S}+r}^{\mathrm{T}}\mathbf{X}\_{\mathrm{S}\cup\{r\}}\right)\mathbf{X}\_{\mathrm{S}\cup\{r\}}^{\otimes 2}\right]\mathbf{e}\_{\mathrm{r}} \end{split}$$

where *β*˜ *S*+*r* is some point between *βS*+*<sup>r</sup>* and (*β* ∗T *S* , 0) <sup>T</sup> and **e***<sup>r</sup>* is a vector of length (|*S*| + 1) with the *r*th element being 1.

Since <sup>|</sup>*β*˜<sup>T</sup> *S*+*r* **X***S*∪{*r*} | ≤ |*β* ∗T *S*+*r* **X***S*∪{*r*} | + |(*β* ∗T *S* , 0)**X***S*∪{*r*} | ≤ 2*K* <sup>2</sup> by Conditions (1) and (2), <sup>|</sup>*σ*(*β*˜<sup>T</sup> *S*+*r* **X***S*∪{*r*} )| ≥ *σ*min and

$$o(n^{-a}) = \left| E\left[ \left\{ Y - \mu\left(\mathfrak{f}\_{\mathcal{S}}^{\*\mathbf{T}} \mathbf{X}\_{\mathcal{S}} \right) \right\} \mathbf{X}\_{\mathbf{r}} \right] \right| \geq \sigma\_{\text{min}} \kappa\_{\text{min}} ||\mathfrak{f}\_{\mathcal{S}+\mathbf{r}}^{\*\mathbf{T}} - (\mathfrak{f}\_{\mathcal{S}}^{\*\mathbf{T}}, \mathbf{0})||.$$

Therefore, max*S*:|*S*|≤*q*,*r*∈*<sup>S</sup> <sup>c</sup>*∩M*<sup>c</sup>* k*β* ∗T *<sup>S</sup>*+*<sup>r</sup>* − (*β* ∗T *S* , 0)k = *o*(*n* −*α* ).

Under <sup>Ω</sup> that is defined in Theorem 1, max|*S*|≤*<sup>q</sup>* <sup>k</sup>*β*<sup>ˆ</sup> *<sup>S</sup>* − *β* ∗ *S* k ≤ *A*2*K* −1 (*q* 2 log *p*/*n*) 1/2. For any *j* ∈ *S c* ,

<sup>ℓ</sup>*S*∪{*j*} (*β* ∗ *S*+*j* ) <sup>−</sup> <sup>ℓ</sup>*S*(*β*<sup>ˆ</sup> *<sup>S</sup>*) ≤ sup k*βS*−*β* ∗ *S* k≤*A*2*K*−1(*<sup>q</sup>* <sup>2</sup> log *p*/*n*) 1/2 <sup>ℓ</sup>*S*∪{*j*} (*β* ∗ *S*+*j* ) <sup>−</sup> <sup>ℓ</sup>*S*(*βS*) ≤ *n* <sup>−</sup>1/2G*<sup>n</sup>* h *L*(*β* ∗T *<sup>S</sup>*+*j***X***S*∪{*j*} ,*Y*) i  + *n* <sup>−</sup>1/2G*<sup>n</sup>* h *L*(*β* ∗T *<sup>S</sup>* **X***S*,*Y*) i + sup k*βS*−*β* ∗ *S* k≤*A*2*K*−1(*<sup>q</sup>* <sup>2</sup> log *p*/*n*) 1/2 *n* <sup>−</sup>1/2G*<sup>n</sup>* h *L*(*β* T *<sup>S</sup>***X***S*,*Y*) − *L*(*β* ∗T *<sup>S</sup>* **X***S*,*Y*) i + *E* h *L*(*β* ∗T *<sup>S</sup>*+*j***X***S*∪{*j*} ,*Y*) i − *E* h *L*(*β* ∗T *<sup>S</sup>* **X***S*,*Y*) i ≤ 20(*A*1*K* <sup>2</sup> + *b*max) q *qn*−<sup>1</sup> log *p* + 20*A*1*A*2*q* 2*n* −1 log *p* + *σ*max*κ*maxk*β* ∗ *<sup>S</sup>*+*<sup>j</sup>* − (*β* ∗T *S* , 0) T k <sup>2</sup>/2,

where the second inequality follows from the event Ω and Lemma A5. Since max*S*:|*S*|<*q*,*r*∈*<sup>S</sup> <sup>c</sup>*∩M*<sup>c</sup>* k*β* ∗ *<sup>S</sup>*+*<sup>r</sup>* − (*β* ∗T *S* , 0) <sup>T</sup><sup>k</sup> <sup>=</sup> *<sup>o</sup>*(*<sup>n</sup>* −*α* ) and *qn*−1+4*<sup>α</sup>* log *p* → 0,

$$\begin{split} \max\_{\boldsymbol{\mathcal{S}} : |\boldsymbol{\mathcal{S}}| < \eta, \boldsymbol{r} \in \mathbb{S}^{\boldsymbol{r}} \cap \mathcal{M}^{\boldsymbol{\epsilon}}} \ell\_{\boldsymbol{S} \cup \{\boldsymbol{r}\}} (\mathfrak{f}\_{\boldsymbol{S} + \boldsymbol{r}}^{\ast}) - \ell\_{\boldsymbol{S}} (\mathfrak{f}\_{\boldsymbol{S}}) &\leq 20 (A\_{1} \boldsymbol{K}^{2} + b\_{\max}) \sqrt{\eta n^{-1} \log p} + 20 A\_{1} A\_{2} q^{2} n^{-1} \log p \\ &+ \sigma\_{\max} \kappa\_{\max} \|\mathfrak{f}\_{\boldsymbol{S} + \boldsymbol{j}}^{\ast} - (\mathfrak{f}\_{\boldsymbol{S}}^{\ast \mathsf{T}} \boldsymbol{J} )^{\mathsf{T}} \|^{2} / 2 = o(n^{-2a}), \end{split}$$

with probability at least 1 − 6 exp(−6*q* log *p*). Then by Lemma A6,

$$\begin{split} &\max\_{\boldsymbol{S}\in[\boldsymbol{S}]\times\boldsymbol{g},\boldsymbol{r}\in\boldsymbol{S}\cap\mathcal{M}^{\boldsymbol{\varepsilon}}}\ell\_{\boldsymbol{S}\cup\{\boldsymbol{r}\}}(\boldsymbol{\mathring{\mathcal{S}}}\_{\boldsymbol{S}+\boldsymbol{r}})-\ell\_{\boldsymbol{S}}(\boldsymbol{\mathring{\mathcal{S}}}\_{\boldsymbol{S}}) \\ &\leq\max\_{\boldsymbol{S}\in[\boldsymbol{S}]\times\boldsymbol{g},\boldsymbol{r}\in\boldsymbol{S}\cap\mathcal{M}^{\boldsymbol{\varepsilon}}}|\ell\_{\boldsymbol{S}\cup\{\boldsymbol{r}\}}(\boldsymbol{\mathring{\mathcal{S}}}\_{\boldsymbol{S}+\boldsymbol{r}})-\ell\_{\boldsymbol{S}\cup\{\boldsymbol{r}\}}(\boldsymbol{\mathcal{S}}^{\boldsymbol{e}}\_{\boldsymbol{S}+\boldsymbol{r}})| +\max\_{\boldsymbol{S}\in[\boldsymbol{s}]\times\boldsymbol{g},\boldsymbol{r}\in\boldsymbol{S}\cap\mathcal{M}^{\boldsymbol{\varepsilon}}}\left|\ell\_{\boldsymbol{S}\cup\{\boldsymbol{r}\}}(\boldsymbol{\mathcal{S}}^{\boldsymbol{e}}\_{\boldsymbol{S}+\boldsymbol{r}})-\ell\_{\boldsymbol{S}}(\boldsymbol{\mathring{\mathcal{S}}}\_{\boldsymbol{S}})\right| \\ &\leq A\_{3}\eta^{2}n^{-1}\log p + o(n^{-2\alpha}) = o(n^{-2\alpha}), \end{split}\tag{A3}$$

with probability at least 1 − 12 exp(−6*q* log *p*).

By Theorem 1, if M 6⊆ *S*, the forward stage would select a noise variable with probability less than 18 exp(−6*q* log *p*).

For *<sup>k</sup>* <sup>&</sup>gt; |M|, M 6⊆ *<sup>S</sup><sup>k</sup>* implies that at least *k* − |M| noise variables are selected within the *k* steps. Then for *<sup>k</sup>* <sup>=</sup> *<sup>C</sup>*2|M| with *<sup>C</sup>*<sup>2</sup> <sup>&</sup>gt; 2,

$$\begin{aligned} \mathbb{P}\left(\mathcal{M}\nsubseteq S\_k\right) &\leq \sum\_{j=k-|\mathcal{M}|}^k \binom{k}{j} \left\{ 18\exp(-6q\log p) \right\}^j \leq |\mathcal{M}| k^{|\mathcal{M}|} \left\{ 18\exp(-6q\log p) \right\}^{k-|\mathcal{M}|} \\ &\leq 18\exp(-6q\log p + \log|\mathcal{M}| + |\mathcal{M}|\log k) \leq 18\exp(-4q\log p). \end{aligned}$$

Therefore, M ⊂ *<sup>S</sup>C*2|M| with probability at least 1 − 18 exp(−4*<sup>q</sup>* log *<sup>p</sup>*).

**Proof of Theorem 3.** By Theorem 2, M will be included in *F<sup>k</sup>* for some *k* < *q* with probability going to 1. Therefore, the forward stage stops at the *k*th step if EBIC(*Fk*+<sup>1</sup> ) > EBIC(*F<sup>k</sup>* ).

On the other hand, that EBIC(*Fk*+<sup>1</sup> ) < EBIC(*F<sup>k</sup>* ) if and only if 2ℓ*Fk*+<sup>1</sup> (*β*ˆ *Fk*+<sup>1</sup> ) <sup>−</sup> <sup>2</sup>ℓ*F<sup>k</sup>* (*β*ˆ *Fk* ) ≥ (log *n* + 2*η*<sup>1</sup> log *p*)/*n*. Thus, to show the forward stage stops at the *k*th step, we only need to show that with probability tending to 1,

$$2\ell\_{\tilde{\mathbf{F}}\_{k+1}}(\hat{\mathfrak{g}}\_{\mathbf{F}\_{k+1}}) - 2\ell\_{\tilde{\mathbf{F}}\_{k}}(\hat{\mathfrak{g}}\_{\mathbf{F}\_{k}}) < (\log n + 2\eta\_{1}\log p)/n,\tag{A4}$$

for all *η*<sup>1</sup> > 0.

To prove (A4), we first verify the conditions (A4) and (A5) in [17]. Given any index *S* such that M ⊆ *S* and |*S*| ≤ *q*, let *β*∗*<sup>S</sup>* be the subvector of *β*<sup>∗</sup> corresponding to *S*. We obtain that

$$E\left[ (\boldsymbol{Y} - \boldsymbol{\mu}(\boldsymbol{\mathcal{B}}\_{\ast S}^{\mathrm{T}} \mathbf{X}\_{S})) \mathbf{X}\_{S} \right] = E\left[ E\left[ (\boldsymbol{Y} - \boldsymbol{\mu}(\boldsymbol{\mathcal{B}}\_{\ast \mathcal{M}}^{\mathrm{T}} \mathbf{X}\_{\mathcal{M}})) | \mathbf{X}\_{S} \right] \mathbf{X}\_{S} \right] = 0.$$

This implies *β* ∗ *<sup>S</sup>* = *β*∗*S*.

Given any *<sup>π</sup>* <sup>∈</sup> <sup>R</sup>|*S*<sup>|</sup> , let H*<sup>S</sup>* := *h*(*π*, *βS*) = (*σ*max*K* 2 |*S*|) <sup>−</sup>1*σ β* T *S* **X***S π*T**X***<sup>S</sup>* 2 , k*π*k = 1, *β<sup>S</sup>* ∈ B 0 *S* (*d*) . By Conditions (1) and (2), *h*(*π*, *βS*) is bounded between −1 and 1 uniformly over k*π*k = 1 and *<sup>β</sup><sup>S</sup>* ∈ B<sup>0</sup> *S* (*d*).

By Lemma 2.6.15 in [50], the VC indices of W := {(*K* p |*S*|) <sup>−</sup>1*π*T**X***S*, <sup>k</sup>*π*<sup>k</sup> <sup>=</sup> <sup>1</sup>} and <sup>V</sup> :<sup>=</sup> {*β* T *S* **<sup>X</sup>***S*, *<sup>β</sup><sup>S</sup>* ∈ B<sup>0</sup> *S* (*d*)} are bounded by |*S*| + 2. For the definitions of the VC index and covering numbers, we refer to pages 83 and 85 in [50]. The VC index of the class U := {(*K* 2 |*S*|) −1 (*π*T**X***S*) 2 , k*π*k = 1} is the VC index of the class of sets {(**X***S*, *t*) : (*K* 2 |*S*|) −1 (*π*T**X***S*) <sup>2</sup> <sup>≤</sup> *<sup>t</sup>*, <sup>k</sup>*π*<sup>k</sup> <sup>=</sup> 1, *<sup>t</sup>* <sup>∈</sup> <sup>R</sup>}. Since {(**X***S*, *t*) : (*K* 2 |*S*|) −1 (*π*T**X***S*) <sup>2</sup> <sup>≤</sup> *<sup>t</sup>*} <sup>=</sup> {(**X***S*, *<sup>t</sup>*) : <sup>0</sup> <sup>&</sup>lt; (*<sup>K</sup>* p |*S*|) <sup>−</sup>1*π*T**X***<sup>S</sup>* <sup>≤</sup> √ *t*} ∪ {(**X***S*, *t*) : − √ *t* < (*K* p |*S*|) <sup>−</sup>1*π*T**X***<sup>S</sup>* <sup>≤</sup> <sup>0</sup>}, each set of {(**X***S*, *<sup>t</sup>*) : (*<sup>K</sup>* 2 |*S*|) −1 (*π*T**X***S*) <sup>2</sup> <sup>≤</sup> *<sup>t</sup>*, <sup>k</sup>*π*<sup>k</sup> <sup>=</sup> 1, *<sup>t</sup>* <sup>∈</sup> <sup>R</sup>} is created by taking finite unions, intersections and complements of the basic sets {(**X***S*, *t*) : (*K* p |*S*|) <sup>−</sup>1*π*T**X***<sup>S</sup>* <sup>&</sup>lt; *<sup>t</sup>*}. Therefore, the VC index of {(**X***S*, *t*) : (*K* 2 |*S*|) −1 (*π*T**X***S*) <sup>2</sup> <sup>≤</sup> *<sup>t</sup>*, <sup>k</sup>*π*<sup>k</sup> <sup>=</sup> 1, *<sup>t</sup>* <sup>∈</sup> <sup>R</sup>} is of the same order as the VC index of {(**X***S*, *t*) : (*K* p |*S*|) <sup>−</sup>1*π*T**X***<sup>S</sup>* <sup>&</sup>lt; *<sup>t</sup>*}, by Lemma 2.6.17 in [50].

Then by Theorem 2.6.7 in [50], for any probability measure *Q*, there exists some universal constant *C*<sup>3</sup> such that *N*(*ǫ*, U, *L*2(*Q*)) ≤ (*C*3/*ǫ*) 2(|*S*|+1) . Likewise, *N*(*dǫ*, V, *L*2(*Q*)) ≤ (*C*3/*ǫ*) 2(|*S*|+1) . Given <sup>a</sup> *<sup>β</sup>S*,0 ∈ B<sup>0</sup> *S* (*d*), for any *β<sup>S</sup>* in the ball {*β<sup>S</sup>* : sup**<sup>x</sup>** |*β* T *S* **x** − *β* T *<sup>S</sup>*,0**x**<sup>|</sup> <sup>&</sup>lt; *<sup>d</sup>ǫ*}, we have sup**<sup>x</sup>** |*σ*(*β* T *S* **x**) − *σ*(*β* T *<sup>S</sup>*,0**x**)<sup>|</sup> <sup>&</sup>lt; *Kd<sup>ǫ</sup>* by Condition (4). Let <sup>V</sup> ′ := {*σ* −1 max*σ*(*β* T *S* **<sup>X</sup>***S*), *<sup>β</sup><sup>S</sup>* ∈ B<sup>0</sup> *S* (*d*)}. By the definition of covering number, *N*(*Kdǫ*, V ′ , *L*2(*Q*)) ≤ (*C*3/*ǫ*) <sup>2</sup>(|*S*|+1)Given a *σ*(*β* T *<sup>S</sup>*,0**x**) and *π*<sup>T</sup> 0 **x**, for any *σ*(*β* T *S* **x**) in the ball {*σ*(*β* T *S* **x**) : sup**<sup>x</sup>** |*σ*(*β* T *S* **x**) − *σ*(*β* T *<sup>S</sup>*,0**x**)| ≤ *Kdǫ*} and *π* in the ball {*π* : sup**<sup>x</sup>** |(*π*<sup>T</sup>**x**) <sup>2</sup> <sup>−</sup> (*π*<sup>T</sup> 0 **x**) 2 <sup>|</sup> <sup>&</sup>lt; *<sup>ǫ</sup>*}, (*σ*max*K* 2 |*S*|) −1 sup**<sup>x</sup>** |*σ*(*β* T *S* **x**)(*π*<sup>T</sup>**x**) <sup>2</sup> <sup>−</sup> *<sup>σ</sup>*(*<sup>β</sup>* T *S*,0**x**)(*π*<sup>T</sup> 0 **x**) 2 | ≤ (*σ* −1 max*Kd* + (*K* 2 |*S*|) −1 )*ǫ*. Thus, *N*((*σ* −1 max*Kd* + (*K* 2 |*S*|) −1 )*ǫ*, H*S*, *L*2(*Q*)) ≤ (*C*3/*ǫ*) 4(|*S*|+1) , and consequently *N*(*ǫ*, H*S*, *L*2(*Q*)) ≤ (*C*4/*ǫ*) 4(|*S*|+1) for some constant *C*4.

By Theorem 1.1 in [51] and |*S*| ≤ *q*, we can find some constant *C*<sup>5</sup> such that

$$\begin{aligned} &P\left(\sup\_{\|\pi\|=1,\mathfrak{H}\_{\mathbb{S}}\in\mathcal{B}\_{\mathbb{S}}^{3}(d)}|\mathsf{G}\_{n}[h(\pi,\mathfrak{P}\_{\mathbb{S}})]| \geq \mathsf{C}\_{5}\sqrt{q\log p}\right) \\ &\leq \frac{\mathsf{C}\_{4}'}{\mathsf{C}\_{5}\sqrt{q\log p}} \left(\frac{\mathsf{C}\_{4}'\mathsf{C}\_{5}^{2}q\log p}{4(|S|+1)}\right)^{4(|S|+1)}\exp\left(-2\mathsf{C}\_{5}^{2}q\log p\right) \\ &\leq \exp\left(4(|S|+1)\log(\mathsf{C}\_{4}'\mathsf{C}\_{5}^{2}q\log p) - 2\mathsf{C}\_{5}^{2}q\log p\right) \leq \exp\left(-5q\log p\right). \end{aligned}$$

where *C* ′ 4 is some constant that depends on *C*<sup>4</sup> only. Thus,

$$P\left(\sup\_{|S| \le q, \|\pi\| = 1, \mathfrak{g}\_S \in B\_S^0(d)} \left| \mathbb{E}\_\pi \left\{ \sigma \left( \mathbf{X}\_S^\mathsf{T} \mathfrak{g}\_S \right) \left( \pi^\mathsf{T} \mathbf{X}\_S \right)^2 \right\} - \mathbb{E} \left[ \sigma \left( \mathbf{X}\_S^\mathsf{T} \mathfrak{g}\_S \right) \left( \pi^\mathsf{T} \mathbf{X}\_S \right)^2 \right] \right| \ge \mathbb{C}\_5 \mathsf{K}^2 \sqrt{q^3 \log p/n} \right)$$

$$\le \sum\_{s=\lfloor \mathcal{M} \rfloor}^q \left( \frac{\epsilon p}{s} \right)^s \exp \left( -5q \log p \right) \le \exp(-3q \log p). \tag{A5}$$

By Condition (5), *<sup>σ</sup>*min*κ*min ≤ <sup>Λ</sup> *E* h *σ* **X** T *S βS* **X** ⊗2 *S* i <sup>≤</sup> *<sup>σ</sup>*max*κ*max, for all *<sup>β</sup><sup>S</sup>* ∈ B<sup>0</sup> *S* (*d*) and *S* : M ⊆ *<sup>S</sup>*, <sup>|</sup>*S*<sup>|</sup> <sup>&</sup>lt; *<sup>q</sup>*. Then, by (A5),

$$\sigma\_{\min} \kappa\_{\min} / 2 \le \Lambda \left( \mathbb{E}\_n \left\{ \sigma \left( \mathbf{X}\_S^\mathrm{T} \mathfrak{E}\_{\ast S} \right) \mathbf{X}\_S^{\otimes 2} \right\} \right) \le 2 \sigma\_{\max} \kappa\_{\max}$$

uniformly over all *S* satisfying M ⊆ *S* and |*S*| ≤ *q*, with probability at least 1 − exp(−3*q* log *p*). Hence, the condition (A4) in [17] is satisfied with probability at least 1 − exp(−3*q* log *p*).

Additionally, for any *<sup>β</sup><sup>S</sup>* ∈ B<sup>0</sup> *S* (*d*),

 E*<sup>n</sup>* n *σ* **X** T *<sup>S</sup>β<sup>S</sup> <sup>π</sup>* <sup>T</sup>**X***<sup>S</sup>* <sup>2</sup> o − E*<sup>n</sup>* n *σ* **X** T *<sup>S</sup>β*∗*<sup>S</sup> <sup>π</sup>* <sup>T</sup>**X***<sup>S</sup>* <sup>2</sup> o ≤ *n* <sup>−</sup>1/2G*<sup>n</sup>* n *σ* **X** T *<sup>S</sup>β<sup>S</sup> <sup>π</sup>* <sup>T</sup>**X***<sup>S</sup>* <sup>2</sup> o  + *n* <sup>−</sup>1/2G*<sup>n</sup>* n *σ* **X** T *<sup>S</sup>β*∗*<sup>S</sup> <sup>π</sup>* <sup>T</sup>**X***<sup>S</sup>* <sup>2</sup> o + *E* h *σ* **X** T *<sup>S</sup>β<sup>S</sup> <sup>π</sup>* <sup>T</sup>**X***<sup>S</sup>* 2 i − *E* h *σ* **X** T *<sup>S</sup>β*∗*<sup>S</sup> <sup>π</sup>* <sup>T</sup>**X***<sup>S</sup>* 2 i ≤2*C*5*K* 2 q *q* 3 log *p*/*n* + *µ*maxk*β<sup>S</sup>* − *β*∗*S*k q |*S*|*Kλ*max.

Hence, the condition (A5) in [17] is satisfied uniformly over all *S* such that M ⊆ *S* and |*S*| ≤ *q*, with probability at least 1 − exp(−3*q* log *p*).

Then (A4) can be shown by following the proof of Equation (3.2) in [17]. Thus, our forward stage stops at the *k*th step with probability at least 1 − exp(−3*q* log *p*).

**Proof of Theorem 4.** Suppose that a covariate *X<sup>r</sup>* is removed from *S*. For any *r* ∈ M, sinceM 6⊆ *S*\{*r*} and *r* is the only element that is in (*S*\{*r*}) *<sup>c</sup>* ∩ M, by Lemma A1 and (A2)

$$\begin{aligned} \ell\_{\mathcal{S}}(\mathfrak{f}\_{\mathcal{S}}) - \ell\_{\mathcal{S}\backslash\{r\}}(\mathfrak{f}\_{\mathcal{S}\backslash\{r\}}) &\geq \ell\_{\mathcal{S}}(\mathfrak{f}\_{\mathcal{S}}^{\*}) - \ell\_{\mathcal{S}\backslash\{r\}}(\mathfrak{f}\_{\mathcal{S}\backslash\{r\}}) \\ = \ell\_{\mathcal{S}\backslash\{r\}\cup\{r\}}(\mathfrak{f}\_{\mathcal{S}\backslash\{r\}+r}^{\*}) - \ell\_{\mathcal{S}\backslash\{r\}}(\mathfrak{f}\_{\mathcal{S}\backslash\{r\}}) &\geq \mathcal{C}\_{1}n^{-2\mathfrak{a}} \end{aligned}$$

with probability at least 1 <sup>−</sup> <sup>6</sup> exp(−6*<sup>q</sup>* log *<sup>p</sup>*). From the proof of Theorem 1, we have for any *<sup>η</sup>*<sup>2</sup> <sup>&</sup>gt; 0, BIC(*S*) − BIC(*S*\{*r*}) ≤ −2*C*1*n* <sup>−</sup>2*<sup>α</sup>* + *η*2*n* −1 log *<sup>n</sup>* <sup>&</sup>lt; 0, uniformly over *<sup>r</sup>* ∈ M and *<sup>S</sup>* satisfying M ⊂ *<sup>S</sup>* and |*S*| ≤ *q*, with probability at least 1 − 6 exp(−6*q* log *p*).

**Proof of Theorem 5.** By Theorems 1–3, we have that the event <sup>Ω</sup><sup>1</sup> :<sup>=</sup> {|M| ≤ <sup>ˆ</sup> *<sup>q</sup>* and M ⊆ M} <sup>ˆ</sup> holds with probability at least 1 − 25 exp(−2*q* log *p*). Thus, in the rest of the proof, we restrict our attention on Ω1.

As shown in the proof of Theorem 3, we obtain that *β* ∗ <sup>M</sup><sup>ˆ</sup> <sup>=</sup> *<sup>β</sup>*∗M<sup>ˆ</sup> . Then by Lemma A6, we have <sup>k</sup>*β*ˆM<sup>ˆ</sup> <sup>−</sup> *<sup>β</sup>* ∗ Mˆ k ≤ *A*2*K* −1 p *q* 2 log *p*/*n* with probability at least 1 − 6 exp(−6*q* log *p*). Withdrawing the attention on Ω1, we obtain that

$$\|\|\hat{\mathcal{B}} - \mathcal{B}\_\*\|\| = \|\|\hat{\mathcal{B}}\_{\mathcal{M}} - \mathcal{B}\_\*\_{\*\hat{\mathcal{M}}}\|\| = \|\|\hat{\mathcal{B}}\_{\mathcal{M}} - \mathcal{B}\_{\hat{\mathcal{M}}}^\*\|\| \le A\_2 K^{-1} \sqrt{q^2 \log p/n\_{\mathcal{M}}}$$

with probability at least 1 − 31 exp(−2*q* log *p*).

#### *Additional Lemmas and Proofs*

**Lemma A1.** *Given a model S such that* <sup>|</sup>*S*<sup>|</sup> <sup>&</sup>lt; *<sup>q</sup>*,M 6⊆ *S, under Condition (6), (i):* ∃*r* ∈ *S <sup>c</sup>* ∩ M*, such that <sup>β</sup>* ∗ *S*+*r* 6= (*β* ∗T *S* , 0) T *. (ii): Suppose Conditions (1), (2) and (6') hold.* ∃*r* ∈ *S <sup>c</sup>* ∩ M*, such that* <sup>k</sup>*<sup>β</sup>* ∗T *<sup>S</sup>*+*<sup>r</sup>* − (*β* ∗T *S* , 0)k ≥ *Cσ* −1 max*κ* −1 max*n* −*α .*

**Proof.** As *β* ∗ *S*+*j* is the maximizer of *E* h <sup>ℓ</sup>*S*∪{*j*} (*βS*+*j*) i , by the concavity of *E* h <sup>ℓ</sup>*S*∪{*j*} (*βS*+*j*) i , *β* ∗ *S*+*j* is the solution to the equation *E* h*<sup>Y</sup>* <sup>−</sup> *<sup>µ</sup> β* ∗*T S* **X***<sup>S</sup>* + *βjX<sup>j</sup>* **X***S*∪{*j*} i = **0**, (*i*): Suppose that *β* ∗ *<sup>S</sup>*+*<sup>j</sup>* = (*β* ∗T *S* , 0) T , ∀*j* ∈ *S <sup>c</sup>* ∩ M. Then,

$$\begin{split} 0 &= E\left[\left(Y - \mu\left(\mathcal{B}\_{\text{S}}^{\*T} \mathbf{X}\_{\text{S}}\right)\right) \mathbf{X}\_{\text{j}}\right] = E\left[\left(\mu\left(\mathcal{B}\_{\*}^{\text{T}} \mathbf{X}\right) - \mu\left(\mathcal{B}\_{\text{S}}^{\*T} \mathbf{X}\_{\text{S}}\right)\right) \mathbf{X}\_{\text{j}}\right] \\ &\Rightarrow \max\_{j \in \mathcal{S}^{\text{r}} \cap \mathcal{M}} \left| E\left[\left(\mu\left(\mathcal{B}\_{\*}^{\text{T}} \mathbf{X}\right) - \mu\left(\mathcal{B}\_{\text{S}}^{\*T} \mathbf{X}\_{\text{S}}\right)\right) \mathbf{X}\_{\text{j}}\right] \right| = 0, \end{split}$$

which violates the Condition (6). Therefore, we can find a *r* ∈ *S <sup>c</sup>* ∩ M, such that *<sup>β</sup>* ∗ *S*+*r* 6= (*β* ∗T *S* , 0) T . (*ii*): Let *r* ∈ *S <sup>c</sup>* ∩ M satisfy that *E µ β* T ∗**X** − *µ β* ∗T *S* **X***S X<sup>r</sup>* <sup>&</sup>gt; *Cn*−*<sup>α</sup>* . Without loss of generality, we assume that *X<sup>r</sup>* is the last element of **<sup>X</sup>***S*∪{*r*} . By the mean value theorem,

*E µ β* T ∗**X** − *µ β* ∗T *<sup>S</sup>* **X***<sup>S</sup> X<sup>r</sup>* = *E µ β* T ∗**X** − *µ β* ∗T *<sup>S</sup>* **X***<sup>S</sup> X<sup>r</sup>* − *E µ β* T ∗**X** − *µ β* ∗T *<sup>S</sup>*+*r***X***S*∪{*r*} *X<sup>r</sup>* = *E µ β* ∗T *<sup>S</sup>*+*r***X***S*∪{*r*} − *µ* (*β* ∗T *S* , 0)**X***S*∪{*r*} *X<sup>r</sup>* = *β* ∗T *<sup>S</sup>*+*<sup>r</sup>* − (*β* ∗T *S* , 0) *E* - *σ β*˜T *<sup>S</sup>*+*r***X***S*∪{*r*} **X** ⊗2 *S*∪{*r*} **e***r* , (A6)

where *β*˜ *S*+*r* is some point between *β* ∗ *S*+*r* and (*β* ∗T *S* , 0) <sup>T</sup> and **e***<sup>r</sup>* is a vector of length (|*S*| + 1) with the *r*th element being 1.

As *β*˜ *S*+*r* is some point between *β* ∗ *S*+*r* and (*β* ∗T *S* , 0) T , <sup>|</sup>*β*˜<sup>T</sup> *S*+*r* **X***S*∪{*r*} | ≤ |*β* ∗T *S*+*r* **X***S*∪{*r*} | + |(*β* ∗T *S* , 0)**X***S*∪{*r*} | ≤ 2*K* 2 , by Conditions (1) and (2). Thus, <sup>|</sup>*σ*(*β*˜<sup>T</sup> *S*+*r* **X***S*∪{*r*} )| ≤ *σ*max. By (A6) and Condition (5),

$$\begin{split} \mathsf{C}n^{-a} &\leq \left| \mathbb{E}\left[ \left( \mu(\mathsf{f}\_{\mathsf{F}}^{\mathsf{T}}\mathsf{X}) - \mu(\mathsf{f}\_{\mathsf{F}}^{\mathsf{s}\mathsf{T}}\mathsf{X}\_{\mathsf{S}}) \right) \mathbbm{X}\_{\mathsf{r}} \right] \right| \\ &\leq \left\| \mathsf{f}\_{\mathsf{S}+r}^{\mathsf{s}\mathsf{T}} - (\mathsf{f}\_{\mathsf{S}}^{\mathsf{s}\mathsf{T}}, \mathsf{0}) \left\| \sigma\_{\max} \lambda\_{\max} \left( \mathrm{E} \left[ \mathsf{X}\_{\mathsf{S}\cup\{\mathsf{r}\}}^{\odot 2} \right] \right) \left\| \mathsf{e}\_{\mathsf{r}} \right\| \leq \sigma\_{\max} \mathsf{x}\_{\max} \left\| \mathsf{f}\_{\mathsf{S}+r}^{\mathsf{s}\mathsf{T}} - (\mathsf{f}\_{\mathsf{S}}^{\mathsf{s}\mathsf{T}}, \mathsf{0}) \right\|. \end{split}$$

Therefore, k*β* ∗T *<sup>S</sup>*+*<sup>r</sup>* − (*β* ∗T *S* , 0)k ≥ *Cσ* −1 max*κ* −1 max*n* −*α* .

**Lemma A2.** *Let ξ<sup>i</sup>* , *i* = 1, . . . , *n be n i.i.d random variables such that* |*ξ<sup>i</sup>* | ≤ *<sup>B</sup> for a constant <sup>B</sup>* <sup>&</sup>gt; <sup>0</sup>*. Under Conditions (1)–(3), we have E* [|*Yiξ<sup>i</sup>* − *E* [*Yiξ<sup>i</sup>* ] | *<sup>m</sup>*] <sup>≤</sup> *<sup>m</sup>*!(2*B*( √ <sup>2</sup>*<sup>M</sup>* <sup>+</sup> *<sup>µ</sup>*max))*m, for every m* <sup>≥</sup> <sup>1</sup>*.*

**Proof.** By Conditions (1) and (2), |*β* T ∗**X***i* | ≤ *KL*, ∀*i* ≥ 1 and consequently *µ*(*β* T <sup>∗</sup>**X***i*)  ≤ *µ*max. Then by Condition (3),

$$\begin{aligned} E[|\mathbf{Y}\_i|^m] &= E[|\varepsilon\_i + \mu(\mathcal{J}\_\*^\mathrm{T} \mathbf{X}\_i)|^m] \le \sum\_{t=0}^m \binom{m}{t} E\left[|\varepsilon\_i|^t\right] \mu\_{\max}^{m-t} \\ \le \sum\_{t=0}^m t! \binom{m}{t} \mathcal{M}^t \mu\_{\max}^{m-t} \le m! (M + \mu\_{\max})^m \end{aligned}$$

for every *m* ≥ 1. By the same arguments, it can be shown that, for every *m* ≥ 1, *E* [|*Yiξ<sup>i</sup>* − *E* [*Yiξ<sup>i</sup>* ] | *<sup>m</sup>*] <sup>≤</sup> *E* - (|*Yiξ<sup>i</sup>* | + |*E* [*Yiξ<sup>i</sup>* ] |) *m* <sup>≤</sup> *<sup>m</sup>*!(2*B*(*<sup>M</sup>* <sup>+</sup> *<sup>µ</sup>*max))*m*.

**Lemma A3.** *Under Conditions (1)–(3), when n is sufficiently large such that* 28p *q* log *p*/*n* < 1*, we have* sup*β*∈<sup>B</sup> E*<sup>n</sup> L*(*β* <sup>T</sup>**X**,*Y*)  ≤ 2(*M* + *µ*max)*K* <sup>3</sup> <sup>+</sup> *<sup>b</sup>*max*, with probability* <sup>1</sup> <sup>−</sup> 2 exp(−10*<sup>q</sup>* log *<sup>p</sup>*)*.*

**Proof.** By Conditions (2), sup*β*∈<sup>B</sup> *β* T**X**  ≤ *K* 3 . Thus,

$$\begin{split} \sup\_{\mathcal{B}\in\mathbb{B}} \left| \mathbb{E}\_{\boldsymbol{\eta}} \left\{ L(\boldsymbol{\mathcal{B}}^{\mathrm{T}} \mathbf{X}, \boldsymbol{Y}) \right\} \right| &\leq \sup\_{\mathcal{B}\in\mathbb{B}} \left| \mathbb{E}\_{\boldsymbol{\eta}} \left\{ \left| \operatorname{Y} \boldsymbol{\mathcal{B}}^{\mathrm{T}} \mathbf{X} \right| \right\} \right| + b\_{\max} \\ \leq \left( \left| \mathbb{E}\_{\boldsymbol{\eta}} \left\{ \left| \boldsymbol{Y} \right| - E \left[ \left| \boldsymbol{Y} \right| \right] \right\} \right| + E \left[ \left| \boldsymbol{Y} \right| \right] \right) K^{3} + b\_{\max} \\ \leq \left( \left| \mathbb{E}\_{\boldsymbol{\eta}} \left\{ \left| \boldsymbol{Y} \right| - E \left[ \left| \boldsymbol{Y} \right| \right] \right\} \right) K^{3} + (\boldsymbol{M} + \mu\_{\max}) K^{3} + b\_{\max} \end{split}$$

where the last inequality follows from that *E*[|*Y*|] ≤ *M* + *µ*max as shown in the proof of Lemma A2.

Let *<sup>ξ</sup><sup>i</sup>* <sup>=</sup> <sup>1</sup>{*Y<sup>i</sup>* <sup>&</sup>gt; <sup>0</sup>} − <sup>1</sup>{*Y<sup>i</sup>* <sup>&</sup>lt; <sup>0</sup>}. Thus <sup>|</sup>*ξ<sup>i</sup>* | ≤ 1. By Lemma A2, we have *E* h |*Yi* | − *E* [|*Y<sup>i</sup>* |] *m* i ≤ *m*!(2(*M* + *µ*max))*m*. Applying Bernstein's inequality (e.g., Lemma 2.2.11 in [50]) yields that

$$P\left(\left|\mathbb{E}\_{\eta}\left\{\left|Y\right|-\mathbb{E}\left[\left|Y\right|\right]\right\}\right|>10(M+\mu\_{\text{max}})\sqrt{q\log p/n}\right)$$

$$\leq 2\exp\left(-\frac{1}{2}\frac{196q\log p}{4+20\sqrt{q\log p/n}}\right) \leq 2\exp(-10q\log p),\tag{A7}$$

when *n* is sufficiently large such that 20p *q* log *p*/*n* < 1. Since 10(*M* + *µ*max) p *q* log *p*/*n* = *o*(1), then

$$\left| P \left( \sup\_{\mathcal{J} \in \mathbb{B}} \left| \mathbb{E}\_{\mathcal{U}} \left\{ L(\mathbf{f}^{\mathrm{T}} \mathbf{X}, \mathbf{Y}) \right\} \right| \geq 2(M + \mu\_{\max}) \mathbf{K}^{\mathfrak{J}} + b\_{\max} \right) \right| \leq 2 \exp(-10q \log p).$$

**Lemma A4.** *Given an index set S and r* ∈ *S c , let* B 0 *S* (*d*) = {*β<sup>S</sup>* : k*β<sup>S</sup>* − *β* ∗ *S* k ≤ *d*/(*K* p |*S*|)} *and A*<sup>1</sup> := (*M* + 2*µ*max)*. Under Conditions (1)–(3), when n is sufficiently large such that* 10p *q* log *p*/*n* < 1*, we have*


**Proof.** : (1): Let R|*S*<sup>|</sup> (*d*) be a |*S*|-dimensional ball with center at 0 and radius *d*/(*K* p |*S*|). Then B 0 *S* (*d*) = R|*S*<sup>|</sup> (*d*) + *β* ∗ *S* . Let C|*S*<sup>|</sup> := {C(*ξ<sup>k</sup>* )} be a collection of cubes that cover the ball *<sup>R</sup>*|*S*<sup>|</sup> (*d*), where C(*ξ<sup>k</sup>* ) is a cube containing *ξ<sup>k</sup>* with sides of length *d*/(*K* p |*S*|*n* 2 ) and *ξ<sup>k</sup>* is some point in R|*S*<sup>|</sup> (*d*). As the volume of C(*ξ<sup>k</sup>* ) is *d*/(*K* p |*S*|*n* 2 ) |*S*<sup>|</sup> and the volume of R|*S*<sup>|</sup> (*d*) is less than (2*d*/(*K* p <sup>|</sup>*S*|))|*S*<sup>|</sup> , we can select *ξk*s so that no more than (4*n* 2 ) |*S*| cubes are needed to cover R|*S*<sup>|</sup> (*d*). We thus assume |C|*S*<sup>|</sup> | ≤ (4*n* 2 ) |*S*| . For any *ξ* ∈ C(*ξ<sup>k</sup>* ), <sup>k</sup>*<sup>ξ</sup>* <sup>−</sup> *<sup>ξ</sup>k*k ≤ *<sup>d</sup>*/(*Kn*<sup>2</sup> ). In addition, let *T*1*S*(*ξ*) := E*<sup>n</sup>* - *Yξ* <sup>T</sup>**X***<sup>S</sup>* , *T*2*S*(*ξ*) := E*<sup>n</sup>* - *b β* ∗ *<sup>S</sup>* + *ξ* T **X***S* − *b β* ∗T *S* **X***S* , and *<sup>T</sup>S*(*ξ*) :<sup>=</sup> *<sup>T</sup>*1*S*(*ξ*) <sup>−</sup> *<sup>T</sup>*2*S*(*ξ*). Given any *<sup>ξ</sup>* ∈ R|*S*<sup>|</sup> (*d*), there exists C(*ξ<sup>k</sup>* ) ∈ C|*S*<sup>|</sup> such that *ξ* ∈ C(*ξ<sup>k</sup>* ). Then

$$\begin{aligned} \left| \left| T\_{\mathbb{S}}(\mathfrak{f}) - E\left[ T\_{\mathbb{S}}(\mathfrak{f}) \right] \right| \right| &\leq \left| T\_{\mathbb{S}}(\mathfrak{f}) - T\_{\mathbb{S}}(\mathfrak{f}\_{k}) \right| \left| T\_{\mathbb{S}}(\mathfrak{f}\_{k}) - E\left[ T\_{\mathbb{S}}(\mathfrak{f}\_{k}) \right] \right| + \left| E\left[ T\_{\mathbb{S}}(\mathfrak{f}) \right] - E\left[ T\_{\mathbb{S}}(\mathfrak{f}\_{k}) \right] \right| \\ &=: I + II + III. \end{aligned}$$

We deal with *I I I* first. By the mean value theorem, there exists a ˜*ξ* between *ξ* and *ξ<sup>k</sup>* such that

$$|E\left[T\_{\mathbb{S}}(\xi\_{k})\right] - E\left[T\_{\mathbb{S}}(\xi)\right]| = \left| E\left[Y(\xi\_{k} - \mathfrak{f})^{\mathsf{T}}\mathbf{X}\_{\mathsf{S}}\right] + E\left[\mu\left(\left(\mathfrak{f}\_{\mathsf{S}}^{\*} + \mathfrak{f}\right)^{\mathsf{T}}\mathbf{X}\_{\mathsf{S}}\right)(\xi\_{k} - \mathfrak{f})^{\mathsf{T}}\mathbf{X}\_{\mathsf{S}}\right]\right| $$

$$\leq E[|Y|] \|\mathfrak{f}\_{k} - \mathfrak{f}\| \|\mathbf{X}\_{\mathsf{S}}\| + \mu\_{\max} \|\mathfrak{f}\_{k} - \mathfrak{f}\| \|\mathbf{X}\_{\mathsf{S}}\| \leq (M + 2\mu\_{\max})d\sqrt{|S|}n^{-2} = A\_{1}d\sqrt{|S|}n^{-2}, \text{ (A8)}$$

where the last inequality follows from Lemma A2 and *A*<sup>1</sup> = *M* + 2*µ*max.

Next, we evaluate *I I*. By Condition (2), |**X** T *iSξ*| ≤ k**X***iS*kk*ξ*k ≤ *d*/(*K* p |*S*|) p |*S*|*K* = *d*, for all *ξ* ∈ R|*S*<sup>|</sup> (*d*). Then by Lemma A2,

$$E\left[\left|\mathbf{Y}\mathbf{f}\_k^T\mathbf{X}\_\mathbf{S} - E\left[\mathbf{Y}\mathbf{f}\_k^T\mathbf{X}\_\mathbf{S}\right]\right|^m\right] \le m!(2(M + \mu\_{\max})d)^m.$$

By Bernstein's inequality, when *n* is sufficiently large such that 10p *q* log *p*/*n* ≤ 1.

$$P\left(\left|T\_{1S}(\mathfrak{F}\_k) - E\left[T\_{1S}(\mathfrak{F}\_k)\right]\right| > 10(M + \mu\_{\text{max}})d\sqrt{qn^{-1}\log p}\right)$$

$$\leq 2\exp\left(-\frac{1}{2}\frac{100q\log p}{4 + 20\sqrt{q\log p/n}}\right) \leq 2\exp(-10q\log p). \tag{A9}$$

Since |*b*( *β* ∗ *<sup>S</sup>* + *ξ<sup>k</sup>* T **X***S*) − *b*(*β* ∗T *S* **X***S*)| ≤ *µ*max*d*, by the same arguments used for (A9), we have

$$P\left(|T\_{2S}(\mathfrak{f}\_k) - E\left[T\_{2S}(\mathfrak{f}\_k)\right]| > 10\mu\_{\max}d\sqrt{qn^{-1}\log p}\right) \le 2\exp(-10q\log p). \tag{A10}$$

Combining (A9) and (A10) yields that uniformly over *ξ<sup>k</sup>*

$$|T\_{\mathbb{S}}(\mathfrak{f}\_k) - E\left[T\_{\mathbb{S}}(\mathfrak{f}\_k)\right]| \le 10A\_1 d \sqrt{qn^{-1} \log p}.\tag{A11}$$

with probability at least 1 − 2(4*n* 2 ) |*S*| exp(−10*q* log *p*).

We now assess *I*. Following the same arguments as in Lemma A3,

$$P\left(\sup\_{\substack{\xi \in \mathcal{C}(\xi\_k)}} |T\_{\mathcal{S}}(\xi) - T\_{\mathcal{S}}(\xi\_k)| > (2M + 3\mu\_{\max})d\sqrt{|\mathcal{S}|}n^{-2}\right) \le 2\exp(-8q\log p). \tag{A12}$$

Since p |*S*|*n* <sup>−</sup><sup>2</sup> = *o*( p *qn*−<sup>1</sup> log *p*), combining (A8), (A11) and (A12) together yields that

$$\begin{aligned} &P\left(\sup\_{\substack{\mathfrak{F}\in\mathcal{R}\_{|\mathcal{S}|}(d)}} |T\_{\mathcal{S}}(\mathfrak{F}) - E\left[T\_{\mathcal{S}}(\mathfrak{F})\right]| \geq 20A\_{1}d\sqrt{qn^{-1}\log p}\right) \\ &\leq 2(4n^{2})^{|\mathcal{S}|}\exp(-10q\log p) + 2\exp(-8q\log p) \leq 4\exp(-8q\log p). \end{aligned}$$

By the combinatoric inequality ( *p s* ) ≤ (*ep*/*s*) *s* , we obtain that

$$\begin{aligned} &P\left(\sup\_{|S|\le q, \mathfrak{F}\_S \in \mathcal{B}\_S^0(d\_1)} \Big| \mathcal{G}\_n \left[ L\left( \mathfrak{F}\_S^\mathrm{T} \mathbf{X}\_S, Y \right) - L\left( \mathfrak{F}\_S^{\*\mathrm{T}} \mathbf{X}\_S, Y \right) \right] \Big| \geq 20A\_1 d \sqrt{q \log p} \right) \\ &\leq \sum\_{s=1}^q (ep/s)^s 4 \exp(-8q \log p) \leq 4 \exp(-6q \log p). \end{aligned}$$

(2): We evaluate the *m*th moment of *L*(*β* ∗ *<sup>S</sup>***X***S*,*Y*).

$$\begin{aligned} &E\left[\left(\mathbf{Y}\mathbf{f}\mathbf{f}\_S^\*\mathbf{X}\_S - b(\mathbf{f}\_S^\*\mathbf{X}\_S)\right)^m\right] \le E\left[\sum\_{t=0}^m \binom{m}{t} |\mathbf{Y}|^t \mathbf{K}^{2t} b\_{\max}^{m-t}\right] \\ &\le \sum\_{t=0}^m \binom{m}{t} t! \left((M+\mu\_{\max})\mathbf{K}^2\right)^t b\_{\max}^{m-t} \le m! ((M+\mu\_{\max})\mathbf{K}^2 + b\_{\max})^m. \end{aligned}$$

Then, by Bernstein's inequality,

$$P\left(|\mathbb{G}\_{\mathbb{H}}\left[L(\mathfrak{F}\_{\mathbb{S}}^{\ast \mathsf{T}}\mathbb{X}\_{\mathsf{S}},\mathsf{Y})\right]|>10(A\_1K^2+b\_{\max})\sqrt{q\log p}\right)\leq 2\exp(-10q\log p).$$

By the same arguments used in (i), we obtain that

$$\begin{aligned} &P\left(\sup\_{|S|\le q} \left| \mathbb{G}\_n \left[ L\left( \mathfrak{f}\_S^{\*T} \mathbf{X}\_S, Y \right) \right] \right| \ge 10(A\_1 K^2 + b\_{\max}) \sqrt{q \log p} \right) \\ &\le \sum\_{s=1}^q (ep/s)^s 2 \exp(-10q \log p) \le 2 \exp(-8q \log p). \end{aligned}$$

**Lemma A5.** *Given a model S and r* ∈ *S c , under Conditions (1), (2) and (5), for any* k*β<sup>S</sup>* − *β* ∗ *S* k ≤ *K*/ p |*S*|*, σ*min*κ*mink*β<sup>S</sup>* − *β* ∗ *S* k <sup>2</sup>/2 <sup>≤</sup> *<sup>E</sup>* - ℓ*S*(*β* ∗ *S* ) <sup>−</sup> *<sup>E</sup>* [ℓ*S*(*βS*)] <sup>≤</sup> *<sup>σ</sup>*max*κ*maxk*β<sup>S</sup>* <sup>−</sup> *<sup>β</sup>* ∗ *S* k <sup>2</sup>/2*.*

**Proof.** Due to the concavity of the log-likelihood in GLMs with the canonical link, *E* - *Y***X***<sup>S</sup>* − *µ*(*β* ∗T *S* **X***S*)**X***<sup>S</sup>* = **0**. Then for any k*β<sup>S</sup>* − *β* ∗ *S* k ≤ *K*/ p |*S*|, |*β* <sup>T</sup>**X***S*| ≤ |*<sup>β</sup>* <sup>∗</sup>T**X***S*<sup>|</sup> <sup>+</sup> <sup>|</sup>(*<sup>β</sup>* <sup>−</sup> *<sup>β</sup>* ∗ ) <sup>T</sup>**X***S*| ≤ *K* <sup>2</sup> + *K*/ p |*S*| × *K* p |*S*| = 2*KL*. Thus, by Taylor's expansion,

$$E\left[\ell\_S(\mathfrak{f}\_S)\right] - E\left[\ell\_S(\mathfrak{f}\_S^\*)\right] = -\frac{1}{2}(\mathfrak{f}\_S - \mathfrak{f}\_S^\*)^\mathrm{T} E\left[\sigma\left(\mathfrak{f}\_S^\mathrm{T} \mathbf{x}\_S\right) \mathbf{x}\_S^{\otimes 2}\right](\mathfrak{f}\_S - \mathfrak{f}\_S^\*).$$

where *β*˜ *<sup>S</sup>* is between *β<sup>S</sup>* and *β* ∗ *S* . By Condition (5), *σ*min*κ*mink*β<sup>S</sup>* − *β* ∗ *S* k <sup>2</sup>/2 <sup>≤</sup> *<sup>E</sup>* - ℓ*S*(*β* ∗ *S* ) <sup>−</sup> *<sup>E</sup>* [ℓ*S*(*βS*)] <sup>≤</sup> *σ*max*κ*maxk*β<sup>S</sup>* − *β* ∗ *S* k <sup>2</sup>/2.

**Lemma A6.** *Under Conditions (1)–(6), there exist some constants A*<sup>2</sup> *and A*<sup>3</sup> *that do not depend on n, such that* k*β*ˆ *<sup>S</sup>* − *β* ∗ *S* k ≤ *A*2*K* −1 p *q* 2 log *<sup>p</sup>*/*<sup>n</sup> and* <sup>|</sup>ℓ*S*(*β*<sup>ˆ</sup> *<sup>S</sup>*) <sup>−</sup> <sup>ℓ</sup>*S*(*<sup>β</sup>* ∗ *S* )| ≤ *A*3*q* 2 log *p*/*n hold uniformly over S* : |*S*| ≤ *q, with probability at least* 1 − 6 exp(−6*q* log *p*)*.*

**Proof.** Define

$$\Omega(d) := \left\{ \sup\_{|\mathbf{S}| \le q, \mathbf{f} \in \mathcal{B}\_S^0(d)} \left| \mathbb{G}\_n \left[ L \left( \mathfrak{f}\_S^\mathbf{T} \mathbf{X}\_{\mathbf{S}}, Y \right) - L \left( \mathfrak{f}\_S^{\*\mathbf{T}} \mathbf{X}\_{\mathbf{S}}, Y \right) \right] \right| < 20A\_1 d \sqrt{q \log p} \right\}.$$

By Lemma A4, the event <sup>Ω</sup>(*d*) holds with probability at least 1 − 4 exp(−6*<sup>q</sup>* log *<sup>p</sup>*). Thus, in the proof of Lemma A6, we shall assume Ω(*d*) hold with *d* = *A*<sup>2</sup> p *q* 3 log *p*/*n* for some *A*<sup>2</sup> > 20(*σ*min*κ*min) <sup>−</sup>1*K* <sup>2</sup>*A*1.

For any k*β<sup>S</sup>* − *β* ∗ *S* k = *A*2*K* −1 p *q* 2 log *p*/*n*, since p *q* 2 log *p*/*n* ≤ p *q* 3 log *p*/*n*/ p <sup>|</sup>*S*|, *<sup>β</sup><sup>S</sup>* ∈ B<sup>0</sup> *S* (*d*). By Lemma A5,

$$\begin{split} &\ell\_{\mathcal{S}}(\mathfrak{F}\_{\mathcal{S}}^{\*}) - \ell\_{\mathcal{S}}(\mathfrak{F}\_{\mathcal{S}}) \\ &= \left( \ell\_{\mathcal{S}}(\mathfrak{F}\_{\mathcal{S}}^{\*}) - E\left[ \ell\_{\mathcal{S}}(\mathfrak{F}\_{\mathcal{S}}^{\*}) \right] - \left( \ell\_{\mathcal{S}}(\mathfrak{F}\_{\mathcal{S}}) - E\left[ \ell\_{\mathcal{S}}(\mathfrak{F}\_{\mathcal{S}}) \right] \right) \right) + \left( E\left[ \ell\_{\mathcal{S}}(\mathfrak{F}\_{\mathcal{S}}^{\*}) \right] - E\left[ \ell\_{\mathcal{S}}(\mathfrak{F}\_{\mathcal{S}}) \right] \right) \\ &\geq \sigma\_{\min} \kappa\_{\min} ||\mathfrak{F}\_{\mathcal{S}} - \mathfrak{F}\_{\mathcal{S}}^{\*}||^{2} / 2 - 20A\_{1} d \sqrt{q \log p / n} \\ &= \sigma\_{\min} \kappa\_{\min} A\_{2}^{2} q^{2} \log p / \left( K^{2}n \right) - 20A\_{1} A\_{2} q^{2} \log p / n > 0. \end{split}$$

Thus,

$$\inf\_{||S|| \le q\_{\prime} ||\mathcal{B}\_{\mathcal{S}} - \mathcal{F}\_{\mathcal{S}}^{\*}|| = A\_{2} K^{-1} \sqrt{q^{2} \log p / n}} \ell\_{\mathcal{S}}(\mathcal{G}\_{\mathcal{S}}^{\*}) - \ell\_{\mathcal{S}}(\mathcal{B}\_{\mathcal{S}}) > 0.$$

Then by the concavity of <sup>ℓ</sup>*S*(·), we obtain that max|*S*|≤*<sup>q</sup> β*ˆ *<sup>S</sup>* − *β* ∗ *S* <sup>≤</sup> *<sup>A</sup>*2*<sup>K</sup>* −1 p *q* <sup>2</sup>*n*−<sup>1</sup> log *p*. On the other hand, for any k*β<sup>S</sup>* − *β* ∗ *S* k ≤ *A*2*K* −1 p *q* 2 log *p*/*n*,

$$\begin{aligned} & \left| \ell\_{\mathcal{S}}(\mathfrak{f}\_{\mathcal{S}}^{\*}) - \ell\_{\mathcal{S}}(\mathfrak{f}\_{\mathcal{S}}) \right| \\ & \leq \left| \ell\_{\mathcal{S}}(\mathfrak{f}\_{\mathcal{S}}^{\*}) - E\left[ \ell\_{\mathcal{S}}(\mathfrak{f}\_{\mathcal{S}}^{\*}) \right] - \left( \ell\_{\mathcal{S}}(\mathfrak{f}\_{\mathcal{S}}) - E\left[ \ell\_{\mathcal{S}}(\mathfrak{f}\_{\mathcal{S}}) \right] \right) \right| + \left| E\left[ \ell\_{\mathcal{S}}(\mathfrak{f}\_{\mathcal{S}}^{\*}) \right] - E\left[ \ell\_{\mathcal{S}}(\mathfrak{f}\_{\mathcal{S}}) \right] \right| \\ & \leq \sigma\_{\max} \kappa\_{\max} \left\| \mathfrak{f}\_{\mathcal{S}} - \mathfrak{f}\_{\mathcal{S}}^{\*} \right\| ^2 / 2 + 20A\_{1} d \sqrt{q \log p / n} \leq A\_{3} q^{2} n^{-1} \log p, \end{aligned}$$

where *A*<sup>3</sup> := 4*σ*max*λ*max*A* 2 2*K* <sup>−</sup><sup>2</sup> + 20*A*1*A*2.

#### **Appendix B. Additional Results in the Applications**



Note: Diagonal and off-diagonal elements of the table represent the model sizes for each method and the number of overlapping genes selected by the two methods corresponding to the row and column, respectively.


**Table A2.** Selected miRNAs for ESCC training dataset.

Note: LASSO, SIS+LASSO, dgLARS selected 20, 17 and 33 miRNAs, respectively, and we only reported top 10 miRNAs.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

**Eugene A. Opoku 1,\*, Syed Ejaz Ahmed <sup>2</sup> , Yin Song <sup>1</sup> and Farouk S. Nathoo <sup>1</sup>**


**Abstract:** Electroencephalography/Magnetoencephalography (EEG/MEG) source localization involves the estimation of neural activity inside the brain volume that underlies the EEG/MEG measures observed at the sensor array. In this paper, we consider a Bayesian finite spatial mixture model for source reconstruction and implement Ant Colony System (ACS) optimization coupled with Iterated Conditional Modes (ICM) for computing estimates of the neural source activity. Our approach is evaluated using simulation studies and a real data application in which we implement a nonparametric bootstrap for interval estimation. We demonstrate improved performance of the ACS-ICM algorithm as compared to existing methodology for the same spatiotemporal model.

**Keywords:** ant colony system; bayesian spatial mixture model; inverse problem; nonparamteric boostrap; EEG/MEG data

#### - **-**

**Citation:** Opoku, E.A.; Ahmed, S.E.; Song, Y.; Nathoo, F.S. Ant Colony System Optimization for Spatiotemporal Modelling of Combined EEG and MEG Data. *Entropy* **2021**, *23*, 329. https:// doi.org/10.3390/e23030329

Academic Editor: David Holcman

Received: 21 January 2021 Accepted: 7 March 2021 Published: 11 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Electroencephalography (EEG) and Magnetoencephalography (MEG) are two noninvasive approaches for measuring electrical activity of the brain with high temporal resolution. These neuroimaging techniques allow us to study brain dynamics and the complex informational exchange processes in the human brain. They are widely used in many clinical and research applications [1,2], though estimating the evoked-response activity within the brain from electromagnetic fields measured outside of the skull remains a challenging inverse problem with infinitely many different sources within the brain that can produce the same observed data [3].

Proposed solutions to the MEG/EEG inverse problem have been based on distributed source and dipolar methods [4]. In the case of distributed source methods, every location on a fine grid within the brain has associated neural activation source parameters. In this case, the number of unknown current sources exceeds the number of MEG or EEG sensors and estimation thus requires constraints through regularization or priors to obtain a solution. For such methods, various steps have been taken to regularize the solution by choosing minimum-norm solutions or by limiting the spatiotemporal variation of the solution. These approaches impose *L*<sup>2</sup> or *L*<sup>1</sup> [5] norm regularization constraints that serve to stabilize and condition the source parameter estimates. However, these methods do not consider the temporal nature of the problem. In [6], the authors propose a dynamic statespace model that accounts for both spatial and temporal correlations within and across candidate intra-cortical sources using Bayesian estimation and Kalman filtering. Dipolar methods, on the other hand, assume that the actual current distribution can be explained by a small set of current dipoles with unknown locations, amplitudes and orientations (see [4] for review). Hence, the resulting inverse problem becomes non-linear and a number of dipoles is to be estimated. Proposed solutions to this problem include algorithms

such as simulated annealing [7] to address nonlinear optimization in the localization of neuromagnetic sources.

From the perspective of Bayesian approaches, the ill-posed nature of the inverse problem requires incorporation of prior assumptions when choosing an appropriate solution out of an infinite set of candidates. For instance, the authors of [8] consider Gaussian scale mixture models, with flexible, large covariance components representing spatial patterns of neural activity. The authors of [9] also propose a hierarchical linear model with Gaussian errors in a Parametric Empirical Bayes (PEB) framework whose random terms are drawn from multivariate Gaussian distributions and covariances factor into temporal and spatial components at the sensor and source levels. The authors of [10] propose an application of empirical Bayes to the source reconstruction problem with automatic selection of multiple cortical sources. The authors of [11] developed the Mesostate-Space Model (MSM) based on the assumption that the unknown neural brain activity can be specified in terms of a set of locally distributed and temporally coherent meso-sources for either MEG or EEG data, while the authors of [12] extend this approach to propose a Switching Mesostate-Space Model (SMSM) to allow flexibility by accounting for complex brain processes that cannot be characterized by linear and stationary Gaussian dynamics.

By extending and building on the MSM, the authors of [13] developed a Bayesian spatial finite mixture model incorporating the following two conditions, taken directly from [13]:


This spatiotemporal model describes a joint model that combines MEG and EEG data, in which brain neural activity is modeled from the Gaussian spatial mixture model. The neural source activity is described in terms of a few hidden states, with each state having its own dynamics and a Potts model used in representing the spatial dependence in the mixture model.

For the Bayesian mixture model formulated, an Iterated Conditional Modes (ICM) algorithm was developed by the authors of [13] for simultaneous point estimation and model selection for the number of mixture components in the latent process. Whilst ICM is a very simple and computationally efficient algorithm, convergence of this algorithm is sensitive to starting values and local optima. This issue was left unresolved in [13]. Here we investigate the potential for finding better solutions, and focus on implementing a population-based optimization algorithm-based Ant Colony System (ACS) [14] .

ACS is a metaheuristic optimization algorithm inspired by the biological behavior of ants constructing solutions based on their collective foraging behavior [14]. ACS has been successfully applied in several areas such as clustering, data mining and image segmentation problems [15–17]. ACS is a constructive algorithm that uses an analogue of ant trail pheromones to learn about good features of solutions in combinatorial optimization problems. New solutions are generated using a parameterized probabilistic model, the parameters of which are updated using previously generated solutions so as to direct the search towards promising areas of the solution space. The model used in ACS is known as pheromone, an artificial analogue of the chemical substance used by real ants to mark trails from the nest to food sources. Based on this representation, each artificial ant constructs a part of the solution based on concentration of pheromone information released by other ants. The amount of pheromone deposited by an ant reflects the quality of the good solutions built and the traversed path. The pheromone deposited and volatilized adds solution components to partial solutions. After some time and based on more ants' communications through pheromone information, they tend to follow the same optimal paths yielding the optimal solution, in our context maximization of the posterior density.

As an alternative to the ICM algorithm, we thus implement the ACS algorithm coupled with a local search ICM algorithm to provide a new approach to model estimation and potentially better estimates of the model parameters. This approach is evaluated and found to provide significant improvements. Within the context of a simpler spatial mixture, ACS has been implemented for a Gaussian Potts mixture model in [18] and has been shown to outperform both the Simulated Annealing and ICM algorithms for parameter and mixture component estimation. The theoretical guarantees associated with simulated annealing to reach a global optimum is dependent on the choice of a cooling schedule. The choice of an optimal cooling schedule can be difficult in practice for large spatiotemporal models. ACS has also proved to be competitive with genetic and other optimization algorithms in several tasks, mainly in image classification and the traveling salesman problem [19,20].

Ant Colony Optimization (ACO) algorithms are implemented to solve Constraint Satisfaction Problems (CSP) where ACO solutions to CSP face the challenge of high cost and low solution quality. Motivated by this challenge, the authors of [21] proposed Ant Colony Optimization based on information Entropy (ACOE). The idea is based on incorporating a local search that uses a crossover operation to optimize the best solution according to the feedback of information entropy. This is performed by comparing the difference of the information entropy between the current global best solution and the best solution in the current iteration. Datasets from four classes of binary CSP test cases were generated and then ACOE was implemented for comparison. Results showed that ACOE outperformed Particle Swarm Optimization (PSO), a Differential Evolution (DE) algorithm and Artificial Bee Colony (ABC) in terms of the solution quality, data distribution and convergence performance.

To our knowledge, this is the first attempt at solving the neuroelectromagnetic inverse problem for combined EEG/MEG data using a population-based optimization approach combined with a spatial mixture model. The primary contribution of this paper is the design and implementation of the ACS algorithm to the dynamic spatial model and its evaluation. Importantly, we demonstrate improved results both in the estimation of neural activity and model selection uniformly across all conditions considered.

The rest of the paper proceeds as follows. The posterior distribution of the model and the design and implementation of the ACS algorithm are presented in Section 2. In Section 3 our algorithm is investigated using simulation studies and comparisons made with an existing approach developed in [13] . Section 4 provides an illustration on real data and the development of a nonparametric bootstrap for interval estimation in a study of scrambled face perception. The paper concludes with a conclusion and directions for future work in Section 4.

#### *Related Works*

Merging EEG and MEG aims at accounting for information missed by one modality and captured by the other. Fused reconstruction therefore appears promising to reach high temporal and spatial resolutions in brain function imaging. The authors of [22] address the added value of combining EEG and MEG data for distributed source localization, building on the flexibility of parametric empirical Bayes, namely for EEG–MEG data fusion, group level inference and formal hypothesis testing. The proposed approach follows a two-step procedure by first using unimodal or multimodal inference to derive a cortical solution at the group level, and second by using this solution as a prior model for single subject-level inference based on either unimodal or multimodal data. Another popular approach for non-globally optimized solutions of the MEG/EEG inverse problem is based on the use of adaptive Beamformers (BF). However, the BFs are known to fail when dealing with correlated sources acting like poorly tuned spatial filters with a low signal-to-noise ratio (SNR) of the output time series and often meaningless cortical maps of power distribution. To address this limitation, the authors of [23] developed a novel data covariance approach to supply robustness to the beamforming technique when operating in an environment with correlated sources. To reduce the impact of the low spatial resolution of MEG and EEG, the authors of [24] developed a unifying framework for quantifying the spatial fidelity of MEG/EEG source estimates. This method quantifies the spatial fidelity of MEG/EEG estimates from simulated patch activations over the entire neocortex superposed on measured resting-state data. This approach grants more generalizability in the evaluation process that allows for, e.g., comparing linear and non-linear estimates in the whole brain for different Signal-to-Noise Ratios (SNR), number of active sources and activation waveforms. The authors of [25] discuss a solution to the source reconstruction problem and developed a novel hierarchical multiscale Bayesian algorithm for electromagnetic brain imaging using MEG and EEG within the context of sources that vary in spatial extent. In this Bayesian algorithm, the sensor data measurements are defined using a generative probabilistic graphical model that is hierarchical across spatial scales of brain regions and voxels. This algorithm enables robust reconstruction of sources that have different spatial extent, from spatially contiguous clusters of dipoles to isolated dipolar sources.

In [26], the authors propose a methodological framework for inverse-modeling of propagating cortical activity. Within this framework, cortical activity is represented in the spatial frequency domain, which is more natural than the dipole domain when dealing with spatially continuous activity. In dealing with multi-subject MEG/EEG source imaging, he authors of [27] propose a sparse multi-task regression that takes into account inter-subject variabilities known as the Minimum Wasserstein Estimates (MWE). This work jointly localizes sources for a population of subjects by casting the estimation as a multi-task regression problem in three key ideas. First, it proposes to use non-linear registration to obtain subject-specific lead field matrices that are spatially aligned. Second, it copes with the issue of inter-subject spatial variability of functional activations using optimal transport. Finally, it makes use of non-convex sparsity priors and joint inference of source estimates to obtain accurate source amplitudes. Various applications for MEG/EEG source reconstruction have been applied in the clinical setting for detection of epileptic spikes [28], identification of seizure onset zone [29] and presurgical workup of epilepsy patients [30–32].

#### **2. Methods**

This section describes the Bayesian spatial mixture model developed in [13] and the ACS-ICM algorithm.

#### *2.1. Model*

We provide details and mathematical description of the joint model below. Let *M*(*t*) = (*M*1(*t*), *M*2(*t*), ..., *Mn<sup>M</sup>* (*t*))′ and *E*(*t*) = (*E*1(*t*), *E*2(*t*), ..., *En<sup>E</sup>* (*t*))′ denote the MEG and EEG, respectively, at time *t*, *t* = 1, . . . , *T*; where *n<sup>M</sup>* and *n<sup>E</sup>* denote the number of MEG and EEG sensors, the model assumes:

$$\mathbf{M}(t) = \mathbf{X}\_{\text{M}}\mathbf{S}(t) + \boldsymbol{\varepsilon}\_{\text{M}}(t), \quad \boldsymbol{\varepsilon}\_{\text{M}}(t)|\boldsymbol{\sigma}\_{\text{M}}^{2} \stackrel{\text{iid}}{\sim} MVN(\mathbf{0}, \sigma\_{\text{M}}^{2}\mathbf{H}\_{\text{M}}), \ t = 1, \ldots, T,$$

$$\mathbf{E}(t) = \mathbf{X}\_{\text{E}}\mathbf{S}(t) + \boldsymbol{\varepsilon}\_{\text{E}}(t), \quad \boldsymbol{\varepsilon}\_{\text{E}}(t)|\sigma\_{\text{E}}^{2} \stackrel{\text{iid}}{\sim} MVN(\mathbf{0}, \sigma\_{\text{E}}^{2}\mathbf{H}\_{\text{E}}), \ t = 1, \ldots, T,$$

where *X<sup>M</sup>* and *X<sup>E</sup>* denote *n<sup>M</sup>* × *P* and *n<sup>E</sup>* × *P* forward operators, respectively computed based on Maxwell's equations under the quasi-static assumption [33] for EEG and MEG; *H<sup>M</sup>* and *H<sup>E</sup>* are known *n<sup>M</sup>* × *n<sup>M</sup>* and *n<sup>E</sup>* × *n<sup>E</sup>* matrices, respectively, which can be obtained from baseline data providing information on the covariance structure of EEG and MEG sensor noise; and *S*(*t*) = (*S*1(*t*), . . . *SP*(*t*))′ represents the magnitude and polarity of neural currents sources over a fine grid covering the cortical surface. In this case, *P* represents a large number of point sources of potential neural activity within the brain covering the cortical surface. It is assumed that the *P* cortical locations are embedded in a 3D regular grid composed of *N<sup>v</sup>* voxels to allow efficient computational implementation. Given this grid of voxels, a mapping *v* : {1, ..., *P*} → {1, ..., *Nv*} is defined such that *v*(*j*) is the index

of the voxel containing the *j*th cortical location. We assume a latent Gaussian mixture with allocations at the level of voxels:

$$S\_{\vec{\boldsymbol{\beta}}}(t)|\boldsymbol{\mu}(t),\boldsymbol{\alpha},\mathbf{Z}\stackrel{\text{ind}}{\sim}\prod\_{l=1}^{K}N(\mu\_{l}(t),\boldsymbol{\alpha}\_{l})^{Z\_{\boldsymbol{v}(\boldsymbol{j})l}}\tag{1}$$

*j* = 1, . . . , *P*, *t* = 1, . . . , *T*; where *Z* = (*Z* ′ 1 , *Z* ′ 2 , ..., *Z* ′ *Nv* ) ′ is a labeling process defined over the grid of voxels such that for each *v* ∈ {1, ..., *Nv*}, *Z* ′ *<sup>v</sup>* = (*Zv*1, *Zv*2, ..., *ZvK*) with *Zvl* ∈ {0, 1} and ∑ *K <sup>l</sup>*=<sup>1</sup> *Zvl* = 1; *µ*(*t*) = (*µ*1(*t*), *µ*2(*t*), ..., *µK*(*t*))′ = (*µ*1(*t*),*µ <sup>A</sup>*(*t*) ′ ) ′ , where *µ <sup>A</sup>*(*t*) = (*µ*2(*t*), ..., *µK*(*t*))′ denotes the mean of the "active" states over different components of activity and *µ*1(*t*) = 0 for all *t*, so that the first component corresponds to an "inactive" state. The variability of the *l th* mixture component about its mean *µ<sup>l</sup>* (*t*) is represented by *α<sup>l</sup>* , *l* = 1, . . . , *K*.

The labeling process assigns each voxel to a latent state and is assumed to follow a Potts model:

$$P(\mathbf{Z}|\boldsymbol{\beta}) = \frac{\exp\{\beta \sum\_{\mathbf{h}\sim j} \delta(\mathbf{Z\_{\mathbf{h}}} \mathbf{Z\_{\mathbf{h}}})\}}{G(\boldsymbol{\beta})}, \quad \delta(\mathbf{Z\_{\mathbf{j}}} \mathbf{Z\_{\mathbf{h}}}) = 2 \mathbf{Z\_{\mathbf{j}}'} \mathbf{Z\_{\mathbf{h}}} - 1.$$

where *G*(*β*) is the normalizing constant for this probability mass function, *β* ≥ 0 is a hyper-parameter that governs the strength of spatial cohesion, and *i* ∼ *j* indicates that voxels *i* and *j* are neighbors, with a first-order neighborhood structure over the 3D regular grid. The mean temporal dynamics for active components is assumed to follow a first-order vector autoregressive process:

$$\mu^A(t) = \mathbf{A}\mu^A(t-1) + \mathbf{a}(t), \ \mathbf{a}(t)|\sigma\_a^2 \stackrel{i.i.d.}{\sim} MVN(\mathbf{0}, \sigma\_a^2 \mathbf{I})^2$$

*t* = 2, . . . , *T*, *µ <sup>A</sup>*(1) <sup>∼</sup> *MVN*(0, *<sup>σ</sup>* 2 *µ*1 *I*), with *σ* 2 *µ*1 fixed and known, but *σ* 2 *<sup>a</sup>* unknown and assigned an inverse-Gamma (*aa*, *ba*) hyper-prior. Although in [13] a pseudo-likelihood approximation is adopted to the normalizing constant of the Potts model and then assigned a uniform prior to the spatial parameter to control the degree of spatial correlation, we fixed the inverse temperature parameter and vary it as part of a sensitivity analysis.

For model selection, the number of mixture components, the value of *K*, in Equation (1) will not be known prior and so it is estimated simultaneously with model parameters. Thus this approach achieves simultaneous point estimation and model selection. We can obtain a simple estimate for the number of mixture components based on the estimated allocation variables *Z*ˆ when the algorithm is run with a sufficiently large value of *K*. This is achieved by running the algorithm with a value of *K* that is larger than the expected number of mixture components. For example, the value of *K* can be set as *K* = 15 when running the algorithm. The *j th* location on the cortex is allocated to one of the mixture components based on the estimated value of *Z*ˆ *v*(*j*) , where *Z*ˆ *<sup>v</sup>*(*j*) = (*Z*ˆ *v*(*j*)<sup>1</sup> , *Z*ˆ *v*(*j*)<sup>2</sup> , . . . , *Z*ˆ *v*(*j*)*<sup>K</sup>* ) ′ and *Z*ˆ *v*(*j*)*<sup>l</sup>* = 1 if location *j* is allocated to component *l* ∈ {1, . . . , *K*}. When the algorithm is run with a value of *K* that is large, there will result empty mixture components that have not been assigned any voxel locations under *Z*ˆ. In a sense these empty components have been automatically pruned out as redundant. The estimated number of mixture components can be obtained by counting the number of non-empty mixture components as follows:

$$\mathcal{K} = \sum\_{l=1}^{K} I\{\sum\_{v=1}^{n\_v} \mathcal{Z}\_{v\_l} \neq 0\}.$$

This estimator requires us to run our algorithm only once for a single value of *K* and then the resulting number of mixture components assigned a location in *Z*ˆ is determined and *<sup>K</sup>*<sup>ˆ</sup> <sup>≤</sup> *<sup>K</sup>*.

#### *2.2. Ant Colony System*

Ant Colony System (ACS) is a population-based optimization algorithm introduced in [14]. The basic structure of this algorithm is designed to solve the traveling salesman problem in which the aim is to find the shortest path to cover a given set of cities without revisiting any one of them. The inspiring source and development of this algorithm is the observation of the foraging behavior of real ants in their colony. This behavior is exploited in artificial ant colonies for the search of approximate solutions to discrete optimization problems, for continuous optimization problems, and for important problems in telecommunications, such as routing and load balancing, telecommunication network design, or problems in bioinformatics [34,35]. At the core of this algorithm is the communication between the ants by means of chemical pheromone trails, which enables them to collectively find short paths between their nest and food source. The framework of this algorithm can be categorized into four main parts: (1) construction of an agent ant solution, (2) local pheromone update of the solution, (3) improving solution by local search, and (4) global pheromone update of the best solution.

At each step of this constructive algorithm a decision is made concerning which solution component to add to the sequence of solution components already built. These decisions are dependent on the pheromone information, which represents the learned experience of adding a particular solution component given the current state of the solution under construction. The accumulated amount of pheromone mirrors the quality of the solution constructed based on the value of the objective function. The pheromone update aims to concentrate the search in regions of the search space containing high quality solutions while there is a stochastic component facilitating random exploration of the search space. In particular, the reinforcement of solution components depending on the solution quality is an important ingredient of ACS algorithms. To learn which components contribute to good solutions can help assembling them into better solutions. In general, the ACS approach attempts to solve an optimization problem by iterating the following two steps: (1) candidate solutions are constructed using a pheromone model, that is, a parameterized probability distribution over the solution space; (2) the candidate solutions are used to modify the pheromone values in a way that is deemed to bias future sampling toward high quality solutions.

The posterior distribution of the dynamic model takes the form *<sup>P</sup>*(**Θ**|*E*, *<sup>M</sup>*) <sup>=</sup> *P*(**Θ**, *E*, *M*)/*P*(*E*, *M*), where:

$$\begin{split}P(\Theta,\mathcal{E},\mathcal{M}) &= P(\mathcal{E},\mathcal{M}|\Theta)P(\Theta) = P(\mathcal{E}|\Theta)P(\mathcal{M}|\Theta)P(\Theta) \\ &= \prod\_{t=1}^{T}MVN(\mathcal{E}(t);\mathcal{X}\_{\mathcal{E}}\mathcal{S}(t),\sigma\_{\mathcal{E}}^{2}\mathcal{H}\_{t}) \times MVN(\mathcal{M}(t);\mathcal{X}\_{\mathcal{M}}\mathcal{S}(t),\sigma\_{\mathcal{M}}^{2}\mathcal{H}\_{\mathcal{M}}) \\ &\times IG(\sigma\_{\mathcal{E}}^{2};a\_{\mathcal{E}},b\_{\mathcal{E}}) \times IG(\sigma\_{\mathcal{M}}^{2};a\_{\mathcal{M}},b\_{\mathcal{M}}) \times \left[\prod\_{j=1}^{p}\prod\_{l=1}^{T}\prod\_{l=1}^{K}N(S\_{j}(t);\mu\_{l}(t),a\_{l})^{Z\_{\mathcal{O}[l]}}\right] \\ &\times \left(\prod\_{t=2}^{T}MVN(\mu^{A}(t);\mathcal{A}\mu^{A}(t-1),\sigma\_{a}^{2}I)\right) \times MVN(\mu^{A}(1);\mathcal{O},\sigma\_{\mu\_{1}}^{2}I) \times \text{Potts}(\mathcal{Z};\mathcal{P}) \\ &\times \prod\_{l=1}^{K}IG(a\_{l};a\_{a},b\_{a}) \times \chi\left[\prod\_{i=1}^{K-1}\prod\_{l=1}^{K-1}N(A\_{lji};0,\sigma\_{A}^{2})\right] \times IG(\sigma\_{\mathcal{A}}^{2};a\_{a},b\_{a}) \end{split} (2)$$

where *MVN*(**x**; *µ*, **Σ**) denotes the density of the dim(**x**)-dimensional multivariate normal distribution with mean *µ* and covariance **Σ** evaluated at **x**; *IG*(*x*; *a*, *b*) denotes the density of the inverse gamma distribution with parameters *a* and *b* evaluated at *x*; *N*(*x*; *µ*, *σ* 2 ) denotes the density of the normal distribution with mean *µ* and variance *σ* 2 evaluated at *x*; Potts(*Z*; *β*) is the joint probability mass function of the Potts model with parameter *β* evaluated at *Z*. Equation (2) represents the objective function to be maximized over **Θ**. The goal is to optimize over **<sup>Θ</sup>** <sup>=</sup> {*S*(*t*), *<sup>Z</sup>*,*µ*(*t*), *<sup>α</sup>*, *<sup>σ</sup>* 2 *E* , *σ* 2 *<sup>M</sup>*, *A*, *σ* 2 *<sup>a</sup>* } maximizing the posterior (2).

ACS is based on set of agents, each representing artificial ants that construct solutions as sequences of solution components. Agent ant *k* builds a solution by allocating label ℓ from a set of voxel labels <sup>Λ</sup> = {1, . . . *<sup>K</sup>*} to the voxel *<sup>s</sup>* ∈ {1, ..., *<sup>N</sup>v*} based on a probabilistic transition rule *p k* (*s*, ℓ). The transition rule quantifies the probability of ant *k*, assigning voxel *s* to label ℓ. This transition rule depends on the pheromone information *τ*(*s*, ℓ) of the coupling (*s*, ℓ) representing the quality of assigning voxel *s* to label ℓ based on experience gathered by ants in the previous iteration. We let:

$$\ell = \begin{cases} \arg\max\_{\iota} \tau(s,\iota) & \text{if } q \le q\_o \\ p^k(s,\ell) & \text{if } q > q\_o \end{cases}$$

$$p^k(s,\ell) = \frac{\tau(s,\ell)}{\sum\_{\iota \in \Lambda} \tau(s,\iota)} \tag{3}$$

where <sup>ℓ</sup> is a label for voxel *<sup>s</sup>* selected according to the transition rule above; *<sup>q</sup>* <sup>∼</sup> *Uni f orm*(0, 1); *q<sup>o</sup>* ∈ (0, 1) is a tuning parameter. An artificial ant chooses, with probability *q*0, the solution component that maximizes the pheromone function *τ*(*s*, ℓ) or it performs, with probability 1 − *q*0, a probabilistic construction step according to (3). The ACS pheromone system consists of two update rules; one rule is applied whilst constructing solutions (local pheromone update rule) and the other rule is applied after all ants have finished constructing a solution (global pheromone update rule). After assigning a label to a voxel, an ant modifies the amount of pheromone of the chosen couples (*s*, ℓ) by applying a local pheromone update (4):

$$
\pi(\mathbf{s}, \boldsymbol{\ell}) \leftarrow (\mathbf{1} - \boldsymbol{\rho})\pi(\mathbf{s}, \boldsymbol{\ell}) + \boldsymbol{\rho}\tau\_0 \tag{4}
$$

where *ρ* ∈ (0, 1) is a tuning parameter that controls evaporation of the pheromone and *τ<sup>o</sup>* is the initial pheromone value. This operation simulates the natural process of pheromone evaporation preventing the algorithm from converging too quickly (all ants constructing the same solution) and getting trapped into a poor solution. In practice, the effect of this local pheromone update is to decrease the pheromone values via evaporation (<sup>1</sup> <sup>−</sup> *<sup>ρ</sup>*)*τ*(*s*, <sup>ℓ</sup>) on the visited solution components, making these components less desirable for the subsequents ants. The value of the evaporation rate indicates the relative importance of the pheromone values from one iteration to the following one. If *ρ* takes a value near 1, then the pheromone trail will not have a lasting effect, and this mechanism increases the random exploration of the search space within each iteration and helps avoid a too rapid convergence of the algorithm toward a sub-optimal region of the parameter space, whereas a small value will increase the importance of the pheromone, favoring the exploitation of the search space near the current solution.

To improve all solutions constructed and also update the other model parameters, we considered incorporating ICM as a local search method. Here, the ICM algorithm is used for both updating the model parameters and also for a local search over the mixture allocation variables. Thus, the update steps corresponding to ACS are combined with running ICM to convergence at each iteration. Finally, after all solutions have been constructed by combined ACS and ICM steps, the quality of all solutions is evaluated using the objective function where the corresponding best solution is selected. We use a global update rule, where pheromone evaporation is again applied on the best solution chosen. Assuming voxel *j* is assigned to label *v* for the best solution, the global update is given as:

$$\tau(j\_\prime v) \leftarrow \begin{cases} (1-\rho)\tau(j\_\prime v) + \rho \tau\_{0\prime} \\ (1-\rho)\tau(j\_\prime k)\_\prime \end{cases} \text{ and for all } k \ne v.$$

The steps described are performed repeatedly until a change in the objective function becomes negligible and the model parameters from the best solution are returned as the final parameter estimates. The optimal values for the tuning parameters (*qo*, *τo*, *ρ*) used in our ACS-ICM algorithm depend on the data. The strategy we adopt for choosing the tuning parameters is by using an outer level optimization on top of the ACS-ICM algorithm to optimize over tuning parameters (*qo*, *τo*, *ρ*) within updates at the outer level based on the Nelder–Mead algorithm [36] applied to optimize over tuning parameters.

In order to reduce the dimension of parameter space and computing time, we apply clustering to the estimated neural sources. This is achieved by implementing a K-means algorithm to cluster the *P* locations on the cortex into a smaller number of *J* ≤ *P* clusters, assuming that *Sj*(*t*) = *S<sup>l</sup>* (*t*) for cortical locations *l*, *j* belonging to the same cluster. We investigated different values of *J* = 250, 500, 1000 in our simulation studies. Within the ICM algorithm, the labeling process *Z* is updated using an efficient chequerboard updating scheme [13]. The update scheme starts with partitioning *Z* into two blocks *Z* = {*ZW*, *ZB*} based on a three-dimensional chequerboard arrangement, where *Z<sup>W</sup>* corresponds to "white" voxels and *Z<sup>B</sup>* corresponds to "black" voxels. Under the Markov random field prior with a first-order neighborhood structure, the elements of *Z<sup>W</sup>* are conditionally independent given *ZB*, the remaining parameters, and the data *E*, *M*. This allows us to update *Z<sup>W</sup>* in a single step, which involves simultaneously updating its elements from their full conditional distributions. The variables *Z<sup>B</sup>* are updated in the same way.

It is well-known that the ICM algorithm is sensitive to initial values and the authors of [13] found this to be the case with the ICM algorithm developed for the spatiotemporal mixture model. The solution obtained, and even the convergence of the algorithm depend rather heavily on the starting values chosen. In the case of ACS-ICM, regardless of the initial values, the algorithm finds a better solution with the optimal tuning parameters and this solution tends to be quite stable. This is because ACS-ICM is a stochastic search procedure in which the pheromone update concentrates the search in regions of the search space containing high quality solutions to reach an optimum. When considering a stochastic optimization algorithm, there are at least two possible types of convergence that can be considered: convergence in value and convergence in solution. With convergence in value, we are interested in evaluating the probability that the algorithm will generate an optimal solution at least once. On the contrary, with convergence in solution we are interested in evaluating the probability that the algorithm reaches a state that keeps generating the same optimal solution. The convergence proofs are presented in [37,38]. The authors of [37] proved convergence with a probability of 1 − *ǫ* for the optimal solution and more in general for any optimal solution in [38] of the ACS algorithm. This supports the argument that theoretically the application of ACS-ICM to source reconstruction should improve ICM .

The local search ICM algorithm procedure is presented in Algorithm 1 and the ACS-ICM algorithm is presented in Algorithm 2. Convergence of the ICM algorithms is monitored by examining the relative change of the Frobenius norm of the estimated neural sources on consecutive iterations.

Algorithm 1 presents a detailed description of the ICM algorithm. The ICM algorithm requires full conditional distributions of each model parameter where the mode of the distribution is taken as the update step for the parameter. The full conditional distribution are described and presented in [13]. This ICM algorithm is embedded in our ACS-ICM algorithm.

**Algorithm 1** Iterated Conditional Modes (ICM) Algorithm.

1: <sup>Θ</sup> = {*S*(*t*), *<sup>Z</sup>*,*µ*(*t*), *<sup>α</sup>*, *<sup>σ</sup>* 2 *E* , *σ* 2 *<sup>M</sup>*, *A*, *σ* 2 *<sup>a</sup>* } ← Initial Value 2: Converged ← 0 3: **while** Converged = 0 **do** 4: *σ* 2 *M* ← h ∑ *T t*=1 1 2 *M*(*t*) − *XMS*(*t*) ′*H* −1 *M M*(*t*) − *XMS*(*t*) + *b<sup>M</sup>* i / h *a<sup>M</sup>* + *TN<sup>M</sup>* <sup>2</sup> + 1 i 5: *σ* 2 *E* ← h ∑ *T t*=1 1 2 *E*(*t*) − *XES*(*t*) ′*H* −1 *E E*(*t*) − *XES*(*t*) + *b<sup>E</sup>* i / h *a<sup>E</sup>* + *TN<sup>E</sup>* <sup>2</sup> + 1 i 6: *σ* 2 *a* ← h ∑ *T t*=2 1 2 (*µ <sup>A</sup>*(*t*) <sup>−</sup> *<sup>A</sup><sup>µ</sup> <sup>A</sup>*(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>))′ (*µ <sup>A</sup>*(*t*) <sup>−</sup> *<sup>A</sup><sup>µ</sup> <sup>A</sup>*(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>)) + *<sup>b</sup><sup>a</sup>* i / h *a<sup>a</sup>* + (*T*−1)(*K*−1) <sup>2</sup> + 1 i 7: *vec*(*A*) ← 1 *σ* 2 *a* ∑ *T t*=2 *µ <sup>A</sup>*(*t*) ′*Krt* × *C* −1 1 ′ , where *C*<sup>1</sup> = <sup>1</sup> *σ* 2 *A I* (*K*−1) <sup>2</sup> + <sup>1</sup> *σ* 2 *a* ∑ *T <sup>t</sup>*=<sup>2</sup> *Kr<sup>t</sup>* ′*Krt* , and *Kr<sup>t</sup>* = *µ <sup>A</sup>*(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>) ′ ⊗ *IK*−<sup>1</sup> 8: **for** *l* = 1, ..., *K* **do** 9: *α<sup>l</sup>* ← h *P* ∑ *j*=1 *T* ∑ *t*=1 *Zv*(*j*)*<sup>l</sup>* (*Sj*(*t*)−*µl*(*t*))<sup>2</sup> <sup>2</sup> + *b<sup>α</sup>* i / h *T P* ∑ *j*=1 *Zv*(*j*)*<sup>l</sup>* <sup>2</sup> + *a<sup>α</sup>* + 1 i 10: **end for**

$$\begin{aligned} \text{11:} \qquad \mathfrak{p}(1) &\leftarrow \left( \left( \sum\_{j=1}^{p} (\mathbf{S}\_{j}(1)\vec{I}\_{\mathbf{K}-1})^{\prime} \mathbf{D}\_{j} + \frac{1}{\sigma\_{\mathbf{z}}^{\prime}} \mathbf{p}^{A}(2)^{\prime} \mathbf{A} \right) \times \mathbf{B}\_{1}^{-1} \right)', \text{ where } \mathbf{B}\_{1} &= \sum\_{j=1}^{p} \mathbf{D}\_{j} + \frac{1}{\sigma\_{\mathbf{z}}^{\prime}} \mathbf{A}^{\prime} \mathbf{A} + \frac{1}{\sigma\_{\mathbf{z}\_{1}}^{\prime}} \mathbf{I}\_{\mathbf{K}-1\prime} \cdot \mathbf{D}\_{j} = \text{Diag}(\frac{Z\_{\mathbf{v}(1)}}{a\_{1}}, 1-\mathbf{z}), \text{ and } \mathbf{z} = \mathbf{0} \\\ \mathbf{z} &\text{, } \mathbf{x}\_{1} \mathbf{X}\_{1} = (1, 1, \dots, 1)^{\prime} \text{ with } \dim\left(\vec{I}\_{\mathbf{K}-1}\right) = \mathbf{K} - 1 \end{aligned}$$

$$\text{12:} \qquad \text{for} \ t = 2, \ldots, T - 1 \text{ do}$$

$$\begin{aligned} \text{13:} \qquad \qquad \mathfrak{p}(t) &\leftarrow \left( \left( \sum\_{j=1}^{\mathbb{P}} (\mathbb{S}\_{j}(t)\vec{I}\_{\mathbf{K}-1})' \mathbf{D}\_{j} + \frac{1}{\sigma\_{\vec{a}}^{2}} (\mathfrak{p}^{A}(t+1))' \mathbf{A} + \frac{1}{\sigma\_{\vec{a}}^{2}} (\mathfrak{p}^{A}(t-1)' \mathbf{A}') \right) \times \mathbf{B}\_{2}^{-1} \right)^{\mathsf{H}} \\ \text{where } \mathbf{B}\_{2} &= \sum\_{j=1}^{\mathbb{P}} \mathbf{D}\_{j} + \frac{1}{\sigma\_{\vec{a}}^{2}} (\mathbf{A}^{\mathsf{f}}\mathbf{A} + \mathbf{I}\_{\mathbf{K}-1}) \end{aligned}$$

14: **end for**

$$\begin{aligned} \text{15:} \qquad \mathfrak{p}(T) &\leftarrow \left( \left( \sum\_{j=1}^{P} (\mathbb{S}\_{j}(T)\check{I}\_{K-1})' \mathbf{D}\_{j} + \frac{1}{\sigma\_{\check{s}}^{2}} (\mathfrak{p}^{A}(T-1)' \mathbf{A}') \right) \times \mathbf{B}\_{3}^{-1} \right)' \\ \text{where } \mathbf{B}\_{3} &= \sum\_{j=1}^{P} \mathbf{D}\_{j} + \frac{1}{\sigma\_{\check{s}}^{2}} \mathbf{I}\_{k-1} \end{aligned}$$

16: **for** *j* = 1, ..., *P* **do**

17: *<sup>S</sup><sup>j</sup>* ← −<sup>1</sup> 2 <sup>Σ</sup>*SjW*2*<sup>j</sup>* ⊲ *<sup>S</sup><sup>j</sup>* = (*S<sup>j</sup>* (1), *S<sup>j</sup>* (2), ..., *S<sup>j</sup>* (*T*))′ Σ −1 *Sj* = *W*1*<sup>j</sup> IT*, *W*′ <sup>2</sup>*<sup>j</sup>* = (*W*2*<sup>j</sup>* (1), *W*2*<sup>j</sup>* (2), ..., *W*2*<sup>j</sup>* (*T*)) where *W*1*<sup>j</sup>* = <sup>1</sup> *σ* 2 *M XM*[, *j*] ′*H* −1 *<sup>M</sup> XM*[, *j*] + <sup>1</sup> *σ* 2 *E XE*[, *j*] ′*H* −1 *<sup>E</sup> XE*[, *j*] + ∑ *K l*=1 *Zv*(*j*)*<sup>l</sup> αl W*2*<sup>j</sup>* (*t*) = <sup>1</sup> *σ* 2 *M* − 2*M*(*t*) ′*H* −1 *<sup>M</sup> <sup>X</sup>M*[, *<sup>j</sup>*] + <sup>2</sup>(∑*v*6=*<sup>j</sup> <sup>X</sup>M*[, *<sup>v</sup>*]*Sv*(*t*))′*<sup>H</sup>* −1 *<sup>M</sup> XM*[, *j*] + <sup>1</sup> *σ* 2 *E* − 2*E*(*t*) ′*H* −1 *<sup>E</sup> <sup>X</sup>E*[, *<sup>j</sup>*] + <sup>2</sup>(∑*v*6=*<sup>j</sup> <sup>X</sup>E*[, *<sup>v</sup>*]*Sv*(*t*))′*<sup>H</sup>* −1 *<sup>E</sup> XE*[, *j*] − 2 ∑ *K l*=1 *µl*(*t*) *αl*

*XM*[, *j*], *XE*[, *j*] denote the *jth* column of *X<sup>E</sup>* and *X<sup>M</sup>*

#### 18: **end for**

19: Let B denote the indices for "black" voxels and W denote the indices for "white" voxels.

#### **Algorithm 1** *Cont.*

#### 20: **for** *κ* ∈ B *simultaneously* **do**

21: *Zκ<sup>q</sup>* ← 1 and *Zκ<sup>l</sup>* ← 0, ∀*l* 6= *q* where *<sup>q</sup>* <sup>=</sup> argmax*h*∈{1,...,*K*} *P*(*h*), and

$$\text{(22: }\qquad P(h) = \frac{a\_h^{-1N\_{jk}/2} \times \exp\left(-\frac{1}{2} \sum\_{\boldsymbol{l}\in\boldsymbol{\left(\boldsymbol{l}\right)}\boldsymbol{\kappa}} a\_h^{-1} \sum\_{\boldsymbol{l}=1}^{T} (S\_{\boldsymbol{l}}(t) - \mu\_h(t))^2 + 2\beta \sum\_{\boldsymbol{\kappa}\in\boldsymbol{\kappa}\_h} Z\_{\boldsymbol{\kappa}h}\right)}{\sum\_{\boldsymbol{l}=1}^{K} a\_h^{-1N\_{jk}/2} \times \exp\left(-\frac{1}{2} \sum\_{\boldsymbol{l}\in\boldsymbol{\left(\boldsymbol{l}\right)}\boldsymbol{\kappa}} a\_h^{-1} \sum\_{\boldsymbol{l}=1}^{T} (S\_{\boldsymbol{l}}(t) - \mu\_{\boldsymbol{l}}(t))^2 + 2\beta \sum\_{\boldsymbol{\kappa}\in\boldsymbol{\kappa}\_h} Z\_{\boldsymbol{\kappa}l}\right)}$$

where *Nj<sup>κ</sup>* is the number of cortical locations contained in voxel *κ*.

#### 23: **end for**

24: **for** *κ* ∈ W *simultaneously* **do**

25: *Zκ<sup>q</sup>* ← 1 and *Zκ<sup>l</sup>* ← 0, ∀*l* 6= *q* where *<sup>q</sup>* <sup>=</sup> argmax*h*∈{1,...,*K*} *P*(*h*), and

26: *<sup>P</sup>*(*h*) = *<sup>α</sup>* −*TNjκ* /2 *<sup>h</sup>* <sup>×</sup>exp − 1 <sup>2</sup> <sup>∑</sup>*j*|*v*(*j*)=*<sup>κ</sup> <sup>α</sup>* −1 *<sup>h</sup>* ∑ *T t*=1 (*Sj*(*t*)−*µ<sup>h</sup>* (*t*))<sup>2</sup>+2*<sup>β</sup>* <sup>∑</sup>*v*∈*δκ <sup>Z</sup>vh* ∑ *K <sup>l</sup>*=<sup>1</sup> *α* −*TNjκ* /2 *<sup>l</sup>* <sup>×</sup>exp − 1 <sup>2</sup> <sup>∑</sup>*j*|*v*(*j*)=*<sup>κ</sup> <sup>α</sup>* −1 *<sup>l</sup>* ∑ *T t*=1 (*Sj*(*t*)−*µl*(*t*))<sup>2</sup>+2*<sup>β</sup>* <sup>∑</sup>*v*∈*δκ <sup>Z</sup>vl*

where *Nj<sup>κ</sup>* is the number of cortical locations contained in voxel *κ*.


#### 29: **end while**

#### **Algorithm 2** Ant Colony System (ACS)-ICM Algorithm.

1: Θ Initial Value; set tuning parameters *τo*, *qo*, *ρ* and *Nants*.

	- Construct candidate by assigning label *l* to voxel *s* using the transition probability rule:

$$\ell = \begin{cases} \arg\max\_{\mathfrak{u}} \tau(\mathfrak{s}, \mathfrak{u}) & \text{if } q \le q\_o \\ p(\mathfrak{s}, \ell) & \text{if } q > q\_o \end{cases}$$

where if *<sup>q</sup>* <sup>&</sup>gt; *<sup>q</sup><sup>o</sup>* the label for voxel *<sup>s</sup>* is drawn randomly from {1, . . . , *<sup>K</sup>*} with probability

$$p(\mathbf{s}, \ell) = \frac{\tau(\mathbf{s}, \ell)}{\sum\_{\boldsymbol{\mu} \in \Lambda} \tau(\mathbf{s}, \boldsymbol{\mu})},$$

and where *q* ∼ uniform[0, 1].

• Assuming voxel *s* is assigned label ℓ set:

*<sup>τ</sup>*(*s*, <sup>ℓ</sup>) <sup>←</sup> (<sup>1</sup> <sup>−</sup> *<sup>ρ</sup>*)*τ*(*s*, <sup>ℓ</sup>) + *ρτ<sup>o</sup>*

and for all *k* 6= *l*

$$
\tau(\mathbf{s}, \boldsymbol{\ell}) \leftarrow (1 - \rho)\tau(\mathbf{s}, \boldsymbol{k}).
$$

where *ρ* is a tuning parameter in (0,1), which represents evaporation of the pheromone trails and *τ<sup>o</sup>* > 0.

#### **Algorithm 2** *Cont.*


$$
\tau(\mathbf{s}, \ell) \leftarrow (1 - \rho)\tau(\mathbf{s}, \ell) + \rho \tau\_0
$$

and for all *<sup>k</sup>* <sup>6</sup><sup>=</sup> <sup>ℓ</sup>:

*<sup>τ</sup>*(*s*, <sup>ℓ</sup>) <sup>←</sup> (<sup>1</sup> <sup>−</sup> *<sup>ρ</sup>*)*τ*(*s*, *<sup>k</sup>*)

Check for convergence via increase in *logP*(**Θ**, *E*, *M*). Go back to step 3

7: Return all voxel labeling **Z** and model parameters **Θ** from the best solution.

#### **3. Simulation Studies**

In this section, we use a simulation study to evaluate the performance of our algorithm. The simulation study assesses the quality of the source estimates and the optimized objective function values obtained when using our proposed algorithm in comparison to the existing ICM algorithm developed in [13]. We then make comparisons between ACS-ICM and the ICM algorithm applied to combined simulated EEG and MEG data.

#### *3.1. Proposed Approach*

The MEG and EEG data were both generated from four scenarios with two, three, four and nine latent states corresponding to regions of neural activity. In each of the four cases, one of the states is inactive, while the remaining states represent different regions of brain activity generated by Gaussian signals. The temporal profile of brain activity at each of the brain locations in the activated regions is depicted in Appendix A, Figures A1 and A2. We projected the source activity at 8196 brain locations from the cortex onto the MEG and EEG sensor arrays using the forward operators *X<sup>M</sup>* and *XE*. The simulated data were then obtained by adding Gaussian noise at each sensor, where the variance of the noise at each sensor was set to be 5% of the temporal variance of the signal at that sensor. The number of mixture components *K* was set to be the true number of latent states (either two, three, four, or nine) in the model. We simulated 500 replicate datasets and both ACS-ICM and ICM were applied to each dataset. For each simulated dataset we applied our algorithm with J = 250, 500, 1000 clusters so as to evaluate how the performance varies as this tuning parameter changes. We initialized both algorithms using the same starting values. For each replicate we computed the correlation between the estimated sources and the true sources Corr[(*S*(1) ′ , *S*(2) ′ , . . . , *S*(*T*) ′ ),(*S*ˆ(1) ′ , *S*ˆ(2) ′ , . . . , *S*ˆ(*T*) ′ )] as a measure of agreement. This measure was also averaged over the 500 replicate datasets to compute average correlation. In addition, we estimated the Mean-Squared Error (MSE) of the estimator *S*ˆ *<sup>j</sup>*(*t*) based on the R = 500 simulation replicates for each brain location *j* and time point *t*. The Total MSE (TMSE) was computed by adding all the MSE's over brain locations and time points. This was done separately for locations in active and inactive regions.

In our simulation studies, the ACS-ICM algorithm had four tuning parameters. The first denoted as *q<sup>o</sup>* ∈ (0, 1) controlled the degree of stochasticity, with larger values corresponding to less stochasticity and thus less random exploration of the parameter space. When a solution is chosen, another tuning parameter *τ*<sup>0</sup> controlled the amount of pheromone reinforcing this solution in the information available to the other ants. A third tuning parameter *ρ* controlled the evaporation of pheromone, and finally a fourth tuning parameter *Nants* controlled the number of ants. The number of ants (*Nants*) was fixed at 10, a value for which we have seen generally good performance. This was chosen based on computing efficiency and similar results (objective function values) from using *Nants* ≥ 10. The remaining optimal tuning parameters (*qo*, *τ*0, *ρ*) for all simulations cases were chosen using an outer level optimization using the Nelder–Mead algorithm.

#### *3.2. Simulation Results*

#### 3.2.1. Evaluation of Neural Source Estimation

We present the average correlation between the estimated values and the truth for the algorithms considered in our study in Appendix A, Table A1. Inspecting Table A1, we observe that for all cases considered for the true number of latent states (either two, three, four, or nine), the estimates obtained from the ACS-ICM algorithm yielded a higher average correlation than those obtained from ICM. In addition, with respect to the number of clusters, ACS-ICM resulted in a higher average correlation than ICM uniformly for all cluster sizes (250, 500, 1000). In summary, the average correlation was significantly improved when estimates were computed using the ACS-ICM algorithm for both large and small numbers of latent states as well as cluster sizes. In addition, we present in Appendix A, Figure A4, violin plots comparing the correlation values obtained from each of the algorithms for different simulation cases across all replicates. These plots show the entire distribution and provide a better assessment of each algorithm for simulation replicates. Observing Figure A4, we can see that ACS-ICM provides the highest correlation values uniformly in all simulation scenarios.

The TMSE for all simulation scenarios is presented in Appendix A, Table A2. To improve the readability of the results from TMSE values, we computed the relative percentage improvement in TMSE of the neural source estimators from ICM to ACS-ICM. Here, using ICM as the reference algorithm, the relative percentage improvement is defined as the ratio of the difference in TMSE between ICM and ACS-ICM to its ICM TMSE value multiplied by 100. The results of this computation are presented in Table 1. In all simulation scenarios for Table 1, ACS-ICM performed better and showed a significant improvement as compared to ICM. Specifically with respect to the number of clusters, ACS-ICM was roughly 10% better than ICM with respect to TMSE when the cluster size was 250. For both small and large numbers of latent states, ACS-ICM was better than ICM in the active region with significant improvements. This shows that ACS-ICM outperforms ICM in active regions using both small and large numbers of latent states. The total MSEs were decomposed into total variance and total squared bias for the same distinct cases of the simulation depicted in Table 1. From the results, when we considered active regions with different numbers of clusters, and observed that ACS-ICM was better than ICM based on the total squared bias due to the percentage of relative change. Based on the total variance we also noticed a similar positive change from ICM to ACS-ICM uniformly for all values of K. It is also clear that for inactive regions, ACS-ICM was better than ICM for both total variance and squared bias for all simulation cases considered. Overall, these results from the TMSE demonstrate a significant improvement obtained from our algorithm when considering total squared bias and variance for our simulation studies. This improvement was observed uniformly across all conditions.

We present in Figure 1 boxplots comparing the final objective function values obtained from each of the algorithms for the different simulation scenarios across all replicates. Again, a clear pattern emerged showing that ACS-ICM yielded the highest objective function values uniformly in all cases. Overall, ACS-ICM outperformed the ICM algorithm uniformly with respect to both neural source estimates and the values of the objective function. This indicates the superiority of the ACS-ICM algorithm over ICM for computing neural source estimates for the spatiotemporal model.

#### 3.2.2. Evaluation of Mixture Component Estimation

In addition to evaluating point estimation and objective function maximization, we also evaluated model selection, comparing *K*ˆ *ACS* and *K*ˆ *ICM*, that is, the estimators obtained from ACS-ICM and ICM, respectively. We focused on estimating the number of mixture components and evaluating the sampling distribution of *K*ˆ *ACS* and *K*ˆ *ICM*. The following five scenarios were considered in our experiments:


**Table 1.** Simulation study I-Percentage of relative improvement in Total Mean-Squared Error (TMSE) of the neural source estimators decomposed into variance and squared bias from ICM to ACS-ICM. This total was obtained separately for locations in active regions and then for the inactive region.


**Figure 1.** Box-plots comparing the objective function values obtained in the simulation studies for the ICM and ACS-ICM algorithms. The first row corresponds to the case when *K* = 2, second row corresponds to when *K* = 3, third row is when *K* = 4 and the last row is when *K* = 9.

We simulated the data for each of the five scenarios considered, and added 5% Gaussian noise at the sensors with 1000 replicate datasets used in each case. The algorithms were run with an upper bound of *K* = 10 for each of the 5000 simulated datasets. For each dataset, we computed the value of the estimator, and histograms representing the sampling distributions are presented in Figure 2, for each of the five cases above illustrating the sampling distribution of *K*ˆ *ICM* (panels (a)–(e)) and *K*ˆ *ACS* (panels (f)–(j)) corresponding to the first and second row, respectively. Observing Figure 2, where the true signals are well separated in the simulation experiments, in all cases except for the case with a larger number of latent states (*K* = 9), the mode of the sampling distributions corresponds to the true number of latent states for both the ACS-ICM and ICM algorithms. In the case of nine neural sources, ACS-ICM gave better and improved results than ICM. Additionally, Table 2 reports both the bias and mean-squared error of the estimators from ACS-ICM (*K*ˆ *ACS*) and ICM (*K*ˆ *ICM*). From Table 2, both ACS-ICM and ICM are biased and over-estimated for the small number of latent states but underestimated for the large number of latent states. More importantly, the estimate for the number of mixture components obtained from ACS-ICM exhibited the best performance in terms of both bias and MSE uniformly for all cases considered. This is based on <sup>|</sup>Bias(*K*<sup>ˆ</sup> *ACS*)<sup>|</sup> <sup>&</sup>lt; <sup>|</sup>Bias(*K*<sup>ˆ</sup> *ICM*)<sup>|</sup> and MSE(*K*<sup>ˆ</sup> *ACS*) < MSE(*K*ˆ *ICM*).

We repeated the simulation studies for all five cases but where true signals are less well separated by altering the true signals depicted in Appendix A, Figure A3. We present histograms depicted in Figure 3, for each of the five cases above, illustrating the sampling distribution of *K*ˆ *ICM* (panels (a)–(e)) and *K*ˆ *ACS* (panels (f)–(j)). In this case, the mode of the sampling distribution corresponds to the true number of latent states when *K* = 2 and *K* = 3 but not for the case with four and nine latent states with both algorithms. In Table 2 we compare the bias and mean square error of *K*ˆ *ICM* and *K*ˆ *ACS* under this simulation settings. Similarly, under these settings, ACS-ICM outperformed ICM in terms of the bias and mean square error; thus, <sup>|</sup>Bias(*K*<sup>ˆ</sup> *ACS*)<sup>|</sup> <sup>&</sup>lt; <sup>|</sup>Bias(*K*<sup>ˆ</sup> *ICM*)<sup>|</sup> and MSE(*K*<sup>ˆ</sup> *ACS*) < MSE(*K*ˆ *ICM*). In summary, for model selection, based on the results presented in Table 2, ACS-ICM showed an overall better performance over ICM uniformly for all eight conditions considered.

Whereas the ACS-ICM algorithm showed superiority in terms of quality of source estimates, a drawback is that it is computationally expensive relative to ICM due to its population-based and iterative procedure. Notwithstanding, this might not be a serious challenge for source localization problems, which do not require real-time solutions in most situations. With regards to computation time, on the Niagara cluster running R software on a single core (Intel Skylake 2.4 GHz, AVX512), ICM computed source estimates in approximately 2 min whereas ACS-ICM computed estimates in roughly 6 h and 30 min.


**Table 2.** Simulation study II—bias and Mean Square Error (MSE) of estimated number of mixture components (*K*ˆ) from the 1000 simulation replicates when the algorithms were run with *K* = 10.

**Figure 2.** Histograms illustrating the sampling distribution of *K* ˆ in the case where the true signals were well separated in the simulation studies. The first row corresponds to the sampling distribution of *K* ˆ *ICM*; panel (**a**), *K* = 2; panel (**b**), *K* = 3; panel (**c**), *K* = 4 with three Gaussian signals; panel (**d**), *K* = 4 with two Gaussian signals and one sinusoid; panel (**e**), *K* = 9 with eight Gaussian signals. The second row corresponds to the sampling distribution of *K* ˆ *ACS*; panel (**f**), *K* = 2; panel (**g**), *K* = 3; panel (**h**), *K* = 4 with three Gaussian signals; panel (**i**), *K* = 4 with two Gaussian signals and one sinusoid; panel (**j**), *K* = 9 with eight Gaussian signals. In each case the vertical red line indicates the true number of latent states underlying the simulated data.

**Figure 3.** Histograms illustrating the sampling distribution of *K* ˆ in the case where the true signals were less well-separated in the simulation studies. The first row corresponds to the sampling distribution of *K* ˆ *ICM*; panel (**a**), *K* = 2; panel (**b**), *K* = 3; panel (**c**), *K* = 4 with three Gaussian signals; panel (**d**), *K* = 4 with two Gaussian signals and one sinusoid; panel (**e**), *K* = 9 with eight Gaussian signals. The second row corresponds to the sampling distribution of *K* ˆ *ACS*; panel (**f**), *K* = 2; panel (**g**), *K* = 3; panel (**h**), *K* = 4 with three Gaussian signals; panel (**i**), *K* = 4 with two Gaussian signals and one sinusoid; panel (**j**), *K* = 9 with eight Gaussian signals. In each case the vertical red line indicates the true number of latent states underlying the simulated data.

#### **4. Application to Scrambled Face Perception MEG/EEG Data**

In this section, we present the application of our methodology for comparison with EEG and MEG data measuring an event-related response to the visual presentation of scrambled faces in a face perception study. In addition, we demonstrate how a nonpara-

metric bootstrap can be used to obtain standard errors, confidence intervals and T-maps. The data from both MEG and EEG were obtained from a single subject in an experimental paradigm that involved repeated random presentation of a picture showing either a face or a scrambled face while the subject was required to make a symmetry judgement. The scrambled faces were created through 2D Fourier transformation, random phase permutation, inverse transformation and outline-masking of each face. The experiment involved a sequence of trials, each lasting 1800 ms, where in each trial the subject was presented with one of the pictures for a period of 600 ms while being required to make a four-way, left– right symmetry judgment while brain activity was recorded over the array. Both scrambled faces and unscrambled faces were presented to the subject; however, our analysis will focus only on trials involving scrambled faces. This produced a multivariate time series for each trial, and the trial-specific time series were then averaged across trials to create a single multivariate time series; the average evoked response is depicted in Figure 4, panel (a), for MEG data, and panel (c), for EEG data. Looking from a spatial perspective, at a given time point, each array recorded a spatial field such as that depicted in Figure 4, panel (b), which shows the MEG spatial field at a particular time point, and Figure 4, panel (d), which shows the EEG spatial field at the same time point. The degree of inter-trial variability was quite low. This experiment was conducted while EEG data were recorded, and then again on the same subject while MEG data were recorded.

The EEG data were acquired on a 128-sensors ActiveTwo system with a high sampling rate of 2048 Hz and down-sampled to 200 Hz. The EEG data were re-referenced to the average over good channels. The resulting EEG data were a trial-specific multivariate time series and contained 128 sensors, 161 time points and 344 trials. For real data analysis, the trial-specific time series were averaged across trials to produce a single average evoked response. The MEG data were acquired on 274 sensors with a CTF/VSM system, with a high sampling rate of 480 Hz and down-sampled to 200 Hz. The MEG data obtained were a trial -specific multivariate time series and contained 274 sensors, 161 time points and 336 trials. We obtained a temporal segment of the data from time point t = 50 to t = 100, resulting in 51 time points for both the EEG and MEG data. The trial-specific time series were averaged across trials to produce a single average evoked response. Detailed description of the data and related analysis can be found in [9,39,40]. In addition, a link to the open access data repository used for analysis can be found here: https: //www.fil.ion.ucl.ac.uk/spm/data/mmfaces (accessed on 14 November 2020).

We set the upper bound at *K* = 10 mixture components, voxels as *n<sup>v</sup>* = 560, *β* = 0.3 (hyperparameter of Potts model) and a cluster size of J = 250. For our real data application, the optimal tuning parameters (*qo*, *τ*0, *ρ*, *Nants*) = (0.43, 0.05, 0.64, 10) were selected similarly using the Nelder–Mead algorithm. First, the ICM algorithm was run to convergence and the estimates obtained were used as the initial values for the ACS-ICM algorithm. Our primary interest lies in the estimated neural sources *S*ˆ(*t*) and we computed the total power of these estimated sources obtained from both algorithms at each brain location, which was then mapped onto the cortex. The cortical maps showing the spatial patterns from the estimated power of the reconstructed sources are displayed in Figure 5. The first and second row depict the corresponding results obtained from the ICM and ACS-ICM algorithms, respectively. As shown in Figure 5, the greatest power occurred on the bilateral ventral occipital cortex for both estimated sources from the ACS-ICM and ICM algorithms. Interestingly, the results from ACS-ICM estimates also differed when compared with the results from ICM in the left ventral frontal and right ventral temporal regions. In particular, the ACS-ICM estimate detected higher power, whereas ICM showed low activation in these regions. The estimated source locations of these region is responsible for high-level visual processing. Therefore, the cortical power map seems to represent regions that would be expected to show scrambled face-related activity. To compare the general quality of the estimates from ACS-ICM versus ICM, we show the plot of the final objective function values obtained from the algorithms in Figure 6. We see clearly that the application of ACS-ICM has led to higher quality estimates with much larger posterior density values.

The ACS-ICM algorithm used to maximize the posterior distribution produces only point estimates of the neural source activity. In addition to the point estimates, we applied a nonparametric bootstrap on the trial-specific multivariate time series to obtain confidence interval estimates and characterize the variability in our source estimates, which is another extension to [13]. The interval estimates were constructed by resampling the trial-specific MEG/EEG time series data with replacement. From each resampled dataset, we obtained the average evoked response and then run the ACS-ICM algorithm for a total of 400 nonparametric bootstrap replicates. This procedure was made feasible using parallel computation on a large number of computing cores. We constructed a cortical map of the bootstrap standard deviations of the total power of the estimated source. To account for uncertainty in our point estimates, we constructed a T-map and this is depicted in Figure 7. A T-map is the ratio of the ACS-ICM point estimate of the source activity to its bootstrap standard deviations. The T-map represents the best depiction of reconstructed power since it accounts for standard errors that a simple map of the point estimates does not. Broadly, the T-map seems to indicate similar results to those obtained from point estimates, in particular with respect to high power activation on the bilateral ventral occipital cortex and right ventral temporal region. An interesting observation from the T-map is the detection of a high signal in the left ventral temporal region but a low activation from the point estimate.

**Figure 4.** The Magnetoencephalography (MEG) and Electroencephalography (EEG) data considered in the face perception study: panels (**a**,**c**) show the time series observed at each MEG sensor and EEG sensor, respectively; panels (**b**,**d**) depict the spatially interpolated values of the MEG data and the EEG data, respectively, each observed at *t* = 80, roughly 200 ms after presentation of the stimulus. In panels (**b**,**d**) the black circles correspond to the sensor locations after projecting these locations onto a 2-dimensional grid (for presentation).

**Figure 5.** Brain activation for scrambled faces—the power of the estimated source activity ∑ *T t*=1 *S* ˆ *j* (*t*) <sup>2</sup> at each location *j* of the cortical surface. **Row 1** displays results from our ICM algorithm applied to the combined MEG and EEG data; **Row 2** displays results from ACS-ICM applied to the combined MEG and EEG data.

**Figure 6.** Objective function values obtained from the data with the ACS-ICM (**left**) and ICM (**right**) algorithms.

In addition, we present the temporal summary from our bootstrap replicates representing the interval estimation for the estimated temporal profile of brain activity at the peak location of the T-map. The interval estimate represents a 95% confidence interval depicted in Figure 8. One of the key components of our work is varying the inverse temperature parameter for sensitivity analysis. We fixed the inverse temperature at *β* = (0.1, 0.44) and run the ACS-ICM algorithm to convergence. We run our algorithm together with *K* = 10, *n<sup>v</sup>* = 560 and a cluster size of J = 250. For *β* = 0.1, the corresponding results obtained are depicted in the first row of Figure 9. The results indicate activation on the bilateral ventral occipital cortex. Additionally, at *β* = 0.44, the power map results from ACS-ICM, depicted in the second row of Figure 9, differ when compared with results from ACS-ICM at *β* = 0.1 In particular, the highest power signals occured in the right ventral temporal region where there was low activation for using *β* = 0.1.

For our real data application we applied both algorithms with J = 500 clusters so as to evaluate how the performance varies as this tuning parameter changes. The results are

displayed in Appendix B. The corresponding results obtained from ACS-ICM are displayed in the second row of Figure A5. Examining Figure A5, ACS-ICM seems to indicate similar results to those obtained from using a tuning parameter of J = 250, in particular with respect to activation on the bilateral ventral occipital cortex. For our sensitivity analysis, we present results obtained from using inverse temperature (*β* = 0.1 and *β* = 0.44) displayed in Figure A6. We observe that from ACS-ICM, the spatial spread of the high power occurs on the bilateral ventral occipital cortex. In addition, source estimates obtained from ACS-ICM indicate bilateral activation in the occipital cortex, and activation in the right temporal and right frontal regions of the brain. These estimated source locations reveal activation in areas known to be involved in the processing of visual stimuli. More interestingly, ACS-ICM also detected high power in a region on the corpus callosum; given that the inverse problem is ill-posed with an infinite number of possible configurations this may be the reason.

In our real data analysis, the required computation time for ICM was 3 min on a single core (Intel Skylake 2.4 GHz, AVX512) with R software, whereas the computation time for the ACS-ICM was roughly 7 h. The choice of cluster size will have an impact on the computational time required by the algorithm. With regards to ACS-ICM, the required computing time for a cluster size of 250 was approximately 7 h, whereas for a cluster size of 500, ACS-ICM required 12 h of computing time. While there is a substantially increase in computation, the paper has demonstrated uniform improvements in the quality of the solutions, in terms of both source estimation and model selection. Furthermore, the bootstrap can be implemented in parallel on a computing cluster to obtain standard errors with no increase to the required computation time.

**Figure 7.** The spatial profile of brain activity from ACS-ICM based on our bootstrap replicates. **Row 1** displays standard deviations of the total power of the estimated source activity; **Row 2** displays the T-map.

**Figure 8.** The 95% confidence interval for the estimated temporal profile of brain activity at the peak location of the T-map from the bootstrap replicates.

**Figure 9.** Brain activation for scrambled faces—the power of the estimated source activity ∑ *T t*=1 *S* ˆ *j* (*t*) <sup>2</sup> at each location *j* of the cortical surface. **Row 1** displays results from our ACS-ICM algorithm applied to the combined MEG and EEG data with *β* = 0.1; **Row 2** displays results from ACS-ICM applied to the combined MEG and EEG data with *β* = 0.44.

#### *Residual Diagnostics for the Scrambled Faces MEG and EEG Data*

We assessed the goodness of fit of the model by checking the residual time series plot, normal quantile–quantile plot and residuals versus fitted values after running the ACS-ICM and ICM algorithms. This was done by computing the residuals for both EEG and MEG after applying both algorithms. The residuals were computed as *ǫ* ˆ *<sup>M</sup>*(*t*) = *M*(*t*) − *XMS* ˆ(*t*) and *ǫ* ˆ*E*(*t*) = *E*(*t*) − *XES* ˆ(*t*) at each time point *t* = 1, . . . , *T*. The assumption made for the residuals was that they should be draws from a mean-zero Gaussian distribution if the assumed model generated the observed data. The residual time series plot for EEG and MEG from the ACS-ICM algorithm is displayed in Figure 10, panels (a) and (b). The plots from Figure 11, panels (a) and (b), also depicts residuals time series plots obtained from

ICM for EEG and MEG, respectively. Examining the plots, the residual time series plots obtained from both algorithms exhibit similar patterns for MEG and EEG. However, there are significant improvements seen in estimates from ACS-ICM. Specifically for the EEG data, there are sensors with relatively large peaks remaining from the ICM but significant improvements from ACS-ICM as we observe no fewer residuals patterns relative to ICM. In the case of MEG data, we observe that the residuals obtained from ACS-ICM reveal few sensors with peaks remaining as compared to ICM, where there are more sensors with large peaks and residuals.

In Figures 10 and 11, panels (c) and (d), we show plots of the residuals versus fitted values from ACS-ICM and ICM. For the EEG data, the ACS-ICM residuals reveal fewer extreme values with smaller residual patterns but more outliers are seen in the residuals obtained from ICM comparably. The residuals obtained from ICM are characterized by higher values to the left of zero and lower values to the right of zero. In the case of MEG data, the residuals obtained from ACS-ICM also show fewer extreme values with a smaller residual pattern but a similar resemblance for residuals obtained from the ICM algorithm with few extreme values outside the zero band. We observe more extreme values in the residual plot obtained from ICM than that obtained from ACS-ICM. This signifies improvements of the ACS-ICM algorithm over ICM. Inspecting Figure 10, panels (e) and (f), reveals normal quantile–quantile plots for the EEG and MEG residuals obtained from the ACS-ICM algorithm. There is no deviation from normality observed from the EEG and MEG data. Hence, the Gaussian assumption holds from using the ACS-ICM algorithm. In the case of the ICM, in Figure 11, panels (e) and (f) depict the normal quantile–quantile plots for the EEG and MEG data. In this case we observe a clear divergence from the normal distribution for the EEG and MEG residuals. In particular, we see a strong deviation from normality in the left and right tail of the distribution for the EEG data. There is also a deviation from normality in the right tail of the distribution for the MEG data.

In summary, the residual analysis revealed the use of the ACS-ICM algorithm resulted in estimates with a better fit of the spatial mixture model for the EEG and MEG data relative to ICM. Thus our proposed approach leads to improvements in point estimation and model selection uniformly in all settings in simulation studies and in our application with larger objective function values and improved model fit based on residual analysis.

#### **Figure 10.** *Cont.*

**Figure 10.** Brain activation for scrambled faces using the ACS-ICM algorithm—residual diagnostics: time series of residuals, (**a**) EEG, (**b**) MEG; residuals versus fitted values, (**c**) EEG, (**d**) MEG; residual normal quantile–quantile plots, (**e**) EEG, (**f**) MEG.

**Figure 11.** Brain activation for scrambled faces using icm algorithm—residual diagnostics: time series of residuals, (**a**) EEG, (**b**) MEG; residuals versus fitted values, (**c**) EEG, (**d**) MEG; residual normal quantile–quantile plots, (**e**) EEG, (**f**) MEG.

#### **5. Discussion and Conclusions**

In this section, we provide numerical results obtained in the data analysis, limitations of the proposed approach and the prospects for future research. We have developed an ACS-ICM algorithm for spatiotemporal modeling of combined MEG/EEG data for solving the neuroelectromagnetic inverse problem. Adopting a Bayesian finite mixture model with a Potts model as a spatial prior, the focus of our work has been to improve source localization estimates, model selection and model fit. The primary contribution is the design and implementation of the ACS-ICM algorithm as an approach for source localization that result in better performance over ICM, which is very positive uniformly in every setting on simulation studies and real data application. Another key development is the technique implemented in choosing the tuning parameters for the ACS-ICM by using an outer level optimization that numerically optimizes the choice of the tuning parameters for this algorithm. This strategy ensures that the optimal tuning parameters based on the data and problem complexity are selected.

#### *5.1. Numerical Results*

In our simulation studies, we observed four significant improvements associated with ACS-ICM over ICM: (1) ACS-ICM neural source estimates provided improved correlation between estimated and truth sources uniformly across all settings considered; (2) the objective function values obtained from the posterior density values for ACS-ICM were larger than those obtained from ICM uniformly across all settings considered; (3) ACS-ICM showed significant improvement with respect to the total mean square error for all cluster sizes considered compared to ICM; (4) ACS-ICM exhibited improved performance in terms of both bias and mean square error for the non-regular problem of estimating number of

mixture components. Moreover, the application of ACS-ICM to real data led to higher quality estimates with larger maximized posterior density values. These improvements have demonstrated the advantage of the ACS-ICM algorithm when compared with ICM in both the face perception analysis as well as the simulation studies. In addition to implementing the ACS-ICM algorithm for point estimation, we demonstrated how a nonparametric bootstrap can be used to obtain standard errors, confidence intervals and T-maps for the proposed methodology. This was done to account for uncertainty in our point estimates of the neural source activity.

#### *5.2. Limitations of the Proposed Approach*

An important limitation of the simulation studies is the use of white noise added to the signals. This is because MEG/EEG data would have structured noise that arise from, e.g., motion, and such noise would be spatially correlated. The spatially correlated noise will make the simulation scenarios more challenging, which we expect to result in a decline in performance. We did not pursue this scenario in our simulation and we will consider it in our future studies.

#### *5.3. Prospects for Future Research*

In our current work, we are implementing ACS-ICM for the spatial mixture model developed in [13]. We hope in the future to extend the model by considering a robust error structure in the MEG/EEG model. The model currently assumes that the errors are independent in time. This will be extended by allowing for an autoregressive structure. A second extension would be to relax the assumption that the errors have a Gaussian distribution by incorporating a multivariate t distribution for the error terms. Integrating these extensions, we will develop a new joint model for the MEG/EEG data and implement the ACS-ICM and ICM algorithms for the neuroelectromagnetic inverse problem.

Furthermore, when we obtained the source estimates from the ACS-ICM algorithm, we mapped a function of them (the total power) on the cortex and in that map we used no thresholding. That is to say, the locations were not thresholded so we can see all the locations with estimated power. For our future studies we hope to map the total power on the cortex with a threshold so that we can see the locations with highest power. In a better way to choose the threshold, our next objective is to extend this work by implementing thresholding of cortical maps using random field theory [41]. Random field theory is mainly applied in dealing with thresholding problems encountered in functional imaging. This is used to solve the problem of finding the height threshold for a smooth statistical map, which gives the required family-wise error rate. In going forward with our current work, the idea is to take the point estimate obtained from ACS-ICM and standard errors (obtained from bootstrap) to provide estimates of p-values for t-statistics pertaining to the number of activated voxels comprising a particular region.

It should be noted that the ACS-ICM algorithm and spatial model developed can also be applied to studies involving multiple subjects. Expanding from a single subject model to a model developed for multiple subjects would be of great interest for the MEG/EEG inverse problem. This will be based on developing a fully Bayesian analysis based on a divide and conquer Markov Chain Monte Carlo (MCMC) method [42]. This approach for Bayesian computation with multiple subjects is to partition the data into partitions, perform local inference for each piece separately, and combine the results to obtain a global posterior approximation.

**Author Contributions:** Conceptualization, E.A.O. and F.S.N.; methodology, E.A.O., F.S.N. and Y.S.; formal analysis, E.A.O.; writing—original draft preparation, E.A.O.; writing—review and editing, E.A.O., S.E.A., Y.S. and F.S.N.; supervision, F.S.N. and S.E.A.; funding acquisition, F.S.N. and S.E.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC) and supported by infrastructure provided by Compute Canada (www. computecanada.ca (accessed on 14 November 2020) and data sourced from [13].

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Publicly available datasets were analyzed in this study. This data can be found here https://www.fil.ion.ucl.ac.uk/spm/data/mmfaces (accessed on 26 January 2021).

**Acknowledgments:** This research is supported by the Visual and Automated Disease Analytics (VADA) graduate training program.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**Figure A1.** *Cont.*

**Figure A1.** The true signal *S<sup>j</sup>* (*t*) used in each of the distinct active and inactive regions in the simulation studies of Section 3 for *K* = 2; panel (**a**), *K* = 3; panel (**c**) and *K* = 4; panel (**e**) & (**g**) are depicted in the left column. The right column presents the true partition of the cortex into active and inactive states for the corresponding states for *K* = 2; panel (**b**), *K* = 3; panel (**d**) and *K* = 4; panel (**f**) & (**h**).


**Table A1.** Simulation study I—Average (Ave.) correlation between the neural source estimates and


**Figure A2.** The true signal *S<sup>j</sup>* (*t*) in panel (**a**) and true partition of the cortex into active and inactive states for the case of *K* = 9 states (panel **b**–**f**) used in simulation studies of Section 3.

**Figure A3.** The true signal *S<sup>j</sup>* (*t*) used in in each of the distinct active and inactive regions for *K* = 2; panel (**a**), *K* = 3; panel (**b**), *K* = 4; panel (**c**) & (**d**), *K* = 9; panel (**e**) in the simulation study of Section 3.2, in the second part of the study where the mixture components were less well separated.


**Table A2.** Simulation study I—Total Mean-Squared Error (TMSE) of the neural source estimators decomposed into variance and squared bias for the ICM and ACS-ICM algorithms. This total was obtained separately for locations in active regions and then for the inactive region.

**Figure A4.** Violin plots comparing the correlation values obtained in the simulation studies for the ICM and ACS-ICM algorithms. The first row corresponds to the case when *K* = 2, the second row corresponds to when *K* = 3, the third row is when *K* = 4 and the last row is when *K* = 9.

#### **Appendix B**

We present results from the ACS-ICM and ICM algorithms with tuning parameters for the cluster size of 500. The cortical maps displaying the spatial patterns of the total power for estimated sources are represented below.

**Figure A5.** Brain activation for scrambled faces—the power of the estimated source activity ∑ *T t*=1 *S*ˆ *j* (*t*) <sup>2</sup> at each location *j* of the cortical surface. **Row 1** displays results from our ICM algorithm applied to the combined MEG and EEG data; **Row 2** displays results from ACS applied to the combined MEG and EEG data.

**Figure A6.** Brain activation for scrambled faces—the power of the estimated source activity ∑ *T t*=1 *S*ˆ *j* (*t*) <sup>2</sup> at each location *j* of the cortical surface. **Row 1** displays results from our ACS applied to the combined MEG and EEG data with *β* = 0.1; **Row 2** displays results from ACS applied to the combined MEG and EEG data with *β* = 0.44.

**Figure A7.** The spatial profile of brain activity from ACS-ICM based on our bootstrap replicates. **Row 1** displays standard deviations of the total power of the estimated source activity; **Row 2** displays the T-map.

#### **References**


*Article*
