2.3.3. Granger Causality

We also characterize the simulated and EEG data using Granger causality (GC). Like TE, GC is derived from Wiener's definition of causality, and the two measures, in their original forms, are equivalent for Gaussian variables [36]. Briefly, for two stationary time series **<sup>x</sup>** = {*xt*}*<sup>T</sup> <sup>i</sup>*=<sup>1</sup> and **<sup>y</sup>** = {*yi*}*<sup>T</sup> <sup>t</sup>*=1, the Granger causality from **x** to **y** is defined as:

$$\text{GC}(\mathbf{x}\rightarrow\mathbf{y}) = \log\left(\frac{\text{var}\{\mathbf{e}\}}{\text{var}\{\mathbf{e}'\}}\right),\tag{12}$$

where **<sup>e</sup>**, **<sup>e</sup>** <sup>∈</sup> <sup>R</sup>*T*−*<sup>o</sup>* are vectors holding the residual or prediction errors of two autoregressive models, and var{·} stands for the variance operator. The errors in **e** come from an autoregressive process of order *o* that predicts **y** from its past values alone. On the other hand, the errors in **e** come from a bivariate autoregressive process of order *o* that predicts **y** from the past values of **y** and **x** [11]. If the past of **x** improves the prediction of **y** then var{**e**} var{**e** } and GC(**x** → **y**) 0, if it does not, then var{**e**} ≈ var{**e** } and GC(**x** → **y**) → 0. In addition, in analogy to the concept of phase TE, we define *GC<sup>θ</sup>* (**<sup>x</sup>** <sup>→</sup> **<sup>y</sup>**, *<sup>f</sup>*) = *GC*(*θ<sup>x</sup>* <sup>→</sup> *<sup>θ</sup>y*), where *<sup>θ</sup><sup>x</sup>* and *<sup>θ</sup><sup>y</sup>* are instantaneous phase time series obtained by filtering **x** and **y** at frequency *f* , as a measure within the framework of GC that captures phase-based interactions.

## *2.4. Kernel-Based Relevance Analysis*

When characterizing EEG data through effective brain connectivity measures for BCI-related applications, two common and related issues can arise. First, all to all channel connectivity analyses result in a large number of features, many of which may not provide useful information to discriminate between the conditions of the BCI paradigm of interest [10]. This can add noise and complexity to any subsequent analysis stage. Second, estimating such a large number of pairwise channel connectivities can be computationally expensive, especially for measures such as TE and single-trial TE*<sup>θ</sup>* [8], which can hinder their inclusion in practical BCI systems. Both problems could be addressed by identifying the set of pairwise channel connectivities that are relevant to discriminate between specific conditions, which would also lead to a clearer neurophysiological interpretation of the obtained results [6,10]. To that end, we explore a relevance analysis strategy based on centered kernel alignment (CKA). CKA allows quantifying the similarity between two sample spaces by comparing two characterizing kernel functions [29]. First, assume we have a feature matrix **<sup>Φ</sup>** <sup>∈</sup> <sup>R</sup>*N*×*P*, and a corresponding vector of labels *<sup>l</sup>* <sup>∈</sup> <sup>Z</sup>*N*, with *<sup>N</sup>* the number of observations and *P* the number of features. For the case of connectivity-based EEG analysis, each element in **Φ** holds a connectivity value for a pair of channels, with each row of **Φ** containing multiple connectivity values (features) estimated for a single trial or observation. The corresponding element in *l* holds a label identifying the condition associated to that trial. Next, we define two kernel matrices *<sup>K</sup>***<sup>Φ</sup>** <sup>∈</sup> <sup>R</sup>*N*×*<sup>N</sup>* and *Kl* <sup>∈</sup> <sup>R</sup>*N*×*N*. The first matrix holds elements *k***<sup>Φ</sup>** *ij* <sup>=</sup> *<sup>κ</sup>***Φ**(*φi*, *<sup>φ</sup>j*) with *<sup>φ</sup>i*, *<sup>φ</sup><sup>j</sup>* <sup>∈</sup> <sup>R</sup>*<sup>P</sup>* row vectors belonging to **Φ**, and

$$\kappa\_{\Phi}(\phi\_{i}, \phi\_{j}; \sigma) = \exp\left(-\frac{d^{2}(\Phi\_{i}, \Phi\_{j})}{2\sigma^{2}}\right) \tag{13}$$

a radial basis function (RBF) kernel [37], where *<sup>d</sup>*2(·, ·) is a distance operator and *<sup>σ</sup>* <sup>∈</sup> <sup>R</sup><sup>+</sup> is the kernel's bandwidth. The second matrix has elements *k<sup>l</sup> ij* <sup>=</sup> *<sup>κ</sup>l*(*li*, *lj*) with *li*, *lj* <sup>∈</sup> *<sup>l</sup>*, and

$$
\kappa\_I(l\_i, l\_j) = \delta(l\_i - l\_j)\_{\prime} \tag{14}
$$

a dirac kernel, where *δ*(·) stands for the Dirac delta. Then, the CKA can be estimated as:

$$\rho(\mathcal{K}\_{\Phi}, \mathcal{K}\_{I}) = \frac{\langle \mathcal{K}\_{\Phi}, \mathcal{K}\_{I} \rangle\_{\mathcal{F}}}{(\langle \mathcal{K}\_{\Phi}, \mathcal{K}\_{\Phi} \rangle\_{\mathcal{F}} \langle \mathcal{K}\_{I}, \mathcal{K}\_{I} \rangle\_{\mathcal{F}})^{1/2'}} \tag{15}$$

where *<sup>K</sup>*¯ <sup>∈</sup> <sup>R</sup>*N*×*<sup>N</sup>* is the centered version of *<sup>K</sup>*, obtained as *<sup>K</sup>*¯ <sup>=</sup> *IK <sup>I</sup>* , where *<sup>I</sup>* <sup>=</sup> *<sup>I</sup>* <sup>−</sup> **<sup>1</sup>1**/*<sup>N</sup>* is the empirical centering matrix, *<sup>I</sup>* <sup>∈</sup> <sup>R</sup>*N*×*<sup>N</sup>* is the identity matrix, **<sup>1</sup>** <sup>∈</sup> <sup>R</sup>*<sup>N</sup>* is an all-ones vector, and *K*¯, *<sup>K</sup>*¯<sup>F</sup> <sup>=</sup> tr(*K*¯ *<sup>K</sup>*¯ *<sup>T</sup>*) denotes the matrix-based Frobenius norm. Now, for *<sup>κ</sup>***<sup>Φ</sup>** we select as distance operator the the Mahalanobis distance

$$d\_A^2(\phi\_i, \phi\_j) = \left(\phi\_i - \phi\_j\right) \Gamma \Gamma^\top \left(\phi\_i - \phi\_j\right)^\top \tag{16}$$

where **<sup>Γ</sup>** <sup>∈</sup> <sup>R</sup>*P*×*Q*, *<sup>Q</sup>* <sup>≤</sup> *<sup>P</sup>*, is a linear projection matrix, and **ΓΓ** is the corresponding inverse covariance matrix. Afterward, the projection matrix **Γ** is obtained by solving the following optimization problem:

$$\hat{\mathbf{f}} = \arg\max\_{\Gamma} \log \left( \boldsymbol{\rho} (\bar{\mathbf{K}}\_{\Phi}, \bar{\mathbf{K}}\_{I}; \Gamma) \right), \tag{17}$$

where the logarithm function is used for mathematical convenience. **Γ**ˆ can be estimated through standard stochastic gradient descent, as detailed in [38], through the update rule

$$\Gamma^{r+1} = \Gamma^r - \mu\_\Gamma^r \nabla\_\Gamma (\hat{\rho}(\mathbf{K}\_\Phi, \mathbf{K}\_l)), \tag{18}$$

where *<sup>μ</sup>* <sup>∈</sup> <sup>R</sup><sup>+</sup> is the step size of the learning rule, and *<sup>r</sup>* indicates a time step. Finally, we quantify the contribution of each feature to the projection matrix **Γ**ˆ, which maximizes the alignment between the feature and label spaces, by building a relevance vector index <sup>∈</sup> <sup>R</sup>*P*, whose elements are defined as:

$$\varrho\_{\mathcal{P}} = \sum\_{q=1}^{Q} |\gamma\_{\mathcal{P}q}| ; \forall p \in P, \quad \gamma \in \Gamma. \tag{19}$$

 can then be used to rank the features in **Φ** according to their discrimination capability. A high *<sup>p</sup>* value indicates that the *p*-th feature in **Φ**, in our case a connection between a specific pair of channels, is relevant when it comes to distinguishing between the conditions contained in the label vector *l*.
