**2. Background**

In this section, theoretical background about preprocessing, feature extraction, and classification stages are presented.

#### *2.1. Blind Source Separation*

Firstly, BSS is an approach to estimate and recover the original sources using only the information of the observed mixture at the recording channels. For the instantaneous and ideal case, where the source signals arrive to the sensors at the same time such as in the EEG, the mathematical model of BSS can be expressed as

$$\mathbf{x}(t) = \mathbf{A}\mathbf{s}(t),\tag{1}$$

where **x**(*t*) is the vector of the mixed signals, **A** is the unknown non-singular mixing matrix, and **s**(*t*) is the vector of sources [2]. Then, BSS consists of estimating a separation matrix **B** such that in an ideal case **B** = **A**<sup>−</sup>1, and then computed the sources **s** (*t*) as

$$\mathbf{s}'(t) = \mathbf{B}\mathbf{x}(t),\tag{2}$$

The BSS algorithms are grouped into two main categories: (1) Second-Order Statistics (SOS), built from correlations and time-delayed windowed signals [36]; and (2) High Order Statistics (HOS) which are based on the optimization of relevant elements in the Probability Density Function (PDF) [37]. The prior SOS algorithms search for the independence of signals based on the criterion of correlation between them. However, uncorrelatedness does not imply independence in all cases. In the HOS algorithms, the assumption to find the estimated sources is that the independent sources have non-Gaussian PDF, while the mixtures present a Gaussian distribution, which is valid for most cases including EEG signals [38]. The classical algebraic approach of ICA is based on negentropy, which is a measure of the entropy of a random variable *H*(*y*) with respect to the entropy of a Gaussian distribution variable *Hg*(*y*). By definition negentropy is the Kullback-Liebler divergence between a density of probability *p*(**y**) and the Gaussian probability density of *g*(**y**) of the same mean and same variance. The negentropy *J* of **y** is defined as

$$J(\mathbf{y}) = H\_{\mathcal{S}}(\mathbf{y}) - H(\mathbf{y}),\tag{3}$$

According to the definition of mutual information, *I*(**y**) can be expressed as

$$I(\mathbf{y}) = J(\mathbf{y}) - \sum\_{i} I(y\_i) + \frac{1}{2} \log \frac{\prod\_{i} v\_{ii}}{\det V} \tag{4}$$

where **V** is the variance-covariance matrix of **y**, with diagonal elements *vii*. Since maximizing independence means minimizing mutual information, maximizing of the sum of marginal negentropies is to minimize mutual information after whitening of the observations. This method is also similar to those using the notion of kurtosis. Under these conditions, the separation problem consists in the search for a rotation matrix **J** such that ∑*<sup>i</sup> H*(*yi*) is minimal. The use of second-order moment is not sufficient to decide if non-Gaussian variables are independent. On the other hand, the use of cumulants (cross-referenced) of all kinds makes it possible to show whether variables are independent or not: if all the crossed cumulants of a set of random variables of all kinds are null, then the random variables are independent.

## *2.2. Wavelet Transform*

Second, EEG registers are initially time series, from which it is possible to obtain information related to the temporal evolution of some characteristics. However, in this space, it is not possible to know frequency information. On the other hand, the Fourier transformation makes it possible to identify information in the frequency domain, but the temporal evolution of the frequency components is unknown. The CWT generates 2D maps from 1D time series containing information of time and scale, with a logarithmic relationship with the frequency components. Unlike the Short Time Fourier Transform (STFT), CWT performs a multiresolution analysis [39]. CWT is described by

$$\mathcal{W}\_{\mathbf{x}}(a,\tau) = \frac{1}{\sqrt{|a|}} \int\_{-\infty}^{\infty} \mathbf{x}(t) \boldsymbol{\uppsi}^\* \left(\frac{t-\tau}{a}\right) \, dt,\tag{5}$$

where *a* is the scale factor, *ψ* is the mother wavelet, and *τ* is the shift time of the mother wavelet on the *x*(*t*) signal.

#### *2.3. Convolutional Neural Network*

The Convolutional Neural Network (CNN) architecture is comprised of a sequence of convolutions and sub-sampling layers in which the content or values of the convolutional kernels is learned via an optimization algorithm. With a fixed number of filters in each layer, each individual filter is convoluted transversely with the width and height of the input figure in the forward transmits. The output of this layer is a two-dimensional feature map of that filter to detect the pattern. This is followed by a Rectified Linear Unit (ReLU) where the non-linearity is increased in the network using a rectified function. The governing equation of the convolution operation is given as

$$h\_i^\ell = f(\mathbf{c}) = f\left(\sum\_{n=1}^M W\_{ni}^\ell \cdot h\_n^{\ell-1} \prime + b\_i^\ell\right) \tag{6}$$

with the ReLU function is defined as

$$f(c) = \begin{cases} 0, & c < 0, \\ c, & c \ge 0, \end{cases} \tag{7}$$

where *h*- *<sup>i</sup>* is the *i*-th output of layer -, *W*- *ni* is the convolutional kernel that is operated on the *n*-th map of the - − 1-th layer used for the *i*-th output of the --th layer, and *b*- *<sup>i</sup>* is the bias term; *f*(*c*) is an activation function imposed on the output of the convolution (*c*). The optimization algorithm focuses on optimizing the convolutional kernel *W*. The output of each convolutional operation or any operation is denoted as a feature map. After convolution, the sub-sampling layer reduces the dimension of the feature maps by representing a neighborhood of values in the feature map by a single value.

Additionally, CNN are also known as shift invariant or space invariant artificial neural networks (SIANN), based on their shared-weights architecture and translation invariance characteristics [40]. This particular feature makes CNN appropriate to deal with the problem described above with the loss of order in each trial processed by BSS algorithms. The parameters of each convolutional layer are:


Another important operation applied usually in each convolutional layer is the max-pooling, that is a sample-based discretization process. The objective is to down-sample an input representation, reducing its dimensionality and allowing for assumptions to be made about features contained in the sub-regions binned. This is done to in part to help over-fitting by providing an abstracted form of the

representation. Also, it reduces the computational cost by reducing the number of parameters to learn and provides basic translation invariance to the internal representation [41].

#### **3. Methodology**

In this section the dataset format, the selected electrodes, and the proposed approach for MI classification are described.
