*2.2. Preprocessing*

EEG signals' preprocessing relates to signal cleaning and enhancement. EEG signals are weak and easily contaminated with noise from internal and external sources. Thus, these processes are essential to avoid noise contamination that could affect posterior classification. The body itself may produce electrical impulses through blinking, eye or muscular movement, or even heartbeats that blend with EEG signals. It should be carefully considered whether these artifacts should be removed because they may have relevant emotional state information and could improve emotion recognition algorithms' performance. If filters are used, it is necessary to use caution to apply them to avoid signal distortions.

The three commonly used filter types in EEG are (1) low-frequency filters, (2) high-frequency filters (commonly known by electrical engineers as low-pass and high-pass filters), and (3) notch filters. The first two filters are used to filter frequencies between 1 and 50–60 Hz.

For EEG signal processing, filters, such as Butterworth, Chebyshev, or inverse Chebyshev, are preferred [39]. Each of them has specific features that need to be analyzed. A Butterworth filter has a flat response in the passband and the stopband but also has a wide transition zone. The Chebyshev filter has a ripple on the passband, and a steeper transition, so it is monotonic on the stopband. The inverse Chevishev has a flat response in the passband, is narrow in the transition, and has a ripple in the stopband. A Butterworth phase zero filter should be used to prevent a phase shift because this filter goes forward and backward over the signal to avoid this problem.

Another preprocessing objective is to clean the noise that may correspond to low-frequency signals generated by an external source, such as power line interference [40]. Notch filters are used to stop the passage of a specific frequency rather than a frequency range. This filter is designed to eliminate frequencies originated by electrical networks, and it typically ranges from 50 to 60 Hz depending on the electrical signal's frequency in the specific country.

All of these filters are appropriate for artifact elimination in EEG signals. However, as previously noted, care must be taken when using filters. Generally, filters could distort the EEG signal's waveform and structure in the time domain. Hence, filtering should be kept to a minimum to avoid loss of EEG signal information.

Nevertheless, preprocessing helps to separate different signals and sources. Table 3 shows methods used for preprocessing EEG signals [41] and the percentage in which they are mentioned in the literature as used from 2015 to 2020. Independent Component Analysis (ICA) and Principal Component Analysis (PCA) are tools that apply blind source analysis to isolate the source signal from noise when using multi-channel recordings so they can be used for artifact removal and noise reduction. Common Average Reference (CAR) is right for noise reduction. SL is applied for spatial filtering to improve the signal's spatial resolution. The Common Spatial Patterns (CSP) algorithm finds spatial filters that could serve to distinguish signals corresponding to muscular movements.


**Table 3.** Frequently used pre-preprocessing methods of EEG signals.

Therefore, each of the most widely used preprocessing algorithms has its benefits. In Table 3, we can observe from the percentage of the usage column that the most utilized algorithms for preprocessing are PCA (50.1%), ICA (26.8%), and CSP (17.7%).

#### *2.3. Feature Extraction*

Once signals are noise free, the BCI needs to extract essential features, which will be fed to the classifier. Features can be computed in the domain of (1) time, (2) frequency, (3) time-frequency, or (4) space, as shown in Table 4 [31,38,39]. This table presents the most popular techniques used for feature extraction, their domain, advantages, and limitations.

Time-domain features include the event-related potential (ERP), Hjorth features, and higher-order crossing (HOC) [58–60], independent component analysis (ICA), principal component analysis (PCA), and Higuchi's fractal dimensions (FD) as a measure of signal complexity and self-similarity in this domain. There are also statistical measures, such as power, mean, standard deviation, variance, skewness, kurtosis, relative band energy, and entropy. The latter evaluates signal randomness [61].

Among frequency-domain methods, the most popular is the fast Fourier transform (FFT). Auto-regressive (AR) modeling is an alternative to Fourier-based methods for computing the frequency spectrum of a signal [62,63].

The time-frequency domain exploits variations in time and frequency, which are very descriptive of the neural activities. For this, wavelet transform (WT) and wavelet packet decomposition (WPD) are used [62].

The spatial information provided in the description of EEG signals' characteristics is also considered in a broader approach. For this dimension, signals are referenced to digitally linked ears (DLE) values, which are calculated in terms of the left and right earlobes as follows:

$$V\_{\varepsilon}^{DLE} = V\_{\varepsilon} - \frac{1}{2}(V\_{A1} + V\_{A2})\_{\prime} \tag{1}$$

where *VA*1 and *VA*2 are the reference voltages on the left and right earlobe. Thus, EEG data is broken down, considering each electrode. Consequently, each channel contains spatial information of the location pertinent to its source.

For spatial computation, the surface Laplacian (SL) algorithm reduces volume conduction e ffects dramatically. SL also improves EEG spatial resolution by reducing the distortion produced by volume conduction and reference electrodes [47].

Figure 4 shows EEG signals in the time domain, the frequency domain, and spatial information.

**Figure 4.** Frequency domain, time domain, and spatial information [63].


frequencies.

**Table 4.** Feature extraction algorithms.


**Table 4.** *Cont*.

According to [97], emotions emerge as the synchronization of various subsystems. Several authors use synchronized activity indexes in different parts of the brain. The efficiency of these indexes has been demonstrated in [98], calculating the correlation dimension of a group of EEG signals. In [98], other methods were used to calculate the synchronization of different areas of the brain. Synchronized indexes are a promising method for emotion recognition that deserves further research.

Table 4 shows the most commonly used algorithms and their respective mention percentages in the literature: (1) WT (26%), (2) PCA (19.7%), (3) Hjorth (17%), (4) ICA (11.3%), and (5) statistical measures (8.6%).

#### *2.4. Feature Selection*

The feature selection process is vital because it obtains the signal's properties that best describe the EEG characteristics to be classified. In BCI systems, the feature vector generally has high dimensionality [99]. Feature selection reduces the number of input variables for the classifier (not to be confused with dimensionality reduction). While both processes decrease the data's attributes, dimensionality reduction combines features to reduce their quantity.

A feature selection method does not change characteristics but excludes some according to specific usefulness criteria. Feature selection methods aim to achieve the best results by processing the least amount of data. It serves to remove attributes that do not contribute to the classification because they are irrelevant (or redundant) for simpler classification models (which are faster and have better performance). Additionally, feature selection methods reduce the overfitting likelihood in regular datasets, flexible models, or when the dataset has too many features but not enough observations.

One classification of feature selection methods based on the number of variables divides them into two classes: (1) Univariate and (2) multivariate. Univariate methods consider the input features one by one. Multivariate methods consider whole groups of characteristics together.

Another classification distinguishes feature selection methods as filtering, wrapper, and built-in algorithms.


#### Examples of Feature Selection Algorithms

The following are some examples of algorithms for feature selection:

• Effect-size (ES)-based feature selection is a filter method. ES-based univariate: Cohen's is an appropriate effect size for comparisons between two means [100]. So, if two groups' means do not differ by 0.2 standard deviations or more, the difference is trivial, even if it is statistically significant. The effect size is calculated by taking the difference between the two groups and dividing it by the standard deviation of one of the groups. Univariate methods may discard features that could have provided useful information. ES-based multivariate helps remove several features with redundant information, therefore selecting fewer features, while retaining the most information [58]. It considers all the dependencies between characteristics when evaluating them. For example, calculating the Mahalanobis distance using the covariance structure of the noise. Min-redundancy max-relevance (mRMR) is a wrapper method [101]. This algorithm compares

the mutual information between each feature with each class at the output. Mutual information between two random variables x and y is calculated as:

$$I(\mathbf{x}; y) = \iint p(\mathbf{x}, y) \log \frac{p(\mathbf{x}, y)}{p(\mathbf{x})p(y)} d\mathbf{x} dy,\tag{2}$$

where *p* (*x*) and *p* (*y*) are the marginal probability density functions of *x* and *y*, respectively, and *p* (*<sup>x</sup>*, *y*) is their joint probability function. If *I* (*<sup>x</sup>*, *y*) equals zero, the two random variables x and y are statistically independent [58]. mRMR maximizes *I* (*xi*, *y*) between each characteristic xi and the target vector y; and minimizes the average mutual information *I* (*xi*, *yi*) between two characteristics.


Table 5 shows feature selection algorithms and their percentage of usage in the literature. Genetic algorithms are frequently used (32.3%), followed by SDA (17.7%), wrapper methods (15.6%), and mRMR (11.5%).


**Table 5.** Feature selection methods used in the literature (2015–2020) in percentages (%).

#### *2.5. Classification Algorithms*

Model frameworks can categorize classification algorithms [56,57]. The model's categories may be (1) generative-discriminative, (2) static-dynamic, (3) stable-unstable, and (4) regularized [102–104].

There are two different selection approaches for the classifier that works best under certain conditions in emotion recognition [56]. The first identifies the best classifier for a given BCI device. The second specifies the best classifier for a given set of features.

For synchronous BCIs, dynamic classifiers and ensemble combinations have shown better performances than SVMs. For asynchronous BCIs, the authors in this field have not determined an optimal classifier. However, it seems that dynamic classifiers perform better than static classifiers [56] because they handle better the identification of the onset of mental processes.

From the second approach, discriminative classifiers have been found to perform better than generative classifiers, principally in the presence of noise or outliers. Dynamic classifiers like SVM generally handle high dimensionality in the features better. If there is a small training set, simple techniques like LDA classifiers may yield satisfactory results [58].

#### 2.5.1. Generative Discriminative

These classifier models generally have supervised learning problems that fit the data's probability. A generative model specifies the distribution of each class using the joint probability distribution p(x,y) and Bayes theorem. A discriminative model finds the decision boundary between the categories using the conditional probability distribution p(y|x). Such a model includes the following classifiers: Naïve Bayes, Bayesian networks, Markov random fields, and hidden Markov models (HMM).

## 2.5.2. Static-Dynamic Classification

Static-dynamic classification takes into account the training method's time variations. A static model trains the data once and then uses the trained model to classify a single feature vector. In a dynamic model, the system is updated continually. Thus, dynamic models can obtain a sequence of feature vectors and catch temporal dynamics.

Multilayer perceptron (MLP) can be considered a static classifier. Likewise, an example of a dynamic classifier is hidden Markov methods (HMM) because it can classify a sequence of feature vectors.

#### 2.5.3. Stable Unstable

Stable classifiers usually have low complexity and do not affect their performance with small variations of the training set. For example, k Nearest Neighbors (kNN) is a common stable classifier. Unstable classifiers have high complexity and present considerable changes in performance with minor variations of the training set. Examples of unstable classifiers are linear support vector machine (SVM), multi-layer perceptron (MLP), and bilinear recurrent neural network (BLR-NN).
