2.1.1. Filter Methods

Features are selected according to the value of different metrics, usually certain statistical criteria.

In the context of text mining, it is common to use the bag-of-words approach so that each word is taken as a unique feature. Chi-squared was used to filter the most relevant terms in a text mining algorithm to estimate credit score at Deutsche Bank [3].

In relation to text mining as well, in [4] tweets are analysed in order to figure out the impact of their sentiment on stock market movements. The authors also use a filter method to select the most relevant features—Fisher score.

**Citation:** Lopez-Miguel, I.D. Survey on Preprocessing Techniques for Big Data Projects. *Eng. Proc.* **2021**, *7*, 14. https://doi.org/10.3390/engproc 2021007014


Academic Editors: Joaquim de Moura, Marco A. González, Javier Pereira and Manuel G. Penedo

Published: 7 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Based on Chi-squared and the GUIDE regression tree, Loh [5,6] presents a technique to perform feature selection in a large genomic dataset.

Other filter methods include Information Gain [7], correlation [8], variance similarity [9], and Dispersion ratio [10].

Some work has been done to adapt these methods to the big data context, such as in [11], where a framework to parallelise and scale some of these algorithms is introduced.

#### 2.1.2. Embedded Methods

Feature selection is performed in the process of fitting a model to a given dataset.

SVM-RFE (Supported Vector Machine Recursive Feature Elimination), introduced in [12] to analyse DNA microarrays has shown its power in several applications, such as in bioinformatics [13].

The Feature Selection-Perceptron (FS-P) [14] technique has been used in a proton (1H) magnetic resonance spectroscopy (MRS) database to select features that could better predict brain tumours.

Based on a more complex neural network, the embedded method BlogReg is introduced in [15], where it is applied to data collected from the sensors of a robot.

#### 2.1.3. Wrapper Methods

Wrapper methods refer to an iterative process in which a subset of features is evaluated at a time.

A wrapper method based on the decision tree C4.5 has been used for many years [16]. However, developments based on this method are still ongoing, such as the one from [17], which is applied to healthcare data (Medical Internet of Things).

Another wrapper method is based on the SVM algorithm [18]. It has been widely used since its creation, such as in [19], predicting arrhythmias from cardiac data.

FSSEM (Feature Subset Selection wrapped around EM clustering) [20] is also a wrapped method, and a popular stepwise approach for regression problems [21].

#### *2.2. Discretisation*

Discretisation is the step where continuous variables are transformed into categorical ones [2]. There exist multiple classifications for discretization techniques, but here one of unsupervised and supervised discretisation is chosen [2].

#### 2.2.1. Unsupervised

Unsupervised discretisation methods do not take into account the target of the learning algorithm when the features are discretised.

Equal width interval discretisation and equal frequency interval discretisation need to be adapted in the context of big data streaming as done in [22].

In [23], k-means [24] discretisation is used to transform the target for road detection.

Other methods based on k-means algorithm have been proposed, such as Cokmeans and Bikmeans, used in [25] in the context of microarrays.

#### 2.2.2. Supervised

Supervised discretisation does take into account the target of the learning algorithm. One of the most popular methods is based on entropy [26]. This algorithm was parallelised in [27].

Chi-squared is the basis for ChiMerge [28], ChiSplit [29], and Khiops [30]. They were parallelised in [31] to work for big data problems.

The previously presented approaches are univariate, but there also exist supervised multivariate discretisation (SMD) techniques, such as the one in [32].

### **3. Conclusions**

Due to extension limitations, this paper has only given some feature selection and discretisation techniques, mentioning some up-to-date examples of where they are used. There is growing interest in adapting these techniques so that they can perform efficiently in the big data context. In this direction, a future line of work is to create a comprehensive and complete taxonomy of the up-to-date feature selection and discretisation techniques, performing experimental results in the big data context.

**Funding:** This research received no external funding.

