*Article* **EvoPreprocess—Data Preprocessing Framework with Nature-Inspired Optimization Algorithms**

#### **Sašo Karakatiˇc**

Faculty of Electrical Engineering and Computer Science, University of Maribor, Maribor 2000, Slovenia; saso.karakatic@um.si

Received: 29 April 2020; Accepted: 27 May 2020; Published: 2 June 2020

**Abstract:** The quality of machine learning models can suffer when inappropriate data is used, which is especially prevalent in high-dimensional and imbalanced data sets. Data preparation and preprocessing can mitigate some problems and can thus result in better models. The use of meta-heuristic and nature-inspired methods for data preprocessing has become common, but these approaches are still not readily available to practitioners with a simple and extendable application programming interface (API). In this paper the EvoPreprocess open-source Python framework, that preprocesses data with the use of evolutionary and nature-inspired optimization algorithms, is presented. The main problems addressed by the framework are *data sampling* (simultaneous over- and under-sampling data instances), *feature selection* and *data weighting* for supervised machine learning problems. EvoPreprocess framework provides a simple object-oriented and parallelized API of the preprocessing tasks and can be used with scikit-learn and imbalanced-learn Python machine learning libraries. The framework uses self-adaptive well-known nature-inspired meta-heuristic algorithms and can easily be extended with custom optimization and evaluation strategies. The paper presents the architecture of the framework, its use, experiment results and comparison to other common preprocessing approaches.

**Keywords:** data sampling; feature selection; instance weighting; nature-inspired algorithms; meta-heuristic algorithms

#### **1. Introduction**

Data preprocessing is one of the standard procedures in data mining, which can greatly improve the performance of machine learning models or statistical analysis [1]. Three common data preprocessing tasks that are addressed by the presented EvoPreprocess framework are feature selection, data sampling, and data weighting. This paper presents the EvoPreprocess framework, which addresses the listed preprocessing tasks with the use of supervised machine learning based evaluation. All three tasks deal with inappropriate and high-dimensional data, which can result in either over-fitted and non-generalizable machine learning models [2,3].

Many different techniques have been proposed and applied to these of preprocessing data tasks [1,4]; from various feature selection methods based on statistics (information gain, covariance, Gini index, *χ*<sup>2</sup> etc.) [5,6]; under-sampling data with neighborhood cleaning [7], prototype selection [8], over-sampling data with SMOTE [9], and other SMOTE data over-sampling variants [10]. These methods are mainly deterministic and have limited variability in the resulting solutions. Due to this, researchers also extensively focused on preprocessing with meta-heuristic optimization methods [11,12].

Meta-heuristic optimization methods provide sufficiently good solutions to NP-hard problems while not guaranteeing that the solutions are globally optimal. Since feature selection, data sampling and data weighting are NP-hard problems [6], meta-heuristics present a valid approach, which is also supported by a wide body of research presented in the following sections. Applications of nature-inspired algorithms in data preprocessing have already been used. Genetic algorithms were used for feature selection in high-dimensional datasets to select six biomarker genes that linked with colon cancer [13,14]. Evolution strategy was used for data sampling for the in the early software defect detection [15]. Also, particle swarm optimization was used for the data sampling for classification of hyperspectral images [16].

While research on the topic is wide and prolific, the standard libraries, packages and frameworks are sparse. There are some well maintained and well-documented data preprocessing libraries [17–22], but none of them provide the ability to use nature-inspired approaches in Python programming language. The EvoPreprocess framework aims to fill in the gap between the data preprocessing and nature-inspired meta-heuristics and provide easy to use and extend Python API that can be used by practitioners in their data mining pipelines, or by researchers developing novel nature-inspired methods on the problem of data preprocessing.

This paper presents the implementation details and examples of use of the EvoPreprocess framework, which offers API for solving the three mentioned data preprocessing tasks with nature-inspired meta-heuristic optimization methods. While the data preprocessing approaches for classification tasks are already available, there is a lack of resources for data preprocessing on regression tasks. The presented framework works with both tasks of supervised learning (i.e., classification and regression) and therefore fills this gap.

The novelty of the framework is the following:


The implementation of preprocessing tasks in the provided framework is on-par or better in comparison to other available approaches. Thus, the framework can be used as-is, but its main strength is that it provides a framework on which others can build upon to provide various specialized preprocessing approaches. The framework handles the parallelization and the evaluation of the optimization process and addresses the data leakage problem without any additional input needed.

The rest of the paper is organized as follows. The next section contains the problem formulation of the three preprocessing tasks and the literature overview of data preprocessing with the nature-inspired method. Section 3 contains the implementation details of the presented framework, which is followed by the fourth section with examples of use. Finally, concluding remarks are provided in the fifth section of the paper.

#### **2. Problem Formulation**

Let *<sup>X</sup>* <sup>∈</sup> <sup>R</sup>*n*×*<sup>m</sup>* be a data matrix (data set) where *<sup>n</sup>* denotes the number of instances (samples) in the data set, and *m* is the number of features. The data set *X* consists of instances *x*1, *x*2, ... , *xn*, where *xi* <sup>∈</sup> <sup>R</sup>*<sup>m</sup>* and is written as *<sup>X</sup>* = [*x*1, *<sup>x</sup>*2, ... , *xn*]. Also, the data set *<sup>X</sup>* consists of features *<sup>f</sup>*1, *<sup>f</sup>*2, ... *fm*, where *fi* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* and is denoted as *<sup>X</sup>* = [ *<sup>f</sup>*1, *<sup>f</sup>*2, ... , *fm*]. As we are dealing with a supervised data mining problem, there is also *Y*, which is a vector of target values which are to be predicted. If the problem is in a type of classification, the values in *Y* are nominal *Y* ∈ {*c*1, *c*2,..., *ck*}, where *k* is the number of predefined discreet categories or classes. On the contrary, if we are dealing with regressing, the target values in *<sup>Y</sup>* are continuous *<sup>Y</sup>* <sup>∈</sup> <sup>R</sup>*n*. The goal of supervised learning is to construct the model

*M*, which can map *X* to *Y*ˆ while minimizing the difference between the predicted target values *Y*ˆ and true target values *Y*.

**Feature selection** handles the curse of dimensionality when machine learning models tend to over-fit on data with a too large set of features [23]. It is a technique where most relevant features are selected, while the redundant, noisy or irrelevant features are discarded. The result of using feature selection is improved learning performance, increased computational efficiency, decreased memory storage space needed and more generalizable models [5]. In mathematical terms, the feature selection transforms the original data *X* to the new *XFS* (Equation (1)), with a potentially smaller set of features *f <sup>i</sup>* than in the original set.

$$\begin{aligned} X &= [f\_1, f\_2, \dots, f\_m] \\ X\_{\rm FS} &= [f\_1', f\_2', \dots, f\_l'] \\ l &\le m \\ l &> 0 \end{aligned} \tag{1}$$

Feature selection has already been addressed with meta-heuristic and nature-inspired optimization methods, as has been demonstrated in review papers [24,25]. Lately, the research topic has gained extensive focus from the nature-inspired optimization research community, with the application of every type of nature-inspired method to the given problem—whale optimization algorithm [26], dragonfly [27] and chaotic dragonfly algorithm [28], grasshopper algorithm [29], grey wolf optimizer [30,31], differential evolution and artificial bee colony [32], crow search [33], swarm optimization [34], genetic algorithm [35] and many others.

On the other hand, **data sampling** and **data weighting** (sometimes *instance weighting*) address the problem of improper ratios of instances in the learning data set [36]. The problem is two-fold—some types of instances can be over-represented (the majority), and other instances can be under-represented (the minority).

Imbalanced data sets can form for various reasons. Either there is a natural imbalance in real-life cases, or certain instances are more difficult to collect. For example, some diseases are not common and therefore patients with similar symptoms without the rare disease are much more common than the patients with symptoms that have the rare condition [37,38]. In other cases, the balanced and representative collection of data that reflects the population is sometimes problematic or even impossible. One major cause of this problem in social domains is the well-documented self-selection bias, where only a non-representative group of individuals select themselves into the group [39]. Convenience sampling is the next reason for over-representation of some and under-representation of other samples [40]. For example, the cost-effectiveness of data collection can also contribute to the emergence of majority and minority cases, when minorities are expensive (be it time- or financial cost-wise) to obtain.

Furthermore, machine learning models require an appropriate representation of all types of instances to be able to extract the signal and not confuse it with noise [41]. Data sampling [42–44] and cost-sensitive instance weighting [45] have already been tackled with meta-heuristic methods, evolutionary algorithms, and nature-inspired methods.

In the **data sampling** task, we transform the original data set *X* to *XS*, with a potentially different distribution of instances. Note, that we can (1) *under-sample* the data set–only select the most relevant instances, (2) *over-sample* the data set–introduce copies or new instances to the original set *X*, or (3) *simultaneously under- and over-sample* the data set. The latter removes redundant instances and introduces new ones in the new data set *XS*. With the EvoPreprocess framework, the simultaneous under- and over-sampling is used, where instances can be removed, and copies of existing instances can be introduced. The size *n* of the new set *XS* can be different or equal to the size *n* of the original

set *X*, but the distribution of the instances in the *XS* should be different from the *X*. This is shown in Equation (2).

$$\begin{aligned} n &= |X| \\ n' &= |X\_S| \\\\ n' &> 0 \\ X\_S &\neq X \\ X\_S \cap X &\neq \bigotimes \dots \end{aligned} \tag{2}$$

On the other hand, **data weighting** does not alter the original data set *X*, but introduces the importance factor of instances in the *X*, called weights. The greater the importance of the instance, the bigger the weight for that instance, and vice versa—the lesser the importance, the smaller the instance weight. Fitting the machine learning model on weighted instances is called *cost-sensitive* learning, and can be utilized in several different machine learning models [46]. Let us denote the vector of weights as *<sup>W</sup>*, which consists of individual weights *<sup>w</sup>*1, *<sup>w</sup>*2, ... , *wn*, where *wi* <sup>∈</sup> <sup>R</sup><sup>+</sup> as is presented in Equation (3).

$$\begin{aligned} \mathcal{W} &= [w\_1, w\_2, \dots, w\_n] \\ |\mathcal{W}| &= |\mathcal{X}| \end{aligned} \tag{3}$$

While there are some rudimentary feature selection, data weighting, and data sampling methods in the Python machine learning framework scikit-learn, again, there is a lack of evolutionary and nature-inspired methods either included in this framework or, as independent open-source libraries, compatible with it. EvoPreprocess intends to fill this gap by providing a scikit-learn compatible toolkit in Python, which can be extended easily with the custom nature-inspired algorithms.

#### *2.1. Nature-Inspired Preprocessing Optimization*

Nature-inspired optimization algorithms is a broad term for meta-heuristic optimization techniques inspired by nature, more specifically by biological systems, swarm intelligence, physical and chemical systems [47,48]. In essence, these algorithms look for good enough solutions to any optimization problem with the formulation [47,49] in Equation (4).

$$\begin{aligned} S\_t &= [s\_1, s\_2, \dots, s\_k] \\ S\_{t+1} &= A \left\{ S\_t; (p\_1, p\_2, \dots, p\_h); (w\_1, w\_2, \dots, w\_{\S})) \right\}. \end{aligned} \tag{4}$$

The set *St* is a set of solutions [*s*1,*s*2, ... ,*sk*] in the iteration *t*. The next iteration of solutions *St*+<sup>1</sup> is generated using an algorithm *A* in accordance to the solution set in the previous iteration *St*, the parameters (*p*1, *p*2,..., *ph*) of the algorithm *A*, and random variables (*w*1, *w*2,..., *wg*).

These algorithms have already been successfully applied to the preprocessing tasks [11,12,25], proving the validity and the efficacy of these methods.

#### 2.1.1. Solution Encoding for Preprocessing Tasks

The base of any optimization method is the solution *s* to the problem, which is encoded in a way that changing operators can be applied on it to minimize the distance between the solution *s* and the optimal solution. A broad set of nature-inspired algorithms work with the encoding of one solution in the form of an array of values. If the values in the array are continuous numbers, we are dealing with a continuous optimization problem. Alternatively, we also have a discrete optimization problem where values in the array are discrete. Even though some problems need a discrete solution, one can use continuous optimization techniques that can be mapped to the discrete search space [50]. Figure 1 shows the common solution encoding *s* for feature selection, data sampling, and data weighting and the transformation from a continuous search space to a discrete one, where this is applicable (data sampling and feature selection).

**Figure 1.** Encoding and decoding of solutions for data preprocessing tasks.

As Figure 1 shows, the encoding of *data sampling* with discrete values is straightforward—one value in the encoding array corresponds to one instance from the original data set *X*, and the scalar value represents the number of occurrences in the sampled data set *XS*. When using continuous optimization, one can use mapping function *m*, which splits the continuous search space into bins, each with its corresponding discrete value. Note that discrete values of occurrences can take any of the non-negative integers, but solution encoding can be any non-negative real value. This is reflected in Equation (5).

$$\begin{aligned} s\_S &= [\varepsilon\_1, \varepsilon\_2, \dots, \varepsilon\_k] \\\\ \varepsilon\_i &\in \mathbb{R}\_{\ge 0} \\\\ G\_S &= m(s\_S) \\\\ m: &\mathbb{R}\_{\ge 0} \longrightarrow \mathbb{N}\_0 \\\\ \mathbb{G}\_S &= [\mathbb{g}\_1, \mathbb{g}\_2, \dots, \mathbb{g}\_k] \\\\ g\_i &\in \mathbb{N}\_0 \end{aligned} \tag{5}$$

The encoding for *data weighting* is even more straightforward. Again, each value in the array corresponds to one instance in the data set *X*, and the scalar values represent the actual weights *W* of the instances. There is no need for the mapping function, as long as we limit the interval of allowed values for scalars in solution *ei* to [0, *max*\_*weight*] as shown in Equation (6). Some implementations of the machine learning algorithms accept only weights up to 1, thus limiting the search space to [0, 1]; others have no such limit, broadening the search space to [0, ∞).

$$\begin{aligned} \mathfrak{s}\_{W} &= [\mathfrak{e}\_{1}, \mathfrak{e}\_{2}, \dots, \mathfrak{e}\_{k}] \\ \mathfrak{e}\_{i} &\in \mathbb{R}\_{\geq 0}. \end{aligned} \tag{6}$$

Encoding for *feature selection* is in the form of an array of binary values—the feature is either present (value 1) or absent (value 0) from the changed data set *XFS* (see Equation (7)). Using the continuous solution encoding, one should again use the mapping function *m*, which splits the search space into two bins (with arbitrary limits), one for the feature being present and one for the feature being absent.

$$\begin{aligned} s\_{FS} &= [e\_1, e\_2, \dots, e\_k] \\ e\_i &\in \mathbb{R}\_{\geq 0} \\\\ G\_{FS} &= m(s\_{FS}) \\ m: &\mathbb{R}\_{\geq 0} \longrightarrow \{0, 1\} \\\\ G\_{FS} &= [\mathcal{g}\_1, \mathcal{g}\_2, \dots, \mathcal{g}\_k] \\ g\_i &\in \{0, 1\} \end{aligned} \tag{7}$$

2.1.2. Self-Adaptive Solutions

The presented solution encoding shows that there are different options for mapping the encoding to the actual data set. These mapping settings are the following:


In general, these values can all be set arbitrary, but could also be one of the objectives of the optimization process itself. The values of these parameters can seriously influence the quality of the results in preprocessing tasks [51–53], as it can guide the evolution of the optimization to the global optima, rather than to the local one. The implementation of the preprocessing tasks in EvoPreprocess uses this self-adaptive approach with the additional genes in the genotype [54].

In the *data weighting* task, there is one additional gene, which corresponds to the maximum weight that can be assigned. All of the genes corresponding to the weights are normalized to this maximum value while leaving the minimum at 0. When using the imbalanced data set it is preferred that bigger differences in weights are possible.

In *data sampling* the mapping of the interval [0, 1] to the instance occurrence count is set. Here, *nmax* setting genes are added to the genotype, where *nmax* is the maximum number of occurrences on the individual instance after it is over-sampled. Each setting gene presents the size of the mapping interval of an individual occurrence count. Figure 2 presents the process of splitting the encoding genome to the sampling and the self-adaptation parts and mapping it to the solution. Note, that in Figure 2 the mapping intervals are just an example of one such self-adaptation and are not set as the final values used in all mappings. These values differ from solutions and are also data set dependent.

**Figure 2.** The self-adaptation with mapping genes in the encoding for data sampling task. The mapping values presented in this example are determined by the genotype and are not fixed, but adapt during the evolution process.

Some heavily imbalanced data sets could be better analysed if the emphasis is on under-sampling—the interval for 0 occurrences (the absence of the instance) would become bigger in the process. Other data sets could be more suitable if there are more minority instances, and therefore intervals for many occurrences become bigger within the optimization process.

With *feature selection* where is only one setting gene added to the genotype. This gene represents the size of the first interval, which maps other genes to the absence of the feature. Wider the data sets with more features could be analyses with better results if more features are removed from the data sets. Therefore, the larger the number in the setting gene is preferred, when the more emphasis is given to the shrinkage of the data set as more features are not selected. Again, the mapping interval in Figure 3 are just an example for that particular mapping gene in the encoded genotype.

**Figure 3.** The self-adaptation with mapping gene in the encoding for feature selection task. The mapping value presented in this example are determined by the genotype and are not fixed, but adapt during the evolution process.

The self-adaptation is implemented in the framework but could be removed in the extension or customization of the individual optimization process. It is included in the framework as it is an integral part of the optimization process in recent state-of-the-art papers [53,55–57] on data preprocessing with nature-inspired methods.

#### 2.1.3. Optimization Process for Preprocessing Tasks

All nature-inspired methods run in iterations during which the optimization process is applied. The broad overview of the nature-inspired optimization algorithm is shown in Figure 4 and is based on the rudimentary framework for nature-inspired optimization of feature selection by [25].

**Figure 4.** The optimization process of preprocessing data with nature-inspired optimization methods in EvoPreprocess.

The optimization process starts with the initialization of solutions, which is either random or with some heuristic methodology. After that, the iteration loop starts. First, every solution is evaluated, which will be discussed later in this section. Next, the optimization operators are applied to the solutions, which are specific to each the nature-inspired algorithm. The general goal of the operators is to select, repair, change, combine, improve, mutate, and so forth,. solutions in the direction of a perceived (either local or, hopefully, global) optimum. The iteration loop is stopped when a predefined limit of ending criteria is reached, be it the maximum number of iterations, the maximum number of iterations when the solutions stagnate, the quality that solutions reach, and others.

Nature-inspired algorithms strive to keep good solutions and discard the bad ones. Here, the method of evaluating solutions plays an important role. One should define one or more objectives that solutions should meet, and the evaluation function grades the quality of the solution based on these objectives. As solutions are evaluated every iteration, this is usually the most computationally time-consuming process of all steps in the optimization process.

When dealing with the optimization of the data set for the machine learning process, multiple objectives have been considered in the literature [25,58–61]. Usually, one of the most important objectives is the quality of the fitted model from the given data set. If the classification task is used the standard classification metrics could be used: accuracy, error rate, F-measure, G-measure, area under the ROC curve (AUC). If we are dealing with the regression task, the following regression tasks can be used—mean squared error, mean absolute error, or explained variance. By default, the EvoPreprocess framework uses a single-objective optimization to obtain either the error rate and F-score for the classification and the mean squared error for the regression tasks. As the later sections will show, the framework is easily extendable to be used for multiple objectives, be it the size of the data set or others.

It is important to note that not all researchers consider the problem of data leakage. *Data leakage* occurs when information from outside the training data set is used in the fitting of the machine learning models. This manifests in over-fitted models that perform exceptionally well on the training set, but poorly on the hold-out testing set. A common mistake that leads to data leakage is not using the validation set when optimizing the model fitting. Applying the same logic to nature-inspired preprocessing, data leakage occurs when the same data are used in the evaluation process and the final testing process. The EvoPreprocess framework automatically holds-out the separate validation sets and thus, prevents data leakage.

#### *2.2. Nature-Inspired Algorithms*

A large number of different nature-inspired algorithms were proposed in recent years. The recent meta-heuristic research of nature-inspired algorithm was reviewed by Lones [62] in 2020, where he concluded that most recent innovations are usually small variations of already existing optimization operators. Still, there are some novel algorithms worth further investigation: polar bear optimization algorithm [63], bison algorithm optimization [64], butterfly optimization [65], cheetah based optimization algorithm [66], coyote optimization algorithm [67] and squirrel search algorithm [68]. Due to the amount of nature-inspired algorithms, a broad overview is beyond the scope of this paper. Consequently, this section provides the formulation for the well-established and most used nature-inspired algorithms.

First, *Genetic algorithm* (GA) is one of the cornerstones of nature-inspired algorithms, presented by Sampson [69] in 1976, but research efforts on its the variations and novel applications are still numerous. The basic operators here are the selection of the solutions (called individuals) that are then used for the crossover (the mixing of genotype) which forms new individuals, which have a chance to go through a mutation procedure (random changing of genotype). The crossover is an exploitation operator where the solutions are varied to find their best variants. On the other hand, the mutation is a prime example of an exploration operator, which prevent the optimization to get stuck in the local optima. Each iteration is called a generation and can repeat until predefined criteria, be it in a form

of maximum generations or stagnation limit. Most of the following optimization algorithms use a variation of the presented operators.

Next, *Differential evolution* (DE) [70] includes the same operators as GA, selection, crossover and mutation, but multiple solutions (called agents) can be used. In its basic form, the crossover of three agents is not done with the simple mixing of genes (like in GA), but the calculation from Equation (8) for each gene is used. Here *xparent*1, *xparent*<sup>2</sup> and *xparent*<sup>3</sup> are gene values from three parents, *xnew* is the new value of the gene, and *F* is the differential weight parameter.

$$\mathbf{x}\_{new} = \mathbf{x}\_{parent1} + F \ast \left(\mathbf{x}\_{parent2} - \mathbf{x}\_{parent3}\right). \tag{8}$$

*Evolution strategy* (ES) [71] is an optimization algorithm, which has similar operators to GA, but the emphasis is given to the selection and mutation of the individuals. The most common variants are the following: (*μ*/*ρ*, *λ*), where only new solutions form the next generation and (*μ*/*ρ* + *λ*) where the old solutions compete with the new ones. The parameter *μ* presents the number of selected solutions for the crossover and mutation from where *ρ* are derived. Value *λ* denotes the size of the generation.

A variation of Evolution strategy is the *Harmony search* (HS) optimization algorithm [72] which mimics the improvisation of a musician. The main search operator is the pitch adjustment, which adds random noise to existing solutions (called harmony) or creates a new random harmony. Creation of new harmony is shown in Equation (9). where *xold* denotes the old gene from the old genotype, *xnew* is the new gene solution, *brange* denotes the range of maximal improvisation (change of solution) and  is a random number in the interval [−1, 1].

$$\mathbf{x}\_{new} = \mathbf{x}\_{old} + b\_{ramp} \* \boldsymbol{\varepsilon}.\tag{9}$$

Next group of nature-inspired algorithms mimic the behaviour of swarms and are called *swarm optimization* algorithms. *Particle swarm optimization* (PSO) [73] is a prime example of swarm algorithms. Here each solution (called particle) is supplemented with its velocity. This velocity represents the difference of change from its current position (genotype encoding) in the new iteration for this solution. After each iteration, the velocities are recalculated to direct the particle to the best position. The moving of particles in the search space represents both, the exploitation (moving around the best solution) and the exploration (moving towards the best solution) parts of optimization. The iterative moving of particles stops one of the following criteria is satisfied—the maximum number of iterations is reached, the stagnation limit is reached, or particles converge to one best position.

One variation of PSO is *Artificial bee colony* (ABC) algorithm [74] which imitates the foraging process of a honey bee swarm. This optimization algorithm consists of three operators which modify existing solutions. First, employed bees use local search to exploit already existing solutions. Next, the onlooker bees, which serve as a selection operator, search for new sources in their vicinity of existing ones (exploitation). And last, the scout bees are used for the exploration where they use a random search to find new food sources (new solutions).

Next, the *Bat algorithm* (BA) [75] which is a variant of PSO which imitates swarms of microbats, where every solution (called bat) still has a velocity, but also emits pulses with varying levels of loudness and frequency. The velocity of bats changes in consideration to the pulses from other bats and the pulses are determined by the quality of the solution. Equation (10) shows the procedure for updating the genes of individual bats. Here *xold* and *xnew* denote the old and the new genes respectively, *vold* and *vnew* are the old and the new velocities of the bats, and *fi*, *fmin* and *fmax* are current, minimal and maximal frequencies. *β* is a random number in the interval [0, 1].

$$\begin{aligned} f\_i &= f\_{\text{min}} + (f\_{\text{max}} - f\_{\text{min}}) \ast \mathcal{G} \\ \upsilon\_{new} &= \upsilon\_{old} + (\mathbf{x}\_{old} - \mathbf{x}\_{best}) \ast f\_i \\ \mathbf{x}\_{new} &= \mathbf{x}\_{old} + \upsilon\_{new} \end{aligned} \tag{10}$$

Finally, one of the widely used swarm algorithms is *Cuckoo search* (CS) [76], which imitates laying of eggs in the foreign nest by cuckoo birds. The solutions are eggs in the nests and those are repositioned to new nests every iteration (exploitation), while the worst ones are abandoned. The modification of the solutions is done with the optimization operator called the Lévy flight, which is in the form of long random flights (exploration) or short random flights (exploitation). The migration of the eggs with Lévy flight is shown in Equation (11), where *α* denotes the size of the maximal flight and *Levi*(*λ*) represents a random number from Lévi distribution.

$$\mathbf{x}\_{new} = \mathbf{x}\_{old} + \boldsymbol{\alpha} \* \boldsymbol{Levi}(\boldsymbol{\lambda}).\tag{11}$$

#### *2.3. Computation Complexity*

In general, most nature-inspired algorithms have time complexity of *O*(*m* ∗ *p* ∗ *Coperators* + *p* ∗ *Cfitnesseval*), where *m* is the size of the solution, *p* is the number of modified and evaluated solutions during the whole process, *Coperators* is the complexity of the operators (i.e., crossover, recombination, mutation, selection, random jumps...), and *Cfitnesseval* is the complexity of the evaluation.

The evaluation of solutions is the most computationally expensive part. One of the classification/regression algorithms must be used in order to (1) build the model, (2) make the predictions, and (3) evaluate the predictions. This part is heavily reliant on the chosen solution evaluator (the classification algorithm used). The evaluation of the prediction is a fixed *O*(*n*) complexity, dependent on *n* samples in the data set. Training of the models and making predictions vary: decision tree with *Otraining*(*n*<sup>2</sup> *p*) and *Oprediction*(*p*), linear and logistic regression with *Otraining*(*p*2*n* + *p*3) and *Oprediction*(*p*), and naive Bayes with *Otraining*(*np*) and *Oprediction*(*p*). Some of the ensemble methods have and additional factor of the number of models in the ensembles (i.e., Random forest and AdaBoost), and neural networks depend on the net architecture. Usually, the more complex the training process, the more complex patterns can be extracted from the data and consequently, the predictions are better.

If the evaluator and its use are fixed (as is in the experimental part of the paper), optimization time varies in relation to the nature-inspired optimization algorithms used. Considering only the array style solution encoding, the basic exploration and exploitation type operators (i.e., crossover, recombination, mutation and random jumps) have *O*(*m* ∗ *r* ∗ *Coperation*) complexity, where *m* is the size of the encoded solution, *r* is the number of solutions used in the operator, and *Coperation* is the complexity of the operation (i.e., linear combination, sum, distance calculation...). As the *m* size of the solution is usually fixed, researchers must optimize other aspects of the algorithms to get shorter computation times. Thus, the fastest optimization algorithms are the ones with the fewest operators, the operators with the fewest solutions participating in the optimization, or the lest complex operators. For example, the genetic algorithm has multiple relatively simple operators: crossover, mutation, elitism, selection (i.e., tournament where the fitness values are compared) and thus is one of the more time-consuming ones.

Furthermore, algorithms, where operators can be vectorized, can take advantage of computational speed-ups when low-level massive computation calls can be used. This is especially prevalent in swarm algorithms, where calculating distances or velocities and then moving the solutions can be done with matrix multiplications for all of the solution set at once, instead of individually for every solution. If the speed of the data preprocessing is of the essence, this should be taken into account in the implementation of the optimization algorithms. EvoPreprocess framework already provides parallel runs of the optimization process data set folds, but its speed is still reliant on the nature-inspired algorithms and evaluators used in the process.

#### **3. EvoPreprocess Framework**

The present section provides a detailed architecture description of the EvoPreprocess framework, which is meant to serve as a basis for further third party additions and extensions.

The EvoPreprocess framework includes three main modules:


Each of the modules contains two files:


Figure 5 shows a UML class diagram for the EvoPreprocess framework and its relation to the Python packages scikit-learn, imbalanced-learn and NiaPy. The custom implementation of benchmark classes are in the classes CustomSamplingBenchmark, CustomWeightingBenchmark and CustomFeatureSelectionBenchmark, which are denoted with dark background color.

**Figure 5.** UML class diagram of the EvoPreprocess framework.

#### *3.1. Task Classes*

The task class for data sampling is EvoSampling, which extends the imbalanced-learn class for sampling data BaseOverSampler. The task class for feature selection, EvoFeatureSelection, extends \_BaseFilter class from scikit-learn. The main task class for data weighting does not use any parent class. All three task classes are initialized with the following parameters.


The base optimization procedure pseudo-code is demonstrated in Algorithm 1, where the train-validation split (lines 3 and 4) and multiple parallel runs (for loop in lines 6, 7 and 8) are shown. Note that the *XT* data set given to the procedure should already be the training set and not the whole data set *X*. The further splitting of *XT* into training and validation data sets ensures that data leakage does not happen. To ensure that random splitting into training and validation sets does not produce split dependent results, the stratified *k*-fold splitting is applied (into *n*\_ *f olds* folds). As the nature of nature-inspired optimization methods is non-deterministic, optimization is done multiple times (*n*\_*runs* parameter) and is run in parallel—every optimization on the separate CPU core. The reduction (aggregation) of the results to one final results is done in different ways, dependent on the preprocessing task.


The preprocessing procedure from Algorithm 1 is called in different way in every task, which is the consequence of different inheritance for every task: data sampling inherits from scikit-learn \_BaseFilter, feature selection inherits from imbalanced-learn BaseSampler and data weighting does not inherit from any class.

All three task classes contain the following functions.



In addition to those functions, the class EvoSampling also contains the function which is used for sampling data.

• \_fit\_resample private function which gets the data to be sampled X and the corresponding target values y. This is an implementation of the abstract function from imbalanced-learn package, is called from the public fit\_resample function from the BaseOverSampler class, and provides the possibility to be included in imbalanced-learn pipelines. It contains the main logic of the sampling process. The function returns a tuple with two values: X\_S which is a sampled X and y\_S which is a sampled y.

The class EvoWeighting also contains the following function.

• reweight, which gets the data to be reweighted X and the corresponding target values y. This function returns the array weights with weights for every instance in X.

The class EvoFeatureSelection contains the following functions.


#### *3.2. Evaluation of Solutions with Benchmark Classes*

The second part of classes in all the modules are **benchmark** classes, which are helper classes meant to be used in the evaluation of the samplings, weighting or feature selection tasks. The implementation of custom fitness evaluation function should be done by replacing or extending of these classes. The benchmark classes are initialized with the following parameters.


The benchmark classes all must provide one function—function which returns the evaluation function of the tasks. This architecture is in accordance with NiaPy benchmark classes for their optimization methods and was used in EvoPreprocess for compatibility reasons.

Three benchmark classes are provided, each for its task to be evaluated:


All provided benchmarks classes evaluate the task in the same way: evaluator is trained on the training set selected with train\_indices and evaluated on the validation set selected with valid\_indices. If the evaluator provided is a classifier, the sampled, weighted or feature selected data are evaluated with *error rate*, which is defined in the Equation (12) and presents the ratio of misclassified instances [77].

$$Error\ rate = \frac{1}{n} \sum\_{i}^{i-1} z \begin{cases} 0 & \text{if } y\_i = \hat{y}\_i \\ 1 & \text{if } y\_i \neq \hat{y}\_i \end{cases} \tag{12}$$

The *F-score* can be used to evaluate the classification quality, where the balance between precision and recall is considered [77]. As the optimization strives to minimize the solution values, 1 − F-score is used then. If the evaluator is a regressor, the new data set is evaluated using *mean squared error* presented in Equation 13 and calculates the average of absolute error in predicting the outcome variable [77].

$$MSE = \frac{1}{n} \sum\_{i}^{i-1} \left( y\_i - \hat{y}\_i \right)^2. \tag{13}$$

#### **4. Examples of Use**

This section presents examples of using EvoPreprocess in all three supported data preprocessing tasks: first, the problem of data sampling, next, we cover data weighting, and finally, the feature selection problem. Before using the framework, the following requirements must be met.


#### *4.1. Data Sampling*

One of the problems addressed by the EvoPreprocess is data sampling; or more specifically, simultaneous under- and over-sampling of data with class EvoSampling in the module data\_sampling and its function fit\_resample(). This function returns a two arrays; a two-dimensional array of sampled instances (rows) and features (columns), and a one-dimensional array of sampled target values from the corresponding instances from the first array. The code below shows the basic use of the framework for resampling data for the classification problem, with default parameter values.

```
>>> from sklearn . datasets import load_breast_cancer
>>> from EvoPreprocess . data_sampling import EvoSampling
>>>
>>> d a t a s e t = l o a d _ b r e a s t _ c a n c e r ( )
>>> print ( dataset . data . shape , len ( dataset . target ))
[569 ,30] 569
>>> X_resampled , y_resampled = EvoSampling ( ) . fit_resample ( dat aset . data , dat aset . target )
>>> print ( X_resampled . shape , len ( y_resampled ) )
[341 ,30] 341
```
The results of the run show that there were 569 instances in the data set before sampling and there are 341 instances in the data set after the sampling, which shows that more samples were removed from the data set than there were added.

The following code shows the usage of data sampling for the data set for the regression problem. The code also demonstrates the setting of parameter values: a non-default optimizer method of evolution strategy, 5 folds for validation, 5 numbers of individual runs on each fold split and 4 parallel executions of individual runs.

```
>>> from sklearn . datasets import load_boston
>>> from EvoPreprocess . data_sampling import EvoSampling , SamplingBenchmark
>>>
>>> d a t a s e t = load_boston ()
>>> print ( dataset . data . shape , len ( dataset . target ))
(506 , 13) 506
>>> X_resampled , y_resampled = EvoSampling (
evaluator=DecisionTreeRegressor () ,
optimizer=nia . Evolution strategy ,
n_folds=5,
n_runs=5,
n_jobs=4,
benchmark=SamplingBenchmark
). fit_resample ( dataset . data , dat aset . target )
>>> print ( X_resampled . shape , len ( y_resampled ) )
(703 , 13) 703
```
Note that, in this case, the optimized resampled set is bigger (703 instances) than the original non-resampled data set (506 instances). This is a clear example of when more instances are added than removed from the set. If only under-sampling is preferred instead of simultaneous under- and over-sampling, the appropriate under-sampling benchmark class would be provided as the value for the benchmark parameter. In this example, a CART regression decision tree is used as the evaluator for the resampled data sets (DecisionTreeRegressor), but any scikit-learn regressor could take its place.

#### *4.2. Data Weighting*

Some scikit-learn models can handle weighted instances. Class EvoWeighting from the module data\_weighting optimizes weights of individual instances with the call of reweight() function. This function serves to find instance weights that can lead to better classification or regression results—it returns the array of the real numbers, which are weights of instances in the order of those instances in the given data set. The following code shows the basic example of reweighting and the resulting array of weights.

```
>>> from sklearn . datasets import load_breast_cancer
>>> from EvoPreprocess . data_weighting import EvoWeighting
>>>
>>> d a t a s e t = l o a d _ b r e a s t _ c a n c e r ( )
>>> in s t a n ce _wei g h t s = EvoWeighting ( ) . reweigh t ( dataset . data ,
dataset . target )
>>> print (instance_weights )
[1.568983893273244 1.2899430717992133 . . . 0.7248390003761751]
```
The following code shows the example of combining the data weight optimization with the scikit-learn classifier of the decision tree, which supports weighted instances. The accuracies of both classifiers, the one fitted with unweighted data and the one built with weighted data, are outputted.

```
>>> from sklearn . datasets import load_breast_cancer
>>> from sklearn . model_selection import train_test_split
>>> from sklearn . tree import DecisionTreeClassifier
>>> from EvoPreprocess . data_weighting import EvoWeighting
>>>
>>> random_seed = 1234
>>> d a t a s e t = l o a d _ b r e a s t _ c a n c e r ( )
>>> X _ t r ain , X _ te s t , y _ t r ai n , y_test = train_test_split (
dataset . data , da taset . target ,
test_size =0.33 ,
random_state=random_seed )
>>> c l s = D e ci si o n T r e e Cl a s si fi e r ( random_state=random_seed )
>>> c l s . f i t ( X _ t r ain , y _ t r ai n )
>>>
>>> print ( X_train . shape ,
accuracy_score ( y_test , cls . predict ( X_test )) ,
sep=' : ' )
(381 , 30): 0.8936170212765957
>>> in s t a n ce _wei g h t s = EvoWeighting (random_seed=random_seed ) . reweight ( X_train ,
y_train )
>>> c l s . f i t ( X _ t r ain , y _ t r ai n , sample_weight=in s t an ce _wei gh t s )
>>> print ( X_train . shape ,
accuracy_score ( y_test , cls . predict ( X_test )) ,
sep=' : ' )
(381 , 30): 0.9042553191489362
```
The example shows that the number of instances stays 381 and the number of feature stays at 30, but the accuracy rises from 89.36% with unweighted data to 90.43% with weighted instances.

#### *4.3. Feature Selection*

Another task performed by the EvoPreprocess framework is feature selection. This is done with nature-inspired algorithms from the NiaPy framework, and can be used independently, or as one of the steps in the scikit-learn pipeline. The feature selection is used with the construction of the EvoFeatureSelection class from the module feature\_selection and calling the function fit\_transform(), which returns the new data set with only selected features.

```
>>> from sklearn . datasets import load_breast_cancer
>>> from EvoPreprocess . feat ure_selection import EvoFeatureSelection
>>>
>>> d a t a s e t = l o a d _ b r e a s t _ c a n c e r ( )
>>> print ( dataset . data . shape )
(569 , 30)
>>> X_new = E v o Fe a tu re Sele c ti o n ( ) . fit_transform ( dataset . data ,
dataset . target )
>>> print (X_new . shape )
(569 , 17)
```
The results of the example demonstrate, that the original data set contains 30 features, and the modified data set after the feature selection contains only 17 features.

Again, numerous settings can be changed: the nature-inspired algorithm used for the optimization process, the classifier/regressor used for the evaluation, the number of folds for validation, the number of repeated runs for each fold, the number of parallel runs and the random seed. The following code shows the example of combining EvoFeatureSelection with the regressor from scikit-learn, and the evaluation of the quality of both approaches – the regressor built with the original data set, and the regressor built with the modified data set with only some features selected.

```
>>> from sklearn . datasets import load_boston
>>> from sklearn . metrics import mean_squared_error
>>> from sklearn . model_selection import train_test_split
>>> from sklearn . tree import DecisionTreeRegressor
>>> from EvoPreprocess . feat ure_selection import EvoFeatureSelection
>>>
>>> random_seed = 654
>>> d a t a s e t = load_boston ()
>>> X _ t r ain , X _ te s t , y _ t r ai n , y_test = train_test_split (
dataset . data ,
dataset . target ,
test_size =0.33 ,
random_state=random_seed )
>>> model = Deci si onT reeReg re s s o r ( random_state=random_seed )
>>> model . f i t ( X _ t r ain , y _ t r ai n )
>>> print ( X_train . shape ,
mean_squared_error ( y_ tes t , model . p redic t ( X_ tes t ) ) ,
sep=' : ' )
(339 , 13): 24.475748502994012
>>> evo = E v o Fe a tu re Sele c ti o n ( e v alu a t o r=model , random_seed=random_seed )
>>> X_train_new = evo . fi t _ t r a n s f o rm ( X _ t r ain , y_train )
>>>
>>> model . f i t ( X_train_new , y _ t r ai n )
>>> X_test_new = evo . t r ans fo rm ( X _ t e s t )
>>> print (X_train_new . shape ,
mean_squared_error ( y_ tes t , model . p redic t ( X_test_new ) ) ,
sep=' : ' )
(339 , 6): 18.03443113772455
```
The results show that, using the new data set with only 6 features selected during the fitting of the decision tree regressor, outperforms the decision tree regressor built with the original data set with all of the 13 features—MSE of 18.03 for the regressor with feature selected data set vs. MSE of 24.48 for the regressor with the original data set.

#### *4.4. Compatibility and Extendability*

The compatibility with existing well-established data analysis machine learning libraries is one of the main features of EvoPreprocess. For this reason, all the modules accept the data in the form of extensively used NumPy array [78] and the pandas DataFrame [79]. The following examples demonstrate the further compatibility capacity of the EvoPreprocess with scikit-learn and imbalanced-learn Python machine learning packages.

The scikit-learn already includes various feature selection methods and supports their usage with the provided machine learning pipelines. EvoFeatureSelection extends scikit-learn's feature selection base class, and so it can be included in the pipeline, as the following code demonstrates.

```
>>> from sklearn . linear_model import LinearRegression
>>> from sklearn . pipeline import Pipeline
>>> from sklearn . datasets import load_boston
>>> from sklearn . metrics import mean_squared_error
>>> from sklearn . model_selection import train_test_split
>>> from sklearn . tree import DecisionTreeRegressor
>>> from EvoPreprocess . feat ure_selection import EvoFeatureSelection
>>>
>>> random_seed = 987
>>> d a t a s e t = load_boston ()
>>>
>>> X _ t r ain , X _ te s t , y _ t r ai n , y_test = train_test_split (
dataset . data ,
dataset . target ,
test_size =0.33 ,
random_state=random_seed )
>>> model = Deci si onT reeReg re s s o r ( random_state=random_seed )
>>> model . f i t ( X _ t r ain , y _ t r ai n )
>>> print (mean_squared_error ( y_test , cls . predict ( X_test ) ) )
20.227544910179642
>>> pi p eli n e = Pi p eli n e ( s t e p s = [
( ' feature_selection ' , EvoFeatureSelection (
evaluator=LinearRegression () ,
n_folds=4,
n_runs=8,
random_seed=random_seed ) ) ,
( ' regressor ' , DecisionTreeRegressor ( random_state=random_seed ) )
] )
>>> pi p eli n e . f i t ( X _ t r ain , y _ t r ai n )
>>> print (mean_squared_error ( y_test , pipeline . predict ( X_test ) ) )
19.073532934131734
```
As the example demonstrates, the pipeline with EvoFeatureSelection builds a better regressor than the model without the feature selection, as the MSE is 19.07 vs. the original regressor's MSE of 20.23. The example also shows that one can choose different evaluators in any of the preprocessing tasks from the final classifier or regressor. Here, the linear regression is chosen as the evaluator, but the decision tree regressor is used in the final model fitting. Using a different, lightweight, evaluation model can be useful when fitting the decision models that can be computationally expensive.

All three EvoPreprocess tasks are compatible and can be used in imbalanced-learn pipelines [17], as is demonstrated in the following code. Note that both tasks in the example are parallelized and the number of simultaneous runs on different CPU cores is set with the n\_jobs parameter (default value of None utilizes all cores available). The reproducibility of the results is guaranteed with setting the random\_seed parameter.

```
>>> from sklearn . datasets import load_breast_cancer
>>> from sklearn . tree import DecisionTreeClassifier
>>> from imblearn . pipeline import Pipeline
>>> from EvoPreprocess . feat ure_selection import EvoFeatureSelection
>>> from EvoPreprocess . data_sampling import EvoSampling
>>> random_seed = 1111
>>> d a t a s e t = l o a d _ b r e a s t _ c a n c e r ( )
>>> X _ t r ain , X _ te s t , y _ t r ai n , y_test = train_test_split (
dataset . data ,
dataset . target ,
test_size =0.33 ,
random_state=random_seed )
>>>
>>> c l s = D e ci si o n T r e e Cl a s si fi e r ( random_state=random_seed )
>>> c l s . f i t ( X _ t r ain , y _ t r ai n )
>>> print ( accuracy_score ( y_test , cls . predict ( X_test ) ) )
0.8829787234042553
>>> pi p eli n e = Pi p eli n e ( s t e p s = [
( ' feature_selection ' , EvoFeatureSelection ( n_folds=10,
random_seed=random_seed ) ) ,
( 'data_sampling ' , EvoSampling ( n_folds=10,
random_seed=random_seed ) ) ,
( ' classifier ' , DecisionTreeClassifier (random_state=random_seed ) ) ] )
] )
>>> pi p eli n e . f i t ( X _ t r ain , y _ t r ai n )
>>> print ( accuracy_score ( y_test , pipeline . predict ( X_test ) ) )
0.9148936170212766
```
The results show that the pipeline with both feature selection and data sampling results in a superior classifier than the one without these preprocessing steps, as the accuracy of the pipeline is 91.49 vs. 88.30 for the classifier without preprocessing steps.

Using any of the EvoPreprocess tasks can be further optimized with the custom settings of the optimizer (the nature-inspired algorithm used to optimize the task). This can be done with the parameter optimizer\_settings, which is in the form of a Python dictionary. The code listing below shows one such example, where the bat algorithm [80] is used for the optimization process, and samples the new data set with only 335 instances instead of the original 569. One should refer to the NiaPy package for the available nature-inspired optimization methods and their settings.

```
>>> from sklearn . datasets import load_breast_cancer
>>> from EvoPreprocess . data_sampling import EvoSampling
>>> import NiaPy . algorithms . basic as nia
>>>
>>> d a t a s e t = l o a d _ b r e a s t _ c a n c e r ( )
>>> print ( dataset . data . shape , len ( dataset . target ))
[569 ,30] 569
>>> s e t t i n g s = { 'NP' : 1 0 0 0 , 'A' : 0 . 5 , ' r ' : 0 . 5 , 'Qmin ' : 0 . 0 , 'Qmax ' : 2 . 0 }
>>> X_resampled , y_resampled = EvoSampling ( optimizer=nia . Bat algorithm ,
optimizer_settings=settings
). fit_resample ( dataset . data ,
dataset . target )
>>> print ( X_resampled . shape , len ( y_resampled ) )
(335 , 30) 335
```
As EvoPreprocess uses any NiaPy compatible continuous optimization algorithm, one can implement their own, or customize and extend the existing ones. The customization is also possible on the evaluation functions (i.e., favoring or limiting to under-sampling with more emphasis on a smaller data set than on the quality of the model). This can be done with the extension or replacement of benchmark classes (SamplingBenchmark, FeatureSelectionBenchmark and WeightingBenchmark). The following code listing shows the usage of the custom optimizer (random search) and custom benchmark function for data sampling (a mixture of both model quality and size of the data set).

```
>>> import numpy as np
>>> from NiaPy . algorithms import Algorithm
>>> from numpy import apply_along_axis , math
>>> from sklearn . datasets import load_breast_cancer
>>> from sklearn . utils import safe_indexing
>>> from EvoPreprocess . data_sampling import EvoSampling
>>> from EvoPreprocess . data_sampling . SamplingBenchmark import SamplingBenchmark
>>>
>>> class RandomSearch ( Algorithm ) :
Name = [ ' RandomSearch ' , 'RS ' ]
def r u n I t e r a ti o n ( s el f , task , pop , fpop , xb , fxb , ∗∗dparams ) :
pop = t a s k . Lower + s e l f . Rand . rand ( s e l f .NP, t a s k .D) ∗ task . bRange
fpop = apply_along_axis ( task . eval , 1 , pop )
return pop , fpop , { }
>>>
>>> class CustomSamplingBenchmark ( SamplingBenchmark ) :