**2. Background**

Applications of genetic algorithms (GAs) to analyze medical data have allowed for solving complex problems such as disease screening, diagnosis, treatment planning, pharmacovigilance, prognosis and health care managemen<sup>t</sup> [26]. GAs have been applied to different fields in medicine, among which we can highlight, Radiology, Oncology, Cardiology, Endocrinology, Pulmonology and Pediatrics, among others. In this context, GAs have been used for edge detection of images obtained from Magnetic Resonance Imaging

(MRI), Compute Tomography (CT) and ultrasound [27–29]. Making use of these kinds of algorithms, different methods have been proposed to detect microcalcifications in mammograms leading to diagnosing breast cancer [30–32]. In other studies, GAs have been used to fuse MRI images with Positron Emission Tomography (PET) in order to generate colored images of breast cancer [33].

In other works [34], a methodology based on the application of a Micro-Genetic Algorithm (MGA) was used to generate the training set that best detects solitary lung nodules. The designed algorithm can detect lung nodules with about 86% sensitivity, 98% specificity, and 97.5% accuracy. In [35], the authors proposed a model using Particle Swarm Optimization method (PSO), a GA and a Support Vector Machine (SVM) in conjunction with feature selection and classification of CT, MRI and ultrasound images. The proposed method was capable of detecting lung cancer with an accuracy of 89.5%.

GAs have also been used to detect patients with some type of carcinoma through Microarray Technology. For example, in [36], a GA combined with an Artificial Bee Colony (ABC) algorithm was proposed. The method aims to make cancer classification in patients through extraction of features from microarray data. This method was tested with a dataset of colon carcinoma, two different datasets of Leukemia, a dataset involving patients with lung carcinoma, and one of patients with Small, Round-Blue Cell Tumors (SRBCT). The method proposed in that paper achieved an accuracy of almost 100% when selecting very few biomarkers.

In the area of Pediatrics, GAs are also being used to detect diseases such as autism from gene expression microarrays. In [37], an approach of GA as a feature selection engine and an SVM as the classifier were proposed to validate the set of features selected. In this work, a performance greater than 86% accuracy for one of the used datasets and a performance of 92.93% accuracy for the other dataset were reached to outperform previous works.

There are other applications of GAs aimed at making predictions from the data acquired from blood tests. In [38], a GA is used to optimize the performance of an Artificial Neural Network (ANN) to detect Coronary Artery Disease (CAD). Through the previous approach, the authors show that CAD can be detected without angiography and consequently eliminate its high cost and the main side effects. In another context, electrocardiogram (ECG) signals in cardiology have been used to detect cardiac arrhythmias [39]. In this work, a method liking a Genetic Algorithm with a Backpropagation Neural Network (GA-BPNN) was proposed to reduce the dimension of the datasets by 50% and achieve 99% accuracy. This makes the method suitable for automatic identification of cardiac arrhythmias.

As stated at the beginning of this section, there are many more applications of GAs to medicine that can be consulted about in the literature [40–42]. Since the efficacy of GAs in the medicine field has been proved, we will deal with other recent algorithms (Genetic Programming), which include a GA as its base operation. Genetic Programming (GP) is a kind of GA whose main difference with respect to normal GAs is to produce expressions (functions or programs) as outputs rather than data [43–45]. An example of the use of this kind of algorithms in the medical field is shown in [46]. In this work, a GP algorithm is proposed to automatically create the best mathematical formula that combines a set of preselected features from a Magnetoencephalography (MEG) dataset. To evaluate the generated formulas, a K-nearest neighbor algorithm (KNN) is used. This approach achieved 91.75% sensitivity and 92.99% specificity in the diagnosis of Epilepsy.

GP is also used to provide diagnosis from MRI images by evaluating the medical spine condition of patients [47]. The GP algorithm proposed in this work uses of a fitness function based on expert knowledge, in this case, a neuroradiologist. The rules rendered in each generation of the algorithm are evaluated and then compared with the true results in order to select the rules with less difference. The accuracy reached was greater than 90% in the conditions evaluated by combining the GP algorithm and expert knowledge.

Another example of GP applied to medicine is image classification [48]. In this work, a GP algorithm is proposed to create and evolve tree-based classifiers, whose aim is to diagnose active tuberculosis from

raw X-ray images. The framework proposed was able to achieve a competitive classification and a superior speed compared to methods that rely upon image processing and feature extraction.

In general terms, GP represents a flexible and powerful evolutionary technique that uses a set of functions and terminals to produce computable expressions. Hence, this research presents a GP method to render rule-based classifiers for knowledge discovery from medical data. Some of the advantages related to this kind of classifiers generating comprehensible knowledge are high expressiveness, which allows them to render models that are very easy to interpret. Such rules can be altered to handle missing values and noise from attributes of the data set. They are relatively easy to obtain and very fast at classifying new patterns (or data) [49]. Moreover, a very important advantage of such rules for machine learning is that they are intuitively comprehensible to the user [50,51]. Another advantage related to the above is that they are not only used to classify, but they also represent, by themselves, a process of knowledge discovery, providing the user with new insights into the data and their application domain [52].

#### **3. Materials and Methods**

#### *3.1. Evolutionary Strategy to Build Rule-Based Classifiers (ESRBC)*

This section presents our main proposal, the evolutionary method (ESRBC) to render rule-based classifiers. Thus, we describe the strategy to follow by ESRBC, individuals, crossover, mutation operators and fitness functions. Individuals represent logical rules adopting an internal representation of a linear sequence of clauses (or comparisons) separated by conjunctions AND. Individuals to be built in this proposal follow the Michigan-style [24,50,53,54]; hence, each individual encodes a single rule (with a linear chromosome) with a variable length, where each rule is associated with the class of the dataset it represents. Therefore, an individual can be evaluated as True or False according to the pattern evaluated in the antecedent of the rule. As applicable, the pattern may or may not belong to the class assigned to the rule.

As explained, the individuals generated by ESRBC represent logical rules of type *IF* <*CLAUSES*> *THEN* <*CLASS*>, where <*CLAUSES*> is formed by a set of clauses (or comparisons) separated by conjunctions AND. <*CLASS*> is the class of the dataset that is being represented by the rule or, in other words, the class to which the rule belongs. A more detailed representation of a rule can be given as follows:

$$IF\ (at\_1\ o\_1\ val\_1)\ AND\ (at\_2\ o\_2\ val\_2)\ AND\ \cdots\ \ AND\ (at\_n\ o\_n\ val\_n)\ THEN\ class = k\_n$$

where (*ati oi vali*) is clause number *i*, *ati* is an attribute of the dataset, *oi* is a comparison operator from set {<sup>&</sup>lt;, >, ≤, ≥, =, =}, *vali* a value of the set of all possible values admitted by *ati*, whereas *k* is the class of the dataset covered by the rule. An example of logical rules representing a dataset with attributes {*p*, *q*,*r*,*s*, *t*} and two classes {0, 1} can be as follows:

$$IF(p > 12.3) \text{ AND } (p \le 15) \text{ AND } (\text{s} \ne 3.4) \text{ THEN } class = 0,$$

$$IF(p \le 12.3) \text{ AND } (r > 7.4) \text{ AND } (t \ge 2) \text{ THEN class} = 1,$$

which means that, if there is a specific pattern (*pi*, *qi*,*ri*,*si*, *ti*) from the domain of the dataset, whose values *pi* and *si* hold the antecedent of the rule in class-0, then such a pattern belongs to class-0. Likewise, if attribute values *pi*,*ri*, *ti* hold the antecedent of the rule in class-1, then this pattern is in class-1. Keep in mind that the challenge that each rule learned from a dataset must meet is generalization. In other words, the set of rules holding a dataset should generalize enough in such a way that the pattern space be properly partitioned. Thereby, each region of the space is covered as much as possible by the set of rules.

Continuing with the description of the rule concept, we define the length of a rule as the number of clauses that form it. The evolutionary algorithm (EA) of our approach, which is responsible for the search process for a diverse set of rules, adopts the sequential covering strategy for each class of a dataset [51]. Sequential covering is a technique that discovers one rule at a time. The EA is executed multiple times to build a complete set of rules representing each class of a dataset. During each execution, the best rule evolved through the EA is added to the set of previously discovered rules and the patterns covered by this rule are removed from the dataset. The process is repeated until there are no more patterns to be covered. The steps followed by this methodology can be described as follows:


## *3.2. Fitness Functions*

This section introduces the fitness functions used in the evolutionary algorithm of our approach. In this case, the fitness functions defined are based on the concept of accuracy [52,55,56]. The accuracy of a rule is the fraction of patterns from its class, covered by the rule. Then, according to the definition above, we are going to introduce two variants of fitness functions based on accuracy. However, we firstly need to define two functions which evaluate a pattern *e* in a rule *r*. Then, the first function is *g* acting on *r* and *e*, i.e., *g*(*<sup>r</sup>*,*<sup>e</sup>*), which computes the number of clauses of *r* evaluated True when *e* is evaluated in *r*. The second function defines the evaluation of a pattern *e* in *r* (*r*(*e*)) in the following way:

$$r(\varepsilon) = \begin{cases} 1, \text{if } \varepsilon \text{ belongs to the class of } r, \text{ in this case we say, } \varepsilon \text{ holds } r;\\ 0, \text{otherwise.} \end{cases} \tag{1}$$

Note that *g*(*<sup>r</sup>*,*<sup>e</sup>*) evaluates the number of clauses in *r* holding a pattern *e*, whereas *r*(*e*) evaluates the rule to 1 (True) if it covers pattern *e* (all its clauses become True). Additionally, if we want to specify the class of both *r* and *e*, we write *ri* and *ei* respectively, where *i* is a class of the dataset. Finally, the two expected fitness functions are given below. For this case, both fitness functions define a maximization problem. The first objective of *f*1 assesses accuracy based on the number of clauses turned true by patterns of the target class, whereas the second objective acts as a penalty for patterns not belonging to the class of the rule, whose values make the clauses of the rule true. The same situation happens for *f*2, but, in this case, the accuracy is assessed by considering the number of patterns holding a rule *r*. *f*1 has been created to be run in the first generations of the evolutionary algorithm where rules have randomly been created and no pattern holds them. However, the use of *f*2 makes more sense in a second stage of the evolutionary algorithm (after applying *f*1) when the rendered rules have reached a certain learning level.

#### **Definition 1.** *Fitness function- f*<sup>1</sup>*.*

*If D is a labeled dataset with k classes, Ci a class of D and ri a rule of Ci and consider i*, *j* ∈ [0, 1, ··· , *k* − 1]*. Then, we define a fitness function- f*1 *applied to ri as:*

$$f\_1(r^i) = \frac{1}{|\mathbb{C}\_i| \cdot |r^i|} \sum\_{\forall e \in \mathcal{C}\_i} \mathbb{g}(r^i, e) - \frac{1}{|D| - |\mathbb{C}\_i|} \sum\_{\forall \boldsymbol{\ell}' \in \mathcal{C}\_{\boldsymbol{\ell}}, j \neq i} \mathbb{g}(r^i, \boldsymbol{\ell}') + 3. \tag{2}$$

#### **Definition 2.** *Fitness function- f*<sup>2</sup>*.*

*From the same conditions given in Definition 1, we define fitness function- f*2 *applied to a rule ri as:*

$$f\_2(r^i) = \frac{1}{|\mathcal{C}\_i|} \sum\_{\forall \boldsymbol{\epsilon} \in \mathcal{C}\_i} r^i(\boldsymbol{\epsilon}) - \frac{1}{|D| - |\mathcal{C}\_i|} \sum\_{\forall \boldsymbol{\epsilon}' \in \mathcal{C}\_j, j \neq i} r^i(\boldsymbol{\epsilon}') + \text{3.} \tag{3}$$

Both fitness functions have been focused on a maximum problem: the bigger their values, the more fit the evaluated rules. In the first fitness function, the first objective deals with a kind of accuracy using *g*, which consists of computing the number of clauses evaluated True in the current rule for all pattern of its class. The second objective measures the number of clauses evaluated True by the current rule for all pattern belonging a different class of the rule class. This fitness function is useful in the evaluation of rules built in the first generations of the EA, where the rule accuracy is zero. The second fitness function is responsible for measuring the number of patterns from the rule class holding the rule versus the number of patterns of other classes holding the rule.

## *3.3. Genetic Operators*

The crossover operator used in this method to recombine clauses from two parent-rules to achieve two new children-rules performs as the classical operator [57]. That is, the crossover operator selects a random position (with a uniform distribution) from two parent-rules and exchanges two segments of clauses from them to achieve two children, inheriting part of the clauses (genetic code) of their parents. In other words, given two rules, the position of a clause is randomly selected. Then, the clauses located on the right or left side of both rules (which is also decided at random) are exchanged to create two new rules. The mutation operator is responsible for providing new information to the individuals generated. In this case, we provide three types of mutation operations by defining a mutation group for each one:


The mutation operator applied to each mutation in the rules is selected at random. Note also that the goal of defining the M2 and M3 compound mutation operators is to create different mutation levels from the M1 basic mutation operator. This allows us to explore different alterations on the individuals yielded from generation to generation. Each of these operators (M1, M2, and M3) performs an alteration level of individuals by regarding a minor (M1), medium (M2) and higher level (M3) of alteration.

#### *3.4. Running the Evolutionary Algorithm*

Once the genetic operators have been defined, the evolutionary algorithm (EA) of our proposal ESRBC is responsible for discovering each rule covering different parts of the search space, hoping the rules can generalize. Hence, the EA is run following the general scheme given by evolutionary algorithms [57,58], with the particularity of introducing an elitism which is transmitted from generation to generation and tournament selection as the adopted selection method.

Aside from the above, the EA includes an evolutionary strategy of local search (Algorithm 1 ESLS), which acts on the population or the most fit individual returned by the EA. In fact, the option of executing Algorithm 1 from a population or a single individual is a parameter of the algorithm. The term local search is because Algorithm 1 is based on mutation operators and in each generation, Algorithm 1 replaces only individuals who have improved their value fitness after the mating process. The goal of this strategy is to refine the solutions of the EA by making an in-depth search. Hopefully, the individuals from the EA are close enough to a global optimum. Therefore, Algorithm 1 is in charge of searching such an optimum. This idea has been taken from [59] and implemented in [60] with good results. The idea is as follows:

• *Running a genetic algorithm (GA) until it slows down, then letting a local optimizer take over the last generation (and/or best individual) of the GA. Hopefully, the GA is very close to the global optimal.*

ESLS has been defined below. This strategy improves a population of individuals or a single individual given by ESRBC. Finally, both ESRBC and Algorithm 1 were implemented in the *C++ programming language*, whereas the experiments were performed under *R-Project* [61].
