**1. Introduction**

Current data managemen<sup>t</sup> and storage methods have been challenged by the high increase in the amount of medical data available to us. Obtaining valuable information in the process of knowledge discovery has become problematic. There is an urgen<sup>t</sup> need for new tools and approaches whose mechanism will allow overcoming the present-day limitations of computational medicine, by converting large quantities of data into knowledge. Novel methods will make it possible to go beyond simple data description, providing knowledge in the form of models. Through abstract data models, it is possible to create highly reliable prediction systems [1–14].

The process of knowledge discovery from the data involves, among other techniques, machine learning. Our interest is to select or combine techniques with a high performance in prediction tasks for medical datasets. In medicine, prediction systems are most frequently applied in the field of diagnosis and prognosis. According to previous research on the development of diagnosis systems, it is possible to determine the presence or absence of a disorder through interpretation of patient data [15]. These systems

are used specifically in the diagnosis of patients. Prognosis systems use the collected information to predict the progress of the condition a patient is suffering from or to determine whether a patient may suffer a disease in the future. Moreover, they are used to choose the most effective treatment based on the patient's symptoms and different medical factors [16].

In the context of diagnosis and prognosis, the aim of using intelligent systems based on machine learning techniques is knowledge discovery from the collected information. Sometimes, the discovered knowledge is expressed in a probabilistic model by relating the clinical features of patients to a stage of the target disease. In other cases, a rule-based representation is selected to provide the expert with an explanation of why certain decision was made. Knowledge representations as those described above are known as white box systems and the focus of this research because they express part of the knowledge directly. Finally, there are other cases in which the system is designed as a black box for decision-making, where the system only shows the prediction results. All of these techniques are suitable for making a diagnosis and prognosis of a patient's condition [17].

Because of all previously explained, this research proposes a system generating classifiers based on genetic programming (GP), which is capable of inducing sets of rules that represent the relationship between the disease and the symptoms experienced by patients. Therefore, our goal is to build a rule-based classifier and compare its ability to correctly classify data with other previously proposed methods. Finally, we analyze the rules obtained by our approach to determine the most important attributes of the dataset. In this case, the system performs a feature filtering process [18–21]. Rule-based classifiers are an attractive approach since the structure of IF/THEN rules is well-known and can easily be interpreted for knowledge discovery. Hence, such rules not only classify unknown patterns, they also disclose knowledge about the class structure and problem domain. The goal of a rule-based classifier is to find a set of rules that suit a labeled dataset. That is, the discovered rules should represent the target dataset and cover each region of the search space. Hence, the application of GP in the building of rule-based classifiers has been the basis of works such as [22–25]. Our ultimate goal is to provide the expert with an initial interpretation of the data through our rules-based model that can serve as a starting point in the study of the disease. Hence, we also provide a visual interpretation of the data, which supports the process of knowledge discovery.

In summary, medical databases store a lot of data about the health condition of patients. Such an amount of information is ideal for the application of machine learning techniques, which can transform data into knowledge by analyzing the relationships provided by the model. This mechanism provides a means of hypothesis validation [6,9]. To reach the goals proposed in this work, the rest of this manuscript has been divided into the following sections: Section 2 deals with the background related to this research. Section 3.1 describes the main features of our proposal, encoding, fitness functions, genetic operators and running strategy. Section 4 describes the employed datasets, an analysis of the structure and distribution of the datasets, the experiments to select the best mutation operators for each medical dataset, and accuracy comparison of our approach with other machine learning methods. At the end of this section, an analysis of the rules discovered by the proposal is given and the most influential attributes of the datasets are analyzed. Conclusions, Appendix A (classifiers of our proposal), Appendix B (mutation operator experiments), and the references of this research are the final part of this document.
