Predicting the Risk of Overweight and Obesity in Madrid—A Binary Classification Approach with Evolutionary Feature Selection

Parra, Daniel; Gutiérrez-Gallego, Alberto; Garnica, Oscar; Velasco, Jose Manuel; Zekri-Nechar, Khaoula; Zamorano-León, José J.; Heras, Natalia de las; Hidalgo, J. Ignacio

doi:10.3390/app12168251

Open AccessArticle

Predicting the Risk of Overweight and Obesity in Madrid—A Binary Classification Approach with Evolutionary Feature Selection

by

Daniel Parra

¹

,

Alberto Gutiérrez-Gallego

¹

,

Oscar Garnica

¹

,

Jose Manuel Velasco

¹

,

Khaoula Zekri-Nechar

²

,

José J. Zamorano-León

³

,

Natalia de las Heras

⁴

and

J. Ignacio Hidalgo

^1,*

¹

Computer Architecture and Automation Department, Faculty of Computer Science, Universidad Complutense de Madrid, 28040 Madrid, Spain

²

Department of Medicine, Faculty of Medicine, Universidad Complutense de Madrid, 28040 Madrid, Spain

³

Public Health and Maternal and Child Health Department, Faculty of Medicine, Universidad Complutense de Madrid, 28040 Madrid, Spain

⁴

Department of Physiology, Faculty of Medicine, Universidad Complutense de Madrid, 28040 Madrid, Spain

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(16), 8251; https://doi.org/10.3390/app12168251

Submission received: 8 July 2022 / Revised: 15 August 2022 / Accepted: 16 August 2022 / Published: 18 August 2022

(This article belongs to the Special Issue Evolutionary Computation: Theories, Techniques, and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we experimented with a set of machine-learning classifiers for predicting the risk of a person being overweight or obese, taking into account his/her dietary habits and socioeconomic information. We investigate with ten different machine-learning algorithms combined with four feature-selection strategies (two evolutionary feature-selection methods, one feature selection from the literature, and no feature selection). We tackle the problem under a binary classification approach with evolutionary feature selection. In particular, we use a genetic algorithm to select the set of variables (features) that optimize the accuracy of the classifiers. As an additional contribution, we designed a variant of the Stud GA, a particular structure of the selection operator of individuals where a reduced set of elitist solutions dominate the process. The genetic algorithm uses a direct binary encoding, allowing a more efficient evaluation of the individuals. We use a dataset with information from more than 1170 people in the Spanish Region of Madrid. Both evolutionary and classical feature-selection methods were successfully applied to Gradient Boosting and Decision Tree algorithms, reaching values up to 79% and increasing the average accuracy by two points, respectively.

Keywords:

feature selection; classification; genetic algorithm; evolutionary computing; overweight; obesity

1. Introduction

A person is considered overweight when her/his Body Mass Index (BMI) is higher than 25 and obese if it is over 30. BMI is computed by dividing the weight of the person (in kilograms) by the square of her/his height (in meters) [1]. According to the Spanish National Institute of Statistics [2], the rate of individuals with obesity has increased from 7.4% in 1987 to 17.4% in 2017. In just 30 years, it has multiplied by 2.4 times, taking the data for Spain as a reference. This health problem affects all sectors of the population, although it is not equally distributed. In particular, men are more prone to develop overweight/obesity than women. The number of cases of childhood obesity has also increased, reaching 10.3% of children between 2 and 17 years old. The World Health Organization [3] also shows that overweight people are more prone to developing cerebrovascular and respiratory problems, gallbladder disease and may increase the risk of different types of cancer. Hence, it is necessary to prevent future cases of overweight or obesity.

Some people may have a certain predisposition to suffer from overweight/obesity due to genetics. Overweight and obesity are usually a consequence of social behaviors, such as high-fat meals and sedentary lifestyles. Prevention should begin at early ages to consolidate healthy lifestyle habits, and those who suffer from any of these disorders should seek the help and advice of medical staff. Regarding prevention, it would be essential to know the relationships among lifestyle indicators, overweight, and obesity. In this context, a system predicting the risk of developing overweight and obesity could be very useful. In this paper, we show that it is possible to analyze the different factors and habits of a person and obtain the relationship among them by machine-learning techniques. In particular, we investigate the performance of a set of machine-learning classification algorithms when classifying people as overweight/obese versus non-overweight/non-obese using information about lifestyle habits. This information was collected by a pool of 14 different institutions participating in the consortium of the GenObIA project (work financed by the Community of Madrid and the European Social Fund through the GENOBIA-CM project with reference S2017/BMD-3773).

In machine learning (ML), classification is related to the assignment of class labels to data in the problem domain. A critical element in this type of problem is the selection of the variables used as predictors (feature selection). Using all variables is not always the best option; sometimes it is necessary to discard part of them in order to avoid noise in the data and also to reduce computational load. Feature-selection (FS) methods aim to reduce dimensionality, reduce execution times, and improve model results. In this work, four different approaches for FS have been tested, firstly, without applying FS, secondly by classical FS, and finally two FS methods using genetic algorithms (GAs). In particular, we investigate FS by a GA, using two selection operators: a classical tournament selection, and Stud selection, a new method proposed in this paper. Experimental results show that machine-learning algorithms are good classifiers when combined with evolutionary feature-selection methods for this particular problem.

The main contributions of this work are:

We tried different configurations for a set of ML classifiers for people being overweight or obese based on information concerning lifestyle and dietary habits.
We use a dataset with data of more than 1170 people, which is the largest study in Spain to the best of our knowledge.
We explore four different feature-selection methods.
We propose the Stud selection operator, a variant of the stud GA algorithm presented in [4], that could be adapted to another evolutionary algorithm.

The rest of the paper is organized as follows. Section 2 provides a brief review of related work. Section 3 describes the evolutionary algorithms for feature selection and implementation details. In Section 4, we explain the experimental setup and Section 5 collects the experimental results. In Section 6, the statistical analysis can be found and finally Section 7 contains the conclusions and future work.

2. Literature Review

Machine learning [5] is a constantly evolving branch of artificial intelligence related to those algorithms that try to simulate human intelligence using information from their environment. Techniques based on machine learning have been used in different fields, such as finance [6], pattern recognition [7,8], and medical applications [9,10].

In machine learning, it is important to carefully choose which features should be used and which should be discarded to construct the models. Medical datasets commonly present a small number of cases with a large number of variables, thus introducing different problems such as dimensionality and high computational requirements [11]. To deal with these obstacles, the use of feature-selection techniques has been proposed to select the variables that provide the greatest value. According to [12], three common FS categories are: filters, wrappers, and embedded methods.

Filter methods stand out mainly for their speed and scalability, being of great help in extremely large datasets. Through a series of statistical processes, scores are assigned to the different variables; the ones with the highest scores will be used to create the model. The great limitation of these methods is that they do not consider the relationship between variables. Two examples of this category are Pearson’s correlation coefficient, which allows us to quantify the linear dependence between two variables, and mutual information, which seeks to reduce the uncertainty of a random variable through knowledge of another random variable [12,13].

Wrapper methods search for the most appropriate subset of variables using the selected predictor as a black box to score the different subsets of variables generated throughout the different iterations. Wrapper methods [13] have a high computational cost since they require training and testing for each possible subset. Sequential Feature Selection (SFS) [14] starts with an empty set and adds the different variables individually, looking for the one that contributes the most value to the set. Once identified, this variable will be permanently incorporated into the subset, and then the next iteration will follow, repeating the same process until the desired number of variables is obtained. Based on this implementation, it is possible to find different variants, such as Sequential Backward Selection (SBS), which starts with the complete set of variables and reduces it through iterations, or Sequential Floating Forward Selection (SFFS) [15], which is based on the SFS method and incorporates a backward component with the SBS.

An example of the use of this type of technique is [16], where the authors apply a wrapper method, called Recursive Feature Elimination with Cross-Validation (RFECV), to select the best variables for a classification problem in the medical domain and obtain an accuracy improvement. In our work, RFECV is one of the techniques against which we compare the performance of our proposal, since it is a good FS alternative and has the benefits of wrapper methods. Heuristic search methods applied to feature selection can also be considered part of the wrapper methods. Examples of the use of genetic algorithms focused on feature selection can be found in the medical field, these methods being interesting for large datasets. In [17], an optimization algorithm based on a genetic algorithm is proposed, which allows to optimize the values of the SVM parameters, obtaining an optimal subset of features, and improving the classification accuracy. Embedded methods aim to reduce the computational time required to reclassify subsets of distinct variables. In order to do that, it tries to combine the advantages of the filter and wrapper methods. One of the main characteristics of embedded methods is the introduction of feature selection as part of the training process rather than as a separate phase, i.e., the feature-selection process becomes an integral part of the model.

Our proposal, Stud selection, is a variant of the stud GA algorithm presented in [4]. In stud GA, the fittest individual is considered a Stud and the rest of the population is crossed with it to obtain the offspring. On the other hand, in stud GA, diversity is maintained using the hamming distance between the Stud and the individual that will serve as the second parent. If the diversity is above a set threshold, the crossing is made to produce an offspring; otherwise the current second parent is mutated to produce the offspring.

In this work we have adapted this strategy to the particularities of our data. We found that crossover based on hamming distance produced a large computational load that did not mean significantly improved solutions. Therefore, we decided to use a simple one-point crossover and compensate for the possible loss of diversity with four studs.

In relation to the study of factors related to obesity, Hudson Reddon [18] deals with physical activity and genetic predisposition to obesity. With this purpose, the impact of exercise on 14 variants predisposing to obesity is analyzed. Physical activity is able to reduce the impact of the fat mass and obesity-associated (FTO) gene variation and obesity genetic risk scores. The results include the identification of an interaction between physical activity and FTO SNP rs1421085, a single nucleotide polymorphism of the gene associated with fat mass, and obesity, in a prospective cohort of six ethnic groups. According to this study, prevention programs with a heavy physical activity load can be a very important resource to combat obesity.

In [19], the authors try to identify risk factors for overweight and obesity using machine-learning techniques (regression and classification). Their main contributions are the identification of factors related to obesity/overweight, the analysis of these factors, and their respective variable analysis. The issue we find with this work is the use of the variable “weight” since it is a variable that is unknown in the future and is part of the BMI formula (the value to be predicted). In our opinion, weight cannot be used in a classification method.

There are examples of the use of evolutionary computation in similar environments. Ref. [20] deals with the prediction of obesity in children using a hybrid approach, combining supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features, a.k.a. Naïve Bayes, and a GA. In this case, the use of Naïve Bayes in prediction presented problems when dealing with zero-value parameters, and as a solution, the author proposes to use GA for parameter optimization. The initial experiment to identify the usability of their approach indicated a 75% improvement in accuracy. Similarly, this work proposes creating a genetic algorithm to support the classification model in use by selecting the most useful features.

Among the studies dealing with overweight or obesity, datasets have few cases and limited information. There are also occasions where the decisions made may be questionable, such as the use of weight in the dataset.

The work presented here seeks to predict the risk of overweight and obesity in Madrid. With this aim, a binary classification approach with evolutionary feature selection is proposed. Hence, we provide the most relevant variables for the classification algorithm through an evolutionary process. A particular structure of the feature-selection process has been developed. Additionally, a high-quality dataset has been used, composed of detailed information about the habits of the individuals and their health.

3. Methodology

This section explains the methodology applied in this work and how the feature selection for the classification problem was performed using genetic algorithms. Figure 1 shows a diagram of the feature-selection process and the generation of the classification model.

In order to apply the methodology, we need to select three main items:

The machine-learning technique.
The dataset, which defines the features.
The FS method, and if it applies, the parameters of the genetic algorithm.

From the original dataset, a curation process is performed, which also defines the initial dataset to be used. After that, the FS process is applied, and the selected features will be used to train the ML algorithm. The best classification ML model will be chosen after analyzing the results, their accuracy, and the number of false negatives.

The objective of a GA is to find the best solution of a problem through the iterative transformation (using crossover and mutation operators [21]) of an initial set of potential solutions (population). For each solution (individual), its performance (fitness) is evaluated, and, based on this value, the fittest ones will have a higher probability of passing to the next iteration. After a certain number of iterations, one of the candidates will be selected as the solution to the problem.

Four key concepts need to be considered when designing GAs:

The encoding of the problem.
The size of the population and the initialization method.
The selection method including the fitness function.
The processes by which the changes are introduced in the next iteration, including the probabilities and parameters [22].

After the execution of the GA, the fittest individual of the last generation represents the set of features (variables) selected to train the ML models.

Evaluation Metrics

Table 1 shows a description of a confusion matrix. It is a numeric matrix where we can see the number of successes of the model for the different classes. The confusion matrix of Table 1 is for algorithms classifying data into two classes (binary classification): Positive and Negative. For constructing it, it is necessary to compute the following values:

Number of True Positives (TPs): the class assigned to the sample by the model is Positive and it is also the real class.
Number of False Negatives (FNs): the class assigned to the sample by the model is Negative, but the real class is Positive.
Number of False Positives (FPs): the class assigned to the sample by the model is Positive, but the real class is Negative.
Number of True Negatives (TNs): the class assigned to the sample by the model is Negative and it is also the real class.
Number of Total Samples (Total).

From those metrics, we can also obtain different metrics to measure the model’s performance.

Accuracy: Percentage of data correctly classified.

$A c c u r a c y = \frac{T P + T N}{T o t a l}$

(1)
Misclassification Rate: Percentage of misclassified data.

$Misclassification Rate = \frac{F P + F N}{T o t a l}$

(2)
Precision: The percentage of correct predictions obtained.

$P r e c i s i o n = \frac{T P}{T P + F P}$

(3)
Recall: True positive rate, the percentage of data that manages to be classified from the positive class.

$R e c a l l = \frac{T P}{T P + F N}$

(4)

Precision and recall can be associated with the positive and negative classes. In our case, we will focus on reducing the number of false negatives since these would be cases of overweight or obesity that our model does not detect. In other words, we seek to obtain a high recall of the positive class.

The GA selects the best set of features for prediction. A solution is represented by a binary chain (chromosome), with as many positions as features available in the curated dataset. Each of the positions (genes) in the chromosome applies to a feature. The value of the gene will indicate if the feature is selected (1) or not (0) as the prediction variable. The initial population will be generated randomly.

Due to the fact that we have a balanced dataset, to evaluate an individual (a model), we used a classical cross-validation scheme, with stratification [23]. Each model is trained using only the features expressed as 1s in the individual genotype. The average accuracy rate of the 10 folds, F, is used as fitness function:

F = \frac{1}{10} \sum_{i = 1}^{10} {Accuracy}_{i}

where

{Accuracy}_{i}

is the accuracy obtained for each one of the cross-validation folds [24].

In this paper, we propose a variation of the Stud GA method and we compare its performance with a traditional tournament implementation.

The Stud selection method works as follows. First, the four best individuals of the generation are selected, and form the Stud candidates group. Second, the two best individuals pass to the following iteration without crossover. Finally, the rest of the population is completed by applying the crossover operator to a pair formed by: (i) a member of the Stud candidates group and (ii) another individual of the population in the event that the probability of crossover is met.

For the tournament selection, we use a simple implementation with selection pressure of five [25]. As usual in the literature [26], we denote selective pressure to the size of the tournament pool. Adjusting this parameter allows us to find a trade-off between exploration and exploitation of the fitness landscape of the algorithm. In this study, we use a selection pressure of five, which is a value that prioritizes exploitation over exploration and seemed to work well in the preliminary experiments with our datasets.

After the selection of the individuals, we apply a single-point crossover, choosing a point in the chromosome of the two selected individuals and generating one offspring. With this purpose, we combine the information from one of them up to the crossover point and complete it with the remaining information from the other individual.

A random mutation is introduced in the individual with a very low probability. This mutation affects a gene of the individual, flipping its value.

The parameters used for the GA were a crossover probability of 0.82, a mutation probability of 0.09, a population of 50 individuals, and 100 generations.

4. Experimental Setup

4.1. Dataset

The original dataset is the result of surveys carried out in different center parts of the GenObIA consortium, including universities and hospitals. The information collected in these surveys includes lifestyle, nutrition habits, and information about pathologies suffered by the person in the past.

4.1.1. Data Curation

The original dataset, Appendix A.2, Table A2, is composed of a total of 93 variables and 1179 subjects, among which we find:

One subject identifier;
Thirteen variables of general information about the subject such as weight, age, education, stress, etc.;
Seven variables related to alcoholic drinks, distinguishing between distilled and fermented drinks;
Seven variables on smoking habits, such as number of cigarettes, pipes, cigars. For ex-smokers: time since a smoker quit smoking;
Fifteen variables related to pathologies, such as types of cancer, sleep apnea, and type 2 diabetes mellitus, among others;
Thirty-four variables on nutritional habits; we found information on the portions of different types of food and the points of adherence to the Mediterranean diet according to these portions;
Sixteen variables related to physical exercise and its intensity.

The dataset is balanced in terms of the predicted variable, overweight/obesity (BMI ≥ 25), with 48% being obese/overweight individuals and 52% being non-obese/non-overweight individuals. Therefore, we considered unnecessary the use of classification techniques focused on imbalanced datasets.

In order to avoid repeated information that may introduce noise in the system, a reduction in the number of variables was performed. An example of a reduction is the case of the variables referring to the food intake, which were replaced by a unique variable, namely adherence to the Mediterranean Diet (ADH). The original dataset, contains a set of variables related to food, 16 associated with the servings, and derived from these, another 16 measuring their adherence to the Mediterranean diet using points. If there are more than eight points in total, the subject is considered to have a high ADH.

In addition, some redundant variables were eliminated. For instance, the dataset initially contains two variables referring to exercise that were computed with a set of features of the pool: Cal_IPAQ, which reports the calories burned as a function of physical exercise; and IPAQ, which reports the information on the exercise performed and its intensity. IPAQ takes into account the duration and intensity of exercise, pondering the value of sedentary, moderate, and vigorous exercise. Cal_IPAQ includes weight as a variable for its calculation and therefore cannot be used since weight is also present in the close form for computing BMI. Hence, we use only IPAQ as a training variable.

There are also some features, such as the place or institution, where the sample was obtained (center), that were removed from the dataset since a high correlation with the BMI was observed due to the differences in the nature of the population of the places (policeman, sport teams, retired people, etc.).

After processing the data, which in our case was supervised by the medical staff that participated in the project, the total number of variables was reduced, from 93 to 41, as shown in Appendix A.1, Table A1. This table shows the different features of the study, providing its identifier, name, short description, and type.

4.1.2. Dataset with Pathologies

Two datasets were generated using the variables of Appendix A.1. One of them, called dataset with pathologies, includes the variables related to pathologies. This dataset includes most of the features. In particular, all the variables, excluding variables 37 to 40. The objective of this dataset is to evaluate the classifiers with standard data of the pools, which includes information on the health record of the person.

4.1.3. Dataset without Pathologies

There is a set of variables related to pathologies. When dealing with this type of variable, it is necessary to consider whether a pathology is a cause or a consequence of overweight/obesity. An example is the variable number 33, Apnea, which indicates whether the subject suffers from sleep apnea. Usually, overweight or obese people suffer from this problem. However, it is not necessarily true that because they suffer from sleep apnea, they are suffering from overweight/obesity. The same applies to other pathologies. In order to evaluate this kind of artifact, we create a new dataset, selecting all the variables in Appendix A.1, but excluding those of pathological type and variable 11.

5. Experimental Results

We performed experiments combining ten different algorithms as classifiers and four different feature-selection strategies (two evolutionary feature-selection methods, one feature selection from the literature, and no feature selection) on the two datasets explained above (With and Without Pathologies).

Table 2 and Table 3 present the experimental results. These tables contain one row for each configuration, identified by an acronym (ID), and 11 additional columns including: the name of the algorithm (ALGORITHM), the feature-selection strategy (FS), the number of variables of the dataset (VARIABLES), the accuracy of the best solution (BEST), the accuracy of the worst solution (WORST), the mean (MEAN), and the standard deviation (STD) of 30 runs for the configuration in the row. In addition, the four last columns show precision (PRECISION_0 and PRECISION_1 )) and recall values (RECALL_0 and RECALL_1) for class 0 (non-overweight/non-obese) and 1 (overweight or obese) for the best solution with this algorithm.

The interpretation of the FS column is:

Stud-GA: Evolutionary feature selection with Stud selection operator.
Tournament-GA: Evolutionary feature selection with tournament selection operator.
RFECV: Feature selection with recursive feature elimination (RFE) with cross-validation (CV).
No-FS: No feature selection applied in the configuration

As mentioned, ten classification algorithms were used, using the implementation available at the Scikit-learn python library [27]:

Decision Tree (DT): Its objective is to create a model to predict the value of a target variable by learning simple decision rules from data characteristics.
Gradient Boosting (GB): An additive model is created in a stepwise way, allowing the optimization of arbitrary differentiable loss functions. At each step, a regression tree is fitted to the negative gradient of the given loss function
Adaboost (ADB): A meta-estimator that starts by fitting a classifier on the original dataset and next fits additional copies of the classifier for the same dataset where the weights of the misclassified instances are adjusted in order to make the subsequent classifiers concentrate on the difficult cases.
Bagging (BG): Fits base classifiers on random subsets of the original dataset and aggregates its individual predictions into a final prediction.
Bernoulli Naive Bayes (BNB): This classifier is useful for discrete data and is designed to handle binary/boolean features.
Extra Trees (ET): A meta estimator that fits random Decision Trees on multiple subsamples of the dataset and uses the average to improve predictive accuracy and control overfitting.
Gaussian Naive Bayes (GNB): Another Naïve Bayes model. This classifier is used when the values of the predictor are continuous
Logistic Regression (LR): This algorithm attempts to predict the probability that a given data entry will belong to a category. Just as linear regression assumes that the data follow a linear function, logistic regression models the data using the sigmoid function.
Random Forest (RFC): This technique fits several Decision Tree classifiers on multiple subsamples of the dataset and uses averaging to improve predictive accuracy and control overfitting.
XGBoost (XGB): A tree boosting system that stands out for its scalability and is widely used by data scientists.

Appendix A.3, Table A3 provides a table with information on the parameters used for each model.

Regarding evolutionary feature-selection methods, we focus their application on three models. The first one is Gradient Boosting [28] as it obtained consistently good results among the set of classifiers without feature selection. Next is XGBoost, [29], which is a state-of-the-art machine-learning algorithm that allows us to measure the goodness of results from the other algorithms. Finally, there is Decision Tree [30], because the use of models based on trees provides a solution with a straightforward interpretation by the medical staff. The development of understandable solutions for medical doctors is one of the main objectives of our research. Understanding why a model makes a particular prediction can be as crucial as prediction accuracy in the medical field. In some cases, the best results are obtained with complex models that are difficult to interpret. Thanks to the SHAP library [31], it is possible to obtain from each feature a value of importance for a particular predictor. The SHAP algorithm aims to explain the outcome of machine-learning models, representing the results by means of graphs. It is based on the Shapley values of game theory. In particular, in this paper we will focus on the ones that allow us to show the impact of the different variables in the model. To understand these graphs, two factors must be taken into account: the position of the points on the horizontal axis and the color. Let us take Figure 2 as an example.

The color of the dots denotes the numerical value of the variable. In the case of age, the redder the dot the higher the age and the bluer the dot the lower the age. In the case of sex. The red color represents the female sex and the blue color represents the male sex. In the case of education, the red color represents higher levels of education while the blue color represents people with low levels of education or no education.
The position of the points on the horizontal axis in our study indicates the probability of overweight/obesity. The further to the right the point is (positive values), the greater the probability that the person suffers from this problem. The further to the left (negative values), the lower the probability.

Thus, we can see as an example that Figure 2 provides us with the following information:

Age is an important factor for the probability of being overweight/obese. We can clearly see that higher ages (red dots) have a higher probability than lower ages (blue dots).
Gender is an unequivocal factor, with male gender (blue dots) being related to a higher probability of being overweight/obese while female gender (red dots) is similarly distributed but in the negative range.
Education is another important factor to take into account. Low levels of education (blue dots) have a higher probability while high levels (red dots) are related to a lower probability.
In the case of other variables, such as job, for example, we can see that the color of the points is intermixed, indicating no correlation with the probability of being overweight/obese.

5.1. Results Using the Dataset with Pathologies

5.1.1. Results without Feature Selection

In the scenario with pathologies, Table 2 summarizes the results of the algorithms with and without feature selection (No-FS). The average accuracy rate is 0.6953, and the standard deviation is 0.0297, with DT, ADB, BNB, and GNB being below this average. The algorithms with the best results are GB and RFC, with a mean that is close to 0.74.

Figure 2 shows the impact of the different variables of the RFC model. The age variable is in first place, in the case of younger people, represented by the lowest values (blue), which are mainly found in the left part of the graph, indicating a lower risk of being overweight/obese. In contrast, the higher values of this variable, the older people (red), have a higher risk of suffering from these health problems. In the case of sex, being male has a higher probability of being classified in the overweight/obesity class. Third, the first variable associated with the type of pathology, apnea, appears. The variable apnea has all the blue points very close to the zero point, while the red points extend along the positive part of the axis. This indicates that in cases of suffering from this pathology it is likely to suffer from overweight/obesity, but otherwise, it does not have a significant impact on the prediction.

5.1.2. Results for Gradient Boosting with Classical Feature Selection (GB-RFECV)

Moreover, in Table 2, using RFECV, an average accuracy rate of 0.7324 was achieved, reaching a maximum of 0.7797. A total of 17 variables were selected, including stress; some of the variables related to alcoholic drinks, education, time since quitting smoking, and physical exercise. Some of the most selected variables, such as sex, age, or apnea, are among the variables obtained. Figure 3 shows the evolution of the accuracy rate for different algorithms and datasets, taking the average of the cross-validation runs with RFECV in relation to the number of variables chosen. The average accuracy rate increases until it reaches 17 variables and then starts to decrease, probably due to the selection of variables that introduce noise.

5.1.3. Results for Gradient Boosting with Evolutionary Feature Selection

The results presented below are obtained with GB using evolutionary FS.

GA with Stud selection for Gradient Boosting (GB-S): In this case, the selection method reduced the number of variables used from 37 to 19, keeping its classification rate, and even obtaining a slight improvement, reaching an average accuracy of 0.7382, as shown in Table 2. Three of the variables selected were age, sex, and apnea, these being the most frequently chosen. Other variables to be highlighted are those related to smoking, education, earning, and adherence to the Mediterranean diet.

Figure 4 shows the graph corresponding to the Gradient Boosting model with Stud selection for the dataset with pathologies. The first variables that appear are age, sex, and apnea (as in the previous cases). In the case of education, it is shown that a lower level of education increases the probability of being overweight or obese. On the other hand, those individuals who suffer from metabolic syndrome are also more likely to be overweight or obese, but as in the case of apnea, if the individual does not suffer from this pathology, the variable does not have such a strong impact. Again, this can be seen in that the blue dots are clustered next to the zero point, while the red dots are spread over the positive values.

GA with tournament selection for Gradient Boosting (GB-T): Using the tournament selection method with selection pressure of five, 23 variables were selected by the GA, achieving an average accuracy of 0.7332, as can be found in Table 2. Similar to the previous case, among the features selected, some of the most common ones are sex, age, adherence to the Mediterranean diet, and some new additions, which were variables related to heart disease and different types of cancer. Other variables to note are the appearance of distilled/fermented beverages, education, earnings, and stress.

5.1.4. Results for Decision Tree with Classical Feature Selection (DT-RFECV)

Using RFECV, a total of 13 variables were selected, reaching an accuracy rate of 0.6962 and, in the best case, up to 0.7559, as shown in Table 2. Some of the variables are education, alcoholic drinks, physical exercise, and diabetes, among others. Again the most common variables are included: sex, age, and apnea. Figure 3 shows the evolution of the accuracy rate, in this case, the maximum working with 13 variables, reducing the accuracy when the number of variables increases.

5.1.5. Results for Decision Tree with Evolutionary Feature Selection

The results obtained with DT using evolutionary FS are presented below.

GA with Stud selection for Decision Tree (DT-S): Fourteen variables were selected, obtaining an average accuracy of 0.7150 and, in the best case, up to 0.7661, as can be verified in Table 2. Among the variables used were sex, age, and seven variables related to pathologies.

Figure 5 shows the impact of the different variables in the model Stud with Decision Tree with pathologies. Once again, age, sex and apnea are at the top of the list. In this case, diabetes appears with a similar distribution to apnea, although with less impact. The appearance of fermented beverages is also interesting.

GA with tournament selection for Decision Tree (DT-T): Thanks to the feature selection, with only 16 variables, an accuracy reaching 0.7492 was achieved in the best cases and an average of 0.7103, as shown in Table 2.

Some of the variables used for this case are metabolic syndrome, those related to cancer, fermented drinks, specifically those related to wine, and the variable heart_angina. Moreover, as in the previous cases, sex and age appear.

Figure 6 shows a heat map representing the frequency with which the different FS models (columns) selected each of the variables (rows). Depending on the color of each cell, it is possible to get an idea of the number of times a variable was selected, with the cool colors being those cases with the lowest number of occurrences and the warm colors being those that appeared most frequently. In the case of the Decision Tree, there is greater diversity in the variables chosen by the models, since the colors are not so intense, unlike Gradient Boosting. Despite the differences between the variables chosen for the models, the extremes (top and bottom) of the heat map show a similar range of colors in most cases. The most frequent variables were sex, age, education, apnea, alcoholic beverages, both fermented and distilled, and metabolic syndrome.

5.1.6. Results for XGBoost with Classical Feature Selection (XGB-RFECV)

This time the average accuracy rate achieved was 0.7085, reaching a maximum of 0.7424, as shown in Table 2. A total of 20 variables were used, including sex, age, education, metabolic syndrome, apnea, IPAQ, and different variables associated with alcoholic beverages, among others. Figure 3 shows the evolution of the accuracy rate with the number of variables chosen. In this case, the maximum value is obtained with 20 variables.

5.1.7. Results for XGBoost with Evolutionary Feature Selection

The results obtained by applying the evolutionary FS to XGBoost are presented below.

GA withStudselection for XGBoost (XGB-S): As shown in Table 2, an average accuracy rate of 0.7216 and a maximum of 0.7831 was achieved using 20 variables. Similar to other cases, variables such as sex, age, edu, and those related to alcoholic beverages appear. Nine of the selected variables belong to the group of pathologies. Among them, we find several types of cancer, apnea, diabetes, and metabolic syndrome.
GA with tournament selection for XGBoost (XGB-T): In this case, using a total of 19 variables, a mean accuracy ratio of 0.7029 was obtained, and a maximum of 0.7559 was reached, as seen in Table 2. Among the most common variables, we found sex, edu, and apnea, but age did not appear. In addition, five variables related to smoking and up to nine related to pathologies were selected.

5.2. Results without Pathologies

A new batch of experiments was performed with the dataset without pathologies, using the same algorithms as in the previous section. The results with this new dataset are worse than those obtained using the dataset with pathologies. It may indicate that the models obtained using the dataset with pathologies gave more importance to these variables, which may be considered a consequence of overweight/obesity rather than a cause.

5.2.1. Results without Feature Selection

After testing the dataset without pathologies with the algorithms without feature selection, a reduction in the accuracy rate of the models was observed with respect to the results with pathologies, obtaining an average accuracy rate of 0.6825 and a standard deviation of 0.0408. RFC obtained the highest result, reaching 0.7331 accuracy, as can be seen in Table 3.

Figure 7 shows the impact of the variables for the RFC model, using the dataset without pathologies. Age and sex maintain the highest positions with the same characteristics seen in previous cases. Among the variables referring to eating habits, vegetables and soda stand out. The values of vegetables seem to have an inversely proportional relationship with the possibility of being overweight or obese while in the case of soda this relationship is direct. In Figure 8, we can see the Pearson Correlation Matrix of the variables in Figure 7. As we can see, no significant correlation can be appreciated between variables with the exception of smoke (smoker or non-smoker) and nsmoke (number of cigarettes per day).

5.2.2. Results for Gradient Boosting with Classical Feature Selection (GB-RFECV)

In this configuration, the number of variables selected was 9, reaching an average accuracy rate of 0.7171 and reaching 0.7695 in the best case, as shown in Table 3. Legume intake, vegetable intake, physical exercise, and time as an ex-smoker are some of the selected variables. On the other hand, there are also those more common ones such as sex, age, and education. In Figure 3, the accuracy reaches its maximum value at nine variables.

5.2.3. Results for Gradient Boosting with Evolutionary Feature Selection

The results obtained using GB with evolutionary FS are presented below.

GA withStudselection for Gradient Boosting (GB-S): Using only 12 variables, it has been possible to achieve an average accuracy of 0.7295 and 0.7797 for the best case, achieving a slight improvement compared to the version without feature selection, as shown in Table 3. In this new set of variables, sex and age continue to dominate similarly to the set with pathologies. Other variables used are those related to hours of sleep, fermented/distilled drinks, soft drinks, legume portions, and education.

The graph of the impact of the variables corresponding to the Gradient Boosting model with Stud selection for the case without pathologies is shown in Figure 9. In the highest positions we find sex, age, education, and soda with the same performance as above. A new variable to highlight is legumes, whose highest values appear in the left zone of the graph. As for alcoholic beverages, we find apparent differences between distilled and fermented beverages; for the higher values of the variable wineWEEK, the probability of being overweight/obese is lower. In the case of spirits, it seems that their intake may be associated with overweight/obesity.

GA with tournament selection for Gradient Boosting (GB-T): Using the tournament selection method, the number of variables used was also 12 with an average accuracy of 0.7307 and for the best-case scenario up to 0.7763, as shown in Table 3. New variables were included: adherence to the Mediterranean diet and the population. Other variables seen previously such as education, soft drinks, distilled/fermented beverages, sex, age, and hours of sleep, among others, also appear.

5.2.4. Results for Decision Tree with Classical Feature Selection (DT-RFECV)

For this case, RFECV has selected a single variable, age, achieving an average accuracy rate of 0.6821 and a maximum of 0.7186, as shown in Table 3. Classical Feature Selection algorithms have solutions with very few variables; in our case, we are looking for models that explain more extensively the causes of overweight and obesity, so we explore other solutions such as those based on evolutionary algorithms that allow us to obtain models with a greater number of variables and similar accuracy. Looking at Figure 3, the accuracy reaches its maximum value with only one variable, but here we have to consider two facts:

This value is not very far from the case of using all the variables;
From a medical point of view, the use of a single variable model does not provide any help to clinical practice.

Figure 3 shows the variation of the cross-validation score (inside RFECV strategy) for the different algorithms for both the pathology and non-pathology cases. This value is of little interest if we look at the small difference in the values on the horizontal axis and at the fact that we are comparing two different datasets. What really interests us is to see the differences in the number of selected features (peaks marked with a red dot) and how these values vary depending on the algorithms and the datasets.

5.2.5. Results for Decision Tree with Evolutionary Feature Selection

The following results were obtained with the DT using the evolutionary GA.

GA withStudselection for Decision Tree (DT-S): Using the Stud selection method, a total of 14 variables were chosen, with an average accuracy of 0.6934 and 0.7322 for the best case, as can be seen in Table 3. In this case, the age variable was taken but not the sex variable. The variables chosen include education, population, hours of sleep, alcoholic drinks, soft drinks, legumes rations, vegetable rations, and the stress variable as a new incorporation.

Figure 10 shows the impact of the different variables for the Decision Tree model with Stud for the set without pathologies. First, age appears again, the next variable is education and it is observed that, for higher values and higher level of studies, the probability of being overweight/obese is lower. In the case of soft drinks, higher consumption implies a higher probability of being overweight or obese. Another variable to note is vegetable, where it is observed that most of the red dots are on the left side of the graph, so it can be considered that it is less likely to be overweight/obese. It would be interesting to study in the future the causes of the behaviors of the variables stress and ex-smoker.

GA with tournament selection for Decision Tree (DT-T): In this case, the number of variables used was 15, obtaining an average accuracy of 0.6862 and reaching 0.7525 in the best case, as shown in Table 3. Among the variables chosen are portions of legumes, fermented drinks, the time without smoking of an ex-smoker, and the IPAQ, the variable that reflects physical exercise. Other variables found are the most common, such as sex, age, soft drinks, and education, among others.

Figure 11 shows the heat map obtained by representing the frequency of variables selected from the dataset without pathologies using the FS models with genetic algorithms. There is a closer correlation between the variables selected with the GB and DT models, compared to the case of the dataset with pathologies. Again the most commonly selected variables are age, sex, and education followed by alcoholic beverages. As a new addition among the most selected variables, we find the variable soda.

5.2.6. Results for XGBoost with Classical Feature Selection (XGB-RFECV)

With only two variables, sex and age, an average accuracy ratio of 0.6850 and a maximum of 0.7186 were obtained, as shown in Table 2. The evolution of the accuracy rate as a function of the number of variables used is shown in Figure 3. From a medical point of view, the selection of only these two variables does not provide any information.

5.2.7. Results for XGBoost with Evolutionary Feature Selection

Results obtained with the XGBoost using the evolutionary GA were as follows.

GA withStudselection for XGBoost (XGB-S): As shown in Table 2, a mean accuracy rate of 0.6989 and a maximum of 0.7424 using 12 variables were achieved. In this case, among the variables selected, we found some of the most common ones, such as sex, age, edu, or soda. In addition, there are variables such as vegetation or those related to alcoholic beverages.
GA with tournament selection for XGBoost (XGB-T): Using the 11 selected variables, an average precision ratio of 0.6948 and a maximum of 0.7288 were obtained, as can be seen in Table 2. In addition to the most common variables, such as sex, age, education, or soda, we can find variables related to alcoholic beverages, both distilled and fermented, vegetable intake, and earning.

Table 4 gives a sample of time in seconds taken by the GA for the different algorithms and datasets. One run of each type was carried out individually (no other program was running in parallel) on the same machine to obtain these values. The time is strongly related to the algorithm chosen.

6. Statistical Analysis

We performed a nonparametric statistical analysis once the results were obtained for the two proposed problems, with and without pathologies. The Friedman test [32] is a nonparametric [33] alternative to the two-way parametric analysis of variance that tries to detect significant differences between the behavior of two or more algorithms. It can be used to identify if, in a set of k samples (where k ≥ 2), at least two of the samples represent populations with different median values. The first step is to convert the original results, in Table 5, into ranks to produce Table 6. Table 5 shows the average accuracy rate for each of the algorithms on the two datasets.

Gather observed results for each algorithm/problem pair.
For each problem i, rank values from 1 (best result) to k (worst result). Denote these ranks as:

$r_{i}^{j} = (1 \leq j \leq k)$

(5)

where k is the number of algorithms.
For each algorithm j, average the ranks obtained in all problems to obtain the final rank:

$R_{j} = \frac{1}{n} \sum_{i = 1}^{n} r_{i}^{j}$

(6)

where n is the number of datasets.

In Table 6, the best algorithms are Gradient Boosting with Stud selection method, Gradient Boosting with tournament method, and Random Forest without feature selection. In both problems, algorithms with feature selection outperform their traditional version.

Under the null hypothesis (H0), which states that all algorithms behave similarly, so their ranks

R_{j}

should be equal, the Friedman statistic

F_{f}

can be calculated as:

F_{f} = \frac{12 n}{k (k + 1)} [\sum_{j} R_{j}^{2} - \frac{k {(k + 1)}^{2}}{4}]

(7)

Distributed according to a

χ^{2}

distribution with k − 1 degrees (k = 19) of freedom [33]. As usual, we can define the p-value as the probability of obtaining a result as extreme or even more extreme than the observed one, provided that the null hypothesis is true. As usual, we have chosen a p-value of 0.05. For this p-value, the critical value according to the distribution of

χ^{2}

with 18 degrees of freedom is 28.8693. To reject H0, it is necessary to exceed this value. Applying the formula of Friedman’s statistic, Equation (7), a value of 34.6421 is obtained, so the critical value is exceeded and the hypothesis H0 is rejected, confirming that there are significant differences between the models.

7. Conclusions and Future Work

In this work, a set of classification systems of persons at risk of suffering from overweight/obesity have been developed. Four different FS strategies have been employed for the two experiment datasets, one FS from the literature, two evolutionary FS methods, and no FS. For this work, ten machine-learning models have been employed. For the application of the feature-selection methods, we chose three of these 10 methods, based on reasons of performance and possible application to medical clinical practice. Thus, the FS methods were successfully applied in Gradient Boosting, XGBoost, and Decision Tree algorithms. The most important conclusions of this work are:

Although not a surprising finding, we have found GA to be a very competent tool to perform feature selection and thus improve the training of classification models.
The Stud selection, which uses an elitist set that is always part of the crossover process, achieves promising results.
If we look at the fitness of the best individual in each of the final populations, they all maintain a similar standard deviation, perhaps indicating that we might expect to obtain similar results in the future with new data sets.
About the variables related to pathologies, it is necessary to identify what is the cause and what is the effect. A good example would be sleep apnea. On several occasions, models have been based on apnea when classifying, but this disorder is caused by overweight or obesity in many cases. A person may have apnea due to being overweight or obese but will not necessarily be overweight or obese due to suffering from apnea.
Finally, significant differences were found among the algorithms, with Gradient Boosting with feature selection being the one obtaining the best results.

The models developed in this work will be the basis of a recommendation system. This system will be able to warn people about behavioral tendencies that will end up producing overweight and obesity and recommend healthy habits to replace them.

Future Work

Although the overall accuracy of the model is important, from a medical point of view, classifying an individual at risk of overweight/obesity as not at risk is more detrimental than classifying a healthy person as at risk. The model must take this into account, so it would be interesting to find different fitness functions or develop a multi-objective evolutionary algorithm that could increase accuracy and reduce the number of false negatives for at-risk individuals. Therefore, we intend to test with precision and recall as fitness functions instead of accuracy.

In this work, and for space reasons, we have only used accuracy as a fitness function. There remains, therefore, as future work, the investigation of other fitness functions as well as the heuristic search or the dynamic selection as they have already been tested in the literature of other fields [34,35].

It is also necessary to study carefully the parameters of the evolutionary algorithm, testing with different combinations of population size and number of generations.

It would be highly recommended to increase the volume of the dataset as it could improve the accuracy of the models. As part of the project, genetic information of individuals will be incorporated, so it will be necessary to perform a study on the impact of these variables and their possible interaction with the current ones.

In this study we achieved an accuracy close to 0.8. Can these results be considered good enough, can they be improved, are they unacceptable? Although the members of our research team who are specialists in medicine consider these results as good, we lack having a context with a clear metric to determine where the minimum level is and how high we can achieve. Establishing this scale remains ambitious future work.

Author Contributions

Conceptualization: J.I.H., J.M.V. and J.J.Z.-L.; methodology: J.J.Z.-L., D.P., O.G. and J.I.H.; software: D.P. and A.G.-G.; validation: O.G., J.J.Z.-L. and J.M.V.; formal analysis: J.J.Z.-L., N.d.l.H. and J.I.H.; investigation: D.P. and K.Z.-N.; resources: K.Z.-N., J.J.Z.-L., N.d.l.H. and J.I.H.; data curation: K.Z.-N., D.P., A.G.-G. and N.d.l.H.; writing—original draft preparation: D.P. and A.G.-G.; writing—review and editing: J.M.V., O.G. and J.I.H.; visualization: D.P., A.G.-G. and J.M.V.; supervision: J.I.H.; project administration: J.I.H. and J.J.Z.-L.; funding acquisition: J.J.Z.-L. and J.I.H. All authors have read and agreed to the published version of the manuscript.

Funding

Work financed by the regional government of Madrid and co-financed by the EU Structural Funds through the Community of Madrid projects B2017/BMD3773 (GenObIA-CM) and Spanish Ministry of Economy and Competitiveness with number RTI2018-095180-B-I00 and PID2021-125549OB-I00.

Institutional Review Board Statement

The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Clinical Research Ethics Committee of the Community of Madrid (Comité Ético de Investigación Clínica de la Comunidad de Madrid (CEIC-R)) and the Clinical Research Ethics Regional Committee (CEIC-R) (Comité Ético de Investigación Clínica-Regional (CEIC-R)). Genetic analyses have always been carried out in compliance with the provisions specified in the Biomedical Research Law (14/2007) and in the Personal Data Protection Law (Law 15/1999) from Spain.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data that support the findings of this study are available on reasonable request from the corresponding author [J.I. Hidalgo]. The data are not publicly available due to legal restrictions.

Acknowledgments

We would also like to thank the centers that provided the data and made this work possible. 1. Atención Primaria. 2. Hospital Clínico San Carlos. 3. Hospital Universitario 12 de Octubre. 4. Hospital Universitario La Paz. 5. Hospital General Universitario Gregorio Marañón. 6. Hospital Universitario Ramón y Cajal. 7. Hospital Universitario Infanta Leonor.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1. Representation of Project Variables

Table A1. Representation of project variables.

ID	Variable	Description	Type
1	sex	Sex of the person	General information
2	age	Age of the person in years	General information
3	pop	Volume of the population where the person resides	General information
4	edu	Academic level attained by the person	General information
5	earning	Income level of the person	General information
6	job	Work type performed by the person	General information
7	stress	person self-perceived stress	General information
8	sleep.8	The person sleeps more than eight hours	General information
9	spirit	The person drinks spirits	General information
10	spiritWEEK	Units of spirit drinks per week	Alcoholic drinks
11	wine_beer	The person drinks beer or wine	Alcoholic drinks
12	beerWEEK	Units of beer per week	Alcoholic drinks
13	wineWEEK	Units of red wine per week	Alcoholic drinks
14	whiteWEEK	Units of white wine per week	Alcoholic drinks
15	pinkWEEK	Units of rosé wine per week	Alcoholic drinks
16	smoke	The person smokes	Tobacco
17	nsmoke	Sigarettes consumed per day	Tobacco
18	pipe	Pipe tobacco consumed per day	Tobacco
19	cigar	Cigars consumed per day	Tobacco
20	exsmokerY	Time since a smoker quit smoking in years	Tobacco
21	exsmokerUNK	The person has given up smoking but does not remember how long ago.	Tobacco
22	cancer	The person has suffered or suffers from cancer	Pathologies
23	cancer_mam	The person has suffered or suffers from breast cancer.	Pathologies
24	cancer_col	The person has suffered or suffers from colon cancer	Pathologies
25	cancer_pros	The person has suffered or suffers from prostate cancer.	Pathologies
26	cancer_lung	The person has suffered or suffers from lung cancer.	Pathologies
27	cancer_other	The person has suffered or suffers from another type of cancer.	Pathologies
28	heart_attack	The person has suffered an acute myocardial infarction.	Pathologies
29	heart_angina	The person has suffered angina pectoris.	Pathologies
30	heart_failure	The person has suffered heart failure	Pathologies
31	diabetes	The person has type 2 diabetes mellitus	Pathologies
32	metabolic_syn	The person suffers from metabolic syndrome	Pathologies
33	apnea	The person suffers from sleep apnea	Pathologies
34	asthma	The person has asthma	Pathologies
35	COPD	The person suffers from chronic obstructive pulmonary disease.	Pathologies
36	ADH	The person has adherence to the Mediterranean diet.	Nutritional habits
37	vege	Servings of vegetables consumed by the individual per day	Nutritional habits
38	soda	Servings of carbonated and/or sweetened drinks consumed by the subject per day	Nutritional habits
39	legume	Servings of legumes consumed by the subject per week	Nutritional habits
40	milk	Servings of milk or dairy products consumed by the subject per day	Nutritional habits
41	IPAQ	Subject scores on the International Physical Activity Questionnaire (IPAQ)	Physical exercise

Appendix A.2. Representation of All Variables in the Original Data Set

Table A2. Representation of all variables in the original data set.

ID	Variable	Description	Type
1	n	Inclusion number	General information
2	center	Center	General information
3	sex	Sex of the person	General information
4	age	Age of the person in years	General information
5	height	Height (m)	General information
6	weight	Weight (Kg)	General information
7	IMC	BMI	General information
8	waist	Waist circumference (cm)	General information
9	pop	Volume of the population where the person resides	General information
10	edu	Academic level attained by the person	General information
11	earning	Income level of the person	General information
12	job	Work type performed by the person	General information
13	stress	Person self-perceived stress	General information
14	sleep.8	The person sleeps more than eight hours	General information
15	spirit	The person drinks spirits	Alcoholic drinks
16	spiritWEEK	Units of spirit drinks per week	Alcoholic drinks
17	wine_beer	The person drinks beer or wine	Alcoholic drinks
18	beerWEEK	Units of beer per week	Alcoholic drinks
19	wineWEEK	Units of red wine per week	Alcoholic drinks
20	whiteWEEK	Units of white wine per week	Alcoholic drinks
21	pinkWEEK	Units of rosé wine per week	Alcoholic drinks
22	smoke	The person smokes	Tobacco
23	nsmoke	Cigarettes consumed per day	Tobacco
24	pipe	Pipe tobacco consumed per day	Tobacco
25	cigar	Cigars consumed per day	Tobacco
26	exsmokerY	Time since a smoker quit smoking (years)	Tobacco
27	exsmokerM	Time since a smoker quit smoking (months)	Tobacco
28	exsmokerUNK	The person has given up smoking but does not remember how long ago	Tobacco
29	cancer	The person has suffered or suffers from cancer	Pathologies
30	cancer_mam	The person has suffered or suffers from breast cancer.	Pathologies
31	cancer_col	The person has suffered or suffers from colon cancer	Pathologies
32	cancer_pros	The person has suffered or suffers from prostate cancer.	Pathologies
33	cancer_lung	The person has suffered or suffers from lung cancer	Pathologies
34	cancer_other	The person has suffered or suffers from another type of cancer.	Pathologies
35	heart_attack	The person has suffered an acute myocardial infarction.	Pathologies
36	heart_angina	The person has suffered angina pectoris.	Pathologies
37	heart_failure	The person has suffered heart failure	Pathologies
38	diabetes	The person has type 2 diabetes mellitus	Pathologies
39	hemo	Glycosylated hemoglobin (%)	Pathologies
40	metabolic_syn	The person suffers from metabolic syndrome	Pathologies
41	apnea	The person suffers from sleep apnea	Pathologies
42	asthma	The person has asthma	Pathologies
43	COPD	The person suffers from chronic obstructive pulmonary disease.	Pathologies
44	ADH	The person has adherence to the Mediterranean diet.	Nutritional habits
45	ADH_tot	Total ADH points	Nutritional habits
46	olive	Use olive oil	Nutritional habits
47	n_olive	Use olive oil (POINTS)	Nutritional habits
48	tot_olive	Tablespoons of olive oil consumed in total per day	Nutritional habits
49	ntot_olive	Tablespoons of olive oil consumed in total per day (POINTS)	Nutritional habits
50	vege	Servings of vegetables consumed per day	Nutritional habits
51	n_vege	Servings of vegetables consumed per day (POINTS)	Nutritional habits
52	fruit	Pieces of fruit (including natural juice) consumed per day	Nutritional habits
53	n_fruit	Pieces of fruit (including natural juice) consumed per day (POINTS)	Nutritional habits
54	burger	Red meat portions	Nutritional habits
55	n_burger	Red meat portions (POINTS)	Nutritional habits
56	cream	Servings of butter, margarine or cream consumed per day	Nutritional habits
57	n_cream	Servings of butter, margarine or cream consumed per day (POINTS)	Nutritional habits
58	soda	Glasses of carbonated and/or sweetened beverages per day	Nutritional habits
59	n_soda	Glasses of carbonated and/or sweetened beverages per day (POINTS)	Nutritional habits
60	wine_week	Wine consumed per week	Nutritional habits
61	n_wine_week	Wine consumed per week (POINTS)	Nutritional habits
62	legume	Servings of legumes per week	Nutritional habits
63	n_legume	Servings of legumes per week (POINTS)	Nutritional habits
64	fish	Servings of fish or seafood consumed per week	Nutritional habits
65	n_fish	Servings of fish or seafood consumed per week (POINTS)	Nutritional habits
66	cake	Times per week consuming commercial bakery products	Nutritional habits
67	n_cake	Times per week consuming commercial bakery products (POINTS)	Nutritional habits
68	nuts	Servings of nuts and dried fruit consumed per week	Nutritional habits
69	n_nuts	Servings of nuts and dried fruit consumed per week (POINTS)	Nutritional habits
70	chicken	Preferably consume chicken, turkey or rabbit meat instead of beef, pork, hamburgers or sausages	Nutritional habits
71	n_chicken	Preferably consume chicken, turkey or rabbit meat instead of beef, pork, hamburgers or sausages (POINTS)	Nutritional habits
72	sauce	Times a week eat cooked vegetables, pasta, rice or other dishes seasoned with a tomato, garlic, onion or leek sauce simmered with olive oil	Nutritional habits
73	n_sauce	Times a week eat cooked vegetables, pasta, rice or other dishes seasoned with a tomato, garlic, onion or leek sauce simmered with olive oil (POINTS)	Nutritional habits
74	milk	Milk or dairy products (yogurts, cheese) consumed per day	Nutritional habits
75	n_milk	Milk or dairy products (yogurts, cheese) consumed per day (POINTS)	Nutritional habits
76	milk_light	The person takes skimmed dairy products	Nutritional habits
77	n_milk_light	The person takes skimmed dairy products (POINTS)	Nutritional habits
78	IPAQ	IPAQ Points	Physical exercise
79	cal_IPAQ	IPAQ Calories	Physical exercise
80	exercise_H	Days of intense physical exercise	Physical exercise
81	exercise_H_mets	Days of intense physical exercise (Mets)	Physical exercise
82	exercise_H_min	Intense physical exercise in one day (minutes)	Physical exercise
83	exercise_H_tot	Not sure about the time of intense physical exercise in one day	Physical exercise
84	exercise_L	Days of moderate physical exercise	Physical exercise
85	exercise_L_mets	Days of moderate physical exercise (Mets)	Physical exercise
86	exercise_L_min	Moderate physical exercise in one day (minutes)	Physical exercise
87	exercise_L_tot	Not sure about the time of moderate physical exercise in one day	Physical exercise
88	exercise_walk	Days of sedentary physical exercise	Physical exercise
89	exercise_walk_mets	Days of sedentary physical exercise (Mets)	Physical exercise
90	exercise_walk_min	Sedentary physical exercise in one day (minutes)	Physical exercise
91	exercise_walk_tot	Not sure about the time of sedentary physical exercise in one day	Physical exercise
92	exercise_sit_min	Time spent sitting during a day (minutes)	Physical exercise
93	exercise_sit	The person is not sure how much time was spent sitting during a day (minutes)	Physical exercise

Appendix A.3. Models Parameters

Table A3. Details about models parameters.

MODEL	Objective	ccp_Alpha	Class Weight	Criterion	Learning_Rate	Loss	Max_Depth	Max_Features	n_estimators	Splitter	Bootstrap	Algorithm	Base_Estimator	Max_iter	Solver	Tol	Penalty	Priors	Var_Smoothing	Binarize	Fit_Prior
XGB	binary: logistic	-	-	-	None	-	None	-	100	-	-	-	-	-	-	-	-	-	-	-	-
GB	-	0.0	-	friedman_mse	0.1	deviance	3	None	100	-	-	-	-	-	-	-	-	-	-	-	-
DT	-	0.0	None	gini	-	-	6	None	-	best	-	-	-	-	-	-	-	-	-	-	-
RFC	-	0.0	None	gini	-	-	None	auto	100	-	True	-	-	-	-	-	-	-	-	-	-
ADB	-	0.0	None	gini	1.0	-	None	None	500	best	-	SAMME	DecisionTreeClassifier	-	-	-	-	-	-	-	-
LR	-	-	balanced	-	-	-	-	-	-	-	-	-	-	100	lbfgs	0.0001	12	-	-	-	-
ET	-	0.0	None	gini	-	-	None	auto	250	-	False	-	-	-	-	-	-	-	-	-	-
GNB	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	None	1 × 10⁻⁹	-	-
BNB	-	1.0	-	-	-	-	-	-	-	-	-	-	-	-		-	-	-	-	0.0	True
BG	-	-	-	-	-	-	-	auto	250	-	True	-	-	-	-	-	-	-	-	-	-

References

Keys, A.; Fidanza, F.; Karvonen, M.J.; Kimura, N.; Taylor, H.L. Indices of relative weight and obesity. J. Chronic Dis. 1972, 25, 329–343. [Google Scholar] [CrossRef]
Spanish Ministry of Health (Ministerio de Sanidad, Consumo y Bienestar Social). Encuesta Nacional de Salud. España 2017. Available online: https://www.mscbs.gob.es/estadEstudios/estadisticas/encuestaNacional/encuestaNac2017/ENSE2017_notatecnica.pdf (accessed on 15 January 2021).
World Health Organization. Obesity: Preventing and Managing the Global Epidemic; World Health Organization: Geneva, Switzerland, 2000; 252p.
Khatib, W.; Fleming, P.J. The Stud GA: A mini revolution? In Proceedings of the Parallel Problem Solving from Nature—PPSN V, Amsterdam, The Netherlands, 27–30 September 1998; Eiben, A.E., Bäck, T., Schoenauer, M., Schwefel, H.P., Eds.; Springer: Berlin/Heidelberg, Germany, 1998; pp. 683–691. [Google Scholar]
El Naqa, I.; Murphy, M.J. What is machine learning? In Machine Learning in Radiation Oncology; Springer: Berlin/Heidelberg, Germany, 2015; pp. 3–11. [Google Scholar]
De Prado, M.L. Advances in Financial Machine Learning; John Wiley & Sons: Hoboken, NJ, USA, 2018. [Google Scholar]
Bishop, C.M.; Nasrabadi, N.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006; Volume 4. [Google Scholar]
Braga-Neto, U. Fundamentals of Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Kononenko, I. Machine learning for medical diagnosis: History, state of the art and perspective. Artif. Intell. Med. 2001, 23, 89–109. [Google Scholar] [CrossRef]
Ahsan, M.M.; Luna, S.A.; Siddique, Z. Machine-Learning-Based Disease Diagnosis: A Comprehensive Review. Healthcare 2022, 10, 541. [Google Scholar] [CrossRef] [PubMed]
Pirgazi, J.; Alimoradi, M.; Abharian, T.E.; Olyaee, M.H. An Efficient hybrid filter-wrapper metaheuristic-based gene selection method for high dimensional datasets. Sci. Rep. 2019, 9, 18580. [Google Scholar] [CrossRef] [PubMed]
Chandrashekar, G.; Sahin, F. A survey on feature-selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
Reunanen, J. Overfitting in making comparisons between variable selection methods. J. Mach. Learn. Res. 2003, 3, 1371–1382. [Google Scholar]
Pudil, P.; Novovičová, J.; Kittler, J. Floating search methods in feature selection. Pattern Recognit. Lett. 1994, 15, 1119–1125. [Google Scholar] [CrossRef]
Misra, P.; Yadav, A.S. Improving the classification accuracy using recursive feature elimination with cross-validation. Int. J. Emerg. Technol. 2020, 11, 659–665. [Google Scholar]
Kumar, G.R.; Ramachandra, G.; Nagamani, K. An efficient feature selection system to integrating SVM with genetic algorithm for large medical datasets. Int. J. 2014, 4, 272–277. [Google Scholar]
Reddon, H.; Gerstein, H.C.; Engert, J.C.; Mohan, V.; Bosch, J.; Desai, D.; Bailey, S.D.; Diaz, R.; Yusuf, S.; Anand, S.S.; et al. Physical activity and genetic predisposition to obesity in a multiethnic longitudinal study. Sci. Rep. 2016, 6, 18672. [Google Scholar] [CrossRef] [PubMed]
Chatterjee, A.; Gerdes, M.W.; Martinez, S.G. Identification of Risk Factors Associated with Obesity and Overweight—A Machine Learning Overview. Sensors 2020, 20, 2734. [Google Scholar] [CrossRef] [PubMed]
Muhamad Adnan, M.H.B.; Husain, W.; Abdul Rashid, N. A hybrid approach using Naïve Bayes and Genetic Algorithm for childhood obesity prediction. In Proceedings of the 2012 International Conference on Computer Information Science (ICCIS), Chongqing, China, 17–19 August 2012; Volume 1, pp. 281–285. [Google Scholar] [CrossRef]
Mirjalili, S. Genetic algorithm. In Evolutionary Algorithms and Neural Networks; Springer: Berlin/Heidelberg, Germany, 2019; pp. 43–55. [Google Scholar]
Affenzeller, M.; Winkler, S.; Wagner, S.; Beham, A. Genetic Algorithms and Genetic Programming: Modern Concepts and Practical Applications; Chapman and Hall/CRC Publishers: London, UK, 2009. [Google Scholar]
Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, IJCAI’95, Montreal, QC, Canada, 20–25 August 1995; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1995; Volume 2, pp. 1137–1143. [Google Scholar]
Rao, R.; Fung, G. On the Dangers of Cross-Validation. An Experimental Evaluation. In Proceedings of the 2008 SIAM International Conference on Data Mining, Atlanta, GA, USA, 24–26 April 2008; pp. 588–596. [Google Scholar] [CrossRef]
Miller, B.L.; Goldberg, D.E. Genetic algorithms, tournament selection, and the effects of noise. Complex Syst. 1995, 9, 193–212. [Google Scholar]
Bäck, T. Selective Pressure in Evolutionary Algorithms: A Characterization of Selection Mechanisms. In Proceedings of the First IEEE Conference on Evolutionary Computation, Orlando, FL, USA, 27–29 June 1994; pp. 57–62. [Google Scholar]
Jolly, K. Machine Learning with Scikit-Learn Quick Start Guide: Classification, Regression, and Clustering Techniques in Python; Packt Publishing Ltd.: Birmingham, UK, 2018. [Google Scholar]
Friedman, J.H. Stochastic Gradient Boosting. Comput. Stat. Data Anal. 2002, 38, 367–378. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Myles, A.J.; Feudale, R.N.; Liu, Y.; Woody, N.A.; Brown, S.D. An introduction to Decision Tree modeling. J. Chemom. J. Chemom. Soc. 2004, 18, 275–285. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 4765–4774. [Google Scholar]
Derrac, J.; García, S.; Molina, D.; Herrera, F. A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol. Comput. 2011, 1, 3–18. [Google Scholar] [CrossRef]
Eisinga, R.; Heskes, T.; Pelzer, B.; Te Grotenhuis, M. Exact p-values for pairwise comparison of Friedman rank sums, with application to comparing classifiers. BMC Bioinform. 2017, 18, 68. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; Jiang, W.; Li, C.; Li, R. A Heuristic Feature Selection Approach for Text Categorization by Using Chaos Optimization and Genetic Algorithm. Math. Probl. Eng. 2013, 2013, 1–6. [Google Scholar] [CrossRef]
Malhotra, R.; Khanna, M. Dynamic selection of fitness function for software change prediction using Particle Swarm Optimization. Inf. Softw. Technol. 2019, 112, 51–67. [Google Scholar] [CrossRef]

Figure 1. Workflow diagram.

Figure 2. Impact of variables with SHAP, Random Forest with pathologies, and no FS.

Figure 3. Mean accuracy evolution graph with RFECV.

Figure 4. Impact of variables with SHAP, Gradient Boosting with Stud selection, with pathologies.

Figure 5. Impact of variables with SHAP, Decision Tree with Stud selection, with pathologies.

Figure 6. Heat map for frequency of the different variables using FS with GA, dataset with pathologies.

Figure 7. Impact of variables with SHAP, Random Forest without pathologies, and no FS.

Figure 8. Pearson Correlation Matrix of variables in Figure 7.

Figure 9. Impact of variables with SHAP, Gradient Boosting with Stud selection, without pathologies.

Figure 10. Impact of Variables with SHAP, Decision Tree with Stud Selection, without Pathologies.

Figure 11. Heat map representation of the frequency of different variables obtained using FS with GA, dataset without pathologies.

Table 1. Example of confusion matrix structure for binary classification.

	Positive Prediction	Negative Prediction
Positive Class	(TP)	(FN)
Negative Class	(FP)	(TN)

Table 2. Results of the different algorithms for the set of variables with pathologies. The table shows the algorithm, selection method, number of variables used, best case, worst case, mean, and standard deviation. The algorithm with the highest mean is marked with bold font.

ID	ALGORITHM	FS	VARIABLES	BEST	WORST	MEAN	STD +/−	PRECISION_0	PRECISION_1	RECALL_0	RECALL_1
DT-S	Decision Tree	Stud-GA	14	0.7661	0.6407	0.7150	0.0309	0.7682	0.7639	0.7733	0.7586
DT-T	Decision Tree	Tournament-GA	16	0.7492	0.6542	0.7103	0.0238	0.7255	0.7746	0.7762	0.7237
DT-RFECV	Decision Tree	RFECV	13	0.7559	0.6373	0.6962	0.0257	0.7597	0.7518	0.7697	0.7413
DT	Decision Tree	No FS	37	0.7254	0.6237	0.6914	0.0232	0.7035	0.7561	0.8013	0.6458
XGB-S	XGBOOST	Stud-GA	20	0.7831	0.6780	0.7216	0.0245	0.7451	0.8239	0.8201	0.7500
XGB-T	XGBOOST	Tournament-GA	19	0.7559	0.6746	0.7029	0.0187	0.7614	0.7479	0.8171	0.6794
XGB-RFECV	XGBOOST	RFECV	20	0.7424	0.6746	0.7085	0.0201	0.7677	0.7143	0.7484	0.7353
XGB	XGBOOST	No FS	37	0.7593	0.6441	0.6981	0.0225	0.7669	0.7500	0.7911	0.7226
GB-S	Gradient Boosting	Stud-GA	19	0.7966	0.6881	0.7382	0.0231	0.8204	0.7656	0.8204	0.7656
GB-T	Gradient Boosting	Tournament-GA	23	0.7864	0.6644	0.7332	0.0251	0.7738	0.8031	0.8387	0.7286
GB-RFECV	Gradient Boosting	RFECV	17	0.7797	0.6915	0.7324	0.0199	0.7727	0.7899	0.8447	0.7015
GB	Gradient Boosting	No FS	37	0.7797	0.6814	0.7305	0.0211	0.7683	0.7939	0.8235	0.7324
ADB	ADABOOST	No FS	37	0.6746	0.6000	0.6424	0.0189	0.6690	0.6800	0.6690	0.6800
BG	BAGGING	No FS	37	0.7458	0.6678	0.7114	0.0210	0.7486	0.7411	0.8253	0.6434
BNB	BERNOULLI NB	No FS	37	0.7220	0.6271	0.6675	0.0198	0.7429	0.6917	0.7784	0.6484
ET	EXTRA TREES	No FS	37	0.7627	0.6508	0.7119	0.0276	0.7419	0.7857	0.7931	0.7333
GNB	GAUSSIAN NB	No FS	37	0.7051	0.5932	0.6603	0.0230	0.6792	0.7711	0.8834	0.4848
LR	LOGISTIC REGRESSION	No FS	37	0.7627	0.6780	0.7098	0.0190	0.7603	0.7651	0.7603	0.7651
RFC	RANDOM FOREST	No FS	37	0.7763	0.6644	0.7292	0.0240	0.7582	0.7958	0.8000	0.7533

Table 3. Results of the different algorithms for the set of variables without pathologies. The table shows the algorithm, selection method, number of variables used, best case, worst case, mean, and standard deviation. The algorithm with the highest mean is marked with bold font.

ID	ALGORITHM	FS	VARIABLES	BEST	WORST	MEAN	STD +/−	PRECISION_0	PRECISION_1	RECALL_0	RECALL_1
DT-S	Decision Tree	Stud-GA	14	0.7322	0.6407	0.6934	0.0212	0.7419	0.7214	0.7468	0.7163
DT-T	Decision Tree	Tournament-GA	15	0.7525	0.6169	0.6862	0.0307	0.7548	0.75	0.7697	0.7343
DT-RFECV	Decision Tree	RFECV	1	0.7186	0.6373	0.6821	0.0195	0.7200	0.7172	0.7248	0.7123
DT	Decision Tree	No FS	26	0.7153	0.6203	0.6799	0.0252	0.7533	0.6759	0.7062	0.7259
XGB-S	XGBOOST	Stud-GA	12	0.7424	0.6237	0.6989	0.026	0.7702	0.709	0.7607	0.7197
XGB-T	XGBOOST	Tournament-GA	11	0.7288	0.6542	0.6948	0.0223	0.7235	0.736	0.7885	0.6619
XGB-RFECV	XGBOOST	RFECV	2	0.7186	0.6475	0.6850	0.0164	0.6746	0.7778	0.8028	0.6405
XGB	XGBOOST	No FS	26	0.7254	0.6068	0.6803	0.0295	0.7024	0.7559	0.7919	0.6575
GB-S	Gradient Boosting	Stud-GA	12	0.7797	0.6814	0.7295	0.0230	0.7578	0.8060	0.8243	0.7347
GB-T	Gradient Boosting	Tournament-GA	12	0.7763	0.6610	0.7307	0.0250	0.7636	0.7923	0.8235	0.7254
GB-RFECV	Gradient Boosting	RFECV	9	0.7695	0.661	0.7171	0.0236	0.7471	0.8	0.8355	0.6993
GB	Gradient Boosting	No FS	26	0.7695	0.6678	0.7169	0.0280	0.7857	0.7518	0.7756	0.7626
ADB	ADABOOST	No FS	26	0.6678	0.5661	0.6236	0.0262	0.6883	0.6454	0.6795	0.6547
BG	BAGGING	No FS	26	0.7695	0.6644	0.7086	0.0254	0.7709	0.7672	0.8364	0.6846
BNB	BERNOULLI NB	No FS	26	0.6881	0.5695	0.6154	0.0295	0.7048	0.6667	0.7312	0.6370
ET	EXTRA TREES	No FS	26	0.7661	0.6542	0.7125	0.0251	0.7197	0.8188	0.8188	0.7197
GNB	GAUSSIAN NB	No FS	26	0.6881	0.5763	0.6488	0.0279	0.6550	0.7339	0.7724	0.6067
LR	LOGISTIC REGRESSION	No FS	26	0.7559	0.6644	0.7060	0.0232	0.7697	0.7385	0.7888	0.7164
RFC	RANDOM FOREST	No FS	26	0.7695	0.6983	0.7331	0.0188	0.7987	0.7353	0.7791	0.7576

Table 4. Measured times in seconds for each method and problem.

	GB-T	GB-S	DT-T	DT-S	XGB-T (GPU)	XGB-S (GPU)	XGB-T (CPU)	XGB-S (CPU)
With pathologies	4641.42	5203.35	178.31	189.13	13,140.59	13,492.46	2130.07	2085.69
Without pathologies	4311.17	4384.38	200.53	194.88	13,781.81	12,732.41	2220.16	2558.31

Table 5. Algorithm results by problem (average).

Problem	DT-S	DT-T	DT-RFECV	DT	XGB-S	XGB-T	XGB-RFECV	XGB	GB-S	GB-T	GB-RFECV	GB	ADB	BG	BNB	ET	GNB	LR	RFC
With Pathologies	0.7150	0.7103	0.6962	0.6914	0.7216	0.7029	0.7085	0.6981	0.7382	0.7332	0.7324	0.7305	0.6424	0.7114	0.6675	0.7119	0.6603	0.7098	0.7292
Without Pathologies	0.6934	0.6862	0.6821	0.6799	0.6989	0.6948	0.6850	0.6803	0.7295	0.7307	0.7171	0.7169	0.6236	0.7086	0.6154	0.7125	0.6488	0.7060	0.7331

Table 6. Friedman Ranks for the different proposed algorithms.

	With Feature Selection									Without Feature Selection
Problem	GB-S	GB-T	GB-RFECV	XGB-S	DT-S	DT-T	XGB-T	XGB-RFECV	DT-RFECV	RFC	GB	ET	BG	LR	XGB	DT	GNB	BNB	ADB
With Pathologies	1	2	3	6	7	10	13	12	15	5	4	8	9	11	14	16	18	17	19
Without Pathologies	3	2	4	9	11	12	10	13	14	1	5	6	7	8	15	16	17	19	18
Mean	2	2	3.5	7.5	9	11	11.5	12.5	14.5	3	4.5	7	8	9.5	14.5	16	17.5	18	18.5

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Parra, D.; Gutiérrez-Gallego, A.; Garnica, O.; Velasco, J.M.; Zekri-Nechar, K.; Zamorano-León, J.J.; Heras, N.d.l.; Hidalgo, J.I. Predicting the Risk of Overweight and Obesity in Madrid—A Binary Classification Approach with Evolutionary Feature Selection. Appl. Sci. 2022, 12, 8251. https://doi.org/10.3390/app12168251

AMA Style

Parra D, Gutiérrez-Gallego A, Garnica O, Velasco JM, Zekri-Nechar K, Zamorano-León JJ, Heras Ndl, Hidalgo JI. Predicting the Risk of Overweight and Obesity in Madrid—A Binary Classification Approach with Evolutionary Feature Selection. Applied Sciences. 2022; 12(16):8251. https://doi.org/10.3390/app12168251

Chicago/Turabian Style

Parra, Daniel, Alberto Gutiérrez-Gallego, Oscar Garnica, Jose Manuel Velasco, Khaoula Zekri-Nechar, José J. Zamorano-León, Natalia de las Heras, and J. Ignacio Hidalgo. 2022. "Predicting the Risk of Overweight and Obesity in Madrid—A Binary Classification Approach with Evolutionary Feature Selection" Applied Sciences 12, no. 16: 8251. https://doi.org/10.3390/app12168251

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predicting the Risk of Overweight and Obesity in Madrid—A Binary Classification Approach with Evolutionary Feature Selection

Abstract

1. Introduction

2. Literature Review

3. Methodology

Evaluation Metrics

4. Experimental Setup

4.1. Dataset

4.1.1. Data Curation

4.1.2. Dataset with Pathologies

4.1.3. Dataset without Pathologies

5. Experimental Results

5.1. Results Using the Dataset with Pathologies

5.1.1. Results without Feature Selection

5.1.2. Results for Gradient Boosting with Classical Feature Selection (GB-RFECV)

5.1.3. Results for Gradient Boosting with Evolutionary Feature Selection

5.1.4. Results for Decision Tree with Classical Feature Selection (DT-RFECV)

5.1.5. Results for Decision Tree with Evolutionary Feature Selection

5.1.6. Results for XGBoost with Classical Feature Selection (XGB-RFECV)

5.1.7. Results for XGBoost with Evolutionary Feature Selection

5.2. Results without Pathologies

5.2.1. Results without Feature Selection

5.2.2. Results for Gradient Boosting with Classical Feature Selection (GB-RFECV)

5.2.3. Results for Gradient Boosting with Evolutionary Feature Selection

5.2.4. Results for Decision Tree with Classical Feature Selection (DT-RFECV)

5.2.5. Results for Decision Tree with Evolutionary Feature Selection

5.2.6. Results for XGBoost with Classical Feature Selection (XGB-RFECV)

5.2.7. Results for XGBoost with Evolutionary Feature Selection

6. Statistical Analysis

7. Conclusions and Future Work

Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Representation of Project Variables

Appendix A.2. Representation of All Variables in the Original Data Set

Appendix A.3. Models Parameters

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI