Next Article in Journal
Theoretical and Simulation Analysis of Static and Dynamic Properties of MXene-Based Humidity Sensors
Next Article in Special Issue
Hybrid Discrete Particle Swarm Optimization Algorithm with Genetic Operators for Target Coverage Problem in Directional Wireless Sensor Networks
Previous Article in Journal
Effect of Mechanical Activation on the Leaching Process of Rare Earth Metal Yttrium in Deep Eutectic Solvents
Previous Article in Special Issue
Optimizing the Layout of Run-of-River Powerplants Using Cubic Hermite Splines and Genetic Algorithms
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Predicting the Risk of Overweight and Obesity in Madrid—A Binary Classification Approach with Evolutionary Feature Selection

by
Daniel Parra
1,
Alberto Gutiérrez-Gallego
1,
Oscar Garnica
1,
Jose Manuel Velasco
1,
Khaoula Zekri-Nechar
2,
José J. Zamorano-León
3,
Natalia de las Heras
4 and
J. Ignacio Hidalgo
1,*
1
Computer Architecture and Automation Department, Faculty of Computer Science, Universidad Complutense de Madrid, 28040 Madrid, Spain
2
Department of Medicine, Faculty of Medicine, Universidad Complutense de Madrid, 28040 Madrid, Spain
3
Public Health and Maternal and Child Health Department, Faculty of Medicine, Universidad Complutense de Madrid, 28040 Madrid, Spain
4
Department of Physiology, Faculty of Medicine, Universidad Complutense de Madrid, 28040 Madrid, Spain
*
Author to whom correspondence should be addressed.
Appl. Sci. 2022, 12(16), 8251; https://doi.org/10.3390/app12168251
Submission received: 8 July 2022 / Revised: 15 August 2022 / Accepted: 16 August 2022 / Published: 18 August 2022
(This article belongs to the Special Issue Evolutionary Computation: Theories, Techniques, and Applications)

Abstract

:
In this paper, we experimented with a set of machine-learning classifiers for predicting the risk of a person being overweight or obese, taking into account his/her dietary habits and socioeconomic information. We investigate with ten different machine-learning algorithms combined with four feature-selection strategies (two evolutionary feature-selection methods, one feature selection from the literature, and no feature selection). We tackle the problem under a binary classification approach with evolutionary feature selection. In particular, we use a genetic algorithm to select the set of variables (features) that optimize the accuracy of the classifiers. As an additional contribution, we designed a variant of the Stud GA, a particular structure of the selection operator of individuals where a reduced set of elitist solutions dominate the process. The genetic algorithm uses a direct binary encoding, allowing a more efficient evaluation of the individuals. We use a dataset with information from more than 1170 people in the Spanish Region of Madrid. Both evolutionary and classical feature-selection methods were successfully applied to Gradient Boosting and Decision Tree algorithms, reaching values up to 79% and increasing the average accuracy by two points, respectively.

1. Introduction

A person is considered overweight when her/his Body Mass Index (BMI) is higher than 25 and obese if it is over 30. BMI is computed by dividing the weight of the person (in kilograms) by the square of her/his height (in meters) [1]. According to the Spanish National Institute of Statistics [2], the rate of individuals with obesity has increased from 7.4% in 1987 to 17.4% in 2017. In just 30 years, it has multiplied by 2.4 times, taking the data for Spain as a reference. This health problem affects all sectors of the population, although it is not equally distributed. In particular, men are more prone to develop overweight/obesity than women. The number of cases of childhood obesity has also increased, reaching 10.3% of children between 2 and 17 years old. The World Health Organization [3] also shows that overweight people are more prone to developing cerebrovascular and respiratory problems, gallbladder disease and may increase the risk of different types of cancer. Hence, it is necessary to prevent future cases of overweight or obesity.
Some people may have a certain predisposition to suffer from overweight/obesity due to genetics. Overweight and obesity are usually a consequence of social behaviors, such as high-fat meals and sedentary lifestyles. Prevention should begin at early ages to consolidate healthy lifestyle habits, and those who suffer from any of these disorders should seek the help and advice of medical staff. Regarding prevention, it would be essential to know the relationships among lifestyle indicators, overweight, and obesity. In this context, a system predicting the risk of developing overweight and obesity could be very useful. In this paper, we show that it is possible to analyze the different factors and habits of a person and obtain the relationship among them by machine-learning techniques. In particular, we investigate the performance of a set of machine-learning classification algorithms when classifying people as overweight/obese versus non-overweight/non-obese using information about lifestyle habits. This information was collected by a pool of 14 different institutions participating in the consortium of the GenObIA project (work financed by the Community of Madrid and the European Social Fund through the GENOBIA-CM project with reference S2017/BMD-3773).
In machine learning (ML), classification is related to the assignment of class labels to data in the problem domain. A critical element in this type of problem is the selection of the variables used as predictors (feature selection). Using all variables is not always the best option; sometimes it is necessary to discard part of them in order to avoid noise in the data and also to reduce computational load. Feature-selection (FS) methods aim to reduce dimensionality, reduce execution times, and improve model results. In this work, four different approaches for FS have been tested, firstly, without applying FS, secondly by classical FS, and finally two FS methods using genetic algorithms (GAs). In particular, we investigate FS by a GA, using two selection operators: a classical tournament selection, and Stud selection, a new method proposed in this paper. Experimental results show that machine-learning algorithms are good classifiers when combined with evolutionary feature-selection methods for this particular problem.
The main contributions of this work are:
  • We tried different configurations for a set of ML classifiers for people being overweight or obese based on information concerning lifestyle and dietary habits.
  • We use a dataset with data of more than 1170 people, which is the largest study in Spain to the best of our knowledge.
  • We explore four different feature-selection methods.
  • We propose the Stud selection operator, a variant of the stud GA algorithm presented in [4], that could be adapted to another evolutionary algorithm.
The rest of the paper is organized as follows. Section 2 provides a brief review of related work. Section 3 describes the evolutionary algorithms for feature selection and implementation details. In Section 4, we explain the experimental setup and Section 5 collects the experimental results. In Section 6, the statistical analysis can be found and finally Section 7 contains the conclusions and future work.

2. Literature Review

Machine learning [5] is a constantly evolving branch of artificial intelligence related to those algorithms that try to simulate human intelligence using information from their environment. Techniques based on machine learning have been used in different fields, such as finance [6], pattern recognition [7,8], and medical applications [9,10].
In machine learning, it is important to carefully choose which features should be used and which should be discarded to construct the models. Medical datasets commonly present a small number of cases with a large number of variables, thus introducing different problems such as dimensionality and high computational requirements [11]. To deal with these obstacles, the use of feature-selection techniques has been proposed to select the variables that provide the greatest value. According to [12], three common FS categories are: filters, wrappers, and embedded methods.
Filter methods stand out mainly for their speed and scalability, being of great help in extremely large datasets. Through a series of statistical processes, scores are assigned to the different variables; the ones with the highest scores will be used to create the model. The great limitation of these methods is that they do not consider the relationship between variables. Two examples of this category are Pearson’s correlation coefficient, which allows us to quantify the linear dependence between two variables, and mutual information, which seeks to reduce the uncertainty of a random variable through knowledge of another random variable [12,13].
Wrapper methods search for the most appropriate subset of variables using the selected predictor as a black box to score the different subsets of variables generated throughout the different iterations. Wrapper methods [13] have a high computational cost since they require training and testing for each possible subset. Sequential Feature Selection (SFS) [14] starts with an empty set and adds the different variables individually, looking for the one that contributes the most value to the set. Once identified, this variable will be permanently incorporated into the subset, and then the next iteration will follow, repeating the same process until the desired number of variables is obtained. Based on this implementation, it is possible to find different variants, such as Sequential Backward Selection (SBS), which starts with the complete set of variables and reduces it through iterations, or Sequential Floating Forward Selection (SFFS) [15], which is based on the SFS method and incorporates a backward component with the SBS.
An example of the use of this type of technique is [16], where the authors apply a wrapper method, called Recursive Feature Elimination with Cross-Validation (RFECV), to select the best variables for a classification problem in the medical domain and obtain an accuracy improvement. In our work, RFECV is one of the techniques against which we compare the performance of our proposal, since it is a good FS alternative and has the benefits of wrapper methods. Heuristic search methods applied to feature selection can also be considered part of the wrapper methods. Examples of the use of genetic algorithms focused on feature selection can be found in the medical field, these methods being interesting for large datasets. In [17], an optimization algorithm based on a genetic algorithm is proposed, which allows to optimize the values of the SVM parameters, obtaining an optimal subset of features, and improving the classification accuracy. Embedded methods aim to reduce the computational time required to reclassify subsets of distinct variables. In order to do that, it tries to combine the advantages of the filter and wrapper methods. One of the main characteristics of embedded methods is the introduction of feature selection as part of the training process rather than as a separate phase, i.e., the feature-selection process becomes an integral part of the model.
Our proposal, Stud selection, is a variant of the stud GA algorithm presented in [4]. In stud GA, the fittest individual is considered a Stud and the rest of the population is crossed with it to obtain the offspring. On the other hand, in stud GA, diversity is maintained using the hamming distance between the Stud and the individual that will serve as the second parent. If the diversity is above a set threshold, the crossing is made to produce an offspring; otherwise the current second parent is mutated to produce the offspring.
In this work we have adapted this strategy to the particularities of our data. We found that crossover based on hamming distance produced a large computational load that did not mean significantly improved solutions. Therefore, we decided to use a simple one-point crossover and compensate for the possible loss of diversity with four studs.
In relation to the study of factors related to obesity, Hudson Reddon [18] deals with physical activity and genetic predisposition to obesity. With this purpose, the impact of exercise on 14 variants predisposing to obesity is analyzed. Physical activity is able to reduce the impact of the fat mass and obesity-associated (FTO) gene variation and obesity genetic risk scores. The results include the identification of an interaction between physical activity and FTO SNP rs1421085, a single nucleotide polymorphism of the gene associated with fat mass, and obesity, in a prospective cohort of six ethnic groups. According to this study, prevention programs with a heavy physical activity load can be a very important resource to combat obesity.
In [19], the authors try to identify risk factors for overweight and obesity using machine-learning techniques (regression and classification). Their main contributions are the identification of factors related to obesity/overweight, the analysis of these factors, and their respective variable analysis. The issue we find with this work is the use of the variable “weight” since it is a variable that is unknown in the future and is part of the BMI formula (the value to be predicted). In our opinion, weight cannot be used in a classification method.
There are examples of the use of evolutionary computation in similar environments. Ref. [20] deals with the prediction of obesity in children using a hybrid approach, combining supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features, a.k.a. Naïve Bayes, and a GA. In this case, the use of Naïve Bayes in prediction presented problems when dealing with zero-value parameters, and as a solution, the author proposes to use GA for parameter optimization. The initial experiment to identify the usability of their approach indicated a 75% improvement in accuracy. Similarly, this work proposes creating a genetic algorithm to support the classification model in use by selecting the most useful features.
Among the studies dealing with overweight or obesity, datasets have few cases and limited information. There are also occasions where the decisions made may be questionable, such as the use of weight in the dataset.
The work presented here seeks to predict the risk of overweight and obesity in Madrid. With this aim, a binary classification approach with evolutionary feature selection is proposed. Hence, we provide the most relevant variables for the classification algorithm through an evolutionary process. A particular structure of the feature-selection process has been developed. Additionally, a high-quality dataset has been used, composed of detailed information about the habits of the individuals and their health.

3. Methodology

This section explains the methodology applied in this work and how the feature selection for the classification problem was performed using genetic algorithms. Figure 1 shows a diagram of the feature-selection process and the generation of the classification model.
In order to apply the methodology, we need to select three main items:
  • The machine-learning technique.
  • The dataset, which defines the features.
  • The FS method, and if it applies, the parameters of the genetic algorithm.
From the original dataset, a curation process is performed, which also defines the initial dataset to be used. After that, the FS process is applied, and the selected features will be used to train the ML algorithm. The best classification ML model will be chosen after analyzing the results, their accuracy, and the number of false negatives.
The objective of a GA is to find the best solution of a problem through the iterative transformation (using crossover and mutation operators [21]) of an initial set of potential solutions (population). For each solution (individual), its performance (fitness) is evaluated, and, based on this value, the fittest ones will have a higher probability of passing to the next iteration. After a certain number of iterations, one of the candidates will be selected as the solution to the problem.
Four key concepts need to be considered when designing GAs:
  • The encoding of the problem.
  • The size of the population and the initialization method.
  • The selection method including the fitness function.
  • The processes by which the changes are introduced in the next iteration, including the probabilities and parameters [22].
After the execution of the GA, the fittest individual of the last generation represents the set of features (variables) selected to train the ML models.

Evaluation Metrics

Table 1 shows a description of a confusion matrix. It is a numeric matrix where we can see the number of successes of the model for the different classes. The confusion matrix of Table 1 is for algorithms classifying data into two classes (binary classification): Positive and Negative. For constructing it, it is necessary to compute the following values:
  • Number of True Positives (TPs): the class assigned to the sample by the model is Positive and it is also the real class.
  • Number of False Negatives (FNs): the class assigned to the sample by the model is Negative, but the real class is Positive.
  • Number of False Positives (FPs): the class assigned to the sample by the model is Positive, but the real class is Negative.
  • Number of True Negatives (TNs): the class assigned to the sample by the model is Negative and it is also the real class.
  • Number of Total Samples (Total).
From those metrics, we can also obtain different metrics to measure the model’s performance.
  • Accuracy: Percentage of data correctly classified.
    A c c u r a c y = T P + T N T o t a l
  • Misclassification Rate: Percentage of misclassified data.
    Misclassification Rate = F P + F N T o t a l
  • Precision: The percentage of correct predictions obtained.
    P r e c i s i o n = T P T P + F P
  • Recall: True positive rate, the percentage of data that manages to be classified from the positive class.
    R e c a l l = T P T P + F N
Precision and recall can be associated with the positive and negative classes. In our case, we will focus on reducing the number of false negatives since these would be cases of overweight or obesity that our model does not detect. In other words, we seek to obtain a high recall of the positive class.
The GA selects the best set of features for prediction. A solution is represented by a binary chain (chromosome), with as many positions as features available in the curated dataset. Each of the positions (genes) in the chromosome applies to a feature. The value of the gene will indicate if the feature is selected (1) or not (0) as the prediction variable. The initial population will be generated randomly.
Due to the fact that we have a balanced dataset, to evaluate an individual (a model), we used a classical cross-validation scheme, with stratification [23]. Each model is trained using only the features expressed as 1s in the individual genotype. The average accuracy rate of the 10 folds, F, is used as fitness function:
F = 1 10 i = 1 10 Accuracy i
where Accuracy i is the accuracy obtained for each one of the cross-validation folds [24].
In this paper, we propose a variation of the Stud GA method and we compare its performance with a traditional tournament implementation.
The Stud selection method works as follows. First, the four best individuals of the generation are selected, and form the Stud candidates group. Second, the two best individuals pass to the following iteration without crossover. Finally, the rest of the population is completed by applying the crossover operator to a pair formed by: (i) a member of the Stud candidates group and (ii) another individual of the population in the event that the probability of crossover is met.
For the tournament selection, we use a simple implementation with selection pressure of five [25]. As usual in the literature [26], we denote selective pressure to the size of the tournament pool. Adjusting this parameter allows us to find a trade-off between exploration and exploitation of the fitness landscape of the algorithm. In this study, we use a selection pressure of five, which is a value that prioritizes exploitation over exploration and seemed to work well in the preliminary experiments with our datasets.
After the selection of the individuals, we apply a single-point crossover, choosing a point in the chromosome of the two selected individuals and generating one offspring. With this purpose, we combine the information from one of them up to the crossover point and complete it with the remaining information from the other individual.
A random mutation is introduced in the individual with a very low probability. This mutation affects a gene of the individual, flipping its value.
The parameters used for the GA were a crossover probability of 0.82, a mutation probability of 0.09, a population of 50 individuals, and 100 generations.

4. Experimental Setup

4.1. Dataset

The original dataset is the result of surveys carried out in different center parts of the GenObIA consortium, including universities and hospitals. The information collected in these surveys includes lifestyle, nutrition habits, and information about pathologies suffered by the person in the past.

4.1.1. Data Curation

The original dataset, Appendix A.2, Table A2, is composed of a total of 93 variables and 1179 subjects, among which we find:
  • One subject identifier;
  • Thirteen variables of general information about the subject such as weight, age, education, stress, etc.;
  • Seven variables related to alcoholic drinks, distinguishing between distilled and fermented drinks;
  • Seven variables on smoking habits, such as number of cigarettes, pipes, cigars. For ex-smokers: time since a smoker quit smoking;
  • Fifteen variables related to pathologies, such as types of cancer, sleep apnea, and type 2 diabetes mellitus, among others;
  • Thirty-four variables on nutritional habits; we found information on the portions of different types of food and the points of adherence to the Mediterranean diet according to these portions;
  • Sixteen variables related to physical exercise and its intensity.
The dataset is balanced in terms of the predicted variable, overweight/obesity (BMI ≥ 25), with 48% being obese/overweight individuals and 52% being non-obese/non-overweight individuals. Therefore, we considered unnecessary the use of classification techniques focused on imbalanced datasets.
In order to avoid repeated information that may introduce noise in the system, a reduction in the number of variables was performed. An example of a reduction is the case of the variables referring to the food intake, which were replaced by a unique variable, namely adherence to the Mediterranean Diet (ADH). The original dataset, contains a set of variables related to food, 16 associated with the servings, and derived from these, another 16 measuring their adherence to the Mediterranean diet using points. If there are more than eight points in total, the subject is considered to have a high ADH.
In addition, some redundant variables were eliminated. For instance, the dataset initially contains two variables referring to exercise that were computed with a set of features of the pool: Cal_IPAQ, which reports the calories burned as a function of physical exercise; and IPAQ, which reports the information on the exercise performed and its intensity. IPAQ takes into account the duration and intensity of exercise, pondering the value of sedentary, moderate, and vigorous exercise. Cal_IPAQ includes weight as a variable for its calculation and therefore cannot be used since weight is also present in the close form for computing BMI. Hence, we use only IPAQ as a training variable.
There are also some features, such as the place or institution, where the sample was obtained (center), that were removed from the dataset since a high correlation with the BMI was observed due to the differences in the nature of the population of the places (policeman, sport teams, retired people, etc.).
After processing the data, which in our case was supervised by the medical staff that participated in the project, the total number of variables was reduced, from 93 to 41, as shown in Appendix A.1, Table A1. This table shows the different features of the study, providing its identifier, name, short description, and type.

4.1.2. Dataset with Pathologies

Two datasets were generated using the variables of Appendix A.1. One of them, called dataset with pathologies, includes the variables related to pathologies. This dataset includes most of the features. In particular, all the variables, excluding variables 37 to 40. The objective of this dataset is to evaluate the classifiers with standard data of the pools, which includes information on the health record of the person.

4.1.3. Dataset without Pathologies

There is a set of variables related to pathologies. When dealing with this type of variable, it is necessary to consider whether a pathology is a cause or a consequence of overweight/obesity. An example is the variable number 33, Apnea, which indicates whether the subject suffers from sleep apnea. Usually, overweight or obese people suffer from this problem. However, it is not necessarily true that because they suffer from sleep apnea, they are suffering from overweight/obesity. The same applies to other pathologies. In order to evaluate this kind of artifact, we create a new dataset, selecting all the variables in Appendix A.1, but excluding those of pathological type and variable 11.

5. Experimental Results

We performed experiments combining ten different algorithms as classifiers and four different feature-selection strategies (two evolutionary feature-selection methods, one feature selection from the literature, and no feature selection) on the two datasets explained above (With and Without Pathologies).
Table 2 and Table 3 present the experimental results. These tables contain one row for each configuration, identified by an acronym (ID), and 11 additional columns including: the name of the algorithm (ALGORITHM), the feature-selection strategy (FS), the number of variables of the dataset (VARIABLES), the accuracy of the best solution (BEST), the accuracy of the worst solution (WORST), the mean (MEAN), and the standard deviation (STD) of 30 runs for the configuration in the row. In addition, the four last columns show precision (PRECISION_0 and PRECISION_1 )) and recall values (RECALL_0 and RECALL_1) for class 0 (non-overweight/non-obese) and 1 (overweight or obese) for the best solution with this algorithm.
The interpretation of the FS column is:
  • Stud-GA: Evolutionary feature selection with Stud selection operator.
  • Tournament-GA: Evolutionary feature selection with tournament selection operator.
  • RFECV: Feature selection with recursive feature elimination (RFE) with cross-validation (CV).
  • No-FS: No feature selection applied in the configuration
As mentioned, ten classification algorithms were used, using the implementation available at the Scikit-learn python library [27]:
  • Decision Tree (DT): Its objective is to create a model to predict the value of a target variable by learning simple decision rules from data characteristics.
  • Gradient Boosting (GB): An additive model is created in a stepwise way, allowing the optimization of arbitrary differentiable loss functions. At each step, a regression tree is fitted to the negative gradient of the given loss function
  • Adaboost (ADB): A meta-estimator that starts by fitting a classifier on the original dataset and next fits additional copies of the classifier for the same dataset where the weights of the misclassified instances are adjusted in order to make the subsequent classifiers concentrate on the difficult cases.
  • Bagging (BG): Fits base classifiers on random subsets of the original dataset and aggregates its individual predictions into a final prediction.
  • Bernoulli Naive Bayes (BNB): This classifier is useful for discrete data and is designed to handle binary/boolean features.
  • Extra Trees (ET): A meta estimator that fits random Decision Trees on multiple subsamples of the dataset and uses the average to improve predictive accuracy and control overfitting.
  • Gaussian Naive Bayes (GNB): Another Naïve Bayes model. This classifier is used when the values of the predictor are continuous
  • Logistic Regression (LR): This algorithm attempts to predict the probability that a given data entry will belong to a category. Just as linear regression assumes that the data follow a linear function, logistic regression models the data using the sigmoid function.
  • Random Forest (RFC): This technique fits several Decision Tree classifiers on multiple subsamples of the dataset and uses averaging to improve predictive accuracy and control overfitting.
  • XGBoost (XGB): A tree boosting system that stands out for its scalability and is widely used by data scientists.
Appendix A.3, Table A3 provides a table with information on the parameters used for each model.
Regarding evolutionary feature-selection methods, we focus their application on three models. The first one is Gradient Boosting [28] as it obtained consistently good results among the set of classifiers without feature selection. Next is XGBoost, [29], which is a state-of-the-art machine-learning algorithm that allows us to measure the goodness of results from the other algorithms. Finally, there is Decision Tree [30], because the use of models based on trees provides a solution with a straightforward interpretation by the medical staff. The development of understandable solutions for medical doctors is one of the main objectives of our research. Understanding why a model makes a particular prediction can be as crucial as prediction accuracy in the medical field. In some cases, the best results are obtained with complex models that are difficult to interpret. Thanks to the SHAP library [31], it is possible to obtain from each feature a value of importance for a particular predictor. The SHAP algorithm aims to explain the outcome of machine-learning models, representing the results by means of graphs. It is based on the Shapley values of game theory. In particular, in this paper we will focus on the ones that allow us to show the impact of the different variables in the model. To understand these graphs, two factors must be taken into account: the position of the points on the horizontal axis and the color. Let us take Figure 2 as an example.
  • The color of the dots denotes the numerical value of the variable. In the case of age, the redder the dot the higher the age and the bluer the dot the lower the age. In the case of sex. The red color represents the female sex and the blue color represents the male sex. In the case of education, the red color represents higher levels of education while the blue color represents people with low levels of education or no education.
  • The position of the points on the horizontal axis in our study indicates the probability of overweight/obesity. The further to the right the point is (positive values), the greater the probability that the person suffers from this problem. The further to the left (negative values), the lower the probability.
Thus, we can see as an example that Figure 2 provides us with the following information:
  • Age is an important factor for the probability of being overweight/obese. We can clearly see that higher ages (red dots) have a higher probability than lower ages (blue dots).
  • Gender is an unequivocal factor, with male gender (blue dots) being related to a higher probability of being overweight/obese while female gender (red dots) is similarly distributed but in the negative range.
  • Education is another important factor to take into account. Low levels of education (blue dots) have a higher probability while high levels (red dots) are related to a lower probability.
  • In the case of other variables, such as job, for example, we can see that the color of the points is intermixed, indicating no correlation with the probability of being overweight/obese.

5.1. Results Using the Dataset with Pathologies

5.1.1. Results without Feature Selection

In the scenario with pathologies, Table 2 summarizes the results of the algorithms with and without feature selection (No-FS). The average accuracy rate is 0.6953, and the standard deviation is 0.0297, with DT, ADB, BNB, and GNB being below this average. The algorithms with the best results are GB and RFC, with a mean that is close to 0.74.
Figure 2 shows the impact of the different variables of the RFC model. The age variable is in first place, in the case of younger people, represented by the lowest values (blue), which are mainly found in the left part of the graph, indicating a lower risk of being overweight/obese. In contrast, the higher values of this variable, the older people (red), have a higher risk of suffering from these health problems. In the case of sex, being male has a higher probability of being classified in the overweight/obesity class. Third, the first variable associated with the type of pathology, apnea, appears. The variable apnea has all the blue points very close to the zero point, while the red points extend along the positive part of the axis. This indicates that in cases of suffering from this pathology it is likely to suffer from overweight/obesity, but otherwise, it does not have a significant impact on the prediction.

5.1.2. Results for Gradient Boosting with Classical Feature Selection (GB-RFECV)

Moreover, in Table 2, using RFECV, an average accuracy rate of 0.7324 was achieved, reaching a maximum of 0.7797. A total of 17 variables were selected, including stress; some of the variables related to alcoholic drinks, education, time since quitting smoking, and physical exercise. Some of the most selected variables, such as sex, age, or apnea, are among the variables obtained. Figure 3 shows the evolution of the accuracy rate for different algorithms and datasets, taking the average of the cross-validation runs with RFECV in relation to the number of variables chosen. The average accuracy rate increases until it reaches 17 variables and then starts to decrease, probably due to the selection of variables that introduce noise.

5.1.3. Results for Gradient Boosting with Evolutionary Feature Selection

The results presented below are obtained with GB using evolutionary FS.
  • GA with Stud selection for Gradient Boosting (GB-S): In this case, the selection method reduced the number of variables used from 37 to 19, keeping its classification rate, and even obtaining a slight improvement, reaching an average accuracy of 0.7382, as shown in Table 2. Three of the variables selected were age, sex, and apnea, these being the most frequently chosen. Other variables to be highlighted are those related to smoking, education, earning, and adherence to the Mediterranean diet.
Figure 4 shows the graph corresponding to the Gradient Boosting model with Stud selection for the dataset with pathologies. The first variables that appear are age, sex, and apnea (as in the previous cases). In the case of education, it is shown that a lower level of education increases the probability of being overweight or obese. On the other hand, those individuals who suffer from metabolic syndrome are also more likely to be overweight or obese, but as in the case of apnea, if the individual does not suffer from this pathology, the variable does not have such a strong impact. Again, this can be seen in that the blue dots are clustered next to the zero point, while the red dots are spread over the positive values.
  • GA with tournament selection for Gradient Boosting (GB-T): Using the tournament selection method with selection pressure of five, 23 variables were selected by the GA, achieving an average accuracy of 0.7332, as can be found in Table 2. Similar to the previous case, among the features selected, some of the most common ones are sex, age, adherence to the Mediterranean diet, and some new additions, which were variables related to heart disease and different types of cancer. Other variables to note are the appearance of distilled/fermented beverages, education, earnings, and stress.

5.1.4. Results for Decision Tree with Classical Feature Selection (DT-RFECV)

Using RFECV, a total of 13 variables were selected, reaching an accuracy rate of 0.6962 and, in the best case, up to 0.7559, as shown in Table 2. Some of the variables are education, alcoholic drinks, physical exercise, and diabetes, among others. Again the most common variables are included: sex, age, and apnea. Figure 3 shows the evolution of the accuracy rate, in this case, the maximum working with 13 variables, reducing the accuracy when the number of variables increases.

5.1.5. Results for Decision Tree with Evolutionary Feature Selection

The results obtained with DT using evolutionary FS are presented below.
  • GA with Stud selection for Decision Tree (DT-S): Fourteen variables were selected, obtaining an average accuracy of 0.7150 and, in the best case, up to 0.7661, as can be verified in Table 2. Among the variables used were sex, age, and seven variables related to pathologies.
Figure 5 shows the impact of the different variables in the model Stud with Decision Tree with pathologies. Once again, age, sex and apnea are at the top of the list. In this case, diabetes appears with a similar distribution to apnea, although with less impact. The appearance of fermented beverages is also interesting.
  • GA with tournament selection for Decision Tree (DT-T): Thanks to the feature selection, with only 16 variables, an accuracy reaching 0.7492 was achieved in the best cases and an average of 0.7103, as shown in Table 2.
Some of the variables used for this case are metabolic syndrome, those related to cancer, fermented drinks, specifically those related to wine, and the variable heart_angina. Moreover, as in the previous cases, sex and age appear.
Figure 6 shows a heat map representing the frequency with which the different FS models (columns) selected each of the variables (rows). Depending on the color of each cell, it is possible to get an idea of the number of times a variable was selected, with the cool colors being those cases with the lowest number of occurrences and the warm colors being those that appeared most frequently. In the case of the Decision Tree, there is greater diversity in the variables chosen by the models, since the colors are not so intense, unlike Gradient Boosting. Despite the differences between the variables chosen for the models, the extremes (top and bottom) of the heat map show a similar range of colors in most cases. The most frequent variables were sex, age, education, apnea, alcoholic beverages, both fermented and distilled, and metabolic syndrome.

5.1.6. Results for XGBoost with Classical Feature Selection (XGB-RFECV)

This time the average accuracy rate achieved was 0.7085, reaching a maximum of 0.7424, as shown in Table 2. A total of 20 variables were used, including sex, age, education, metabolic syndrome, apnea, IPAQ, and different variables associated with alcoholic beverages, among others. Figure 3 shows the evolution of the accuracy rate with the number of variables chosen. In this case, the maximum value is obtained with 20 variables.

5.1.7. Results for XGBoost with Evolutionary Feature Selection

The results obtained by applying the evolutionary FS to XGBoost are presented below.
  • GA withStudselection for XGBoost (XGB-S): As shown in Table 2, an average accuracy rate of 0.7216 and a maximum of 0.7831 was achieved using 20 variables. Similar to other cases, variables such as sex, age, edu, and those related to alcoholic beverages appear. Nine of the selected variables belong to the group of pathologies. Among them, we find several types of cancer, apnea, diabetes, and metabolic syndrome.
  • GA with tournament selection for XGBoost (XGB-T): In this case, using a total of 19 variables, a mean accuracy ratio of 0.7029 was obtained, and a maximum of 0.7559 was reached, as seen in Table 2. Among the most common variables, we found sex, edu, and apnea, but age did not appear. In addition, five variables related to smoking and up to nine related to pathologies were selected.

5.2. Results without Pathologies

A new batch of experiments was performed with the dataset without pathologies, using the same algorithms as in the previous section. The results with this new dataset are worse than those obtained using the dataset with pathologies. It may indicate that the models obtained using the dataset with pathologies gave more importance to these variables, which may be considered a consequence of overweight/obesity rather than a cause.

5.2.1. Results without Feature Selection

After testing the dataset without pathologies with the algorithms without feature selection, a reduction in the accuracy rate of the models was observed with respect to the results with pathologies, obtaining an average accuracy rate of 0.6825 and a standard deviation of 0.0408. RFC obtained the highest result, reaching 0.7331 accuracy, as can be seen in Table 3.
Figure 7 shows the impact of the variables for the RFC model, using the dataset without pathologies. Age and sex maintain the highest positions with the same characteristics seen in previous cases. Among the variables referring to eating habits, vegetables and soda stand out. The values of vegetables seem to have an inversely proportional relationship with the possibility of being overweight or obese while in the case of soda this relationship is direct. In Figure 8, we can see the Pearson Correlation Matrix of the variables in Figure 7. As we can see, no significant correlation can be appreciated between variables with the exception of smoke (smoker or non-smoker) and nsmoke (number of cigarettes per day).

5.2.2. Results for Gradient Boosting with Classical Feature Selection (GB-RFECV)

In this configuration, the number of variables selected was 9, reaching an average accuracy rate of 0.7171 and reaching 0.7695 in the best case, as shown in Table 3. Legume intake, vegetable intake, physical exercise, and time as an ex-smoker are some of the selected variables. On the other hand, there are also those more common ones such as sex, age, and education. In Figure 3, the accuracy reaches its maximum value at nine variables.

5.2.3. Results for Gradient Boosting with Evolutionary Feature Selection

The results obtained using GB with evolutionary FS are presented below.
  • GA withStudselection for Gradient Boosting (GB-S): Using only 12 variables, it has been possible to achieve an average accuracy of 0.7295 and 0.7797 for the best case, achieving a slight improvement compared to the version without feature selection, as shown in Table 3. In this new set of variables, sex and age continue to dominate similarly to the set with pathologies. Other variables used are those related to hours of sleep, fermented/distilled drinks, soft drinks, legume portions, and education.
The graph of the impact of the variables corresponding to the Gradient Boosting model with Stud selection for the case without pathologies is shown in Figure 9. In the highest positions we find sex, age, education, and soda with the same performance as above. A new variable to highlight is legumes, whose highest values appear in the left zone of the graph. As for alcoholic beverages, we find apparent differences between distilled and fermented beverages; for the higher values of the variable wineWEEK, the probability of being overweight/obese is lower. In the case of spirits, it seems that their intake may be associated with overweight/obesity.
  • GA with tournament selection for Gradient Boosting (GB-T): Using the tournament selection method, the number of variables used was also 12 with an average accuracy of 0.7307 and for the best-case scenario up to 0.7763, as shown in Table 3. New variables were included: adherence to the Mediterranean diet and the population. Other variables seen previously such as education, soft drinks, distilled/fermented beverages, sex, age, and hours of sleep, among others, also appear.

5.2.4. Results for Decision Tree with Classical Feature Selection (DT-RFECV)

For this case, RFECV has selected a single variable, age, achieving an average accuracy rate of 0.6821 and a maximum of 0.7186, as shown in Table 3. Classical Feature Selection algorithms have solutions with very few variables; in our case, we are looking for models that explain more extensively the causes of overweight and obesity, so we explore other solutions such as those based on evolutionary algorithms that allow us to obtain models with a greater number of variables and similar accuracy. Looking at Figure 3, the accuracy reaches its maximum value with only one variable, but here we have to consider two facts:
  • This value is not very far from the case of using all the variables;
  • From a medical point of view, the use of a single variable model does not provide any help to clinical practice.
Figure 3 shows the variation of the cross-validation score (inside RFECV strategy) for the different algorithms for both the pathology and non-pathology cases. This value is of little interest if we look at the small difference in the values on the horizontal axis and at the fact that we are comparing two different datasets. What really interests us is to see the differences in the number of selected features (peaks marked with a red dot) and how these values vary depending on the algorithms and the datasets.

5.2.5. Results for Decision Tree with Evolutionary Feature Selection

The following results were obtained with the DT using the evolutionary GA.
  • GA withStudselection for Decision Tree (DT-S): Using the Stud selection method, a total of 14 variables were chosen, with an average accuracy of 0.6934 and 0.7322 for the best case, as can be seen in Table 3. In this case, the age variable was taken but not the sex variable. The variables chosen include education, population, hours of sleep, alcoholic drinks, soft drinks, legumes rations, vegetable rations, and the stress variable as a new incorporation.
Figure 10 shows the impact of the different variables for the Decision Tree model with Stud for the set without pathologies. First, age appears again, the next variable is education and it is observed that, for higher values and higher level of studies, the probability of being overweight/obese is lower. In the case of soft drinks, higher consumption implies a higher probability of being overweight or obese. Another variable to note is vegetable, where it is observed that most of the red dots are on the left side of the graph, so it can be considered that it is less likely to be overweight/obese. It would be interesting to study in the future the causes of the behaviors of the variables stress and ex-smoker.
  • GA with tournament selection for Decision Tree (DT-T): In this case, the number of variables used was 15, obtaining an average accuracy of 0.6862 and reaching 0.7525 in the best case, as shown in Table 3. Among the variables chosen are portions of legumes, fermented drinks, the time without smoking of an ex-smoker, and the IPAQ, the variable that reflects physical exercise. Other variables found are the most common, such as sex, age, soft drinks, and education, among others.
Figure 11 shows the heat map obtained by representing the frequency of variables selected from the dataset without pathologies using the FS models with genetic algorithms. There is a closer correlation between the variables selected with the GB and DT models, compared to the case of the dataset with pathologies. Again the most commonly selected variables are age, sex, and education followed by alcoholic beverages. As a new addition among the most selected variables, we find the variable soda.

5.2.6. Results for XGBoost with Classical Feature Selection (XGB-RFECV)

With only two variables, sex and age, an average accuracy ratio of 0.6850 and a maximum of 0.7186 were obtained, as shown in Table 2. The evolution of the accuracy rate as a function of the number of variables used is shown in Figure 3. From a medical point of view, the selection of only these two variables does not provide any information.

5.2.7. Results for XGBoost with Evolutionary Feature Selection

Results obtained with the XGBoost using the evolutionary GA were as follows.
  • GA withStudselection for XGBoost (XGB-S): As shown in Table 2, a mean accuracy rate of 0.6989 and a maximum of 0.7424 using 12 variables were achieved. In this case, among the variables selected, we found some of the most common ones, such as sex, age, edu, or soda. In addition, there are variables such as vegetation or those related to alcoholic beverages.
  • GA with tournament selection for XGBoost (XGB-T): Using the 11 selected variables, an average precision ratio of 0.6948 and a maximum of 0.7288 were obtained, as can be seen in Table 2. In addition to the most common variables, such as sex, age, education, or soda, we can find variables related to alcoholic beverages, both distilled and fermented, vegetable intake, and earning.
Table 4 gives a sample of time in seconds taken by the GA for the different algorithms and datasets. One run of each type was carried out individually (no other program was running in parallel) on the same machine to obtain these values. The time is strongly related to the algorithm chosen.

6. Statistical Analysis

We performed a nonparametric statistical analysis once the results were obtained for the two proposed problems, with and without pathologies. The Friedman test [32] is a nonparametric [33] alternative to the two-way parametric analysis of variance that tries to detect significant differences between the behavior of two or more algorithms. It can be used to identify if, in a set of k samples (where k ≥ 2), at least two of the samples represent populations with different median values. The first step is to convert the original results, in Table 5, into ranks to produce Table 6. Table 5 shows the average accuracy rate for each of the algorithms on the two datasets.
  • Gather observed results for each algorithm/problem pair.
  • For each problem i, rank values from 1 (best result) to k (worst result). Denote these ranks as:
    r i j = ( 1 j k )
    where k is the number of algorithms.
  • For each algorithm j, average the ranks obtained in all problems to obtain the final rank:
    R j = 1 n i = 1 n r i j
    where n is the number of datasets.
In Table 6, the best algorithms are Gradient Boosting with Stud selection method, Gradient Boosting with tournament method, and Random Forest without feature selection. In both problems, algorithms with feature selection outperform their traditional version.
Under the null hypothesis (H0), which states that all algorithms behave similarly, so their ranks R j should be equal, the Friedman statistic F f can be calculated as:
F f = 12 n k ( k + 1 ) j R j 2 k ( k + 1 ) 2 4
Distributed according to a χ 2 distribution with k − 1 degrees (k = 19) of freedom [33]. As usual, we can define the p-value as the probability of obtaining a result as extreme or even more extreme than the observed one, provided that the null hypothesis is true. As usual, we have chosen a p-value of 0.05. For this p-value, the critical value according to the distribution of χ 2 with 18 degrees of freedom is 28.8693. To reject H0, it is necessary to exceed this value. Applying the formula of Friedman’s statistic, Equation (7), a value of 34.6421 is obtained, so the critical value is exceeded and the hypothesis H0 is rejected, confirming that there are significant differences between the models.

7. Conclusions and Future Work

In this work, a set of classification systems of persons at risk of suffering from overweight/obesity have been developed. Four different FS strategies have been employed for the two experiment datasets, one FS from the literature, two evolutionary FS methods, and no FS. For this work, ten machine-learning models have been employed. For the application of the feature-selection methods, we chose three of these 10 methods, based on reasons of performance and possible application to medical clinical practice. Thus, the FS methods were successfully applied in Gradient Boosting, XGBoost, and Decision Tree algorithms. The most important conclusions of this work are:
  • Although not a surprising finding, we have found GA to be a very competent tool to perform feature selection and thus improve the training of classification models.
  • The Stud selection, which uses an elitist set that is always part of the crossover process, achieves promising results.
  • If we look at the fitness of the best individual in each of the final populations, they all maintain a similar standard deviation, perhaps indicating that we might expect to obtain similar results in the future with new data sets.
  • About the variables related to pathologies, it is necessary to identify what is the cause and what is the effect. A good example would be sleep apnea. On several occasions, models have been based on apnea when classifying, but this disorder is caused by overweight or obesity in many cases. A person may have apnea due to being overweight or obese but will not necessarily be overweight or obese due to suffering from apnea.
  • Finally, significant differences were found among the algorithms, with Gradient Boosting with feature selection being the one obtaining the best results.
The models developed in this work will be the basis of a recommendation system. This system will be able to warn people about behavioral tendencies that will end up producing overweight and obesity and recommend healthy habits to replace them.

Future Work

Although the overall accuracy of the model is important, from a medical point of view, classifying an individual at risk of overweight/obesity as not at risk is more detrimental than classifying a healthy person as at risk. The model must take this into account, so it would be interesting to find different fitness functions or develop a multi-objective evolutionary algorithm that could increase accuracy and reduce the number of false negatives for at-risk individuals. Therefore, we intend to test with precision and recall as fitness functions instead of accuracy.
In this work, and for space reasons, we have only used accuracy as a fitness function. There remains, therefore, as future work, the investigation of other fitness functions as well as the heuristic search or the dynamic selection as they have already been tested in the literature of other fields [34,35].
It is also necessary to study carefully the parameters of the evolutionary algorithm, testing with different combinations of population size and number of generations.
It would be highly recommended to increase the volume of the dataset as it could improve the accuracy of the models. As part of the project, genetic information of individuals will be incorporated, so it will be necessary to perform a study on the impact of these variables and their possible interaction with the current ones.
In this study we achieved an accuracy close to 0.8. Can these results be considered good enough, can they be improved, are they unacceptable? Although the members of our research team who are specialists in medicine consider these results as good, we lack having a context with a clear metric to determine where the minimum level is and how high we can achieve. Establishing this scale remains ambitious future work.

Author Contributions

Conceptualization: J.I.H., J.M.V. and J.J.Z.-L.; methodology: J.J.Z.-L., D.P., O.G. and J.I.H.; software: D.P. and A.G.-G.; validation: O.G., J.J.Z.-L. and J.M.V.; formal analysis: J.J.Z.-L., N.d.l.H. and J.I.H.; investigation: D.P. and K.Z.-N.; resources: K.Z.-N., J.J.Z.-L., N.d.l.H. and J.I.H.; data curation: K.Z.-N., D.P., A.G.-G. and N.d.l.H.; writing—original draft preparation: D.P. and A.G.-G.; writing—review and editing: J.M.V., O.G. and J.I.H.; visualization: D.P., A.G.-G. and J.M.V.; supervision: J.I.H.; project administration: J.I.H. and J.J.Z.-L.; funding acquisition: J.J.Z.-L. and J.I.H. All authors have read and agreed to the published version of the manuscript.

Funding

Work financed by the regional government of Madrid and co-financed by the EU Structural Funds through the Community of Madrid projects B2017/BMD3773 (GenObIA-CM) and Spanish Ministry of Economy and Competitiveness with number RTI2018-095180-B-I00 and PID2021-125549OB-I00.

Institutional Review Board Statement

The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Clinical Research Ethics Committee of the Community of Madrid (Comité Ético de Investigación Clínica de la Comunidad de Madrid (CEIC-R)) and the Clinical Research Ethics Regional Committee (CEIC-R) (Comité Ético de Investigación Clínica-Regional (CEIC-R)). Genetic analyses have always been carried out in compliance with the provisions specified in the Biomedical Research Law (14/2007) and in the Personal Data Protection Law (Law 15/1999) from Spain.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data that support the findings of this study are available on reasonable request from the corresponding author [J.I. Hidalgo]. The data are not publicly available due to legal restrictions.

Acknowledgments

We would also like to thank the centers that provided the data and made this work possible. 1. Atención Primaria. 2. Hospital Clínico San Carlos. 3. Hospital Universitario 12 de Octubre. 4. Hospital Universitario La Paz. 5. Hospital General Universitario Gregorio Marañón. 6. Hospital Universitario Ramón y Cajal. 7. Hospital Universitario Infanta Leonor.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1. Representation of Project Variables

Table A1. Representation of project variables.
Table A1. Representation of project variables.
IDVariableDescriptionType
1sexSex of the personGeneral information
2ageAge of the person in yearsGeneral information
3popVolume of the population where the person residesGeneral information
4eduAcademic level attained by the personGeneral information
5earningIncome level of the personGeneral information
6jobWork type performed by the personGeneral information
7stressperson self-perceived stressGeneral information
8sleep.8The person sleeps more than eight hoursGeneral information
9spiritThe person drinks spiritsGeneral information
10spiritWEEKUnits of spirit drinks per weekAlcoholic drinks
11wine_beerThe person drinks beer or wineAlcoholic drinks
12beerWEEKUnits of beer per weekAlcoholic drinks
13wineWEEKUnits of red wine per weekAlcoholic drinks
14whiteWEEKUnits of white wine per weekAlcoholic drinks
15pinkWEEKUnits of rosé wine per weekAlcoholic drinks
16smokeThe person smokesTobacco
17nsmokeSigarettes consumed per dayTobacco
18pipePipe tobacco consumed per dayTobacco
19cigarCigars consumed per dayTobacco
20exsmokerYTime since a smoker quit smoking in yearsTobacco
21exsmokerUNKThe person has given up smoking but does not remember how long ago.Tobacco
22cancerThe person has suffered or suffers from cancerPathologies
23cancer_mamThe person has suffered or suffers from breast cancer.Pathologies
24cancer_colThe person has suffered or suffers from colon cancerPathologies
25cancer_prosThe person has suffered or suffers from prostate cancer.Pathologies
26cancer_lungThe person has suffered or suffers from lung cancer.Pathologies
27cancer_otherThe person has suffered or suffers from another type of cancer.Pathologies
28heart_attackThe person has suffered an acute myocardial infarction.Pathologies
29heart_anginaThe person has suffered angina pectoris.Pathologies
30heart_failureThe person has suffered heart failurePathologies
31diabetesThe person has type 2 diabetes mellitusPathologies
32metabolic_synThe person suffers from metabolic syndromePathologies
33apneaThe person suffers from sleep apneaPathologies
34asthmaThe person has asthmaPathologies
35COPDThe person suffers from chronic obstructive pulmonary disease.Pathologies
36ADHThe person has adherence to the Mediterranean diet.Nutritional habits
37vegeServings of vegetables consumed by the individual per dayNutritional habits
38sodaServings of carbonated and/or sweetened drinks consumed by the subject per dayNutritional habits
39legumeServings of legumes consumed by the subject per weekNutritional habits
40milkServings of milk or dairy products consumed by the subject per dayNutritional habits
41IPAQSubject scores on the International Physical Activity Questionnaire (IPAQ)Physical exercise

Appendix A.2. Representation of All Variables in the Original Data Set

Table A2. Representation of all variables in the original data set.
Table A2. Representation of all variables in the original data set.
IDVariableDescriptionType
1nInclusion numberGeneral information
2centerCenterGeneral information
3sexSex of the personGeneral information
4ageAge of the person in yearsGeneral information
5heightHeight (m)General information
6weightWeight (Kg)General information
7IMCBMIGeneral information
8waistWaist circumference (cm)General information
9popVolume of the population where the person residesGeneral information
10eduAcademic level attained by the personGeneral information
11earningIncome level of the personGeneral information
12jobWork type performed by the personGeneral information
13stressPerson self-perceived stressGeneral information
14sleep.8The person sleeps more than eight hoursGeneral information
15spiritThe person drinks spiritsAlcoholic drinks
16spiritWEEKUnits of spirit drinks per weekAlcoholic drinks
17wine_beerThe person drinks beer or wineAlcoholic drinks
18beerWEEKUnits of beer per weekAlcoholic drinks
19wineWEEKUnits of red wine per weekAlcoholic drinks
20whiteWEEKUnits of white wine per weekAlcoholic drinks
21pinkWEEKUnits of rosé wine per weekAlcoholic drinks
22smokeThe person smokesTobacco
23nsmokeCigarettes consumed per dayTobacco
24pipePipe tobacco consumed per dayTobacco
25cigarCigars consumed per dayTobacco
26exsmokerYTime since a smoker quit smoking (years)Tobacco
27exsmokerMTime since a smoker quit smoking (months)Tobacco
28exsmokerUNKThe person has given up smoking but does not remember how long agoTobacco
29cancerThe person has suffered or suffers from cancerPathologies
30cancer_mamThe person has suffered or suffers from breast cancer.Pathologies
31cancer_colThe person has suffered or suffers from colon cancerPathologies
32cancer_prosThe person has suffered or suffers from prostate cancer.Pathologies
33cancer_lungThe person has suffered or suffers from lung cancerPathologies
34cancer_otherThe person has suffered or suffers from another type of cancer.Pathologies
35heart_attackThe person has suffered an acute myocardial infarction.Pathologies
36heart_anginaThe person has suffered angina pectoris.Pathologies
37heart_failureThe person has suffered heart failurePathologies
38diabetesThe person has type 2 diabetes mellitusPathologies
39hemoGlycosylated hemoglobin (%)Pathologies
40metabolic_synThe person suffers from metabolic syndromePathologies
41apneaThe person suffers from sleep apneaPathologies
42asthmaThe person has asthmaPathologies
43COPDThe person suffers from chronic obstructive pulmonary disease.Pathologies
44ADHThe person has adherence to the Mediterranean diet.Nutritional habits
45ADH_totTotal ADH pointsNutritional habits
46oliveUse olive oilNutritional habits
47n_oliveUse olive oil (POINTS)Nutritional habits
48tot_oliveTablespoons of olive oil consumed in total per dayNutritional habits
49ntot_oliveTablespoons of olive oil consumed in total per day (POINTS)Nutritional habits
50vegeServings of vegetables consumed per dayNutritional habits
51n_vegeServings of vegetables consumed per day (POINTS)Nutritional habits
52fruitPieces of fruit (including natural juice) consumed per dayNutritional habits
53n_fruitPieces of fruit (including natural juice) consumed per day (POINTS)Nutritional habits
54burgerRed meat portionsNutritional habits
55n_burgerRed meat portions (POINTS)Nutritional habits
56creamServings of butter, margarine or cream consumed per dayNutritional habits
57n_creamServings of butter, margarine or cream consumed per day (POINTS)Nutritional habits
58sodaGlasses of carbonated and/or sweetened beverages per dayNutritional habits
59n_sodaGlasses of carbonated and/or sweetened beverages per day (POINTS)Nutritional habits
60wine_weekWine consumed per weekNutritional habits
61n_wine_weekWine consumed per week (POINTS)Nutritional habits
62legumeServings of legumes per weekNutritional habits
63n_legumeServings of legumes per week (POINTS)Nutritional habits
64fishServings of fish or seafood consumed per weekNutritional habits
65n_fishServings of fish or seafood consumed per week (POINTS)Nutritional habits
66cakeTimes per week consuming commercial bakery productsNutritional habits
67n_cakeTimes per week consuming commercial bakery products (POINTS)Nutritional habits
68nutsServings of nuts and dried fruit consumed per weekNutritional habits
69n_nutsServings of nuts and dried fruit consumed per week (POINTS)Nutritional habits
70chickenPreferably consume chicken, turkey or rabbit meat instead of beef, pork, hamburgers or sausagesNutritional habits
71n_chickenPreferably consume chicken, turkey or rabbit meat instead of beef, pork, hamburgers or sausages (POINTS)Nutritional habits
72sauceTimes a week eat cooked vegetables, pasta, rice or other dishes seasoned with a tomato, garlic, onion or leek sauce simmered with olive oilNutritional habits
73n_sauceTimes a week eat cooked vegetables, pasta, rice or other dishes seasoned with a tomato, garlic, onion or leek sauce simmered with olive oil (POINTS)Nutritional habits
74milkMilk or dairy products (yogurts, cheese) consumed per dayNutritional habits
75n_milkMilk or dairy products (yogurts, cheese) consumed per day (POINTS)Nutritional habits
76milk_lightThe person takes skimmed dairy productsNutritional habits
77n_milk_lightThe person takes skimmed dairy products (POINTS)Nutritional habits
78IPAQIPAQ PointsPhysical exercise
79cal_IPAQIPAQ CaloriesPhysical exercise
80exercise_HDays of intense physical exercisePhysical exercise
81exercise_H_metsDays of intense physical exercise (Mets)Physical exercise
82exercise_H_minIntense physical exercise in one day (minutes)Physical exercise
83exercise_H_totNot sure about the time of intense physical exercise in one dayPhysical exercise
84exercise_LDays of moderate physical exercisePhysical exercise
85exercise_L_metsDays of moderate physical exercise (Mets)Physical exercise
86exercise_L_minModerate physical exercise in one day (minutes)Physical exercise
87exercise_L_totNot sure about the time of moderate physical exercise in one dayPhysical exercise
88exercise_walkDays of sedentary physical exercisePhysical exercise
89exercise_walk_metsDays of sedentary physical exercise (Mets)Physical exercise
90exercise_walk_minSedentary physical exercise in one day (minutes)Physical exercise
91exercise_walk_totNot sure about the time of sedentary physical exercise in one dayPhysical exercise
92exercise_sit_minTime spent sitting during a day (minutes)Physical exercise
93exercise_sitThe person is not sure how much time was spent sitting during a day (minutes)Physical exercise

Appendix A.3. Models Parameters

Table A3. Details about models parameters.
Table A3. Details about models parameters.
MODELObjectiveccp_AlphaClass WeightCriterionLearning_RateLossMax_DepthMax_Featuresn_estimatorsSplitterBootstrapAlgorithmBase_EstimatorMax_iterSolverTolPenaltyPriorsVar_SmoothingBinarizeFit_Prior
XGBbinary:
logistic
---None-None-100------------
GB-0.0-friedman_mse0.1deviance3None100------------
DT-0.0Nonegini--6None-best-----------
RFC-0.0Nonegini--Noneauto100-True----------
ADB-0.0Nonegini1.0-NoneNone500best-SAMMEDecisionTreeClassifier--------
LR--balanced----------100lbfgs0.000112----
ET-0.0Nonegini--Noneauto250-False----------
GNB-----------------None1 × 10−9--
BNB-1.0------------ ----0.0True
BG-------auto250-True----------

References

  1. Keys, A.; Fidanza, F.; Karvonen, M.J.; Kimura, N.; Taylor, H.L. Indices of relative weight and obesity. J. Chronic Dis. 1972, 25, 329–343. [Google Scholar] [CrossRef]
  2. Spanish Ministry of Health (Ministerio de Sanidad, Consumo y Bienestar Social). Encuesta Nacional de Salud. España 2017. Available online: https://www.mscbs.gob.es/estadEstudios/estadisticas/encuestaNacional/encuestaNac2017/ENSE2017_notatecnica.pdf (accessed on 15 January 2021).
  3. World Health Organization. Obesity: Preventing and Managing the Global Epidemic; World Health Organization: Geneva, Switzerland, 2000; 252p.
  4. Khatib, W.; Fleming, P.J. The Stud GA: A mini revolution? In Proceedings of the Parallel Problem Solving from Nature—PPSN V, Amsterdam, The Netherlands, 27–30 September 1998; Eiben, A.E., Bäck, T., Schoenauer, M., Schwefel, H.P., Eds.; Springer: Berlin/Heidelberg, Germany, 1998; pp. 683–691. [Google Scholar]
  5. El Naqa, I.; Murphy, M.J. What is machine learning? In Machine Learning in Radiation Oncology; Springer: Berlin/Heidelberg, Germany, 2015; pp. 3–11. [Google Scholar]
  6. De Prado, M.L. Advances in Financial Machine Learning; John Wiley & Sons: Hoboken, NJ, USA, 2018. [Google Scholar]
  7. Bishop, C.M.; Nasrabadi, N.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006; Volume 4. [Google Scholar]
  8. Braga-Neto, U. Fundamentals of Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
  9. Kononenko, I. Machine learning for medical diagnosis: History, state of the art and perspective. Artif. Intell. Med. 2001, 23, 89–109. [Google Scholar] [CrossRef]
  10. Ahsan, M.M.; Luna, S.A.; Siddique, Z. Machine-Learning-Based Disease Diagnosis: A Comprehensive Review. Healthcare 2022, 10, 541. [Google Scholar] [CrossRef] [PubMed]
  11. Pirgazi, J.; Alimoradi, M.; Abharian, T.E.; Olyaee, M.H. An Efficient hybrid filter-wrapper metaheuristic-based gene selection method for high dimensional datasets. Sci. Rep. 2019, 9, 18580. [Google Scholar] [CrossRef] [PubMed]
  12. Chandrashekar, G.; Sahin, F. A survey on feature-selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
  13. Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
  14. Reunanen, J. Overfitting in making comparisons between variable selection methods. J. Mach. Learn. Res. 2003, 3, 1371–1382. [Google Scholar]
  15. Pudil, P.; Novovičová, J.; Kittler, J. Floating search methods in feature selection. Pattern Recognit. Lett. 1994, 15, 1119–1125. [Google Scholar] [CrossRef]
  16. Misra, P.; Yadav, A.S. Improving the classification accuracy using recursive feature elimination with cross-validation. Int. J. Emerg. Technol. 2020, 11, 659–665. [Google Scholar]
  17. Kumar, G.R.; Ramachandra, G.; Nagamani, K. An efficient feature selection system to integrating SVM with genetic algorithm for large medical datasets. Int. J. 2014, 4, 272–277. [Google Scholar]
  18. Reddon, H.; Gerstein, H.C.; Engert, J.C.; Mohan, V.; Bosch, J.; Desai, D.; Bailey, S.D.; Diaz, R.; Yusuf, S.; Anand, S.S.; et al. Physical activity and genetic predisposition to obesity in a multiethnic longitudinal study. Sci. Rep. 2016, 6, 18672. [Google Scholar] [CrossRef] [PubMed]
  19. Chatterjee, A.; Gerdes, M.W.; Martinez, S.G. Identification of Risk Factors Associated with Obesity and Overweight—A Machine Learning Overview. Sensors 2020, 20, 2734. [Google Scholar] [CrossRef] [PubMed]
  20. Muhamad Adnan, M.H.B.; Husain, W.; Abdul Rashid, N. A hybrid approach using Naïve Bayes and Genetic Algorithm for childhood obesity prediction. In Proceedings of the 2012 International Conference on Computer Information Science (ICCIS), Chongqing, China, 17–19 August 2012; Volume 1, pp. 281–285. [Google Scholar] [CrossRef]
  21. Mirjalili, S. Genetic algorithm. In Evolutionary Algorithms and Neural Networks; Springer: Berlin/Heidelberg, Germany, 2019; pp. 43–55. [Google Scholar]
  22. Affenzeller, M.; Winkler, S.; Wagner, S.; Beham, A. Genetic Algorithms and Genetic Programming: Modern Concepts and Practical Applications; Chapman and Hall/CRC Publishers: London, UK, 2009. [Google Scholar]
  23. Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, IJCAI’95, Montreal, QC, Canada, 20–25 August 1995; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1995; Volume 2, pp. 1137–1143. [Google Scholar]
  24. Rao, R.; Fung, G. On the Dangers of Cross-Validation. An Experimental Evaluation. In Proceedings of the 2008 SIAM International Conference on Data Mining, Atlanta, GA, USA, 24–26 April 2008; pp. 588–596. [Google Scholar] [CrossRef]
  25. Miller, B.L.; Goldberg, D.E. Genetic algorithms, tournament selection, and the effects of noise. Complex Syst. 1995, 9, 193–212. [Google Scholar]
  26. Bäck, T. Selective Pressure in Evolutionary Algorithms: A Characterization of Selection Mechanisms. In Proceedings of the First IEEE Conference on Evolutionary Computation, Orlando, FL, USA, 27–29 June 1994; pp. 57–62. [Google Scholar]
  27. Jolly, K. Machine Learning with Scikit-Learn Quick Start Guide: Classification, Regression, and Clustering Techniques in Python; Packt Publishing Ltd.: Birmingham, UK, 2018. [Google Scholar]
  28. Friedman, J.H. Stochastic Gradient Boosting. Comput. Stat. Data Anal. 2002, 38, 367–378. [Google Scholar] [CrossRef]
  29. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  30. Myles, A.J.; Feudale, R.N.; Liu, Y.; Woody, N.A.; Brown, S.D. An introduction to Decision Tree modeling. J. Chemom. J. Chemom. Soc. 2004, 18, 275–285. [Google Scholar] [CrossRef]
  31. Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 4765–4774. [Google Scholar]
  32. Derrac, J.; García, S.; Molina, D.; Herrera, F. A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol. Comput. 2011, 1, 3–18. [Google Scholar] [CrossRef]
  33. Eisinga, R.; Heskes, T.; Pelzer, B.; Te Grotenhuis, M. Exact p-values for pairwise comparison of Friedman rank sums, with application to comparing classifiers. BMC Bioinform. 2017, 18, 68. [Google Scholar] [CrossRef] [PubMed]
  34. Chen, H.; Jiang, W.; Li, C.; Li, R. A Heuristic Feature Selection Approach for Text Categorization by Using Chaos Optimization and Genetic Algorithm. Math. Probl. Eng. 2013, 2013, 1–6. [Google Scholar] [CrossRef]
  35. Malhotra, R.; Khanna, M. Dynamic selection of fitness function for software change prediction using Particle Swarm Optimization. Inf. Softw. Technol. 2019, 112, 51–67. [Google Scholar] [CrossRef]
Figure 1. Workflow diagram.
Figure 1. Workflow diagram.
Applsci 12 08251 g001
Figure 2. Impact of variables with SHAP, Random Forest with pathologies, and no FS.
Figure 2. Impact of variables with SHAP, Random Forest with pathologies, and no FS.
Applsci 12 08251 g002
Figure 3. Mean accuracy evolution graph with RFECV.
Figure 3. Mean accuracy evolution graph with RFECV.
Applsci 12 08251 g003
Figure 4. Impact of variables with SHAP, Gradient Boosting with Stud selection, with pathologies.
Figure 4. Impact of variables with SHAP, Gradient Boosting with Stud selection, with pathologies.
Applsci 12 08251 g004
Figure 5. Impact of variables with SHAP, Decision Tree with Stud selection, with pathologies.
Figure 5. Impact of variables with SHAP, Decision Tree with Stud selection, with pathologies.
Applsci 12 08251 g005
Figure 6. Heat map for frequency of the different variables using FS with GA, dataset with pathologies.
Figure 6. Heat map for frequency of the different variables using FS with GA, dataset with pathologies.
Applsci 12 08251 g006
Figure 7. Impact of variables with SHAP, Random Forest without pathologies, and no FS.
Figure 7. Impact of variables with SHAP, Random Forest without pathologies, and no FS.
Applsci 12 08251 g007
Figure 8. Pearson Correlation Matrix of variables in Figure 7.
Figure 8. Pearson Correlation Matrix of variables in Figure 7.
Applsci 12 08251 g008
Figure 9. Impact of variables with SHAP, Gradient Boosting with Stud selection, without pathologies.
Figure 9. Impact of variables with SHAP, Gradient Boosting with Stud selection, without pathologies.
Applsci 12 08251 g009
Figure 10. Impact of Variables with SHAP, Decision Tree with Stud Selection, without Pathologies.
Figure 10. Impact of Variables with SHAP, Decision Tree with Stud Selection, without Pathologies.
Applsci 12 08251 g010
Figure 11. Heat map representation of the frequency of different variables obtained using FS with GA, dataset without pathologies.
Figure 11. Heat map representation of the frequency of different variables obtained using FS with GA, dataset without pathologies.
Applsci 12 08251 g011
Table 1. Example of confusion matrix structure for binary classification.
Table 1. Example of confusion matrix structure for binary classification.
Positive PredictionNegative Prediction
Positive Class(TP)(FN)
Negative Class(FP)(TN)
Table 2. Results of the different algorithms for the set of variables with pathologies. The table shows the algorithm, selection method, number of variables used, best case, worst case, mean, and standard deviation. The algorithm with the highest mean is marked with bold font.
Table 2. Results of the different algorithms for the set of variables with pathologies. The table shows the algorithm, selection method, number of variables used, best case, worst case, mean, and standard deviation. The algorithm with the highest mean is marked with bold font.
IDALGORITHMFSVARIABLESBESTWORSTMEANSTD +/−PRECISION_0PRECISION_1RECALL_0RECALL_1
DT-SDecision TreeStud-GA140.76610.64070.71500.03090.76820.76390.77330.7586
DT-TDecision TreeTournament-GA160.74920.65420.71030.02380.72550.77460.77620.7237
DT-RFECVDecision TreeRFECV130.75590.63730.69620.02570.75970.75180.76970.7413
DTDecision TreeNo FS370.72540.62370.69140.02320.70350.75610.80130.6458
XGB-SXGBOOSTStud-GA200.78310.67800.72160.02450.74510.82390.82010.7500
XGB-TXGBOOSTTournament-GA190.75590.67460.70290.01870.76140.74790.81710.6794
XGB-RFECVXGBOOSTRFECV200.74240.67460.70850.02010.76770.71430.74840.7353
XGBXGBOOSTNo FS370.75930.64410.69810.02250.76690.75000.79110.7226
GB-SGradient BoostingStud-GA190.79660.68810.73820.02310.82040.76560.82040.7656
GB-TGradient BoostingTournament-GA230.78640.66440.73320.02510.77380.80310.83870.7286
GB-RFECVGradient BoostingRFECV170.77970.69150.73240.01990.77270.78990.84470.7015
GBGradient BoostingNo FS370.77970.68140.73050.02110.76830.79390.82350.7324
ADBADABOOSTNo FS370.67460.60000.64240.01890.66900.68000.66900.6800
BGBAGGINGNo FS370.74580.66780.71140.02100.74860.74110.82530.6434
BNBBERNOULLI NBNo FS370.72200.62710.66750.01980.74290.69170.77840.6484
ETEXTRA TREESNo FS370.76270.65080.71190.02760.74190.78570.79310.7333
GNBGAUSSIAN NBNo FS370.70510.59320.66030.02300.67920.77110.88340.4848
LRLOGISTIC REGRESSIONNo FS370.76270.67800.70980.01900.76030.76510.76030.7651
RFCRANDOM FORESTNo FS370.77630.66440.72920.02400.75820.79580.80000.7533
Table 3. Results of the different algorithms for the set of variables without pathologies. The table shows the algorithm, selection method, number of variables used, best case, worst case, mean, and standard deviation. The algorithm with the highest mean is marked with bold font.
Table 3. Results of the different algorithms for the set of variables without pathologies. The table shows the algorithm, selection method, number of variables used, best case, worst case, mean, and standard deviation. The algorithm with the highest mean is marked with bold font.
IDALGORITHMFSVARIABLESBESTWORSTMEANSTD +/−PRECISION_0PRECISION_1RECALL_0RECALL_1
DT-SDecision TreeStud-GA140.73220.64070.69340.02120.74190.72140.74680.7163
DT-TDecision TreeTournament-GA150.75250.61690.68620.03070.75480.750.76970.7343
DT-RFECVDecision TreeRFECV10.71860.63730.68210.01950.72000.71720.72480.7123
DTDecision TreeNo FS260.71530.62030.67990.02520.75330.67590.70620.7259
XGB-SXGBOOSTStud-GA120.74240.62370.69890.0260.77020.7090.76070.7197
XGB-TXGBOOSTTournament-GA110.72880.65420.69480.02230.72350.7360.78850.6619
XGB-RFECVXGBOOSTRFECV20.71860.64750.68500.01640.67460.77780.80280.6405
XGBXGBOOSTNo FS260.72540.60680.68030.02950.70240.75590.79190.6575
GB-SGradient BoostingStud-GA120.77970.68140.72950.02300.75780.80600.82430.7347
GB-TGradient BoostingTournament-GA120.77630.66100.73070.02500.76360.79230.82350.7254
GB-RFECVGradient BoostingRFECV90.76950.6610.71710.02360.74710.80.83550.6993
GBGradient BoostingNo FS260.76950.66780.71690.02800.78570.75180.77560.7626
ADBADABOOSTNo FS260.66780.56610.62360.02620.68830.64540.67950.6547
BGBAGGINGNo FS260.76950.66440.70860.02540.77090.76720.83640.6846
BNBBERNOULLI NBNo FS260.68810.56950.61540.02950.70480.66670.73120.6370
ETEXTRA TREESNo FS260.76610.65420.71250.02510.71970.81880.81880.7197
GNBGAUSSIAN NBNo FS260.68810.57630.64880.02790.65500.73390.77240.6067
LRLOGISTIC REGRESSIONNo FS260.75590.66440.70600.02320.76970.73850.78880.7164
RFCRANDOM FORESTNo FS260.76950.69830.73310.01880.79870.73530.77910.7576
Table 4. Measured times in seconds for each method and problem.
Table 4. Measured times in seconds for each method and problem.
GB-TGB-SDT-TDT-SXGB-T (GPU)XGB-S (GPU)XGB-T (CPU)XGB-S (CPU)
With pathologies4641.425203.35178.31189.1313,140.5913,492.462130.072085.69
Without pathologies4311.174384.38200.53194.8813,781.8112,732.412220.162558.31
Table 5. Algorithm results by problem (average).
Table 5. Algorithm results by problem (average).
ProblemDT-SDT-TDT-RFECVDTXGB-SXGB-TXGB-RFECVXGBGB-SGB-TGB-RFECVGBADBBGBNBETGNBLRRFC
With Pathologies0.71500.71030.69620.69140.72160.70290.70850.69810.73820.73320.73240.73050.64240.71140.66750.71190.66030.70980.7292
Without Pathologies0.69340.68620.68210.67990.69890.69480.68500.68030.72950.73070.71710.71690.62360.70860.61540.71250.64880.70600.7331
Table 6. Friedman Ranks for the different proposed algorithms.
Table 6. Friedman Ranks for the different proposed algorithms.
With Feature SelectionWithout Feature Selection
ProblemGB-SGB-TGB-RFECVXGB-SDT-SDT-TXGB-TXGB-RFECVDT-RFECVRFCGBETBGLRXGBDTGNBBNBADB
With Pathologies12367101312155489111416181719
Without Pathologies32491112101314156781516171918
Mean223.57.591111.512.514.534.5789.514.51617.51818.5
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Parra, D.; Gutiérrez-Gallego, A.; Garnica, O.; Velasco, J.M.; Zekri-Nechar, K.; Zamorano-León, J.J.; Heras, N.d.l.; Hidalgo, J.I. Predicting the Risk of Overweight and Obesity in Madrid—A Binary Classification Approach with Evolutionary Feature Selection. Appl. Sci. 2022, 12, 8251. https://doi.org/10.3390/app12168251

AMA Style

Parra D, Gutiérrez-Gallego A, Garnica O, Velasco JM, Zekri-Nechar K, Zamorano-León JJ, Heras Ndl, Hidalgo JI. Predicting the Risk of Overweight and Obesity in Madrid—A Binary Classification Approach with Evolutionary Feature Selection. Applied Sciences. 2022; 12(16):8251. https://doi.org/10.3390/app12168251

Chicago/Turabian Style

Parra, Daniel, Alberto Gutiérrez-Gallego, Oscar Garnica, Jose Manuel Velasco, Khaoula Zekri-Nechar, José J. Zamorano-León, Natalia de las Heras, and J. Ignacio Hidalgo. 2022. "Predicting the Risk of Overweight and Obesity in Madrid—A Binary Classification Approach with Evolutionary Feature Selection" Applied Sciences 12, no. 16: 8251. https://doi.org/10.3390/app12168251

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop