1. Introduction
In the recent years, there is a growing interest in the incorporation of artificial intelligence technologies—including machine learning and deep learning—in healthcare and medicine. These technologies are expected to have a transformative role in patient treatment and disease management via automating tasks, streamlining processes, keeping manual intervention at a minimum and simplifying mundane operations for all parties involved [
1].
On the other hand, biomarkers are objective measurements drawn from blood and other bodily fluids or tissue that form medical signs or indicators of a disease or, more generally, the health state of a person. Biomarkers can comprise either a sole metric or a combination of metrics and observations [
2]. In
Table 1, some commonly used biomarkers are presented, including waist-to-hip ratio, total cholesterol, systolic blood pressure (SBP) and fasting glucose. Currently, biomarkers are seen as key drivers of the personalization of patient management and treatment and drug development.
In this paper, we report on recent findings from our research on investigating the link between a person’s standard biochemistry profile (based on blood exams), his/her body mass index (BMI), metabolism as health state and SBP. Our current findings expand upon our previous related research, which was based on the use of deep neural networks and other machine learning paradigms in relation to BMI and nutrition [
3,
4,
5,
6].
The motivation for this research was to compress tasks and improve outcomes by using more common variables to predict health states and, in a sense, simplify the patient’s journey via minimizing response time between test, result and recommended action. At the same time, another motivation was to find methods to optimize operational issues via applying machine learning methodologies that can be easily transcribed into telemedicine applications. For example, Big Data and artificial intelligence can be used to improve decision making, support interventions [
7] and add more pathways in healthcare analytics [
8]. Mathematical tools and Big Data, as part of advanced machine learning pipelines and artificial intelligence, will eventually become the basis for analysis in diagnostics and pathology [
9].
Childhood obesity is a very important issue and is connected to genetic predispositions and behavior. According to the CDC and based on recent studies, obesity prevalence rises to approximately 13% in the age group of 2 to 5 years, 20% in the age group 6 to 11 years and 21.2% among 12 to 19 year olds [
10]. Furthermore, childhood obesity is commonly associated with certain communities and specific populations [
10]. The age ranges of our dataset can be seen in
Figure 1. Clearly, mostly young adults and people above the age of 18 are included, with only a small number of them belonging to the 7–14 age group. Thus, additional datasets need to be collected to draw reliable conclusions with regard to childhood obesity. Demographics and lifestyle determinants could also prove to be useful tools if used as inputs, alongside other nutritional factors and laboratory data, for machine learning classifiers and should be investigated further. Machine learning tools could be used as predictors of child obesity or deployed for automating targeted interventions. This research avenue is, in fact, being followed, and its results will be announced in other fora.
In this study, we are looking into the spectrum of metabolic syndrome (MetS) as seen in
Figure 2. We follow a more holistic method by looking into engineering shortcuts using linked automation. Via building on related literature [
11] and expanding on related conclusions [
12], we propose a more complete approach via linking factors to states and via using said states to extract actions. MetS being a specific health state and weight being a factor related to MetS and triglyceride combined with glucose and cholesterol are accompanied by defining factors; we utilize all three to finalize a conclusion and calculate related risk factors and recommendations. In this paper, we implement and test various classifiers towards linking a person’s standard biochemistry profile (based on blood exams), his/her BMI, metabolism as health state and SBP. We show that support vector machine-based classifiers are very promising. We also provide a brief comparison with results on previous works of ours that were based on deep neural networks [
3,
4,
5,
6]. Moreover, a more extensive look into related literature is provided in
Section 3.
More specifically, the paper is organized as follows: In
Section 2, the key concepts with regard to MetS are summarized, laying the theoretical ground work on which our research is based. In
Section 3, previous related work is highlighted. In
Section 4, we discuss the datasets used in our research, as well as important statistical measures to characterize them. In
Section 5, we develop and comparatively evaluate various classifiers for BMI prediction based on blood exams, including neural network-based and SVM-based classifiers. Furthermore, we show that a cascaded SVM-based classifier is most promising, achieving an average correct classification rate of about 84%. In
Section 6, we propose and implement a system for MetS prediction, which relies not only on BMI prediction but also on the prediction from blood exams of all MetS defining factors, i.e., total cholesterol, triglycerides and blood pressure (
Figure 2). In
Section 7, we itemize the key findings of our research and discuss their significance. Finally, in
Section 8, we draw conclusions and point to future research avenues in this area.
3. Related Work and Methodology Comparison
In our previous study [
3], we tested and evaluated the ability of neural networks to classify people in classes based on their BMI and using a basic biochemical profile extracted via routine blood exams (
Figure 4). The classification process was firstly realized in four classes, then in three and finally in two (2) and a general understanding of the relation was established. Specifically, four BMI classes are defined in the relevant literature, namely “obese”, “overweight”, “normal” and “underweight”. At each stage, accuracy increased while classes were thinned into smaller groups, as in
Figure 5. Thus, the idea of a cascaded classifying method was conceived in accordance with similar previous approaches with regard to the recommendation problem [
25,
26]. At the same time, it became apparent that blood exams held sufficient information for a system to be tested in a greater variety of health states. Going forward, the data used would be thoroughly analyzed to ensure that no bias was intended in the classification system. The results of each classifier would be cross-examined by using comparison matrices. In the current paper, we report on our study of binary and one-class support vector machine (SVM)-based classifiers, which we also compare with several other classifiers.
In other related works, a relation between electrocardiogram (ECG) and MetS has been suggested via deploying neural network-based classifiers [
22]. A link between MetS and a variety of demographic data and specific blood tests has also been established with the deployment of neural networks, while other statistical methods were also compared [
27]. The prediction of MetS, using artificial neural networks and clinical data, has also been explored, where BMI, age, HDL and LDL were identified as defining factors [
28]. In other studies, an extreme learning machine approach was explored to identify the overweight class by using blood exams [
12] using a sample of 500 data points of men and women. In a recent paper, a general diagnosis of MetS was pursued using clinical symptoms integrated with physio-chemical indexes in a smaller study of 586 cases where 450 participants had MetS and 136 others did not [
29]. What is worth noting is also a recent paper with an extensive literature review related to MetS statistics and machine learning paradigms [
11].
More precise comparisons of machine learning methodologies for predicting MetS and BMI can be seen in
Table 2 and
Table 3, respectively. The main difference between our proposed methodologies and methodologies previously followed by other authors is that the latter, even though technically sound, lack a solid medical corroboration. The comparison is limited to papers that follow a similar pattern to ours, as per feature selection. Important differences between this work and previous related works include the size of samples and the fact that samples were in most cases imbalanced and could, thus, result in biased results, as outlined in
Section 4. Another important factor that can add usability to a system is a feature selection methodology. We have ensured that our classification method explores methods that can assist medical prognosis in a novel manner by not using major defining factors that are already explored (e.g., triglycerides, cholesterol, BMI and glucose), as is the case in previous related works described in
Table 2 and
Table 3.
Our aim is to create a streamlined machine learning pipeline that can cover as big a part of the patient’s journey, minimize manual interventions and improve outcomes via minimizing required data to be fed in the process. We should also mention that for our experiments, a very broad dataset has being utilized of about 70,000 data points in both balanced and imbalanced states. Clearly, this is a great increase over previous related studies that have incorporated small samples of an average of 500 imbalanced data points.
4. Exploratory Data Analysis and Bias Evaluation
Initially, the data were analyzed as per the BMI via deploying frequency histograms and probability density distributions to define how the sample used is represented. Data were also analyzed via deploying the same techniques as per each variable (blood exams—standard biochemistry profile) to determine average values, correlation between each variable and other statistical metrics, as in
Figure 6. Variables were tested in both a generic way (complete data-set) and a more precise way as per weight category.
Key observations were drawn via examining important statistical metrics (e.g., mean, median, mode or standard deviation) to better understand the sample set under examination, to create the basis of available tools for future research endeavors and to develop a tool for missing value prediction through entity alignment. The results of this analysis can be seen in
Figure 1. Clearly, the sample is well-balanced among the two genders and the age classes appear normally distributed. Thus, we can safely conclude that the network fed with this particular data is less likely to be biased, since gender is equally represented and a representative group is used for all age classes.
When examining the biochemistry profile, we observe mostly normally distributed values among all variables, as can be seen in
Figure 7. The long tails observed in some variables suggest that more investigation could lead to the retrieval of valuable information.
7. Summary, Discussion, Itemization of Key Findings and Contribution
In this paper, we proposed a system (of superior performance) to predict MetS from blood exams as in
Figure 18. Specifically, blood exams were used as input, but our efforts focused on the use of as small a number of parameters as possible for the initiation of the classification process. The classifier in the system predicts the BMI class (“underweight”, “normal”, “overweight” and “obese”) and MetS state (“MetS present” and “MetS not present”) without using related factors.
Depending on identified states and using blood exams, the system classifies blood pressure state or some other health state that is related to MetS. The system returns related risk factors and related recommendations (e.g., diet suggestions, lifestyle changes or medical interventions more commonly used). We deployed one-class SVMs to test the validity of our hypothesis and, thus, the ability of the classifier to identify body weight factors in blood exams. Using cascaded classifiers, an average accuracy of 85% was achieved, and BMI classification as per weight class, for all weight classes, based on standard biochemistry profile was optimized. By validating our initial hypothesis, other pathways were explored via applying similar methodologies.
Weight being associated with metabolism and weight class being identified via blood exams, metabolic syndrome became the new classifying target. Using factors related to MetS, a new system was engineered that can identify this particular health state with an accuracy of 84%. Total cholesterol and HDL provided similar results when used alternately as factors of MetS in the classifier, even though the medical literature suggests that HDL is a key biomarker for MetS when combined with triglycerides and glucose.
High blood pressure, being both a factor and an outcome of MetS, was tested in a similar fashion. At this stage, SBP was evaluated using a full biochemistry profile and BMI. The final outcome was a 74% classification accuracy. Testing different scenarios, as described in detail in previous sections, we concluded that by using all parameters, as seen in
Figure 18, a robust increase of about 8% in accuracy was achieved compared to using only parts of the biochemistry profile. Having already expanded on methodologies to identify the other parameters, this system as a whole can predict accurately systolic blood pressure even when some values are missing(missing value prediction) by using the available blood exams to classify BMI and MetS classes and, thus, enhancing the system with increased accuracy. More research is being conducted on this and will be published elsewhere in the near future. Finally, a basic system has been conceptualized (in
Figure 18) and will be deployed as an interface with the optimized classifier.