A Review on Trending Machine Learning Techniques for Type 2 Diabetes Mellitus Management

Petridis, Panagiotis D.; Kristo, Aleksandra S.; Sikalidis, Angelos K.; Kitsas, Ilias K.

doi:10.3390/informatics11040070

Open AccessReview

A Review on Trending Machine Learning Techniques for Type 2 Diabetes Mellitus Management

by

Panagiotis D. Petridis

¹

,

Aleksandra S. Kristo

^2,3

,

Angelos K. Sikalidis

^2,4,*

and

Ilias K. Kitsas

¹

Department of Electrical and Computer Engineering, School of Engineering, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

²

Department of Food Science and Nutrition, California Polytechnic State University, San Luis Obispo, CA 93407, USA

³

Applied Nutrition Graduate Program, College of Professional Studies, University of New England, 716 Stevens Ave., Portland, ME 04103, USA

⁴

Center for Health Research, California Polytechnic State University, San Luis Obispo, CA 93407, USA

^*

Author to whom correspondence should be addressed.

Informatics 2024, 11(4), 70; https://doi.org/10.3390/informatics11040070

Submission received: 7 August 2024 / Revised: 23 September 2024 / Accepted: 25 September 2024 / Published: 27 September 2024

(This article belongs to the Section Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

:

Type 2 diabetes mellitus (T2DM) is a chronic disease characterized by elevated blood glucose levels and insulin resistance, leading to multiple organ damage with implications for quality of life and lifespan. In recent years, the rising prevalence of T2DM globally has coincided with the digital transformation of medicine and healthcare, including extensive electronic health records (EHRs) for patients and healthy individuals. Numerous research articles as well as systematic reviews have been conducted to produce innovative findings and summarize current developments and applications of data science in the life sciences, medicine and healthcare. The present review is conducted in the context of T2DM and Machine Learning, examining relatively recent publications using tabular data and demonstrating the relevant use cases, the workflows during model building and the candidate predictors. Our work indicates that Gradient Boosting and tree-based models are the most successful ones, the SHAPley and Wrapper algorithms being quite popular feature interpretation and evaluation methods, highlighting urinary markers and dietary intake as emerging diabetes predictors besides the typical invasive ones. These results could offer insight toward better management of diabetes and open new avenues for research.

Keywords:

type 2 diabetes mellitus; machine learning; electronic health records; non-invasive decision support; type 2 diabetes management

1. Introduction

Diabetes is an increasingly prevalent chronic disease impacting the healthspan and lifespan of millions of individuals globally. The International Diabetes Federation (IDF)’s Diabetes Atlas (2021) reports that 10.5% of the adult population (20–79 years) is estimated to have diabetes, with almost half unaware that they are living with this chronic disease. According to the same report, by 2045, 1 in 8 adults, approximately 783 million, is predicted to have diabetes—an increase of 46% [1]. Over 90% of people with diabetes have type 2 diabetes mellitus (T2DM), which is driven by socioeconomic, demographic, environmental, behavioral/lifestyle and genetic factors [2]. Machine Learning is a powerful tool for prediction of health-related events, as well as for the analysis of massive data sets and extraction of useful knowledge. Existing studies conducted for the prediction of diabetes as well as for the identification of factors significantly contributing positively or negatively to diabetes management address diabetes in a broad context, ranging from diabetic complications, genetic background and the environment to drugs and therapies, as well as healthcare and management regarding diabetes [3,4]. Further, regarding Machine Learning, Deep Learning with image analysis offers less detailed insights about the methodologies utilized. Early studies have demonstrated the development of robust Machine Learning models for the prediction of hypoglycemia in patients with T2DM. These predictive models could play a valuable role in managing hypoglycemia, particularly for vulnerable patients, helping to prevent episodes and optimize treatment strategies [5]. Machine Learning-assisted assessment of a T2DM care program identified population-level effects and mostly benefited patient subgroups, demonstrating the value of Machine Learning for community-based studies [6]. Moreover, recent research suggests that Machine Learning can summarize patient characteristics and predict T2DM risk [7], although the identification of the most significant factors that predict T2DM is still ongoing [8]. In this manuscript, we review recent research focused on diabetes diagnosis, long-term diabetes prediction and relevant biomarker regression, using tabular data derived from questionnaire, biometric, laboratory, physical and dietary assessments. Our aim is to analyze each work for its methodology regarding data preprocessing, feature selection, model building, evaluation and feature importance extraction, aiming to unveil patterns, best models and new non-invasive predictors, thus potentially identifying strengths and areas of improvement for different tools and opening new research areas.

The advancement of technology has created new fields and contributed to the development of new tools and know-how transfer that generate interdisciplinary services. The development of omics technologies gave rise to nutrigenomics and supported the notion of personalized and precision nutrition and medicine, thus contributing to approaches that can lower chronic disease risk and maximize quality of life and longevity [9]. Along similar lines, Artificial Intelligence and Machine Learning could very well find applications in the fields of clinical nutrition and medical nutrition therapy, especially in the context of certain chronic diseases such as T2DM that require, beyond diagnostics, ideally, constant monitoring and personalized interventions for optimal management.

There are significant challenges met by healthcare professionals in terms of identifying T2DM patients, monitoring the progress of the disease and managing effectively the increasing numbers of diagnosed patients. Type 2 diabetes symptoms may remain overlooked and persist long before a diagnosis is made, with a significant impact on disease prognosis, while managing the disease is challenging for diagnosed patients due to the continuous and complex demands of the required lifestyle modifications and drug therapy. Therefore, novel technologies and multidisciplinary approaches that can lead to the development of tools to assist health professionals and patients in T2DM diagnosis and management can have a very positive impact from a public health standpoint.

Our work herein aimed to provide an overview and comparison of available approaches, thus creating a foundational level upon which collaborations between developers and healthcare professionals can be established.

2. Diabetes

2.1. Definitions

Diabetes is an insulin-related chronic disease caused either by the diminished capacity of the pancreas to produce adequate levels of the hormone insulin and/or the resistance of target tissues to utilize the available glucose as a result of impaired insulin signaling [1,9]. Insulin, produced by beta pancreatic cells, drives glucose clearance into the peripheral tissues, namely, skeletal muscle, the liver and adipose tissue. Elevated blood glucose (hyperglycemia), due to lack of insulin or insulin resistance, over time leads to significant damage to several organs and tissues [10].

There are three main types of diabetes: type 1 diabetes mellitus (T1DM), an autoimmune disease caused by the destruction of the beta pancreatic cells followed by minimal (if any) insulin release; gestational diabetes (GD), which manifests in 10% of pregnancies during the 22nd–24th weeks of gestation, resembles T2DM and typically resolves after delivery; and type 2 diabetes mellitus (T2DM), characterized by chronic hyperglycemia and hyperinsulinemia, as well as insulin resistance, eventually leading to beta cell failure to produce adequate insulin to meet metabolic demands [1,10].

2.2. T2DM Complications and Management

Type 2 diabetes is the most common type of diabetes, representing approximately 90% of diabetic cases; hence, it is the focus of our work herein. Overweight/obesity, age, high blood pressure, physical inactivity, unhealthy nutrition and genetics constitute risk factors for T2DM. The pathology develops due to gradual insulin resistance coupled with a reduced capacity of beta pancreatic cells to produce adequate levels of insulin for effective glucose clearance, finally leading to beta cell exhaustion [10].

Insulin resistance precedes T2DM, driving numerous metabolic perturbations beyond hyperglycemia, including dyslipidemia, hypertension, hyperuricemia, systemic inflammation, endothelial dysfunction and prothrombosis. Prediabetes is a condition that develops due to insulin resistance and precedes T2DM and is characterized by elevated fasting plasma glucose levels, but which are still within what would be considered upper physiological values (100–126 mg/dL) [1].

In T2DM, there is an increased risk of developing numerous diabetes complications, mainly related to the heart, blood vessels, eyes, kidneys, nerves, teeth and gums. Complications can be delayed or prevented by managing glucose levels and other symptoms, such as hypertension and hyperlipidemia.

While a diabetes cure remains elusive, disease management is based on lifestyle modifications and/or drug therapy. Successful diabetes management requires robust (frequent and accurate) monitoring of blood glucose levels with an appropriate device, as well as personalized diet and physical activity plans, with the aid of oral medication (e.g., biguanides and sulfonylureas) and/or insulin administration to maintain blood glucose levels within acceptable ranges, i.e., a fasting plasma glucose level of 70–100 mg/dL (3.9–4.9 mmol/L), while limiting glycation reactions, as indicated by a glycosylated hemoglobin (HbA1c) level of <7%. A nutritious and balanced diet including adequate intake of dietary fiber, lean protein and unsaturated fats, along with reduced consumption of refined carbohydrates, red meat and saturated fats, can contribute to better management. Furthermore, regular physical activity such as 30 min of daily walking and avoidance of unhealthy habits such as smoking and alcohol consumption are recommended.

Beyond the technological advancements of continuous blood glucose monitoring (CGM), AI/ML approaches can be applied towards lifestyle modifications, primarily recommendations on dietary and physical activity regimes to aid and support the work of a clinical nutritionist in providing medical nutrition therapy services to diabetic patients.

3. Machine Learning Background

3.1. Models

Machine Learning, as the term denotes, is a domain that studies the creation of intelligent systems which can learn from humans and especially from current available knowledge. In Machine Learning, the tool that utilizes knowledge is called a model. A model can be trained on current data, which represent current knowledge, and thereafter make predictions based on prior knowledge for new, unknown data. A model is represented by a mathematical function with tunable parameters, which tries to adapt to the data, minimizing the error function, which evaluates how well the model performs. In Machine Learning, there are three main subdomains, namely, classification, clustering and regression, which, depending on the problem, possess different capabilities to make predictions. In this manuscript, only classification and regression will be presented since the domain of Machine Learning is vast. In classification, the model takes input data which include observable data (features) and their corresponding target variables. The concept of classification is to train a model on observed data and then try to predict the target variables of new, unknown observable data. A target variable is called a class, and it is discrete (0 or 1, Yes or No, etc.). Some classification models, which are utilized in the evaluated studies, are referenced briefly below:

Logistic Regression: This model uses a logistic function to estimate the probability of a vector of observable data, Xi, belonging to a class, Y. It emerges as the most straightforward approach to utilize for binary classification of T2DM.
Naive Bayes (NB): NB assumes independence between features and applies Bayes’ Rule to predict the class that maximizes the respective likelihoods. In terms of T2DM classification, the probability of each feature within each class is calculated [11].
K-Nearest Neighbors (KNN): KNN is an algorithm, which, given an observation, tries to find the K most similar, known data, sums their classes and assigns the majority class to the unknown observation. The similarity can be calculated with various methods, the most common being Euclidean distance, Manhattan and Mahalanobis. It is applied to accurately forecast the onset of T2DM [12] and is preferred over other data-driven Machine Learning algorithms for diabetes risk prediction [13].
Support Vector Machine (SVM): This model maps the obtained data onto a space and then tries to find the best hyperplane that separates the data of different classes. The term “best” is evaluated by the margin; the greater value, the better it is considered to have performed. SVMs are implemented to predict the diagnosis of diabetes mellitus [14] and are the most successful and widely used algorithms for both biomarker discovery and diabetes mellitus prediction [15].
Decision Tree (DT): DT is a data-driven model which does not make any assumption about data distribution but constructs a tree-shaped structure of simple if–else rules based on input features and based on these rules, tries to make predictions. The identification of potential interactions between T2DM risk factors [16] and the determination of risk factors associated with T2DM [17] are integrated in such models.
Random Forest (RF): RF is an ensemble method consisting of multiple DTs, as applies also in nature. An RF sums all DT predictions and, by majority voting, elects the class. The method is used for optimal feature selection in T2DM risk prediction modeling [1] as well as in T2DM prediction [18,19].
Multilayer Perceptron (MLP): MLP is perhaps the simplest form of Artificial Neural Network (ANN). It is a type of feed-forward ANN, simulating the way that the human brain makes decisions. It consists of nodes and artificial neurons, which all cooperate to export a simple numerical value. It applies to T2DM diagnosis and determination of the relative importance of risk factors [20] and diabetes risk prediction [21].
Gradient Boosting: Gradient Boosting refers to a broad family of ensemble models which combine trees with the mathematical term “gradient”. In every iteration, it combines the predictions of trees, also called “weak learners”, to produce a better result than the previous iteration. After the last iteration, it gives a final prediction. Such models are Extreme Gradient Boost (XGBOOST), the Gradient Boosting Machine (GBM), Light GBM and Categorical Boosting (CatBoost). These are referred to as some of the most recent successful research findings within the Gradient Boosting framework, presenting low computational complexity when employed in T2DM diagnosis and prediction [22,23], as well as in the identification of T2DM predictors [8].
Ensemble Models: In this category, a voting classifier and a stacking classifier are set. They both utilize the concept of a combination of weak classifiers. In the voting classifier, weak classifier prediction is combined with majority, soft or weighted voting to make the final prediction. In the stacking classifier, the weak classifiers produce a probability output and then all are fed to a final classifier, which is trained based on probabilistic outputs rather than conventional features. Ultimately, the final classifier makes a prediction [24,25,26,27,28].
In the regression category, most of the aforementioned models can also work efficiently. However, a classical model that belongs to this category is Linear Regression. In this case, a linear function tries to predict continuous values, utilizing the least-squares method.

3.2. Imputation and Normalization

Empty values or features with very different numerical ranges can lead to bad models. With imputation, empty values are filled with predefined ones, using methods such as Mean/Mode or Multivariate Imputation by Chained Equations (MICE). In the former, the values take the mean, median or most common values within a specific feature. In the latter, using a model such as Linear Regression or Random Forest, the empty values are predicted by them, setting as input features all the other non-empty ones which exist in the dataset. The normalization of data is achieved using MinMax or Z-Score. MinMax sets all values in the (0, 1) range, while Z-Score tries to apply Gaussian distribution to each feature. Z-Score is beneficial for Machine Learning models that assume a priori the normal distribution of data.

3.3. Balancing

Balancing refers to the ratio between class sizes. When one class has significantly higher volumes of data (numbers of values/observations) compared to others, then the trained model could be biased towards this majority class. This means that the model is more likely to predict the majority class. To this end, resampling techniques, such as majority undersampling, minority oversampling and synthetic minority oversampling, have been developed. Majority undersampling constructs a new dataset by taking a specified percentage of majority class data, so the class ratio becomes 1. The second technique resamples many times the minority class data so that their respective numbers in the new dataset are equal to the majority class’s amount of data. Finally, the last one creates new synthetic minority class data by applying linear interpolation among known data; thus, the number of minority data is equal to that of the minority ones.

3.4. Feature Selection

Very often, some features existing in a dataset are not always useful. This means that they do not contribute to model performance given their complexity and memory size and lead to overfitted models or have similar behavior to other features of the dataset (linearly correlated). So, it is critical to choose the best subset from these features. The feature selection techniques, in general, can be classified in the following three main categories:

Filter: Pearson correlation coefficient, chi-squared test and ANOVA.
Wrapper: Sequential feature selection or backward elimination.
Embedded: L1 and Ridge Regression.

In addition, Principal Component Analysis (PCA) finds the best features according to variance, and SHapley Additive exPlanations (SHAP) is a relatively new method which evaluates feature contributions to predicted classes.

3.5. Evaluation

The evaluation of a model’s predictive capability is perhaps the most important aspect of Machine Learning, since it depicts clearly, with numerical values, how well the model has performed and allows for comparisons between different models. The main idea of model evaluation is the comparison of model-predicted values against the real values in a given instance. Assuming a binary classification problem, where there is a positive and a negative class, the confusion matrix (Figure 1) illustrates the principal evaluation metrics that are used in Machine Learning.

The counts of true-positive (TP), false-positive (FP), false-negative (FN) and true-negative (TN) ground truths and inferences are essential for summarizing model performance. The basic produced metrics are Accuracy, Sensitivity, Specificity, F1-Score and the Area Under Curve (AUC), as listed below.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(1)

S p e c i f i c i t y = \frac{T P}{F P + T N}

(2)

S e n s i t i v i t y = \frac{T P}{T P + F N}

(3)

F 1 = \frac{2 * T P}{2 * T P + F P + F N}

(4)

The Receiver Operating Characteristic (ROC) curve is a graph plotting the true-positive rate against the false-positive rate, showing the performance of a classification model at all classification thresholds. AUC is the area between the curve for Sensitivity–(1-Specificity) (Figure 2).

4. Relevant Sections

4.1. Related Work

Diabetes is a prevalent and widely studied topic, with a significant number of studies dealing with the identification of diabetic cases, the long-term prediction of diabetes development and the regression of critical biomarkers such as fasting plasma glucose (FPG) and glycosylated hemoglobin A1c (HbA1c). Of course, there are other use cases, such as classification/prediction of diabetes complications, as well as different input data, e.g., images [3,4]. Regarding the bibliography search, the research framework employed was often limited to T2DM cases using tabular data. To this end, extensive review publications by Kavakiotis et al. [3] and Fregoso-Aparicio et al. [4] were identified. The work by Kavakiotis and colleagues considers the following elements: (a) prediction and diagnosis, (b) diabetic complications, (c) genetic background and environment, (d) drugs and therapies, and (e) healthcare and management. The researchers note that supervised algorithms constituted the majority (85%) of those found during their study as well as the superiority of SVMs and the frequent utilization of clinical data such as electronic health records (EHRs), summing up, optimistically, that the diagnosis, etiopathophysiology and treatment of T2DM through the employment of machine learning and data mining techniques with enriched datasets that include clinical and biological information has many benefits to offer during the continuous systematic exploration [3]. The work of Fregoso-Aparicio et al. evaluated approximately 90 research papers, covering a broad spectrum of the intersection between Machine Learning, Deep Learning and diabetes mellitus [4]. The review demonstrates a variety of use cases, ranging from diabetes identification to complication prediction. Their significant findings are oriented on the following axes:

Dataset structure;
Top-performing models;
Most frequent models;
Complementary techniques;
Ascending models;
Performance evaluation.

Thus, clean and well-structured datasets are preferred against big and complex ones. Random Forest and Decision Tree are the top-performing models, achieving AUC values of nearly 0.99. The most frequently used models are Deep Neural Networks, tree-type models and SVMs; however, the advanced capabilities of the first type on noisy and very big datasets are noted. Complementary techniques can improve the performance of every model, so resampling techniques are employed, and for the crucial task of feature selection, Linear Regression and Principal Component Analysis seem to find broader acceptance among scientists. The same authors report that Neural Networks are found to perform better on large datasets, particularly those with over 70,000 records. Finally, the difference in evaluation metrics from study to study caused some difficulties during their screening; therefore, it seems better to adopt common metrics such as AUC, Accuracy, Sensitivity and Specificity [4].

4.2. Machine Learning Applications in Diabetes

The applications of statistical analysis and Machine Learning in healthcare and more specifically in diabetes have demonstrated a steady increase in the last two decades, since the development of corresponding programming frameworks has enabled the easy storage, collection, processing and analysis of massive quantities of available data and the employment of statistical and Machine Learning models [29,30]. Regarding diabetes research in relation to Machine Learning, the literature focuses on the identification of people with diabetes, early or long-term (1–10 years) prognosis, and diabetes complication types and prediction. Considering the prevention of diabetes, the goal is the extraction of features (e.g., biomarkers) which are relevant to diabetes occurrence. Then, in the case that these features are configurable, the patient could receive personalized recommendations to apply in his/her lifestyle to minimize the risk of developing diabetes.

Our literature review focused on relatively new research articles or systematic reviews which deal with methods for the prediction of diabetes mellitus or prediabetes utilizing demographic, anthropometric, biometric, laboratory, nutritional and medical history data as input features.

The initial mathematical approaches to diabetes consisted of statistical risk scores exploiting questionnaires completed by large numbers of study participants. Some of the more popular approaches are the Leicester Risk Assessment Score [31] developed by Leicester University and Finnish Diabetes Risk Score FINDRISC [32] developed by the University of Helsinki. The former utilizes a Logistic Regression model, considering age, ethnicity, sex, first-degree family history of diabetes, antihypertensive therapy or history of hypertension, waist circumference, and Body Mass Index (BMI) to predict current impaired glucose regulation or diabetes mellitus, achieving an AUC metric of 72%. The latter (i.e., FINDRISC), also employs Logistic Regression and uses sex; age; BMI; use of blood pressure medication; history of high blood glucose; physical activity; daily consumption of vegetables, fruits or berries; and family history of diabetes to predict 10-year development, achieving an AUC metric of 86%. We can observe two main differences in diabetes studies. The Leicester Risk approach aims to identify current health conditions, while FINDRISC tries to predict long-term prevalence. There is also significant research that deals with Deep Learning and more specifically with image recognition for the classification of diabetic retinopathy, which is a typical complication and very well studied, using images from the eye bulb as input [3,4]. Other diabetes complication studies utilizing Machine Learning and Deep Learning assess neuropathy and nephropathy [3,4]. Apart from classification challenges, there are also regression methods which are used for the prediction of fasting plasma glucose or HbA1c values, i.e., biomarkers that are clinically considered the best indicators of abnormal glucose regulation and consequently diabetes mellitus presence [2,3,4,33]. These parameters are also used as markers to assess the quality of diabetes management.

In our work herein, we delved further into the relevant literature and identified a good number of high-quality articles which, considered together, can help identify a principal methodology for diabetes prediction. Next, the identified studies were clustered based on their aims and key methodologies, which were described in a more detailed context and compared/contrasted in terms of their strengths and weaknesses.

4.2.1. Current-State Classification

Current-state detection of diabetes, that is, where the class variable and the independent feature values are registered at the same time, has been studied in several studies [34,35,36,37,38,39,40,41,42,43].

In the study of Lai et al. [34], the researchers trained a GBM, a Random Forest and a Logistic Regression model on a dataset containing 13,309 records from healthy individuals and T2DM patients. The input features were age, sex, fasting plasma glucose (FPG), BMI, triglycerides, systolic blood pressure and low-density lipoprotein (LDL) cholesterol (LDL-c). First, the dataset was split into an 80% training set and a 20% testing set. Then, a misclassification cost matrix was constructed with a false-negative–false-positive ratio equal to 3/1 and zero cost for correct predictions. This cost matrix was used along with AUC as objective functions to tune the hyperparameters of the models using 10-fold cross validation. Due to the class imbalance, the cut-off point of the decision boundary was adjusted such that the misclassification cost was minimized. After this adjustment, each model with the tuned hyperparameters was trained on the entire training set and finally evaluated on the testing set. The best model was reported to be GBM, which achieved an AUC of 84.7%, a misclassification rate of 18.9%, a Sensitivity of 71.6% and a Specificity of 83.7% at a threshold equal to 0.24. The information gain feature which measures the amount of information gained by each predictor led to the ranking of the features in the following order: FPG, LDL-c, BMI. The authors, summarizing, suggested the incorporation of such models into online programs for further assistance to physicians during patient assessments. The developed models were validated on a Canadian population, reflecting the risk patterns of diabetes mellitus among Canadian patients.

Zou et al. [35] used a 138,000-record dataset containing 14 features, such as age, pulse rate, breathing, diastolic and systolic pressure metrics, biometric data, physique index, FPG, LDL-c and HDL-c. Five sub-datasets were created with random sampling to train the models five times and then calculate the average performance in an independent testing set using 5-fold cross validation. The models which were trained were Random Forest (RF), Decision Tree and a Neural Network. The models were evaluated with different subsets of features using feature selection techniques such as PCA, mRMR, a technique that excluded FPG and one that only included FPG. Using all features, every model achieved the best Accuracy, while training that excluded fasting plasma glucose resulted in the worst models [35]. In the first case, the Accuracy of RF was the best, with a value of 0.8084. Using only fasting plasma glucose as the input feature, the trained models were still better than those achieved with the PCA and mRMR techniques, yielding an Accuracy of 0.7597. Through the feature screening procedure, FPG, weight and age were the most useful. The authors concluded that FPG is a very good predictor of diabetes; however, they added that including more features produces better-performing models. Hence, the need for a more comprehensive approach incorporating other relevant factors or biomarkers is emphasized. As future work, the authors propose the extraction of indicators’ importance and the classification of specific diabetes types [35].

Dinh et al. [27] used the National Health and Nutrition Examination Survey (NHANES) dataset to classify cases of diabetes and undiagnosed diabetes versus prediabetes and healthy, as well as undiagnosed diabetes and prediabetes versus healthy, using different feature subsets such as survey data and laboratory results [34]. First, the data were standardized and then downsampled to produce a balanced dataset, which was then split to derive an 80/20 train/test set. The models which were trained were Logistic Regression, SVM, Random Forest, XGBoost and an ensemble of all the individual models. The ensemble model was a weighted soft voting classifier whose weights were the percentage of each individual model AUC in comparison with the sum of all AUCs. The calculation was performed according to the following equation:

w_{i} = \frac{{A U C}_{i}^{2}}{\sum_{i = 1}^{4} {A U C}_{i}^{2}}

(5)

Each tunable model underwent hyperparameter tuning, and the final performance was calculated using 10-fold cross validation. For the first classification case using only survey data (123 features, 1999–2014 period) XGBoost performed the best, achieving an AUC of 0.862, a Precision of 0.78, a Recall of 0.78 and an F1-Score of 0.78. When laboratory results were included, for the period 1999–2014, XGBoost performed the best with an AUC of 0.957, a Precision of 0.89, a Recall of 0.89 and an F1-Score of 0.89. For the second classification case and for the period 1999–2014 using survey data, the ensemble model showed superior performance, achieving an AUC of 0.737, a Precision of 0.68, a Recall of 0.68 and an F1-Score of 0.68. Utilizing laboratory data for the period 1999–2014, XGBoost classified the cases with the best scores, namely, an AUC of 0.802, a Precision of 0.74, a Recall of 0.74 and an F1-Score of 0.74. The feature performance experiment, based on error rates, showed that waist circumference, age, leg length, sodium intake, blood osmolality, blood urea nitrogen, triglycerides, LDL-c, total cholesterol, carbohydrate intake, diastolic and systolic pressure, BMI and fiber intake were the top-ranked features. The authors concluded that Machine Learning models based on survey questionnaires can provide automated identification mechanisms for patients at risk of diabetes and cardiovascular disease [27]. While an important contribution of the study is the identification of factors that contribute to T2DM and cardiovascular disease (CVD) key contributors to the prediction were also identified, which can be further explored for their implications in electronic health records.

Zhang and coworkers utilized a 36,652-record dataset from a Henan rural cohort study to test the ability of Machine Learning algorithms to predict the risk of T2DM in a rural Chinese population [36]. Among features considered in their analyses were sociodemographic parameters (e.g., age, sex, income and education), anthropometric values (e.g., waist circumference and waist-to-hip ratio), biometric values (e.g., pulse pressure and heart rate) and laboratory results (LDL-c, HDL-c, TG, insulin, creatinine, uric acid and urine glucose), while personal and family histories of diseases were factored in (i.e., hypertension, coronary heart disease and family history of T2DM). The positive class was defined as a positive diagnosis of T2DM by a physician or an FPG value greater than or equal to 7.0 mmol/L (126 mg/dL), according to the American Diabetes Association [37]. At the preprocessing step, apart from data cleaning, the Synthetic Minority Oversampling Technique (SMOTE) was employed to overcome the bias obstacle due to class imbalance. Several linear, non-linear and ensemble models were employed on two datasets, one containing laboratory results and the other excluding them. After hyperparameter tuning through a 10-fold CV grid search, the GBM model produced the best evaluation metrics on both datasets, achieving AUCs of 87.2% and 81.7%, Accuracies of 81.2% and 70.28%, Sensitivities of 76.04% and 78.96%, and Specificities of 81.71% and 69.43%, respectively. Moreover, a comparison of the model performance applying a fixed number of variables, and a dynamic number of variables confirmed that models with several variables could perform similarly to the model with all variables. Finally, using SHAP attribute evaluation, this study revealed as the most relevant predictors of T2DM the urinary parameters, sweet flavor, age, heart rate and creatinine [36].

DeSilva et al. used the NHANES dataset of 16,429 records with nutritional, behavioral, socioeconomic and non-modifiable demographic features (114 nutritional/dietary/food-intake-associated features; 13 other modifiable/health-behavior-associated features; and 12 socioeconomic/demographic features) spanning the years 2007–2016 [38]. Missing values were imputed using the MICE package. The dataset was divided into training, validation and testing subsets. Due to class imbalance, three new dataset variations were created using three sampling techniques, namely, Minority Class Oversampling, Random Oversampling Examples and SMOTE. Logistic Regression, an Artificial Neural Network and Random Forest were trained in the four datasets. For the ANN, parameter tuning was conducted, whereas for the other two models the default parameters and 10-fold cross validation were utilized. The experiments showed that Logistic Regression, trained on the minority oversampled dataset, was the best-performing model, achieving an AUC value of 75.35%. The odds ratio analysis of the best-performing Logistic Regression indicated folic acid, food folate, self-reported health of diet and calcium as factors that minimize the risk of diabetes, while total number of people in household, total fat intake, cigarette smoking, weight, BMI and waist circumference were shown to increase the risk for T2DM. The authors concluded that their findings are a step towards personalized clinical nutrition, such as risk-stratified nutritional recommendations and early preventive strategies aimed at high-risk individuals, as well as the nutritional management of individuals with T2DM [38].

Furthermore, Phongying and colleagues evaluated four ordinary Machine Learning models for their efficiency in classifying patients with T2DM on 20,227 records taken from the Department of Medical Services in Bangkok [39]. The dataset, containing 10 typical demographic, biometric, blood pressure and heart rate results, as well as family history of diabetes, was normalized through the MinMax algorithm, and then the Gain ratio was utilized for the identification of the most important features during class prediction. Thus, BMI and family history of diabetes were the most influential attributes, which next played a primary role in the creation of the “interaction variables”. As a first step, every continuous attribute was binarized (e.g., age ≥60 equals 1, whereas age <60 equals 0, diastolic blood pressure ≥90 equals 1, whereas diastolic blood pressure <90 equals 0, etc.). Then, BMI and family history of diabetes, due to their weigh-in superiority, were taken two by two with all the rest of the attributes with “AND” clauses to create interaction variables (e.g., if BMI < 23 AND diastolic blood pressure < 90, then Y = 0; if family history of diabetes = True and age ≥ 60, then Y = 1, 0 otherwise, etc.). The produced interaction variables were added to the existing 10-feature dataset. Finally, after 80/20 train/test splitting and hyperparameter tunning, all the models were evaluated on both datasets. Among them, Random Forest, trained on the dataset including interaction terms, yielded the best metrics, namely, 97.5% Accuracy, 97.4% Precision and 96.6% Recall. In this study, new classification models that incorporate optimized hyperparameters and include the interaction of important risk factors affecting diabetes are presented [39].

Qin et al. also used the NHANES dataset along with Boosting, Random Forest, Logistic Regression and SVM models to identify cases of diabetes [40]. After dataset balancing using SMOTE and attribute evaluation using backward feature selection as the method and the AIC (Akaike Information Criterion) as the objective, 18 attributes were extracted. The attributes were categorized into four relevant sectors: Demographic, Dietary, Examination and Questionnaire features. Regarding the model evaluation, CatBoost showed the best performance, with metrics as follows: AUC: 0.83, Accuracy: 82.1%, Sensitivity: 82% and Specificity: 51.9%. SHAP evaluation was also performed in this study, revealing that sleep time, energy and age are highly influential in terms of diabetes outcome. Hence, the study demonstrates the potential of Machine Learning methods for predicting diabetes using lifestyle-related data [40].

In their research work, Kazerouni et al. [41] used a dataset comprising 100 Iranian T2DM patients from Shohadan Hospital, Tehran, and 100 healthy individuals, matched for age and sex. The authors applied KNN, SVM, Logistic Regression and ANN models while exploring the potential of long non-coding RNA (lncRNA) expression for predicting T2DM and the detection of diabetes on an RNA molecular basis. The SVM model produced the best evaluation metrics, achieving an AUC of 0.95, a Sensitivity of 95% and a Specificity of 86%. The biomarker applied in this study demonstrated a high diagnostic value that could help in T2DM prediction.

A balanced dataset was created by merging data from three datasets (NHANES, MIMIC-III and MIMIC-IV) in a study by Agliata et al. [42]. The study employed a binary classifier, trained from scratch, to identify potential non-linear relationships between the onset of T2DM and a set of parameters obtained from patient measurements. The proposed model achieved a satisfactory level of Accuracy (approximately 86%) and an AUC value of 0.934. This Neural Network-based approach may provide accurate information for personalized medicine, rendering it a valuable resource for decision making.

Uddin et al. [7] evaluated the performance of various classifiers, including DT, Logistic Regression, SVM, Gradient Boosting, XGBoost, RF and an ensemble technique (ET), on a 508-record dataset from Bangladesh. The experimental results showed that the ET outperformed the other classifiers; to further enhance its effectiveness, the researchers fine-tuned and evaluated the hyperparameters of the ET, obtaining an outstanding Accuracy of 99.27% and an F1-score of 99.27%. In addition, using various statistical and Machine Learning models, they determined four key factors as being highly associated with diabetes: age, having diabetes in the family, regular intake of medicine and extreme thirst. Since the proposed diabetes detection system has a high degree of precision, physicians and clinicians may use the proposed framework to assess diabetes risk [7].

Another group of researchers used the LightGBM model and compared it with the Zewditu Memorial Hospital (ZMHDD) Ethiopian hospital dataset against typical pattern recognition models [12]. Basic demographic and anthropometric as well as blood pressure, cholesterol, pulse rate and FPG data were utilized to predict the diabetes status of patients. After median imputation of missing values and MinMax normalization, the LightGBM model was trained using 10-fold cross validation for hyperparameter tuning. The optimized model achieved an AUC of 0.98, an Accuracy of 0.98, a Sensitivity of 0.99 and a Specificity of 0.96. The Pearson correlation coefficient showed that FPG, total cholesterol and BMI had the strongest linear relationship with diabetes prevalence, among other attributes, even if their value did not reach more than 0.37. This highly computationally efficient model outperformed the well-established typical pattern recognition models in all metrics, providing a vital help to low-income countries that lack the appropriate hardware to support more complex operations.

In another study, the efficiency of a hard voting ensemble model was examined against simple models on a 1787-record dataset (898 positive cases and 889 negative cases) from Centro Medico Nacional Siglo XXI in Mexico City [26]. The feature selection using the Least Absolute Shrinkage and Selection Operator (LASSO) method yielded 12 features, namely, sociodemographic, anthropometric and laboratory features (urea, High-Density Lipoprotein (HDL) under treatment, triglycerides (TG), diastolic pressure under treatment and systolic pressure without treatment), as well as the existence of hypertension and administration of lipid-lowering medication. Notably, there were also heart rate and blood pressure data with and without treatment, offering a more objective source of data/knowledge for building the models. Z-score standardization was performed to allow for comparable features. An SVM, a Linear Regression model and an Artificial Neural Network were trained and tuned on 75% of the dataset with 10-fold cross validation, as well as the hard voting ensemble, incorporating the entire dataset. The testing stage for the remaining percentage of the data showed that single SVM exhibited a slightly better performance in terms of AUC compared to hard voting, namely, 92.8% versus 90.5%, respectively. Single SVM demonstrated an Accuracy of 89.82%, a Sensitivity of 87.85% and a Specificity of 92.35%. Furthermore, blood lipid level in treatment and hypertension treatment were deemed the most useful features [26].

Interestingly, in a separate study, researchers explored a somewhat reverse strategy [34]. They did not use the typical features, as in the previous studies (only age and sex), but examined only a variety of symptoms which are characteristic of diabetes. Those features are polyuria, polydipsia, polyphagia, sudden involuntary weight loss, weakness, genital thrush, itching and obesity. The used dataset consisted of 520 records, 16 features and 1 target class for diabetes diagnosis, which underwent balancing through SMOTE, looking for the five nearest neighbors. The features’ relevance and importance were evaluated using the Pearson coefficient, the Gain ratio, Naive Bayes AUC and Random Forest AUC. As a result, polyuria, polydipsia, sudden involuntary weight loss and sex were the most outcome-influencing features. At the evaluation step, a plethora of typical models were trained and tested, and the impact of SMOTE and cross validation was determined. Thus, the best-performing models were Random Forest and KNN trained on a balanced dataset using 10-fold cross validation, which achieved AUCs of 99.9% and 98.9%, respectively, and an Accuracy of 98.59%, a Recall of 98.6% and a Precision of 98.6%. The performance analysis presented in this study showed that data preprocessing is a major step in the design of efficient and accurate models for diabetes occurrence [34].

4.2.2. Biomarker Regression

As mentioned previously, apart from classification problems, Machine Learning can be applied to diabetes through regression for the estimation of predictive biomarkers such as FPG and the revelation of factors that relate to FPG variability. To this end, Kopitar et al. utilized models of three conceptually different families, boosting, bagging and Linear Regression, as each family has a different capability to detect hidden patterns and important features [43]. The dataset initially consisted of 27,050 electronic health records of adults with no prior diabetes diagnosis between 2014 and 2017. The first goal of the study was to compare the model’s performance against FINDRISC; thus, records that had missing values for any of the features included in FINDRISC were excluded. Outlier detection was conducted using the formula X ± (3 × SD), and each outlier value was marked as missing. Records and features with more than 50% of missing values were excluded. The remaining missing values were imputed with the MICE method. The preprocessing stage yielded a final dataset of 3723 records, 58 features and the FPG target variable. These features can be grouped into the following four groups: lipid profile laboratory results (blood HDL-c, LDL-c, total cholesterol and triglycerides), social determinants of health (consumption of alcohol, smoking, dietary habits and stress), cardiovascular variables (blood pressure measurements and atrial fibrillation history) and history of other health conditions (stroke, hypertension and colon cancer). The data were partitioned into 6-month intervals (T6, T12, T18, T24 and T30) according to the submission date of each record; thus, five sub-datasets were created, and each Machine Learning model was trained on each sub-dataset and validated using 100-times random sampling with replacement (bootstrap). Linear Regression achieved the lowest Root Mean-Square Error of 0.838 (95% CI: 0.814–0.862), trained on only seven features common to FINDRISC. When the entire dataset was available for training and testing, RF achieved the lowest RMSE of 0.745 (95% CI: 0.733–0.757). To measure how well the regressor fitted the actual FPG value given the input features, the R2 coefficient was utilized. For the 6-month data available, the linear model performed the best, with an average value of 0.310, while RF performed the best for 18- and 30-month data that were available, achieving mean values of 0.340 and 0.368, respectively. Finally, feature importance was assessed for every model through the five time-frame datasets using different metrics (because each model has a different structure) like β-coefficient and permutation importance on mean squared error (MSE). Triglyceride levels were assessed as the most important features in LightGBM, while hyperglycemia history was deemed the most important feature in the remaining three models. For the next set of features that were lower in importance, even if there were some differences in the rankings, age, HDL-c, LDL-c, total cholesterol, systolic pressure, diastolic pressure and weight were in the top 10. The authors concluded that, the more data are available, the better the stability that models demonstrate. LightGBM achieved the most stable results through the multiple evaluations [43].

4.2.3. Long-Term Prediction

Long-term diabetes prediction, in the sense that the class variable is filled many years after the features are initially filled, was investigated by several studies, utilizing the baseline follow-up method [21,28,44,45,46,47]. Liu et al. [44] applied four Machine Learning algorithms (Logistic Regression, DT, RF and XGBoost) to build prediction models for the risk of incident T2DM on a 127,031-record dataset of adults older than 65 years in Wuhan, China, from 2018 to 2020. Overall, 8298 participants were diagnosed with incident T2DM. The XGBoost model with 21 features demonstrated the best performance for predicting T2DM, achieving an AUC of 0.78, an Accuracy of 75%, a Sensitivity of 64.5% and a Specificity of 75.7%, with FPG, education, exercise, gender and waist circumference as the top-five important predictors. The findings of the study showed that the model can be applied to screen individuals at high risk of T2DM in the early phase, such that it has strong potential for intelligent prevention and control of diabetes [44].

Lama et al. examined a cohort of 7949 people with known and unknown family histories of diabetes [45]. The features involved were questionnaires about lifestyle, socioeconomic and psychosocial matters, along with measurements of plasma glucose and insulin in an oral glucose tolerance test (OGTT), glycosylated hemoglobin (HbA1c), blood pressure, weight, height, and waist circumference. In the baseline study, type 2 diabetes mellitus was diagnosed in 51 women and 66 men, and prediabetes in 219 women and 259 men. A first follow-up study was conducted 8–10 years later and a second follow-up study approximately 20 years later, with at least 70% participation. People diagnosed with type 2 diabetes at baseline and at the first follow-up were not called to follow-up later for a second time. The dataset was partitioned into three sets: a training set, a validation set and a test set. Random Forest was the classifier utilized to predict the development of T2DM 10 years after the initial measurement. SHAP TreeExplainer was used to build an interpretable Machine Learning model to find factors that correlated with high or low diabetes risk. Hyperparameter tuning using 5-fold cross validation in the validation set was conducted to identify the best hyperparameter set for the Random Forest, a combination of AUC and robustness being the objective function. This function was defined as:

S_{l} = {A U C}_{l}^{v a l} \cdot (1 - Θ_{l})

(6)

where

Θ_{l} = μ (σ (X_{i j k l}^{'}))

(7)

and

X_{i j k l}^{'} = \frac{X_{i j k l} - μ (X_{i j k l})}{σ (X_{i j k l})}

(8)

where

X_{i j k l}

is a tensor of SHAP values per person i, feature j, cross-validation split k and parameter set l. Then,

X_{i j k l}^{'}

is the standardized tensor with zero mean and unit variance. Finally,

Θ_{l}

is the mean variance of SHAP values per hyperparameter set l. The best hyperparameter set was as follows: number of estimators = 120, min sample leaf = 125, max depth = 4 and number of models = 30, achieving a robustness value of S = 0.630 and an AUC value of 0.779. According to the SHAP value analysis, the features that increase the risk for T2DM are family history of diabetes, high waist-to-hip ratio, high BMI, increased systolic pressure, increased diastolic pressure, low physical activity and male sex. On the other hand, the features that decrease T2DM risk are exercise, higher socioeconomic status and lower age. Also, with the help of a SHAP force plot, personalized risk profiles were extracted to assess the individual risk score, which is called the output value, and revealed the features that exert the highest impact on the individual risk score. Finally, the authors suggest that this method be introduced in primary healthcare to improve diabetes care by developing more individualized, easily accessible healthcare plans [45].

In another study, regression, tree and Gradient Boosting models were employed to predict the development of diabetes within 9 years [46]. The dataset consisted of 38,379 records and a large volume of demographic, laboratory, pulmonary test, personal history and family history data, and the data were imputed with mean/mode values and split into 80/20 train/test datasets. The tunable models were optimized with stratified 10-fold cross validation. The comparison of the models’ metrics showed that XGBoost achieved the best performance, specifically, a 0.623 AUC, a 0.966 Accuracy, a 0.970 Sensitivity and a 0.690 Specificity. In addition, a survival analysis was performed utilizing Cox Regression and XGBoost Survival Embeddings to provide a clearer picture about the mean time for someone to develop diabetes. Finally, the SHAP attribute evaluation revealed that FPG, HbA1c and family history of diabetes contributed the most to diabetes risk, while, contrary to [36], uric acid contributed the least.

Additionally, researchers evaluated the efficiency of Machine Learning models, such as Random Forest, Gradient Boosting, MLP and Naive Bayes, against the classical Linear Regression (LR) [21]. The cohort study included a total of 3687 participants, and, by using the baseline follow-up method, prediction of 3-year diabetes development was performed, considering demographic, smoking, alcohol, history of health condition and laboratory data. According to the LR feature analysis results, eight factors, including age, family history, impaired fasting glucose (IFG), impaired glucose tolerance (IGT), hypertension, triacylglycerol, alanine aminotransferase (ALT) and gamma glutamyl transpeptidase (GGT), were finally selected as modeling variables. The training phase included a 75/25 train/test split, while the evaluation and 10-fold cross validation showed that Random Forest was the best proposed model. Then, hyperparameter optimization, again using 10-fold cross validation, was performed and the final model was analyzed against SHAP evaluation, and this, in turn, was compared with LR feature analysis. Finally, the AUC of RF was 0.835 [21]. These findings reveal that in real-world epidemiological research, the combination of traditional variable screening and a Machine Learning algorithm to construct a diabetes risk prediction model has a satisfactory clinical application value [21].

In another study, the researchers evaluated a variety of single and ensemble models on the ELSA dataset to predict type 2 diabetes occurrence [28]. The dataset contained a variety of biometric, anthropometric, hematological, lifestyle, sociodemographic and performance index data. Several different feature selection techniques were employed, such as LASSO, correlation and Greedy stepwise methods. The selected method was Greedy stepwise with Naive Bayes, and after the addition of extra features the final dataset consisted of 34 input features, the class variable indicating the diabetes condition of the person. Random undersampling was conducted to achieve a diabetic distribution per age group comparable to real life. This yielded a final dataset of 2331 records. For the evaluation of the models, the procedure involved creating 10 datasets from the existing master dataset, using a stratified 70/30 train/test split. Logistic Regression, Naive Bayes, Decision Tree, Random Forest, Artificial Neural Network, Deep Neural Network, and three ensembles of Random Forest and Logistic Regression, namely, stacking, voting and weighted voting, were employed. Due to class imbalance, the method of adjusting thresholds was conducted for each model having as the objective function J, the Youden Index, which is the sum of Specificity and Sensitivity. For the weighted soft voting, a bio-objective optimization problem was solved to calculate the best weights for RF and LR, which maximized the Sensitivity and AUC. Indeed, the weighted classifier produced the best results in terms of AUC, with a value of 0.884. In addition, the Sensitivity and Specificity were 0.856 and 0.798, respectively. Finally, the authors concluded that due to the superiority of the ensemble models, these can be embedded into recommendation systems to help with prevention strategies against the development of diabetes [28].

In a large study, a dataset of 500,000 records from the Hanaro medical foundation containing diagnostic results and questionnaires from 5 years in the form of baseline follow-up was utilized to conduct a multiclassification experiment for the prediction of prediabetic, diabetic and healthy individuals in the following 1 year [47]. Participants who suffered from a relevant condition, such as diabetes, hypertension or hyperlipidemia, or took medication for one of these conditions were excluded. Due to class imbalance, majority undersampling as well as SMOTE techniques were employed to avoid majority class bias. The extraction of important features consisted of two stages. In the first stage, continuous and nominal features were selected through ANOVA and x2 testing, respectively, and in the second stage, 12 features were selected using Recursive Feature Elimination with the impurity index of a decision tree as a criterion. The 12 features were FPG, HbA1c, demographic data, BMI, gamma glutamyl transpeptidase (gamma-GT), uric acid, lifestyle habits (smoking and alcohol consumption) and family history of diabetes. This method demonstrated that FPG, HbA1c and gamma-GT were the most informative/predictive features. Regarding model creation, Logistic Regression, SVM, Random Forest and XGBoost were compared to more sophisticated ensemble classifiers, such as the confusion matrix-based classifier integration approach (CIM), soft voting and stacking. Ten-fold cross validation was used for the hyperparameter tuning with grid searching and testing. The CIM model yielded the best results, with Accuracy, Precision and Recall values of 0.77. As a final experiment, the researchers used follow-up data from 5 consecutive years, proving that the more data exist, the better the results achieved. The proposed model can provide both clinicians and patients with valuable information on the incidence of T2DM ahead of time, which would help patients take measures to mitigate T2DM risk and progression and related complications.

A summary table of the reviewed literature is provided below in Table 1a,b. The tables below present a holistic overview of the datasets used along with their unique features; complementary techniques, such as data normalization, resampling techniques and most-significant-feature extraction; and the selection of the best models along with their respective evaluation metrics. A breakdown of the studies by their purposes is shown in Figure 3.

5. Discussion

In our work presented here, our findings were oriented towards the following thematic axes:

Types of hypotheses addressing T2DM through Machine Learning using tabular data.
Data preprocessing.
Features involved.
Selection and identification of most important features.
Methodology structure towards model building.
Evaluation metrics.
Best models.

5.1. Types of Hypotheses Addressing Diabetes through Machine Learning Using Tabular Data

Recent applications of tabular data in diabetes research can be categorized into three discrete hypotheses: current-state diabetes identification, long-term diabetes prediction and biomarker regression. As Figure 3 depicts, most hypotheses are in the context of current-state diabetes identification with 62.50%, followed by long-term diabetes prediction with 31.25% and lastly diabetes biomarker regression with 6.25% out of a total of 16 articles. This distribution is quite reasonable, as each hypothesis presents different data collection requirements. First, current-state diabetes identification requires only present health condition information, while in the long-term case, the class variable referring to diabetes state is filled many years after measurement completion, maintaining a baseline follow-up method. Thus, current-state studies have a simpler data collection process than long-term ones. Biomarker regression, such as FPG or HbA1c, presents even more complex dataset creation because the target variable should be measured continually and systematically by the individuals with an invasive method. Moreover, the feature values should be filled every time, along with the measurements by individuals, which could lead to false or incomplete data due to lack of professional intervention. Among those, the most challenging and interesting use case is long-term forecasting because it can provide an early assessment of diabetes development.

5.2. Data Preprocessing

Most reviewed research articles give high importance to data preprocessing techniques such as empty-value imputation, data balancing and transformation. The imputation is typically performed with Mean/Mode or the more complex MICE method, while, interestingly, most of the work we reviewed herein did not perform imputation but rather excluded “empty values” given the overall high volume of data available. Therefore, there is not a universally accepted or consensus method for dealing with empty values, but the manner in which the issue is handled depends on every specific problem setting. However, MICE utilization in datasets with relatively low percentages of missing data could simulate possible hidden patterns between features and therefore produce more realistic datasets.

Data balancing is undoubtedly one of the most important stages because balancing adjusts the bias of the entire experiment. As deduced from our literature search, general approaches include oversampling minority instances, undersampling majority instances and synthesizing artificial data (SMOTE) to produce equal instances for all classes. Interestingly, Fazakis et al. [47] successfully used the undersampling technique to match real-life positive diabetic cases age-wise, given that, by only 2000 instances, the equal size of classes could lead to significant bias towards positive diabetic cases. Also, Fazakis et al. [47] and Lai et al. [34] utilized an adjusted decision threshold for their classifier, yielding very good results.

Data transformation techniques were only included in [15,42] (Standardization) and in [21,37] (MinMax). Standardization contributes in general to all models, while MinMax normalization does not affect tree-based models. This is in contradiction to [22,37], which trained LightGBM and Random Forest, respectively, mainly for fair comparisons against other MinMax-dependent models (SVM and KNN).

5.3. Features Involved

The datasets in the studies we reviewed typically have a size of several thousands of records and include a variety of data representing multiple aspects of human health. Datasets are mainly provided by clinics, hospitals and institutes such as NHANES [33,36,38]. The features included are sociodemographic parameters, such as age, sex, education level, income and marital status; anthropometric/biometric parameters, such as height, weight, BMI, waist circumference, systolic and diastolic blood pressure, and pulse rate; laboratory results like FPG, HbA1c, HDL-c, LDL-c, total cholesterol, triglycerides, urea measurements, creatinine and gamma-GT; and lifestyle behavior, such as sleep time, smoking, alcohol consumption and physical activity; family history of diabetes; and dietary consumption data, such as folate, carbohydrates and sugar [36,38]. An advantage of tabular data is the low storage overhead compared to images, which demand thousands of times more memory for storage. However, such a plethora of features demands a systematic and time-consuming registration effort, while participants may not be available for all measurements.

5.4. Selection and Identification of the Most Important Features

With respect to the selection and identification of the most important features, the application of feature selection techniques could facilitate the removal of features which seem to be unrelated to diabetes and thus help the data collection process as well as improve the computational efficiency of model building. As depicted in Figure 4a, the top-performing models utilize a variety of methods, including Wrapper, Filter and Embedded methods. The most notable finding is that most researchers chose to promote models trained on the whole feature set rather than a reduced version thereof. A reason for this is the scope of each study. For example, De Silva et al. [38] examined dietary feature contributions, while Dritsas et al. [13] presented the influence of common diabetic symptoms. However, the most useful result of such studies is perhaps the interpretability of associated factors to moderate them towards disease risk minimization. Luckily, with the new advancements in Machine Learning, the prediction models are no longer treated as black boxes but feature important methods such that every feature contribution can be quantified. For instance, Da Silva et al., using the odds ratio metric, identified folate and total fat consumption, as well as how healthy the diet is based on self-reports, among the significant predictors of undiagnosed T2DM [38], while Dritsas et al. highlighted the significance of polyuria, polydipsia, sudden weight loss and gender as the best indicators of diabetes, using Pearson correlation, Gain ratio, Random Forest and Naive Bayes AUC [13]. Another finding is the usual deployment of the SHAP method in several studies [19,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46], which provides a common interface for all types of models to quantify feature influence, as well as the influence of features on a particular participant [44]. Summing up, in Figure 4b, an enumeration of the best-predictor categories in this review is depicted. The glucose-related biomarkers FPG and HbA1c, as expected, were found to be the most important predictors, as they are known to be the principal indicators of diabetes [1]. Age, BMI, heart function metrics, lipid profile and diabetes heredity seem to agree in such data-driven techniques and with medical practice/experience, contrary to ethnicity, which was not highlighted by the studies we reviewed. Finally, urinary parameters seem to be promising regarding their predictive capabilities [13,35], while dietary factors should be further investigated given their complexity [35,36,44].

5.5. Methodology Structure towards Model Building

In general, the model-building methodology structure offers wide freedom of management. However, our findings present with clarity a common pattern existing in model building. Firstly, several research studies used many different feature subsets, which arose either from a variety of feature selection methods or from medical bibliographies or from the hypotheses which were examined [12,33,34,35,37,47]. Datasets with laboratory data are compared with non-invasive data to evaluate if the latter type can provide reasonably reliable results [34]. According to Table 1, the vast majority of studies prefer a train/test split with an 80/20 ratio, hyperparameter tuning using mostly grid search, and 10-fold cross validation. Due to the magnitude of data quantities, the split percentages as well as the number of folds are considered good choices. Figure 5 presents a typical workflow for model building, extracted from the reviewed literature, including, e.g., balancing at preprocessing, a Wrapper or non-Wrapper feature selection technique, the training procedure, the evaluation metrics and the popular feature importance methods.

5.6. Evaluation Metrics

In most of the examined studies, the evaluation metrics were the typical ones, such as AUC, Accuracy, Sensitivity or Recall, Specificity, Precision, and F1-Score. These provide a complete understanding of performance. However, in such use cases as disease detection, Sensitivity should have a priority, as the higher its value, the less possible it is for an individual with diabetes to be diagnosed as healthy. Other metrics used are the misclassification rate [32], the AIC [38], robustness [44] and the Youden index [47]. Generally, researchers should consider including different existing or new, custom metrics.

5.7. Best Models

The general categories of top models elected in the studies we reviewed herein are depicted in Figure 6, clustered by hypothesis. As far as the identification purpose is concerned, Gradient Boosting models present a clear advantage, with Random Forest following. Moreover, long-term forecasting, stacking and voting share the top place with Random Forest, while for biomarker regression, the only model preferred belongs to the Gradient Boosting family. SVM, KNN and Logistic Regression are not preferred, firstly due to their worse performance and secondly because these models demand data transformation, such as MinMax normalization due to data heterogeneity, which is both time- and resource-consuming, requiring datasets with thousands of records. On the other hand, Gradient Boosting and tree-like models remain unaffected by the different numerical ranges of features.

5.8. Limitations of Existing Approaches

Machine Learning datasets often come with various limitations that can affect the performance and generalizability of Machine Learning models. Insufficient data can prevent a model from learning meaningful patterns, leading to underfitting and poor generalization. This is the case for a lot of the existing approaches [6,7,11,24,41]. Similarly, when inevitably dealing with imbalanced datasets, certain classes or labels may be underrepresented and the model may perform well on the majority class but poorly on the minority class; hence, the model’s predictions may not generalize well to real-world scenarios. Addressing such limitations requires careful data preprocessing, as described in [11]. Furthermore, in cases where research findings are derived from cross-sectional studies without follow-up data, causal and temporal associations may not be determined [35]. In cases of datasets not coming from a hospital unit or institute, useful information such as biochemical measurements that records detailed health profiles of the participants may be missing. However, acquiring access to such data is time-consuming and difficult for privacy reasons [13]. There also cases where the limited number of samples is attributed to the high costs of assessment and data collection [41]. Moreover, referring to model validation, an internal validation is performed in most of the approaches [32,41], while these prediction models need to be further validated in an external validation set, as in the study by Kazerouni et al. [41].

5.9. Comparison with Previous Reviews

Previous reviews [3,4] examined a broader context of Artificial Intelligence and T2DM, including Deep Learning, unsupervised learning and association rules. More specifically, Kavakiotis et al. deals with a more general overview of applied methods, which includes hypotheses about diabetic complication prediction, data-driven investigation of both drugs and therapies, genetic background and environment, and healthcare management [3]. In contrast to our work, they found SVM to be the best model regarding diabetes hypotheses. A more recent review by Feroso et al. is in agreement with our finding that tree-based, ensemble models present the highest performance [4]. In their analyses, they found that Linear Regression coefficients and PCA are the most popular feature selection techniques, and that heterogeneity is the most popular among model assessment metrics. SHAPley values were not of great concern in either of the reviews mentioned, while, as shown by our study, SHAPley is a widely acceptable interpretation method.

In the present review, recent publications were included along with certain high-quality older studies that all consider tabular data and machine learning models. An advancement of the current work is its attempt to provide an in-depth, detailed methodological description of several key articles reviewed and considered together, unveil common successful patterns, highlight new emerging features, and subsequently provide some new insights in approaching the topic.

6. Conclusions

The identification and very early diagnosis of T2DM are, by their nature, very challenging fields of study. However, such work can help people stay informed about their health condition and prevent further negative development. In this study, recent Machine Learning applications in T2DM prediction using tabular data, such as demographic, biometric, laboratory, lifestyle and dietary data, are reviewed, focusing on the investigation of common patterns regarding Machine Learning model implementations and the discovery of both emerging methods and features. Interestingly, Gradient Boosting and tree-based models, which are trained and optimized usually with grid searches, were deemed the most successful ones. Moreover, it appears that the SHAPley and Wrapper algorithms are quite popular feature interpretation and evaluation methods. Furthermore, apart from the classical laboratory biomarkers, urinary results and non-invasive dietary information are promising features, opening new considerations which could possibly lead to enhanced prevention or easier management of T2DM. While Gradient Boosting models present a clear advantage, with Random Forest following, it is important to consider the conditions and circumstances as well as any idiosyncrasies of the target population. Utilization of those models can support interesting approaches that can have applications for health management improvement for a variety of beneficiaries who may not have regular access to healthcare due to idiosyncratic occupation demands or lifestyles [48].

Author Contributions

Conceptualization, I.K.K., A.S.K. and A.K.S.; supervision, I.K.K. and A.K.S.; methodology, P.D.P. and I.K.K.; writing—original draft, P.D.P.; writing—review and editing, A.K.S., A.S.K. and I.K.K. All authors have read and agreed to the published version of the manuscript.

Funding

United States Department of Agriculture—National Institute of Food and Agriculture (USDA-NIFA), grant no.: 2020-70001-31296, awarded to Angelos K. Sikalidis (PI) and Aleksandra S. Kristo (Co-PI).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflicts of interest.

References

International Diabetes Federation. Available online: https://idf.org/about-diabetes/what-is-diabetes/ (accessed on 23 May 2024).
Kristo, A.S.; İzler, K.; Grosskopf, L.; Kerns, J.J.; Sikalidis, A.K. Emotional Eating Is Associated with T2DM in an Urban Turkish Population: A Pilot Study Utilizing Social Media. Diabetology 2024, 5, 286–299. [Google Scholar] [CrossRef]
Kavakiotis, I.; Tsave, O.; Salifoglou, A.; Maglaveras, N.; Vlahavas, I.; Chouvarda, I. Machine Learning and Data Mining Methods in Diabetes Research. Comput. Struct. Biotechnol. J. 2017, 15, 104–116. [Google Scholar] [CrossRef] [PubMed]
Fregoso-Aparicio, L.; Noguez, J.; Montesinos, L.; García-García, J. Machine learning and deep learning predictive models for type 2 diabetes: A systematic review. Diabetol. Metab. Syndr. 2021, 13, 148. [Google Scholar] [CrossRef] [PubMed]
Sudharsan, B.; Peeples, M.; Shomali, M. Hypoglycemia Prediction Using Machine Learning Models for Patients With Type 2 Diabetes. J. Diabetes Sci. Technol. 2015, 9, 86–90. [Google Scholar] [CrossRef]
You, Y.; Doubova, S.V.; Pinto-Masis, D.; Pérez-Cuevas, R.; Borja-Aburto, V.H.; Hubbard, A. Application of machine learning methodology to assess the performance of DIABETIMSS program for patients with type 2 diabetes in family medicine clinics in Mexico. BMC Med. Inform. Decis. Mak. 2019, 19, 221. [Google Scholar] [CrossRef]
Uddin, M.J.; Ahamad, M.M.; Hoque, M.N.; Walid, M.A.A.; Aktar, S.; Alotaibi, N.; Alyami, S.A.; Kabir, M.A.; Moni, M.A. A Comparison of Machine Learning Techniques for the Detection of Type-2 Diabetes Mellitus: Experiences from Bangladesh. Information 2023, 14, 376. [Google Scholar] [CrossRef]
Lugner, M.; Rawshani, A.; Helleryd, E.; Eliasson, B. Identifying top ten predictors of type 2 diabetes through machine learning analysis of UK Biobank data. Sci. Rep. 2024, 14, 2102. [Google Scholar] [CrossRef]
Sikalidis, A.K. From Food for Survival to Food for Personalized Optimal Health: A Historical Perspective of How Food and Nutrition Gave Rise to Nutrigenomics. J. Am. Coll. Nutr. 2019, 38, 84–95. [Google Scholar] [CrossRef] [PubMed]
Cloete, L. Diabetes mellitus: An overview of the types, symptoms, complications and management. Nurs. Stand. 2022, 37, 61–66. [Google Scholar] [CrossRef]
Iparraguirre-Villanueva, O.; Espinola-Linares, K.; Flores Castañeda, R.O.; Cabanillas-Carbonell, M. Application of Machine Learning Models for Early Detection and Accurate Classification of Type 2 Diabetes. Diagnostics 2023, 13, 2383. [Google Scholar] [CrossRef]
Garcia-Carretero, R.; Vigil-Medina, L.; Mora-Jimenez, I.; Soguero-Ruiz, C.; Barquero-Perez, O.; Ramos-Lopez, J. Use of a K-nearest neighbors model to predict the development of type 2 diabetes within 2 years in an obese, hypertensive population. Med. Biol. Eng. Comput. 2020, 58, 991–1002. [Google Scholar] [CrossRef] [PubMed]
Dritsas, E.; Trigka, M. Data-Driven Machine-Learning Methods for Diabetes Risk Prediction. Sensors 2022, 22, 5304. [Google Scholar] [CrossRef]
Viloria, A.; Herazo-Beltran, Y.; Cabrera, D.; Pineda, O.B. Diabetes diagnostic prediction using vector support machines. Procedia Comput. Sci. 2020, 170, 376–381. [Google Scholar] [CrossRef]
Bernabe-Ortiz, A.; Borjas-Cavero, D.B.; Páucar-Alfaro, J.D.; Carrillo-Larco, R.M. Multimorbidity Patterns among People with Type 2 Diabetes Mellitus: Findings from Lima, Peru. Int. J. Environ. Res. Public Health 2022, 19, 9333. [Google Scholar] [CrossRef]
Ramezankhani, A.; Hadavandi, E.; Pournik, O.; Shahrabi, J.; Azizi, F.; Hadaegh, F. Decision tree-based modelling for identification of potential interactions between type 2 diabetes risk factors: A decade follow-up in a Middle East prospective cohort study. BMJ Open 2016, 6, e013336. [Google Scholar] [CrossRef]
Esmaily, H.; Tayefi, M.; Doosti, H.; Ghayour-Mobarhan, M.; Nezami, H.; Amirabadizadeh, A. A Comparison between Decision Tree and Random Forest in Determining the Risk Factors Associated with Type 2 Diabetes. PubMed 2018, 18, e00412. [Google Scholar]
Aguilera-Venegas, G.; López-Molina, A.; Rojo-Martínez, G.; Galán-García, J.L. Comparing and Tuning Machine Learning Algorithms to Predict Type 2 Diabetes Mellitus. J. Comput. Appl. Math. 2023, 427, 115115. [Google Scholar] [CrossRef]
Wang, X.; Zhai, M.; Ren, Z.; Ren, H.; Li, M.; Quan, D.; Chen, L.; Qiu, L. Exploratory Study on Classification of Diabetes Mellitus through a Combined Random Forest Classifier. BMC Med. Inform. Decis. Mak. 2021, 21, 105. [Google Scholar] [CrossRef]
Borzouei, S.; Soltanian, A.R. Application of an Artificial Neural Network Model for Diagnosing Type 2 Diabetes Mellitus and Determining the Relative Importance of Risk Factors. Epidemiol. Health 2018, 40, e2018007. [Google Scholar] [CrossRef] [PubMed]
Mao, Y.; Zhu, Z.; Pan, S.; Lin, W.; Liang, J.; Huang, H.; Li, L.; Wen, J.; Chen, G. Value of machine learning algorithms for predicting diabetes risk: A subset analysis from a real-world retrospective cohort study. J. Diabetes Investig. 2023, 14, 309–320. [Google Scholar] [CrossRef]
Rufo, D.D.; Debelee, T.G.; Ibenthal, A.; Negera, W.G. Diagnosis of Diabetes Mellitus Using Gradient Boosting Machine (LightGBM). Diagnostics 2021, 11, 1714. [Google Scholar] [CrossRef]
Khan, A.A.; Qayyum, H.; Liaqat, R.; Ahmad, F.; Nawaz, A.; Younis, B. Optimized Prediction Model for Type 2 Diabetes Mellitus Using Gradient Boosting Algorithm; IEEE Xplore: Piscataway, NJ, USA, 2021. [Google Scholar] [CrossRef]
Alsadi, B.; Musleh, S.; Al-Absi, H.R.; Refaee, M.; Qureshi, R.; El Hajj, N.; Alam, T. An Ensemble-Based Machine Learning Model for Predicting Type 2 Diabetes and Its Effect on Bone Health. BMC Med. Inform. Decis. Mak. 2024, 24, 144. [Google Scholar] [CrossRef]
Ganie, S.M.; Malik, M.B. An Ensemble Machine Learning Approach for Predicting Type-II Diabetes Mellitus Based on Lifestyle Indicators. Healthc. Anal. 2022, 2, 100092. [Google Scholar] [CrossRef]
Morgan-Benita, J.A.; Galván-Tejada, C.E.; Cruz, M.; Galván-Tejada, J.I.; Gamboa-Rosales, H.; Arceo-Olague, J.G.; Luna-García, H.; Celaya-Padilla, J.M. Hard Voting Ensemble Approach for the Detection of Type 2 Diabetes in Mexican Population with Non-Glucose Related Features. Healthcare 2022, 10, 1362. [Google Scholar] [CrossRef]
Dinh, A.; Miertschin, S.; Young, A.; Mohanty, S. A data-driven approach to predicting diabetes and cardiovascular disease with machine learning. BMC Med. Inform. Decis. Mak. 2019, 19, 211. [Google Scholar] [CrossRef]
Fazakis, N.; Kocsis, O.; Dritsas, E.; Alexiou, S.; Fakotakis, N.; Moustakas, K. Machine Learning Tools for Long-Term Type 2 Diabetes Risk Prediction. IEEE Access 2021, 9, 103737–103757. [Google Scholar] [CrossRef]
Frank, E.; Hall, M.A.; Holmes, G.; Kirkby, R.; Pfahringer, B.; Witten, I.H. Weka: A machine learning workbench for data mining. In Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers; Maimon, O., Rokach, L., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 1305–1314. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Seabold, S.; Perktold, J. statsmodels: Econometric and statistical modeling with python. In Proceedings of the 9th Python in Science Conference, Austin, AX, USA, 28–30 June 2010. [Google Scholar]
Gray, L.J.; Taub, N.A.; Khunti, K.; Gardiner, E.; Hiles, S.; Webb, D.R.; Srinivasan, B.T.; Davies, M.J. The Leicester Risk Assessment score for detecting undiagnosed Type 2 diabetes and impaired glucose regulation for use in a multiethnic UK setting. Diabet. Med. 2010, 27, 887–895. [Google Scholar] [CrossRef]
Lindstrom, J.; Tuomilehto, J. The Diabetes Risk Score: A practical tool to predict type 2 diabetes risk. Diabetes Care 2003, 26, 725–731. [Google Scholar] [CrossRef]
Lai, H.; Huang, H.; Keshavjee, K.; Guergachi, A.; Gao, X. Predictive models for diabetes mellitus using machine learning techniques. BMC Endocr. Disord. 2019, 19, 101. [Google Scholar] [CrossRef]
Zou, Q.; Qu, K.; Luo, Y.; Yin, D.; Ju, Y.; Tang, H. Predicting Diabetes Mellitus With Machine Learning Techniques. Front. Genet. 2018, 9, 515. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Wang, Y.; Niu, M.; Wang, C.; Wang, Z. Machine learning for characterizing risk of type 2 diabetes mellitus in a rural Chinese population: The Henan Rural Cohort Study. Sci. Rep. 2020, 10, 4406. [Google Scholar] [CrossRef] [PubMed]
American Diabetes Association. Diabetes Diagnostic Criteria. Available online: https://diabetes.org/about-diabetes/diagnosis (accessed on 21 June 2024).
De Silva, K.; Lim, S.; Mousa, A.; Teede, H.; Forbes, A.; Demmer, R.T.; Jönsson, D.; Enticott, J. Nutritional markers of undiagnosed type 2 diabetes in adults: Findings of a machine learning analysis with external validation and benchmarking. PLoS ONE 2021, 16, e0250832. [Google Scholar] [CrossRef] [PubMed]
Phongying, M.; Hiriote, S. Diabetes Classification Using Machine Learning Techniques. Computation 2023, 11, 96. [Google Scholar] [CrossRef]
Qin, Y.; Wu, J.; Xiao, W.; Wang, K.; Huang, A.; Liu, B.; Yu, J.; Li, C.; Yu, F.; Ren, Z. Machine Learning Models for Data-Driven Prediction of Diabetes by Lifestyle Type. Int. J. Environ. Res. Public Health 2022, 19, 5027. [Google Scholar] [CrossRef]
Kazerouni, F.; Bayani, A.; Asadi, F.; Saeidi, L.; Parvizi, N.; Mansoori, Z. Type2 Diabetes Mellitus Prediction Using Data Mining Algorithms Based on the Long-Noncoding RNAs Expression: A Comparison of Four Data Mining Approaches. BMC Bioinform. 2020, 21, 372. [Google Scholar] [CrossRef]
Agliata, A.; Giordano, D.; Bardozzo, F.; Bottiglieri, S.; Facchiano, A.; Tagliaferri, R. Machine Learning as a Support for the Diagnosis of Type 2 Diabetes. Int. J. Mol. Sci. 2023, 24, 6775. [Google Scholar] [CrossRef]
Kopitar, L.; Kocbek, P.; Cilar, L.; Sheikh, A.; Stiglic, G. Early detection of type 2 diabetes mellitus using machine learning-based prediction models. Sci. Rep. 2020, 10, 11981. [Google Scholar] [CrossRef]
Liu, Q.; Zhang, M.; He, Y.; Zhang, L.; Zou, J.; Yan, Y.; Guo, Y. Predicting the Risk of Incident Type 2 Diabetes Mellitus in Chinese Elderly Using Machine Learning Techniques. J. Pers. Med. 2022, 12, 905. [Google Scholar] [CrossRef]
Lama, L.; Wilhelmsson, O.; Norlander, E.; Gustafsson, L.; Lager, A.; Tynelius, P.; Wärvik, L.; Östenson, C.G. Machine learning for prediction of diabetes risk in middle-aged Swedish people. Heliyon 2021, 7, e07419. [Google Scholar] [CrossRef]
Shin, J.; Lee, J.; Ko, T.; Lee, K.; Choi, Y.; Kim, H.S. Improving Machine Learning Diabetes Prediction Models for the Utmost Clinical Effectiveness. J. Pers. Med. 2022, 12, 1899. [Google Scholar] [CrossRef] [PubMed]
Deberneh, H.M.; Kim, I. Prediction of Type 2 Diabetes Based on Machine Learning Algorithm. Int. J. Environ. Res. Public Health 2021, 18, 3317. [Google Scholar] [CrossRef] [PubMed]
Sikalidis, A.K.; Kristo, A.S.; Reaves, S.K.; Kurfess, F.J.; DeLay, A.M.; Vasilaky, K.; Donegan, L. Capacity Strengthening Undertaking—Farm Organized Response of Workers against Risk for Diabetes: (C.S.U.—F.O.R.W.A.R.D. with Cal Poly)—A Concept Approach to Tackling Diabetes in Vulnerable and Underserved Farmworkers in California. Sensors 2022, 22, 8299. [Google Scholar] [CrossRef] [PubMed]

Figure 1. A typical confusion matrix.

Figure 2. A Receiver Operating Characteristic (ROC) curve.

Figure 3. Percentages of reviewed studies by purpose.

Figure 4. (a) Top-performing feature selection techniques by category. (b) Topmost important features.

Figure 5. Typical methodology workflow for model building.

Figure 6. Top-performing models categorized by purpose.

Table 1. (a) Summary of reviewed studies on diabetes classification/identification. (b) Summary of reviewed studies on regression/long-term diabetes prediction.

(a)
Study and Purpose	Dataset	Complementary Techniques	Important Features	Best Model
[34] Lai et al. Diabetes classification	13,309 records from CPCSSN ¹. Personal data and recent laboratory results	Misclassification cost matrix, grid search, adjusted threshold and 10-fold cross validation, and information gain ¹	FPG, HDL, BMI and triglycerides	GBM AUC 0.847, Misclassification rate 0.189, Sensitivity 0.716 and Specificity 0.837
[35] Zou et al. Diabetes identification	138,000 records with glucose, physical examination, biometric, demographic and laboratory results	Random sampling, mRMR, PCA and 5-fold CV	FPG, weight and age	Random Forest using all 14 available features, with Accuracy 0.8084, Sensitivity 0.8495 and Specificity 0.7673
[27] Dinh et al. Diabetes identification	NHANES, survey data and laboratory results	Standardization, majority downsampling, ensemble model weighting optimization, hyperparameter tuning and 10-fold CV	Waist circumference, age, blood osmolality, sodium, blood urea nitrogen and triglycerides	Case 1: (a) With survey data: XGBoost, AUC 0.862, Precision, Recall and F1-Score all 0.78; (b) With laboratory results: AUC 0.957, Precision, Recall and F1-Score all 0.89. Case 2: (a) With survey data: Ensemble, AUC 0.737, Precision, Recall and F1-Score all 0.68; (b) With laboratory results: XGBoost, AUC 0.802, Precision, Recall and F1-Score all 0.74
[36] Zhang et al. Diabetes identification	36,652 records from Henan rural cohort, including sociodemographic, anthropometric, biometric, laboratory result and history of disease data	SMOTE, hyperparameter tuning and 10-fold CV	Urinary parameters, sweet flavor, age, heart rate and creatinine	Experiments with and without laboratory results: both XGBoost methods: AUC 0.872 and 0.817, Accuracy 0.812 and 0.702, Sensitivity 0.76 and 0.789, and Specificity 0.871 and 0.694
[38] DeSilva et al. Diabetes identification	16,429 records from NHANES with nutritional, behavioral, 146 socioeconomic and non-modifiable demographic features	MICE imputation, minority class oversampling, ROSE and SMOTE. Hyperparameter tunning, CV and odds ratio	Folate, self-reported diet health, number of people in household, and total fat and cigarette consumption	Logistic Regression trained on minority oversampling dataset: AUC 0.746
[39] Phongying et al. Diabetes identification	20,227 records from the Department of Medical Services in Bangkok, including demographic, biometric, heart pressure and rate results and family history of diabetes data	MinMax normalization, Gain ratio, interaction variables and hyperparameter tuning	BMI and family history of diabetes	Random Forest trained on interaction variable dataset, achieving 0.975 Accuracy, 0.974 Precision and 0.966 Recall
[41] Kazerouni et al. Diabetes identification	200 records, Shohadan Hospital, Tehran (100 T2DM)	Standardization, 10-fold cross validation	Long non-coding RNA (lncRNA) expression for predicting T2DM and diabetes detection on an RNA molecular basis	SVM: AUC 0.95, Sensitivity 95% and Specificity 86%
[42] Agliata et al. Diabetes identification	Balanced dataset (NHANES, MIMIC-III and MIMIC-IV)	Standardization	Glucose level, triglyceride level, HDL, systolic blood pressure, diastolic blood pressure, gender/sex, age, weight and Body Mass Index (BMI)	Binary classifier (NN): Accuracy approximately 86%, AUC 0.934
[7] Uddin et al. Diabetes identification	508-record dataset from Bangladesh	SMOTE and random oversampling, feature selection by recursive feature elimination	Age, having diabetes in the family, regular intake of medicine and extreme thirst	Ensemble Technique: Accuracy of 99.27% and F1-score of 99.27%
[40] Qin et al. Diabetes identification	17,833 records (NHANES), including demographic, dietary, examination and questionnaire features	SMOTE, backward feature selection with objective AIC and SHAP	Sleep time, energy and age	CatBoost: AUC 0.83, Accuracy 0.821, Sensitivity 0.82 and Specificity 0.519
[22] Rufo et al. Diabetes identification	2109 records from ZMHDD hospital, including demographic, anthropometric, blood pressure, cholesterol, pulse rate and FBS data	Median imputation, MinMax normalization, Pearson correlation coefficient hyperparameter tuning and 10-fold CV	FPG, total cholesterol and BMI	LightGBM: AUC 0.98, Accuracy 0.98, Sensitivity 0.99 and Specificity 0.96
[26] Benita et al. Diabetes identification	1787 records from Centro Medico Nacional Siglo XXI in Mexico City, including sociodemographic, anthropometric and laboratory data, such as HDL and diastolic pressure under treatment and systolic pressure without treatment	Standardization, LASSO feature selection and hyperparameter tuning with 10-fold CV	Lipid level in treatment and hypertension treatment	SVM: AUC 0.928, Accuracy 0.898, Sensitivity 0.878 and Specificity 0.923
[13] Dritsas et al. Diabetes identification	520-record dataset from Kaggle, including symptoms such as polyuria, polydipsia, sudden 226 weight loss, weakness, polyphagia, genital thrush, itching, obesity, etc.	SMOTE using 5-NN, Pearson coefficient, Gain ratio, AUC of NB and RF, and 10-fold CV	Polyuria, polydipsia, sudden weight loss and gender	Random Forest and KNN: AUCs 0.99 and 0.98, respectively; Accuracy 0.985, Recall 0.986 and Precision 0.986
(b)
Study and Purpose	Dataset	Complementary Techniques	Important Features	Best Model
[43] Kopitar et al. FPG regression	2109 records from ZMHDD hospital, including demographic, anthropometric, blood pressure, cholesterol, pulse rate and FBS data	Outlier detection, MICE imputation, bootstrap random sampling with replacement, R² model calibration	Hyperglycemia, age, triglyceride, cholesterol and blood pressure results	LightGBM: RMSE 0.8 mmol/L
[44] Liu et al. Long-term diabetes prediction	127,031 records, patients older than 65 years	LASSO feature selection, SHAP	FPG, education, exercise, gender and waist circumference as the top-five important predictors	XGBoost model with 21 features: AUC 0.78, Accuracy 75%, Sensitivity 64,5% and Specificity 75.7%
[45] Lama et al. Long-term diabetes prediction	7949 records: socioeconomic and psychosocial factors, physical and laboratory results, physical activity, diet information and tobacco use	Median imputation, SHAP, 5-fold CV grid search with objective as 6 and risk profiles	BMI, waist–hip ratio, age, systolic and diastolic BP, and diabetes heredity	Random Forest: AUC 0.7795
[46] Shin et al. Long-term diabetes prediction	38,379 records, including demographic, laboratory, pulmonary test, personal history and family history data	Mean/mode imputation, hyperparameter tuning with stratified 10-fold CV, SHAP and survival analysis	FPG, HbA1c and family history of diabetes	XGBoost: 0.623 AUC, 0.966 Accuracy, 0.970 Sensitivity and 0.690 Specificity
[21] Mao et al. Long-term diabetes prediction	3687 records, including demographic, smoking, drinking, history of health condition and laboratory data	LR feature analysis, hyperparameter tuning, 10-fold cv and SHAP	Age, impaired fasting glycose and glycose tolerance	Random Forest: AUC 0.835
[28] Fazakis et al. Long-term diabetes prediction	2331 records from ELSA, including biometric, anthropometric, hematological, lifestyle, sociodemographic and performance index variables	Feature selection techniques: LASSO, correlation, Greedy stepwise. Random undersampling. Adjusted threshold with objective J³, multiobjective optimization	Not applicable	Weighted soft voting ensemble with base classifiers LR and RF: AUC 0.884, Sensitivity 0.856 and Specificity 0.798
[47] Deberneh et al. Multiclass long-term diabetes prediction	500,000 records containing diagnostic results and questionnaires	Majority undersampling and SMOTE, ANOVA, x2 and RFE, grid search, 10-fold cross validation	FPG, HbA1c and gamma-GTP	CIM: Accuracy, Precision, Recall 0.77

¹ Canadian Primary Care Sentinel Surveillance Network (www.cpcssn.ca, accessed on 8 November 2023).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Petridis, P.D.; Kristo, A.S.; Sikalidis, A.K.; Kitsas, I.K. A Review on Trending Machine Learning Techniques for Type 2 Diabetes Mellitus Management. Informatics 2024, 11, 70. https://doi.org/10.3390/informatics11040070

AMA Style

Petridis PD, Kristo AS, Sikalidis AK, Kitsas IK. A Review on Trending Machine Learning Techniques for Type 2 Diabetes Mellitus Management. Informatics. 2024; 11(4):70. https://doi.org/10.3390/informatics11040070

Chicago/Turabian Style

Petridis, Panagiotis D., Aleksandra S. Kristo, Angelos K. Sikalidis, and Ilias K. Kitsas. 2024. "A Review on Trending Machine Learning Techniques for Type 2 Diabetes Mellitus Management" Informatics 11, no. 4: 70. https://doi.org/10.3390/informatics11040070

APA Style

Petridis, P. D., Kristo, A. S., Sikalidis, A. K., & Kitsas, I. K. (2024). A Review on Trending Machine Learning Techniques for Type 2 Diabetes Mellitus Management. Informatics, 11(4), 70. https://doi.org/10.3390/informatics11040070

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Review on Trending Machine Learning Techniques for Type 2 Diabetes Mellitus Management

Abstract

1. Introduction

2. Diabetes

2.1. Definitions

2.2. T2DM Complications and Management

3. Machine Learning Background

3.1. Models

3.2. Imputation and Normalization

3.3. Balancing

3.4. Feature Selection

3.5. Evaluation

4. Relevant Sections

4.1. Related Work

4.2. Machine Learning Applications in Diabetes

4.2.1. Current-State Classification

4.2.2. Biomarker Regression

4.2.3. Long-Term Prediction

5. Discussion

5.1. Types of Hypotheses Addressing Diabetes through Machine Learning Using Tabular Data

5.2. Data Preprocessing

5.3. Features Involved

5.4. Selection and Identification of the Most Important Features

5.5. Methodology Structure towards Model Building

5.6. Evaluation Metrics

5.7. Best Models

5.8. Limitations of Existing Approaches

5.9. Comparison with Previous Reviews

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI