Supervised Machine Learning-Based Models for Predicting Raised Blood Sugar
Abstract
:1. Introduction
2. Literature Review
3. Methodology
3.1. Data Collection and Dataset Description
3.2. Data Cleaning and Preparation
- Handling missing values: Upon exploration of the STEPS dataset, it was observed to contain some missing values. On checking for missing values, several issues were identified and addressed. Records with consent values of 2 and empty remaining variables, indicating refusal to participate, were excluded. Pregnant women’s records, which lacked physical measurements as per STEPs survey guidelines, were also dropped to maintain accuracy. Special codes used as placeholders for missing categorical values were replaced with the mode of the respective variables. Missing values in conditional variables, such as daily smoking based on current smoking, were filled with zero. For missing biochemical measurements, the issues stemmed from either lack of consent for Step 3 participation or non-adherence to fasting instructions, observations belonging to these cases were handled by dropping them from the dataset.
- Handling outliers: The STEPS dataset comprises various types of data, including data about the participant’s lifestyle, medical history, and lab test results. Some of these features contain abnormal values. A visual approach, utilizing boxplots, was employed to detect outlier values. The feedback provided by domain experts along with the research results for identifying normal ranges were used to handle the outlier values either by dropping or substitution.
3.3. Exploratory Data Analysis
3.4. Data Preprocessing
3.4.1. Determining Target Feature
3.4.2. Feature Selection
- The correlation matrix was used to explore the relationships between the set of independent variables in the processed dataset and to decide on the list of independent variables that could be eliminated due to high correlation which may affect the model performance badly. The impact of multicollinearity is not an issue specific to regression models only, but it may affect classification models as well [47]. Its impact on classification models involves both the stability and the interpretability of the model [48]. In this study, the criteria that were followed in handling the multicollinearity issue between independent variables were based on eliminating one of the independent features that has a correlation coefficient value greater than 0.7 with another independent one [49]. Figure 13 shows the correlation matrix for a subset of independent features related to the medical history of participants and the raised blood sugar (RBS) target feature. In Figure 13, it can be seen there is a high collinearity between a set of independent features, represented in the relationship between prevalent hypertension and taking high blood pressure medication features, the second collinearity is found between prevalent diabetes and taking diabetes medication features, and the third one is between prevalent cholesterol and raised cholesterol medication features. All of these combinations have collinearity with a correlation coefficient value of 0.8, which is greater than the threshold of 0.7. In this step, the selected features to eliminate are those related to taking medication variables.
- The chi-square was used to identify the categorical independent set of variables that is correlated with the target variable of raised blood sugar. Figure 14 below shows the list of independent categorical features identified as the top 15 important features in predicting the target feature of raised blood sugar. This list included the history of cholesterol, history of hypertension, history of CVD, raised blood pressure, level of sugar intake, history of osteoporosis, physical inactivity, sleep disturbances, gender, former smoker, level of salt intake, current smoker, anxiety and depression (PHQ4), insufficiency of fruit and vegetable intake, and history of asthma, as the most important features identified by the chi-square test for the prediction of raised blood sugar. The history of diabetes was not forwarded to this test of feature importance since it is correlated with the outcome variable which can affect the predictive power of the model. This step helped in identifying the important predictor variables for the outcome features related to raised blood sugar and raised blood pressure prediction models. However, no variable was eliminated within this step, the set variables were forwarded for further phase of exploring features importance using the random forest classifier.
- A random forest classifier was utilized as the final step in the feature selection process. The random forest classifier was used to obtain the optimal feature selection process, by forwarding all types of variables, either categorical or continuous, as the input and identifying their performance in comparison to the outcome variables in the proposed models [50]. Utilizing the random forest in the variable’s selection process is considered one of the most powerful methods to determine the appropriate and significant features that contribute to the prediction of the outcome feature in machine learning models [51]. Figure 15 illustrates the results of feature importance that were obtained by integrating the random forest classifier in selecting the top 30 features for predicting the raised blood sugar outcome variable.
3.4.3. Feature Scaling
3.4.4. Dataset Oversampling
3.5. Machine Learning Models
- Decision tree is a supervised machine learning algorithm that can be utilized to build classification and regression models. It is marked as one of the simplest straightforward machine learning algorithms, and is based on arranging the features in a tree structure, and recursively splitting them based on chosen impurity criteria, such as entropy measure, Gini index, or information gain value [55].
- The AdaBoost algorithm which stands for adaptive boosting is a supervised machine learning algorithm used for building classification models, based on combining multiple weak classifiers to obtain more improved performance, the AdaBoost classifier commonly uses one level decision tree classifier. In AdaBoost, overfitting problems can be less likely to occur with it compared to other learning algorithms; however, it is not a suitable choice for datasets containing outliers and noisy data [56].
- Random forest is one of the most robust supervised machine learning algorithms commonly used for classification tasks. It is based on constructing a forest of multiple decision trees for a subset of the variables that are selected randomly in each tree. The prediction results of all the generated decision trees are aggregated to obtain the final refined output. By using this ensemble technique, the random forest makes an improved performance by mitigating the high variance issue that is known as a common issue in the decision tree [57].
- The XGBOOST algorithm, short for extreme gradient boosting, is a supervised machine learning tree-based algorithm that is an improved version of its earlier gradient boost algorithm. It can be applied for regression and classification problems, in particular for large datasets due to its high efficiency in generating accurate results and fast execution time. The working approach of the XGBOOST is based on passing the outcome of a processed tree into the next tree sequentially [58].
- Bagging decision trees, which stands for bootstrap aggregating, is a machine learning algorithm that improves the accuracy and stability of decision trees by using the ensemble approach. Bagging decision trees train multiple decision tree classifiers independently, with training each tree on a random sample of the training data with replacement (bootstrap sampling). Based on a different subset of data, each decision tree predicts the target variable, and their predictions are aggregated to determine the final outcome. For regression tasks, this aggregation is achieved by averaging the predictions, and for classification tasks by voting. Using the bagging technique reduces overfitting and variance, which improves the model’s overall performance and robustness [59].
- The multi-layer perceptron classifier (MLP) is a feedforward ANN algorithm used for classification problems. It consists of multiple layers of interconnected nodes associated with weights and uses an iterative optimization algorithm of backpropagation to minimize error and optimize classification results. The MLP classifier has the well-known advantage of handling the nonlinearity issue of relationships in the processed data [60].
3.6. Performance Evaluation Criteria
- Confusion matrix: used commonly to summarize the prediction results of the machine learning classification models, by comparing the actual values versus predicted values, by which also the different performance metrics can be calculated. The confusion matrix consists of the following four measures:
- TP: the number of records from the positive class predicted correctly by the model.
- FP: the number of records from the negative class predicted incorrectly as a positive class by the model.
- TN: the number of records from the negative class predicted correctly by the model.
- FN: the number of records from the positive class predicted incorrectly as a negative class by the model.
- Accuracy: performance measure used to evaluate the efficiency of machine learning classification models. It is computed as the ratio of correct prediction to the total number of predictions.
- Precision is calculated as the fraction of correct prediction from the positive class to the total number of predictions as a positive class, in this model it represents the proportion of those who were predicted correctly as raised blood sugar cases, to the total number of observations predicted as raised blood sugar cases.
- Recall metric is the fraction of correctly predicted observations out of all actual observations of raised blood sugar cases.
- F1-Score is calculated as the harmonic mean of precision and recall metrics.
- The Receiver Operator Characteristic (ROC) is a visual representation that shows how well a machine learning model can differentiate between several classes, by plotting the true positive rate (TPR) and the false positive rate. ROC curve is a visual evaluation method for the performance of classification models, which works by calculating the area under the curve (AUC), the greater the AUC value the better the performance of the classification model [62].
4. Results and Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- The World Health Organization. Diabetes. Available online: https://www.who.int/news-room/fact-sheets/detail/diabetes (accessed on 4 March 2024).
- Clark, N.G.; Fox, K.M.; Grandy, S. Symptoms of diabetes and their association with the risk and presence of diabetes: Findings from the study to help improve early evaluation and management of risk factors leading to diabetes (SHIELD). Diabetes Care 2007, 30, 2868–2873. [Google Scholar] [CrossRef] [PubMed]
- Forouhi, N.G.; Wareham, N.J. Epidemiology of diabetes. Medicine 2010, 38, 602–606. [Google Scholar] [CrossRef]
- Zheng, Y.; Ley, S.H.; Hu, F.B. Global aetiology and epidemiology of type 2 diabetes mellitus and its complications. Nat. Rev. Endocrinol. 2017, 14, 88–98. [Google Scholar] [CrossRef] [PubMed]
- Soomro, M.H.; Jabbar, A. Diabetes etiopathology, classification, diagnosis, and epidemiology. In BIDE’s Diabetes Desk Book; Elsevier: Amsterdam, The Netherlands, 2024; pp. 19–42. [Google Scholar] [CrossRef]
- IDF Diabetes Atlas 2021|IDF Diabetes Atlas. Available online: https://diabetesatlas.org/atlas/tenth-edition/ (accessed on 19 February 2024).
- Bloomgarden, Z.; Handelsman, Y. Diabetes Epidemiology and Its Implications. In Lipoproteins in Diabetes Mellitus; Springer International Publishing: Cham, Switzerland, 2023; pp. 881–890. [Google Scholar] [CrossRef]
- American Diabetes Association Professional Practice Committee. 12. Retinopathy, Neuropathy, and Foot Care: Standards of Care in Diabetes—2024. Diabetes Care 2024, 47, S231–S243. [Google Scholar] [CrossRef] [PubMed]
- Alqadi, S.F. Diabetes Mellitus and Its Influence on Oral Health: Review. Diabetes Metab. Syndr. Obes. 2024, 17, 107–120. [Google Scholar] [CrossRef]
- Williams, R.; Airey, M. Epidemiology and Public Health Consequences of Diabetes. Curr. Med. Res. Opin. 2002, 18 (Suppl. 1), s1–s12. [Google Scholar] [CrossRef] [PubMed]
- The World Health Organization. The Top 10 Causes of Death. Available online: https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death (accessed on 30 January 2024).
- Laine, C.; Caro, J.F. Preventing complications in diabetes mellitus: The role of the primary care physician. Med. Clin. N. Am. 1996, 80, 457–474. [Google Scholar] [CrossRef]
- Tiwary, N.; Sharma, N.; Singh, S.; Behl, T.; Zahoor, I. Understanding the Pharmacological and Nanotechnological Facets of Dipeptidyl Peptidase-4 Inhibitors in Type II Diabetes Mellitus: A Paradigm in Therapeutics. Bionanoscience 2023, 14, 211–229. [Google Scholar] [CrossRef]
- American Diabetes Association. 2. Classification and Diagnosis of Diabetes: Standards of Medical Care in Diabetes—2020. Diabetes Care 2020, 43, S14–S31. [Google Scholar] [CrossRef]
- Peng, W.K.; Chen, L.; Boehm, B.O.; Han, J.; Loh, T.P. Molecular phenotyping of oxidative stress in diabetes mellitus with point-of-care NMR system. NPJ Aging Mech. Dis. 2020, 6, 11. [Google Scholar] [CrossRef]
- The World Health Organization. Mean Fasting Blood Glucose. Available online: https://www.who.int/data/gho/indicator-metadata-registry/imr-details/2380 (accessed on 20 February 2024).
- Owess, M.M.; Owda, A.Y.; Owda, M. Decision Support System in Healthcare for Predicting Blood Pressure Disorders. In Proceedings of the 2023 International Conference on Information Technology: Cybersecurity Challenges for Sustainable Cities, ICIT 2023—Proceeding, Amman, Jordan, 9–10 August 2023; pp. 62–67. [Google Scholar] [CrossRef]
- Saleem, T.J.; Chishti, M.A. Exploring the Applications of Machine Learning in Healthcare. Int. J. Sens. Wirel. Commun. Control. 2019, 10, 458–472. [Google Scholar] [CrossRef]
- Singh, P.; Singh, N.; Singh, K.K.; Singh, A. Diagnosing of disease using machine learning. In Machine Learning and the Internet of Medical Things in Healthcare; Academic Press: Cambridge, MA, USA, 2021; pp. 89–111. [Google Scholar] [CrossRef]
- Jaiswal, V.; Negi, A.; Pal, T. A review on current advances in machine learning based diabetes prediction. Prim. Care Diabetes 2021, 15, 435–443. [Google Scholar] [CrossRef] [PubMed]
- Zhu, T.; Li, K.; Herrero, P.; Georgiou, P. Deep Learning for Diabetes: A Systematic Review. IEEE J. Biomed. Health Inform. 2021, 25, 2744–2757. [Google Scholar] [CrossRef] [PubMed]
- Varma, K.M.; Panda, B.S. Comparative analysis of Predicting Diabetes Using Machine Learning Techniques. J. Emerg. Technol. Innov. Res. 2019, 6, 522–530. [Google Scholar]
- Makalesi, A.; Nur Ergün, Ö.; İlhan, H.O. Early Stage Diabetes Prediction Using Machine Learning Methods. Avrupa Bilim Teknol. Derg. 2021, 29, 52–57. [Google Scholar] [CrossRef]
- Islam, M.T.; Al-Absi, H.R.H.; Ruagh, E.A.; Alam, T. DiaNet: A Deep Learning Based Architecture to Diagnose Diabetes Using Retinal Images only. IEEE Access 2021, 9, 15686–15695. [Google Scholar] [CrossRef]
- Mahboob Alam, T.; Iqbal, M.A.; Ali, Y.; Wahab, A.; Ijaz, S.; Baig, T.I.; Hussain, A.; Malik, M.A.; Raza, M.M.; Ibrar, S.; et al. A model for early prediction of diabetes. Inform. Med. Unlocked 2019, 16, 100204. [Google Scholar] [CrossRef]
- UCI Machine Learning and Kaggle, Pima Indians Diabetes Database. Available online: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database/data (accessed on 4 March 2024).
- Khanam, J.J.; Foo, S.Y. A comparison of machine learning algorithms for diabetes prediction. ICT Express 2021, 7, 432–439. [Google Scholar] [CrossRef]
- Kandhasamy, J.P.; Balamurali, S. Performance Analysis of Classifier Models to Predict Diabetes Mellitus. Procedia Comput. Sci. 2015, 47, 45–51. [Google Scholar] [CrossRef]
- Aitbayev, A. Diabetes UCI Dataset. Available online: https://www.kaggle.com/datasets/alakaaay/diabetes-uci-dataset (accessed on 4 March 2024).
- Yahyaoui, A.; Jamil, A.; Rasheed, J.; Yesiltepe, M. A Decision Support System for Diabetes Prediction Using Machine Learning and Deep Learning Techniques. In Proceedings of the 1st International Informatics and Software Engineering Conference: Innovative Technologies for Digital Transformation, IISEC 2019—Proceedings, Ankara, Turkey, 6–7 November 2019. [Google Scholar] [CrossRef]
- Naz, H.; Ahuja, S. Deep learning approach for diabetes prediction using PIMA Indian dataset. J. Diabetes Metab. Disord. 2020, 19, 391–403. [Google Scholar] [CrossRef]
- Wu, H.; Yang, S.; Huang, Z.; He, J.; Wang, X. Type 2 diabetes mellitus prediction model based on data mining. Inform. Med. Unlocked 2018, 10, 100–107. [Google Scholar] [CrossRef]
- Meng, X.H.; Huang, Y.X.; Rao, D.P.; Zhang, Q.; Liu, Q. Comparison of three data mining models for predicting diabetes or prediabetes by risk factors. Kaohsiung J. Med. Sci. 2013, 29, 93–99. [Google Scholar] [CrossRef] [PubMed]
- Dinh, A.; Miertschin, S.; Young, A.; Mohanty, S.D. A data-driven approach to predicting diabetes and cardiovascular disease with machine learning. BMC Med. Inform. Decis. Mak. 2019, 19, 211. [Google Scholar] [CrossRef] [PubMed]
- Centers for Disease Control and Prevention, NHANES Questionnaires, Datasets, and Related Documentation. Available online: https://wwwn.cdc.gov/nchs/nhanes/Default.aspx (accessed on 4 March 2024).
- Vangeepuram, N.; Liu, B.; Chiu, P.H.; Wang, L.; Pandey, G. Predicting youth diabetes risk using NHANES data and machine learning. Sci. Rep. 2021, 11, 11212. [Google Scholar] [CrossRef] [PubMed]
- Maeta, K.; Nishiyama, Y.; Fujibayashi, K.; Gunji, T.; Sasabe, N.; Iijima, K.; Naito, T. Prediction of Glucose Metabolism Disorder Risk Using a Machine Learning Algorithm: Pilot Study. JMIR Diabetes 2018, 3, e10212. [Google Scholar] [CrossRef] [PubMed]
- Noncommunicable Disease Surveillance, Monitoring and Reporting. Available online: https://www.who.int/teams/noncommunicable-diseases/surveillance/systems-tools/steps (accessed on 20 February 2024).
- Owda, M.; Owda, A.Y.; Fasli, M. An Exploratory Data Analysis and Visualizations of Underprivileged Communities Diabetes Dataset for Public Good. In Proceedings of the 2023 22nd IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2023, Venice, Italy, 26–29 October 2023; pp. 581–585. [Google Scholar] [CrossRef]
- Ferrannini, E.; Cushman, W.C. Diabetes and hypertension: The bad companions. Lancet 2012, 380, 601–610. [Google Scholar] [CrossRef] [PubMed]
- De Boer, I.H.; Bangalore, S.; Benetos, A.; Davis, A.M.; Michos, E.D.; Muntner, P.; Rossing, P.; Zoungas, S.; Bakris, G. Diabetes and hypertension: A position statement by the American diabetes association. Diabetes Care 2017, 40, 1273–1284. [Google Scholar] [CrossRef] [PubMed]
- Nguyen, N.T.; Magno, C.P.; Lane, K.T.; Hinojosa, M.W.; Lane, J.S. Association of Hypertension, Diabetes, Dyslipidemia, and Metabolic Syndrome with Obesity: Findings from the National Health and Nutrition Examination Survey, 1999 to 2004. J. Am. Coll. Surg. 2008, 207, 928–934. [Google Scholar] [CrossRef] [PubMed]
- Jafar, T.H.; Chaturvedi, N.; Pappas, G. Prevalence of overweight and obesity and their association with hypertension and diabetes mellitus in an Indo-Asian population. Cmaj 2006, 175, 1071–1077. [Google Scholar] [CrossRef]
- Abdullah, A.; Peeters, A.; de Courten, M.; Stoelwinder, J. The magnitude of association between overweight and obesity and the risk of diabetes: A meta-analysis of prospective cohort studies. Diabetes Res. Clin. Pract. 2010, 89, 309–319. [Google Scholar] [CrossRef]
- Amarnath, B.; Balamurugan, S.; Alias, A. Review on feature selection techniques and its impact for effective data classification using UCI machine learning repository dataset. J. Eng. Sci. Technol. 2016, 11, 1639–1646. [Google Scholar]
- Chen, R.C.; Dewi, C.; Huang, S.W.; Caraka, R.E. Selecting critical features for data classification based on machine learning methods. J. Big Data 2020, 7, 52. [Google Scholar] [CrossRef]
- Misra, P.; Yadav, A.S. Improving the classification accuracy using recursive feature elimination with cross-validation. Int. J. Emerg. Technol. 2020, 11, 659–665. [Google Scholar]
- Drobnič, F.; Kos, A.; Pustišek, M. On the interpretability of machine learning models and experimental feature selection in case of multicollinear data. Electronics 2020, 9, 761. [Google Scholar] [CrossRef]
- Dormann, C.F.; Elith, J.; Bacher, S.; Buchmann, C.; Carl, G.; Carré, G.; Marquéz, J.R.G.; Gruber, B.; Lafourcade, B.; Leitão, P.J.; et al. Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography 2013, 36, 27–46. [Google Scholar] [CrossRef]
- Reif, D.M.; Motsinger, A.A.; McKinney, B.A.; Crowe, J.E.; Moore, J.H. Feature selection using a random forests classifier for the integrated analysis of multiple data types. In Proceedings of the 2006 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB’06, Toronto, ON, Canada, 28–29 September 2006; pp. 171–178. [Google Scholar] [CrossRef]
- Khan, N.M.; Madhav, C.N.; Negi, A.; Thaseen, I.S. Analysis on Improving the Performance of Machine Learning Models Using Feature Selection Technique. In Advances in Intelligent Systems and Computing; Springer International Publishing: Berlin/Heidelberg, Germany, 2020. [Google Scholar] [CrossRef]
- Raju, V.N.G.; Lakshmi, K.P.; Jain, V.M.; Kalidindi, A.; Padma, V. Study the Influence of Normalization/Transformation process on the Accuracy of Supervised Classification. In Proceedings of the 3rd International Conference on Smart Systems and Inventive Technology, ICSSIT 2020, Tirunelveli, India, 20–22 August 2020; pp. 729–735. [Google Scholar] [CrossRef]
- Cecchini, V.; Nguyen, T.P.; Pfau, T.; De Landtsheer, S.; Sauter, T. An efficient machine learning method to solve imbalanced data in metabolic disease prediction. In Proceedings of the 2019 11th International Conference on Knowledge and Systems Engineering, KSE 2019, Da Nang, Vietnam, 24–26 October 2019. [Google Scholar] [CrossRef]
- Gosain, A.; Sardana, S. Handling class imbalance problem using oversampling techniques: A review. In Proceedings of the 2017 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2017, Udupi, India, 13–16 September 2017; pp. 79–85. [Google Scholar] [CrossRef]
- Sharma, H.; Kumar, S. A Survey on Decision Tree Algorithms of Classification in Data Mining. Int. J. Sci. Res. 2016, 5, 2094–2097. [Google Scholar] [CrossRef]
- Cao, Y.; Miao, Q.-G.; Liu, J.-C.; Gao, L. Advance and Prospects of AdaBoost Algorithm. Acta Autom. Sin. 2013, 39, 745–758. [Google Scholar] [CrossRef]
- Ziegler, A.; König, I.R. Mining data with random forests: Current options for real-world applications. Wiley Interdiscip Rev. Data Min. Knowl. Discov. 2014, 4, 55–63. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar] [CrossRef]
- Abellán, J.; Masegosa, A.R. Bagging decision trees on data sets with classification noise. In Lecture Notes in Computer Science; Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar] [CrossRef]
- Fiesler, E.; Beale, R. Multilayer perceptrons. In Handbook of Neural Computation; CRC Press: Boca Raton, FL, USA, 2020; pp. C1.2:1–C1.2:30. [Google Scholar]
- Dj Novakovi, J.; Veljovi, A.; Ili, S.S.; Papi, Ž.; Tomovi, M. Evaluation of Classification Models in Machine Learning. Theory Appl. Math. Comput. Sci. 2017, 7, 39–46. [Google Scholar]
- Sø, K. Receiver-operating characteristic curve analysis in diagnostic, prognostic and predictive biomarker research. J. Clin. Pathol. 2009, 62, 1–5. [Google Scholar] [CrossRef]
Ref. | Dataset | Features | Target Feature | Algorithms | Best Model | Outcome % |
---|---|---|---|---|---|---|
[25] | PIDD | age, number of pregnancies, glucose, diabetes pedigree function, blood pressure, skin thickness, insulin, BMI | Diabetic class (yes/no) | ANN, RF, K-means | ANN | Accuracy 75.7 |
[27] | PIDD | number of pregnancies, glucose, blood pressure, skin thickness, insulin, BMI, diabetes pedigree function, age | Diabetic class (yes/no) | SVM, DT, KNN, RF, AdaBoost. NB, LR, ANN | ANN | Accuracy 88.6 |
[28] | UCI Diabetes Dataset | age, gender, polyuria, polydipsia, sudden weight loss, weakness, polyphagia, genital thrush, visual blurring, itching, irritability, delayed healing, partial paresis, muscle stiffness, alopecia, obesity | Diabetes patient (yes/no) | RF, SVM, KNN, and DT J48 | DT J48(with noisy data) RF (without noisy data) | Accuracy 73.82 Accuracy 100.0 |
[30] | PIDD | number of pregnancies, glucose, blood pressure, skin thickness, insulin, BMI, diabetes pedigree function, age | Diabetic class (yes/no) | RF, SVM, CNN | RF | Accuracy 83.6 |
[31] | PIDD | number of pregnancies, glucose, blood pressure, skin thickness, insulin, BMI, diabetes pedigree function, age | Diabetic class (yes/no) | NB, DT, ANN, DL | DL | Accuracy 98.07 |
[32] | PIDD | number of pregnancies, glucose, blood pressure, skin thickness, insulin, BMI, diabetes pedigree function, age | Diabetic class (yes/no) | K-means and LR | Hybrid model (K-means and LR) | Accuracy 93.9 |
[33] | Privately collected dataset | age, gender, BMI, family history of diabetes, marital status, education level, stress, sleep, physical activity, diet, in-salt taking, and drinking coffee | Reported diabetes diagnosis (yes/no) | ANN, DT, LR | DT | Accuracy 77.87 Sensitivity 80.68 specificity 75.13 |
[33] | NHANES | age, waist size, leg length, sodium, fiber, caffeine intake, ethnicity and income | Reported diabetes diagnosis (yes/no) | LR, SVM, RF, XGBoost, Ensemble | XGBoost | ROC AUC 86.2 Precision, Recall, F1-score 78.0 |
[33] | NHANES | age, waist size, leg length, sodium, fiber, caffeine intake, ethnicity and income, HDL, LDL, cholesterol, urine | Reported diabetes diagnosis (yes/no) | LR, SVM, RF, XGBoost, Ensemble | XGBoost | ROC AUC 95.7 Precision, Recall, F1-score 89.0 |
[33] | NHANES | age, waist size, leg length, sodium, fiber, caffeine intake, ethnicity, and income | FBS ≥ 126 (yes/no) | LR, SVM, RF, XGBoost, Ensemble | Ensemble | ROC AUC 73.7 Precision, Recall, F1-score 68.0 |
[34] | NHANES | age, waist size, leg length, sodium, fiber, caffeine intake, ethnicity and income, HDL, LDL, cholesterol, urine | FBS ≥ 126 (yes/no) | LR, SVM, RF, XGBoost, Ensemble | XGBoost | ROC AUC 80.2 Precision, Recall, F1-score 68.0 |
[36] | NHANES | BMI, family history of diabetes, race, hypertension, cholesterol | FBS ≥ 100, or 2hrPG ≥ 140, or HbA1C ≥ 5.7% (yes/no) | RF, AdaBoost, LR, J48, NB, PART, SMO, IBk, LogitBoost | NB | Accuracy 74.5 |
[37] | Privately collected dataset | age, sex, BMI, blood pressure, triglyceride, HDL, LDL, creatinine, total cholesterol, FBS, HbA1C, IRI, PG | FBS ≥ 100, or 2hrPG ≥ 140, or HbA1C ≥ 5.7% (yes/no) | LR, XGBoost | XGBoost | ROC AUC 78.0 |
Algorithm | RF | Bagging DT | MLP | XGBoost | AdaBoost | DT |
---|---|---|---|---|---|---|
Accuracy | 98.4% | 97.4% | 96.3% | 96.4% | 95.2% | 94.8% |
F1-Score | 98.4% | 97.5% | 96.5% | 96.5% | 95.4% | 95.2% |
Precision | 97.1% | 95.3% | 93.2% | 93.5% | 91.5% | 90.9% |
Recall | 99.8% | 99.5% | 99.9% | 99.8% | 99.8% | 99.8% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Owess, M.M.; Owda, A.Y.; Owda, M.; Massad, S. Supervised Machine Learning-Based Models for Predicting Raised Blood Sugar. Int. J. Environ. Res. Public Health 2024, 21, 840. https://doi.org/10.3390/ijerph21070840
Owess MM, Owda AY, Owda M, Massad S. Supervised Machine Learning-Based Models for Predicting Raised Blood Sugar. International Journal of Environmental Research and Public Health. 2024; 21(7):840. https://doi.org/10.3390/ijerph21070840
Chicago/Turabian StyleOwess, Marwa Mustafa, Amani Yousef Owda, Majdi Owda, and Salwa Massad. 2024. "Supervised Machine Learning-Based Models for Predicting Raised Blood Sugar" International Journal of Environmental Research and Public Health 21, no. 7: 840. https://doi.org/10.3390/ijerph21070840
APA StyleOwess, M. M., Owda, A. Y., Owda, M., & Massad, S. (2024). Supervised Machine Learning-Based Models for Predicting Raised Blood Sugar. International Journal of Environmental Research and Public Health, 21(7), 840. https://doi.org/10.3390/ijerph21070840