An Effective Methodology for Diabetes Prediction in the Case of Class Imbalance
Abstract
:1. Introduction
2. Materials and Methods
2.1. Methodology 1: Algorithms 1–3—Classical Approach
2.2. Proposed Methodology: Algorithms 4 and 5—Classification with Data Preprocessing for Class Imbalance
3. Results
3.1. Comparison of Classification Metrics
3.2. Confusion Matrices and AUC–ROC Curves
3.3. Comparison to Other Research
4. Discussion
- Using Python’s built-in functions can successfully tackle class imbalance. The used parameter in question is ‘class_weight’ = balance. However, additional steps need to be taken to handle class imbalance so that the model can become more efficient.
- When having imbalanced data, resampling and shuffling the data randomly before the train/test split can help improve estimation metrics. This result is robust regardless of the type of cross validation used.
- Applying our proposed algorithm is a simple and fast way to predict labels with class imbalance. It does not require additional techniques to balance classes. It does not involve preselecting important variables, which saves time and makes the model easy for analysis. This makes it an effective algorithm for the initial and further modeling of data with heavy class imbalance.
- Our algorithm does not need a feature selection procedure, therefore avoiding the bias that can be introduced with the method of feature selection.
- Two types of cross validation can be used as shown. The results are similar, suggesting that the type of cross validation may not be key for class imbalance. Rather, the overall strategy to eliminate the influence of the dominant class may be more important.
- Despite the relatively small size of the PIMA datasets, both k-fold and stratified k-fold cross validation are appropriate. This finding contrasts with some researchers suggesting using k-fold cross validation for class imbalance only in large datasets. We highlight that the type of cross validation used may not be dependent on the size of the dataset rather than its characteristics.
- This property of the model makes it flexible to adjust to other issues in the data, not only class imbalance. Therefore, other types of cross validation can be used.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Kaggle. Pima Indians Diabetes Database. Available online: https://www.kaggle.com/uciml/pima-indians-diabetes-database (accessed on 30 June 2024).
- Hounguè, P.; Bigirimana, A. Leveraging Pima Dataset to Diabetes Prediction: Case Study of Deep Neural Network. J. Comput. Commun. 2022, 10, 15–28. [Google Scholar] [CrossRef]
- Traymbak, S.; Issar, N. Data Mining Algorithms in Knowledge Management for Predicting Diabetes After Pregnancy by Using R. Indian J. Comput. Sci. Eng. 2021, 12, 1542–1558. [Google Scholar] [CrossRef]
- Gurcan, F.; Soylu, A. Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis. Cancers 2024, 16, 3417. [Google Scholar] [CrossRef] [PubMed]
- John, A.; Isnin, I.F.B.; Madni, S.H.H.; Muchtar, F.B. Enhanced intrusion detection model based on principal component analysis and variable ensemble machine learning algorithm. Intell. Syst. Appl. 2024, 24, 200442. [Google Scholar] [CrossRef]
- Zou, Q.; Qu, K.; Luo, Y.; Yin, D.; Ju, Y.; Tang, H. Predicting diabetes mellitus with machine learning techniques. Front. Genet. 2018, 9, 515. [Google Scholar] [CrossRef]
- Zhou, H.; Xin, Y.; Li, S. A diabetes prediction model based on Boruta feature selection and ensemble learning. BMC Bioinform. 2023, 24, 224. [Google Scholar] [CrossRef]
- Alghamdi, M.; Al-Mallah, M.; Keteyian, S.; Brawner, C.; Ehrman, J.; Sakr, S. Predicting diabetes mellitus using SMOTE and ensemble machine learning approach: The Henry Ford Exercise Testing (FIT) project. PLoS ONE 2017, 12, e0179805. [Google Scholar] [CrossRef] [PubMed]
- Rezki, M.K.; Mazdadi, M.I.; Indriani, F.; Muliadi, M.; Saragih, T.H.; Athavale, V.A. Application of SMOTE to address class imbalance in diabetes disease classification utilizing C5.0, Random Forest, and SVM. J. Electron. Electromed. Eng. Med. Inform. 2024, 6, 343–354. [Google Scholar] [CrossRef]
- Wu, Y.; Zhang, L.; Bhatti, U.A.; Huang, M. Interpretable Machine Learning for Personalized Medical Recommendations: A LIME-Based Approach. Diagnostics 2023, 13, 2681. [Google Scholar] [CrossRef] [PubMed]
- Kitova, K.; Ivanov, I.; Hooper, V. Stroke Dataset Modeling: Comparative Study of Machine Learning Classification Methods. Algorithms 2024, 17, 571. [Google Scholar] [CrossRef]
- Mhaskar, H.N.; Pereverzyev, S.V.; Van der Walt, M.D. A Deep Learning Approach to Diabetic Blood Glucose Prediction. Front. Appl. Math. Stat. 2017, 3, 14. [Google Scholar] [CrossRef]
- Islam, I.A.; Milon, M.I. Diabetes Prediction: A Deep Learning Approach. Int. J. Inf. Eng. Electron. Bus. 2019, 11, 21–27. [Google Scholar] [CrossRef]
- Zhou, H.; Myrzashova, R.; Zheng, R. Diabetes Prediction Model Based on an Enhanced Deep Neural Network. EURASIP J. Wirel. Commun. Netw. 2020, 2020, 148. [Google Scholar] [CrossRef]
- Pham, T.; Tran, T.; Phung, D.; Venkatesh, S. Predicting Healthcare Trajectories from Medical Records: A Deep Learning Approach. J. Biomed. Inform. 2017, 69, 218–229. [Google Scholar] [CrossRef]
- Naz, H.; Ahuja, S. Deep learning approach for diabetes prediction using PIMA Indian dataset. J. Diabetes Metab. Disord. 2020, 19, 391–403. [Google Scholar] [CrossRef]
- Kulkarni, A.; Chong, D.; Batarseh, F.A. 5—Foundations of data imbalance and solutions for a data democracy. In Data Democracy, 1st ed.; Batarseh, F.A., Yang, R., Eds.; Academic Press: Cambridge, MA, USA, 2020; pp. 83–106. [Google Scholar] [CrossRef]
- Gupta, S.C.; Goel, N. Predictive Modeling and Analytics for Diabetes using Hyperparameter tuned Machine Learning Techniques. Procedia Comput. Sci. 2023, 218, 1257–1269. [Google Scholar] [CrossRef]
- Chang, V.; Bailey, J.; Xu, Q.A.; Sun, Z. Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms. Neural Comput. Appl. 2023, 35, 16157–16173. [Google Scholar] [CrossRef] [PubMed]
- Pima-Indians-Diabetes. Available online: https://www.openml.org/search?type=data&status=active&id=43582&sort=runs (accessed on 30 December 2024).
- Tigga, N.P.; Garg, S. Prediction of Type 2 Diabetes using Machine Learning Classification Methods. Procedia Comput. Sci. 2020, 167, 706–716. [Google Scholar] [CrossRef]
- Ejiyi, C.J.; Qin, Z.; Amos, J.; Ejiyi, M.B.; Nnani, A.; Ejiyi, T.U.; Agbesi, V.K.; Diokpo, C.; Okpara, C. A robust predictive diagnosis model for diabetes mellitus using Shapley-incorporated machine learning algorithms. Healthc. Anal. 2023, 3, 100166. [Google Scholar] [CrossRef]
- Ivanov, I.; Toleva, B. An Algorithm to Predict Hepatitis Diagnosis. In Proceedings of the 11th International Scientific Conference on Computer Science, COMSCI 2023, Sofia, Bulgaria, 18 September 2023. [Google Scholar]
- Agung, E.S.; Rifai, A.P.; Wijayanto, T. Image-based facial emotion recognition using convolutional neural network on emognition dataset. Sci. Rep. 2024, 14, 14429. [Google Scholar] [CrossRef]
- Bhagat, M.; Bakariya, B. Implementation of Logistic Regression on Diabetic Dataset using Train-Test-Split, K-Fold and Stratified K-Fold Approach. Natl. Acad. Sci. Lett. 2022, 45, 401–404. [Google Scholar] [CrossRef]
- Kolipaka, V.R.R.; Namburu, A. K-Fold Validation of Multi Models for Crop Yield Prediction with Improved Sparse Data Clustering Process. Int. J. Intell. Syst. Appl. Eng. 2023, 11, 454–463. Available online: https://ijisae.org/index.php/IJISAE/article/view/3300 (accessed on 20 December 2024).
- Prusty, S.; Patnaik, S.; Dash, S.K. SKCV: Stratified K-fold cross-validation on ML classifiers for predicting cervical cancer. Front. Nanotechnol. 2022, 4, N972421. [Google Scholar] [CrossRef]
- Szeghalmy, S.; Fazekas, A.A. A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learning. Sensors 2023, 23, 2333. [Google Scholar] [CrossRef] [PubMed]
- Al Sadi, K.; Balachandran, W. Leveraging a 7-Layer Long Short-Term Memory Model for Early Detection and Prevention of Diabetes in Oman: An Innovative Approach. Bioengineering 2024, 11, 379. [Google Scholar] [CrossRef] [PubMed]
- Gragnaniello, M.; Marrazzo, V.R.; Borghese, A.; Maresca, L.; Breglio, G.; Riccio, M. Edge-AI Enabled Wearable Device for Non-Invasive Type 1 Diabetes Detection Using ECG Signals. Bioengineering 2025, 12, 4. [Google Scholar] [CrossRef]
- Fuss, F.K.; Tan, A.M.; Weizman, Y. Advanced Dynamic Centre of Pressure Diagnostics with Smart Insoles: Comparison of Diabetic and Healthy Persons for Diagnosing Diabetic Peripheral Neuropathy. Bioengineering 2024, 11, 1241. [Google Scholar] [CrossRef]
- Jiang, H.; Wang, H.; Pan, T.; Liu, Y.; Jing, P.; Liu, Y. Mobile Application and Machine Learning-Driven Scheme for Intelligent Diabetes Progression Analysis and Management Using Multiple Risk Factors. Bioengineering 2024, 11, 1053. [Google Scholar] [CrossRef] [PubMed]
- Mohanty, P.K.; Francis, S.A.J.; Barik, R.K.; Roy, D.S.; Saikia, M.J. Leveraging Shapley Additive Explanations for Feature Selection in Ensemble Models for Diabetes Prediction. Bioengineering 2024, 11, 1215. [Google Scholar] [CrossRef]
- Geantă, M.; Bădescu, D.; Chirca, N.; Nechita, O.C.; Radu, C.G.; Rascu, Ș.; Rădăvoi, D.; Sima, C.; Toma, C.; Jinga, V. The Emerging Role of Large Language Models in Improving Prostate Cancer Literacy. Bioengineering 2024, 11, 654. [Google Scholar] [CrossRef] [PubMed]
- Bekbolatova, M.; Mayer, J.; Ong, C.W.; Toma, M. Transformative Potential of AI in Healthcare: Definitions, Applications, and Navigating the Ethical Landscape and Public Perspectives. Healthcare 2024, 12, 125. [Google Scholar] [CrossRef] [PubMed]
- Maccaro, A.; Stokes, K.; Statham, L.; He, L.; Williams, A.; Pecchia, L.; Piaggio, D. Clearing the Fog: A Scoping Literature Review on the Ethical Issues Surrounding Artificial Intelligence-Based Medical Devices. J. Pers. Med. 2024, 14, 443. [Google Scholar] [CrossRef] [PubMed]
Algorithm 1 (%) Random Forest | Algorithm 2 (%) SVM, KFOLD(N_SPLITS = 4) | Algorithm 3 (%) SVM, KFOLD(N_SPLITS = 5) |
---|---|---|
Accuracy = 83.85 | Accuracy = 84.90 | Accuracy = 85.06 |
Precision = 90.82 | Precision = 92.56 | Precision = 90.91 |
Sensitivity = 82.50 | Sensitivity = 84.85 | Sensitivity = 86.54 |
Specificity = 86.11 | Specificity = 85.00 | Specificity = 82.00 |
Algorithm 4 (%) | Algorithm 5 (%) |
---|---|
Accuracy = 95.5 | Accuracy = 95.05 |
Precision = 91.35 | Precision = 91.74 |
Sensitivity = 100.00 | Sensitivity = 100.00 |
Specificity = 91.35 | Specificity = 91.00 |
Confusion Matrix | ROC Curve |
---|---|
[100 20] [10 62] ROC curve AUC= 90.81% |
Confusion Matrix | ROC Curve |
---|---|
[112 20] [9 51] ROC curve AUC= 88.88% |
Confusion Matrix | ROC Curve |
---|---|
[90 14] [9 41] ROC curve AUC= 91.70% |
Confusion Matrix | ROC Curve |
---|---|
[96 0] [9 95] ROC curve AUC= 97.62% |
Confusion Matrix | ROC Curve |
---|---|
[100 0] [9 91] ROC curve AUC= 96.80% |
Table 6. [18] (%) | Table 3. [21] (%) | Table 13. [19] (%) |
---|---|---|
Accuracy = 80.52 Precision = 74.47 Sensitivity = 72.72 Specificity = 90.74 | Accuracy = 75.0 Precision = 84.0 Sensitivity = 78.95 Specificity = 66.10 | Accuracy = 79.57 Precision = 89.40 Sensitivity = 81.33 Specificity = 75.0 ROC-AUC = 86.24 |
Models | TP | FN | FP | TN | Total | (FN + FP)/Total (%) |
---|---|---|---|---|---|---|
Extra Tree | 141 | 12 | 18 | 129 | 300 | 0.1 |
RF | 142 | 11 | 11 | 136 | 300 | 0.073 |
AdaBoost | 142 | 11 | 5 | 142 | 300 | 0.053 |
GB | 143 | 12 | 4 | 141 | 300 | 0.053 |
Models | TP | FN | FP | TN | Total | (FN + FP)/Total (%) |
---|---|---|---|---|---|---|
Algorithm 4 | 95 | 9 | 0 | 96 | 200 | 0.045 |
Algorithm 5 | 91 | 9 | 0 | 100 | 200 | 0.045 |
Models | Accuracy (%) | Precision (%) | Sensitivity (%) | Specificity (%) |
---|---|---|---|---|
Extra Tree | 90.0 | 88.68 | 92.20 | 87.76 |
RF | 92.67 | 92.81 | 92.80 | 92.52 |
AdaBoost | 94.67 | 96.60 | 92.80 | 96.60 |
GB | 94.67 | 97.28 | 92.30 | 97.24 |
Models | Measures |
---|---|
DNN [12] DNN [13] DNN [14] DNN + DT [15] DNN + 10-fold cross-validation [2] DNN [16] | Accuracy = 94.39% Accuracy: 98.04% Sensitivity: 98.80%. Specificity: 96.64% Accuracy: 99.4% Accuracy: 98.07% Sensitivity: 95.52% Specificity: 99.29% Accuracy: 89% Sensitivity: 87% Specificity: 91% Accuracy: 98.07% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Toleva, B.; Atanasov, I.; Ivanov, I.; Hooper, V. An Effective Methodology for Diabetes Prediction in the Case of Class Imbalance. Bioengineering 2025, 12, 35. https://doi.org/10.3390/bioengineering12010035
Toleva B, Atanasov I, Ivanov I, Hooper V. An Effective Methodology for Diabetes Prediction in the Case of Class Imbalance. Bioengineering. 2025; 12(1):35. https://doi.org/10.3390/bioengineering12010035
Chicago/Turabian StyleToleva, Borislava, Ivan Atanasov, Ivan Ivanov, and Vincent Hooper. 2025. "An Effective Methodology for Diabetes Prediction in the Case of Class Imbalance" Bioengineering 12, no. 1: 35. https://doi.org/10.3390/bioengineering12010035
APA StyleToleva, B., Atanasov, I., Ivanov, I., & Hooper, V. (2025). An Effective Methodology for Diabetes Prediction in the Case of Class Imbalance. Bioengineering, 12(1), 35. https://doi.org/10.3390/bioengineering12010035