Prediction Model of Type 2 Diabetes Mellitus for Oman Prediabetes Patients Using Artificial Neural Network and Six Machine Learning Classifiers
Abstract
:1. Introduction
2. Literature Review
3. Materials and Methods
3.1. Dataset
3.2. Variable Selection
- The total number of pregnancies;
- Glucose: a two-hour oral glucose tolerance test’s plasma glucose concentration;
- Blood Pressure: diastolic blood pressure (mm Hg);
- Skin thickness: triceps skin fold thickness (mm);
- Insulin: 2-h serum insulin (mu U/mL);
- BMI: body mass index (kg/m2);
- Diabetes pedigree function;
- Age: age (years);
- Outcome: class variable (0 or 1).
3.3. Data Processing and Cleaning
3.3.1. Finding Missing Values from the Dataset
3.3.2. Data Pre-Processing The First Step Was for Data to Merge with Similar Categories
3.3.3. Exploratory Data Analysis
3.3.4. Fill the Outlier in the Data
3.3.5. Data Scaling Was Applied for All Machine Learning Algorithms and ANN Using a Z-Score That Centred the Data to a Standard Deviation of 1 and a Mean of 0
3.4. Training and Validation Datasets
4. Results and Discussion
- The classification models are assessed using the metric of accuracy. Formally, accuracy is the percentage of accurate predictions made by our model. The accuracy is defined as shown below [41] and was measured in terms of positives and negatives:
- Sensitivity is a metric that evaluates a model’s ability to predict a true positive for each available category. This measure determines the proportion of positive diabetes cases predicted correctly [29]
- Specificity is the metric that evaluates a model’s ability to predict a true negative for each available category; it determines the proportion of actual negative cases predicted correctly [27].
- Precision is the proportion of true positives to all the positives; it refers to the percentage of relevant results and is a useful metric when false positives are more important than false negatives [27].
4.1. Accuracy Analysis Using Confusion Matrix
4.1.1. K-Nearest Neighbours (K-NN) Is an Example of this Type of Supervised ML Algorithm
4.1.2. The Support Vector Machine (SVM) The Support Vector Machine (SVM) Works on the Margin Calculation Concept
4.1.3. Naive Bayes Mainly Targets the Text Classification Industry
4.1.4. Decision Tree (DT) Is a Supervised ML Method to Solve Classification, Prediction, and Feature Selection Problems
4.1.5. Random Forest (RF) Is a Supervised Machine Learning Algorithm Used Widely in Classification and Regression Problems
4.1.6. Linear Discriminant Analysis Is a Statistical Technique That Can Classify Individuals into Mutually Exclusive and Exhaustive Groups Based on Independent Variables [44]
4.1.7. The Conventional Artificial Neural Network (ANN) Consists of Layers and Weights
- Input layer: Receiving the network’s raw data input.
- Hidden layer: The functioning of a hidden layer is defined by the inputs and the weight of the connections between them and the neuron in the hidden layer. These connection weights decide whether a neuron in the hidden layer must be active or inactive.
- Output layer: The operation of this layer is determined by the outputs of the neurons in the hidden layer and the connection weight between these neurons and the neurons in the output layer.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- World Health Organization. Noncommunicable Diseases (NCD) Country Profiles. 2018. Available online: https://www.who.int/nmh/countries/omn_en.pdf (accessed on 15 November 2021).
- Peters, S.A.; Huxley, R.R.; Woodward, M. Diabetes as a risk factor for stroke in women compared with men: A systematic review and meta-analysis of 64 cohorts, including 775,385 individuals and 12,539 strokes. Lancet 2014, 383, 19731980. [Google Scholar] [CrossRef] [PubMed]
- Vos, T.; Lim, S.S.; Abbiati, C.; Abbas, K.M.; Abbasi, M.; Abbasifard, M.; Abbasi-Kangevari, M.; Abbastabar, H.; Abd-Allah, F.; Abdelalim, A.; et al. Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: A systematic analysis for the Global Burden of Disease Study 2019. Lancet 2020, 396, 1204–1222. [Google Scholar] [CrossRef] [PubMed]
- Aljulifi, M.Z. Prevalence and reasons of increased type 2 diabetes in Gulf Cooperation Council Countries. Saudi Med. J. 2021, 42, 481–490. [Google Scholar] [CrossRef] [PubMed]
- Sarwar, A.; Sharma, V. Comparative analysis of machine learning techniques in prognosis of type II diabetes. AI Soc. 2014, 29, 123–129. [Google Scholar] [CrossRef]
- Kumari, V.A.; Chitra, R. Classification of diabetes disease using support vector machine. Int. J. Adv. Comput. Sci. Appl. 2013, 3, 1797–1801. [Google Scholar]
- Negi, A.; Jaiswal, V. A First Attempt to Develop a Diabetes Prediction Method Based on Different Global Datasets. In Proceedings of the 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC), Waknaghat, India, 22–24 December 2016; pp. 237–241. [Google Scholar]
- Maniruzzaman, M.; Kumar, N.; Menhazul Abedin, M.; Shaykhul Islam, M.; Suri, H.S.; El-Baz, A.S.; Suri, J.S. Comparative approaches for classification of diabetes mellitus data: Machine learning paradigm. Comput. Methods Programs Biomed. 2017, 152, 23–34. [Google Scholar] [CrossRef]
- Olaniyi, E.O.; Adnan, K. Onset diabetes diagnosis using artificial neural network. Int. J. Sci. Eng. Res. 2014, 5, 754–759. [Google Scholar]
- Wei, S.; Zhao, X.; Miao, C. A comprehensive exploration to the machine learning techniques for diabetes identification. In Proceedings of the 2018 IEEE 4th World Forum on Internet of Things (WF-IoT), Singapore, 5–8 February 2018. [Google Scholar] [CrossRef]
- Anwar, F.; Qurat-Ul-Ain, F.A.; Ejaz, M.Y.; Mosavi, A. A comparative analysis on diagnosis of diabetes mellitus using different approaches—A survey. Inform. Med. Unlocked 2020, 21, 100482. [Google Scholar] [CrossRef]
- Swapna, G.; Vinayakumar, R.; Soman, K.P. Diabetes detection using deep learning algorithms. ICT Express 2018, 4, 243–246. [Google Scholar]
- Chaves, L.; Marques, G. Data Mining Techniques for Early Diagnosis of Diabetes: A Comparative Study. Appl. Sci. 2021, 11, 2218. [Google Scholar] [CrossRef]
- Grądalski, T.; Hołoń, A. Diabetes mellitus in the last weeks of life—Case study and current literature review. Med. Paliatywna 2019, 11, 67–72. [Google Scholar] [CrossRef]
- Mirshahvalad, R.; Zanjani, N.A. Diabetes prediction using the ensemble perceptron algorithm. In Proceedings of the 2017 9th International Conference on Computational Intelligence and Communication Networks (CICN), Girne, Cyprus, 16–17 September 2017; pp. 190–194. [Google Scholar]
- Kumar. Pima-Indians-Diabetes.csv. Kaggle. 2018. Available online: https://www.kaggle.com/kumargh/pimaindiansdiabetescsv (accessed on 18 June 2021).
- Perveen, S.; Shahbaz, M.; Guergachi, A.; Keshavjee, K. Performance analysis of data mining classification techniques to predict diabetes. Procedia Comput. Sci. 2016, 82, 115–121. [Google Scholar] [CrossRef] [Green Version]
- Khan, N.S.; Muaz, M.H.; Kabir, A.; Islam, M.N. A machine learning-based intelligent system for predicting diabetes. Int. J. Big Data Anal. Healthc. 2019, 4, 20. [Google Scholar] [CrossRef]
- Nai-Arun, N.; Moungmai, R. Comparison of classifiers for the risk of diabetes prediction. Procedia Comput. Sci. 2015, 69, 132–142. [Google Scholar] [CrossRef]
- Kocher, T.; Holtfreter, B.; Petersmann, A.; Eickholz, P.; Hoffmann, T.; Kaner, D.; Kim, T.; Meyle, J.; Schlagenhauf, U.; Doering, S.; et al. Effect of periodontal treatment on HbA1c among patients with prediabetes. J. Dent. Res. 2019, 98, 171–179. [Google Scholar] [CrossRef] [PubMed]
- Meng, X.H.; Huang, Y.X.; Rao, D.P.; Zhang, Q.; Liu, Q. Comparison of three data mining models for predicting diabetes or prediabetes by risk factors. Kaohsiung J. Med. Sci. 2013, 29, 93–99. [Google Scholar] [CrossRef]
- Sheikhi, G.; Altınçay, H. The cost of type II diabetes mellitus: A machine learning perspective. In IFMBE Proceedings, Proceedings of the XIV Mediterranean Conference on Medical and Biological Engineering and Computing, Paphos, Cyprus, 31 March–2 April 2016; Kyriacou, E., Christofides, S., Pattichis, C.S., Eds.; Springer: Cham, Switzerland, 2016; Volume 57, pp. 818–821. [Google Scholar] [CrossRef]
- Iyer, A.; Jeyalatha, S.; Sumbaly, R. Diagnosis of diabetes using classification mining techniques. Int. J. Data Min. Knowl. Manag. Process 2015, 5, 1–14. [Google Scholar] [CrossRef]
- Barik, S.; Mohanty, S.; Mohanty, S.; Singh, D. Analysis of Prediction Accuracy of Diabetes Using Classifier and Hybrid Machine Learning Techniques. In Intelligent and Cloud Computing. Smart Innovation, Systems and Technologies; Mishra, D., Buyya, R., Mohapatra, P., Patnaik, S., Eds.; Springer: Singapore, 2021; Volume 153, pp. 399–409. [Google Scholar] [CrossRef]
- Ephzibah, E.P. A hybrid genetic-fuzzy expert system for effective heart disease diagnosis. In Communications in Computer and Information Science, Proceedings of the Advances in Computing and Information Technology, First International Conference, ACITY 2011, Chennai, India, 15–17 July 2011; Wyld, D.C., Wozniak, M., Chaki, N., Meghanathan, N., Nagamalai, D., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; Volume 198, pp. 115–121. [Google Scholar] [CrossRef]
- Zheng, T.; Xie, W.; Xu, L.L.; He, X.Y.; Zhang, Y.; You, M.R.; Yang, G.; Chen, Y. A machine learning-based framework to identify type 2 diabetes through electronic health records. Int. J. Med. Inform. 2017, 97, 120–127. [Google Scholar] [CrossRef]
- Zou, Q.; Qu, K.Y.; Luo, Y.M.; Yin, D.H.; Ju, Y.; Tang, H. Predicting diabetes mellitus with machine learning techniques. Front. Genet. 2018, 9, 515. [Google Scholar] [CrossRef] [PubMed]
- Malik, S.; Khadgawat, R.; Anand, S.; Gupta, S. Non-invasive detection of fasting blood glucose level via electrochemical measurement of saliva. SpringerPlus 2016, 5, 701. [Google Scholar] [CrossRef]
- Lekha, S.; Suchetha, M. Real-Time Non-Invasive Detection and Classification of Diabetes Using Modified Convolution Neural Network. IEEE J. Biomed. Health Inform. 2018, 22, 1630–1636. [Google Scholar] [CrossRef] [PubMed]
- Yuvaraj, N.; Sri Preethaa, K.R. Diabetes prediction in healthcare systems using machine learning algorithms on Hadoop cluster. Clust. Comput. 2019, 22, 1–9. [Google Scholar] [CrossRef]
- Sisodia, D.; Sisodia, D.S. Prediction of Diabetes using Classification Algorithms. Procedia Comput. Sci. 2018, 132, 1578–1585. [Google Scholar] [CrossRef]
- Mercaldo, F.; Nardone, V.; Santone, A. Diabetes Mellitus Affected Patients Classification and Diagnosis through Machine Learning Techniques. Procedia Comput. Sci. 2017, 112, 2519–2528. [Google Scholar] [CrossRef]
- Maniruzzaman, M.; Rahman, M.J.; Ahammed, B.; Abedin, M.M. Classification and prediction of diabetes disease using machine learning paradigm. Health Inf. Sci. Syst. 2020, 8, 7. [Google Scholar] [CrossRef] [PubMed]
- Saberi-Movahed, F.; Rostami, M.; Berahmand, K.; Karami, S.; Tiwari, P.; Oussalah, M.; Band, S.S. Dual Regularized Unsupervised Feature Selection Based on Matrix Factorization and Minimum Redundancy with application in gene selection. Knowl.-Based Syst. 2022, 256, 109884. [Google Scholar] [CrossRef]
- Ministry of Health Al Shifa System. (n.d.). Available online: https://omanportal.gov.om/wps/wcm/connect/2a19ffae-ade0-428b-9f7c-b30bdd874882/Al%2BShifa_MoH.pdf?MOD=AJPERES (accessed on 29 July 2021).
- Find Missing Values—MATLAB. 2022. Available online: https://www.mathworks.com/help/matlab/ref/ismissing.html?s_tid=doc_ta. (accessed on 29 July 2021).
- Fill Missing Values—MATLAB. 2022. Available online: https://www.mathworks.com/help/matlab/ref/fillmissing.html?s_tid=doc_ta (accessed on 29 July 2021).
- Detect and Replace Outliers in Data—MATLAB. Available online: https://www.mathworks.com/help/matlab/ref/filloutliers.html?s_tid=doc_ta (accessed on 30 August 2022).
- Partition Data for Cross-Validation—MATLAB. Available online: https://www.mathworks.com/help/stats/cvpartition.html (accessed on 9 August 2021).
- Mathworks. Normalise Data—MATLAB Normalize. Available online: https://www.mathworks.com/help/matlab/ref/double.normalize.html (accessed on 28 March 2022).
- Lador, S.M. What Metrics Should Be Used for Evaluating a Model on an Imbalanced Data Set? Medium, 22 October 2017. Available online: https://towardsdatascience.com/what-metrics-should-we-use-on-imbalanced-data-set-precision-recall-roc-e2e79252aeba (accessed on 27 June 2022).
- Lavrac, N.; Keravnou, E.; Zupan, B. Intelligent Data Analysis in Medicine. In Encyclopedia of Computer Science and Technology; Dekker: New York, NY, USA, 2000; Volume 42. [Google Scholar]
- Lowd, D.; Domingos, P. Naive Bayes Models for Probability Estimation. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005; Available online: https://dl.acm.org/doi/abs/10.1145/1102351.1102418?casa_token=93gP6KZPvIEAAAAA%3AR7o8Y2erGyVaOKEtyDCVmLZLu_Kth5VcLyihYXQ9A0tiFR7eEYRelyjwHAsdpNqnho34tEdNnnk (accessed on 15 May 2021).
- Performance for Diabetes with Linear Discriminant Analysis and Genetic Algorithm. Available online: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9637039 (accessed on 25 January 2022).
- Mathworks. Cross-Entropy Loss for Classification Tasks—MATLAB Crossentropy. Available online: https://www.mathworks.com/help/deeplearning/ref/dlarray.crossentropy.html (accessed on 28 January 2022).
S/N | Patient Data | Details | Notes |
---|---|---|---|
1 | Risk Factors |
| |
2 | Examination |
| Right side Left side |
3 | LAB |
| 1st reading Repeat reading By (MDRD) If indicated If indicated By using WHO/ISH risk prediction chart |
4 | Problem List |
| |
5 | Patients develop disease transfer to |
|
Symbol | Are You at Risk for Diabetes? Screening | Type | Score |
---|---|---|---|
1 | How old are you? | <40 years. 40–49 years 50–59 years >60 | 0 1 2 3 |
2 | Are you a man or a woman? | Woman Man | 0 1 |
3 | If you are a woman, have you been diagnosed with gestational diabetes? | Yes No | 1 0 |
4 | Do you have a mother, father, sister, or brother with diabetes? | Yes No | 1 0 |
5 | What is your blood glucose level currently? | ≥5.6 and <6.1 ≥6.1 and <7 | 0 1 |
6 | Are you physically active (30 min/5 days)? | Yes No | 0 1 |
7 | What is your weight category? | Normal weight Overweight Obese Morbidly obese | 0 1 2 3 |
Symbol | Features | Type |
---|---|---|
1 | Gender | Categorical |
2 | Age | Numeric |
3 | Risk factor (0–8) | Categorical |
4 | Diastolic blood pressure (mmHg) | Numeric |
5 | Height (m) | Numeric |
6 | Weight (kg) | Numeric |
7 | Waist circumference (cm) | Numeric |
8 | Total cholesterol (mmol/L) | Numeric |
9 | Fasting plasma glucose (mmol/L) | Numeric |
10 | HbA1c | Numeric |
11 | Outcome | Categorical |
Dataset | Total | Percentage |
---|---|---|
Training | 737 | 80% |
Testing | 184 | 20% |
Model | Accuracy% | Sensitivity% | Specificity% | Precision% |
---|---|---|---|---|
K-nearest neighbours | 92.39 | 94.44 | 77.27 | 90.00 |
Support vector machine; | 96.74 | 98.68 | 87.88 | 83.71 |
Naive Bayes | 96.74 | 98.10 | 88.46 | 87.08 |
Decision tree | 98.37 | 100 | 92.11 | 80.66 |
Random forest | 98.37 | 98.01 | 84.85 | 84.1 |
Linear discriminant analysis | 96.19 | 98.71 | 82.76 | 86.44 |
Artificial neural networks | 97.3 | 93.33 | 97.96 | 93.9 |
Model | PID Dataset % | Oman Dataset % |
---|---|---|
K-nearest neighbours [4] | 91 | 92.39 |
Support vector machine [6] | 90 | 96.74 |
Naive Bayes [32] | 91 | 96.73 |
Decision tree [32] | 88 | 98.37 |
Random forest [32] | 94 | 98.37 |
Linear discriminant analysis [7] | 81.1 | 96.19 |
ANN [8] | 82 | 97.3 |
Model | Features PIDD | Accuracy PIDD | Features Oman | Accuracy Oman | Features Oman | Accuracy Oman |
---|---|---|---|---|---|---|
K-nearest neighbours | 8 | 75.1 | 8 | 84.2 | 11 | 92.39 |
Support vector machine; | 8 | 78.4 | 8 | 85.3 | 11 | 96.74 |
Naive Bayes | 8 | 77.1 | 8 | 87.5 | 11 | 96.74 |
Decision tree | 8 | 71.89 | 8 | 80.9 | 11 | 98.37 |
Random forest | 8 | 76.47 | 8 | 85.3 | 11 | 98.37 |
Linear discriminant analysis | 8 | 77.7 | 8 | 86.95 | 11 | 96.19 |
Artificial neural networks | 8 | 78.1 | 8 | 86.0 | 11 | 97.3 |
Model | PID Prediction Speed | PID Training Time | Oman Prediction Speed | Oman Training Time |
---|---|---|---|---|
K-nearest neighbours | ~24,000 obs/s * | 0.53 s | ~15,000 obs/s | 0.61 s |
Support vector machine; | ~18,000 obs/s | 54.72 s | ~19,000 obs/s | 0.54 s |
Naive Bayes | ~26,000 obs/s | 0.65 s | ~15,000 obs/s | 0.93 s |
Decision tree | ~58,000 obs/s | 0.44 s | ~22,000 obs/s | 1.07 s |
Random forest | ~7000 obs/s | 1.44 s | ~6500 obs/s | 1.67 s |
Linear discriminant analysis | ~35,000 obs/s | 0.78 s | ~17,000 obs/s | 0.93 s |
Artificial neural networks | ~12,000 obs/s | 1.93 s | ~12,000 obs/s | 1.93 s |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Al Sadi, K.; Balachandran, W. Prediction Model of Type 2 Diabetes Mellitus for Oman Prediabetes Patients Using Artificial Neural Network and Six Machine Learning Classifiers. Appl. Sci. 2023, 13, 2344. https://doi.org/10.3390/app13042344
Al Sadi K, Balachandran W. Prediction Model of Type 2 Diabetes Mellitus for Oman Prediabetes Patients Using Artificial Neural Network and Six Machine Learning Classifiers. Applied Sciences. 2023; 13(4):2344. https://doi.org/10.3390/app13042344
Chicago/Turabian StyleAl Sadi, Khoula, and Wamadeva Balachandran. 2023. "Prediction Model of Type 2 Diabetes Mellitus for Oman Prediabetes Patients Using Artificial Neural Network and Six Machine Learning Classifiers" Applied Sciences 13, no. 4: 2344. https://doi.org/10.3390/app13042344
APA StyleAl Sadi, K., & Balachandran, W. (2023). Prediction Model of Type 2 Diabetes Mellitus for Oman Prediabetes Patients Using Artificial Neural Network and Six Machine Learning Classifiers. Applied Sciences, 13(4), 2344. https://doi.org/10.3390/app13042344