An Ensemble Machine Learning and Data Mining Approach to Enhance Stroke Prediction
Abstract
:1. Introduction
1.1. Data Analytics in Healthcare
1.2. Existing Studies on Stroke Prediction
1.3. Potential of Machine Learning and Data Mining in Stroke Prediction
2. Materials and Methods
2.1. Cross-Industry Standard Process for Data Mining (CRISP-DM) Methodology
2.2. Data Understanding
2.3. Data Preparation
2.4. Modeling
2.5. Evaluation
2.6. The Proposed Approach
3. Experimental Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Janiesch, C.; Zschech, P.; Heinrich, K. Machine learning and deep learning. Electron. Mark. 2021, 31, 685–695. [Google Scholar] [CrossRef]
- World Stroke Organization. Impact of Stroke. World Stroke Organization, 2024. Available online: https://www.world-stroke.org/world-stroke-day-campaign/about-stroke/impact-of-stroke (accessed on 10 October 2022).
- Stroke Association. Stroke Statistics | Stroke Association. 2024. Available online: https://www.stroke.org.uk/stroke/statistics (accessed on 10 October 2022).
- Office for National Statistics. Leading Causes of Death, UK—Office for National Statistics. 2024. Available online: https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/causesofdeath/articles/leadingcausesofdeathuk/2001to2018#strengths-and-limitations (accessed on 10 October 2022).
- Stewart, C. Number of Inpatient Episodes with a Main Diagnosis of Stroke in the United Kingdom (UK) from 2011/12 to 2020/21*,” 2022. Available online: https://www.statista.com/statistics/1132426/hospital-admissions-for-stroke-in-the-uk/ (accessed on 10 October 2022).
- Dritsas, E.; Trigka, M. Stroke Risk Prediction with Machine Learning Techniques. Mach. Learn. Biomed. Sens. Healthc. 2022, 22, 4670. [Google Scholar] [CrossRef] [PubMed]
- Alhakami, H.; Alraddadi, S.; Alseady, S.; Baz, A.; Alsubait, T. A Hybrid Efficient Data Analytics Framework for Stroke Prediction. IJCSNS Int. J. Comput. Sci. Netw. Secur. 2020, 20, 240–250. [Google Scholar]
- Biswas, N.; Uddin KM, M.; Rikta, S.T.; Dey, S.K. A comparative analysis of machine learning classifiers for stroke prediction: A predictive analytics approach. Healthc. Anal. 2022, 2, 100116. [Google Scholar] [CrossRef]
- Wu, Y.; Fang, Y. Stroke Prediction with Machine Learning Methods among Older Chinese. Int. J. Environ. Res. Public Health 2020, 17, 1828. [Google Scholar] [CrossRef] [PubMed]
- Sailasya, G.; Kumari, G.L.A. Analyzing the Performance of Stroke Prediction using ML Classification Algorithms. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 539–545. [Google Scholar] [CrossRef]
- Emon, M.U.; Keya, M.S.; Meghla, T.I.; Rahman, M.M.; Mamun, M.S.A.; Kaiser, M.S. Performance Analysis of Machine Learning Approaches in Stroke Prediction. In Proceedings of the Fourth International Conference on Electronics, Communication and Aerospace Technology, Coimbatore, India, 5–7 November 2020; pp. 1464–1469. [Google Scholar]
- Cheon, S.; Kim, J.; Lim, J. The Use of Deep Learning to Predict Stroke Patient Mortality. Int. J. Environ. Res. Public Health 2019, 16, 1876. [Google Scholar] [CrossRef] [PubMed]
- Choi, Y.-A.; Park, S.-J.; Jun, J.-A.; Pyo, C.-S.; Cho, K.-H.; Lee, H.-S.; Yu, J.-H. Deep Learning-Based Stroke Disease Prediction System Using Real-time Bio Signals. Sensors 2021, 21, 4269. [Google Scholar] [CrossRef]
- Govindarajan, P.; Soundarapandian, R.K.; Gandomi, A.H.; Patan, R.; Jayaraman, P.; Manikandan, R. Classification of stroke disease using machine learning algorithms. Intell. Biomed. Data Anal. Process. 2020, 32, 817–828. [Google Scholar]
- Dev, S.; Wang, H.; Nwosu, C.S.; Jain, N.; Veeravalli, B.; John, D. A predictive analytics approach for stroke prediction using machine learning. Healthc. Anal. 2022, 2, 100032. [Google Scholar] [CrossRef]
- World Health Organisation. The Top 10 Causes of Death. 2020. Available online: https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death (accessed on 30 October 2022).
- Piovani, D.; Bonovas, S. Real World—Big Data Analytics in Healthcare. Int. J. Environ. Res. Public Health 2022, 19, 11677. [Google Scholar] [CrossRef]
- Galetsi, P.; Katsaliaki, K.; Kumar, S. Values, challenges and future directions of big data analytics in healthcare: A systematic review. Soc. Sci. Med. 2019, 241, 112533. [Google Scholar] [CrossRef]
- Khanra, S.; Dhir, A.; Islam, A.K.M.N.; Mäntymäkia, M. Big data analytics in healthcare: A systematic literature review. Enterp. Inf. Syst. 2020, 14, 878–912. [Google Scholar] [CrossRef]
- Latif, J.; Xiao, C.; Imran, A.; Tu, S. Medical Imaging using Machine Learning and Deep Learning Algorithms: A Review. In Proceedings of the 2019 2nd International Conference on Computing, Mathematics and Engineering Technologies (iCoMET), Sukkur, Pakistan, 30–31 January 2019. [Google Scholar]
- PAggarwal; Mishra, N.K.; Fatimah, B.; Singh, P.; Gupta, A.; Joshi, S.D. COVID-19 image classification using deep learning: Advances, challenges and opportunities. Comput. Biol. Med. 2022, 144, 105350. [Google Scholar]
- Allen, A.; Iqbal, Z.; Green-Saxena, A.; Hurtado, M.; Hoffman, J. Prediction of diabetic kidney disease with machine learning algorithms, upon the initial diagnosis of type 2 diabetes mellitus. Emerg. Technol. Pharmacol. Ther. 2021, 10, e002560. [Google Scholar] [CrossRef]
- Dong, Z.; Wang, Q.; Ke, Y.; Zhang, W.; Hong, Q.; Liu, C.; Liu, X.; Yang, J.; Xi, Y.; Shi, J.; et al. Prediction of 3-year risk of diabetic kidney disease using machine learning based on electronic medical records. J. Transl. Med. 2022, 20, 143. [Google Scholar] [CrossRef]
- Wu, C.-C.; Yeh, W.-C.; Hsu, W.-D.; Islam, M.M.; Nguyen, P.A.; Poly, T.N.; Wang, Y.-C.; Yang, H.-C.; Li, Y.-C. Prediction of fatty liver disease using machine learning algorithms. Comput. Methods Programs Biomed. 2019, 170, 23–29. [Google Scholar] [CrossRef]
- Mohan, S.; Thirumalai, C.; Srivastava, G. Effective Heart Disease Prediction Using Hybrid Machine Learning Techniques. Digit. Object Identifier 2019, 7, 81542–81554. [Google Scholar] [CrossRef]
- Saboor, A.; Usman, M.; Ali, S.; Samad, A.; Abrar, M.F.; Ullah, N. A Method for Improving Prediction of Human Heart Disease Using Machine Learning Algorithms. Mob. Inf. Syst. 2022, 2022, 1410169. [Google Scholar] [CrossRef]
- Fedesoriano. Stroke Prediction Dataset. 2020. Available online: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset (accessed on 1 May 2024).
- Smith, A.; Jones, B.; Brown, C. Machine learning in healthcare: A review. J. Med. Inform. 2020, 45, 123–135. [Google Scholar]
- Brown, T.; Taylor, L. Ensemble methods for stroke prediction. Int. J. Data Min. Bioinform. 2019, 12, 289–301. [Google Scholar]
- Johnson, R.; Williams, D.; Clark, E. Adaptive learning in machine learning models. Health Data Sci. 2021, 33, 222–230. [Google Scholar]
- Lee, J.; Kim, H.; Yoon, S. Data mining techniques for predicting stroke. Comput. Biol. Chem. 2018, 76, 54–60. [Google Scholar]
- Liu, H.; Long, J.; Nguyen, T. Feature selection and dimensionality reduction techniques for machine learning. J. Artif. Intell. Res. 2019, 65, 315–340. [Google Scholar]
- Nguyen, P.; Wong, T.; Gao, H. Personalized healthcare: Predictive modeling and data integration. IEEE Trans. Inf. Technol. Biomed. 2020, 24, 1565–1573. [Google Scholar]
- Wang, X.; Li, Y.; Huang, Z. Multi-modal data integration for health prediction. J. Biomed. Inform. 2019, 92, 103–113. [Google Scholar]
- Zhou, Q.; Liu, X.; Wang, Y. Evaluating ensemble models for stroke prediction. Bioinform. Adv. 2021, 7, 278–289. [Google Scholar]
- Garcia, F.; Johnson, L.; Martinez, M. Clinical applications of machine learning in stroke prediction. J. Clin. Bioinform. 2022, 10, 144–159. [Google Scholar]
- Huber, S.; Wiemer, H.; Schneider, D.; Ihlenfeldt, S. DMME: Data mining methodology for engineering applications—A holistic extension to the CRISP-DM model. Procedia CIRP 2019, 79, 403–408. [Google Scholar] [CrossRef]
- Chucks, P. Diabetes, Hypertension and Stroke Prediction. 2022. Available online: https://www.kaggle.com/datasets/prosperchuks/health-dataset (accessed on 1 May 2024).
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
- Friedman, J. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3146–3154. [Google Scholar]
- Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.; Gulin, A. CatBoost: Unbiased Boosting with Categorical Features. Available online: https://arxiv.org/pdf/1706.09516 (accessed on 10 June 2024).
- Whitley, D. A genetic algorithm tutorial. Stat. Comput. 1994, 4, 65–85. [Google Scholar] [CrossRef]
Author | Best Model (Accuracy) |
---|---|
[6] | Stacking: 97.4% |
[7] | Random forest: 97.6% |
[8] | SVM: 99.9% |
[9] | Random forest: 78.0% |
[10] | Naïve Bayes: 82.0% |
[11] | Weighted voting: 97.0% |
[12] | DNN: 84.0% |
[13] | CNN-bi-LSTM: 94% |
[14] | ANN: 95.3% |
[15] | Neural network: 77.0% |
Attribute Name | Data Type | Description |
---|---|---|
ID | Numeric | Unique identifier of each patient |
Gender | Categorical | Gender of the patient |
Age | Numeric | Age of the patient |
Hypertension | Numeric | 0 means that the patient does not have hypertension 1 means that the patient has hypertension |
Heart disease | Numeric | 0 indicates the patient does not have heart disease 1 indicates that the patient has heart disease |
Ever married | Categorical | “Yes” means that the patient is married “No” means that the patient is not married |
Work type | Categorical | The work that each patient does, categorized into ‘children’, ‘Govt_Job’, ‘Never Worked’, and ‘Self-employed’ |
Residence type | Categorical | The residence type of each patient is categorized into rural or urban area |
Average glucose level | Numeric | The blood glucose level of the patient |
BMI (Body Mass Index) | Numeric | The BMI of each patient |
Smoking status | Categorical | The smoking status of patients is categorized into ‘Formerly smoked’, ‘Never smoked’, ‘Smokes’, or ‘Unknown’ |
Stroke | Numeric | The target variable indicates whether the patient has had a stroke or not |
Attribute Name | Data Type | Description |
---|---|---|
Age | Numeric | Age of the patient in 13 categories |
Sex | Numeric | Patient gender, where (1) is male and (0) is female |
HighChol | Numeric | Cholesterol in the patient: (0) not high cholesterol and (1) cholesterol is high |
CholCheck | Numeric | Cholesterol in the patient, where (0) indicates no cholesterol check for the past 5 years and (1) means that there have been cholesterol checks in the past 5 years |
BMI | Numeric | Body Mass Index |
Smoker | Numeric | This binary variable indicates whether the patient has smoked more than 100 cigarettes in their entire life |
HeartDiseaseorAttack | Numeric | This is a binary variable which indicates whether the patient has a history of coronary heart disease (CHD) or not |
PhysActivity | Numeric | This variable represents whether the patient has performed any physical activity for the past month, where (0) is no and (1) is yes |
Fruits | Numeric | This binary variable indicates whether the patient consumes one or more fruits per day |
Veggies | Numeric | This binary variable indicates whether the patient consumes one or more vegetables per day |
HvyAlcoholConsump | Numeric | This binary variable represents heavy alcohol consumption of more than 14 drinks per week for men and more than 7 drinks for women. |
GenHlth | Numeric | This binary variable represents the patient’s general health on a scale of 1 to 5, where 1 is excellent and 5 is poor |
Metrics | Formula |
---|---|
Accuracy | |
Precision | |
Recall (sensitivity) | |
F1-score | |
Specificity |
Model | Evaluation Metrics | ||||
---|---|---|---|---|---|
Accuracy | Precision | Recall | F1-Score | AUC | |
Gradient Boost | 97.29% | 97.38% | 97.29% | 97.29% | 97% |
Histogram-based Gradient Boosting | 97.87% | 97.89% | 97.87% | 97.87% | 98% |
XGBoost | 97.87% | 97.88% | 97.87% | 97.87% | 98% |
LightGBM | 98.03% | 98.05% | 98.03% | 98.03% | 98% |
CatBoost | 97.45% | 97.50% | 97.45% | 97.45% | 97% |
ExtraTrees Classifier | 98.24% | 98.24% | 98.24% | 98.24% | 98% |
Random forest | 98.03% | 98.05% | 98.03% | 98.03% | 98% |
Random forest without feature scaling and class balancing | 94.50% | 89.49% | 94.50% | 91.93% | 50% |
Random forest with minmax scaler | 96.65% | 96.72% | 96.65% | 96.65% | 97% |
Random forest with standard scaler | 85.32% | 87.69% | 85.32% | 85.04% | 85% |
Random forest without feature scaling + RSCV | 97.18% | 97.26% | 97.18% | 97.18% | 97% |
Random forest with minmax scaler + RSCV | 97.29% | 97.38% | 97.29% | 97.29% | 97% |
Random forest with standard scaler + RSCV | 77.77% | 84.20% | 77.77% | 76.55% | 77% |
ANN without feature scaling and class balancing | 94.60% | 89.50% | 94.60% | 91.98% | 83% |
ANN without feature scaling | 92.13% | 92.14% | 92.13% | 92.13% | 97% |
ANN with minmax scaler | 94.89% | 94.89% | 94.89% | 94.89% | 99% |
ANN with standard scaler | 95.43% | 95.43% | 95.43% | 95.43% | 99% |
ANN without feature scaling + GSCV | 87.93% | 88.50% | 87.93% | 87.86% | 94% |
ANN with minmax scaler + GSCV | 95.96% | 96.05% | 95.96% | 95.96% | 99% |
ANN with standard scaler + GSCV | 96.33% | 96.38% | 96.33% | 96.33% | 99% |
GANN without feature scaling and class balancing | 96.03% | - | - | - | - |
GANN without feature scaling | 75.67% | - | - | - | - |
GANN with minmax scaler | 76.79% | - | - | - | - |
GANN with standard scaler | 79.98% | - | - | - | - |
Model | Evaluation Metrics | ||||
---|---|---|---|---|---|
Accuracy | Precision | Recall | F1-Score | AUC | |
Random forest without feature scaling and class balancing | 92.96% | 89.27% | 92.96% | 90.67% | 52% |
Random forest without feature scaling | 96.40% | 96.50% | 96.40% | 96.40% | 96% |
Random forest with minmax scaler | 96.22% | 96.32% | 96.22% | 96.21% | 96% |
Random forest with standard scaler | 49.61% | 62.95% | 49.61% | 33.07% | 50% |
Random forest without feature scaling + RSCV | 96.74% | 96.88% | 96.74% | 96.73% | 97% |
Random forest with minmax scaler + RSCV | 96.65% | 96.79% | 96.65% | 96.64% | 97% |
Random forest with standard scaler + RSCV | 49.53% | 62.38% | 49.53% | 32.84% | 50% |
ANN without feature scaling and class balancing | 93.65% | 87.70% | 93.65% | 90.58% | 81% |
ANN without feature scaling | 89.38% | 89.50% | 89.38% | 89.37% | 96% |
ANN with minmax scaler | 93.83% | 94.01% | 93.83% | 93.82% | 98% |
ANN with standard scaler | 92.08% | 92.33% | 92.08% | 92.07% | 98% |
ANN without feature scaling + GSCV | 91.27% | 91.44% | 91.27% | 91.26% | 97% |
ANN with minmax scaler + GSCV | 93.17% | 93.61% | 93.17% | 93.15% | 98% |
ANN with standard scaler + GSCV | 91.04% | 91.06% | 91.04% | 91.04% | 97% |
GANN without feature scaling and class balancing | 93.82% | - | - | - | - |
GANN without feature scaling | 72.27% | - | - | - | - |
GANN with minmax scaler | 76.01% | - | - | - | - |
GANN with standard scaler | 73.21% | - | - | - | - |
Author | Best Algorithm | Data Pre-Processing | Dataset | Accuracy | AUC |
---|---|---|---|---|---|
The proposed model | ExtraTrees Classifier | SMOTE balancing, One-hot encoding, and label encoding | Stroke Prediction Dataset by [27] | 98.24% | 98% |
[7] | Random forest | Normalization and agglomerative hierarchal clustering | Open-Source Healthcare Dataset Stroke Data | 97.62% | 81% |
[8] | SVM | Class balancing and hyperparameter optimization | Stroke Prediction Dataset by [27] | 99.99% | - |
[9] | Random forest | SMOTE balancing | Chinese Longitudinal Healthy Longevity Study (CLHLS) dataset for stroke prediction from 2012 and 2014 | 78.0% | 71% |
[10] | Naïve Bayes | Undersampling class balancing | Stroke Prediction Dataset by [27] | 82.0% | - |
[11] | Weighted voting | Data normalization | Stroke Prediction Dataset by [27] | 97.0% | 93% |
[15] | Neural network | Principal component analysis | Stroke Prediction Dataset by [27] | 77.0% | - |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wijaya, R.; Saeed, F.; Samimi, P.; Albarrak, A.M.; Qasem, S.N. An Ensemble Machine Learning and Data Mining Approach to Enhance Stroke Prediction. Bioengineering 2024, 11, 672. https://doi.org/10.3390/bioengineering11070672
Wijaya R, Saeed F, Samimi P, Albarrak AM, Qasem SN. An Ensemble Machine Learning and Data Mining Approach to Enhance Stroke Prediction. Bioengineering. 2024; 11(7):672. https://doi.org/10.3390/bioengineering11070672
Chicago/Turabian StyleWijaya, Richard, Faisal Saeed, Parnia Samimi, Abdullah M. Albarrak, and Sultan Noman Qasem. 2024. "An Ensemble Machine Learning and Data Mining Approach to Enhance Stroke Prediction" Bioengineering 11, no. 7: 672. https://doi.org/10.3390/bioengineering11070672
APA StyleWijaya, R., Saeed, F., Samimi, P., Albarrak, A. M., & Qasem, S. N. (2024). An Ensemble Machine Learning and Data Mining Approach to Enhance Stroke Prediction. Bioengineering, 11(7), 672. https://doi.org/10.3390/bioengineering11070672