A Heart Disease Prediction Model Based on Feature Optimization and Smote-Xgboost Algorithm
Abstract
:1. Introduction
- To remove the crucial features from the dataset, an information gain-based feature selection method is used.
- Use a technique that combines undersampling and oversampling to handle uneven data on the selected dataset.
- Using the preprocessed dataset, validate efficacy of xgboost. Additionally, assess the ability of the xgboost algorithm with five baseline methods using a confusion matrix.
2. Related Work
3. Method
3.1. Dataset
3.2. Data Preprocessing
3.3. Feature Selection Based on Information Gain
Algorithm 1: The pseudo code of IGFS. |
3.4. Imbalance Data Processing Based on Smote-Enn
Algorithm 2: The pseudo code of Smote-Enn. |
3.5. XGBoost
3.6. Baseline Alogorithms
3.6.1. Random Forest
3.6.2. K-Nearest Neighbor
3.6.3. Logistic Regression
3.6.4. Decision Tree
3.6.5. Naïve Bayes
4. Performance Evaluation
4.1. Result of Exploratory Data Analysis
4.2. Cross Validation
4.3. Performance Measure
4.4. The Performance of Algorithms
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Cardiovascular Diseases. Available online: https://www.who.int/health-topics/cardiovascular-diseases/ (accessed on 10 September 2022).
- Shah, S.; Shah, F.; Hussain, S.; Batool, S. Support Vector Machines-based Heart Disease Diagnosis using Feature Subset, Wrapping Selection and Extraction Methods. Comput. Electr. Eng. 2020, 84, 106628. [Google Scholar] [CrossRef]
- Che, C.; Zhang, P.; Zhu, M.; Qu, Y.; Jin, B. Constrained transformer network for ECG signal processing and arrhythmia classification. BMC Med. Inform. Decis. Mak. 2021, 21, 184. [Google Scholar] [CrossRef]
- Hoodbhoy, Z.; Jiwani, U.; Sattar, S.; Salam, R.; Hasan, B.; Das, J. Diagnostic Accuracy of Machine Learning Models to Identify Congenital Heart Disease: A Meta-Analysis. Front. Artif. Intell. 2021, 4, 197. [Google Scholar] [CrossRef]
- Wang, Z.; Chen, L.; Zhang, J.; Yin, Y.; Li, D. Multi-view ensemble learning with empirical kernel for heart failure mortality prediction. Int. J. Numer. Methods Biomed. Eng. 2020, 36, e3273. [Google Scholar] [CrossRef]
- Modepalli, K.; Gnaneswar, G.; Dinesh, R.; Sai, Y.R.; Suraj, R.S. Heart Disease Prediction using Hybrid machine Learning Model. In Proceedings of the 2021 6th International Conference on Inventive Computation Technologies (ICICT), Coimbatore, India, 20–22 January 2021. [Google Scholar]
- Joo, G.; Song, Y.; Im, H.; Park, J. Clinical Implication of Machine Learning in Predicting the Occurrence of Cardiovascular Disease Using Big Data (Nationwide Cohort Data in Korea). IEEE Access 2020, 8, 157643–157653. [Google Scholar] [CrossRef]
- Li, J.; Haq, A.; Din, S.; Khan, J.; Khan, A.; Saboor, A. Heart Disease Identification Method Using Machine Learning Classification in E-Healthcare. IEEE Access 2020, 8, 107562–107582. [Google Scholar] [CrossRef]
- Ali, F.; El-Sappagh, S.; Islam, S.M.R.; Kwak, D.; Ali, A.; Imran, M.; Kwak, K. A smart healthcare monitoring system for heart disease prediction based on ensemble deep learning and feature fusion. Inf. Fusion 2020, 63, 208–222. [Google Scholar] [CrossRef]
- Rahim, A.; Rasheed, Y.; Azam, F.; Anwar, M.; Rahim, M.; Muzaffar, A. An Integrated Machine Learning Framework for Effective Prediction of Cardiovascular Diseases. IEEE Access 2021, 9, 106575–106588. [Google Scholar] [CrossRef]
- Ishaq, A.; Sadiq, S.; Umer, M.; Ullah, S.; Mirjalili, S.; Rupapara, V.; Nappi, M. Improving the Prediction of Heart Failure Patients’ Survival Using SMOTE and Effective Data Mining Techniques. IEEE Access 2021, 9, 39707–39716. [Google Scholar] [CrossRef]
- Khurana, P.; Sharma, S.; Goyal, A. Heart Disease Diagnosis: Performance Evaluation of Supervised Machine Learning and Feature Selection Techniques. In Proceedings of the 8th International Conference on Signal Processing and Integrated Networks, SPIN 2021, Matsue, Japan, 18–22 October 2021. [Google Scholar]
- Ashri, S.E.A.; El-Gayar, M.M.; El-Daydamony, E.M. HDPF: Heart Disease Prediction Framework Based on Hybrid Classifiers and Genetic Algorithm. IEEE Access 2021, 9, 146797–146809. [Google Scholar] [CrossRef]
- Bashir, S.; Almazroi, A.; Ashfaq, S.; Almazroi, A.; Khan, F. A Knowledge-Based Clinical Decision Support System Utilizing an Intelligent Ensemble Voting Scheme for Improved Cardiovascular Disease Prediction. IEEE Access 2021, 9, 130805–130822. [Google Scholar] [CrossRef]
- Odhiambo Omuya, E.; Onyango Okeyo, G.; Waema Kimwele, M. Feature Selection for Classification using Principal Component Analysis and Information Gain. J. Biomed. Inform. 2021, 174, 114765. [Google Scholar] [CrossRef]
- Le, T.; Lee, M.; Park, J.; Baik, S. Oversampling techniques for bankruptcy prediction: Novel features from a transaction dataset. Symmetry 2018, 10, 79. [Google Scholar] [CrossRef] [Green Version]
- Vandewiele, G.; Dehaene, I.; Kovács, G.; Sterckx, L.; Janssens, O.; Ongenae, F.; Backere, F.D.; Turck, F.D.; Roelens, K.; Decruyenaere, J.; et al. Overly optimistic prediction results on imbalanced data: A case study of flaws and benefits when applying over-sampling. Artif. Intell. Med. 2021, 111, 101987. [Google Scholar] [CrossRef] [PubMed]
- Xu, Z.; Shen, D.; Nie, T.; Kou, Y. A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data. J. Biomed. Inform. 2020, 107, 103465. [Google Scholar] [CrossRef] [PubMed]
- Budholiya, K.; Shrivastava, S.; Sharma, V. An optimized XGBoost based diagnostic system for effective prediction of heart disease. J. King Saud-Univ.–Comput. Inf. Sci. 2020, 34, 4514–4523. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
- Asadi, S.; Roshan, S.; Kattan, M.W. Random forest swarm optimization-based for heart diseases diagnosis. J. Biomed. Inform. 2021, 115, 103690. [Google Scholar] [CrossRef]
- Bansal, M.; Goyal, A.; Choudhary, A. A comparative analysis of K-Nearest Neighbor, Genetic, Support Vector Machine, Decision Tree, and Long Short Term Memory algorithms in machine learning. Decis. Anal. J. 2022, 3, 100071. [Google Scholar] [CrossRef]
- Książek, W.; Gandor, M.; Pławiak, P. Comparison of various approaches to combine logistic regression with genetic algorithms in survival prediction of hepatocellular carcinoma. Comput. Biol. Med. 2021, 134, 104431. [Google Scholar] [CrossRef]
- Ghiasi, M.M.; Zendehboudi, S.; Mohsenipour, A. Decision tree-based diagnosis of coronary artery disease: CART model. Comput. Methods Prog. Biomed. 2020, 192, 105400. [Google Scholar] [CrossRef]
- Chen, S.; Webb, G.I.; Liu, L.; Ma, X. A novel selective naïve Bayes algorithm. Knowl.-Based Syst. 2020, 192, 105361. [Google Scholar] [CrossRef]
Index | Feature | Type | Description |
---|---|---|---|
1 | Sex | category | Man = 1; Female = 0 |
2 | Stable_CAD | category | Stable CAD = 0; Unstable CAD = 1 |
3 | Age | numeric | Age in years, [20, 86] |
4 | CVD_history | category | Ischemic cerebrovascular disease = 0; Hemorrhagic cerebral vascular diseases = 1 |
5 | Smoke | category | No smoking history = 0; Have smoking history = 1 |
6 | nitrate | category | Hospitalization without nitrate = 0; Hospitalization with nitrate = 1 |
7 | LVEF | numeric | Left ventricular ejection fraction, [18, 88] |
8 | HBG | numeric | Hemoglobin, [55, 193.2] |
9 | BUN | numeric | Blood urea nitrogen, [0.7, 119.0] |
10 | TC | numeric | Total cholesterol, [73, 589] |
11 | SCV_number | numeric | SCV_number, [0, 3] |
12 | DM | category | No diabetes mellitus = 0; Having diabetes mellitus = 1 |
13 | REV_type | category | PCI = 1; CABG = 2 |
14 | LM_lesion | category | No LM_lesion = 0; Having LM_lesion = 1 |
15 | ASA | category | Hospitalization without ASA = 0; Hospitalization with ASA = 1 |
16 | MACCE | category | No MACCE = 1; Occurrence of MACCE = 1 |
MACCE | 0 | 1 | Total |
---|---|---|---|
Number | 3204 | 323 | 3527 |
Percentage | 90.84% | 9.16% | 100% |
Algorithm | Confusion matrix | Description |
---|---|---|
RF | TN: MACCE was correctly predicted not to occur for 869 samples, and the actual sample MACCE does not occur. TP: MACCE was correctly predicted to occur for 94 samples, and the actual sample MACCE occurred. FP: MACCE was incorrectly predicted to occur for 91 samples, and the actual sample MACCE does not occur. FN: MACCE was incorrectly predicted not to occur for 4 samples, and the actual sample MACCE occurred. | |
KNN | TN: MACCE was correctly predicted not to occur for 873 samples, and the actual sample MACCE did not occur. TP: MACCE was correctly predicted to occur for 97 samples, and the actual sample MACCE occurred. FP: MACCE was incorrectly predicted to occur for 87 samples, and the actual sample MACCE did not occur. FN: MACCE was incorrectly predicted not to occur for one sample, and the actual sample MACCE occurred. | |
LR | TN: MACCE was correctly predicted not to occur for 707 samples, and the actual sample MACCE does not occur. TP: MACCE was correctly predicted to occur for 84 samples, and the actual sample MACCE occurred. FP: MACCE was incorrectly predicted to occur for 253 samples, and the actual sample MACCE did not occur. FN: MACCE was incorrectly predicted not to occur for 14 samples, and the actual sample MACCE occurred. | |
DT | TN: MACCE was correctly predicted not to occur for 792 samples, and the actual sample MACCE did not occur. TP: MACCE was correctly predicted to occur for 90 samples, and the actual sample MACCE occurred. FP: MACCE was incorrectly predicted to occur for 168 samples, and the actual sample MACCE did not occur. FN: MACCE was incorrectly predicted not to occur for eight samples, and the actual sample MACCE occurred. | |
NB | TN: MACCE was correctly predicted not to occur for 712 samples, and the actual sample MACCE did not occur. TP: MACCE was correctly predicted to occur for 90 samples, and the actual sample MACCE occurred. FP: MACCE was incorrectly predicted to occur for 248 samples, and the actual sample MACCE did not occur. FN: MACCE was incorrectly predicted not to occur for eight samples, and the actual sample MACCE occurred. | |
XGBoost | TN: MACCE was correctly predicted not to occur for 899 samples, and the actual sample MACCE did not occur. TP: MACCE was correctly predicted to occur for 90 samples, and the actual sample MACCE occurred. FP: MACCE was incorrectly predicted to occur for 61 samples, and the actual sample MACCE did not occur. FN: MACCE was incorrectly predicted not to occur for eight samples, and the actual sample MACCE occurred. |
Algorithm | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
Random Forest | 0.9115 | 0.9026 | 0.9615 | 0.9037 |
KNN | 0.9177 | 0.8878 | 0.9933 | 0.9085 |
Logistic Regression | 0.7481 | 0.7625 | 0.8645 | 0.7178 |
Decision Tree | 0.8335 | 0.8308 | 0.9197 | 0.8157 |
Naïve Bayes | 0.7585 | 0.7486 | 0.9214 | 0.7157 |
XGBoost | 0.9344 | 0.9266 | 0.9716 | 0.9486 |
Ranking | Random Forest | Decision Tree | Logistic Regression | XGBoost |
---|---|---|---|---|
1 | HBG | TC | age | TC |
2 | TC | LVEF | SCV_number | LVEF |
3 | LVEF | HBG | TC | HBG |
4 | BUN | age | sex | BUN |
5 | age | BUN | HBG | age |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, J.; Guan, J. A Heart Disease Prediction Model Based on Feature Optimization and Smote-Xgboost Algorithm. Information 2022, 13, 475. https://doi.org/10.3390/info13100475
Yang J, Guan J. A Heart Disease Prediction Model Based on Feature Optimization and Smote-Xgboost Algorithm. Information. 2022; 13(10):475. https://doi.org/10.3390/info13100475
Chicago/Turabian StyleYang, Jian, and Jinhan Guan. 2022. "A Heart Disease Prediction Model Based on Feature Optimization and Smote-Xgboost Algorithm" Information 13, no. 10: 475. https://doi.org/10.3390/info13100475
APA StyleYang, J., & Guan, J. (2022). A Heart Disease Prediction Model Based on Feature Optimization and Smote-Xgboost Algorithm. Information, 13(10), 475. https://doi.org/10.3390/info13100475