Tabular Data Generation to Improve Classification of Liver Disease Diagnosis
Abstract
:1. Introduction
2. Related Works
2.1. Liver Disease Diagnosis
2.2. Data Augmentation
3. I.L.P.D. Dataset: Exploratory Data Analysis
3.1. Dataset Description
3.2. Exploratory Data Analysis
4. Model Construction
5. Classification Algorithms
5.1. Artificial Neural Networks (ANN)
5.2. Support Vector Machines (SVM)
5.3. Decision Trees (DT)
5.4. K-Nearest Neighbours Algorithm (K.N.N.)
5.5. Logistic Regression Classifier (L.R.)
6. Data Augmentation Methods
6.1. Generative Adversarial Networks
Algorithm 1: Generative Adversarial Networks (GANs) Algorithm |
For number of training iterations do: |
For K steps do: |
Sample minibatch of m noise samples from noise prior |
Sample minibatch of m examples from data-generating distribution |
Update the discriminator by ascending it is stochastic gradient |
End For |
Sample minibatch of m noise samples from noise prior |
Update the generator by descending it is stochastic gradient |
End For |
6.2. Synthetic Minority Oversampling Technique
7. Experiments and Evaluation
7.1. Evaluation Performance Measures
7.2. Experimental Results
8. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Lin, R.-H. An intelligent model for liver disease diagnosis. Artif. Intell. Med. 2009, 47, 53–62. [Google Scholar] [CrossRef] [PubMed]
- Maddrey, W.C.; Sorrell, M.F.; Schiff, E.R. Schiff’s Diseases of the Liver; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
- Oniśko, A.; Druzdzel, M.J.; Wasyluk, H. Learning Bayesian network parameters from small data sets: Application of Noisy-OR gates. Int. J. Approx. Reason. 2001, 27, 165–182. [Google Scholar] [CrossRef] [Green Version]
- Babu, M.S.P.; Ramana, B.V.; Kumar, B.R.S. New automatic diagnosis of liver status using bayesian classification. In Proceedings of the International Conference on Intelligent Network and Computing) ICINC, Kuala Lumpur, Malaysia, 26–28 November 2010. [Google Scholar]
- Domingos, P. Metacost: A general method for making classifiers cost-sensitive. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA USA, 15–18 August 1999; pp. 155–164. [Google Scholar]
- Ramana, B.V.; Babu, M.S.P.; Venkateswarlu, N. A critical study of selected classification algorithms for liver disease diagnosis. Int. J. Database Manag. Syst. 2011, 3, 101–114. [Google Scholar] [CrossRef]
- Kim, S.; Jung, S.; Park, Y.; Lee, J.; Park, J. Effective liver cancer diagnosis method based on machine learning algorithm. In Proceedings of the 2014 7th International Conference on Biomedical Engineering and Informatics, Dalian, China, 14–16 October 2014; pp. 714–718. [Google Scholar]
- Al-Qerem, A.; Alsalman, Y.S.; Mansour, K. Image Generation Using Different Models of Generative Adversarial Network. In Proceedings of the 2019 International Arab Conference on Information Technology (ACIT), Al Ain, United Arab Emirates, 3–5 December 2019; pp. 241–245. [Google Scholar]
- Al-Qerem, A.; Kharbat, F.; Nashwan, S.; Ashraf, S.; Blaou, K. General model for best feature extraction of EEG using discrete wavelet transform wavelet family and differential evolution. Int. J. Distrib. Sens. Netw. 2020, 16, 1550147720911009. [Google Scholar] [CrossRef] [Green Version]
- Al-Qerem, A. An efficient machine-learning model based on data augmentation for pain intensity recognition. Egypt. Inform. J. 2020, 21, 241–257. [Google Scholar] [CrossRef]
- Arjovsky, M.; Bottou, L. Towards principled methods for training generative adversarial networks. arXiv 2017, arXiv:1701.04862. [Google Scholar]
- Borji, A. Pros and cons of gan evaluation measures. Comput. Vis. Image Underst. 2019, 179, 41–65. [Google Scholar] [CrossRef] [Green Version]
- Ho, D.; Liang, E.; Chen, X.; Stoica, I.; Abbeel, P. Population based augmentation: Efficient learning of augmentation policy schedules. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 2731–2741. [Google Scholar]
- Perez, L.; Wang, J. The effectiveness of data augmentation in image classification using deep learning. arXiv 2017, arXiv:1712.04621. [Google Scholar]
- Che, Z.; Cheng, Y.; Zhai, S.; Sun, Z.; Liu, Y. Boosting deep learning risk prediction with generative adversarial networks for electronic health records. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM), New Orleans, LA, USA, 18–21 November 2017; pp. 787–792. [Google Scholar]
- Pradhan, A. Support vector machine-a survey. Int. J. Emerg. Technol. Adv. Eng. 2012, 2, 82–85. [Google Scholar]
- Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
- Al-Qerem, A.; Salem, A.A.; Jebreen, I.; Nabot, A.; Samhan, A. Comparison between Transfer Learning and Data Augmentation on Medical Images Classification. In Proceedings of the 2021 22nd International Arab Conference on Information Technology (ACIT), Muscat, Oman, 21–23 December 2021; pp. 1–7. [Google Scholar]
- Jeyalakshmi, K.; Rangaraj, R. Accurate liver disease prediction system using convolutional neural network. Indian J. Sci. Technol. 2021, 14, 1406–1421. [Google Scholar] [CrossRef]
- Islam, M.K.; Alam, M.M.; Rony, M.R.A.H.; Mohiuddin, K. Statistical Analysis and Identification of Important Factors of Liver Disease using Machine Learning and Deep Learning Architecture. In Proceedings of the 2019 3rd International Conference on Innovation in Artificial Intelligence, Suzhou, China, 15–18 March 2019; pp. 131–137. [Google Scholar]
- Sravani, K.; Anushna, G.; Maithraye, I.; Chetan, P.; Yeruva, S. Prediction of Liver Malady Using Advanced Classification Algorithms. In Machine Learning Technologies and Applications: Proceedings of ICACECS 2020; Springer: Singapore, 2021; pp. 39–49. [Google Scholar]
- Belavigi, D.; Veena, G.; Harekal, D. Prediction of liver disease using Rprop, SAG and CNN. Int. J. Innov. Technol. Expl. Eng. IJITEE 2019, 8, 3290–3295. [Google Scholar]
- Singh, J.; Bagga, S.; Kaur, R. Software-based prediction of liver disease with feature selection and classification techniques. Procedia Comput. Sci. 2020, 167, 1970–1980. [Google Scholar] [CrossRef]
- Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 1–48. [Google Scholar] [CrossRef]
- Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training gans. Adv. Neural Inf. Process. Syst. 2016, 29, 2234–2242. [Google Scholar]
- Tran, T.; Pham, T.; Carneiro, G.; Palmer, L.; Reid, I. A bayesian data augmentation approach for learning deep models. Adv. Neural Inf. Process. Syst. 2017, 30, 2794–2803. [Google Scholar]
- Turhan, C.G.; Bilge, H.S. Recent trends in deep generative models: A review. In Proceedings of the 2018 3rd International Conference on Computer Science and Engineering (UBMK), Sarajevo, Bosnia and Herzegovina, 20–23 September 2018; pp. 574–579. [Google Scholar]
- Zou, J.; Han, Y.; So, S.-S. Overview of artificial neural networks. Artif. Neural Netw. 2008, 458, 14–22. [Google Scholar]
- Ecer, F.; Ardabili, S.; Band, S.S.; Mosavi, A. Training Multilayer Perceptron with Genetic Algorithms and Particle Swarm Optimization for Modeling Stock Price Index Prediction. Entropy 2020, 22, 1239. [Google Scholar] [CrossRef]
- Bansal, M.; Goyal, A.; Choudhary, A. A comparative analysis of K-Nearest Neighbor, Genetic, Support Vector Machine, Decision Tree, and Long Short Term Memory algorithms in machine learning. Decis. Anal. J. 2022, 3, 100071. [Google Scholar] [CrossRef]
- Xia, D.; Tang, H.; Sun, S.; Tang, C.; Zhang, B. Landslide Susceptibility Mapping Based on the Germinal Center Optimization Algorithm and Support Vector Classification. Remote Sens. 2022, 14, 2707. [Google Scholar] [CrossRef]
- Awad, M.; Khanna, R. Support vector machines for classification. In Efficient Learning Machines; Apress: Berkeley, CA, USA, 2015; pp. 39–66. [Google Scholar]
- Osei-Bryson, K.-M. Evaluation of decision trees: A multi-criteria approach. Comput. Oper. Res. 2004, 31, 1933–1945. [Google Scholar] [CrossRef]
- Saxena, R.; Sharma, S.K.; Gupta, M.; Sampada, G.C. A Novel Approach for Feature Selection and Classification of Diabetes Mellitus: Machine Learning Methods. Comput. Intell. Neurosci. 2022, 2022, 3820360. [Google Scholar] [CrossRef] [PubMed]
- Kataria, A.; Singh, M. A review of data classification using k-nearest neighbour algorithm. Int. J. Emerg. Technol. Adv. Eng. 2013, 3, 354–360. [Google Scholar]
- Lemon, S.C.; Roy, J.; Clark, M.A.; Friedmann, P.D.; Rakowski, W. Classification and regression tree analysis in public health: Methodological review and comparison with logistic regression. Ann. Behav. Med. 2003, 26, 172–181. [Google Scholar] [CrossRef]
- Fernández, A.; Garcia, S.; Herrera, F.; Chawla, N.V. SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 2018, 61, 863–905. [Google Scholar] [CrossRef]
- Laakso, M.; Soininen, H.; Partanen, K.; Lehtovirta, M.; Hallikainen, M.; Hänninen, T.; Helkala, E.-L.; Vainio, P.; Riekkinen, P. MRI of the hippocampus in Alzheimer’s disease: Sensitivity, specificity, and analysis of the incorrectly classified subjects. Neurobiol. Aging 1998, 19, 23–31. [Google Scholar] [CrossRef]
- Sokolova, M.; Japkowicz, N.; Szpakowicz, S. Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation. In Proceedings of the Australasian joint conference on artificial intelligence, Hobart, Australia, 4–8 December 2006; pp. 1015–1021. [Google Scholar]
- Dritsas, E.; Trigka, M. Supervised Machine Learning Models for Liver Disease Risk Prediction. Computers 2023, 12, 19. [Google Scholar] [CrossRef]
- Behera, M.P.; Sarangi, A.; Mishra, D.; Sarangi, S.K. A Hybrid Machine Learning algorithm for Heart and Liver Disease Prediction Using Modified Particle Swarm Optimization with Support Vector Machine. Procedia Comput. Sci. 2023, 218, 818–827. [Google Scholar] [CrossRef]
- Mostafa, F.; Hasan, E.; Williamson, M.; Khan, H. Statistical Machine Learning Approaches to Liver Disease Prediction. Livers 2021, 1, 294–312. [Google Scholar] [CrossRef]
- Wu, C.-C.; Yeh, W.-C.; Hsu, W.-D.; Islam, M.M.; Nguyen, P.A.; Poly, T.N.; Wang, Y.-C.; Yang, H.-C.; Li, Y.-C. Prediction of fatty liver disease using machine learning algorithms. Comput. Methods Programs Biomed. 2019, 170, 23–29. [Google Scholar] [CrossRef]
Sl. No | Attribute Name | Attribute Type | Attribute Description |
---|---|---|---|
1. | Age | Numeric | Age of the patient |
2. | Sex | Nominal | Gender of the patient |
3. | Total Bilirubin | Numeric | Quantity of total bilirubin in patient |
4. | Direct Bilirubin | Numeric | Quantity of direct bilirubin in patient |
5. | Alkphos Alkaline Phosphatase | Numeric | Amount of A.L.P. enzyme in patient |
6. | Sgpt Alamine Aminotransferase | Numeric | Amount of S.G.P.T. in patient |
7. | Sgot Aspartate Aminotransferase | Numeric | Amount of S.G.O.T. in patient |
8. | Total Proteins | Numeric | Protein content in patient |
9. | Albumin | Numeric | Amount of albumin in patient |
10. | Albumin and Globulin Ratio | Numeric | Fraction of albumin and globulin in Patient |
11. | Class | Numeric [1,2] | Status of liver disease in patient |
Case | GAN | SMOTE | |||||||
---|---|---|---|---|---|---|---|---|---|
Accuracy | Recall | Precision | F-measure | Accuracy | Recall | Precision | F-Measure | ||
NO-AUG | 0.70669 | 0.98558 | 0.71304 | 0.82745 | 0.8237 | 0.824 | 0.832 | 0.822 | |
SVM | DD-AUG | 0.71254 | 0.98745 | 0.72004 | 0.83215 | 0.9182 | 0.918 | 0.923 | 0.918 |
TD-AUG | 0.71689 | 0.989104 | 0.73041 | 0.83545 | 0.9473 | 0.947 | 0.950 | 0.947 | |
AVG (AUG) | 0.71472 | 0.988277 | 0.72523 | 0.8338 | 0.9328 | 0.933 | 0.937 | 0.933 |
Case | GAN | SMOTE | |||||||
---|---|---|---|---|---|---|---|---|---|
Accuracy | Recall | Precision | F-Measure | Accuracy | Recall | Precision | F-Measure | ||
NO-AUG | 0.60163 | 0.71875 | 0.73105 | 0.72485 | 0.9462 | 0.946 | 0.948 | 0.946 | |
D.T. | DD-AUG | 0.59455 | 0.74575 | 0.76924 | 0.74674 | 0.9746 | 0.975 | 0.975 | 0.975 |
TD-AUG | 0.59278 | 0.7575 | 0.76553 | 0.74934 | 0.9829 | 0.983 | 0.983 | 0.983 | |
AVG | 0.59367 | 0.75163 | 0.76739 | 0.74804 | 0.9788 | 0.979 | 0.979 | 0.979 |
Case | GAN | SMOTE | |||||||
---|---|---|---|---|---|---|---|---|---|
Accuracy | Recall | Precision | F-Measure | Accuracy | Recall | Precision | F-Measure | ||
NO-AUG | 0.67067 | 0.8101 | 0.74889 | 0.77829 | 0.9907 | 0.991 | 0.991 | 0.991 | |
K.N.N. | DD-AUG | 0.69455 | 0.8145 | 0.75321 | 0.78524 | 0.9951 | 0.995 | 0.995 | 0.995 |
TD-AUG | 0.69122 | 0.8784 | 0.77001 | 0.79245 | 0.9968 | 0.997 | 0.997 | 0.997 | |
AVG | 0.69289 | 0.8465 | 0.76161 | 0.78885 | 0.996 | 0.996 | 0.996 | 0.996 |
Case | GAN | SMOTE | |||||||
---|---|---|---|---|---|---|---|---|---|
Accuracy | Recall | Precision | F-Measure | Accuracy | Recall | Precision | F-Measure | ||
NO-AUG | 0.74889 | 0.9976 | 0.71306 | 0.83166 | 0.9624 | 0.962 | 0.963 | 0.962 | |
L.R. | DD-AUG | 0.75451 | 0.9862 | 0.71786 | 0.83517 | 0.9851 | 0.985 | 0.985 | 0.985 |
TD-AUG | 0.75007 | 0.9954 | 0.71724 | 0.84006 | 0.9893 | 0.989 | 0.989 | 0.989 | |
AVG | 0.75229 | 0.9908 | 0.71755 | 0.83762 | 0.9872 | 0.987 | 0.987 | 0.987 |
Case | GAN | SMOTE | |||||||
---|---|---|---|---|---|---|---|---|---|
Accuracy | Recall | Precision | F-Measure | Accuracy | Recall | Precision | F-Measure | ||
NO-AUG | 0.6964 | 0.85817 | 0.75158 | 0.80135 | 0.5588 | 0.559 | 0.559 | 0.549 | |
ANN | DD-AUG | 0.7024 | 0.8754 | 0.75471 | 0.80002 | 0.5473 | 0.547 | 0.546 | 0.546 |
TD-AUG | 0.7094 | 0.8813 | 0.75004 | 0.80081 | 0.5035 | 0.504 | 0.527 | 0.499 | |
AVG | 0.7059 | 0.8784 | 0.75238 | 0.80042 | 0.5254 | 0.526 | 0.537 | 0.523 |
Research | Title | Method and Results |
---|---|---|
[19] | Accurate liver disease prediction system using convolutional neural network | MCNN-LDPS: 90.75% M.L.P.N.N.: 86.70% |
[20] | Statistical Analysis and Identification of Important Factors of Liver Disease using Machine Learning and Deep Learning Architecture. | ANN 76.07%, DTREE 76.07%, R.Forest 74.36%, SVM 74.35%, MLP 74.36%, GNB 74.50%, KNN 78.63%, Logistic Regression 73.50% |
[21] | Prediction of Liver Malady Using Advanced Classification Algorithms | ANN 94.09% SVM 78.09% |
[22] | Prediction of Liver Disease using Rprop, S.A.G. and CNN | Rprop: 69.41% S.A.G.: 68.82% CNN: 96.07% |
[23] | Software-based prediction of liver disease with feature selection and classification techniques | LR, SMO, RF, NB, J48, IBk. The best result is L.R.: 77.4% |
[40] | Supervised Machine Learning Models for Liver Disease Risk Prediction | F-measure of 80.1%, a precision of 80.4%, and an A.U.C. equal to 88.4% after SMOTE with 10-fold cross-validation. |
[41] | A Hybrid Machine Learning algorithm for Heart and Liver Disease Prediction Using Modified Particle Swarm Optimization with Support Vector Machine | Recall (SVM): 62.93 Recall (P.S.O.S.V.M.): 83.62 Recall (C.P.S.O.S.V.M.): 96.55 Recall (CCPSOSVM): 97.41 |
[42] | Statistical Machine Learning Approaches to Liver Disease Prediction | The RF: 98.14% accuercy. |
[43] | Prediction of fatty liver disease using machine learning algorithms | The accuracy of R.F., NB, ANN, and LR 87.48, 82.65, 81.85, and 76.96%. |
Proposed approach | Tabular Data Generation to Improve Classification of Liver Disease Diagnosis | ANN: 0.932 with SMOTE SVM: 0.9328 with SMOTE LR: 0.9872 with SMOTE DT: 0.9788 with SMOTE K-NN: 0.996 with SMOTE |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alauthman, M.; Aldweesh, A.; Al-qerem, A.; Aburub, F.; Al-Smadi, Y.; Abaker, A.M.; Alzubi, O.R.; Alzubi, B. Tabular Data Generation to Improve Classification of Liver Disease Diagnosis. Appl. Sci. 2023, 13, 2678. https://doi.org/10.3390/app13042678
Alauthman M, Aldweesh A, Al-qerem A, Aburub F, Al-Smadi Y, Abaker AM, Alzubi OR, Alzubi B. Tabular Data Generation to Improve Classification of Liver Disease Diagnosis. Applied Sciences. 2023; 13(4):2678. https://doi.org/10.3390/app13042678
Chicago/Turabian StyleAlauthman, Mohammad, Amjad Aldweesh, Ahmad Al-qerem, Faisal Aburub, Yazan Al-Smadi, Awad M. Abaker, Omar Radhi Alzubi, and Bilal Alzubi. 2023. "Tabular Data Generation to Improve Classification of Liver Disease Diagnosis" Applied Sciences 13, no. 4: 2678. https://doi.org/10.3390/app13042678
APA StyleAlauthman, M., Aldweesh, A., Al-qerem, A., Aburub, F., Al-Smadi, Y., Abaker, A. M., Alzubi, O. R., & Alzubi, B. (2023). Tabular Data Generation to Improve Classification of Liver Disease Diagnosis. Applied Sciences, 13(4), 2678. https://doi.org/10.3390/app13042678