Metabolomics Biomarker Discovery to Optimize Hepatocellular Carcinoma Diagnosis: Methodology Integrating AutoML and Explainable Artificial Intelligence
Abstract
:1. Introduction
2. Materials and Methods
2.1. Subjects, Data, and Features
2.2. Automated Machine Learning with TPOT
2.3. Model Explanation Using TreeSHAP
2.4. Machine Learning Pipeline
2.5. Statistical Analysis
3. Results
3.1. Univariate Analysis Results
3.2. Model Evaluation and Performance
3.3. Explaining the AutoML Pipeline Ensemble Using SHAP
4. Discussion
4.1. Model Performance and Interpretability Based on Metabolomics Biomarker Discovery
4.2. Comparison with the Previous Literature
4.3. Clinical Implementation, Potential Contributions, and Future Directions
5. Limitation
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Vogel, A.; Cervantes, A.; Chau, I.; Daniele, B.; Llovet, J.M.; Meyer, T.; Nault, J.-C.; Neumann, U.; Ricke, J.; Sangro, B. Hepatocellular carcinoma: ESMO clinical practice guidelines for diagnosis, treatment and follow-up. Ann. Oncol. 2018, 29, iv238–iv255. [Google Scholar] [CrossRef] [PubMed]
- Zhao, M.; Huang, H.; He, F.; Fu, X. Current insights into the hepatic microenvironment and advances in immunotherapy for hepatocellular carcinoma. Front. Immunol. 2023, 14, 1188277. [Google Scholar] [CrossRef] [PubMed]
- McGlynn, K.A.; Petrick, J.L.; El-Serag, H.B. Epidemiology of hepatocellular carcinoma. Hepatology 2021, 73, 4–13. [Google Scholar] [CrossRef] [PubMed]
- Dorochowicz, M.; Krzemienowska-Cebulla, A.; Matus, I.; Senat, H. Advances and challenges in hepatocellular carcinoma: A comprehensive review. J. Educ. Health Sport 2023, 49, 27–43. [Google Scholar] [CrossRef]
- Befeler, A.S.; Di Bisceglie, A.M. Hepatocellular carcinoma: Diagnosis and treatment. Gastroenterology 2002, 122, 1609–1619. [Google Scholar] [CrossRef]
- Klungboonkrong, V.; Das, D.; McLennan, G. Molecular mechanisms and targets of therapy for hepatocellular carcinoma. J. Vasc. Interv. Radiol. 2017, 28, 949–955. [Google Scholar] [CrossRef]
- Özdemir, F.; Baskiran, A. The importance of AFP in liver transplantation for HCC. J. Gastrointest. Cancer 2020, 51, 1127–1132. [Google Scholar] [CrossRef]
- Wei, Z.; Zhang, Y.; Lu, H.; Ying, J.; Zhao, H.; Cai, J. Serum alpha-fetoprotein as a predictive biomarker for tissue alpha-fetoprotein status and prognosis in patients with hepatocellular carcinoma. Transl. Cancer Res. 2022, 11, 669. [Google Scholar] [CrossRef]
- Zhu, K.; Dai, Z.; Zhou, J. Biomarkers for hepatocellular carcinoma: Progression in early diagnosis, prognosis, and personalized therapy. Biomark. Res. 2013, 1, 10. [Google Scholar] [CrossRef]
- Huang, L.; Songyang, Z.; Dai, Z.; Xiong, Y. Field cancerization profile-based prognosis signatures lead to more robust risk evaluation in hepatocellular carcinoma. iScience 2022, 25, 103747. [Google Scholar] [CrossRef]
- Liesenfeld, D.B.; Habermann, N.; Owen, R.W.; Scalbert, A.; Ulrich, C.M. Review of mass spectrometry–based metabolomics in cancer research. Cancer Epidemiol. Biomark. Prev. 2013, 22, 2182–2201. [Google Scholar] [CrossRef] [PubMed]
- Shen, J.; Yan, L.; Liu, S.; Ambrosone, C.B.; Zhao, H. Plasma metabolomic profiles in breast cancer patients and healthy controls: By race and tumor receptor subtypes. Transl. Oncol. 2013, 6, 757. [Google Scholar] [CrossRef]
- Hanahan, D.; Weinberg, R.A. Hallmarks of cancer: The next generation. Cell 2011, 144, 646–674. [Google Scholar] [CrossRef] [PubMed]
- Kulasingam, V.; Diamandis, E.P. Strategies for discovering novel cancer biomarkers through utilization of emerging technologies. Nat. Clin. Pract. Oncol. 2008, 5, 588–599. [Google Scholar] [CrossRef] [PubMed]
- Hauschild, A.-C. Computational Methods for Breath Metabolomics in Clinical Diagnostics. Ph.D. Thesis, Universität des Saarlandes Saarbrücken, Saarbrücken, Germany, 2016. [Google Scholar]
- Elshawi, R.; Sakr, S. Automated machine learning: Techniques and frameworks. In Proceedings of the Big Data Management and Analytics: 9th European Summer School, eBISS 2019, Berlin, Germany, 30 June–5 July 2019; Revised Selected Papers 9. Springer: Berlin/Heidelberg, Germany, 2020; pp. 40–69. [Google Scholar]
- ElShawi, R.; Sakr, S. TPE-AutoClust: A tree-based pipline ensemble framework for automated clustering. In Proceedings of the 2022 IEEE International Conference on Data Mining Workshops (ICDMW), Orlando, FL, USA, 28 November–1 December 2022; IEEE: New York, NY, USA, 2022; pp. 1144–1153. [Google Scholar]
- Eldeeb, H.; Maher, M.; Elshawi, R.; Sakr, S. Automlbench: A comprehensive experimental evaluation of automated machine learning frameworks. arXiv 2022, arXiv:2204.08358. [Google Scholar] [CrossRef]
- Sayed, E.; Maher, M.; Sedeek, O.; Eldamaty, A.; Kamel, A.; El Shawi, R. GizaML: A collaborative meta-learning based framework using llm for automated time-series forecasting. In Proceedings of the 27th International Conference on Extending Database Technology (EDBT), Paestum, Italy, 25–28 March 2024. [Google Scholar]
- Olson, R.S.; Moore, J.H. TPOT: A tree-based pipeline optimization tool for automating machine learning. In Proceedings of the Workshop on Automatic Machine Learning; PMLR: Birmingham, UK, 2016; pp. 66–74. [Google Scholar]
- LeDell, E.; Poirier, S. H2o automl: Scalable automatic machine learning. In Proceedings of the AutoML Workshop at ICML; ICML: San Diego, CA, USA, 2020. [Google Scholar]
- Omar, I.; Khan, M.; Starr, A.; Abou Rok Ba, K. Automated prediction of crack propagation using H2O AutoML. Sensors 2023, 23, 8419. [Google Scholar] [CrossRef]
- Feurer, M.; Klein, A.; Eggensperger, K.; Springenberg, J.; Blum, M.; Hutter, F. Efficient and robust automated machine learning. Adv. Neural Inf. Process. Syst. 2015, 28, 113. [Google Scholar]
- Elshawi, R.; Al-Mallah, M.H.; Sakr, S. On the interpretability of machine learning-based model for predicting hypertension. BMC Med. Inform. Decis. Mak. 2019, 19, 146. [Google Scholar] [CrossRef]
- ElShawi, R.; Sherif, Y.; Al-Mallah, M.; Sakr, S. Interpretability in healthcare: A comparative study of local machine learning interpretability techniques. Comput. Intell. 2021, 37, 1633–1650. [Google Scholar] [CrossRef]
- Shawi, R.E.; Al-Mallah, M.H. Interpretable local concept-based explanation with human feedback to predict all-cause mortality. J. Artif. Intell. Res. 2022, 75, 833–855. [Google Scholar] [CrossRef]
- Shawi, R.E.; Kilanava, K.; Sakr, S. An interpretable semi-supervised framework for patch-based classification of breast cancer. Sci. Rep. 2022, 12, 16734. [Google Scholar] [CrossRef]
- Alahdab, F.; El Shawi, R.; Ahmed, A.I.; Han, Y.; Al-Mallah, M. Patient-level explainable machine learning to predict major adverse cardiovascular events from SPECT MPI and CCTA imaging. PLoS ONE 2023, 18, e0291451. [Google Scholar] [CrossRef] [PubMed]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
- Bloch, L.; Friedrich, C.M.; Alzheimer’s Disease Neuroimaging Initiative. Data analysis with Shapley values for automatic subject selection in Alzheimer’s disease data sets using interpretable machine learning. Alzheimer’s Res. Ther. 2021, 13, 155. [Google Scholar] [CrossRef] [PubMed]
- Di Poto, C.; Ferrarini, A.; Zhao, Y.; Varghese, R.S.; Tu, C.; Zuo, Y.; Wang, M.; Nezami Ranjbar, M.R.; Luo, Y.; Zhang, C. Metabolomic characterization of hepatocellular carcinoma in patients with liver cirrhosis for biomarker discovery. Cancer Epidemiol. Biomark. Prev. 2017, 26, 675–683. [Google Scholar] [CrossRef]
- Di Poto, C.; He, S.; Varghese, R.S.; Zhao, Y.; Ferrarini, A.; Su, S.; Karabala, A.; Redi, M.; Mamo, H.; Rangnekar, A.S. Identification of race-associated metabolite biomarkers for hepatocellular carcinoma in patients with liver cirrhosis and hepatitis C virus infection. PLoS ONE 2018, 13, e0192748. [Google Scholar] [CrossRef]
- Olson, R.S.; Urbanowicz, R.J.; Andrews, P.C.; Lavender, N.A.; Kidd, L.C.; Moore, J.H. Automating biomedical data science through tree-based pipeline optimization. In Proceedings of the Applications of Evolutionary Computation: 19th European Conference, EvoApplications 2016, Porto, Portugal, 30 March–1 April 2016; Proceedings, Part I 19. Springer: Berlin/Heidelberg, Germany, 2016; pp. 123–137. [Google Scholar]
- Kiala, Z.; Odindi, J.; Mutanga, O. Determining the capability of the tree-based pipeline optimization tool (tpot) in mapping parthenium weed using multi-date sentinel-2 image data. Remote Sens. 2022, 14, 1687. [Google Scholar] [CrossRef]
- Wang, G.; Sun, Y.; Chen, Y.; Gao, Q.; Peng, D.; Lin, H.; Zhan, Z.; Liu, Z.; Zhuo, S. Rapid identification of human ovarian cancer in second harmonic generation images using radiomics feature analyses and tree-based pipeline optimization tool. J. Biophotonics 2020, 13, e202000050. [Google Scholar] [CrossRef] [PubMed]
- Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
- Inan, M.S.K.; Rahman, I. Explainable AI integrated feature selection for landslide susceptibility mapping using TreeSHAP. SN Comput. Sci. 2023, 4, 482. [Google Scholar] [CrossRef]
- Kopanja, M.; Hačko, S.; Brdar, S.; Savić, M. Cost-sensitive tree SHAP for explaining cost-sensitive tree-based models. Comput. Intell. 2024, 40, e12651. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.-I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef] [PubMed]
- Wojtuch, A.; Jankowski, R.; Podlewska, S. How can SHAP values help to shape metabolic stability of chemical compounds? J. Cheminform. 2021, 13, 74. [Google Scholar] [CrossRef] [PubMed]
- Neutatz, F.; Chen, B.; Alkhatib, Y.; Ye, J.; Abedjan, Z. Data Cleaning and AutoML: Would an optimizer choose to clean? Datenbank Spektrum 2022, 22, 121–130. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Feurer, M.; Springenberg, J.; Hutter, F. Initializing bayesian hyperparameter optimization via meta-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; Volume 29. [Google Scholar]
- Caruana, R.; Niculescu-Mizil, A.; Crew, G.; Ksikes, A. Ensemble selection from libraries of models. In Proceedings of the Twenty-First International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004; p. 18. [Google Scholar]
- Caruana, R.; Munson, A.; Niculescu-Mizil, A. Getting the most out of ensemble selection. In Proceedings of the Sixth International Conference on Data Mining (ICDM’06), Hong Kong, China, 18–22 December 2006; IEEE: New York, NY, USA, 2006; pp. 828–833. [Google Scholar]
- Nemenyi, P.B. Distribution-Free Multiple Comparisons; Princeton University: Princeton, NJ, USA, 1963. [Google Scholar]
- Han, M.; Xie, M.; Han, J.; Yuan, D.; Yang, T.; Xie, Y. Development and validation of a rapid, selective, and sensitive LC–MS/MS method for simultaneous determination of d-and l-amino acids in human serum: Application to the study of hepatocellular carcinoma. Anal. Bioanal. Chem. 2018, 410, 2517–2531. [Google Scholar] [CrossRef]
- Morine, Y.; Utsunomiya, T.; Yamanaka-Okumura, H.; Saito, Y.; Yamada, S.; Ikemoto, T.; Imura, S.; Kinoshita, S.; Hirayama, A.; Tanaka, Y. Essential amino acids as diagnostic biomarkers of hepatocellular carcinoma based on metabolic analysis. Oncotarget 2022, 13, 1286–1298. [Google Scholar] [CrossRef]
- Du, Y.; Zhu, D.; Zou, L.; Yi, J. Serum metabolomic phenotyping for diagnosis and prognosis of hepatocellular carcinoma. Authorea Prepr. 2023. [Google Scholar] [CrossRef]
- Nakanishi, C.; Doi, H.; Katsura, K.; Satomi, S. Treatment with L-valine ameliorates liver fibrosis and restores thrombopoiesis in rats exposed to carbon tetrachloride. Tohoku J. Exp. Med. 2010, 221, 151–159. [Google Scholar] [CrossRef]
- Hassan, Y.A.; Helmy, M.W.; Ghoneim, A.I. Combinatorial antitumor effects of amino acids and epigenetic modulations in hepatocellular carcinoma cell lines. Naunyn-Schmiedeberg’s Arch. Pharmacol. 2021, 394, 2245–2257. [Google Scholar] [CrossRef]
Metabolite Name * | Group | p-Value | |
---|---|---|---|
CIRR | HCC | ||
2,3-butanediol 2 | 10650000 (5477273.75) | 9922545 (6109323.25) | 0.519 |
2-hydroxybenzyl alcohol | 548521.5 (175550.5) | 539983.5 (132430) | 0.547 |
alpha-tocopherol | 1406009 (772343) | 1671573.5 (675712) | 0.075 |
alpha-D-glucosamine 1-phosphate | 451731 (6850415.25) | 1359994 (11365839.25) | 0.123 |
arabitol | 378181.5 (390092.75) | 355639.5 (258970.75) | 0.240 |
arachidic acid | 418434.5 (291183.25) | 330568 (430527) | 0.745 |
cholesterol | 92450000 (3.9e+07) | 93400000 (32175000) | 0.554 |
citric acid | 20400000 (11675000) | 18900000 (8450000) | 0.128 |
Creatinine | 1917199 (2062927) | 1211467.5 (1116398.5) | 0.053 |
D-glucose 2 [17,625]/[24,749] D-glucose 1 | 1.13e+08 (37175000) | 111500000 (42125000) | 0.785 |
D-malic acid | 467819.5 (255174.25) | 378374.5 (348014.25) | 0.251 |
D-threitol | 1000500 (1225931) | 654203.5 (405024) | 0.014 |
diglycerol 2 | 221418 (384361) | 170021 (201316.25) | 0.081 |
DL-isoleucine 1 | 1199501.5 (772889) | 1433657 (1229352) | 0.194 |
DL-isoleucine 2 | 1131049 (618355.25) | 1424988 (1050869.25) | 0.051 |
ethanolamine | 1352473 (958388.25) | 1068951 (654469) | 0.075 |
glyceric acid | 683817.5 (560978) | 625615 (518581.75) | 0.577 |
glycine | 9695280.5 (4818167) | 7312240 (2826197.5) | 0.003 |
glycine-d5 deuterated | 23850000 (10650000) | 21500000 (12225000) | 0.293 |
L-sorbose 2 | 1185333 (1842600.25) | 1185925 (1437230.75) | 0.596 |
L-(+) lactic acid | 94950000 (48875000) | 95500000 (44650000) | 0.594 |
L-alanine-2,3,3,3-d4 | 1054093 (1392805.5) | 663959 (978040) | 0.013 |
L-cystine 3 | 2434071 (2331622) | 2399262.5 (2699936.75) | 0.793 |
L-glutamic acid 2 | 384084.5 (364430.5) | 357128.5 (432405) | 0.685 |
L-glutamic acid-2,3,3,4,4-d5 2 | 1178557.5 (414761) | 1010110 (421477) | 0.047 |
L-glutamic acid-2,3,3,4,4-d5 3 (dehydrated) | 1879400 (698975.25) | 1539727 (1043167) | 0.351 |
L-homoserine 3 | 73072.5 (76962.5) | 57437 (43002.75) | 0.383 |
L-leucine 1 | 3958303 (3744917.75) | 4579521 (4642521.5) | 0.091 |
L-phenylalanine-phenyl-d5-2,3,3-d3 2 | 5153990.5 (2038524.25) | 4548072 (2141348.5) | 0.144 |
L-proline 2 | 2773618 (1277443) | 2310765 (1221758) | 0.209 |
L-pyroglutamic acid/glutamic acid | 6786754.5 (2598769) | 5370125.5 (2845517.75) | 0.007 |
L-serine 1 | 2358091 (1530538.25) | 2598062 (1569120.25) | 0.380 |
L-threonine 1 | 3226459 (1786736.5) | 3165103 (1790904) | 0.904 |
L-threonine 2 | 4140405.5 (2479098.5) | 3530755 (2347793.5) | 0.365 |
L-tyrosine-3,3-d2 2 | 25400000 (9275000) | 25950000 (15525000) | 0.868 |
L-valine 1 | 2502653.5 (2499262.75) | 3654925 (3262877.5) | 0.008 |
L-valine 2 | 2380131.5 (2110335.5) | 2615713 (2691378.5) | 0.425 |
lactulose 1 | 223597 (614729) | 186756.5 (535900.75) | 0.621 |
lauric acid | 252586.5 (223034.5) | 205633.5 (197828.75) | 0.170 |
linoleic acid | 5677306.5 (7210283.5) | 9848396 (9099414.25) | 0.006 |
myo-inositol | 8412012 (10620568.5) | 7559708 (5249310) | 0.613 |
Myristic Acid d27 | 6418506.5 (2757576.75) | 6669544 (3364288.5) | 0.634 |
N-acetyl-5-hydroxytryptamine 1 | 105500000 (21350000) | 104500000 (45800000) | 0.968 |
oxalic acid | 28250000 (16350000) | 31300000 (16975000) | 0.322 |
palmitic acid | 27600000 (13550000) | 31750000 (16125000) | 0.179 |
Phenylalanine 1 | 2002936 (1742651) | 1493905.5 (894662.75) | 0.040 |
phosphoric acid | 84250000 (27750000) | 82800000 (39875000) | 0.610 |
putrescine | 896244.5 (812358) | 672440 (488520.75) | 0.342 |
ribitol | 634169 (501046) | 487043.5 (320140) | 0.097 |
ribose | 77076.5 (78126.25) | 43190 (51127.25) | 0.165 |
stearic acid | 44650000 (21650000) | 49150000 (21050000) | 0.182 |
tagatose 1 | 3312035 (16452131) | 1204101 (2889017.25) | 0.006 |
trans-aconitic acid | 112768 (80640.5) | 96625 (70570.5) | 0.457 |
tyramine | 3083548.5 (3382113.25) | 1700785 (2529014) | 0.095 |
tyrosine 2 | 24800000 (11975000) | 24050000 (14600000) | 0.872 |
urea | 98800000 (66050000) | 99850000 (60625000) | 0.788 |
Model | Train Set AUC | Test Set AUC | Train Set Accuracy | Test Set Accuracy | Train Set Sensitivity | Test Set Sensitivity | Train Set Specificity | Test Set Specificity |
---|---|---|---|---|---|---|---|---|
TPOT | 0.80 ± 0.02 | 0.81 | 0.85 ± 0.01 | 0.85 | 0.84 ± 0.03 | 0.84 | 0.85 ± 0.01 | 0.83 |
RF | 0.72 ± 0.02 | 0.70 | 0.74 ± 0.03 | 0.72 | 0.72 ± 0.04 | 0.71 | 0.75 ± 0.03 | 0.71 |
SVM | 0.70 ± 0.03 | 0.68 | 0.72 ± 0.03 | 0.70 | 0.70 ± 0.04 | 0.69 | 0.73 ± 0.04 | 0.70 |
k-NN | 0.66 ± 0.03 | 0.65 | 0.70 ± 0.03 | 0.68 | 0.68 ± 0.04 | 0.67 | 0.72 ± 0.03 | 0.68 |
AutoSklearn | 0.75 ± 0.01 | 0.77 | 0.75 ± 0.02 | 0.73 | 0.70 ± 0.02 | 0.74 | 0.77 ± 0.01 | 0.74 |
H2O AutoML | 0.74 ± 0.02 | 0.75 | 0.76 ± 0.03 | 0.75 | 0.74 ± 0.03 | 0.73 | 0.77 ± 0.02 | 0.73 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yagin, F.H.; El Shawi, R.; Algarni, A.; Colak, C.; Al-Hashem, F.; Ardigò, L.P. Metabolomics Biomarker Discovery to Optimize Hepatocellular Carcinoma Diagnosis: Methodology Integrating AutoML and Explainable Artificial Intelligence. Diagnostics 2024, 14, 2049. https://doi.org/10.3390/diagnostics14182049
Yagin FH, El Shawi R, Algarni A, Colak C, Al-Hashem F, Ardigò LP. Metabolomics Biomarker Discovery to Optimize Hepatocellular Carcinoma Diagnosis: Methodology Integrating AutoML and Explainable Artificial Intelligence. Diagnostics. 2024; 14(18):2049. https://doi.org/10.3390/diagnostics14182049
Chicago/Turabian StyleYagin, Fatma Hilal, Radwa El Shawi, Abdulmohsen Algarni, Cemil Colak, Fahaid Al-Hashem, and Luca Paolo Ardigò. 2024. "Metabolomics Biomarker Discovery to Optimize Hepatocellular Carcinoma Diagnosis: Methodology Integrating AutoML and Explainable Artificial Intelligence" Diagnostics 14, no. 18: 2049. https://doi.org/10.3390/diagnostics14182049
APA StyleYagin, F. H., El Shawi, R., Algarni, A., Colak, C., Al-Hashem, F., & Ardigò, L. P. (2024). Metabolomics Biomarker Discovery to Optimize Hepatocellular Carcinoma Diagnosis: Methodology Integrating AutoML and Explainable Artificial Intelligence. Diagnostics, 14(18), 2049. https://doi.org/10.3390/diagnostics14182049