Pan-Cancer Classification of Gene Expression Data Based on Artificial Neural Network Model
Abstract
:1. Introduction
2. Materials and Methods
2.1. Patients, Samples and Gene Expression Data
2.2. Data Processing
2.3. Neural Network Architecture
2.4. Random Forest
2.5. Extreme Gradient Boosting (XGBoost)
3. Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ferlay, J.; Ervik, M.; Lam, F.; Colombet, M.; Mery, L.; Piñeros, M.; Znaor, A.; Soerjomataram, I.; Bray, F. Global Cancer Observatory: Cancer Today. Lyon: International Agency for Research on Cancer. 2020. Available online: https://gco.iarc.fr/today (accessed on 1 February 2021).
- World Health Organization. Available online: https://www.who.int/news-room/fact-sheets/detail/cancer (accessed on 1 February 2023).
- Gore, S.; Azad, R.K. CancerNet: A unified deep learning network for pan-cancer diagnostics. BMC Bioinform. 2022, 23, 229. [Google Scholar] [CrossRef] [PubMed]
- Cava, C.; Castiglioni, I. In Silico perturbation of drug targets in pan-cancer analysis combining multiple networks and pathways. Gene 2019, 698, 100–106. [Google Scholar] [CrossRef]
- Cava, C.; Bertoli, G.; Castiglioni, I. Portrait of Tissue-Specific Coexpression Networks of Noncoding RNAs (miRNA and Lncrna) and mRNAs in Normal Tissues. Comput. Math. Methods Med. 2019, 2019, 9029351. [Google Scholar] [CrossRef] [PubMed]
- Cava, C.; Bertoli, G.; Castiglioni, I. In silico identification of drug target pathways in breast cancer subtypes using pathway cross-talk inhibition. J. Transl. Med. 2018, 16, 154. [Google Scholar] [CrossRef] [PubMed]
- Alharbi, F.; Vakanski, A. Machine Learning Methods for Cancer Classification Using Gene Expression Data: A Review. Bioengineering 2023, 10, 173. [Google Scholar] [CrossRef]
- van’t Veer, L.J.; Dai, H.; van de Vijver, M.J.; He, Y.D.; Hart, A.A.M.; Mao, M.; Peterse, H.L.; van der Kooy, K.; Marton, M.J.; Witteveen, A.T.; et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415, 530–536. [Google Scholar] [CrossRef] [Green Version]
- Paik, S.; Shak, S.; Tang, G.; Kim, C.; Baker, J.; Cronin, M.; Baehner, F.L.; Walker, M.G.; Watson, D.; Park, T.; et al. A Multigene Assay to Predict Recurrence of Tamoxifen-Treated, Node-Negative Breast Cancer. N. Engl. J. Med. 2004, 351, 2817–2826. [Google Scholar] [CrossRef] [Green Version]
- Wang, Y.; Xu, X.; Maglic, D.; Dill, M.T.; Mojumdar, K.; Ng, P.K.-S.; Jeong, K.J.; Tsang, Y.H.; Moreno, D.; Bhavana, V.H.; et al. Comprehensive Molecular Characterization of the Hippo Signaling Pathway in Cancer. Cell Rep. 2018, 25, 1304–1317.e5. [Google Scholar] [CrossRef] [Green Version]
- Barrett, T.; Wilhite, S.E.; Ledoux, P.; Evangelista, C.; Kim, I.F.; Tomashevsky, M.; Marshall, K.A.; Phillippy, K.H.; Sherman, P.M.; Holko, M.; et al. NCBI GEO: Archive for functional genomics data sets—Update. Nucleic Acids Res. 2013, 41, D991–D995. [Google Scholar] [CrossRef] [Green Version]
- Guyon, I. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
- Makiewicz, A.; Ratajczak, W. Principal Components Analysis (PCA). Comput. Geosci. 1993, 19, 303–342. [Google Scholar] [CrossRef]
- Świetlicka, I.; Kuniszyk-Jóźkowiak, W.; Świetlicki, M. Artificial Neural Networks Combined with the Principal Component Analysis for Non-Fluent Speech Recognition. Sensors 2022, 22, 321. [Google Scholar] [CrossRef] [PubMed]
- Tabares-Soto, R.; Orozco-Arias, S.; Romero-Cano, V.; Bucheli, V.S.; Rodríguez-Sotelo, J.L.; Jiménez-Varón, C.F. A comparative study of machine learning and deep learning algorithms to classify cancer types based on microarray gene expression data. PeerJ Comput. Sci. 2020, 6, e270. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Michie, D.; Spiegelhalter, D.J.; Taylor, C.C. Machine Learning, Neural and Statistical Classification. Technometrics 1994, 37, 459. [Google Scholar] [CrossRef]
- Ogunleye, A.A.; Wang, Q.-G. XGBoost Model for Chronic Kidney Disease Diagnosis. IEEE/ACM Trans. Comput. Biol. Bioinform. 2019, 17, 2131–2140. [Google Scholar] [CrossRef]
- Torlay, L.; Perrone-Bertolotti, M.; Thomas, E.; Baciu, M. Machine learning–XGBoost analysis of language networks to classify patients with epilepsy. Brain Inform. 2017, 4, 159–169. [Google Scholar] [CrossRef] [Green Version]
- Huang, Z.; Hu, C.; Chi, C.; Jiang, Z.; Tong, Y.; Zhao, C. An Artificial Intelligence Model for Predicting 1-Year Survival of Bone Metastases in Non-Small-Cell Lung Cancer Patients Based on XGBoost Algorithm. BioMed Res. Int. 2020, 2020, 3462363. [Google Scholar] [CrossRef]
- Zhang, Y.; Feng, T.; Wang, S.; Dong, R.; Yang, J.; Su, J.; Wang, B. A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number Variations. Front. Genet. 2020, 11, 585029. [Google Scholar] [CrossRef]
- Gene Expression Omnibus. Available online: http://www.ncbi.nlm.nih.gov/geo (accessed on 1 January 2023).
- R Development Core Team. Computational Many-Particle Physics; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
- Davis, S.; Meltzer, P.S. GEOquery: A bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics 2007, 23, 1846–1847. [Google Scholar] [CrossRef] [Green Version]
- Moolayil, J. Tuning and Deploying Deep Neural Networks. In Learn Keras for Deep Neural Networks; Apress: Berkeley, CA, USA, 2019. [Google Scholar] [CrossRef]
- Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, PMLR, Fort Lauderdale, FL, USA, 11–13 April 2011; Gordon, G., Dunson, D., Dudík, M., Eds.; Mir Press: Banksmeadow, Australia, 2011; Volume 15, pp. 315–323. [Google Scholar]
- Yuan, Y.; Bar-Joseph, Z. Deep learning for inferring gene relationships from single-cell expression data. Proc. Natl. Acad. Sci. USA 2019, 116, 27151–27158. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Chollet, F. Keras, GitHub. 2015. Available online: https://github.com/fchollet/keras (accessed on 1 March 2023).
- Liaw, A.; Wiener, M. Classification and Regression by randomForest. R News 2002, 2, 18–22. Available online: https://CRAN.R-project.org/doc/Rnews/ (accessed on 1 March 2023).
- Fatahi, R.; Nasiri, H.; Homafar, A.; Khosravi, R.; Siavoshi, H.; Chelgani, S.C. Modeling operational cement rotary kiln variables with explainable artificial intelligence methods—A “conscious lab” development. Part. Sci. Technol. 2022, 41, 715–724. [Google Scholar] [CrossRef]
- Walker, A.M.; Cliff, A.; Romero, J.; Shah, M.B.; Jones, P.; Gazolla, J.G.F.M.; Jacobson, D.A.; Kainer, D. Evaluating the performance of random forest and iterative random forest based methods when applied to gene expression data. Comput. Struct. Biotechnol. J. 2022, 20, 3372–3386. [Google Scholar] [CrossRef]
- Homafar, A.; Nasiri, H.; Chelgani, S. Modeling coking coal indexes by SHAP-XGBoost: Explainable artificial intelligence method. Fuel Commun. 2022, 13, 100078. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Chelgani, S.C.; Nasiri, H.; Tohry, A.; Heidari, H. Modeling industrial hydrocyclone operational variables by SHAP-CatBoost—A “conscious lab” approach. Powder Technol. 2023, 420, 118416. [Google Scholar] [CrossRef]
- Amjad, M.; Ahmad, I.; Ahmad, M.; Wróblewski, P.; Kamiński, P.; Amjad, U. Prediction of Pile Bearing Capacity Using XGBoost Algorithm: Modeling and Performance Evaluation. Appl. Sci. 2022, 12, 2126. [Google Scholar] [CrossRef]
- Hanai, T.; Yatabe, Y.; Nakayama, Y.; Takahashi, T.; Honda, H.; Mitsudomi, T.; Kobayashi, T. Prognostic models in patients with non-small-cell lung cancer using artificial neural networks in comparison with logistic regression. Cancer Sci. 2003, 94, 473–477. [Google Scholar] [CrossRef] [Green Version]
- Pergialiotis, V.; Pouliakis, A.; Parthenis, C.; Damaskou, V.; Chrelias, C.; Papantoniou, N.; Panayiotides, I. The utility of artificial neural networks and classification and regression trees for the prediction of endometrial cancer in postmenopausal women. Public Health 2018, 164, 1–6. [Google Scholar] [CrossRef]
- Lee, K.; Jeong, H.-O.; Lee, S.; Jeong, W.-K. CPEM: Accurate cancer type classification based on somatic alterations using an ensemble of a random forest and a deep neural network. Sci. Rep. 2019, 9, 16927. [Google Scholar] [CrossRef] [Green Version]
- Yuan, Y.; Shi, Y.; Li, C.; Kim, J.; Cai, W.; Han, Z.; Feng, D.D. DeepGene: An advanced cancer type classifier based on deep learning and somatic point mutations. BMC Bioinform. 2016, 17, 243–256. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ainscough, B.J.; Barnell, E.K.; Ronning, P.; Campbell, K.M.; Wagner, A.H.; Fehniger, T.A.; Dunn, G.P.; Uppaluri, R.; Govindan, R.; Rohan, T.E.; et al. A deep learning approach to automate refinement of somatic variant calling from cancer sequencing data. Nat. Genet. 2018, 50, 1735–1743. [Google Scholar] [CrossRef] [PubMed]
- Alwosheel, A.; van Cranenburgh, S.; Chorus, C.G. Is your dataset big enough? Sample size requirements when using artificial neural networks for discrete choice analysis. J. Choice Model. 2018, 28, 167–182. [Google Scholar] [CrossRef]
Dataset | GEO ID | # of Cancer Samples | # of Normal Samples |
---|---|---|---|
Bladder Urothelial Carcinoma | GSE13507 | 165 | 10 |
Breast invasive carcinoma cancer | GSE39004 | 61 | 47 |
Colon adenocarcinoma | GSE41657 | 25 | 12 |
Esophageal carcinoma | GSE20347 | 17 | 17 |
Head and Neck squamous cell carcinoma | GSE6631 | 22 | 22 |
Kidney Chromophobe | GSE15641 | 6 | 23 |
Kidney renal clear cell carcinoma | GSE15641 | 32 | 23 |
Kidney renal papillary cell carcinoma | GSE15641 | 11 | 23 |
Liver hepatocellular carcinoma | GSE45267 | 48 | 39 |
Lung squamous cell carcinoma | GSE33479 | 14 | 27 |
Lung adenocarcinoma | GSE10072 | 58 | 49 |
Prostate adenocarcinoma | GSE6919 | 65 | 63 |
Rectum adenocarcinoma | GSE20842 | 65 | 65 |
Stomach adenocarcinoma | GSE2685 | 21 | 8 |
Thyroid carcinoma | GSE33630 | 60 | 45 |
Uterine Corpus Endometrial Carcinoma | GSE17025 | 79 | 12 |
TOT | 749 | 485 |
Model | Parameters |
---|---|
ANN | Number of Hidden Layers = 2 |
Batch size = 8 | |
Epochs = 200 | |
Optimizer = adam | |
Losses = binary crossentropy | |
Hidden layers activation function = relu | |
Output layer activation function = sigmoid | |
RF | Number of trees = 500 |
Minimum size of terminal nodes = 1 | |
Number of features to be analyzed = (sqrt(p) where p is number of features | |
XGBoost | Loss = mean squared error |
Tree method = gpu hist | |
Number of estimators = 100 | |
Learning rate = 0.3 | |
Gamma = 0 |
Dataset | Sensitivity | Specificity | ||||
---|---|---|---|---|---|---|
Dataset | ANN | RF | XGBoost | ANN | RF | XGBoost |
Bladder Urothelial Carcinoma (GSE13507) | 0.85 ± 0.09 | 0.0 ± 0.0 | 0.97 ± 0.03 | 0.58 ± 0.22 | 1 ± 0.0 | 0.79 ± 0.24 |
Breast invasive carcinoma cancer (GSE39004) | 0.80 ± 0.1 | 0.84 ± 0.09 | 0.87 ± 0.06 | 0.82 ± 0.14 | 0.79 ± 0.1 | 0.75 ± 0.07 |
Colon adenocarcinoma (GSE41657) | 0.75 ± 0.24 | 1 ± 0 | 0.97 ± 0.06 | 1 ± 0 | 1 ± 0 | 1 ± 0 |
Esophageal carcinoma (GSE20347) | 0.85 ± 0.27 | 0.97 ± 0.09 | 0.89 ± 0.21 | 1 ± 0 | 0.81 ± 0.28 | 0.98 ± 0.08 |
Head and Neck squamous cell carcinoma (GSE6631) | 0.69 ± 0.27 | 0.96 ± 0.08 | 0.93 ± 0.08 | 0.92 ± 0.09 | 0.84 ± 0.14 | 0.92 ± 0.11 |
Kidney Chromophobe (GSE15641) | 0.76 ± 0.33 | 0.97 ± 0.09 | 0.95 ± 0.16 | 0.84 ± 0.16 | 0.67 ± 0.45 | 0.93 ± 0.14 |
Kidney renal clear cell carcinoma (GSE15641) | 0.97 ± 0.05 | 1 ± 0 | 0.91 ± 0.17 | 1 ± 0 | 1 ± 0 | 1 ± 0 |
Kidney renal papillary cell carcinoma (GSE15641) | 0.72 ± 0.37 | 1 ± 0 | 1 ± 0 | 0.9 ± 0.26 | 1 ± 0 | 0.93 ± 0.08 |
Liver hepatocellular carcinoma (GSE45267) | 0.85 ± 0.1 | 0.97 ± 0.04 | 0.92 ± 0.09 | 0.81 ± 0.12 | 0.73 ± 0.06 | 0.83 ± 0.13 |
Lung squamous cell carcinoma (GSE33479) | 0.77 ± 0.25 | 0.91 ± 0.16 | 0.82 ± 0.15 | 0.87 ± 0.11 | 0.81 ± 0.16 | 0.89 ± 0.20 |
Lung adenocarcinoma (GSE10072) | 0.79 ± 0.14 | 0.99 ± 0.02 | 0.94 ± 0.03 | 0.89 ± 0.08 | 0.95 ± 0.04 | 0.98 ± 0.03 |
Prostate adenocarcinoma (GSE6919) | 0.57 ± 0.09 | 0.55 ± 0.16 | 0.62 ± 0.17 | 0.64 ± 0.14 | 0.69 ± 0.17 | 0.61 ± 0.17 |
Rectum adenocarcinoma (GSE20842) | 0.93 ± 0.05 | 1 ± 0 | 0.99 ± 0.02 | 0.92 ± 0.07 | 1 ± 0 | 1 ± 0 |
Stomach adenocarcinoma (GSE2685) | 0.85 ± 0.11 | 0.92 ± 0.19 | 0.86 ± 0.2 | 1 ± 0 | 0.76 ± 0.22 | 0.96 ± 0.13 |
Thyroid carcinoma (GSE33630) | 0.85 ± 0.1 | 0.89 ± 0.07 | 0.89 ± 0.1 | 0.73 ± 0.15 | 0.92 ± 0.05 | 0.77 ± 0.1 |
Uterine Corpus Endometrial Carcinoma (GSE17025) | 0.94 ± 0.06 | 0.71 ± 0.24 | 0.95 ± 0.07 | 0.83 ± 0.17 | 1 ± 0 | 0.89 ± 0.28 |
TOT | 0.81 | 0.85 | 0.90 | 0.86 | 0.87 | 0.89 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cava, C.; Salvatore, C.; Castiglioni, I. Pan-Cancer Classification of Gene Expression Data Based on Artificial Neural Network Model. Appl. Sci. 2023, 13, 7355. https://doi.org/10.3390/app13137355
Cava C, Salvatore C, Castiglioni I. Pan-Cancer Classification of Gene Expression Data Based on Artificial Neural Network Model. Applied Sciences. 2023; 13(13):7355. https://doi.org/10.3390/app13137355
Chicago/Turabian StyleCava, Claudia, Christian Salvatore, and Isabella Castiglioni. 2023. "Pan-Cancer Classification of Gene Expression Data Based on Artificial Neural Network Model" Applied Sciences 13, no. 13: 7355. https://doi.org/10.3390/app13137355