Using Machine Learning Methods to Study Colorectal Cancer Tumor Micro-Environment and Its Biomarkers
Abstract
:1. Introduction
2. Results
2.1. Overview of Datasets
2.2. Class Balancing and Feature Selection
2.3. Classification Rules
2.4. Co-Predictive Network
2.5. Survival Analysis and Differential Expression Analysis
2.6. Immune Infiltrating and Gene Mutation Analysis
3. Discussion
4. Materials and Methods
4.1. Data and Preprocessing
4.2. Differentially Expressed Genes
4.3. Class Imbalance Correction
4.4. Feature Selection
4.4.1. MCFS
4.4.2. Boruta
4.4.3. mRMR
4.4.4. LightGBM
4.5. Incremental Feature Selection
4.5.1. SVM
4.5.2. XGBoost
4.5.3. Random Forest
4.5.4. kNN
4.6. Rule-Based Classification
4.7. Co-Predictive Network
4.8. Protein–Protein Interaction Analysis
4.9. Survival Analysis
4.10. Evaluation of Infiltrating Immune Cells
4.11. Gene Mutation Analysis
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Sung, H.; Ferlay, J.; Siegel, R.L.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2021, 71, 209–249. [Google Scholar] [CrossRef] [PubMed]
- Arnold, M.; Sierra, M.S.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global patterns and trends in colorectal cancer incidence and mortality. Gut 2017, 66, 683–691. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Dekker, E.; Tanis, P.J.; Vleugels, J.L.A.; Kasi, P.M.; Wallace, M.B. Colorectal cancer. Lancet 2019, 394, 1467–1480. [Google Scholar] [CrossRef]
- Grady, W.M.; Carethers, J.M. Genomic and epigenetic instability in colorectal cancer pathogenesis. Gastroenterology 2008, 135, 1079–1099. [Google Scholar] [CrossRef] [Green Version]
- Zheng, Z.; Yu, T.; Zhao, X.; Gao, X.; Zhao, Y.; Liu, G. Intratumor heterogeneity: A new perspective on colorectal cancer research. Cancer Med. 2020, 9, 7637–7645. [Google Scholar] [CrossRef] [PubMed]
- Arnadottir, S.S.; Mattesen, T.B.; Vang, S.; Madsen, M.R.; Madsen, A.H.; Birkbak, N.J.; Bramsen, J.B.; Andersen, C.L. Transcriptomic and proteomic intra-tumor heterogeneity of colorectal cancer varies depending on tumor location within the colorectum. PLoS ONE 2020, 15, e0241148. [Google Scholar] [CrossRef]
- Dunne, P.D.; McArt, D.G.; Bradley, C.A.; O’Reilly, P.G.; Barrett, H.L.; Cummins, R.; O’Grady, T.; Arthur, K.; Loughrey, M.B.; Allen, W.L.; et al. Challenging the Cancer Molecular Stratification Dogma: Intratumoral Heterogeneity Undermines Consensus Molecular Subtypes and Potential Diagnostic Value in Colorectal Cancer. Clin. Cancer Res. Off. J. Am. Assoc. Cancer Res. 2016, 22, 4095–4104. [Google Scholar] [CrossRef] [Green Version]
- Zhuang, Y.; Wang, H.; Jiang, D.; Li, Y.; Feng, L.; Tian, C.; Pu, M.; Wang, X.; Zhang, J.; Hu, Y.; et al. Multi gene mutation signatures in colorectal cancer patients: Predict for the diagnosis, pathological classification, staging and prognosis. BMC Cancer 2021, 21, 380. [Google Scholar] [CrossRef]
- Li, B.-Q.; Huang, T.; Liu, L.; Cai, Y.-D.; Chou, K.-C. Identification of colorectal cancer related genes with mRMR and shortest path in protein-protein interaction network. PLoS ONE 2012, 7, e33393. [Google Scholar] [CrossRef] [Green Version]
- Hozhabri, H.; Lashkari, A.; Razavi, S.M.; Mohammadian, A. Integration of gene expression data identifies key genes and pathways in colorectal cancer. Med. Oncol. 2021, 38, 7. [Google Scholar] [CrossRef]
- Paget, S. The distribution of secondary growths in cancer of the breast. Cancer Metastasis Rev. 1989, 8, 98–101. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Nenkov, M.; Ma, Y.; Gassler, N.; Chen, Y. Metabolic Reprogramming of Colorectal Cancer Cells and the Microenvironment: Implication for Therapy. Int. J. Mol. Sci. 2021, 22, 6262. [Google Scholar] [CrossRef] [PubMed]
- Bindea, G.; Mlecnik, B.; Tosolini, M.; Kirilovsky, A.; Waldner, M.; Obenauf, A.C.; Angell, H.; Fredriksen, T.; Lafontaine, L.; Berger, A.; et al. Spatiotemporal dynamics of intratumoral immune cells reveal the immune landscape in human cancer. Immunity 2013, 39, 782–795. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Hernandez-Camarero, P.; Lopez-Ruiz, E.; Marchal, J.A.; Peran, M. Cancer: A mirrored room between tumor bulk and tumor microenvironment. J. Exp. Clin. Cancer Res. 2021, 40, 217. [Google Scholar] [CrossRef]
- Liu, Z.; Liu, L.; Weng, S.; Guo, C.; Dang, Q.; Xu, H.; Wang, L.; Lu, T.; Zhang, Y.; Sun, Z.; et al. Machine learning-based integration develops an immune-derived lncRNA signature for improving outcomes in colorectal cancer. Nat. Commun. 2022, 13, 816. [Google Scholar] [CrossRef]
- Fortino, V.; Wisgrill, L.; Werner, P.; Suomela, S.; Linder, N.; Jalonen, E.; Suomalainen, A.; Marwah, V.; Kero, M.; Pesonen, M.; et al. Machine-learning–driven biomarker discovery for the discrimination between allergic and irritant contact dermatitis. Proc. Natl. Acad. Sci. USA 2020, 117, 33474–33485. [Google Scholar] [CrossRef]
- Yang, M.; Yang, H.; Ji, L.; Hu, X.; Tian, G.; Wang, B.; Yang, J. A multi-omics machine learning framework in predicting the survival of colorectal cancer patients. Comput. Biol. Med. 2022, 146, 105516. [Google Scholar] [CrossRef]
- Jiang, D.; Liao, J.; Duan, H.; Wu, Q.; Owen, G.; Shu, C.; Chen, L.; He, Y.; Wu, Z.; He, D.; et al. A machine learning-based prognostic predictor for stage III colon cancer. Sci. Rep. 2020, 10, 1–9. [Google Scholar] [CrossRef]
- Draminski, M.; Rada-Iglesias, A.; Enroth, S.; Wadelius, C.; Koronacki, J.; Komorowski, J. Monte Carlo feature selection for supervised classification. Bioinform. 2008, 24, 110–117. [Google Scholar] [CrossRef] [Green Version]
- Kursa, M.B.; Jankowski, A.; Rudnicki, W.R. Boruta—A system for feature selection. Fundam. Inform. 2010, 101, 271–285. [Google Scholar] [CrossRef]
- Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 52. [Google Scholar]
- Bhasin, M.; Raghava, G. ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acids Res. 2004, 32, W414–W419. [Google Scholar] [CrossRef] [Green Version]
- Li, W.; Yin, Y.; Quan, X.; Zhang, H. Gene expression value prediction based on XGBoost algorithm. Front. Genet. 2019, 10, 1077. [Google Scholar] [CrossRef] [Green Version]
- Kouzani, A.Z. Subcellular localisation of proteins in fluorescent microscope images using a random forest. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 3926–3932. [Google Scholar]
- Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef] [Green Version]
- Fajarda, O.; Duarte-Pereira, S.; Silva, R.M.; Oliveira, J.L. Merging microarray studies to identify a common gene expression signature to several structural heart diseases. BioData Min. 2020, 13, 8. [Google Scholar] [CrossRef] [PubMed]
- Rudin, C.; Radin, J. Why are we using black box models in AI when we don’t need to? A lesson from an explainable AI competition. Harv. Data Sci. Rev. 2019, 1, 10–1162. [Google Scholar]
- Garbulowski, M.; Smolinska, K.; Diamanti, K.; Pan, G.; Maqbool, K.; Feuk, L.; Komorowski, J. Interpretable machine learning reveals dissimilarities between subtypes of autism spectrum disorder. Front. Genet. 2021, 12, 618277. [Google Scholar] [CrossRef]
- Komorowski, J. Learning rule-based models-the rough set approach. In Amsterdam: Comprehensive Biomedical Physics; Uppsala University: Uppsala, Sweden, 2014. [Google Scholar]
- Kawada, J.-I.; Takeuchi, S.; Imai, H.; Okumura, T.; Horiba, K.; Suzuki, T.; Torii, Y.; Yasuda, K.; Imanaka-Yoshida, K.; Ito, Y. Immune cell infiltration landscapes in pediatric acute myocarditis analyzed by CIBERSORT. J. Cardiol. 2021, 77, 174–178. [Google Scholar] [CrossRef] [PubMed]
- Germline variation in NCF4, an innate immunity gene, is associated with an increased risk of colorectal cancer . Int. J. Cancer 2014, 134, 1399–1407.
- Brouwer-Visser, J.; Cheng, W.-Y.; Bauer-Mehren, A.; Maisel, D.; Lechner, K.; Andersson, E.; Dudley, J.T.; Milletti, F. Regulatory T-cell genes drive altered immune microenvironment in adult solid cancers and allow for immune contextual patient subtyping. Cancer Epidemiol. Biomark. Prev. 2018, 27, 103–112. [Google Scholar] [CrossRef] [Green Version]
- Zhao, K.; Yi, Y.; Ma, Z.; Zhang, W. INHBA is a prognostic biomarker and correlated with immune cell infiltration in cervical cancer. Front. Genet. 2022, 12, 705512. [Google Scholar] [CrossRef]
- Ma, Y.-F.; Li, G.-D.; Sun, X.; Li, X.-X.; Gao, Y.; Gao, C.; Cao, K.X.; Yang, G.W.; Yu, M.W.; Wang, X.M. Identification of FAM107A as a potential biomarker and therapeutic target for prostate carcinoma. Am. J. Transl. Res. 2021, 13, 10163. [Google Scholar]
- Chen, M.S.; Kim, H.; Jagot-Lacoussiere, L.; Maurel, P. Cadm3 (Necl-1) interferes with the activation of the PI3 kinase/Akt signaling cascade and inhibits Schwann cell myelination in vitro. Glia 2016, 64, 2247–2262. [Google Scholar] [CrossRef] [Green Version]
- Mazzoccoli, G.; Pazienza, V.; Panza, A.; Valvano, M.R.; Benegiamo, G.; Vinciguerra, M.; Andriulli, A.; Piepoli, A. ARNTL2 and SERPINE1: Potential biomarkers for tumor aggressiveness in colorectal cancer. J. Cancer Res. Clin. Oncol. 2012, 138, 501–511. [Google Scholar] [CrossRef] [PubMed]
- Susmi, T.F.; Rahman, A.; Khan, M.M.R.; Yasmin, F.; Islam, M.S.; Nasif, O.; Alharbi, S.A.; Batiha, G.E.-S.; Hossain, M.U. Prognostic and clinicopathological insights of phosphodiesterase 9A gene as novel biomarker in human colorectal cancer. BMC Cancer 2021, 21, 577. [Google Scholar] [CrossRef]
- Wang, G.; Zhou, X.; Li, Y.; Zhao, M.; Zou, Y.; Lu, Q.; Wu, Y. Research Article Comprehensive Multiomics Analysis Identified IQGAP3 as a Potential Prognostic Marker in Pan-Cancer. Dis. Markers 2022, 2022, 4822964. [Google Scholar] [CrossRef]
- Wu, M.; Li, X.; Huang, W.; Chen, Y.; Wang, B.; Liu, X. Ubiquitin-conjugating enzyme E2T (UBE2T) promotes colorectal cancer progression by facilitating ubiquitination and degradation of p53. Clin. Res. Hepatol. Gastroenterol. 2021, 45, 101493. [Google Scholar] [CrossRef] [PubMed]
- Sharma, M.; Anandram, S.; Ross, C.; Srivastava, S. FUBP3 regulates chronic myeloid leukaemia progression through PRC2 complex regulated PAK1-ERK signalling. J. Cell. Mol. Med. 2023, 27, 15–29. [Google Scholar] [CrossRef] [PubMed]
- Wang, Z.; Tian, Z.; Song, X.; Zhang, J. Membrane tension sensing molecule-FNBP1 is a prognostic biomarker related to immune infiltration in BRCA, LUAD and STAD. BMC Immunol. 2022, 23, 1. [Google Scholar] [CrossRef]
- Shou-peng, W.; Zi-xue, D.; Jian, M.; Meng, L.; Xiao-dong, L.; Zhuang, Y. Expression and clinical significance of HIST1H2BH in head and neck squamous cell carcinoma. Shanghai J. Stomatol. 2021, 30, 599. [Google Scholar]
- Chen, S.; Gong, Y.; Shen, Y.; Liu, Y.; Fu, Y.; Dai, Y.; Rehman, A.U.; Tang, L.; Liu, H. INHBA is a novel mediator regulating cellular senescence and immune evasion in colorectal cancer. J. Cancer 2021, 12, 5938–5949. [Google Scholar] [CrossRef]
- Li, X.; Yu, W.; Liang, C.; Xu, Y.; Zhang, M.; Ding, X.; Cai, X. INHBA is a prognostic predictor for patients with colon adenocarcinoma. BMC Cancer 2020, 20, 305. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Sun, X.; Chen, D.; Jin, Z.; Chen, T.; Lin, A.; Jin, H.; Zhu, Y.; Lai, M. Genome-wide methylation and expression profiling identify methylation-associated genes in colorectal cancer. Epigenomics 2020, 12, 19–36. [Google Scholar] [CrossRef] [PubMed]
- Maresca, K.P.; Chen, J.; Mathur, D.; Giddabasappa, A.; Root, A.; Narula, J.; King, L.; Schaer, D.; Golas, J.; Kobylarz, K.; et al. Preclinical Evaluation of 89Zr-Df-IAB22M2C PET as an Imaging Biomarker for the Development of the GUCY2C-CD3 Bispecific PF-07062119 as a T Cell Engaging Therapy. Mol. Imaging Biol. 2021, 23, 941–951. [Google Scholar] [CrossRef] [PubMed]
- Ren, J.; Guo, W.; Feng, K.; Huang, T.; Cai, Y. Identifying MicroRNA Markers That Predict COVID-19 Severity Using Machine Learning Methods. Life 2022, 12, 1964. [Google Scholar] [CrossRef] [PubMed]
- The Cancer Genome Atlas Research Network; Weinstein, J.N.; Collisson, E.A.; Mills, G.B.; Mills Shaw, K.R.; Ozenberger, B.A.; Ellrott, K.; Shmulevich, I.; Sander, C.; Stuart, J.M. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 2013, 45, 1113–1120. [Google Scholar] [CrossRef] [PubMed]
- Clough, E.; Barrett, T. The gene expression omnibus database. In Statistical Genomics: Methods and Protocols; Springer: Berlin/Heidelberg, Germany, 2016; pp. 93–110. [Google Scholar]
- Davis, S.; Meltzer, P.S. GEOquery: A bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics 2007, 23, 1846–1847. [Google Scholar] [CrossRef] [Green Version]
- Colaprico, A.; Silva, T.C.; Olsen, C.; Garofano, L.; Cava, C.; Garolini, D.; Sabedot, T.S.; Malta, T.M.; Pagnotta, S.M.; Castiglioni, I.; et al. TCGAbiolinks: An R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res. 2016, 44, e71. [Google Scholar] [CrossRef]
- Gautier, L.; Cope, L.; Bolstad, B.M.; Irizarry, R.A. affy—Analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 2004, 20, 307–315. [Google Scholar] [CrossRef] [Green Version]
- Langfelder, P.; Horvath, S. WGCNA: An R package for weighted correlation network analysis. BMC Bioinform. 2008, 9, 559. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Leek, J.T.; Johnson, W.E.; Parker, H.S.; Jaffe, A.E.; Storey, J.D. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 2012, 28, 882–883. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ritchie, M.E.; Phipson, B.; Wu, D.; Hu, Y.; Law, C.W.; Shi, W.; Smyth, G.K. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015, 43, e47. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Pan, X.; Chen, L.; Liu, M.; Niu, Z.; Huang, T.; Cai, Y.-D. Identifying protein subcellular locations with embeddings-based node2loc. IEEE/ACM Trans. Comput. Biol. Bioinform. 2021, 19, 666–675. [Google Scholar] [CrossRef]
- Hao, M.; Wang, Y.; Bryant, S.H. An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. Anal. Chim. Acta 2014, 806, 117–127. [Google Scholar] [CrossRef] [Green Version]
- Dramiński, M.; Koronacki, J. rmcfs: An R package for Monte Carlo feature selection and interdependency discovery. J. Stat. Softw. 2018, 85, 1–28. [Google Scholar] [CrossRef] [Green Version]
- Kursa, M.B.; Rudnicki, W.R. Feature selection with the Boruta package. J. Stat. Softw. 2010, 36, 1–13. [Google Scholar] [CrossRef] [Green Version]
- De Jay, N.; Papillon-Cavanagh, S.; Olsen, C.; El-Hachem, N.; Bontempi, G.; Haibe-Kains, B. mRMRe: An R package for parallelized mRMR ensemble feature selection. Bioinformatics 2013, 29, 2365–2368. [Google Scholar] [CrossRef] [Green Version]
- Wang, D.; Zhang, Y.; Zhao, Y. LightGBM: An effective miRNA classification method in breast cancer patients. In Proceedings of the 2017 International Conference on Computational Biology and Bioinformatics, Newark, NJ, USA, 18–20 October 2017; pp. 7–11. [Google Scholar]
- Liu, H.; Setiono, R. Incremental feature selection. Appl. Intell. 1998, 9, 217–230. [Google Scholar] [CrossRef]
- Krawczuk, J.; Lukaszuk, T. The feature selection bias problem in relation to high-dimensional gene data. Artif. Intell. Med. 2016, 66, 63–71. [Google Scholar] [CrossRef]
- Matthews, B.W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta (BBA)-Protein Struct. 1975, 405, 442–451. [Google Scholar] [CrossRef]
- Mandrekar, J.N. Receiver operating characteristic curve in diagnostic test assessment. J. Thorac. Oncol. 2010, 5, 1315–1316. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Dimitriadou, E.; Hornik, K.; Leisch, F.; Meyer, D.; Weingessel, A.; Leisch, M.F. The E1071 Package. Misc Functions of Department of Statistics (e1071), TU Wien. 2006; pp. 297–304. Available online: https://rdrr.io/rforge/e1071/ (accessed on 24 June 2023).
- Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K.; Mitchell, R.; Cano, I.; Zhou, T. Xgboost: Extreme gradient boosting. In R Package Version 04-2; 2015; Volume 1, pp. 1–4. [Google Scholar]
- RColorBrewer, S.; Liaw, M.A. Package ‘Randomforest’; University of California, Berkeley: Berkeley, CA, USA, 2018. [Google Scholar]
- Ripley, B.D.; Venable, W. R Package: Class. Functions for Classification 2019. Available online: https://cran.r-project.org/web/packages/class/class.pdf (accessed on 25 March 2023).
- Garbulowski, M.; Diamanti, K.; Smolińska, K.; Baltzer, N.; Stoll, P.; Bornelöv, S.; Øhrn, A.; Feuk, L.; Komorowski, J.R. ROSETTA: An interpretable machine learning framework. BMC Bioinform. 2021, 22, 625905. [Google Scholar] [CrossRef]
- Johnson, D.S. Approximation algorithms for combinatorial problems. In Proceedings of the Fifth Annual ACM Symposium on Theory of Computing, Austin, TX, USA, 30 April–2 May 1973; pp. 38–49. [Google Scholar]
- Lenzerini, M. Data integration: A theoretical perspective. In Proceedings of the Twenty-First ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Madison, WI, USA, 3–5 June 2002; pp. 233–246. [Google Scholar]
- Smolinska, K.; Garbulowski, M.; Diamanti, K.; Davoy, X.; Anyango, S.O.O.; Barrenäs, F.; Bornelöv, S.; Komorowski, J. VisuNet: An Interactive Tool for Rule Network Visualization of Rule-Based Learning Models. 2021. Available online: https://www.diva-portal.org/smash/get/diva2:1602210/FULLTEXT02 (accessed on 24 June 2023).
- Szklarczyk, D.; Kirsch, R.; Koutrouli, M.; Nastou, K.; Mehryary, F.; Hachilif, R.; Gable, A.L.; Fang, T.; Doncheva, N.T.; Pyysalo, S.; et al. The STRING database in 2023: Protein–protein association networks and functional enrichment analyses for any sequenced gen.ome of interest. Nucleic Acids Res. 2023, 51, D638–D646. [Google Scholar] [CrossRef]
- Covani, U.; Marconcini, S.; Derchi, G.; Barone, A.; Giacomelli, L. Relationship between human periodontitis and type 2 diabetes at a genomic level: A data-mining study. J. Periodontol. 2009, 80, 1265–1273. [Google Scholar] [CrossRef]
- Tang, Z.; Li, C.; Kang, B.; Gao, G.; Li, C.; Zhang, Z. GEPIA: A web server for cancer and normal gene expression profiling and interactive analyses. Nucleic Acids Res. 2017, 45, W98–W102. [Google Scholar] [CrossRef] [Green Version]
- Rich, J.T.; Neely, J.G.; Paniello, R.C.; Voelker, C.C.; Nussenbaum, B.; Wang, E.W. A practical guide to understanding Kaplan-Meier curves. Otolaryngol.—Head Neck Surg. 2010, 143, 331–336. [Google Scholar] [CrossRef] [Green Version]
- Jenkins, S.P. Survival Analysis; Unpublished Manuscript; Institute for Social and Economic Research, University of Essex: Colchester, UK, 2005; Volume 42, pp. 54–56. [Google Scholar]
- Newman, A.M.; Liu, C.L.; Green, M.R.; Gentles, A.J.; Feng, W.; Xu, Y.; Hoang, C.D.; Diehn, M.; Alizadeh, A.A. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods 2015, 12, 453–457. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Mayakonda, A.; Lin, D.-C.; Assenov, Y.; Plass, C.; Koeffler, H.P. Maftools: Efficient and comprehensive analysis of somatic variants in cancer. Genome Res. 2018, 28, 1747–1756. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Dataset | Source | Series | No. (CRCs) | No. (Controls) | No. (Genes) |
---|---|---|---|---|---|
DS1 | Ryan et al., 2014 [32] | GSE44861 | 56 | 55 | 22,278 |
DS2 | Maisel D et al., 2018 [33] | GSE103512 | 57 | 12 | 20,742 |
DS3 | https://portal.gdc.cancer.gov/ (accessed on 20 February 2023) | TCGA-COAD/READ | 647 | 51 | 19,935 |
Datasets | Confirmed | Tentative | Rejected | Recalculated |
---|---|---|---|---|
DS1 | 51 | 43 | 13,329 | 78 |
DS2 | 84 | 68 | 20,589 | 117 |
DS3 | 146 | 268 | 16,310 | 213 |
Feature Ranking Algorithm | Classification Algorithm | No. Features | MCC | ACC |
---|---|---|---|---|
MCFS | SVM | 16 | 0.756 | 0.875 |
XGBoost | 22 | 0.645 | 0.813 | |
RF | 17 | 0.700 | 0.844 | |
kNN | 20 | 0.756 | 0.844 | |
Boruta | SVM | 13 | 0.756 | 0.875 |
XGBoost | 11 | 0.776 | 0.875 | |
RF | 21 | 0.756 | 0.875 | |
kNN | 20 | 0.756 | 0.875 | |
mRMR | SVM | 10 | 0.756 | 0.812 |
XGBoost | 9 | 0.592 | 0.781 | |
RF | 12 | 0.756 | 0.781 | |
kNN | 9 | 0.756 | 0.844 | |
LightGBM | SVM | 12 | 0.756 | 0.875 |
XGBoost | 5 | 0.625 | 0.813 | |
RF | 9 | 0.814 | 0.906 | |
kNN | 9 | 0.750 | 0.844 |
Characteristic | DS1 | DS2 | DS3 | |
---|---|---|---|---|
Original | No. features | 33 | 52 | 49 |
No. rules (p ≤ 0.05) | 15 | 38 | 12 | |
ACC | 81.2% | 97.5% | 98.4% | |
AUC | 0.821 | 0.979 | 0.999 | |
Merged | No. features | 99 | 121 | 114 |
No. Rules | 10 | 34 | 19 | |
ACC | 84.7% | 84.2% | 98.6% | |
AUC | 0.918 | 0.931 | 0.999 |
CRC | Control | |||
---|---|---|---|---|
Total number of rules | 53 | 45 | ||
Rule statistics | Basic | Recalculated | Basic | Recalculated |
Number of rules (p ≤ 0.05) | 31 | 31 | 31 | 32 |
LHS support | 44 | 75 | 55 | 77 |
RHS support | 43 | 75 | 53 | 77 |
Top co-predictors | INHBA, CADM3 | INHBA, FAM107A | INHBA, ANK2 | INHBA, ANK2 |
No. | Rule | Decision | Accuracy | RHS Support | p Value |
---|---|---|---|---|---|
1 | INHBA = 3, FAM107A = 1 | T | 1 | 260 | 1.90 × 10−93 |
2 | INHBA = 3, CADM3 = 1 | T | 1 | 259 | 5.14 × 10−93 |
3 | INHBA = 3, ANK2 = 1 | T | 1 | 249 | 1.00 × 10−88 |
4 | INHBA = 1, ANK2 = 2 | N | 1 | 241 | 3.69 × 10−82 |
5 | INHBA = 1, FAM107A = 3 | N | 1 | 225 | 8.77 × 10−76 |
6 | INHBA = 1, CADM3 = 3 | N | 1 | 219 | 1.97 × 10−73 |
7 | INHBA = 1, CADM3 = 2 | N | 1 | 217 | 1.18 × 10−72 |
8 | INHBA = 2, FAM107A = 3 | N | 1 | 212 | 1.03 × 10−70 |
9 | INHBA = 2, ANK2 = 3 | N | 0.9910314 | 221 | 2.86 × 10−70 |
10 | INHBA = 1, FAM107A = 2 | N | 1 | 209 | 1.47 × 10−69 |
11 | INHBA = 2, CADM3 = 3 | N | 1 | 208 | 3.56 × 10−69 |
12 | INHBA = 1, ANK2 = 3 | N | 1 | 196 | 1.31 × 10−64 |
13 | INHBA = 2, ANK2 = 1 | T | 1 | 188 | 5.90 × 10−64 |
14 | INHBA = 2, CADM3 = 1 | T | 1 | 180 | 7.18 × 10−61 |
15 | INHBA = 3, CADM3 = 2 | T | 1 | 176 | 2.43 × 10−59 |
16 | INHBA = 3, FAM107A = 2 | T | 1 | 175 | 5.85 × 10−59 |
17 | INHBA = 2, FAM107A = 1 | T | 1 | 173 | 3.37 × 10−58 |
18 | INHBA = 3, ANK2 = 2 | T | 1 | 169 | 1.10 × 10−56 |
19 | IQGAP3 = 3 | T | 1 | 41 | 4.23 × 10−17 |
20 | UBE2T = 3 | T | 1 | 41 | 4.23 × 10−17 |
21 | SQLE = 3 | T | 1 | 41 | 4.23 × 10−17 |
22 | CCT3 = 3 | T | 1 | 41 | 4.23 × 10−17 |
23 | HIST1H2BG = 3 | T | 1 | 40 | 1.92 × 10−16 |
24 | HIST1H2BG = 1 | N | 1 | 40 | 5.75 × 10−15 |
25 | IQGAP3 = 1 | N | 1 | 39 | 2.14 × 10−14 |
26 | UBE2T = 1 | N | 1 | 38 | 7.67 × 10−14 |
27 | CCT3 = 1 | N | 1 | 38 | 7.67 × 10−14 |
28 | SQLE = 1 | N | 1 | 37 | 2.67 × 10−13 |
29 | CENPF = 3, KIAA1199 = 3 | T | 1 | 27 | 2.63 × 10−9 |
30 | AURKB = 3, FAM83H = 3 | T | 1 | 25 | 2.22 × 10−8 |
31 | ARNTL2 = 3, PDE9A = 1 | T | 1 | 25 | 3.33 × 10−8 |
32 | FNBP1 = 1, SQLE = 3 | T | 1 | 24 | 6.26 × 10−8 |
33 | A2LD1 = 2, BMP3 = 3 | N | 1 | 23 | 7.62 × 10−7 |
34 | AURKB = 1, FAM83H = 1 | N | 1 | 23 | 7.62 × 10−7 |
35 | CENPF = 1, KIAA1199 = 1 | N | 1 | 23 | 7.62 × 10−7 |
36 | FNBP1 = 2, SQLE = 1 | N | 1 | 22 | 1.90 × 10−6 |
37 | LOC63928 = 3, SOX4 = 2 | N | 1 | 19 | 6.74 × 10−6 |
38 | INHBA = 3, ANK2 = 3 | T | 1 | 21 | 7.48 × 10−6 |
39 | FNBP1 = 1, SQLE = 2 | T | 1 | 19 | 8.46 × 10−6 |
40 | A2LD1 = 3, BMP3 = 1 | T | 1 | 18 | 2.15 × 10−5 |
41 | FNBP1 = 3, SQLE = 2 | N | 1 | 19 | 2.72 × 10−5 |
42 | LOC63928 = 1, SOX4 = 3 | T | 0.952381 | 20 | 5.37 × 10−5 |
43 | CENPF = 1, KIAA1199 = 2 | N | 1 | 18 | 6.40 × 10−5 |
44 | FNBP1 = 3, SQLE = 1 | N | 1 | 18 | 6.40 × 10−5 |
45 | CENPF = 2, KIAA1199 = 1 | N | 1 | 18 | 6.40 × 10−5 |
46 | AURKB = 1, FAM83H = 2 | N | 1 | 18 | 6.40 × 10−5 |
47 | INHBA = 1, VAV1 = 3 | N | 1 | 16 | 1.02 × 10−4 |
48 | INHBA = 3, VAV1 = 1 | T | 1 | 16 | 1.43 × 10−4 |
49 | LOC63928 = 3, SOX4 = 1 | N | 0.9473684 | 18 | 2.01 × 10−4 |
50 | INHBA = 1, VAV1 = 2 | N | 1 | 15 | 2.45 × 10−4 |
51 | AURKB = 2, FAM83H = 1 | N | 1 | 16 | 3.42 × 10−4 |
52 | ARNTL2 = 1, PDE9A = 3 | N | 0.9444444 | 17 | 4.67 × 10−4 |
53 | AURKB = 3, FAM83H = 2 | T | 1 | 14 | 7.70 × 10−4 |
54 | AURKB = 2, FAM83H = 3 | T | 1 | 13 | 1.82 × 10−3 |
55 | A2LD1 = 1, BMP3 = 2 | N | 0.9047619 | 19 | 2.06 × 10−3 |
56 | ARNTL2 = 2, PDE9A = 3 | N | 0.9333333 | 14 | 5.27 × 10−3 |
57 | INHBA = 3, VAV1 = 2 | T | 1 | 11 | 8.81 × 10−3 |
58 | FNBP1 = 2, SQLE = 3 | T | 1 | 11 | 9.78 × 10−3 |
59 | A2LD1 = 2, BMP3 = 1 | T | 1 | 11 | 9.78 × 10−3 |
60 | A2LD1 = 1, BMP3 = 1 | T | 1 | 11 | 9.78 × 10−3 |
61 | A2LD1 = 3, BMP3 = 2 | T | 1 | 10 | 2.23 × 10−2 |
62 | CENPF = 3, KIAA1199 = 2 | T | 1 | 10 | 2.23 × 10−2 |
63 | A2LD1 = 3, BMP3 = 3 | N | 1 | 10 | 3.89 × 10−2 |
mRNA | Expression Level | Predicted Class | ||
---|---|---|---|---|
DS1 | DS2 | DS3 | ||
ARNTL2 | Downregulated | / | / | CRC |
PDE9A | Upregulated | / | / | CRC |
INHBA | Downregulated | / | / | CRC |
ANK2 | Upregulated | Upregulated | Upregulated | CRC |
FAM107A | / | / | / | CRC |
CADM3 | / | / | Upregulated | CRC |
SQLE | Downregulated | / | / | CRC |
FNBP1 | / | / | / | CRC |
CCT3 | / | / | / | CRC |
CENPF | / | Downregulated | / | CRC |
KIAA1199 | Downregulated | Downregulated | / | CRC |
UBE2T | / | / | / | CRC |
FAM83H | / | / | / | CRC |
IQGAP3 | / | Downregulated | / | CRC |
HIST1H2BG | / | / | / | CRC |
A2LD1 | / | / | / | CRC |
BMP3 | / | / | Upregulated | CRC |
AURKB | / | / | CRC |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wei, W.; Li, Y.; Huang, T. Using Machine Learning Methods to Study Colorectal Cancer Tumor Micro-Environment and Its Biomarkers. Int. J. Mol. Sci. 2023, 24, 11133. https://doi.org/10.3390/ijms241311133
Wei W, Li Y, Huang T. Using Machine Learning Methods to Study Colorectal Cancer Tumor Micro-Environment and Its Biomarkers. International Journal of Molecular Sciences. 2023; 24(13):11133. https://doi.org/10.3390/ijms241311133
Chicago/Turabian StyleWei, Wei, Yixue Li, and Tao Huang. 2023. "Using Machine Learning Methods to Study Colorectal Cancer Tumor Micro-Environment and Its Biomarkers" International Journal of Molecular Sciences 24, no. 13: 11133. https://doi.org/10.3390/ijms241311133
APA StyleWei, W., Li, Y., & Huang, T. (2023). Using Machine Learning Methods to Study Colorectal Cancer Tumor Micro-Environment and Its Biomarkers. International Journal of Molecular Sciences, 24(13), 11133. https://doi.org/10.3390/ijms241311133