Cancer Diagnosis by Gene-Environment Interactions via Combination of SMOTE-Tomek and Overlapped Group Screening Approaches with Application to Imbalanced TCGA Clinical and Genomic Data
Abstract
:1. Introduction
2. Materials and Methods
2.1. Data Structure and the Multiple Pathways
2.2. Evaluation Criteria for Binary Classification
2.3. SMOTE-Tomek Procedure for Imbalanced Data
2.4. The OGS Approach with Binary Logistic Regression for G-E Interactions
2.5. The Alternative Classification Methods
3. Results
3.1. Simulation Studies: Synthetic Imbalanced Dataset with Complex Gene Structure
3.2. Real Data Application: TCGA LUAD Data
3.3. Real Data Application: TCGA BRCA Data
3.4. Improvement in Predictive Capability for Real Data
4. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
BMI | Body mass index |
BP | Biological process |
BRCA | Breast invasive carcinoma |
CC | Cellular composition |
CV | Cross-validation |
FN | Number of false negatives |
FP | Number of false positives |
G-E | Gene-environment |
G-G | Gene-gene |
GO | Gene ontology |
GWAS | Genome-wide association study |
IR | Imbalanced ratio |
KNNs | K-nearest neighbors |
LDA | Linear discriminant analysis |
LUAD | Lung adenocarcinoma |
MF | Molecular function |
MI | Mutual information |
ML | Machine learning |
MLRs | Multiple logistic regression models |
OGS | Overlapping group screening |
RFs | Random forests |
SKAT | Sequence kernel association test |
SMOTE | Synthetic minority oversampling technique |
SVMs | Support vector machines |
TCGA | The cancer genome atlas |
TN | Number of true negatives |
TP | Number of true positives |
Appendix A. Latent Effect Approach
ML Method | R Package and Function | Hyperparameter | Procedure |
---|---|---|---|
SVM | “e1071”, svm(), tune() | Kernel: “radial” | given |
cost: | CV | ||
gamma: | CV | ||
RF | “randomForest”, randomForest(), train() | Kernel: “rectangular” | given |
ntree: 1, 2, …, 500 | CV | ||
mtry: 1, 2, …, 10 | CV | ||
KNN | “kknn”, kknn() | k: 1, 2, …, 50 | CV |
LDA | “MASS”, lad() | prior: 0.5 | given |
Accuracy | Precision | Sensitivity | F1 | Specificity | |
---|---|---|---|---|---|
60:40 | |||||
OGS_Ridge | 0.8358 | 0.8084 | 0.8194 | 0.7958 | 0.8456 |
OGS_Lasso | 0.8308 | 0.7959 | 0.8331 | 0.7961 | 0.8281 |
OGS_ALasso | 0.8636 | 0.8388 | 0.8572 | 0.8327 | 0.8665 |
OGS_SVM | 0.8248 | 0.7476 | 0.8437 | 0.7891 | 0.8163 |
OGS_LDA | 0.7985 | 0.6960 | 0.8709 | 0.7704 | 0.7551 |
OGS_KNN | 0.4941 | 0.4334 | 0.8440 | 0.5662 | 0.2678 |
OGS_RF | 0.6520 | 0.6837 | 0.2756 | 0.3700 | 0.8970 |
70:30 | |||||
OGS_Ridge | 0.7575 | 0.7057 | 0.7609 | 0.6717 | 0.7578 |
OGS_Lasso | 0.7480 | 0.7060 | 0.7525 | 0.6604 | 0.7473 |
OGS_ALasso | 0.7467 | 0.6689 | 0.8219 | 0.6867 | 0.7080 |
OGS_SVM | 0.7790 | 0.6752 | 0.7266 | 0.6867 | 0.8158 |
OGS_LDA | 0.6543 | 0.5016 | 0.7228 | 0.5879 | 0.6210 |
OGS_KNN | 0.4199 | 0.3650 | 0.9101 | 0.5173 | 0.1626 |
OGS_RF | 0.6531 | 0.5085 | 0.5771 | 0.5295 | 0.6932 |
Accuracy | Precision | Sensitivity | F1 | Specificity | |
---|---|---|---|---|---|
80:20 | |||||
OGS_Ridge | 0.6640 | 0.9071 | 0.6687 | 0.7369 | 0.6529 |
OGS_Lasso | 0.6807 | 0.8994 | 0.6957 | 0.7544 | 0.6220 |
OGS_ALasso | 0.7858 | 0.8984 | 0.8425 | 0.8261 | 0.5490 |
OGS_SVM | 0.7397 | 0.8905 | 0.7767 | 0.8233 | 0.6125 |
OGS_LDA | 0.6090 | 0.8889 | 0.5791 | 0.7000 | 0.7195 |
OGS_KNN | 0.4314 | 0.8390 | 0.3641 | 0.4998 | 0.6842 |
OGS_RF | 0.7375 | 0.8044 | 0.8835 | 0.8397 | 0.2015 |
Location | Left-Lower | Left-Upper | Right-Lower | Right-Middle | Right-Upper | Other | NA |
---|---|---|---|---|---|---|---|
Number | 76 | 119 | 96 | 23 | 180 | 4 | 7 |
Location | left | Left LIQ | left LOQ | left UIQ | left UOQ |
Number | 189 | 29 | 40 | 83 | 230 |
Location | right | right LIQ | right LOQ | right UIQ | right UOQ |
Number | 175 | 27 | 49 | 83 | 189 |
References
- Thomas, D. Gene–environment-wide association studies: Emerging approaches. Nat. Rev. Genet. 2010, 11, 259–272. [Google Scholar] [CrossRef]
- Franks, P.W.; Paré, G. Putting the genome in context: Gene-environment interactions in type 2 diabetes. Curr. Diabetes Rep. 2016, 16, 57. [Google Scholar] [CrossRef]
- Batchelor, T.T.; Betensky, R.A.; Esposito, J.M.; Pham, L.-D.D.; Dorfman, M.V.; Piscatelli, N.; Jhung, S.; Rhee, D.; Louis, D.N. Age-dependent prognostic effects of genetic alterations in glioblastoma. Clin. Cancer Res. 2004, 10, 228–233. [Google Scholar] [CrossRef]
- Lin, W.; Huang, C.; Liu, Y.; Tsai, S.; Kuo, P. Genome-Wide Gene-Environment Interaction Analysis Using Set-Based Association Tests. Front. Genet. 2019, 9, 715. [Google Scholar] [CrossRef]
- Rauschert, S.; Raubenheimer, K.; Melton, P.E.; Huang, R.C. Machine learning and clinical epigenetics: A review of challenges for diagnosis and classification. Clin. Epigenetics 2020, 12, 51. [Google Scholar] [CrossRef] [PubMed]
- Xie, J.; Wang, M.; Xu, S.; Huang, Z.; Grant, P.W. The unsupervised feature selection algorithms based on standard deviation and cosine similarity for genomic data analysis. Front. Genet. 2021, 12, 684100. [Google Scholar] [CrossRef]
- Lavanya, C.; Pooja, S.; Kashyap, A.H.; Rahaman, A.; Niranjan, S.; Niranjan, V. Novel biomarker prediction for lung cancer using random forest classifiers. Cancer Inform. 2023, 22, 11769351231167992. [Google Scholar]
- Ali, M.D.; Saleem, A.; Elahi, H.; Khan, M.A.; Khan, M.I.; Yaqoob, M.M.; Farooq Khattak, U.; Al-Rasheed, A. Breast cancer classification through meta-learning ensemble technique using convolution neural networks. Diagnostics 2023, 13, 2242. [Google Scholar] [CrossRef] [PubMed]
- Tian, X.; Wang, X.; Chen, J. Network-constrained group lasso for high-dimensional multinomial classification with application to cancer subtype prediction. Cancer Inform. 2015, 13, 25–33. [Google Scholar] [CrossRef] [PubMed]
- Zhou, F.; Ren, J.; Lu, X.; Ma, S.; Wu, C. Gene–Environment Interaction: A Variable Selection Perspective. Methods Mol. Biol. 2021, 6, 191–223. [Google Scholar]
- Murcray, C.E.; Lewinger, J.P.; Gauderman, W.J. Gene-environment interaction in genome-wide association studies. Am. J. Epidemiol. 2009, 169, 219–226. [Google Scholar] [CrossRef] [PubMed]
- Winham, S.J.; Biernacka, J.M. Gene-environment interactions in genome-wide association studies: Current approaches and new directions. J. Child Psychol. Psychiatry Allied Discip. 2013, 54, 1120–1134. [Google Scholar] [CrossRef] [PubMed]
- Cordell, H.J. Detecting gene-gene interactions that underlie human diseases. Nat. Rev. Genet. 2009, 10, 392–404. [Google Scholar] [CrossRef] [PubMed]
- Ahn, J.; Mukherjee, B.; Gruber, S.B.; Ghosh, M. Bayesian semiparametric analysis for two-phase studies of gene-environment interaction. Ann. Appl. Stat 2013, 7, 543–569. [Google Scholar] [CrossRef] [PubMed]
- Liu, C.; Ma, J.; Amos, C.I. Bayesian variable selection for hierarchical gene-environment and gene-gene interactions. Hum. Genet. 2015, 134, 23–36. [Google Scholar] [CrossRef] [PubMed]
- Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Fan, J.; Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Methodol. 2008, 70, 849–911. [Google Scholar] [CrossRef]
- Wang, J.; Chen, Y. Overlapping group screening for detection of gene-gene interactions: Application to gene expression profiles with survival trait. BMC Bioinform. 2018, 19, 335. [Google Scholar] [CrossRef]
- Wang, J.; Wang, K.; Chen, Y. Overlapping group screening for detection of gene-environment interactions with application to TCGA high-dimensional survival genomic data. BMC Bioinform. 2022, 23, 202. [Google Scholar] [CrossRef]
- Wang, J.; Chen, Y. Overlapping group screening for binary cancer classification with TCGA high-dimensional genomic data. J. Bioinform. Comput. Biol. 2023, 21, 2350013. [Google Scholar] [CrossRef]
- Selamat, N.A.; Abdullah, A.; Diah, N.M. Association features of smote and rose for drug addiction relapse risk. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 7710–7719. [Google Scholar] [CrossRef]
- Abdoh, S.F.; Rizka, M.A.; Maghraby, F.A. Cervical cancer diagnosis using random forest classifier with SMOTE and feature reduction techniques. IEEE Access 2018, 6, 59475–59485. [Google Scholar] [CrossRef]
- Chawla, N.; Bowyer, K.; Hall, L.; Kegelmeyer, P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Tomek, I. Two modifications of CNN. IEEE Trans. Syst. Man. Cybern. 1976, 6, 769–772. [Google Scholar]
- Batista, G.E.A.P.A.; Prati, R.C.; Monard, M.C. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. 2004, 6, 20–29. [Google Scholar] [CrossRef]
- Colaprico, A.; Silva, T.C.; Olsen, C.; Garofano, L.; Cava, C.; Garolini, D.; Sabedot, T.S.; Malta, T.M.; Pagnotta, S.M.; Castiglioni, I.; et al. TCGAbiolinks: An R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res. 2016, 44, e71. [Google Scholar] [CrossRef] [PubMed]
- Wang, S.; Liu, X. The UCSCXenaTools R package: A toolkit for accessing genomics data from UCSC xena platform, from cancer multi-omics to single-cell RNA-seq. J. Open Source Softw. 2019, 4, 1627. [Google Scholar] [CrossRef]
- Sain, H.; Purnami, S.W. Combine sampling support vector machine for imbalanced data classification. Procedia Comput. Sci. 2015, 72, 59–66. [Google Scholar] [CrossRef]
- Liu, C.; Wu, J.; Mirador, L.; Song, Y.; Hou, W. Classifying dna methylation imbalance data in cancer risk prediction using smote and tomek link methods. In International Conference of Pioneering Computer Scientists, Engineers and Educators; Springer: Singapore, 2018; pp. 1–9. [Google Scholar]
- Jonathan, B.; Putra, P.H.; Ruldeviyani, Y. Observation imbalanced data text to predict users selling products on female daily with SMOTE, Tomek, and SMOTE-Tomek. In Proceedings of the 2020 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT), Bali, Indonesia, 7–8 July 2020; pp. 81–85. [Google Scholar]
- Sasada, T.; Liu, Z.; Baba, T.; Hatano, K.; Kimura, Y. A Resampling Method for Imbalanced Datasets Considering Noise and Overlap. Procedia Comput. Sci. 2020, 176, 420–429. [Google Scholar] [CrossRef]
- Jacob, L.; Obozinski, G.; Vert, J.P. Group lasso with overlap and graph lasso. In Proceedings of the International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 433–440. [Google Scholar]
- Zeng, Y.; Breheny, P. Overlapping group logistic regression with applications to genetic pathway selection. Cancer Inform. 2016, 15, 179–187. [Google Scholar] [CrossRef]
- Wu, M.C.; Lee, S.; Cai, T.; Li, Y.; Boehnke, M.; Lin, X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 2011, 89, 82–93. [Google Scholar] [CrossRef] [PubMed]
- Davies, R.B. Algorithm AS 155: The distribution of a linear combination of random variables. J. R. Stat. Soc. Ser. C Appl. Stat. 1980, 29, 323–333. [Google Scholar] [CrossRef]
- Duchesne, P.; Lafaye De Micheaux, P. Computing the distribution of quadratic forms: Further comparisons between the Liu-Tang-Zhang approximation and exact methods. Comput. Stat. Data Anal. 2010, 54, 858–862. [Google Scholar] [CrossRef]
- Zou, H. The Adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
- Simon, N.; Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for Cox’s proportional hazards model via coordinate de scent. J. Stat. Softw. 2011, 39, 1–13. [Google Scholar] [CrossRef] [PubMed]
- Wu, M.; Ma, S. Robust semiparametric gene–environment interaction analysis using sparse boosting. Stat. Med. 2019, 38, 4625–4641. [Google Scholar] [CrossRef]
- Wang, B.; Pei, J.; Xu, S.; Liu, J.; Yu, J. System analysis based on glutamine catabolic-related enzymes identifies GPT2 as a novel immunotherapy target for lung adenocarcinoma. Comput. Biol. Med. 2023, 165, 107415. [Google Scholar] [CrossRef] [PubMed]
- Rodriguez, E.F.; De Marchi, F.; Lokhandwala, P.M.; Belchis, D.; Xian, R.; Gocke, C.D.; Eshleman, J.R.; Illei, P.; Li, M.-T. IDH1 and IDH2 mutations in lung adenocarcinomas: Evidences of subclonal evolution. Cancer Med. 2020, 9, 4386–4394. [Google Scholar] [CrossRef] [PubMed]
- Lei, B.; Jiang, X.; Saxena, A. TCGA expression analyses of 10 carcinoma types reveal clinically significant racial differences. Cancers 2023, 15, 2695. [Google Scholar] [CrossRef]
- Qu, W.; Yao, Y.; Liu, Y.; Jo, H.; Zhang, Q.; Zhao, H. Prognostic and immunological roles of CES2 in breast cancer and potential application of CES2-targeted fluorescent probe DDAB in breast surgery. Int. J. Gen. Med. 2023, 16, 1567–1580. [Google Scholar] [CrossRef]
- Wang, Z.; Zhang, S.; Zheng, C.; Xia, K.; Sun, L.; Tang, X.; Zhou, F.; Ouyang, Y.; Tang, F. CTHRC1 is a potential prognostic bi omarker and correlated with macrophage infiltration in breast cancer. Int. J. Gen. Med. 2022, 15, 5701–5713. [Google Scholar] [CrossRef] [PubMed]
- Blagus, R.; Lusa, L. SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 2013, 14, 106. [Google Scholar] [CrossRef] [PubMed]
- Saito, T.; Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef] [PubMed]
- Lauby-Secretan, B.; Scoccianti, C.; Loomis, D.; Grosse, Y.; Bianchini, F.; Straif, K. Body fatness and cancer—Viewpoint of the IARC working group. N. Engl. J. Med. 2016, 375, 794–798. [Google Scholar] [CrossRef] [PubMed]
- Hu, C.; Chen, X.; Yao, C.; Liu, Y.; Xu, H.; Zhou, G.; Xia, H.; Xia, J. Body mass index-associated molecular characteristics involved in tumor immune and metabolic pathways. Cancer Metab. 2020, 8, 21. [Google Scholar] [CrossRef]
- Lee, S.; Abecasis, G.R.; Boehnke, M.; Lin, X. Rare-variant association analysis: Study designs and statistical tests. Am. J. Hum. Genet. 2014, 95, 5–23. [Google Scholar] [CrossRef]
Positive (Predicted) | Negative (Predicted) | |
---|---|---|
Positive (actual) | number of true positives (TP) | number of false negatives (FN) |
Negative (actual) | number of false positives (FP) | number of true negatives (TN) |
Group | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 |
Gene Size | 3 | 3 | 3 | 6 | 6 | 6 | 9 | 9 | 9 | 15 | 15 | 15 | 24 | 24 | 24 | 36 | 36 | 36 | 45 | 45 | 45 | 60 | 60 | 60 | 38 |
Overlapping | 1 | 1 | 0 | 2 | 2 | 0 | 3 | 3 | 0 | 5 | 5 | 0 | 8 | 8 | 0 | 12 | 12 | 0 | 15 | 15 | 0 | 20 | 20 | 0 |
Accuracy | Precision | Sensitivity | F1 | Specificity | |
---|---|---|---|---|---|
60:40 | |||||
OGS_Ridge | 0.8867 (0.7819) | 0.8673 (0.7571) | 0.8571 (0.8338) | 0.8536 (0.7590) | 0.9043 (0.7456) |
OGS_Lasso | 0.8796 (0.7298) | 0.8436 (0.6772) | 0.8822 (0.8463) | 0.8515 (0.7187) | 0.8777 (0.6509) |
OGS_ALasso | 0.8695 (0.6957) | 0.8286 (0.6336) | 0.8864 (0.8466) | 0.8439 (0.6923) | 0.8581 (0.5945) |
OGS_SVM | 0.8827 (0.8184) | 0.8617 (0.8150) | 0.8281 (0.7070) | 0.8418 (0.7506) | 0.9167 (0.8928) |
OGS_LDA | 0.8737 (0.8265) | 0.8131 (0.7688) | 0.8732 (0.8074) | 0.8403 (0.7849) | 0.8738 (0.8393) |
OGS_KNN | 0.5929 (0.6554) | 0.4809 (0.7277) | 0.7226 (0.2254) | 0.5743 (0.3281) | 0.5109 (0.9405) |
OGS_RF | 0.7007 (0.6354) | 0.6764 (0.5547) | 0.4402 (0.5069) | 0.5248 (0.5187) | 0.8631 (0.7213) |
70:30 | |||||
OGS_Ridge | 0.8284 (0.6571) | 0.7940 (0.6052) | 0.7189 (0.7735) | 0.7157 (0.6109) | 0.8791 (0.6061) |
OGS_Lasso | 0.7641 (0.5780) | 0.6726 (0.5126) | 0.8473 (0.8140) | 0.7069 (0.5563) | 0.7255 (0.4746) |
OGS_ALasso | 0.7515 (0.5476) | 0.6790 (0.4868) | 0.8529 (0.8175) | 0.7102 (0.5337) | 0.7061 (0.4291) |
OGS_SVM | 0.8753 (0.8206) | 0.8531 (0.7831) | 0.7383 (0.5909) | 0.7849 (0.6659) | 0.9382 (0.9244) |
OGS_LDA | 0.8329 (0.8093) | 0.7080 (0.6607) | 0.8104 (0.7970) | 0.7537 (0.7190) | 0.8424 (0.8150) |
OGS_KNN | 0.5178 (0.7120) | 0.3858 (0.7475) | 0.8449 (0.1199) | 0.5253 (0.2089) | 0.3682 (0.9784) |
OGS_RF | 0.7273 (0.6512) | 0.6300 (0.4357) | 0.4375 (0.3801) | 0.4836 (0.3949) | 0.8587 (0.7723) |
Group | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 1 |
Gene Size | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 26 | 23 |
Overlapping | 3 | 3 | 0 | 3 | 3 | 0 | 3 | 3 | 0 | 3 | 3 | 0 | 0 | 3 | 0 | 0 | 3 | 0 | 3 | 3 | 0 | 0 | 3 | 0 | 3 |
Accuracy | Precision | Sensitivity | F1 | Specificity | |
---|---|---|---|---|---|
80:20 | |||||
OGS_Ridge | 0.7996 (0.5055) | 0.8870 (0.9554) | 0.8550 (0.4089) | 0.8386 (0.6226) | 0.6102 (0.8496) |
OGS_Lasso | 0.8013 (0.4993) | 0.8799 (0.9357) | 0.8668 (0.4140) | 0.8765 (0.5815) | 0.5779 (0.8036) |
OGS_ALasso | 0.8149 (0.5104) | 0.8943 (0.9202) | 0.8746 (0.4411) | 0.8535 (0.5914) | 0.6194 (0.7580) |
OGS_SVM | 0.7842 (0.8220) | 0.9055 (0.8473) | 0.8101 (0.9407) | 0.8505 (0.8905) | 0.7055 (0.4121) |
OGS_LDA | 0.6572 (0.8011) | 0.9076 (0.9095) | 0.6146 (0.8259) | 0.7307 (0.8648) | 0.7998 (0.7149) |
OGS_KNN | 0.4094 (0.7755) | 0.8024 (0.7771) | 0.3189 (0.9964) | 0.4532 (0.8726) | 0.6876 (0.0137) |
OGS_RF | 0.7343 (0.7275) | 0.7832 (0.8012) | 0.9052 (0.8638) | 0.8371 (0.8288) | 0.1990 (0.2544) |
With the SMOTE-Tomek Process | Without the SMOTE-Tomek Process | |||
---|---|---|---|---|
IR | Gene | G-E Interaction | Gene | G-E Interaction |
Original coefficients | ||||
60:40 | 0.8467 | 0.4667 | 0.8443 | 0.0046 |
70:30 | 0.8410 | 0.4489 | 0.8456 | 0 |
Weaker coefficients (all original coefficients divided by 2) | ||||
60:40 | 0.8527 | 0.4390 | 0.8417 | 0.0033 |
70:30 | 0.8406 | 0.4216 | 0.8441 | 0 |
Factor | Coding | Missing Status | Continuous (C) /Discrete (D) |
---|---|---|---|
Number of pack years smoked | Yes | C | |
Race | white = 1, Asian = 2, black or African American = 3 | Yes | D |
Gender | female = 0, male = 1 | No | D |
Accuracy | Precision | Sensitivity | F1 | Specificity | |
---|---|---|---|---|---|
GO_BP | |||||
OGS_Ridge | 0.7426 | 1.0000 | 0.7063 | 0.8278 | 1.0000 |
OGS_Lasso | 0.6832 | 1.0000 | 0.6482 | 0.7865 | 1.0000 |
OGS_ALasso | 0.6782 | 1.0000 | 0.6389 | 0.7797 | 1.0000 |
OGS_SVM | 0.8762 | 1.0000 | 0.8716 | 0.9262 | 1.0000 |
OGS_LDA | 0.6436 | 1.0000 | 0.6022 | 0.7517 | 1.0000 |
OGS_KNN | 0.8663 | 1.0000 | 0.8533 | 0.9199 | 1.0000 |
OGS_RF | 0.8861 | 1.0000 | 0.8785 | 0.9301 | 1.0000 |
GO_CC | |||||
OGS_Ridge | 0.7277 | 1.0000 | 0.6945 | 0.8197 | 1.0000 |
OGS_Lasso | 0.6832 | 1.0000 | 0.6424 | 0.7822 | 1.0000 |
OGS_ALasso | 0.6881 | 1.0000 | 0.6480 | 0.7864 | 1.0000 |
OGS_SVM | 0.8465 | 0.9939 | 0.8694 | 0.9109 | 0.9667 |
OGS_LDA | 0.6634 | 1.0000 | 0.6250 | 0.7692 | 1.0000 |
OGS_KNN | 0.8366 | 1.0000 | 0.8162 | 0.8988 | 1.0000 |
OGS_RF | 0.8762 | 1.0000 | 0.8641 | 0.9271 | 1.0000 |
GO_MF | |||||
OGS_Ridge | 0.7624 | 1.0000 | 0.7303 | 0.8441 | 1.0000 |
OGS_Lasso | 0.7376 | 1.0000 | 0.7017 | 0.8247 | 1.0000 |
OGS_ALasso | 0.7475 | 1.0000 | 0.7102 | 0.8306 | 1.0000 |
OGS_SVM | 0.8663 | 1.0000 | 0.8827 | 0.9211 | 1.0000 |
OGS_LDA | 0.7673 | 1.0000 | 0.7310 | 0.8446 | 1.0000 |
OGS_KNN | 0.8713 | 1.0000 | 0.8540 | 0.9212 | 1.0000 |
OGS_RF | 0.9059 | 1.0000 | 0.8959 | 0.9440 | 1.0000 |
Gene | Number Pack Years Smoked | Race | Gender |
---|---|---|---|
GPT2 | 0.9891 | 1.4863 | 1.2152 |
IDH2 | 0.9993 | 1.0542 | 1.0906 |
L2HGDH | 1.0143 | 1.0690 | 1.0884 |
Variable | Coding | Missing Status | Continuous (C) /Discrete (D) |
---|---|---|---|
age at initial pathologic diagnosis (years) | No | C | |
Race | white = 1, Asian = 2, black or African American = 3 | Yes | D |
Gender | female = 0, male = 1 | No | D |
Accuracy | Precision | Sensitivity | F1 | Specificity | |
---|---|---|---|---|---|
GO_BP | |||||
OGS_Ridge | 0.7384 | 0.9996 | 0.7090 | 0.8290 | 0.9977 |
OGS_Lasso | 0.6909 | 1.0000 | 0.6559 | 0.7916 | 1.0000 |
OGS_ALasso | 0.6915 | 1.0000 | 0.6566 | 0.7920 | 1.0000 |
OGS_SVM | 0.8626 | 0.9710 | 0.8743 | 0.9194 | 0.7619 |
OGS_LDA | 0.6691 | 0.9999 | 0.6319 | 0.7711 | 0.9990 |
OGS_KNN | 0.8383 | 0.9928 | 0.8227 | 0.8912 | 0.9780 |
OGS_RF | 0.8811 | 0.9975 | 0.8699 | 0.9280 | 0.9813 |
GO_CC | |||||
OGS_Ridge | 0.7023 | 0.9997 | 0.6686 | 0.8002 | 0.9983 |
OGS_Lasso | 0.6932 | 0.9999 | 0.6581 | 0.7925 | 0.9996 |
OGS_ALasso | 0.7028 | 0.9999 | 0.6689 | 0.8008 | 0.9997 |
OGS_SVM | 0.8471 | 0.9666 | 0.8609 | 0.9097 | 0.7318 |
OGS_LDA | 0.7721 | 0.9990 | 0.7468 | 0.8527 | 0.9936 |
OGS_KNN | 0.7604 | 0.9874 | 0.7359 | 0.8042 | 0.9809 |
OGS_RF | 0.8213 | 0.9989 | 0.8020 | 0.8878 | 0.9923 |
GO_MF | |||||
OGS_Ridge | 0.7350 | 1.0000 | 0.7048 | 0.8256 | 1.0000 |
OGS_Lasso | 0.7166 | 1.0000 | 0.6842 | 0.8119 | 1.0000 |
OGS_ALasso | 0.7227 | 1.0000 | 0.6910 | 0.8168 | 1.0000 |
OGS_SVM | 0.8541 | 0.9605 | 0.8744 | 0.9147 | 0.6803 |
OGS_LDA | 0.7568 | 0.9999 | 0.7291 | 0.8419 | 0.9991 |
OGS_KNN | 0.7886 | 0.9887 | 0.7672 | 0.8305 | 0.9783 |
OGS_RF | 0.8456 | 0.9993 | 0.8286 | 0.9048 | 0.9959 |
Gene | Age at Initial Pathologic Diagnosis (Years) | Race | Gender |
---|---|---|---|
SPRY2 | 1.0246 | 0.7620 | 0.9996 |
CES1 | 1.0456 | 1.6377 | 1.0135 |
CTHRC1 | 1.0022 | 0.8993 | 1.0054 |
Accuracy | Precision | Sensitivity | F1 | Specificity | |
---|---|---|---|---|---|
GO_BP | |||||
OGS_Ridge | 0.8243 | 1.0000 | 0.8008 | 0.8886 | 1.0000 |
OGS_Lasso | 0.7866 | 1.0000 | 0.7579 | 0.8606 | 1.0000 |
OGS_ALasso | 0.7650 | 1.0000 | 0.7333 | 0.8451 | 1.0000 |
OGS_SVM | 0.9716 | 0.9803 | 0.9877 | 0.9838 | 0.8750 |
OGS_LDA | 0.9738 | 0.9963 | 0.9739 | 0.9849 | 0.9779 |
OGS_KNN | 0.5223 | 0.5701. | 0.4715 | 0.4753 | 0.8577 |
OGS_RF | 0.9767 | 0.9862 | 0.9885 | 0.9872 | 0.9117 |
GO_CC | |||||
OGS_Ridge | 0.8020 | 1.0000 | 0.7727 | 0.8718 | 1.0000 |
OGS_Lasso | 0.7475 | 1.0000 | 0.7182 | 0.8360 | 1.0000 |
OGS_ALasso | 0.7327 | 1.0000 | 0.6966 | 0.8212 | 1.0000 |
OGS_SVM | 0.9802 | 0.9889 | 0.9888 | 0.9886 | 0.9129 |
OGS_LDA | 0.9703 | 1.0000 | 0.9674 | 0.9828 | 1.0000 |
OGS_KNN | 0.9604 | 0.9888 | 0.9625 | 0.9775 | 0.9045 |
OGS_RF | 0.9802 | 0.9889 | 0.9890 | 0.9889 | 0.9167 |
GO_MF | |||||
OGS_Ridge | 0.8091 | 1.0000 | 0.7848 | 0.8788 | 1.0000 |
OGS_Lasso | 0.7653 | 1.0000 | 0.7354 | 0.8459 | 1.0000 |
OGS_ALasso | 0.7423 | 1.0000 | 0.7094 | 0.8288 | 1.0000 |
OGS_SVM | 0.9752 | 0.9842 | 0.9881 | 0.9860 | 0.8911 |
OGS_LDA | 0.9684 | 0.9964 | 0.9681 | 0.9819 | 0.9760 |
OGS_KNN | 0.4974 | 0.5468 | 0.4491 | 0.4523 | 0.8466 |
OGS_RF | 0.9657 | 0.9854 | 0.9756 | 0.9782 | 0.8983 |
Accuracy | Precision | Sensitivity | F1 | Specificity | |
---|---|---|---|---|---|
GO_BP | |||||
OGS_Ridge | 0.8119 | 1.0000 | 0.7908 | 0.8829 | 1.0000 |
OGS_Lasso | 0.7552 | 1.0000 | 0.7277 | 0.8418 | 1.0000 |
OGS_ALasso | 0.7586 | 1.0000 | 0.7316 | 0.8446 | 1.0000 |
OGS_SVM | 0.9799 | 0.9867 | 0.9910 | 0.9888 | 0.8869 |
OGS_LDA | 0.9728 | 0.9992 | 0.9706 | 0.9846 | 0.9947 |
OGS_KNN | 0.6156 | 0.7004 | 0.5798 | 0.5835 | 0.9093 |
OGS_RF | 0.9832 | 0.9927 | 0.9887 | 0.9906 | 0.9400 |
GO_CC | |||||
OGS_Ridge | 0.8173 | 1.0000 | 0.7971 | 0.8868 | 1.0000 |
OGS_Lasso | 0.8162 | 0.9991 | 0.7967 | 0.8842 | 0.9928 |
OGS_ALasso | 0.7494 | 1.0000 | 0.7214 | 0.8365 | 1.0000 |
OGS_SVM | 0.9812 | 0.9858 | 0.9935 | 0.9896 | 0.8737 |
OGS_LDA | 0.9773 | 0.9993 | 0.9755 | 0.9872 | 0.9949 |
OGS_KNN | 0.4933 | 0.5406 | 0.4460 | 0.4479 | 0.8864 |
OGS_RF | 0.9856 | 0.9922 | 0.9918 | 0.9920 | 0.9351 |
GO_MF | |||||
OGS_Ridge | 0.8125 | 1.0000 | 0.7912 | 0.8831 | 1.0000 |
OGS_Lasso | 0.7476 | 0.9999 | 0.7189 | 0.8357 | 0.9993 |
OGS_ALasso | 0.7489 | 1.0000 | 0.7203 | 0.8368 | 1.0000 |
OGS_SVM | 0.9826 | 0.9895 | 0.9911 | 0.9903 | 0.9135 |
OGS_LDA | 0.9793 | 0.9995 | 0.9776 | 0.9884 | 0.9960 |
OGS_KNN | 0.4127 | 0.4819 | 0.3567 | 0.3593 | 0.8880 |
OGS_RF | 0.9842 | 0.9930 | 0.9894 | 0.9912 | 0.9438 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, J.-H.; Liu, C.-Y.; Min, Y.-R.; Wu, Z.-H.; Hou, P.-L. Cancer Diagnosis by Gene-Environment Interactions via Combination of SMOTE-Tomek and Overlapped Group Screening Approaches with Application to Imbalanced TCGA Clinical and Genomic Data. Mathematics 2024, 12, 2209. https://doi.org/10.3390/math12142209
Wang J-H, Liu C-Y, Min Y-R, Wu Z-H, Hou P-L. Cancer Diagnosis by Gene-Environment Interactions via Combination of SMOTE-Tomek and Overlapped Group Screening Approaches with Application to Imbalanced TCGA Clinical and Genomic Data. Mathematics. 2024; 12(14):2209. https://doi.org/10.3390/math12142209
Chicago/Turabian StyleWang, Jie-Huei, Cheng-Yu Liu, You-Ruei Min, Zih-Han Wu, and Po-Lin Hou. 2024. "Cancer Diagnosis by Gene-Environment Interactions via Combination of SMOTE-Tomek and Overlapped Group Screening Approaches with Application to Imbalanced TCGA Clinical and Genomic Data" Mathematics 12, no. 14: 2209. https://doi.org/10.3390/math12142209
APA StyleWang, J.-H., Liu, C.-Y., Min, Y.-R., Wu, Z.-H., & Hou, P.-L. (2024). Cancer Diagnosis by Gene-Environment Interactions via Combination of SMOTE-Tomek and Overlapped Group Screening Approaches with Application to Imbalanced TCGA Clinical and Genomic Data. Mathematics, 12(14), 2209. https://doi.org/10.3390/math12142209