Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data
Abstract
:1. Introduction
2. Materials and Methods
2.1. Penalized Logistic Regression Method
2.2. Variable Ranking with MMLR
Algorithm 1 Proposed two-step procedure |
Step 1: Sample 70% of samples randomly without replacement from the training set. |
Step 2: Count frequency of each of genes from 100 models of λ values. |
Step 3: Repeat Step 1 and Step 2 100 times. |
Step 4: Calculate selection probability for each of variables based on Equation (10) and then rank them. |
Step 5: Select top genes with the highest frequency. |
Step 6: Apply them to sparse logistic regression methods to build prognostic models. |
2.3. The Proposed Variable Ranking Method
2.4. Metrics of Performance
3. Results
3.1. Simulation Results
3.2. Real Data Analysis
4. Discussion
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Sangjin, K.; Susan, H. High Dimensional Variable Selection with Error Control. Biomed. Res. Int. Vol. 2016, 2016. [Google Scholar] [CrossRef]
- Shuangge, M.; Jian, H. Penalized feature selection and classification in bioinformatics. Brief. Bioinform. 2008, 9, 392–403. [Google Scholar] [Green Version]
- Abhishek, B.; Shailendra, S. Gene Selection Using High Dimensional Gene Expression Data: An Appraisal. Curr. Bioinform. 2018, 13, 225–233. [Google Scholar] [CrossRef]
- Hassan, T.; Elf, E.; lan, W. An efficient approach for feature construction of high-dimensional microarray data by random projections. PLoS ONE 2018, 13, e0196385. [Google Scholar] [CrossRef]
- Bourgon, R. Independent filtering increases detection power for high-throughput experiments. Proc. Natlacad. Sci. 2010, 107, 9546–9951. [Google Scholar] [CrossRef] [PubMed]
- Bourgon, R.; Gentleman, R.; Huber, W. Reply to Talloen et al.: Independent filtering is a generic approach that needs domain-specific adaptation. Proc. Natl Acad. Sci. USA 2010, 107, E175. [Google Scholar] [CrossRef]
- Lu, J.; Peddada, S.D.; Bushel, P.R. Principal component analysis-based filtering improves detection for Affymetrix gene expression arrays. Nucleic Acids Res. 2011, e86, 39. [Google Scholar] [CrossRef]
- Jiang, H.; Doerge, R.W. A two-step multiple comparison procedure for a large number of tests and multiple treatments. Stat. Appl. Genet. Mol. Biol. 2006, 5. [Google Scholar] [CrossRef]
- Ramskold, E.; Kerns, R.T. An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput. Biol. 2009, 5, e1000598. [Google Scholar] [CrossRef]
- Sultan, M.; Schulz, M.H.; Richard, H.; Magen, A.; Klingenhoff, A.; Scherf, M.; Seifert, M.; Borodina, T.; Soldatov, A.; Parkhomchuk, D.; et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 2008, 321, 956–960. [Google Scholar] [CrossRef]
- Calle, M.L.; Urrea, V.; Malats, V.N.; Steen, K.V. Improving strategies for detecting genetic patterns of disease susceptibility in association studies. Stat. Med. 2008, 27, 6532–6546. [Google Scholar] [CrossRef] [PubMed]
- Li, L.; Kabesch, M.; Bouzigon, E.; Demenais, F.; Farrall, M.; Moffatt, M.F.; Lin, X.; Liang, L. Using eQTL weights to improve power for genome-wide association studies: A genetic study of childhood asthma. Fron. Genet. 2013, 4, 103. [Google Scholar] [CrossRef] [PubMed]
- Taqwa, A.A.; Siraj, M.M.; Zainal, A.; Elshoush, H.T.; Elhaj, F. Feature Selection Using Information Gain for Improved Structural-Based Alert Correlation. PLoS ONE 2016, 11, e0166017. [Google Scholar] [CrossRef]
- Tan, Y.; Liu, Z. Feature selection and prediction with a Markov blanket structure learning algorithm. BMC Bioinform. 2013, 14, A3. [Google Scholar] [CrossRef]
- Kakourou, A.; Mertens, B. Bayesian variable selection logistic regression with paired proteomic measurements. Biom. J. 2018. [Google Scholar] [CrossRef] [PubMed]
- Kursa, M.B.; Rudnicki, W.R. Feature Selection with the Boruta Package. J. Stat. Softw. 2010, 36, 1–13. [Google Scholar] [CrossRef]
- Okeh, U.M.; Oyeka, I.C.A. Estimating the fisher’s scoring matrix formula from the logistic model. Am. J. Theor. Appl. Stat. 2013, 2, 221–227. [Google Scholar]
- Urbanowicz, R.J.; Meekerb, M.; La Cavaa, W.; Olsona, R.S.; Moorea, J.H. Relief-based feature selection: Introduction and review. J. Biomed. Inform. 2018, 85, 189–203. [Google Scholar] [CrossRef]
- Milos, R.; Mohamed, G.; Nenad, F.; Zoran, O. Minimum redundancy maximum relevance feature selection approach for temporal gene expression data. BMC Bioinform. BMC Ser. 2017, 18, 9. [Google Scholar] [CrossRef]
- Algamal, Z.Y.; Lee, M.H. A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification. Adv. Data Anal. Classif. 2018, 1–19. [Google Scholar] [CrossRef]
- Le, T.T.; Urbanowicz, R.J.; Moore, J.H.; McKinney, B.A. Statistical Inference Relief (STIR) feature selection. Bioinformatics 2018, 788. [Google Scholar] [CrossRef] [PubMed]
- Abdel-Aal, R.E. GMDH-based feature ranking and selection for improved classification of medical data. J. Biomed. Inf. 2005, 38, 456–468. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Fan, J. Sure Independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. B 2008, 70, 849–911. [Google Scholar] [CrossRef]
- Dizler, G.; Morrison, J.C.; Lan, Y.; Rosen, G.L. Fizzy: Feature subset selection for metagenomics. BMC Bioinform. 2015, 1, 358. [Google Scholar] [CrossRef]
- Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef]
- Wei, M.; Chow, T.W.S.; Chan, R.H.M. Heterogeneous feature subset selection using mutual information based feature transformation. Neurocomputing 2015, 168, 706–718. [Google Scholar] [CrossRef]
- Su, C.-T.; Yang, C.-H. Feature selection for the SVM: An application to hypertension diagnosis. Expert Syst. Appl. 2008, 34, 754–763. [Google Scholar] [CrossRef]
- Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Zhang, C.-H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef] [Green Version]
- Fan, J.; Li, R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
- Two-Stage-Resources-2019. Available online: https://sites.google.com/site/sangjinkim0716/data-repository/two-stage-resources-2019 (accessed on 29 May 2019).
- Pappua, V.; Panagopoulosb, O.P.; Xanthopoulosb, P.; Pardalosa, P.M. Sparse proximal support vector machines for features selection in high dimensional datasets. Expert Syst. Appl. 2015, 42, 9183–9191. [Google Scholar] [CrossRef]
- Liao, J.G.; Chin, K.-V. Logistic regression for disease classification using micro data: Model selection in a large p and small n case. Bioinformatics 2007, 23, 1945–1951. [Google Scholar] [CrossRef] [PubMed]
- Park, M.Y.; Hastie, T. Penalized logistic regression for detecting gene interactions. Biostatistics 2008, 9, 30–50. [Google Scholar] [CrossRef] [PubMed]
- Bielza, C.; Robles, V.; Larrañaga, P. Regularized logistic regression without a penalty term: An application to cancer classification with microarray data. Expert Syst. Appl. 2011, 38, 5110–5118. [Google Scholar] [CrossRef]
- Bootkrajang, J.; Kabán, A. Classification of mislabelled microarrays using robust sparse logistic regression. Bioinformatics 2013, 29, 870–877. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Cawley, G.C.; Talbot, N.L.C. Gene selection in cancer classification using sparse logistic regression with Bayesian regularization. Bioinformatics 2006, 22, 2348–2355. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Li, J.; Jia, Y.; Zhao, Z. Partly adaptive elastic net and its application to microarray classification. Neural Comput. Appl. 2012, 22, 1193–1200. [Google Scholar] [CrossRef]
- Sun, H.; Wang, S. Penalized logistic regression for high-dimensional DNA methylation data with case-control studies. Bioinformatics 2012, 28, 1368–1375. [Google Scholar] [CrossRef] [PubMed]
- Zhu, J.; Hastie, T. Classification of gene microarrays by penalized logistic regression. Biostatistics 2004, 5, 427–443. [Google Scholar] [CrossRef]
- Liang, Y.; Liu, C.; Luan, X.-Z.; Leung, K.-S.; Chan, T.-M.; Xu, Z.-B.; Zhang, H. Sparse logistic regression with an L1/2 penalty for gene selection in cancer classification. BMC Bioinform. 2013, 14, 198–211. [Google Scholar] [CrossRef]
- Huang, H.H.; Liu, X.Y.; Liang, Y. Feature selection and cancer classification via sparse logistic regression with the hybrid L1/2 + 2 regularization. PLoS ONE 2016, 11, e0149675. [Google Scholar] [CrossRef] [PubMed]
- Algamal, Z.Y.; Lee, M.H. Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification. Expert Syst. Appl. 2015, 42, 9326–9332. [Google Scholar] [CrossRef]
- Ben Brahim, A.; Limam, M. A hybrid feature selection method based on instance learning and cooperative subset search. Pattern Recogn. Lett. 2016, 69, 28–34. [Google Scholar] [CrossRef]
- Wang, Y.; Yang, X.-G.; Lu, Y. Informative Gene Selection for Microarray Classification via Adaptive Elastic Net with Conditional Mutual Information. Appl. Math. Model. 2019, 71, 286–297. [Google Scholar] [CrossRef]
- Patrick, M.; John, S.; Rebecca, W. Methods for Bayesian Variable Selection with Binary Response Data using the EM algorithm. arXiv 2016, arXiv:1605.05429. [Google Scholar]
- Castellanos-Garzon, J.A.; Ramos-Gonzalez, J. A Gene Selection Approach based on Clustering for Classification Tasks in Colon Cancer. Adv. Distrib. Comput. Artif. Intell. J. 2015, 4. [Google Scholar] [CrossRef]
- Fortunato, R.S.; Gomes, L.R.; Munford, V.; Pessoa, C.F.; Quinet, A.; Hecht, F.; Kajitani, G.S.; Milito, C.B.; Carvalho, D.P.; Martins Menck, C.F. DUOX1 Silencing in Mammary Cell Alters the Response to Genotoxic Stress. Oxid. Med. Cell. Longev. 2018, 2018. [Google Scholar] [CrossRef] [PubMed]
- Little, A.C.; Sham, D.; Hristova, M.; Danyal, K.; Heppner, D.E.; Bauer, R.A.; Sipsey, L.M.; Habibovic, A.; van der Vliet, A. DUOX1 silencing in lung cancer promotes EMT, cancer stem cell characteristics and invasive properties. Oncogenesis 2016, 5. [Google Scholar] [CrossRef] [PubMed]
- Liang, Y.; Han, H.; Liu, L.; Duan, Y.; Yang, X.; Ma, C.; Zhu, Y.; Han, J.; Li, X.; Chen, Y. CD36 plays a critical role in proliferation, migration and tamoxifen-inhibited growth of ER-positive breast cancer cells. Oncogenesis 2018, 7, 98. [Google Scholar] [CrossRef] [PubMed]
- Sun, Q.; Zhang, W.; Guo, F. Hypermethylated CD36 gene affected the progression of lung cancer. Genetics 2018, 678, 395–406. [Google Scholar] [CrossRef] [PubMed]
- Zhang, W.; Fan, J.; Chen, Q.; Lei, C.; Qiao, B.; Liu, Q. SPP1 and AGER as potential prognostic biomarkers for lung adenocarcinoma. Oncol. Lett. 2018, 15, 7028–7036. [Google Scholar] [CrossRef] [PubMed]
- Ioanna, G.; Vasilieios, P.; Ioannis, L.; Nikolaos, K.; Theodora, A.; Georgios, S. Tumor cell-derived osteopontin promotes lung metastasis via both cell-autonomous and paracrine pathways. Eur. Respir. J. 2016, 48. [Google Scholar] [CrossRef]
- Pastuszak-Lewandoska, D.; Czarnecka, K.H.; Nawrot, E.; Domanska, D.; Kiszalkiewicz, J. Decreased FAM107A Expression in Patients with Non-small Cell Lung Cancer. Adv. Exp. Med. Biol. 2015, 852, 39–48. [Google Scholar] [PubMed]
Filtering Method | Metric | Correlation Coefficient | ||
---|---|---|---|---|
0.2 | 0.5 | 0.7 | ||
PF | Number of True Positive | 5.4 (0.765) | 4.21 (1.09) | 3.11 (1.09) |
MMRL | 4.52(0.948) | 2.15 (1.26) | 0.29 (0.50) | |
two sample t-test (p value) | 1.204 × 10−11 | < 2.2 × 10−16 | < 2.2 × 10−16 |
Correlation | Filtering | Methods | Accuracy | G-mean | TP | FP | MS |
---|---|---|---|---|---|---|---|
0.2 | PF | SIS-LASSO | 0.856(0.047) | 0.854(0.049) | 5.25(0.757) | 0.019(0.002) | 24.55(1.971) |
SIS-MCP | 0.878(0.054) | 0.877(0.056) | 5.03(0.937) | 0.006(0.003) | 11.3(2.805) | ||
SIS-SCAD | 0.878(0.053) | 0.876(0.055) | 5.18(0.757) | 0.012(0.005) | 17.24(5.053) | ||
average | 0.871(0.051) | 0.869(0.053) | 5.153(0.817) | 0.012(0.003) | 17.697(3.276) | ||
MMLR | SIS-LASSO | 0.847(0.056) | 0.844(0.06) | 4.3(0.99) | 0.015(0.002) | 18.73(2.131) | |
SIS-MCP | 0.86(0.061) | 0.858(0.063) | 4.21(0.988) | 0.006(0.003) | 10.32(2.449) | ||
SIS-SCAD | 0.861(0.059) | 0.858(0.062) | 4.3(0.99) | 0.011(0.004) | 14.8(3.649) | ||
average | 0.856(0.059) | 0.853(0.062) | 4.27(0.989) | 0.011(0.003) | 14.617(2.743) | ||
0.5 | PF | SIS-LASSO | 0.886(0.041) | 0.884(0.042) | 3.65(1.266) | 0.019(0.003) | 22.71(2.267) |
SIS-MCP | 0.869(0.055) | 0.868(0.057) | 2.93(1.409) | 0.008(0.003) | 10.87(2.058) | ||
SIS-SCAD | 0.884(0.048) | 0.883(0.05) | 3.57(1.257) | 0.017(0.004) | 20.06(3.92) | ||
average | 0.88(0.048) | 0.878(0.05) | 3.383(1.311) | 0.015(0.003) | 17.88(2.748) | ||
MMLR | SIS-LASSO | 0.865(0.046) | 0.863(0.047) | 1.84(1.237) | 0.015(0.003) | 17.02(2.137) | |
SIS-MCP | 0.858(0.048) | 0.857(0.048) | 1.66(1.233) | 0.008(0.002) | 9.89(1.681) | ||
SIS-SCAD | 0.863(0.047) | 0.861(0.047) | 1.83(1.28) | 0.014(0.003) | 15.64(2.873) | ||
average | 0.862(0.047) | 0.86(0.047) | 1.777(1.25) | 0.012(0.003) | 14.183(2.23) | ||
0.7 | PF | SIS-LASSO | 0.911(0.037) | 0.911(0.038) | 2.74(1.16) | 0.019(0.003) | 21.14(2.274) |
SIS-MCP | 0.899(0.042) | 0.899(0.043) | 1.82(1.158) | 0.007(0.002) | 8.88(1.981) | ||
SIS-SCAD | 0.907(0.038) | 0.907(0.038) | 2.68(1.171) | 0.016(0.004) | 18.88(3.699) | ||
average | 0.906(0.039) | 0.906(0.04) | 2.413(1.163) | 0.014(0.003) | 16.3(2.651) | ||
MMLR | SIS-LASSO | 0.887(0.037) | 0.886(0.037) | 0.26(0.543) | 0.014(0.002) | 13.72(1.724) | |
SIS-MCP | 0.881(0.04) | 0.88(0.041) | 0.21(0.498) | 0.008(0.002) | 7.75(1.591) | ||
SIS-SCAD | 0.888(0.036) | 0.888(0.037) | 0.25(0.52) | 0.013(0.002) | 13.45(2.285) | ||
average | 0.885(0.038) | 0.885(0.038) | 0.24(0.52) | 0.012(0.002) | 11.64(1.867) |
Dataset | Method | Accuracy | AUROC | G-Mean | Model Size |
---|---|---|---|---|---|
SIS-LASSO | 0.803 (0.098) | 0.886 (0.077) | 0.745 (0.144) | 7.8 (1.47) | |
Colon | SIS-MCP | 0.793 (0.097) | 0.864 (0.088) | 0.748 (0.132) | 4.14 (1.054) |
SIS-SCAD | 0.798 (0.096) | 0.874 (0.082) | 0.753 (0.13) | 6.73 (1.896) | |
SIS-LASSO | 0.976 (0.017) | 0.998 (0.007) | 0.975 (0.019) | 9.53 (1.453) | |
Lung | SIS-MCP | 0.952 (0.03) | 0.983 (0.017) | 0.95 (0.032) | 1.09 (0.288) |
SIS-SCAD | 0.975 (0.021) | 0.997 (0.006) | 0.973 (0.023) | 8.65 (2.222) |
Rank | SIS-LASSO | SIS-MCP | SIS-SCAD |
---|---|---|---|
Gene Accession ID | |||
1 | Hsa.36689 *** (G50753) | Hsa.36689 | Hsa.36689 |
2 | Hsa.692.2 *** (M76378) | Hsa.8147 | Hsa.692.2 |
3 | Hsa.6814 *** (H08393) | Hsa.6814 | Hsa.6814 |
4 | Hsa.1660 *** (H55916) | Hsa.1660 | Hsa.1660 |
5 | Hsa.8147 *** (M63391) | Hsa.692.2 | Hsa.33268 |
6 | Hsa.5392 *** (T62947) | Hsa.12241 ** (T64012) | Hsa.12241 |
7 | Hsa.37937 ** (R87126) | Hsa.33268 | Hsa.5392 |
8 | Hsa.33268 *** (R80427) | Hsa.5392 | Hsa.8147 |
9 | Hsa.3016 ** (T47377) | Hsa.8125 | Hsa.8125 |
10 | Hsa.8125 *** (T71025) | Hsa.37937 | Hsa.3016 |
Rank | SIS-LASSO | SIS-MCP | SIS-SCAD |
---|---|---|---|
- | Gene Accession ID | ||
1 | 219597_s_at ***(DUOX1) | 209555_s_at | 219597_s_at |
2 | 205357_s_at ** | 209074_s_at | 205357_s_at |
3 | 209555_s_at ***(CD36) | 32625_at | 209555_s_at |
4 | 209875_s_at ***(SPP1) | 206209_s_at * | 209875_s_at |
5 | 203980_at ** | 204271_s_at * | 209074_s_at |
6 | 208982_at ** | 204396_s_at * | 219213_at |
7 | 209074_s_at *** (FAM107A) | 219213_at | 208982_at |
8 | 220170_at ** | 219597_s_at | 220170_at |
9 | 219213_at *** (JAM2) | 219719_at * | 209614_at * |
10 | 32625_at ** | 209875_s_at | 203980_at |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, S.; Kim, J.-M. Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data. Mathematics 2019, 7, 493. https://doi.org/10.3390/math7060493
Kim S, Kim J-M. Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data. Mathematics. 2019; 7(6):493. https://doi.org/10.3390/math7060493
Chicago/Turabian StyleKim, Sangjin, and Jong-Min Kim. 2019. "Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data" Mathematics 7, no. 6: 493. https://doi.org/10.3390/math7060493
APA StyleKim, S., & Kim, J. -M. (2019). Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data. Mathematics, 7(6), 493. https://doi.org/10.3390/math7060493