A Machine Learning Pipeline for Cancer Detection on Microarray Data: The Role of Feature Discretization and Feature Selection †
Abstract
:1. Introduction
- We assess the use of Decision Tree (DT) classifiers, which have seldom been used in the literature with this type of data, motivated by their well-known intrinsic explainability;
- We assess the combined effect of a composition of FD and FS techniques (FS has been used more often than FD on this type of data), comparing their individual and joint usage;
- We consistently evaluate our methods using false negative rates, which is an important metric concerning diagnostic decisions;
- For each dataset, we identify the best combination of discretization, selection, and classification and assess the improvement of this combination as compared to the baseline results.
2. Related Work on DNA Microarrays
2.1. DNA Microarrays: Acquisition Technique and Resulting Data
- It is composed by a solid surface, arranged in columns and rows, containing thousands of spots;
- Each spot refers to one single gene and contains multiple strands of the same DNA, yielding a unique DNA sequence;
- Each spot location and its corresponding DNA sequence is recorded in a database.
- 1.
- Extraction of ribonucleic acid (RNA) from the sample cells and drawing out the messenger RNA (mRNA) from the existing RNA, because only the mRNA develops gene expression.
- 2.
- CDNA creation: a DNA copy is made from the mRNA using the reverse transcriptase enzyme, which generates the complementary DNA (CDNA). A label is added in the CDNA representing each cell sample (e.g., with fluorescent red and green for cancer and healthy cells, respectively). This step is necessary since DNA is more stable than RNA and this labeling allows identifying the genes.
- 3.
- Hybridization: both CDNA types are added to the DNA microarray and each spot already has many unique CDNA. When mixed together, they will base-pair each other due to the DNA complementary base pairing property. Not all CDNA strands will bind to each other, since some may not hybridize being washed off.
- 4.
- Analysis: the DNA microarray is analyzed with a scanner to find patterns of hybridization by detecting the fluorescent colors.
- A few red CDNA molecules bound to a spot, if the gene is expressed only in the cancer (red) cells;
- A few green CDNA molecules bound to another spot, if the gene is expressed only in the healthy (green) cells;
- Some of both red and green CDNA molecules bound to a single spot on the microarray, yielding a yellow spot; in this case, the gene is expressed both in the cancer and healthy cells;
- Finally, several spots of the microarray do not have a single red or green CDNA strand bound to them; this happens if the gene is not being expressed in either type of cell.
2.2. Feature Discretization
2.3. Feature Selection
2.4. Classifiers
2.4.1. SVM
2.4.2. DT
2.5. Related Approaches
3. Proposed Approach
3.1. Microarray Datasets and Clinical Tasks
- Detecting the presence of a specific cancer (such as in CNS, Colon, and Ovarian);
- Detecting the re-incidence of a disease (Breast dataset);
- Diagnosing between two types of cancer (Leukemia dataset).
- Distinguishing among different types of cells (Leukemia_3c, Leukemia_4c, and Lymphoma);
- Distinguishing between healthy situation and the presence of cancer (Lung, MLL, and SRBCT).
3.2. Machine Learning Pipeline
- Choosing the techniques under evaluation;
- Building a ML pipeline using data representation/discretization, dimensionality reduction, and data classification techniques;
- Comparing the performance of these techniques, using standard metrics;
- Identifying, for each dataset, the best technique as well as the best subset of features.
- 1.
- Mapping all nominal class labels to a number (for instance: no cancer corresponds to 0, whereas cancer corresponds to 1); this is performed because some algorithms do not accept nominal labels.
- 2.
- Filling the missing values with the most frequent value in the corresponding feature. We used the SimpleImputer method from Scikit-learn. This is only required for the Lymphoma dataset, as it was the only one with missing values.
- 3.
- Removing constant features, since they provide no information for classification. This is only required for the Breast dataset, in which d is reduced from 24,481 to 24,188.
Algorithm 1 Machine learning pipeline |
Input: 11 DNA microarray datasets, described in Table 1. |
Output: Error rate (Err). False negative rate (FNR). False positive rate (FPR). Percentage of the selected features (). |
|
3.3. Evaluation Metrics
4. Experimental Evaluation
- Section 4.1 addresses the baseline classification results without FD and FS, using the SVM and DT classifiers (stages (1), (2), (5), and (6) of the pipeline).
- Section 4.2 refers to the use of FD techniques (stages (1), (3), (5), and (6) of the pipeline).
- Section 4.3 reports the experimental results of FS techniques (stages (1), (4), (5), and (6) of the pipeline).
- Section 4.4 summarizes the best ML pipeline configuration found for each dataset.
- Section 4.5 reports the experimental results towards the explainability of the classification (stage (7)). We show the subsets of features that are most often chosen for each dataset.
4.1. Baseline Classification Results: Stages (1), (2), (5), and (6)
4.2. Feature Discretization Assessment: Stages (1), (3), (5), and (6)
4.3. Feature Selection Assessment: Stages (1), (4), (5), and (6)
4.4. The Complete Pipeline: Best Configuration for Each Dataset
4.5. Explainability: Most Relevant Genes–Stage (7)
- Use of the LOOCV procedure, which draws n data folds for training/testing;
- On a dataset with n instances, each feature can be chosen up to n times;
- The importance of a feature to accurately classify a dataset, on all data folds, and to explain the classification results is proportional to the number of times that feature is chosen;
- After the LOOCV procedure, we count the number of times each feature was chosen and we display the corresponding counters in decreasing order.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Alonso-Betanzos, A.; Bolón-Canedo, V.; Morán-Fernández, L.; Sánchez-Marono, N. A Review of Microarray Datasets: Where to Find Them and Specific Characteristics. Methods Mol. Biol. 2019, 1986, 65–85. [Google Scholar] [CrossRef] [PubMed]
- Bishop, C. Neural Networks for Pattern Recognition; Oxford University: Oxford, UK, 1995. [Google Scholar]
- Hughes, G. On the mean accuracy of statistical pattern recognizers. IEEE Trans. Inf. Theory 1968, 14, 55–63. [Google Scholar] [CrossRef] [Green Version]
- Nogueira, A.; Ferreira, A.; Figueiredo, M. A Step Towards the Explainability of Microarray Data for Cancer Diagnosis with Machine Learning Techniques. In Proceedings of the International Conference on Pattern Recognition Applications and Methods (ICPRAM), Online, 3–5 February 2022; pp. 362–369. [Google Scholar] [CrossRef]
- Garcia, S.; Luengo, J.; Saez, J.; Lopez, V.; Herrera, F. A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 2013, 25, 734–750. [Google Scholar] [CrossRef]
- Duda, R.; Hart, P.; Stork, D. Pattern Classification, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2001. [Google Scholar]
- Escolano, F.; Suau, P.; Bonev, B. Information Theory in Computer Vision and Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
- Guyon, I.; Gunn, S.; Nikravesh, M.; Zadeh, L. Feature Extraction: Foundations and Applications; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
- Simon, R.; Korn, E.; McShane, L.; Radmacher, M.; Wright, G.; Zhao, Y. Design and Analysis of DNA Microarray Investigations; Springer: New York, NY, USA, 2003. [Google Scholar]
- Ferreira, A.; Figueiredo, M. Exploiting the bin-class histograms for feature selection on discrete data. In Proceedings of the Iberian Conference on Pattern Recognition and Image Analysis, Santiago de Compostela, Spain, 17–19 June 2015; Springer: Cham, Switzerland, 2015; pp. 345–353. [Google Scholar]
- Belkin, M.; Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 2003, 15, 1373–1396. [Google Scholar] [CrossRef] [Green Version]
- Dougherty, J.; Kohavi, R.; Sahami, M. Supervised and unsupervised discretization of continuous features. In Machine Learning Proceedings 1995; Elsevier: Amsterdam, The Netherlands, 1995; pp. 194–202. [Google Scholar]
- Fayyad, U.; Irani, K. Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the International Joint Conference on Uncertainty in AI, Washington, DC, USA, 9–11 July 1993; pp. 1022–1027. [Google Scholar]
- Alpaydin, E. Introduction to Machine Learning, 3rd ed.; The MIT Press: Cambridge, MA, USA, 2014. [Google Scholar]
- He, X.; Cai, D.; Niyogi, P. Laplacian score for feature selection. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 5–8 December 2005; MIT Press: Cambridge, MA, USA; Volume 18, pp. 507–514. [Google Scholar]
- Zhao, Z.; Liu, H. Spectral feature selection for supervised and unsupervised learning. In Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR, USA, 20–24 June 2007; pp. 1151–1157. [Google Scholar]
- Liu, L.; Kang, J.; Yu, J.; Wang, Z. A comparative study on unsupervised feature selection methods for text clustering. In Proceedings of the 2005 International Conference on Natural Language Processing and Knowledge Engineering, Wuhan, China, 30 October–1 November 2005; IEEE: Piscataway, NJ, USA, 2005; pp. 597–601. [Google Scholar] [CrossRef]
- Fisher, R. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
- Yu, L.; Liu, H. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of the International Conference on Machine Learning (ICML), Washington, DC, USA, 21–24 August 2003; pp. 856–863. [Google Scholar]
- Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 2005, 27, 1226–1238. [Google Scholar] [CrossRef]
- Kononenko, I. Estimating attributes: Analysis and extensions of RELIEF. In Proceedings of the European Conference on Machine Learning, Catania, Italy, 6–8 April 1994; Springer: Berlin/Heidelberg, Germany, 1994; pp. 171–182. [Google Scholar]
- Ferreira, A.; Figueiredo, M. Efficient feature selection filters for high-dimensional data. Pattern Recognit. Lett. 2012, 33, 1794–1804. [Google Scholar] [CrossRef] [Green Version]
- Zhao, Z.; Morstatter, F.; Sharma, S.; Alelyani, S.; Anand, A.; Liu, H. Advancing Feature Selection Research—ASU Feature Selection Repository; Technical Report; Computer Science & Engineering, Arizona State University: Tempe, AZ, USA, 2010. [Google Scholar]
- Furey, T.; Cristianini, N.; Duffy, N.; Bednarski, D.; Schummer, M.; Haussler, D. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000, 16, 906–914. [Google Scholar] [CrossRef] [Green Version]
- Remeseiro, B.; Bolon-Canedo, V. A review of feature selection methods in medical applications. Comput. Biol. Med. 2019, 112, 103375. [Google Scholar] [CrossRef]
- Pudjihartono, N.; Fadason, T.; Kempa-Liehr, A.; O’Sullivan, J. A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. Front. Bioinform. 2022, 2, 927312. [Google Scholar] [CrossRef] [PubMed]
- Dhal, P.; Azad, C. A comprehensive survey on feature selection in the various fields of machine learning. Appl. Intell. 2022, 52, 4543–4581. [Google Scholar] [CrossRef]
- Lazar, C.; Taminau, J.; Meganck, S.; Steenhoff, D.; Coletta, A.; Molter, C.; Schaetzen, V.; Duque, R.; Bersini, H.; Nowé, A. A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis. IEEE/ACM Trans. Comput. Biol. Bioinform. 2012, 9, 1106–1119. [Google Scholar] [CrossRef] [PubMed]
- Manikandan, G.; Abirami, S. A Survey on Feature Selection and Extraction Techniques for High-Dimensional Microarray Datasets. In Knowledge Computing and its Applications: Knowledge Computing in Specific Domains: Volume II; Springer: Singapore, 2018; pp. 311–333. [Google Scholar] [CrossRef]
- Almugren, N.; Alshamlan, H. A Survey on Hybrid Feature Selection Methods in Microarray Gene Expression Data for Cancer Classification. IEEE Access 2019, 7, 78533–78548. [Google Scholar] [CrossRef]
- Arowolo, M.; Adebiyi, M.; Aremu, C.; Adebiyi, A. A survey of dimension reduction and classification methods for RNA-Seq data on malaria vector. J. Big Data 2021, 8, 50. [Google Scholar] [CrossRef]
- Alpaydin, E. Introduction to Machine Learning, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2010. [Google Scholar]
- Boser, B.; Guyon, I.; Vapnik, V. A training algorithm for optimal margin classifiers. In Proceedings of the Annual ACM Workshop on Computational Learning Theory, Pittsburgh, PA, USA, 27–29 July 1992; ACM Press: New York, NY, USA, 1992; pp. 144–152. [Google Scholar]
- Burges, C. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 1998, 2, 121–167. [Google Scholar] [CrossRef]
- Vapnik, V. The Nature of Statistical Learning Theory; Springe: New York, NY, USA, 1999. [Google Scholar]
- Hsu, C.; Lin, C. A comparison of methods for multi-class support vector machines. IEEE Trans. Neural Netw. 2002, 13, 415–425. [Google Scholar] [CrossRef] [Green Version]
- Weston, J.; Watkins, C. Multi-Class Support Vector Machines; Technical Report; Department of Computer Science, Royal Holloway, University of London: London, UK, 1998. [Google Scholar]
- Breiman, L. Classification and Regression Trees, 1st ed.; Chapman & Hall/CRC: Boca Raton, FL, USA, 1984. [Google Scholar]
- Quinlan, J. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef] [Green Version]
- Quinlan, J. C4.5: Programs for Machine Learning; Morgan Kaufmann: San Mateo, CA, USA, 1993. [Google Scholar]
- Quinlan, J. Bagging, boosting, and C4.5. In Proceedings of the National Conference on Artificial Intelligence, Portland, OR, USA, 4–8 August 1996; AAAI Press: Washington, DA, USA, 1996; pp. 725–730. [Google Scholar]
- Rokach, L.; Maimon, O. Top-down induction of decision trees classifiers—A survey. IEEE Trans. Syst. Man, Cybern. Part C Appl. Rev. 2005, 35, 476–487. [Google Scholar] [CrossRef] [Green Version]
- Yip, W.; Amin, S.; Li, C. A Survey of Classification Techniques for Microarray Data Analysis. In Handbook of Statistical Bioinformatics; Springer: Berlin/Heidelberg, Germany, 2011; pp. 193–223. [Google Scholar] [CrossRef]
- Statnikov, A.; Tsamardinos, I.; Dosbayev, Y.; Aliferis, C. GEMS: A system for automated cancer diagnosis and biomarker discovery from microarray gene expression data. Int. J. Med. Inform. 2005, 74, 491–503. [Google Scholar] [CrossRef] [PubMed]
- Witten, I.; Frank, E.; Hall, M.; Pal, C. Data Mining: Practical Machine Learning Tools and Techniques, 4th ed.; Morgan Kauffmann: Mateo, CA, USA, 2016. [Google Scholar]
- Meyer, P.; Schretter, C.; Bontempi, G. Information-theoretic feature selection in microarray data using variable complementarity. IEEE J. Sel. Top. Signal Process. 2008, 2, 261–274. [Google Scholar] [CrossRef]
- Statnikov, A.; Aliferis, C.; Tsamardinos, I.; Hardin, D.; Levy, S. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 2005, 21, 631–643. [Google Scholar] [CrossRef] [Green Version]
- Diaz-Uriarte, R.; Andres, S. Gene selection and classification of microarray data using random forest. BMC Bioinform. 2006, 7, 3. [Google Scholar] [CrossRef] [Green Version]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
- Li, Z.; Xie, W.; Liu, T. Efficient feature selection and classification for microarray data. PLoS ONE 2018, 13, 0202167. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Consiglio, A.; Casalino, G.; Castellano, G.; Grillo, G.; Perlino, E.; Vessio, G.; Licciulli, F. Explaining Ovarian Cancer Gene Expression Profiles with Fuzzy Rules and Genetic Algorithms. Electronics 2021, 10, 375. [Google Scholar] [CrossRef]
- Saeys, Y.; Inza, I.; naga, P.L. A review of feature selection techniques in bioinformatics. Bioinformatics 2007, 23, 2507–2517. [Google Scholar] [CrossRef] [Green Version]
- AbdElNabi, M.L.R.; Wajeeh Jasim, M.; El-Bakry, H.M.; Hamed, N.; Taha, M.; Khalifa, N.E.M. Breast and Colon Cancer Classification from Gene Expression Profiles Using Data Mining Techniques. Symmetry 2020, 12, 408. [Google Scholar] [CrossRef] [Green Version]
- Alonso-González, C.J.; Moro-Sancho, Q.I.; Simon-Hurtado, A.; Varela-Arrabal, R. Microarray gene expression classification with few genes: Criteria to combine attribute selection and classification methods. Expert Syst. Appl. 2012, 39, 7270–7280. [Google Scholar] [CrossRef]
- Jirapech-Umpai, T.; Aitken, S. Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC Bioinform. 2005, 6, 148. [Google Scholar] [CrossRef] [Green Version]
- Zhu, Z.; Ong, Y.; Dash, M. Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognit. 2007, 40, 3236–3248. [Google Scholar] [CrossRef]
- Van’t Veer, L.J.; Dai, H.; Van De Vijver, M.J.; He, Y.D.; Hart, A.A.; Mao, M.; Peterse, H.L.; Van Der Kooy, K.; Marton, M.J.; Witteveen, A.T.; et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415, 530–536. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Pomeroy, S.L.; Tamayo, P.; Gaasenbeek, M.; Sturla, L.M.; Angelo, M.; McLaughlin, M.E.; Kim, J.Y.; Goumnerova, L.C.; Black, P.M.; Lau, C.; et al. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 2002, 415, 436–442. [Google Scholar] [CrossRef] [PubMed]
- Alon, U.; Barkai, N.; Notterman, D.A.; Gish, K.; Ybarra, S.; Mack, D.; Levine, A.J. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 1999, 96, 6745–6750. [Google Scholar] [CrossRef]
- Golub, T.; Slonim, D.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J.; Coller, H.; Loh, M.; Downing, J.; Caligiuri, M.; et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 1999, 286, 531–537. [Google Scholar] [CrossRef] [Green Version]
- Bhattacharjee, A.; Richards, W.; Staunton, J.; Li, C.; Monti, S.; Vasa, P.; Ladd, C.; Beheshti, J.; Bueno, R.; Gillette, M.; et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Natl. Acad. Sci. USA 2001, 98, 13790–13795. [Google Scholar] [CrossRef]
- Alizadeh, A.; Eisen, M.; Davis, R.; Ma, C.; Lossos, I.; Rosenwald, A.; Boldrick, J.; Sabet, H.; Tran, T.; Yu, X.; et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000, 403, 503–511. [Google Scholar] [CrossRef]
- Armstrong, S.A.; Staunton, J.E.; Silverman, L.B.; Pieters, R.; den Boer, M.L.; Minden, M.D.; Sallan, S.E.; Lander, E.S.; Golub, T.R.; Korsmeyer, S.J. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat. Genet. 2002, 30, 41–47. [Google Scholar] [CrossRef] [PubMed]
- Basegmez, H.; Sezer, E.; Erol, C. Optimization for Gene Selection and Cancer Classification. Proceedings 2021, 74, 21. [Google Scholar] [CrossRef]
- Khan, J.; Wei, J.; Ringner, M.; Saal, L.; Ladanyi, M.; Westermann, F.; Berthold, F.; Schwab, M.; Antonescu, C.; Peterson, C.; et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. 2001, 7, 673–679. [Google Scholar] [CrossRef] [PubMed]
Name | n | d | c | Instances per Class | Numeric, Categorical | |
---|---|---|---|---|---|---|
Breast [58] | 97 | 24,481 | 252.38 | 2 | 46, 51 | 24,188, 293 |
CNS [59] | 60 | 7129 | 118.81 | 2 | 39, 21 | 7129, 0 |
Colon [60] | 62 | 2000 | 32.25 | 2 | 40, 22 | 2000, 0 |
Leukemia [61] | 72 | 7129 | 99.01 | 2 | 47, 25 | 7129, 0 |
Leukemia_3c [61] | 72 | 7129 | 99.01 | 3 | 38, 25, 9 | 7129, 0 |
Leukemia_4c [61] | 72 | 7129 | 99.01 | 4 | 38, 21, 9, 4 | 7129, 0 |
Lung [62] | 203 | 12,600 | 62.06 | 5 | 139, 17, 6, 21, 20 | 12600, 0 |
Lymphoma [63] | 66 | 4026 | 61.00 | 3 | 46, 11, 9 | 4026, 0 |
MLL [64] | 72 | 12,582 | 174.75 | 3 | 28, 24, 20 | 11,270, 1312 |
Ovarian [65] | 253 | 15,154 | 59.89 | 2 | 162, 91 | 15,151, 3 |
SRBCT [66] | 83 | 2308 | 27.80 | 4 | 29, 11, 18, 25 | 2308, 0 |
Name | Clinical Task Regarding Cancer Detection |
---|---|
Breast | Breast cancer diagnosis |
CNS | Central Nervous System tumor diagnosis |
Colon | Colon tumor diagnosis |
Leukemia | Acute Lymphocytic Leukemia and |
Acute Myelogenous Leukemia diagnosis | |
Leukemia_3c | Distinguishes types of blood cells which became cancerous |
Leukemia_4c | Distinguishes types of blood cells which became cancerous |
Lung | Lung cancer diagnosis |
Lymphoma | Distinguishes subtypes of non-Hodgkin lymphoma |
MLL | Distinguishes types of acute leukemia, including |
Mixed Lineage Leukemia | |
Ovarian | Ovarian cancer diagnosis |
SRBCT | Distinguishes types of of Small Round Blue Cell Tumors |
Linear Kernel | Poly Kernel | RBF Kernel | Sigmoid Kernel | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset | Err | FNR | FPR | Err | FNR | FPR | Err | FNR | FPR | Err | FNR | FPR |
Breast | 0.31 | 0.30 | 0.31 | 0.33 | 0.28 | 0.37 | 0.37 | 0.46 | 0.29 | 0.47 | 1.00 | 0.00 |
CNS | 0.33 | 0.62 | 0.18 | 0.37 | 0.62 | 0.23 | 0.35 | 1.00 | 0.00 | 0.35 | 1.00 | 0.00 |
Colon | 0.18 | 0.27 | 0.12 | 0.27 | 0.55 | 0.12 | 0.21 | 0.50 | 0.05 | 0.39 | 0.82 | 0.15 |
Leukemia | 0.01 | – | – | 0.03 | – | – | 0.15 | – | – | 0.35 | – | – |
Leukemia_3c | 0.04 | – | – | 0.06 | – | – | 0.26 | – | – | 0.47 | – | – |
Leukemia_4c | 0.07 | – | – | 0.10 | – | – | 0.32 | – | – | 0.47 | – | – |
Lung | 0.05 | 0.01 | 0.12 | 0.05 | 0.01 | 0.18 | 0.09 | 0.01 | 0.24 | 0.32 | 0.00 | 1.00 |
Lymphoma | 0.00 | – | – | 0.00 | – | – | 0.00 | – | – | 0.30 | – | – |
MLL | 0.03 | – | – | 0.06 | – | – | 0.10 | – | – | 0.61 | – | – |
Ovarian | 0.00 | 0.00 | 0.00 | 0.004 | 0.00 | 0.01 | 0.02 | 0.01 | 0.02 | 0.36 | 0.00 | 1.00 |
SRBCT | 0.00 | – | – | 0.01 | – | – | 0.07 | – | – | 0.65 | – | – |
Average | 0.09 | 0.24 | 0.15 | 0.12 | 0.29 | 0.18 | 0.18 | 0.40 | 0.12 | 0.43 | 0.56 | 0.43 |
Std. dev. | 0.12 | 0.23 | 0.10 | 0.13 | 0.26 | 0.12 | 0.13 | 0.37 | 0.12 | 0.11 | 0.47 | 0.47 |
Max Depth = 2 | Max Depth = 5 | Max Depth = 7 | Max Depth = 10 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset | Err | FNR | FPR | Err | FNR | FPR | Err | FNR | FPR | Err | FNR | FPR |
Breast | 0.40 | 0.35 | 0.45 | 0.33 | 0.30 | 0.35 | 0.33 | 0.30 | 0.35 | 0.33 | 0.30 | 0.35 |
CNS | 0.18 | 0.48 | 0.03 | 0.25 | 0.33 | 0.21 | 0.25 | 0.33 | 0.21 | 0.25 | 0.33 | 0.21 |
Colon | 0.18 | 0.36 | 0.08 | 0.19 | 0.23 | 0.18 | 0.19 | 0.23 | 0.18 | 0.19 | 0.23 | 0.18 |
Leukemia | 0.26 | – | – | 0.26 | – | – | 0.26 | – | – | 0.26 | – | – |
Leukemia_3c | 0.15 | – | – | 0.17 | – | – | 0.17 | – | – | 0.17 | – | – |
Leukemia_4c | 0.11 | – | – | 0.15 | – | – | 0.15 | – | – | 0.15 | – | – |
Lung | 0.13 | 0.01 | 0.06 | 0.07 | 0.01 | 0.12 | 0.07 | 0.01 | 0.12 | 0.07 | 0.01 | 0.12 |
Lymphoma | 0.00 | – | – | 0.00 | – | – | 0.00 | – | – | 0.00 | – | – |
MLL | 0.08 | – | – | 0.08 | – | – | 0.08 | – | – | 0.08 | – | – |
Ovarian | 0.03 | 0.01 | 0.07 | 0.03 | 0.01 | 0.07 | 0.03 | 0.01 | 0.07 | 0.03 | 0.01 | 0.07 |
SRBCT | 0.27 | – | – | 0.17 | – | – | 0.17 | – | – | 0.17 | – | – |
Average | 0.16 | 0.24 | 0.14 | 0.15 | 0.18 | 0.19 | 0.15 | 0.18 | 0.19 | 0.15 | 0.18 | 0.19 |
Std. dev. | 0.11 | 0.19 | 0.16 | 0.10 | 0.14 | 0.10 | 0.10 | 0.14 | 0.10 | 0.10 | 0.14 | 0.10 |
Num. Bins = 2 | Num. Bins = 3 | Num. Bins = 4 | Num. Bins = 5 | Num. Bins = 6 | Num. Bins = 7 | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset | Err | FNR | FPR | Err | FNR | FPR | Err | FNR | FPR | Err | FNR | FPR | Err | FNR | FPR | Err | FNR | FPR |
Breast | 0.32 | 0.30 | 0.33 | 0.33 | 0.33 | 0.33 | 0.32 | 0.33 | 0.31 | 0.32 | 0.33 | 0.31 | 0.30 | 0.30 | 0.29 | 0.31 | 0.33 | 0.29 |
CNS | 0.35 | 0.71 | 0.15 | 0.30 | 0.62 | 0.13 | 0.38 | 0.71 | 0.21 | 0.32 | 0.62 | 0.15 | 0.32 | 0.62 | 0.15 | 0.37 | 0.67 | 0.21 |
Colon | 0.18 | 0.27 | 0.12 | 0.18 | 0.27 | 0.12 | 0.16 | 0.27 | 0.10 | 0.15 | 0.23 | 0.10 | 0.15 | 0.23 | 0.10 | 0.16 | 0.27 | 0.10 |
Leukemia | 0.01 | – | – | 0.01 | – | – | 0.01 | – | – | 0.01 | – | – | 0.01 | – | – | 0.01 | – | – |
Leukemia_3c | 0.03 | – | – | 0.03 | – | – | 0.03 | – | – | 0.03 | – | – | 0.04 | – | – | 0.04 | – | – |
Leukemia_4c | 0.08 | – | – | 0.07 | – | – | 0.07 | – | – | 0.07 | – | – | 0.07 | – | – | 0.07 | – | – |
Lung | 0.05 | 0.01 | 0.18 | 0.05 | 0.01 | 0.18 | 0.05 | 0.01 | 0.18 | 0.04 | 0.01 | 0.18 | 0.04 | 0.01 | 0.18 | 0.04 | 0.01 | 0.18 |
Lymphoma | 0.00 | – | – | 0.00 | – | – | 0.00 | – | – | 0.00 | – | – | 0.00 | – | – | 0.00 | – | – |
MLL | 0.04 | – | – | 0.03 | – | – | 0.03 | – | – | 0.03 | – | – | 0.03 | – | – | 0.03 | – | – |
Ovarian | 0.004 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
SRBCT | 0.00 | – | – | 0.00 | – | – | 0.00 | – | – | 0.00 | – | – | 0.00 | – | – | 0.00 | – | – |
Average | 0.10 | 0.26 | 0.16 | 0.09 | 0.25 | 0.15 | 0.10 | 0.26 | 0.16 | 0.09 | 0.24 | 0.15 | 0.09 | 0.23 | 0.14 | 0.09 | 0.26 | 0.16 |
Std. dev. | 0.12 | 0.26 | 0.10 | 0.12 | 0.23 | 0.11 | 0.13 | 0.26 | 0.10 | 0.12 | 0.23 | 0.10 | 0.11 | 0.23 | 0.10 | 0.12 | 0.25 | 0.10 |
Configurations | |||||
---|---|---|---|---|---|
Dataset | Classifier | Num. Bins | Err | FNR | FPR |
Breast | SVM | 6 | 0.30 * | 0.30 | 0.29 |
CNS | DT | 5 | 0.18 | 0.33 | 0.10 |
Colon | SVM | 5, 6 | 0.15 * | 0.23 | 0.10 |
Leukemia | SVM, DT | 2, 3, 4, 5, 6, 7 | 0.01 | – | – |
Leukemia_3c | SVM | 2, 3, 4, 5 | 0.03 * | – | – |
Leukemia_4c | SVM | 3, 4, 5, 6, 7 | 0.07 * | – | – |
Lung | SVM | 5, 6, 7 | 0.04 * | 0.01 | 0.18 |
Lymphoma | SVM | 2, 3, 4, 5, 6, 7 | 0.00 | – | – |
MLL | SVM | 3, 4, 5, 6, 7 | 0.03 | – | – |
Ovarian | SVM | 3, 4, 5, 6, 7 | 0.00 | 0.00 | 0.00 |
SRBCT | SVM | 2, 3, 4, 5, 6, 7 | 0.00 | – | – |
Average | – | – | 0.07 | 0.17 | 0.13 |
Std. dev. | – | – | 0.09 | 0.14 | 0.10 |
2 | 3 | 4 | 5 | 6 | 7 | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset | Err | FNR | FPR | Err | FNR | FPR | Err | FNR | FPR | Err | FNR | FPR | Err | FNR | FPR | Err | FNR | FPR |
Breast | 0.30 | 0.35 | 0.25 | 0.30 | 0.35 | 0.25 | 0.46 | 0.50 | 0.43 | 0.32 | 0.37 | 0.27 | 0.47 | 0.59 | 0.37 | 0.48 | 0.61 | 0.37 |
CNS | 0.42 | 0.67 | 0.28 | 0.65 | 0.81 | 0.56 | 0.43 | 0.57 | 0.36 | 0.18 | 0.33 | 0.10 | 0.37 | 0.57 | 0.26 | 0.50 | 0.71 | 0.38 |
Colon | 0.26 | 0.32 | 0.22 | 0.34 | 0.55 | 0.22 | 0.29 | 0.55 | 0.15 | 0.16 | 0.18 | 0.15 | 0.34 | 0.45 | 0.28 | 0.24 | 0.41 | 0.15 |
Leukemia | 0.01 | – | – | 0.11 | – | – | 0.12 | – | – | 0.19 | – | – | 0.08 | – | – | 0.14 | – | – |
Leukemia_3c | 0.19 | – | – | 0.22 | – | – | 0.12 | – | – | 0.14 | – | – | 0.10 | – | – | 0.19 | – | – |
Leukemia_4c | 0.21 | – | – | 0.28 | – | – | 0.10 | – | – | 0.26 | – | – | 0.21 | – | – | 0.17 | – | – |
Lung | 0.27 | 0.07 | 0.47 | 0.17 | 0.01 | 0.12 | 0.16 | 0.02 | 0.12 | 0.15 | 0.03 | 0.41 | 0.18 | 0.05 | 0.35 | 0.15 | 0.03 | 0.29 |
Lymphoma | 0.06 | – | – | 0.06 | – | – | 0.11 | – | – | 0.09 | – | – | 0.09 | – | – | 0.11 | – | – |
MLL | 0.19 | – | – | 0.17 | – | – | 0.25 | – | – | 0.15 | – | – | 0.08 | – | – | 0.18 | – | – |
Ovarian | 0.06 | 0.07 | 0.03 | 0.02 | 0.01 | 0.03 | 0.04 | 0.04 | 0.03 | 0.03 | 0.01 | 0.05 | 0.02 | 0.01 | 0.04 | 0.03 | 0.01 | 0.07 |
SRBCT | 0.18 | – | – | 0.25 | – | – | 0.22 | – | – | 0.16 | – | – | 0.23 | – | – | 0.17 | – | – |
Average | 0.20 | 0.30 | 0.25 | 0.23 | 0.35 | 0.24 | 0.21 | 0.34 | 0.22 | 0.17 | 0.18 | 0.20 | 0.20 | 0.33 | 0.26 | 0.21 | 0.35 | 0.25 |
Std. dev. | 0.11 | 0.22 | 0.14 | 0.16 | 0.31 | 0.18 | 0.13 | 0.25 | 0.15 | 0.07 | 0.15 | 0.13 | 0.14 | 0.25 | 0.12 | 0.14 | 0.29 | 0.12 |
Unsupervised | Supervised | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LS | SPEC | RRFS (MM) | FiR | RRFS (FiR) | |||||||||||
Dataset | Err | FNR | FPR | Err | FNR | FPR | Err | FNR | FPR | Err | FNR | FPR | Err | FNR | FPR |
Breast | 0.33 | 0.35 | 0.31 | 0.32 | 0.30 | 0.33 | 0.31 | 0.28 | 0.33 | 0.31 | 0.28 | 0.33 | 0.31 | 0.28 | 0.33 |
CNS | 0.35 | 0.52 | 0.26 | 0.33 | 0.62 | 0.18 | 0.27 | 0.48 | 0.15 | 0.30 | 0.57 | 0.15 | 0.33 | 0.67 | 0.15 |
Colon | 0.16 | 0.27 | 0.10 | 0.19 | 0.32 | 0.12 | 0.21 | 0.36 | 0.12 | 0.19 | 0.32 | 0.12 | 0.18 | 0.27 | 0.12 |
Leukemia | 0.01 | – | – | 0.01 | – | – | 0.01 | – | – | 0.01 | – | – | 0.01 | – | – |
Leukemia_3c | 0.04 | – | – | 0.06 | – | – | 0.04 | – | – | 0.04 | – | – | 0.03 | – | – |
Leukemia_4c | 0.08 | – | – | 0.10 | – | – | 0.07 | – | – | 0.07 | – | – | 0.07 | – | – |
Lung | 0.05 | 0.01 | 0.12 | 0.05 | 0.01 | 0.12 | 0.05 | 0.01 | 0.12 | 0.04 | 0.01 | 0.12 | 0.05 | 0.01 | 0.18 |
Lymphoma | 0.00 | – | – | 0.00 | – | – | 0.03 | – | – | 0.00 | – | – | 0.02 | – | – |
MLL | 0.04 | – | – | 0.06 | – | – | 0.03 | – | – | 0.03 | – | – | 0.04 | – | – |
Ovarian | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.004 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
SRBCT | 0.02 | – | – | 0.00 | – | – | 0.00 | – | – | 0.00 | – | – | 0.00 | – | – |
Average | 0.10 | 0.23 | 0.16 | 0.10 | 0.25 | 0.15 | 0.09 | 0.23 | 0.15 | 0.09 | 0.24 | 0.14 | 0.09 | 0.25 | 0.16 |
Std. dev. | 0.12 | 0.20 | 0.11 | 0.12 | 0.23 | 0.11 | 0.11 | 0.19 | 0.10 | 0.11 | 0.21 | 0.11 | 0.12 | 0.24 | 0.11 |
Unsupervised | Supervised | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LS | SPEC | RRFS (MM) | FiR | RRFS (FiR) | |||||||||||
Dataset | Err | FNR | FPR | Err | FNR | FPR | Err | FNR | FPR | Err | FNR | FPR | Err | FNR | FPR |
Breast | 0.43 | 0.50 | 0.37 | 0.29 | 0.30 | 0.27 | 0.41 | 0.50 | 0.33 | 0.37 | 0.33 | 0.41 | 0.26 | 0.33 | 0.20 |
CNS | 0.38 | 0.67 | 0.23 | 0.22 | 0.38 | 0.13 | 0.28 | 0.43 | 0.21 | 0.32 | 0.38 | 0.28 | 0.32 | 0.38 | 0.28 |
Colon | 0.32 | 0.50 | 0.22 | 0.34 | 0.55 | 0.22 | 0.19 | 0.32 | 0.12 | 0.26 | 0.27 | 0.25 | 0.21 | 0.27 | 0.18 |
Leukemia | 0.14 | – | – | 0.21 | – | – | 0.21 | – | – | 0.25 | – | – | 0.15 | – | – |
Leukemia_3c | 0.07 | – | – | 0.14 | – | – | 0.14 | – | – | 0.17 | – | – | 0.15 | – | – |
Leukemia_4c | 0.12 | – | – | 0.25 | – | – | 0.14 | – | – | 0.11 | – | – | 0.22 | – | – |
Lung | 0.15 | 0.03 | 0.41 | 0.16 | 0.01 | 0.18 | 0.09 | 0.01 | 0.12 | 0.09 | 0.01 | 0.18 | 0.08 | 0.01 | 0.18 |
Lymphoma | 0.23 | – | – | 0.18 | – | – | 0.08 | – | – | 0.08 | – | – | 0.12 | – | – |
MLL | 0.26 | – | – | 0.26 | – | – | 0.14 | – | – | 0.07 | – | – | 0.15 | – | – |
Ovarian | 0.04 | 0.02 | 0.08 | 0.04 | 0.04 | 0.05 | 0.02 | 0.02 | 0.02 | 0.03 | 0.02 | 0.04 | 0.02 | 0.01 | 0.04 |
SRBCT | 0.23 | – | – | 0.20 | – | – | 0.17 | – | – | 0.23 | – | – | 0.17 | – | – |
Average | 0.22 | 0.34 | 0.26 | 0.21 | 0.26 | 0.17 | 0.17 | 0.26 | 0.16 | 0.18 | 0.20 | 0.23 | 0.17 | 0.20 | 0.18 |
Std. dev. | 0.12 | 0.27 | 0.12 | 0.08 | 0.21 | 0.08 | 0.10 | 0.20 | 0.10 | 0.11 | 0.16 | 0.12 | 0.08 | 0.16 | 0.08 |
Pipeline Configuration | |||
---|---|---|---|
Dataset | Discretization | Selection | Classification |
Breast | EFB (n_bins = 6) | RRFS (with FiR; = 0.7) | SVM (C = 1; kernel = linear) |
CNS | EFB (n_bins = 5) | SPEC | DT (criterion = entropy, max_depth = 6, and random_state = 42) |
Colon | MDLP | LS | DT (criterion = entropy, max_depth = None, and random_state = 5) |
Leukemia | EFB (n_bins = 2) | LS | SVM (C = 1; kernel = linear) |
Leukemia_3c | EFB (n_bins = 2) | RRFS (with FiR; = 0.7) | SVM (C = 1; kernel = linear) |
Leukemia_4c | EFB (n_bins = 3) | RRFS (with FiR; = 0.7) | SVM (C = 1; kernel = linear) |
Lung | EFB (n_bins = 5) | FiR | SVM (C = 1; kernel = linear) |
Lymphoma | EFB (n_bins = 2) | LS | SVM (C = 1; kernel = linear) |
MLL | EFB (n_bins = 3) | RRFS (with MM; = 0.7) | SVM (C = 1; kernel = linear) |
Ovarian | EFB (n_bins = 3) | RRFS (with FiR; = 0.7) | SVM (C = 1; kernel = linear) |
SRBCT | EFB (n_bins = 2) | SPEC | SVM (C = 1; kernel = linear) |
Configurations | Results | ||||||
---|---|---|---|---|---|---|---|
Dataset | Discretization | Selection | Classification | Err | FNR | FPR | |
Breast | EFB (6) | – | SVM (1, linear) | 0.30 | 0.30 | 0.29 | – |
CNS | EFB (5) | – | DT (entropy, 5, 42) | 0.18 | 0.33 | 0.10 | – |
Colon | – | – | DT (entropy, None, 5) | 0.13 | 0.23 | 0.08 | – |
Leukemia | – | LS | SVM (1, linear) | 0.01 | – | – | 0.13 |
Leukemia_3c | – | RRFS (FiR, 0.7) | SVM (1, linear) | 0.03 | – | – | 0.18 |
Leukemia_4c | – | RRFS (FiR, 0.7) | SVM (1, linear) | 0.07 | – | – | 0.17 |
Lung | – | FiR | SVM (1, linear) | 0.04 | 0.01 | 0.12 | 0.67 |
Lymphoma | – | LS | SVM (1, linear) | 0.00 | – | – | 0.22 |
MLL | – | RRFS (MM, 0.7) | SVM (1, linear) | 0.03 | – | – | 0.23 |
Ovarian | – | RRFS (FiR, 0.7) | SVM (1, linear) | 0.00 | 0.00 | 0.00 | 0.04 |
SRBCT | – | SPEC | SVM (1, linear) | 0.00 | – | – | 0.49 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Nogueira, A.; Ferreira, A.; Figueiredo, M. A Machine Learning Pipeline for Cancer Detection on Microarray Data: The Role of Feature Discretization and Feature Selection. BioMedInformatics 2023, 3, 585-604. https://doi.org/10.3390/biomedinformatics3030040
Nogueira A, Ferreira A, Figueiredo M. A Machine Learning Pipeline for Cancer Detection on Microarray Data: The Role of Feature Discretization and Feature Selection. BioMedInformatics. 2023; 3(3):585-604. https://doi.org/10.3390/biomedinformatics3030040
Chicago/Turabian StyleNogueira, Adara, Artur Ferreira, and Mário Figueiredo. 2023. "A Machine Learning Pipeline for Cancer Detection on Microarray Data: The Role of Feature Discretization and Feature Selection" BioMedInformatics 3, no. 3: 585-604. https://doi.org/10.3390/biomedinformatics3030040
APA StyleNogueira, A., Ferreira, A., & Figueiredo, M. (2023). A Machine Learning Pipeline for Cancer Detection on Microarray Data: The Role of Feature Discretization and Feature Selection. BioMedInformatics, 3(3), 585-604. https://doi.org/10.3390/biomedinformatics3030040