Multivariate and Dimensionality-Reduction-Based Machine Learning Techniques for Tumor Classification of RNA-Seq Data
Abstract
:1. Introduction
2. Dataset
3. Preprocessing
3.1. Data Transformation
3.2. Dimensionality Reduction
3.2.1. Low Variance Filter
3.2.2. High Correlation Filter
3.3. Random Forest Top Feature ID
3.4. Factor Analysis
- Bartlett Sphericity Test: The Bartlett Sphericity Test [24] checks whether or not the features (observed variables) are intercorrelated by comparing the observed correlation matrix and the identity matrix. If the two are not the same, the test will be significant. For the test of our feature set, the p-value was 0, signifying that a factor analysis is feasible.
- Kaiser–Meyer–Olkin (KMO) Test: The KMO test estimates the proportion of variance among all the observed variables [24]. KMO values range between 0 and 1, with a value of or more indicating a factor analysis is feasible. For the test of our feature set, the KMO value was , again indicating a factor analysis is feasible.
4. Modeling and Validation
4.1. Individual Modeling
4.1.1. Logistic Regression (LogReg)
4.1.2. k-Nearest Neighbor (kNN)
4.1.3. Decision Tree (DTree)
4.1.4. Random Forest (RF)
4.1.5. Neural Network Multilayer Perceptron (NN)
4.1.6. Support Vector Machine
4.2. Ensemble Modeling
4.3. Validation Results
- True positives (TPs): These are the cases where the model correctly predicts a positive outcome when the actual outcome is positive.
- True negatives (TNs): These are the cases where the model correctly predicts a negative outcome when the actual outcome is negative.
- False positives (FPs): These are the cases where the model incorrectly predicts a positive outcome when the actual outcome is negative.
- False negatives (FNs): These are the cases where the model incorrectly predicts a negative outcome when the actual outcome is positive.
5. Related Work Comparisons
6. Conclusions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
BFGS | Broyden–Fletcher–Goldfarb–Shanno algorithm |
DNA | Deoxyribonucleic acid |
DTree | Decision tree |
EFA | Exploratory factor analysis |
GDC | Genomic Data Commons |
KMO | Kaiser–Meyer–Olkin |
kNN | k-Nearest Neighbor |
LogReg | Logistic regression |
NCI | National Cancer Institute |
NN | Neural network multilayer perceptron |
RBF | Radial basis function |
RF | Random forest |
RNA | Ribonucleic acid |
RNA-Seq | Ribonucleic acid sequencing |
SMOTE | Synthetic Minority Oversampling Technique |
SVC | Support Vector Classification |
TC1 | Pilot 1 Tumor Classifier project |
TPM | Transcripts per kilobase million |
References
- Bronakowski, M.; Al-khassaweneh, M.; Al Bataineh, A. Automatic Detection of Clickbait Headlines Using Semantic Analysis and Machine Learning Techniques. Appl. Sci. 2023, 13, 2456. [Google Scholar] [CrossRef]
- Huette, J.; Al-Khassaweneh, M.; Oakley, J. Using Machine Learning Techniques for Clickbait Classification. In Proceedings of the 2022 IEEE International Conference on Electro Information Technology (eIT), Romeoville, IL, USA, 19–21 May 2022; pp. 091–095. [Google Scholar]
- Al Bataineh, A.; Kaur, D.; Al-khassaweneh, M.; Al-sharoa, E. Automated CNN Architectural Design: A Simple and Efficient Methodology for Computer Vision Tasks. Mathematics 2023, 11, 1141. [Google Scholar] [CrossRef]
- Siegel, R.L.; Miller, K.D.; Wagle, N.S.; Jemal, A. Cancer statistics. CA Cancer J. Clin. 2023, 73, 17–48. [Google Scholar] [CrossRef]
- O’keefe, W.; Ide, B.; Al-Khassaweneh, M.; Abuomar, O.; Szczurek, P. A cnn approach for skin cancer classification. In Proceedings of the 2021 International Conference on Information Technology (ICIT), Amman, Jordan, 14–15 July 2021; pp. 472–475. [Google Scholar]
- Available online: https://www.cancer.gov/about-cancer/understanding/what-is-cancer (accessed on 4 December 2022).
- Available online: https://www.genome.gov/genetics-glossary/RNA-Ribonucleic-Acid (accessed on 4 December 2022).
- Behjati, S.; Tarpey, P.S. What is next generation sequencing? Arch. Dis. Child.-Educ. Pract. 2013, 98, 236–238. [Google Scholar] [CrossRef] [PubMed]
- Mardis, E.R. DNA sequencing technologies: 2006–2016. Nat. Protoc. 2017, 12, 213–218. [Google Scholar] [CrossRef]
- Elbashir, M.K.; Ezz, M.; Mohammed, M.; Saloum, S.S. Lightweight convolutional neural network for breast cancer classification using RNA-seq gene expression data. IEEE Access 2019, 7, 185338–185348. [Google Scholar] [CrossRef]
- Rukhsar, L.; Bangyal, W.H.; Ali Khan, M.S.; Ag Ibrahim, A.A.; Nisar, K.; Rawat, D.B. Analyzing RNA-seq gene expression data using deep learning approaches for cancer classification. Appl. Sci. 2022, 12, 1850. [Google Scholar] [CrossRef]
- Khalifa, N.E.M.; Taha, M.H.N.; Ali, D.E.; Slowik, A.; Hassanien, A.E. Artificial intelligence technique for gene expression by tumor RNA-Seq data: A novel optimized deep learning approach. IEEE Access 2020, 8, 22874–22883. [Google Scholar] [CrossRef]
- Bonat, E. Available online: https://medium.com/@ernest-bonat/rna-seq-gene-expression-classification-using-machine-learning-algorithms-de862e60bfd0 (accessed on 4 December 2022).
- Cascianelli, S.; Molineris, I.; Isella, C.; Masseroli, M.; Medico, E. Machine learning for RNA sequencing-based intrinsic subtyping of breast cancer. Sci. Rep. 2020, 10, 14071. [Google Scholar] [CrossRef] [PubMed]
- Wang, J.; Dai, X.; Luo, H.; Yan, C.; Zhang, G.; Luo, J. MI_DenseNetCAM: A Novel Pan-Cancer Classification and Prediction Method Based on Mutual Information and Deep Learning Model. Front. Genet. 2021, 12, 670232. [Google Scholar] [CrossRef] [PubMed]
- Li, Y.; Kang, K.; Krahn, J.M.; Croutwater, N.; Lee, K.; Umbach, D.M.; Li, L. A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data. BMC Genom. 2017, 18, 508. [Google Scholar] [CrossRef] [PubMed]
- Lyu, B.; Haque, A. Deep learning based tumor type classification using gene expression data. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Washington, DC, USA, 29 August–1 September 2018; pp. 89–96. [Google Scholar]
- Available online: https://datascience.cancer.gov/collaborations/joint-design-advanced-computing/cellular-pilot (accessed on 4 December 2022).
- Available online: https://modac.cancer.gov/searchTab?dme_data_id=NCI-DME-MS01-6996872 (accessed on 4 December 2022).
- Available online: https://github.com/CBIIT/NCI-DOE-Collab-Pilot1-Tumor_Classifier-hardening/blob/master/TC1-dataprep.ipynb (accessed on 4 December 2022).
- Zebari, R.; Abdulazeez, A.; Zeebaree, D.; Zebari, D.; Saeed, J. A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. J. Appl. Sci. Technol. Trends 2020, 1, 56–70. [Google Scholar] [CrossRef]
- Available online: https://pypi.org/project/factor-analyzer/ (accessed on 4 December 2022).
- Rahn, M. Factor Analysis: A Short Introduction, Part 5: Dropping Unimportant Variables from your Analysis. Anal. Factor 2014. Available online: https://www.theanalysisfactor.com/factor-analysis-5/ (accessed on 4 December 2022).
- Toth, G. Available online: https://www.datasklr.com/principal-component-analysis-and-factor-analysis/factor-analysis (accessed on 4 December 2022).
- Available online: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning (accessed on 4 December 2022).
- Available online: https://github.com/ECP-CANDLE/Benchmarks/tree/master/Pilot1/TC1 (accessed on 4 December 2022).
Gene | |||||||
---|---|---|---|---|---|---|---|
Sample # | Tumor Code | 1 | 2 | 3 | … | 60,482 | 60,483 |
0 | 29 | … | |||||
1 | 5 | … | |||||
2 | 29 | … | |||||
… | … | … | … | … | … | … | … |
5397 | 14 | … | |||||
5398 | 11 | … | |||||
5399 | 11 | … |
Tumor Code | Tumor Cancer Type | Tumor Code | Tumor Cancer Type |
---|---|---|---|
1 | Leukemia-LAML | 16 | Liver-LIHC |
3 | Bladder-BLCA | 17 | Lung-LUAD |
4 | Brain-LLG | 18 | Lung-LUSC |
5 | Breast-BRCA | 22 | Ovarian-OV |
6 | Cervical-CESC | 25 | Prostate-PRAD |
8 | Colon-COAD | 29 | Skin-SKCM |
11 | Head/Neck-HNSC | 30 | Stomach-STAD |
14 | Kidney-KIRC | 33 | Thyroid-THCA |
15 | Kidney-KIRP | 35 | Uterine-UCEC |
# | Dataset | # Features |
---|---|---|
1 | Baseline dataset | 58,544 |
2 | High-Correlation-Filtered dataset | 46,688 |
3 | Low-Variance-Filtered dataset | 35,333 |
4 | Combined High Correlation/Low Variance dataset | 26,963 |
5 | r ≥ 0.5 Correlated dataset | 3928 |
6 | Top 250 Correlations per Tumor dataset | 4082 |
7 | RF Top Features dataset | 660 |
8 | Factors—Top 250 Correlations per Tumor dataset | 356 |
9 | Factors—Top RF Features dataset | 68 |
Related Work | Best Model and Dataset Type | Accuracy |
---|---|---|
Tumor Classifier Project (TC1) 2019 [26] | 1D Convoluted Neural Network Model – 5400 Samples × 60,484 RNA-Seq Features – 18 Target Tumor Types | |
Bonat 2022 [13] | Light Gradient Boosting Machine Model – 1500 Samples × 60,484 RNA-Seq Features – 5 Target Tumor Types | 1 |
Cascianelli et al. 2020 [14] | Logistic Regression Model – 4731 Samples × 19,737 RNA-Seq Features – 5 Target Breast Tumor Subtypes | 2 |
Wang et al. 2021 [15] | MI_DenseNetCAM Model (Deep Learning) 4 – 10,267 Samples × 20,531 RNA-Seq Features – Reduced to 10,267 × 3600 RNA–Seq Features – 33 Target Tumor Types | |
Li et al. 2017 [16] | k–Nearest Neighbors Model – 9096 Samples × 20,000 RNA-Seq Features – 31 Target Tumor Types | 3 |
Lyu et al. 2018 [17] | 2D Convoluted Neural Network Model – 10,267 Samples × 20,531 RNA-Seq Features – 33 Target Tumor Types | 3 |
Our Top Model 2023 | Support Vector Machine – 5400 Samples × 60,484 RNA-Seq Features – Reduced to 5400 Samples × 68 Factor Features – 18 Target Tumor Types | 2 |
Wang [15] | Li [16] | Lyu [17] | Proposed Model | ||
---|---|---|---|---|---|
Tumor | Accuracy | Accuracy | Accuracy | Accuracy | |
1 | Leukemia—LAML | ||||
3 | Bladder— BLCA | 91% | |||
4 | Brain—LLG | ||||
5 | Breast—BRCA | ||||
6 | Cervical—CESC | ||||
8 | Colon—COAD | ||||
11 | Head/Neck—HNSC | ||||
14 | Kidney— KIRC | ||||
15 | Kidney—KIRP | ||||
16 | Liver—LIHC | ||||
17 | Lung—LUAD | ||||
18 | Lung—LUSC | ||||
22 | Ovarian—OV | ||||
25 | Prostate—PRAD | ||||
29 | Skin—SKCM | ||||
30 | Stomach—STAD | –– | |||
33 | Thyroid—THCA | ||||
35 | Uterine—UCEC |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Al-khassaweneh, M.; Bronakowski, M.; Al-Sharoa, E. Multivariate and Dimensionality-Reduction-Based Machine Learning Techniques for Tumor Classification of RNA-Seq Data. Appl. Sci. 2023, 13, 12801. https://doi.org/10.3390/app132312801
Al-khassaweneh M, Bronakowski M, Al-Sharoa E. Multivariate and Dimensionality-Reduction-Based Machine Learning Techniques for Tumor Classification of RNA-Seq Data. Applied Sciences. 2023; 13(23):12801. https://doi.org/10.3390/app132312801
Chicago/Turabian StyleAl-khassaweneh, Mahmood, Mark Bronakowski, and Esraa Al-Sharoa. 2023. "Multivariate and Dimensionality-Reduction-Based Machine Learning Techniques for Tumor Classification of RNA-Seq Data" Applied Sciences 13, no. 23: 12801. https://doi.org/10.3390/app132312801
APA StyleAl-khassaweneh, M., Bronakowski, M., & Al-Sharoa, E. (2023). Multivariate and Dimensionality-Reduction-Based Machine Learning Techniques for Tumor Classification of RNA-Seq Data. Applied Sciences, 13(23), 12801. https://doi.org/10.3390/app132312801