Application of Machine Learning Algorithms for Prediction of Tumor T-Cell Immunogens
Abstract
:1. Introduction
1.1. The Immune Aspects of Peptide-Based Cancer Vaccines
1.2. Computational Methods for Tumor T-Cell Antigen Prediction
1.2.1. Machine Learning Prediction Tools
1.2.2. Other Prediction Tools
Method | Year | Method a | Availability Online (as of 1 March 2024) |
---|---|---|---|
TTAgP [6] | 2019 | RF | https://github.com/bio-coding/TTAgP |
iTTCA-Hybrid [7] | 2020 | RF, SVM | http://camt.pythonanywhere.com/iTTCA-Hybrid |
iTTCA-RF [8] | 2021 | RF | http://112.124.26.17:7002/ |
TAP [11] | 2021 | ML | No |
PSRTTCA [12] | 2023 | RF | http://pmlabstack.pythonanywhere.com/PSRTTCA |
VaxiJen v2.0 [13] | 2007 | PSL-DA | http://www.ddg-pharmfac.net/vaxijen/VaxiJen/VaxiJen.html |
LENS [18] | 2023 | Over two dozen separate tools to generate tumor antigen predictions | https://gitlab.com/landscape-of-effective-neoantigens-software |
OpenVax [19] | 2020 | Bioinformatics pipeline | https://github.com/openvax |
pVACtools [20] | 2020 | Various MHC-I prediction algorithms | https://github.com/griffithlab/pVACtools |
nextNEOpi [21] | 2022 | WES/WGS/RNA-Seq pipeline | https://github.com/icbi-lab/nextNEOpi |
TIminer [22] | 2017 | NGS pipeline | https://bio.tools/timiner |
StackTTCA [26] | 2023 | Stacking ensemble-learning algorithm | No |
2. Materials and Methods
2.1. Datasets
2.2. Descriptors
2.3. Auto-Cross Covariance (ACC) Transformation
- E—the E-descriptor value
- j, k (j≠k)—the number of the E-descriptor (j, k = 1–5)
- i—the position of amino acid in the peptide chain (i = 1, 2, 3…n)
- n—the number of the amino acids in the protein
- L—lag-value; the length of the frame of the contiguous amino acids
2.4. Machine Learning Methods
2.4.1. k-Nearest Neighbor (kNN)
2.4.2. Linear Discriminant Analysis (LDA)
2.4.3. Quadratic Discriminant Analysis (QDA)
2.4.4. Support Vector Machine (SVM)
2.4.5. Random Forest (RF)
2.4.6. Extreme Gradient Boosting (XGBoost)
2.5. Machine Learning Models Validation
3. Results and Discussion
4. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
ACC | auto- and cross-covariance |
AROC | area under the ROC curve |
CAR | chimeric antigen receptor |
FN | false negatives |
FP | false positives |
GA | genetic algorithm |
IEDB | The Immune Epitope Database |
kNN | k-nearest neighbor |
L | lag-value |
LDA | linear discriminant analysis |
LENS | Landscape of Effective Neoantigens Software |
MCC | Matthew’s correlation coefficient |
MHC | Major Histocompatibility Complex |
ML | machine learning |
nextNEOpi | Nextflow NEOantigen prediction pipeline |
NGS | next generation sequencing |
PLS-DA | partial least squares—discriminant analysis |
QDA | quadratic discriminant analysis |
RF | random forest |
RNA-seq | RNA sequencing |
ROC | receiver operating characteristic |
SCM | scoring card method |
SMOTE | Synthetic Minority Over-sampling Technique |
SVM | support vector machine |
TIminer | Tumor Immunology Miner |
TN | true negatives |
TP | true positives |
VEP | Variant Effect Predictor |
WES | whole exome sequencing |
WGS | whole genome sequencing |
XGBoost | extreme gradient boosting |
References
- Singh, T.; Bhattacharya, M.; Mavi, A.K.; Gulati, A.; Rakesh, N.K.S.; Gaur, S.; Kumar, U. Immunogenicity of cancer cells: An overview. Cell Signal. 2024, 113, 110952. [Google Scholar] [CrossRef]
- Woo, S.R.; Corrales, L.; Gajewski, T.F. Innate immune recognition of cancer. Annu. Rev. Immunol. 2015, 33, 445–474. [Google Scholar] [CrossRef] [PubMed]
- Tsung, K.; Norton, J.A. In situ vaccine, immunological memory and cancer cure. Hum. Vaccines Immunotherap. 2016, 12, 117–119. [Google Scholar] [CrossRef] [PubMed]
- Okada, M.; Shimizu, K.; Fujii, S.I. Identification of Neoantigens in Cancer Cells as Targets for Immunotherapy. Int. J. Mol. Sci. 2022, 23, 2594. [Google Scholar] [CrossRef] [PubMed]
- Soria-Guerra, R.E.; Nieto-Gomez, R.; Govea-Alonso, D.O.; Rosales-Mendoza, S. An overview of bioinformatics tools for epitope prediction: Implications on vaccine development. J. Biomed. Inform. 2015, 53, 405–414. [Google Scholar] [CrossRef]
- Beltrán, J.F.L.; Herrera, L.B.; Farias, J.G. TTAgP 1.0: A computational tool for the specific prediction of tumor T cell antigens. Comp. Biol. Chem. 2019, 83, 107103. [Google Scholar] [CrossRef] [PubMed]
- Charoenkwan, P.; Nantasenamat, C.; Hasan, M.M.; Shoombuatong, W. iTTCA-Hybrid: Improved and robust identification of tumor T cell antigens by utilizing hybrid feature representation. Anal. Biochem. 2020, 599, 113747. [Google Scholar] [CrossRef] [PubMed]
- Jiao, S.; Zou, Q.; Guo, H.; Shi, L. iTTCA-RF: A random forest predictor for tumor T cell antigens. J. Transl. Med. 2021, 19, 449. [Google Scholar] [CrossRef]
- Kawashima, S.; Ogata, H.; Kanehisa, M. AAindex: Amino Acid Index Database. Nucleic Acids Res. 1999, 27, 368–369. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, P.W. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Herrera-Bravo, J.; Herrera, L.B.; Farias, J.G.; Beltrán, J.F. TAP 1.0: A robust immunoinformatic tool for the prediction of tumor T-cell antigens based on AAindex properties. Comput. Biol. Chem. 2021, 91, 107452. [Google Scholar] [CrossRef] [PubMed]
- Charoenkwan, P.; Pipattanaboon, C.; Nantasenamat, C.; Hasan, M.M.; Moni, M.A.; Lio, P.; Shoombuatong, W. PSRTTCA: A new approach for improving the prediction and characterization of tumor T cell antigens using propensity score representation learning. Comput. Biol. Med. 2023, 152, 106368. [Google Scholar] [CrossRef] [PubMed]
- Doytchinova, I.A.; Flower, D.R. VaxiJen: A server for prediction of protective antigens, tumour antigens and subunit vaccines. BMC Bioinform. 2007, 8, 4. [Google Scholar] [CrossRef] [PubMed]
- Hellberg, S.; Sjöström, M.; Skagerberg, B.; Wold, S. Peptide quantitative structure-activity relationships, a multivariate approach. J. Med. Chem. 1987, 30, 1126–1135. [Google Scholar] [CrossRef] [PubMed]
- Wold, S.; Jonsson, J.; Sjöström, M.; Sandberg, M.; Rännar, S. DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial least squares projections to latent structures. Anal. Chim. Acta 1993, 277, 239–253. [Google Scholar] [CrossRef]
- Leardi, R.; Boggia, R.; Terrile, M. Genetic algorithms as a strategy for feature selection. J. Chemom. 1992, 6, 267–281. [Google Scholar] [CrossRef]
- Ståhle, L.; Wold, S. Partial least squares analysis with cross-validation for the two-class problem: A monte carlo study. J. Chemom. 1987, 1, 185–196. [Google Scholar] [CrossRef]
- Vensko, S.P.; Olsen, K.; Bortone, D.; Smith, C.C.; Chai, S.; Beckabir, B.; Fini, M.; Jadi, O.; Rubinsteyn, A.; Vincent, B.G. LENS: Landscape of Effective Neoantigens Software. Bioinformatics 2023, 39, 6. [Google Scholar] [CrossRef] [PubMed]
- Kodysh, J.; Rubinsteyn, A. OpenVax: An open-source computational pipeline for cancer neoantigen prediction. In Bioinformatics for Cancer Immunotherapy; Boegel, S., Ed.; Methods in Molecular Biology; Humana: New York, NY, USA, 2020; Volume 2120, pp. 147–160. [Google Scholar] [CrossRef]
- Hundal, J.; Kiwala, S.; McMichael, J.; Miller, C.A.; Xia, H.; Wollam, A.T.; Liu, C.J.; Zhao, S.; Feng, Y.Y.; Graubert, A.P.; et al. pVACtools: A Computational Toolkit to Identify and Visualize Cancer Neoantigens. Cancer Immunol. Res. 2020, 8, 409–420. [Google Scholar] [CrossRef]
- Rieder, D.; Fotakis, G.; Ausserhofer, M.; René, G.; Paster, W.; Trajanoski, Z.; Finotello, F. nextNEOpi: A comprehensive pipeline for computational neoantigen prediction. Bioinformatics 2022, 38, 1131–1132. [Google Scholar] [CrossRef]
- Tappeiner, E.; Finotello, F.; Charoentong, P.; Mayer, C.; Rieder, D.; Trajanoski, Z. TIminer: NGS data mining pipeline for cancer immunology and immunotherapy. Bioinformatics 2017, 33, 3140–3141. [Google Scholar] [CrossRef] [PubMed]
- McLaren, W.; Gil, L.; Hunt, S.E.; Riat, H.S.; Ritchie, G.R.S.; Thormann, A.; Flicek, P.; Cunningham, F. The Ensembl Variant Effect Predictor. Genome Biol. 2016, 17, 122. [Google Scholar] [CrossRef]
- Szolek, A.; Schubert, B.; Mohr, C.; Sturm, M.; Feldhahn, M.; Kohlbacher, O. OptiType: Precision HLA typing from next-generation sequencing data. Bioinformatics 2014, 30, 3310–3316. [Google Scholar] [CrossRef] [PubMed]
- Jurtz, V.; Paul, S.; Andreatta, M.; Marcatili, P.; Peters, B.; Nielsen, M. NetMHCpan-4.0: Improved Peptide-MHC Class I Interaction Predictions Integrating Eluted Ligand and Peptide Binding Affinity Data. J. Immunol. 2017, 199, 3360–3368. [Google Scholar] [CrossRef]
- Charoenkwan, P.; Schaduangrat, N.; Shoombuatong, W. StackTTCA: A stacking ensemble learning-based framework for accurate and high-throughput identification of tumor T cell antigens. BMC Bioinform. 2023, 24, 301. [Google Scholar] [CrossRef]
- Vita, R.; Overton, J.A.; Greenbaum, J.A.; Ponomarenko, J.; Clark, J.D.; Cantrell, J.R.; Wheeler, D.K.; Gabbard, J.L.; Hix, D.; Sette, A.; et al. The immune epitope database (IEDB) 3.0. Nucleic Acids Res. 2015, 43, D405–D412. [Google Scholar] [CrossRef]
- Venkatarajan, M.S.; Braun, W. New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical-chemical properties. J. Mol. Model. 2001, 7, 445–453. [Google Scholar]
- Scikit-Learn Machine Learning in Python. Available online: https://scikit-learn.org (accessed on 5 May 2024).
- Sklearn.Model_Selection.GridSearchCV. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html (accessed on 5 May 2024).
- Goldberger, J.; Hinton, G.E.; Roweis, S.T.; Salakhutdinov, R.R. Neighbourhood components analysis. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 5–8 December 2005; pp. 513–520. [Google Scholar]
- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer: New York, NY, USA, 2008; Section 4.3; pp. 106–119. [Google Scholar]
- Bhavsar, H.P.; Panchal, M. A Review on Support Vector Machine for Data Classification. IJARCET 2012, 1, 185–189. [Google Scholar]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Chen, T.Q.; Guestrin, C. Xgboost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
- Ojala, M.; Garriga, G.C. Permutation tests for studying classifier performance. J. Mach. Learn. Res. 2010, 11, 1833–1863. [Google Scholar]
- Tharwat, A. Classification assessment methods. New Engl. J. Entrepr. 2020, 17, 168–192. [Google Scholar] [CrossRef]
- Wold, S.; Eriksson, L. Statistical Validation of QSAR Results. In Chemometric Methods in Molecular Design; Weinheim van de Waterbeemd, H., Ed.; Wiley: Hoboken, NJ, USA, 1995; pp. 309–318. [Google Scholar]
Model | TN | FN | TP | FP | Sensitivity | Specificity | Accuracy | AROC | MCC |
---|---|---|---|---|---|---|---|---|---|
kNN | 49 | 9 | 161 | 121 | 0.95 | 0.29 | 0.62 | 0.62 | 0.31 |
LDA | 92 | 58 | 112 | 78 | 0.66 | 0.54 | 0.60 | 0.61 | 0.20 |
QDA | 115 | 105 | 65 | 55 | 0.38 | 0.68 | 0.53 | 0.53 | 0.06 |
SVM | 138 | 38 | 132 | 32 | 0.78 | 0.81 | 0.80 | 0.88 | 0.59 |
RF | 153 | 48 | 122 | 17 | 0.72 | 0.90 | 0.81 | 0.87 | 0.63 |
XGBoost | 129 | 35 | 135 | 41 | 0.79 | 0.76 | 0.78 | 0.86 | 0.55 |
Model | TN | FN | TP | FP | Sensitivity | Specificity | Accuracy | AROC | MCC |
---|---|---|---|---|---|---|---|---|---|
kNN | 9 | 4 | 38 | 33 | 0.90 | 0.21 | 0.56 | 0.56 | 0.16 |
LDA | 22 | 16 | 26 | 20 | 0.62 | 0.52 | 0.57 | 0.57 | 0.14 |
QDA | 26 | 26 | 16 | 16 | 0.38 | 0.62 | 0.50 | 0.50 | 0.00 |
SVM | 36 | 9 | 33 | 6 | 0.79 | 0.86 | 0.82 | 0.83 | 0.64 |
RF | 39 | 10 | 32 | 3 | 0.76 | 0.93 | 0.85 | 0.80 | 0.70 |
XGBoost | 32 | 9 | 33 | 10 | 0.79 | 0.76 | 0.77 | 0.83 | 0.55 |
Model | Top 10 Features |
---|---|
SVM | |
Permutation feature importance | ACC145, ACC137, ACC313, ACC114, ACC117, ACC217, ACC116, ACC147, ACC111, ACC234 |
Drop-column feature importance | ACC313, ACC145, ACC552, ACC446, ACC416, ACC324, ACC247, ACC234, ACC147, ACC141 |
RF | |
Permutaion feature importance | ACC511, ACC417, ACC234, ACC117, ACC433, ACC514, ACC247, ACC315, ACC341, ACC254 |
Drop-column feature importance | ACC547, ACC541, ACC537, ACC536, ACC535, ACC534, ACC533, ACC532, ACC531, ACC525 |
XGBoost | |
Permutaion feature importance | ACC535, ACC137, ACC516, ACC145, ACC254, ACC151, ACC442, ACC132, ACC531, ACC441 |
Drop-column feature importance | ACC135, ACC511, ACC457, ACC237, ACC455, ACC453, ACC441, ACC432, ACC311, ACC233 |
Algorithm | Sensitivity | Specificity | Accuracy | MCC |
---|---|---|---|---|
SVM + RF + XGBoost | 0.76 | 0.91 | 0.83 | 0.67 |
TTAgP 1.0 | 0.60 | 0.00 | 0.30 | −0.50 |
iTTCA-Hybrid | 0.24 | 0.17 | 0.20 | −0.60 |
iTTCA-RF | 0.24 | 0.24 | 0.24 | −0.52 |
PSRTTCA | 0.76 | 0.26 | 0.51 | 0.03 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sotirov, S.; Dimitrov, I. Application of Machine Learning Algorithms for Prediction of Tumor T-Cell Immunogens. Appl. Sci. 2024, 14, 4034. https://doi.org/10.3390/app14104034
Sotirov S, Dimitrov I. Application of Machine Learning Algorithms for Prediction of Tumor T-Cell Immunogens. Applied Sciences. 2024; 14(10):4034. https://doi.org/10.3390/app14104034
Chicago/Turabian StyleSotirov, Stanislav, and Ivan Dimitrov. 2024. "Application of Machine Learning Algorithms for Prediction of Tumor T-Cell Immunogens" Applied Sciences 14, no. 10: 4034. https://doi.org/10.3390/app14104034
APA StyleSotirov, S., & Dimitrov, I. (2024). Application of Machine Learning Algorithms for Prediction of Tumor T-Cell Immunogens. Applied Sciences, 14(10), 4034. https://doi.org/10.3390/app14104034