mRCat: A Novel CatBoost Predictor for the Binary Classification of mRNA Subcellular Localization by Fusing Large Language Model Representation and Sequence Features
Abstract
:1. Introduction
2. Materials and Methods
2.1. Datasets
2.2. Framework of mRCat
2.3. Feature Encoding
2.3.1. Nucleotide Composition
2.3.2. Three-Tuple Nucleotide Electron–Ion Interaction Pseudopotential
2.3.3. Large Language Model Features
2.4. Performance Evaluation Metrics
3. Results and Discussion
3.1. Comparison of Different Classifiers
3.2. Comparison of Different Encoding Schemes
3.3. Comparison with Other Predictors
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Kloc, M.; Zearfoss, N.R.; Etkin, L.D. Mechanisms of subcellular mRNA localization. Cell 2002, 108, 533–544. [Google Scholar] [CrossRef] [PubMed]
- Holt, C.E.; Bullock, S.L. Subcellular mRNA localization in animal cells and why it matters. Science 2009, 326, 1212–1216. [Google Scholar] [CrossRef] [PubMed]
- Chen, L.L. Towards higher-resolution and in vivo understanding of lncRNA biogenesis and function. Nat. Methods 2022, 19, 1152–1155. [Google Scholar] [CrossRef]
- Meyer, C.; Garzia, A.; Tuschl, T. Simultaneous detection of the subcellular localization of RNAs and proteins in cultured cells by combined multicolor RNA-FISH and IF. Methods 2017, 118–119, 101–110. [Google Scholar] [CrossRef]
- Li, J.; Zhang, L.; He, S.; Guo, F.; Zou, Q. SubLocEP: A novel ensemble predictor of subcellular localization of eukaryotic mRNA based on machine learning. Brief. Bioinform. 2021, 22, bbaa401. [Google Scholar] [CrossRef]
- Kejiou, N.S.; Palazzo, A.F. mRNA localization as a rheostat to regulate subcellular gene expression. Wiley Interdiscip. Rev. RNA 2017, 8, e1416. [Google Scholar] [CrossRef] [PubMed]
- Peer, E.; Moshitch-Moshkovitz, S.; Rechavi, G.; Dominissini, D. The Epitranscriptome in Translation Regulation. Cold Spring Harb. Perspect. Biol. 2019, 11, a032623. [Google Scholar] [CrossRef] [PubMed]
- Medioni, C.; Mowry, K.; Besse, F. Principles and roles of mRNA localization in animal development. Development 2012, 139, 3263–3276. [Google Scholar] [CrossRef] [PubMed]
- Stefanini, M.; Lovino, M.; Cucchiara, R.; Ficarra, E. Predicting gene and protein expression levels from DNA and protein sequences with Perceiver. Comput. Methods Programs Biomed. 2023, 234, 107504. [Google Scholar] [CrossRef] [PubMed]
- Martin, K.C.; Ephrussi, A. mRNA localization: Gene expression in the spatial dimension. Cell 2009, 136, 719–730. [Google Scholar] [CrossRef]
- Di Liegro, C.M.; Schiera, G.; Di Liegro, I. Regulation of mRNA transport, localization and translation in the nervous system of mammals (Review). Int. J. Mol. Med. 2014, 33, 747–762. [Google Scholar] [CrossRef] [PubMed]
- Bergalet, J.; Lécuyer, E. The functions and regulatory principles of mRNA intracellular trafficking. Adv. Exp. Med. Biol. 2014, 825, 57–96. [Google Scholar] [PubMed]
- Cooper, T.A.; Wan, L.; Dreyfuss, G. RNA and disease. Cell 2009, 136, 777–793. [Google Scholar] [CrossRef] [PubMed]
- Liu, H.; Zhang, W.; Zou, B.; Wang, J.; Deng, Y.; Deng, L. DrugCombDB: A comprehensive database of drug combinations toward the discovery of combinatorial therapy. Nucleic Acids Res. 2020, 48, D871–D881. [Google Scholar] [PubMed]
- Fagerberg, L.; Hallström, B.M.; Oksvold, P.; Kampf, C.; Djureinovic, D.; Odeberg, J.; Habuka, M.; Tahmasebpoor, S.; Danielsson, A.; Edlund, K.; et al. Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Mol. Cell. Proteom. 2014, 13, 397–406. [Google Scholar] [CrossRef] [PubMed]
- Zhivaki, D.; Gosselin, E.A.; Sengupta, D.; Concepcion, H.; Arinze, C.; Chow, J.; Nikiforov, A.; Komoroski, V.; MacFarlane, C.; Sullivan, C.; et al. mRNAs encoding self-DNA reactive cGAS enhance the immunogenicity of lipid nanoparticle vaccines. mBio 2023, 14, e0250623. [Google Scholar] [CrossRef] [PubMed]
- Riedmayr, L.M.; Hinrichsmeyer, K.S.; Thalhammer, S.B.; Mittas, D.M.; Karguth, N.; Otify, D.Y.; Böhm, S.; Weber, V.J.; Bartoschek, M.D.; Splith, V.; et al. mRNA trans-splicing dual AAV vectors for (epi)genome editing and gene therapy. Nat. Commun. 2023, 14, 6578. [Google Scholar] [CrossRef] [PubMed]
- Gai, C.; Pomatto, M.A.C.; Deregibus, M.C.; Dieci, M.; Piga, A.; Camussi, G. Edible Plant-Derived Extracellular Vesicles for Oral mRNA Vaccine Delivery. Vaccines 2024, 12, 200. [Google Scholar] [CrossRef] [PubMed]
- Lei, J.; Qi, S.; Yu, X.; Gao, X.; Yang, K.; Zhang, X.; Cheng, M.; Bai, B.; Feng, Y.; Lu, M.; et al. Development of Mannosylated Lipid Nanoparticles for mRNA Cancer Vaccine with High Antigen Presentation Efficiency and Immunomodulatory Capability. Angew. Chem. Int. Ed. 2024, 63, e202318515. [Google Scholar] [CrossRef] [PubMed]
- Hori, H.; Yoshida, F.; Ishida, I.; Matsuo, J.; Ogawa, S.; Hattori, K.; Kim, Y.; Kunugi, H. Blood mRNA expression levels of glucocorticoid receptors and FKBP5 are associated with depressive disorder and altered HPA axis. J. Affect. Disord. 2024, 349, 244–253. [Google Scholar] [CrossRef] [PubMed]
- Cabili, M.N.; Dunagin, M.C.; McClanahan, P.D.; Biaesch, A.; Padovan-Merhar, O.; Regev, A.; Rinn, J.L.; Raj, A. Localization and abundance analysis of human lncRNAs at single-cell and single-molecule resolution. Genome Biol. 2015, 16, 20. [Google Scholar] [CrossRef] [PubMed]
- Kochan, J.; Wawro, M. Immunofluorescence Combined with Single-Molecule RNA Fluorescence In Situ Hybridization for Concurrent Detection of Proteins and Transcripts in Stress Granules. Methods Mol. Biol. 2024, 2752, 127–141. [Google Scholar] [PubMed]
- Dresselhaus, T.; Bleckmann, A. Tagging and Application of RNA Probes for Sequence-Specific Visualization of RNAs by Fluorescent In Situ Hybridization. Methods Mol. Biol. 2020, 2166, 3–21. [Google Scholar] [PubMed]
- Garg, A.; Singhal, N.; Kumar, R.; Kumar, M. mRNALoc: A novel machine-learning based in-silico tool to predict mRNA subcellular localization. Nucleic Acids Res. 2020, 48, W239–W243. [Google Scholar] [CrossRef] [PubMed]
- Yan, Z.; Lécuyer, E.; Blanchette, M. Prediction of mRNA subcellular localization using deep recurrent neural networks. Bioinformatics 2019, 35, i333–i342. [Google Scholar] [CrossRef] [PubMed]
- Yuan, G.H.; Wang, Y.; Wang, G.Z.; Yang, L. RNAlight: A machine learning model to identify nucleotide features determining RNA subcellular localization. Brief. Bioinform. 2023, 24, bbac509. [Google Scholar] [CrossRef]
- Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Ke, G.L.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the 31st Conference on Neural Information Processing System, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 2–8 December 2018. [Google Scholar]
- Benoit Bouvrette, L.P.; Cody, N.A.L.; Bergalet, J.; Lefebvre, F.A.; Diot, C.; Wang, X.; Blanchette, M.; Lécuyer, E. CeFra-seq reveals broad asymmetric mRNA and noncoding RNA distribution profiles in Drosophila and human cells. RNA 2018, 24, 98–113. [Google Scholar] [CrossRef]
- Fazal, F.M.; Han, S.; Parker, K.R.; Kaewsapsak, P.; Xu, J.; Boettiger, A.N.; Chang, H.Y.; Ting, A.Y. Atlas of subcellular RNA localization revealed by APEX-Seq. Cell 2019, 178, 473–490.e426. [Google Scholar] [CrossRef]
- Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
- LaValley, M.P. Logistic regression. Circulation 2008, 117, 2395–2399. [Google Scholar] [CrossRef] [PubMed]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
- Chen, W.; Lin, H.; Chou, K.C. Pseudo nucleotide composition or PseKNC: An effective formulation for analyzing genomic sequences. Mol. Biosyst. 2015, 11, 2620–2634. [Google Scholar] [CrossRef] [PubMed]
- Muhammod, R.; Ahmed, S.; Md Farid, D.; Shatabda, S.; Sharma, A.; Dehzangi, A. PyFeat: A Python-based effective feature generation tool for DNA, RNA and protein sequences. Bioinformatics 2019, 35, 3831–3833. [Google Scholar] [CrossRef] [PubMed]
- Nair, A.S.; Sreenadhan, S.P. A coding measure scheme employing electron-ion interaction pseudopotential (EIIP). Bioinformation 2006, 1, 197–202. [Google Scholar] [PubMed]
- Zhou, Z.; Ji, Y.; Li, W.; Dutta, P.; Davuluri, R.; Liu, H. Dnabert-2: Efficient foundation model and benchmark for multi-species genome. arXiv 2023, arXiv:2306.15006. [Google Scholar]
- Sennrich, R.; Haddow, B.; Birch, A. Neural machine translation of rare words with subword units. arXiv 2015, arXiv:1508.07909. [Google Scholar]
- Press, O.; Smith, N.A.; Lewis, M. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. arXiv 2021, arXiv:2108.12409. [Google Scholar]
- Ji, Y.; Zhou, Z.; Liu, H.; Davuluri, R.V. DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 2021, 37, 2112–2120. [Google Scholar] [CrossRef] [PubMed]
- Dao, T.; Fu, D.Y.; Ermon, S.; Rudra, A.; Ré, C. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc.: La Jolla, CA, USA, 2022; Volume 35, pp. 16344–16359. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Bioinformatics 2021, 37, 2112–2120. [Google Scholar]
- Arlot, S.; Celisse, A. A survey of cross-validation procedures for model selection. Stat. Surv. 2010, 4, 40–79. [Google Scholar] [CrossRef]
Nucleotide | EIIP |
---|---|
A | 0.1260 |
C | 0.1340 |
G | 0.0806 |
T | 0.1335 |
Evaluation Metrics | ||||
---|---|---|---|---|
Model | Accuracy | Precision | Recall | F1 Score |
CatBoost + HFs | 0.714 | 0.706 | 0.601 | 0.649 |
CatBoost + LFs | 0.724 | 0.723 | 0.605 | 0.659 |
CatBoost + HFs + LFs (mRCat) | 0.761 | 0.760 | 0.667 | 0.710 |
Evaluation Metrics | ||||
---|---|---|---|---|
Model | Accuracy | Precision | Recall | F1 Score |
mRCat (CatBoost) | 0.761 | 0.760 | 0.667 | 0.710 |
RNAlight | 0.73 | 0.75 | 0.59 | 0.66 |
SubLocEP | 0.655 | 0.66 | 0.65 | 0.655 |
RNATracker | 0.516 | 0.595 | 0.519 | 0.554 |
mRNALoc | 0.591 | 0.545 | 0.49 | 0.516 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, X.; Yang, L.; Wang, R. mRCat: A Novel CatBoost Predictor for the Binary Classification of mRNA Subcellular Localization by Fusing Large Language Model Representation and Sequence Features. Biomolecules 2024, 14, 767. https://doi.org/10.3390/biom14070767
Wang X, Yang L, Wang R. mRCat: A Novel CatBoost Predictor for the Binary Classification of mRNA Subcellular Localization by Fusing Large Language Model Representation and Sequence Features. Biomolecules. 2024; 14(7):767. https://doi.org/10.3390/biom14070767
Chicago/Turabian StyleWang, Xiao, Lixiang Yang, and Rong Wang. 2024. "mRCat: A Novel CatBoost Predictor for the Binary Classification of mRNA Subcellular Localization by Fusing Large Language Model Representation and Sequence Features" Biomolecules 14, no. 7: 767. https://doi.org/10.3390/biom14070767
APA StyleWang, X., Yang, L., & Wang, R. (2024). mRCat: A Novel CatBoost Predictor for the Binary Classification of mRNA Subcellular Localization by Fusing Large Language Model Representation and Sequence Features. Biomolecules, 14(7), 767. https://doi.org/10.3390/biom14070767