lncRNA_Mdeep: An Alignment-Free Predictor for Distinguishing Long Non-Coding RNAs from Protein-Coding Transcripts by Multimodal Deep Learning
Abstract
:1. Introduction
2. Results
2.1. Performance of lncRNA_Mdeep
2.1.1. Performance of Different Model Architectures
2.1.2. Effects of Different Hyper-Parameters
2.2. Comparison with Other Existing Methods
2.2.1. Comparison Performance on Human Dataset
2.2.2. Comparison Performance on Cross-Species Datasets
3. Discussion
4. Material and Methods
4.1. Dataset
4.2. LncRNA_Mdeep
4.2.1. Feature Extraction and One-Hot Encoding
4.2.2. High-Level Abstract Representations
4.2.3. Multimodal Framework
4.3. Evaluation Metrics
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Conflicts of Interest
Abbreviations
lncRNA | Long non-coding RNA |
SVM | Support vector machine |
CNN | Convolutional neural network |
RNN | Recurrent neural network |
DNN | Deep neural network |
ORF | Open reading frame |
ACC | Accuracy |
Sn | Sensitivity |
Sp | Specificity |
MCC | Matthew’s correlation coefficient |
References
- Djebali, S.; Davis, C.A.; Merkel, A.; Dobin, A.; Lassmann, T.; Mortazavi, A.; Tanzer, A.; Lagarde, J.; Lin, W.; Schlesinger, F.; et al. Landscape of transcription in human cells. Nature 2012, 489, 101–108. [Google Scholar] [CrossRef] [PubMed]
- Kapranov, P.; Cheng, J.; Dike, S.; Nix, D.A.; Duttagupta, R.; Willingham, A.T.; Stadler, P.F.; Hertel, J.; Hackermuller, J.; Hofacker, I.L.; et al. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 2007, 316, 1484–1488. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Mattick, J.S.; Rinn, J.L. Discovery and annotation of long noncoding RNAs. Nat. Struct Mol. Biol. 2015, 22, 5–7. [Google Scholar] [CrossRef] [PubMed]
- Derrien, T.; Johnson, R.; Bussotti, G.; Tanzer, A.; Djebali, S.; Tilgner, H.; Guernec, G.; Martin, D.; Merkel, A.; Knowles, D.G.; et al. The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression. Genome Res. 2012, 22, 1775–1789. [Google Scholar] [CrossRef] [Green Version]
- Rinn, J.L.; Chang, H.Y. Genome regulation by long noncoding RNAs. Annu. Rev. Biochem. 2012, 81, 145–166. [Google Scholar] [CrossRef] [Green Version]
- Ponting, C.P.; Oliver, P.L.; Reik, W. Evolution and Functions of Long Noncoding RNAs. Cell 2009, 136, 629–641. [Google Scholar] [CrossRef] [Green Version]
- Wapinski, O.; Chang, H.Y. Long noncoding RNAs and human disease. Trends Cell Biol 2011, 21, 354–361. [Google Scholar] [CrossRef]
- Kong, L.; Zhang, Y.; Ye, Z.Q.; Liu, X.Q.; Zhao, S.Q.; Wei, L.; Gao, G. CPC: Assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res. 2007, 35, W345–W349. [Google Scholar] [CrossRef]
- Lin, M.F.; Jungreis, I.; Kellis, M. PhyloCSF: A comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 2011, 27, I275–I282. [Google Scholar] [CrossRef]
- Hu, L.; Xu, Z.Y.; Hu, B.Q.; Lu, Z.J. COME: A robust coding potential calculation tool for lncRNA identification and characterization based on multiple features. Nucleic Acids Res. 2017, 45. [Google Scholar] [CrossRef]
- Achawanantakun, R.; Chen, J.; Sun, Y.N.; Zhang, Y. LncRNA-ID: Long non-coding RNA IDentification using balanced random forests. Bioinformatics 2015, 31, 3897–3905. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Sun, L.; Liu, H.; Zhang, L.; Meng, J. lncRScan-SVM: A Tool for Predicting Long Non-Coding RNAs Using Support Vector Machine. PLoS ONE 2015, 10. [Google Scholar] [CrossRef] [PubMed]
- Yang, C.; Yang, L.; Zhou, M.; Xie, H.; Zhang, C.; Wang, M.D.; Zhu, H. LncADeep: An ab initio lncRNA identification and functional annotation tool based on deep learning. Bioinformatics 2018, 34, 3825–3834. [Google Scholar] [CrossRef]
- Sun, L.; Luo, H.T.; Bu, D.C.; Zhao, G.G.; Yu, K.T.; Zhang, C.H.; Liu, Y.N.; Chen, R.S.; Zhao, Y. Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts. Nucleic Acids Res. 2013, 41. [Google Scholar] [CrossRef] [PubMed]
- Wang, L.; Park, H.J.; Dasari, S.; Wang, S.Q.; Kocher, J.P.; Li, W. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Res. 2013, 41. [Google Scholar] [CrossRef]
- Li, A.M.; Zhang, J.Y.; Zhou, Z.Y. PLEK: A tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme. BMC Bioinform. 2014, 15. [Google Scholar] [CrossRef] [Green Version]
- Fan, X.N.; Zhang, S.W. lncRNA-MFDL: Identification of human long non-coding RNAs by fusing multiple features and using deep learning. Mol. Biosyst. 2015, 11, 892–897. [Google Scholar] [CrossRef]
- Tripathi, R.; Patel, S.; Kumari, V.; Chakraborty, P.; Varadwaj, P.K. DeepLNC, a long non-coding RNA prediction tool using deep neural network. Netw. Model Anal. Health Inform. Bioinform. 2016, 5, 21. [Google Scholar] [CrossRef]
- Kang, Y.J.; Yang, D.C.; Kong, L.; Hou, M.; Meng, Y.Q.; Wei, L.P.; Gao, G. CPC2: A fast and accurate coding potential calculator based on sequence intrinsic features. Nucleic Acids Res. 2017, 45, W12–W16. [Google Scholar] [CrossRef] [Green Version]
- Baek, J.; Lee, B.; Kwon, S.; Yoon, S. LncRNAnet: Long non-coding RNA identification using deep learning. Bioinformatics 2018, 34, 3889–3897. [Google Scholar] [CrossRef]
- Han, S.; Liang, Y.; Ma, Q.; Xu, Y.; Zhang, Y.; Du, W.; Wang, C.; Li, Y. LncFinder: An integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property. Brief. Bioinform. 2018. [Google Scholar] [CrossRef] [PubMed]
- Wu, C.H.; Apweiler, R.; Bairoch, A.; Natale, D.A.; Barker, W.C.; Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.Z.; Lopez, R.; et al. The Universal Protein Resource (UniProt): An expanding universe of protein information. Nucleic Acids Res. 2006, 34, D187–D191. [Google Scholar] [CrossRef] [PubMed]
- Altschul, S.F.; Madden, T.L.; Schaffer, A.A.; Zhang, J.H.; Zhang, Z.; Miller, W.; Lipman, D.J. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997, 25, 3389–3402. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Finn, R.D.; Coggill, P.; Eberhardt, R.Y.; Eddy, S.R.; Mistry, J.; Mitchell, A.L.; Potter, S.C.; Punta, M.; Qureshi, M.; Sangrador-Vegas, A.; et al. The Pfam protein families database: Towards a more sustainable future. Nucleic Acids Res. 2016, 44, D279–D285. [Google Scholar] [CrossRef] [PubMed]
- Finn, R.D.; Clements, J.; Eddy, S.R. HMMER web server: Interactive sequence similarity searching. Nucleic Acids Res. 2011, 39, W29–W37. [Google Scholar] [CrossRef] [Green Version]
- Chollet, F. Keras: The Python Deep Learning Library. 2018. Available online: https://ui.adsabs.harvard.edu/abs/2018ascl.soft06022C (accessed on 12 July 2020).
- Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), Savannah, GA, USA, 2–4 November 2016; pp. 265–283. [Google Scholar]
- Dietterich, T.G. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Comput. 1998, 10, 1895–1923. [Google Scholar] [CrossRef] [Green Version]
- Bergstra, J.; Komer, B.; Eliasmith, C.; Yamins, D.; Cox, D.D. Hyperopt: A python library for model selection and hyperparameter optimization. Comput. Sci. Discov. 2015, 8, 014008. [Google Scholar]
- Frankish, A.; Diekhans, M.; Ferreira, A.M.; Johnson, R.; Jungreis, I.; Loveland, J.; Mudge, J.M.; Sisu, C.; Wright, J.; Armstrong, J.; et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 2019, 47, D766–D773. [Google Scholar] [CrossRef] [Green Version]
- Pruitt, K.D.; Tatusova, T.; Brown, G.R.; Maglott, D.R. NCBI Reference Sequences (RefSeq): Current status, new features and genome annotation policy. Nucleic Acids Res. 2012, 40, D130–D135. [Google Scholar] [CrossRef] [Green Version]
- Fickett, J.W. Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 1982, 10, 5303–5318. [Google Scholar] [CrossRef] [Green Version]
- Svozil, D.; Kvasnicka, V.; Pospichal, J. Introduction to multi-layer feed-forward neural networks. Chemom. Intell. Lab. Syst. 1997, 39, 43–62. [Google Scholar] [CrossRef]
- Min, S.; Lee, B.; Yoon, S. Deep learning in bioinformatics. Brief. Bioinform. 2017, 18, 851–869. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Hashemifar, S.; Neyshabur, B.; Khan, A.A.; Xu, J. Predicting protein-protein interactions through sequence-based deep learning. Bioinformatics 2018, 34, i802–i810. [Google Scholar] [CrossRef] [PubMed] [Green Version]
ACC (%) | Sn (%) | Sp (%) | MCC | |
---|---|---|---|---|
OFH_DNN | 95.74 ± 1.70 | 94.44 ± 4.89 | 97.04 ± 2.15 | 0.9171 ± 0.0307 |
k-mer_DNN | 96.53 ± 0.41 | 96.40 ± 1.11 | 96.66 ± 0.78 | 0.9307 ± 0.0082 |
One-hot_CNN | 95.82 ± 0.33 | 97.01 ± 0.96 | 94.63 ± 1.19 | 0.9169 ± 0.0064 |
OFH_DNN + k-mer_DNN | 95.97 ± 2.49 | 96.87 ± 1.05 | 95.06 ± 5.71 | 0.9211 ± 0.0449 |
k-mer_DNN + One-hot_CNN | 98.36 ± 0.16 | 98.70 ± 0.42 | 98.03 ± 0.50 | 0.9674 ± 0.0033 |
OFH_DNN + One-hot_CNN | 97.60 ± 1.26 | 97.78 ± 1.58 | 97.43 ± 2.33 | 0.9526 ± 0.0248 |
Decision fusion | 98.42 ± 1.12 | 99.24 ± 0.45 | 97.60 ± 2.59 | 0.9689 ± 0.0212 |
lncRNA_Mdeep | 98.73 ± 0.41 | 98.95 ± 0.54 | 98.52 ± 0.92 | 0.9748 ± 0.0080 |
Methods | ACC (%) | Sn (%) | Sp (%) | MCC |
---|---|---|---|---|
CNCI | 86.40 | 97.42 | 75.38 | 0.7463 |
CPAT | 87.98 | 95.22 | 80.73 | 0.7676 |
PLEK | 77.71 | 97.22 | 58.20 | 0.6019 |
lncRNA-MFDL | 85.47 | 93.43 | 77.50 | 0.7185 |
CPC2 | 77.98 | 94.07 | 61.90 | 0.5911 |
lncRNAnet | 92.18 | 96.63 | 87.73 | 0.8470 |
lncFinder1 | 86.22 | 95.20 | 77.23 | 0.7363 |
lncFinder2 | 86.88 | 95.98 | 77.77 | 0.7501 |
lncRNA_Mdeep | 93.12 | 97.27 | 88.97 | 0.8653 |
Species | CNCI | CPAT | PLEK | lncRNA-MFDL | CPC2 | lncRNAnet | lncFinder1 | lncFinder2 | lncRNA_Mdeep |
---|---|---|---|---|---|---|---|---|---|
Mouse | 87.09 | 90.47 | 71.89 | 88.53 | 80.43 | 91.81 | 88.47 | 88.99 | 92.52 |
Arabidopsis | 79.86 | 91.39 | 66.93 | 97.30 | 93.36 | 94.60 | 92.45 | 93.77 | 95.73 |
Bos taurus | 92.88 | 97.13 | 89.32 | 95.51 | 96.10 | 96.30 | 97.00 | 97.03 | 97.33 |
C. elegans | 77.72 | 91.48 | 45.37 | 97.97 | 94.75 | 97.95 | 87.46 | 88.55 | 98.87 |
Chicken | 91.52 | 97.04 | 83.95 | 96.87 | 95.22 | 95.56 | 96.82 | 96.64 | 96.06 |
Chimpanzee | 89.84 | 96.18 | 88.99 | 94.26 | 95.48 | 94.78 | 96.05 | 96.21 | 96.76 |
Frog | 90.60 | 96.40 | 80.90 | 96.14 | 96.34 | 95.53 | 96.92 | 97.26 | 96.80 |
Fruit fly | 92.90 | 96.02 | 74.43 | 96.49 | 94.28 | 95.21 | 95.33 | 95.50 | 96.10 |
Gorilla | 89.37 | 94.99 | 86.75 | 95.12 | 94.12 | 94.31 | 94.72 | 94.87 | 95.65 |
Pig | 91.73 | 96.91 | 87.34 | 96.98 | 95.86 | 95.56 | 96.88 | 96.82 | 96.87 |
Zebrafish | 93.59 | 97.50 | 85.07 | 92.17 | 96.83 | 95.77 | 97.54 | 97.78 | 96.76 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Fan, X.-N.; Zhang, S.-W.; Zhang, S.-Y.; Ni, J.-J. lncRNA_Mdeep: An Alignment-Free Predictor for Distinguishing Long Non-Coding RNAs from Protein-Coding Transcripts by Multimodal Deep Learning. Int. J. Mol. Sci. 2020, 21, 5222. https://doi.org/10.3390/ijms21155222
Fan X-N, Zhang S-W, Zhang S-Y, Ni J-J. lncRNA_Mdeep: An Alignment-Free Predictor for Distinguishing Long Non-Coding RNAs from Protein-Coding Transcripts by Multimodal Deep Learning. International Journal of Molecular Sciences. 2020; 21(15):5222. https://doi.org/10.3390/ijms21155222
Chicago/Turabian StyleFan, Xiao-Nan, Shao-Wu Zhang, Song-Yao Zhang, and Jin-Jie Ni. 2020. "lncRNA_Mdeep: An Alignment-Free Predictor for Distinguishing Long Non-Coding RNAs from Protein-Coding Transcripts by Multimodal Deep Learning" International Journal of Molecular Sciences 21, no. 15: 5222. https://doi.org/10.3390/ijms21155222