Machine Learning-Based Models for Detection of Biomarkers of Autoimmune Diseases by Fragmentation and Analysis of miRNA Sequences
Abstract
:1. Introduction
- Introducing the transcriptomic fragmentation model for miRNA sequence analysis and autoimmune disease biomarker detection (Model_I);
- Introducing an LSTM model for the detection of biomarkers of autoimmune diseases (Model_II);
- Studying the impact of two library preparation kits on biomarker detection (NEBNEXT and NEXTFLIX);
- Applying further experiments to a previously published model (Model_0) using a more extensive dataset;
- Conducting further experiments on both introduced models using three different datasets of autoimmune diseases, and the findings below were produced:
- ○
- Model_I saves 22.4% of the execution time compared to Model_0.
- ○
- Using Model_I, a fragmentation size of 0.2 of a whole sequence file is sufficient for autoimmune disease biomarker detection.
- ○
- Model_I reported sensitivity, specificity, precision, accuracy, and F1 scores of (92.7, 92.8, 94.8, 95.7, 95.2), respectively, in RA biomarker detection.
- ○
- Model_I reported sensitivity, specificity, precision, accuracy, and F1 scores of (94.6, 94.7, 95.6, 96.8, 96.1), respectively, in MS biomarker detection.
- ○
- Model_I reported an accuracy score of 89% in analyzing sensitive synthetic data.
- ○
- The sequences prepared by NEBNEXT demonstrated relatively higher detection accuracy.
- ○
- Model_II reported a promising accuracy score of 0.72 for MS biomarker detection, considering the study’s relatively small dataset.
1.1. Background
1.1.1. Fragmentation and End Repair
1.1.2. Adapter Ligation
1.1.3. Polymerase Chain Reaction (PCR) Amplification
2. Materials and Methods
2.1. Datasets
2.2. Development Environment
2.3. Implemented Models
2.3.1. Model_0
2.3.2. Fragmented Sequence Model (Model_I)
2.3.3. Analyzing Datasets Using a Deep Learning Model (Model_II)
3. Results
Model | Dataset | Experiment Name | Objective | Description | Result |
---|---|---|---|---|---|
Model_I | Dataset_VI | Ex01 |
| Apply sequence fragmentation among the sub-datasets produced from Dataset_VI (DS1–DSn) with different fragmentation sizes f = (0.1, 0.2, 0.3, and 0.4) and evaluate the model accuracy. |
|
Model_I | Dataset_I | Ex02 | To determine the miRNA sequence fragment size that obtains the highest classification accuracy on a different disease dataset Dataset_I | Apply sequence fragmentation among the sub-datasets produced from Dataset_I (DS1–DSn) with different fragmentation sizes f = (0.1, 0.2, 0.3, and 0.4) and evaluate the model accuracy. | The fragmentation size that reported the highest classification accuracy was f = 0.2, which confirms the results of EX01 (Figure 11). |
Model_I | Dataset_I | Ex03 | To detect the biomarkers of RA disease using Model_I | Classifying Dataset_I into Cases, Controls, and Synthetic across the entire dataset | The reported accuracy of RA biomarkers detection is 95.7 (Figure 12). |
Model_I | Dataset_I | Ex04 | To determine the impact of library preparation kits on disease biomarkers’ detection. | Classifying the NEBNEXT samples into (RA Case/Control/Synthetic) and NEXTFlex Samples into (RA Case/Control/Synthetic) and comparing the resulting classification accuracy | NEXTflex-prepared have a higher potential in detecting RA biomarkers. (Figure 13). |
Model_I | Dataset_I | Ex05 | To evaluate the model with sensitive data. | Distinguish between synthetic data classes (A, B, C, D, and E) prepared by NEBNEXT/NEXTFlex kits | The classification accuracy scores reported by samples prepared by NEXTflex are relatively higher then NEBNEXT (Figure 14). EX03, EX04, and EX05 were consolidated in Figure 15. |
Model_I | Dataset_II | Ex06 | To detect the biomarkers of MS disease | Classifying Dataset_II into MS Cases/Controls | Figure 16 |
Model_I | Dataset_VI | Ex07 | To detect the biomarkers of MS disease and examine Mode_0 with more extensive datasets. | Classifying Dataset_VI into MS Cases/Controls | Figure 16 |
Model_II | Dataset_VI | Ex08 | To detect the biomarkers of MS disease. | Classifying Dataset_VI into MS Cases/Controls | Figure 16 |
Model_II | Dataset_VI (Fragmented) | Ex09 |
| Run the model among Dataset_VI after fragmenting the sequences by fragmentation size of (10%, 20%, and 30%) to multiplicate the number of the analyzed sequence samples and study the impact of increasing the number of samples on the disease detection accuracy of Model_II. | Figure 17 and Table 5 |
Model_0, Model_I, and Model_II | Dataset_VI | EX10 | To have a comparative analysis of the execution time of the three models. | Apply the three proposed models to the exact dataset and analyze the execution times accordingly. | Figure 18 |
Sequence Fragmentation Size | Number of Dataset_VI Samples after Fragmentation | Prediction Accuracy |
---|---|---|
0.1 | 2390 | 0.61 |
0.2 | 1195 | 0.72 |
0.3 | 717 | 0.65 |
4. Discussion
5. Conclusions and Future Work
6. Limitations
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Grothe, M.; Ellenberger, D.; von Podewils, F.; Stahmann, A.; Rommer, P.S.; Zettl, U.K. Epilepsy as a Predictor of Disease Progression in Multiple Sclerosis. Mult. Scler. J. 2021. [Google Scholar] [CrossRef] [PubMed]
- Mo, J.J.; Zhang, W.; Wen, Q.W.; Wang, T.H.; Qin, W.; Zhang, Z.; Huang, H.; Cen, H.; Wu, X. Di Genetic Association Analysis of ATG16L1 Rs2241880, Rs6758317 and ATG16L2 Rs11235604 Polymorphisms with Rheumatoid Arthritis in a Chinese Population. Int. Immunopharmacol. 2021, 93, 107378. [Google Scholar] [CrossRef] [PubMed]
- Lo, J.; Chan, L.; Flynn, S. A Systematic Review of the Incidence, Prevalence, Costs, and Activity and Work Limitations of Amputation, Osteoarthritis, Rheumatoid Arthritis, Back Pain, Multiple Sclerosis, Spinal Cord Injury, Stroke, and Traumatic Brain Injury in the United States: A 2019 Update. Arch. Phys. Med. Rehabil. 2021, 102, 115–131. [Google Scholar]
- Schorr, E.M.; Kurz, D.; Rossi, K.C.; Zhang, M.; Yeshokumar, A.K.; Jette, N.; Dhamoon, M.S. Depression Readmission Risk Is Elevated in Multiple Sclerosis Compared to Other Chronic Illnesses. Mult. Scler. J. 2022, 28, 139–148. [Google Scholar] [CrossRef]
- Olivares, D.; Perez-Hernandez, J.; Perez-Gil, D.; Chaves, F.J.; Redon, J.; Cortes, R. Optimization of Small RNA Library Preparation Protocol from Human Urinary Exosomes. J. Transl. Med. 2020, 18, 132. [Google Scholar] [CrossRef]
- El Hamid, M.M.A.; Ali, N.M.; Saad, M.N.; Mabrouk, M.S.; Shaker, O.G. Multiple Sclerosis: An Associated Single-Nucleotide Polymorphism Study on Egyptian Population. Netw. Model. Anal. Health Inform. Bioinform. 2020, 9, 48. [Google Scholar] [CrossRef]
- Li, G.; Zhu, N.; Zhou, J.; Kang, K.; Zhou, X.; Ying, B.; Yi, Q.; Wu, Y. A Magnetic Surface-Enhanced Raman Scattering Platform for Performing Successive Breast Cancer Exosome Isolation and Analysis. J. Mater. Chem. B 2021, 9, 2709–2716. [Google Scholar] [CrossRef]
- Zhang, Z.; Liu, D.; Wang, D.; Wu, Q. Library Preparation Based on Transposase Assisted RNA/DNA Hybrid Co-Tagmentation for Next-Generation Sequencing of Human Noroviruses. Viruses 2021, 13, 65. [Google Scholar] [CrossRef] [PubMed]
- Shtratnikova, V.; Naumov, V.; Bezuglov, V.; Zheludkevich, A.; Smigulina, L.; Dikov, Y.; Denisova, T.; Suvorov, A.; Pilsner, J.R.; Hauser, R.; et al. Optimization of Small RNA Extraction and Comparative Study of NGS Library Preparation from Low Count Sperm Samples. Syst. Biol. Reprod. Med. 2021, 67, 230–243. [Google Scholar] [CrossRef] [PubMed]
- Raymond-Bouchard, I.; Maggiori, C.; Brennan, L.; Altshuler, I.; Manchado, J.M.; Parro, V.; Whyte, L.G. Assessment of Automated Nucleic Acid Extraction Systems in Combination with MinION Sequencing As Potential Tools for the Detection of Microbial Biosignatures. Astrobiology 2022, 22, 87–103. [Google Scholar] [CrossRef] [PubMed]
- Ali, N.M.; Shaheen, M.; Mabrouk, M.S.; Aborizka, M.A. A Novel Approach of Transcriptomic MicroRNA Analysis Using Text Mining Methods: An Early Detection of Multiple Sclerosis Disease. IEEE Access 2021, 9, 120024–120033. [Google Scholar] [CrossRef]
- Heinicke, F.; Zhong, X.; Zucknick, M.; Breidenbach, J.; Sundaram, A.Y.M.; Flåm, S.T.; Leithaug, M.; Dalland, M.; Rayner, S.; Lie, B.A.; et al. An Extension to: Systematic Assessment of Commercially Available Low-Input MiRNA Library Preparation Kits. RNA Biol. 2020, 17, 1284–1292. [Google Scholar] [CrossRef] [PubMed]
- Kapp, J.D.; Green, R.E.; Shapiro, B. A Fast and Efficient Single-Stranded Genomic Library Preparation Method Optimized for Ancient DNA. J. Hered. 2021, 2021, 1–9. [Google Scholar] [CrossRef] [PubMed]
- Psonis, N.; Vassou, D.; Kafetzopoulos, D. Testing a Series of Modifications on Genomic Library Preparation Methods for Ancient or Degraded DNA. Anal. Biochem. 2021, 623, 114193. [Google Scholar] [CrossRef]
- Hu, T.; Chitnis, N.; Monos, D.; Dinh, A. Next-Generation Sequencing Technologies: An Overview. Hum. Immunol. 2021, 82, 801–811. [Google Scholar] [CrossRef]
- Shi, H.; Zhou, Y.; Jia, E.; Pan, M.; Bai, Y.; Ge, Q. Bias in RNA-Seq Library Preparation: Current Challenges and Solutions. BioMed Res. Int. 2021, 2021, 6647597. [Google Scholar] [CrossRef]
- Ebrahimkhani, S.; Beadnall, H.N.; Wang, C.; Suter, C.M.; Barnett, M.H.; Buckland, M.E.; Vafaee, F. Serum Exosome MicroRNAs Predict Multiple Sclerosis Disease Activity after Fingolimod Treatment. Mol. Neurobiol. 2020, 57, 1245–1258. [Google Scholar] [CrossRef]
- Baulina, N.; Osmak, G.; Kiselev, I.; Popova, E.; Boyko, A.; Kulakova, O.; Favorova, O. MiRNAs from DLK1-DIO3 Imprinted Locus at 14q32 Are Associated with Multiple Sclerosis: Gender-Specific Expression and Regulation of Receptor Tyrosine Kinases Signaling. Cells 2019, 8, 133. [Google Scholar] [CrossRef] [Green Version]
- Mohamed Ali, N.; El Hamid, M.M.A.; Youssif, A. Sentiment analysis for movies reviews dataset using deep learning models. Int. J. Data Min. Knowl. Manag. Process 2019. [Google Scholar] [CrossRef]
- Saif, R.; Ejaz, A.; Mahmood, T.; Zia, S. Differential Gene Expression Pipeline for Whole Transcriptome RNA-Seq Data Using Personal Computer. bioRxiv 2021, bioRxiv:2021.01.26.428352. [Google Scholar]
- Taghavi Namin, S.; Esmaeilzadeh, M.; Najafi, M.; Brown, T.B.; Borevitz, J.O. Deep Phenotyping: Deep Learning for Temporal Phenotype/Genotype Classification. Plant Methods 2018, 14, 1–14. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Xia, J.; Pan, S.; Zhu, M.; Cai, G.; Yan, M.; Su, Q.; Yan, J.; Ning, G.; Duggento, A. A Long Short-Term Memory Ensemble Approach for Improving the Outcome Prediction in Intensive Care Unit. Comput. Math. Methods Med. 2019, 2019, 8152713. [Google Scholar] [CrossRef] [PubMed]
- Haghighat, E.; Juanes, R. SciANN: A Keras/TensorFlow Wrapper for Scientific Computations and Physics-Informed Deep Learning Using Artificial Neural Networks. Comput. Methods Appl. Mech. Eng. 2021, 373, 113552. [Google Scholar] [CrossRef]
- Sherstinsky, A. Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network. Phys. D Nonlinear Phenom. 2020, 404, 132306. [Google Scholar] [CrossRef] [Green Version]
- Xiao, Y.; Yin, H.; Zhang, Y.; Qi, H.; Zhang, Y.; Liu, Z. A Dual-Stage Attention-Based Conv-LSTM Network for Spatio-Temporal Correlation and Multivariate Time Series Prediction. Int. J. Intell. Syst. 2021, 36, 2036–2057. [Google Scholar] [CrossRef]
Dataset_VI | |||
---|---|---|---|
Parameter | Dataset_I | Dataset_II | Dataset_III |
BioProject | PRJNA594317 | PRJNA588268 | PRJNA514238 |
Organism | Synthetic construct; Homo Sapiens | Homo Sapiens | Homo Sapiens |
Datastore filetype | FASTQ, SRA | FASTQ, SRA | FASTQ, SRA |
Datastore provider | GS, NCBI, S3 | GS, NCBI, S3 | GS, NCBI, S3 |
Library Source | Transcriptomic | Transcriptomic | Transcriptomic |
Instrument | Illumina HiSeq 2500 | Illumina HiSeq 2000 | Illumina MiSeq |
Library Layout | SINGLE | SINGLE | SINGLE |
Number of samples | 42 | 215 | 24 |
Number of cases | 6 | 105 (before treatment) | 12 |
Number of Controls | 6 | 110 (after treatment) | 12 |
Number of Synthesized samples | 30 | N/A | N/A |
Disease | RA | MS | MS |
Registration Data | 9 December 2019 | 7 November 2019 | 23 January 2019 |
Specification | Value |
---|---|
Processor | TPU |
RAM | 35.35 GB |
Disk Space | 2 T |
Data Storage space | Google Drive |
Development platform | Python 3.6 |
Implementation Step | Tool/Library Used |
---|---|
Sequence files’ download | pysradb |
Sequences’ quality control | FastQC |
Sequences’ trimming | Trimmomatic |
Fastq files’ conversion and manipulation | Biopython |
Files’ fragmentation | seqkit |
Random forest model implementation | Sklearn |
LSTM | Tensorflow |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ali, N.M.; Shaheen, M.; Mabrouk, M.S.; Aborizka, M. Machine Learning-Based Models for Detection of Biomarkers of Autoimmune Diseases by Fragmentation and Analysis of miRNA Sequences. Appl. Sci. 2022, 12, 5583. https://doi.org/10.3390/app12115583
Ali NM, Shaheen M, Mabrouk MS, Aborizka M. Machine Learning-Based Models for Detection of Biomarkers of Autoimmune Diseases by Fragmentation and Analysis of miRNA Sequences. Applied Sciences. 2022; 12(11):5583. https://doi.org/10.3390/app12115583
Chicago/Turabian StyleAli, Nehal M., Mohamed Shaheen, Mai S. Mabrouk, and Mohamed Aborizka. 2022. "Machine Learning-Based Models for Detection of Biomarkers of Autoimmune Diseases by Fragmentation and Analysis of miRNA Sequences" Applied Sciences 12, no. 11: 5583. https://doi.org/10.3390/app12115583
APA StyleAli, N. M., Shaheen, M., Mabrouk, M. S., & Aborizka, M. (2022). Machine Learning-Based Models for Detection of Biomarkers of Autoimmune Diseases by Fragmentation and Analysis of miRNA Sequences. Applied Sciences, 12(11), 5583. https://doi.org/10.3390/app12115583