Minimizing Cohort Discrepancies: A Comparative Analysis of Data Normalization Approaches in Biomarker Research
Abstract
:1. Introduction
2. Materials and Methods
- Normalization by total concentration:
- 2.
- Autoscaling normalization:
- 3.
- Quantile normalization:
- 4.
- Probabilistic quotient normalization:
- 5.
- Median ratio normalization:
- 6.
- Trimmed median-m value normalization:
- 7.
- Variance stabilizing normalization.
3. Results
4. Discussion
5. Limitations, Future Prospects, and Suggestions
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
List of Used Abbreviations
HIE—hypoxic–ischemic encephalopathy |
MRN—median ration normalization |
OPLS—orthogonal partial least squares |
PCA—principal component analysis |
PQN—probabilistic quotient normalization |
TMM—trimmed mean m-value normalization |
VSN—variance stabilizing normalization |
VIP—variable importance projection |
References
- Badrick, T. Biological variation: Understanding why it is so important? Pract. Lab. Med. 2021, 23, e00199. [Google Scholar] [CrossRef] [PubMed]
- Higdon, R.; Kolker, E. Can “normal” protein expression ranges be estimated with high-throughput proteomics? J. Proteome Res. 2015, 14, 2398–2407. [Google Scholar] [CrossRef] [PubMed]
- Chelala, L.; O’Connor, E.E.; Barker, P.B.; Zeffiro, T.A. Meta-analysis of brain metabolite differences in HIV infection. NeuroImage Clin. 2020, 28, 102436. [Google Scholar] [CrossRef] [PubMed]
- Cao, W.; Siegel, L.; Zhou, J.; Zhu, M.; Tong, T.; Chen, Y.; Chu, H. Estimating the reference interval from a fixed effects meta-analysis. Res. Synth. Methods 2021, 12, 630–640. [Google Scholar] [CrossRef] [PubMed]
- Lee, J.; Park, J.; Lim, M.-S.; Seong, S.J.; Seo, J.J.; Park, S.M.; Lee, H.W.; Yoon, Y.-R. Quantile normalization approach for liquid chromatography—Mass spectrometry-based metabolomic data from healthy human volunteers. Anal. Sci. 2012, 28, 801–805. [Google Scholar] [CrossRef] [PubMed]
- Dieterle, F.; Ross, A.; Schlotterbeck, G.; Senn, H. Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. Anal. Chem. 2006, 78, 4281–4290. [Google Scholar] [CrossRef] [PubMed]
- Huber, W.; Von Heydebreck, A.; Sültmann, H.; Poustka, A.; Vingron, M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 2002, 18, S96–S104. [Google Scholar] [CrossRef] [PubMed]
- Anders, S.; Huber, W. Differential expression analysis for sequence count data. Genome Biol. 2010, 11, R106. [Google Scholar] [CrossRef]
- Robinson, M.D.; Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010, 11, R25. [Google Scholar] [CrossRef] [PubMed]
- Brix, F.; Demetrowitsch, T.; Jensen-Kroll, J.; Zacharias, H.U.; Szymczak, S.; Laudes, M.; Schreiber, S.; Schwarz, K. Evaluating the Effect of Data Merging and Postacquisition Normalization on Statistical Analysis of Untargeted High-Resolution Mass Spectrometry Based Urinary Metabolomics Data. Anal. Chem. 2024, 96, 33–40. [Google Scholar] [CrossRef] [PubMed]
- Chua, A.E.; Pfeifer, L.D.; Sekera, E.R.; Hummon, A.B.; Desaire, H. Workflow for Evaluating Normalization Tools for Omics Data Using Supervised and Unsupervised Machine Learning. J. Am. Soc. Mass Spectrom. 2023, 34, 2775–2784. [Google Scholar] [CrossRef] [PubMed]
- Shevtsova, Y.; Starodubtseva, N.; Tokareva, A.; Goryunov, K.; Sadekova, A.; Vedikhina, I.; Ivanetz, T.; Ionov, O.; Frankevich, V.; Plotnikov, E.; et al. Metabolite Biomarkers for Early Ischemic–Hypoxic Encephalopathy: An Experimental Study Using the NeoBase 2 MSMS Kit in a Rat Model. Int. J. Mol. Sci. 2024, 25, 2035. [Google Scholar] [CrossRef] [PubMed]
- Rice, J.E.; Vannucci, R.C.; Brierley, J.B. The influence of immaturity on hypoxic-ischemic brain damage in the rat. Ann. Neurol. 1981, 9, 131–141. [Google Scholar] [CrossRef]
- Edwards, A.B.; Feindel, K.W.; Cross, J.L.; Anderton, R.S.; Clark, V.W.; Knuckey, N.W.; Meloni, B.P. Modification to the Rice-Vannucci perinatal hypoxic-ischaemic encephalopathy model in the P7 rat improves the reliability of cerebral infarct development after 48 hours. J. Neurosci. Methods 2017, 288, 62–71. [Google Scholar] [CrossRef] [PubMed]
- Bolstad, B.M.; Irizarry, R.A.; Astrand, M.; Speed, T.P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19, 185–193. [Google Scholar] [CrossRef] [PubMed]
- Evans, C.; Hardin, J.; Stoebel, D.M. Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions. Brief. Bioinform. 2018, 19, 776–792. [Google Scholar] [CrossRef] [PubMed]
- Huber, W.; von Heydebreck, A.; Sueltmann, H.; Poustka, A.; Vingron, M. Parameter estimation for the calibration and variance stabilization of microarray data. Stat. Appl. Genet. Mol. Biol. 2003, 2, 3. [Google Scholar] [CrossRef] [PubMed]
- Thévenot, E.A.; Roux, A.; Xu, Y.; Ezan, E.; Junot, C. Analysis of the Human Adult Urinary Metabolome Variations with Age, Body Mass Index, and Gender by Implementing a Comprehensive Workflow for Univariate and OPLS Statistical Analyses. J. Proteome Res. 2015, 14, 3322–3335. [Google Scholar] [CrossRef] [PubMed]
- Chen, Y.; Lun, A.T.L.; Smyth, G.K. From reads to genes to pathways: Differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Research 2016, 5, 1438. [Google Scholar] [CrossRef] [PubMed]
- O’Connell, G.C. Variability in donor leukocyte counts confound the use of common RNA sequencing data normalization strategies in transcriptomic biomarker studies performed with whole blood. Sci. Rep. 2023, 13, 15514. [Google Scholar] [CrossRef] [PubMed]
- Abbas-Aghababazadeh, F.; Li, Q.; Fridley, B.L. Comparison of normalization approaches for gene expression studies completed with highthroughput sequencing. PLoS ONE 2018, 13, e0206312. [Google Scholar] [CrossRef] [PubMed]
- Cook, T.; Ma, Y.; Gamagedara, S. Evaluation of statistical techniques to normalize mass spectrometry-based urinary metabolomics data. J. Pharm. Biomed. Anal. 2020, 177, 112854. [Google Scholar] [CrossRef] [PubMed]
- Dressler, F.F.; Brägelmann, J.; Reischl, M.; Perner, S. Normics: Proteomic Normalization by Variance and Data-Inherent Correlation Structure. Mol. Cell. Proteom. 2022, 21, 100269. [Google Scholar] [CrossRef] [PubMed]
- Narasimhan, M.; Kannan, S.; Chawade, A.; Bhattacharjee, A.; Govekar, R. Clinical biomarker discovery by SWATH-MS based label-free quantitative proteomics: Impact of criteria for identification of differentiators and data normalization method. J. Transl. Med. 2019, 17, 184. [Google Scholar] [CrossRef] [PubMed]
- Xue, Z.; Wu, D.; Zhang, J.; Pan, Y.; Kan, R.; Gao, J.; Zhou, B. Protective effect and mechanism of procyanidin B2 against hypoxic injury of cardiomyocytes. Heliyon 2023, 9, e21309. [Google Scholar] [CrossRef] [PubMed]
- Pan, Q.; Wang, D.; Chen, D.; Sun, Y.; Feng, X.; Shi, X.; Xu, Y.; Luo, X.; Yu, J.; Li, Y.; et al. Characterizing the effects of hypoxia on the metabolic profiles of mesenchymal stromal cells derived from three tissue sources using chemical isotope labeling liquid chromatography-mass spectrometry. Cell Tissue Res. 2020, 380, 79–91. [Google Scholar] [CrossRef] [PubMed]
- Zhao, M.; Zhu, P.; Fujino, M.; Zhuang, J.; Guo, H.; Sheikh, I.; Zhao, L.; Li, X.-K. Oxidative stress in hypoxic-ischemic encephalopathy: Molecular mechanisms and therapeutic strategies. Int. J. Mol. Sci. 2016, 17, 2078. [Google Scholar] [CrossRef] [PubMed]
- Denihan, N.M.; Kirwan, J.A.; Walsh, B.H.; Dunn, W.B.; Broadhurst, D.I.; Boylan, G.B.; Murray, D.M. Untargeted metabolomic analysis and pathway discovery in perinatal asphyxia and hypoxic-ischaemic encephalopathy. J. Cereb. Blood Flow Metab. 2019, 39, 147–162. [Google Scholar] [CrossRef] [PubMed]
- Kuligowski, J.; Solberg, R.; Sánchez-Illana, Á.; Pankratov, L.; Parra-Llorca, A.; Quintás, G.; Saugstad, O.D.; Vento, M. Plasma metabolite score correlates with Hypoxia time in a newly born piglet model for asphyxia. Redox Biol. 2017, 12, 1–7. [Google Scholar] [CrossRef] [PubMed]
Normalization Method | R2X | R2Y | Q2Y | Accuracy | Sensitivity | Specificity |
---|---|---|---|---|---|---|
Raw | 0.69 | 0.68 | 0.56 | 0.70 | 0.71 | 0.64 |
Total sum | 0.76 | 0.62 | 0.47 | 0.74 | 0.86 | 0.57 |
Autoscaling | 0.69 | 0.68 | 0.56 | 0.70 | 0.71 | 0.69 |
Quantile normalization | 0.59 | 0.66 | 0.38 | 0.70 | 0.71 | 0.69 |
PQN | 0.56 | 0.72 | 0.55 | 0.67 | 0.79 | 0.5 |
MRN | 0.58 | 0.72 | 0.55 | 0.85 | 0.79 | 0.86 |
TMM | 0.78 | 0.62 | 0.47 | 0.85 | 0.79 | 0.86 |
VSN | 0.26 | 0.89 | 0.72 | 0.74 | 0.86 | 0.57 |
Normalization Method | Advantages | Disadvantages |
---|---|---|
Total sum | The OPLS model performed well, displaying a close distribution of the biomarkers’ importance compared to the other normalized datasets (TMM, PQN, MRN) | The OPLS model’s performance is adversely affected by imbalances in the training data, resulting in a lower quality outcome compared to when the model is trained on raw data. |
Autoscaling | None | The application of this approach does not lead to an improvement in the validation results with the test data. |
Quantile normalization | The distribution of the biomarkers’ importance closely aligns with the distribution observed in the raw data. | The performance of the OPLS model is found to be unsatisfactory, particularly as it demonstrates sensitivity to imbalances within the training data. |
PQN | The OPLS model performed well, displaying a close distribution of the biomarkers’ importance compared to the other normalized datasets (TMM, total sum, MRN), resulting in an enhanced validation outcome for the test data | None |
MRN | The OPLS model performed well, displaying a close distribution of the biomarkers’ importance compared to the other normalized datasets (TMM, total sum, PQN), resulting in an enhanced validation outcome for the test data | None |
TMM | The OPLS model performed well, displaying a close distribution of biomarkers’ importance compared to the other normalized datasets (PQN, total sum, MRN), resulting in an enhanced validation outcome for the test data | The OPLS model’s performance is adversely affected by imbalances in the training data, resulting in a lower quality outcome compared to when the model is trained on raw data. |
VSN | The model’s exceptional quality is demonstrated by achieving the highest sensitivity during the validation on test data, indicating its robust performance and reliability in accurately predicting outcomes. | There has been a significant change in the distribution of the biomarkers’ importance, reflecting a notable shift in the key factors influencing the model’s outcomes. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tokareva, A.; Starodubtseva, N.; Frankevich, V.; Silachev, D. Minimizing Cohort Discrepancies: A Comparative Analysis of Data Normalization Approaches in Biomarker Research. Computation 2024, 12, 137. https://doi.org/10.3390/computation12070137
Tokareva A, Starodubtseva N, Frankevich V, Silachev D. Minimizing Cohort Discrepancies: A Comparative Analysis of Data Normalization Approaches in Biomarker Research. Computation. 2024; 12(7):137. https://doi.org/10.3390/computation12070137
Chicago/Turabian StyleTokareva, Alisa, Natalia Starodubtseva, Vladimir Frankevich, and Denis Silachev. 2024. "Minimizing Cohort Discrepancies: A Comparative Analysis of Data Normalization Approaches in Biomarker Research" Computation 12, no. 7: 137. https://doi.org/10.3390/computation12070137
APA StyleTokareva, A., Starodubtseva, N., Frankevich, V., & Silachev, D. (2024). Minimizing Cohort Discrepancies: A Comparative Analysis of Data Normalization Approaches in Biomarker Research. Computation, 12(7), 137. https://doi.org/10.3390/computation12070137