Simplifying Data Analysis in Biomedical Research: An Automated, User-Friendly Tool
Abstract
:1. Introduction
2. Methods
2.1. The User Experience—An Overview
2.2. Enhancing Traditional Sampling through ArsHive
2.3. Software Conceptualization and Validation
2.4. Behind the Scenes: How Our Algorithms Work
Algorithm 1. Algorithm (simplified for visualization) of the advanced equalization function. |
Input: Dataset, TargetVariable, VariableList, MinThreshold |
Output: EqualizedDataset |
Begin Initialize equalized dataset as empty Calculate initial subpopulation sizes and proportions for TargetVariable Set target size to the size of the smallest subpopulation For each variable in VariableList do Determine the impact score based on deviation from expected proportions End For Sort VariableList by descending impact scores While subpopulation sizes differ from target size do For each variable in VariableList do Adjust subpopulation proportion to match target proportion If subpopulation size < MinThreshold then Protect subpopulation from further reduction End If End For Recalculate subpopulation sizes and proportions End While Output the equalized dataset End |
2.5. Handling Missing Data
2.6. Statistical Test Selection
- For comparisons between two independent groups:
- If the continuous variable is normally distributed, an independent samples t-test is conducted to compare the means of the two groups.
- For non-normally distributed data, the Mann–Whitney U (MWU) test is utilized as a non-parametric alternative.
- For comparisons between three or more independent groups:
- ANOVA is the test of choice for normally distributed data, and is used to compare means across the multiple groups. After performing the ANOVA test, post hoc analyses with Tukey’s test can be used to determine pairwise comparisons between all categories of the target variable and to identify which are significantly different from each other.
- The Kruskal–Wallis one-way ANOVA test is employed as a non-parametric counterpart to ANOVA when the data is non-normal. Post hoc analyses can also be performed in this case.
2.7. Introducing A.D.A.—A Large Language Model Companion Application
Algorithm 2. A.D.A interaction process within ArsHive. |
Input: User-Provided Data File (Excel or CSV format) |
Output: Chat Window with Loaded Dataset |
Begin Initialize ArsHive Software A.D.A. button is displayed on ArsHive Software Interface When A.D.A. Button is Clicked: Prompt User for OpenAI API Key Initialize OpenAI API with Provided Key Open Data Preprocessing Window Display Options to Upload Data File (Excel or CSV) User Uploads File Load Dataset into A.D.A. Calculate Token Count and Associated Cost for Dataset Display Token Count and Cost in Preprocessing Window Allow User to Modify Dataset if Needed (e.g., Delete Columns) Confirm Final Dataset for Analysis Transition to Chat Window Display Summary of Dataset (Variables, Global Identifier, Token Cost) Initialize Chat Session with GPT-4-turbo Model Load Dataset Context (Variables, Token Information) User Interacts with A.D.A. through Chat Window Send Queries Related to Dataset Receive Responses from A.D.A. based on GPT-4-turbo Model Maintain Session History for Continuous Conversation Context Offer Option to Save Session Transcript End A.D.A. Interaction on User Command or Window Closure End |
2.8. Quality Assurance and Open Source Philosophy
3. Results and Discussion
3.1. Different Algorithms for Different Needs
- Open-Source Accessibility: No cost, and the open code enhances transparency and fosters a collaborative improvement environment.
- User-Friendly Interface: Simplified inputs enable ease of use for users with varying expertise levels.
- Automated Data Handling: Pre-configured methods for identifying variable types and imputing missing values (mode for binary data and mean for continuous data) facilitate data preparation.
- Exhaustive Reporting: Detailed reports elucidate findings without overwhelming users.
- Data Integrity Checks: Ensures consistency between datasets before and after equalization processes.
3.2. Concise and Easy Reporting Features
3.3. A.D.A.—A Digital Assistant Companion Tool
4. Challenges and Considerations
5. Final Remarks
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Langley, G.R.; Adcock, I.M.; Busquet, F.; Crofton, K.M.; Csernok, E.; Giese, C.; Heinonen, T.; Herrmann, K.; Hofmann-Apitius, M.; Landesmann, B.; et al. Towards a 21st-century roadmap for biomedical research and drug discovery: Consensus report and recommendations. Drug Discov. Today 2017, 22, 327–339. [Google Scholar] [CrossRef]
- Keramaris, N.C.; Kanakaris, N.K.; Tzioupis, C.; Kontakis, G.; Giannoudis, P.V. Translational research: From benchside to bedside. Injury 2008, 39, 643–650. [Google Scholar] [CrossRef]
- Jarvis, M.F.; Williams, M. Irreproducibility in Preclinical Biomedical Research: Perceptions, Uncertainties, and Knowledge Gaps. Trends Pharmacol. Sci. 2016, 37, 290–302. [Google Scholar] [CrossRef]
- Frampton, G.; Whaley, P.; Bennett, M.; Bilotta, G.; Dorne, J.-L.C.M.; Eales, J.; James, K.; Kohl, C.; Land, M.; Livoreil, B.; et al. Principles and framework for assessing the risk of bias for studies included in comparative quantitative environmental systematic reviews. Environ. Evid. 2022, 11, 12. [Google Scholar] [CrossRef]
- Roberts, C.; Torgerson, D.J. Understanding controlled trials: Baseline imbalance in randomised controlled trials. BMJ 1999, 319, 185. [Google Scholar] [CrossRef]
- Singh, D.; Singh, B. Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 2020, 97, 105524. [Google Scholar] [CrossRef]
- Kang, H. The prevention and handling of the missing data. Korean J. Anesthesiol. 2013, 64, 402. [Google Scholar] [CrossRef]
- Palanivinayagam, A.; Damaševičius, R. Effective Handling of Missing Values in Datasets for Classification Using Machine Learning Methods. Information 2023, 14, 92. [Google Scholar] [CrossRef]
- Griss, J.; Perez-Riverol, Y.; Hermjakob, H.; Vizcaíno, J.A. Identifying novel biomarkers through data mining—A realistic scenario? Proteomics Clin. Appl. 2015, 9, 437–443. [Google Scholar] [CrossRef]
- Bauer, C.; Glintschert, A.; Schuchhardt, J. ProfileDB: A resource for proteomics and cross-omics biomarker discovery. Biochim. Biophys. Acta Proteins Proteom. 2014, 1844, 960–966. [Google Scholar] [CrossRef]
- Diao, Z.; Han, D.; Zhang, R.; Li, J. Metagenomics next-generation sequencing tests take the stage in the diagnosis of lower respiratory tract infections. J. Adv. Res. 2022, 38, 201–212. [Google Scholar] [CrossRef]
- Williams, C.G.; Lee, H.J.; Asatsuma, T.; Vento-Tormo, R.; Haque, A. An introduction to spatial transcriptomics for biomedical research. Genome Med. 2022, 14, 68. [Google Scholar] [CrossRef]
- Póvoa, P.; Bos, L.D.J.; Coelho, L. The role of proteomics and metabolomics in severe infections. Curr. Opin. Crit. Care 2022, 28, 534–539. [Google Scholar] [CrossRef]
- Araújo, R.; Ramalhete, L.; Ribeiro, E.; Calado, C. Plasma versus Serum Analysis by FTIR Spectroscopy to Capture the Human Physiological State. BioTech 2022, 11, 56. [Google Scholar] [CrossRef]
- Horejs, C.-M. Artificial intelligence identifies new cancer biomarkers. Nat. Rev. Bioeng. 2023, 1, 313. [Google Scholar] [CrossRef]
- Choudhuri, S.; Kaur, T.; Jain, S.; Sharma, C.; Asthana, S. A review on genotoxicity in connection to infertility and cancer. Chem. Biol. Interact. 2021, 345, 109531. [Google Scholar] [CrossRef]
- Ramalhete, L.M.; Araújo, R.; Ferreira, A.; Calado, C.R.C. Proteomics for Biomarker Discovery for Diagnosis and Prognosis of Kidney Transplantation Rejection. Proteomes 2022, 10, 24. [Google Scholar] [CrossRef]
- Vigia, E.; Ramalhete, L.; Ribeiro, R.; Barros, I.; Chumbinho, B.; Filipe, E.; Pena, A.; Bicho, L.; Nobre, A.; Carrelha, S.; et al. Pancreas Rejection in the Artificial Intelligence Era: New Tool for Signal Patients at Risk. J. Pers. Med. 2023, 13, 1071. [Google Scholar] [CrossRef]
- Araújo, R.; Bento, L.F.N.; Fonseca, T.A.H.; Von Rekowski, C.P.; da Cunha, B.R.; Calado, C.R.C. Infection Biomarkers Based on Metabolomics. Metabolites 2022, 12, 92. [Google Scholar] [CrossRef]
- Babu, M.; Snyder, M. Multi-Omics Profiling for Health. Mol. Cell. Proteomics 2023, 22, 100561. [Google Scholar] [CrossRef]
- Subramanian, I.; Verma, S.; Kumar, S.; Jere, A.; Anamika, K. Multi-omics Data Integration, Interpretation, and Its Application. Bioinform. Biol. Insights 2020, 14, 117793221989905. [Google Scholar] [CrossRef] [PubMed]
- Li, Y.; Wu, X.; Fang, D.; Luo, Y. Informing immunotherapy with multi-omics driven machine learning. Npj Digit. Med. 2024, 7, 67. [Google Scholar] [CrossRef] [PubMed]
- Ramalhete, L.; Vieira, M.B.; Araújo, R.; Vigia, E.; Aires, I.; Ferreira, A.; Calado, C.R.C. Predicting Cellular Rejection of Renal Allograft Based on the Serum Proteomic Fingerprint. Int. J. Mol. Sci. 2024, 25, 3844. [Google Scholar] [CrossRef] [PubMed]
- Kather, J.N. Artificial intelligence in oncology: Chances and pitfalls. J. Cancer Res. Clin. Oncol. 2023, 149, 7995–7996. [Google Scholar] [CrossRef] [PubMed]
- Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef]
- Meskó, B. The Impact of Multimodal Large Language Models on Health Care’s Future. J. Med. Internet Res. 2023, 25, e52865. [Google Scholar] [CrossRef] [PubMed]
- Toufiq, M.; Rinchai, D.; Bettacchioli, E.; Kabeer, B.S.A.; Khan, T.; Subba, B.; White, O.; Yurieva, M.; George, J.; Jourde-Chiche, N.; et al. Harnessing large language models (LLMs) for candidate gene prioritization and selection. J. Transl. Med. 2023, 21, 728. [Google Scholar] [CrossRef]
- Elfil, M.; Negida, A. Sampling methods in Clinical Research; an Educational Review. Emergency 2017, 5, e52. [Google Scholar]
- César, C.C.; Carvalho, M.S. Stratified sampling design and loss to follow-up in survival models: Evaluation of efficiency and bias. BMC Med. Res. Methodol. 2011, 11, 99. [Google Scholar] [CrossRef]
- Kahan, B.C.; Morris, T.P. Reporting and analysis of trials using stratified randomisation in leading medical journals: Review and reanalysis. BMJ 2012, 345, e5840. [Google Scholar] [CrossRef]
- Johnson, A.E.W.; Bulgarelli, L.; Shen, L.; Gayles, A.; Shammout, A.; Horng, S.; Pollard, T.J.; Hao, S.; Moody, B.; Gow, B.; et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 2023, 10, 1. [Google Scholar] [CrossRef] [PubMed]
- Pollard, T.J.; Johnson, A.E.W.; Raffa, J.D.; Celi, L.A.; Mark, R.G.; Badawi, O. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci. Data 2018, 5, 180178. [Google Scholar] [CrossRef] [PubMed]
- Hyland, S.L.; Faltys, M.; Hüser, M.; Lyu, X.; Gumbsch, T.; Esteban, C.; Bock, C.; Horn, M.; Moor, M.; Rieck, B.; et al. Early prediction of circulatory failure in the intensive care unit using machine learning. Nat. Med. 2020, 26, 364–373. [Google Scholar] [CrossRef] [PubMed]
- Thorsson, V.; Gibbs, D.L.; Brown, S.D.; Wolf, D.; Bortone, D.S.; Ou Yang, T.-H.; Porta-Pardo, E.; Gao, G.F.; Plaisier, C.L.; Eddy, J.A.; et al. The Immune Landscape of Cancer. Immunity 2018, 48, 812–830.e14. [Google Scholar] [CrossRef] [PubMed]
- Sayers, E.W.; Bolton, E.E.; Brister, J.R.; Canese, K.; Chan, J.; Comeau, D.C.; Connor, R.; Funk, K.; Kelly, C.; Kim, S.; et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2022, 50, D20–D26. [Google Scholar] [CrossRef]
- Yang, J.; Liu, Y.; Shang, J.; Chen, Q.; Chen, Q.; Ren, L.; Zhang, N.; Yu, Y.; Li, Z.; Song, Y.; et al. The Quartet Data Portal: Integration of community-wide resources for multiomics quality control. Genome Biol. 2023, 24, 245. [Google Scholar] [CrossRef]
- Hugging Face Tokenization GPT2. Available online: https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/tokenization_gpt2.py (accessed on 5 March 2024).
- OpenAI OpenAI API Pricing. Available online: https://openai.com/pricing (accessed on 5 March 2024).
Dataset | Type | Brief Description | Size | Ref. |
---|---|---|---|---|
MIMIC-IV | Clinical | Patient data including physiological variables, treatments, and diagnostics, in the intensive care unit (ICU). | >60,000 ICU admissions | [31] |
eICU Collaborative Research Database | Clinical | Multi-center including vital signs, laboratory work, and Acute Physiology and Chronic Health Evaluation (APACHE) score. | >200,000 ICU admissions | [32] |
HiRID | Clinical | High-resolution ICU dataset with demographic data and detailed treatment parameters. | >30,000 ICU admissions | [33] |
Georgetown Immuno-oncology registry | Clinical and genomic | Electronic health records including demographic, and clinical, prescription information and retrospective outcomes research at the 10 DC-Baltimore based MedStar Health network hospitals and Hackensack Meridian Health system in New Jersey. | N/A | [34] |
NCBI | Genomic | A collection of databases for biotechnology and biomedicine, including nucleotide sequences, protein sequences, and literature. Size varies by database (e.g., GenBank, PubMed). | Extensive | [35] |
Quartet metabolomics Project | Metabolomics | Metabolite reference materials from B lymphoblastoid cell lines for inter-laboratory proficiency testing and data integration of metabolomics profiling. | N/A | [36] |
Variable Type | Comparisons between 2 Groups | Comparisons between 3 or More Groups | |
---|---|---|---|
Continuous variables | Parametric | Two-sample t-test | ANOVA |
Non-parametric | Mann–Whitney U test | Kruskal–Wallis one-way ANOVA test | |
Categorical/ nominal variables | Chi-square test or Fisher’s exact test (if the applicability conditions of the first test are not verified) | Chi-square test or Fisher’s exact test (if the applicability conditions of the first test are not verified) |
Characteristic | Dataset 1 | Dataset 2 | Dataset 3 |
---|---|---|---|
Number of demographic variables | 2 | 3 | 3 |
Number of clinical/therapeutical variables | 2 | 18 | 10 |
Total number of rows (samples) | 26 | 225 | 436 |
Total number of columns (variables #) | 6 | 29 | 3750 |
Total processing time (in seconds *) | 2.06 | 2.78 | 34.06 |
Estimated number of tokens | 173 | 8398 | 2,178,874 |
Variable | COVID-19 (n = 112 Patients) | CONTROL (n = 103 Patients) | p-Value | |
---|---|---|---|---|
Gender (n/proportion) | Female | 44 (0.39) | 40 (0.39) | 1.000 * |
Male | 68 (0.61) | 63 (0.61) | ||
Age (years), median (IQR) | 59 (21) | 62 (23) | 0.117 ● | |
UCI death (n/proportion) | No | 89 (0.79) | 79 (0.77) | 0.745 * |
Yes | 23 (0.21) | 24 (0.23) | ||
IMV (n/proportion) | No | 36 (0.32) | 33 (0.32) | 1.000 * |
Yes | 76 (0.68) | 70 (0.68) | ||
ECMO (n/proportion) | No | 104 (0.93) | 95 (0.92) | 1.000 * |
Yes | 8 (0.07) | 8 (0.08) |
Challenges | Considerations/Future Work |
---|---|
Balancing user autonomy with guidance | Educational empowerment with toggles and feedback options |
OpenAI’s API token and TPM limitations | GUI enhancements for visual data manipulation |
Data privacy concerns with cloud processing | Integration with established platforms or standalone option |
Need for local LLM deployment options | Quality assurance and advanced GUI overhaul |
Economical access to LLM capabilities | Privacy-by-design approach with no data retention |
Ensuring relevance in data curation for ADA | Community collaboration and continuous software development |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Araújo, R.; Ramalhete, L.; Viegas, A.; Von Rekowski, C.P.; Fonseca, T.A.H.; Calado, C.R.C.; Bento, L. Simplifying Data Analysis in Biomedical Research: An Automated, User-Friendly Tool. Methods Protoc. 2024, 7, 36. https://doi.org/10.3390/mps7030036
Araújo R, Ramalhete L, Viegas A, Von Rekowski CP, Fonseca TAH, Calado CRC, Bento L. Simplifying Data Analysis in Biomedical Research: An Automated, User-Friendly Tool. Methods and Protocols. 2024; 7(3):36. https://doi.org/10.3390/mps7030036
Chicago/Turabian StyleAraújo, Rúben, Luís Ramalhete, Ana Viegas, Cristiana P. Von Rekowski, Tiago A. H. Fonseca, Cecília R. C. Calado, and Luís Bento. 2024. "Simplifying Data Analysis in Biomedical Research: An Automated, User-Friendly Tool" Methods and Protocols 7, no. 3: 36. https://doi.org/10.3390/mps7030036
APA StyleAraújo, R., Ramalhete, L., Viegas, A., Von Rekowski, C. P., Fonseca, T. A. H., Calado, C. R. C., & Bento, L. (2024). Simplifying Data Analysis in Biomedical Research: An Automated, User-Friendly Tool. Methods and Protocols, 7(3), 36. https://doi.org/10.3390/mps7030036