How to Write Effective Prompts for Screening Biomedical Literature Using Large Language Models
Abstract
:1. Introduction
2. Understanding Large Language Models Beyond the Buzzwords
2.1. The Transformer Revolution
2.2. Strengths, Caveats, and Uncharted Territories
2.3. Where to Draw the Line: Titles, Abstracts, or Full Text?
3. Evaluation Metrics for Classification Tasks
3.1. The Basics of Classification Metrics
3.2. Bringing Metrics to Life: A Practical Example
4. Speak and It Shall Be Done: The Art of Prompt Engineering
4.1. Building a Good Prompt
4.2. Zero, One, or Few Shots?
4.3. Soft Versus Strict: Setting the Bar for Inclusion
4.4. Illustrative Examples
“You are assisting in a systematic review on periodontal regeneration comparing Bone Morphogenetic Proteins (BMP) plus any bone graft (BG) versus BG alone.We include RCTs with at least six months of follow-up in adults (≥18 years) with intrabony or furcation defects.If the abstract suggests or does not contradict these criteria, respond with ‘ACCEPT.’Only respond with ‘REJECT’ if it clearly violates any requirement—exclusively pediatric population, follow-up shorter than six months, or no mention of randomization or EMD.Output only the word ‘ACCEPT’ or ‘REJECT’”.
“You are an expert reviewer screening for RCTs of BMP + BG vs. BG alone in adult periodontitis patients (≥18 years) with intrabony or furcation defects.The study must explicitly mention randomization, adult age, combination of EMD and bone graft, a control with BG alone, and a follow-up of at least six months.If any criterion is absent or unclear, respond ‘REJECT’.Otherwise, respond ‘ACCEPT’.No additional text”.
4.5. Avoiding Pitfalls with Prompt Refinement
5. Future Horizons and Potential Advancements
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Conflicts of Interest
Appendix A. Practical Guidelines for Prompt Engineering
Appendix A.1. Define Clear Criteria
- Population: Clearly state the target population (e.g., “adult patients, defined as individuals ≥ 18 years”).
- Intervention: Specify the intervention details (e.g., “use of EMD combined with bone graft”).
- Comparison: Define what the intervention is being compared against (e.g., “bone graft alone”).
- Outcome: Identify the primary outcome or endpoints (e.g., “clinical measures of periodontal regeneration”).
- Use Unambiguous Language: Avoid vague terms. For example, write “adult patients (≥18 years)” instead of “adults”; include specific keywords or phrases that the model can recognize as indicators of a particular criterion.
- Incorporate Contextual Examples (if necessary): Where appropriate, provide a brief example or keyword list that exemplifies the criterion, ensuring that the model captures the intended meaning.
Appendix A.2. Choose an Appropriate Prompting Approach
Appendix A.3. Employ Iterative Testing and Refinement
Appendix A.4. Integrate Human Oversight
Appendix B. Comprehensive Examples
- (A) Zero-Shot Soft Prompt (Periodontal regeneration)
- System:
- You are an AI assistant helping with a systematic review on periodontal regeneration.
- User Prompt:
- You will decide if each article should be ACCEPTED or REJECTED based on the following criteria:
- Population (P): Adult patients (≥18 years) with intrabony or furcation periodontal defects. If the abstract does not mention age, or does not clearly describe non-adult populations, do not penalize.
- Intervention (I): Must involve enamel matrix derivative (EMD) combined with a bone graft material (BG). If either EMD or BG is implied or partially mentioned, do not penalize.
- Comparison (C): Ideally a group that uses BG alone, or some control lacking EMD. If not stated but not contradicted, do not penalize.
- Outcomes (O): Must measure clinical attachment level (CAL) gain or probing depth (PD) reduction, or at least mention standard periodontal parameters. If the abstract does not state outcomes explicitly but mentions “periodontal regeneration”, do not penalize.
- Study design: Must be an RCT or strongly imply random allocation. If uncertain, do not penalize.
- Follow-up: Minimum 6 months. If not stated or unclear, do not penalize unless it says <6 months.
- Decision Rule: If no criterion is explicitly violated, respond only with “ACCEPT”. If any criterion is clearly contradicted (e.g., non-randomized, pediatric population, <6 months follow-up), respond with “REJECT”. Provide no additional explanation.
- Title: {title}
- Abstract: {abstract}
- (B) One-Shot Soft Prompt (Type 2 Diabetes)
- System:
- You are an AI assistant helping with a systematic review on Type 2 Diabetes Mellitus (T2DM) management using SGLT2 inhibitors.
- User Prompt:
- You will decide if each new abstract should be ACCEPTED or REJECTED based on the following criteria:
- Population (P): Adult patients (≥18 years) with Type 2 Diabetes Mellitus. If the abstract does not specify age or does not exclude adults, do not penalize.
- Intervention (I): Must involve at least one SGLT2 inhibitor (e.g., canagliflozin, empagliflozin, dapagliflozin). If the abstract partially suggests an SGLT2 inhibitor, do not penalize.
- Comparison (C): Ideally involves a placebo, standard care, or any other antidiabetic medication. If not stated but not contradicted, do not penalize.
- Outcomes (O): Must report changes in glycemic control (e.g., HbA1c reduction) or related metabolic parameters. If the abstract implies an improvement in glycemic measures without explicitly mentioning numbers, do not penalize.
- Study Design: Must be an RCT or strongly imply random allocation. If uncertain, do not penalize.
- Follow-Up: Minimum 3 months. If not stated or unclear, do not penalize unless the abstract explicitly mentions a follow-up shorter than 3 months.
- Below is one example of an abstract meeting all criteria (labeled ACCEPT):
- Title: [Users should insert the title example here]
- Abstract: [Users should insert the abstract example here]
- Decision: ACCEPT
- Using the example above as a guide, classify the following abstracts into “ACCEPT” or “REJECT”. If no criterion is explicitly violated, respond with “ACCEPT”. If any criterion is clearly contradicted (e.g., pediatric-only study, non-randomized design, explicitly <3 months follow-up), respond with “REJECT”. Provide no additional explanation.
- Title: {title}
- Abstract: {abstract}
References
- Mulrow, C.D. Systematic Reviews: Rationale for Systematic Reviews. BMJ 1994, 309, 597–599. [Google Scholar] [CrossRef]
- Dickersin, K.; Scherer, R.; Lefebvre, C. Systematic Reviews: Identifying Relevant Studies for Systematic Reviews. BMJ 1994, 309, 1286–1291. [Google Scholar] [CrossRef] [PubMed]
- Parums, D.V. Review Articles, Systematic Reviews, Meta-Analysis, and the Updated Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 Guidelines. Med. Sci. Monit. 2021, 27, e934475-1. [Google Scholar] [CrossRef]
- Waffenschmidt, S.; Knelangen, M.; Sieben, W.; Bühn, S.; Pieper, D. Single Screening versus Conventional Double Screening for Study Selection in Systematic Reviews: A Methodological Systematic Review. BMC Med. Res. Methodol. 2019, 19, 132. [Google Scholar] [CrossRef] [PubMed]
- Elliott, J.H.; Synnot, A.; Turner, T.; Simmonds, M.; Akl, E.A.; McDonald, S.; Salanti, G.; Meerpohl, J.; MacLehose, H.; Hilton, J.; et al. Living Systematic Review: 1. Introduction—The Why, What, When, and How. J. Clin. Epidemiol. 2017, 91, 23–30. [Google Scholar] [CrossRef] [PubMed]
- Bramer, W.M.; Rethlefsen, M.L.; Kleijnen, J.; Franco, O.H. Optimal Database Combinations for Literature Searches in Systematic Reviews: A Prospective Exploratory Study. Syst. Rev. 2017, 6, 245. [Google Scholar] [CrossRef]
- Scells, H.; Zuccon, G. Generating Better Queries for Systematic Reviews. In Proceedings of the the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 475–484. [Google Scholar]
- Cooper, C.; Booth, A.; Varley-Campbell, J.; Britten, N.; Garside, R. Defining the Process to Literature Searching in Systematic Reviews: A Literature Review of Guidance and Supporting Studies. BMC Med. Res. Methodol. 2018, 18, 85. [Google Scholar] [CrossRef]
- Gupta, S.; Rajiah, P.; Middlebrooks, E.H.; Baruah, D.; Carter, B.W.; Burton, K.R.; Chatterjee, A.R.; Miller, M.M. Systematic Review of the Literature: Best Practices. Acad. Radiol. 2018, 25, 1481–1490. [Google Scholar] [CrossRef]
- Eriksen, M.B.; Frandsen, T.F. The Impact of Patient, Intervention, Comparison, Outcome (PICO) as a Search Strategy Tool on Literature Search Quality: A Systematic Review. J. Med. Libr. Assoc. 2018, 106, 420–431. [Google Scholar] [CrossRef]
- Abbade, L.P.F.; Wang, M.; Sriganesh, K.; Mbuagbaw, L.; Thabane, L. Framing of Research Question Using the PICOT Format in Randomised Controlled Trials of Venous Ulcer Disease: A Protocol for a Systematic Survey of the Literature. BMJ Open 2016, 6, e013175. [Google Scholar] [CrossRef]
- Methley, A.M.; Campbell, S.; Chew-Graham, C.; McNally, R.; Cheraghi-Sohi, S. PICO, PICOS and SPIDER: A Comparison Study of Specificity and Sensitivity in Three Search Tools for Qualitative Systematic Reviews. BMC Health Serv. Res. 2014, 14, 579. [Google Scholar] [CrossRef]
- Cooke, A.; Smith, D.; Booth, A. Beyond PICO. Qual. Health Res. 2012, 22, 1435–1443. [Google Scholar] [CrossRef] [PubMed]
- Frandsen, T.F.; Bruun Nielsen, M.F.; Lindhardt, C.L.; Eriksen, M.B. Using the Full PICO Model as a Search Tool for Systematic Reviews Resulted in Lower Recall for Some PICO Elements. J. Clin. Epidemiol. 2020, 127, 69–75. [Google Scholar] [CrossRef]
- Brown, D. A Review of the PubMed PICO Tool: Using Evidence-Based Practice in Health Education. Health Promot. Pract. 2020, 21, 496–498. [Google Scholar] [CrossRef] [PubMed]
- Pham, B.; Jovanovic, J.; Bagheri, E.; Antony, J.; Ashoor, H.; Nguyen, T.T.; Rios, P.; Robson, R.; Thomas, S.M.; Watt, J.; et al. Text Mining to Support Abstract Screening for Knowledge Syntheses: A Semi-Automated Workflow. Syst. Rev. 2021, 10, 156. [Google Scholar] [CrossRef]
- Chai, K.E.K.; Lines, R.L.J.; Gucciardi, D.F.; Ng, L. Research Screener: A Machine Learning Tool to Semi-Automate Abstract Screening for Systematic Reviews. Syst. Rev. 2021, 10, 93. [Google Scholar] [CrossRef]
- Gates, A.; Johnson, C.; Hartling, L. Technology-Assisted Title and Abstract Screening for Systematic Reviews: A Retrospective Evaluation of the Abstrackr Machine Learning Tool. Syst. Rev. 2018, 7, 45. [Google Scholar] [CrossRef]
- Wang, Z.; Chu, Z.; Doan, T.V.; Ni, S.; Yang, M.; Zhang, W. History, Development, and Principles of Large Language Models: An Introductory Survey. AI Ethics 2024, 1–17. [Google Scholar] [CrossRef]
- Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A Comprehensive Overview of Large Language Models. arXiv 2023, arXiv:2307.06435. [Google Scholar]
- Li, M.; Sun, J.; Tan, X. Evaluating the Effectiveness of Large Language Models in Abstract Screening: A Comparative Analysis. Syst. Rev. 2024, 13, 219. [Google Scholar] [CrossRef]
- Khraisha, Q.; Put, S.; Kappenberg, J.; Warraitch, A.; Hadfield, K. Can Large Language Models Replace Humans in the Systematic Review Process? Evaluating GPT-4’s Efficacy in Screening and Extracting Data from Peer-Reviewed and Grey Literature in Multiple Languages. arXiv 2023, arXiv:2310.17526. [Google Scholar]
- Giray, L. Prompt Engineering with ChatGPT: A Guide for Academic Writers. Ann. Biomed. Eng. 2023, 51, 2629–2633. [Google Scholar] [CrossRef] [PubMed]
- Meskó, B. Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial. J. Med. Internet Res. 2023, 25, e50638. [Google Scholar] [CrossRef]
- Zaghir, J.; Naguib, M.; Bjelogrlic, M.; Névéol, A.; Tannier, X.; Lovis, C. Prompt Engineering Paradigms for Medical Applications: Scoping Review. J. Med. Internet Res. 2024, 26, e60501. [Google Scholar] [CrossRef] [PubMed]
- Wahidur, R.S.M.; Tashdeed, I.; Kaur, M.; Lee, H.-N. Enhancing Zero-Shot Crypto Sentiment with Fine-Tuned Language Model and Prompt Engineering. IEEE Access 2024, 12, 10146–10159. [Google Scholar] [CrossRef]
- Ashwathy, J.S.; Nithin, S.R.; Pyati, T. The Progression of ChatGPT: An Evolutionary Study from GPT-1 to GPT-4. J. Innov. Data Sci. Big Data Manag. 2024, 3, 38–44. [Google Scholar]
- Bharathi Mohan, G.; Prasanna Kumar, R.; Parathasarathy, S.; Aravind, S.; Hanish, K.B.; Pavithria, G. Text Summarization for Big Data Analytics: A Comprehensive Review of GPT 2 and BERT Approaches. In Data Analytics for Internet of Things Infrastructure; Springer: Berlin/Heidelberg, Germany, 2023; pp. 247–264. [Google Scholar]
- Kalyan, K.S. A Survey of GPT-3 Family Large Language Models Including ChatGPT and GPT-4. Nat. Lang. Process. J. 2024, 6, 100048. [Google Scholar] [CrossRef]
- Katrak, M. The Role of Language Prediction Models in Contractual Interpretation: The Challenges and Future Prospects of GPT-3. In Legal Analytics; Taylor & Francis Group: Abingdon, UK, 2022; pp. 47–62. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process Syst. 2017, 30, 5998–6008. [Google Scholar]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
- Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A Survey of Transformers. AI Open 2022, 3, 111–132. [Google Scholar] [CrossRef]
- Liu, J.; Chu, X.; Wang, Y.; Wang, M. Deep Text Retrieval Models Based on DNN, CNN, RNN and Transformer: A Review. In Proceedings of the 2022 IEEE 8th International Conference on Cloud Computing and Intelligent Systems (CCIS), Chengdu, China, 26–28 November 2022; pp. 391–400. [Google Scholar]
- Shiri, F.M.; Perumal, T.; Mustapha, N.; Mohamed, R. A Comprehensive Overview and Comparative Analysis on Deep Learning Models. arXiv 2023, arXiv:2305.17473. [Google Scholar]
- Gue, C.C.Y.; Rahim, N.D.A.; Rojas-Carabali, W.; Agrawal, R.; Palvannan, R.K.; Abisheganaden, J.; Yip, W.F. Evaluating the OpenAI’s GPT-3.5 Turbo’s Performance in Extracting Information from Scientific Articles on Diabetic Retinopathy. Syst. Rev. 2024, 13, 135. [Google Scholar] [CrossRef] [PubMed]
- Liu, Y.; Han, T.; Ma, S.; Zhang, J.; Yang, Y.; Tian, J.; He, H.; Li, A.; He, M.; Liu, Z.; et al. Summary of ChatGPT-Related Research and Perspective towards the Future of Large Language Models. Meta-Radiology 2023, 1, 100017. [Google Scholar] [CrossRef]
- Zhang, C.; Zhang, C.; Zheng, S.; Qiao, Y.; Li, C.; Zhang, M.; Dam, S.K.; Thwal, C.M.; Tun, Y.L.; Huy, L.L. A Complete Survey on Generative Ai (Aigc): Is Chatgpt from Gpt-4 to Gpt-5 All You Need? arXiv 2023, arXiv:2303.11717. [Google Scholar]
- Tao, K.; Osman, Z.A.; Tzou, P.L.; Rhee, S.-Y.; Ahluwalia, V.; Shafer, R.W. GPT-4 Performance on Querying Scientific Publications: Reproducibility, Accuracy, and Impact of an Instruction Sheet. BMC Med. Res. Methodol. 2024, 24, 139. [Google Scholar] [CrossRef] [PubMed]
- Baktash, J.A.; Dawodi, M. Gpt-4: A Review on Advancements and Opportunities in Natural Language Processing. arXiv 2023, arXiv:2305.03195. [Google Scholar]
- Sindhu, B.; Prathamesh, R.P.; Sameera, M.B.; KumaraSwamy, S. The Evolution of Large Language Model: Models, Applications and Challenges. In Proceedings of the 2024 International Conference on Current Trends in Advanced Computing (ICCTAC), Bengaluru, India, 8–9 May 2024; pp. 1–8. [Google Scholar]
- Irshad, M. Revolutionizing Healthcare Delivery: Evaluating the Impact of Google’s Gemini AI as a Virtual Doctor in Medical Services. J Artif Intell. Mach. Learn. Data Sci. 2024, 2, 1618–1625. [Google Scholar] [CrossRef]
- Annepaka, Y.; Pakray, P. Large Language Models: A Survey of Their Development, Capabilities, and Applications. Knowl. Inf. Syst. 2024, 67, 2967–3022. [Google Scholar] [CrossRef]
- Xu, J.; Li, Z.; Chen, W.; Wang, Q.; Gao, X.; Cai, Q.; Ling, Z. On-Device Language Models: A Comprehensive Review. arXiv 2024, arXiv:2409.00088. [Google Scholar]
- Qu, G.; Chen, Q.; Wei, W.; Lin, Z.; Chen, X.; Huang, K. Mobile Edge Intelligence for Large Language Models: A Contemporary Survey. IEEE Commun. Surv. Tutor. 2025. [Google Scholar] [CrossRef]
- Patel, D.; Raut, G.; Cheetirala, S.N.; Nadkarni, G.N.; Freeman, R.; Glicksberg, B.S.; Klang, E.; Timsina, P. Cloud Platforms for Developing Generative AI Solutions: A Scoping Review of Tools and Services. arXiv 2024, arXiv:2412.06044. [Google Scholar]
- Ofoeda, J.; Boateng, R.; Effah, J. Application Programming Interface (API) Research. Int. J. Enterp. Inf. Syst. 2019, 15, 76–95. [Google Scholar] [CrossRef]
- Hadi, M.U.; Qureshi, R.; Shah, A.; Irfan, M.; Zafar, A.; Shaikh, M.B.; Akhtar, N.; Wu, J.; Mirjalili, S. A Survey on Large Language Models: Applications, Challenges, Limitations, and Practical Usage. Authorea Prepr. 2023, 3, 1–29. [Google Scholar]
- Villalobos, P.; Ho, A.; Sevilla, J.; Besiroglu, T.; Heim, L.; Hobbhahn, M. Will We Run out of Data? Limits of LLM Scaling Based on Human-Generated Data. arXiv 2024, arXiv:2211.04325. [Google Scholar]
- Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z. A Survey of Large Language Models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
- Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–15. [Google Scholar] [CrossRef]
- Wang, J.; Yang, Z.; Yao, Z.; Yu, H. Jmlr: Joint Medical Llm and Retrieval Training for Enhancing Reasoning and Professional Question Answering Capability. arXiv 2024, arXiv:2402.17887. [Google Scholar]
- Zhang, Y.; Mao, S.; Ge, T.; Wang, X.; de Wynter, A.; Xia, Y.; Wu, W.; Song, T.; Lan, M.; Wei, F. LLM as a Mastermind: A Survey of Strategic Reasoning with Large Language Models. arXiv 2024, arXiv:2404.01230. [Google Scholar]
- Liu, Y.; He, H.; Han, T.; Zhang, X.; Liu, M.; Tian, J.; Zhang, Y.; Wang, J.; Gao, X.; Zhong, T. Understanding Llms: A Comprehensive Overview from Training to Inference. arXiv 2024, arXiv:2401.02038. [Google Scholar] [CrossRef]
- Campos, D.G.; Fütterer, T.; Gfrörer, T.; Lavelle-Hill, R.; Murayama, K.; König, L.; Hecht, M.; Zitzmann, S.; Scherer, R. Screening Smarter, Not Harder: A Comparative Analysis of Machine Learning Screening Algorithms and Heuristic Stopping Criteria for Systematic Reviews in Educational Research. Educ. Psychol. Rev. 2024, 36, 19. [Google Scholar] [CrossRef]
- Drury, A.; Pape, E.; Dowling, M.; Miguel, S.; Fernández-Ortega, P.; Papadopoulou, C.; Kotronoulas, G. How to Write a Comprehensive and Informative Research Abstract. Semin. Oncol. Nurs. 2023, 39, 151395. [Google Scholar] [CrossRef]
- Liang, X.; Wang, H.; Wang, Y.; Song, S.; Yang, J.; Niu, S.; Hu, J.; Liu, D.; Yao, S.; Xiong, F. Controllable Text Generation for Large Language Models: A Survey. arXiv 2024, arXiv:2408.12599. [Google Scholar]
- Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst. 2025, 43, 1–55. [Google Scholar] [CrossRef]
- Loya, M.; Sinha, D.A.; Futrell, R. Exploring the Sensitivity of LLMs’ Decision-Making Capabilities: Insights from Prompt Variation and Hyperparameters. arXiv 2023, arXiv:2312.17476. [Google Scholar]
- Errica, F.; Siracusano, G.; Sanvito, D.; Bifulco, R. What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering. arXiv 2024, arXiv:2406.12334. [Google Scholar]
- Sclar, M.; Choi, Y.; Tsvetkov, Y.; Suhr, A. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I Learned to Start Worrying about Prompt Formatting. arXiv 2023, arXiv:2310.11324. [Google Scholar]
- Shen, J.; Tenenholtz, N.; Hall, J.B.; Alvarez-Melis, D.; Fusi, N. Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains. arXiv 2024, arXiv:2402.05140. [Google Scholar]
- Felin, T.; Holweg, M. Theory Is All You Need: AI, Human Cognition, and Causal Reasoning. Strategy Sci. 2024, 9, 346–371. [Google Scholar] [CrossRef]
- Crowther, M.A.; Cook, D.J. Trials and Tribulations of Systematic Reviews and Meta-Analyses. ASH Educ. Program Book 2007, 2007, 493–497. [Google Scholar] [CrossRef]
- Guizzardi, S.; Colangelo, M.T.; Mirandola, P.; Galli, C. Modeling New Trends in Bone Regeneration, Using the BERTopic Approach. Regen. Med. 2023, 18, 719–734. [Google Scholar] [CrossRef]
- Colangelo, M.T.; Meleti, M.; Guizzardi, S.; Galli, C. A Macroscopic Exploration of the Ideoscape on Exosomes for Bone Regeneration. Osteology 2024, 4, 159–178. [Google Scholar] [CrossRef]
- Galli, C.; Cusano, C.; Meleti, M.; Donos, N.; Calciolari, E. Topic Modeling for Faster Literature Screening Using Transformer-Based Embeddings. Metrics 2024, 1, 2. [Google Scholar] [CrossRef]
- Mateen, F.J.; Oh, J.; Tergas, A.I.; Bhayani, N.H.; Kamdar, B.B. Titles versus Titles and Abstracts for Initial Screening of Articles for Systematic Reviews. Clin. Epidemiol. 2013, 5, 89–95. [Google Scholar] [CrossRef] [PubMed]
- Saloojee, H.; Pettifor, J.M. Maximizing Access and Minimizing Barriers to Research in Low- and Middle-Income Countries: Open Access and Health Equity. Calcif. Tissue Int. 2023, 114, 83–85. [Google Scholar] [CrossRef] [PubMed]
- Herbst, E.; Kopf, S. Writing an Abstract. Arthroskopie 2024, 37, 258–261. [Google Scholar] [CrossRef]
- Galli, C.; Colangelo, M.T.; Guizzardi, S. Linguistic Changes in the Transition from Summaries to Abstracts: The Case of the Journal of Experimental Medicine. Learn. Publ. 2022, 35, 271–284. [Google Scholar] [CrossRef]
- Moher, D.; Schulz, K.F.; Altman, D.G. The CONSORT Statement: Revised Recommendations for Improving the Quality of Reports of Parallel-Group Randomised Trials. Lancet 2001, 357, 1191–1194. [Google Scholar] [CrossRef]
- Hermont, A.P.; Cruz, P.V.; Occhi-Alexandre, I.G.P.; Bendo, C.B.; Auad, S.M.; Pordeus, I.A.; Martins, C.C. The Importance of Full Text Screening When Judging Eligibility Criteria in a Systematic Review. Arq. Em Odontol. 2022, 58, 160–165. [Google Scholar] [CrossRef]
- Jacso, P. Open Access to Scholarly Full-text Documents. Online Inf. Rev. 2006, 30, 587–594. [Google Scholar] [CrossRef]
- Singh, A.; Singh, M.; Singh, A.K.; Singh, D.; Singh, P.; Sharma, A. “Free Full Text Articles”: Where to Search for Them? Int. J. Trichol. 2011, 3, 75–79. [Google Scholar] [CrossRef]
- Lewis, C.L. The Open Access Citation Advantage: Does It Exist and What Does It Mean for Libraries? Inf. Technol. Libr. 2018, 37, 50–65. [Google Scholar] [CrossRef]
- Ye, A.; Maiti, A.; Schmidt, M.; Pedersen, S.J. A Hybrid Semi-Automated Workflow for Systematic and Literature Review Processes with Large Language Model Analysis. Future Internet 2024, 16, 167. [Google Scholar] [CrossRef]
- Baldi, P.; Brunak, S.; Chauvin, Y.; Andersen, C.A.F.; Nielsen, H. Assessing the Accuracy of Prediction Algorithms for Classification: An Overview. Bioinformatics 2000, 16, 412–424. [Google Scholar] [CrossRef]
- Bramer, W.M.; Giustini, D.; Kramer, B.M.R. Comparing the Coverage, Recall, and Precision of Searches for 120 Systematic Reviews in Embase, MEDLINE, and Google Scholar: A Prospective Study. Syst. Rev. 2016, 5, 39. [Google Scholar] [CrossRef] [PubMed]
- Streiner, D.L.; Norman, G.R. “Precision” and “Accuracy”: Two Terms That Are Neither. J. Clin. Epidemiol. 2006, 59, 327–330. [Google Scholar] [CrossRef] [PubMed]
- Straube, S.; Heinz, J.; Landsvogt, P.; Friede, T. Recall, Precision, and Coverage of Literature Searches in Systematic Reviews in Occupational Medicine: An Overview of Cochrane Reviews Recall, Precision Und Coverage von Literatursuchen in Systematischen Reviews Aus Dem Bereich Arbeitsmedizin: Ein Überblick Über Cochrane Reviews. GMS Med. Inform. Biom. Und Epidemiol. 2021, 17, Doc02. [Google Scholar]
- Grandini, M.; Bagli, E.; Visani, G. Metrics for Multi-Class Classification: An Overview. arXiv 2020, arXiv:2008.05756. [Google Scholar]
- Beurer-Kellner, L.; Fischer, M.; Vechev, M. Prompting Is Programming: A Query Language for Large Language Models. Proc. ACM Program. Lang. 2023, 7, 1946–1969. [Google Scholar] [CrossRef]
- Chen, B.; Zhang, Z.; Langrené, N.; Zhu, S. Unleashing the Potential of Prompt Engineering in Large Language Models: A Comprehensive Review. arXiv 2023, arXiv:2310.14735. [Google Scholar]
- Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
- Pourpanah, F.; Abdar, M.; Luo, Y.; Zhou, X.; Wang, R.; Lim, C.P.; Wang, X.-Z.; Wu, Q.M.J. A Review of Generalized Zero-Shot Learning Methods. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4051–4070. [Google Scholar] [CrossRef]
- Li, Y. A Practical Survey on Zero-Shot Prompt Design for in-Context Learning. arXiv 2023, arXiv:2309.13205. [Google Scholar]
- Dang, H.; Mecke, L.; Lehmann, F.; Goller, S.; Buschek, D. How to Prompt? Opportunities and Challenges of Zero-and Few-Shot Learning for Human-AI Interaction in Creative Applications of Generative Models. arXiv 2022, arXiv:2209.01390. [Google Scholar]
- Qi, B.; Zhang, K.; Li, H.; Tian, K.; Zeng, S.; Chen, Z.-R.; Zhou, B. Large Language Models Are Zero Shot Hypothesis Proposers. arXiv 2023, arXiv:2311.05965. [Google Scholar]
- Wang, X.; Yin, X.; Zhang, Y.; Zhang, Y. Related Work on Few-Shot Method: A Review. Appl. and Comput. Eng. 2024; 181–188. [Google Scholar] [CrossRef]
- White, J.; Fu, Q.; Hays, S.; Sandborn, M.; Olea, C.; Gilbert, H.; Elnashar, A.; Spencer-Smith, J.; Schmidt, D.C. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv 2023, arXiv:2302.11382. [Google Scholar]
- Cao, C.; Sang, J.; Arora, R.; Kloosterman, R.; Cecere, M.; Gorla, J.; Saleh, R.; Chen, D.; Drennan, I.; Teja, B. Prompting Is All You Need: LLMs for Systematic Review Screening. medRxiv 2024. [Google Scholar] [CrossRef]
- Kusano, G.; Akimoto, K.; Takeoka, K. Are Longer Prompts Always Better? Prompt Selection in Large Language Models for Recommendation Systems. arXiv 2024, arXiv:2412.14454. [Google Scholar]
- Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv 2024, arXiv:2402.07927. [Google Scholar]
- Heston, T.F.; Khun, C. Prompt Engineering in Medical Education. Int. Med. Educ. 2023, 2, 198–205. [Google Scholar] [CrossRef]
- Ferdaus, M.M.; Abdelguerfi, M.; Ioup, E.; Niles, K.N.; Pathak, K.; Sloan, S. Towards Trustworthy Ai: A Review of Ethical and Robust Large Language Models. arXiv 2024, arXiv:2407.13934. [Google Scholar]
- Yan, B.; Li, K.; Xu, M.; Dong, Y.; Zhang, Y.; Ren, Z.; Cheng, X. On Protecting the Data Privacy of Large Language Models (Llms): A Survey. arXiv 2024, arXiv:2403.05156. [Google Scholar]
- Yao, Y.; Duan, J.; Xu, K.; Cai, Y.; Sun, Z.; Zhang, Y. A Survey on Large Language Model (Llm) Security and Privacy: The Good, the Bad, and the Ugly. High-Confid. Comput. 2024, 4, 100211. [Google Scholar] [CrossRef]
- Feretzakis, G.; Verykios, V.S. Trustworthy AI: Securing Sensitive Data in Large Language Models. AI 2024, 5, 2773–2800. [Google Scholar] [CrossRef]
- Das, B.C.; Amini, M.H.; Wu, Y. Security and Privacy Challenges of Large Language Models: A Survey. ACM Comput. Surv. 2025, 57, 1–39. [Google Scholar] [CrossRef]
- Custers, B.; Dechesne, F.; Sears, A.M.; Tani, T.; Van der Hof, S. A Comparison of Data Protection Legislation and Policies across the EU. Comput. Law. Secur. Rev. 2018, 34, 234–243. [Google Scholar] [CrossRef]
- Lee, J.; Hicke, Y.; Yu, R.; Brooks, C.; Kizilcec, R.F. The Life Cycle of Large Language Models in Education: A Framework for Understanding Sources of Bias. Br. J. Educ. Technol. 2024, 55, 1982–2002. [Google Scholar] [CrossRef]
- Guo, Y.; Guo, M.; Su, J.; Yang, Z.; Zhu, M.; Li, H.; Qiu, M.; Liu, S.S. Bias in Large Language Models: Origin, Evaluation, and Mitigation. arXiv 2024, arXiv:2411.10915. [Google Scholar]
- Ling, L.; Rabbi, F.; Wang, S.; Yang, J. Bias Unveiled: Investigating Social Bias in LLM-Generated Code. arXiv 2024, arXiv:2411.10351. [Google Scholar]
- Srinivasan, N.; Perumalsamy, K.K.; Sridhar, P.K.; Rajendran, G.; Kumar, A.A. Comprehensive Study on Bias in Large Language Models. Int. Ref. J. Eng. Sci. 2024, 13, 77–82. [Google Scholar]
- Sakib, S.K.; Das, A.B. Challenging Fairness: A Comprehensive Exploration of Bias in Llm-Based Recommendations. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; pp. 1585–1592. [Google Scholar]
- Huang, D.; Bu, Q.; Zhang, J.; Xie, X.; Chen, J.; Cui, H. Bias Testing and Mitigation in Llm-Based Code Generation. arXiv 2023, arXiv:2309.14345. [Google Scholar]
- Quttainah, M.; Mishra, V.; Madakam, S.; Lurie, Y.; Mark, S. Cost, Usability, Credibility, Fairness, Accountability, Transparency, and Explainability Framework for Safe and Effective Large Language Models in Medical Education: Narrative Review and Qualitative Study. JMIR AI 2024, 3, e51834. [Google Scholar] [CrossRef]
- Matsui, K.; Utsumi, T.; Aoki, Y.; Maruki, T.; Takeshima, M.; Takaesu, Y. Human-Comparable Sensitivity of Large Language Models in Identifying Eligible Studies Through Title and Abstract Screening: 3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews. J. Med. Internet Res. 2024, 26, e52758. [Google Scholar] [CrossRef] [PubMed]
- Guo, E.; Gupta, M.; Deng, J.; Park, Y.-J.; Paget, M.; Naugler, C. Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study. J. Med. Internet Res. 2024, 26, e48996. [Google Scholar] [CrossRef]
- Tran, V.-T.; Gartlehner, G.; Yaacoub, S.; Boutron, I.; Schwingshackl, L.; Stadelmaier, J.; Sommer, I.; Aboulayeh, F.; Afach, S.; Meerpohl, J. Sensitivity, Specificity and Avoidable Workload of Using a Large Language Models for Title and Abstract Screening in Systematic Reviews and Meta-Analyses. medRxiv 2023. [Google Scholar] [CrossRef]
- Anisuzzaman, D.M.; Malins, J.G.; Friedman, P.A.; Attia, Z.I. Fine-Tuning Llms for Specialized Use Cases. Mayo Clin. Proc. Digit. Health 2024, 3, 100184. [Google Scholar] [CrossRef]
- Parthasarathy, V.B.; Zafar, A.; Khan, A.; Shahid, A. The Ultimate Guide to Fine-Tuning Llms from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities. arXiv 2024, arXiv:2408.13296. [Google Scholar]
- Cohen, Y.; Aperstein, Y. A Review of Generative Pretrained Multi-Step Prompting Schemes—And a New Multi-Step Prompting Framework. Preprint 2024. [Google Scholar] [CrossRef]
- Neimann Rasmussen, L.; Montgomery, P. The Prevalence of and Factors Associated with Inclusion of Non-English Language Studies in Campbell Systematic Reviews: A Survey and Meta-Epidemiological Study. Syst. Rev. 2018, 7, 129. [Google Scholar] [CrossRef]
- Zhu, S.; Xu, S.; Sun, H.; Pan, L.; Cui, M.; Du, J.; Jin, R.; Branco, A.; Xiong, D. Multilingual Large Language Models: A Systematic Survey. arXiv 2024, arXiv:2411.11072. [Google Scholar]
- Xu, Y.; Hu, L.; Zhao, J.; Qiu, Z.; XU, K.; Ye, Y.; Gu, H. A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias. arXiv 2024, arXiv:2404.00929. [Google Scholar]
- Yuan, F.; Yuan, S.; Wu, Z.; Li, L. How Multilingual Is Multilingual LLM? arXiv 2023, arXiv:2311.09071. [Google Scholar]
- Thellmann, K.; Stadler, B.; Fromm, M.; Buschhoff, J.S.; Jude, A.; Barth, F.; Leveling, J.; Flores-Herr, N.; Köhler, J.; Jäkel, R. Towards Multilingual LLM Evaluation for European Languages. arXiv 2024, arXiv:2410.08928. [Google Scholar]
- Huang, K.; Mo, F.; Li, H.; Li, Y.; Zhang, Y.; Yi, W.; Mao, Y.; Liu, J.; Xu, Y.; Xu, J. A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers. arXiv 2024, arXiv:2405.10936. [Google Scholar]
- Qin, L.; Chen, Q.; Zhou, Y.; Chen, Z.; Li, Y.; Liao, L.; Li, M.; Che, W.; Yu, P.S. Multilingual large language model: A survey of resources, taxonomy and frontiers. arXiv 2024, arXiv:2404.04925. [Google Scholar]
- Laakso, A.; Kemell, K.-K.; Nurminen, J.K. Ethical Issues in Large Language Models: A Systematic Literature Review. CEUR Workshop Proc. 2024, 3901, 42–46. [Google Scholar]
- Zhang, Z.; Yao, Y.; Zhang, A.; Tang, X.; Ma, X.; He, Z.; Wang, Y.; Gerstein, M.; Wang, R.; Liu, G. Igniting Language Intelligence: The Hitchhiker’s Guide From Chain-of-Thought Reasoning to Language Agents. arXiv 2023, arXiv:2311.11797. [Google Scholar] [CrossRef]
- Augenstein, I.; Baldwin, T.; Cha, M.; Chakraborty, T.; Ciampaglia, G.L.; Corney, D.; DiResta, R.; Ferrara, E.; Hale, S.; Halevy, A.; et al. Factuality Challenges in the Era of Large Language Models and Opportunities for Fact-Checking. Nat. Mach. Intell. 2024, 6, 852–863. [Google Scholar] [CrossRef]
- Wang, H.; Fu, W.; Tang, Y.; Chen, Z.; Huang, Y.; Piao, J.; Gao, C.; Xu, F.; Jiang, T.; Li, Y. A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy. arXiv 2025, arXiv:2501.09431. [Google Scholar]
- Lin, Z.; Guan, S.; Zhang, W.; Zhang, H.; Li, Y.; Zhang, H. Towards Trustworthy LLMs: A Review on Debiasing and Dehallucinating in Large Language Models. Artif. Intell. Rev. 2024, 57, 243. [Google Scholar] [CrossRef]
- Kumar, D.; Jain, U.; Agarwal, S.; Harshangi, P. Investigating Implicit Bias in Large Language Models: A Large-Scale Study of Over 50 LLMs. arXiv 2024, arXiv:2410.12864. [Google Scholar]
- Ranjan, R.; Gupta, S.; Singh, S.N. A Comprehensive Survey of Bias in Llms: Current Landscape and Future Directions. arXiv 2024, arXiv:2409.16430. [Google Scholar]
Model | Developer | Approx. Parameter Count | Architecture | Typical Access/License | Key Features |
---|---|---|---|---|---|
GPT-3.5 | OpenAI | ~175 billion (GPT-3) | Transformer-based | Proprietary API (closed source) | Intermediate step between GPT-3 and GPT-4; improved fluency and coherence compared to GPT-3; powers ChatGPT (early versions). |
GPT-4 | OpenAI | Undisclosed (estimated > 1T) | Transformer-based | Proprietary API (closed source) | Advanced reasoning capabilities; improved context window; fewer hallucinations vs. GPT-3.5; strong multilingual performance. |
Llama 2 | Meta | 7B, 13B, or 70B (various) | Transformer-based | Custom, permissive license | Can be run locally with suitable hardware; strong performance in multilingual benchmarks; more flexible for research. |
Claude | Anthropic | Not publicly disclosed | Transformer-based | API-based (closed source) | Focus on safety and transparency; uses Constitutional AI approach to reduce harmful or biased outputs. |
Qwen | Alibaba | Not publicly disclosed | Transformer-based | API-based (closed source) | Emphasizes multilingual capabilities and domain specialization; integrated with Alibaba Cloud services. |
DeepSeek | DeepSeek AI | 7B–100B | Transformer-based | Open source (Apache 2.0) for certain versions; API access for larger models | Optimized for efficiency and domain-specific tasks; supports multilingual processing; modular architecture for customizable deployments. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Colangelo, M.T.; Guizzardi, S.; Meleti, M.; Calciolari, E.; Galli, C. How to Write Effective Prompts for Screening Biomedical Literature Using Large Language Models. BioMedInformatics 2025, 5, 15. https://doi.org/10.3390/biomedinformatics5010015
Colangelo MT, Guizzardi S, Meleti M, Calciolari E, Galli C. How to Write Effective Prompts for Screening Biomedical Literature Using Large Language Models. BioMedInformatics. 2025; 5(1):15. https://doi.org/10.3390/biomedinformatics5010015
Chicago/Turabian StyleColangelo, Maria Teresa, Stefano Guizzardi, Marco Meleti, Elena Calciolari, and Carlo Galli. 2025. "How to Write Effective Prompts for Screening Biomedical Literature Using Large Language Models" BioMedInformatics 5, no. 1: 15. https://doi.org/10.3390/biomedinformatics5010015
APA StyleColangelo, M. T., Guizzardi, S., Meleti, M., Calciolari, E., & Galli, C. (2025). How to Write Effective Prompts for Screening Biomedical Literature Using Large Language Models. BioMedInformatics, 5(1), 15. https://doi.org/10.3390/biomedinformatics5010015