Assessing LLMs on IDSA Practice Guidelines for the Diagnosis and Treatment of Native Vertebral Osteomyelitis: A Comparison Study
Abstract
1. Introduction
2. Materials and Methods
2.1. Ethical Considerations
2.2. Study Design
2.3. Accuracy Assessment
- (a) Poor: The response contains significant factual inaccuracies that could mislead clinicians and result in potentially harmful clinical decisions.
- (b) Moderate: The response includes moderate factual inaccuracies. While unlikely to result in harm, these responses require clarification to ensure optimal patient care.
- (c) Good: The response contains only minor factual errors and may require limited clarification but is generally safe and clinically usable.
- (d) Excellent: The response is entirely accurate, complete, and aligned with guideline-based recommendations, requiring no clarification.
2.4. Comprehensiveness Assessment
- (a) Not Comprehensive: The response is grossly deficient and lacks necessary clinical details.
- (b) Slightly Comprehensive: The response offers only minimal and basic information.
- (c) Moderately Comprehensive: The response provides an acceptable level of detail but may omit certain nuances.
- (d) Comprehensive: The response thoroughly covers most of the critical aspects outlined in the guidelines.
- (e) Very Comprehensive: The response demonstrates a high level of detail, incorporating extensive and nuanced information fully in line with IDSA guidance.
2.5. Statistical Analysis
3. Results
4. Discussion
4.1. Limitations
4.2. Practical Uses and Potential Future Paths
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
AI | Artificial Intelligence |
ANOVA | Analysis of Variance |
CoT | Chain of Thought |
EMR | Electronic Medical Record |
IDSA | Infectious Diseases Society of America |
IRB | Institutional Review Board |
LLM | Large Language Model |
NASS | North American Spine Society |
NVO | Native Vertebral Osteomyelitis |
SD | Standard Deviation |
References
- Issa, K.; Diebo, B.G.; Faloon, M.; Naziri, Q.; Pourtaheri, S.; Paulino, C.B.; Emami, A. The epidemiology of vertebral osteomyelitis in the United States from 1998 to 2013. Clin. Spine Surg. 2018, 31, E102–E108. [Google Scholar] [CrossRef]
- Deutscher Ärzteverlag GmbH. Spondylodiscitis: Diagnosis and Treatment Options. Available online: https://www.aerzteblatt.de/int/archive/article/195481 (accessed on 29 March 2024).
- Baryeh, K.; Anazor, F.; Iyer, S.; Rajagopal, T. Spondylodiscitis in adults: Diagnosis and management. Br. J. Hosp. Med. 2022, 83, 1–9. [Google Scholar] [CrossRef] [PubMed]
- Braun, S.; Diaremes, P.; Schönnagel, L.; Caffard, T.; Brenneis, M.; Meurer, A. Spondylodiscitis. Orthopadie 2023, 52, 677–690. [Google Scholar] [PubMed]
- Lima, D.; Lopes, N.; Pereira, A.L.; Rodrigues, D.; Amaral-Silva, M.; Marques, E. Diagnosis and treatment of spondylodiscitis: Insights from a five-year single-center study. Cureus 2024, 16, e74192. [Google Scholar] [CrossRef]
- Yagdiran, A.; Bredow, J.; Weber, C.; Mousa Basha, G.; Eysel, P.; Fischer, J.; Jung, N. The burden of vertebral osteomyelitis-an analysis of the workforce before and after treatment. J. Clin. Med. 2022, 11, 1095. [Google Scholar] [CrossRef]
- Berbari, E.F.; Kanj, S.S.; Kowalski, T.J.; Darouiche, R.O.; Widmer, A.F.; Schmitt, S.K.; Hendershot, E.F.; Holtom, P.D.; Huddleston, P.M.; Petermann, G.W.; et al. 2015 infectious diseases society of America (IDSA) clinical practice guidelines for the diagnosis and treatment of native vertebral osteomyelitis in adults. Clin. Infect. Dis. 2015, 61, e26–e46. [Google Scholar] [CrossRef] [PubMed]
- Available online: https://openai.com/index/chatgpt/ (accessed on 16 October 2024).
- Ray, P.P. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet Things Cyber-Phys. Syst. 2023, 3, 121–154. [Google Scholar] [CrossRef]
- Frieder, S.; Pinchetti, L.; Chevalier, A.; Griffiths, R.-R.; Salvatori, T.; Lukasiewicz, T.; Petersen, P.; Berner, J. Mathematical capabilities of ChatGPT. arXiv 2023, arXiv:2301.13867. Available online: https://proceedings.neurips.cc/paper_files/paper/2023/file/58168e8a92994655d6da3939e7cc0918-Paper-Datasets_and_Benchmarks.pdf (accessed on 16 October 2024).
- Hoang, T.; Liou, L.; Rosenberg, A.M.; Zaidat, B.; Duey, A.H.; Shrestha, N.; Ahmed, W.; Tang, J.; Kim, J.S.; Cho, S.K. An analysis of ChatGPT recommendations for the diagnosis and treatment of cervical radiculopathy. J. Neurosurg. Spine 2024, 41, 385–395. [Google Scholar] [CrossRef]
- Ahmed, W.; Saturno, M.; Rajjoub, R.; Duey, A.H.; Zaidat, B.; Hoang, T.; Mejia, M.R.; Gallate, Z.S.; Shrestha, N.; Tang, J.; et al. ChatGPT versus NASS clinical guidelines for degenerative spondylolisthesis: A comparative analysis. Eur. Spine J. 2024, 33, 4182–4203. [Google Scholar] [CrossRef]
- Rajjoub, R.; Arroyave, J.S.; Zaidat, B.; Ahmed, W.; Mejia, M.R.; Tang, J.; Kim, J.S.; Cho, S.K. ChatGPT and its role in the decision-making for the diagnosis and treatment of lumbar spinal stenosis: A comparative Analysis and Narrative Review. Glob. Spine J. 2024, 14, 998–1017. [Google Scholar] [CrossRef] [PubMed]
- Kayastha, A.; Lakshmanan, K.; Valentine, M.J.; Nguyen, A.; Dholakia, K.; Wang, D. Lumbar disc herniation with radiculopathy: A comparison of NASS guidelines and ChatGPT. N. Am. Spine Soc. J. 2024, 19, 100333. [Google Scholar] [CrossRef]
- Yu, A.; Li, A.; Ahmed, W.; Saturno, M.; Cho, S.K. Evaluating artificial intelligence in spinal cord injury management: A comparative analysis of ChatGPT-4o and Google Gemini against American College of Surgeons best practices guidelines for Spine Injury. Glob. Spine J. 2025, 21925682251321836. [Google Scholar] [CrossRef] [PubMed]
- Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Hou, L.; Clark, K.; Pfohl, S.; Cole-Lewis, H.; Neal, D.; et al. Towards expert-level medical question answering with large language models. arXiv 2023, arXiv:2305.09617. [Google Scholar] [CrossRef]
- Lorenzi, A.; Pugliese, G.; Maniaci, A.; Lechien, J.R.; Allevi, F.; Boscolo-Rizzo, P.; Vaira, L.A.; Saibene, A.M. Reliability of large language models for advanced head and neck malignancies management: A comparison between ChatGPT 4 and Gemini Advanced. Eur. Arch. Otorhinolaryngol. 2024, 281, 5001–5006. [Google Scholar] [CrossRef]
- López-Pineda, A.; Nouni-García, R.; Carbonell-Soliva, Á.; Gil-Guillén, V.F.; Carratalá-Munuera, C.; Borrás, F. Validation of large language models (Llama 3 and ChatGPT-4o mini) for title and abstract screening in biomedical systematic reviews. Res. Synth. Methods 2025, 16, 620–630. [Google Scholar] [CrossRef]
- Ullah, E.; Parwani, A.; Baig, M.M.; Singh, R. Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology—A recent scoping review. Diagn. Pathol. 2024, 19, 43. [Google Scholar] [CrossRef] [PubMed]
- Wang, L.; Wan, Z.; Ni, C.; Song, Q.; Li, Y.; Clayton, E.; Malin, B.; Yin, Z. Applications and concerns of ChatGPT and other conversational large language models in health care: Systematic review. J. Med. Internet Res. 2024, 26, e22769. [Google Scholar] [CrossRef]
- Sallam, M. ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare 2023, 11, 887. [Google Scholar] [CrossRef]
- How Consensus Works. Available online: https://help.consensus.app/en/articles/9922673-how-consensus-works (accessed on 26 May 2025).
- De Angelis, L.; Baglivo, F.; Arzilli, G.; Privitera, G.P.; Ferragina, P.; Tozzi, A.E.; Rizzo, C. ChatGPT and the rise of large language models: The new AI-driven infodemic threat in public health. Front. Public Health 2023, 11, 1166120. [Google Scholar] [CrossRef]
- Cao, M.; Wang, Q.; Zhang, X.; Lang, Z.; Qiu, J.; Yung, P.S.-H.; Ong, M.T.-Y. Large language models’ performances regarding common patient questions about osteoarthritis: A comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and perplexity. J. Sport Health Sci. 2024, 14, 101016. [Google Scholar] [CrossRef] [PubMed]
- Duey, A.H.; Nietsch, K.S.; Zaidat, B.; Ren, R.; Ndjonko, L.C.M.; Shrestha, N.; Rajjoub, R.; Ahmed, W.; Hoang, T.; Saturno, M.P.; et al. Thromboembolic prophylaxis in spine surgery: An analysis of ChatGPT recommendations. Spine J. 2023, 23, 1684–1691. [Google Scholar] [CrossRef] [PubMed]
- Zaidat, B.; Shrestha, N.; Rosenberg, A.M.; Ahmed, W.; Rajjoub, R.; Hoang, T.; Mejia, M.R.; Duey, A.H.; Tang, J.E.; Kim, J.S.; et al. Performance of a large language model in the generation of Clinical Guidelines for Antibiotic Prophylaxis in Spine Surgery. Neurospine 2024, 21, 128–146. [Google Scholar] [CrossRef] [PubMed]
- Kaiser, P.; Yang, S.; Bach, M.; Breit, C.; Mertz, K.; Stieltjes, B.; Ebbing, J.; Wetterauer, C.; Henkel, M. The interaction of structured data using openEHR and large Language models for clinical decision support in prostate cancer. World J. Urol. 2025, 43, 67. [Google Scholar] [CrossRef]
- Snoswell, C.L.; Snoswell, A.J.; Kelly, J.T.; Caffery, L.J.; Smith, A.C. Artificial intelligence: Augmenting telehealth with large language models. J. Telemed. Telecare 2025, 31, 150–154. [Google Scholar] [CrossRef]
- Senadheera, I.; Hettiarachchi, P.; Haslam, B.; Nawaratne, R.; Sheehan, J.; Lockwood, K.J.; Alahakoon, D.; Carey, L.M. AI applications in adult stroke recovery and rehabilitation: A scoping review using AI. Sensors 2024, 24, 6585. [Google Scholar] [CrossRef]
- Abdalla, M.M.I.; Mohanraj, J. Revolutionizing diabetic retinopathy screening and management: The role of artificial intelligence and machine learning. World J. Clin. Cases 2025, 13, 101306. [Google Scholar] [CrossRef]
LLM | Response Length (Words) | ||
---|---|---|---|
Mean ± SD | Minimum | Maximum | |
Consensus | 213.2 ± 68.8 | 43.0 | 305.0 |
Gemini | 358.2 ± 60.5 | 273.0 | 475.0 |
ChatGPT 4o Mini | 392.2 ± 97.35 | 179.0 | 543.0 |
ChatGPT 4o | 428.0 ± 45.4 | 98.0 | 768.0 |
LLM | Response Length (Characters) | ||
---|---|---|---|
Mean ± SD | Minimum | Maximum | |
Consensus | 1313.0 ± 427.2 | 258.0 | 1929.0 |
Gemini | 2223.0 ± 106.7 | 1616.0 | 2971.0 |
ChatGPT 4o Mini | 2551.0 ± 175.6 | 1121.0 | 3249.0 |
ChatGPT 4o | 2705.0 ± 295.3 | 624.0 | 5119.0 |
Domain | No. of Questions | Model | Moderate | Good | Excellent |
---|---|---|---|---|---|
Clinical Diagnostics | 21 | Gemini | 1 (5%) | 13 (62%) | 7 (33%) |
Consensus | 4 (19%) | 13 (62%) | 4 (19%) | ||
ChatGPT 4o Mini | 2 (9%) | 10 (48%) | 9 (43%) | ||
ChatGPT 4o | 4 (19%) | 9 (43%) | 8 (38%) | ||
Clinical Therapy | 9 | Gemini | 0 (0%) | 2 (22%) | 7 (78%) |
Consensus | 1 (11%) | 5 (56%) | 3 (33%) | ||
ChatGPT 4o Mini | 0 (0%) | 6 (66%) | 3 (34%) | ||
ChatGPT 4o | 0 (0%) | 0 (0%) | 9 (100%) | ||
Clinical Follow-up | 9 | Gemini | 0 (0%) | 3 (33%) | 6 (67%) |
Consensus | 2 (22%) | 6 (67%) | 1 (11%) | ||
ChatGPT 4o Mini | 0 (0%) | 7 (78%) | 2 (22%) | ||
ChatGPT 4o | 1 (11%) | 4 (44%) | 4 (44%) |
LLM | Response Comprehensiveness | ||
---|---|---|---|
n | Mean ± SD | Median | |
Consensus | 13 | 2.87 ± 0.66 | 3.00 |
Gemini | 13 | 3.82 ± 0.68 | 4.00 |
ChatGPT 4o Mini | 13 | 3.15 ± 0.57 | 3.00 |
ChatGPT 4o | 13 | 3.95 ± 0.79 | 4.00 |
Evaluation Metric | Fleiss’ Kappa (κ) | Interpretation |
---|---|---|
Accuracy Scoring | 0.61 | Substantial Agreement |
Comprehensiveness Scoring | 0.57 | Moderate Agreement |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Milicevic, F.; Ghandour, M.; Khasawneh, M.Y.; Ghasemi, A.R.; Al Zuabi, A.; Smajic, S.; Mahmoud, M.A.; Kabir, K.; Mert, Ü. Assessing LLMs on IDSA Practice Guidelines for the Diagnosis and Treatment of Native Vertebral Osteomyelitis: A Comparison Study. J. Clin. Med. 2025, 14, 4996. https://doi.org/10.3390/jcm14144996
Milicevic F, Ghandour M, Khasawneh MY, Ghasemi AR, Al Zuabi A, Smajic S, Mahmoud MA, Kabir K, Mert Ü. Assessing LLMs on IDSA Practice Guidelines for the Diagnosis and Treatment of Native Vertebral Osteomyelitis: A Comparison Study. Journal of Clinical Medicine. 2025; 14(14):4996. https://doi.org/10.3390/jcm14144996
Chicago/Turabian StyleMilicevic, Filip, Maher Ghandour, Moh’d Yazan Khasawneh, Amir R. Ghasemi, Ahmad Al Zuabi, Samir Smajic, Mohamad Agha Mahmoud, Koroush Kabir, and Ümit Mert. 2025. "Assessing LLMs on IDSA Practice Guidelines for the Diagnosis and Treatment of Native Vertebral Osteomyelitis: A Comparison Study" Journal of Clinical Medicine 14, no. 14: 4996. https://doi.org/10.3390/jcm14144996
APA StyleMilicevic, F., Ghandour, M., Khasawneh, M. Y., Ghasemi, A. R., Al Zuabi, A., Smajic, S., Mahmoud, M. A., Kabir, K., & Mert, Ü. (2025). Assessing LLMs on IDSA Practice Guidelines for the Diagnosis and Treatment of Native Vertebral Osteomyelitis: A Comparison Study. Journal of Clinical Medicine, 14(14), 4996. https://doi.org/10.3390/jcm14144996