From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance
Abstract
:1. Introduction
2. Materials and Methods
2.1. General Information and Applied Large Language Models
2.2. Input Source
2.3. Analyses
3. Results
4. Discussion
5. Conclusions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Karabacak, M.; Ozkara, B.B.; Margetis, K.; Wintermark, M.; Bisdas, S. The Advent of Generative Language Models in Medical Education. JMIR Med. Educ. 2023, 9, e48163. [Google Scholar] [CrossRef]
- Currie, G.M. Academic integrity and artificial intelligence: Is ChatGPT hype, hero or heresy? Semin. Nucl. Med. 2023, 53, 719–730. [Google Scholar] [CrossRef] [PubMed]
- Susnjak, T.; McIntosh, T.R. ChatGPT: The End of Online Exam Integrity? Educ. Sci. 2024, 14, 656. [Google Scholar] [CrossRef]
- Stribling, D.; Xia, Y.; Amer, M.K.; Graim, K.S.; Mulligan, C.J.; Renne, R. The Model Student: GPT-4 Performance on Graduate Biomedical Science Exams. Sci. Rep. 2024, 14, 5670. [Google Scholar] [CrossRef] [PubMed]
- Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef] [PubMed]
- Kanjee, Z.; Crowe, B.; Rodman, A. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge. JAMA 2023, 330, 78–80. [Google Scholar] [CrossRef]
- Hoch, C.C.; Wollenberg, B.; Lüers, J.C.; Knoedler, S.; Knoedler, L.; Frank, K.; Cotofana, S.; Alfertshofer, M. ChatGPT’s quiz skills in different otolaryngology subspecialties: An analysis of 2576 single-choice and multiple-choice board certification preparation questions. Eur. Arch. Oto-Rhino-Laryngol. 2023, 280, 4271–4278. [Google Scholar] [CrossRef]
- Giannos, P. Evaluating the limits of AI in medical specialisation: ChatGPT’s performance on the UK Neurology Specialty Certificate Examination. BMJ Neurol. Open 2023, 5, e000451. [Google Scholar] [CrossRef]
- Huang, R.S.T.; Lu, K.J.Q.; Meaney, C.; Kemppainen, J.; Punnett, A.; Leung, F.-H. Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study. JMIR Med. Educ. 2023, 9, e50514. [Google Scholar] [CrossRef]
- Jang, D.; Yun, T.R.; Lee, C.Y.; Kwon, Y.K.; Kim, C.E. GPT-4 can pass the Korean National Licensing Examination for Korean Medicine Doctors. PLoS Digit. Health 2023, 2, e0000416. [Google Scholar] [CrossRef]
- Lin, S.Y.; Chan, P.K.; Hsu, W.H.; Kao, C.H. Exploring the proficiency of ChatGPT-4: An evaluation of its performance in the Taiwan advanced medical licensing examination. Digit. Health 2024, 10. [Google Scholar] [CrossRef] [PubMed]
- OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Leoni Aleman, F.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Uriel, K.; Eran, C.; Eliya, S.; Jonathan, S.; Adam, F.; Eli, M.; Beki, S.; Ido, W. GPT versus resident physicians—A benchmark based on official board scores. NEJM AI 2024, 1, 5. [Google Scholar] [CrossRef]
- Stokel-Walker, C. AI bot ChatGPT writes smart essays—Should professors worry? Nature 2022. [Google Scholar] [CrossRef]
- Biswas, S. ChatGPT and the Future of Medical Writing. Radiology 2023, 307, e223312. [Google Scholar] [CrossRef]
- Gordijn, B.; Have, H.T. ChatGPT: Evolution or revolution? Med. Health Care Philos. 2023, 26, 1–2. [Google Scholar] [CrossRef]
- Stokel-Walker, C.; Van Noorden, R. What ChatGPT and generative AI mean for science. Nature 2023, 614, 214–216. [Google Scholar] [CrossRef]
- Buchmann, E.; Thor, A. Online Exams in the Era of ChatGPT. In Proceedings of the 21. Fachtagung Bildungstechnologien (DELFI), Aachen, Germany, 11–13 September 2023; Available online: https://dl.gi.de/handle/20.500.12116/42240 (accessed on 1 August 2024). [CrossRef]
- Malik, A.A.; Hassan, M.; Rizwan, M.; Mushtaque, I.; Lak, T.A.; Hussain, M. Impact of academic cheating and perceived online learning effectiveness on academic performance during the COVID-19 pandemic among Pakistani students. Front. Psychol. 2023, 14, 1124095. [Google Scholar] [CrossRef]
- Newton, P.M.; Essex, K. How Common is Cheating in Online Exams and did it Increase During the COVID-19 Pandemic? A Systematic Review. J. Acad. Ethics 2024, 22, 323–343. [Google Scholar] [CrossRef]
- Gupta, H.; Varshney, N.; Mishra, S.; Pal, K.K.; Sawant, S.A.; Scaria, K.; Goyal, S.; Baral, C. “John is 50 years old, can his son be 65?” Evaluating NLP Models’ Understanding of Feasibility. arXiv 2022, arXiv:2210.07471. [Google Scholar] [CrossRef]
- Ahmed, Y. Utilization of ChatGPT in Medical Education: Applications and Implications for Curriculum Enhancement. Acta Inform. Medica 2023, 31, 300–305. [Google Scholar] [CrossRef]
- Rodrigues Alessi, M.; Gomes, H.A.; Lopes de Castro, M.; Terumy Okamoto, C. Performance of ChatGPT in Solving Questions From the Progress Test (Brazilian National Medical Exam): A Potential Artificial Intelligence Tool in Medical Practice. Cureus 2024, 16, e64924. [Google Scholar] [CrossRef]
- Ebel, S.; Ehrengut, C.; Denecke, T.; Gößmann, H.; Beeskow, A.B. GPT-4o’s competency in answering the simulated written European Board of Interventional Radiology exam compared to a medical student and experts in Germany and its ability to generate exam items on interventional radiology: A descriptive study. J. Educ. Eval. Health Prof. 2024, 21, 21. [Google Scholar] [CrossRef]
- Al-Naser, Y.; Halka, F.; Ng, B.; Mountford, D.; Sharma, S.; Niure, K.; Yong-Hing, C.; Khosa, F.; Van der Pol, C. Evaluating Artificial Intelligence Competency in Education: Performance of ChatGPT-4 in the American Registry of Radiologic Technologists (ARRT) Radiography Certification Exam. Acad. Radiol. 2024, in press. [Google Scholar] [CrossRef]
- Hsieh, C.H.; Hsieh, H.Y.; Lin, H.P. Evaluating the performance of ChatGPT-3.5 and ChatGPT-4 on the Taiwan plastic surgery board examination. Heliyon 2024, 10, e34851. [Google Scholar] [CrossRef]
- Sadeq, M.A.; Ghorab, R.M.F.; Ashry, M.H.; Abozaid, A.M.; Banihani, H.A.; Salem, M.; Aisheh, M.T.A.; Abuzahra, S.; Mourid, M.R.; Assker, M.M.; et al. AI chatbots show promise but limitations on UK medical exam questions: A comparative performance study. Sci. Rep. 2024, 14, 18859. [Google Scholar] [CrossRef]
- Ming, S.; Guo, Q.; Cheng, W.; Lei, B. Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study. JMIR Med. Educ. 2024, 10, e52784. [Google Scholar] [CrossRef] [PubMed]
- Terwilliger, E.; Bcharah, G.; Bcharah, H.; Bcharah, E.; Richardson, C.; Scheffler, P. Advancing Medical Education: Performance of Generative Artificial Intelligence Models on Otolaryngology Board Preparation Questions with Image Analysis Insights. Cureus 2024, 16, e64204. [Google Scholar] [CrossRef] [PubMed]
- Nicikowski, J.; Szczepański, M.; Miedziaszczyk, M.; Kudliński, B. The potential of ChatGPT in medicine: An example analysis of nephrology specialty exams in Poland. Clin. Kidney J. 2024, 17, sfae193. [Google Scholar] [CrossRef]
- Chow, R.; Hasan, S.; Zheng, A.; Gao, C.; Valdes, G.; Yu, F.; Chhabra, A.; Raman, S.; Choi, J.I.; Lin, H.; et al. The Accuracy of Artificial Intelligence ChatGPT in Oncology Exam Questions. J. Am. Coll. Radiol. JACR 2024, in press. [Google Scholar] [CrossRef]
- Vij, O.; Calver, H.; Myall, N.; Dey, M.; Kouranloo, K. Evaluating the competency of ChatGPT in MRCP Part 1 and a systematic literature review of its capabilities in postgraduate medical assessments. PLoS ONE 2024, 19, e0307372. [Google Scholar] [CrossRef] [PubMed]
- Schoch, J.; Schmelz, H.U.; Strauch, A.; Borgmann, H.; Nestler, T. Performance of ChatGPT-3.5 and ChatGPT-4 on the European Board of Urology (EBU) exams: A comparative analysis. World J. Urol. 2024, 42, 445. [Google Scholar] [CrossRef] [PubMed]
- Menekşeoğlu, A.K.; İş, E.E. Comparative performance of artificial ıntelligence models in physical medicine and rehabilitation board-level questions. Rev. Da Assoc. Medica Bras. (1992) 2024, 70, e20240241. [Google Scholar] [CrossRef]
- Cherif, H.; Moussa, C.; Missaoui, A.M.; Salouage, I.; Mokaddem, S.; Dhahri, B. Appraisal of ChatGPT’s Aptitude for Medical Education: Comparative Analysis with Third-Year Medical Students in a Pulmonology Examination. JMIR Med. Educ. 2024, 10, e52818. [Google Scholar] [CrossRef]
- Sparks, C.A.; Kraeutler, M.J.; Chester, G.A.; Contrada, E.V.; Zhu, E.; Fasulo, S.M.; Scillia, A.J. Inadequate Performance of ChatGPT on Orthopedic Board-Style Written Exams. Cureus 2024, 16, e62643. [Google Scholar] [CrossRef] [PubMed]
- Zheng, C.; Ye, H.; Guo, J.; Yang, J.; Fei, P.; Yuan, Y.; Huang, D.; Huang, Y.; Peng, J.; Xie, X.; et al. Development and evaluation of a large language model of ophthalmology in Chinese. Br. J. Ophthalmol. 2024, in press. [Google Scholar] [CrossRef] [PubMed]
- Shang, L.; Li, R.; Xue, M.; Guo, Q.; Hou, Y. Evaluating the application of ChatGPT in China’s residency training education: An exploratory study. Med. Teach. 2024, in press. [Google Scholar] [CrossRef]
- Soulage, C.O.; Van Coppenolle, F.; Guebre-Egziabher, F. The conversational AI “ChatGPT” outperforms medical students on a physiology university examination. Adv. Physiol. Educ. 2024, in press. [Google Scholar] [CrossRef]
- Yudovich, M.S.; Makarova, E.; Hague, C.M.; Raman, J.D. Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: A descriptive study. J. Educ. Eval. Health Prof. 2024, 21, 17. [Google Scholar] [CrossRef]
- Patel, E.A.; Fleischer, L.; Filip, P.; Eggerstedt, M.; Hutz, M.; Michaelides, E.; Batra, P.S.; Tajudeen, B.A. Comparative Performance of ChatGPT 3.5 and GPT4 on Rhinology Standardized Board Examination Questions. OTO Open 2024, 8, e164. [Google Scholar] [CrossRef]
- Borna, S.; Gomez-Cabello, C.A.; Pressman, S.M.; Haider, S.A.; Forte, A.J. Comparative Analysis of Large Language Models in Emergency Plastic Surgery Decision-Making: The Role of Physical Exam Data. J. Pers. Med. 2024, 14, 612. [Google Scholar] [CrossRef] [PubMed]
- Han, Y.; Choudhry, H.S.; Simon, M.E.; Katt, B.M. ChatGPT’s Performance on the Hand Surgery Self-Assessment Exam: A Critical Analysis. J. Hand Surg. Glob. Online 2024, 6, 200–205. [Google Scholar] [CrossRef]
- Touma, N.J.; Caterini, J.; Liblk, K. Performance of artificial intelligence on a simulated Canadian urology board exam: Is CHATGPT ready for primetime? Can. Urol. Assoc. J. 2024, 18. [Google Scholar] [CrossRef] [PubMed]
- Suwała, S.; Szulc, P.; Guzowski, C.; Kamińska, B.; Dorobiała, J.; Wojciechowska, K.; Berska, M.; Kubicka, O.; Kosturkiewicz, O.; Kosztulska, B.; et al. ChatGPT-3.5 passes Poland’s medical final examination-Is it possible for ChatGPT to become a doctor in Poland? SAGE Open Med. 2024, 12, 20503121241257777. [Google Scholar] [CrossRef] [PubMed]
- Alessandri-Bonetti, M.; Liu, H.Y.; Donovan, J.M.; Ziembicki, J.A.; Egro, F.M. A Comparative Analysis of ChatGPT, ChatGPT-4, and Google Bard Performances at the Advanced Burn Life Support Exam. J. Burn Care Res. 2024, 45, 945–948. [Google Scholar] [CrossRef]
- Duggan, R.; Tsuruda, K.M. ChatGPT performance on radiation technologist and therapist entry to practice exams. J. Med. Imaging Radiat. Sci. 2024, 55, 101426. [Google Scholar] [CrossRef]
- Takagi, S.; Koda, M.; Watari, T. The Performance of ChatGPT-4V in Interpreting Images and Tables in the Japanese Medical Licensing Exam. JMIR Med. Educ. 2024, 10, e54283. [Google Scholar] [CrossRef]
- Canillas Del Rey, F.; Canillas Arias, M. Exploring the potential of Artificial Intelligence in Traumatology: Conversational answers to specific questions. Rev. Esp. De Cir. Ortop. Y Traumatol. 2024, in press. [Google Scholar] [CrossRef]
- Powers, A.Y.; McCandless, M.G.; Taussky, P.; Vega, R.A.; Shutran, M.S.; Moses, Z.B. Educational Limitations of ChatGPT in Neurosurgery Board Preparation. Cureus 2024, 16, e58639. [Google Scholar] [CrossRef]
- D’Anna, G.; Van Cauter, S.; Thurnher, M.; Van Goethem, J.; Haller, S. Can large language models pass official high-grade exams of the European Society of Neuroradiology courses? A direct comparison between OpenAI chatGPT 3.5, OpenAI GPT4 and Google Bard. Neuroradiology 2024, 66, 1245–1250. [Google Scholar] [CrossRef]
- Alexandrou, M.; Mahtani, A.U.; Rempakos, A.; Mutlu, D.; Al Ogaili, A.; Gill, G.S.; Sharma, A.; Prasad, A.; Mastrodemos, O.C.; Sandoval, Y.; et al. Performance of ChatGPT on ACC/SCAI Interventional Cardiology Certification Simulation Exam. JACC Cardiovasc. Interv. 2024, 17, 1292–1293. [Google Scholar] [CrossRef]
- Rojas, M.; Rojas, M.; Burgess, V.; Toro-Pérez, J.; Salehi, S. Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 with Vision in the Chilean Medical Licensing Examination: Observational Study. JMIR Med. Educ. 2024, 10, e55048. [Google Scholar] [CrossRef] [PubMed]
- Lin, J.C.; Kurapati, S.S.; Younessi, D.N.; Scott, I.U.; Gong, D.A. Ethical and Professional Decision-Making Capabilities of Artificial Intelligence Chatbots: Evaluating ChatGPT’s Professional Competencies in Medicine. Med. Sci. Educ. 2024, 34, 331–333. [Google Scholar] [CrossRef]
- Shieh, A.; Tran, B.; He, G.; Kumar, M.; Freed, J.A.; Majety, P. Assessing ChatGPT 4.0’s test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports. Sci. Rep. 2024, 14, 9330. [Google Scholar] [CrossRef] [PubMed]
- Taesotikul, S.; Singhan, W.; Taesotikul, T. ChatGPT vs pharmacy students in the pharmacotherapy time-limit test: A comparative study in Thailand. Curr. Pharm. Teach. Learn. 2024, 16, 404–410. [Google Scholar] [CrossRef] [PubMed]
- van Nuland, M.; Erdogan, A.; Aςar, C.; Contrucci, R.; Hilbrants, S.; Maanach, L.; Egberts, T.; van der Linden, P.D. Performance of ChatGPT on Factual Knowledge Questions Regarding Clinical Pharmacy. J. Clin. Pharmacol. 2024, 64, 1095–1100. [Google Scholar] [CrossRef]
- Vaishya, R.; Iyengar, K.P.; Patralekh, M.K.; Botchu, R.; Shirodkar, K.; Jain, V.K.; Vaish, A.; Scarlat, M.M. Effectiveness of AI-powered Chatbots in responding to orthopaedic postgraduate exam questions-an observational study. Int. Orthop. 2024, 48, 1963–1969. [Google Scholar] [CrossRef] [PubMed]
- Abbas, A.; Rehman, M.S.; Rehman, S.S. Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions. Cureus 2024, 16, e55991. [Google Scholar] [CrossRef]
- Cabañuz, C.; García-García, M. ChatGPT is an above-average student at the Faculty of Medicine of the University of Zaragoza and an excellent collaborator in the development of teaching materials. Rev. Esp. Patol. 2024, 57, 91–96. [Google Scholar] [CrossRef]
- Fiedler, B.; Azua, E.N.; Phillips, T.; Ahmed, A.S. ChatGPT performance on the American Shoulder and Elbow Surgeons maintenance of certification exam. J. Shoulder Elb. Surg. 2024, 33, 1888–1893. [Google Scholar] [CrossRef]
- Miao, J.; Thongprayoon, C.; Cheungpasitporn, W.; Cornell, L.D. Performance of GPT-4 Vision on kidney pathology exam questions. Am. J. Clin. Pathol. 2024, 162, 220–226. [Google Scholar] [CrossRef]
- Ghanem, D.; Nassar, J.E.; El Bachour, J.; Hanna, T. ChatGPT Earns American Board Certification in Hand Surgery. Hand Surg. Rehabil. 2024, 43, 101688. [Google Scholar] [CrossRef] [PubMed]
- Noda, M.; Ueno, T.; Koshu, R.; Takaso, Y.; Shimada, M.D.; Saito, C.; Sugimoto, H.; Fushiki, H.; Ito, M.; Nomura, A.; et al. Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study. JMIR Med. Educ. 2024, 10, e57054. [Google Scholar] [CrossRef]
- Le, M.; Davis, M. ChatGPT Yields a Passing Score on a Pediatric Board Preparatory Exam but Raises Red Flags. Glob. Pediatr. Health 2024, 11, 2333794x241240327. [Google Scholar] [CrossRef]
- Stengel, F.C.; Stienen, M.N.; Ivanov, M.; Gandía-González, M.L.; Raffa, G.; Ganau, M.; Whitfield, P.; Motov, S. Can AI pass the written European Board Examination in Neurological Surgery?-Ethical and practical issues. Brain Spine 2024, 4, 102765. [Google Scholar] [CrossRef]
- Garabet, R.; Mackey, B.P.; Cross, J.; Weingarten, M. ChatGPT-4 Performance on USMLE Step 1 Style Questions and Its Implications for Medical Education: A Comparative Study Across Systems and Disciplines. Med. Sci. Educ. 2024, 34, 145–152. [Google Scholar] [CrossRef] [PubMed]
- Gravina, A.G.; Pellegrino, R.; Palladino, G.; Imperio, G.; Ventura, A.; Federico, A. Charting new AI education in gastroenterology: Cross-sectional evaluation of ChatGPT and perplexity AI in medical residency exam. Dig. Liver Dis. 2024, 56, 1304–1311. [Google Scholar] [CrossRef] [PubMed]
- Nakao, T.; Miki, S.; Nakamura, Y.; Kikuchi, T.; Nomura, Y.; Hanaoka, S.; Yoshikawa, T.; Abe, O. Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study. JMIR Med. Educ. 2024, 10, e54393. [Google Scholar] [CrossRef]
- Ozeri, D.J.; Cohen, A.; Bacharach, N.; Ukashi, O.; Oppenheim, A. Performance of ChatGPT in Israeli Hebrew Internal Medicine National Residency Exam. Isr. Med. Assoc. J. IMAJ 2024, 26, 86–88. [Google Scholar]
- Su, M.C.; Lin, L.E.; Lin, L.H.; Chen, Y.C. Assessing question characteristic influences on ChatGPT’s performance and response-explanation consistency: Insights from Taiwan’s Nursing Licensing Exam. Int. J. Nurs. Stud. 2024, 153, 104717. [Google Scholar] [CrossRef]
- Valdez, D.; Bunnell, A.; Lim, S.Y.; Sadowski, P.; Shepherd, J.A. Performance of Progressive Generations of GPT on an Exam Designed for Certifying Physicians as Certified Clinical Densitometrists. J. Clin. Densitom. 2024, 27, 101480. [Google Scholar] [CrossRef] [PubMed]
- Farhat, F.; Chaudhry, B.M.; Nadeem, M.; Sohail, S.S.; Madsen, D. Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard. JMIR Med. Educ. 2024, 10, e51523. [Google Scholar] [CrossRef] [PubMed]
- Huang, C.H.; Hsiao, H.J.; Yeh, P.C.; Wu, K.C.; Kao, C.H. Performance of ChatGPT on Stage 1 of the Taiwanese medical licensing exam. Digit. Health 2024, 10, 20552076241233144. [Google Scholar] [CrossRef]
- Zong, H.; Li, J.; Wu, E.; Wu, R.; Lu, J.; Shen, B. Performance of ChatGPT on Chinese national medical licensing examinations: A five-year examination evaluation study for physicians, pharmacists and nurses. BMC Med. Educ. 2024, 24, 143. [Google Scholar] [CrossRef]
- Morreel, S.; Verhoeven, V.; Mathysen, D. Microsoft Bing outperforms five other generative artificial intelligence chatbots in the Antwerp University multiple choice medical license exam. PLoS Digit. Health 2024, 3, e0000349. [Google Scholar] [CrossRef]
- Meyer, A.; Riese, J.; Streichert, T. Comparison of the Performance of GPT-3.5 and GPT-4 with That of Medical Students on the Written German Medical Licensing Examination: Observational Study. JMIR Med. Educ. 2024, 10, e50965. [Google Scholar] [CrossRef] [PubMed]
- Tanaka, Y.; Nakata, T.; Aiga, K.; Etani, T.; Muramatsu, R.; Katagiri, S.; Kawai, H.; Higashino, F.; Enomoto, M.; Noda, M.; et al. Performance of Generative Pretrained Transformer on the National Medical Licensing Examination in Japan. PLoS Digit. Health 2024, 3, e0000433. [Google Scholar] [CrossRef]
- Herrmann-Werner, A.; Festl-Wietek, T.; Holderried, F.; Herschbach, L.; Griewatz, J.; Masters, K.; Zipfel, S.; Mahling, M. Assessing ChatGPT’s Mastery of Bloom’s Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study. J. Med. Internet Res. 2024, 26, e52113. [Google Scholar] [CrossRef]
- Long, C.; Lowe, K.; Zhang, J.; Santos, A.D.; Alanazi, A.; O’Brien, D.; Wright, E.D.; Cote, D. A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology-Head and Neck Surgery Certification Examinations: Performance Study. JMIR Med. Educ. 2024, 10, e49970. [Google Scholar] [CrossRef]
- Kollitsch, L.; Eredics, K.; Marszalek, M.; Rauchenwald, M.; Brookman-May, S.D.; Burger, M.; Körner-Riffard, K.; May, M. How does artificial intelligence master urological board examinations? A comparative analysis of different Large Language Models’ accuracy and reliability in the 2022 In-Service Assessment of the European Board of Urology. World J. Urol. 2024, 42, 20. [Google Scholar] [CrossRef]
- Ting, Y.T.; Hsieh, T.C.; Wang, Y.F.; Kuo, Y.C.; Chen, Y.J.; Chan, P.K.; Kao, C.H. Performance of ChatGPT incorporated chain-of-thought method in bilingual nuclear medicine physician board examinations. Digit. Health 2024, 10, 20552076231224074. [Google Scholar] [CrossRef] [PubMed]
- Shemer, A.; Cohen, M.; Altarescu, A.; Atar-Vardi, M.; Hecht, I.; Dubinsky-Pertzov, B.; Shoshany, N.; Zmujack, S.; Or, L.; Einan-Lifshitz, A.; et al. Diagnostic capabilities of ChatGPT in ophthalmology. Graefes Arch. Clin. Exp. Ophthalmol. 2024, 262, 2345–2352. [Google Scholar] [CrossRef] [PubMed]
- Sahin, M.C.; Sozer, A.; Kuzucu, P.; Turkmen, T.; Sahin, M.B.; Sozer, E.; Tufek, O.Y.; Nernekli, K.; Emmez, H.; Celtikci, E. Beyond human in neurosurgical exams: ChatGPT’s success in the Turkish neurosurgical society proficiency board exams. Comput. Biol. Med. 2024, 169, 107807. [Google Scholar] [CrossRef]
- Tsoutsanis, P.; Tsoutsanis, A. Evaluation of Large language model performance on the Multi-Specialty Recruitment Assessment (MSRA) exam. Comput. Biol. Med. 2024, 168, 107794. [Google Scholar] [CrossRef]
- Savelka, J.; Agarwal, A.; Bogart, C.; Sakr, M. Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions about Code. arXiv 2023, arXiv:2303.08033. [Google Scholar] [CrossRef]
- Angel, M.; Patel, A.; Alachkar, A.; Baldi, P. Clinical Knowledge and Reasoning Abilities of AI Large Language Models in Pharmacy: A Comparative Study on the NAPLEX Exam. BioRxiv 2023. [Google Scholar] [CrossRef]
- Choi, J.; Hickman, K.; Monahan, A.; Schwarcz, D. ChatGPT Goes to Law School. J. Leg. Educ. 2023. Available online: https://ssrn.com/abstract=4335905 (accessed on 1 August 2024). [CrossRef]
- Traoré, S.Y.; Goetsch, T.; Muller, B.; Dabbagh, A.; Liverneaux, P.A. Is ChatGPT able to pass the first part of the European Board of Hand Surgery diploma examination? Hand Surg. Rehabil. 2023, 42, 362–364. [Google Scholar] [CrossRef] [PubMed]
- Moazzam, Z.; Cloyd, J.; Lima, H.A.; Pawlik, T.M. Quality of ChatGPT Responses to Questions Related to Pancreatic Cancer and its Surgical Care. Ann. Surg. Oncol. 2023, 30, 6284–6286. [Google Scholar] [CrossRef]
- Zhu, L.; Mou, W.; Yang, T.; Chen, R. ChatGPT can pass the AHA exams: Open-ended questions outperform multiple-choice format. Resuscitation 2023, 188, 109783. [Google Scholar] [CrossRef]
- Cai, L.Z.; Shaheen, A.; Jin, A.; Fukui, R.; Yi, J.S.; Yannuzzi, N.; Alabiad, C. Performance of Generative Large Language Models on Ophthalmology Board Style Questions. Am. J. Ophthalmol. 2023, 254, 141–149. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Z.; Lei, L.; Wu, L.; Sun, R.; Huang, Y.; Long, C.; Liu, X.; Lei, X.; Tang, J.; Huang, M. SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions. arXiv 2023, arXiv:2309.07045. [Google Scholar] [CrossRef]
- Yue, S.; Song, S.; Cheng, X.; Hu, H. Do Large Language Models Understand Conversational Implicature—A case study with a chinese sitcom. arXiv 2024, arXiv:2404.19509. [Google Scholar] [CrossRef]
- Shetty, M.; Ettlinger, M.; Lynch, M. GPT-4, an artificial intelligence large language model, exhibits high levels of accuracy on dermatology specialty certificate exam questions. medRxiv 2023. [Google Scholar] [CrossRef]
- Jakub Pokrywka, J.K.E.G.n. GPT-4 passes most of the 297 written Polish Board Certification Examinations. arXiv 2024, arXiv:2405.01589. [Google Scholar]
- Guerra, G.A.; Hofmann, H.; Sobhani, S.; Hofmann, G.; Gomez, D.; Soroudi, D.; Hopkins, B.S.; Dallas, J.; Pangal, D.J.; Cheok, S.; et al. GPT-4 Artificial Intelligence Model Outperforms ChatGPT, Medical Students, and Neurosurgery Residents on Neurosurgery Written Board-Like Questions. World Neurosurg. 2023, 179, e160–e165. [Google Scholar] [CrossRef]
- van Dis, E.A.M.; Bollen, J.; Zuidema, W.; van Rooij, R.; Bockting, C.L. ChatGPT: Five priorities for research. Nature 2023, 614, 224–226. [Google Scholar] [CrossRef]
- Hua, C. Reinforcement Learning and Feedback Control. In Reinforcement Learning Aided Performance Optimization of Feedback Control Systems; Springer: Berlin/Heidelberg, Germany, 2021; pp. 27–57. [Google Scholar] [CrossRef]
Autumn Exam in 2022 | |||||||||
---|---|---|---|---|---|---|---|---|---|
1st Round | 2nd Round | ||||||||
Index | Image | Topic | Correct | False | No Answer | Correct | False | No Answer | |
1 | penis | x | x | ||||||
2 | A vertebralis | x | x | ||||||
3 | cerebellum | x | x | ||||||
4 | * | ||||||||
5 | uterus | x | x | ||||||
6 | neuroanatomy of the optic tract | x | x | ||||||
7 | intestines | x | x | ||||||
8 | respiratory system | x | x | ||||||
9 | segmental innervation | x | x | ||||||
10 | * | ||||||||
11 | caecum | x | x | ||||||
12 | lymphatics | x | x | ||||||
13 | stomach | x | x | ||||||
14 | * | ||||||||
15 | inguinal region | x | x | ||||||
16 | coronary blood vessel | x | x | ||||||
17 | * | ||||||||
18 | face | x | x | ||||||
19 | intestines | x | x | ||||||
20 | * | ||||||||
21 | nose | x | x | ||||||
22 | thorax | x | x | ||||||
23 | Bursa omentalis | x | x | ||||||
24 | eye | x | x | ||||||
25 | neuroanatomy of the optic tract | x | x | ||||||
26 | neuroanatomy of the trigeminal nerve | x | x | ||||||
27 | * | ||||||||
28 | neuroanatomy of machanosensation | x | x | ||||||
29 | acromioclavicular joint | x | x | ||||||
30 | eye bulb movement | x | x | ||||||
31 | larynx | x | x | ||||||
32 | neuroanatomy of the trigeminal nerve | x | x | ||||||
33 | hand and wrist anatomy | x | x | ||||||
34 | lymphatics | x | x | ||||||
35 | brain stem nuclei | x | x | ||||||
36 | neuroanatomy of the cortex | x | x | ||||||
37 | * | ||||||||
38 | bile duct | x | x | ||||||
39 | tympanic | x | x | ||||||
40 | neck and throat | x | x | ||||||
41 | continence organs | x | x | ||||||
42 | * | ||||||||
43 | superficial anatomy | x | x | ||||||
44 | * | ||||||||
45 | lymphatics | x | x | ||||||
46 | neuroanatomy of the spinal cord | x | x | ||||||
47 | * | ||||||||
48 | * | ||||||||
49 | hand and wrist anatomy | x | x | ||||||
50 | eyeball | x | x | ||||||
51 | ankle joint and foot | x | x | ||||||
52 | knee joint | x | x | ||||||
53 | abdominal wall | x | x | ||||||
54 | neuroanatomy of the n facialis | x | x | ||||||
55 | hip joint | x | x | ||||||
56 | inguinal region | x | x |
Spring Exam in 2021 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
1st Round | 2nd Round | |||||||||
Index | Image | Topic | Correct | False | No Answer | Correct | False | No Answer | Re-Challenge | |
1 | nose | x | x | wrong | ||||||
2 | vegetativum | x | x | |||||||
3 | reflexes | x | x | |||||||
4 | hand and wrist | x | x | |||||||
5 | * | |||||||||
6 | hand and wrist | x | x | correct | ||||||
7 | segental innervation | x | x | |||||||
8 | hand and wrist | x | x | |||||||
9 | N medianus | x | x | |||||||
10 | ankle joint and foot | x | x | |||||||
11 | ankle joint and foot | x | x | |||||||
12 | joints | x | x | no change | ||||||
13 | hip joint | x | x | correct | ||||||
14 | hip joint | x | x | |||||||
15 | N obturatorius | x | x | no change | ||||||
16 | hip joint | x | x | |||||||
17 | gluteal region | x | x | |||||||
18 | pelvis | x | x | no change | ||||||
19 | inguinal region | x | x | |||||||
20 | scrotum | x | x | |||||||
21 | lymphatics | x | x | no change | ||||||
22 | pelvis | x | x | |||||||
23 | abdomen | x | x | |||||||
24 | aorta | x | x | |||||||
25 | heart | x | x | |||||||
26 | heart | x | x | |||||||
27 | lymphatics | x | x | |||||||
28 | abdomen | x | x | |||||||
29 | abdomen | x | x | no change | ||||||
30 | abdomen | x | x | no change | ||||||
31 | abdomen | x | x | |||||||
32 | continence organs | x | x | wrong | ||||||
33 | adrenals | x | x | wrong | ||||||
34 | neck and throat | x | x | wrong | ||||||
35 | neck and throat | x | x | no change | ||||||
36 | neck and throat | x | x | no change | ||||||
37 | neck and throat | x | x | no change | ||||||
38 | vegetativum | x | x | |||||||
39 | head | x | x | |||||||
40 | CNS | x | x | |||||||
41 | CNS | x | x | |||||||
42 | CNS | x | x | correct | ||||||
43 | endocrinum | x | x | |||||||
44 | optic tract | x | x | |||||||
45 | eye | x | x | correct | ||||||
46 | eye | x | x | |||||||
47 | eye | x | x | no change | ||||||
48 | * | |||||||||
49 | vestibular | x | x |
Citation | Study Resources | Applied LLM | Main Outcome(s) |
---|---|---|---|
[23] | Progress Test of the Brazilian National Medical Exam | GPT-3.5 | GPT-3.5 outperformed humans |
[24] | European Board of Interventional Radiology exam | GPT-4.o | GPT-3.5 outperformed students and consultants; lower accuracy than certificate holders |
[25] | American Registry of Radiologic Technologists Radiography Certification Exam | GPT-4 | Higher accuracy on text-based compared to image-based questions; performance varied by domain |
[26] | Taiwan plastic surgery board exam | GPT3.5 and GPT-4 | ChatGPT-4 outperformed ChatGPT-3.5; ChatGPT-4 passed five out of eight yearly exams |
[27] | U.K. medical board exam | Various | All LLMs scored higher on multiple-choice vs. true/false or “choose N” questions; best performance of GPT-4.o |
[28] | Chinese Medical Licensing exam | GPT3.5 and GPT-4 | ChatGPT-4 outperformed ChatGPT-3.5; GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties |
[29] | U.S. otolaryngology board exam (preparation tool BoardVitals) | ChatGPT, GPT-4, and Google Bard | GPT-4 outperformed ChatGPT and Bard |
[30] | Polish nephrology speciality exam | GPT3.5 and GPT-4 | ChatGPT-4 outperformed ChatGPT-3.5; outperformed humans |
[31] | U.S. national radiation oncology in-service examination | GPT3.5 and GPT-4.o | ChatGPT-4 outperformed ChatGPT-3.5; difference in performance by question category |
[32] | Member of Royal College of Physicians Part 1 exam | GPT3.5 and GPT-4 | ChatGPT-4 outperformed ChatGPT-3.5; both versions above the historical pass mark for MRCP Part 1 |
[33] | European Board of Urology exam | GPT3.5 and GPT-4 | ChatGPT-4 outperformed ChatGPT-3.5; performance varied by domain; GPT-4 passed all exams |
[34] | American Board of Physical Medicine and Rehabilitation exam | GPT-3.5, GPT-4, and Google Bard | GPT-4 outperformed ChatGPT and Bard |
[35] | Local pulmonology exam with third-year medical students | GPT-3.5 | Performance varied by domain and question type; performance closely mirrors that of an average medical student |
[36] | Orthopedic Board-Style Written Exam | GPT-3.5 | Performs below a threshold likely to pass the American Board of Orthopedic Surgery (ABOS) Part I written exam |
[37] | Chinese ophthalmology-related exam | own developed LLM (MOPH) | Good performance |
[38] | Chinese residency final exam | GPT-3.5 | Potential for personalized Chinese medical education |
[39] | French local physiology university exam | GPT-3.5 | Outperformed humans |
[40] | American urology board exam | GPT-3.5 and GPT-4 | GPT-4 outperformed ChatGPT-3.5; performance relatively poor |
[41] | Otolaryngology (Rhinology) Standardized Board Exam | GPT-3.5 and GPT-4 | GPT-4 outperformed ChatGPT-3.5 and residents |
[42] | Vignettes covering emergency conditions in plastic surgery | GPT-4 and Gemini | GPT-4 outperformed Gemini; AI might support clinical decision-making |
[43] | American Society for Surgery of the Hand exam | GPT-3.5 | Poor performance; performance varied by question type |
[44] | Urology Canadian board exam | GPT-4 | GPT-4 underperformed compared to residents |
[45] | Polish Medical Final exam | GPT-3.5 | GPT-3.5 underperformed compared to humans; passed 8 out of 11 exams |
[46] | Advanced Burn Life Support exam | GPT-3.5, GPT-4, Bard | GPT-4 outperformed Bard; high accuracy of GPT-3.5 and GPT-4 |
[47] | Canadian Association of Medical Radiation Technologists exam | GPT-4 | Passed exams; bad performance on critical thinking questions |
[48] | Japanese Medical Licensing exam | GPT-4 | Passed exams; performance comparable to humans; performance varied by question type |
[49] | Spanish orthopedic Surgery and Traumatology exam | GPT-3.5, Bard, Perplexity | GPT-3.5 outperformed Bard and Perplexity; |
[50] | American Board of Neurological Surgery | ChatGPT | ChatGPT underperformed compared to humans |
[51] | European Society of Neuroradiology exam | GPT-3.5, GPT-4, Bard | GPT-4 outperformed GPT-3.5 and Bard |
[52] | Interventional cardiology exam | GPT-4 | GPT-4 underperformed compared to humans; passed the exam |
[53] | Examen Único Nacional de Conocimientos de Medicina min Chile | GPT-3.5, GPT-4, GPT-4V | All versions passed the exam; GPT-4 and GPT-4V outperformed GPT-3.5 |
[54] | AAMC PREview® exam | GPT-3.5, GPT-4 | GPT-4 outperformed GPT-3.5; both versions outperformed humans |
[55] | United States Medical Licensing Exam (USMLE) Step 2 | GPT-3.5, GPT-4 | GPT-4 outperformed GPT-3.5; correctly listed differential diagnoses from case reports |
[56] | Thai local 4th year pharmacy exam | GPT-3.5 | GPT-3.5 underperformed compared to humans |
[57] | Dutch Clinical pharmacy exam (retrieved by parate kennis database) | GPT-4 | GPT-4 outperformed pharmacists |
[58] | Postgraduate orthopaedic qualifying exam | GPT-3.5, GPT-4, Bard | Bard outperformed GPT-3.5 and GPT-4 |
[59] | National Board of Medical Exam | GPT-4, GPT-3.5, Claude, Bard | GPT-4 outperformed GPT-3.5, Claude, and Bard |
[60] | Anatomopathological Diagnostic and Therapeutic Procedures | ChatGPT | GPT-4 outperformed humans |
[61] | American Shoulder and Elbow Surgeons Maintenance of Certification exam | GPT-3.5, GPT-4 | GPT-4 underperformed compared to humans; performance varied by question type |
[62] | nephrology fellows exam | GLT-4V | GPT-4V underperformed compared to humans; performance varied by question type |
[63] | American Society for Surgery of the Hand Self-Assessment exam | GPT-4 | GPT-4 underperformed compared to humans |
[64] | Japanese Otolaryngology Board Certification exam | GPT-4V | Accuracy rate increased after translation to English; performance varied by question type |
[65] | Pediatric Board Preparatory exam | GPT-3.5, GPT-4 | GPT-4 outperformed GPT-3.5, limitations with complex questions |
[66] | European Board exam in Neurological Surgery | GPT-3.5, Bing, Bard | Bard outperformed GPT-3.5, Bing, and humans; performance varied by question type |
[67] | United States Medical Licensing Exam STEP 1-style questions | GPT-4 | Passed exam |
[68] | Italian gastroenterology-related national residency admission exam | GPT-3.5, Perplexity | GPT-3.5 outperformed Perplexity |
[69] | Japanese National Medical Licensing exam | GPT-4V | 68% accuracy; input of images does not further improve performance |
[11] | Taiwan advanced medical licensing exam | GPT-4 | Passed exams; chain-of-thought prompt increased performance |
[70] | Israeli Hebrew Internal Medicine National Residency exam | GPT-3.5 | Suboptimal performance (~37%) |
[71] | Taiwan Nursing Licensing exam | GPT-4 | Passed exam; performance varied by domain |
[72] | International Society for Clinical Densitometry exam | GPT-3, GPT-4 | GPT-4 outperformed GPT-3; GPT-4 passed exam |
[73] | Indian National Eligibility cum Entrance Test | GPT-3.5, GPT-4, Bard | GPT-4 outperformed GPT-3.5 and Bard, GPT-4, and GPT-3.5 passed the exam |
[74] | Taiwanese Stage 1 of Senior Professional and Technical Exam for Medical Doctors | GPT-4 | Passed exam; performance varied by domain |
[75] | Chinese national medical licensing exam | GPT-3.5 | Does not pass the exams; performance varied by domain and question type |
[76] | Local Belgium medical licensing exam of University of Antwerp | GPT-3.5, GPT-4, Bard, Bing, Claude instant, Claude+ | All LLMs passed the exam; Bing and GPT-4 outperformed the other LLMs and humans |
[77] | Medical licensing exam | GPT.3-5, GPT-4 | GPT-4 outperformed GPT-3.5 and humans |
[78] | Japanese National Medical Licensing Examination | GPT.3-5, GPT-4 | GPT-4 outperformed GPT-3.5; optimization of prompts increases performance |
[79] | Psychosomatic medicine | GPT-4 | Passed exam |
[80] | Otolaryngology-head and neck surgery certification exam | GPT-4 | Passed exam |
[81] | European Board of Urology (EBU) In-Service Assessment exam | GPT-3.5, GPT-4, Bing | GPT-4 and Bing outperformed GPT-3.5 |
[82] | Taiwanese Nuclear Medicine Specialty Exam | GPT-4 | Passed exam; increased performance when adding chain-of-thoughts method |
[83] | Medical cases in ophthalmology | GPT-3.5 | GPT-3.5 underperformed compared to residents |
[84] | Turkish Neurosurgical Society Proficiency Board exam | GPT-4 | GPT-4 outperformed humans |
[85] | Multi-Specialty Recruitment Assessment from PassMedicine | GPT-3.5, Llama2, Bard, Bing | Bing outperformed all other LLMs and humans |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kipp, M. From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance. Information 2024, 15, 543. https://doi.org/10.3390/info15090543
Kipp M. From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance. Information. 2024; 15(9):543. https://doi.org/10.3390/info15090543
Chicago/Turabian StyleKipp, Markus. 2024. "From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance" Information 15, no. 9: 543. https://doi.org/10.3390/info15090543
APA StyleKipp, M. (2024). From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance. Information, 15(9), 543. https://doi.org/10.3390/info15090543