Analysis of ChatGPT-3.5’s Potential in Generating NBME-Standard Pharmacology Questions: What Can Be Improved?

Saad, Marwa; Almasri, Wesam; Hye, Tanvirul; Roni, Monzurul; Mohiyeddini, Changiz

doi:10.3390/a17100469

Open AccessArticle

Analysis of ChatGPT-3.5’s Potential in Generating NBME-Standard Pharmacology Questions: What Can Be Improved?

by

Marwa Saad

¹,

Wesam Almasri

¹,

Tanvirul Hye

²,

Monzurul Roni

³ and

Changiz Mohiyeddini

^1,*

¹

Department of Foundational Medical Studies, Oakland University William Beaumont School of Medicine, 586 Pioneer Drive, Rochester, MI 48309, USA

²

College of Medicine, Rosman University, Henderson, NV 89014, USA

³

College of Medicine Peoria, University of Illinois, Peoria, IL 61605, USA

^*

Author to whom correspondence should be addressed.

Algorithms 2024, 17(10), 469; https://doi.org/10.3390/a17100469

Submission received: 26 September 2024 / Revised: 14 October 2024 / Accepted: 19 October 2024 / Published: 21 October 2024

(This article belongs to the Special Issue Artificial Intelligence Algorithms and Generative AI in Education)

Download

Browse Figures

Versions Notes

Abstract

ChatGPT by OpenAI is an AI model designed to generate human-like responses based on diverse datasets. Our study evaluated ChatGPT-3.5’s capability to generate pharmacology multiple-choice questions adhering to the NBME guidelines for USMLE Step exams. The initial findings show ChatGPT’s rapid adoption and potential in healthcare education and practice. However, concerns about its accuracy and depth of understanding prompted this evaluation. Using a structured prompt engineering process, ChatGPT was tasked to generate questions across various organ systems, which were then reviewed by pharmacology experts. ChatGPT consistently met the NBME criteria, achieving an average score of 13.7 out of 16 (85.6%) from expert 1 and 14.5 out of 16 (90.6%) from expert 2, with a combined average of 14.1 out of 16 (88.1%) (Kappa coefficient = 0.76). Despite these high scores, challenges in medical accuracy and depth were noted, often producing “pseudo vignettes” instead of in-depth clinical questions. ChatGPT-3.5 shows potential for generating NBME-style questions, but improvements in medical accuracy and understanding are crucial for its reliable use in medical education. This study underscores the need for AI models tailored to the medical domain to enhance educational tools for medical students.

Keywords:

ChatGPT; AI; NBME; pharmacology; medical education

1. Introduction

Chat Generative Pre-trained Transformer (ChatGPT-3.5) is an artificial intelligence (AI) software developed by OpenAI [1]. ChatGPT is trained on various datasets and modeled to generate human-like responses [2]. ChatGPT is trained on information from up to September 2021, limiting its knowledge base to events prior to September 2021 [1,3]. ChatGPT is continually being updated and improved by OpenAI [4]. Our study focused on assessing ChatGPT-3.5, the free version of ChatGPT, trained on 175 billion parameters [5]. In the first couple of months since its launch in November 2022, ChatGPT gained approximately 100 million users [6]. Relative to other online platforms such as Netflix, Airbnb, Twitter, Facebook, and Instagram, which took several months to years to reach 1 million users, ChatGPT reached 1 million users in 5 days [7]. As of June 2023, ChatGPT has reached over 1.6 billion visitors and an average of 55 million visitors daily [7].

The National Board of Medical Examiners (NBME) is an organization that owns the United States Medical Licensing Examination (USMLE), which writes Step exams for medical students to take during their medical education. The USMLE Step 1 exam comprises a number of disciplines, such as pathology, physiology, pharmacology, biochemistry and nutrition, microbiology, immunology, gross anatomy and embryology, histology and cell biology, behavioral sciences, and genetics [8]. Pharmacology makes up approximately 15–22 percent of the questions in the USMLE Step 1 [8].

Previous studies have assessed the performance of ChatGPT in USMLE Step exams using the NBME and AMBOSS datasets [9]. It was found that ChatGPT answered the NBME Step 1 exam with 64.4 percent of questions correct, the NBME Step 2 exam with 57.8 percent of questions correct, the AMBOSS Step 1 exam with 44 percent of questions correct, and the AMBOSS Step 2 exam with 42 percent of questions correct [9]. The ability of ChatGPT to perform with a passing grade of over 60 percent in the NBME Step 1 exam prompted our interest in assessing its ability of ChatGPT to generate NBME-style pharmacology multiple-choice questions that meet the NBME Item-Writing Guidelines [10,11].

Previous research has assessed the ability of ChatGPT to take an Ophthalmic Knowledge Assessment Program (OKAP) exam, which is created for ophthalmology residents [12]. The study found that ChatGPT performed at the level of a first-year resident, providing insight into its capabilities and knowledge base.

ChatGPT offers promising benefits in scientific research, healthcare practice, and healthcare education [13]. In scientific research, it has been found to be a useful tool in academic research and writing [14]. In healthcare practice, ChatGPT shows the potential to assist in documentation, disease risk prediction, improving health literacy, and enhancing diagnostics [15,16].

In healthcare education, ChatGPT has been found to be beneficial in its ability to pass exams such as the USMLE Step exams and ophthalmology residency examinations [12,13]. ChatGPT shows promising potential in healthcare and can be a useful tool for medical education [17]. Additionally, ChatGPT has been found to be beneficial in the rapid generation of clinical vignettes, which could help reduce costs for healthcare students [18].

However, there are concerns that ChatGPT can potentially provide inaccurate information and references [19]. With the plethora of resources available to students during Step preparation, the process can be both time-consuming and overwhelming to students. If ChatGPT-generated questions are medically accurate and abide by the NBME guidelines, they can be a valuable resource for medical students [11].

This study aimed to evaluate ChatGPT-3.5’s capability to generate NBME-style pharmacology multiple-choice questions, adhering to the NBME Item-Writing Guidelines [11]. Our objective was to generate recommendations for enhancing and fine-tuning ChatGPT-generated NBME questions for medical education. If the questions and answers generated by ChatGPT align with the standards outlined in the NBME Item-Writing Guide, it has the potential to serve as a valuable resource for medical students seeking free NBME practice questions for USMLE Step exams [11].

2. Materials and Methods

Stage 1: Defining a suitable prompt for ChatGPT to generate NBME-style pharmacology multiple-choice questions, adhering to the NBME guidelines.

OpenAI provides “GPT best practices” that assist in the use of ChatGPT. Some of the strategies include writing clear instructions by providing details, specifying tasks, providing examples, and specifying the complexity of the response [20]. Another strategy includes asking ChatGPT to adopt a persona [20].

To develop suitable prompts for ChatGPT to generate the most optimal clinical vignettes, we adopted the following steps:

Step 1. The selection of organ systems and related drugs: We selected 10 medications from the following organ systems and their given corresponding drugs in parentheses: hematology/lymphatics (warfarin), neurology (norepinephrine), cardiovascular (metoprolol), respiratory (albuterol), renal (lisinopril), gastrointestinal (bisacodyl), endocrine (metformin), reproductive (norethindrone), musculoskeletal (cyclobenzaprine), and behavioral medicine (trazodone).

Step 2. Prompt engineering: We conducted prompt engineering to formulate questions that align with NBME standards using ‘ChatGPT best practices’. This step involved an iterative process following these stages:

A pharmacology expert provided 12 NBME-style questions as examples.
IT experts developed the initial prompt based on these examples and generated 12 questions.
Refinement for Complexity and Clinical Relevance: The prompts were iteratively refined to enhance complexity and ensure alignment with NBME standards through the following steps:
(A)
Iterative refinement: Initial broad prompts were progressively adjusted, emphasizing clinical scenarios and reducing basic recall. For example, instead of asking “What is the primary side effect of metformin?”, the prompt would specify a more detailed clinical vignette, such as: “Create a scenario involving a diabetic patient prescribed metformin, focusing on a less common but clinically significant adverse effect”.
(B)
Use of clinical variables: Prompts were further developed to incorporate variables like patient history, comorbidities, and lab results. This helped generate more nuanced questions that required deeper analysis, mimicking the integrated reasoning expected in NBME Step exams.
(C)
Expert feedback integration: Ongoing expert reviews provided feedback on medical accuracy, clinical relevance, and adherence to NBME guidelines, such as avoiding simple distractors and ensuring that the questions tested applied medical knowledge [11].
(D)
Adherence to item-writing guidelines: Prompts were fine-tuned based on the NBME Item-Writing Guide recommendations, such as avoiding negatively phrased lead-ins (e.g., “except”) and ensuring that questions could be answered without seeing the multiple-choice options [11].
Validation and iterative refinement through expert review: After multiple iterations, a standardized prompt template was developed and validated by pharmacology experts, ensuring that it consistently generated questions aligned with NBME standards. In addition, the pharmacology experts continuously reviewed the generated questions to identify specific areas for improvement, such as advising the AI to focus on particular drug-related topics or ensuring greater clinical relevance in the scenarios.
The IT expert revised the prompt and generated the subsequent series of questions.

After 12 iterations, the pharmacology experts confirmed the quality of the generated questions, and the refined prompt was adopted as the finalized standard. The standard prompt reads as follows (with brackets indicating the different medications to be assessed): ‘Can you provide an expert-level sample NBME question about the (mechanism of action, indication, OR side effect) of (drug) and furnish the correct answer along with a detailed explanation’?

Step 3. Using the above prompt, we asked ChatGPT to generate a question for either the mechanism of action, indications, or side effects for each of these 10 medications. The mechanism of action, indications, and side effects topics were randomly selected to be assessed and evaluated for each medication.

Step 4. The questions and answers generated by ChatGPT were assessed using a grading system that was created following the NBME Item-Writing Guide [11]. The NBME Item-Writing Guide provides the following specific criteria for writing clinical vignettes that are used to design USMLE Step exam questions [11]. We adapted the following criteria from the NBME Item-Writing Guide [11]:

Give the correct answer with a medically accurate explanation.
Question stem avoids describing the class of the drug in conjunction with the name of the drug in the drug.
The question applies foundational medical knowledge, rather than basic recall.
The clinical vignette is in complete sentence format.
The question can be answered without looking at multiple-choice options, referred to as the “cover-the-option” rule.
Avoid long or complex multiple-choice options.
Avoid using frequency terms within the clinical vignette, such as “often” or “usually”; instead, use “most likely” or “best indicated”.
Avoid “none of the above” in the answer choices.
Avoid nonparallel, inconsistent answer choices, ensuring all follow the same format and structure.
The clinical vignette avoids negatively phrased lead-ins, such as “except”.
The clinical vignette avoids general grammatical cues that can lead you to the correct answer, such as “an” at the end of the question stem, which would eliminate answer choices that begin with consonants.
Avoid grouped or collectively exhaustive answer choices. For instance, the answer choices are “a decrease in X”, “an increase in X”, and “no change in X”.
Avoid absolute terms, such as “always” and “never”, in the answer choices.
Avoid having the correct answer choice stand out. For example, this can happen when one of the answer choices is longer and more in-depth relative to the other answer choices.
Avoid repeated words or phrases in the clinical vignette that clue the correct answer choice.
Create a balanced distribution of key terms in the answer choices, ensuring none stand out as too similar or too different from the others. (NBME Item-Writing Guide).

Stage 2: Assessing the quality of ChatGPT-generated questions.

To evaluate the quality of the questions generated by ChatGPT, we employed the following steps:

Step 1. Expert panel review: All ChatGPT-generated questions and answers underwent evaluation by a panel consisting of two pharmacology education experts with a significant teaching background in medical schools in the USA. These experts were well-versed in the NBME Item-Writing Guide and had experience with NBME-style exam questions in pharmacology [11].

Step 2. Calibration process: Two randomly selected questions generated by ChatGPT were presented to the experts for evaluation. The experts independently utilized the aforementioned 16 different criteria to individually score each question and assigned a score of 0 or 1 for each criterion, resulting in a quality score ranging from 0 to 16 for each question. A higher score indicated better adherence to the NBME Item-Writing Guide [11]. Subsequently, the experts convened to discuss and reconcile any discrepancies in their evaluations, arriving at a consensus.

Step 3. Independent evaluation: Following the calibration process, the experts individually and independently assessed the remaining ChatGPT-generated questions and provided their respective scores.

Estimated question difficulty: The expert panel was also asked to rate the difficulty of each question on a scale using the question ‘On a scale from very easy to very difficult, how would you rate the difficulty of this question for students?’. The responses ranged from 1 (very easy) to 5 (very difficult).

3. Results

3.1. Inter-Rater Reliability

Using Cohen’s Kappa coefficient, an inter-rater reliability analysis was conducted to assess the internal consistency of the scores assigned by two pharmacology education experts during the evaluation of 10 questions generated by ChatGPT. The resulting Kappa coefficient of 0.76 indicated a good degree of agreement between the two experts beyond what would be expected by chance. Hence, we averaged the experts’ ratings.

3.2. Quality of ChatGPT-Generated Questions

According to expert 1, the average score for the ChatGPT-3.5 questions was 13.7 out of 16 (85.6%). According to expert 2, the average score for the ChatGPT-3.5 questions was 14.5 out of 16 (90.6%). Thus, the average score between the two experts was 14.1 out of 16 (88.1%).

The most pronounced difference between expert ratings was related to the question on the reproductive system, which was scored 11/16 by expert 1 and 14/16 by expert 2. The individual scores for each of the organ systems are represented in Figure 1.

Given the high inter-rater reliability, we then averaged the expert ratings. The average quality score of hematology/lymphology was 13.5/16. The average quality score of neurology and respiratory was 15/16. The average quality score of cardiology and behavioral was 14/16. The average quality score of renal and endocrine was 14.5/16. The average quality score of gastrointestinal was 15/16. The average quality score of reproductive was 12.5/16. The average quality score of musculoskeletal was 13/16.

3.3. ChatGPT-3.5’s Fulfillment of NBME Criteria

In the next step, we analyzed how often, based on the expert ratings, ChatGPT-3.5 fulfilled the aforementioned 16 characteristics of a valid NBME-style question when generating questions for each of the ten organ systems. Figure 2 shows the results.

According to expert 1, ChatGPT-3.5 fulfilled criteria 2, 6–10, and 13 in all 10 questions (100% degree of fulfillment). Criteria 4, 11, and 14 were satisfied by 9 out of 10 (90% degree of fulfillment). Criteria 1, 12, and 16 were fulfilled by 8 out of 10 (80% degree of fulfillment). Criterion 15 was fulfilled by 7 out of 10 (70% degree of fulfillment), criterion 5 was fulfilled by 6 out of 10 (60% degree of fulfillment), and criterion 3 was fulfilled by 3 out of 10 (30% degree of fulfillment).

According to expert 2, criteria 2, 4, 6–10, and 12–14 were fulfilled by 10 out of 10 (100% degree of fulfillment). Criteria 5 and 15 were fulfilled by 9 out of 10 (90% degree of fulfillment), criteria 1 and 16 were fulfilled by 8 out of 10 (80% degree of fulfillment), criterion 11 was fulfilled by 7 out of 10 (70% degree of fulfillment), and criterion 3 was fulfilled by 5 out of 10 (50% degree of fulfillment).

Additionally, the average rating of the criteria addressed by each expert was analyzed. Criterion 1 was addressed on average 8/10 times. Criteria 2, 6, 7, 8, 9, 10, and 13 were addressed on average 10/10 times. Criterion 3 was addressed on average 4/10 times. Criterion 4 was addressed on average 9.5/10 times. Criterion 5 was addressed on average 7.5/10 times. Criteria 11, 15, and 16 were addressed on average 8/10 times. Lastly, criterion 14 was addressed on average 9.5/10 times.

3.4. Estimated Complexity and Difficulty of Questions Generated by ChatGPT-3.5

As demonstrated in Figure 3, expert 1 found one question very simple, five questions simple, two questions moderately complex, two questions complex, and no questions very complex. In contrast, expert 2 found three questions very simple, four questions simple, three questions moderately complex, and no questions complex or very complex.

As depicted in Figure 4, expert 1 found four questions very easy, two questions easy, two questions moderately hard, two questions hard, and no questions very hard. In contrast, expert 2 found three questions very easy, four questions easy, three questions moderately hard, and no questions hard or very hard.

4. Discussion

At face value, ChatGPT-3.5 provided remarkable results following the NBME Item-Writing Guide criteria [11], given the average score of 14.1 out of 16 (88.1%) between the two experts. However, there are significant improvements needed when thoroughly assessing the questions and answers. When the frequency of each of the criteria was evaluated, there was a trend in which criteria were missed more frequently compared with others. According to both experts 1 and 2, criterion 1, which assessed medical accuracy, was fulfilled by 8 out of 10 (80%) of the ChatGPT-generated questions. Thus, two questions contained medically inaccurate information. Given the importance of having the questions and answers contain medically accurate information, it is vital to ensure this criterion is always fulfilled. Having ChatGPT-3.5-generated questions that contain medically inaccurate information can be especially problematic when students prepare for USMLE exams. For example, when asked to evaluate the side effects of warfarin, ChatGPT-3.5 stated that thrombocytopenia is a side effect of warfarin and deemed that as the correct answer, when thrombocytopenia is not a side effect of warfarin [21]. This can lead to the student unknowingly making incorrect associations with drugs and adverse effects, such as abciximab actually causing thrombocytopenia, rather than warfarin [21]. Additionally, when the indications for metoprolol were assessed, ChatGPT-3.5 suggested that the indications for a patient with a history of myocardial infarction and heart failure be prescribed the beta-blocker for chronic stable angina and not hypertension. Although this indication can be true, without the patient’s vitals, there is not enough information to decide between the provided answer choices of hypertension or chronic stable angina. Thus, this question is medically inaccurate.

The ability of ChatGPT to provide factually inaccurate information is potentially due to a phenomenon coined “hallucinations”. “Hallucinations” are defined as the generation of probable statements that include inaccurate information. These “hallucinations” can be due to the multitude of datasets it is trained on and the mixing up of facts [22,23]. ChatGPT-3.5 has the potential to provide this medical information but needs to ensure it is always accurate in providing accurate medical information when generating NBME-standard clinical vignettes, answer choices, and explanations for each answer choice. Thus, having an AI system that is dedicated to the healthcare industry, which can provide medical diagnoses, as well as disseminate accurate medical information, can be highly beneficial for medical students, physicians, and patients alike [24].

According to expert 1, criterion 3 was fulfilled by 3 out of 10 (30%) ChatGPT-3.5-generated questions, and 4 out of 10 (40%) according to expert 2. This criterion was the most frequently missed criterion of the 16 criteria. This criterion assessed the ability of ChatGPT-3.5 to apply foundational medical knowledge, rather than basic recall. Additionally, another frequently missed criterion was criterion 5, which is the ability to answer the question without looking at the multiple-choice options. According to expert 1, 6 out of 10 (60%) questions, and according to expert 2, 9 out of 10 (90%) questions, fulfilled criterion 5. In addition to the questions assessing basic recall, the questions can be answered without multiple-choice options. Furthermore, this constitutes most of the questions as “pseudo-vignettes”. A pseudo-vignette is defined as a vignette that does not require the need for the clinical information provided to answer the question and, rather, tests basic recall of material [25]. For example, the following are some of the questions in which a clinical vignette was provided and asked the reader to answer the following: “Norepinephrine acts on which of the following receptors to exerts its vasoconstrictive effects?”, “What is the primary mechanism of action of albuterol?”, “What is the primary mechanism of action of metformin in the treatment of diabetes?”, and “Which of the following side effects is most likely associated with the use of norethindrone?”. It is evident that these questions can be answered without any other clinical information and are testing basic recall rather than the ability to apply medical knowledge. Given the low frequency of criterion 3 being followed, we further assessed the complexity and difficulty of the generated questions and answers. According to expert 1, 6 out of 10 (60%) questions were graded as either very simple or simple, and according to expert 2, 7 out of 10 (70%) questions were graded as either very simple or simple. When assessing difficulty, expert 1 rated 6 out of 10 (60%) questions as either very easy or easy, and expert 2 rated 7 out of 10 (70%) questions as either very easy or easy. Thus, this allows us to conclude that the majority of questions are too simple and too easy for USMLE Step preparation.

According to expert 1, criterion 11 was fulfilled by 9 out of 10 (90%) ChatGPT-3.5-generated questions, and 7 out of 10 (70%) according to expert 2. Questions missing this criterion provided alternative cues in a grammatical sense that prompted examinees to eliminate any answer choice options that did not fit the cues given. An example of this was seen when ChatGPT was asked to provide an NBME question evaluating the indication for cyclobenzaprine, where the patient in the clinical vignette presented with both acute lower back pain and muscle spasms. This provided a cue to the examinee by stating two different clinical problems needed to be fulfilled by the medication prescribed, which left the option “acute musculoskeletal pain with muscle spasms” as the only viable option, which was correct. Picking up on subtle cues like this that are not necessarily explicit may still provide guidance that does not necessarily evaluate the examinee’s proficiency in the material, reducing the quality of the question. According to expert 1, criterion 12 was fulfilled by 8 out of 10 (80%) ChatGPT-3.5-generated questions, and 10 out of 10 (100%) according to expert 2. Providing grouped or exhaustive answer choices can cause the examinee to immediately negate all other option choices, leading to only the grouped answer choices. The example that was provided in criterion 11 also violated criterion 12, as it provided both “acute exacerbation of chronic obstructive pulmonary disease (COPD)” and “acute musculoskeletal pain with muscle spasms” as answer choices, with the latter being the correct answer choice. With these two answer choices being grouped together as acute conditions, the other answer choices were quickly able to be eliminated. To provide NBME-standard questions that can accurately evaluate the examinee’s knowledge of the given pharmacological concepts, ChatGPT must be able to provide at least five answer choices that can be evaluated thoroughly to test the examinee’s understanding of the material.

According to expert 1, criterion 15 was fulfilled by 7 out of 10 (70%) ChatGPT-3.5-generated questions, and 9 out of 10 (90%) according to expert 2. Repeating words or phrases within the clinical vignette and giving multiple-choice options can cue the examinee to determine the correct answer without accurately reflecting their proficiency in the material. For example, when asked to generate a question about the side effects of norethindrone, ChatGPT wrote within the clinical vignette that the patient reported breast tenderness and then provided breast tenderness as a correct answer choice. Providing keywords within the clinical vignette and answer choice negates the question’s ability to evaluate the examinee’s understanding of norethindrone and only evaluates their ability to dissect the clinical vignette. Lastly, according to both expert 1 and expert 2, criterion 16 was fulfilled by 8 out of 10 (80%) ChatGPT-3.5-generated questions. Criterion 16 assigns importance to keeping key terms balanced if key terms are used within the answer choices. When asked to generate the adverse effects of warfarin, ChatGPT provided both thrombocytopenia and heparin-induced thrombocytopenia, which could lead to all other option choices being immediately negated. If it is important to provide key terms within the choice options, it is important to provide a balancing of the terms throughout all the choice options to ensure the examinee is not able to remove certain choices due to the lack of keywords.

In terms of estimated question difficulties, the experts’ ratings diverged. This could be due to limited calibration or the fact that it is very challenging to estimate the difficulty of a question for students.

While discussing the capabilities of AI and what improvements need to be made to make it a reliable source for NBME question generation, it is important to consider that AI is constantly evolving, updating, and improving its knowledge base [4]. With that being said, a recent ChatGPT update now provides the opportunity to provide an alternative output if the initial output was unsatisfactory to the user. Lastly, AI can be highly variable, and may not give the same output every time, despite the fixed input.

Future studies should employ a larger expert panel and implement a more comprehensive calibration exercise with more diverse and context-rich examples to assess a greater number of questions. Additionally, the criteria established for evaluating questions generated by ChatGPT should undergo scrutiny by an expert panel to identify any gaps and potentially incorporate additional criteria.

Furthermore, future research should investigate how OpenAI’s new models, ChatGPT-4 and ChatGPT-4 Turbo, perform compared with ChatGPT-3.5 in generating questions that follow the NBME guidelines [11]. Given the more advanced processing of ChatGPT-4 and ChatGPT-4 Turbo, it can give insight into the updated capabilities of AI software. In addition, future research should also include comparing the capabilities of other AI models, such as Google’s AI model Google Gemini and Facebook’s AI model Llama 3, with those of ChatGPT in creating NBME-standardized questions. Comparing the similarities and differences between the various AI models can provide more insight into the advantages of each AI model in medical education.

The results of our study suggest that ChatGPT-3.5 has the potential to generate NBME-style questions, but there are significant limitations related to the depth of knowledge tested. These findings align with previous research, which highlighted that while AI models can emulate human-like question generation, their depth of clinical understanding remains limited [9]. This observation can be explained through the lens of cognitive load theory, which suggests that effective learning and assessment require engaging learners in complex, higher-order thinking tasks [26,27,28]. Our results show that ChatGPT-3.5 tends to favor questions that test recall rather than application or analysis, reflecting its reliance on pattern recognition over true clinical reasoning.

Furthermore, our study’s findings raise important questions about the suitability of AI for medical education. According to constructivist learning theory, effective education should not only convey factual information but also facilitate the application of knowledge in real-world scenarios [29,30]. The tendency of ChatGPT-3.5 to generate “pseudo-vignettes”, as identified in our analysis, suggests a gap between AI-generated content and the needs of complex, case-based learning environments advocated by constructivist principles.

These results have practical implications for medical educators, indicating that while AI tools like ChatGPT-3.5 can support question generation, they should be complemented with expert oversight to ensure the development of questions that engage critical thinking. Future research should explore ways to enhance the model’s capacity to generate questions that align with theories of cognitive complexity, such as Bloom’s Taxonomy, which emphasizes the importance of creating questions that test analysis, synthesis, and evaluation. By doing so, we can better integrate AI-generated content into medical education in a manner that promotes deeper learning and prepares students for clinical practice.

Proposed Evaluation Framework for Future Implementation of AI in Exam Generation.

Given the results of our study, which identified several shortcomings of ChatGPT-3.5, including hallucinations and the tendency to produce oversimplified questions, it is crucial to adopt a comprehensive, multi-stage evaluation framework for the future implementation of AI in exam question generation. This framework aims to enhance the reliability, accuracy, and educational value of AI-generated content, particularly for high-stakes assessments like the USMLE. We suggest the following steps:

-: Initial automated screening: Implement AI-based tools such as the Google Knowledge Graph API, the Wolfram Alpha API, or the MedGPT to perform an initial quality check of the generated content. This stage would identify potential inaccuracies, inconsistencies, or superficial questions that do not meet established standards, flagging them for further review.
-: Expert review process: Involve a panel of three to five domain-specific experts, such as pharmacologists and medical educators, to thoroughly assess the generated questions. This expert review would evaluate the medical accuracy, clinical relevance, level appropriateness, and adherence to item-writing standards (such as the NBME guidelines), ensuring that the questions align with the depth and complexity required for medical exams [11].
-: Integration of specialized medical knowledge bases: Use external databases such as UpToDate, PubMed, or other clinical guidelines to cross-reference the generated content for factual accuracy. Incorporating these knowledge bases would help validate medical accuracy and ensure that the AI’s knowledge aligns with the latest developments in current clinical practice.
-: Iterative feedback loop: Create an iterative process where expert feedback is used to refine the AI-generated questions. This would involve adjusting prompts, question structures, or explanations, followed by re-evaluation until the content reaches a satisfactory level of quality and complexity.
-: Comparison using bootstrapping methods: Recent studies suggested applying bootstrapping techniques to evaluate exam questions [31]. Bootstrapping could be used to compare the psychometric characteristics of AI-generated questions with instructor-written ones, analyzing factors such as the size of confidence intervals, reliability, and question difficulty.
-: Pilot testing with medical students: Conduct small-scale pilot testing of AI-generated questions with medical students. This step would assess the perceived difficulty and educational value of the questions in real exam conditions, providing direct feedback on how well the questions prepare students for actual exams like the USMLE.

By adopting this multi-stage evaluation framework, future AI-generated questions can be fine-tuned to not only meet NBME standards but also foster critical thinking and higher-order reasoning among medical students. This approach addresses the limitations of current AI models while maximizing their potential as an educational tool.

However, a key consideration for future research is the emphasis on generating questions that test higher-level medical knowledge rather than relying predominantly on basic recall. While our study has demonstrated ChatGPT-3.5’s ability to produce questions that meet many NBME criteria, a notable area for improvement is the focus on deeper analytical and reasoning skills. Higher-order questions challenge learners to apply their knowledge in complex clinical scenarios, which is essential for the development of critical thinking skills in medical education. Future research should explore prompt adjustments and evaluation processes that can guide AI models to create more sophisticated clinical vignettes. This would help ensure that the generated questions not only align with NBME standards but also effectively test learners’ ability to synthesize and apply knowledge in a clinical context.

5. Conclusions

Based on the scores received, the ChatGPT-3.5-generated questions seem to satisfy the NBME guidelines [11]. However, when thoroughly going through the questions, the questions are not up to par with other Step exam material resources. ChatGPT-3.5 provided oversimplified questions and overuse of repeated words in the clinical vignette and answer choices that allow an individual to answer the question without any medical knowledge. Our data support the need for AI software that is specifically trained on medical and scientific information.

The high fulfillment rates for criteria 2, 6–10, and 13 suggest that these criteria likely involve straightforward and clear rules. It seems that ChatGPT can consistently follow and apply these types of rules. For instance, avoiding complex multiple-choice options (criterion 6) or using consistent answer-choice formats (criterion 9) are structural elements that AI can reliably adhere to. This finding highlights the strength of ChatGPT in following well-defined, rule-based guidelines that do not require deep contextual understanding. In a related vein, the lower fulfillment rates of 4/10 for criterion 3 (applying foundational knowledge) suggest that ChatGPT struggled with applying foundational medical knowledge rather than simple recall. This criterion requires a deeper understanding of medical concepts and their application in various contexts, which is challenging for an AI model primarily trained on text data without practical experience. This finding indicates a significant gap in ChatGPT’s ability to create questions that go beyond basic recall and test higher-order thinking skills.

With regard to criterion 1 (medical accuracy), the 8/10 fulfillment rate shows instances where ChatGPT provided medically inaccurate information. This could be due to “hallucinations”, where the AI generates plausible-sounding but incorrect information. Ensuring medical accuracy is critical but challenging due to the vast and complex nature of medical knowledge. The fulfillment of criterion 5 (cover-the-option rule) was moderate (7.5/10 times). The “cover-the-option” rule requires questions to be answerable without seeing the options, which tests the examinee’s true understanding of the material. This criterion was harder for ChatGPT to consistently meet, likely due to its reliance on generating plausible distractors rather than deeply understanding the question context.

The fulfillment rates of 8/10 and 7/10 for criteria 11 and 15 (avoiding grammatical and repetitive cues, respectively, highlight occasional lapses where questions contain grammatical or repetitive cues that could inadvertently guide examinees to the correct answer) points to the need for more sophisticated language-processing capabilities to avoid such subtle errors.

The high scores (15/16) in neurology, respiratory, and gastrointestinal areas suggest that ChatGPT’s training data and its ability to generate questions for these specialties are particularly robust. These fields may have clearer, more standardized guidelines and well-documented clinical scenarios that the AI can emulate effectively.

The relatively lower scores in reproductive and musculoskeletal (12.5/16 and 13/16) areas may reflect more complex or nuanced clinical scenarios that are harder for the AI to accurately capture. Additionally, reproductive health often involves sensitive and varied patient factors, which may not be as well represented in the AI’s training data.

Author Contributions

Conceptualization, M.S., W.A., T.H. and C.M.; methodology, C.M., M.S., T.H., M.R. and W.A.; validation, C.M., M.S. and W.A.; formal analysis, C.M.; investigation, C.M., M.S. and W.A.; resources, M.S., C.M. and W.A.; data curation, M.S., C.M. and W.A.; writing—original draft preparation, M.S., W.A. and C.M.; writing—review and editing, C.M., M.S. and W.A.; visualization, M.S., C.M. and W.A.; supervision, C.M.; project administration, C.M., M.S. and W.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the first and second authors on request.

Acknowledgments

The authors would like to extend their gratitude to Steve Wilson, who substantially contributed to the development of the design guidelines and provision of software expertise.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Natalie; OpenAI Help Center. What Is ChatGPT? 2023. Available online: https://help.openai.com/en/articles/6783457-what-is-chatgpt (accessed on 20 October 2024).
Biswas, S. ChatGPT and the Future of Medical Writing. Radiology 2023, 307, e223312. [Google Scholar] [CrossRef] [PubMed]
Niles, R.; OpenAI Help Center. GPT-3.5 Turbo Updates. 2023. Available online: https://help.openai.com/en/articles/8555514-gpt-3-5-turbo-updates (accessed on 20 October 2024).
Davis, S.E.; Walsh, C.G.; Matheny, M.E. Open questions and research gaps for monitoring and updating AI-enabled tools in clinical settings. Front. Digit. Health 2022, 4, 958284. [Google Scholar] [CrossRef] [PubMed]
Openai.com. Language Models Are Few-Shot Learners. Available online: https://openai.com/research/language-models-are-few-shot-learners (accessed on 20 October 2024).
Ruby, D. ChatGPT Statistics for 2023: Comprehensive Facts and Data. Demandsage. 28 April 2023. Available online: https://www.demandsage.com/chatgpt-statistics/ (accessed on 20 October 2024).
Brandl, R. ChatGPT Statistics and User Numbers 2023—OpenAI Chatbot. Tooltester. 15 February 2023. Available online: https://www.tooltester.com/en/blog/chatgpt-statistics/#:~:text=diagrams%2C%20and%20illustrations (accessed on 20 October 2024).
Step 1 Content Outline and Specifications | USMLE. Available online: https://www.usmle.org/prepare-your-exam/step-1-materials/step-1-content-outline-and-specifications (accessed on 20 October 2024).
Gilson, A.; Safranek, C.W.; Huang, T.; Socrates, V.; Chi, L.; Taylor, R.A.; Chartash, D. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med. Educ. 2023, 9, e45312. [Google Scholar] [CrossRef] [PubMed]
Scoring & Score Reporting | USMLE. Available online: https://www.usmle.org/bulletin-information/scoring-and-score-reporting (accessed on 20 October 2024).
Item-Writing Guide | NBME. Available online: https://www.nbme.org/item-writing-guide (accessed on 20 October 2024).
Antaki, F.; Touma, S.; Milad, D.; El-Khoury, J.; Duval, R. Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings. Ophthalmol. Sci. 2023, 3, 100324. [Google Scholar] [CrossRef] [PubMed]
Sallam, M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare 2023, 11, 887. [Google Scholar] [CrossRef] [PubMed]
Deng, J.; Lin, Y. The Benefits and Challenges of ChatGPT: An Overview. Front. Comput. Intell. Syst. 2023, 2, 81–83. [Google Scholar] [CrossRef]
Johnson, K.B.; Wei, W.Q.; Weeraratne, D.; Frisse, M.E.; Misulis, K.; Rhee, K.; Zhao, J.; Snowdon, J.L. Precision Medicine, AI, and the Future of Personalized Health Care. Clin. Transl. Sci. 2021, 14, 86–93. [Google Scholar] [CrossRef] [PubMed]
Rajpurkar, P.; Chen, E.; Banerjee, O.; Topol, E.J. AI in health and medicine. Nat. Med. 2022, 28, 31–38. [Google Scholar] [CrossRef] [PubMed]
Paranjape, K.; Schinkel, M.; Nannan Panday, R.; Car, J.; Nanayakkara, P. Introducing Artificial Intelligence Training in Medical Education. JMIR Med. Educ. 2019, 5, e16048. [Google Scholar] [CrossRef] [PubMed]
Benoit, J. ChatGPT for Clinical Vignette Generation, Revision, and Evaluation. medRxiv 2023. [Google Scholar] [CrossRef]
Ahn, C. Exploring ChatGPT for information of cardiopulmonary resuscitation. Resuscitation 2023, 185, 109729. [Google Scholar] [CrossRef] [PubMed]
OpenAI Platform. Available online: https://platform.openai.com/docs/guides/prompt-engineering/six-strategies-for-getting-better-results (accessed on 20 October 2024).
Warfarin: Drug Information. UpToDate. Available online: https://www-uptodate-com.huaryu.kl.oakland.edu/contents/warfarin-drug-information?search=warfarin&source=panel_search_result&selectedTitle=1~148&usage_type=panel&kp_tab=drug_general&display_rank=1 (accessed on 20 October 2024).
Shuster, K.; Poff, S.; Chen, M.; Kiela, D.; Weston, J. Retrieval augmentation reduces hallucination in conversation. arXiv 2021, arXiv:2104.07567. [Google Scholar]
Tian, K.; Mitchell, E.; Yao, H.; Manning, C.D.; Finn, C. Fine-tuning language models for factuality. arXiv 2023, arXiv:2311.08401. [Google Scholar]
Basu, K.; Sinha, R.; Ong, A.; Basu, T. Artificial Intelligence: How is It Changing Medical Sciences and Its Future? Indian J. Dermatol. 2020, 65, 365–370. [Google Scholar] [CrossRef] [PubMed]
Vanderbilt, A.A.; Feldman, M.; Wood, I.K. Assessment in undergraduate medical education: A review of course exams. Med. Educ. Online 2013, 18, 1–5. [Google Scholar] [CrossRef] [PubMed]
Mohiyeddini, C.; Loftus, S.F. Editorial: Medical education in uncertain times: Threats, challenges, and opportunities of COVID-19. Front. Psychol. 2024, 15, 1467070. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Sweller, J. Cognitive load during problem solving: Effects on learning. Cogn. Sci. 1988, 12, 257–285. [Google Scholar] [CrossRef]
Taylor, T.A.H.; Kamel-ElSayed, S.; Grogan, J.F.; Hajj Hussein, I.; Lerchenfeldt, S.; Mohiyeddini, C. Teaching in Uncertain Times: Expanding the Scope of Extraneous Cognitive Load in the Cognitive Load Theory. Front. Psychol. 2022, 13, 665835. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Piaget, J. The Construction of Reality in the Child, Cook, M., Translator; Basic Books: New York, NY, USA, 1954. [CrossRef]
Gopnik, A.; Meltzoff, A.N.; Kuhl, P.K. The Scientist in the Crib: Minds, Brains, and How Children Learn; William Morrow & Co.: New York, NY, USA, 1999. [Google Scholar]
Mohiyeddini, C. Enhancing exam question quality in medical education through bootstrapping. Anat. Sci. Educ. 2024; early view. [Google Scholar] [CrossRef]

Figure 1. ChatGPT-3.5-generated question scores.

Figure 2. ChatGPT-3.5 frequency of criteria addressed.

Figure 3. Complexity of questions by experts.

Figure 4. Difficulty of questions by experts.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Saad, M.; Almasri, W.; Hye, T.; Roni, M.; Mohiyeddini, C. Analysis of ChatGPT-3.5’s Potential in Generating NBME-Standard Pharmacology Questions: What Can Be Improved? Algorithms 2024, 17, 469. https://doi.org/10.3390/a17100469

AMA Style

Saad M, Almasri W, Hye T, Roni M, Mohiyeddini C. Analysis of ChatGPT-3.5’s Potential in Generating NBME-Standard Pharmacology Questions: What Can Be Improved? Algorithms. 2024; 17(10):469. https://doi.org/10.3390/a17100469

Chicago/Turabian Style

Saad, Marwa, Wesam Almasri, Tanvirul Hye, Monzurul Roni, and Changiz Mohiyeddini. 2024. "Analysis of ChatGPT-3.5’s Potential in Generating NBME-Standard Pharmacology Questions: What Can Be Improved?" Algorithms 17, no. 10: 469. https://doi.org/10.3390/a17100469

APA Style

Saad, M., Almasri, W., Hye, T., Roni, M., & Mohiyeddini, C. (2024). Analysis of ChatGPT-3.5’s Potential in Generating NBME-Standard Pharmacology Questions: What Can Be Improved? Algorithms, 17(10), 469. https://doi.org/10.3390/a17100469

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Analysis of ChatGPT-3.5’s Potential in Generating NBME-Standard Pharmacology Questions: What Can Be Improved?

Abstract

1. Introduction

2. Materials and Methods

3. Results

3.1. Inter-Rater Reliability

3.2. Quality of ChatGPT-Generated Questions

3.3. ChatGPT-3.5’s Fulfillment of NBME Criteria

3.4. Estimated Complexity and Difficulty of Questions Generated by ChatGPT-3.5

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI