1. Introduction
Chat Generative Pre-trained Transformer (ChatGPT-3.5) is an artificial intelligence (AI) software developed by OpenAI [
1]. ChatGPT is trained on various datasets and modeled to generate human-like responses [
2]. ChatGPT is trained on information from up to September 2021, limiting its knowledge base to events prior to September 2021 [
1,
3]. ChatGPT is continually being updated and improved by OpenAI [
4]. Our study focused on assessing ChatGPT-3.5, the free version of ChatGPT, trained on 175 billion parameters [
5]. In the first couple of months since its launch in November 2022, ChatGPT gained approximately 100 million users [
6]. Relative to other online platforms such as Netflix, Airbnb, Twitter, Facebook, and Instagram, which took several months to years to reach 1 million users, ChatGPT reached 1 million users in 5 days [
7]. As of June 2023, ChatGPT has reached over 1.6 billion visitors and an average of 55 million visitors daily [
7].
The National Board of Medical Examiners (NBME) is an organization that owns the United States Medical Licensing Examination (USMLE), which writes Step exams for medical students to take during their medical education. The USMLE Step 1 exam comprises a number of disciplines, such as pathology, physiology, pharmacology, biochemistry and nutrition, microbiology, immunology, gross anatomy and embryology, histology and cell biology, behavioral sciences, and genetics [
8]. Pharmacology makes up approximately 15–22 percent of the questions in the USMLE Step 1 [
8].
Previous studies have assessed the performance of ChatGPT in USMLE Step exams using the NBME and AMBOSS datasets [
9]. It was found that ChatGPT answered the NBME Step 1 exam with 64.4 percent of questions correct, the NBME Step 2 exam with 57.8 percent of questions correct, the AMBOSS Step 1 exam with 44 percent of questions correct, and the AMBOSS Step 2 exam with 42 percent of questions correct [
9]. The ability of ChatGPT to perform with a passing grade of over 60 percent in the NBME Step 1 exam prompted our interest in assessing its ability of ChatGPT to generate NBME-style pharmacology multiple-choice questions that meet the NBME Item-Writing Guidelines [
10,
11].
Previous research has assessed the ability of ChatGPT to take an Ophthalmic Knowledge Assessment Program (OKAP) exam, which is created for ophthalmology residents [
12]. The study found that ChatGPT performed at the level of a first-year resident, providing insight into its capabilities and knowledge base.
ChatGPT offers promising benefits in scientific research, healthcare practice, and healthcare education [
13]. In scientific research, it has been found to be a useful tool in academic research and writing [
14]. In healthcare practice, ChatGPT shows the potential to assist in documentation, disease risk prediction, improving health literacy, and enhancing diagnostics [
15,
16].
In healthcare education, ChatGPT has been found to be beneficial in its ability to pass exams such as the USMLE Step exams and ophthalmology residency examinations [
12,
13]. ChatGPT shows promising potential in healthcare and can be a useful tool for medical education [
17]. Additionally, ChatGPT has been found to be beneficial in the rapid generation of clinical vignettes, which could help reduce costs for healthcare students [
18].
However, there are concerns that ChatGPT can potentially provide inaccurate information and references [
19]. With the plethora of resources available to students during Step preparation, the process can be both time-consuming and overwhelming to students. If ChatGPT-generated questions are medically accurate and abide by the NBME guidelines, they can be a valuable resource for medical students [
11].
This study aimed to evaluate ChatGPT-3.5’s capability to generate NBME-style pharmacology multiple-choice questions, adhering to the NBME Item-Writing Guidelines [
11]. Our objective was to generate recommendations for enhancing and fine-tuning ChatGPT-generated NBME questions for medical education. If the questions and answers generated by ChatGPT align with the standards outlined in the NBME Item-Writing Guide, it has the potential to serve as a valuable resource for medical students seeking free NBME practice questions for USMLE Step exams [
11].
2. Materials and Methods
Stage 1: Defining a suitable prompt for ChatGPT to generate NBME-style pharmacology multiple-choice questions, adhering to the NBME guidelines.
OpenAI provides “GPT best practices” that assist in the use of ChatGPT. Some of the strategies include writing clear instructions by providing details, specifying tasks, providing examples, and specifying the complexity of the response [
20]. Another strategy includes asking ChatGPT to adopt a persona [
20].
To develop suitable prompts for ChatGPT to generate the most optimal clinical vignettes, we adopted the following steps:
Step 1. The selection of organ systems and related drugs: We selected 10 medications from the following organ systems and their given corresponding drugs in parentheses: hematology/lymphatics (warfarin), neurology (norepinephrine), cardiovascular (metoprolol), respiratory (albuterol), renal (lisinopril), gastrointestinal (bisacodyl), endocrine (metformin), reproductive (norethindrone), musculoskeletal (cyclobenzaprine), and behavioral medicine (trazodone).
Step 2. Prompt engineering: We conducted prompt engineering to formulate questions that align with NBME standards using ‘ChatGPT best practices’. This step involved an iterative process following these stages:
A pharmacology expert provided 12 NBME-style questions as examples.
IT experts developed the initial prompt based on these examples and generated 12 questions.
Refinement for Complexity and Clinical Relevance: The prompts were iteratively refined to enhance complexity and ensure alignment with NBME standards through the following steps:
- (A)
Iterative refinement: Initial broad prompts were progressively adjusted, emphasizing clinical scenarios and reducing basic recall. For example, instead of asking “What is the primary side effect of metformin?”, the prompt would specify a more detailed clinical vignette, such as: “Create a scenario involving a diabetic patient prescribed metformin, focusing on a less common but clinically significant adverse effect”.
- (B)
Use of clinical variables: Prompts were further developed to incorporate variables like patient history, comorbidities, and lab results. This helped generate more nuanced questions that required deeper analysis, mimicking the integrated reasoning expected in NBME Step exams.
- (C)
Expert feedback integration: Ongoing expert reviews provided feedback on medical accuracy, clinical relevance, and adherence to NBME guidelines, such as avoiding simple distractors and ensuring that the questions tested applied medical knowledge [
11].
- (D)
Adherence to item-writing guidelines: Prompts were fine-tuned based on the
NBME Item-Writing Guide recommendations, such as avoiding negatively phrased lead-ins (e.g., “except”) and ensuring that questions could be answered without seeing the multiple-choice options [
11].
Validation and iterative refinement through expert review: After multiple iterations, a standardized prompt template was developed and validated by pharmacology experts, ensuring that it consistently generated questions aligned with NBME standards. In addition, the pharmacology experts continuously reviewed the generated questions to identify specific areas for improvement, such as advising the AI to focus on particular drug-related topics or ensuring greater clinical relevance in the scenarios.
The IT expert revised the prompt and generated the subsequent series of questions.
After 12 iterations, the pharmacology experts confirmed the quality of the generated questions, and the refined prompt was adopted as the finalized standard. The standard prompt reads as follows (with brackets indicating the different medications to be assessed): ‘Can you provide an expert-level sample NBME question about the (mechanism of action, indication, OR side effect) of (drug) and furnish the correct answer along with a detailed explanation’?
Step 3. Using the above prompt, we asked ChatGPT to generate a question for either the mechanism of action, indications, or side effects for each of these 10 medications. The mechanism of action, indications, and side effects topics were randomly selected to be assessed and evaluated for each medication.
Step 4. The questions and answers generated by ChatGPT were assessed using a grading system that was created following the
NBME Item-Writing Guide [
11]. The
NBME Item-Writing Guide provides the following specific criteria for writing clinical vignettes that are used to design USMLE Step exam questions [
11]. We adapted the following criteria from the
NBME Item-Writing Guide [
11]:
Give the correct answer with a medically accurate explanation.
Question stem avoids describing the class of the drug in conjunction with the name of the drug in the drug.
The question applies foundational medical knowledge, rather than basic recall.
The clinical vignette is in complete sentence format.
The question can be answered without looking at multiple-choice options, referred to as the “cover-the-option” rule.
Avoid long or complex multiple-choice options.
Avoid using frequency terms within the clinical vignette, such as “often” or “usually”; instead, use “most likely” or “best indicated”.
Avoid “none of the above” in the answer choices.
Avoid nonparallel, inconsistent answer choices, ensuring all follow the same format and structure.
The clinical vignette avoids negatively phrased lead-ins, such as “except”.
The clinical vignette avoids general grammatical cues that can lead you to the correct answer, such as “an” at the end of the question stem, which would eliminate answer choices that begin with consonants.
Avoid grouped or collectively exhaustive answer choices. For instance, the answer choices are “a decrease in X”, “an increase in X”, and “no change in X”.
Avoid absolute terms, such as “always” and “never”, in the answer choices.
Avoid having the correct answer choice stand out. For example, this can happen when one of the answer choices is longer and more in-depth relative to the other answer choices.
Avoid repeated words or phrases in the clinical vignette that clue the correct answer choice.
Create a balanced distribution of key terms in the answer choices, ensuring none stand out as too similar or too different from the others. (NBME Item-Writing Guide).
Stage 2: Assessing the quality of ChatGPT-generated questions.
To evaluate the quality of the questions generated by ChatGPT, we employed the following steps:
Step 1. Expert panel review: All ChatGPT-generated questions and answers underwent evaluation by a panel consisting of two pharmacology education experts with a significant teaching background in medical schools in the USA. These experts were well-versed in the
NBME Item-Writing Guide and had experience with NBME-style exam questions in pharmacology [
11].
Step 2. Calibration process: Two randomly selected questions generated by ChatGPT were presented to the experts for evaluation. The experts independently utilized the aforementioned 16 different criteria to individually score each question and assigned a score of 0 or 1 for each criterion, resulting in a quality score ranging from 0 to 16 for each question. A higher score indicated better adherence to the
NBME Item-Writing Guide [
11]. Subsequently, the experts convened to discuss and reconcile any discrepancies in their evaluations, arriving at a consensus.
Step 3. Independent evaluation: Following the calibration process, the experts individually and independently assessed the remaining ChatGPT-generated questions and provided their respective scores.
Estimated question difficulty: The expert panel was also asked to rate the difficulty of each question on a scale using the question ‘On a scale from very easy to very difficult, how would you rate the difficulty of this question for students?’. The responses ranged from 1 (very easy) to 5 (very difficult).
4. Discussion
At face value, ChatGPT-3.5 provided remarkable results following the
NBME Item-Writing Guide criteria [
11], given the average score of 14.1 out of 16 (88.1%) between the two experts. However, there are significant improvements needed when thoroughly assessing the questions and answers. When the frequency of each of the criteria was evaluated, there was a trend in which criteria were missed more frequently compared with others. According to both experts 1 and 2, criterion 1, which assessed medical accuracy, was fulfilled by 8 out of 10 (80%) of the ChatGPT-generated questions. Thus, two questions contained medically inaccurate information. Given the importance of having the questions and answers contain medically accurate information, it is vital to ensure this criterion is always fulfilled. Having ChatGPT-3.5-generated questions that contain medically inaccurate information can be especially problematic when students prepare for USMLE exams. For example, when asked to evaluate the side effects of warfarin, ChatGPT-3.5 stated that thrombocytopenia is a side effect of warfarin and deemed that as the correct answer, when thrombocytopenia is not a side effect of warfarin [
21]. This can lead to the student unknowingly making incorrect associations with drugs and adverse effects, such as abciximab actually causing thrombocytopenia, rather than warfarin [
21]. Additionally, when the indications for metoprolol were assessed, ChatGPT-3.5 suggested that the indications for a patient with a history of myocardial infarction and heart failure be prescribed the beta-blocker for chronic stable angina and not hypertension. Although this indication can be true, without the patient’s vitals, there is not enough information to decide between the provided answer choices of hypertension or chronic stable angina. Thus, this question is medically inaccurate.
The ability of ChatGPT to provide factually inaccurate information is potentially due to a phenomenon coined “
hallucinations”. “Hallucinations” are defined as the generation of probable statements that include inaccurate information. These “hallucinations” can be due to the multitude of datasets it is trained on and the mixing up of facts [
22,
23]. ChatGPT-3.5 has the potential to provide this medical information but needs to ensure it is always accurate in providing accurate medical information when generating NBME-standard clinical vignettes, answer choices, and explanations for each answer choice. Thus, having an AI system that is dedicated to the healthcare industry, which can provide medical diagnoses, as well as disseminate accurate medical information, can be highly beneficial for medical students, physicians, and patients alike [
24].
According to expert 1, criterion 3 was fulfilled by 3 out of 10 (30%) ChatGPT-3.5-generated questions, and 4 out of 10 (40%) according to expert 2. This criterion was the most frequently missed criterion of the 16 criteria. This criterion assessed the ability of ChatGPT-3.5 to apply foundational medical knowledge, rather than basic recall. Additionally, another frequently missed criterion was criterion 5, which is the ability to answer the question without looking at the multiple-choice options. According to expert 1, 6 out of 10 (60%) questions, and according to expert 2, 9 out of 10 (90%) questions, fulfilled criterion 5. In addition to the questions assessing basic recall, the questions can be answered without multiple-choice options. Furthermore, this constitutes most of the questions as “
pseudo-vignettes”. A pseudo-vignette is defined as a vignette that does not require the need for the clinical information provided to answer the question and, rather, tests basic recall of material [
25]. For example, the following are some of the questions in which a clinical vignette was provided and asked the reader to answer the following: “Norepinephrine acts on which of the following receptors to exerts its vasoconstrictive effects?”, “What is the primary mechanism of action of albuterol?”, “What is the primary mechanism of action of metformin in the treatment of diabetes?”, and “Which of the following side effects is most likely associated with the use of norethindrone?”. It is evident that these questions can be answered without any other clinical information and are testing basic recall rather than the ability to apply medical knowledge. Given the low frequency of criterion 3 being followed, we further assessed the complexity and difficulty of the generated questions and answers. According to expert 1, 6 out of 10 (60%) questions were graded as either very simple or simple, and according to expert 2, 7 out of 10 (70%) questions were graded as either very simple or simple. When assessing difficulty, expert 1 rated 6 out of 10 (60%) questions as either very easy or easy, and expert 2 rated 7 out of 10 (70%) questions as either very easy or easy. Thus, this allows us to conclude that the majority of questions are too simple and too easy for USMLE Step preparation.
According to expert 1, criterion 11 was fulfilled by 9 out of 10 (90%) ChatGPT-3.5-generated questions, and 7 out of 10 (70%) according to expert 2. Questions missing this criterion provided alternative cues in a grammatical sense that prompted examinees to eliminate any answer choice options that did not fit the cues given. An example of this was seen when ChatGPT was asked to provide an NBME question evaluating the indication for cyclobenzaprine, where the patient in the clinical vignette presented with both acute lower back pain and muscle spasms. This provided a cue to the examinee by stating two different clinical problems needed to be fulfilled by the medication prescribed, which left the option “acute musculoskeletal pain with muscle spasms” as the only viable option, which was correct. Picking up on subtle cues like this that are not necessarily explicit may still provide guidance that does not necessarily evaluate the examinee’s proficiency in the material, reducing the quality of the question. According to expert 1, criterion 12 was fulfilled by 8 out of 10 (80%) ChatGPT-3.5-generated questions, and 10 out of 10 (100%) according to expert 2. Providing grouped or exhaustive answer choices can cause the examinee to immediately negate all other option choices, leading to only the grouped answer choices. The example that was provided in criterion 11 also violated criterion 12, as it provided both “acute exacerbation of chronic obstructive pulmonary disease (COPD)” and “acute musculoskeletal pain with muscle spasms” as answer choices, with the latter being the correct answer choice. With these two answer choices being grouped together as acute conditions, the other answer choices were quickly able to be eliminated. To provide NBME-standard questions that can accurately evaluate the examinee’s knowledge of the given pharmacological concepts, ChatGPT must be able to provide at least five answer choices that can be evaluated thoroughly to test the examinee’s understanding of the material.
According to expert 1, criterion 15 was fulfilled by 7 out of 10 (70%) ChatGPT-3.5-generated questions, and 9 out of 10 (90%) according to expert 2. Repeating words or phrases within the clinical vignette and giving multiple-choice options can cue the examinee to determine the correct answer without accurately reflecting their proficiency in the material. For example, when asked to generate a question about the side effects of norethindrone, ChatGPT wrote within the clinical vignette that the patient reported breast tenderness and then provided breast tenderness as a correct answer choice. Providing keywords within the clinical vignette and answer choice negates the question’s ability to evaluate the examinee’s understanding of norethindrone and only evaluates their ability to dissect the clinical vignette. Lastly, according to both expert 1 and expert 2, criterion 16 was fulfilled by 8 out of 10 (80%) ChatGPT-3.5-generated questions. Criterion 16 assigns importance to keeping key terms balanced if key terms are used within the answer choices. When asked to generate the adverse effects of warfarin, ChatGPT provided both thrombocytopenia and heparin-induced thrombocytopenia, which could lead to all other option choices being immediately negated. If it is important to provide key terms within the choice options, it is important to provide a balancing of the terms throughout all the choice options to ensure the examinee is not able to remove certain choices due to the lack of keywords.
In terms of estimated question difficulties, the experts’ ratings diverged. This could be due to limited calibration or the fact that it is very challenging to estimate the difficulty of a question for students.
While discussing the capabilities of AI and what improvements need to be made to make it a reliable source for NBME question generation, it is important to consider that AI is constantly evolving, updating, and improving its knowledge base [
4]. With that being said, a recent ChatGPT update now provides the opportunity to provide an alternative output if the initial output was unsatisfactory to the user. Lastly, AI can be highly variable, and may not give the same output every time, despite the fixed input.
Future studies should employ a larger expert panel and implement a more comprehensive calibration exercise with more diverse and context-rich examples to assess a greater number of questions. Additionally, the criteria established for evaluating questions generated by ChatGPT should undergo scrutiny by an expert panel to identify any gaps and potentially incorporate additional criteria.
Furthermore, future research should investigate how OpenAI’s new models, ChatGPT-4 and ChatGPT-4 Turbo, perform compared with ChatGPT-3.5 in generating questions that follow the NBME guidelines [
11]. Given the more advanced processing of ChatGPT-4 and ChatGPT-4 Turbo, it can give insight into the updated capabilities of AI software. In addition, future research should also include comparing the capabilities of other AI models, such as Google’s AI model Google Gemini and Facebook’s AI model Llama 3, with those of ChatGPT in creating NBME-standardized questions. Comparing the similarities and differences between the various AI models can provide more insight into the advantages of each AI model in medical education.
The results of our study suggest that ChatGPT-3.5 has the potential to generate NBME-style questions, but there are significant limitations related to the depth of knowledge tested. These findings align with previous research, which highlighted that while AI models can emulate human-like question generation, their depth of clinical understanding remains limited [
9]. This observation can be explained through the lens of cognitive load theory, which suggests that effective learning and assessment require engaging learners in complex, higher-order thinking tasks [
26,
27,
28]. Our results show that ChatGPT-3.5 tends to favor questions that test recall rather than application or analysis, reflecting its reliance on pattern recognition over true clinical reasoning.
Furthermore, our study’s findings raise important questions about the suitability of AI for medical education. According to constructivist learning theory, effective education should not only convey factual information but also facilitate the application of knowledge in real-world scenarios [
29,
30]. The tendency of ChatGPT-3.5 to generate “pseudo-vignettes”, as identified in our analysis, suggests a gap between AI-generated content and the needs of complex, case-based learning environments advocated by constructivist principles.
These results have practical implications for medical educators, indicating that while AI tools like ChatGPT-3.5 can support question generation, they should be complemented with expert oversight to ensure the development of questions that engage critical thinking. Future research should explore ways to enhance the model’s capacity to generate questions that align with theories of cognitive complexity, such as Bloom’s Taxonomy, which emphasizes the importance of creating questions that test analysis, synthesis, and evaluation. By doing so, we can better integrate AI-generated content into medical education in a manner that promotes deeper learning and prepares students for clinical practice.
Proposed Evaluation Framework for Future Implementation of AI in Exam Generation.
Given the results of our study, which identified several shortcomings of ChatGPT-3.5, including hallucinations and the tendency to produce oversimplified questions, it is crucial to adopt a comprehensive, multi-stage evaluation framework for the future implementation of AI in exam question generation. This framework aims to enhance the reliability, accuracy, and educational value of AI-generated content, particularly for high-stakes assessments like the USMLE. We suggest the following steps:
- -
Initial automated screening: Implement AI-based tools such as the Google Knowledge Graph API, the Wolfram Alpha API, or the MedGPT to perform an initial quality check of the generated content. This stage would identify potential inaccuracies, inconsistencies, or superficial questions that do not meet established standards, flagging them for further review.
- -
Expert review process: Involve a panel of three to five domain-specific experts, such as pharmacologists and medical educators, to thoroughly assess the generated questions. This expert review would evaluate the medical accuracy, clinical relevance, level appropriateness, and adherence to item-writing standards (such as the NBME guidelines), ensuring that the questions align with the depth and complexity required for medical exams [
11].
- -
Integration of specialized medical knowledge bases: Use external databases such as UpToDate, PubMed, or other clinical guidelines to cross-reference the generated content for factual accuracy. Incorporating these knowledge bases would help validate medical accuracy and ensure that the AI’s knowledge aligns with the latest developments in current clinical practice.
- -
Iterative feedback loop: Create an iterative process where expert feedback is used to refine the AI-generated questions. This would involve adjusting prompts, question structures, or explanations, followed by re-evaluation until the content reaches a satisfactory level of quality and complexity.
- -
Comparison using bootstrapping methods: Recent studies suggested applying bootstrapping techniques to evaluate exam questions [
31]. Bootstrapping could be used to compare the psychometric characteristics of AI-generated questions with instructor-written ones, analyzing factors such as the size of confidence intervals, reliability, and question difficulty.
- -
Pilot testing with medical students: Conduct small-scale pilot testing of AI-generated questions with medical students. This step would assess the perceived difficulty and educational value of the questions in real exam conditions, providing direct feedback on how well the questions prepare students for actual exams like the USMLE.
By adopting this multi-stage evaluation framework, future AI-generated questions can be fine-tuned to not only meet NBME standards but also foster critical thinking and higher-order reasoning among medical students. This approach addresses the limitations of current AI models while maximizing their potential as an educational tool.
However, a key consideration for future research is the emphasis on generating questions that test higher-level medical knowledge rather than relying predominantly on basic recall. While our study has demonstrated ChatGPT-3.5’s ability to produce questions that meet many NBME criteria, a notable area for improvement is the focus on deeper analytical and reasoning skills. Higher-order questions challenge learners to apply their knowledge in complex clinical scenarios, which is essential for the development of critical thinking skills in medical education. Future research should explore prompt adjustments and evaluation processes that can guide AI models to create more sophisticated clinical vignettes. This would help ensure that the generated questions not only align with NBME standards but also effectively test learners’ ability to synthesize and apply knowledge in a clinical context.
5. Conclusions
Based on the scores received, the ChatGPT-3.5-generated questions seem to satisfy the NBME guidelines [
11]. However, when thoroughly going through the questions, the questions are not up to par with other Step exam material resources. ChatGPT-3.5 provided oversimplified questions and overuse of repeated words in the clinical vignette and answer choices that allow an individual to answer the question without any medical knowledge. Our data support the need for AI software that is specifically trained on medical and scientific information.
The high fulfillment rates for criteria 2, 6–10, and 13 suggest that these criteria likely involve straightforward and clear rules. It seems that ChatGPT can consistently follow and apply these types of rules. For instance, avoiding complex multiple-choice options (criterion 6) or using consistent answer-choice formats (criterion 9) are structural elements that AI can reliably adhere to. This finding highlights the strength of ChatGPT in following well-defined, rule-based guidelines that do not require deep contextual understanding. In a related vein, the lower fulfillment rates of 4/10 for criterion 3 (applying foundational knowledge) suggest that ChatGPT struggled with applying foundational medical knowledge rather than simple recall. This criterion requires a deeper understanding of medical concepts and their application in various contexts, which is challenging for an AI model primarily trained on text data without practical experience. This finding indicates a significant gap in ChatGPT’s ability to create questions that go beyond basic recall and test higher-order thinking skills.
With regard to criterion 1 (medical accuracy), the 8/10 fulfillment rate shows instances where ChatGPT provided medically inaccurate information. This could be due to “hallucinations”, where the AI generates plausible-sounding but incorrect information. Ensuring medical accuracy is critical but challenging due to the vast and complex nature of medical knowledge. The fulfillment of criterion 5 (cover-the-option rule) was moderate (7.5/10 times). The “cover-the-option” rule requires questions to be answerable without seeing the options, which tests the examinee’s true understanding of the material. This criterion was harder for ChatGPT to consistently meet, likely due to its reliance on generating plausible distractors rather than deeply understanding the question context.
The fulfillment rates of 8/10 and 7/10 for criteria 11 and 15 (avoiding grammatical and repetitive cues, respectively, highlight occasional lapses where questions contain grammatical or repetitive cues that could inadvertently guide examinees to the correct answer) points to the need for more sophisticated language-processing capabilities to avoid such subtle errors.
The high scores (15/16) in neurology, respiratory, and gastrointestinal areas suggest that ChatGPT’s training data and its ability to generate questions for these specialties are particularly robust. These fields may have clearer, more standardized guidelines and well-documented clinical scenarios that the AI can emulate effectively.
The relatively lower scores in reproductive and musculoskeletal (12.5/16 and 13/16) areas may reflect more complex or nuanced clinical scenarios that are harder for the AI to accurately capture. Additionally, reproductive health often involves sensitive and varied patient factors, which may not be as well represented in the AI’s training data.