1. Introduction
Medical education is undergoing a transformative shift with the integration of artificial intelligence (AI), particularly by using generative AI techniques like prompt engineering [
1] in educational settings. Tools such as ChatGPT, with their advanced ability to analyze patterns and generate contextually relevant content, hold significant potential for this transformation. Their applications include fostering personalized learning pathways [
2], supporting educators in content delivery [
3], and enhancing preparation for assessments in online environments [
4].
Simulations have long been an essential tool, offering learners a controlled environment to practice skills, critical thinking, decision-making, and situational awareness. Studies have demonstrated that simulation-based learning enhances medical students’ competencies, including critical thinking and decision-making skills [
5].
Despite its widespread use, traditional simulation training faces significant barriers that limit its accessibility and impact. One of the key challenges lies in the dependence on physical simulation centers, which often feature advanced yet expensive tools, such as high-fidelity mannequins and virtual reality systems. Access to these resources is typically limited by scheduling constraints, high costs, and the need for instructor supervision [
6]. Moreover, simulation exercises are frequently designed for group settings, which can make it difficult for individual learners to engage at their own pace or revisit scenarios for further practice. This lack of flexibility also restricts opportunities for learners to experiment, make mistakes, and learn in a fail-safe environment—an essential component of skill acquisition, which underscores the need for innovative approaches to simulation training.
Recent systematic reviews have highlighted the growing role of artificial intelligence (AI) in competency-based medical education. Feigerlova et al. (2025) conducted a systematic review examining the impact of AI on educational outcomes in health professions education, finding that AI-based educational strategies have the potential to enhance learning outcomes, particularly in personalized learning and performance assessment [
7]. Similarly, a systematic review by Ma et al. (2024) explored frameworks, programs, and tools aimed at promoting AI competencies among medical students, emphasizing the importance of integrating AI literacy into medical curricula to prepare future healthcare professionals for AI-enhanced clinical environments [
8].
While AI-generated medical simulations have been explored in various forms, current applications are often limited to static question-answer models rather than fully interactive, context-specific clinical scenarios. Furthermore, existing research rarely integrates regional clinical guidelines or competency-based learning frameworks into AI-generated content, reducing their applicability to real-world training environments. This study addresses these limitations by proposing a structured methodology for designing, evaluating, and refining AI-driven simulations using advanced prompt engineering techniques. Unlike previous works that primarily explore AI as a passive educational tool, this paper examines how prompt engineering can actively shape interactive learning environments that enhance decision-making and adaptability in medical training.
Our approach enables more dynamic, adaptive, and realistic clinical case simulations that learners can access individually, at any time, and without fear of failure. Such simulations replicate the variability of real-world clinical scenarios, allowing for personalized and iterative learning experiences. We focus on three techniques: chain-of-thought prompting, context augmentation, and role-specific prompting to enhance the realism and educational value of these simulations.
By developing specialized prompts using regional guidelines and educational contexts, we aim to ensure that simulations remain relevant, accessible, and aligned with the latest standards. Despite these advancements, there is a paucity of research on the effectiveness of AI-generated simulations in medical education, which underscores the need for innovative approaches to simulation training [
9].
In this study, we introduce the PROMPT+ Framework, a structured approach specifically designed for AI-driven medical simulations. The framework, including its methodological structure and the selected combination of chain-of-thought prompting, contextual augmentation, and role-specific prompting, is an original contribution developed based on our practical experience in medical education and AI-assisted learning environments. Additionally, the clinical case examples used throughout this paper are based on realistic training scenarios created by the authors, ensuring their relevance to competency-based medical training.
The present paper is structured as follows:
Section 2 introduces the fundamental prompt engineering techniques that underpin our approach and explores their application in medical education. Building on this foundation,
Section 3 presents a use case that demonstrates how structured prompt methodology can enhance complex clinical simulations, highlighting the advantages of dynamic AI-generated case studies.
Section 4 then delves into the PROMPT
+ Framework, outlining its systematic approach to designing, evaluating, and refining AI-driven medical simulations. In
Section 5, we discuss the limitations of this approach, addressing key challenges such as AI bias, the generalizability of structured prompt engineering beyond protocol-driven domains, and the necessity for empirical validation. Finally,
Section 6 summarizes the potential of advanced prompt engineering in medical simulations, emphasizing the integration of clinical guidelines, decision-making support, and role-specific training, while underscoring the importance of human oversight and specialized AI models for responsible implementation.
This approach could contribute to improved scalability and adaptability in medical training, potentially helping healthcare professionals develop the skills needed to navigate complex real-world challenges and supporting educational outcomes at various stages of their careers. While this paper centers on ChatGPT as the primary large language model, the principles and methodologies are applicable to other LLMs as well.
2. Prompt Techniques
Prompt engineering offers a new approach to creating realistic and adaptive medical case simulations [
10]. This section outlines three essential techniques—Chain-of-Thought Prompting, Contextual Augmentation, and Role-Specific Prompting—and explains why their combined use was chosen to develop effective medical simulations. Given the complex cognitive and practical demands of medical education, we carefully evaluated multiple advanced prompting techniques and selected these three as the most suitable. Each technique addresses a specific aspect of simulation-based education: Chain-of-Thought Prompting ensures structured reasoning for complex decision-making [
11], Contextual Augmentation [
12] provides realism and alignment with evidence-based guidelines, and Role-Specific Prompting [
13] creates authentic scenarios by simulating professional responsibilities. Together, these techniques complement each other, enabling the creation of comprehensive simulations that support both the cognitive and practical learning needs of medical students. The focus of this study is on optimizing prompt engineering techniques to enhance generative AI-driven medical simulations. Rather than modifying the underlying model architecture, we refine the interaction between users and large language models through structured prompt design. The PROMPT
+ Framework is designed to be compatible with various generative AI models and has been tested using GPT-4 (OpenAI) and BioGPT (Microsoft) due to their capabilities in medical text generation. However, the framework is not dependent on these specific models and can be applied to any large language model that supports structured prompt interactions, allowing for flexibility in different educational and institutional contexts.
2.1. Chain-of-Thought Prompting
Chain-of-Thought prompting is an effective technique for generative AI models, such as ChatGPT, enabling them to produce structured, step-by-step reasoning. By incorporating logical steps into the AI’s response, chain-of-thought prompting transforms static case studies into dynamic and interactive educational tools. As demonstrated by Wei et al. (2022), this approach significantly enhances problem-solving and critical thinking across a range of complex tasks, making it particularly valuable for medical education scenarios that require multi-step reasoning and decision-making processes [
14].
To ensure precision, it is crucial to distinguish the reasoning process facilitated by Chain-of-Thought Prompting from the “reasoning” framework typically attributed to GPT-o1. In medical simulations, Chain-of-Thought Prompting is indispensable for guiding learners through diagnostic, investigative, and therapeutic challenges. By introducing a systematic approach, it fosters the development of clinical reasoning, a core competency essential for accurate diagnosis and effective patient management. Additionally, this method encourages learners to reflect on each step, promoting a deeper understanding of underlying principles and preventing cognitive overload during complex case evaluations.
2.2. Contextual Augmentation with Guidelines
Contextual augmentation enhances the realism and educational value of AI-generated simulations by embedding relevant background information, situational details, and evidence-based clinical guidelines. For medical education in Germany, referencing national standards such as the S3-guideline “Polytrauma/Schwerverletzten-Behandlung” ensures that simulations reflect current best practices in trauma care [
15]. Incorporating Retrieval-Augmented Generation (RAG) frameworks allows generative AI models like ChatGPT to dynamically retrieve and incorporate the most current guidelines directly into simulations. RAG enables AI systems to access and apply validated external resources, ensuring that learners’ decisions align with up-to-date, evidence-based standards [
16]. To evaluate the effective implementation of Retrieval-Augmented Generation (RAG) in ChatGPT, users can employ targeted strategies to ensure accuracy and consistency. Prompts like “Which guideline did you reference?” or “Does this align with the 2023 S3-guideline update?” allow for source verification. Repeated queries with slight variations test the model’s consistency, while cross-referencing answers with validated guidelines ensures alignment. Additionally, introducing known scenarios or deliberate inaccuracies can help assess the model’s ability to detect and correct errors.
However, even with RAG, reliance on generative AI to reference clinical guidelines introduces the potential for inaccuracies or ambiguities, which must be carefully mitigated through human oversight and regular validation of sources.
While ChatGPT is capable of integrating clinical guidelines into its simulations, the reliability of its output depends heavily on the recency and quality of the data it was trained on, as well as the specificity of the prompts. This presents several risks, including outdated recommendations, regional guideline discrepancies (e.g., S3-Leitlinie vs. ATLS), and potential oversimplification or misinterpretation of complex scenarios.
To minimize errors, it is essential to explicitly specify the source and scope of guidelines, ensuring alignment in diagnostic and therapeutic recommendations. Incorporating follow-up prompts to verify which sources the AI references enhances transparency and ensures content validity. Additionally, human verification by educators is crucial to review AI-generated outputs for consistency with the latest guidelines, as ChatGPT should supplement but not replace expert oversight. Where feasible, integrating ChatGPT with real-time access to trusted medical databases or retrieval-augmented generation tools can further ensure that recommendations are based on up-to-date and validated clinical guidelines. By prompting learners to reference evidence-based guidelines and evaluate their relevance, the simulation not only improves clinical reasoning, a core competency essential for accurate diagnosis and effective patient management, but also reinforces the habit of verifying recommendations through validated sources.
2.3. Role-Specific Prompting
Role-specific prompting leverages the ability of generative AI models to adapt to clearly defined instructions, as demonstrated in the foundational work by Brown et al. [
17] on Few-Shot Learning. By assigning the model a specific role, such as a trauma team leader or medical educator, this technique fosters structured, realistic interactions that enhance the educational experience for both learners and educators. For learners, role-specific prompting immerses them in authentic clinical scenarios, allowing them to act as decision-makers within a controlled, fail-safe environment. For example, in a polytrauma simulation aligned with the S3-guideline, a learner might be prompted with: “
As the team leader, prioritize interventions for a 35-year-old patient with a flail chest and pelvic instability, following the ABCDE approach”. This approach builds critical clinical reasoning and leadership skills while promoting active engagement. The principles described by Brown et al. show that models perform better when given explicit, context-rich instructions—an idea that underpins the effectiveness of role-specific prompting.
When using role-specific prompting, explicitly assign the AI a defined role to enhance realism and focus. Effective prompts might include: “As the trauma team leader, prioritize the next steps in stabilizing this patient”. This ensures the simulation aligns with real-world expectations and responsibilities. Encourage learners to adopt complementary roles, allowing for realistic team-based dynamics. Avoid overly generic or unclear roles, as they may reduce context and limit clarity. For example, “Guide the case as a doctor” lacks specificity. Additionally, ensure prompts balance between guiding the learner and leaving room for independent decision-making. Overly directive prompts risk reducing engagement and critical thinking.
While each of these techniques contributes unique strengths, their integration is essential for creating comprehensive medical simulations. Chain-of-Thought Prompting provides a structured framework for reasoning, ensuring learners can tackle multi-step problems systematically. Contextual Augmentation embeds realism and evidence-based standards, aligning simulations with clinical best practices. Role-Specific Prompting immerses learners in authentic professional scenarios, fostering both critical thinking and interpersonal skills. Together, these methods address the cognitive, contextual, and professional dimensions of medical education. By integrating these techniques, AI-driven simulations can mimic the complexity and variability of real-world clinical scenarios, enabling learners to engage in structured yet dynamic decision-making. To illustrate their practical application, we implemented these methods in a high-fidelity polytrauma simulation, demonstrating how prompt engineering enhances case realism, diagnostic reasoning, and guided intervention strategies.
3. Use Case
To evaluate the combined utility of Chain-of-Thought Prompting, Contextual Augmentation, and Role-Specific Prompting, we applied these techniques to a simulated polytrauma scenario involving a 35-year-old male injured in a high-speed motor vehicle collision. This case presents a complex interplay of diagnostic, investigative, and therapeutic challenges, including a flail chest with respiratory compromise and a hemodynamically unstable pelvic fracture. Each technique contributed distinct strengths to the simulation, enabling a comprehensive approach to learner engagement and clinical reasoning. Chain-of-Thought Prompting was employed to guide learners through step-by-step reasoning processes, ensuring systematic engagement with the clinical scenario.
The following table compares different prompting styles used in AI-driven simulations, illustrating their progression from static case descriptions to interactive, learner-centered approaches (see
Table 1).
Initially, a static description of the patient’s condition was generated, focusing on history, examination findings, and potential diagnoses. While functional, this approach lacked opportunities for interaction and critical reasoning. For instance, outputs included findings such as paradoxical chest wall movement, hypotension (BP: 90/60 mmHg), and tachycardia (HR: 130 bpm), with potential diagnoses like flail chest or intra-abdominal injury causing blood loss.
The combined application of advanced techniques—Chain-of-Thought Prompting, Contextual Augmentation, and Role-Specific Prompting—enabled a holistic approach to the polytrauma simulation. Chain-of-Thought Prompting structured the case into manageable steps, fostering systematic reasoning. Contextual Augmentation grounded decisions in evidence-based guidelines, enhancing realism and reliability. Role-Specific Prompting simulated professional responsibilities, promoting engagement and practical decision-making. This integrated framework not only provided a robust foundation for developing procedural competence and adhering to best practices but also demonstrated how learners can identify and recover from errors. By challenging learners to reassess their steps, the simulation emphasized the importance of systematic prioritization and guideline-based decision-making. A detailed example of the interactive dialogue and team dynamics can be found in the
Supplementary Materials (Supplemental S1). However, as we refined our simulations, it became evident that traditional prompting approaches lacked the necessary structure to optimize interactivity, clinical depth, and adaptability. This highlighted the need for a systematic methodology to guide and evaluate the prompting process.
4. PROMPT+ Framework
The integration of three distinct prompting techniques (Chain-of-Thought Prompting, Role-Specific Prompting, Contextual Augmentation) provides a foundation for advancing AI-driven simulations. However, the lack of a structured approach to guide and evaluate the prompting process has highlighted the need for a more systematic methodology. Initial static simulations often fell short in interactivity, clinical depth, and adaptability, underscoring the importance of a comprehensive framework. Therefore, we suggest the PROMPT
+ Framework as a flexible and scalable solution, designed to enhance prompt design and evaluation in medical education. The PROMPT
+ Framework is designed as a self-assessment tool rather than an AI-driven training system (see
Figure 1). Its primary function is to provide a structured methodology for learners to evaluate whether their prompts generate clinically appropriate AI responses. By systematically refining their inputs, users can enhance their ability to interact with generative AI models for medical training purposes. The framework does not require dedicated computational resources or proprietary AI models, as it can be implemented with freely available generative AI platforms.
Moreover, its principles could extend to fields such as legal training, where case-based reasoning is critical, business simulations for leadership development, and engineering education, where complex problem-solving requires structured guidance and iterative learning.
Thus, the prompt engineering framework was developed to ensure that medical simulations align with clinical guidelines, support learning objectives, and foster interactivity. It follows a systematic, iterative workflow comprising six stages:
Prompts are crafted to explicitly define the simulation context, complexity, and learning objectives. Key elements include guideline integration and clarity checks to ensure specific, actionable instructions. For example, a prompt might specify: “Generate a case for a 35-year-old motorcyclist with polytrauma, following the S3-guideline”.
- 2.
Review:
AI-generated outputs are evaluated for clinical accuracy, coherence, and realism. This includes ensuring adherence to guidelines, appropriate complexity for learners, and logical consistency. For example, the simulation must recommend needle decompression or mini-thoracotomy for tension pneumothorax per the S3-guideline.
- 3.
Optimize:
Prompts are refined to enhance adaptability, interactivity, and feedback quality. Scenarios are tested for dynamic responses, error recovery, and meaningful decision pathways. For instance, if a learner delays intervention, the AI prompts guidance on prioritizing airway stabilization.
- 4.
Measure:
The educational impact is assessed by evaluating alignment with learning objectives, engagement, and error correction. Metrics include scenario realism, learner success rates, and the ability to identify and address mistakes, such as highlighting the omission of a pelvic binder.
- 5.
Persist:
Prompts are regularly updated to reflect new guidelines and maintain relevance. Their reusability across contexts, such as adult and pediatric trauma cases, is also tracked to ensure consistency and flexibility.
- 6.
Test:
Simulations undergo systematic validation to confirm functionality, reliability, and reproducibility. Prompts must consistently generate coherent outputs, flag errors, and maintain logical integrity across repeated use.
+: In order to help users responsibly apply the framework, the “+” component introduces a pivotal dimension of reflection rooted in the principle of primum non nocere—first, do no harm [
18,
19]. This addition calls for a deliberate examination of the ethical and practical implications of AI-generated simulations, fostering a culture of accountability and thoughtful introspection. By encouraging users to question the rationale behind AI outputs and creating opportunities for learners to critically evaluate their decisions, this stage ensures that simulations transcend mere clinical accuracy embedded in the full range of human experience.
To operationalize this ethical dimension, we emphasize the need for standardized validation mechanisms, bias mitigation strategies, and structured human oversight. The specific implementation of these measures will depend on the educational context and the intended use of AI-driven simulations, requiring a tailored approach to determine the appropriate focus. For validation, structured peer-review processes involving domain experts can ensure that AI-generated cases align with current medical standards.
Fundamentally, the ”+” component should be seen as a flexible yet essential safeguard, ensuring that AI-enhanced medical training not only fosters clinical accuracy but also prioritizes ethical responsibility, inclusivity, and transparency in its application.
Bias mitigation may include continuous evaluation against diverse patient demographics and targeted dataset curation to address disparities. Human oversight can range from real-time educator intervention during AI-assisted learning to post-simulation debriefings where learners critically reflect on AI-generated recommendations. While RAG improves medical accuracy by anchoring AI-generated outputs to structured knowledge sources, its effectiveness is contingent upon the availability and reliability of those sources. Unlike traditional databases, LLMs do not have direct real-time access to evolving clinical guidelines. This necessitates continuous dataset curation and validation protocols to prevent outdated or regionally inconsistent recommendations. The PROMPT+ Framework addresses this issue by integrating structured dataset audits, where domain experts assess whether AI-generated content remains aligned with updated medical standards.
Ensuring the safety and reliability of generative AI in medical education is a critical component of the PROMPT+ Framework. Large language models such as ChatGPT and BioGPT are trained on heterogeneous datasets, which may introduce biases or reinforce existing disparities in clinical reasoning. Additionally, these models do not inherently guarantee medical accuracy and may generate plausible yet incorrect information. To mitigate these risks, the PROMPT+ Framework incorporates multiple safeguards: (1) a human-in-the-loop validation mechanism, where AI-generated cases undergo expert review before integration into training modules, (2) retrieval-augmented generation to enhance guideline adherence and factual accuracy, and (3) structured error correction, where AI responses are dynamically refined based on expert input and learner feedback.
The framework presented in this paper is designed to be adaptable and context-agnostic, enabling users to tailor it to their specific needs and evaluation metrics. As such, it is inherently open and accessible, encouraging adoption across diverse domains beyond medical education. Given its versatility, it is crucial to examine the broader impact of advanced prompt engineering techniques on medical training and their potential to enhance learning outcomes.
The PROMPT+ Framework provides a structured and adaptable approach to AI-driven medical simulations. However, to ensure scientific rigor, its impact must be quantitatively validated. To address this, we propose a systematic evaluation process based on the core PROMPT+ principles, ensuring that each stage of the framework is methodically assessed. A key aspect of this evaluation is the alignment of AI-generated responses with clinical reasoning standards, decision-making accuracy, and adaptability across diverse medical scenarios. This process relies on our three fundamental prompt engineering techniques: Role-Specific Prompting (RSP), Chain-of-Thought Prompting (COTP), and Retrieval-Augmented Generation (RAG). Each technique addresses specific challenges in AI-driven medical simulations, enhancing context awareness, structured diagnostic reasoning, and adherence to evidence-based guidelines.
To ensure objectivity and scientific validity, the framework incorporates quantifiable evaluation metrics, including Likert-scale expert assessments to measure clinical realism and accuracy, pre/post-test comparisons to evaluate decision-making improvements, and error rate tracking to analyze AI reliability in correcting incorrect responses. A detailed breakdown of the technical evaluation criteria, including specific assessment methods and their application within the PROMPT
+ structure, is provided in
Table 2. This table outlines how each evaluation metric supports decision-making optimization and risk reduction strategies, preventing common AI-driven pitfalls such as hallucinations, oversimplified reasoning, or misalignment with clinical protocols.
To further establish the scientific foundation of the PROMPT+ Framework, we propose a structured validation process in future studies, consisting of a pilot evaluation assessing clinical reasoning improvements in medical students and emergency physicians, a controlled experiment comparing AI-assisted simulations with traditional case-based learning to evaluate efficacy, and an iterative refinement process integrating structured expert feedback to optimize AI-generated case simulations continuously. By integrating these validation steps, we aim to provide a comprehensive, evidence-based assessment of how AI-driven prompt engineering enhances medical education and supports more effective clinical decision-making.
To assess the impact of structured prompting on the quality of AI-driven medical simulations, an expert evaluation was conducted. A single domain expert reviewed ten simulated clinical cases, five of which were designed using structured prompting techniques (RSP, RAG, COTP), while the other five followed an unstructured approach. The evaluation was conducted using a Likert scale (1–5), with the PROMPT+ Framework serving as the assessment criterion. Each case was rated based on six key dimensions, aligning with the framework’s core components.
To maintain consistency, the evaluation focused solely on the initial phase of each simulation, specifically the case introduction and the first five AI-learner interactions (
Supplementary Materials (Supplemental S2)).
The mean ratings for structured and unstructured cases based on the PROMPT+ framework are as follows:
Structured cases consistently scored higher across all PROMPT
+ evaluation dimensions. The most significant differences were observed in error correction (M: 4.60 vs. 0.20) and structured reasoning (O: 3.80 vs. 1.20), indicating that structured prompting enhances both logical decision-making and AI feedback mechanisms. Clarity of case objectives (P: 3.80 vs. 2.00) and adaptive complexity (T: 3.40 vs. 1.00) were also rated higher, suggesting that structured prompts improve guidance and dynamic adaptation to user input. Adherence to medical guidelines (R: 2.40 vs. 1.60) showed a smaller difference, indicating potential areas for improvement in integrating guideline-based responses (see
Table 3).
The results indicate that the PROMPT+ Framework provides a useful structure for evaluating the quality of AI-driven clinical simulations. The expert assessment suggests that cases aligned with structured prompting techniques (RSP, RAG, COTP) demonstrated greater clarity, better alignment with clinical reasoning, and more effective error correction mechanisms. The most substantial benefits were observed in structured reasoning and feedback quality, emphasizing the role of structured prompting in guiding AI responses toward clinically relevant and pedagogically sound interactions. However, given that this evaluation was based on a single expert and limited to the first five AI-learner interactions, further studies with a larger reviewer sample and extended simulation durations are necessary to validate these findings and refine the assessment framework.
5. Discussion
Advanced prompt engineering techniques bear the potential of significant improvement in medical education [
20,
21]. Dynamic and adaptive simulations foster interactive learning, allowing learners to explore clinical scenarios iteratively, which enhances critical thinking and decision-making. Scalability is another major improvement, as diverse and complex scenarios can be generated efficiently, catering to learners at varying levels of expertise. Effective prompt design requires adherence to key principles to maximize clarity, interactivity, and educational value. Prompts should explicitly define patient scenarios, clinical guidelines, and learning objectives to ensure relevance and focus. For instance, specifying, “
Generate a simulation for a 45-year-old with blunt abdominal trauma, referencing the S3-guideline”, provides clear direction. Incorporating decision points, such as asking, “
What are the risks of delaying imaging in this case?” encourages critical engagement, while explicitly referencing evidence-based standards like the S3-guideline for polytrauma management promotes guideline adherence. Tailoring complexity to the learner’s expertise further enhances educational effectiveness, with foundational tasks suited for beginners and advanced challenges, like identifying complications of a missed pelvic binder, designed for more experienced learners.
This approach aligns with Croskerry’s [
22] dual-process theory, which underscores the importance of combining heuristic (System 1) and analytical (System 2) reasoning to improve clinical judgment—an essential component of physician performance. Given the persistently high rates of diagnostic errors, simulations that integrate both systems encourage learners to critically assess their intuitive decisions while developing systematic analytical skills. By fostering this balance between intuition and analysis, simulations mirror the complexity of real-world clinical scenarios, enabling reflective learning from errors and ultimately enhancing decision-making capabilities and patient outcomes.
Conversely, poorly constructed prompts can undermine the simulation’s value. Vague prompts, such as “Generate a trauma case”, often result in generic outputs that lack educational focus. Prompts must also encourage error identification and correction to reinforce learning, as in: “Incorrect: No chest tube was placed. Reassess pneumothorax management”. Overly lengthy or complex prompts should be avoided to prevent cognitive overload for both the AI and the learner. Instead, instructions should remain concise and targeted, focusing on specific aspects, such as, “Focus on airway management steps without addressing unrelated issues”. By following these dos and avoiding common pitfalls, prompt design can effectively support dynamic and engaging medical simulations.
Recent studies, such as the investigation by Vidhani et al., have demonstrated the significant role of prompt engineering in steering ChatGPT-3.5’s performance within specialized educational domains like chemistry. By systematically refining prompts through techniques such as contextual augmentation, iterative adjustments, and direct questioning, the authors achieved a notable improvement in ChatGPT-3.5’s accuracy across various tasks, particularly those categorized as “apply” level on Bloom’s taxonomy [
23]. These findings emphasize the critical need for tailored prompts that integrate domain-specific constraints and user-guided contextual cues. This aligns with the principles of the PROMPT
+ Framework, which seeks to provide similar systematic guidance in the context of medical education simulations, ensuring relevance, interactivity, and alignment with clinical guidelines.
Despite these benefits, certain challenges need to be addressed: The accuracy of guideline integration depends on the specificity of the prompts and access to data. Errors in guideline application or outdated recommendations can compromise the educational value of simulations. To reduce the risk of incorrect responses, prompts should explicitly encourage the model to acknowledge uncertainty and avoid speculation, such as: “If unsure, state that information may be incomplete or outdated”. Preconfigured system prompts can restrict the model’s scope, ensuring adherence to validated guidelines by instructing: “Base all responses on the S3-guideline; recommend verification if uncertain”. Integrating user-driven verification prompts, such as “What is the source of this recommendation?” enhances transparency. Furthermore, excessive reliance on AI feedback may reduce critical engagement with the material.
The PROMPT
+ Framework sets itself apart by systematically addressing interactivity, adaptability, and guideline alignment in medical education simulations. Unlike frameworks such as CureFun [
13], which focus on structured dialogue flows and automated evaluations, PROMPT
+ emphasizes a six-stage iterative workflow that integrates prompt design with ethical considerations and evolving guidelines. Additionally, PROMPT
+ extends beyond operational assessment by embedding stages like “Measure” and “Persist”, ensuring that simulations align with learning objectives and remain adaptable to new clinical standards. Its “+” component introduces a reflective dimension, fostering critical evaluation of AI outputs, and ethical accountability.
While the PROMPT
+ Framework focuses on interactivity, adaptability, and guideline alignment in medical education simulations, the SPARRO framework [
24] offers a complementary lens by placing emphasis on iterative refinement of human–AI interactions within educational settings. Developed through an ethnographic study in healthcare and nursing education, SPARRO addresses critical challenges such as mistrust of AI-generated outputs and difficulties in crafting effective prompts. The workflow is thoughtfully designed to guide users through key stages, such as strategy formulation, prompt creation, iterative feedback, and ongoing optimization, while fostering collaboration to improve the reliability and effectiveness of AI integration in academic settings. While the SPARRO framework has primarily been used in nursing and healthcare education, its principles hold valuable insights that can be applied to broader contexts, including AI-driven medical simulations. By providing a systematic methodology to refine AI interactions, SPARRO tackles operational concerns such as AI hallucinations and the ethical use of generative outputs. These features align with and potentially enrich the PROMPT
+ Framework’s reflective and guideline-centric approach, particularly in scenarios where evidence-based decision-making and tailored learning pathways are critical.
Integrating the SPARRO framework’s iterative refinement with the guideline adherence and reflective learning elements of the PROMPT
+ Framework enables the development of an advanced integration model for AI applications in medical education. These could facilitate the creation of adaptive, accurate, and ethical simulations that address various learning needs, fostering trust in and appropriate use of these applications within advanced clinical training contexts. This aligns with broader efforts to tackle the challenges of prompt evaluation and optimization, as evidenced by several complementary approaches that emphasize either scalability or interactivity to refine and assess prompt performance. GLaPE (Gold Label-agnostic Prompt Evaluation) introduces a practical method for evaluating prompts without the need for annotated datasets, relying instead on self-consistency metrics to assess robustness [
25]. EvalLM complements this by offering an interactive and iterative framework, allowing users to refine prompts based on natural language-defined criteria [
26]. While these approaches provide valuable technical tools for improving prompt evaluation, our framework focuses on the context of medical education simulations to address specific, domain-relevant challenges.
The broad applicability of advanced prompt engineering techniques, as demonstrated in XR simulations for educational, entertainment, and training purposes, underscores their potential in medical education [
27]. By leveraging approaches such as contextual augmentation and iterative prompt refinement, the PROMPT
+ Framework can bridge gaps between fields, fostering interdisciplinary advancements in AI-driven simulations. Expanding the use of prompt engineering for bias mitigation in educational and clinical AI applications presents a promising avenue. Techniques such as genre-specific prompt crafting, which have proven effective in NLP bias reduction [
28], could be adapted to medical contexts to ensure equitable and contextually aware learning scenarios. This approach has already been explored in the context of informal online health queries, where structured prompt templates have been used to address conditions such as obesity [
29].
A promising future direction involves developing specialized GPT models [
30,
31] tailored to specific clinical guidelines and learner profiles. These models could be fine-tuned to incorporate national and international standards, ensuring evidence-based and regionally relevant simulations. A preconfigured system prompt could adjust the model’s behavior to match the learner’s level of expertise, study year, or clinical experience. For example, beginner-level users could receive additional guidance and explanations, while advanced learners would face more complex, self-directed challenges. Such an approach would make simulations “ready-to-start”, even for users without prior knowledge of prompt engineering, thus lowering the barrier to entry and maximizing the accessibility and effectiveness of AI-assisted medical education.
Given the larger context of a fast-changing environment of AI applications, with its potential opportunities and risks, any development should explicitly be challenged on whether it follows the principles of transparency and responsibility in creating learning environments [
32]. Students should be encouraged to reflect on the limits of AI support, potential misuse, and the risk of detachment from human centricity [
33].
Limitations
While the PROMPT
+ Framework provides a structured approach to AI-driven medical simulations, its implementation is subject to several limitations. A key concern is the inherent bias of large language models like ChatGPT, which are trained on heterogeneous datasets [
34] that may reinforce existing disparities in medical education and clinical reasoning. For instance, Sallam (2023) highlighted that ChatGPT might generate inaccurate or misleading information due to biases in its training data, posing risks in medical education contexts [
35]. Another limitation lies in the integration of evidence-based guidelines. While contextual augmentation can enhance realism, LLMs do not have direct access to real-time updates of clinical guidelines, which may lead to outdated or regionally inconsistent recommendations [
16,
36]. Even Retrieval-Augmented Generation approaches are dependent on the availability of structured and validated medical knowledge sources, necessitating continuous human oversight to ensure clinical accuracy. Despite the structured safeguards implemented in the PROMPT
+ Framework, challenges remain in ensuring that generative AI models consistently align with evidence-based guidelines and adapt to evolving medical standards. While Retrieval-Augmented Generation improves factual consistency, its reliability is inherently dependent on the quality and scope of the underlying knowledge bases. Additionally, the potential for AI-generated content to deviate from established medical protocols highlights the need for longitudinal monitoring and external validation. Future research should focus on refining automated plausibility checks, developing adaptive feedback systems, and evaluating the effectiveness of human oversight models in AI-assisted decision-making. Establishing standardized validation protocols will be essential for ensuring the long-term reliability of generative AI in medical education.
Moreover, the generalizability of the framework beyond emergency medicine and anesthesiology remains unproven. These fields are characterized by protocol-driven decision-making, making them well suited for structured prompt engineering. For instance, an article by Tangsrivimol et al. (2025) highlighted that while AI can process vast amounts of data, it lacks clinical experience and emotional intelligence [
37], which are crucial in specialties requiring nuanced clinical judgment.
Given the distinct cognitive demands of different medical fields, future research should investigate whether the structured prompt engineering techniques proposed in this study are equally effective in domains that require more heuristic-based and differential diagnostic reasoning, such as neurology or primary care. While this study describes the integration of three prompt engineering techniques, it does not yet provide a comparative analysis of their individual and combined effects on learning outcomes. This omission limits the ability to determine the relative importance of each technique and to identify the optimal combination for maximum effectiveness. Future research should systematically assess these factors through controlled experimental designs, comparing different prompting strategies in terms of knowledge retention, diagnostic accuracy, and decision-making speed. Additionally, longitudinal studies should investigate how AI-assisted simulations influence long-term clinical reasoning development, particularly in comparison to traditional simulation-based training.
While structured reasoning, like chain-of-thought prompting, enhances decision-making in standardized clinical workflows, more complex, iterative diagnostic processes, such as those required in neurology or hematology, may demand different approaches. Additionally, interprofessional education and multidisciplinary case simulations, which are essential for collaborative clinical training, remain unexplored within this framework.
Moreover, although this study describes the integration of three prompt engineering techniques, it does not provide a comparative analysis of their individual and combined effects on learning outcomes. This omission limits the ability to determine the relative importance of each technique and identify the optimal combination for maximum effectiveness. While prior research has demonstrated that chain-of-thought prompting improves structured reasoning in complex problem-solving, and that contextual augmentation enhances AI-generated outputs by incorporating domain-specific guidelines, there is limited empirical evidence on how these methods interact in medical education settings.
Finally, the ethical dimension of the PROMPT+ Framework, represented by the “+” component, introduces an important reflective element, yet it lacks concrete guidance on how these principles should be operationalized and monitored in practice. While ethical concerns such as bias mitigation, hallucination control, and accountability in AI-generated medical content are acknowledged, the absence of structured implementation strategies limits the framework’s applicability in real-world educational settings. Future work should focus on developing standardized validation protocols to ensure that AI-generated medical simulations align with current clinical standards and educational objectives. This includes establishing peer review mechanisms where domain experts systematically evaluate AI-generated case scenarios for accuracy, bias, and educational value. Additionally, the creation of best-practice guidelines is essential to provide educators with structured methodologies for integrating AI-driven simulations while maintaining transparency, reliability, and pedagogical soundness.
Beyond structural and validation aspects, future research should systematically assess the educational impact of different prompt engineering techniques. Controlled studies could compare their effectiveness in medical training, analyzing their influence on knowledge retention, diagnostic accuracy, and decision-making speed. Furthermore, longitudinal studies should investigate how AI-enhanced simulations contribute to the development of clinical reasoning skills over time, particularly in comparison to traditional simulation-based training. Establishing these empirical foundations will be critical for refining AI-assisted medical education and optimizing its role in competency-based learning environments. While AI can enhance training, it should not replace human oversight. Studies emphasize the importance of maintaining a “human-in-the-loop” model, where educators retain control over AI-generated content and can intervene when necessary [
38].
6. Conclusions
Advanced prompt engineering techniques, including chain-of-thought prompting, contextual augmentation, and role-specific prompting, have demonstrated the potential to enhance medical simulations. By integrating current clinical guidelines, such as the S3-guideline, these simulations ensure relevance and alignment with real-world practice while also offering educators a framework for targeted feedback and assessment. Role-specific prompting allows learners to immerse themselves in authentic clinical scenarios by adopting roles such as team leader, thereby fostering accountability and role-based problem-solving. These techniques could collectively enable the creation of interactive, adaptive, and evidence-based simulations that improve learners’ critical thinking, decision-making, and clinical reasoning skills.
Nevertheless, challenges persist, particularly in maintaining the accuracy and currency of guideline integration and ensuring that both learners and educators balance AI-driven support with critical oversight. Regular updates, robust prompt design, and human verification remain essential for maintaining the quality, reliability, and educational value of AI-driven tools. Future innovations should focus on the development of specialized GPT models tailored to specific clinical guidelines and learner profiles.
In essence, prompt engineering offers a scalable and transformative approach to medical simulation, addressing the growing need for flexible, interactive, and guideline-aligned learning environments. With continued refinement and integration into educational frameworks, AI-powered simulations could play a pivotal role in preparing future healthcare professionals for the complexities of clinical practice. The responsible integration of these new technologies necessitates thoughtful and deliberate reflection on their benefits and limitations.