Using Large Language Models to Retrieve Critical Data from Clinical Processes and Business Rules

Yu, Yunguo; Gomez-Cabello, Cesar A.; Makarova, Svetlana; Parte, Yogesh; Borna, Sahar; Haider, Syed Ali; Genovese, Ariana; Prabha, Srinivasagam; Forte, Antonio J.

doi:10.3390/bioengineering12010017

Open AccessArticle

Using Large Language Models to Retrieve Critical Data from Clinical Processes and Business Rules

by

Yunguo Yu

¹,

Cesar A. Gomez-Cabello

²,

Svetlana Makarova

¹,

Yogesh Parte

¹,

Sahar Borna

²

,

Syed Ali Haider

²

,

Ariana Genovese

²

,

Srinivasagam Prabha

² and

Antonio J. Forte

^1,2,*

¹

Center for Digital Health, Mayo Clinic, Rochester, MN 55905, USA

²

Division of Plastic Surgery, Mayo Clinic, 4500 San Pablo Road, Jacksonville, FL 32224, USA

^*

Author to whom correspondence should be addressed.

Bioengineering 2025, 12(1), 17; https://doi.org/10.3390/bioengineering12010017

Submission received: 19 November 2024 / Revised: 20 December 2024 / Accepted: 27 December 2024 / Published: 28 December 2024

(This article belongs to the Special Issue Application of Artificial Intelligence in Complex Diseases)

Download

Browse Figures

Versions Notes

Abstract

:

Current clinical care relies heavily on complex, rule-based systems for tasks like diagnosis and treatment. However, these systems can be cumbersome and require constant updates. This study explores the potential of the large language model (LLM), LLaMA 2, to address these limitations. We tested LLaMA 2′s performance in interpreting complex clinical process models, such as Mayo Clinic Care Pathway Models (CPMs), and providing accurate clinical recommendations. LLM was trained on encoded pathways versions using DOT language, embedding them with SentenceTransformer, and then presented with hypothetical patient cases. We compared the token-level accuracy between LLM output and the ground truth by measuring both node and edge accuracy. LLaMA 2 accurately retrieved the diagnosis, suggested further evaluation, and delivered appropriate management steps, all based on the pathways. The average node accuracy across the different pathways was 0.91 (SD ± 0.045), while the average edge accuracy was 0.92 (SD ± 0.122). This study highlights the potential of LLMs for healthcare information retrieval, especially when relevant data are provided. Future research should focus on improving these models’ interpretability and their integration into existing clinical workflows.

Keywords:

diagnostics; clinical decision support; Artificial Intelligence; large language models; data retrieval

Graphical Abstract

1. Introduction

In the healthcare sector, the design and execution of clinical processes are predominantly governed by rule-based systems [1]. These systems, which encapsulate the subtleties of medical decision-making, are pivotal in various facets of patient care, including disease diagnosis and treatment [1,2,3]. For instance, consider the diagnostic process for coronavirus disease 2019 (COVID-19), a disease that has profoundly impacted global health systems [4,5]. This process typically involves a complex set of rules that take into account a patient’s current symptoms, vaccination history (including COVID-19 vaccinations), the severity of their condition, and the need to differentiate COVID-19 from other respiratory illnesses such as influenza or non-COVID-19 lung infections [5,6]. These rules are meticulously crafted based on clinical guidelines, best practices, and expert knowledge to ensure patients receive timely and accurate care [7,8].

Despite the effectiveness of rule-based systems in guiding clinical decision-making, they are not without limitations [8,9]. One of the primary challenges is the overwhelming volume and complexity of rules required to cover the wide array of clinical scenarios encountered in practice [7]. As our understanding of diseases and treatments evolves, maintaining and updating these rules becomes increasingly challenging. Additionally, rule-based systems may struggle to capture the nuances and context-dependent nature of medical decision-making, leading to suboptimal outcomes in certain situations [10].

The emergence of large language models (LLMs) has demonstrated remarkable logical reasoning capabilities, even for complex clinical rules [11,12]. LLMs have shown their potential as pivotal tools in medical practice by excelling in the distinct nuances of clinical decision support, even without the implementation of fine-tuning techniques [12,13,14,15,16,17,18]. Related studies show that LLMs can supplement traditional Clinical Decision Support Systems’ (CDSSs) optimization by decreasing the number of unnecessary alerts and providing relevant, understandable, and non-redundant recommendations compared to human systems [19]. Furthermore, they can accurately provide timely and individualized diagnosis and treatment [11,20,21]. However, there were some gaps and deficiencies in their knowledge, highlighting their lack of training for these tasks. Conversely, techniques to improve LLMs’ understanding and performance, such as fine-tuning and Retrieval-Augmented Generation (RAG), enable them to better understand the context, input, and output in a specific domain [22,23,24,25,26].

This paper showcases how LLMs can be leveraged to retrieve information from complex rule-based systems in healthcare thereby enhancing information systems. The process involves encoding complex clinical rules, such as those in Care Pathway Models (CPMs), into DOT (graph description language) language, followed by embedding the encoded rules into a vector database. Subsequently, an LLM—LLaMA 2—is employed to retrieve the information to support clinical decision-making (Figure 1). Finally, we analyze the performance of the model by measuring its accuracy at a token-level. By harnessing the power of LLMs, healthcare providers can potentially enhance the retrieval of critical information from clinical processes and business rules, leading to more efficient and effective decision-making. This paper explores the potential of LLMs to retrieve critical information from complex clinical processes and business rules for improving clinical decision support and reducing physicians’ administrative burdens.

2. Methods

2.1. Selection of LLM

The large language model (LLM) selected for this study is the open-source, publicly available LLAMA 2 model. LLAMA 2 wass chosen for its robust performance in natural language understanding and generation tasks, making it suitable for retrieving information from complex clinical processes and business rules.

2.2. Dataset

The clinical guidelines and recommendations used in this study are based on the Mayo Clinic consensus [28], which was derived from a review of existing evidence and guidelines. It is important to note that these recommendations do not replace clinical judgment but serve as a guideline for healthcare professionals. While several CPMs were evaluated to demonstrate the application of the model, we demonstrate a clinical scenario related to the diagnosis and testing of COVID-19 in adults, specifically outpatient testing.

2.3. Encoding the Clinical Procedures

To encode the clinical recommendations provided by CPMs, we utilize DOT language, a format commonly used to represent graphical models, including rules and decision trees. In DOT language, there are three kinds of objects: graphs, nodes, and edges. A node is created when it first appears in the file and an edge is created when nodes are joined by the “->” operator [29]. In the scenario depicted in Figure 2, examples of nodes are the question regarding the presence of COVID-19 vaccination, the presence of symptoms in the last 48 h, and the recommendation of testing with specific clinical criteria. Conversely, edges are the relationship between those three questions.

The clinical procedures were sourced from the AskMayoExpert site [30], which provides detailed clinical guidelines and recommendations for healthcare professionals (Figure 2). See Supplementary Material for a sample of a CPM (COVID-19 testing for adults) in DOT language.

2.4. Embedding the Encoded Processes

The encoded clinical processes are then embedded using HuggingFace SentenceTransformer, a library designed for encoding sentences and paragraphs into fixed-dimensional embeddings. This step is crucial for transforming the textual representation of clinical procedures into a format that the LLM can process.

2.5. Information Retrieval Approach

The methodology for using LLM to retrieve critical steps from the encoded clinical processes involves the following steps (Figure 1): First, we input the encoded clinical processes into the LLAMA 2 model. Once in the model, we leveraged the LLM’s natural language understanding capabilities to interpret the encoded processes and identify critical steps. Finally, we can retrieve the critical steps based on the given contents, focusing on key aspects such as patient symptoms, vaccination history, severity of condition, and differentiation from other respiratory illnesses.

By following this approach, we aim to demonstrate the effectiveness of using LLMs for retrieving critical information from complex clinical processes and business rules (Figure 3).

2.6. Measuring Model Accuracy

We compared the token-level accuracy between LLM output and the ground truth by measuring both node and edge accuracy. Ground truth was provided by the DOT language files obtained from the CPMs. Nodes are identified as individual entities, and edges are connections or relationships that link pairs of nodes. First, we parsed the DOT and LLM-generated JSON (JavaScript Object Notation) files describing the graph structures using nodes and edges. Then, we compared the nodes from the DOT files with those from the JSON outputs to determine the overlap. This is performed by calculating the intersection of node sets from both sources. Similarly, edges were compared between the two files, with the intersection of edge sets used to identify matches. Accuracy is computed as the ratio of matched nodes and edges to the total number of those in the ground truth. The following formulas depict how we calculated node and edge accuracy:

N o d e A c c u r a c y = \frac{N_{m a t c h e d}}{N_{g r o u n d t r u t h}}

E d g e A c c u r a c y = \frac{E_{m a t c h e d}}{E_{g r o u n d t r u t h}}

where

N_{m a t c h e d}

refers to the intersection of nodes in the ground truth and those present in the LLM’s output and

N_{g r o u n d t r u t h}

refers to the number of nodes in the DOT language files. Conversely,

E_{m a t c h e d}

is the number of edges provided by the model’s output that correlates to those present in the ground truth, while

E_{g r o u n d t r u t h}

is the number of total edges in the DOT language files.

Node accuracy reflects how well the LLM’s output aligns with the expected nodes, while edge accuracy assesses the correctness of the connections. Finally, we calculated the average accuracy across all compared files to measure overall performance. This method provides a comprehensive evaluation of the LLM’s ability to replicate the detailed structure of the ground truth graphs, ensuring both the correctness of individual components and the overall fidelity of the generated outputs.

3. Results

We tested the suitability of our model for retrieving clinical processes based on contextual information with a hypothetical case of a patient presenting with cough and fever, disclosing a recent COVID-19 vaccination. As shown in Table 1, LLM demonstrated promising results by accurately retrieving the diagnosis and suggesting an evaluation for COVID-19 and other relevant respiratory infections. It further advised observation at home unless symptoms necessitated immediate ER evaluation, showcasing its ability to provide nuanced and contextually appropriate recommendations.

In Table 1, we present an example of our results in which our LLM-CDSS suggests the steps that should be taken based on the given context, the patient’s symptoms of cough and fever, and recent COVID-19 vaccination within the past two weeks.

Nineteen additional CPMs were evaluated to measure the model’s token-level accuracy. The average node accuracy across the different pathways was 0.91 (SD ± 0.045), while the average edge accuracy was 0.92 (SD ± 0.122). In Table 2, we present the individual node and edge results per individual CPM.

4. Discussion

Compared to conventional approaches, LLM-based retrieval for clinical decision support reveals several advantages. Traditional methods for clinical decisions often involve the manual review of guidelines and consultation with other healthcare professionals, leading to potential delays and inefficiencies in patient care. Some promising alternatives include CDSSs. Current CDSSs assist physicians with diagnosis, disease management, prescription, and drug control [8] and are especially effective in increasing adherence to clinical guidelines and improving patient safety [2,3,7,8]. Their assistance becomes paramount, especially with the rising prevalence of chronic conditions, the emergence of new diseases, and the expansion of medical knowledge, as the demand for healthcare services and documentation increases, resulting in a higher volume of data usage [1,9].

Nevertheless, clinician satisfaction remains low due to excessive time consumption, workflow interruptions, suboptimal EHR integration, irrelevant recommendations, and poor user-friendliness [7,31]. Additionally, studies have demonstrated that physicians perceived that CDSSs increase their cognitive load [7,32] and feel no need for their assistance or disagree with their recommendations [33]. In contrast, LLMs can retrieve relevant information to respond appropriately to user inputs in a natural and fluent human-like conversation [22]. They perform well in information extraction despite not being specifically trained for that task, even outperforming some models in extracting medications and medical evidence [34]. LLMs’ rapid information retrieval and concise recommendations can save valuable clinical time, potentially enhancing overall efficiency and physician acceptance.

These models’ abilities to engage in conversational interactions facilitates their implementation and use in clinical practice as compared to regular AI-based CDSSs [14]. Moreover, while it is a difficult task to ensure that healthcare practitioners read, internalize, and implement new clinical guidelines, the rules implicit in them can be encoded into CDSSs [8] and therefore into LLMs. In this study, we demonstrated the effectiveness of LLMs in retrieving critical information from complex clinical processes, as exemplified by the accuracy of diagnosis and management retrieval in the case of a patient with cough and fever post-COVID-19 vaccination. LLM’s ability to interpret nuanced patient information and provide relevant recommendations showcases its potential utility in clinical decision-making.

Fine-tuned and RAG LLMs have an improved ability to align with the task’s purpose, overcome the base model’s limitations, and be safer, less biased, and harmful. For this reason, they are better suited for application in healthcare and medicine [12,35]. We leveraged a type of RAG technique that allowed us to embed LLaMA 2 with a specialized CPM based on Mayo Clinic’s clinical guidelines and recommendations. RAG enables systems to access external medical databases in real-time to support their knowledge and provide more accurate responses [25]. We used COVID-19 as an example due to its current relevance and rapidly evolving nature, making up-to-date knowledge crucial for precise and safe management. Our model not only accurately retrieved the diagnosis based on the patient’s presentation but also suggested relevant and contextually appropriate evaluations and follow-ups for proper management. Furthermore, it provided a step-by-step explanation of its train of thought, increasing transparency and explainability.

Our results correlate with those of Oniani et al. [36], who also incorporated COVID-19 clinical practice guidelines (CPGs) into different LLMs using three different methods and achieved significant improvements compared to base models. Meanwhile, additional studies have utilized RAG approaches to implement CPGs into different LLMs for nephrology [25], radiology [37], hepatology [38], neurology, gastroenterology, anesthesia and critical care, infectious diseases, and pediatrics [26], all proving to be superior to baseline models for providing accurate, comprehensive, and safe responses. Moreover, some were more time-efficient and less expensive than human personnel [37].

To ensure the precision and reliability of the model’s performance, we compared the token-level accuracy between LLM output and the ground truth. This meticulous comparison allows for a detailed assessment of how accurately the LLM captures and represents specific data elements, particularly critical in high-stakes fields such as healthcare and legal documentation. Token-level accuracy provides insights into the model’s ability to handle nuanced details and identify discrepancies that could have significant implications. By pinpointing and addressing specific areas of deviation, this approach facilitates targeted refinements and enhances the overall quality of the model’s outputs. Furthermore, validating token-level accuracy ensures that the LLM adheres to established standards and benchmarks, reinforcing its applications’ credibility and trustworthiness. Such rigorous evaluation is indispensable for integrating LLMs into critical systems and decision-making processes, where precision and consistency are crucial. With a node and edge accuracy of 0.91 and 0.92, we can determine that the model appropriately understands individual clinical characteristics and their relationships.

LLMs’ potential to become specialized tools, in addition to their natural understanding of quotidian and technical language and embedded logical capacities [11,34,39,40], is particularly important in complex and ambiguous cases where traditional methods might struggle to provide timely and accurate guidance. Our study underscores their potential to augment the clinical decision-making processes such as in managing contagious diseases and other time-sensitive medical conditions. By leveraging LLMs, healthcare providers can streamline information retrieval processes, enhancing patient care and outcomes, as illustrated in Figure 3. Additionally, LLM-based retrieval has the potential to reduce healthcare costs by optimizing resource utilization and minimizing the need for unnecessary diagnostic tests or consultations [13,39,41], ultimately leading to more efficient healthcare delivery and benefiting both patients and healthcare systems [37].

Our results demonstrate the utility of using DOT language for the accurate and efficient retrieval of clinical practice guidelines, in this case, Care Pathway Models. We tested the model on different clinical scenarios, including infectious diseases, chronic diseases, acute and emergent conditions, and medication toxicity across different specialties. This showcases the broad, versatile, and generalizable applicability of LLMs with our proposed framework for clinical decision support across the medical field. Furthermore, this method also opens a broader path as translating clinical or business rules into DOT language establishes a common framework that bridges these rules with LLMs. By implementing this method to embed clinical guidelines or management algorithms into language models, they become capable of being used in a wide array of clinical settings. Depending on the type of clinical process embedded, LLM gains the ability to provide differential diagnoses, treatment options, pertinent laboratory tests and imaging recommendations, and potential prognoses. Given the versatility of DOT language, the models are not limited by specialty or institution, as demonstrated by our results, where the model demonstrated very high accuracy when tested in several clinical settings, from the emergency department to postoperative settings.

Beyond the immediate benefits highlighted in this work, this approach harnesses powerful computing capabilities to work alongside LLMs, enabling the processing of vast and complex rule-based systems in large hospital networks, such as the Mayo Clinic. This integration facilitates a seamless connection between administrative and clinical operations, significantly improving operational costs and, most importantly, freeing physicians from administrative tasks thereby allowing them to devote more time to patient care [12,42].

An additional strength of this approach is that DOT language-encoded business rules are both machine-readable (e.g., by LLMs) and human-readable. This transparency offers substantial advantages in scenarios such as improving and optimizing existing clinical or administrative rules. On top of that, with the assistance of generative AI, this method enables the creation of novel, more efficient rules thereby enhancing the overall rule-authoring process. This dual readability not only ensures accuracy and consistency but also fosters collaboration among stakeholders, enabling the continuous improvement and adaptation of rules to meet evolving clinical and administrative needs. The combination of DOT language and LLMs thus represents a powerful tool in modern healthcare management, dividing innovation and excellence in patient care delivery.

Take, for instance, the creation of new CPMs or updating old ones. At our institution, it is estimated that it may take several months, including gathering subject matter experts (SMEs), building consensus and algorithms, and time for revision. This means that while short CPMs may take up to two weeks, more complex ones can take up to seven months or approximately 100 continuous hours. Conversely, we experimented with the model’s ability to create a new CPM for type 2 diabetes mellitus management based on a straight DOT language structure and obtained an accurate algorithm in just a few minutes. Moreover, when utilizing this new CPM to provide specific patient recommendations for a patient with HbA1c results higher than 7.0%, the LLM retrieved an accurate, straightforward step-by-step guide.

Despite the promising results, there are limitations to consider. It is important to note that the effectiveness of LLMs and therefore LLM-based retrieval is contingent upon high-quality input data and continual model refinement. As these models’ responses are influenced by the data they are trained on, they are subject to reproducing and perpetuating the biases embedded in them [43,44,45]. For instance, if the model is embedded only with CPMs based on a static CPG or on specific patient populations, the model may underperform when implemented in another clinical setting or provide inaccurate advice for patients from minority populations or with extreme clinical conditions [46,47,48,49]. Furthermore, if the models lack contextual understanding, such as models in their basal state, they may generate convincing responses without the proper specificity needed for accurate decision-making, a phenomenon known as hallucination [14]. Although in one study assessing LLMs’ hallucinations, LLaMA and Falcon outperformed other commercial models, such as GTP-3.5, they still hallucinated and produced inaccurate responses, demonstrating their need for further improvement [50].

According to Harrer [51], we must consider six ethical principles when implementing LLMs in clinical practice: accountability, fairness, data privacy and selection, transparency, explainability, and value and purpose alignment. While these models can streamline decision-making, they must not replace the judgment of physicians who will maintain the final responsibility for patient care. Additionally, LLMs must adhere to patient privacy regulations (e.g., HIPAA) to prevent unauthorized access or misuse of sensitive health data. Ensuring robust encryption, strict access controls, and comprehensive audit trails is essential to maintaining patient trust and confidentiality, especially if LLMs will rely on extensive external datasets for their knowledge bases. While this remains a subject of ongoing research, strategies such as prompt engineering, fine-tuning, and RAG may offer a potential solution [25]. Nevertheless, constant model training and validation remain essential to maintaining accuracy and adaptability to evolving medical knowledge and practices.

By providing LLMs with an external curated knowledge database, such as in our methodology and additional RAG approaches, they have access to specialized and up-to-date sources of information that can be tailored to each different clinical setting. This not only improves the models’ accuracy and reduces hallucinations but also increases the interpretability and explainability of the responses, as they are primarily based on information that can be guaranteed to be truthful and accurate [25,52,53,54,55,56]. As shown in Table 1, when a clinician queries the system, the model not only provides a recommendation but also points to a particular section of the CPM. This traceability helps users understand why the model made a certain suggestion. Additionally, this reduces the portion of the model’s logic hidden within its neural algorithm. Moreover, by embedding structured clinical rules and care pathways, the models can present relevant snippets that directly inform the output, allowing decision-makers to see the contextual reasoning steps. However, it is essential to implement human-in-the-loop approaches, where humans constantly evaluate the veracity, safety, and relevance of the information provided to the model and its performance [57].

Finally, it is essential to mention that our results are limited to a few of Mayo Clinic’s Care Pathway Models, limiting their generalizability to other diseases and institutions. Future research in this area could focus on enhancing the interpretability of LLMs in clinical settings. Developing methods to explain the reasoning behind LLM-based recommendations can improve trust and acceptance among healthcare professionals. Exploring new ways to integrate LLMs into existing clinical workflows and decision support systems can further enhance their utility in healthcare settings.

5. Recommendations for Healthcare Organizations

For healthcare organizations looking to implement LLM-based information retrieval, several recommendations can be made.

Data Quality: Ensure that the input data used to train LLMs is of high-quality and accurately reflects clinical processes and guidelines [12,58].
Model Selection: Choose an LLM that is well suited for the specific healthcare application, considering factors such as model performance, scalability, and interpretability [11,16,17,18,40].
Integration: Integrate LLMs into existing clinical workflows and decision support systems to maximize their impact and usability [23].
Training and Validation: Continuously train and validate LLMs to ensure their accuracy and adaptability to evolving medical knowledge and practices [12,58,59].
Collaboration: Collaborate with healthcare professionals and experts to refine LLM-based information retrieval processes and ensure their relevance and effectiveness in clinical settings [12,57,59].

By following these recommendations, healthcare organizations can harness the power of LLMs to enhance information retrieval processes, improve clinical decision-making, and, ultimately, enhance patient care and outcomes.

6. Conclusions

This study highlights the potential of LLMs in enhancing information retrieval in clinical processes. The results demonstrate the effectiveness of LLMs in accurately retrieving critical information and providing contextually relevant recommendations, showcasing their utility in clinical decision-making and operational efficiency in healthcare. By addressing limitations and exploring future research directions, LLMs can become valuable and powerful tools in enhancing patient care and optimizing healthcare delivery.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/bioengineering12010017/s1, Sample of Care Pathway Model (AskMayoExpert): Code in DOT language.

Author Contributions

Conceptualization, Y.Y., C.A.G.-C., S.M., Y.P. and A.J.F.; methodology, Y.Y., C.A.G.-C. and Y.P.; software, Y.Y.; validation, Y.Y., C.A.G.-C. and Y.P.; formal analysis, all authors; investigation, all authors; resources, Y.Y., S.M., Y.P. and A.J.F.; data curation, Y.Y. and Y.P.; writing—original draft preparation, Y.Y. and C.A.G.-C.; writing—review and editing, all authors; supervision, Y.Y. and A.J.F.; project administration, Y.Y. and A.J.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

CDSSs	Clinical Decision Support Systems
CPGs	Clinical practice guidelines
CPM	Care Pathway Model
EHRs	Electronic Health Records
JSON	JavaScript Object Notation
LLM	Large language model
RAG	Retrieval-Augmented Generation
SMEs	Subject matter experts

References

Papadopoulos, P.; Soflano, M.; Chaudy, Y.; Adejo, W.; Connolly, T.M. A systematic review of technologies and standards used in the development of rule-based clinical decision support systems. Health Technol. 2022, 12, 713–727. [Google Scholar] [CrossRef]
Litvin, C.B.; Ornstein, S.M.; Wessell, A.M.; Nemeth, L.S.; Nietert, P.J. Adoption of a clinical decision support system to promote judicious use of antibiotics for acute respiratory infections in primary care. Int. J. Med. Inform. 2012, 81, 521–526. [Google Scholar] [CrossRef]
Cricelli, I.; Marconi, E.; Lapi, F. Clinical Decision Support System (CDSS) in primary care: From pragmatic use to the best approach to assess their benefit/risk profile in clinical practice. Curr. Med. Res. Opin. 2022, 38, 827–829. [Google Scholar] [CrossRef]
Jiang, F.; Deng, L.; Zhang, L.; Cai, Y.; Cheung, C.W.; Xia, Z. Review of the Clinical Characteristics of Coronavirus Disease 2019 (COVID-19). J. Gen. Intern. Med. 2020, 35, 1545–1549. [Google Scholar] [CrossRef]
WHO. WHO COVID-19 Dashboard. Available online: https://data.who.int/dashboards/covid19/cases?n=c (accessed on 15 May 2024).
Ahmed, F.; Hossain, M.S.; Islam, R.U.; Andersson, K. An Evolutionary Belief Rule-Based Clinical Decision Support System to Predict COVID-19 Severity under Uncertainty. Appl. Sci. 2021, 11, 5810. [Google Scholar] [CrossRef]
Gomez-Cabello, C.A.; Borna, S.; Pressman, S.; Haider, S.A.; Haider, C.R.; Forte, A.J. Artificial-Intelligence-Based Clinical Decision Support Systems in Primary Care: A Scoping Review of Current Clinical Implementations. Eur. J. Investig. Health Psychol. Educ. 2024, 14, 685–698. [Google Scholar] [CrossRef] [PubMed]
Sutton, R.T.; Pincock, D.; Baumgart, D.C.; Sadowski, D.C.; Fedorak, R.N.; Kroeker, K.I. An overview of clinical decision support systems: Benefits, risks, and strategies for success. NPJ Digit. Med. 2020, 3, 17. [Google Scholar] [CrossRef] [PubMed]
Ramgopal, S.; Sanchez-Pinto, L.N.; Horvat, C.M.; Carroll, M.S.; Luo, Y.; Florin, T.A. Artificial intelligence-based clinical decision support in pediatrics. Pediatr. Res. 2023, 93, 334–341. [Google Scholar] [CrossRef] [PubMed]
Peiffer-Smadja, N.; Rawson, T.M.; Ahmad, R.; Buchard, A.; Georgiou, P.; Lescure, F.X.; Birgand, G.; Holmes, A.H. Machine learning for clinical decision support in infectious diseases: A narrative review of current applications. Clin. Microbiol. Infect. 2020, 26, 584–595. [Google Scholar] [CrossRef]
Pressman, S.M.; Borna, S.; Gomez-Cabello, C.A.; Haider, S.A.; Forte, A.J. AI in Hand Surgery: Assessing Large Language Models in the Classification and Management of Hand Injuries. J. Clin. Med. 2024, 13, 2832. [Google Scholar] [CrossRef] [PubMed]
Yu, P.; Xu, H.; Hu, X.; Deng, C. Leveraging Generative AI and Large Language Models: A Comprehensive Roadmap for Healthcare Integration. Healthcare 2023, 11, 2776. [Google Scholar] [CrossRef] [PubMed]
Al Nazi, Z.; Peng, W. Large language models in healthcare and medical domain: A review. arXiv 2023, arXiv:2401.06775. [Google Scholar]
Ferdush, J.; Begum, M.; Hossain, S.T. ChatGPT and clinical decision support: Scope, application, and limitations. Ann. Biomed. Eng. 2023, 52, 1119–1124. [Google Scholar] [CrossRef] [PubMed]
Miao, J.; Thongprayoon, C.; Fulop, T.; Cheungpasitporn, W. Enhancing clinical decision-making: Optimizing ChatGPT’s performance in hypertension care. J. Clin. Hypertens 2024, 26, 588–593. [Google Scholar] [CrossRef]
Borna, S.; Gomez-Cabello, C.A.; Pressman, S.M.; Haider, S.A.; Forte, A.J. Comparative Analysis of Large Language Models in Emergency Plastic Surgery Decision-Making: The Role of Physical Exam Data. J. Pers. Med. 2024, 14, 612. [Google Scholar] [CrossRef]
Gomez-Cabello, C.A.; Borna, S.; Pressman, S.M.; Haider, S.A.; Forte, A.J. Large Language Models for Intraoperative Decision Support in Plastic Surgery: A Comparison between ChatGPT-4 and Gemini. Medicina 2024, 60, 957. [Google Scholar] [CrossRef]
Haider, S.A.; Pressman, S.M.; Borna, S.; Gomez-Cabello, C.A.; Sehgal, A.; Leibovich, B.C.; Forte, A.J. Evaluating Large Language Model (LLM) Performance on Established Breast Classification Systems. Diagnostics 2024, 14, 1491. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; Wright, A.P.; Patterson, B.L.; Wanderer, J.P.; Turer, R.W.; Nelson, S.D.; McCoy, A.B.; Sittig, D.F.; Wright, A. Assessing the value of ChatGPT for clinical decision support optimization. MedRxiv 2023. [Google Scholar] [CrossRef]
Benary, M.; Wang, X.D.; Schmidt, M.; Soll, D.; Hilfenhaus, G.; Nassir, M.; Sigler, C.; Knodler, M.; Keller, U.; Beule, D.; et al. Leveraging Large Language Models for Decision Support in Personalized Oncology. JAMA Netw. Open 2023, 6, e2343689. [Google Scholar] [CrossRef]
Wang, G.; Gao, K.; Liu, Q.; Wu, Y.; Zhang, K.; Zhou, W.; Guo, C. Potential and Limitations of ChatGPT 3.5 and 4.0 as a Source of COVID-19 Information: Comprehensive Comparative Analysis of Generative and Authoritative Information. J. Med. Internet Res. 2023, 25, e49771. [Google Scholar] [CrossRef] [PubMed]
Zhu, Y.; Yuan, H.; Wang, S.; Liu, J.; Liu, W.; Deng, C.; Dou, Z.; Wen, J.-R. Large language models for information retrieval: A survey. arXiv 2023, arXiv:2308.07107. [Google Scholar]
Hasan, W.U.; Zaman, K.T.; Wang, X.; Li, J.; Xie, B.; Tao, C. Empowering Alzheimer’s caregivers with conversational AI: A novel approach for enhanced communication and personalized support. NPJ Biomed. Innov. 2024, 1, 3. [Google Scholar] [CrossRef]
Lakatos, R.; Pollner, P.; Hajdu, A.; Joo, T. Investigating the performance of Retrieval-Augmented Generation and fine-tuning for the development of AI-driven knowledge-based systems. arXiv 2024, arXiv:2403.09727. [Google Scholar]
Miao, J.; Thongprayoon, C.; Suppadungsuk, S.; Garcia Valencia, O.A.; Cheungpasitporn, W. Integrating Retrieval-Augmented Generation with Large Language Models in Nephrology: Advancing Practical Applications. Medicina 2024, 60, 445. [Google Scholar] [CrossRef]
Zakka, C.; Shad, R.; Chaurasia, A.; Dalal, A.R.; Kim, J.L.; Moor, M.; Fong, R.; Phillips, C.; Alexander, K.; Ashley, E.; et al. Almanac—Retrieval-Augmented Language Models for Clinical Medicine. NEJM AI 2024, 1, AIoa2300068. [Google Scholar] [CrossRef] [PubMed]
BioRender. Available online: https://www.biorender.com/ (accessed on 16 May 2024).
Mayo Foundation for Medical Education and Research COVID-19 navigator. AskMayoExpert Website. Available online: https://askmayoexpert.mayoclinic.org/navigator/covid-19 (accessed on 15 May 2024).
Gansner, E.; Koutsofios, E.; North, S. Drawing Graphs with Dot; 2006. Available online: https://www.graphviz.org/pdf/dotguide.pdf (accessed on 15 May 2024).
Parkulo, M.A.; Post, J.A.; Ristagno, E.H.; Tande, A.J.; Eggers, S.D.; Wald, M.K. COVID-19 Plus Seasonal Illness: Outpatient Testing (Adult). Available online: https://askmayoexpert.mayoclinic.org/topic/clinical-answers/prt-20503524/cpm-20522078 (accessed on 15 May 2024).
Moxey, A.; Robertson, J.; Newby, D.; Hains, I.; Williamson, M.; Pearson, S.A. Computerized clinical decision support for prescribing: Provision does not guarantee uptake. J. Am. Med. Inform. Assoc. 2010, 17, 25–33. [Google Scholar] [CrossRef]
Harada, T.; Miyagami, T.; Kunitomo, K.; Shimizu, T. Clinical Decision Support Systems for Diagnosis in Primary Care: A Scoping Review. Int. J. Environ. Res. Public Health 2021, 18, 8435. [Google Scholar] [CrossRef] [PubMed]
Meunier, P.Y.; Raynaud, C.; Guimaraes, E.; Gueyffier, F.; Letrilliart, L. Barriers and Facilitators to the Use of Clinical Decision Support Systems in Primary Care: A Mixed-Methods Systematic Review. Ann. Fam. Med. 2023, 21, 57–69. [Google Scholar] [CrossRef] [PubMed]
Agrawal, M.; Hegselmann, S.; Lang, H.; Kim, Y.; Sontag, D. Large language models are few-shot clinical information extractors. arXiv 2022, arXiv:2205.12689. [Google Scholar]
Yang, J.; Jin, H.; Tang, R.; Han, X.; Feng, Q.; Jiang, H.; Zhong, S.; Yin, B.; Hu, X. Harnessing the power of llms in practice: A survey on chatgpt and beyond. ACM Trans. Knowl. Discov. Data 2024, 18, 1–32. [Google Scholar] [CrossRef]
Oniani, D.; Wu, X.; Visweswaran, S.; Kapoor, S.; Kooragayalu, S.; Polanska, K.; Wang, Y. Enhancing Large Language Models for Clinical Decision Support by Incorporating Clinical Practice Guidelines. arXiv 2024, arXiv:2401.11120. [Google Scholar]
Rau, A.; Rau, S.; Zoeller, D.; Fink, A.; Tran, H.; Wilpert, C.; Nattenmueller, J.; Neubauer, J.; Bamberg, F.; Reisert, M.; et al. A Context-based Chatbot Surpasses Trained Radiologists and Generic ChatGPT in Following the ACR Appropriateness Guidelines. Radiology 2023, 308, e230970. [Google Scholar] [CrossRef] [PubMed]
Ge, J.; Sun, S.; Owens, J.; Galvez, V.; Gologorskaya, O.; Lai, J.C.; Pletcher, M.J.; Lai, K. Development of a liver disease-specific large language model chat interface using retrieval-augmented generation. Hepatology 2024, 80, 1158–1168. [Google Scholar] [CrossRef] [PubMed]
Van Veen, D.; Van Uden, C.; Blankemeier, L.; Delbrouck, J.B.; Aali, A.; Bluethgen, C.; Pareek, A.; Polacin, M.; Reis, E.P.; Seehofnerova, A.; et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 2024, 30, 1134–1142. [Google Scholar] [CrossRef]
Gomez-Cabello, C.A.; Borna, S.; Pressman, S.M.; Haider, S.A.; Sehgal, A.; Leibovich, B.C.; Forte, A.J. Artificial Intelligence in Postoperative Care: Assessing Large Language Models for Patient Recommendations in Plastic Surgery. Healthcare 2024, 12, 1083. [Google Scholar] [CrossRef]
Li, Y.; Li, Z.; Zhang, K.; Dan, R.; Jiang, S.; Zhang, Y. ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Cureus 2023, 15, e40895. [Google Scholar] [CrossRef] [PubMed]
Tripathi, S.; Sukumaran, R.; Cook, T.S. Efficient healthcare with large language models: Optimizing clinical workflow and enhancing patient care. J. Am. Med. Inform. Assoc. 2024, 31, 1436–1440. [Google Scholar] [CrossRef]
Leslie, D.; Mazumder, A.; Peppin, A.; Wolters, M.K.; Hagerty, A. Does “AI” stand for augmenting inequality in the era of covid-19 healthcare? BMJ 2021, 372, n304. [Google Scholar] [CrossRef] [PubMed]
Pressman, S.M.; Borna, S.; Gomez-Cabello, C.A.; Haider, S.A.; Haider, C.; Forte, A.J. AI and Ethics: A Systematic Review of the Ethical Considerations of Large Language Model Use in Surgery Research. Healthcare 2024, 12, 825. [Google Scholar] [CrossRef] [PubMed]
Zaidi, D.; Miller, T. Implicit Bias and Machine Learning in Health Care. South. Med. J. 2023, 116, 62–64. [Google Scholar] [CrossRef] [PubMed]
Shekelle, P.G. Clinical Practice Guidelines: What’s Next? JAMA 2018, 320, 757–758. [Google Scholar] [CrossRef] [PubMed]
Shaneyfelt, T.M.; Centor, R.M. Reassessment of Clinical Practice Guidelines: Go Gently Into That Good Night. JAMA 2009, 301, 868–869. [Google Scholar] [CrossRef] [PubMed]
Morris, Z.S.; Wooding, S.; Grant, J. The answer is 17 years, what is the question: Understanding time lags in translational research. J. R. Soc. Med. 2011, 104, 510–520. [Google Scholar] [CrossRef] [PubMed]
Guerra-Farfan, E.; Garcia-Sanchez, Y.; Jornet-Gibert, M.; Nunez, J.H.; Balaguer-Castro, M.; Madden, K. Clinical practice guidelines: The good, the bad, and the ugly. Injury 2023, 54 (Suppl. S3), S26–S29. [Google Scholar] [CrossRef]
Umapathi, L.K.; Pal, A.; Sankarasubbu, M. Med-halt: Medical domain hallucination test for large language models. arXiv 2023, arXiv:2307.15343. [Google Scholar]
Harrer, S. Attention is not all you need: The complicated case of ethically using large language models in healthcare and medicine. EBioMedicine 2023, 90, 104512. [Google Scholar] [CrossRef] [PubMed]
Zhang, T.; Patil, S.G.; Jain, N.; Shen, S.; Zaharia, M.; Stoica, I.; Gonzalez, J.E. Raft: Adapting language model to domain specific rag. arXiv 2024, arXiv:2403.10131. [Google Scholar]
Shuster, K.; Poff, S.; Chen, M.; Kiela, D.; Weston, J. Retrieval augmentation reduces hallucination in conversation. arXiv 2021, arXiv:2104.07567. [Google Scholar]
Ong, J.C.L.; Jin, L.; Elangovan, K.; Lim, G.Y.S.; Lim, D.Y.Z.; Sng, G.G.R.; Ke, Y.; Tung, J.Y.M.; Zhong, R.J.; Koh, C.M.Y. Development and Testing of a Novel Large Language Model-Based Clinical Decision Support Systems for Medication Safety in 12 Clinical Specialties. arXiv 2024, arXiv:2402.01741. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel, T. Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv 2021, arXiv:2005.11401. [Google Scholar]
Huang, Y.; Huang, J. A Survey on Retrieval-Augmented Text Generation for Large Language Models. arXiv 2024, arXiv:2404.10981. [Google Scholar]
Afzal, A.; Kowsik, A.; Fani, R.; Matthes, F. Towards Optimizing and Evaluating a Retrieval Augmented QA Chatbot using LLMs with Human in the Loop. arXiv 2024, arXiv:2407.05925. [Google Scholar]
Park, Y.J.; Pillai, A.; Deng, J.; Guo, E.; Gupta, M.; Paget, M.; Naugler, C. Assessing the research landscape and clinical utility of large language models: A scoping review. BMC Med. Inform. Decis. Mak. 2024, 24, 72. [Google Scholar] [CrossRef] [PubMed]
Mesko, B.; Topol, E.J. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit. Med. 2023, 6, 120. [Google Scholar] [CrossRef]

Figure 1. A summary of the methodology for creating a large language model to retrieve clinical information. First, we select clinical guidelines based on the Mayo Clinic consensus. Second, the guidelines are coded in DOT language (see Supplementary Material). Third, the encoded guidelines are embedded using HuggingFace SentenceTransformer. Finally, the large language model interprets the encoded processes and identifies critical steps for diagnosis and management. Created in BioRender [27].

Figure 2. Example of partial Care Pathway Model (CPM) for COVID-19 testing for adults. From original algorithm obtained from AskMayoExpert.

Figure 3. The use of large language models in clinical practice. The patient comes to the clinic or emergency room and is evaluated by the clinician (1), who concurrently inputs the clinical presentation to the large language model-based clinical decision system (2). The system provides individualized recommendations for diagnosis and treatment based on its specialized training data (3), which will ultimately help the physician determine the appropriate management (4). Created in BioRender [27].

Table 1. A sample output of the large language model (LLaMA-2). Please refer to the Supplementary Material for the node names. CPM, Care Pathway Model.

Query:
A patient, female, 55 years old, visit her physician. Her chief complaints are fever and cough. She said that she had vaccinations two weeks ago. Please give me the next step based on the CPM and the given information.

Response from LLM:
Based on the context provided, the next steps for a patient with a chief complaint of cough and fever who has had COVID-19 vaccination within the past 14 days would be:

Go to node 2 (COVID-19 vaccination in past 72 h?)
Since the vaccination was given more than 72 h ago, follow the “NO” path to node 5 (Does patient meet criteria for consideration of COVID-19 or other testing?)
Since the patient has symptoms of cough and fever, follow the “YES” path to node 5 (Has patient tested positive for COVID-19 in the past 30 days?)
Since we do not have this information, assume “NO” for now and follow the “NO” path to node 16 (Select a pathway based on the presence of fever or symptoms unrelated to a pre-existing condition in the past 48 h)
Since the patient has fever, follow the “16 -> 18” path (Unexplained symptoms)
Since we do not have enough information to determine if the patient has unexplained symptoms, assume “NO” for now and follow the “NO” path to node 19 (No unexplained symptoms)
Since we do not have enough information to determine if the patient has had close contact with a person with confirmed COVID-19 in the past 10 days, assume “NO” for now and follow the “NO” path to node 176 (Has patient tested positive for COVID-19 in the past 30 days?)
Since we do not have this information, assume “NO” for now and follow the “NO” path to node 187 (If close contact exposure: Quarantine: Not required Testing: Home antigen testing 5 days after last contact Masking: 10 days after last contact Elective outpatient visits: Allowed)
Since we do not have enough information to determine if the patient has had close contact with a person with confirmed COVID-19 or if they have unexplained symptoms or if they have had influenza or group A streptococcal pharyngitis, further evaluation and testing may be necessary based on clinical judgment and the patient’s symptoms and history.

Sources: CPM-20522078 (https://askmayoexpert.mayoclinic.org/topic/clinical-answers/prt-20503524/cpm-20522078, accessed on 15 May 2024.)

Table 2. The token-level accuracy of LLM outputs compared against the ground truth. The accuracies are measured at the node and edge to delineate the capability of LLM in understanding nodes and relationships (edges).

CPM Name	Node Accuracy	Edge Accuracy
Screening for anal dysplasia and cancer in people living with HIV	0.93	1.00
Nonalcoholic fatty liver disease	0.89	1.00
Mpox	0.94	1.00
Symptomatic severe tricuspid regurgitation: Indications for referral	0.92	0.93
Phenytoin or fosphenytoin order alert logic	0.87	1.00
Pediatric pain management: First-line analgesics, adjunctive therapies, and opioid options	0.90	0.73
COVID-19: Outpatient management (child)	0.91	1.00
Diabetic ketoacidosis or hyperosmolar hyperglycemia state in pregnancy	0.90	1.00
Postpartum hemorrhage (PPH)	0.94	0.71
Differentiated thyroid cancer: Postoperative risk stratification	0.97	0.62
Myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS)	0.88	1.00
Emergency department and inpatient management of atrial fibrillation with rapid ventricular response	0.95	0.79
Management of elevated anion gap acidosis	0.94	0.94
Belzutifan alert logic	0.85	1.00
Sacituzumab-govitecan order alert logic	0.85	1.00
Differentiated thyroid cancer: Radioiodine whole body scan and guide to subsequent management	0.96	0.83
Tricuspid regurgitation	0.97	0.86
COVID-19: Postinfection return to physical activity and sports (child)	0.89	1.00
Preoperative medication management	0.80	1.00
Average Accuracy	0.91	0.92

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, Y.; Gomez-Cabello, C.A.; Makarova, S.; Parte, Y.; Borna, S.; Haider, S.A.; Genovese, A.; Prabha, S.; Forte, A.J. Using Large Language Models to Retrieve Critical Data from Clinical Processes and Business Rules. Bioengineering 2025, 12, 17. https://doi.org/10.3390/bioengineering12010017

AMA Style

Yu Y, Gomez-Cabello CA, Makarova S, Parte Y, Borna S, Haider SA, Genovese A, Prabha S, Forte AJ. Using Large Language Models to Retrieve Critical Data from Clinical Processes and Business Rules. Bioengineering. 2025; 12(1):17. https://doi.org/10.3390/bioengineering12010017

Chicago/Turabian Style

Yu, Yunguo, Cesar A. Gomez-Cabello, Svetlana Makarova, Yogesh Parte, Sahar Borna, Syed Ali Haider, Ariana Genovese, Srinivasagam Prabha, and Antonio J. Forte. 2025. "Using Large Language Models to Retrieve Critical Data from Clinical Processes and Business Rules" Bioengineering 12, no. 1: 17. https://doi.org/10.3390/bioengineering12010017

APA Style

Yu, Y., Gomez-Cabello, C. A., Makarova, S., Parte, Y., Borna, S., Haider, S. A., Genovese, A., Prabha, S., & Forte, A. J. (2025). Using Large Language Models to Retrieve Critical Data from Clinical Processes and Business Rules. Bioengineering, 12(1), 17. https://doi.org/10.3390/bioengineering12010017

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Using Large Language Models to Retrieve Critical Data from Clinical Processes and Business Rules

Abstract

1. Introduction

2. Methods

2.1. Selection of LLM

2.2. Dataset

2.3. Encoding the Clinical Procedures

2.4. Embedding the Encoded Processes

2.5. Information Retrieval Approach

2.6. Measuring Model Accuracy

3. Results

4. Discussion

5. Recommendations for Healthcare Organizations

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI