Accuracy Evaluation of GPT-Assisted Differential Diagnosis in Emergency Department

Shah-Mohammadi, Fatemeh; Finkelstein, Joseph

doi:10.3390/diagnostics14161779

Open AccessArticle

Accuracy Evaluation of GPT-Assisted Differential Diagnosis in Emergency Department

by

Fatemeh Shah-Mohammadi

and

Joseph Finkelstein

^*

Department of Biomedical Informatics, School of Medicine, University of Utah, Salt Lake City, UT 84112, USA

^*

Author to whom correspondence should be addressed.

Diagnostics 2024, 14(16), 1779; https://doi.org/10.3390/diagnostics14161779

Submission received: 9 July 2024 / Revised: 10 August 2024 / Accepted: 13 August 2024 / Published: 15 August 2024

(This article belongs to the Special Issue AI-Assisted Diagnostics in Telemedicine and Digital Health)

Download

Browse Figures

Versions Notes

Abstract

:

In emergency department (ED) settings, rapid and precise diagnostic evaluations are critical to ensure better patient outcomes and efficient healthcare delivery. This study assesses the accuracy of differential diagnosis lists generated by the third-generation ChatGPT (ChatGPT-3.5) and the fourth-generation ChatGPT (ChatGPT-4) based on electronic health record notes recorded within the first 24 h of ED admission. These models process unstructured text to formulate a ranked list of potential diagnoses. The accuracy of these models was benchmarked against actual discharge diagnoses to evaluate their utility as diagnostic aids. Results indicated that both GPT-3.5 and GPT-4 reasonably accurately predicted diagnoses at the body system level, with GPT-4 slightly outperforming its predecessor. However, their performance at the more granular category level was inconsistent, often showing decreased precision. Notably, GPT-4 demonstrated improved accuracy in several critical categories that underscores its advanced capabilities in managing complex clinical scenarios.

Keywords:

emergency department; differential diagnosis; GPT; patient outcomes

1. Introduction

Diagnostic errors represent a major contributor to iatrogenic harm across all clinical settings including the emergency department (ED). Diagnostic errors surpass the combined morbidity and mortality associated with all other medical errors [1]. Annually, at least 12 million diagnostic errors occur in the United States, with numbers potentially being much higher [1,2]. Misdiagnosis-related harms, affecting between approximately 40,000 and 10 million Americans each year, range from minor to catastrophic, including permanent disability or death [1]. The economic burden of these errors on the U.S. healthcare system could surpass USD 100 billion annually [3,4].

Although diagnostic errors significantly impact patient safety and societal well-being, they often go unnoticed. Typically, these errors are not immediately evident; they usually surface later and are often identified by a different clinician or following incidents of misdiagnosis-related harm. This highlights the urgent need for enhanced diagnostic tools to support physicians in making accurate diagnoses [5]. Among the promising technologies being explored are clinical decision support (CDS) systems, including symptom checkers [6] and differential diagnosis generators [7]. Symptom checkers are designed primarily for the public, whereas differential diagnosis generators cater to healthcare professionals. The inception of computer-aided healthcare traces back to the early 1970s, marked by a strong interest in utilizing computational power to improve care quality. Historically, CDS tools have integrated multiple processes—logical or computational methods, probabilistic evaluations, and heuristic techniques—with many applications employing a combination of algorithms and heuristic rules [8,9]. Despite their potential to enhance diagnostic precision and efficiency, these systems often increase the workload for clinicians [10], largely due to the need for structured input data, which can hinder widespread adoption. In this landscape, artificial intelligence (AI) emerges as a viable alternative in providing healthcare support [11]. AI systems apply advanced algorithms, machine learning techniques, and statistical pattern recognition to process and analyze medical data. Unlike the rule-based counterparts, the modern AI-based systems are designed to evolve, continuously updating and refining their capabilities and outputs based on new data. The integration of AI based on deep learning into CDS systems is accelerating, underscoring the increasing reliance on sophisticated technologies in the healthcare sector. In particular, the advent of generative AI through Large Language Models (LLMs) has been transformative, significantly enhancing diagnostic accuracy, treatment planning, and overall patient care [12,13,14]. These AI systems mimic human cognitive processes and learn autonomously from continuous streams of new data. As they assimilate vast amounts of complex patient information, AI-enhanced CDS systems provide healthcare professionals with invaluable insights, thereby improving the efficacy of clinical decision making and leading to better patient outcomes. This ongoing evolution marks a significant shift in how healthcare systems leverage technology to meet the demands of modern medicine [15,16].

Among AI technologies, LLMs represent a sophisticated class of AI algorithms that have been meticulously trained on vast amounts of textual data. This training enables them to process and produce human-like text, which is proving invaluable in the field of medical diagnostics. The ability of these models to parse and synthesize complex information allows them to provide insights that were previously beyond the reach of automated systems. Tools like Google’s Bard (now Gemini) [17,18], Meta AI’s LLM Meta AI 2 (LLaMA2) [19], and OpenAI’s Chat Generative Pre-Trained Transformer (ChatGPT) [20] are prominent examples of such technologies now accessible to the public and the medical community alike. These generative AI tools are not only innovative due to their underlying technology, but also because of their demonstrated competence in practical applications [21]. Remarkably, models such as ChatGPT have been tested against rigorous standards such as national medical licensing examinations, where they have performed successfully without specific training tailored to these exams. This accomplishment underscores the potential of LLMs to significantly advance medical diagnostics. ChatGPT, in particular, has been the subject of extensive research within the healthcare sector, distinguishing itself as a leader in the application of generative AI [22,23].

Since its November 2022 launch, ChatGPT has rapidly gained significant attention, achieving over 100 million monthly active users within just two months of its release [24]. This success underscores the public’s engagement with the platform. GPT-3.5-turbo and its more advanced iteration, GPT-4, are LLMs that provide chat-based interaction capabilities, allowing for sophisticated responses to complex inquiries and problem-solving tasks [20]. Originally developed as versatile, general-purpose models, there is a growing interest in their applicability to specialized fields, including clinical settings. Preliminary investigations into their utility reveal that GPT-3.5-turbo can generate broadly suitable recommendations for simple cardiovascular disease prevention [25]. Furthermore, in the context of a public social media platform, GPT’s responses to health-related questions were not only favored but also perceived as more empathetic when compared to those from medical professionals [26]. Its efficacy has also been highlighted in studies where it has been used to respond to clinical vignette questions, showcasing its ability to operate as a powerful diagnostic tool [27]. The recent review [28] assessed the capability of ChatGPT to provide medical information on topics frequently discussed by patients with inflammatory bowel disease with their gastroenterologists. Various studies have evaluated how well these AI tools perform in generating differential diagnosis lists, a critical step in the diagnostic process [29,30,31]. While some studies cast a wide net by comparing multiple state-of-the-art models, our study is aimed at assessing the unique capabilities of the zero-shot prompting of ChatGPT-3.5 and ChatGPT-4 to generate differential diagnostic lists for patients admitted to the emergency department (ED) with the purpose of exploring their potential effectiveness as diagnostic aids.

2. Materials and Methods

2.1. Dataset

The main data source for analysis in this paper is the Medical Information Mart for Intensive Care III (MIMIC-III) dataset. This dataset is a widely utilized and comprehensive source of de-identified healthcare data. It contains detailed clinical information from over 60,000 critical care patients admitted to the Beth Israel Deaconess Medical Center (Boston, MA, USA) spanning a period of nearly a decade. This rich dataset includes electronic health records, lab results, prescription records, and clinical notes, making it a valuable resource for medical research, particularly in the fields of critical care, epidemiology, and health informatics. Within the dataset, our analysis was focused on patients who were admitted to the ED, and among all notes documented for these patients, we only used the notes that were documented within the first 24-hour period following admission to the ED. There were 17,971 unique patients admitted to the ED with 22,754 unique admissions. Of 17,971 unique patients, 2758 patients were admitted to the ED more than once. We considered 3000 random admissions.

2.2. Study Design

We assessed the diagnostic accuracy of differential diagnosis lists produced by ChatGPT-3.5 and ChatGPT-4. The term ‘differential diagnosis’ denotes a list of potential conditions or diseases that might explain a patient’s symptoms and signs, formulated based on the patient’s clinical history, physical examination, and any investigative results, thus facilitating the diagnostic process. We employed the ChatGPT-3.5 model (gpt-3.5-turbo model) and the ChatGPT-4 model (gpt-4). Neither model was specifically trained or enhanced for medical diagnosis.

Clinical notes from the ED served as the basis for predictive modeling, while hospital discharge summaries documented the final diagnoses. These diagnoses are typically documented by responsible clinical staff at the point of care and reflect the consensus medical opinion at the time of patient discharge. The discharge diagnoses represent the final result of comprehensive patient review by a multi-disciplinary team of providers based on the entirety of diagnostic, laboratory, imaging, pathology, and any other relevant information collected throughout the patient hospital stay and reviewed by medical experts. Diagnosis codes within the MIMIC-III database are initially in ICD-9-CM format, which were converted to ICD-10-CM for standardization and contemporary relevance. These standardized ICD-10-CM codes were then utilized as the ground truth against which the GPT-predicted diagnoses were evaluated.

Engaging with LLMs necessitates the application of prompt engineering. Prompt engineering is the practice of crafting inputs or “prompts” for LLMs to effectively direct the model’s response towards a desired output. This technique is crucial because the quality of the input significantly influences the model’s output. Skilled prompt engineering enhances the precision and relevance of the responses from LLMs, making them more useful in practical applications such as content creation, or even complex decision-making scenarios. Effective prompts can reduce the number of iterations needed to reach an accurate or satisfactory answer, which leads to an increase in efficiency. Well-engineered prompts can exploit the full capabilities of an LLM, unlocking sophisticated behaviors and deeper insights from the model that are not immediately apparent through simple queries. Our final prompt selected was the following: “Using the patient text in the following, give, in bullets, the ranked list of most potential clinical diagnoses for this patient: <put the clinical text here>”. The clinical text that needs to be embedded in the prompt is the first-day-documented notes from ED admissions that were concatenated. The integration of these clinical notes into GPT’s framework allows the model to apply its advanced understanding of medical terminology and context to infer plausible medical conditions. This is achieved through the model’s ability to analyze the text for symptomatic mentions, historical medical information, and any initial treatments documented to be able to generate a differential diagnosis list. After prompting the GPT, it generated a ranked list of diagnoses, which was then converted into ICD-10-CM codes. These diagnoses were compared with actual discharge diagnoses. These codes, along with discharge diagnoses codes, were categorized using the Clinical Classifications Software Refined (CCSR) mapping algorithm from the Agency for Healthcare Research and Quality (AHRQ) [32]. The CCSR for ICD-10-CM diagnoses aggregates more than 70,000 ICD-10-CM diagnosis codes into clinically meaningful categories across 22 body systems, which generally follow the structure of the ICD-10-CM diagnosis chapters. The term “body system” is used to describe the organization of conditions within the CCSR tool. The CCSR categories are organized by body system. Each body system is abbreviated using a three-character scheme. Individual CCSR categories are numbered sequentially with the numbering scheme starting at “001” within each body system (i.e., there is a CCSR 001 for each body system). A complete listing of all CCSR categories and their associated descriptions can be found in the CCSR Reference File, available on the CCSR page [33]. For the analysis of body systems, the first three letters of the ‘CCSR CATEGORY 1’ column were extracted. All diagnoses were compared at two levels: the categorical level (over 530 categories) and the body system level (22 categories) [33].

2.3. Evaluation

We implemented two approaches to evaluate the diagnostic accuracy of ChatGPT-generated differential diagnoses. First, accuracy was assessed categorically by comparing each ChatGPT diagnosis with its corresponding final discharge diagnosis and categorizing the results into three scenarios. To do so, we first calculated the percentage of matched diagnoses using the formula below (Dx stands for diagnosis).

Percent of matched diagnosis = \frac{n u m b e r o f m a t c h e d D x}{n u m b e r o f d i s c h a r g e D x} \times 100 %

(1)

After calculating case match percentages for each visit, we classified them into three distinct levels: “no match” or “mismatch” (0%), “partial match” (>0% and <100%), and “full match” (100%). A “no match” was determined when none of the ChatGPT diagnoses aligned with the final discharge diagnoses, while a “full match” was determined when all ChatGPT diagnoses were explicitly diagnosed in the final hospital discharge diagnoses. A “partial match” was considered when only parts of the ChatGPT diagnoses were present in the hospital discharge diagnoses.

Second, diagnostic concordance was also evaluated on a systemic level. This involved aggregating the diagnoses into broader body system categories, as delineated by the CCSR. The accuracy for each body system was determined by the ratio of the number of GPT-predicted diagnoses that matched the body system of the discharge diagnoses to the total number of discharge diagnoses recorded for that same body system across all admissions.

3. Results

The analytical dataset comprised 3000 unique ED admissions. Patients on average had 15.87 ± 20.25 discharge diagnoses per admission. Patients’ concatenated first day ED notes on average consisted of 663.40 ± 613.88 GPT tokens.

According to Table 1, at the body system level, GPT-3.5 and GPT-4 show a relatively similar frequency of ‘No match’ with diagnoses at 9.09% and 9.38%, respectively. The ‘Partial match’ category, which indicates the models’ ability to predict some but not all of the diagnoses, is where both models perform best. GPT-3.5 achieved a partial match 79.53% of the time, while GPT-4 showed a slight increase to 81.27%. However, for a ‘Complete match’, where the models’ predictions fully aligned with the actual diagnoses, both models scored low, with GPT-3.5 at 11.38% and GPT-4 slightly lower at 9.35%.

When assessing performance at the category level, the number of ‘No match’ results increases considerably for both models, with GPT-3.5 at 33.66% and GPT-4 at 36.01%, suggesting a more significant challenge in accurately predicting specific diagnosis categories. ‘Partial matches’ still constitute most of the predictions, with GPT-3.5 at 65.47% and GPT-4 at 63.37%. However, a ‘Complete match’ at the category level is notably rare, with GPT-3.5 achieving a mere 0.87% and GPT-4 at an even lower 0.62%.

Table 2 presents a comparative analysis of the diagnostic accuracy of GPT-3.5 and GPT-4 against actual discharge diagnoses across various body system categories using the CCSR. Each row represents a different body system, with the table organized by the frequency (‘n’) of correct predictions made by each GPT model and the total number of corresponding discharge diagnoses for each system. The columns labeled “GPT-4” and “GPT-3.5” show the counts of accurate predictions across all admissions by the respective models, while the “Discharge Dx” column reflects the total counts of diagnoses documented at discharge for each body system. Figure 1 provides a visual representation of the diagnostic accuracy of the GPT-3.5 and GPT-4 models across various body system categories, as detailed in Table 2. The y-axis in the sub-figures represents the ratio of the number of correct diagnoses predicted by the GPT models (GPT-3.5 and GPT-4) to the total number of discharge diagnoses within the same body system, as recorded in Table 2.

The data exhibited in Table 2 and Figure 1 underscore the comparative efficacy between GPT-4 and GPT-3.5 across various body systems. Both GPT models demonstrated the highest predictive accuracy for ‘Diseases of the circulatory system (CIR)’, with GPT-4 surpassing GPT-3.5, yielding 4017 correct predictions compared to 3523, against 9681 actual discharge diagnoses in this category. This suggests a more refined ability of GPT-4 to contextualize and analyze symptoms pertinent to circulatory conditions. Moreover, a substantial number of correct predictions were made by both models in ‘Diseases of the respiratory system (RSP)’, with GPT-4 once again achieving higher accuracy with 1223 correct predictions compared to GPT-3.5’s 1127, out of 3791 discharge diagnoses. Another area where both models performed relatively well is in ‘Diseases of the digestive system (DIG)’, although with a lower number of correct predictions than the aforementioned systems, standing at 601 for GPT-4 and 571 for GPT-3.5 out of a total of 3090 discharge diagnoses.

In contrast, body systems like ‘Endocrine, nutritional, and metabolic diseases (END)’ and ‘Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism (BLD)’ reflected a notable decline in the number of correct predictions, particularly when considering the total number of discharge diagnoses present in the dataset for these categories. Furthermore, the ‘Mental, behavioral, and neurodevelopmental disease (MBD)’ and ‘Diseases of the musculoskeletal system and connective tissue (MUS)’ also observed fewer correct predictions. Figure 1, illustrating the accuracy of predictions at the body system level for both GPT models, visually confirms the findings in Table 2.

Figure 2, Figure 3, Figure 4 and Figure 5 delve deeper into the four body systems, where high accuracy was initially observed. These figures generally depict the frequency or prevalence of various medical conditions at the time of hospital discharge, providing a baseline for comparative analysis against model predictions. Additionally, they illustrate the frequency of accurate diagnoses made by the GPT-3.5 and GPT-4 models. Figure 2 and Figure 3 in particular delve into ‘Diseases of the circulatory system (CIR)’ and ‘Diseases of the respiratory system (RSP)’ body systems. For CIR, in the initial part of the chart, where discharge diagnoses are more frequent, both models exhibit relatively high accuracy, suggesting that these conditions, being more common, possibly have clearer diagnostic markers that are well-captured by the models. Notably, GPT-4 generally shows a higher accuracy than GPT-3.5, indicative of its enhanced modeling capabilities. However, as the conditions become less frequent towards the right side of the chart, a marked decrease in the models’ accuracy is observed. This decline in performance might reflect the increased difficulty in diagnosing less common diseases that manifest with subtler or less distinct symptoms.

The detailed results in Figure 3 for the ‘Diseases of the respiratory system (RSP)’ body system level exhibit a differential performance across various respiratory conditions, which provides an insight into the relative strengths of each model in parsing and analyzing specific respiratory-related clinical data. In the case of ‘Pneumonia’, a prevalent condition, both GPT models demonstrate moderately high diagnostic accuracy, aligning closely with the higher frequency of discharge diagnoses. Specifically, GPT-4 shows a marked improvement in diagnostic accuracy for ‘Chronic obstructive pulmonary disease and bronchiectasis’, surpassing GPT-3.5 slightly, suggesting refined capabilities in identifying this condition. Notably, for ‘Acute bronchitis’, both models exhibit low diagnostic accuracy, which does not proportionally reflect the moderate frequency of this diagnosis at discharge, indicating a potential area for model improvement. Overall, the data reveal that, while both models perform well in identifying common respiratory diseases, there remains room for enhancement, especially in conditions with lower discharge frequencies.

Figure 4 and Figure 5 delve deeper into the ‘Neoplasms (NEO)’ and ‘Diseases of digestive system (DIG)’ body systems. Regarding the NEO body system, with a focus on those that exhibit a higher frequency of discharge diagnoses, the category of ‘Nervous System Cancers—Brain’ not only shows a high frequency of discharge diagnoses, but also demonstrates a high accuracy rate of 89.47% with GPT-3.5, although there is a slight decrease to 84.21% with GPT-4. This category is significant due to both the high frequency of cases and the high accuracy observed, particularly with the earlier model iteration. Another key observation is in ‘Respiratory Cancers’, where, despite a high frequency of diagnoses, there is a notable decrease in diagnostic accuracy from GPT-3.5 at 36.00% to GPT-4 at 26.00%, indicating potential areas for improvement in the newer model. Additionally, ‘Urinary System Cancers—Kidney’ shows a significant improvement in accuracy from GPT-3.5 at 25.00% to GPT-4 at 50.00%, reflecting robust enhancements in GPT-4’s diagnostic capabilities.

The analysis of differential diagnosis accuracy for diseases of the digestive system (DIG) by the GPT-3.5 and GPT-4 models reveals notable trends and highlights significant categories. The frequency of discharge diagnoses indicates that certain categories within the digestive system diseases were more prevalent than others. Focusing on categories with higher discharge diagnosis frequencies provides a more relevant insight into the performance of the models. The graph indicates a notable decline in the frequency of discharge diagnoses from left to right, with the most common conditions appearing towards the left side of the graph. Both GPT-3.5 and GPT-4 show higher diagnostic accuracy for the more frequently diagnosed conditions, suggesting that both models perform better with conditions that are more common or perhaps more distinct in their presentation.

For the conditions with the highest discharge frequencies, GPT-4 generally matches or slightly surpasses the performance of GPT-3.5, indicating incremental improvements in model capabilities from one generation to the next. However, for conditions with lower discharge frequencies, the accuracy of both models diminishes, which may reflect the challenges AI models face when dealing with less common or more complex diagnostic scenarios.

In diseases such as ‘Gastroesophageal reflux disease (GERD)’ and ‘Gallbladder disease’, both models demonstrated relatively higher diagnostic accuracy. GPT-4 consistently outperformed GPT-3.5, aligning with the general trend of improved capabilities in later model iterations. This suggests that GPT-4 may be better at identifying symptoms and correlating them with specific conditions within the digestive system. Categories with lower diagnostic accuracy across both models, and which also had significant discharge diagnosis frequencies, included peptic ulcer disease and acute pancreatitis. In these cases, both models struggled to reach a high level of accuracy, which might indicate the complexity and symptom overlap common in these conditions that pose challenges for AI-driven diagnostic tools. For categories with low discharge frequencies, such as chronic pancreatitis and liver diseases, the performance of the models was not as critical from a clinical utility perspective due to the rare nature of these conditions in the dataset. However, where GPT-4 did engage with these rarer conditions, it showed a marginal improvement over GPT-3.5, suggesting incremental advancements in handling fewer common diseases within the digestive system.

4. Discussion

In this study, the accuracy of differential diagnosis lists generated by ChatGPT-3.5 and ChatGPT-4 using the clinical notes recorded during the first 24 h of ED admission was assessed. These models processed unstructured text to formulate a ranked list of potential diagnoses. The accuracy of these models was benchmarked against actual discharge diagnoses to evaluate their utility as diagnostic aids. The results in Table 1 and Table 2 indicate that, while both models are reasonably effective at identifying some correct diagnoses within broader body systems, their performance diminishes significantly when the task is narrowed down to specific categories.

Given the higher ‘Partial match’ rate denoted in Table 1, GPT-4 appears to slightly outperform GPT-3.5 in overall diagnostic prediction at the body system level. According to Table 2 and Figure 1, GPT-4 also outperformed GPT-3.5 in predicting the correct discharge diagnoses. This was observed across most body systems, with the diseases of the circulatory system presenting the highest number of correct predictions by both models. The disparity between the models’ performances and the actual discharge diagnoses highlights potential areas for model refinement. In the circulatory system, certain conditions like ‘Acute myocardial infarction’ and ‘Acute hemorrhagic cerebrovascular disease’ showed relatively high accuracy rates for both GPT models, with GPT-4 generally outperforming GPT-3.5. This suggests that GPT-4’s enhancements in model architecture might be better at interpreting the complex clinical features of severe cardiovascular conditions. Conversely, both models struggled with less common conditions such as ‘Hypotension’ and ‘Postprocedural or postoperative circulatory system complications’, which had low accuracy, indicating difficulties in diagnosing less distinct or infrequent conditions. In conditions with low discharge frequencies like ‘Asthma’ and ‘Respiratory failure; insufficiency; arrest’, both models showed lower accuracy, underscoring potential challenges in capturing the complexities associated with such conditions through AI analysis. It is evident that, while the models can identify common respiratory diseases relatively well, they struggle with less common or more complex conditions.

Overall, across all body systems, our results consistently show that, while GPT-4 tends to outperform GPT-3.5 in most categories, particularly in those involving complex and severe conditions, both models still show room for improvement in less common diseases. The higher frequency of certain conditions at discharge generally correlates with better model performance, suggesting that more common conditions are better represented in the training data, thus making it easier for the models to learn and predict accurately. However, the accuracy diminishes for conditions that are infrequent or have subtler clinical presentations, highlighting an area where future model training could focus on improving. These findings point towards the continued evolution of AI models in medical diagnostics and the need for ongoing advancements to enhance their precision and reliability in clinical applications. In these instances, despite the technological advances embedded in these AI models, the challenge of accurately diagnosing rare conditions remains evident. This pattern underlines the importance of continuous model training and refinement, particularly focusing on less common conditions to potentially improve diagnostic outcomes in clinical settings.

Our results demonstrate the significant potential of LLMs in reducing diagnostic discrepancy, which is especially important in the fast-paced ED setting. In emergency medicine, diagnostic discrepancy is defined as the difference between the diagnosis made by the emergency department physicians and the final diagnosis made by the hospitalists or specialists after the patient has been admitted to the hospital [34]. Diagnostic discrepancy is a well-described phenomenon in healthcare [35,36], and it is common, especially in patients hospitalized via the ED [37,38]. According to previous reports, diagnostic error rates in patients admitted to the ED vary between 0.6% and 64% [39,40]. The variation in the reported error rates, at least in part, is explained by differences in how the diagnostic error was defined, and whether primary diagnoses, all diagnoses, missed diagnoses, or unintentionally delayed diagnoses were used in calculating diagnostic discrepancy [41,42,43]. Diagnostic discrepancy in the ED may result in serious patient harm [35,36] and, in certain instances, may be associated with heightened morbidity and mortality [38]. Diagnostic discrepancies can have significant implications, both for patients and healthcare providers, such as delayed treatment, increased healthcare costs, patient anxiety, distress and continued suffering, decreased patient trust, inappropriate treatment, overuse of resources, unnecessary or excessive referrals and consultations, negative impact on clinical decision making, and unnecessary treatments [44,45,46,47]. The Comparative Effectiveness Review of Diagnostic Errors in the ED, published in 2022 and based on the analysis of 279 studies, concluded that diagnostic discrepancies in the ED represent a significant patient safety challenge and that future research should emphasize areas in which the use of EHR data can facilitate differential diagnostics and a reduction in diagnostic uncertainty [48].

Our results are concordant with the recent findings on the use of LLMs to facilitate differential diagnostics [49,50]. Large Language Models (LLMs) are the most recent advancements in deep learning and present tremendous potential for AI applications in clinical care [51]. Language models such as Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformers (GPT) can perform complex tasks such as text classification, question answering, named entity recognition, and text summarization and classification, which have the potential to augment clinical decision making in providing a differential diagnosis and reducing diagnostic uncertainty [52,53,54,55]. Recent studies reported the high accuracy of GPT-4 in complex diagnostic challenges [56] and generating differential diagnoses early in ED presentations with 98% accuracy [57]. Another study demonstrated that GPT-4 may increase confidence in diagnosis and earlier commencement of appropriate treatment, alert clinicians about missing important diagnoses, and offer suggestions similar to specialists to achieve the correct clinical diagnosis [58]. Similar results were reported by the application of pre-training and fine-tuning of existing BERT models [59,60,61].

As the primary goal of ED care is stabilization of the main organ systems of patients who are acutely ill or injured, comprehensive diagnostic work-up may be affected by the life-saving priorities of the fast-paced ED settings [62]. Previous studies showed that diagnostic delays in ED result in longer hospital stays and higher mortality [63]. Diagnostic tools that facilitate differential diagnostics may reduce diagnostic uncertainty and help arrive at the correct diagnosis earlier [64,65]. Thus, the further exploration of LLMs’ capabilities in facilitating differential diagnostic processes in ED is warranted.

Limitations

Despite the demonstrated capabilities of ChatGPT-3.5 and ChatGPT-4 in enhancing diagnostic accuracy, several limitations must be addressed to ensure their effective integration into clinical settings. One significant limitation is the inconsistency in performance across different diagnostic categories, which can lead to varied reliability in their predictive outputs. While GPT-4 shows improved accuracy in critical diagnostic categories, its performance still varies, especially in more complex or less common conditions, which could undermine its utility in scenarios requiring consistent precision. Moreover, these models depend heavily on the quality and scope of the data they are trained on, limiting their effectiveness in scenarios where patient presentations are atypical or poorly documented in training datasets.

Furthermore, the current evaluation of these models does not fully account for the nuances of real-time clinical decision making, which often involves variables that extend beyond the textual data available in electronic health records. Factors such as patient-specific nuances, practitioner expertise, and interdisciplinary inputs, which are crucial in real-world settings, remain outside the scope of these models. This limitation highlights the need for a hybrid approach that combines AI insights with human oversight. Future applications must focus on creating more adaptable, context-aware systems that can function alongside healthcare professionals, providing support that enhances rather than replaces human judgment, ensuring a balanced integration of technology in healthcare practices.

5. Conclusions

Emergency department (ED) settings necessitate rapid and precise diagnostic evaluations to optimize patient outcomes and streamline healthcare delivery. This study assesses the effectiveness of generative pre-trained transformers, specifically the third-generation GPT-3.5 and fourth-generation GPT-4, in analyzing electronic health record notes from the initial 24 h following ED admission. These models were employed to process unstructured text, generating a hierarchy of potential differential diagnoses, which were subsequently evaluated against actual discharge diagnoses to determine the models’ diagnostic precision and their practical utility as diagnostic aids. The results demonstrate that both GPT-3.5 and GPT-4 hold considerable promise for aiding the early diagnostic process, showcasing commendable accuracy in generating partially correct matches at the body system level. Nonetheless, the study reveals limitations at the more detailed category level, where both models struggled to achieve the precision necessary for fully accurate diagnoses. This discrepancy highlights the need for ongoing improvements in AI technology to refine its capacity to discern and interpret the nuanced details essential for accurate medical diagnosis. While AI exhibits potential as a supportive tool in ED settings for formulating differential diagnoses, its integration into routine clinical practice requires continuous technological enhancements and rigorous clinical validation.

Author Contributions

Conceptualization, J.F.; methodology, J.F. and F.S.-M.; software, F.S.-M.; validation, J.F. and F.S.-M.; formal analysis, F.S.-M.; investigation, F.S.-M.; resources, J.F.; data curation, F.S.-M.; writing—original draft preparation, F.S.-M.; writing—review and editing, J.F.; visualization, F.S.-M.; supervision, J.F.; project administration, J.F. and F.S.-M.; funding acquisition, J.F. All authors have read and agreed to the published version of the manuscript.

Funding

This project was supported in part by a grant R33HL143317 from the National Heart, Lung and Blood Institute.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

There are no additional data deposited on any other site other than in this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Newman-Toker, D.E.; Wang, Z.; Zhu, Y.; Nassery, N.; Tehrani, A.S.S.; Schaffer, A.C.; Yu-Moe, C.W.; Clemens, G.D.; Fanai, M.; Siegal, D. Rate of diagnostic errors and serious misdiagnosis-related harms for major vascular events, infections, and cancers: Toward a national incidence estimate using the “Big Three”. Diagnosis 2021, 8, 67–84. [Google Scholar] [CrossRef]
Newman-Toker, D.E.; Pronovost, P.J. Diagnostic errors—The next frontier for patient safety. JAMA 2009, 301, 1060–1062. [Google Scholar] [CrossRef] [PubMed]
Gunderson, C.G.; Bilan, V.P.; Holleck, J.L.; Nickerson, P.; Cherry, B.M.; Chui, P.; Bastian, L.A.; Grimshaw, A.A.; Rodwin, B.A. Prevalence of harmful diagnostic errors in hospitalized adults: A systematic review and meta-analysis. BMJ Qual. Saf. 2020, 29, 1008–1018. [Google Scholar] [CrossRef]
Newman-Toker, D.E.; Tucker, L.J.; on behalf of the Society to Improve Diagnosis in Medicine Policy Committee. Roadmap for Research to Improve Diagnosis, Part 1: Converting National Academy of Medicine Recommendations into Policy Action. 2018. Available online: https://www.improvediagnosis.org/roadmap/ (accessed on 8 July 2024).
Committee on Diagnostic Error in Health Care; Board on Health Care Services; Balogh, E.P.; Miller, B.T. Technology and tools in the diagnostic process. In Improving Diagnosis in Health Care; National Academies Press (US): Washington, DC, USA, 2015. [Google Scholar]
Schmieding, M.L.; Kopka, M.; Schmidt, K.; Schulz-Niethammer, S.; Balzer, F.; Feufel, M.A. Triage accuracy of symptom checker apps: 5-year follow-up evaluation. J. Med. Internet Res. 2022, 24, e31810. [Google Scholar] [CrossRef]
Riches, N.; Panagioti, M.; Alam, R.; Cheraghi-Sohi, S.; Campbell, S.; Esmail, A.; Bower, P. The effectiveness of electronic differential diagnoses (ddx) generators: A systematic review and meta-analysis. PLoS ONE 2016, 11, e0148991. [Google Scholar] [CrossRef]
Greenes, R. Chapter 2—A brief history of clinical decision support: Technical, social, cultural, economic, governmental perspectives. In Clinical Decision Support, 2nd ed.; Academic Press: London, UK, 2014; pp. 49–109. [Google Scholar]
Sutton, R.T.; Pincock, D.; Baumgart, D.C.; Sadowski, D.C.; Fedorak, R.N.; Kroeker, K.I. An overview of clinical decision support systems: Benefits, risks, and strategies for success. NPJ Digit. Med. 2020, 3, 17. [Google Scholar] [CrossRef]
Meunier, P.; Raynaud, C.; Guimaraes, E.; Gueyffier, F.; Letrilliart, L. Barriers and facilitators to the use of clinical decision support systems in primary care: A mixed-methods systematic review. Ann. Fam. Med. 2023, 21, 57–69. [Google Scholar] [CrossRef]
Wani, S.U.D.; Khan, N.A.; Thakur, G.; Gautam, S.P.; Ali, M.; Alam, P.; Alshehri, S.; Ghoneim, M.M.; Shakeel, F. Utilization of artificial intelligence in disease prevention: Diagnosis, treatment, and implications for the healthcare workforce. Healthcare 2022, 10, 608. [Google Scholar] [CrossRef]
Haug, C.J.; Drazen, J.M. Artificial intelligence and machine learning in clinical medicine, 2023. N. Engl. J. Med. 2023, 388, 1201–1208. [Google Scholar] [CrossRef] [PubMed]
Liu, J.; Wang, C.; Liu, S. Utility of ChatGPT in clinical practice. J. Med. Internet Res. 2023, 25, e48568. [Google Scholar] [CrossRef]
Alowais, S.A.; Alghamdi, S.S.; Alsuhebany, N.; Alqahtani, T.; Alshaya, A.I.; Almohareb, S.N.; Aldairem, A.; Alrashed, M.; Saleh, K.B.; Badreldin, H.A.; et al. Revolutionizing healthcare: The role of artificial intelligence in clinical practice. BMC Med. Educ. 2023, 23, 689. [Google Scholar] [CrossRef]
Collins, C.; Dennehy, D.; Conboy, K.; Mikalef, P. Artificial intelligence in information systems research: A systematic literature review and research agenda. Int. J. Inf. Manage. 2021, 60, 102383. [Google Scholar] [CrossRef]
Kawamoto, K.; Finkelstein, J.; Del Fiol, G. Implementing Machine Learning in the Electronic Health Record: Checklist of Essential Considerations. Mayo Clin Proc. 2023, 98, 366–369. [Google Scholar] [CrossRef] [PubMed]
Patrizio, A. Google Gemini (Formerly Bard). TechTarget. Mar 2024. Available online: https://www.techtarget.com/searchenterpriseai/definition/Google-Bard (accessed on 15 July 2024).
Saeidnia, H.R. Welcome to the Gemini era: Google DeepMind and the information industry. Libr. Hi Tech News, 2023; ahead-of-print. [Google Scholar] [CrossRef]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Sai, S.; Gaur, A.; Sai, R.; Chamola, V.; Guizani, M.; Rodrigues, J.J.P.C. Generative AI for transformative healthcare: A comprehensive study of emerging models, applications, case studies, and limitations. IEEE Access 2024, 12, 31078–31106. [Google Scholar] [CrossRef]
Cascella, M.; Montomoli, J.; Bellini, V.; Bignami, E. Evaluating the feasibility of ChatGPT in healthcare: An analysis of multiple clinical and research scenarios. J. Med. Syst. 2023, 47, 33. [Google Scholar] [CrossRef]
Giannakopoulos, K.; Kavadella, A.; Aaqel Salim, A.; Stamatopoulos, V.; Kaklamanos, E.G. Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study. J. Med. Internet Res. 2023, 25, e51580. [Google Scholar] [CrossRef]
Hu, K.; Hu, K. ChatGPT Sets Record for Fastest-Growing User Base—Analyst Note. Reuters. Published 2 February 2023. Available online: https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/ (accessed on 7 April 2024).
Sarraju, A.; Bruemmer, D.; Van Iterson, E.; Cho, L.; Rodriguez, F.; Laffin, L. Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained from a Popular Online Chat-Based Artificial Intelligence Model. JAMA 2023, 329, 842–844. [Google Scholar] [CrossRef]
Ayers, J.W.; Poliak, A.; Dredze, M.; Leas, E.C.; Zhu, Z.; Kelley, J.B.; Faix, J.D.; Goodman, A.M.; Longhurst, C.A.; Hogarth, M.; et al. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern. Med. 2023, 183, 589–596. [Google Scholar] [CrossRef]
Han, T.; Adams, L.C.; Bressem, K.K.; Busch, F.; Nebelung, S.; Truhn, D. Comparative analysis of multimodal large language model performance on clinical vignette questions. JAMA 2024, 331, 1320–1321. [Google Scholar] [CrossRef] [PubMed]
Gravina, A.G.; Pellegrino, R.; Cipullo, M.; Palladino, G.; Imperio, G.; Ventura, A.; Auletta, S.; Ciamarra, P.; Federico, A. May ChatGPT be a tool producing medical information for common inflammatory bowel disease patients’ questions? An evidence-controlled analysis. World J. Gastroenterol. 2024, 30, 17–33. [Google Scholar] [CrossRef]
Hirosawa, T.; Kawamura, R.; Harada, Y.; Mizuta, K.; Tokumasu, K.; Kaji, Y.; Suzuki, T.; Shimizu, T. ChatGPT-generated differential diagnosis lists for complex case-derived clinical vignettes: Diagnostic accuracy evaluation. JMIR Med. Inform. 2023, 11, e48808. [Google Scholar] [CrossRef]
Hirosawa, T.; Mizuta, K.; Harada, Y.; Shimizu, T. Comparative evaluation of diagnostic accuracy between Google Bard and physicians. Am. J. Med. 2023, 136, 1119–1123.e18. [Google Scholar] [CrossRef]
Kanjee, Z.; Crowe, B.; Rodman, A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 2023, 330, 78–80. [Google Scholar] [CrossRef] [PubMed]
Clinical Classifications Software Refined (CCSR) for ICD-10-CM Diagnoses. Available online: https://hcup-us.ahrq.gov/toolssoftware/ccsr/dxccsr.jsp (accessed on 8 July 2024).
Department of Health and Human Service. Available online: https://hcup-us.ahrq.gov/toolssoftware/ccsr/ccs_refined.jsp (accessed on 8 July 2024).
Tipsmark, L.S.; Obel, B.; Andersson, T.; Søgaard, R. Organisational determinants and consequences of diagnostic discrepancy in two large patient groups in the emergency departments: A national study of consecutive episodes between 2008 and 2016. BMC Emerg. Med. 2021, 21, 145. [Google Scholar] [CrossRef]
Hussain, F.; Cooper, A.; Carson-Stevens, A.; Donaldson, L.; Hibbert, P.; Hughes, T.; Edwards, A. Diagnostic error in the emergency department: Learning from national patient safety incident report analysis. BMC Emerg. Med. 2019, 19, 77. [Google Scholar] [CrossRef]
Hautz, W.E.; Kämmer, J.E.; Hautz, S.C.; Sauter, T.C.; Zwaan, L.; Exadaktylos, A.K.; Birrenbach, T.; Maier, V.; Müller, M.; Schauber, S.K. Diagnostic error increases mortality and length of hospital stay in patients presenting through the emergency room. Scand. J. Trauma Resusc. Emerg. Med. 2019, 27, 54. [Google Scholar] [CrossRef]
Wong, K.E.; Parikh, P.D.; Miller, K.C.; Zonfrillo, M.R. Emergency Department and Urgent Care Medical Malpractice Claims 2001–15. West J. Emerg. Med. 2021, 22, 333–338. [Google Scholar] [CrossRef]
Abe, T.; Tokuda, Y.; Shiraishi, A.; Fujishima, S.; Mayumi, T.; Sugiyama, T.; Deshpande, G.A.; Shiino, Y.; Hifumi, T.; Otomo, Y.; et al. JAAM SPICE Study Group. In-hospital mortality associated with the misdiagnosis or unidentified site of infection at admission. Crit. Care 2019, 23, 202. [Google Scholar] [CrossRef]
Avelino-Silva, T.J.; Steinman, M.A. Diagnostic discrepancies between emergency department admissions and hospital discharges among older adults: Secondary analysis on a population-based survey. Sao Paulo Med. J. 2020, 138, 359–367. [Google Scholar] [CrossRef]
Finkelstein, J.; Parvanova, I.; Xing, Z.; Truong, T.T.; Dunn, A. Qualitative Assessment of Implementation of a Discharge Prediction Tool Using RE-AIM Framework. Stud. Health Technol. Inform. 2023, 302, 596–600. [Google Scholar] [CrossRef]
Peng, A.; Rohacek, M.; Ackermann, S.; Ilsemann-Karakoumis, J.; Ghanim, L.; Messmer, A.S.; Misch, F.; Nickel, C.; Bingisser, R. The proportion of correct diagnoses is low in emergency patients with nonspecific complaints presenting to the emergency department. Swiss. Med. Wkly. 2015, 145, w14121. [Google Scholar] [CrossRef]
Berner, E.S.; Graber, M.L. Overconfidence as a Cause of Diagnostic Error in Medicine. Am. J. Med. 2008, 121 (Suppl. S5), 2–23. [Google Scholar] [CrossRef]
Chellis, M.; Olson, J.; Augustine, J.; Hamilton, G. Evaluation of missed diagnoses for patients admitted from the emergency department. Acad. Emerg. Med. 2001, 8, 125–130. [Google Scholar] [CrossRef]
Kachalia, A.; Gandhi, T.K.; Puopolo, A.L.; Yoon, C.; Thomas, E.J.; Griffey, R.; Brennan, T.A.; Studdert, D.M. Missed and delayed diagnoses in the emergency department: A study of closed malpractice claims from 4 liability insurers. Ann. Emerg. Med. 2007, 49, 196–205. [Google Scholar] [CrossRef]
Brown, T.W.; McCarthy, M.L.; Kelen, G.D.; Levy, F. An epidemiologic study of closed emergency department malpractice claims in a national database of physician malpractice insurers. Acad. Emerg. Med. Off. J. Soc. Acad. Emerg. Med. 2010, 17, 553–556. [Google Scholar] [CrossRef]
Trautlein, J.J.; Lambert, R.L.; Miller, J. Malpractice in the emergency department-review of 200 cases. Ann. Emerg. Med. 1984, 13, 709–711. [Google Scholar] [CrossRef]
Newman-Toker, D.E.; Schaffer, A.C.; Yu-Moe, C.W.; Nassery, N.; Saber Tehrani, A.S.; Clemens, G.D.; Wang, Z.; Zhu, Y.; Fanai, M.; Siegal, D. Serious misdiagnosis-related harms in malpractice claims: The “Big Three”—Vascular events, infections, and cancers. Diagnosis 2019, 6, 227–240. [Google Scholar] [CrossRef]
Newman-Toker, D.E.; Peterson, S.M.; Badihian, S.; Hassoon, A.; Nassery, N.; Parizadeh, D.; Wilson, L.M.; Jia, Y.; Omron, R.; Tharmarajah, S.; et al. Diagnostic Errors in the Emergency Department: A Systematic Review [Internet]; Report No.: 22-EHC043; Agency for Healthcare Research and Quality (US): Rockville, MD, USA, 2022.
Cabral, S.; Restrepo, D.; Kanjee, Z.; Wilson, P.; Crowe, B.; Abdulnour, R.E.; Rodman, A. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Intern. Med. 2024, 184, 581–583. [Google Scholar] [CrossRef]
Hirosawa, T.; Harada, Y.; Yokose, M.; Sakamoto, T.; Kawamura, R.; Shimizu, T. Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study. Int. J. Environ. Res. Public Health 2023, 20, 3378. [Google Scholar] [CrossRef]
Zhang, P.; Boulos, M.N.K. Generative AI in Medicine and Healthcare: Promises, Opportunities and Challenges. Future Internet 2023, 15, 286. [Google Scholar] [CrossRef]
Savage, T.; Wang, J.; Shieh, L. A Large Language Model Screening Tool to Target Patients for Best Practice Alerts: Development and Validation. JMIR Med. Inform. 2023, 11, e49886. [Google Scholar] [CrossRef]
Rojas-Carabali, W.; Sen, A.; Agarwal, A.; Tan, G.; Cheung, C.Y.; Rousselot, A.; Agrawal, R.; Liu, R.; Cifuentes-González, C.; Elze, T.; et al. Chatbots Vs. Human Experts: Evaluating Diagnostic Performance of Chatbots in Uveitis and the Perspectives on AI Adoption in Ophthalmology. Ocul. Immunol. Inflamm. 2023, 1–8. [Google Scholar] [CrossRef]
Savage, T.; Nayak, A.; Gallo, R.; Rangan, E.; Chen, J.H. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digit. Med. 2024, 7, 20. [Google Scholar] [CrossRef]
Wada, A.; Akashi, T.; Shih, G.; Hagiwara, A.; Nishizawa, M.; Hayakawa, Y.; Kikuta, J.; Shimoji, K.; Sano, K.; Kamagata, K.; et al. Optimizing GPT-4 Turbo Diagnostic Accuracy in Neuroradiology through Prompt Engineering and Confidence Thresholds. Diagnostics 2024, 14, 1541. [Google Scholar] [CrossRef] [PubMed]
Cui, W.; Kawamoto, K.; Morgan, K.; Finkelstein, J. Reducing Diagnostic Uncertainty in Emergency Departments: The Role of Large Language Models in Age-Specific Diagnostics. In Proceedings of the 2024 IEEE 12th International Conference on Healthcare Informatics (ICHI), Orlando, FL, USA, 3–6 June 2024; pp. 525–527. [Google Scholar]
Berg, H.T.; van-Bakel, B.; van-de-Wouw, L.; Jie, K.E.; Schipper, A.; Jansen, H.; O’Connor, R.D.; van-Ginneken, B.; Kurstjens, S. ChatGPT and Generating a Differential Diagnosis Early in an Emergency Department Presentation. Ann. Emerg. Med. 2024, 83, 83–86. [Google Scholar] [CrossRef]
Shea, Y.F.; Lee, C.M.Y.; Ip, W.C.T.; Luk, D.W.A.; Wong, S.S.W. Use of GPT-4 to Analyze Medical Records of Patients with Extensive Investigations and Delayed Diagnosis. JAMA Netw. Open 2023, 6, e2325000. [Google Scholar] [CrossRef]
Huang, K.; Altosaar, J.; Ranganath, R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv 2019, arXiv:1904.05342. [Google Scholar]
Franz, L.; Shrestha, Y.R.; Paudel, B. A Deep Learning Pipeline for Patient Diagnosis Prediction Using Electronic Health Records. arXiv 2020, arXiv:2006.16926. [Google Scholar]
Alam, M.M.; Raff, E.; Oates, T.; Matuszek, C. DDxT: Deep Generative Transformer Models for Differential Diagnosis. arXiv 2023, arXiv:2312.01242. [Google Scholar]
Huo, X.; Finkelstein, J. Analyzing Diagnostic Discrepancies in Emergency Department Using the TriNetX Big Data. In Proceedings of the 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Turkey, 5–8 December 2023; pp. 4896–4898. [Google Scholar]
Finkelstein, J.; Cui, W.; Ferraro, J.P.; Kawamoto, K. Association of Diagnostic Discrepancy with Length of Stay and Mortality in Congestive Heart Failure Patients Admitted to the Emergency Department. AMIA Jt. Summits Transl. Sci. Proc. 2024, 2024, 155–161. [Google Scholar]
Shah-Mohammadi, F.; Finkelstein, J. Combining NLP and Machine Learning for Differential Diagnosis of COPD Exacerbation Using Emergency Room Data. Stud. Health Technol. Inform. 2023, 305, 525–528. [Google Scholar] [CrossRef]
Finkelstein, J.; Cui, W.; Morgan, K.; Kawamoto, K. Reducing Diagnostic Uncertainty Using Large Language Models. In Proceedings of the 2024 IEEE First International Conference on Artificial Intelligence for Medicine, Health and Care (AIMHC), Laguna Hills, CA, USA, 5–7 February 2024; pp. 236–242. [Google Scholar]

Figure 1. Comparative accuracy of GPT model predictions at the body system level: (a) performance of GPT-3.5; (b) performance of GPT-4.

Figure 2. Comparative accuracy of GPT model predictions at category level for CIR body systems.

Figure 3. Comparative accuracy of GPT model predictions at category level for RSP body systems.

Figure 4. Comparative accuracy of GPT model predictions at category level for NEO body system.

Figure 5. Comparative accuracy of GPT model predictions at category level for DIG body system.

Table 1. Differential diagnostic accuracy across GPT-3.5 and GPT-4 models.

	Body System		Category
	GPT-3.5	GPT-4	GPT-3.5	GPT-4
	Percentage (%)	Percentage (%)	Percentage (%)	Percentage (%)
No match	9.09%	9.38%	33.66%	36.01%
Partial match	79.53%	81.27%	65.47%	63.37%
Complete match	11.38%	9.35%	0.87%	0.62%

No match, partial match, and complete match represent diagnostic match of 0%, higher than 0% and less than 100%, and 100%, respectively.

Table 2. Differential diagnostic accuracy across GPT-3.5 and GPT-4 models in body system level.

	GPT-4	GPT-3.5	Discharge Dx
Body System	n	n	n
Diseases of the circulatory system (CIR)	4017	3523	9681
Diseases of the respiratory system (RSP)	1223	1127	3791
Diseases of the digestive system (DIG)	601	571	3090
Endocrine, nutritional, and metabolic diseases (END)	330	282	5076
Diseases of the genitourinary system (GEN)	269	239	2449
Certain infectious and parasitic diseases (INF)	245	229	2077
Neoplasms (NEO)	221	216	983
Diseases of the nervous system (NVS)	158	133	1767
Mental, behavioral, and neurodevelopmental disease (MBD)	133	130	1465
Diseases of the musculoskeletal system and connective tissue (MUS)	107	90	1371
Factors influencing health status and contact with health services (FAC)	106	84	1943
Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism (BLD)	80	80	2131
Diseases of the skin and subcutaneous tissue (SKN)	37	35	1372
Dental diseases (DEN)	6	5	315
External causes of morbidity (EXT)	5	5	1802
Congenital malformations, deformations, and chromosomal abnormalities (MAL)	5	4	77
Diseases of the eye and adnexa (EYE)	3	2	1971
Pregnancy, childbirth, and the puerperium (PRG)	2	1	59

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shah-Mohammadi, F.; Finkelstein, J. Accuracy Evaluation of GPT-Assisted Differential Diagnosis in Emergency Department. Diagnostics 2024, 14, 1779. https://doi.org/10.3390/diagnostics14161779

AMA Style

Shah-Mohammadi F, Finkelstein J. Accuracy Evaluation of GPT-Assisted Differential Diagnosis in Emergency Department. Diagnostics. 2024; 14(16):1779. https://doi.org/10.3390/diagnostics14161779

Chicago/Turabian Style

Shah-Mohammadi, Fatemeh, and Joseph Finkelstein. 2024. "Accuracy Evaluation of GPT-Assisted Differential Diagnosis in Emergency Department" Diagnostics 14, no. 16: 1779. https://doi.org/10.3390/diagnostics14161779

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Accuracy Evaluation of GPT-Assisted Differential Diagnosis in Emergency Department

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Study Design

2.3. Evaluation

3. Results

4. Discussion

Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI