Background/Objectives: Large language models (LLMs) have shown promising results in medical decision support; however, their effectiveness in managing acute cholecystitis and other gallbladder diseases remains insufficiently examined. This study evaluated the performance of a neuro-symbolic LLM system that integrates multiple AI agents with neural–symbolic reasoning for acute cholecystitis management and compared its diagnostic accuracy with that of human expert physicians across three clinical specialties.
Methods: This multi-center cross-sectional study included 30 case-based questions covering acute cholecystitis and gallbladder diseases, stratified across eight predefined disease categories: acute calculous cholecystitis (n = 6), acute acalculous cholecystitis (n = 2), complicated cholecystitis including gangrenous, emphysematous, and perforated variants (n = 5), chronic cholecystitis and biliary colic (n = 4), gallbladder polyps and adenomyomatosis (n = 3), Mirizzi syndrome (n = 2), gallbladder carcinoma (n = 4), and post-cholecystectomy complications (n = 4). Questions were categorized into diagnosis (n = 10), treatment (n = 10), and complications/prognosis (n = 10). Gold standard answers were established through consensus by an expert panel consisting of two senior general surgery expert clinicians and one senior emergency medicine expert clinician, each with more than 20 years of clinical experience, utilizing the Tokyo Guidelines 2018 (TG18) as the reference standard for diagnostic criteria, severity grading, and management recommendations. The expert panel achieved unanimous consensus on all 30 gold standard answers. All responses were cross-referenced against the primary TG18 publications to ensure guideline-based rather than solely opinion-based reference standards. This consensus-based, guideline-anchored approach is consistent with established methodologies for gold standard establishment in AI diagnostic accuracy studies. Performance of a neuro-symbolic LLM system orchestrated via LangGraph v1.0 was compared against 10 general surgery specialists, 10 emergency medicine physicians, and 10 gastroenterology specialists from four tertiary centers in Turkey. The neuro-symbolic system incorporated the Tokyo Guidelines 2018 (TG18) as its symbolic knowledge base for diagnostic criteria, severity grading, and management algorithms.
Results: The neuro-symbolic system attained the highest overall accuracy rate of 96.7% (29/30), markedly surpassing the performance of general surgery specialists (average 82.3% ± 6.8%), emergency medicine physicians (average 71.0% ± 8.2%), and gastroenterology specialists (average 78.7% ± 7.4%). Furthermore, the neuro-symbolic system exhibited superior performance across all clinical categories. Among human participants, general surgeons showed the highest accuracy in treatment decisions (88.0%), while gastroenterologists excelled in diagnostic questions (82.0%). Emergency medicine physicians showed comparable performance to other specialties in acute presentation scenarios. ROC analysis revealed excellent discrimination for the neuro-symbolic system (AUC = 0.983) compared to general surgery (AUC = 0.856), gastroenterology (AUC = 0.821), and emergency medicine (AUC = 0.764).
Conclusions: The neuro-symbolic LLM system exhibited superior performance in standardized guideline-concordant case-based assessment of acute cholecystitis management compared to all human expert groups, reflecting its consistent application of encoded guideline criteria. These findings support its potential role as a clinical decision-support tool that augments, rather than replaces, physician expertise. The system’s consistent application of standardized guidelines indicates its potential utility as a clinical decision support tool, particularly in settings where specialist expertise is limited. However, these results should be interpreted within the constraints of a structured case-based evaluation and do not imply global clinical superiority over human experts.
Full article