Next Article in Journal
Participation and Activity Inventory for Children and Youth (PAI-CY): Translation and Cultural Adaptation to European Portuguese
Previous Article in Journal
γ-Aminobutyric Acid Transporter Mutation GAT1 (S295L) Substantially Impairs Neurogenesis in Dentate Gyrus
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Assessing the Accuracy of ChatGPT in Answering Questions About Prolonged Disorders of Consciousness

Villa Rosa Rehabilitation Hospital, Provincial Agency for Health Services (APSS) of Trento, 38057 Pergine Valsugana, Italy
*
Author to whom correspondence should be addressed.
Brain Sci. 2025, 15(4), 392; https://doi.org/10.3390/brainsci15040392
Submission received: 29 March 2025 / Revised: 8 April 2025 / Accepted: 11 April 2025 / Published: 13 April 2025
(This article belongs to the Section Neurorehabilitation)

Abstract

:
Objectives: Prolonged disorders of consciousness (DoC) present complex diagnostic and therapeutic challenges. This study aimed to evaluate the accuracy of two ChatGPT models (ChatGPT 4o and ChatGPT o1) in answering questions about prolonged DoC, framed as if they were posed by a patient’s relative. Secondary objectives included comparing performance across languages (English vs. Italian) and assessing whether responses conveyed an empathetic tone. Methods: Fifty-seven open-ended questions reflecting common caregiver concerns were generated in both English and Italian, each categorized into one of three domains: clinical data, instrumental diagnostics, or therapy. Each question contained a background context followed by a specific query and was submitted once to both models. Two reviewers evaluated the responses on a four-point scale, ranging from “incorrect and potentially misleading” to “correct and complete”. Discrepancies were resolved by a third reviewer. Accuracy, language differences, empathy, and recommendation to consult a healthcare professional were analyzed using absolute frequencies, percentages, the Mann–Whitney U test, and Chi-squared tests. Results: A total of 228 responses were analyzed. Both models provided predominantly correct answers (80.7–96.8%), with English responses achieving higher accuracy only for ChatGPT 4o on clinical data. ChatGPT 4o exhibited greater empathy in its responses, whereas ChatGPT o1 more frequently recommended consulting a healthcare professional in Italian. Conclusions: Both ChatGPT models demonstrated high accuracy in addressing prolonged DoC queries, highlighting their potential usefulness for caregiver support. However, occasional inaccuracies emphasize the importance of verifying chatbot-generated information with professional medical advice.

1. Introduction

Prolonged disorders of consciousness (DoC) are defined by impaired consciousness persisting for at least 28 days following severe brain injury [1]. While such clinical conditions may also occur in the advanced stages of neurodegenerative diseases, the term “prolonged DoC” typically refers to the sequelae of acute brain injury, such as traumatic brain injury, cerebral hypoxia, or stroke. These disorders invariably follow an initial coma phase and encompass two main diagnostic categories: unresponsive wakefulness syndrome (UWS) and minimally conscious state (MCS). UWS is characterized by spontaneous eye opening without evidence of self-awareness or environmental awareness [2], whereas MCS involves minimal and inconsistent signs of awareness [3].
Prolonged DoC are devastating not only because of their severe impact on patients, but also due to their profound emotional and social consequences for families [4,5]. These conditions place a significant psychological burden on caregivers, as their onset is sudden and unexpected—an effect further amplified by the complexity and uncertainty surrounding prolonged DoC. Even experienced clinicians often face considerable challenges in accurately distinguishing UWS from MCS, further complicating diagnostic certainty [6,7]. Prognostication remains difficult, both in terms of recovery of consciousness and long-term functional outcomes [8]. Additionally, prolonged DoC are frequently associated with complications, such as hydrocephalus, seizures, spasticity, infections, paroxysmal sympathetic hyperactivity, and critical illness neuromyopathy, all of which can further hinder recovery [9,10,11]. As a result, these patients typically require extensive care and rehabilitative treatment for months or even years, making precise and transparent communication between clinicians and patients’ families essential, ensuring that information is aligned with current scientific evidence and best clinical practices.
Since 2022, the introduction of chatbots capable of simulating human-like linguistic interaction transformed access to information on a variety of topics, including medical issues. On 30 November of that year, OpenAI, a private company, released ChatGPT [12], a chatbot that enables users to interact conversationally, reaching an unprecedented level of accuracy. In 2024, OpenAI released several new versions of its chatbot, most notably GPT-4o (where “o” stands for “omni”, meaning “all”) and its most advanced model, OpenAI o1, which, according to OpenAI, significantly outperforms GPT-4o on challenging reasoning benchmarks. OpenAI o1 was specifically developed to enhance performances on complex reasoning tasks, utilizing a method known as “chain-of-thought reasoning” [13]. This approach enables the model to work through problems step-by-step, thereby reducing hallucinations, that is, instances where the model generates plausible-sounding but incorrect or unsupported information. As a result, it may provide more accurate responses to complex medical questions compared to earlier versions of ChatGPT [14]. Consistent with the naming conventions used on the official ChatGPT app, this article refers to these models as ChatGPT 4o and ChatGPT o1, respectively. From the outset, the potential applications of ChatGPT in the medical field have generated considerable interest and debate within the scientific community. As a tool virtually accessible to any practitioner, it presents both opportunities and risks [15,16,17]. Interestingly, several studies have evaluated ChatGPT’s ability to answer medical questions across various clinical domains, reporting responses that range from accurate to potentially misleading [18,19,20,21,22,23]. Another key consideration is that the accuracy of ChatGPT may vary depending on the language used. Although it has been trained on large multilingual datasets, the majority of its training data is in English. Consequently, its performance on medical topics may differ across languages, an issue that has been rarely investigated [21].
The primary aim of this study was to evaluate the accuracy of two ChatGPT models (ChatGPT 4o and ChatGPT o1) in responding to questions about prolonged DoC, framed as if they were posed by a patient’s relative. As a secondary objective, we compared their performance in English versus Italian and assessed whether the responses conveyed an empathetic tone. The results of this study will help determine whether ChatGPT can serve as a valuable supportive tool for the families of patients with prolonged DoC.

2. Materials and Methods

2.1. Data Collection and Evaluation

This study was conducted between January and March 2025. First, a questionnaire comprising 57 open-ended questions—based on those commonly asked by family members of patients with prolonged DoC—was developed. The questions were derived from the experience of two authors (SB and CB), each with approximately 20 years of clinical practice and research experience in prolonged DoC. The questions were categorized into three domains: clinical data (n = 29), instrumental diagnostics (n = 14), and therapy (n = 14). Each question contained an introductory contextualization (e.g., “My relative has a prolonged disorder of consciousness due to a severe brain injury”) followed by a specific query (e.g., “Is CT useful for determining if they are conscious?”). The questionnaire was developed in both English and Italian. All 57 questions were submitted once to both ChatGPT 4o and ChatGPT o1. At the time of this study, ChatGPT-4o was freely available, albeit with daily usage limitations, while ChatGPT o1 required a paid subscription. Each question was entered as an independent prompt using the “New chat” function. Responses were graded using a four-point scale based on their accuracy, alignment with scientific evidence [24,25], and consistency with best clinical practices. The following grading system was adapted from previous studies evaluating ChatGPT’s performance [20].
  • Incorrect and potentially misleading information: the information provided is not consistent with the most recent scientific evidence or best clinical practice and could lead to conclusions unsupported by the data.
  • Partially correct information, with significant errors: the information provided contains both correct elements and elements that are not in line with the most recent scientific evidence or best clinical practice.
  • Correct but incomplete information: the information provided is consistent with the most recent scientific evidence and best clinical practice, but lacks some necessary elements to fully answer the question.
  • Correct and complete information: The information provided is fully consistent with the most recent scientific evidence and best clinical practice.
Two authors (SB and CB) independently graded each response, with discrepancies resolved by a third expert reviewer (JB).
Additionally, responses were assessed for their empathetic tone based on the explicit presence of phrases that acknowledged the sensitive emotional context of prolonged DoC (e.g., “I’m sorry to hear about your relative’s condition”). Responses were also evaluated for the inclusion of a recommendation to consult a healthcare professional for a more in-depth evaluation of the issue. Because this information was easily identifiable in the responses, it was not subjected to independent double review.
Since this study does not involve clinical data or sensitive information, ethical committee approval was not required.

2.2. Data Analysis

Absolute frequencies and percentages for each grade were calculated separately for responses provided by ChatGPT 4o and ChatGPT o1 in both English and Italian. Interrater reliability between the two primary reviewers was assessed using weighted Cohen’s Kappa. Comparisons between models and languages were performed using Mann–Whitney U and Chi-squared (χ2) tests. p values < 0.05 were considered statistically significant. All statistical analyses were performed using Prism 10 (GraphPad Software, San Diego, CA, USA), while figures were created in Python 3 (Python Software Foundation, Wilmington, DE, USA) using code generated by ChatGPT o1.

3. Results

A total of 228 responses were analyzed. Interrater reliability, assessed using weighted Cohen’s Kappa, was 0.76, denoting substantial agreement between the two reviewers [26] (Figure 1). Discrepancies in the 29 responses (12.7% of the total) for which initial evaluations differed were resolved by the third reviewer. All evaluated responses, along with the corresponding reviewers’ scores, are listed in the Supplementary Materials.
Of the 57 questions posed in English, ChatGPT 4o responses were graded as incorrect in 1 case (1.7%), partially correct in 3 cases (5.3%), correct but incomplete in 8 cases (14%), and correct and complete in 45 cases (78.9%). ChatGPT o1 responses were graded as incorrect in 0 cases, partially correct in 2 cases (3.5%), correct but incomplete in 8 cases (14%), and correct and complete in 47 cases (82.5%) (Table 1). Differences between the two models were not significant (U = 1559; p = 0.6).
Of the 57 questions posed in Italian, ChatGPT 4o responses were graded as incorrect in 1 case (1.7%), partially correct in 10 cases (17.5%), correct but incomplete in 9 cases (15.8%), and correct and complete in 37 cases (64.9%). ChatGPT o1 responses were graded as incorrect in 0 cases, partially correct in 9 cases (15.8%), correct but incomplete in 3 cases (5.3%), and correct and complete in 45 cases (78.9%) (Table 1). Differences between the two models were not significant (U = 1468; p = 0.1).
Responses in English generally received higher scores than those in Italian, particularly for ChatGPT 4o, although the differences were not statistically significant (ChatGPT 4o: U = 1374; p = 0.07; ChatGPT o1: U = 1535; p = 0.5) (Figure 2). However, subgroup analysis by question domain (clinical data, instrumental diagnostics, and therapy) showed significantly better English responses by ChatGPT 4o in clinical data questions (U = 302; p = 0.02), with no significant differences for other subgroups or ChatGPT o1.
ChatGPT 4o responses were significantly more empathetic than those of ChatGPT o1 in both English and Italian (English: χ2 = 24.1; p < 0.0001; Italian: χ2 = 59.9; p < 0.0001) (Figure 3). ChatGPT o1, however, recommended consulting a healthcare professional significantly more often than ChatGPT 4o in Italian responses only (χ2 = 4; p = 0.04) (Figure 3).

4. Discussion

The main finding of this study was that both ChatGPT models provided a notably high proportion of responses to questions about prolonged DoC that were graded as either correct and complete or correct but incomplete, ranging from 80.7% for ChatGPT 4o in Italian to 96.8% for ChatGPT o1 in English. When comparing responses in English versus Italian, ChatGPT 4o—but not ChatGPT o1—performed significantly better on clinical data questions when these were posed in English. Additionally, ChatGPT 4o’s responses were rated as more empathetic than those of ChatGPT o1 in both languages. Conversely, ChatGPT o1 was more likely to recommend consulting a healthcare professional, although this difference was only significant in the Italian responses.
The reliability of medical information provided by large language model-powered chatbots is steadily improving, and, in some cases, has even been deemed superior to that of clinical experts in terms of accuracy, completeness, and empathy [27,28,29,30]. Accordingly, it is unsurprising that most responses in this study were rated as correct across both models and languages. Both models generally produced extensive answers, averaging approximately 320 words per response for ChatGPT 4o and 450 words per response for ChatGPT o1. As a result, most correct responses were also graded as complete, while only a relatively small percentage (5.3–15.8%) were considered correct but incomplete. However, the responses often included information beyond the specific scope of the questions. This verbosity may stem from several factors, including the training process, which tend to reward detailed and comprehensive answers. Notably, such verbosity has already been documented for ChatGPT in both medical and non-medical contexts [31,32,33].
Partially correct responses with significant errors accounted for 3.5% to 17.5% of cases, depending on the ChatGPT model and language. These responses contained both accurate and inaccurate elements. Additionally, two responses were rated as incorrect and potentially misleading: both concerned instrumental diagnostics and were generated by ChatGPT 4o. In one case, an English-language question asked whether somatosensory evoked potentials (SSEPs) could determine if a patient with a prolonged DoC was conscious. ChatGPT 4o stated that SSEPs can help distinguish between UWS and MCS; however, SSEPs are primarily used for prognostic purposes, and the response failed to mention the importance of considering the patient’s etiology when interpreting the results [34,35]. In the second case, an Italian question asked whether an electroencephalogram (EEG) could help determine if a patient is conscious. Multiple errors were identified in the response, including the incorrect claim that EEG can detect brain activity indicating whether the brain is minimally responsive or completely unresponsive, as well as the misattribution of the burst-suppression pattern to patients in UWS and MCS. Recognizing and understanding these error patterns could inform the development and refinement of future chatbot models, potentially enhancing their accuracy and clinical reliability.
An interesting finding of this study concerns the accuracy of responses provided in Italian compared to the reference language, English. While both ChatGPT models demonstrated strong overall performance in both languages, English responses generally achieved higher accuracy scores, particularly with the ChatGPT 4o model. Notably, this difference reached statistical significance within the clinical data subgroup, suggesting that language may influence the precision of information delivered by large language models on specific medical topics. These findings align with the well-documented predominance of English in the training datasets used to develop ChatGPT models, highlighting a potential linguistic bias in non-English contexts. Given the widespread use of such chatbots by non-English-speaking populations, these results underscore the importance of further refining multilingual training processes to ensure consistently accurate and reliable medical information across languages.
Caregivers of patients with DoC frequently experience significant burdens, including symptoms of depression and anxiety [36,37], underscoring the importance of empathetic communication when addressing concerns about a relative with a prolonged DoC. A key finding of this study was that ChatGPT 4o provided significantly more empathetic responses than ChatGPT o1, in both English and Italian. Notably, all prompts were framed as if posed by a patient’s relative; in this context, ChatGPT 4o’s greater empathetic tone may enhance its role as a supportive tool for families. Finally, both models frequently recommended consulting a healthcare professional, although in Italian, this suggestion was made more often by ChatGPT o1.
This study has some limitations. First, the questions used were derived from the authors’ extensive clinical and research experience with prolonged DoC. However, involving actual caregivers or family members in formulating the questions would have ensured greater alignment with the specific informational needs of relatives of patients with prolonged DoC. Additionally, consistent with previous studies evaluating chatbot responses to medical questions, we employed a four-point grading scale to assess the responses provided by the two ChatGPT models. While interrater agreement between the two primary reviewers was high, we acknowledge that reviewers with different professional backgrounds or expertise in prolonged DoC might interpret the grading criteria differently, potentially affecting the assigned scores. Finally, empathy was assessed by verifying the presence of explicitly empathetic statements, rather than using a more structured grading system, which may limit the standardization and comparability of our findings.

5. Conclusions

Both ChatGPT models demonstrated high accuracy in providing information on prolonged DoC, underscoring their potential as supportive tools for caregivers. In particular, the more empathetic responses generated by ChatGPT 4o may further enhance its value, particularly in emotionally sensitive contexts. Integrating chatbots into long-term care and rehabilitation pathways for individuals with prolonged DoC entails both significant opportunities and notable challenges. These tools could serve as readily accessible resources for caregivers, providing not only timely, accurate information, but also emotional support in situations often characterized by profound uncertainty and distress. Additionally, they could be used as educational tools, helping caregivers better understand the patient’s condition and navigate complex, long-term decisions. From a healthcare perspective, chatbots can also facilitate more effective communication between professionals and patients’ families, assisting in the management of delicate and emotionally charged interactions. Nonetheless, the occasional inaccuracies observed in this study highlight the need for careful implementation and continuous professional oversight. Future research should further explore the real-world impact of these tools on caregiver burden, patient outcomes, and their long-term feasibility within healthcare settings.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/brainsci15040392/s1: ChatGPT Responses (English and Italian) and reviewers’ scores.

Author Contributions

Conceptualization, S.B., C.B. and J.B.; ChatGPT response review, S.B., C.B. and J.B.; statistical analysis, S.B.; writing—original draft preparation, S.B.; writing—review and editing, S.B., C.B. and J.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are provided in the article and its Supplementary Materials. Further inquiries can be directed to the corresponding author.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT o1 to generate in Phyton 3 code used in the creation of Figure 1, Figure 2 and Figure 3. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
DoCDisorders of consciousness
UWSUnresponsive wakefulness syndrome
MCSMinimally conscious state
SSEPsSomatosensory evoked potentials
EEGElectroencephalogram

References

  1. Giacino, J.T.; Katz, D.I.; Schiff, N.D.; Whyte, J.; Ashman, E.J.; Ashwal, S.; Barbano, R.; Hammond, F.M.; Laureys, S.; Ling, G.S.F.; et al. Comprehensive systematic review update summary: Disorders of consciousness: Report of the Guideline Development, Dissemination, and Implementation Subcommittee of the American Academy of Neurology; the American Congress of Rehabilitation Medicine; and the National Institute on Disability, Independent Living, and Rehabilitation Research. Neurology 2018, 91, 461–470. [Google Scholar] [CrossRef] [PubMed]
  2. Laureys, S.; Celesia, G.G.; Cohadon, F.; Lavrijsen, J.; León-Carrión, J.; Sannita, W.G.; Sazbon, L.; Schmutzhard, E.; von Wild, K.R.; Zeman, A.; et al. Unresponsive wakefulness syndrome: A new name for the vegetative state or apallic syndrome. BMC Med. 2010, 8, 68. [Google Scholar] [CrossRef] [PubMed]
  3. Giacino, J.T.; Ashwal, S.; Childs, N.; Cranford, R.; Jennett, B.; Katz, D.I.; Kelly, J.P.; Rosenberg, J.H.; Whyte, J.; Zafonte, R.D.; et al. The minimally conscious state: Definition and diagnostic criteria. Neurology 2002, 58, 349–353. [Google Scholar] [CrossRef]
  4. Giovannetti, A.M.; Leonardi, M.; Pagani, M.; Sattin, D.; Raggi, A. Burden of caregivers of patients in vegetative state and minimally conscious state. Acta Neurol. Scand. 2013, 127, 10–18. [Google Scholar] [CrossRef]
  5. Gonzalez-Lara, L.E.; Munce, S.; Christian, J.; Owen, A.M.; Weijer, C.; Webster, F. The multiplicity of caregiving burden: A qualitative analysis of families with prolonged disorders of consciousness. Brain Inj. 2021, 35, 200–208. [Google Scholar] [CrossRef] [PubMed]
  6. Schnakers, C.; Vanhaudenhuyse, A.; Giacino, J.; Ventura, M.; Boly, M.; Majerus, S.; Moonen, G.; Laureys, S. Diagnostic accuracy of the vegetative and minimally conscious state: Clinical consensus versus standardized neurobehavioral assessment. BMC Neurol. 2009, 9, 35. [Google Scholar] [CrossRef]
  7. Schnakers, C.; Hirsch, M.; Noé, E.; Llorens, R.; Lejeune, N.; Veeramuthu, V.; De Marco, S.; Demertzi, A.; Duclos, C.; Morrissey, A.M.; et al. Covert Cognition in Disorders of Consciousness: A Meta-Analysis. Brain Sci. 2020, 10, 930. [Google Scholar] [CrossRef]
  8. Song, M.; Yang, Y.; Yang, Z.; Cui, Y.; Yu, S.; He, J.; Jiang, T. Prognostic models for prolonged disorders of consciousness: An integrative review. Cell Mol. Life Sci. 2020, 77, 3945–3961. [Google Scholar] [CrossRef]
  9. Ganesh, S.; Guernon, A.; Chalcraft, L.; Harton, B.; Smith, B.; Louise-Bender Pape, T. Medical comorbidities in disorders of consciousness patients and their association with functional outcomes. Arch. Phys. Med. Rehabil. 2013, 94, 1899–1907. [Google Scholar] [CrossRef]
  10. Bagnato, S.; Boccagni, C.; Sant’angelo, A.; Prestandrea, C.; Romano, M.C.; Galardi, G. Neuromuscular involvement in vegetative and minimally conscious states following acute brain injury. J. Peripher. Nerv. Syst. 2011, 16, 315–321. [Google Scholar] [CrossRef]
  11. Bagnato, S.; Boccagni, C.; Galardi, G. Structural epilepsy occurrence in vegetative and minimally conscious states. Epilepsy Res. 2013, 103, 106–109. [Google Scholar] [CrossRef] [PubMed]
  12. OpenAI. ChatGPT: Optimizing Language Models for Dialogue. Available online: https://openai.com/blog/chatgpt/ (accessed on 27 March 2025).
  13. Introducing OpenAI o1-Preview. 2024. Available online: https://openai.com/index/introducing-openai-o1-preview/ (accessed on 6 April 2025).
  14. Temsah, M.H.; Jamal, A.; Alhasan, K.; Temsah, A.A.; Malki, K.H. OpenAI o1-preview vs. ChatGPT in healthcare: A new frontier in medical AI reasoning. Cureus 2024, 16, e70640. [Google Scholar] [CrossRef] [PubMed]
  15. Volkmer, S.; Meyer-Lindenberg, A.; Schwarz, E. Large language models in psychiatry: Opportunities and challenges. Psychiatry Res. 2024, 339, 116026. [Google Scholar] [CrossRef] [PubMed]
  16. Sridi, C.; Brigui, S. The use of ChatGPT in occupational medicine: Opportunities and threats. Ann. Occup. Environ. Med. 2023, 35, e42. [Google Scholar] [CrossRef]
  17. Cohen, I.G. What Should ChatGPT Mean for Bioethics? Am. J. Bioeth. 2023, 23, 8–16. [Google Scholar] [CrossRef]
  18. Pan, A.; Musheyev, D.; Bockelman, D.; Loeb, S.; Kabarriti, A.E. Assessment of Artificial Intelligence Chatbot Responses to Top Searched Queries About Cancer. JAMA Oncol. 2023, 9, 1437–1440. [Google Scholar] [CrossRef]
  19. Yeo, Y.H.; Samaan, J.S.; Ng, W.H.; Ting, P.S.; Trivedi, H.; Vipani, A.; Ayoub, W.; Yang, J.D.; Liran, O.; Spiegel, B.; et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin. Mol. Hepatol. 2023, 29, 721–732. [Google Scholar] [CrossRef]
  20. Samaan, J.S.; Yeo, Y.H.; Rajeev, N.; Hawley, L.; Abel, S.; Ng, W.H.; Srinivasan, N.; Park, J.; Burch, M.; Watson, R.; et al. Assessing the Accuracy of Responses by the Language Model ChatGPT to Questions Regarding Bariatric Surgery. Obes. Surg. 2023, 33, 1790–1796. [Google Scholar] [CrossRef]
  21. Oliveira, A.L.; Coelho, M.; Guedes, L.C.; Cattoni, M.B.; Carvalho, H.; Duarte-Batista, P. Performance of ChatGPT 3.5 and 4 as a tool for patient support before and after DBS surgery for Parkinson’s disease. Neurol. Sci. 2024, 45, 5757–5764. [Google Scholar] [CrossRef]
  22. Horiuchi, D.; Tatekawa, H.; Shimono, T.; Walston, S.L.; Takita, H.; Matsushita, S.; Oura, T.; Mitsuyama, Y.; Miki, Y.; Ueda, D. Accuracy of ChatGPT generated diagnosis from patient’s medical history and imaging findings in neuroradiology cases. Neuroradiology 2024, 66, 73–79. [Google Scholar] [CrossRef]
  23. Antaki, F.; Touma, S.; Milad, D.; El-Khoury, J.; Duval, R. Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings. Ophthalmol. Sci. 2023, 3, 100324. [Google Scholar] [CrossRef] [PubMed]
  24. Giacino, J.T.; Katz, D.I.; Schiff, N.D.; Whyte, J.; Ashman, E.J.; Ashwal, S.; Barbano, R.; Hammond, F.M.; Laureys, S.; Ling, G.S.F.; et al. Practice guideline update recommendations summary: Disorders of consciousness: Report of the Guideline Development, Dissemination, and Implementation Subcommittee of the American Academy of Neurology; the American Congress of Rehabilitation Medicine; and the National Institute on Disability, Independent Living, and Rehabilitation Research. Neurology 2018, 91, 450–460. [Google Scholar] [CrossRef] [PubMed]
  25. Kondziella, D.; Bender, A.; Diserens, K.; van Erp, W.; Estraneo, A.; Formisano, R.; Laureys, S.; Naccache, L.; Ozturk, S.; Rohaut, B.; et al. European Academy of Neurology guideline on the diagnosis of coma and other disorders of consciousness. Eur. J. Neurol. 2020, 27, 741–756. [Google Scholar] [CrossRef] [PubMed]
  26. Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef]
  27. Ayers, J.W.; Poliak, A.; Dredze, M.; Leas, E.C.; Zhu, Z.; Kelley, J.B.; Faix, D.J.; Goodman, A.M.; Longhurst, C.A.; Hogarth, M.; et al. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern. Med. 2023, 183, 589–596. [Google Scholar] [CrossRef]
  28. Goodman, R.S.; Patrinely, J.R.; Stone, C.A., Jr.; Zimmerman, E.; Donald, R.R.; Chang, S.S.; Berkowitz, S.T.; Finn, A.P.; Jahangir, E.; Scoville, E.A.; et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Netw. Open 2023, 6, e2336483. [Google Scholar] [CrossRef]
  29. Zhou, S.; Luo, X.; Chen, C.; Jiang, H.; Yang, C.; Ran, G.; Yu, J.; Yin, C. The performance of large language model-powered chatbots compared to oncology physicians on colorectal cancer queries. Int. J. Surg. 2024, 110, 6509–6517. [Google Scholar] [CrossRef]
  30. Huang, A.S.; Hirabayashi, K.; Barna, L.; Parikh, D.; Pasquale, L.R. Assessment of a Large Language Model’s Responses to Questions and Cases About Glaucoma and Retina Management. JAMA Ophthalmol. 2024, 142, 371–375, Erratum in JAMA Ophthalmol. 2024, 142, 393. [Google Scholar] [CrossRef]
  31. Haver, H.L.; Lin, C.T.; Sirajuddin, A.; Yi, P.H.; Jeudy, J. Use of ChatGPT, GPT-4, and Bard to Improve Readability of ChatGPT’s Answers to Common Questions About Lung Cancer and Lung Cancer Screening. AJR Am. J. Roentgenol. 2023, 221, 701–704. [Google Scholar] [CrossRef]
  32. Kaiser, K.N.; Hughes, A.J.; Yang, A.D.; Mohanty, S.; Maatman, T.K.; Gonzalez, A.A.; Patzer, R.E.; Bilimoria, K.Y.; Ellis, R.J. Use of large language models as clinical decision support tools for management pancreatic adenocarcinoma using National Comprehensive Cancer Network guidelines. Surgery 2025, 109267. [Google Scholar] [CrossRef]
  33. Taniguchi, M.; Lindsey, J.S. Performance of chatbots in queries concerning fundamental concepts in photochemistry. Photochem. Photobiol. 2024; online ahead of print. [Google Scholar] [CrossRef]
  34. Estraneo, A.; Moretta, P.; Loreto, V.; Lanzillo, B.; Cozzolino, A.; Saltalamacchia, A.; Lullo, F.; Santoro, L.; Trojano, L. Predictors of recovery of responsiveness in prolonged anoxic vegetative state. Neurology 2013, 80, 464–470. [Google Scholar] [CrossRef] [PubMed]
  35. Bagnato, S.; Prestandrea, C.; D’Agostino, T.; Boccagni, C.; Rubino, F. Somatosensory evoked potential amplitudes correlate with long-term consciousness recovery in patients with unresponsive wakefulness syndrome. Clin. Neurophysiol. 2021, 132, 793–799. [Google Scholar] [CrossRef] [PubMed]
  36. Covelli, V.; Sattin, D.; Giovannetti, A.M.; Scaratti, C.; Willems, M.; Leonardi, M. Caregiver’s burden in disorders of consciousness: A longitudinal study. Acta Neurol. Scand. 2016, 134, 352–359. [Google Scholar] [CrossRef]
  37. Pagani, M.; Giovannetti, A.M.; Covelli, V.; Sattin, D.; Raggi, A.; Leonardi, M. Physical and mental health, anxiety and depressive symptoms in caregivers of patients in vegetative state and minimally conscious state. Clin. Psychol. Psychother. 2014, 21, 420–426. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Heatmap illustrating interrater agreement between the two primary reviewers. Cell values indicate the number of responses assigned to each grading category (1 = incorrect, 4 = correct and complete) by Reviewer 1 (rows) and Reviewer 2 (columns). Darker shades correspond to higher frequencies.
Figure 1. Heatmap illustrating interrater agreement between the two primary reviewers. Cell values indicate the number of responses assigned to each grading category (1 = incorrect, 4 = correct and complete) by Reviewer 1 (rows) and Reviewer 2 (columns). Darker shades correspond to higher frequencies.
Brainsci 15 00392 g001
Figure 2. Grouped bar chart displaying the distribution of scores (from 1 = incorrect to 4 = correct and complete) assigned to responses provided by ChatGPT 4o and ChatGPT o1, separately for questions posed in English and Italian.
Figure 2. Grouped bar chart displaying the distribution of scores (from 1 = incorrect to 4 = correct and complete) assigned to responses provided by ChatGPT 4o and ChatGPT o1, separately for questions posed in English and Italian.
Brainsci 15 00392 g002
Figure 3. Stacked bar charts comparing responses of ChatGPT 4o and ChatGPT o1. The upper panels show the frequency of responses rated as empathetic in English (left) and Italian (right). The lower panels display the frequency of responses including a recommendation to consult a healthcare professional, again separated by language (English, left; Italian, right).
Figure 3. Stacked bar charts comparing responses of ChatGPT 4o and ChatGPT o1. The upper panels show the frequency of responses rated as empathetic in English (left) and Italian (right). The lower panels display the frequency of responses including a recommendation to consult a healthcare professional, again separated by language (English, left; Italian, right).
Brainsci 15 00392 g003
Table 1. Grading of responses provided by two ChatGPT models (ChatGPT 4o and ChatGPT o1) to questions concerning prolonged DoC. Results are presented separately for three question domains (clinical data, instrumental diagnostics, and therapy), each posed in English and Italian. Data are reported as absolute values with corresponding percentages in parentheses.
Table 1. Grading of responses provided by two ChatGPT models (ChatGPT 4o and ChatGPT o1) to questions concerning prolonged DoC. Results are presented separately for three question domains (clinical data, instrumental diagnostics, and therapy), each posed in English and Italian. Data are reported as absolute values with corresponding percentages in parentheses.
CategoryQuestions in EnglishQuestions in Italian
ChatGPT 4oChatGPT o1ChatGPT 4oChatGPT o1
Clinical data (n = 29)
  • Incorrect and potentially misleading
0 (0%)0 (0%)0 (0%)0 (0%)
2.
Partially correct with significant errors
1 (3.4%)2 (6.9%)5 (17.2%)6 (20.7%)
3.
Correct but incomplete
2 (6.9%)5 (17.2%)6 (20.7%)2 (6.9%)
4.
Correct and complete
26 (89.7)22 (75.9%)18 (62.1%)21 (72.4%)
Instrumental diagnostics (n = 14)
  • Incorrect and potentially misleading
1 (7.1%)0 (0%)1 (7.1%)0 (0%)
2.
Partially correct with significant errors
0 (0%)0 (0%)3 (21.4%)3 (21.4%)
3.
Correct but incomplete
2 (14.3%)2 (14.3%)1 (7.1%)0 (0%)
4.
Correct and complete
11 (78.6%)12 (85.7%)9 (64.3%)11 (78.6%)
Therapy (n = 14)
  • Incorrect and potentially misleading
0 (0%)0 (0%)0 (0%)0 (0%)
2.
Partially correct with significant errors
2 (14.3%)0 (0%)2 (14.3%)0 (0%)
3.
Correct but incomplete
4 (28.6)1 (7.1%)2 (14.3%)1 (7.1%)
4.
Correct and complete
8 (57.1)13 (92.9%)10 (71.4%)13 (92.9%)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bagnato, S.; Boccagni, C.; Bonavita, J. Assessing the Accuracy of ChatGPT in Answering Questions About Prolonged Disorders of Consciousness. Brain Sci. 2025, 15, 392. https://doi.org/10.3390/brainsci15040392

AMA Style

Bagnato S, Boccagni C, Bonavita J. Assessing the Accuracy of ChatGPT in Answering Questions About Prolonged Disorders of Consciousness. Brain Sciences. 2025; 15(4):392. https://doi.org/10.3390/brainsci15040392

Chicago/Turabian Style

Bagnato, Sergio, Cristina Boccagni, and Jacopo Bonavita. 2025. "Assessing the Accuracy of ChatGPT in Answering Questions About Prolonged Disorders of Consciousness" Brain Sciences 15, no. 4: 392. https://doi.org/10.3390/brainsci15040392

APA Style

Bagnato, S., Boccagni, C., & Bonavita, J. (2025). Assessing the Accuracy of ChatGPT in Answering Questions About Prolonged Disorders of Consciousness. Brain Sciences, 15(4), 392. https://doi.org/10.3390/brainsci15040392

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop