Performance of ChatGPT in Pediatric Audiology as Rated by Students and Experts

Ratuszniak, Anna; Gos, Elzbieta; Lorens, Artur; Skarzynski, Piotr Henryk; Skarzynski, Henryk; Jedrzejczak, W. Wiktor

doi:10.3390/jcm14030875

Open AccessArticle

Performance of ChatGPT in Pediatric Audiology as Rated by Students and Experts

by

Anna Ratuszniak

^1,2,*

,

Elzbieta Gos

^1,2,

Artur Lorens

^1,2

,

Piotr Henryk Skarzynski

^1,2,3,4

,

Henryk Skarzynski

^1,2

and

W. Wiktor Jedrzejczak

^1,2

¹

Institute of Physiology and Pathology of Hearing, Mochnackiego 10 Street, 02-042 Warsaw, Poland

²

World Hearing Center, Mokra 17 Street, 05-830 Kajetany, Poland

³

Heart Failure and Cardiac Rehabilitation Department, Faculty of Medicine, Medical University of Warsaw, Banacha 1a Street, 02-097 Warsaw, Poland

⁴

Institute of Sensory Organs, Mokra 1 Street, 05-830 Kajetany, Poland

^*

Author to whom correspondence should be addressed.

J. Clin. Med. 2025, 14(3), 875; https://doi.org/10.3390/jcm14030875

Submission received: 13 November 2024 / Revised: 23 January 2025 / Accepted: 26 January 2025 / Published: 28 January 2025

(This article belongs to the Section Otolaryngology)

Download

Browse Figures

Versions Notes

Abstract

Background: Despite the growing popularity of artificial intelligence (AI)-based systems such as ChatGPT, there is still little evidence of their effectiveness in audiology, particularly in pediatric audiology. The present study aimed to verify the performance of ChatGPT in this field, as assessed by both students and professionals, and to compare its Polish and English versions. Methods: ChatGPT was presented with 20 questions, which were posed twice, first in Polish and then in English. A group of 20 students and 16 professionals in the field of audiology and otolaryngology rated the answers on a Likert scale of 1 to 5 in terms of correctness, relevance, completeness, and linguistic accuracy. Both groups were also asked to assess the usefulness of ChatGPT as a source of information for patients, in educational settings for students, and in professional work. Results: Both students and professionals generally rated ChatGPT’s responses to be satisfactory. For most of the questions, ChatGPT’s responses were rated somewhat higher by the students than the professionals, although statistically significant differences were only evident for completeness and linguistic accuracy. Those who rated ChatGPT’s responses more highly also rated its usefulness more highly. Conclusions: ChatGPT can possibly be used for quick information retrieval, especially by non-experts, but it lacks the depth and reliability required by professionals. The different ratings given by students and professionals, and its language dependency, indicate it works best as a supplementary tool, not as a replacement for verifiable sources, particularly in a healthcare setting.

Keywords:

artificial intelligence; ChatGPT; large language models in medicine; health information-seeking behavior; audiology; otorhinolaryngology

1. Introduction

With the proliferation of information and communication technologies, using the internet to seek health information is common [1,2]. This approach is often preferred because of its availability and coverage, convenience, affordability, interactivity, and anonymity [3,4]. Seeking health information online enables a person to quickly learn more about their health problems, manage a health condition, decide about a health option, or change behavior [5]. Increasing numbers of patients are seeking health information online, making health information-seeking behavior into an acronym (HISB) and a global trend. Chatbots based on large language models (LLMs) are becoming more commonly used for HISB [6].

One of the most advanced artificial intelligence (AI) language models is ChatGPT, developed by OpenAI (San Francisco, CA, USA) [7]. ChatGPT is an LLM that uses machine-learning techniques to generate human-like text based on a given prompt. Based on a large corpus of text, ChatGPT is able to capture the subtleties of human language, allowing it to generate appropriate and contextually relevant responses across a broad spectrum of topics [8]. Launched in November 2022, it quickly gained popularity and has become one of the fastest-growing web applications ever. According to the latest data, ChatGPT currently has approximately 180 million users [9].

While it appears that ChatGPT can provide significant support in many areas of science and education, there is potential for misuse, including the provision of biased content, limited credibility, the creation of dishonest views and opinions, and others [10,11]. ChatGPT is currently being widely tested in various fields of knowledge, including science and medicine. The potential applications of ChatGPT in the medical field range from identifying potential research topics to assisting professionals in clinical diagnosis. It has been used in the fields of psychiatry, dermatology, ophthalmology, radiology, oncology, neurology, pharmacology, and others [12,13,14,15,16,17,18]. There are numerous studies in the field of otorhinolaryngology [6,19,20], but strangely, there are only a handful in the related field of audiology and none focusing on pediatric issues [21,22,23,24,25]. This presents an unwelcome gap, since audiology deals with important issues surrounding the diagnosis, management, and treatment of hearing loss, as well as balance problems. An audiologist is responsible for fitting and dispensing hearing aids, providing hearing rehabilitation, and helping in the prevention of hearing loss. While hearing problems are usually not life-threatening, they have a not-insignificant impact on society more generally. Unlike a major disease for which a patient will immediately seek help from a professional, it appears likely that a person with a minor hearing problem (or the parent of a child with such a problem) will first seek information from the internet. ChatGPT and other artificial intelligence tools in medicine can be used by different groups of users. One group is patients and their relatives, who are looking for information and support from the internet. Another group is students who may use this tool as an educational aid. Lastly, there are professionals who are looking for an advisory tool. Within otolaryngology, validation of information from the internet is important for all these groups, although their needs are qualitatively different. The real question is, how valid are AI-based information sources in this field [6]?

This study evaluates audiology information provided by ChatGPT in terms of the correctness, relevance, completeness, and linguistic accuracy of responses to a defined set of questions related to pediatric audiology. We also specifically wanted to know whether students and experts rated the responses similarly, and whether there were differences when questions were presented in English or Polish.

2. Materials and Methods

The first step of the study was to prepare a set of suitable open-ended questions related to pediatric audiology commonly encountered by students. The questions were then checked by the lecturer (A.R.), and finally, 20 questions were asked of ChatGPT version 3.5 and the answers analyzed. The publicly available standard version of ChatGPT was used, which has a setting to change the language of the dialogue—English is the default, but Polish is an option. The questions focused on topics relating to hearing aids (questions 1, 3, 12, 16, 20), diagnosis and audiological testing (5–7, 9–11, 13, 18), and diseases and treatments (2, 4, 8, 14, 15, 17, 19). One of the questions (number 15) was specific to Poland. The questions are listed in Table 1.

The questions were submitted to ChatGPT in Polish (using the Polish language setting) on 12 January 2024. The questions were then translated from Polish into English using DeepL [26] and presented again to ChatGPT using the English language setting on 18 January 2024 (a related study tested ChatGPT’s responses posed on different days or weeks and did not find significant differences due to time [23]). The answers given in English were again translated back into Polish with DeepL, a tool that has been given good reviews [27]. The reasoning here was that the evaluation would not depend on the English-language skill of the evaluators. In this way we could take the translation element out of ChatGPT and compare the performance of its Polish and English settings in answering the same question. This process also simulated the way a person who did not know English would use the app if they knew the English version provided more information. The two versions of answers to the same question (one using the Polish-language route and the other the English-language route, with the second translated back into Polish) were given to participants for evaluation (see supplementary file). The only information they were given was that these two responses were collected at two different points in time.

The quality of ChatGPT’s responses was evaluated by two groups of participants, students and experts, and the framework established by Wang and Strong [28] was used. This framework has four quality (Q) categories: (1) Intrinsic Q consists of accuracy, objectivity, believability, and reputation; (2) Contextual Q consists of value-added, relevancy, timeliness, completeness, and an appropriate amount of data; (3) Representational Q consists of interpretability, ease of understanding, and representational consistency; (4) Accessibility Q consists of data accessibility and access security. Correctness (Intrinsic Q) was defined as the factual correctness of an answer and the absence of errors. Relevance (Intrinsic Q) rated how much an answer related to the question. Completeness (Contextual Q) evaluated whether all important information was provided. Finally, linguistic accuracy (Representational Q) was assessed by whether the text sounded natural, whether there were any strange phrases or surprising words used, and whether the technical terms were used properly. We did not evaluate the fourth category (Accessibility Q) because our study was designed so that all participants received the same ChatGPT response sets for evaluation, rather than accessing ChatGPT during the study. A similar approach was used by two other studies that tested ChatGPT [29,30]. Correctness, relevance, completeness, and linguistic accuracy of the answers were rated using a five-point Likert scale (1 = very unsatisfactory, 2 = unsatisfactory, 3 = neutral, 4 = satisfactory, and 5 = very satisfactory).

In addition, participants were asked five general questions about the usefulness of ChatGPT for patients as a source of information, for students in education, and for specialists in professional work (Table 2). Again, the participants were asked to give a score on a five-point Likert scale (1 = very low, 2 = low, 3 = neither low nor high, 4 = high, 5 = very high).

The ChatGPT responses were evaluated by a group of students (the same ones who compiled the list of questions) and a group of experts. The students (n = 20) were in the third year of a bachelor’s degree in audiophonology (similar to audiology and speech-language therapy in other countries). The experts (n = 16) consisted of professionals working in the field of audiology and otolaryngology, with many years of clinical and scientific experience, and comprised medical doctors (n = 2), hearing care professionals (n = 7), and scientists (n = 7). There were 10 who had 20 years or more of professional experience, 2 who had 13 years, 1 who had 12 years, 2 who had 10 years, and 1 who had 5 years. Among them, three were professors, five had a PhD, and the others were medical doctors or had a master’s degree.

The study was approved by the bioethics committee of the Institute of Physiology and Pathology of Hearing (KB.IFPS: 2/2024).

Statistical Analysis

A mixed-design ANOVA (analysis of variance) with Bonferroni adjustment was employed to evaluate the differences in ratings of ChatGPT responses between the students and the experts (the between-subject factor) and between the Polish and English versions (the within-subject factor). The assumptions of normality and of homogeneity of variance and sphericity were checked. An ANOVA model was chosen because it enables interaction between both factors to be analyzed. Analyses were conducted both on the average ratings and for each of the 20 questions separately. A Mann–Whitney U-test was used to compare the usefulness of ChatGPT as perceived by the students and the experts. For some analyses, Pearson correlations were also calculated. A p-value below 0.05 was considered statistically significant. The analysis was conducted using IBM SPSS Statistics v. 24 (IBM Corp, 2016, Armonk, New York, NY, USA).

3. Results

3.1. General Overview of ChatGPT’s Response Ratings

The average ratings of ChatGPT’s responses in four dimensions (correctness, relevance, completeness, and linguistic accuracy), as assessed by both the students and the experts and for the Polish and English versions, are presented in Table 3, together with the results of the ANOVA analysis.

The general pattern was that ChatGPT’s responses received the highest ratings for relevance (at least 4 points on average), whereas the lowest ratings were for completeness. The experts were generally more critical than the students, while both groups gave about equal ratings for the two language versions. The ANOVA revealed some statistically significant differences in the ratings given by students and experts, as well as between the Polish and English versions (bold numbers in Table 3).

The most pronounced effects were found for the linguistic accuracy of ChatGPT’s responses. The group effect was statistically significant, indicating that experts (M= 3.61; SD = 0.61) rated ChatGPT’s responses significantly lower than students (M = 4.04; SD = 0.61), regardless of the language version. This effect was moderate, η² = 0.11. The language version effect was also statistically significant, indicating that ratings for the English version (M = 3.99; SD = 0.61) were significantly higher than for the Polish version (M = 3.71; SD = 0.72), regardless of whether the rater was a student or an expert. This effect was large, η² = 0.36.

A statistically significant difference between students and experts was also found for the completeness of ChatGPT’s responses. Again, experts (M = 3.48; SD = 0.54) rated ChatGPT’s responses significantly lower than students (M = 3.88; SD = 0.53), regardless of the language version. This effect was moderate, η² = 0.11. Similar results were found for correctness, although the effect did not reach statistical significance (p = 0.059). Experts (M = 3.69; SD = 0.47) rated ChatGPT’s responses generally as less correct than students (M = 4.00; SD = 0.48).

For relevance, there was a tendency for ratings to be slightly higher for the Polish version (M = 4.11; SD = 0.61) than for the English version (M = 3.99; SD = 0.47), regardless of whether the rater was a student or an expert; however, this effect did not reach statistical significance (p = 0.066).

3.2. Question-Specific Analysis of ChatGPT’s Response Ratings

ANOVA was also applied to the ratings provided by the students and the experts to both language versions of ChatGPT’s responses to each of the 20 questions. The colors in Figure 1 indicate whether the ratings were higher for students (orange) or experts (green). Saturated colors indicate statistically significant differences, while lighter shades represent non-significant differences.

The trend was that students generally rated ChatGPT’s responses higher than the experts across all questions (except question 15) and all evaluation dimensions. Experts tended to be more critical regarding quality of responses, particularly concerning linguistic accuracy and completeness.

Figure 2 shows whether the ratings were higher for ChatGPT’s responses in the English version (in blue) or in the Polish version (in yellow). Once more, the use of saturated colors indicates that the observed differences are statistically significant, whereas lighter shades represent non-significant differences.

The overall observation is that the English version was rated higher than the Polish version. This is particularly evident in the context of linguistic accuracy and completeness. Again, an exception was the rating of responses to question 15, in which the Polish version was rated higher than the English version consistently across all dimensions.

One question, question 15, related particularly to Polish circumstances. It asked, “What do the yellow and blue certificates signify in the context of hearing screening for children in Poland?” For the Polish version, ChatGPT provided an explanation, whereas in the English version, it responded, “I do not have specific information about the color-coding system used in child health books in Poland for hearing screening certificates” and followed with some advice about where such information could be found (see supplementary files for complete ChatGPT responses).

We examined how the ChatGPT responses to this question, both in the Polish and English versions, were rated by the students and the experts. The results are shown in Table 4.

There were large differences between the ratings of the Polish and English versions of question 15 in terms of correctness, relevance, and completeness. However, the differences between the ratings by students and by experts of question 15 were not statistically significant (although in terms of relevance, the difference almost reached significance, p = 0.065). The interaction effects (group × language version) were significant only in terms of relevance.

For correctness, there was a significant language effect in that the Polish version (M = 4.69; SD = 0.62) was generally rated higher than the English version (M = 1.64; SD = 0.76).

Similarly, a language effect was evident in the relevance domain, where the Polish version (M = 4.64; SD = 0.68) was rated higher than the English version (M = 2.11; SD = 1.28). A significant interaction effect was that the experts rated ChatGPT’s responses significantly higher in the Polish version than in the English version (p < 0.001). The same was true for the students. A difference between the experts and the students was found only in the English version, which was rated significantly higher by the experts than the students (p = 0.014), although the ratings for the Polish version were similar across both groups.

For completeness, a significant language effect was that ratings for the Polish version (M = 4.28; SD = 1.00) were significantly higher than for the English version (M = 1.53; SD = 0.88), irrespective of whether the rater was a student or an expert.

To sum up, ChatGPT’s responses to the specifically Polish question were, in the English version, consistently rated as less correct, less relevant, and less in depth compared to the Polish version. Only the formal aspect of ChatGPT’s responses to this question, namely, linguistic accuracy, was rated similarly across groups and language versions.

3.3. Assessment of the Usefulness of ChatGPT

Both students and experts evaluated the usefulness of ChatGPT in education, specialized fields, and patient care (Table 2). They also assessed its potential application in professional work and the associated risks of using ChatGPT for obtaining medical information. Their assessments were compared using a Mann–Whitney U-test and are presented in Table 5.

The ratings (on a scale of 1 to 5) are shown in Table 5 and reflect differing levels of perceived usefulness of ChatGPT. Overall, the students rated ChatGPT more favorably than the experts. Notably, the students gave significantly higher ratings for its usefulness in education (question 1b) compared to the experts. While most other assessments were similar between the two groups, both rated ChatGPT relatively lower for consulting specialized cases (question 1c) and higher as a source of information for patients (question 1a). Ratings for potential professional use (question 2) and the perceived risk to patients (question 3) were moderate across both groups.

Finally, Table 6 combines the scores evaluating the correctness, relevance, completeness, and linguistic accuracy together with assessment of the usefulness of ChatGPT. The table presents the correlations of average scores from the evaluation of questions in both languages (since there were no language-specific correlations). The correlations in Table 6 are for the whole group of 36 evaluators (20 students and 16 experts). There were several significant positive correlations for correctness, completeness, and linguistic accuracy; however, relevance was not significantly correlated with any usefulness score.

4. Discussion

This evaluation of the quality of ChatGPT responses to a set of questions related to topics in audiology brought out some interesting perspectives. Generally, the responses given by both students and experts were positive, although it should be underlined that they were mostly at the level of “satisfactory”, not “very satisfactory”. This rating of close to “satisfactory” applied to all categories of quality (intrinsic, contextual, and representational). Furthermore, the experts were generally more critical and gave lower ratings for completeness. Even though the responses to questions posed to ChatGPT in English were generally rated slightly higher by both students and experts, the response to question 15 (specific to Polish circumstances) was rated higher when questions were posed in Polish.

The long experience and large knowledgebase of a group of experts are expected to be important factors in how the answers provided by ChatGPT are evaluated. Experts are likely to be more critical than students. This study largely confirms this view, in that in almost all cases the experts’ evaluations were lower than those of the students, although the differences were not large enough to be statistically significant in every category analyzed. One might assume that if the answers given were exemplary (e.g., as from an encyclopedia), then the expert and student evaluations would be very high and be similar for both groups. If, however, there were differences, one might assume that they were due to greater knowledge and experience, which allows for quicker and easier recognition of shortcomings and ambiguities. The less someone knows about a particular field, the more they will be impressed by a well-formulated ChatGPT answer. We think this could be particularly harmful to patients who could be easily misled by incomplete or imprecise information.

The level of knowledge of ChatGPT in audiology and laryngology was rated as quite high, although the mean scores of the responses were closer to 4 (satisfactory) than 5 (very satisfactory), even in the student group. Although the differences in responses between the students and the experts were mostly not statistically significant, they show that students always gave higher scores than the experts. In general, the answers were pertinent and correct, but not without errors, so some caution should be exercised in their use. These results are quite similar to ChatGPT’s performance in different disciplines [31,32].

A statistically significant difference in responses between the groups was found for the completeness category, which indicates whether all important information was given and whether the answer was too vague or superficial. When one looks at the relevance category, it is here that differences due to knowledge and experience are most likely to be evident. Given that ChatGPT’s responses were of reasonably good quality, albeit not error-free, ChatGPT can be considered a useful tool for people who wish to obtain quick information in a casual situation (e.g., overhearing something and wanting to check it). However, the results here show that we should have limited trust in AI, and that patients should check information with verified sources. It is noteworthy that ChatGPT has a disclaimer on the bottom of the screen warning that errors are possible and that information should be verified.

This study has shown that linguistic correctness is important, with experts being more critical than students in this area. This is probably because they can more readily navigate through the professional terminology, based on their own daily experience, whereas students only have the knowledge acquired from their teachers. As a result, experts are more likely to spot any inaccuracies or imprecise wording. Experts were easily able to identify common errors in ChatGPT responses such as lack of precision, colloquial word clusters, use of technical terms but in the wrong context, gibberish, generalizations, and truisms.

Comparison of the two language versions (Polish and English) showed that the quality of answers was better with the English setting than with the Polish setting. We were curious to find out whether the Polish and English versions of answers would be the same. Since we are not native speakers and not competent to qualitatively assess English texts, we used the DeepL tool (also based on AI) for translation into Polish. In most cases (the exception being question 15), the translated answers as given by DeepL were rated significantly better than the responses ChatGPT gave to the questions originally provided in Polish. English is the most common language by which ChatGPT has been trained, and studies have shown that it performs better in English than in other languages [13]. Nevertheless, this result is quite surprising in that despite two steps of translation (one Polish to English to frame the questions, and a second English to Polish to consistently frame responses), the responses were still rated as better than staying within a common Polish framework. At this stage, we cannot give an explanation as to whether this is due to the better performance of ChatGPT in English or the translational skills of DeepL or both. However, this aspect is worth testing further, since translation tools are freely available on the internet.

At this point, question 15 is worth mentioning, as it reflects a Polish-specific question (What do the yellow and blue certificates mean in the context of hearing screening for children in Poland?). In this case, ChatGPT’s response in Polish was rated higher than in English. This shows how care needs to be taken in asking questions about country-specific questions (and other cultural, religious, and linguistic circumstances) and how the language version used can affect the answer supplied. While the English version generally performed slightly better in our case, for certain local topics the native-language version can clearly be better.

Both experts and students rated ChatGPT as useful for patients as a source of information (on average coming close to a “high” rating). However, it rated lower for students in training (“neither low nor high”), and was rated worst by professionals for consulting on difficult cases (“low”). Two recent papers also note that AI might be more appropriate for gaining patient information than for training health professionals [33,34]. For all our categories, students rated the usefulness of ChatGPT more highly, although statistically significant differences between students and experts only showed up for the usefulness of ChatGPT as a teaching aid. Here, the students’ ratings were statistically significantly higher than those of the experts. This may be because experts have a greater level of caution and skepticism towards new technologies, in contrast to the familiarity of searching online for information in the student group [34]. Nevertheless, caution is needed, as ChatGPT has a tendency to provide incorrect or non-existent scientific references, a risk that has been explicitly set out in the fields of education, research, and healthcare [34,35,36]. In our setting, both experts and students gave a neutral assessment of the usefulness of ChatGPT for their own professional work. Nevertheless, the participants who rated ChatGPT’s responses more highly were also more eager to use it professionally or as an education tool (Table 6).

This study sheds light on which categories of quality of health information are most important for people seeking information on the internet. As shown in Table 6, only ratings in the category of linguistic accuracy (Representational Q) correlated with ratings of patient usefulness. This suggests that both experts and students consider linguistic accuracy to be highly important for patients, helping them interpret and understand health information. On the other hand, the strong correlations found in all categories of quality show that no category seems to be more important than another for students in education and for specialists in consulting difficult, specialized cases.

One important limitation of ChatGPT, and shared by all LLMs, is the phenomenon of hallucination—seemingly realistic responses by an AI system that turn out to be incorrect [37]. This creates a potential risk to the patient in that they may be misled by false information. However, both students and experts saw little overall risk to patients from using ChatGPT in that way: there was no correlation between the assessment of risk to patients and any category of health information quality. None of the participants could point to any particularly harmful information in ChatGPT’s responses.

Finally, the major drawback of ChatGPT is that, even if directly asked, it does not provide sources of information or references to scientific papers, a crucial failing when involving scientific or medical knowledge [21]. This makes it impossible to verify any statement provided by ChatGPT, and unless this drawback is overcome, it will remain the biggest limitation of this technology.

To sum up, the practical implications of this study can be categorized into two main areas: (1) the use of chatbots by patients, and (2) their application by professionals. For patients, chatbots offer a promising new avenue for delivering simplified information and providing quick explanations, particularly if integrated with the convenience of voice interaction via a mobile device. This accessibility makes chatbots a valuable tool for addressing common, non-critical questions and enhancing patient engagement. However, for more detailed or specialized information needs, chatbot limitations must be acknowledged, as they will lack the depth, accuracy, and reliability required for more nuanced topics. In a professional context, chatbots are less suited as standalone tools. Instead, they may serve as a preliminary content-drafting resource, such as creating an initial outline for a patient handout. In all cases, it is crucial to combine the outputs of a chatbot with thorough professional review to ensure accuracy and reliability. A cautious approach will mean that a chatbot can complement a professional’s expertise rather than replace it, and thus efficiency can be enhanced without compromising quality.

5. Conclusions

While ChatGPT appears to be a promising tool for initial information retrieval, particularly for non-experts, it falls short in providing the completeness and reliability required by professionals. While students found ChatGPT useful as an educational tool, experts were more skeptical, particularly since the program never provides verifiable references and sometimes provides misleading information. It is therefore recommended that, in the training phase, reliable sources of scientific information such as publications listed in PubMed, etc., be used. The differences in perception between students and experts, as well as between language versions, suggest that while AI can serve as a useful supplementary tool, it cannot replace traditional, verified sources, particularly in critical fields like healthcare. AI offers no quick, error-free, and reliable path to education; the only reliable sources of education remain the standard teaching and educational tools.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/jcm14030875/s1, Questions to ChatGPT in Polish (original version), Questions with ChatGPT answers in Polish (original version), Questions to ChatGPT in English language version, Questions with ChatGPT answers in English language version, ChatGPT’s answers (to questions posed in English) translated into Polish using DeepL.

Author Contributions

Conceptualization, A.R., W.W.J. and P.H.S.; methodology, W.W.J., E.G. and A.L.; formal analysis, A.R., E.G. and W.W.J.; data curation, A.R., E.G. and W.W.J.; writing—original draft preparation, A.R., E.G., A.L. and W.W.J.; writing—review and editing, P.H.S. and H.S.; supervision, A.R, P.H.S. and H.S. All authors discussed the results and implications and commented on the manuscript at all stages. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was approved by the bioethics committee of the Institute of Physiology and Pathology of Hearing (KB.IFPS: 2/2024) on 15 February 2024.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The authors confirm that the data supporting the findings of this study are available within the article and its Supplementary Materials.

Acknowledgments

The authors would like to thank the students and experts who participated in this study. The authors thank also Andrew Bell for comments on an earlier version of this manuscript. An earlier version of this manuscript was posted to the medRxiv preprint server https://www.medrxiv.org/content/10.1101/2024.10.24.24316037v1 (accessed on 28 October 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bundorf, M.K.; Wagner, T.H.; Singer, S.J.; Baker, L.C. Who Searches the Internet for Health Information? Health Serv. Res. 2006, 41, 819–836. [Google Scholar] [CrossRef]
Maon, S.N.; Hassan, N.M.; Seman, S.A.A. Online Health Information Seeking Behavior Pattern. Adv. Sci. Lett. 2017, 23, 10582–10585. [Google Scholar] [CrossRef]
Powell, J.; Inglis, N.; Ronnie, J.; Large, S. The Characteristics and Motivations of Online Health Information Seekers: Cross-Sectional Survey and Qualitative Interview Study. J. Med. Internet Res. 2011, 13, e1600. [Google Scholar] [CrossRef] [PubMed]
Lagoe, C.; Atkin, D. Health Anxiety in the Digital Age: An Exploration of Psychological Determinants of Online Health Information Seeking. Comput. Hum. Behav. 2015, 52, 484–491. [Google Scholar] [CrossRef]
Ghahramani, F.; Wang, J. Impact of Smartphones on Quality of Life: A Health Information Behavior Perspective. Inf. Syst. Front. 2020, 22, 1275–1290. [Google Scholar] [CrossRef]
Nielsen, J.P.S.; von Buchwald, C.; Grønhøj, C. Validity of the Large Language Model ChatGPT (GPT4) as a Patient Information Source in Otolaryngology by a Variety of Doctors in a Tertiary Otorhinolaryngology Department. Acta Oto-Laryngol. 2023, 143, 779–782. [Google Scholar] [CrossRef]
Gilson, A.; Safranek, C.W.; Huang, T.; Socrates, V.; Chi, L.; Taylor, R.A.; Chartash, D. How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med. Educ. 2023, 9, e45312. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 1 November 2024).
Number of ChatGPT Users (Apr 2024). Available online: https://explodingtopics.com/blog/chatgpt-users (accessed on 4 April 2024).
Biswas, S.S. Role of Chat GPT in Public Health. Ann. Biomed. Eng. 2023, 51, 868–869. [Google Scholar] [CrossRef]
Pawar, V.V.; Farooqui, S. Ethical Consideration for Implementing AI in Healthcare: A Chat GPT Perspective. Oral. Oncol. 2024, 149, 106682. [Google Scholar] [CrossRef]
Luykx, J.J.; Gerritse, F.; Habets, P.C.; Vinkers, C.H. The Performance of ChatGPT in Generating Answers to Clinical Questions in Psychiatry: A Two-Layer Assessment. World Psychiatry 2023, 22, 479–480. [Google Scholar] [CrossRef]
Lewandowski, M.; Łukowicz, P.; Świetlik, D.; Barańska-Rybak, W. ChatGPT-3.5 and ChatGPT-4 Dermatological Knowledge Level Based on the Specialty Certificate Examination in Dermatology. Clin. Exp. Dermatol. 2024, 49, 686–691. [Google Scholar] [CrossRef] [PubMed]
Samaan, J.S.; Rajeev, N.; Ng, W.H.; Srinivasan, N.; Busam, J.A.; Yeo, Y.H.; Samakar, K. ChatGPT as a Source of Information for Bariatric Surgery Patients: A Comparative Analysis of Accuracy and Comprehensiveness Between GPT-4 and GPT-3.5. Obes. Surg. 2024, 34, 1987–1989. [Google Scholar] [CrossRef] [PubMed]
Zaidat, B.; Shrestha, N.; Rosenberg, A.M.; Ahmed, W.; Rajjoub, R.; Hoang, T.; Mejia, M.R.; Duey, A.H.; Tang, J.E.; Kim, J.S.; et al. Performance of a Large Language Model in the Generation of Clinical Guidelines for Antibiotic Prophylaxis in Spine Surgery. Neurospine 2024, 21, 128–146. [Google Scholar] [CrossRef] [PubMed]
Emile, S.H.; Horesh, N.; Freund, M.; Pellino, G.; Oliveira, L.; Wignakumar, A.; Wexner, S.D. How Appropriate Are Answers of Online Chat-Based Artificial Intelligence (ChatGPT) to Common Questions on Colon Cancer? Surgery 2023, 174, 1273–1275. [Google Scholar] [CrossRef]
Maida, E.; Moccia, M.; Palladino, R.; Borriello, G.; Affinito, G.; Clerico, M.; Repice, A.M.; Di Sapio, A.; Iodice, R.; Spiezia, A.L.; et al. ChatGPT vs. Neurologists: A Cross-Sectional Study Investigating Preference, Satisfaction Ratings and Perceived Empathy in Responses among People Living with Multiple Sclerosis. J. Neurol. 2024, 271, 4057–4066. [Google Scholar] [CrossRef]
Huang, C.; Hong, D.; Chen, X. ChatGPT in Medicine: Evaluating Psoriasis Patient Concerns. Ski. Res. Technol. 2024, 30, e13680. [Google Scholar] [CrossRef]
Topsakal, O.; Akinci, T.C.; Celikoyar, M. Evaluating Patient and Otolaryngologist Dialogues Generated by ChatGPT, Are They Adequate? Res. Sq. 2023. [Google Scholar] [CrossRef]
Moise, A.; Centomo-Bozzo, A.; Orishchak, O.; Alnoury, M.K.; Daniel, S.J. Can ChatGPT Replace an Otolaryngologist in Guiding Parents on Tonsillectomy? Ear Nose Throat J. 2024, 4556, 01455613241230841. [Google Scholar] [CrossRef]
Jedrzejczak, W.W.; Skarzynski, P.H.; Raj-Koziak, D.; Sanfins, M.D.; Hatzopoulos, S.; Kochanek, K. ChatGPT for Tinnitus Information and Support: Response Accuracy and Retest after Three Months. medRxiv 2023. [Google Scholar] [CrossRef]
Jedrzejczak, W.W.; Kochanek, K. Comparison of the Audiological Knowledge of Three Chatbots–ChatGPT, Bing Chat, and Bard. medRxiv 2023. [Google Scholar] [CrossRef]
Kochanek, K.; Skarzynski, H.; Jedrzejczak, W.W. Accuracy and Repeatability of ChatGPT Based on a Set of Multiple-Choice Questions on Objective Tests of Hearing. Cureus 2024, 16, e59857. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Mo, C.; Chen, Y.; Dai, X.; Wang, H.; Shen, X. Exploring the Performance of ChatGPT-4 in the Taiwan Audiologist Qualification Examination: Preliminary Observational Study Highlighting the Potential of AI Chatbots in Hearing Care. JMIR Med. Educ. 2024, 10, e55595. [Google Scholar] [CrossRef] [PubMed]
Pastucha, M. Do chatbots provide reliable information about mobile apps in audiology? J. Hear. Sci. 2024, 14, 9–15. [Google Scholar] [CrossRef]
Introduction|DeepL API Documentation. Available online: https://developers.deepl.com/docs/ (accessed on 16 October 2024).
Kur, M. Method of Measuring the Effort Related to Post-Editing Machine Translated Outputs Produced in the English>Polish Language Pair by Google, Microsoft and DeepL MT Engines: A Pilot Study. Beyond Philol. Int. J. Linguist. Lit. Stud. Engl. Lang. Teach. 2019, 69–99. [Google Scholar] [CrossRef][Green Version]
Wang, R.Y.; Strong, D.M. Beyond Accuracy: What Data Quality Means to Data Consumers. J. Manag. Inf. Syst. 1996, 12, 5–33. [Google Scholar] [CrossRef]
Kim, R.; Margolis, A.; Barile, J.; Han, K.; Kalash, S.; Papaioannou, H.; Krevskaya, A.; Milanaik, R. Challenging the Chatbot: An Assessment of ChatGPT’s Diagnoses and Recommendations for DBP Case Studies. J. Dev. Behav. Pediatr. 2024, 45, e8–e13. [Google Scholar] [CrossRef]
He, W.; Zhang, W.; Jin, Y.; Zhou, Q.; Zhang, H.; Xia, Q. Physician Versus Large Language Model Chatbot Responses to Web-Based Questions From Autistic Patients in Chinese: Cross-Sectional Comparative Analysis. J. Med. Internet Res. 2024, 26, e54706. [Google Scholar] [CrossRef]
Gumilar, K.E.; Indraprasta, B.R.; Faridzi, A.S.; Wibowo, B.M.; Herlambang, A.; Rahestyningtyas, E.; Irawan, B.; Tambunan, Z.; Bustomi, A.F.; Brahmantara, B.N.; et al. Assessment of Large Language Models (LLMs) in Decision-Making Support for Gynecologic Oncology. Comput. Struct. Biotechnol. J. 2024, 23, 4019–4026. [Google Scholar] [CrossRef]
Ismaiel, N.; Nguyen, T.P.; Guo, N.; Carvalho, B.; Sultan, P.; Study Collaborators. The Evaluation of the Performance of ChatGPT in the Management of Labor Analgesia. J. Clin. Anesth. 2024, 98, 111582. [Google Scholar] [CrossRef]
Frosolini, A.; Franz, L.; Benedetti, S.; Vaira, L.A.; de Filippis, C.; Gennaro, P.; Marioni, G.; Gabriele, G. Assessing the Accuracy of ChatGPT References in Head and Neck and ENT Disciplines. Eur. Arch. Otorhinolaryngol. 2023, 280, 5129–5133. [Google Scholar] [CrossRef]
Sallam, M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare 2023, 11, 887. [Google Scholar] [CrossRef] [PubMed]
Plevris, V.; Papazafeiropoulos, G.; Jiménez Rios, A. Chatbots Put to the Test in Math and Logic Problems: A Comparison and Assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard. AI 2023, 4, 949–969. [Google Scholar] [CrossRef]
Jędrzejczak, W.W.; Pastucha, M.; Skarżyński, H.; Kochanek, K. Comparison of ChatGPT and Gemini as Sources of References in Otorhinolaryngology. medRxiv 2024. [Google Scholar] [CrossRef]
Chelli, M.; Descamps, J.; Lavoué, V.; Trojani, C.; Azar, M.; Deckert, M.; Raynier, J.-L.; Clowez, G.; Boileau, P.; Ruetsch-Chelli, C. Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis. J. Med. Internet Res. 2024, 26, e53164. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Comparison of students’ and experts’ ratings for ChatGPT’s responses across 20 questions (q1–q20). Legend: orange: average student ratings higher than expert ratings; green: average expert ratings higher than student ratings. Saturated colors are statistically significant differences; light colors are statistically non-significant.

Figure 2. Comparison of English and Polish versions in terms of ratings of ChatGPT’s responses to 20 questions. Legend: blue: English version ratings higher than Polish version ratings; yellow: Polish version ratings higher than English version ratings. Saturated colors are statistically significant differences, and light colors are statistically non-significant; white indicates equal ratings.

Table 1. English versions of the questions that were posed to ChatGPT (translated from Polish using DeepL). The original Polish wording can be found in the Supplementary Material.

No.	Question
1	What is the CROS system?
2	What is conductive hearing loss?
3	What frequencies are responsible for the reception of speech sounds in children?
4	What can be the causes of otitis in a child?
5	How to interpret the test of tonal audiometry, what is the norm of hearing?
6	What is the ABR test in a child?
7	What is newborn hearing screening?
8	What is the course of exudative otitis media in a child?
9	How can hearing be tested in a one-year-old child?
10	What is the verbal audiometry * test?
11	What types of tympanograms are distinguished in impedance audiometry?
12	For what purpose is an ear impression performed in children?
13	How to determine air conduction in the audiogram?
14	What should be done in the case of an abnormal result of hearing screening in a newborn?
15	What do the yellow and blue certificates in the child’s health book mean in the context of hearing screening of children in Poland?
16	What are the contraindications to the use of air-conduction hearing aids?
17	What are the most effective methods of treatment of exudative otitis media in a child?
18	What is the result of a verbal audiometry * test and how to interpret it?
19	What are the results of otoscopy for otosclerosis?
20	What implantable hearing prostheses can be used in children?

* “Verbal audiometry” is a term made up by DeepL when translating from Polish to English; the usual term is “speech audiometry.”

Table 2. Questions asked of the participants relating to the usefulness of ChatGPT.

No.	Question
1a	How do you rate the usefulness of the ChatGPT tool for patients as a source of information?
1b	How do you rate the usefulness of the ChatGPT tool for students in education?
1c	How do you assess the usefulness of the ChatGPT tool for specialists in consulting difficult, specialized cases?
2	How do you assess the possibility of using the ChatGPT tool in your professional work?
3	How do you assess the level of risk to the patient in using ChatGPT to obtain information?

Table 3. Average ratings by students and experts of the Polish and English versions of all ChatGPT responses.

		Polish Version		English Version		Group Effect	Language Version Effect	Interaction Effect
		M	SD	M	SD	F; p	F; p	F; p
Correctness	Student	4.02	0.70	3.98	0.45	3.81; p = 0.059	0.54; p = 0.469	0.01; p = 0.914
Correctness	Expert	3.72	0.41	3.66	0.43	3.81; p = 0.059	0.54; p = 0.469	0.01; p = 0.914
Relevance	Student	4.17	0.64	4.11	0.41	0.21; p = 0.648	3.62; p = 0.066	0.69; p = 0.411
Relevance	Expert	4.03	0.59	4.00	0.56	0.21; p = 0.648	3.62; p = 0.066	0.69; p = 0.411
Completeness	Student	3.84	0.73	3.67	0.48	5.00; p = 0.032 *	1.23; p = 0.274	0.02; p = 0.887
Completeness	Expert	3.45	0.45	3.92	0.55	5.00; p = 0.032 *	1.23; p = 0.274	0.02; p = 0.887
Linguistic accuracy	Student	3.93	0.83	4.16	0.65	5.39; p = 0.044 *	19.07; p < 0.001 **	0.53; p = 0.470
Linguistic accuracy	Expert	3.45	0.45	3.78	0.51	5.39; p = 0.044 *	19.07; p < 0.001 **	0.53; p = 0.470

* p < 0.05, ** p < 0.01. M, mean; SD, standard deviation; F, Fisher test statistic; statistic; p = p-value.

Table 4. Ratings by students and by experts to ChatGPT’s responses to the Polish and English versions of question 15, What do the yellow and blue certificates mean in the context of hearing screening for children in Poland?

		Polish Version		English Version		Group Effect	Language Version Effect	Interaction Effect
		M	SD	M	SD	F; p	F; p; η²	F; p; η²
Correctness	Student	4.70	0.66	1.60	0.75	0.06; p = 0.807	282.56; p < 0.001 ** η² = 0.89	0.08; p = 0.785
Correctness	Expert	4.69	0.60	1.69	0.79	0.06; p = 0.807	282.56; p < 0.001 ** η² = 0.89	0.08; p = 0.785
Relevance	Student	4.70	0.66	1.65	0.81	3.62; p = 0.065 η² = 0.10	120.77; p < 0.001 ** η² = 0.78	6.87; p = 0.013 * η² = 0.17
Relevance	Expert	4.56	0.73	2.69	1.54	3.62; p = 0.065 η² = 0.10	120.77; p < 0.001 ** η² = 0.78	6.87; p = 0.013 * η² = 0.17
Completeness	Student	4.20	1.20	1.45	0.89	0.91; p = 0.346	110.48; p < 0.001 ** η² = 0.77	0.00; p > 0.999
Completeness	Expert	4.38	0.72	1.63	0.89	0.91; p = 0.346	110.48; p < 0.001 ** η² = 0.77	0.00; p > 0.999
Linguistic accuracy	Student	4.30	1.03	3.80	1.28	0.08; p = 0.786	2.01; p = 0.165	0.072; p = 0.401
Linguistic accuracy	Expert	4.19	0.54	4.06	1.12	0.08; p = 0.786	2.01; p = 0.165	0.072; p = 0.401

* p < 0.05, ** p < 0.01. M, mean; SD, standard deviation; F, Fisher test statistic; statistic; η², effect size; p = p-value.

Table 5. Ratings by students and experts of ChatGPT’s usefulness from Table 2.

Question No.	All Participants		Students		Experts		U; p
Question No.	M	SD	M	SD	M	SD	U; p
1a	3.78	0.72	3.90	0.72	3.63	0.72	120.0; p = 0.160
1b	3.25	0.97	3.55	0.89	2.88	0.96	96.0; p = 0.031 *
1c	2.06	0.83	2.20	0.89	1.87	0.72	127.0; p = 0.264
2	2.86	0.80	3.05	0.83	2.63	0.72	117.0; p = 0.143
3	2.94	0.75	3.00	0.86	2.88	0.62	151.0; p = 0.753

* p < 0.05. M, mean; SD, standard deviation; U, Mann-Whitney test statistic; p = p-value.

Table 6. Correlation between the usefulness of ChatGPT and average ratings given to ChatGPT responses. The correlations are for the whole group of 36 evaluators (20 students and 16 experts).

Question No.	Correctness	Relevance	Completeness	Linguistic Accuracy
1a	0.30	0.26	0.30	0.33 *
1b	0.51 **	0.23	0.43 **	0.39 *
1c	0.49 **	0.26	0.38 *	0.37 *
2	0.38 *	0.26	0.41 *	0.50 **
3	0.11	0.03	0.10	−0.07

* p < 0.05, ** p < 0.01.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ratuszniak, A.; Gos, E.; Lorens, A.; Skarzynski, P.H.; Skarzynski, H.; Jedrzejczak, W.W. Performance of ChatGPT in Pediatric Audiology as Rated by Students and Experts. J. Clin. Med. 2025, 14, 875. https://doi.org/10.3390/jcm14030875

AMA Style

Ratuszniak A, Gos E, Lorens A, Skarzynski PH, Skarzynski H, Jedrzejczak WW. Performance of ChatGPT in Pediatric Audiology as Rated by Students and Experts. Journal of Clinical Medicine. 2025; 14(3):875. https://doi.org/10.3390/jcm14030875

Chicago/Turabian Style

Ratuszniak, Anna, Elzbieta Gos, Artur Lorens, Piotr Henryk Skarzynski, Henryk Skarzynski, and W. Wiktor Jedrzejczak. 2025. "Performance of ChatGPT in Pediatric Audiology as Rated by Students and Experts" Journal of Clinical Medicine 14, no. 3: 875. https://doi.org/10.3390/jcm14030875

APA Style

Ratuszniak, A., Gos, E., Lorens, A., Skarzynski, P. H., Skarzynski, H., & Jedrzejczak, W. W. (2025). Performance of ChatGPT in Pediatric Audiology as Rated by Students and Experts. Journal of Clinical Medicine, 14(3), 875. https://doi.org/10.3390/jcm14030875

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Performance of ChatGPT in Pediatric Audiology as Rated by Students and Experts

Abstract

1. Introduction

2. Materials and Methods

Statistical Analysis

3. Results

3.1. General Overview of ChatGPT’s Response Ratings

3.2. Question-Specific Analysis of ChatGPT’s Response Ratings

3.3. Assessment of the Usefulness of ChatGPT

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI