Dialogues with AI: Comparing ChatGPT, Bard, and Human Participants’ Responses in In-Depth Interviews on Adolescent Health Care
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe number of references (19) is to low to show the current state of the art.
Author Response
The number of references (19) is to low to show the current state of the art.
Response: Thank you for your comment. We acknowledge your concern about the number of references in our article. As you undoubtedly know, academic research on AI is rapidly evolving with numerous preprints being published regularly. In our effort to ensure the reliability and credibility of the information presented, we have adopted a stringent approach in selecting our sources, focusing primarily on peer-reviewed publications. While this may have resulted in a smaller number of references, we believe it enhances the quality and validity of the content by drawing from established research findings. However, we appreciate your feedback and will continue to strive for a balanced representation of the current state of the art in AI research.
Reviewer 2 Report
Comments and Suggestions for AuthorsMain Points:
- The study aim is not entirely clear
- Methods are too imprecise and may not answer any study question. (e.g. can AI fake a participant?)
- Results are not results but further methodology
- Discussion is like a second result section
- Language is imprecise but very "verbose".
General comments:
Clarify study aim and methods
The authors should state, if any paragraphs and phrases in this article rise may have been written by ChatGPT.
Abbreviations should be explained the first time they are used.
What was the prompt given to GPT to have these conversations?
Please discuss also the privacy issues that ChatGPT is adhering to regarding medical application.
Generally there are some grammatical errors in the text. They should be revised.
LLM usually gives different answers for the same prompts - how was this problem dealt with?
Which prompts were used ?
Introduction:
ChatGPT is a transformer model which is a deep learning model and not created with the help of such
Zeile 38-39: Needs a reference
48-59: It is unnecessary to explain what qualitative research is.
60-62: Where does the assumption that it is a collective subconsciousness come from? Please ad references.
63; 65: The model is extremely dependent on the research field and thus task.
68-82: Please add references
It should generally be pointed out that chatGPT was not developed and tested to transfer knowledge, especially not in the field of medicine.
Methods:
The Methods section should describe how the study was conducted, not in detail why.
98-99: Please specify what these profiles look like.
101-102: Specify how the data collection was conducted
Generally, describe the study that was imitated and then refer to it. Make the methods more concise.
103-119: Please refer to the questions you used in the reference study.
120: What is meant by open coding?
Please provide a brief explanation for the sample size of 10+10.
120+129: Please rephrase. This paragraph is not concisely written.
The actual description of the qualitative analysis is missing. Hypothesis and study aim is also missing. Please refer to the reporting guidelines for qualitative research (e.g. SRQR, COREQ).
Results:
The results section is not supposed to add information on methods! It should solely display the actual results of the study. The wording is too vague; more clarity if needed.
Generally an comment: how were the privacy issues that chatGPT is adhering to regarding medical discussions.
e.g. 147: “placed significant trust” – Which criteria was used to evaluate this?
156: your own opinion is not appropriate in the Reults. Please just describe the outcome of your data analysis.
175,176: whos assumption is it? How did you find this out?
The language should be clear, simple, easy to understand.
209-2018: complex language however there is no clear message in this paragraph.
figure 1 is unreasonable. here it should be shown how it was done, what is being compared; what is the point of this graphic, what is the methodology behind the graphic.
238-240: unclear, please explain
Discussion:
249: “To streamline the research experience” – What?
246-257: more concise.
253-255: not understandable
255: which concepts
258-260: first real information. Should start the discussion section.
The discussion has the same problems as the results. No clear statements. The discussion is not there to restate the results.
287: “large language models” not “large linguistic models”!!!
287-289: why is this relevant
289: OpenAI is a company, Lambda is a LLM. This are two different things.
290: Lambda is a LLM!
298: “AI relies heavily on factual knowledge” – WRONG: If you state this, give a reference
301: see 298
317-318: this statement is not correct. No static nature, it has a random factor.
320-322: reference
316: AI Revolution? What is the relevance of COVID?
343-348 The authors should focus on their own research and what their contribution to science is. What is the added value to the scientific field?
349-350: which methods and why do you suggest to adapt the researchers methodology?
350-353: unconcise language like written with LLM
355: chat gpt gibt an das ihr LLM etwas verbesert
General comment to conclusion: What is your finding? What is the added value to the scientific field?
Comments on the Quality of English Languagecould use some language editing
Author Response
Main Points:
- The study aim is not entirely clear
- Methods are too imprecise and may not answer any study question. (e.g. can AI fake a participant?)
- Results are not results but further methodology
- Discussion is like a second result section
- Language is imprecise but very "verbose".
Response: We thank you for your constructive feedback on our manuscript. We have incorporated several of your points into this revised version, and you can find a more detailed overview below.
General comments:
Clarify study aim and methods
The authors should state, if any paragraphs and phrases in this article rise may have been written by ChatGPT.
Abbreviations should be explained the first time they are used.
What was the prompt given to GPT to have these conversations?
Please discuss also the privacy issues that ChatGPT is adhering to regarding medical application.
Generally there are some grammatical errors in the text. They should be revised.
LLM usually gives different answers for the same prompts - how was this problem dealt with?
Which prompts were used ?
Response: We have now created a separate section (1.1 study aims) to more clearly highlight the main aims of our study. No paragraphs were written using ChatGPT or other AI. In Appendix A, we now also present an overview of the cases that were presented to both AI and human participants.
Introduction:
ChatGPT is a transformer model which is a deep learning model and not created with the help of such
38-39: Needs a reference
48-59: It is unnecessary to explain what qualitative research is.
60-62: Where does the assumption that it is a collective subconsciousness come from? Please ad references.
63; 65: The model is extremely dependent on the research field and thus task.
68-82: Please add references
It should generally be pointed out that chatGPT was not developed and tested to transfer knowledge, especially not in the field of medicine.
Response: Thank you for these in-depth comments. We have made various revisions to our introduction based on your feedback. We have added references in several sections that you highlighted, and added clarifications regarding the characterization of ChatGPT throughout.
Methods:
The Methods section should describe how the study was conducted, not in detail why.
98-99: Please specify what these profiles look like.
101-102: Specify how the data collection was conducted
Generally, describe the study that was imitated and then refer to it. Make the methods more concise.
103-119: Please refer to the questions you used in the reference study.
120: What is meant by open coding?
Please provide a brief explanation for the sample size of 10+10.
120+129: Please rephrase. This paragraph is not concisely written.
The actual description of the qualitative analysis is missing. Hypothesis and study aim is also missing. Please refer to the reporting guidelines for qualitative research (e.g. SRQR, COREQ).
Response: We appreciate your views on our methods section. In the materials and methods section, we highlight that we generated AI personas based on participant profiles in the face-to-face interviews:
AI personas were generated based on the same participant profiles that were recruited in the study by Donck et al. [16]. This approach ensured consistent representation in terms of age, gender, marital status, and parental status. The interactive sessions, designed to simulate online interviews, faithfully mirrored the conditions of data collection established in the original study [16].
In Appendix A, we now also present an overview of the cases that were presented to both AI and human participants. Subsequent follow-up questions were not included as they differed depending on the individual. We created 20 participants (10 women, 10 men) to mirror the number and gender distribution of participants in Donck et al.
Open coding involves dissecting your data into distinct components and assigning "codes" to them for identification. As its name suggests, open coding aims to unlock new theoretical avenues as you initially interact with your qualitative data. The objective of segmenting data and labeling it with codes is to empower the researcher to consistently analyze and juxtapose similar occurrences within the data. This entails gathering all data fragments (e.g., quotes) labeled with a specific code. By doing so, this method challenges preconceived notions and biases, fostering a more objective approach to research.
We revised our methodology with the COREQ-guidelines in mind. The detailed description of the qualitative analysis is here:
The initial phase of data analysis involved open coding, wherein interview outcomes were distilled into summarized concepts. To bolster research robustness and mitigate biases, interviews conducted with ChatGPT and Bard were independently coded by two-person research groups. Collaboration within these groups was integral to the coding process [17]. Subsequently, the axial coding phase was initiated to formulate overarching categories that encompassed the entire interview data for each AI model. In the final selective coding phase, the connections identified earlier were validated through discussions that involved comparing included and omitted data [18]. This coding approach resulted in the identification of primary themes related to confidentiality: etiology, privacy, responsibility, and patient characteristics.
Results:
The results section is not supposed to add information on methods! It should solely display the actual results of the study. The wording is too vague; more clarity if needed.
Generally an comment: how were the privacy issues that chatGPT is adhering to regarding medical discussions.
e.g. 147: “placed significant trust” – Which criteria was used to evaluate this?
156: your own opinion is not appropriate in the Reults. Please just describe the outcome of your data analysis.
175,176: whos assumption is it? How did you find this out?
The language should be clear, simple, easy to understand.
209-2018: complex language however there is no clear message in this paragraph.
figure 1 is unreasonable. here it should be shown how it was done, what is being compared; what is the point of this graphic, what is the methodology behind the graphic.
238-240: unclear, please explain
Discussion:
249: “To streamline the research experience” – What?
246-257: more concise.
253-255: not understandable
255: which concepts
258-260: first real information. Should start the discussion section.
The discussion has the same problems as the results. No clear statements. The discussion is not there to restate the results.
287: “large language models” not “large linguistic models”!!!
287-289: why is this relevant
289: OpenAI is a company, Lambda is a LLM. This are two different things.
290: Lambda is a LLM!
298: “AI relies heavily on factual knowledge” – WRONG: If you state this, give a reference
301: see 298
317-318: this statement is not correct. No static nature, it has a random factor.
320-322: reference
316: AI Revolution? What is the relevance of COVID?
343-348 The authors should focus on their own research and what their contribution to science is. What is the added value to the scientific field?
349-350: which methods and why do you suggest to adapt the researchers methodology?
350-353: unconcise language like written with LLM
355: chat gpt gibt an das ihr LLM etwas verbesert
General comment to conclusion: What is your finding? What is the added value to the scientific field?
Response: Thank you for these valuable comments on our results and discussion. We have made various revisions based on them, adding to the clarity and succinctness of our manuscript.
Reviewer 3 Report
Comments and Suggestions for Authors1. Provide more details on the methodology, especially on how the AI personas were generated and how they correspond to real human demographics. Additionally, explaining the selection criteria for ChatGPT and Bard versions used would offer insight into the decision-making process and any potential biases.
2. Enhance the section comparing AI and human responses by developing a more structured framework. This could include specific metrics or criteria used for comparison, which would help readers understand the basis for conclusions drawn about the performance and reliability of AI models versus human participants.
3. Expand the discussion on ethical considerations. While the paper briefly mentions the potential for misuse of AI in research, a deeper exploration of the ethical implications, especially regarding privacy, consent, and the authenticity of AI-generated data, would be beneficial.
4. The discussion on the limitations of AI models, particularly Bard, is insightful. However, providing a more critical analysis of why these limitations exist (e.g., training data biases, model architecture) and suggesting potential solutions or areas for future improvement could enrich the paper.
5. The section on future perspectives hints at the exploration of 'cultural bias' in AI models. Expanding on this by proposing specific research questions or methodologies could guide future work in this area. Additionally, considering the rapid evolution of AI technologies, suggesting how ongoing developments might impact the findings would be valuable.
6. If applicable, incorporating statistical analysis to quantify the differences between AI and human responses could strengthen the findings. This could involve using measures of agreement or correlation to assess the consistency between AI-generated responses and human responses.
7. The broader implications section could be expanded to discuss the potential impact of AI in qualitative research beyond the specific context of adolescent health care. This might include implications for data collection, analysis, and the role of AI in enhancing or complementing traditional qualitative research methods.
8. Ensure the paper undergoes thorough proofreading to correct any typographical or grammatical errors. Additionally, considering the use of more visuals or tables to summarize findings and comparisons could make the paper more engaging and easier to digest.
9. Strengthen the literature review by engaging more critically with existing studies on the use of AI in qualitative research. Highlighting gaps that your study addresses and positioning your findings within the broader discourse could provide more context for your contributions.
10. Proofread the paper and remove the grammatical errors.
Comments on the Quality of English Language
Minor editing of English language required.
Author Response
- Provide more details on the methodology, especially on how the AI personas were generated and how they correspond to real human demographics. Additionally, explaining the selection criteria for ChatGPT and Bard versions used would offer insight into the decision-making process and any potential biases.
Response: In the materials and methods section, we highlight that we generated AI personas based on participant profiles in the face-to-face interviews:
AI personas were generated based on the same participant profiles that were recruited in the study by Donck et al. [16]. This approach ensured consistent representation in terms of age, gender, marital status, and parental status. The interactive sessions, designed to simulate online interviews, faithfully mirrored the conditions of data collection established in the original study [16].
Thus, knowing these basic characteristics of each participant in the human interviews, we prompted the respective LLMs to the questions we were posing as if they were, for example, a married 40-year old man with a 15-year old daughter. We repeated this procedure for each AI participant to mirror each human participant.
- Enhance the section comparing AI and human responses by developing a more structured framework. This could include specific metrics or criteria used for comparison, which would help readers understand the basis for conclusions drawn about the performance and reliability of AI models versus human participants.
Response: We understand your question for specific metrics but that implies a quantitative approach, which was not the design of the current (qualitative) study. What we intended to do is to compare AI and human responses through the themes that emerged from the interviews conducted with each. Although this approach cannot provide statistically significant results regarding correlation and such, this is not the end goal of a qualitative approach. Rather, the aim of our qualitative approach is to provide a nuanced understanding of the similarities and differences between AI and human responses within the context of the themes identified in the interviews.
- Expand the discussion on ethical considerations. While the paper briefly mentions the potential for misuse of AI in research, a deeper exploration of the ethical implications, especially regarding privacy, consent, and the authenticity of AI-generated data, would be beneficial.
Response: We agree that a more in-depth discussion could be an added value, and have added this to the introduction:
Transparency regarding the use of AI and clearly distinguishing between AI-generated and human-generated data is essential to prevent misleading interpretations. Failure to do so can erode trust in research findings and compromise the credibility of the research process. Furthermore, ethical considerations extend to the development and deployment of AI models themselves. Developers must adhere to ethical guidelines and principles, such as fairness, transparency, and accountability, to mitigate the risk of bias and ensure the responsible use of AI in research settings [4,5]. Regular audits and assessments of AI models' performance and biases are essential to identify and address potential ethical concerns. Additionally, biased AI models can perpetuate existing biases, further compromising research validity. Safeguards such as transparency and robust validation procedures are necessary to maintain research integrity in the face of evolving AI capabilities.
- The discussion on the limitations of AI models, particularly Bard, is insightful. However, providing a more critical analysis of why these limitations exist (e.g., training data biases, model architecture) and suggesting potential solutions or areas for future improvement could enrich the paper.
Response: We agree with your comment and have added a paragraph in the discussion to address this, not necessarily focusing on Bard but providing a broader analysis of why certain limitations exist in the use of these AI models for the current purposes:
Utilizing AI models such as ChatGPT and Bard in these types of studies also introduces the possibility of biases that can impact the data's quality and interpretation. These models are trained on extensive datasets that may inherently contain biases from the data sources. For instance, if the training data predominantly represents specific demographics or cultural viewpoints, the resulting AI-generated responses may mirror these biases, potentially skewing or inadequately representing the topic under examination. The language used in this study, Dutch, may also introduce biases, as nuances and complexities in other languages could yield different outcomes.
- The section on future perspectives hints at the exploration of 'cultural bias' in AI models. Expanding on this by proposing specific research questions or methodologies could guide future work in this area. Additionally, considering the rapid evolution of AI technologies, suggesting how ongoing developments might impact the findings would be valuable.
Response: While we do not intend to offer specific RQs for future research, we do believe it is valuable to offer various avenues for future research. We have expanded this section considerably with further reflections in line with this and your other comments:
This proof of concept highlights similarities between human and LLM responses, raising ethical discussions about the use of these new technologies in qualitative research. Future considerations should include the performance of AI models in emerging topics such as COVID-19 or the AI revolution, where human opinions may evolve before being accurately reflected in AI models. Additionally, investigating the 'cultural bias' of AI models, which are often developed by American companies, poses an intriguing avenue for exploration. Integrating AI models into qualitative research has the potential to influence interview dynamics, communication skills, and participant comfort. For instance, the use of AI in interviews may alter the interaction dynamic between researchers and participants, potentially affecting rapport building and the depth of responses. Moreover, participants may perceive interactions with AI differently than with humans, impacting their comfort level and willingness to disclose sensitive information. Additionally, researchers may need to adapt their communication strategies when interacting with AI, ensuring clear and precise prompts to elicit meaningful responses.
The notable struggle of AI models, particularly in addressing sensitive topics, brings to light future challenges in their application. These challenges manifest in limitations observed during discussions on subjects like STDs and depression. The nuanced nature of human emotions, the complexity of personal experiences, and the ethical considerations surrounding sensitive topics pose difficulties for AI models. In these instances, AI models, such as Bard and ChatGPT, exhibited limitations such as non-answers or occasional errors. This underscores the current boundary of AI in fully grasping the intricacies of human emotions and experiences. The nature of sensitive topics often involves nuanced understanding, empathy, and context, elements that may be challenging for AI models to comprehend fully. It remains an open question whether LLMs will be able to successfully address these issues in the future.
- If applicable, incorporating statistical analysis to quantify the differences between AI and human responses could strengthen the findings. This could involve using measures of agreement or correlation to assess the consistency between AI-generated responses and human responses.
Response: Thank you for your suggestion. While incorporating statistical analysis could indeed provide valuable insights into the differences between AI and human responses, we must clarify that our study is qualitative in nature, focusing on open-ended responses. As such, our methodology primarily involves qualitative analysis techniques to explore the richness and depth of the data collected. While statistical analyses are not feasible within the scope of this qualitative study, we appreciate your input and recognize the potential benefits of incorporating quantitative methods in future research endeavours.
- The broader implications section could be expanded to discuss the potential impact of AI in qualitative research beyond the specific context of adolescent health care. This might include implications for data collection, analysis, and the role of AI in enhancing or complementing traditional qualitative research methods.
Response: We have made several additions, also in line with Reviewer 4’s comments, to enhance the discussion in this way. We have added two reflections in this light:
- Utilizing AI models such as ChatGPT and Bard in this study introduces the possibility of biases that can impact the data's quality and interpretation. These models are trained on extensive datasets that may inherently contain biases from the data sources. For instance, if the training data predominantly represents specific demographics or cultural view-points, the resulting AI-generated responses may mirror these biases, potentially skewing or inadequately representing the topic under examination. The language used in this study, Dutch, may also introduce biases, as nuances and complexities in other languages could yield different outcomes.
- Integrating AI models into qualitative research has the potential to influence interview dynamics, communication skills, and participant comfort. For instance, the use of AI in interviews may alter the interaction dynamic between researchers and participants, potentially affecting rapport building and the depth of responses. Moreover, participants may perceive interactions with AI differently than with humans, impacting their comfort level and willingness to disclose sensitive information. Additionally, researchers may need to adapt their communication strategies when interacting with AI, ensuring clear and precise prompts to elicit meaningful responses.
- Ensure the paper undergoes thorough proofreading to correct any typographical or grammatical errors. Additionally, considering the use of more visuals or tables to summarize findings and comparisons could make the paper more engaging and easier to digest.
Response: We have carefully proofread the paper and made various small grammatical changes throughout. While we acknowledge the importance of visuals in conveying concepts and key data, we contend that Figure 1 offers a concise yet insightful summary of the main findings in our results.
- Strengthen the literature review by engaging more critically with existing studies on the use of AI in qualitative research. Highlighting gaps that your study addresses and positioning your findings within the broader discourse could provide more context for your contributions.
Response: As mentioned in these papers (Morgan, 2023; Christou, 2023)) and as we know from our own literature review, there are almost no empirical or peer-reviewed studies published at present where AI or LLMs are empirically used as tools in qualitative research (as we have done in this study). This information has also been added to the ‘this study’-section. One of the only empirical studies entails the use of LLM’s in the analysis phase of interview transcripts (Ashwin et al., 2023). Even though this not a peer-reviewed publication, we would like to mention it as it is a critical assessment of the use of AI in this context. This research finds that caution is needed in using LLMs to annotate interview text due to the risk of bias that can lead to misleading inferences, usually this coding is done by human experts who are aware of the methodological pitfalls and possible bias.
While AI's application is under exploration in academic research, it remains mainly limited to tasks such as idea generation, literature summarization, and essay writing. Most papers discussing the role or use of AI in qualitative research are opinions or perspectives grounded in reflections on potential use rather than practical, empirical application. While these papers offer critical reflections and discuss practical implications for researchers and analysts, they lack comparable methodologies. Currently, the academic community has not fully explored the dynamics of AI in research, leading to significant gaps in our understanding of how AI can be most effectively utilized in this context. We believe that our study may represent one of the early adopters of AI in qualitative research, subject to critical scrutiny, and may provide a basis for reflecting on the advantages and disadvantages of future research endeavors.
Morgan, D. L. (2023). Exploring the Use of Artificial Intelligence for Qualitative Data Analysis: The Case of ChatGPT. International Journal of Qualitative Methods, 22.
https://doi.org/10.1177/16094069231211248
Christou, P. A. (2023). The Use of Artificial Intelligence (AI) in Qualitative Research for Theory Development. The Qualitative Report, 28(9), 2739-2755.
Ashwin, J., Chhabra, A., & Rao, V. (2023). Using Large Language Models for Qualitative Analysis can Introduce Serious Bias. Washington, D.C.: World Bank Group. http://documents.worldbank.org/curated/en/099433311072326082/IDU09959393309484041660b85d0ab10e497bd1f
- Proofread the paper and remove the grammatical errors.
Response: We have carefully proofread the paper and made various small grammatical changes throughout.
Reviewer 4 Report
Comments and Suggestions for AuthorsThis research study is great. It is paving the road towards the integration of Generative AI and LLMs in healthcare. More similar researches are needed in different medical disciplines.
- 20 interviews might not be statistically significant.
- While Dutch was used, testing in other languages and cultural contexts is important.
- Elaborating on personal creation and potential biases introduced is crucial.
- Exploring how LLMs can complement, not replace, human researchers is key.
- It would be beneficial to know the specific datasets used to train the LLMs.
- The study could explore the potential impact of LLMs on interview dynamics, communication skills and participant comfort.
- The potential for bias, both algorithmic and human, in the coding and analysis process should be further addressed.
Author Response
This research study is great. It is paving the road towards the integration of Generative AI and LLMs in healthcare. More similar researches are needed in different medical disciplines.
20 interviews might not be statistically significant.
While Dutch was used, testing in other languages and cultural contexts is important.
Elaborating on personal creation and potential biases introduced is crucial.
Exploring how LLMs can complement, not replace, human researchers is key.
It would be beneficial to know the specific datasets used to train the LLMs.
The study could explore the potential impact of LLMs on interview dynamics, communication skills and participant comfort.
The potential for bias, both algorithmic and human, in the coding and analysis process should be further addressed.
Response: We thank you for your positive feedback on our manuscript. We have incorporated several of your points into this revised version. We have added the following sentences to the discussion:
- Utilizing AI models such as ChatGPT and Bard in this study introduces the possibility of biases that can impact the data's quality and interpretation. These models are trained on extensive datasets that may inherently contain biases from the data sources. For instance, if the training data predominantly represents specific demographics or cultural view-points, the resulting AI-generated responses may mirror these biases, potentially skewing or inadequately representing the topic under examination. The language used in this study, Dutch, may also introduce biases, as nuances and complexities in other languages could yield different outcomes.
- Integrating AI models into qualitative research has the potential to influence interview dynamics, communication skills, and participant comfort. For instance, the use of AI in interviews may alter the interaction dynamic between researchers and participants, potentially affecting rapport building and the depth of responses. Moreover, participants may perceive interactions with AI differently than with humans, impacting their comfort level and willingness to disclose sensitive information. Additionally, researchers may need to adapt their communication strategies when interacting with AI, ensuring clear and precise prompts to elicit meaningful responses.
Reviewer 5 Report
Comments and Suggestions for AuthorsStrong point:
It is showcasing an innovative approach by Utilizing AI as virtual participants in qualitative research. this approach can be used for validating the LLM based products in healthcare industry.
Weak points:
low sample size
It adopted original cases from Donck et al. without evaluating any potential bias.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsI would suggest the following: Copy every single comment in your rebuttal letter and write your reply immediately after each point in a clear and concise manner, preferably with the changed paragraph from the manuscript, so the reviewer does not have to do searching him/herself, if, where and how anything was actually changed as suggested.
Comments on the Quality of English Languageminor editing required
Author Response
Reviewer 2
Main Points:
- The study aim is not entirely clear
- Methods are too imprecise and may not answer any study question. (e.g. can AI fake a participant?)
- Results are not results but further methodology
- Discussion is like a second result section
- Language is imprecise but very "verbose".
Response: We thank you for your constructive feedback on our manuscript. We have incorporated several of your points into this revised version, and you can find a more detailed overview below.
General comments:
Clarify study aim and methods
Response: We have now created a separate section (1.1 study aims) to more clearly highlight the main aims of our study.
The authors should state, if any paragraphs and phrases in this article rise may have been written by ChatGPT.
Response: No sentences have been written by ChatGPT.
Abbreviations should be explained the first time they are used.
Response: Revised.
LLM usually gives different answers for the same prompts - how was this problem dealt with?
Response: When the LLM was asked to take up the role of a next participant, the system was closed (so that the ‘previous’ task was ended and we ensured that the system was reset (so the previous answers were not taken in to account by the system). It was not so that the same question was repeatedly refed into the system as this would indeed result in different answers for a repetitive prompts
Please discuss also the privacy issues that ChatGPT is adhering to regarding medical application.
Response: As our research was not a ‘medical application’ and we did not prompt the system form the context of clinical care, we do not feel that any privacy issues are at play in our project.
Generally there are some grammatical errors in the text. They should be revised.
Response: Revised.
What was the prompt given to GPT to have these conversations?
Which prompts were used ?
Response: In Appendix A, we now also present an overview of the cases that were presented to both AI and human participants. As mentioned above the LLM was ‘reset’ after every interview was performed so that no change in answer was caused by a repetition of a prompt in the same session.
Introduction:
ChatGPT is a transformer model which is a deep learning model and not created with the help of such
Response: We have revised this sentence: “It is a transformer model, which uses deep learning algorithms, and trained through a combination of supervised and reinforcement learning on a petabyte-scale dataset.”
38-39: Needs a reference
Response: Added.
48-59: It is unnecessary to explain what qualitative research is.
Response: We felt this was necessary, not all readers of these types of articles may be equally familiar with qualitative research designs.
60-62: Where does the assumption that it is a collective subconsciousness come from? Please ad references.
Response: Added.
68-82: Please add references
Response: Added.
It should generally be pointed out that chatGPT was not developed and tested to transfer knowledge, especially not in the field of medicine.
Response: This point was added (see line 65-66).
Methods:
The Methods section should describe how the study was conducted, not in detail why.
Response: We believe that we strictly describe the methodology in this section, and do not highlight the underlying motivation for this study, which we outline in the introduction.
98-99: Please specify what these profiles look like.
Response: We highlight that we generated AI personas based on participant profiles in the face-to-face interviews (see line 116-120):
AI personas were generated based on the same participant profiles that were recruited in the study by Donck et al. [16]. This approach ensured consistent representation in terms of age, gender, marital status, and parental status. The interactive sessions, designed to simulate online interviews, faithfully mirrored the conditions of data collection established in the original study [16].
101-102: Specify how the data collection was conducted
Response: We have done so, see line 121-138:
In our methodology, we adapted original cases from Donck et al. [19] into prompts tailored to be compatible with the selected AI platforms. Twenty AI-generated participants (10 AI mothers and 10 AI fathers, aligning with Donck et al. [19]) were engaged in simulated interview sessions that replicated the conditions of the original study. These sessions revolved around four cases related to potential areas of disagreement between parents and children in medical treatment scenarios. These cases, developed collaboratively by a team of pediatricians and sociologists, focused on confidentiality issues within the physician-patient-parent triad in the context of alcohol intoxication, sexually transmitted disease (STD), ultrasound without parental knowledge, and mental health issues (see Appendix A for an overview of the cases). After presenting each case, AI participants were queried about their opinions on whether a physician should share information with them, even if the adolescent requested confidentiality. Consistent inquiries were made regarding the potential influence of the age and sex of the adolescent, as well as distinction between a general practitioner and a specialist. Responses from the AI models were recorded and stored in a separate database. All interactions, including those with the AI models, were conducted in Dutch, and the entire AI interview cycle took place in August 2023. Instances where ChatGPT-4 or Bard struggled to produce a response were terminated, prompting the initiation of a new interview.
Generally, describe the study that was imitated and then refer to it. Make the methods more concise.
Response: We have made several small revisions throughout to make the methods more concise.
103-119: Please refer to the questions you used in the reference study.
Response: In Appendix A, we now also present an overview of the cases that were presented to both AI and human participants. Subsequent follow-up questions were not included as they differed depending on the individual.
120: What is meant by open coding?
Response: Open coding is a standard analysis type of qualitative research, it involves dissecting your data into distinct components and assigning "codes" to them for identification. As its name suggests, open coding aims to unlock new theoretical avenues as you initially interact with your qualitative data. The objective of segmenting data and labeling it with codes is to empower the researcher to consistently analyze and juxtapose similar occurrences within the data. This entails gathering all data fragments (e.g., quotes) labeled with a specific code. By doing so, this method challenges preconceived notions and biases, fostering a more objective approach to research.
Please provide a brief explanation for the sample size of 10+10.
Response: We created 20 participants (10 women, 10 men) to mirror the number and gender distribution of participants in Donck et al. We feel that deviating from the original study could potentially introduce bias and inconsistency into the data.
120+129: Please rephrase. This paragraph is not concisely written.
Response: Revised.
The actual description of the qualitative analysis is missing. Hypothesis and study aim is also missing. Please refer to the reporting guidelines for qualitative research (e.g. SRQR, COREQ).
Response: We revised our methodology with the COREQ-guidelines in mind. The detailed description of the qualitative analysis is here:
The initial phase of data analysis involved open coding, wherein interview outcomes were distilled into summarized concepts. To bolster research robustness and mitigate biases, interviews conducted with ChatGPT and Bard were independently coded by two-person research groups. Collaboration within these groups was integral to the coding process [17]. Subsequently, the axial coding phase was initiated to formulate overarching categories that encompassed the entire interview data for each AI model. In the final selective coding phase, the connections identified earlier were validated through discussions that involved comparing included and omitted data [18]. This coding approach resulted in the identification of primary themes related to confidentiality: etiology, privacy, responsibility, and patient characteristics.
Results:
The results section is not supposed to add information on methods! It should solely display the actual results of the study. The wording is too vague; more clarity if needed.
Response: We were very careful to only present results in this section, and upon rereading we are sure that we have done nothing more than that.
Generally an comment: how were the privacy issues that chatGPT is adhering to regarding medical discussions.
e.g. 147: “placed significant trust” – Which criteria was used to evaluate this?
Response: ‘Significant’ was removed.
156: your own opinion is not appropriate in the Reults. Please just describe the outcome of your data analysis.
Response: In the sentence below, we merely present observations, not opinions. We observed that ChatGPT deemed the relationship between the adolescent and physician crucial – it would be remiss not to mention this.
In ChatGPT's responses, the relationship between the adolescent and physician was deemed crucial, especially in cases of STDs and alcohol abuse.
175,176: whos assumption is it? How did you find this out?
Response: The assumption was made by ChatGPT; we have revised this sentence in line with this.
The language should be clear, simple, easy to understand.
209-2018: complex language however there is no clear message in this paragraph.
Response: The key message in this paragraph is highlighted in bold below, and includes that the primary distinctions between responses generated by AI and those by humans were clearest when discussing the concept of privacy.
The primary distinctions between the responses generated by ChatGPT and Bard, and those from human participants were most evident in their treatment of the concept of 'privacy.' Both ChatGPT and Bard addressed privacy, albeit with distinct nuances, whereas this aspect was notably absent in human responses. In ChatGPT's outputs, privacy emerged as a complex concept intricately tied to the developmental stage of adolescents and the level of responsibility entrusted to them. Bard's responses, on the other hand, also incorporated age considerations, but the legal framework played a more prominent role in shaping decision-making, a facet less emphasized in ChatGPT's out-puts. In both instances, the degree of adolescent responsibility correlated with an escalation in the confidentiality maintained between the doctor and the adolescent
figure 1 is unreasonable. here it should be shown how it was done, what is being compared; what is the point of this graphic, what is the methodology behind the graphic.
Response: The point of this graphic is to succinctly present the results, in the results section, rather than outline the methodology of this study which we have done in the methods. More specifically, the figure represents a schematic overview of the main themes of parental perceptions regarding adolescent confidentiality in AI and human participants. This is common in qualitative studies, we also refer to the original publication by Donck et al where a visual representation of themes and subthemes has been presented.
238-240: unclear, please explain
Response: We have removed the sentence in 239-240, which simplifies the message here.
Discussion:
249: “To streamline the research experience” – What?
Response: We have removed ‘to streamline’ in this sentence to avoid confusion.
246-257: more concise.
Response: We feel this section is already quite concise while still delivering essential information to readers, and could not develop strategies to shorten it more.
253-255: not understandable
Response: ‘This aligns with previous findings which indicate that AI can perform as well as, or even better than, humans in knowledge reproduction’
This sentence means that our findings align with previous research that shows that AI can mirror or even exceed human performance in knowledge reproduction. We refer to the relevant studies in the references.
255: which concepts
Response: The concepts are all highlighted in section 4.1.1 through 4.1.4.
258-260: first real information. Should start the discussion section.
Response: We would differ in opinion here; the sentences directly above that also deliver new information.
The discussion has the same problems as the results. No clear statements. The discussion is not there to restate the results.
287: “large language models” not “large linguistic models”!!!
Response: Revised
287-289: why is this relevant
Response: This sentence (added below) is relevant to ensure that readers realize one AI model is not the same as another; different AI models will yield different insights.
Despite both AI models being categorized as large linguistic models [19], their unique algorithms and training datasets yielded different responses, enhancing the value of comparing their outputs
289: OpenAI is a company, Lambda is a LLM. This are two different things
Response: Revised.
290: Lambda is a LLM!
Response: Revised.
298: “AI relies heavily on factual knowledge” – WRONG: If you state this, give a reference
Response: Revised:
AI relies heavily on factual knowledge and data, while human responses are more emotion-based, reflection a nuanced understanding beyond mere factual recall
317-318: this statement is not correct. No static nature, it has a random factor.
Response: Added.
320-322: reference
Response: Added.
316: AI Revolution? What is the relevance of COVID?
Response: These were general reflections on the role of AI on societal shocks (such as COVID, but it could also be the Israeli-Gaza war, geopolitical tensions between the US and China, etc.):
Future considerations should include the performance of AI models in emerging topics such as COVID-19 or the AI revolution, where human opinions may evolve before being accurately reflected in AI models.
343-348 The authors should focus on their own research and what their contribution to science is. What is the added value to the scientific field?
Response: We feel like this is covered in section 4.5 Broader Implications.
349-350: which methods and why do you suggest to adapt the researchers methodology?
Response: We have revised this sentence for clarity purposes:
The differences observed between AI and human interviews suggest that researchers need to be aware of the potential of employing AI in qualitative studies […]
350-353: unconcise language like written with LLM
Response: Minor revisions were made here.
General comment to conclusion: What is your finding? What is the added value to the scientific field?
Response: We feel like this is covered in section 4.5 Broader Implications.
Reviewer 3 Report
Comments and Suggestions for AuthorsCould you please provide the section, heading, page number, and line number for the chnages you have done in th erevised version.
Comments on the Quality of English LanguageI guess still minor editing of English language is required.
Author Response
Reviewer 3
- Provide more details on the methodology, especially on how the AI personas were generated and how they correspond to real human demographics. Additionally, explaining the selection criteria for ChatGPT and Bard versions used would offer insight into the decision-making process and any potential biases.
Response: In the materials and methods section, we highlight that we generated AI personas based on participant profiles in the face-to-face interviews (see line 116-119):
AI personas were generated based on the same participant profiles that were recruited in the study by Donck et al. [16]. This approach ensured consistent representation in terms of age, gender, marital status, and parental status. The interactive sessions, designed to simulate online interviews, faithfully mirrored the conditions of data collection established in the original study [16].
Thus, knowing these basic characteristics of each participant in the human interviews, we prompted the respective LLMs to the questions we were posing as if they were, for example, a married 40-year old man with a 15-year old daughter. We repeated this procedure for each AI participant to mirror each human participant.
- Enhance the section comparing AI and human responses by developing a more structured framework. This could include specific metrics or criteria used for comparison, which would help readers understand the basis for conclusions drawn about the performance and reliability of AI models versus human participants.
Response: We understand your question for specific metrics but that implies a quantitative approach, which was not the design of the current (qualitative) study. What we intended to do is to compare AI and human responses through the themes that emerged from the interviews conducted with each. Although this approach cannot provide statistically significant results regarding correlation and such, this is not the end goal of a qualitative approach. Rather, the aim of our qualitative approach is to provide a nuanced understanding of the similarities and differences between AI and human responses within the context of the themes identified in the interviews.
- Expand the discussion on ethical considerations. While the paper briefly mentions the potential for misuse of AI in research, a deeper exploration of the ethical implications, especially regarding privacy, consent, and the authenticity of AI-generated data, would be beneficial.
Response: We agree that a more in-depth discussion could be an added value, and have added this to the introduction (see line 76-86):
Transparency regarding the use of AI and clearly distinguishing between AI-generated and human-generated data is essential to prevent misleading interpretations. Failure to do so can erode trust in research findings and compromise the credibility of the research process. Furthermore, ethical considerations extend to the development and deployment of AI models themselves. Developers must adhere to ethical guidelines and principles, such as fairness, transparency, and accountability, to mitigate the risk of bias and ensure the responsible use of AI in research settings [4,5]. Regular audits and assessments of AI models' performance and biases are essential to identify and address potential ethical concerns. Additionally, biased AI models can perpetuate existing biases, further compromising research validity. Safeguards such as transparency and robust validation procedures are necessary to maintain research integrity in the face of evolving AI capabilities.
- The discussion on the limitations of AI models, particularly Bard, is insightful. However, providing a more critical analysis of why these limitations exist (e.g., training data biases, model architecture) and suggesting potential solutions or areas for future improvement could enrich the paper.
Response: We agree with your comment and have added a paragraph in the discussion to address this, not necessarily focusing on Bard but providing a broader analysis of why certain limitations exist in the use of these AI models for the current purposes (see line 310-318):
Utilizing AI models such as ChatGPT and Bard in these types of studies also introduces the possibility of biases that can impact the data's quality and interpretation. These models are trained on extensive datasets that may inherently contain biases from the data sources. For instance, if the training data predominantly represents specific demographics or cultural viewpoints, the resulting AI-generated responses may mirror these biases, potentially skewing or inadequately representing the topic under examination. The language used in this study, Dutch, may also introduce biases, as nuances and complexities in other languages could yield different outcomes.
- The section on future perspectives hints at the exploration of 'cultural bias' in AI models. Expanding on this by proposing specific research questions or methodologies could guide future work in this area. Additionally, considering the rapid evolution of AI technologies, suggesting how ongoing developments might impact the findings would be valuable.
Response: While we do not intend to offer specific RQs for future research, we do believe it is valuable to offer various avenues for future research. We have expanded this section considerably with further reflections in line with this and your other comments (see line 351-375):
This proof of concept highlights similarities between human and LLM responses, raising ethical discussions about the use of these new technologies in qualitative research. Future considerations should include the performance of AI models in emerging topics such as COVID-19 or the AI revolution, where human opinions may evolve before being accurately reflected in AI models. Additionally, investigating the 'cultural bias' of AI models, which are often developed by American companies, poses an intriguing avenue for exploration. Integrating AI models into qualitative research has the potential to influence interview dynamics, communication skills, and participant comfort. For instance, the use of AI in interviews may alter the interaction dynamic between researchers and participants, potentially affecting rapport building and the depth of responses. Moreover, participants may perceive interactions with AI differently than with humans, impacting their comfort level and willingness to disclose sensitive information. Additionally, researchers may need to adapt their communication strategies when interacting with AI, ensuring clear and precise prompts to elicit meaningful responses.
The notable struggle of AI models, particularly in addressing sensitive topics, brings to light future challenges in their application. These challenges manifest in limitations observed during discussions on subjects like STDs and depression. The nuanced nature of human emotions, the complexity of personal experiences, and the ethical considerations surrounding sensitive topics pose difficulties for AI models. In these instances, AI models, such as Bard and ChatGPT, exhibited limitations such as non-answers or occasional errors. This underscores the current boundary of AI in fully grasping the intricacies of human emotions and experiences. The nature of sensitive topics often involves nuanced understanding, empathy, and context, elements that may be challenging for AI models to comprehend fully. It remains an open question whether LLMs will be able to successfully address these issues in the future.
- If applicable, incorporating statistical analysis to quantify the differences between AI and human responses could strengthen the findings. This could involve using measures of agreement or correlation to assess the consistency between AI-generated responses and human responses.
Response: Thank you for your suggestion. While incorporating statistical analysis could indeed provide valuable insights into the differences between AI and human responses, we must clarify that our study is qualitative in nature, focusing on open-ended responses. As such, our methodology primarily involves qualitative analysis techniques to explore the richness and depth of the data collected. While statistical analyses are not feasible within the scope of this qualitative study, we appreciate your input and recognize the potential benefits of incorporating quantitative methods in future research endeavours.
- The broader implications section could be expanded to discuss the potential impact of AI in qualitative research beyond the specific context of adolescent health care. This might include implications for data collection, analysis, and the role of AI in enhancing or complementing traditional qualitative research methods.
Response: We have made several additions, also in line with Reviewer 4’s comments, to enhance the discussion in this way. We have added two reflections in this light:
- Line 310-318: Utilizing AI models such as ChatGPT and Bard in this study introduces the possibility of biases that can impact the data's quality and interpretation. These models are trained on extensive datasets that may inherently contain biases from the data sources. For instance, if the training data predominantly represents specific demographics or cultural view-points, the resulting AI-generated responses may mirror these biases, potentially skewing or inadequately representing the topic under examination. The language used in this study, Dutch, may also introduce biases, as nuances and complexities in other languages could yield different outcomes.
- Line 357-363: Integrating AI models into qualitative research has the potential to influence interview dynamics, communication skills, and participant comfort. For instance, the use of AI in interviews may alter the interaction dynamic between researchers and participants, potentially affecting rapport building and the depth of responses. Moreover, participants may perceive interactions with AI differently than with humans, impacting their comfort level and willingness to disclose sensitive information. Additionally, researchers may need to adapt their communication strategies when interacting with AI, ensuring clear and precise prompts to elicit meaningful responses.
- Ensure the paper undergoes thorough proofreading to correct any typographical or grammatical errors. Additionally, considering the use of more visuals or tables to summarize findings and comparisons could make the paper more engaging and easier to digest.
Response: We have carefully proofread the paper and made various small grammatical changes throughout. While we acknowledge the importance of visuals in conveying concepts and key data, we contend that Figure 1 offers a concise yet insightful summary of the main findings in our results.
- Strengthen the literature review by engaging more critically with existing studies on the use of AI in qualitative research. Highlighting gaps that your study addresses and positioning your findings within the broader discourse could provide more context for your contributions.
Response: As mentioned in these papers (Morgan, 2023; Christou, 2023)) and as we know from our own literature review, there are almost no empirical or peer-reviewed studies published at present where AI or LLMs are empirically used as tools in qualitative research (as we have done in this study). This information has also been added to the ‘this study’-section. One of the only empirical studies entails the use of LLM’s in the analysis phase of interview transcripts (Ashwin et al., 2023). Even though this not a peer-reviewed publication, we would like to mention it as it is a critical assessment of the use of AI in this context. This research finds that caution is needed in using LLMs to annotate interview text due to the risk of bias that can lead to misleading inferences, usually this coding is done by human experts who are aware of the methodological pitfalls and possible bias.
While AI's application is under exploration in academic research, it remains mainly limited to tasks such as idea generation, literature summarization, and essay writing. Most papers discussing the role or use of AI in qualitative research are opinions or perspectives grounded in reflections on potential use rather than practical, empirical application. While these papers offer critical reflections and discuss practical implications for researchers and analysts, they lack comparable methodologies. Currently, the academic community has not fully explored the dynamics of AI in research, leading to significant gaps in our understanding of how AI can be most effectively utilized in this context. We believe that our study may represent one of the early adopters of AI in qualitative research, subject to critical scrutiny, and may provide a basis for reflecting on the advantages and disadvantages of future research endeavors.
Morgan, D. L. (2023). Exploring the Use of Artificial Intelligence for Qualitative Data Analysis: The Case of ChatGPT. International Journal of Qualitative Methods, 22.
https://doi.org/10.1177/16094069231211248
Christou, P. A. (2023). The Use of Artificial Intelligence (AI) in Qualitative Research for Theory Development. The Qualitative Report, 28(9), 2739-2755.
Ashwin, J., Chhabra, A., & Rao, V. (2023). Using Large Language Models for Qualitative Analysis can Introduce Serious Bias. Washington, D.C.: World Bank Group. http://documents.worldbank.org/curated/en/099433311072326082/IDU09959393309484041660b85d0ab10e497bd1f
- Proofread the paper and remove the grammatical errors.
Response: We have carefully proofread the paper and made various small grammatical changes throughout.
Round 3
Reviewer 2 Report
Comments and Suggestions for Authorsthanks for the revision and the answers to the individual issues, Manuscript has been improved
Comments on the Quality of English Languagecan still be improved