Next Article in Journal
Remote Moderated Usability Testing of a Mobile Phone App for Remote Monitoring of Pregnant Women at High Risk of Preeclampsia in Karachi, Pakistan
Previous Article in Journal
A Machine Learning-Based Multiple Imputation Method for the Health and Aging Brain Study–Health Disparities
 
 
Article
Peer-Review Record

Qualitative Research Methods for Large Language Models: Conducting Semi-Structured Interviews with ChatGPT and BARD on Computer Science Education

Informatics 2023, 10(4), 78; https://doi.org/10.3390/informatics10040078
by Andreas Dengel *, Rupert Gehrlein *, David Fernes, Sebastian Görlich, Jonas Maurer, Hai Hoang Pham, Gabriel Großmann and Niklas Dietrich genannt Eisermann
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Informatics 2023, 10(4), 78; https://doi.org/10.3390/informatics10040078
Submission received: 20 July 2023 / Revised: 13 September 2023 / Accepted: 7 October 2023 / Published: 12 October 2023
(This article belongs to the Topic AI Chatbots: Threat or Opportunity?)

Round 1

Reviewer 1 Report

In the introduction, the author stops with research questions. however, it would have been better to mention point out their contribution in this paper as a summary.

Section 3.2 author mentioned about participants. However total number of participant was not mentioned. How many participant interviewed for the content analysis? 

qualitative analysis was performed on two different LLM. however, to compare the performance, author should present some examples how they are different from each other, which one performs better. Those analysis was not present. Though it is qualitative analysis, however, on average responses for both model could be summarized - for example, response length, contextual differences, number of tokens etc. to showcase the comparative study. 

Similarly asking same question with two different context - how these models generate different types of answer should be shown as well.

Another scenario - similar questions with similar context. 

 

This type of scenario needs to be generated and analyzed the content before coming to a conclusion.

Overall it is okay, need to improve the overall writing and presenting paper.

Author Response

Dear reviewer, thank you for your valuable comments. We tried to take them into account by revising the manuscript. Please let us now if there are any more points that remain unclear after the revision!

Number of participants unclear

We clarified that there were 3 “participating”, which were each interviewed 2 times in section 3.2 and 3.3.

Specify differences between LLMs

The differences between the three LLMs are now highlighted better in section 3.5.1.

Why have different identities not been assigned to the LLMs

We further clarified the reason for not using specific roles for the LLMs in sections 3 and 3.3.

 

author should present some examples how they are different from each other, which one performs better. Those analysis was not present.

 

The differences are provided in Appendix B, we now relate to these results in the "Results" section.

 

on average responses for both model could be summarized - for example, response length, contextual differences, number of tokens etc. to showcase the comparative study. 

We added the information that response length and other hyperparameters were not changed from the default settings as this can be expected from other researchers using these tools. (added explanation in section "Research Design")

 

Reviewer 2 Report

Thanks to the authors for submitting this interesting research.

My only comment relates herein the the results and this can be explained with benefit to the reader: Did you provide LLMs results (for each one) based on one or more runs? 

Author Response

Dear reviewer, thank you for your valuable comment on clarifying the number of runs. We tried to take it into account by revising the manuscript. Please let us now if there are any more points that remain unclear after the revision!

Number of runs unclear

Number of runs (2 per LLM) have been clarified in section 3.3

Reviewer 3 Report

Hello,

Thank you for your insightful and highly relevant paper! I thoroughly enjoyed reading it. However, I do have some questions that I'd like to have clarified.

1. Could you elaborate further on this point in the paper: "By 'interviewing' these language models, it may be possible to combine the strengths of both qualitative and quantitative approaches. This method would allow researchers to access a vast amount of data containing numerous opinions and perspectives, while also leveraging the probabilistic nature of the models to identify the most probable viewpoints"? My understanding of how these models function differs slightly from what you've suggested. As far as I can tell, there's no guarantee that the viewpoints generated by these large language models (LLMs) reflect a majority opinion. Although you don't explicitly claim this, it's unclear how your idea of utilizing LLMs constructively holds any weight without this assurance. However, I am open to being corrected on this point and am eager to hear your perspective.

2. Based on my understanding of how Large Language Models (LLMs) operate, I'm skeptical about their utility for factual verification, which is a common objective in semi-structured interviews. For instance, when I investigated the usage of a particular algorithm within a public agency, I conducted interviews with multiple users and cross-referenced their responses with protocols and my own examination of the algorithm to ensure factual accuracy. Given the opaque nature of how LLMs function, I would hesitate to use them for fact-checking. However, I have employed them to fulfill another common purpose of semi-structured interviews: understanding public perceptions, exploring different viewpoints, and analyzing their justifications. In such contexts, the source of the information, its bias, or its factual accuracy is less critical as long as it generates useful hypotheses that can be further tested using quantitative methods. My question to you is, why didn't you explore this application in your paper? Or is this an aspect you could consider adding?

3. It appears to me that you haven't fully utilized the interview format, and I'm curious as to why that is the case. You seem to regard the act of asking follow-up questions to attain more precise answers as problematic. However, isn't this a common issue even when interviewing people? In that sense, the limitation isn't specific to LLMs but is more a general challenge in the interview process. One of the strengths of conducting semi-structured interviews is the ability to ask follow-up questions to clarify the interviewee's intent or to prompt more thoughtful responses. Additionally, you don't seem to experiment with assigning different identities to the LLMs, such as "You are a German pedagogue," or engaging them with follow-up questions as one would in a semi-structured interview. Why haven't you considered this approach, especially since it could be done rigorously?

 

4. I would also suggest rephrasing your research questions (RQs). At a glance, it appears that a simple "yes" could answer all three questions without requiring substantial research. It is possible after all to get the LLMs to do these things. The real challenge, it seems to me, lies in determining whether the responses generated by LLMs are well-grounded and meaningful, which is a different issue altogether. This critique is solely about rephrasing and nothing more.

 

 

Author Response

Dear reviewer, thank you for your valuable comments. We tried to take them into account by revising the manuscript. Please let us now if there are any more points that remain unclear after the revision! Also, thank you for the insight "I conducted interviews with multiple users and cross-referenced their responses with protocols and my own examination of the algorithm to ensure factual accuracy. " That is, in fact, also what we would suggest when at least trying more coherent perspectives on a topic. But still, the "factual verification" problem persists so that we still need to treat the results as "opinions" rather than truths. But we added a little bit of discussion on that to the limitations and implications sections. 

Clarify reasoning for using LLMs for interviews

We rewrote parts of the introduction section to make the reasoning for the interviews more clear

LLMs cannot guarantee factual information

That is indeed correct! But for conducting "interviews" with LLMs, that was not the goal, as an interview with a human being would also not guarantee factual information. We utilized the probabilistic character of the LLMs to discover likely "opinions" based on the underlying datasets. In order to clarify the possibility of LLMs not delivering factual information, we now extended the discussion of this issue in the Limitations section.

Why have different identities not been assigned to the LLMs

We now clarified the reasoning for not using specific roles for the LLMs in sections 3 and 3.3.

Research questions should be rephrased to be clearer

Research questions have been rephrased for clarification. We also rewrote parts of the discussion to address the new questions better (and not only give Yes/No answers ;-))

Reviewer 4 Report

This comes from a sociologist who was frankly confused; it was never clear to me what the goal of the exercise was.  What are you trying to learn?  When we do in depth interviewing (IDI) with humans, we might be trying to get information (we treat them as informants) on the world (for example, if we are trying to piece together what happened in a case of organizational innovation), or we might be trying to get their subjective views or opinions.  But it wasn’t clear to me which direction the authors were taking here—it seemed the latter, but one of the problems with interviewing LLMs is that the more user-friendly ones, even if given a question phrased in subjective terms, revert to objective material, since they “know” the “true” answer.  That might be especially true for a topic involving educational psychology—which is also a topic on which most humans will have no opinions at all, which makes it again a strange choice.

Much of the body of the paper consisted of what appeared to me to be undirected review.  Section 2.2 didn’t seem relevant at all—and the statement of relevance 3.134 seemed completely illogical (a bit like saying there is problem with X that can be solved by doing some Y, simply because it is not X).  It would be much better to start with existing work on surveying LLMs.  Most of this has been using LLMs to generate different personalities (“what would a Republican say?”), because LLMs have in them many opinions structured in different ways.  If your idea is that we would use similar techniques to get at the LLM itself, that’s different. 

Page 6 also seemed totally irrelevant to me.  If this is about computer education, that’s one thing, but if it’s about interviewing LLMs, we don’t need to know this.  If it’s about whether IDI methods work for LLMs, you would present excerpts, focusing on cases of misunderstanding, inconsistency, refusal to answer, and how effective different types of prompts are (e.g., to get the subject to expand on an answer, or to coax an answer out when there is first a refusal to give a real answer).  Instead, just focusing on coded answers seems to throw away all the depth of the information that would lead one to do IDI.

So it seems that it is necessary to have a much better and clearer research question.

The writing is generally acceptable in terms of grammar and style.  The text was garbled 2.52, 2.74-77, 11.495 (RQ1 is given where RQ2 should be).  It should be proofread.

Author Response

Dear reviewer, thank you for your valuable comments. We tried to take them into account by revising the manuscript. Please let us now if there are any more points that remain unclear after the revision! 

Clarify reasoning for using LLMs for interviews

We rewrote parts of the introduction section to make the reasoning for the interviews more clear. We addressed also the goal of getting the "subjective views" of the LLMs in order to get the "subjective views" of the individuals within the dataset used for training the LLM.

   

Why have different identities not been assigned to the LLMs

We clarified the reasons for not using specific roles for the LLMs in sections 3 and 3.3. In addition to that, we discussed the possibility of doing so in the discussion and the implications.

Research questions should be rephrased to be clearer

All research questions have been rephrased for clarification. This also led to rewriting parts of the discussion and the implications in order to answer the new questions (and not simply providing yes/no answers). 

Page 6 (section 2.4) not relevant

We shortened section 2.4. by removing some parts that were  less relevant for the interviews. We also clarified that we introduce these different approaches to explore how different the solutions for this exemplary "problem" of integrating Computer Science can be.

Back to TopTop