The Promises and Pitfalls of Large Language Models as Feedback Providers: A Study of Prompt Engineering and the Quality of AI-Driven Feedback

Jacobsen, Lucas Jasper; Weber, Kira Elena

doi:10.3390/ai6020035

Open AccessArticle

The Promises and Pitfalls of Large Language Models as Feedback Providers: A Study of Prompt Engineering and the Quality of AI-Driven Feedback

by

Lucas Jasper Jacobsen

^*,†

and

Kira Elena Weber

^†

Faculty of Education, Universität Hamburg, 20146 Hamburg, Germany

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

AI 2025, 6(2), 35; https://doi.org/10.3390/ai6020035

Submission received: 5 December 2024 / Revised: 13 January 2025 / Accepted: 27 January 2025 / Published: 12 February 2025

(This article belongs to the Special Issue Exploring the Use of Artificial Intelligence in Education)

Download

Browse Figure

Versions Notes

Abstract

:

Background/Objectives: Artificial intelligence (AI) is transforming higher education (HE), reshaping teaching, learning, and feedback processes. Feedback generated by large language models (LLMs) has shown potential for enhancing student learning outcomes. However, few empirical studies have directly compared the quality of LLM feedback with feedback from novices and experts. This study investigates (1) the types of prompts needed to ensure high-quality LLM feedback in teacher education and (2) how feedback from novices, experts, and LLMs compares in terms of quality. Methods: To address these questions, we developed a theory-driven manual to evaluate prompt quality and designed three prompts of varying quality. Feedback generated by ChatGPT-4 was assessed alongside feedback from novices and experts, who were provided with the highest-quality prompt. Results: Our findings reveal that only the best prompt consistently produced high-quality feedback. Additionally, LLM feedback outperformed novice feedback and, in the categories explanation, questions, and specificity, even surpassed expert feedback in quality while being generated more quickly. Conclusions: These results suggest that LLMs, when guided by well-crafted prompts, can serve as high-quality and efficient alternatives to expert feedback. The findings underscore the importance of prompt quality and emphasize the need for prompt design guidelines to maximize the potential of LLMs in teacher education.

Keywords:

AI; feedback; prompt engineering; teacher education

1. Introduction

AI is transforming not only numerous industries but also the education sector, leveraging technologies like machine learning and natural language processing to improve individual learning experiences [1]. Accordingly, the use of Artificial Intelligence (AI) in higher education (AIEd) has grown significantly over the past five years [2] and is now reshaping the field of teacher education [3]. This development has sparked extensive debates regarding its impact on the future of teaching and learning [4]. For example, several studies have examined the impact of automated writing evaluation feedback tools on students’ writing performance (for a meta-analysis, see [5]). In general, feedback is considered an integral part of educational processes in higher education (HE), but as Henderson et al. [6] noted, it raises issues as well:

Feedback is a topic of hot debate in universities. Everyone agrees that it is important. However, students report a lot of dissatisfaction: they don’t get what they want from the comments they receive on their work and they don’t find it timely. Teaching staff find it burdensome, are concerned that students do not engage with it and wonder whether the effort they put in is worthwhile. (p. 3)

The provision of feedback represents both an opportunity and a challenge, as it can also have negative consequences for learning processes [7]. Therefore, high-quality feedback is needed in teacher education, characterized by features such as concreteness, activation, and empathy [8]. Unfortunately, the human and financial resources required to provide high-quality feedback are often lacking [9], which is why AI feedback is promising, offering the potential to optimize teaching and learning processes in HE. According to the United Nations Educational, Scientific and Cultural Organization (UNESCO) [10], AI tools should be used to enhance the professional development of teachers, allowing them “to practice skills and receive feedback” (p. 42).

Following up on these ideas, this study will specifically look at the potential of large language model (LLM)-based feedback in teacher education by comparing it to novice and expert feedback. For teachers to use AI tools, they need adequate “application skills” [10] (p. 22). Consequently, we present a theory-driven and evidence-based manual for prompt engineering that can facilitate teachers’ use of AI and improve their ability to apply it in the educational context. We address the following questions in this study: (1) What kinds of prompts are required to ensure high-quality AI feedback? (2) How does LLM feedback, influenced by prompt quality, compare to novice and expert feedback in terms of feedback quality (specific, empathetic, and engaging)?

2. Theoretical Background

2.1. Artificial Intelligence in Higher Education

AIEd has been used for various purposes, including assessment/evaluation, prediction, AI assistance, intelligent tutoring, and managing student learning [2]. Within the broader field of AIEd, LLMs have recently gained prominence due to their ability to generate complex, human-like outputs. These models represent a transformative development that is rapidly reshaping HE practices [2,11], introducing new possibilities for improving student engagement, supporting educators, enhancing access to education, and transforming feedback processes [12,13,14]. In feedback processes specifically, LLMs have emerged as powerful tools due to their ability to provide adaptive, scalable, and timely feedback, addressing long-standing challenges such as time constraints and limited resources.

Nevertheless, empirical research on this emerging technology, particularly in the context of HE, is still in its infancy. For example, studies comparing the quality of LLM-generated feedback with traditional human feedback remain limited, and specific insights into the quality of feedback generated by LLMs are still lacking. This paper addresses these gaps by developing a theory-driven prompt manual, designed to optimize the quality of LLM-generated feedback and ensure its systematic evaluation. By comparing this feedback with that of novices and experts, the study advances our understanding of the potential and limitations of LLMs in HE.

2.2. Prompt Engineering for Large Language Models in Higher Education

To effectively use LLMs in HE, it is crucial to recognize the importance of prompt engineering. Research must answer the question of how to write prompts that yield high-quality output. In simple terms, prompt engineering is the process of designing effective questions or stimuli, known as “prompts”, for LLMs. The aim is to obtain clear, relevant answers. Essentially, the process involves fine-tuning questions for LLMs to produce the desired results. Although prompt engineering is a fairly new research topic, findings have consistently suggested that the quality of the output of LLMs is not merely determined by their foundational algorithms or training data. Equally crucial is the clarity and accuracy of the prompts they are given [15,16,17].

Studies have highlighted different aspects of prompt engineering, e.g., [16,17,18]. For example, Kipp [19] noted that four primary elements (context, question, format, and examples) should serve as modular guidelines for constructing effective prompts. Ekin [18] proposed five factors that influence prompt selection: user intent, model understanding, domain specificity, clarity and specificity, and constraints. In addition, Lo [16] developed the CLEAR framework, which comprises five key elements of effective prompts: concise, logical, explicit, adaptive, and reflective.

The ability to develop prompts is crucial to support future skills in an increasingly AI-influenced world [10]. However, creating effective prompts can be challenging and may lead to unexpected results [17]. To the best of our knowledge, there are no manuals for analyzing the quality of prompts within HE, and no investigations have been performed to determine whether such guidelines actually improve the output of LLMs. This study aims to develop such a manual and investigate whether there are differences in output when feeding an LLM (ChatGPT-4) with different kinds of prompts.

2.3. Feedback

Feedback is widely recognized as an integral component of individual and institutional learning and developmental processes [20] and thus as a crucial component in HE [6]. Feedback is defined as information offered to an individual concerning their current performance to facilitate improvement in future endeavors [21], and individuals often struggle to effectively reflect on, manage, and adjust their actions or tasks in the absence of appropriate feedback [22]. In the context of teacher education, pre-service teachers receive feedback after actual classroom practice or specific skill training, which could be from peers with a similar knowledge base (novices) or from experts with knowledge authority [22,23]. However, while the incorporation of feedback sessions in teacher education is becoming increasingly prevalent [22,24], feedback designs are often compromised, as feedback from novices is not as high in quality as expert feedback [8]. In addition, educators (experts) frequently express concerns about a lack of time for delivering high-quality feedback [9].

2.3.1. Feedback Quality

Ericsson et al. [25] underscored that substantial enhancements in performance are achievable only through high-quality feedback. Similarly, Prilop et al. [26] showed that the quality of feedback is crucial to its acceptance and for facilitating the continuous development of professional competencies among teachers. Regarding the quality of feedback, Prilop et al. [8,26] provided criteria for effective feedback for teachers based on various studies in other domains (e.g., [27,28]). Summarizing these criteria, effective feedback should consistently be specific, empathetic, and engaging [8,29]. On a cognitive level (specific and engaging), numerous studies (e.g., [30]) have suggested that effective feedback should incorporate both evaluative and tutorial components. Therefore, individuals providing feedback should assess a particular situation with a firm emphasis on content, offer and explain alternative actions, and pose engaging questions. Nicol and Macfarlane-Dick [31] emphasized that feedback fosters self-regulation by not only clarifying good performance but also encouraging students to reflect and engage in self-assessment as part of their learning process. Additionally, Nicol [32] highlighted feedback as a mechanism for promoting learner autonomy, suggesting that students must actively construct meaning from feedback to improve performance and self-regulation. At the affective–motivational level (empathetic), the delivery of feedback is crucial. Ultimately, according to Prins et al. [28], effective peer feedback should be presented in first person. This perspective suggests that feedback is subjective and open to dialog rather than an indisputable fact. In our previous research [26], we found that critiques should always be counterbalanced by positive evaluations. Regarding the criteria for high-quality feedback, a few studies [28,33] have examined the impact of expertise on feedback quality by comparing the feedback provided by novices and experts.

2.3.2. Novice and Expert Feedback

Hattie and Timperley [34] emphasized that feedback can be provided by different agents, such as experts or novices. The disparity in the quality of feedback given by experts and novices has been systematically examined in a few studies. Prins et al. [28] compared expert and novice feedback in medical education, finding that experts utilized more criteria, provided more situation-specific comments and positive remarks, and frequently adopted a first-person perspective style. They also observed that a significant portion of novices either did not pose any reflective questions (59%) or failed to offer alternative suggestions (44%). Similar observations were made in the domain of teacher education [33]. Specifically, expert feedback was more specific, question-rich, and first-person-perspective-oriented than pre-service teachers’ feedback at the bachelor level. Pre-service teachers seldom included specific descriptions of teaching situations in their feedback and rarely utilized activating questions. In sum, expert feedback seems to be of higher quality than novice feedback. However, the provision of adaptive feedback is resource intensive if performed manually for every learner’s task solution, and accordingly, experts in HE often struggle to provide high-quality feedback due to insufficient resources [6]. LLM feedback offers a potential solution [35], but it remains unclear whether LLM feedback is qualitatively equivalent to expert feedback in HE.

2.3.3. Large Language Models as Feedback Providers

The integration of AI into education is changing teaching methods, curriculum planning, and student engagement [36]. Recent studies have investigated the use of LLMs to generate adaptive feedback. For example, in their meta-analysis, Fleckenstein et al. [5]. established that the utilization of automated feedback could enhance students’ writing progress. Zhu et al. [37] examined an LLM-powered feedback system in a high school climate activity task and found that it helped students refine their scientific arguments. Sailer et al. [35] investigated the impact of adaptive feedback on pre-service teachers’ diagnostic reasoning, showing that while it improved justification quality in written assignments, it did not enhance diagnostic accuracy. In contrast, static feedback negatively affected learning in dyads. Additionally, Bernius et al. [38] used natural language processing models to generate feedback for student responses in large courses, reducing grading effort by up to 85% and being perceived as highly precise. Kasneci et al. [25] highlighted how LLMs can assist university and high school teachers with research and writing tasks, improving efficiency and reducing the time spent on personalized feedback [25]. In a recent study, Dai et al. [39] investigated the ability of two GPT model versions (GPT-3.5 and GPT-4) to provide feedback on students’ open-ended writing assignments. The feedback generated by GPT-3.5 and GPT-4 was compared to that of human instructors, evaluating three key aspects: readability, the presence of effective feedback components, and reliability in assessing student performance. The results indicate that (1) both GPT-3.5 and GPT-4 consistently produced more readable feedback than human instructors, (2) GPT-4 outperformed GPT-3.5 as well as human instructors by delivering feedback enriched with crucial components such as feeding up, feeding forward, and self-regulation strategies, and (3) GPT-4 exhibited superior feedback reliability compared to GPT-3.5. Considering the results of these previous studies, LLMs appear to be promising feedback givers. However, there is still a lack of empirical evidence in the context of teacher education as well as on the quality of feedback in terms of the criteria for effective feedback (specific, empathetic, and engaging). Moreover, our study addresses the importance of prompt engineering when using LLMs as feedback providers.

3. The Aim of the Study and the Research Questions

This study aims to investigate the potential of LLMs, particularly ChatGPT-4, as feedback providers in higher education. To achieve this, a theory-driven prompt manual was developed to guide the creation of high-quality prompts, which serve as the foundation for analyzing the output generated by the model. The manual was designed to systematically assess and enhance prompt quality, thereby enabling a structured approach to studying LLM feedback. Using this manual, this study seeks to address the following research questions:

(1): What differences emerge in LLM feedback when prompts of varying quality are used?
(2): How does LLM feedback, influenced by prompt quality, compare to novice and expert feedback in terms of feedback quality (specific, empathetic, and engaging)?

Figure 1 shows our heuristic working model, which includes the quality of prompts, the quality of feedback, and potential outcomes that should be investigated in future studies.

4. Method

4.1. The Development of a Theory-Driven Prompt Manual

We developed a theory-driven coding manual to analyze prompt quality for LLMs, integrating various prompt engineering approaches. Our design followed Kipps’ [19]. four key elements of prompt engineering and considered five factors influencing prompt selection highlighted by ChatGPT and Ekin [18]. Lastly, we applied Lo’s [16] CLEAR framework to refine each prompt module. This resulted in a manual with eight distinct categories of prompt quality (see Table 1).

Subsequently, we developed three prompts of different quality (poor, medium, and good) using our prompting manual. Following Wittwer et al. [42], we formulated a learning goal with three types of errors (learning goal: students will recognize a right triangle and understand the Pythagorean theorem [type of errors: no activity verb; instructional rather than learning goal; and multiple learning goals in a single statement]) and asked ChatGPT-4 to provide feedback on the abovementioned learning goal.

4.2. Generating LLM Feedback

To examine the impact of prompt quality on feedback generation, ChatGPT-4 was exclusively used as the model for this study. This decision was based on its widespread use in higher education, reflecting real-life applications in HE [4], and its status as the latest and most advanced iteration available through OpenAI during our research. These attributes ensured that the study’s findings were grounded in real-world contexts and aligned with the current state of AI-driven feedback tools. Further supporting this decision, a comparative study by Jacobsen et al. [43] analyzed feedback quality across ChatGPT-4, Gemini Advanced, and Claude 3 within HE contexts and found ChatGPT-4 to be the superior model.

To maintain consistency, feedback was generated within separate, independent conversations, ensuring that each prompt was treated as a distinct input. This approach eliminated potential influences of prior interactions. Additionally, a single, consistent account was used throughout the study, standardizing access to GPT-4 and removing variations caused by differing model configurations or access tiers. These measures ensured that feedback quality was evaluated under uniform and replicable conditions.

4.3. Assessment of Feedback Quality

To analyze the quality of LLM feedback and answer our first research question, we conducted a quantitative feedback analysis. We adapted the coding scheme of Prilop et al. [8] based on the feedback quality index developed by Prins et al. [28]. Each feedback instance served as a unit of analysis and enabled a thorough content evaluation. The original scheme comprises six categories: evaluation criteria, specificity, suggestions, questions, first-person perspective, and valence (positive/negative). The feedback is assigned a rating of “2” for high quality, “1” for average, and “0” for suboptimal. A detailed explanation of this process can be found in Prilop et al. [44]. We added three categories: errors, explanations, and explanations of suggestions. The error category was necessary due to the tendency of LLMs to hallucinate [45,46], with points deducted in this area. Hallucination in LLMs refers to the generation of information or responses that appear plausible but are factually incorrect or not based on the given input or data. The category explanation was based on the manual by Wu and Schunn [47]. Finally, suggestions were divided into two categories, presence of suggestion and explanation of suggestion, to improve coding accuracy (see Table 2 for the coding manual and inter-coder reliability).

4.4. Coding of the Feedback

The LLM feedback (20 pieces of feedback from low-quality prompts, 20 from medium-quality prompts, and 20 from high-quality prompts) was coded by three trained student coders. These coders were trained by a member of the research team and initially coded a sample of 20 feedback comments. Any discrepancies were discussed and resolved following the method described by Zottmann et al. [48]. The feedback was then randomly assigned for coding. Fleiss’ kappa [49] (κ) was used to measure agreement between coders, resulting in significant kappa values (see Table 2), indicating reliable coding. Based on the analysis, it became clear which prompt provided better results. Subsequently, the high-quality prompt was presented to 30 pre-service teachers (novices), seven teacher trainers, two educational science professors, one teacher trainer, and one headmaster (experts), who also formulated feedback. This feedback was coded by the same coders.

4.5. Analysis Method

We used our prompt manual to analyze the prompt quality of our three different prompts. We then analyzed differences between LLM feedback (n = 30), expert feedback (n = 11), and novice feedback (n = 30) (independent variables) concerning the different subdimensions of feedback quality (dependent variables) using one-way analyses of variance (ANOVAs), followed by Bonferroni post hoc tests. All statistical calculations were performed using SPSS 26, and we set the significance level at p < 0.05 for all tests.

5. Results

5.1. Differences Between Prompts and Their Output

Regarding the first research question, we fed ChatGPT-4 with different types of prompts (to see the prompts, please view the Supplementary Material) and analyzed the outcome in terms of quality as well as the accuracy of the feedback. The first prompt achieved low quality (5 out of 16 points according to the prompt manual). The second prompt contained more details than the first and therefore achieved slightly higher quality (8 out of 16 points). The third prompt had the highest quality, scoring 15 out of a possible 16 points. We generated feedback 20 times for each prompt and coded the results using our feedback quality manual. To compare the feedback, we conducted an ANOVA with Bonferroni post hoc tests. Our results show significant differences between the prompts regarding feedback quality for all subdimensions except valence and presence of suggestions (for more details about descriptive data, see Table 3). Bonferroni-adjusted post hoc tests revealed that the feedback generated with prompt 3 (most sophisticated prompt) performed significantly (p < 0.001) better in the subcategory assessment criteria than prompt 1 (M_Diff = 1.50, 95% CI [1.10, 1.90]) and prompt 2 (M_Diff = 0.90, 95% CI [0.50, 1.30]). We found the same effect for the categories explanation (prompt 1: M_Diff = 0.75, 95% CI [0.41, 1.09], p < 0.001; prompt 2: M_Diff = 0.40, 95% CI [0.06, 0.74], p < 0.05), first person (prompt 1: M_Diff = 1.05, 95% CI [0.63, 1.47], p < 0.001; prompt 2: M_Diff = 0.95, 95% CI [0.53, 1.37], p < 0.001), and questions (prompt 1: M_Diff = 0.70, 95% CI [0.28, 1.12], p < 0.001; prompt 2: M_Diff = 1.00, 95% CI [0.58, 1.42], p < 0.001). Furthermore, the feedback generated with prompt 3 was significantly (p < 0.001) better than that generated with prompt 1 for the categories explanation of suggestion (M_Diff = 0.60, 95% CI [0.23, 0.97]) and specificity (M_Diff = 1.25, 95% CI [0.90, 1.60]). For the category error, prompt 2 generated significantly (p < 0.001) more errors than prompt 1 (M_Diff = −0.85, 95% CI [−1.34, −0.36]) and prompt 2 (M_Diff = −0.95, 95% CI [−1.44, −0.46]).

These quantitative results are illustrated using examples of feedback generated using the low-quality prompt (prompt 1) and the high-quality prompt (prompt 3). The feedback on prompt 1 received zero points for feedback quality from our coders. While it vaguely listed some errors and mentioned aspects of a learning goal, the feedback lacked clarity and would not help a novice improve their learning goal. In contrast, the feedback on prompt 3 received 12 out of 16 points. High-quality sections of this feedback are annotated and explained (comments in italics and gray). The following example illustrates the difference in feedback quality when an LLM is provided with prompts of varying quality, although it is not intended as an in-depth qualitative analysis.

LLM feedback generated using prompt 1:

The learning goal “Students will recognize a right triangle and understand the Pythagorean theorem” is clear but could benefit from further specificity. Recognizing right triangles and understanding the Pythagorean theorem are two distinct skills, and clarifying the level of understanding and application expected can enhance the goal. While the goal is relevant to mathematics education, it would be beneficial to outline how success will be measured. Breaking the goal into subgoals, such as identification, understanding, and application, could help in creating appropriate assessments. Finally, aligning this goal with students’ grade level and setting a specific time frame might provide a more structured educational pathway. This feedback aims to enhance the clarity and measurability of the goal, ensuring it is tailored to the students’ needs. [→ The LLM does not write in first person throughout the feedback; the LLM provides no alternatives, poses no questions, and recognizes only one error].

LLM feedback generated using prompt 3:

I appreciate the effort you’ve put into formulating this learning goal. It’s clear that you aim to teach a foundational concept in geometry that has a wide range of applications. Let me share my thoughts with you. [→ The LLM writes in first person throughout the feedback].

Firstly, I noticed that the goal contains two distinct objectives: recognizing right triangles and understanding the Pythagorean theorem. [→ The LLM recognizes the first error: multiple learning goals in a single statement]. This could lead to a scattered focus. [→ The LLM explains the relevance of learning goals]. How might you refine it to contain just one learning goal? [→ The LLM suggests alternatives in a cognitively stimulating way and poses a question].

In terms of action verbs, “recognize” and “understand” are quite broad and don’t lend themselves to easy assessment. [→ The LLM recognizes the second error: no activity verb]. Bloom’s taxonomy might help you choose more specific verbs that clarify what you want students to do. [→ The LLM uses theoretical models to address good learning goals]. Perhaps “identify” for recognizing right triangles and “apply” for using the Pythagorean theorem would be more measurable and actionable? [→ The LLM presents alternatives].

The goal is aimed at the learner, which is a positive, but I wonder if it’s concrete enough. How will you measure “understanding” in this context? Could you link this more closely to a specific learning outcome, like solving a problem using the Pythagorean theorem? [→ The LLM recognizes the third error, instructional rather than learning goal, and poses a question].

Lastly, the connection between content and goal is somewhat present but could be more explicit. Could the goal focus more on what students will actually do with their knowledge about right triangles and the Pythagorean theorem in practice? [→ The LLM poses a thought-provoking question].

So, what do you think about these points? How would you modify the learning goal to make it more specific, actionable, and closely related to measurable outcomes? [→ The LLM poses thought provoking questions].

5.2. Differences Between Novice, LLM, and Expert Feedback

To compare LLM feedback with novice and expert feedback, we provided the highest-quality prompt (prompt 3) to pre-service teachers and experts (see Section 4.4 for expert details). An ANOVA with Bonferroni post hoc tests revealed significant differences among the groups in feedback quality across all subdimensions except empathy, valence, and first person (see Table 4 for descriptive data). The Bonferroni-adjusted post hoc tests confirmed previous findings [26,33], indicating that expert feedback was more concrete, activating, and correct but not more empathetic than that of novices. Expert feedback showed significantly higher quality (p < 0.001) in the subcategories assessment criteria, explanation, questions, presence of suggestions, explanation of suggestions, and specificity. The comparison between novice and LLM feedback showed that LLM feedback outperformed novice feedback in all subcategories except valence and first person. Regarding the difference between LLM and expert feedback, the Bonferroni-adjusted post hoc tests revealed that the LLM feedback had higher quality than expert feedback in the subcategories explanation (M_Diff = 0.46, 95% CI [0.17, 0.74], p < 0.001), questions (M_Diff = 0.50, 95% CI [0.07, 0.93], p < 0.05), and specificity (M_Diff = 0.96, 95% CI [0.52, 1.41]).

6. Discussion

The findings of this study offer compelling insights into the utility and effectiveness of LLM-based feedback in HE. Currently, novice feedback, in the form of peer feedback, is often used in HE, but it is not always conducive to learning [7]. Moreover, it is challenging for experts to provide high-quality feedback in HE due to a lack of human and financial resources [9]. LLM feedback can provide an enriching and economical alternative. A particularly promising result of our study is that feedback generated by the LLM surpassed novice feedback in quality and even rivaled that of experts. Accordingly, our results align with those of Dai et al. [39] while underlining the importance of prompting when using LLMs.

Our first research question addressed what kinds of prompts are needed to generate high-quality LLM feedback. One key finding of our study is the importance of prompt quality in determining the quality of LLM-based feedback. While LLMs can generate high-quality feedback, the output is dependent on the context, mission, specificity, and clarity of the prompts provided. This study reveals that only the prompt with the highest quality could induce the LLM to generate consistent high-quality feedback. When considering the category error, prompt 2 was revealed to be a wolf in sheep’s clothing, having good stylistic properties but resulting in significantly more errors than prompt 1 and more errors than any other prompt or feedback provider in this study. This illustrates the potential of LLMs to hallucinate [45,46] and underscores the importance of careful, theory-driven prompt design. The ability to craft high-quality prompts is a skill that educators need to master (e.g., [10,17]), necessitating a manual or guidelines. In our study, we designed a prompt manual which could and should be used by educators who work with LLMs.

With regard to research question 2, our study supports previous findings [26,33] showing that expert feedback is of higher quality than novice feedback. We found that experts outperformed pre-service teachers in the categories concreteness, activation, and correctness but not in the category empathy. The same was true when we compared LLM and novice feedback. By comparing LLM feedback with expert feedback, we complement these findings, providing new insights regarding feedback processes in HE. Our results show that LLM feedback can outperform expert feedback in the categories explanation, questions, and specificity. This attests to the transformative potential of LLMs in educational settings, offering the promise of scalable, high-quality feedback that could revolutionize the way educators assess student work. Furthermore, the LLM-based feedback was produced in significantly less time than the expert feedback (in our study, ChatGPT-4 produced an average of 49 pieces of feedback in the same amount of time that an expert produced 1 piece of feedback), heralding efficiency gains that could free up educators for more personalized or creative pedagogical endeavors. However, considering our proposed heuristic model, future studies should investigate how LLM-based feedback is perceived by students and whether students’ learning experiences and learning gains can be enhanced by LLM feedback.

Overall, our findings support the results of Dai et al. [39] and lend credence to the promise of LLMs as a viable alternative to expert feedback in HE. However, we must also consider the scope and limitations of LLMs. While they can quickly analyze and generate feedback based on set parameters, LLMs lack the nuanced understanding of individual learners’ psychology, needs, and the socio-cultural context within which learning occurs. LLMs seem to perform particularly well with task-related feedback [39], which corresponds to the feedback level observed in this study. Nevertheless, it is crucial to recognize that expertise is not solely a function of accurate or quick feedback. Experts bring a depth of experience, professional judgment, and a personal touch to their interactions with students. These qualities are currently beyond the reach of AI systems and may prove irreplaceable in educational settings that value not only the transfer of knowledge but also the building of relationships and character. Even if efficiency and quality are the only benchmarks, there was one outlier with multiple errors among the 20 feedback comments generated by the highest-quality prompt. Thus, we posit that experts are still needed but that their tasks should be shifted from providing feedback to monitoring and revising LLM feedback.

These hybrid human–AI feedback approaches present a promising avenue for addressing the limitations of both fully human and fully AI-driven feedback processes. These systems combine the efficiency of AI with the nuanced understanding of human experts to optimize learning outcomes [50]. By leveraging AI for tasks like initial error detection and feedback drafting, and human intelligence for contextualization and deeper interpretation, hybrid systems can address challenges in scalability while maintaining high-quality feedback. Similarly, Miranda et al. [51] proposed a routing framework for optimizing the distribution of feedback tasks between AI and humans. Their findings show that hybrid annotation strategies improved feedback quality and reduced annotation costs compared to using either humans or AI alone. Particularly, their framework successfully routed complex feedback tasks requiring human judgment while delegating routine tasks to AI systems, achieving a balanced and effective workflow. Applying these insights to HE, a hybrid feedback model could help educators efficiently scale feedback processes without compromising the personal and contextual touch that students value.

Beyond efficiency and quality, ethical considerations surrounding LLM implementation in HE demand attention. While LLMs operate without subjective bias, the data they are trained on often contain systemic biases that can unintentionally propagate inequities [52]. To mitigate these risks, practical measures such as implementing diverse and representative training datasets, conducting regular bias assessments, and adopting careful prompt engineering strategies are essential [10]. Additionally, educators should be trained to identify and counteract any unintended biases that may arise in LLM outputs, ensuring fair and equitable applications in educational contexts. Moreover, data privacy and security must be prioritized. As LLMs are integrated into educational settings, robust policies should be established to safeguard sensitive information and adhere to principles of data sovereignty and ethical usage [52]. Educators and policymakers must collaborate to create transparent regulatory frameworks that address these challenges, balancing the benefits of LLMs with the necessity of maintaining ethical standards in education. Echoing Zawacki-Richter et al. [53], “We should not strive for what is technically possible, but always ask ourselves what makes pedagogical sense” (p. 21).

6.1. Limitations and Implications

This study takes an in-depth look at the efficacy of LLMs as tools for generating feedback in HE. An important limitation of our study that warrants discussion is the restricted focus on a single learning goal and a limited set of errors for which feedback was generated. This narrow scope may limit the generalizability of our findings. While we found that the LLM outperforms both novices and experts in providing high-quality feedback for the specific errors we examined, it remains an open question whether these findings would hold true across a broader range of academic subjects and tasks in HE. Educational settings are diverse, encompassing a wide array of subjects, each with their own unique types of content and forms of assessment. Therefore, it would be risky to assume that the efficacy of an LLM in our context would be universally applicable across all educational environments. Future research should aim to diversify the types of tasks and the corresponding feedback. This would provide a more comprehensive understanding of where LLM-based feedback can be most effectively and appropriately utilized in HE. Until such broader research is conducted, the application of our findings should be considered preliminary and best suited for contexts similar to the one we studied. For this reason, we conducted a study in which we compared the feedback quality of three different LLMs giving feedback to 153 pre-service teachers regarding their learning goals in their first teaching practicum [43]. In our study, we focused on the comparison between LLMs, novices, and experts; however, we did not analyze intermediate skill levels or other feedback modalities (e.g., hybrid human–AI approaches). Hybrid approaches, where human input refines or contextualizes AI-generated feedback, could especially leverage the efficiency of AI while ensuring nuanced and context-sensitive guidance (e.g., [50,51]). Exploring these areas in future research could reveal strategies for integrating AI feedback into educational settings in ways that maximize its pedagogical value. Another practical implication of this study is that the relevance of prompt engineering may create a barrier to entry for educators less familiar with the nuances of designing effective prompts, thus necessitating further training or guidance.

6.2. Conclusions

In conclusion, there is compelling evidence supporting the use of LLMs as tools for feedback in HE, particularly in terms of their quality and efficiency. However, their application is not without pitfalls. While the foundational importance of prompt quality is a reaffirmed finding, our study contributes uniquely to the field by comparing feedback quality across novices, experts, and LLMs. Additionally, we developed a theory-driven prompt manual, specifically designed for use by educators in HE and schools. This manual provides a practical tool to ensure consistent, high-quality prompts, enabling educators to harness the transformative potential of LLMs while optimizing feedback mechanisms in educational contexts.

Overall, we found that LLMs have the potential to be valuable tools, but educators must be skilled in prompt engineering and adept at utilizing the tools to achieve optimal results. As Azaria et al. [54] emphasized in the title of their article, “ChatGPT is a Remarkable Tool—For Experts”, the dependence on prompt quality, ethical challenges, and the irreplaceable nuanced inputs from human experts make it a tool to be used cautiously. Future research should explore these dimensions in more detail, possibly leading to a balanced hybrid approach that combines the strengths of both LLM and human expertise in educational feedback mechanisms.

The endeavor to incorporate LLMs in HE is not a question of replacement but of augmentation. By navigating this balance, educators can harness the efficiency and scalability of LLMs while preserving the personalized, nuanced contributions of human expertise. How we navigate this balance will determine the efficacy of such technological solutions in truly enriching the educational landscape.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ai6020035/s1, Prompts used in this study for the Large Language Model.

Author Contributions

Conceptualization, L.J.J. and K.E.W.; methodology, L.J.J.; software, K.E.W.; validation, L.J.J. and K.E.W.; formal analysis, K.E.W.; investigation, L.J.J. and K.E.W.; resources, L.J.J.; data curation, K.E.W.; writing—original draft preparation, L.J.J. and K.E.W.; writing—review and editing, L.J.J. and K.E.W.; visualization, K.E.W.; supervision, L.J.J.; project administration, L.J.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

In Germany, the criteria set forth by the German Research Foundation (DFG) stipulate that a study must obtain ethical clearance if it subjects participants to significant emotional or physical stress, does not fully disclose the study’s purpose, involves patients, or includes procedures like functional magnetic resonance imaging or transcranial magnetic stimulation. Our research did not meet any of these conditions, so it was not necessary for us to seek ethical approval. The pre-service teachers as well as the experts provided the feedback voluntarily. Moreover, all participants were informed about the study’s purpose and confidentiality as well as data protection information.

Informed Consent Statement

Informed consent was obtained from all subjects involved in this study.

Data Availability Statement

All data generated or analyzed during this study are either included in this published article or can be made available by the authors upon request.

Acknowledgments

We would like to thank the pre-service teachers and experts in our study, as well as the coders, for their efforts. Thank you for participating in the study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Harry, A.; Sayudin, S. Role of AI in education. Interdiscip. J. Hummanity (INJURITY) 2023, 2, 260–268. [Google Scholar] [CrossRef]
Crompton, H.; Burke, D. Artificial intelligence in higher education: The state of the field. Int. J. Educ. Technol. High. Educ. 2023, 20, 22. [Google Scholar] [CrossRef]
Prilop, C.N.; Mah, D.; Jacobsen, L.J.; Hansen, R.R.; Weber, K.E.; Hoya, F. Generative AI in Teacher Education: Using AI-Enhanced Methods to Explore Teacher Educators’ Perceptions; Center for Open Science: Charlottesville, VA, USA, 2024. [Google Scholar] [CrossRef]
von Garrel, J.; Mayer, J. Artificial Intelligence in studies—Use of ChatGPT and AI-based tools among students in Germany. Humanit. Soc. Sci. Commun. 2023, 10, 1–9. [Google Scholar] [CrossRef]
Fleckenstein, J.; Liebenow, L.W.; Meyer, J. Automated feedback and writing: A multi-level meta-analysis of effects on students’ performance. Front. Artif. Intell. 2023, 6, 1162454. [Google Scholar] [CrossRef] [PubMed]
Henderson, M.; Ajjawi, R.; Boud, D.; Molloy, E. (Eds.) The Impact of Feedback in Higher Education: Improving Assessment Outcomes for Learners; Springer International Publishing: Berlin/Heidelberg, Germany, 2019. [Google Scholar] [CrossRef]
Kluger, A.N.; DeNisi, A. The effects of feedback interventions on performance: A historical review, a meta-analysis and a preliminary feedback intervention theory. Psychol. Bull. 1996, 119, 254–284. [Google Scholar] [CrossRef]
Prilop, C.N.; Weber, K.E.; Kleinknecht, M. Entwicklung eines video- und textbasierten Instruments zur Messung kollegialer Feedbackkompetenz von Lehrkräften. In Lehrer. Bildung. Gestalten.: Beiträge zur Empirischen Forschung in der Lehrerbildung; Beltz Juventa Verlag: Weinheim, Germany, 2019. [Google Scholar]
Demszky, D.; Liu, J.; Hill, H.C.; Jurafsky, D.; Piech, C. Can automated feedback improve teachers’ uptake of student ideas? Evidence from a randomized controlled trial in a large-scale online course. Educ. Eval. Policy Anal. 2023, 46, 483–505. [Google Scholar] [CrossRef]
United Nations Educational, Scientific and Cultural Organization (UNESCO). AI Competency Framework for Teachers; UNESCO: Paris, France, 2024. [Google Scholar] [CrossRef]
Mah, D.K.; Groß, N. Artificial intelligence in higher education: Exploring faculty use, self-efficacy, distinct profiles, and professional development needs. Int. J. Educ. Technol. High. Educ. 2024, 21, 58. [Google Scholar] [CrossRef]
Cotton DR, E.; Cotton, P.A.; Shipway, J.R. Chatting and cheating: Ensuring academic integrity in the era of ChatGPT. Innov. Educ. Teach. Int. 2023, 61, 228–239. [Google Scholar] [CrossRef]
Ifenthaler, D.; Majumdar, R.; Gorissen, P.; Judge, M.; Mishra, S.; Raffaghelli, J.; Shimada, A. Artificial Intelligence in Education: Implications for Policymakers, Researchers, and Practitioners. Technol. Knowl. Learn. 2024, 29, 1693–1710. [Google Scholar] [CrossRef]
Jensen, L.X.; Buhl, A.; Sharma, A.; Bearman, M. Generative AI and higher education: A review of claims from the first months of ChatGPT. High. Educ. 2024. [Google Scholar] [CrossRef]
Bsharat, S.M.; Myrzakhan, A.; Shen, Z. Principled instructions are all you need for questioning LLaMA-1/2, GPT-3.5/4. arXiv 2024, arXiv:2312.16171. [Google Scholar] [CrossRef]
Lo, L.S. The CLEAR path: A framework for enhancing information literacy through prompt engineering. J. Acad. Librariansh. 2023, 49, 102720. [Google Scholar] [CrossRef]
Zamfrescu-Pereira, J.D.; Wong, R.; Hartmann, B.; Yang, Q. Why Johnny can’t prompt: How non-AI experts try (and fail) to design LLM prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ‘23), Hamburg, Germany, 23–28 April 2023; ACM: New York, NY, USA, 2023; pp. 1–21. [Google Scholar] [CrossRef]
ChatGPT; Ekin, S. Prompt engineering for ChatGPT: A quick guide to techniques, tips and best practice. Authorea Prepr. 2023. [Google Scholar] [CrossRef]
Kipp, M. Wie Sag Ich’s Meiner KI? Hintergründe und Prinzipien zum #Prompting bei #ChatGPT [Video]. 2023. Available online: https://www.youtube.com/watch?v=cfl7q1llkso&t=2382s (accessed on 20 June 2023).
Weber, K.E.; Prilop, C.N. Videobasiertes Training kollegialen Feedbacks in der Lehrkräftebildung. In Spezifische Aspekte von Trainings pädagogischer Kompetenzen in Abgrenzung zu anderen Lehr-Lern-Situationen in der Lehrkräftebildung: Tagungsband. Online-Tagung an der Uni Rostock am 4. und 5. März 2022 zum Thema “Ist das jetzt schon ein Training? Wie unterscheiden sich Trainings von anderen Lehr-Lern-Situationen in der Lehrkräftebildung?”; Carnein, O., Damnik, G., Krause, G., Vanier, D., Eds.; Publikationsserver RosDok: Rostock, Germany, 2023. [Google Scholar] [CrossRef]
Narciss, S. Designing and evaluating tutoring feedback strategies for digital learning environments on the basis of the interactive feedback model. Digit. Educ. Rev. 2013, 23, 7–26. [Google Scholar]
Weber, K.E.; Gold, B.; Prilop, C.N.; Kleinknecht, M. Promoting pre-service teachers’ professional vision of classroom management during practical school training: Effects of a structured online- and video-based self-reflection and feedback intervention. Teach. Teach. Educ. 2018, 76, 39–49. [Google Scholar] [CrossRef]
Lu, H.-L. Research on peer-coaching in preservice teacher education—A review of literature. Teach. Teach. Educ. 2010, 26, 748–753. [Google Scholar] [CrossRef]
Kraft, M.A.; Blazar, D.; Hogan, D. The effect of teacher coaching on instruction and achievement: A meta-analysis of the causal evidence. Rev. Educ. Res. 2018, 88, 547–588. [Google Scholar] [CrossRef]
Ericsson, K.A.; Krampe, R.T.; Tesch-Römer, C. The role of deliberate practice in the acquisition of expert performance. Psychol. Rev. 1993, 100, 363–406. [Google Scholar] [CrossRef]
Prilop, C.N.; Weber, K.E.; Kleinknecht, M. The role of expert feedback in the development of pre-service teachers’ professional vision of classroom management in an online blended learning environment. Teach. Teach. Educ. 2021, 99, 103276. [Google Scholar] [CrossRef]
Gielen, M.; De Wever, B. Structuring peer assessment: Comparing the impact of the degree of structure on peer feedback content. Comput. Hum. Behav. 2015, 52, 315–325. [Google Scholar] [CrossRef]
Prins, F.; Sluijsmans, D.; Kirschner, P.A. Feedback for general practitioners in training: Quality, styles and preferences. Adv. Health Sci. Educ. 2006, 11, 289–303. [Google Scholar] [CrossRef] [PubMed]
Prilop, C.N.; Weber, K.E. Digital video-based peer feedback training: The effect of expert feedback on pre-service teachers’ peer feedback beliefs and peer feedback quality. Teach. Teach. Educ. 2023, 127, 104099. [Google Scholar] [CrossRef]
Strijbos, J.W.; Narciss, S.; Dünnebier, K. Peer feedback content and sender’s competence level in academic writing revision tasks: Are they critical for feedback perceptions and efficiency? Learn. Instr. 2010, 20, 291–303. [Google Scholar] [CrossRef]
Nicol, D.J.; Macfarlane-Dick, D. Formative assessment and self-regulated learning: A model and seven principles of good feedback practice. Stud. High. Educ. 2006, 31, 199–218. [Google Scholar] [CrossRef]
Nicol, D. Assessment for learner self-regulation: Enhancing achievement in the first year using learning technologies. Assess. Eval. High. Educ. 2009, 34, 335–352. [Google Scholar] [CrossRef]
Weber, K.E.; Prilop, C.N.; Kleinknecht, M. Effects of blended and video-based coaching approaches on preservice teachers’ self-efficacy and perceived competence support. Learn. Cult. Soc. Interact. 2019, 22, 103–118. [Google Scholar] [CrossRef]
Hattie, J.; Timperley, H. The power of feedback. Rev. Educ. Res. 2007, 77, 81–112. [Google Scholar] [CrossRef]
Sailer, M.; Bauer, E.; Hofmann, R.; Kiesewetter, J.; Glas, J.; Gurevych, I.; Fischer, F. Adaptive feedback from artificial neural networks facilitates pre-service teachers’ diagnostic reasoning in simulation-based learning. Learn. Instr. 2023, 83, 101620. [Google Scholar] [CrossRef]
Wang, S.; Wang, F.; Zhu, Z.; Wang, J.; Tran, T.; Du, Z. Artificial intelligence in education: A systematic literature review. Expert Syst. Appl. 2024, 252 Pt A, 124167. [Google Scholar] [CrossRef]
Zhu, M.; Liu, O.L.; Lee, H.-S. The effect of automated feedback on revision behavior and learning gains in formative assessment of scientific argument writing. Comput. Educ. 2020, 143, 103668. [Google Scholar] [CrossRef]
Bernius, J.P.; Krusche, S.; Bruegge, B. Machine learning based feedback on textual student answers in large courses. Comput. Educ. Artif. Intell. 2022, 3, 100081. [Google Scholar] [CrossRef]
Dai, W.; Tsai, Y.S.; Lin, J.; Aldino, A.; Jin, H.; Li, T.; Gasevic, D.; Chen, G. Assessing the proficiency of large language models in automatic feedback generation: An evaluation study. Comput. Educ. Artif. Intell. 2024, 7, 100299. [Google Scholar] [CrossRef]
Narciss, S. Feedback strategies for interactive learning tasks. In Handbook of Research on Educational Communications and Technology, 3rd ed.; Spector, J.M., Merrill, M.D., van Merrienboer, J.J.G., Driscoll, M.P., Eds.; Lawrence Erlbaum Associates: Mahwah, NJ, USA, 2008; pp. 125–144. [Google Scholar]
Pekrun, R.; Marsh, H.W.; Elliot, A.J.; Stockinger, K.; Perry, R.P.; Vogl, E.; Goetz, T.; van Tilburg, W.A.P.; Lüdtke, O.; Vispoel, W.P. A three-dimensional taxonomy of achievement emotions. J. Personal. Soc. Psychol. 2023, 124, 145–178. [Google Scholar] [CrossRef]
Wittwer, J.; Kratschmayr, L.; Voss, T. Wie gut erkennen Lehrkräfte typische Fehler in der Formulierung von Lernzielen? Unterrichtswissenschaft 2020, 48, 113–128. [Google Scholar] [CrossRef]
Jacobsen, L.J.; Rohlmann, J.; Weber, K.E. AI Feedback in Education: The Impact of Prompt Design and Human Expertise on LLM Performance. OSF Prepr. 2025. [Google Scholar] [CrossRef]
Prilop, C.N.; Weber, K.E.; Kleinknecht, M. Effects of digital video-based feedback environments on pre-service teachers’ feedback competence. Comput. Hum. Behav. 2020, 102, 120–131. [Google Scholar] [CrossRef]
Alkaissi, H.; McFarlane, S.I. Artificial hallucinations in ChatGPT: Implications in Scientific Writing. Cureus 2023, 15, e35179. [Google Scholar] [CrossRef]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, J.; Dai, W.; Madotto, A.; et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 2022, 55, 1–38. [Google Scholar] [CrossRef]
Wu, Y.; Schunn, C.D. From plans to actions: A process model for why feedback features influence feedback implementation. Instr. Sci. 2021, 49, 365–394. [Google Scholar] [CrossRef]
Zottmann, J.M.; Stegmann, K.; Strijbos, J.-W.; Vogel, F.; Wecker, C.; Fischer, F. Computer-supported collaborative learning with digital video cases in teacher education: The impact of teaching experience on knowledge convergence. Comput. Hum. Behav. 2013, 5, 2100–2108. [Google Scholar] [CrossRef]
Fleiss, J.L.; Cohen, J. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ. Psychol. Meas. 1973, 33, 613–619. [Google Scholar] [CrossRef]
Weber, F.; Krause, T.; Müller, L. Enhancing legal writing skills: The impact of formative feedback in a hybrid intelligence system. Br. J. Educ. Technol. 2024. [Google Scholar] [CrossRef]
Miranda, L.J.V.; Wang, Y.; Elazar, Y.; Kumar, S.; Pyatkin, V.; Brahman, F.; Smith, N.A.; Hajishirzi, H.; Dasigi, P. Hybrid preferences: Learning to route instances for human vs. AI feedback. arXiv 2025, arXiv:2410.19133. [Google Scholar] [CrossRef]
German Ethics Council (Deutscher Ethikrat). Mensch und Maschine—Herausforderungen durch künstliche Intelligenz. 2023. Available online: https://www.ethikrat.org (accessed on 10 January 2025).
Zawacki-Richter, O.; Marín, V.I.; Bond, M.; Gouveneur, F. Systematic review of research on artificial intelligence applications in higher education—Where are the educators? Int. J. Educ. Technol. High. Educ. 2019, 16, 39. [Google Scholar] [CrossRef]
Azaria, A.; Azoulay, R.; Reches, S. ChatGPT is a remarkable tool–For experts. arXiv 2023, arXiv:2306.03102. [Google Scholar] [CrossRef]

Figure 1. Heuristic working model adapted from Narciss [40] and Pekrun et al. [41].

Table 1. Prompt manual to ensure the development of high-quality prompts.

Category	Subcategory	Good	Code	Average	Code	Suboptimal	Example/Clarifying Comment
Context	Role	The role of the LLM and of the person asking the question is explained	2	Only one of the roles is explained	1	Neither the role of the LLM nor the role of the person asking the question is explained	Example: “You are a mathematics tutor assisting a high school student with geometry problems. I am the teacher creating a learning goal for this task”.
	Target audience	There is a clearly defined and described target audience	2	The target audience is roughly described	1	The target audience is not specified	Example: “The audience is high school students learning geometry in grade 10, with a focus on foundational concepts like the Pythagorean theorem”.
	Medium/channel	The medium or channel in which the information is presented is clearly described	2	The medium or channel in which the information is presented is roughly described	1	The medium or channel in which the information is presented is not mentioned	Clarifying comment: The “medium or channel” specifies the style or platform in which the output is intended to be presented, such as a Twitter post, an academic essay, an email, or a PowerPoint slide. Defining the format or medium ensures that the LLM tailors its tone, structure, and level of detail to meet the specific requirements of the chosen communication method, making the response more relevant and effective.
Mission	Mission/question	The mission of the LLM is clearly described	2	The mission of the LLM is roughly described	1	The mission of the LLM is not clear	Example: “The mission is to create a clear and specific learning goal for a high school geometry class that will help students understand the Pythagorean theorem and its applications”.
Clarity and specificity	Format and constraints	Stylistic properties as well as length specifications are described	2	Either stylistic properties are described or a length specification is given	1	Neither stylistic properties nor length specifications are given	Example: “You should provide feedback in concise bullet points, each not exceeding 20 words, and the response should fit within 200 words”.
	Conciseness	The prompt contains only information that is directly related and relevant to the output, and it is clear and concise	2	The prompt is concise with little superfluous information	1	The prompt contains a lot of information that is irrelevant to the mission/question	Clarifying comment: “Conciseness” evaluates whether the prompt contains only information that is essential and directly relevant to the task or mission. Unnecessary details can dilute the focus of the response and reduce efficiency. A concise prompt ensures that the LLM concentrates on the key elements, avoiding extraneous or unrelated content.
	Domain specificity	Technical terms are used correctly and give the LLM the opportunity to refer to them in the answer	2	Technical terms are used sporadically or without explanation	1	No specific vocabulary that is relevant to the subject area of the question is used	Example: “Use terminology like ’right triangle’, ’hypotenuse’, and ’Pythagorean theorem’ to ensure alignment with geometry concepts”.
	Logic	The prompt has a very good reading flow, internal logical coherence, a very coherent sequence of information, and a clearly understandable connection between the content and mission	2	The prompt fulfills only some of the conditions of the coding “2”	1	The prompt is illogically constructed	Clarifying comment: “Logic” assesses the internal coherence and sequence of information in the prompt. A logical prompt presents ideas in a clear, structured, and step-by-step manner, ensuring that the LLM can understand the relationships between different elements of the task. Prompts with good logic provide a smooth flow of information, making it easier for the LLM to generate accurate and meaningful responses.

Table 2. Content analysis of feedback quality: categories, examples, and inter-coder reliability (Fleiss kappa).

Category	Good Feedback Definition	Code	Average Feedback Definition	Code	Sub-Optimal Feedback Definition	κ	Good Feedback Example
Assessment criteria	Aspects of a good learning goal are addressed using technical terms/theoretical models	2	Aspects of a good learning goal are addressed without technical terms/theoretical models	1	Aspects of a good learning goal are not addressed	0.81	“However, the learning goal, as currently stated, has room for improvement. The verb ‘recognize’ is on the lower end of Bloom’s taxonomy; it’s more about recall than application or analysis.” (LLM feedback 3.30)
Specificity	All three error types are named and explicitly explained	2	Two types of errors are named and explicitly explained	1	One type of error is named and explicitly explained	0.81	“Your goal contains two separate objectives: […]Next, the verbs you’ve chosen, ‘recognize’ and ‘understand’, are a bit vague in the context of Bloom’s taxonomy […]And how do you envision this learning goal relating back to the learner? […]”(LLM feedback 3.28)
Explanation	A detailed explanation is given regarding why the aspects of a good learning goal are relevant	2	A brief explanation is given of why the aspects of a good learning goal are relevant	1	No explanation is given regarding why the aspects of a good learning goal are relevant	0.86	“According to best practices, it’s beneficial to focus on just one learning goal at a time. This makes it clearer for both you and the students, streamlining the assessment process.” (LLM feedback 3.14)
Presence of suggestions for improvement	Alternatives are suggested in a cognitively stimulating way	2	Alternatives are presented in concrete terms	1	No alternatives are named	0.86	“A more targeted learning goal will focus on just one of these. Which one is your priority?” (LLM feedback 3.28)
Explanation of suggestions	Alternatives are explained in detail	2	Alternatives are briefly explained	1	Alternatives are not explained	0.82	“This would align the goal more closely with achieving deeper understanding and skill utilization. […] This goal is learner-centered, contains only one focus, and involves higher-level thinking skills. It also makes the intended learning outcome clear.” (LLM feedback 3.30)
Errors	The feedback includes several content errors regarding learning goals	−2	The feedback includes one error regarding learning goals	−1	The feedback does not include errors regarding learning goals	0.90
Questions	The activating question is posed	2	The clarifying question is posed	1	No questions are posed	1.00	“So, what specific skill or understanding are you hoping your students will gain by the end of this lesson?” (LLM feedback 3.28)
First person	The feedback is written in first person throughout	2	The feedback is occasionally written in first person	1	The feedback is not written in first person	0.90	“I appreciate the effort you’ve put into formulating this learning goal for your future teachers. […] Let me share my thoughts with you. Firstly, I noticed […]” (LLM feedback 3.23)
Valence	There is a balance between positive and negative feedback	2	The feedback is mainly positive	1	The feedback is mainly negative	0.76	“I don’t think this learning goal is well worded. […] However, I like that your learning goal is formulated in a very clear and structured way.” (Novice feedback 13)

Table 3. The quality of the feedback generated using the three different prompts.

Category	Concreteness
Subcategory	Assessment criteria				Explanation
	M	SD	Min.	Max.	M	SD	Min.	Max.
Prompt 1	0.45	0.76	0	2	0.25	0.44	0	1
Prompt 2	1.05	0.39	0	2	0.60	0.50	0	1
Prompt 3	1.95	0.22	1	2	1.00	0.32	0	2
	Empathy
	First person				Valence
	M	SD	Min.	Max.	M	SD	Min.	Max.
Prompt 1	0.00	0.00	0	0	0.85	0.56	0	2
Prompt 2	0.10	0.45	0	2	1.00	0.00	1	1
Prompt 3	1.05	0.83	0	2	1.00	0.00	1	1
	Activation
	Questions				Presence of suggestions for improvement				Explanation of suggestions
	M	SD	Min.	Max.	M	SD	Min.	Max.	M	SD	Min.	Max.
Prompt 1	1.20	0.52	0	2	1.15	0.75	0	2	0.50	0.51	0	1
Prompt 2	0.90	0.72	0	2	1.15	0.37	1	2	1.25	0.55	0	2
Prompt 3	1.90	0.31	1	2	1.50	0.51	1	2	1.10	0.31	1	2
	Correctness
	Specificity				Errors
	M	SD	Min.	Max.	M	SD	Min.	Max.
Prompt 1	0.10	0.30	0	1	−0.40	0.50	−1	0
Prompt 2	1.05	0.39	0	2	−1.25	0.79	−2	0
Prompt 3	1.35	0.59	0	2	−0.30	0.57	−2	0

Table 4. Quality of novice, expert, and LLM feedback.

	Concreteness
	Assessment criteria				Explanation
	M	SD	Min.	Max.	M	SD	Min.	Max.
Peers	0.63	0.81	0	2	0.10	0.31	0	1
Experts	1.64	0.51	1	2	0.55	0.52	0	1
ChatGPT-4	1.97	0.18	1	2	1.00	0.26	0	2
	Empathy
	First person				Valence
	M	SD	Min.	Max.	M	SD	Min.	Max.
Peers	1.10	0.71	0	2	1.10	0.30	1	2
Experts	1.18	0.60	0	2	1.25	0.50	1	2
ChatGPT-4	1.10	0.76	0	2	1.00	0.39	0	2
	Activation
	Questions				Presence of suggestions for improvement				Explanation of suggestions
	M	SD	Min.	Max.	M	SD	Min.	Max.	M	SD	Min.	Max.
Peers	0.17	0.38	0	1	0.87	0.82	0	2	0.30	0.54	0	2
Experts	1.36	0.81	0	2	1.73	0.47	1	2	0.82	0.60	0	2
ChatGPT-4	1.86	0.44	0	2	1.57	0.50	1	2	1.13	0.35	1	2
	Correctness
	Specificity				Errors
	M	SD	Min.	Max.	M	SD	Min.	Max.
Peers	0.17	0.38	0	1	−0.73	0.87	−2	0
Experts	0.64	0.67	0	2	−0.18	0.60	−2	0
ChatGPT-4	1.60	0.56	0	2	−0.17	0.46	−2	0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jacobsen, L.J.; Weber, K.E. The Promises and Pitfalls of Large Language Models as Feedback Providers: A Study of Prompt Engineering and the Quality of AI-Driven Feedback. AI 2025, 6, 35. https://doi.org/10.3390/ai6020035

AMA Style

Jacobsen LJ, Weber KE. The Promises and Pitfalls of Large Language Models as Feedback Providers: A Study of Prompt Engineering and the Quality of AI-Driven Feedback. AI. 2025; 6(2):35. https://doi.org/10.3390/ai6020035

Chicago/Turabian Style

Jacobsen, Lucas Jasper, and Kira Elena Weber. 2025. "The Promises and Pitfalls of Large Language Models as Feedback Providers: A Study of Prompt Engineering and the Quality of AI-Driven Feedback" AI 6, no. 2: 35. https://doi.org/10.3390/ai6020035

APA Style

Jacobsen, L. J., & Weber, K. E. (2025). The Promises and Pitfalls of Large Language Models as Feedback Providers: A Study of Prompt Engineering and the Quality of AI-Driven Feedback. AI, 6(2), 35. https://doi.org/10.3390/ai6020035

Article Menu

The Promises and Pitfalls of Large Language Models as Feedback Providers: A Study of Prompt Engineering and the Quality of AI-Driven Feedback

Abstract

1. Introduction

2. Theoretical Background

2.1. Artificial Intelligence in Higher Education

2.2. Prompt Engineering for Large Language Models in Higher Education

2.3. Feedback

2.3.1. Feedback Quality

2.3.2. Novice and Expert Feedback

2.3.3. Large Language Models as Feedback Providers

3. The Aim of the Study and the Research Questions

4. Method

4.1. The Development of a Theory-Driven Prompt Manual

4.2. Generating LLM Feedback

4.3. Assessment of Feedback Quality

4.4. Coding of the Feedback

4.5. Analysis Method

5. Results

5.1. Differences Between Prompts and Their Output

5.2. Differences Between Novice, LLM, and Expert Feedback

6. Discussion

6.1. Limitations and Implications

6.2. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI