Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy

Seo, Hyein; Hwang, Taewook; Jung, Jeesu; Kang, Hyeonseok; Namgoong, Hyuk; Lee, Yohan; Jung, Sangkeun

doi:10.3390/app15020671

Open AccessArticle

Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy

by

Hyein Seo

¹

,

Taewook Hwang

¹

,

Jeesu Jung

¹

,

Hyeonseok Kang

¹

,

Hyuk Namgoong

¹

,

Yohan Lee

² and

Sangkeun Jung

^1,*

¹

Computer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of Korea

²

Electronics and Telecommunications Research Institute, Daejeon 34129, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(2), 671; https://doi.org/10.3390/app15020671

Submission received: 10 December 2024 / Revised: 1 January 2025 / Accepted: 9 January 2025 / Published: 11 January 2025

(This article belongs to the Special Issue Intelligent Systems and Tools for Education)

Download

Browse Figures

Versions Notes

Abstract

:

The recent advancements in large language models (LLMs) have brought significant changes to the field of education, particularly in the generation and evaluation of feedback. LLMs are transforming education by streamlining tasks like content creation, feedback generation, and assessment, reducing teachers’ workload and improving online education efficiency. This study aimed to verify the consistency and reliability of LLMs as evaluators by conducting automated evaluations using various LLMs based on five educational evaluation criteria. The analysis revealed that while LLMs were capable of performing consistent evaluations under certain conditions, a lack of consistency was observed both among evaluators and across models for other criteria. Notably, low agreement among human evaluators correlated with reduced reliability in LLM evaluations. Furthermore, variations in evaluation results were influenced by factors such as prompt strategies and model architecture, highlighting the complexity of achieving reliable assessments using LLMs. These findings suggest that while LLMs have the potential to transform educational systems, careful selection and combination of models are essential to improve their consistency and align their performance with human evaluators in educational settings.

Keywords:

education; LLMs-as-evaluators; LLMs-as-judges; feedback generation; feedback evaluation; large language models

1. Introduction

Recent advancements in large language models (LLMs) are bringing about significant changes in the field of education. For instance, LLMs are playing crucial roles in various educational applications such as content generation [1], summarization [2], feedback generation [3], and assessment [4]. These developments are contributing to reducing teachers’ workload, enhancing the efficiency of online education, and providing personalized feedback to students.

LLMs are increasingly being utilized in the field of education, in some cases even replacing the role of teachers [5,6]. In particular, as LLM-generated feedback is being used in tutoring systems between teachers and students, their importance in educational environments is growing significantly. However, despite LLMs demonstrating excellent performance in generating natural and contextually appropriate responses, it remains unclear whether these models can provide consistent evaluation results in educational settings. Various factors, such as position bias and differences in generation results depending on prompts, can influence evaluation outcomes [7,8], raising the possibility that LLM evaluations could lack consistency. If LLMs lack consistency as evaluators, the reliability of the objectivity of their assessment results will diminish. Therefore, this study aims to verify the consistency and reliability of feedback evaluation when utilizing LLMs as evaluators in the field of education. Figure 1 visually represents an overview of this research.

We proposed five categories as evaluation criteria, drawing from standards suggested in the educational field for assessing teacher feedback. Based on these criteria, we aimed to verify the consistency and reliability of using LLMs as evaluators by automatically assessing LLM-generated feedback. We utilized state-of-the-art open-source language models and commercial language models, evaluating feedback through various prompt strategies and measuring the agreement between evaluation results. Furthermore, we comprehensively analyzed the quality and consistency of feedback provided by LLMs by employing diverse models as both feedback generators and evaluators. Lastly, through human evaluation, we compared the assessment criteria that humans find challenging with the evaluation results of LLMs to assess how consistently LLMs provide feedback in comparison to human evaluators.

The findings from our analysis are as follows:

For criteria with low agreement among human evaluators, the reliability of LLM evaluators also tended to be low.
Inconsistent responses were observed from LLMs, even when identical prompts were provided.
Low inter-agreement among different LLMs indicated a lack of consistency in evaluations across models.
Large-scale or the Llama family models demonstrated a tendency to generate responses with human-comparable consistency in some evaluation criteria.

The remainder of this paper is structured as follows. Section 2 introduces related research, and Section 3 explains the proposed method for feedback generation and evaluation. Section 4 covers the experimental design, while Section 5 includes the experimental results and their analysis. Finally, Section 6 presents the conclusion of this paper.

2. Related Works

The growing integration of LLMs in education and their application as evaluators has spurred significant advancements and challenges in the field. This section provides an overview of how LLMs are transforming educational applications and their emerging role as evaluators.

2.1. Large Language Models in Education

The emergence of LLMs has significantly advanced in the field of education [9,10]. The integration of LLMs with educational technology is being utilized across a broad spectrum of applications, including automated short answer grading [11], automated essay scoring [12], automatic question generation [13], multiple-choice question generation [14,15], and chatbot [16].

In particular, the potential for application in intelligent tutoring systems is gaining attention, where providing immediate and appropriate feedback to students is crucial. Such feedback has a positive impact on improving students’ cognitive, emotional, and motivational outcomes [17]. In the field of programming, research has been conducted focusing on generating feedback for student submissions [3]. Furthermore, studies on automated feedback systems to provide better feedback to students have been actively pursued [18]. Moreover, by providing personalized feedback based on the teacher’s language style, these systems are reducing teachers’ workload and enhancing the efficiency of online education [19].

Furthermore, LLMs are being extensively utilized not only for generating feedback but also for evaluating the quality and effectiveness of provided feedback. Ref. [4] examined the ability of generative AI to provide specific feedback that aligns with various educational objectives. For instance, research has been conducted on generating feedback for programming assignments and evaluating its quality [20], and in the field of mathematics, a framework for feedback generation and evaluation has been proposed [21]. As such, LLMs are demonstrating their usefulness and potential across various academic disciplines.

The integration of AI agents into educational applications has gained increasing attention, particularly focusing on leveraging multi-agent systems to enhance the quality of automated feedback and educational processes. Studies have proposed multi-agent frameworks where one agent generates feedback, and another validates or refines it, addressing common challenges such as over-praise and over-inference [22]. In the context of coding assignment grading, multi-agent systems have been introduced to mimic human-like grading nuances through agent interactions, enabling more reliable automated assessment [23]. Furthermore, frameworks like AutoGen [24] have demonstrated the versatility of multi-agent architectures, where agents communicate and collaborate to perform tasks across various domains, such as mathematics and coding. Similarly, SimClass [25] proposed a multi-agent-based classroom simulation framework designed to enhance learning outcomes by fostering teacher–student and student–student collaboration. These advancements not only address the limitations of traditional LLM-based applications but also demonstrate the transformative potential of multi-agent systems in educational environments, paving the way for scalable and collaborative solutions.

2.2. Large Language Models as Evaluators

With the advancement of LLMs, the concept of ‘LLMs-as-evaluators’ or ‘LLMs-as-judges’, where LLMs are used to evaluate the output of other LLMs, is being widely researched [21,26,27,28,29,30]. Recent studies have proposed an evaluation method that moves beyond relying on a single LLM, employing various structures and multiple LLMs to assess model output quality [31,32,33,34].

LLMs-as-evaluators can be utilized even in the absence of reference outputs and are considered a cost-effective alternative to human evaluation. However, there are several limitations to using LLMs-as-evaluators. Judgments by LLMs can be vulnerable to position bias (e.g., preference for the first response), knowledge bias (e.g., lack of necessary knowledge), and format bias (e.g., preference for how judgments are presented to the LLM) [7,8]. These issues can lead to debates concerning the consistency and reliability of LLMs-as-evaluators.

While LLMs serve as a useful alternative to human evaluation, they remain controversial due to potential biases and variability in their decision-making processes. In this study, we explored the consistency and reliability of various open-source and closed-source LLMs as feedback evaluators.

3. Automatic Feedback Generation and Evaluation

3.1. Student Correct and Incorrect Response Collection

We utilized the MCTest dataset [35] to construct an intelligent tutoring system between teachers and students. MCTest is a collection of stories and associated multiple-choice questions designed to facilitate research in the area of machine text comprehension. It serves as a foundation for simulating the types of questions teachers might ask students in an intelligent tutoring system.

The dataset comprises two subsets, MC160 and MC500, each tailored for elementary school students but varying in scope and complexity. MC160 includes 160 stories suitable for grades 1 to 4, while MC500 expands this to 500 stories and incorporates enhancements such as automation and error corrections. Table 1 provides a statistical overview of the dataset.

To simulate incorrect answer situations that may occur in real educational environments, we directly constructed datasets of students’ correct and incorrect responses. Correct responses were taken from the ‘answer’ field provided in the MCTest dataset, while incorrect responses were randomly selected from the remaining options in the ‘answer_options’ field, excluding the correct answer. These responses were used by the LLM to generate feedback. Figure 2 shows a sample of this dataset.

3.2. Automatic Feedback Generation

In this section, we provide a detailed explanation of the feedback generation process. Figure 3 represents an overview of this process.

3.2.1. Feedback Generation Criteria

The story (S), question (Q), student’s incorrect response (IR), and student’s correct response (CR) are used to generate teacher feedback. The feedback should be accurate and guide the student towards the correct answer. It should also include an explanation of any misunderstandings while encouraging the student in a supportive and educational manner. We organized these criteria into five main categories, which were inspired by previous studies [4,21,36]. Figure 4 shows a comparison of the feedback criteria proposed in this study and in previous studies.

Correct (COR.): The teacher’s feedback is expected to contain no incorrect statements and to be relevant to the current question and the student’s response.
Revealing (REV.): The teacher’s feedback should not directly reveal the correct answer to the student.
Guidance (GUI.): The teacher’s feedback should provide guidance that helps the student move towards the correct answer.
Diagnostic (DIA.): The teacher’s feedback should accurately identify and address any misunderstandings or errors in the student’s response.
Encouragement (ENC.): The teacher’s feedback should maintain a positive or encouraging tone.

3.2.2. Prompt Strategy

We designed two types of prompts for feedback generation in this study:

Without (w/o) criteria: This prompt excludes information about the five feedback criteria for generating teachers’ feedback.
With (w/) criteria: This prompt includes information about the five feedback criteria for evaluating teachers’ feedback.

Examples of the prompts can be seen in Figure 5. All feedback was restricted to be generated as a single sentence.

3.3. Automatic Feedback Evaluation

3.3.1. Feedback Evaluation Criteria

In this study, we aimed to evaluate how well the teacher feedback generated through LLMs aligns with educational objectives. To achieve this, we designed a scoring rubric that assesses feedback on five aspects, applying the same criteria used in feedback generation. Each evaluation criterion is represented by a binary-valued label, with 1 (True) awarded if the criterion is met and 0 (False) if it is not.

Previous studies have evaluated feedback using various methods, such as a 5-point scale [4] or a binary scale [20,21]. However, in this study, we adopted a binary evaluation method for several reasons. In preliminary experiments, when LLMs were instructed to evaluate feedback using a 5-point scale, they exhibited inconsistent adherence to the criteria. Specifically, the evaluations frequently skewed toward extreme scores (e.g., 1 or 5), while the use of intermediate scores led to a lack of consistent scoring patterns. This variability suggested that increasing the granularity of the scoring scale did not enhance sensitivity to subtle differences in feedback quality; rather, it exacerbated inconsistencies in the evaluation results.

In contrast, the binary evaluation method provides a clear and straightforward framework for determining whether specific elements of feedback meet predefined criteria. This simplicity not only enhances the consistency of LLM evaluations but also allows for a more systematic verification of the reliability of LLM-generated feedback.

Future research will focus on improving LLMs’ ability to accurately and consistently utilize more complex scoring systems, such as the 5-point scale. By addressing these challenges, we aim to expand the applicability of multi-point scoring systems and improve the precision of feedback quality assessments.

As illustrated in Figure 3, when a prompt containing the story (S) and the generated teacher feedback (F) is provided as input, the feedback evaluation model generates R evaluation results (

E_{1}

to

E_{R}

) based on that prompt. Examples of feedback generation and evaluation can be found in Table A3 and Table A4.

3.3.2. Prompt Strategy

We designed two types of feedback evaluation prompts:

Grouped: This prompt includes all feedback evaluation criteria within a single prompt.
Individual: This prompt includes only one information for each feedback evaluation criterion and is performed independently a total of 5 times (as there are 5 feedback evaluation criteria).

Examples of the prompts can be seen in Figure 6.

4. Experimental Setup

4.1. Models

Feedback Generator

For feedback generation, we utilized two closed-source models, OpenAI’s GPT-4 [37] and Anthropic’s Claude-3, along with an open-source model, Llama-3.1-70B [38]. The selection of these models was based on their state-of-the-art performance in natural language generation tasks and their ability to produce coherent, contextually relevant feedback in educational settings. GPT-4 and Claude-3 were chosen for their robust handling of diverse prompts and complex queries, while Llama-3.1-70B was selected to evaluate the potential of open-source models in matching or complementing the capabilities of proprietary models. Table 2 represents the model names and API versions.

Feedback Evaluator

For feedback evaluation, we considered four open-source models: Llama-3.1-8B [38], Llama-3.1-70B, Gemma-9B [39], and Gemma-27B [39]. Additionally, we used GPT-4o as a closed-source model. While Claude was utilized as a feedback generator, it was excluded as an evaluator due to its API rate limits and token constraints, which made it less feasible for large-scale and iterative evaluation tasks compared to the more flexible OpenAI API. Table 3 represents the model names and API versions.

4.2. Implementation Details

For our experiment, we queried closed-source models using the OpenAI and Anthropic Python libraries, while all open-source models were queried through the HuggingFace [40] pipeline. For Llama-3.1-70B, due to its size, we employed low-rank adaptation (LoRA) [41] 4-bit quantization. We standardized the temperature setting across all models to 1.0. All open-source models were experimented on a single GPU, an NVIDIA RTX A6000.

To ensure a fair evaluation, each LLM was executed five times. This repetition helps to mitigate any anomalies and ensures that our results are robust and reliable, providing a comprehensive assessment of each model’s performance. The costs associated with API calls for closed-source models used in both feedback generation and feedback evaluation can be found in Table A1 and Table A2 (see Appendix A).

4.3. Evaluation Metrics

In our LLMs-as-evaluators experiment, we employed two metrics to measure agreement between responses: inner agreement and inter-agreement.

Inner agreement is a metric that measures the consistency of responses when a single LLM generates multiple answers, allowing us to evaluate the reliability of the evaluator. An overview of the calculation process can be found in Figure 7a, and the pseudocode is detailed in Algorithm A1.

A single LLM evaluator (E) generates R responses (

E_{R}

). If all responses

E_{1}

to

E_{R}

yield the same results, a score of 1 is assigned; otherwise, a score of 0 is given. In other words, a score of 1 is assigned only in the case of unanimous agreement. The formula for calculating inner agreement is as follows:

Inner-agreement = \{\begin{matrix} 1 & if \sum_{r = 1}^{R} E_{r} = 0 or \\ \sum_{r = 1}^{R} E_{r} = R, \\ 0 & otherwise . \end{matrix}

(1)

The inter-agreement measures the agreement between responses from different evaluators. This metric demonstrates how similarly various evaluators judge the same query. A high inter-agreement signifies that diverse evaluators provide stable and consistent results. An overview of the entire calculation process can be found in Figure 7b, and the pseudocode is detailed in Algorithm A2.

Suppose there are M feedback evaluators (

E^{M}

), and each evaluator generates R responses (

E_{R}^{M}

). For each evaluator, a majority voting is performed on the R responses (

M V^{M}

). Then, if all majority voting results from

M V^{1}

to

M V^{M}

are the same, a score of 1 is assigned; otherwise, a score of 0 is given. That is, similar to inner agreement, a score of 1 is assigned only in the case of unanimous agreement. The formula for calculating inter-agreement is as follows:

Inter-agreement = \{\begin{matrix} 1 & if \sum_{m = 1}^{M} M V^{m} = 0 or \\ \sum_{m = 1}^{M} M V^{m} = M, \\ 0 & otherwise . \end{matrix}

(2)

4.4. Human Evaluation Setup

In this study, we conducted a human evaluation to compare the assessment results of LLMs with those of human evaluators, aiming to validate the reliability of the LLM evaluators’ judgments. The evaluation involved five human annotators; each human evaluator was instructed to assess the five feedback criteria (Correct, Revealing, Guidance, Diagnostic, Encouragement) using binary-valued labels, in the same manner as the LLM evaluators. Additionally, the annotators were asked to provide a confidence score ranging from 1 (low) to 5 (high) for each criterion they evaluated. This approach allows for additional analysis that considers the confidence levels of the evaluators.

For the evaluation, 50 samples were randomly selected from each dataset. Additionally, the evaluators were tasked with assessing the feedback generated by three feedback generators: GPT-4o, Claude-3, and Llama-3.1-70B. Details about the human evaluation can be found in Appendix C.

5. Experimental Results and Analysis

5.1. Human Agreement

All human annotators evaluated the feedback generated by each feedback generator according to the five feedback criteria (Correct, Revealing, Guidance, Diagnostic, Encouragement). Figure 8 presents the average human agreement on the MC160 for two feedback generation prompt strategies (w/ criteria and w/o criteria).

Overall, all human annotators showed high agreement in the Correct and Revealing categories. Among the feedback generation models, Claude-3 demonstrated high agreement. Notably, in the case of ‘w/ criteria’, where feedback evaluation criteria were clearly included in the prompt, high agreement was also observed in the Guidance category. However, even human annotators experienced difficulties in evaluation, showing low agreement in the Diagnostic and Encouragement categories.

As shown in Figure 9, evaluation criteria with low agreement among human annotators, such as Diagnostic and Encouragement, also showed low evaluator confidence scores. This indicates that the Diagnostic and Encouragement categories were evaluation criteria that caused confusion for the majority of human annotators, and it is difficult to make correct assessments. The graph for the w/o criteria can be found in Figure A2. A similar trend was observed in the MC500 dataset, as illustrated in Figure A1, Figure A3, and Figure A4 (see Appendix D for details).

5.2. LLM Inner Agreement

In this section, we analyzed the consistency of LLM responses by measuring the inner agreement of each feedback evaluator. Figure 10 presents the results for the MC160 dataset when GPT-4o was used as the feedback generation model, with the generation prompt strategy set to w/ criteria. The top graph Figure 10a shows the results for the grouped evaluation prompt strategy, while the bottom graph Figure 10b depicts the results for the individual strategy.

Overall, in the Correct and Revealing criteria, high inner agreement was observed across all feedback evaluators, except for Llama-3.1-8B. This indicates that most LLMs performed consistent evaluations when provided with the same prompts. In contrast, the Guidance, Diagnostic, and Encouragement criteria exhibited relatively lower agreement among the feedback evaluators. This suggests that these criteria posed challenges for the LLMs in generating consistent responses. Additionally, differences in agreement were observed more prominently based on model families rather than model sizes, with the Gemma family models showing higher agreement compared to the Llama family. GPT-4o demonstrated agreement levels that fall between the Llama and Gemma families.

5.3. LLM Inter-Agreement

In this section, we analyzed the inter-agreement by measuring the level of agreement among different feedback evaluators. Through inter-agreement, we can determine how consistently evaluations were made among different models. Figure 11 presents the results for the MC160 dataset with the generation prompt strategy set to w/ criteria. The key findings are as follows.

Both the Grouped and Individual prompt strategies show high agreement in the Correct and Revealing criteria. This trend is similar to the inner agreement observed in Section 5.2. In contrast, the Guidance, Diagnostic, and Encouragement criteria exhibit low agreement, with Diagnostic showing particularly low agreement across all models and prompt strategies. This indicates that certain criteria are challenging for models to evaluate consistently. Although the performance of Guidance and Encouragement varies depending on the prompt strategy, they still exhibit low agreement.

Overall, high agreement is observed for the Correct and Revealing criteria, indicating that most models can provide consistent evaluations. However, for the remaining criteria, there is a significant drop in consistency among the models. The inconsistency in responses generated by the models suggests that further improvements are necessary in these criteria.

5.4. Comparison Between Human and LLM Evaluations

LLMs showed consistent results for certain criteria such as Correct and Revealing; however, from a comprehensive perspective, there are still issues with consistency and reliability. In this section, we conducted an analysis comparing human evaluation results with those from LLMs. We verified the reliability of the ensemble LLM evaluation by assuming the majority vote results from human annotators as the true label. We then compared this true label with the accuracy of the predicted label, which was determined by the majority vote of LLM evaluators.

We proposed two majority voting methods:

Internal (Macro) majority voting: This method involves performing majority voting within each LLM and then combining the majority results from multiple LLMs to make a final decision.
Global (Micro) majority voting: This method gathers M evaluators and N responses, applying a single majority vote across all at once to make a final decision.

Figure 12 presents the accuracy comparison between human evaluations and LLM evaluations on the MC160 dataset with the generation prompt strategy set to w/ criteria. Overall, regardless of the majority voting method or the type of feedback generator model used, high accuracy was observed for the Correct, Revealing, and Guidance criteria. This indicates alignment between the majority vote results of human annotators and those of LLM evaluators. In contrast, lower accuracy was noted for the Diagnostic and Encouragement criteria, suggesting that both humans and LLMs struggle to provide consistent responses for these specific feedback criteria, highlighting the need for further research to address these challenges.

Of particular note is the discrepancy between the high agreement among human annotators for the Guidance criterion, as shown in Figure 8a, and the very low inter-agreement observed in Figure 11a. Despite this contradictory result, the accuracy for Guidance in Figure 12 remains high across models, indicating the need for additional research.

Table 4 presents the results of majority voting based on LLM-generated feedback evaluations on the MC160 dataset. Among the five feedback evaluators, Correct and Revealing scored in the high 4-point range on average, while Guidance and Encouragement averaged in the high 3-point range, and Diagnostic scored in the low 3-point range. When considering the average voting counts, agreement among LLM evaluators appeared low. However, after performing majority voting, the results were similar to the majority vote of human annotators, leading to a high overall accuracy.

5.5. Analysis of Score Distribution in Human and LLM Evaluations

In this section, we analyzed the score distribution between human evaluation and LLM evaluation. Here, the score is defined as the total sum of the evaluations, which comprehensively reflects the five aspects of the feedback. Each evaluation criterion is scored as either 0 or 1, and the score is calculated as follows:

Score = COR + REV + GUI + DIA + ENC

(3)

Figure 13 illustrates the score distribution of human annotators and various LLM evaluators on the MC160 dataset. Regardless of the evaluation prompt strategy, human annotators consistently scored within a high range (approximately 4.0 to 5.0) across all feedback generators, indicating that the generated feedback met an average of at least four out of five evaluation criteria. However, the score distribution of human annotators exhibited greater variability compared to those of the LLM evaluators, suggesting that not all generated feedback met the feedback criteria equally well. For Llama-3.1-8B, it consistently recorded the lowest scores in all cases and exhibited the highest variability among the LLM evaluators. This trend is similar to the lower agreement observed in the inner agreement analysis (refer to Figure 10).

When analyzing LLM evaluators divided into three groups (GPT-4o, Llama family, and Gemma family), the Gemma family demonstrated a narrow distribution range, showing stable and high agreement. In the Llama and Gemma families, as the model size increased, the score distribution tended to narrow and become more stable, and an upward trend in scores was also observed.

Figure 14 represents the score distribution when evaluation criteria information was not included in the prompt during feedback generation. In this case, the generated feedback often failed to meet the actual criteria, resulting in a wider score distribution for human annotators, ranging from 2.5 to 4.0. However, the score distribution for LLM evaluators showed a similar pattern to that of w/ criteria, suggesting that there may have been instances where items that should have received a score of 0 were instead given a score of 1. This indicates that LLMs tend to assign 1 score regardless of the evaluation criteria. A similar trend was observed in the MC500 dataset, as shown in Figure A22 and Figure A23.

5.6. Analysis of Score Distribution Based on the Number of Generated Feedback Sentences

In the previous experiments, evaluations were conducted using only a single sentence of feedback generated by the feedback generator. In this section, we expand the number of generated feedback sentences to five, aiming to analyze the performance variance in evaluation results. The goal is to assess the impact of the number of feedback sentences on the evaluation scores. The score was calculated using the same method as in Equation (3).

This experiment was conducted using the MC160 dataset, focusing solely on the GPT-4o model as the feedback generator. As shown in Figure 15, when the number of feedback sentences was increased to five, there was minimal variance in the scores. This suggests that the GPT-4o model is capable of maintaining consistent evaluation results even as the number of feedback sentences increases. Results for other prompt strategies can be found in Figure A24 (see Appendix D for details).

Therefore, considering the minimal impact of the number of feedback sentences on evaluation results, it can be concluded that there is relatively little performance variance between single feedback sentence evaluations and multiple feedback sentence evaluations. These findings suggest that, in situations where similar score distributions are observed, it is not necessary to repeatedly generate multiple feedback instances to achieve effective evaluation. This approach can provide significant advantages in terms of cost reduction and efficiency during the generation process.

6. Conclusions

This study examined the consistency and reliability of feedback evaluations in the educational domain using LLMs. We evaluated teacher feedback in the tutoring system between teachers and students across five criteria using various LLMs. The aim was to analyze the strengths and limitations of LLMs when acting as evaluators. The results revealed that for criteria where there was low agreement among human annotators, LLM evaluations also tended to lack consistency. Moreover, instances were observed where LLMs generated inconsistent responses even when provided with identical prompts. However, some models demonstrated a tendency to maintain high consistency under specific conditions.

These findings suggest a cautious approach when using LLMs as educational evaluation tools, as the consistency and reliability of results can vary significantly depending on the feedback criteria and LLM combinations. To improve LLM evaluation performance, exploring various evaluation criteria and model combinations to identify optimal conditions is crucial.

Future research will focus on improving LLMs-as-evaluators by diversifying feedback scores, evaluation criteria, and prompts to enhance response consistency. Tailoring prompts and refining criteria for various educational contexts will help develop LLMs into more reliable evaluation tools.

Author Contributions

Conceptualization, H.S. and S.J.; data curation, H.S.; funding acquisition, S.J.; methodology, H.S.; project administration, H.S., T.H. and J.J.; resources, H.K.; software, H.S.; supervision, S.J., T.H. and J.J.; visualization, H.S. and J.J.; writing—original draft, H.S. and S.J.; writing—review and editing, H.S., T.H., J.J., H.K., H.N. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (2019-0-00004, Development of semisupervised learning language intelligence technology and Korean tutoring service for foreigners), the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2022R1F1A1071047), Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2022-00155857, Artificial Intelligence Convergence Innovation Human Resources Development (Chungnam National University)), and research fund of Chungnam National University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The MCTest dataset can be found at https://huggingface.co/datasets/sagnikrayc/mctest, accessed on 25 October 2022.

Conflicts of Interest

The authors declare no conflicts of interest.

Correction Statement

This article has been republished with a minor correction to resolve spelling errors. This change does not affect the scientific content of the article.

Abbreviations

The following abbreviations are used in this manuscript:

LLM	Large Language Models
COR	Correct
REV	Revealing
GUI	Guidance
DIA	Diagnostic
ENC	Encouragement
w/o criteria	without criteria
w/ criteria	with criteria
MV	Majority Voting

Appendix A. Details of Data Generation

Table A1 and Table A2 display the API costs for closed-source models used in both the feedback generator and feedback evaluator.

Table A1. API costs of feedback generation models.

Generation Prompt Strategy	Model Name	Dataset
Generation Prompt Strategy	Model Name	MC160	MC500
w/o criteria	GPT-4o	$0.64	$2.87
w/o criteria	Claude-3	$0.58	$2.38
w/ criteria	GPT-4o	$0.76	$2.96
w/ criteria	Claude-3	$0.67	$2.80

Table A2. API costs of feedback evaluation models (average cost of five runs).

Generation Prompt Strategy	Evaluation Prompt Strategy	Dataset
Generation Prompt Strategy	Evaluation Prompt Strategy	MC160	MC500
w/o criteria	Grouped	$1.01	$4.06
w/o criteria	Individual	$3.21	$13.43
w/ criteria	Grouped	$1.02	$4.16
w/ criteria	Individual	$3.29	$13.55

Appendix B. Details of Agreements

This appendix includes the pseudocode for the algorithms used to measure the agreement between LLM responses. Algorithm A1 presents the algorithm for inner agreement, and Algorithm A2 presents the algorithm for inter-agreement.

Algorithm A1 Inner Agreement

Initialize

N \leftarrow total number of items

Initialize

c o u n t \leftarrow 0

for

n = 1

to N do

r e s p o n s e s \leftarrow Fetch R responses for n

if all elements in

r e s p o n s e s

are equal then

c o u n t \leftarrow c o u n t + 1

end if
end for

a g r e e m e n t \leftarrow \frac{c o u n t}{N}

return

a g r e e m e n t

Algorithm A2 Inter-Agreement

Initialize

M \leftarrow number of LLM evaluators

Initialize

N \leftarrow total number of items

Initialize

v o t e s \leftarrow empty list for item votes

for

n = 1

to N do
Initialize

i t e m V o t e s \leftarrow empty list

for

m = 1

to M do

r e s p o n s e s \leftarrow Fetch R responses from m

for item n

v o t e \leftarrow Calculate majority vote from responses

Append

v o t e

to

i t e m V o t e s

end for
Append

i t e m V o t e s

to

v o t e s

end for
Initialize

c o u n t \leftarrow 0

for

i = 1

to N do
for

m = 1

to M do
if all elements in

v o t e s [i] [m]

are equal then

c o u n t \leftarrow c o u n t + 1

end if
end for
end for

a g r e e m e n t \leftarrow \frac{c o u n t}{N}

return

a g r e e m e n t

Appendix C. Details of Human Evaluations

The participants in the human evaluation were all AI experts, consisting of graduate students. Before conducting the evaluation, all participants received training on the dataset descriptions, feedback criteria, binary-valued labels (0–1 scales), and confidence scores (1–5 scales). The criteria for the confidence scores are as follows:

5: Very confident in the accuracy of the evaluation.
4: Quite confident.
3: Somewhat confident but with some reservations.
2: Not confident.
1: Almost no confidence.

Figure A1 shows the human agreement on the MC500 dataset. Figure A2, Figure A3 and Figure A4 depict the distribution of confidence scores among the human annotators.

Figure A1. Human agreement on the MC500. Five human annotators evaluated the feedback generated by each feedback generator.

Figure A2. Confidence scores for each human annotator under the w/o criteria generation prompt strategy on the MC160.

Figure A3. Confidence scores for each human annotator under the w/ criteria generation prompt strategy on the MC500.

Figure A4. Confidence scores for each human annotator under the w/o criteria generation prompt strategy on the MC500.

Appendix D. Additional Experimental Results

The following are additional experimental results not included in the main text. These results are provided to aid in the overall understanding of the study and include more detailed analyses for each dataset and model. Figure A5, Figure A6, Figure A7, Figure A8 and Figure A9 show the inner agreement results for each feedback generator on the MC160 dataset. Figure A10, Figure A11, Figure A12, Figure A13, Figure A14 and Figure A15 present the inner agreement results on the MC500 dataset.

Figure A16, Figure A17 and Figure A18 represent additional performance results for inter-agreement as discussed in Section 5.3. Figure A19, Figure A20 and Figure A21 show the accuracy between human and LLM evaluations as mentioned in Section 5.4. Figure A22 and Figure A23 display the score distribution results on the MC500 dataset (see Section 5.5). Figure A24 illustrates the score distribution when the feedback generator is GPT-4o with five generated feedback items on the MC160 dataset (as discussed in Section 5.6). Table A3 and Table A4 present examples of feedback generation and evaluation for the MC160 and MC500 datasets, respectively.

Table A5 and Table A6 present the overall inner agreement performance for the MC160 dataset, while Table A7 and Table A8 show the overall inner agreement performance for the MC500 dataset. Similarly, Table A9 and Table A10 illustrate the overall inter-agreement performance for the MC160 and MC500 datasets, respectively.

Table A11 and Table A12 present examples of evaluation results and confidence scores by human annotators on the MC160 dataset, while Table A13 and Table A14 provide examples from the MC500 dataset.

Figure A5. Inner agreement for the generation prompt strategy w/o criteria in the MC160 dataset with GPT-4o as the feedback generator.

Figure A6. Inner agreement for the generation prompt strategy w/ criteria in the MC160 dataset with Claude-3 as the feedback generator.

Figure A7. Inner agreement for the generation prompt strategy w/o criteria in the MC160 dataset with Claude-3 as the feedback generator.

Figure A8. Inner agreement for the generation prompt strategy w/ criteria in the MC160 dataset with Llama-3.1-70B as the feedback generator.

Figure A9. Inner agreement for the generation prompt strategy w/o criteria in the MC160 dataset with Llama-3.1-70B as the feedback generator.

Figure A10. Inner agreement for the generation prompt strategy w/ criteria in the MC500 dataset with GPT-4o as the feedback generator.

Figure A11. Inner agreement for the generation prompt strategy w/o criteria in the MC500 dataset with GPT-4o as the feedback generator.

Figure A12. Inner agreement for the generation prompt strategy w/ criteria in the MC500 dataset with Claude-3 as the feedback generator.

Figure A13. Inner agreement for the generation prompt strategy w/o criteria in the MC500 dataset with Claude-3 as the feedback generator.

Figure A14. Inner agreement for the generation prompt strategy w/ criteria in the MC500 dataset with Llama-3.1-70B as the feedback generator.

Figure A15. Inner agreement for the generation prompt strategy w/o criteria in the MC500 dataset with Llama-3.1-70B as the feedback generator.

Figure A16. Inter-agreement for the generation prompt strategy w/o criteria in the MC160.

Figure A17. Inter-agreement for the generation prompt strategy w/ criteria in the MC500.

Figure A18. Inter-agreement for the generation prompt strategy w/o criteria in the MC500.

Figure A19. Accuracy between human and LLM evaluations for the generation prompt strategy w/o criteria in the MC160. The legend represents the feedback generation models. MV stands for majority voting.

Figure A20. Accuracy between human and LLM evaluations for the generation prompt strategy w/ criteria in the MC500. The legend represents the feedback generation models. MV stands for majority voting.

Figure A21. Accuracy between human and LLM evaluations for the generation prompt strategy w/o criteria in the MC500. The legend represents the feedback generation models. MV stands for majority voting.

Figure A22. Score distribution on the MC500 with the generation prompt strategy using w/ criteria. Each subfigure represents a feedback generation model.

Figure A23. Score distribution on the MC500 with the generation prompt strategy using w/o criteria. Each subfigure represents a feedback generation model.

Figure A24. Score distribution when the generation prompt strategy is w/o criteria and the number of sentences f generated feedback is 5, using GPT-4o as the feedback generator on the MC160.

Table A3. Examples of feedback generation and feedback evaluation in the MC160. The values represent majority voting for each evaluation criterion. For human evaluation, the results are based on majority voting among five human annotators. For LLM evaluation, the majority vote is based on five runs for each evaluator.

Feedback Generator: GPT-4o
Review the last paragraph carefully to see what Rudy was most excited about in his birthday party.
	Grouped					Individual
Feedback Evaluator	COR.	REV.	GUI.	DIA.	ENC.	COR.	REV.	GUI.	DIA.	ENC.
Human	1	1	1	0	1	-	-	-	-	-
GPT-4o	1	1	0	1	0	1	1	1	1	0
Llama-3.1-8B	1	1	0	0	1	1	1	0	0	0
Llama-3.1-70B	1	1	0	0	0	0	1	1	0	0
Gemma-9B	1	1	0	0	1	1	1	0	0	0
Gemma-27B	1	1	1	1	0	1	1	1	1	1
Feedback Generator: Claude-3
Great attempt, but let’s take another look at the last part of the story
to find the gift that made Rudy especially excited.
Feedback Evaluator	COR.	REV.	GUI.	DIA.	ENC.	COR.	REV.	GUI.	DIA.	ENC.
Human	1	1	1	1	1	-	-	-	-	-
GPT-4o	1	1	1	1	1	1	1	1	1	1
Llama-3.1-8B	1	1	0	0	1	1	0	0	0	0
Llama-3.1-70B	1	1	1	1	1	1	1	1	1	1
Gemma-9B	1	1	1	0	1	1	1	1	0	1
Gemma-27B	1	1	1	1	1	1	1	1	1	1
Feedback Generator: Llama-3.1-70B
Take another look at the story and see what gift made Rudy jump and bark with excitement,
as that might be a clue to what he thought was the best gift he had ever been given.
Feedback Evaluator	COR.	REV.	GUI.	DIA.	ENC.	COR.	REV.	GUI.	DIA.	ENC.
Human	1	1	1	1	1	-	-	-	-	-
GPT-4o	1	1	1	1	1	1	0	1	1	1
Llama-3.1-8B	1	1	0	0	0	1	0	0	0	0
Llama-3.1-70B	1	1	1	0	1	1	1	1	1	1
Gemma-9B	1	1	1	0	1	1	1	1	1	1
Gemma-27B	1	1	1	1	1	1	1	1	1	1

Table A4. Examples of feedback generation and feedback evaluation in the MC500. The values represent majority voting for each evaluation criterion. For human evaluation, the results are based on majority voting among five human annotators. For LLM evaluation, the majority vote is based on five runs for each evaluator.

Feedback Generator: GPT-4o
You identified one sport correctly, but remember that
Tom played another sport he wasn’t very good at, which is different from football.
	Grouped					Individual
Feedback Evaluator	COR.	REV.	GUI.	DIA.	ENC.	COR.	REV.	GUI.	DIA.	ENC.
Human	1	1	1	1	1	-	-	-	-	-
GPT-4o	1	1	1	1	1	1	1	1	1	0
Llama-3.1-8B	1	1	0	0	1	1	0	0	0	0
Llama-3.1-70B	1	1	1	1	1	1	1	1	1	0
Gemma-9B	1	1	1	0	1	1	1	0	0	0
Gemma-27B	1	1	1	1	0	1	1	1	1	1
Feedback Generator: Claude-3
Great attempt, but let’s revisit the story
and carefully consider which sports were specifically mentioned for Tom.
Feedback Evaluator	COR.	REV.	GUI.	DIA.	ENC.	COR.	REV.	GUI.	DIA.	ENC.
Human	1	1	1	1	1	-	-	-	-	-
GPT-4o	1	1	1	1	1	1	1	1	1	1
Llama-3.1-8B	1	1	0	0	1	1	1	0	0	1
Llama-3.1-70B	1	1	1	1	1	1	1	1	1	1
Gemma-9B	1	1	1	0	1	1	1	1	1	1
Gemma-27B	1	1	1	1	1	1	1	1	1	1
Feedback Generator: Llama-3.1-70B
Take another look at the story and think about the two sports that Tom is mentioned to be enjoying,
one of which he is particularly good at and the other not so much.
Feedback Evaluator	COR.	REV.	GUI.	DIA.	ENC.	COR.	REV.	GUI.	DIA.	ENC.
Human	1	1	1	1	1	-	-	-	-	-
GPT-4o	1	1	1	1	0	1	1	1	1	0
Llama-3.1-8B	1	0	0	0	1	1	1	0	1	0
Llama-3.1-70B	1	1	1	0	1	1	1	1	1	0
Gemma-9B	1	1	1	0	1	1	1	1	0	1
Gemma-27B	1	1	1	1	0	1	1	1	1	1

Table A5. Inner agreement for the generation prompt strategy w/ criteria in the MC160 dataset.

	Grouped					Individual
	COR.	REV.	GUI.	DIA.	ENC.	COR.	REV.	GUI.	DIA.	ENC.
Feedback Evaluator	Feedback Generator: GPT-4o
GPT-4o	0.968	0.989	0.761	0.689	0.593	0.882	0.911	0.836	0.707	0.771
Llama-3.1-8B	0.650	0.546	0.436	0.536	0.318	0.614	0.325	0.207	0.450	0.386
Llama-3.1-70B	0.989	0.943	0.732	0.507	0.554	0.768	0.896	0.775	0.500	0.707
Gemma-9B	0.996	0.989	0.900	0.693	0.729	0.993	0.996	0.939	0.904	0.825
Gemma-27B	0.999	0.999	0.982	0.875	0.893	0.989	0.971	0.853	0.896	0.754
Feedback Evaluator	Feedback Generator: Claude-3
GPT-4o	0.986	0.999	0.800	0.629	0.664	0.893	0.957	0.868	0.600	0.825
Llama-3.1-8B	0.814	0.657	0.461	0.550	0.371	0.825	0.464	0.125	0.443	0.293
Llama-3.1-70B	0.993	0.989	0.800	0.379	0.693	0.882	0.968	0.761	0.400	0.704
Gemma-9B	0.996	0.999	0.893	0.732	0.832	0.999	0.999	0.946	0.904	0.786
Gemma-27B	0.999	0.999	0.975	0.868	0.821	0.999	0.996	0.861	0.864	0.936
Feedback Evaluator	Feedback Generator: Llama-3.1-70B
GPT-4o	0.982	0.993	0.829	0.604	0.396	0.900	0.871	0.904	0.600	0.700
Llama-3.1-8B	0.850	0.482	0.389	0.364	0.239	0.771	0.568	0.221	0.207	0.236
Llama-3.1-70B	0.999	0.954	0.918	0.393	0.393	0.864	0.900	0.854	0.504	0.754
Gemma-9B	0.996	0.996	0.907	0.796	0.721	0.989	0.993	0.946	0.889	0.889
Gemma-27B	0.999	0.993	0.975	0.839	0.707	0.996	0.961	0.932	0.900	0.825

Table A6. Inner agreement for the generation prompt strategy w/o criteria in the MC160 dataset.

	Grouped					Individual
	COR.	REV.	GUI.	DIA.	ENC.	COR.	REV.	GUI.	DIA.	ENC.
Feedback Evaluator	Feedback Generator: GPT-4o
GPT-4o	0.875	0.911	0.675	0.446	0.736	0.868	0.593	0.718	0.679	0.929
Llama-3.1-8B	0.636	0.475	0.539	0.421	0.275	0.543	0.482	0.207	0.339	0.250
Llama-3.1-70B	0.925	0.825	0.689	0.539	0.375	0.711	0.757	0.657	0.586	0.821
Gemma-9B	0.996	0.996	0.832	0.671	0.654	0.975	0.979	0.904	0.854	0.911
Gemma-27B	0.986	0.993	0.761	0.804	0.889	0.954	0.936	0.768	0.925	0.529
Feedback Evaluator	Feedback Generator: Claude-3
GPT-4o	0.954	0.961	0.639	0.682	0.675	0.914	0.632	0.625	0.804	0.914
Llama-3.1-8B	0.811	0.643	0.575	0.375	0.182	0.629	0.582	0.196	0.296	0.239
Llama-3.1-70B	0.993	0.932	0.721	0.546	0.407	0.864	0.857	0.650	0.664	0.868
Gemma-9B	0.996	0.996	0.793	0.729	0.636	0.996	0.999	0.939	0.911	0.886
Gemma-27B	0.999	0.989	0.639	0.907	0.854	0.989	0.971	0.818	0.964	0.693
Feedback Evaluator	Feedback Generator: Llama-3.1-70B
GPT-4o	0.882	0.818	0.721	0.654	0.754	0.936	0.789	0.768	0.846	0.932
Llama-3.1-8B	0.568	0.464	0.496	0.328	0.232	0.546	0.443	0.254	0.296	0.254
Llama-3.1-70B	0.904	0.729	0.746	0.611	0.500	0.807	0.829	0.750	0.818	0.889
Gemma-9B	0.986	0.975	0.850	0.775	0.704	0.999	0.961	0.975	0.964	0.914
Gemma-27B	0.964	0.914	0.843	0.864	0.925	0.982	0.861	0.911	0.968	0.675

Table A7. Inner agreement for the generation prompt strategy w/ criteria in the MC500 dataset.

	Grouped					Individual
	COR.	REV.	GUI.	DIA.	ENC.	COR.	REV.	GUI.	DIA.	ENC.
Feedback Evaluator	Feedback Generator: GPT-4o
GPT-4o	0.968	0.988	0.703	0.641	0.548	0.873	0.883	0.853	0.719	0.843
Llama-3.1-8B	0.746	0.585	0.488	0.438	0.283	0.567	0.458	0.203	0.363	0.297
Llama-3.1-70B	0.994	0.954	0.780	0.598	0.413	0.784	0.932	0.767	0.551	0.784
Gemma-9B	0.994	0.994	0.888	0.676	0.712	0.991	0.997	0.928	0.850	0.907
Gemma-27B	0.999	0.999	0.963	0.889	0.857	0.988	0.971	0.828	0.898	0.699
Feedback Evaluator	Feedback Generator: Claude-3
GPT-4o	0.986	0.997	0.765	0.608	0.598	0.902	0.959	0.879	0.543	0.817
Llama-3.1-8B	0.851	0.648	0.423	0.478	0.355	0.741	0.622	0.199	0.307	0.184
Llama-3.1-70B	0.997	0.988	0.829	0.389	0.628	0.894	0.978	0.770	0.453	0.773
Gemma-9B	0.998	0.998	0.879	0.716	0.798	0.991	0.999	0.921	0.907	0.927
Gemma-27B	0.999	0.999	0.958	0.846	0.779	0.996	0.997	0.821	0.858	0.929
Feedback Evaluator	Feedback Generator: Llama-3.1-70B
GPT-4o	0.987	0.985	0.812	0.623	0.423	0.933	0.890	0.923	0.653	0.712
Llama-3.1-8B	0.829	0.540	0.363	0.367	0.244	0.750	0.619	0.218	0.202	0.185
Llama-3.1-70B	0.996	0.955	0.899	0.472	0.403	0.863	0.914	0.867	0.492	0.779
Gemma-9B	0.998	0.998	0.903	0.767	0.718	0.992	0.998	0.950	0.888	0.878
Gemma-27B	0.999	0.999	0.985	0.869	0.720	0.996	0.968	0.926	0.921	0.854

Table A8. Inner agreement for the generation prompt strategy w/o criteria in the MC500 dataset.

	Grouped					Individual
	COR.	REV.	GUI.	DIA.	ENC.	COR.	REV.	GUI.	DIA.	ENC.
Feedback Evaluator	Feedback Generator: GPT-4o
GPT-4o	0.829	0.895	0.715	0.443	0.735	0.873	0.668	0.684	0.692	0.933
Llama-3.1-8B	0.635	0.487	0.574	0.476	0.253	0.522	0.433	0.266	0.361	0.275
Llama-3.1-70B	0.888	0.872	0.716	0.546	0.347	0.681	0.747	0.683	0.573	0.828
Gemma-9B	0.964	0.954	0.866	0.631	0.687	0.954	0.978	0.905	0.873	0.895
Gemma-27B	0.981	0.977	0.751	0.825	0.900	0.945	0.929	0.692	0.898	0.532
Feedback Evaluator	Feedback Generator: Claude-3
GPT-4o	0.954	0.948	0.672	0.694	0.735	0.920	0.653	0.603	0.832	0.918
Llama-3.1-8B	0.780	0.633	0.533	0.348	0.203	0.650	0.647	0.177	0.243	0.195
Llama-3.1-70B	0.988	0.948	0.701	0.508	0.371	0.841	0.834	0.652	0.706	0.908
Gemma-9B	0.998	0.993	0.765	0.783	0.684	0.992	0.989	0.906	0.913	0.880
Gemma-27B	0.999	0.995	0.819	0.934	0.885	0.973	0.975	0.838	0.958	0.626
Feedback Evaluator	Feedback Generator: Llama-3.1-70B
GPT-4o	0.878	0.835	0.741	0.708	0.770	0.928	0.751	0.748	0.883	0.922
Llama-3.1-8B	0.538	0.464	0.479	0.339	0.228	0.565	0.429	0.220	0.277	0.228
Llama-3.1-70B	0.909	0.738	0.775	0.586	0.467	0.828	0.848	0.765	0.835	0.911
Gemma-9B	0.960	0.933	0.865	0.750	0.736	0.985	0.960	0.953	0.944	0.888
Gemma-27B	0.967	0.908	0.843	0.913	0.913	0.965	0.842	0.899	0.969	0.655

Table A9. Inter-agreement for the MC160 dataset.

	Grouped					Individual
	COR.	REV.	GUI.	DIA.	ENC.	COR.	REV.	GUI.	DIA.	ENC.
Feedback Generator	w/ criteria
GPT-4o	0.936	0.936	0.075	0.071	0.375	0.857	0.657	0.296	0.068	0.114
Claude-3	0.986	0.943	0.089	0.057	0.582	0.921	0.850	0.404	0.068	0.064
Llama-3.1-70B	0.989	0.889	0.104	0.043	0.225	0.893	0.782	0.604	0.132	0.039
Feedback Generator	w/o criteria
GPT-4o	0.882	0.807	0.179	0.064	0.196	0.711	0.543	0.307	0.104	0.286
Claude-3	0.946	0.896	0.118	0.132	0.175	0.829	0.654	0.282	0.211	0.139
Llama-3.1-70B	0.782	0.571	0.139	0.104	0.282	0.757	0.382	0.382	0.179	0.107

Table A10. Inter-agreement for the MC500 dataset.

	Grouped					Individual
	COR.	REV.	GUI.	DIA.	ENC.	COR.	REV.	GUI.	DIA.	ENC.
Feedback Generator	w/ criteria
GPT-4o	0.962	0.914	0.059	0.075	0.248	0.828	0.780	0.331	0.108	0.149
Claude-3	0.981	0.943	0.068	0.054	0.494	0.903	0.919	0.453	0.096	0.163
Llama-3.1-70B	0.980	0.914	0.107	0.046	0.246	0.897	0.849	0.571	0.125	0.062
Feedback Generator	w/o criteria
GPT-4o	0.866	0.808	0.238	0.098	0.216	0.678	0.536	0.305	0.104	0.308
Claude-3	0.958	0.913	0.125	0.100	0.203	0.847	0.574	0.253	0.213	0.163
Llama-3.1-70B	0.753	0.512	0.129	0.123	0.293	0.768	0.380	0.343	0.250	0.169

Table A11. Evaluation results and average confidence score examples for generation prompt strategy w/ criteria by human annotators on the MC160 dataset.

Feedback Generator: GPT-4o
Great effort, but make sure to carefully read the story again
to identify what specifically frightened the child during their zoo visit.
	COR.	REV.	GUI.	DIA.	ENC.	Avg. Confidence
Annotator 1	1	1	1	1	1	3
Annotator 2	1	1	1	1	1	4.4
Annotator 3	1	1	1	1	0	5
Annotator 4	1	1	1	1	1	5
Annotator 5	1	1	1	1	1	5
Human Agreement	1	1	1	1	0	4.48
Feedback Generator: Claude-3
Recall the part of the story where Lauren brings Chocolate home and consider
where she might find a safe, water-friendly place to keep a turtle temporarily.
	COR.	REV.	GUI.	DIA.	ENC.	Avg. Confidence
Annotator 1	1	1	1	0	1	3.6
Annotator 2	1	1	1	1	1	3.8
Annotator 3	1	1	1	1	0	4.8
Annotator 4	1	1	1	1	0	4.6
Annotator 5	1	1	1	1	1	5
Human Agreement	1	1	1	0	0	4.38
Feedback Generator: Llama-3.1-70B
To determine what Mark wanted, re-read the text to identify the specific reason mentioned
for Mark’s motivation to write a story, focusing on the connection to his relationship with Wendy.
	COR.	REV.	GUI.	DIA.	ENC.	Avg. Confidence
Annotator 1	1	1	1	0	1	5
Annotator 2	1	1	1	1	1	4.2
Annotator 3	1	1	1	1	1	5
Annotator 4	1	1	1	1	0	5
Annotator 5	1	1	1	1	0	4.6
Human Agreement	1	1	1	0	0	4.76

Table A12. Evaluation results and average confidence score examples for generation prompt strategy w/o criteria by human annotators on the MC160 dataset.

Feedback Generator: GPT-4o
Actually, the store had various types of food including turkey, chocolate, and radishes.
	COR.	REV.	GUI.	DIA.	ENC.	Avg. Confidence
Annotator 1	1	0	0	0	0	4.6
Annotator 2	1	0	1	1	1	5
Annotator 3	1	1	1	1	0	4.4
Annotator 4	1	0	0	1	0	4.8
Annotator 5	1	0	1	0	0	4.6
Human Agreement	1	0	0	0	0	4.68
Feedback Generator: Claude-3
While Martha did check under the benches,
the story mentions she looked inside the restrooms first before checking other places.
	COR.	REV.	GUI.	DIA.	ENC.	Avg. Confidence
Annotator 1	1	0	0	1	1	3.4
Annotator 2	1	0	1	1	1	4.2
Annotator 3	1	0	1	1	0	5
Annotator 4	1	0	0	1	0	4.8
Annotator 5	1	0	1	1	0	5
Human Agreement	1	1	0	1	0	4.48
Feedback Generator: Llama-3.1-70B
Since it’s mentioned in the passage that Spike, Molly’s dog, plays with a ball,
the correct answer should be A: ball.
	COR.	REV.	GUI.	DIA.	ENC.	Avg. Confidence
Annotator 1	1	0	1	1	0	3.4
Annotator 2	1	0	1	1	1	4.2
Annotator 3	1	0	1	0	0	4.2
Annotator 4	1	0	0	1	0	5
Annotator 5	1	0	0	1	0	4.8
Human Agreement	1	1	0	0	0	4.32

Table A13. Evaluation results and average confidence score examples for generation prompt strategy w/ criteria by human annotators on the MC500 dataset.

Feedback Generator: GPT-4o
It looks like you misunderstood the story; try to find
which item the pig specifically mentioned forgetting, leading to his reluctance to eat.
	COR.	REV.	GUI.	DIA.	ENC.	Avg. Confidence
Annotator 1	0	1	1	1	0	4.4
Annotator 2	1	1	1	1	1	4.4
Annotator 3	1	1	1	0	0	4.4
Annotator 4	1	1	1	1	0	4.6
Annotator 5	1	1	1	1	1	5
Human Agreement	0	1	1	0	0	4.56
Feedback Generator: Claude-3
Let’s take a closer look at the story and see
if we can find any specific mentions of Kim’s food preferences for her birthday dinner.
	COR.	REV.	GUI.	DIA.	ENC.	Avg. Confidence
Annotator 1	1	1	1	0	0	4.4
Annotator 2	1	1	1	1	1	4.6
Annotator 3	1	1	1	1	1	5
Annotator 4	1	1	1	0	0	4.4
Annotator 5	1	1	1	1	1	5
Human Agreement	1	1	1	0	0	4.68
Feedback Generator: Llama-3.1-70B
Let’s take a closer look at the first sentence of the story to see
if we can find the correct answer to when the magical chipmunks come out to play.
	COR.	REV.	GUI.	DIA.	ENC.	Avg. Confidence
Annotator 1	1	1	1	0	0	5
Annotator 2	1	1	1	1	1	4
Annotator 3	1	1	1	1	1	5
Annotator 4	1	1	1	1	0	4.6
Annotator 5	1	1	1	1	1	5
Human Agreement	1	1	1	0	0	4.72

Table A14. Evaluation results and average confidence score examples for generation prompt strategy w/o criteria by human annotators on the MC500 dataset.

Feedback Generator: GPT-4o
Double-check the story’s details; Rose gave her slice to Dave, who took it home for his Bro.
	COR.	REV.	GUI.	DIA.	ENC.	Avg. Confidence
Annotator 1	1	0	1	0	0	4.6
Annotator 2	1	0	1	1	1	4.6
Annotator 3	1	0	1	1	1	5
Annotator 4	1	0	1	1	0	4.6
Annotator 5	1	0	1	1	0	5
Human Agreement	1	1	1	0	0	4.76
Feedback Generator: Claude-3
Let’s carefully review the sequence of events in the story,
paying close attention to Will’s location when the rain started.
	COR.	REV.	GUI.	DIA.	ENC.	Avg. Confidence
Annotator 1	1	1	1	0	1	4.2
Annotator 2	1	1	1	1	1	4.4
Annotator 3	1	1	1	1	1	5
Annotator 4	1	1	1	0	1	4.2
Annotator 5	1	1	1	1	1	5
Human Agreement	1	1	1	0	1	4.56
Feedback Generator: Llama-3.1-70B
Consider re-reading the story to find the item that Timmy’s mom specifically said
he couldn’t bring to his grandma’s house due to his grandma’s dislike for it.
	COR.	REV.	GUI.	DIA.	ENC.	Avg. Confidence
Annotator 1	1	1	1	1	0	4.4
Annotator 2	1	1	1	1	1	4.4
Annotator 3	1	0	0	1	0	5
Annotator 4	1	1	1	1	0	4.4
Annotator 5	1	1	1	1	1	5
Human Agreement	1	0	0	1	0	4.64

References

Alier, M.; Casañ, M.J.; Filvà, D.A. Smart Learning Applications: Leveraging LLMs for Contextualized and Ethical Educational Technology. In Proceedings of the TEEM 2023; Gonçalves, J.A.d.C., Lima, J.L.S.d.M., Coelho, J.P., García-Peñalvo, F.J., García-Holgado, A., Eds.; Springer: Singapore, 2024; pp. 190–199. [Google Scholar]
Laato, S.; Morschheuser, B.; Hamari, J.; Björne, J. AI-Assisted Learning with ChatGPT and Large Language Models: Implications for Higher Education. In Proceedings of the 2023 IEEE International Conference on Advanced Learning Technologies (ICALT), Orem, UT, USA, 10–13 July 2023; pp. 226–230. [Google Scholar] [CrossRef]
Azaiz, I.; Kiesler, N.; Strickroth, S. Feedback-Generation for Programming Exercises with GPT-4. In Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1, Milan, Italy, 8–10 July 2024; pp. 31–37. [Google Scholar]
Steiss, J.; Tate, T.; Graham, S.; Cruz, J.; Hebert, M.; Wang, J.; Moon, Y.; Tseng, W.; Warschauer, M.; Olson, C.B. Comparing the quality of human and ChatGPT feedback of students’ writing. Learn. Instr. 2024, 91, 101894. [Google Scholar] [CrossRef]
Baidoo-anu, D.; Owusu Ansah, L. Education in the Era of Generative Artificial Intelligence (AI): Understanding the Potential Benefits of ChatGPT in Promoting Teaching and Learning. J. AI 2023, 7, 52–62. [Google Scholar] [CrossRef]
Lan, Y.-J.; Chen, N.-S. Teachers’ Agency in the Era of LLM and Generative AI. Educ. Technol. Soc. 2024, 27, I–XVIII. [Google Scholar]
Dhurandhar, A.; Nair, R.; Singh, M.; Daly, E.; Natesan Ramamurthy, K. Ranking Large Language Models without Ground Truth. In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 2431–2452. [Google Scholar]
Chen, G.H.; Chen, S.; Liu, Z.; Jiang, F.; Wang, B. Humans or llms as the judge? A study on judgement biases. arXiv 2024, arXiv:2402.10669. [Google Scholar]
Gan, W.; Qi, Z.; Wu, J.; Lin, J.C.W. Large Language Models in Education: Vision and Opportunities. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 4776–4785. [Google Scholar] [CrossRef]
Wang, S.; Xu, T.; Li, H.; Zhang, C.; Liang, J.; Tang, J.; Yu, P.S.; Wen, Q. Large language models for education: A survey and outlook. arXiv 2024, arXiv:2403.18105. [Google Scholar]
Kortemeyer, G. Performance of the pre-trained large language model GPT-4 on automated short answer grading. Discov. Artif. Intell. 2024, 4, 47. [Google Scholar] [CrossRef]
Stahl, M.; Biermann, L.; Nehring, A.; Wachsmuth, H. Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation. In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), Mexico City, Mexico, 6 June 2024; Kochmar, E., Bexte, M., Burstein, J., Horbach, A., Laarmann-Quante, R., Tack, A., Yaneva, V., Yuan, Z., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 283–298. [Google Scholar]
Mulla, N.; Gharpure, P. Automatic question generation: A review of methodologies, datasets, evaluation metrics, and applications. Prog. Artif. Intell. 2023, 12, 1–32. [Google Scholar] [CrossRef]
Feng, W.; Lee, J.; McNichols, H.; Scarlatos, A.; Smith, D.; Woodhead, S.; Ornelas, N.; Lan, A. Exploring Automated Distractor Generation for Math Multiple-choice Questions via Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, 16–21 June 2024; Duh, K., Gomez, H., Bethard, S., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 3067–3082. [Google Scholar]
Lee, J.; Smith, D.; Woodhead, S.; Lan, A. Math Multiple Choice Question Generation via Human-Large Language Model Collaboration. arXiv 2024, arXiv:2405.00864. [Google Scholar]
Dan, Y.; Lei, Z.; Gu, Y.; Li, Y.; Yin, J.; Lin, J.; Ye, L.; Tie, Z.; Zhou, Y.; Wang, Y.; et al. Educhat: A large-scale language model-based chatbot system for intelligent education. arXiv 2023, arXiv:2308.02773. [Google Scholar]
Meyer, J.; Jansen, T.; Schiller, R.; Liebenow, L.W.; Steinbach, M.; Horbach, A.; Fleckenstein, J. Using LLMs to bring evidence-based feedback into the classroom: AI-generated feedback increases secondary students’ text revision, motivation, and positive emotions. Comput. Educ. Artif. Intell. 2024, 6, 100199. [Google Scholar] [CrossRef]
Dai, W.; Lin, J.; Jin, H.; Li, T.; Tsai, Y.S.; Gašević, D.; Chen, G. Can Large Language Models Provide Feedback to Students? A Case Study on ChatGPT. In Proceedings of the 2023 IEEE International Conference on Advanced Learning Technologies (ICALT), Orem, UT, USA, 10–13 July 2023; pp. 323–325. [Google Scholar] [CrossRef]
Liu, H.; Liu, Z.; Wu, Z.; Tang, J. Personalized Multimodal Feedback Generation in Education. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; Scott, D., Bel, N., Zong, C., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1826–1840. [Google Scholar]
Koutcheme, C.; Dainese, N.; Hellas, A.; Sarsa, S.; Leinonen, J.; Ashraf, S.; Denny, P. Evaluating Language Models for Generating and Judging Programming Feedback. arXiv 2024, arXiv:2407.04873. [Google Scholar]
Scarlatos, A.; Smith, D.; Woodhead, S.; Lan, A. Improving the Validity of Automatically Generated Feedback via Reinforcement Learning. In Artificial Intelligence in Education; Olney, A.M., Chounta, I.A., Liu, Z., Santos, O.C., Bittencourt, I.I., Eds.; Springer: Cham, Switzerland, 2024; pp. 280–294. [Google Scholar]
Guo, S.; Latif, E.; Zhou, Y.; Huang, X.; Zhai, X. Using Generative AI and Multi-Agents to Provide Automatic Feedback. arXiv 2024, arXiv:2411.07407. [Google Scholar]
Lagakis, P.; Demetriadis, S. EvaAI: A Multi-agent Framework Leveraging Large Language Models for Enhanced Automated Grading. In Proceedings of the International Conference on Intelligent Tutoring Systems, Thessaloniki, Greece, 10–13 June 2024; Springer: Cham, Switzerland, 2024; pp. 378–385. [Google Scholar]
Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Zhang, S.; Zhu, E.; Li, B.; Jiang, L.; Zhang, X.; Wang, C. Autogen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework. arXiv 2023, arXiv:2308.08155. [Google Scholar]
Zhang, Z.; Zhang-Li, D.; Yu, J.; Gong, L.; Zhou, J.; Liu, Z.; Hou, L.; Li, J. Simulating Classroom Education with LLM-Empowered Agents. arXiv 2024, arXiv:2406.19226. [Google Scholar]
Koutcheme, C.; Dainese, N.; Hellas, A. Using Program Repair as a Proxy for Language Models’ Feedback Ability in Programming Education. In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), Mexico City, Mexico, 20 June 2024; Kochmar, E., Bexte, M., Burstein, J., Horbach, A., Laarmann-Quante, R., Tack, A., Yaneva, V., Yuan, Z., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 165–181. [Google Scholar]
Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Volume 36, pp. 46595–46623. [Google Scholar]
Chiang, C.H.; Lee, H.y. Can Large Language Models Be an Alternative to Human Evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 15607–15631. [Google Scholar]
Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 2511–2522. [Google Scholar]
Bavaresco, A.; Bernardi, R.; Bertolazzi, L.; Elliott, D.; Fernández, R.; Gatt, A.; Ghaleb, E.; Giulianelli, M.; Hanna, M.; Koller, A.; et al. LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks. arXiv 2024, arXiv:2406.18403. [Google Scholar]
Verga, P.; Hofstatter, S.; Althammer, S.; Su, Y.; Piktus, A.; Arkhangorodsky, A.; Xu, M.; White, N.; Lewis, P. Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arXiv 2024, arXiv:2404.18796. [Google Scholar]
Thakur, A.S.; Choudhary, K.; Ramayapally, V.S.; Vaidyanathan, S.; Hupkes, D. Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges. arXiv 2024, arXiv:2406.12624. [Google Scholar]
Jung, J.; Brahman, F.; Choi, Y. Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement. arXiv 2024, arXiv:2407.18370. [Google Scholar]
Kenton, Z.; Siegel, N.Y.; Kramár, J.; Brown-Cohen, J.; Albanie, S.; Bulian, J.; Agarwal, R.; Lindner, D.; Tang, Y.; Goodman, N.D.; et al. On scalable oversight with weak LLMs judging strong LLMs. arXiv 2024, arXiv:2407.04622. [Google Scholar]
Richardson, M.; Burges, C.J.; Renshaw, E. MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; Yarowsky, D., Baldwin, T., Korhonen, A., Livescu, K., Bethard, S., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; pp. 193–203. [Google Scholar]
Jia, Q.; Young, M.; Xiao, Y.; Cui, J.; Liu, C.; Rashid, P.; Gehringer, E. Insta-Reviewer: A Data-Driven Approach for Generating Instant Feedback on Students’ Project Reports. In Proceedings of the International Educational Data Mining Society, Durham, UK, 24–27 July 2022. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Team, G.; Mesnard, T.; Hardin, C.; Dadashi, R.; Bhupatiraju, S.; Pathak, S.; Sifre, L.; Rivière, M.; Kale, M.S.; Love, J.; et al. Gemma: Open models based on gemini research and technology. arXiv 2024, arXiv:2403.08295. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. HuggingFace’s Transformers: State-of-the-art natural language processing. arXiv 2019, arXiv:1910.03771. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]

Figure 1. Overview of teacher feedback generation and evaluation in the educational tutoring system.

Figure 2. Sample from our dataset (chosen randomly from MC160 train set).

Figure 3. Illustration of automatic feedback generation (top) and evaluation (bottom) using LLMs. Our study aims to analyze the consistency of LLMs as evaluators.

Figure 4. Comparison of feedback criteria used in this study and previous studies. Each figure represents the study of (a) [20], (b) [4], and (c) [21].

Figure 5. Prompts for generating a teacher’s feedback.

Figure 6. Prompts for evaluating a teacher’s feedback.

Figure 7. Calculation process for inner and inter-agreements. We conducted the experiment with R and M set to 5. The tick represents cases where the agreement is 1, while the cross represents cases where the agreement is 0.

Figure 8. Human agreement on the MC160. Five human annotators evaluated the feedback generated by each feedback generator. COR stands for Correct, REV for Revealing, GUI for Guidance, DIA for Diagnostic, and ENC for Encouragement.

Figure 9. Confidence scores for each human annotator under the w/ criteria generation prompt strategy on the MC160.

Figure 10. Inner agreement for the generation prompt strategy w/ criteria in the MC160 dataset with GPT-4o as the feedback generator. The legend represents the feedback evaluation models. The vertical dotted lines within the plot distinguish between large (GPT-4o, Llama-3.1-70B) and small + medium (Llama-3.1-8B, Gemma-9B, Gemma-27B) LLMs based on model size. The red horizontal line represents the performance of GPT-4o on the Correct criterion.

Figure 11. Inter-agreement for the generation prompt strategy w/ criteria in the MC160. The legend represents the feedback generation models.

Figure 12. Accuracy between human and LLM evaluations for the generation prompt strategy w/ criteria in the MC160. The legend represents the feedback generation models. MV stands for majority voting.

Figure 13. Score distribution on the MC160 with the generation prompt strategy using w/ criteria. Each graph represents a feedback generation model.

Figure 14. Score distribution on the MC160 with the generation prompt strategy using w/o criteria. Each subfigure represents a feedback generation model.

Figure 15. Score distribution when the generation prompt strategy is w/ criteria and the number of sentences of generated feedback is 5, using GPT-4o as the feedback generator on the MC160.

Table 1. Dataset statistics.

Corpus	# Data	Average Words per:
Corpus	# Data	Story	Question
MC160	280	191.84	7.83
MC500	1200	213.60	7.75

Table 2. Feedback generation model names and API versions.

Model Size	Model Name	Version/API Version
Large	Closed-source
	GPT-4o	gpt-4o
	Claude-3	claude-3-5-sonnet-20240620
	Open-source
	Llama-3.1-70B	Meta-Llama-3.1-70B-Instruct

Table 3. Feedback evaluation model names and API versions.

Model Size	Model Name	Version/API Version
Large	Closed-source
	GPT-4o	gpt-4o
	Open-source
	Llama-3.1-70B	Meta-Llama-3.1-70B-Instruct
Small + Medium	Open-source
	Llama-3.1-8B	Meta-Llama-3.1-8B-Instruct
	Gemma-9B	gemma-2-9b-it
	Gemma-27B	gemma-2-27b-it

Table 4. Majority voting counts of LLM-based feedback evaluators on the MC160. A count of 5 indicates that all five evaluators provided the same result.

Feedback Generator	COR.	REV.	GUI.	DIA.	ENC.
GPT-4o	4.96	4.94	3.88	3.48	3.98
Claude-3	5.00	4.96	3.90	3.38	4.40
Llama-3.1-70B	4.96	4.82	3.82	3.38	3.86

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Seo, H.; Hwang, T.; Jung, J.; Kang, H.; Namgoong, H.; Lee, Y.; Jung, S. Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy. Appl. Sci. 2025, 15, 671. https://doi.org/10.3390/app15020671

AMA Style

Seo H, Hwang T, Jung J, Kang H, Namgoong H, Lee Y, Jung S. Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy. Applied Sciences. 2025; 15(2):671. https://doi.org/10.3390/app15020671

Chicago/Turabian Style

Seo, Hyein, Taewook Hwang, Jeesu Jung, Hyeonseok Kang, Hyuk Namgoong, Yohan Lee, and Sangkeun Jung. 2025. "Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy" Applied Sciences 15, no. 2: 671. https://doi.org/10.3390/app15020671

APA Style

Seo, H., Hwang, T., Jung, J., Kang, H., Namgoong, H., Lee, Y., & Jung, S. (2025). Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy. Applied Sciences, 15(2), 671. https://doi.org/10.3390/app15020671

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy

Abstract

1. Introduction

2. Related Works

2.1. Large Language Models in Education

2.2. Large Language Models as Evaluators

3. Automatic Feedback Generation and Evaluation

3.1. Student Correct and Incorrect Response Collection

3.2. Automatic Feedback Generation

3.2.1. Feedback Generation Criteria

3.2.2. Prompt Strategy

3.3. Automatic Feedback Evaluation

3.3.1. Feedback Evaluation Criteria

3.3.2. Prompt Strategy

4. Experimental Setup

4.1. Models

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Human Evaluation Setup

5. Experimental Results and Analysis

5.1. Human Agreement

5.2. LLM Inner Agreement

5.3. LLM Inter-Agreement

5.4. Comparison Between Human and LLM Evaluations

5.5. Analysis of Score Distribution in Human and LLM Evaluations

5.6. Analysis of Score Distribution Based on the Number of Generated Feedback Sentences

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Correction Statement

Abbreviations

Appendix A. Details of Data Generation

Appendix B. Details of Agreements

Appendix C. Details of Human Evaluations

Appendix D. Additional Experimental Results

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI