Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy
Abstract
:1. Introduction
- For criteria with low agreement among human evaluators, the reliability of LLM evaluators also tended to be low.
- Inconsistent responses were observed from LLMs, even when identical prompts were provided.
- Low inter-agreement among different LLMs indicated a lack of consistency in evaluations across models.
- Large-scale or the Llama family models demonstrated a tendency to generate responses with human-comparable consistency in some evaluation criteria.
2. Related Works
2.1. Large Language Models in Education
2.2. Large Language Models as Evaluators
3. Automatic Feedback Generation and Evaluation
3.1. Student Correct and Incorrect Response Collection
3.2. Automatic Feedback Generation
3.2.1. Feedback Generation Criteria
- Correct (COR.): The teacher’s feedback is expected to contain no incorrect statements and to be relevant to the current question and the student’s response.
- Revealing (REV.): The teacher’s feedback should not directly reveal the correct answer to the student.
- Guidance (GUI.): The teacher’s feedback should provide guidance that helps the student move towards the correct answer.
- Diagnostic (DIA.): The teacher’s feedback should accurately identify and address any misunderstandings or errors in the student’s response.
- Encouragement (ENC.): The teacher’s feedback should maintain a positive or encouraging tone.
3.2.2. Prompt Strategy
- Without (w/o) criteria: This prompt excludes information about the five feedback criteria for generating teachers’ feedback.
- With (w/) criteria: This prompt includes information about the five feedback criteria for evaluating teachers’ feedback.
3.3. Automatic Feedback Evaluation
3.3.1. Feedback Evaluation Criteria
3.3.2. Prompt Strategy
- Grouped: This prompt includes all feedback evaluation criteria within a single prompt.
- Individual: This prompt includes only one information for each feedback evaluation criterion and is performed independently a total of 5 times (as there are 5 feedback evaluation criteria).
4. Experimental Setup
4.1. Models
- Feedback Generator
- Feedback Evaluator
4.2. Implementation Details
4.3. Evaluation Metrics
4.4. Human Evaluation Setup
5. Experimental Results and Analysis
5.1. Human Agreement
5.2. LLM Inner Agreement
5.3. LLM Inter-Agreement
5.4. Comparison Between Human and LLM Evaluations
- Internal (Macro) majority voting: This method involves performing majority voting within each LLM and then combining the majority results from multiple LLMs to make a final decision.
- Global (Micro) majority voting: This method gathers M evaluators and N responses, applying a single majority vote across all at once to make a final decision.
5.5. Analysis of Score Distribution in Human and LLM Evaluations
5.6. Analysis of Score Distribution Based on the Number of Generated Feedback Sentences
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Correction Statement
Abbreviations
LLM | Large Language Models |
COR | Correct |
REV | Revealing |
GUI | Guidance |
DIA | Diagnostic |
ENC | Encouragement |
w/o criteria | without criteria |
w/ criteria | with criteria |
MV | Majority Voting |
Appendix A. Details of Data Generation
Generation Prompt Strategy | Model Name | Dataset | |
---|---|---|---|
MC160 | MC500 | ||
w/o criteria | GPT-4o | $0.64 | $2.87 |
Claude-3 | $0.58 | $2.38 | |
w/ criteria | GPT-4o | $0.76 | $2.96 |
Claude-3 | $0.67 | $2.80 |
Generation Prompt Strategy | Evaluation Prompt Strategy | Dataset | |
---|---|---|---|
MC160 | MC500 | ||
w/o criteria | Grouped | $1.01 | $4.06 |
Individual | $3.21 | $13.43 | |
w/ criteria | Grouped | $1.02 | $4.16 |
Individual | $3.29 | $13.55 |
Appendix B. Details of Agreements
Algorithm A1 Inner Agreement |
Initialize Initialize for to N do if all elements in are equal then end if end for return |
Algorithm A2 Inter-Agreement |
Initialize Initialize Initialize for to N do Initialize for to M do Append to end for Append to end for Initialize for to N do for to M do if all elements in are equal then end if end for end for return |
Appendix C. Details of Human Evaluations
- 5: Very confident in the accuracy of the evaluation.
- 4: Quite confident.
- 3: Somewhat confident but with some reservations.
- 2: Not confident.
- 1: Almost no confidence.
Appendix D. Additional Experimental Results
Feedback Generator: GPT-4o | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Review the last paragraph carefully to see what Rudy was most excited about in his birthday party. | |||||||||||
Grouped | Individual | ||||||||||
Feedback Evaluator | COR. | REV. | GUI. | DIA. | ENC. | COR. | REV. | GUI. | DIA. | ENC. | |
Human | 1 | 1 | 1 | 0 | 1 | - | - | - | - | - | |
GPT-4o | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | |
Llama-3.1-8B | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | |
Llama-3.1-70B | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | |
Gemma-9B | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | |
Gemma-27B | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | |
Feedback Generator: Claude-3 | |||||||||||
Great attempt, but let’s take another look at the last part of the story | |||||||||||
to find the gift that made Rudy especially excited. | |||||||||||
Feedback Evaluator | COR. | REV. | GUI. | DIA. | ENC. | COR. | REV. | GUI. | DIA. | ENC. | |
Human | 1 | 1 | 1 | 1 | 1 | - | - | - | - | - | |
GPT-4o | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
Llama-3.1-8B | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | |
Llama-3.1-70B | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
Gemma-9B | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | |
Gemma-27B | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
Feedback Generator: Llama-3.1-70B | |||||||||||
Take another look at the story and see what gift made Rudy jump and bark with excitement, | |||||||||||
as that might be a clue to what he thought was the best gift he had ever been given. | |||||||||||
Feedback Evaluator | COR. | REV. | GUI. | DIA. | ENC. | COR. | REV. | GUI. | DIA. | ENC. | |
Human | 1 | 1 | 1 | 1 | 1 | - | - | - | - | - | |
GPT-4o | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | |
Llama-3.1-8B | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | |
Llama-3.1-70B | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | |
Gemma-9B | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | |
Gemma-27B | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Feedback Generator: GPT-4o | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
You identified one sport correctly, but remember that | |||||||||||
Tom played another sport he wasn’t very good at, which is different from football. | |||||||||||
Grouped | Individual | ||||||||||
Feedback Evaluator | COR. | REV. | GUI. | DIA. | ENC. | COR. | REV. | GUI. | DIA. | ENC. | |
Human | 1 | 1 | 1 | 1 | 1 | - | - | - | - | - | |
GPT-4o | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | |
Llama-3.1-8B | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | |
Llama-3.1-70B | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | |
Gemma-9B | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | |
Gemma-27B | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | |
Feedback Generator: Claude-3 | |||||||||||
Great attempt, but let’s revisit the story | |||||||||||
and carefully consider which sports were specifically mentioned for Tom. | |||||||||||
Feedback Evaluator | COR. | REV. | GUI. | DIA. | ENC. | COR. | REV. | GUI. | DIA. | ENC. | |
Human | 1 | 1 | 1 | 1 | 1 | - | - | - | - | - | |
GPT-4o | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
Llama-3.1-8B | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | |
Llama-3.1-70B | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
Gemma-9B | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | |
Gemma-27B | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
Feedback Generator: Llama-3.1-70B | |||||||||||
Take another look at the story and think about the two sports that Tom is mentioned to be enjoying, | |||||||||||
one of which he is particularly good at and the other not so much. | |||||||||||
Feedback Evaluator | COR. | REV. | GUI. | DIA. | ENC. | COR. | REV. | GUI. | DIA. | ENC. | |
Human | 1 | 1 | 1 | 1 | 1 | - | - | - | - | - | |
GPT-4o | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | |
Llama-3.1-8B | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | |
Llama-3.1-70B | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | |
Gemma-9B | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | |
Gemma-27B | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
Grouped | Individual | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
COR. | REV. | GUI. | DIA. | ENC. | COR. | REV. | GUI. | DIA. | ENC. | ||
Feedback Evaluator | Feedback Generator: GPT-4o | ||||||||||
GPT-4o | 0.968 | 0.989 | 0.761 | 0.689 | 0.593 | 0.882 | 0.911 | 0.836 | 0.707 | 0.771 | |
Llama-3.1-8B | 0.650 | 0.546 | 0.436 | 0.536 | 0.318 | 0.614 | 0.325 | 0.207 | 0.450 | 0.386 | |
Llama-3.1-70B | 0.989 | 0.943 | 0.732 | 0.507 | 0.554 | 0.768 | 0.896 | 0.775 | 0.500 | 0.707 | |
Gemma-9B | 0.996 | 0.989 | 0.900 | 0.693 | 0.729 | 0.993 | 0.996 | 0.939 | 0.904 | 0.825 | |
Gemma-27B | 0.999 | 0.999 | 0.982 | 0.875 | 0.893 | 0.989 | 0.971 | 0.853 | 0.896 | 0.754 | |
Feedback Evaluator | Feedback Generator: Claude-3 | ||||||||||
GPT-4o | 0.986 | 0.999 | 0.800 | 0.629 | 0.664 | 0.893 | 0.957 | 0.868 | 0.600 | 0.825 | |
Llama-3.1-8B | 0.814 | 0.657 | 0.461 | 0.550 | 0.371 | 0.825 | 0.464 | 0.125 | 0.443 | 0.293 | |
Llama-3.1-70B | 0.993 | 0.989 | 0.800 | 0.379 | 0.693 | 0.882 | 0.968 | 0.761 | 0.400 | 0.704 | |
Gemma-9B | 0.996 | 0.999 | 0.893 | 0.732 | 0.832 | 0.999 | 0.999 | 0.946 | 0.904 | 0.786 | |
Gemma-27B | 0.999 | 0.999 | 0.975 | 0.868 | 0.821 | 0.999 | 0.996 | 0.861 | 0.864 | 0.936 | |
Feedback Evaluator | Feedback Generator: Llama-3.1-70B | ||||||||||
GPT-4o | 0.982 | 0.993 | 0.829 | 0.604 | 0.396 | 0.900 | 0.871 | 0.904 | 0.600 | 0.700 | |
Llama-3.1-8B | 0.850 | 0.482 | 0.389 | 0.364 | 0.239 | 0.771 | 0.568 | 0.221 | 0.207 | 0.236 | |
Llama-3.1-70B | 0.999 | 0.954 | 0.918 | 0.393 | 0.393 | 0.864 | 0.900 | 0.854 | 0.504 | 0.754 | |
Gemma-9B | 0.996 | 0.996 | 0.907 | 0.796 | 0.721 | 0.989 | 0.993 | 0.946 | 0.889 | 0.889 | |
Gemma-27B | 0.999 | 0.993 | 0.975 | 0.839 | 0.707 | 0.996 | 0.961 | 0.932 | 0.900 | 0.825 |
Grouped | Individual | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
COR. | REV. | GUI. | DIA. | ENC. | COR. | REV. | GUI. | DIA. | ENC. | ||
Feedback Evaluator | Feedback Generator: GPT-4o | ||||||||||
GPT-4o | 0.875 | 0.911 | 0.675 | 0.446 | 0.736 | 0.868 | 0.593 | 0.718 | 0.679 | 0.929 | |
Llama-3.1-8B | 0.636 | 0.475 | 0.539 | 0.421 | 0.275 | 0.543 | 0.482 | 0.207 | 0.339 | 0.250 | |
Llama-3.1-70B | 0.925 | 0.825 | 0.689 | 0.539 | 0.375 | 0.711 | 0.757 | 0.657 | 0.586 | 0.821 | |
Gemma-9B | 0.996 | 0.996 | 0.832 | 0.671 | 0.654 | 0.975 | 0.979 | 0.904 | 0.854 | 0.911 | |
Gemma-27B | 0.986 | 0.993 | 0.761 | 0.804 | 0.889 | 0.954 | 0.936 | 0.768 | 0.925 | 0.529 | |
Feedback Evaluator | Feedback Generator: Claude-3 | ||||||||||
GPT-4o | 0.954 | 0.961 | 0.639 | 0.682 | 0.675 | 0.914 | 0.632 | 0.625 | 0.804 | 0.914 | |
Llama-3.1-8B | 0.811 | 0.643 | 0.575 | 0.375 | 0.182 | 0.629 | 0.582 | 0.196 | 0.296 | 0.239 | |
Llama-3.1-70B | 0.993 | 0.932 | 0.721 | 0.546 | 0.407 | 0.864 | 0.857 | 0.650 | 0.664 | 0.868 | |
Gemma-9B | 0.996 | 0.996 | 0.793 | 0.729 | 0.636 | 0.996 | 0.999 | 0.939 | 0.911 | 0.886 | |
Gemma-27B | 0.999 | 0.989 | 0.639 | 0.907 | 0.854 | 0.989 | 0.971 | 0.818 | 0.964 | 0.693 | |
Feedback Evaluator | Feedback Generator: Llama-3.1-70B | ||||||||||
GPT-4o | 0.882 | 0.818 | 0.721 | 0.654 | 0.754 | 0.936 | 0.789 | 0.768 | 0.846 | 0.932 | |
Llama-3.1-8B | 0.568 | 0.464 | 0.496 | 0.328 | 0.232 | 0.546 | 0.443 | 0.254 | 0.296 | 0.254 | |
Llama-3.1-70B | 0.904 | 0.729 | 0.746 | 0.611 | 0.500 | 0.807 | 0.829 | 0.750 | 0.818 | 0.889 | |
Gemma-9B | 0.986 | 0.975 | 0.850 | 0.775 | 0.704 | 0.999 | 0.961 | 0.975 | 0.964 | 0.914 | |
Gemma-27B | 0.964 | 0.914 | 0.843 | 0.864 | 0.925 | 0.982 | 0.861 | 0.911 | 0.968 | 0.675 |
Grouped | Individual | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
COR. | REV. | GUI. | DIA. | ENC. | COR. | REV. | GUI. | DIA. | ENC. | ||
Feedback Evaluator | Feedback Generator: GPT-4o | ||||||||||
GPT-4o | 0.968 | 0.988 | 0.703 | 0.641 | 0.548 | 0.873 | 0.883 | 0.853 | 0.719 | 0.843 | |
Llama-3.1-8B | 0.746 | 0.585 | 0.488 | 0.438 | 0.283 | 0.567 | 0.458 | 0.203 | 0.363 | 0.297 | |
Llama-3.1-70B | 0.994 | 0.954 | 0.780 | 0.598 | 0.413 | 0.784 | 0.932 | 0.767 | 0.551 | 0.784 | |
Gemma-9B | 0.994 | 0.994 | 0.888 | 0.676 | 0.712 | 0.991 | 0.997 | 0.928 | 0.850 | 0.907 | |
Gemma-27B | 0.999 | 0.999 | 0.963 | 0.889 | 0.857 | 0.988 | 0.971 | 0.828 | 0.898 | 0.699 | |
Feedback Evaluator | Feedback Generator: Claude-3 | ||||||||||
GPT-4o | 0.986 | 0.997 | 0.765 | 0.608 | 0.598 | 0.902 | 0.959 | 0.879 | 0.543 | 0.817 | |
Llama-3.1-8B | 0.851 | 0.648 | 0.423 | 0.478 | 0.355 | 0.741 | 0.622 | 0.199 | 0.307 | 0.184 | |
Llama-3.1-70B | 0.997 | 0.988 | 0.829 | 0.389 | 0.628 | 0.894 | 0.978 | 0.770 | 0.453 | 0.773 | |
Gemma-9B | 0.998 | 0.998 | 0.879 | 0.716 | 0.798 | 0.991 | 0.999 | 0.921 | 0.907 | 0.927 | |
Gemma-27B | 0.999 | 0.999 | 0.958 | 0.846 | 0.779 | 0.996 | 0.997 | 0.821 | 0.858 | 0.929 | |
Feedback Evaluator | Feedback Generator: Llama-3.1-70B | ||||||||||
GPT-4o | 0.987 | 0.985 | 0.812 | 0.623 | 0.423 | 0.933 | 0.890 | 0.923 | 0.653 | 0.712 | |
Llama-3.1-8B | 0.829 | 0.540 | 0.363 | 0.367 | 0.244 | 0.750 | 0.619 | 0.218 | 0.202 | 0.185 | |
Llama-3.1-70B | 0.996 | 0.955 | 0.899 | 0.472 | 0.403 | 0.863 | 0.914 | 0.867 | 0.492 | 0.779 | |
Gemma-9B | 0.998 | 0.998 | 0.903 | 0.767 | 0.718 | 0.992 | 0.998 | 0.950 | 0.888 | 0.878 | |
Gemma-27B | 0.999 | 0.999 | 0.985 | 0.869 | 0.720 | 0.996 | 0.968 | 0.926 | 0.921 | 0.854 |
Grouped | Individual | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
COR. | REV. | GUI. | DIA. | ENC. | COR. | REV. | GUI. | DIA. | ENC. | ||
Feedback Evaluator | Feedback Generator: GPT-4o | ||||||||||
GPT-4o | 0.829 | 0.895 | 0.715 | 0.443 | 0.735 | 0.873 | 0.668 | 0.684 | 0.692 | 0.933 | |
Llama-3.1-8B | 0.635 | 0.487 | 0.574 | 0.476 | 0.253 | 0.522 | 0.433 | 0.266 | 0.361 | 0.275 | |
Llama-3.1-70B | 0.888 | 0.872 | 0.716 | 0.546 | 0.347 | 0.681 | 0.747 | 0.683 | 0.573 | 0.828 | |
Gemma-9B | 0.964 | 0.954 | 0.866 | 0.631 | 0.687 | 0.954 | 0.978 | 0.905 | 0.873 | 0.895 | |
Gemma-27B | 0.981 | 0.977 | 0.751 | 0.825 | 0.900 | 0.945 | 0.929 | 0.692 | 0.898 | 0.532 | |
Feedback Evaluator | Feedback Generator: Claude-3 | ||||||||||
GPT-4o | 0.954 | 0.948 | 0.672 | 0.694 | 0.735 | 0.920 | 0.653 | 0.603 | 0.832 | 0.918 | |
Llama-3.1-8B | 0.780 | 0.633 | 0.533 | 0.348 | 0.203 | 0.650 | 0.647 | 0.177 | 0.243 | 0.195 | |
Llama-3.1-70B | 0.988 | 0.948 | 0.701 | 0.508 | 0.371 | 0.841 | 0.834 | 0.652 | 0.706 | 0.908 | |
Gemma-9B | 0.998 | 0.993 | 0.765 | 0.783 | 0.684 | 0.992 | 0.989 | 0.906 | 0.913 | 0.880 | |
Gemma-27B | 0.999 | 0.995 | 0.819 | 0.934 | 0.885 | 0.973 | 0.975 | 0.838 | 0.958 | 0.626 | |
Feedback Evaluator | Feedback Generator: Llama-3.1-70B | ||||||||||
GPT-4o | 0.878 | 0.835 | 0.741 | 0.708 | 0.770 | 0.928 | 0.751 | 0.748 | 0.883 | 0.922 | |
Llama-3.1-8B | 0.538 | 0.464 | 0.479 | 0.339 | 0.228 | 0.565 | 0.429 | 0.220 | 0.277 | 0.228 | |
Llama-3.1-70B | 0.909 | 0.738 | 0.775 | 0.586 | 0.467 | 0.828 | 0.848 | 0.765 | 0.835 | 0.911 | |
Gemma-9B | 0.960 | 0.933 | 0.865 | 0.750 | 0.736 | 0.985 | 0.960 | 0.953 | 0.944 | 0.888 | |
Gemma-27B | 0.967 | 0.908 | 0.843 | 0.913 | 0.913 | 0.965 | 0.842 | 0.899 | 0.969 | 0.655 |
Grouped | Individual | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
COR. | REV. | GUI. | DIA. | ENC. | COR. | REV. | GUI. | DIA. | ENC. | ||
Feedback Generator | w/ criteria | ||||||||||
GPT-4o | 0.936 | 0.936 | 0.075 | 0.071 | 0.375 | 0.857 | 0.657 | 0.296 | 0.068 | 0.114 | |
Claude-3 | 0.986 | 0.943 | 0.089 | 0.057 | 0.582 | 0.921 | 0.850 | 0.404 | 0.068 | 0.064 | |
Llama-3.1-70B | 0.989 | 0.889 | 0.104 | 0.043 | 0.225 | 0.893 | 0.782 | 0.604 | 0.132 | 0.039 | |
Feedback Generator | w/o criteria | ||||||||||
GPT-4o | 0.882 | 0.807 | 0.179 | 0.064 | 0.196 | 0.711 | 0.543 | 0.307 | 0.104 | 0.286 | |
Claude-3 | 0.946 | 0.896 | 0.118 | 0.132 | 0.175 | 0.829 | 0.654 | 0.282 | 0.211 | 0.139 | |
Llama-3.1-70B | 0.782 | 0.571 | 0.139 | 0.104 | 0.282 | 0.757 | 0.382 | 0.382 | 0.179 | 0.107 |
Grouped | Individual | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
COR. | REV. | GUI. | DIA. | ENC. | COR. | REV. | GUI. | DIA. | ENC. | ||
Feedback Generator | w/ criteria | ||||||||||
GPT-4o | 0.962 | 0.914 | 0.059 | 0.075 | 0.248 | 0.828 | 0.780 | 0.331 | 0.108 | 0.149 | |
Claude-3 | 0.981 | 0.943 | 0.068 | 0.054 | 0.494 | 0.903 | 0.919 | 0.453 | 0.096 | 0.163 | |
Llama-3.1-70B | 0.980 | 0.914 | 0.107 | 0.046 | 0.246 | 0.897 | 0.849 | 0.571 | 0.125 | 0.062 | |
Feedback Generator | w/o criteria | ||||||||||
GPT-4o | 0.866 | 0.808 | 0.238 | 0.098 | 0.216 | 0.678 | 0.536 | 0.305 | 0.104 | 0.308 | |
Claude-3 | 0.958 | 0.913 | 0.125 | 0.100 | 0.203 | 0.847 | 0.574 | 0.253 | 0.213 | 0.163 | |
Llama-3.1-70B | 0.753 | 0.512 | 0.129 | 0.123 | 0.293 | 0.768 | 0.380 | 0.343 | 0.250 | 0.169 |
Feedback Generator: GPT-4o | ||||||
---|---|---|---|---|---|---|
Great effort, but make sure to carefully read the story again | ||||||
to identify what specifically frightened the child during their zoo visit. | ||||||
COR. | REV. | GUI. | DIA. | ENC. | Avg. Confidence | |
Annotator 1 | 1 | 1 | 1 | 1 | 1 | 3 |
Annotator 2 | 1 | 1 | 1 | 1 | 1 | 4.4 |
Annotator 3 | 1 | 1 | 1 | 1 | 0 | 5 |
Annotator 4 | 1 | 1 | 1 | 1 | 1 | 5 |
Annotator 5 | 1 | 1 | 1 | 1 | 1 | 5 |
Human Agreement | 1 | 1 | 1 | 1 | 0 | 4.48 |
Feedback Generator: Claude-3 | ||||||
Recall the part of the story where Lauren brings Chocolate home and consider | ||||||
where she might find a safe, water-friendly place to keep a turtle temporarily. | ||||||
COR. | REV. | GUI. | DIA. | ENC. | Avg. Confidence | |
Annotator 1 | 1 | 1 | 1 | 0 | 1 | 3.6 |
Annotator 2 | 1 | 1 | 1 | 1 | 1 | 3.8 |
Annotator 3 | 1 | 1 | 1 | 1 | 0 | 4.8 |
Annotator 4 | 1 | 1 | 1 | 1 | 0 | 4.6 |
Annotator 5 | 1 | 1 | 1 | 1 | 1 | 5 |
Human Agreement | 1 | 1 | 1 | 0 | 0 | 4.38 |
Feedback Generator: Llama-3.1-70B | ||||||
To determine what Mark wanted, re-read the text to identify the specific reason mentioned | ||||||
for Mark’s motivation to write a story, focusing on the connection to his relationship with Wendy. | ||||||
COR. | REV. | GUI. | DIA. | ENC. | Avg. Confidence | |
Annotator 1 | 1 | 1 | 1 | 0 | 1 | 5 |
Annotator 2 | 1 | 1 | 1 | 1 | 1 | 4.2 |
Annotator 3 | 1 | 1 | 1 | 1 | 1 | 5 |
Annotator 4 | 1 | 1 | 1 | 1 | 0 | 5 |
Annotator 5 | 1 | 1 | 1 | 1 | 0 | 4.6 |
Human Agreement | 1 | 1 | 1 | 0 | 0 | 4.76 |
Feedback Generator: GPT-4o | ||||||
---|---|---|---|---|---|---|
Actually, the store had various types of food including turkey, chocolate, and radishes. | ||||||
COR. | REV. | GUI. | DIA. | ENC. | Avg. Confidence | |
Annotator 1 | 1 | 0 | 0 | 0 | 0 | 4.6 |
Annotator 2 | 1 | 0 | 1 | 1 | 1 | 5 |
Annotator 3 | 1 | 1 | 1 | 1 | 0 | 4.4 |
Annotator 4 | 1 | 0 | 0 | 1 | 0 | 4.8 |
Annotator 5 | 1 | 0 | 1 | 0 | 0 | 4.6 |
Human Agreement | 1 | 0 | 0 | 0 | 0 | 4.68 |
Feedback Generator: Claude-3 | ||||||
While Martha did check under the benches, | ||||||
the story mentions she looked inside the restrooms first before checking other places. | ||||||
COR. | REV. | GUI. | DIA. | ENC. | Avg. Confidence | |
Annotator 1 | 1 | 0 | 0 | 1 | 1 | 3.4 |
Annotator 2 | 1 | 0 | 1 | 1 | 1 | 4.2 |
Annotator 3 | 1 | 0 | 1 | 1 | 0 | 5 |
Annotator 4 | 1 | 0 | 0 | 1 | 0 | 4.8 |
Annotator 5 | 1 | 0 | 1 | 1 | 0 | 5 |
Human Agreement | 1 | 1 | 0 | 1 | 0 | 4.48 |
Feedback Generator: Llama-3.1-70B | ||||||
Since it’s mentioned in the passage that Spike, Molly’s dog, plays with a ball, | ||||||
the correct answer should be A: ball. | ||||||
COR. | REV. | GUI. | DIA. | ENC. | Avg. Confidence | |
Annotator 1 | 1 | 0 | 1 | 1 | 0 | 3.4 |
Annotator 2 | 1 | 0 | 1 | 1 | 1 | 4.2 |
Annotator 3 | 1 | 0 | 1 | 0 | 0 | 4.2 |
Annotator 4 | 1 | 0 | 0 | 1 | 0 | 5 |
Annotator 5 | 1 | 0 | 0 | 1 | 0 | 4.8 |
Human Agreement | 1 | 1 | 0 | 0 | 0 | 4.32 |
Feedback Generator: GPT-4o | ||||||
---|---|---|---|---|---|---|
It looks like you misunderstood the story; try to find | ||||||
which item the pig specifically mentioned forgetting, leading to his reluctance to eat. | ||||||
COR. | REV. | GUI. | DIA. | ENC. | Avg. Confidence | |
Annotator 1 | 0 | 1 | 1 | 1 | 0 | 4.4 |
Annotator 2 | 1 | 1 | 1 | 1 | 1 | 4.4 |
Annotator 3 | 1 | 1 | 1 | 0 | 0 | 4.4 |
Annotator 4 | 1 | 1 | 1 | 1 | 0 | 4.6 |
Annotator 5 | 1 | 1 | 1 | 1 | 1 | 5 |
Human Agreement | 0 | 1 | 1 | 0 | 0 | 4.56 |
Feedback Generator: Claude-3 | ||||||
Let’s take a closer look at the story and see | ||||||
if we can find any specific mentions of Kim’s food preferences for her birthday dinner. | ||||||
COR. | REV. | GUI. | DIA. | ENC. | Avg. Confidence | |
Annotator 1 | 1 | 1 | 1 | 0 | 0 | 4.4 |
Annotator 2 | 1 | 1 | 1 | 1 | 1 | 4.6 |
Annotator 3 | 1 | 1 | 1 | 1 | 1 | 5 |
Annotator 4 | 1 | 1 | 1 | 0 | 0 | 4.4 |
Annotator 5 | 1 | 1 | 1 | 1 | 1 | 5 |
Human Agreement | 1 | 1 | 1 | 0 | 0 | 4.68 |
Feedback Generator: Llama-3.1-70B | ||||||
Let’s take a closer look at the first sentence of the story to see | ||||||
if we can find the correct answer to when the magical chipmunks come out to play. | ||||||
COR. | REV. | GUI. | DIA. | ENC. | Avg. Confidence | |
Annotator 1 | 1 | 1 | 1 | 0 | 0 | 5 |
Annotator 2 | 1 | 1 | 1 | 1 | 1 | 4 |
Annotator 3 | 1 | 1 | 1 | 1 | 1 | 5 |
Annotator 4 | 1 | 1 | 1 | 1 | 0 | 4.6 |
Annotator 5 | 1 | 1 | 1 | 1 | 1 | 5 |
Human Agreement | 1 | 1 | 1 | 0 | 0 | 4.72 |
Feedback Generator: GPT-4o | ||||||
---|---|---|---|---|---|---|
Double-check the story’s details; Rose gave her slice to Dave, who took it home for his Bro. | ||||||
COR. | REV. | GUI. | DIA. | ENC. | Avg. Confidence | |
Annotator 1 | 1 | 0 | 1 | 0 | 0 | 4.6 |
Annotator 2 | 1 | 0 | 1 | 1 | 1 | 4.6 |
Annotator 3 | 1 | 0 | 1 | 1 | 1 | 5 |
Annotator 4 | 1 | 0 | 1 | 1 | 0 | 4.6 |
Annotator 5 | 1 | 0 | 1 | 1 | 0 | 5 |
Human Agreement | 1 | 1 | 1 | 0 | 0 | 4.76 |
Feedback Generator: Claude-3 | ||||||
Let’s carefully review the sequence of events in the story, | ||||||
paying close attention to Will’s location when the rain started. | ||||||
COR. | REV. | GUI. | DIA. | ENC. | Avg. Confidence | |
Annotator 1 | 1 | 1 | 1 | 0 | 1 | 4.2 |
Annotator 2 | 1 | 1 | 1 | 1 | 1 | 4.4 |
Annotator 3 | 1 | 1 | 1 | 1 | 1 | 5 |
Annotator 4 | 1 | 1 | 1 | 0 | 1 | 4.2 |
Annotator 5 | 1 | 1 | 1 | 1 | 1 | 5 |
Human Agreement | 1 | 1 | 1 | 0 | 1 | 4.56 |
Feedback Generator: Llama-3.1-70B | ||||||
Consider re-reading the story to find the item that Timmy’s mom specifically said | ||||||
he couldn’t bring to his grandma’s house due to his grandma’s dislike for it. | ||||||
COR. | REV. | GUI. | DIA. | ENC. | Avg. Confidence | |
Annotator 1 | 1 | 1 | 1 | 1 | 0 | 4.4 |
Annotator 2 | 1 | 1 | 1 | 1 | 1 | 4.4 |
Annotator 3 | 1 | 0 | 0 | 1 | 0 | 5 |
Annotator 4 | 1 | 1 | 1 | 1 | 0 | 4.4 |
Annotator 5 | 1 | 1 | 1 | 1 | 1 | 5 |
Human Agreement | 1 | 0 | 0 | 1 | 0 | 4.64 |
References
- Alier, M.; Casañ, M.J.; Filvà, D.A. Smart Learning Applications: Leveraging LLMs for Contextualized and Ethical Educational Technology. In Proceedings of the TEEM 2023; Gonçalves, J.A.d.C., Lima, J.L.S.d.M., Coelho, J.P., García-Peñalvo, F.J., García-Holgado, A., Eds.; Springer: Singapore, 2024; pp. 190–199. [Google Scholar]
- Laato, S.; Morschheuser, B.; Hamari, J.; Björne, J. AI-Assisted Learning with ChatGPT and Large Language Models: Implications for Higher Education. In Proceedings of the 2023 IEEE International Conference on Advanced Learning Technologies (ICALT), Orem, UT, USA, 10–13 July 2023; pp. 226–230. [Google Scholar] [CrossRef]
- Azaiz, I.; Kiesler, N.; Strickroth, S. Feedback-Generation for Programming Exercises with GPT-4. In Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1, Milan, Italy, 8–10 July 2024; pp. 31–37. [Google Scholar]
- Steiss, J.; Tate, T.; Graham, S.; Cruz, J.; Hebert, M.; Wang, J.; Moon, Y.; Tseng, W.; Warschauer, M.; Olson, C.B. Comparing the quality of human and ChatGPT feedback of students’ writing. Learn. Instr. 2024, 91, 101894. [Google Scholar] [CrossRef]
- Baidoo-anu, D.; Owusu Ansah, L. Education in the Era of Generative Artificial Intelligence (AI): Understanding the Potential Benefits of ChatGPT in Promoting Teaching and Learning. J. AI 2023, 7, 52–62. [Google Scholar] [CrossRef]
- Lan, Y.-J.; Chen, N.-S. Teachers’ Agency in the Era of LLM and Generative AI. Educ. Technol. Soc. 2024, 27, I–XVIII. [Google Scholar]
- Dhurandhar, A.; Nair, R.; Singh, M.; Daly, E.; Natesan Ramamurthy, K. Ranking Large Language Models without Ground Truth. In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 2431–2452. [Google Scholar]
- Chen, G.H.; Chen, S.; Liu, Z.; Jiang, F.; Wang, B. Humans or llms as the judge? A study on judgement biases. arXiv 2024, arXiv:2402.10669. [Google Scholar]
- Gan, W.; Qi, Z.; Wu, J.; Lin, J.C.W. Large Language Models in Education: Vision and Opportunities. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 4776–4785. [Google Scholar] [CrossRef]
- Wang, S.; Xu, T.; Li, H.; Zhang, C.; Liang, J.; Tang, J.; Yu, P.S.; Wen, Q. Large language models for education: A survey and outlook. arXiv 2024, arXiv:2403.18105. [Google Scholar]
- Kortemeyer, G. Performance of the pre-trained large language model GPT-4 on automated short answer grading. Discov. Artif. Intell. 2024, 4, 47. [Google Scholar] [CrossRef]
- Stahl, M.; Biermann, L.; Nehring, A.; Wachsmuth, H. Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation. In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), Mexico City, Mexico, 6 June 2024; Kochmar, E., Bexte, M., Burstein, J., Horbach, A., Laarmann-Quante, R., Tack, A., Yaneva, V., Yuan, Z., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 283–298. [Google Scholar]
- Mulla, N.; Gharpure, P. Automatic question generation: A review of methodologies, datasets, evaluation metrics, and applications. Prog. Artif. Intell. 2023, 12, 1–32. [Google Scholar] [CrossRef]
- Feng, W.; Lee, J.; McNichols, H.; Scarlatos, A.; Smith, D.; Woodhead, S.; Ornelas, N.; Lan, A. Exploring Automated Distractor Generation for Math Multiple-choice Questions via Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, 16–21 June 2024; Duh, K., Gomez, H., Bethard, S., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 3067–3082. [Google Scholar]
- Lee, J.; Smith, D.; Woodhead, S.; Lan, A. Math Multiple Choice Question Generation via Human-Large Language Model Collaboration. arXiv 2024, arXiv:2405.00864. [Google Scholar]
- Dan, Y.; Lei, Z.; Gu, Y.; Li, Y.; Yin, J.; Lin, J.; Ye, L.; Tie, Z.; Zhou, Y.; Wang, Y.; et al. Educhat: A large-scale language model-based chatbot system for intelligent education. arXiv 2023, arXiv:2308.02773. [Google Scholar]
- Meyer, J.; Jansen, T.; Schiller, R.; Liebenow, L.W.; Steinbach, M.; Horbach, A.; Fleckenstein, J. Using LLMs to bring evidence-based feedback into the classroom: AI-generated feedback increases secondary students’ text revision, motivation, and positive emotions. Comput. Educ. Artif. Intell. 2024, 6, 100199. [Google Scholar] [CrossRef]
- Dai, W.; Lin, J.; Jin, H.; Li, T.; Tsai, Y.S.; Gašević, D.; Chen, G. Can Large Language Models Provide Feedback to Students? A Case Study on ChatGPT. In Proceedings of the 2023 IEEE International Conference on Advanced Learning Technologies (ICALT), Orem, UT, USA, 10–13 July 2023; pp. 323–325. [Google Scholar] [CrossRef]
- Liu, H.; Liu, Z.; Wu, Z.; Tang, J. Personalized Multimodal Feedback Generation in Education. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; Scott, D., Bel, N., Zong, C., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1826–1840. [Google Scholar]
- Koutcheme, C.; Dainese, N.; Hellas, A.; Sarsa, S.; Leinonen, J.; Ashraf, S.; Denny, P. Evaluating Language Models for Generating and Judging Programming Feedback. arXiv 2024, arXiv:2407.04873. [Google Scholar]
- Scarlatos, A.; Smith, D.; Woodhead, S.; Lan, A. Improving the Validity of Automatically Generated Feedback via Reinforcement Learning. In Artificial Intelligence in Education; Olney, A.M., Chounta, I.A., Liu, Z., Santos, O.C., Bittencourt, I.I., Eds.; Springer: Cham, Switzerland, 2024; pp. 280–294. [Google Scholar]
- Guo, S.; Latif, E.; Zhou, Y.; Huang, X.; Zhai, X. Using Generative AI and Multi-Agents to Provide Automatic Feedback. arXiv 2024, arXiv:2411.07407. [Google Scholar]
- Lagakis, P.; Demetriadis, S. EvaAI: A Multi-agent Framework Leveraging Large Language Models for Enhanced Automated Grading. In Proceedings of the International Conference on Intelligent Tutoring Systems, Thessaloniki, Greece, 10–13 June 2024; Springer: Cham, Switzerland, 2024; pp. 378–385. [Google Scholar]
- Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Zhang, S.; Zhu, E.; Li, B.; Jiang, L.; Zhang, X.; Wang, C. Autogen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework. arXiv 2023, arXiv:2308.08155. [Google Scholar]
- Zhang, Z.; Zhang-Li, D.; Yu, J.; Gong, L.; Zhou, J.; Liu, Z.; Hou, L.; Li, J. Simulating Classroom Education with LLM-Empowered Agents. arXiv 2024, arXiv:2406.19226. [Google Scholar]
- Koutcheme, C.; Dainese, N.; Hellas, A. Using Program Repair as a Proxy for Language Models’ Feedback Ability in Programming Education. In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), Mexico City, Mexico, 20 June 2024; Kochmar, E., Bexte, M., Burstein, J., Horbach, A., Laarmann-Quante, R., Tack, A., Yaneva, V., Yuan, Z., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 165–181. [Google Scholar]
- Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Volume 36, pp. 46595–46623. [Google Scholar]
- Chiang, C.H.; Lee, H.y. Can Large Language Models Be an Alternative to Human Evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 15607–15631. [Google Scholar]
- Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 2511–2522. [Google Scholar]
- Bavaresco, A.; Bernardi, R.; Bertolazzi, L.; Elliott, D.; Fernández, R.; Gatt, A.; Ghaleb, E.; Giulianelli, M.; Hanna, M.; Koller, A.; et al. LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks. arXiv 2024, arXiv:2406.18403. [Google Scholar]
- Verga, P.; Hofstatter, S.; Althammer, S.; Su, Y.; Piktus, A.; Arkhangorodsky, A.; Xu, M.; White, N.; Lewis, P. Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arXiv 2024, arXiv:2404.18796. [Google Scholar]
- Thakur, A.S.; Choudhary, K.; Ramayapally, V.S.; Vaidyanathan, S.; Hupkes, D. Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges. arXiv 2024, arXiv:2406.12624. [Google Scholar]
- Jung, J.; Brahman, F.; Choi, Y. Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement. arXiv 2024, arXiv:2407.18370. [Google Scholar]
- Kenton, Z.; Siegel, N.Y.; Kramár, J.; Brown-Cohen, J.; Albanie, S.; Bulian, J.; Agarwal, R.; Lindner, D.; Tang, Y.; Goodman, N.D.; et al. On scalable oversight with weak LLMs judging strong LLMs. arXiv 2024, arXiv:2407.04622. [Google Scholar]
- Richardson, M.; Burges, C.J.; Renshaw, E. MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; Yarowsky, D., Baldwin, T., Korhonen, A., Livescu, K., Bethard, S., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; pp. 193–203. [Google Scholar]
- Jia, Q.; Young, M.; Xiao, Y.; Cui, J.; Liu, C.; Rashid, P.; Gehringer, E. Insta-Reviewer: A Data-Driven Approach for Generating Instant Feedback on Students’ Project Reports. In Proceedings of the International Educational Data Mining Society, Durham, UK, 24–27 July 2022. [Google Scholar]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
- Team, G.; Mesnard, T.; Hardin, C.; Dadashi, R.; Bhupatiraju, S.; Pathak, S.; Sifre, L.; Rivière, M.; Kale, M.S.; Love, J.; et al. Gemma: Open models based on gemini research and technology. arXiv 2024, arXiv:2403.08295. [Google Scholar]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. HuggingFace’s Transformers: State-of-the-art natural language processing. arXiv 2019, arXiv:1910.03771. [Google Scholar]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Corpus | # Data | Average Words per: | |
---|---|---|---|
Story | Question | ||
MC160 | 280 | 191.84 | 7.83 |
MC500 | 1200 | 213.60 | 7.75 |
Model Size | Model Name | Version/API Version |
---|---|---|
Large | Closed-source | |
GPT-4o | gpt-4o | |
Claude-3 | claude-3-5-sonnet-20240620 | |
Open-source | ||
Llama-3.1-70B | Meta-Llama-3.1-70B-Instruct |
Model Size | Model Name | Version/API Version |
---|---|---|
Large | Closed-source | |
GPT-4o | gpt-4o | |
Open-source | ||
Llama-3.1-70B | Meta-Llama-3.1-70B-Instruct | |
Small + Medium | Open-source | |
Llama-3.1-8B | Meta-Llama-3.1-8B-Instruct | |
Gemma-9B | gemma-2-9b-it | |
Gemma-27B | gemma-2-27b-it |
Feedback Generator | COR. | REV. | GUI. | DIA. | ENC. |
---|---|---|---|---|---|
GPT-4o | 4.96 | 4.94 | 3.88 | 3.48 | 3.98 |
Claude-3 | 5.00 | 4.96 | 3.90 | 3.38 | 4.40 |
Llama-3.1-70B | 4.96 | 4.82 | 3.82 | 3.38 | 3.86 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Seo, H.; Hwang, T.; Jung, J.; Kang, H.; Namgoong, H.; Lee, Y.; Jung, S. Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy. Appl. Sci. 2025, 15, 671. https://doi.org/10.3390/app15020671
Seo H, Hwang T, Jung J, Kang H, Namgoong H, Lee Y, Jung S. Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy. Applied Sciences. 2025; 15(2):671. https://doi.org/10.3390/app15020671
Chicago/Turabian StyleSeo, Hyein, Taewook Hwang, Jeesu Jung, Hyeonseok Kang, Hyuk Namgoong, Yohan Lee, and Sangkeun Jung. 2025. "Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy" Applied Sciences 15, no. 2: 671. https://doi.org/10.3390/app15020671
APA StyleSeo, H., Hwang, T., Jung, J., Kang, H., Namgoong, H., Lee, Y., & Jung, S. (2025). Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy. Applied Sciences, 15(2), 671. https://doi.org/10.3390/app15020671