Background and objectives: Large language models (LLMs) show promise in automating open-ended evaluation tasks, yet their reliability in rubric-based assessment remains uncertain. Variability in scoring, feedback, and rubric adherence raises concerns about transparency and pedagogical validity in educational contexts. This study introduces
[...] Read more.
Background and objectives: Large language models (LLMs) show promise in automating open-ended evaluation tasks, yet their reliability in rubric-based assessment remains uncertain. Variability in scoring, feedback, and rubric adherence raises concerns about transparency and pedagogical validity in educational contexts. This study introduces CourseEvalAI, a framework designed to enhance consistency and fidelity in rubric-guided evaluation by fine-tuning a general-purpose LLM with authentic university-level instructional content.
Methods: The framework employs supervised fine-tuning with Low-Rank Adaptation (LoRA) on rubric-annotated answers and explanations drawn from undergraduate computer science exams. Responses generated by both the base and fine-tuned models were independently evaluated by two human raters and two LLM judges, applying dual-layer rubrics for answers (technical or argumentative) and explanations. Inter-rater reliability was reported as intraclass correlation coefficient (ICC(2,1)), Krippendorff’s α, and quadratic-weighted Cohen’s κ (QWK), and statistical analyses included Welch’s
t tests with Holm–Bonferroni correction, Hedges’ g with bootstrap confidence intervals, and Levene’s tests. All responses, scores, feedback, and metadata were stored in a Neo4j graph database for structured exploration.
Results: The fine-tuned model consistently outperformed the base version across all rubric dimensions, achieving higher scores for both answers and explanations. After multiple-testing correction, only the Generative Pre-trained Transformer (GPT-4)—judged Technical Answer contrast remains statistically significant; other contrasts show positive trends without passing the adjusted threshold, and no additional significance is claimed for explanation-level results. Variance in scoring decreased, inter-model agreement increased, and evaluator feedback for fine-tuned outputs contained fewer vague or critical remarks, indicating stronger rubric alignment and greater pedagogical coherence. Inter-rater reliability analyses indicated moderate human–human agreement and weaker alignment of LLM judges to the human mean.
Originality: CourseEvalAI integrates rubric-guided fine-tuning, dual-layer evaluation, and graph-based storage into a unified framework. This combination provides a replicable and interpretable methodology that enhances the consistency, transparency, and pedagogical value of LLM-based evaluators in higher education and beyond.
Full article