Beyond BLEU: GPT–5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation

Shalawati, Shalawati; Nasution, Arbi Haza; Monika, Winda; Derin, Tatum; Onan, Aytug; Murakami, Yohei

doi:10.3390/digital6010008

Open AccessArticle

Beyond BLEU: GPT–5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation

by

Shalawati Shalawati

^1,*,

Arbi Haza Nasution

^2,*

,

Winda Monika

³

,

Tatum Derin

⁴,

Aytug Onan

⁵

and

Yohei Murakami

⁶

¹

Faculty of Education, Universitas Islam Riau, Pekanbaru 28284, Indonesia

²

Department of Informatics Engineering, Universitas Islam Riau, Pekanbaru 28284, Indonesia

³

Department of Library Science, Universitas Lancang Kuning, Pekanbaru 28266, Indonesia

⁴

Faculty of Education and Vocation, Universitas Lancang Kuning, Pekanbaru 28266, Indonesia

⁵

Department of Computer Engineering, Faculty of Engineering, Izmir Institute of Technology, Izmir 35430, Turkey

⁶

Faculty of Information Science and Engineering, Ritsumeikan University, Kusatsu 525-8577, Japan

^*

Authors to whom correspondence should be addressed.

Digital 2026, 6(1), 8; https://doi.org/10.3390/digital6010008

Submission received: 20 October 2025 / Revised: 26 December 2025 / Accepted: 14 January 2026 / Published: 22 January 2026

Download

Browse Figures

Review Reports Versions Notes

Abstract

This paper investigates the use of large language models (LLMs) as evaluators in multidimensional machine translation (MT) assessment, focusing on the English–Indonesian language pair. Building on established evaluation frameworks, we adopt an MQM-aligned rubric that assesses translation quality along morphosyntactic, semantic, and pragmatic dimensions. Three LLM-based translation systems (Qwen 3 (0.6B), LLaMA 3.2 (3B), and Gemma 3 (1B)) are evaluated using both expert human judgments and an LLM-based evaluator (GPT–5), allowing for a detailed comparison of alignment, bias, and consistency between human and AI-based assessments. In addition, a classroom calibration study is conducted to examine how rubric-guided evaluation supports alignment among novice evaluators. The results indicate that GPT–5 exhibits strong agreement with human evaluators in terms of relative quality ranking, while systematic differences in absolute scoring highlight calibration challenges. Overall, this study provides insights into the role of LLMs as reference-free evaluators for MT and illustrates how multidimensional rubrics can support both research-oriented evaluation and pedagogical applications in a mid-resource language setting.

Keywords:

machine translation; large language models; evaluation metrics; multidimensional quality metrics

1. Introduction

Machine translation (MT)—the automatic translation of text or speech from one language to another—has long been recognized as one of the most challenging tasks in artificial intelligence. Over the past decade, advances in neural network architectures and the availability of large parallel corpora have led to dramatic improvements in MT quality [1]. Recently, the emergence of large language models (LLMs) has further revolutionized the field. General-purpose LLMs trained on massive multilingual data have demonstrated the ability to perform translation without explicit parallel training, raising the question of whether they can replace or augment dedicated MT systems [2]. For example, GPT–5 (a state-of-the-art LLM) can produce highly fluent translations in many languages, suggesting that LLMs are now at a point where they could potentially serve as general MT engines. However, the extent of their advantages over traditional neural MT and their limitations across different languages remain active areas of research.

Early evaluations indicate a mixed picture. On one hand, LLMs like ChatGPT/GPT–5 have achieved translation quality on high-resource language pairs that is competitive with or even surpasses specialized MT models in certain cases. For instance, a recent study [3] in the medical domain found that GPT-4’s translations from English to Spanish and Chinese were over 95% accurate, on par with commercial MT (Google Translate) in those languages. However, GPT-4’s accuracy dropped to 89% for English–Russian, trailing Google Translate’s performance in that case. Another evaluation comparing ChatGPT to Google’s MT on patient instructions found that ChatGPT excelled in Spanish (only 3.8% of sentences mistranslated vs. 18.1% for Google) but underperformed in Vietnamese, with ChatGPT mistranslating 24% of sentences compared to 10.6% for Google [4]. These discrepancies highlight that LLM translation quality varies widely by language, content, and context. In literary translation, for example, one study observed that human translators still significantly outperformed ChatGPT in accuracy (94.5% vs. 77.9% on average), even though ChatGPT’s output was often fluent and grammatically correct [5]. Such findings reinforce that being fluent is not the same as being faithful: LLMs may produce very natural-sounding text but can subtly distort meaning, especially in nuanced or domain-specific content.

A key challenge, therefore, is how to evaluate translation quality in this new era. Traditional automatic metrics like BLEU, ROUGE, or TER—based on n-gram overlap with reference translations—are fast and reproducible, but they have well-documented limitations [6,7]. They often correlate poorly with human judgments, failing to capture semantic adequacy or subtle errors. As a result, researchers have argued for more comprehensive evaluation frameworks. The Multidimensional Quality Metrics (MQM) paradigm, for instance, evaluates translations along multiple error categories (accuracy, fluency, terminology, style, etc.) instead of a single score [8,9]. It provides a more granular assessment compared to single-score metrics. Such multi-aspect human evaluation provides deeper insight, revealing cases where a translation might score high on adequacy but still contain grammatical errors or unnatural phrasing. Recent studies have shown that MQM can effectively capture the nuances of translation quality, especially when used with human annotations [9,10]. In recent MT competitions, the WMT24 Metrics Shared Task utilized MQM to benchmark LLM-based translations, confirming the robustness of fine-tuned neural metrics [10]. The MT competitions have increasingly relied on MQM-style human assessments, confirming that even human evaluators often disagree when forced to give a single overall score. This suggests that breaking down quality into linguistic dimensions yields more reliable and informative assessments.

Interestingly, LLMs themselves are now being explored as evaluation tools. If an LLM can understand and generate language, perhaps it can also judge translation quality by comparing a translation against the source for fidelity and fluency. Early studies are promising: some have found that certain LLMs’ judgments correlate surprisingly well with human evaluation. For example, Kocmi and Federmann showed in [11] that ChatGPT (GPT-3.5) can rank translations with correlations comparable to traditional human rankings. Research on LLM-based evaluators (sometimes called GPT evaluators or G-Eval) has demonstrated that prompting GPT–5 with detailed instructions and even chain-of-thought reasoning can yield evaluation scores approaching human agreement [12]. Beyond translation, similar concerns about annotation reliability have been raised in low-resource NLP tasks. A recent comparative study assessed annotation quality across Turkish, Indonesian, and Minangkabau NLP tasks, showing that while LLM-generated annotations can be competitive—particularly for sentiment analysis—human annotations consistently outperform them on more complex tasks, underscoring LLM limitations in handling context and ambiguity [13]. Building on this, Nasution et al. [14] benchmarked 22 open-source LLMs against ChatGPT-4 and human annotators on Indonesian tweet sentiment and emotion classification, finding that state-of-the-art open models can approach closed-source LLMs but with notable gaps on challenging categories (e.g., Neutral sentiment and Fear emotion). These studies highlight broader themes of evaluator calibration, annotation reliability, and model consistency in low-resource settings—issues that directly parallel the challenges of using LLMs for MT evaluation.

In summary, the landscape of MT in 2025 is being reshaped by large language models, bringing both opportunities and open questions. This paper aims to contribute along two fronts: (1) assessing how a specialized smaller-scale LLM for a particular language pair (English–Indonesian) compares to larger general models on translation quality and (2) examining the efficacy of a multidimensional evaluation approach that combines human judgments and GPT–5. We focus on three core dimensions of translation quality—morphosyntactic accuracy (grammar/syntax and word form correctness), semantic accuracy (fidelity to source meaning), and pragmatic effectiveness (appropriate style, tone, and coherence in context). Using these fine-grained criteria, we evaluate translations produced by three different LLMs. Human evaluators and GPT–5 are both used to rate the translations, allowing us to analyze where an LLM evaluator agrees with or diverges from human opinion. By concentrating on a relatively under-studied language pair (English–Bahasa Indonesia) and a detailed evaluation scheme, our study provides new insights into LLM translation capabilities and the feasibility of AI-assisted evaluation. The goal is to inform both the development of more effective translation models and the design of evaluation frameworks that can reliably benchmark translation quality in the era of LLMs.

In this work, we present a comprehensive evaluation of three LLM-based translation systems of different scales on an English–Indonesian translation task. We introduce a multi-aspect evaluation framework with human raters and GPT–5 (throughout this paper, “GPT–5” refers specifically to the GPT–5-chat endpoint available in Azure AI Foundry during October 2025), and we analyze the correlation and differences between GPT-based and human-based assessments. The experimental results demonstrate that a moderately sized, translation-focused LLM can outperform a larger general-purpose model on this task and that GPT–5’s evaluation aligns with human judgment to a high degree (overall correlation ≈0.82) while systematically giving higher scores. We discuss the implications of these findings for developing cost-effective translation solutions and hybrid human–AI evaluation methods. This study contributes a detailed examination of human and GPT–5 evaluation on a real English–Indonesian translation task and aims to provide a focused case study on the use of LLMs as evaluators within multidimensional MT assessment. We next situate our study within recent work on LLM-based MT and multi-aspect evaluation.

To summarize, this paper contributes to MT evaluation and translation pedagogy in four main ways. First, it proposes a multidimensional evaluation framework incorporating morphosyntactic, semantic, and pragmatic aspects. Second, it empirically demonstrates that GPT–5 can approximate human judgments across these dimensions with high reliability. Third, it provides a novel classroom validation study showing that rubric calibration significantly enhances students’ scoring consistency. Finally, it bridges automated evaluation and translation education, offering a unified framework for both model assessment and evaluator training.

2. Related Work

2.1. LLMs in Machine Translation

The advent of models like GPT-3, GPT–5, PaLM, and others has prompted numerous studies re-examining machine translation through the lens of large language models. These works generally find that LLMs now play a significant role in MT, though not a uniformly dominant one [15,16]. A recent comprehensive survey situating MT in the era of LLMs is provided in [1]. The authors note that LLMs fine-tuned on multilingual data can both replace and assist traditional MT systems, especially in high-resource settings. For example, GPT-4 was reported to surpass Facebook’s NLLB-200 (a strong dedicated MT model) in about 40% of translation directions—a remarkable achievement. However, LLMs still fell short of industry MT systems (e.g., Google Translate) in many cases, particularly for low-resource languages [15]. This indicates that while LLMs have strong generalization ability, specialized MT models retain an edge in certain scenarios, likely due to domain and terminology optimization.

Empirical comparisons in real-world domains echo this pattern. Kong et al. [3] evaluated ChatGPT (GPT–5) against Google Translate on patient discharge instructions in Spanish, Chinese, and Russian. They found that GPT–5 was highly accurate in Spanish (97% correct sentences) and Chinese (95%), essentially matching Google’s quality in those languages. This demonstrates that for high-resource languages, an LLM not explicitly trained as a translator can achieve professional-grade results in factual, instructional text. However, for Russian (a somewhat lower-resource language for GPT–5), accuracy fell to 89% for GPT–5 vs. 80% for Google, and GPT’s advantage was less pronounced. Importantly, both tools made only very few clinically harmful errors (under 1% of sentences) across languages. In a related medical translation study, Rao et al. [4] compared ChatGPT-3.5 to Google on patient educational materials in Spanish, Vietnamese, and Russian. They observed that ChatGPT significantly outperformed Google in Spanish, making errors in only 3.8% of sentences vs. 18.1% for Google. This can be attributed to Spanish being well-represented in GPT’s training data and structurally close to English. In contrast, for Vietnamese (a low-resource, linguistically distant language), ChatGPT’s error rate was higher (24.2%) than Google’s (10.6%). For Russian, both struggled (ChatGPT 35.6% errors, Google 41.6%), yielding unacceptably low quality in a medical context. These findings highlight that LLM performance in MT is uneven: excellent for some languages and contexts, but unreliable for others. The variability often correlates with training data richness; languages and domains underrepresented in the LLM’s training corpora see degraded performance. Moreover, as [1] discusses, LLMs tend to be English-centric, sometimes translating via English as a pivot, which can hurt direct non-English pair translations [16,17].

LLMs have also been tested in creative and literary translation. Here, the consensus is that human translators still have a clear edge in preserving nuanced meaning, cultural context, and stylistic elements. For example, a 2025 study examined an Arabic literary text and found that human translation achieved about 94.5% accuracy in meaning transfer, whereas ChatGPT’s translation achieved ∼77.9% [5]. The ChatGPT output was fluent and grammatically well-formed—often indistinguishable from human translation in terms of language naturalness—but it lost or altered important subtleties in the narrative. This supports a general observation: LLMs excel at fluency, sometimes even appearing overly “polished,” but fidelity to the source can be a weakness, especially when translating metaphor, humor, or culturally specific references. Ataman et al. note that LLMs, by virtue of their training, exhibit increased non-literalness and paraphrasing, which improves naturalness but sometimes comes at the cost of precise meaning alignment. They also warn of hallucinations—instances where LLMs insert content not present in the source—which can be catastrophic for high-stakes translations. Hallucination is more common with low-resource languages or when the model is prompted in ways that confuse its implicit knowledge. Our work contributes to this discussion by evaluating how well LLMs handle a mid-resource language (Indonesian) and by quantifying their strengths (e.g., fluency) and weaknesses (semantic accuracy) through separate scores.

2.2. Multidimensional Evaluation and LLM-Based Metrics

As MT systems reach higher levels of fluency, the MT research community has increasingly turned to fine-grained evaluation methods. It is no longer sufficient to report a single BLEU score; doing so can mask critical errors in otherwise fluent output. Multidimensional Quality Metrics (MQM) and similar frameworks have gained traction, especially in recent years’ shared tasks. MQM involves human raters categorizing errors (omission, mistranslation, grammatical error, stylistic issue, etc.) and scoring severity, yielding a more holistic quality profile than a single metric [9,18,19]. Using such approaches, researchers have discovered that translations deemed high-quality by BLEU or other automatic metrics often still contain notable errors. For example, a translation might receive a decent BLEU score by capturing the general meaning with different wording, yet a human rater could identify a severe mistranslation of a named entity or a pronoun error. Ataman et al. [1] emphasize that relying on single-number metrics can be misleading, as even human experts frequently disagree on an overall score for a translation. Instead, a multi-level scoring (like MQM) or separate dimension scoring can provide more reliable insights. Inspired by this, our evaluation methodology assigns separate scores on three dimensions—aiming to capture (i) linguistic well-formedness, (ii) semantic accuracy, and (iii) pragmatic appropriateness—analogous in spirit to MQM’s fluency and accuracy axes, with an added lens on context/pragmatics.

It is important to note that multidimensional MT evaluation itself is well established in the literature. Frameworks such as MQM, BLONDE [20], Direct Assessment [21], and Error Span Annotation [22] have shaped MT evaluation practice and demonstrated the value of separating linguistic dimensions when assessing quality. Our study does not propose a new evaluation framework; rather, it adopts an MQM-aligned perspective and operationalizes it in a form suitable for English–Indonesian evaluation and instructional contexts. This ensures that the dimensions used in our rubric—covering linguistic well-formedness, semantic accuracy, and pragmatic appropriateness—remain consistent with prior work while being tailored to the pedagogical setting examined in this study.

Another novel direction is the use of LLMs as evaluators [11,12]. LLM-based evaluation methods have now been widely explored across MT and NLG, and approaches such as G-Eval [12], LLM-as-judge [23], and reference-free LLM scoring [24] have shown that large models can approximate human judgments with high consistency. Our study does not aim to introduce a new LLM evaluation paradigm; rather, it adapts these emerging approaches to a multidimensional rubric tailored for the English–Indonesian language pair, allowing us to compare GPT–5 and human evaluators along linguistically interpretable dimensions. This framing differentiates our work from prior LLM-as-judge research, which has typically focused on English or high-resource language settings and often on single dimension or holistic scoring. For Indonesian MT specifically, substantial progress has been made through initiatives such as NusaMT [25] and IndoNLG [26], which provide high-quality datasets, benchmarks, and pretrained models for Indonesian and related languages. These efforts highlight that Indonesian MT evaluation is an active area of research. Our study does not claim novelty in developing MT systems for Indonesian; instead, it contributes a detailed comparison of human and GPT–5 evaluators within a multidimensional rubric for this mid-resource language pair, a perspective that complements and extends prior Indonesian MT benchmarks.

The idea of an automatic evaluation metric is not new—BLEU and others have existed for decades—but these metrics are essentially string-comparison heuristics. What if we leverage the full understanding capabilities of an LLM to judge a translation? This question has led to research on LLM-based evaluation metrics for MT and other NLG tasks. One approach is to prompt an LLM (like GPT–5) with the source text and a candidate translation and ask it to provide a rating or verdict (often with justifications). Preliminary studies show that GPT–5 can serve as a remarkably good evaluator for certain tasks: Liu et al. (2023) introduced G-Eval, which used GPT–5 to score summaries on multiple dimensions, and found that it achieved a higher correlation with human judgments than several traditional metrics [12]. For MT, Kocmi and Federmann [11] reported that ChatGPT-based ranking of translations mirrored human ranking results on a WMT test set to a high degree, even outperforming some learned metrics like COMET. These results suggest LLMs implicitly know a lot about what a “good translation” looks like.

However, caution is warranted. Bias and calibration are challenges for LLM evaluators. As noted in one study, an LLM may have a tendency to overrate text that resembles LLM-generated text, leading to a form of systematic bias. In other words, if the MT output was produced by a model similar to the evaluator, the evaluator might inherently consider the style more favorably—a kind of AI groupthink. Additionally, without careful prompting, LLM evaluators might display high variance or be overly generous. A recent study found that including a chain of thought (i.e., having the LLM explicitly reason step by step about the translation’s quality) improved the LLM’s consistency and made its ratings more aligned with humans [27]. They also experimented with probabilistic scoring (asking the model to output a calibrated probability of adequacy) versus direct scoring, finding some trade-offs in correlation metrics. The general conclusion is that LLM evaluators hold great promise—they could eventually replace many test-set references and human evaluation rounds, saving time and cost—but we must first validate their outputs and understand their failure modes. Our study contributes to this line of inquiry by comparing an LLM evaluator (GPT–5) to human evaluators on the same translations. By analyzing where GPT–5’s scores diverge from human scores, we shed light on how reliably GPT–5 can assess translation quality across different aspects. We also compute correlation coefficients to quantify alignment. Such analysis complements recent work (e.g., a 2025 study in translation education [28]) that explored using ChatGPT for providing feedback on student translations, finding it useful for flagging errors but not fully in agreement with instructor grading. In sum, related work suggests parallel trends: the push for multi-aspect human evaluation (e.g., MQM) and the emergence of AI-based evaluation. In this paper, we merge these by having both humans and an AI evaluate multiple aspects, thus positioning our approach at the intersection of these research frontiers.

3. Method

We conducted an experiment where the translations by the three models were assessed by three expert human evaluators and GPT–5 across all evaluation dimensions. Inter-annotator agreement was measured using Krippendorff’s

α

and weighted

κ

, and annotator competence was estimated using MACE.

3.1. MQM Mapping and Evaluation Rubric

To ensure that the three proposed linguistic metrics are not ad hoc but grounded in a recognized translation quality framework, we mapped them to the Multidimensional Quality Metrics (MQM) standard. MQM provides a taxonomy of translation errors categorized by accuracy, fluency, style, locale conventions, and related subcategories. By aligning our rubric-based metrics with MQM, we ensure methodological validity and comparability with prior MT evaluation studies.

Morphosyntacticcorresponds to MQM categories under Linguistic Conventions and accuracy, specifically targeting gramma, agreement, and word form issues, as well as errors in textual conventions, transliteration, or hallucinated forms. This dimension captures whether the translation is grammatically well-formed and morphologically consistent.

Semantic aligns with MQM categories such asaccuracy, Design and Markup, and Audience Appropriateness, covering subcategories including mistranslation, omission, and addition. It evaluates the fidelity of the conveyed meaning, ensuring that no critical content is missing or distorted.

Pragmatic relates to MQM categories Design and Markup, locale conventions, and terminology, with subcategories including style, formality, and Audience Appropriateness. It addresses appropriateness of tone, register, and cultural or contextual fit, extending beyond purely linguistic correctness.

Table 1 and Table 2 present a structured mapping of our three linguistic metrics to the corresponding MQM categories and subcategories, together with their optimized descriptions. This mapping provides transparency on how the proposed rubric-based evaluation integrates into the broader MQM framework and clarifies the interpretability of the scores.

Human annotators followed a detailed 5-point rubric for each dimension (morphosyntactic, semantic, and pragmatic). The condensed 5-point evaluation rubric for translation quality is listed in Table 3. The rubric defines scoring guidelines for every level from 1 (very poor) to 5 (excellent), covering grammar and morphology, semantic fidelity, and pragmatic appropriateness. The full rubric is provided in Appendix A.

3.2. Models and Translation Task

To examine evaluation behavior in English–Indonesian translation rather than benchmark official model releases, we selected three large language models (LLMs) of varying parameter sizes and training characteristics as fixed translation generators. All models used in this study correspond to concrete snapshot builds distributed through the Ollama runtime at the time of experimentation. These snapshot builds follow Ollama’s internal versioning conventions and do not necessarily align with the official release nomenclature of their respective model families. Importantly, all models were used without any fine-tuning or task-specific adaptation, ensuring that differences in evaluation outcomes reflect properties of the generated translations rather than training interventions. Accordingly, any observed differences in translation quality should not be interpreted as claims about the relative capabilities of official model releases or scaling laws.

Crucially, we emphasize that this paper is not a benchmarking study. We do not claim that the Ollama-instantiated models used here are equivalent to, or representative of, the official checkpoints distributed by Alibaba, Meta, or Google. The selected models serve solely as fixed translation generators producing outputs of varying quality, which are then used to analyze evaluation behavior, specifically the alignment, bias, and calibration differences between human evaluators and an LLM-based evaluator (GPT–5). Observations such as a smaller Ollama build receiving higher scores than a larger one are therefore reported strictly as empirical outcomes of the specific snapshot builds tested and are not generalized beyond this context or interpreted as statements about official model scaling laws.

Qwen 3 (0.6B) (https://ollama.com/library/qwen3 (accessed on 12 January 2026)): The Qwen 3 family is an officially released and documented series of large language models developed by Alibaba Cloud, including a 0.6B-parameter variant [29]. Official checkpoints and documentation are available via the Hugging Face model hub (https://huggingface.co/Qwen (accessed on 12 January 2026)) and the Qwen3 GitHub repository (https://github.com/QwenLM/Qwen3 (accessed on 12 January 2026)). In this study, however, we used the Ollama-distributed Qwen 3 (0.6B) build, rather than directly loading the official checkpoints. This build corresponds to a compact, general-purpose multilingual model without task-specific optimization for machine translation. The Qwen 3 (0.6B) Ollama build contains approximately 0.6 billion parameters and represents a lightweight, general-purpose language model with some multilingual capability but no specialized training for machine translation. We include this model as a small-scale baseline to examine how a compact LLM handles English–Indonesian translation under a fixed zero-shot prompt.
LLaMA 3.2 (3B) (https://ollama.com/library/llama3.2 (accessed on 12 January 2026)): The LLaMA family is a widely adopted open-weight model line developed by Meta AI, with official documentation (https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2 (accessed on 12 January 2026)) and downloadable checkpoints (https://www.llama.com/llama-downloads (accessed on 12 January 2026)) provided for LLaMA 3.2 [30]. In this study, we used an Ollama-distributed snapshot build labeled LLaMA 3.2 (3B), following Ollama’s internal versioning conventions, rather than the official Meta release. The LLaMA family is a widely adopted open-weight model line trained on diverse multilingual corpora [30]. The LLaMA 3.2 (3B) build used in this study contains approximately 3 billion parameters and was used as provided by Ollama, without any fine-tuning or task-specific adaptation. We use this model as a mid-scale general LLM representative for zero-shot translation, rather than as a benchmark of official LLaMA releases.
Gemma 3 (1B) (https://ollama.com/library/gemma3 (accessed on 12 January 2026)): Gemma is an open family of efficient language models developed by Google DeepMind. The Gemma 3 family is officially documented (https://deepmind.google/models/gemma/gemma-3/ (accessed on 12 January 2026)), where Ollama is explicitly referenced as a supported method for local deployment. The Gemma family comprises efficient open models designed for research and local deployment [31]. In line with this official guidance, we instantiated Gemma 3 (1B) using the Ollama-distributed build, without any fine-tuning or customization. Although not explicitly trained for English–Indonesian translation, the model is capable of producing translations under appropriate prompting and serves as a comparative translation generator alongside Qwen and LLaMA.

The translation direction is English to Bahasa Indonesia, chosen because Indonesian is a widely spoken language that nonetheless qualifies as a moderately resourced language (it has a substantial web presence and some translation corpora, but far less than languages like French or Chinese). This allows us to explore how LLMs handle a language that is neither extremely high-resource nor truly low-resource. Indonesian also presents interesting linguistic challenges; it has relatively simple morphology (no gender or plural inflections, but extensive affixation), flexible word order, and contextual pronoun drop, which means that translations require care in preserving meaning that might be implicit in English pronouns or tense markers.

3.3. Reproducibility and Resources

To support transparency and reproducibility, we document all resources and experimental configurations used in this study. Model inference was performed on a system equipped with four NVIDIA RTX A6000 GPUs (49 GB VRAM each), using CUDA 12.6 and NVIDIA driver version 560.35.05. In practice, inference typically used a single GPU per model instance, with memory usage ranging from 2 to 3 GB depending on model size and quantization. Models were deployed via the Ollama runtime, with CPU fallback used only when GPU execution was not supported.

All models were executed under a consistent inference configuration using the default decoding parameters defined by the Ollama runtime: temperature = 0.8, top-p = 0.9, and a maximum output length of 256 tokens. We intentionally retained these default settings to avoid introducing tuning-induced variability and to ensure that the comparison reflects standard, out-of-the-box model behavior that can be replicated easily by other researchers. The three models (Qwen 3 (0.6B), LLaMA 3.2 (3B), and Gemma 3 (1B)) were used exactly as provided in the Ollama library, without any additional fine-tuning or customization. For each model, we employed a simple prompt format, such as “Translate the following English sentence to Indonesian: [sentence]”. The output from each model—the Indonesian translation hypothesis—was then recorded for evaluation.

Following Salhazan et al. [32], we adopt the TED 2018 dataset containing parallel English–Indonesian sentences. We randomly selected 100 sentence pairs, constrained to sentence lengths between 20 and 30 tokens to ensure moderate complexity. In the resulting sample, sentence lengths range from 20 to 29 tokens, with an average length of 24 tokens. Although the sample size is modest, it is sufficient for analyzing evaluator agreement, ranking consistency, and calibration effects, which are the primary focus of this study rather than large-scale system benchmarking. The reference translations were only used to compute BLEU scores. For both the LLM-based and human evaluations, the reference translations were hidden to avoid bias from reference phrasing. Instead, evaluators relied solely on the source and the system output, ensuring that adequacy, fluency, and the three linguistic metrics were judged directly rather than by overlap with a gold standard. In this study, BLEU is included as a traditional reference-based baseline to provide context alongside the reference-free evaluation performed by human raters and GPT–5. While humans and GPT–5 do not receive access to reference translations, BLEU is computed in the standard way using the available references and serves primarily as a comparative anchor rather than a direct competitor to reference-free scoring.

All GPT–5 evaluation prompts were fixed prior to experimentation and applied consistently across all translation outputs. The full prompt is provided in Appendix B. Human evaluators and GPT–5 were guided by the same evaluation rubric, covering morphosyntactic, semantic, and pragmatic aspects with identical scoring scales.

All evaluation data, human annotations, GPT–5 outputs, and prompt templates in the Python scripts used in this study are publicly available in an online repository (https://github.com/arbihazanst/multidimentional-MT-Eval (accessed on 12 January 2026)). To protect participant privacy, student data and model outputs are anonymized.

3.4. Human Evaluation Procedure

We recruited three bilingual evaluators to assess the translation outputs. All three evaluators are native or fluent speakers of Indonesian with advanced proficiency in English. Two have backgrounds in linguistics and translation studies, and the third is a professional translator. This mix was intended to balance formal knowledge of language with practical translation experience. Before the evaluation, we conducted a brief training session to familiarize the evaluators with the rating criteria and ensure consistency. We provided examples of translations and discussed how to interpret the rating scales for each aspect.

Each evaluator was given the source English sentence and the model’s Indonesian translation and asked to rate it on three aspects:

Morphosyntactic Quality: Is the translation well-formed in terms of grammar, word order, and word forms? With this aspect, raters do not yet focus on whether a translation is meaningfully accurate, but focus entirely on the accuracy of the morphology and syntax. They check for issues such as incorrect affixes, plurality, tense markers (where applicable in Indonesian), word order mistakes, agreement errors, or any violation of Indonesian grammatical norms. A score of 5 means the sentence is grammatically perfect and natural; a 3 indicates some awkwardness or minor errors; and a 1 indicates severely broken grammar that impedes understanding.
Semantic Accuracy: Does the translation faithfully convey the meaning of the source? This is essentially adequacy: how much of the content and intent of the original is preserved. Raters compare the Indonesian translation against the source English to identify any omissions, additions, or meaning shifts. A score of 5 means the translation is completely accurate with no loss or distortion of meaning; a 3 means some nuances or minor details are lost/mistranslated but the main message is there; and a 1 means it is mostly incorrect or missing significant content from the source.
Pragmatic Appropriateness: Is the translation appropriate and coherent in context and style? This covers aspects like tone, register, and overall coherence. Raters judge if the translation would make sense and be appropriate for an Indonesian reader in the intended context of the sentence. For example, does it use the correct level of formality? Does it avoid unnatural or literal phrasing that, while grammatically correct, would sound odd to a native speaker? This category also captures whether the translation is pragmatically effective—e.g., if the source had an idiom, was it translated to an equivalent idiom or explained in a way that an Indonesian reader would understand the intended effect? A score of 5 means the translation not only is correct but also feels native—one could not easily tell it was translated. A 3 might indicate that it is understandable but has some unnatural phrasing or slight tone issues. A 1 would mean that it is pragmatically inappropriate or incoherent (perhaps overly literal or culturally off-base).

Each aspect was rated on an integral 1–5 scale (where 5 = excellent, 1 = very poor). We explicitly instructed that scores should be considered independent—e.g., a grammatically perfect but meaning-wrong translation could receive morphosyntactic = 5, semantic = 1. By collecting these separate scores, we obtain a profile of each translation’s strengths and weaknesses.

To ensure consistency, the evaluators first conducted a pilot round of 10 sample translations (covering all three models in random order) and discussed any discrepancies in the scoring. After refining the shared understanding of the criteria, they proceeded to rate the full set. Each model’s 100 translations were mixed and anonymized, so evaluators did not know which model produced a given translation (to prevent any bias). In total, we collected 3 (evaluators)

\times 100

(sentences)

\times 3

(aspects)

= 900

human ratings per model, or 2700 ratings overall.

For analysis, we typically consider the average human score on each aspect for a given translation. We compute per-sentence averages across the three evaluators for morphosyntactic, semantic, and pragmatic dimensions. This gives a more stable score per item and smooths out individual rater variance. We also compute an overall human score per sentence by averaging all three aspect scores (this overall is not used for primacy in evaluation, but for correlation analysis and summary).

3.5. GPT–5 Evaluation Procedure

For the automatic evaluator, we used the GPT–5-chat endpoint provided in Azure AI Foundry. This endpoint is part of the GPT–5 model family publicly released by OpenAI and Microsoft prior to the time of our experiments (October 2025). To avoid ambiguity, all references to “GPT–5” in this paper denote this specific Azure-hosted endpoint. Because no public model card or architectural specification was provided for GPT–5-chat, we include full details of our evaluation settings to support reproducibility: decoding parameters (temperature = 0.8, top-p = 0.9, and maximum output length = 256 tokens) and the exact prompt templates, which are listed in Appendix A. This clarification ensures that our findings characterize the behavior of the deployed endpoint rather than assume undocumented model properties.

It is important to clarify that GPT–5 was not used to generate any of the translation outputs evaluated in this study. All translations were produced solely by Qwen 3, LLaMA 3.2, and Gemma 3. GPT–5 served only as an evaluation agent, functioning analogously to an additional rater in parallel with the human evaluators. No part of the translation set was produced by GPT–5, and no conclusions in this paper concern GPT–5’s translation ability. Rather, our focus is on GPT–5’s behavior as a rubric-guided evaluator.

For each English source sentence and its machine translation, we crafted a prompt (see Appendix B) for GPT–5 instructing it to act as a translation quality evaluator. We avoided revealing which model produced the translation, and we randomized the order of sentences, so GPT–5 evaluated each translation independently. We verified qualitatively on a few cases that GPT–5’s scoring was reasonable (for example, it gave lower scores to sentences where obvious meaning errors were present). Thereafter, we automated this process for all 100 sentences for each model.

Notably, GPT–5 was not given any “reference” or hint beyond the source and translation. This is effectively a zero-shot evaluation. We did not use chain of thought or explanations to avoid incurring high cost and to streamline the output (though we acknowledge that including reasoning might further improve reliability). Still, GPT–5’s strong language understanding suggests that it can make decent judgments even with this straightforward prompting.

Using GPT–5 as an evaluator allows us to compare its ratings directly with the average human ratings for each translation and aspect. We treat GPT–5 as an additional “evaluator 4,” and we can compute metrics like Pearson correlation between GPT–5’s scores and the human average scores on each aspect. High correlation would indicate that GPT–5 agrees with humans on which translations are good or bad; consistent differences in magnitude would indicate bias. We can also identify specific instances where GPT–5 disagrees with all humans to qualitatively analyze potential reasons (though due to brevity we focus mostly on aggregate results in this paper).

All prompts and model outputs were logged for transparency. The combination of human and GPT–5 evaluations provides a rich evaluation dataset for analysis.

3.6. Classroom Evaluation Study

To complement the quantitative analysis, a classroom experiment was conducted to evaluate whether rubric-based calibration activities improve students’ evaluation consistency. A total of 26 undergraduate translation students participated in this study. Students acted as evaluators using the same 5-point rubric (morphosyntactic, semantic, and pragmatic) applied in the main evaluation. The experiment consisted of three phases using Google Forms:

Pre-test: Students rated 15 translations (5 English–Indonesian sentences translated by Qwen 3 (0.6B), LLaMA 3.2 (3B), and Gemma 3 (1B)).
Post-test: After a short discussion clarifying how to apply the rubric and what constitutes scores 1–5, students re-evaluated the same items.
Final test: Students rated five new items (indices 25, 47, 52, 74, and 97), each translated by the same three models.

All ratings were compared against majority scores determined by three human experts and GPT–5. Two evaluation metrics were computed for each phase: mean absolute error (MAE) and Exact Match Rate. MAE captures the average deviation from the majority score, while Exact Match Rate represents the proportion of identical scores. The majority score is used strictly as a consensus anchor, not as a gold standard for evaluating GPT–5 itself. Human ratings form the primary basis of the consensus, and GPT–5’s score contributes only as an additional rater to stabilize tie-breaking. GPT–5 is not being validated against a reference it determined; instead, the classroom study evaluates rubric alignment and calibration effects rather than GPT–5’s correctness.

3.7. Ethical Considerations

Our evaluation involved human judgments, so we adhered to proper ethical guidelines. The study protocol was reviewed and approved by the Institutional Review Board (IRB) at our institution. This study involved human participation in the form of expert evaluation and a classroom calibration activity. All procedures adhered to institutional ethical guidelines for human subject research. Expert evaluators participated voluntarily, provided informed consent, and were compensated for their time. They were informed that the materials under evaluation consisted of machine-generated translations and that their ratings would be used for research purposes related to MT evaluation.

The classroom activity was conducted as part of routine instructional practice. Student responses were anonymized prior to analysis, and no personally identifiable or sensitive information was collected. Under institutional guidelines, this component was reviewed internally and deemed exempt from full IRB review as minimal-risk educational research. Across both components, no test sentences contained sensitive or personal content. Evaluator identities and individual scores were kept confidential, and only aggregated results are reported. By ensuring voluntary participation, transparency of purpose, and data anonymization, we aimed to uphold established ethical standards for research involving human judgment.

4. Results

4.1. Quantitative Evaluation Scores

After collecting all ratings, we first examine the average scores obtained by each model on each aspect, as evaluated by humans and by GPT–5. Table 4, Table 5 and Table 6 summarize these results, including mean scores and standard deviations, for the three models. Each score is on a 1–5 scale (higher is better). The human scores are averaged across the three human evaluators per item and then across all items for the model; the GPT scores are averaged across the 100 items for the model.

Gemma 3 achieves the highest human-rated scores on all three aspects. For example, Gemma’s average semantic score is 3.75, compared to 3.33 for LLaMA and 2.53 for Qwen. This indicates that Gemma’s translations most accurately preserve meaning. Similarly, for morphosyntactic quality, Gemma averages 3.87 (approaching “very good” on our scale), higher than LLaMA’s 3.45 and Qwen’s 3.01. Pragmatic quality shows a similar ranking (Gemma 3.83 > LLaMA 3.40 > Qwen 2.59). In terms of overall quality (the average of all aspects), human evaluators rated Gemma around 3.82/5, LLaMA about 3.40/5, and Qwen only 2.71/5. In practical terms, Gemma’s translations were often deemed good on most criteria, LLaMA’s were fair to good with some issues, and Qwen’s were between poor to fair, frequently requiring improvement.

The differences between models are statistically significant. We performed paired t-tests on the human scores (pairing by source sentence since each sentence was translated by all models). For each aspect, Gemma’s scores were significantly higher than LLaMA’s (

p < 0.001

in all cases), and LLaMA’s were in turn significantly higher than Qwen’s (

p < 0.001

). This confirms a clear ordering: Gemma > LLaMA > Qwen in translation quality. Notably, while Gemma (1B) achieved higher human-rated scores than LLaMA (3B) in our experiments, we interpret this result with caution. Because all models were evaluated using snapshot builds distributed through the Ollama runtime, the observed ranking likely reflects the specific pretrained mixtures, parameterization choices, and quantization strategies of these builds rather than any general claim about the intrinsic superiority of smaller models. Accordingly, we frame this as an empirical finding specific to our evaluation setting rather than a broader statement about model scaling.

All models scored lower on semantic accuracy than on the other aspects (in human eval). For each model, the semantic scores in Table 5 (3.75, 3.33, and 2.53, respectively) are the lowest among the three aspects. This suggests that meaning preservation is the most challenging aspect for the systems. Many translations that were fluent and grammatical still lost some nuance or made minor errors in meaning. For example, Qwen often dropped subtle content or mistranslated a word, earning it a lower semantic score even if the sentence was well-formed. LLaMA and Gemma performed better but still occasionally missed culturally specific meanings or implied context. By contrast, morphosyntactic scores are the highest for two of the models (LLaMA and Gemma). This indicates that from a purely grammatical standpoint, the translations were quite solid, especially for Gemma, which likely reflects its model design and pretraining/data composition. Qwen’s morphosyntax score (3.01) was a bit higher than its pragmatic score and much higher than its semantic score, reflecting that even when it produces grammatically correct Indonesian, it often fails to convey the full meaning. Pragmatic scores were in between: e.g., LLaMA’s pragmatic 3.40 vs. morph 3.45 (almost equal) and Gemma’s pragmatic 3.83 vs. morph 3.87 (almost equal). This means that aside from pure accuracy, the naturalness and appropriateness of translations track closely with overall fluency for these models. Gemma and LLaMA produce fairly natural-sounding translations, with only occasional hints of unnatural phrasing or wrong tone. Qwen, on the other hand, had a pragmatic score (2.59) closer to its semantic score (2.53), meaning the poorer translations were both inaccurate and awkwardly worded.

These tendencies align with known behaviors of MT systems and LLMs. The fact that accuracy (semantic) lags behind fluency is a common observation: modern neural models often produce fluent output (thanks to strong language modeling) that can mask internal errors in meaning [9]. Our evaluators caught those, hence the lower semantic scores. This underscores the value of evaluating aspects separately—if we had only a single score, we might not realize that meaning fidelity is a pain point. It also corroborates what Ataman et al. noted about LLM translations: they tend to be more paraphrastic and sometimes compromise on strict fidelity [1].

Next, we examine how the GPT–5 (AI) evaluation compares. The GPT columns in Table 4, Table 5 and Table 6 show that GPT–5’s scores were consistently higher than the human scores for all models and aspects. For instance, GPT–5 rated Gemma’s morphosyntactic quality as 4.25 on average, whereas humans gave 3.87. GPT–5 even gave Qwen’s translations a ∼3.09 in morphosyntax (above “acceptable”), whereas humans averaged 3.01 (just borderline acceptable). This suggests GPT–5 was slightly more lenient on grammar. The difference is more pronounced in the pragmatic aspect: GPT–5 gave Gemma 4.35 vs. humans’ 3.83 and LLaMA 4.09 vs. humans’ 3.40. It appears that GPT–5 found the translations more pragmatically acceptable than the human evaluators did. One possible reason is that GPT–5 might not pick up on subtle style or tone issues as strictly as a human native speaker would—it might focus on whether the content makes sense and is coherent, which it usually does, and thus give a high score. Human evaluators, however, might notice if a phrase, while coherent, is not the way a native would phrase it in that context (slight awkwardness). This points to a potential bias of GPT–5 to overestimate naturalness. Another contributing factor could be that GPT–5, lacking cultural intuition, assumes something is fine if grammatically fine, whereas humans use real-world expectations.

That said, it is remarkable that GPT–5’s scores follow the same ranking of models: Gemma > LLaMA > Qwen. For example, GPT–5’s overall scores are ∼4.20 for Gemma, 3.92 for LLaMA, and 3.04 for Qwen, maintaining the relative differences. Even on a per-aspect basis, GPT–5 consistently scored Gemma highest and Qwen lowest. We can visualize these comparisons in the figures below.

From Figure 1 and Figure 2, we clearly see the upward shift in GPT–5’s scoring. For instance, on pragmatic quality (rightmost group of bars), GPT–5 scored even Qwen’s translations around 3.2 on average, whereas human evaluators scored them around 2.6. For LLaMA and Gemma, GPT–5 gave many translations a full 5 for pragmatics (finding them perfectly acceptable in tone), whereas human raters often gave 4, explaining the ∼0.5–0.7 gap in means. Despite this discrepancy in absolute terms, the correlation in trend is strong.

We calculated the Pearson correlation coefficient between GPT–5’s scores and the average human scores across the set of 300 translation instances (100 sentences × 3 models). This was performed separately for each aspect. The results are as follows:

Morphosyntactic: $r \approx 0.724$ ;
Semantic: $r \approx 0.807$ ;
Pragmatic: $r \approx 0.782$ .

All correlation coefficients are high and statistically significant (

p ≪ 0.001

). Semantic accuracy had the highest agreement (

r \sim 0.81

), which is encouraging—it means GPT–5 often concurred with human evaluators on which translations got the meaning right versus which did not. Morphosyntactic had a slightly lower correlation (∼0.72), possibly because GPT–5 might sometimes miss minor grammatical errors that humans catch (or vice versa, GPT might penalize something humans did not mind). Pragmatic is in between (∼0.78). We also computed an overall correlation (pairing each translation’s average of the three human scores with the average of GPT–5’s three scores). This yielded

r = 0.822

, as shown in Figure 3.

The consistency in model ranking and the high correlations indicate that GPT–5 largely agrees with human evaluators on relative translation quality. GPT–5 correctly identifies that, for example, Gemma’s outputs are better than Qwen’s, despite not being given any information about which model produced which output. This suggests that GPT–5 is able to discriminate translation quality directly from the content in a manner broadly aligned with human evaluators, consistent with recent findings that LLMs can function as effective judges of NLG quality. However, while the rank-order agreement is strong, it does not imply calibration to human scoring scales. Across all dimensions, GPT–5 assigns scores that are approximately 0.3–0.6 points higher on average than those of human evaluators, indicating a systematic generosity. This bias suggests that GPT–5 is best interpreted as a reliable ordinal evaluator rather than an interchangeable substitute for expert judgment when absolute thresholds are required. Practical calibration strategies—such as linear rescaling, subtracting a constant bias term, or emphasizing pairwise comparisons—may improve alignment when GPT–5 is used as an automatic evaluation metric.

To delve deeper, we also analyzed the distribution of scores. Gemma’s translations received a human score of 4 or 5 (good to excellent) in 67% of instances for morphosyntactic, 60% for semantic, and 62% for pragmatic. Qwen, by contrast, received 4 or 5 in only 30% of instances for morphosyntactic and under 20% for semantic and pragmatic—meaning the majority of Qwen’s outputs were mediocre or poor in meaning and style. LLaMA was intermediate, with roughly 50% of its outputs rated good. GPT–5’s scoring, on the other hand, rated a larger fraction as good. For example, GPT–5 gave Gemma a full 5 in pragmatics for 55% of sentences, whereas humans gave a 5 for only 40%. It is worth noting that even humans rarely gave a perfect 5; they were using the full range. GPT–5 seemed to cluster scores at 3–5 and seldom used 1 or 2, except for clear failures. Humans likewise did not use 1 often (since outright incoherent translations were few), but they did use 2 s for some Qwen outputs where meaning was severely compromised.

4.2. Cross-Metric Correlation

Figure 4 visualizes the Pearson correlations among all evaluation metrics in the final experiment (adequacy, fluency, morphosyntactic, semantic, pragmatic, and BLEU). Two strong convergence patterns emerge. First, adequacy and semantic are nearly indistinguishable at the item level (r ≈ 1.00), indicating that the rubric used for adequacy closely operationalizes the same construct as our semantic dimension—faithfulness to source content. Second, fluency and morphosyntactic are also essentially collinear (r ≈ 1.00), suggesting that the sentence-level grammatical well-formedness we captured under morphosyntactic is the primary driver of perceived fluency in our dataset. Cross-dimension correlations remain high (e.g., adequacy↔fluency ≈ 0.81; morphosyntactic↔pragmatic ≈ 0.88; and adequacy↔pragmatic ≈ 0.87), consistent with the trends in Table 4, Table 5 and Table 6 and the ranking agreement observed in Figure 1, Figure 2 and Figure 3.

From a measurement perspective, the matrix suggests a two-factor structure: (i) a meaning factor where adequacy and semantic load near-1.0, and (ii) a form/fluency factor where fluency and morphosyntactic load near-1.0. pragmatic correlates strongly with both factors (0.87–0.89), acting as a bridge between “being correct” and “sounding right.” This aligns with our qualitative analysis: items can be grammatically flawless yet miss nuance (semantic shortfalls) or semantically faithful yet stylistically off (pragmatic shortfalls). The high but not perfect cross-links (e.g., ∼0.81 between adequacy and fluency) reinforce the value of reporting aspects separately rather than collapsing to a single overall score.

Moreover, the low correlations between BLEU and all rubric-based metrics (about 0.22–0.23 across the board) underscore that reference overlap captures a different signal than the human/GPT–5 rubric scores. Because both human evaluators and GPT–5 assessed translations without access to the reference translations, the divergence between BLEU and the rubric-based dimensions is expected and reflects a methodological asymmetry rather than a deficiency of BLEU itself. BLEU operates strictly as a reference-based metric, whereas the human and LLM evaluators generate reference-free judgments of fidelity, naturalness, and pragmatic appropriateness. In this context, BLEU provides limited explanatory power for item-level variation but remains useful as a traditional baseline that complements the reference-free evaluators. This observation aligns with results in Table 4, Table 5 and Table 6 and with established findings regarding the limitations of reference-based metrics for paraphrastic yet faithful translations. Taken together, the heatmap supports our design choice to foreground semantic and pragmatic judgments alongside form-related dimensions and to treat BLEU as an auxiliary rather than primary signal.

The very high correlations observed between the adequacy–semantic pair and the fluency–morphosyntactic pair indicate that these dimensions behave as tightly coupled constructs in practice. This does not weaken the proposed framework; rather, it shows that the linguistic dimensions align well with the meaning and form components of the MQM hierarchy. Accordingly, in the remainder of the analysis we retain semantic as the representative meaning-related dimension and morphosyntactic as the representative form-related dimension. Pragmatic remains separate, contributing unique variance linked to tone, idiomaticity, and stylistic appropriateness—areas in which GPT–5 exhibited the most leniency. Thus, the multidimensional structure provides clarity without redundancy, while the observed correlations naturally support a streamlined reporting scheme that improves interpretability and pedagogical usefulness. If a composite score is desired, a two-factor aggregation (meaning + form, with pragmatic partially loading on both) is statistically appropriate and consistent with both the MQM hierarchy and the empirical structure of our data.

4.3. Per-Aspect Dispersion and Human–GPT Means

Table 7 restates the per-aspect means and standard deviations for both human experts and GPT–5. Three observations emerge. First, GPT–5’s means are uniformly higher than human means across models and aspects (e.g., for Gemma morphosyntactic: H = 3.870 vs. GPT = 4.25; for LLaMA pragmatic: H = 3.403 vs. GPT = 4.09), quantifying the leniency already visible in Figure 2 relative to Figure 1. Second, GPT–5’s dispersion (SD) is comparable to or slightly higher than human SDs (e.g., Qwen Pragmatic: H = 0.890 vs. GPT = 1.001), suggesting that the LLM uses the full scale but hesitates to assign very low scores, which inflates means without collapsing variance.

Third, comparing across aspects, human SDs tend to be largest for semantic and pragmatic—the dimensions where nuanced disagreements are likeliest (e.g., Qwen semantic SD = 0.834; LLaMA pragmatic SD = 0.816). GPT–5 mirrors this pattern (e.g., LLaMA semantic SD = 1.055; Gemma pragmatic SD = 0.936), which aligns with our item-level examples where style/idiom and subtle meaning shifts drive disagreement. Together, Table 4 complements Figure 1 and Figure 2 by showing that the human–GPT difference is a mostly level shift (bias) rather than a fundamental change in relative spreads, supporting the case for simple calibration of GPT–5’s scale.

4.4. Inter-Annotator Agreement by Model and Aspect

Table 8 reports Krippendorff’s

α

(ordinal) and averaged quadratic weighted

κ

across humans + GPT–5, broken down by model and aspect. Agreement is moderate overall, with the highest values for semantic—e.g., Gemma:

α = 0.624

,

\bar{κ} = 0.621

; Qwen:

α = 0.518

,

\bar{κ} = 0.521

—and lower values for morphosyntactic and pragmatic in LLaMA (e.g., pragmatic

α = 0.445

). Two patterns are noteworthy: (i) higher-quality outputs (Gemma) yield higher agreement, likely because errors are clearer (fewer borderline cases), and (ii) pragmatic shows the most variability across models, consistent with style/tone being more subjective and context-sensitive.

The closeness between

α

and weighted

κ

within aspects indicates that pairwise rater consistency and multi-rater reliability tell a coherent story here. Importantly, these values are in the “moderate-to-substantial” band for semantic and pragmatic—precisely the dimensions we most care about for user-facing quality. This supports the reliability of the rubric and suggests our guidance was sufficiently specific for consistent judgments without over-constraining raters.

At the same time, the moderate level of inter-annotator agreement (

α \leq 0.62

) indicates that human judgment itself contains non-trivial variability. For this reason, we treat the human consensus not as an infallible gold standard, but as a practical reference point that reflects domain expert expectations within the limits of rater reliability. This also establishes a natural upper bound on the degree of alignment any automated evaluator—including GPT–5—can realistically achieve in this setting.

4.5. Annotator Competence (MACE) Across Aspects

Table 9 summarizes MACE competence (probability of correctness) per annotator and aspect. GPT–5 attains the highest or near-highest competence in morphosyntactic (0.651) and semantic (0.634), closely followed by the strongest human rater(s); in pragmatic, Evaluator 2 (0.601) slightly surpasses GPT–5 (0.572), reflecting the human advantage on cultural/register nuances already discussed. The spread among human annotators is modest (e.g., semantic: 0.533–0.611), indicating a reasonably tight human panel.

Taken together, the results in Table 9 position GPT–5 as a competent fourth rater that can reduce variance and cost in large-scale evaluations, while humans retain an edge in pragmatic subtleties. In practice, one could weight annotators by MACE competence in a pooled score or use GPT–5 for triage plus human adjudication on items likely to involve tone/idiom. We note that our Likert data were treated as categorical for MACE; while this is common, ordinal-aware alternatives (e.g., Gwet’s AC2) can be reported in parallel for completeness (we already supply Krippendorff’s

α

and weighted

κ

).

4.6. Examples of Evaluation Differences

To illustrate the evaluation in concrete terms, it is useful to consider a few example sentences and how each model performed:

Source: “Archimedes was an ancient Greek thinker, and …” (continuation omitted).
–
Qwen’s translation: “Archimedes adalah peneliti ahli Yunani sebelum…” (translates roughly to “Archimedes was a Greek expert researcher from before…”). This received Morph = 3, Sem = 2, and Prag = 3 from humans. They commented that “peneliti ahli” (“expert researcher”) is an odd choice for “thinker” (a semantic error) and the sentence trailed off inaccurately.
–
LLaMA’s translation: “Archimedes adalah pemikir Yunani kuno…” (similar to Gemma’s but missing the article or some nuance) received Morph = 5, Sem = 4, and Prag = 4. Here all models were grammatically okay (hence Qwen Morph = 3 not too low, just one minor grammar issue with “peneliti ahli”), but the semantic accuracy separated them—Qwen changed the meaning, LLaMA preserved meaning but omitted a slight detail, and Gemma was spot on. GPT–5 gave scores of (3, 2, 3) to Qwen, (5, 4, 5) to Gemma, and (5, 4, 4) to LLaMA—matching the human pattern, though GPT–5 rated Qwen’s grammar a bit higher than humans did (perhaps not catching the issue with “peneliti ahli”).
–
Gemma’s translation: “Archimedes adalah seorang pemikir Yunani Kuno…” (literally “Archimedes was an Ancient Greek thinker…”), which is almost a perfect translation. Humans gave it Morph = 5, Sem = 5, and Prag = 5, a rare unanimous perfect score.
Source: “When the devil bites another devil, it actually…” (colloquial expression). This is tricky pragmatically because it is figurative.
–
Qwen’s translation actually mistranslated the structure, yielding something incoherent (Prag = 1, Sem = 1). In this case, GPT–5 somewhat overrated the pragmatic aspect for Gemma and LLaMA, giving them Prag = 4 where humans gave 2. GPT–5 likely saw a grammatically correct sentence and, not recognizing the idiom, assumed it was acceptable, whereas human translators knew it missed the idiomatic meaning. This example illustrates a limitation: GPT–5 lacked cultural context to see the pragmatic failure.
–
LLaMA’s translation was similarly literal and received Morph = 4, Sem = 4, and Prag = 2 as well.
–
Gemma’s translation: “Ketika iblis menggigit iblis lainnya, sebenarnya…”, which is very literal (“When a devil bites another devil, actually…”). Indonesian evaluators noted that this was grammatically fine but pragmatically odd—“iblis menggigit iblis” is not a known saying. They gave it Morph = 4, Sem = 4, and Prag = 2, citing that the tone did not carry over (maybe the source was implying conflict between bad people, an idiom that was not localized).
Source: “The project is going very, very well, and…” (conversational tone).
–
Qwen: “Proyek ini berjalan sangat, sangat baik, dan…” (literal translation; in Indonesian doubling “sangat” is a bit unusual but understandable). Humans: Morph = 4 (a slight stylistic issue), Sem = 5, and Prag = 3 (tone a bit off, could use “sangat baik sekali” instead).
–
LLaMA: “Proyeknya berjalan dengan sangat baik, dan…” (more natural phrasing), received Morph = 5, Sem = 5, and Prag = 5.
–
Gemma: “Proyek ini berjalan dengan sangat baik, dan…” (also excellent). Here all convey the meaning; it is about style. Qwen’s phrasing was less idiomatic (hence pragmatic 3). GPT–5 gave Qwen Prag = 4 (it did not flag the style issue), showing again a slight leniency or lack of that nuance.

Overall, our results show each model’s strengths and weaknesses clearly, and GPT–5’s evaluation, while not identical to humans, is largely in agreement with human assessments on a broad scale. Gemma as a fine-tuned model excelled in both fluency and accuracy; it made the fewest errors, and those it did make were minor (e.g., perhaps overly literal at times, but still correct). LLaMA had generally good translations but occasionally dropped details or used phrasing that was correct yet slightly unnatural. Qwen, being the smallest and not specialized, had numerous issues: mistranslations (hence low semantic scores) and some clunky Indonesian constructions (lower pragmatic scores).

In terms of error types, by cross-referencing low-scoring cases we found that Qwen’s errors were often outright mistranslations or omissions. For example, it might translate “not uncommon” as “common” (flipping meaning)—a semantic mistake leading to a semantic score of 1 or 2. LLaMA’s errors were more subtle—e.g., it might choose a wrong synonym or fail to convey a nuance like a modal particle, resulting in a semantic score of 3 or 4 but rarely 1. Gemma’s “errors” were mostly stylistic choices or slight over-formality. There were almost no cases of Gemma clearly mistranslating content; its semantic scores were mostly 4–5, indicating high adequacy. Its lowest pragmatic scores (a few 2s) happened on idiomatic or colloquial inputs where a more culturally adapted translation existed but Gemma gave a literal rendering. This aligns with the known difficulty of pragmatic equivalence—something even professional translators must handle carefully.

Finally, it is worth emphasizing that none of the models produced dangerously wrong outputs in our test. Unlike some reports of LLMs “hallucinating” in translations, we did not observe cases where entirely unrelated or extraneous information was introduced. This may be due to the relatively simple, self-contained nature of our test sentences. More complex inputs might provoke hallucinations or large errors, which would be important for future work to test (especially for tasks like summarizing and then translating, or translating ambiguous inputs).

4.7. Results of the Classroom Evaluation Study

The descriptive statistics in Table 10 show that rubric calibration improved both the accuracy and consistency of student judgments. The overall MAE decreased steadily (0.97 → 0.83), indicating that students’ ratings became closer to the majority reference. The largest reduction appeared in the semantic dimension (1.03 → 0.86), reflecting an improved grasp of meaning fidelity after rubric clarification. Morphosyntactic and pragmatic dimensions also exhibited smaller yet steady improvements, implying that students grew more consistent in recognizing structural and stylistic quality. The Exact Match Rate increased from 0.30 to 0.50, meaning half of the students’ scores in the final test were identical to expert/GPT–5 judgments. The absence of “Outside” deviations (beyond

\pm 1

) further confirms that calibration aligned students’ interpretations with the rubric scale, minimizing rating variability.

Figure 5 illustrates how the mean absolute error (MAE) decreased across all tests and aspects, confirming a steady convergence toward the majority reference. Students initially showed greater variation in the semantic aspect (highest MAE in the pre-test), suggesting difficulties in assessing meaning fidelity. After rubric clarification and in-class discussion, the MAE for the semantic aspect dropped sharply in the post-test and continued to improve in the final test. Morphosyntactic and pragmatic aspects also exhibited downward trends, indicating that students became more consistent in judging grammatical accuracy and tone appropriateness. Overall, the decreasing MAE trend demonstrates that rubric calibration and iterative evaluation practice effectively reduced scoring deviation across linguistic dimensions.

Although the MAE values decrease across phases, these differences should be interpreted descriptively rather than inferentially. Because the study reports only aggregated MAE scores rather than paired student-level deviations, formal statistical testing is not appropriate. Moreover, the absence of a control group and the reuse of the same items across phases mean that part of the improvement may reflect familiarity effects or regression to the mean. For these reasons, we interpret the classroom activity as a rubric calibration exercise that illustrates increasing alignment with the consensus reference, rather than as evidence of statistically validated instructional impact.

As shown in Figure 6, the Exact Match Rate increased steadily from pre- to final test, indicating improved alignment between student and majority ratings. The most notable increase occurred in the final test, where half of all scores matched the majority exactly. This trend demonstrates that students internalized the scoring criteria and applied them consistently, even to new and unseen data. Combined with the MAE results, this pattern suggests that brief rubric calibration sessions, supported by GPT–5 majority feedback, significantly enhanced students’ evaluative precision.

5. Discussion

5.1. Specialized Smaller LLM vs. General Larger LLM

One of the most striking findings is that Gemma 3 (1B) outperformed LLaMA 3.2 (3B) across all evaluation aspects. Despite having only one-third of the parameters, Gemma 3 delivered higher morphosyntactic, semantic, and pragmatic scores. This indicates that scaling laws alone do not guarantee better translation performance; model architecture and pretraining/data composition can be equally (or more) decisive for a given language pair and domain. In our setting—English → Indonesian—Gemma 3’s design and training mixture appear better aligned than the larger general model, leading to consistently stronger quality under human and GPT–5 evaluation.

LLaMA 3.2, as a downscaled version of a general LLM, presumably knows some Indonesian and can translate via its multilingual pretraining, but it may lack the robustness that Gemma exhibited in this evaluation. Our finding supports recent arguments that for many practical MT applications, fine-tuned NMT systems or fine-tuned LLMs can be more effective than using a large zero-shot model, especially when computational resources or latency is a concern. There is a cost–benefit angle too: Gemma 3 is much smaller and faster to run than a multi-billion-parameter model, making it attractive for deployment, yet we did not sacrifice quality—in fact, we gained quality. This is encouraging for communities working with languages that might not yet have a full GPT–5-level model: a moderately sized model focused on their translation needs could surpass using a giant LLM off the shelf.

5.2. Quality Aspect Analysis—Fluency vs. Accuracy

By separating fluency (morphosyntax) and accuracy (semantic) in our evaluation, we confirmed a commonly observed trend: our models (like most neural MT systems) tend to be more fluent than they are accurate. All three models had their lowest human scores in the semantic dimension. Qualitatively, even Qwen’s outputs rarely devolved into word salad—they usually looked like plausible Indonesian sentences—but that did not mean they conveyed the right meaning. This fluency/accuracy gap is precisely why automatic metrics that rely on surface similarity (BLEU, etc.) can be fooled by fluent outputs that use different wording. Our human evaluators could detect when something was “off” in meaning. For example, Qwen translated “what makes this gift valuable is…” to a form that literally meant “what makes this gift so valuable is…”, dropping the emphasis—a subtle but important nuance. They penalized semantic while still giving decent morphosyntactic scores. In MT evaluation discussions, this touches on the concept of “critical errors”—an otherwise fluent translation might contain one mistranslated term (say, “respirator” vs. “ventilator”) that is a critical error in context (medical) but a lexical choice that automatic metrics might not heavily penalize. Our multi-aspect approach catches that because semantic accuracy would be scored low. The results therefore reinforce the value of multidimensional evaluation: we obtain a clearer diagnostic of systems. For instance, if one were to improve these models, the focus for Qwen and LLaMA should be on improving semantic fidelity (perhaps through better training or prompting), whereas their fluency is largely adequate. Meanwhile, to improve pragmatic scores, one might incorporate more colloquial or domain-specific parallel data so the model learns more natural phrasing and idioms (Gemma’s slight pragmatic edge likely comes from seeing genuine translations in training).

5.3. GPT–5 as an Evaluator—Potential and Pitfalls

The high correlation between GPT–5 and human scores (

r \approx 0.8

) is a positive sign for using LLMs in evaluation. GPT–5 effectively identified the better system and, in many cases, gave similar relative scores to translations as humans did. This corroborates other findings that LLMs can serve as surrogate judges for translation quality. Such an approach could dramatically speed up MT development cycles—instead of running costly human evaluations for every tweak, researchers could use an LLM metric to obtain an estimate of quality differences. Our study adds evidence that GPT–5’s evaluations are fine-grained: it was not just giving all outputs a high score or low score arbitrarily; it differentiated quality on a per-sentence level (with ∼0.8 correlation with human variation). For example, when a translation had a specific meaning error, GPT–5 often caught it and lowered the semantic score accordingly.

However, our analysis also highlights important caveats. GPT–5 was more lenient, especially on pragmatic aspects. In practical terms, if one used GPT–5 to evaluate two systems, it might overestimate their absolute performance. This is less of an issue if one is only looking at comparative differences (System A vs. System B), since GPT–5 still ranked correctly. But it could be an issue if one tries to use GPT–5 to decide “Is this translation good enough to publish as is?”—GPT–5 might say “yes” (score 5) when a human would still see room for improvement. This mirrors findings from recent G-Eval research that LLM evaluators may have a bias towards outputs that seem fluent. We saw that in our scatter plot: GPT–5 rarely gave low scores unless the translation was clearly bad. It is almost as if GPT–5 was hesitant to be harsh, whereas human evaluators, perhaps following instructions, used the full scale more vigorously.

One reason for GPT–5’s leniency could be that the prompt did not enforce strict scoring guidelines. Our prompt was brief; GPT–5 might naturally cluster towards 3–5. If we had shown it examples of what constitutes a “1” vs. “5” (calibration), maybe it would align better. There is work to be done in prompt engineering for LLM evaluators: giving them reference criteria or anchor examples might reduce bias. Also, using chain of thought (asking GPT–5 to explain why it gave a score) could reveal whether it misses some considerations that humans have.

We also observed that GPT–5 may not capture cultural or contextual pragmatics as a human would. The idiom example (“devil bites another devil”) showed that GPT–5 was satisfied with a literal translation that technically conveyed the words, whereas humans knew the phrase likely had a deeper meaning. This suggests that for truly evaluating pragmatic equivalence, human intuition is still essential. An LLM, unless explicitly trained on parallel examples of idioms and their translations, might not “know” that something is an idiom that fell flat in translation. In general, nuance detection is an area where current LLMs might fall short—e.g., detecting when a polite form should have been used but was not or when a translation, though accurate, sounds condescending or sarcastic in an unintended way.

Another interesting point is the self-referential bias: our GPT–5 evaluator was essentially assessing outputs from models that are also LLMs (though smaller). There is a possibility that GPT–5’s own translation preferences influenced its judgment. For example, if GPT–5 tends to phrase something a certain way in Indonesian, and Gemma’s translation happened to match that phrasing, GPT–5 might inherently favor it as “correct.” If Qwen used a different phrasing that is still correct, GPT–5 might (implicitly) think it is less natural since it is not what GPT would do. This is speculative, but it raises the broader issue of evaluators needing diversity—perhaps the best practice in the future is to use a panel of LLM evaluators (from different providers or with different training) to avoid one model’s idiosyncrasies biasing the evaluation.

5.4. Implications for Indonesian MT

Our study shines some light on the state of English–Indonesian MT. Indonesian is a language with relatively simple morphology but some challenging aspects like reduplication and formality levels (e.g., pronouns and address terms vary by politeness). Our human evaluators noted that all models sometimes struggled with formality/tone. For instance, “you” can be translated as “kamu”, “Anda”, or just omitted, depending on context and politeness. In a few sentences, models defaulted to a neutral tone that was acceptable but not always contextually ideal. Gemma, probably due to seeing more examples, occasionally picked a more context-aware phrasing. For truly high-quality Indonesian translation, context (like who is speaking to whom) would be important—something our sentence-level approach did not incorporate. This points to the need for document-level or context-aware translation for handling pragmatics like formality. LLMs, by virtue of handling long contexts, might be advantageous here if they are used on whole documents.

We also observed how each model handled Indonesian morphology, such as prefix/suffix usage. Indonesian uses affixes to adjust meaning (me-, -kan, -nya, etc.). Gemma had very few morphological errors, indicating that it learned those patterns well. LLaMA had a handful of errors, like dropping the me- prefix (common in MT: when the model is not sure, it might leave a verb in base form). Qwen did this more often. These are relatively minor errors (often did not impede comprehension), but they do affect perceived fluency. It is encouraging that the fine-tuned model essentially mastered them—showing that even a small model can capture such specifics with focused training, which an un-fine-tuned LLM might not always do.

5.5. Human Evaluation for Research vs. Real-World

Our human evaluators provided extremely valuable judgments, but such evaluations are time-consuming and costly in real-world settings (100 sentences × 3 people is 300 judgments). The fact that GPT–5 can approximate this suggests a hybrid approach: one could use GPT–5 to pre-score thousands of outputs to identify problematic cases and then have humans focus on those or on a smaller subset for final verification. In a deployment scenario (say a translation service), GPT–5 could serve as a real-time quality estimator, flagging translations that are likely incorrect. However, as our results show, it might not flag everything a human would (especially subtle pragmatic issues), so some caution and possibly a margin of safety (e.g., only trust GPT–5’s “flag” if it is really confident) should be implemented.

5.6. Toward Better LLM Translators

What do our findings imply for improving LLMs in translation tasks? One takeaway is that specialization for specific language pairs remains highly valuable. If resources allow, developing or adapting models explicitly for a given pair could yield better results than relying on a single general-purpose model to handle all directions. On the other hand, massive models like GPT–5 are already capable of producing strong translations if guided properly—so prompt design or few-shot demonstrations may help close the gap. In our evaluation, we tested Qwen, LLaMA, and Gemma in a zero-shot setting with a simple “Translate:” prompt. It is possible that providing a few in-context translation examples could have improved their outputs (especially for LLaMA). Exploring how much prompting strategies can help general LLMs approximate the performance of specialized models is an important direction for future work.

Another aspect is addressing the semantic accuracy gap. We observed that even the best-performing model (Gemma) lost points mainly on semantic nuances. Techniques to improve accuracy might include consistency checks (e.g., back-translation or source–target comparison) or integrating modules for factual verification. Since LLMs are prone to hallucination, one mitigation strategy is to encourage greater faithfulness to the source, for example by penalizing deviations during generation or by adopting constrained decoding and feedback-based training. Such approaches could reduce meaning drift while maintaining the high fluency that current models already achieve.

5.7. Error Profile and Improvement Targets

Our multi-aspect evaluation confirmed that translation quality is multi-faceted. We saw cases where a translation was grammatically flawless yet semantically flawed or semantically accurate yet stylistically awkward. By capturing these differences, we can better target improvements. For example, all models in our study need improvement in semantic fidelity, as evidenced by semantic scores trailing fluency scores. Research into techniques to reinforce source–target alignment in LLM outputs—through constraints or improved training objectives—would directly address this gap. We also underline pragmatic and cultural nuances: even the best model (Gemma) occasionally failed to convey tone or idiomatic expression appropriately. Future work could augment training data with more colloquial and context-rich examples or use RLHF focusing on adequacy and style.

5.8. LLM Evaluator Agreement and Calibration

A novel component of our work was comparing GPT–5 with human evaluation. We found strong correlations (

r > 0.8

overall), demonstrating the potential of LLM-based evaluation metrics. GPT–5 could rank translations and identify better outputs with high accuracy, effectively mimicking a human evaluator. However, we also observed a systematic bias: GPT–5 tended to give higher absolute scores and was less critical of pragmatic issues. Thus, while LLM evaluators are extremely useful, they need careful calibration before exclusive reliance. In the near term, a hybrid workflow—using LLM evaluators for rapid iteration and humans for final validation or to cover aspects the LLM might miss—is prudent.

5.9. Practical Use and Post-Editing Considerations

For real-world acceptability, the top-performing model (Gemma) achieved an average overall human score of ∼3.8/5, with many sentences rated 4–5. These translations are generally good, but not perfect; human post-editing remains advisable for professional use, particularly to correct meaning nuances or polish style. The predominant errors (minor omissions or stiffness) are typically quick to fix, implying substantial productivity gains from strong MT drafts. By contrast, Qwen often had major errors requiring extensive editing or retranslation; LLaMA’s output was decent and useful where perfect accuracy was not critical or as an editable draft. Matching the MT solution to the use case remains essential (e.g., high-stakes content favors the most accurate systems, potentially combined with human review).

5.10. Implications of the Classroom Evaluation Study

The classroom experiment demonstrates that rubric-based calibration effectively improved student evaluators’ alignment with expert and GPT–5 majority judgments. The strongest improvement occurred in the semantic dimension, suggesting that students developed a better understanding of meaning fidelity after guided discussion. Morphosyntactic and pragmatic dimensions also showed steady gains, implying enhanced awareness of grammatical accuracy and contextual tone.

Importantly, no “Outside” deviations (greater than

\pm 1

) occurred in any phase, indicating complete convergence within the rubric’s interpretive range. This confirms that rubric-guided classroom calibration narrows inter-rater variance and fosters evaluative literacy. Such outcomes underscore the educational potential of integrating LLM-supported majority scoring as a feedback reference, enabling translation students to self-correct and internalize consistent evaluation standards. The results of our experiments provide several insights relevant to machine translation research and the use of LLMs in translation.

5.11. Broader Impacts

Our findings have a few broader implications. First, for practitioners in translation and localization, they demonstrate that one does not necessarily need a 175B-parameter model to obtain good results—smaller tailored models like Gemma can suffice and even excel. This could democratize MT technology for languages where building a giant model is impractical: a focused approach with available parallel data can yield high quality. Second, the experiment suggests that human translators and AI evaluators can coexist in an evaluation workflow. By understanding where GPT–5 agrees or disagrees with humans, we can start to trust AI to carry out the heavy lifting of evaluation and involve human experts in edge cases or in refining the evaluation criteria.

For the research community, an interesting point is how to improve LLMs as evaluators further. If one could align GPT–5’s scoring more closely with MQM (perhaps by fine-tuning GPT–5 on a dataset of human evaluations), we might obtain an automatic evaluator that is not only correlated but also calibrated. That would be a game-changer for benchmarking models rapidly. Our results hint that GPT–5 already contains a lot of this capability intrinsically; it is a matter of fine-tuning or prompting to bring it out in a trustworthy way.

5.12. Future Work

Building on this study, we identify several directions for further investigation:

1.: Cross-lingual generalization: Extend evaluation to additional language pairs with diverse typological characteristics, including morphology-rich languages (e.g., Turkish and Finnish) and low-resource settings (e.g., regional Indonesian languages), to test the robustness and generalizability of our findings.
2.: Document-level and discourse phenomena: Move beyond sentence-level evaluation to assess how models handle longer context. This includes measuring discourse-level coherence, consistency of terminology, pronominal and anaphora resolution, and appropriate management of register across paragraphs. Methods could involve evaluating paragraph- and document-level translations, designing rubrics for discourse quality, and exploiting LLMs’ extended context windows for both translation and evaluation.
3.: Evaluator calibration: Refine prompts, explore in-context exemplars, or apply lightweight fine-tuning to GPT–5 (GPT–5-chat) as an evaluator in order to reduce its leniency and align it more tightly with rubric-anchored human severity. This could include calibration against reference human ratings on held-out data.
4.: Explainable disagreement analysis: Collect structured rationales from GPT–5 and annotators to better diagnose systematic blind spots. For example, discrepancies in idiomaticity, politeness, or tone could be analyzed qualitatively and linked back to specific rubric dimensions.
5.: Training-time feedback loops: Explore the use of GPT–5 as a “critic” to provide feedback signals during model training, prioritizing low-scoring phenomena (e.g., semantic fidelity errors or pragmatic mismatches). To mitigate evaluator bias, such loops should incorporate confidence thresholds and human-in-the-loop audits.
6.: Few-shot prompting: Prior work suggests that few-shot calibration using human-labeled examples may further improve alignment between LLM and human judgments. In this study, we deliberately adopted a zero-shot evaluation setting in order to examine GPT–5’s ability to apply the rubric without prior exposure to human ratings. While this choice allows for a cleaner comparison between human and LLM evaluators, future work could explore whether instruction-aligned or example-conditioned prompting yields improved calibration, reduced bias, or greater consistency across evaluation dimensions.

5.13. Limitations

While our study is comprehensive within its scope, several limitations must be acknowledged. First, the test set is relatively small (100 sentences) and may not capture the full diversity of linguistic phenomena. It suffices to reveal clear model differences and common error types, but a larger and more varied set (spanning multiple domains and registers) would be necessary to increase confidence in generalization.

Second, our evaluation was conducted at the sentence level. We did not assess how these models handle document-level translation, which is critical for real deployment scenarios. Longer texts introduce discourse-level phenomena such as pronoun resolution, lexical consistency, topic continuity, and coherence across sentences. Document-level translation is precisely where LLMs, with their large context windows, may demonstrate distinctive strengths or weaknesses. Future work should therefore extend beyond isolated sentences to systematically examine English → Indonesian translation at the paragraph and document level.

Another limitation is that we only tested a single direction (En → Id). It is possible that the dynamics differ in the opposite direction or for other language pairs. Indonesian → English might be easier in some respects (given that English dominates LLM pretraining data) but could pose different challenges, such as explicitly marking politeness or pronominal distinctions that are implicit in Indonesian. Moreover, languages with richer morphology or very different syntax (e.g., English↔Japanese) may produce different error patterns, perhaps making fluency harder to achieve. Thus, while our results align with trends noted in recent surveys, caution is warranted in extrapolating too broadly.

6. Conclusions

In this paper, we presented a comprehensive evaluation of machine translation outputs from three LLM-based models (Qwen 3 (0.6B), LLaMA 3.2 (3B), and Gemma 3 (1B)) on an English–Indonesian translation task, using both human raters and GPT–5 as evaluators. We structured the assessment along three key quality dimensions—morphosyntactic correctness, semantic accuracy, and pragmatic appropriateness—to gain deeper insights than a single metric could provide.

We presented a multidimensional evaluation of English–Indonesian MT that combines standard metrics (adequacy and fluency) with MQM-inspired linguistic dimensions (morphosyntactic, semantic, and pragmatic) and both human and LLM-based judgments. All results consistently ranked the systems as Gemma 3 (1B) > LLaMA 3.2 (3B) > Qwen 3 (0.6B). Inter-annotator agreement (Krippendorff’s

α

and weighted

κ

) indicated moderate to substantial reliability, and MACE showed GPT–5 (GPT–5-chat) competence on par with the strongest human rater. GPT–5’s scores correlating strongly with human judgments (

r = 0.82

) suggest it is an effective, scalable evaluator, albeit slightly lenient and in need of calibration. Overall, the findings support specialized mid-size models and rubric-guided LLM evaluation as a practical path to dependable MT assessment beyond BLEU.

Beyond numerical validation, this study extends MT evaluation into an educational context through a classroom experiment involving 26 translation students. The classroom results verified that rubric-based calibration can meaningfully transfer the multidimensional metrics to human learners. Students’ alignment with expert and GPT–5 majority judgments improved substantially, especially in the semantic dimension, confirming the framework’s interpretability and usability in training evaluators. The mean absolute error (MAE) decreased from 0.97 to 0.83 and the Exact Match Rate increased from 0.30 to 0.50 across pre-, post-, and final-test phases. This pedagogical validation highlights the framework’s broader impact: beyond research evaluation, it serves as a practical tool for developing evaluative literacy and self-corrective feedback in translation education.

In future work, we plan to extend this multidimensional framework to additional language pairs and domains, explore automated calibration using GPT-based adaptive feedback, and integrate real-time classroom analytics to support scalable evaluator training.

Author Contributions

Conceptualization, S.S., A.H.N., W.M. and T.D.; methodology, S.S., A.H.N. and T.D.; software, A.H.N. and W.M.; validation, S.S., A.H.N., W.M., T.D. and A.O.; formal analysis, S.S., A.H.N., W.M. and T.D.; investigation, A.H.N. and W.M.; resources, S.S., A.H.N., T.D. and Y.M.; data curation, S.S. and T.D.; writing—original draft preparation, A.H.N. and W.M.; writing—review and editing, A.H.N., W.M. and T.D.; visualization, A.H.N. and W.M.; supervision, A.O. and Y.M.; funding acquisition, S.S., A.H.N. and W.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the Ministry of Higher Education, Science and Technology under research grant number 001/LL17/DT.05.00/PL/2025; 29/DPPM-UIR/HN-P/2025.

Institutional Review Board Statement

Ethical review and approval were waived for this study because the research involved anonymized educational data collected as part of routine classroom activities, posed no risk to participants, and did not influence students’ grades or academic standing.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data supporting the findings of this study are openly available at https://github.com/arbihazanst/multidimentional-MT-Eval (accessed on 12 January 2026). The repository includes the evaluation rubric, anonymized scoring data, analysis scripts, and instructions required to reproduce the reported results.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Evaluation Rubric

The following rubric was provided to all human evaluators as part of the evaluation instructions. It specifies the criteria for assigning scores from 1 to 5 on each of the three linguistic dimensions.

Appendix A.1. Morphosyntactic (Grammar, Agreement, Sentence Structure, Morphology)

5: No grammatical or syntactic errors. The sentence structure is fluent and consistent with the source. Morphological forms (e.g., tense, agreement, inflections) are fully accurate.
4: Minor grammatical or syntactic issues that do not hinder comprehension. Word forms and sentence structure are mostly correct but slightly unnatural in places.
3: Noticeable morphosyntactic errors that occasionally affect comprehension or fluency. May include agreement mistakes, awkward word order, different sentence structure, or wrong tense.
2: Frequent errors in grammar and structure that hinder comprehension. The sentence structure differs significantly from the source. Unnatural phrasing, tense confusion, or broken sentence patterns are common.
1: Sentence is ungrammatical, fragmented, or unparseable. Morphosyntactic errors make it incomprehensible or entirely ungrammatical. The sentence structure is completely different from the source.

Appendix A.2. Semantic (Meaning Preservation, Omissions, Mistranslations)

5: Full preservation of meaning. No omissions, distortions, or mistranslations. The translation conveys the same message as the source without deleting or adding additional words.
4: Minor meaning shifts or vague expressions that slightly alter nuance but do not mislead. No omissions for verbs, subject, and object of the sentence.
3: Some parts are missing or incorrect, but the main information is still conveyed.
2: Major meaning loss or distortion. Critical elements are missing, incorrect, or misleading. Intended message is mostly lost.
1: Little semantic correspondence to the source. Mostly incorrect, irrelevant, or hallucinated content.

Appendix A.3. Pragmatic (Tone, Register, Politeness, Cultural/Situational Appropriateness)

5: Tone, register, and cultural fit are fully appropriate for the context. The translation sounds natural and is aligned with the speaker’s intent.
4: Mostly appropriate tone and register. Slight mismatches (e.g., slightly too formal/informal), but they do not cause misunderstanding or awkwardness.
3: Inconsistent or ambiguous tone or formality. Some sections may sound awkward or misaligned with the source intent.
2: Inappropriate tone or register for the context. Translation may sound offensive, robotic, or culturally inappropriate.
1: Completely wrong pragmatic use (e.g., overly rude, sarcastic instead of sincere, or entirely mismatched social function).

Appendix B. Prompt Template

This prompt defines the evaluation instructions provided to GPT–5 for assessing translation outputs from Qwen 3 (0.6B), LLaMA 3.2 (3B), and Gemma 3 (1B).

Listing A1: Prompt template used for GPT–5 evaluation.

Evaluate the Indonesian translation of the English text below based on five criteria:
Adequacy, Fluency, Morphosyntactic, Semantic, and Pragmatic. Use the scoring rubric from
1 to 5 for each criterion, as described in detail below. Assign only a numeric score
for each aspect (no explanations).

Scoring Rubrics:
1. Adequacy: Does the Indonesian translation convey the same meaning as the English source
sentence?
- Scoring Guide:
5: Complete transfer of meaning (accurate and all important information conveyed)
4: Most meaning conveyed with minor omissions or errors
3: Some parts are missing or incorrect, but some information still conveyed
2: Major omissions or incorrect meaning, only small part of the information is conveyed
1: Meaning largely incorrect or missing

2. Fluency: How natural and grammatically correct is the Indonesian translation?
- Scoring Guide:
5: Fluent, natural, and grammatically correct
4: Mostly fluent with minor grammatical issues or awkwardness
3: Understandable, but noticeable grammatical errors
2: Difficult to understand due to poor grammar or awkward phrasing
1: Unintelligible due to major grammatical issues

3. Morphosyntactic (Grammar, agreement, sentence structure, morphology):
- Scoring Guide:
5: No grammatical or syntactic errors. The sentence structure is fluent, and consistent with
the source data. Morphological forms (e.g., tense, agreement, inflections) are fully
accurate.
4: Minor grammatical or syntactic issues that do not hinder comprehension. Word forms and
sentence structure are mostly correct but slightly unnatural in places.
3: Noticeable morphosyntactic errors that occasionally affect comprehension or fluency. May
include agreement mistakes, awkward word order, different sentence structure, or wrong
tense.
2: Frequent errors in grammar and structure that hinder comprehension. The sentence
structure differs significantly from the source data. Unnatural phrasing, tense
confusion, or broken sentence patterns are common.
1: Sentence is ungrammatical, fragmented, or unparseable. Morphosyntactic errors make it
incomprehensible or entirely ungrammatical. The sentence structure is completely
different from the source data.

4. Semantic (Meaning preservation, omissions, mistranslations):
- Scoring Guide:
5: Full preservation of meaning. No omissions, distortions, or mistranslations. The
translation conveys the same message as the source without deleting or adding additional
words.
4: Minor meaning shifts or vague expressions that slightly alter nuance but do not mislead.
No omissions for verbs, subject, and object of the sentence.
3: Partial meaning loss. Some ideas are omitted or inaccurately rendered, but core meaning
is still recoverable with effort.
2: Major meaning loss or distortion. Critical elements are missing, incorrect, or misleading.
Intended message is mostly lost.
1: Little semantic correspondence to the source. Mostly incorrect, irrelevant, or
hallucinated content.

5. Pragmatic (Tone, register, politeness, cultural/situational appropriateness):
- Scoring Guide:
5: Tone, register, and cultural fit are fully appropriate for the context. The translation
sounds natural and is aligned with the speakers intent.
4: Mostly appropriate tone and register. Slight mismatches (e.g., slightly too formal/
informal), but do not cause misunderstanding or awkwardness.
3: Inconsistent or ambiguous tone or formality. Some sections may sound awkward or
misaligned with the source intent.
2: Inappropriate tone or register for the context. Translation may sound offensive, robotic,
or culturally inappropriate.
1: Completely wrong pragmatic use. For example, overly rude, sarcastic instead of sincere,
or completely mismatched social function.

You will receive a list of examples. For each example, return a JSON object with:
- "row_id": the given ID
- "scores": { "adequacy": int, "fluency": int, "morphosyntactic": int, "semantic": int,
"pragmatic": int }

IMPORTANT OUTPUT FORMAT:
Return a single JSON object with the key "evaluations" whose value is an array of the
example results.
Do not include any extra text, prose, or code fences---only valid JSON.

References

Ataman, D.; Birch, A.; Habash, N.; Federico, M.; Koehn, P.; Cho, K. Machine Translation in the Era of Large Language Models: A Survey of Historical and Emerging Problems. Information 2025, 16, 723. [Google Scholar] [CrossRef]
Zhu, W.; Liu, H.; Dong, Q.; Xu, J.; Huang, S.; Kong, L.; Chen, J.; Li, L. Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, 16–21 June 2024; pp. 2765–2781. [Google Scholar]
Kong, M.; Fernandez, A.; Bains, J.; Milisavljevic, A.; Brooks, K.C.; Shanmugam, A.; Avilez, L.; Li, J.; Honcharov, V.; Yang, A.; et al. Evaluation of the accuracy and safety of machine translation of patient-specific discharge instructions: A comparative analysis. BMJ Qual. Saf. 2025. [Google Scholar] [CrossRef] [PubMed]
Rao, P.; McGee, L.M.; Seideman, C.A. A Comparative assessment of ChatGPT vs. Google Translate for the translation of patient instructions. J. Med Artif. Intell. 2024, 7, 1–8. [Google Scholar] [CrossRef]
Rousan, R.A.; Jaradat, R.S.; Malkawi, M. ChatGPT translation vs. human translation: An examination of a literary text. Cogent Soc. Sci. 2025, 11, 2472916. [Google Scholar] [CrossRef]
Callison-Burch, C.; Osborne, M.; Koehn, P. Re-evaluating the Role of Bleu in Machine Translation Research. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, 3–7 April 2006; pp. 249–256. [Google Scholar]
Mathur, N.; Baldwin, T.; Cohn, T. Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4984–4997. [Google Scholar]
Park, D.; Padó, S. Multi-Dimensional Machine Translation Evaluation: Model Evaluation and Resource for Korean. In Proceedings of the Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 11723–11744. [Google Scholar]
Freitag, M.; Foster, G.; Grangier, D.; Ratnakar, V.; Tan, Q.; Macherey, W. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Trans. Assoc. Comput. Linguist. 2021, 9, 1460–1474. [Google Scholar] [CrossRef]
Freitag, M.; Mathur, N.; Deutsch, D.; Lo, C.K.; Avramidis, E.; Rei, R.; Thompson, B.; Blain, F.; Kocmi, T.; Wang, J.; et al. Are LLMs Breaking MT Metrics? Results of the WMT24 Metrics Shared Task. In Proceedings of the Ninth Conference on Machine Translation, Miami, FL, USA, 15–16 November 2024; pp. 47–81. [Google Scholar] [CrossRef]
Kocmi, T.; Federmann, C. Large Language Models Are State-of-the-Art Evaluators of Translation Quality. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, Tampere, Finland, 12–15 June 2023; p. 193. [Google Scholar]
Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 2511–2522. [Google Scholar] [CrossRef]
Nasution, A.H.; Onan, A. ChatGPT Label: Comparing the Quality of Human-Generated and LLM-Generated Annotations in Low-Resource Language NLP Tasks. IEEE Access 2024, 12, 71876–71900. [Google Scholar] [CrossRef]
Nasution, A.H.; Onan, A.; Murakami, Y.; Monika, W.; Hanafiah, A. Benchmarking Open-Source Large Language Models for Sentiment and Emotion Classification in Indonesian Tweets. IEEE Access 2025, 13, 94009–94025. [Google Scholar] [CrossRef]
NLLB Team; Costa-jussà, M.R.; Cross, J.; Çelebi, O.; Elbayad, M.; Heafield, K.; Heffernan, K.; Kalbassi, E.; Lam, J.; Licht, D.; et al. No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv 2022, arXiv:2207.04672. [Google Scholar] [CrossRef]
Fan, A.; Bhosale, S.; Schwenk, H.; Ma, Z.; El-Kishky, A.; Goyal, S.; Baines, M.; Celebi, O.; Wenzek, G.; Chaudhary, V.; et al. Beyond English-Centric Multilingual Machine Translation. J. Mach. Learn. Res. 2021, 22, 1–48. [Google Scholar]
Kim, Y.; Petrov, P.; Petrushkov, P.; Khadivi, S.; Ney, H. Pivot-based Transfer Learning for Neural Machine Translation between Non-English Languages. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019. [Google Scholar]
He, Z.; Liang, T.; Jiao, W.; Zhang, Z.; Yang, Y.; Wang, R.; Tu, Z.; Shi, S.; Wang, X. Exploring Human-Like Translation Strategy with Large Language Models. Trans. Assoc. Comput. Linguist. 2024, 12, 229–246. [Google Scholar] [CrossRef]
Freitag, M.; Rei, R.; Mathur, N.; Lo, C.K.; Stewart, C.; Avramidis, E.; Kocmi, T.; Foster, G.; Lavie, A.; Martins, A.F. Results of the WMT22 Metrics Shared Task: Stop Using BLEU. In Proceedings of the Seventh Conference on Machine Translation (WMT), Abu Dhabi, United Arab Emirates, 7–8 December 2022. [Google Scholar]
Jiang, Y.E.; Liu, T.; Ma, S.; Zhang, D.; Yang, J.; Huang, H.; Sennrich, R.; Sachan, M.; Cotterell, R.; Zhou, M. BLONDE: An Automatic Evaluation Metric for Document-level Machine Translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 1550–1565. [Google Scholar] [CrossRef]
Kocmi, T.; Avramidis, E.; Bawden, R.; Bojar, O.; Dvorkovich, A.; Federmann, C.; Fishel, M.; Freitag, M.; Gowda, T.; Grundkiewicz, R.; et al. Findings of the 2023 Conference on Machine Translation (WMT23): LLMs Are Here But Not Quite There Yet. In Proceedings of the Eighth Conference on Machine Translation, Singapore, 6–7 December 2023; pp. 1–42. [Google Scholar]
Kocmi, T.; Zouhar, V.; Avramidis, E.; Grundkiewicz, R.; Karpinska, M.; Popovic, M.; Sachan, M.; Shmatova, M. Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation. In Proceedings of the Ninth Conference on Machine Translation, Miami, FL, USA, 15–16 November 2024; pp. 1440–1453. [Google Scholar] [CrossRef]
Kim, A. RUBRIC-MQM: Span-Level LLM-as-judge in Machine Translation For High-End Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Vienna, Austria, 27 July–1 August 2025; Volume 6, pp. 147–165. [Google Scholar] [CrossRef]
Li, R.; Patel, T.; Du, X. PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations. Trans. Mach. Learn. Res. 2024, 1–37. Available online: https://openreview.net/forum?id=YVD1QqWRaj (accessed on 12 January 2026).
Tan, W.; Zhu, K. NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with Large Language Models. arXiv 2024. [Google Scholar] [CrossRef]
Cahyawijaya, S.; Winata, G.I.; Wilie, B.; Vincentio, K.; Li, X.; Kuncoro, A.; Ruder, S.; Lim, Z.Y.; Bahar, S.; Khodra, M.; et al. IndoNLG: Benchmark and resources for evaluating Indonesian natural language generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 8875–8898. [Google Scholar]
Sun, J.; Luo, Y.; Gong, Y.; Lin, C.; Shen, Y.; Guo, J.; Duan, N. Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models. In Findings of the Association for Computational Linguistics: NAACL 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 4074–4101. [Google Scholar] [CrossRef]
Cao, S.; Zhou, T. Exploring the Efficacy of ChatGPT-Based Feedback Compared with Teacher Feedback and Self-Feedback: Evidence From Chinese-English Translation. SAGE Open 2025, 15, 1–18. [Google Scholar] [CrossRef]
Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 Technical Report. arXiv 2025. [Google Scholar] [CrossRef]
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024. [Google Scholar] [CrossRef]
Team, G.; Kamath, A.; Ferret, J.; Pathak, S.; Vieillard, N.; Merhej, R.; Perrin, S.; Matejovicova, T.; Ramé, A.; Rivière, M.; et al. Gemma 3 Technical Report. arXiv 2025. [Google Scholar] [CrossRef]
Nasution, S.; Ferdiana, R.; Hartanto, R. Towards Two-Step Fine-Tuned Abstractive Summarization for Low-Resource Language Using Transformer T5. Int. J. Adv. Comput. Sci. Appl. 2025, 16, 1220–1230. [Google Scholar] [CrossRef]

Figure 1. Human evaluation scores by quality aspect for each model. Each group of bars corresponds to an evaluation aspect. Within a group, the three bars represent the average score given to the translations. Error bars show ±1 standard deviation among the 100 test sentences. Gemma’s bars are highest in all categories, indicating its superior performance, while Qwen’s are lowest.

Figure 2. GPT–5 evaluation scores by quality aspect for each model. The bars show GPT–5’s average scoring of each model’s outputs. GPT–5’s evaluations also rank Gemma highest and Qwen lowest in all aspects. Note that GPT–5’s absolute scores are higher than the human scores (Figure 1), especially for pragmatic quality, suggesting a more lenient or optimistic evaluation compared to human judges.

Figure 3. Correlation between GPT–5 and human overall evaluation scores. Each point represents one translated sentence (from any of the three models). The x-axis is the human evaluators’ average overall score (mean of their three aspect scores) and the y-axis is GPT–5’s overall score (mean of its three aspect scores). The three colors denote the model: blue for Qwen, orange for LLaMA, and green for Gemma. The dashed line is the diagonal

y = x

. We see a strong positive correlation (Pearson

r = 0.82

). GPT–5’s scores generally track the human scores—points do not scatter wildly. However, most points lie above the diagonal, reflecting that GPT–5 tends to give higher scores than humans for the same items. This is especially noticeable for the orange and green points (LLaMA and Gemma) in the upper right: many of Gemma’s translations that humans rated around 4.0 overall were rated 4.5 or even 5 by GPT–5, hence appearing above the line. Qwen’s points (blue, in the lower left) also show GPT often rating them slightly higher than humans did.

Figure 3. Correlation between GPT–5 and human overall evaluation scores. Each point represents one translated sentence (from any of the three models). The x-axis is the human evaluators’ average overall score (mean of their three aspect scores) and the y-axis is GPT–5’s overall score (mean of its three aspect scores). The three colors denote the model: blue for Qwen, orange for LLaMA, and green for Gemma. The dashed line is the diagonal

y = x

. We see a strong positive correlation (Pearson

r = 0.82

). GPT–5’s scores generally track the human scores—points do not scatter wildly. However, most points lie above the diagonal, reflecting that GPT–5 tends to give higher scores than humans for the same items. This is especially noticeable for the orange and green points (LLaMA and Gemma) in the upper right: many of Gemma’s translations that humans rated around 4.0 overall were rated 4.5 or even 5 by GPT–5, hence appearing above the line. Qwen’s points (blue, in the lower left) also show GPT often rating them slightly higher than humans did.

Figure 4. Heatmap of Pearson correlation coefficients between evaluation dimensions. Both the x-axis and y-axis list the evaluation dimensions (adequacy, fluency, morphosyntactic, semantic, pragmatic, and BLEU). Each cell represents the correlation between a pair of dimensions. Cell colors encode the strength of correlation (Pearson’s r): warmer colors (red) indicate higher positive correlation, whereas cooler colors (blue) indicate lower correlation. The figure illustrates the degree of overlap and distinctiveness among the evaluation dimensions. BLEU correlates weakly (≈0.22–0.23) with the rubric-based metrics, reflecting its different sensitivity to reference n-gram overlap rather than direct judgments of meaning and naturalness.

Figure 5. Mean absolute error (MAE) by test and aspect. Lower MAE indicates closer alignment with majority scores.

Figure 6. Exact Match Rate by test. Agreement improves steadily from pre- to final test.

Table 1. Proposed linguistic metrics.

Metric	Description
Morphosyntactic	Assesses grammar, agreement, sentence structure, and morphology; identifies issues like verb conjugation errors, subject–verb disagreement, tense misuse, or malformed clause structure.
Semantic	Evaluates fidelity of meaning, detecting mistranslations, omissions, incorrect lexical choices, or additions that distort intended meaning.
Pragmatic	Captures tone, register, politeness, and cultural/situational appropriateness; identifies errors in speech level, formality, or contextual fit.

Table 2. Mapping of linguistic metrics to MQM categories and subcategories.

Metric	Matched MQM Category	Matched MQM Subcategories
Morphosyntactic	Linguistic Conventions; Accuracy	Word form; Grammar; Agreement; Textual conventions; Transliteration; MT hallucination
Semantic	Design and Markup; Accuracy; Linguistic Conventions; Audience Appropriateness	Mistranslation; Omission; Addition; Missing markup; Incorrect item; End-user suitability; Missing graphic/table
Pragmatic	Design and Markup; Locale Conventions; Terminology	Style; Formality; Audience appropriateness; Questionable markup; Locale-specific punctuation; Number/measurement format

Table 3. Condensed 5-point evaluation rubric for translation quality. The full rubric is provided in Appendix A.

Aspect	Scale Description (1–5)
Morphosyntactic	5: No grammatical/syntactic errors; fully fluent and accurate. 4: Minor issues not hindering comprehension. 3: Noticeable errors occasionally affecting comprehension. 2: Frequent errors that hinder comprehension. 1: Ungrammatical or incomprehensible.
Semantic	5: Full preservation of meaning. 4: Minor meaning shifts not misleading. 3: Partial meaning loss but core message intact. 2: Major meaning loss or distortion. 1: Little correspondence; mostly incorrect/irrelevant.
Pragmatic	5: Tone, register, and cultural fit fully appropriate. 4: Mostly appropriate with slight mismatches. 3: Inconsistent or awkward tone/formality. 2: Inappropriate tone/register for context. 1: Completely wrong pragmatic use.

Table 4. Average morphosyntactic evaluation scores (with standard deviation) by human evaluators and GPT–5 for each model.

Model	Human	GPT–5	Overall Human	Overall GPT–5
Qwen 3 (0.6B)	3.01 ± 0.74	3.09 ± 0.96	2.71 ± 0.78	3.04 ± 0.93
LLaMA 3.2 (3B)	3.45 ± 0.60	3.91 ± 0.87	3.40 ± 0.68	3.92 ± 0.89
Gemma 3 (1B)	3.87 ± 0.81	4.25 ± 0.95	3.82 ± 0.84	4.20 ± 0.88

Table 5. Average semantic evaluation scores (with standard deviation) by human evaluators and GPT–5 for each model.

Model	Human	GPT–5	Overall Human	Overall GPT–5
Qwen 3 (0.6B)	2.53 ± 0.83	2.82 ± 0.93	2.71 ± 0.78	3.04 ± 0.93
LLaMA 3.2 (3B)	3.33 ± 0.78	3.76 ± 1.05	3.40 ± 0.68	3.92 ± 0.89
Gemma 3 (1B)	3.75 ± 0.90	4.00 ± 1.03	3.82 ± 0.84	4.20 ± 0.88

Table 6. Average pragmatic evaluation scores (with standard deviation) by human evaluators and GPT–5 for each model.

Model	Human	GPT–5	Overall Human	Overall GPT–5
Qwen 3 (0.6B)	2.59 ± 0.89	3.22 ± 1.00	2.71 ± 0.78	3.04 ± 0.93
LLaMA 3.2 (3B)	3.40 ± 0.81	4.09 ± 0.92	3.40 ± 0.68	3.92 ± 0.89
Gemma 3 (1B)	3.83 ± 0.89	4.35 ± 0.93	3.82 ± 0.84	4.20 ± 0.88

Table 7. Average scores (1–5) and SD by model and aspect: human experts vs. GPT–5.

Model	Metric	Human Mean	Human Std	GPT–5 Mean	GPT–5 Std
Qwen 3 (0.6B)	Morphosyntactic	3.007	0.748	3.09	0.965
	Semantic	2.533	0.834	2.82	0.936
	Pragmatic	2.593	0.890	3.22	1.001
LLaMA 3.2 (3B)	Morphosyntactic	3.450	0.600	3.91	0.877
	Semantic	3.333	0.789	3.76	1.055
	Pragmatic	3.403	0.816	4.09	0.922
Gemma 3 (1B)	Morphosyntactic	3.870	0.814	4.25	0.957
	Semantic	3.750	0.909	4.00	1.035
	Pragmatic	3.830	0.892	4.35	0.936

Table 8. Inter-annotator agreement (humans + GPT–5) by model and metric.

Model	Metric	Krippendorff Alpha	Weighted Kappa Avg
Qwen 3 (0.6B)	Morphosyntactic	0.399	0.408
	Semantic	0.518	0.521
	Pragmatic	0.516	0.535
LLaMA 3.2 (3B)	Morphosyntactic	0.377	0.394
	Semantic	0.502	0.507
	Pragmatic	0.445	0.465
Gemma 3 (1B)	Morphosyntactic	0.482	0.488
	Semantic	0.624	0.621
	Pragmatic	0.530	0.529

Table 9. MACE annotator competence (probability of correctness) per metric.

Metric	Annotator	Competence
Morphosyntactic	Evaluator 1	0.552
	Evaluator 2	0.600
	Evaluator 3	0.591
	GPT–5	0.651
Semantic	Evaluator 1	0.576
	Evaluator 2	0.533
	Evaluator 3	0.611
	GPT–5	0.634
Pragmatic	Evaluator 1	0.551
	Evaluator 2	0.601
	Evaluator 3	0.528
	GPT–5	0.572

Table 10. Descriptive statistics of student–majority agreement across tests. Mean absolute error (MAE) quantifies the average deviation between student ratings and the expert/GPT–5 majority score (lower is better). Exact Match Rate represents the proportion of identical ratings (higher is better). Results show a consistent improvement from pre- to final test across all linguistic aspects.

Metric/Aspect	Pre	Post	Final
MAE (Overall)	0.97	0.93	0.83
Morphosyntactic	0.98	0.96	0.91
Semantic	1.03	0.92	0.86
Pragmatic	0.90	0.89	0.84
Exact Match Rate (Overall)	0.30	0.28	0.50

Note: Bold values indicate the best performance observed for each metric, corresponding to the final test stage.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shalawati, S.; Nasution, A.H.; Monika, W.; Derin, T.; Onan, A.; Murakami, Y. Beyond BLEU: GPT–5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation. Digital 2026, 6, 8. https://doi.org/10.3390/digital6010008

AMA Style

Shalawati S, Nasution AH, Monika W, Derin T, Onan A, Murakami Y. Beyond BLEU: GPT–5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation. Digital. 2026; 6(1):8. https://doi.org/10.3390/digital6010008

Chicago/Turabian Style

Shalawati, Shalawati, Arbi Haza Nasution, Winda Monika, Tatum Derin, Aytug Onan, and Yohei Murakami. 2026. "Beyond BLEU: GPT–5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation" Digital 6, no. 1: 8. https://doi.org/10.3390/digital6010008

APA Style

Shalawati, S., Nasution, A. H., Monika, W., Derin, T., Onan, A., & Murakami, Y. (2026). Beyond BLEU: GPT–5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation. Digital, 6(1), 8. https://doi.org/10.3390/digital6010008

Article Menu

Beyond BLEU: GPT–5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation

Abstract

1. Introduction

2. Related Work

2.1. LLMs in Machine Translation

2.2. Multidimensional Evaluation and LLM-Based Metrics

3. Method

3.1. MQM Mapping and Evaluation Rubric

3.2. Models and Translation Task

3.3. Reproducibility and Resources

3.4. Human Evaluation Procedure

3.5. GPT–5 Evaluation Procedure

3.6. Classroom Evaluation Study

3.7. Ethical Considerations

4. Results

4.1. Quantitative Evaluation Scores

4.2. Cross-Metric Correlation

4.3. Per-Aspect Dispersion and Human–GPT Means

4.4. Inter-Annotator Agreement by Model and Aspect

4.5. Annotator Competence (MACE) Across Aspects

4.6. Examples of Evaluation Differences

4.7. Results of the Classroom Evaluation Study

5. Discussion

5.1. Specialized Smaller LLM vs. General Larger LLM

5.2. Quality Aspect Analysis—Fluency vs. Accuracy

5.3. GPT–5 as an Evaluator—Potential and Pitfalls

5.4. Implications for Indonesian MT

5.5. Human Evaluation for Research vs. Real-World

5.6. Toward Better LLM Translators

5.7. Error Profile and Improvement Targets

5.8. LLM Evaluator Agreement and Calibration

5.9. Practical Use and Post-Editing Considerations

5.10. Implications of the Classroom Evaluation Study

5.11. Broader Impacts

5.12. Future Work

5.13. Limitations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Evaluation Rubric

Appendix A.1. Morphosyntactic (Grammar, Agreement, Sentence Structure, Morphology)

Appendix A.2. Semantic (Meaning Preservation, Omissions, Mistranslations)

Appendix A.3. Pragmatic (Tone, Register, Politeness, Cultural/Situational Appropriateness)

Appendix B. Prompt Template

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI