Beyond BLEU: GPT–5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation
Abstract
1. Introduction
2. Related Work
2.1. LLMs in Machine Translation
2.2. Multidimensional Evaluation and LLM-Based Metrics
3. Method
3.1. MQM Mapping and Evaluation Rubric
3.2. Models and Translation Task
- Qwen 3 (0.6B) (https://ollama.com/library/qwen3 (accessed on 12 January 2026)): The Qwen 3 family is an officially released and documented series of large language models developed by Alibaba Cloud, including a 0.6B-parameter variant [29]. Official checkpoints and documentation are available via the Hugging Face model hub (https://huggingface.co/Qwen (accessed on 12 January 2026)) and the Qwen3 GitHub repository (https://github.com/QwenLM/Qwen3 (accessed on 12 January 2026)). In this study, however, we used the Ollama-distributed Qwen 3 (0.6B) build, rather than directly loading the official checkpoints. This build corresponds to a compact, general-purpose multilingual model without task-specific optimization for machine translation. The Qwen 3 (0.6B) Ollama build contains approximately 0.6 billion parameters and represents a lightweight, general-purpose language model with some multilingual capability but no specialized training for machine translation. We include this model as a small-scale baseline to examine how a compact LLM handles English–Indonesian translation under a fixed zero-shot prompt.
- LLaMA 3.2 (3B) (https://ollama.com/library/llama3.2 (accessed on 12 January 2026)): The LLaMA family is a widely adopted open-weight model line developed by Meta AI, with official documentation (https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2 (accessed on 12 January 2026)) and downloadable checkpoints (https://www.llama.com/llama-downloads (accessed on 12 January 2026)) provided for LLaMA 3.2 [30]. In this study, we used an Ollama-distributed snapshot build labeled LLaMA 3.2 (3B), following Ollama’s internal versioning conventions, rather than the official Meta release. The LLaMA family is a widely adopted open-weight model line trained on diverse multilingual corpora [30]. The LLaMA 3.2 (3B) build used in this study contains approximately 3 billion parameters and was used as provided by Ollama, without any fine-tuning or task-specific adaptation. We use this model as a mid-scale general LLM representative for zero-shot translation, rather than as a benchmark of official LLaMA releases.
- Gemma 3 (1B) (https://ollama.com/library/gemma3 (accessed on 12 January 2026)): Gemma is an open family of efficient language models developed by Google DeepMind. The Gemma 3 family is officially documented (https://deepmind.google/models/gemma/gemma-3/ (accessed on 12 January 2026)), where Ollama is explicitly referenced as a supported method for local deployment. The Gemma family comprises efficient open models designed for research and local deployment [31]. In line with this official guidance, we instantiated Gemma 3 (1B) using the Ollama-distributed build, without any fine-tuning or customization. Although not explicitly trained for English–Indonesian translation, the model is capable of producing translations under appropriate prompting and serves as a comparative translation generator alongside Qwen and LLaMA.
3.3. Reproducibility and Resources
3.4. Human Evaluation Procedure
- Morphosyntactic Quality: Is the translation well-formed in terms of grammar, word order, and word forms? With this aspect, raters do not yet focus on whether a translation is meaningfully accurate, but focus entirely on the accuracy of the morphology and syntax. They check for issues such as incorrect affixes, plurality, tense markers (where applicable in Indonesian), word order mistakes, agreement errors, or any violation of Indonesian grammatical norms. A score of 5 means the sentence is grammatically perfect and natural; a 3 indicates some awkwardness or minor errors; and a 1 indicates severely broken grammar that impedes understanding.
- Semantic Accuracy: Does the translation faithfully convey the meaning of the source? This is essentially adequacy: how much of the content and intent of the original is preserved. Raters compare the Indonesian translation against the source English to identify any omissions, additions, or meaning shifts. A score of 5 means the translation is completely accurate with no loss or distortion of meaning; a 3 means some nuances or minor details are lost/mistranslated but the main message is there; and a 1 means it is mostly incorrect or missing significant content from the source.
- Pragmatic Appropriateness: Is the translation appropriate and coherent in context and style? This covers aspects like tone, register, and overall coherence. Raters judge if the translation would make sense and be appropriate for an Indonesian reader in the intended context of the sentence. For example, does it use the correct level of formality? Does it avoid unnatural or literal phrasing that, while grammatically correct, would sound odd to a native speaker? This category also captures whether the translation is pragmatically effective—e.g., if the source had an idiom, was it translated to an equivalent idiom or explained in a way that an Indonesian reader would understand the intended effect? A score of 5 means the translation not only is correct but also feels native—one could not easily tell it was translated. A 3 might indicate that it is understandable but has some unnatural phrasing or slight tone issues. A 1 would mean that it is pragmatically inappropriate or incoherent (perhaps overly literal or culturally off-base).
3.5. GPT–5 Evaluation Procedure
3.6. Classroom Evaluation Study
- Pre-test: Students rated 15 translations (5 English–Indonesian sentences translated by Qwen 3 (0.6B), LLaMA 3.2 (3B), and Gemma 3 (1B)).
- Post-test: After a short discussion clarifying how to apply the rubric and what constitutes scores 1–5, students re-evaluated the same items.
- Final test: Students rated five new items (indices 25, 47, 52, 74, and 97), each translated by the same three models.
3.7. Ethical Considerations
4. Results
4.1. Quantitative Evaluation Scores
- Morphosyntactic: ;
- Semantic: ;
- Pragmatic: .
4.2. Cross-Metric Correlation
4.3. Per-Aspect Dispersion and Human–GPT Means
4.4. Inter-Annotator Agreement by Model and Aspect
4.5. Annotator Competence (MACE) Across Aspects
4.6. Examples of Evaluation Differences
- Source: “Archimedes was an ancient Greek thinker, and …” (continuation omitted).
- –
- Qwen’s translation: “Archimedes adalah peneliti ahli Yunani sebelum…” (translates roughly to “Archimedes was a Greek expert researcher from before…”). This received Morph = 3, Sem = 2, and Prag = 3 from humans. They commented that “peneliti ahli” (“expert researcher”) is an odd choice for “thinker” (a semantic error) and the sentence trailed off inaccurately.
- –
- LLaMA’s translation: “Archimedes adalah pemikir Yunani kuno…” (similar to Gemma’s but missing the article or some nuance) received Morph = 5, Sem = 4, and Prag = 4. Here all models were grammatically okay (hence Qwen Morph = 3 not too low, just one minor grammar issue with “peneliti ahli”), but the semantic accuracy separated them—Qwen changed the meaning, LLaMA preserved meaning but omitted a slight detail, and Gemma was spot on. GPT–5 gave scores of (3, 2, 3) to Qwen, (5, 4, 5) to Gemma, and (5, 4, 4) to LLaMA—matching the human pattern, though GPT–5 rated Qwen’s grammar a bit higher than humans did (perhaps not catching the issue with “peneliti ahli”).
- –
- Gemma’s translation: “Archimedes adalah seorang pemikir Yunani Kuno…” (literally “Archimedes was an Ancient Greek thinker…”), which is almost a perfect translation. Humans gave it Morph = 5, Sem = 5, and Prag = 5, a rare unanimous perfect score.
- Source: “When the devil bites another devil, it actually…” (colloquial expression). This is tricky pragmatically because it is figurative.
- –
- Qwen’s translation actually mistranslated the structure, yielding something incoherent (Prag = 1, Sem = 1). In this case, GPT–5 somewhat overrated the pragmatic aspect for Gemma and LLaMA, giving them Prag = 4 where humans gave 2. GPT–5 likely saw a grammatically correct sentence and, not recognizing the idiom, assumed it was acceptable, whereas human translators knew it missed the idiomatic meaning. This example illustrates a limitation: GPT–5 lacked cultural context to see the pragmatic failure.
- –
- LLaMA’s translation was similarly literal and received Morph = 4, Sem = 4, and Prag = 2 as well.
- –
- Gemma’s translation: “Ketika iblis menggigit iblis lainnya, sebenarnya…”, which is very literal (“When a devil bites another devil, actually…”). Indonesian evaluators noted that this was grammatically fine but pragmatically odd—“iblis menggigit iblis” is not a known saying. They gave it Morph = 4, Sem = 4, and Prag = 2, citing that the tone did not carry over (maybe the source was implying conflict between bad people, an idiom that was not localized).
- Source: “The project is going very, very well, and…” (conversational tone).
- –
- Qwen: “Proyek ini berjalan sangat, sangat baik, dan…” (literal translation; in Indonesian doubling “sangat” is a bit unusual but understandable). Humans: Morph = 4 (a slight stylistic issue), Sem = 5, and Prag = 3 (tone a bit off, could use “sangat baik sekali” instead).
- –
- LLaMA: “Proyeknya berjalan dengan sangat baik, dan…” (more natural phrasing), received Morph = 5, Sem = 5, and Prag = 5.
- –
- Gemma: “Proyek ini berjalan dengan sangat baik, dan…” (also excellent). Here all convey the meaning; it is about style. Qwen’s phrasing was less idiomatic (hence pragmatic 3). GPT–5 gave Qwen Prag = 4 (it did not flag the style issue), showing again a slight leniency or lack of that nuance.
4.7. Results of the Classroom Evaluation Study
5. Discussion
5.1. Specialized Smaller LLM vs. General Larger LLM
5.2. Quality Aspect Analysis—Fluency vs. Accuracy
5.3. GPT–5 as an Evaluator—Potential and Pitfalls
5.4. Implications for Indonesian MT
5.5. Human Evaluation for Research vs. Real-World
5.6. Toward Better LLM Translators
5.7. Error Profile and Improvement Targets
5.8. LLM Evaluator Agreement and Calibration
5.9. Practical Use and Post-Editing Considerations
5.10. Implications of the Classroom Evaluation Study
5.11. Broader Impacts
5.12. Future Work
- 1.
- Cross-lingual generalization: Extend evaluation to additional language pairs with diverse typological characteristics, including morphology-rich languages (e.g., Turkish and Finnish) and low-resource settings (e.g., regional Indonesian languages), to test the robustness and generalizability of our findings.
- 2.
- Document-level and discourse phenomena: Move beyond sentence-level evaluation to assess how models handle longer context. This includes measuring discourse-level coherence, consistency of terminology, pronominal and anaphora resolution, and appropriate management of register across paragraphs. Methods could involve evaluating paragraph- and document-level translations, designing rubrics for discourse quality, and exploiting LLMs’ extended context windows for both translation and evaluation.
- 3.
- Evaluator calibration: Refine prompts, explore in-context exemplars, or apply lightweight fine-tuning to GPT–5 (GPT–5-chat) as an evaluator in order to reduce its leniency and align it more tightly with rubric-anchored human severity. This could include calibration against reference human ratings on held-out data.
- 4.
- Explainable disagreement analysis: Collect structured rationales from GPT–5 and annotators to better diagnose systematic blind spots. For example, discrepancies in idiomaticity, politeness, or tone could be analyzed qualitatively and linked back to specific rubric dimensions.
- 5.
- Training-time feedback loops: Explore the use of GPT–5 as a “critic” to provide feedback signals during model training, prioritizing low-scoring phenomena (e.g., semantic fidelity errors or pragmatic mismatches). To mitigate evaluator bias, such loops should incorporate confidence thresholds and human-in-the-loop audits.
- 6.
- Few-shot prompting: Prior work suggests that few-shot calibration using human-labeled examples may further improve alignment between LLM and human judgments. In this study, we deliberately adopted a zero-shot evaluation setting in order to examine GPT–5’s ability to apply the rubric without prior exposure to human ratings. While this choice allows for a cleaner comparison between human and LLM evaluators, future work could explore whether instruction-aligned or example-conditioned prompting yields improved calibration, reduced bias, or greater consistency across evaluation dimensions.
5.13. Limitations
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Evaluation Rubric
Appendix A.1. Morphosyntactic (Grammar, Agreement, Sentence Structure, Morphology)
- 5: No grammatical or syntactic errors. The sentence structure is fluent and consistent with the source. Morphological forms (e.g., tense, agreement, inflections) are fully accurate.
- 4: Minor grammatical or syntactic issues that do not hinder comprehension. Word forms and sentence structure are mostly correct but slightly unnatural in places.
- 3: Noticeable morphosyntactic errors that occasionally affect comprehension or fluency. May include agreement mistakes, awkward word order, different sentence structure, or wrong tense.
- 2: Frequent errors in grammar and structure that hinder comprehension. The sentence structure differs significantly from the source. Unnatural phrasing, tense confusion, or broken sentence patterns are common.
- 1: Sentence is ungrammatical, fragmented, or unparseable. Morphosyntactic errors make it incomprehensible or entirely ungrammatical. The sentence structure is completely different from the source.
Appendix A.2. Semantic (Meaning Preservation, Omissions, Mistranslations)
- 5: Full preservation of meaning. No omissions, distortions, or mistranslations. The translation conveys the same message as the source without deleting or adding additional words.
- 4: Minor meaning shifts or vague expressions that slightly alter nuance but do not mislead. No omissions for verbs, subject, and object of the sentence.
- 3: Some parts are missing or incorrect, but the main information is still conveyed.
- 2: Major meaning loss or distortion. Critical elements are missing, incorrect, or misleading. Intended message is mostly lost.
- 1: Little semantic correspondence to the source. Mostly incorrect, irrelevant, or hallucinated content.
Appendix A.3. Pragmatic (Tone, Register, Politeness, Cultural/Situational Appropriateness)
- 5: Tone, register, and cultural fit are fully appropriate for the context. The translation sounds natural and is aligned with the speaker’s intent.
- 4: Mostly appropriate tone and register. Slight mismatches (e.g., slightly too formal/informal), but they do not cause misunderstanding or awkwardness.
- 3: Inconsistent or ambiguous tone or formality. Some sections may sound awkward or misaligned with the source intent.
- 2: Inappropriate tone or register for the context. Translation may sound offensive, robotic, or culturally inappropriate.
- 1: Completely wrong pragmatic use (e.g., overly rude, sarcastic instead of sincere, or entirely mismatched social function).
Appendix B. Prompt Template
| Listing A1: Prompt template used for GPT–5 evaluation. |
| Evaluate the Indonesian translation of the English text below based on five criteria: Adequacy, Fluency, Morphosyntactic, Semantic, and Pragmatic. Use the scoring rubric from 1 to 5 for each criterion, as described in detail below. Assign only a numeric score for each aspect (no explanations). Scoring Rubrics: 1. Adequacy: Does the Indonesian translation convey the same meaning as the English source sentence? - Scoring Guide: 5: Complete transfer of meaning (accurate and all important information conveyed) 4: Most meaning conveyed with minor omissions or errors 3: Some parts are missing or incorrect, but some information still conveyed 2: Major omissions or incorrect meaning, only small part of the information is conveyed 1: Meaning largely incorrect or missing 2. Fluency: How natural and grammatically correct is the Indonesian translation? - Scoring Guide: 5: Fluent, natural, and grammatically correct 4: Mostly fluent with minor grammatical issues or awkwardness 3: Understandable, but noticeable grammatical errors 2: Difficult to understand due to poor grammar or awkward phrasing 1: Unintelligible due to major grammatical issues 3. Morphosyntactic (Grammar, agreement, sentence structure, morphology): - Scoring Guide: 5: No grammatical or syntactic errors. The sentence structure is fluent, and consistent with the source data. Morphological forms (e.g., tense, agreement, inflections) are fully accurate. 4: Minor grammatical or syntactic issues that do not hinder comprehension. Word forms and sentence structure are mostly correct but slightly unnatural in places. 3: Noticeable morphosyntactic errors that occasionally affect comprehension or fluency. May include agreement mistakes, awkward word order, different sentence structure, or wrong tense. 2: Frequent errors in grammar and structure that hinder comprehension. The sentence structure differs significantly from the source data. Unnatural phrasing, tense confusion, or broken sentence patterns are common. 1: Sentence is ungrammatical, fragmented, or unparseable. Morphosyntactic errors make it incomprehensible or entirely ungrammatical. The sentence structure is completely different from the source data. 4. Semantic (Meaning preservation, omissions, mistranslations): - Scoring Guide: 5: Full preservation of meaning. No omissions, distortions, or mistranslations. The translation conveys the same message as the source without deleting or adding additional words. 4: Minor meaning shifts or vague expressions that slightly alter nuance but do not mislead. No omissions for verbs, subject, and object of the sentence. 3: Partial meaning loss. Some ideas are omitted or inaccurately rendered, but core meaning is still recoverable with effort. 2: Major meaning loss or distortion. Critical elements are missing, incorrect, or misleading. Intended message is mostly lost. 1: Little semantic correspondence to the source. Mostly incorrect, irrelevant, or hallucinated content. 5. Pragmatic (Tone, register, politeness, cultural/situational appropriateness): - Scoring Guide: 5: Tone, register, and cultural fit are fully appropriate for the context. The translation sounds natural and is aligned with the speakers intent. 4: Mostly appropriate tone and register. Slight mismatches (e.g., slightly too formal/ informal), but do not cause misunderstanding or awkwardness. 3: Inconsistent or ambiguous tone or formality. Some sections may sound awkward or misaligned with the source intent. 2: Inappropriate tone or register for the context. Translation may sound offensive, robotic, or culturally inappropriate. 1: Completely wrong pragmatic use. For example, overly rude, sarcastic instead of sincere, or completely mismatched social function. You will receive a list of examples. For each example, return a JSON object with: - "row_id": the given ID - "scores": { "adequacy": int, "fluency": int, "morphosyntactic": int, "semantic": int, "pragmatic": int } IMPORTANT OUTPUT FORMAT: Return a single JSON object with the key "evaluations" whose value is an array of the example results. Do not include any extra text, prose, or code fences---only valid JSON. |
References
- Ataman, D.; Birch, A.; Habash, N.; Federico, M.; Koehn, P.; Cho, K. Machine Translation in the Era of Large Language Models: A Survey of Historical and Emerging Problems. Information 2025, 16, 723. [Google Scholar] [CrossRef]
- Zhu, W.; Liu, H.; Dong, Q.; Xu, J.; Huang, S.; Kong, L.; Chen, J.; Li, L. Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, 16–21 June 2024; pp. 2765–2781. [Google Scholar]
- Kong, M.; Fernandez, A.; Bains, J.; Milisavljevic, A.; Brooks, K.C.; Shanmugam, A.; Avilez, L.; Li, J.; Honcharov, V.; Yang, A.; et al. Evaluation of the accuracy and safety of machine translation of patient-specific discharge instructions: A comparative analysis. BMJ Qual. Saf. 2025. [Google Scholar] [CrossRef] [PubMed]
- Rao, P.; McGee, L.M.; Seideman, C.A. A Comparative assessment of ChatGPT vs. Google Translate for the translation of patient instructions. J. Med Artif. Intell. 2024, 7, 1–8. [Google Scholar] [CrossRef]
- Rousan, R.A.; Jaradat, R.S.; Malkawi, M. ChatGPT translation vs. human translation: An examination of a literary text. Cogent Soc. Sci. 2025, 11, 2472916. [Google Scholar] [CrossRef]
- Callison-Burch, C.; Osborne, M.; Koehn, P. Re-evaluating the Role of Bleu in Machine Translation Research. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, 3–7 April 2006; pp. 249–256. [Google Scholar]
- Mathur, N.; Baldwin, T.; Cohn, T. Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4984–4997. [Google Scholar]
- Park, D.; Padó, S. Multi-Dimensional Machine Translation Evaluation: Model Evaluation and Resource for Korean. In Proceedings of the Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 11723–11744. [Google Scholar]
- Freitag, M.; Foster, G.; Grangier, D.; Ratnakar, V.; Tan, Q.; Macherey, W. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Trans. Assoc. Comput. Linguist. 2021, 9, 1460–1474. [Google Scholar] [CrossRef]
- Freitag, M.; Mathur, N.; Deutsch, D.; Lo, C.K.; Avramidis, E.; Rei, R.; Thompson, B.; Blain, F.; Kocmi, T.; Wang, J.; et al. Are LLMs Breaking MT Metrics? Results of the WMT24 Metrics Shared Task. In Proceedings of the Ninth Conference on Machine Translation, Miami, FL, USA, 15–16 November 2024; pp. 47–81. [Google Scholar] [CrossRef]
- Kocmi, T.; Federmann, C. Large Language Models Are State-of-the-Art Evaluators of Translation Quality. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, Tampere, Finland, 12–15 June 2023; p. 193. [Google Scholar]
- Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 2511–2522. [Google Scholar] [CrossRef]
- Nasution, A.H.; Onan, A. ChatGPT Label: Comparing the Quality of Human-Generated and LLM-Generated Annotations in Low-Resource Language NLP Tasks. IEEE Access 2024, 12, 71876–71900. [Google Scholar] [CrossRef]
- Nasution, A.H.; Onan, A.; Murakami, Y.; Monika, W.; Hanafiah, A. Benchmarking Open-Source Large Language Models for Sentiment and Emotion Classification in Indonesian Tweets. IEEE Access 2025, 13, 94009–94025. [Google Scholar] [CrossRef]
- NLLB Team; Costa-jussà, M.R.; Cross, J.; Çelebi, O.; Elbayad, M.; Heafield, K.; Heffernan, K.; Kalbassi, E.; Lam, J.; Licht, D.; et al. No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv 2022, arXiv:2207.04672. [Google Scholar] [CrossRef]
- Fan, A.; Bhosale, S.; Schwenk, H.; Ma, Z.; El-Kishky, A.; Goyal, S.; Baines, M.; Celebi, O.; Wenzek, G.; Chaudhary, V.; et al. Beyond English-Centric Multilingual Machine Translation. J. Mach. Learn. Res. 2021, 22, 1–48. [Google Scholar]
- Kim, Y.; Petrov, P.; Petrushkov, P.; Khadivi, S.; Ney, H. Pivot-based Transfer Learning for Neural Machine Translation between Non-English Languages. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019. [Google Scholar]
- He, Z.; Liang, T.; Jiao, W.; Zhang, Z.; Yang, Y.; Wang, R.; Tu, Z.; Shi, S.; Wang, X. Exploring Human-Like Translation Strategy with Large Language Models. Trans. Assoc. Comput. Linguist. 2024, 12, 229–246. [Google Scholar] [CrossRef]
- Freitag, M.; Rei, R.; Mathur, N.; Lo, C.K.; Stewart, C.; Avramidis, E.; Kocmi, T.; Foster, G.; Lavie, A.; Martins, A.F. Results of the WMT22 Metrics Shared Task: Stop Using BLEU. In Proceedings of the Seventh Conference on Machine Translation (WMT), Abu Dhabi, United Arab Emirates, 7–8 December 2022. [Google Scholar]
- Jiang, Y.E.; Liu, T.; Ma, S.; Zhang, D.; Yang, J.; Huang, H.; Sennrich, R.; Sachan, M.; Cotterell, R.; Zhou, M. BLONDE: An Automatic Evaluation Metric for Document-level Machine Translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 1550–1565. [Google Scholar] [CrossRef]
- Kocmi, T.; Avramidis, E.; Bawden, R.; Bojar, O.; Dvorkovich, A.; Federmann, C.; Fishel, M.; Freitag, M.; Gowda, T.; Grundkiewicz, R.; et al. Findings of the 2023 Conference on Machine Translation (WMT23): LLMs Are Here But Not Quite There Yet. In Proceedings of the Eighth Conference on Machine Translation, Singapore, 6–7 December 2023; pp. 1–42. [Google Scholar]
- Kocmi, T.; Zouhar, V.; Avramidis, E.; Grundkiewicz, R.; Karpinska, M.; Popovic, M.; Sachan, M.; Shmatova, M. Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation. In Proceedings of the Ninth Conference on Machine Translation, Miami, FL, USA, 15–16 November 2024; pp. 1440–1453. [Google Scholar] [CrossRef]
- Kim, A. RUBRIC-MQM: Span-Level LLM-as-judge in Machine Translation For High-End Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Vienna, Austria, 27 July–1 August 2025; Volume 6, pp. 147–165. [Google Scholar] [CrossRef]
- Li, R.; Patel, T.; Du, X. PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations. Trans. Mach. Learn. Res. 2024, 1–37. Available online: https://openreview.net/forum?id=YVD1QqWRaj (accessed on 12 January 2026).
- Tan, W.; Zhu, K. NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with Large Language Models. arXiv 2024. [Google Scholar] [CrossRef]
- Cahyawijaya, S.; Winata, G.I.; Wilie, B.; Vincentio, K.; Li, X.; Kuncoro, A.; Ruder, S.; Lim, Z.Y.; Bahar, S.; Khodra, M.; et al. IndoNLG: Benchmark and resources for evaluating Indonesian natural language generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 8875–8898. [Google Scholar]
- Sun, J.; Luo, Y.; Gong, Y.; Lin, C.; Shen, Y.; Guo, J.; Duan, N. Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models. In Findings of the Association for Computational Linguistics: NAACL 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 4074–4101. [Google Scholar] [CrossRef]
- Cao, S.; Zhou, T. Exploring the Efficacy of ChatGPT-Based Feedback Compared with Teacher Feedback and Self-Feedback: Evidence From Chinese-English Translation. SAGE Open 2025, 15, 1–18. [Google Scholar] [CrossRef]
- Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 Technical Report. arXiv 2025. [Google Scholar] [CrossRef]
- Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024. [Google Scholar] [CrossRef]
- Team, G.; Kamath, A.; Ferret, J.; Pathak, S.; Vieillard, N.; Merhej, R.; Perrin, S.; Matejovicova, T.; Ramé, A.; Rivière, M.; et al. Gemma 3 Technical Report. arXiv 2025. [Google Scholar] [CrossRef]
- Nasution, S.; Ferdiana, R.; Hartanto, R. Towards Two-Step Fine-Tuned Abstractive Summarization for Low-Resource Language Using Transformer T5. Int. J. Adv. Comput. Sci. Appl. 2025, 16, 1220–1230. [Google Scholar] [CrossRef]






| Metric | Description |
|---|---|
| Morphosyntactic | Assesses grammar, agreement, sentence structure, and morphology; identifies issues like verb conjugation errors, subject–verb disagreement, tense misuse, or malformed clause structure. |
| Semantic | Evaluates fidelity of meaning, detecting mistranslations, omissions, incorrect lexical choices, or additions that distort intended meaning. |
| Pragmatic | Captures tone, register, politeness, and cultural/situational appropriateness; identifies errors in speech level, formality, or contextual fit. |
| Metric | Matched MQM Category | Matched MQM Subcategories |
|---|---|---|
| Morphosyntactic | Linguistic Conventions; Accuracy | Word form; Grammar; Agreement; Textual conventions; Transliteration; MT hallucination |
| Semantic | Design and Markup; Accuracy; Linguistic Conventions; Audience Appropriateness | Mistranslation; Omission; Addition; Missing markup; Incorrect item; End-user suitability; Missing graphic/table |
| Pragmatic | Design and Markup; Locale Conventions; Terminology | Style; Formality; Audience appropriateness; Questionable markup; Locale-specific punctuation; Number/measurement format |
| Aspect | Scale Description (1–5) |
|---|---|
| Morphosyntactic | 5: No grammatical/syntactic errors; fully fluent and accurate. 4: Minor issues not hindering comprehension. 3: Noticeable errors occasionally affecting comprehension. 2: Frequent errors that hinder comprehension. 1: Ungrammatical or incomprehensible. |
| Semantic | 5: Full preservation of meaning. 4: Minor meaning shifts not misleading. 3: Partial meaning loss but core message intact. 2: Major meaning loss or distortion. 1: Little correspondence; mostly incorrect/irrelevant. |
| Pragmatic | 5: Tone, register, and cultural fit fully appropriate. 4: Mostly appropriate with slight mismatches. 3: Inconsistent or awkward tone/formality. 2: Inappropriate tone/register for context. 1: Completely wrong pragmatic use. |
| Model | Human | GPT–5 | Overall Human | Overall GPT–5 |
|---|---|---|---|---|
| Qwen 3 (0.6B) | 3.01 ± 0.74 | 3.09 ± 0.96 | 2.71 ± 0.78 | 3.04 ± 0.93 |
| LLaMA 3.2 (3B) | 3.45 ± 0.60 | 3.91 ± 0.87 | 3.40 ± 0.68 | 3.92 ± 0.89 |
| Gemma 3 (1B) | 3.87 ± 0.81 | 4.25 ± 0.95 | 3.82 ± 0.84 | 4.20 ± 0.88 |
| Model | Human | GPT–5 | Overall Human | Overall GPT–5 |
|---|---|---|---|---|
| Qwen 3 (0.6B) | 2.53 ± 0.83 | 2.82 ± 0.93 | 2.71 ± 0.78 | 3.04 ± 0.93 |
| LLaMA 3.2 (3B) | 3.33 ± 0.78 | 3.76 ± 1.05 | 3.40 ± 0.68 | 3.92 ± 0.89 |
| Gemma 3 (1B) | 3.75 ± 0.90 | 4.00 ± 1.03 | 3.82 ± 0.84 | 4.20 ± 0.88 |
| Model | Human | GPT–5 | Overall Human | Overall GPT–5 |
|---|---|---|---|---|
| Qwen 3 (0.6B) | 2.59 ± 0.89 | 3.22 ± 1.00 | 2.71 ± 0.78 | 3.04 ± 0.93 |
| LLaMA 3.2 (3B) | 3.40 ± 0.81 | 4.09 ± 0.92 | 3.40 ± 0.68 | 3.92 ± 0.89 |
| Gemma 3 (1B) | 3.83 ± 0.89 | 4.35 ± 0.93 | 3.82 ± 0.84 | 4.20 ± 0.88 |
| Model | Metric | Human Mean | Human Std | GPT–5 Mean | GPT–5 Std |
|---|---|---|---|---|---|
| Qwen 3 (0.6B) | Morphosyntactic | 3.007 | 0.748 | 3.09 | 0.965 |
| Semantic | 2.533 | 0.834 | 2.82 | 0.936 | |
| Pragmatic | 2.593 | 0.890 | 3.22 | 1.001 | |
| LLaMA 3.2 (3B) | Morphosyntactic | 3.450 | 0.600 | 3.91 | 0.877 |
| Semantic | 3.333 | 0.789 | 3.76 | 1.055 | |
| Pragmatic | 3.403 | 0.816 | 4.09 | 0.922 | |
| Gemma 3 (1B) | Morphosyntactic | 3.870 | 0.814 | 4.25 | 0.957 |
| Semantic | 3.750 | 0.909 | 4.00 | 1.035 | |
| Pragmatic | 3.830 | 0.892 | 4.35 | 0.936 |
| Model | Metric | Krippendorff Alpha | Weighted Kappa Avg |
|---|---|---|---|
| Qwen 3 (0.6B) | Morphosyntactic | 0.399 | 0.408 |
| Semantic | 0.518 | 0.521 | |
| Pragmatic | 0.516 | 0.535 | |
| LLaMA 3.2 (3B) | Morphosyntactic | 0.377 | 0.394 |
| Semantic | 0.502 | 0.507 | |
| Pragmatic | 0.445 | 0.465 | |
| Gemma 3 (1B) | Morphosyntactic | 0.482 | 0.488 |
| Semantic | 0.624 | 0.621 | |
| Pragmatic | 0.530 | 0.529 |
| Metric | Annotator | Competence |
|---|---|---|
| Morphosyntactic | Evaluator 1 | 0.552 |
| Evaluator 2 | 0.600 | |
| Evaluator 3 | 0.591 | |
| GPT–5 | 0.651 | |
| Semantic | Evaluator 1 | 0.576 |
| Evaluator 2 | 0.533 | |
| Evaluator 3 | 0.611 | |
| GPT–5 | 0.634 | |
| Pragmatic | Evaluator 1 | 0.551 |
| Evaluator 2 | 0.601 | |
| Evaluator 3 | 0.528 | |
| GPT–5 | 0.572 |
| Metric/Aspect | Pre | Post | Final |
|---|---|---|---|
| MAE (Overall) | 0.97 | 0.93 | 0.83 |
| Morphosyntactic | 0.98 | 0.96 | 0.91 |
| Semantic | 1.03 | 0.92 | 0.86 |
| Pragmatic | 0.90 | 0.89 | 0.84 |
| Exact Match Rate (Overall) | 0.30 | 0.28 | 0.50 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Shalawati, S.; Nasution, A.H.; Monika, W.; Derin, T.; Onan, A.; Murakami, Y. Beyond BLEU: GPT–5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation. Digital 2026, 6, 8. https://doi.org/10.3390/digital6010008
Shalawati S, Nasution AH, Monika W, Derin T, Onan A, Murakami Y. Beyond BLEU: GPT–5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation. Digital. 2026; 6(1):8. https://doi.org/10.3390/digital6010008
Chicago/Turabian StyleShalawati, Shalawati, Arbi Haza Nasution, Winda Monika, Tatum Derin, Aytug Onan, and Yohei Murakami. 2026. "Beyond BLEU: GPT–5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation" Digital 6, no. 1: 8. https://doi.org/10.3390/digital6010008
APA StyleShalawati, S., Nasution, A. H., Monika, W., Derin, T., Onan, A., & Murakami, Y. (2026). Beyond BLEU: GPT–5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation. Digital, 6(1), 8. https://doi.org/10.3390/digital6010008

