Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Automated Essay Scoring Using Transformer Models

Psych 2021, 3(4), 897-915; https://doi.org/10.3390/psych3040056

by Sabrina Ludwig^1,*

, Christian Mayer²

, Christopher Hansen³, Kerstin Eilers⁴ and Steffen Brandt³

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Psych 2021, 3(4), 897-915; https://doi.org/10.3390/psych3040056

Submission received: 10 October 2021 / Revised: 4 December 2021 / Accepted: 7 December 2021 / Published: 14 December 2021

(This article belongs to the Special Issue Computational Aspects, Statistical Algorithms and Software in Psychometrics)

Round 1

Reviewer 1 Report

In this manuscript, the authors introduce the benefits of applying a transformer-based model for AES. Experimental results show that the transformer-based approaches outperformed the regression-based approach for the given AES task. I suggest the authors do experiments on the Automated Student Assessment Prize (ASAP) dataset .

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

This paper concerns the use of pre-trained Transformer-based language models for automated essay scoring (AES). The authors conducted experiments to compare a BERT fine-tuning approach against a traditional bag-of-words (BOW) feature-based approach on the task of politeness classification. Better results with the neural approach on a German email datasets have been reported.

However, my main concern is that I think the task of politeness classification should be treated as a text classification rather than an AES task. AES has been cast as (1) a regression task, where the goal is to predict the score of an essay; (2) a classification task, where the goal is to classify an essay as belonging to one of a small number of classes (e.g. low, medium, or high); or (3) a ranking task, where the goal is to rank two or more essays based on their quality (Ke and Ng., 2019). Whenever an AES task has been cast as a classification task, more than 2 labels were involved.

Section 2 should be improved and please refer to: Ke and Ng (2019) Automated Essay Scoring: A Survey of the State of the Art

Traditional feature-based approaches to AES used not only BOW, but also a more extended list of features, including embeddings, word category, length-based, lexical, syntactic, semantic features. I understand that some of these features might not be useful/relevant for politeness classification. Again, I don't understand why it has been compared to / treated as an AES task. As for politeness classification task, I would expect other types of features have been used. I would also encourage the authors to contextualise the work with respect to text classification.
Section 2.2.1: some of the terms are used by both traditional and neural approaches, e.g. model, feature, label.
Table 1: results (i.e. Kappa scores) from Taghipour & Ng (2016) and Alikaniotis et al. (2016) are not directly comparable. The table listed them together and it's very confusing.
Line 247: "The encoder-decoder architecture describes a sequence-to-sequence model as proposed in the original “Attention is All You Need” paper..." The sequence-to-sequence model and encoder-decoder architecture have been proposed and widely used in neural machine translation, much early than the “Attention is All You Need” paper.

I think most of the content in Section 4 should be in appendix or in a technical report, rather than a journal paper.

Section 5, Table 4: better to refer "German BERT (Jun. 2019)" as "German BERT (small)" and "German BERT (Oct. 2020)" as "German BERT (big)".

Line 583: "humagiven" - a spelling mistake?

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 3 Report

Automated essay scoring (AES) is gaining increasing attention in the education. This paper focus on the AES topic and investigates the differences between AES using transformer-based models and previous machine learning based models based on bag-of-words (BOW) approach. My favorite part of this paper is section 6 (discussion), in which the authors discuss and analyze in detail the possible reasons why the performance of transformer-based models is good while that of the BOW based models is bad.

However, the answer to the difference between AES using transformer-based models and machine learning based models based on bag-of-words (BOW) approach is known and can be find in many previous works about AES or related works. Therefore, in my opinion, if the readers of this paper are from the field of education, who are not familiar with and understand deep learning and want to know some AES technology, this paper is good and feasible. If the paper is intended for readers in the NLP field, this paper may not be suitable.

Other suggestions:

Some of new papers about AES using BERT can be listed and compared with, such as: (a) Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirec-tional transformers for language understanding. In: Proceedings of NAACL-HLT pp. 4171–4186. Association for Computational Linguistics (2019) and (b) Liao, D., Xu, J., Li, G., Wang, Y.: Hierarchical coherence modeling for documentquality assessment. Proceedings of the AAAI 202135(15), 13353–13361 (May 2021);
Because of the QWK is the public used metric, the authors can use the same metric of QWK as supplementary evaluation metric.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 4 Report

The work is interesting and important. However, the description must be improved. My recommendations:

I have controversial thoughts about the Related Work analysis. It is very comprehensive, but there are things not related to the solving problem (e.g., decoders, encoder-decoder architectures, etc.). The Related Work section is really very strong, unfortunately, I cannot say so about the other sections.
Line 299-309 you tell about the student analysis. In such evaluations, there is a lot of subjectivity. Have you performed any inter-agreement evaluations?
Your methodological part (Section 4) contains a lot of programming code (that is typically used in similar classification programs). I would recommend you to provide more theoretical details about the used methods (architecture, hyper-parameters, etc.), explanations about the selected parameter values, and put the reference into your code (e.g., on Github).
You have an imbalanced dataset (92.5% in one class and 7.5% in another). From Table 4 (German BERT, 2020) you have the split of (627/46) with the majority baseline equal to 92.66%. It means that the method must increase this value to be considered suitable for your solving tasks. From Table 5 I see that this model is the only one suitable for your solving task.
Besides, since the initial weights in DNNs are randomly initialized, the results in different experiments can be slightly different. Typically it is recommended to perform the same experiment several times, average the results, calculate confidence intervals. From one run it is impossible to make robust conclusions.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Round 2

Reviewer 3 Report

As I mentioned in my previous review, if the target readers are practitioners from the educational sector, this paper is suitable. Most of my questions are answered in the responses.

In addition, as this work is for the readers from educational sector for example, I suggest the authors add some of the latest studies on cross-promt automated essay scoring in related work of the paper, such as:

(a) C. Jin, B. He, K. Hui, L. Sun, TDNN: A two-stage deep neural network for prompt-independent automated essay scoring, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018.

(b)Li, Xia, Minping Chen, and Jian-Yun Nie. "SEDNN: Shared and enhanced deep neural network model for cross-prompt automated essay scoring." Knowledge-Based Systems 210 (2020): 106491.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 4 Report

Thank you for considering some of my previous remarks. Unfortunately, the most important remark (see below) is not considered. The methodology is not the python code. My previous remark:

Your methodological part (Section 4) contains a lot of programming code (that is typically used in similar classification programs). I would recommend you to provide more theoretical details about the used methods (architecture, hyper-parameters, etc.), explanations about the selected parameter values, and put the reference into your code (e.g., on Github).

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Article Menu

Automated Essay Scoring Using Transformer Models

Further Information

Guidelines

MDPI Initiatives

Follow MDPI