Next Article in Journal
Construction of a Dimensional Damage Model of Reinforced Concrete Columns under Explosion Loading
Previous Article in Journal
Event-Triggered Disturbance Estimation and Output Feedback Control Design for Inner-Formation Systems
 
 
Article
Peer-Review Record

Automated Assessment of Inferences Using Pre-Trained Language Models

Appl. Sci. 2024, 14(9), 3657; https://doi.org/10.3390/app14093657
by Yongseok Yoo
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Appl. Sci. 2024, 14(9), 3657; https://doi.org/10.3390/app14093657
Submission received: 25 March 2024 / Revised: 21 April 2024 / Accepted: 23 April 2024 / Published: 25 April 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have identified an appropriate research niche, namely the development of a reliable evaluation method for judging readers’ ability to make inferences based on a reading text.

The introduction provides readers with the necessary background details and shows the research gap that this study aims to fill.  The materials and method section is easy to follow and logical although there is scope to improve it (see issues below). The results and the subsequent discussion sections are well written. The claims and comments made appear justified and are in line with the reported findings. This study makes a novel contribution to the extant literature.

Prior to publication, there are a few minor issues that the authors should deal with to increase the quality and readability of their manuscript. These are listed below:

Minor Issues

1. Language

I assume that this experiment was conducted in Korean, but the authors should explicitly state this.

2. Inter-evaluator agreement

It is laudable to use multiple evaluators, but a more detailed reporting of the degree of agreement would increase faith in the reliability of the judgements, e.g. Fleiss' Kappa or Krippendorff's Alpha.

3. Inference types

There are multiple ways to categorize inferences, e.g. by source of additional information (endotextual, exotextual, etc.), by function (causal, predictive, etc.), by task (gap-fill, theme, etc.), and so forth; therefore, the authors should justify their categorization of inference types. Were the categorizations made by one or multiple evaluators?

 

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This paper explore objective and automated methods for assessing inference in readers' responses using natural language processing.

The fuction (1) is correct or not? which y is the true label 0 or 1, so that the Loss = -log(σ(z)) always, why design this formula?

I didn’t find the details of the fine-tuning? Why the 12 layers better than 24 layers based fine-tuning?

The experiments comparision with BERT or variant, How does it compare to the LLMs? I think this paper lack of innovation.

Comments on the Quality of English Language

the english language is ok

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The author proposes a method for classifying inference of sentence-response pairs based on pre-trained language models, which is a highly meaningful topic with significant educational application value. The paper starts from the current state of the problem, elaborates in detail the challenges of inference evaluation, and verifies the effectiveness of the proposed method through experiments. This research has certain practical value, but the overall innovation of the article is insufficient, and the content is not rich enough. Specifically, the article has the following areas for improvement:

1. The introduction should further supplement some related work, such as previous attempts to use natural language processing techniques to address this problem and the results achieved. This can better highlight the innovation of this paper.

2. Table 2 could include examples of Non-inference.

3. An explanation of why these three specific pre-trained models were chosen for the experiment could be provided. Describe their characteristics, advantages, and the rationale for their selection.

4. Section 2.3 could explain in detail why these four hyperparameters were chosen for grid search, such as their expected impact on model performance.

5. The range and method of parameter selection are given in section two, but the results do not present the final selection outcome, which should be displayed in a table.

6. The results do not show the final performance of the three models on the test set, making it difficult to assess the models' effectiveness in practical applications.

7. The evaluation metrics in the results section are singular; for model test results, more comprehensive evaluation metrics, such as recall rate and precision, should be supplemented.

8. The discussion section could elaborate on the specific manifestations of overfitting and the particular solutions for this study.

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

the mamuscript had revised based on the comments

Author Response

Thank you for your careful review and insightful comments. They were very helpful in improving the manuscript.

Reviewer 3 Report

Comments and Suggestions for Authors

The revised manuscript has shown some quality improvements over the previous version, with the author making targeted revisions based on the comments. However, there are still some issues that need to be addressed:

1. The introduction lacks a comprehensive review of related research and should include a complete literature review.

2. Line 206 should refer to Table 4.

3. To more clearly display the research findings, it is recommended to present the test results of the three models using a table or figure.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop