Next Article in Journal
Predicting Number of Vehicles Involved in Rural Crashes Using Learning Vector Quantization Algorithm
Previous Article in Journal
Arabic Spam Tweets Classification: A Comprehensive Machine Learning Approach
 
 
Article
Peer-Review Record

ChatGPT Code Detection: Techniques for Uncovering the Source of Code

AI 2024, 5(3), 1066-1094; https://doi.org/10.3390/ai5030053
by Marc Oedingen 1, Raphael C. Engelhardt 1, Robin Denz 2, Maximilian Hammer 1 and Wolfgang Konen 1,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
AI 2024, 5(3), 1066-1094; https://doi.org/10.3390/ai5030053
Submission received: 23 May 2024 / Revised: 23 June 2024 / Accepted: 28 June 2024 / Published: 2 July 2024
(This article belongs to the Topic AI Chatbots: Threat or Opportunity?)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This study examines techniques for differentiating between code written by humans and code generated by large language models (in this case, chat gpt 3.5). With the increasing ability of language models to produce code, there is a need to identify the source of code, especially in the context of higher education and software development. 

The study used a combination of embedding and supervised algorithms, such as DNN, Random forest, and XGB, achieving a detection accuracy of 98%. White-box features and an interpretable Bayesian classifier were explored to provide transparency and explainability in the detection process. 

The results have significant implications for the evaluation of code reliability and security in academic and professional environments, highlighting the importance of effective tools for distinguishing between human contributions and those generated by LLMs. 

 

The article is very interesting and the proposal is very good, but I believe some minor revisions are needed before publication.

 

Comment 1: There is excessive use of the first person throughout the article. It would be preferable to use the first person only in the discussions.

 

Comment 2: Consider adding a table in the related work or discussion section summarizing all approaches to this topic and comparing them with the current proposal.

 

Comment 3: I believe a flowchart illustrating Section 4 could help better understand the steps taken for data collection and processing.

 

Comment 4: Is there a plan to expand the study to be used with other LLMs such as GPT-4, Claude, or Gemini?

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The adoption of LLM in programming has underscored the importance of differentiating between human-written code and code generated by machines.

 

In this sense this paper makes a contribution. Difficult is to evaluate how relevant is this contribution among many possible alternatives which are flooding the literature in this field and similar ones, for example: Distinguishing ChatGPT(-3.5, -4)-generated and human-written papers through Japanese stylometric analysis, Plos one, 2023, doi: 10.1371/journal.pone.0288453.

 

 

How to build such a classifier that works properly is itself a task that leaves doubts to me if the classifier itself is based on learning from data.

 

I think that this paper should be presented more like an experimental study in explainability than a list of results which should be credible by themselves.

 

One of my main doubts, for instance, is on the consistency, reliability and validity of the employed datasets. Did the authors read the many papers on this problems? Like for example: AA VV, An alternative approach to dimension reduction for pareto distributed data: a case study, Journal of Big Data Volume 8, Issue 1, 2021 DOI: 10.1186/s40537-021-00428-8.

 

 

Another doubt is on the design of the study: Why Python? How the sketches of code have been chosen? Along this line of sense we will soon bang our heads agaist the wall of Turing completeness.

 

I would preferably tackle this problem form a more "controlled" point of view, like in traditional experimental statistics, for example by analyzing pieces of software (human vs machine generated) that solve the same problem and then would have tried to understand/differentiate. Also by deciding prior characteristics/parametrs for human-generated vs machine. In essence I would suggest a more gradual, scientifically controlled approach. Otherwise the risk is to draw any kind of conclusion without the explanation for those conclusions.

 

I have many doubts also on the (non) use of statistical procedures to make final decisions. Why typical approaches based on statistical hypothesis testing have been disregarded here?

 

For example, a sample of just 20 programmers can be enjoyable but it does not have any statistical significance.

 

In the end the paper needs a rethinking and a consequential restructuring

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

Except for one specific reference, none of the issues raised has had a counterpart in the paper; in this condition I cannot be in favour of the paper which has remained unchanged with respect to my criticisms

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop