Next Article in Journal
Emotions during the Pandemic’s First Wave: The Case of Greek Tweets
Previous Article in Journal
Defect Isolation from Whole to Local Field Separation in Complex Interferometry Fringe Patterns through Development of Weighted Least-Squares Algorithm
 
 
Article
Peer-Review Record

Effectiveness of ChatGPT in Coding: A Comparative Analysis of Popular Large Language Models

Digital 2024, 4(1), 114-125; https://doi.org/10.3390/digital4010005
by Carlos Eduardo Andino Coello, Mohammed Nazeh Alimam and Rand Kouatly *
Reviewer 1:
Reviewer 2: Anonymous
Digital 2024, 4(1), 114-125; https://doi.org/10.3390/digital4010005
Submission received: 3 October 2023 / Revised: 27 December 2023 / Accepted: 2 January 2024 / Published: 8 January 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

1. This looks more like a technical report instead of a research paper. No scientific insights are provided by thorough analysis on difference between LLMs. 

2. There is no professional or standard quantitative metric for model evaluation.

3.  The test cases are limited. The claim about sufficient testing samples is not sound since there's no evidence to support that the small set of test samples can sufficiently represent all test dataset.

4.  The selected dataset is only one which is too limited to make conclusion. 

Comments on the Quality of English Language

Some sentences are ineloquent. The editing quality is weak, so some repetitive words, like "one the hand", are shown in inappropriate positions. 

 

Author Response

R1

Thank you for your thoughtful feedback on our paper. We understand your concern and would like to clarify that the primary focus of our paper was to evaluate the coding capability of MLL tools. While we acknowledge that the paper may exhibit some characteristics of a technical report, our intention was to explain and analyze the coding capabilities of LLMs in simple scientific ways. From our perspective, it would be very difficult to participate in the training processes and change the number of attributes of these systems. Our goal was to compare the most popular LLMs systems from the perspective of normal users. Until today, there is no study or research showing the accuracy of these systems in terms of coding. All that we found in the literature pertains to the accuracy of LLMs in performing medical diagnoses.

R2

I would like to emphasize that we are not evaluating the LLMs model; instead, we are evaluating the programming code generated by these models. Published source code metrics can be broadly divided into five categories based on what they measure: correctness, size, complexity, coupling, cohesion, and inheritance. The first two metrics can be quantitatively evaluated using automation, while others require qualitative tests, which will be conducted in our future work.

R3

The number of tested sets was increased to 460, incorporating the entire certified Google dataset, and the results were updated

 

Reviewer 2 Report

Comments and Suggestions for Authors

The paper compared five LLMs to generate Python code using a benchmark dataset. The LLMs were evaluated on two measures; effectiveness and efficiency. The former was computed based upon the LLM's ability to generate correct code, while the latter was done based upon the number of lines of generated code.

The authors didn't review existing literature that provide similar comparisons to identify a research gap that this study set out to fill. 

While Generative AI covers both text and image generation models, it seems like the authors referred to the former only and failed to make this distinction in describing the history and working of Generative AI.

In addition, the section titled "How Generative AI Work" should be "What is Generative AI" as it provides a very high level account of working of generative AI, even while limiting it to LLMs only.

Comments on the Quality of English Language

English needs to be improved. Following are some of the instances of incorrect grammar:

Line 127: The first LLM is Google’s Bard was selected

Line 137: it is an AI safety

Line 144: Mostly Basic Python Programming (MPPP)

Line 206: only one hundred of the dataset was tested against 460 python problems

Author Response

Thank you for your thoughtful feedback on our paper, find bellow our answers and correction:

We added several literal reviews concerning similar comparisons from line 50 to line 69, in addition to what was already included in the study. We would like to emphasize that due to the novelty of this type of research, we were unable to find similar already published work in the literature to compare with

The study is focused solely on the use of Generative AI for generative code. Working with images is beyond the scope of the proposed study, and we will consider this aspect for our future separate work.

Also, the title was corrected to "What is Generative AI", Thank you

All the mentioned errors have been corrected and updated in the second uploaded draft

Back to TopTop