Evaluating Large Language Models in Code Generation: INFINITE Methodology for Defining the Inference Index

Christakis, Nicholas; Drikakis, Dimitris

doi:10.3390/app15073784

Open AccessArticle

Evaluating Large Language Models in Code Generation: INFINITE Methodology for Defining the Inference Index

by

Nicholas Christakis

^*,†

and

Dimitris Drikakis

^*,†

Institute for Advanced Modeling and Simulation, University of Nicosia, Nicosia 2417, Cyprus

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(7), 3784; https://doi.org/10.3390/app15073784

Submission received: 17 February 2025 / Revised: 19 March 2025 / Accepted: 27 March 2025 / Published: 30 March 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

This study introduces a new methodology for an Inference Index (InI) called the Inference Index In Testing Model Effectiveness methodology (INFINITE), aiming to evaluate the performance of Large Language Models (LLMs) in code generation tasks. The InI index provides a comprehensive assessment focusing on three key components: efficiency, consistency, and accuracy. This approach encapsulates time-based efficiency, response quality, and the stability of model outputs, offering a thorough understanding of LLM performance beyond traditional accuracy metrics. We apply this methodology to compare OpenAI’s GPT-4o (GPT), OpenAI-o1 pro (OAI1), and OpenAI-o3 mini-high (OAI3) in generating Python code for two tasks: a data-cleaning and statistical computation task and a Long Short-Term Memory (LSTM) model generation task for forecasting meteorological variables such as temperature, relative humidity, and wind speed. Our findings demonstrate that GPT outperforms OAI1 and performs comparably to OAI3 regarding accuracy and workflow efficiency. The study reveals that LLM-assisted code generation can produce results similar to expert-designed models with effective prompting and refinement. GPT’s performance advantage highlights the benefits of widespread use and user feedback. These findings contribute to advancing AI-assisted software development, providing a structured approach for evaluating LLMs in coding tasks and setting the groundwork for future studies on broader model comparisons and expanded assessment frameworks.

Keywords:

LLM; forecasting; inference; time series; LSTM; artificial intelligence

1. Introduction

The rapid advancement of Large Language Models (LLMs) [1,2] brings significant benefits across various applications. One area increasingly attracting the scientific community’s interest is using LLMs for algorithmic coding. Writing computer code requires considerable time and effort, and leveraging artificial intelligence (AI) tools to expedite this process can help developers and researchers accelerate scientific discoveries in diverse fields and technological applications. Therefore, developing indicators for assessment and ensuring careful AI model validation and experimentation is essential. This approach will guide progress in the field and prevent misleading hype.

The Generative Pre-trained Transformers series [1,2] is utilizing the Transformer architecture [3], which relies on the self-attention mechanism. Transformers have found applications in various domains, including language translation [3,4], speech recognition [5,6], language modelling [7], and in scientific and engineering fields such as fluid dynamics [8,9]. These models consist of layers of multi-head attention and position-wise feed-forward networks. They undergo a pre-training phase with large-scale text data in an unsupervised manner, followed by fine-tuning for specific tasks. One of the standout features of LLMs is their exceptional performance across a range of Natural Language Processing (NLP) tasks, including machine translation, question-answering, and text generation. This versatility allows them to handle numerous functions without needing specialized modifications.

However, several challenges persist in the development of Transformers and LLMs. These include biases in training data that can lead to biased outputs, as well as complexity, which may hinder interpretability and make modification difficult [10,11,12]. Furthermore, creating specialized code models, such as forecasting models for time series data, requires detailed domain knowledge to design optimal architectures and effectively tune parameters [13,14]. Additionally, scalability can be an issue, as expanding to larger data sets or more complex tasks can present challenges [15]. All these factors may result in unintended outputs, while the black-box nature of LLMs complicates efforts to comprehend the reasoning behind their decisions [16].

A particularly challenging area for LLMs is code generation. Unlike natural language, code necessitates strict adherence to syntax, logic, and context, making it difficult for LLMs to consistently produce correct and functional code [17]. The absence of context awareness in multi-step programming tasks means that LLMs may produce code that functions in isolation but fails when incorporated into larger programs, often resulting in bugs or errors [18]. Moreover, LLMs face challenges with debugging and validation because they cannot reason through errors or test their code in real time, leading to undetected mistakes [19,20]. While techniques like Chain of Thought (CoT) reasoning offer some improvement in transparency [21,22,23], LLMs still face significant hurdles in generating reliable, optimized code, often requiring human oversight to ensure correctness.

In AI literature, the process of running a trained model on new input data to produce an output is referred to as inference [24,25,26,27]. To evaluate the performance and efficiency of inference, several key factors relating to the various stages of the process have to be taken into consideration and quantified. These factors are discussed in the following section. Various inference models have been proposed to assess the performance of AI tools [28,29,30], but there is no standard metric for evaluating the effectiveness of inferences across tasks and applications [31,32]. This study introduces a new Inference Index (InI) methodology for assessing code-writing with LLMs, labeled the Inference Index In Testing Model Effectiveness (INFINITE). This methodology differs from other approaches due to its unique holistic approach to analyzing and quantifying the effects of key factors that impact the model’s performance. This multidimensional assessment provides a richer picture of inference effectiveness and offers actionable insights into technical and operational aspects of deploying LLMs in code-generation tasks.

Utilizing the INFINITE framework, we assess the coding abilities of different LLMs [1,2]. To systematically evaluate their performance, we employ a two-step approach. First, we task the models with a fundamental code generation task: reading a CSV file, cleaning it by removing empty cells and NaN values, saving the cleaned data set in a new file, and computing the mean, median, and standard deviation for each column, storing these statistics in a separate text file. The results of this task are compared against a reference implementation developed by the authors to ensure correctness and reliability. Subsequently, we present the models with a more complex challenge—generating functional code for a Long Short-Term Memory (LSTM) model [33] to forecast meteorological variables. We compare the performance of ChatGPT-4o (GPT), OpenAI-o1 pro (OAI1), and OpenAI-o3 mini-high (OAI3) in executing these tasks, benchmarking their generated LSTM implementations against both observed data and predictions from an LSTM model developed by the authors (LSTM-H). Please note that all three software packages—GPT, OAI1 and OAI3—are developed and maintained by OpenAI, San Francisco, CA, USA. By evaluating both an easy and a complex problem, we ensure that the INFINITE framework is tested under different difficulty levels, providing a more robust assessment of its applicability.

This study specifically focuses on OpenAI models to evaluate their code-generation capabilities and compare the performance differences between their paid and free versions. The choice to limit our comparison to OpenAI models was made to ensure a controlled and focused evaluation, avoiding variability introduced by architectural differences across different LLM providers. While other models such as Claude, Gemini, and LLaMA [34,35,36] may offer additional insights, their inclusion requires a more extensive comparative analysis that extends beyond the scope of this work. Our primary aim is to establish a clear understanding of how OpenAI models perform under different access conditions. However, we acknowledge the value of a broader comparison and plan to incorporate multiple LLM providers in future research to assess their relative performance in code generation.

The paper is organized as follows: Section 2 introduces the INFINITE methodology. Section 3 briefly presents the LSTM model. In Section 4, an overview of the meteorological data set used in this study is provided. Section 5 describes the initial evaluation, where the models perform a fundamental data processing task—cleaning a CSV file, handling missing values, and computing basic statistical metrics—compared against a reference implementation, with results and analysis included. Section 6 then presents the more complex challenge of generating an LSTM model for forecasting meteorological variables, along with a discussion of the models’ performance in this task. Finally, Section 7 reviews the findings and explores potential future directions.

2. Inference and the INFINITE Methodology

To develop the INFINITE methodology for assessing AI-generated codes, we discuss key concepts that facilitate problem understanding and simplify the methodology and solution procedure. We define various questions (attempts) that can guide the LLM towards the desired answer. Whenever a question is posed, whether the same or a different one, it is treated as a single query. Following a query, the response may either be a “Server Busy” (SB) response or an answer, whether desirable or not. For example, if the same question is asked five times, with the first four yielding an SB response and the fifth providing any answer (desirable or otherwise), this counts as one attempt and five queries. If a non-satisfactory answer is received, a new question—considered a new attempt—is posed. To calculate the average response time per query (

A R T p Q

) of a framework, we calculate the mean of the total response times, where response time (

R T

) is defined as the duration between a query and the framework’s response:

A R T p Q = \frac{Σ_{i = 1}^{N} R T}{N}

(1)

where N is the total number of queries.

When evaluating the performance of LLMs, including the OpenAI family frameworks, it is crucial to assess all intermediate steps in the process “from the initial question to the desired outcome”. Therefore, we consider and quantify factors such as the adequacy of computational resources needed for output generation, the time required to process each query and provide a response, the number of queries necessary to obtain the desired answer, and the quality of the responses. As previously mentioned, no standard metric for assessing inference effectiveness is found in the existing literature. Some attempts have shown promise, particularly causal inference, in improving the predictive accuracy, fairness, robustness, and explainability of LLMs [37]. Causal inference evaluates assumptions, study designs, and estimation strategies to draw causal conclusions from data. It originates from three frameworks: potential outcomes, causal graphs, and structural equations. Potential outcomes concentrate on estimating causal effects, whereas graphical models—especially those based on Pearl’s causal graphs—demonstrate causal pathways using directed acyclic graphs (DAGs) to illustrate conditional independence among variables [38,39,40]. However, for code-generating tasks, causality is linked to the accuracy of results, particularly regarding whether a change in a model (for instance, switching from manually written code to AI-generated code) causes a genuine improvement in accuracy [41].

It is widely accepted that there are three major components impacting inference performance: Efficiency, Consistency and Accuracy [42,43,44]. Efficiency refers to how quickly the model processes and returns responses. It is influenced by (a) Response Time (i.e., the time taken for the model to generate a response after receiving a query); (b) Server Busy Rate (SBR, i.e., the proportion of queries that cannot be processed due to system overload or resource constraints). Consistency refers to how “clever” and well trained the model is to be able to produce satisfactory responses with as few attempts as possible. A key indicator is the number of attempts by the user until an acceptable response is obtained. Accuracy refers to the model’s reliability in providing a correct answer or making accurate predictions. Numerical predictions may be quantified through a mean relative error, which expresses how closely predicted values match the actual (ground truth) values.

To determine and quantify the corresponding parameters that allow for the assessment of these components influencing inference, we propose the INFINITE methodology (Inference Index In Testing Model Effectiveness) along with the introduction of an Inference Index (InI). The steps of the methodology are as follows:

First, we identify the key parameters involved in submitting a query, allowing us to measure Efficiency, Consistency, and Accuracy.
As a second step, we develop indices ranging from 0 to 1 which define the performance of each individual factor separately; 0 represents the poorest performance while 1 indicates excellent performance.
Finally, InI is calculated by weighting the average of individual indices.

For the first step of the methodology, key parameters that can be recorded during the execution of a code-generating task include the number of different questions (attempts) we must ask the model until the correct code is produced, the number of queries that need to be submitted each time (whether an SB response or an answer is returned), response times between a query and the response from the framework, SB responses, and accuracy metrics, such as coefficient of determination R² or Mean Absolute Percentage Error (MAPE), between generated model predictions and ground truth. The SB responses, total number of queries, and response times are related to Efficiency. The number of questions posed until a satisfactory answer is achieved is tied to Consistency. The accuracy metrics correspond to Accuracy. In Figure 1, we visually present the three major components and their corresponding parameters.

As we focus on code-generating tasks, we believe that metrics such as BLEU [45] or codeBLEU [46] are not well suited for evaluating the accuracy of these tasks; both metrics measure the number of contiguous sequences of words (or tokens) from a generated text that match those in a reference text, rather than assessing the degree of deviation in code correctness. In the context of code generation, n-gram similarity does not necessarily reflect correctness, functionality, or logical consistency. A piece of code may be syntactically similar to a reference implementation yet fail to execute correctly, while a semantically equivalent yet lexically different implementation may receive a lower BLEU score despite being correct. For this reason, our evaluation approach incorporates supervised assessment to ensure that correctness, execution validity, and logical consistency are properly accounted for through the use of MAPE or R² score. While this introduces some level of human involvement, it ensures that the generated code is not just similar but functionally correct. Future work can explore hybrid approaches that balance automation with correctness verification, allowing for more scalable yet reliable evaluations.

As a second step, we offer suggestions for the indices that should be formulated to calculate InI. All indices range from 1 (excellent performance) to 0 (poorest performance). First, we consider both the SB rate and the

A R T p Q

for Efficiency. Therefore, two indices are formulated. We propose the SB rate to be represented through index

E_{S B R}

:

E_{S B R} = 1 - \frac{Number of SB responses}{Total queries}

(2)

The greater the number of SB responses the framework returns, the poorer its performance becomes.

To account for

A R T p Q

, we define two threshold values of average response time,

A R T_{1}

and

A R T_{2}

, where

A R T_{1} < A R T_{2}

. Then, we define index

E_{A R T}

, such that the performance is excellent, average, or poor:

E_{A R T} = \{\begin{matrix} 1, & if A R T p Q \leq A R T_{1} (Excellent performance) \\ 0.5, & if A R T_{1} < A R T p Q < A R T_{2} \\ (Average performance) \\ 0, & if A R T p Q \geq A R T_{2} (Poor performance) \end{matrix}

(3)

The values for

A R T_{1}

and

A R T_{2}

depend on the user and the nature of the problem. We suggest reasonable values

A R T_{1} = 10 s

and

A R T_{2} = 30 s

.

Then, index E for Efficiency may be derived from averaging the two indices

E_{S B R}

and

E_{A R T}

:

E = \frac{E_{S B R} + E_{A R T}}{2}

(4)

To construct a well-balanced decay function for the Consistency index C, we introduce

l n (Q)

instead of simply using Q, where Q represents the total number of attempts (i.e., questions asked). The main reason for this choice is that

l n (Q)

grows significantly more slowly than Q, ensuring that the decay towards zero is gradual rather than abrupt. Given that Q is always a positive integer starting from 1, the optimal scenario occurs when

Q = 1

, signifying the problem is resolved with a single query. As Q increases, the index should approach zero, reflecting the growing effort required to reach a conclusion. However, this decay should not be excessively rapid, as different problems exhibit varying complexity and ambiguity. To account for this variability, we introduce a problem-dependent multiplier m applied to

l n (Q)

, which adjusts the decay rate. The parameter m is user-defined, must be a positive real number, and may take values less than 1 for highly ambiguous problems (e.g., open-ended research questions or real-world dynamic systems). In this manner, index C is formulated as

C = \frac{1}{1 + m \cdot l n (Q)}

(5)

For the present work, parameter m is set to 1.

To formulate an index for Accuracy, we must determine how closely the answer obtained by the framework aligns with the truth. In the case of numerical predictions in code-generating tasks, this can be quantified either through the R² score (for a series of predictions) or through some relative error, such as MAPE for single-value prediction or series of predictions. The R² or MAPEs of all predicted variables may be averaged to indicate the average accuracy. When the coefficient of determination is considered, if

R_{a v}^{2}

represents the averaged R² from all predictions, index A for Accuracy can be expressed as

A = R_{a v}^{2}

(6)

If

R_{a v}^{2}

is less than or equal to 0, it is set to 0, and index A is 0 (indicating the poorest performance).

When MAPE is considered, if

{(MAPE)}_{a v}

represents the averaged MAPE from all predictions, index A for Accuracy can be expressed as

A = 1 - \frac{{(MAPE)}_{a v}}{100}

(7)

If

{(MAPE)}_{a v}

exceeds 100%, it is set to 100, and index A is 0 (indicating again the poorest performance). MAPE can become problematic when observed values are close to zero, leading to division by small numbers and resulting in artificially large or infinite errors. In contrast to MAPE, R² measures how well the model explains the variance in the data without suffering from division issues. If it is known a priori that observed values are not close to 0, then Equation (7) may be used, otherwise Equation (6) is suggested.

Next, as we progress to the third step of the INFINITE methodology, index InI can be defined through a weighted average of the three indices E, C, and A as

InI = w_{E} \cdot E + w_{C} \cdot C + w_{A} \cdot A

(8)

where weights

w_{E}, w_{C}

and

w_{A}

are assigned based on the significance of each of the three components as determined by the user. For the specific problem we are addressing, we carry out simple averaging and set all weights equal to

\frac{1}{3}

.

The proposed index is a unique tool that comprehensively evaluates three key components of inference: Efficiency, Consistency, and Accuracy. These components are quantified through distinct indices, incorporating factors such as the number of questions asked, the number of queries required to obtain a valid answer (including SB responses), average response time, and comparing results with the ground truth. By integrating these diverse measures, InI provides a holistic assessment of the LLM framework’s performance, ensuring a balanced evaluation of its capability to answer the posed questions.

While the INFINITE methodology evaluates Efficiency, Consistency, and Accuracy in code generation, we acknowledge additional factors such as robustness [47], generalization [48], interpretability [49], bias [50], complexity [51], scalability [52], adaptability [53], and resource consumption [54]. However, these aspects are implicitly incorporated into the existing metrics for evaluating LLM-assisted code generation. Bias in code generation primarily affects stylistic preferences, library selection, and minor syntactic variations, rather than fundamental correctness [55], and it is harder to define quantitatively. Instead of directly measuring bias, Consistency and Efficiency indirectly expose biased tendencies by assessing variability in responses to equivalent prompts (Consistency) and differences in processing times that could indicate bias toward specific problem types (Efficiency). Similarly, complexity is inherently captured by the need for structured, logically sound code—our Accuracy metric ensures correctness. At the same time, Consistency reflects the model’s ability to handle increasingly intricate tasks without requiring multiple attempts. Scalability is closely tied to Efficiency and Accuracy, as models that struggle to generate more extended or more intricate code snippets tend to exhibit increased response times and inconsistencies in output quality. Robustness and generalization are inherently addressed through Consistency, as models that fail to generalize require multiple attempts to produce valid code. Explainability in code generation tasks is indirectly verified through code execution and the correctness of the produced outputs, making it less of a primary concern compared to decision-making AI systems. Adaptability is indirectly accounted for through Efficiency; if a model requires significantly more time or attempts to produce correct outputs for complex prompts, it indicates difficulty adapting to unfamiliar patterns. Finally, resource consumption is inherently tied to Efficiency, as models that require excessive attempts or longer inference times naturally have higher computational costs.

Thus, while INFINITE does not explicitly quantify these additional factors, it offers a streamlined and practical framework for assessing LLMs in code generation, prioritizing aspects most relevant to developers and end users. Some of these additional properties may not be readily and explicitly quantifiable in the context of LLM-assisted code generation. Metrics such as robustness, interpretability, and adaptability often require subjective or task-specific assessments, making them difficult to integrate into a standardized framework like INFINITE. We do acknowledge, however, their importance in broader AI evaluation, and in future work, we aim to incorporate as many of these factors as possible to extend the INFINITE methodology beyond code generation. We aim to evolve the Inference Index (InI) into a more comprehensive assessment framework applicable to a wide range of LLM tasks, thereby establishing it as a general benchmark for evaluating language models across diverse domains.

Before concluding this discussion, we make a final note on computational resources, another aspect frequently considered in AI evaluation. Given that GPT-based models operate via cloud-based APIs, their exact resource consumption remains unknown; however, we evaluated execution times on local hardware to ensure practical applicability. We assessed the computational efficiency of the generated programs by running them on an Intel Core i5-9300HF, 2.40 GHz, Quad-Core processor with 8 GB RAM. All AI-generated codes and the human-written implementation were executed within a few seconds on this hardware, demonstrating their efficiency. These results suggest that the generated codes do not introduce significant computational demands and can be run on standard consumer-grade personal computers without performance issues.

3. LSTM Model and Forecasting

Standard RNNs [56] often struggle with long-term dependencies due to issues such as vanishing gradients. LSTM networks, an advanced RNN variant, address these challenges by incorporating memory cells and gating mechanisms that regulate the flow of information over time [33]. Details about the LSTM architecture can be found in the literature [57]. We chose the LSTM model as the coding example to assess OpenAI frameworks on complex coding tasks.

The data preprocessing pipeline involves normalizing the input data to a specific range. A sliding window approach is employed to transform raw time series data into fixed-length sequences suitable for input into the LSTM [58]. The model is trained using the backpropagation through time (BPTT) algorithm, optimizing the mean squared error loss function. The trained model is subsequently evaluated on an unseen test set with performance metrics such as Mean Squared Error (MSE), Mean Absolute Error (MAE), Mean Bias (MB), Mean Absolute Percentage Error (MAPE), Mean Fractional Error (MFE), Mean Fractional Bias (MFB), R² score, and Pearson correlation coefficient r [59,60,61].

The authors developed an LSTM model (LSTM-H) and tested it to identify the optimal settings for forecasting meteorological variables. These settings are presented in Table 1. GPT, OAI1, and OAI3 were instructed to construct an LSTM model based on these settings.

It is essential to note that the results presented correspond to the final 10% of the meteorological data set, equating to 4821 ten-minute intervals (approximately 33.3 days). Predictions are denormalized to the original scale for interpretability. For all LSTM models utilized, predictions are visualized alongside observations to evaluate performance.

4. Description of the Data Set

The data set used in this study for the LSTM model testing task was collected by a weather station operated by Earth Networks (Germantown MD, USA) [62], situated at latitude 35.3069° N and longitude 25.0819° E, at an elevation of 324 m above sea level. This station serves as a robust source of high-resolution meteorological data. The data set covers precisely one year, from 6 November 2018 at 18:38 local time to 8 November 2019 at 15:19 local time, with measurements taken every 10 min, providing a sequential and continuous time series ideal for research purposes. This time window was selected as it was the only period during which continuous time series data could be retrieved without missing values. The station records parameters such as the average temperature over 10 min (°C), the lowest temperature during that interval, the highest temperature within the same timeframe (both in °C), dew point temperature (°C), wet bulb temperature (°C), relative humidity (%), atmospheric pressure at mean sea level (hPa), accumulated daily rainfall resetting at the end of each day (mm), wind speed (m/s), and wind direction (degrees). The wind direction was analyzed in sine and cosine values to prevent discontinuities at 0°/360° and to maintain angular relationships for smoother mathematical operations in our machine learning models. A typical wind field for the island of Crete, where the weather station is located, is depicted in Figure 2 for a date included in the data set. The flow field was generated using WRF numerical weather prediction model simulations [63].

The employed data set consists of variables that exhibit diverse temporal behavior—some, like temperature, display strong repeating patterns, while others, such as wind speed, exhibit chaotic dynamics. This diversity makes the data set a suitable testbed for evaluating the performance of time series modelling approaches, particularly Long Short-Term Memory (LSTM) networks, which are well-suited for capturing both long-term dependencies and short-term variations in sequential data. While the data set is specific to meteorological applications, the methodological framework employed in this study is broadly applicable to other time series domains. The ability of LSTMs to learn complex temporal patterns makes them effective for analyzing data sets in finance, biomedical sciences, and other fields where time-dependent relationships are critical. However, while our findings highlight the robustness of LSTM-based modelling for capturing different types of temporal dependencies, domain-specific adjustments may be necessary to optimize performance in contexts beyond weather forecasting.

5. Simple Task: Data Cleaning and Statistical Computation

For the initial evaluation, all three models were tasked with writing a Python script to read a CSV file, remove rows containing missing values, save the cleaned data set to a new CSV file, and compute the mean, median, and standard deviation for each numeric column, storing these statistics in a separate text file. The CSV file was generated by the authors and contained two columns, namely X and Y, each consisting of 1000 randomly generated integer values between 1 and 1000, with 10% of the entries in each column randomly removed. The generated scripts were compared against a reference implementation developed by the authors in Python (version 3.12.9, Python Software Foundation, Beaverton OR, USA).

Once cleaned, the CVS file contained a total of 806 rows of data. The computed metrics by the authors for the two columns, X and Y, are presented in Table 2. We observe that the two medians are fractional numbers; this is expected, since each column contains 806 numbers and the median is calculated as the mean of the two middle values in each sorted column.

To evaluate the ability of the three frameworks to perform a simple structured programming task, we provided each with the following prompt:

“Write a code in Python, which reads a CSV file, cleans missing
values, then computes for each column mean, median, standard
deviation, outputs in a new CSV file the cleaned data set, and
outputs in a text file the calculated values of mean, median,
and standard deviation for each column”.

Each framework was identically given the prompt and tested independently. We recorded key performance metrics, including response time, the number of attempts required to obtain a functional script, and the correctness of the generated code. The generated scripts were executed, and their outcome was compared against the values recorded in Table 2. The results of this task provided insight into the frameworks’ capabilities in handling fundamental data processing operations and their ability to generate correct and executable Python scripts.

5.1. GPT

The GPT framework responded to the prompt in 7 s, providing a complete and functional Python script in its first attempt. No additional clarification or re-prompting was required, demonstrating its ability to understand and execute the task correctly on the first try. The response contained a well-structured code snippet. Upon execution, the generated code ran without any errors or modifications. It successfully read the CSV file, removed missing values, and computed the mean, median, and standard deviation for each numeric column. Additionally, the script correctly saved the cleaned data set in a new CSV file and output the computed statistics to a text file in the expected format. The results produced were identical to those recorded by the authors in Table 2, confirming the model’s accuracy. The absence of unnecessary debugging or fine-tuning highlights GPT’s reliability for fundamental data processing tasks.

5.2. OAI1

The OAI1 framework responded to the prompt in 91 s, significantly longer than the other frameworks. Despite the extended response time, the generated code was complete and correctly structured, requiring no additional clarification or re-prompting. The script followed a modular approach, encapsulating the functionality within a function, which could be beneficial for reusability but was unnecessary for a task of this simplicity. When executed, the code ran without errors or the need for further adjustments. It successfully read the CSV file, removed missing values, and computed each numeric column’s mean, median, and standard deviation. The cleaned data set and statistical metrics were also properly saved to their respective output files. The results were identical to those obtained by the authors’ reference implementation, verifying the correctness of the output.

5.3. OAI3

The OAI3 framework responded to the prompt in 7 s, delivering a structured and complete Python script on the first attempt. No additional clarification or re-prompting was needed. The generated code followed again a modular approach, encapsulating the data processing steps within a function. While the function-based approach added structure, it did not provide a clear advantage for a straightforward task like this, making the added modularity redundant for simple, one-time operations. Upon execution, the script ran smoothly without any errors or modifications. It correctly read the CSV file, removed missing values, and computed each numeric column’s mean, median, and standard deviation. The script also saved the cleaned data set in a new CSV file and stored the computed statistics in a separate text file. The results were identical to those obtained by the authors, confirming the generated model’s accuracy.

5.4. InI Calculation

We then applied the INFINITE methodology to compute the InI index and evaluate the performance of the three LLM frameworks: GPT, OAI1, and OAI3. To establish the parameters for assessing each framework’s effectiveness, we tracked several key metrics, including the number of attempts required to obtain a working code snippet, the total number of queries submitted (regardless of whether an SB response or a valid answer was returned), the response times between query submission and model output, and the relative errors between the framework-generated predictions and the ground truth (R²> is not pertinent in this case, as single values are predicted by the generated model). As ground truth, we used the statistical analysis results (mean, median, and standard deviation) obtained by the authors. Since all three frameworks produced identical outputs to the reference implementation, we assigned a MAPE value of 0% across all models. Notably, INFINITE does not incorporate an index for code style, as this aspect is subjective; instead, it strictly evaluates the accuracy of the generated code in fulfilling the specified task. The final results are presented in Table 3.

A R T p Q

is calculated using Equation (1).

By substituting the values of Table 3 into Equations (2)–(8), we derived Table 4. For better visualization, InI is shown in Figure 3 for the three frameworks.

Applying the INFINITE methodology, GPT and OAI3 both achieved the highest possible score of 1, while OAI1 scored 0.83 due to its significantly longer response time. Despite this difference, all three models demonstrated exceptional performance in the simple coding task, successfully generating correct and executable code on the first attempt without requiring any modifications. Given that all models produced identical results in terms of accuracy and functionality, response time and accessibility become key differentiating factors. From a practical standpoint, GPT emerges as the most accessible option, as it is freely available and does not require the same level of access restrictions as OAI1 or OAI3. This makes it the most convenient and widely usable choice for general users who need reliable coding assistance without additional constraints.

6. Complex Task: Generating Code for an LSTM Model

The problem and the procedure adopted are described below. The aim is to generate LSTM models using GPT, OAI1, and OAI3, which simultaneously predict short-term fluctuations in temperature, relative humidity, and wind speed. These variables were selected due to their significance and differing levels of variability, ranging from relatively stable (temperature) to highly dynamic (wind speed). Accurate short-term forecasting of these parameters is essential across multiple domains. For instance, the Canadian Fire Weather Index (FWI), a widely used model for assessing wildfire risk, relies on these parameters [64,65]. Public health also benefits from forecasting these variables, as temperature and humidity affect the heat index. At the same time, wind plays a role in air quality by dispersing pollutants and airborne pathogens [66,67].

Initially, the CSV file containing the data was uploaded. The columns labelled “temp”, “hum”, and “windvel” represent the time series of the variables we wish to predict (i.e., temperature, relative humidity, and wind speed, respectively). We began by asking the same question to all frameworks:

“I have uploaded a file with a continuous time series from a
weather station. Retain the first 90\% of the data for
training and validation and the last 10\% for testing, in
order to train an LSTM model that predicts the variables temp,
hum and wind-vel. Write the code in Python that executes this
task, including the appropriate error metrics, such as mean
squared error, mean absolute error, mean relative error, mean
fractional error, mean fractional bias, mean bias, R2 score
and Pearson correlation coefficient. Include a graph that
compares the last 10\% of the data during the testing phase.
The LSTM has one layer with 10 units, a batch size 16, the
activation function is ReLU, and the optimiser is Adam with a
learning rate of 0.001, running for 10 epochs. It also uses a
timestep of 3. If possible, please plot the results of the
comparisons”.

The description of the responses for each framework follows.

6.1. GPT

The GPT framework initially took 53 s to respond. It generated a code, which read from the data set the three required columns and created a training–testing data set only with these three columns, i.e., it used temperature, relative humidity, and wind speed only to predict temperature, relative humidity, and wind speed. From the question asked, GPT did not understand that all variables in the file had to be taken as input, so a new question (second attempt) was asked, and this was clarified:

“Please do as before, but take as input to the model all given
features from the CSV file”.

This time, the model generated a Python code within 3 s, considering all variables to predict temperature, relative humidity, and wind speed. The code was executed, and predictions for the three variables were obtained. GPT did not manage to generate code that predicted all requested error metrics. It predicted MSE (0.5668), MAE (0.4735), MAPE (1.71

\times 10^{13}

), R² Score (0.9405), and Pearson r (0.9997). MAPE is erroneous due to division by a minimal number (due to the existence of observed wind speeds close to 0). As observed, the GPT-generated LSTM model did not manage to evaluate error metrics for each variable separately; instead, it aggregated results across all selected columns and returned an overall mean value. Hence, we cannot comment on the accuracy of the GPT-produced model based on the calculated error metrics. This issue could be remedied by posing an additional question to GPT to compute error metrics for each variable separately. We comment on the model’s accuracy and perform detailed comparisons with ground truth for each variable later in this section. In summary, the GPT framework provided an answer after 2 attempts and 2 queries. No SB responses were recorded.

6.2. OAI1

It took the model 133 s to respond and generate Python code. However, an error occurred during execution because the code attempted to overwrite the Pandas DataFrame with all variables and create a Pandas DataFrame of the same name, which included only the three variables: temperature, relative humidity, and wind speed. A new question (second attempt) was posed, indicating the error and requesting that the model correct it. The model responded to the second query after 113 s, but the generated code encountered another error during execution. This time, it complained about non-numeric values in the “low temperature” data set. Consequently, a third question (third attempt) was asked, requesting the model to rectify the error. A response was received after 168 s, and the model produced a corrected code. However, similar to the case with GPT, the generated code only accessed the three necessary columns from the data set and created a training–testing data set based solely on these three columns. Therefore, a new question (fourth attempt) was posed to clarify that all variables needed to be used as inputs for the generated code. The question was precisely the same as the one asked of GPT:

“Please do as before, but take as input to the model all given
features from the CSV file”.

This time, the framework generated the correct Python code after 103 s, utilizing all inputs to predict the three requested variables. The code was executed and predictions for the three variables were obtained. The model predicted all requested error metrics for each variable separately. The results are compiled in Table 5.

The MAPE and MFE for wind speed are incorrect due to division by a minimal number close to zero. By examining the error metrics, we can conclude that the OAI1-generated model effectively predicted the three variables satisfactorily. The OAI1 framework delivered a satisfactory response after four attempts and four queries, with no SB responses recorded. Detailed comparisons with the ground truth for each variable follow.

6.3. OAI3

OAI3 responded in 30 s and behaved exactly like the GPT framework; it generated code that read the three required columns from the data set and created a training–testing data set consisting solely of these three columns. Specifically, it used temperature, relative humidity, and wind speed values to predict temperature, relative humidity, and wind speed. From the question posed, OAI3 did not grasp that all variables in the file needed to be taken as input, prompting a new question (second attempt) for clarification:

“Please do as before, but take as input to the model all given
features from the CSV file”.

This time, the framework responded in 21 s and generated an LSTM model that considered all variables to predict temperature, relative humidity, and wind speed. The code was executed, and predictions for the three variables were obtained. The model predicted all requested error metrics for each variable individually. The results are presented in Table 6.

The MAPE for wind speed is once again incorrect due to division by a very small number close to zero. By examining the error metrics for all three variables, we can conclude that the OAI3-generated model also predicted the three variables with reasonable accuracy. The OAI3 framework provided a satisfactory response after two attempts and two queries, with no SB responses recorded. Detailed comparisons with the ground truth for each variable follow.

6.4. LSTM-H

After running the LSTM-H model and obtaining predictions for the test data set for each of the three variables, we compiled Table 7 containing all error metrics, specifically MSE, MAE, MB, MAPE, MFE, MFB, R² score, and the Pearson correlation coefficient r. In our code implementation, we avoided division by zero when calculating MAPE and MFE, since we explicitly excluded these lines from the calculations.

The comparisons of the predictions from the LSTM-H model and the three frameworks against observations (i.e., ground truth) are presented in Figure 4, Figure 5 and Figure 6 for temperature, relative humidity, and wind speed, respectively. All predictions from LSTM models generated by GPT, OAI1, or OAI3 appear to be remarkably close to those of LSTM-H for the testing data set and align well with observations, particularly in the cases of temperature and relative humidity.

As MAPE and R² scores are reliable metrics for error calculations (due to their scale independence and interpretability), we independently calculated the MAPE and R² scores for the three predicted variables from the GPT-generated model. We also calculated MAPE and MFE for wind speed between the ground truth and predictions from the OAI1 and OAI3 frameworks, as not all of their calculations were performed correctly. The results are presented in Table 8 and Table 9.

As observed in Figure 4, Figure 5 and Figure 6 and the corresponding MAPE and R² Tables, the predictions of all models are practically indistinguishable and remarkably close to ground truth. When comparing the three OpenAI frameworks (GPT, OAI1 and OAI3), we observe that the GPT framework provides, even if marginally, better accuracy. It seems that GPT, at present, overtakes in accuracy the other two frameworks for scientific coding tasks. This is further discussed in the following subsection.

6.5. InI Calculation

We then applied the INFINITE methodology to calculate the InI index and assess the performance of the three LLM frameworks: GPT, OAI1, and OAI3. To identify the parameters that can enable us to measure the performance of each framework, we recorded the number of different questions (attempts) we had to pose to the model until the correct LSTM model was generated, the number of queries submitted each time (whether an SB response or an answer was returned), the response times between a query and a response from the framework and the R² values between LSTM predictions and ground truth (in this case, R² is preferred over MAPE, since there exist observed wind speed values close to 0). For each framework, an average R² was calculated for temperature, relative humidity, and wind speed from the values of Table 9. The main parameters are summarized in Table 10.

A R T p Q

is calculated using Equation (1).

By substituting the values of Table 10 into Equations (2)–(8), we derived Table 11. For better visualization, index InI is shown in Figure 7 for the three frameworks.

As observed, GPT surpasses OAI1 at this stage in Efficiency and Consistency, and the two frameworks are very close regarding Accuracy. GPT’s performance is also slightly better than OAI3’s in terms of Accuracy. Overall, GPT achieved the highest InI among the three frameworks. It only failed to calculate the requested errors for each variable separately; instead, it aggregated them. Hence, GPT appears to be producing meaningful code for complex machine learning tasks in the shortest time possible. Thus, it is currently the most appropriate option for scientific coding tasks. An extensive literature search yielded no references for this, although there have been limited comparisons between GPT and the newer variants [68,69]. We believe this emergent phenomenon occurs because all OpenAI models are regularly fine-tuned based on user interactions, including instances where users correct mistakes or upvote/downvote responses [70]. Since GPT is a public model, it interacts with a more extensive user base. Its fine-tuning benefits from a broader range of feedback, potentially enhancing its robustness in certain areas. On the other hand, pay-walled versions such as OAI1 and OAI3 are fine-tuned based on feedback from their own smaller sets of users. Each model is fine-tuned separately, meaning mistakes corrected in one model do not immediately transfer to another. This indicates that, at least at this stage, GPT performs better than the newer variants of OpenAI in areas where its broader user base has rectified more mistakes. For a comprehensive understanding of the fine-tuning methodologies applicable to LLMs, the interested reader is referred to [71].

7. Conclusions

In this study, we introduce a new methodology and index to evaluate the performance of LLMs in code-generating tasks. The InI index (calculated through the INFINITE methodology) provides a more comprehensive evaluation by focusing on three key components of inference: Efficiency, Consistency, and Accuracy. Unlike traditional approaches, which often emphasize only accuracy (e.g., R² score, MAPE) or response quality, this methodology also incorporates time-based efficiency (average response time and server load) and consistency (the number of iterations needed to reach a correct answer). The InI index results from the combination of indices derived for each key component. This multidimensional approach allows for a more holistic understanding of an LLM’s performance, capturing both the speed and reliability of the model alongside the correctness of the generated code. By incorporating consistency, we account for the model’s learning curve and ability to converge on a solution, which is often overlooked in existing frameworks that primarily focus on outcome precision without addressing model stability. Moreover, by evaluating server load and response times, we provide insights into the scalability and resource efficiency of the framework, making our approach particularly valuable for real-world applications where performance under variable workloads is crucial. This multidimensional assessment offers a richer picture of inference effectiveness and provides actionable insights into both technical and operational aspects of deploying LLMs in code generation tasks. In the present study, we do not vary query prompts, as our focus is on evaluating the frameworks’ correctness in code generation rather than prompt sensitivity. We assume that users interacting with these models are domain experts who can formulate precise queries, ensuring that variations in phrasing do not significantly impact the evaluation.

Furthermore, we conduct a comparative analysis of OpenAI’s GPT-4o (GPT), OpenAI-o1 pro (OAI1), and OpenAI-o3 mini-high (OAI3) frameworks for both simple and complex tasks; a simple data cleaning and statistical computation task and a complex Python code generating task to implement an LSTM deep learning model for predicting critical meteorological variables, namely temperature, relative humidity, and wind speed. The evaluation focuses on performance aspects such as inference efficiency, computational time, forecasting accuracy, memory utilization, and usability. The results indicate that all frameworks can generate functional Python code for both tasks. However, GPT demonstrates a more efficient workflow and outperforms its competitors, particularly OAI1, which requires more attempts before producing a fully functional LSTM model, indicating less consistency in managing complex programming tasks. OAI3 performs equally well as GPT in terms of efficiency and consistency; however, GPT’s predictions show better accuracy in the complex task, resulting in a higher overall score. All AI-generated LSTM models perform similarly to the manually developed LSTM-H model, with only marginal differences in predictive accuracy. This demonstrates that AI-assisted code generation can yield results comparable to expert-designed implementations, provided that the models are correctly prompted and refined through iterative feedback. Moreover, the analysis shows that execution times of all AI-generated codes remain low, ensuring that generated programmes are practical to use on standard consumer-grade hardware.

In summary, GPT’s slightly better performance reveals that it remains a competitive framework despite the release of newer variants from OpenAI, namely OAI1 and OAI3. This sustained performance may be attributed to its widespread usage, which allows for extensive fine-tuning based on a larger volume of user interactions, feedback, and error corrections. As a result, GPT benefits from a broader range of refinements, potentially offsetting advancements in newer, more specialized models. The findings of this work highlight the evolving role of AI in computational modeling and scientific research, where the ability to generate accurate and efficient code has significant implications for accelerating discovery and innovation. Future research will explore further refinements in AI-assisted coding, assess additional AI frameworks, and investigate the potential for hybrid approaches that utilize multiple models to optimize performance. Moreover, the Inference Index (InI) methodology will be refined to support expanded AI assessment tasks beyond code generation. These developments will further enhance the ability to measure, compare, and improve LLM performance across diverse applications.

Author Contributions

Conceptualization, N.C. and D.D.; Methodology, N.C. and D.D.; Software, N.C.; Formal analysis, N.C. and D.D.; Investigation, N.C. and D.D.; Data curation, N.C.; Writing—original draft, N.C. and D.D.; Writing—review & editing, N.C. and D.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data sets and AI-generated codes, along with the comparison graphs they produced, are available and may be downloaded from: https://github.com/nchrkis/infinite (accessed on 15 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LLMs	Large Language Models
AI	Artificial Intelligence
NLP	Natural Language Processing
LSTM	Long Short-Term Memory
RNNs	Recurrent Neural Networks
INFINITE	Inference Index In Testing Model Effectiveness
GPT	ChatGPT-4o
OAI1	OpenAI-o1 pro
OAI3	OpenAI-o3 mini-high
MSE	Mean Squared Error
MAE	Mean Absolute Error
MB	Mean Bias
MAPE	Mean Average Percentage Error
MFE	Mean Fractional Error
MFB	Mean Fractional Bias

References

OpenAI. Optimizing Language Models for Dialogue. 2022. Available online: https://openai.com/blog/chatgpt (accessed on 15 January 2025).
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Volume 30. [Google Scholar] [CrossRef]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar] [CrossRef]
Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Q.; Chen, G.; et al. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; Balcan, M.F., Weinberger, K.Q., Eds.; PMLR: New York, NY, USA; Volume 48, pp. 173–182. [Google Scholar]
Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Minneapolis, MI, USA, 2019; Volume 1(Long and Short Papers), pp. 4171–4186. [Google Scholar] [CrossRef]
Drikakis, D.; Kokkinakis, I.W.; Fung, D.; Spottswood, S.M. Self-supervised transformers for turbulent flow time series. Phys. Fluids 2024, 36, 065113. [Google Scholar] [CrossRef]
Drikakis, D.; Kokkinakis, I.W.; Fung, D.; Spottswood, S.M. Generalizability of transformer-based deep learning for multidimensional turbulent flow data. Phys. Fluids 2024, 36, 026102. [Google Scholar] [CrossRef]
Huang, D.; Bu, Q.; Zhang, J.; Xie, X.; Chen, J.; Cui, H. Bias assessment and mitigation in llm-based code generation. arXiv 2023, arXiv:2309.14345. [Google Scholar] [CrossRef]
Yu, Y.; Zhuang, Y.; Zhang, J.; Meng, Y.; Ratner, A.J.; Krishna, R.; Shen, J.; Zhang, C. Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Volume 36, pp. 55734–55784. [Google Scholar]
Dai, S.; Xu, C.; Xu, S.; Pang, L.; Dong, Z.; Xu, J. Bias and Unfairness in Information Retrieval Systems: New Challenges in the LLM Era. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24), Barcelona, Spain, 25–29 August 2024; pp. 6437–6447. [Google Scholar] [CrossRef]
Jin, M.; Wang, S.; Ma, L.; Chu, Z.; Zhang, J.Y.; Shi, X.; Chen, P.Y.; Liang, Y.; Li, Y.F.; Pan, S.; et al. Time-llm: Time series forecasting by reprogramming large language models. arXiv 2023, arXiv:2310.01728. [Google Scholar] [CrossRef]
Yu, X.; Chen, Z.; Ling, Y.; Dong, S.; Liu, Z.; Lu, Y. Temporal Data Meets LLM–Explainable Financial Time Series Forecasting. arXiv 2023, arXiv:2306.11025. [Google Scholar] [CrossRef]
Zhang, B.; Liu, Z.; Cherry, C.; Firat, O. When scaling meets llm finetuning: The effect of data, model and finetuning method. arXiv 2024, arXiv:2402.17193. [Google Scholar] [CrossRef]
Ajwani, R.; Javaji, S.R.; Rudzicz, F.; Zhu, Z. LLM-Generated Black-box Explanations Can Be Adversarially Helpful. arXiv 2024, arXiv:2405.06800. [Google Scholar] [CrossRef]
Nam, D.; Macvean, A.; Hellendoorn, V.; Vasilescu, B.; Myers, B. Using an llm to help with code understanding. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal, 14–20 April 2024; pp. 1–13. [Google Scholar] [CrossRef]
Liu, F.; Liu, Y.; Shi, L.; Huang, H.; Wang, R.; Yang, Z.; Zhang, L.; Li, Z.; Ma, Y. Exploring and evaluating hallucinations in llm-powered code generation. arXiv 2024, arXiv:2404.00971. [Google Scholar] [CrossRef]
Tian, H.; Lu, W.; Li, T.O.; Tang, X.; Cheung, S.C.; Klein, J.; Bissyandé, T.F. Is ChatGPT the ultimate programming assistant—How far is it? arXiv 2023, arXiv:2304.11938. [Google Scholar] [CrossRef]
Liu, Y.; Le-Cong, T.; Widyasari, R.; Tantithamthavorn, C.; Li, L.; Le, X.B.D.; Lo, D. Refining chatgpt-generated code: Characterizing and mitigating code quality issues. ACM Trans. Softw. Eng. Methodol. 2024, 33, 1–26. [Google Scholar] [CrossRef]
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates Inc.: New York, NY, USA, 2022; Volume 35, pp. 22199–22213. [Google Scholar]
Zhang, Z.; Zhang, A.; Li, M.; Smola, A. Automatic chain of thought prompting in large language models. arXiv 2022, arXiv:2210.03493. [Google Scholar] [CrossRef]
Jin, M.; Yu, Q.; Shu, D.; Zhao, H.; Hua, W.; Meng, Y.; Zhang, Y.; Du, M. The impact of reasoning step length on large language models. arXiv 2024, arXiv:2401.04925. [Google Scholar] [CrossRef]
Wu, C.J.; Brooks, D.; Chen, K.; Chen, D.; Choudhury, S.; Dukhan, M.; Hazelwood, K.; Isaac, E.; Jia, Y.; Jia, B.; et al. Machine Learning at Facebook: Understanding Inference at the Edge. In Proceedings of the 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), Washington, DC, USA, 16–20 February 2019; pp. 331–344. [Google Scholar] [CrossRef]
Lu, Y.; Chowdhery, A.; Kandula, S.; Chaudhuri, S. Accelerating Machine Learning Inference with Probabilistic Predicates. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD ’18), Houston, TX, USA, 10–15 June 2018; pp. 1493–1508. [Google Scholar] [CrossRef]
Creswell, A.; Shanahan, M.; Higgins, I. Selection-inference: Exploiting large language models for interpretable logical reasoning. arXiv 2022, arXiv:2205.09712. [Google Scholar] [CrossRef]
Sheng, Y.; Zheng, L.; Yuan, B.; Li, Z.; Ryabinin, M.; Chen, B.; Liang, P.; Re, C.; Stoica, I.; Zhang, C. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J., Eds.; PMLR: New York, NY, USA, 2023; Volume 202, pp. 31094–31116. [Google Scholar]
Chitty-Venkata, K.T.; Raskar, S.; Kale, B.; Ferdaus, F.; Tanikanti, A.; Raffenetti, K.; Taylor, V.; Emani, M.; Vishwanath, V. LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators. In Proceedings of the IEEE Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis SC24-W, Atlanta, GA, USA, 17–22 November 2024; pp. 1362–1379. [Google Scholar] [CrossRef]
Lukasik, M.; Narasimhan, H.; Menon, A.K.; Yu, F.; Kumar, S. Metric-aware LLM inference. arXiv 2024, arXiv:2403.04182. [Google Scholar] [CrossRef]
He, Y.; Xu, M.; Wu, J.; Zheng, W.; Ye, K.; Xu, C. UELLM: A Unified and Efficient Approach for LLM Inference Serving. arXiv 2024, arXiv:2409.14961. [Google Scholar] [CrossRef]
Marsili, M.; Roudi, Y. Quantifying relevance in learning and inference. Phys. Rep. 2022, 963, 1–43. [Google Scholar] [CrossRef]
Psaros, A.F.; Meng, X.; Zou, Z.; Guo, L.; Karniadakis, G.E. Uncertainty quantification in scientific machine learning: Methods, metrics, and comparisons. J. Comput. Phys. 2023, 477, 111902. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Caruccio, L.; Cirillo, S.; Polese, G.; Solimando, G.; Sundaramurthy, S.; Tortora, G. Claude 2.0 large language model: Tackling a real-world classification problem with a new iterative prompt engineering approach. Intell. Syst. Appl. 2024, 21, 200336. [Google Scholar] [CrossRef]
Islam, R.; Ahmed, I. Gemini-the most powerful LLM: Myth or Truth. In Proceedings of the 2024 5th Information Communication Technologies Conference (ICTC), Nanjing, China, 10–12 May 2024; pp. 303–308. [Google Scholar] [CrossRef]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Liu, X.; Xu, P.; Wu, J.; Yuan, J.; Yang, Y.; Zhou, Y.; Liu, F.; Guan, T.; Wang, H.; Yu, T.; et al. Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey. arXiv 2024, arXiv:2403.09606. [Google Scholar] [CrossRef]
Pearl, J. Graphs, causality, and structural equation models. Sociol. Methods Res. 1998, 27, 226–284. [Google Scholar] [CrossRef]
Pearl, J. Graphical models for probabilistic and causal reasoning. In Quantified Representation of Uncertainty and Imprecision; Smets, P., Ed.; Springer: Dordrecht, The Netherlands, 1998; pp. 367–389. [Google Scholar] [CrossRef]
Zeng, J.; Wang, R. A survey of causal inference frameworks. arXiv 2022, arXiv:2209.00869. [Google Scholar] [CrossRef]
Ji, Z.; Ma, P.; Li, Z.; Wang, S. Benchmarking and explaining large language model-based code generation: A causality-centric approach. arXiv 2023, arXiv:2310.06680. [Google Scholar] [CrossRef]
Vu, T.; Kang, H.; Yoo, C.D. Scnet: Training inference sample consistency for instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 2701–2709. [Google Scholar] [CrossRef]
Mitchell, E.; Noh, J.J.; Li, S.; Armstrong, W.S.; Agarwal, A.; Liu, P.; Finn, C.; Manning, C.D. Enhancing self-consistency and performance of pre-trained language models through natural language inference. arXiv 2022, arXiv:2211.11875. [Google Scholar] [CrossRef]
Valsamara, I.; Papaioannidis, C.; Pitas, I. Efficient Data Utilization in Deep Neural Networks for Inference Reliability. In Proceedings of the 2024 IEEE International Conference on Image Processing Challenges and Workshops (ICIPCW), Abu Dhabi, United Arab Emirates, 27–30 October 2024; pp. 4142–4147. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Ren, S.; Guo, D.; Lu, S.; Zhou, L.; Liu, S.; Tang, D.; Sundaresan, N.; Zhou, M.; Blanco, A.; Ma, S. Codebleu: A method for automatic evaluation of code synthesis. arXiv 2020, arXiv:2009.10297. [Google Scholar] [CrossRef]
Beyer, T.; Schuchardt, J.; Schwinn, L.; Günnemann, S. Fast Proxies for LLM Robustness Evaluation. arXiv 2025, arXiv:2502.10487. [Google Scholar] [CrossRef]
Kang, K.; Setlur, A.; Ghosh, D.; Steinhardt, J.; Tomlin, C.; Levine, S.; Kumar, A. What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? arXiv 2024, arXiv:2411.07681. [Google Scholar] [CrossRef]
Singh, C.; Inala, J.P.; Galley, M.; Caruana, R.; Gao, J. Rethinking interpretability in the era of large language models. arXiv 2024, arXiv:2402.01761. [Google Scholar] [CrossRef]
Lin, L.; Wang, L.; Guo, J.; Wong, K.F. Investigating bias in llm-based bias detection: Disparities between llms and human perception. arXiv 2024, arXiv:2403.14896. [Google Scholar] [CrossRef]
Bae, H.; Deeb, A.; Fleury, A.; Zhu, K. Complexitynet: Increasing llm inference efficiency by learning task complexity. arXiv 2023, arXiv:2312.11511. [Google Scholar] [CrossRef]
Huang, Y.; Wan, L.J.; Ye, H.; Jha, M.; Wang, J.; Li, Y.; Zhang, X.; Chen, D. Invited: New Solutions on LLM Acceleration, Optimization, and Application. In Proceedings of the 61st ACM/IEEE Design Automation Conference (DAC ’24), San Francisco, CA, USA, 23–27 June 2024. [Google Scholar] [CrossRef]
Li, C.; Tian, Y.; Zerong, Z.; Song, Y.; Xia, F. Challenging Large Language Models with New Tasks: A Study on their Adaptability and Robustness. In Proceedings of the Findings of the Association for Computational Linguistics (ACL 2024), Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Curran Associates Inc.: New York, NY, USA, 2024; pp. 8140–8162. [Google Scholar] [CrossRef]
Stojkovic, J.; Choukse, E.; Zhang, C.; Goiri, I.; Torrellas, J. Towards greener llms: Bringing energy-efficiency to the forefront of llm inference. arXiv 2024, arXiv:2403.20306. [Google Scholar] [CrossRef]
Mouselinos, S.; Malinowski, M.; Michalewski, H. A simple, yet effective approach to finding biases in code generation. arXiv 2022, arXiv:2211.00609. [Google Scholar] [CrossRef]
Yadav, H.; Thakkar, A. NOA-LSTM: An efficient LSTM cell architecture for time series forecasting. Expert Syst. Appl. 2024, 238, 122333. [Google Scholar] [CrossRef]
Gers, F.A.; Schmidhuber, J.; Cummins, F. Learning to forget: Continual prediction with LSTM. Neural Comput. 2000, 12, 2451–2471. [Google Scholar] [CrossRef]
Chen, C.; Zhang, Q.; Kashani, M.H.; Jun, C.; Bateni, S.M.; Band, S.S.; Dash, S.S.; Chau, K.W. Forecast of rainfall distribution based on fixed sliding window long short-term memory. Eng. Appl. Comput. Fluid Mech. 2022, 16, 248–261. [Google Scholar] [CrossRef]
Lane, D.M. Online Statistics Education: A Multimedia Course of Study; Project Leader: David M. Lane, Rice University. Available online: http://onlinestatbook.com/ (accessed on 1 February 2025).
Boylan, J.W.; Russell, A.G. PM and light extinction model performance metrics, goals, and criteria for three-dimensional air quality models. Atmos. Environ. 2006, 40, 4946–4959. [Google Scholar] [CrossRef]
Sedgewick, P. Pearson’s Correlation Coefficient. BMJ 2012, 345, e4483. [Google Scholar] [CrossRef]
Earth Networks. Weather Data, Forecasting, and Environmental Monitoring Solutions. Available online: https://www.earthnetworks.com (accessed on 17 January 2025).
Christakis, N.; Katsaounis, T.; Kossioris, G.; Plexousakis, M. On the Performance of the WRF Numerical Model over Complex Terrain on a High Performance Computing Cluster. In Proceedings of the 2014 IEEE International Conference on High Performance Computing and Communications, 2014 IEEE 6th International Symposium on Cyberspace Safety and Security, 2014 IEEE 11th International Conference on Embedded Software and System (HPCC, CSS, ICESS), Paris, France, 20–22 August 2014; pp. 298–303. [Google Scholar] [CrossRef]
van Wagner, C.E. Development and Structure of the Canadian Forest Fire Weather Index System; Forestry Technical Report 35; Canadian Forestry Service: Ottawa, ON, Canada, 1987. [Google Scholar]
Wotton, B.M. Interpreting and using outputs from the Canadian Forest Fire Danger Rating System in research applications. Environ. Ecol. Stat. 2009, 16, 107–131. [Google Scholar] [CrossRef]
Le Ribault, C.; Vinkovic, I.; Simoëns, S. Large eddy simulation of droplet dispersion and deposition over street canyons. Phys. Fluids 2024, 36, 113313. [Google Scholar] [CrossRef]
Zodo, G.; Konka, H.; Stevanovic, S.; Schluter, J.J.S. Simulation of the transition of respiratory droplets to aerosol states: Implications for pathogen spread. Phys. Fluids 2025, 37, 015188. [Google Scholar] [CrossRef]
Goto, H.; Shiraishi, Y.; Okada, S. Performance Evaluation of GPT-4o and o1-Preview Using the Certification Examination for the Japanese ‘Operations Chief of Radiography With X-rays’. Cureus 2024, 16, e74262. [Google Scholar] [CrossRef]
Hu, H.; Shang, Y.; Xu, G.; He, C.; Zhang, Q. Can GPT-O1 Kill All Bugs? An Evaluation of GPT-Family LLMs on QuixBugs. arXiv 2024, arXiv:2409.10033. [Google Scholar] [CrossRef]
Steiner, A.; Peeters, R.; Bizer, C. Fine-tuning Large Language Models for Entity Matching. arXiv 2024, arXiv:2409.08185. [Google Scholar] [CrossRef]
Dodge, J.; Ilharco, G.; Schwartz, R.; Farhadi, A.; Hajishirzi, H.; Smith, N. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv 2020, arXiv:2002.06305. [Google Scholar] [CrossRef]

Figure 1. Major components impacting inference and their corresponding parameters that may be recorded while executing a code-generating task.

Figure 2. The wind speed field over Crete, as predicted by WRF for 00:00 GMT on 27 October 2019, is illustrated using both colors (indicating magnitude) and arrows that represent wind direction and strength. The shaft of the arrow points in the direction of the wind, while the barbs denote magnitude (short for 5 knots and long for 10 knots) and are positioned on the side of the shaft from which the wind originates knots.

Figure 3. The InI index for the three frameworks assessed for the data cleaning and statistical computation task.

Figure 4. Temperature predictions from all three models and comparisons with ground truth and LSTM-H predictions. LSTM_H is the model developed by the authors, LSTM_GPT is the GPT-generated model, LSTM_OAI1 is the OAI1-generated model, and LSTM_OAI3 is the OAI3-generated model. The (top) graph represents the entire testing data set. The two bottom graphs concentrate on specific time intervals, specifically between 100 and 200 ten-minute intervals (bottom left) and 4100 and 4200 ten-minute intervals (bottom right). This is implemented to better visualize the difference between the various predictions.

Figure 5. Relative humidity predictions of all three models and comparisons with ground truth and LSTM-H predictions. LSTM_H is the model developed by the authors, LSTM_GPT is the GPT-generated model, LSTM_OAI1 is the OAI1-generated model, and LSTM_OAI3 is the OAI3-generated model. The (top) graph is for the whole testing data set. The two bottom graphs focus on specific time intervals, namely between 100 and 200 10 min intervals (bottom left) and 4100 and 4200 10 min intervals (bottom right). This is implemented to better visualize the difference between the different predictions.

Figure 6. Wind speed predictions of all three models and comparisons with ground truth and LSTM-H predictions. LSTM_H is the model developed by the authors, LSTM_GPT is the GPT-generated model, LSTM_OAI1 is the OAI1-generated model, and LSTM_OAI3 is the OAI3-generated model. The (top) graph is for the whole testing data set. The two bottom graphs focus on specific time intervals, namely between 100 and 200 10 min intervals (bottom left) and 4100 and 4200 10 min intervals (bottom right). This is implemented in order to better visualize the difference between the different predictions.

Figure 7. The InI index for the three frameworks assessed for the LSTM model generation task.

Table 1. LSTM-H model optimal settings for forecasting meteorological variables. (Tr-Te) denotes Training–Testing data split.

Units	Activation	Optimizer	Batch Size	Epochs	Data Scaling	Data Split (Tr-Te)	Time Steps
10	ReLU	Adam	16	10	MinMax Scaler	90-10	3

Table 2. Metrics for X and Y columns for the cleaned data set as calculated by the authors.

	Mean	Median	Standard Deviation
X	509.059	520.5	282.678
Y	484.526	473.5	286.875

Table 3. Important parameters collected during the execution of the procedure for generating code for data cleaning and statistical computation using GPT, OAI1, and OAI3.

	Attempts Until Correct Answer	Total Queries	ARTpQ (s)
GPT	1	1	7
OAI1	1	1	91
OAI3	1	1	7

Table 4. Efficiency, Consistency, Accuracy, and InIs for GPT, OAI1, and OAI3 performance, as determined through INFINITE methodology for data cleaning and statistical computation task.

	$E_{SBR}$	$E_{ART}$	E	C	A	InI
GPT	1	1	1	1	1	1
OAI1	1	0	0.5	1	1	0.83
OAI3	1	1	1	1	1	1

Table 5. Error metrics for temperature, relative humidity, and wind speed as calculated by the OAI1-generated LSTM model.

	Temperature	Relative Humidity	Wind Speed
MSE	0.1214	1.1871	0.4642
MAE	0.2838	0.7619	0.5100
MB	−0.2321	−0.2048	−0.1410
MAPE (%)	1.3938	1.0432	9.5998 $\times 10^{8}$
MFE (%)	1.4161	1.0417	2.24889 $\times 10^{5}$
MFB (%)	−1.1641	−0.2316	−14.1026
R²	0.9675	0.9861	0.8457
Pearson r	0.9910	0.9949	0.9235

Table 6. Error metrics for temperature, relative humidity, and wind speed as calculated by the OAI3-generated LSTM model.

	Temperature	Relative Humidity	Wind Speed
MSE	0.0787	1.1518	0.6924
MAE	0.2120	0.7428	0.6711
MB	−0.1433	−0.1422	0.2954
MAPE (%)	1.0412	1.0238	2.0214 $\times 10^{4}$
MFE (%)	1.0582	1.0271	34.9757
MFB (%)	−0.7288	−0.1346	29.5465
R²	0.9789	0.9895	0.7698
Pearson r	0.9923	0.9951	0.9230

Table 7. LSTM-H model error metrics for temperature, relative humidity, and wind speed.

	Temperature	Relative Humidity	Wind Speed
MSE	0.0994	1.4854	0.4546
MAE	0.2366	0.8755	0.5088
MB	0.1476	−0.4317	0.0890
MAPE (%)	1.1569	1.1833	36.4229
MFE (%)	1.1562	1.1147	19.3069
MFB (%)	0.7214	−0.5496	3.3800
R²	0.9733	0.9864	0.8488
Pearson r	0.9901	0.9940	0.9227

Table 8. MAPE between predictions and ground truth for the LSTM-H, GPT, OAI1, and OAI3 models concerning temperature, relative humidity, and wind speed.

	Temperature	Relative Humidity	Wind Speed
LSTM-H	1.1569	1.1833	36.4229
GPT	0.9026	1.0079	33.3757
OAI1	1.3938	1.0432	36.0228
OAI3	1.0412	1.0238	37.4445

Table 9. R² score comparisons between predictions and ground truth for the LSTM-H, GPT, OAI1, and OAI3 models concerning temperature, relative humidity, and wind speed.

	Temperature	Relative Humidity	Wind Speed
LSTM-H	0.9733	0.9864	0.8488
GPT	0.9827	0.9892	0.85041
OAI1	0.9675	0.9861	0.8457
OAI3	0.9789	0.9895	0.7698

Table 10. Important parameters collected during the execution of the procedure for generating code for an LSTM model using GPT, OAI1, and OAI3.

	Attempts Until Correct Answer	Total Queries	ARTpQ (s)	Average R²
GPT	2	2	28.53	0.94
OAI1	4	4	129.25	0.93
OAI3	2	2	25.50	0.90

Table 11. Efficiency, Consistency, Accuracy, and InI indices for GPT, OAI1, and OAI3 performance, as determined through the INFINITE methodology.

	$E_{SBR}$	$E_{ART}$	E	C	A	InI
GPT	1	0.5	0.75	0.59	0.88	0.76
OAI1	1	0	0.50	0.42	0.87	0.62
OAI3	1	0.5	0.75	0.59	0.86	0.75

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Christakis, N.; Drikakis, D. Evaluating Large Language Models in Code Generation: INFINITE Methodology for Defining the Inference Index. Appl. Sci. 2025, 15, 3784. https://doi.org/10.3390/app15073784

AMA Style

Christakis N, Drikakis D. Evaluating Large Language Models in Code Generation: INFINITE Methodology for Defining the Inference Index. Applied Sciences. 2025; 15(7):3784. https://doi.org/10.3390/app15073784

Chicago/Turabian Style

Christakis, Nicholas, and Dimitris Drikakis. 2025. "Evaluating Large Language Models in Code Generation: INFINITE Methodology for Defining the Inference Index" Applied Sciences 15, no. 7: 3784. https://doi.org/10.3390/app15073784

APA Style

Christakis, N., & Drikakis, D. (2025). Evaluating Large Language Models in Code Generation: INFINITE Methodology for Defining the Inference Index. Applied Sciences, 15(7), 3784. https://doi.org/10.3390/app15073784

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluating Large Language Models in Code Generation: INFINITE Methodology for Defining the Inference Index

Abstract

1. Introduction

2. Inference and the INFINITE Methodology

3. LSTM Model and Forecasting

4. Description of the Data Set

5. Simple Task: Data Cleaning and Statistical Computation

5.1. GPT

5.2. OAI1

5.3. OAI3

5.4. InI Calculation

6. Complex Task: Generating Code for an LSTM Model

6.1. GPT

6.2. OAI1

6.3. OAI3

6.4. LSTM-H

6.5. InI Calculation

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI