Evaluating Large Language Models in Code Generation: INFINITE Methodology for Defining the Inference Index
Abstract
:1. Introduction
2. Inference and the INFINITE Methodology
- First, we identify the key parameters involved in submitting a query, allowing us to measure Efficiency, Consistency, and Accuracy.
- As a second step, we develop indices ranging from 0 to 1 which define the performance of each individual factor separately; 0 represents the poorest performance while 1 indicates excellent performance.
- Finally, InI is calculated by weighting the average of individual indices.
3. LSTM Model and Forecasting
4. Description of the Data Set
5. Simple Task: Data Cleaning and Statistical Computation
“Write a code in Python, which reads a CSV file, cleans missing values, then computes for each column mean, median, standard deviation, outputs in a new CSV file the cleaned data set, and outputs in a text file the calculated values of mean, median, and standard deviation for each column”. |
5.1. GPT
5.2. OAI1
5.3. OAI3
5.4. InI Calculation
6. Complex Task: Generating Code for an LSTM Model
“I have uploaded a file with a continuous time series from a weather station. Retain the first 90\% of the data for training and validation and the last 10\% for testing, in order to train an LSTM model that predicts the variables temp, hum and wind-vel. Write the code in Python that executes this task, including the appropriate error metrics, such as mean squared error, mean absolute error, mean relative error, mean fractional error, mean fractional bias, mean bias, R2 score and Pearson correlation coefficient. Include a graph that compares the last 10\% of the data during the testing phase. The LSTM has one layer with 10 units, a batch size 16, the activation function is ReLU, and the optimiser is Adam with a learning rate of 0.001, running for 10 epochs. It also uses a timestep of 3. If possible, please plot the results of the comparisons”. |
6.1. GPT
“Please do as before, but take as input to the model all given features from the CSV file”. |
6.2. OAI1
“Please do as before, but take as input to the model all given features from the CSV file”. |
6.3. OAI3
“Please do as before, but take as input to the model all given features from the CSV file”. |
6.4. LSTM-H
6.5. InI Calculation
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
LLMs | Large Language Models |
AI | Artificial Intelligence |
NLP | Natural Language Processing |
LSTM | Long Short-Term Memory |
RNNs | Recurrent Neural Networks |
INFINITE | Inference Index In Testing Model Effectiveness |
GPT | ChatGPT-4o |
OAI1 | OpenAI-o1 pro |
OAI3 | OpenAI-o3 mini-high |
MSE | Mean Squared Error |
MAE | Mean Absolute Error |
MB | Mean Bias |
MAPE | Mean Average Percentage Error |
MFE | Mean Fractional Error |
MFB | Mean Fractional Bias |
References
- OpenAI. Optimizing Language Models for Dialogue. 2022. Available online: https://openai.com/blog/chatgpt (accessed on 15 January 2025).
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Volume 30. [Google Scholar] [CrossRef]
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar] [CrossRef]
- Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Q.; Chen, G.; et al. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; Balcan, M.F., Weinberger, K.Q., Eds.; PMLR: New York, NY, USA; Volume 48, pp. 173–182. [Google Scholar]
- Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Minneapolis, MI, USA, 2019; Volume 1(Long and Short Papers), pp. 4171–4186. [Google Scholar] [CrossRef]
- Drikakis, D.; Kokkinakis, I.W.; Fung, D.; Spottswood, S.M. Self-supervised transformers for turbulent flow time series. Phys. Fluids 2024, 36, 065113. [Google Scholar] [CrossRef]
- Drikakis, D.; Kokkinakis, I.W.; Fung, D.; Spottswood, S.M. Generalizability of transformer-based deep learning for multidimensional turbulent flow data. Phys. Fluids 2024, 36, 026102. [Google Scholar] [CrossRef]
- Huang, D.; Bu, Q.; Zhang, J.; Xie, X.; Chen, J.; Cui, H. Bias assessment and mitigation in llm-based code generation. arXiv 2023, arXiv:2309.14345. [Google Scholar] [CrossRef]
- Yu, Y.; Zhuang, Y.; Zhang, J.; Meng, Y.; Ratner, A.J.; Krishna, R.; Shen, J.; Zhang, C. Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Volume 36, pp. 55734–55784. [Google Scholar]
- Dai, S.; Xu, C.; Xu, S.; Pang, L.; Dong, Z.; Xu, J. Bias and Unfairness in Information Retrieval Systems: New Challenges in the LLM Era. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24), Barcelona, Spain, 25–29 August 2024; pp. 6437–6447. [Google Scholar] [CrossRef]
- Jin, M.; Wang, S.; Ma, L.; Chu, Z.; Zhang, J.Y.; Shi, X.; Chen, P.Y.; Liang, Y.; Li, Y.F.; Pan, S.; et al. Time-llm: Time series forecasting by reprogramming large language models. arXiv 2023, arXiv:2310.01728. [Google Scholar] [CrossRef]
- Yu, X.; Chen, Z.; Ling, Y.; Dong, S.; Liu, Z.; Lu, Y. Temporal Data Meets LLM–Explainable Financial Time Series Forecasting. arXiv 2023, arXiv:2306.11025. [Google Scholar] [CrossRef]
- Zhang, B.; Liu, Z.; Cherry, C.; Firat, O. When scaling meets llm finetuning: The effect of data, model and finetuning method. arXiv 2024, arXiv:2402.17193. [Google Scholar] [CrossRef]
- Ajwani, R.; Javaji, S.R.; Rudzicz, F.; Zhu, Z. LLM-Generated Black-box Explanations Can Be Adversarially Helpful. arXiv 2024, arXiv:2405.06800. [Google Scholar] [CrossRef]
- Nam, D.; Macvean, A.; Hellendoorn, V.; Vasilescu, B.; Myers, B. Using an llm to help with code understanding. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal, 14–20 April 2024; pp. 1–13. [Google Scholar] [CrossRef]
- Liu, F.; Liu, Y.; Shi, L.; Huang, H.; Wang, R.; Yang, Z.; Zhang, L.; Li, Z.; Ma, Y. Exploring and evaluating hallucinations in llm-powered code generation. arXiv 2024, arXiv:2404.00971. [Google Scholar] [CrossRef]
- Tian, H.; Lu, W.; Li, T.O.; Tang, X.; Cheung, S.C.; Klein, J.; Bissyandé, T.F. Is ChatGPT the ultimate programming assistant—How far is it? arXiv 2023, arXiv:2304.11938. [Google Scholar] [CrossRef]
- Liu, Y.; Le-Cong, T.; Widyasari, R.; Tantithamthavorn, C.; Li, L.; Le, X.B.D.; Lo, D. Refining chatgpt-generated code: Characterizing and mitigating code quality issues. ACM Trans. Softw. Eng. Methodol. 2024, 33, 1–26. [Google Scholar] [CrossRef]
- Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates Inc.: New York, NY, USA, 2022; Volume 35, pp. 22199–22213. [Google Scholar]
- Zhang, Z.; Zhang, A.; Li, M.; Smola, A. Automatic chain of thought prompting in large language models. arXiv 2022, arXiv:2210.03493. [Google Scholar] [CrossRef]
- Jin, M.; Yu, Q.; Shu, D.; Zhao, H.; Hua, W.; Meng, Y.; Zhang, Y.; Du, M. The impact of reasoning step length on large language models. arXiv 2024, arXiv:2401.04925. [Google Scholar] [CrossRef]
- Wu, C.J.; Brooks, D.; Chen, K.; Chen, D.; Choudhury, S.; Dukhan, M.; Hazelwood, K.; Isaac, E.; Jia, Y.; Jia, B.; et al. Machine Learning at Facebook: Understanding Inference at the Edge. In Proceedings of the 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), Washington, DC, USA, 16–20 February 2019; pp. 331–344. [Google Scholar] [CrossRef]
- Lu, Y.; Chowdhery, A.; Kandula, S.; Chaudhuri, S. Accelerating Machine Learning Inference with Probabilistic Predicates. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD ’18), Houston, TX, USA, 10–15 June 2018; pp. 1493–1508. [Google Scholar] [CrossRef]
- Creswell, A.; Shanahan, M.; Higgins, I. Selection-inference: Exploiting large language models for interpretable logical reasoning. arXiv 2022, arXiv:2205.09712. [Google Scholar] [CrossRef]
- Sheng, Y.; Zheng, L.; Yuan, B.; Li, Z.; Ryabinin, M.; Chen, B.; Liang, P.; Re, C.; Stoica, I.; Zhang, C. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J., Eds.; PMLR: New York, NY, USA, 2023; Volume 202, pp. 31094–31116. [Google Scholar]
- Chitty-Venkata, K.T.; Raskar, S.; Kale, B.; Ferdaus, F.; Tanikanti, A.; Raffenetti, K.; Taylor, V.; Emani, M.; Vishwanath, V. LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators. In Proceedings of the IEEE Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis SC24-W, Atlanta, GA, USA, 17–22 November 2024; pp. 1362–1379. [Google Scholar] [CrossRef]
- Lukasik, M.; Narasimhan, H.; Menon, A.K.; Yu, F.; Kumar, S. Metric-aware LLM inference. arXiv 2024, arXiv:2403.04182. [Google Scholar] [CrossRef]
- He, Y.; Xu, M.; Wu, J.; Zheng, W.; Ye, K.; Xu, C. UELLM: A Unified and Efficient Approach for LLM Inference Serving. arXiv 2024, arXiv:2409.14961. [Google Scholar] [CrossRef]
- Marsili, M.; Roudi, Y. Quantifying relevance in learning and inference. Phys. Rep. 2022, 963, 1–43. [Google Scholar] [CrossRef]
- Psaros, A.F.; Meng, X.; Zou, Z.; Guo, L.; Karniadakis, G.E. Uncertainty quantification in scientific machine learning: Methods, metrics, and comparisons. J. Comput. Phys. 2023, 477, 111902. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Caruccio, L.; Cirillo, S.; Polese, G.; Solimando, G.; Sundaramurthy, S.; Tortora, G. Claude 2.0 large language model: Tackling a real-world classification problem with a new iterative prompt engineering approach. Intell. Syst. Appl. 2024, 21, 200336. [Google Scholar] [CrossRef]
- Islam, R.; Ahmed, I. Gemini-the most powerful LLM: Myth or Truth. In Proceedings of the 2024 5th Information Communication Technologies Conference (ICTC), Nanjing, China, 10–12 May 2024; pp. 303–308. [Google Scholar] [CrossRef]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
- Liu, X.; Xu, P.; Wu, J.; Yuan, J.; Yang, Y.; Zhou, Y.; Liu, F.; Guan, T.; Wang, H.; Yu, T.; et al. Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey. arXiv 2024, arXiv:2403.09606. [Google Scholar] [CrossRef]
- Pearl, J. Graphs, causality, and structural equation models. Sociol. Methods Res. 1998, 27, 226–284. [Google Scholar] [CrossRef]
- Pearl, J. Graphical models for probabilistic and causal reasoning. In Quantified Representation of Uncertainty and Imprecision; Smets, P., Ed.; Springer: Dordrecht, The Netherlands, 1998; pp. 367–389. [Google Scholar] [CrossRef]
- Zeng, J.; Wang, R. A survey of causal inference frameworks. arXiv 2022, arXiv:2209.00869. [Google Scholar] [CrossRef]
- Ji, Z.; Ma, P.; Li, Z.; Wang, S. Benchmarking and explaining large language model-based code generation: A causality-centric approach. arXiv 2023, arXiv:2310.06680. [Google Scholar] [CrossRef]
- Vu, T.; Kang, H.; Yoo, C.D. Scnet: Training inference sample consistency for instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 2701–2709. [Google Scholar] [CrossRef]
- Mitchell, E.; Noh, J.J.; Li, S.; Armstrong, W.S.; Agarwal, A.; Liu, P.; Finn, C.; Manning, C.D. Enhancing self-consistency and performance of pre-trained language models through natural language inference. arXiv 2022, arXiv:2211.11875. [Google Scholar] [CrossRef]
- Valsamara, I.; Papaioannidis, C.; Pitas, I. Efficient Data Utilization in Deep Neural Networks for Inference Reliability. In Proceedings of the 2024 IEEE International Conference on Image Processing Challenges and Workshops (ICIPCW), Abu Dhabi, United Arab Emirates, 27–30 October 2024; pp. 4142–4147. [Google Scholar] [CrossRef]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
- Ren, S.; Guo, D.; Lu, S.; Zhou, L.; Liu, S.; Tang, D.; Sundaresan, N.; Zhou, M.; Blanco, A.; Ma, S. Codebleu: A method for automatic evaluation of code synthesis. arXiv 2020, arXiv:2009.10297. [Google Scholar] [CrossRef]
- Beyer, T.; Schuchardt, J.; Schwinn, L.; Günnemann, S. Fast Proxies for LLM Robustness Evaluation. arXiv 2025, arXiv:2502.10487. [Google Scholar] [CrossRef]
- Kang, K.; Setlur, A.; Ghosh, D.; Steinhardt, J.; Tomlin, C.; Levine, S.; Kumar, A. What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? arXiv 2024, arXiv:2411.07681. [Google Scholar] [CrossRef]
- Singh, C.; Inala, J.P.; Galley, M.; Caruana, R.; Gao, J. Rethinking interpretability in the era of large language models. arXiv 2024, arXiv:2402.01761. [Google Scholar] [CrossRef]
- Lin, L.; Wang, L.; Guo, J.; Wong, K.F. Investigating bias in llm-based bias detection: Disparities between llms and human perception. arXiv 2024, arXiv:2403.14896. [Google Scholar] [CrossRef]
- Bae, H.; Deeb, A.; Fleury, A.; Zhu, K. Complexitynet: Increasing llm inference efficiency by learning task complexity. arXiv 2023, arXiv:2312.11511. [Google Scholar] [CrossRef]
- Huang, Y.; Wan, L.J.; Ye, H.; Jha, M.; Wang, J.; Li, Y.; Zhang, X.; Chen, D. Invited: New Solutions on LLM Acceleration, Optimization, and Application. In Proceedings of the 61st ACM/IEEE Design Automation Conference (DAC ’24), San Francisco, CA, USA, 23–27 June 2024. [Google Scholar] [CrossRef]
- Li, C.; Tian, Y.; Zerong, Z.; Song, Y.; Xia, F. Challenging Large Language Models with New Tasks: A Study on their Adaptability and Robustness. In Proceedings of the Findings of the Association for Computational Linguistics (ACL 2024), Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Curran Associates Inc.: New York, NY, USA, 2024; pp. 8140–8162. [Google Scholar] [CrossRef]
- Stojkovic, J.; Choukse, E.; Zhang, C.; Goiri, I.; Torrellas, J. Towards greener llms: Bringing energy-efficiency to the forefront of llm inference. arXiv 2024, arXiv:2403.20306. [Google Scholar] [CrossRef]
- Mouselinos, S.; Malinowski, M.; Michalewski, H. A simple, yet effective approach to finding biases in code generation. arXiv 2022, arXiv:2211.00609. [Google Scholar] [CrossRef]
- Yadav, H.; Thakkar, A. NOA-LSTM: An efficient LSTM cell architecture for time series forecasting. Expert Syst. Appl. 2024, 238, 122333. [Google Scholar] [CrossRef]
- Gers, F.A.; Schmidhuber, J.; Cummins, F. Learning to forget: Continual prediction with LSTM. Neural Comput. 2000, 12, 2451–2471. [Google Scholar] [CrossRef]
- Chen, C.; Zhang, Q.; Kashani, M.H.; Jun, C.; Bateni, S.M.; Band, S.S.; Dash, S.S.; Chau, K.W. Forecast of rainfall distribution based on fixed sliding window long short-term memory. Eng. Appl. Comput. Fluid Mech. 2022, 16, 248–261. [Google Scholar] [CrossRef]
- Lane, D.M. Online Statistics Education: A Multimedia Course of Study; Project Leader: David M. Lane, Rice University. Available online: http://onlinestatbook.com/ (accessed on 1 February 2025).
- Boylan, J.W.; Russell, A.G. PM and light extinction model performance metrics, goals, and criteria for three-dimensional air quality models. Atmos. Environ. 2006, 40, 4946–4959. [Google Scholar] [CrossRef]
- Sedgewick, P. Pearson’s Correlation Coefficient. BMJ 2012, 345, e4483. [Google Scholar] [CrossRef]
- Earth Networks. Weather Data, Forecasting, and Environmental Monitoring Solutions. Available online: https://www.earthnetworks.com (accessed on 17 January 2025).
- Christakis, N.; Katsaounis, T.; Kossioris, G.; Plexousakis, M. On the Performance of the WRF Numerical Model over Complex Terrain on a High Performance Computing Cluster. In Proceedings of the 2014 IEEE International Conference on High Performance Computing and Communications, 2014 IEEE 6th International Symposium on Cyberspace Safety and Security, 2014 IEEE 11th International Conference on Embedded Software and System (HPCC, CSS, ICESS), Paris, France, 20–22 August 2014; pp. 298–303. [Google Scholar] [CrossRef]
- van Wagner, C.E. Development and Structure of the Canadian Forest Fire Weather Index System; Forestry Technical Report 35; Canadian Forestry Service: Ottawa, ON, Canada, 1987. [Google Scholar]
- Wotton, B.M. Interpreting and using outputs from the Canadian Forest Fire Danger Rating System in research applications. Environ. Ecol. Stat. 2009, 16, 107–131. [Google Scholar] [CrossRef]
- Le Ribault, C.; Vinkovic, I.; Simoëns, S. Large eddy simulation of droplet dispersion and deposition over street canyons. Phys. Fluids 2024, 36, 113313. [Google Scholar] [CrossRef]
- Zodo, G.; Konka, H.; Stevanovic, S.; Schluter, J.J.S. Simulation of the transition of respiratory droplets to aerosol states: Implications for pathogen spread. Phys. Fluids 2025, 37, 015188. [Google Scholar] [CrossRef]
- Goto, H.; Shiraishi, Y.; Okada, S. Performance Evaluation of GPT-4o and o1-Preview Using the Certification Examination for the Japanese ‘Operations Chief of Radiography With X-rays’. Cureus 2024, 16, e74262. [Google Scholar] [CrossRef]
- Hu, H.; Shang, Y.; Xu, G.; He, C.; Zhang, Q. Can GPT-O1 Kill All Bugs? An Evaluation of GPT-Family LLMs on QuixBugs. arXiv 2024, arXiv:2409.10033. [Google Scholar] [CrossRef]
- Steiner, A.; Peeters, R.; Bizer, C. Fine-tuning Large Language Models for Entity Matching. arXiv 2024, arXiv:2409.08185. [Google Scholar] [CrossRef]
- Dodge, J.; Ilharco, G.; Schwartz, R.; Farhadi, A.; Hajishirzi, H.; Smith, N. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv 2020, arXiv:2002.06305. [Google Scholar] [CrossRef]
Units | Activation | Optimizer | Batch Size | Epochs | Data Scaling | Data Split (Tr-Te) | Time Steps |
---|---|---|---|---|---|---|---|
10 | ReLU | Adam | 16 | 10 | MinMax Scaler | 90-10 | 3 |
Mean | Median | Standard Deviation | |
---|---|---|---|
X | 509.059 | 520.5 | 282.678 |
Y | 484.526 | 473.5 | 286.875 |
Attempts Until Correct Answer | Total Queries | SB Responses | ARTpQ (s) | Average MAPE (%) | |
---|---|---|---|---|---|
GPT | 1 | 1 | 0 | 7 | 0 |
OAI1 | 1 | 1 | 0 | 91 | 0 |
OAI3 | 1 | 1 | 0 | 7 | 0 |
E | C | A | InI | |||
---|---|---|---|---|---|---|
GPT | 1 | 1 | 1 | 1 | 1 | 1 |
OAI1 | 1 | 0 | 0.5 | 1 | 1 | 0.83 |
OAI3 | 1 | 1 | 1 | 1 | 1 | 1 |
Temperature | Relative Humidity | Wind Speed | |
---|---|---|---|
MSE | 0.1214 | 1.1871 | 0.4642 |
MAE | 0.2838 | 0.7619 | 0.5100 |
MB | −0.2321 | −0.2048 | −0.1410 |
MAPE (%) | 1.3938 | 1.0432 | 9.5998 |
MFE (%) | 1.4161 | 1.0417 | 2.24889 |
MFB (%) | −1.1641 | −0.2316 | −14.1026 |
R2 | 0.9675 | 0.9861 | 0.8457 |
Pearson r | 0.9910 | 0.9949 | 0.9235 |
Temperature | Relative Humidity | Wind Speed | |
---|---|---|---|
MSE | 0.0787 | 1.1518 | 0.6924 |
MAE | 0.2120 | 0.7428 | 0.6711 |
MB | −0.1433 | −0.1422 | 0.2954 |
MAPE (%) | 1.0412 | 1.0238 | 2.0214 |
MFE (%) | 1.0582 | 1.0271 | 34.9757 |
MFB (%) | −0.7288 | −0.1346 | 29.5465 |
R2 | 0.9789 | 0.9895 | 0.7698 |
Pearson r | 0.9923 | 0.9951 | 0.9230 |
Temperature | Relative Humidity | Wind Speed | |
---|---|---|---|
MSE | 0.0994 | 1.4854 | 0.4546 |
MAE | 0.2366 | 0.8755 | 0.5088 |
MB | 0.1476 | −0.4317 | 0.0890 |
MAPE (%) | 1.1569 | 1.1833 | 36.4229 |
MFE (%) | 1.1562 | 1.1147 | 19.3069 |
MFB (%) | 0.7214 | −0.5496 | 3.3800 |
R2 | 0.9733 | 0.9864 | 0.8488 |
Pearson r | 0.9901 | 0.9940 | 0.9227 |
Temperature | Relative Humidity | Wind Speed | |
---|---|---|---|
LSTM-H | 1.1569 | 1.1833 | 36.4229 |
GPT | 0.9026 | 1.0079 | 33.3757 |
OAI1 | 1.3938 | 1.0432 | 36.0228 |
OAI3 | 1.0412 | 1.0238 | 37.4445 |
Temperature | Relative Humidity | Wind Speed | |
---|---|---|---|
LSTM-H | 0.9733 | 0.9864 | 0.8488 |
GPT | 0.9827 | 0.9892 | 0.85041 |
OAI1 | 0.9675 | 0.9861 | 0.8457 |
OAI3 | 0.9789 | 0.9895 | 0.7698 |
Attempts Until Correct Answer | Total Queries | SB Responses | ARTpQ (s) | Average R2 | |
---|---|---|---|---|---|
GPT | 2 | 2 | 0 | 28.53 | 0.94 |
OAI1 | 4 | 4 | 0 | 129.25 | 0.93 |
OAI3 | 2 | 2 | 0 | 25.50 | 0.90 |
E | C | A | InI | |||
---|---|---|---|---|---|---|
GPT | 1 | 0.5 | 0.75 | 0.59 | 0.88 | 0.76 |
OAI1 | 1 | 0 | 0.50 | 0.42 | 0.87 | 0.62 |
OAI3 | 1 | 0.5 | 0.75 | 0.59 | 0.86 | 0.75 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Christakis, N.; Drikakis, D. Evaluating Large Language Models in Code Generation: INFINITE Methodology for Defining the Inference Index. Appl. Sci. 2025, 15, 3784. https://doi.org/10.3390/app15073784
Christakis N, Drikakis D. Evaluating Large Language Models in Code Generation: INFINITE Methodology for Defining the Inference Index. Applied Sciences. 2025; 15(7):3784. https://doi.org/10.3390/app15073784
Chicago/Turabian StyleChristakis, Nicholas, and Dimitris Drikakis. 2025. "Evaluating Large Language Models in Code Generation: INFINITE Methodology for Defining the Inference Index" Applied Sciences 15, no. 7: 3784. https://doi.org/10.3390/app15073784
APA StyleChristakis, N., & Drikakis, D. (2025). Evaluating Large Language Models in Code Generation: INFINITE Methodology for Defining the Inference Index. Applied Sciences, 15(7), 3784. https://doi.org/10.3390/app15073784