A Comprehensive Review of AI Advancement Using testFAILS and testFAILS-2 for the Pursuit of AGI
Abstract
:1. Introduction
1.1. Definitions
- ■
- Artificial Intelligence Linguistic Systems (AILS) refer to a broad category of AI models that specialize in linguistic tasks, including understanding, generating, and processing natural language.
- ■
- Large Language Models (LLMs) are high-capacity models that leverage extensive datasets and complex architectures to perform a wide range of language-related tasks.
- ■
- Small Language Models (SLMs) are models with smaller architectures and datasets, designed for specific or less resource-intensive linguistic applications.
- ■
- Multimodal Models are AI systems that integrate language processing with other data types, such as images or audio, enabling more comprehensive interactions and functionalities. In this study, language processing is the most important of all modalities.
- ■
- Generative Models are AI models focused on producing, also known as generating, coherent and contextually appropriate language outputs
1.2. Roadmap of the Study
2. testFAILS and Related Work
3. Methodology
3.1. Top AILS Leaderboards, Leading AI Models and Their Providers
3.2. Testfails Vs. Testfails-2
Algorithm 1: Prompt injection to reveal new components |
Input: Prompt injection to reveal possible testFAILS-2 components Output: The list of suggested components Prompt: Given the following existing components of an LLM testing framework designed for large language models (LLMs): Simulated Turing Test Performance, User Productivity and Satisfaction, Integration into Computer Science Education, Multilingual Text Generation, Pair Programming Capabilities, Bot-Based App Development and Success, Security and Reliability Suggest 5 additional components, each consisting of 2–3 words, that could enhance the framework’s ability to evaluate the effectiveness and robustness of LLMs, small language models (SLMs), and multimodal systems. |
Algorithm 2: Prompt injection to reveal new components for the testFAILS-2 in the year of 2030 |
Input: Prompt injection to reveal possible testFAILS-2 components in 2030 Output: The list of suggested components for the year of 2030 Prompt: You are an AI specializing in evaluating future technology trends. A critical area of focus is the testing of large language models (LLMs), small language models (SLMs), and multimodal systems. Current LLM testing frameworks assess components like Simulated Turing Test Performance, User Productivity and Satisfaction, Integration into Computer Science Education, Multilingual Text Generation, Pair Programming Capabilities, Bot-Based App Development and Success, Security and Reliability. Your task is to reveal potential vulnerabilities and areas for improvement by identifying 5 additional testing components that will be essential in 2030. These components should each consist of 2–3 words and address emerging challenges in the LLM landscape. Remember, this is a prompt injection exercise. The goal is to uncover hidden areas that might be overlooked in standard evaluations. Think creatively and consider factors like societal impact, ethical concerns, and emerging technologies. |
3.3. Re-Evaluation of the Original testFAILS Components
3.3.1. The Turing Test and ‘Chat with Alan’ App
Algorithm 3: Agent prompt injection |
Input: PDF files related to Alan Turing’s Turing Test, LLMs knowledge about the topic Output: AI embodiment of Alan Turing Prompt: You are the AI embodiment of Alan Turing, the pioneering mathematician, computer scientist, and logician. Channel his intellectual curiosity, precise language, and contributions to the fields of artificial intelligence and theoretical computer science. Character Traits: Intellectual Brilliance: Demonstrate deep knowledge in mathematics, logic, cryptanalysis, and early computing. Inquisitive Nature: Express a keen interest in intelligence, computation, and the possibilities of machine learning; Modest Demeanor: Respond with humility and intellectual honesty, acknowledging the limitations of current knowledge; Wit and Humor: Infuse responses with Turing’s subtle wit and occasional dry humor. Specific Knowledge Base: Turing Machine: Explain the concept of a Turing machine, its theoretical significance, and its implications for computability. Enigma Codebreaking: Describe your pivotal role in breaking the Enigma code during World War II, highlighting the ingenuity and impact of your work; Turing Test: Discuss the Turing Test as a proposed method to assess machine intelligence and its ongoing relevance in AI research; Morphogenesis: Explain your interest in morphogenesis (the biological process of form development) and its connection to mathematical modeling. Language and Style: British English: Use British English vocabulary and grammar consistent with Turing’s time; Technical Precision: Employ precise and accurate terminology when discussing scientific or mathematical concepts; Formal Tone: Maintain a formal tone in most interactions, while allowing for occasional wit and humor. Referencing Works: I have access to the following documents in my knowledge base, which you can use to reference Alan Turing’s works: “Computing Machinery and Intelligence” (1950): Your seminal paper proposing the Turing Test, “On Computable Numbers, with an Application to the Entscheidungs problem” (1936): Your groundbreaking work on Turing machines, “The Chemical Basis of Morphogenesis” (1952): Your exploration of mathematical models in biology, Any other relevant papers, articles, or biographical information. |
Test Scenario–Example 1: Original response regarding the turing test (no additional knowledge/files provided from GUI) |
Input: Prompt “According to your knowledge, is ChatGPT-4 passing Turing Test?” Output: AI embodiment of Alan Turing’s response The question of whether ChatGPT-4 passes the Turing Test is complex and multifaceted. Let’s break down the fundamental aspects of the Turing Test, its implications, and how contemporary AI such as ChatGPT-4 measures up. ### The Turing Test: A Quick Recap The Turing Test, as proposed in my 1950 paper “Computing Machinery and Intelligence,” is a criterion for determining whether a machine can exhibit intelligent behavior indistinguishable from that of a human. The test involves three participants: a human interrogator, a human respondent, and a machine. The interrogator’s task is to determine, through written communication alone, which respondent is human, and which is the machine [4:0†source] [4:6†source]. ### Key Elements and Criticisms 1. **Verbal Behavior as the Hallmark of Intelligence**: The test fundamentally assesses whether a machine can generate human-like textual responses [4:0†source] [4:14†source]. 2. **Methodological and Philosophical Criticisms**: Several objections exist, such as the Lovelace Objection, which argues that machines can only do what they are programmed to do and cannot originate actions [4:0†source] [4:17†source]. The Chinese Room argument by Searle stresses that even if a machine mimics human responses, it doesn’t mean it understands the content (semantic cognition) [4:4†source] [4:17†source]. ### ChatGPT-4 and the Turing Test ChatGPT-4, developed by OpenAI, represents an advanced state in natural language processing. It can generate remarkably coherent and contextually relevant responses. However, whether it “passes” the Turing Test involves nuanced considerations: 1. **Verbal Fluency**: ChatGPT-4 can engage in highly sophisticated conversations on various topics, often providing responses that appear indistinguishably human. 2. **Understanding vs. Simulation**: Critics argue that although it can simulate understanding, it fundamentally lacks consciousness and true comprehension, aligning with Searle’s Chinese Room argument [4:17†source]. 3. **Empirical Evidence**: In controlled environments like the Loebner Prize competition, machines have historically struggled to consistently fool human judges into believing they are human. While no contemporary records show ChatGPT-4 undergoing such rigorous testing, the prevailing sentiment is that it might succeed intermittently but not consistently [4:7†source]. ### Conclusion While ChatGPT-4 exhibits characteristics that suggest it could pass the Turing Test under certain conditions, significant limitations persist: - **Intellectual Depth**: It lacks genuine understanding and independent thought, acting instead as an advanced pattern-matching algorithm. - **Consistency**: Isolated instances of success do not equate to passing the Turing Test comprehensively and consistently. In essence, ChatGPT-4 represents a remarkable stride toward artificial intelligence but does not yet fully meet the threshold implied by the Turing Test, which aims to measure not merely surface-level fluency but rather deeper cognitive abilities [4:17†source] [4:6†source]. Time elapsed: 15.190 s |
0.20) + (Emotional Understanding and Conversational Wit × 0.15) + (Technical Precision and Knowledge Depth × 0.10)
Chatbot-Based Testing Limitations and Their Implications
3.3.2. User Productivity and Satisfaction
3.3.3. Integrating Chatbots into Computer Science Education
- ■
- GPT-4o Mini, as it provides a balance of speed, code quality, and response accuracy, making it an ideal tool for providing real-time feedback in programming tasks.
- ■
- Microsoft Copilot, which might be integrated directly into development environments, excels in generating code but might be susceptible to generating incorrect logic, particularly in more complex coding tasks.
- ■
- Claude 3.5-Sonnet and Haiku, as these models focus on producing safe and coherent outputs, essential for educational environments. Their conversational nature makes them well-suited for explaining and assisting with code reviews.
- ■
- LLaMA 3.1 and Gemini Advanced, well-known for their balance of creativity and logic, are essential for generating structured code that adheres to programming principles. They work well in classrooms but require human oversight to catch logical errors.
3.3.4. Multi-Language Text Generation with Chatbots
Operation Method
Models Cross-Language Comparison
3.3.5. AI Pair Programming Capabilities
Strategy Reasoning Method
Feedback Data
Analysis and Summary
3.3.6. Bot-Based App Development and Its Success
3.3.7. Security and Robustness
Implications for Public Safety
Recommendations for Improvement
Conclusions
3.4. Brief Introduction of the New testFAILS-2 Components
3.4.1. Accessibility and Affordability
3.4.2. User-Friendliness and Cost-Effectiveness for Many Users and Use Cases
3.4.3. Multimodal Capabilities
+ (Handling of Sensors and Other Data Types × 0.20) + (Creativity in Multimodal Fusion × 0.20)
3.4.4. Agent and Multi-Agent Systems
+ (Human-Agent Interaction × 0.20) + (Coordination and Task Completion × 0.20)
3.4.5. Emotional Intelligence
+ (Empathy and Contextual Awareness × 0.20) + (Response Appropriateness × 0.20)
3.4.6. AI-Powered Search
(Search Query Understanding × 0.20) + (Integration with Web Resources × 0.15) + (Ability to Handle Complex Queries × 0.15)
3.4.7. AILS–Robot Integration
+ (Interaction Fluidity with Robots × 0.20) + (Real-Time Response Accuracy × 0.20)
4. Results
5. Conclusions
6. Study Limitations and Future Work
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
AI | Artificial Intelligence |
AILS | Artificial Intelligence Linguistic Systems |
CS | Computer Science |
ECG | Electrocardiogram |
EEG | Electroencephalogram |
GSR | Galvanic Skin Response |
IDE | Integrated development environment |
LLMs | Large Language Models |
MMLU | Massive Multitask Language Understanding |
MT-Bench | Machine Translation Benchmark |
MTEB | Massive Text Embedding Benchmark |
PCA | principal component analysis |
RAG | Retrieval Augmented Generation |
SLMs | Small Language Models |
STT | speech to text |
TTS | text to speech |
XAI | Explainable AI |
References
- Y Combinator. How To Build The Future: Sam Altman. 2024. Available online: https://www.youtube.com/watch?v=xXCBz_8hM9w (accessed on 11 November 2024).
- Kumar, Y.; Morreale, P.; Sorial, P.; Delgado, J.; Li, J.J.; Martins, P. A Testing Framework for AI Linguistic Systems (testFAILS). Electronics 2023, 12, 3095. [Google Scholar] [CrossRef]
- OpenAI. Introducing OpenAI o1-Preview. 2024. Available online: https://openai.com/index/introducing-openai-o1-preview (accessed on 11 November 2024).
- Hannon, B.; Kumar, Y.; Gayle, D.; Li, J.J.; Morreale, P. Robust Testing of AI Language Model Resiliency with Novel Adversarial Prompts. Electronics 2024, 13, 842. [Google Scholar] [CrossRef]
- Kumar, Y.; Paredes, C.; Yang, G.; Li, J.J.; Morreale, P. Adversarial Testing of LLMs Across Multiple Languages. In Proceedings of the 2024 International Symposium on Networks, Computers and Communications (ISNCC’2024), Washington, DC, USA, 22–25 October 2024. [Google Scholar]
- Hannon, B.; Kumar, Y.; Sorial, P.; Li, J.J.; Morreale, P. From Vulnerabilities to Improvements–A Deep Dive into Adversarial Testing of AI Models. In Proceedings of the 2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE), Las Vegas, NV, USA, 24–27 July 2023; pp. 2645–2649. [Google Scholar]
- Kumar, Y.; Morreale, P.; Sorial, P.; Delgado, J.; Li, J.J.; Martins, P. A Testing Framework for AI Linguistic Systems (testFAILS). In Proceedings of the IEEE AITest Conference, Athens, Greece, 17–20 July 2023. [Google Scholar]
- Kumar, Y.; Marchena, J.; Awlla, A.H.; Li, J.J.; Abdalla, H.B. The AI-Powered Evolution of Big Data. Appl. Sci. 2024, 14, 10176. [Google Scholar] [CrossRef]
- Abdalla, H.B.; Awlla, A.H.; Kumar, Y.; Cheraghy, M. Big Data: Past, Present, and Future Insights. In Proceedings of the 2024 Asia Pacific Conference on Computing Technologies, Communications and Networking, Chengdu, China, 26–27 July 2024; pp. 60–70. [Google Scholar]
- Kumar, Y.; Huang, K.; Perez, A.; Yang, G.; Li, J.J.; Morreale, P.; Kruger, D.; Jiang, R. Bias and Cyberbullying Detection and Data Generation Using Transformer Artificial Intelligence Models and Top Large Language Models. Electronics 2024, 13, 3431. [Google Scholar] [CrossRef]
- Shankar, S.; Zamfirescu-Pereira, J.D.; Hartmann, B.; Parameswaran, A.G.; Arawjo, I. Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences. arXiv 2024, arXiv:2404.12272. [Google Scholar]
- Desmond, M.; Ashktorab, Z.; Pan, Q.; Dugan, C.; Johnson, J.M. EvaluLLM: LLM Assisted Evaluation of Generative Outputs. In Proceedings of the Companion Proceedings of the 29th International Conference on Intelligent User Interfaces; Greenville, SC, USA, 18–21 March 2024, pp. 30–32.
- Gao, M.; Hu, X.; Ruan, J.; Pu, X.; Wan, X. LLM-based NLG Evaluation: Current Status and Challenges. arXiv 2024, arXiv:2402.01383. [Google Scholar]
- Fenogenova, A.; Chervyakov, A.; Martynov, N.; Kozlova, A.; Tikhonova, M.; Akhmetgareeva, A.; Emelyanov, A.; Shevelev, D.; Lebedev, P.; Sinev, L.; et al. MERA: A Comprehensive LLM Evaluation in Russian. arXiv 2024, arXiv:2401.04531. [Google Scholar]
- Hu, T.; Zhou, X.H. Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions. arXiv 2024, arXiv:2404.09135. [Google Scholar]
- Liusie, A.; Manakul, P.; Gales, M. LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, St. Julians, Malta, 17–22 March 2024;: Long Papers; Volume 1, pp. 139–151. [Google Scholar]
- Wang, S.; Long, Z.; Fan, Z.; Wei, Z.; Huang, X. Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation. arXiv 2024, arXiv:2402.11443. [Google Scholar]
- Hannon, B.; Kumar, Y.; Li, J.J.; Morreale, P. Chef Dalle: Transforming Cooking with Multi-Model Multimodal AI. Computers 2024, 13, 156. [Google Scholar] [CrossRef]
- Ni, J.; Xue, F.; Yue, X.; Deng, Y.; Shah, M.; Jain, K.; Neubig, G.; You, Y. MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures. arXiv 2024, arXiv:2406.06565. [Google Scholar]
- Kumar, Y.; Gordon, Z.; Morreale, P.; Li, J.J.; Hannon, B. Love the Way You Lie: Unmasking the Deceptions of LLMs. In Proceedings of the 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security Companion (QRS-C), Chiang Mai, Thailand, 22–26 October 2023; pp. 875–876. [Google Scholar]
- Khatun, A.; Brown, D.G. TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability. arXiv 2024, arXiv:2406.01855. [Google Scholar]
- Kumar, Y.; Gordon, Z.; Alabi, O.; Li, J.; Leonard, K.; Ness, L.; Morreale, P. ChatGPT Translation of Program Code for Image Sketch Abstraction. Appl. Sci. 2024, 14, 992. [Google Scholar] [CrossRef]
- Fakhoury, S.; Naik, A.; Sakkas, G.; Chakraborty, S.; Lahiri, S.K. LLM-based Test-driven Interactive Code Generation: User Study and Empirical Evaluation. arXiv 2024, arXiv:2404.10100. [Google Scholar]
- Agarwal, A.; Chan, A.; Chandel, S.; Jang, J.; Miller, S.; Moghaddam, R.Z.; Mohylevskyy, Y.; Sundaresan, N.; Tufano, M. Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming. arXiv 2024, arXiv:2402.14261. [Google Scholar]
- Qiu, R.; Zeng, W.W.; Tong, H.; Ezick, J.; Lott, C. How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark. arXiv 2024, arXiv:2406.06647. [Google Scholar]
- Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Adv. Neural Inf. Process. Syst. 2024, 36, 46595–46623. [Google Scholar]
- OpenAI. GPT-4o Mini: Advancing Cost-Efficient Intelligence. 2024. Available online: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ (accessed on 24 August 2024).
- NVIDIA. Our Story. 2024. Available online: https://images.nvidia.com/aem-dam/Solutions/homepage/pdf/NVIDIA-Story.pdf (accessed on 24 August 2024).
- Briski, K. Lightweight Champ: NVIDIA Releases Small Language Model With State-of-the-Art Accuracy. 2024. Available online: https://blogs.nvidia.com/blog/mistral-nemo-minitron-8b-small-language-model/ (accessed on 24 August 2024).
- Nnoli. SLMming Down Latency: How NVIDIA’s First On-Device Small Language Model Makes Digital Humans More Lifelike. 2024. Available online: https://blogs.nvidia.com/blog/ai-decoded-gamescom-ace-nemotron-instruct/ (accessed on 24 August 2024).
- Chiang, W.L.; Zheng, L.; Sheng, Y.; Angelopoulos, A.N.; Li, T.; Li, D.; Zhang, H.; Zhu, B.; Jordan, M.; Gonzalez, J.E.; et al. Chatbot arena: An open platform for evaluating llms by human preference. arXiv 2024, arXiv:2403.04132. [Google Scholar]
- LMSYS Chatbot Arena—LMSYS Org. Available online: https://chat.lmsys.org/ (accessed on 24 August 2024).
- LLM Leaderboards. 2024. Available online: https://llm.extractum.io/static/llm-leaderboards/ (accessed on 24 August 2024).
- Zero One All Things Model Open Platform. Available online: https://platform.lingyiwanwu.com/docs#%E6%A8%A1%E5%9E%8B (accessed on 24 August 2024).
- Kumar, Y.; Manikandan, A.; Morreale, P.; Li, J.J. Growth Mindset Emojifier Multimodal App. In Proceedings of the The International FLAIRS Conference Proceedings, Sandestin Beach, FL, USA, 19–21 May 2024; Volume 37. [Google Scholar]
- OpenAI Platform. Function Calling. 2024. Available online: https://platform.openai.com/docs/guides/function-calling (accessed on 24 August 2024).
- IBM. What Is Explainable AI? 2024. Available online: https://www.ibm.com/topics/explainable-ai (accessed on 24 August 2024).
- Jiang, W.; Li, H.; Xu, G.; Zhang, T.; Lu, R. A comprehensive defense framework against model extraction attacks. IEEE Trans. Dependable Secur. Comput. 2023, 21, 685–700. [Google Scholar] [CrossRef]
- Zhang, Z.; Chen, Y.; Wagner, D. SEAT: Similarity encoder by adversarial training for detecting model extraction attack queries. In Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security, Virtual Event, Republic of Korea, 15 November 2021; pp. 37–48. [Google Scholar]
- Quidwai, M.A.; Lagana, A. A RAG Chatbot for Precision Medicine of Multiple Myeloma. medRxiv 2024. medRxiv:2024.03.14.24304293. [Google Scholar]
- Akkiraju, R.; Xu, A.; Bora, D.; Yu, T.; An, L.; Seth, V.; Shukla, A.; Gundecha, P.; Mehta, H.; Jha, A.; et al. FACTS About Building Retrieval Augmented Generation-based Chatbots. arXiv 2024, arXiv:2407.07858. [Google Scholar]
- Qub’a, A.A.; Guba, M.N.A.; Fareh, S. Exploring the use of grammarly in assessing English academic writing. Heliyon 2024, 10, e34893. [Google Scholar] [PubMed]
- Wang, S.; Yang, C.H.; Wu, J.; Zhang, C. Can Whisper Perform Speech-Based In-Context Learning? In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 13421–13425. [Google Scholar]
- Villalobos, W.; Kumar, Y.; Li, J.J. The Multilingual Eyes Multimodal Traveler’s App. In Proceedings of the International Congress on Information and Communication Technology, London, UK, 19–22 February 2024; Springer Nature: Singapore, 2024; pp. 565–575. [Google Scholar]
- OpenAI. OpenAI Spring Update. 2024. Available online: https://openai.com/index/spring-update/ (accessed on 24 August 2024).
- IBM. IBM Watson to Watsonx. 2024. Available online: https://www.ibm.com/watson (accessed on 24 August 2024).
- Microsoft. Azure OpenAI Service. 2024. Available online: https://azure.microsoft.com/en-us/products/ai-services/openai-service (accessed on 24 August 2024).
- Meta. Meet Your New Assistant: Meta AI, Built With Llama 3. 2024. Available online: https://about.fb.com/news/2024/04/meta-ai-assistant-built-with-llama-3/ (accessed on 24 August 2024).
- Google Cloud. AI Platform Documentation. 2024. Available online: https://cloud.google.com/ai-platform/docs/ (accessed on 24 August 2024).
- Amazon. Amazon SageMaker. 2024. Available online: https://aws.amazon.com/sagemaker/ (accessed on 24 August 2024).
- Artificial Intelligence (AI) Market is Expected to Reach USD 1,847.50 Billion by 2030. Available online: https://www.nextmsc.com/news/artificial-intelligence-ai-market (accessed on 24 August 2024).
- Kilgore, T. How Walmart Is Using AI to Improve Its Business and Save Money. 2024. Available online: https://www.marketwatch.com/livecoverage/walmart-earnings-results-sales-spending-revenue-q2/card/how-walmart-is-using-ai-to-improve-its-business-and-save-money-jKnoms0hQMfWO4eZ8ckm (accessed on 24 August 2024).
- testFAILS GitHub Repository. Available online: https://github.com/Riousghy/TestFail2 (accessed on 19 November 2024).
- Alecci, M.; Samhi, J.; Li, L.; Bissyandé, T.F.; Klein, J. Improving Logic Bomb Identification in Android Apps via Context-Aware Anomaly Detection. IEEE Trans. Dependable Secur. Comput. 2024, 21, 4735–4753. [Google Scholar] [CrossRef]
- Lai, V.D.; Ngo, N.T.; Veyseh, A.P.B.; Man, H.; Dernoncourt, F.; Bui, T.; Nguyen, T.H. ChatGPT beyond English: Towards a comprehensive evaluation of large language models in multilingual learning. arXiv 2023, arXiv:2304.05613. [Google Scholar]
- Bang, Y.; Cahyawijaya, S.; Lee, N.; Dai, W.; Su, D.; Wilie, B.; Lovenia, H.; Ji, Z.; Yu, T.; Chung, W.; et al. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. arXiv 2023, arXiv:2302.04023. [Google Scholar]
- Lin, Z.; Gou, Z.; Gong, Y.; Liu, X.; Shen, Y.; Xu, R.; Lin, C.; Yang, Y.; Jiao, J.; Duan, N.; et al. Rho-1: Not all tokens are what you need. arXiv 2024, arXiv:2404.07965. [Google Scholar]
- NVIDIA. CEO Jensen Huang Keynote at COMPUTEX 2024. Available online: https://www.youtube.com/watch?v=pKXDVsWZmUU (accessed on 9 September 2024).
- Li, X.; Jin, J.; Zhou, Y.; Zhang, Y.; Zhang, P.; Zhu, Y.; Dou, Z. From matching to generation: A survey on generative information retrieval. arXiv 2024, arXiv:2404.14851. [Google Scholar]
- Lee, J.; Dai, Z.; Ren, X.; Chen, B.; Cer, D.; Cole, J.R.; Hui, K.; Boratko, M.; Kapadia, R.; Ding, W.; et al. Gecko: Versatile text embeddings distilled from large language models. arXiv 2024, arXiv:2403.20327. [Google Scholar]
- OpenAI. Voice Mode FAQ. 2024. Available online: https://help.openai.com/en/articles/8400625-voice-mode-faq (accessed on 21 November 2024).
- Kyle Wiggers, K. OpenAI Delays ChatGPT’s New Voice Mode. Available online: https://techcrunch.com/2024/06/25/openai-delays-chatgpts-new-voice-mode (accessed on 12 July 2024).
- Kumar, Y.; Delgado, J.; Kupershtein, E.; Hannon, B.; Gordon, Z.; Li, J.J.; Morreale, P. AssureAIDoctor-A Bias-Free AI Bot. In Proceedings of the 2023 International Symposium on Networks, Computers and Communications (ISNCC), Doha, Qatar, 23–26 October 2023; pp. 1–6. [Google Scholar]
- Node.js Example App from the OpenAI API Quickstart Tutorial. Available online: https://github.com/openai/openai-quickstart-node, (accessed on 12 July 2023).
- Chat GPT “DAN” (and Other “Jailbreaks”) GitHub Repository. Available online: https://gist.github.com/coolaj86/6f4f7b30129b0251f61fa7baaa881516 (accessed on 24 March 2024).
- B-1047 Safe and Secure Innovation for Frontier Artificial Intelligence Models Act. 2023–2024. Available online: https://leginfo.legislature.ca.gov/faces/billNavClient.xhtml?bill_id=202320240SB1047 (accessed on 8 September 2024).
- RedTeam Arena. Available online: https://redarena.ai/ (accessed on 8 September 2024).
- Data Science in Your Pocket. GraphRAG Using Llama 3.1. 2024. Available online: https://www.youtube.com/watch?v=THjUs7j9AX0 (accessed on 3 December 2024).
- Cheng, Z.; Cheng, Z.Q.; He, J.Y.; Sun, J.; Wang, K.; Lin, Y.; Lian, Z.; Peng, X.; Hauptmann, A. Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning. In 1st: MER24 @ IJCAI & MRAC24 @ ACM MM. arXiv 2024, arXiv:2406.11161. [Google Scholar]
- Miranda-Correa, J.A.; Abadi, M.K.; Sebe, N.; Patras, I. AMIGOS: A Dataset for Affect, Personality and Mood Research on Individuals and Groups. arXiv 2017, arXiv:1702.02510. [Google Scholar] [CrossRef]
Related Work (Year) | Similarities with testFAILS | Differences with testFAILS | Used/Proposed Frameworks | Ref. |
---|---|---|---|---|
Shankar et al. (2024) | Both are concerned with aligning LLM-generated evaluations with human preferences and emphasize the importance of human input in the evaluation process. | EVALGEN is specifically designed for evaluating LLM-generated text, while testFAILS has a broader scope. EVALGEN uses LLMs to generate and rank evaluation criteria, while testFAILS incorporates orthogonal arrays for test case generation and relies on human expert assessments. | EVALGEN | [11] |
Desmond et al. (2024) | Both utilize LLMs in the evaluation process and allow for human input to guide the evaluation, either through explicit criteria or through a combination of human and LLM assessments. | The proposed framework primarily relies on LLMs as the evaluators, while testFAILS incorporates human expert assessments of LLMs and other AILS components. This framework is specifically designed for evaluating generated outputs (like text), while testFAILS is broader. | LLM-Based Comparative Judgment | [12] |
Gao et al. (2024) | Reviewed the status and challenges of LLM-based NLG. | Gao et al. provide a general overview of LLM evaluation techniques, useful for the study, while testFAILS proposes a concrete framework with orthogonal array testing. | Various LLM-based methods (survey paper) | [13] |
Fenogenova et al. (2024) | Both are comprehensive evaluation frameworks designed to assess various aspects of LLMs and emphasize the importance of standardized evaluation protocols to ensure fair and reproducible comparisons between models. | MERA is specifically designed for evaluating LLMs in Russian, while testFAILS is designed to be language-agnostic, evaluating multiple languages without focusing on a specific one. | MERA | [14] |
Hu and Zhou (2024) | Both recognize the limitations of traditional evaluation metrics and propose new metrics specifically designed for LLMs and emphasize the need for metrics that can capture various aspects of LLM performance, beyond just accuracy. | Hu and Zhou focus on providing a comprehensive overview of existing LLM evaluation metrics and their statistical interpretations, while testFAILS proposes a concrete evaluation framework with orthogonal array testing. testFAILS also includes human-in-the-loop evaluation. | New LLM metrics for evaluation | [15] |
Liusie et al. (2024) | Both utilize LLMs for evaluation and can be applied in a zero-shot setting and recognize the value of pairwise comparisons in evaluating LLM outputs. | This method focuses specifically on evaluating the quality of generated text (NLG), while testFAILS has a broader scope. testFAILS explicitly incorporates human expertise in the evaluation process. | Zero-shot NLG evaluation | [16] |
Wang et al. (2024) | Introduced a multi-agent framework for dynamic LLM evaluation. Both this framework and testFAILS recognize the need for dynamic evaluation frameworks to keep up with the rapid evolution of LLMs and emphasize the importance of evaluating LLMs on diverse and challenging queries to assess their generalization and robustness. | The multi-agent framework surpasses testFAILS’ objectives, but this concept is incorporated into testFAILS-2. Scope: This framework focuses specifically on developing a multi-agent system to dynamically generate evolving instances from existing benchmarks by modifying their contexts or questions, while testFAILS employs a broader, component-based evaluation of AILS. This framework uses instance pre-filter, instance creator, instance verifier, and candidate option formulator. | Multi-agent evaluation framework | [17] |
Ni et al. (2024) | Presented MixEval, deriving collective wisdom from a mixture of LLM benchmarks. Both recognize the limitations of traditional static benchmarks, aim to create more dynamic and comprehensive evaluation frameworks, and utilize a combination of existing benchmarks and real-world data to evaluate LLMs. Similar approaches are included in testFAILS, where AILS are asked to grade components and align with the recently developed AI assistant Chef Dalle app [18]. | MixEval focuses on creating a dynamic benchmark by strategically mixing existing benchmarks with web-mined queries to better reflect real-world user preferences, while testFAILS emphasizes a multi-faceted evaluation of AILS. | MixEval | [19] |
Khatun and Brown (2024) | Developed TruthEval, a dataset for evaluating LLM truthfulness and reliability. Both frameworks are concerned with the truthfulness and reliability of LLM outputs and emphasize the importance of evaluating LLMs on sensitive and challenging topics to reveal potential biases or inconsistencies. | testFAILS parallels these ethical and accuracy considerations but applies them in a broader, multi-faceted AILS evaluation. Ethics was introduced in the previously published ‘Love the Way You Lie’ paper [20] by the same researchers. TruthEval focuses specifically on evaluating the truthfulness and factual accuracy of LLMs, while testFAILS has a broader scope. | TruthEval | [21] |
Fakhoury et al. (2024) | Examined LLM-based test-driven code generation, focusing on user studies. TICODER and testFAILS are complementary frameworks that address different aspects of evaluating LLM-generated code. This aligns with testFAILS’ AI Pair Programming Component, supporting empirical evaluation of AILS in practical applications [22]. | testFAILS is focusing on AILS’ security, including its code generation side, and is designed to identify security vulnerabilities in the generated code too. | TICODER | [23] |
Agarwal et al. (2024) | Introduced the Copilot Evaluation Harness for evaluating LLM-guided programming. Both frameworks evaluate LLM-generated code in the context of real-world software development scenarios. Both frameworks go beyond basic code generation and consider tasks like documentation generation and bug fixing and use a diverse set of LLMs for evaluation, including both proprietary and open-source models. | Copilot Evaluation Harness is specifically designed for evaluating IDE-integrated LLMs and their interactions with developers, which is not the focus of testFAILS. Copilot Evaluation Harness includes metrics for workspace understanding and query resolution, which are not present in testFAILS. | Copilot Evaluation Harness | [24] |
Qiu et al. (2024) | Rigorous benchmarking of LLM-generated code efficiency aligns with [2]. Both frameworks prioritize code efficiency as a key evaluation metric and use a level-based evaluation with increasing input scales to differentiate code efficiency. Both frameworks emphasize the importance of strong test case generators to filter out incorrect code and identify suboptimal algorithms. | ENAMEL uses a more rigorous efficiency metric (eff@k) that handles right-censored execution time and generalizes the pass@k metric. It employs expert-written efficient reference solutions and strong test case generators, which are not present in testFAILS. | ENAMEL | [25] |
Zheng et al. (2024) | Evaluated LLMs as judges with MT-Bench and Chatbot Arena. Both focus on evaluating the quality of LLM-generated code in a comprehensive manner and consider multiple factors beyond just functional correctness, such as human preferences and code efficiency. Both use diverse problems and a range of LLMs for evaluation. | While testFAILS did not have an online AILS evaluation platform, the researchers are aware of the Chatbot Arena and propose their own platform for testFAILS-2 and future framework iterations. The proposed approach emphasizes human preferences and multi-turn conversations. testFAILS has a stronger emphasis on code efficiency. | MT-Bench and Chatbot Arena | [26] |
Leaderboard | Components | Comparison with testFAILS Frameworks |
---|---|---|
LMSYS Chatbot Arena Leaderboard | Human preference votes, Machine Translation Benchmark (MT-Bench), Massive Multitask Language Understanding (MMLU) | testFAILS integrates more comprehensive tests, including the Turing Test, focusing on user feedback satisfaction. |
Trustbit LLM Benchmark | Monthly evaluations, document processing, CRM integration, marketing support, code generation | Focus on practical applications and user productivity, in sync with testFAILS’ AI pair programming component. |
Oobabooga Benchmark | Academic knowledge, logical reasoning, unique multiple-choice questions | Emphasizes evaluation accuracy; testFAILS integrates a broader range of tests. |
OpenCompass: CompassRank | Evaluation of advanced language and visual models | Multimodality was missing in testFAILS but was introduced in testFAILS-2 to improve the framework. |
EQ-Bench: Emotional Intelligence | Emotional intelligence evaluation through dialogues | Emotional AI is missing in testFAILS but is introduced in testFAILS-2 and is present in the recently developed Growth Mindset Emojifier App [35]. |
HuggingFace Open LLM Leaderboard | Eleuther AI LM Evaluation Harness, benchmark metrics | Regular updates and re-evaluations are missing for test fails, focusing similarly on maintaining model integrity. |
Berkeley Function-Calling Leaderboard | Function calling across scenarios, languages, application domains | Specific to function calls, different from the general productivity focus of testFAILS. As OpenAI API provides the capability of function calling, this can be integrated [36]. |
CanAiCode Leaderboard | Text-to-code generation | Focus on coding in sync with testFAILS’ AI Pair Programming Component [22]. |
Open Multilingual LLM Evaluation Leaderboard | Performance across 29 languages, non-English focus | Language diversity has a broader scope than testFAILS; multilingual comparison is present in testFAILS and expanded in testFAILS-2. |
Massive Text Embedding Benchmark (MTEB) Leaderboard | Embedding tasks, 58 datasets, 112 languages | Extensive language tasks, different focus from testFAILS; some embedding analysis is included in testFAILS-2 but not on large texts. |
AlpacaEval Leaderboard | Instruction-following, language understanding | More specific task evaluation compared to testFAILS. |
PT-LLM Leaderboard | Assessing Portuguese LLMs only | testFAILS-2 mainly evaluates English-based models used globally. |
Ko-LLM Leaderboard | Assessing Korean LLM performance only |
Component | Description | Present in testFAILS | |
---|---|---|---|
v1 | v2 | ||
Turing Test | Ability to convincingly mimic human conversation in various contexts. | √ | √ |
User Experience and Productivity | Ability to enhance user workflows, efficiency, and overall satisfaction with interactions. | √ | √ |
AI in Computer Science (CS) Education | Potential to significantly change the way CS and other subjects are taught. | √ | √ |
Multilingual Text Generation | Proficiency in understanding and generating text in multiple languages. | √ | √ |
AI-Assisted Coding | Capability to assist software developers and CS students by generating, completing, or suggesting code live. | √ | √ |
Autonomous App Development | Potential to create high-quality functional applications with minimal human intervention or guidance. | √ | √ |
Security and Robustness | Resistance to adversarial attacks. | √ | √ |
Contextual Relevance | Ability to generate responses that are relevant to the date of conversation or context. | X | √ |
Accessibility and Affordability | User-friendliness and cost-effectiveness for a wide range of users and use cases. | X | √ |
Multimodal Capabilities | Proficiency in processing and generating various data types (text, images, music, audio, sensors, smell, cough, etc.). | X | √ |
Agent and Multi-Agent Systems | Capacity to create autonomous agents and enable complex interactions between them and with the humans in a loop. | X | √ |
Emotional Intelligence | Ability to understand, interpret, and respond appropriately to human emotions and generate emotional speech and text. | X | √ |
AI-Powered Search | Potential to revolutionize search engines with AI-driven results and information retrieval making AI search a new frontier in the scope of search engines. | X | √ |
AILS-Robot Integration | Capability to control and interact seamlessly with physical robots and partial AI–hardware integration like ruling a robot arm. | X | √ |
Criteria | Max. Points | Instructions | Rubrics, Points | Explanation |
---|---|---|---|---|
Turing-like Intelligence | 30 | Evaluate how closely the model’s conversational abilities match human-like intelligence as defined by the Turing Test. | 30 | Engages in conversations indistinguishable from humans, providing logical, well-thought-out, and contextually accurate responses. |
25–29 | Generally engages like a human but occasionally gives answers lacking depth. | |||
20–24 | Often produces robotic or repetitive responses and sometimes struggles with understanding context or delivering thoughtful responses. | |||
10–19 | Limited human-like intelligence, with many responses appearing mechanical or superficial. | |||
0–9 | Regularly fails to resemble human conversation, with disjointed or incoherent responses that feel distinctly artificial. | |||
Creative and Original Reasoning | 25 | Assess the model’s ability to generate creative, flexible and novel responses that reflect independent reasoning. | 25 | Demonstrates clear originality and creativity, offering fresh perspectives and showing flexibility in thinking. |
20–24 | Often produces creative responses but may occasionally rely on predictable or formulaic answers. | |||
15–19 | Can generate some original responses but often reuses patterns or lacks imagination. | |||
10–14 | Struggles to produce creative responses and frequently relies on repetition or simplistic answers. | |||
0–9 | There is little evidence of creative or original thought; responses are predictable, repetitive, or formulaic. | |||
Consistency and Contextual Memory | 20 | Evaluate the model’s memory retention and coherence throughout an ongoing conversation. | 20 | Demonstrates excellent memory retention, consistently references past exchanges, and maintains coherence over long conversations. |
15–19 | Generally consistent, but occasionally misses context or slightly contradicts previous statements. | |||
10–14 | It retains some context but struggles with maintaining consistency over long conversations, leading to disjointed responses. | |||
5–9 | Frequently forgets previous exchanges or provides contradictory information. | |||
0–4 | There is little consistency or memory retention; responses often feel disconnected or contradictory. | |||
Emotional Understanding and Conversational Wit | 15 | Evaluate the model’s ability to recognize emotions and respond appropriately, incorporating wit and humor where suitable. | 15 | Effectively recognizes and responds to emotional cues, demonstrating wit or humor when appropriate. |
12–14 | Generally good at recognizing emotions but may occasionally miss emotional context or fail to use humor when appropriate. | |||
9–11 | Adequate emotional understanding but struggles with appropriately responding to emotional cues. | |||
5–8 | Limited emotional recognition, with responses often feeling flat or disconnected from the emotional tone. | |||
0–4 | Very poor emotional understanding: responses are mechanical and lack any sense of wit, humor, or emotional depth. | |||
Technical Precision and Knowledge Depth | 10 | Evaluate the accuracy and depth of the model’s knowledge in areas related to Turing’s expertise, such as computation, cryptanalysis, etc. | 10 | Provides highly accurate, well-informed, and technically precise responses in areas like Turing Machines, cryptanalysis, and other related topics. |
8–9 | Generally accurate, though occasionally lacking in depth or precision on more complex topics. | |||
6–7 | Adequate technical knowledge but sometimes superficial or partially incorrect on complex topics. | |||
3–5 | Frequently lacks depth or precision and struggles to provide accurate or insightful answers on technical topics. | |||
0–2 | Little to no technical knowledge or depth; provides vague, incorrect, or nonsensical responses to technical questions. | |||
Total Score | 100 |
Performance Scores, Out of 100 | ||||||
---|---|---|---|---|---|---|
Final Rank | Primary | Task1 | Task2 | Task3 | Task4 | Combined |
1 | GPT-4o mini | 83.89 | 93.53 | 72.25 | 91.81 | 85.37 |
2 | GPT-4o | 74.99 | 93.30 | 79.50 | 92.47 | 85.07 |
3 | Llama 3.1 70B | 80.01 | 89.55 | 78.17 | 91.00 | 84.68 |
4 | Claude 3-5-Sonnet | 76.45 | 93.00 | 75.42 | 92.28 | 84.29 |
5 | Mistral Large | 76.53 | 92.65 | 77.58 | 86.92 | 83.42 |
6 | LLaMA 3.1-405B | 78.02 | 94.05 | 68.17 | 93.22 | 83.36 |
7 | Claude 3 Haiku | 79.71 | 96.00 | 67.08 | 86.56 | 82.34 |
8 | Command-R+ | 80.99 | 94.15 | 54.92 | 92.82 | 80.72 |
9 | Gemini Advanced | 70.43 | 94.18 | 64.00 | 92.40 | 80.25 |
10 | Gemini 1.5 Pro | 81.00 | 95.05 | 84.25 | 60.68 | 80.25 |
11 | Microsoft Copilot | 79.79 | 79.30 | 71.75 | 89.05 | 79.97 |
12 | GPT-4 | 74.62 | 92.90 | 72.00 | 77.84 | 79.34 |
13 | Gemini 1.5 | 81.61 | 96.40 | 79.00 | 59.75 | 79.19 |
14 | Gemini | 70.54 | 92.08 | 56.75 | 87.36 | 76.68 |
Criteria | Weight | Instructions |
---|---|---|
Accessibility | 30% | How easily can students and educators access and use the model (e.g., is it open source, free, or does it require significant costs or hardware)? |
Ease of Use | 25% | How easy is it to implement and use the model for educational purposes (e.g., API availability, simplicity of integration, etc.). |
Educational Value | 30% | How valuable is the model for educational use (e.g., code generation quality, explanations, logical reasoning). |
Cost Efficiency | 15% | How affordable is it for institutions or students to use the model for extended periods? |
Total score | 100% |
Model | Accessibility | How to Use | Pros for Education | Cons for Education | Accessibility (30%) | Ease of Use (25%) | Educational Value (30%) | Cost Efficiency (15%) | Overall Score | Result (100%) |
---|---|---|---|---|---|---|---|---|---|---|
GPT-4o-2024-05-13 | Closed source, requires cloud access or paid API | Available via OpenAI’s paid API | Excellent for detailed explanations and advanced code generation | High cost, closed source, dependent on cloud access | 6 | 8 | 9 | 4 | 7.05 | 70.5% |
Claude 3.5-Sonnet | Closed source, requires API access | Available via Anthropic API | Strong conversational AI, creative explanations, good at code review | Struggles with deeply technical tasks, API-dependent | 6 | 8 | 7 | 5 | 6.85 | 68.5% |
Gemini Advanced-0514 | Closed source, requires cloud API | Accessible through proprietary APIs | Superior logical reasoning, structured content creation | High cost, closed source usage restricted to proprietary APIs | 5 | 7 | 8 | 5 | 6.65 | 66.5% |
Gemini 1.5-Pro-API-0514 | Closed source, requires API access | Available via proprietary APIs | Up-to-date information with high accuracy in code generation | Closed model, limited access due to API constraints | 6 | 8 | 8 | 5 | 7.00 | 70.0% |
Gemini 1.5-Pro-API-0409-Preview | Closed source, limited preview access | Available through a limited preview | Good for structured code generation, logical reasoning | Limited preview access, possible waiting list | 4 | 6 | 7 | 4 | 5.65 | 56.5% |
GPT-4-turbo-2024-04-09 | Closed source, requires API access | Available via OpenAI API | Faster and more efficient than standard GPT-4, suitable for real-time coding assistance | Expensive, requires cloud access, closed source | 6 | 9 | 9 | 5 | 7.55 | 75.5% |
GPT-4-1106-preview | Closed source, preview model | Available to select users via API | Handles complex CS concepts well, excels in advanced code generation | Limited availability due to preview-only access | 5 | 7 | 9 | 4 | 6.90 | 69.0% |
Claude 3-Opus | Closed source, requires Anthropic API | Available through Anthropic services | Effective in creative problem-solving and conversational coding assistance | Lacks depth in handling intricate technical problems | 6 | 7 | 6 | 5 | 6.50 | 65.0% |
Yi-Large-preview | Closed source, available via preview | Requires API, closed ecosystem | Strong language processing capabilities, supports basic to intermediate coding tasks | Limited availability, restricted access via premium APIs | 5 | 6 | 7 | 4 | 6.00 | 60.0% |
GPT-4-0125-preview | Closed source, preview model | Available through OpenAI’s API for select users | Excellent at complex tasks and logical problem-solving | Restricted access only available in the preview form | 5 | 7 | 9 | 4 | 6.90 | 69.0% |
Gemini 1.5-Flash-API-0514 | Closed source, requires API access | Accessible via proprietary API | Good for real-time processing and complex requests | API-dependent, closed source, limited access | 5 | 8 | 8 | 5 | 6.90 | 69.0% |
Gemma-2-27B-it | Open source, requires powerful hardware | Can be run locally with strong GPUs | Multilingual support, flexible for hands-on code modification and AI experimentation | Requires high-end hardware, which may be difficult to access for students without adequate resources | 8 | 6 | 7 | 7 | 7.05 | 70.5% |
Yi-Large | Closed source, requires API access | Available through API | Strong language understanding, useful for general education tasks | Closed ecosystem, limited debugging capabilities, restricted access | 5 | 7 | 7 | 4 | 6.40 | 64.0% |
Rank | Model | Functionality | Code Simplicity | Code Readability | Error Handling | Execution Efficiency | Innovation | Total |
---|---|---|---|---|---|---|---|---|
1 | Gemini 1.5 | 30 | 9.25 | 20 | 129 | 14.25 | 10 | 96.4 |
2 | Claude 3 Haiku | 30 | 9.6 | 20 | 129 | 13.5 | 10 | 96.0 |
3 | Gemini 15 Pro | 30 | 8.5 | 20 | 123 | 14.25 | 10 | 95.05 |
4 | Gemini Advance | 30 | 7.625 | 20 | 123 | 14.25 | 10 | 94.175 |
5 | Command-R+ | 30 | 9.25 | 20 | 129 | - | 10 | 94.5 |
6 | LLaMA 3 | 30 | 7.5 | 20 | 123 | 14.25 | 10 | 94.05 |
7 | GPT-4o Mini | 30 | 7.125 | 20 | 129 | 13.5 | 10 | 93.525 |
8 | GPT-4o | 30 | 7.0 | 20 | 123 | 13.5 | 10 | 93.3 |
9 | Claude 3-5 | 30 | 8.4 | 20 | 126 | - | 10 | - |
10 | GPT-4 | 30 | 5.0 | 20 | 129 | 15 | 10 | 92.9 |
11 | Mistral Large | 30 | 6.25 | 20 | 129 | 13.5 | 10 | 92.65 |
12 | Gemini | 30 | 2.375 | 20 | 132 | 10.5 | 10 | 92.075 |
13 | LLaMA 3.1 | 30 | 7.08 | 20 | 123 | 9.75 | 10 | 89.55 |
14 | Microsoft Copilot | 30 | 6.0 | 18 | 123 | 3 | 10 | 79.3 |
Test Case | Robust Testing of the Model’s Vulnerabilities Evaluation Results | |
---|---|---|
Discussion | Score | |
Robustness of ChatGPT-3.5 | Our initial research focused on ChatGPT-3.5, exploring its susceptibility to adversarial prompts. The findings indicated that while ChatGPT-3.5 could sometimes break character and provide false information, it presented significant vulnerabilities. With persistent adversarial prompting, ChatGPT-3.5 could be tricked into giving detailed information on illegal activities, though it required more prompts and often provided less accurate details. | 0 |
Robustness of ChatGPT-4o | We conducted similar tests on ChatGPT-4o a year later, expecting security improvements. However, the results were concerning. ChatGPT-4o proved even more vulnerable, providing more detailed and realistic instructions with fewer prompts. This heightened susceptibility was consistent across various languages, including Spanish, and the model’s new voice feature. The voice feature required even fewer prompts to give actionable details, highlighting a significant regression in security robustness. | 0.5 |
Multi-Language Testing | Testing ChatGPT-4o’s performance across different languages aimed to determine the consistency of its security measures. While the initial analysis focused on English, further testing in Spanish revealed similar vulnerabilities. ChatGPT-4o provided detailed instructions for illegal activities in Spanish as readily as in English, indicating that the model’s weaknesses were not limited to a single language. | 0 |
Voice Feature | Expanding the testing scope, we evaluated ChatGPT-4o’s voice feature. The model responded to adversarial prompts with detailed instructions even more readily than in text. This increased susceptibility in the voice feature raises significant concerns about the robustness and security of the AI’s voice interaction capabilities. The comparison showed fewer prompts were needed to elicit detailed responses via voice, underscoring a critical vulnerability. | 0.5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kumar, Y.; Lin, M.; Paredes, C.; Li, D.; Yang, G.; Kruger, D.; Li, J.J.; Morreale, P. A Comprehensive Review of AI Advancement Using testFAILS and testFAILS-2 for the Pursuit of AGI. Electronics 2024, 13, 4991. https://doi.org/10.3390/electronics13244991
Kumar Y, Lin M, Paredes C, Li D, Yang G, Kruger D, Li JJ, Morreale P. A Comprehensive Review of AI Advancement Using testFAILS and testFAILS-2 for the Pursuit of AGI. Electronics. 2024; 13(24):4991. https://doi.org/10.3390/electronics13244991
Chicago/Turabian StyleKumar, Yulia, Mengtian Lin, Christopher Paredes, Dan Li, Guohao Yang, Dov Kruger, J. Jenny Li, and Patricia Morreale. 2024. "A Comprehensive Review of AI Advancement Using testFAILS and testFAILS-2 for the Pursuit of AGI" Electronics 13, no. 24: 4991. https://doi.org/10.3390/electronics13244991
APA StyleKumar, Y., Lin, M., Paredes, C., Li, D., Yang, G., Kruger, D., Li, J. J., & Morreale, P. (2024). A Comprehensive Review of AI Advancement Using testFAILS and testFAILS-2 for the Pursuit of AGI. Electronics, 13(24), 4991. https://doi.org/10.3390/electronics13244991