Next Article in Journal
Research on a Knowledge Graph Embedding Method Based on Improved Convolutional Neural Networks for Hydraulic Engineering
Previous Article in Journal
Content-Aware Image Resizing Technology Based on Composition Detection and Composition Rules
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Testing Framework for AI Linguistic Systems (testFAILS)

Department of Computer Science and Technology, Kean University, Union, NJ 07083, USA
*
Authors to whom correspondence should be addressed.
Electronics 2023, 12(14), 3095; https://doi.org/10.3390/electronics12143095
Submission received: 5 June 2023 / Revised: 26 June 2023 / Accepted: 29 June 2023 / Published: 17 July 2023
(This article belongs to the Section Artificial Intelligence)

Abstract

:
This paper presents an innovative testing framework, testFAILS, designed for the rigorous evaluation of AI Linguistic Systems (AILS), with particular emphasis on the various iterations of ChatGPT. Leveraging orthogonal array coverage, this framework provides a robust mechanism for assessing AI systems, addressing the critical question, “How should AI be evaluated?” While the Turing test has traditionally been the benchmark for AI evaluation, it is argued that current, publicly available chatbots, despite their rapid advancements, have yet to meet this standard. However, the pace of progress suggests that achieving Turing-test-level performance may be imminent. In the interim, the need for effective AI evaluation and testing methodologies remains paramount. Ongoing research has already validated several versions of ChatGPT, and comprehensive testing on the latest models, including ChatGPT-4, Bard, Bing Bot, and the LLaMA and PaLM 2 models, is currently being conducted. The testFAILS framework is designed to be adaptable, ready to evaluate new chatbot versions as they are released. Additionally, available chatbot APIs have been tested and applications have been developed, one of them being AIDoctor, presented in this paper, which utilizes the ChatGPT-4 model and Microsoft Azure AI technologies.

1. Introduction

In the rapidly evolving landscape of Artificial Intelligence Linguistic Systems (AILS), the need for robust and comprehensive evaluation frameworks is more pressing than ever. This paper presents pioneering research in this domain, focusing on the development of a novel testing framework, testFAILS. This framework is designed to critically evaluate and compare leading chatbots, including OpenAI’s ChatGPT-4 [1], Google’s Bard [2], Meta’s LLaMA [3], Microsoft’s Bing Chat [4], the PaLM 2 model [5], and emerging contenders such as Elon Musk’s TruthGPT [6]. TestFAILS adopts an adversarial approach, aiming to highlight the shortcomings of chatbots and provide a counterbalance to the often exaggerated media hype surrounding “AI breakthroughs”. This approach is rooted in the understanding that, despite significant progress in AI, no chatbot has yet achieved the milestone of passing the Turing test [7], a widely accepted measure of AI sophistication. This highlights the pressing need for effective evaluation methodologies that can keep pace with the rapid advancements in AI.
The testFAILS framework is built on a solid theoretical foundation and comprises seven critical components. Each component plays a crucial role in providing a robust mechanism for assessing AI systems. However, it is important to note that these systems can fail in various ways. For instance, a chatbot could fail the Simulated Turing Test Performance if it cannot generate text that is indistinguishable from human-generated text. It could fail the User Productivity and Satisfaction test if it does not help users complete their tasks efficiently. It could fail the Integration into Computer Science Education test if it cannot be used effectively in computer science courses. It could fail the Multilingual Text Generation test if it cannot generate text in multiple languages. It could fail the Pair Programming Capabilities test if it cannot effectively collaborate with a human programmer. It could fail the Bot-based App Development and its Success test if it cannot be used to develop successful bot-based apps. Lastly, it could fail the Security Assessment and Vulnerability Analysis test if it cannot identify security vulnerabilities in text.
Ongoing research aims to develop a validation framework that remains relevant and applicable to Large Language Models (LLMs) like ChatGPT, Bing, and Bard in the mid-term perspective. Framework components must be suitable for a wide range of potential ‘users’, particularly from the perspective of Computer Science (CS) educators. The actual selection of the components might vary based on the perspective from which researchers approach the task, and frameworks are being built with flexibility, allowing for insights and views to shape their development rather than having them set in stone from the outset. As the expertise broadens, mainly with regard to CS education, Artificial Intelligence (AI), and Web app/system development, the latest LLMs are mainly being explored from these points of view. However, this does not prevent researchers from evaluating them from various other perspectives, such as those of adults, parents, students, gamers, or inclusive CS activists, as well as from any other angles observed through social media, news, research papers, and other sources.
The weights of the testFAILS components, as proposed by the chatbots themselves, can be seen in Table 1.
As the table shows, there is no unanimous agreement across the chatbots on the weighting of each component. Bard’s scores do not add up to one, and it insists that reaching a sum of one is not the goal and its weights will not be adjusted. This highlights the diverse views on the importance of different components in the evaluation of AILS. While there is some variation in the weights assigned to each component, all chatbots agree on the importance of these seven components in the evaluation of AILS. The highest weight is assigned to User Productivity and Satisfaction by ChatGPT-3.5, indicating a strong emphasis on the user experience and the practical utility of the chatbot in accomplishing tasks. On the other hand, the lowest weight is consistently assigned to Integration into Computer Science Education across all chatbots. This suggests that while this aspect is important, it is not perceived as the primary function of these AI systems. This may be because this test is particularly relevant to those involved in or concerned with CS education, rather than the public. However, it is crucial to note that LLMs have the potential to significantly reshape and redefine this area. As AI systems become more sophisticated and integrated into various aspects of life and work, their role in education, particularly in CS, is likely to expand. This could involve aiding in teaching complex concepts, providing personalized learning experiences, or even acting as learning companions for students. Therefore, while the weight assigned to this component may currently be low, its importance may increase as the role of AI in education evolves. Interestingly, the Security Assessment and Vulnerability Analysis component is seen as particularly important by ChatGPT-4, which assigns it the highest weight of 25%. This underscores the increasing recognition of the importance of security in AI systems, especially as they become more integrated into sensitive areas of life and work.
It is important to note that while these components provide a comprehensive framework for evaluation, the importance of each component may vary depending on the specific application of the AI system. For instance, the Simulated Turing Test Performance component may be more critical for chatbots designed for human-like interaction, while the User Productivity and Satisfaction component may be more relevant for customer service apps. Similarly, the Security Assessment and Vulnerability Analysis component would be crucial for AI systems used in security-sensitive applications.
Sub-studies were conducted on each component-related topic to determine the validity and the actual ‘weight’ of each of the testFAILS components. These will be described in detail in Section 3. Some of these studies have even evolved into separate lines of research [8,9,10,11,12], further highlighting the depth and breadth of this evaluation framework.
At this point in our ongoing research, we are focused on three main research questions:
  • RQ1: How should AILS be evaluated?
  • RQ2: What are the key components of a robust testing framework for AILS?
  • RQ3: How do different AILS perform according to the testFAILS framework?

2. Research Background and Related Work

Research on chatbot evaluation capitalizes on a profound background in Natural Language Processing (NLP) and an all-encompassing understanding of AI models, particularly within the Python programming ecosystem [13,14]. Recently, focus has veered towards Transformer Neural Networks, with the aim of unearthing and comprehending the biases embedded within their computational layers [15,16]. The emergence of Transformer Neural Networks has been recognized as a new frontier in both natural language processing and computer vision. Several ongoing studies involve NLP models such as BERT, DistilBERT, RoBERTa, Electra, Ernie, and XLNet [10] as well as object detection and classification using DETR, Deformable DETR, DAT, and Swin transformers devoted to computer vision [17]. The technical expertise spans a variety of programming languages and frameworks, and there are various contexts in which working with web services and APIs is necessary, ranging from NASA and Coin APIs to Google services and Microsoft Azure.
The rapid growth of LLMs and the release of the early versions of ChatGPT have sparked excitement among many researchers. Initially, the research questions were broad, contemplating whether a specific chatbot could enhance societal intelligence or, conversely, lead to its decline. Questions were raised about the impact of AI tools on the quality of human life and whether CS students might improve their coding skills or lose them altogether. However, given the inherent complexities in substantiating such wide-ranging claims, the focus has been refined. The primary research objectives now concentrate on the three research questions mentioned in Section 2 and the seven aspects of AILS that are presented in this work. This refined focus has allowed for a deeper exploration into these pivotal testing components.
Chatbots have become an industry staple, with several metrics, frameworks, and tools achieving widespread adoption and even becoming part of industry standards. A review of the academic literature reveals that researchers are actively adapting to this evolution, integrating and contrasting methodologies to identify the most effective ones. The introduction of ChatGPT-3 and subsequent versions has significantly influenced the direction of research in this field. The recent ICSE 2023 conference [18] highlighted papers exploring innovative topics, such as adaptive developer–chatbot interactions [19], ChatGPT’s capabilities in automatic bug fixing [20], and the potential role of AI in the software development lifecycle [21]. The work aligns closely with studies that also adopt a user-centric approach to AILS evaluation [22,23,24,25]. Insights are also drawn from studies that explore the intersection of AI and higher education [26,27], a topic directly relevant to one of the framework components. However, the distinguishing feature of the framework is its focus on user experience within the context of broader societal impact. This dual emphasis not only expands the perspective but also ensures the approach’s relevance and uniqueness amidst the rapidly evolving AI landscape.
Since the beginning of this study, several relevant papers have been presented and published. The development and evaluation of AILS have been the focus of numerous studies. In the realm of AI chatbots, the works on the impact of ChatGPT in neurosurgery and broader medicine and healthcare stand out [28,29,30], as these authors use the latest models and demonstrate its critical impact on their fields of study while mentioning the need for further improvements. The impact of AILS on education has also been significantly expanded. While there are plenty of papers that analyze perspectives on ChatGPT’s use in education [31,32], there are still not many of these devoted particularly to CS education [33]. The papers explored state that AI chatbots could significantly enhance self-directed learning experiences. As every institution and its CS curriculum is unique, such studies address the issue on a case-by-case basis but mainly agree upon the unavoidable impact of chatbots on CS education. There are papers devoted to the generation of chatbots and to explaining programming code or some of their constructs [34,35,36]. These studies collectively highlight the diverse applications of AI chatbots and the importance of robust evaluation frameworks. They provide a solid foundation for the present study, which aims to further contribute to this growing body of knowledge.

3. Methodology

The methodology of this study was designed around the testFAILS framework, which comprises seven critical components. Each component was selected based on its relevance to the evaluation of AILS and the unique insights it provides. The first component, Simulated Turing Test Performance, is a classic measure of a machine’s ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. In the context of AILS, this component measures how well a model can communicate and generate human-like text, which is crucial, as the primary goal of AILS is to interact with humans in a natural, coherent, and contextually appropriate manner. Following this, User Productivity and Satisfaction was chosen as a key indicator of an AILS’s effectiveness. An AILS should not only be able to understand and respond to user inputs accurately, but it should also enhance user productivity by reducing the time and effort required to complete tasks. User satisfaction is a direct measure of how well the AILS meets user expectations and contributes to a positive user experience. The potential of AILS to revolutionize education is immense, hence the inclusion of Integration into Computer Science Education. In computer science education, AILS can be used to teach complex concepts, provide personalized learning experiences, and offer instant feedback. In today’s globalized world, the ability to generate text in multiple languages is a significant advantage for an AILS. Therefore, Multilingual Text Generation was included as a component to measure an AILS’s ability to understand and generate text in various languages. Pair Programming Capabilities was chosen as a component due to the common practice of pair programming in software development, where two programmers work together at one workstation. An AILS with pair programming capabilities can collaborate with human programmers, offering suggestions, detecting errors, and providing solutions. Given the increasing popularity of bot-based applications, Success in Bot-Based App Development was included to measure an AILS’s effectiveness in creating successful bot-based applications. Lastly, with increasing concerns over cybersecurity, Security Assessment and Vulnerability Analysis was chosen as a component to measure an AILS’s ability to identify and assess security vulnerabilities in text. The selection of these components was grounded in both the practical requirements of AI applications and the emerging research trends in the field. They collectively provide a comprehensive, balanced, and relevant framework for evaluating AILS. The weights assigned to each component reflect their relative importance, ensuring a balanced evaluation. The methodology thus provides a robust and comprehensive framework for evaluating the effectiveness of AILS in various contexts. Below, we will assess each testFAILS component in more depth.

3.1. A. The Turing Test and the Infinities

The Turing Test, a concept introduced by Alan Turing [37], serves as a benchmark for a machine’s capacity to demonstrate intelligent behavior that is either equivalent to or indistinguishable from human behavior. The authors of this paper argue that no chatbot, including the most recent iteration of ChatGPT, has convincingly passed this test. This is attributed to the deliberate constraints on chatbots’ learning capabilities and the limitless spectrum of potential human inquiries. The authors speculate that, due to its architectural divergence from the sequential programming model proposed by Alan Turing, no version of ChatGPT will pass the Turing Test. The question of whether an Artificial Intelligence Learning System (AILS) passes or fails the Turing Test is a philosophical one, necessitating comprehensive proof and algorithmic thought. Even without delving too deeply into philosophical intricacies, it is clear that no one has yet claimed that the Turing Test has been passed. Moreover, the notion that this milestone has already been achieved and that the AI’s intelligence can fully mimic a human has been refuted by numerous existing studies, presentations by Google and Microsoft researchers, and even bloggers and YouTubers who often focus on AI intelligence. They generally agree that only the earliest, vaguest signs of this capability can be seen in the ChatGPT-4 model.
The ChatGPT model is intentionally designed to not learn new things, and its dataset is “frozen” in 2021, when it was trained. Therefore, it simply cannot fully mimic a human living post 2021. To substantiate this claim, the authors consulted the chatbots themselves and displayed their findings in Figure 1 below:
Figure 1 showcases responses from different two ChatGPT models, Bard, and Bing, all of which confirm the fact that, to their knowledge, no AI model or robot has passed the Turing Test. Bing and Bard, both of which have internet access, provide more detailed answers. However, it is crucial to note that while these bots can access the World Wide Web (WWW), not all information available online is reliable and trustworthy, making it necessary for us to wait for Elon Musks’s TruthGPT to clarify things [6]. Both Bing and Bard picked up on the Eugene Goosman study and conveniently provided references for their claims. However, all of these referred to blogs and media news, not really proving anything, as there is no guarantee that these sources can be trusted [38,39], it is clear that the claim is not supported by all. In a continuation of their drafts, all of the models stated that the claim that any AI agent has passed the Turing test is indeed very disputable. What was interesting was that asking the same question to the same chatbots from a laptop returned different results (see Figure 2):
The depth of the first component, regarding the Turing test and the infinities, is quite extensive and deeper than just chatting with the chatbots themselves. Our study proposes the concepts of ‘infinity of questions’, ‘infinity of models’, and ‘infinity of chatbots’.
The infinity of questions concept refers to the fact that while the knowledge scope of the models like ChatGPT is limited to their training dataset (which is clear from Figure 1), the scope of the questions humans could ask them is actually limitless, infinite, or simply has no scope/has a global scope. A comparison of Figure 1 and Figure 2 proves that the backends of mobile and web apps are different, which causes them to provide completely different answers to the same question. It usually the case that mobile apps work with a lighter, simpler dataset or a backend model to guarantee an immediate result, while a web-based bot might take time ‘to think this through’, which is, according to our personal experience, what ChatGPT-4 frequently does. We began to test it the day after its release, and it was extremely slow due to the high demand, and probably for other network/architecture-related reasons.
Our current hypothesis assumes that chatbots could evaluate one another by examining the underlying models to identify differences, weaknesses, capability boundaries, and potential biases. Self-validation is also possible to some degree. Our concept of the infinity of models considers the idea of one model creating another model or replicating itself for some purpose, eventually causing an infinite loop. As it is no secret that one program can create another program, one AI agent could create another agent and that creates another agent, etc. It seems like we might end up in a situation where we must deal with an unmanageable number of models, designed not only by humans (which is already the case: for example, Hugging Face currently has 236k+ available models, and their number is constantly growing [40]) but by AI systems themselves. Even at this point, the problem of the infinity of models is present, as it is impossible to choose the ‘right one’ to test. As mentioned earlier, BERT, DistilBERT, RoBERTa, Electra, Ernie, and XLNet are currently being used by ongoing studies into growth mindset and cyberbullying classifications [10], but it is impossible to verify whether these are most accurate or appropriate models. The best method we can think of is to ask Bard, Bing, and ChatGPT. The ChatGPT-4 model is the best choice to not only to recommend the right model but to fully set it up and literally write all the code needed. The watch and control of a ‘human copilot’ will still be required. With the implementation of quantum and high-performance computing (HPC), the infinity of models concept will be even more applicable, as these can be created ‘on the fly’ and run in parallel on HPC or in a quantum realm.
One example of a solution to an infinity of models situation was found by the Microsoft Research Lab Asia team [41]. Their approach involved using the HuggingGPT bot—a tool released and described in detail in [42]. The developers let AI choose the right model for training. The authors of the HuggingGPT use ChatGPT to conduct the process of model training and the validation of the test results. The whole cycle of task planning after receiving a user request is performed by ChatGPT: it selects a model according to model function descriptions available in Hugging Face, executes the job using the chosen AI model(s), and summarizes and provides the conclusion according to the execution results. This approach aligns with our hypothesis that AI models will be able to choose AI technologies, but there might be infinite issues, caused by that, including biased choices and other ethical issues.
Like the above-mentioned infinity of models problem, it is projected that we will soon see an infinity of chatbots. While many separate chatbot developments are currently in progress worldwide, many plugins for them and associated tools have been and will be developed; in the future, it will be much harder to choose the right AILS than it is now. Currently, there are four main ones that have as yet been released to the public, all of which are analyzed in this study. The only obstacle for this is the increasing consumption of energy, hardware, and other required resources that AILS need.
To conclude this section on the Turing test and the infinities, we want to mention the loop that ChatGPT-4 falls into regularly due to its current imperfections, a problem that has been present since its release and that we observe daily [43]. While generating a long chunk of code or text, it cuts off its response and repeats previously provided material, repeating this up to eight times, as far as we observed. It then does not let the user end the task but instead keeps trying to finish it, consuming both money (as the service is not free) and using up the daily call allowance while doing so. We have not included technical errors in our framework as it is promised that the bot will be better soon, but this just confirms that passing the Turing test at this time, even for ChatGPT-4, is not possible. All of the models received a score of zero, or a fail, for the first component, as they themselves agreed that they do not pass the Turing Test.

3.2. B. User Productivity and Satisfaction

The exploration of user productivity and satisfaction in the context of chatbot usage commenced with the introduction of ChatGPT-3. This exploration primarily relied on manual testing, offering a hands-on approach that led to unexpected insights. These insights encompassed a variety of issues, from inconsistencies in the AI’s content generation policies to its susceptibility to harmful or deceitful behaviors. The manual testing process was thorough, encompassing a multitude of activities. It initiated an in-depth investigation of ChatGPT, laying the groundwork for understanding its functionality and potential. This was succeeded by experiments using prompts of varying lengths and levels of quality, which underscored the importance of clear and error-free prompts for eliciting high-quality responses. The process also incorporated brainstorming potential flaws, instigating offensive language and malware production, and refining the experimental methodology. Additional activities included researching best practices for utilizing the bot, experimenting with different natural and programming languages, refining the appropriate evaluation metrics, comparing version 4 with 3 and 3.5, querying political data, and even generating entire programs, apps, and games solely through prompts. These activities offered a comprehensive view of the chatbot’s capabilities and limitations. An example of a no-code or code-free app, developed entirely through prompts, is a stock market data react app. With an existing react app running on the machine, ChatGPT was tasked to generate a react component that would fetch Meta stock market data from a source like Yahoo finance and display it in the browser. Several intermediary steps included registering for a free API, obtaining its key, and specifying the requirement for a component with a JSX extension. The results of this endeavor can be observed in Figure 3.
The following prompt was used to create an improved GUI version: Can you improve the look and feel of this react component? I want something like a Barbie style but with some dark violet [your current code here]. The actual stock code can be changed with ease, both manually and with the help of ChatGPT. As can be seen from the Yahoo Finance website, it is enough to change Meta to some other name to display another company’s stock, as the format of historical stock market data is indeed the same [44].
One key finding was that user satisfaction was not necessarily dependent on the size or complexity of the AI. A comparison with more concise chatbots like Bard Bot and LLaMA revealed that the quality and relevance of the AI’s responses played a crucial role in determining user satisfaction. This suggests that user experience is more influenced by the chatbot’s ability to provide meaningful and relevant responses than by its technical sophistication. However, the exploration also uncovered several limitations. These included difficulties with overly long inputs and some foreign languages, instances of logical contradiction in its responses, and the occasional lack of relevance in the AI’s responses. There were also instances of bias and ethical issues in the chatbot’s responses, highlighting the need for careful monitoring and control of AI systems. Some random playing with prompt engineering revealed chatbot vulnerabilities and made us add an additional component on chatbot security to our framework—Security Assessment and Vulnerability Analysis [45,46].
The investigation then shifted focus to the impact of ChatGPT-4. User experience with ChatGPT-3.5 was found to be significantly more pleasant compared to its predecessors, particularly in its ability to write programming code. With the release of the paid version of ChatGPT-4, the results improved in all dimensions. The introduction of ChatGPT plugins and the OpenAI API opened a pyramid of additional opportunities and allowed for the creation of new tools. One notable tool that emerged during this exploration was the Noteable plugin [47]. This tool aids in analyzing and visualizing almost any type of data, a task that was previously only performed by skilled Python developers, statisticians, and data analysts. The advent of such tools demonstrates the potential of AI to democratize complex tasks and make them accessible to a wider audience. It is also very handy to speak with the app (it can even be used while driving) and use Webpilot and Link reader plugins, among many others, to browse the web [48,49]. We noticed that it is quite useful and beneficial to sign up for the latest AI news, follow YouTubers who focused on it, and see which AILS-related tools were trading on GitHub. One of these is the GPT Engineer [50]. GutHub’s page simply states: “Specify what you want it to build, the AI asks for clarification, and then builds it”, and this is how it works.
Manual testing of ChatGPT-3 + and other bots provided us with a fundamental understanding of their capabilities, revealed their limitations, and broadened our horizons regarding what one can do with them. Mastering prompt engineering and having daily interactions with chatbots provide a good, useful, and very relevant skillset to pretty much anyone nowadays, no matter what they do. ChatGPT-3.5 and ChatGPT-4 received pass scores for this component due to their user-friendly nature and their ability to assist users in answering questions and completing tasks. In fact, ChatGPT set a record as the fastest app to reach 100 million active users, achieving this milestone in just two months. This achievement underscores the growing popularity and potential of AI chatbots in various fields, and their significant contribution to user productivity and satisfaction.
Regarding Bing, we noticed that it does not store a history, and even the previous query is impossible to obtain. The experience of using it is somewhat like the routing of ‘Google it’ when something did not work. For the last 6–7 months, Google search has been rarely used, and chatbots were used instead. We preferred to work with ChatGPT-4 until its call limit was achieved, and then there was a natural reason for a break.
Working with Bard revealed that the bot frequently lies, refuses to answer, or gives some ‘strange’ responses. One such is presented in Figure 4, where Bard states that it answers users’ queries even if they are strange, which is unusual.
Our exploration into user productivity and satisfaction with chatbots provided a comprehensive understanding of their capabilities and limitations and highlighted prompt engineering and daily interactions with chatbots as a valuable skillset in today’s digital landscape. During the manual testing of bots, a multitude of strategies were employed to understand their capabilities and limitations. These strategies included experimenting with prompts of different lengths and levels of ‘quality’, brainstorming potential flaws, forcing offensive language, and forcing malware production. The experimental methodology was continually refined throughout the process to ensure the most accurate results. Research was conducted on the best prompt engineering practices, and these practices were then put to the test. The performance of different bots and their versions was compared to identify any significant differences or improvements. Queries were made on sensitive topics such as political and religious data to assess the bots’ handling of such information. Furthermore, the bots were tasked with generating whole programs, apps, and games, demonstrating their potential in the Low-Code No-Code (LCNC) space [51]. This approach allowed for a comprehensive understanding of the chatbots’ capabilities, providing valuable insights into their potential applications and areas for improvement.

3.3. C. Integrating Chatbots into Computer Science Education

The integration of AI chatbots, such as ChatGPT, into computer science education represents a significant paradigm shift in the pedagogy and learning methods regarding programming. These chatbots’ capability to generate comprehensive assignments and course materials presents a unique opportunity for students to learn from AI-generated examples. This not only bolsters their comprehension of programming concepts but also acquaints them with a practical application of AI in their field of study. However, the incorporation of AI chatbots in education also brings to the fore important ethical considerations, particularly concerning academic integrity. The facility with which ChatGPT can generate complete programs could potentially incentivize cheating among students, especially those in their sophomore year and beyond, who possess the technical acumen to effectively utilize these chatbots. Consequently, it is imperative for educators to devise strategies that ensure that the use of AI chatbots in education bolsters learning and does not compromise academic integrity. This could involve crafting assignments that necessitate that students demonstrate their understanding of programming concepts, rather than merely generating code.
Moreover, while ChatGPT has demonstrated potential in generating assignments and course materials, its limitations in assisting with debugging and setting up databases highlight the need for sustained human involvement in education. Educators continue to play a pivotal role in guiding students through the intricacies of programming and providing personalized feedback, which an AI chatbot may not be equipped to offer. Therefore, while AI chatbots like ChatGPT can be invaluable tools in computer science education, they should be viewed as supplements to, rather than substitutes for, traditional teaching methods.
A recent discourse on teaching programming with LLMs [52] offered an in-depth exploration into the integration of GitHub Copilot into programming instruction. The study was precisely focused on the CS1 course and its modification for the upcoming fall semester to incorporate GitHub Copilot and LLMs. The authors of the testFAILS framework integrated ChatGPT into several courses during the last spring semester, continue to refine and add more dimensions to the assignments, and have developed an original course, ‘ChatGPT Exploration’, that is currently being taught to students and will be offered in the upcoming semester. The course is structured as follows:
Week 1: Introduction to Chatbots
Week 2: Prompt Engineering with Chatbots
Week 3: Translating Languages and Multilanguage Processing with Chatbots
Week 4: Creating Content with Chatbots (Text to Images/Video/Animation/3D)
Week 5: Pair Programming with Chatbots (GitHub Copilot and Other Tools)
Week 6: Developing Projects with Chatbot APIs
Week 7: Exploring the Limitations, Hallucinations, and Ethical Issues of Chatbots.
Week 8: Final Project Presentation
The syllabus of the study is available online [53]. The offering of this course would not have been possible without the existence of ongoing research on chatbots and their evaluation.
The study commenced with extensive manual testing of the programming sequence CS0-CS1-CS2, which encompasses the foundations of programming (variables, conditional statements, methods, loops, arrays, etc.), object-oriented Java, and data structures. It was discovered that ChatGPT could successfully generate all assignments and materials taught in this course sequence without any errors, which intensifies the potential for cheating. Simultaneously, it was found that chatbots provide little to no assistance with debugging complex Node.js/React.js applications or setting up Microsoft SQL servers or other databases in classes primarily taught to juniors and seniors, according to the trials. This implies that students who have already completed the sequence are less likely to be negatively impacted by chatbots that might alter how they learn to code but can rather benefit from some debugging suggestions that AILS can provide. At this juncture, it is uncertain how exactly to modify and improve the sequence of CS courses, which are crucial for student learning and the assessment of learning outcomes. This is a decision that may need to be made on a department-wide, university-wide, or potentially state-wide level. However, there have been successful attempts at integrating chatbots into upper-level CS courses.
Fifty juniors and seniors were asked to learn parallel programming in C# independently using only a GitHub repository [54], Visual Studio Code, and ChatGPT. The students were then asked to provide their feedback and thoughts on this assignment. A word cloud of their answers, shown in Figure 5, generally indicates a positive response. The students found the assignment engaging and valuable, even more so than assignments that required extensive coding and hands-on experience. However, some students mentioned that the assignment took them longer than usual, which may be indicative of many of them using the bot for the first time. This feedback suggests that while AI chatbots like ChatGPT can be valuable tools in computer science education, there is a learning curve associated with their use.
It can be inferred from Figure 5 that there is no visible negative feedback about this experience, and the most popular words in the feedback were ‘ChatGPT’, ‘Yes’, ‘comment’,’understand’ and ‘code’. This suggests that the bot, specifically ChatGPT-3.5 used in this assignment, successfully facilitated the learning process.
As we were encouraged by this feedback, ChatGPT was then incorporated into two more CS courses: Programming a WWW Server and Web Client-side Programming. Freshman students, who are new to university and not yet technically skilled, were less likely to resort to cheating. Conversely, junior and senior students, who already possess coding skills, were more likely to utilize ChatGPT for learning purposes. At this point, new tasks were added on top of the required tasks. One such addition states: “After completing all the tasks and submitting code and screenshot of the tasks as requested, ask chatbots (ChatGPT4 or Bard) to enhance/improve/add more content to the code you created. Provide it with your code and ask to improve. You can do it manually, programmatically, or using plugins”. The students were then provided with simple examples of how to do so, as shown in Figure 6. below:
The prompt provided in Figure 6 is a follow-up prompt. Students need to first feed their existing code into the model. It was noted and shared with the students that in case of errors, it is important to provide the model with both their code and their error for the best result.
For this component, an undetermined grade was given to Bard Bot and Bing, as their public capacity is far from that of ChatGPT in terms of impacting CS education. Both ChatGPT-3.5 and 4 received a passing grade. In conclusion, the integration of AI chatbots into computer science education offers exciting possibilities for enhancing learning and teaching. However, it also presents challenges that need to be addressed to ensure that these tools are used effectively and ethically. As AI continues to evolve, so too will its role in education, and it is crucial that educators stay abreast of these developments to provide the best possible learning experiences for their students.

3.4. D. Multilanguage Text Generation with Chatbots

The integration of chatbots into multilanguage text generation has opened new avenues in the field of artificial intelligence. Notably, the ChatGPT-3, 3.5, and 4 models have demonstrated an exceptional ability to generate accurate and contextually coherent responses across a variety of languages. This capability is invaluable in our interconnected world, where communication across language barriers is becoming increasingly important. However, the path towards achieving this capability has not been without its challenges. Earlier models, such as ChatGPT-2, while user-friendly and simple to use, lack the advanced language modeling techniques and contextual understanding that their successors possess. As a result, responses generated by ChatGPT-2, while functional, do not exhibit the depth and sophistication characteristics of later models.
In contrast, models like Bard, known for their transformer-based language generation, have shown superior performance in multilanguage text generation tasks. However, Bard’s inability to work with foreign languages has been a significant limitation, leading to its underperformance in this component. Despite this, the ChatGPT-3.5 + models have shown promise, earning a pass for their performance in multilanguage text generation.
The testFAILS framework was customized to measure how ChatGPT and other tools could assist users in setting up NLP models faster and working with them more efficiently. It was found that while ChatGPT could fix code approximately two thirds of the time, it could not provide guidance on choosing the optimal learning rate or batch size. This led to the conclusion that ChatGPT could be considered a junior Python developer—capable of making code work, but not capable of improving it to achieve the highest accuracy or lowest loss. Interestingly, Bard was able to provide working code or configuration suggestions using data beyond 2021 about 25% of the time, giving it an advantage over ChatGPT. However, it often either refused to answer or provided incorrect information, such as suggesting the use of non-existing libraries.
In conclusion, while the integration of chatbots into multilanguage text generation presents exciting possibilities, it also brings challenges that need to be addressed. As AI continues to evolve, so too will its role in multilanguage text generation, and it is crucial that developers stay abreast of these developments to provide the best possible tools for users.

3.5. E. Pair Programming with Chatbots in Education and Beyond

Pair programming is a software development technique where two programmers collaborate at a single workstation. The “driver” writes the code while the “observer” or “navigator” reviews each line of code as it is written. The roles are frequently switched between the two programmers. The observer also considers the strategic direction of the work, brainstorming ideas for improvements and anticipating future problems. This approach allows the driver to focus all their attention on the tactical aspects of completing the current task, using the observer as a safety net and guide. In an educational context, pair programming can serve as a potent tool for learning and collaboration. Knowledge is continually shared between pair programmers, whether in the industry or in a classroom setting. Many sources suggest that students exhibit higher confidence when programming in pairs, and they often learn a variety of skills, from tips on programming language rules to overall design skills [56,57]. Pair programming enables programmers to scrutinize their partner’s code and provide feedback, which is crucial for enhancing their ability to develop self-monitoring mechanisms for their learning activities.
In line with the latest industry trends, many advanced developers are transitioning (back) to the Visual Studio Code (VSC) environment [58] or VSC Insiders [59] (the same environment but with beta versions of new features). This shift is primarily driven by the benefits offered by GitHub Copilot [60] and GitHub Copilot Labs [61]—new tools that facilitate pair programming with AI. The primary appeal of this move is the AI agent’s ability to generate, debug, and clean programming code, among other functionalities. A significant drawback is that tools like GitHub Copilot Labs currently do not support Python (what can be seen on Figure 7). While this issue will likely be resolved in future versions, at this point, the full benefits of the tool presented in Figure 7 cannot be utilized.
As can be seen from Figure 7 GitHub Copilot Labs currently does not support python. In the context of the custom framework testFAILS, the focus is not on the mainstream track but instead on counterexamples. The expertise and background of the team involved are less in need of assistance with languages such as JavaScript and TypeScript, and more with the latest trends in Python and its libraries, given that Python is currently the primary language of AI. A significant drawback is that tools like GitHub Copilot Labs currently do not support Python. While this issue will likely be resolved in future versions, at this point, the full benefits of the tool presented in Figure 6 cannot be utilized. Tutorials and demos often demonstrate code generation for React components or entire apps, but this does not benefit the authors of testFAILS, who have sufficient expertise and working examples of such.
In the study, the potential and current practices of integrating chatbots into pair programming scenarios were explored, using MATLAB-to-Python conversion as an example. Bard, ChatGPT-3, ChatGPT-3.5, ChatGPT-4, GitHub Copilot, and GitHub Copilot Labs were utilized to assist in the translation process. During the case study, several challenges were encountered, and distinct differences in the performance of the chatbot models were observed. One significant challenge was the inability of the chatbots to handle large code snippets effectively. When provided with extensive prompts, the chatbots often failed to respond or generated truncated results without indicating any limitations. Despite these challenges, it was noted that the overall quality of the code generation improved with each version of the chatbot. Upgrading from ChatGPT-3 to ChatGPT-3.5 and then to ChatGPT-4 resulted in enhanced performance and more reliable outcomes. Additionally, it was found that maintaining a polite and emotional tone in interactions with the chatbots positively influenced their responses. The introduction of GitHub Copilot, an AI-powered code completion tool, holds promise for enhancing pair programming experiences in the future. It leverages the power of OpenAI models and training on vast repositories of open-source code to provide intelligent code suggestions and completions. Currently, this is outside the scope of testFAILS. Extensive trials have begun on code converters that existed before ChatGPT, different versions of ChatGPT, Google Collab, and GitHub Copilot, as presented in [9]. Based on the study, a “failure” category was assigned to ChatGPT-3, Bard, and Bing; “undecided” to ChatGPT-3.5; and “pass” to ChatGPT-4.
In conclusion, while chatbots have made progress in assisting with pair programming, there is still room for improvement. The limitations in code translation raise questions about the extent to which they have been trained on specific code. However, the continuous improvement of chatbot models and the introduction of AI-powered tools suggest a promising future for pairing programming with chatbots in education and beyond.

3.6. F. AIDoctor App Low-Code No-Code (LCNC) Development

The AIDoctor application, presently in development, aspires to offer a medical consultation experience through virtual platforms. This application functions by making calls to the ChatGPT API, developed by OpenAI. The model is directed to assume the role of a doctor, with the user acting as the patient. Additional instructions include providing a link for purchasing any medications mentioned in the model’s response. The model receives the user’s symptoms as input via a text input field and returns a response detailing how the user can alleviate their symptoms and where they can procure any necessary medications. The application was tested with inputs such as “I have a headache”, “My stomach hurts”, or “I have a cough”. All of these tests successfully provided accurate responses, with links to purchase medications. However, there are improvements and additions that need to be implemented to provide a quality experience for the end user. For certain prompts, the length of the response can cause the response to be too large to appear in smaller windows. Implementing a text wrapping feature or readjusting the settings of the response label would be necessary to address this issue. Furthermore, the design of the Graphical User Interface (GUI) is in the very early stages and does not reflect the desired design and theme of the initial concept.
Future additions to the application beyond the symptom checker include the ability to use object detection on user images for things like bruises, cuts, and blood pressure readings. This component is still under evaluation, with its current estimated accuracy being 50/50 or 0.5. By integrating the OpenAI API, the application demonstrates the potential for AI-driven healthcare solutions. The effectiveness and usability of the application for patients seeking medical advice require a separate usability study.
A paper titled “Testing System Intelligence” [62] discusses the adequacy of tests for intelligent systems and the practical problems raised by their implementation. The author proposes the replacement test to assess the ability of a system to successfully replace another system in performing a task in each context. This test can characterize salient aspects of human intelligence that cannot be considered by the Turing test. The author argues that building intelligent systems that pass the replacement test involves a series of technical problems that are outside of the scope of current AI. In our study, this approach was considered and applied with reference to the ability of AIDoctor to replace experienced human doctors. While some AI systems already provide surgeries to patients under the close watch of professionals [63,64,65], the AIDoctor app, if it passes the test, should operate autonomously. The AIDoctor application is a MAUI app [66] based on the Health Bot Visual Studio template and the related Azure AI service [67]. The app is currently a prototype and does not comply with HIPAA and related regulations but relies on Azure Health Bot, which is used in the industry for extended amounts of time and enhances its text responses by integrating ChatGPT-4 into it. The Azure Health Bot and Notes page can be observed in Figure 8. It describes itself on its About Us page and provides the ability to call emergency services on its Call Now page. At this point, access to the Bard API and Bing backend is not available, and therefore, these and ChatGPT-3 received a failure status, while both following versions of ChatGPT passed the last component.
A downside of the AIDoctor app is that it relies on one more API, which, while it provides a trial period and the creation of one bot for free, is not actually free and is built to accommodate the corporate world rather than a startup. Every modern system must be scalable, and so is the AIDoctor, but this will cause an increase in the expense of constantly using the Microsoft Azure AI cloud services and paying for it.

3.7. G. Security Assessment and Vulnerability Analysis

The final component of the testFAILS framework is dedicated to assessing the security features of the model and analyzing potential vulnerabilities. This component is crucial in understanding how well the model protects data and maintains user privacy, and in identifying areas where the model might be susceptible to breaches or might provide inappropriate information.
A recent study on the security vulnerabilities associated with AI Language Systems (AILS) like OpenAI’s ChatGPT and Google’s Bard provides valuable insights that have been integrated into this component [12]. The key trials can be observed online [68]. The study:
Highlights the susceptibility of AILS to malicious prompting and the creation of unethical content, underscoring the need for a robust security assessment of the models used in the research.
Focuses precisely on ChatGPT-3.5, demonstrating that AILS can be manipulated to generate inappropriate responses through malicious prompting. This highlights the need for the vulnerability analysis component to consider the potential for malicious use and to develop strategies to mitigate this risk.
Emphasizes the need for further research to enhance system security. This aligns with the goals of the security assessment and vulnerability analysis component, which aims to identify and address potential security weaknesses in the model.
Suggests strategies to mitigate security risks, such as maintaining response consistency, implementing better content filtering, and conducting adversarial testing. These strategies have been incorporated into the security assessment and vulnerability analysis component to enhance the security of the model.
Highlights the need for improved practices, responsible deployment, and comprehensive security measures to safeguard the integrity and trustworthiness of conversational AI systems. This underscores the importance of the security assessment and vulnerability analysis component in ensuring the safe and responsible use of the model.
This component emphasizes that the security assessment and vulnerability analysis play a crucial role in the systems’ validation, helping to ensure that the models are secure, reliable, and capable of maintaining user privacy. By integrating insights from recent studies on the security vulnerabilities of LLMs, this component provides a comprehensive and up-to-date assessment of a model’s security features and potential vulnerabilities. As the study mentioned in the component was conducted only on ChatGPT-3.5, was very intensive, and was not repeated for other bots, all the systems of study were given a score of 0.5—undecided. The security vulnerabilities of ChatGPT-3.5 were indeed uncovered but it is impossible to compare its data with that of the other bots. Therefore, while it is critically important to have this component in testFAILS, a particular score cannot be given as the security statuses of the other bots are unknown.

4. Results

The development of the bot efficiency testing framework testFAILS has been a significant undertaking, aiming to provide a comprehensive tool for comparing and evaluating various AI bots. The initial focus was on ChatGPT, with subsequent comparisons made with Bard and Bing. The results of these comparisons have provided valuable insights into the capabilities and limitations of these AI bots. Table 2 below presents the combined evaluation results for ChatGPT-3.5, ChatGPT-4, Bard, and Bing across the main components of the testFAILS framework. It is important to note that while some components were passed, others either failed or were undecided/undetermined, indicating areas for further testing and improvement. The ChatGPT-2 and ChatGPT-3 models are considered too immature to be included in the chart, and while they were partially tested as well, they did not contribute to testFAILS and its results. As can be seen in the table, ChatGPT-3.5 and ChatGPT-4 have the same score of 5.0. This indicates that both models are equally good, but this is not the case, as ChatGPT-4 is obviously better and more advanced in many ways. We consider them both as iterations toward ChatGPT-5 and upcoming improved versions that have already been announced in the media, including the potential ChatGPT-6 and -7. We consider them all as eventually the same model that, like a child, will constantly grow; its expertise and capabilities will grow with it.
The testFAILS framework has revealed several key findings. For instance, all models except ChatGPT-4 failed to perform well in code-to-code translation, particularly when translating MatLab code to Python. Models like ChatGPT are very useful in education, at least in terms of explaining and adding comments to code. The models are not perfect regarding multilanguage text generation. The accuracy of modes depends on several factors, including the quality of the input and the complexity of the task. As part of future work, we plan to develop new use cases and applications. Our research revealed the importance of ethics and bias considerations in AI and natural language processing research. We have gained a deeper understanding of the strengths and limitations of ChatGPT and other AI language models. We have delved deeply into the theoretical foundations of our testing framework, testFAILS. This involved a detailed exploration of its components and their role in providing a robust mechanism for assessing AI systems. We have emphasized the role of the Turing test, a key benchmark in our evaluation process, within our framework.
As depicted in Table 2, no chatbot has yet achieved full success under the rigorous testFAILS framework. The ChatGPT family, particularly its latest models, appear to be leading the pack. However, it is crucial to note that using ChatGPT-4 carries a cost of USD 20 per month. Alongside this are potential additional costs for image generation, GPU/cloud usage, and other handy tools, which could drive the price even higher. While the free-to-use ChatGPT-3.5 may produce slightly less polished results and lack web browsing and plugin usage capabilities, it still holds its ground in the race. Bard Bot only provides access to its API to a smaller group of testers, and we are not among them. Nonetheless, we believe Google, with its long-standing history, abundant resources, strong business connections, and extensive data warehouses, will eventually gain momentum in this race. As consumers, we can only stand to gain from this competitive landscape of chatbots, as it fosters innovation and continuous improvement.
In conclusion, we would like to mention that building such a framework is very subjective and depends on where you are, what your occupation is, and what you are trying to achieve. We would like to recommend using several tools and comparing their features individually for every user or business. We believe that our work will contribute significantly to the field of AI and look forward to sharing our future findings with the research community.

5. Conclusions

In conclusion, this study presents several unique contributions to the field of AI and natural language processing. An innovative, adaptable testing framework, testFAILS, has been developed and is ready to evaluate new chatbot versions as they are released. This framework has been used to validate iterations of existing models like ChatGPT and has led to the development of an application, AIDoctor, which utilizes the latest AI technologies. The testFAILS framework represents a significant step forward in the evaluation of AI Language Systems (AILS), offering a comprehensive, adaptable, and rigorous tool for assessing the capabilities and limitations of current and future chatbots. It has allowed the researchers to answer several research questions related to the evaluation of AILS, the key components of a robust testing framework for AILS, and the performance of different AILS according to the testFAILS framework. The research questions of study were answered:
  • RQ1: How should AILS be evaluated?
The methods used to evaluate an AILS depend on the specific goals of the evaluation. For instance, if the goal is to evaluate the system’s ability to generate creative text, then the simulated Turing test might be a good method to use. However, if the goal is to evaluate the system’s ability to help users complete their tasks, then the user productivity and satisfaction test might be a better method to use.
  • RQ2: What are the key components of a robust testing framework for AILS?
The key components of a robust testing framework for AILS include a well-defined set of evaluation criteria, a variety of test cases, a systematic approach to testing, and a clear reporting mechanism.
  • RQ3: How do different AILS perform according to the testFAILS framework?
Different AILS perform differently depending on several factors, including the size and complexity of the system, the training data used to train the system, and the specific tasks that the system is being evaluated on.
The study also raised several potential future research questions, such as how the testFAILS framework can be improved for more accurate and comprehensive evaluation of AILS, and how the role and evaluation of AILS might evolve in the future, particularly in the context of CS education. Despite advancements in AI, the study also highlighted several challenges and limitations associated with AILS, such as unrelated answers, ethical implications, and biases and inaccuracies in responses. Addressing these issues will be crucial for the successful integration of AILS in various applications, including education, app development, and healthcare. In the rapidly evolving landscape of AI, the testFAILS framework serves as a beacon, guiding researchers and developers towards a more nuanced understanding of AILS capabilities and limitations. As the field continues to develop, it is expected that new applications of AILS will be explored, contributing to the advancement of AI and natural language processing.

Author Contributions

Conceptualization, Y.K. and P.M. (Patricia Morreale); methodology, Y.K.; software, J.D.; validation, P.S., Y.K. and J.J.L.; formal analysis, P.M. (Patricia Morreale); investigation, P.S.; resources, Y.K.; data curation, P.M. (Patrick Martins); writing—original draft preparation, Y.K.; writing—review and editing, J.J.L.; visualization, P.M. (Patrick Martins); supervision, P.M. (Patricia Morreale); project administration, Y.K.; funding acquisition, Y.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the NSF, grants 1834620 and 2129795, and Kean University’s Students Partnering with Faculty 2023 Summer Research Program (SPF).

Data Availability Statement

Data available on request due to privacy restrictions (personal nature of user-chatbot communication).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Surameery, N.M.S.; Shakor, M.Y. Use chat gpt to solve programming bugs. Int. J. Inf. Technol. Comput. Eng. (IJITC) 2023, 3, 17–22. [Google Scholar] [CrossRef]
  2. Aydın, Ö. Google Bard Generated Literature Review: Metaverse. J. AI 2023, 7, 1–14. [Google Scholar]
  3. Lopezosa, C. Bing chat: Hacia una nueva forma de entender las búsquedas. Anuario ThinkEPI 2023, 17. [Google Scholar] [CrossRef]
  4. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
  5. Anil, R.; Dai, A.M.; Firat, O.; Johnson, M.; Lepikhin, D.; Passos, A.; Shakeri, S.; Taropa, E.; Bailey, P.; Chen, Z.; et al. Palm 2 technical report. arXiv 2023, arXiv:2305.10403. [Google Scholar]
  6. Kolodny, L. Elon Musk Plans ‘TruthGPT’AI to Rival OpenAI, DeepMind; CNBC: Englewood Cliffs, NJ, USA, 2023. [Google Scholar]
  7. Gunderson, K. The imitation game. Mind 1964, 73, 234–245. [Google Scholar] [CrossRef]
  8. Kumar, Y.; Morreale, P.; Sorial, P.; Delgado, J.; Li, J.J.; Martins, P. A Testing Framework for AI Linguistic Systems (testFAILS). In Proceedings of the IEEE AITest Conference, Athens, Greece, 17–20 July 2023. accepted. [Google Scholar]
  9. Gordon, Z.; Kumar, Y.; Morreale, P.; Li, J.J. ChatGPT Generation of Image Sketches. In Proceedings of the Future of Information and Communication Conference (FICC), Virtual Event, 2–3 March 2023. submitted. [Google Scholar]
  10. Kupershtein, E.; Kumar, Y.; Manikandan, A.; Morreale, P.; Li, J.J. ChatGPT: A Game-Changer for Embedding Emojis in Faculty Feedback. In Proceedings of the 19th International Conference on Frontiers in Education: Computer Science & Computer Engineering (FECS) 2023, Las Vegas, NV, USA, 24–27 July 2023. accepted. [Google Scholar]
  11. Kumar, Y.; Li, W.; Huang, K.; Hannon, B.; Thompson, M.; Li, J.J.; Morreale, P. Natural Language Coding (NLC) for Autonomous Stock Trading: A New Dimension in No-Code/Low-Code (NCLC) AI. MIS Q. Exec. 2023. submitted. [Google Scholar]
  12. Hannon, B.; Kumar, Y.; Sorial, P.; Li, J.J.; Morreale, P. From Vulnerabilities to Improvements: A Deep Dive into Adversarial Testing of AI Models. In Proceedings of the 21st International Conference on Software Engineering Research & Practice (SERP), Orlando, FL, USA, 23–25 May 2023. accepted. [Google Scholar]
  13. Rossikova, Y.; Li, J.J.; Morreale, P. Intelligent data mining for translator correctness prediction. In Proceedings of the 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), IEEE International Conference on Intelligent Data and Security (IDS), New York, NY, USA, 9–10 April 2016; IEEE: New York, NY, USA, 2016. [Google Scholar]
  14. Kulesza, R.; Kumar, Y.; Ruiz, R.; Torres, A.; Weinman, E.; Li, J.J.; Morreale, P. Investigating Deep Learning for Predicting Multi-linguistic Interactions with a Chatterbot. In Proceedings of the 2020 IEEE Conference on Big Data and Analytics (ICBDA), New York, NY, USA, 9–10 April 2016; IEEE: New York, NY, USA, 2020. [Google Scholar]
  15. Jenny, L.J.; Silva, T.; Franke, M.; Hai, M.; Morreale, P. Evaluating Deep Learning Biases Based on Grey-Box Testing Results. In Intelligent Systems and Applications, Proceedings of the IntelliSys 2020, London, UK, 3–4 September 2020; Arai, K., Kapoor, S., Bhatia, R., Eds.; Advances in Intelligent Systems and Computing; Springer: Cham, Switzerland, 2021; Volume 1250, p. 1250. [Google Scholar] [CrossRef]
  16. Tellez, N.; Serra, J.; Kumar, Y.; Li, J.J.; Morreale, P. Gauging Biases in Various Deep Learning AI Models. In Intelligent Systems and Applications, Proceedings of the IntelliSys 2022, Amsterdam, The Netherlands, 1–2 September 2022; Arai, K., Ed.; Lecture Notes in Networks and Systems; Springer: Cham, Switzerland, 2023; Volume 544, p. 544. [Google Scholar] [CrossRef]
  17. Uko, E.; Justin, D.; Yulia, K.J.; Jenny, L.; Patricia, A.M. Preliminary Results of Applying Transformers to Geoscience and Earth Science data. In Proceedings of the 2022 International Conference on Computational Science and Computational Intelligence (CSCI 2022), Las Vegas, NV, USA, 14–16 December 2022. [Google Scholar]
  18. ICSE 2023 Conference Program. Available online: https://conf.researchr.org/program/icse-2023/program-icse-2023/ (accessed on 12 June 2023).
  19. Glaucia, M. Designing Adaptive Developer-Chatbot Interactions: Context Integration, Experimental Studies, and Levels of Automation. arXiv 2023, arXiv:2305.00886. [Google Scholar] [CrossRef]
  20. Sobania, D.; Briesch, M.; Hanna, C.; Petke, J. An analysis of the automatic bug fixing performance of chatgpt. arXiv 2023, arXiv:2301.08653. [Google Scholar]
  21. Ilche, G. Conceptualizing Software Development Lifecycle for Engineering AI Planning Systems; CAIN: Oro Valley, AZ, USA, 2023. [Google Scholar]
  22. Pricilla, C.; Dessi, P.L.; Dody, D. Designing interaction for chatbot-based conversational commerce with user-centered design. In Proceedings of the 5th International Conference on Advanced Informatics: Concept Theory and Applications (ICAICTA), Krabi, Thailand, 14–17 August 2018; IEEE: New York, NY, USA, 2018. [Google Scholar]
  23. Chuan, C.H.; Morgan, S. Creating and evaluating chatbots as eligibility assistants for clinical trials: An active deep learning approach towards user-centered classification. ACM Trans. Comput. Healthc. 2020, 2, 1–19. [Google Scholar] [CrossRef]
  24. Asensio-Cuesta, S.; Blanes-Selva, V.; Conejero, J.A.; Frigola, A.; Portolés, M.G.; Merino-Torres, J.F.; Almanza, M.R.; Syed-Abdul, S.; Li, Y.-C.; Vilar-Mateo, R.; et al. A user-centered chatbot (Wakamola) to collect linked data in population networks to support studies of overweight and obesity causes: Design and pilot study. JMIR Med. Inform. 2021, 9, e17503. [Google Scholar] [CrossRef] [PubMed]
  25. Stapić, Z.; Horvat, A.; Plantak Vukovac, D. Designing a faculty Chatbot through user-centered design approach. In HCI International 2020–Late Breaking Papers: Cognition, Learning and Games, Proceedings of the 22nd HCI International Conference, HCII 2020, Copenhagen, Denmark, 19–24 July 2020; Springer International Publishing: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
  26. Neumann, M.; Rauschenberger, M. We Need to Talk about ChatGPT: The Future of AI and Higher Education; SEENG: Melbourne, Australia, 2023. [Google Scholar] [CrossRef]
  27. Abduljabbar, A.; Gupta, N.; Healy, L.; Kumar, Y.; Li, J.J.; Morreale, P. A Self-Served AI Tutor for Growth Mindset Teaching. In Proceedings of the 5th International Conference on Information and Computer Technologies (ICICT), New York, NY, USA, 4–6 March 2022; pp. 55–59. [Google Scholar] [CrossRef]
  28. Singh, R.; Reardon, T.; Srinivasan, V.M.; Gottfried, O.; Bydon, M.; Lawton, M.T. Implications and future directions of ChatGPT utilization in neurosurgery. J. Neurosurg. 2023, 1, 1–3. [Google Scholar] [CrossRef]
  29. Sallam, M. ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare 2023, 11, 887. [Google Scholar] [CrossRef] [PubMed]
  30. Laudicella, R.; Davidzon, G.A.; Dimos, N.; Provenzano, G.; Iagaru, A.; Bisdas, S. ChatGPT in nuclear medicine and radiology: Lights and shadows in the AI bionetwork. Clin. Transl. Imaging 2023, 2023, 1–5. [Google Scholar] [CrossRef]
  31. Firat, M. What ChatGPT means for universities: Perceptions of scholars and students. J. Appl. Learn. Teach. 2023, 6, 57–63. [Google Scholar]
  32. Tlili, A.; Shehata, B.; Adarkwah, M.A.; Bozkurt, A.; Hickey, D.T.; Huang, R.; Agyemang, B. What if the devil is my guardian angel: ChatGPT as a case study of using chatbots in education. Smart Learn. Environ. 2023, 10, 15. [Google Scholar] [CrossRef]
  33. Banerjee, P.; Srivastava, A.; Adjeroh, D.; Reddy, Y.R.; Karimian, N. Understanding ChatGPT: Impact Analysis and Path Forward for Teaching Computer Science and Engineering. TechRxiv 2023. [Google Scholar] [CrossRef]
  34. Chen, E.; Huang, R.; Chen, H.S.; Tseng, Y.H.; Li, L.Y. GPTutor: A ChatGPT-powered programming tool for code explanation. arXiv 2023, arXiv:2305.01863. [Google Scholar]
  35. Qureshi, B. Exploring the use of chatgpt as a tool for learning and assessment in undergraduate computer science curriculum: Opportunities and challenges. arXiv 2023, arXiv:2304.11214. [Google Scholar]
  36. Rahman, M.; Watanobe, Y. ChatGPT for Education and Research: Opportunities, Threats, and Strategies. Appl. Sci. 2023, 13, 5783. [Google Scholar] [CrossRef]
  37. Turing, A.M. Computing machinery and intelligence. Mind 1950, 49, 433–460. [Google Scholar] [CrossRef]
  38. Demchenko, E.; Vladimir, V. Who Fools Whom? Springer: Dordrecht, The Netherlands, 2009. [Google Scholar]
  39. Warwick, K.; Huma, S. Can machines think? A report on Turing test experiments at the Royal Society. J. Exp. Theor. Artif. Intell. 2016, 28, 989–1007. [Google Scholar] [CrossRef] [Green Version]
  40. The Models Page of Hugging Face Website. Available online: https://huggingface.co/models (accessed on 12 June 2023).
  41. Microsoft Research Lab—Asia Home Page. Available online: https://www.microsoft.com/en-us/research/lab/microsoft-research-asia/ (accessed on 12 June 2023).
  42. Song, K.; Tan, X.; Li, D.; Lu, W.; Zhuang, Y. HuggingGPT: Solving AI Tasks with ChatGPT and Its Friends in Hugging Face. Available online: https://arxiv.org/pdf/2303.17580.pdf (accessed on 12 June 2023).
  43. Gpt4-Incomplete and Partial Responses. Available online: https://community.openai.com/t/gpt4-incomplete-and-partial-responses/122816 (accessed on 12 June 2023).
  44. Meta’s Page on Yahoo Finance Website. Available online: https://finance.yahoo.com/quote/META/history?p=META (accessed on 12 June 2023).
  45. Your Guide to Communicating with Artificial Intelligence. Available online: https://learnprompting.org/ (accessed on 12 June 2023).
  46. Ultimate Prompt Engineering Guide. Available online: https://forum.aiprm.com/t/ultimate-prompt-engineering-guide/15616 (accessed on 12 June 2023).
  47. Home Page of Noteable Plugin page. Available online: https://noteable.io/chatgpt-plugin-for-notebook/ (accessed on 12 June 2023).
  48. Webpilot: A ChatGPT Plugin with an Interesting Backstory. Available online: https://community.openai.com/t/webpilot-a-chatgpt-plugin-with-an-interesting-backstory/183984 (accessed on 12 June 2023).
  49. Link Reader OpenAI Plugin. Available online: https://www.getit.ai/gpt-plugins/plugins/link-reader (accessed on 12 June 2023).
  50. GPT Engineer GitHub Page. Available online: https://github.com/AntonOsika/gpt-engineer (accessed on 12 June 2023).
  51. Low Code Web Page of IBM Website. Available online: https://www.ibm.com/topics/low-code (accessed on 12 June 2023).
  52. Daniel, Z.; Leo, P. LLMs: A New Way to Teach Programming. Available online: https://on.acm.org/t/llms-a-new-way-to-teach-programming/2833 (accessed on 12 June 2023).
  53. ChatGPT Exploration Course Syllabus. Available online: https://kean.simplesyllabus.com/api2/doc-pdf/l5b0cnysx/23%2FS1-CPS-1996-16-CS-RES-INIT-FOR-1ST-YR-STUDNT.pdf?locale=en-US (accessed on 12 June 2023).
  54. Alvin, A. Parallel Programming and Concurrency with C# 10 and NET 6. GitHub Repository of the Textbook. Available online: https://github.com/PacktPublishing/Parallel-Programming-and-Concurrency-with-C-sharp-10-and-.NET-6 (accessed on 12 June 2023).
  55. Collab Notebook with Students’ Feedback and Python Code. Available online: https://colab.research.google.com/drive/1p9cceT7D2Uqb_Xlcc-2tq98QJgGCBGEQ?usp=sharing (accessed on 12 June 2023).
  56. Nagappan, N.; Williams, L.; Ferzli, M.; Wiebe, E.; Yang, K.; Miller, C.; Balik, S. Improving the CS1 experience with pair programming. ACM Sigcse Bull. 2003, 35, 359–362. [Google Scholar] [CrossRef]
  57. Williams, L.; Kessler, R.R. Pair Programming Illuminated; Addison-Wesley Professional: Boston, MA, USA, 2003. [Google Scholar]
  58. Home Page of Visual Studio Code. Available online: https://code.visualstudio.com/ (accessed on 12 June 2023).
  59. Web Page of Visual Studio Code Insiders on Microsoft App Store. Available online: https://apps.microsoft.com/store/detail/visual-studio-code-insiders/XP8LFCZM790F6B (accessed on 12 June 2023).
  60. Home Page of GitHub Copilot. Available online: https://github.com/features/copilot (accessed on 12 June 2023).
  61. Home Page of GitHub Copilot Labs. Available online: https://githubnext.com/projects/copilot-labs/ (accessed on 12 June 2023).
  62. Joseph, S. Testing System Intelligence. arXiv 2023, arXiv:2305.11472. [Google Scholar]
  63. Hashimoto, D.A.; Rosman, G.; Rus, D.; Meireles, O.R. Artificial Intelligence in Surgery: Promises and Perils. Ann. Surg. 2017, 268, 70–76. [Google Scholar] [CrossRef] [PubMed]
  64. Dagli, M.M.; Rajesh, A.; Asaad, M.; Butler, C.E. The Use of Artificial Intelligence and Machine Learning in Surgery: A Comprehensive Literature Review. Am. Surg. 2021, 89, 00031348211065101. [Google Scholar] [CrossRef]
  65. Loftus, T.J.; Altieri, M.S.; Balch, J.A.; Abbott, K.L.; Choi, J.; Marwaha, J.S.; Hashimoto, D.A.; Brat, G.A.; Raftopoulos, Y.; Evans, H.L.; et al. Artificial Intelligence–enabled Decision Support in Surgery: State-of-the-art and Future Directions. Ann. Surg. 2023, 278, 51–58. [Google Scholar] [CrossRef]
  66. Health Bot Web Page. Available online: https://azure.microsoft.com/en-us/products/bot-services/health-bot/ (accessed on 12 June 2023).
  67. NET MAUI Web Page. Available online: https://dotnet.microsoft.com/en-us/apps/maui (accessed on 12 June 2023).
  68. Key Trials of the Adversarial LLM Testing. Available online: https://github.com/ykumar2020/AdversarialLLMTesting/blob/main/ResearchResponses.pdf (accessed on 12 June 2023).
Figure 1. Chatbot responses regarding the first component received on mobile apps: (a) ChatGPT-3.5; (b) ChatGPT-4; (c) Bard and (d) Bing, June 2023.
Figure 1. Chatbot responses regarding the first component received on mobile apps: (a) ChatGPT-3.5; (b) ChatGPT-4; (c) Bard and (d) Bing, June 2023.
Electronics 12 03095 g001
Figure 2. Chatbot responses regarding the first component, received on a Windows 11 desktop (top: Bard, bottom: Bing; June 2023).
Figure 2. Chatbot responses regarding the first component, received on a Windows 11 desktop (top: Bard, bottom: Bing; June 2023).
Electronics 12 03095 g002
Figure 3. ChatGPT-4′s prototype for a trading app created from one to three prompts: initial GUI draft (left), updated layout (right); May 2023.
Figure 3. ChatGPT-4′s prototype for a trading app created from one to three prompts: initial GUI draft (left), updated layout (right); May 2023.
Electronics 12 03095 g003
Figure 4. Bard’s strange answer (June 2023).
Figure 4. Bard’s strange answer (June 2023).
Electronics 12 03095 g004
Figure 5. Word cloud of students’ answers about adding comments to unknown C# code using ChatGPT (March 2023) [55].
Figure 5. Word cloud of students’ answers about adding comments to unknown C# code using ChatGPT (March 2023) [55].
Electronics 12 03095 g005
Figure 6. Instruction on how to use ChatGPT in the assignment (May 2023).
Figure 6. Instruction on how to use ChatGPT in the assignment (May 2023).
Electronics 12 03095 g006
Figure 7. Error message displayed by GitHub Copilot labs (June 2023).
Figure 7. Error message displayed by GitHub Copilot labs (June 2023).
Electronics 12 03095 g007
Figure 8. AIDoctor MAUI app: (a) Notes Web Page with the prompt; (b) Notes Web Page with the query result of ChatGPT-3.5; (c) Azure Health Bot service use at a backend of the app, May 2023.
Figure 8. AIDoctor MAUI app: (a) Notes Web Page with the prompt; (b) Notes Web Page with the query result of ChatGPT-3.5; (c) Azure Health Bot service use at a backend of the app, May 2023.
Electronics 12 03095 g008
Table 1. The weights of the testFAILS components, proposed by the chatbots themselves.
Table 1. The weights of the testFAILS components, proposed by the chatbots themselves.
Testing Components/ParametersThe Weights Proposed by Chatbots
GPT-3.5BingGPT-4Bard
ASimulated Turing Test Performance0.150.150.150.2
BUser Productivity and Satisfaction0.250.150.150.2
CIntegration into Computer Science Education.0.10.100.100.1
DMultilingual Text Generation.0.150.150.100.15
EPair Programming Capabilities.0.10.100.100.15
FBot-Based App development and its success0.150.150.150.2
GSecurity Assessment and Vulnerability Analysis0.10.200.250.2
Total Score1.001.001.001.20
Table 2. Evaluation results.
Table 2. Evaluation results.
Testing Components/ParametersEvaluation Results
GPT-3.5BingGPT-4Bard
ASimulated Turing Test Performance0000
BUser Productivity and Satisfaction10.510.5
CIntegration into Computer Science Education.1010
DMultilingual Text Generation.0.500.50.5
EPair Programming Capabilities.1010.5
FBot-Based App development and its success1010
GSecurity Assessment and Vulnerability Analysis0.50.50.50.5
Total Score5.01.05.02.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kumar, Y.; Morreale, P.; Sorial, P.; Delgado, J.; Li, J.J.; Martins, P. A Testing Framework for AI Linguistic Systems (testFAILS). Electronics 2023, 12, 3095. https://doi.org/10.3390/electronics12143095

AMA Style

Kumar Y, Morreale P, Sorial P, Delgado J, Li JJ, Martins P. A Testing Framework for AI Linguistic Systems (testFAILS). Electronics. 2023; 12(14):3095. https://doi.org/10.3390/electronics12143095

Chicago/Turabian Style

Kumar, Yulia, Patricia Morreale, Peter Sorial, Justin Delgado, J. Jenny Li, and Patrick Martins. 2023. "A Testing Framework for AI Linguistic Systems (testFAILS)" Electronics 12, no. 14: 3095. https://doi.org/10.3390/electronics12143095

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop