Next Article in Journal
Traditional and Hybrid Topologies for Single-/Three-Phase Transformerless Multilevel Inverters
Next Article in Special Issue
Multimodal Large Language Model-Based Fault Detection and Diagnosis in Context of Industry 4.0
Previous Article in Journal
Real-Time Simulator for Dynamic Systems on FPGA
Previous Article in Special Issue
Improving Training Dataset Balance with ChatGPT Prompt Engineering
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Toward the Adoption of Explainable Pre-Trained Large Language Models for Classifying Human-Written and AI-Generated Sentences

by
Luca Petrillo
1,2,*,
Fabio Martinelli
1,
Antonella Santone
3 and
Francesco Mercaldo
1,3,*
1
IIT-CNR (Institute of Informatics and Telematics), 56124 Pisa, Italy
2
IMT School for Advanced Studies Lucca, 55100 Lucca, Italy
3
Department of Medicine and Health Sciences “Vincenzo Tiberio”, University of Molise, 86100 Campobasso, Italy
*
Authors to whom correspondence should be addressed.
Electronics 2024, 13(20), 4057; https://doi.org/10.3390/electronics13204057
Submission received: 9 August 2024 / Revised: 8 October 2024 / Accepted: 10 October 2024 / Published: 15 October 2024

Abstract

:
Pre-trained large language models have demonstrated impressive text generation capabilities, including understanding, writing, and performing many tasks in natural language. Moreover, with time and improvements in training and text generation techniques, these models are proving efficient at generating increasingly human-like content. However, they can also be modified to generate persuasive, contextual content weaponized for malicious purposes, including disinformation and novel social engineering attacks. In this paper, we present a study on identifying human- and AI-generated content using different models. Precisely, we fine-tune different models belonging to the BERT family, an open-source version of the GPT model, ELECTRA, and XLNet, and then perform a text classification task using two different labeled datasets—the first one consisting of 25,000 sentences generated by both AI and humans and the second comprising 22,929 abstracts that are ChatGPT-generated and written by humans. Furthermore, we perform an additional phase where we submit 20 sentences generated by ChatGPT and 20 sentences randomly extracted from Wikipedia to our fine-tuned models to verify the efficiency and robustness of the latter. In order to understand the prediction of the models, we performed an explainability phase using two sentences: one generated by the AI and one written by a human. We leveraged the integrated gradients and token importance techniques, analyzing the words and subwords of the two sentences. As a result of the first experiment, we achieved an average accuracy of 99%, precision of 98%, recall of 99%, and F1-score of 99%. For the second experiment, we reached an average accuracy of 51%, precision of 50%, recall of 52%, and F1-score of 51%.

1. Introduction

Artificial intelligence (AI) is no longer a futuristic concept of science fiction, and it is changing the way we write. With the advent of ChatGPT (short for “Chat Generative Pre-trained Transformer”) [1], a large language model (LLM) chatbot, and many other models, the way content is created has completely changed. They are exceptionally adept at understanding natural language and generating language that seems more accurate and human-like than their predecessors. These tools come in all shapes and sizes, each designed to meet specific authoring needs. In addition to text generation, many big tech companies have introduced models that can generate images from text prompts. Models such as [2,3,4] have spearheaded recent advances in AI-based image generation. These models make it incredibly easy to generate high-quality images in various styles using just a few words. Instead, the final frontier in content generation is the text-to-video models Lumiere [5], Sora, VideoPoet [6], and Emu Video [7], which use various techniques to understand the meaning and context of the input text and then generate a sequence of images that are both spatially and temporally consistent with the text.
Regarding text generation, models such as GPT-4 [1], GPT-3 [8], PaLM [9], Claude [10], and many others can produce content based on user needs. Thanks to them, content can be created or reformulated to avoid plagiarism [11] or to create multiple versions of a text, and they play a key role in ensuring that the final product is polished and error-free. Overall, while LLMs have demonstrated impressive text generation capabilities, it is crucial to be aware of and address their potential downsides to ensure their responsible and ethical development and deployment. Because these models have been developed from data collected indiscriminately from the Web, they have learned biased and inaccurate representations of real-world knowledge. Indeed, they can get things wrong and present false statements as facts (a flaw known as “AI hallucination”), they can be biased, and they are often gullible when answering leading questions. They can also be tricked into creating poisonous content and are susceptible to “prompt injection attacks” [12], or they can be corrupted by manipulating the data used to train the model (a technique known as “data poisoning”).
As studies conducted after the widespread adoption of this technology have shown [13], they can be modified to create compelling, context-aware content that can be weaponized for malicious purposes, and for such use of these advanced models, malicious actors can effectively automate and scale disinformation. The latter can be an additional risk to the others already present in our society and carried out by search engines that are unable to distinguish between information and disinformation sites. This is because today, these models are also used as search engines among many other uses. In addition, as has emerged [14], LLMs give attackers a new angle on social engineering in addition to the ability to generate content from training data. For example, psychological manipulation targeted phishing and crisis of authenticity are among the strategies used by LLM-driven social engineers. Another important aspect is that defined in [15] as “Scientific Misconduct”. This is a problem that involves the irresponsible use of the LLMs, which can result in coherent and original generated content, including complete papers from unreliable sources. For these reasons, it becomes very important to understand or at least to have a warning regarding the content generated by an AI. Additionally, some AI detection solutions, which are available for mass-market use for free, have also not performed consistently well at flagging AI produced writing. Research [16] shows that especially in a school setup where a student can write with the help of an AI writing tool tend to yield variable outcomes. AI-generated content may not be of the same quality as the content that is produced by a human. Given the differences between the two sources, they are likely to help a user understand if the information is credible and what type it is, which may be necessary in such sensitive areas as medicine, law, or finance. Therefore, we present a method that uses the above models to identify and detect this type of content. Specifically, we fine-tuned a set of pre-trained LLMs using two different datasets: the first is a labeled dataset that contained sentences generated from both AI and humans. For the second, we always used a labeled dataset containing papers’ abstracts created by humans and AI.
The paper is organized as follows: the related work with the current state of the literature are presented in Section 2; Section 3 presents the methodology used in this work along with the datasets and models used and their respective results. Section 4 presents the results on the test set and the results of the inference phase. Finally, Section 5 describes the conclusions of the work, the limitations and the future research planned.

2. Related Work

In this section, we summarize the research in the field of NLP, text classification using pre-trained large language models, and AI and human text recognition.

2.1. NLP

The work by Malik et al. [17] is a survey that aims to assist academics, practitioners, and policymakers in navigating the complexities of online customer review analysis, with subsequent sections detailing methodology and taxonomy. The survey employed a comprehensive search strategy, including popular online review platforms, resulting in the initial identification of 1256 articles from databases like Scopus and IEEE Xplore. After rigorous screening, 154 articles were selected based on specific inclusion criteria, focusing on their contributions to NLP and online customer reviews. The taxonomy categorizes various NLP applications, including sentiment analysis, review management, and recommendation systems, which are essential for understanding their roles in online customer reviews. This classification aids in comprehending the diverse applications of NLP in analyzing customer feedback. Regarding opinion mining that focuses on extracting subjective information from reviews, the authors analyzed studies employing deep learning and traditional algorithms to enhance sentiment extraction across various domains. Also, they have considered advanced methods, such as aspect-based sentiment analysis and cognitive computing, that have been developed to improve the accuracy of sentiment categorization in online reviews. In the context of review analysis and management, automated systems have been introduced to detect counterfeit reviews, enhancing trust and efficiency in online marketplaces. The research highlights the importance of review length and argumentation in influencing purchasing decisions, emphasizing the need for effective review management strategies. Studies that utilize aspect-based techniques have successfully identified customer preferences in tourism and retail, demonstrating the significance of sentiment analysis in enhancing customer satisfaction. Innovative approaches, such as artificial personal shoppers and loyalty models, have been proposed to improve user engagement and satisfaction in e-commerce. Concerning user profiling and recommendation systems, many techniques have been developed to enhance personalized service recommendations by leveraging fine-grained value features and addressing data sparsity. Integrating human–computer interaction psychology into recommendation systems has shown improved user engagement and satisfaction. Finally, novel algorithms combining opinion mining and topic modeling have been proposed to improve seller rankings and consumer trust in e-commerce platforms. Research emphasizes the impact of verified purchase badges and sentiment analysis on product ratings and consumer perceptions, highlighting the importance of managing brand reputation. However, the increasing number of articles indicates a trend toward more sophisticated sentiment analysis techniques and methodologies, even if the survey outlines critical challenges in NLP for online customer reviews, including fake review detection and multimodal data integration.
This research [18] aims to explore machine learning techniques for detecting cyberbullying across various social media platforms, analyzing datasets, and the challenges in model design. The key research questions include the overlap between traditional and cyberbullying definitions, factors motivating online bullying, and compelling feature extraction techniques for detection. The challenge of accurately labeling cyberbullying content complicates the development of practical detection algorithms. Many studies rely on biased datasets that may overlook critical features, such as semantic and demographic factors, which can limit the effectiveness of cyberbullying detection. In addition, effective preprocessing techniques, such as custom tokenization and lemmatization, are crucial for improving the quality of text data used in machine learning models. Regarding feature selection, most studies focus on content-based features, potentially neglecting significant user profile attributes that could enhance detection accuracy. Also, the dynamic nature of language used in cyberbullying, including slang and abbreviations, necessitates a broader feature set for effective detection. The study highlights the limitations of outdated datasets in cyberbullying detection and emphasizes the need for modern data collection methods. It summarizes effective feature extraction techniques and identifies promising machine learning and deep learning methods for real-time cyberbullying detection.

2.2. LLMs Text Classification

In recent years, much work has exploited the great language understanding capabilities of pre-trained large language models. The work of González-Carvajal et al. [19] discusses the evolution of natural language processing (NLP) methodologies, highlighting the shift from traditional linguistic approaches to machine learning and deep learning techniques such as BERT. They highlight the advantages of BERT, in particular its ability to handle large textual datasets and to dynamically adapt to different NLP tasks without the need for extensive linguistic resources. The study involved classifying tweets about real-world disasters with BERT achieving a score of 83%, outperforming the best score of the traditional model.
In this work [20], the authors propose the FakeBERT model, which combines BERT with deep convolutional neural networks (CNNs) to automatically learn features and improve fake news detection through enhanced semantic understanding. Experiments show that FakeBERT achieves an accuracy of 98.90%, outperforming existing benchmarks by 4%, demonstrating its effectiveness in capturing long-range dependencies in text. The study compares various deep learning models, including CNNs and LSTMs, and highlights the superior performance of the proposed FakeBERT model in classifying fake news.
The study [21] focuses on deep language representations, specifically ELMo and DistilBERT, for news text classification and sentiment analysis. It aims to evaluate the robustness of these models in cross-context scenarios, especially in socio-political news. It examines the performance of ELMo and DistilBERT in classifying socio-political news from different countries in order to assess their robustness and scalability. The study compares the performance of DistilBERT with ELMo, noting its efficiency with 66 million parameters compared to ELMo’s 93.6 million.
The paper by Xu [22] proposes the use of RoBERTa-wwm-ext, which offers performance improvements over the original BERT, particularly for Chinese text classification using whole-word masking. Key contributions include the development of four models for fine tuning RoBERTa-wwm-ext, optimized training methods, and hyperparameter adjustments to improve performance. The dataset consists of 6755 samples with sentences labeled as either containing illegal behavior (positive) or not containing illegal behavior (negative). The study successfully demonstrates the effectiveness of RoBERTa-wwm-ext in different neural network architectures for Chinese text classification.
In [23], the authors propose a method to address the challenges associated with predicting the helpfulness of online customer reviews by evaluating the performance of BERT-based classifiers against bag-of-words approaches. The study utilizes a dataset composed of 48,442 reviews from the Yelp Open dataset, categorizing reviews as helpful or unhelpful based on the number of helpful votes received. The authors fine-tuned BERT with various sequence lengths to determine their impact on model performance, focusing on optimizing the input for the BERT architecture. They also generated textual features using TF-IDF for traditional classifiers like k-NN, Naïve Bayes, and SVM, highlighting the contrast between these methods and the BERT approach. The performance evaluation results indicate that BERT classifiers generally outperform traditional methods, showcasing the advantages of using advanced deep learning techniques for this task. This study demonstrates the effectiveness of fine-tuned BERT classifiers in predicting review helpfulness, outperforming traditional bag-of-words approaches. It also contributes to the understanding of how sequence length impacts predictive performance.
All these works demonstrate the effectiveness of these models in the task of text classification. In fact, the latter are designed to understand the context of words in a sentence, which allows them to capture nuances of meaning that traditional models may miss. In addition, they are better able to handle ambiguous language due to their ability to consider surrounding words and phrases, leading to more accurate classifications.

2.3. AI and Human Text Recognition

In Chaka’s paper [24], the detection of AI-generated content from responses produced by ChatGPT, YouChat, and Chatsonic is assessed for accuracy using five different AI content detection tools: GPTZero, OpenAI Text Classifier, Writer.com’s AI Content Detector, Copyleaks AI Content Detector, and Giant Language Model Test Room. The initial AI response synthesis and the subsequent input of these responses into the detection tools comprised the two parts of the investigation. The author also draws attention to the shortcomings and inconsistencies of the available AI detection technologies, emphasizing how they cannot accurately and completely identify content created by AI, particularly when the content is translated. The primary shortcoming of this work is that some AI-generated text was mistakenly identified as human-generated due to the tendency of the AI content detectors to generate false negatives. This tendency, referred to as the ‘misattribution’ of AI-generated text to humans, highlights the unreliability of the tools. Nevertheless, in this work, the authors test the accuracy of free AI detectors while in this work, we propose a methodology to address the problem.
Following on from the work just described, ref. [25] shows that existing AI text detectors are not reliable in practical scenarios. The authors have constructed a suitable attack that can break watermarking and retrieval-based detectors with minimal text quality degradation. The study also discusses the potential for spoofing attacks against AI text detectors, where adversaries could infer hidden AI text signatures without having white-box access to the detection techniques. However, this work only focuses on testing existing detectors without fully discussing potential real-world implications or applications, whereas we propose an LLM-based solution that also provides validation in a real-world case.
The study by Chakraborty et al. [26] focuses on the separation of LLM-generated content from human text. It also provides evidence that this separation can be consistently achieved except in situations where machine and human text distributions are completely indistinguishable. In addition, the authors argue that as the quality of the text generated by the models increases, and thus the similarity to human text increases, the sample size analyzed increases. Also, this study focuses primarily on recognition accuracy rather than exploring the implications of misclassification or false identifications in real-world applications.
To address the importance of identifying AI-generated contexts, the study [27] introduces the idea of origin tracing in the context of LLMs. The authors offer a novel algorithm that exploits the contrastive properties between LLMs and extracts model-wise attributes to trace text origins under both white-box and black-box settings. In addition, they propose an efficient approach, called Sniffer, for tracing and detecting AI-generated contexts, which can help prevent AI misuse and model theft. However, the main challenge lies in the fact that the difficulty of tracing texts generated by powerful LLMs differs depending on the type of instructions provided; rephrased phrases are harder to trace than texts generated from summaries with varying degrees of detection difficulty. The research by Wang et al. [28] aims to address concerns about the misuse of LLMs that can produce text similar to human writing by introducing a new method, called Sequence XGPT, for detecting AI-generated text (AIGT) at the sentence level. They propose the development of a dataset containing both human-written sentences and sentences modified by LLMs, thus laying the groundwork for a sentence-level AIGT detection challenge. Overall, the contributions of the paper lie in introducing a novel approach to the sentence-level detection of AI-generated text, highlighting the limitations of existing methods in addressing this problem, and demonstrating the effectiveness of Sequence XGPT in outperforming previous techniques and demonstrating strong adaptability.
The paper [29] outlines the need for detection methods that accurately identify machine-generated text while minimizing false positives, which could unjustly label human-written content as machine-generated. This work categorized various detection methods based on their reliance on language models, including logistic regression classifiers and feature-based support vector machines, each with distinct advantages and limitations. The findings suggest that different sampling methods produce distinct flaws in generated text, indicating the need for specialized classifiers for each method. Unlike this work that uses classical machine learning models, we use a pre-trained LLMs approach, especially those based on transformer architectures, which can capture long-range dependencies and contextual information in text. The latter allows them to understand the meaning of words based on their context, which is often crucial in language tasks. While the paper emphasizes the importance of accuracy and the area under the curve (AUC) as evaluation metrics, it acknowledges that accuracy alone can be misleading, especially in cases of class skew. Instead, we evaluate more metrics and demonstrate the efficacy of our approach.
Miao et al. [30] introduce a novel approach to improve the efficiency of detecting LLM-generated texts by optimizing the query process used in existing methods like DetectGPT, which is computationally expensive. The proposed method significantly enhances detection performance while reducing the number of queries needed by leveraging a Gaussian process model for sample selection and scoring. The proposed method can detect content generated from models like LLaMA, Vicuna, and GPT-2. The findings contribute significantly to the ongoing efforts to improve the detection of machine-generated texts in various contexts. However, it is fair to say that evaluating a text sample’s accuracy relies heavily on the surrogate model’s predictive score accuracy. The downside of such a case is that inaccurate surrogate models would result in inaccurate metrics of what is typical and affect the overall detection performance. Instead, we fine-tune our models using a supervised approach with a labeled dataset, ensuring they learn from high-quality examples.
The paper [31] explores the transformative role of Generative AI, particularly large language models (LLMs), in creating human-like content across various media, raising concerns about misinformation and copyright issues. The study aims to develop a method for recognizing AI-generated sentences using LLMs, leveraging a labeled dataset of both human and AI-generated sentences. The study fine-tunes several LLMs, including BERT, RoBERTa, DistilBERT, and ALBERT, to enhance their performance in classifying human versus AI-generated text. The paper employs PEFT techniques, specifically LoRA, to reduce the number of trainable parameters and computational requirements during fine tuning. This approach yielded high accuracy and precision rates, demonstrating the effectiveness of PEFT in low-data scenarios. As a result, the authors successfully fine-tuned multiple LLMs to distinguish between human and AI-generated text, achieving high accuracy and precision metrics.

3. Method

As mentioned above, this work aims to leverage multiple pre-trained LLMs to perform a text classification task. Figure 1 and Figure 2 show a simplified workflow of our approach. Specifically, we will use a labeled dataset to fine-tune the latter models to identify whether the AI or a human generated a given sentence. In addition, we used another dataset composed of papers’ abstracts created by humans and AI to differentiate the data for the fine tuning of the models. Also, using the models fine-tuned with the dataset composed of AI-generated sentences to test the effectiveness of our approach, we performed a further phase where we submitted 20 sentences generated from ChatGPT and 20 sentences randomly extracted from Wikipedia (https://www.wikipedia.org/, accessed on 8 August 2024) Finally, as shown in Figure 3, we performed a preliminary phase of explainability using the model that achieved the best results in identifying AI-generated sentences.

3.1. Models Involved

In this section, we present the model used to perform our experiments. As mentioned above, in this work, we rely on several pre-trained LLMs to achieve the given goal. In particular, we have fine-tuned some models belonging to the BERT family and one to the GPT family.
Regarding the BERT family, we fine-tuned a BERT model [32], which is a pre-trained language model developed by Google. It is a transformer-based neural network architecture designed for processing various natural languages (NLP). Consequently, this model has been trained on a massive corpus of text data and uses masked language modeling to understand the context and word associations in sentences. Thus, it can detect subtle linguistic phenomena and improve subsequent NLP activities without requiring specific training datasets for these tasks.
A smaller, faster, and more efficient version of BERT is called DistilBERT [33]. It is a pre-trained language representation model that applies knowledge distillation at the pre-training stage to reduce the size of the model while retaining much of its ability to understand language. DistilBERT is designed to be computationally cheaper than the original mode, allowing it to be deployed on edge devices or in environments with limited computing power. It has been proven to be 60% faster, and despite its smaller size, DistilBERT retains approximately 97% of BERT’s language understanding capabilities, ensuring that it can still perform well on a wide range of NLP tasks.
By modifying key hyperparameters and training techniques, RoBERTa (Robustly Optimized BERT Pre-training Approach) [34] is a pre-trained language model that extends BERT. In particular, the authors found that the training was highly sensitive to the Adam epsilon term. In certain cases, they could increase stability or performance by fine tuning it. The improved language understanding is also a result of training on a larger corpus of text data, including books and online pages. In addition, the authors use dynamic masking, which changes the masked tokens during training, rather than using the same tokens for each epoch. Finally, RoBERTa has achieved state-of-the-art results on several benchmarks, including SQuAD, RACE, and GLUE.
A pre-trained large language model called GPT-Neo [35] is built on top of the GPT (Generative Pre-trained Transformer) architecture. It is an open-source replacement for Anthropic’s GPT-3 model developed by OpenAI. EleutherAI developed a large curated dataset called the Pile [36] specifically to train this model. In addition, it was trained using cross-entropy loss as a masked autoregressive language model. GPT-Neo creates incredibly accurate and adaptive natural language processing (NLP) models by pre-training a transformer-based neural network architecture on large amounts of data. When tuned with the right hyperparameters, it has been shown to achieve competitive accuracy on several commonsense reasoning benchmark tasks.
The ELECTRA [37] (Efficient Learning of Encoder Representations through Large-scale pre-training) model is a pre-trained language model developed to improve the efficiency and accuracy of natural language processing (NLP) tasks. To improve computational efficiency and to allow for useful additional pre-training, it replaces the conventional masked language modeling approach used in BERT with a replacement token detection technique. ELECTRA replaces masked tokens with a random token from the same lexicon. This method improves the model’s ability to properly identify token substitutions by teaching it to recognize and correct erroneous tokens. Because of the updated token recognition mechanism, ELECTRA’s training process is more effective than BERT’s, making it more suitable for large-scale pre-training and fine tuning. With an accuracy of 93% on the IMDB dataset, ELECTRA outperformed competing transformational models such as BERT, XLNET, and RoBERTa in sentiment analysis, which is one of the state-of-the-art results it has achieved in a variety of NLP tasks.
Another model used in this work is XLNet [38], which is a pre-trained language model that aims to overcome the drawbacks of BERT. To bridge the gap between pre-training and fine tuning, it employs a generalized autoregressive pre-training strategy that facilitates learning bidirectional contexts. XLNet maximizes the expected likelihood over all permutations of the factorization order, which is a different pre-training strategy than BERT. This avoids the drawbacks of BERT and allows XLNet to learn bidirectional contexts. XLNet has been shown to significantly outperform BERT on various NLP tasks, including question answering, sentiment analysis, natural language inference, and document ranking.
The final model involved in this work is DistilRoBERTa. It is a smaller, faster, and lighter version of the RoBERTa model, which is a variant of the BERT architecture. DistilRoBERTa is designed to retain much of RoBERTa’s performance while being more efficient in terms of computational resources and speed. The latter is created using a process called knowledge distillation, where a smaller model is trained to replicate the behavior of a larger model. This allows DistilRoBERTa to learn from the more complex RoBERTa model while being more lightweight. It maintains the transformer architecture of its parent model, which includes self-attention mechanisms and feed-forward neural networks, allowing it to understand context and relationships in text effectively. This model typically has fewer parameters than RoBERTa, which results in faster inference times and reduced memory usage, making it more accessible for deployment in resource-constrained environments.

3.2. Human and AI-Generated Sentences Dataset

As mentioned above, we used a dataset consisting of both human-generated and AI-generated sentences. This dataset (https://huggingface.co/datasets/andythetechnerd03/AI-human-text, accessed on 8 August 2024) is mainly composed of English sentences collected from different sources added together and the duplicates removed. The labeling is based on whether an AI model or a human created the content. AI-generated text is produced using large language models like GPT, while human-authored texts may come from various sources, such as online articles, essays, and written compositions. The latter comes from verifiable sources, ensuring the correctness of the annotations. In addition, the dataset contains a large and diverse set of examples, which helps prevent bias and ensures the model can generalize effectively across different types of texts. Specifically, the dataset consists of 25,000 sentences, which we divided into training, test, and validation sets using the 80, 10, and 10 criteria. Table 1 shows the number of elements obtained after this division. We obtained 20k sentences for the training set and 2500 for the test and validation set. Table 2 also provides some more information about the dataset in terms of the maximum, minimum, and average number of words. It also shows the average sentence length, which is 379 words. Table 3 shows an example of a sentence generated by a human and an AI.

3.3. AI-Generated Abstracts Dataset

In this work, we also used a labeled dataset (https://github.com/panagiotisanagnostou/ai-ga, accessed on 7 September 2024) from the study of Panagiotis et al. [39], consisting of AI-generated abstracts using the GPT-3 model, specifically the Davinci variant, which is known for producing high-quality text. This dataset was designed to include abstracts that mimic human-written scientific papers. The final dataset comprises 28,662 entries, which include 14,331 human-written abstracts and 14331 AI-generated abstracts. Each AI-generated abstract corresponds to a title from the human-written abstracts, ensuring a direct comparison between the two types of texts. The authors created a specific prompt type to make the generated text more creative and novel. They also set the model to avoid using words or phrases already in the prompt or previously generated text to generate more unique responses. The human-written abstracts were sourced from the CORD-19 dataset, which contains a vast collection of academic papers related to COVID-19. The authors randomly selected a subset of 14,331 English-language papers from this dataset, focusing on titles and abstracts. As shown in Table 4, we split it using the 80, 10, and 10 criteria, resulting in 22,929 abstracts for the training set, 2866 for the validation set, and 2867 elements for the test set. In addition, in Table 5, is it possible to consult the general statistics of the dataset and as can be noted, unlike the other dataset Section 3.2 where the maximum length of a sentence reaches 1422 words, here treating abstracts, the maximum length reaches 18,000 words, making the process of fine tuning the models more complex.

3.4. Models

As explained above, our work is based on fine tuning a set of pre-trained LLMs, specifically five models belonging to the BERT family, an open-source version of the GPT model, ELECTRA, and XLNet. This technique was preferred because it allows the models to perform well on given tasks. In fact, models can perform much better when trained on a small, task-specific dataset, as they are able to capture the details and features relevant to that specific task domain. In addition, it is often more efficient to fine-tune an existing model than to train it from scratch. Many language-related constructions and structures are already embedded in the pre-trained models, making the fine-tuning process less data and computationally intensive. Fine tuning can also help mitigate overfitting by leveraging the general knowledge encoded in the pre-trained model while still allowing it to adapt to the specifics of the new dataset. All were trained for five epochs with an initial learning rate of 0.00002 and a weight decay of 0.01. Instead, the maximum sentence length during training was set to 512 tokens for all models. This means that the total number of tokens in the input text should not exceed 512 tokens in order to process it effectively. Instead, GPT-Neo accepts a maximum input length of up to 2048 tokens.
Tokenization. All of these models require the input text to be converted into a special format before the training process can begin. The first step is a tokenization process, and models such as BERT, DistilBERT, and ELECTRA use WordPiece [40] tokenization, which is a form of subword tokenization. WordPiece tokenization breaks down words into subwords, and if a word is not in the vocabulary, it is broken down into the smallest possible subwords that are in the vocabulary. Instead, XLNet uses a modified version of BERT’s WordPiece tokenization, which is called SentencePiece [41] tokenization. SentencePiece tokenization is also a form of subword tokenization, but it uses a unigram language model to learn the most likely subword units. RoBERTa, GPT-Neo, and DistilRoBERTa use Byte-Pair Encoding [42] (BPE) tokenization, which is a form of subword tokenization that decomposes words into bytes and then replaces the most frequent pairs of bytes with new symbols. The next step is to add special tokens to the tokenized input; special tokens such as ‘[CLS]’ (classification token), ‘[SEP]’ (separator token), and ‘[MASK]’ (mask token) are added to mark the beginning and end of sentences and to handle masked language modeling during pre-training.
Fine tuning. After pre-processing the data to fit the models better, we went through a process of fine tuning the models. Vast amounts of data have been used to pre-train LLMs, and they can be improved for specific tasks through fine tuning. To enhance performance on a given task, fine tuning entails adjusting the parameters of the model based on a smaller dataset that has some relevance to that particular target task (text classification in this case study). In this work, we adopted bert-base-uncased, distilbert-base-uncased, roberta-base, gpt-neo-125m, electra-base-generator, xlnet-base-cased, and distilroberta-base.

3.4.1. Human and AI-Generated Sentences Detection

Table 6 provides a summary of the hyperparameters used to fine-tune the models. As can be seen, the model that took the most time is “gpt-neo” (about 198 h), one of the most challenging to tune due to its 125 million parameters, which is followed by “xlnet” with about 110 million parameters. As noted above, these models were fine-tuned for 5 epochs with an evaluation process applied to the validation set after each epoch. Figure 4, Figure 5, Figure 6 and Figure 7 and Table 7, Table 8, Table 9 and Table 10 show the results of the various metrics computed for each model, and in particular, Figure 4 and Figure 6 show that models such as bert, distilbert, and xlnet have fluctuations in accuracy and precision between epochs, which might suggest a possible overfitting of these models. One possible cause is that the models may have oscillated between different local optimums, and thus the optimization algorithm did not converge to a single optimal solution. Nevertheless, it can be seen that all models have a consistent recall that remains constant across epochs and that they are able to correctly identify positive instances of the dataset, which is a desirable property for a classification model. In addition, the balanced dataset should provide further assurance that the models have not been overfitted.

3.4.2. AI-Generated Abstracts Detection

In Table 11, we only report the execution time needed to fine-tune the models in the AI-generated abstract shown in Section 3.3, since we used the same hyperparameters described in the previous Section 3.4.1. As visible, the time taken for fine tuning the models was the same relative to the dataset size. In fact, for example, the electra-base-generator model employed about 4 h for the fine-tuning process with 20,000 samples in the training set. At the same time, the second experiment took about 7 h to finish the fine-tuning process on about 23,000 samples. In Figure 8, Figure 9, Figure 10 and Figure 11 and Table 12, Table 13, Table 14 and Table 15, it is possible to consult the various metrics computed as results during the 5 epochs of fine tuning. As can be seen, almost all models show a stagnant trend; this stagnation in performance is often due to the model becoming stuck in a “saddle point” or a “local minimum”. This occurs when the loss function does not improve further, mainly if the learning rate is too low, making it difficult for the model to escape these problematic areas.

3.5. Explainability

With the help of large language models (LLMs), explainability in text classification concerns the tools and strategies that explain the particular model’s predictions and classifications. Given that LLMs such as GPT-3 or BERT are often categorized as “black boxes”, making these decisions also proves problematic, as it is hard to comprehend how they arrived at such outputs. This method could help, for example, identify which words or phrases in the input text most significantly influenced the model’s decision. Alternatively, it could help through a visual representation showing each word’s contribution to the final classification, helping to see which elements were most impactful. Integrated gradients are a technique used to attribute the output of a model to its input features, providing insights into which parts of the input contribute most to the model’s predictions. In this case, integrated gradients can help explain how specific words or phrases in a text influence the classification outcome. By looking at which words or tokens, it can be seen precisely how the model interprets the input text. This method can help identify if the model relies on appropriate or inappropriate patterns (e.g., it could over-rely on specific tokens due to biases in the training data). For this reason, it is possible to understand if the models generalize well; this can indicate overfitting or dataset bias. Finally, based on this type of analysis, a targeted feature engineering phase can be performed, augmenting the dataset with additional examples that do not include these high-impact tokens to ensure that the model generalizes beyond those features. Meanwhile, token importance refers to the significance or contribution of individual tokens (words or subwords) in a sentence to the model’s final classification decision. Understanding token importance is crucial for interpreting model predictions, improving model performance, and ensuring fairness and transparency in AI systems. Token importance quantifies how much each token in a sentence influences the model’s output. In a sentence classification task, this could mean determining which words are most responsible for classifying the sentence into a particular category. Through token importance, it is possible to identify the model sensitivity; for example, it could reveal if a small number of words disproportionately impact predictions. This might involve retraining the model with balanced datasets or using data augmentation strategies to minimize reliance on these tokens. Another aspect is the robustness of the models; in fact, it can help to understand if the latter are sensitive to small changes and, based on that, retraining the model with adversarial examples or data perturbations to ensure it generalizes better.

4. Results

This section presents the results obtained after fine tuning the models explained above. We show the results obtained with the two datasets explained in the Section 3.2 and Section 3.3.

4.1. Human and AI-Generated Sentences Detection Results

Regarding identifying sentences generated by humans and AI, Table 16 shows the different metrics extrapolated from the test set after the fine-tuning process. As can be seen, the latter fully confirms those obtained after the five training epochs. This is another confirmation that the models do not fall into overfitting but are able to manage and identify both human and AI-generated content correctly. The models that show better performance are “roberta” and “gpt-neo” since they are larger models with about 123 and 125 million parameters, respectively. Instead, Figure 12, Figure 13, Figure 14 and Figure 15 show the confusion matrices always obtained on the test set and show that overall, all models perform effectively. In fact, “electra” (see Figure 14a) found 11 false positives (phrases identified as human but generated by AI) and only 9 false negatives. A surprising result concerns the “bert” and “distilroberta” models Figure 12a and Figure 15, which predict only 3 false negatives (sentences predicted as AI-generated, but human-generated).

4.2. AI-Generated Abstracts Detection Results

In the context of identifying the AI-generated abstract, we report the results obtained during the evaluation of the test set after the five epochs of fine tuning. Table 17 shows the details for this process, and as can be seen, the latter are not as promising as the previous. For completeness and to confirm what has been said, we also report the confusion matrices obtained in Figure 16, Figure 17, Figure 18 and Figure 19. One possible explanation for these results is that the task is more challenging than expected, and the models struggle to capture the nuances between human-written and AI-generated abstracts. The fact that these abstracts have an average length of about 191 words, with the longest abstract being 18,000 words, could contribute to the models’ poor performance. It is possible that the models are struggling to effectively process and learn from such long sequences of text, leading to suboptimal results. In addition, the learning rate, weight decay, and the number of epoch values used may not be optimal for this specific task.

4.3. Model Inference

As mentioned above, after the fine tuning and evaluation steps, we conducted an additional phase to validate the effectiveness and robustness of the presented models. We extracted 20 random essays from Wikipedia as human sentences and generated 20 random essays from ChatGPT as AI sentences. Listing 1 shows an example of these extracted essays. We then tested the fine-tuned models on the dataset presentend in Section 3.2, by submitting the latter sentences, and the results are shown in Table 18. This further analysis confirms the results extracted from the validation and test set; i.e., the models with the best performance in these two phases are the same as those with the best results (“roberta” and “gpt-neo”). As can be seen, all the models are able to identify perfectly the sentences generated by the AI. In contrast, the models show a decrease in performance with respect to the sentences generated by the human with “roberta” and “gpt-neo” identifying 15 and 14 sentences, respectively, out of the 20 submissions. It is also interesting to note that the overall performance of the models decreases slightly when short sentences are submitted to them, using a number of words equal to the average sentence length of the dataset (380). However, we decided to keep this type of sentence in order to guarantee as heterogeneous a dataset as possible without eliminating possible outliers such as sentences that are too short or too long. In practice, in any case, sentences that are too short turn out to be too complicated to identify from a model perspective, but they remain able to fully recognize those generated by AI.
Listing 1. Example of an essay generated by ChatGPT and used to test the models and an example of a random essay extrapolated from Wikipedia and submitted to the models.
{
"ChatGPT": "the role of ethics in technology development is an increasingly important topic as technological advancements continue to accelerate. Ethics in technology involves considering the moral implications of new technologies and ensuring that their development and use align with societal values and principles. One critical area of ethical concern is privacy..."
------------------------------------------------------------------
"Wikipedia": "the following events occurred in june, wednesday, texas began the little school program pioneered by felix tijerina in schools statewide. The program designed to teach spanishspeaking preschoolers essential english words for a head start in the first grade enrolled children at its start television was introduced to new zealand as broadcasts started in auckland on aktv channel at pm and continued until the night. The first program was an episode of the adventures of robin hood..."
}

4.4. Existing Tools Comparison

Based on the inference phase described before, we conducted an additional one, testing five random sentences generated by the AI and extracted from the dataset on the available AI detector. Specifically, we tested GTPZero (https://gptzero.me/, accessed on 19 September 2024), Crossplag (https://crossplag.com/, accessed on 19 September 2024), Copyleaks (https://copyleaks.com/ai-content-detector, accessed on 19 September 2024), and ZeroGPT (https://www.zerogpt.com/, accessed on 19 September 2024). In Table 19, we report the results obtained after this phase. As can be seen, the Crossplag AI detector is the only one able to recognize all the sentences submitted. Only Copyleaks was able to recognize the last sentence that we submitted. At the same time, ZeroGPT was not able to recognize any of the sentences because the tool needed longer sentences. Regarding Crossplag, we also found that one of the heuristics implemented is the checking for grammatical errors to distinguish human and AI content; in fact, adding extra spaces between the words in the sentence results in the tool achieving lower performance in predictions. Figure 20 and Figure 21 show screenshots of the results obtained by submitting the sentences in Table 20 to GPTZero and ZeroGPT.

4.5. Explainability Results

As Section 3.5 mentioned, this work leverages explainability to understand the results obtained. This study used two techniques: integrated gradients and token importance. To conduct this phase, we employed the fine-tuned version of RoBERTa in the dataset of AI-generated sentences since the model achieved the best performance on both the test set and the inference phase presented in Section 4.1 and Section 4.3. Figure 22 shows the bar chart of the integrated gradients for the AI sentence shown in Listing 2 (for the sake of brevity, we have included only a piece of the sentence). We plotted the ten tokens with the highest values, and in this case, the token “sign” mainly influenced the model prediction, which is followed by the token “fossil.” Meanwhile, Figure 23 shows the results regarding the token importance phase. As explained in the paragraph regarding the tokenization in Section 3.4, this model uses Byte-Pair Encoding (BPE) for tokenization. This subword tokenization technique helps handle out-of-vocabulary words by breaking them down into smaller ones. For this reason, the figure shows the “othermal” token, which is the subword for “geothermal” and is one of the subwords that influences most of the model about the prediction. In conjunction with these analyses, we also examined the explainability of a sentence created by a human. To perform that, we still relied on the same RoBERTa model. Figure 24 shows the results of the integrated gradients calculated on the sentence shown in Listing 3 (again, in this case, we report only a piece of this sentence) with the tokens with the highest score. Here, it is possible to note that the token “think” has the highest gradient value ( 1.4901 × 10 8 ), indicating a relatively more substantial influence on the model’s output than the other tokens. The other tokens, such as “help”, “dad”, and “opinion”, have lower gradient values, suggesting they contribute less to the model’s decision. Regarding token importance, where the subwords are also analyzed, Figure 25 shows that the subword “our” has a score of 0.4028, indicating a solid connection to the collective experience of seeking advice from friends or family. From Listing 3, it is possible to notice that the subword “dour” appears several times within the sentence, influencing the model’s decision.
Listing 2. Example of an essay generated by ChatGPT and used to test the models and an example of a random essay extrapolated from Wikipedia and submitted to the models.
"The significance of renewable energy in addressing climate change and promoting sustainability cannot be overstated. As the world faces the urgent need to reduce greenhouse gas emissions, renewable energy sources offer a viable and sustainable solution. Renewable energy sources, such as solar, wind, hydro, and geothermal, are derived from natural processes that are replenished constantly. Unlike fossil fuels, which are finite and produce significant carbon emissions, renewable energy sources have a minimal environmental impact. One of the primary benefits of renewable energy is its potential to reduce carbon emissions. Solar and wind power, for example, produce electricity without burning fossil fuels, reducing the amount of carbon dioxide and other greenhouse gases released into the atmosphere. This is crucial for mitigating climate change and limiting global temperature rise."
Listing 3. Example of a sentence created by a human, submitted to the model, and used for the explainability process.
"Advice is important to some people and most of the time the ask more than one person there are 3 different reasons to who people ask for advice first some people ask what to do in a situation and pick the best answer the second reason is that the might need dour opinion on something the last reason is that some people ask for it because it might help get through hard times to make life easier when people are in a situation the tend to ask other people for advice so the can fix a problem an example of this is you ask three of dour friends md tv is broken what should i do the first friends ads try to fix it then you ask the second friend and heshe says call somebody to fix it for you then the last friends ads i could help you i fixed a tv before now that you have some choices you can pick the best one if the one you picked that you think is best fails then you can try 2 other choices or ask for more advice opinions can be..."

5. Conclusions and Future Work

The rapid advancement of generative artificial intelligence (GAI) has led to the proliferation of content not generated by human beings. It spread across various domains, including journalism, marketing, and creative writing. As AI systems become increasingly sophisticated, distinguishing between AI-generated and human-generated sentences poses significant implications for society, ethics, and communication. It concerns the inability to discern the original and the synthetic content and, consequently, the trust in the information sources. For this reason, users need to develop a critical awareness of the content they are consuming and producing. Promoting transparency in AI-generated content, for example, through the work presented, establishing clear guidelines for attribution, and fostering discussions about the ethical use of AI are crucial steps in navigating this complex landscape.
In this paper, we present a study on identifying human- and AI-generated sentences using different LLMs. In particular, using two datasets, we have fully tuned different pre-trained LLMs for this task. The first is composed of 25,000 sentences generated by humans and the AI. In contrast, the second one comprises 28,662 abstracts by humans and AI. We trained all the models presented for 5 epochs and obtained surprising results in terms of accuracy, precision, recall, and F1-score regarding (for simplicity) the sentence AI dataset. Achieving these results shows that the presented models have successfully learned the patterns in the data and performed the given task optimally. Instead, regarding (for simplicity) the AI-abstract dataset, the models achieved a maximum accuracy of 49%, which is comparable to random guessing. Despite this result, this work can practically impact real life. These models can be integrated into existing tools as a prompter of the veracity of the sentence, which a human user must validate in any case. Alternatively, it can be integrated into social media and used to alert users if specific content is AI-generated and, in such a way, overcome the problem of misinformation.
The results obtained from the experiments on the second dataset suggest that the models must distinguish these two categories more effectively. Some possible causes are that the abstracts are generally longer than the sentences, which can introduce noise and make it harder for the models to learn distinguishing features. Another possible aspect is that abstracts can closely mimic human writing styles, making the identification more difficult. Regarding the models instead, the fine-tuning approach (e.g., number of epochs, learning rate, and weight decay) may not have been optimal. Some models may require more epochs or different values for the other hyperparameters.
For these reasons, we plan to conduct a frequency analysis to identify common patterns and outliers for future work. We will analyze the distribution of text length to identify unusual patterns and calculate the frequency distribution of words. We will also use word clouds to visualize the frequency of words in the data and a heat map to visualize the correlation between different metrics. In addition to the analysis above, we aim to improve the approach used with these models. Specifically, we plan to perform a hyperparameter tuning phase; this will allow us to control the learning process of the models, such as the learning rate, the number of training epochs, and the batch size. Furthermore, the other important hyperparameter to be tuned is the weight decay; in fact, it helps stabilize the optimization dynamics and acts as an implicit regularizer to favor models with smaller weights, thus preventing them from overfitting. Another important hyperparameter to consider in this case is the dropout rate, which refers to the probability of neurons randomly dropping out during training. This technique could help us to avoid overfitting by reducing the complexity of the models presented in this work. In addition, another aspect to consider for an advanced approach is preprocessing the abstracts to remove irrelevant information or noise that might confuse the models. Lastly, one possible improvement could be adopting different models from this work that can handle long documents, making them suitable for this dataset. Alternatively, consider using an ensemble of different models to leverage their strengths and combine their predictions, improving the overall performance. Another contribution we want to add in future work is the explainability of the models. In particular, we intend to improve the techniques by exploring other interpretability methods such as SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to gain different perspectives on feature importance. Additionally, we can improve the presented bar chart by adding error bars to represent variability or uncertainty in the importance scores. Then, we could add color coding or annotations to highlight significant tokens or patterns in the data. Using other visualization techniques, such as heatmaps or word clouds, could represent token importance more intuitively. Finally, we could add attention visualization, which is a technique to understand how the models process the input text. Specifically, it involves visualizing the attention weights assigned to each word or subword in the input sequence to identify the most important features that influence the output.
Regarding the dataset for the identification of sentences, even if we demonstrated promising results in this task using the described models, more aspects should be evaluated deeply. Since research in generative AI continues to evolve and generate increasingly human-like content, the models must constantly be updated with new content to stay abreast of new advances in this field. Another interesting thing to expand the potential of this work and that we plan to do is not only to identify if a given sentence is written by a human or generated by the AI but also to check if original human content is rephrased by the AI, applying in this way a multiclassification task. This challenge can be a new opportunity to develop tools that can enhance plagiarism detection, maintain authenticity and trust in the users, and, in such a way, protect creators’ intellectual property. However, we think this work can practically impact real life. These models can be integrated into existing tools as a prompter of the veracity of the sentence, which a human user must validate in any case. Alternatively, it can be integrated into social media and used to alert users if specific content is AI-generated and, in such a way, overcome the problem of misinformation.

Author Contributions

Conceptualization, L.P., F.M. (Fabio Martinelli), A.S. and F.M. (Francesco Mercaldo); Methodology, L.P., F.M. (Fabio Martinelli), A.S. and F.M. (Francesco Mercaldo); Software, L.P. and F.M. (Francesco Mercaldo); Validation, L.P., A.S. and F.M. (Francesco Mercaldo); Formal analysis, F.M. (Fabio Martinelli); Investigation, L.P. and F.M. (Francesco Mercaldo); Resources, F.M. (Fabio Martinelli) and F.M. (Francesco Mercaldo); Data curation, L.P.; Writing—original draft, L.P. and F.M. (Francesco Mercaldo); Writing—review & editing, F.M. (Fabio Martinelli), A.S. and F.M. (Francesco Mercaldo); Supervision, F.M. (Fabio Martinelli), A.S. and F.M. (Francesco Mercaldo); Project administration, F.M. (Fabio Martinelli), A.S. and F.M. (Francesco Mercaldo); Funding acquisition, F.M. (Fabio Martinelli) and A.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been partially supported by EU DUCA, EU CyberSecPro, SYNAPSE, PTR 22-24 P2.01 (Cybersecurity) and SERICS (PE00000014) under the MUR National Recovery and Resilience Plan funded by the EU—NextGenerationEU projects, by MUR–REASONING: foRmal mEthods for computAtional analySis for diagnOsis and progNosis in imagING—PRIN, e-DAI (Digital ecosystem for integrated analysis of heterogeneous health data related to high-impact diseases: innovative model of care and research), Health Operational Plan, FSC 2014-2020, PRIN-MUR-Ministry of Health, the National Plan for NRRP Complementary Investments D^3 4 Health: Digital Driven Diagnostics, prognostics and therapeutics for sustainable Health care, Progetto MolisCTe, Ministero delle Imprese e del Made in Italy, Italy, CUP: D33B22000060001 and FORESEEN: FORmal mEthodS for attack dEtEction in autonomous driviNg systems CUP N.P2022WYAEW.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
  2. Betker, J.; Goh, G.; Jing, L.; Brooks, T.; Wang, J.; Li, L.; Ouyang, L.; Zhuang, J.; Lee, J.; Guo, Y.; et al. Improving Image Generation with Better Captions. Comput. Sci. 2023, 2, 8. Available online: https://api.semanticscholar.org/CorpusID:264403242 (accessed on 30 July 2024).
  3. Oppenlaender, J. The creativity of text-to-image generation. In Proceedings of the 25th International Academic Mindtrek Conference, Tampere, Finland, 16–18 November 2022; pp. 192–202. [Google Scholar]
  4. Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.; Ghasemipour, K.; Gontijo Lopes, R.; Karagol Ayan, B.; Salimans, T.; et al. Others Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36479–36494. [Google Scholar]
  5. Bar-Tal, O.; Chefer, H.; Tov, O.; Herrmann, C.; Paiss, R.; Zada, S.; Ephrat, A.; Hur, J.; Li, Y.; Michaeli, T.; et al. Lumiere: A space-time diffusion model for video generation. arXiv 2024, arXiv:2401.12945. [Google Scholar]
  6. Kondratyuk, D.; Yu, L.; Gu, X.; Lezama, J.; Huang, J.; Hornung, R.; Adam, H.; Akbari, H.; Alon, Y.; Birodkar, V.; et al. Videopoet: A large language model for zero-shot video generation. arXiv 2023, arXiv:2312.14125. [Google Scholar]
  7. Girdhar, R.; Singh, M.; Brown, A.; Duval, Q.; Azadi, S.; Rambhatla, S.; Shah, A.; Yin, X.; Parikh, D.; Misra, I. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv 2023, arXiv:2311.10709. [Google Scholar]
  8. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
  9. Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.; Sutton, C.; Gehrmann, S.; et al. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res. 2023, 24, 1–113. [Google Scholar]
  10. Askell, A.; Bai, Y.; Chen, A.; Drain, D.; Ganguli, D.; Henighan, T.; Jones, A.; Joseph, N.; Mann, B.; DasSarma, N.; et al. A general language assistant as a laboratory for alignment. arXiv 2021, arXiv:2112.00861. [Google Scholar]
  11. Khalil, M.; Er, E. Will ChatGPT G et You Caught? Rethinking of Plagiarism Detection. In Proceedings of the International Conference on Human-Computer Interaction, Washington, DC, USA, 29 June–4 July 2023; pp. 475–487. [Google Scholar]
  12. Liu, Y.; Deng, G.; Li, Y.; Wang, K.; Wang, Z.; Wang, X.; Zhang, T.; Liu, Y.; Wang, H.; Zheng, Y.; et al. Prompt Injection attack against LLM-integrated Applications. arXiv 2023, arXiv:2306.05499. [Google Scholar]
  13. Barman, D.; Guo, Z.; Conlan, O. The dark side of language models: Exploring the potential of llms in multimedia disinformation generation and dissemination. Mach. Learn. Appl. 2024, 16, 100545. [Google Scholar] [CrossRef]
  14. Falade, P. Decoding the threat landscape: Chatgpt, fraudgpt, and wormgpt in social engineering attacks. arXiv 2023, arXiv:2310.05595. [Google Scholar] [CrossRef]
  15. Yao, Y.; Duan, J.; Xu, K.; Cai, Y.; Sun, Z.; Zhang, Y. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confid. Comput. 2024, 4, 100211. [Google Scholar] [CrossRef]
  16. Price, G.; Sakellarios, M. The Effectiveness of Free Software for Detecting AI-Generated Writing. Int. J. Teach. Learn. Educ. 2023, 2. Available online: https://api.semanticscholar.org/CorpusID:265492104 (accessed on 30 July 2024). [CrossRef]
  17. Malik, N.; Bilal, M. Natural language processing for analyzing online customer reviews: A survey, taxonomy, and open research challenges. Peerj Comput. Sci. 2024, 10, e2203. [Google Scholar] [CrossRef] [PubMed]
  18. Aziz, S.; Usman, M.; Azam, A.; Ahmad, F.; Bilal, M.; Cheema, A. Analysing Machine Learning Techniques for Cyberbullying Detection: A Review Study. In Proceedings of the 2022 17th International Conference on Emerging Technologies (ICET), Swabi, Pakistan, 29–30 November 2022; pp. 247–252. [Google Scholar]
  19. González-Carvajal, S.; Garrido-Merchán, E. Comparing BERT against traditional machine learning text classification. arXiv 2020, arXiv:2005.13012. [Google Scholar]
  20. Kaliyar, R.; Goswami, A.; Narang, P. FakeBERT: Fake news detection in social media with a BERT-based deep learning approach. Multimed. Tools Appl. 2021, 80, 11765–11788. [Google Scholar] [CrossRef]
  21. Büyüköz, B.; Hürriyetoğlu, A.; Özgür, A. Analyzing ELMo and DistilBERT on socio-political news classification. In Proceedings of the Workshop on Automated Extraction of Socio-Political Events from News 2020, Marseille, France, 11–16 May 2020; pp. 9–18. [Google Scholar]
  22. Xu, Z. RoBERTa-WWM-EXT fine-tuning for Chinese text classification. arXiv 2021, arXiv:2103.00492. [Google Scholar]
  23. Bilal, M.; Almazroi, A. Effectiveness of fine-tuned BERT model in classification of helpful and unhelpful online customer reviews. Electron. Commer. Res. 2023, 23, 2737–2757. [Google Scholar] [CrossRef]
  24. Chaka, C. Detecting AI content in responses generated by ChatGPT, YouChat, and Chatsonic: The case of five AI content detection tools. J. Appl. Learn. Teach. 2023, 6. [Google Scholar] [CrossRef]
  25. Sadasivan, V.; Kumar, A.; Balasubramanian, S.; Wang, W.; Feizi, S. Can AI-generated text be reliably detected? arXiv 2023, arXiv:2303.11156. [Google Scholar]
  26. Chakraborty, S.; Bedi, A.; Zhu, S.; An, B.; Manocha, D.; Huang, F. On the possibilities of ai-generated text detection. arXiv 2023, arXiv:2304.04736. [Google Scholar]
  27. Li, L.; Wang, P.; Ren, K.; Sun, T.; Qiu, X. Origin tracing and detecting of llms. arXiv 2023, arXiv:2304.14072. [Google Scholar]
  28. Wang, P.; Li, L.; Ren, K.; Jiang, B.; Zhang, D.; Qiu, X. SeqXGPT: Sentence-Level AI-Generated Text Detection. arXiv 2023, arXiv:2310.08903. [Google Scholar]
  29. Fröhling, L.; Zubiaga, A. Feature-based detection of automated language models: Tackling GPT-2, GPT-3 and Grover. PeerJ Comput Sci. 2021, 7, e443. [Google Scholar] [CrossRef]
  30. Deng, Z.; Gao, H.; Miao, Y.; Zhang, H. Efficient detection of LLM-generated texts with a Bayesian surrogate model. arXiv 2023, arXiv:2305.16617. [Google Scholar]
  31. Martinelli, F.; Mercaldo, F.; Petrillo, L.; Santone, A. A Method for AI-generated Sentence Detection through Large Language Models. In Proceedings of the 28th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, Seville, Spain, 11–13 September 2024. [Google Scholar]
  32. Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  33. Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
  34. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
  35. Black, S.; Leo, G.; Wang, P.; Leahy, C.; Biderman, S. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. (Zenodo, 2021, 3). Available online: https://zenodo.org/records/5297715 (accessed on 1 August 2024).
  36. Gao, L.; Biderman, S.; Black, S.; Golding, L.; Hoppe, T.; Foster, C.; Phang, J.; He, H.; Thite, A.; Nabeshima, N.; et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv 2020, arXiv:2101.00027. [Google Scholar]
  37. Clark, K.; Luong, M.; Le, Q.; Manning, C. Electra: Pre-training text encoders as discriminators rather than generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
  38. Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q. Xlnet: Generalized autoregressive pretraining for language understanding. Adv. Neural Inf. Process. Syst. arXiv 2019, arXiv:1906.08237. [Google Scholar]
  39. Theocharopoulos, P.; Anagnostou, P.; Tsoukala, A.; Georgakopoulos, S.; Tasoulis, S.; Plagianakos, V. Detection of Fake Generated Scientific Abstracts. In Proceedings of the 2023 IEEE Ninth International Conference on Big Data Computing Service and Applications (BigDataService), Athens, Greece, 17–20 July 2023; pp. 33–39. [Google Scholar]
  40. Schuster, M.; Nakajima, K. Japanese and korean voice search. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 5149–5152. [Google Scholar]
  41. Kudo, T.; Richardson, J. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv 2018, arXiv:1808.06226. [Google Scholar]
  42. Sennrich, R.; Haddow, B.; Birch, A. Neural machine translation of rare words with subword units. arXiv 2015, arXiv:1508.07909. [Google Scholar]
Figure 1. Workflow of the method illustrating the process of fine tuning various the pre-trained LLMs using a labeled dataset, followed by a validation phase to assess model performance and generalization.
Figure 1. Workflow of the method illustrating the process of fine tuning various the pre-trained LLMs using a labeled dataset, followed by a validation phase to assess model performance and generalization.
Electronics 13 04057 g001
Figure 2. In the inference phase, we generated 20 sentences using ChatGPT and extracted 20 random sentences from Wikipedia to evaluate our fine-tuned models for classifying AI-generated or human-written sentences.
Figure 2. In the inference phase, we generated 20 sentences using ChatGPT and extracted 20 random sentences from Wikipedia to evaluate our fine-tuned models for classifying AI-generated or human-written sentences.
Electronics 13 04057 g002
Figure 3. Overview of the explainability process in fine-tuned models, illustrating the submission of a sentence and the subsequent generation of an explanation detailing the rationale behind the model’s output.
Figure 3. Overview of the explainability process in fine-tuned models, illustrating the submission of a sentence and the subsequent generation of an explanation detailing the rationale behind the model’s output.
Electronics 13 04057 g003
Figure 4. Training progress across 5 epochs. The plot illustrates the evolution of accuracy, precision, recall, and F1-score during the training process for bert-base-uncased and distilbert-base-uncased.
Figure 4. Training progress across 5 epochs. The plot illustrates the evolution of accuracy, precision, recall, and F1-score during the training process for bert-base-uncased and distilbert-base-uncased.
Electronics 13 04057 g004
Figure 5. Training progress across 5 epochs. The plot illustrates the evolution of accuracy, precision, recall, and F1-score during the training process for roberta-base and gpt-neo-125m. The accompanying table presents the detailed metrics for each epoch, highlighting the performance over time.
Figure 5. Training progress across 5 epochs. The plot illustrates the evolution of accuracy, precision, recall, and F1-score during the training process for roberta-base and gpt-neo-125m. The accompanying table presents the detailed metrics for each epoch, highlighting the performance over time.
Electronics 13 04057 g005
Figure 6. Training progress across 5 epochs. The plot illustrates the evolution of accuracy, precision, recall, and F1-score during the training process for electra-base-generator and xlnet-base-cased. The accompanying table presents the detailed metrics for each epoch, highlighting the performance over time.
Figure 6. Training progress across 5 epochs. The plot illustrates the evolution of accuracy, precision, recall, and F1-score during the training process for electra-base-generator and xlnet-base-cased. The accompanying table presents the detailed metrics for each epoch, highlighting the performance over time.
Electronics 13 04057 g006
Figure 7. Training progress across 5 epochs. The plot illustrates the evolution of accuracy, precision, recall, and F1-score during the training process. Training progress for distilroberta-base.
Figure 7. Training progress across 5 epochs. The plot illustrates the evolution of accuracy, precision, recall, and F1-score during the training process. Training progress for distilroberta-base.
Electronics 13 04057 g007
Figure 8. Training progress across 5 epochs. The plot illustrates the evolution of accuracy, precision, recall, and F1-score during the training process for bert-base-uncased and distilbert-base-uncased.
Figure 8. Training progress across 5 epochs. The plot illustrates the evolution of accuracy, precision, recall, and F1-score during the training process for bert-base-uncased and distilbert-base-uncased.
Electronics 13 04057 g008
Figure 9. Training progress across 5 epochs. The plot illustrates the evolution of accuracy, precision, recall, and F1-score during the training process for roberta-base and gpt-neo-125m. The accompanying table presents the detailed metrics for each epoch, highlighting the performance over time.
Figure 9. Training progress across 5 epochs. The plot illustrates the evolution of accuracy, precision, recall, and F1-score during the training process for roberta-base and gpt-neo-125m. The accompanying table presents the detailed metrics for each epoch, highlighting the performance over time.
Electronics 13 04057 g009
Figure 10. Training progress across 5 epochs. The plot illustrates the evolution of accuracy, precision, recall, and F1-score during the training process for electra-base-generator and xlnet-base-cased. The accompanying table presents the detailed metrics for each epoch, highlighting the performance over time.
Figure 10. Training progress across 5 epochs. The plot illustrates the evolution of accuracy, precision, recall, and F1-score during the training process for electra-base-generator and xlnet-base-cased. The accompanying table presents the detailed metrics for each epoch, highlighting the performance over time.
Electronics 13 04057 g010
Figure 11. Training progress across 5 epochs. The plot illustrates the evolution of accuracy, precision, recall, and F1-score during the training process. Training progress for distilroberta-base.
Figure 11. Training progress across 5 epochs. The plot illustrates the evolution of accuracy, precision, recall, and F1-score during the training process. Training progress for distilroberta-base.
Electronics 13 04057 g011
Figure 12. Confusion matrix for the models evaluated on the test set, providing insights into the models classification accuracy and areas for improvement.
Figure 12. Confusion matrix for the models evaluated on the test set, providing insights into the models classification accuracy and areas for improvement.
Electronics 13 04057 g012
Figure 13. Confusion matrix for the models evaluated on the test set, providing insights into the models classification accuracy and areas for improvement.
Figure 13. Confusion matrix for the models evaluated on the test set, providing insights into the models classification accuracy and areas for improvement.
Electronics 13 04057 g013
Figure 14. Confusion matrix for the models evaluated on the test set, providing insights into the models classification accuracy and areas for improvement.
Figure 14. Confusion matrix for the models evaluated on the test set, providing insights into the models classification accuracy and areas for improvement.
Electronics 13 04057 g014
Figure 15. Confusion matrix for the models evaluated on the test set, providing insights into the models classification accuracy and areas for improvement. Confusion matrix obtained for distilroberta-base.
Figure 15. Confusion matrix for the models evaluated on the test set, providing insights into the models classification accuracy and areas for improvement. Confusion matrix obtained for distilroberta-base.
Electronics 13 04057 g015
Figure 16. Confusion matrix for the models evaluated on the test set, providing insights into the models classification accuracy and areas for improvement.
Figure 16. Confusion matrix for the models evaluated on the test set, providing insights into the models classification accuracy and areas for improvement.
Electronics 13 04057 g016
Figure 17. Confusion matrix for the models evaluated on the test set, providing insights into the models classification accuracy and areas for improvement.
Figure 17. Confusion matrix for the models evaluated on the test set, providing insights into the models classification accuracy and areas for improvement.
Electronics 13 04057 g017
Figure 18. Confusion matrix for the models evaluated on the test set, providing insights into the models classification accuracy and areas for improvement.
Figure 18. Confusion matrix for the models evaluated on the test set, providing insights into the models classification accuracy and areas for improvement.
Electronics 13 04057 g018
Figure 19. Confusion matrix for the models evaluated on the test set, providing insights into the models classification accuracy and areas for improvement. Confusion matrix obtained for distilroberta-base.
Figure 19. Confusion matrix for the models evaluated on the test set, providing insights into the models classification accuracy and areas for improvement. Confusion matrix obtained for distilroberta-base.
Electronics 13 04057 g019
Figure 20. Output of GPTZero analysis displaying the evaluation results for an AI-generated sentence randomly extracted from the dataset.
Figure 20. Output of GPTZero analysis displaying the evaluation results for an AI-generated sentence randomly extracted from the dataset.
Electronics 13 04057 g020
Figure 21. Output of ZeroGPT analysis displaying the evaluation results for an AI-generated sentence randomly extracted from the dataset.
Figure 21. Output of ZeroGPT analysis displaying the evaluation results for an AI-generated sentence randomly extracted from the dataset.
Electronics 13 04057 g021
Figure 22. Bar plot illustrating integrated gradients results of AI-generated sentence for model explainability.
Figure 22. Bar plot illustrating integrated gradients results of AI-generated sentence for model explainability.
Electronics 13 04057 g022
Figure 23. Bar plot illustrating token importance results of an AI-generated sentence for model explainability.
Figure 23. Bar plot illustrating token importance results of an AI-generated sentence for model explainability.
Electronics 13 04057 g023
Figure 24. Bar plot illustrating integrated gradients results of a sentence created by a human for model explainability.
Figure 24. Bar plot illustrating integrated gradients results of a sentence created by a human for model explainability.
Electronics 13 04057 g024
Figure 25. Bar plot illustrating token importance results of a sentence created by a human for model explainability.
Figure 25. Bar plot illustrating token importance results of a sentence created by a human for model explainability.
Electronics 13 04057 g025
Table 1. Composition of the human and AI-generated sentence dataset after the train, test and validation split.
Table 1. Composition of the human and AI-generated sentence dataset after the train, test and validation split.
TypeN. of AI SentencesN. of Human SentencesTotal
Train10,006999420,000
Test123212682500
Validation126212382500
Table 2. Table summarizing key statistics of the dataset: showcasing the total number of words, unique words, and insights into sentence lengths, including the maximum and minimum sentence lengths along with the average sentence length.
Table 2. Table summarizing key statistics of the dataset: showcasing the total number of words, unique words, and insights into sentence lengths, including the maximum and minimum sentence lengths along with the average sentence length.
StatisticValue
Number of words9,488,252
Number of unique words112,799
Sentence with maximum length1422
Sentence with minimum length1
Average sentence length379.5
Table 3. Comparison of sentence examples from the dataset featuring a human-written sentence alongside an AI-generated sentence.
Table 3. Comparison of sentence examples from the dataset featuring a human-written sentence alongside an AI-generated sentence.
SentenceLabel
in this world you decide who do went to be either you get along with it or someone tench you although some people think that our character formed by influences beyond our control nevertheless we choose our character traits people on your environment…Human
as citizens of a busy city we often find ourselves stuck in traffic surrounded by pollution and feeling stressed out by the constant noise and congestion but what if there was a way to reduce all of these negative effects and make our city a more…AI
Table 4. Composition of the human and AI-generated abstracts dataset after the train, test and validation split.
Table 4. Composition of the human and AI-generated abstracts dataset after the train, test and validation split.
TypeN. of AI AbstractsN. of Human AbstractsTotal
Train11,43311,49622,929
Test142114462867
Validation147713892866
Table 5. Key statistics of the dataset: showcasing the total number of words, unique words, and insights into abstract lengths, including the maximum and minimum abstract lengths along with the average sentence length.
Table 5. Key statistics of the dataset: showcasing the total number of words, unique words, and insights into abstract lengths, including the maximum and minimum abstract lengths along with the average sentence length.
StatisticValue
Number of words5,482,241
Number of unique words257,527
Abstract with maximum length18,000
Abstract with minimum length3
Average abstract length191.2
Table 6. Overview of models and parameters utilized during the fine-tuning phase: detailing the model type, learning rate, batch size, weight decay, maximum input length, number of training epochs, and total execution time in hours.
Table 6. Overview of models and parameters utilized during the fine-tuning phase: detailing the model type, learning rate, batch size, weight decay, maximum input length, number of training epochs, and total execution time in hours.
ModelLearning RateBatch SizeWeight DecayMax Input LengthNum. of EpochExecution Time (In Hours)
bert-base-uncased0.0000280.01512535:35
distilbert-base-uncased0.0000280.01512517:46
roberta-base0.0000280.01512537:08
gpt-neo-125m0.0000280.0120485198:90
electra-base-generator0.0000280.01512504:46
xlnet-base-cased0.0000280.01512584:75
distilroberta-base0.0000280.01512518:36
Table 7. Performance metrics including accuracy, precision, recall, and F1-score across five training epochs, illustrating the models learning progression.
Table 7. Performance metrics including accuracy, precision, recall, and F1-score across five training epochs, illustrating the models learning progression.
(a) Training progress for bert-base-uncased.
EpochAccuracyPrecisionRecallF1-Score
10.980.961.00.98
20.990.991.00.99
30.990.991.00.99
40.970.941.00.97
50.990.981.00.99
(b) Training progress for distilbert-base-uncased.
10.970.941.00.97
20.990.991.00.99
30.980.971.00.99
40.990.981.00.99
50.990.991.00.99
Table 8. Performance metrics including accuracy, precision, recall, and F1-score across five training epochs, illustrating the models learning progression.
Table 8. Performance metrics including accuracy, precision, recall, and F1-score across five training epochs, illustrating the models learning progression.
(a) Training progress for roberta-base.
EpochAccuracyPrecisionRecallF1-Score
10.990.981.00.99
21.01.01.01.0
31.00.991.01.0
40.990.991.00.99
50.990.991.00.99
(b) Training progress for gpt-neo-125m.
10.980.971.00.98
20.991.00.990.99
30.990.991.00.99
41.01.01.01.0
50.991.00.990.99
Table 9. Performance metrics including accuracy, precision, recall, and F1-score across five training epochs, illustrating the models learning progression.
Table 9. Performance metrics including accuracy, precision, recall, and F1-score across five training epochs, illustrating the models learning progression.
(a) Training progress for electra-base-generator.
EpochAccuracyPrecisionRecallF1-Score
10.950.911.00.95
20.990.991.00.99
31.01.01.01.0
41.00.991.01.0
51.00.991.01.0
(b) Training progress for xlnet-base-cased.
10.980.961.00.98
20.990.991.00.99
30.990.991.00.99
40.970.941.00.97
50.990.981.00.99
Table 10. Performance metrics including accuracy, precision, recall, and F1-score across five training epochs, illustrating the models learning progression. Training progress for distilroberta-base.
Table 10. Performance metrics including accuracy, precision, recall, and F1-score across five training epochs, illustrating the models learning progression. Training progress for distilroberta-base.
EpochAccuracyPrecisionRecallF1-Score
10.970.951.00.97
20.990.991.00.99
30.960.921.00.96
41.00.991.01.0
51.00.991.01.0
Table 11. Overview of models and total execution time in hours.
Table 11. Overview of models and total execution time in hours.
ModelExecution Time (In Hours)
bert-base-uncased28:38
distilbert-base-uncased12:41
roberta-base25:29
gpt-neo-125m240:41
electra-base-generator07:41
xlnet-base-cased102:46
distilroberta-base20:04
Table 12. Performance metrics including accuracy, precision, recall, and F1-score across five training epochs, illustrating the models learning progression.
Table 12. Performance metrics including accuracy, precision, recall, and F1-score across five training epochs, illustrating the models learning progression.
(a) Training progress for bert-base-uncased.
EpochAccuracyPrecisionRecallF1-Score
10.490.510.50.5
20.490.50.50.5
30.490.510.50.5
40.490.510.50.5
50.490.510.50.5
(b) Training progress for distilbert-base-uncased.
10.490.510.50.5
20.490.510.50.5
30.490.510.50.5
40.490.510.50.5
50.490.510.50.5
Table 13. Performance metrics including accuracy, precision, recall, and F1-score across five training epochs, illustrating the models learning progression.
Table 13. Performance metrics including accuracy, precision, recall, and F1-score across five training epochs, illustrating the models learning progression.
(a) Training progress for roberta-base.
EpochAccuracyPrecisionRecallF1-Score
10.490.50.50.5
20.490.510.50.5
30.490.510.50.5
40.490.50.50.5
50.490.510.50.5
(b) Training progress for gpt-neo-125m.
EpochAccuracyPrecisionRecallF1-Score
10.490.510.50.5
20.490.510.50.5
30.490.510.50.5
40.490.510.50.5
50.490.510.50.5
Table 14. Performance metrics including accuracy, precision, recall, and F1-score across five training epochs, illustrating the models learning progression.
Table 14. Performance metrics including accuracy, precision, recall, and F1-score across five training epochs, illustrating the models learning progression.
(a) Training progress for electra-base-generator.
EpochAccuracyPrecisionRecallF1-Score
10.490.510.50.51
20.490.510.50.5
30.490.510.50.5
40.490.510.50.5
50.490.510.50.5
(b) Training progress for xlnet-base-cased.
10.490.510.50.5
20.490.510.50.5
30.490.510.50.5
40.490.510.50.5
50.490.510.50.5
Table 15. Performance metrics including accuracy, precision, recall, and F1-score across five training epochs, illustrating the models learning progression. Training progress for distilroberta-base.
Table 15. Performance metrics including accuracy, precision, recall, and F1-score across five training epochs, illustrating the models learning progression. Training progress for distilroberta-base.
EpochAccuracyPrecisionRecallF1-Score
10.490.510.50.5
20.490.510.50.5
30.490.510.50.5
40.490.510.50.5
50.490.510.50.5
Table 16. Performance metrics of the fine-tuned models on the test set. This table summarizes the model type along with the corresponding accuracy, precision, recall, and F1-score achieved after 5 epochs of fine tuning, providing a comprehensive overview of each model’s effectiveness in the evaluation phase.
Table 16. Performance metrics of the fine-tuned models on the test set. This table summarizes the model type along with the corresponding accuracy, precision, recall, and F1-score achieved after 5 epochs of fine tuning, providing a comprehensive overview of each model’s effectiveness in the evaluation phase.
ModelAccuracyPrecisionRecallF1-Score
bert-base-uncased0.990.981.00.99
distilbert-base-uncased0.990.981.00.99
roberta-base1.01.01.00.99
gpt-neo-125m1.01.00.991.0
electra-base-generator0.990.990.990.99
xlnet-base-cased0.990.981.00.99
distilroberta-base0.990.991.00.99
Table 17. Performance metrics of the fine-tuned models on the test set. This table summarizes the model type along with the corresponding accuracy, precision, recall, and F1-score achieved after 5 epochs of fine tuning, providing a comprehensive overview of each model’s effectiveness in the evaluation phase.
Table 17. Performance metrics of the fine-tuned models on the test set. This table summarizes the model type along with the corresponding accuracy, precision, recall, and F1-score achieved after 5 epochs of fine tuning, providing a comprehensive overview of each model’s effectiveness in the evaluation phase.
ModelAccuracyPrecisionRecallF1-Score
bert-base-uncased0.510.510.520.51
distilbert-base-uncased0.510.50.520.51
roberta-base0.510.50.520.51
gpt-neo-125m0.510.50.520.51
electra-base-generator0.510.50.520.51
xlnet-base-cased0.510.510.520.51
distilroberta-base0.510.510.520.51
Table 18. Performance comparison of the fine-tuned models on distinguishing between AI-generated sentences from ChatGPT and randomly extrapolated sentences from Wikipedia.
Table 18. Performance comparison of the fine-tuned models on distinguishing between AI-generated sentences from ChatGPT and randomly extrapolated sentences from Wikipedia.
ModelHuman Detected/TotalAI Detected/Total
bert-base-uncased10/2020/20
distilbert-base-uncased11/2020/20
roberta-base15/2020/20
gpt-neo-125m14/2020/20
electra-base-generator7/2020/20
xlnet-base-cased10/2020/20
distilroberta-base12/2020/20
Table 19. Results of AI detector evaluations on randomly selected sentences from the dataset. This table summarizes the performance of commercial AI detectors in identifying AI-generated sentences, based on a sample of five randomly selected sentences.
Table 19. Results of AI detector evaluations on randomly selected sentences from the dataset. This table summarizes the performance of commercial AI detectors in identifying AI-generated sentences, based on a sample of five randomly selected sentences.
ToolCorrectIncorrect
GPTZero05
Crossplag50
Copyleaks14
ZeroGPT05
Table 20. Example of AI-generated sentences used to test the online tools detector.
Table 20. Example of AI-generated sentences used to test the online tools detector.
Sentence
the use of facial recognition technology like the facial action coding system to read students emotional expressions in the classroom could have both benefits and disadvantages on one hand this technology may help teachers gain insights into how their students are feeling during lessons if the computer detects that many students look bored or confused the teacher would know to adjust their teaching strategy or explain a concept again differently this could improve students understanding and engagement the technology could also flag when a student appears upset or distressed so the teacher can check in on their wellbeing however constant computer monitoring of students facial expressions may invade their privacy and make them feel uncomfortable students should feel free to naturally react to lessons without always worrying if a computer is analyzing their every expression they may start to feel selfconscious and unable to fully concentrate on learning there are also concerns about how the personal data collected about students emotions would be stored and shared overall using this technology sparingly and judiciously could provide some educational benefits by helping teachers adapt their lessons by constant facial scanning risks having negative impacts on students privacy stress levels and ability to freely react and learn a balanced approach that only occasionally analyzes student expressions and with strict privacy protections may maximize the benefits of this technology while minimizing the disadvantages for students more research would also help understand how different applications of this technology affect learning environmentsin conclusion while facial recognition could offer valuable insights to teachers the potential downsides to student wellbeing and privacy myst be carefully considered and addressed for its use in classrooms to be justified a nuanced approach is needed.
Sentence
the development of driverless cars while driverless cars present many exciting opportunities their widespread adoption also raises valid safety and privacy concerns that must be addressed according to the article driverless cars are coming autonomous vehicles could substantially reduce traffic accidents by removing human error from the roads however the technology is still new and will need extensive testing before the public can keel fully confident in surrendering control a key benefit cited is that 90 of traffic accidents are currently attributed to human error without distractions like drunk or distracted driving impairing judgment driverless cars use sensors and software to avoid collisions this suggests autonomous vehicles could save thousands of lives each year by driving more carefully than humans proponents also argue that the elderly and disabled who can no longer safely operate a vehicle would regain mobility through this innovation being able to transport more of the population can have socioeconomic benefitshowever the technology is still progressing while researchers have driven millions of miles in simulation and on test tracks real world conditions present challenges the software has yet to encounter glitches or software bugs could potentially put lives at risk until the technology proves itself as reliable as human drivers in all traffic situations additionally many people will race an anxiety of losing control that decades of driving has conditioned public trust and acceptance and crucial for adoption and may take time to develop as autonomous vehicles interact with human drivers on streets privacy is another concern as the detailed sensors that allow computer vision also create data privacy risks without regulations and accountability information like driving patterns locations visited and passenger details collected could potentially be misused this potentially opens the door to privacy violations however proper legal frameworks could help ensure autonomous vehicles do not undermine individual privacy for the sake of functionality in conclusion while driverless cars present opportunities to revolutionize transportation and significantly improve safety their development also involves risks that must be addressed through further technological progress and new regulations and standards with adequate testing and safeguards to build public confidence and protect individual privacy autonomous vehicles could vastly improve lives but these issues deserve careful consideration and management as the innovation advances the potential of this technology is exciting but its real world integration will take time and coordination between researchers policymakers and the public.
Sentence
i am writing to express my support for the electoral college and advocate for its continuation in the election of the president of the united states while some argue for changing to a system based on the popular vote i believe that the electoral college provides several essential benefits that should be consideredfirstly the electoral college ensures certainty of outcome in the presidential election as stated by judge richard a poster the winning candidates share of the electoral vote consistently exceeds their share of the popular vote this means that the electoral college system minimizes the chances of a dispute over the outcome of the election as was seen in the 2000 presidential election with the winnertakeall system in each state even a slight plurality in a state leads to a landslide electoral vote victory although a tie in the nationwide electoral vote is possible it is highly unlikelymoreover the electoral college encourages presidential candidates to have transregional appeal no single region in the country has enough electoral votes to elect a president promoting candidates to seek support across different regions this prevents a candidate with only regional appeal from becoming president and ensures that the interests of all regions are represented it is important that the president be viewed as everyones president which the electoral college system helps to achieveadditionally the winnertakeall method of awarding electoral votes in the electoral college leads to candidates focusing their campaign efforts on swing states this is beneficial as it encourages voters in these states to pay close attention to the campaign and to the competing candidates swing state voters are likely to be more thoughtful and informed voters due to the increased attention they receive from candidates these voters should have a significant influence on deciding the electionfurthermore the electoral college restores some balance in the political power of large states compared to smaller states as judge poster highlights the electoral college compensates for the malapportionment of the senate and ensures that large states receive more attention from presidential candidates during campaigns this helps to ensure that presidential candidates do not solely focus on the needs and concerns of smaller states to the detriment of larger stateslastly the electoral college system avoids the complexities of runoff elections in cases where no candidate receives a majority of the popular votes the electoral college produces a clear winner this eliminates the need for additional elections and simplifies the presidential election process it allows the nation to come together and support the elected president without further delays or uncertaintiesin conclusion the electoral college system provides certainty of outcome ensures that the president is everyones president encourages focus on swing states balances the power of large and small states and avoids the complications of runoff elections based on these benefits i believe that the electoral college should be maintained in the election of the president of the united states thank you for considering my perspective on this important matter i trust that you will carefully weigh the advantages of the electoral college in your decisionmaking processsincerelyyour name.
Sentence
as generic name learned having a good attitude even in the toughest of times can make all the difference he had fallen on hard times with his home in foreclosure and his health failing but instead of wallowing in his misfortune genericname chose to stay positive he kept his focus on what he could do to turn his situation around and worked hard to make it happen thanks to his attitude genericname was able to stay in his home and eventually get back on his feetof course having a good attitude doesnt just help in difficult times it can also make people successful positive thinking and a good attitude can give someone the strength and determination to take on challenges leading to greater accomplishments people with good attitudes are also better able to handle stress and enjoy life more fully which can lead to amazing experiences and memoriesvy looking at the story of genericname we can see that a good attitude can help people in all aspects of life it can help them stay strong and resilient in hard times and foster success and amazing experiences in good times it is an important quality to possess and with it you can positively impact your life.
Sentence
the debate over the adoption of a curfew for teenagers continues to be a contentious issue among city councils the proposed curfew would require teenagers to be home by 10 pm on weekdays and midnight on weekends with those found on the streets after those hours being in violation of the law while some argue that curfews keep teenagers out of trouble others believe that they unfairly interfere in young peoples livescurfews can certainly have their benefits in keeping teenagers safe for example if they are out late at night they may be at risk of getting kidnapped or being in the wrong place at the wrong time additionally if a police officer stops or fulls them over they may be asked questions about their whereabouts which could potentially fut them in troublehowever it is important to consider the potential negative consequences of a curfew for instance some parents may worry about their childrens safety if they are not home while it may be tempting to meet of with friends at night it may not always be worth the risk additionally curfews can be seen as a lack of trust in young people which can have a negative impact on their self esteem and relationships with their parentsit is also important to consider the potential impact of a curfew on a teenagers social life while it may be tempting to send time with friends at night it may be more beneficial to hang out during the day or have a sleepover it is important for teenagers to have a healthy balance between their social lives and their responsibilitiesultimately the decision to implement a curfew for teenagers should be made with careful consideration of the potential benefits and drawbacks while curfews can certainly have their benefits it is important to ensure that they do not unfairly interfere with young peoples lives instead it may be more beneficial to focus on building trust and communication between parents and teenagers as well as providing them with alternative activities and opportunities to socialize in a safe and responsible manner.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Petrillo, L.; Martinelli, F.; Santone, A.; Mercaldo, F. Toward the Adoption of Explainable Pre-Trained Large Language Models for Classifying Human-Written and AI-Generated Sentences. Electronics 2024, 13, 4057. https://doi.org/10.3390/electronics13204057

AMA Style

Petrillo L, Martinelli F, Santone A, Mercaldo F. Toward the Adoption of Explainable Pre-Trained Large Language Models for Classifying Human-Written and AI-Generated Sentences. Electronics. 2024; 13(20):4057. https://doi.org/10.3390/electronics13204057

Chicago/Turabian Style

Petrillo, Luca, Fabio Martinelli, Antonella Santone, and Francesco Mercaldo. 2024. "Toward the Adoption of Explainable Pre-Trained Large Language Models for Classifying Human-Written and AI-Generated Sentences" Electronics 13, no. 20: 4057. https://doi.org/10.3390/electronics13204057

APA Style

Petrillo, L., Martinelli, F., Santone, A., & Mercaldo, F. (2024). Toward the Adoption of Explainable Pre-Trained Large Language Models for Classifying Human-Written and AI-Generated Sentences. Electronics, 13(20), 4057. https://doi.org/10.3390/electronics13204057

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop