Identification of Scientific Texts Generated by Large Language Models Using Machine Learning

Soto-Osorio, David; Sidorov, Grigori; Chanona-Hernández, Liliana; López-Ramírez, Blanca Cecilia

doi:10.3390/computers13120346

Open AccessArticle

Identification of Scientific Texts Generated by Large Language Models Using Machine Learning

by

David Soto-Osorio

^1,†,

Grigori Sidorov

^1,*,†

,

Liliana Chanona-Hernández

^2,† and

Blanca Cecilia López-Ramírez

^3,†

¹

Computing Research Center, Instituto Politécnico Nacional, Av. Juan de Dios Bátiz S/N, Nueva Industrial Vallejo, Ciudad de México 07700, Mexico

²

Escuela Superior de Ingeneria Mecanica y Electrica, Unidad Zacatenco, Instituto Politécnico Nacional, Av. Luis Enrique Erro, S/N, Ciudad de México 07700, Mexico

³

Tecnológico Nacional de México/I.T. de Roque, Celaya 38110, Mexico

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Computers 2024, 13(12), 346; https://doi.org/10.3390/computers13120346

Submission received: 7 November 2024 / Revised: 10 December 2024 / Accepted: 16 December 2024 / Published: 19 December 2024

(This article belongs to the Special Issue When Natural Language Processing Meets Machine Learning—Opportunities, Challenges and Solutions)

Download

Browse Figures

Versions Notes

Abstract

:

Large language models (LLMs) are tools that help us in a variety of activities, from creating well-structured texts to quickly consulting information. But as these new technologies are so easily accessible, many people use them for their own benefit without properly citing the original author, or in other cases the student sector can be heavily compromised because students may opt for a quick answer over understanding and comprehending a specific topic in depth, considerably reducing their basic writing, editing and reading comprehension skills. Therefore, we propose to create a model to identify texts produced by LLM. To do so, we will use natural language processing (NLP) and machine-learning algorithms to recognize texts that mask LLM misuse using different types of adversarial attack, like paraphrasing or translation from one language to another. The main contributions of this work are to identify the texts generated by the large language models, and for this purpose several experiments were developed looking for the best results implementing the f1, accuracy, recall and precision metrics, together with PCA and t-SNE diagrams to see the classification of each one of the texts.

Keywords:

large language models; natural language processing; text generation; machine learning; text classification; adversarial attacks

1. Introduction

The rapid advance of technology has brought with it new tools, such as large language models, which facilitate various tasks in daily life, but it has also generated multiple challenges that we must face. These models have completely changed the way we interact with information, as well as allowing us to improve different processes. However, despite the fact that their progress requires a large amount of computational, natural and financial resources, the ease with which these technologies can be obtained has generated significant difficulties, especially in the academic and professional environment.

The advancement of LLM models can have a negative impact in several areas, especially in academia. By providing students with these tools, their learning process could be compromised because they may opt for quick and easy solutions instead of gaining a deep understanding of the topics. This could result in basic skills such as writing, spelling, reading comprehension and research techniques being compromised. In addition, one of the main problems is the inappropriate use of texts created by LLMs for personal gain without acknowledging the original author, which increases the risk of plagiarism.

In the near future, more parametrized versions of LLMs are expected to generate texts that more accurately mimic the grammar and writing style of specific authors. As the differences between AI-produced texts and those written by humans become almost imperceptible, it will become increasingly difficult to identify LLM-generated texts. This development will complicate the detection of plagiarism and misuse of these tools, especially in the academic and professional sectors, where originality and authenticity are paramount.

In the long run, it is likely that all the information produced by LLMs will be published on the Internet. Thus, if these models exhibit a high degree of hallucination, it is possible that upcoming models trained with such incorrect information will suffer a reduction in performance. This phenomenon could cause a dilution of information, leading to a decrease in accuracy and depth of attention to detail, negatively impacting the quality of the final answers.

As a precedent, several pieces of research have been carried out linked to the identification of texts produced by artificial intelligence (AI). However, methods such as tagging have not shown encouraging results in the face of challenges such as recursive paraphrasing or machine translation. Therefore, the application of deep-learning algorithms and Transformer-based architectures has been established as one of the most widely used tactics to address this problem.

In addition, since this is a relatively recent field of study, existing datasets from different sources have not been created specifically for this problem. For example, sets such as PAN and HC3 have considerable restrictions. Although the structure of PAN is not appropriate for this purpose, the HC3 set only takes into account a GPT model to generate answers to questions, limiting its comparison to texts written by humans and texts produced by GPT.

In this context, it is essential to develop effective solutions to

identify LLM-generated texts with high accuracy;
detect covert plagiarism practices using advanced techniques;
provide accessible tools for academic and professional institutions.

In this paper, we propose a detection model based on PLN and machine learning. Our approach focuses on

the creation of a meticulously designed dataset validated through comprehensive experiments;
the implementation of models ranging from classical techniques to Transformer and LLM architectures.

The following section presents a review of work related to text detection and plagiarism identification performed by LLMs. The background is detailed in Section 3. Section 4 explains the methodology. Section 5 covers the experiments and analysis of results. Section 6 presents the constraints encountered during the development of the research. Section 8 presents the conclusions and future works.

2. Theoretical Framework

2.1. Preprocessing Techniques

To fully understand the advances in our research, it is essential to master a variety of concepts related to large language models, natural language processing, and evaluation metrics, such as accuracy, recall, f1 score, and precision. Additionally, it is critical to become familiar with tools such as confusion matrices, t-distributed stochastic neighbor embedding (t-SNE), receiver operating characteristic/area under the curve (ROC/AUC) and principal component analysis (PCA) plots.

Because they prepare the data for more effective analysis or modeling, preprocessing techniques are a crucial stage of text analysis. Several methods are used in this process to convert raw text into a more appropriate form using natural language processing techniques to improve the performance of our machine-learning models.

Tokenization [1], which divides the text into smaller units called tokens, which can be words, phrases or sentences, is a very important technique within PLN. After tokenization, we proceed to remove all those symbols that are not necessary for our analysis; this includes removing punctuation marks, removing unnecessary symbols and converting the text to lowercase. This helps to have less variability in the text and also ensures that there is a uniform treatment of all words, reducing the risk of giving greater importance to symbols that are not likely to contribute anything to the context of the sentence.

Another set of important techniques is the elimination of empty words, which consists of ignoring all those words that do not have much semantic value and that may cause noise in the analysis, and lemmatization, which consists of reducing a word to its lemma or lexical root and stemming, which cuts the word to its base form, although this does not always guarantee that it is the appropriate lexical root.

These techniques are essential to properly debug a textual dataset, thus minimizing errors when using machine-learning models. A diagram illustrating the key steps in performing text preprocessing is shown in Figure 1.

2.2. Overview of Text Vectorization Methods

Text vectorization is an important step within PLN, since it allows us to make vector representations of our texts, and in this way we can provide our machine-learning models with the semantic relationships and contexts necessary for it to perform a specific task. Several methods have been developed to identify the connections between words; below we briefly describe some of the most popular methods.

One-hot encoding consists of adding a binary vector to each word, where only one element is 1 (representing the word) and the others are 0. It does not identify the semantic relationships between words [2].
Bag of words represents a document as a list of words; it does not take into account the order of these words, and the resulting vector indicates the frequency at which each word appears in the text [3].
N-grams expands the bag of words by considering sequences of consecutive words. In this way, it is possible capture information such as the order of the words, which also bring with them the context. These can be uni-grams, bi-grams or tri-grams; this is not limited only to words. They can also be implemented with groups of characters [4].
TF-IDF compares the frequency of a word in a document with the frequency of the same word in a collection of documents. Less common words are more important and have less weight [5].
Word2Vec generates dense and low-dimensional vectors for each word according to its context. It uses models such as Skip-Gram or CBOW to train embeddings. It generates a low-dimensional vector for each word; if you want to obtain the vector representation of a sentence, you must add the vectors of each word and divide them by the total number of vectors to obtain a normalized vector containing semantic information and the context of the sentence [6].
GloVe is an embedding technique that relies on word matches within a large text corpus. It captures semantic patterns using a global matrix of co-occurrences, unlike models such as Word2Vec that train words in close context [7].
BERT is a language model based on the Transformer architecture that differs in that it is bidirectional, meaning that it takes into account the preceding and following context of a word within a sentence. Compared to unidirectional models that only process text from left to right, being bidirectional allows it to generate much more accurate contextual insertions. Also, the word masking task, in which some words in the text are hidden and the model tries to predict them, helps BERT to learn deep semantic relationships [8].
RoBERTa is an improved version of BERT, as it was created to overcome some limitations of the original model. This new model was trained with a larger amount of data and employs key adjustments to optimize its performance in various PLN tasks; this model eliminates the “predict the next sentence” method, as the researchers found that it did not provide significant improvements. This new model only receives a masking stage, but it is focused to have a greater attention to scale; this means that it has more data, longer sequences and the use of larger minibatches [9].
The use of large language models, such as GPT or LLaMA, to create insertions depends on their ability to understand the full context of a text stream. Their Transformer architecture allows these models to process both individual words and their relationship to the rest of the sentence or document. As a result, they produce highly contextualized embeddings, where the meaning of a word depends on the environment in which it is found. This allows the encapsulations to capture complex semantic relationships, representing both the individual meaning of words and the overall context of entire sentences, making them ideal for advanced language processing tasks such as text classification or natural language generation [10].

2.3. Classical Classification Algorithms

Logistic regression is a linear classification model that is mainly used in binary problems to predict the probability of belonging to a class; unlike linear regression, it uses a sigmoid function that predicts continuous values. In this way, the output can be transformed to values between 0 and 1; for decision-making, normally, a typical threshold of 0.5 is used. If the probability is greater or equal to this value, the model assigns a positive class, otherwise it gives a negative one. The algorithm is efficient when the relationships are approximately linear, but it can be limited when the relationships are more complex. On the other hand, this model can also be implemented for multiclass classification through approaches such as “one vs. rest” or “softmax regression”; in these, the model predicts the probability that an instance belongs to each of the classes that are available, despite being effective in cases where the relationships are linear. Its simplicity and its ability to handle both binary and multiclass problems make it widely used in different tasks [11].
Random forest is a machine-learning algorithm that is based on the creation of multiple decision trees; each of the trees is trained with a random subset of training data, which produces a diversity among the trees. At the time of classifying a new piece of data, each tree generates a prediction and at the end the final model makes the decision by a majority voting system for classification cases, or the other method is an averaging for regression cases. This approach is not very susceptible to overfitting because individual trees are likely to overfit, but this is mitigated by combining many trees. This model is very effective when dealing with data that do not have linear interactions, as it can work with large datasets of many variables and is able to detect or capture complex relationships between features [12].
SVMs are a class of powerful classification algorithm that focus mainly on finding an optimal hyperplane that can separate classes in a high dimensionality feature space. The main idea is to maximize the distance between the hyperplane and the points closest to it. These are known as support vectors; having a greater margin can lead to greater confidence in the classification. When talking about nonlinear problems, SVMs use the kernel tool, which allows mapping the data to a higher dimensional feature space, where the classes can be linearly separable; there are several kernels, among which are linear, polynomial and radial basis function (RBF). This algorithm is good in high dimensionality spaces but can be computationally expensive, especially when working with large datasets [13].
The KNN algorithm is a supervised classification model that is mainly based on the similarity of instances; in order to classify new data, the model looks for the closest neighbors to that data within the feature space and assigns the most common class among the neighbors. In order to determine the distance between points, the Euclidean distance is mainly used, although, depending on the nature of the data, other metrics can be used. It is a very simple and effective model where the decision boundaries are complex and nonlinear. It has the main disadvantage of being sensitive to the scale of the features, so it is necessary to apply a normalization process before its application. Also, its performance is affected when there are large datasets; however, KNN is useful when a quick solution is required and there is no parametric model [14].

2.4. Deep-Learning Models

Some of the neural network architectures that were implemented in the development of the project are described below.

Fully connected neural networks are the most basic type of neural network; each of the neurons of a layer is completely connected to each neuron of the next layer and the information is propagated in a unidirectional way, from the input to the output, without having any kind of feedback. They can be implemented to solve classification or regression problems. One of their main limitations is that they do not capture the spatial or temporal relationships of the data, which translates into problems to bring good performance in complex problems, such as the analysis of sequences or large images. Despite having a simple architecture, they can become powerful for tasks where the input relationships are linear [15].
RNN is a type of neural network that has the ability to process data sequences such as text or time series; the rRNN has cyclic connections different from fully connected networks, which allows them to maintain a memory of previous inputs. The reason for this is that they can be useful when modeling temporal dependencies; in each time step, the RNN receives an input and modifies its hidden state based on the input and the current hidden state. This type of neural network usually has problems of gradient fading or splashing; this can make learning difficult when dealing with long data sequences, although they can be useful for tasks such as sequence analysis or machine translation [16].
LSTM is a type of network is a variant of RNNs but is designed primarily to mitigate the problem of gradient fading when implementing long sequences. This type of network uses a special memory architecture; these memories are composed of cells that can remember and forget information over time. This allows them to capture long-term dependencies more effectively than traditional RNNs. This type of network is widely implemented in sequential tasks such as language modeling, text generation, sentiment analysis and time-series prediction, but despite being computationally more expensive and more complex, LSTMs have proven to be significantly more effective in most sequential problems [16].
Transformer architecture is an innovative solution presented in 2017 to overcome the limitations that RNNs and LSTMs have, especially in natural language processing and sequence processing tasks. Its main improvements are the attention mechanisms; these allow each part of the input to influence every other part, regardless of the position of the sequence. With that, the need to process data sequentially can be eliminated. This allows for much greater parallelism in training and data processing. This new architecture has proven to have far superior performance than previous architectures for machine translation, language modeling and text generation. Some models that have revolutionized the NLP field, such as BERT or GPT, have their operating principles in the Transformer architecture [17].

2.5. Evaluation Metrics and Visualization Techniques

In this section, we describe some evaluation metrics that are important to verify the adequate training of our models, along with some visualization techniques used in the analysis of classification models [18].

Precision is the proportion of instances correctly classified as positive among all instances that were classified as positive. It is a useful metric when the cost of false positives is high. The formula to calculate it is:

$Precision = \frac{T P}{T P + F P},$

where $T P$ are the true positives and $F P$ are the false positives.
Recall measures the ability of the model to correctly identify positive instances among all true positive instances. It is particularly important when the cost of false negatives is high. The formula to calculate it is:

$Recall = \frac{T P}{T P + F N},$

where $T P$ are the true positives and $F N$ are the false negatives.
F1-Score is the harmonic mean between precision and recall and is useful when there is a balance between false positives and false negatives. The F1-Score provides a single metric that balances these two aspects:

$F 1 - Score = 2 \times \frac{Precision \times Recall}{Precision + Recall},$
Accuracy measures the proportion of correct predictions among all predictions made. It is useful in balanced datasets but can be misleading in unbalanced datasets:

$Accuracy = \frac{T P + T N}{T P + T N + F P + F N},$

where $T N$ are the true negatives.
The confusion matrix is a table that shows the predictions of the model against the original labels; these are broken down into true positives, true negatives, false positives and false negatives. This matrix helps to analyze the performance of our model with each of the classes and better understand the types of errors they are making [19].
PCA (Principal Component Analysis) is a dimensionality reduction technique that can transform the data into a new space with fewer dimensions while preserving as much variance as possible. It is mainly implemented to visualize high-dimensional data so that the main features are highlighted in a two-dimensional or three-dimensional plane [20].
t-SNE (T-distributed Stochastic Neighbor Embedding) focuses primarily on preserving local relationships between instances, which makes it particularly useful for visualizing data clusters or embeddings in tight spaces. This is another dimensionality reduction technique used for visualization, especially effective for high-dimensional data [21].
The ROC (receiver operating characteristic) curve shows the relationship between the true positive rate (TPR) and the false positive rate (FPR) for different decision thresholds. An area under the curve (AUC) of 1 indicates a perfect model, while an AUC of 0.5 indicates a random model [22].

2.6. LLM Implementation Methods

There are several techniques for training and applying large language models. Four key methods are described below, along with their main advantages and disadvantages.

Prompt Engineering consists of designing prompts in a precise way to guide the language model to generate the most appropriate responses. When using pre-trained LLMs without the need to change their parameters, this technique is particularly useful. Advantages include the fact that it does not require additional training or large computational capacity and that it is fast and efficient for specific tasks. However, its customization for more complex or specific tasks may be limited, and its effectiveness depends on the capability of the LLM [23].
Fine tuning involves taking a previously trained model and retraining it with a specific dataset. It allows the weights of the model to be adjusted to improve its performance on particular problems. Advantages include the ability to create models that are highly tailored to specific tasks, improving accuracy and performance and being flexible for a wide range of applications. However, disadvantages include the fact that it requires a high quality dataset and significant computational resources, and it can be costly in terms of time and processing [24].
RAG (Retrieval-augmented generation) combines information retrieval techniques with text generation. First, relevant information is retrieved from a database or search engine, and then the LLM generates text based on that information. Advantages include increasing the accuracy of the LLM by relying on up-to-date and relevant information, improving answers to specific queries and reducing the dependency on model size. However, the disadvantages are that it requires additional systems for information retrieval, which complicates the architecture, and can increase latency in the generation process [25].
The creation of LLM from scratch implements the initial training of a language model using a large amount of unstructured data without prior training. The design of the model architecture, the selection of training data and the configuration of hyperparameters are all components of this process. Advantages include full control over model design and training, allowing for innovative or extremely customized models for specific needs. However, disadvantages include being very expensive and requiring large amounts of computational resources, storage and time, as well as being complex and requiring a great deal of expertise in language modeling and optimization.

In order to train or adjust a large language model on a local computer, it is necessary to implement some techniques that allow us to reduce the size of the weights of the network and also allow us to modify only a small part of the entire network; below we explain each of these techniques.

Quantization is a technique implemented to reduce the size of deep-learning models and also allows us to accelerate their inference, instead of representing the weights of a model with floating point numbers, which are a type of data that consume more memory and require more computation time. Quantization converts this type of data into one of lower precision, for example, converting them into 8-bit data.
LoRA is a technique that is used to train models that were already pre-trained without the need to adjust all the parameters of the model; instead, LoRA introduces a low-rank matrix that is trained while the model weights are kept fixed so that only a small number of parameters are trained instead of the whole model, which can considerably reduce the training time and also reduces the amount of computational resources needed. This technique is a great advantage when large language models need to be retrained.
QLoRA combines both quantization and LoRA; that is, this technique applies quantization to the low-rank matrices that are introduced during the fitting process. This allows one to further reduce the size and complexity of the models. QLoRA is a very useful technique when limited computational resources are available.

3. Related Work

In the work developed by Sadasivaan et al. [26], a critical problem related to Type I and Type II errors is highlighted. Type I errors occur when LLM-generated texts are misclassified as human-written, while Type II errors occur when human-written texts are mislabeled as LLM-generated. The authors argue that improving detector robustness against Type I errors often leads to an increase in Type II errors, revealing an inverse relationship between the two types of error. Furthermore, a low Type I error can have serious consequences, such as falsely accusing a human of plagiarism, which can damage their professional or academic reputation.

A significant challenge related to Type I and Type II errors is highlighted in Vinu Sankar Sadasivaan’s study [26]. When texts created by large language models (LLMs) are incorrectly classified as being written by humans, Type I errors occur. On the other hand, Type II errors occur when texts created by humans are incorrectly labeled as created by LLMs. According to the authors, when attempts are made to decrease Type I errors, Type II errors often increase, demonstrating an inverse relationship between the two types of error. In addition, they warn that a low number of Type I errors can have serious consequences, such as unfairly accusing someone of plagiarism, which could negatively affect their academic or professional reputation.

In addition, the paper demonstrates that current models are vulnerable to adversarial attacks such as recursive paraphrasing, despite the use of technologies such as watermarking, deep-learning and zero-trigger methods. Although human studies show that this type of paraphrasing only slightly reduces text quality, these attacks can confuse detectors and increase Type I errors. The authors conclude that to avoid misuse of these models, an ideal detector should be able to accurately identify AI-created texts. However, they caution that the high cost associated with misidentification makes the practical application of these detectors unreliable and may even become infeasible.

Wu et al. [27] conducted a thorough investigation into the current state of LLM detectors, examining the drawbacks of existing detectors and proposing several research directions for future work. Initially, they mention that current LLM detectors face two major problems: The first is the model augmented degradation (MAD) phenomenon, which mainly involves the risk of models being trained with erroneous knowledge published online, leading to repeated use of texts and reduced quality in generated texts. The second problem is that the models may provide false information, as they determine only the probability of the subsequent word without understanding the correctness of the information.

Wu et al. point out that there are currently three very active research areas related to LLM detection: the implementation of watermarking techniques, deep-learning methods and the use of LLMs as detectors. Some of the future work they propose includes creating detectors trained with more robust datasets and developing detectors suitable for resource-limited environments.

Research by Kumar et al. [28] presents an innovative detector based on the DistilBERT Transformer architecture. DistilBERT is a smaller and more efficient version of the bidirectional encoder representations from the Transformer (BERT) model, chosen due to limited computational resources. The authors note that LLMs demonstrate remarkable text generation abilities, producing grammatically correct information with a coherent writing style; however, they cannot ensure the accuracy of the information provided, a phenomenon known as hallucination.

The authors conducted experiments on two datasets: “DAIGT-V3”, which includes twenty thousand essays written by humans and twenty thousand created by large language models, and “LLM – Detect AI Generated Text”, which contains student essays and texts created by various LLM models. It is crucial to note that neither of these datasets is protected against adversarial attacks.

The binary classification model they used demonstrated 100 percent accuracy in detecting texts created by LLMs and 90 percent accuracy in detecting texts written by humans, with recall rates of 84 percent and 90 percent, respectively. While the model performs well overall, it has a tendency to misclassify texts that were written by humans. Kumar concludes that while DistilBERT is highly capable of identifying texts created by LLMs, there is still room for improvement in the way human-written texts are classified. According to the study, DistilBERT could be useful in ensuring the quality of datasets used in a variety of applications.

According to Capobianco et al. [29], large language model detectors are crucial for maintaining academic integrity and benefiting society. The paper examines various models, including BERT and RoBERTa, for the binary classification of LLM and human texts. Experiments were conducted on the HC3 corpus, which contains 24,322 questions and corresponding answers from both human and LLM sources.

The authors trained different models of the BERT architecture. These models were separated into two sets; the first set froze its parameters, while the second was with its parameters not frozen. The results they reported ranged from 88 to 100%. On the other hand, the RoBERTa model had a slightly lower accuracy than the BERT model, but this model is much better when having a larger amount of data. The study concludes that LLM models have a positive impact on society; however, it is crucial to ensure that these tools are used responsibly for the benefit of the entire society.

Table 1 compares several studies on large linguistic model generated text (LPM) detection, focusing on the datasets, models used and main approaches. It includes work by Sadasivaan et al., Wu et al., Kumar et al. and Capobianco et al. along with a new proposal. Highlights include datasets such as HC3 and new ones, the use of models such as Transformers (BERT and RoBERTa), classical algorithms and watermarking techniques. The table also describes unique approaches, such as type 1 and 2 error detection, classification of GPT-generated texts, and identification of texts from multiple GPLs, showing the originality of each study.

4. Methodology

The methodology we implemented for our research is separated by different stages, which consist mainly of the creation of our datasets, experimentation with a variety of models and finally the analysis of the results.

4.1. Formation and Preprocessing of Linguistic Corpus

The current datasets used to train the text detection models created by LLM are limited in diversity and focus mainly on computer science areas, which makes the models less effective for other disciplines. To solve this problem, we developed code that connects to an Arxiv API, the purpose of which is to collect scientific articles from a variety of fields, such as physics, medicine, electronics, and communications, and extract text summaries using the Nougat model [30], which organizes the content of PDF files. Dataset creation was carried out in two stages, with the aim of prioritizing the thematic and stylistic diversity of the texts.

Figure 2 presents an outline of the procedure for constructing the dataset. In this situation, papers from different fields such as medicine, computer science, astronomy, physics and mathematics were incorporated. Finally, the database consisted of 1550 abstracts equally distributed among the different categories of paper. After processing the texts with LLMs, the final result consisted of 7750 texts, equally distributed between those generated by LLMs and those written by individuals. This technique was chosen because, when constructing a dataset, it is crucial to maintain a balance between classes to prevent overfitting to a given class and also to achieve a better extension of the problem.

The first stage consists of extracting the abstract from the PDF files, we focused mainly on obtaining a very diverse representation of texts, ensuring that the dataset covers a variety of topics and writing styles. The second stage consists mainly of eliminating all those articles that were published after 2017 since in this year the Transformer architecture started to be implemented and the first models of this class, such as BERT or RoBERTa, were presented. In this way, we can guarantee that our detection models are not trained with possible texts that were generated by models with Transformer architecture, which gives us a more solid and objective basis. Table 2 shows the structure of our datasets.

Once all human-written texts have been obtained, we proceed to download and install Ollama. This software enables the installation of various large language models. We installed Llama3 [31] and LLaMA2 [32], both with 7 billion parameters, and gemini [33] and LLaVA [34], also with 7 billion parameters. With the models installed, we proceed to create a prompt that includes an instruction and a summary. We enter it into the LLM and with this we generate a new paraphrased text labeled with the name of the model. The following is an example of the instruction that is entered into the large language model for the generation of the new texts.

Instruction:

“Summarize the article. Do not generate any additional text, just provide the summary.”

Summary:

“The article discusses how artificial intelligence (AI) is transforming the educational sector. It focuses on the use of AI-based tools to personalize learning, improve teaching through automation and provide more efficient access to educational content. Additionally, it addresses the ethical and social challenges that may arise from the integration of these technologies, such as the potential gap between students with access to AI and those without.”

In this study, for the creation of instructions and the application of constraints, the LangChain library was used in conjunction with Ollama. LangChain enabled the unification and coordination of several elements of the system, which favored the creation of complex work processes. Specifically, for instructions and constraints, a template was used within LangChain, which facilitated the organized and adaptable definition of the requirements of each task. This method allowed customizing the requests to Ollama, ensuring that the responses produced were accurate and adapted to the constraints set, such as avoiding the creation of additional text and providing only the summary required.

Once we have all our new texts generated by the different LLMs, we perform the dataset cleaning stage, eliminating unnecessary symbols and applying one-hot encoding for the labels, as well as tokenization, stopword removal, lemmatization and spell checking of the texts. The process we implemented for our dataset is presented in Figure 3.

At the end of the LLM text generation, we add these texts to our dataset, so the dataset looks like that in Table 3.

For our first dataset we have 7750 texts proportionally distributed to ensure a balanced representation of each class, i.e., 1550 are human texts and 1550 are for each of the LLMs we implemented for text generation. At this point is when we also generate our second dataset, which includes various attacks, such as recursive paraphrasing and translation from one language to another. The columns mentioned in Table 2 are also present in this dataset, but we add recursive paraphrasing and translation, so our dataset becomes much larger and therefore more time consuming to process and train in the machine-learning models. Figure 4 presents the development process for the creation of the second dataset.

4.2. Embeddings Generation

With the new datasets we developed, we implemented several vectorization and embedding generation techniques in order to train our machine-learning models; these techniques range from the most basic ones, such as TF-IDF vectorization, to more modern embedding generation methods, such as LLM embeddings. Each of these embeddings vary in length depending on the model we implemented; traditional techniques such as TF-IDF generate very large vectors but with limitations in capturing relationships between words. On the other hand, word2vec, GloVe, BERT [35], RoBERTa [36] and other LLM models produce more compact, but contextually richer embeddings. The first step was to store all embeddings in a single dataset; however, computational limitations led to the decision to separate them by model in order to optimize the loading of the sets. Figure 5 presents the whole process that was carried out for the generation of our feature vectors and Table 4 presents the sizes of the embeddings of each of the LLMs implemented and of the models with Transformer architecture, such as BERT or RoBERTa.

4.3. Implementation of Classification Algorithms

Once we completed the initial dataset, we trained different machine-learning classification models, starting with classical algorithms such as logistic regression, support vector machines, decision trees and k-nearest neighbors. For each of the experiments, we considered both a simple validation using different training and testing percentages, mainly 70% train 30% test, 80% train 20% test and 90% train 10% test percentages, and we also implemented cross-validation with different numbers of folds, mainly 4, 6, 8 and 10 folds, the main goal being to identify the configuration that would best optimize our metrics. Upon completion of training each of the models with our dataset, we generated the confusion matrix and PCA plots to visualize the distribution of the data and their classification. These experiments were repeated using different embeddings, performing a total of 9 experiments for each classification model.

Once we trained the basic models, we implemented deep-learning algorithms, starting with fully connected neural networks and LSTM networks. We experimented with different layer configurations, learning rates, epochs, activation functions and validation percentages, in addition to using dropout layers in each experiment to avoid overfitting to the training-set data. In the same way as the classical models, classification metrics were extracted, along with their respective confusion matrices and PCA and t-SNE diagrams. Figure 6 and Figure 7 present the training and validation process of basic machine-learning algorithms and deep-learning networks.

After training the basic and deep-learning models, we continued with the fine tuning of the BERT and RoBERTa models using reduced versions such as DistilBERT and DistilRoBERTa. Adjustments were made to the model parameters and the number of epochs in each experiment. As they are pre-trained models, a large number of epochs is not necessary to obtain good results. The process to perform the fine tuning and testing with these models is that, at the input, we enter the clean text and the model is responsible for making the embeddings and perform the classification. In order to evaluate the results, we extracted the values of the penultimate layer to implement PCA and see the distribution of our data and also allow us to calculate the evaluation metrics, including the confusion matrix.

Once we completed the fine tuning of DistilBERT and DistilRoBERTa, we proceeded to the implementation of language wide models (LLMs) to perform the classification. Initially, we used prompt engineering, a technique that consists of designing a complex prompt that allows the LLM to perform a classification. The retrieve and generate (RAG) technique was also implemented, and the fine tuning of LLaMA2 and LLaMA3 models was carried out, employing LoRA optimization to reduce the size of the values within the network, and these can be adjusted with limited resource systems. Each of the trained models were saved along with their metrics, evaluation graphs and confusion matrices for future testing. The steps that were followed to realize these implementations are shown in Figure 8.

After completing the first stage of training, it was shown that models with good metrics are affected by various attacks, such as paraphrasing with tools like QuillBot, and their performance decreases. A larger dataset was used that includes attacks such as recursive paraphrasing and language translation. For this new set of experiments using classical evaluation metrics, confusion matrix and ROC/AUC curves, we can state that if the model does not correctly recognize any class, its area under the curve will always be less than 0.5 or very close to 0.5. In the following section, we present the most outstanding results and perform an analysis of the results we obtained.

5. Experiments and Analysis of Results

In this section, we present the most salient data from all the experiments and we will also perform an analysis of all the results obtained, determining which is the best model to carry out this classification task. In Table 5 and Table 6 show the key results (you wish to check or consult all the experiments of this first stage, you can at https://drive.google.com/drive/folders/1LytZlFoOQ7JauQ3IrPUUypEhQvzx1AJf?usp=sharing) (accessed on 22 November 2024).

The results obtained show considerable feasibility in the performance of the classification models as a function of the different vectorization and embedding techniques employed. Although the TF-IDF vectorization method did not generate results high enough for it to be considered as a good classifier, due to its simplicity it makes it an option to consider, even though its performance was modest.

The LSTM network that was trained with Word2Vec embeddings with 1000 epochs and with a learning rate of 0.0001 showed a low performance. Although its results were the best when implementing the Word2Vec dataset and despite its robustness and ease in handling text sequences, this model fails to correctly capture the complexity present in the dataset. With this result, we can determine that the choice of both the embeddings and the architecture of the model was not optimal, since the embeddings failed to effectively capture the context and the relationships between the words in the texts.

The random forest model when trained with GloVe embeddings, as well as the LSTM model with Word2Vec embeddings, also failed to perform optimally, so these experiments show that the combination of these techniques is not the most effective, even though these approaches and architectures provided some of the best results among the different classification models evaluated with GloVe embeddings.

In contrast, models using more complex embeddings, such as BERT and RoBERTa, demonstrated much better performance. The results were exceptional when BERT embeddings were combined with logistic regression. This indicates that, for effective classification, it is essential to use embeddings that more accurately show the semantic and syntactic relationships of the texts. Furthermore, the outstanding results of combining RoBERTa with a support vector machine (SVM) and 9-fold cross-validation demonstrated the ability of Transformer-based architectures to capture semantic and syntactic relationships in texts, giving it a high quality classification.

In addition, the LLM-created embeddings were exceptional. Combining basic machine-learning classification algorithms with LLM-created embeddings without fine tuning gave positive results. When fine-tuned, the DistilBERT and DistilRoBERTa models also showed good results.

Figure 9, Figure 10 and Figure 11 show the confusion matrix, the PCA plot and the t-SNE plot of the model with the lowest classification performance. It is clear that the creation of adequate embeddings is essential for classes to be linearly separable, indicating that when classes are more differentiated basic classification models can become more useful. However, Figure 12, Figure 13 and Figure 14 show the confusion matrix, the PCA plot and the t-SNE plot of the SVM model where embeddings of the LLM LLaVA were implemented. Meanwhile, Figure 15, Figure 16 and Figure 17 show the confusion matrix, PCA plot and t-SNE plot of distilRoBERTa implementing fine tuning.

6. Limitations

Although this study focused primarily on identifying texts produced by large language models, there are several constraints that need to be identified. First, our dataset, although varied in terms of topics, writing styles, and fields of knowledge, was constrained by existing computing resources. These restrictions impacted the magnitude of the data processed and the complexity of the models we were able to build. Consequently, the results of this research may not fully reflect the wider range of possible text generation scenarios with different LLMs or the wider scenario of possible applications.

Another significant restriction is that this research focuses only on the identification of texts produced by LLMs in English, without taking into account texts in other languages. Language variety is a crucial element in text identification, and models trained only in English may not be as effective in identifying texts produced in other languages due to variations in grammar, syntax and style.

Given these constraints, future work will focus on creating a larger and more varied dataset that includes texts in other languages, considers adversarial attacks, and uses more sophisticated computational resources. This will facilitate the development of more accurate and scalable models for the identification of texts produced by LLMs in real contexts.

7. Impact and Applicability

The incorporation of text identification models created by LLMs into plagiarism detection systems can have a significant impact on education and academic research. Currently, there is an increase in the misuse of automatic text generation technologies, which raises serious ethical concerns. Conventional plagiarism detection systems, which are mainly based on precise text matching, fail to detect texts created by broad language models, enabling students or creators to present artificially created content as if it were their own. The application of models capable of recognizing these texts can address this gap, thus ensuring greater integrity in academia.

By integrating these sophisticated models into plagiarism detection systems, not only plagiarized texts could be detected but also those created by artificial intelligence, which would allow academic institutions to establish a more precise differentiation between human work and that produced by artificial intelligence. This could be particularly beneficial in virtual education platforms and in scientific studies where the use of large language models is constantly expanding. In addition, optimized detection systems could be merged with current content review tools, simplifying the work of teachers and academics to ensure that the work displayed is unique and ethical.

The social impact of this implementation is significant. Not only does it help combat plagiarism but it also fosters a more ethical and transparent learning environment. As language models become more sophisticated, the ability to differentiate them from human-written texts becomes a crucial element in maintaining trust in education and research. Furthermore, the responsible use of these technologies can be a starting point for new academic policies that promote integrity in the use of AI-based tools, striking a balance between innovation and ethics in academia.

8. Conclusions and Future Work

A key conclusion of this study is that it is not essential to fine-tune complex and resource-intensive models to effectively detect texts generated by large language models (LLMs). The results show that, by choosing appropriate embeddings, simpler and computationally efficient classification models can be employed while maintaining a high level of detection accuracy.

This implies that, by leveraging advanced embeddings that accurately reflect the semantic and syntactic features of texts, the ability of simpler models to differentiate between machine-generated and human-written texts can be greatly enhanced. This approach allows for a significant decrease in the computational costs associated with fine-tuning large models, without sacrificing classification accuracy.

This approach not only facilitates the implementation of models under resource constraints but also broadens access to advanced detection tools in the field of natural language processing. As a result, it opens the possibility of adopting more sustainable and cost-effective solutions, promoting an optimal balance between performance and computational requirements. We believe that our work represents a significant contribution towards the efficient classification of texts generated by large linguistic models.

In future research, we plan to implement a more robust dataset, performing training and embedding generation in a manner similar to that presented in this paper. Currently, we are working on this phase, but due to the larger size and complexity of the new dataset the training and validation time of the models has increased significantly, which also requires a greater use of computational resources. In the end, the models described in this study will be evaluated to determine which one offers the best performance, also considering its ability to handle adversarial attacks.

Author Contributions

Conceptualization, D.S.-O. and G.S.; methodology, D.S.-O.; software, D.S.-O.; validation, D.S.-O., G.S. and L.C.-H.; formal analysis, D.S.-O.; investigation, D.S.-O.; resources, G.S.; data curation, D.S.-O.; writing—original draft preparation, D.S.-O.; writing—review and editing, D.S.-O., G.S., L.C.-H. and B.C.L.-R.; visualization, D.S.-O.; supervision, G.S., L.C.-H. and B.C.L.-R.; project administration, D.S.-O.; funding acquisition, G.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Council of Science and Technology (CONACYT), Mexico, through grant A1-S-47854. Additional funding was provided by grants 20241816, 20241819, and 20240951 from the Secretariat of Research and Postgraduate Studies of the Instituto Politécnico Nacional (IPN), Mexico. The APC was funded by CONACYT.

Data Availability Statement

The dataset used in this study is not publicly available. However, the results of the experiments conducted during this study can be accessed at the following link: https://drive.google.com/drive/folders/1LytZlFoOQ7JauQ3IrPUUypEhQvzx1AJf?usp=sharing (accessed on 25 February 2024).

Acknowledgments

The authors of this article would like to thank the National Council of Science and Technology (CONAHCYT) and the Computing Research Center of the Instituto Politécnico Nacional for their support in carrying out this work. The work was done with partial support from the Mexican Government through the grant A1-S-47854 of CONACYT, Mexico, grants 20241816, 20241819 and 20240951 of the Secretaría de Investigación y Posgrado of the Instituto Politécnico Nacional, Mexico. The authors thank the CONACYT for the computing resources brought to them through the Plataforma de Aprendizaje Profundo para Tecnologías del Lenguaje of the Laboratorio de Supercómputo of the INAOE, Mexico, and acknowledge the support of Microsoft through the Microsoft Latin America PhD Award.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Hugging Face. Preprocessing Data with Transformers. 2024. Available online: https://huggingface.co/docs/transformers/preprocessing (accessed on 10 April 2024).
Interactive Chaos. Machine Learning Tutorial: One Hot Encoding. 2024. Available online: https://interactivechaos.com/es/manual/tutorial-de-machine-learning/one-hot-encoding (accessed on 22 April 2024).
IBM. Bag of Words. 2024. Available online: https://www.ibm.com/topics/bag-of-words (accessed on 24 April 2024).
Towards Data Science. Understanding Word N-Grams and N-Gram Probability in Natural Language Processing. 2024. Available online: https://towardsdatascience.com/understanding-word-n-grams-and-n-gram-probability-in-natural-language-processing-9d9eef0fa058 (accessed on 24 April 2024).
Jain, A. TF-IDF in NLP: Term Frequency-Inverse Document Frequency. 2024. Available online: https://medium.com/@abhishekjainindore24/tf-idf-in-nlp-term-frequency-inverse-document-frequency-e05b65932f1d (accessed on 25 April 2024).
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. Available online: https://aclanthology.org/D14-1162/ (accessed on 25 April 2024).
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2 June–7 June 2019; Version 2, Last Revised 24 May 2019. Available online: https://arxiv.org/abs/1810.04805 (accessed on 25 April 2024).
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. Available online: https://arxiv.org/abs/1907.11692 (accessed on 25 April 2024).
LlamaIndex. Ollama Embedding Example. 2024. Available online: https://llamaindex.ai (accessed on 10 April 2024).
Scikit-Learn. Logistic Regression. 2024. Scikit-Learn Documentation. Available online: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression (accessed on 23 April 2024).
Scikit-Learn. Random Forest. 2024. Scikit-Learn Documentation. Available online: https://scikit-learn.org/stable/modules/ensemble.html#random-forests (accessed on 17 April 2024).
Scikit-Learn. Support Vector Machines. 2024. Scikit-Learn Documentation. Available online: https://scikit-learn.org/stable/modules/svm.html (accessed on 15 April 2024).
Scikit-Learn. K-Nearest Neighbors. 2024. Scikit-Learn Documentation. Available online: https://scikit-learn.org/stable/modules/neighbors.html (accessed on 23 April 2024).
BuiltIn. What Is a Fully Connected Layer in Machine Learning? BuiltIn Machine Learning Topics. 2024. Available online: https://builtin.com/machine-learning/fully-connected-layer (accessed on 23 April 2024).
ScienceDirect. Long Short-Term Memory Networks. 2024. ScienceDirect Topics in Computer Science. Available online: https://www.sciencedirect.com/topics/computer-science/long-short-term-memory-networks (accessed on 23 April 2024).
Hugging Face. Chapter 1.4–NLP Tasks. 2024. Hugging Face NLP Course. Available online: https://huggingface.co/course/chapter1/4 (accessed on 23 April 2024).
Analytics Vidhya. Metrics to Evaluate Your Classification Model to Take the Right Decisions. 2021. Analytics Vidhya Blog. Available online: https://www.analyticsvidhya.com/blog/2021/07/metrics-to-evaluate-your-classification-model-to-take-the-right-decisions/ (accessed on 23 April 2024).
IBM. Confusion Matrix: What It Is and How to Use It. 2024. IBM Topics. Available online: https://www.ibm.com/mx-es/topics/confusion-matrix (accessed on 23 April 2024).
IBM. Principal Component Analysis: What It Is and How to Use It. 2024. IBM Topics. Available online: https://www.ibm.com/think/topics/principal-component-analysis (accessed on 25 April 2024).
IBM. Creación de gráficos t-SNE en SPSS Statistics. 2024. IBM Documentation. Available online: https://www.ibm.com/docs/es/spss-statistics/beta?topic=sslvmb-subs-statistics-mainhelp-ddita-spss-base-chart-creation-tsne-html (accessed on 26 April 2024).
Google Developers. ROC and AUC—Machine Learning Crash Course. Google Machine Learning Crash Course. Available online: https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc?hl=es-419 (accessed on 28 April 2024).
Pinecone Learning Hub. LangChain Prompt Templates. Pinecone.io Documentation. Available online: https://python.langchain.com/docs/integrations/vectorstores/pinecone/ (accessed on 23 July 2024).
FreeCodeCamp. Fine-Tuning LLM Models—FreeCodeCamp. FreeCodeCamp News. Available online: https://www.freecodecamp.org/news/fine-tuning-llm-models-course/ (accessed on 30 July 2024).
Amazon Web Services. What Is Retrieval-Augmented Generation (RAG)? AWS Documentation. Available online: https://aws.amazon.com/what-is/retrieval-augmented-generation/ (accessed on 1 August 2024).
Sadasivaan, V.S.; Kumar, A.; Balasubramanian, S.; Wang, W.; Feizi, S. Can AI-Generated Text be Reliably Detected? arXiv 2023, arXiv:2303.11156v3. [Google Scholar]
Wu, J.; Yang, S.; Zhan, R.; Yuan, Y.; Wong, D.F.; Chao, L.S. A Survey on LLM-generated Text Detection: Necessity, Methods, and Future Directions. arXiv 2023, arXiv:2310.14724. [Google Scholar]
Bv, P.; Ahmed, S.; Sadanandam, M. DistilBERT: A Novel Approach to Detect Text Generated by Large Language Models (LLM). arXiv 2024, arXiv:3909387v1. [Google Scholar]
Major, A.; Capobianco, M.; Reynolds, M.; Phelan, C.; Shah-Nathwani, K.; Luong, D.; Lee, K.; Kumaravel, M. Supervised Machine Generated Text Detection Using LLM Encoders in Various Data Resource Scenarios. Doctoral Dissertation, Worcester Polytechnic Institute, Worcester, MA, USA, 2023. Available online: https://www.semanticscholar.org/paper/Supervised-Machine-Generated-Text-Detection-Using-Major-Capobianco/a79561bad0a5a3f5b0cb3ba9750ad7851369ff2a (accessed on 25 February 2024).
Blecher, L.; Cucurull, G.; Scialom, T.; Stojnic, R. Nougat: Neural Optical Understanding for Academic Documents. arXiv 2023, arXiv:2308.13418. Available online: https://arxiv.org/abs/2308.13418 (accessed on 25 February 2024).
Ollama-Llama3. LLaMA3. Ollama Library. Available online: https://ollama.com/library/llama3 (accessed on 25 February 2024).
Ollama-Llama2. Llama2. Ollama Library. Available online: https://ollama.com/library/llama2 (accessed on 25 February 2024).
Ollama-Gemma. Gemma. Ollama Library. Available online: https://ollama.com/library/gemma (accessed on 25 February 2024).
Ollama-Llava. LLaVA. Ollama Library. Available online: https://ollama.com/library/llava (accessed on 25 February 2024).
Hugging Face-Distilbert. DistilBERT Model Documentation. Hugging Face Transformers. Available online: https://huggingface.co/distilbert (accessed on 20 April 2024).
Hugging Face. DistilRoBERTa-Base-Distilroberta. Hugging Face Transformers. Available online: https://huggingface.co/distilroberta-base (accessed on 25 April 2024).

Figure 1. Diagram illustrating the stages involved in text preprocessing, from the elimination of non-relevant elements to obtaining vector representations, allowing their use in machine-learning models.

Figure 2. Diagram illustrating the process of extracting text in Markdown format from PDF documents hosted on the arXiv platform, using an API query and the implementation of a natural language processing model (Nougat).

Figure 3. Diagram showing the text preprocessing process for large language model (LLM) training. It starts with an input text that is preprocessed through several stages, including cleaning, tokenization, stop word removal, spell checking and lemmatization, before being fed to different LLM models (Gemini, LLaMA2, LLaMA3, LLaMA).

Figure 4. Diagram representing a text processing process involving recursive paraphrasing, translation and assignment of new labels. The original text goes through several stages of transformation before a new dataset is generated.

Figure 5. Diagram showing the process of generating embeddings from a linguistic corpus. The text is preprocessed and then different techniques (TF-IDF, Word2Vec, GloVe, BERT, RoBERTa and 4xLLLM) are used to create vector representations of the words (embeddings), thus generating new datasets for each technique.

Figure 6. This diagram shows the general process of training and evaluation of basic models, using different validation techniques and evaluation metrics.

Figure 7. This diagram represents a basic neural network training and evaluation process. Embeddings and labels are used to train the model, and then its performance is evaluated through various metrics and visualizations.

Figure 8. This diagram compares two main approaches to working with large language models (LLMs): prompt engineering and fine tuning. Both methods seek to optimize model performance, but use different strategies.

Figure 9. The figure shows a confusion matrix that evaluates the performance of a classification model (LSTM with Word2Vec) on a dataset. Each cell of the matrix represents the number of instances that were classified in a certain class but actually belong to another class.

Figure 10. The PCA diagram presents a two-dimensional visualization of the training data, reducing the original dimensionality to only two principal components. The different colored dots represent the different classes that the model attempts to classify. The scatter and clustering of the dots gives us an intuitive idea of how well the model separates the classes. If points in the same class are compactly grouped and separated from points in other classes, it indicates good model performance. In this case, it appears that some classes are better separated than others.

Figure 11. The t-SNE plot shows the distribution of the data in a low-dimensional space, revealing the ability of the LSTM+Word2Vec model to separate the different classes. The considerable overlap between groups indicates that the model has difficulty distinguishing between certain categories. This suggests that improvements can be made to the model to achieve greater classification accuracy.

Figure 12. The confusion matrix shows the performance of the SVM model trained with LLaVA embeddings, evaluated by 10-fold cross-validation. The main diagonal reveals a high level of hits, indicating that the model correctly classifies most of the samples. However, some classification errors are observed, suggesting that the model could benefit from additional adjustments.

Figure 13. Principal component analysis (PCA) reveals the ability of the SVM model trained with LLaVA embeddings to separate the different classes. The formation of distinct groups indicates good classification performance, especially for the LLaVA and Gemini classes. However, some overlap is observed between human, LLaMA2 and LLaMA3, suggesting that the model could benefit from additional tuning.

Figure 14. The t-SNE analysis reveals the ability of the SVM model trained with LLaVA embeddings to separate the different classes. The formation of distinct groups indicates good classification performance, especially for the LLaVA and Gemini classes. However, some overlap is observed between human, LLaMA2 and LLaMA3, suggesting that the model could benefit from additional tuning.

Figure 15. The confusion matrix shows the performance of the DistilRoBERTa model fitted over 10 epochs. The main diagonal reveals a high level of hits, indicating that the model correctly classifies most of the samples. However, some misclassification errors are observed, especially in classes flame 2 and flame 3, suggesting that the model could benefit from additional adjustments.

Figure 16. Principal component analysis (PCA) reveals the ability of the fitted DistilRoBERTa model to separate the different classes. The formation of distinct groups indicates good classification performance, especially for the human, LLaVA and Gemini classes. However, some overlap is observed between the LLaMA2 and LLaMA3 classes, suggesting that the model could benefit from further adjustments.

Figure 17. The t-SNE analysis reveals the ability of the fitted DistilRoBERTa model to separate the different classes. The formation of distinct groups indicates good classification performance, especially for the human, LLaVA and Gemini classes. However, some overlap is observed between LLaMA2 and LLaMA3, suggesting that the model could benefit from additional adjustments.

Table 1. Comparative table of studies on the detection of texts generated by LLMs.

Details	Sadasivaan et al. [26]	Wu et al. [27]	Kumar et al. [28]	Capobianco et al. [29]	My Proposal
Dataset
DAIGT-V3, LLM-DetectAI			✔
HC3 corpus	✔	✔		✔
New dataset					✔
Models Used
Deep Learning	✔	✔			✔
Classical algorithms					✔
Watermarking		✔
Transformers BERT and RoBERTa			✔	✔	✔
Main Approach
Detection of type 1 and 2 errors	✔
Classification of texts generated by GPT		✔
Detection of texts generated by LLMs			✔	✔
Detection of texts generated by various LLMs					✔

Note: The checkmark (✔) indicates that the corresponding datasets and classification models were used to achieve a specific objective listed in the “Main Approach” column.

Table 2. Dataset description.

Column Name	Column Function
Title	Represents the title of the article.
Abstract	Provides a concise summary of the article’s content.
Category	Indicates the subject area or category assigned to the article in Arxiv.
Label	Specifies the label used for classification tasks.

Table 3. Description of the columns in the dataset.

Column Name	Description
Title	Article title
Abstract	Article abstract
Category	Arxiv’s category
Preprocessed texts	Preprocessed texts
Label	Label’s texts
One-Hot	One hot coded labels

Table 4. Size of embeddings generated by different LLMs.

Model	Embedding Sizes
LLaMA3	4096
LLaMA2	4096
Gemini	2048
LLaVA	2048
BERT	768
RoBERTa	1024

Table 5. Results of basic classification models with different embeddings.

Embeddings	Model	k-Fold	Epochs	Accuracy	Precision	Recall	F1
TF-IDF	Logistic Regression	10	-	0.6981	0.7040	0.6981	0.6938
N-grams	Logistic Regression	10	-	0.7165	0.7250	0.7132	0.7153
Word2Vec	LSTM	-	1000	0.4761	0.4549	0.4648	0.4523
GloVe	Random Forest	10	-	0.5161	0.5153	0.5161	0.5081
BERT	Logistic Regression	10	-	0.7812	0.7825	0.7812	0.7808
RoBERTa	SVM	9	-	0.9233	0.9235	0.9233	0.9234
LLM Gemini	LSTM	-	500	0.7225	0.7288	0.7120	0.7119
LLM LLaMA2	SVM	10	-	0.9860	0.9860	0.9860	0.9860
LLM LLaMA3	Logistic Regression	10	-	0.9861	0.9862	0.9861	0.9861
LLM LLaVA	SVM	10	-	0.9899	0.9899	0.9899	0.9899

Table 6. Results of the Transformer architecture models and large language models (LLMs) implementing fine tuning.

Model	Epochs	Accuracy	Precision	Recall	F1
DistilBERT	10	0.8077	0.8312	0.8077	0.8097
DistilBERT	100	0.8400	0.8517	0.8400	0.8429
DistilRoBERTa	10	0.9974	0.9974	0.9974	0.9974
DistilRoBERTa	100	0.9954	0.9955	0.9954	0.9954
LLaMA2	5	0.9932	0.9953	0.9931	0.9935
LLaMA3	5	0.9952	0.9943	0.9966	0.9963

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Soto-Osorio, D.; Sidorov, G.; Chanona-Hernández, L.; López-Ramírez, B.C. Identification of Scientific Texts Generated by Large Language Models Using Machine Learning. Computers 2024, 13, 346. https://doi.org/10.3390/computers13120346

AMA Style

Soto-Osorio D, Sidorov G, Chanona-Hernández L, López-Ramírez BC. Identification of Scientific Texts Generated by Large Language Models Using Machine Learning. Computers. 2024; 13(12):346. https://doi.org/10.3390/computers13120346

Chicago/Turabian Style

Soto-Osorio, David, Grigori Sidorov, Liliana Chanona-Hernández, and Blanca Cecilia López-Ramírez. 2024. "Identification of Scientific Texts Generated by Large Language Models Using Machine Learning" Computers 13, no. 12: 346. https://doi.org/10.3390/computers13120346

APA Style

Soto-Osorio, D., Sidorov, G., Chanona-Hernández, L., & López-Ramírez, B. C. (2024). Identification of Scientific Texts Generated by Large Language Models Using Machine Learning. Computers, 13(12), 346. https://doi.org/10.3390/computers13120346

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Identification of Scientific Texts Generated by Large Language Models Using Machine Learning

Abstract

1. Introduction

2. Theoretical Framework

2.1. Preprocessing Techniques

2.2. Overview of Text Vectorization Methods

2.3. Classical Classification Algorithms

2.4. Deep-Learning Models

2.5. Evaluation Metrics and Visualization Techniques

2.6. LLM Implementation Methods

3. Related Work

4. Methodology

4.1. Formation and Preprocessing of Linguistic Corpus

4.2. Embeddings Generation

4.3. Implementation of Classification Algorithms

5. Experiments and Analysis of Results

6. Limitations

7. Impact and Applicability

8. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI