The primary objective of this paper is to compare methods of numerical text representation using various classifiers in the detection of content generated by large language models. This section presents the data used in the comparison process insights from their exploration, which significantly influenced the selection of appropriate data processing and cleaning methods. Selected results of numerical text representation methods and classifiers used for comparison are introduced.
5.2. Data Preprocessing
To transform the raw essay content into a numerical representation, it must first be processed and cleaned appropriately. The text must be in a form suitable for analysis, free of digits, punctuation marks, or errors that could hinder the NLP process. The first step involved thoroughly verifying the contents of the essays to find any unusual characters that could not be fully eliminated using standard text-cleaning libraries and functions. The main challenge in the cleaning process was non-standard links present in the text as part of bibliographies. Most links were complex and long, and nearly all contained randomly placed spaces. Constructing a regular expression to eliminate them without risking the removal of text following the links proved problematic. As a result, most links were removed using a single, complex regular expression, applied carefully to avoid deleting any text following the links. The remaining links were analyzed, and a custom list of words to be removed from the essay content was created, including [“www”, ”org”, “html”, “http”, “https”, “edgarsnyder”, “pittsburgcriminalattorney”, “thezebra”, “fivethirtyeight”, …].
It was also noted that the texts contained many newline characters, such as “\n”, “\r”, and their combinations. Additionally, non-standard punctuation marks, like quotation marks in three different forms (or “instead of the standard“), were found in the text, which were undetectable by punctuation removal functions in NLP libraries. Another characteristic of essays generated by large language models was placeholders that students were supposed to replace, such as “STUDENT_NAME”, “TEACHER_NAME”, or “Mrs/Mr.{insert principal’s name here}”. Automatically removing punctuation marks through ready-made functions would result in words like “STUDENTNAME” and similar combinations, likely not found in dictionaries. Therefore, characters in these characteristic forms (like newline characters and non-standard quotation marks) were replaced with spaces using regular expressions.
Subsequent steps included converting the text to lowercase, removing remaining punctuation marks, multiple spaces, and digits using the gensim library. Additionally, words consisting of two or fewer letters were eliminated to reduce data sparsity without risking the removal of words that could affect the meaning and context of the entire sentence. The final step was eliminating essays that were incorrectly generated responses by large language models. One example of such a record was an essay identified as code for the prompt “Facial action coding system”. Other incorrectly generated essays were identified through character count analysis. The dataset contained six records with fewer than 400 characters, all of which were AI-generated essays with truncated or unfinished content or forms atypical for essays, possibly due to misinterpretation of the prompt by the model.
Table 2 presents examples of the described records, which were removed from the dataset.
The cleaned text was placed in a new column. A fragment of the dataset after the preprocessing is presented in
Table 3. With the cleaned dataset, the word count distribution for AI-generated essays and student-written essays was verified using the column with cleaned content. The results are presented in
Figure 2 and
Figure 3.
After filtering and cleaning, the dataset contained only valid essays.
Figure 2 presents the distribution of the number of words in AI-generated essays. It can be observed that the word count distribution is similar to a normal distribution. This indicates that while most AI-generated essays center around a particular word count, there is still considerable variation, with some essays having slightly fewer or more words. Also, shorter essays are much more common than in student-written content.
Figure 3 shows the distribution of the number of words in student-written essays. Here, the word count distribution resembles a Poisson distribution, with a higher occurrence of shorter essays. However, despite the higher frequency of shorter essays, student-written essays have a higher minimum word count compared to AI-generated ones, which suggests that students likely perceive an essay as a formal task with a required or expected minimum length. Additionally, the distributions suggest that students tend to write essays with a wider range of lengths, possibly influenced by individual writing styles, interpretations of prompts, or external constraints such as time limits.
Table 4 presents statistical results regarding the number of words in the series. Student-written essays generally have a higher word count, averaging 404, compared to AI-generated essays, with an average of 320 words. The minimum word count for students is 136, and the highest is 1611, more than double the maximum word count in AI-generated essays, where the minimum word count is 62, and the maximum is 764.
The next step in the experiments was to analyze different stop word lists. It was decided not to remove stop words before further analysis due to the presence of words on the lists whose elimination could disrupt the context of entire sentences, such as “against”, “nobody”, or “not”, whose occurrence in a sentence can completely change its meaning. The stop word lists also included many words whose removal would not significantly affect the sentence context, such as “eg”, “be”, “kg”, “ie”, “a”, or “km”. Such expressions and abbreviations were filtered out in earlier stages by removing words with fewer than three letters.
The final steps of processing involved tokenization and lemmatization, i.e., dividing essays into tokens and then reducing them to their base form. Space characters were used as the criterion for tokenization. Before lemmatization, the dataset was split into test and training sets. This split helps avoid accidental information leakage between the sets. If lemmatization were applied to the entire dataset before the split, there is a risk that patterns learned during training could be influenced by the test set, compromising the integrity of the evaluation process. By splitting the data first, we ensure that each set is processed independently, maintaining a clear distinction between training and test data and thus preventing potential bias or overfitting. This is important because lemmatization can reveal information about the characteristics of the entire dataset, which could influence how the test data is interpreted by the model. Independent processing of these sets ensures that models are trained and tested objectively, based only on their designated data, and the procedure mimics real-world model usage. A key step in splitting the dataset was balancing the number of essays for each label. The number of essays from the majority class (student-written essays, label 0) was reduced by randomly selecting an equal number of essays as the minority class (AI-generated essays, label 1). This reduced the number of records from 44,860 to 34,978. The balanced dataset was split into test and training sets using stratification by labels, ensuring an even distribution of data in both sets. The test set comprised 25% of the data. The final record counts and label distributions are presented in
Table 5.
It was also necessary to verify that all the prompts used to generate essays were present in both sets. The analysis did not show omissions in the test or training sets, indicating that each prompt was adequately represented in the data for the training and testing models. In addition, a frequency analysis was performed to ensure that its presence was sufficiently diverse in terms of quantity. The results are presented in
Table 6.
The en_core_web_sm language model, a pre-trained English model offered by spaCy, was used to transform tokens into their base forms. This model contains information not only about base forms but also about part-of-speech tagging and syntactic dependencies, thus considering the structure of the analyzed texts. The process was conducted separately for training and test data.
5.3. Results of Classification
To evaluate the effectiveness of the text representations, we used various classifiers, including:
Logistic Regression: a baseline model effective for linearly separable data;
Random Forest: an ensemble method that combines multiple decision trees to improve robustness and accuracy;
XGBoost and LightGBM: advanced boosting algorithms that iteratively improve model performance by focusing on misclassified instances.
Experiments were conducted on a dataset comprising essays labeled as either AI generated or student written. The dataset was split into training and test sets to objectively evaluate model performance. The division ensured that 75% of the data was used for training and 25% for testing, providing a reliable assessment of each model’s performance. Various metrics such as accuracy, precision, recall, and F1-score were used to assess the classifiers.
The comprehensive methodology employed in this study aimed to provide a robust comparison of various text representation techniques, highlighting their strengths and weaknesses in the context of text classification. This approach contributes to a better understanding of how different methods can be leveraged to improve natural language processing tasks.
Table 7 shows the accuracy comparison for each classifier, highlighting the highest results for each.
For logistic regression, the highest accuracy was observed with TF-IDF weights, achieving 99.82%. Similarly high values were obtained with Bag of Words (99.05%) and parameterized TF-IDF weights (98.98%). For random forests, the best performance was with word embeddings created using the fastText model (Skip-gram) trained on the essay dataset, achieving an accuracy of 98.03%. LightGBM also performed well with these embeddings, reaching an accuracy of 98.83%. For XGBoost, the highest accuracy was noted for Bag of Words (98.96%), with closely matching accuracies for TF-IDF weights, parameterized TF-IDF weights, and fastText word embeddings (Skip-gram), at 98.78%, 98.78%, and 98.91%, respectively.
To provide a more detailed comparison, additional metrics were calculated from the confusion matrix. Besides accuracy, precision, recall, specificity, F1-score, and MCC (Matthews correlation coefficient) were considered. The values of these indicators were calculated assuming class 1 (essays generated by large language models) as the positive class. This allows for a precise evaluation of how well the embeddings and classifiers perform in detecting this class. The results are presented in
Table 8, with the best results for each numerical representation highlighted.
For Bag of Words, TF-IDF weights, parameterized TF-IDF weights, and fastText word embeddings (Skip-gram), accuracies for all classifiers were equal to or greater than 97.72%. For fastText embeddings and Continuous Bag of Words, accuracies were equal to or greater than 96.59%. The lowest performance was observed for embeddings created using the pre-trained fastText model with logistic regression—this was the only instance where accuracy fell below 90% (89.88%). For the same embeddings and other classifiers, accuracies did not exceed 97.42%. For MiniLM sentence embeddings, the highest accuracy was 96.47%, and the lowest was 95.06%. For distilBERT sentence embeddings, the lowest and highest accuracies were 93.78% and 96.63%, respectively.
TF-IDF weights and parameterized TF-IDF weights achieved the highest accuracy with logistic regression. However, for Bag of Words, the fastText model (both Skip-gram and ontinuous Bag of Words), and MiniLM embeddings, the best performance was with the XGBoost classifier. Similarly, for distilBERT sentence embeddings, the highest accuracy was achieved with logistic regression, but the precision and specificity were higher for the XGBoost classifier. This indicates that, despite slightly lower accuracy, XGBoost better detected the positive class in the test set and made fewer errors regarding negative cases.