A Black-Box Analysis of the Capacity of ChatGPT to Generate Datasets of Human-like Comments

Rosete, Alejandro; Sosa-Gómez, Guillermo; Rojas, Omar

doi:10.3390/computers14050162

Open AccessArticle

A Black-Box Analysis of the Capacity of ChatGPT to Generate Datasets of Human-like Comments

by

Alejandro Rosete

^1,2

,

Guillermo Sosa-Gómez

^3,*

and

Omar Rojas

³

¹

Facultad de Ingeniería Informática, Universidad Tecnológica de La Habana José Antonio Echeverría (Cujae), Marianao, La Habana 19390, Cuba

²

Avangenio S.R.L., 5ta B. esq. 6, Miramar, Playa, La Habana 11300, Cuba

³

Facultad de Ciencias Económicas y Empresariales, Universidad Panamericana, Álvaro del Portillo 49, Zapopan 45010, Jalisco, Mexico

^*

Author to whom correspondence should be addressed.

Computers 2025, 14(5), 162; https://doi.org/10.3390/computers14050162

Submission received: 7 March 2025 / Revised: 23 April 2025 / Accepted: 23 April 2025 / Published: 27 April 2025

(This article belongs to the Special Issue Harnessing Artificial Intelligence for Social and Semantic Understanding)

Download

Browse Figures

Versions Notes

Abstract

:

This paper examines the ability of ChatGPT to generate synthetic comment datasets that mimic those produced by humans. To this end, a collection of datasets containing human comments, freely available in the Kaggle repository, was compared to comments generated via ChatGPT. The latter were based on prompts designed to provide the necessary context for approximating human results. It was hypothesized that the responses obtained from ChatGPT would demonstrate a high degree of similarity with the human-generated datasets with regard to vocabulary usage. Two categories of prompts were analyzed, depending on whether they specified the desired length of the generated comments. The evaluation of the results primarily focused on the vocabulary used in each comment dataset, employing several analytical measures. This analysis yielded noteworthy observations, which reflect the current capabilities of ChatGPT in this particular task domain. It was observed that ChatGPT typically employs a reduced number of words compared to human respondents and tends to provide repetitive answers. Furthermore, the responses of ChatGPT have been observed to vary considerably when the length is specified. It is noteworthy that ChatGPT employs a smaller vocabulary, which does not always align with human language. Furthermore, the proportion of non-stop words in ChatGPT’s output is higher than that found in human communication. Finally, the vocabulary of ChatGPT is more closely aligned with human language than the similarity between the two configurations of ChatGPT. This alignment is particularly evident in the use of stop words. While it does not fully achieve the intended purpose, the generated vocabulary serves as a reasonable approximation, enabling specific applications such as the creation of word clouds.

Keywords:

ChatGPT; text generation; generative AI; text analysis; NLP; simulation

1. Introduction

In recent years, there has been a marked increase in the influence of artificial intelligence (AI) across a variety of disciplines. A notable milestone in this trend has been the widespread adoption of large language models (LLMs), particularly ChatGPT [1,2,3,4]. We are seeing more and more use of generative artificial intelligence (AI) to solve various problems. This is having a big impact on many parts of our lives today. ChatGPT and other tools that can generate text have had a big impact on many fields, like medicine, management, science, and education. Here are some examples of these application areas: code generation [5,6], social work [7], cybersecurity [8], genomics [9], the evaluation of research quality [10,11,12], quality assurance [13], mental health [14], art [15], education [16,17,18,19], sentiment analysis [20], text writing [21], writing quality assessment [22], text synthesis [23], health information retrieval [24], and medicine [25,26]. It is important to know that, when people use ChatGPT, the main goal is to create a comment or response for a particular situation. Each specific prompt tests a different situation, but the model only needs to generate one response for each one.

Despite the pervasive utilization of LLMs, numerous concerns have been articulated from diverse vantage points [6,27,28]. One important concern is the ethical aspects discussed in [16,20,21,29], e.g., the gender bias [30] and the detection of LLM-assisted writing in scientific papers without acknowledgment [31]. Another important concern is related to the precision of the results, as discussed in several papers [5,14,25,32,33,34]. In this context, the tendency for models to collapse when the data generated via an LLM are used recursively to train new models is also a critical and timely issue [35].

In this paper, we investigate the potential of ChatGPT to generate synthetic datasets of comments that emulate those produced by humans concerning specific services. These datasets bear a resemblance to those employed for sentiment analysis, as evidenced by their availability on platforms such as Kaggle (e.g., [36,37,38,39]) and other sites. In contrast to the previously cited applications of ChatGPT, which require a response for each individual situation, the task of generating a dataset of comments necessitates the creation of multiple high-quality comments. These comments must exhibit diversity and precision in relation to the desired focus. Furthermore, the length of the comments is an important factor to consider in some cases.

The utilization of AI-generated comments as a substitute for human-generated datasets is imperative in scenarios where such datasets are not available. This scenario frequently arises in the training of models to classify human comments during the nascent stages of a service, where comments are not yet available for model development. Moreover, these synthetic comments can serve as a substitute for data augmentation [33,40] during the training of machine learning algorithms, thereby enhancing or supplementing datasets when their quality and/or quantity is inadequate. Additionally, the availability of a diverse range of comment datasets, potentially generated using AI, is crucial for conducting experimental studies that evaluate new algorithms. This is particularly beneficial when tailored datasets are required, such as those customized in terms of theme, comment volume, and comment length, to investigate the impact of various factors on algorithm performance. Consequently, if ChatGPT can generate such datasets, it will serve as a valuable alternative—or complement—to the labor-intensive process of gathering human opinions across various contexts for training machine learning models.

The objective of this paper is to examine ChatGPT’s capacity to generate synthetic datasets of comments that closely resemble those produced by humans. To this end, an experimental comparison was conducted of four human-generated datasets obtained from Kaggle with ChatGPT-generated datasets corresponding to each context of the respective Kaggle datasets. The focus of this study is on the lexical aspects of the task, specifically the vocabulary used.

The paper is organized into two sections. First, Section 2 outlines the motivation, focus, and limitations of the study. These limitations also express some interesting lines for future research. Section 2 also provides a detailed introduction to the experimental protocol, including access to the analyzed data and code to facilitate replication. This section also explains how the various datasets (generated by humans and via ChatGPT) are processed for comparison based on several metrics. Section 3 presents the main findings and summarizes the key insights derived from the analysis. The main results of the paper are outlined in the conclusions, including some lines for future research, for example, by proposing some possible prompt strategies.

2. Experimental Framework

2.1. Considerations on the Motivation, Focus, and Limitations of the Experimental Study

A number of studies have been conducted in this field, and it is essential to elucidate several key points to clarify the motivation behind the current proposal, its focal point, and its inherent limitations. These limitations can also be regarded as potential avenues for future research, as they extend beyond the scope of the present study.

Despite the lack of explicit design of LLMs to generate textual datasets, particularly comment datasets, there is a demonstrated capacity for success in other contexts involving similar variations in generated text. These include the generation of email, the rephrasing of ideas, the creation of multiple poems or songs based on a title, variations in output texts from similar prompts, and the solving of academic problems, among others [6,41]. Indeed, the capabilities of LLMs in these tasks have raised several concerns in modern society. Such concerns include issues related to plagiarism and educational integrity. These issues pose various risks that need to be addressed [6,28]. The similarity of these tasks, each corresponding to the category of question-answer according to [42], motivates an investigation of ChatGPT’s potential in this under-explored area.

For tasks involving the generation of comment datasets, there are several reasons to believe that the underlying model of ChatGPT may be capable of accomplishing this task. First, its training set encompasses a vast diversity of textual information, containing a wide array of words (expressed in terms of tokens) that can be utilized to vary the generated text, even when employing the same writing pattern (e.g., variation through the use of synonyms). Secondly, the training set incorporates a wide array of writing patterns, which can be leveraged to generate variability in the comments, despite the “Register leveling” phenomenon, a tendency to combine or overlap styles or genres, resulting in homogenized output with reduced differentiation [43]. Thirdly, the stochastic nature of the LLMs, when combined with the aforementioned factors, engenders the generation of a vast number of response combinations. This is particularly salient in the present context, where disparities in polarity and sentiment can be introduced as an ancillary source of variability.

The decision to prioritize ChatGPT stemmed from its status as a prominent large LLM that plays a fundamental role in the current rise of artificial intelligence, particularly generative artificial intelligence. Extensive research has been conducted on ChatGPT in various fields, including education [6,13,16,18,19,44], medicine [14,24,25,30], ethics [26], art [15], biology [9], cybersecurity [8], social work [7], marketing [4], bibliometrics [10,11], and text processing [20,21,22,32,34,45], amongst others. It was hypothesized that a study of this nature would be required in the future for other LLMs and their various versions. Furthermore, we posit that this undertaking (i.e., the generation of synthetic comment datasets) would constitute a compelling and stimulating challenge at conferences and seminars related to LLMs.

Due to the unique nature of the task examined in this paper, the focus is directed towards an analysis of the lexical aspects of the generated comments. However, this focus introduces a clear limitation, as the grammatical structure and semantic analysis of the text are also crucial components. Nevertheless, we posit that examining the lexical aspect is a foundational step in this endeavor, enabling us to conduct common textual analyses such as word frequency and the creation of word clouds. A plethora of techniques have been developed for the purpose of regulating or refining the results of LLMs. The following are a few examples of such techniques: [42,46,47,48]. However, the employment of such sophisticated methodologies is imperative when the LLM is incapable of executing the task in its most elementary form, such as by responding to a direct prompt. In the future, the exploration of additional approaches may be undertaken to enhance the outcomes observed in this study. These include the utilization of initial comments generated via an LLM as exemplars for subsequent comments, the generation of comments in multiple steps with distinct prompts for each instance (varying the focus, polarity, and grammatical structure of the provided examples), and the integration of results obtained from one LLM to inform another. Additionally, direct control of the API or the use of complex structures such as the GAL model [48] or the HELM model [42] are interesting lines to be explored in the future.

In accordance with the preceding arguments, this paper will initially describe an experiment within the context of the present study. The approach that was employed in this experiment is consistent with that delineated in several preceding publications. In [43], a prompt was provided with the “headlines plus the first three words of the lead paragraph” being used “as prompts for LLMs to generate news”. Similarly, in [45] the “exact thesis titles were used to generate abstracts and introductions using GPT-3.5”. As previously mentioned, the fundamental objective of this study was to evaluate the capacity of the LLM to generate extended texts that align with the minimal synthesis characteristic of human-generated reference materials. This methodology was employed in the present study. The design of the prompts constitutes a significant area of research. To initiate this investigation, we opted to adopt a focused approach, concentrating on the examination of ChatGPT’s lexical capabilities in generating comment datasets in response to direct prompts.

In their paper [45], the authors assumed that the text generated via GPT would be semantically aligned with the title presented in the prompt. This approach is analogous to the method employed in [43] for message generation. Both studies employed word counts as a metric to assess the disparities between human-written and LLM-generated texts. Additionally, cosine similarity was applied in [43]. Specifically, in [45], the authors also compared the specific word sets produced using each method. The findings of both studies indicated that LLMs have a propensity to yield shorter sentences. In this context, our focus is directed towards a more detailed comparison of the specific words generated for each scenario, as discussed in [43,45]. In some research, the focus has been on augmenting textual datasets [41,46,47,49]. This may represent an alternative approach to consider in future research. However, it is crucial to first assess the direct feasibility of accomplishing this task. Furthermore, the implementation of certain approaches in this context presents significant challenges. For instance, the chain-of-thought approach [41] implicitly assumes that there is an expected correct answer, which is difficult to define in the context of comment generation, where each comment can vary significantly in polarity, style, and topics addressed yet still be considered correct (“The customer is always right”). The datasets examined in this study are available for download. This accessibility enables the augmentation of the present study to encompass additional linguistic dimensions, such as grammar, style, semantics, and other pertinent measures.

2.2. Selection of Cases

In order to explore the lexical capacity of ChatGPT, four datasets from Kaggle related to the quality of different services were obtained to generate a dataset of comments. The objective was to establish a reference for the anticipated outcomes expected from ChatGPT. Section 2.3.1, Section 2.3.2, Section 2.3.3 and Section 2.3.4, describe these datasets in more detail. These datasets pertain to Amazon services [36], Indian airlines [37], McDonald’s experiences [38], and a women’s clothing store [39]. The datasets under consideration exhibit a certain degree of internal coherence with regard to content. It was expected that this coherence would be articulated through the use of a common vocabulary related to services. However, given the unique characteristics inherent to each service, it is reasonable to expect variations in the vocabulary utilized, with specific terms exhibiting particular suitability for distinct contexts. Each dataset is accompanied by a detailed description, including the number of comments, the intended application of the dataset, the included columns, and an explanation of their meanings. This comprehensive information may facilitate the formulation of pertinent prompts to query ChatGPT based on these descriptions.

The subsequent subsections delineate the methodological framework employed to conduct the comparative analysis. This methodological framework enables the extension of the experimental framework in future directions, encompassing the following variations: additional datasets from Kaggle or alternative sources; alternative LLMs or variants of ChatGPT; alternative metrics for dataset evaluation; and alternative prompt strategies.

For each of the four datasets that were selected for analysis, two datasets were generated using ChatGPT. One of these datasets defined the desired length of the comment (in terms of characters), and the other did not. It was hypothesized that specifying the desired length is a crucial factor for exploration, as it enables ChatGPT to generate a similar number of characters and words as those employed in human comments. Previous studies have observed a tendency for LLMs to produce shorter responses, as evidenced by the works of [43,45]. In light of these observations, this paper undertakes a comparative analysis of three generators (methods of generating comments): human respondents, ChatGPT without a defined comment length, and ChatGPT with a specified length. The investigation encompasses twelve cases, each derived from a distinct combination of the four original datasets (Amazon, Indian Airlines, McDonald’s, and Women’s Clothing) and each generator.

The present study utilizes a lexical-based approach to evaluate the appropriateness of ChatGPT as a generator of human-like comment datasets. Subsequent research endeavors may encompass the incorporation of more intricate grammatical elements and word co-occurrences, thereby facilitating a more profound exploration of this domain.

2.3. Pre-Processing of the Cases

A uniform sequence was applied to each dataset during the preparation phase. Amazon (identified with an “A” in the case name), Indian Airlines (identified with an “I”), McDonald’s (identified with a “D”), and Women’s Clothing (identified with a “W”). Each dataset was subjected to a series of steps, which are outlined below:

Download the dataset from Kaggle. This file is the basis for the “original” variant of each dataset (denoted by the subindex o).
Analyze the dataset description to create a comprehensive prompt for ChatGPT, incorporating as much relevant information as possible. We aimed to obtain a dataset similar to those available on Kaggle, based on the premise that they share comparable descriptions.
Compute the total number of records and the average character count of the comments to establish the necessary conditions in the corresponding prompt.
Ask ChatGPT using the designed prompt without specifying the length of the comments. The response obtained served as the basis for the “ChatGPT” variant of each dataset (denoted as the subindex c).
To generate comments of a specified length (number of characters), ask ChatGPT using a designed prompt that outlines this requirement. The responses obtained serve as the foundation for the “ChatGPT with length” variant of each dataset (denoted as the subindex l).
For the three cases related to the dataset, the comments were refined to remove characters not associated with words, such as emojis, various punctuation marks, line breaks, and so on.
Avoid the repetition of comments. If multiple identical comments are present, only one will be considered for the remainder of the analysis.

The prompts were executed using the online version of ChatGPT on 12 November 2024. The experiments employed the ChatGPT model GPT-4-turbo, Pro version, with a context size of 128,000 tokens (approximately 300 text pages), trained with information available until June 2024. A comprehensive overview of the cases is provided in Table 1. The table includes the identifier for each case, its source, the total number of comments originally obtained (the Comments column), the average character count of the comments (column L.Ch), and the number of truly distinct comments (“the Distinct” column) for each case. The set of files utilized in the experiments, along with all the necessary code to execute them, is available upon request to facilitate replication of the results.

2.3.1. Amazon Reviews Dataset

The original variant of this dataset will be referred to as case

A_{o}

. All details regarding this dataset can be found in [36], where the dataset is described as “Amazon Reviews Dataset. A Comprehensive Review Dataset for E-Commerce Analysis”. The description states that “This dataset comprises customer reviews for Amazon, an online retail giant, featuring insights into customer experiences, including ratings, review titles, texts, and metadata. It is valuable for analyzing customer satisfaction, sentiment, and trends”. Examples of the included columns are as follows: Reviewer Name, Profile Link, Country, Review Count, Review Date, Rating, Review Title, Review Text, and Date of Experience. It is described as useful for “Prospective applications”, such as sentiment analysis, customer satisfaction tracking, product improvement, market segmentation, competitor analysis, recommendation systems, and trend analysis.

Based on the aforementioned information, the following prompt was used to ask ChatGPT to generate the case

A_{c}

: “Generate a csv file with 20,000 customer reviews for an online retail giant such as Amazon. I will use it for featuring insights into customer experiences, including ratings, review titles, texts, and metadata. It is valuable for analyzing customer satisfaction, sentiment, and trends. The file most include key features such as reviewer name, country, rating, review title, review text. These data points may be used for various purposes, including customer satisfaction analysis, sentiment analysis, topic modeling, predictive modeling, and competitive analysis. The specific features and level of detail should vary”.

To obtain the case

A_{l}

, a similar prompt was used with the addition of this sentence at the end of the prompt: “The average length of the comments should be 460”.

2.3.2. Indian Airlines Dataset

The original variant of this dataset will be referred to as case

I_{o}

. All details of this dataset are available in [37], where the dataset is described as “Indian Airlines Customer Reviews. Exploring Trends and Patterns in Indian Airlines Customer Ratings”. It is explained that “This dataset contains customer reviews for Indian Airlines, providing valuable insights into customer satisfaction, loyalty, and areas for improvement”. Several columns are included, such as the following: review text, ratings, dates, feedback, and recommendations. Some of the proposed uses are customer satisfaction analysis, sentiment analysis, topic modeling, predictive modeling, and competitive analysis.

Based on the aforementioned information, the following prompt was used to ask ChatGPT to generate the case

I_{c}

: “Generate a csv file with 2000 customer reviews for Indian Airlines. I will use it for exploring trends and patterns in Indian airlines customer ratings in order to provide valuable insights into customer satisfaction, loyalty, and areas for improvement. The file most include key features such as review text, ratings, dates, feedback, and recommendations. These data points may be used for various purposes, including customer satisfaction analysis, sentiment analysis, topic modeling, predictive modeling, and competitive analysis. The specific features and level of detail should vary”.

To obtain the case

I_{l}

, a similar prompt was used with the addition of this sentence at the end of the prompt: “The average length of the comments should be 630”.

2.3.3. McDonald’s Dataset

The original variant of this dataset will be referred to as case

M_{o}

. All details of this dataset are available in [38], where the dataset is described as “McDonald’s Store Reviews Exploring Customer Sentiments in McDonald’s US Store Reviews”. It is explained that “This dataset contains over 33,000 anonymized reviews of McDonald’s stores in the United States, scraped from Google reviews”. Several columns are included, such as: store names, categories, addresses, geographic coordinates, review ratings, review texts, and timestamps. This dataset is intended to “provides valuable insights into customer experiences and opinions about various McDonald’s locations across the country”. with potential uses in sentiment analysis, location-based analysis, category analysis, and time-based analysis.

Based on the aforementioned information, the following prompt was used to ask ChatGPT to obtain the case

M_{c}

: “Generate a csv file with 33,000 US customer reviews for a global fast food chain that serves food and beverages, such as those reviews about McDonald that appear in Google reviews. I will use it to provide valuable insights into customer experiences and opinions across the country (US). The file most include information such as store names, categories, review ratings, and review texts. The specific features and level of detail should vary”.

To obtain the case

M_{l}

, a similar prompt was used with the addition of this sentence at the end of the prompt: “The average length of the comments should be 130”.

2.3.4. Women’s Clothing Dataset

The original variant of this dataset will be referred to as case

W_{o}

. All details of this dataset are available in [39], where the dataset is described as “Women’s E-Commerce Clothing Reviews. 23,000 Customer Reviews and Ratings”. It is explained that “This is a Women’s Clothing E-Commerce dataset revolving around the reviews written by customers. Its nine supportive features offer a great environment to parse out the text through its multiple dimensions. Because this is real commercial data, it has been anonymized…” Several columns are included such as: Clothing ID, Title, Review Text, Rating, Recommended IND, Positive Feedback Count, Division Name, Department Name, Class Name. Some of the proposed uses are “quality NLP, … feature engineering, and multivariate analysis”.

Based on the aforementioned information, the following prompt was used to ask ChatGPT in order to obtain the case

I_{c}

: “Generate a csv file with 20,000 customer reviews for a women’s clothing e-commerce site. I will use it for feature engineering, and multivariate analysis. The file most include key features such as age, title, review text, rating, recommended, division name (product high level division), department name, product class name. The specific features and level of detail should vary”.

To obtain the case

I_{l}

, a similar prompt was used with the addition of this sentence at the end of the prompt: “The average length of the comments should be 300”.

2.4. Post-Processing of the Cases

2.4.1. Stop Words

In order to compare the vocabulary used in each case, it was necessary to classify the words into stop words [50] and non-stop words, which we expected to be more relevant for differentiating each context; we refer to these as relevant words. There is no standard list of stop words, as it may vary, depending on the context or individual preferences. In our case, we utilize the set of English stop words available in [51], which consists of 635 frequently used words. This effort is part of a broader initiative to compile stop words in 29 languages. During our pre-processing, we removed certain punctuation marks. We applied the same approach to the stop words in order to effectively treat them as tokens. For instance, the stop word “doesn’t” was replaced with the word “doesnt”. In some cases, this transformation resulted in two stop words becoming identical, such as the cases of “were” and “we’re” or “cant” and “can’t”. Finally, a set of 627 distinct stop words was compiled and placed in the corresponding table to facilitate the subsequent steps of post-processing.

In the remainder of this paper, words not included in the set of stop words will be classified as “relevant” because they are more likely to differ significantly or hold greater relevance in each context. However, it is important to note that the relevance of a word may vary depending on the context. For example, some words categorized as stop words due to their frequent usage, such as “against” and “hopefully” may actually be relevant for sentiment analysis. Conversely, a typographical error in human writing may provoke that “aginst” (missing one “a”) is considered “relevant” (however, we anticipate that this variant is unlikely to become a frequently used word when processing the comments). In spite of these concerns, we consider that distinguishing stop words from other relevant words is a sensible approach. As we compute the number of uses of each word, we pay special attention to the most frequently used words, i.e., the most frequent words. The set of frequent words provides a straightforward and effective means of capturing the general content of a text [43,45], for example by using word clouds.

2.4.2. Preparing the Cases for Analysis

After the pre-processing was completed, and taking into account the list of stop words previously discussed, each case underwent the following sequence of steps:

The really distinct comments for each case were loaded into the column “comment” in the table of comments corresponding to each case.
Based on this list, the 500 most frequently used words were placed in the table of frequently used words of each case that contains two columns: word and frequency (how many times the word is used in the comments).
Based on the frequently used words for each case, a column “words” was added to each table of comments that contains the set of frequent words used in each comment.
Using the table of stop words, each frequently used word was classified accordingly.

Based on these steps, Table 2 in Section 3 provides a general description of each case. For each case, several numerical indicators were derived based on the characteristics of the comments, the most frequently used words, and the similarity between the vectors that describe the words used in each case. All of this processing was conducted using standard SQL to facilitate easy replication.

3. Results

3.1. General Description of the Cases

Table 2 provides a comprehensive overview of each case, detailing the number of distinct comments (the Count column), the length (in terms of the number of characters in the Ch columns and in terms of the number of words in the L columns) of the comments (the average in the L.Ave column, the minimum number in the L.Min column, and the maximum in the L.Max column), the number of frequently used words in each comment (the average in the F.Ave column, the minimum number in the F.Min column, and the maximum in the F.Max column), and the average number of frequent relevant words (the Fr.Ave column) and frequent stop words (the Fs.Ave column) in each comment. Each row represents a case identified with a name that combines the identifier of each dataset (A, M, I, or W) with the type of generator used (o: human comments sourced from the original dataset on Kaggle; c: comments generated via ChatGPT without a specified length; l: comments generated via ChatGPT for which the length of the comment was defined). Three additional rows

T_{o}

,

T_{c}

, and

T_{l}

were included to account for the complete set of words used with each generator (o, c, and l).

Some interesting results can be observed in Table 2. The most significant findings regarding ChatGPT’s performance are as follows: it uses fewer characters and words than humans, it frequently repeats the same answer, its responses vary considerably based on the defined length, and its proportion of stop words is very similar to that of human comments.

In cases

A_{c}

,

I_{c}

,

M_{c}

, and

W_{c}

, in which the desired length was not specified to ChatGPT in the prompt, the size of each comment was significantly smaller. This suggests that the default size of ChatGPT comments is smaller than the size used in these authentic human comments. It is remarkable how few truly distinct comments there are. This suggests that ChatGPT is fulfilling the requested number of comments by repeating the same ones multiple times. This may indicate a tendency to rely on specific words without exploring synonyms or varying the language used in the comments.

In cases

A_{l}

,

I_{l}

,

M_{l}

, and

W_{l}

it is important to note that the number of truly distinct comments generated via ChatGPT, given a defined length, was accurate. However, it is essential to examine this aspect in greater detail. The size of each comment varied a lot. In the case

I_{l}

, the desired length was exactly respected (630 characters). In the case

M_{l}

, the length of the comment generated was close to the desired length, just slightly smaller (an average of 91 characters with respect to 130). In the other cases, the generated comments were larger than the desired length. In the case

A_{l}

, the length of the comment generated was larger (average of more than 700 characters instead of 460), but in the case

W_{l}

the length of the comment generated was significantly larger (more than 2000 characters instead of 300). Perhaps, in this case, ChatGPT infers that the desired size (300) refers to words, as opposed to the approximate size in terms of characters correctly inferred (and almost achieved) for the other cases. But this sounds strange because the desired length of 300 is not extreme: other desired lengths are larger (630) or smaller (125), and it interpreted them as characters.

The size of each comment varied significantly. In the case of

I_{l}

, the desired length was precisely met at 630 characters; however, upon examining the details of the comments, we can observe that several of them end abruptly in order to meet the desired length. Some examples of these incomplete final sentences include the following: “Friend couple a”, “Can boy room value film”, “Once born with”, and “Same wa”. Additional examples can be found in Section 3.2.

Some of the generated responses do not seem comparable to human comments, as several words are presented in a sequence that does not form coherent sentences. An example is as follows: “color color comfortable fashion fit love size fit fashion size stylish fashion size comfortable”. See other examples in Section 3.2.

The average use of frequent words in human comments (column F.Ave 19, row

T_{o}

) was equal to the corresponding value for ChatGPT with a defined length (19, row

T_{l}

). It is interesting to note this similarity despite the differences in average length (column L.Ave: 329 for

T_{o}

compared to 82 for

T_{l}

). The F.Ave value for ChatGPT when the length was not defined is considerably smaller (6, row

T_{c}

), similar to the case of the defined length. It is also noteworthy that the proportion of relevant words to stop words is consistent across all cases. Generally, human comments contain 2.2 relevant words for every stop word (row

T_{o}

, column Fr.Ave 13 compared to Fs.Ave 6). This ratio is 2 (4 in relation to 2) for

T_{c}

and 2.2 (13 in relation to 6) for

T_{l}

.

3.2. Some Examples of Comments Generated via ChatGPT for Each Case

The comments can be examined in detail in the available information provided. However, we believe it is useful to provide some examples of the generated comments. For each case, we provide two examples of comments generated via ChatGPT when the desired length was not specified in the prompt. For

A_{c}

, “Excellent service: It didnt meet my standards, especially given the price point. Not recommended”. “Great product!: Exceeded my expectations, the features are great and easy to use”. For

I_{c}

, “Excellent service! Highly recommend this airline”. “Flight was delayed, but the staff was polite”. For

M_{c}

, “Best place for a quick meal with family. Affordable and tasty!” “Disappointed with the customer service, but the food was good”. For

W_{c}

, “Amazing fit. Unfortunately, it didnt fit as expected. A bit too small”. “Love this! Not as described, returned it immediately”. In spite of their relatively small size, these sentences convey syntactically coherent ideas; however, some may lack semantic clarity due to inherent contradictions. Additionally, it is evident that they align with the expected context of each case.

Now, for each case, we provide two examples of comments generated via ChatGPT for which the expected size of the comment is defined in the prompt. For the case

A_{l}

,

“Last key give: Amazing product! The quality is top-notch, and I am really pleased with my purchase. Quality exceeded my expectations, making it worth every penny. Highly recommended! The product arrived on time and was as described, though it could have been packed better. Product met my needs but has some limitations that may be improved in future models. Excellent quality and build; Im very satisfied with my purchase and would buy it again. I love the features, but the design could use some improvement. Overall, its decent for the price. The item didnt work as expected, but customer service was helpful in resolving my issue. Quality exceeded my expectations, making it worth every penny. Highly recommended!”
“Offer run eight: Shipping took longer than expected, but the product itself is satisfactory. I love the features, but the design could use some improvement. Overall, its decent for the price. The performance was underwhelming, and I am quite disappointed with this purchase. This product has exceeded my expectations and provides great value for the money. Quality exceeded my expectations, making it worth every penny. Highly recommended! Shipping took longer than expected, but the product itself is satisfactory. Quality exceeded my expectations, making it worth every penny. Highly recommended! I had a terrible experience with the quality of this product, and I wouldnt recommend it”.

For the case

I_{l}

,

“Somebody nation arm their he wish. Grow story alone forward wait. Difference become push ball better value fly. Physical identify garden piece sign large. Game including all TV imagine common level. Dream ok war actually official significant name. Thing bed concern worker drive. Wrong provide blue manage player. Through realize season. Care its someone point daughter nice resource. Where Mrs above husband. Southern size run reason act left hundred. Should keep role. Magazine cup create air. Suggest state huge his participant nation. Computer trial necessary treat model. Visit indeed school because. Since loss rock. Kid man”
“Continue final song well growth onto pick meet. Necessary some state argue. Message identify list reflect talk fast approach. Art within human issue music must Democrat. Pretty through tend Congress. Treat whom impact choose technology season recognize. Good home win send social north. Trouble I property also. Just do together local agency quickly fish. Consider boy magazine. Structure environmental direction likely approach dog success. American sense contain five. Course gun Mr argue eat. Former reduce decide carry political effect. Travel recently relationship billion ago. Industry money fear traditional. Nation find ar”

For the case

M_{l}

,

“A agency important another food. Behavior budget everybody old store open. Add the newspaper watch speak”.
“A ahead under rather sometimes to. Tree vote wall shoulder”.

For the case

W_{l}

,

“Amazing! color color comfortable fashion fit love size fit fashion size stylish fashion size comfortable love color stylish color fashion perfect color disappointed size comfortable fashion size fit fashion fashion love quality size love quality color perfect size love fit perfect stylish stylish stylish comfortable stylish love stylish fashion disappointed stylish stylish perfect fit quality love quality stylish love stylish fit perfect disappointed love comfortable color comfortable fit disappointed perfect perfect fit disappointed size stylish fashion comfortable love love quality quality disappointed disappointed fashion fit color love stylish disappointed quality fit stylish fashion fashion quality color comfortable fashion fashion fashion size fit love fit fit comfortable fashion stylish love fit quality color size disappointed disappointed quality love comfortable stylish love love fashion color color color stylish stylish perfect fashion size love love comfortable love fit fashion fit love love fit fashion size color quality size disappointed fashion stylish size color fit fit fit quality color disappointed perfect love love stylish disappointed perfect love stylish perfect love disappointed fashion disappointed fit color quality fit stylish size disappointed size size size stylish quality color disappointed disappointed love fit perfect fit color fashion perfect fashion perfect size comfortable fashion fit love quality color fit quality size fit fit size love disappointed fashion stylish love disappointed love comfortable fit love quality disappointed size perfect fit fashion disappointed disappointed quality fashion comfortable comfortable color perfect fit comfortable fit fashion perfect perfect fit quality size comfortable love fashion love love fashion comfortable stylish disappointed size color love disappointed love fashion disappointed comfortable perfect fit color quality quality quality love disappointed comfortable perfect stylish disappointed fit love fit quality perfect quality perfect stylish disappointed perfect quality disappointed stylish comfortable fashion love quality comfortable fashion perfect quality fit quality fit fit quality stylish disappointed stylish quality comfortable perfect fashion”
“Disappointing! color color color color perfect color comfortable fashion quality perfect fashion comfortable fit color perfect comfortable fashion size stylish perfect fashion fit love disappointed love fit love comfortable disappointed stylish disappointed fit perfect fit love color love color love comfortable fit quality size comfortable fit stylish quality disappointed perfect stylish quality comfortable color size stylish stylish quality comfortable fashion quality fit stylish fashion comfortable stylish perfect color fashion perfect love comfortable color disappointed comfortable size comfortable perfect fit perfect love stylish love stylish love stylish color fashion quality color love stylish comfortable fit quality color disappointed stylish fit color quality fit perfect fit quality disappointed fit fashion fit comfortable size fit size fit quality color disappointed disappointed comfortable comfortable fashion disappointed love size quality fashion love love color disappointed fashion stylish size size perfect comfortable disappointed fashion fashion color fashion comfortable disappointed perfect stylish perfect fashion color love color color perfect size perfect stylish comfortable love disappointed perfect stylish size fashion size quality comfortable perfect color stylish perfect quality color size comfortable fit love fit disappointed love love fashion comfortable perfect color love disappointed color size stylish comfortable stylish color color disappointed love fashion disappointed love stylish perfect disappointed disappointed color quality fashion color color color color comfortable fit stylish stylish love comfortable stylish color quality love comfortable stylish quality fashion fit comfortable size fit size comfortable fashion love comfortable color disappointed love fashion color love fit stylish love stylish perfect comfortable color size fashion perfect color quality color disappointed stylish color fit love fashion size love color disappointed disappointed stylish comfortable stylish comfortable size love quality stylish quality stylish stylish love perfect fit fit size perfect disappointed comfortable stylish disappointed perfect color fashion fashion perfect love disappointed perfect stylish size fashion comfortable stylish size quality perfect color size color”

As evidenced by these examples, numerous words possess inherent significance; nevertheless, several inconsistencies emerge. This is particularly evident in the final case,

W_{l}

. Despite the appropriateness of many of the words in the context, an analysis reveals this. The propensity to reiterate words within specific contexts has been observed in other experiments [35]. In summary, it can be concluded that ChatGPT is capable of producing a limited number of meaningful comments that are focused on the specific context outlined in the prompt. However, it continues to grapple with adhering to a predefined length when generating a substantial dataset of comments. In contrast, the study by [10] reports that ChatGPT performs better when the prompt is concise.

3.3. Analysis of Frequent Words

In light of the previous results, it is intriguing to explore whether the comments generated via ChatGPT can be used in certain forms of text summarization, such as analyzing frequent words (or word clouds) that disregard the grammatical coherence of the comments. Table 3 presents a general overview of the frequently used words in each case. For each case, including the integration of all cases in the last three rows —

T_{o}

,

T_{c}

, and

T_{l}

—we present the number of frequent words, F, the number of relevant frequent words, Fr, and stop words, Fs, the number of words that are common (the total in column C and the number that relevant in column Cr and stop words in columns Cs) to the tree generators in the same group of cases (i.e., A, I, M, W, and T). A word is counted in columns C, Cr, and Cs if it appears among the frequent words of all three generators. We believe that these values are useful for examining the convergence of all generators within each specific context.

A total of two words were identified as appearing in the set of frequently used words across all cases: “for” and “not”, which are considered stop words. Consequently, the values 2, 0, and 2 should be placed in columns C, Cr, and Cs, corresponding to the rows

T_{o}

,

T_{c}

, and

T_{l}

. However, it is more meaningful to populate these cells with the frequently used words employed for each generator in the four corresponding cases. These values are highlighted in bold to emphasize this distinction. Additionally, for each case, the number of words used with any of the generators for each group of cases is presented (total in column U, number of relevant words in Ur, and stop words in Us). Finally, the last three columns present the number of words that are exclusively used with each generator in each group of cases (total in O, relevant words in Or, and stop words in Os).

A number of noteworthy observations can be made regarding the information presented in Table 3. The most significant findings are as follows: It is evident that ChatGPT utilizes a restricted lexicon, exhibiting minimal overlap with frequently used words across all specific contexts when compared to human language. However, ChatGPT employs a unique lexicon, incorporating specialized and infrequent terms that are not commonly used by humans. Furthermore, humans exhibit a tendency to employ the same word in different contexts more frequently than ChatGPT does. Due to the limited number of comments in cases where ChatGPT is not prompted with a desired length, the number of frequent words is fewer than 100, falling short of the maximum available number of frequent words for each case (500).

It is remarkable how few coincidences exist among the frequently used words. The proportion of common words (column C) is notably small—less than 7% in all cases: Case A: 5.2% (44 out of 849), Case I: 1.4% (13 out of 872), Case M: 2.8% (23 out of 830), Case W: 2.5% (13 out of 517), and Case T: 6.1% (97 out of 1586). This suggests that ChatGPT is not converging on a set of frequently used words that resembles the vocabulary employed by humans. Therefore, a different underlying vocabulary is inferred for a similar context.

In spite of the reduced number of frequently used words for ChatGPT (without a defined length), it is interesting to note that the proportion of common words is significant only for

A_{c}

with 53% (44 out of 83), and

M_{c}

at 40.35% (23 out of 57). In other cases, this proportion is less than 27%, such as for

I_{c}

.18.8% (13 out of 69),

W_{c}

17.3% (13 out of 75), and

T_{c}

26.6% (51 out of 192). This indicates a distinct difference in the vocabulary employed via ChatGPT compared to that of humans in each context, with only a small overlap of frequently used words. Consequently, it is not only the case that ChatGPT uses a more limited vocabulary than humans, but it also incorporates several words in each context that are not used by humans.

In addition, it is noteworthy that the proportion of frequently used words for ChatGPT (without a defined length) that were not included in the frequent words of other generators is as follows:

A_{c}

with 13.2% (11 out of 83),

I_{c}

24.6% (17 out of 69),

M_{c}

14% (8 out of 57),

W_{c}

20% (15 out of 75), and

T_{c}

13.5% (26 out of 192). This indicates that between 13% and 25% of the vocabulary used via ChatGPT is absent from the most frequently used words by humans (and even for ChatGPT with a defined length) in a similar context. In spite of the fact that the proportion of relevant words is quite similar across all cases for the union of all frequent words in column U (ranging from 60.5% for case A to 69.4% for case I), this proportion varies significantly for the common words in column C (between 30.8% for case I and 76.9% for case W). The proportion of relevant words among the frequent words that are exclusive to a generator in columns Or and O varies significantly: approximately 60% for most human-generated cases (

A_{o}

,

M_{o}

,

W_{o}

), compared to over 94% for two ChatGPT cases without a defined length (

I_{c}

,

M_{o}

) and also for

W_{l}

. This observation suggests that the uniqueness of ChatGPT’s vocabulary is more related to relevant words than to stop words. ChatGPT, when not subject to length restrictions, was able to generate 26 relevant words that were not produced through other methods. Additionally, the more than 400 words generated solely via ChatGPT under length restrictions suggests that its vocabulary exhibits a certain divergence from human commentary.

It is interesting to examine the singularity of each generator by analyzing the proportion of frequently used words generated solely through each (column O) in relation to the total number of frequent words utilized (column F). This proportion ranges from 56% to 87% for human-generated content (

A_{o}

65.4%,

I_{o}

63.4%,

M_{o}

59.2%,

W_{o}

87%,

T_{o}

56.6%), indicating that more than half of the frequent words in human comments were not employed via ChatGPT. In cases involving ChatGPT without length constraints, the proportion of unique words is significantly smaller, at less than 25% (

A_{c}

13.3%,

I_{c}

24.6%,

M_{c}

14%,

W_{c}

20%, and

T_{c}

13.5%). This may be interpreted as a tendency for ChatGPT to conform to a common human language, exhibiting limited diversity. However, the more than 13% of frequently used singular words in ChatGPT’s comments is noteworthy, especially considering that the length of ChatGPT’s comments is considerably shorter (only 175 distinct comments compared to 68,121 distinct comments generated by humans, which is less than 0.3%). Additionally, the number of distinct words used is also limited (only 192 distinct frequent words compared to 1076 distinct frequent words used by humans, or 17.8%). The proportions of uniqueness in ChatGPT’s output are slightly higher when only relevant words are considered (

A_{c}

18.8%,

I_{c}

34.8%,

M_{c}

20%,

W_{c}

24.4%, and

T_{c}

19.7).

The cases of ChatGPT that involve a length condition are quite specific. Generally, the proportions for these cases are similar to those observed in humans, ranging between 51% and 71% (

A_{l}

64.2%,

I_{l}

70.8%,

M_{l}

64.4%, and

T_{l}

51.8%). However, the exception is

W_{l}

, which has a proportion of only 10%. This may be interpreted as a tendency for ChatGPT to diverge in the choice of words when prompted to achieve a specific length. The divergence of frequently used words varies depending on the context, as observed by examining the proportion of frequent words utilized by each generator across all cases (comparing column C to column F for rows

T_{o}

,

T_{c}

, and

T_{l}

). This proportion is 17% for

T_{o}

, while it is just 4% for

T_{c}

and 0.4%

T_{l}

). It appears that humans tend to use the same words in different contexts more frequently than ChatGPT.

3.4. Total Uses of the Words

The frequency of each type of word (F columns: total; Fr: relevant words; Fs: stop words) in each case is presented in Table 4, which counts the total occurrences of each frequent word in the comments for each case. The last column,

% T_{u}

, indicates the proportion of word usage in each case relative to the total across all cases. The most significant conclusion is that the proportion of relevant word usage in ChatGPT is higher than that of humans, even when the desired length is not specified.

As previously noted regarding the length of the comments, the word count within each case of the same group varies significantly. In instances where the desired comment length was not specified (cases

A_{c}

,

I_{c}

,

M_{c}

, and

W_{c}

), ChatGPT produced a limited number of words, accounting for less than 0.02% of the total word usage. Contrary to expectations, when the desired length of the comments was defined (cases

A_{l}

,

I_{l}

,

M_{l}

, and

W_{l}

), the number of words generated via ChatGPT differs significantly from the human comments (cases

A_{o}

,

I_{o}

,

M_{o}

, and

W_{o}

). The case

W_{l}

contains around 50% of the total word usage

T_{u}

; in fact, it is more than five times the proportion represented as

W_{o}

. Conversely, the proportions for cases

M_{l}

and

I_{l}

are about half of those for cases

M_{o}

and

I_{o}

, respectively. In contrast, the proportions of

A_{o}

and

A_{l}

are the most similar, at around 15%. The percentage representing the relevant words (column

% F_{r}

) is never more than 45% in human comments (cases

A_{o}

,

I_{o}

,

M_{o}

, and

W_{o}

). In contrast, the comments generated via ChatGPT exceed 50% in all cases, with particularly high percentages for cases

W_{l}

at over 99%, and

L_{l}

and

M_{l}

at approximately 70%. In general, the value of

% F_{r}

in ChatGPT cases is significantly higher when the length is specified, with the exception of

A_{l}

with respect to

A_{c}

, which are quite similar. This can be interpreted as a tendency for ChatGPT to produce a greater proportion of relevant words compared to human comments, particularly when the desired length is defined.

3.5. Most Used Words

In the preceding sections, an investigation was conducted into the general trends in word usage. In this section, we will present examples of words that are more distinctive for each generator and context. These are approximately the ten most frequently used words relevant to each case. In instances where multiple words are tied for the tenth position, all such words are included in the following lists, with the exception of the Wc category, where only nine words are shown due to more than thirty words being tied for the tenth position. Some words commonly found in human comments appear to be typographical errors (such as “âœ…” or “½ï”) or are used as specific quotation marks (such as “|”). Since these words lack meaningful context, they are not included.

$A_{o}$ : a, amazon, customer, delivery, i, item, order, prime, service, time.
$A_{c}$ : customer, expectations, expected, experience, great, highly, i, money, product, purchase, quality, recommended, satisfied, service.
$A_{l}$ : quality, exceeded, expectations, expected, find, i, money, product, purchase, purpose
$I o$ : a, experience, flight, i, service, time, trip, verified.
$I c$ : a, average, comfortable, customer, delayed, experience, flight, good, seats, service, staff.
$I l$ : attorney, building, decade, fall, group, realize, space, time, walk, war, western, worry.
$M o$ : a, drive, food, good, i, mcdonalds, order, place, service.
$M c$ : a, bit, clean, food, great, i, place, quick, service, tasty.
$M l$ : age, bed, boy, close, draw, hotel, i, public, read, rich, rise, stop, true.
$W o$ : a, dress, fabric, fit, great, i, love, size, top, wear.
$W c$ : amazing, bad, buy, comfortable, expected, fit, love, perfect, price.
$W l$ : color, comfortable, disappointed, fashion, fit, love, perfect, quality, size, stylish.

Most words seem to be meaningful within their respective contexts (e.g., flight for Indian Airlines, food for McDonald’s) or in general (e.g., quality, purchase). However, some usages appear unusual when the desired length is specified, such as the use of “attorney” and “building” for Indian Airlines, and “hotel” in comments about McDonald’s.

3.6. Common Words in Each Group of Cases

As the number of common words for each group of cases was reduced, we prefer to include all of them in the following list. These are the relevant words included in the three cases for each group:

Group of cases A: arrived, buy, customer, disappointed, easy, excellent, experience, fast, good, great, i, issues, money, point, price, product, purchase, quality, recommend, service, shipping, terrible, worth.
Group of cases I: a, time, bit, staff.
Group of cases M: a, bit, customer, food, great, hot, i, staff, time, wait.
Group of cases W: amazing, comfortable, disappointed, expected, fit, i, love, perfect, quality, stylish.

There are some convergences among the three generators, such as the use of “price” and “shipping” for Amazon, and “staff” and “time” for Indian Airlines and McDonald’s. Notably, in the context of the cases involving Amazon and Women’s Clothing, there are additional words. This may reflect an imbalance in the training set of ChatGPT. It is worth noting that the words “i” and “a” may be classified as stop words; however, we prefer to adhere to the conditions established in the initial stages of the experiments.

These are stop words included in the three cases of each group:

Group of cases A: again, and, anyone, as, be, every, for, had, is, it, my, not, of, the, this, use, very, was, will, with, would.
Group of cases I: but, for, it, my, no, not, the, this, with.
Group of cases M: again, and, as, best, for, just, like, not, of, the, them, will, with.
Group of cases W: for, it, not.

3.7. Words Used with Only One Generator

There are 26 words used with ChatGPT that were not included in the frequently used words of humans, nor in those employed via ChatGPT with a fixed length. All of them are relevant words: unhelpful, everyday, standards, beginning, subpar, stitching, undone, vibrant, immediately, occasion, average, tasty, plenty, outstanding, punctual, assistance, exceptional, cramped, crispy, ambiance, cleanliness, enjoyed, managed, reasonable, affordable, improvements.

On the other hand, there are 609 words that were frequently used by humans that were not used with ChatGPT. Here are some examples of these words (all of them with more than 4000 uses): amazon, an, order, dress, there, ordered, delivery, dont, prime, refund, items, said, long, bought, mcdonalds, cute, told, made, days, through, delivered, flattering.

When ChatGPT was constrained to a defined length, it generated 466 frequently used words that were not included in the lists of frequently used words from the other generators. These are some examples (all with more than 12,000 uses): fashion, purpose, build, future, lacking, areas, intended, itself, performance, satisfactory, resolving, limitations, models, improved, setup, encountered, decent, improvement, packed.

There were 510 frequently used words generated via one of the ChatGPT models that were not commonly found in human comments. Here are some examples for each context (group of cases).

A: disappointing, meet, unhelpful, everyday, standards, beginning, subpar.
I: improve, minor, improve, unhelpful, average, outstanding, punctual, assistance, exceptional, cramped, cleanliness, enjoyed, managed, reasonable, improvements.
M: expected, underwhelming, loved, tasty, plenty, crispy, ambiance, affordable, true, past, read, above, name, hotel, rich, road, draw, public, key, age, boy, rise, bed, form.
W: product, value, expectations, exceeded, again, others, described, started, coming, stitching, undone, vibrant, immediately, occasion, fashion, disappointing.

Now, we present some words frequently used in the human comments that were not used with ChatGPT.

A (relevant words with more than 5000 uses): amazon, order, delivery, prime, account, refund
A (stop words): me, so, no, if, an, them, when, one, their, now, never.
I (relevant words with more than 1000 uses): verified.
I (stop words): is, as, they, me, had, are, at.
M (relevant words with more than 2000 uses): order, mcdonalds, fast.
M (stop word): my, is, have, had, are, no, get, there.
W (relevant words with more than 3000 uses): top, dress, ordered, nice, bought, cute, beautiful, flattering, large, shirt.
W (stop words): but, in, was, of, on, that, have, they, am, you, im, its, just, or.

Some words that are clearly oriented towards specific contexts and were used by humans were not included among the frequently used words of ChatGPT. Here are some of the most notable examples.

A: amazon, delivery, prime, refund.
I: verified.
M: mcdonalds, fast.
W: dress, nice, cute, beautiful, shirt.

Conversely, there are certain words used with ChatGPT that are significant but were not included in the most frequently used words in human comments, as they are clearly tailored to the specific context of each case.

A: disappointing, standards, subpar.
I: punctual, assistance.
M: tasty, crispy, ambiance.
W: product, occasion, fashion.

Perhaps it reflects a certain bias in the training set of ChatGPT that differs from the human comments included in the datasets.

3.8. Global Comparison in Terms of Cosine Similarity

Table 5, Table 6 and Table 7 present a comparison of the normalized frequency vectors representing the set of frequent words for each case in terms of cosine similarity [52], which is a widely used measure in information retrieval for comparing vectors that represent texts. First, all frequency vectors were normalized based on the total number of uses. Each component now represents the proportion of occurrences of each word in the cases relative to the overall frequency of the most common words. Although cosine similarity can be negative when vectors are opposite, this does not occur in this instance due to the normalization process (where all components range between 0 and 1). Therefore, in this context, the cosine similarity ranges from 1 (indicating alignment) to 0 (indicating orthogonal vectors). In Table 5, all words were taken into account, resulting in each vector having 1586 components, one for each possible word (0 if it was not used in the particular case). In Table 6, the similarity is computed by considering only the relevant words; therefore, each vector has 1233 components. Finally, in Table 7, the same measure is computed by taking into account the relative frequency of the stop words, resulting in each vector having 353 components.

Several observations can be made based on the results presented in Table 5, Table 6 and Table 7, which reflect the similarities among each generator as expressed with the corresponding vectors of normalized usage. The most significant findings are as follows: ChatGPT (without a defined length) is more similar to human vocabulary than ChatGPT with a defined length, and this similarity is particularly pronounced in terms of stop words. In general terms (see Table 5), it can be observed that columns I, M, and W indicate that the similarity between human comments and those generated via ChatGPT without a defined length (first row) is significantly greater than the similarities observed in the other rows (second and third rows). The only exception is column A, where the other similarities are higher. Overall, ChatGPT demonstrates stability in its similarity to the set of human comments, ranging between 0.45 and 0.6. However, when the length is defined, the results may diverge.

A similar situation occurs in Table 6 and Table 7, with the exception of column A. This suggests that, when the length is not defined, ChatGPT’s responses were closer to human comments in columns I (“indian airlines”), column M (“mcdonalds”) and column W (“women’s clothing”). In column A (“amazon”) this similarity is similar to the other contexts; however, the other similarities (

A_{o}

vs.

A_{l}

and

A_{c}

vs.

A_{l}

) were significantly greater. When the comparison is based on the average vectors of the generators across all cases (the Generators column) again, the similarities between the comments generated by humans and those produced through ChatGPT, without a defined length (O vs. C) are the highest. It is noteworthy that the similarity between these average vectors is greater than the values observed in each individual case. This may indicate a general trend in vocabulary that is not specifically tailored to each particular context. This greater similarity in terms of the averages may be attributed to the smaller similarities concerning the relevant words. It can be observed that, in nearly all cases, the similarity between the 21 pairs of vectors is greater when stop words are considered (Table 7) than when relevant words are considered (Table 6), with the exception of

W c

vs.

W_{l}

, for which the values are very close.

With regard to the general vocabulary associated with each group of cases (column Case in the three tables), the most analogous contexts were A and M, followed by A and I, particularly with respect to the utilization of stop words. It appears that ChatGPT exhibits a greater degree of customization in its application of stop words than in its selection of relevant words.

3.9. Global Comparison in Terms of Pearson Correlation Coefficient

Table 8, Table 9 and Table 10 present a comparison of the normalized frequency vectors representing the set of frequent words for each case, evaluated using the Pearson Correlation Coefficient [53]. This coefficient ranges from −1, indicating an inverse linear relationship (in which higher values in one series correspond to lower values in the other), to 1, which signifies that both series increase in alignment. In this context, despite the normalization of the vectors, it is still possible to obtain negative values. In Table 8, all words were taken into account, resulting in each vector having 1586 components. In Table 9, the correlation is computed by considering only the 1233 relevant words. Finally, in Table 10, the same measure is calculated by focusing solely on the relative frequency of usage of the 353 stop words.

In general terms (see Table 8), it can be observed that columns I, M, and W indicate that the correlations between human comments and those generated via ChatGPT without a defined length (first row) are generally higher than the other similarities (second and third rows). This suggests that ChatGPT uses words in a more similar manner to human comments. However, exceptions are noted in column A, where the other correlations are greater. This finding aligns with similar results presented in the previous section. Overall, ChatGPT demonstrates a stable correlation with human comments, ranging from 0.44 to 0.58. However, when the length is defined, the results may diverge. Notably, the correlations between ChatGPT and the defined length comments in columns I and W are almost negligible.

A similar situation occurs in Table 9 and Table 10, again with the exception of column A. This may indicate that, when the length is not defined, ChatGPT’s responses were closer to human comments in column I, M, and W. In column A, this correlation is similar to that in the other contexts; however, other similarities (

A_{o}

vs.

A_{l}

and

A_{c}

vs.

A_{l}

) were generally greater.

When the comparison is based on the average vectors of the generators across all cases (the Generators column) against the similarities (correlations) between the comments generated by humans and those produced via ChatGPT, regardless of the defined length (O vs. C), a significant correlation is observed. Furthermore, the correlation between the two ChatGPT outputs is also substantial. It is noteworthy that the correlation between these average vectors is greater than the values observed in each individual case, which may reflect a general trend in vocabulary that is not specifically tailored to each particular context. These stronger correlations of averages can be attributed to the weaker correlations concerning the relevant words. A close examination reveals that, in nearly all cases, the similarity between the 21 pairs of vectors is greater when considering stop words (Table 10) than when focusing on relevant words (Table 9), with the exception of

W c

vs.

W_{l}

, where the values are very similar. Additionally, the general vocabulary of each group of cases (column “Case” in the three tables) exhibits a greater similarity between groups A and M, followed by groups A and I, particularly with regard to the use of stop words. Despite the minor discrepancies in conclusions related to correlations compared to cosine similarity, ChatGPT tends to align more closely with human comments in terms of stop-word usage than in the use of relevant words.

3.10. Global Comparison of Cumulative Word Frequency Usage

As illustrated in Figure 1, the cumulative usage frequency for each generator across the different cases associated with the top 100 most frequently used words is shown. The x-axis of each graph corresponds to the number of words included, ranging from 1 to 100, while the corresponding y-axis indicates the cumulative frequency of usage for this set of most frequently used words. It is imperative to acknowledge that the most frequently used word may vary for each case and generator. Each figure presents the cumulative values for the first 100 words, although, in certain cases, the number of frequent words is less than 100, and in other cases, 500 frequent words are obtained. As a reference, all figures also display the cumulative normalized frequency predicted through Zipf’s Law [54], which states that the frequency of use of a word

f (r)

is inversely proportional to the position, r, of each word in a list arranged in decreasing order of use, i.e.,

f (r) \propto \frac{1}{r} .

The Zipf-estimated frequencies for the first 500 words were computed and normalized. In the graph, this line can be used as a reference for the expected performance of a natural language [54]. In the figures, a consistent performance can be observed between human comments (cases o) and those generated freely via ChatGPT (cases c). In all cases, the shape of the cumulative frequency for these two generators approached Zipf’s Law, with a closer alignment observed for human comments. ChatGPT (case c) converges to 100% before reaching 100 words, meaning that fewer than 100 distinct words were generated.

When ChatGPT is constrained to a defined length (cases l), the results exhibit significant variability. In case

A_{l}

the performance is quite similar to

A_{c}

; in case

W_{l}

, the frequencies accumulate more rapidly than

W_{c}

(only 20 distinct words). Meanwhile, in cases

M_{l}

and

I_{c}

, the frequencies accumulate very slowly due to the similar frequency of use across all words. These scenarios appear to deviate considerably from natural performance. Figure 2 shows the average performance of each generator across the four cases, highlighting the performance previously described.

3.11. Global Comparison in Terms of the Usage of Some Representative Words

It is imperative to elucidate the particular usage patterns of select pertinent words, thereby offering a comprehensive representation of each instance. These words are likely to appear in word clouds. A selection of terms that rank among the top 20 most frequently used relevant words was conducted, including comfortable, fit, size, stylish, color, fashion, experience, flight, expected, and time. These words are particularly suitable for specific contexts (e.g., flight for the cases involving Indian airlines, stylish for the cases related to women’s clothing).

We elected to illustrate the use of the aforementioned terms, as opposed to employing others that may be more general (common) for the four contexts analyzed in this paper. This phenomenon was also observed with other frequently relevant words, such as the following: quality, disappointed, service, product, customer, good, love, perfect, great, bit. This set of words would also be present in the associated word clouds, but these are less specific. Figure 3 illustrates the relative frequency of usage for this selected set of words. Each bar corresponds to a specific case, while the various colored sections within each bar indicate the relative frequency of each word in relation to the other ten words included in the analysis.

As illustrated in Figure 3, the words “fit”, “stylish”, “color”, and “fashion” appear more frequently in the cases

W_{o}

,

W_{c}

, and

W_{l}

, whereas the word “flight” is more frequent in the cases

I_{o}

and

I_{c}

. Additionally, some unexpected results emerged in instances where ChatGPT was constrained to a specific length. For example, the occurrence of the word “color” in

I_{l}

and

M_{l}

is noteworthy. For reference, the last bar in the figure represents the average relative frequency for each context (A, I, M, and W) or generator (O, C, and L).

Figure 4 shows a similar analysis, focusing on the most frequently used stop words. It can be observed that the relative frequency of use is more consistent across all cases, with only minor differences among the generators. A noteworthy pattern that has been observed is the tendency of ChatGPT to employ the word “was” in an excessive manner. Furthermore, the results from ChatGPT with defined lengths, particularly

I_{l}

and

W_{l}

, appear to deviate from expected norms. Nevertheless, the striking similarity among the generators, as illustrated in the final three bars, is noteworthy. This phenomenon stands in stark contrast to the patterns observed in the final bars of Figure 3. This observation suggests that ChatGPT may align more closely with human language in terms of the frequency of stop-word usage than in its application of relevant words.

4. Conclusions

In this paper, we examined the present capacity of ChatGPT to generate datasets of comments analogous to those obtained from surveys regarding services requested from humans. Preliminary findings indicate that, at the moment, ChatGPT lacks the capacity to generate a substantial number of comments that effectively emulate human-generated datasets. In the absence of any indication regarding the desired length (number of characters) of the comment, the comments generated via ChatGPT tend to be brief (less than 100 characters) and frequently repetitive. However, when the desired length is specified in the prompt, ChatGPT’s performance can be inconsistent. For instance, in some cases, comments are abruptly truncated to meet the specified length. In other instances, the generated text exhibits grammatical incoherence or fails to adhere to the specified length. In this sense, in order to avoid any ambiguity in the unit used to express the length of the comments, it may be convenient to use in the prompt the phrase “The number of characters in the comments should be…” instead of “The average length of the comments should be…”.

The present study focused on the lexical aspect of language. With respect to vocabulary, ChatGPT demonstrated a high degree of compatibility with the various contexts examined in the experimental sessions. This compatibility enabled the generation of word clouds that were analogous to those derived from human comment datasets. However, notable discrepancies in the utilization of a specific vocabulary across diverse contexts were observed. A notable observation is that ChatGPT’s utilization of stop words exhibits a stronger alignment with human comments compared to its employment of relevant words. In instances where the desired length of the context is not specified, ChatGPT’s utilization of vocabulary tends to align more closely with that observed in human comments. An interesting line for future research is to introduce semantic and grammatical analysis to the comparison of generated comments.

The experimental plan delineated in this paper can be readily expanded to encompass additional data sets of human comments, diverse prompt creation methods, and other large language models (LLMs). An example of some prompt strategies to be explored may be in line with the chain-of-thought approach. For example, an interesting prompt scheme might be the explicit declaration of ideas that may be useful for ChatGPT to improve the diversity of comments needed to accomplish this task. For example, instead of the phrase “The specific features and level of detail should vary”, some examples of possible clues that can be included as phrases in the message could be as follows: “use the same writing pattern for several comments just varying some words by their synonyms or antonyms” or “create several variations of a comment by changing the polarity”. These are some examples of the several lines of research related to prompt engineering that may be explored to improve the performance of ChatGPT or other LLMs for this task. It is imperative to acknowledge the crucial role of a large language model (LLM) capable of generating high-quality, human-like comment datasets tailored to specific contexts for various machine learning tasks.

Author Contributions

A.R.: conceptualization, methodology, software, validation, and writing. G.S.-G.: data curation, prompt execution, validation, and writing—review. O.R.: supervision, validation, and writing—review. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data underlying this article are available at https://github.com/alejandrorosetesuarez/CommentsGen.

Conflicts of Interest

Author Alejandro Rosete was employed by the company Avangenio S.R.L. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

OpenAI. GPT-4 Technical Report. 2023. Available online: https://cdn.openai.com/papers/gpt-4.pdf (accessed on 12 April 2024).
Gupta, B.; Mufti, T.; Sohail, S.S.; Øivind Madsen, D. ChatGPT: A brief narrative review. Cogent Bus. Manag. 2023, 10, 2275851. [Google Scholar] [CrossRef]
Salloum, S.; Almarzouqi, A.; Gupta, B.; Aburayya, A.; Al Saidat, M.; Alfaisal, R. The Coming ChatGPT. Stud. Big Data 2024, 144, 3–9. [Google Scholar] [CrossRef]
Najafov, E. Understanding ChatGPT. In ChatGPT for Marketing; Apress: Berkeley, CA, USA, 2024. [Google Scholar] [CrossRef]
Stöckl, A. Information visualization with ChatGPT. In Artificial Intelligence and Visualization: Advancing Visual Knowledge Discovery; Studies in Computational Intelligence; Springer: Cham, Switzerland, 2024; Volume 1126. [Google Scholar] [CrossRef]
Naznin, K.; Mahmud, A.A.; Nguyen, M.T.; Chua, C. ChatGPT Integration in Higher Education for Personalized Learning, Academic Writing, and Coding Tasks: A Systematic Review. Computers 2025, 14, 53. [Google Scholar] [CrossRef]
Segal, M. Confronting and managing ethical dilemmas in social work using ChatGPT. Eur. J. Soc. Work 2024, 28, 155–167. [Google Scholar] [CrossRef]
Gupta, M.; Akiri, C.; Aryal, K.; Parker, E.; Praharaj, L. From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy. IEEE Access 2023, 11, 80218–80245. [Google Scholar] [CrossRef]
Chen, Y.; Zou, J. Simple and effective embedding model for single-cell biology built from ChatGPT. Nat. Biomed. Eng 2024, 9, 483–493. [Google Scholar] [CrossRef] [PubMed]
Thelwall, M. Evaluating research quality with Large Language Models: An analysis of ChatGPT’s effectiveness with different settings and inputs. J. Data Inf. Sci. 2024, 10, 7–25. [Google Scholar] [CrossRef]
Thelwall, M.; Kousha, K. Journal Quality Factors from ChatGPT: More meaningful than Impact Factors? J. Data Inf. Sci. 2024. [Google Scholar] [CrossRef]
Thelwall, M. Is Google Gemini better than ChatGPT at evaluating research quality? J. Data Inf. Sci. 2025. [Google Scholar] [CrossRef]
Fuller, K.A.; Morbitzer, K.A.; Zeeman, J.M.; Persky, A.M.; Savage, A.C.; McLaughlin, J.E. Exploring the use of ChatGPT to analyze student course evaluation comments. BMC Med. Educ. 2024, 24, 423. [Google Scholar] [CrossRef]
Naher, J. Can ChatGPT provide a better support: A comparative analysis of ChatGPT and dataset responses in mental health dialogues. Curr. Psychol. 2024, 43, 23837–23845. [Google Scholar] [CrossRef]
Gil-Martín, M.; Luna-Jiménez, C.; Esteban-Romero, S.; Estecha-Garitagoitia, M.; Fernández-Martínez, F.; D’Haro, L.F. A dataset of synthetic art dialogues with ChatGPT. Sci. Data 2024, 11, 825. [Google Scholar] [CrossRef]
Stefanovič, P.; Pliuskuvienė, B.; Radvilaitė, U.; Ramanauskaitė, S. Machine learning model for chatGPT usage detection in students’ answers to open-ended questions: Case of Lithuanian language. Educ. Inf. Technol. 2024, 29, 18403–18425. [Google Scholar] [CrossRef]
Kauchak, D.; Song, V.; Mishra, P.; Leroy, G.; Harber, P.; Rains, S.; Hamre, J.; Morgenstein, N. Automatic Generation of a large multiple-choice question-answer corpus. In Intelligent Systems and Applications; Lecture Notes in Networks and Systems; Springer: Cham, Switzerland, 2024; Volume 1066. [Google Scholar] [CrossRef]
Franke, S.; Pott, C.; Rutinowski, J.; Pauly, M.; Reining, C.; Kirchheim, A. Can ChatGPT Solve Undergraduate Exams from Warehousing Studies? An Investigation. Computers 2025, 14, 52. [Google Scholar] [CrossRef]
Montenegro-Rueda, M.; Fernández-Cerero, J.; Fernández-Batanero, J.M.; López-Meneses, E. Impact of the Implementation of ChatGPT in Education: A Systematic Review. Computers 2023, 12, 153. [Google Scholar] [CrossRef]
Iio, J. Analysis of critical comments on ChatGPT. In Advances in Network-Based Information Systems; Lecture Notes on Data Engineering and Communications Technologies; Springer: Cham, Switzerland, 2023; Volume 183. [Google Scholar] [CrossRef]
Cabezas-Clavijo, A.; Magadan-Diaz, M.; Rivas-García, J.I.; Sidorenko-Bautista, P. This Book is Written by ChatGPT: A Quantitative Analysis of ChatGPT Authorships Through Amazon.com. Publ. Res. Q. 2024, 40, 147–163. [Google Scholar] [CrossRef]
Bucol, J.L.; Sangkawong, N. Exploring ChatGPT as a Writing Assessment Tool. Innov. Educ. Teach. Int. 2024, 1–16. [Google Scholar] [CrossRef]
Lew, R. ChatGPT as a COBUILD lexicographer. Humanit. Soc. Sci. Commun. 2023, 10, 704. [Google Scholar] [CrossRef]
Shen, S.A.; Perez-Heydrich, C.A.; Xie, D.X.; Nellis, J.C. ChatGPT vs. web search for patient questions: What does ChatGPT do better? Eur. Arch. Otorhinolaryngol. 2024, 281, 3219–3225. [Google Scholar] [CrossRef]
Horiuchi, D.; Tatekawa, H.; Oura, T.; Oue, S.; Walston, S.L.; Takita, H.; Matsushita, S.; Mitsuyama, Y.; Shimono, T.; Miki, Y.; et al. Comparing the Diagnostic Performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and Radiologists in Challenging Neuroradiology Cases. Clin. Neuroradiol. 2024, 34, 779–787. [Google Scholar] [CrossRef]
Samaan, J.; Yeo, Y.; Rajeev, N.; Hawley, L.; Abel, S.; Ng, W.H.; Srinivasan, N.; Park, J.; Burch, M.; Watson, R.; et al. Assessing the Accuracy of Responses by the Language Model ChatGPT to Questions Regarding Bariatric Surgery. Obes. Surg. 2023, 33, 1790–1796. [Google Scholar] [CrossRef]
Raiaan, M.A.K.; Mukta, M.S.H.; Fatema, K.; Fahad, N.M.; Sakib, S.; Mim, M.M.J.; Ahmad, J.; Ali, M.E.; Azam, S. A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges. IEEE Access 2024, 12, 26839–26874. [Google Scholar] [CrossRef]
Kumar, S.; Balachandran, V.; Njoo, L.; Anastasopoulos, A.; Tsvetkov, Y. Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, 2–6 May 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 3299–3321. [Google Scholar]
Sable, R.; Baviskar, V.; Gupta, S.; Pagare, D.; Kasliwal, E.; Bhosale, D.; Jade, P. AI Content Detection. In Advanced Computing; Communications in Computer and Information Science; Springer: Cham, Switzerland, 2023; Volume 2053. [Google Scholar] [CrossRef]
Wu, J.; Song, Y.; Wu, D. Does ChatGPT show gender bias in behavior detection? Humanit. Soc. Sci. Commun. 2024, 11, 1706. [Google Scholar] [CrossRef]
Lazebnik, T.; Rosenfeld, A. Detecting LLM-assisted writing in scientific communication: Are we there yet? J. Data Inf. Sci. 2024, 9, 4–13. [Google Scholar] [CrossRef]
Rawashdeh, A.; Rawashdeh, O.; Rawashdeh, M. ChatGPT and ChatGPT API: An Experiment with Evaluating ChatGPT Answers. In Proceedings of the Future Technologies Conference (FTC), London, UK, 14–15 November 2024; Lecture Notes in Networks and Systems. Springer: Cham, Switzerland, 2024; pp. 514–533. [Google Scholar] [CrossRef]
Pieper, T.; Ballout, M.; Krumnack, U.; Heidemann, G.; Kühnberger, K. Enhancing small language models via ChatGPT and dataset augmentation. In Natural Language Processing and Information Systems; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2024; Volume 14763. [Google Scholar] [CrossRef]
Vinora, A.; Bojiah, J.; Alfiras, M. Sentiment analysis of reviews on AI interface ChatGPT: An interpretative study. In Business Sustainability with Artificial Intelligence (AI): Challenges and Opportunities; Studies in Systems, Decision and Control; Springer: Cham, Switzerland, 2025; Volume 566. [Google Scholar] [CrossRef]
Shumailov, I.; Shumaylov, Z.; Zhao, Y.; Papernot, N.; Anderson, R.; Gal, Y. AI models collapse when trained on recursively generated data. Nature 2024, 631, 755–759. [Google Scholar] [CrossRef] [PubMed]
Laxman, D. Kaggle: Amazon Reviews Dataset. 2024. Available online: https://www.kaggle.com/datasets/dongrelaxman/amazon-reviews-dataset (accessed on 4 November 2024).
Jagathratchakan, J. Kaggle: Indian Airlines Customer Reviews. 2024. Available online: https://www.kaggle.com/datasets/jagathratchakan/indian-airlines-customer-reviews (accessed on 4 November 2024).
Elgiriyewithana, N. Kaggle: McDonald’s Store Reviews. 2024. Available online: https://www.kaggle.com/datasets/nelgiriyewithana/mcdonalds-store-reviews (accessed on 4 November 2024).
Nicapotato, N. Kaggle: Women’s E-Commerce Clothing Reviews. 2024. Available online: https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews (accessed on 4 November 2024).
Volkova, S. An overview on data augmentation for machine learning. In Digital and Information Technologies in Economics and Management; Lecture Notes in Networks and Systems; Springer: Cham, Switzerland, 2023; Volume 942. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.H.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Bommasani, R.; Liang, P.; Lee, T. Holistic Evaluation of Language Models. Ann. N. Y. Acad. Sci. 2023, 1525, 140–146. [Google Scholar] [CrossRef]
Muñoz-Ortiz, A.; Gómez-Rodríguez, C.; Vilares, D. Contrasting Linguistic Patterns in Human and LLM-Generated News Text. Artif. Intell. Rev. 2024, 57, 265. [Google Scholar] [CrossRef]
Botana, F.; Recio, T.; Vélez, M.P. On Using GeoGebra and ChatGPT for Geometric Discovery. Computers 2024, 13, 187. [Google Scholar] [CrossRef]
Selvioğlu, A.; Adanova, V.; Atagoziev, M. Feature Extraction and Analysis for GPT-Generated Text. arXiv 2025, arXiv:2503.13687v1. [Google Scholar]
Kumar, V.; Choudhary, A.; Cho, E. Data Augmentation Using Pre-Trained Transformer Models. In Proceedings of the 2nd Workshop on Life-Long Learning for Spoken Language Systems, Suzhou, China, 4–7 December 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 18–26. [Google Scholar]
Lajcinova, B.; Valabek, P.; Spisiak, M. Named Entity Recognition for Address Extraction in Speech-to-Text Transcriptions Using Synthetic Data. 2025. Available online: https://www.eurocc-access.eu/success-stories/named-entity-recognition-for-address-extraction-in-speech-to-text-transcriptions-using-synthetic-data/ (accessed on 29 March 2025).
He, X.; Nassar, I.; Kiros, J.; Haffari, G.; Norouzi, M. Generate, annotate, and learn: NLP with synthetic text. Trans. Assoc. Comput. Linguist. 2022, 10, 826–842. [Google Scholar] [CrossRef]
Li, B.; Hou, Y.; Che, W. Data augmentation approaches in natural language processing: A survey. AI Open 2022, 3, 71–90. [Google Scholar] [CrossRef]
Choy, M. Effective Listings of Function Stop words for Twitter. Int. J. Adv. Comput. Sci. Appl. 2012, 3, 8–11. [Google Scholar] [CrossRef]
Google Code Archive. Available online: https://code.google.com/archive/p/stop-words/downloads (accessed on 30 November 2024).
Novotný, V. Implementation Notes for the Soft Cosine Measure. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Turin, Italy, 22–26 October 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 1639–1642. [Google Scholar] [CrossRef]
Buda, A.; Jarynowski, A. Implementation Notes for the Soft Cosine Measure; Wydawnictwo Niezależne: Warsaw, Poland, 2010; pp. 5–21. [Google Scholar]
Montemurro, M.A. Beyond the Zipf–Mandelbrot law in quantitative linguistics. Phys. A Stat. Mech. Its Appl. 2022, 300, 567–578. [Google Scholar] [CrossRef]

Figure 1. (a) Use of words in the Amazon case, (b) use of words in the Indian Airlines case, (c) use of words in the McDonald’s case and the regular case, and (d) use of words in the Women’s Clothing case.

Figure 2. Average use of words in four cases.

Figure 3. Relative frequency of use of ten relevant words.

Figure 4. Relative frequency of use of ten stop words.

Table 1. General description of the preparation of the cases.

Case	Source	Comments	L.Ch	Distinct
$A_{o}$	Kaggle: Amazon Reviews Dataset	21,314	460	21,022
$A_{c}$	ChatGPT (prompt in Section 2.3.1)	20,000	94	100
$A_{l}$	ChatGPT (prompt in Section 2.3.1)	20,000	705	20,000
$I_{o}$	Kaggle: Indian Airlines Customer Reviews	2210	630	2208
$I_{c}$	ChatGPT (prompt in Section 2.3.2)	2000	54	15
$I_{l}$	ChatGPT (prompt in Section 2.3.2)	2000	630	2000
$M_{o}$	Kaggle: McDonald’s Dataset	33,396	125	22,253
$M_{c}$	ChatGPT (prompt in Section 2.3.3)	33,000	55	10
$M_{l}$	ChatGPT (prompt in Section 2.3.3)	33,000	91	33,000
$W_{o}$	Kaggle: Women’s Clothing Dataset	23,486	300	22,638
$W_{c}$	ChatGPT (prompt in Section 2.3.4)	20,000	69	50
$W_{l}$	ChatGPT (prompt in Section 2.3.4)	20,000	2324	20,000

Table 2. General description of the cases.

Case	Count	Ch.Ave	[Min,Max]	L.Ave	[Min,Max]	F.Ave	[Min,Max]	Fr.Ave	Fs.Ave
$A_{o}$	21,022	460	[4,9987]	90	[1,1802]	22	[1,174]	14	8
$A_{c}$	100	94	[73,118]	15	[10,19]	7	[5,11]	5	2
$A_{l}$	20,000	705	[591,790]	116	[94,135]	45	[1,66]	35	10
$I_{o}$	2208	630	[134,3522]	114	[23,699]	28	[5,107]	20	8
$I_{c}$	15	54	[41,66]	8	[6,11]	3	[2,6]	2	1
$I_{l}$	2000	630	[630,630]	94	[84,104]	42	[26,59]	32	10
$M_{o}$	22,253	125	[1,3087]	26	[1,584]	9	[1,98]	5	4
$M_{c}$	10	55	[44,62]	10	[8,13]	4	[3,6]	3	1
$M_{l}$	33,000	91	[13,129]	14	[3,25]	7	[1,20]	5	7
$W_{o}$	22,638	300	[0,553]	63	[2,121]	25	[1,55]	20	5
$W_{c}$	50	69	[49,93]	12	[9,15]	5	[3,7]	3	2
$W_{l}$	20,000	2324	[2137,2505]	303	[302,305]	10	[10,11]	10	0
$T_{o}$	681	329	[0,9987]	61	[1,1802]	19	[1,174]	13	6
$T_{c}$	175	82	[41,118]	13	[6,19]	6	[2,11]	4	2
$T_{l}$	75,000	865	[13,2505]	120	[3,305]	19	[1,66]	13	6

Table 3. General description of the frequent words in the cases.

Case	F	Fr	Fs	C	Cr	Cs	U	Ur	Us	O	Or	Os
$A_{o}$	500	263	237	44	23	21	849	563	286	327	195	132
$A_{c}$	83	48	35	44	23	21	849	563	286	11	9	2
$A_{l}$	500	356	144	44	23	21	849	563	286	321	278	43
$I_{o}$	500	288	212	13	4	9	872	605	267	317	212	105
$I_{c}$	69	46	23	13	4	9	872	605	267	17	16	1
$I_{l}$	500	352	148	13	4	9	872	605	267	354	300	54
$M_{o}$	500	272	228	23	10	13	830	557	273	296	179	117
$M_{c}$	57	40	17	23	10	13	830	557	273	8	8	0
$M_{l}$	500	348	152	23	10	13	830	557	273	322	277	45
$W_{o}$	500	301	199	13	10	3	517	313	204	435	267	168
$W_{c}$	75	41	34	13	10	3	517	313	204	15	10	5
$W_{l}$	20	15	5	13	10	3	517	313	204	2	2	0
$T_{o}$	1076	784	292	$182$	$33$	$149$	1586	1233	353	609	517	92
$T_{c}$	192	132	60	$7$	$1$	$6$	1586	1233	353	26	26	0
$T_{l}$	900	650	250	$4$	$1$	$3$	1586	1233	353	466	408	58

Table 4. Number of occurrences of each type of word in each case.

Case	F	$F_{r}$	$F_{s}$	$% F_{r}$	$% F_{s}$	$% T_{u}$
$A_{o}$	1,483,581	503,791	979,790	33.96	66.04	12.276
$A_{c}$	1420	710	710	50	50	0.012
$A_{l}$	2,282,307	1,080,457	1,201,850	47.34	52.66	18.886
$I_{o}$	199,153	76,849	122,304	38.59	61.41	1.648
$I_{c}$	121	69	52	57.02	42.98	0.001
$I_{l}$	101,779	71,569	30,210	70.32	29.68	0.842
$M_{o}$	523,830	222,292	301,538	42.44	57.56	4.335
$M_{c}$	100	52	48	52	48	0.001
$M_{l}$	245,070	170,099	74,971	69.41	30.59	2.028
$W_{o}$	1,203,528	465,751	737,777	38.7	61.3	9.959
$W_{c}$	565	270	295	47.79	52.21	0.005
$W_{l}$	6,043,299	6,023,202	20,097	99.67	0.33	50.008
$T_{u}$	12,084,753	8,615,111	3,469,642	71.29	28.71	100

Table 5. Cosine similarity among vectors of frequent words.

A	I	M	W	Generators	Cases
$A o$ vs. $A_{c}$	$I o$ vs. $I_{c}$	$M o$ vs. $M_{c}$	$W o$ vs. $W_{c}$	O vs. C	A vs. I	I vs. W
0.56	0.57	0.45	0.6	0.71	0.69	0.35
$A o$ vs. $A_{l}$	$I o$ vs. $I_{l}$	$M o$ vs. $M_{l}$	$W o$ vs. $W_{l}$	O vs. L	A vs. M	I vs. W
0.64	0.18	0.15	0.09	0.32	0.7	0.35
$A c$ vs. $A_{l}$	$I c$ vs. $I_{l}$	$M c$ vs. $M_{l}$	$W c$ vs. $W_{l}$	C vs. L	A vs. W	M vs. W
0.75	0.07	0.11	0.31	0.41	0.48	0.35

Table 6. Cosine similarity among vectors of frequent relevant words.

A	I	M	W	Generators	Cases
$A o$ vs. $A_{c}$	$I o$ vs. $I_{c}$	$M o$ vs. $M_{c}$	$W o$ vs. $W_{c}$	O vs. C	A vs. I	I vs. W
0.24	0.48	0.22	0.31	0.4	0.36	0.14
$A o$ vs. $A_{l}$	$I o$ vs. $I_{l}$	$M o$ vs. $M_{l}$	$W o$ vs. $W_{l}$	O vs. L	A vs. M	I vs. W
0.48	0.12	0.06	0.17	0.17	0.35	0.14
$A c$ vs. $A_{l}$	$I c$ vs. $I_{l}$	$M c$ vs. $M_{l}$	$W c$ vs. $W_{l}$	C vs. L	A vs. W	M vs. W
0.71	0.03	0.1	0.5	0.38	0.26	0.13

Table 7. Cosine similarity among vectors of frequent stop words.

A	I	M	W	Generators	Cases
$A o$ vs. $A_{c}$	$I o$ vs. $I_{c}$	$M o$ vs. $M_{c}$	$W o$ vs. $W_{c}$	O vs. C	A vs. I	I vs. W
0.74	0.64	0.72	0.75	0.81	0.84	0.74
$A o$ vs. $A_{l}$	$I o$ vs. $I_{l}$	$M o$ vs. $M_{l}$	$W o$ vs. $W_{l}$	O vs. L	A vs. M	I vs. W
0.71	0.29	0.35	0.24	0.79	0.85	0.74
$A c$ vs. $A_{l}$	$I c$ vs. $I_{l}$	$M c$ vs. $M_{l}$	$W c$ vs. $W_{l}$	C vs. L	A vs. W	M vs. W
0.78	0.13	0.16	0.47	0.8	0.91	0.76

Table 8. Pearson correlation coefficient among vectors of frequent words.

A	I	M	W	Generators	Cases
$A o$ vs. $A_{c}$	$I o$ vs. $I_{c}$	$M o$ vs. $M_{c}$	$W o$ vs. $W_{c}$	O vs. C	A vs. I	I vs. W
0.54	0.55	0.44	0.58	0.69	0.67	0.32
$A o$ vs. $A_{l}$	$I o$ vs. $I_{l}$	$M o$ vs. $M_{l}$	$W o$ vs. $W_{l}$	O vs. L	A vs. M	I vs. W
0.62	0.07	0.07	0.08	0.27	0.68	0.32
$A c$ vs. $A_{l}$	$I c$ vs. $I_{l}$	$M c$ vs. $M_{l}$	$W c$ vs. $W_{l}$	C vs. L	A vs. W	M vs. W
0.75	−0.02	0.05	0.3	0.38	0.46	0.32

Table 9. Pearson correlation coefficient among vectors of frequent relevant words.

A	I	M	W	Generators	Cases
$A o$ vs. $A_{c}$	$I o$ vs. $I_{c}$	$M o$ vs. $M_{c}$	$W o$ vs. $W_{c}$	O vs. C	A vs. I	I vs. W
0.22	0.46	0.2	0.3	0.37	0.32	0.11
$A o$ vs. $A_{l}$	$I o$ vs. $I_{l}$	$M o$ vs. $M_{l}$	$W o$ vs. $W_{l}$	O vs. L	A vs. M	I vs. W
0.46	0	0	0.16	0.13	0.31	0.11
$A c$ vs. $A_{l}$	$I c$ vs. $I_{l}$	$M c$ vs. $M_{l}$	$W c$ vs. $W_{l}$	C vs. L	A vs. W	M vs. W
0.7	−0.06	0.01	0.5	0.34	0.24	0.1

Table 10. Pearson correlation coefficient among vectors of frequent stop words.

A	I	M	W	Generators	Cases
$A o$ vs. $A_{c}$	$I o$ vs. $I_{c}$	$M o$ vs. $M_{c}$	$W o$ vs. $W_{c}$	O vs. C	A vs. I	I vs. W
0.72	0.62	0.72	0.72	0.8	0.82	0.71
$A o$ vs. $A_{l}$	$I o$ vs. $I_{l}$	$M o$ vs. $M_{l}$	$W o$ vs. $W_{l}$	O vs. L	A vs. M	I vs. W
0.69	0.1	0.17	0.22	0.76	0.84	0.71
$A c$ vs. $A_{l}$	$I c$ vs. $I_{l}$	$M c$ vs. $M_{l}$	$W c$ vs. $W_{l}$	C vs. L	A vs. W	M vs. W
0.76	0	0.08	0.46	0.8	0.9	0.74

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rosete, A.; Sosa-Gómez, G.; Rojas, O. A Black-Box Analysis of the Capacity of ChatGPT to Generate Datasets of Human-like Comments. Computers 2025, 14, 162. https://doi.org/10.3390/computers14050162

AMA Style

Rosete A, Sosa-Gómez G, Rojas O. A Black-Box Analysis of the Capacity of ChatGPT to Generate Datasets of Human-like Comments. Computers. 2025; 14(5):162. https://doi.org/10.3390/computers14050162

Chicago/Turabian Style

Rosete, Alejandro, Guillermo Sosa-Gómez, and Omar Rojas. 2025. "A Black-Box Analysis of the Capacity of ChatGPT to Generate Datasets of Human-like Comments" Computers 14, no. 5: 162. https://doi.org/10.3390/computers14050162

APA Style

Rosete, A., Sosa-Gómez, G., & Rojas, O. (2025). A Black-Box Analysis of the Capacity of ChatGPT to Generate Datasets of Human-like Comments. Computers, 14(5), 162. https://doi.org/10.3390/computers14050162

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Black-Box Analysis of the Capacity of ChatGPT to Generate Datasets of Human-like Comments

Abstract

1. Introduction

2. Experimental Framework

2.1. Considerations on the Motivation, Focus, and Limitations of the Experimental Study

2.2. Selection of Cases

2.3. Pre-Processing of the Cases

2.3.1. Amazon Reviews Dataset

2.3.2. Indian Airlines Dataset

2.3.3. McDonald’s Dataset

2.3.4. Women’s Clothing Dataset

2.4. Post-Processing of the Cases

2.4.1. Stop Words

2.4.2. Preparing the Cases for Analysis

3. Results

3.1. General Description of the Cases

3.2. Some Examples of Comments Generated via ChatGPT for Each Case

3.3. Analysis of Frequent Words

3.4. Total Uses of the Words

3.5. Most Used Words

3.6. Common Words in Each Group of Cases

3.7. Words Used with Only One Generator

3.8. Global Comparison in Terms of Cosine Similarity

3.9. Global Comparison in Terms of Pearson Correlation Coefficient

3.10. Global Comparison of Cumulative Word Frequency Usage

3.11. Global Comparison in Terms of the Usage of Some Representative Words

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI