1. Introduction
In recent years, there has been a marked increase in the influence of artificial intelligence (AI) across a variety of disciplines. A notable milestone in this trend has been the widespread adoption of large language models (LLMs), particularly ChatGPT [
1,
2,
3,
4]. We are seeing more and more use of generative artificial intelligence (AI) to solve various problems. This is having a big impact on many parts of our lives today. ChatGPT and other tools that can generate text have had a big impact on many fields, like medicine, management, science, and education. Here are some examples of these application areas: code generation [
5,
6], social work [
7], cybersecurity [
8], genomics [
9], the evaluation of research quality [
10,
11,
12], quality assurance [
13], mental health [
14], art [
15], education [
16,
17,
18,
19], sentiment analysis [
20], text writing [
21], writing quality assessment [
22], text synthesis [
23], health information retrieval [
24], and medicine [
25,
26]. It is important to know that, when people use ChatGPT, the main goal is to create a comment or response for a particular situation. Each specific prompt tests a different situation, but the model only needs to generate one response for each one.
Despite the pervasive utilization of LLMs, numerous concerns have been articulated from diverse vantage points [
6,
27,
28]. One important concern is the ethical aspects discussed in [
16,
20,
21,
29], e.g., the gender bias [
30] and the detection of LLM-assisted writing in scientific papers without acknowledgment [
31]. Another important concern is related to the precision of the results, as discussed in several papers [
5,
14,
25,
32,
33,
34]. In this context, the tendency for models to collapse when the data generated via an LLM are used recursively to train new models is also a critical and timely issue [
35].
In this paper, we investigate the potential of ChatGPT to generate synthetic datasets of comments that emulate those produced by humans concerning specific services. These datasets bear a resemblance to those employed for sentiment analysis, as evidenced by their availability on platforms such as Kaggle (e.g., [
36,
37,
38,
39]) and other sites. In contrast to the previously cited applications of ChatGPT, which require a response for each individual situation, the task of generating a dataset of comments necessitates the creation of multiple high-quality comments. These comments must exhibit diversity and precision in relation to the desired focus. Furthermore, the length of the comments is an important factor to consider in some cases.
The utilization of AI-generated comments as a substitute for human-generated datasets is imperative in scenarios where such datasets are not available. This scenario frequently arises in the training of models to classify human comments during the nascent stages of a service, where comments are not yet available for model development. Moreover, these synthetic comments can serve as a substitute for data augmentation [
33,
40] during the training of machine learning algorithms, thereby enhancing or supplementing datasets when their quality and/or quantity is inadequate. Additionally, the availability of a diverse range of comment datasets, potentially generated using AI, is crucial for conducting experimental studies that evaluate new algorithms. This is particularly beneficial when tailored datasets are required, such as those customized in terms of theme, comment volume, and comment length, to investigate the impact of various factors on algorithm performance. Consequently, if ChatGPT can generate such datasets, it will serve as a valuable alternative—or complement—to the labor-intensive process of gathering human opinions across various contexts for training machine learning models.
The objective of this paper is to examine ChatGPT’s capacity to generate synthetic datasets of comments that closely resemble those produced by humans. To this end, an experimental comparison was conducted of four human-generated datasets obtained from Kaggle with ChatGPT-generated datasets corresponding to each context of the respective Kaggle datasets. The focus of this study is on the lexical aspects of the task, specifically the vocabulary used.
The paper is organized into two sections. First,
Section 2 outlines the motivation, focus, and limitations of the study. These limitations also express some interesting lines for future research.
Section 2 also provides a detailed introduction to the experimental protocol, including access to the analyzed data and code to facilitate replication. This section also explains how the various datasets (generated by humans and via ChatGPT) are processed for comparison based on several metrics.
Section 3 presents the main findings and summarizes the key insights derived from the analysis. The main results of the paper are outlined in the conclusions, including some lines for future research, for example, by proposing some possible prompt strategies.
3. Results
3.1. General Description of the Cases
Table 2 provides a comprehensive overview of each case, detailing the number of distinct comments (the Count column), the length (in terms of the number of characters in the Ch columns and in terms of the number of words in the L columns) of the comments (the average in the L.Ave column, the minimum number in the L.Min column, and the maximum in the L.Max column), the number of frequently used words in each comment (the average in the F.Ave column, the minimum number in the F.Min column, and the maximum in the F.Max column), and the average number of frequent relevant words (the Fr.Ave column) and frequent stop words (the Fs.Ave column) in each comment. Each row represents a case identified with a name that combines the identifier of each dataset (A, M, I, or W) with the type of generator used (o: human comments sourced from the original dataset on Kaggle; c: comments generated via ChatGPT without a specified length; l: comments generated via ChatGPT for which the length of the comment was defined). Three additional rows
,
, and
were included to account for the complete set of words used with each generator (o, c, and l).
Some interesting results can be observed in
Table 2. The most significant findings regarding ChatGPT’s performance are as follows: it uses fewer characters and words than humans, it frequently repeats the same answer, its responses vary considerably based on the defined length, and its proportion of stop words is very similar to that of human comments.
In cases , , , and , in which the desired length was not specified to ChatGPT in the prompt, the size of each comment was significantly smaller. This suggests that the default size of ChatGPT comments is smaller than the size used in these authentic human comments. It is remarkable how few truly distinct comments there are. This suggests that ChatGPT is fulfilling the requested number of comments by repeating the same ones multiple times. This may indicate a tendency to rely on specific words without exploring synonyms or varying the language used in the comments.
In cases , , , and it is important to note that the number of truly distinct comments generated via ChatGPT, given a defined length, was accurate. However, it is essential to examine this aspect in greater detail. The size of each comment varied a lot. In the case , the desired length was exactly respected (630 characters). In the case , the length of the comment generated was close to the desired length, just slightly smaller (an average of 91 characters with respect to 130). In the other cases, the generated comments were larger than the desired length. In the case , the length of the comment generated was larger (average of more than 700 characters instead of 460), but in the case the length of the comment generated was significantly larger (more than 2000 characters instead of 300). Perhaps, in this case, ChatGPT infers that the desired size (300) refers to words, as opposed to the approximate size in terms of characters correctly inferred (and almost achieved) for the other cases. But this sounds strange because the desired length of 300 is not extreme: other desired lengths are larger (630) or smaller (125), and it interpreted them as characters.
The size of each comment varied significantly. In the case of
, the desired length was precisely met at 630 characters; however, upon examining the details of the comments, we can observe that several of them end abruptly in order to meet the desired length. Some examples of these incomplete final sentences include the following:
“Friend couple a”,
“Can boy room value film”,
“Once born with”, and
“Same wa”. Additional examples can be found in
Section 3.2.
Some of the generated responses do not seem comparable to human comments, as several words are presented in a sequence that does not form coherent sentences. An example is as follows:
“color color comfortable fashion fit love size fit fashion size stylish fashion size comfortable”. See other examples in
Section 3.2.
The average use of frequent words in human comments (column F.Ave 19, row ) was equal to the corresponding value for ChatGPT with a defined length (19, row ). It is interesting to note this similarity despite the differences in average length (column L.Ave: 329 for compared to 82 for ). The F.Ave value for ChatGPT when the length was not defined is considerably smaller (6, row ), similar to the case of the defined length. It is also noteworthy that the proportion of relevant words to stop words is consistent across all cases. Generally, human comments contain 2.2 relevant words for every stop word (row , column Fr.Ave 13 compared to Fs.Ave 6). This ratio is 2 (4 in relation to 2) for and 2.2 (13 in relation to 6) for .
3.2. Some Examples of Comments Generated via ChatGPT for Each Case
The comments can be examined in detail in the available information provided. However, we believe it is useful to provide some examples of the generated comments. For each case, we provide two examples of comments generated via ChatGPT when the desired length was not specified in the prompt. For , “Excellent service: It didnt meet my standards, especially given the price point. Not recommended”. “Great product!: Exceeded my expectations, the features are great and easy to use”. For , “Excellent service! Highly recommend this airline”. “Flight was delayed, but the staff was polite”. For , “Best place for a quick meal with family. Affordable and tasty!” “Disappointed with the customer service, but the food was good”. For , “Amazing fit. Unfortunately, it didnt fit as expected. A bit too small”. “Love this! Not as described, returned it immediately”. In spite of their relatively small size, these sentences convey syntactically coherent ideas; however, some may lack semantic clarity due to inherent contradictions. Additionally, it is evident that they align with the expected context of each case.
Now, for each case, we provide two examples of comments generated via ChatGPT for which the expected size of the comment is defined in the prompt. For the case ,
“Last key give: Amazing product! The quality is top-notch, and I am really pleased with my purchase. Quality exceeded my expectations, making it worth every penny. Highly recommended! The product arrived on time and was as described, though it could have been packed better. Product met my needs but has some limitations that may be improved in future models. Excellent quality and build; Im very satisfied with my purchase and would buy it again. I love the features, but the design could use some improvement. Overall, its decent for the price. The item didnt work as expected, but customer service was helpful in resolving my issue. Quality exceeded my expectations, making it worth every penny. Highly recommended!”
“Offer run eight: Shipping took longer than expected, but the product itself is satisfactory. I love the features, but the design could use some improvement. Overall, its decent for the price. The performance was underwhelming, and I am quite disappointed with this purchase. This product has exceeded my expectations and provides great value for the money. Quality exceeded my expectations, making it worth every penny. Highly recommended! Shipping took longer than expected, but the product itself is satisfactory. Quality exceeded my expectations, making it worth every penny. Highly recommended! I had a terrible experience with the quality of this product, and I wouldnt recommend it”.
For the case ,
“Somebody nation arm their he wish. Grow story alone forward wait. Difference become push ball better value fly. Physical identify garden piece sign large. Game including all TV imagine common level. Dream ok war actually official significant name. Thing bed concern worker drive. Wrong provide blue manage player. Through realize season. Care its someone point daughter nice resource. Where Mrs above husband. Southern size run reason act left hundred. Should keep role. Magazine cup create air. Suggest state huge his participant nation. Computer trial necessary treat model. Visit indeed school because. Since loss rock. Kid man”
“Continue final song well growth onto pick meet. Necessary some state argue. Message identify list reflect talk fast approach. Art within human issue music must Democrat. Pretty through tend Congress. Treat whom impact choose technology season recognize. Good home win send social north. Trouble I property also. Just do together local agency quickly fish. Consider boy magazine. Structure environmental direction likely approach dog success. American sense contain five. Course gun Mr argue eat. Former reduce decide carry political effect. Travel recently relationship billion ago. Industry money fear traditional. Nation find ar”
For the case ,
For the case ,
“Amazing! color color comfortable fashion fit love size fit fashion size stylish fashion size comfortable love color stylish color fashion perfect color disappointed size comfortable fashion size fit fashion fashion love quality size love quality color perfect size love fit perfect stylish stylish stylish comfortable stylish love stylish fashion disappointed stylish stylish perfect fit quality love quality stylish love stylish fit perfect disappointed love comfortable color comfortable fit disappointed perfect perfect fit disappointed size stylish fashion comfortable love love quality quality disappointed disappointed fashion fit color love stylish disappointed quality fit stylish fashion fashion quality color comfortable fashion fashion fashion size fit love fit fit comfortable fashion stylish love fit quality color size disappointed disappointed quality love comfortable stylish love love fashion color color color stylish stylish perfect fashion size love love comfortable love fit fashion fit love love fit fashion size color quality size disappointed fashion stylish size color fit fit fit quality color disappointed perfect love love stylish disappointed perfect love stylish perfect love disappointed fashion disappointed fit color quality fit stylish size disappointed size size size stylish quality color disappointed disappointed love fit perfect fit color fashion perfect fashion perfect size comfortable fashion fit love quality color fit quality size fit fit size love disappointed fashion stylish love disappointed love comfortable fit love quality disappointed size perfect fit fashion disappointed disappointed quality fashion comfortable comfortable color perfect fit comfortable fit fashion perfect perfect fit quality size comfortable love fashion love love fashion comfortable stylish disappointed size color love disappointed love fashion disappointed comfortable perfect fit color quality quality quality love disappointed comfortable perfect stylish disappointed fit love fit quality perfect quality perfect stylish disappointed perfect quality disappointed stylish comfortable fashion love quality comfortable fashion perfect quality fit quality fit fit quality stylish disappointed stylish quality comfortable perfect fashion”
“Disappointing! color color color color perfect color comfortable fashion quality perfect fashion comfortable fit color perfect comfortable fashion size stylish perfect fashion fit love disappointed love fit love comfortable disappointed stylish disappointed fit perfect fit love color love color love comfortable fit quality size comfortable fit stylish quality disappointed perfect stylish quality comfortable color size stylish stylish quality comfortable fashion quality fit stylish fashion comfortable stylish perfect color fashion perfect love comfortable color disappointed comfortable size comfortable perfect fit perfect love stylish love stylish love stylish color fashion quality color love stylish comfortable fit quality color disappointed stylish fit color quality fit perfect fit quality disappointed fit fashion fit comfortable size fit size fit quality color disappointed disappointed comfortable comfortable fashion disappointed love size quality fashion love love color disappointed fashion stylish size size perfect comfortable disappointed fashion fashion color fashion comfortable disappointed perfect stylish perfect fashion color love color color perfect size perfect stylish comfortable love disappointed perfect stylish size fashion size quality comfortable perfect color stylish perfect quality color size comfortable fit love fit disappointed love love fashion comfortable perfect color love disappointed color size stylish comfortable stylish color color disappointed love fashion disappointed love stylish perfect disappointed disappointed color quality fashion color color color color comfortable fit stylish stylish love comfortable stylish color quality love comfortable stylish quality fashion fit comfortable size fit size comfortable fashion love comfortable color disappointed love fashion color love fit stylish love stylish perfect comfortable color size fashion perfect color quality color disappointed stylish color fit love fashion size love color disappointed disappointed stylish comfortable stylish comfortable size love quality stylish quality stylish stylish love perfect fit fit size perfect disappointed comfortable stylish disappointed perfect color fashion fashion perfect love disappointed perfect stylish size fashion comfortable stylish size quality perfect color size color”
As evidenced by these examples, numerous words possess inherent significance; nevertheless, several inconsistencies emerge. This is particularly evident in the final case,
. Despite the appropriateness of many of the words in the context, an analysis reveals this. The propensity to reiterate words within specific contexts has been observed in other experiments [
35]. In summary, it can be concluded that ChatGPT is capable of producing a limited number of meaningful comments that are focused on the specific context outlined in the prompt. However, it continues to grapple with adhering to a predefined length when generating a substantial dataset of comments. In contrast, the study by [
10] reports that ChatGPT performs better when the prompt is concise.
3.3. Analysis of Frequent Words
In light of the previous results, it is intriguing to explore whether the comments generated via ChatGPT can be used in certain forms of text summarization, such as analyzing frequent words (or word clouds) that disregard the grammatical coherence of the comments.
Table 3 presents a general overview of the frequently used words in each case. For each case, including the integration of all cases in the last three rows —
,
, and
—we present the number of frequent words, F, the number of relevant frequent words, Fr, and stop words, Fs, the number of words that are common (the total in column C and the number that relevant in column Cr and stop words in columns Cs) to the tree generators in the same group of cases (i.e., A, I, M, W, and T). A word is counted in columns C, Cr, and Cs if it appears among the frequent words of all three generators. We believe that these values are useful for examining the convergence of all generators within each specific context.
A total of two words were identified as appearing in the set of frequently used words across all cases: “for” and “not”, which are considered stop words. Consequently, the values 2, 0, and 2 should be placed in columns C, Cr, and Cs, corresponding to the rows , , and . However, it is more meaningful to populate these cells with the frequently used words employed for each generator in the four corresponding cases. These values are highlighted in bold to emphasize this distinction. Additionally, for each case, the number of words used with any of the generators for each group of cases is presented (total in column U, number of relevant words in Ur, and stop words in Us). Finally, the last three columns present the number of words that are exclusively used with each generator in each group of cases (total in O, relevant words in Or, and stop words in Os).
A number of noteworthy observations can be made regarding the information presented in
Table 3. The most significant findings are as follows: It is evident that ChatGPT utilizes a restricted lexicon, exhibiting minimal overlap with frequently used words across all specific contexts when compared to human language. However, ChatGPT employs a unique lexicon, incorporating specialized and infrequent terms that are not commonly used by humans. Furthermore, humans exhibit a tendency to employ the same word in different contexts more frequently than ChatGPT does. Due to the limited number of comments in cases where ChatGPT is not prompted with a desired length, the number of frequent words is fewer than 100, falling short of the maximum available number of frequent words for each case (500).
It is remarkable how few coincidences exist among the frequently used words. The proportion of common words (column C) is notably small—less than 7% in all cases: Case A: 5.2% (44 out of 849), Case I: 1.4% (13 out of 872), Case M: 2.8% (23 out of 830), Case W: 2.5% (13 out of 517), and Case T: 6.1% (97 out of 1586). This suggests that ChatGPT is not converging on a set of frequently used words that resembles the vocabulary employed by humans. Therefore, a different underlying vocabulary is inferred for a similar context.
In spite of the reduced number of frequently used words for ChatGPT (without a defined length), it is interesting to note that the proportion of common words is significant only for with 53% (44 out of 83), and at 40.35% (23 out of 57). In other cases, this proportion is less than 27%, such as for .18.8% (13 out of 69), 17.3% (13 out of 75), and 26.6% (51 out of 192). This indicates a distinct difference in the vocabulary employed via ChatGPT compared to that of humans in each context, with only a small overlap of frequently used words. Consequently, it is not only the case that ChatGPT uses a more limited vocabulary than humans, but it also incorporates several words in each context that are not used by humans.
In addition, it is noteworthy that the proportion of frequently used words for ChatGPT (without a defined length) that were not included in the frequent words of other generators is as follows: with 13.2% (11 out of 83), 24.6% (17 out of 69), 14% (8 out of 57), 20% (15 out of 75), and 13.5% (26 out of 192). This indicates that between 13% and 25% of the vocabulary used via ChatGPT is absent from the most frequently used words by humans (and even for ChatGPT with a defined length) in a similar context. In spite of the fact that the proportion of relevant words is quite similar across all cases for the union of all frequent words in column U (ranging from 60.5% for case A to 69.4% for case I), this proportion varies significantly for the common words in column C (between 30.8% for case I and 76.9% for case W). The proportion of relevant words among the frequent words that are exclusive to a generator in columns Or and O varies significantly: approximately 60% for most human-generated cases (, , ), compared to over 94% for two ChatGPT cases without a defined length (, ) and also for . This observation suggests that the uniqueness of ChatGPT’s vocabulary is more related to relevant words than to stop words. ChatGPT, when not subject to length restrictions, was able to generate 26 relevant words that were not produced through other methods. Additionally, the more than 400 words generated solely via ChatGPT under length restrictions suggests that its vocabulary exhibits a certain divergence from human commentary.
It is interesting to examine the singularity of each generator by analyzing the proportion of frequently used words generated solely through each (column O) in relation to the total number of frequent words utilized (column F). This proportion ranges from 56% to 87% for human-generated content ( 65.4%, 63.4%, 59.2%, 87%, 56.6%), indicating that more than half of the frequent words in human comments were not employed via ChatGPT. In cases involving ChatGPT without length constraints, the proportion of unique words is significantly smaller, at less than 25% ( 13.3%, 24.6%, 14%, 20%, and 13.5%). This may be interpreted as a tendency for ChatGPT to conform to a common human language, exhibiting limited diversity. However, the more than 13% of frequently used singular words in ChatGPT’s comments is noteworthy, especially considering that the length of ChatGPT’s comments is considerably shorter (only 175 distinct comments compared to 68,121 distinct comments generated by humans, which is less than 0.3%). Additionally, the number of distinct words used is also limited (only 192 distinct frequent words compared to 1076 distinct frequent words used by humans, or 17.8%). The proportions of uniqueness in ChatGPT’s output are slightly higher when only relevant words are considered ( 18.8%, 34.8%, 20%, 24.4%, and 19.7).
The cases of ChatGPT that involve a length condition are quite specific. Generally, the proportions for these cases are similar to those observed in humans, ranging between 51% and 71% ( 64.2%, 70.8%, 64.4%, and 51.8%). However, the exception is , which has a proportion of only 10%. This may be interpreted as a tendency for ChatGPT to diverge in the choice of words when prompted to achieve a specific length. The divergence of frequently used words varies depending on the context, as observed by examining the proportion of frequent words utilized by each generator across all cases (comparing column C to column F for rows , , and ). This proportion is 17% for , while it is just 4% for and 0.4% ). It appears that humans tend to use the same words in different contexts more frequently than ChatGPT.
3.4. Total Uses of the Words
The frequency of each type of word (F columns: total; Fr: relevant words; Fs: stop words) in each case is presented in
Table 4, which counts the total occurrences of each frequent word in the comments for each case. The last column,
, indicates the proportion of word usage in each case relative to the total across all cases. The most significant conclusion is that the proportion of relevant word usage in ChatGPT is higher than that of humans, even when the desired length is not specified.
As previously noted regarding the length of the comments, the word count within each case of the same group varies significantly. In instances where the desired comment length was not specified (cases , , , and ), ChatGPT produced a limited number of words, accounting for less than 0.02% of the total word usage. Contrary to expectations, when the desired length of the comments was defined (cases , , , and ), the number of words generated via ChatGPT differs significantly from the human comments (cases , , , and ). The case contains around 50% of the total word usage ; in fact, it is more than five times the proportion represented as . Conversely, the proportions for cases and are about half of those for cases and , respectively. In contrast, the proportions of and are the most similar, at around 15%. The percentage representing the relevant words (column ) is never more than 45% in human comments (cases , , , and ). In contrast, the comments generated via ChatGPT exceed 50% in all cases, with particularly high percentages for cases at over 99%, and and at approximately 70%. In general, the value of in ChatGPT cases is significantly higher when the length is specified, with the exception of with respect to , which are quite similar. This can be interpreted as a tendency for ChatGPT to produce a greater proportion of relevant words compared to human comments, particularly when the desired length is defined.
3.5. Most Used Words
In the preceding sections, an investigation was conducted into the general trends in word usage. In this section, we will present examples of words that are more distinctive for each generator and context. These are approximately the ten most frequently used words relevant to each case. In instances where multiple words are tied for the tenth position, all such words are included in the following lists, with the exception of the Wc category, where only nine words are shown due to more than thirty words being tied for the tenth position. Some words commonly found in human comments appear to be typographical errors (such as “✅” or “½ï”) or are used as specific quotation marks (such as “|”). Since these words lack meaningful context, they are not included.
: a, amazon, customer, delivery, i, item, order, prime, service, time.
: customer, expectations, expected, experience, great, highly, i, money, product, purchase, quality, recommended, satisfied, service.
: quality, exceeded, expectations, expected, find, i, money, product, purchase, purpose
: a, experience, flight, i, service, time, trip, verified.
: a, average, comfortable, customer, delayed, experience, flight, good, seats, service, staff.
: attorney, building, decade, fall, group, realize, space, time, walk, war, western, worry.
: a, drive, food, good, i, mcdonalds, order, place, service.
: a, bit, clean, food, great, i, place, quick, service, tasty.
: age, bed, boy, close, draw, hotel, i, public, read, rich, rise, stop, true.
: a, dress, fabric, fit, great, i, love, size, top, wear.
: amazing, bad, buy, comfortable, expected, fit, love, perfect, price.
: color, comfortable, disappointed, fashion, fit, love, perfect, quality, size, stylish.
Most words seem to be meaningful within their respective contexts (e.g., flight for Indian Airlines, food for McDonald’s) or in general (e.g., quality, purchase). However, some usages appear unusual when the desired length is specified, such as the use of “attorney” and “building” for Indian Airlines, and “hotel” in comments about McDonald’s.
3.6. Common Words in Each Group of Cases
As the number of common words for each group of cases was reduced, we prefer to include all of them in the following list. These are the relevant words included in the three cases for each group:
Group of cases A: arrived, buy, customer, disappointed, easy, excellent, experience, fast, good, great, i, issues, money, point, price, product, purchase, quality, recommend, service, shipping, terrible, worth.
Group of cases I: a, time, bit, staff.
Group of cases M: a, bit, customer, food, great, hot, i, staff, time, wait.
Group of cases W: amazing, comfortable, disappointed, expected, fit, i, love, perfect, quality, stylish.
There are some convergences among the three generators, such as the use of “price” and “shipping” for Amazon, and “staff” and “time” for Indian Airlines and McDonald’s. Notably, in the context of the cases involving Amazon and Women’s Clothing, there are additional words. This may reflect an imbalance in the training set of ChatGPT. It is worth noting that the words “i” and “a” may be classified as stop words; however, we prefer to adhere to the conditions established in the initial stages of the experiments.
These are stop words included in the three cases of each group:
Group of cases A: again, and, anyone, as, be, every, for, had, is, it, my, not, of, the, this, use, very, was, will, with, would.
Group of cases I: but, for, it, my, no, not, the, this, with.
Group of cases M: again, and, as, best, for, just, like, not, of, the, them, will, with.
Group of cases W: for, it, not.
3.7. Words Used with Only One Generator
There are 26 words used with ChatGPT that were not included in the frequently used words of humans, nor in those employed via ChatGPT with a fixed length. All of them are relevant words: unhelpful, everyday, standards, beginning, subpar, stitching, undone, vibrant, immediately, occasion, average, tasty, plenty, outstanding, punctual, assistance, exceptional, cramped, crispy, ambiance, cleanliness, enjoyed, managed, reasonable, affordable, improvements.
On the other hand, there are 609 words that were frequently used by humans that were not used with ChatGPT. Here are some examples of these words (all of them with more than 4000 uses): amazon, an, order, dress, there, ordered, delivery, dont, prime, refund, items, said, long, bought, mcdonalds, cute, told, made, days, through, delivered, flattering.
When ChatGPT was constrained to a defined length, it generated 466 frequently used words that were not included in the lists of frequently used words from the other generators. These are some examples (all with more than 12,000 uses): fashion, purpose, build, future, lacking, areas, intended, itself, performance, satisfactory, resolving, limitations, models, improved, setup, encountered, decent, improvement, packed.
There were 510 frequently used words generated via one of the ChatGPT models that were not commonly found in human comments. Here are some examples for each context (group of cases).
A: disappointing, meet, unhelpful, everyday, standards, beginning, subpar.
I: improve, minor, improve, unhelpful, average, outstanding, punctual, assistance, exceptional, cramped, cleanliness, enjoyed, managed, reasonable, improvements.
M: expected, underwhelming, loved, tasty, plenty, crispy, ambiance, affordable, true, past, read, above, name, hotel, rich, road, draw, public, key, age, boy, rise, bed, form.
W: product, value, expectations, exceeded, again, others, described, started, coming, stitching, undone, vibrant, immediately, occasion, fashion, disappointing.
Now, we present some words frequently used in the human comments that were not used with ChatGPT.
A (relevant words with more than 5000 uses): amazon, order, delivery, prime, account, refund
A (stop words): me, so, no, if, an, them, when, one, their, now, never.
I (relevant words with more than 1000 uses): verified.
I (stop words): is, as, they, me, had, are, at.
M (relevant words with more than 2000 uses): order, mcdonalds, fast.
M (stop word): my, is, have, had, are, no, get, there.
W (relevant words with more than 3000 uses): top, dress, ordered, nice, bought, cute, beautiful, flattering, large, shirt.
W (stop words): but, in, was, of, on, that, have, they, am, you, im, its, just, or.
Some words that are clearly oriented towards specific contexts and were used by humans were not included among the frequently used words of ChatGPT. Here are some of the most notable examples.
A: amazon, delivery, prime, refund.
I: verified.
M: mcdonalds, fast.
W: dress, nice, cute, beautiful, shirt.
Conversely, there are certain words used with ChatGPT that are significant but were not included in the most frequently used words in human comments, as they are clearly tailored to the specific context of each case.
A: disappointing, standards, subpar.
I: punctual, assistance.
M: tasty, crispy, ambiance.
W: product, occasion, fashion.
Perhaps it reflects a certain bias in the training set of ChatGPT that differs from the human comments included in the datasets.
3.8. Global Comparison in Terms of Cosine Similarity
Table 5,
Table 6 and
Table 7 present a comparison of the normalized frequency vectors representing the set of frequent words for each case in terms of cosine similarity [
52], which is a widely used measure in information retrieval for comparing vectors that represent texts. First, all frequency vectors were normalized based on the total number of uses. Each component now represents the proportion of occurrences of each word in the cases relative to the overall frequency of the most common words. Although cosine similarity can be negative when vectors are opposite, this does not occur in this instance due to the normalization process (where all components range between 0 and 1). Therefore, in this context, the cosine similarity ranges from 1 (indicating alignment) to 0 (indicating orthogonal vectors). In
Table 5, all words were taken into account, resulting in each vector having 1586 components, one for each possible word (0 if it was not used in the particular case). In
Table 6, the similarity is computed by considering only the relevant words; therefore, each vector has 1233 components. Finally, in
Table 7, the same measure is computed by taking into account the relative frequency of the stop words, resulting in each vector having 353 components.
Several observations can be made based on the results presented in
Table 5,
Table 6 and
Table 7, which reflect the similarities among each generator as expressed with the corresponding vectors of normalized usage. The most significant findings are as follows: ChatGPT (without a defined length) is more similar to human vocabulary than ChatGPT with a defined length, and this similarity is particularly pronounced in terms of stop words. In general terms (see
Table 5), it can be observed that columns I, M, and W indicate that the similarity between human comments and those generated via ChatGPT without a defined length (first row) is significantly greater than the similarities observed in the other rows (second and third rows). The only exception is column A, where the other similarities are higher. Overall, ChatGPT demonstrates stability in its similarity to the set of human comments, ranging between 0.45 and 0.6. However, when the length is defined, the results may diverge.
A similar situation occurs in
Table 6 and
Table 7, with the exception of column A. This suggests that, when the length is not defined, ChatGPT’s responses were closer to human comments in columns I (“indian airlines”), column M (“mcdonalds”) and column W (“women’s clothing”). In column A (“amazon”) this similarity is similar to the other contexts; however, the other similarities (
vs.
and
vs.
) were significantly greater. When the comparison is based on the average vectors of the generators across all cases (the Generators column) again, the similarities between the comments generated by humans and those produced through ChatGPT, without a defined length (O vs. C) are the highest. It is noteworthy that the similarity between these average vectors is greater than the values observed in each individual case. This may indicate a general trend in vocabulary that is not specifically tailored to each particular context. This greater similarity in terms of the averages may be attributed to the smaller similarities concerning the relevant words. It can be observed that, in nearly all cases, the similarity between the 21 pairs of vectors is greater when stop words are considered (
Table 7) than when relevant words are considered (
Table 6), with the exception of
vs.
, for which the values are very close.
With regard to the general vocabulary associated with each group of cases (column Case in the three tables), the most analogous contexts were A and M, followed by A and I, particularly with respect to the utilization of stop words. It appears that ChatGPT exhibits a greater degree of customization in its application of stop words than in its selection of relevant words.
3.9. Global Comparison in Terms of Pearson Correlation Coefficient
Table 8,
Table 9 and
Table 10 present a comparison of the normalized frequency vectors representing the set of frequent words for each case, evaluated using the Pearson Correlation Coefficient [
53]. This coefficient ranges from −1, indicating an inverse linear relationship (in which higher values in one series correspond to lower values in the other), to 1, which signifies that both series increase in alignment. In this context, despite the normalization of the vectors, it is still possible to obtain negative values. In
Table 8, all words were taken into account, resulting in each vector having 1586 components. In
Table 9, the correlation is computed by considering only the 1233 relevant words. Finally, in
Table 10, the same measure is calculated by focusing solely on the relative frequency of usage of the 353 stop words.
In general terms (see
Table 8), it can be observed that columns I, M, and W indicate that the correlations between human comments and those generated via ChatGPT without a defined length (first row) are generally higher than the other similarities (second and third rows). This suggests that ChatGPT uses words in a more similar manner to human comments. However, exceptions are noted in column A, where the other correlations are greater. This finding aligns with similar results presented in the previous section. Overall, ChatGPT demonstrates a stable correlation with human comments, ranging from 0.44 to 0.58. However, when the length is defined, the results may diverge. Notably, the correlations between ChatGPT and the defined length comments in columns I and W are almost negligible.
A similar situation occurs in
Table 9 and
Table 10, again with the exception of column A. This may indicate that, when the length is not defined, ChatGPT’s responses were closer to human comments in column I, M, and W. In column A, this correlation is similar to that in the other contexts; however, other similarities (
vs.
and
vs.
) were generally greater.
When the comparison is based on the average vectors of the generators across all cases (the Generators column) against the similarities (correlations) between the comments generated by humans and those produced via ChatGPT, regardless of the defined length (O vs. C), a significant correlation is observed. Furthermore, the correlation between the two ChatGPT outputs is also substantial. It is noteworthy that the correlation between these average vectors is greater than the values observed in each individual case, which may reflect a general trend in vocabulary that is not specifically tailored to each particular context. These stronger correlations of averages can be attributed to the weaker correlations concerning the relevant words. A close examination reveals that, in nearly all cases, the similarity between the 21 pairs of vectors is greater when considering stop words (
Table 10) than when focusing on relevant words (
Table 9), with the exception of
vs.
, where the values are very similar. Additionally, the general vocabulary of each group of cases (column “Case” in the three tables) exhibits a greater similarity between groups A and M, followed by groups A and I, particularly with regard to the use of stop words. Despite the minor discrepancies in conclusions related to correlations compared to cosine similarity, ChatGPT tends to align more closely with human comments in terms of stop-word usage than in the use of relevant words.
3.10. Global Comparison of Cumulative Word Frequency Usage
As illustrated in
Figure 1, the cumulative usage frequency for each generator across the different cases associated with the top 100 most frequently used words is shown. The x-axis of each graph corresponds to the number of words included, ranging from 1 to 100, while the corresponding y-axis indicates the cumulative frequency of usage for this set of most frequently used words. It is imperative to acknowledge that the most frequently used word may vary for each case and generator. Each figure presents the cumulative values for the first 100 words, although, in certain cases, the number of frequent words is less than 100, and in other cases, 500 frequent words are obtained. As a reference, all figures also display the cumulative normalized frequency predicted through Zipf’s Law [
54], which states that the frequency of use of a word
is inversely proportional to the position,
r, of each word in a list arranged in decreasing order of use, i.e.,
The Zipf-estimated frequencies for the first 500 words were computed and normalized. In the graph, this line can be used as a reference for the expected performance of a natural language [
54]. In the figures, a consistent performance can be observed between human comments (cases o) and those generated freely via ChatGPT (cases c). In all cases, the shape of the cumulative frequency for these two generators approached Zipf’s Law, with a closer alignment observed for human comments. ChatGPT (case c) converges to 100% before reaching 100 words, meaning that fewer than 100 distinct words were generated.
When ChatGPT is constrained to a defined length (cases l), the results exhibit significant variability. In case
the performance is quite similar to
; in case
, the frequencies accumulate more rapidly than
(only 20 distinct words). Meanwhile, in cases
and
, the frequencies accumulate very slowly due to the similar frequency of use across all words. These scenarios appear to deviate considerably from natural performance.
Figure 2 shows the average performance of each generator across the four cases, highlighting the performance previously described.
3.11. Global Comparison in Terms of the Usage of Some Representative Words
It is imperative to elucidate the particular usage patterns of select pertinent words, thereby offering a comprehensive representation of each instance. These words are likely to appear in word clouds. A selection of terms that rank among the top 20 most frequently used relevant words was conducted, including comfortable, fit, size, stylish, color, fashion, experience, flight, expected, and time. These words are particularly suitable for specific contexts (e.g., flight for the cases involving Indian airlines, stylish for the cases related to women’s clothing).
We elected to illustrate the use of the aforementioned terms, as opposed to employing others that may be more general (common) for the four contexts analyzed in this paper. This phenomenon was also observed with other frequently relevant words, such as the following: quality, disappointed, service, product, customer, good, love, perfect, great, bit. This set of words would also be present in the associated word clouds, but these are less specific.
Figure 3 illustrates the relative frequency of usage for this selected set of words. Each bar corresponds to a specific case, while the various colored sections within each bar indicate the relative frequency of each word in relation to the other ten words included in the analysis.
As illustrated in
Figure 3, the words “fit”, “stylish”, “color”, and “fashion” appear more frequently in the cases
,
, and
, whereas the word “flight” is more frequent in the cases
and
. Additionally, some unexpected results emerged in instances where ChatGPT was constrained to a specific length. For example, the occurrence of the word “color” in
and
is noteworthy. For reference, the last bar in the figure represents the average relative frequency for each context (A, I, M, and W) or generator (O, C, and L).
Figure 4 shows a similar analysis, focusing on the most frequently used stop words. It can be observed that the relative frequency of use is more consistent across all cases, with only minor differences among the generators. A noteworthy pattern that has been observed is the tendency of ChatGPT to employ the word “was” in an excessive manner. Furthermore, the results from ChatGPT with defined lengths, particularly
and
, appear to deviate from expected norms. Nevertheless, the striking similarity among the generators, as illustrated in the final three bars, is noteworthy. This phenomenon stands in stark contrast to the patterns observed in the final bars of
Figure 3. This observation suggests that ChatGPT may align more closely with human language in terms of the frequency of stop-word usage than in its application of relevant words.