Evaluating Creativity: Can LLMs Be Good Evaluators in Creative Writing Tasks?
Abstract
:1. Introduction
2. Related Works
2.1. Advancements of Large Language Models (LLMs)
2.2. Large Language Model (LLM) Evaluation
2.3. Types of Creativity Assessment
3. Experimental Design for Dataset
3.1. Configuration of Writing Task Data
3.1.1. Description of Prompts
- WP (Writing Prompt): Prompts that provide a scenario, idea, or situation intended to inspire writers to create a story or piece of writing based on it. No specific restrictions, allowing for a wide range of creative responses.
- SP (Simple Prompt): Prompts that are basic but limited to 100 characters in the title, encouraging concise and focused creative responses.
- EU (Established Universe): Prompts that encourage writers to expand upon an existing fictional universe, creating new stories, characters, or events within that established setting.
- CW (Constrained Writing): Prompts that impose specific constraints on the writing process, such as a strict word limit or adhering to a specific style.
- TT (Theme Thursday): Prompts that revolve around a specific theme, released on Thursdays, encouraging writers to explore that theme in their writing. These prompts are featured weekly and change every Thursday.
- PM (Prompt Me): Prompts where writers request others to write about a specific subject, helping to introduce new outcomes and perspectives.
- MP (Media Prompt): Prompts that use linked audio or visual media to inspire a piece of writing. This method encourages writers to draw inspiration from non-textual sources.
- IP (Image Prompt): Prompts that use an image as the inspiration for a story. Writers create a narrative based on the scene, characters, or mood depicted in the image.
- PI (Prompt Inspired): Prompts that are standalone responses to prompts that are at least three days old. These responses must include a link to the prompt that inspired the story and contain the story within the text area of the post itself.
3.1.2. Human Writing Interface
3.1.3. AI Writing Interface
“A total of 200 prompts are provided for 200 topics. You are required to write a creative story on each topic. There are several types of writing prompts. If the prompt includes [WP], it means there are no restrictions to provide a creative idea that the writer can use to develop the story. For [CW], there are constraints such as using specific words or writing in a particular style. [EU] requires expanding an existing fictional world to create new stories, characters, and events. [TT] involves writing about a specific topic or focuses on certain styles of writing. [PM] focuses on creative writing with new outcomes, or perspectives. For [IP], the story should be based on an image. For prompts not included in these categories, generally, a creative text should be written. The story should be within 400 words, and the text outputs should end sentences with a period (.) and not with a comma (,). Read the instructions carefully and complete the task”.
3.2. Evaluation Method
3.2.1. Creativity Evaluation Criteria
- Fluency: Novice writers consider a single idea (1 point), developing writers consider several ideas (2 points), and expert writers explore many ideas (3 points).
- Flexibility: Novice writers consider one type of idea (1 point), developing writers consider several types (2 points), and expert writers incorporate many types of ideas (3 points).
- Originality: Novice writers develop common or replicated ideas (1 point), developing writers create interesting but minimally innovative ideas (2 points), and expert writers generate unique ideas or significantly enhance existing ones (3 points).
- Elaboration: Novice writers add minimal details and improvements (1 point), developing writers add a few (2 points), and expert writers contribute many significant details and enhancements (3 points).
- Usefulness: Novice writers suggest ideas that might meet user needs under certain conditions (1 point), developing writers propose ideas that would meet user needs (2 points), and expert writers offer ideas that would significantly enhance the user’s life (3 points).
- Specific creativity strategies: Novice writers randomly select and implement a strategy without effectively leveraging it (1 point), developing writers select and implement a strategy while explaining its support for their creativity (2 points), and expert writers deliberately choose and thoroughly explain how a creative thinking strategy bolsters their creative output (3 points).
3.2.2. Human Evaluation Interface
3.2.3. AI Evaluation Interface
4. Experiments
4.1. Metrics
4.1.1. Mean
4.1.2. Standard Deviation
4.1.3. Inter-Annotator Agreement
4.2. Main Results
4.2.1. Human Evaluation
4.2.2. AI Evaluation
4.3. Case Study
4.3.1. Strengths of LLMs in Evaluation
4.3.2. Weaknesses of LLMs in Evaluation
5. Conclusions and Future Works
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. ANOVA and Post Hoc Analysis for Evaluator Type and Evaluation Criteria
Category | Sum Sq | F | p |
---|---|---|---|
Human Writings | |||
Evaluator type | 1.040 | 5.106 | <0.01 |
Evaluation criteria | 1.953 | 7.671 | <0.001 |
GPT-4 Writings | |||
Evaluator type | 0.475 | 1.845 | 0.160 |
Evaluation criteria | 2.688 | 8.353 | <0.001 |
Appendix B. Chi-Square Test Results for Evaluation Criteria
p | ||
---|---|---|
Human writings | 18,227.18 | <0.001 |
AI writings | 15,992.43 | <0.001 |
Appendix C. Score Distribution Analysis
References
- Roumeliotis, K.I.; Tselikas, N.D. ChatGPT and Open-AI Models: A Preliminary Review. Future Internet 2023, 15, 192. [Google Scholar] [CrossRef]
- Gemini Team, G. Gemini: A Family of Highly Capable Multimodal Models. arXiv 2024, arXiv:2312.11805v2. [Google Scholar]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
- Mijwil, M.M.; Hiran, K.K.; Doshi, R.; Dadhich, M.; Al-Mistarehi, A.H.; Bala, I. ChatGPT and the Future of Academic Integrity in the Artificial Intelligence Era: A New Frontier. Al-Salam J. Eng. Technol. 2023, 2, 116–127. [Google Scholar] [CrossRef]
- Chiang, W.L.; Zheng, L.; Sheng, Y.; Angelopoulos, A.N.; Li, T.; Li, D.; Zhu, B.; Zhang, H.; Jordan, M.I.; Gonzalez, J.E.; et al. Chatbot Arena: An open platform for evaluating LLMs by human preference. In Proceedings of the Forty-First International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024. [Google Scholar]
- Kevian, D.; Syed, U.; Guo, X.; Havens, A.; Dullerud, G.; Seiler, P.; Qin, L.; Hu, B. Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra. arXiv 2024, arXiv:2404.03647v1. [Google Scholar] [CrossRef]
- Bubeck, S.; Chandrasekaran, V.; Eldan, R.; Gehrke, J.; Horvitz, E.; Kamar, E.; Lee, P.; Lee, Y.T.; Li, Y.; Lundberg, S.; et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv 2023, arXiv:2303.12712v5. [Google Scholar]
- OpenAI. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar]
- Morrison, M.; Tang, B.; Tan, G.; Pardo, B. Reproducible subjective evaluation. arXiv 2022, arXiv:2203.04444. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
- Lin, C.Y. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
- Banerjee, S.; Lavie, A. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
- Li, J.; Li, R.; Liu, Q. Beyond static datasets: A deep interaction approach to llm evaluation. arXiv 2023, arXiv:2309.04369. [Google Scholar]
- Chiang, C.H.; Lee, H.y. Can Large Language Models Be an Alternative to Human Evaluation? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; pp. 15607–15631. [Google Scholar]
- Landa-Blanco, M.; Agüero Flores, M.; Mercado, M. Human vs. AI Authorship: Does it Matter in Evaluating Creative Writing? A Pilot Study Using ChatGPT; School of Psychological Sciences, National Autonomous University of Honduras: Tegucigalpa, Honduras, 2024. [Google Scholar]
- Zhao, Y.; Zhang, R.; Li, W.; Huang, D.; Guo, J.; Peng, S.; Hao, Y.; Wen, Y.; Hu, X.; Du, Z.; et al. Assessing and Understanding Creativity in Large Language Models. arXiv 2024, arXiv:2401.12491. [Google Scholar]
- Shively, K.; Stith, K.; Rubenstein, L.D. Measuring what matters: Assessing creativity, critical thinking, and the design process. Gift. Child Today 2018, 41, 149–159. [Google Scholar] [CrossRef]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018; pp. 1–14. [Google Scholar]
- Kenton, J.D.M.W.C.; Toutanova, L.K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the naacL-HLT, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, p. 2. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar] [CrossRef]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In Proceedings of the International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 30 April 2020; pp. 1–17. [Google Scholar] [CrossRef]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. OpenAI Blog. 2019. Available online: https://huggingface.co/openai-community/gpt2 (accessed on 2 October 2024).
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar] [CrossRef]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Kublik, S.; Saboo, S. GPT-3: The Ultimate Guide To Building NLP Products with OpenAI API; Library of Congress Control Number: 2018675309; Packt Publishing Ltd.: Birmingham, UK, 2022; p. 234. [Google Scholar]
- Ye, J.; Chen, X.; Xu, N.; Zu, C.; Shao, Z.; Liu, S.; Cui, Y.; Zhou, Z.; Gong, C.; Shen, Y.; et al. A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models. arXiv 2023, arXiv:2303.10420v2. [Google Scholar] [CrossRef]
- Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-eval: Nlg evaluation using gpt-4 with better human alignment. arXiv 2023, arXiv:2303.16634. [Google Scholar]
- Boden, M.A. Creativity and artificial intelligence. Artif. Intell. 1998, 103, 347–356. [Google Scholar] [CrossRef]
- Dietrich, A. The cognitive neuroscience of creativity. Psychon. Bull. Rev. 2004, 11, 1011–1026. [Google Scholar] [CrossRef]
- Mozaffari, J. An analytical rubric for assessing creativity in creative writing. J. Creat. Writ. Stud. 2023, 14, 112–129. [Google Scholar] [CrossRef]
- D’Souza, R. What characterises creativity in narrative writing, and how do we assess it? Research findings from a systematic literature search. J. Lang. Lit. Stud. 2019, 42, 215–228. [Google Scholar] [CrossRef]
- Guilford, J. Creativity. Am. Psychol. 1950, 5, 444–454. [Google Scholar] [CrossRef] [PubMed]
- Runco, M.A.; Jaeger, G.J. The standard definition of creativity. Creat. Res. J. 2012, 24, 92–96. [Google Scholar] [CrossRef]
- Peeperkorn, M.; Kouwenhoven, T.; Brown, D.; Jordanous, A. Is Temperature the Creativity Parameter of Large Language Models? arXiv 2024, arXiv:2405.00492. [Google Scholar]
- Rice, J.A.; Rice, J.A. Mathematical Statistics and Data Analysis; Thomson/Brooks/Cole: Belmont, CA, USA, 2007; Volume 371. [Google Scholar]
- Lee, D.K.; In, J.; Lee, S.H. Standard deviation and standard error of the mean. Korean J. Anesthesiol. 2015, 68, 220–223. [Google Scholar] [CrossRef]
- Krippendorff, K. Content Analysis: An Introduction to Its Methodology, 3rd ed.; Sage Publications: Thousand Oaks, CA, USA, 2011; pp. 222–235. [Google Scholar]
- Magni, F.; Park, J.; Chao, M.M. Humans as creativity gatekeepers: Are we biased against AI creativity? J. Bus. Psychol. 2024, 39, 643–656. [Google Scholar] [CrossRef]
Human Writings | Fluency | Flexibility | Originality | |||
Mean/SD | IAA/% | Mean/SD | IAA/% | Mean/SD | IAA/% | |
Human | 2.14/0.66 | 0.13 (1.00%) | 2.11/0.73 | 0.20 (0.00%) | 2.11/0.74 | 0.19 (0.00%) |
GPT-4o (T) | 2.15/0.36 | 0.79 (63.50%) | 2.06/0.30 | 0.80 (66.00%) | 2.07/0.33 | 0.80 (64.50%) |
GPT-4o (P) | 2.14/0.35 | 0.79 (84.50%) | 2.08/0.28 | 0.80 (88.50%) | 2.08/0.31 | 0.80 (86.00%) |
GPT-3.5 (T) | 2.87/0.33 | 0.79 (75.00%) | 2.37/0.48 | 0.75 (43.00%) | 2.70/0.46 | 0.76 (59.50%) |
GPT-3.5 (P) | 2.92/0.28 | 0.80 (90.50%) | 2.31/0.46 | 0.76 (71.00%) | 2.74/0.44 | 0.76 (80.50%) |
Human Writings | Elaboration | Usefulness | Specific Creativity | |||
Mean/SD | IAA/% | Mean/SD | IAA/% | Mean/SD | IAA/% | |
Human | 2.14/0.72 | 0.21 (0.50%) | 2.08/0.68 | 0.13 (0.00%) | 1.92/0.64 | 0.08 (0.50%) |
GPT-4o (T) | 2.89/0.31 | 0.79 (73.50%) | 1.97/0.24 | 0.81 (76.50%) | 2.00/0.33 | 0.80 (68.00%) |
GPT-4o (P) | 2.94/0.23 | 0.80 (84.00%) | 1.99/0.18 | 0.82 (76.50%) | 2.05/0.34 | 0.80 (79.77%) |
GPT-3.5 (T) | 2.91/0.28 | 0.80 (77.50%) | 1.82/0.53 | 0.76 (64.50%) | 2.47/0.50 | 0.75 (42.50%) |
GPT-3.5 (P) | 2.93/0.25 | 0.80 (90.50%) | 1.86/0.54 | 0.76 (71.50%) | 2.46/0.50 | 0.75 (63.50%) |
GPT-4 Writings | Fluency | Flexibility | Originality | |||
Mean/SD | IAA/% | Mean/SD | IAA/% | Mean/SD | IAA/% | |
Human | 2.01/0.66 | 0.12 (1.50%) | 2.27/0.72 | 0.22 (0.00%) | 2.24/0.70 | 0.18 (0.00%) |
GPT-4o (T) | 2.41/0.49 | 0.75 (33.00%) | 2.28/0.45 | 0.76 (46.50%) | 2.40/0.50 | 0.75 (52.00%) |
GPT-4o (P) | 2.47/0.50 | 0.75 (56.50%) | 2.32/0.47 | 0.76 (76.50%) | 2.47/0.50 | 0.75 (69.00%) |
GPT-3.5 (T) | 2.92/0.27 | 0.80 (82.50%) | 2.47/0.52 | 0.75 (36.00%) | 2.78/0.43 | 0.77 (71.50%) |
GPT-3.5 (P) | 2.92/0.37 | 0.81 (96.00%) | 2.37/0.56 | 0.75 (66.00%) | 2.83/0.46 | 0.78 (84.50%) |
GPT-4 Writings | Elaboration | Usefulness | Specific Creativity | |||
Mean/SD | IAA/% | Mean/SD | IAA/% | Mean/SD | IAA/% | |
Human | 2.31/0.70 | 0.20 (0.50%) | 2.23/0.69 | 0.17 (0.00%) | 2.02/0.66 | 0.09 (0.50%) |
GPT-4o (T) | 2.99/0.06 | 0.82 (97.50%) | 2.00/0.19 | 0.82 (85.00%) | 2.29/0.45 | 0.76 (50.50%) |
GPT-4o (P) | 3.00/0.00 | 1.00 (100.00%) | 2.01/0.15 | 0.83 (95.50%) | 2.36/0.48 | 0.75 (70.50%) |
GPT-3.5 (T) | 2.95/0.23 | 0.81 (83.00%) | 1.52/0.55 | 0.76 (51.50%) | 2.41/0.50 | 0.75 (49.00%) |
GPT-3.5 (P) | 2.94/0.35 | 0.81 (95.00%) | 1.57/0.59 | 0.76 (66.50%) | 2.40/0.55 | 0.75 (60.00%) |
0.9 [WP] Leonardo DiCaprio in a fit of rage begins to torpedo his own career by deliberately acting poorly and taking on bad films. He finally wins an Oscar for starring in Paul Blart: Mall Cop 3. | |
GPT-4 | In a shocking turn of events, Hollywood’s golden boy, Leonardo DiCaprio, has embarked on a self-destructive journey to torpedo his own illustrious career. The actor, known for his impeccable performances and choice of critically acclaimed films, has suddenly started delivering subpar performances and choosing questionable scripts. It’s as if he’s deliberately trying to tarnish his own reputation. The first sign of this bizarre behavior was when he signed on for the third installment of the widely panned ‘Paul Blart: Mall Cop’ series. |
Human | A famous actor, Leonardo DiCaprio intentionally damages his career while trying to win an Oscar. Interestingly, he chooses to star in terrible films and fails in them. The film “Paul Blart: Mall Cop 3”, which no one thought would help his reputation, was his choice to star in. However, his role as a mall cop going through a difficult time in this film manages to touch the hearts of all viewers. Award season is approaching, and rumors exist that Leonardo DiCaprio might take home an award for this performance. |
Human Writings | Fluency | Flexibility | Originality | Elaboration | Usefulness | Strategy |
---|---|---|---|---|---|---|
Human | 1.38 | 1.50 | 1.38 | 1.50 | 1.75 | 1.63 |
GPT-3.5 (T) | 3.00 | 2.88 | 3.00 | 3.00 | 1.75 | 2.88 |
GPT-3.5 (P) | 3.00 | 3.00 | 3.00 | 3.00 | 2.00 | 3.00 |
GPT-4o (T) | 2.00 | 2.00 | 2.00 | 2.88 | 2.00 | 2.00 |
GPT-4o (P) | 2.00 | 2.00 | 2.00 | 3.00 | 2.00 | 2.00 |
GPT-4 Writings | Fluency | Flexibility | Originality | Elaboration | Usefulness | Strategy |
Human | 1.25 | 1.88 | 2.13 | 2.50 | 2.38 | 2.00 |
GPT-3.5 (T) | 2.75 | 2.50 | 3.00 | 2.63 | 1.00 | 2.00 |
GPT-3.5 (P) | 2.63 | 2.38 | 3.00 | 2.75 | 1.00 | 2.00 |
GPT-4o (T) | 2.00 | 2.25 | 3.00 | 2.88 | 1.75 | 2.13 |
GPT-4o (P) | 2.13 | 2.13 | 2.88 | 3.00 | 2.00 | 2.13 |
[WP] Rewrite your favorite song as a story. | |
GPT-4 | Once upon a time, in a small town nestled between the mountains and the sea, lived a young man named John. He was a dreamer, always lost in his thoughts, yearning for something more than the mundane life he was living…. The star had not only guided him to his dreams but also taught him the value of perseverance and belief. And so, John’s story became a legend, inspiring others to follow their dreams and |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, S.; Oh, D. Evaluating Creativity: Can LLMs Be Good Evaluators in Creative Writing Tasks? Appl. Sci. 2025, 15, 2971. https://doi.org/10.3390/app15062971
Kim S, Oh D. Evaluating Creativity: Can LLMs Be Good Evaluators in Creative Writing Tasks? Applied Sciences. 2025; 15(6):2971. https://doi.org/10.3390/app15062971
Chicago/Turabian StyleKim, Sungeun, and Dongsuk Oh. 2025. "Evaluating Creativity: Can LLMs Be Good Evaluators in Creative Writing Tasks?" Applied Sciences 15, no. 6: 2971. https://doi.org/10.3390/app15062971
APA StyleKim, S., & Oh, D. (2025). Evaluating Creativity: Can LLMs Be Good Evaluators in Creative Writing Tasks? Applied Sciences, 15(6), 2971. https://doi.org/10.3390/app15062971