Knowledge-Driven Generative Design of Role-Playing Game Scenarios

Owczarek, Wojciech; Wróbel, Julia; Pęszor, Damian

doi:10.3390/app16062966

Open AccessArticle

Knowledge-Driven Generative Design of Role-Playing Game Scenarios

by

Wojciech Owczarek

,

Julia Wróbel

and

Damian Pęszor

^*

Department of Computer Graphics, Vision and Digital Systems, Faculty of Automatics, Electronics and Computer Science, Silesian University of Technology, 44-100 Gliwice, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(6), 2966; https://doi.org/10.3390/app16062966

Submission received: 16 February 2026 / Revised: 7 March 2026 / Accepted: 17 March 2026 / Published: 19 March 2026

(This article belongs to the Special Issue Applications of Deep Learning and Generative AI Models: Challenges and Opportunities)

Download

Browse Figures

Versions Notes

Abstract

The paper addresses the problem of the generative creation of role-playing game scenarios based on a knowledge compendium. The purpose of this exploratory research was to determine the impact of elements of the generation process on the quality of the scenario. The research was conducted in a generative pipeline using large language models and a compendium that describes the world of the game. The scope of the study includes the creation of a modular system that enables ablation studies and the analysis of the influence of individual factors on the quality of the results. The experiments involved comparisons between language models, variants of knowledge compendia, and the count of user prompt steps. In addition, an ablation study, a self-bias study and a small-scale study with human respondents were conducted. The main purpose of these additional studies was to examine the methods used and identify potential problems regarding them. The ablation study supported the significance of creating a scenario skeleton in a non-random way. No indesputible self-bias was found. The human-based study showed that the LLM evaluators are, on average, less critical than the human ones, but share some similar scoring patterns. The study demonstrated statistically significant differences resulting from the choice of language model in Relevance, Coherence, Informativeness, Interactivity and Structure criteria, as well as the influence of the size of the compendium and the count of user prompt steps on the quality of the results. It was discovered that in the process of generating role-playing game scenarios, it might be beneficial to use short, non-randomly filled structures as the basis for the output scenario generation. It was found that large language models tend to score the generated scenarios higher than human respondents. There is, however, an overlap in preferences regarding the generation model between the human and the machine evaluators.

Keywords:

role-playing games; text generation; large language models; knowledge compendium; knowledge graphs

1. Introduction

In the latter half of the 20th century, Role-Playing Games (RPGs) emerged as a form of interactive entertainment centered on collaborative storytelling. Participants enact fictional characters within imagined or real-world settings, typically under the guidance of a Game Master (GM), who is responsible for describing the game world and preparing narrative frameworks for play. One common mode of play is the campaign format, understood as a continuous narrative unfolding over multiple sessions within a coherent fictional world. Research in clinical psychology [1,2], developmental psychology [3,4,5], sociology [6], and social psychology [7] indicates that RPG participation may have a positive impact on social skills, self-esteem, literacy, and collaborative competencies. These benefits highlight the practical relevance of tools that support the preparation and facilitation of RPG content. Despite this potential, the preparation of RPG scenarios remains time-consuming and conceptually complex for GMs, which motivates the development of tools that can support this process.

Recent advances in Generative Artificial Intelligence (GenAI) and Large Language Models (LLMs) have substantially expanded the scope of automatic text generation, enabling the production of long-form, coherent narratives that maintain semantic consistency and contextual grounding. Beyond literary applications, LLMs are increasingly used to generate functional narrative artifacts—texts intended to be interpreted, adapted, or acted upon by human users. Examples include educational scenarios, interactive fiction, simulation narratives, and RPG scenarios. In such contexts, textual quality cannot be evaluated solely in terms of fluency or stylistic creativity, but must also account for structure, relevance, and usability. This aligns with current research trends in the applications of deep learning to creative industries, where the challenge lies in balancing autonomy with user-centric control.

RPG scenarios constitute a distinct class of narrative artifacts. Rather than presenting a closed story, a scenario functions as a structured framework that supports interactive storytelling, player agency, and improvisation. A typical scenario combines multiple layers of information, including plot hooks, locations, non-player characters (NPCs), motivations, conflicts, and potential player interactions. These elements are commonly organized in a semi-structured form that balances explicit guidance with narrative openness. Consequently, the usefulness of a scenario depends not only on linguistic quality but also on its structural clarity, thematic coherence, and suitability for facilitating gameplay. From the perspective of automated text generation, RPG scenarios pose specific challenges. Generated content must remain coherent across multiple narrative components, respect established world constraints, and provide sufficient informational density without overwhelming the user. Excessive narrative complexity may hinder practical usability, while overly simplistic structures may limit engagement. Generating high-quality RPG scenarios, therefore, requires careful control over narrative structure, content selection, and the level of detail provided.

Maintaining narrative consistency and contextual relevance in automatically generated scenarios typically requires grounding in external knowledge. In practice, this grounding can be provided through a knowledge compendium that formalizes the fictional world, including its locations, characters, factions, and narrative constraints. Such a compendium constrains generation, reduces the risk of contradictions, and enables the reuse of world-specific entities across narrative components. At the same time, the size and structure of the compendium may influence the balance between creative freedom and informational overload, potentially affecting scenario quality.

Despite the growing availability of tools for narrative content creation, the automatic generation of RPG scenarios remains largely unsupported by systematic, domain-aware evaluation. Existing tools commonly focus on generating isolated elements using random tables or procedural rules [8,9], or on producing locations [10], items [11], characters [12], or entire worlds [13]. While effective for inspiration, such tools provide limited support for maintaining narrative coherence and contextual consistency across scenarios or campaigns.

Similarly, recent applications of LLMs to RPG content generation are often driven by ad hoc prompt engineering, offering limited transparency and reproducibility. To the best of the authors’ knowledge, no prior study has systematically examined how design choices—such as prompt structure, knowledge grounding, and model selection—interact to influence the quality and usability of generated scenarios within a modular pipeline. This gap limits the practical adoption of generative AI in RPG campaign design workflows, where consistency, relevance, and structural clarity are critical.

Therefore, the objective of this study is to investigate how selected design choices in LLM-based RPG scenario generation influence the quality and practical usability of the resulting narrative artifacts from a game master–oriented perspective. Rather than proposing a new generative model or a fully automated storytelling system, the paper focuses on analyzing how prompt structure, external knowledge integration, and model choice interact within a modular generative pipeline intended to support human-led scenario preparation. To this end, a dedicated scenario generation pipeline was designed to enable controlled experimentation and ablation studies, supporting the systematic evaluation of narrative quality across criteria relevant to RPG campaign design and gameplay.

The main contributions of this work are as follows:

1.: The design of a modular LLM-based pipeline for RPG scenario generation that enables systematic analysis of generation factors.
2.: An empirical evaluation of the influence of prompt step count, knowledge compendium variants, and LLM choice on multiple qualitative dimensions of RPG scenarios.
3.: Empirical evidence that compact, non-randomly structured narrative skeletons can lead to more coherent and practically usable generated scenarios.
4.: An ablation study assessing the impact of algorithmic selection of narrative elements on scenario quality.

The experiments led to the conclusion that shorter prompt structures outperformed longer ones across most quality parameters and that model choice can have a statistically significant impact on scenario quality. Furthermore, it was found that LLM-based evaluators tend to assign higher scores than human judges.

2. Related Work

2.1. From Procedural Content Generation to Generative AI in Games: Evolution of Narrative Systems

Research on automated content generation in games has transformed from rule-based procedural systems [14] to large language model-driven approaches [15]. This evolution resulted in a fundamental shift in how narrative artefacts are created and deployed in applied settings [16]. This is particularly relevant for role-playing game (RPG) scenario design, where generated content must support collaborative interpretation rather than algorithmic execution. In this section, role-playing game (RPG) denotes its original form as a social, narrative practice in which participants collaboratively interpret a fictional world through description, rules, and shared imagination. Such play is typically mediated by a game master. Computer role-playing games (cRPGs) are digital games inspired by selected elements of this practice, where narrative progression and mechanics are executed by a computational system rather than interpreted collaboratively. In cRPGs, content serves as an executable script, whereas in RPGs, it functions as an interpretive artefact. This distinction is critical for research on content generation, as methods developed for cRPGs might not fully translate to the open-ended, human-mediated nature of RPGs.

Early and influential work in this area focuses on procedural content generation (PCG) as a means of automating the creation of game worlds, quests, levels, and narrative structures. Early procedural approaches framed scenario generation as action planning, where sequences of actions with defined preconditions and effects are determined through backward reasoning from a goal state [17]. Stochastic goal and action selection introduced emergent diversity, while formal entity representations enabled natural language realization. Hendrikx et al. (2013) [14] provide a comprehensive survey of PCG techniques in games, identifying types of generated game content, including storyboards and stories, and proposing a taxonomy of techniques used for these content types, such as constraint satisfaction, planning, and artificial neural networks. Although primarily oriented toward digital games, this taxonomy formalized key trade-offs between authorial control, variability, and computational tractability. Hafis et al. [18] review PCG techniques used in video games between 2014 and 2018, observing that only a small fraction address cRPGs, and even fewer focus explicitly on story or narrative generation, which resisted clean categorization within the proposed taxonomy. This suggests that the field remained methodologically underdeveloped prior to the LLMs era.

More recent surveys document a fundamental methodological shift driven by the emergence of LLMs, particularly the Generative Pre-trained Transformers (GPT) family. Yang et al. [19] present a scoping review of 55 academic publications examining the use of GPT models in games research up to 2023. They identify five dominant application areas: procedural content generation, mixed-initiative game design, mixed-initiative gameplay, autonomous game playing, and game user research. Within PCG, the review shows emphasis on text-centric content, such as dialogue, quests, and narrative continuations, reflecting GPT’s strengths in natural language generation (NLG). However, applications beyond story generation—such as level, scene, or character generation—remain comparatively limited and technically challenging, often requiring structured representations, planning capabilities, or integration with non-textual game systems. An updated and substantially expanded review by Yang et al. [20] extends this analysis to 177 papers, including 122 published in 2024 alone. Despite rapid growth, the same five application categories remain dominant, suggesting conceptual consolidation rather than diversification. The review emphasizes persistent implementation challenges related to long-term planning, consistency, hallucination, and evaluation of player experience. Importantly, both reviews stress that much of the existing work remains exploratory, with limited integration into complete game systems and sparse empirical validation involving players or designers.

Taken together, these surveys trace a shift from early algorithmic PCG approaches toward data-driven, language-based generation methods, while underscoring persistent challenges related to long-term coherence, authorial control, and meaningful human–AI collaboration—challenges that are particularly salient in role-playing games. While the majority of reviewed studies focus on digital games, the identified methods and limitations directly inform RPG scenario generation research. The persistent issue of hallucinations and the lack of robust grounding mechanisms identified in these surveys underscore a critical need for structured approaches. In particular, the gap between exploratory demonstrations and production-ready systems, persistent issues with long-horizon narrative coherence, limited empirical analysis of design parameters, and the absence of systematic evaluation practices motivate the present study’s focus on controlled experimentation in knowledge-grounded scenario generation, enabling systematic analysis of factors affecting practical usability.

2.2. LLM-Based Narrative Systems: Capabilities and Deployment Barriers

The application of large language models to narrative generation has been explored across multiple domains, revealing both capabilities and persistent limitations relevant for RPG scenario creation. Empirical studies demonstrate that while LLMs can store relational knowledge within their parameters [21], their capacity to manipulate this information precisely in knowledge-intensive tasks remains limited [22]. Yu et al. [23] survey knowledge-enhanced text generation approaches, noting that reliance on parametric memory alone often leads to factual inconsistencies and discourse incoherence in long-form generation. Guan et al. [24] evaluate long-text generation capabilities, identifying challenges in maintaining long-range commonsense and discourse relations. These capabilities are essential for RPG scenarios, where narratives must adhere to the constraints of a specific, often fictional, world-state.

Attempts to formalize narrative generation mechanisms inspired by RPG practices further illustrate these constraints. Ono and Ogata [25] propose an integrated narrative generation system modelled on scenario preparation in RPGs, with particular emphasis on world settings as narrative constraints. Although computational in nature, the authors explicitly note the difficulty of automating scene development driven by player decisions. This limitation reflects a core property of RPGs: Narrative progression depends on human interpretation and improvisation, which cannot be fully specified in advance or reliably automated.

One approach to managing narrative complexity is story sifting [26], which identifies meaningful subsets of events within computer-simulated emergent narratives. The principal advantage of this method lies in its capacity to systematically extract narratively significant patterns from extensive data, effectively refining raw event logs into coherent story candidates. While such simulations enable the formal description of resulting event sequences [26], the generated structures often lack readability for human users. To address this limitation, hybrid architectures have been proposed that separate structural narrative generation (via story sifting) from surface realization performed by an LLM [27]. This separation of structure and realization is echoed in research on emotional arcs, which draws on narratological theory and treats them as organizing principles for narrative progression. Wen et al. [28] propose a framework in which emotional trajectories, such as rise and fall, guide the generation of branching story structures. While their implementation targets a computer game, the underlying idea of structuring content around emotional pacing resonates with long-standing practices in RPG design, where tension, escalation, and release are central to sustaining engagement across sessions. At the same time, the reliance on automated control of narrative progression and difficulty highlights a limitation from the perspective of RPGs: Emotional coherence cannot be enforced procedurally but must be negotiated and enacted by human participants during play.

Beyond academic research, several commercial and open-source applications demonstrate practical deployment of LLM-based narrative generation. AI Dungeon [29,30], developed since 2019, enables users to engage in text-based role-playing experiences in which the system acts as a game master, describing the fictional world and responding to player actions. The application relies on LLMs for response generation, using both role-playing-specific models such as Hermes-3 [31] and general-purpose models from the Mistral family [32]. Notably, AI Dungeon generates conversational play transcripts rather than structured scenarios designed for traditional tabletop RPGs, illustrating the distinction between real-time interactive generation and preparatory content creation. This distinction is critical: Whereas such systems generate live conversational output, RPG scenario preparation requires stable, inspectable artefacts that support planning, modification, and reuse by the game master across sessions. Other tools focus more narrowly on supporting scenario preparation. The AI-Powered Game Master Tools suite [33] generates individual scenario elements such as location descriptions, items, and adversaries based on user prompts. Particularly noteworthy is the adversary generator, which produces parameters grounded in specific rule systems [34], demonstrating system-aware content generation. Similarly, Sudowrite [35], while not designed specifically for RPGs, supports creative fiction writing through narrative suggestion and may be adapted for scenario authoring.

2.3. Quest and Scenario Generation: cRPG Methods and Transfer Limits to RPG Preparation

Methods developed for cRPGs—such as procedural quest generation, narrative planning, and constraint-based content synthesis—provide an important methodological foundation and point of comparison for the present study, although so far, as Lopez wrote in [36]:

The emerging narrative that appears by system in the ARPGs [RPGs] sessions becomes a unique narrative artifact, which stimulates the imagination of the players within a collaborative context and creative freedom. The DRPGs [cRPGs] have failed, despite all the technological advances, to emulate this gaming experience.

Several studies specifically address quest and scenario generation in cRPG-like systems. Works by Lima et al. [37,38] explore automatic quest generation using planning, genetic algorithms and graph-based representations, emphasizing logical coherence and goal satisfaction. Similar approaches are discussed by Prins et al. [39] and Breault et al. [40], who model quests as structured sequences of objectives represented by desires or goals, constrained by world state and player actions. These systems demonstrate the primary advantage of procedural quest generation and narrative planning: Algorithmic control over narrative structure can significantly improve logical consistency and reduce nonsensical outcomes compared to purely random generators.

Research by Balint and Bidarra [41] extends this line of work by integrating semantic constraints and spatiotemporal location descriptions into narrative generation pipelines. Buongiorno et al. [42] focuses on Non-Player Characters’ (NPCs) personality modelling and domain constraints in dialogue-based interactions between the player and NPC. Griffith [43] focuses on NPC traits representing personality and behaviour to create emotionally engaging NPCs. These approaches highlight the main advantage of constraint-based content synthesis: the ability to balance variability with directed narrative progression by relying on explicit state representations and predefined action schemas. From a methodological perspective, they illustrate how narrative coherence can be operationalized through formal constraints and controlled generation processes.

Recent advancements in Large Language Models have prompted a paradigm shift in Procedural Content Generation (PCG), bridging the gap between rigid rule-based systems and highly flexible, open-ended narrative generation. A comprehensive survey in [16] highlights that while traditional PCG methods ensure mechanical stability, integrating LLMs offers unprecedented dynamic storytelling capabilities. However, governing these capabilities remains a significant challenge. To address structural invalidity in purely generative outputs, recent studies, such as the schema-governed pipeline proposed in [44], demonstrate that constraining LLMs with explicitly structured knowledge schemas significantly improves narrative reliability and structural consistency in RPG environments. Furthermore, research specifically targeting tabletop RPG dynamics has explored how models balance authored scenarios with player autonomy. For instance, the evaluation of the ChatRPG system by [45] emphasizes the use of redirection strategies to maintain narrative adherence without diminishing player agency. Similarly, studies testing LLM agents as Dungeon Masters in Dungeons & Dragons [46] highlight the intricate relationship between prompt engineering, logical reasoning, and the model’s capacity to maintain a coherent, open-ended game state. These recent developments align with the premise that while LLMs offer vast generative potential, their successful application in RPG scenario preparation requires careful structural grounding.

Emergent narrative is widely recognized as a defining characteristic of role-playing games. Mumper [47] examines this phenomenon through the design and playtesting of an adventure for MÖRK BORG, applying concepts from narratology and ludology. The study demonstrates how procedural elements such as random encounter tables, travel mechanics, and environmental constraints can support narrative emergence without prescribing a fixed plot. A key finding of this study is the importance of deliberately leaving parts of the fictional world underdetermined to preserve interpretive flexibility and player agency. Excessive detail or rigid scripting risks constraining interpretation and reducing the role of the game master to that of a narrative executor rather than a participant in play.

From a literary and media studies perspective, Gryka-Zawadzka [48] analyzes RPG rulebooks and scenarios as texts situated between usability and literariness. The study emphasizes that such texts are not narratives in themselves, but tools designed to facilitate narrative events that occur during play. The narrative proper emerges only through the interaction between players and the game master, forming a triadic relationship between rulebook, scenario, and session. For content generation, this implies a fundamental limitation: Generated materials must prioritize clarity, flexibility, and interpretive affordances over narrative completeness or authorial control.

The distribution of narrative agency between players and the game master further conditions how generated content functions in practice. Svan and Wuolo [49] compare Dungeons & Dragons and Blades in the Dark, showing how differences in rules and procedures shape narrative dynamics. Blades in the Dark structurally positions the game master as reacting to player-declared intentions, while Dungeons & Dragons allows for both this kind of storytelling as well as players reacting primarily to situations framed by the game master, as in the case presented by Svan and Wuolo. These findings indicate a significant limitation for generic content generation approaches: Material that functions effectively in one RPG system may be ill-suited to another, as the same content will be interpreted and operationalized differently depending on how authority and agency are distributed.

Despite their technical sophistication, a fundamental disadvantage of applying procedural quest generation, narrative planning, and constraint-based content synthesis to tabletop RPGs is their underlying assumption of automated processing. cRPG-oriented generation systems are designed to enforce constraints automatically, which conflicts with the nature of RPG scenarios that function as human-mediated interpretive artifacts. Consequently, techniques such as exhaustive state tracking, fine-grained action schemas, or fully specified narrative graphs do not transfer directly to RPG practice. Their main disadvantage in this context is that overly complex or rigidly specified outputs may reduce usability, overwhelm the game master, or constrain player agency rather than support it. Nevertheless, these works remain highly relevant for RPG scenario generation for two reasons. First, they articulate formal solutions to problems of coherence, narrative structure, and constraint satisfaction that RPG designers routinely face in an implicit or ad hoc manner. Second, they expose the limitations of highly detailed, system-driven generation, thereby motivating alternative approaches that favor compact structure, selective grounding, and human interpretive flexibility. In this sense, research on cRPG content generation provides both a methodological foundation and a critical contrast that motivates approaches that favor compact structural representations, selective grounding with constraints that support emergent narrative while preserving the interpretative roles of players and the game master, rather than exhaustive state tracking or fully specified narrative graphs. The present study builds on these insights while explicitly targeting scenario generation as a preparatory, human-facing artifact rather than an executable narrative system.

2.4. Knowledge Grounding for Consistent Scenario Generation

The deployment of LLMs in RPGs faces the challenge of hallucinations and a lack of long-range semantic consistency, which is further amplified in long-text modeling [24]. These limitations motivate the use of explicit grounding mechanisms that externalize world knowledge and narrative constraints, shifting the burden of consistency from parametric memory to structured context representations. While LLMs demonstrate a remarkable ability to store relational knowledge within their parameters [21], their capacity to precisely manipulate this factual information in knowledge-intensive tasks remains limited [22].

To address these limitations, recent research has shifted towards knowledge-enhanced text generation, which incorporates external knowledge sources—such as knowledge bases or knowledge graphs—into the generative process [23]. The integration of Retrieval-Augmented Generation (RAG) frameworks has proven effective in increasing the fact-adherence of model outputs, ensuring that generated claims can be referenced against authoritative sources [22,50]. The roadmap for unifying LLMs and Knowledge Graphs suggests that such structured models can enhance LLMs by providing explicit external knowledge for inference and interpretability [51]. This is particularly critical for functional narrative artifacts, where text usability depends on its structural clarity and thematic adherence to a predefined setting.

The development of believable agents within these narratives requires a sophisticated architecture for memory and reflection. This architectural shift is exemplified by generative agents that utilize an architecture of observation, planning, and reflection to maintain long-term behavioral consistency. Park et al. [52] demonstrated that extending an LLM with a mechanism to store, retrieve, and reflect upon a complete record of an agent’s experiences enables the simulation of complex, emergent social behaviors. Such agent-based motivations are essential for RPG scenarios, where Non-Player Characters (NPCs) must act consistently within the narrative framework.

Role-playing game worlds are often highly elaborated, as evidenced by the size and complexity of many published sourcebooks, some of which exceed several hundred pages [53]. Typically, information contained in such rulebooks is written in natural language and thus primarily suited for human readers. However, to enable computational processing for automated scenario generation, this knowledge must be formalized into structured representations such as databases or graphs. This formalization requirement represents a key implementation challenge: converting narrative-rich, often ambiguous world descriptions into machine-readable formats while preserving the semantic richness necessary for meaningful scenario generation. By treating the RPG world-state as a structured framework, generative systems can condition their predictions on the details of the local environment, including objects, characters and their past actions [54], which can be thought of as a knowledge compendium. The compendium acts as a structured context that enforces thematic consistency and captures complex relational knowledge that LLMs cannot manipulate through weights alone [21,51] and allows for avoiding pitfalls related to deviations from the world-state that purely parametric memory often leads to [23].

Graph structures offer an effective representation of complex RPG worlds. Knowledge graphs have been used to define NPC desires, resources, and locations, enabling the generation of mission dialogues for cRPGs [55]. Annotated relations linking specific elements complement the ontology of the represented world, defining, among other things, the mutual positions of entities and the desires of individual NPCs. Alternatively, event-based graphs focus on temporal relations between occurrences [56]. However, for RPG scenario generation, temporal coherence of historical events is often less important than thematic and situational consistency, necessitating a compendium structure that prioritizes entity relationships and motivations. This domain-specific requirement distinguishes RPG content generation from general-purpose knowledge-grounded text generation and motivates the design choices explored in the present study.

Despite these advancements, there is a lack of systematic analysis regarding how the scale and structure of the grounding information—specifically, the size of the knowledge compendium, understood as the total number of elements it includes, and the number of elements retrieved during generation—influence the quality of the generated scenario. While it is established that calling external APIs or consulting external knowledge improves groundedness [50], the trade-off between creative freedom and informational overload remains under-explored in the domain of automated narrative artifact generation. This study addresses this gap by investigating how selected design choices in knowledge integration and prompt structuring interact to influence the practical usability of the resulting narrative artifacts.

2.5. Evaluation of Generated Narrative Artifacts

Evaluating the quality of generated narrative content remains one of the most challenging problems in NLG, particularly in creative and open-ended domains such as RPG scenarios. Unlike tasks with a well-defined objective function or a limited space of valid outputs, narrative generation admits a wide range of plausible, stylistically diverse, and semantically distinct realizations for the same prompt. This fundamental characteristic necessitates a comparative analysis of evaluation approaches, as methods developed for constrained generation tasks such as machine translation or summarization are often poorly aligned with human judgments of narrative quality, coherence, or usefulness in interactive settings. The evaluation of creative, unformalized text generation involves selecting appropriate metrics and quality criteria used to compare different generation pipelines or algorithms. The perception of creative text can vary among readers depending on individual preferences. However, it is possible to model the subjective assessment of text by larger groups of people across selected categories, such as coherence, engagement, and fluency. Using such metrics enables comparison of the quality of human-written versus machine-generated text, as well as comparison among different generated texts. Solutions requiring surveying extensive groups of readers are, however, described in the context of such studies as time-consuming and costly [57]. In light of the development of LLMs, research is being conducted on their capabilities for evaluating texts generated by other models [58,59,60,61]. This enables automation and acceleration of research, which can be significant in the case of large datasets.

Early automatic evaluation metrics, including BLEU [62], ROUGE [63], and METEOR [64], rely on surface-level overlap between generated text and one or more reference texts. While these metrics have proven useful in constrained generation tasks, their applicability to narrative generation is severely limited by the one-to-many nature of storytelling, where high-quality outputs may share little lexical or syntactic similarity with any reference. Numerous studies have demonstrated that such reference-based metrics correlate weakly with human judgments in creative generation tasks, particularly for longer texts and story-like structures [65,66].

In response to these limitations, a growing body of work has explored semantic and learned evaluation metrics that move beyond exact token matching. Metrics based on contextual embeddings, such as BERTScore [67] and BARTScore [68], measure similarity in a learned semantic space and have been shown to achieve higher correlation with human judgments across several NLG tasks. However, these approaches still fundamentally depend on reference texts and therefore inherit many of the same structural limitations when applied to open-ended narrative generation, where references are sparse, noisy, or inherently incomplete.

To address the reference-dependence problem more directly, several works propose reference-free or weakly supervised evaluation methods tailored to open-ended text. UNION [69], for example, trains a discriminator to distinguish human-written stories from systematically corrupted variants, capturing narrative errors such as repetition, logical inconsistency, and long-range incoherence. While such approaches show improved robustness and correlation with human judgments in story generation tasks, they require task-specific training data and assumptions about the types of errors likely to occur, which may not generalize across domains or generation paradigms. In the present work, this method was not directly employed, as certain evaluation categories used therein were deemed inadequate for RPG scenarios, while other essential categories—specifically Structure and Interactivity—were absent. Nonetheless, given the promising results described in [69], the evaluation presented in this article was based on the approach proposed therein.

More recently, LLMs have been explored not only as generators but also as evaluators of generated content. Several studies demonstrate that LLMs, when prompted appropriately, can produce quality judgments that correlate well with expert human evaluation in tasks including open-ended story generation [59]. Fu et al. propose a solution that leverages the emergent properties of GPT for automatic text evaluation [58]. The authors posited that, given the complexity of GPT models, they would be capable of evaluating texts across a broad spectrum of categories without requiring prior adaptation for such evaluations. Evaluation aspects included coherence, comprehensibility, capacity to engage the reader, depth, and textual interest. Studies showed that, based on Kendall’s absolute correlation between scores generated by different automatic methods and those assigned by humans, LLM responses are on average weakly correlated with human judgments in the context of evaluating individual stories. However, for evaluating entire systems, the average result indicates a strong correlation. Moreover, the impact of prompt structure on the degree of evaluation correlation was also tested. It was observed that simple prompts—without requests for elaboration, examples, or detailed evaluation criteria—provide the most stable results.

Frameworks such as G-EVAL [70] and GPTScore [58] leverage instruction following, chain-of-thought reasoning, and form-based scoring to perform multi-dimensional, customizable evaluations without task-specific training. These methods offer substantial practical advantages, including flexibility, scalability, and reduced cost compared to human evaluation, making them particularly attractive for iterative development and large-scale experimentation. Chiang and Lee [59] further demonstrate that LLM-based evaluation can serve as a viable alternative to human evaluation in certain contexts, achieving reasonable agreement with expert judgments while offering improved reproducibility and cost efficiency. At the same time, the use of LLMs as evaluators introduces new challenges. Several works caution that model-based evaluators may exhibit biases toward texts generated by similar models, may be sensitive to prompt formulation, and may fail to reliably assess aspects such as factual correctness, emotional impact, or long-term narrative consistency [59,66]. These concerns are especially salient in creative domains, where evaluation criteria are subjective, culturally situated, and difficult to formalize.

Beyond pointwise quality assessment, recent research emphasizes the importance of evaluating diversity alongside quality in creative generation systems. Bradley et al. [61] propose a quality–diversity framework in which language models provide feedback not only on the overall quality of generated texts but also on their diversity along specified dimensions. While such approaches demonstrate promising alignment between AI-based and human judgments, they also highlight risks of reward hacking and underspecification of evaluation axes, reinforcing the need for careful experimental design and human oversight.

Human evaluation, therefore, remains a central component of narrative generation research. Best practice guidelines emphasize the importance of clearly defined evaluation criteria, appropriate study design, and transparency in reporting [57]. Although costly and time-consuming, human judgments are currently indispensable for validating automatic metrics, uncovering qualitative failure modes, and assessing the practical usefulness of generated narratives in real-world applications. Surveys of evaluation practices consistently recommend combining multiple evaluation methods—automatic metrics, model-based evaluators, and human studies—to obtain a more reliable and nuanced picture of system performance [65,66]. Deriu et al. [71] provide a comprehensive review of evaluation methods for dialogue systems, highlighting similar challenges in assessing interactive and conversational AI, where context-dependence and user experience are critical.

In summary, evaluation of generated narrative content remains an open research problem, particularly in creative and interactive domains such as RPGs. Existing approaches exhibit trade-offs between scalability, interpretability, and alignment with human judgment. Consequently, recent work increasingly adopts mixed evaluation strategies that balance automatic measures with model-based and human evaluation, reflecting a broader shift toward methodological pluralism in narrative generation research.

Across the reviewed literature, a recurring limitation is the lack of systematic analysis of how concrete design parameters in generative pipelines affect the usability of narrative artifacts for human practitioners. While prior work explores planning depth, world modeling, knowledge integration, and narrative control largely in isolation, there is limited empirical evidence on how these factors interact in the context of scenario preparation for RPGs. In particular, the effects of prompt structure complexity, the scale and density of grounding information, and the use of compact versus elaborated narrative skeletons remain underexplored. The present study directly addresses this gap by experimentally evaluating how variations in prompt step count, knowledge compendium size, and structural design influence scenario quality across multiple user-relevant dimensions.

3. Materials and Methods

3.1. Game World Compendium Structure

The world compendia used in these experiments were modelled as a directed graph of entities and relations representing a single game world. This representation is based on the approach proposed in [55]. It is, however, extended by introducing entity descriptions to provide more context for prompt generation. Moreover, the distinction between Non-Player Characters and Creatures, introduced in [55], was eliminated, as even zoomorphic entities were assumed to possess goals and agency. Furthermore, the Resources category was generalized to the Objects category.

The compendium acts as the initial input of the pipeline and contains two primary data types:

Entity: component of type: character, object, location, or event;
Relation: directed connection between entities $(v_{i}, v_{j})$ , where $v_{i}$ denotes the source entity and $v_{j}$ the target entity.

The entities of type character serve as the foundation for defining motivations and strategies, which, in turn, shape the skeleton of the generated scenario. These strategies determine the narrative structure and the sequence of actions undertaken. The object entity type provides NPCs with the ability to perform actions on objects as well as requiring player characters to do so as part of the narrative. The location entity type provides essential spatial and contextual anchors for other world components such as objects, events, and interactions. Entities of type event are not used directly to populate the scenario skeleton. Instead, they are included only when related to other entities within the scenario. Consequently, events constitute the smallest entity group across all compendium variants.

Formally, the compendium is defined as a labelled directed graph

c = (V, E)

where each vertex

v \in V

is described as

v = (n_{v}, d_{v}, t_{v})

with

n_{v}

denoting the name of the entity,

d_{v}

the description of the entity, and

t_{v}

its type. Each edge

e \in E

is defined as

e = (v_{i}, v_{j}, n_{e}, d_{e})

where

n_{e}

is the name of the relation and

d_{e}

its description.

3.2. Generation Pipeline Design

The scenario generation is executed through a processing pipeline composed of:

Skeletal scenario generation
Skeletal scenario completion
Scenario text generation

The pipeline was designed with a modular architecture to facilitate experimental flexibility, allowing the assessment of the contribution of its elements to the overall quality and coherence.

3.2.1. Skeletal Scenario Generation

The skeletal scenario generation is based on NPC motivations and strategies derived from an analysis of quests in cRPGs by Doran and Parberry [72], which are assumed to be transferable to RPG settings. The set of possible motivations M is defined as the motivations m introduced in [72], as summarized in Table 1. In order to operationalize these motivations, they are represented in the framework as a textual description of characters’ desires. For each such

m \in M

, multiple strategies

s \in S_{m}

exist that allow the fulfilment of this motivation. Each such strategy

s = (a_{1}, a_{2}, \dots, a_{n})

is an ordered list of actions

a_{i} \in A

. The list of possible actions and templates for their textual descriptions are presented in Table 2. Due to the extensive size of the lists of strategies proposed by Doran and Parberry, we refer the reader to [72].

Each scenario originates from a quest giver character—g, who is selected uniformly at random from the set of characters in the compendium. Due to the wide variety of possible entities in RPGs, it is necessary to estimate the probability of specific motivations and corresponding strategies for a given quest giver. Quest giver’s natural language-based description

d_{g}

is used to select the motivation m and the strategy s. The operationalized motivation description (Table 1) can be compared against

d_{g}

. Having selected the motivation, the descriptions of strategies (see [72]) that correspond to this particular motivation are compared against

d_{g}

. The mutual comparison is estimated based on semantic similarity within a textual embedding space, where linguistic units are represented as numerical vectors preserving semantic relationships. To create the embeddings, the nomic-embed-text v1.5 [73] model was used. To measure the proximity of the embeddings cosine similarity method was employed. While widely used in text similarity research [74,75,76], this metric is susceptible to the hubness phenomenon [77]—where certain vectors appear artificially close to many others—and stylistic bias [78], which can obscure semantic diversity [61]. To mitigate these issues, the values were first standardized using the Z-score transformation. This step highlights the motivations and strategies that are relatively stronger for a given NPC, reducing the influence of common or stylistically uniform patterns. Finally, the scores were normalized using min–max scaling and clipped at a threshold

ε = 0.01

to prevent zero weights, as formulated in Equation (1):

{\hat{z}}_{i, g}^{(t)} = max (\frac{z_{i, g}^{(t)} - {min}_{j} z_{j, g}^{(t)}}{{max}_{j} z_{j, g}^{(t)} - {min}_{j} z_{j, g}^{(t)}}, ε), t \in {t_{m}, t_{s}}, i = 1, \dots, n_{t}

(1)

where the superscript

t \in {t_{m}, t_{s}}

acts as a categorical indicator for the motivation and strategy selection stages, respectively. The index i represents a candidate within the corresponding set, and g represents the quest giver. The number of candidates

n_{t}

is determined by the stage:

n_{t_{m}} = | M |

is the total number of motivations, while

n_{t_{s}} = | S_{m} |

is the number of strategies available for the selected motivation m. Normalized scores

{\hat{z}}_{i, g}^{(t)}

were then used as weights to compute the probability of selecting a specific motivation or strategy, as shown in Equation (2):

P_{g}^{(m)} (i) = \frac{{\hat{z}}_{i, g}^{(m)}}{\sum_{j = 1}^{| M |} {\hat{z}}_{j, g}^{(m)}}, P_{g, m}^{(s)} (k) = \frac{{\hat{z}}_{k, g}^{(s)}}{\sum_{j = 1}^{| S_{m} |} {\hat{z}}_{j, g}^{(s)}}

(2)

where

S_{m}

represents the set of available strategies associated with motivation m. Once probabilities are calculated, the quest giver’s motivation

m_{g}

and strategy

s_{g}

are randomly selected according to those probabilities. The ordered set of actions associated with

s_{g}

is added to the list of scenario actions

A_{g}

. If the number of required plot points defined by the prompt variant (see Table 3) is not reached, the next strategy is drawn from the set excluding already selected strategies for the given

m_{g}

, and the list of actions is concatenated with previous strategies. If all strategies in

S_{m}

have already been used, the set

S_{m}

is reset to its full state, allowing strategies to be reused. This is repeated until the desired length is reached, with excess actions truncated.

The output of this stage of the pipeline is a skeletal scenario, consisting of a quest giver g, their motivation

m_{q}

, strategies, and a derivative series of actions

A_{g}

with generic subjects.

3.2.2. Skeletal Scenario Completion

Following motivation and strategy assignment, the quest giver g is associated with an action sequence

A_{g} = (a_{1}, a_{2}, \dots, a_{n})

, where each action

a_{i}

corresponds to an abstract operation listed in Table 2. The actions set, derived from [72], was augmented with natural language templates and effect descriptors specifying the semantic impact of each action on its target entity, enabling grammatically valid scenario text generation. To instantiate the scenario skeleton, each action template must be populated with specific entities of the required type

t \in {character, object, location}

. Entity selection proceeds in two stages.

First, priority candidates are identified from entities directly connected to the quest giver g via any relation e in the compendium c.

For each candidate entity v, if its connecting relation description

d_{e}

achieves cosine similarity threshold

τ_{c}

with the action’s effect descriptor, the entity is retained. The average value of the score is dependent on the compendium used. In this research, a constant value, residing in the middle of the positive scope, was chosen. This decision was made to avoid curating it towards any specific compendium. In practice, it is advised to select the threshold in an experimental process, as it controls the fraction of the descriptions to be treated with priority and therefore the sensitivity of this pipeline stage.

If no candidates meet this threshold, all entities of the required type t are considered, regardless of their relation to the quest giver g.

Final entity selection maximizes cosine similarity between two text embeddings:

Candidate representation: The candidate entity description $d_{v}$ , optionally prepended with its relation description $d_{e}$ if connected to g
Context representation: The concatenation of the quest giver’s g description $d_{g}$ and the action textual template

To ensure narrative diversity, previously selected entities are excluded from subsequent selections unless no alternatives remain. The result is a structured scenario skeleton with actions populated by entity names

n_{v}

, ready for natural language generation.

3.2.3. Scenario Text Generation

Following the generation and completion of the skeletal scenario, two prompts are generated that allow for obtaining its natural language representation through the use of LLM. The system prompt defines the role and knowledge state of the model. The user prompt specifies the task the model is expected to perform. The generator combines scenario skeleton N, entity set E, and relation set R, serializing them into textual forms

E_{t}

and

R_{t}

. The entity set and relation sets are serialized to textual form and concatenated with the system prompt, while the skeletal scenario is concatenated with the user prompt.

System prompt
You are an RPG gamemaster creating scenarios for tabletop RPG games. Base your work strictly on the provided context. Only use the listed entities and connections, and preserve their integrity and logic.
User prompt
Write an RPG scenario, realising the following structure.

3.3. Experimental Setup

This study aims to assess the quality of the generation variants with a research pipeline consisting of scenario generation under different compendium sizes and entity distribution, number of narrative points, and the language model used. The evaluation of the scenarios is performed under the set of qualitative criteria listed in Table 4.

The set of compendia

C = {c_{s}, c_{m}, c_{l}}

used in the experiments was composed of three compendium variants of increasing scale and complexity: small

c_{s}

, medium

c_{m}

, and large

c_{l}

, with entity and relation counts and distributions shown in Table 3. In all versions, characters are the largest group of entities, reflecting their central role in the generative pipeline.

Scenarios with different amounts of narrative steps were created. Table 3 presents the grouping of user prompts by the number of narrative points, denoted as

p \in P = {p_{s}, p_{m}, p_{l}}

.

3.3.1. Scenario Generation

The models presented in Table 5 were selected as generators and evaluators for the experiment based on [58,60]. All selected models employed a decoder-only transformer architecture and fell within a similar parameter range from 4B to 8B parameters to control the size-related effects [58,60]. LLaMA models trained on datasets including literary fiction have demonstrated capability in generating short-form narratives (500–1000 words) with quality approaching human-written texts, with larger variants achieving scores comparable to human authors [60]. Stable Beluga exhibits system-level correlations with human judges approaching inter-human agreement [60], while maintaining high auto-consistency as measured by intraclass correlation coefficients [58]. These models support evaluation of multiple quality dimensions through natural language prompts without additional fine-tuning [58]. The inclusion of Gemma, trained with different instruction-tuning paradigms, was decided upon to increase the robustness of the evaluation. To ensure reproducibility of the experiments, the generation seed was set to correspond with the ordinal number of each scenario within its category.

3.3.2. Scenario Evaluation

The full dataset used in [60] for narrative text generation comprised 384 generated stories. Since the evaluation framework in this study was inspired by that work, maintaining a comparable dataset size was considered beneficial. Therefore, 40 scenarios were generated for each of the

4 \times 3 \times 3 = 36

unique experimental configurations. Consequently, each prompt-length and compendium-size variant included 480 scenarios, while each model variant included 360.

During the evaluation stage, the generated scenarios were assessed by an automated procedure, relying mostly on LLMs. An additional small survey with human reviewers was conducted to compare the scores Section 4.2. This approach is inspired by research on overall [61,69] and category-specific [58,60] text quality assessment. Evaluation categories, chosen in this research, are summarised in Table 4.

Every generated scenario was evaluated twice by each verification model. For each category, a separate prompt was created, requesting an assessment of the scenario on a 0–5 scale. The scale is inspired by the Likert scale [79], expanded with a 0 to indicate an error state. Additionally, each prompt included the scenario content and the respective user prompt.

Evaluation prompt
You are an expert RPG scenario reviewer. Evaluate the following scenario based on the category:
category with descriptions
Provide a score from 0.0 to 5.0 and a one-sentence explanation. Respond ONLY in the following format:
Score: number
Note: brief explanation
The following scenario was generated in response to this user prompt:
Prompt for scenario generation:
Write an RPG scenario, using the following structure:
scenario_prompt
Scenario: scenario

Most of the relevant criteria were selected among those proposed in [58,60]. In the approach proposed in these sources, it is possible to evaluate generated texts across custom quality dimensions, thus allowing the evaluation framework to be extended with domain-specific criteria. Two categories, Interactivity and Structure, specific to the RPG scenarios, were added, as shown in Table 4. The Interactivity category reflects the centrality of player agency in tabletop RPGs. Svan and Wuolo [49] identify player agency as a central factor for the quality of an RPG session, while [25] claims that RPG scenarios form through exchanges of interactions. The Structure category was chosen because RPGs are not only literary but also functional texts. Scenarios and other supporting materials for RPGs contain structured information on the history, geography and culture of the game world, often making them structurally similar to manuals or atlases [48]. Therefore, structural ease of use and understanding of the contained materials could benefit scenario utility, a defining characteristic of RPG systems, shaping how generated content must be structured to remain usable across different playstyles. Ono and Ogata [25] similarly identify player-driven scene development as a core property of RPG scenarios that resists full automation, precisely because meaningful choices cannot be specified in advance. Interactivity therefore operationalizes playability in the sense of GM usability: It assesses whether the generated scenario provides sufficient decision points and narrative affordances to support improvisation and player agency during actual play, rather than presenting a fixed, unidirectional plot.

Additionally, Table 4 lists all of the operational definitions, structured as questions and passed to the evaluation prompt. The purpose of adding these questions was to expand the set of category names and provide more context for the evaluation models, similarly to [58]. The resulting scores represent the degree to which the evaluator’s answer to the question is positive in the modified Likert scale.

Determining the judgment anchors was left to the LLM interpretation because, according to [60], explicitly providing the scoring anchors in the prompt results in worse correlation with human responses.

To assess the quality of the generated scenarios in comparison with other creative texts, three widely used evaluation metrics were applied: Bilingual Evaluation Understudy (BLEU) [62], Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [63], and Bidirectional Encoder Representations from Transformers (BERTScore) [67].

While BLEU and ROUGE were included for completeness, their applicability in this study is limited. These metrics were originally designed for evaluating text translations rather than creatively generated narratives based on structural templates, and their correlation with human judgments of scenario quality is weak [58]. In contrast, the BERT-based metric provides a more reliable measure in this context, as it evaluates semantic similarity between texts rather than surface-level overlap.

4. Results

The set of generated scenarios denoted by

Σ

was created from the set of parameters

Ξ

. The set was defined as the Cartesian product of the compendium variants C,

| C | = 3

, prompt-length variants P,

| P | = 3

, LLM variants L,

| L | = 4

and Random Number Generator (RNG) seed variants

Z, | Z | = 40

. Z was equal to the index of each scenario, starting from 1 up to 40. Different RNG values were used to simulate the individual preferences of human evaluators. As calculated in Equation (3), 1440 scenarios were generated. Note that the parameter set

Ξ

was also used in the ablation study.

| Ξ | = | C \times P \times M \times Z | = 1440

(3)

Let R be the set of criteria,

Z^{'} = {1, 2}

the set of the RNG seeds used for scenario evaluation, and Equation (4) the set of single evaluations. A decision was made to use multiple values of

Z

and

Z^{'}

to ensure diversification of the results.

Ξ^{'} = Ξ \times R \times L \times Z^{'}

(4)

Let the evaluation outcomes be defined by Equation (5)

o_{ξ^{'}} = o (ξ^{'})

(5)

where

ξ^{'} \in Ξ^{'}

, and let the average score for a given criterion

r \in R

be defined by Equation (6).

{\bar{o}}_{ξ, r} = \frac{1}{| Z \times L \times Z^{'} |} \sum_{ξ^{'} \in Ξ^{'} | ξ_{ξ}^{'} = ξ, ξ_{r}^{'} = r} o_{ξ^{'}}

(6)

As each scenario was evaluated twice by each verifier model, a total of 11,520 evaluation sets were generated for the non-ablation and ablation variants of the experimental pipeline.

Moreover, an additional survey on a small group of eight human respondents was conducted. The main goal of the survey was to determine if the patterns in verifier model preferences are similar to human preferences. It is acknowledged that this sample size was not large enough to wholly represent the population of GMs; therefore, this human-based study should be treated as preliminary in nature. The age of the respondents varied, ranging from 24 to 39 years old. Among the respondents, the average experience in years of actively running RPGs as a GM was

11.5

with a standard deviation of

6.37

years and ranging from 4 to 22 years. In total, they evaluated 44 scenarios, covering all of the compendium, prompt and model variants. Each respondent was asked to evaluate sets, consisting of 4 scenarios generated using the same prompt, compendium and seed. The order of presented scenarios within each set was randomized. The seed was equal to either 1 or 2 and was constant within each of the evaluated sets. Therefore, the only variable in each set was the generative model used for each of the scenarios. The human judges were instructed to reflect in their evaluations the degree to which they would answer positively to the respective questions from Table 4 on a modified Likert scale from 0 to 5 in six categories. They were instructed that 0 was reserved for potential erroneous generations. In practice, none of the candidates responded with a note lower than 1 to any scenario in any category. The results of the survey are further discussed in Section 4.2.

The values of

{\bar{o}}_{ξ, r}

are presented in Figure 1. It can be noticed that prompt

p_{s}

achieved the highest averages in five of the six categories except for Complexity. The overall means were equal to 3.864 for

p_{s}

, 3.717 for

p_{m}

and 3.682 for

p_{l}

. The quality drop between

p_{s}

and

p_{l}

variants is 0.182 points (3.64%). Categories most affected by the prompt lengthening were the Coherence (−0.15), (−0.065) and Complexity (−0.009) were relatively resistant to this negative impact. Among the chosen compendia, the

c_{m}

achieved the best results in Coherence,

c_{s}

in Relevance and

c_{l}

in other categories.

Model-wise Mistral achieved the highest overall quality, scoring highest in Relevance, Coherence, and Informativeness. Out of all the models, it is the least sensitive to parameter changes. Overall mean of LLaMA was close to Mistral’s. It was observed that LLaMA specializes in Interactivity with an advantage of 0.119–0.342 points compared to other models. The lowest-scoring model in all categories was Stable Beluga.

It was observed that often high Coherence variants also had high Structure score. A similar pattern was discovered between Relevance and Informativeness and, to some degree, between Interactivity and Complexity. Sensitivity to compendium size changes tends to increase with prompt length. Among the chosen variants, Stable Beluga proved to be the least resistant to prompt extension, and Gemma—the most.

4.1. Ablation

To verify the validity of the algorithmic selection of elements, an ablation study was conducted. Instead of embedding-based selection, entities, motivations, and strategies were randomly chosen with equal probability.

The ablation results are presented in Figure 2. For each evaluation category, a decrease in the mean score relative to the baseline pipeline was observed across all assessed variants. Notably, Relevance and Coherence exhibited the strongest positive impact of algorithmic element selection, with mean improvements of

0.092

and

0.103

points. The smallest effects were recorded for Complexity (

0.023

) and Interactivity (

0.043

). Moderate differences were observed for Structure (

0.065

) and Informativeness (

0.068

).

The mean scores for individual variants are, in most cases, higher for the algorithmic approach. For Relevance, Informativeness, and Interactivity, the baseline pipeline achieved better results in

86.11 %

of variants, for Coherence—in

88.89 %

, for Complexity—in

66.67 %

, and for Structure—in

83.33 %

. Among the variants that achieved higher scores for the ablation than the baseline,

43.24 %

were variants using the Mistral model,

35.14 %

used Gemma,

13.51 %

used Stable Beluga, and

8.11 %

used LLaMA.

In Relevance the largest increase related to algorithmic selection was observed for

(p_{l}, c_{m}, Mistral)

, amounting to

0.287

points. For the category Coherence, the largest increase was achieved by the variant

(p_{l}, c_{m}, Stable Beluga)

, with

0.296

. In Complexity, Informativeness, and Interactivity,

(p_{l}, c_{m}, LLaMA)

achieved the largest increase—

0.15

,

0.165

, and

0.159

, respectively. In Structure, variant

(p_{l}, c_{m}, Mistral)

also obtained the greatest benefit from algorithmic selection, although with Mistral.

Based on the ablation results, it can be deduced that employing the standard research pipeline is usually justified. The deterministic method exerts a particularly strong effect on Relevance and Coherence. Variants with a long prompt and a medium-sized compendium consistently show the greatest benefits. Regarding that evidence, it can be inferred that the integrity of the overall scenario depends on the design of their skeletons. The magnitude of the method’s positive effect depends on the specific language model used.

4.2. Self-Bias and Human Evaluation

The scoring of the generator LLMs by the verifier LLMs was analysed to detect any scoring patterns indicating the presence of self-bias. In Table 6, the highest average score for each category and verifier was shown in bold and the lowest in italics. Several trends were observed. The Gemma model scored its own generation the lowest in all of the categories, despite getting relatively high scores from most of the other LLMs. The LLaMA model rated Gemma’s scenarios especially favourably, giving them the highest average score in all categories. The Mistral model exposed preference in its own scenarios in Relevance, Coherence, Complexity and Informativeness, rating them highest on average. These notes might indicate some level of self-bias; however, the Mistral model was also, on average, rated the highest by Gemma in all categories. Moreover, Stable Beluga gave Mistral the highest average score in Structure and the second highest in Relevance. While no indisputable evidence of self-bias was found, it is apparent that some scoring preferences exist among the various models, likely resulting from differences in data sets and architectures.

Regarding the human-based evaluation, it is noticeable that human-granted scores are generally lower, on average by 0.95 point than the LLM-based ones. Especially in Relevance and Informativeness, the average LLM-based scores were overstated, by

1.32

and

1.21

points, respectively. The most accurate categories were Coherence and Complexity, with average differences of

0.68

and

0.42

points. Despite the disparity, it can be noticed that in general, human judgment has some similarities to the LLM-based one. In most categories, the highest-scored models were Gemma and Mistral with an average score of 3.08 and 2.96 across all of the categories, while LLaMA and Stable Beluga scored 2.52 and 2.65. Among the LLMs, Mistral got the highest average scores in both the human-based and LLM-based evaluations, while Stable Beluga was rated relatively low.

4.3. Statistical Significance

The statistical analysis in Table 7, Table 8 and Table 9 shows that significant effects were present in all criteria except Complexity, proving generator model (l) and instruction length (p) impactful on the scenario quality. Compendium size (c) did not produce significant effects in any criterion. Across Relevance, Coherence, Informativeness, Interactivity, and Structure, the dominant pattern was the inferior performance of Stable Beluga relative to the other three models. The largest between-model contrasts reached moderate effect sizes, for example, in Structure and Interactivity, where some differences reached over

0.3

standard deviations. In criteria with weaker effects, such as Relevance and Informativeness, model comparisons still had significant differences but with smaller magnitudes. Instruction length exerted a similarly strong influence across five of the six criteria. Short prompts

p_{s}

outperformed medium and long ones, with effect sizes reaching

d = 0.34

in Structure and Interactivity and

d = 0.44

in Coherence. Across all criteria where effects were present, longer prompts reduced clarity, organization, and interactivity, suggesting that more concise instructions enable models to generate more coherent scenarios. The only criterion that did not provide meaningful differences was Complexity. Taken together, the results indicate that model choice and prompt length are the primary determinants of scenario quality, with Stable Beluga consistently underperforming and short prompts yielding the best outcomes across nearly all evaluated dimensions.

4.4. Popular Metrics Analysis

In order to enhance the comparability, the generated scenarios were also evaluated using popular metrics for evaluating text. The selected metrics include the Bilingual Evaluation Understudy (BLEU) [62], the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [63], and the Bidirectional Encoder Representations from Transformers score (BERTScore) [67]. BLEU and ROUGE measure the degree of overlap of word sequences between the generated output and a reference text, while BERTScore measures the semantic overlap relative to the reference text. In the context of this work, the reference is the skeletal form of the scenario. For BLEU the minimum possible value is 0, and the maximum is 100, where 100 indicates that the generated output and the reference are identical. The resulting scores for ROUGE and BERTScore range from 0 to 1, with higher values meaning more word overlap for ROUGE and stronger semantic similarity for BERTScore.

In the context of the present study, BLEU and ROUGE are not particularly reliable. These metrics were primarily designed for the evaluation of machine translation rather than creative text generation. Their correlation with human evaluation of creative text is weak [58]. Nevertheless, due to their popularity, an analysis of the generated scenarios using these metrics was conducted.

A more suitable metric for the present task is BERTScore, as its ability to assess the semantic meaning of words allows for the determination of semantic similarity between two texts.

Table 10 presents the evaluation metrics for different generation variants. BLEU scores are generally low, although the highest are for Stable Beluga and Mistral. ROUGE metrics follow a similar trend—Mistral reaches best results in all ROUGE markers among the models. BERTScore indicates that Mistral has the strongest semantic similarity, while LLaMA is the weakest. BLEU scores among the compendia are generally low, although higher for

c_{l}

, while the ROUGE score does not show significant distinctions in terms of lexical coverage. BERTScore shows a minor influence on semantic similarity. Variants with

p_{l}

improve lexical coverage, achieving the highest BLEU and ROUGE scores, while shorter prompts show slightly higher semantic alignment, as indicated by BERTScore precision.

5. Discussion

The results indicate that the quality of the generated RPG scenarios depends on the number of steps in the prompt and the choice of an LLM. The influence of a knowledge compendium size is less important. The best outcomes were achieved for variants with shorter prompts, suggesting that more compact narrative structures output more coherent and relevant scenarios. Variants obtained the highest scores in Relevance and the lowest in Complexity, which may reflect the limitations of contemporary LLMs in producing complicated narratives. The finding that shorter prompts (

p_{s}

) yield higher quality results than longer, more detailed prompts (

p_{l}

) might initially seem counter-intuitive, especially when compared to complex analytical tasks where techniques like Chain-of-Thought reasoning thrive on expanded context. However, the root of this phenomenon in narrative generation lies not in context window noise or a suppression of creative capacity, but rather in the model’s struggle to maintain logical consequence and its lack of awareness regarding the relationships between consecutively joined elements. When a prompt strictly dictates a dense sequence of specific narrative constraints or beats (

p_{l}

), the LLM attempts to satisfy all conditions but sometimes fails to establish coherent causal links and transitions between them. Conversely, shorter prompts (

p_{s}

) provide a manageable structural skeleton, allowing the model to leverage its statistical language capabilities to naturally bridge the gaps and maintain a cohesive narrative flow.

A qualitative comparison illustrates this clearly. As shown in Appendix B, a failure case generated with a long prompt (

p_{l}

) is characterized by reduced narrative coherence: The model forces location changes without explicit transitions and assumes connections that were never established, simply to fulfill the prompt’s extensive requirements. In contrast, a success case generated with a short prompt (

p_{s}

), presented in Appendix A, demonstrates a noticeable, logical structure and maintains a clear, player-centric action.

Although there were evident preferences and trends in the ways the individual models judged the variants, they did not manifest any undisputable self-bias tendencies. It was also discovered that the human respondents tended to give the generated scenarios lower notes than their LLM counterparts. An exceptional disparity between the LLM and human judges scores in terms of Relevance and Informativeness was noticed. Perhaps the selected LLMs struggle to determine the purpose of an RPG scenario as a functional text. Understanding the precise source of this problem through research could improve the quality of the generated scenarios.

Low BLEU and ROUGE scores of the results might not imply poor scenario quality but rather a desirable departure from the schematic structure of the input. A high BERT score demonstrates that the semantic content of the skeleton scenario is preserved despite the use of different word sequences.

5.1. Limitations of LLM-Based Evaluation

The selected models were chosen with the understanding that other size variants might yield different results. Although examining additional modern models such as Gemini 2.5 Flash [80] or GPT-4 [81] could be beneficial, the large number of available architectures necessitated restricting the analysis to a curated subset.

It is acknowledged that LLM-Based evaluation does not perfectly reflect human preferences. However, to an extent, it might cover the most noticeable trends in humans, such as the preference of the generator model. Further research on similarities between human-based and LLM-based judgments of creative texts could help to improve this technique and its precision. Moreover, it could be beneficial to gather more human-based evaluations to achieve a better understanding of the similarities and differences between the human-based evaluation and the LLM-based evaluation.

Authors recognize that additional study using other game world compendia could be advantageous, but it would significantly broaden the scope of the research. In particular, studying worlds set in genres other than fantasy might be beneficial in further research.

5.2. Design Guidelines and Future Directions

As the research on this topic progresses a set of actionable design guidelines for developers of LLM-based narrative generation tools could be obtained:

G1: Minimise prompt complexity. Concise, structured prompts should be preferred over elaborate, open-ended instructions, as excessive complexity was found to negatively affect coherence, structure, and interactivity.

G2: Externalise world knowledge into a structured compendium. Generation should be grounded in an explicit, machine-readable representation of the game world rather than relying solely on the parametric memory of a language model. Even a compact knowledge structure was found sufficient to improve narrative consistency. For larger compendia, retrieval-based or staged injection strategies should be considered to avoid knowledge overload.

G3: Apply semantic matching when selecting narrative elements. Embedding-based selection of entities and strategies should be preferred over random sampling, as it was found to consistently produce more coherent and relevant outputs.

G4: Align model selection with target quality criteria. Different language models exhibit distinct strengths across quality dimensions; model selection should therefore reflect the quality criteria most relevant to the intended use case. In particular, human-in-the-loop validation is recommended when high interactivity is a priority.

G5: Complement automated evaluation with human judgment. Automated metrics were found to systematically overestimate scenario quality relative to human judges and should be treated as indicative rather than definitive.

Several directions for future research follow naturally from these guidelines. The generalisability of G1 and G4 could be assessed by extending the pipeline to incorporate larger models such as GPT-4 or Gemini 2.5 Flash. The scope of G2 could be broadened by expanding the compendium to cover alternative genres such as science fiction, horror, or historical settings. The validity of G5 would benefit from a larger and more diverse respondent pool, providing a more robust basis for validating automated evaluation metrics in narrative domains. Additionally, the construction of a GM-oriented usability metric capturing functional dimensions absent from current automated measures, such as the preparation effort required for a scenario, remains an important direction for future work.

6. Conclusions

This paper aimed to investigate the potential for LLM-based RPG scenario generation and factors affecting the quality of the results. A dedicated generative pipeline was designed, implemented and used to create a dataset.

The study found that both the number of steps in the prompt and the choice of an LLM significantly affect the quality of generated scenarios. More steps in the instruction prompt generally lead to worse results across criteria. This suggests that concise prompts produce a more effective basis for scenario generation than elaborate, open-ended instructions. LLMs show an inherent variety as some of them excel in particular scoring categories. Metrics based on lexical similarity provided low values, while semantic metrics indicated a high degree of meaning similarity between the scenario and the prompt, confirming that generated scenarios preserve the semantic content of the input skeleton despite surface-level difference. Furthermore, the results indicate that knowledge-grounded generation improves scenario consistency, and that even a compact compendium proved sufficient to achieve this effect, as compendium size did not produce statistically significant differences across evaluation criteria. This suggests that the overhead of maintaining a formalized world compendium need not be prohibitive for practical adoption. Ablation analysis confirmed the validity of using the baseline pipeline in comparison to a version with random choices instead of semantic similarity-based choices.

It was discovered that human judges generally rate the generated scenarios lower than the LLM judges. This finding highlights a methodological risk in relying solely on automated evaluation for creative, functional texts, and underscores the importance of incorporating human judgment in the assessment of generative tools for narrative domains. Despite that, it was noticed that LLM judges display similar preferences to human judges regarding the generative model, suggesting that automated evaluation may capture certain qualitative trends even when absolute scores diverge. Further research of the similarities and differences in human and machine ratings would be beneficial. The modular architecture of the proposed pipeline further enables its potential adaptation to scenario generation tasks beyond fantasy RPGs, including educational simulations, training scenarios, and interactive storytelling applications.

Author Contributions

Conceptualization, W.O. and D.P.; methodology, W.O. and D.P.; software, W.O.; validation, W.O.; formal analysis, W.O.; investigation, W.O. and D.P.; resources, W.O. and D.P.; data curation, W.O. and D.P.; writing—original draft preparation, D.P., W.O. and J.W.; writing—review and editing, D.P., W.O. and J.W.; visualization, W.O.; supervision, D.P.; project administration, D.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki. This study was exempt from ethical review and approval due to its non-interventional nature and the complete anonymity of the human participants, in accordance with the internal regulations of the Silesian University of Technology (Rector’s Regulation No. 179/2025). According to § 3(2) of the Regulation, ethical approval is required only for studies involving vulnerable populations, deceptive procedures, sensitive topics, or interventions affecting participants’ behavior or well-being. The present study, based on an anonymous survey conducted with independent adult participants and involving no intervention or deception, does not fall within these categories.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The source code used in this paper is openly available at: https://github.com/Wojowc/Knowledge-Driven-Generative-Design-of-Role-Playing-Game-Scenarios.git (accessed on 15 February 2026).

Acknowledgments

This publication was supported by the Department of Computer Graphics, Vision, and Digital Systems, Silesian University of Technology (Gliwice, Poland) under the statutory research projects (Rau6, 2026). The authors acknowledge the Minotaur Fantasy Society, registered in Poland under REGON 54366942 (national statistical identification number), for its voluntary expert evaluation of the generated scenarios.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RPG	Role-Playing Game
cRPG	Computer Role-Playing Game
NPC	Non-Player Character
LLM	Large Language Model
PCG	Procedural Content Generation
GM	Game Master
GenAI	Generative Artificial Intelligence
GPT	Generative Pre-trained Transformers
NLG	Natural Language Generation
AI	Artificial Intelligence

Appendix A

Fragment of a scenario generated with Gemma model,

p_{s}

,

c_{s}

and seed 1. It demonstrates a noticeable structure different from, for example, a novel. Also, a player-centric action is noticeable.

Scenario Title: Whispers of Bavronne

Setting: The scenario takes place in a region with a small, isolated mountain range dominated by the village of Bavronne.

Quest Giver: Elandra—A quiet herbalist, driven by a need for Serenity. She believes understanding the echoes of the past holds the key to finding inner peace.

Initial Situation: Elandra approaches the players, her face etched with a quiet urgency. “I’ve been experiencing…disturbances. A restlessness. I believe the answers I seek lie within the stories of the past, particularly those held by those who dwell on the edges of this land. I’ve heard whispers of a bard named Maric, who sings of ancient tragedies and forgotten kings. He resides in Bavronne, a village shrouded in mist and suspicion. I need you to investigate his current state—is he still singing? Is his melancholy a sign of something deeper?” She hands you a small, dried sprig of lavender, a gesture of calming intent. “This is for your journey. May it bring you some measure of Serenity.”

Phase 1: To Bavronne

Travel: The journey to Bavronne is arduous, involving a trek through dense forest. The players will need to make Survival checks to navigate the terrain and avoid getting lost.

Arrival: The players arrive at Bavronne. The village is immediately striking—the air is damp, and a perpetual mist clings to the buildings. The village is noticeably quiet, with a sense of guardedness.

Phase 2: Listening to Maric

Finding Maric: The players locate Maric, who is sitting on a weathered stone near the village square. He’s playing a mournful tune on his flute, his eyes fixed on the ground. (…)

Appendix B

Fragment of a scenario generated with the Mistral model using the longer prompt configuration

p_{l}

, large compendium

c_{l}

, and seed 1. It illustrates a failure case characterized by reduced narrative coherence: Changes of location occur without explicit transitions, some narrative points assume connections that are not explicitly established, and certain actions appear unnecessary due to earlier narrative developments.

Title: Whispers from the Woods: The Missing Chisel

Setting: The remote village of Bavronne, nestled in the heart of the Whispering Woods.

Quest Giver: Elandra, a quiet herbalist known for her knowledge of rare plants and ancient lore. She is motivated by Serenity, seeking to maintain harmony within the village and uncover any potential threats or mysteries that may lurk in the woods.

Objective 1: Elandra requests the assistance of the adventurers. She explains that Torren, the reclusive carpenter, has lost his prized chisel, a tool passed down through generations and essential for his craft. The loss of the chisel could disrupt the harmony of Bavronne if it falls into the wrong hands.

Objective 2: The adventurers must travel to Bavronne, a village cloaked in mist and suspicion of outsiders, located on the eastern side of the Market square. There they will find Torren’s humble abode on the northern outskirts of the village.

Quest Giver: Maric, a ragged bard who sings of ancient tragedies and forgotten kings, is motivated by Serenity, as he seeks to preserve the stories and traditions of Bavronne for future generations.

Objective 3: Once in Bavronne, the adventurers must seek out Maric, who can often be found at the Market during the day, sharing tales and playing his flute for coin. If they manage to speak with him, he will ask them to retrieve Torren’s missing chisel.

Objective 4: The adventurers must then return to Torren’s home in Bavronne and search for the lost chisel. Upon finding it, they should bring it back to Maric at the Market.

Quest Giver: Maric (again) is now motivated by Knowledge, as he seeks to learn more about the circumstances surrounding Torren’s loss of his chisel.

Objective 5: After giving Torren’s Chisel to Maric, the adventurers should listen to any additional information or insights he may share regarding the chisel’s disappearance.

Objective 6: The adventurers must then return to the Market in Bavronne and seek out Elandra, who frequents the market at dawn to trade her herbs and tinctures. If they manage to speak with her, she may share valuable information about recent events in the Whispering Woods or provide clues as to who might have taken Torren’s chisel.

Quest Giver: Elandra (again) is now motivated by Knowledge, as she seeks to uncover any potential threats or mysteries that may be affecting her beloved woods.

Objective 7: The adventurers must report their findings and any new information they’ve gathered back to Elandra. Depending on their actions throughout the scenario, different outcomes may occur, ranging from resolving the mystery of Torren’s missing chisel to uncovering a larger plot affecting Bavronne and the Whispering Woods.

References

Arenas, D.L.; Viduani, A.; Araujo, R.B. Therapeutic use of role-playing game (RPG) in mental health: A scoping review. Simul. Gaming 2022, 53, 285–311. [Google Scholar] [CrossRef]
Yuliawati, L.; Wardhani, P.A.P.; Ng, J.H. A scoping review of tabletop role-playing game (TTRPG) as psychological intervention: Potential benefits and future directions. Psychol. Res. Behav. Manag. 2024, 17, 2885–2903. [Google Scholar] [CrossRef] [PubMed]
Hammer, J.; To, A.; Schrier, K.; Bowman, S.L.; Kaufman, G. Learning and role-playing games. In Role-Playing Game Studies; Routledge: New York, NY, USA, 2018; pp. 283–299. [Google Scholar] [CrossRef]
Katō, K. Employing tabletop role-playing games (TRPGs) in social communication support measures for children and youth with autism spectrum disorder (ASD) in Japan: A hands-on report on the use of leisure activities. Jpn. J. Analog. Role-Play. Game Stud. 2019, 23–28. [Google Scholar] [CrossRef]
Merrick, A.; Li, W.W.; Miller, D.J. A study on the efficacy of the tabletop roleplaying game Dungeons & Dragons for improving mental health and self-concepts in a community sample. Games Health J. 2024, 13, 128–133. [Google Scholar] [CrossRef]
Williams, J.P.; Kirschner, D.; Deterding, S. Sociology and role-playing games. In The Routledge Handbook of Role-Playing Game Studies; Routledge: New York, USA, 2024; pp. 243–260. [Google Scholar] [CrossRef]
Henning, G.; de Oliveira, R.R.; de Andrade, M.T.P.; Gallo, R.V.; Benevides, R.R.; Gomes, R.A.F.; Fukue, L.E.K.; Lima, A.V.; de Oliveira, M.B.B.Z.; de Oliveira, D.A.M.; et al. Social skills training with a tabletop role-playing game, before and during the pandemic of 2020: In-person and online group sessions. Front. Psychiatry 2024, 14, 1276757. [Google Scholar] [CrossRef]
Finch, M. Tome of Adventure Design (Revised); Mythmere Games: Katy, TX, USA, 2022. [Google Scholar]
Sholtis, J. The Dungeon Dozen; Hydra Collective (brand name: Hydra Cooperative): Austin, TX, USA, 2014. [Google Scholar]
Robbins, B. Donjon Random Inn Generator. 2009. Available online: https://donjon.bin.sh/fantasy/inn/ (accessed on 15 February 2026).
Cros.land. D&D 5e Magic Item Generator. 2018. Available online: https://cros.land/dnd-5e-magic-item-generator/ (accessed on 15 February 2026).
Kassoon. D&D NPC Generator. 2016. Available online: https://www.kassoon.com/dnd/npc-generator/ (accessed on 15 February 2026).
Azgaar. Fantasy Map Generator. 2017. Available online: https://azgaar.github.io/Fantasy-Map-Generator/ (accessed on 15 February 2026).
Hendrikx, M.; Meijer, S.; Van Der Velden, J.; Iosup, A. Procedural content generation for games: A survey. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2013, 9, 1–22. [Google Scholar] [CrossRef]
Gallotta, R.; Todd, G.; Zammit, M.; Earle, S.; Liapis, A.; Togelius, J.; Yannakakis, G.N. Large language models and games: A survey and roadmap. IEEE Trans. Games 2024. early access. [Google Scholar] [CrossRef]
Maleki, M.F.; Zhao, R. Procedural content generation in games: A survey with insights on emerging llm integration. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, Lexington, KY, USA, 18–22 November 2024; Volume 20, pp. 167–178. [Google Scholar] [CrossRef]
Young, R.M.; Ware, S.G.; Cassell, B.A.; Robertson, J. Plans and planning in narrative generation: A review of plan-based approaches to the generation of story, discourse and interactivity in narratives. Sprache Und Datenverarbeitung Spec. Issue Form. Comput. Model. Narrat. 2013, 37, 41–64. [Google Scholar]
Hafis, M.; Tolle, H.; Supianto, A.A. A literature review of empirical evidence on procedural content generation in game-related implementation. J. Inf. Technol. Comput. Sci. 2019, 4, 308–328. [Google Scholar] [CrossRef]
Yang, D.; Kleinman, E.; Harteveld, C. GPT for games: A scoping review (2020-2023). In Proceedings of the 2024 IEEE Conference on Games (CoG); IEEE: New York, NY, USA, 2024; pp. 1–8. [Google Scholar] [CrossRef]
Yang, D.; Kleinman, E.; Harteveld, C. GPT for Games: An Updated Scoping Review (2020–2024). IEEE Trans. Games 2025. early access. [Google Scholar] [CrossRef]
Petroni, F.; Rocktäschel, T.; Riedel, S.; Lewis, P.; Bakhtin, A.; Wu, Y.; Miller, A. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 2463–2473. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Yu, W.; Zhu, C.; Li, Z.; Hu, Z.; Wang, Q.; Ji, H.; Jiang, M. A survey of knowledge-enhanced text generation. ACM Comput. Surv. 2022, 54, 1–38. [Google Scholar] [CrossRef]
Guan, J.; Feng, Z.; Chen, Y.; He, R.; Mao, X.; Fan, C.; Huang, M. LOT: A Story-Centric Benchmark for Evaluating Chinese Long Text Understanding and Generation. Trans. Assoc. Comput. Linguist. 2022, 10, 434–451. [Google Scholar] [CrossRef]
Ono, J.; Ogata, T. A design plan of a game system including an automatic narrative generation mechanism: The entire structure and the world settings. J. Robot. Netw. Artif. Life 2016, 2, 243–246. [Google Scholar] [CrossRef]
Ryan, J. Curating Simulated Storyworlds; University of California: Santa Cruz, CA, USA, 2018. [Google Scholar]
Gervás, P.; Méndez, G. Distributing Creative Responsibility Between a Knowledge-Based Content Determiner and a Neural Text Realizer. In Proceedings of the EPIA Conference on Artificial Intelligence; Springer: Cham, Switzerland, 2024; pp. 41–53. [Google Scholar] [CrossRef]
Wen, Y.; Huang, C.; Zhou, H.; Zeng, Z.; Po, C.M.L.; Togelius, J.; Merino, T.; Earle, S. All stories are one story: Emotional arc guided procedural game level generation. arXiv 2025, arXiv:2508.02132. [Google Scholar] [CrossRef]
Latitude. AI Dungeon. 2025. Available online: https://aidungeon.com/ (accessed on 15 February 2026).
Hua, M.; Raley, R. Playing With Unicorns: AI Dungeon and Citizen NLP. DHQ Digit. Humanit. Q. 2020, 14, 4. [Google Scholar]
Store, A.A. Hermes-3. 2025. Available online: https://aiagentstore.ai/ai-agent/hermes-3 (accessed on 15 February 2026).
AI, M. Mistral. 2025. Available online: https://mistral.ai/ (accessed on 15 February 2026).
Crosland, K. AI Powered Game Master Tools. 2023. Available online: https://cros.land/2023/04/ai-powered-game-master-tools/ (accessed on 15 February 2026).
Cros.land. AI Powered DnD 5e Monster Statblock Generator. 2023. Available online: https://cros.land/ai-powered-dnd-5e-monster-statblock-generator/ (accessed on 3 September 2025).
Sudowrite. Sudowrite. 2025. Available online: https://sudowrite.com/ (accessed on 15 February 2026).
López, J.J.P. Procedural and Emergent Narrative: From Analog RPG to Digital RPG. In Proceedings of the Abstract Proceedings of DiGRA 2023 Conference: Limits and Margins of Games; Digital Games Research Association: Tampere, Finland, 2023. [Google Scholar] [CrossRef]
de Lima, E.S.; Feijó, B.; Furtado, A.L. Procedural Generation of Quests for Games Using Genetic Algorithms and Automated Planning. In Proceedings of the SBGames, Rio de Janeiro, Brazil, 28–31 October 2019; pp. 144–153. [Google Scholar] [CrossRef]
de Lima, E.S.; Feijó, B.; Furtado, A.L. Procedural generation of branching quests for games. Entertain. Comput. 2022, 43, 100491. [Google Scholar] [CrossRef]
Prins, V.L.; Prins, J.; Preuss, M.; Gómez-Maureira, M.A. Storyworld: Procedural quest generation rooted in variety & believability. In Proceedings of the 18th International Conference on the Foundations of Digital Games; Association for Computing Machinery: New York, NY, USA, 2023; pp. 1–4. [Google Scholar] [CrossRef]
Breault, V.; Ouellet, S.; Davies, J. Let CONAN tell you a story: Procedural quest generation. Entertain. Comput. 2021, 38, 100422. [Google Scholar] [CrossRef]
Balint, J.T.; Bidarra, R. Procedural generation of narrative worlds. IEEE Trans. Games 2022, 15, 262–272. [Google Scholar] [CrossRef]
Buongiorno, S.; Klinkert, L.; Zhuang, Z.; Chawla, T.; Clark, C. PANGeA: Procedural artificial narrative using generative AI for turn-based, role-playing video games. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, Lexington, KY, USA, 18–22 November 2024; Volume 20, pp. 156–166. [Google Scholar] [CrossRef]
Griffith, I. Procedural Narrative Generation Through Emotionally Interesting Non-Player Characters. Master’s Thesis, Linnaeus University, Växjö, Sweden, 2018. [Google Scholar]
Rahman, A.; Yu, A.; Cho, K. Game Knowledge Management System: Schema-Governed LLM Pipeline for Executable Narrative Generation in RPGs. Systems 2026, 14, 175. [Google Scholar] [CrossRef]
Jørgensen, N.H.; Tharmabalan, S. Narrative Adherence in LLM-Driven Games. Master’s Thesis, Aalborg University, Aalborg, Denmark, 2025. [Google Scholar]
Delafuente, P.; Honraopatil, A.; Martin, L.J. Does Reasoning Help LLM Agents Play Dungeons and Dragons? A Prompt Engineering Experiment. arXiv 2025, arXiv:2510.18112. [Google Scholar] [CrossRef]
Mumper, P. Emergent Narrative in Tabletop Role-Playing Games: An Application of Concepts; Honors Projects, 938; Bowling Green State University: Bowling Green, OH, USA, 2024. [Google Scholar]
Gryka-Zawadzka, D. Podręczniki do TRPG: Między użytkowością a literackością. Białostockie Stud. Lit. 2024, 24, 175–187. [Google Scholar] [CrossRef]
Svan, O.; Wuolo, A. Emergent Player-Driven Narrative in Blades in the Dark and Dungeons & Dragons: A Comparative Study. Bachelor’s Thesis, Uppsala University, Uppsala, Sweden, June 2021. [Google Scholar]
Thoppilan, R.; De Freitas, D.; Hall, J.; Shazeer, N.; Kulshreshtha, A.; Cheng, H.T.; Jin, A.; Bos, T.; Baker, L.; Du, Y.; et al. Lamda: Language models for dialog applications. arXiv 2022, arXiv:2201.08239. [Google Scholar] [CrossRef]
Pan, S.; Luo, L.; Wang, Y.; Chen, C.; Wang, J.; Wu, X. Unifying large language models and knowledge graphs: A roadmap. IEEE Trans. Knowl. Data Eng. 2024, 36, 3580–3599. [Google Scholar] [CrossRef]
Park, J.S.; O’Brien, J.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, San Francisco, CA, USA, 29 October–1 November 2023; pp. 1–22. [Google Scholar] [CrossRef]
Wizards of the Coast. Forgotten Realms Campaign Setting; Wizards of the Coast: Renton, WA, USA, 2001. [Google Scholar]
Urbanek, J.; Fan, A.; Karamcheti, S.; Jain, S.; Humeau, S.; Dinan, E.; Rocktäschel, T.; Kiela, D.; Szlam, A.; Weston, J. Learning to Speak and Act in a Fantasy Text Adventure Game. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Hong Kong, China, 2019; pp. 673–683. [Google Scholar] [CrossRef]
Ashby, T.; Webb, B.K.; Knapp, G.; Searle, J.; Fulda, N. Personalized quest and dialogue generation in role-playing games: A knowledge graph-and language model-based approach. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, Hamburg, Germany, 23–28 April 2023; pp. 1–20. [Google Scholar] [CrossRef]
Gottschalk, S.; Demidova, E. EventKG–the hub of event knowledge on the web–and biographical timeline generation. Semant. Web 2019, 10, 1039–1070. [Google Scholar] [CrossRef]
Van der Lee, C.; Gatt, A.; Van Miltenburg, E.; Krahmer, E. Human evaluation of automatically generated text: Current trends and best practice guidelines. Comput. Speech Lang. 2021, 67, 101151. [Google Scholar] [CrossRef]
Fu, J.; Ng, S.K.; Jiang, Z.; Liu, P. Gptscore: Evaluate as you desire. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers); Association for Computational Linguistics: Mexico City, Mexico, 2024; pp. 6556–6576. [Google Scholar] [CrossRef]
Chiang, C.H.; Lee, H.Y. Can Large Language Models Be an Alternative to Human Evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 15607–15631. [Google Scholar] [CrossRef]
Chhun, C.; Suchanek, F.M.; Clavel, C. Do language models enjoy their own stories? prompting large language models for automatic story evaluation. Trans. Assoc. Comput. Linguist. 2024, 12, 1122–1142. [Google Scholar] [CrossRef]
Bradley, H.; Dai, A.; Teufel, H.B.; Zhang, J.; Oostermeijer, K.; Bellagente, M.; Clune, J.; Stanley, K.; Schott, G.; Lehman, J. Quality-Diversity through AI Feedback. In Proceedings of the The Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
Sai, A.B.; Mohankumar, A.K.; Khapra, M.M. A survey of evaluation metrics used for nlg systems. ACM Comput. Surv. (CSUR) 2022, 55, 1–39. [Google Scholar] [CrossRef]
Gehrmann, S.; Clark, E.; Sellam, T. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. J. Artif. Intell. Res. 2023, 77, 103–166. [Google Scholar] [CrossRef]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Yuan, W.; Neubig, G.; Liu, P. Bartscore: Evaluating generated text as text generation. Adv. Neural Inf. Process. Syst. 2021, 34, 27263–27277. [Google Scholar]
Guan, J.; Huang, M. UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP); Webber, B., Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Vienna, Austria, 2020; pp. 9157–9166. [Google Scholar] [CrossRef]
Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 2511–2522. [Google Scholar] [CrossRef]
Deriu, J.; Rodrigo, A.; Otegi, A.; Echegoyen, G.; Rosset, S.; Agirre, E.; Cieliebak, M. Survey on evaluation methods for dialogue systems. Artif. Intell. Rev. 2021, 54, 755–810. [Google Scholar] [CrossRef]
Doran, J.; Parberry, I. A prototype quest generator based on a structural analysis of quests from four MMORPGs. In Proceedings of the 2nd International Workshop on Procedural Content Generation in Games; Association for Computing Machinery: New York, NY, USA, 2011; pp. 1–8. [Google Scholar] [CrossRef]
Nussbaum, Z.; Morris, J.X.; Duderstadt, B.; Mulyar, A. Nomic embed: Training a reproducible long context text embedder. arXiv 2024, arXiv:2402.01613. [Google Scholar] [CrossRef]
Wang, B.; Wang, A.; Chen, F.; Wang, Y.; Kuo, C.C.J. Evaluating word embedding models: Methods and experimental results. APSIPA Trans. Signal Inf. Process. 2019, 8, e19. [Google Scholar] [CrossRef]
Levy, O.; Goldberg, Y.; Dagan, I. Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 2015, 3, 211–225. [Google Scholar] [CrossRef]
Schnabel, T.; Labutov, I.; Mimno, D.; Joachims, T. Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 298–307. [Google Scholar] [CrossRef]
Faruqui, M.; Tsvetkov, Y.; Rastogi, P.; Dyer, C. Problems With Evaluation of Word Embeddings Using Word Similarity Tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, Berlin, Germany, 12 August 2016; pp. 30–35. [Google Scholar] [CrossRef]
Icard, B.; Zve, E.; Sainero, L.; Breton, A.; Ganascia, J.G. Embedding Style Beyond Topics: Analyzing Dispersion Effects Across Different Language Models. In Proceedings of the 31st International Conference on Computational Linguistics (COLING); Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2025. [Google Scholar]
Likert, R. A technique for the measurement of attitudes. Arch. Psychol. 1932, 22, 140. [Google Scholar]
Google DeepMind. Gemini 2.5 Flash. 2025. Available online: https://deepmind.google/models/gemini/flash/ (accessed on 15 February 2026).
OpenAI. GPT-4 Research. 2023. Available online: https://openai.com/pl-PL/index/gpt-4-research/ (accessed on 15 February 2026).

Figure 1. Average arithmetic scores for all scenarios in baseline pipeline.

Figure 2. Average arithmetic scores for all scenarios in ablation.

Table 1. NPC motivations as defined by Doran and Parberry [72] along with operationalized descriptions.

Motivation	Description	Operationalized Description
Knowledge	Information known to a character	Wants to gather information
Comfort	Physical comfort	Wants to have physical comfort
Reputation	How others perceive a character	Wants to be perceived well by others
Serenity	Peace of mind	Wants a peace of mind
Protection	Security against threats	Wants security against threats
Conquest	Desire to prevail over enemies	Wants to prevail over enemies
Wealth	Economic power	Wants economic power
Ability	Character skills	Wants to use their skills
Equipment	Usable assets	Wants physical assets

Table 2. Actions used in strategy definitions along with associated textual template and effect polarity. Note that the list of actions is derived from Doran and Parberry’s [72] definition of strategies and not the atomic action list.

Action	Template	Effect
capture	Capture {character}	Negative
damage	Damage {object}	Negative
defend	Defend {character}	Positive
escort	Escort {character} to {location}	Positive
exchange	Exchange {object} with {character}	Neutral
experiment	Experiment with {object}	Neutral
get	Obtain {object}	Neutral
give	Give {object} to {character}	Positive
goto	Go to {location}	Neutral
kill	Kill {character}	Negative
listen	Listen to {character}	Neutral
repair	Repair {object}	Positive
report	Report to {character}	Positive
spy	Spy on {character}	Negative
steal	Steal {object} from {character}	Negative
take	Take {object} from {character}	Neutral
use	Use {object}	Neutral

Table 3. Overview of prompt variants and compendium characteristics.

Prompt Variants
Variant		Narrative Points		Description
$p_{s}$		4		Short
$p_{m}$		8		Medium
$p_{l}$		12		Long
Compendium Variants
Compendium	Characters	Events	Locations	Objects	Relations
$c_{s}$	5	1	3	1	10
$c_{m}$	6	4	6	4	20
$c_{l}$	18	9	12	11	50

Table 4. Evaluation categories. The symbol ✓ indicates the presence of a given category in the respective publication or that it was added by the authors.

Category	[60]	[58]	Author	Description
Relevance	✓	✓		Does the scenario address the given prompt and appropriately incorporate the provided context?
Coherence	✓	✓		Are the events and character actions logical, causally consistent, and internally coherent?
Complexity	✓			Does the scenario feature a complex narrative structure, including multi-layered plots and interdependent subplots?
Informativeness		✓		Does the scenario provide sufficiently detailed and comprehensive information about the narrative and game world?
Interactivity			✓	Does the scenario offer players meaningful choices and opportunities for interaction that influence the course of events?
Structure			✓	Does the scenario follow an organizational structure appropriate for RPG scenarios, including clearly defined sections (e.g., locations, characters, items, and plot points)?

Table 5. Sizes and release dates of the selected models.

Language Models
Model	Release Date	Size
Mistral: 7B	27 September 2023	4.1 GB
LLaMA3.1: 8B	23 July 2024	4.9 GB
StableBeluga: 7B	21 July 2023	4.1 GB
Gemma3: 4B	12 March 2025	3.3 GB

Table 6. Average scores across the generator models by the verifier models and human judges.

Generator	Verifier
Generator	Gemma	LLaMA	Mistral	Stable Beluga	Human
Relevance
Gemma	4.575	$4.592$	$4.755$	3.070	$3.227$
LLaMA	$4.794$	$4.508$	$4.769$	$3.207$	2.500
Mistral	$4.935$	$4.473$	$4.789$	$3.197$	$3.364$
Stable Beluga	$4.779$	4.019	4.450	$3.169$	$2.682$
Coherence
Gemma	3.213	$4.475$	$4.520$	$3.115$	$3.364$
LLaMA	$3.603$	$4.394$	$4.470$	$3.156$	2.818
Mistral	$3.870$	$4.250$	$4.536$	3.046	$3.136$
Stable Beluga	$3.551$	3.917	4.317	$3.049$	$3.364$
Complexity
Gemma	2.179	$3.392$	$3.774$	$3.109$	$3.091$
LLaMA	$2.215$	$3.377$	$3.725$	$3.190$	2.500
Mistral	$2.435$	$3.284$	$3.807$	$3.059$	2.500
Stable Beluga	$2.274$	3.163	3.724	3.050	$2.682$
Informativeness
Gemma	3.445	$4.509$	$4.332$	$3.138$	$3.045$
LLaMA	$3.697$	$4.502$	$4.335$	$3.192$	2.182
Mistral	$3.976$	$4.499$	$4.455$	3.039	$3.045$
Stable Beluga	$3.654$	4.258	4.257	$3.063$	$2.500$
Interactivity
Gemma	2.782	$3.804$	$4.170$	$3.262$	$2.727$
LLaMA	$3.106$	$4.056$	$4.154$	$3.243$	$2.318$
Mistral	$3.137$	$3.480$	$4.079$	3.075	$2.545$
Stable Beluga	$2.881$	3.312	3.873	$3.079$	2.273
Structure
Gemma	3.871	$4.467$	$4.627$	3.054	$3.000$
LLaMA	$4.092$	$4.270$	$4.565$	$3.136$	$2.773$
Mistral	$4.299$	$3.967$	$4.584$	$3.156$	$3.182$
Stable Beluga	$4.092$	3.310	4.197	$3.130$	2.409

Bold values indicate the highest scores in each evaluation category. Italic values indicate the lowest scores in each evaluation category.

Table 7. Results of Levene’s test and ANOVA. p-values

< 0.005

are shown in bold.

Table 7. Results of Levene’s test and ANOVA. p-values

< 0.005

are shown in bold.

Factor	Source	Relevance	Coherence	Complex.	Inform.	Inter.	Structure
	l	$0.0 \times 10^{- 05}$	$0.0 \times 10^{- 05}$	$4.3 \times 10^{- 05}$	$1.9 \times 10^{- 03}$	$0.0 \times 10^{- 05}$	$4.3 \times 10^{- 01}$
Levene’s	p	$4.3 \times 10^{- 01}$	$0.0 \times 10^{- 05}$	$3.7 \times 10^{- 02}$	$0.1 \times 10^{- 05}$	$0.0 \times 10^{- 05}$	$3.3 \times 10^{- 01}$
p	c	$0.86761$	$0.76119$	$0.27223$	$0.57087$	$0.91147$	$0.60279$
	l	$1.3 \times 10^{- 18}$	$1.0 \times 10^{- 16}$	$1.7 \times 10^{- 02}$	$3.1 \times 10^{- 13}$	$2.6 \times 10^{- 43}$	$1.6 \times 10^{- 54}$
ANOVA	p	$9.7 \times 10^{- 09}$	$7.8 \times 10^{- 45}$	$7.3 \times 10^{- 01}$	$3.8 \times 10^{- 20}$	$1.9 \times 10^{- 25}$	$2.1 \times 10^{- 27}$
p	c	$0.49358$	$0.50784$	$0.12325$	$0.70186$	$0.59909$	$0.81185$
	l	$0.01539$	$0.01344$	$0.00184$	$0.01086$	$0.03498$	$0.04375$
ANOVA	p	$0.00655$	$0.03560$	$0.00011$	$0.01587$	$0.01965$	$0.02101$
$η_{p}^{2}$	c	$0.00025$	$0.00023$	$0.00076$	$0.00012$	$0.00018$	$0.00007$
	l	$1.00000$	$1.00000$	$0.77062$	$1.00000$	$1.00000$	$1.00000$
ANOVA	p	$0.99992$	$1.00000$	$0.09941$	$1.00000$	$1.00000$	$1.00000$
$π$	c	$0.17013$	$0.16487$	$0.43239$	$0.10713$	$0.13487$	$0.08271$

Table 8. Tukey’s test results. p-values

< 0.005

are shown in bold. Abbreviations: Gem: Gemma; Lla: LLaMA; Mis: Mistral; Bel: Stable Beluga. Symbols:

Δ

: mean difference; p: p-value; d: Cohen’s d.

Table 8. Tukey’s test results. p-values

< 0.005

are shown in bold. Abbreviations: Gem: Gemma; Lla: LLaMA; Mis: Mistral; Bel: Stable Beluga. Symbols:

Δ

: mean difference; p: p-value; d: Cohen’s d.

Criterion	Value	Gem:Lla	Gem:Mis	Gem:Bel	Lla:Mis	Lla:Bel	Mis:Bel
Relevance	$Δ$	$0.054$	$0.084$	$- 0.163$	$0.031$	$- 0.216$	$- 0.247$
	p	$0.255$	$0.018$	$0.000$	$0.708$	$0.000$	$0.000$
	d	$- 0.073$	$- 0.113$	$0.208$	$- 0.041$	$0.276$	$0.313$
Coherence	$Δ$	$0.047$	$0.075$	$- 0.138$	$0.028$	$- 0.184$	$- 0.212$
	p	$0.303$	$0.026$	$0.000$	$0.722$	$0.000$	$0.000$
	d	$- 0.064$	$- 0.104$	$0.193$	$- 0.040$	$0.266$	$0.316$
Complex.	$Δ$	$0.001$	$0.023$	$- 0.060$	$0.022$	$- 0.061$	$- 0.083$
	p	$1.000$	$0.830$	$0.131$	$0.848$	$0.118$	$0.012$
	d	$- 0.001$	$- 0.033$	$0.082$	$- 0.031$	$0.083$	$0.119$
Inform.	$Δ$	$0.049$	$0.118$	$- 0.064$	$0.069$	$- 0.113$	$- 0.182$
	p	$0.196$	$0.000$	$0.047$	$0.023$	$0.000$	$0.000$
	d	$- 0.074$	$- 0.181$	$0.100$	$- 0.107$	$0.177$	$0.287$
Inter.	$Δ$	$0.119$	$- 0.073$	$- 0.222$	$- 0.192$	$- 0.341$	$- 0.149$
	p	$0.000$	$0.015$	$0.000$	$0.000$	$0.000$	$0.000$
	d	$- 0.172$	$0.113$	$0.340$	$0.299$	$0.525$	$0.248$
Structure	$Δ$	$- 0.012$	$- 0.022$	$- 0.339$	$- 0.010$	$- 0.328$	$- 0.318$
	p	$0.967$	$0.824$	$0.000$	$0.979$	$0.000$	$0.000$
	d	$0.018$	$0.032$	$0.517$	$0.015$	$0.496$	$0.477$

Table 9. Tukey’s test results. p-values

< 0.005

are shown in bold.

Table 9. Tukey’s test results. p-values

< 0.005

are shown in bold.

Criterion	Value	$p_{s} : p_{m}$	$p_{s} : p_{l}$	$p_{m} : p_{l}$
Relevance	$Δ$	$- 0.126$	$- 0.135$	$- 0.008$
	p	$0.000$	$0.000$	$0.942$
	d	$0.165$	$0.176$	$0.011$
Coherence	$Δ$	$- 0.262$	$- 0.298$	$- 0.036$
	p	$0.000$	$0.000$	$0.266$
	d	$0.381$	$0.437$	$0.050$
Complex.	$Δ$	$- 0.010$	$- 0.018$	$- 0.008$
	p	$0.899$	$0.728$	$0.945$
	d	$0.015$	$0.025$	$0.011$
Inform.	$Δ$	$- 0.144$	$- 0.190$	$- 0.046$
	p	$0.000$	$0.000$	$0.078$
	d	$0.226$	$0.298$	$0.070$
Inter.	$Δ$	$- 0.162$	$- 0.217$	$- 0.055$
	p	$0.000$	$0.000$	$0.030$
	d	$0.253$	$0.339$	$0.081$
Structure	$Δ$	$- 0.180$	$- 0.226$	$- 0.046$
	p	$0.000$	$0.000$	$0.094$
	d	$0.270$	$0.335$	$0.070$

Table 10. Average scores in popular metrics.

Parameter	BLEU	ROUGE			BERTScore
Parameter	BLEU	1_F	2_F	L_F	_P	_R	_F1
$c_{l}$	3.067	0.229	$0.086$	$0.137$	$0.807$	$0.811$	$0.809$
$c_{m}$	$2.791$	$0.220$	$0.082$	$0.133$	$0.807$	$0.807$	$0.807$
$c_{s}$	$2.741$	$0.217$	$0.083$	$0.135$	$0.807$	$0.807$	$0.807$
Stable Beluga	$3.201$	$0.206$	$0.088$	$0.143$	$0.803$	$0.794$	$0.798$
Gemma	$1.189$	$0.159$	$0.061$	$0.111$	$0.793$	$0.814$	$0.803$
LLaMA	$1.706$	$0.180$	$0.062$	$0.122$	$0.798$	$0.782$	$0.790$
Mistral	$2.998$	$0.223$	$0.090$	$0.156$	$0.810$	$0.832$	$0.820$
$p_{l}$	$3.368$	$0.252$	$0.098$	$0.155$	$0.810$	$0.798$	$0.804$
$p_{m}$	$2.972$	$0.232$	$0.089$	$0.141$	$0.808$	$0.803$	$0.805$
$p_{s}$	$1.964$	$0.181$	$0.063$	$0.109$	$0.802$	$0.824$	$0.812$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Owczarek, W.; Wróbel, J.; Pęszor, D. Knowledge-Driven Generative Design of Role-Playing Game Scenarios. Appl. Sci. 2026, 16, 2966. https://doi.org/10.3390/app16062966

AMA Style

Owczarek W, Wróbel J, Pęszor D. Knowledge-Driven Generative Design of Role-Playing Game Scenarios. Applied Sciences. 2026; 16(6):2966. https://doi.org/10.3390/app16062966

Chicago/Turabian Style

Owczarek, Wojciech, Julia Wróbel, and Damian Pęszor. 2026. "Knowledge-Driven Generative Design of Role-Playing Game Scenarios" Applied Sciences 16, no. 6: 2966. https://doi.org/10.3390/app16062966

APA Style

Owczarek, W., Wróbel, J., & Pęszor, D. (2026). Knowledge-Driven Generative Design of Role-Playing Game Scenarios. Applied Sciences, 16(6), 2966. https://doi.org/10.3390/app16062966

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Knowledge-Driven Generative Design of Role-Playing Game Scenarios

Abstract

1. Introduction

2. Related Work

2.1. From Procedural Content Generation to Generative AI in Games: Evolution of Narrative Systems

2.2. LLM-Based Narrative Systems: Capabilities and Deployment Barriers

2.3. Quest and Scenario Generation: cRPG Methods and Transfer Limits to RPG Preparation

2.4. Knowledge Grounding for Consistent Scenario Generation

2.5. Evaluation of Generated Narrative Artifacts

3. Materials and Methods

3.1. Game World Compendium Structure

3.2. Generation Pipeline Design

3.2.1. Skeletal Scenario Generation

3.2.2. Skeletal Scenario Completion

3.2.3. Scenario Text Generation

3.3. Experimental Setup

3.3.1. Scenario Generation

3.3.2. Scenario Evaluation

4. Results

4.1. Ablation

4.2. Self-Bias and Human Evaluation

4.3. Statistical Significance

4.4. Popular Metrics Analysis

5. Discussion

5.1. Limitations of LLM-Based Evaluation

5.2. Design Guidelines and Future Directions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI