A Multi-Modal Story Generation Framework with AI-Driven Storyline Guidance

Kim, Juntae; Heo, Yoonseok; Yu, Hogeon; Nang, Jongho

doi:10.3390/electronics12061289

Open AccessArticle

A Multi-Modal Story Generation Framework with AI-Driven Storyline Guidance

¹

Department of Computer Science and Engineering, Sogang University, Seoul 04107, Republic of Korea

²

Department of Electronic Engineering, Sogang University, Seoul 04107, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Electronics 2023, 12(6), 1289; https://doi.org/10.3390/electronics12061289

Submission received: 1 February 2023 / Revised: 1 March 2023 / Accepted: 6 March 2023 / Published: 8 March 2023

(This article belongs to the Special Issue Real-Time Control of Embedded Systems)

Download

Browse Figures

Versions Notes

Abstract

:

An automatic story generation system continuously generates stories with a natural plot. The major challenge of automatic story generation is to maintain coherence between consecutive generated stories without the need for human intervention. To address this, we propose a novel multi-modal story generation framework that includes automated storyline decision-making capabilities. Our framework consists of three independent models: a transformer encoder-based storyline guidance model, which predicts a storyline using a multiple-choice question-answering problem; a transformer decoder-based story generation model that creates a story that describes the storyline determined by the guidance model; and a diffusion-based story visualization model that generates a representative image visually describing a scene to help readers better understand the story flow. Our proposed framework was extensively evaluated through both automatic and human evaluations, which demonstrate that our model outperforms the previous approach, suggesting the effectiveness of our storyline guidance model in making proper plans.

Keywords:

multi-modal story generation; AI-driven storyline guidance; multiple-choice question answering; automatic story generation; story visualization; diffusion

1. Introduction

Story generation is one of the most creative tasks which guide people from being readers to becoming writers. With the emergence of models such as GPT-2 [1], BART [2], and more advanced natural language generation models [3,4], techniques for completing short stories have made enormous progress. However, generating stories is still challenging to maintain coherence, which is a key challenge of the automatic story generation. The best way to maintain coherence is to conduct the planning before writing each paragraph, just as a human would do when writing a novel. In reality, storytellers determine the elements that make up the story or plan the plot with characters, motifs, and background, to maintain the coherence of the story. Since a story involves several scenes, which are composed of a number of paragraphs, paragraph-level coherence is the most important aspect of the automatic story generation. Just as humans plan a storyline before writing a text, the system needs to accomplish two goals: (1) planning a storyline and (2) generating a paragraph conditioned on the storyline.

When the system tries to generate the next paragraph directly from the current paragraph without any planning, it may struggle to produce a story with a natural flow. Several studies introduce planning approaches to maintain the coherence of a story. They utilize diverse approaches, such as using character personalities [5,6] and matching scene-level contexts using event [7,8,9,10]. Other attempts leverage commonsense knowledge [11,12], and extract keywords by global planning [13,14]. However, these methods do not provide a continuous supervision of the system. Instead, they focus on planning all paragraphs at the same time non-sequentially. In order for the system to generate a story that accurately reflects flows well, the system requires a controller that can provide appropriate directions whenever the system generates a paragraph. This task has traditionally been accomplished by human experts. Previous research [15] proposed a machine-in-the-loop system that allows users to generate paragraphs by directly feeding a detailed storyline made up of various entity combinations into the model. Although this approach achieves a considerable improvement of performance on the automatic story generation, it involves human intervention, which hinders the automatic story generation itself. We find motivation from the process of people writing novels, and we introduce a method in which the system plans and generates on its own. As shown in Figure 1, the system automatically and continuously predicts the storyline before generating stories. Without any human intervention, coherence can be achieved by continuously guiding the generation process based on the predicted storyline in real-time. By doing so, the system can generate full stories from an initial source with both the coherence and engagement.

We propose an integrated framework that enables planning and generation. First, we propose a model that can predict the storyline (multiple entities) while maintaining a paragraph-level coherence. We introduce a storyline guidance model that predicts three types of entities (characters, events, and places) by leveraging the multiple-choice question answering (MCQA) approach. Our storyline guidance model predicts the source to compose the next paragraph. As shown in Figure 2, the predicted entities can be newly ordered while maintaining coherence, even if a story is based on the same character, event, and place. Second, we propose the GPT-2-based story generation model which generates a paragraph based on the predicted storyline. Both two models automatically predict and generate iteratively what the other needs.

Additionally, to further arouse the readers’ interest, our system includes images representing each paragraph. We recognize that a story often revolves around a single visual concept, which is why we have introduced the concept of the story visualization [16,17,18,19,20] with a multi-modal setting [21]. The story visualization aims to generate visual representations that correspond to the themes depicted in the story. Our approach involves generating a paragraph-representative image that captures the background information of the story produced by our proposed framework. Previous story visualization models mainly relied on GAN-based models. Moreover, they were trained on limited datasets [18] containing simple captions and corresponding images. While they achieve success in generating images from captions, our goal is to tackle the greater complexity of full-length stories in terms of both their length and content. To address this problem, we leverage a diffusion-based text-to-image generation model that is more advanced compared to GAN-based models for our story visualization approach. As shown in Figure 1, our proposed story visualization model explicitly extracts background information from the generated story and provides a visually descriptive image. The resulting image enhances readers’ imagination and improves their engagement with the story.

In summary, we present a novel multi-modal story generation framework that incorporates all these models. The framework generates a sequence of stories paragraph by paragraph, where the storyline guidance and story generation models are sequentially employed for each paragraph. Once all paragraphs constituting one scene are generated, the story visualization model creates an image representing the scene. To predict the entities of a storyline in the next paragraph, the BERT-based storyline guidance model predicts the storyline: character, event, and place. For story generation, we introduce a GPT-2-based story generation model. Finally, the story visualization model is a diffusion-based text-to-image generation model that creates an image to aid readers in understanding the story. We evaluate each model independently using appropriate evaluation metrics and report on the results. Our experiments demonstrate that the generated stories are coherent both logically and visually.

The remainder of this paper is as follows. Section 2 provides a brief review of the relevant literature. Section 3 defines the task and presents our framework, as well as introduces our models separately. Notably, Section 3.2 provides an in-depth discussion of our storyline guidance model. Section 4 outlines the details of our experiments and provides a discussion of our findings. Lastly, Section 5 presents our conclusions and future work.

2. Related Work

This section is divided into three parts. First, we provide a brief introduction to neural story generation using language generation models. Second, we describe our approach for the controllable story generation. Finally, we introduce story visualization models, which generate images to accompany the generated stories.

2.1. Neural Story Generation

Prior to the development of deep learning, previous language generation models performed sentence-level generation rather than paragraph-level generation. As the Attention method has been widely used since the emergence of Seq2Seq [22], pre-trained end-to-end neural models such as GPT-2 [1] and BART [2] have become established as the main models in language generation [23,24,25,26]. In story generation, several studies of the dependency between the current sentence and the generated next sentence have been conducted by modeling entities [27,28], personas [6] or events [10]. However, as sentences become paragraphs, the models have difficulty maintaining a paragraph-level coherence. To solve this problem, several attempts have been made to decompose story generation into a multi-stage framework [7,29,30,31,32], as real-world settings must be taken into account [33]. These models use a hierarchical strategy that creates a coherent story based on planning before creating the entire story. However, automated storyline planning is not as perfect as human experts, since all of the planning is limited to one entity.

2.2. Controllable Story Generation

The controllable text generation means controlling the Language Generation Model to decode a sentence with a specified semantic meaning. The author of [34] explained the force by introducing two types of control: soft control and hard control. Soft control aims to generate text of a general topic, whereas hard control aims to generate a specific word directly in the decoding stage (such as beam search). For story generation, a story is created in a constrained way using plot-based or planning-based control [23,29,35] or persona-based control [6,36], and these methods are relevant to soft control. Unlike the PersonaChat [36], which is used for dialogue models, the story generation model needs to generate long sentences. The story dataset does not explicitly indicate their storyline, so it is extremely challenging to control the story generation process through the long sentences. To solve this problem, Ref. [37] attempts to control the emotional trajectory of the generation process by applying reinforcement learning. The outline-conditioned generation compensates for this problem in that it provides more flexibility than plan-based, persona-based, or event-based methods. The previous study [15] solves this problem by clarifying each entity with fine-grained annotations in their dataset. Our purpose of controlling the story is that the model can provide storyline guidance to the story generation model. We predict that multiple entities will control our system properly throughout the long sentences.

2.3. Story Visualization

A story visualization task with an image generation model was proposed by [18]. While image generation models based on GANs have been actively studied in the computer vision field [38], story visualization models also rely heavily on GANs. Several attempts have been made to generate high-quality images in the field of GAN-based text-to-image synthesis [16,39,40,41]. StoryGAN [18], which includes a text encoder, text and image discriminator, and image generator, was the first proposed model. DuCo-StoryGAN [17] proposed dual-learning and copy-transform to improve the semantic alignment of generated images, and it introduced a character-based evaluation metric. VLC-StoryGAN [42] focused on the structured input text and guided image generation using a parse tree structure, commonsense knowledge, dense captioning, and foreground-background information. To preserve the global consistency of characters and scenes, a character-preserving coherent story visualization method is introduced by [43]. More recently, StoryDALL-E [19] leveraged DALL-E [44], resulting in a greatly improved performance.

3. Methods

This section is divided into four parts. In the first part, we provide an overview of our proposed framework and task definition. The following three parts focus on the three models that make up the framework and describe each of their functions.

3.1. Task Definition and Our Framework Overview

The goal of our task is to minimize human intervention in generating neural language stories. To accomplish this, we propose a well-developed multi-modal story generation framework that is composed of three stages. Firstly, our storyline guidance model predicts the next storyline entities. Secondly, our story generation model generates a paragraph based on the entities predicted by the guidance model. Finally, a story visualization model generates representative images of the scenes to arouse the readers’ interests in the story. Previous studies have demonstrated that language generation models rely on storylines selected by human beings to generate stories. However, in our framework, the storyline guidance model can predict multiple entities automatically. These entities have semantic relationships with each other, which allows us to leverage them to generate a new story. Once the storyline is predicted, our story generation model generates a story paragraph using the predicted storyline. By using an image generation model, we ensure scene-level coherence to lend to the readers’ understanding. In our framework, all models progress sequentially, generating a new story with a new sequence. The framework is illustrated in Figure 3.

3.2. Storyline Guidance Model

The most crucial elements in crafting a story plot are the characters, events, and places that we anticipate. The protagonist is the central character in the paragraph, while the event refers to what the protagonist does. The place is the backdrop against which the protagonist operates. To predict these entities in our story, we employ BERT [45], a language representation model that excels at extracting deep, pre-trained, bidirectional representations. We fine-tune BERT to predict all three entities by training it using the MCQA problem, where we need to identify the context, a single specified answer, and candidates. We set the first and current paragraph of a story as the context, while the specified characters, events, and places that will feature in the next scene are set as the answer. During training, BERT takes the context, question, and answer separated by |SEP| tokens, which serve as separation tokens that connect different compositions. The model then calculates the categorical cross-entropy loss with the candidates and performs the backpropagation process to learn. Finally, the model generates the highest probability as the answer to the question “What is the entity in the next paragraph?”. Our model learns the context of the current story during the training stage and corresponds to the question above. Using our storyline guidance model, we select an appropriate answer from the five candidates we created. Figure 4 illustrates the details of our model. Our model can independently predict the next storyline entities based on the current paragraph, allowing the story generation process to continue seamlessly. With this model, users can control specific entity types and guide the automated story generation process in real-time according to their preferences.

3.3. Story Generation Model

The Language Generation Model takes sentences and generates the next tokens using an auto-regressive process. To obtain the best sequence for the subsequent tokens, the model predicts a probability distribution for the next tokens given the previous tokens. The probability of sequence y can be obtained using an iterative process, called the chain rule:

P_{θ} (y) = \prod_{t = 1}^{| y |} P_{θ} (y_{t} | y_{< t})

(1)

In a usage of the GPT2-medium [1], which is a general model consisting of multiple transformer blocks of multi-head self-attention modules, the objective is to minimize the following negative likelihood. Storium-GPT2 [15] can be defined as follows: given an input V = (

v_{1}

,

v_{2}

, …,

v_{M}

) with max length M, and the model generates a coherent story Y = {

y_{1}

,

y_{2}

, …,

y_{| Y |}

}.

\begin{matrix} L_{G P T} = - \sum_{t = 1}^{| Y |} log p (y_{t} | y_{< t}, V) \\ p (y_{t} | y_{y < t}, V) = s o f t m a x (H_{t} W + b) \\ H_{t} = D e c o d e r (y_{< t}, E_{t}) \\ E_{t} = p_{t} + v_{t} + \sum_{i = 1}^{n} s_{t}_{i} \end{matrix}

(2)

The final embedding

E_{t}

at position t is computed by summing the positional embedding

p_{t}

and the token embedding

v_{t}

with a set of n segment token embeddings {

s_{1}

,

s_{2}

, …,

s_{n}

}, which is the method proposed in [46]. The probability distribution of y is obtained by the gradient descent process with a loss function, where

H_{t}

is the decoder’s hidden state at the t-th position computed from the context (the story), and W and b are trainable parameters. During training, a summed embedding vector consists of information introduced in the Storium dataset [15].

The most critical challenge for generation models used for the story generation [1,2] is that previous stories are too long to be used recurrently as inputs to the model. Therefore, when an embedding is generated, it is necessary to limit the max sequence length of each field so that as much information as possible can be included as the input. Ref. [15] uses the Cassowary Solver [47] to solve this problem and ensures that all tokens of the input have at least a minimum length. We apply this method as it is, but we fine-tune the model to have greater consistency with respect to the last sentences of the previous scene by appropriately trimming unnecessary tokens. It is difficult to contain all contexts in the input of [15] within the max sequence length; thus, a model with a larger size must be used to adapt the existing method as it is. In order for the attention mechanism to work better, we delete the character description of the input and add the last two complete sentence units of the first entry (establishment) as the input, and then pad it to fit the max sequence length. The entire embedding is created by applying the method in [15] as it is, stacking it in 2 stacks, and adding each segment so that it does not exceed the max sequence length of the GPT2-medium model.

3.4. Story Visualization Model

Our story visualization model is based on the Latent Diffusion Model (LDM) [48]. Prior to illustrating LDM, we briefly explain one detail of the Diffusion Model (DM) [49]. The diffusion model is one of the image generation models, defined as a fixed Markov chain of the forward diffusion process to gradually add Gaussian noise and remove the noise using a reverse process. The forward process (also called diffusion process) is the q sample from the real data

x_{0} \sim q (x_{0})

in T steps. This process approximates the posterior

q (x_{1 : T} | x_{0})

at each time-step t, which is formulated as:

\begin{matrix} \begin{matrix} q (x_{1 : T | x_{0}}) : = \prod_{t = 1}^{T} q (x_{t} | x_{t - 1}) \\ q (x_{t} | x_{t - 1}) : = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I) \end{matrix} \end{matrix}

(3)

in which

β

denotes the variance schedule for each time-step until the last time-step T, and is gradually increased, as

(β_{t - 1} < β_{t})

.

The reverse process is represented as joint distribution

p_{θ} (x_{0 : T})

, and it is also defined as a Markov chain starting from

p (x_{T}) = N (0, 1)

. It can be calculated by:

\begin{matrix} \begin{matrix} p_{θ} (x_{0 : T}) : = p (x_{T}) \prod_{t = 1}^{T} p_{θ} (x_{t - 1} | x_{t}) \\ p_{θ} (x_{t - 1} | x_{t}) : = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t)) \end{matrix} \end{matrix}

(4)

The objective of the diffusion model is to approximate the mean

μ_{θ} (x_{t}, t)

of noise distribution in the reverse process, calculated following:

\begin{matrix} L_{D M} : = E_{x_{0}, t, ϵ} [| | ϵ - ϵ_{θ} (x_{t}, t) {| |}_{2}^{2}] \\ where ϵ \sim N (0, 1) \end{matrix}

(5)

The LDM [48] is introduced to use low computational resources but provides equal or better performance. It contains a pre-trained autoencoder that produces a latent vector

z = ε (x)

from the input space

x \in D_{x}

. The latent vector z is a new input to the Diffusion Model. LDM creates an embedding for a caption for a given image using a text encoder and augments the DM’s UNet backbone. The optimization is defined as:

\begin{matrix} υ_{*} : = \underset{υ}{a r g m i n} E_{x_{0}, t, ϵ} [| | ϵ - ϵ_{θ} (x_{t}, t, c_{θ} (y) {| |}_{2}^{2}] \end{matrix}

(6)

where

v_{*}

is a specialized token that the user inputs into the model, and

c_{θ} (y)

is a conditioning vector mapped from a conditioning input y.

LDM outperforms other models in the field of text-to-image synthesis. However, different decoding is performed depending on the context (e.g., the background source), which presents a problem in that it is difficult to maintain the concept the user wants. To solve this problem, fine-tuning a specific concept using the few-shot images has been proposed [50,51]. We utilize Textual Inversion [50] for story visualization, and through this, we create an image that maintains one concept for the scene. The textual Inversion leverages CLIP ImageNet templates [52] to go through a diffusion process with random sampling from several template-driven texts to fine-tune the model for a one-shot image given as input by the user. Our reference dataset [15] does not specify images for each character, so instead of creating a picture depicting a character, we focus on a background that can stimulate the writer’s imagination. Through this generated image, the coherence of the entire story is enhanced for newly appearing background information, and the story maintains one concept for successive scene entries. By using Textual Inversion, a guidance scale is determined by the complexity and level of detail in the textual descriptions being processed. If the textual descriptions contain a high level of detail and complexity, the guidance scale may need to be relatively small in order to capture all of the relevant information and generate high-quality images. On the other hand, if the textual descriptions are relatively simple or abstract, the guidance scale may be larger. Therefore, we test several guidance scales from 1 to 10 to set the proper guidance scale.

4. Experiments

This section is divided into five parts. First, we introduce the dataset used in this work. Second, we describe the experimental settings. In the following three parts, we present the experimental results for each model in our framework, along with the relevant metrics.

4.1. Dataset

In this study, we conduct an experiment using the Storium dataset [15]. The dataset consists of 5743 stories, including a large corpus of 25,092 scenes and 448,264 scene entries, which we used to train both our story generation model and storyline guidance model. Compared to the conventional benchmarks such as ROCStories [53], which is composed of short sentences, our dataset contains various information about each story. The original Storium dataset includes information on event entities, character personalities, and places. To facilitate the use of the dataset in a multiple choice question-answering (MCQA) context, we reorganized it. Specifically, we concatenated the initial paragraph and the current paragraph for each event to form a context. We provide an event with a corresponding description as an answer. To create a multiple-choice setting, we randomly construct four entities and form candidates with the correct answers. In addition, the answers and candidates for characters and places are constructed in the same way. We use the Storium dataset to incorporate more real stories as it contains numerous entities and large paragraphs. We also exclude stories with fewer than five characters or events to ensure that the candidates were sufficient for a meaningful answer. The example of our modified dataset is illustrated in Figure 5.

4.2. Experimental Settings

All experiments are conducted on a 2 NVIDIA RTX3090 card with Intel i7-6700 CPU (3.40 GHz) and 32 GB RAM.Our story generation model is initialized by the weight of GPT-2 medium-sized (355 M parameters). We fine-tuned GPT-2 for 45,000 training steps with a batch size of 4 for 10 h with all the same configurations of Storium paper except the input texts and the output size. We use a temperature of 0.9, a repetition penalty of 1.2, and apply a learning rate of 1 × 10

^{- 5}

and a warm-up step of 5000 to conduct the experiment. Our storyline guidance model is initialized by the weight of the BERT-base (110 M parameters). It takes approximately 8 h for training. We exploit the pre-trained Latent Diffusion Model (1.4B parameters) trained on the LAION-400M dataset [54] and follow the same procedure as the Textual Inversion. We set the model’s hyperparameters to an image resolution of (512,512), a batch size of 4, gradient accumulation steps of 4, 2000 training steps, and a learning rate of 1 × 10

^{- 4}

. During inference, we set our guidance scale to 5, which we determined empirically. It takes less than 10 min for fine-tuning with the original settings.

4.3. Metrics

4.3.1. Automatic Evaluation

We adopt the following automatic metrics. Recall@k, originally used in the retrieval task, measures the proportion of relevant items that are retrieved among the top-k items.This approach measures the outputs of our storyline guidance model to ensure that it predicts appropriate entities compared to the counterparts of the original dataset. We set k = 1, 2. Perplexity(PPL) measures the uncertainty of generated tokens predicted by the natural language generation model. BLEU-n(B-n) [55], also called the Bilingual Evaluation Understudy, is a commonly used metric to evaluate the quality of the generated text. This score measures the similarity between the generated text and the reference text by comparing the n-grams (contiguous sequences of n words) in the generated text to the n-grams in the reference text. We set n = 2, 3, 4. Lexical Repetition(LR-n) [56] compute the percentage of generated stories which repeat a 4-gram at least n times. We set n = 5 to evaluate our model. These measurements cannot measure the semantic similarity between the generated text and the reference text, since they are evaluated based on the words or tokens that appear directly. Therefore, we introduced the BERTScore-reference(BS-r), which is a pre-trained model-based measurement to evaluate the semantic similarity between the generated text and the reference text. To leverage this, we divide sentences into tokens using the NLTK module and utilized the BERTScore-recall [57] with RoBERTa-large as the backbone model. Moreover, this process does not require fine tuning. We introduce a metric BERTScore-coherence(BS-c) to evaluate the coherence between two consecutive paragraphs. This metric computes the BERTScore-recall between the n-th generated paragraph and the (n+1)-th generated paragraph and is particularly suited to evaluating the effectiveness of our storyline guidance strategy. Our approach emphasizes the importance of maintaining consistency between successive paragraphs to create a cohesive and engaging narrative. BS-c is designed to assess the degree of coherence between two paragraphs and provides a more appropriate metric for evaluating the effectiveness of our approach.

4.3.2. Human Evaluation

We conducted a human evaluation to compare our framework with a baseline using four different metrics. The fluency evaluates the quality of individual sentences from a linguistic perspective, taking into account factors such as grammatical correctness and the accuracy of semantic meaning representation. This evaluation was performed on a sentence-by-sentence basis, with each sentence considered in isolation. Coherence measures the logical relatedness between two consecutive paragraphs. This metric aims to assess how well the generated text flows and whether it makes sense as a cohesive whole. Relevance measures the contextual relevance between stories and the storyline. This metric aims to assess how well the generated text aligns with the given topic or subject matter and whether it is relevant to the overall storyline. Likability measures the degree of positive sentiment or engagement that the generated text elicits from human annotators. This metric aims to assess how appealing and engaging the generated text is to the target audience. To evaluate the annotators’ agreement on these properties, we used the Fleiss’

κ

coefficient, which measures the inter-rater reliability of the annotators’ judgments.

4.4. Storyline Guidance Model

To verify the effectiveness of our approach, we applied our modified dataset, which includes rich story entities and long paragraphs, to evaluate the performance of our model in terms of Recall@1 and Recall@2. Specifically, we focused on three storyline entities: character, event, and place, that can lead to the next paragraph in the narrative. We recognize that achieving semantic coherence between successive paragraphs in the Storium dataset can be challenging, as character descriptions often involve personal information such as occupation and age, rather than a well-defined persona. Nonetheless, our evaluation results demonstrate that our model is able to generate a coherent sequence of the entire story. We observed that the Recall for events is higher than for characters, indicating that the coherence between the current and next paragraphs has a semantic dependency on the event entity. Overall, our results, which are summarized in Table 1, demonstrate the effectiveness of our approach in generating coherent and engaging narratives, even in the presence of challenging datasets. As each paragraph of our dataset consists of a large amount of sentences, we show an example based on event descriptions in the predicted order, and we compare them with the existing data.

4.5. Story Generation Model

We have analyzed the performance in both quantitative and qualitative ways. Table 2 presents the automatic evaluation results of our story generation model compared to the Storium-GPT2 baseline. The results demonstrate that our model outperforms the baseline in perplexity, indicating its superior ability to model the text in the test set. In addition, our model generates more word overlaps with the reference texts, as evidenced by higher BLEU-3 and BLEU-4 scores. Although our model’s BLEU-2 score does not surpass that of the baseline, we have diagnosed this issue to stem from the dataset, which contains numerous paragraphs with large sentences. Nonetheless, our model’s results on BLEU-3 and BLEU-4 are much more appropriate to our dataset, and they demonstrate our model’s ability to generate high-quality stories. Moreover, our model have the ability to reduce the lexical repetition.

As shown in Table 3, our evaluation results show that the BERTScore-reference exhibits lower performance than the baseline, likely due to the fact that we reduced the input context to enhance coherence in the generated stories. In contrast, our metric, BERTScore-coherence, outperforms the baseline by a significant margin. This result suggests that our storyline guidance model effectively plans and guides the generation of coherent and engaging narratives. We analyze that this improvement is due to our approach’s ability to maintain consistency between successive paragraphs.

To further investigate the performance of our framework, we conducted a human evaluation process and compared it with a baseline model. Since many studies do not rely solely on automatic evaluation metrics due to limitations in evaluating the story writing performance, we included human evaluation to ensure the quality of our system’s performance. Specifically, we conducted human evaluation on the decoded stories, which are the final output texts generated by our story generation model. We randomly sample 100 stories from the test set of the Storium dataset since the complete story was too voluminous to include in our evaluation. We hire three experts to rate the generated paragraphs using 1–5 scores. Our framework incorporates a storyline guidance model that automatically selects the input storyline, while the baseline model relies on human guidance. We generate a storyline for the randomly sampled stories using the experts’ intuition. As shown in Table 4, our model outperforms the baseline in terms of fluency, coherence, relevance, and likability. Our model demonstrates a comparable grammatical generation ability to the baseline in terms of fluency, while the annotators’ interest was significantly enhanced in terms of likability. Moreover, our storyline guidance model significantly improves the coherence and relevance, indicating its high reliability in predicting the storyline. Therefore, the generated paragraphs had natural plots, and the predicted storyline is well-aligned with the human annotators’ expectations.

4.6. Story Visualization Model

Our image generation model relies on a prompt constructed by applying CLIP ImageNet style templates [52] to the predicted place identified by our storyline guidance model. Following the original paper’s recommendation of using 3–5 image inputs, we empirically selected three images as inputs and applied them to various backgrounds in the story, ensuring that the underlying concept was preserved.Our visualization model generates images that contain appropriate background information and maintain a cohesive visual concept. We train the model using three-shot learning, which tunes it to support the readability of the story. Sample outputs from our story visualization model can be found in Figure 6.

5. Conclusions and Future Work

In this paper, we propose a novel multi-modal story generation framework that includes automated storyline decision-making capabilities, which can replace the human role, allowing the system itself to maintain story coherence. The proposed framework consists of three independent models. One is a BERT-based story guidance model, which predicts a storyline using a multiple-choice question-answering problem. For each entity of the next storyline, a model predicts the one with the most relevance to the current paragraph among the five candidates. Another is a GPT2-medium-based story generation model that creates a story that describes the storyline determined by the guidance model. Lastly, in order to support the readers’ readability, we also propose a diffusion-based story visualization model to visualize a representative image from the current scene place predicted by our storyline guidance model. We evaluated the performance of our framework both quantitatively and qualitatively, as well as their corresponding generated stories using a large-scale dataset. We have analyzed the performance of our storyline guidance with Recall@1,2, which helps the storyline planning before generating a paragraph. We evaluate the quality of generated stories with human evaluation, and it suggests that coherence of the generated story is improved. We also provided the meaningful results of the story visualization based on our generated stories.

Our proposed story guidance model is designed to select one of the five candidate entities. This assumption is different from the real-world scenario that the entity of the storyline proceeds in very diverse ways, more than five. Therefore, as our future work, we will explore more improved methodologies that can overcome this problem. Additionally, we plan to investigate new metrics for evaluating the quality of the multi-modal story generation, specifically in measuring the similarity between the generated stories and images.

Author Contributions

Conceptualization, J.K., Y.H., H.Y. and J.N.; methodology, J.K., Y.H. and H.Y.; software, J.K.; validation, J.K.; formal analysis, J.K.; investigation, J.K. and Y.H.; resources, J.K.; data curation, J.K.; writing—original draft preparation, J.K.; writing—review and editing, Y.H.; visualization, J.K.; supervision, Y.H. and J.N.; project administration, Y.H. and J.N.; funding acquisition, J.N. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2022-0-00956, Development of Self-detection Technology for Online Grooming in Social Networks).

Data Availability Statement

Restrictions apply to the availability of these data. Data was obtained from Ref. [15] and are available at https://storium.cs.umass.edu/ (accessed on 31 January 2023) with the permission of the authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics(ACL), Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Amodei, D. Language models are few-shot learners. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Online, 6–12 December 2020; pp. 1877–1901. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 140. [Google Scholar]
Xu, F.; Wang, X.; Ma, Y.; Tresp, V.; Wang, Y.; Zhou, S.; Du, H. Controllable Multi-Character Psychology-Oriented Story Generation. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management(CKIM), Online, 19–23 October 2020; pp. 1675–1684. [Google Scholar]
Zhang, Z.; Wen, J.; Guan, J.; Huang, M. Persona-Guided Planning for Controlling the Protagonist’s Persona in Story Generation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies(NAACL), Seattle, WA, USA, 10–15 July 2022; pp. 3346–3361. [Google Scholar]
Yao, L.; Peng, N.; Weischedel, R.; Knight, K.; Zhao, D.; Yan, R. Plan-and-write: Towards better automatic storytelling. In Proceedings of the AAAI Conference on Artificial Intelligence, Holololu, HI, USA, 27 January–1 February 2019; pp. 7378–7385. [Google Scholar]
Alhussain, A.I.; Azmi, A.M. Automatic story generation: A survey of approaches. ACM Comput. Surv. 2022, 54, 1–38. [Google Scholar] [CrossRef]
Tang, C.; Guerin, F.; Li, Y.; Lin, C. Recent Advances in Neural Text Generation: A Task-Agnostic Survey. arXiv 2022, arXiv:2203.03047. [Google Scholar]
Tang, C.; Zhang, Z.; Loakman, T.; Lin, C.; Guerin, F. NGEP: A graph-based event planning framework for story generation. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Online, 20–23 November 2022; pp. 186–193. [Google Scholar]
Xu, P.; Patwary, M.; Shoeybi, M.; Puri, R.; Fung, P.; Anandkumar, A.; Catanzaro, B. MEGATRON-CNTRL: Controllable story generation with external knowledge using large-scale language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 2831–2845. [Google Scholar]
Ji, H.; Ke, P.; Huang, S.; Wei, F.; Zhu, X.; Huang, M. Language generation with multi-hop reasoning on commonsense knowledge graph. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 725–736. [Google Scholar]
Hua, X.; Wang, L. PAIR: Planning and iterative refinement in pre-trained transformers for long text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 781–793. [Google Scholar]
Tan, B.; Yang, Z.; AI-Shedivat, M.; Xing, E.P.; Hu, Z. Progressive generation of long text with pretrained language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), Online, 6–11 June 2021; pp. 4313–4324. [Google Scholar]
Akoury, N.; Wang, S.; Whiting, J.; Hood, S.; Peng, N.; Iyyer, M. STORIUM: A Dataset and evaluation platform for machine-in-the-loop story generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6470–6484. [Google Scholar]
Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; Metaxas, D.N. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5907–5915. [Google Scholar]
Maharana, A.; Hannan, D.; Bansal, M. Improving generation and evaluation of visual stories via semantic consistency. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies(NAACL), Online, 6–11 June 2021; pp. 2427–2442. [Google Scholar]
Li, Y.; Gan, Z.; Shen, Y.; Liu, J.; Cheng, Y.; Wu, Y.; Carin, L.; Carlson, D.; Gao, J. Storygan: A sequential conditional gan for story visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 6329–6338. [Google Scholar]
Maharana, A.; Hannan, D.; Bansal, M. Storydall-e: Adapting pretrained text-to-image transformers for story continuation. In Proceedings of the European Conference on Computer Vision(ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 70–87. [Google Scholar]
Zeng, G.; Li, Z.; Zhang, Y. PororoGAN: An improved story visualization model on Pororo-SV dataset. In Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence, Normal, IL, USA, 6–8 December 2019; pp. 155–159. [Google Scholar]
Wu, Y.; Ma, Y.; Wan, S. Multi-scale relation reasoning for multi-modal Visual Question Answering. Signal Process. Image Commun. 2021, 96, 116319. [Google Scholar] [CrossRef]
Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y.N. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning(ICML), Sydney, Australia, 6–11 August 2017; pp. 1243–1252. [Google Scholar]
Rashkin, H.; Celikyilmaz, A.; Choi, Y.; Gao, J. Plotmachines: Outline-conditioned generation with dynamic plot state tracking. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP), Online, 16–20 November 2020; pp. 4274–4295. [Google Scholar]
Guan, J.; Huang, F.; Zhao, Z.; Zhu, X.; Huang, M. A knowledge-enhanced pretraining model for commonsense story generation. Trans. Assoc. Comput. Linguist. 2020, 8, 93–108. [Google Scholar] [CrossRef]
Goldfarb-Tarrant, S.; Chakrabarty, T.; Weischedel, R.; Peng, N. Content planning for neural story generation with aristotelian rescoring. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 4319–4338. [Google Scholar]
Clark, E.; Smith, N.A. Choose your own adventure: Paired suggestions in collaborative writing for evaluating story generation models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies(NAACL), Online, 6–11 June 2021; pp. 3566–3575. [Google Scholar]
Ji, Y.; Tan, C.; Martschat, S.; Choi, Y.; Smith, N.A. Dynamic entity representations in neural language models. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, 7–11 September 2017; pp. 1830–1839. [Google Scholar]
Clark, E.; Ji, Y.; Smith, N.A. Neural text generation in stories using entity representations as context. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), New Orleans, LA, USA, 1–6 June 2018; pp. 2250–2260. [Google Scholar]
Fan, A.; Lewis, M.; Dauphin, Y. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; pp. 889–898. [Google Scholar]
Guan, J.; Mao, X.; Fan, C.; Liu, Z.; Ding, W.; Huang, M. Long text generation by modeling sentence-level and discourse-level coherence. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP), online, 2–5 August 2021; pp. 6379–6393. [Google Scholar]
Martin, L.; Ammanabrolu, P.; Wang, X.; Hancock, W.; Singh, S.; Harrison, B.; Riedl, M. Event representations for automated story generation with deep neural nets. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [CrossRef]. [Google Scholar]
Fan, A.; Lewis, M.; Dauphin, Y. Strategies for structuring story generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics(ACL), Florence, Italy, 28 July–2 August 2019; pp. 2650–2660. [Google Scholar]
Wu, Y.; Cao, H.; Yang, G.; Lu, T.; Wan, S. Digital twin of intelligent small surface defect detection with cyber-manufacturing systems. ACM Trans. Internet Technol. 2022; accepted. [Google Scholar]
Pascual, D.; Egressy, B.; Meister, C.; Cotterell, R.; Wattenhofer, R. A plug-and-play method for controlled text generation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 3973–3997. [Google Scholar]
Kybartas, B.; Bidarra, R. A survey on story generation techniques for authoring computational narratives. IEEE Trans. Comput. Intell. AI Games 2016, 9, 239–253. [Google Scholar] [CrossRef] [Green Version]
Zhang, S.; Dinan, E.; Urbanek, J.; Szlam, A.; Kiela, D.; Weston, J. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), Melbourne, Australia, 15–20 July 2018; pp. 2204–2213. [Google Scholar]
Brahman, F.; Petrusca, A.; Chaturvedi, S. Cue me in: Content-inducing approaches to interactive story generation. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing (AACL), Suzhou, China, 4–7 December 2020; pp. 588–597. [Google Scholar]
Wu, Y.; Zhang, L.; Berretti, S.; Wan, S. Medical image encryption by content-aware dna computing for secure healthcare. IEEE Trans. Ind. Inform. 2022, 19, 2089–2098. [Google Scholar] [CrossRef]
Ullah, U.; Lee, J.S.; An, C.H.; Lee, H.; Park, S.Y.; Baek, R.H.; Choi, H.C. A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint. Sensors 2022, 22, 6816. [Google Scholar] [CrossRef] [PubMed]
Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; He, X. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1316–1324. [Google Scholar]
Qiao, T.; Zhang, J.; Xu, D.; Tao, D. Mirrorgan: Learning text-to-image generation by redescription. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 1505–1514. [Google Scholar]
Maharana, A.; Bansal, M. Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online and Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6772–6786. [Google Scholar]
Song, Y.Z.; Rui Tam, Z.; Chen, H.J.; Lu, H.H.; Shuai, H.H. Character-preserving coherent story visualization. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 18–33. [Google Scholar]
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; CHen, M.; Sutskever, I. Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning (ICML), online, 18–24 July 2021; pp. 8821–8831. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Wolf, T.; Sanh, V.; Chaumond, J.; Delangue, C. Transfertransfo: A transfer learning approach for neural network based conversational agents. arXiv 2019, arXiv:1901.08149. [Google Scholar]
Badros, G.J.; Borning, A.; Stuckey, P.J. The Cassowary linear arithmetic constraint solving algorithm. ACM Trans.-Comput.-Hum. Interact. (TOCHI) 2001, 8, 267–306. [Google Scholar] [CrossRef] [Green Version]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–23 June 2022; pp. 10684–10695. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A.H.; Chechik, G.; Cohen-Or, D. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv 2022, arXiv:2208.01618. [Google Scholar]
Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv 2022, arXiv:2208.12242. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML), online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Chen, J.; Chen, J.; Yu, Z. Incorporating structured commonsense knowledge in story completion. In Proceedings of the AAAI Conference on Artificial Intelligence, Honololu, HI, USA, 27 January–1 February 2019; pp. 6244–6251. [Google Scholar]
Schuhmann, C.; Vencu, R.; Beaumont, R.; Kaczmarczyk, R.; Mullis, C.; Katta, A.; Coombes, T.; Jitsev, J.; Komatsuzaki, A. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv 2021, arXiv:2111.02114. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics(ACL), Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
Shao, Z.; Huang, M.; Wen, J.; Xu, W.; Zhu, X. Long and diverse text generation with planning-based hierarchical variational model. arXiv 2019, arXiv:1908.06605. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. In Proceedings of the International Conference on Learning Representations (ICLR) Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]

Figure 1. An overview of our framework.

Figure 2. The example of predicted storyline entities by Storyline guidance model.

Figure 3. Our proposed multi-modal story generation framework. The framework consists of three parts: (a) Our GPT-2-based story generation model creates the current story paragraph (par). A series of these paragraphs becomes an entire scene. (b) Our BERT-based storyline guidance model generates the storyline, which is composed by character, event, and place. (c) Our diffusion-based story visualization model generates a representative image. A single paragraph is fed on the diffusion model, and the final output is each following image.

Figure 4. Our proposed storyline guidance model with an example of entity selection and ordering.

Figure 5. A dataset example to train storyline guidance model with MCQA problem.

Figure 6. Example of the story visualization model.

Table 1. Automatic evaluation results of storyline guidance model.

Field	Recall@1	Recall@2
Character	0.32	0.54
Event	0.67	0.88
Place	0.54	0.82

Table 2. Automatic evaluation results of story generation models.

Model	PPL↓	B-2↑	B-3↑	B-4↑	LR-5↓
Storium-GPT2 [15]	0.224	0.176	0.106	0.041	0.549
Ours	0.198	0.154	0.111	0.064	0.407

Table 3. BERTScore comparison.

Model	BS-r↑	BS-c↑
Storium-GPT2 [15]	15.7	15.2
Ours	15.1	22.2

Table 4. Human evaluation results for the generated stories, with ratings scored between 1 and 5. Fleiss’ Kappa coefficient (

κ

) is used to measure annotators’ agreement, with all results indicating moderate agreement among the annotators.

Table 4. Human evaluation results for the generated stories, with ratings scored between 1 and 5. Fleiss’ Kappa coefficient (

κ

) is used to measure annotators’ agreement, with all results indicating moderate agreement among the annotators.

Model	Fluency		Coherence		Relevance		Likability
	Rating	$κ$	Rating	$κ$	Rating	$κ$	Rating	$κ$
Storium-GPT2 [15]	3.21	0.34	2.57	0.53	3.11	0.51	2.94	0.42
Ours	3.32	0.31	3.48	0.62	3.55	0.54	3.06	0.46

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, J.; Heo, Y.; Yu, H.; Nang, J. A Multi-Modal Story Generation Framework with AI-Driven Storyline Guidance. Electronics 2023, 12, 1289. https://doi.org/10.3390/electronics12061289

AMA Style

Kim J, Heo Y, Yu H, Nang J. A Multi-Modal Story Generation Framework with AI-Driven Storyline Guidance. Electronics. 2023; 12(6):1289. https://doi.org/10.3390/electronics12061289

Chicago/Turabian Style

Kim, Juntae, Yoonseok Heo, Hogeon Yu, and Jongho Nang. 2023. "A Multi-Modal Story Generation Framework with AI-Driven Storyline Guidance" Electronics 12, no. 6: 1289. https://doi.org/10.3390/electronics12061289

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Modal Story Generation Framework with AI-Driven Storyline Guidance

Abstract

1. Introduction

2. Related Work

2.1. Neural Story Generation

2.2. Controllable Story Generation

2.3. Story Visualization

3. Methods

3.1. Task Definition and Our Framework Overview

3.2. Storyline Guidance Model

3.3. Story Generation Model

3.4. Story Visualization Model

4. Experiments

4.1. Dataset

4.2. Experimental Settings

4.3. Metrics

4.3.1. Automatic Evaluation

4.3.2. Human Evaluation

4.4. Storyline Guidance Model

4.5. Story Generation Model

4.6. Story Visualization Model

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI