A Doodle-Based Control for Characters in Story Visualization

Yang, Hyemin; Yang, Heekyung; Min, Kyungha

doi:10.3390/electronics13234628

Open AccessArticle

A Doodle-Based Control for Characters in Story Visualization

by

Hyemin Yang

¹,

Heekyung Yang

^2,*,†

and

Kyungha Min

^1,*,†

¹

Department of Computer Science, Sangmyung University, Seoul 03016, Republic of Korea

²

Department of Software, Sangmyung University, Chonan 31066, Republic of Korea

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2024, 13(23), 4628; https://doi.org/10.3390/electronics13234628

Submission received: 8 October 2024 / Revised: 15 November 2024 / Accepted: 21 November 2024 / Published: 23 November 2024

(This article belongs to the Special Issue Feature Papers in "Computer Science & Engineering", 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

We propose a story visualization technique that allows users to control the arrangement, poses, and styles of characters in a scene based on user-input doodle sketches. Our method utilizes a text encoder to process scene prompts and an image encoder to handle doodle sketches, generating inputs for a predefined scene generation model. Furthermore, we achieve efficient model training by fine-tuning the backbone network by applying a small dataset and employing a LoRA-based fine-tuning technique. We demonstrate that our method can generate characters with various poses and styles from doodle sketches, and it can validate the advantages of our approach by comparing it with the results from other story visualization studies.

Keywords:

story visualization; character; diffusion model; doodle sketch; clip encoder

1. Introduction

Artists in various fields including comics, cartoons, web-toons and animations require a technique that visualizes a story in a series of scenes that correspond to the artists’ intention. However, most of story visualization requires significant time and effort for a visually convincing results. Recently, story visualization models based on rapidly advancing generative models have proven to be highly effective in resolving their requirements. These story visualization models generate a series of scene images based on the story and characters provided by the user. While early models were developed primarily based on generative adversarial networks (GANs) [1,2,3,4,5], recent trends in generative models have led to the emergence of diffusion- and transformer-based models [6,7,8,9,10]. Despite these advancements, most models are limited by a fixed number of scenes they can generate. Furthermore, they rely solely on text input, which constrains their ability to fully capture the user’s intent regarding scene composition, character actions, and other details.

We propose a story visualization model that overcomes existing limitations by enabling the control on the pose of characters over scene composition. The goal of our model is to generate scenes that follow a doodle sketch while preserving the identity of the characters. A doodle sketch, which represents the pose of a character using a set of few strokes, is provided by a user. Unlike other models, our approach accepts doodle-level sketches as an input, allowing users to control the pose of characters and the overall composition of the scene (see Figure 1). Simultaneously, the model incorporates the character’s identity through the generated scenes.

In order to generate high-resolution images while preserving character identity, we utilize Stable Diffusion XL (SDXL) [11] as our backbone network. This model is fine-tuned using LoRA [12] to accurately reflect the desired character identity. Additionally, in order to align the generated images with the user’s input in terms of sketch dynamics and composition, we employ a separate adapter [13] that controls the model using the provided doodle sketch.

We demonstrate the excellence of our model by quantitatively evaluating the quality and aesthetics of the generated scenes. First, we assess the quality of the generated scenes using LIQE [14] and evaluate the aesthetic aspects of the images with NIME [15]. Finally, Q-align [16] is employed for a comprehensive evaluation that incorporates both criteria. Additionally, we present the results from a user study to further validate the superiority of our model’s outputs.

The benefit of our framework is that we present an intuitive scheme for controlling the characters generated in the story visualization model, which is a key challenge. By presenting a doodle that specifies the pose of a character with a simple set of lines and curves, users can control the pose of characters for scenes encompassing visualized stories.

This paper is organized as follows. We suggest related studies in Section 2. In Section 3, our framework is explained. We present our results and analyze them through various tests in Section 4 and Section 5, respectively. Finally, we draw our conclusion and suggest future work in Section 6.

2. Related Work

2.1. Controlled Image Generation

The progress of generative models such as GAN [17] accelerates the research on image generation in a great scale. Various models employed GAN as a backbone to generate images by appying user-specified conditions [18,19,20,21,22]. Conditional GAN [18] generates images from predefined conditions, but the condition is limited only to numerics. Pix2pix [19] and ACGAN [20] allows more extended conditions such as images; However, Pix2pix requires paired dataset for image generation and ACGAN requires segmenting the dataset according to the classes of the target images.

StyleGAN [21] and StyleGAN2 [22] allow for the generation of high-resolution images with more diverse conditions. A lot of models employed StyleGANs as a backbone to generate high-resolution images. Furthermore, they tried to embed conditions in order to control their generated images. pSp [23], which is one of these models, attempts to achieve fine-grained control by dividing condition details into three levels: coarse, medium, and fine. However, these models still contain limitations, particularly in maintaining the consistency between characters in generated images and in effectively controlling character-specific attributes.

Recently, diffusion model-based schemes have emerged in generating high-quality images [11,24,25,26,27]. A distinguished point of these models is that they generate images from textual inputs. However, due to their reliance on text-based control, reflecting full user’s intent is still challenging. Furthermore, maintaining identity consistency across multiple images remains problematic. To address these limitations, Zhang et al. [28] modified existing diffusion-based models, and some other researchers [12,13,29] added adapters to obtain better control of conditions.

2.2. Diffusion-Based Image Generation

To generate user-desired images, several models [30,31,32,33] have been developed for editing existing images. Models such as Prompt-to-Prompt [30], Custom-Diffusion [31], InstructPix-to-Pix [34], and Mokady et al.’s model [32] allow users to input prompts specifying the desired modification and to generate editing output images. However, as these models rely solely on textual inputs to control modifications, applying full user’s intent in generating images is still challenging. Paint-by-Example [33] allows users to directly mask the areas they wish to modify and then perform inpainting to apply the changes. However, this approach has limitations in generating convincing results for characters or illustrations, as opposed to real photographs.

2.3. Story Visualization

Story visualization generates a series of images that reflect the input story. Story visualization is required to preserve the identities of characters while maintaining consistency across the generated images. The research in this field began with StoryGAN [1], which pioneered the study of story visualization. Subsequent works have developed various approaches using GANs as their backbone [2,3,4,5]. Although GAN-based models maintain scene consistency, they often fail to preserve character identity and suffer from low resolution. Moreover, these models require large amounts of data and extensive training time for each story.

3. Our Method

3.1. The Overview of Our Method

Our model generates scene images from both scene prompts and user-provided doodle sketches. The scene prompt is embedded using the CLIP text encoder

E_{t}

and doodle sketch using CLIP image encoder

E_{I}

, respectively. Instead of training the entire model, we utilize LoRA [12] to train only a small number of parameters, allowing the model to better reflect the identity of the characters. Additionally, we employ a separate adapter to guide the image generation process in accordance with the input doodle sketches. The overview of our method is illustrated in Figure 2.

3.2. Fine-Tuning

To generate high-resolution scene images, we utilize a diffusion-based pretrained model

D M

as the backbone network. Each network in the model follows an attention structure, with each attention layer F defined as Equation (1).

F = A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d}}) \cdot V .

(1)

In Equation (1), Q, K and V, which mean query, key, and value, respectively, are defined in Equations (2)–(4). To incorporate the identity of a character into the model, we fine-tune the model using the character image I. Instead of fine-tuning the entire model, we employ LoRA [12] to facilitate efficient learning. The process of fine-tuning is illustrated in Figure 3. Specifically, to train the character with LoRA, we learn two low-rank matrices

A \in R^{d \times r}

and

B \in R^{r \times k}

in place of the original weight matrix W.

To apply the character image I to each layer, we first embed it using the CLIP image encoder

E_{I}

[35]. The formula for incorporating I into the model using LoRA is as follows:

Q = W_{Q}^{i} \cdot φ_{i} (z_{t}) .

(2)

K = W_{K}^{i} \cdot E_{I} (I) .

(3)

V = W_{V}^{0} \cdot E_{I} (I) + B A \cdot E_{I} (I) .

(4)

Equations (2)–(4) define the query, key, and value of our framework. The purpose of these equations is to train parameters

W_{Q}^{i}

,

W_{K}^{i}

, and

W_{V}^{0}

, who play the role of self-attention. The term

φ_{i} (z_{t})

represents the flattened version of

z_{t}

. Our model applies the LoRA structure exclusively to V in order to learn the character image I. Since we use a pretrained model, all weight matrices W are kept fixed, and only A and B are trained.

3.3. Doodle Sketch to Scene Image

Our model requires a user-specified doodle sketch

I_{d s}

in order to control the overall layout of a scene according to users’ intent. To incorporate this doodle sketch into the model, we treat it as a conditioning input and apply it using an adapter

A

. We utilize the T2I-adapter proposed by Mou et al. [13] as our adapter

A

. The adapter

A

is composed of a scale block, which consists of one convolutional network and two residual blocks, followed by four iterations of a downsampling pair. As a result, the doodle sketch

I_{d s}

processed through

A

is embedded into four sketch vectors

F_{d s} = {F_{d s}^{1}, F_{d s}^{2}, F_{d s}^{3}, F_{d s}^{4}}

. These sketch vectors

F^{d s}

are then applied correspondingly to each layer of the fine-tuned encoder

D M

, represented as

F_{e n c} = {F_{e n c}^{1}, F_{e n c}^{2}, F_{e n c}^{3}, F_{e n c}^{4}}

. The above process can be expressed mathematically as follows:

F_{d s} = A (E_{I} (I_{d s})) {\hat{F}}_{e n c} = F_{e n c}^{i} + F_{d s}^{i}, i \in {1, 2, 3, 4} .

(5)

We also reflect the input scene text y as a prompt for the generated scene image. The input prompt is embedded using the CLIP text encoder

E_{T}

[35]. The loss function of our model, incorporating all the aforementioned conditions, is defined as follows:

L = E_{x, z \sim N (0, 1), t, E_{T}, y, F_{d s}} [{∥\begin{matrix} ϵ - ϵ_{θ} (z_{t}, t, E_{T} (y), F_{d s}) \end{matrix}∥}_{2}^{2}] .

(6)

In Equation (6),

z_{t}

denotes noise in time step t,

E_{I}

and

E_{T}

are the CLIP image encoder and CLIP text encoder, respectively. y is a prompt, and

F_{d s}

is the sketch vector.

4. Implementation and Results

4.1. Implementation

We implemented our model in a cloud platform with Intel Xeon Pentium 8480 and nVidia H100 80 GB. We also employed SDXL [11] as a backbone network and CLIP ViT-B/32 [35] as an embedding encoder.

4.2. Results

We fine-tune existing pretrained story visualization models to incorporate our method, which takes doodle sketches as inputs. Our model visualizes three stories: Flintstone, Sailormoon, and Suzume. The pretrained models for these stories are collected as follows: we select the Flintstone model [36] from Hugging Face and the Suzume [37] and Sailormoon models [38] from CIVITAI as the pretrained story visualization models. These models are capable of generating scenes by utilizing the characters that appear within the selected story contents. However, they do not provide control over the poses of the characters as intended by the user. We demonstrate that by allowing users to input specific poses using doodles for each story prompt, we can generate scene images that reflect the poses specified by the doodles and stories suggested in the prompts. Figure 1 presents the results from the Suzume story, Figure 4 shows the results from the Flintstone story, and Figure 5 displays the results from the Sailormoon story, respectively. In each Figure, the images are generated from same prompts. However, the results from our models are generated from both doodles and prompts.

5. Evaluation

5.1. Comparison

We compare our results with those from two existing studies in the field of story visualization: ARLDM [39] and StoryDallE [6]. These studies were selected because they provide visualization results for the Flintstone story, making them appropriate for comparison with our approach. We presented prompts for four prompts and generated corresponding scene images. Additionally, in our approach, we provided doodles to specify characters. The comparison between our results, ARLDM, and StoryDallE is shown in Figure 6.

5.2. Quantitative Evaluation

We execute a quantitative evaluation by applying the evaluation metrics used in previous story visualization studies, including LIQE [40], NIMA [15], and Q-align [16]. Q-align can be applied for both quality and aesthetic elements of the generated scene images.

LIQE [40] is one of blind image quality assessment metrics, which evaluates target images without reference images. It estimates the image quality based on the vision–language correspondence using transformer-based models. NIMA [15] estimates the quality of an image using the semantic-level aesthetic using a deep CNN structure. It estimates the quality of an image into a ten-point metric using the likelihood estimated from the training dataset. Q-align [16] estimates the image quality and aesthetics using large multi-model models, which executes vision–language tasks such as image captioning and visual question answering.

The results of these evaluations are presented in Table 1, showing that our model achieves higher scores compared to the results from previous studies.

5.3. Qualitative Evaluation

We prepared three questions regarding the images presented in Figure 6. The first question assesses the correspondence between the input prompt and the generated images. The second question evaluates the coherence of character identity across the generated images for the same content. The third question examines the quality of the generated images, specifically the absence of artifacts and overall visual excellence.

We recruited 20 participants for the qualitative evaluation of our study. Among the participants, 15 were in their twenties, 5 were in their thirties, 11 were male, and 9 were female. They were instructed to rate each image on a five-point scale, and the average score for images generated by each method was calculated. The results of this evaluation are presented in Table 2, where it is confirmed that the images generated by our method received the highest scores.

5.4. Ablation Study

We demonstrate the effectiveness of our model by showing that our model can control generated scene images by providing doodles with subtle variations. Figure 7 presents a pair of scene images generated from the same prompt. The only difference in the doodles used for scene generation is the variation in the characters’ hairstyles. The doodles in the upper row have a single ponytail, while the doodles in the lower row have two ponytails. The characters in the generated scene images accurately reflect this difference, confirming that our model can effectively incorporate doodle inputs to control generated scenes.

We present another ablation for LoRA. With LoRA, which fine-tunes the parameter matrix, our model can preserve the identity of characters. Without LoRA, the generated characters have a higher chance to lose their identities. Figure 8 illustrate the result of this ablation study.

5.5. Limitation

Even though we have accomplished the consistency of characters regarding posing and facial mimicry, we still have many limitations in preserving consistency. While the doodle sketches used in our model can effectively control a character’s position, movement, and hairstyle, we cannot preserve the consistency of the character’s clothing. As illustrated in Figure 9, the clothes of the characters or a ribbon on a piece of clothing do not match for a sequence of generated images.

Furthermore, backgrounds such as furniture, chauffeur, or wallpapers may not match. As illustrated in Figure 9, the chauffeurs in the scenes do not preserve consistency.

6. Conclusions and Future Work

In this study, we proposed a story visualization model that allows users to control the placement, pose, and style of characters using simple doodle sketches. By fine-tuning an existing story visualization model with a small dataset, our approach enables users to effectively control characters in the story visualization process. We demonstrated our results of controlling characters across various contents using doodle sketches. Moreover, we proved the advantages of our method by comparing and evaluating our results against those from other studies.

As a future work, we will focus on developing methods to precisely control elements such as clothing, which are not fully controllable in this study. Additionally, we aim to explore techniques for generating continuous and consistent poses across consecutive scenes to implement animations.

Author Contributions

Conceptualization, H.Y. (Hyemin Yang); Methodology, H.Y. (Hyemin Yang); Validation, H.Y. (Heekyung Yang); Writing—original draft, K.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Sangmyung University at 2022.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, Y.; Gan, Z.; Shen, Y.; Liu, J.; Cheng, Y.; Wu, Y.; Carin, L.; Carlson, D.; Gao, J. Storygan: A sequential conditional gan for story visualization. In Proceedings of the CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 6329–6338. [Google Scholar]
Wang, W.; Zhao, C.; Chen, H.; Chen, Z.; Zheng, K.; Shen, C. AutoStory: Generating diverse storytelling images with minimal human effort. arXiv 2023, arXiv:2311.11243. [Google Scholar]
Song, Y.Z.; Rui Tam, Z.; Chen, H.J.; Lu, H.H.; Shuai, H.H. Character-preserving coherent story visualization. In Proceedings of the ECCV 2020, Virtual, 23–28 August 2020; pp. 18–33. [Google Scholar]
Maharana, A.; Hannan, D.; Bansal, M. Improving generation and evaluation of visual stories via semantic consistency. In Proceedings of the NAACL 2021, Online, 6–11 June 2021; pp. 2427–2442. [Google Scholar]
Qiao, T.; Zhang, J.; Xu, D.; Tao, D. Mirrorgan: Learning text-to-image generation by redescription. In Proceedings of the CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 1505–1514. [Google Scholar]
Maharana, A.; Hannan, D.; Bansal, M. Storydall-e: Adapting pretrained text-to-image transformers for story continuation. In Proceedings of the ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; pp. 70–87. [Google Scholar]
Ravi, H.; Kafle, K.; Cohen, S.; Brandt, J.; Kapadia, M. Aesop: Abstract encoding of stories, objects, and pictures. In Proceedings of the ICCV 2021, Virtual, 11–17 October 2021; pp. 2052–2063. [Google Scholar]
Chen, H.; Han, R.; Wu, T.L.; Nakayama, H.; Peng, N. Character-centric story visualization via visual planning and token alignment. In Proceedings of the EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 8259–8272. [Google Scholar]
Li, B.; Lukasiewicz, T. Learning to model multimodal semantic alignment for story visualization. In Proceedings of the EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 4712–4718. [Google Scholar]
Li, B.; Lukasiewicz, T. Word-level fine-grained story visualization. In Proceedings of the ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; pp. 347–362. [Google Scholar]
Podell, D.; English, Z.; Lacey, K.; Blattmann, A.; Dockhorn, T.; Müller, J.; Penna, J.; Rombach, R. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In Proceedings of the ICLR 2024, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Hu, E.J.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the ICLR 2022, Virtual, 25–29 April 2022. [Google Scholar]
Mou, C.; Wang, X.; Xie, L.; Wu, Y.; Zhang, J.; Qi, Z.; Shan, Y. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI 2024, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 4296–4304. [Google Scholar]
Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the ICCV 2023, Paris, France, 2–6 October 2023; pp. 3836–3847. [Google Scholar]
Talebi, H.; Milanfar, P. NIMA: Neural image assessment. IEEE Trans. Image Process. 2018, 27, 3998–4011. [Google Scholar] [CrossRef] [PubMed]
Wu, H.; Zhang, Z.; Zhang, W.; Chen, C.; Liao, L.; Li, C.; Gao, Y.; Wang, A.; Zhang, E.; Sun, W.; et al. Q-Align: Teaching lmms for visual scoring via discrete text-defined levels. In Proceedings of the ICML 2024, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 6190–6200. [Google Scholar]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Odena, A.; Olah, C.; Shlens, J. Conditional Image Synthesis with Auxiliary Classifier GANs. In Proceedings of the ICML 2017, Sydney, Australia, 6–11 August 2017; pp. 2642–2651. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 4401–4410. [Google Scholar]
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the CVPR 2020, Virtual, 14–19 June 2020; pp. 8110–8119. [Google Scholar]
Richardson, E.; Alaluf, Y.; Patashnik, O.; Nitzan, Y.; Azar, Y.; Shapiro, S.; Cohen-Or, D. Encoding in style: A stylegan encoder for image-to-image translation. In Proceedings of the CVPR 2021, Virtual, 19–25 June 2021; pp. 2287–2296. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the CVPR 2023, Vancouver, BC, Canada, 18–22 June 2023; pp. 10684–10695. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
Esser, P.; Kulal, S.; Blattmann, A.; Entezari, R.; Müller, J.; Saini, H.; Levi, Y.; Lorez, D.; Sauer, A.; Boesel, F.; et al. Scaling rectified flow transformers for high-resolution image synthesis. In Proceedings of the ICML 2024, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the CVPR 2023, Vancouver, BC, Canada, 18–22 June 2023; pp. 22500–22510. [Google Scholar]
Zhang, Y.; Dong, W.; Tang, F.; Huang, N.; Huang, H.; Ma, C.; Lee, T.; Deussen, O.; Xu, C. Prospect: Prompt spectrum for attribute-aware personalization of diffusion models. ACM ToG 2023 2023, 42, 1–14. [Google Scholar] [CrossRef]
Ye, H.; Zhang, J.; Liu, S.; Han, X.; Yang, W. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv 2023, arXiv:2308.06721. [Google Scholar]
Hertz, A.; Mokady, R.; Tenenbaum, J.; Aberman, K.; Pritch, Y. Prompt-to-prompt image editing with cross attention control. In Proceedings of the ICLR 2023, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Kumari, N.; Zhang, B.; Zhang, R.; Shechtman, E.; Zhu, J.Y. Custom diffusion: Multi-concept customization of text-to-image diffusion. In Proceedings of the CVPR 2023, Vancouver, BC, Canada, 18–22 June 2023; pp. 1931–1941. [Google Scholar]
Mokady, R.; Hertz, A.; Aberman, K.; Pritch, Y.; Cohen-Or, D. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the CVPR 2023, Vancouver, BC, Canada, 18–22 June 2023; pp. 6038–6047. [Google Scholar]
Yang, B.; Gu, S.; Zhang, B.; Zhang, T.; Chen, X.; Sun, X.; Chen, D.; Wen, F. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the CVPR 2023, Vancouver, BC, Canada, 18–22 June 2023; pp. 18381–18391. [Google Scholar]
Brooks, T.; Holynski, A.; Efros, A.A. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the CVPR 2023, Vancouver, BC, Canada, 18–22 June 2023; pp. 18392–18402. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the ICLR 2021, Virtual, 3–7 May 2021; pp. 8748–8764. [Google Scholar]
Available online: https://huggingface.co/juliajoanna/sd-flintstones-model-lora-sdxl (accessed on 20 August 2024).
Available online: https://civitai.com/models/191195/lora-sdxl-10-sd-15-suzume-iwato-makoto-shinkai (accessed on 15 August 2024).
Available online: https://civitai.com/models/330590/lorasdxlponysailormoonmizunoami (accessed on 10 August 2024).
Pan, X.; Qin, P.; Li, Y.; Xue, H.; Chen, W. Synthesizing coherent story with auto-regressive latent diffusion models. In Proceedings of the WACV 2024, Waikoloa, HI, USA, 4–8 January 2024; pp. 2924–2930. [Google Scholar]
Zhang, W.; Zhai, G.; Wei, Y.; Yang, X.; Ma, K. Blind image quality assessment via vision-language correspondence: A multitask learning perspective. In Proceedings of the CVPR 2023, Vancouver, BC, Canada, 18–22 June 2023; pp. 14071–14081. [Google Scholar]

Figure 1. Teaser image. We generate scenes from user-specified prompts with doodle sketches. For each prompt, our model allows users to input doodle sketches that guide the pose of the characters in the scene. The characters in the generated scenes reflect the doodle sketches for their poses.

Figure 2. The overview of our method: User-created prompt is processed through a CLIP text encoder

E_{t}

, and doodle sketch is processed through a CLIP image encoder

E_{I}

. The output from

E_{I}

is processed through an adapter

A

. The outputs from

E_{t}

and

A

are processed through our fine-tuned pretrained model

D M

to generate a scene image that visualizes the input prompt and the doodle sketch.

Figure 2. The overview of our method: User-created prompt is processed through a CLIP text encoder

E_{t}

, and doodle sketch is processed through a CLIP image encoder

E_{I}

. The output from

E_{I}

is processed through an adapter

A

. The outputs from

E_{t}

and

A

are processed through our fine-tuned pretrained model

D M

to generate a scene image that visualizes the input prompt and the doodle sketch.

Figure 3. The process of fine-tuning in our model. The dataset for fine-tuning is the Flintstone dataset: (A) the overall structure of our pipeline for fine-tuning, (B) the fine-tuning process, (C) the result from fine-tuning.

Figure 4. Our results with the results for Flintstone contents from the existing model [36]. Our model and the existing model use the same prompts suggested at the top of the images. The upper row is from the existing model and the lower row is from our one. The doodle used in the scene generation is embedded in the result images.

Figure 5. Our results with the results for Sailormoon contents from the existing model [38]. Our model and the existing model uses the same prompts suggested at the top of the images. The upper row is from the existing model and the lower row is from our one. The doodle used in the scene generation is embedded in the result images.

Figure 6. A comparison with existing story visualization models for Flintstone story.

Figure 7. A comparison of scene images generated from two doodles with subtle difference. The characters in the upper row have a single ponytail, while the characters in the lower row have two ponytails.

Figure 8. The result of ablation study. The images in the upper row are generated with LoRA, while the images in the lower row are generated without LoRA. The images generated without LoRA lose the identities of characters.

Figure 9. Limitation of our study: (a) The clothing and chauffeurs do not match. (b) The ribbons and clothing do not match.

Table 1. The comparison of our results with the existing models.

	StoryDallE	ARLDM	Ours
LIQE	1.94897	1.04155	2.32057
NIMA	3.33332	3.44587	4.58561
Q-align (quality)	1.881325	2.03805	3.60447
Q-align (aesthetic)	1.3723	1.6733	3.60447

Table 2. The results of user study.

	StoryDallE	ARLDM	Ours
Correspondence	3.67	4.06	4.25
Coherence	3.81	4.27	4.39
Quality	2.65	3.94	4.84

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, H.; Yang, H.; Min, K. A Doodle-Based Control for Characters in Story Visualization. Electronics 2024, 13, 4628. https://doi.org/10.3390/electronics13234628

AMA Style

Yang H, Yang H, Min K. A Doodle-Based Control for Characters in Story Visualization. Electronics. 2024; 13(23):4628. https://doi.org/10.3390/electronics13234628

Chicago/Turabian Style

Yang, Hyemin, Heekyung Yang, and Kyungha Min. 2024. "A Doodle-Based Control for Characters in Story Visualization" Electronics 13, no. 23: 4628. https://doi.org/10.3390/electronics13234628

APA Style

Yang, H., Yang, H., & Min, K. (2024). A Doodle-Based Control for Characters in Story Visualization. Electronics, 13(23), 4628. https://doi.org/10.3390/electronics13234628

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Doodle-Based Control for Characters in Story Visualization

Abstract

1. Introduction

2. Related Work

2.1. Controlled Image Generation

2.2. Diffusion-Based Image Generation

2.3. Story Visualization

3. Our Method

3.1. The Overview of Our Method

3.2. Fine-Tuning

3.3. Doodle Sketch to Scene Image

4. Implementation and Results

4.1. Implementation

4.2. Results

5. Evaluation

5.1. Comparison

5.2. Quantitative Evaluation

5.3. Qualitative Evaluation

5.4. Ablation Study

5.5. Limitation

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI