Artfusion: A Diffusion Model-Based Style Synthesis Framework for Portraits

Yang, Hyemin; Yang, Heekyung; Min, Kyungha

doi:10.3390/electronics13030509

Open AccessArticle

Artfusion: A Diffusion Model-Based Style Synthesis Framework for Portraits

by

Hyemin Yang

¹,

Heekyung Yang

^2,*,†

and

Kyungha Min

^1,*,†

¹

Department of Computer Science, Sangmyung University, Seoul 03016, Republic of Korea

²

Department of Software, Sangmyung University, Cheonan 31066, Republic of Korea

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2024, 13(3), 509; https://doi.org/10.3390/electronics13030509

Submission received: 24 December 2023 / Revised: 21 January 2024 / Accepted: 23 January 2024 / Published: 25 January 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

We present a diffusion model-based approach that applies the artistic style of an artist or an art movement to a portrait photograph. Learning the style from the artworks of an artist or an art movement requires a training dataset composed of a lot of samples. We resolve this limitation by combining Contrastive Language Image Pretraining (CLIP) encoder and diffusion model, since the CLIP encoder extracts the features from an input portrait in a very effective way. Our framework includes three independent CLIP encoders that extract the text features, color features and Canny edge features from an input portrait, respectively. These features are incorporated to the style information extracted through a diffusion model to complete the stylization on an input portrait. The diffusion model extracts the style information from the sample images in the training dataset using an image encoder. The denoising steps in the diffusion model applies the style information from the training dataset to the CLIP-based features from an input portrait. Finally, our framework produces an artistic portrait that presents both the identity of the input portrait and the artistic style from the training dataset. The most important contribution of our framework is that our framework requires less than a hundred sample images for an artistic style. Therefore, our framework can successfully extract styles from an artist who has drawn less than a hundred artworks. We sample three artists and three art movements and apply these styles to the portraits of various identities and produce visually pleasing results. We evaluate our results using various metrics, including Frechet Inception Distance (FID), ArtFID and Language-Image Quality Evaluator (LIQE) to prove the excellence of our results.

Keywords:

stylization; portrait stylization; diffusion model; CLIP encoder; artist style; art movement style

MSC:

03E72; 68T27; 03T52

1. Introduction

Portraits, which depict the face of a human, have been beloved from the early days of fine arts. We observe diverse styles of portrait artworks. The style can be defined as a distinctive and recognizable manner or characteristic way of expressing artistic elements, which include technique, form, color palette, subject matter, brushwork, composition, etc. After observing diverse artworks from over various artists for every era of fine arts, we categorize the style into three groups: “style of an artwork”, “style of an artist” and “style of an art movement”.

The style of an artwork is defined as the distinctive and characteristic features present in a particular artwork. For example, “Starry Night” by van Gogh or “Scream” by Munch shows very distinctive features, which can be referred to the style of an artwork. The style of an artist refers to the recurring and recognizable characteristics that are consistent across multiple works by the same artist. For example, the artworks of Vincent van Gogh or August Renoir show styles very distinct from the artworks of other artists. Finally, the style of an art movement refers to the collective characteristics shared by a group of artists who are associated with a particular artistic movement during a specific time period. For example, the artworks from “Baroque” are distinguished from the artworks of different art movements. Figure 1 illustrates our results.

The recent progress of deep learning accelerated the stylization technique that applies the style sampled from artworks to an input photograph. We categorized the deep learning-based stylization schemes according to our style categories. The stylization schemes based on Neural Style Transfer [1,2,3,4,5,6,7,8,9,10,11,12] applies styles sampled from a target artwork image to a content image. Therefore, these schemes apply the style of an artwork. These schemes successfully apply the style of a single artwork image to content images of arbitrary domain, while they preserve the identity of the content image. However, these schemes have a limitation in learning either style of an artist or style of an art movement. They also suffer from many unwanted artifacts and low-resolutional results.

Recently, generative models are proposed to synthesize images from a pool of training images [13,14,15]. Stylization schemes based on the generative models [16,17,18,19,20,21,22] can produce images of various styles with less artifacts and higher resolution from artwork datasets of a similar style. Therefore, we can refer these stylization schemes can learn the style of an artist or the style of an art movement. However, these schemes require a huge size of training images to learn the style. Therefore, these schemes have limitations in learning the styles of an artist who did not produce sufficient artworks.

Very recently, a diffusion model [23] successfully synthesize high-resolutional images from a latent vector or a prompt. Many researchers presented pre-trained diffusion models for stylization [24,25,26]. The benefit of these schemes is that they can learn a style from a dataset composed of dozens of artworks. Therefore, these schemes can learn the style of an artist. However, these pre-trained diffusion models have a limitation that they can affect the content of the results. Therefore, the identity of a stylized portrait produced by these schemes may not be preserved.

In this paper, we propose a novel artistic portrait generative model that produces high-resolution portrait images with both style of artist and style of art movement while maintaining portrait identity. Stable diffusion models (SDM) presented by Rombach et al. [25] is employed as a backbone network, since it generates high-resolution stylized images with a restricted number of training images. However, SDM is recognized to have a limitation in preserving portrait identity. Since portrait identity is represented through the shape information of the portrait, we devise a scheme that separates color information and shape information from input portrait photo. A vectorized method using CLIP encoder [27] is applied to control the color and the shape properly so that our scheme successfully stylizes the input portrait without identity loss. We apply various quantitative evaluations including FID and ArtFID [28] with the results from existing works and demonstrate the excellence of our scheme.

The contributions of this study can be categorized as follows:

We present a artistic portrait generation framework that can be trained using a small-sized dataset. Compared to the existing generative models, a specific style of an artist or an art movement can be trained using tens of sample artwork images.
We present a diffusion model-based framework that applies pervasive styles including the style of an artist and the style of an art movement to a portrait photograph while preserving the identity of the portrait.

2. Related Work

2.1. Early Style Transfer Methods

Style transfer generates stylized image to apply style of style image to content image. Neural Style Transfer (NST) [1] uses gram matrix for calculating of a content image and a style image as the covariance of each feature map from VGG-19. This model generates images by repeating synthesis style and content features to noise. However, the first method is based on the optimization method, so it has the disadvantage of slow learning and a lot of artifacts in stylized images. The style transfer methods after NST can separate in a few ways to supplement these disadvantages.

2.1.1. Universal Style Transfer

First, Universal style transfer [3,7,8,12] applies style image to content image using encoder–decoder architecture. Ulyanov et al. [7] and Dumoulin et al. [8] propose instance normalization methods to apply style from an image effectively. Li et al. [12] solves by changing style transfer problems to linear transformation. However, these methods cause content leak phenomena in which the shapes of the content image are not applied to the stylized image. An et al. [3] introduce Projection Flow Networks (PEN) in order to prevent content leaks in existing universal style transfer. It processes images in the order of projection–transfer–reversion to generate a stylized image. Universal style transfer grows networks to prevent content leaks, but it has the disadvantage of increasing the amount of computation as the network increases.

2.1.2. Arbitrary Style Transfer

Second, arbitrary style transfer [2,4,9,10,11] makes a stylized image applied to various scale styles. AdaIN [2] loses fine detail owing to transfer from the perspective of a global style pattern. Therefore, to compensate for this, Park and Lee [9] and Huo et al. [10] proposed network for preserving local patterns. MAST [10] decreases the loss of local patterns by learning a projection matrix with values derived by applying a manifold alignment algorithm to content and style feature distribution. AdaAttN [11] combines the global scale transfer from AdaIN and the local scale transfer from SANet. Therefore, AdaAttN does adaptively style transfer with attention mechanism in low-level to high-level viewpoint. Deng et al. [4] proposes a new style transfer method using self-adaptation and co-adaptation for matching content and style distribution. This method shows that the multi-adaptation network produces a better quantitative result image. Arbitrary style transfer has a disadvantage that various scale of style cause increasing amount of the network computation.

2.1.3. Contrastive Learning Style Transfer

Lastly, contrastive learning [5,6] is a method that emerged with the advent of multi-modal approaches. It encodes images by making similar images or styles closer and different ones farther apart. Chen et al. [5] introduced a network that takes into account not only the relationship between content-to-stylization and style-to-stylization, but also the correlation of stylization-to-stylization. This network generates a stylized images by applying contrastive learning to give high correlations to similarly styled images and low correlation images that are not. Zhang et al. [6] suggested a network that applies contrastive learning in arbitrary style transfer using multi-layer style projector (MSP) which extends projector to multi-layer. This concept can allow style representation without second-order statistics unlike many of the previous arbitrary style transfer models. The style transfer using contrastive learning is difficult adding new style image except the first style images which were chosen because it creates an image with a relationship between the styles.

Most style transfer methods based on encoder–decoder structured neural network architecture cannot learn style networks, because it extracts the style right away from a style image. This method has the disadvantage of necessarily requiring a pair of content image and style image. The generated images of this method have a lot of artifacts and low dimensions.

2.2. GAN-Based Style Synthesis

Generative Adversarial Network (GAN) [13] generates images by learning two adversarial networks: a generator and a discriminator. Conditional GAN (CGAN) [29] was proposed for controlling generated images. In this model, predefined conditions are entered additionally as generator and discriminator. Pix2pix [17] is model making by advancing CGAN. This model is an image-to-image translate model, which enters image features in a generator using UNet instead of noise. However, pix2pix requires a paired dataset for its training. To overcome this weakness, Zhu et al. [16] suggest cycleGAN with unnecessary paired data. This model learns converting source domain X to target domain Y using cycle consistency loss without paired data. MUNIT [18] divides latent space into content space and style space. It makes stylized image synthesis to style adaption while two different domains share content space.

GAN-based style generation models were able to create the stylized images without individual a style images by learning the style itself, but it is different to generate high-dimension images because it generates a final image at once. Also, they require many datasets and take a lot of time.

2.3. StyleGAN-Based Style Synthesis

StyleGAN [14] generates high-resolution images to apply AdaIN [2] into GAN. Since StyleGAN’s appearance, controlling style has become a major point when using StyleGAN and StyleGAN2 [15] as backbone networks. Most models after StyleGAN analyze latent space for controlling style [19,20,21]. Abdal et al. [19] analyzed what images are created for each latent space in StyleGAN. By applying this research, Abdal et al. [21] changed target attributes after separating attributes in latent space W to produce stylized images. pSp [20] divides the input style into three categories (course, medium, and fine) to extract the style, then creates images using pretrained StyleGAN generator. However, these models have many artifacts like StyleGAN-generated images because they employed the pretrained StyleGAN generator.

On the other hand, Song et al. [30] presents a model using StyleGAN2 instead. This model connects latent spaces in content image making from hierarchical VAE to style distributions. This method synthesizes images with fewer datasets than previous models, but is still not free from artifacts and requires relatively many datasets. All these models require thousands of images for the generation of high-dimension images. StyleGAN-ADA [22] aims to resolve this limitation, but it still requires a lot of datasets. StyleGAN-based models use hierarchical structures to generate high-resolution images, but they have the disadvantage of taking a long time to learn and requiring large datasets.

2.4. Diffusion Model-Based Style Synthesis

Denoising Diffusion probabilistic Models (DDPM) [23] and improved DDPM [31] are a probability-based generative model that combines the diffusion probabilistic model and denoising score. Diffusion models can create high-dimension images with few image datasets, so many style synthesis models introduce based on these models. Rombach et al. [25] proposes controllable model using text to combine Contrastive Language-Image Pre-Training (CLIP) [27] and Auto-Encoder architecture. Dreambooth [26] and ControlNet [32] are extensions of SDMs. InST [24] introduces the concept of inversion with SDMs for preserving textures. The stylization models based on diffusion models cannot maintain own features from the content image and create incoherent stylized images.

3. Overview

We employ the Stable Diffusion Model (SDM) presented by Rombach et al. [25] as a backbone network to generate stylized portraits efficiently, since SDM requires reasonably small size of a dataset for training. Figure 2 illustrates three components of our framework. The first component, which is explained in Section 4.1, extracts style from the artwork images in our dataset. We apply an image encoder

E

to an artwork image in order to extract latent vector z, which is fed into diffusion process that produces

z_{T}

. The second component, which is explained in Section 4.2, aims to maintain the identity of the input photograph.

z_{T}

is embedded in multi-head cross-attention UNet [33] with

θ_{p}

and

θ_{c}

, which are the results from two CLIP encoders [27]

τ_{p}

and

τ_{c}

. The inputs to

τ_{p}

and

τ_{c}

are the photo and Canny edge, respectively. In order to reduce the loss of content information, we employ ControlNet [32] for the shape information

θ_{c}

.

θ_{p}

and

θ_{c}

are switched to balance the style and content of the results. In the third component, which is explained in Section 4.3, we apply the extracted style to an input portrait. In this component, the latent vector

z_{T - 1}

from the previous step is processed through denoising step to reconstruct the latent vector z, which is processed through an image decoder D. Our framework with three components can successfully generate stylized portrait images efficiently without content loss.

4. Our Framework

4.1. Extraction of Style

Our style extraction method employs a Stable Diffusion Model (SDM) [25], which learns the diffusion process from the latent vector instead of an input image. We encode input portrait image x into a latent vector

z \in R^{h \times w \times c}

through image encoder

E

. Then, we apply diffusion process to z to extract style embedded in x. This process is separated into two steps: forward process and reverse process. In the forward process, we produce

z_{T}

by adding noise to z through time-step

T (z_{t}, t; t = 1 \dots T)

. Our style extraction equation during the diffusion process is defined as follows.

L_{s t y l e} = E_{E (x), ϵ \sim N (0, 1), t} [{‖ ϵ - ϵ_{θ} (z_{t}, t) ‖}_{2}^{2}],

(1)

where

ϵ

is a random value sampled from a normal distribution

N (0, 1)

and

ϵ_{θ}

is a denoising auto-encoder through which noise vector

z_{t}

with t is added during the reverse process for

t = 1 \dots T

. The denoising vectors is decoded using image decoder

D

.

4.2. Maintenance of Identity

4.2.1. Multi-Layered UNet

The content information including prompt (

t x t

), input photo (p) and its canny edge (c) is processed through three CLIP encoders

τ_{t x t}

,

τ_{p}

and

τ_{c}

, respectively. The color information is represented as

θ_{p} = τ_{p} (p)

and the shape information as

θ_{c} = τ_{c} (c)

. We use ControlNet [32] to process content information without loss. We apply zero-convolution

Z

to

θ_{c}

before attaching content shape information

θ_{p}

(see Figure 3). The content information

\hat{v}

, which is the result of the zero-convolution

Z

, is defined as follows:

\hat{θ} = Z {(θ; W, B)}_{p, i} = B_{i} + \sum_{j}^{c} v_{p, i} W_{i, j}

(2)

Z

is

1 \times 1

convolution layer expressed in spatial position p and channel-wise index i. The layer is defined by the weight W and the bias B.

We assign parameters

θ_{p}

and

θ_{c}

to arrange content information of color and shape, respectively.

θ_{p}

is for the color information

{\hat{v}}_{p}

and

θ_{c}

is for the canny edge information

{\hat{v}}_{c}

. The sum of

θ_{c}

and

θ_{p}

is 1. The result of this operation y is defined as follows:

y = F (z_{T} + θ_{p}) + Z (F_{c} (θ_{p} + {\hat{θ}}_{c}))

(3)

We specify a unique prompt token to describe the style embedded in the input. The token should be specified as a word that does not exist in the natural language in order to exclude predefined knowledge in SDMs. We embed the token with CLIP text encoder

τ_{t x t}

.

We use an attention-based network to multi-layer UNet [33] architecture, whose layer possesses cross-attention mechanism [34]. The attention at each layer

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d}}) \cdot V

is specified as follows:

\begin{matrix} Q & = & W_{Q}^{(i)} \cdot φ_{i} (z_{t}), \\ K & = & W_{K}^{(i)} \cdot τ_{θ} (y), \\ V & = & W_{V}^{(i)} \cdot τ_{θ} (y) \end{matrix}

φ_{i} (z_{t}) \in R^{N * d_{ϵ}^{i}}

flattens z to form an input of UNet. Each layer learns trainable projection matrix

W_{Q}^{(i)} \in R^{d \times d_{ϵ}^{i}}, W_{K}^{(i)} \in R^{d \times d_{τ}}

, and

W_{V}^{(i)} \in R^{d \times d_{τ}}

.

4.2.2. Adding Condition into UNet Using ControlNet

The content condition, photo and canny edge are converted to their latent vectors with CLIP text encoder

τ_{t x t}

, CLIP photo encoder

τ_{p}

and CLIP canny edge encoder

τ_{c}

, respectively. Our network considers the color information from

τ_{p} (p)

and the shape information from

τ_{c} (c)

, which are weighted through two parameters

θ_{p}

and

θ_{c}

, respectively. We use ControlNet [32] to apply

θ_{c}

without loss. The ControlNet employs zero-convolution

Z

layer to the UNet architecture.

Z

is

1 \times 1

convolution layer expressed in spatial position p and channel-wise index i. The result of

Z

is defined as follows:

{\hat{θ}}_{c} = Z {(θ_{c}; W, B)}_{p, i} = B_{i} + \sum_{j} {(θ_{c})}_{p, i} W_{i, j},

(4)

where W and B are weights and bias of

Z

.

Our conditions with ControlNet are expressed as follows:

y = F (z_{T} + θ_{p}) + Z (F_{c} (θ_{p} + {\hat{θ}}_{c}))

(5)

4.3. Application of Style

The final loss of our attention mechanism that learns applied style to an input portrait is as follows.

L_{G} = E_{E (x), ϵ \sim N (0, 1), t, v_{p}, t o k e n, {\hat{v}}_{c}} [{‖ ϵ - ϵ_{θ} (z_{t}, t, v_{p}, τ_{t x t} (t o k e n), {\hat{v}}_{c}) ‖}_{2}^{2}]

(6)

This equation is based on latent diffusion model [25]. Our network is trained for all inputs from

z_{T}

step to corresponding loss.

5. Implementation and Results

5.1. Implementation

We trained and executed our model on a personal computer with Intel 16-Core i9-11900K CPU, 128 GB main memory and two NVIDIA GeForce RTX 3090 GPUs. We employed CLIP ViT-B/32 for our framework.

5.2. Dataset

Our dataset consists of 405 head-cut artistic portraits. We collect artistic portrait images and crop them to square size. We define two types of styles for portraits. For “style of an artist”, we define “Renoir”, “Tiziano” and “Mucha”, and for “style of an art movement”, we define “Baroque”, “Rococo” and “Italian renaissance”, which is abbreviated to “Italian”. We categorize the collected images into the six categories according to their styles. The numbers of images for each category are as follows: Renoir style has 58 images, Tiziano has 70, Mucha has 60, Baroque has 87, Rococo has 81, and Italian has 49. We train our model with this portrait dataset. Some images of the dataset are presented in Figure 4.

5.3. Hyperparameters

We set the batch size as 4, epochs as 500, and learning rate as 0.0001. To train level of style, we use Stable Diffusion Model (SDM) v1.5. as the backbone model. In order to demonstrate that our model can be trained properly, a unique token is assigned to all levels of the style. Since our model has SDM as the backbone, we need to set the hyperparameters of SDM. We set a Euler as a sampling method, the classifier-free guidance (CFG) scale as 7, the denoising rate as 0.75, sampling steps as 20 in training, and ControlNet weight depending on the input photo, which fluctuates 0.1 to 1.0 only added in the test. Our generated portraits are

512 \times 512

size, latent size at

64 \times 64

.

5.4. Results

We present our results in Figure 5. We select eight portrait images. Portraits of various genders and ages are included. Some of the portraits have accessories such as earrings and spectacles. For the eight input photos, we apply six styles: three of them are “style of an artist” including Renoir, Tiziano and Mucha, and three of them are “style of an art movement” including Baroque, Rococo and Italian.

6. Evaluation

We evaluate our model using two evaluation approaches: a quantitative approach and a qualitative approach. Among the various existing studies, we sample seven outstanding studies: CAST [6], InST [24], Styleformer [35], Str2 [36], IEST [5], and CycleGAN [16]. In Appendix A, Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, Figure A7 and Figure A8 present the result images produced using the existing schemes and our framework.

6.1. Quantitative Evaluation

We apply three metrics for the quantitative approach: Frechet Inception Distance (FID), ArtFID [28], and Language-Image Quality Evaluator (LIQE) [37]. FID and ArtFID are used to measure the conceptual distances between two images in many studies. Therefore, we apply these metrics to the input photograph and the stylized image and measure the distance. The FID values and ArtFID values estimated from the result images in Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, Figure A7 and Figure A8 are presented in Table 1 and Table 2. The values in the tables are averaged for the eight result images from the identical input photograph. For the FID and ArtFID values, we compute minimum/maximum/average values at the bottom rows of the tables. In the tables, our scheme records four lowest FID and ArtFID values, respectively. Our scheme also shows lowest average values for FID and ArtFID. Therefore, we can conclude that ours record best results among the eight compared schemes.

In addition to FID and ArtFID, which measure the perceptual distance between the result and the input image, we aim to measure the quality of the result image without comparing them to a reference image. Since the quality of the result image is an important target of every image generation technique, we measure the quality of the result image using LIQE, which is a blind image quality assessment metric that predicts the human perception of image quality without any reference images. In Table 3, we suggest the LIQE values estimated from the result images in Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, Figure A7 and Figure A8. According to Table 3, we observe that ours show better LIQE scores for the six styles, except that of Mucha.

6.2. Qualitative Evaluation

For a qualitative evaluation, we execute a user study that asks participants to pick an image whose style is most similar to the target image (see Figure 6). The purpose of this user study is to measure whether our framework successfully applies a style to an input portrait image. We hire 30 participants, 16 of whom are male and 14 are female. 22 of them are in their twenties and eight are in their thirties. The participants are required to have low prior knowledge of fine arts. We prepare 40 questions as follows:

We execute t-tests from the results of user study. For this purpose, we estimate averages and standard deviations from the four questions of our user study (see Table 4). We assume that the scores from first and second columns are similar, and so are the third and fourth columns. To prove this assumption, we execute two t-tests for the columns. The p values of these t-tests are presented in the bottom row of Table 4. Since the p value between the first and second column is 0.1056 and the p between the third and forth column is 0.3639, both pairs of the columns are not significantly different at the confidence level of 95%.

(#01∼#10) (original-artist) questions on style of an artist, and target is original artwork image.
(#11∼#20) (generated-artist) questions on style of an artist, and target is the generated artwork image.
(#21∼#30) (original-art movement) questions on style of an art movement, and target is original artwork image.
(#31∼#40) (generated-art movement) questions on style of an art movement, and target is the generated artwork image.

6.3. Ablation Study

We analyze the effect of Canny edge in the result images by removing Canny edge. As illustrated in Figure 7, the result image without Canny edge looses the identity of the input image. Even though diffusion model-based approaches produce result images of visually convincing styles, the identity of the input photograph is not guaranteed to be preserved. The right column in Figure 7 shows the results without Canny edge. In contrast to these results, our framework preserves the identity of the input photograph by applying the Canny edge extracted from the input. The images in the left column of Figure 7 show our results that preserve both identity and style.

6.4. Limitation

Our framework has a limitation for applying “style of an artwork”, since the “style of an artwork” is a style embedded in a single artwork image. Since our framework is designed to require a set of images for training, training from a single image is not achievable.

Another limitation comes from the definition of an art movement. For example, “impressionism” includes various artists such as Edgar Degas, Claude Monet, Edouard Manet, Vincent van Gogh, Paul Gaugain, Paul Cezanne, and August Renoir, whose artworks show a very wide range of styles. Since the artworks sampled from these artists have such a diverse range of styles, training our model to learn the style of impressionism does not work.

The third limitation is the type of a diffusion model we employed. Since ours apply stable diffusion model, which is a kind of a latent diffusion model, diffusion models that do not belong to the latent diffusion model are not applicable to our framework.

Finally, our framework encounters a “style of an art movement” that raises an identity-preservation problem. For example, an important characteristic of the Baroque style includes tenebrism, which express very strong contrast in the objects. Therefore, some input portrait photographs fail to preserve the identity in their results (see Figure 8).

7. Conclusions and Future Work

We present a diffusion model-based approach that applies styles extracted from a series of artworks from an artist or an art movement to a portrait. We extract the identity of an input portrait using three independent CLIP encoders including text encoder, image encoder and Canny edge encoder. These features are incorporated with the style information extracted using a diffusion model to produce a stylized portrait. Our results successfully produce an artistic portrait that presents both the identity of the portrait and the style of artworks.

We present a very effective framework that requires less than a hundred sample images for a style training. This contribution allows our framework to apply the style from an artist who left less than a hundred artworks. Another important contribution is that three CLIP encoders in our framework are very effective in preserving the identity of an input portrait. The results of our framework show very convincing results in identity preservation. Styles from three artists and three art moves are successfully applied to the portraits of diverse identities. The results of our framework show most impressive results in compared to the results of the existing studies in three metrics including FID, ArtFID and LIQE.

For the future work, we aim to extend the styles to commercial contents including webtoon, animation and game, which require more restrictive style definition and artifact reduction. Another future work is to implement our framework on-device environment, which allows various apps from our framework.

Author Contributions

Conceptual: H.Y. (Hyemin Yang), H.Y. (Heekyung Yang) and K.M., methodology: H.Y. (Hyemin Yang), validation: H.Y. (Heekyung Yang) and K.M., writing—original draft: (Heekyung Yang), writing—review and editing: K.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2021R1F1A1060286 and 2021R1G1A1009800).

Data Availability Statement

All face images in the manuscript are from open access sources; All data underlying the results are available as part of the article and no additional source data are required.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

We suggest eight figures (Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, Figure A7 and Figure A8), which are used for our evaluation.

Figure A1. Comparison of our scheme and other existing schemes. The input image is at the top of the leftmost column. The style images in the leftmost column are applied for the schemes based on style transfer.

Figure A2. Comparison of our scheme and other existing schemes. The input image is at the top of the leftmost column. The style images in the leftmost column are applied for the schemes based on style transfer.

Figure A3. Comparison of our scheme and other existing schemes. The input image is at the top of the leftmost column. The style images in the leftmost column are applied for the schemes based on style transfer.

Figure A4. Comparison of our scheme and other existing schemes. The input image is at the top of the leftmost column. The style images in the leftmost column are applied for the schemes based on style transfer.

Figure A5. Comparison of our scheme and other existing schemes. The input image is at the top of the leftmost column. The style images in the leftmost column are applied for the schemes based on style transfer.

Figure A6. Comparison of our scheme and other existing schemes. The input image is at the top of the leftmost column. The style images in the leftmost column are applied for the schemes based on style transfer.

Figure A7. Comparison of our scheme and other existing schemes. The input image is at the top of the leftmost column. The style images in the leftmost column are applied for the schemes based on style transfer.

Figure A8. Comparison of our scheme and other existing schemes. The input image is at the top of the leftmost column. The style images in the leftmost column are applied for the schemes based on style transfer.

References

Gatys, L.A.; Ecker, A.S.; Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2414–2423. [Google Scholar]
Huang, X.; Benlongie, S. Arbitrary style transfer in real-time with adaptive instance Normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1501–1510. [Google Scholar]
An, J.; Huang, S.; Song, Y.; Dou, D.; Liu, W.; Luo, J. ArtFlow: Unbiased image style transfer via reversible neural flows. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 862–871. [Google Scholar]
Deng, Y.; Tang, F.; Dong, W.; Sun, W.; Huang, F.; Luo, J. Arbitrary style transfer via multi-adaptation network. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2719–2727. [Google Scholar]
Chen, H.; Wang, Z.; Zhang, H.; Zuo, Z.; Li, A.; Xing, W.; Lu, D. Artistic style transfer with internal-externel learning and contrastive learning. In Proceedings of the Conference on Neural Information Processing Systems, Online, 6–14 December 2021; Volume 34, pp. 26561–26573. [Google Scholar]
Zhang, Y.; Tang, F.; Dong, W.; Huang, H.; Ma, C.; Lee, T.Y.; Xu, C. Domain enhanced arbitrary style transfer via contrastive learning. In Proceedings of the ACM SIGRAPH, Vancouver, BC, Canada, 7–11 August 2022. [Google Scholar]
Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Improced texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6924–6932. [Google Scholar]
Dumoulin, V.; Shlens, J.; Kudlur, M. A leared representation for artistic style. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Park, D.Y.; Lee, K.H. Arbitrairy style transfer with style-attentional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5880–5888. [Google Scholar]
Huo, J.; Jin, S.; Li, W.; Wu, J.; Lai, Y.K.; Shi, Y.; Gao, Y. Manifold alignment for semanticallty aligned style transfer. In Proceedings of the International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 14861–14869. [Google Scholar]
Liu, S.; Lin, T.; He, D.; Li, F.; Wang, M.; Li, X.; Sun, Z.; Li, Q.; Ding, E. Adaattn: Revisit attention mechanism in arbitrary neural style transfer. In Proceedings of the International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6649–6658. [Google Scholar]
Li, X.; Liu, S.; Kautz, J.; Yang, M.H. Learning liner transformations for fast image and video style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3809–3817. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–11 December 2014; Volume 27. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410. [Google Scholar]
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–9 June 2020; pp. 8110–8119. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A. Unpaired image-to-image translation using cylce-consistent adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2223–2232. [Google Scholar]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image transflation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Huang, X.; Liu, M.Y.; Belongie, S.; Kautz, J. Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 172–189. [Google Scholar]
Abdal, R.; Qin, Y.; Wonka, P. Image2stylegan: How to embed images into the stylegan latent space? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4435–4441. [Google Scholar]
Richardson, E.; Alaluf, Y.; Patashnik, O.; Nitzan, Y.; Azar, Y.; Shapiro, S.; Cohen-Or, D. Encoding in style: A stylegan encoder for image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2287–2296. [Google Scholar]
Abdal, R.; Zhu, P.; Mitra, N.J.; Wonka, P. Styleflow: Attribute-conditioned exploration of stylegan-generated images using conditional contiuous normalizing flows. ACM Trans. Graph. 2021, 40, 21. [Google Scholar] [CrossRef]
Karras, T.; Aittala, M.; Hellsten, J.; Laine, S.; Lehtinen, J.; Aila, T. Training generative adversarial networks with limited data. In Proceedings of the Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Volume 33, pp. 12104–12114. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Volume 33, pp. 6840–6851. [Google Scholar]
Zhang, Y.; Huang, N.; Tang, F.; Huang, H.; Ma, C.; Dong, W.; Xu, C. Inversion-based style transfer with diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 10146–10156. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolusion image synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 10684–10695. [Google Scholar]
Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. Dreambooth: Fine-tuning text-to-image diffusion models for subject-dricen generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 22500–22510. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Askell, A.; Minshkin, P.; Clark, J.; Kruger, G.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Wright, A.; Ommer, B. ArtFID: Quantitative Evaluation of Neural StyleTransfer. In DAGM German Conference on Pattern Recognition; Springer International Publishing: Cham, Switzerland, 2022; pp. 560–576. [Google Scholar]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
Song, G.; Luo, L.; Liu, J.; Ma, W.C.; Lai, C.; Zheng, C.; Cham, T.J. Agilegan: Stylizing portraits by inversion-consistant transfer learing. ACM Trans. Graph. 2021, 20, 117. [Google Scholar] [CrossRef]
Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8162–8171. [Google Scholar]
Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. arXiv 2023, arXiv:2302.05543. [Google Scholar]
Ronneberger, O.; Philipp, F.; Thomas, B. Unet: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Wu, X.; Hu, Z.; Sheng, L.; Xu, D. Styleformer: Real-time arbitrary style transfer via parametric style composition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 10684–19695. [Google Scholar]
Deng, Y.; Tang, F.; Dong, W.; Ma, C.; Pan, X.; Wang, L.; Xu, C. Strtr2: Image style transfer with tranformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11326–11336. [Google Scholar]
Zhang, W.; Zhai, G.; Wei, Y.; Yang, X.; Ma, K. Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 14071–14081. [Google Scholar]

Figure 1. Teaser image: A portrait photograph is rendered in six different styles: Renoir, Tiziano, Mucha, Baroque, Rococo and Italian Renaissance.

Figure 2. Overview.

Figure 3. ControlNet structure in Section 4.2.

Figure 4. Some images in the dataset of six styles.

Figure 5. Result images: six styles are applied to eight portrait photos with different ages and genders.

Figure 6. An example of a question in our user study.

Figure 7. Ablation study.

Figure 8. An example of a limitation: (a) shows Baroque-styled artwork images with strong tenebrism, (b) shows the results with failed identity preservation due to the strong tenebrism.

Table 1. FID values compared to the existing studies. Red figures denote the lowest FID score among the values in the same row.

	Ours	NST [1]	CAST [6]	InST [24]	Styleformer [35]	Str2 [36]	IEST [5]	CycleGAN [16]
Renoir	137.9	141.9	190.7	224.5	255.4	193.7	140.9	235.1
Mucha	259.0	269.8	254.8	294.2	288.5	270.1	213.1	405.3
Tiziano	155.6	164.8	180.7	228.5	222.6	204.8	177.0	361.1
Baroque	175.8	208.2	173.2	264.9	220.3	231.0	184.8	239.9
Rococo	140.4	127.7	161.3	239.7	228.7	196.2	196.8	210.5
Italian	140.5	225.9	180.7	293.0	237.3	219.3	214.1	221.4
min	137.9	127.7	161.3	224.5	220.3	193.7	140.9	210.5
max	259.0	269.8	254.8	294.2	288.5	270.1	214.1	405.3
average	168.2	189.7	190.2	257.5	242.1	223.8	187.8	278.9

Table 2. ArtFID values compared to the existing studies. Red figures denote the lowest ArtFID value among the values in the same row.

	Ours	NST [1]	CAST [6]	InST [24]	Styleformer [35]	Str2 [36]	IEST [5]	CycleGAN [16]
Renoir	0.62	0.74	0.80	0.65	0.63	0.78	0.81	0.71
Mucha	0.53	0.62	0.77	0.51	0.54	0.63	0.68	0.63
Tiziano	0.59	0.68	0.75	0.63	0.65	0.72	0.76	0.72
Baroque	0.62	0.68	0.78	0.67	0.61	0.72	0.75	0.68
Rococo	0.61	0.67	0.82	0.63	0.70	0.74	0.75	0.75
Italian	0.61	0.64	0.79	0.62	0.65	0.73	0.74	0.74
min	0.53	0.62	0.75	0.51	0.54	0.63	0.68	0.63
max	0.62	0.74	0.82	0.67	0.70	0.78	0.81	0.75
average	0.60	0.67	0.78	0.62	0.63	0.72	0.75	0.70

Table 3. LIQE values compared to the existing studies. Red figures denote the highest LIQE score among the values in the same row.

	Ours	NST [1]	CAST [6]	InST [24]	Styleformer [35]	Str2 [36]	IEST [5]	CycleGAN [16]
Renoir	3.78	3.27	3.27	2.89	2.24	2.24	3.42	1.44
Mucha	4.08	2.23	2.23	2.96	2.69	2.82	4.67	2.44
Tiziano	4.23	1.54	3.46	3.40	2.55	2.55	3.94	1.91
Baroque	4.69	3.66	3.59	3.80	2.65	2.47	4.43	2.26
Rococo	4.52	1.62	1.62	3.44	2.33	2.33	3.03	1.89
Italian	4.64	1.95	4.07	4.42	2.69	2.69	4.54	2.00
min	3.78	1.54	1.62	2.89	2.24	2.24	3.03	1.44
max	4.69	3.66	4.07	4.42	2.69	2.82	4.67	2.44
average	4.32	2.38	3.04	3.48	2.53	2.52	4.00	1.99

Table 4. The average, standard deviation and p value of the four categories of user study.

	Original-Artist	Generated-Artist	Original-Art Movement	Generated-Art Movement
average	7.97	7.53	7.10	6.80
standard deviation	0.98	1.02	1.14	1.35
p value	0.1056		0.3639

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, H.; Yang, H.; Min, K. Artfusion: A Diffusion Model-Based Style Synthesis Framework for Portraits. Electronics 2024, 13, 509. https://doi.org/10.3390/electronics13030509

AMA Style

Yang H, Yang H, Min K. Artfusion: A Diffusion Model-Based Style Synthesis Framework for Portraits. Electronics. 2024; 13(3):509. https://doi.org/10.3390/electronics13030509

Chicago/Turabian Style

Yang, Hyemin, Heekyung Yang, and Kyungha Min. 2024. "Artfusion: A Diffusion Model-Based Style Synthesis Framework for Portraits" Electronics 13, no. 3: 509. https://doi.org/10.3390/electronics13030509

APA Style

Yang, H., Yang, H., & Min, K. (2024). Artfusion: A Diffusion Model-Based Style Synthesis Framework for Portraits. Electronics, 13(3), 509. https://doi.org/10.3390/electronics13030509

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Artfusion: A Diffusion Model-Based Style Synthesis Framework for Portraits

Abstract

1. Introduction

2. Related Work

2.1. Early Style Transfer Methods

2.1.1. Universal Style Transfer

2.1.2. Arbitrary Style Transfer

2.1.3. Contrastive Learning Style Transfer

2.2. GAN-Based Style Synthesis

2.3. StyleGAN-Based Style Synthesis

2.4. Diffusion Model-Based Style Synthesis

3. Overview

4. Our Framework

4.1. Extraction of Style

4.2. Maintenance of Identity

4.2.1. Multi-Layered UNet

4.2.2. Adding Condition into UNet Using ControlNet

4.3. Application of Style

5. Implementation and Results

5.1. Implementation

5.2. Dataset

5.3. Hyperparameters

5.4. Results

6. Evaluation

6.1. Quantitative Evaluation

6.2. Qualitative Evaluation

6.3. Ablation Study

6.4. Limitation

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI