Multi-Source Training-Free Controllable Style Transfer via Diffusion Models

Yu, Cuihong; Han, Cheng; Zhang, Chao

doi:10.3390/sym17020290

Open AccessArticle

Multi-Source Training-Free Controllable Style Transfer via Diffusion Models

by

Cuihong Yu

,

Cheng Han

^*

and

Chao Zhang

School of Computer Science and Technology, Changchun University of Science and Technology, Changchun 130022, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(2), 290; https://doi.org/10.3390/sym17020290

Submission received: 3 January 2025 / Revised: 29 January 2025 / Accepted: 11 February 2025 / Published: 13 February 2025

(This article belongs to the Special Issue Symmetry and Asymmetry in Artificial Intelligence and Machine Learning-Based Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Diffusion models, as representative models in the field of artificial intelligence, have made significant progress in text-to-image synthesis. However, studies of style transfer using diffusion models typically require a large amount of text to describe semantic content or specific painting attributes, and the style and layout of semantic content in synthesized images are frequently uncertain. To accomplish high-quality fixed content style transfer, this paper adopts text-free guidance and proposes a multi-source, training-free and controllable style transfer method by using single image or video as content input and single or multiple style images as style guidance. To be specific, the proposed method firstly fuses the inversion noise of a content image with that of a single or multiple style images as the initial noise of stylized image sampling process. Then, the proposed method extracts the self-attention mechanism’s query, key, and value vectors from the DDIM inversion process of content and style images and injects them into the stylized image sampling process to improve the color, texture and semantics of stylized images. By setting the hyperparameters involved in the proposed method, the style transfer effect of symmetric style proportion and asymmetric style distribution can be achieved. By comparing with state-of-the-art baselines, the proposed method demonstrates high fidelity and excellent stylized performance, and can be applied to numerous image or video style transfer tasks.

Keywords:

style transfer; training-free; multi-source; high fidelity; weighted value

1. Introduction

Symmetry and asymmetry are fundamental concepts in computer vision and artificial intelligence (AI), where they play an essential role in the visual effect of a synthesized image or video. Symmetric style proportion and asymmetric style distribution can cause distinct style transfer effects. Style transfer, an essential visual branch task in computer vision, includes a diverse set of application objects and development opportunities. The process of style transfer can be approached from a variety of perspectives, including content, layout, and design. The style transfer task is primarily utilized to add new style elements to the original image or video by introducing style images while maintaining the original image or video’s content features. People’s demand for a shift in visual style has increased in recent years, as short video technology has become more widely used. At the same time, for images whose semantics and layout are not obvious, the semantics and layout can be highlighted by changing the style.

Current mainstream style transfer technology can be divided into two categories: deep learning-based (conventional) style transfer and diffusion model-based style transfer. The conventional image or video style transfer is derived from the CNN-based image transfer method proposed by Gatys [1] et al., which has resulted in significant technological innovation. The method uses the VGG convolutional neural network model to extract style features from image content and adjusts the generated image using gradient descent; however, the computational cost is enormous. To solve this problem, Johnson et al. [2] proposed a feed-forward method to optimize the model and realize real-time operation. On the basis of the continuous development of global style transfer research [3,4], the research on multiple styles of a single content image [5,6,7,8,9,10,11] and the research on different styles of different areas of a single content image [12,13] are growing. However, the above deep learning-based style transfer methods still need to be improved in terms of generation quality and visual effects.

In recent years, with the rapid development of artificial intelligence generation models such as ChatGPT [14] and Stable Diffusion [15], computer vision research and applications have been optimized and improved in unprecedented ways, as has the generation quality of style transfer. Diffusion models, as the representative algorithm models in the field of artificial intelligence, are widely used in a variety of challenging image synthesis tasks and achieve state-of-the-art synthesis results on image data and beyond. In order to synthesize high-quality stylized images, diffusion model-based style transfer has become a research hotspot in the field of computer vision. The quality and innovative style combined with the content of the image or video results constantly refresh our understanding. Any-to-any style transfer [16] realizes the free style transfer from any area of the style image to any area of the content image. However, due to the different styles of each area of the output image, the output image lacks the overall coordination and artistry. InST [17] transformed artistic styles into a learnable description of painting text, and proposed the style transfer method (InST), so as to more appropriately capture and transfer the artistic style of painting. However, the generalization ability of this method is poor. ProSpect [18] disentangles visual attributes such as texture material, layout and artistic style from a single image, realizes the style transfer of visual attributes, and enhances the controllability and flexibility of style transfer effect.

Although diffusion models achieve an ideal effect in the fusion of style and content, they often require a large investment of computational resources and a complex optimization process. For this reason, Style Injection [19] focuses on the self-attention mechanism of the U-Net structure in the latent diffusion model, and realizes style transfer by replacing the K and V vectors in the generation process of stylized images with the vectors of style image, but only single-style transfer tasks can be realized. Different from traditional style transfer, DEADiff [20] proposes a double-decoupled representation extraction mechanism to identify the semantics and styles of style images separately, thus alleviating the semantic conflict between content and style images. However, most diffusion model-based style transfer methods typically require text to guide content generation to achieve amazing visual effects. Text guidance makes the generated content unstable and makes it impossible to achieve directed style transfer.

The employment of a large number of words to describe the semantics, layout, and style of stylized images may cause a lot of uncertainty. The unpredictability of content and layout frequently fails to meet people’s needs for fixed-content style transfer. Furthermore, for the study of multi-style transfer, due to the mutual interference and influence of the artistic styles of multiple style images, it is difficult to realize the effective transfer and expression of multiple styles, which has been a problems in the study of style transfer. At the same time, people are more focused on the efficient style transfer and art transmission provided by the training-free way.

Based on the above problems and challenges, this paper proposes multi-source training-free controllable style transfer via diffusion models, as shown in Figure 1. The proposed method can obtain different style effects by setting different style control weights and does not require text for guidance. In this paper, we choose a latent diffusion model, a representative model of diffusion models, as the main framework of the research process, so as to ensure state-of-the-art generation quality. Meanwhile, the training-free model can reduce the occupation of computing resources, and text-free guidance eliminates the dependence on a large number of text descriptions. This is a novel style transfer method, which can realize personalized style transfer requirements and make style transfer technology more widely used. The proposed style transfer method consists of initial noise fusion (INF) and controllable weighted value, and is applicable to both images and videos, with the style coming from multiple sources, as shown in Figure 2. The main contributions of this paper are as follows:

We synthesize the initial noise of the stylized image diffusion sampling process using inversion noise fusion (INF), which can realize the control of color, brightness and clarity of the synthesized stylized image or video.
To further improve the color, texture and semantics of stylized images, we strengthen the source of content and style features via a simple manipulation of the features in self-attention, which we refer to as the controllable weighted value method of the U-Net self-attention mechanism. The controllable weighted value method can achieve the tendency control of multiple art styles for the purpose of multi-style transfer.
The proposed method can perform single-style or multi-style transfer for image or video content, handle multi-source data, and have a high generalization ability. Extensive experiments on the style transfer dataset validate that the proposed method significantly outperforms previous methods and achieves state-of-the-art performance.

Figure 1. Results of multi-source training-free controllable style transfer. Different style effects can be obtained by setting different style control weights.

Figure 2. The proposed method can effectively realize the style transfer of image or video.

2. Related Work

2.1. U-Net Attention Mechanism in Diffusion Models

The T2I diffusion models, represented by the latent diffusion model [15], typically employ a U-Net [21] neural network structure to provide spatial downsampling and upsampling with skip connections, resulting in feature sampling. U-Net structures are typically made up of stacked two-dimensional convolution residual blocks and transformer blocks. Each transformer block consists of a spatial self-attention layer (SA), a cross-attention layer (CA), and a feed-forward network (FFN). Spatial self-attention mechanisms use pixel positions in feature maps to find correlations, whereas cross-attention mechanisms focus on the correspondences between pixels and conditional inputs like text. Image editing research based on the attention mechanism of diffusion model has achieved multiple breakthroughs.

According to Prompt-to-Prompt [22], the cross-attention layer controls the relationship between the spatial layout and the text. This allows for partial editing of the image by replacing individual words in the original text prompt while retaining the original image cross attention maps. Dynamic Prompt Learning [23] finds that the quality of cross-attention maps plays a crucial role in image editing methods based on text prompts. In order to ensure that the semantic layout of images generated after text modification is consistent with the source text, a method is proposed to keep cross-attention maps without leakage. FoI [24] uses cross-attention to extract semantic association results between input images and input text, and extracts corresponding masks, so as to ensure that text-guided image editing maintains the same spatial layout as input images.

Plug-and-Play [25] shows that the self-attention map can reflect the semantic and spatial layout of the generated image. Style Injection [19] proved that query vector Q in a self-attention mechanism in the U-Net structure of the latent diffusion model plays a significant role in preserving image semantics, and key vector K and value vector V play a significant role in preserving image color, texture and other visual effects. Through the above research, it is found that both cross-attention and self-attention in the U-Net structure of the diffusion model have an important impact on the semantic and spatial layout of the image sampling process, but cross-attention is mostly combined with the guidance text for association relationship matching. The self-attention mechanism is mainly used in the transmission of pixel information. The self-attention mechanism does not need to be combined with guidance text, which is very consistent with the implementation in this paper. Therefore, the self-attention mechanism is selected as the research focus of style transfer to enhance the style effect.

For latent features representation

z_{i}

, the self-attention mechanism of latent diffusion model can achieve

A t t e n t i o n

(Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d}}) \cdot V

, where the matrices

Q, K, V

are represented as follows:

Q = z_{i} \cdot W^{Q}, K = z_{i} \cdot W^{K}, V = z_{i} \cdot W^{V} .

(1)

The query vector Q represents the content to be considered at the current location. The key vector K represents all locations’ content and is compared to the query to calculate the attention weight. The value vector V is used to weighted average the content in different positions according to the attention weight.

W^{Q}

,

W^{K}

and

W^{V}

are learnable matrices that maps input to Q, K and V. d represents the vector dimension [26] of K and Q.

2.2. DDIM Inversion

Inversion of diffusion models is the process of finding noise maps and conditional vectors that match the input image. However, just adding noise to an image and then denoising can result in a generated image that is drastically different from the input image. Inversion of denoising diffusion implicit models (DDIMs) [27], which uses a closed-form sampling process, can be utilized to obtain latent variables capable of generating a given reality image. The sampling process for DDIM inversion is the inverse of the standard DDIM sampling process, which can be expressed as follows:

x_{t} = \sqrt{\frac{α_{t}}{α_{t - 1}}} x_{t - 1} + \sqrt{α_{t}} (\sqrt{\frac{1}{α_{t}} - 1} - \sqrt{\frac{1}{α_{t - 1}} - 1}) ε_{θ} (x_{t}, t),

(2)

where

t = 0

to T, and

x_{0}

represents the original latent variable of the image. By gradually adding noise to

x_{0}

, the noise encoding

x_{T}

is obtained, with

x_{T}

as the initial latent representation. The original image encoding can be approximately reconstructed after subsequent sampling, so DDIM inversion is often used for original image-related editing.

2.3. AdaIN

AdaIN [28] is a forward neural network-based style transfer method that was proposed in 2017. It performs well in terms of improving the color and brightness of style transfer images. The AdaIN method primarily facilitates the transfer of style feature y to a content image by standardizing the calculation of the content image feature x and style image feature y. The formula is as follows:

A d a I N (x, y) = σ (y) (\frac{x - μ (x)}{σ (x)}) + μ (y),

(3)

where

μ (x)

and

σ (x)

represent the mean value and variance of content image features, respectively.

μ (y)

and

σ (y)

represent the mean value and variance of style image features, respectively. AdaIN’s idea of obtaining stylized image features by using the mean and variance of content image features and style image features is also used in subsequent studies on style transfer [19,29]. However, in the specific application process, it can be different from the traditional formula of AdaIN, as we can reasonably analyze which elements of style transfer are affected by the mean value and variance and assign corresponding weights.

2.4. Latent Diffusion Model

Latent diffusion model [2] performs diffusion processes in latent space, which both reduces the computational complexity and achieves high-quality diffusion model generation. Latent diffusion model mainly consists of a self-coding model (including an encoder and a decoder) and diffusion operations of latent space. Specifically, given an image

x \in R^{H \times W \times 3}

, first use an encoder

ϵ

to encode the image into the latent space

z = ϵ (x)

, where

z \in R^{h \times w \times c}

. After diffusion and denoising, use a decoder

D

to reconstruct the image

\tilde{x} = D (z) = D (ϵ (x))

from latent space, with downsampling factor being

f = H / h = W / w

.

Generally, the denoising process of diffusion models is achieved by training the denoising autoencoder

ε_{θ} (x_{t}, t)

,

t = 1

to T, which allows us to predict the denoising variable of

x_{t}

, which is the noise variable of the input x. The corresponding objective function can be expressed as follows:

L_{D M} = E_{x, ε \sim N (0, 1), t} [| | ε - ε_{θ} (x_{t}, t) {| |}_{2}^{2}] .

(4)

In the latent diffusion model, a self-coding model is introduced, so the

z_{t}

obtained by the encoder can be used to make the model learn in latent space during training. Therefore, the objective function can be expressed as follows:

L_{L D M} = E_{ϵ (x), ε \sim N (0, 1), t} [| | ε - ε_{θ} (z_{t}, t) {| |}_{2}^{2}] .

(5)

3. Method

Given a content image or video and single or multiple style images, our goal is to generate style controlled stylized images or videos. The stylized results conform to a single or multiple guide styles, and preserve the structure and semantic layout of the content image or video. The latent diffusion model serves as the primary framework for the research process. Specifically, the proposed method consists of two primary parts: One is inversion noise fusion, which means that the initial noise of the sampling process is the fusion result of the inversion noise of the content image and single or multiple style images. The second is the controllable weighted value of the U-Net self-attention mechanism, which includes the weighted value operation of Q, K, and V during diffusion sampling, followed by the generation of a stylized image or video. Figure 3 shows the overall network structure of the multi-source training-free controllable style transfer method proposed in this paper. The video style transfer in this paper is consistent with the network structure in Figure 3, and each frame needs to be operated by the proposed method.

3.1. Inversion Noise Fusion (INF)

This paper is inspired by the work of DDIM [27] inversion, which realizes reconstruction by obtaining the latent noise of real image. So, in the study of style transfer, fusion latent noise with semantic features of content image and style features of style image can also be obtained, so as to achieve stylized image reconstruction. As shown in Figure 3, the inversion noise fusion method (INF) is proposed in this paper, which can fuse the inversion noise of one content image and multiple style images to realize the construction of initial latent noise. The specific fusion formula is as follows:

I N F (x, y) = α (x - μ (x)) + β \times μ (x) + \sum_{i = 1}^{N} [γ_{i} \times μ (y_{i})],

(6)

where x represents inversion noise of the content image,

y_{i}

represents inversion noise of the i-th style image,

i = 1, 2, \dots, N

represents the index of style images, N represents the total number of style images,

μ

means take the mean,

α \in [0, 1]

represents the weight of the difference between the inversion noise of the content image and the noise mean,

β

represents the weight of the noise mean of x, and

γ_{i}

represents the weight of noise mean of the i-th style image. The two weights have a constraint relationship:

β + \sum_{i = 1}^{N} γ_{i} = 1

.

In the fusion formula,

α (x - μ (x))

mainly describes the semantics and layout of the content image,

β \times μ (x)

controls the brightness and color of the content image, and

\sum_{i = 1}^{N} [γ_{i} \times μ (y_{i})]

is used to incorporate the style of each style image into the fusion noise based on weight, resulting in the effect of content and style fusion. By adjusting the values of three weights

α

,

β

and

γ_{i}

, the proportion of the semantics of the content image and the style of the style image can be adjusted in the generated stylized image, resulting in an effective balance of content and style. During the multi-style transfer experiment, the weight

γ_{i}

of each style image can be balanced to form the effect of a uniform presentation of various styles. According to the experimental goal, the weight

γ_{i}

of a certain style can be properly highlighted, so as to achieve personalized style integration.

3.2. Controllable Weighted Value Method of U-Net Self-Attention Mechanism

After INF inversion noise fusion, a stylized image with a certain style effect can be generated; however, during the noise fusion process, the style of the content image is also fused into the stylized image, resulting in a weakened style of style image. To further distinguish the source of features and finally generate a better stylized effect, this paper proposes a controllable weighted value method, which is based on observations and inspiration from previous research [19,22,25,30] on the attention mechanism of diffusion models. The style transfer process of this study is guided by both content and style images, rather than text guidance. Therefore, based on the performance of the self-attention mechanism in pixel information transmission, this paper chooses the self-attention mechanism as the research focus of image texture, color and other style elements’ extraction and transfer process in the controllable weighted value method, and finally proposes a controllable weighted value method of U-Net self-attention mechanism.

The fusion noise

z_{t}^{M}

of content image

I^{c}

and single or multiple style images

I^{s i}

,

i = 1, 2, \dots, N

is obtained via the INF method, which is employed as the initial latent noise for stylized image sampling. In order to realize the process of distinguishing and strengthening content and style features, in the denoising sampling process of

z_{t}^{M}

, the query vector

Q_{t}^{M}

, the key vector

K_{t}^{M}

and value vector

V_{t}^{M}

in U-Net self-attention mechanism is revalued, respectively. To ensure that the content and layout of the generated stylized image are consistent with that of content image

I^{c}

, the weighted value of query vector

Q^{c}

corresponding to

I^{c}

is injected into

Q_{t}^{M}

. To ensure the transfer of semantic textures in style images, the query vector

Q^{s i}

corresponding to style image

I^{s i}

,

i = 1, 2, \dots, N

is also weighted and injected into

Q_{t}^{M}

. To align the style of stylized image with the style image

I^{s i}

, weigh the key vector

K^{s i}

and value vector

V^{s i}

of the style image and inject them into

K_{t}^{M}

and

V_{t}^{M}

, respectively. To maintain the fidelity of other texture features of the content image except semantics and layout, the key vector

K^{c}

and value vector

V^{c}

corresponding to the content image

I^{c}

are also weighted and injected into

K_{t}^{M}

and

V_{t}^{M}

, respectively. The query, key and value vectors of content images and style images are obtained from the DDIM inversion process. After obtaining the weighted value, the query vector, key vector and value vector of the U-Net self-attention mechanism in the stylized image sampling process are expressed as follows:

\begin{matrix} {I n j e c t i o n}_{Q} = \tilde{Q_{t}^{M}} = λ_{Q} \times [Q^{c} - μ (Q^{c})] + μ (Q^{c}) + \sum_{i = 1}^{N} q^{s i} [Q^{s i} - μ (Q^{s i})], \\ {I n j e c t i o n}_{K} = \tilde{K_{t}^{M}} = λ_{K} \times [K^{c} - μ (K^{c})] + \sum_{i = 1}^{N} \{k^{s i} [K^{s i} - μ (K^{s i})] + μ (K^{s i})\}, \\ {I n j e c t i o n}_{V} = \tilde{V_{t}^{M}} = λ_{V} \times [V^{c} - μ (V^{c})] + \sum_{i = 1}^{N} \{v^{s i} [V^{s i} - μ (V^{s i})] + μ (V^{s i})\}, \end{matrix}

(7)

where

μ

means taking the mean,

i = 1, 2, \dots, N

, i represents the index of style images, and N represents the total number of style images. and

λ_{Q}

,

λ_{K}

,

λ_{V}

, respectively, represent the value weights of the query vector, key vector and value vector of the content image.

q^{s i}

,

k^{s i}

, and

v^{s i}

, respectively, represent the value weights of the query vector, key vector and value vector of the i-th style image. The weight constraint relationship is shown as follows:

\begin{matrix} λ_{Q} + \sum_{i = 1}^{N} q^{s i} = 1, \\ λ_{K} + \sum_{i = 1}^{N} k^{s i} = 1, \\ λ_{V} + \sum_{i = 1}^{N} v^{s i} = 1 . \end{matrix}

(8)

λ_{Q}

and

q^{s i}

are used to jointly control the semantics and layout of the stylized image. The two groups of weights

λ_{K}

,

k^{s i}

and

λ_{V}

,

v^{s i}

are used to control the color, brightness, texture, clarity of style effect, and play the role of jointly strengthening the style effect. For multi-style transfer tasks, we can control the weight values of

q^{s i}

,

k^{s i}

and

v^{s i}

to control the degree of style inclination. When the experimental target is evenly presented for each style,

q^{s i}

,

k^{s i}

and

v^{s i}

can be set to the same value, respectively. By constructing a weighted value method for the self-attention mechanism of stylized images, the style transfer process of single content features and single or multiple style features is realized as follows:

A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{\tilde{Q_{t}^{M}} {\tilde{K_{t}^{M}}}^{T}}{\sqrt{d}}) \cdot \tilde{V_{t}^{M}} .

(9)

4. Experiments

This section will conduct an in-depth analysis of the proposed method (ours) from the aspects of image single-style transfer and multi-style transfer, as well as video single-style transfer and multi-style transfer. We compare our method with 14 state-of-the-art style transfer methods. Through qualitative and quantitative comparisons, the influence of the proposed method on the color, texture and fidelity of the stylized results is further verified.

Implementation details. The proposed method does not require any training, and all of the content image and style image data used in the experiments are derived from baselines’ experimental data. We conduct all experiments in Stable Diffusion 1.4, and adopt DDIM sampling [27] with a total 50 timesteps. For the default settings for hyperparameters, we describe them in the specific comparison experiments. All experiments are conducted on an NVIDIA 4090 GPU.

Evaluation metrics. According to previous work [19,31], this paper selects ArtFID [32] to evaluate the stylized effect of image style transfer, and LDC [33] + SSIM [34] to detect the similarity between stylized images and content images. The ArtFID indicator, which evaluates the overall style transfer performance in the context of a comprehensive analysis of content and style fidelity, is considered to be very consistent with human judgment. The calculation formula of ArtFID is

A r t F I D = (1 + L P I P S) \times (1 + F I D)

, where LPIPS [35] is used to measure the content fidelity between the stylized image and the corresponding content image, and FID [29] is used to evaluate the style fidelity between the stylized image and the style image. In this paper, LDC [33] outputs fine image edges that are used for SSIM [34] content similarity detection, which helps to avoiding interference from the style effect and accomplish fine detection. For video style transfer, according to previous work [19,36,37], optical flow error [38] and ArtFID [32] are selected in this paper as evaluation indicators of video temporal consistency and stylization effect, and the mean value of each frame indicator is used as the indicator result of the video.

4.1. Qualitative Analysis

4.1.1. Image Single-Style Transfer

Style effect contrast. Comparison baselines for image single-style transfer experiment selection include conventional methods, such as AdaConv [39], EFDM [40], Deep Preset [41], CAST [42] and StyT

r^{2}

[43], as well as diffusion models-based methods, including InST [17] and Style Injection [19]. The weight coefficient of the proposed method is set as follows: (1) Inverse noise fusion process, weight coefficient

α = 1, β = 0, γ = 1

. (2) Controllable weighted value process,

Q : λ_{Q} = 1, q^{s 1} = 0

,

K : λ_{K} = - 0.3, k^{s 1} = 1.3

,

V : λ_{V} = 0, v^{s 1} = 1

.

The comparison of single-style transfer results is shown in Figure 4, where EFDM and CAST have increased their models’ understanding of artistic style. However, excessive artistry leads to structural deformations and content ambiguity in the generated stylized images, as shown in yellow boxes. The AdaConv stylized effect differs from and does not align with the style image’s style (Column 4). Deep Preset and InST do not achieve significant stylized effects (see Columns 6 and 9). Ours, StyT

r^{2}

and Style Injection, have a relatively good effect on style transfer; however, there are still obvious differences in color, brightness, clarity, semantic alignment and other aspects of the stylized images, which will be further compared in the subsequent analysis.

Color effect contrast. Color effect comparison is mainly to compare whether the colors and tones of each stylized image are aligned with the style image. AdaIN [28], CAST [42], Deep Preset [41], StyT

r^{2}

[43], InST [17] and Style Injection [19] are selected as the comparison baseline for the color style transfer experiment. During the experiment, to highlight the effect of color transfer, the weight coefficient of the proposed method is set as follows: (1) Inverse noise fusion process, weight coefficient

α = 1, β = - 0.2

,

γ = 1.2

. (2) Controllable weighted value process,

Q : λ_{Q} = 1, q^{s 1} = 0

,

K : λ_{K} = - 0.2, k^{s 1} = 1.2

,

V : λ_{V} = - 0.2, v^{s 1} = 1.2

.

Figure 5 shows that Deep Preset and InST do not display evident color effects of the style image. The proposed method (ours), AdaIN, CAST, StyT

r^{2}

and Style Injection are all aligned with the color of the style image. To further verify the fidelity of the stylized image, the leaves in the lower right corner of the content image are enlarged to show the Style4 transfer effect for each method. The results show that the stylized images synthesized by AdaIN, CAST, StyT

r^{2}

, and Style Injection methods are fuzzy or cartoonish, and semantic content fidelity is not enough. In contrast, the proposed method achieves color style transfer, while the synthesized stylized image is aligned with the content image with high fidelity.

Photorealistic effect comparison. Making the resulting stylized image have high-definition picture quality is another crucial aspect to expand the application range of image style transfer. In this part, conventional methods, such as Deep Preset [41], EFDM [40] and AdaConv [39], as well as diffusion model-based methods, including ArtFusion [44] and CAP-VSTNet [45], all of which may achieve a photorealistic style transfer effect, are chosen for comparison to evaluate the degree of photorealistic effect achieved by this method. The weight setting for the proposed method is the same to the style transfer experiment in Figure 4.

As shown in Figure 6, the proposed method aligns the color styles of the front and background of the stylized image with the style image, and aligns the semantics and layout with the content image, resulting in a clearer visual effect. Although the experimental results of ArtFusion are consistent with the color and texture of the style image, the semantic and layout of the generated results deviate significantly from the content image, and the structure of the butterfly is largely blurred. CAP-VSTNet and Deep Preset have no obvious effect of style transfer. The style transfer effect of EFDM is good; however, the butterfly antenna is blended in with the background information, making it difficult to identify. AdaConv’s experimental results match the content image semantics, but the style effect is excessively grainy. Figure 7 shows additional HD photorealistic style transfer results of the proposed method.

4.1.2. Image Multi-Style Transfer

Figure 8 shows the performance of the proposed method in terms of image multi-style transfer. The experimental results achieve a good transfer of color and texture features, allowing for a more targeted choice of style preferences. The multi-style transfer process, depending on the weight of various styles, generates a rich and diverse style superposition effect.

The result of the multi-style transfer between two different styles is shown in Figure 9. Two styles: one focused on color rendering and the other on texture and brushwork expressiveness. The multi-style transfer results shown in this study not only effectively integrate two color styles but also highlight the style texture features and align with the content image semantics. The depiction of details, such as the first line of seawater ripples and the texture of the sailboat, as well as the third line of light, etc., can present a high-fidelity transfer effect.

4.1.3. Video Single-Style Transfer

Video single-style transfer is primarily used for achieving single-style transfer across all video frame sequences. In the transfer process, in addition to ensuring that the style of each video frame is aligned with style image, the semantic content of each frame must also be aligned with the content video and have a certain inter-frame consistency so as to generate smooth video results. This study chooses state-of-the-art methods, such as AdaAttN [46], Linear [47], TBOS [36], and CCPL [37], as comparative baselines for video single-style transfer. The weight coefficient of the proposed method is set as follows: (1) Inverse noise fusion process, weight coefficient

α = 1, β = 0, γ = 1

. (2) Controllable weighted value process,

Q : λ_{Q} = 1, q^{s 1} = 0

,

K : λ_{K} = - 0.3, k^{s 1} = 1.3

,

V : λ_{V} = 0, v^{s 1} = 1

.

Figure 10 shows the video single-style transfer results. Although Linear and CCPL achieve style transfer, there is noticeable flicker in each frame, which degrades the visual effect. Moreover, in Linear, the color of the leaves in Frame 235 is inconsistent. AdaAttN further cartoonized the result of style transfer, reducing the authenticity of the original content video. TBOS generates insufficient color alignment with the style image, which results in unsatisfactory stylistic effect. In addition, as shown by the red box in Row 5, the fidelity of the generated leaves is insufficient, resulting in blurred images. In comparison, the proposed method achieves a high degree of style alignment on the basis of semantic alignment with each frame of content video, with obvious inter-frame consistency and high fidelity.

4.1.4. Video Multi-Style Transfer

Video multi-style transfer is based on the implementation of video single-style transfer, which combines multiple style effects. Figure 11 shows the results of video multi-style transfer achieved by the method proposed in this paper. Each frame of the video can be aligned with the semantic meaning of the content video, presenting single-style or multi-style color and texture styles based on weights.

4.2. Quantitative Analysis

4.2.1. Quantitative Analysis of Image Style Transfer

Based on qualitative analysis, the baselines Deep Preset [41], InST [17], CAP-VSTNet [45], and AdaIN [28] are excluded since their style transfer generalization capacity is inadequate or they focus on a certain style field. The final quantitative comparison baseline is determined to include conventional methods, such as AdaConv [39], EFDM [40], CAST [42], and StyT

r^{2}

[43], as well as diffusion model-based methods, including Style Injection [19] and ArtFusion [44]. Figure 12 displays predicted edges of LDC for content similarity calculation. Based on the LDC prediction results, SSIM similarity between the content and stylized images can be obtained, as shown in Table 1.

The quantitative comparison results of image style transfer effects are shown in Table 1, where baseline index data are the average value of 20 groups of style transfer experimental index data. As demonstrated in Table 1, the proposed method (our) has the highest SSIM index value while also having the best stylized effect index value. It can be seen that the proposed method can effectively maintain semantic alignment while generating an outstanding stylized effect.

4.2.2. Quantitative Analysis of Video Style Transfer

AdaAttN [46], Linear [47], TBOS [36], and CCPL [37] serve as baselines for video style transfer quantitative comparisons. The purpose of optical flow error is to compute the motion vector of each pixel between two frames using image information from succeeding frames in order to determine a video’s smoothness. ArtFID is used for evaluating video styles. In the quantitative comparison experiment of video style transfer, we calculate each frame’s evaluation metric to obtain the average value, which is then used as the evaluation metric for this group of experiments.

Table 2 shows the optical flow error and ArtFID evaluation results of the three styles of the five compared methods, and the proposed method has the smallest optical flow error, indicating the highest smoothness, followed by Linear and TBOS. Among the ArtFID evaluation results, the proposed method (ours) has the best stylized effect, while the TBOS and Linear methods display poor effects. The above comparison demonstrates the potential capability and excellent performance of the proposed method in the task of video style transfer.

4.3. Ablation Study

To validate the effectiveness of the proposed components, we conduct ablation studies in qualitative way. As shown in Figure 13, after inversion noise fusion (INF), a stylized image with a certain color style effect can be generated; however, the texture style effect is poor. After using the controllable weighted value method, the color and texture style effect of the stylized image are improved.

4.3.1. Analyzing the Hyperparameters of Inversion Noise Fusion

Figure 14 illustrates how each component of the inversion noise fusion method affects the style transfer effect. As indicated in the first line, controlling the change of weight

α

alone affects the sharpness of the generated stylized image. As shown in the second line, under the constraint relationship:

β + \sum_{i = 1}^{N} γ_{i} = 1

, the weight

β

adjusts the color tendency of the stylized image. A bigger

β

indicates a more inclined color to the content image, whereas a lower

β

indicates a more inclined color to the style image. By increasing the weight

γ

, the color of the stylized image is closer to that of the style image. It can be seen that the inversion noise fusion method has a certain effect on the color, brightness, and sharpness of the stylized image.

4.3.2. Analyzing the Hyperparameters of Controllable Weighted Value

This paper proposes a controllable weighted value method for the U-Net self-attention mechanism. It involves injecting self-attention vector weighted values of content and style image features into the

Q, K, V

vectors of the U-Net self-attention mechanism in the stylized image denoising sampling process. This enhances the influence of content image semantics and style image styles on sampling results. Figure 15 shows an analysis of the influence of weights in the self-attention query vector Q value formula under the assumption of normal values of K and V. Under the weight relation

λ_{Q} + q^{s 1} = 1

constraint, the larger the

λ_{Q}

, the more significant the semantic description of the content images. On the contrary, when

λ_{Q}

is smaller, i.e., when

q^{s 1}

is larger, the semantics of style images become more visible. As a result, by balancing the two weight values, the semantics of the stylized image can be further controlled.

As shown in Figure 16, in the process of controlling the value of the self-attention key vector of the stylized images, under the premise of the normal value of Q, when the weight

k^{s 1}

of the style image key vector is larger, the style effect is more visible and the image is clearer. The value vector follows the same change trend as the key vector, but its color effect is not so sharp; however, leaving the value vector out will inevitably weaken or blur the style effect. It can be seen that it is beneficial to enhance the style effect by setting the values of the above hyperparameters reasonably.

5. Limitations

Our method captures complex realistic textures and enables stylized images or videos with high-fidelity effects. However, effectively transferring some abstract artistic styles while maintaining realism is very challenging. In some cases, realism and abstract art cannot coexist, which weakens the effective learning and transfer of abstract art styles in this paper. We will try to resolve these limitations in the future.

6. Conclusions

This paper proposes multi-source training-free controllable style transfer via diffusion models, which can effectively integrate multiple styles. The proposed method achieves style alignment with the style image and semantic alignment with the content image. Moreover, the generated stylized video has a temporal consistency. The style transfer results generated by this method have high fidelity. For the multi-style transfer of image or video, according to the different style weights, this paper realizes the superposition effect of multiple styles, showing its applicability. Through a comparison and analysis of qualitative and quantitative experiments, the performance of this method in multi-source style transfer of images and videos is further verified. Our method is based on the diffusion models and adopts text-free guidance, which not only addresses the challenge of high-quality style transfer but also does not rely on the extensive textual description often required in current methods, making style transfer techniques accessible to a wider range of applications. In addition, the multi-source, controllable, high-fidelity and flexible performance of this method can provide more accurate and friendlier service to users, and can be widely used in image or video editing, art, film and television production and other fields.

Author Contributions

C.Y.: conceptualization of this study, methodology, validation, and visualization; C.H.: data curation and writing—original draft preparation; C.Z.: data curation and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Natural Science Foundation of Jilin Province (grant no. 20230101179JC), the National Natural Science Foundation of China (grant no. 61702051), and the Science and Technology Development Plan Project of Jilin Province (grant no. 20200403188SF).

Data Availability Statement

The original contributions presented in this study are included in this article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gatys, L.A.; Ecker, A.S.; Bethge, M. Image Style Transfer Using Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2414–2423. [Google Scholar]
Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the Computer Vision—ECCV, Amsterdam, The Netherlands, 11–14 October 2016; pp. 694–711. [Google Scholar]
Li, C.; Wand, M. Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks. In Proceedings of the Computer Vision—ECCV, Amsterdam, The Netherlands, 11–14 October 2016; pp. 702–716. [Google Scholar]
Ulyanov, D.; Lebedev, V.; Andrea; Lempitsky, V. Texture Networks: Feed-forward Synthesis of Textures and Stylized Images. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1349–1357. [Google Scholar]
Chen, D.; Yuan, L.; Liao, J.; Yu, N.; Hua, G. StyleBank: An Explicit Representation for Neural Image Style Transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2770–2779. [Google Scholar]
Dumoulin, V.; Shlens, J.; Kudlur, M. A learned representation for artistic style. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Lin, M.; Tang, F.; Dong, W.; Li, X.; Xu, C.; Ma, C. Distribution Aligned Multimodal and Multi-domain Image Stylization. ACM Trans. Multimed. Comput. Commun. Appl. 2021, 17, 96:1–96:17. [Google Scholar] [CrossRef]
Li, Y.; Fang, C.; Yang, J.; Wang, Z.; Lu, X.; Yang, M. Diversified Texture Synthesis with Feed-forward Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Zhang, H.; Dana, K. Multi-style Generative Network for Real-Time Transfer. In Proceedings of the Computer Vision—ECCV, Munich, Germany, 8–14 September 2018; pp. 349–365. [Google Scholar]
Chen, H.; Zhao, L.; Wang, Z.; Zhang, H.; Zuo, Z.; Li, A.; Xing, W.; Lu, D. DualAST: Dual Style-Learning Networks for Artistic Style Transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 19–25 June 2021; pp. 872–881. [Google Scholar]
Shen, F.; Yan, S.; Zeng, G. Neural Style Transfer via Meta Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Kurzman, L.; Vazquez, D.; Laradji, I. Class-Based Styling: Real-Time Localized Style Transfer with Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 3189–3192. [Google Scholar]
Zhu, P.; Abdal, R.; Qin, Y.; Wonka, P. SEAN: Image Synthesis with Semantic Region-Adaptive Normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5103–5112. [Google Scholar]
Chatgpt: Optimizing Language Models for Dialogue. Available online: https://openai.com/index/chatgpt/ (accessed on 1 November 2022).
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10674–10685. [Google Scholar]
Liu, S.; Ye, J.; Wang, X. Any-to-Any Style Transfer: Making Picasso and Da Vinci collaborate. arXiv 2023, arXiv:2304.09728. [Google Scholar]
Zhang, Y.; Huang, N.; Tang, F.; Huang, H.; Ma, C.; Dong, W.; Xu, C. Inversion-Based Style Transfer with Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 10146–10156. [Google Scholar]
Zhang, Y.; Dong, W.; Tang, F.; Huang, N.; Huang, H.; Ma, C.; Lee, T.; Deussen, O.; Xu, C. ProSpect: Expanded Conditioning for the Personalization of Attribute-Aware Image Generation. arXiv 2023, arXiv:2305.16225. [Google Scholar]
Chung, J.; Hyun, S.; Heo, J. Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 8795–8805. [Google Scholar]
Qi, T.; Fang, S.; Wu, Y.; Xie, H.; Liu, J.; Chen, L.; He, Q.; Zhang, Y. DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 8693–8702. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCA, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Hertz, A.; Mokady, R.; Tenenbaum, J.; Aberman, K.; Pritch, Y.; Cohen-or, D. Prompt-to-Prompt Image Editing with Cross-Attention Control. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Wang, K.; Yang, F.; Yang, S.; Butt, M.A.; Weijer, J. Dynamic Prompt Learning: Addressing Cross-Attention Leakage for Text-Based Image Editing. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023; pp. 26291–26303. [Google Scholar]
Guo, Q.; Lin, T. Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation. arXiv 2023, arXiv:2312.10113. [Google Scholar]
Tumanyan, N.; Geyer, M.; Bagon, S.; Dekel, T. Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 1921–1930. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the 9th International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
Huang, X.; Belongie, S. Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1501–1510. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Con-verge to a Local Nash Equilibrium. In Proceedings of the Advances in neural information processing systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6626–6637. [Google Scholar]
Cao, M.; Wang, X.; Qi, Z.; Shan, Y.; Qie, X.; Zheng, Y. MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 22503–22513. [Google Scholar]
Ke, Z.; Liu, Y.; Zhu, L.; Zhao, N.; Lau, R.W.H. Neural Preset for Color Style Transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14173–14182. [Google Scholar]
Wright, M.; Ommer, B. ArtFID: Quantitative Evaluation of Neural Style Transfer. In Proceedings of the German Conference on Pattern Recognition, Konstanz, Germany, 27–30 September 2022; pp. 560–576. [Google Scholar]
Soria, X.; Pomboza-Junez, G.; Sappa, A.D. LDC: Lightweight Dense CNN for Edge Detection. IEEE Access 2022, 10, 68281–68290. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar]
Gu, B.; Fan, H.; Zhang, L. Two Birds, One Stone: A Unified Framework for Joint Learning of Image and Video Style Transfers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 23488–23497. [Google Scholar]
Wu, Z.; Zhu, Z.; Du, J.; Bai, X. CCPL: Contrastive Coherence Preserving Loss for Versatile Style Transfer. In Proceedings of the Computer Vision—ECCV, Tel Aviv, Israel, 23–27 October 2022; pp. 189–206. [Google Scholar]
Teed, Z.; Deng, J. RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. In Proceedings of the Computer Vision—ECCV, Glasgow, UK, 23–28 August 2020; pp. 402–419. [Google Scholar]
Chandran, P.; Zoss, G.; Gotardo, P.; Gross, M.; Bradley, D. Adaptive convolutions for structure-aware style transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 19–25 June 2021; pp. 7972–7981. [Google Scholar]
Zhang, Y.; Li, M.; Li, R.; Jia, K.; Zhang, L. Exact Feature Distribution Matching for Arbitrary Style Transfer and Domain Gen-eralization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8035–8045. [Google Scholar]
Ho, M.M.; Zhou, J. Deep Preset: Blending and Retouching Photos with Color Style Transfer. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 2112–2120. [Google Scholar]
Zhang, Y.; Tang, F.; Dong, W.; Huang, H.; Ma, C.; Lee, T.; Xu, C. Domain Enhanced Arbitrary Image Style Transfer via Con-trastive Learning. In Proceedings of the SIGGRAPH’22: Special Interest Group on Computer Graphics and Interactive Techniques Conference, Vancouver, BC, Canada, 7–11 August 2022; pp. 12:1–12:8. [Google Scholar]
Deng, Y.; Tang, F.; Dong, W.; Ma, C.; Pan, X.; Wang, L.; Xu, C. StyTr²: Image Style Transfer with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11326–11336. [Google Scholar]
Chen, D. ArtFusion: Controllable Arbitrary Style Transfer using Dual Conditional Latent Diffusion Models. arXiv 2023, arXiv:2306.09330. [Google Scholar]
Wen, L.; Gao, C.; Zou, C. CAP-VSTNet: Content Affinity Preserved Versatile Style Transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 18300–18309. [Google Scholar]
Liu, S.; Lin, T.; He, D.; Li, F.; Wang, M.; Li, X.; Sun, Z.; Li, Q.; Ding, E. AdaAttN: Revisit Attention Mechanism in Arbitrary Neural Style Transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 6629–6638. [Google Scholar]
Li, X.; Liu, S.; Kautz, J.; Yang, M. Learning Linear Transformations for Fast Image and Video Style Transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 3809–3817. [Google Scholar]

Figure 3. Network structure of the proposed multi-source training-free controllable style transfer via diffusion models. The initial noise of the stylized image is obtained by fusing DDIM inversion noise of the content image and single or multiple style images (INF). In the process of stylized image diffusion sampling, the query vector Q in U-Net is obtained by the weighted value of the query vector of the content image and the single or multiple style images, the key vector K is obtained by the weighted value of the key vector of the content image and the single or multiple style images, and the value vector V is obtained by the weighted value of the value vector of the content image and the single or multiple style images.

Figure 4. Image single-style transfer effect qualitative comparison with conventional (4th–8th columns) and diffusion model baselines (9th–10th columns). As shown, the comparison baselines with an abnormal style effect are highlighted in the red boxes; the yellow boxes show a greater understanding of the artistic style.

Figure 5. Color effect qualitative comparison with conventional (3th–6th lines) and diffusion model baselines (7th–8th lines).

Figure 6. Photorealistic effect qualitative comparison with conventional (3th–5th columns) and diffusion model baselines (6th–7th columns).

Figure 7. Photorealistic effect achieved by the proposed method.

Figure 8. Image multi-style transfer results between four styles.

Figure 9. Image multi-style transfer results between two styles.

Figure 10. Video single-style transfer effect qualitative comparison.

Figure 11. Video multi-style transfer results between two styles.

Figure 12. Predicted edges of LDC for content similarity calculation.

Figure 13. Qualitative comparison with ablation studies.

Figure 14. Visualization of image inversion noise fusion with different weights.

Figure 15. Visualization of different weights in the value formula for the query vector Q.

Figure 16. Visualization of different weights in the value formulas for the key vector K and the value vector V.

Table 1. Quantitative comparison of image style transfer effect.

Metric	AdaConv	EFDM	CAST	StyTr²	Style Injection	ArtFusion	Ours
SSIM ↑	0.48	0.41	0.5	0.59	0.58	0.4	0.6
ArtFID ↓	103.62	43.12	305.47	48.75	41.79	42.86	41.5

Table 2. Quantitative comparison of video style transfer effect.

Method	Optical Flow Error ↓				ArtFID ↓
Method	Style1	Style2	Style3	Mean	Style1	Style2	Style3	Mean
AdaAttN	0.04	0.15	0.1	0.1	36.7	26.94	48.06	37.23
Linear	0.03	0.12	0.08	0.08	36.69	44.81	48.66	42.05
TBOS	0.03	0.12	0.08	0.08	215.64	40.96	56.34	104.31
CCPL	0.04	0.15	0.09	0.09	42.85	38.28	55.15	45.43
Ours	0.03	0.12	0.07	0.07	36	26.66	44.28	35.65

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, C.; Han, C.; Zhang, C. Multi-Source Training-Free Controllable Style Transfer via Diffusion Models. Symmetry 2025, 17, 290. https://doi.org/10.3390/sym17020290

AMA Style

Yu C, Han C, Zhang C. Multi-Source Training-Free Controllable Style Transfer via Diffusion Models. Symmetry. 2025; 17(2):290. https://doi.org/10.3390/sym17020290

Chicago/Turabian Style

Yu, Cuihong, Cheng Han, and Chao Zhang. 2025. "Multi-Source Training-Free Controllable Style Transfer via Diffusion Models" Symmetry 17, no. 2: 290. https://doi.org/10.3390/sym17020290

APA Style

Yu, C., Han, C., & Zhang, C. (2025). Multi-Source Training-Free Controllable Style Transfer via Diffusion Models. Symmetry, 17(2), 290. https://doi.org/10.3390/sym17020290

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Source Training-Free Controllable Style Transfer via Diffusion Models

Abstract

1. Introduction

2. Related Work

2.1. U-Net Attention Mechanism in Diffusion Models

2.2. DDIM Inversion

2.3. AdaIN

2.4. Latent Diffusion Model

3. Method

3.1. Inversion Noise Fusion (INF)

3.2. Controllable Weighted Value Method of U-Net Self-Attention Mechanism

4. Experiments

4.1. Qualitative Analysis

4.1.1. Image Single-Style Transfer

4.1.2. Image Multi-Style Transfer

4.1.3. Video Single-Style Transfer

4.1.4. Video Multi-Style Transfer

4.2. Quantitative Analysis

4.2.1. Quantitative Analysis of Image Style Transfer

4.2.2. Quantitative Analysis of Video Style Transfer

4.3. Ablation Study

4.3.1. Analyzing the Hyperparameters of Inversion Noise Fusion

4.3.2. Analyzing the Hyperparameters of Controllable Weighted Value

5. Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI