1. Introduction
Symmetry and asymmetry are fundamental concepts in computer vision and artificial intelligence (AI), where they play an essential role in the visual effect of a synthesized image or video. Symmetric style proportion and asymmetric style distribution can cause distinct style transfer effects. Style transfer, an essential visual branch task in computer vision, includes a diverse set of application objects and development opportunities. The process of style transfer can be approached from a variety of perspectives, including content, layout, and design. The style transfer task is primarily utilized to add new style elements to the original image or video by introducing style images while maintaining the original image or video’s content features. People’s demand for a shift in visual style has increased in recent years, as short video technology has become more widely used. At the same time, for images whose semantics and layout are not obvious, the semantics and layout can be highlighted by changing the style.
Current mainstream style transfer technology can be divided into two categories: deep learning-based (conventional) style transfer and diffusion model-based style transfer. The conventional image or video style transfer is derived from the CNN-based image transfer method proposed by Gatys [
1] et al., which has resulted in significant technological innovation. The method uses the VGG convolutional neural network model to extract style features from image content and adjusts the generated image using gradient descent; however, the computational cost is enormous. To solve this problem, Johnson et al. [
2] proposed a feed-forward method to optimize the model and realize real-time operation. On the basis of the continuous development of global style transfer research [
3,
4], the research on multiple styles of a single content image [
5,
6,
7,
8,
9,
10,
11] and the research on different styles of different areas of a single content image [
12,
13] are growing. However, the above deep learning-based style transfer methods still need to be improved in terms of generation quality and visual effects.
In recent years, with the rapid development of artificial intelligence generation models such as ChatGPT [
14] and Stable Diffusion [
15], computer vision research and applications have been optimized and improved in unprecedented ways, as has the generation quality of style transfer. Diffusion models, as the representative algorithm models in the field of artificial intelligence, are widely used in a variety of challenging image synthesis tasks and achieve state-of-the-art synthesis results on image data and beyond. In order to synthesize high-quality stylized images, diffusion model-based style transfer has become a research hotspot in the field of computer vision. The quality and innovative style combined with the content of the image or video results constantly refresh our understanding. Any-to-any style transfer [
16] realizes the free style transfer from any area of the style image to any area of the content image. However, due to the different styles of each area of the output image, the output image lacks the overall coordination and artistry. InST [
17] transformed artistic styles into a learnable description of painting text, and proposed the style transfer method (InST), so as to more appropriately capture and transfer the artistic style of painting. However, the generalization ability of this method is poor. ProSpect [
18] disentangles visual attributes such as texture material, layout and artistic style from a single image, realizes the style transfer of visual attributes, and enhances the controllability and flexibility of style transfer effect.
Although diffusion models achieve an ideal effect in the fusion of style and content, they often require a large investment of computational resources and a complex optimization process. For this reason, Style Injection [
19] focuses on the self-attention mechanism of the U-Net structure in the latent diffusion model, and realizes style transfer by replacing the
K and
V vectors in the generation process of stylized images with the vectors of style image, but only single-style transfer tasks can be realized. Different from traditional style transfer, DEADiff [
20] proposes a double-decoupled representation extraction mechanism to identify the semantics and styles of style images separately, thus alleviating the semantic conflict between content and style images. However, most diffusion model-based style transfer methods typically require text to guide content generation to achieve amazing visual effects. Text guidance makes the generated content unstable and makes it impossible to achieve directed style transfer.
The employment of a large number of words to describe the semantics, layout, and style of stylized images may cause a lot of uncertainty. The unpredictability of content and layout frequently fails to meet people’s needs for fixed-content style transfer. Furthermore, for the study of multi-style transfer, due to the mutual interference and influence of the artistic styles of multiple style images, it is difficult to realize the effective transfer and expression of multiple styles, which has been a problems in the study of style transfer. At the same time, people are more focused on the efficient style transfer and art transmission provided by the training-free way.
Based on the above problems and challenges, this paper proposes multi-source training-free controllable style transfer via diffusion models, as shown in
Figure 1. The proposed method can obtain different style effects by setting different style control weights and does not require text for guidance. In this paper, we choose a latent diffusion model, a representative model of diffusion models, as the main framework of the research process, so as to ensure state-of-the-art generation quality. Meanwhile, the training-free model can reduce the occupation of computing resources, and text-free guidance eliminates the dependence on a large number of text descriptions. This is a novel style transfer method, which can realize personalized style transfer requirements and make style transfer technology more widely used. The proposed style transfer method consists of initial noise fusion (INF) and controllable weighted value, and is applicable to both images and videos, with the style coming from multiple sources, as shown in
Figure 2. The main contributions of this paper are as follows:
We synthesize the initial noise of the stylized image diffusion sampling process using inversion noise fusion (INF), which can realize the control of color, brightness and clarity of the synthesized stylized image or video.
To further improve the color, texture and semantics of stylized images, we strengthen the source of content and style features via a simple manipulation of the features in self-attention, which we refer to as the controllable weighted value method of the U-Net self-attention mechanism. The controllable weighted value method can achieve the tendency control of multiple art styles for the purpose of multi-style transfer.
The proposed method can perform single-style or multi-style transfer for image or video content, handle multi-source data, and have a high generalization ability. Extensive experiments on the style transfer dataset validate that the proposed method significantly outperforms previous methods and achieves state-of-the-art performance.
Figure 1.
Results of multi-source training-free controllable style transfer. Different style effects can be obtained by setting different style control weights.
Figure 1.
Results of multi-source training-free controllable style transfer. Different style effects can be obtained by setting different style control weights.
Figure 2.
The proposed method can effectively realize the style transfer of image or video.
Figure 2.
The proposed method can effectively realize the style transfer of image or video.
3. Method
Given a content image or video and single or multiple style images, our goal is to generate style controlled stylized images or videos. The stylized results conform to a single or multiple guide styles, and preserve the structure and semantic layout of the content image or video. The latent diffusion model serves as the primary framework for the research process. Specifically, the proposed method consists of two primary parts: One is inversion noise fusion, which means that the initial noise of the sampling process is the fusion result of the inversion noise of the content image and single or multiple style images. The second is the controllable weighted value of the U-Net self-attention mechanism, which includes the weighted value operation of
Q,
K, and
V during diffusion sampling, followed by the generation of a stylized image or video.
Figure 3 shows the overall network structure of the multi-source training-free controllable style transfer method proposed in this paper. The video style transfer in this paper is consistent with the network structure in
Figure 3, and each frame needs to be operated by the proposed method.
3.1. Inversion Noise Fusion (INF)
This paper is inspired by the work of DDIM [
27] inversion, which realizes reconstruction by obtaining the latent noise of real image. So, in the study of style transfer, fusion latent noise with semantic features of content image and style features of style image can also be obtained, so as to achieve stylized image reconstruction. As shown in
Figure 3, the inversion noise fusion method (INF) is proposed in this paper, which can fuse the inversion noise of one content image and multiple style images to realize the construction of initial latent noise. The specific fusion formula is as follows:
where
x represents inversion noise of the content image,
represents inversion noise of the
i-th style image,
represents the index of style images,
N represents the total number of style images,
means take the mean,
represents the weight of the difference between the inversion noise of the content image and the noise mean,
represents the weight of the noise mean of
x, and
represents the weight of noise mean of the
i-th style image. The two weights have a constraint relationship:
.
In the fusion formula, mainly describes the semantics and layout of the content image, controls the brightness and color of the content image, and is used to incorporate the style of each style image into the fusion noise based on weight, resulting in the effect of content and style fusion. By adjusting the values of three weights , and , the proportion of the semantics of the content image and the style of the style image can be adjusted in the generated stylized image, resulting in an effective balance of content and style. During the multi-style transfer experiment, the weight of each style image can be balanced to form the effect of a uniform presentation of various styles. According to the experimental goal, the weight of a certain style can be properly highlighted, so as to achieve personalized style integration.
3.2. Controllable Weighted Value Method of U-Net Self-Attention Mechanism
After INF inversion noise fusion, a stylized image with a certain style effect can be generated; however, during the noise fusion process, the style of the content image is also fused into the stylized image, resulting in a weakened style of style image. To further distinguish the source of features and finally generate a better stylized effect, this paper proposes a controllable weighted value method, which is based on observations and inspiration from previous research [
19,
22,
25,
30] on the attention mechanism of diffusion models. The style transfer process of this study is guided by both content and style images, rather than text guidance. Therefore, based on the performance of the self-attention mechanism in pixel information transmission, this paper chooses the self-attention mechanism as the research focus of image texture, color and other style elements’ extraction and transfer process in the controllable weighted value method, and finally proposes a controllable weighted value method of U-Net self-attention mechanism.
The fusion noise
of content image
and single or multiple style images
,
is obtained via the INF method, which is employed as the initial latent noise for stylized image sampling. In order to realize the process of distinguishing and strengthening content and style features, in the denoising sampling process of
, the query vector
, the key vector
and value vector
in U-Net self-attention mechanism is revalued, respectively. To ensure that the content and layout of the generated stylized image are consistent with that of content image
, the weighted value of query vector
corresponding to
is injected into
. To ensure the transfer of semantic textures in style images, the query vector
corresponding to style image
,
is also weighted and injected into
. To align the style of stylized image with the style image
, weigh the key vector
and value vector
of the style image and inject them into
and
, respectively. To maintain the fidelity of other texture features of the content image except semantics and layout, the key vector
and value vector
corresponding to the content image
are also weighted and injected into
and
, respectively. The query, key and value vectors of content images and style images are obtained from the DDIM inversion process. After obtaining the weighted value, the query vector, key vector and value vector of the U-Net self-attention mechanism in the stylized image sampling process are expressed as follows:
where
means taking the mean,
,
i represents the index of style images, and
N represents the total number of style images. and
,
,
, respectively, represent the value weights of the query vector, key vector and value vector of the content image.
,
, and
, respectively, represent the value weights of the query vector, key vector and value vector of the
i-th style image. The weight constraint relationship is shown as follows:
and
are used to jointly control the semantics and layout of the stylized image. The two groups of weights
,
and
,
are used to control the color, brightness, texture, clarity of style effect, and play the role of jointly strengthening the style effect. For multi-style transfer tasks, we can control the weight values of
,
and
to control the degree of style inclination. When the experimental target is evenly presented for each style,
,
and
can be set to the same value, respectively. By constructing a weighted value method for the self-attention mechanism of stylized images, the style transfer process of single content features and single or multiple style features is realized as follows:
4. Experiments
This section will conduct an in-depth analysis of the proposed method (ours) from the aspects of image single-style transfer and multi-style transfer, as well as video single-style transfer and multi-style transfer. We compare our method with 14 state-of-the-art style transfer methods. Through qualitative and quantitative comparisons, the influence of the proposed method on the color, texture and fidelity of the stylized results is further verified.
Implementation details. The proposed method does not require any training, and all of the content image and style image data used in the experiments are derived from baselines’ experimental data. We conduct all experiments in Stable Diffusion 1.4, and adopt DDIM sampling [
27] with a total 50 timesteps. For the default settings for hyperparameters, we describe them in the specific comparison experiments. All experiments are conducted on an NVIDIA 4090 GPU.
Evaluation metrics. According to previous work [
19,
31], this paper selects ArtFID [
32] to evaluate the stylized effect of image style transfer, and LDC [
33] + SSIM [
34] to detect the similarity between stylized images and content images. The ArtFID indicator, which evaluates the overall style transfer performance in the context of a comprehensive analysis of content and style fidelity, is considered to be very consistent with human judgment. The calculation formula of ArtFID is
, where LPIPS [
35] is used to measure the content fidelity between the stylized image and the corresponding content image, and FID [
29] is used to evaluate the style fidelity between the stylized image and the style image. In this paper, LDC [
33] outputs fine image edges that are used for SSIM [
34] content similarity detection, which helps to avoiding interference from the style effect and accomplish fine detection. For video style transfer, according to previous work [
19,
36,
37], optical flow error [
38] and ArtFID [
32] are selected in this paper as evaluation indicators of video temporal consistency and stylization effect, and the mean value of each frame indicator is used as the indicator result of the video.
4.1. Qualitative Analysis
4.1.1. Image Single-Style Transfer
Style effect contrast. Comparison baselines for image single-style transfer experiment selection include conventional methods, such as AdaConv [
39], EFDM [
40], Deep Preset [
41], CAST [
42] and StyT
[
43], as well as diffusion models-based methods, including InST [
17] and Style Injection [
19]. The weight coefficient of the proposed method is set as follows: (1) Inverse noise fusion process, weight coefficient
. (2) Controllable weighted value process,
,
,
.
The comparison of single-style transfer results is shown in
Figure 4, where EFDM and CAST have increased their models’ understanding of artistic style. However, excessive artistry leads to structural deformations and content ambiguity in the generated stylized images, as shown in yellow boxes. The AdaConv stylized effect differs from and does not align with the style image’s style (Column 4). Deep Preset and InST do not achieve significant stylized effects (see Columns 6 and 9). Ours, StyT
and Style Injection, have a relatively good effect on style transfer; however, there are still obvious differences in color, brightness, clarity, semantic alignment and other aspects of the stylized images, which will be further compared in the subsequent analysis.
Color effect contrast. Color effect comparison is mainly to compare whether the colors and tones of each stylized image are aligned with the style image. AdaIN [
28], CAST [
42], Deep Preset [
41], StyT
[
43], InST [
17] and Style Injection [
19] are selected as the comparison baseline for the color style transfer experiment. During the experiment, to highlight the effect of color transfer, the weight coefficient of the proposed method is set as follows: (1) Inverse noise fusion process, weight coefficient
,
. (2) Controllable weighted value process,
,
,
.
Figure 5 shows that Deep Preset and InST do not display evident color effects of the style image. The proposed method (ours), AdaIN, CAST, StyT
and Style Injection are all aligned with the color of the style image. To further verify the fidelity of the stylized image, the leaves in the lower right corner of the content image are enlarged to show the Style4 transfer effect for each method. The results show that the stylized images synthesized by AdaIN, CAST, StyT
, and Style Injection methods are fuzzy or cartoonish, and semantic content fidelity is not enough. In contrast, the proposed method achieves color style transfer, while the synthesized stylized image is aligned with the content image with high fidelity.
Photorealistic effect comparison. Making the resulting stylized image have high-definition picture quality is another crucial aspect to expand the application range of image style transfer. In this part, conventional methods, such as Deep Preset [
41], EFDM [
40] and AdaConv [
39], as well as diffusion model-based methods, including ArtFusion [
44] and CAP-VSTNet [
45], all of which may achieve a photorealistic style transfer effect, are chosen for comparison to evaluate the degree of photorealistic effect achieved by this method. The weight setting for the proposed method is the same to the style transfer experiment in
Figure 4.
As shown in
Figure 6, the proposed method aligns the color styles of the front and background of the stylized image with the style image, and aligns the semantics and layout with the content image, resulting in a clearer visual effect. Although the experimental results of ArtFusion are consistent with the color and texture of the style image, the semantic and layout of the generated results deviate significantly from the content image, and the structure of the butterfly is largely blurred. CAP-VSTNet and Deep Preset have no obvious effect of style transfer. The style transfer effect of EFDM is good; however, the butterfly antenna is blended in with the background information, making it difficult to identify. AdaConv’s experimental results match the content image semantics, but the style effect is excessively grainy.
Figure 7 shows additional HD photorealistic style transfer results of the proposed method.
4.1.2. Image Multi-Style Transfer
Figure 8 shows the performance of the proposed method in terms of image multi-style transfer. The experimental results achieve a good transfer of color and texture features, allowing for a more targeted choice of style preferences. The multi-style transfer process, depending on the weight of various styles, generates a rich and diverse style superposition effect.
The result of the multi-style transfer between two different styles is shown in
Figure 9. Two styles: one focused on color rendering and the other on texture and brushwork expressiveness. The multi-style transfer results shown in this study not only effectively integrate two color styles but also highlight the style texture features and align with the content image semantics. The depiction of details, such as the first line of seawater ripples and the texture of the sailboat, as well as the third line of light, etc., can present a high-fidelity transfer effect.
4.1.3. Video Single-Style Transfer
Video single-style transfer is primarily used for achieving single-style transfer across all video frame sequences. In the transfer process, in addition to ensuring that the style of each video frame is aligned with style image, the semantic content of each frame must also be aligned with the content video and have a certain inter-frame consistency so as to generate smooth video results. This study chooses state-of-the-art methods, such as AdaAttN [
46], Linear [
47], TBOS [
36], and CCPL [
37], as comparative baselines for video single-style transfer. The weight coefficient of the proposed method is set as follows: (1) Inverse noise fusion process, weight coefficient
. (2) Controllable weighted value process,
,
,
.
Figure 10 shows the video single-style transfer results. Although Linear and CCPL achieve style transfer, there is noticeable flicker in each frame, which degrades the visual effect. Moreover, in Linear, the color of the leaves in Frame 235 is inconsistent. AdaAttN further cartoonized the result of style transfer, reducing the authenticity of the original content video. TBOS generates insufficient color alignment with the style image, which results in unsatisfactory stylistic effect. In addition, as shown by the red box in Row 5, the fidelity of the generated leaves is insufficient, resulting in blurred images. In comparison, the proposed method achieves a high degree of style alignment on the basis of semantic alignment with each frame of content video, with obvious inter-frame consistency and high fidelity.
4.1.4. Video Multi-Style Transfer
Video multi-style transfer is based on the implementation of video single-style transfer, which combines multiple style effects.
Figure 11 shows the results of video multi-style transfer achieved by the method proposed in this paper. Each frame of the video can be aligned with the semantic meaning of the content video, presenting single-style or multi-style color and texture styles based on weights.
4.2. Quantitative Analysis
4.2.1. Quantitative Analysis of Image Style Transfer
Based on qualitative analysis, the baselines Deep Preset [
41], InST [
17], CAP-VSTNet [
45], and AdaIN [
28] are excluded since their style transfer generalization capacity is inadequate or they focus on a certain style field. The final quantitative comparison baseline is determined to include conventional methods, such as AdaConv [
39], EFDM [
40], CAST [
42], and StyT
[
43], as well as diffusion model-based methods, including Style Injection [
19] and ArtFusion [
44].
Figure 12 displays predicted edges of LDC for content similarity calculation. Based on the LDC prediction results, SSIM similarity between the content and stylized images can be obtained, as shown in
Table 1.
The quantitative comparison results of image style transfer effects are shown in
Table 1, where baseline index data are the average value of 20 groups of style transfer experimental index data. As demonstrated in
Table 1, the proposed method (our) has the highest SSIM index value while also having the best stylized effect index value. It can be seen that the proposed method can effectively maintain semantic alignment while generating an outstanding stylized effect.
4.2.2. Quantitative Analysis of Video Style Transfer
AdaAttN [
46], Linear [
47], TBOS [
36], and CCPL [
37] serve as baselines for video style transfer quantitative comparisons. The purpose of optical flow error is to compute the motion vector of each pixel between two frames using image information from succeeding frames in order to determine a video’s smoothness. ArtFID is used for evaluating video styles. In the quantitative comparison experiment of video style transfer, we calculate each frame’s evaluation metric to obtain the average value, which is then used as the evaluation metric for this group of experiments.
Table 2 shows the optical flow error and ArtFID evaluation results of the three styles of the five compared methods, and the proposed method has the smallest optical flow error, indicating the highest smoothness, followed by Linear and TBOS. Among the ArtFID evaluation results, the proposed method (ours) has the best stylized effect, while the TBOS and Linear methods display poor effects. The above comparison demonstrates the potential capability and excellent performance of the proposed method in the task of video style transfer.
4.3. Ablation Study
To validate the effectiveness of the proposed components, we conduct ablation studies in qualitative way. As shown in
Figure 13, after inversion noise fusion (INF), a stylized image with a certain color style effect can be generated; however, the texture style effect is poor. After using the controllable weighted value method, the color and texture style effect of the stylized image are improved.
4.3.1. Analyzing the Hyperparameters of Inversion Noise Fusion
Figure 14 illustrates how each component of the inversion noise fusion method affects the style transfer effect. As indicated in the first line, controlling the change of weight
alone affects the sharpness of the generated stylized image. As shown in the second line, under the constraint relationship:
, the weight
adjusts the color tendency of the stylized image. A bigger
indicates a more inclined color to the content image, whereas a lower
indicates a more inclined color to the style image. By increasing the weight
, the color of the stylized image is closer to that of the style image. It can be seen that the inversion noise fusion method has a certain effect on the color, brightness, and sharpness of the stylized image.
4.3.2. Analyzing the Hyperparameters of Controllable Weighted Value
This paper proposes a controllable weighted value method for the U-Net self-attention mechanism. It involves injecting self-attention vector weighted values of content and style image features into the
vectors of the U-Net self-attention mechanism in the stylized image denoising sampling process. This enhances the influence of content image semantics and style image styles on sampling results.
Figure 15 shows an analysis of the influence of weights in the self-attention query vector
Q value formula under the assumption of normal values of
K and
V. Under the weight relation
constraint, the larger the
, the more significant the semantic description of the content images. On the contrary, when
is smaller, i.e., when
is larger, the semantics of style images become more visible. As a result, by balancing the two weight values, the semantics of the stylized image can be further controlled.
As shown in
Figure 16, in the process of controlling the value of the self-attention key vector of the stylized images, under the premise of the normal value of
Q, when the weight
of the style image key vector is larger, the style effect is more visible and the image is clearer. The value vector follows the same change trend as the key vector, but its color effect is not so sharp; however, leaving the value vector out will inevitably weaken or blur the style effect. It can be seen that it is beneficial to enhance the style effect by setting the values of the above hyperparameters reasonably.
6. Conclusions
This paper proposes multi-source training-free controllable style transfer via diffusion models, which can effectively integrate multiple styles. The proposed method achieves style alignment with the style image and semantic alignment with the content image. Moreover, the generated stylized video has a temporal consistency. The style transfer results generated by this method have high fidelity. For the multi-style transfer of image or video, according to the different style weights, this paper realizes the superposition effect of multiple styles, showing its applicability. Through a comparison and analysis of qualitative and quantitative experiments, the performance of this method in multi-source style transfer of images and videos is further verified. Our method is based on the diffusion models and adopts text-free guidance, which not only addresses the challenge of high-quality style transfer but also does not rely on the extensive textual description often required in current methods, making style transfer techniques accessible to a wider range of applications. In addition, the multi-source, controllable, high-fidelity and flexible performance of this method can provide more accurate and friendlier service to users, and can be widely used in image or video editing, art, film and television production and other fields.