FreeMix: Personalized Structure and Appearance Control Without Finetuning

Kang, Mingyu; Choi, Yong Suk

doi:10.3390/app15189889

Open AccessArticle

FreeMix: Personalized Structure and Appearance Control Without Finetuning

by

Mingyu Kang

¹

and

Yong Suk Choi

^2,*

¹

Department of Artificial Intelligence, Hanyang University, Seoul 04763, Republic of Korea

²

Department of Computer Science and Engineering, Hanyang University, Seoul 04763, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(18), 9889; https://doi.org/10.3390/app15189889

Submission received: 1 August 2025 / Revised: 8 September 2025 / Accepted: 8 September 2025 / Published: 9 September 2025

Download

Browse Figures

Versions Notes

Abstract

Personalized image generation has gained significant attention with the advancement of text-to-image diffusion models. However, existing methods face challenges in effectively mixing multiple visual attributes, such as structure and appearance, from separate reference images. Finetuning-based methods are time-consuming and prone to overfitting, while finetuning-free approaches often suffer from feature entanglement, leading to distortions. To address these challenges, we propose FreeMix, a finetuning-free approach for multi-concept mixing in personalized image generation. Given separate references for structure and appearance, FreeMix generates a new image that integrates both. This is achieved through Disentangle-Mixing Self-Attention (DMSA). DMSA first disentangles the two concepts by applying spatial normalization to remove residual appearance from structure features, and then selectively injects appearance details via self-attention, guided by a cross-attention-derived mask to prevent background leakage. This mechanism ensures precise structural preservation and faithful appearance transfer. Extensive qualitative and quantitative experiments demonstrate that our method achieves superior structural consistency and appearance transfer compared to existing approaches. In addition to personalization, FreeMix can be adapted to exemplar-based image editing.

Keywords:

diffusion models; text-to-image; personalization; image editing

1. Introduction

Recent advancements in text-to-image diffusion models, such as Stable Diffusion [1] and DALL·E2 [2], have enabled the generation of both realistic and imaginative visual content. Building upon these successes, personalized image generation has emerged as an active area of research, where the goal is to incorporate user-specific objects or styles into generated images. In this domain, some methods [3,4,5,6] achieve personalization by finetuning certain parameters within diffusion models when given a collection of images of a single user-specific subject. Recently, finetuning-free approaches [7,8] have emerged, which manipulate attention mechanisms to enable the customization of multiple objects or transfer of style concepts.

Personalized image generation holds significant value across a wide range of creative and industrial applications. In digital art and game design, creators often need to apply new styles to characters while preserving their underlying structure. In e-commerce, a single product may need to be visualized in multiple appearances without reconstructing its geometry. Similarly, in virtual reality and social media, users may wish to preserve the shape of their avatars while experimenting with diverse appearances. Addressing these needs requires precise disentanglement of structural and appearance features, which is crucial for expanding the practical utility and accessibility of personalized image generation.

Despite the growing progress in this field, existing methods still face critical limitations. Finetuning-based approaches are computationally expensive and prone to overfitting, often resulting in the dominance of a single identity. Text embedding-based techniques, such as Extended Textual Inversion (XTI) [6], attempt to achieve style mixing but suffer from cross-attention misalignment, leading to inconsistencies between the generated image and the reference. Furthermore, current approaches struggle to independently manipulate structural and appearance attributes, since attention misalignment often entangles object features with background or other visual elements during disentanglement. These challenges highlight the need for a more effective method that enables fine-grained control over structure–appearance mixing.

To address these challenges, we propose FreeMix, a finetuning-free method for multi-concept mixing in personalized image generation, as illustrated in Figure 1. Unlike prior finetuning-free approaches, FreeMix achieves fine-grained control over structure and appearance by extracting detailed structural and appearance features directly through attention. We introduce Disentangle-Mixing Self-Attention (DMSA), which integrates diffusion features across different layers and time steps. DMSA consists of two core processes: structure generation and appearance transfer. Structure generation extracts the shape and spatial composition from the reference image, ensuring that the generated image retains the structural integrity of the reference object. To facilitate effective feature mixing, we apply spatial normalization to remove residual appearance information from the structure features. Appearance transfer leverages self-attention to construct an appearance attention. Then, this attention is incorporated into the structure features.

To prevent confusion between the structure and background, we utilize a mask extracted from cross-attention, ensuring that appearance transfer is applied only to the object. Unlike XTI, which relies on text embeddings, our method provides a novel mechanism to explicitly disentangle and recombine different semantic components within attention. This enables more precise control over spatial attributes and incorporates fine details that are difficult to describe with text prompts. FreeMix achieves robust disentanglement of structure and appearance, enabling more detailed and precise personalization than prior methods. Representative results comparing existing personalization methods with our approach are shown in Figure 2

We perform comprehensive experiments to evaluate the performance of our proposed framework. Experimental results demonstrate that our method consistently outperforms existing approaches both quantitatively and qualitatively.

Our contributions can be summarized as follows:

We propose a method that generates a new image by combining the structure of a single object image and the appearance of an exemplar without requiring additional model training.
We introduce DMSA, a self-attention mechanism that effectively disentangles and integrates structure and appearance features, enabling precise concept mixing.
Our experimental results demonstrate that our method achieves superior performance in multi-concept mixing and exemplar-based image editing tasks compared to existing methods.

2. Related Work

2.1. Text-to-Image Generation

Text-to-image generation has advanced significantly, particularly with the emergence of diffusion models that outperform traditional GAN-based approaches [9,10]. Diffusion models [2,11,12,13] have demonstrated superior performance by effectively capturing fine-grained details and scaling to larger datasets. Latent Diffusion Models (LDMs) [1] enhance computational efficiency by operating in a compressed latent space while maintaining high-quality outputs. Building on LDM, notable models such as Stable Diffusion and SDXL [14] further improve the quality and flexibility of image generation, making them widely adopted in various applications. Moreover, ControlNet [15] extends text-to-image diffusion models by integrating additional conditions, enabling more precise control over generated images. Our work builds upon the pre-trained Stable Diffusion.

2.2. Personalized Image Generation

Personalized text-to-image diffusion models aim to render user-specific subjects from only a few images. These approaches can be broadly categorized into finetuning-based and finetuning-free approaches.

Finetuning-based methods [3,4,5,6] finetune a pre-trained diffusion model to capture specific visual concepts from reference images. This often involves modifying certain network parameters to bind a textual identifier with the subject of interest. Textual Inversion [3] encodes the subject into a learnable text embedding that represents its unique features in the latent space. DreamBooth [4] finetunes the entire diffusion model to associate a unique identifier with a given subject. Custom Diffusion [5] modifies only the cross-attention layers of the diffusion model, reducing computational costs compared to full model finetuning. XTI [6] extends textual inversion by introducing layer-wise learnable embeddings for improved representation. NeTI [16] incorporates implicit time-aware representations to enhance subject fidelity in generated images. However, these methods require extensive finetuning, which is time-consuming and computationally costly.

Finetuning-free approaches [7,17,18] eliminate the need for test-time adaptation. IP-Adapter [17] and ELITE [18] introduce auxiliary networks trained on large-scale datasets to extract reference image features. These features are then incorporated into the diffusion model, thus eliminating the need for test-time finetuning. FreeCustom [7] and StyleAligned [8] employ attention control mechanisms to achieve personalization without additional networks. FreeCustom enables multi-object composition through multi-reference self-attention, allowing better preservation of reference concepts. StyleAligned focuses on maintaining style consistency across generated images through minimal attention sharing during the diffusion process. This technique can be integrated with dreamBooth to generate images that blend different content and styles. While finetuning-based methods achieve strong personalization at the cost of efficiency and overfitting, finetuning-free approaches improve efficiency but suffer from feature entanglement, leading to unwanted leakage between structure and appearance. Moreover, prior works mainly focus on multi-object composition or style consistency, without explicitly addressing the challenge of disentangling and integrating structure and appearance from different references. This limitation motivates the need for a mechanism that can selectively and independently control these attributes in a finetuning-free manner.

2.3. Image Editing

Image editing [19,20,21,22,23] with text-to-image diffusion models has made remarkable progress through techniques that manipulate latent representations and attention mechanisms. Prompt-to-Prompt [19] edits images by modifying cross-attention maps, enabling targeted changes while preserving structure. MasaCtrl [20] improves consistency by converting self-attention into mutual self-attention, enhancing localized edits. Direct Inversion [23] disentangles source and target diffusion branches, ensuring both content preservation and edit fidelity with minimal computation. InstructPix2Pix [22] trains a diffusion model on instruction-based edits, allowing a wide range of modifications based on textual instructions. ZeST [24] proposes a novel method for transferring materials to objects in real images without requiring any additional training. This approach extracts material representations from an exemplar image using IP-Adapter. It then applies them to the target object using ControlNet for structure-aware transfer, incorporating an inpainting model. We achieve effective appearance editing without relying on adapters or finetuning by leveraging self-attention.

3. Preliminaries

3.1. Diffusion Models

Diffusion models [11,25,26] are a class of generative models that generate images by gradually removing noise from an initial Gaussian sample

x_{T} \sim N (0, I)

. The process consists of two phases: a forward process that progressively adds noise to an image and a reverse process that gradually denoises it to reconstruct the original image. The forward process is as follows:

x_{t} = \sqrt{α_{t}} x_{0} + \sqrt{1 - α_{t}} ϵ, ϵ \sim N (0, I)

(1)

where

x_{t}

is the noisy image at time step t,

x_{0}

is the original image, and

α_{t}

controls the noise level. During inference, the reverse process employs a neural network

ϵ_{θ} (x_{t}, t)

to iteratively predict and remove the added noise. Text-to-image diffusion models extend this framework by conditioning the denoising process on a text prompt. Latent Diffusion Models (LDMs), such as Stable Diffusion, further optimize this process by performing denoising in a lower-dimensional latent space rather than pixel space.

3.2. Attention in Stable Diffusion

Stable Diffusion employs a U-Net architecture with attention mechanisms to guide image generation. Two key attention modules play a role: self-attention and cross-attention.

3.2.1. Self-Attention

Self-attention captures long-range dependencies within an image, allowing spatial features to interact across the image. Self-attention is formulated as:

h = Softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,

(2)

where

Q, K, V

are the query, key, and value features derived from image features, and

d_{k}

is the dimension of the key vectors. In Stable Diffusion, self-attention refines the spatial structure of the image, controlling layout and details across diffusion steps and layers.

3.2.2. Cross-Attention

Cross-attention enables text conditioning by aligning image features with the text embedding derived from the input prompt. Given a text prompt embedding c, cross-attention is computed by:

A_{c r o s s} = Softmax (\frac{Q K_{c}^{T}}{\sqrt{d_{k}}}),

(3)

h_{c r o s s} = A_{c r o s s} V_{c},

(4)

where

K_{c}

and

V_{c}

are the key and value features derived from the text embedding c.

A_{c r o s s}

represents the cross-attention map that determines how much influence each text token has on different spatial locations in the image. The cross-attention mechanism ensures that the generated image aligns with the semantics of the given prompt by modulating spatial features accordingly.

Our method modifies self-attention to refine structure and appearance separately, enabling effective mixing without requiring additional networks or finetuning.

4. Methodology

Our goal is to combine structural and appearance concepts in a zero-shot manner, without requiring finetuning or adapters. Given a structure image

x^{s}

and an appearance image

x^{a}

, FreeMix replaces the original self-attention module of Stable Diffusion with Disentangle-Mixing Self-Attention (DMSA) to generate an output image

x^{g}

that incorporates the structural and appearance features from reference images. Our framework is illustrated in Figure 3.

4.1. Disentangle-Mixing Self-Attention

DMSA is a mechanism to selectively transfer structural and appearance features from different reference images. We first use the Segment Anything Model (SAM) [27] to extract masks

M_{s}

and

M_{a}

of the concepts from each reference image while filtering out irrelevant background elements. Starting with Gaussian noise

x_{T}^{g} \sim N (0, I)

, we iteratively denoise it to generate the final output

x^{g}

. At each time step t, we obtain the noisy representations

x_{t}^{s}

and

x_{t}^{a}

of the structure image

x^{s}

and the appearance image

x^{a}

through the forward process in Equation (1). Then, DMSA extracts the key and value of the self-attention layer from each reference image and injects them into the attention mechanism of the generated image. Since the diffusion features control distinct attributes, we apply structure generation and appearance transfer at appropriate layers and time steps. Additionally, we employ a masking strategy to ensure that only targeted attributes are transferred.

4.2. Structure Generation

We adopt self-attention injection and a weighted mask. Specifically, at time step t and U-Net layer l, let the query, key, and value of

x_{t}^{g}

be

Q_{l, t}^{g}, K_{l, t}^{g}, V_{l, t}^{g}

, and those of the structure image

x_{t}^{s}

be

Q_{l, t}^{s}, K_{l, t}^{s}, V_{l, t}^{s}

. We compute the dot products

Q_{l, t}^{g} K_{l, t}^{g T}

and

Q_{l, t}^{g} K_{l, t}^{s T}

, and then concatenate them, resulting in

S = [Q_{l, t}^{g} K_{l, t}^{g T}, Q_{l, t}^{g} K_{l, t}^{s T}]

. Similarly, we concatenate values as

V^{'} = [V_{l, t}^{g}, V_{l, t}^{s}]

. To integrate structural features, we apply a weighted structure mask

M_{s}

extracted from SAM, which filters out irrelevant background regions. The final output is computed as:

h_{l, t} = Softmax (\frac{M ⊙ S}{\sqrt{d_{k}}}) V^{'}

(5)

where

M = [1, w M_{s}]

is the weighted mask, w is the scaling factor and ⊙ represents the Hadamard product.

This process integrates structural features from the reference image into the generated output. However, as shown in Figure 4, this method suffers from appearance leakage, in which appearance features of the structure image are unintentionally transferred into the generated results.

To address this, we apply spatial normalization. Inspired by prior works that define appearance as feature statistics [8,28], we consider appearance transfer to be a stylization task. In contrast to style transfer methods, which align feature statistics to inject appearance, we instead remove these statistics. Specifically, we apply spatial normalization to

Q_{l, t}^{g}

and

K_{l, t}^{s}

for time steps

t > τ

.

S = \{\begin{matrix} [Q_{l, t}^{g} K_{l, t}^{g T}, Q_{l, t}^{g} K_{l, t}^{s T}] t < τ \\ [Q_{l, t}^{g} K_{l, t}^{g T}, n o r m (Q_{l, t}^{g}) n o r m (K_{l, t}^{s T})] t \geq τ \end{matrix}

(6)

As shown in Figure 5, we visualize the self-attention maps of the U-Net middle blocks across different time steps after normalization and observe that structural information is preserved.

Furthermore, we perform this process starting from the earliest time steps and at the middle layers of the U-Net to focus solely on shape transfer.

4.3. Appearance Transfer

At time step t and U-Net layer l, let the query, key, and value of

x_{t}^{a}

be

Q_{l, t}^{a}, K_{l, t}^{a}, V_{l, t}^{a}

. We first extract the appearance-related components from

K_{l, t}^{a}

and

V_{l, t}^{a}

using an appearance mask

M_{a}

. This mask isolates appearance features and prevents irrelevant elements from interfering with the generated output. A naive approach is to directly replace

K_{l, t}^{g}

and

V_{l, t}^{g}

with

K_{l, t}^{a}

and

V_{l, t}^{a}

, but this results in artifacts where background and foreground features become entangled. To overcome this problem, we utilize cross-attention maps to create a foreground object mask

M_{g}

from

x_{t}^{g}

. Specifically, at time step t, we obtain the cross-attention maps of

x_{t}^{g}

at a spatial resolution of

16 \times 16

. These maps capture the structure of

x_{t}^{s}

during structure generation. We then extract the cross-attention map corresponding to the foreground object token and use it as

M_{g}

. We then perform feature mixing of the key and value as follows:

{\hat{K}}_{l, t}^{g} = K_{l, t}^{g} \cdot (1 - M_{g}) + K_{l, t}^{a} \cdot M_{g}

(7)

{\hat{V}}_{l, t}^{g} = V_{l, t}^{g} \cdot (1 - M_{g}) + V_{l, t}^{a} \cdot M_{g}

(8)

{\hat{K}}_{l, t}^{g}

and

{\hat{V}}_{l, t}^{g}

are used in the self-attention operation with

Q_{l, t}^{g}

, ensuring that appearance is transferred only to the appropriate spatial regions. Unlike the appearance mask

M_{a}

obtained from SAM, the cross-attention mask

M_{g}

is used to localize the foreground regions in the generated image, ensuring that appearance transfer is applied only to the target object while preserving the background. The final output is obtained:

h_{l, t} = Softmax (\frac{Q_{l, t}^{g} {\hat{K}}_{l, t}^{g T}}{\sqrt{d_{k}}}) {\hat{V}}_{l, t}^{g},

(9)

This attention control selectively transfers appearance features to the foreground while preserving the structural integrity of the generated image. We perform this process at later time steps

t > t_{a p p e a r}

and deeper layers in the U-Net, ensuring that appearance features are effectively injected into the generated image.

4.4. Application to Image Editing

Our method can be easily extended to exemplar-based image editing. In this paper, exemplar-based editing involves transferring the appearance of a reference object onto a specific object in the structure image, while preserving the overall structure and background. As shown in Figure 6, similarly to previous image editing approaches, we use DDIM inversion [26] to obtain the latent noise of the structure image, resulting in

x_{T}^{g}

as the initial point for editing. Unlike personalization, which starts from randomly sampled Gaussian noise, we use

x_{T}^{g}

and denoise it while applying DMSA to control structural and appearance attributes. In Section 4.3, appearance transfer utilizes the foreground object mask derived from the cross-attention map. However, in image editing, the structure of the object remains fixed in the result image. Therefore, we directly utilize the structure mask

M_{s}

as the foreground object mask

M_{g}

. In contrast to [24,29], who utilize an inpainting model to retain background details, we adopt a masked blending strategy. At each time step t, after denoising

x_{t}^{s}

and

x_{t}^{g}

to

x_{t - 1}^{s}

and

x_{t - 1}^{g}

, we blend these representations using

M_{s}

, ensuring that the background remains unchanged:

x_{t - 1}^{*} = x_{t - 1}^{g} \cdot M_{s} + x_{t - 1}^{s} \cdot (1 - M_{s})

(10)

Our method enables precise and controllable exemplar-based image editing without finetuning or auxiliary networks.

5. Experiments

5.1. Experiment Setting

5.1.1. Dataset

To evaluate our approach, we construct a dataset consisting of 534 diverse structure–appearance pairs and their corresponding target text prompts. The images in the dataset are collected from previous studies [4,5,7] on personalization. In addition, to enrich evaluation and avoid ambiguity, each structure–appearance pair is combined with three diverse prompts, resulting in a total of 1602 generated images. This design provides both diversity and textual variation, ensuring a more comprehensive and reliable evaluation. More details are provided in Appendix A.

5.1.2. Evaluation Metrics

Following previous studies, we evaluate the performance of our model on personalization and exemplar-based image editing. For personalization, we measure image fidelity and text fidelity. Image fidelity is assessed using CLIP-I [30] and DINOv2 [31]. Specifically, DINOv2 measures structural similarity between the structure image and the generated image. CLIP-I measures the consistency of appearance details by comparing the generated images with the appearance reference. To assess text fidelity, we employ CLIP-T [30], which evaluates how well the generated images align with the given text prompts. In addition to these metrics, we also adopt the prior preservation metric (PRES) and the diversity metric (DIV), inspired by prior work [4]. PRES is computed by measuring the average pairwise DINO embeddings between generated images of random subjects of the prior class and real images of our specific subject. A higher PRES score indicates that random prior-class subjects are more similar to the specific subject, reflecting a collapse of the prior. DIV is computed as the average LPIPS cosine similarity between generated images of the same subject under the same prompt, where higher diversity indicates less overfitting and more varied outputs. These additional metrics provide a more comprehensive evaluation of generation quality by assessing both prior preservation and sample diversity. For exemplar-based image editing, we utilize three key metrics: structure distance (StruD) [32], LPIPS [33], and CLIP-I. StruD quantifies the structural consistency between the original image and the edited image. A smaller StruD indicates better structure preservation. LPIPS measures background preservation by computing perceptual similarity in non-edited areas, similar to [23]. Finally, CLIP-I evaluates the appearance similarity between the appearance exemplar and the edited result in the CLIP space.

5.1.3. Compared Methods

We compare our approach with several methods for multi-concept mixing, including XTI [6], DreamBooth [4], Custom Diffusion [5], FreeCustom [7], and StyleAligned [8]. We include StyleAligned as a baseline because there are few existing methods for mixed structure and appearance personalization, and style transfer is closely related to our task in integrating visual attributes. StyleAligned performs finetuning of the structure image based on DreamBooth-LoRA for mixing. Additionally, we compare our method with MasaCtrl [20], Direct Inversion (DirectINV) [23], and ZeST [24] for exemplar-based image editing. MasaCtrl and DirectINV utilize DreamBooth-based finetuning of the appearance exemplar to enable reference-guided editing.

5.1.4. Implementation Settings

We use Stable Diffusion V1.5 as our base model and perform sampling with 50 denoising steps. For structure generation, we set the mask weight w to 3,

τ

to 35, and

t_{a p p e a r}

to 20. Stable Diffusion consists of 16 self-attention layers (indexed from 0 to 15). We replace layers 8–15 with DMSA, where layers 8–12 are used for structure generation and layers 13–15 are used for appearance transfer. All experiments are conducted on a single A5000 GPU.

5.2. Multi-Concept Mixing

5.2.1. Quantitative Comparisons

As shown in Table 1, our method consistently outperforms existing methods in multi-concept mixing across most evaluation metrics. In terms of image fidelity, our approach achieves the highest CLIP-I and a competitive DINO, demonstrating its ability to preserve the structure and appearance of reference images effectively. For text fidelity, FreeMix achieves a higher CLIP-T compared to prior works. These results highlight the effectiveness of our method in achieving multi-concept mixing without finetuning. In addition, FreeMix obtains a lower PRES score, indicating that our method preserves the ability to generate diverse images of the prior class. At the same time, FreeMix achieves a higher DIV score, showing that it produces more diverse outputs without overfitting to specific reference contexts.

5.2.2. Qualitative Comparisons

We present visual comparisons of multi-concept mixing in Figure 7. XTI, DreamBooth, and Custom Diffusion struggle to effectively integrate the structure and appearance because they rely on text embeddings or finetuning. This approach limits precise control over individual features, making it difficult to properly fuse structure and appearance. FreeCustom and StyleAligned achieve partial success in mixing, but suffer from issues where the appearance is applied not only to the object but also to the background, leading to unintended artifacts. Additionally, these methods exhibit lower prompt fidelity. In contrast, our proposed method successfully achieves a balanced mixture of structure and appearance.

5.3. Exemplar-Based Image Editing

5.3.1. Quantitative Comparisons

Table 2 presents the quantitative results of image editing methods. Our method achieves significant improvements across all metrics, without finetuning. In particular, the notable improvement in CLIP-I demonstrates its superior performance in appearance transfer.

5.3.2. Qualitative Comparisons

As illustrated in Figure 8, we present a qualitative comparison of our method with baseline approaches for exemplar-based image editing. MasaCtrl struggles with appearance transfer and often introduces undesired artifacts in the background. DirectINV improves background preservation compared to MasaCtrl but still fails to effectively edit the appearance. ZeST effectively transfers the appearance of the exemplar, but structural details tend to be lost, such as the head of the robot (first row). Furthermore, ZeST relies on IP-Adaptor, which lacks the capability to extract specific regions from an image. As a result, foreground and background elements become entangled during transfer, such as projecting fire onto the cake (second row). In contrast, our approach achieves superior performance in exemplar-based appearance editing without requiring additional modules and finetuning.

5.4. Ablation Study

We replace the self-attention layers 8–15 of Stable Diffusion with DMSA, where layers 8–13 are responsible for structure generation and layers 13–15 handle appearance transfer. The appearance transfer operation is applied starting from

t_{a p p e a r} = 20

. To evaluate the impact of different configurations, we conduct experiments by varying the layer replacement and

t_{a p p e a r}

. As shown in Figure 9, when structure generation is applied only to layers 8–10 while appearance transfer is applied to the remaining layers 11–15, the generated images exhibit low fidelity to the original structure. In contrast, when structure generation extends to layers 8–14, the images retain a high structural fidelity. However, in this case, appearance transfer becomes ineffective. When

t_{a p p e a r}

increases beyond 20, the structure fidelity starts to degrade, leading to artifacts and distortions. This indicates that applying appearance transfer too late in the denoising process disrupts spatial coherence, leading to unintended structural inconsistencies. These results show that an optimal balance between structure fidelity and effective appearance transfer is achieved by applying structure generation in layers 8–13, appearance transfer in layers 13–15, and setting

t_{a p p e a r} = 20

.

To further evaluate the practical advantages of our finetuning-free approach, we also report inference time comparisons in Table 3. As shown, FreeMix requires 34 s per image, which is significantly faster than traditional finetuning-based methods such as DreamBooth or Custom Diffusion, and slightly faster than FreeCustom despite providing more precise structure–appearance mixing. This demonstrates that our method not only improves the quality of multi-concept mixing but also maintains competitive generation speed, highlighting the efficiency benefits of a finetuning-free design.

5.5. Additional Results

5.5.1. Customization

By using structure generation without appearance transfer and normalization, our method enables single-image customization. As shown in Figure 10, our approach synthesizes images in various contexts while preserving the structural integrity of the input image without finetuning.

5.5.2. Appearance Control

By using only appearance transfer without structure generation, our method extracts the appearance from the input image and transfers it to a new object. As shown in Figure 11, our approach generates images that remain consistent with the appearance of the input concept.

5.5.3. Results with ControlNet

Moreover, we can integrate our method with ControlNet to achieve more stable image generation. As illustrated in Figure 12, ControlNet enables additional conditions as input, allowing for more precise control over the target layout. This integration enables the generation of images where the appearance is consistently transferred to the target layout.

6. Conclusions and Future Work

In this paper, we introduced FreeMix, a finetuning-free method for multi-concept mixing in personalized image generation. We extend self-attention in diffusion models to effectively extract and integrate structural and appearance features from reference images. By selectively combining these attributes, our method generates new concepts. Furthermore, by incorporating DDIM inversion, our method can perform image editing to transfer features from the reference image. Through extensive qualitative and quantitative evaluations, we demonstrate that FreeMix achieves superior performance compared to existing approaches. We plan to explore more advanced disentanglement strategies to enhance the control over individual visual attributes. In addition, we aim to extend FreeMix to handle more than two reference inputs, enabling richer multi-concept compositions. Another promising direction is adapting the method for video generation, where maintaining temporal consistency while mixing structure and appearance remains a challenging task.

7. Limitations

Despite the effectiveness of FreeMix, several limitations remain. First, our approach relies on segmentation masks to disentangle structural and appearance features. While SAM generally produces high-quality masks, its performance may degrade in complex scenarios such as transparent objects, fine structures, or heavy occlusions. Inaccurate masks in these cases can cause unintended leakage between structure and appearance components, thereby reducing generation quality. Second, when the structure and appearance references are highly inconsistent—for instance, an object with intricate geometry combined with a highly textured or stylized appearance—the mixing process may introduce artifacts or result in incomplete transfer. Finally, cross-attention–based mask extraction used during appearance transfer is not always perfectly precise, occasionally leading to minor background interference. We consider these limitations as opportunities for future work, including refining mask precision, developing adaptive attention-based disentanglement strategies, and extending the framework to more extreme structure–appearance discrepancies.

Author Contributions

Conceptualization, M.K.; methodology, M.K. and Y.S.C.; software, M.K.; validation, M.K. and Y.S.C.; formal analysis, M.K. and Y.S.C.; investigation, M.K.; resources, M.K. and Y.S.C.; data curation, M.K.; writing—original draft preparation, M.K. and Y.S.C.; writing—review and editing, M.K. and Y.S.C.; visualization, M.K.; supervision, Y.S.C.; project administration, Y.S.C.; funding acquisition, Y.S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information and communications Technology Planning and evaluation (IITP) grant (No. RS-2025-25422680, No. RS-2020-II201373), and the National Research Foundation of Korea (NRF) grant (No. RS-2025-00520618) funded by the Korean Government (MSIT).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The DreamBooth dataset can be downloaded from https://github.com/google/dreambooth (accessed on 2 January 2025). The CustomConcept101 dataset can be downloaded from https://www.cs.cmu.edu/~custom-diffusion/assets/data.zip (accessed on 2 March 2025). The FreeCustom dataset can be downloaded from https://github.com/aim-uofa/FreeCustom/tree/main/dataset/freecustom (accessed on 2 March 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

We construct a dataset consisting of separate sets of structure images and appearance images. Structure images include a diverse collection of characters, objects, and animals, ensuring variation in shape and spatial composition. Appearance images consist of materials such as gold and glass, as well as various textures and patterns, serving as reference exemplars for guiding the synthesis process. For the prompt, we follow the experimental setup of Custom Diffusion [5]. Specifically, prompts are structured in the form of “A photo of a <new1> in the style of <new2> in the background” or “A photo of a <new1> in the style of <new2> swimming underwater”. Since finetuning-free approach [7] does not require token identifiers such as <new1>, identifiers are not used in these methods.

References

Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar] [CrossRef]
Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A.H.; Chechik, G.; Cohen-Or, D. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv 2022, arXiv:2208.01618. [Google Scholar]
Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22500–22510. [Google Scholar]
Kumari, N.; Zhang, B.; Zhang, R.; Shechtman, E.; Zhu, J.Y. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1931–1941. [Google Scholar]
Voynov, A.; Chu, Q.; Cohen-Or, D.; Aberman, K. p+: Extended textual conditioning in text-to-image generation. arXiv 2023, arXiv:2303.09522. [Google Scholar]
Ding, G.; Zhao, C.; Wang, W.; Yang, Z.; Liu, Z.; Chen, H.; Shen, C. FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 9089–9098. [Google Scholar]
Hertz, A.; Voynov, A.; Fruchter, S.; Cohen-Or, D. Style aligned image generation via shared attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 4775–4785. [Google Scholar]
Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; Metaxas, D.N. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5907–5915. [Google Scholar]
Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; He, X. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1316–1324. [Google Scholar]
Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv 2021, arXiv:2112.10741. [Google Scholar]
Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, K.; Gontijo Lopes, R.; Karagol Ayan, B.; Salimans, T.; et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36479–36494. [Google Scholar]
Podell, D.; English, Z.; Lacey, K.; Blattmann, A.; Dockhorn, T.; Müller, J.; Penna, J.; Rombach, R. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv 2023, arXiv:2307.01952. [Google Scholar] [CrossRef]
Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 3836–3847. [Google Scholar]
Alaluf, Y.; Richardson, E.; Metzer, G.; Cohen-Or, D. A neural space-time representation for text-to-image personalization. ACM Trans. Graph. (TOG) 2023, 42, 1–10. [Google Scholar] [CrossRef]
Ye, H.; Zhang, J.; Liu, S.; Han, X.; Yang, W. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv 2023, arXiv:2308.06721. [Google Scholar]
Wei, Y.; Zhang, Y.; Ji, Z.; Bai, J.; Zhang, L.; Zuo, W. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 15943–15953. [Google Scholar]
Hertz, A.; Mokady, R.; Tenenbaum, J.; Aberman, K.; Pritch, Y.; Cohen-Or, D. Prompt-to-prompt image editing with cross attention control. arXiv 2022, arXiv:2208.01626. [Google Scholar]
Cao, M.; Wang, X.; Qi, Z.; Shan, Y.; Qie, X.; Zheng, Y. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 22560–22570. [Google Scholar]
Tumanyan, N.; Geyer, M.; Bagon, S.; Dekel, T. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1921–1930. [Google Scholar]
Brooks, T.; Holynski, A.; Efros, A.A. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18392–18402. [Google Scholar]
Ju, X.; Zeng, A.; Bian, Y.; Liu, S.; Xu, Q. Direct inversion: Boosting diffusion-based editing with 3 lines of code. arXiv 2023, arXiv:2310.01506. [Google Scholar] [CrossRef]
Cheng, T.Y.; Sharma, P.; Markham, A.; Trigoni, N.; Jampani, V. Zest: Zero-shot material transfer from a single image. In Computer Vision—ECCV 2024, Proceedings of the 18th European Conference, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 370–386. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2020, arXiv:2010.02502. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
Huang, X.; Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1501–1510. [Google Scholar]
Yang, B.; Gu, S.; Zhang, B.; Zhang, T.; Chen, X.; Sun, X.; Chen, D.; Wen, F. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18381–18391. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
Tumanyan, N.; Bar-Tal, O.; Bagon, S.; Dekel, T. Splicing vit features for semantic appearance transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10748–10757. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]

Figure 1. Results of multi-concept mixing and exemplar-based image editing. (Left) Given a structure reference and an appearance reference, our framework generates an image of a new concept that integrates both attributes. (Right) Furthermore, our method can easily be extended to edit an input image by transferring the appearance from an exemplar image. Our method achieves these tasks without requiring training or additional networks.

Figure 2. Results of multi-concept composition and multi-concept mixing. Given two reference images, multi-concept composition generates a single image where both objects coexist. Multi-concept mixing generates a new concept by preserving the structure of one reference while transferring the appearance from the other.

Figure 3. Overview of FreeMix. (a) Given a structure image

x^{s}

and an appearance image

x^{a}

, we first extract masks of the concepts using a segmentation model and obtain

x_{t}^{s}

and

x_{t}^{a}

through the diffusion forward process at each time step t. These are then passed through the diffusion U-Net to extract self-attention features. For structure generation, self-attention features from

x_{t}^{s}

are injected into

x_{t}^{g}

, while for appearance transfer, self-attention features from

x_{t}^{a}

are injected into

x_{t}^{g}

. (b) To transfer appearance features while preserving structure, we extract the foreground mask

M_{g}

using cross-attention maps and mix the self-attention features of

x_{t}^{a}

and

x_{t}^{g}

based on

M_{g}

.

Figure 3. Overview of FreeMix. (a) Given a structure image

x^{s}

and an appearance image

x^{a}

, we first extract masks of the concepts using a segmentation model and obtain

x_{t}^{s}

and

x_{t}^{a}

through the diffusion forward process at each time step t. These are then passed through the diffusion U-Net to extract self-attention features. For structure generation, self-attention features from

x_{t}^{s}

are injected into

x_{t}^{g}

, while for appearance transfer, self-attention features from

x_{t}^{a}

are injected into

x_{t}^{g}

. (b) To transfer appearance features while preserving structure, we extract the foreground mask

M_{g}

using cross-attention maps and mix the self-attention features of

x_{t}^{a}

and

x_{t}^{g}

based on

M_{g}

.

Figure 4. Comparison of generated results with and without spatial normalization. Without normalization, unwanted appearance details from the structure image are injected into the generated output. Applying normalization effectively suppresses appearance leakage, ensuring that only structural features are transferred.

Figure 5. Self-attention maps across different time steps. We visualize PCA of the spatial features from the intermediate blocks of Stable Diffusion.

Figure 6. Overview of our exemplar-based image editing pipeline. We first obtain the latent noise

x_{T}

of the structure image using DDIM inversion. During the denoising process, DMSA transfers the appearance of the reference object while preserving the structure and background.

Figure 6. Overview of our exemplar-based image editing pipeline. We first obtain the latent noise

x_{T}

of the structure image using DDIM inversion. During the denoising process, DMSA transfers the appearance of the reference object while preserving the structure and background.

Figure 7. Qualitative comparison of multi-concept mixing methods. Baselines struggle with effective mixing, often leading to artifacts and background distortions. In contrast, our method successfully integrates the structure and appearance of different concepts into a coherent and visually consistent image.

Figure 8. Qualitative comparison of exemplar-based image editing methods.

Figure 9. Ablation results on the effect of different layer configurations and

t_{a p p e a r}

in structure generation and appearance transfer. The layers indicated above each image represent the self-attention layers (among layers 8–15) used for structure generation, while the remaining layers are assigned to appearance transfer.

Figure 9. Ablation results on the effect of different layer configurations and

t_{a p p e a r}

in structure generation and appearance transfer. The layers indicated above each image represent the self-attention layers (among layers 8–15) used for structure generation, while the remaining layers are assigned to appearance transfer.

Figure 10. Results of customization using our method. The generated images effectively preserve the object of the input image while adapting to various contexts.

Figure 11. Results of appearance transfer using our method. The generated images effectively preserve the appearance of the reference image while adapting to the new object.

Figure 12. Mixing results of our method integrated into ControlNet. The generated images remain consistent with the target layout and the appearance of the reference image.

Table 1. Quantitative comparison of multi-concept mixing methods. Best and second best metrics are highlighted in bold and underline, respectively.

Method	Base Model	CLIP-I ↑	DINO ↑	CLIP-T ↑	PRES ↓	DIV ↑
XTI	SDv1.5	0.5425	0.6827	24.57	0.6745	0.3394
DreamBooth	SDv1.5	0.6110	0.6393	23.58	0.5721	0.3734
Custom Diffusion	SDv1.5	0.6133	0.7595	25.65	0.5583	0.3876
FreeCustom	SDv1.5	0.6220	0.7691	27.32	0.4936	0.3947
StyleAligned	SDv1.5	0.6034	0.7109	21.48	0.5636	0.3681
StyleAligned	SDXL	0.6260	0.7245	22.01	0.4803	0.3846
Ours	SDv1.5	0.6377	0.7605	27.42	0.4869	0.4038

Table 2. Quantitative comparison of exemplar-based image editing methods. Best and second best metrics are highlighted in bold and underline, respectively.

Method	Base Model	StruD ↓	LPIPS ↓	CLIP-I ↑
MasaCtrl *	SDv1.5	0.0350	0.1435	0.7119
DirectINV *	SDv1.5	0.0335	0.1273	0.7353
ZeST	SDXL	0.0390	0.0734	0.7519
Ours	SDv1.5	0.0245	0.0712	0.8101

* use DreamBooth-based finetuning for the appearance reference.

Table 3. Inference time comparison of different multi-concept image generation methods.

	XTI	DreamBooth	Custom Diffusion	FreeCustom	StyleAligned	Ours
Generation Time (s)	1183	675	342	37	680	34

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kang, M.; Choi, Y.S. FreeMix: Personalized Structure and Appearance Control Without Finetuning. Appl. Sci. 2025, 15, 9889. https://doi.org/10.3390/app15189889

AMA Style

Kang M, Choi YS. FreeMix: Personalized Structure and Appearance Control Without Finetuning. Applied Sciences. 2025; 15(18):9889. https://doi.org/10.3390/app15189889

Chicago/Turabian Style

Kang, Mingyu, and Yong Suk Choi. 2025. "FreeMix: Personalized Structure and Appearance Control Without Finetuning" Applied Sciences 15, no. 18: 9889. https://doi.org/10.3390/app15189889

APA Style

Kang, M., & Choi, Y. S. (2025). FreeMix: Personalized Structure and Appearance Control Without Finetuning. Applied Sciences, 15(18), 9889. https://doi.org/10.3390/app15189889

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FreeMix: Personalized Structure and Appearance Control Without Finetuning

Abstract

1. Introduction

2. Related Work

2.1. Text-to-Image Generation

2.2. Personalized Image Generation

2.3. Image Editing

3. Preliminaries

3.1. Diffusion Models

3.2. Attention in Stable Diffusion

3.2.1. Self-Attention

3.2.2. Cross-Attention

4. Methodology

4.1. Disentangle-Mixing Self-Attention

4.2. Structure Generation

4.3. Appearance Transfer

4.4. Application to Image Editing

5. Experiments

5.1. Experiment Setting

5.1.1. Dataset

5.1.2. Evaluation Metrics

5.1.3. Compared Methods

5.1.4. Implementation Settings

5.2. Multi-Concept Mixing

5.2.1. Quantitative Comparisons

5.2.2. Qualitative Comparisons

5.3. Exemplar-Based Image Editing

5.3.1. Quantitative Comparisons

5.3.2. Qualitative Comparisons

5.4. Ablation Study

5.5. Additional Results

5.5.1. Customization

5.5.2. Appearance Control

5.5.3. Results with ControlNet

6. Conclusions and Future Work

7. Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI