Advancing Interior Design with AI: Controllable Stable Diffusion for Panoramic Image Generation

Yang, Wanggong; Wang, Congcong; Liu, Luxiang; Dong, Shuying; Zhao, Yifei

doi:10.3390/buildings15081391

Open AccessArticle

Advancing Interior Design with AI: Controllable Stable Diffusion for Panoramic Image Generation

by

Wanggong Yang

,

Congcong Wang

,

Luxiang Liu

,

Shuying Dong

and

Yifei Zhao

^*

School of New Media, Beijing Institute of Graphic Communication, Beijing 102600, China

^*

Author to whom correspondence should be addressed.

Buildings 2025, 15(8), 1391; https://doi.org/10.3390/buildings15081391

Submission received: 18 March 2025 / Revised: 17 April 2025 / Accepted: 19 April 2025 / Published: 21 April 2025

(This article belongs to the Section Architectural Design, Urban Science, and Real Estate)

Download

Browse Figures

Versions Notes

Abstract

:

AI-driven technologies have significantly advanced panoramic image generation in interior design; however, existing methods often lack controllability and consistency in rendering high-quality, coherent panoramas. To address these limitations, the study proposes CSD-Pano, a controllable and stable diffusion framework tailored for panoramic interior design generation. The study also introduces PSD-4, a curated dataset of panoramic scenes covering diverse interior decoration styles to support training and evaluation. CSD-Pano enables fine-grained control over aesthetic attributes, layout coherence, and stylistic consistency. Furthermore, the study designs a panoramic loss function that enhances spatial coherence, geometric alignment, and perceptual fidelity. Extensive qualitative and quantitative experiments demonstrate that CSD-Pano achieves superior performance compared to existing baselines, with significant improvements in SSIM and LPIPS metrics. These results validate the effectiveness of our approach in advancing automated panoramic interior design.

Keywords:

diffusion models; interior design; artificial intelligence; panorama; fine tuning

1. Introduction

1.1. Background and Motivation

Excellent architecture serves to inspire and enhance the human experience by seamlessly integrating functionality, aesthetics, and context, thereby elevating both the environment and the quality of life for its occupants [1,2,3,4,5,6,7,8,9]. Moreover, the increasing demand for immersive and interactive interior experiences has driven designers to pursue panoramic visualizations that go beyond static renderings [10,11,12]. However, as illustrated in Figure 1, traditional interior design methods still rely heavily on manual processes—including drawing detailed 2D plans, developing comprehensive 3D models, and executing high-quality renderings—which results in a labor-intensive workflow and limits the ability to iterate stylistic and functional alternatives [13] rapidly.

To overcome these challenges, artificial intelligence (AI) has emerged as a transformative force, automating repetitive tasks and enabling the generation of immersive panoramic renderings directly from scene structures via controllable stable diffusion models [11,12,14,15,16,17,18,19]. This shift paves the way for more intuitive, interactive, and efficient interior design processes, ultimately enhancing both creative flexibility and workflow efficiency.

1.2. Problem Statement and Objectives

Clarification of the Problem Statement: Traditional interior design workflows are often time-consuming, with limited flexibility for rapid iterations or exploration of multiple aesthetic directions [11,12]. While recent generative AI models (e.g., Midjourney [20], Stable Diffusion [14], DALL-E 3 [21], and Flux [22]) have demonstrated promising visual capabilities, they typically lack the ability to generate coherent panoramic scenes that preserve spatial structure, stylistic consistency, and semantic alignment—key requirements in professional design practices [21,23]. More studies emphasize the growing demand for AI systems that support user-centered control, preserve spatial logic, and enhance creative efficiency—needs that CSD-Pano is designed to fulfill [24,25].

To bridge these gaps, our study introduces CSD-Pano, a controllable and stable diffusion framework tailored to panoramic interior design. Our method improves upon existing approaches by supporting structure-aware conditioning, stable generation over panoramic projections, and fine-grained aesthetic control. These contributions are supported by empirical evaluation and qualitative feedback from design practitioners. The primary objective is to develop an AI-driven framework, CSD-Pano, that facilitates controllable panoramic image generation with enhanced structural coherence and stylistic fidelity. Our goal is to overcome limitations in existing AI-based tools and empower designers with more effective and customizable generative capabilities.

2. Literature Review

2.1. Architectural Design

Designers’ professionalism and creative ideas are the core driving forces of architectural design. Practical interior design is key to optimizing the functional aesthetics of space and enhancing the user experience. However, this field has long faced the challenges of efficiency bottlenecks and high labor costs, which often restrict creative expression and productivity improvement [7,11,12,26].

Traditional interior design workflows generally involve sequential steps such as 2D-floor plan drafting, 3D modeling, material and texture assignment, lighting setup, and photorealistic rendering [27,28]. These processes are time-consuming and labor-intensive and require specialized expertise, often resulting in extended design cycles and limited flexibility for iterative refinement. Moreover, feedback from clients typically occurs only after rendering is complete, increasing the likelihood of mismatches between expectations and outcomes, which leads to repeated revisions and inefficiencies in both time and cost [29,30].

In addition, the long training cycle and high skill acquisition threshold of professional designers also restrict the overall quality of architectural results. The long process of accumulating experience makes it challenging to produce design solutions that are both innovative and practical [8,9]. As designers develop their professional capabilities, they must continuously absorb new methodologies and explore diverse styles [31,32], which is particularly important in accelerating technological iteration.

Traditional interior design panoramic rendering is quite cumbersome, as designers must repeat the entire process to make design modifications. In contrast, AI-driven interior design panoramic rendering is simpler, requiring no comprehensive modeling, and allows for the rapid and efficient generation of panoramic images. AI-based interior design technology, based on AI diffusion models, revolutionizes traditional practice models through process automation: building a professional dataset to train AI models can significantly improve the efficiency of generating differentiated design styles [12]. Using LoRA fine-tuning technology can make the model adapt to multiple decorative styles and optimize the interior panoramic rendering process [33].

This innovative approach constructs an end-to-end workflow covering design ideation, model building, material selection, lighting configuration, rendering, and post-processing, unleashing the creative potential of designers and improving decision-making efficiency [34,35]. Studies have shown that AI-driven methods can generate diverse and immersive design results in batches, and their scalability has injected strong momentum into the future development of interior design [14,36].

2.2. Stable Diffusion Model

In recent years, the stable diffusion model (SD) [37,38,39] has quickly become a mainstream tool in image generation, and has shown great potential for improving efficiency and creativity in creative fields such as interior design [23,40,41]. This technology allows designers to obtain design images quickly, significantly improving the efficiency and quality of architectural design [12,42].

Traditional diffusion models involve two core processes: forward noise addition and reverse denoising. In the forward process, noise is gradually added to the input image until it is transformed into a disordered noise distribution, while the reverse process aims to reconstruct the original image from the noise [37]. With a deep learning denoising mechanism, diffusion models can generate realistic images [21,43]. When generating images with specific design elements, designers can embed text prompts into the denoising process to achieve controllable image generation and result in optimization [44,45]. The core advantage of text-guided diffusion models is their ease of use, which rapidly generates highly customized design proposals [12,33].

Despite the excellent performance of diffusion models in many fields, there is still significant room for improvement in their application to architectural design [46,47]. The main bottleneck is the reliance on large-scale datasets collected from the web, which often lack high-quality annotations containing professional architectural terminology, making it difficult for models to effectively correlate architectural design principles with the corresponding architectural language during the training stage [48,49]. To address this challenge, there is an urgent need to construct a high-quality architectural design image library supplemented by fine-grained professional annotations, and to achieve precise adaptation of architectural tasks through model fine-tuning [50,51].

This study proposes an innovative method to break through the bottleneck of generating diverse and immersive interior design images by constructing a panoramic interior decoration style dataset and applying LoRA fine-tuning technology [14,52]. This technical system aims to empower designers to improve efficiency and creative expression and promote transforming traditional interior design practices towards intelligent automation. Experimental verification shows that the method based on the stable diffusion model can effectively generate design schemes with spatial coherence and aesthetic consistency, providing key technical support for the intelligent design process.

2.3. LoRA Fine-Tuning

In artificial intelligence, especially image generation tasks, model adaptation to new scenarios is crucial for performance improvement. Given the high cost of retraining a complete model involving large-scale image datasets and time-consuming training processes, fine-tuning techniques have become a more practical solution [11,12].

Low-Rank Adaptation (LoRA) is currently a cutting-edge technology for model fine-tuning. This method constructs a low-rank decomposition of the model weight matrix and only needs to adjust a few parameters to achieve efficient adaptation. Focusing on the attention cross-layer, LoRA can effectively incorporate new knowledge while retaining the prior knowledge of the pre-trained model, ultimately generating lightweight models of only a few hundred megabytes, which can output high-quality images in a moderate training time [53].

The advantages of LoRA are particularly evident in professional scenarios such as interior design image generation. By fine-tuning the interior decoration style dataset, designers can generate customized solutions that meet specific needs. This adaptability stimulates creative potential and optimizes the entire interior design process [14,52].

Compared to other fine-tuning methods such as Textual Inversion [54] and Hypernetwork [55], which are faster but tend to sacrifice image quality, LoRA strikes an optimal balance between efficiency and generation quality [11,45]. This method can incorporate new styles and design concepts without compromising the quality of the output, making it particularly suitable for professional design scenarios [53,56].

Studies have shown that fine-tuning the LoRA model is an optimal strategy for adapting the diffusion model to interior design needs. By generating diverse and high-quality design images, this technology not only empowers designers’ creativity but also promotes the intelligent upgrading of the design process, laying the foundation for transforming traditional design practices into intelligent automation [14].

3. Methodology

This study proposes an innovative approach to improving the creativity and efficiency of interior design through the use of AI diffusion models. The methodology first constructs a panoramic interior decoration style dataset to train AI models capable of generating differentiated design styles. Then, it uses LoRA fine-tuning technology to adapt the models to various decorative styles, effectively replacing the traditional method of creating panoramic interior renderings. The proposed controllable and stable diffusion pano (CSD-Pano) framework consists of four core stages: data collection, design reference element recognition, simple 3D model construction, and AI-assisted pano generation. This end-to-end process enables the generation of design images from the simple structural outlines of interior scenes. This innovative design method can mass-produce a variety of pano design schemes, significantly stimulating designers’ creativity while improving design efficiency and decision-making quality. Experimental results show that this method can effectively generate design results with a sense of difference and immersion, provide a flexible and extensible workflow, and demonstrate the great potential of transforming traditional interior design practices through intelligent automation.

3.1. Research Framework

The proposed framework in this study utilizes AI diffusion models to enhance the creativity and efficiency of interior design. As shown in Figure 2, the framework consists of the following core components: Structural Controller (SC), Style Controller (STC), and Panoramic LoRA Controller (PLC).

Structural Controller (SC). The SC uses the ControlNet architecture [48] to ensure that the generated interior design complies with the principle of structural integrity. This module focuses on the design’s layout and spatial arrangement, considering room proportions, furniture placement, and circulation flow. While supporting style flexibility, the SC module ensures that the generated design has a logical and functional structure. This is crucial to ensuring that the output solution is aesthetically pleasing and practical.

Style Controller (STC). STC adopts the IP-Adapter architecture [57] to control the aesthetic direction of the design (e.g., modern, minimalist, classical, etc.) through predefined style parameters, adjusting elements such as color schemes, materials, textures, and furniture options. This module interacts with SC to ensure that the selected style does not affect the structural integrity and to achieve a harmonious unity of design elements.

Panoramic LoRA Controller (PLC). PLC is based on low-rank adaptive technology (LoRA) [53], which generates high-resolution, immersive panoramic images by adjusting visual parameters to accurately represent the design in a 360-degree format. This module ensures that all design elements blend seamlessly into the scenic view, providing users with an interactive and complete spatial experience.

3.2. Collect Building Datasets

The PSD-4 dataset was curated from professional designers from multiple interior design websites. The images included in the PSD-4 dataset were chosen based on several key criteria: Design Variety, Image Quality, and Spatial Diversity. The dataset consists of images across a wide range of design styles, including American, European, Modern, Neoclassical, and New Chinese, ensuring diverse input for the AI model. Only high-resolution images with clear and distinct visual details were selected to ensure the generated outputs could replicate real-world design aesthetics. The dataset incorporates interior images representing different room types (e.g., living rooms, kitchens, bedrooms, and bathrooms) and layouts to ensure the AI model can handle various spatial configurations.

The annotation process was carried out in a two-stage procedure to ensure accuracy and consistency, as shown in Figure 3:

Automatic Tagging with WD14 Tagger. In the first stage, images were processed using the WD14 tagger, a state-of-the-art AI tool designed to automatically label interior design images with relevant tags based on recognized visual patterns and objects. This tool tags key features such as design styles (e.g., American style, New Chinese style), spatial components (e.g., kitchen, living room), and specific interior elements (e.g., cabinets, tables, windows, plants).

Expert Review and Refinement. After the automatic tagging process, each image was reviewed by a team of three expert interior designers. The experts verified and refined the tags generated by the WD14 tagger, ensuring that the labels were accurate, comprehensive, and aligned with interior design conventions. Any discrepancies or errors identified by the experts were corrected, and additional labels were added when necessary to capture subtle design elements or spatial details not covered by the initial tags.

This study constructs a comprehensive and panoramic interior decoration style dataset to improve the training effect of AI interior design models. The dataset is constructed by selecting high-quality images from multiple interior design websites and contains 294 images that showcase diverse interior styles. Each image is carefully annotated with metadata, covering specific design styles (such as neoclassicism and industrial) and characteristic elements (such as materials and color schemes) to support the actual training of AI models. This approach ensures that the model can accurately generate images that reflect the subtle differences between different interior design styles. An overview of the PSD-4 dataset is shown in Table 1, which categorizes the images by style and highlights the diversity of designs through a systematically organized collection of images.

3.3. Structural Controller

To guide the SD model in generating images with the Canny edge detection method, the structure controller alters the neural network architecture of the diffusion model by introducing additional constraints. This adaptation ensures precise spatial control when integrated with SD, thereby addressing the challenge of spatial consistency. The generation of images thus mirrors the actual painting process, progressing from the outline to the final coloring stages.

The study employs the structure controller to introduce spatial control into a pre-trained diffusion model, allowing it to function beyond the limitations of basic text prompts. This controller connects the UNet framework from the SD model to a modified UNet replica that incorporates zero convolution layers within the encoder and middle blocks. The overall operation of the structure controller is represented as follows:

y_{s} = G (z, φ) + Y (G (z + Y (d, φ_{y 1}), φ_{s}), φ_{y 2}) .

(1)

The structure controller distinguishes itself from the original SD model by handling the residual term. In this equation,

G

represents the UNet architecture, with z being the latent variable. The fixed parameters of the pre-trained model are denoted by

φ

. The zero convolutions are expressed as

Y

, with the respective weights

φ_{y 1}

and

φ_{y 2}

, while

φ_{s}

indicates the trainable parameters specific to the structure controller. Essentially, this controller encodes spatial condition information—such as data from Canny edge detection—by integrating residuals into the UNet block and incorporating this spatial information into the original model.

3.4. Style Controller

In the original SD model, text embeddings are incorporated into the UNet model via the input-to-cross-attention mechanism. A simple method to include image features combines them with text features and inputs the resulting combination into the cross-attention layer. However, this approach is inefficient. Instead, the style controller separates the text and image features in the cross-attention layer, streamlining the model to improve efficiency and computational demands. This modification increases the scalability and versatility of the model. During training, the style controller autonomously learns to generate corresponding images based on text descriptions while effectively utilizing image features. As a result, the style controller can produce more accurate and realistic images that fully integrate the semantic content of the text.

The cross-attention operation on texts can be formally represented as follows:

T_{n e w} = Attention (Q, K^{t}, V^{t}) + μ \cdot Attention (Q, K^{i}, V^{i}) .

(2)

In this formula, Q,

K^{t}

, and

V^{t}

represent the query, key, and value matrices of the text cross-attention operation, while

K^{i}

and

V^{i}

are the key and value matrices of the image cross-attention operation. Given the query feature Y and image feature

d_{i}

, the corresponding formula is:

Q = Y W_{q}

,

K^{i} = d_{i} W_{k}^{i}

, and

V^{i} = d_{i} W_{v}^{i}

. It should be noted that only the key and value weight matrices

W_{k}^{i}

and

W_{v}^{i}

in the image modality are trainable parameters, while the remaining parameters (such as

W_{q}

and

K^{t}

,

V^{t}

related weights on the text side) remain fixed.

3.5. Panoramic Loss Function

This section proposes a comprehensive loss function that enhances the ability to understand the architectural style of interior design images. This loss function is based on the mean square error (MSE) loss function [58]. It integrates multi-dimensional constraint components that ensure that the model effectively captures the content and style features of the image. The mathematical expression of the basic loss function as an optimization benchmark can be expressed as:

L_{b a s e} = {∥ \hat{Y} - Y ∥}_{F}^{2}

(3)

Here,

L_{b a s e}

represents the total loss function calculated only based on the pixel-by-pixel difference between the predicted image

\hat{Y}

and the target image Y. Although this formula can effectively ensure that the generated image matches the target at the pixel value level, it ignores the subtle differences in architectural styles and the importance of smooth transitions at image boundaries.

In view of the above limitations, this study proposes an improved panoptic loss function, the mathematical expression of which is defined as:

L = L_{b a s e} + λ_{1} L_{b o u n d a r y} + λ_{2} L_{s t y l e}

(4)

In this formula,

L

comprises the basic loss term and two additional constraints: the boundary loss

L_{b o u n d a r y}

and the style loss

L_{s t y l e}

. The hyperparameters

λ_{1}

and

λ_{2}

control the contribution weight of the boundary loss and style loss to the total loss function. Adjusting these two parameters can create a dynamic balance between content accuracy and style fidelity. This flexibility is particularly important in architectural design, as different projects may have different needs for dimensions, such as the floor plan layout’s sharpness or the artistic style’s expressiveness.

The boundary loss term specifically penalizes differences in the edge areas of the image. This loss term is calculated using the following formula:

L_{b o u n d a r y} = L_{c o l o r} + L_{g r a d i e n t}

(5)

In this equation,

L_{c o l o r}

ensures smooth and visually coherent color transitions by quantifying the color similarity between the predicted image and the target image in the boundary region. This component effectively suppresses color loss artifacts that may occur during image generation. The

L_{g r a d i e n t}

term evaluates the gradient similarity at the boundary, which is crucial for maintaining edge sharpness and transition clarity. By penalizing gradient differences, this component ensures that the structure and form of architectural elements are accurately represented in the generated image. The style loss term aims to capture the stylistic characteristics of the image so that the model can maintain the desired architectural aesthetic qualities. By analyzing the correlation between different feature maps, this loss function ensures that the generated image displays appropriate texture and stylistic elements. Its mathematical definition is as follows:

L_{s t y l e} = {∥ Gram (\hat{Y}) - Gram (Y_{s t y l e}) ∥}^{2}

(6)

In this equation,

Gram (\cdot)

calculates the Gram matrix of the corresponding image. This matrix effectively encodes style information by capturing the correlation between different feature maps of an image. By comparing the Gram matrices of the generated image

\hat{Y}

and the reference style image

Y_{s t y l e}

, this loss function ensures that the generated result retains the target style features. This approach allows the model to focus on matching texture features and style elements rather than just pixel-level differences, thereby enhancing the architectural aesthetic expressiveness of the output. By incorporating these additional components into the loss function, the model’s ability to learn about image content and architectural style is significantly enhanced. This comprehensive approach produces interior design images of higher quality, capable of generating rich and coherent output that conforms to both human aesthetic preferences and architectural principles. The difference between the basic loss function and the optimized version highlights the importance of considering boundary constraints and stylistic elements when generating high-quality architectural images.

4. Experiments and Results

4.1. Experimental Settings

The training was conducted on an NVIDIA GeForce RTX 4090 GPU (24 GB VRAM) and an Intel i9-14900K CPU (32 cores), supported by 64 GB DDR5 RAM and a 1.8 TB NVMe SSD. This configuration demonstrated robust performance in processing high-resolution imagery and large-scale datasets. The computational environment utilized Ubuntu 22.04.3 LTS with CUDA 11.8, cuDNN 8700, and PyTorch 2.1.2+cu118 for accelerated computing, integrated with Stable Diffusion XL 1.0 and CLIP ViT-H/14 architectures. Critical hyperparameters included

rank = 64

low-rank adaptations in attention layers, AdamW optimization (

l r = 1 \times 10^{- 4}

),

b a t c h_s i z e = 2

configuration, and

f p 16

mixed-precision training enhanced by gradient checkpointing mechanisms, achieving optimal memory efficiency while maintaining numerical stability through dynamic loss scaling.

4.2. PSD-4

To enhance the performance of our AI models in generating specific interior design styles, the study employed Lora fine-tuning techniques on the panoramic interior decoration style dataset PSD-4, as shown in Figure 4. Specifically, the dataset, which includes a diverse range of high-quality images annotated with various design styles, was utilized to fine-tune the diffusion model. This process involved integrating a specialized loss function incorporating style-specific components, allowing the model to learn and adapt effectively to different interior design aesthetics. By training the model with this composite loss function, it can associate the learned design styles with predefined prompts, enabling designers to generate targeted interior design outputs simply by inputting the desired style parameters [14,53]. This innovative approach not only improves the quality of the generated designs but also enhances the creative capabilities of designers by providing them with a robust tool for producing diverse and immersive design representations.

4.3. Subjective Assessment

Evaluating interior design drawings of buildings presents a significant challenge as shown in Figure 5. Automated assessments of design renderings primarily focus on aesthetics and composition, often overlooking a comprehensive evaluation of the design content [59,60]. To address this gap, it is crucial to implement a scientific assessment approach that incorporates design evaluation metrics and expert ratings. In this study, the researchers collaborated with professional interior designers to develop a tailored set of evaluation criteria for interior design. This framework includes metrics such as ”Overall Impression”, “Architectural Details”, ”Architectural Integrity”, ”Lighting Relationship”, ”Composition”, ”Consistent Architectural Style”, and ”Space Function”. The evaluation process involved professional designers selecting the best images from numerous model-generated options. This method ensures a comprehensive assessment of interior design quality.

The proposed evaluation index system includes several critical metrics that assess the quality of interior design drawings. Each index plays a vital role in the overall evaluation process. ”Overall Impression” and ”Consistent Architectural Style” are the most crucial indicators. The former reflects the designer’s general perception of the generated image; it is typically deemed feasible if it is aesthetically pleasing and free of obvious errors. The latter assesses whether the input text prompts lead to a building that aligns with the intended design style, as consistency in architectural style is essential for generative design. Additionally, ”Architectural Details” evaluates the clarity and reasonableness of the design elements within the image, while ”Architectural Integrity” pertains to the overall completeness of the building’s design. The ”Lighting Relationship” examines the accuracy of lights, shadows, and colors in the generated images, contributing to the overall realism. ”Composition” provides insights into the rationality of the building’s placement, while ”Space Function” evaluates how well the design meets its intended purpose. Finally, ”Design Details” assesses the intricacy and thoughtfulness of the individual design elements. Together, these indices create a comprehensive framework for evaluating interior design quality.

As shown in Figure 6, the experimental results show that CSD-Pano performs excellently across all design categories, particularly in ”Overall”, ”Details”, and ”Integrity”, with scores ranging from 68.18% to 74.24%, significantly outperforming other methods. Flux shows weaker performance, with overall scores ranging from 4.55% to 12.12%, falling behind CSD-Pano. Reference Only contributes the least, with scores not exceeding 3%. DALL E3 performs relatively well in ”Details” and ”Integrity”, with scores of 24.24% and 19.70%, but shows lower results in ”Style” and ”Function”. Midjourney also performs moderately, with scores mostly between 1.52% and 3.03%, showing a slight improvement in ”Function” at 10.61%. Overall, CSD-Pano leads in all design metrics, demonstrating its advantage in design quality.

4.4. Objective Assessment

For the objective evaluation of the generated images, the study employed three key metrics: Structural Similarity Index (SSIM) [61], Learned Perceptual Image Patch Similarity (LPIPS) [62], and CLIP-Text Similarity (CLIP-T) [63]. These metrics comprehensively assess the generated images regarding structural consistency, perceptual similarity, and semantic alignment with the input textual descriptions, providing a multi-faceted evaluation of image quality.

Structural Similarity Index (SSIM). SSIM [61] is a widely used metric for assessing the structural similarity between two images by comparing their luminance, contrast, and structural patterns. Since our evaluation does not rely on direct reference images, the study uses Canny edge maps as structural priors to measure the preservation of essential image structures. The SSIM is computed as follows:

SSIM (I, E) = L (I, E) \cdot C (I, E) \cdot S (I, E)

where I represents the generated image, and E is the Canny edge map derived from I. The functions

L (I, E)

,

C (I, E)

, and

S (I, E)

measure luminance, contrast, and structural similarity, respectively. This adaptation of SSIM allows us to evaluate whether the generated image maintains coherent and distinguishable structural features. A higher SSIM value indicates better structural preservation, essential for generating visually coherent outputs.

Learned Perceptual Image Patch Similarity (LPIPS). LPIPS [62] is a perceptual similarity metric that measures how similar two images appear based on deep feature representations rather than simple pixel-wise comparisons. Unlike SSIM, which focuses on low-level structural alignment, LPIPS captures higher-level perceptual features that are more aligned with human visual perception. The LPIPS score is computed as follows:

LPIPS (I, R) = \sum_{l} w_{l} {∥ϕ_{l} (I) - ϕ_{l} (R)∥}_{2}^{2}

where I and R denote the generated and reference images, respectively, and

ϕ_{l}

represents deep feature embeddings extracted from the l-th layer of a pre-trained network (typically VGG [64] or AlexNet [65]). The weight

w_{l}

determines the relative contribution of each layer. A lower LPIPS score indicates greater perceptual similarity between the generated and reference images, suggesting that the generated content successfully captures style and texture characteristics.

CLIP-Text Similarity (CLIP-T). CLIP-T [63] measures the semantic consistency between generated images and their corresponding textual descriptions using the CLIP model, which is trained to align images and text in a shared feature space. The similarity score is computed as follows:

CLIP - T (I, T) = cos (F (I), F (T))

where

F (I)

and

F (T)

represent the CLIP-encoded feature vectors of the generated image and the input text prompt, respectively. The cosine similarity measures how closely the generated image aligns with the text description. A higher CLIP-T score indicates that the generated image is more semantically relevant to the prompt, which is crucial for assessing the faithfulness of text-to-image generation models.

By incorporating SSIM, LPIPS, and CLIP-T, our evaluation framework effectively captures structural integrity, perceptual resemblance, and semantic fidelity, comprehensively assessing the generated images without relying on direct one-to-one reference comparisons. This combination of metrics allows for a more nuanced understanding of the quality of generated images across different aspects of image generation.

As shown in Table 2, CSD-Pano (Ours) outperforms all other methods in terms of structural similarity (SSIM) and perceptual image quality (LPIPS). It achieves the highest SSIM score of 0.910 and the lowest LPIPS score of 0.658, indicating superior image quality compared to other methods. Regarding CLIP-T, a higher score indicates better alignment with the text prompt. Midjourney achieves the highest CLIP-T score of 0.310, which is the largest value, meaning it shows the best alignment with the text prompt among the methods tested. Midjourney, trained on large-scale, diverse image-text pairs, excels in semantic-text alignment, which benefits CLIP-T scores. In contrast, our CSD-Pano prioritizes spatial controllability and structural consistency, which may lead to text alignment trade-offs.

While CSD-Pano performs well in SSIM and LPIPS, its CLIP-T score of 0.274 is the lowest, indicating that its alignment with the text prompt is slightly worse than Midjourney. Meanwhile, Reference Only performs well with an SSIM of 0.895 and LPIPS of 0.696, but its CLIP-T score of 0.294 shows that its alignment with the text prompt is weaker than Midjourney’s, although still relatively good. Flux and DALL E3 show lower performance across all metrics. Flux has an SSIM of 0.862 and LPIPS of 0.674, and DALL E3 records similar results, indicating weaker image quality and text alignment. In conclusion, while CSD-Pano (Ours) achieves the best image quality (SSIM and LPIPS), Midjourney has the strongest alignment with the text prompt (highest CLIP-T score). However, CSD-Pano still performs well in generating high-quality, text-accurate interior design panoramas.

Midjourney and DALL-E 3 struggle with maintaining structural consistency despite their strengths in creative image generation. They cannot control design elements sufficiently, leading to unsatisfactory style alignment and spatial arrangements. On the other hand, Flux and Reference Only offer some level of control over style and structure. Their results often fall short of expectations, with limitations in tuning design elements and achieving the desired spatial coherence. These shortcomings highlight the need for a more robust approach, like CSD-Pano, which provides better control over structure and style, ensuring improved design consistency and coherence. CSD-Pano addresses the limitations of mainstream methods through its SC and STC components. The SC allows designers to control spatial arrangements and layout precisely, ensuring that the generated interiors align with the desired structural configurations. Meanwhile, the STC ensures consistent aesthetics across multiple iterations, which is crucial for maintaining stylistic fidelity in professional interior design projects. These features enable CSD-Pano to provide a more reliable and controlled design process compared to other methods.

4.5. Ablation Study on CSD-Pano Framework

To assess the impact of each individual module on the overall performance of the proposed CSD-Pano framework, the study conducted an ablation study by systematically removing key components from the full model as shown in Figure 7. Specifically, the study evaluated a baseline configuration and variants by removing PLC, SC, and STC. The evaluation used three objective metrics: SSIM, LPIPS, and CLIP-T. The following table summarizes the quantitative results of our ablation experiments, highlighting the contributions of each module to the framework’s overall performance.

Table 3 presents the quantitative ablation results of the CSD-Pano framework, where each control module is individually removed to evaluate its contribution. The baseline model, which lacks all three control modules—SC, STC, and PLC—and relies solely on text-based prompts, achieves the highest CLIP-T score of 0.295 but suffers from the lowest SSIM of 0.840 and the highest LPIPS of 0.734. This indicates that while textual prompts ensure strong semantic alignment, they alone are insufficient for maintaining high-quality structural and stylistic consistency. SC plays a critical role in maintaining structural fidelity. Removing SC leads to a notable drop in SSIM from 0.910 to 0.879 and an increase in LPIPS from 0.658 to 0.707, indicating that the structural consistency of generated panoramic views deteriorates without this module. STC ensures stylistic coherence in the generated images. When STC is removed, SSIM decreases to 0.896, and LPIPS increases to 0.689, suggesting a degradation in style adherence, though its impact is less severe than removing SC. PLC is responsible for maintaining a consistent panoramic perspective. The absence of PLC results in an SSIM drop to 0.901, and LPIPS increases slightly to 0.660 compared to the full model’s 0.658. This suggests that while PLC benefits structural coherence, its absence has a minor effect on perceptual similarity. The full CSD-Pano model achieves the best balance, with the highest SSIM of 0.910 and the lowest LPIPS of 0.658. The ablation study confirms that each module contributes uniquely to structural, stylistic, and panoramic consistency, collectively ensuring high-fidelity panoramic image synthesis.

4.6. The Diversity of Generated Designs

To evaluate the diversity of generated designs, the study conducted experiments where the structural constraints were kept constant using an identical Canny edge map across all samples. The style was varied by employing different style reference images, corresponding textual prompts, and altering the random seed. This setup allowed us to isolate and assess the impact of style guidance on the outputs, independent of structural variations.

As shown in Figure 8, this experiment maintains structural constraints by using the same Canny image in all samples. It ensures style consistency by using a unified style reference image and text prompts. The experiment shows that simply changing the random seed can lead to significant differences in the details of the generated images, which fully verifies the model’s ability to create diverse design schemes while maintaining a fixed structure and style framework. The results prove that the model can achieve a diverse expression of generated results through random seed adjustment, effectively improving the diversity and creativity of design schemes.

4.7. Generate Design Details

As shown in Figure 9, the CSD-Pano framework generates an effect image showcasing a high-quality New Chinese-style panoramic space. The scene highlights the two core capabilities of the framework: precise control of structure through Canny and ensuring overall style unity with the help of style reference images. The design solution perfectly blends traditional Chinese aesthetics with modern elements to create an elegant and harmonious interior space. The warm-toned wooden floor complements the exquisitely crafted furniture, and the essence of the New Chinese style is reflected in the exquisite decorative details. The fabric sofa with bright orange cushions adds warmth and cultural depth. Large floor-to-ceiling windows let in plenty of natural light, enhancing the sense of space and accentuating the delicate texture of the furnishings.

The lighting design is carefully executed to enhance depth and realism. Soft ambient lighting creates a welcoming atmosphere, while strategic highlights emphasize key furniture pieces. The interplay of natural and artificial lighting contributes to a well-balanced composition, reinforcing the scene’s authenticity. However, minor improvements could be made to the vertical alignment of certain elements, such as artwork placement and furniture arrangement, to achieve a more refined presentation. Optimizing shadow consistency and material textures would further enhance the overall visual quality.

The CSD-Pano framework successfully generates new Chinese-style panoramic renderings with high realism and precise aesthetic control. Its ability to automate diverse interior design styles from structured 3D models significantly improves design efficiency, reducing the manual workload required in traditional methods. By enabling fine-grained customization of interior aesthetics, CSD-Pano is a powerful tool for interior designers, offering a seamless workflow for generating high-quality panoramic visualizations.

5. Discussions

CSD-Pano’s framework demonstrates superior performance over widely used AI tools such as Midjourney and DALL-E 3. Specifically, CSD-Pano achieves higher structural consistency (SSIM = 0.910) and better perceptual quality (LPIPS = 0.658), highlighting its advantage in maintaining spatial coherence and stylistic fidelity—key requirements in professional interior design workflows. CSD-Pano empowers interior designers with rapid generation capabilities while preserving layout control and aesthetic consistency, thereby improving design efficiency and creative expression. The system supports a more iterative and user-driven design process by enabling fine-grained manipulation of scene structure and visual style, fostering a tighter integration between creative vision and technical execution. This study contributes to the expanding field of human-centered generative design by demonstrating how diffusion models can be adapted for structured and controllable generation tasks. It also opens new avenues for future research on multimodal conditioning and panoramic image synthesis. As AI tools increasingly support professional design work, regulatory considerations around authorship, copyright, and responsible AI usage must be addressed. CSD-Pano highlights the importance of transparency, user agency, and ethical standards in the deployment of generative technologies.

6. Conclusions

Our study successfully presents the CSD-Pano framework, a novel AI-driven approach for generating high-quality panoramic interior designs. The framework demonstrates superior structural control (with an SSIM of 0.910) and stylistic consistency (with an LPIPS of 0.658), outperforming existing methods such as Midjourney, DALL-E 3, and Flux. The proposed CSD-Pano framework introduces a controllable diffusion-based method tailored for panoramic interior image generation. It addresses key technical challenges such as spatial coherence, structural controllability, and style consistency—limitations in many existing models.

This research contributes significantly to the field of interior design by advancing AI capabilities in generating panoramic designs with high aesthetic and functional quality. The framework’s potential to streamline the design process makes it a valuable tool for designers looking to enhance efficiency and creativity, ultimately supporting the industry’s transition toward more automated and intelligent design processes. In real-world design workflows, CSD-Pano enables fast prototyping and design iteration, significantly reducing the manual effort required for layout planning and stylistic adjustments. This is particularly beneficial for designers working under tight deadlines or managing large-scale projects. The system supports efficient, high-quality interior design by automating complex tasks, enhancing design precision, and lowering production costs—making it a valuable assistive tool in professional environments.

Currently, the CSD-Pano framework is specifically tailored to interior design, with a focus on styles such as American, European, Modern, Neoclassical, and New Chinese styles. As such, its applicability to other architectural design processes, such as exterior building designs or other architectural styles, is limited. The computational costs of training and running the model can be high, particularly when dealing with large datasets or generating high-resolution images. Additionally, the current framework relies on a fixed dataset, which may limit its ability to handle more diverse design styles or rapidly evolving design trends. Future work will focus on expanding the dataset, improving the system’s scalability, and optimizing computational efficiency.

AI models trained on large datasets may inadvertently incorporate copyrighted content, raising questions about authorship and originality in AI-generated designs. There is a risk that AI tools may favor dominant design aesthetics, unintentionally marginalizing underrepresented cultural or stylistic expressions. While AI enhances efficiency, over-reliance may dilute human creativity. Maintaining a balanced collaboration between designers and AI is essential to preserve artistic intent.

Future work will focus on expanding the dataset to encompass a broader range of architectural styles and design processes, enhancing the framework’s adaptability to various building types and workflows. Additionally, optimizing the network architecture to reduce computational demands will improve efficiency, making the system more accessible to designers with varying hardware capabilities.

Author Contributions

W.Y., Conceptualization, Supervision, Methodology, Project administration, Writing—original draft, Writing—review and editing, and Validation. Y.Z., Conceptualization, Writing—original draft, Writing—review and editing, and Validation. C.W., Conceptualization, Writing—original draft, Writing—review and editing, Supervision, and Investigation. L.L., Conceptualization, Writing—original draft, Writing—review and editing, Software, Formal analysis, and Data curation. S.D., Writing—review and editing, Investigation, Software, and Validation. C.W., Writing—review and editing, Data curation, and Formal analysis. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is part of the results from the Beijing Municipal High-Level Faculty Development Support Program on “Research on the Digital Preservation and Innovative Development of Beijing’s Central Axis Cultural Heritage” (Project No. BPHR202203072), the Beijing Institute of Graphic Communication Project on “Application research on creating immersive restaurants based on 3D holographic projection technology” (Project No. Ec202214), and the Youth Support Program of Beijing Institute of Graphic Communication on ”A Study on the Digital Representation of Chinese Cultural Visual Symbols” (Project No. Ea202320).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wu, S.; Zhou, P.; Xiong, Y.; Ma, C.; Wu, D.; Lu, W. Strategies for Driving the Future of Educational Building Design in Terms of Indoor Thermal Environments: A Comprehensive Review of Methods and Optimization. Buildings 2025, 15, 816. [Google Scholar] [CrossRef]
Emad, S.; Aboulnaga, M.; Wanas, A.; Abouaiana, A. The Role of Artificial Intelligence in Developing the Tall Buildings of Tomorrow. Buildings 2025, 15, 749. [Google Scholar] [CrossRef]
Thampanichwat, C.; Wongvorachan, T.; Sirisakdi, L.; Somngam, P.; Petlai, T.; Singkham, S.; Bhutdhakomut, B.; Jinjantarawong, N. The Architectural Language of Biophilic Design After Architects Use Text-to-Image AI. Buildings 2025, 15, 662. [Google Scholar] [CrossRef]
Song, Y.; Wang, S. A Survey and Research on the Use of Artificial Intelligence by Chinese Design-College Students. Buildings 2024, 14, 2957. [Google Scholar] [CrossRef]
Gür, M.; Çorakbaş, F.K.; Atar, İ.S.; Çelik, M.G.; Maşat, İ.; Şahin, C. Communicating AI for Architectural and Interior Design: Reinterpreting Traditional Iznik Tile Compositions through AI Software for Contemporary Spaces. Buildings 2024, 14, 2916. [Google Scholar] [CrossRef]
Adewale, B.A.; Ene, V.O.; Ogunbayo, B.F.; Aigbavboa, C.O. A Systematic Review of the Applications of AI in a Sustainable Building’s Lifecycle. Buildings 2024, 14, 2137. [Google Scholar] [CrossRef]
Wang, Y.T.; Liang, C.; Huai, N.; Chen, J.; Zhang, C.J. A Survey of Personalized Interior Design. Comput. Graph. Forum 2023, 42, e14844. [Google Scholar] [CrossRef]
Delgado, J.M.D.; Oyedele, L.; Ajayi, A.; Akanbi, L.; Akinade, O.; Bilal, M.; Owolabi, H. Robotics and automated systems in construction: Understanding industry-specific challenges for adoption. J. Build. Eng. 2019, 26, 100868. [Google Scholar] [CrossRef]
Wang, D.; Li, J.; Ge, Z.; Han, J. A computational approach to generate design with specific style. In Proceedings of the Design Society: 23rd International Conference on Engineering Design (ICED21), Gothenburg, Sweden, 16–20 August 2021; Volume 1, pp. 21–30. [Google Scholar] [CrossRef]
Zhou, K.; Wang, T. Personalized Interiors at Scale: Leveraging AI for Efficient and Customizable Design Solutions. arXiv 2024, arXiv:2405.19188. [Google Scholar]
Sinha, M.; Fukey, L.N. Sustainable Interior Designing in the 21st Century—A Review. ECS Trans. 2022, 107, 6801–6823. [Google Scholar] [CrossRef]
Chen, L.; Wang, P.; Dong, H.; Shi, F.; Han, J.; Guo, Y.; Childs, P.R.N.; Xiao, J.; Wu, C. An artificial intelligence based data-driven approach for design ideation. J. Vis. Commun. Image Represent. 2019, 61, 10–22. [Google Scholar] [CrossRef]
Le, M.-H.; Chu, C.-B.; Le, K.-D.; Nguyen, T.V.; Tran, M.-T.; Le, T.-N. VIDES: Virtual Interior Design via Natural Language and Visual Guidance. In Proceedings of the 2023 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), Brisbane, Australia, 17–21 October 2023; IEEE: Brisbane, Australia, 2023; pp. 689–694. [Google Scholar]
Chen, J.; Shao, Z.; Hu, B. Generating Interior Design from Text: A New Diffusion Model-Based Method for Efficient Creative Design. Buildings 2023, 13, 1861. [Google Scholar] [CrossRef]
Liu, S.; Li, Z.; Teng, Y.; Dai, L. A Dynamic Simulation Study on the Sustainability of Prefabricated Buildings. Sustain. Cities Soc. 2022, 77, 103551. [Google Scholar] [CrossRef]
Luo, L.; Mao, C.; Shen, L.; Li, Z. Risk Factors Affecting Practitioners’ Attitudes Toward the Implementation of an Industrialized Building System: A Case Study from China. Eng. Constr. Archit. Manag. 2015, 22, 622–643. [Google Scholar] [CrossRef]
Gao, H.; Koch, C.; Wu, Y. Building information modelling based building energy modelling: A review. Appl. Energy 2019, 238, 320–343. [Google Scholar] [CrossRef]
Zikirov, M.C.; Qosimova, S.F.; Qosimov, L.M. Direction of Modern Design Activities. Asian J. Multidimens. Res. (AJMR) 2021, 10, 11–18. [Google Scholar] [CrossRef]
Idi, D.B.; Khaidzir, K.A.B.M. Concept of Creativity and Innovation in Architectural Design Process. Int. J. Innov. Manag. Technol. 2015, 6, 16. Available online: https://www.ijimt.org/vol6/566-A10041.pdf (accessed on 16 March 2025). [CrossRef]
Borji, A. Generated Faces in the Wild: Quantitative Comparison of Stable Diffusion, Midjourney, and DALL-E 2. arXiv 2022, arXiv:2210.00586. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, C.; Zhang, M.; Kweon, I.S.; Kim, J. Text-to-image Diffusion Models in Generative AI: A Survey. arXiv 2023, arXiv:2303.07909. [Google Scholar]
Chang, L.W.; Bao, W.; Hou, Q.; Jiang, C.; Zheng, N.; Zhong, Y.; Zhang, X.; Song, Z.; Yao, C.; Jiang, Z.; et al. FLUX: Fast Software-Based Communication Overlap on GPUs Through Kernel Fusion. arXiv 2024, arXiv:2406.06858. [Google Scholar]
Cheng, S.I.; Chen, Y.J.; Chiu, W.C.; Tseng, H.Y.; Lee, H.Y. Adaptively-Realistic Image Generation from Stroke and Sketch with Diffusion Model. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 4054–4062. Available online: https://openaccess.thecvf.com/content/WACV2023/html/Cheng_Adaptively-Realistic_Image_Generation_From_Stroke_and_Sketch_With_Diffusion_Model_WACV_2023_paper.html (accessed on 16 March 2025).
Shao, Z.; Chen, J.; Zeng, H.; Hu, W.; Xu, Q.; Zhang, Y. A new approach to interior design: Generating creative interior design videos of various design styles from indoor texture-free 3D models. Buildings 2024, 14, 1528. [Google Scholar] [CrossRef]
Li, C.; Zhang, T.; Du, X.; Zhang, Y.; Xie, H. Generative AI for Architectural Design: A Literature Review. arXiv 2024, arXiv:2404.01335. [Google Scholar]
Ashour, M.; Mahdiyar, A.; Haron, S.H. A comprehensive review of deterrents to the practice of sustainable interior architecture and design. Sustainability 2021, 13, 10403. [Google Scholar] [CrossRef]
Colenberg, S.; Jylhä, T. Identifying interior design strategies for healthy workplaces—A literature review. J. Corp. Real Estate 2021, 24, 173–189. [Google Scholar] [CrossRef]
Ibadullaev, I.R.; Atoshov, S.B. The Effects of Colors on the Human Mind in Interior Design. Indones. J. Innov. Stud. 2019, 7, 27. [Google Scholar] [CrossRef]
Bettaieb, D.M.; Alsabban, R. Emerging living styles post-COVID-19: Housing flexibility as a fundamental requirement for apartments in Jeddah. Archnet-IJAR Int. J. Archit. Res. 2021, 15, 28–50. [Google Scholar] [CrossRef]
Park, B.H.; Hyun, K.H. Analysis of pairings of colors and materials of furnishings in interior design with a data-driven framework. J. Comput. Des. Eng. 2022, 9, 2419–2438. [Google Scholar] [CrossRef]
Chen, J.; Shao, Z.; Cen, C.; Li, J. HyNet: A novel hybrid deep learning approach for efficient interior design texture retrieval. Multimed. Tools Appl. 2024, 83, 28125–28145. [Google Scholar] [CrossRef]
Bao, Z.; Laovisutthichai, V.; Tan, T.; Wang, Q.; Lu, W. Design for manufacture and assembly (DfMA) enablers for offsite interior design and construction. Build. Res. Inf. 2021, 50, 325–338. [Google Scholar] [CrossRef]
Chen, J.; Wang, D.; Shao, Z.; Zhang, X.; Ruan, M.; Li, H.; Li, J. Using Artificial Intelligence to Generate Master-Quality Architectural Designs from Text Descriptions. Buildings 2023, 13, 2285. [Google Scholar] [CrossRef]
Chen, J.; Shao, Z.; Zhu, H.; Chen, Y.; Li, Y.; Zeng, Z.; Yang, Y.; Wu, J.; Hu, B. Sustainable interior design: A new approach to intelligent design and automated manufacturing based on Grasshopper. Comput. Ind. Eng. 2023, 183, 109509. [Google Scholar] [CrossRef]
Abd Hamid, B.A.; Taib, M.Z.; Razak, A.A.; Embi, M.R. Building Information Modelling: Challenges and Barriers in Implement of BIM for Interior Design Industry in Malaysia. IOP Conf. Ser. Earth Environ. Sci. 2018, 140, 012002. [Google Scholar] [CrossRef]
Shen, F.; Wang, C.; Gao, J.; Guo, Q.; Dang, J.; Tang, J.; Chua, T.-S. Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model. arXiv 2025, arXiv:2502.09533. [Google Scholar]
Shen, F.; Tang, J. IMAGPose: A Unified Conditional Framework for Pose-Guided Person Generation. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 10–15 December 2024; Available online: https://proceedings.neurips.cc/paper_files/paper/2024/file/0bd32794b26cfc99214b89313764da8e-Paper-Conference.pdf (accessed on 16 March 2025).
Shen, F.; Jiang, X.; He, X.; Ye, H.; Wang, C.; Du, X.; Li, Z.; Tang, J. IMAGDressing-v1: Customizable Virtual Dressing. arXiv 2024, arXiv:2407.12705. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. Available online: https://openaccess.thecvf.com/content/CVPR2022/html/Rombach_High-Resolution_Image_Synthesis_With_Latent_Diffusion_Models_CVPR_2022_paper.html (accessed on 16 March 2025).
Brisco, R.; Hay, L.; Dhami, S. Exploring the Role of Text-to-Image AI in Concept Generation. Proc. Des. Soc. 2023, 3, 1835–1844. [Google Scholar] [CrossRef]
Yang, B.; Gu, S.; Zhang, B.; Zhang, T.; Chen, X.; Sun, X.; Chen, D.; Wen, F. Paint by Example: Exemplar-Based Image Editing with Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 18381–18391. Available online: https://openaccess.thecvf.com/content/CVPR2023/html/Yang_Paint_by_Example_Exemplar-Based_Image_Editing_With_Diffusion_Models_CVPR_2023_paper.html (accessed on 16 March 2025).
Shen, F.; Ye, H.; Zhang, J.; Wang, C.; Han, X.; Yang, W. Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models. arXiv 2023, arXiv:2310.06313. [Google Scholar]
Vartiainen, H.; Tedre, M. Using artificial intelligence in craft education: Crafting with text-to-image generative models. Digit. Creat. 2023, 34, 1–21. [Google Scholar] [CrossRef]
Shamsian, A.; Navon, A.; Fetaya, E.; Chechik, G. Personalized Federated Learning Using Hypernetworks. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Volume 139, pp. 9489–9502. Available online: https://proceedings.mlr.press/v139/shamsian21a.html (accessed on 16 March 2025).
Lee, J.; Cho, K.; Kiela, D. Countering Language Drift via Visual Grounding. arXiv 2019, arXiv:1909.04499. [Google Scholar]
Voynov, A.; Aberman, K.; Cohen-Or, D. Sketch-Guided Text-to-Image Diffusion Models. In Proceedings of the ACM SIGGRAPH 2023 Conference Proceedings, Los Angeles, CA, USA, 6–10 August 2023; Association for Computing Machinery: New York, NY, USA, 2023; Volume 139, pp. 8162–8171. [Google Scholar] [CrossRef]
Li, Y.; Liu, H.; Wu, Q.; Mu, F.; Yang, J.; Gao, J.; Li, C.; Lee, Y.J. GLIGEN: Open-Set Grounded Text-to-Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 22511–22521. Available online: https://openaccess.thecvf.com/content/CVPR2023/html/Li_GLIGEN_Open-Set_Grounded_Text-to-Image_Generation_CVPR_2023_paper.html (accessed on 16 March 2025).
Zhang, L.; Rao, A.; Agrawala, M. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 3836–3847. Available online: https://openaccess.thecvf.com/content/ICCV2023/papers/Zhang_Adding_Conditional_Control_to_Text-to-Image_Diffusion_Models_ICCV_2023_paper.pdf (accessed on 16 March 2025).
Kawar, B.; Zada, S.; Lang, O.; Tov, O.; Chang, H.; Dekel, T.; Mosseri, I.; Irani, M. Imagic: Text-Based Real Image Editing with Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 6007–6017. Available online: https://openaccess.thecvf.com/content/CVPR2023/papers/Kawar_Imagic_Text-Based_Real_Image_Editing_With_Diffusion_Models_CVPR_2023_paper.pdf (accessed on 16 March 2025).
Otani, M.; Togashi, R.; Sawai, Y.; Ishigami, R.; Nakashima, Y.; Rahtu, E.; Heikkilä, J.; Satoh, S. Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 14277–14286. Available online: https://openaccess.thecvf.com/content/CVPR2023/html/Otani_Toward_Verifiable_and_Reproducible_Human_Evaluation_for_Text-to-Image_Generation_CVPR_2023_paper.html (accessed on 16 March 2025).
Shen, F.; Ye, H.; Liu, S.; Zhang, J.; Wang, C.; Han, X.; Yang, W. Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models. arXiv 2024, arXiv:2407.02482. [Google Scholar] [CrossRef]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022; pp. 1–15. [Google Scholar]
Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A.H.; Chechik, G.; Cohen-Or, D. An Image is Worth One Word: Personalizing Text-to-Image Generation Using Textual Inversion. arXiv 2022, arXiv:2208.01618. [Google Scholar]
Gao, B.; Ren, J.; Shen, F.; Wei, M.; Huang, Z. Exploring Warping-Guided Features via Adaptive Latent Diffusion Model for Virtual Try-On. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar] [CrossRef]
Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. arXiv 2022, arXiv:2208.12242. [Google Scholar]
Ye, H.; Zhang, J.; Liu, S.; Han, X.; Yang, W. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. arXiv 2023, arXiv:2308.06721. [Google Scholar]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009; Available online: https://web.stanford.edu/~hastie/ElemStatLearn/ (accessed on 16 March 2025).
Wang, W.; Wang, X.; Xue, C. Aesthetics Evaluation Method of Chinese Characters Based on Region Segmentation and Pixel Calculation. In Intelligent Human Systems Integration (IHSI 2023): Integrating People and Intelligent Systems; Ahram, T., Karwowski, W., Di Bucchianico, P., Taiar, R., Casarotto, L., Costa, P., Eds.; AHFE International: Orlando, FL, USA, 2023; Volume 69, pp. 1234–1243. [Google Scholar] [CrossRef]
Wang, L.; Xue, C. A Simple and Automatic Typesetting Method Based on BM Value of Interface Aesthetics and Genetic Algorithm. In Advances in Usability, User Experience, Wearable and Assistive Technology: Proceedings of the AHFE 2021 Virtual Conferences on Usability and User Experience, Human Factors and Wearable Technologies, Human Factors in Virtual Environments and Game Design, and Human Factors and Assistive Technology; Ahram, T., Karwowski, W., Di Bucchianico, P., Taiar, R., Casarotto, L., Costa, P., Eds.; Springer: Cham, Switzerland, 2021; Volume 69, pp. 931–938. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), Virtual, 18–24 July 2021; Volume 139, pp. 8748–8763. Available online: https://proceedings.mlr.press/v139/radford21a.html (accessed on 16 March 2025).
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. Available online: https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf (accessed on 16 March 2025). [CrossRef]

Figure 1. Comparison between traditional and AI-driven panoramic rendering in interior design.

Figure 2. CSD-Pano research framework. CSD-Pano consists of three main modules: the structural controller (SC), the style controller (STC), and the panoramic LoRA controller (PLC), which work together to ensure precise structure control, stylistic consistency, and spatial coherence in panoramic renderings.

Figure 3. PSD-4 experimental dataset processing workflow. The workflow begins with data collection, selection, classification, automatic tagging, expert review, and refinement. Finally, matching pairs of text and images are created for model training.

Figure 4. Training data samples of PSD-4. The dataset includes various interior design styles, including American, European, Modern, Neoclassical, and New Chinese. The scenes cover four main areas: (a) Kitchen, (b) Bedroom, (c) Living Room, and (d) Bathroom.

Figure 5. Comparison between the proposed and mainstream methods in generative architectural design. (a) American style, (b) European style, (c) Modern style, (d) Neoclassical style, and (e) New Chinese style, representing different interior design styles in each column. Each method generated architectural designs in 5 different styles, including 25 images (prompt: “Interior design panoramic view, living room, (well-known architect’s name), masterpiece, daytime, high-definition picture quality).

Figure 6. Comparison of quantitative evaluation results of interior design panoramas generated by different methods.

Figure 7. Visualization of ablation study. Each component is removed individually to prove its efficiency. (a) American style, (b) European style, (c) Modern style, (d) Neoclassical style, and (e) New Chinese style, representing different interior design styles in each column. CSD-Pano (full) represents the results obtained with the whole CSD-Pano framework (prompt: “Interior design panoramic view, living room, [architectural style], masterpiece, daytime, high-definition picture quality).

Figure 8. Generating diverse images with the same model, Canny edges, style reference, and varying seeds (prompt: Interior design panoramic view, living room, [architectural style], masterpiece, daytime, high-definition picture quality).

Figure 9. Details of images generated by the CSD-Pano framework (prompt: Interior design panoramic view, living room, [architectural style], masterpiece, daytime, high-definition picture quality).

Table 1. Interior design type distribution of different architects in PSD-4. This table presents the distribution of interior design styles for various room types. The styles include American, European, Modern, Neoclassical, and New Chinese.

	American Style	European Style	Modern Style	Neoclassical Style	New Chinese Style
Kitchen	7	6	17	12	10
Bedroom	25	17	25	6	26
Living room	22	18	20	20	17
Bathroom	6	5	24	6	5

Table 2. Comparison of quantitative evaluation results of interior design panoramas generated by different methods. The best results of training-free methods are highlighted in bold (prompt: “Interior design panoramic view, living room, (well-known architect’s name), masterpiece, daytime, high-definition picture quality). ↓ indicates lower is better, and ↑ indicates higher is better.

Method	SSIM ↑	LPIPS ↓	CLIP-T ↑
Midjourney	✗	✗	0.310
DALL E3	✗	✗	0.301
Reference Only	0.895	0.696	0.294
Flux	0.862	0.674	0.277
CSD-Pano (Ours)	0.910	0.658	0.274

Table 3. Quantitative ablation result of CSD-Pano framework. Each component is individually removed to evaluate its necessity and contribution to the overall performance (prompt: “Interior design panoramic view, living room, [architectural style], masterpiece, daytime, high-definition picture quality). ↓ indicates lower is better, and ↑ indicates higher is better.

Method	SSIM ↑	LPIPS ↓	CLIP-T ↑
Baseline	0.840	0.734	0.295
w/o PLC	0.901	0.660	0.275
w/o SC	0.879	0.707	0.280
w/o STC	0.896	0.689	0.288
CSD-Pano (full)	0.910	0.658	0.274

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, W.; Wang, C.; Liu, L.; Dong, S.; Zhao, Y. Advancing Interior Design with AI: Controllable Stable Diffusion for Panoramic Image Generation. Buildings 2025, 15, 1391. https://doi.org/10.3390/buildings15081391

AMA Style

Yang W, Wang C, Liu L, Dong S, Zhao Y. Advancing Interior Design with AI: Controllable Stable Diffusion for Panoramic Image Generation. Buildings. 2025; 15(8):1391. https://doi.org/10.3390/buildings15081391

Chicago/Turabian Style

Yang, Wanggong, Congcong Wang, Luxiang Liu, Shuying Dong, and Yifei Zhao. 2025. "Advancing Interior Design with AI: Controllable Stable Diffusion for Panoramic Image Generation" Buildings 15, no. 8: 1391. https://doi.org/10.3390/buildings15081391

APA Style

Yang, W., Wang, C., Liu, L., Dong, S., & Zhao, Y. (2025). Advancing Interior Design with AI: Controllable Stable Diffusion for Panoramic Image Generation. Buildings, 15(8), 1391. https://doi.org/10.3390/buildings15081391

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advancing Interior Design with AI: Controllable Stable Diffusion for Panoramic Image Generation

Abstract

1. Introduction

1.1. Background and Motivation

1.2. Problem Statement and Objectives

2. Literature Review

2.1. Architectural Design

2.2. Stable Diffusion Model

2.3. LoRA Fine-Tuning

3. Methodology

3.1. Research Framework

3.2. Collect Building Datasets

3.3. Structural Controller

3.4. Style Controller

3.5. Panoramic Loss Function

4. Experiments and Results

4.1. Experimental Settings

4.2. PSD-4

4.3. Subjective Assessment

4.4. Objective Assessment

4.5. Ablation Study on CSD-Pano Framework

4.6. The Diversity of Generated Designs

4.7. Generate Design Details

5. Discussions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI