A New Approach to Interior Design: Generating Creative Interior Design Videos of Various Design Styles from Indoor Texture-Free 3D Models

Shao, Zichun; Chen, Junming; Zeng, Hui; Hu, Wenjie; Xu, Qiuyi; Zhang, Yu

doi:10.3390/buildings14061528

Open AccessArticle

A New Approach to Interior Design: Generating Creative Interior Design Videos of Various Design Styles from Indoor Texture-Free 3D Models

by

Zichun Shao

^2,†

,

Junming Chen

^2,†

,

Hui Zeng

³

,

Wenjie Hu

²,

Qiuyi Xu

⁴

and

Yu Zhang

^1,*

¹

Cultural Creativity and Media, Hangzhou Normal University, Hangzhou 310000, China

²

Faculty of Humanities and Arts, Macau University of Science and Technology, Macao 999078, China

³

School of Design, Jiangnan University, Wuxi 214122, China

⁴

Detroit Green Technology Institute, Hubei University of Technology, Wuhan 430068, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Buildings 2024, 14(6), 1528; https://doi.org/10.3390/buildings14061528

Submission received: 25 April 2024 / Revised: 12 May 2024 / Accepted: 23 May 2024 / Published: 24 May 2024

(This article belongs to the Special Issue Advanced Technologies for Urban and Architectural Design)

Download

Browse Figures

Versions Notes

Abstract

:

Interior design requires designer creativity and significant workforce investments. Meanwhile, Artificial Intelligence (AI) is crucial for enhancing the creativity and efficiency of interior design. Therefore, this study proposes an innovative method to generate multistyle interior design and videos with AI. First, this study created a new indoor dataset to train an AI that can generate a specified design style. Subsequently, video generation and super-resolution modules are integrated to establish an end-to-end workflow that generates interior design videos from texture-free 3D models. The proposed method utilizes AI to produce diverse interior design videos directly, thus replacing the tedious tasks of texture selection, lighting arrangement, and video rendering in traditional design processes. The research results indicate that the proposed method can effectively provide diverse interior design videos, thereby enriching design presentation and improving design efficiency. Additionally, the proposed workflow is versatile and scalable, thus holding significant reference value for transforming traditional design toward intelligence.

Keywords:

interior design; video generation; fine-tuned model; design efficiency; design workflow

1. Introduction

1.1. Background and Motivation

With improved living standards, more residents aspire to personalized and exquisitely designed residences [1,2,3,4,5]. However, designers face challenges regarding insufficient creativity and cumbersome design processes when adopting traditional interior design methods [6,7,8,9,10,11]. The lack of creativity is due to the random emergence of design inspiration [6,12,13], while complex workflows decrease the design efficiency [6,14,15]. Due to these factors, traditional design methods often prove inadequate to meet the continuously growing interior design demands [7,10,16]. Artificial intelligence (AI), particularly big data-based AI, possesses the capability to glean design rules and knowledge from extensive datasets, thereby proficiently generating corresponding interior designs. This reduces the reliance on manual design during the design process and enables designers to select from computer-generated designs. This paradigm shift in design methodology can enhance design efficiency and foster creativity [14,17].

The motivation of this study is to propose a novel AI-based interior design method and a corresponding workflow to assist designers in meeting customer design requirements. Overall, this study aims to enhance design creativity and efficiency by directly generating various interior design videos using AI, thereby driving the transformation of interior design toward intelligence.

1.2. Problem Statement and Objectives

Using generative AI models to produce images is a recent research hotspot [14,18,19,20]. Their ability to generate images based on textual descriptions makes it possible to produce designs directly through texts [21,22,23,24]. However, two significant challenges emerge when applying generative models to interior design. For one thing, conventional generative models are trained using extensive datasets, yet there exists a deficiency of annotated professional insights into design style and spatial functionality within these datasets. Consequently, the learning ability of the models diminishes, thus hindering their ability to generate appropriate interior designs [14,18]. Futhermore, generative models must generate design videos to provide clients with immersive previews, where the technical difficulty lies in ensuring texture consistency between different images [25,26,27]. Therefore, promoting the application of generative models in interior design necessitates injecting domain-specific knowledge into these models to enhance their capability of generating designs with specific styles and addressing the texture consistency issue.

This study aims to develop a novel end-to-end method for artificial intelligence to generate interior design videos and establish a fresh design workflow. The objective is to equip designers with efficient and innovative design tools to enhance their efficacy in completing design tasks. This end-to-end workflow was named “From 3D Model to Interior Design Video (F3MTIDV)”. The new design tool allows designers to quickly generate interior design videos in different styles, thus eliminating the tedious material selection and video rendering work in traditional design processes. Figure 1 compares the interior design videos generated by a conventional diffusion model and F3MTIDV, where the conventional diffusion model faces challenges in specifying design styles [14,28] and lacks texture consistency between the rendered video frames [25,26,27]. In contrast, F3MTIDV can determine the design style in the generated video and maintain texture consistency. Moreover, our improved method can generate high-quality design videos, thus improving design efficiency and customer experience.

1.3. Methodology Overview

The study is mainly divided into several steps. First, a dataset for indoor design styles and spatial functionalities control (i.e., IDSSFCD-24) was created. Then, a new loss function containing design styles and spatial functionalities was proposed. The diffusion model was fine-tuned using this loss function and IDSSFCD-24 for training. The trained diffusion model can generate indoor designs with specified styles and spatial functionalities. Finally, crossframe attention and super-resolution modules were introduced, with the former maintaining content consistency between frames and enhancing the output video to high-definition quality. This end-to-end workflow is named From 3D Model to Interior Design Video (i.e., F3MTIDV).

With F3MTIDV, designers only need to export the texture-free 3D model as a video and input it and corresponding design requirements into the model to generate indoor design videos in the specified design style. Figure 2 illustrates the research framework and the related video generation workflow.

The proposed method alters the entire interior design process. With F3MTIDV, designers can directly generate interior design videos from texture-free 3D models, thus significantly enhancing the efficiency and creativity in interior design. Figure 3 demonstrates that F3MTIDV can generate interior design videos with various styles and spatial functions. In addition, the proposed method is scalable and applicable to other design tasks by retraining on different datasets.

2. Related Work

2.1. General Interior Design Process

Designers face challenges of low design efficiency and low creativity in interior design [10,14,17]. One of the reasons for the low design efficiency is the cumbersome and manual drawing-dependent traditional interior design process. Specifically, designers must first plan the design by drawing floor plans and creating corresponding interior 3D models. Then, they must add various texture maps and furniture to present the design style. Finally, they need to configure appropriate lighting for the 3D model and render the effect image. Designers use the completed renderings to communicate with clients and determine the final design solution. Meanwhile, interior design is an iterative process, where designers must manually select multiple texture maps and rerender effect images to provide clients with various design choices. Each design modification requires repeating nearly the entire design process, thus significantly increasing designer workloads [4,10,15,29]. Figure 4 illustrates the conventional design process.

Another challenge faced by interior designers is the demand for creative designs. Designers must continuously refine their designs to achieve innovative outcomes [13], but the cumbersome traditional interior design workflow often leaves designers little time to improve their creativity [10]. Thus, designers are often compelled to adopt fixed design methods, thereby suppressing creative design production. In the meantime, transforming creativity into visual expression requires significant labor. Therefore, there is an urgent need to leverage advanced technology to assist interior design by enhancing design efficiency and innovation capabilities.

2.2. Fine-Tuned Diffusion Model

Employing AI to generate images has recently become a research hotspot [30,31,32]. With almost the best performance, diffusion models have been widely applied in many cutting-edge fields [18,33,34]. Diffusion models typically generate images rapidly based on input text descriptions [18,21,35,36]. Applying diffusion models to interior design could help designers quickly produce images, thus improving design efficiency and creativity.

A diffusion model consists of the diffusion module and the denoising module. The diffusion module processes the original image by gradually introducing noise, thus transforming it into an image filled with noise. Correspondingly, the denoising module restores the image with noise to the original one. The denoising module achieves the image generation goal by learning the ability to restore the noisy image to the original one [23,35,37]. Additionally, diffusion models provide a flexible way to control the generated results. During image generation, diffusion models can be guided to produce images consistent with the added textual description [35,36].

Although diffusion models perform well in general domains, there is still room for improvement in fields with higher demands for expertise, such as interior design [6,17,18]. Existing diffusion models are trained based on large datasets but lack knowledge in specific professional domains. This limitation prevents users from effectively controlling the content and quality of the generated images through professional cues [14,28]. Therefore, integrating domain-specific knowledge into AI is necessary to enhance image generation quality in specific domains.

There are four standard methods for injecting professional knowledge into AI to improve the quality of the generated images. The first is Textual Inversion [38], which constructs better representation vectors in the embedding space of TextEncoder without changing the original weights of the diffusion model. This method has the shortest training time and the minimum generated model parameters but the weakest control over the generated images. The second is Hypernetwork [39], which inserts a new neural network into the crossattention module as an intermediate layer to influence the generation results. However, the images generated have lower quality. The third method is LoRA [40], which acquires new knowledge by inserting an intermediate layer into the U-net structure of the diffusion model. The advantage of LoRA lies in not requiring replicating the entire model’s weights. The model size is moderate after training, and the quality of the generated images is superior to the first two methods. Finally, there is Dreambooth [28], which adjusts the weights of the entire diffusion model through training. This method addresses language drift issues by setting new cue words and involving them in model training [41]. Additionally, it introduces a prior preservation loss to prevent model overfitting [28]. Dreambooth typically produces the best generation results among these four methods through model fine-tuning.

2.3. Controllable Video Generation

Fine-tuning diffusion models can enhance their capabilities in generating domain-specific images [28,38,39,40]. However, textual descriptions alone are insufficient for precise output control [14,18,36], which is a problem that becomes increasingly pronounced in fields like interior design with extremely high control requirements. Therefore, it is necessary to improve the controllability of the generated results.

Scholars have proposed to enhance the image generation controllability of diffusion models by introducing additional neural networks [42,43,44]. For example, Voynov et al. [42] proposed the Latent Edge Predictor (LGP), which is capable of predicting image edges and comparing them with the actual image edges for loss calculation. The LGP can guide diffusion models to generate results aligned with real edge sketches at the pixel level by learning to minimize the loss. Li et al. [43] suggested training with bounding boxes or human pose maps as control conditions to expand the available types of control networks. Meanwhile, Zhang et al. [44] proposed the general framework, ControlNet, to support using single or multiple control models to govern image generation, thereby broadening the applicability of control conditions. These methods have enriched the means of controlling image generation and improved the controllability of diffusion model-generated images.

Although adding a control network has improved the practicality of text-to-image diffusion models [31,42,44], conveying information through videos is much more efficient than through images. Therefore, researchers have focused more on generating videos from texts [45,46,47]. For example, [26] proposed a novel video synthesis method, ControlVideo, which improved the control network by introducing crossframe attention, a crossframe smoother, and a hierarchical sampler, thus successfully alleviating issues such as high training costs, the inconsistent appearance of generated videos, and video flickering. Chen et al. [27] presented a spatiotemporal self-attention mechanism and introduced residual noise initialization to ensure video appearance consistency. Guo et al. [25] embedded a motion modeling module into a basic text-to-image diffusion model and retrained it on a large-scale dataset, thus enabling the model to generate videos with continuous motion. The above studies indicate that combining diffusion models and control network improvements can produce high-quality videos.

3. Material and Methods

Although diffusion models have been successfully applied in various fields [14,18,33], the research on their application in interior design is relatively limited, especially the direct utilization of AI to generate interior design videos. This study proposes an improved diffusion model and a corresponding workflow for generating interior design videos (F3MTIDV). The primary advantage of F3MTIDV lies in its ability to rapidly generate interior design videos embodying a designated style, thus surpassing traditional interior design methodologies. This eliminates the need for conventional creative design and video rendering tasks, thus enabling designers to promptly prepare design proposals for user consideration and expediting the decision-making process.

The F3MTIDV workflow is implemented in five steps: building the new dataset IDSSFCD-24, designing the new composite loss function incorporating design style and spatial function losses, fine-tuning the basic diffusion model using IDSSFCD-24 and the loss function, introducing the crossframe attention module and super-resolution module to construct a complete interior design video generation workflow, and using this workflow to generate design videos and make modifications.

During dataset construction, this study enlisted the assistance of professional interior designers to collect over 20,000 high-quality, freely downloadable interior design images from well-known interior design websites and annotate each image with design styles and spatial functions. As a result, IDSSFCD-24 was successfully created to address the lack of high-quality interior design datasets.

An innovative integrated loss function was proposed to effectively acquire knowledge of new design styles and spatial functions. This loss function introduces design style, spatial function, and prior style losses based on the traditional diffusion model loss function, as expressed in Equation (1), thus forming a completely new integrated loss function, as expressed in Equation (2). The model training process aims to gradually reduce design style and spatial function losses, thereby gaining the ability to generate designs with specified design styles and spatial functions.

The basic diffusion model is expressed as follows:

L_{X, h, ϵ, t} [w_{t} | | {\hat{X}}_{θ} (α_{t} X + σ_{t} ϵ, h) - {X | |}_{2}^{2}]

(1)

where

L

represents the average loss that the model training process aims to reduce to achieve better design and generation quality,

{\hat{X}}_{θ}

is a trainable diffusion model that continuously receives a noisy image vector

α_{t} X + σ_{t} ϵ

and text guidance

h

, which predicts the noise,

{\hat{X}}_{θ} (α_{t} X + σ_{t} ϵ, h) - X

represents the difference between the predicted noise and the true noise X at that time step, and

w_{t}

is the parameter controlling the weight changes of the diffusion model. During the training process, the diffusion model adjusts its parameters to reduce the difference between the generated and real images, thus ultimately minimizing

L

.

The proposed composite loss function is as follows:

\begin{matrix} L_{X, h, ϵ, ϵ^{'}, t} [w_{t} | | {\hat{X}}_{θ} (α_{t} X + σ_{t} ϵ, h) - X {| |}_{2}^{2} \\ + λ w_{t^{'}} | | {\hat{X}}_{θ} (α_{t^{'}} X_{p r} + σ_{t^{'}} ϵ^{'}, h_{pr}) - X_{p r} {| |}_{2}^{2}] \end{matrix}

(2)

The improved loss function in Equation (2) addresses the limitations of traditional diffusion models in generating interior designs with specified design styles. Equation (2) consists of two parts. The first part measures the difference between the image generated by the fine-tuned diffusion model and the actual image. Here,

\hat{X} θ

represents the new diffusion model, which considers design styles and spatial functions as part of the loss. The model quantifies the difference between the predicted noise (i.e.,

\hat{X} θ (α_{t} X + σ_{t} ϵ, h)

) and the actual noise X and considers it as a loss. The second part of the loss function is the prior knowledge loss, which is obtained by calculating the difference between the noise predicted by the new diffusion model (

\hat{X} θ

) and the noise predicted by the pretrained diffusion model (

X_{p r}

). A minor difference indicates that, while retaining the general knowledge of the original diffusion model, the new model has better knowledge of design styles and spatial functions. Here,

λ_{w}

is a learnable weight, and

λ_{w}

can adjust the weights of the two losses above to allow the diffusion model to generate better results. The new fine-tuned diffusion model is trained with this improved loss function, preserving the general foundational knowledge of the original pre-trained model while also acquiring knowledge of design styles and spatial functions. Therefore, the fine-tuned diffusion model can generate interior designs with specific design styles.

This study fine-tuned the original diffusion model by using ADSSFID-24. Specifically, the design style and spatial function knowledge in ADSSFID-24 was learned by invoking a new composite loss function. By reducing the loss, the fine-tuned model can generate indoor designs with specified design styles and spatial functions. The model fine-tuning process is shown in Figure 5.

To construct the interior design video generation workflow, the ability to create interior designs of specified styles was acquired by fine-tuning the diffusion model. Subsequently, a crossframe attention module was employed to ensure consistency in content and texture between the generated video frames [26]. Also, a super-resolution module, BasicVSR++ [48], was utilized, which is an open-source super-resolution algorithm designed to enhance visual details and improve the resolution of the generated videos. Thus, the F3MTIDV workflow was completed, which enables the generation of interior design videos with consistent design styles, coherent content between consecutive frames, and high resolution.

4. Experiment and Results

4.1. Experimental Settings

The diffusion model was fine-tuned on a Windows 10 PC with 64 GB of RAM and an NVIDIA RTX 6000 graphics card with 48 GB of VRAM. The training was conducted using PyTorch, with 100 iterations per image. PyTorch is an open-source machine learning library that offers a rich set of tools and libraries, and it is widely praised for its flexibility and ease of use. During image preprocessing, an automatic scaling method was employed to adjust each image to one of 13 fixed ratios based on its aspect ratio, thus ensuring that neither its width nor height exceeded 1024 after scaling. This scaling strategy aims to keep the aspect ratio changes of the images within an acceptable range while preserving image integrity. Data augmentation techniques, such as horizontal flipping, were applied. The model had a learning rate of 0.000001 and a batch size of 24. Xformers is a neural network architecture based on an attention mechanism. Through some optimizations and improvements, Xformers improved the speed of training and inference, thus making it more efficient to process large-scale data. FP16 refers to the data type that uses 16-bit floating point numbers to represent numerical values. Because the computation speed of 16-bit floating point numbers is faster than that of 32-bit floating-point numbers, using FP16 can accelerate neural networks’ training and inference speed. The computational process was accelerated using Xformers and FP16, and the total training time for diffusion model fine-tuning amounted to 17 h.

4.2. Video Generation Efficiency

With F3MTIDV, designers only need to input design style descriptions and texture-free videos to promptly obtain high-quality interior design videos aligned with the specified style. Supported by an NVIDIA RTX A6000 graphics card, F3MTIDV generates a video with a resolution of 832 × 512 and a duration of 6 s in only 10 min. Interior design videos covering all spaces within a house can be generated in less than one hour. This streamlined operational process significantly enhances the method’s practicality, thus offering interior designers an entirely novel design approach.

By altering the prompts, designers can use F3MTIDV to generate indoor design videos across a spectrum of styles in bulk. The ability to explore diverse prompts to create videos is an essential skill for designers. Empowered by robust computational capabilities, users can promptly access designs generated by designers and engage in real-time communication, thus enhancing decision-making efficiency and design quality.

4.3. IDSSFCD-24

This study aims to employ AI to assist designers in rapidly and efficiently generating interior design videos with specific design styles. Considering the lack of datasets for interior design styles, we enlisted professional designers to collect over 30,000 high-quality, freely available images from well-known interior design websites. Subsequently, the designers meticulously reviewed these images, thus examining design styles and quality and excluding contents not aligned with the specified standards. After filtering, over 20,000 images aligned with our criteria. Finally, multiple designers annotated the design style and spatial function of the images. Additionally, the IDSSFCD-24 dataset was constructed. At least five different designers annotated each image. In inconsistent annotation categories, the category most frequently encountered was designated as the final category.

The IDSSFCD-24 covers the classification of design styles and spatial functionalities. The labels for design styles include six types: “contemporary style”, “Chinese style”, “European style”, “Nordic style”, “American style”, and “Japanese style”. Meanwhile, the annotations for interior functionalities encompass four types: “bedroom”, “dining room”, “living room”, and “study room”. Table 1 provides a detailed display of the quantity of images for different categories.

The IDSSFCD-24 contains 21,060 indoor design images, with 4721 in the contemporary style (most numerous), 2675 in the Japanese style (least numerous), 6074 depicting living rooms (most numerous), and 4130 depicting study rooms (least numerous). Figure 6 displays some training data samples of IDSSFCD-24.

4.4. Subjective Assessment

Previous studies indicate that objective assessment metrics cannot fully reflect human perception [49,50,51]. Therefore, instead of relying solely on accurate assessment metrics, we also conducted subjective assessments to enhance the credibility of the conclusions [49]. Specifically, subjective and objective assessment methods were adopted to evaluate the quality of the generated interior design videos. This comprehensive evaluation approach allows for a more thorough assessment of the quality and practical value of the generated videos, thus aiding designers in better understanding the technology and applying it to design practices.

In terms of subjective evaluation, Otani et al. [49] showed that the direct scoring method outperformed the ranking scoring method in generated content assessment and proposed the subjective evaluation metrics of Fidelity and Alignment. The Fidelity metric evaluates how closely the generated images resemble the actual images. The Alignment metric assesses the consistency between the generated images and the prompt text. Otani et al. [49] also experimentally demonstrated that providing detailed explanations for each level option in the evaluation improves score consistency between different annotators, thus outperforming traditional Likert scales in score consistency. Therefore, this study adopts the direct scoring method and adds detailed annotations to the scoring options.

Considering the significance of rich design details in generating designs, we introduced a “Design Details” metric. Since the consistency between consecutive video frames is crucial for video tasks, we also added a “Visual Consistency” metric. Finally, a “Usability” metric was introduced to comprehensively assess the overall quality of the generated videos. The specific evaluation metrics and corresponding descriptions are presented in Table 2. In this study, we mainly focus on the Alignment and Usability metrics among these five. A high score in the Alignment metric indicates that the generated images align with the textual prompts, while a high Usability score indicates that the video is directly usable.

The differences in the interior design videos generated by Stable Diffusion 1.5 (SD), Fine-tuned Diffusion Model (FTSD), SD+ Control Video, and the proposed method in this study were compared. SD is the basic diffusion model, while FTSD is a fine-tuned diffusion model, both of which generate videos by generating images by frame. SD+ Control Video combines the basic diffusion model with a video generation module capable of directly producing coherent videos. Our method (F3MTIDV) includes a fine-tuned diffusion model, a crossattention module, and a super-resolution module, thus enabling the generation of readable and high-resolution videos.

The four methods above were used for subjective evaluation, with each method sequentially generating interior design videos for 24 categories (including four spatial functions and six design styles). Consecutive sets of 20 images were extracted from the videos generated for each category for evaluation to total 1920 images. The assessments were conducted by 30 undergraduate students majoring in interior design and 15 professional interior designers, and the average results for the criteria of “Fidelity”, “Alignment”, “Design Details”, “Visual Consistency”, and “Usability” were obtained, as presented in Table 3.

Table 3 shows that F3MTIDV achieved optimal results in all five evaluation metrics. In terms of Fidelity, F3MTIDV slightly outperformed other methods. However, all methods exhibited relatively low Fidelity scores, thus indicating room for improvement in the authenticity of AI-generated videos. Regarding Alignment, the improved scores with FTSD or the added video control module suggest that these methods enhance the alignment between text descriptions and content. Regarding Design Details, SD and SD + Control Video showed lower scores, while FTSD and F3MTIDV achieved higher scores, thus indicating that fine-tuning the model enhances the ability to generate image design details. In terms of Visual Consistency, the method with the added video control module significantly outperformed other methods, thus demonstrating the effectiveness of incorporating this module. In terms of Usability, only F3MTIDV scored above three points. Our approach exhibits a clear advantage in this comprehensive evaluation metric compared to other methods. Overall, subjective assessment scores validate the effectiveness of F3MTIDV.

4.5. Objective Assessment

For the objective evaluation, we utilized the Structural Similarity Index (SSIM) [53], Fréchet Inception Distance (FID) [54], and CLIP Score [55] to assess the consistency, quality, and textual–visual alignment of the generated videos. Specifically, SSIM [53] serves as an objective metric for measuring image quality, thus aiming to quantify the structural similarity between two images. SSIM considers the brightness, contrast, structure, and how human eyes perceive these factors. A value closer to one indicates higher structural similarity between the generated and original images. SSIM is calculated as follows [53]:

S S I M (x, y) = L (x, y) \cdot C (x, y) \cdot S (x, y)

(3)

where x and y are two images,

L (x, y)

calculates the luminance difference between x and y,

C (x, y)

calculates the contrast based on the standard deviation of the images, and

S (x, y)

computes the structural similarity by taking the power function of the ratio of luminance variance, contrast variance, and covariance between the two images. The SSIM value is the product of the components above.

FID [54] is an indicator assessing the quality of generated images. It measures the quality of image generation models by calculating the difference between the distributions of authentic and generated images. FID utilizes the Inception network to transform sets of generated and authentic images into feature vectors and then computes the Fréchet distance between the distributions of actual and generated images. The Fréchet distance quantifies the similarity between two distributions, with a smaller FID indicating higher similarity between authentic and generated images, thus suggesting better image quality. The FID is calculated as follows:

F I D = ∥ μ_{r e a l} - μ_{g e n} ∥_{2}^{2} + Tr (Σ_{r e a l} + Σ_{g e n} - 2 \sqrt{Σ_{r e a l} Σ_{g e n}})

(4)

where

μ_{r e a l}

and

μ_{g e n}

represent the mean of feature vectors for real and generated images,

Σ_{r e a l}

and

Σ_{g e n}

represent the covariance matrices of feature vectors for real and generated images,

∥ \cdot ∥

denotes the L2 norm, and

Tr (\cdot)

represents the trace operation on matrices. This formula measures the similarity between the feature distributions of real and generated images through the Fréchet distance, thereby assessing the quality of generated images.

The CLIP Score [55] measures the visual–textual consistency between descriptive content and images. It transforms natural language and images into feature vectors and then calculates their cosine similarity. A CLIP Score close to one indicates a higher correlation between the image and the corresponding text. The CLIP Score is calculated as follows:

CLIP Score (c, v) = w \times max (\cos (c, v), 0)

(5)

where c and v are the feature vectors output by the CLIP encoder for textual descriptions and images,

cos (c, v)

represents the similarity between c and v calculated using the cosine function, w is the weight used to adjust the impact of similarity, and

max (\cdot, 0)

denotes taking the maximum value, thereby ensuring the similarity does not fall below zero.

The 1920 generated images were quantitatively evaluated using the SSIM [53], FID [54], and CLIP Score [55]. The evaluation results are shown in Table 4.

The results in Table 4 indicate that F3MTIDV performed the best in the SSIM, FID, and the CLIP Score. The research results show that fine-tuning the model and adding the control video module can significantly improve structural similarity, image quality, and the alignment between textual descriptions and content. The significant improvements were mainly in the SSIM and FID, which is consistent with the conclusions from subjective assessments.

4.6. The Diversity of F3MTIDV-Generated Videos

Figure 7 illustrates the diversity of videos generated by F3MTIDV. We selected two sets of distinct prompt words, with each yielding three videos, thus totaling six videos for demonstration purposes. As depicted in the figure, a consistent design style is maintained across all video frames, which is crucial in practical design applications. Moreover, the visuals indicate that varying prompt words can produce designs with differing stylistic elements. Thus, F3MTIDV can provide designers with a spectrum of interior design options, thereby augmenting both design efficiency and quality.

4.7. Details in the Generated Videos

One frame was selected from the Chinese-style living room video generated by F3MTIDV to showcase the details of the generated content (Figure 8). Figure 8 presents F3MTIDV’s ability to create appropriate textures for textureless images. For instance, F3MTIDV generated a white fabric texture for the sofa outline and produced red pillows. For the bookshelf, F3MTIDV generated a dark wooden texture. In terms of flooring, F3MTIDV created light-colored wooden floors and carpets. All the generated textures and colors adhere to the Chinese design style. Additionally, F3MTIDV’s lighting design is also reasonable, thus generating linear lighting for each bookshelf and spotlights at the top of the two bookshelves. F3MTIDV also created auxiliary lighting sources on the TV background wall to reduce the difference in brightness between the TV screen and background wall brightness, thus contributing to eye protection. Furthermore, the design generated by F3MTIDV exhibits rich design details, such as the black stitched edges at the seams of the Chinese fabric sofa, which align with real sofa manufacturing processes. These results collectively indicate that F3MTIDV has obtained outstanding generative design capabilities through training on the dataset. Finally, F3MTIDV effectively conveys the design intentions, thus making the content more realistic.

Nevertheless, the design videos generated by F3MTIDV still have room for improvement. Firstly, the structural integrity of the objects can still be enhanced, such as the insufficient verticality of the lines in the structures of the generated bookshelves and cabinets. Secondly, inaccurate lighting and shadow relationships still exist in the generated videos. Finally, F3MTIDV needs to create videos with higher contrast to enhance realism. Overall, generating videos with different design styles from textureless 3D models has proven feasible with F3MTIDV, which will help designers quickly produce spatial designs in different styles, thus enhancing design efficiency and decision-making effectiveness.

5. Discussion

The subjective and objective evaluations fully demonstrated the effectiveness of F3MTIDV. Visual comparisons with other methods during subjective assessment showed that F3MTIDV can facilitate end-to-end video generation with consistent design styles and no flickering, which is unattainable by other methods. Furthermore, the questionnaire assessments with five subjective evaluation metrics suggest that F3MTIDV has achieved optimal results in all indicators. The exceptionally high Alignment and Usability scores indicate that the videos generated by F3MTIDV have good visual–textual consistency and usability. During objective evaluation, F3MTIDV yielded the highest scores in metrics such as the SSIM [53], FID [54], and CLIP Score [55], thus further confirming its effectiveness.

By directly utilizing AI to generate diverse interior design videos, F3MTIDV replaces the tedious tasks of design creativity, material selection, lighting arrangement, and rendering in traditional design practices. Compared to conventional methods, F3MTIDV excels in efficiency and creative generation. Regarding design efficiency, completing designs and corresponding modifications using traditional methods typically takes about half a month, while F3MTIDV (on a PC with a 48 GB VRAM graphics card) can generate design videos covering various styles and space functions within one hour. With the continuous computing power improvement, the speed of interior design video generation with F3MTIDV can be further increased. In terms of creative design, F3MTIDV can generate multiple interior design styles for users to choose from, thus reducing the complexity of creative design and accelerating the design decision-making process. Overall, F3MTIDV demonstrates the feasibility of an innovative approach to interior design. In addition, F3MTIDV is scalable. By replacing the basic diffusion model, it can be adapted to design video generation in other design tasks.

AI-generated content will profoundly impact current design approaches. In terms of design efficiency, AI will increasingly take over tasks emphasizing logical and rational descriptions, thus eventually forming an AI design chain. Simple design tasks will be completed by AI, thus allowing designers more time to contemplate design creativity and enhance design quality. Regarding role positioning, designers are no longer merely traditional design creators but are transforming into design facilitators collaborating with AI. For example, the work of designers in this study goes beyond simply drawing images: they collect and organize data through their professional knowledge and transfer knowledge to AI models. This new human–machine collaboration approach may become the norm in future interior design, thus driving the design process toward automation and intelligence.

6. Conclusions

Traditional interior design methods require designers to possess high creativity and undertake heavy labor when creating interior design videos, thus leading to a lack of creativity and low design efficiency. To address these issues, we propose F3MTIDV to automate interior design video generation. Experimental results demonstrate that F3MTIDV can replace the laborious creative design and drawing work in traditional design processes, thereby changing the formal design process and significantly improving design efficiency and creativity.

Nonetheless, this study is not free from limitations. Firstly, controlling the entire video generation process through a fixed set of prompts is challenging and can be further improved by using dynamic vectors to automatically adapt to the content of each frame. Secondly, texts cannot specify texture appearance in specific areas of the image when generating textures, thus requiring enhanced generation process controllability. Furthermore, our understanding of design styles is categorized manually, and a more extensive and automated annotation classification method may better represent real-world design classifications. Finally, the Fidelity metric of the videos generated by the proposed method can still be improved, and enhancing the authenticity of the videos can increase their usability.

Artificial intelligence (AI) can be applied to future interior design stages. For example, during the planning phase, AI can provide designers with an array of design schemes for selection. Subsequently, image generation techniques based on generative adversarial networks or diffusion models can yield numerous conceptual design renderings. In the modeling phase, automated AI modeling methods reduce the traditional manual modeling workload. Finally, automating material texturing and video rendering enhances efficiency from design to presentation, thus culminating in a more comprehensive AI design workflow.

Author Contributions

Conceptualization, J.C. and Z.S.; Methodology, J.C. and Z.S.; Software, J.C. and Z.S.; Validation, J.C. and Z.S.; Formal analysis, J.C. and Z.S.; Investigation, J.C. and Z.S.; Resources, Y.Z.; Writing—original draft, J.C., Z.S., H.Z., W.H., Q.X. and Y.Z.; Writing—review and editing, J.C. and Z.S.; Supervision, Y.Z.; Project administration, Y.Z.; Funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Colenberg, S.; Jylhä, T. Identifying interior design strategies for healthy workplaces—A literature review. J. Corp. Real Estate 2021, 24, 173–189. [Google Scholar] [CrossRef]
Ibadullaev, I.; Atoshov, S. The Effects of Colors on the Human Mind in the Interior Design. Indones. J. Innov. Stud. 2019, 7, 1–9. [Google Scholar] [CrossRef]
Bettaieb, D.M.; Alsabban, R. Emerging living styles post-COVID-19: Housing flexibility as a fundamental requirement for apartments in Jeddah. Archnet-IJAR Int. J. Archit. Res. 2021, 15, 28–50. [Google Scholar] [CrossRef]
Wang, Y.; Liang, C.; Huai, N.; Chen, J.; Zhang, C. A Survey of Personalized Interior Design. Comput. Graph. Forum 2023, 42, e14844. [Google Scholar] [CrossRef]
Park, B.H.; Hyun, K.H. Analysis of pairings of colors and materials of furnishings in interior design with a data-driven framework. J. Comput. Des. Eng. 2022, 9, 2419–2438. [Google Scholar] [CrossRef]
Ashour, M.; Mahdiyar, A.; Haron, S.H. A Comprehensive Review of Deterrents to the Practice of Sustainable Interior Architecture and Design. Sustainability 2021, 13, 10403. [Google Scholar] [CrossRef]
Delgado, J.M.D.; Oyedele, L.; Ajayi, A.; Akanbi, L.; Akinade, O.; Bilal, M.; Owolabi, H. Robotics and automated systems in construction: Understanding industry-specific challenges for adoption. J. Build. Eng. 2019, 26, 100868. [Google Scholar] [CrossRef]
Wang, D.; Li, J.; Ge, Z.; Han, J. A Computational Approach to Generate Design with Specific Style. Proc. Des. Soc. 2021, 1, 21–30. [Google Scholar] [CrossRef]
Chen, J.; Shao, Z.; Cen, C.; Li, J. HyNet: A novel hybrid deep learning approach for efficient interior design texture retrieval. Multimed. Tools Appl. 2023, 83, 28125–28145. [Google Scholar] [CrossRef]
Bao, Z.; Laovisutthichai, V.; Tan, T.; Wang, Q.; Lu, W. Design for manufacture and assembly (DfMA) enablers for offsite interior design and construction. Build. Res. Inf. 2022, 50, 325–338. [Google Scholar] [CrossRef]
Sinha, M.; Fukey, L.N. Sustainable Interior Designing in the 21st Century—A Review. ECS Trans. 2022, 107, 6801. [Google Scholar] [CrossRef]
Chen, L.; Wang, P.; Dong, H.; Shi, F.; Han, J.; Guo, Y.; Childs, P.R.; Xiao, J.; Wu, C. An artificial intelligence based data-driven approach for design ideation. J. Vis. Commun. Image Represent. 2019, 61, 10–22. [Google Scholar] [CrossRef]
Yilmaz, S.; Seifert, C.M. Creativity through design heuristics: A case study of expert product design. Des. Stud. 2011, 32, 384–415. [Google Scholar] [CrossRef]
Chen, J.; Wang, D.; Shao, Z.; Zhang, X.; Ruan, M.; Li, H.; Li, J. Using Artificial Intelligence to Generate Master-Quality Architectural Designs from Text Descriptions. Buildings 2023, 13, 2285. [Google Scholar] [CrossRef]
Chen, J.; Shao, Z.; Zhu, H.; Chen, Y.; Li, Y.; Zeng, Z.; Yang, Y.; Wu, J.; Hu, B. Sustainable interior design: A new approach to intelligent design and automated manufacturing based on Grasshopper. Comput. Ind. Eng. 2023, 183, 109509. [Google Scholar] [CrossRef]
Abd Hamid, A.B.; Taib, M.M.; Razak, A.A.; Embi, M.R. Building information modelling: Challenges and barriers in implement of BIM for interior design industry in Malaysia. In Proceedings of the 4th International Conference on Civil and Environmental Engineering for Sustainability (IConCEES 2017), Langkawi, Malaysia, 4–5 December 2017; Volume 140, p. 012002. [Google Scholar] [CrossRef]
Karan, E.; Asgari, S.; Rashidi, A. A markov decision process workflow for automating interior design. KSCE J. Civ. Eng. 2021, 25, 3199–3212. [Google Scholar] [CrossRef]
Chen, J.; Shao, Z.; Hu, B. Generating Interior Design from Text: A New Diffusion Model-Based Method for Efficient Creative Design. Buildings 2023, 13, 1861. [Google Scholar] [CrossRef]
Cheng, S.I.; Chen, Y.J.; Chiu, W.C.; Tseng, H.Y.; Lee, H.Y. Adaptively-realistic image generation from stroke and sketch with diffusion model. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 4054–4062. [Google Scholar] [CrossRef]
Yang, B.; Gu, S.; Zhang, B.; Zhang, T.; Chen, X.; Sun, X.; Chen, D.; Wen, F. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18381–18391. [Google Scholar] [CrossRef]
Brisco, R.; Hay, L.; Dhami, S. Exploring the Role of Text-to-Image AI in Concept Generation. Proc. Des. Soc. 2023, 3, 1835–1844. [Google Scholar] [CrossRef]
Croitoru, F.A.; Hondru, V.; Ionescu, R.T.; Shah, M. Diffusion models in vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10850–10869. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar] [CrossRef]
Vartiainen, H.; Tedre, M. Using artificial intelligence in craft education: Crafting with text-to-image generative models. Digit. Creat. 2023, 34, 1–21. [Google Scholar] [CrossRef]
Guo, Y.; Yang, C.; Rao, A.; Wang, Y.; Qiao, Y.; Lin, D.; Dai, B. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv 2023, arXiv:2307.04725. [Google Scholar] [CrossRef]
Zhang, Y.; Wei, Y.; Jiang, D.; Zhang, X.; Zuo, W.; Tian, Q. ControlVideo: Training-Free Controllable Text-to-Video Generation. arXiv 2023, arXiv:2305.13077. [Google Scholar] [CrossRef]
Chen, W.; Wu, J.; Xie, P.; Wu, H.; Li, J.; Xia, X.; Xiao, X.; Lin, L. Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models. arXiv 2023, arXiv:2305.13840. [Google Scholar] [CrossRef]
Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22500–22510. [Google Scholar] [CrossRef]
Salvagioni, D.A.J.; Melanda, F.N.; Mesas, A.E.; González, A.D.; Gabani, F.L.; Andrade, S.M.d. Physical, psychological and occupational consequences of job burnout: A systematic review of prospective studies. PLoS ONE 2017, 12, e0185781. [Google Scholar] [CrossRef] [PubMed]
Yang, C.; Liu, F.; Ye, J. A product form design method integrating Kansei engineering and diffusion model. Adv. Eng. Inform. 2023, 57, 102058. [Google Scholar] [CrossRef]
Zhao, S.; Chen, D.; Chen, Y.C.; Bao, J.; Hao, S.; Yuan, L.; Wong, K.Y.K. Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models. arXiv 2023, arXiv:2305.16322. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021; pp. 8748–8763. [Google Scholar] [CrossRef]
Gu, S.; Chen, D.; Bao, J.; Wen, F.; Zhang, B.; Chen, D.; Yuan, L.; Guo, B. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10696–10706. [Google Scholar] [CrossRef]
Lyu, Y.; Wang, X.; Lin, R.; Wu, J. Communication in Human—AI Co-Creation: Perceptual Analysis of Paintings Generated by Text-to-Image System. Appl. Sci. 2022, 12, 11312. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, C.; Zhang, M.; Kweon, I.S. Text-to-image diffusion model in generative ai: A survey. arXiv 2023, arXiv:2303.07909. [Google Scholar] [CrossRef]
Liu, B.; Lin, W.; Duan, Z.; Wang, C.; Ziheng, W.; Zipeng, Z.; Jia, K.; Jin, L.; Chen, C.; Huang, J. Rapid diffusion: Building domain-specific text-to-image synthesizers with fast inference speed. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; pp. 295–304. [Google Scholar] [CrossRef]
Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M.H. Diffusion models: A comprehensive survey of methods and applications. ACM Comput. Surv. 2023, 56, 1–39. [Google Scholar] [CrossRef]
Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A.H.; Chechik, G.; Cohen-Or, D. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv 2022, arXiv:2208.01618. [Google Scholar] [CrossRef]
Shamsian, A.; Navon, A.; Fetaya, E.; Chechik, G. Personalized federated learning using hypernetworks. In Proceedings of the International Conference on Machine Learning, Virtual Event, 18–24 July 2021; pp. 9489–9502. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar] [CrossRef]
Lee, J.; Cho, K.; Kiela, D. Countering Language Drift via Visual Grounding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 4 November 2019; pp. 4385–4395. [Google Scholar] [CrossRef]
Voynov, A.; Aberman, K.; Cohen-Or, D. Sketch-guided text-to-image diffusion models. In Proceedings of the SIGGRAPH ’23: Special Interest Group on Computer Graphics and Interactive Techniques Conference, Los Angeles, CA, USA, 6–10 August 2023; pp. 1–11. [Google Scholar] [CrossRef]
Li, Y.; Liu, H.; Wu, Q.; Mu, F.; Yang, J.; Gao, J.; Li, C.; Lee, Y.J. Gligen: Open-set grounded text-to-image generation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22511–22521. [Google Scholar] [CrossRef]
Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 3836–3847. [Google Scholar] [CrossRef]
Kawar, B.; Zada, S.; Lang, O.; Tov, O.; Chang, H.; Dekel, T.; Mosseri, I.; Irani, M. Imagic: Text-based real image editing with diffusion models. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6007–6017. [Google Scholar] [CrossRef]
Chu, E.; Lin, S.Y.; Chen, J.C. Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video Translation Using Conditional Image Diffusion Models. arXiv 2023, arXiv:2305.19193. [Google Scholar] [CrossRef]
Hu, Z.; Xu, D. Videocontrolnet: A motion-guided video-to-video translation framework by using diffusion model with controlnet. arXiv 2023, arXiv:2307.14073. [Google Scholar] [CrossRef]
Chan, K.C.; Zhou, S.; Xu, X.; Loy, C.C. Basicvsr++: Improving video super-resolution with enhanced propagation and alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5972–5981. [Google Scholar] [CrossRef]
Otani, M.; Togashi, R.; Sawai, Y.; Ishigami, R.; Nakashima, Y.; Rahtu, E.; Heikkilä, J.; Satoh, S. Toward verifiable and reproducible human evaluation for text-to-image generation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14277–14286. [Google Scholar] [CrossRef]
Guo, J.; Du, C.; Wang, J.; Huang, H.; Wan, P.; Huang, G. Assessing a Single Image in Reference-Guided Image Synthesis. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, Virtual Event, 22 February–1 March 2022; pp. 753–761. [Google Scholar] [CrossRef]
Seshadrinathan, K.; Soundararajan, R.; Bovik, A.C.; Cormack, L.K. Study of subjective and objective quality assessment of video. IEEE Trans. Image Process. 2010, 19, 1427–1441. [Google Scholar] [CrossRef] [PubMed]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar] [CrossRef]
Bakurov, I.; Buzzelli, M.; Schettini, R.; Castelli, M.; Vanneschi, L. Structural similarity index (SSIM) revisited: A data-driven approach. Expert Syst. Appl. 2022, 189, 116087. [Google Scholar] [CrossRef]
Obukhov, A.; Krasnyanskiy, M. Quality assessment method for GAN based on modified metrics inception score and Fréchet inception distance. In Software Engineering Perspectives in Intelligent Systems: Proceedings of 4th Computational Methods in Systems and Software 2020; Springer: Cham, Switzerland, 2020; Volume 1294, pp. 102–114. [Google Scholar] [CrossRef]
Hessel, J.; Holtzman, A.; Forbes, M.; Bras, R.L.; Choi, Y. Clipscore: A reference-free evaluation metric for image captioning. arXiv 2021, arXiv:2104.08718. [Google Scholar] [CrossRef]

Figure 1. Comparison between interior design videos generated using a conventional diffusion model and our method (F3MTIDV). Through generative AI, F3MTIDV can produce videos with a specific design style and maintain texture consistency between frames, which is a capability that the conventional diffusion model lacks.

Figure 2. Research framework and workflow. This study first constructed the IDSSFCD-24 dataset and established a new composite loss function for fine-tuning the diffusion model, thus enabling it to generate interior designs with specified styles and spatial functions. Subsequently, this study introduced the crossframe attention and super-resolution modules to generate high-definition and temporally consistent interior design videos, thus forming the F3MTIDV workflow. Users only need to input the texture-free interior design video and design requirements into F3MTIDV to obtain creative interior design videos with the specified style.

Figure 3. The interior design video generated by F3MTIDV presents diverse design styles and spatial functions. F3MTIDV can generate videos of six specific design styles and four functional spaces, thus totaling 24 unique types of interior designs.

Figure 4. The conventional interior design process. The process is quite cumbersome, and designers must repeat the entire process for design modifications.

Figure 5. Schematic of the fine-tuned diffusion model.

Figure 6. Training data samples of IDSSFCD-24.

Figure 7. F3MTIDV generates diverse design videos. It can produce designs in various styles and generate differentiated designs within the same design style. This new design approach provides designers with different design videos for communication with users, thus accelerating the design decision-making process.

Figure 8. Details in the generated interior design videos. (Prompt word: Chinese-style living room.)

Table 1. Distribution of images of different design styles and spatial functions in IDSSFCD-24.

	Contemporary Style	Chinese Style	European Style	Nordic Style	American Style	Japanese Style
Bedroom	1073	898	831	995	697	812
Dining room	1491	761	964	993	789	552
Living room	1289	889	990	1486	689	731
Study room	868	662	632	752	636	580
Total	4721	3210	3417	4226	2811	2675

Table 2. Subjective assessment questionnaire questions.

1.	Fidelity: Does the image look like an AI-generated photo or a real photo? AI-generated photo. Probably an AI-generated photo, but photorealistic. Neutral. Probably a real photo, but with irregular textures and shapes. Real photo.
2.	Alignment: The image matches the text description. Does not match at all. Has significant discrepancies. Has several minor discrepancies. Has a few minor discrepancies. Matches exactly.
3.	Design Details: Objects in the image have detail. Minimal details: Almost all objects lack design details, appearing incomplete or blurry. Some details: Only a few objects have certain details. Moderate details: Nearly half of the objects have design details. Good details: Most objects have design details. High details: Almost all objects exhibit design details.
4.	Visual Consistency: The video frames in the image have consistency. Little consistency: Flickering and material changes between frames. Some consistency: Most frames remain inconsistent. Medium consistency: Nearly half of the frames are consistent. Higher consistency: Most frames show consistency. Close to complete consistency: Almost all frames are consistent.
5.	Usability: Videos showcase design ideas and facilitate communication. Not usable: Unrealistic video with irrelevant content, lack of details, screen flickering, and inconsistency between frames. Limited usability: Some improvements, but the overall results are still poor. Partially usable: Some images meet standards, but most remain unusable. Mostly usable: The majority of images meet acceptable standards. Fully usable: The entire video is error-free and ready for use.

Table 3. Comparison of quantitative evaluation results of interior design videos generated by different methods.

Model	Fidelity ↑	Alignment ↑	Design Details ↑	Visual Consistency ↑	Usability ↑
Stable Diffusion (SD) [52]	2.22	2.72	2.63	1.73	1.67
Fine-tuned SD (FTSD) [28]	2.46	3.05	2.97	2.08	1.67
SD [52] + Control Video [26]	2.35	3.14	2.64	2.83	2.48
F3MTIDV	2.61	3.31	2.98	3.18	3.08

Table 4. Quantitative evaluation results of images generated by different methods.

Model	SSIM [53] ↑	FID [54] ↓	CLIP Score [55] ↑
Stable Diffusion (SD) [52]	0.4364	86.07	27.50
Fine-tuned SD (FTSD) [28]	0.4880	84.18	28.10
SD [52] + Control Video [26]	0.6268	79.02	29.05
F3MTIDV	0.6691	76.64	29.36

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shao, Z.; Chen, J.; Zeng, H.; Hu, W.; Xu, Q.; Zhang, Y. A New Approach to Interior Design: Generating Creative Interior Design Videos of Various Design Styles from Indoor Texture-Free 3D Models. Buildings 2024, 14, 1528. https://doi.org/10.3390/buildings14061528

AMA Style

Shao Z, Chen J, Zeng H, Hu W, Xu Q, Zhang Y. A New Approach to Interior Design: Generating Creative Interior Design Videos of Various Design Styles from Indoor Texture-Free 3D Models. Buildings. 2024; 14(6):1528. https://doi.org/10.3390/buildings14061528

Chicago/Turabian Style

Shao, Zichun, Junming Chen, Hui Zeng, Wenjie Hu, Qiuyi Xu, and Yu Zhang. 2024. "A New Approach to Interior Design: Generating Creative Interior Design Videos of Various Design Styles from Indoor Texture-Free 3D Models" Buildings 14, no. 6: 1528. https://doi.org/10.3390/buildings14061528

APA Style

Shao, Z., Chen, J., Zeng, H., Hu, W., Xu, Q., & Zhang, Y. (2024). A New Approach to Interior Design: Generating Creative Interior Design Videos of Various Design Styles from Indoor Texture-Free 3D Models. Buildings, 14(6), 1528. https://doi.org/10.3390/buildings14061528

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A New Approach to Interior Design: Generating Creative Interior Design Videos of Various Design Styles from Indoor Texture-Free 3D Models

Abstract

1. Introduction

1.1. Background and Motivation

1.2. Problem Statement and Objectives

1.3. Methodology Overview

2. Related Work

2.1. General Interior Design Process

2.2. Fine-Tuned Diffusion Model

2.3. Controllable Video Generation

3. Material and Methods

4. Experiment and Results

4.1. Experimental Settings

4.2. Video Generation Efficiency

4.3. IDSSFCD-24

4.4. Subjective Assessment

4.5. Objective Assessment

4.6. The Diversity of F3MTIDV-Generated Videos

4.7. Details in the Generated Videos

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI