Generative Architectural Design from Textual Prompts: Enhancing High-Rise Building Concepts for Assisting Architects

Yang, Feng; Qian, Wenliang

doi:10.3390/app15063000

Open AccessArticle

Generative Architectural Design from Textual Prompts: Enhancing High-Rise Building Concepts for Assisting Architects

by

Feng Yang

^1,2

and

Wenliang Qian

^1,2,*

¹

Key Lab of Smart Prevention and Mitigation of Civil Engineering Disaster of the Ministry of Industry and Information Technology, Harbin 150090, China

²

School of Civil Engineering, Harbin Institute of Technology, Harbin 150090, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(6), 3000; https://doi.org/10.3390/app15063000

Submission received: 11 February 2025 / Revised: 4 March 2025 / Accepted: 7 March 2025 / Published: 10 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

In the early stages of architectural design, architects convert initial ideas into concrete design schemes, which heavily rely on their creativity and consume considerable time. Therefore, generative design methods based on artificial intelligence are promising for such tasks. However, effectively communicating design concepts to machines is challenging. To address this challenge, this paper proposes a novel cross-model approach for architectural design concepts using textual descriptions to assist architects, comprising a design concept extraction module and an architectural appearance generation module. The design concept extraction module adopts a contrastive learning framework to yield a text encoder with semantic extraction. Subsequently, the architectural appearance generation module proposes a novel deep sparse and text fusion generative adversarial network to convert the extracted design concept semantics into conceptual sketches, utilizing the unique sparsity of sketches. Additionally, it employs the pre-trained latent stable diffusion model to generate realistic and diverse high-rise building renderings, simulating the recreation processes of architects. The generated designs are evaluated qualitatively and quantitatively and further compared with existing real-life buildings to demonstrate the effectiveness of the proposed method. Furthermore, we demonstrate the feasibility of applying the proposed methodology in the early stages of architectural design by modeling a generated design.

Keywords:

intelligent design; high-rise building; textual description; design concept extraction module; architectural appearance generation module; sparsity fusion

1. Introduction

Architecture represents a synthesis of artistic expression and scientific principles that demands a delicate balance between creativity and precision. Whether manifested in towering skyscrapers or intimate dwellings, architectural design is a creative process for humans with novel perspectives on aesthetics and culture. Traditional architectural design relies on conceptual sketches to convey the design concepts of architects and recreates them to generate realistic architectural concepts. Conceptual sketches are central to the emergence and reinterpretation of this process [1]. However, this approach encounters challenges such as creative constraints and significant time investments. Moreover, the seemingly abstract nature of this process is difficult to depict mathematically. Fortunately, generative artificial intelligence (AI), such as a generative adversarial network (GAN) [2], offers a new path for providing cutting-edge technology to assist architects in conceptual design, which can be performed by learning mathematical mappings between probability distributions.

Various GANs have been widely employed in the civil engineering [3,4,5,6] and architectural fields. In particular, the outstanding creative generative skills of GANs have yielded noteworthy results for diverse design tasks. The first is a layout design that uses various GANs. For example, a floor plan layout is a critical aspect of design. House-GAN [7] encoded the planar space constraints among rooms into an adjacency graph structure and employed graph-conditioned GANs for automated floor plan generation. Furthermore, House-GAN++ [8] was extended to generate floor plans for non-rectangular room shapes with doors by combining a functional graph instead of an adjacency graph. However, both methods are limited to single-layer generation and ignore the layout relationships among different layers. To address this limitation, Building-GAN [9] introduced a cross-modal graph neural network and a new voxel graph presentation for automated 3D floor plan generation while considering the associativity of room assignments between different floors. Additionally, a GAN-based layout design was applied to the modular construction. Ghannad et al. [10] combined the CoGAN framework [11] and a vectorizing algorithm to generate modular housing designs. In addition to layout design, GANs have been leveraged to learn structural information depicted in images and generate structural designs that satisfy engineering requirements. For example, Liao et al. [12] proposed StructGAN to achieve automated layout designs of shear walls in residential buildings by learning existing shear wall design documents and assigning different colors. FrameGAN [13] adopted a two-stage method for generating automated component layout designs of steel frame-brace structures. In contrast to functional layouts and structural designs, the generation of aesthetic architectural forms based on AI presents a more significant challenge. Sun et al. [14] utilized the CycleGAN framework [15] to automatically generate stylized facades via image-to-image learning. As a pioneering study in the intelligent generative design of building appearances, Self-Sparse GAN [16] was proposed to creatively generate numerous diverse sketches [17]. To integrate user preferences, Qian et al. [18] proposed an innovative AI-based architectural design approach that incorporates user preferences, encompassing geometric shapes and architectural preferences, to generate customized high-rise sketches. Subsequently, images were used as user preference descriptions. However, the capacity to comprehensively convey design concepts through images is limited, underscoring the necessity for alternative approaches, especially in the context of generating aesthetic designs for architectural forms.

Therefore, textual descriptions, which are more convenient than images for describing user preferences, have emerged as a more effective means for generating architectural appearances. Recently, text-to-image transformations have been successfully implemented in the field of computer science using various conditional GANs (cGANs) [19]. GAN-INT-CLS [20] first utilized cGANs to generate 64 × 64 images based on textual descriptions. Furthermore, for enhanced image resolution and the efficient utilization of computing resources, StackGAN [21] and StackGAN++ [22] proposed a multi-stage approach to generate 256 × 256 images by combining multiple generator and discriminator pairs. However, the problem of inconsistency in the details of the generated images remains. To address this limitation, Qiao et al. [23] further enhanced the multi-stage generation process by incorporating a text-to-image-to-text framework to align the images with the corresponding descriptions. However, the complexity and entanglements introduced by the multi-stage framework render effective training challenging. To overcome these problems, Tao et al. [24] proposed DF-GAN, which adopts a single-stage framework (One-Way Output) for direct high-resolution image generation, enabling end-to-end training. Moreover, within the end-to-end training framework, various fusion methods have been introduced to facilitate the generation of refined visual content. Liao et al. [25] proposed semantic condition batch normalization by adapting a text fusion method to align individual image regions with words in the description. In contrast, DiverGAN [26] employs the attention mechanism to fuse text and image features for semantically consistent image generation. However, these approaches are not directly applicable to generating conceptual sketches from text semantics due to the inherent sparsity of sketches compared to color images. Consequently, generating high-quality architectural sketches from text semantics is difficult owing to the sparsity of sketches.

As mentioned earlier, existing AI-assisted architectural conceptual designs require a more refined method for conveying design concepts or user preferences, and text-to-image transformation is a promising solution to this limitation. It is crucial to develop an effective text fusion method to enable the generation of high-quality sketches.

Therefore, two interconnected sub-modules, the design concept extraction (DCE) and architectural appearance generation (AAG) modules, are proposed in this paper for the intelligent design of high-rise building concepts from textual descriptions. The DCE module employs a contrastive learning framework to extract abstract design concepts from textual descriptions, thus facilitating the matching of corresponding image–text pairs. Thereafter, the AAG module translates the extracted text semantics into a spectrum of realistic and diverse high-rise building concepts. To generate semantically coherent and visually realistic high-rise conceptual sketches, the AAG module employs a novel deep sparse text fusion GAN (DSTF-GAN) to generate semantically coherent and visually realistic high-rise conceptual sketches by integrating the inherent sparsity of sketches and extracted semantic information with image features across both channel and spatial dimensions. Additionally, the AAG module incorporates a pre-trained latent stable diffusion model [27] into a pre-trained rendering (PR) module to render the generated sketches into finalized images.

The novelty of this study is summarized as follows: (1) A pioneering AI-assisted approach for autonomously designing high-rise building concepts from text descriptions is proposed; (2) a generic model is introduced to obtain a pre-trained text encoder through contrastive learning, facilitating the effective matching of sketches with their descriptions; and (3) a novel generative adversarial network that integrates sketch sparsity and text semantics to generate high-rise conceptual sketches is proposed.

The remainder of this article is organized as follows. Section 2 elaborates on the proposed framework for intelligent design based on textual descriptions. Section 3 presents the implementation details and provides insights into the design process outcomes using qualitative and quantitative evaluations, as well as the comparisons between the generated results and existing real-life buildings. Section 4 presents a comprehensive discussion that covers the comparison of the proposed fusion method and alternative fusion techniques, along with an in-depth analysis of the components of the proposed fusion method. Section 5 demonstrates the feasibility of the proposed methodology in the architectural conceptual design phase by modeling a generated design. Finally, Section 6 concludes this paper.

2. Methodology

The proposed method comprises a DCE module and an AAG module, as shown in Figure 1. The DCE module uses a contrastive learning framework to obtain a pre-trained text encoder that enables semantic extraction from textual descriptions. Subsequently, the AAG module transforms the extracted design semantics into high-rise building concepts.

2.1. Design Concept Extraction Module

For text-to-image tasks, an effective text encoder is crucial for accurately extracting semantic information from textual descriptions. Contrastive learning is particularly effective for obtaining a pre-trained text encoder as it strengthens the encoder’s ability to extract salient features from text prompts by maximizing the alignment of semantically relevant pairs while minimizing the similarity of unrelated ones. According to the principles of Contrastive Language–Image Pre-training (CLIP) [28], the DCE module uses a contrastive text–image matching model (CTIMM) to obtain an effective text encoder, as shown in Figure 2. This model comprises text and image encoder networks dedicated to extracting text and image features, respectively, and maps them to a shared semantic space. By utilizing contrastive learning, the model establishes meaningful associations between conceptual sketches and their corresponding textual descriptions. The objective of CTIMM optimization is to maximize the similarity between the features of each sketch and its corresponding text features within the semantic space.

The text encoder incorporates a bidirectional Long Short-Term Memory (BiLSTM) architecture [29] to extract multi-level semantic information and a multi-head attention layer [30] that re-extracts low-level semantic information to augment text semantics and map to the shared semantic space. Within the BiLSTM, each word corresponds to two hidden states, which can be concatenated to obtain the low-level semantic information represented by

h_{i} \in R^{512} (i = 1, \dots, n)

. The last hidden state

h_{n} \in R^{512}

is selected as the text semantic vector. Furthermore, the multi-head attention layer is designed to re-extract low-level semantics

h_{i} \in R^{512} (i = 1, \dots, n - 1)

, yielding supplementary semantic information

h^{'} \in R^{512}

. Finally, by combining

h^{'}

with

h_{n}

, the final global text semantic feature

T \in R^{512}

is obtained in the shared semantic space. The text encoder can be formulated as

(w_{1}, \dots, w_{n}) = M_{300 \times n_{w}} (w o r d_{1}, \dots, w o r d_{n})

(1)

h_{1}, \dots, h_{n} = F_{B i L S T M} (w_{1}, \dots, w_{n})

(2)

h^{'} = F_{A t t n} (h_{1}, \dots, h_{n - 1})

(3)

T = h_{n} + h^{'}

(4)

where

w o r d_{i} (i = 1, \dots, n)

denotes the one-hot encoding of the

i

-th word in the description based on the word pool consisting of words in the dataset,

M_{300 \times n_{w}} \in R^{300 \times n_{w}}

is the semantic embedding matrix for obtaining the semantic embedding vector of each word in the description,

n

is the number of words in the description,

n_{w}

denotes the total number of words in the dataset, and

F_{B i L S T M}

and

F_{A t t n}

denote the BiLSTM network and multi-head attention layer, respectively. The multi-head attention layer is formulated as follows [30]:

F_{A t t n} (H) = M u l t i H e a d (H) [0]

(5)

M u l t i H e a d (H) = C o n c a t (h e a d_{1}, \dots, h e a d_{h}) W^{o} w h e r e h e a d_{i} = A t t e n t i o n (H W_{i}^{Q}, H W_{i}^{K}, H W_{i}^{V})

(6)

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(7)

where

H \in R^{d_{m} \times d_{n}}

is the input matrix of the multi-head attention layer,

h

is the number of heads in the multi-head attention layer,

W_{i}^{Q} \in R^{d_{n} \times d_{k}}

,

W_{i}^{K} \in R^{d_{n} \times d_{k}}

,

W_{i}^{V} \in R^{d_{n} \times d_{v}}

,

W^{o} \in R^{h d_{v} \times d_{n}}

are the parameter matrices,

C o n c a t

denotes splicing multiple matrices along the columns, and

F_{A t t n}

denotes the extraction of the first column from the output of

M u l t i H e a d (H)

.

The image encoder has a ResNet-18 [31] architecture followed by an additional multi-head attention layer to map the extracted image features to the shared semantic space. The ResNet-18 architecture processes the sketch to obtain the local features

v \in R^{512 \times 7 \times 7}

, and then it is reshaped into a local feature matrix

i \in R^{512 \times 50}

, with the last column representing the average of all the local feature vectors. Subsequently, the final global feature of the sketch in the shared semantic space, denoted as

I \in R^{512}

, is obtained through an additional multi-head attention layer. The image encoder can be formulated as

v = F_{R e s N e t} (x)

(8)

I = {F_{A t t n} (f}_{r e} (v))

(9)

where

x

is the input sketch and

f_{r e}

,

F_{R e s N e t}

, and

F_{A t t n}

denote the average and reshape operators, ResNet-18, and the other multi-head attention layer, respectively.

The loss function of the CTIMM is designed to maximize the cosine similarity between the conceptual image and text features for matching pairs while minimizing the cosine similarity between text features for mismatching pairs in the shared semantic space. The objective loss function is defined as follows:

L_{C T I M M} = \sum_{\begin{matrix} i = 1, j = 1 \\ i \neq j \end{matrix}}^{n} ⟨I_{i}, T_{j}^{T}⟩ - \sum_{\begin{matrix} i = 1, j = 1 \\ i = j \end{matrix}}^{n} ⟨I_{i}, T_{j}^{T}⟩

(10)

where

⟨,⟩

denotes the dot product operation between two vectors, and

I_{i} \in R^{512}

and

T_{j} \in R^{512}

are the extracted feature vectors of the

i

-th sketch and

j

-th description for the same batch, respectively. The CTIMM is trained using Algorithm 1.

Algorithm 1. Pseudocode for the training of the CTIMM.

Contrastive text–image matching model: AdamW optimizer with

(β_{1}, β_{2}) = (0.9,0.99)

,

e p s = 1 \times 10^{- 8}

,

w e i g h t_d e v a y = 0.25, l r = 0.0001

; CosineAnnealingWarmRestarts lr_scheduler with

T_0 = 400, T_m u l t = 1, e t a_m i n = 0, l a s t_e p o c h = - 1

.

1: Extract the features of sketch and textual description

I_f = i m a g e_e n c o d e r (I m a g e) # [b a t c h_s i z e, 512]

T_f = t e x t_e n c o d e r (T e x t) # [b a t c h_s i z e, 512]

2: Normalization with L2 norm

I_e = L 2_n o r m a l i z e (I_f, a x i s = 1) # [b a t c h_s i z e, 512]

T_e = L 2_n o r m a l i z e (T_f, a x i s = 1) # [b a t c h_s i z e, 512]

3: Calculate the cosine similarities

l o g i t s = n p . d o t (I_e, T_e . T) \times n p . e x p (t) # [b a t c h_s i z e, b a t c h_s i z e]

4: Calculate the loss

l a b e l s = n p . a r r a n g e (b a t c h_s i z e)

l o s s_i = c r o s s_e n t r o p y_l o s s (l o g i t s, l a b e l s, a x i s = 0)

l o s s_t = c r o s s_e n t r o p y_l o s s (l o g i t s, l a b e l s, a x i s = 1)

l o s s = (l o s s_i + l o s s_t) / 2

#

t

is a learnable parameter

# CosineAnnealingWarmRestarts refers to Stochastic Gradient Descent with Warm Restarts.

2.2. Architectural Appearance Generation Module

When the architectural design textual descriptions have been translated into text semantic features by the DCE module, the new architectural design is performed by the AAG module, as shown in Figure 3. The DSTF-GAN of the AAG module translates the extracted design text semantics into high-rise conceptual sketches. Subsequently, the PR module performs recreation and generates realistic high-rise buildings.

The DSTF-GAN is depicted in Figure 4, and the detailed properties of layers and operators are listed in Table A1 and Table A2 in Appendix A. As noted by Qian et al. [17], a simple GAN architecture is sufficient for sketch synthesis. Building on this, DSTF-GAN is designed to effectively integrate the inherent sparsity of sketches [17] and the extracted semantics of textual descriptions into the image features to generate high-quality sketches. As shown in Figure 4a, the generator of the DSTF-GAN introduces a novel block called the sparse text fusion (STF) block to seamlessly integrate both the sparsity of the sketches and text features into the generated sketches. The STF block consists of a sparse fusion (SF) block (Figure 5a) to incorporate the sparsity of sketches and a text fusion (TF) block (Figure 5b) to augment the feature maps with the conditioned text semantic features.

The SF block employs a two-headed neural network to extract the sparsity of sketches, transforming the latent vector

z \in R^{200}

into a channel sparsity coefficient

α_{i} \in R^{1 \times 1 \times C_{i}}

and a position sparsity coefficient

β_{i} \in R^{H_{i} \times W_{i} \times 1}

. Subsequently, the feature map

h_{i} \in R^{H_{i} \times W_{i} \times C_{i}}

is multiplied by two coefficients, and the results are summed to yield the output

F_{z, i} \in R^{H_{i} \times W_{i} \times C_{i}}

. Sparsity fusion in the

i

-th STF block can be expressed as follows:

h_{i} = R e L U (f_{B N} (f_{D e C o n v} (F_{i - 1})))

(11)

α_{i} = R e L U (g_{i, 1}^{z} (f_{i}^{z} (z)))

(12)

β_{i} = R e L U (g_{i, 2}^{z} (f_{r e s h a p e} (f_{i}^{z} (z))))

(13)

F_{z, i}^{j, k, l} = α_{i}^{l} \times h_{i}^{j, k, l} + β_{i}^{l} \times h_{i}^{j, k, l}

(14)

where

F_{i - 1} \in R^{H_{i - 1} \times W_{i - 1} \times C_{i - 1}}

is the input feature maps of the

i

-th STF block;

f_{D e C o n v}

,

f_{B N}

, and

R e L U

denote the deconvolutional layer, the batch normalization layer, and the rectified linear unit activation function, respectively;

f_{i}^{z}, g_{i, 1}^{z}, g_{i, 2}^{z}

denotes an underlying shared layer (multi-layer perceptron, MLP) and two exclusive networks of the two-headed neural network, respectively; and

f_{r e s h a p e}

denotes the reshape operation, as shown in Figure 5a.

The TF block employs two MLPs and a convolutional layer to predict a channel-wise coefficient

w_{i} \in R^{1 \times 1 \times C_{i}}

and a position-wise coefficient

b_{i} \in R^{H_{i} \times W_{i} \times 1}

based on the text feature

T \in R^{512}

. The feature map

F_{z, i}

is then multiplied by

w_{i}

and

b_{i}

, respectively, and the two results are added together to yield the output

F_{T, i} \in R^{H_{i} \times W_{i} \times C_{i}}

. Finally, the resulting feature map

F_{i} \in R^{H_{i} \times W_{i} \times C_{i}}

of the

i

-th STF block is obtained after applying a ReLU activation function. This block is formulated as follows:

w_{i} = f_{i, 1}^{T} (T)

(15)

b_{i} = L e a k y R e L U (g_{i}^{T} (f_{r e s h a p e} (f_{i, 2}^{T} (T))))

(16)

F_{T, i}^{j, k, l} = w_{i}^{l} \times F_{z, i}^{j, k, l} + b_{i}^{l} \times F_{z, i}^{j, k, l}

(17)

F_{i} = R e L U (F_{T, i})

(18)

where

L e a k y R e L U

denotes the LeakyReLU activation function;

f_{i, 1}^{T}

and

f_{i, 2}^{T}

both denote a single MLP;

g_{i}^{T}

is a convolutional layer; and

f_{r e s h a p e}

denotes the reshaping operation, as shown in Figure 5b.

The discriminator is inspired by the Target-Aware discriminator in the DF-GAN [24], demonstrating its ability to ascertain the consistency between input image semantics and text. To ensure a balanced and symmetrical structure with the generator, as well as semantic consistency between the conceptual sketches and input text, a simplified version of the target-aware discriminator is adopted, as shown in Figure 4b.

The discriminator converts a sketch into feature maps, denoted as

F_{6}^{'} \in R^{4 \times 4 \times 128}

, using a convolutional layer and six DownBlock modules. Subsequently, the feature map

F_{6}^{'}

with the reshaped text feature

T^{'} \in R^{4 \times 4 \times 512}

along the channels is concatenated. The output of the discriminator determines whether the sketch is semantically consistent with the textual description.

To ensure the stability of the adversarial training process, the hinge loss [32] is used as the objective function of the DSTF-GAN. Additionally, the Matching-Aware zero-centered Gradient Penalty (MA-GP) of the DF-GAN [24] is incorporated into the objective function of the discriminator to reinforce the text–image’s semantic consistency. The complete loss of the DSTF-GAN is expressed as follows:

\begin{array}{l} L_{D} = & - E_{x ~ P_{r}} [\min (0, - 1 + D (x, T))] \\ - (1 / 2) E_{z ~ P_{z}} [m i n (0, - 1 - D (G (z, T), T))] \\ - (1 / 2) E_{x ~ P_{m i s}} [m i n (0, - 1 - D (x, T))] \\ + k E_{x ~ P_{r}} [{(‖\nabla_{x} D (x, T)‖ + ‖\nabla_{T} D (x, T)‖)}^{p}] \end{array}

(19)

L_{G} = - E_{z ~ P_{z}} [D (G (z, T), T)]

(20)

where

z

is the noise vector;

T

is the text feature;

P_{r}, P_{z}, {a n d P}_{m i s}

denote the real data, Gaussian, and mismatched data distributions, respectively; and

k

and

p

are two hyperparameters to balance the effectiveness of the gradient penalty. The DSTF-GAN is trained using Algorithm 2.

Algorithm 2. Pseudocode for the training of DSTF-GAN.

Deep sparse and text fusion generative adversarial network: Adam optimizer with

(β_{1}, β_{2}) = (0.5,0.999),

l r_G = 0.0001, l r_D = 0.0003

1: Initialize discriminator parameters

θ_{D}

and generator parameters

θ_{G}

2: for number of training epoch do

3: sampling real data pairs

{(x_{1}, {T e x t}_{1}), \dots, (x_{m}, {T e x t}_{m})}

, latent variables

{z_{1}, \dots, z_{m}}

4: extract semantics:

T_{i} =

t e x t_e n c o d e r ({T e x t}_{i})

5: calculate loss function:

L_{D}^{i} = \frac{1}{2} D_{θ_{D}} (G_{θ_{G}} (z_{i}, T_{i}), T_{i}) + \frac{1}{2} D_{θ_{D}} (x_{i}, T_{j}) - D_{θ_{D}} (x_{i}, T_{i}) + k {(‖\nabla_{x_{i}} D (x_{i}, T_{i})‖ + ‖\nabla_{T_{i}} D (x_{i}, T_{i})‖)}^{p}

6: update the discriminator parameters:

θ_{D} \leftarrow A d a m (\nabla_{m} \frac{1}{m} \sum_{i = 1}^{m} L_{D}^{i}, θ_{D}, β_{1}, β_{2}, {l r}_{G}, {l r}_{D})

7: sampling latent variables

{z_{1}, \dots, z_{m}}

8: calculate loss function:

L_{G}^{i} = - D_{θ_{G}} (G_{θ_{G}} (z_{i}, T_{i}), T_{i})

9: update the discriminator:

θ_{G} \leftarrow A d a m (\nabla_{m} \frac{1}{m} \sum_{i = 1}^{m} L_{G}^{i}, θ_{G}, β_{1}, β_{2}, {l r}_{G}, {l r}_{D})

10: end for

#

t e x t_e n c o d e r

is pre-trained from the Contrastive text–image matching model

The PR module leverages the image-to-image capabilities of the pre-trained latent stable diffusion model [27] to re-create the generated conceptual sketches as follows:

\tilde{x} = F_{L D M} (x)

(21)

where

x

and

\tilde{x}

denote the input conceptual sketch and the generated high-rise building image, respectively, and

F_{L D M}

is the pre-trained latent stable diffusion model.

Additionally, by integrating the pre-trained text encoder, DSTF-GAN, and the stable-diffusion-v1-5 model, we have developed a user-friendly work interface within the Python framework. This interface facilitates the seamless transition from input text prompts to high-rise building renderings. Detailed descriptions of the interface and workflow are provided in Appendix A, as illustrated in Figure A1.

3. Experiments and Results

This section provides an overview of the datasets used in the experiments. Subsequently, the training details of the DCE and AAG modules are presented. To assess the efficacy of the proposed intelligent design approach based on textual descriptions, we systematically conducted a sequence of experiments. Specifically, for the DCE module, ablation studies showed that data augmentation and the semantic re-extraction layer enhanced the accuracy of architectural design semantic extraction, resulting in a more precise textual encoder for semantic extraction. In the case of the AAG module, qualitative experiments confirmed that our method effectively generates architectural sketches consistent with the provided textual semantics. By leveraging a pre-trained diffusion model to render the generated sketches, we produced realistic high-rise building renderings that better capture the underlying semantic nuances compared to direct diffusion model generation. Furthermore, comparisons with existing real-life buildings were conducted to demonstrate the effectiveness of the proposed method.

3.1. Sketch–Text Dataset

To address the challenge of generating the appearance of high-rise buildings, we required a dataset containing paired high-rise building sketches and the corresponding textual descriptions. A total of 856 images were meticulously gathered from online sources, with a distinct emphasis on the top 1000 tall buildings worldwide. The images of high-rise buildings represent real-world structures from various countries and regions, including a variety of architectural styles (such as postmodernism, modernism, and structuralism) and functional types (including commercial and cultural buildings). This diversity ensures that the model can generalize the architectural characteristics of high-rise buildings in different styles and cultural contexts. The initial preprocessing step involved resizing the images and removing their backgrounds. Subsequently, conceptual sketches were generated using the eXtended difference-of-Gaussians operator (XDoG), as shown in Figure 6. The underlying procedure of the XDoG operation can be outlined as follows [33]:

g (x) = S_{σ, k, p} (x) * I (x)

(22)

S_{σ, k, p} (x) = (1 + p) G_{σ} (x) - p G_{k σ} (x)

(23)

G_{σ} (x) = \frac{1}{2 π σ^{2}} \exp (- \frac{{‖x‖}^{2}}{2 σ^{2}})

(24)

where

I (x), x

, and

g (x)

denote the original image, a two-dimensional coordinate, and the filtered sketch generated by the XDoG operator, respectively;

S_{σ, k, p} (x)

is the reparameterization of the XDoG filter for simplified control; and

σ, k, p

are the control parameters.

Textual descriptions included architectural style (e.g., modernism, futurism, or postmodernism), high-rise type (e.g., high-rise or tower), architectural design manner (e.g., symmetrical, misaligned, or segmented), architectural structure (e.g., podium or revolving floor), and other common design concepts to denote the abstract design concepts. To mitigate potential overfitting, we leveraged the pre-trained latent stable diffusion model and ChatGPT 3.5 from Open AI to augment each sketch–text pair, as shown in Figure 7. For image augmentation using the pre-trained diffusion model, we leveraged the image-to-image functionality of stable-diffusion-v1-5, adjusting the output correlation coefficient to preserve the overall structure while modifying only the finer details, resulting in 32 variations per sketch. For text augmentation with ChatGPT, we rephrased each textual description into 10 semantically equivalent but differently phrased versions. During this process, we ensured that the rephrased sentences maintained the original semantics by applying the following established conditions, including using synonyms, reordering words, and adjusting the phrasing, aiming to generate diverse expressions with the same semantic content. Subsequently, we conducted manual evaluations to ensure that each of the 10 augmented text prompts for a sketch preserved the original meaning.

3.2. Training Details

DCE module. The image encoder utilized 32 attention heads, whereas the text encoder employed 16 attention heads. Training was performed with a batch size of eight and a total of 800 epochs. In the training of the CTIMM, the dataset was partitioned into a training set constituting

90 %

of the data and a validation set representing the remaining

10 %

. This division was randomly conducted multiple times for robust assessment. The optimization was conducted using an AdamW [34] optimizer with parameters

β_{1} = 0.9, β_{2} = 0.999, w e i g h t_d e c a y = 0.25

. The learning rate was set to 0.0001, and the learning rate schedule followed the CosineAnnealingWarmRestarts [35] with

T_0 = 400

,

T_m u l t = 1

,

e t a_m i n = 0,

and

l a s t_e p o c h = - 1

. The experiments were conducted on a workstation equipped with a 16-core AMD Ryzen 9 3950X processor, an NVIDIA GeForce RTX 2080 Ti GPU, and 64 GB of DDR4 memory. The CTIMM was implemented using the open-source machine learning framework PyTorch Lightning 2.0.0.

AAG module. Training was performed with a batch size of eight and a total of 600 epochs. The division of the dataset was the same as that used by the DCE module to obtain the pre-trained text encoder. In the training process of the DSTF-GAN, the Adam [36] optimizer was utilized with

β_{1} = 0.5

,

β_{2} = 0.999

. The learning rates for the generator and discriminator were set as

{l r}_{G} = 0.0001

and

{l r}_{D} = 0.0003

, respectively. The experiments were conducted on a workstation equipped with a 16-core AMD Ryzen 9 3950X processor, NVIDIA GeForce RTX 2080 Ti, and 64 GB of DDR4 memory. The DSTF-GAN was implemented using PyTorch Lightning. The pre-trained latent stable diffusion model was employed to render the generated conceptual sketches, with

s t r e n g t h = 0.6 ~ 0.8

,

p r o m p t = ‘ A h i g h - r i s e b u i l d i n g, 3 D ’,

s c a l e = 30

,

d d i m_s t e p s = 100

.

3.3. Experimental Results

Training the text encoder. To evaluate and acquire an efficient pre-trained text encoder, the metric should signify the highest cosine similarity values between the corresponding image–text and text–image pairs within the validation set. Accuracy was computed using the following formula:

A c c = \frac{\sum_{i = 1}^{n} \sum_{j = 1}^{b} g ({l o g i t s}_{j, :}^{i}, {l o g i t s}_{:, j}^{i}, j)}{n \times b}

(25)

g (x, y, j) = \{\begin{array}{l} 1, i f a r g m a x (x) = j a n d a r g m a x (y) = j \\ 0, o t h e r w i s e \end{array}

(26)

where

{l o g i t s}^{i}

is the cosine similarity matrix of the

i

-th batch;

{l o g i t s}_{j, :}^{i}

is the

j

-th row of

{l o g i t s}^{i}

, representing the similarity of the

j

-th text to every sketch in the batch;

{l o g i t s}_{:, j}^{i}

is the

j

-th column of

{l o g i t s}^{i}

, representing the similarity of the

j

-th sketch to every text in the batch;

n

and

b

are the total number of batches and the batch size, respectively; and

x, y

denote two vectors. A higher

A c c

indicates a stronger alignment between the sketch and its corresponding text description, reflecting more accurate semantic extraction.

The dataset was randomly split into training and validation sets on five occasions, and the model’s matching accuracy was evaluated for each partition. During training, we randomly selected a sketch–text pair from a set of 32 generated sketches and the corresponding 10 distinct textual descriptions for each epoch. Figure 8 shows the

A c c

comparisons of the CTIMM under various dataset divisions, incorporating a semantic re-extraction layer and different data augmentation techniques, including random image scaling and cropping (DA) during the training process, the pre-trained latent stable diffusion model for image augmentation (SD), and ChatGPT for text augmentation (GPT). The experimental results on the validation set demonstrate that these techniques consistently improved the matching accuracy between sketches and text descriptions, regardless of the dataset partitioning. Figure 9 shows the matching outcomes of one batch in Division 1 of the dataset. The results demonstrate that the pre-trained text encoder effectively extracted design concepts from textual descriptions and provided the necessary conditions for the subsequent generation of high-rise building images.

Generation of high-rise building images. During the training of the DSTF-GAN, we selected the data partition that achieved the highest

A c c

along with the corresponding text encoder, as this partition is believed to better capture the features of high-rise buildings described in the textual descriptions. The performance of the trained DSTF-GAN was then qualitatively assessed on the validation set to evaluate its generative capabilities. Figure 10 shows the ability of the DSTF-GAN within the AAG module to produce a varied array of semantically consistent conceptual sketches of high-rise buildings using the same design concepts. However, the implicit biases present in the dataset can affect the generated sketches. For example, since the tower-type buildings in the dataset are primarily cultural landmarks, inputting the term “tower” tends to generate sketches that closely resemble these cultural structures. Figure 11 compares the generated conceptual sketches with the real sketches in the training dataset, demonstrating that the DSTF-GAN does not memorize the samples in the dataset but rather understands the semantics of textual descriptions and the features of the high-rise sketches. Notably, the generated sketches exhibited uniqueness, indicating that the DSTF-GAN can learn the attributes of the design concepts in the training dataset and does not lose the creative requirements of the architectural design process. Subsequently, as shown in Figure 12, the PR module effectively transformed the generated conceptual sketches into lifelike high-rise buildings using the pre-trained stable-diffusion-v1-5 model, with parameters set to

s t r e n g t h = 0.6 ~ 0.8

,

p r o m p t = ‘ A h i g h - r i s e b u i l d i n g, 3 D

’,

s c a l e = 30

, and

d d i m_s t e p s = 100

. The experimental outcomes confirmed the practical viability of the proposed approach for generating a diverse and authentic collection of high-rise building concepts from textual descriptions. Figure 13 provides a direct comparison of the high-rise building images generated by the pre-trained latent diffusion model and the proposed method. The pre-trained model employs the text-to-image functionality of the stable-diffusion-v1-5 model, with the input consisting of the text description corresponding to the sketch, using parameters

s c a l e = 5

and

d d i m_s t e p s = 100

. The results demonstrate that our approach generates high-rise building renderings with improved semantic consistency compared to those produced directly from the text-to-image functionality of the diffusion model, thus highlighting the superior performance of the proposed method. The limitations of the pre-trained latent diffusion model, such as lacking a comprehensive understanding of architecture and precise correspondence between architectural design concepts and visual features, were evident. In contrast, the proposed method, which utilizes conceptual sketches as an intermediary representation, proved to be a more effective approach.

Authenticity assessment of generated results. To further validate the authenticity of the generated results, an investigation is conducted to assess the applicability of the generated architectural sketches and renderings by comparing them with real high-rise buildings not included in the dataset. Figure 14 demonstrates the utility of the generated sketches and renderings, showing similarities with existing buildings. This indicates that the proposed methodology captures essential characteristics found in real buildings while maintaining innovation and diversity in conceptual design. For instance, Figure 14a highlights similarities between a generated rendering sample and Metropol Tower Istanbul in the Ataşehir district of Istanbul, Turkey, showcasing a cutting architectural design approach at the top and similarly decorated surfaces. These findings validate the feasibility of applying this method.

4. Discussion

4.1. Comparison of Different Text Fusion Methods

To demonstrate the efficacy of the proposed approach of fusing the textual semantics and latent visual features of sketches in the context of the sketch generation task, we conducted a comparative analysis with two other cutting-edge single-stage text fusion methodologies, namely DF-GAN [24] (denoted as “DF”) and DiverGAN [26] (denoted as “Diver”), within the framework of the DSTF-GAN. DF utilizes two MLPs to predict language-conditioned channel-wise scaling parameters

γ

and shifting parameters

θ

, which are generated from the concatenation of semantic vectors

T

and noise vectors

z

, to operate the feature maps. Diver utilizes attention mechanisms to derive reshape operators at both the channel and pixel levels, thereby facilitating the re-weighting of the visual feature maps. We replaced the text fusion module in the DSTF-GAN with the “DF” and “Diver” fusion techniques, while keeping other components unchanged, and evaluated the effectiveness of the proposed text semantic fusion method using quantitative metrics, including FID and the newly introduced CM precision.

The Fréchet Inception Distance (FID) [37] was employed to evaluate the different fusion methods. The FID measures the quality of the generated images by computing the Fréchet distance between the distribution of generated samples and the distribution of actual data. A lower FID value indicates that the synthetic images closely resemble the real images, reflecting their visual quality. The FID is calculated as

F I D (r, f) = {‖μ_{r} - μ_{f}‖}_{2}^{2} + T r (\sum_{r} + \sum_{f} - 2 {(\sum_{r} \sum_{f})}^{1 / 2})

(27)

where

\sum

and

μ

represent the covariance and feature-wise mean of samples, respectively, and

f

denotes the real and synthetic images, respectively.

While FID evaluates the visual quality of generated sketches, it does not assess the semantic alignment between generated sketches and their corresponding textual descriptions. To address this limitation, we introduced contrastive matching precision (CM precision) as a complementary metric for evaluating text-to-conceptual sketch synthesis. To calculate the CM precision, we used the pre-trained CTIMM to measure the percentage of correct matches between the generated sketches and their corresponding input textual descriptions. For the new dataset consisting of the generated high-rise conceptual sketches and corresponding texts, the CM precision was calculated as follows:

C M - p r e c i s i o n = \frac{\sum_{i = 1}^{n_{g}} \sum_{j = 1}^{b_{g}} g ({l o g i t s_{g}}_{j, :}^{i}, {l o g i t s_{g}}_{:, j}^{i}, j)}{n_{g} \times b_{g}}

(28)

where

{l o g i t s}_{g}^{i}

is the cosine similarity matrix of the

i

-th batch on the newly generated dataset;

{l o g i t s_{g}}_{j, :}^{i}

is the

j

-th row of

{l o g i t s}_{g}^{i}

, representing the similarity of the

j

-th text to every generated sketch in the batch;

{l o g i t s_{g}}_{:, j}^{i}

is the

j

-th column of

{l o g i t s}_{g}^{i}

, representing the similarity of the

j

-th generated sketch to every text in the batch;

n_{g}

and

b_{g}

are the total number of batches and the batch size, respectively; and

g

is expressed in Equation (26). A higher CM precision indicates a better match between the generated sketch and text semantics.

To calculate the FID, we generated five distinct sketches for each text description in the validation set by altering the noise vector

z

, and computed the FID by comparing the generated sketches with the corresponding real sketches. To calculate the CM precision, we grouped the generated sketch–text pairs into batches of eight, calculated the CM precision for each batch, and then averaged the results across all batches to obtain the final CM precision. Table 1 presents the experimental results, revealing that the proposed DSTF-GAN outperformed the other methods in terms of the FID and CM precision metrics. These quantitative comparisons provide evidence for the efficacy of the fusion approach employed in the DSTF-GAN for text-to-sketch synthesis. The results underscore that incorporating the sparsity of sketches into the generative process significantly enhances quality and semantic fusion. The conceptual sketches generated from various fusion methods using the same descriptions are shown in Figure 15. The results show that the Diver fusion method only captured certain architectural features and failed to generate complete high-rise sketches, whereas the DF fusion method could only generate realistic but not well-aligned high-rise building sketches. For instance, as depicted in Figure 15b, both DF and the DSTF-GAN produced high-rise building sketches with a broader base and a pole. However, only the DSTF-GAN incorporated a dome structure at the top, demonstrating the superior accuracy of the proposed DSTF fusion method in the synthesis of sketches.

4.2. Component Analysis

Section 4.1 has demonstrated the efficacy of the DSTF-GAN fusion method compared with two other leading single-stage text fusion approaches. To gain deeper insight into the conceptual sketch generation process, we conducted ablation studies to assess the individual effectiveness of each component within the DSTF-GAN, namely the SF block, TF block, and the order of these two blocks. Three variations of DSTF were created: “DSTF w/o SF block”, “DSTF w/o TF block”, and “DTSF”. These variations indicated the removal of the SF Block, the removal of the TF Block, and the swapping of the two blocks.

To comprehensively analyze each component, was employed an additional metric, the LPIPS metric [38], in addition to the FID and CM precision. LPIPS leverages pre-trained networks to assess the diversity of the generated sketches. This was accomplished by quantifying the average feature distance between the feature maps of the synthetic images. A higher LPIPS value signifies a greater dissimilarity between pairs of synthetic sketches, indicating higher diversity. The calculation methods for the FID and CM precision follow the same procedure as described in Section 4.1. The comparative results of the ablation studies are listed in Table 2.

Effect of the SF block. Compared with DSTF w/o SF block, DSTF enhanced the LPIPS value from 0.0588 to 0.2200, indicating that the SF block effectively increased the diversity of sketches generated from the same description by introducing sparsity to the feature maps. Furthermore, the incorporation of sparsity during the fusion process improved the CM precision from

30.83 %

to

36.08 %

. This demonstrated that integrating sparsity into the fusion process enhances the semantic consistency between the text and the generated sketches.

Effect of the TF block. The TF block improved the CM precision from 4.31% to 36.08% compared with DSTF w/o TF block. Additionally, the TF block contributed to a decrease in the FID from 32.43 to 25.73, indicating an improvement in the overall quality of the generated sketches. The TF block achieved this by incorporating channel- and position-level adjustments based on text semantics, effectively enhancing the fusion process and generating more accurate and visually appealing sketches.

Swapping the order of the two blocks. A comparison between DSTF and DTSF revealed that the selected order of the two blocks had a notable impact on the performance of the model. DSTF achieved an improved CM precision of

36.08 %

compared with

32.01 %

for DTSF. Additionally, DSTF exhibits a decreased FID of 25.73 in contrast to 29.43 for DTSF. These results indicated that the order of combining sparsity and fusing text semantics in the sketch generation process plays a crucial role in generating high-rise building sketches that are more semantically consistent.

5. Application

To assess the potential utility of the proposed pipeline to assist architects in the conceptual design phase during the professional high-rise building design process, architects are invited to develop a 3D conceptualization of a rendered building appearance in an office building type using the SketchUp Pro 2021 software. Figure 16 illustrates the elevation perspective and aerial views of the created office building model. The architects perceive this approach as significantly enhancing the efficiency of the early conceptual design phase for buildings while providing them with more design inspiration.

6. Conclusions

This paper proposes a novel method for assisting architects in the conceptual design stage by intelligently generating high-rise building concepts from textual descriptions. The model comprises a design concept extraction (DCE) module and an architectural appearance generation (AAG) module. The DCE module adopts contrastive learning to train a contrastive text–image matching model (CTIMM) to obtain a pre-trained text encoder that can accurately extract design concepts. Furthermore, the AAG module, which consists of a deep sparse and text fusion generative adversarial network (DSTF-GAN) and a pre-trained rendering (PR) module, transforms the extracted semantics into high-rise building sketches and renders them as realistic high-rise appearances. The effectiveness of the proposed method is validated through qualitative and quantitative experiments, as well as the comparison with existing buildings. Furthermore, the feasibility of applying the method to assist architects in the conceptual design phase of a building is demonstrated through 3D modeling of a generated high-rise rendering. The proposed method demonstrates the potential of leveraging deep learning techniques to facilitate the generation of high-rise building concepts from textual descriptions, offering valuable support to architects during the initial design process.

Limitations. The dataset used in this study consists of only 856 real-world high-rise sketches paired with corresponding textual descriptions. Expanding the dataset could incorporate a wider range of architectural styles and features, leading to more diverse and robust high-rise sketch generation. Furthermore, the rendering process using the pre-trained diffusion model introduces some uncertainty, which may result in variations in architectural style and features, causing a few discrepancies between the generated renderings and the original textual descriptions. Although the visual plausibility of the generated sketches has been validated through 3D modeling and comparisons with existing buildings, incorporating qualitative or quantitative evaluations from experienced architects could further enhance the validation of the method’s effectiveness.

Future work. In future work, we will collaborate with professional architects to enhance the dataset and evaluate the generated designs, employing methods similar to focus groups [39]. By incorporating additional architectural elements and inspirations, we aim to improve the model’s ability to generate high-rise buildings with diverse styles, cultural features, and structural variations while also improving its practical applicability. Additionally, we will refine the diffusion model’s rendering process for more controlled outputs and enable interactive editing. We also plan to investigate the generation of 3D architectural massing by decomposing the overall volume into sub-blocks and representing it through a graph-based representation. During the generation process, spatial and functional constraints will be incorporated to ensure design feasibility. Additionally, by integrating the generated 2D high-rise renderings, we aim to achieve a cohesive design that synthesizes both massing and form, contributing to advancements in 3D architectural design.

Author Contributions

Conceptualization, W.Q. and F.Y.; methodology, W.Q. and F.Y.; validation, F.Y.; formal analysis, W.Q. and F.Y.; investigation, W.Q.; resources, W.Q.; data curation, F.Y.; writing—original draft preparation, F.Y.; writing—review and editing, W.Q.; visualization, F.Y.; supervision, W.Q.; project administration, W.Q.; funding acquisition, W.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Key Research and Development Program of China (Grant No. 2023YFC3805700), National Natural Science Foundation of China (Grant No. 52408171), and China Postdoctoral Science Foundation (Grant No. 2024M764196).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available for confidentiality reasons.

Acknowledgments

The author acknowledges the assistance of OpenAI’s ChatGPT-3.5 in enhancing text data augmentation for the dataset.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. The architecture of the deep sparse and text fusion generative adversarial network for the generation of high-rise conceptual sketches based on textual descriptions.

$Generator, z \in R^{200} ~ N (0,1)$	$Discriminator, I m a g e \in R^{256 \times 256}, T \in R^{512}$
Linear: $z \to 128 \times 4 \times 4$	Conv. $3 \times 3$ , stride = 1, padding = 1
DeConv. $4 \times 4$ , stride = 2, padding = 1, BN-128, ReLU, Sparse Fusion Block, Text Fusion Block, ReLU	Conv. $4 \times 4$ , stride = 2, padding = 1, BN-128, LeakyReLU
DeConv. $4 \times 4$ , stride = 2, padding = 1, BN-128, ReLU, Sparse Fusion Block, Text Fusion Block, ReLU	Conv. $4 \times 4$ , stride = 2, padding = 1, BN-128, LeakyReLU
DeConv. $4 \times 4$ , stride = 2, padding = 1, BN-128, ReLU, Sparse Fusion Block, Text Fusion Block, ReLU	Conv. $4 \times 4$ , stride = 2, padding = 1, BN-128, LeakyReLU
DeConv. $4 \times 4$ , stride = 2, padding = 1, BN-128, ReLU, Sparse Fusion Block, Text Fusion Block, ReLU	Conv. $4 \times 4$ , stride = 2, padding = 1, BN-128, LeakyReLU
DeConv. $4 \times 4$ , stride = 2, padding = 1, BN-128, ReLU, Sparse Fusion Block, Text Fusion Block, ReLU	Conv. $4 \times 4$ , stride = 2, padding = 1, BN-128, LeakyReLU
DeConv. $4 \times 4$ , stride = 2, padding = 1, BN-128, ReLU, Sparse Fusion Block, Text Fusion Block, ReLU	Conv. $4 \times 4$ , stride = 2, padding = 1, BN-128, LeakyReLU
BN-128, ReLU, Conv. $3 \times 3$ , stride = 1, padding = 1, Tanh	Conv. $3 \times 3$ , stride = 1, padding = 1, LeakyReLU, Conv. $4 \times 4$ , stride = 1, padding = 0

Abbreviations: DeConv, deconvolutional layer; Conv, convolutional layer; ReLU, rectified linear unit; LeakyReLU, LeakyReLU activation function; BN, batch normalization.

Table A2. The architecture of the sparse fusion and text fusion blocks for integrating the sparsity of sketches and text features into the generated sketches, respectively.

H_{i} \times W_{i}

is the size of the output feature map of the deconvolution layer in the

i

-th sparse text fusion block. In

g_{i, 2}^{z}

and

g_{i}^{T}

, if

H_{i} \times W_{i}

is greater than

16 \times 16

, deconvolution is adopted; otherwise, convolution is used.

Table A2. The architecture of the sparse fusion and text fusion blocks for integrating the sparsity of sketches and text features into the generated sketches, respectively.

H_{i} \times W_{i}

is the size of the output feature map of the deconvolution layer in the

i

-th sparse text fusion block. In

g_{i, 2}^{z}

and

g_{i}^{T}

, if

H_{i} \times W_{i}

is greater than

16 \times 16

, deconvolution is adopted; otherwise, convolution is used.

$Sparse Fusion Block, z \in R^{200}$		$Text Fusion Block, T \in R^{512}$
$f_{i}^{z}$ $, Linear : z \to 1024,$ ReLU $Linear : 1024 \to 512,$ ReLU $Linear : 512 \to 256,$ ReLU		$f_{i, 1}^{T}$ $, Linear : T \to 128,$ LeakyReLU $Linear : 128 \to 128,$ LeakyReLU	$f_{i, 2}^{T},$ Linear: $T \to 256,$ LeakyReLU
$g_{i, 1}^{z},$ Linear: 512 $\to 128$	$g_{i, 2}^{z},$ Conv or DeConv: $16 \times 16 \to H_{i} \times W_{i}$		$g_{i}^{T},$ Conv or DeConv: $16 \times 16 \to H_{i} \times W_{i}$
ReLU	ReLU		LeakyReLU

Abbreviations: DeConv, deconvolutional layer; Conv, convolutional layer; ReLU, rectified linear unit; LeakyReLU, LeakyReLU activation function; BN, batch normalization.

Figure A1 presents the work interface we developed, which integrates a pre-trained text encoder from the DCE module, the DSTF-GSN, and the pre-trained stable-diffusion-v1-5 model within a Python framework. This interface enables an integrated workflow from input text prompts to high-rise design sketches and renderings. Users input textual descriptions in the left-side input box and click the “Generate Sketch” button to generate three high-rise concept sketches. For the characteristics of the prompts, the input text prompts primarily include architectural styles (e.g., modernism, futurism, and postmodernism), high-rise types (e.g., high-rise and tower), and design manners (e.g., symmetrical, misaligned, and segmented). The format of the text prompt typically follows the structure: “The ** style high-rise/tower adopts ** design method, with ** structure”. There are no strict requirements regarding the order, format, and necessary content of the descriptions. If the results are unsatisfactory, they can click the “Generate Sketch” button again to produce three additional sketches. Once a preferred sketch is selected, users can click the “Render” button to generate three rendered images of the chosen sketch. If the rendering results are not satisfactory, the “Render” button can be clicked again for three additional renderings. The resolution of the generated outputs varies across stages. The sketches, typically rendered at a lower resolution, still effectively convey the underlying design concepts. In contrast, the renderings produced by the diffusion model are of higher resolution and exhibit a high degree of realism, capturing architectural details with greater precision. Additionally, each generated sketch and rendering includes a “Save” button for users to save the respective image.

Figure A1. Work interface for generating high-rise sketches and renderings from input text prompts within a Python framework. The interface integrates the pre-trained text encoder from the DCE module for semantic text extraction, the trained DSTF-GAN for sketch generation, and the pre-trained stable-diffusion-v1-5 model for rendering.

References

Menezes, A.; Lawson, B. How designers perceive sketches. Des. Stud. 2006, 27, 571–585. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Advances in Neural Information Processing Systems; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; Volume 27. [Google Scholar]
Song, Z.; Zhang, C.; Lu, Y. The methodology for evaluating the fire resistance performance of concrete-filled steel tube columns by integrating conditional tabular generative adversarial networks and random oversampling. J. Build. Eng. 2024, 97, 110824. [Google Scholar] [CrossRef]
Guo, X.; Zhang, J.; Zong, S.; Zhu, S. A fast-response-generation method for single-layer reticulated shells based on implicit parameter model of generative adversarial networks. J. Build. Eng. 2023, 72, 106563. [Google Scholar] [CrossRef]
Fu, B.; Wang, W.; Gao, Y. Physical rule-guided generative adversarial network for automated structural layout design of steel frame-brace structures. J. Build. Eng. 2024, 86, 108943. [Google Scholar] [CrossRef]
Qi, Y.; Yuan, C.; Li, P.; Kong, Q. Damage analysis and quantification of RC beams assisted by Damage-T Generative Adversarial Network. Eng. Appl. Artif. Intell. 2023, 117, 105536. [Google Scholar] [CrossRef]
Nauata, N.; Chang, K.-H.; Cheng, C.-Y.; Mori, G.; Furukawa, Y. House-Gan: Relational Generative Adversarial Networks for Graph-Constrained House layout Generation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part I 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 162–177. [Google Scholar]
Nauata, N.; Hosseini, S.; Chang, K.-H.; Chu, H.; Cheng, C.-Y.; Furukawa, Y. House-gan++: Generative adversarial layout refinement network towards intelligent computational agent for professional architects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 13632–13641. [Google Scholar]
Chang, K.-H.; Cheng, C.-Y.; Luo, J.; Murata, S.; Nourbakhsh, M.; Tsuji, Y. Building-GAN: Graph-conditioned architectural volumetric design generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 11956–11965. [Google Scholar]
Ghannad, P.; Lee, Y.-C. Automated modular housing design using a module configuration algorithm and a coupled generative adversarial network (CoGAN). Autom. Constr. 2022, 139, 104234. [Google Scholar] [CrossRef]
Liu, M.-Y.; Tuzel, O. Coupled Generative Adversarial Networks. In Advances in Neural Information Processing Systems; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29. [Google Scholar]
Liao, W.; Lu, X.; Huang, Y.; Zheng, Z.; Lin, Y. Automated structural design of shear wall residential buildings using generative adversarial networks. Autom. Constr. 2021, 132, 103931. [Google Scholar] [CrossRef]
Fu, B.; Gao, Y.; Wang, W. Dual generative adversarial networks for automated component layout design of steel frame-brace structures. Autom. Constr. 2023, 146, 104661. [Google Scholar] [CrossRef]
Sun, C.; Zhou, Y.; Han, Y. Automatic generation of architecture facade for historical urban renovation using generative adversarial network. Build. Environ. 2022, 212, 108781. [Google Scholar] [CrossRef]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Qian, W.; Xu, Y.; Zuo, W.; Li, H. Self sparse generative adversarial networks. arXiv 2021, arXiv:2101.10556. [Google Scholar] [CrossRef]
Qian, W.; Xu, Y.; Li, H. A self-sparse generative adversarial network for autonomous early-stage design of architectural sketches. Comput. Aided Civ. Infrastruct. Eng. 2022, 37, 612–628. [Google Scholar] [CrossRef]
Qian, W.; Yang, F.; Mei, H.; Li, H. Artificial intelligence-designer for high-rise building sketches with user preferences. Eng. Struct. 2023, 275, 115171. [Google Scholar] [CrossRef]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative adversarial text to image synthesis. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 19–24 June 2016; pp. 1060–1069. [Google Scholar]
Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; Metaxas, D. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5907–5915. [Google Scholar]
Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1947–1962. [Google Scholar] [CrossRef]
Qiao, T.; Zhang, J.; Xu, D.; Tao, D. Mirrorgan: Learning text-to-image generation by redescription. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1505–1514. [Google Scholar]
Tao, M.; Tang, H.; Wu, F.; Jing, X.-Y.; Bao, B.-K.; Xu, C. DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 16515–16525. [Google Scholar]
Liao, W.; Hu, K.; Yang, M.Y.; Rosenhahn, B. Text to Image Generation with Semantic-Spatial Aware GAN. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 18166–18175. [Google Scholar] [CrossRef]
Zhang, Z.; Schomaker, L. DiverGAN: An Efficient and Effective Single-Stage Framework for Diverse Text-to-Image Generation. Neurocomputing 2022, 473, 182–198. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 10684–10695. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual, 18–24 July 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Schuster, M.; Paliwal, K.K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 17 June–2 July 2016; pp. 770–778. [Google Scholar]
Lim, J.H.; Ye, J.C. Geometric gan’. arXiv 2017, arXiv:1705.02894. [Google Scholar]
Winnemöller, H.; Kyprianidis, J.E.; Olsen, S.C. XDoG: An eXtended difference-of-Gaussians compendium including advanced image stylization. Comput. Graph. 2012, 36, 740–753. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar]
Fusaro, G.; Kang, J. Participatory approach to draw ergonomic criteria for window design. Int. J. Ind. Ergon. 2021, 82, 103098. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the proposed method. DCE and AAG refer to the design concept extraction and architectural appearance generation modules, respectively. The AAG module contains a deep sparse and text fusion generative adversarial network (DSTF-GAN) and a pre-trained rendering (PR) module.

Figure 2. Framework of the contrastive text–image matching model.

Figure 3. Overall architecture of the proposed AAG module.

Figure 4. Proposed DSTF-GAN architecture for text-to-conceptual sketch generation.

z

is a 200-dimensional latent code sampled from a normal distribution, and

T

is a 512-dimensional text semantic vector of textual description. FC, Conv, DeConv, BatchNorm, InstanceNorm, ReLU, LeakyReLU, and Tanh denote the fully connected layer, convolutional layer, deconvolutional layer, batch normalization layer, instance normalization layer, rectified linear unit activation function, LeakyReLU activation function, and hyperbolic tangent activation function, respectively. Additionally, the STF module, which fuses the sparsity and text features into sketches, refers to the sparse text fusion blocks. The detailed network architecture is shown in Table A1 in Appendix A.

Figure 4. Proposed DSTF-GAN architecture for text-to-conceptual sketch generation.

z

is a 200-dimensional latent code sampled from a normal distribution, and

T

is a 512-dimensional text semantic vector of textual description. FC, Conv, DeConv, BatchNorm, InstanceNorm, ReLU, LeakyReLU, and Tanh denote the fully connected layer, convolutional layer, deconvolutional layer, batch normalization layer, instance normalization layer, rectified linear unit activation function, LeakyReLU activation function, and hyperbolic tangent activation function, respectively. Additionally, the STF module, which fuses the sparsity and text features into sketches, refers to the sparse text fusion blocks. The detailed network architecture is shown in Table A1 in Appendix A.

Figure 5. Architectures of the sparse and text fusion blocks. The detailed network architecture is shown in Table A2 in Appendix A. MLPs, Conv, DeConv, ReLU, and LeakyReLU denote the multi-layer perceptron, Conv, DeConv, rectified linear unit activation function, and LeakyReLU activation function, respectively.

Figure 6. Reshaping, background removal, and sketch conversion of high-rise building images using the XDoG operator.

Figure 7. Data augmentation using the pre-trained latent stable diffusion model and ChatGPT.

Figure 8. Matching the accuracy of experiments in different divisions of the dataset with different data augmentation techniques. “Original” denotes training the CTIMM without any data augmentations. “DA” denotes training the model with scaling and cropping operations for data augmentation. “SD” and “GPT” denote that the pre-trained latent stable diffusion model and ChatGPT are adopted for data augmentation, respectively.

Figure 9. One batch of matching results of “DA and SD and GPT” on the validation set of Division 1.

Figure 10. Comparison of semantic consistency between generated high-rise conceptual sketches and textual descriptions. The semantic correspondences highlighted in the text are indicated by colored bounding boxes in the sketches.

Figure 11. Creative comparison of generated high-rise conceptual sketches with real sketches.

Figure 12. Rendering high-rise building appearances from generated conceptual sketches using the pre-trained stable-diffusion-v1-5 model.

Figure 13. Comparisons of generated high-rise building appearances directly from the pre-trained latent stable diffusion model and the proposed two-stage method. (a) Comparison of high-rise type building generation results. (b) Comparison of tower type building generation results.

Figure 14. Comparative representative examples of generated sketches and renderings with existing buildings. The generated sketches and existing buildings are out of the training dataset, and the corresponding renderings are similar to the main features of some existing buildings.

Figure 15. Examples of high-rise sketches synthesized by different text fusion methods.

Figure 16. Three-dimensional conceptualization of a rendered high-rise building appearance in an office building type.

Table 1. FID and CM precision results of three fusion methods.

Text Fusion Method	FID ↓	CM Precision (%) ↑
Diver	116.99	6.81
DF	25.88	27.64
DSTF (the proposed method)	25.73	36.08

Table 2. FID, CM-precision, and LPIPS results of component analysis.

Architecture	FID ↓	CM Precision (%) ↑	LPIPS
DSTF (the proposed method)	25.73	36.08	0.2200
DSTF w/o SF block	26.51	30.83	0.0588
DSTF w/o TF block	32.43	4.31	0.3095
DTSF	29.43	32.01	0.2235

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, F.; Qian, W. Generative Architectural Design from Textual Prompts: Enhancing High-Rise Building Concepts for Assisting Architects. Appl. Sci. 2025, 15, 3000. https://doi.org/10.3390/app15063000

AMA Style

Yang F, Qian W. Generative Architectural Design from Textual Prompts: Enhancing High-Rise Building Concepts for Assisting Architects. Applied Sciences. 2025; 15(6):3000. https://doi.org/10.3390/app15063000

Chicago/Turabian Style

Yang, Feng, and Wenliang Qian. 2025. "Generative Architectural Design from Textual Prompts: Enhancing High-Rise Building Concepts for Assisting Architects" Applied Sciences 15, no. 6: 3000. https://doi.org/10.3390/app15063000

APA Style

Yang, F., & Qian, W. (2025). Generative Architectural Design from Textual Prompts: Enhancing High-Rise Building Concepts for Assisting Architects. Applied Sciences, 15(6), 3000. https://doi.org/10.3390/app15063000

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Generative Architectural Design from Textual Prompts: Enhancing High-Rise Building Concepts for Assisting Architects

Abstract

1. Introduction

2. Methodology

2.1. Design Concept Extraction Module

2.2. Architectural Appearance Generation Module

3. Experiments and Results

3.1. Sketch–Text Dataset

3.2. Training Details

3.3. Experimental Results

4. Discussion

4.1. Comparison of Different Text Fusion Methods

4.2. Component Analysis

5. Application

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI