Survey of Architectural Floor Plan Retrieval Technology Based on 3ST Features

Ling, Hongxing; Luo, Guangsheng; Zhou, Nanrun; Jiang, Xiaoyan

doi:10.3390/ai6040067

Open AccessReview

Survey of Architectural Floor Plan Retrieval Technology Based on 3ST Features

¹

Business and Information College, Shanghai 200235, China

²

School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

AI 2025, 6(4), 67; https://doi.org/10.3390/ai6040067

Submission received: 20 January 2025 / Revised: 12 March 2025 / Accepted: 17 March 2025 / Published: 26 March 2025

(This article belongs to the Topic Theoretical Foundations and Applications of Deep Learning Techniques)

Download

Browse Figures

Versions Notes

Abstract

:

Feature retrieval technology for building floor plans has garnered significant attention in recent years due to its critical role in the efficient management and execution of construction projects. This paper presents a comprehensive exploration of four primary features essential for the retrieval of building floor plans: semantic features, spatial features, shape features, and texture features (collectively referred to as 3ST features). The extraction algorithms and underlying principles associated with these features are thoroughly analyzed, with a focus on advanced methods such as wavelet transforms and Fourier shape descriptors. Furthermore, the performance of various retrieval algorithms is evaluated through rigorous experimental analysis, offering valuable insights into optimizing the retrieval of building floor plans. Finally, this study outlines prospective directions for the advancement of feature retrieval technology in the context of floor plans.

Keywords:

floor plan feature retrieval; 3ST features; wavelet transforms; Fourier shape descriptions

1. Introduction

In the field of architectural design, architectural blueprints and floor plans are two fundamental concepts; however, their meanings and functions are frequently conflated. Architectural blueprints encompass the comprehensive design documentation of a construction project, typically comprising floor plans, elevations, sections, and other detailed drawings that illustrate the building’s structural, electrical, and plumbing systems. In contrast, floor plans constitute a specific component of architectural blueprints, primarily depicting the horizontal layout of a building. These plans include the distribution of rooms, the positioning of walls, and the placement of doors and windows, serving as essential tools for construction and spatial planning. A floor plan serves as a foundational representation of a building’s layout from an aerial perspective, encompassing a horizontal projection and a corresponding legend. It is a fundamental example of a construction drawing, illustrating the shape, size, and arrangement of the structure while detailing the dimensions and materials of walls and columns, as well as the types and placements of windows and doors. This blueprint is a crucial reference for delineating lines, erecting walls, installing doors and windows, executing interior and exterior finishes, and formulating budgets during the construction phase. As shown in Figure 1, the blueprint shows a building floor plan with bathrooms, doors, windows, etc.

A floor plan is a two-dimensional (2D) horizontal projection of a building’s floors, effectively conveying the layout of its spatial components, such as regions, doors, and walls. The automatic retrieval of floor plans has been actively studied over the past few decades [1]. A fundamental task in this area is segmenting a floor plan into regions (e.g., bedroom, living room) with accurate labels. However, the heterogeneous information present in floor plans complicates semantic segmentation, including tasks like line detection and region-growing segmentation [2]. With the emergence of deep learning neural networks (DNNs), DNN-based planar graph retrieval has gained significant popularity [3]. In particular, the integration of convolutional neural networks (CNNs) and recurrent neural networks (RNNs) has proven to be a powerful approach. Notable DNN architectures include fully convolutional networks (FCNs) [4], U-Net [5], and DeepLab [6]. These architectures have applications in various domains, including digitizing residential structures [7], analyzing and identifying elements in 2D building plans, constructing 3D models of buildings [8,9,10,11], evaluating the appeal of a building’s layout to users [12,13], and simulating immersive virtual reality (VR) indoor architectural environments [14]. Our analysis of numerous existing studies revealed that AI holds significant potential in the architectural design process, encompassing idea development, data analysis (e.g., project predictive analysis), construction supervision, and ongoing facility maintenance. To evaluate the role of AI in the design process, we compared the latest large-scale models—DALL-E, Midjourney, and Stable Diffusion—across various design stages [15]. The study demonstrated that DALL-E and Stable Diffusion outperform Midjourney in generating ideas, sketches, and architectural style variants, significantly accelerating concept development. Stable Diffusion excelled in accurately representing construction plans and interior/exterior designs, with DALL-E closely following in performance. DALL-E also effectively addresses diverse ideation needs and offers robust editing capabilities. While Midjourney lacks features like in/out-painting and image combination, it remains valuable for basic sketching. Stable Diffusion achieves a balance, excelling in both generative design and detailed construction planning [16].

The complexity and size of building plans necessitate large datasets for training deep learning models; however, issues such as blurriness, overlap, and distortion can adversely affect the accuracy of retrieval models. In contrast, prompt and accurate retrieval of relevant plans can significantly enhance the efficiency of designers, engineers, and architects. A robust retrieval system enables designers to quickly identify similar design cases, fostering innovation and supporting informed design decisions. Effective management of spatial layouts, structural designs, equipment configurations, and other resources can optimize the building design and construction process, ultimately minimizing resource waste. With the advancements in building information modeling (BIM) and intelligent building technologies, the need for effective retrieval and analysis of building plan features has become increasingly vital. The advanced integration of BIM and VR in the construction environment, exemplified by immersive virtual reality-based construction management schedules, demonstrates their transformative potential. BIM offers significant advantages in controlling life cycle costs and addressing environmental challenges, enhancing construction efficiency, improving design quality, and facilitating decision-making. Studies have demonstrated that VR technology enhances stakeholder participation by 62% and spatial awareness by 48%, fostering greater community engagement and inclusiveness in the development process. The integration of BIM and VR not only optimizes construction workflows but also enhances environmental and socio-economic outcomes, including a 20% reduction in greenhouse gas emissions [16]. This paper aims to comprehensively investigate key aspects of building plan feature retrieval technology, including semantic, spatial, texture, and shape characteristics, while emphasizing their significance and potential applications.

The main contributions of this paper are as follows:

Through the systematic classification of various features in building floor plans, this paper provides a comprehensive framework to help researchers and practitioners understand and apply these features more effectively. This classification not only supports theoretical research but also offers valuable guidance for practical implementations.
Through a detailed analysis of the four features, this paper introduces innovative tools and methodologies that assist architectural designers and planners in selecting and optimizing design schemes. It illustrates how these tools work together to extract and analyze features of building floor plans, contributing to improved design efficiency and effectiveness.

2. Floor Plan Retrieval Overview

2.1. Overview of Floor Plan Feature Extraction

The floor plan retrieval method based on deep learning primarily employs models such as CNNs, RC-Net, Faster R-CNN, Transformers, and Graph Matching Networks (GMNs) to extract features from building floor plans. By identifying unique high-level semantic features and employing feature vectors to calculate image similarity, this approach enables effective retrieval of building floor plans. Prior to 2013, the forward extraction method was predominantly utilized, directly extracting relevant information or features from images, such as identifying and isolating structural components like walls, doors, and windows [17,18]. However, the efficiency of direct recognition is limited due to the lack of distinct features for these elements in the floor plan, whereas functional components, such as beds and tables, are more clearly defined [19]. To address this challenge, the reverse extraction method was introduced in 2019, which derives or generates new information from previously extracted data and utilizes algorithmic modeling to identify and remove functional components, thereby enhancing extraction efficiency [20].

In 2022, Shehzadi et al. [21] utilized a CNN to identify connection points in building floor plans, such as wall corners and door endpoints. Their method established links between these intersections and implemented an integer programming algorithm to detect doors on walls. In 2021, Fan et al. [22] proposed a novel approach that integrated graph convolutional networks (GCNs) with CNNs in an end-to-end architecture. This architecture comprises a CNN backbone, a graph convolution header, and a detection header, collectively forming a baseline network for panoramic symbol recognition tasks. The CNN-GCN approach achieved state-of-the-art (SOTA) performance in semantic symbol recognition. In 2023, Wang et al. [23] employed the RC-Net framework to extract fundamental features from floor plan images using a VGG encoder. Their feature branch emphasized room boundaries and types, optimizing the learning of essential features necessary for prediction. The final output is a room mask that integrates text features from the text branch with room features to enhance accuracy. These studies indicate that the advancement of Graph Neural Networks (GNNs) has led to floor plan retrieval based on graph structures, demonstrating that GNNs can outperform Faster R-CNN in terms of performance and accuracy, making them well suited for handling large-scale architectural floor plans [24]. In recent years, scholars [7] have begun exploring techniques for converting 2D floor plans into 3D models by first segmenting structural components, such as walls, and recognizing associated symbols, including windows and doors. They subsequently extract wall details from the predicted segmentation mask to generate semantic elements, which can then be utilized to construct 3D models in accordance with the Industrial Foundation Class (IFC) standard.

Regarding the conversion of 3D models into building floor plans, a spatial context module has been proposed to enhance the accuracy of room type prediction by transferring room boundary features from the top-layer decoder to the bottom-layer decoder [25]. Additionally, deep segmentation and detection neural networks have been employed to extract structural information from rooms, determine room dimensions through keypoint detection and clustering analysis, vectorize the room data via an iterative optimization method, and generate vectorized 3D reconstruction outputs [26]. Furthermore, three augmentation techniques—Gaussian noise, Gaussian blurring, and random rotation—have been utilized to enhance the input floor plans [27]. Multiple rounds of random erosion operations are performed on the target wall images to decrease the model’s sensitivity to pixel variations at the wall edges, thereby improving its ability to effectively learn the features of wall edges. In summary, floor plan retrieval relies on image processing, feature extraction, image matching, and machine learning, involving a wide range of applications and complex algorithms. As data continue to grow, new techniques are constantly being developed and refined.

2.2. Overview of Floor Plan Retrieval Architecture

2.2.1. Network Feedforward Solutions

In the field of architecture, extracting deep features from building floor plans is essential for tasks such as building identification, design, and analysis. The emergence of deep learning techniques has significantly advanced computer vision and image processing, leading to the development of innovative feedforward neural network architectures specifically tailored for extracting complex features from building floor plans. In ref. [25], a novel approach is introduced for recognizing elements within building floor plans through the construction of a hierarchical model of floor plan components. This method employs the VGG deep convolutional neural network architecture in conjunction with a room boundary-guided attention mechanism to enhance floor plan recognition. By predicting both room boundary elements and room types, the model organizes floor plan components into a hierarchical structure, classifying them based on their placement within the interior or exterior of the building. Interior elements are further categorized into room boundary components (including walls, doors, and windows) and room type components (such as living rooms, bathrooms, and bedrooms), as shown in Figure 2, the VGG deep convolutional network can be used to obtain the features of various elements and boundary components from Figure 2.

Ouahbi et al. [28] introduced a feature extraction framework. This framework consists of a contraction path with multiple convolutional and pooling layers, as well as a symmetric expansion path that includes an upsampling layer (transposed convolution) and regular convolution operations. The U-Net model effectively extracts features by merging shallow and deep information, capturing both local and global context within building floor plans. The upsampling layer facilitates feature fusion and spatial recovery, while the contraction path progressively consolidates low-level features into more abstract representations with larger receptive fields. The expansion path then restores these features to super-resolution, ensuring dense segmentation at the original resolution.

To enhance planar segmentation and vectorization processes, researchers have proposed the DeepLabv3+ network as the foundational model. By integrating dilated convolution and spatial pyramid pooling techniques, this framework excels at extracting multi-scale features, including structural elements, textual annotations, and symbolic representations from pixel maps. This advanced approach accurately identifies room configurations, dimensions, and types, effectively capturing the intricate structure and detailed nuances of buildings across various scales. The system significantly improves accuracy and generalization capabilities, particularly excelling in planar segmentation and vectorization tasks, including the detection of slant walls. The degree of match between the optimized polygons and the room contours is represented by

ι_{b o u n d a r y}

, while the match between the polygonal regions and the internal areas of the room is denoted by

ι_{I O U}

.

ι_{b o u n d a r y} = \sum_{b \in B} \underset{p_{i} \in P}{m i n} D (p_{i} p_{i + 1}, b)

(1)

ι_{I O U} = I O U (R a s t e r i z e d (P), c)

(2)

In Equation (1),

D (p_{i} p_{i + 1}, b)

denotes the shortest distance from b to line segment

p_{i} p_{i + 1}

, and

R a s t e r i z e d (P)

in Equation (2) denotes the region covered by polygonal rasterization.

A groundbreaking study referenced in [29] introduces a novel neural network architecture aimed at understanding both layout types and room types within architectural blueprints. This sophisticated architecture incorporates an n-dimensional fully connected layer placed behind the fc7 layer of the widely used VGG-16 model [30], which has been pre-trained on the extensive ImageNet dataset. In addition, a 2 m-dimensional fully connected (FC) layer is specifically designed to classify various floor types, while a parallel set of 2 m-dimensional FC layers collaborates to identify distinct room types. Here, n represents the different layout types under investigation, while m denotes the room types awaiting classification.

Our analysis of the parameter optimization process in VGG-16 revealed a significant limitation: the process is governed by a feedback loop between layout-type classification errors and room existence classification errors. While parameter optimization in VGG-16 allows for the fine-tuning of the network’s parameters, it neglects the critical role of edges within the complex network of building information. Consequently, the network fails to capture the intricate details embedded in the interconnections between rooms, which limits its capacity to comprehend the spatial relationships inherent in the building layout.

2.2.2. Feature Extraction of Floor Plans Structural Elements

In building plan analysis, the selection of deep features emphasizes extracting relevant insights from raw blueprints to enhance operational efficiency. When extracting wall features, the inherent variability and irregularity of their shapes and sizes necessitate sophisticated approaches. For instance, the ResNet-based Feature Pyramid Network (FPN) architecture is integrated with semantic segmentation techniques. In contrast, for features such as doors and windows, which are characterized by discrete and repetitive symbols, the bounding box detection neural network is seamlessly combined with the Faster R-CNN model, yielding effective results.

Since room-type text can provide valuable insights into the functionality of a specific space, Wang et al. [23] employed a text branching technique to extract textual features and integrated them with spatial features through a merging module. This approach was designed to enhance semantic features for subsequent predictive analysis. Furthermore, Gao et al. [27] conducted a comprehensive statistical analysis to assess the significance of features during the deep feature selection process. Their methodology involved segmenting the drawing box, extracting wall features by isolating the wall centerline, and feeding the processed data into a space-merging module for boundary completion and integration. The resulting data were meticulously filtered to eliminate errors, ensuring the accuracy and reliability of the final output. In these studies, segmentation networks were primarily utilized to extract key structural elements, such as walls, rooms, windows, and doors. Vectorization algorithms were subsequently applied to these elements, and detection models were employed to identify symbols and text within the extracted data. This information, combined with the length of measured lines, facilitated the calculation of the floor plan’s scale.

2.2.3. Similarity Measure

YAMASAKI et al. [18] introduced a novel approach for planar graph retrieval based on a fully convolutional network (FCN) and a graph model structure. This method applies a semantic segmentation algorithm to planar graphs using the FCN to categorize each component. Vertices are established when the segmented object exceeds a threshold size of 1000, and edges are added if the distance between objects is less than 30, transforming the planar graph into a graph model structure. Additionally, a new algorithm is devised and optimized to extract MCS [31] and isolated points, with similarity calculations on MCS used for retrieving images with similar structures.

s (v) = sim (s_{g}, s_{h}) = \frac{m i n (s_{g}, s_{h})}{m a x (s_{g}, s_{h})}

(3)

w (e) = 2 e^{s (v_{e 1})} * e^{s (v_{e 2})}

(4)

In Equation (3),

s (v)

is denoted as the similarity of vertices

s_{g}

and

s_{h}

in graphs G and H. It is obtained by comparing the size of the set of vertices in the two graphs using Equation (4), where

w (e)

is the edge weight between two vertices

v_{e 1}

and

v_{e 1}

. The similarity between graphs G and H is ultimately computed by summation:

s i m (G, H) = \sum_{e i n M} w (e) + \sum_{i s o l a t e v i n e i n M} s (v)

(5)

For multimodal feature extraction, YUKI et al. [29] proposed a method to transform planar images into graph structures. The authors contend that the variability in drawing styles across floor plans adversely affects the performance of conventional neural networks. To address this challenge, they introduced a framework that jointly optimizes room layout and room style classifiers. Features were extracted using the VGG network, which transforms the structural characteristics of rooms into graph-based representations, thereby enabling graph-based retrieval of similar attributes in floor plans. This framework achieved an accuracy of 0.49 at

p = 0.5

. DIVYA et al. [32] proposed a unified framework that employs a graph embedding method to represent graphs extracted from layouts. A two-stage matching and retrieval method, consisting of Room Layout Matching (RLM) and Room Decoration Matching (RDM), was proposed for feature matching and retrieval. The evaluation primarily focuses on ranking retrieval results to assess the method’s capability to identify layouts similar to the query within the database. However, this approach exhibits several limitations: the two-stage matching process lacks uniformity; critical features such as room dimensions, which are essential for buyers, are overlooked; and the dataset is constrained by limited sample variation. To address these challenges, DIVYA et al. [33] introduced an end-to-end framework for extracting high-level semantic features, such as room sizes, adjacencies, and furnishings, within a fine-grained retrieval approach. This method utilizes feature fusion to aggregate high-level semantic features, enabling the retrieval of similar building floor plans. The total matching score M is computed by weighting the four extracted feature scores.

M (i, j) = \frac{ρ^{+} (i, j) + ψ^{+} (i, j) + ϕ^{+} (i, j) + θ^{+} (i, j)}{4}

(6)

In Equation (6),

ρ^{+}

delineates the RAS (Room Adjacency String) score,

ψ^{+}

denotes the CAR (Ratio of Carpet Area) score,

ϕ^{+}

indicates the number of furniture score, and

θ^{+}

indicates the type of furniture score.

In addition to the above similarity calculations, there are actually

L_{1}

and

L_{2}

distances, Mahalanobis distances, and histogram intersections. If the components of the image features are orthogonal and independent, and the importance of each dimension is the same, the distance between two feature vectors A and B can be measured by

L_{1}

distance or

L_{1}

distance (also known as Euler distance). The

L_{1}

distance can be expressed as

D_{1} = \sum_{i = 1}^{N} |A_{i} - B_{i}|,

(7)

where N is the dimension of the feature vector. Similarly, the

L_{2}

distance can be expressed as

D_{2} = \sum_{i = 1}^{N} {(A_{i} - B_{i})}^{2} .

(8)

If the components of the feature vector are correlated or have different weights, the Mahalanobis distance can be used to calculate the similarity between the features. The mathematical expression of the Mahalanobis distance is

D_{mahal} = {(A - B)}^{t} C^{- 1} (A - B),

(9)

where C represents the covariance matrix of the feature vector. This distance criterion is frequently employed to assess the similarity of Synthetic Aperture Radar (SAR) features.

2.3. Overview of 3ST Feature Analysis

Feature extraction is a fundamental concept in computer vision and image processing. It refers to the process of using computers to extract information from images and determine whether each image point corresponds to a specific image feature. The outcome of feature extraction is the division of image points into distinct subsets, which typically comprise isolated points, continuous curves, or continuous regions. As of now, there is no universally accepted and precise definition of features. Various image features and similarity algorithms are employed in image retrieval. For a specific image library, it is essential to select one or more of the most effective image features and similarity algorithms. This necessitates a comprehensive evaluation of the retrieval effectiveness under varying conditions, comparing the advantages and disadvantages of diverse methods to identify the most optimal approach.

Semantic feature extraction is a fundamental operation in image processing, typically performed as the initial step in analyzing an image. It involves examining each pixel to identify whether it corresponds to a feature. As feature extraction serves as a critical computational step in numerous image processing algorithms, a wide range of feature extraction methods have been developed, each differing in the types of features extracted, computational complexity, and repeatability. An edge, for instance, is a pixel that delineates the boundary between two distinct image regions. The shape of an edge can vary arbitrarily and may include intersections. In practice, edges are defined as subsets of image points exhibiting significant gradients. Common algorithms often connect points with high gradients to construct a more comprehensive representation of the edge. These algorithms may impose constraints on the edge, which is locally a one-dimensional structure. Corners are point-like features in an image, characterized by a two-dimensional structure locally. Early algorithms initially performed edge detection and subsequently analyzed edge directions to identify abrupt changes, corresponding to corners. Later algorithms eliminate the need for edge detection and instead directly identify regions of high curvature in the image gradient.

The performance of texture features, ranked from best to worst, is as follows: multidimensional autoregressive texture feature (MRSAR), Gabor feature, TWT wavelet transform, PWT wavelet transform, improved Tamura feature, roughness histogram, directional histogram, and traditional Tamura feature. Additionally, the performance of the traditional Tamura feature exceeds that of both the roughness and directional histograms, indicating that it is more suitable for describing image sets with relatively uniform textures, such as Brodatz.

Geometric shape features and color aggregation vectors are more effective for images with uniform colors or textures, offering no advantages for general images. Furthermore, the geometric method is a texture feature analysis approach based on the theory of texture primitives (basic texture elements). The texture primitive theory posits that complex textures can be composed of several simple texture primitives arranged in a regular, repetitive pattern.

The use of spatial relationship features enhances the ability to describe and distinguish image content; however, these features are often sensitive to transformations such as rotation, inversion, and scale changes in images or targets. Moreover, in practical applications, relying solely on spatial information is often insufficient, as it cannot effectively or accurately represent scene information. To enable retrieval, other features, in addition to spatial relationship features, are also required. For details, see the following section.

3. Semantic Feature Retrieval

3.1. Semantic Feature Analysis

Feature extraction from building plans is challenging due to interfering elements like thin axes and walls of specific thicknesses. To mitigate the impact of these interfering lines and highlight wall features, recent research [27] employed linear downsampling on input images, resizing them to 512 × 512 pixels to approximate the receptive field size of ResNet50 (483 × 483 pixels). Additionally, two iterations of erosion with a 3 × 3 filtering kernel were applied to replace hollow walls with solid ones. To address data scarcity, three data augmentation techniques were applied to the input plan views: Gaussian noise, Gaussian blurring, and random rotation, all aimed at mitigating overfitting. Furthermore, zero to two rounds of random erosion were applied to target wall images to reduce sensitivity to pixel variations along wall edges, improving the model’s ability to learn wall edge features.

To enhance the semantic information of room type features, a study [34] employed combinatorial mapping and its dual approach. The method used two VGG-based feature extraction branches to predict boundaries and room types, integrating orientation-aware kernels and boundary features to enrich semantic information and generate a compact representation of both geometric and semantic details. By leveraging machine learning algorithms, this approach showed strong potential for floor plan analysis across diverse room styles, highlighting its significant generalization capabilities.

3.2. Rule Feature Extraction Retrieval

Prior to the advent of deep learning methodologies, traditional approaches to building floor plan retrieval relied on manually crafted features (e.g., match points, histograms, and eigenvalues) [35], as well as rule-based techniques. In building floor plan retrieval, fundamental graphical components (such as rooms, doors, and windows) must first be identified within the architectural layout [36]. Conventional methods typically separate textual and image data, employing handcrafted rules to identify elements within building floor plans. However, reliance on manual feature design and rule-based formulations presents several limitations. First, manual creation of features and rules requires specialized expertise but often fails to address the complexities and variations in diverse building layouts, resulting in error-prone outcomes. Second, this approach lacks universality and fails to adapt to different architectural styles, requiring extensive customization for each scenario.

As early as 1996, a study [37] introduced a prototype system for the automated recognition and interpretation of hand-drawn architectural floor plans. The system employed pattern recognition techniques to analyze processed images, extract architectural components (e.g., rooms, walls, doors, and windows), and organize them into a structured format. In 2011, another study [38] used Speeded Up Robust Features (SURFs), including edge extraction and boundary detection, to extract localized information from architectural floor plans. The extracted structural data were then used for analysis and retrieval tasks. In content-based image retrieval (CBIR), symbolic localization enabled document retrieval based on a query image, allowing for the identification of the query image’s specific location. Additionally, a graph-based retrieval method, known as the “room connectivity graph”, was proposed in [39]. This method extracted room connectivity graphs from the polygonal representations of buildings, capturing their topological relationships. The similarity between the query graph and the building model in the database was determined by evaluating the subgraph isomorphism between the query graph and the room connectivity graph. This approach, effective for smaller graphs, showed its viability for specific retrieval challenges. Another study [40] introduced a retrieval technique based on visual cues, quantifying spatial and line features using Runlength Histogram (RH). The similarity between two building floor plans was computed using the cardinal distance (

χ^{2}

distance), with the retrieved similarity verified through subjective observer evaluation.

χ^{2} (P, Q) = \frac{1}{2} \sum_{i} \frac{{(P_{i} - Q_{i})}^{i}}{(P_{i} - Q_{i})}

(10)

In Equation (10), P and Q represent the signatures of distinct planar graphs. With the rise in and advancement of machine learning algorithms, the inherent constraints imposed by conventional heuristics reliant on specific styles are effectively bypassed. A study published in [41] introduces a methodology grounded in statistical plane segmentation and structural pattern recognition for the analysis and interpretation of floor plans. This approach embraces a bottom–up two-step recognition procedure, wherein textual content is segregated from images through text–image segmentation during data preprocessing before the recognition phase.

The initial step of the recognition process involves identifying fundamental building region blocks, such as walls, doors, and windows, at the pixel level using statistical plane-based segmentation. Subsequently, during the transformation of the pixel image into vector space, the wall entities are isolated and amalgamated with the doors and windows, and the rooms are identified by locating enclosed areas in the solid plane to generate Figure 3, which shows the recognition process of a detailed floor plan.

According to the China Civil Building Design Terminology Standard [42] and the Industrial Foundation Class (IFC) standard, building plans are generally categorized into two primary components: structural components and functional components. Structural components mainly include load-bearing walls, non-load-bearing walls, columns, doors, windows, railings, and stairs. In contrast, functional components, which do not directly affect the load-bearing structure but contribute to spatial partitioning and aesthetic design, include furniture elements such as tables, beds, cabinets, and other furnishings [43]. During searches for building floor plans, these component types can be analyzed based on their characteristics and functions to improve search efficiency and accuracy.

In the field of building structural components, a study [44] introduced a lightweight and fully automated processing technique for analyzing building floor plans. Initially, the method applies regional segmentation using the mean integral projection function (IPF) to identify wall-containing regions. Subsequently, critical information, such as wall locations, is extracted using a sparse point pixel vectorization algorithm based on non-detailed data. Finally, a linear discriminative analysis algorithm, utilizing QR decomposition and generalized singular value decomposition, is employed to identify building components such as doors, windows, and wall openings. Another study [38] proposed an algorithm specifically designed to recognize and extract structural components, including walls, doors, and windows. The approach eliminates the exterior wall through successive erosion and expansion operations, extracts wall contours from connected components, applies polygon approximation to each contour to determine wall edges, and uses symbol recognition techniques such as SURF to detect doors and windows.

Structural components such as walls, doors, and windows exhibit non-distinct features and varied shapes in plan views, leading to strong abstraction and weak regularity, which can reduce direct recognition efficiency. Therefore, initial recognition of functional components in building floor plans, such as tables, beds, and chairs, becomes essential. A study [20] introduces the Faster R-CNN model, which leverages three deep learning networks to recognize, locate, and remove functional components from building floor plans. Another study [25] proposes a deep multi-neural network incorporating spatial context modules and room boundary-guided attention mechanisms to improve the recognition performance of various floor plan elements, including walls, rooms, doors, and windows.

These methodologies represent deep learning-based approaches for recognizing architectural floor plan elements. After recognizing elements, retrieving building floor plan components requires matching based on geometric features such as shape, location, and others. A study [45] proposes a sketch-based system (a.Scatch system) that extracts semantic structures from past projects. Initially, information such as walls, symbols, and text is segmented, yielding thick, medium, and thin line images through erosion and expansion operations. Thick lines delineate building boundaries, medium lines represent internal structures, and thin lines indicate architectural elements such as doors, windows, and furniture. Next, structural information is extracted and analyzed semantically using Speeded Up Robust Features (SURF). The extracted structures are then compared using graph-matching techniques to retrieve the most similar results. Another approach [46] revolves around PU learning, efficiently analyzing planar graphs to recognize various structural element styles with minimal user interaction. This method involves extracting regions of interest (RoIs) from the image, filtering RoIs similar to the query based on IoU thresholds, extracting features using Haar-based kernels on the remaining RoIs, and ultimately retrieving similar RoIs through PU learning.

4. Texture Feature Retrieval

4.1. Gabor Wavelet Transform

Texture features in an image are inherently linked to the surface structure of the depicted object. Image texture reflects local structural characteristics, primarily manifested as variations in pixel grayscale or color within a defined neighborhood. In the early 1970s, Haralick et al. introduced the co-occurrence matrix representation for characterizing texture features [47]. A co-occurrence matrix is first constructed based on pixel direction and distance, and relevant statistics are then extracted to serve as a texture representation. However, the texture properties derived from the co-occurrence matrix often fail to align with visual similarity. As a result, there is a need to explore feature extraction methods that better align with human visual perception.

In the early 1990s, following the introduction and theoretical foundation of the wavelet transform, numerous researchers began investigating its application in texture representation. Empirical evidence has shown the effectiveness of these studies. Manjunath and Ma [48] showed that using the Gabor wavelet transform for feature extraction in content-based image retrieval is more effective than employing the pyramid wavelet transform (PWT) or the tree wavelet transform (TWT). The Gabor filter consists of a collection of wavelets, each capturing energy at a specific frequency and orientation. According to this framework, the signal is expanded to provide a local frequency description, effectively capturing its local characteristics and energy. Texture features can then be extracted from this collection of energy distributions. The variability in scale (frequency) and direction of the Gabor filter makes it particularly useful for texture analysis.

For an image defined with size

P \times Q

, its high-frequency Gabor wavelet transform is represented as

G_{m m} (x, y) = \sum_{s} \sum_{t} I (x - s, y - t) Ψ_{m n}^{*} (s, t)

(11)

where s and t are the variables of the wavelet filter size, and

Ψ_{m n}^{*}

is the complex transform of

Ψ_{m n}

.

Ψ_{m n}

is derived from Equation (12), which is the wavelet function, generating a set of similar functions, defined as

Ψ (x, y) = \frac{1}{2 π σ_{x} σ_{y}} exp (- \frac{1}{2} (\frac{x^{2}}{σ_{x}^{2}} + \frac{y^{2}}{σ_{y}^{2}})) \cdot exp (j 2 π W x)

(12)

Among them, W is referred to as the modulation frequency. It is derived from the Gabor wavelet equation (Equation (13)):

Ψ_{m n} (x, y) = a^{- m} Ψ (\tilde{x}, \tilde{y})

(13)

Here, m and n represent the standards and directions of the small waves, where

m = 0, 1, \dots, M - 1

and

n =

0, 1, \dots, N - 1

:

\tilde{x} = a^{- m} (x cos θ + y sin θ)

(14)

\tilde{y} = a^{- m} (x cos θ + y sin θ)

(15)

In the equation,

a > 1, θ = r π /

N. The variables in Equations (12)–(15) are defined as follows:

\begin{matrix} a = {(U_{h} / U_{l})}^{\frac{1}{M - 1}} \\ W_{m, n} = a^{m} U_{I} \\ σ_{y, m, n} = \frac{1}{2 π tan (\frac{π}{2 N}) \sqrt{\frac{U_{h}}{2 ln 2} - {(\frac{1}{2 π σ_{n}})}^{2}}} \end{matrix}

(16)

4.2. Texture Spectrum

After performing multi-scale and multi-directional filtering on an image, a multidimensional array is obtained:

\begin{matrix} E (x, y) = \sum_{x} \sum_{y} |G_{m n} (x, y)|, m = 0, 1, \dots \\ M - 1, n = 0, 1, \dots, N - 1 \end{matrix}

(17)

These dimensions represent the energy of different scales and directions in the image. The main purpose of texture-based image retrieval is to find images or regions in images with similar textures. Assuming that an image or a region has the same texture, the mean value

μ_{m n}

and standard deviation

σ_{m n}

of the transform coefficients can be used to represent the texture characteristics of the region:

\begin{matrix} μ_{m n} & = \frac{E (x, y)}{P \times Q} \\ σ_{m n} & = \frac{\sqrt{\sum_{x} \sum_{y} {(∣ G_{m n} (x, y) ⊢ μ_{m n})}^{2}}}{P \times O} \end{matrix}

(18)

He DC et al. proposed the concept of texture spectrum in the early 1990s [49]. Compared with previous methods, texture spectrum has a clearer concept and requires less computation, and has received more and more attention in recent years. The following will describe in detail the basic method of texture spectrum model and some improved texture spectrum models. For a

3 \times 3

neighborhood in an image, as shown in Table 1, the nine pixels in the neighborhood are recorded as

V = V_{0}, V_{1}, V_{2}, V_{3}, V_{4}, V_{5}, V_{6}, V_{7}, V_{8}

, where

V_{0}

represents the pixel value at the center of the neighborhood. At the same time, a texture unit

T_{U}

is defined, and

T_{U}

contains eight pixels.

T_{U} = E_{1}, E_{2}, E_{3}, E_{4}, E_{5}, E_{6}, E_{7}, E_{8}

, where the value of

E_{i}

is

E_{i} \{\begin{matrix} 0 & V_{i} \leq (V_{0} - C) \\ 1 & (V_{0} + C) \geq V_{i} \geq (V_{0} - C) \\ 2 & V_{i} \leq (V_{0} + C) \end{matrix}

(19)

where i

= 1, 2, \dots, 8,

C; C represents a positive constant; and

E_{i}

corresponds to pixel i. For each element in

T U

, there are three possible values, so for 8 units, there are

3^{8} = 6561

possible values. The texture unit can be written as follows:

\begin{matrix} N_{T U} = \sum_{i = 1}^{8} E_{i} \times 3^{i - 1} \\ N_{T U} \in {0, 1, 2, \dots, 6560} \end{matrix}

(20)

An image texture unit characterizes the textural features of a given pixel, specifically the relative grayscale relationship between the central pixel and its neighboring pixels. The occurrence frequency of all texture units in the image is quantified, and this frequency function delineates the image’s texture information. This function, called the texture spectrum, encapsulates the distribution of all texture units. A texture spectrum histogram can be created, with the horizontal axis representing the number of texture units (NTUs) and the vertical axis indicating their occurrence frequency. Typically, an image consists of two components: texture primitives and random noise or background. The greater the proportion of texture components relative to the background, the more easily the texture features can be perceived by the human eye. With respect to the texture spectrum, a higher percentage of texture components in an image results in a distinct peak distribution. Additionally, different textures are characterized by unique texture units and corresponding texture spectra. Thus, the texture information of an image is effectively represented by the texture spectrum.

5. Spatial Feature Retrieval

5.1. Topological and Geometric Feature

5.1.1. Topological Feature Retrieval

Conventional methods for locating room layouts primarily focus on matching room functions and adjacency, often overlooking the internal shapes of rooms. Inspired by shape grammar principles, Lee et al. introduce a novel approach that transforms building floor plans into a tree-like structure to address retrieval challenges [50]. This method integrates internal layout considerations into similarity metrics and represents room layout structures through a hierarchical tree framework. Floor plan retrieval is then executed using tree editing distance as a metric for query similarity. Tree structure matching is computationally efficient, preserving high-level feature information through parent–child relationships. This retrieval method, based on matching the shapes of structural hierarchies, eliminates the need to account for variations in room functions, resulting in a remarkable experimental efficacy of 53%. The tree editing distance [27] serves as the key similarity metric between the query and the candidate parse tree. To compute the tree edit distance effectively, costs must be allocated to editing operations (relabeling, adding, deleting), requiring an ordered arrangement of the tree to achieve polynomial complexity [31,51]. Initially, tree nodes are sorted based on the number of child nodes and, in cases of equal child nodes, by room size. The matching metric C is computed as follows:

C (A, B) = | | a r e a (A) v_{A} - a r e a (B) v_{B} {| |}_{1}

(21)

In Equation (21),

a r e a (A)

denotes the area covered by the parameterized shape at the node, while

{∥*∥}_{1}

represents the mean absolute difference.

5.1.2. Geometric Feature Retrieval

The shape-based retrieval method offers several advantages over feature extraction methods. It effectively handles transformations like translation, rotation, and scaling, improving retrieval performance for complex building plan shapes. However, this method has limitations, including mismatching building plans with similar shapes but different scales and faces difficulty in extracting shape features from non-geometric building plans. To address these issues, Rasika et al. proposed a multi-directional building plan retrieval approach using FR-CNN and geometric features [52]. This method accounts for rotation during scanning and is invariant to scale changes. The process involves detecting door endpoints on walls. After removing the door from the image, a blank area remains on the outer wall of the

I_{F}

. Next, the endpoints of the door are connected with a straight line to ensure continuity. Finally, the shape of

I_{F}

is obtained by tracing the outer wall. Geometric features such as area, corners, axes, distances, slopes, and angles are computed. These geometric features are interdependent, and polygonal approximation is used to represent the extracted profile.

The matching phase includes matching based on appearance features (

F_{1}

) and matching based on internal object features (

F_{1}

). For the matching based on profile features, the query image (

F^{q}

) and the dataset image (

F^{i}

) are matched using Equation (12), and the average matching cost is calculated using Equation (13), where DS denotes the size of the dataset image,

j \in \{1, 2\}

,

F_{j}

denotes the jth feature, and

n_{j}

denotes its corresponding count. Let

F_{j k}

denote the kth component of the jth feature. Here, the matching cost based on the shape features

φ

when

F_{j}

=

F_{1}

is

n_{j}

=

n_{1}

. If

j = 1

, then

F_{j k}

means that the feature is profile-based, and k denotes the index of the feature component.

ψ {(F_{j k})}^{i} = {[{[F_{j k}^{i} - F_{j k}^{q}]}_{k = 1}^{n_{j}}]}_{i = 1}^{D S}

(22)

A v g {(F_{j})}^{i} = {[\frac{1}{n_{j}} \sum_{k = 1}^{n_{j}} ψ {(F_{j k})}^{i}]}_{i = 1}^{D S}

(23)

The average feature matching cost (Avg) is calculated using Equation (22) and sorted in ascending order using Equation (23) to prioritize maximum similarity.

φ {(j)}^{i} = {[s o r t_{a s c} (A v g {(F_{j})}^{i})]}_{i = 1}^{D S}

(24)

For internal object-based feature matching, the sorted results of external shape feature matching prior to the last image associated with the query are considered when calculating the matching cost, rather than the entire dataset. This is the case, and when

F_{j}

=

F_{2}

,

n_{j}

=

n_{2}

after calculating the matching cost using Equation (22). Further, the average matching cost for internal object features is calculated using Equation (23). Finally, the retrieval results are sorted in ascending order using Equation (24) to obtain a final ranked list of the query images, which includes the images before the last relevant image of the query image.

5.2. Multidimensional and Shape Feature Retrieval

5.2.1. Multidimensional Feature Retrieval

To expand content-based image retrieval technology to accommodate a large-scale image library, the adoption of effective multidimensional indexing techniques is essential. The challenges associated with this expansion can be categorized into two main aspects:

High dimensionality: Typically, the dimension of the image feature vector is on the order of $10^{2}$ .
Non-Euclidean similarity measurement: The Euclidean distance metric often fails to adequately represent all human perceptions of visual content; consequently, alternative similarity measurement methods, such as histogram intersection, cosine similarity, and correlation, must be employed.

To address these challenges, a viable approach involves initially applying dimensionality reduction techniques to minimize the feature vector’s dimensionality, followed by the utilization of suitable multidimensional indexing methods.

Three primary research domains have significantly advanced the development of multidimensional indexing technology: computational geometry, database management systems, and pattern recognition. Prominent multidimensional indexing technologies currently in use include the Bucketing grouping algorithm, k-d trees, priority k-d trees, quadtrees, K-D-B trees, HB trees, R-trees, and their variants, such as the R+ tree and R* tree. Additionally, clustering and neural network techniques, which are widely employed in pattern recognition, also offer viable indexing solutions. The origins of multidimensional indexing technology can be traced to the mid-1970s, when foundational techniques such as the Cell algorithm, quadtrees, and k-d trees were first introduced, albeit with limited initial effectiveness. The growing demand for spatial indexing in Geographic Information Systems (GISs) and Computer-Aided Design (CAD) systems culminated in Guttman’s proposal of the R-tree index structure in 1984 [53]. This innovation spurred the development of several R-tree variants, including Sellis et al.’s R+ tree [54] and Greene’s R-tree variant [55]. In 1990, Beckman and Kriegel introduced the R* tree, which is widely recognized as the most efficient dynamic R-tree variant [56]. Nevertheless, even the R* tree faces challenges in effectively managing dimensionalities exceeding 20.

A comprehensive review and comparison of various indexing algorithms are provided in references [57]. Among these, White and Jain focus on the development of general and domain-specific indexing algorithms. Inspired by k-d trees and R-trees, they proposed VAM k-d trees and VAMSplit R-trees. Experimental results demonstrate that VAMSplit R-trees exhibit superior algorithmic efficiency, albeit at the cost of sacrificing the dynamic characteristics inherent to R-trees. In contrast, Ng and Sedighian [58] introduced a three-stage retrieval technique for image retrieval, comprising dimensionality reduction, evaluation and selection of existing indexing algorithms, and optimization of the chosen indexing algorithm. Since nearly all tree-structured indexing techniques are primarily designed for traditional database queries (e.g., point queries and range queries) rather than image retrieval, there is an urgent need to explore novel indexing structures tailored to the specific requirements of image retrieval. Tagare addressed this challenge in [59] by proposing a method for tree structure adjustment, which improves tree efficiency by eliminating nodes that hinder the effectiveness of similarity queries.

5.2.2. Fourier Shape Descriptor

Shape features can generally be represented in two ways: as contour features or regional features. Contour features focus solely on the outer boundary of an object, while regional features encompass the entire area of the shape. The most common methods for these two types of shape features are Fourier descriptors and shape-independent moments.

The fundamental concept of Fourier shape descriptors is to utilize the Fourier transform of the object’s boundary for shape representation. Consider the contour of a two-dimensional object composed of a series of pixels with coordinates

(x_{s}, y_{s})

, where

0 \leq s \leq N - 1

, and N represents the total number of pixels on the contour. From the coordinates of these boundary points, three shape representations can be derived: the curvature function, centroid distance, and complex coordinate function. The curvature of a point on a profile is defined as the rate of change in the profile tangent angle relative to the arc length. The curvature function

K (s)

can be expressed as

K (s) = \frac{d}{d s} θ (s)

, where

θ (s)

is the tangent angle of the contour line, defined as

\begin{matrix} θ (s) = {tan}^{- 1} (\frac{y_{s}^{'}}{x_{s}^{'}}) \\ y_{s}^{'} = \frac{d y_{s}}{d s} \\ x_{s}^{'} = \frac{d x_{s}}{d s} \end{matrix}\}

(25)

The centroid distance is defined as the distance from the object’s boundary point to the object’s center

(x_{c}, y_{c})

, as follows:

R (s) = \sqrt{{(x_{s} - x_{c})}^{2} + {(y_{s} - y_{c})}^{2}}

(26)

The complex coordinate function is the pixel coordinate expressed by complex numbers:

Z (s) = (x_{s} - x_{c}) + j (y_{s} - y_{c})

(27)

The Fourier transform of the complex coordinate function generates a series of complex coefficients. These coefficients characterize the object’s shape in the frequency domain, with low-frequency components indicating macroscopic properties and high-frequency components detailing fine structural characteristics. A shape descriptor can be derived from these transformation parameters. To maintain rotation independence, only the magnitude information of the parameters is retained, while the phase information is omitted. Scaling independence is achieved by normalizing the magnitude of the parameter to the magnitude of the

D C

component (or the first non-zero parameter).

For the curvature function and the centroid distance function, we focus solely on the positive frequency coordinate axis because the Fourier transform of these functions is symmetric, specifically,

|F_{- i}| = |F_{i}|

. The shape descriptor derived from the curvature function is formulated as follows:

f_{K} = [|F_{1}|, |F_{2}|, \dots, |F_{M / 2}|]

(28)

where

F_{i}

represents the i-th component of the Fourier transform parameter. Similarly, the shape descriptor derived from the centroid distance function is

f_{R} = [\frac{|F_{1}|}{|F_{0}|}, \frac{|F_{2}|}{|F_{0}|}, \dots, \frac{|F_{M / 2}|}{|F_{0}|}]

(29)

In the complex coordinate function, both positive- and negative-frequency components are utilized. The

D C

component is omitted as it pertains to the position of the shape. Consequently, the first non-zero frequency component is employed to normalize the remaining transformation parameters. The shape descriptor derived from the complex coordinate function is formulated as follows:

f_{Z} = [\frac{|F_{- (M / 2 - 1)}|}{|F_{1}|}, \dots, \frac{|F_{- 1}|}{|F_{1}|}, \frac{|F_{2}|}{|F_{1}|}, \dots, \frac{|F_{M / 2}|}{|F_{1}|}]

(30)

To ensure uniformity in the shape features of all objects in the database, the number of boundary points must be standardized to M before implementing the Fourier transform. For instance, M can be set as

2^{n} = 64

to leverage the fast Fourier transform, thereby enhancing the algorithm’s efficiency.

5.2.3. Shape-Independent Moments

Moment invariants are a region-based technique for representing the shapes of objects. Assuming R is an object represented by a binary image, the

p + q

-th-order central moment of its shape is

μ_{p, q} = \sum_{(x, y) \in R} {(x - x_{c})}^{p} {(y - y_{c})}^{q}

(31)

where

(x_{c}, y_{c})

is the center of the object. To achieve scale independence, the central moment can be normalized:

η_{p, q} = \frac{μ_{p, q}}{μ_{0, 0}^{γ}}, γ = \frac{p + q + 2}{2}

(32)

Based on these moments, Hu [60] proposed a series of seven moments that are independent of translation, rotation, and scaling:

\begin{matrix} ϕ_{1} = & μ_{2, 0} + μ_{0, 2} \\ ϕ_{2} = & {(μ_{2, 0} - μ_{0, 2})}^{2} + 4 μ_{1, 1}^{2} \\ ϕ_{3} = & {(μ_{3, 0} - 3 μ_{1, 2})}^{2} + {(μ_{0, 3} - 3 μ_{2, 1})}^{2} \\ ϕ_{4} = & {(μ_{3, 0} + μ_{1, 2})}^{2} + {(μ_{0, 3} + μ_{2, 1})}^{2} \\ ϕ_{5} = & (μ_{3, 0} - 3 μ_{1, 2}) (μ_{3, 0} + μ_{1, 2}) [{(μ_{3, 0} + μ_{1, 2})}^{2} - 3 {(μ_{0, 3} + μ_{2, 1})}^{2}] \\ + (μ_{0, 3} - 3 μ_{2, 1}) (μ_{0, 3} + μ_{2, 1}) [{(μ_{0, 3} + μ_{2, 1})}^{2} - 3 {(μ_{3, 0} + μ_{1, 2})}^{2}] \\ ϕ_{6} = & (μ_{2, 0} - μ_{0, 2}) [{(μ_{3, 0} + μ_{1, 2})}^{2} - {(μ_{0, 3} + μ_{2, 1})}^{2}] + 4 μ_{1, 1} (μ_{3, 0} + μ_{1, 2}) (μ_{0, 3} + μ_{2, 1}) \\ ϕ_{7} = & (3 μ_{2, 1} - μ_{0, 3}) (μ_{3, 0} + μ_{1, 2}) [{(μ_{3, 0} + μ_{1, 2})}^{2} - 3 {(μ_{0, 3} + μ_{2, 1})}^{2}] \\ + (μ_{3, 0} - 3 μ_{2, 1}) (μ_{0, 3} + μ_{2, 1}) [{(μ_{0, 3} + μ_{2, 1})}^{2} - 3 {(μ_{3, 0} + μ_{1, 2})}^{2}] \end{matrix}

(33)

In addition to the seven invariant moments mentioned above, various methods exist for calculating shape-independent moments. In reference [61], Yang and Albregtsen introduced a method for efficiently calculating moments in binary images using Green’s theorem. Since many effective invariants are derived from repeated experiments, Kapur et al. developed algorithms to systematically identify specific geometric invariants [62]. Gross and Latecki also devised a method that preserves the qualitative differential geometry of object edges during image digitization [62]. Furthermore, reference [63] discusses a framework of algebraic curves and invariants for representing complex objects in mixed scenes. This framework employs polynomial fitting to capture local geometric information and geometric invariants for object matching and recognition.

5.2.4. Shape Features Based on Inner Angles

In reference [64], a method for expressing shape features based on internal angles was proposed. Similarly to the Fourier descriptor, the object is first approximated as a polygon. The internal angles of polygons are crucial for shape representation and recognition and can be expressed as Intra-angle

= {α 1, α 2, \dots . α n}

.

Obviously, the shape description based on internal angles is independent of the shape’s position, rotation, and size, making it well suited for image retrieval systems. Below is a series of definitions for shape features derived from internal angles:

Number of vertices: The more vertices a polygon has, the more complex its shape. It is reasonable to consider two shapes with different numbers of vertices as distinctly different shapes.

Internal Angle Mean: The average value of all internal angles of a polygon reflects its shape attributes to some extent. For example, the internal angle mean of a triangle is 60 degrees, which is notably different from the internal angle mean of a rectangle, which is 90 degrees.

Internal Angle Standard Deviation: The standard deviation of the internal angles of a polygon provides insight into the variability of the angles relative to the mean.

δ = \sqrt{\sum_{i = 1}^{n} {(a_{i} - \bar{a})}^{2}}

(34)

where

\bar{a}

is the mean value of the interior angles, this standard deviation

δ

serves as a general descriptor of the polygon. The more regular the polygon, the smaller the value of

δ

. Therefore, it can be used to distinguish regular polygons from irregular ones.

Intra-angle Histogram: The angle range of

0 °

to

360 °

is first divided into k intervals, which serve as the k bins of the histogram. Then, the number of interior angles within each interval is counted. The resulting intra-angle histogram reflects the overall distribution of the interior angles.

We will take the calculation of the interior angle

θ = ∠ a b c

in Figure 4 as an example to introduce the calculation method of the interior angle, Figure 4 is a diagram of the interior angles of a plane graph. Let the center of the three points a, b, and c be p, then

o \vec{p} = \frac{o \vec{a} + o \vec{b} + o \vec{c}}{3}

(35)

where o is the origin. If p is inside the polygon, then

θ

is less than

180 °

; otherwise,

θ

is greater than

180 °

. When

θ ⩽ 180 °

,

θ = arccos (\frac{{| a b |}^{2} + {| b c |}^{2} - {| a c |}^{2}}{2 | a b | b c ∣})

(36)

When

θ > 180 °

,

θ = 360 - arccos (\frac{{| a b |}^{2} + {| b c |}^{2} - {| a c |}^{2}}{2 | a b | | b c |})

(37)

6. Experimental and Performance Comparison of Various Algorithms

6.1. Datasets

The commonly used datasets in the field of floor plan retrieval include SESYD, FPLAN-POLY, and HOME. These datasets are outlined as follows: Delalande et al. introduced the SESYD (Systems Evaluation Synthetic Documents) public dataset [65], a comprehensive document database synthesized using the 3gT (generation of graphical ground truth) system. This dataset is primarily employed in the GREC (Graphics Recognition) symbol recognition and discovery competition, comprising 10 classes of planar maps with 100 samples per class, totaling 28,065 symbols. The CVC-FP (Computer Vision Center Floor Plan) dataset, introduced in reference [66], encompasses four subsets: the Black dataset, Textured1 dataset, Textured2 dataset, and Parallel dataset. This dataset consists of 122 scanned documents and illustrates the relationship between architectural symbols and structural elements. Detailed labeling of these symbols facilitates more effective extraction of structural arrangements by floor plan analysis systems, thereby enhancing interpretive performance.

The FPLAN-POLY dataset [67] comprises 42 vector images featuring 38 symbolic models with 344 instances, assessing the presence of 8 structural conformations in each diagram. LV et al. [26] introduced the publicly available dataset ROBIN [68], which is specifically tailored for building floor plan retrieval tasks. ROBIN encompasses three primary layout categories, further subdivided into 10 subcategories based on global map layout shapes. Designed to cater to the needs of potential buyers, ROBIN showcases diverse room types and quantities.

FloorPlanCAD, a large vector dataset proposed by YUKI et al. [29], features 10,094 floor plan data entries derived from real-world sources, including homes, shopping malls, schools, and other structures. The dataset is divided into 6382 training plans and 3712 test plans. Similarly, the BRIDGE dataset, proposed by SHREYA et al. [69], consists of over 13,000 building floor plan images, accompanied by task-specific annotations such as symbols, region descriptions, and paragraph details.

Alain, P et al. introduced the comprehensive SFPI (Synthetic Floor Plan Images) dataset [21], which consists of 10,000 images encompassing 16 furniture categories and a total of 300,000 furniture symbols. To enhance the diversity of the dataset, data augmentation techniques, including random angle rotations, were applied. Detailed information about the dataset is provided in Table 2.

The datasets utilized in this study originate from various building types, including residential, commercial, and industrial sectors, which exhibit significant differences in both design and function. For instance, residential floor plans are typically simplistic, whereas commercial buildings may incorporate complex spatial layouts and functional zoning. Furthermore, these datasets encompass architectural designs from different historical periods, reflecting the evolution of architectural styles and technologies. For example, modern buildings tend to emphasize open spaces and sustainability, while traditional structures often prioritize symmetry and ornamentation. Datasets with a limited number of samples may not adequately represent all possible building types and design styles, thereby constraining the generalizability of the research findings. Additionally, the quality of annotations in the datasets, such as information regarding room types, sizes, and functions in the floor plans, significantly influences their representativeness. High-quality annotations enhance the practical applicability of the datasets, whereas low-quality annotations may adversely affect model training outcomes. In practical applications, it is essential to integrate datasets from diverse sources, types, and regions. Employing data augmentation techniques, such as rotation, scaling, and noise addition, can enhance the diversity of the datasets and improve the robustness of the model.

6.2. Evaluation

In the domain of building plan retrieval, evaluation metrics are critical for assessing the performance of retrieval systems. Commonly used evaluation metrics include accuracy, F1 score, mean average precision (mAP), and the Matthews correlation coefficient (MCC) [75]. The MCC exhibits superior performance in addressing category imbalance issues in datasets compared to metrics such as accuracy, precision, and recall. To comprehensively assess the effectiveness and utility of retrieval methods, additional evaluation techniques have been employed in some studies. In this context, a retrieval is deemed successful if one or more of the top five retrieved images exhibit structural similarity to the query image. Retrieval performance is quantified by calculating the proportion of correctly matched images within the retrieval results. For example, if three out of the first five retrieved images are structurally similar to the query image, the precision value is 0.6, indicating that 60% of the retrieved images are correct matches.

Different types of building floor plans, such as those for residential, office, and shopping mall purposes, exhibit varying degrees of structural complexity. For instance, residential floor plans are typically simpler, whereas shopping malls and office buildings tend to encompass a greater number of rooms, corridors, and intricate layouts. If the algorithm demonstrates strong performance on simpler floor plans but underperforms on more complex layouts, this indicates insufficient generalization capabilities. Complex floor plans often present additional challenges such as increased occlusions, overlapping areas, and nonlinear structures, all of which can complicate feature extraction. Therefore, this paper recommends that experiments incorporate a diverse range of building floor plan types, with a thorough analysis of the performance for each category. Furthermore, for complex floor plans, the introduction of advanced feature extraction techniques, such as neural networks, or the implementation of data augmentation strategies, may enhance performance.

The resolution of building plans can vary significantly depending on the source. Low-resolution images may result in the loss of important details, while high-resolution images can increase computational complexity. This discrepancy may hinder the accurate recognition of key features, such as walls, doors, and windows, ultimately diminishing the performance of the algorithm. Therefore, this paper recommends evaluating the algorithm’s performance across various resolutions during the experiments and analyzing the relationship between resolution and performance. Additionally, it may be beneficial to employ a multi-scale feature extraction method to effectively process both low-resolution and high-resolution images.

Secondly, with respect to noise, building plans may contain various forms of interference, such as blurred images, incorrect annotations, and extraneous text or symbols. Such noise can disrupt the feature extraction process of the algorithm, leading to false detections or missed detections. If the algorithm is sensitive to noise, its performance in practical applications may be significantly compromised. Therefore, this paper recommends incorporating a denoising step in the data preprocessing stage, utilizing image filtering techniques or deep learning denoising models.

6.3. Various Algorithms’ Performance Comparison

While precision, recall, and mAP are established performance evaluation metrics in the existing literature, this study demonstrates system performance by evaluating feature extraction at each hierarchical level. The features extracted from the normalized Layer 1 exhibit strong performance, attaining an mAP value of 0.63. In contrast, the basic deep learning framework performs poorly when extracting features directly from the CNNs, yielding a mere average accuracy of 0.43. Furthermore, methods such as Scale-Invariant Feature Transform (SIFT), Histogram of Oriented Gradients (HOGs), run-length histograms, and OASIS yield suboptimal results in capturing the abstraction and sparsity inherent in sketches [76,77]. The comparative retrieval performance of various methods is presented in Table 3. Notably, the retrieval performance of the graph GCN method has significantly improved compared to previous techniques, attaining an mAP of 0.85. Furthermore, the latest YOLOv8-L achieves an mAP of 0.90 on the public COCO dataset.

In addition, we used several commonly used methods and compared the F1 value and mAp on the semantic features of different categories of plan views [22]. The experimental results are shown in Table 4.

From Table 4, it is evident that the F1 scores for the GCN-based and DccpLabv3 methods differ significantly, and their corresponding mAP values are relatively small. However, as the number of categories increases, both the F1 and mAP values tend to decrease. This suggests that the characteristics of the categories significantly influence retrieval performance.

7. Conclusions

In the field of architectural floor plan feature retrieval, the integration of semantic, spatial, texture, and shape features has significantly improved the accuracy and efficiency of retrieval systems. Semantic features enable a deeper understanding of the functional aspects of floor plan elements, whereas spatial features clarify the relationships between different building layouts. Texture features improve the visual consistency of floor plans, while shape features aid in the accurate identification of various architectural styles. This paper thoroughly examines these feature retrieval methods and introduces advanced fusion techniques to support future research, with the goal of achieving greater automation and intelligence in retrieval systems. Ultimately, the findings are expected to drive advancements in architectural floor plan data processing technology and promote its practical application.

We believe that the following are potential directions for floor plan retrieval in the future:

Diverse graphic retrieval and cross-application, encompassing images, text, audio, video, and other modalities, are becoming increasingly relevant.
Future building design tools are expected to prioritize user experience by integrating various data sources for federated search, thereby providing more comprehensive and accurate results. By incorporating personalized needs and collaborative design into the process, user satisfaction and efficiency can be significantly improved. Future research should explore multimodal data representation, processing, and retrieval, combining floor plans with other modalities such as text, 3D models, and satellite images to achieve a holistic understanding of building design. Additionally, integrating data from building design software (e.g., Revit, AutoCAD) with parametric design tools (e.g., Grasshopper) will create a unified data platform that enables real-time updates and sharing of design information. Constructing a homogeneous graph network model of building structures, where nodes represent components (such as beams, columns, and walls) and edges illustrate the relationships between them, will enhance structural analysis. In the domains of smart city planning, architectural design, and virtual reality (VR), BIM and VR-enhanced visualization and interaction capabilities enable improved design coordination, reducing problem-solving time by 15% and enhancing stakeholder engagement and satisfaction by 20%. Rule-based model checking and generative design methods are employed to ensure compliance with urban development and sustainability standards, delivering environmentally sustainable and cost-effective design solutions. Multidimensional feature retrieval of floor plans will become increasingly important to achieve accurate design, simulation, and optimization. Creating immersive models will enable stakeholders to intuitively understand spatial and functional layouts during the design phase. This study showed that the use of BIM and VR significantly enhanced communication, collaboration, and decision-making among project stakeholders. A detailed taxonomy of AR interactions was combined with a frequency analysis, where navigation interactions were the most common, accounting for 62% of all interactions, followed by preparation interactions at 22%, annotation interactions at 10%, and recording interactions at 6%. Transitions between different design artifacts are critical to solving design problems, with 47% of transitions occurring in the transition from 2D digital information (PDF) to other artifacts, 30% in 3D digital information, and 23% in physical drawings [80]. These findings indicate that effective management of digital transformation is critical to optimizing design workflows.
Efficient feature extraction, indexing, and personalized adaptive retrieval are critical components.
As data volumes grow, the efficient indexing and retrieval of multidimensional features have become indispensable. Future research could prioritize the development of efficient and lightweight deep learning models capable of real-time multidimensional feature extraction on mobile and embedded systems. By leveraging user interaction data, future floor plan retrieval systems can adaptively adjust feature weights and optimization strategies to deliver personalized retrieval outcomes. Even with minimal user feedback, the retrieval model can be iteratively refined, significantly improving system performance.
Interpretability of multidimensional data and data privacy security are critical considerations.
As deep learning models grow in complexity, ensuring interpretability has become increasingly important. Future research could focus on developing interpretable models that elucidate the relationship between multidimensional features and retrieval results, thereby enabling users to comprehend the system’s decision-making process. Moreover, with growing concerns over data privacy, it is imperative to ensure efficient floor plan retrieval while safeguarding user privacy. Technologies such as differential privacy and federated learning must be incorporated into retrieval systems. Furthermore, protecting sensitive architectural and design data from unauthorized access will become a critical research focus, especially in the context of military and government building designs.

Author Contributions

H.L. is responsible for the writing and conception of the entire article, G.L. is responsible for the verification and checking of the entire article, N.Z. is responsible for the guidance and review of the article, and X.J. is responsible for the experiments and collection of data sets. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Technology Planning Project of Shanghai grant number 23010501800.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

For details, please refer to the files in the URL https://github.com/BYFCJX/download-floorplan-image-datasets (accessed on 16 March 2025).

Acknowledgments

This work is supported by by the Technology Planning Project of Shanghai.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

Kemble Stokes, H. An examination of the productivity decline in the construction industry. Rev. Econ. Stat. 1981, 63, 495. [Google Scholar]
Zhengda, L.; Wang, T.; Guo, J.; Meng, W.; Xiao, J.; Zhang, W.; Zhang, X. Data-driven floor plan understanding in rural residential buildings via deep recognition. Inf. Sci. 2021, 567, 58–74. [Google Scholar] [CrossRef]
Pizarro, P.N.; Hitschfeld, N.; Sipiran, I.; Saavedra, J.M. Automatic floor plan analysis and recognition. Autom. Constr. 2022, 140, 104348. [Google Scholar] [CrossRef]
Shelhamer, E.; Long, J.; Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Proceedings, Part III; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, September 8–14, 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Proceedings, Part VII; Springer International Publishing: Cham, Switzerland, 2018; pp. 833–851. [Google Scholar] [CrossRef]
Barreiro, A.C.; Trzeciakiewicz, M.; Hilsmann, A.; Eisert, P. Automatic reconstruction of semantic 3D models from 2D floor plans. In Proceedings of the 2023 18th International Conference on Machine Vision and Applications, Hamamatsu, Japan, 23–25 July 2023; pp. 1–5. [Google Scholar]
Park, S.; Hyeoncheol, K. 3DPlanNet: Generating 3d models from 2d floor plan images using ensemble methods. Electronics 2021, 10, 2729. [Google Scholar] [CrossRef]
Zhu, J.; Zhang, H.; Wen, Y.M. A new reconstruction method for 3D buildings from 2D vector floor plan. Comput.-Aided Des. Appl. 2014, 11, 704–714. [Google Scholar] [CrossRef]
Wang, L.Y.; Gunho, S. An integrated framework for reconstructing full 3d building models. In Advances in 3D Geo-Information Sciences; Springer: Berlin/Heidelberg, Germany, 2011; pp. 261–274. [Google Scholar]
Chen, L.; Wu, J.Y.; Yasutaka, F. Floornet: A unified framework for floorplan reconstruction from 3d scans. In Proceedings of the Europeanconference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 201–217. [Google Scholar]
Taro, N.; Toshihiko, Y. A preliminary study on attractiveness analysis of real estate floor plans. In Proceedings of the 2019 IEEE 8th Global Conference on Consumer Electronics, Osaka, Japan, 15–18 October 2019; pp. 445–446. [Google Scholar]
Kirill, S.; Nicolas, P. Integratingfloor plans into hedonic models for rent price appraisal. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021. [Google Scholar]
Gerstweiler, G.; Furlan, L.; Timofeev, M.; Kaufmann, H. Extraction of structural and semantic data from 2D floor plans for interactive andimmersive VR real estate exploration. Technologies 2018, 6, 101. [Google Scholar] [CrossRef]
Derevyanko, N.; Zalevska, O. Comparative analysis of neural networks Midjourney, Stable Diffusion, and DALL-E and ways of their implementation in the educational process of students of design specialities. Sci. Bull. Mukachevo State Univ. Ser. Pedagogy Psychol. 2023, 9, 36–44. [Google Scholar] [CrossRef]
Li, Y.; Chen, H.; Yu, P.; Yang, L. A review of artificial intelligence in enhancing architectural design efficiency. Appl. Sci. 2025, 15, 1476. [Google Scholar] [CrossRef]
Sharma, D.; Gupta, N.; Chattopadhyay, C.; Mehta, S. A novel feature transform framework using deep neural network for multimodal floor plan retrieval. Int. J. Doc. Anal. Recognit. 2019, 22, 417–429. [Google Scholar] [CrossRef]
Yamasaki, T.; Zhang, J.; Takada, Y. Apartment structure estimation using fully convolutional networks and graph model. In Proceedings of the 2018 ACM Workshop on Multimedia for Real Estate Tech, Yokohama, Japan, 11 June 2018; pp. 1–6. [Google Scholar]
Kim, H.; Seongyong, K.; Kiyun, Y. Automatic extraction of indoor spatial information fromfloor plan image: A patch-based deep learning methodology application on large-scale complex buildings. ISPRS Int. J. Geo-Inf. 2021, 10, 828. [Google Scholar] [CrossRef]
Ma, K.; Cheng, Y.; Ge, W.; Zhao, Y.; Qi, Z. Method of automatic recognition of functional parts in architecturallayout plan using Faster R-CNN. J. Surv. Plan. Sci. Technol. 2019, 36, 311–317. [Google Scholar]
Shehzadi, T.; Hashmi, K.A.; Pagani, A.; Liwicki, M.; Stricker, D.; Afzal, M.Z. Mask-Aware semi-supervised object detection in floor plans. Appl. Sci. 2022, 12, 9398. [Google Scholar] [CrossRef]
Fan, Z.; Zhu, L.; Li, H.; Chen, X.; Zhu, S.; Tan, P. Floorplancad: A large-scale cad drawing dataset for panoptic symbol spotting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10128–10137. [Google Scholar]
Wang, T.; Meng, W.L.; Lu, Z.D.; Guo, J.W.; Xiao, J.; Zhang, X.P. RC-Net: Row and column network with text feature for parsing floor plan images. J. Comput. Sci. Technol. 2023, 38, 526–539. [Google Scholar] [CrossRef]
Simonsen, C.P.; Thiesson, F.M.; Philipsen, M.P.; Moeslund, T.B. Generalizing floor plans using graph neural networks. In Proceedings of the 2021 IEEE International Conference on Image Processing, Anchorage, AK, USA, 19–22 September 2021; pp. 654–658. [Google Scholar]
Zeng, Z.; Li, X.; Yu, Y.K.; Fu, C.W. Deep floor plan recognition using a multi-task network withroom-boundary-guided attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9096–9104. [Google Scholar]
Lv, X.; Zhao, S.; Yu, X.; Zhao, B. Residential floor plan recognition and reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16717–16726. [Google Scholar]
Gao, M.; Zhang, H.-h.; Zhang, T.-r.; Zhang, X.-m. Deep learning based pixel-level public architectural floor plan space recognition. J. Graph. 2022, 43, 189. [Google Scholar]
Ouahbi, M.I. A Hybrid UNet-GNN Architecture for Enhanced Medical Image Segmentation. Ph.D. Thesis, Kasdi Merbah University Ouargla Algeria, Ouargla, Algeria, 2024. [Google Scholar]
Takada, Y.; Inoue, N.; Yamasaki, T.; Aizawa, K. Similar floor plan retrieval featuring multi-task learning of layout type classification and room presence prediction. In Proceedings of the 2018 IEEE International Conference on Consumer Electronics, Las Vegas, NV, USA, 12–14 January 2018; pp. 1–6. [Google Scholar]
Karen, S.; Andrew, Z. Very deep convolutional networks for large-scale image recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
Cao, Y.Q.; Jiang, T.; Thomas, G. A maximum common substructure-based algorithm for searching and predicting drug-like compounds. Bioinformatics 2008, 24, i366–i374. [Google Scholar] [CrossRef]
Sharma, D.; Chattopadhyay, C.; Harit, G. A unified framework for semantic matching of architectural floorplans. In Proceedings of the 2016 23rd International Conference on Pattern Recognition, Cancun, Mexico, 4–8 December 2016. [Google Scholar]
Divya, S.; Chiranjoy, C. High-level feature aggregation for fine-grained architectural floor plan retrieval. IET Comput. Vis. 2018, 12, 702–709. [Google Scholar]
Yang, L.P.; Michael, W. Generation of navigation graphs for indoor space. Int. J. Geogr. Inf. Sci. 2015, 29, 1737–1756. [Google Scholar] [CrossRef]
Sardey, M.P.; Gajanan, K. A Comparative Analysis of Retrieval Techniques in Content BasedImage Retrieval. arXiv 2015, arXiv:1508.06728. [Google Scholar]
Liu, C.; Wu, J.; Kohli, P.; Furukawa, Y. Raster-to-Vector: Revisiting floorplan transformation. In Proceedings of the 2017 IEEE International Conference on Computer, Venice, Italy, 22–29 October 2017; pp. 2214–2222. [Google Scholar]
Aoki, Y.; Shio, A.; Arai, H.; Odaka, K. A prototype system for interpreting hand-sketched floor plans. In Proceedings of the 13th International Conference on Pattern Recognition, Vienna, Austria, 25–29 August 1996; pp. 747–751. [Google Scholar]
Ahmed, S.; Liwicki, M.; Weber, M.; Dengel, A. Improved automatic analysis of architecturalfloor plans. In Proceedings of the 2011 International Conference on Document Analysis and Recognition, Beijing, China, 18–21 September 2011; pp. 864–869. [Google Scholar]
Wessel, R.; Ina, B.; Klein, R. The Room Connectivity Graph: Shape Retrieval in the Architectural Domain; Václav Skala-UNION Agency: Plzen, Czech Republic, 2008; Volume 2. [Google Scholar]
de las Heras, L.P.; Fernández, D.; Fornés, A.; Valveny, E.; Sánchez, G.; Lladós, J. Runlength histogram image signature for perceptual retrieval of architectural floor plans. In IAPR International Workshop on Graphics Recognition; Springer: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
De Las Heras, L.P.; Ahmed, S.; Liwicki, M.; Valveny, E.; Sánchez, G. Statistical segmentation and structural recognition for floor plan interpretation. Int. J. Doc. Anal. Recognit. 2014, 17, 221–237. [Google Scholar] [CrossRef]
GB 50352-2005; Ministry of Housing and Urban-Rural Development of the People’s Republic of China. General Principles for Civil Engineering Design. China Publishing Group: Beijing, China, 2005; pp. 101–108.
Yang, L. Research and Implementation of Building Image Retrieval Based on Deep Learning; Xi’an University of Architecture and Technology: Xi’an, China, 2022. [Google Scholar]
Zhang, H.X.; Li, Y.S.; Song, C. Block vectorization of interior layout plans and high-efficiency 3D building modeling. Comput. Sci. Explor. 2013, 7, 63–73. [Google Scholar]
Markus, W.; Marcus, L.; Andreas, D.A. SCAtch—A sketch-based retrieval for architectural floor plans. In Proceedings of the 2010 12th International Conferenceon Frontiers in Handwriting Recognition, Kolkata, India, 16–18 November 2010; pp. 289–294. [Google Scholar]
Iordanis, E.; Michalis, S.; Georgios, P. PU learning-based recognition of structural elements in architectural floor plans. Multimed. Tools Appl. 2021, 80, 13235–13252. [Google Scholar]
Haralick, R.M.; Shanmugam, K.; Dinstein, I.H. Textural features for image classification. IEEE Trans. Syst. Man, Cybern. 1973, SMC-3, 610–621. [Google Scholar] [CrossRef]
Manjunath, B.S.; Ma, W.Y. Texture features for browsing and retrieval of image data. IEEE Trans. Pattern Anal. Mach. Intell. 1996, 18, 837–842. [Google Scholar] [CrossRef]
He, D.C.; Wang, L. Texture features based on texture spectrum. Pattern Recognit. 1991, 24, 391–399. [Google Scholar] [CrossRef]
Lee, P.K.; Bjorn, S. Shape-Based floorplan retrieval using parse tree matching. In Proceedings of the 2021 17th International Conference on Machine Vision and Applications, Aichi, Japan, 25–27 July 2021; pp. 1–5. [Google Scholar]
Mura, C.; Pajarola, R.; Schindler, K.; Mitra, N. Walk2Map: Extracting floor plans from indoor walk trajectories. Comput. Graph. Forum 2021, 40, 375–388. [Google Scholar] [CrossRef]
Khade, R.; Jariwala, K.; Chattopadhyay, C.; Pal, U. A rotation and scale invariant approach for multi-oriented floor plan image retrieval. Pattern Recognit. Lett. 2021, 145, 1–7. [Google Scholar] [CrossRef]
Boston, M. A dynamic index structure for spatial searching. In Proceedings of the ACM-SIGMOD, Boston, MA, USA, 18–21 June 1984; pp. 547–557. [Google Scholar]
Sellis, T.; Roussopoulos, N.; Faloutsos, C. The R+-tree: A dynamic index for multi-dimensional objects. In Proceedings of the 13th International Conference on Very Large Data Bases, Brighton, UK, 1–4 September 1987; Volume 12, pp. 507–518. [Google Scholar]
Greene, D. An implementation and performance analysis of spatial data access methods. In Proceedings of the Proceedings. Fifth International Conference on Data Engineering, Los Angeles, CA, USA, 6–10 February 1989; IEEE Computer Society: Washington, DC, USA, 1989. [Google Scholar]
Beckmann, N.; Kriegel, H.P.; Schneider, R.; Seeger, B. The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, Atlantic City, NJ, USA, 23–26 May 1990; pp. 322–331. [Google Scholar]
White, D.A.; Jain, R.C. Similarity indexing: Algorithms and performance. In Storage and Retrieval for Still Image and Video Databases IV; SPIE: Bellingham, WA, USA, 1996; Volume 2670, pp. 62–73. [Google Scholar]
Ng, R.T.; Sedighian, A. Evaluating multidimensional indexing structures for images transformed by principal component analysis. In Storage and Retrieval for Still Image and Video Databases IV; SPIE: Bellingham, WA, USA, 1996; Volume 2670, pp. 50–61. [Google Scholar]
Tagare, H.D. Increasing retrieval efficiency by index tree adaptation. In Proceedings of the 1997 Proceedings IEEE Workshop on Content-Based Access of Image and Video Libraries, St. Thomas, VI, USA, 20 June 1997; pp. 28–35. [Google Scholar]
Hu, M.K. Visual pattern recognition by moment invariants. IRE Trans. Inf. Theory 1962, 8, 179–187. [Google Scholar]
Yang, L.; Albregtsen, F. Fast computation of invariant geometric moments: A new method giving correct results. In Proceedings of the 12th International Conference on Pattern Recognition, Jerusalem, Israel, 9–13 October 1994; Volume 1, pp. 201–204. [Google Scholar]
Kapur, D.; Lakshman, Y.N.; Saxena, T. Computing invariants using elimination methods. In Proceedings of the International Symposium on Computer Vision-ISCV, Coral Gables, FL, USA, 21–23 November 1995; pp. 97–102. [Google Scholar]
Cooper, D.B.; Lei, Z. On representation and invariant recognition of complex objects based on patches and parts. In International Workshop on Object Representation in Computer Vision; Springer: Berlin/Heidelberg, Germany, 1994; pp. 139–153. [Google Scholar]
Zhuang, Y. Intelligent Multimedia Information Analysis and Retrieval with Application to Visual Design. Ph.D. Thesis, Zhejiang University, Hangzhou, China, 1998. [Google Scholar]
Delalandre, M.; Valveny, E.; Pridmore, T.; Karatzas, D. Generation of synthetic documents for performance evaluation of symbol recognition and spotting systems. Int. J. Doc. Anal. Recognit. (IJDAR) 2010, 13, 187–207. [Google Scholar] [CrossRef]
de las Heras, L.P.; Terrades, O.R.; Robles, S.; Sánchez, G. CVC-FP and SGT: A new database for structural floor plan analysis and its groundtruthing tool. Int. J. Doc. Anal. Recognit. 2015, 18, 15–30. [Google Scholar] [CrossRef]
Rusiñol, M.; Borràs, A.; Lladós, J. Relational indexing of vectorial primitives for symbol spotting in line-drawing images. Pattern Recognit. Lett. 2010, 31, 188–201. [Google Scholar] [CrossRef]
Sharma, D.; Gupta, N.; Chattopadhyay, C.; Mehta, S. DANIEL: A deep architecture for automatic analysis and retrieval of building floor plans. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition, Kyoto, Japan, 9–15 November 2017. [Google Scholar]
Goyal, S.; Mistry, V.; Chattopadhyay, C.; Bhatnagar, G. BRIDGE: Building plan repositoryfor image description generation, and evaluation. In Proceedings of the 2019 International Conference on Document Analysis and Recognition, Sydney, Australia, 20–25 September 2019; pp. 1071–1076. [Google Scholar]
Kiyota, Y. Frontiers of computer vision technologies on real estate property photographs and floorplans. Front. Real Estate Sci. Jpn. 2021, 29, 325–337. [Google Scholar]
Dodge, S.; Xu, J.; Stenger, B. Parsing floor plan images. In Proceedings of the 2017 Fifteenth IAPR International Conference on Machine Vision Applications (MVA), Nagoya, Japan, 8–12 May 2017; pp. 358–361. [Google Scholar]
Pizarro Riffo, P.N. Wall Polygon Retrieval from Architectural Floor Plan Images Using Vectorización and Deep Learning Methods. 2023. Available online: https://repositorio.uchile.cl/handle/2250/196842 (accessed on 16 March 2025).
Ebert, F.; Yang, Y.; Schmeckpeper, K.; Bucher, B.; Georgakis, G.; Daniilidis, K.; Levine, S. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv 2021, arXiv:2109.13396. [Google Scholar]
Mishra, S.; Hashmi, K.A.; Pagani, A.; Liwicki, M.; Stricker, D.; Afzal, M.Z. Towards robust object detection in floor plan images: A data augmentation approach. Appl. Sci. 2021, 11, 11174. [Google Scholar] [CrossRef]
Kato, N.; Yamasaki, T.; Aizawa, K.; Ohama, T. Users’preference prediction of real estate propertiesbased on floor plan analysis. IEICE Trans. Inf. Syst. 2020, 103, 398–405. [Google Scholar] [CrossRef]
Sharma, D.; Gupta, N.; Chattopadhyay, C.; Mehta, S. REXplore: A sketch based interactive explorer for real estates using building floor plan images. In Proceedings of the 2018 IEEE International Symposium on Multimedia, Taichung, Taiwan, 10–12 December 2018; pp. 61–64. [Google Scholar]
Kalsekar, A.; Khade, R.; Jariwala, K.; Chattopadhyay, C. RISC-Net: Rotation invariant siamese convolution network for floor plan image retrieval. Multimed. Tools Appl. 2022, 81, 41199–41223. [Google Scholar] [CrossRef]
Chechik, G.; Shalit, U.; Sharma, V.; Bengio, S. Anonline algorithm for large scale image similarity learning. Neural Information Processing Systems. Neural Inf. Process. Syst. 2009, 22, 306–314. [Google Scholar]
Divya, S.; Chiranjoy, C.; Gauray, H. Retrieval of architectural floor plans based on layout semantics. In Proceedings of the IEEE 2016 Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Shehadeh, A.; Alshboul, O.; Taamneh, M.M.; Jaradat, A.Q.; Alomari, A.H.; Arar, M. Advanced Integration of BIM and VR in the Built Environment: Enhancing Sustainability and Resilience in Urban Development. Heliyon 2025, 11, e42558. [Google Scholar] [CrossRef]

Figure 1. Building floor plan.

Figure 2. Building identification hierarchy diagram.

Figure 3. Building plan inspection process.

Figure 4. Interior angle histogram. (a)

θ > 180 °

, (b)

θ ⩽ 180 °

.

Figure 4. Interior angle histogram. (a)

θ > 180 °

, (b)

θ ⩽ 180 °

.

Table 1. Texture unit.

$V_{1}$	$V_{2}$	$V_{3}$
$V_{4}$ $V_{6}$	$V_{0}$ $V_{7}$	$V_{5}$ $V_{8}$

Table 2. Summary of datasets mentioned in this article.

Dataset Name	Number of Pictures	Usage	Public	Year
SESYD [65]	1000	Retrieval, Symbol localization	yes	2010
LIFULL HOME’S Dataset [70]	8300 ten thousand	Retrieval, Deep learning, Text mining	yes	2015
CVC-FP [71]	122	Semantic segmentation	yes	2015
FPLAN-POLY [72]	42	Symbolic positioning	yes	2010
ROBIN [68]	510	Retrieval, Symbol location	yes	2017
FloorPlanCAD [22]	10,094	Panoramic symbol positioning	yes	2021
BRIDGE [73]	13,000	Symbol recognition Scene map composition Retrieval Building plan analysis	yes	2019
SFPI [74]	10,000	Symbol positioning Building plan analysis	yes	2022

Table 3. Building plan retrieval performance comparison.

Methodology	Dataset	Performance	Year
RLH + Chechik et al. [78]	ROBIN	mAp = 0.10	2009
RLH + Chechik et al. [78]	SESYD	mAp = 1.0	2009
BOW + Chechik et al. [78]	ROBIN	mAp = 0.19	2009
BOW + Chechik et al. [78]	SESYD	mAp = 1.0	2009
HOG + Chechik et al. [78]	ROBIN	mAp = 0.31	2009
DANIEL [68]	ROBIN	mAp = 0.56	2017
Sharma et al. [32]	ROBIN	mAp = 0.25	2016
CVPR [79]	ROBIN	mAp = 0.29	2016
MCS [79]	HOME	-	2018
CNNs(update) [43]	HOME	Accuracy = 0.49	2018
Sharma and Chattopadhyay [33]	ROBIN	mAp = 0.31	2018
Sharma and Chattopadhyay [33]	SESYD	mAp = 1.0	2018
FCNs [18]	HOME	mAp = 0.39	2018
REXplore [76]	ROBIN	mAp = 0.63	2018
Rasika et al. [52]	ROBIN	mAp = 0.74	2021
RISC-Net [77]	ROBIN	mAp = 0.79	2022
GCNs	ROBIN	mAp = 0.85	–
YOLOv8-L	COCO	mAp = 0.9	2024

Table 4. Comparison of semantic features of different categories of floor plans and F1 values and mAP of commonly used retrieval methods.

Class	Semantic Symbol Spotting		Instance Symbol Spotting
	Weighted Fl		mAP
	GCN-Based	DccpLabv3+	Faster R-CNN	FCOS	YOLOv3
single door	0.885	0.827	0.843	0.859	0.829
double door	0.796	0.831	0.779	0.771	0.743
sliding door	0.874	0.876	0.556	0.494	0.481
window	0.691	0.603	0.518	0.465	0.379
bay window	0.050	0.163	0.068	0.169	0.062
blind window	0.833	0.856	0.614	0.520	0.322
opening symbol	0.451	0.721	0.496	0.542	0.168
stairs	0.857	0.853	0.464	0.487	0.370
gas stove	0.789	0.847	0.503	0.715	0.601
refrigerator	0.705	0.730	0.767	0.774	0.723
washing machine	0.784	0.569	0.379	0.261	0.374
sofa	0.606	0.674	0.160	0.133	0.435
bed	0.893	0.908	0.713	0.738	0.664
chair	0.524	0.543	0.112	0.087	0.132
table	0.354	0.496	0.175	0.109	0.173

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ling, H.; Luo, G.; Zhou, N.; Jiang, X. Survey of Architectural Floor Plan Retrieval Technology Based on 3ST Features. AI 2025, 6, 67. https://doi.org/10.3390/ai6040067

AMA Style

Ling H, Luo G, Zhou N, Jiang X. Survey of Architectural Floor Plan Retrieval Technology Based on 3ST Features. AI. 2025; 6(4):67. https://doi.org/10.3390/ai6040067

Chicago/Turabian Style

Ling, Hongxing, Guangsheng Luo, Nanrun Zhou, and Xiaoyan Jiang. 2025. "Survey of Architectural Floor Plan Retrieval Technology Based on 3ST Features" AI 6, no. 4: 67. https://doi.org/10.3390/ai6040067

APA Style

Ling, H., Luo, G., Zhou, N., & Jiang, X. (2025). Survey of Architectural Floor Plan Retrieval Technology Based on 3ST Features. AI, 6(4), 67. https://doi.org/10.3390/ai6040067

Article Menu

Survey of Architectural Floor Plan Retrieval Technology Based on 3ST Features

Abstract

1. Introduction

2. Floor Plan Retrieval Overview

2.1. Overview of Floor Plan Feature Extraction

2.2. Overview of Floor Plan Retrieval Architecture

2.2.1. Network Feedforward Solutions

2.2.2. Feature Extraction of Floor Plans Structural Elements

2.2.3. Similarity Measure

2.3. Overview of 3ST Feature Analysis

3. Semantic Feature Retrieval

3.1. Semantic Feature Analysis

3.2. Rule Feature Extraction Retrieval

4. Texture Feature Retrieval

4.1. Gabor Wavelet Transform

4.2. Texture Spectrum

5. Spatial Feature Retrieval

5.1. Topological and Geometric Feature

5.1.1. Topological Feature Retrieval

5.1.2. Geometric Feature Retrieval

5.2. Multidimensional and Shape Feature Retrieval

5.2.1. Multidimensional Feature Retrieval

5.2.2. Fourier Shape Descriptor

5.2.3. Shape-Independent Moments

5.2.4. Shape Features Based on Inner Angles

6. Experimental and Performance Comparison of Various Algorithms

6.1. Datasets

6.2. Evaluation

6.3. Various Algorithms’ Performance Comparison

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI