Raster Image-Based House-Type Recognition and Three-Dimensional Reconstruction Technology

Chang, Jianbo; Lv, Yunlei; Wang, Jian; Pang, Hao; Liu, Yaqiu

doi:10.3390/buildings15071178

Open AccessArticle

Raster Image-Based House-Type Recognition and Three-Dimensional Reconstruction Technology

by

Jianbo Chang

¹

,

Yunlei Lv

^1,2,*

,

Jian Wang

¹,

Hao Pang

³ and

Yaqiu Liu

^3,*

¹

School of Computer Science and Information Technology, Daqing Normal University, Daqing 163712, China

²

College of Mechanical and Electrical Engineering, Northeast Forestry University, Harbin 150040, China

³

College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China

^*

Authors to whom correspondence should be addressed.

Buildings 2025, 15(7), 1178; https://doi.org/10.3390/buildings15071178

Submission received: 17 February 2025 / Revised: 24 March 2025 / Accepted: 26 March 2025 / Published: 3 April 2025

(This article belongs to the Special Issue Information Technology in Building Construction Management)

Download

Browse Figures

Versions Notes

Abstract

:

The automatic identification and three-dimensional reconstruction of house plans has emerged as a significant research direction in intelligent building and smart city applications. Three-dimensional models reconstructed from two-dimensional floor plans provide more intuitive visualization for building safety assessments and spatial suitability evaluations. To address the limitations of existing public datasets—including low quality, inaccurate annotations, and poor alignment with residential architecture characteristics—this study constructs a high-quality vector dataset of raster house plans. We collected and meticulously annotated over 5000 high-quality floor plans representative of urban housing typologies, covering the majority of common residential layouts in the region. For architectural element recognition, we propose a key point-based detection approach for walls, doors, windows, and scale indicators. To improve wall localization accuracy, we introduce CPN-Floor, a method that achieves precise key point detection of house plan primitives. By generating and filtering candidate primitives through axial alignment rules and geometric constraints, followed by post-processing to refine the positions of walls, doors, and windows, our approach achieves over 87% precision and 88% recall, with positional errors within 1% of the floor plan’s dimensions. Scale recognition combines YOLOv8 with Shi–Tomasi corner detection to identify measurement endpoints, while leveraging the pre-trained multimodal OFA-OCR model for digital character recognition. This integrated solution achieves scale calculation accuracy exceeding 95%. We design and implement a house model recognition and 3D reconstruction system based on the WebGL framework and use the front-end MVC design pattern to interact with the data and views of the house model. We also develop a high-performance house model recognition and reconstruction system to support the rendering of reconstructed walls, doors, and windows; user interaction with the reconstructed house model; and the history of the house model operations, such as forward and backward functions.

Keywords:

house-type recognition; house-type vectorization; house-type reconstruction; scale recognition; raster image

1. Introduction

House-type recognition and 3D reconstruction technology has important research and application value in architectural design, interior design, and other related fields. With the progress of science and technology and the development of computer-vision technology, house-type recognition and 3D reconstruction based on raster images have both become a popular research direction [1]. The traditional methods of house pattern recognition and 3D reconstruction usually require manual participation and complex measurement work, even using laser ray [2]. The reconstruction work is not only time-consuming and labor-intensive but also relies, in many cases, on existing buildings, which cannot be reconstructed in advance, and it is easy to introduce human errors with low accuracy [3]. A raster image is a two-dimensional plane image acquired by equipment such as cameras or laser scanners that contains a significant amount of information about the structure and layout of the house [4]. While house-type raster images are widely used in daily life due to their low price, easy dissemination, and vivid image quality, raster image-based recognition and reconstruction technology provides a more convenient and automated method for house-type recognition and 3D reconstruction, which is in line with people’s expectations [5]. By analyzing the lines, corners, and textures in the raster image, the house-type information, such as the location, size, and connection relationship of the rooms, can be deduced. At the same time, by utilizing the change of view angle of multiple raster images, 3D reconstruction of the house can be realized, and a house model with a geometric structure can be generated [6].

Although house building design tools such as AutoCad, Revit, etc. [7], can provide personalized design and development experience, these pieces of software are intended for professional designers. Due to the need for a large amount of manual processing, their human–computer interaction is complex, and there is a relatively steep learning route, and it is time-consuming and laborious to get started, making it more difficult for ordinary people to gain value from it [8]. In addition, design tool development technology is mostly monopolized by foreign technology companies, the cost of using the software is higher, and it is mostly oriented to professional design units. Architectural floor plans play a crucial role in designing, understanding, and remodeling interior spaces. Designers can quickly recognize the extent of a room, the position of a door, or the arrangement of objects (geometric shapes) with the naked eye, and they can easily identify the type of room, door, or object through text or icon styles (semantics) [9]. Therefore, the accurate recovery of vectorized information from pixel images has become an urgent problem [10]. While deep-learning technology is still in its infancy, the use of image processing is the more common processing method. Morphological operations, Hough transforms, or image vectorization techniques were often used to extract lines, normalize line widths, or group them according to predetermined widths [11]. Detected lines were used to match walls according to a priori rules requiring various image heuristics such as convex packet approximation, polygonal approximation, edge linking to overcome gaps, or color analysis along the lines [12], and doors and windows existed in walls that can be detected by geometric features of symbols [13]. Yamasaki T et al. [14] parsed the floor plan image into many connected segments and recognized the walls by the visual feature extraction approach based on the bilinear positional orientation and distance rules by contour extraction of the input image. Door and window recognition relies on the geometric visual features of doors and windows on the floor plan to be matched according to pre-set rules. Its set rules cannot prevent the interference of other furniture components, and its recognition accuracy is low when the house plan is more complex [15]; The disadvantages are that it is not possible to recognize tilted walls and the algorithm is very complex and time-consuming [16]. Ahmed used multiple erosion expansion operations on the image to achieve segmentation of coarse and fine lines of the image, classifying the lines according to predefined rules of line thickness. Ahmed used image overlay to recognize the textual information of the image through the idea of separation of text and graphics [17] and used Speeded Up Robust Features (SURFs) for recognizing symbols on the house plan, such as doors and windows [18,19,20,21]. Heras [22] proposed a generalized method for floor plan analysis and interpretation, applying two recognition steps in a bottom-up fashion. Firstly, basic building blocks, i.e., walls, doors, and windows, are detected using a statistical patch-based segmentation method. Secondly, graphics are generated and structural pattern recognition techniques are applied to further localize the main entities, i.e., the rooms of the building. The proposed method is able to analyze any type of floor plan and recognize features, with high accuracy, on different datasets. Huang [23] proposed prior-knowledge-based wall detection by manually designing the local geometric and color features of the wall, using the self-similarity of the wall for wall detection and alignment, and then detecting the door and window symbols in the floor plan using a deep-learning two-stage target detection model to determine the location of the doors and windows before classifying the target region. This manual feature design approach based on image processing requires a high quality of the house plans and performs poorly on house plans with less clarity and more noise. Ma Bo [24] proposed multi-attribute-analysis-based house-type element recognition, using template matching and edge features to detect scale and walls in house-type drawings, and was based on shape features for column recognition. This purely rule-based recognition has poor generalization and may be usable for some datasets, but underperforms on house-type drawings that have inconsistent edge information and large differences in style. Shen [25] proposed a structure extraction method to filter out interference lines, extract wall structures, and divide space area through morphological processing for building floor plan image features. They realized scale recognition of house-type drawings based on a target detection algorithm [26], and then completed the calculation of space area by means of coordinate transformation, which showed that the scale recognition and calculation based on target detection was accurate [27].

To synthesize the above background, this paper recognizes and analyzes the house-type map based on the house-type raster image, so as to realize the house-type reconstruction. The construction is studied, and the significance of this research can be summarized as follows:

(1) Based on existing deep-learning technology, we study the application of deep-learning technology in the recognition of house patterns and improve the existing recognition technology of house patterns regarding the problem of low recognition accuracy and poor generalization performance of the existing methods. We use key point detection technology to locate the key points of the house pattern in order to determine the structure of the house pattern, with reference to the human body’s key point detection model, and improve the feature extraction network and the feature fusion network in order to adapt to the task of this paper. Through target detection, corner detection, and OCR for scale calculation, the accurate identification and vectorization of walls, doors, windows, scales, and other elements in the house plan are realized.

(2) Based on the existing Web development technology, we reconstruct the vectorized data of the house model in the browser and provide interaction to realize the design of the house model. Aimed at the problems of low efficiency and poor performance of the existing house model design tools, we develop a fast and modern real-time reconstruction and design tool for house models and use the front-end MVC mode to control the data flow at a fine granularity. We optimize the rendering process in order to satisfy the growing demand for personalized and customized houses of the people in China, and to meet the demand for visualization of the house models by businessmen or individual designers, and to assist in the design and creation of the house models, so as to promote the stable and harmonious development of the upstream and downstream real estate industry chain, so that the national economy can develop healthily.

(3) We combine the cross-fertilization of deep-learning technology with civil engineering disciplines, exploring the application of deep-learning technology in civil engineering and decorative engineering industries, exploring the development path and practical application of digital twin technology and smart cities, empowering traditional industries with digital intelligent technology. In this paper, on the basis of a large number of raster house-type floor plans accumulated from the relevant industries at present, we carry out research on the recognition and vectorization algorithm of raster house-type plans. On the basis of summarizing previous research on the recognition and vectorization of house types, in order to solve the problems and pain points in the previous research, this paper puts forward a method of recognizing and vectorizing the elements of the house types, which shows excellent performance. Against the background of the industry’s lack of excellent visualization tools for identifying house plan elements, this paper develops a Web-based performance. It develops a Web-based house model reconstruction tool with excellent performance, which makes the whole house model reconstruction workflow more complete, plays the role of the top and bottom, and lays a solid foundation for the downstream tasks, such as the automatic layout of house models and the automatic generation of home furnishings.

2. Household Element Edge Detection

The standard house raster plan has certain prior knowledge, and there are certain laws in many house plan data; for example, the inner area of the wall usually has the same color, and the two sides of the wall are the edges. In the early process of house plan wall, door, and window recognition, edge detection with prior knowledge to detect the wall target is a common practice, and edge detection is still widely used in the detection of the elements of the house plan. Edge detection is a very important image feature extraction method in the field of computer vision and digital image processing, and it is usually the basis for other types of image feature extraction.

Image edge detection [28] is usually implemented in two ways: one is the traditional image processing-based method, the other is the deep-learning-based edge detection method that appeared after 2015. The traditional image processing method is more widely used in edge detection; the edge detection method discussed in this section is based on the traditional method. Image edge detection is realized by calculating the gradient of the image, which is converted to obtain the gradient of the image by using the matrix of operators to calculate the convolution of the image to obtain the core of the edge detection algorithm of the image, which lies in the edge detection operator. The extraction of edge detection is a filtering process, which is used to extract different features by different operators. There are many commonly used edge detection operators, including the following: Roberts operator, Prewitt operator, Sobel operator, Laplacian operator, Canny operator, and so on. In practical applications, it is often necessary to try and adjust the selection of different operators to obtain the best edge detection results. The Roberts operator is better when the image edge is close to plus 45° and minus 45°, but its shortcomings lie in the inaccuracy of the edge localization. The Prewitt algorithm is better in the horizontal and vertical parts of the detection, but the gray value of the noise will lead to the effect of the large poor. The Laplacian operator is a rotationally invariant, isotropic second-order differential operator that captures the overall edge information in an image and responds well to some specific image structures. It is sensitive to noise and needs to remove image noise by low-pass filtering.

The Canny operator is a comprehensive first-order differential operator detection algorithm; its goal is to find an optimal edge profile. In the filtering class of edge detection algorithms, the Canny operator is better. The excellent edge should be the position of the precise edge of the full readiness of the strong resistance to noise. In practical edge detection in engineering, the Canny algorithm is the most common; the main calculation process of the Canny algorithm edge detection is as follows:

(1) Gaussian filtering: Gaussian filtering is a commonly used image smoothing filter, which is mainly used to remove image noise, according to the Gaussian formula, to generate a two-dimensional filtering kernel, i.e., the Gaussian kernel. Then, the gray value of the pixel and its neighboring pixels and the filtering operator are convolved with the operator to achieve the weighted average of pixel values, and the operator is usually used as a 5 × 5 or 3 × 3 Gaussian filtering operator. (2) Calculate the gradient image and angle image: The edges have the feature that the gray value changes drastically on both sides of them, and the change of the gray value can be regarded as the derivative of the gray value. Since the pixel points of the image are not consecutive, the derivative can be described by the difference value. Four operators, horizontal, vertical, ortho-diagonal, and anti-diagonal, are used to detect the horizontal, vertical, and diagonal edges in the image, respectively. After the convolution operation, the gradient value of each pixel point is maximized and the direction is determined from it, so that the luminance gradient image of each pixel point in the image and its direction can be obtained. (3) Non-extremely large value suppression: When processing gradient images, it is very common to encounter problems such as uneven width of edges, blurring, and misrecognition. In order to solve these problems, pixel points that are not edges need to be eliminated, and the method used is to select the extreme value points and then suppress the non-extreme value points around them. This process consists of comparing each pixel point and its neighborhood surrounding pixel points in the gradient direction, retaining the extreme value points in the gradient direction while suppressing the non-extreme value points around them. (4) The dual thresholding algorithm detects and connects edges: Two thresholds (high and low thresholds) are set artificially to categorize pixels in an image into strong, weak, or non-edge. Strong edges are pixels larger than the high threshold, while weak edges are pixels between the high and low thresholds, with the high threshold usually set at two times the low threshold. Non-edges are pixels below the low threshold. With the hysteresis thresholding method, when a strong edge point is detected in the surrounding eight neighborhoods, the weak edge point is converted to a strong edge point to complement the strong edge, and from there, new edge points continue to be detected and connected until a complete contour is formed.

2.1. Scale Endpoint Corner Point Detection

The endpoints at both ends of the scale have significant corner point properties. Corner point detection is a technique used in the field of computer vision to recognize corner points in an image, which are suddenly changing locations in an image, where two or more edges intersect. Unlike edge detection, corner detection focuses on local maxima in the image, i.e., locations that change suddenly. Commonly used corner detection algorithms are Harris corner detection and Shi–Tomasi corner detection. Harris corner detection is a method based on the local image gradient, which calculates the eigenvalues of the gradient matrix in the local window region to determine the corner points in the image. Shi–Tomasi corner detection is an optimization of Harris corner detection, which follows the gradient eigenvalues of Harris corner detection. Corner detection is an optimization of Harris corner detection, which follows the gradient eigenvalues in Harris corner detection but improves the scoring function. The Harris corner detection algorithm uses the window sliding in the image to compute the grayscale change values to identify the corner points in the image. Its key processes include the following: (1) Image Grayscaling: the image is converted to grayscale to eliminate the effect of color information. (2) Difference Value Calculation: The grayscale difference between neighboring pixels is calculated to enhance the image edge information. (3) Filter Smoothing: The image is smoothed using a Gaussian filter to reduce the effect of noise. (4) Local Extreme Value: The extreme value points in the local region of the image are calculated, and the corner point candidates are screened. (5) Confirmation of Corner Points: According to the value of the corner point response function and local features, the corner points are finally confirmed. (6) If the grayscale image has undergone a large grayscale change in all directions through the window calculation, it is considered to be a corner point area. The window can be a normal rectangular window or a Gaussian window with different weights for each pixel.

The window is shifted in each direction (

u, v

) and is expanded according to the change in the gray value of the image after a binary first-order Taylor series approximation. It is written in the form of a matrix, as shown in Equation (1):

E (u, v) = [u v] (\sum_{x, y} w (x, y) [\begin{matrix} I_{x}^{2} & I_{x} I_{y} \\ I_{x} I_{y} & I_{y}^{2} \end{matrix}]) [\begin{matrix} u \\ v \end{matrix}])

(1)

The Shi–Tomasi corner detection method is similar to Harris, replacing the scoring function, which recognizes a corner point if the score exceeds a specified threshold. The scoring formula for Shi–Tomasi corner detection is shown in Equation (2):

R = d e l (M) - k {(t r a c e (M))}^{2}

(2)

2.2. HSL Color Space Model

HSL [29] (Hue, Saturation, Lightness) has three components, H on behalf of the hue, S on behalf of the saturation, and L on behalf of the brightness. HSL color space can be expressed as a spatial cylinder, as shown in Figure 1. In the L component of the luminance, 100 means white, and a brightness of 0 means black. The hue is expressed in terms of the polar coordinates of the polar angle, the saturation is expressed in terms of the polar coordinate of the polar axis, and the luminance is expressed by the height of the middle axis of the cylinder. The elements in house plans have rich a priori knowledge of color and luminance and are often used for wall segmentation, and when luminance correlation needs to be detected, it is more accurate to use the HSL color space.

M = \sum_{x, y} w (x, y) [\begin{matrix} I_{x}^{2} & I_{x} I_{y} \\ I_{x} I_{y} & I_{y}^{2} \end{matrix}]

(3)

d e l (M) = λ_{1} λ_{2}

(4)

2.3. Neural Network Activation Function

The activation function is one of the components of neurons in deep-learning models [30] and controls the activation or not of neurons. The activation function is introduced to increase the nonlinearity of the neural network model in order to enhance the network learning ability and expressive ability, from the ability to better fit the objective function. In modern neural network models, activation functions are widely used, and they mainly have the following roles: Solving nonlinear classification problems: many real-world datasets and problems are nonlinearly divisible, and the nonlinear activation function can help the network learn nonlinear relationships and better solve classification problems. Activation sparsity: the activation function can limit the neuron’s output range, thus controlling the neuron’s active state The activation function can limit the output range of neurons, thus controlling the active state of neurons, which is conducive to improving the sparsity of the network, reducing the computational complexity, and thus optimizing the network. Solving the problem of gradient disappearance and gradient explosion: when the neural network is improperly designed or the initialization of the training parameters is incorrect, the training will produce the problem of gradient disappearance and gradient explosion, which makes it difficult for the network to be trained. Network output range: to ensure that the output of the neuron is located in a specific interval, or to perform the normalization process. With the development of neural network models, the activation function is also in constant development; this paper will introduce the very widely used activation functions Sigmoid and ReLU, and, due to their shortcomings, also introduce three modern activation functions: LeakyReLU, GELU, and Swish.

The Sigmoid function, which was commonly used in the early days of neural networks, is a common S-shaped function that maps the input between 0 and 1 and is calculated as shown in Equation (5):

f (x) = \frac{1}{1 + e^{- x}}

(5)

(1) Sigmoid function: The Sigmoid function is shown in Figure 2, where the horizontal axis represents the input of the function and the vertical axis represents the output of the function. The Sigmoid function is characterized by an output range of 0 to 1, which ensures that the neuron’s output is in a controllable range, and it is usually used as a neural network classifier to output the probability. The Sigmoid function is also a smooth and integrable function, avoiding jumps in the output value. In the early days, Sigmoid was used as the activation function of neurons, but the function also has certain problems: the Sigmoid function performs exponential operations, which is not friendly to computation; the output of the function is not centered on 0, which leads to slower convergence; when performing back propagation, it is easy to make the gradient disappear, and it is difficult to update the network weights. Therefore, in modern neural networks, Sigmoid usually plays the role of a classifier and is not recommended as an activation function at this stage. (2) ReLU function and LeakyReLU function: The ReLU function is a widely used and popular activation function; it solves the problem of some previous activation functions and effectively solves the problem of network gradient disappearance during training, and because there is only a linear relationship, the calculation speed is much faster. The expression of the ReLU function is shown in Equation (5). The function image is shown in Figure 2; the horizontal axis represents the input, and the vertical axis represents the output of the function.

3. Analysis of the Network Structure for Wall, Door, and Window Detection and Scale Recognition of House Plans

3.1. Key Point Detection Network of Walls, Doors, and Windows of House Plans

The key point detection network of house plans comes from the human key point detection network, which mainly includes two mainstream methods: the regression method and the heat map method. The regression method model is simple; the whole process can be derived, and the construction of the dataset is very convenient, but the training is more difficult, prone to overfitting, and there is a lack of spatial generalization ability. The heat map method is more commonly applied in the key point detection task at this stage and can make full use of the information in the key point adjacent and spatial regions, making the accuracy of the heat map method much higher than the direct regression coordinate method. However, the construction of the dataset is more complicated, and the real data needs to be a heat map. Most heat map methods extract low-resolution feature maps from the input image through a series of high-to-low convolutional networks, and then use the up-sampling method to improve the resolution. The whole network is a full convolutional network, which contains three main parts: a feature extraction backbone network, a resolution recovery network, and a regressor for estimating heat maps. The feature extraction network is usually composed of a backbone sub-network similar to a classification network, which gradually decreases the resolution and increases the number of channels in order to extract the high-level features of the image. The resolution recovery network produces a representation of the same resolution as the expected input, usually using a U-shaped framework of high to low resolution, from low to high resolution, and also using multi-scale feature fusion and intermediate supervision to enhance the information. Both processes can be repeated multiple times to improve performance. Representative network designs include the symmetric hourglass-like Hourglass [31,32,33], with more computational processes from high to low resolution and less computational processes from low to high resolution. Lighter bilinear interpolation up-sampling or transposed convolution can be used, or a combination with null convolution, where null convolution is used instead of being fully-connected in the last two phases of a ResNet or a VGGNet to mitigate the spatial resolution deficit, followed by further resolution enhancement using up-sampling methods. The most common regressor for heatmaps is the Sigmoid function. In tasks such as key point detection, the Sigmoid function is used to generate a probability value in the range of 0 to 1 for each pixel of the heatmap as a confidence level, which indicates the possibility of whether or not a certain point contains the key points of the current category. The Argmax function is used to select candidate key points in the heatmap with high confidence levels, and the output value is usually thresholded. The output value is thresholded to determine if the key point is detected at that location. If multiple key points need to be predicted in a single heatmap, it is often necessary to specify a maximum number of key points, loop through the heatmap to find the location of the current heatmap’s confidence maximum, and then perform a non-maximum suppression method centered on that location.

While Figure 3 is the Hourglass network structure, which realizes the detection of key points by a symmetric hourglass-like structure from high to low scores and from low to high scores. Figure 3 shows the network structure of a Cascaded Pyramid Network (CPN), the asterisks are used to distinguish the loss functions of the GlobalNet and RefineNet networks and have the same meaning, which adopts the commonly used ImageNet-based classification networks, such as ResNet or VGG, from high to low, and the process from low to high is a relatively lightweight bilinear interpolation upsampling layer. The common key point detection network is shown in Figure 4, which is the current structure adopted in the key point detection network, Figure 4 shows the SimpleBaseline network with two transposed convolutions in the low-to-high structure. Figure 4d shows the DeeperCut [34] network with dilation convolution. The process of moving the network from high to low shown in Figure 4b–d is part of a classification-based network (e.g., ResNet). The horizontal line portion of the same layer in Figure 4a,b indicates jump connections between features of the same resolution layer for mutual fusion between features.

3.2. Scale Recognition Network Model Structure Analysis

Scale consists of two parts: scale boundary markers and scale region length text information. As the house-type region of the house-type map has very large interference with the scale markers, it should be the first to locate the scale region, remove the interference, and only retain the scale region for subsequent accurate identification. The target detection network can be used for the box selection of the house-type region, which can be removed to obtain the complete scale region. This can also be done through the image processing to find the contour function to find the maximum outer enclosing box of the house-type region in order to remove the house-type region. Scale line localization is a target detection task; after detecting the boundary region of the scale line, the pixel length of the scale can be obtained after alignment.

Target detection networks are categorized into two-stage and single-stage methods, with the two-stage method having a better but slower detector accuracy and the single-stage method having a slightly worse but faster performance. The two-stage approach is represented by Faster-RCNN, where the first stage proposes the bounding box of the candidate objects, and the second stage uses ROIPool [35,36] to extract features from each candidate box proposed in the first stage and performs classification and bounding box regression tasks. The single-stage approach is represented by the Yolo family of algorithms, and the Yolo family of algorithms is represented by the Yolov3, Yolov5, and Yolov8 versions. All three versions incorporate techniques that have been studied by previous researchers to improve performance. Among them, Yolov8 is the newer target algorithm of the Yolo series, which further improves flexibility. The whole network can be divided into three parts: backbone network, feature fusion neck, and detection head. The backbone network consists of C2F and SPPF modules; the C2F module obtains richer gradient flow through channel segmentation and the bottleneck layer, and the SPPF utilizes different pooling kernel sizes to obtain rich feature information, which improves the recognition accuracy of the network. The purpose of the neck part is to achieve multi-scale feature fusion in order to obtain rich feature information at different resolutions and fully fuses the deep and advanced semantic information with the surface features, following the structure of Yolov4 [37] FPN and PANet. The structure of FPN and PANet of Yolov4 [38] is followed, and the features are fused by adding the feature maps of multiple up-sampling and down-sampling paths through lateral connections to produce multi-scale feature maps. Although FPN has already fused the shallow features once, the shallow features are still not fused adequately in the task of instance segmentation and key point localization. For the FPN and PANet two-feature fusion network for the Yolo series, small target detection is difficult to do to, which strengthens the problem. The Yolov8 network structure is shown in Figure 5, * means dot; the input image is passed through a number of stages to produce different resolution sizes of the feature map. These feature maps are passed through the FPN structure and the PANet structure of the multi-path fusion several times, which ultimately produces three sizes of feature maps for the detection head detection. This multi-scale fusion takes into account the rich positional information of the high-resolution feature maps and the semantic information of the higher-level feature maps, which makes the target detection more accurate.

The loss function calculation process consists of two parts of the loss, which are the category classification loss and the marginal regression loss; the overall loss calculation is shown in Equation (6).

L o s s = λ_{l} \times L o s s_{l o c a t i o n} + λ_{c l s} \times L o s s_{c l a s s f i c a t i o n}

(6)

where

λ_{l}

and

λ_{c l s}

are the weights of the bounding box loss and the category categorization loss, respectively. The categorization loss characterizes whether or not the category of the target is consistent with the category of the real target, and the categorization loss is calculated using BCE Loss, while the bounding box loss is calculated using CIOU Loss and DFL Loss and the CIOU Loss is calculated using the formulas from Equations (7)–(10):

L o s s_{C I O U} = 1 - C I O U = 1 - (I O U - \frac{ρ^{2}}{c^{2}} - α v)

(7)

v = \frac{4}{π^{2}} {(a r c t a n \frac{w_{l}}{h_{l}} - a r c t a n \frac{w_{p}}{h_{p}})}^{2}

(8)

α = \frac{v}{1 - I O U + v}

(9)

I O U = \frac{|A \cap B|}{|A \cup B|}

(10)

where IOU is the intersection and concurrency ratio of the predicted frame to the real bounding box, A and B are the enclosing regions of the two bounding boxes, which are the centroid distances between the predicted frame and the real frame, c is the diagonal length of the smallest enclosing rectangle between the predicted frame and the real boundaries, v is the similarity of the width-to-height ratio of the predicted frame to the real frame, and w and h are the width and height of the frame, respectively.

3.3. Household Critical Point Detection Network CPN-Floor

Considering the identification and vectorization of house-type wall, door, and window elements as a key point detection task, the key point detection network initially generates key points with directional categories, and then generates wall, door, and window primitive candidates by defining axial alignment rules, and then imposes geometric and semantic constraints on them to filter the primitives and obtain the vectorization results. The process of vectorization for the house-type map is shown in Figure 5. The key point detection network selection for this task is inspired by the CPN network structure; the CPN network itself has a small number of parameters, and the computational cost and detection accuracy are more balanced. This paper proposes a convolutional neural network, CPN-Floor, for key point detection of the elements in the house plan on the basis of the CPN network structure for regression on the key point locations of the walls, doors, and windows in the house plan. The network structure is shown in Figure 6, in which the bottleneck structure is derived from the deep bottleneck structure of ResNet, which is a special residual block consisting of a layer of 1 × 1 convolution for decreasing the number of channels, a layer of 3 × 3 convolution with the same number of channels, and a layer of 1 × 1 convolution for increasing the number of channels. This hourglass shaped bottleneck block reduces the computation amount of the network and can effectively learn feature representation. This section describes the design of the CPN-Floor network and the direction of improvement.

3.4. Definition of Key Points for Raster House Plan Elements

In the house plan, the main elements of the house structure for walls, doors, and windows are composed. The main elements are first encoded as a set of connections with categories. In the house plan recognition task, the recognition of wall structure is the most common and important task. The wall structure is defined as a set of connection points intersecting the walls, and there are five types of wall connections, I-type, L-type, T-type, and cross-type, and a planar right-angle coordinate system is constructed with the wall connection points as the origin. The I-type takes into account the fact that the rotations are in four directions, which corresponds to the four categories, which are the 30-degree to 60-degree direction, the 120-degree to 150-degree direction, the 210-degree to 240-degree direction, the 300-degree to 330-degree direction, and the 300-degree to 330-degree direction. L-type and T-type also consider the rotation, and each category is rotated 90 degrees in the positive direction of the previous category. There are 4 categories each, and only 1 category for the cross and 13 categories for the wall joints. The definition of wall categories is shown in Figure 6. Doors and windows are represented as a line in the floor plan; there are eight directions in 45 degrees, and there are eight directions for each category of doors and windows. There are eight directions for each door and window category, i.e., eight key points for each category. Since the common doors are double sliding doors and single doors, both of which have common widths, e.g., the width of a single door is between 0.8 m and 1.2 m, and the length of a double sliding door is usually between 1.8 m and 3 m, the approximate category of a door can be determined by a category plus length information in the house plan. Windows are also considered in this task only in the category of common windows. In this paper, we focus on the location information of doors and windows, and there are eight categories of door joints and eight categories of window joints.

After obtaining the key points of the elements of each category via the key point detection network, the connections will be encoded as geometric primitives via alignment rules, with the wall and the doors and windows represented as a line, and valid primitives should be able to be generated by connecting them in the direction indicated by the key point category. Also due to some a priori knowledge of walls, wall primitives must form closed one-dimensional loops and doors and windows must be located on walls; a planar vector representation with high-level structure can be obtained by a simple heuristic post-processing method to vectorize the main elements of the house type.

3.5. Feature Extraction Network ConvNeXt

Feature extraction networks, also known as backbone networks, have the primary role of extracting features at multiple scales, with the backbone network outputting dense features for target detection and semantic segmentation, or sign vectors for image classification and image retrieval. Neural-network-based computer-vision systems are usually built on pre-trained feature extractors, and considering downstream task migration, they are usually not trained from scratch based on task-specific data, but are pre-trained on a large-scale benchmark dataset and then migrated on a downstream dataset. ConvNeXt proposes stages (stages) based on blocks, where each stage is composed of several network blocks, and the ratio of blocks in each stage is 1:1:9:1. Each stage usually consists of three blocks, and the final number and ratio of blocks obtained is 3:3:27:3. Meanwhile, in terms of activation function, ConvNeXt adopts the GELU activation function, which effectively avoids the possible problems of the ReLU activation function. The structure of ConvNeXt’s network blocks is illustrated in Figure 7. The structure of the ConvNeXt network block is shown in Figure 8:

In ConvNeXt, the use of Layer Norm obtains superior performance to Batch Norm; the overall structure of the ConvNeXt-B network is shown in Table 1.

In Table 1, c represents the input channel, s represents the convolutional step size, and Conv represents a convolutional block consisting of a convolutional layer, a normalization layer, and a ReLU layer. In FPN feature fusion, feature maps of smaller sizes are often up-sampled and then summed with laterally connected feature maps at the pixel level; however, in BiFPN, different input features have different feature resolutions and contain geometric or semantic information of different importance, so the input feature maps do not contribute equally to the output features. Therefore, each feature map learns a weight at the time of feature fusion and then weights the summed pixel values. The traditional unbounded weight parameter will lead to unstable training, and the weight calculation after softmax normalization will increase the computational cost and lead to slower training, so this paper adopts the fast normalized fusion; the fast normalized fusion weight calculation formula is shown in Equation (11):

α_{i} = \frac{ω_{i}}{ε + \sum_{j} ω_{j}}

(11)

The depth-separable convolution block consists of a channel convolution with a convolution kernel size of 3 and a layer of ordinary convolution with a convolution kernel size of 1. As an example, the M3 feature map in Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9 and Figure 10 is calculated as shown in Equations (12) and (13):

M_{3} = d s c o n v (\frac{ω_{1} * C_{3}^{i n} + ω_{2} * u p s a m p l e () C_{4}^{i n}}{ω_{1} + ω_{2} + ε})

(12)

M_{s} (F) = s i g m o i d (c o n v_{7 * 7} (c o n t a t (A v g P o o l (F), M a x P o o l (F)))))

(13)

where F is the feature map to calculate spatial attention,

M_{s}

is the spatial attention feature map, which has only one channel and has the same feature width and height as F, conv 7

\times

7 is the convolutional layer with a kernel size of 7, concat indicates the stitching operation by channel, AvgPool is the average pooling in the direction of the channel, and MaxPool indicates the maximum pooling in the direction of the channel.

3.6. Loss Function

The loss function is a function that measures the difference between the predictions of the network model and the true target. The direct regression of key point coordinates is a very difficult and nonlinear problem because the direct regression of key point coordinates discards spatial location information and predicts only one correct value. Therefore, the commonly used alternative is to label the value of the true location coordinates as a heat map and label the confidence level in a certain region. In the human body key point detection task, the L2 loss is usually used to calculate and is based on the human body key point detection of a heat map, which contains only one key point. The confidence map labeling of the key point is a Gaussian distribution with a certain standard deviation, so the L2 loss can be used to regress the heat map of the key point more accurately, and then obtain the coordinates of the key point through the Argmax function. In terms of the loss function, it is more appropriate to replace the L2 loss with a binary cross-entropy loss, which maps the final output heat map of the network between 0 and 1 using Sigmoid normalization. The L2 loss is calculated by Equation (14):

L 2 L o s s = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - \hat{y_{i}})}^{2}

(14)

where N is the total number of samples,

y_{i}

is the true label, and

\hat{y_{i}}

is the model predicted value of the corresponding sample.

The key point detection network computes the heat map of key points. Each channel contains a heat map of the location of key points of one type of household, the output is a set of detected heat maps containing key points of N types of households, and the goal is to obtain a probability distribution of the location of the key points that is consistent with the true value. The label of each location is a binary classification problem that indicates whether or not it contains key points, and for the confidence level of the output of the Sigmoid activation, the binary cross entropy loss is calculated as in Equation (15):

L_{B C E} = \frac{1}{N} \sum_{n = 1}^{N} \sum_{i = 1}^{W} \sum_{j = 1}^{H} [G T_{i j}^{n} + (1 - G T_{i j}^{n} l o g (1 - p_{i j}^{n})]

(15)

where

L_{B C E}

is the binary cross-entropy objective function, denotes the true value of the nth category at pixel position (i,j), and H is the confidence value at the same position. Where N is the network prediction heat map, W is the confidence threshold, m predicts the maximum number of key points, G is the candidate key points, and n is the points that need to be non-maximally suppressed.

3.7. Division of Scale Area

Splitting the scale area from the house plan is the first step of scale calculation. By observing the house plan, it is found that the peripheral walls, doors, and windows of the house plan area usually form a closed area, and the scale markers are usually not completely closed, so a simple and efficient method of scale area splitting is proposed: find the closed space of the house plan area first, and then remove it, so as to obtain the scale area around it. Taking the four regions above, below, left, and right, the scaled region image can be obtained for the subsequent accurate calculation. The segmentation method proposed in this paper is a simpler, faster, and more accurate scale region segmentation method, which is applicable to the house plan where only one house plan region exists. In this paper, Opencv-python is used as the basic library for image processing, and the steps of the whole algorithm are as follows: (1) Binarization threshold segmentation: Color house plans need to be converted to grayscale images first, and then the average grayscale of the image is calculated as the threshold value to binarize the image. (2) Contour detection: The maximum outer contour must be found, i.e., the contour of the house plan area. In this paper, we use the contour finding function findContours provided by opencv to find all the contours in the floor plan, and filter out the one with the largest perimeter as the contour of the floor plan area. (3) Find the largest enclosing rectangle: After finding the contours of the floor plan area, the boundingRect function of opencv is used to obtain the largest enclosing rectangle of the contours of the floor plan area. From the original picture, the region in the largest enclosing rectangle will be cleared. (4) Through the above algorithm, we can obtain a picture that contains only the scale area, so that we can easily obtain the scale part of the picture around the top, bottom, left, and right areas, as shown in Figure 8.

3.8. Scale Endpoint Area Detection and Scale Corner Positioning

The two endpoints of the scale marker determine the pixel length of the scale in this section; the calculation formula of the scale is shown in Equations (3)–(15), where Lp represents the pixel length and Lgt represents the real length. Due to the diversity of scale marker styles, direct corner detection of the scale area will cause very large errors. The method first uses the scale endpoint area detector to locate the range of the corner area at the two ends of the scale, and then carries out detailed corner detection of the detected corner area to reduce the interference of the non-scale endpoint area, and finally optimizes the corner detection results to determine the final scale endpoint position. Statistical and mathematical analysis of the previously calculated scale, in which there may be erroneous values, first calculate the standard deviation and mean of all the calculated scale. The distribution of the calculated scale values can be approximated as a normal distribution, screening out plus or minus one standard deviation range of scale values; these scale values after a standard deviation test can be approximated as the scale of this house plan These scale values after a standard deviation test can be approximated as the scale of the house plan and then averaged to obtain the final scale value. See Equations (16) and (17) for the standard deviation and average calculation formulas:

3.9. Experimental Test and Result Analysis

As shown in Figure 9, the standard CAD exported construction drawings and design drawings are also included. It is difficult to perform the high-quality recognition and reconstruction of complex and diverse types of architectural drawings with only one deep-learning model, and in the current method the model is usually designed and trained according to the classification of the architectural images to be recognized and reconstructed according to the category.

The annotation information is converted to a connection layer representation as follows: (1) For doors and windows, key point objects directly read the two end position points of the corresponding door and window categories in the annotation and determine the direction information of the two end position points to determine the key point category at the door and window end position. (2) For walls, key points are usually calculated from multiple wall primitives, so the wall intersection point is determined by calculating the local connection of the wall, i.e., checking whether there are other walls on the top, bottom, left, and right sides of the wall and de-duplicating the result in order to accurately calculate the category and location of the key points of each type of wall. The results are shown in Figure 11, where Figure 11a shows the wall key points, Figure 11b shows the door key points, and in Figure 11, each key point can not only express the location information but also has the semantic information of the category attached.

The annotation information is converted to a connection layer representation as follows: (1) For doors and windows, key point objects directly read the two end position points of the corresponding door and window categories in the annotation and determine the direction information of the two end position points to determine the key point category at the door and window end position. (2) For wall key points are usually calculated from multiple wall primitives, so the wall intersection point is determined by calculating the local connection of the wall, i.e., checking whether there are other walls on the top, bottom, left, and right sides of the wall and de-duplicating the result in order to accurately calculate the category and location of the key points of each type of wall. The results are shown in Figure 12, where Figure 12a shows the walls, key points and Figure 12b shows the door key points; each key point can not only express the location information but also has the semantic information of the category attached.

The house plan and its labeling data are shown in Figure 13 and Figure 14, where each row represents a wall or door and window primitive; the first four columns represent the position of the two endpoints of the primitives for the picture dataset, the fifth column represents the type of the primitive (WALL for wall, OPENING for window, and DOOR for door), the sixth and the seventh columns are reserved for the categories, and the sixth column is for the broad categories of walls and doors and windows. For walls, some internal and external walls are reserved, and for doors and windows, some subcategories are reserved, such as single or double doors, etc. The seventh column is the door orientation information. The dataset of the endpoint area of the house plan scale line is labeled using Yolo format; the endpoints of the scale area of each house plan are labeled manually using the makesense labeling tool, and there is only one category of scale endpoints. The labeling process is shown in Figure 13, using the bounding box method to select the scale endpoint area of the house plan, and to make fine adjustments and corrections. See Figure 13 for the Scale Marker Endpoints annotation data file.

In Figure 14, the first column is the target category, and there is only one category in this task, i.e., the endpoint area of the scale bar, and the next four columns are the coordinates of the center point x, the coordinates of the center point y, the width of the detection frame, and the height of the detection frame with the upper-left corner of the image as the origin, respectively. The length and width are normalized to the original image and mapped to between 0 and 1. The dataset of this experiment is the self-constructed household raster image dataset and the household scale line endpoint detection dataset described above. This paper implements the method proposed in the paper R2V, because the R2V method originally could not detect tilted walls, so the output layer of the network proposed in the original paper is changed to the output dimension suitable for this task dataset; the inference test is also carried out after 300 epochs of training in this paper’s dataset. Based on the CPN network structure, this paper also conducts comparison experiments with two classical feature extraction networks. One is ResNet-50, which is the original CPN, and the other is DRN-d-54; this paper also chooses the SimpleBaseLine network, which is a better network than CPN in human body key point detection, to conduct migration learning and testing on the dataset proposed in this paper. Migration learning is performed and tested. The results of the comparison experiments are shown in Table 1 and Figure 14.

From the results of the ablation experiments in Figure 15, it is concluded that the selection of an excellent backbone network can obtain better feature extraction results. If the features provided by feature extraction are poor, the selection of a more advanced FPN structure will bring limited improvement to the network, and the excellent FPN structure cannot play its proper part, dragging down the overall network performance. For the critical point detection task, both a good feature extraction network and an efficient feature fusion module are necessary. From the ablation experiments of the spatial attention module, it can be seen that the addition of the spatial attention module to the BiFPN in this paper improves the accuracy of the wall key points and the window primitives, while the wall key points are more obviously improved. In terms of recall, the wall key points and the door primitives have a significant increase in the recall, while the window primitives have a limited increase in the recall.

The key point detection results are shown in Figure 16, where Figure 16 shows the wall key point detection results and Figure 17 shows the window key point detection results. It shows the door key point detection results; the center of the highlighted part is the key point detection results, and in order to facilitate the viewing of the positional relationship between the key points, the results of all the key point detection results heatmaps are shown in Figure 17, and the results of the key point detection heatmaps are shown in Figure 18. In order to distinguish different types of key points, the brightness of the key points of doors and windows is slightly darker, and each key point has a category.

The results of vectorized reconstruction of walls with slanted walls and doors and windows are shown in Figure 18, where white content is key point. show the vectorized results of those containing slanted walls with slanted doors and slanted windows using the R2V method, show the results of those containing slanted walls with slanted doors and slanted windows using the CPN-Floor method of this paper. It can be seen that this paper’s method is better than R2V in identifying house plans with slanted walls, doors, and windows, while R2V fails to identify some hard-to-detect wall joints, which leads to the failure of the constraints of some subsequent joints. The CPN-Floor proposed in this paper can still maintain better vectorization results in the case of house plans with more complex styles and topologies.

s c a l e = \frac{L_{p}}{L_{g t}}

(16)

μ = \frac{1}{N} \sum_{i = 1}^{N} l_{i}

(17)

σ = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} (l_{i} - μ)}

(18)

4. Vectorized Reconstruction Methods for Household Plans

4.1. 2D Reconstruction Method for House Plans

The wall and door/window pictures in the 2D house plan are provided by the api provided by PixiJS, which draws them in real time according to the starting point position and thickness of the wall and door/window and makes connection point optimization for the wall connection point, if the end point of the wall does not have any intersection points with the other wall. Then, the edge shape is defined as a right angle, and if two walls have an intersection point, then the thickness of the wall is taken into account for calculating the two sides of the intersection point, and the two intersections points will be used as a connecting line for the two walls to be respectively drawn, as in Figure 18. This system provides four kinds of door and window drawings. This can be used to join the reconstructed house type, such as in Figure 19.

The 2D reconstruction of the house map is designed and implemented using PixiJS6. PixiJS organizes the rendering structure through a container, which is the object that creates the scene graph and which collects a set of subview object sprites, graphics, text, etc., together to form a tree-shaped rendering structure, as shown in Figure 20. The PixiJS rendering engine continuously redraws, and after updating the view objects, PixiJS will render to the screen and repeat the rendering cycle. PixiJS provides Graphics objects for 2D drawing, and also provides an event-based interaction system to manage display object interactions. Methods such as mouse-click events can be used to see where the wall that needs to be drawn is located on the canvas, and then the wall view can be drawn through the Graphics api. When it needs to be updated, the wall view is drawn from the Visual object pool to find the view object and re-trigger the re-drawing of its container to realize the incremental update of each element part and provides finer control of data flow and the timing of triggering the re-rendering. This design makes it no longer necessary to compare which view objects need to be updated through the comparison algorithm of the virtual view objects when the data objects are updated, and each view object of the wall, door, window, etc., is instantiated by the corresponding wall, door, window in Figure 21, etc. The view object is instantiated by the corresponding data object, so that a Map structure can be constructed from the data object to its corresponding view object relationship, once the data object changes, through the event triggered, so that its corresponding view object obtains the changes and updates, according to different event categories, to the corresponding view update.

In addition to the wall and window and door views, in order to provide visual convenience, length marker objects have been added to each wall and window and door view object, and these have been rendered in separate containers to visualize the calculated length and geometry of the wall and window and door.

4.2. 3D Modeling Rendering Techniques for House Plans

Three-dimensional modeling and rendering of house plans can be achieved through a variety of tools and techniques in the field of engineering using AutoCaD2020, Revit2019 for implementation. The use of game engines such as Unity and Unreal Engine is also a more common method; commonly used game engines are provided, including models, materials, physics, lighting, camera, and other common elements of the graphics scene. In this paper, the system is based on WebGL design and implementation; in view of the unity and integrity of the system, Babylon.js is the Web-side 3D modeling and rendering framework. Babylon.js is an open-source 3D game engine and graphics rendering library based on WebGL, which is designed to enable developers to create high-performance interactive 3D scenes and applications in the browser. It provides a rich set of features and tools to efficiently create complex 3D games, virtual reality, and augmented reality applications, product demonstrations, etc. Babylon.js provides encapsulated upper-layer interfaces that allow developers to directly develop graphics in Javascript or Typescript without the need to use complex shader languages and WebGL underlying interactions. There are several important basic concepts in Babylon.js: (1) Scene: A container that contains all the visual objects in a viewport. The scene is responsible for all rendering, lighting, camera, physics, and interaction operations. (2) Mesh: A basic object in Babylon.js used to represent the geometry of a 3D object. Many triangular planes are connected together to represent the shape of a 3D object. The mesh can be some basic geometry such as a cube, sphere, cylinder, etc. It can also be an external 3D model loaded through the model loader, such as .obj or .gltf/.glb files, or it can be described by the vertex data, which can be triangulated through a list of vertices and indices to customize the mesh shape. The mesh can add properties such as materials and textures and can apply geometric transformations. (3) Camera: The camera is the viewpoint used to observe the 3D scene, which defines the field of view and the observation direction in the scene. Babylon.js provides several types of preset cameras; the most commonly used ones are the general-perspective camera and the arc-rotation camera. The general-perspective camera is similar to the first-person view of a human being, and the arc-rotation camera is in the orbital view of a satellite, which is always directed towards the center of the specified target position. In this task, the display of the 3D scene of the house model is selected from the arc rotation camera. You can rotate it to view the scene in the house model rendering results. (4) Lighting: Lighting determines the light and dark effect of objects in the scene, and the position, direction, color, and other attributes of the light directly affect the lighting effect of the objects in the scene. Babylon.js also provides preset types of lighting, such as point light, parallel light, spot light source, hemispherical light source, etc. In this task, the default configuration is point light, parallel light, spot light source, hemispherical light source, and so on. In this task, the default configuration of the light source is the parallel light source. (5) Material: Material determines the appearance and texture of the mesh surface. Babylon.js provides commonly used materials for the model, including standard material, PBR material, texture material, and so on. (6) Texture: Texture is an image used to cover the surface of the mesh; it can be used to add color, pattern, mapping, and other effects to the mesh. All the walls and doors and windows of the relevant location information are derived from the 2D raster floor plan vectorized plane results. In the 3D model reconstruction, 2D information is known, so the height dimension information in this paper is preset for each floor height of 3 m, the thickness of the external walls of 24 cm, the thickness of the internal walls of 12 cm, and the same height as the height of the floor of 3 m. The default height of the door is 2.1 m and is placed on the ground, the height of the window is 1 m, and the height from the ground is 1.5 m in Figure 22. With this preset height information, model objects can be created and placed in the scene according to these 3D data, and each wall can be abstracted as a rectangle or trapezoidal prism, and the wall information is described in terms of surfaces. When there is a connection relationship between walls, the intersecting vertex positions of the wall edges at both ends of the connection point need to be calculated to support the subsequent mesh vertex calculation. For walls, it is necessary to first calculate the 3D geometric data of each geometric face of the wall based on the wall data object, including 3D surface vertices, contour lines, Figure 23 enclosing boxes, holes, and so on. Each geometric face is converted to triangular mesh data, i.e., the triangulation process, and then a wall mesh object is created based on these geometric attributes. The generated triangular mesh data are used to set the vertices, indexes, normals, UVs, materials, and so on, and these are added to the Babbitt mesh object attributes and added to the Babylon3D scene. As for the mesh of windows and doors, the pre-built glb files of the window and door models are loaded using the model loader provided by Babylon. Since the doors and windows must be located on the wall, after calculating the geometric properties of the doors and windows first, the holes for the doors and windows are opened in the wall mesh, the doors and windows are placed in them, the related wall mesh is recalculated and built, and the 3D view is rendered.

4.3. System Architecture Design

In the development of large-scale front-end graphical applications, data-driven views can control the flow of data at a finer granularity, and the view needs to be supported by the corresponding data model. When the user operates the view, it triggers the change of its corresponding data model, which triggers the re-rendering of the view. There are two mainstream design patterns for the realization of this mutual separation of view and data: MVC and MVVM. In the MVC design pattern, the controller is responsible for receiving user inputs and requests, and it then sends the requests to the corresponding Model for processing, which then triggers the update of the corresponding View object, thus updating the view performance, which usually requires the identification of the view object corresponding to a certain data pair. MVVM decouples the data exchanges between user-defined data objects and real views by means of a virtual view object, and the data exchanges between user-defined data objects and real views. MVVM uncouples the data exchange between user-defined data objects and real views through virtual view objects, and obtains the minimum update of the view objects through the comparison algorithm of the virtual view objects which is usually used in the case that it is difficult to determine which view objects need to be updated for the update of the data objects. These two design patterns can fully realize the modularity of the function; each module is independently maintained, the modules do not affect each other, and there is high cohesion within the module and low coupling between modules.

The main project is developed with Vue and Element Plus, using Typescript language. The graphic rendering part will be used as a dependency package of the front-end system display project, which will be installed and imported by the main front-end project after construction. Referring to the front-end MVC design pattern, the whole graphic rendering module consists of the following packages: VisualModule package, Schema package, Interaction package, History package, and RenderApp package. VisualModule package: This mainly stores the encapsulation of 2D household scenes and 3D scenes, divided into 2D view models and 3D view models, and contains the following core classes: Visual2dModule, which is used to integrate the applications, containers, and resource loading provided by PixiJS, and to receive user interactions on the 2D area and event notifications from the scene; Visual3dModule, which is mainly used to integrate Babylon’s scene, camera, lights, materials, textures, mapping, resource loading, etc., and receive various events from data objects in the current scene; VisualWall2d and VisualWall3d are packages for the 2D and 3D view models of the wall model, respectively, and are used to compute in the 2D and 3D scenes; Wall2d and VisualWall3d are packages for the 2D and 3D view models of the wall model, respectively, which are used to compute the geometric attributes of the windows and doors in the 2D and 3D scenes and render the window and door views; VisualRoom2d and VisualRoom3d are used for the rendering of the house view rendering. Schema package: A package used to describe the structure of the user-defined data model in the scene under the current scene, register data objects in this package, and complete the unified serialization and deserialization operations of the custom walls, doors, windows, rooms, and other data objects, which is a collection of all the data objects describing the current house model. It contains an important class, SceneSchema, which is used to describe the collection of data objects and the relationship between data objects in the current scene, and is the container for all the data models of walls, doors, and windows. Unlike Babylon.js Scene, SceneSchema describes the relationship between the data object elements in the current user-defined house model, rather than the concrete EntityWall and EntityWallAttachment, which are the data models of walls and doors, encapsulating the intrinsic attributes such as length, thickness, start position, end position, height above ground, and so on, of walls and doors. The interaction package mainly provides the interaction and operation of 2D scene views, receives and processes the user’s interaction in the scene, and changes the corresponding wall and door data objects. It contains certain core classes: InteractionDrawWall, used for wall drawing; InteractionEdit, used for editing the attributes of walls, doors, and windows; InteractionTransport, used for dragging walls, doors, and windows; InteractionScale, used for specifying the scale of the background of the current house type. History package: This is used to manage the historical state of the data in the current scene and provide the function of undoing and redoing the data in the scene. Based on the data changes in the scene, it caches the historical state of the data objects in the current scene, which relies on the Data Definition Schema package, including the core class HistoryManager, which is mainly used for the caching of the data objects in the tree of the current scene and comparing the old and new scene trees, thus realizing the comparison of the tree of the current scene. It is mainly used for data object caching of the current scene tree and comparison of the old and new scene trees, thus realizing the forward and backward of the history of the current scene tree. RenderApp package: This is used to integrate the contents of the above packages, encapsulated into an integrated class as the rendering module entrance for external project use, including the core class RenderApp, for the integration of the above packages and unified exposure to external use. BFF: This provides the back-end services using Nest.js development, as the main back-end services and front-end service communication. Usually a front-end project only needs a back-end; all internal communication by the back-end server to solve the communication between the core classes. FileService is used for floor plan uploads. FloorPlanService is used for the persistence of the floor plan design preservation. The household identification module is developed and deployed separately, and FastAPI is used to provide services to the outside world, as a separate service of the back-end system, which is called by other modules through network requests to realize the uncoupling between modules.

Household Identification and Reconstruction Module Design

The main function of this module is to recognize the raster house plan images uploaded by users and to draw and render the recognition results in the front-end, and according to the reconstructed 2D vectorized house plan, synchronously construct and render the 3D house model, which can be displayed in real time under the 3D scene, in order to realize the effect of reconstructing the 3D house model through 2D raster house plan images.

The flowchart of house model recognition and reconstruction is shown in Figure 24. The user first needs to upload the house model that needs to be recognized, package the image as FormData, and send it to the back-end for scale recognition of the house model and reconstruction and vectorization of the wall, door, and window primitives, and then load the image into a 2D canvas element, use the AssetsLoader provided by PixiJS to load the background image of the house model into the middle of the window, and center-align and scale the image according to its width and height so that the user’s uploaded house model can be located right in the middle of the user’s viewing area. The house model background image is loaded into the middle of the window using the AssetsLoader provided by PixiJS, and the image is center aligned and scaled according to its width and height, so that the house model uploaded by the user can be located in the center of the user’s visual area. The server side calls the algorithm described in Section 3 of this paper for vectorized reconstruction and scale calculation of house-type walls, doors, and windows. When the BFF receives the request from the front-end, it uploads the user’s picture to the OSS and forwards the request and the OSS address of the picture to the processing module of the Household Type Recognition Service. Subsequently, the recognized scale and house-type primitive information is encapsulated in JSON data format. When the front-end receives the response, it will first pop up the position and number of the scale bar for the user to confirm, in order to facilitate fine-tuning of the scale area, and then after confirmation, the graphics rendering module will create the data objects and view objects of the walls and doors and windows according to the vectorization results and display them onto the canvas element. If the user uploads a house-type plan that originally has no scale bar information, then in the center of the 2D canvas there will be a default scale ruler; the user manually specifies the scale, and then the real size of the primitives is recognized for calculation. The user selects the floor plan they want to recognize and uploads it, packages it as a FormData object, and sends it to the BFF back-end. The BFF back-end receives the request for uploading the image, saves the image on the OSS, forwards the image address and the TaskId of the current task in the message queue, and returns the TaskId and image address to the front-end, which needs to check whether or not the current task has been completed by using the TaskId to the BFF service once every 5 s. The front-end needs to use TaskId to query the BFF service every 5 s to see if the current task is completed. After waiting for the back-end AI house-type recognition to complete and return the response to the network request, the BFF service will return the recognition result to the front-end in the next front-end polling, and the user-selected house-type plan will be loaded in the central 2D canvas using the AssetsLoader, and the rendering of the scale recognition scale, the position of the scale line corresponding to the scale, and scale number size. If there is a correctly recognized scale result in the current house plan, the scale marker area will appear at the recognized scale position, as shown in Figure 24.

If the current floor plan does not have a recognizable scale, the scale will appear in the center of the floor plan, and the user can drag the position of the scale bar and click the circle buttons on both sides of the drag bar to adjust the length of the scale. After the user confirms its position, the keyboard is used to click the Enter button and release it; the reconstructed 2D floor plan is drawn in the 2D Canvas area, as shown in Figure 25, and the 3D reconstruction of the floor plan will be displayed in the upper right corner. By clicking on the Switch View button in the upper left corner, it is possible to switch the display position of the 2D and 3D views, as shown in Figure 26. After the identification is completed, the right panel will appear to adjust the transparency of the background of the floor plan and the display scale. The transparency of the background image will be lowered, as shown in Figure 26, after the user has completed the identification and drawing of the floor plan. The transparency of the background image can be lowered to reduce the interference of the background floor plan, or to remove the scale of the background image, so that the user can focus on the design of the floor plan.

5. Conclusions

The structure of the deep-learning network related to house pattern recognition and understanding is first analyzed and introduced in some detail. In the Section 4.2, details are presented on the migration of the human key point detection network inspired by the human key point task to the key point detection of the connection points of walls and doors and windows in the house plan, encoding the connection points in the house plan as key points with types, proposing the overall structure of the network based on CPNs, proposing CPN-Floor, and describing the improvement from the selection of the feature extraction network to the feature fusion network FPN, so as to make it more adaptable to the task of this paper. Loss function and non-maximum suppression methods are also introduced, which enable the acquisition of correct key points for the subsequent post-processing of wall and door–window vectorization to construct candidate wall and door–window primitives.This paper describes a house plan scale identification and computation method involving scale region splitting, scale marker endpoint region detection, scale endpoint corner point detection, and scale digit identification, which together compute the scale for restoring the true size of the house plan elements. In the Section 4.3, the dataset information and data preprocessing details used in this paper are described in detail, and the effectiveness of the proposed algorithm in this paper is verified with comparative experiments and ablation experiments. After the results are post-processed, very good and reliable results of the vectorization of house plan walls, doors, windows, and other primitives can be achieved, even under conventional house plans.

Author Contributions

Conceptualization, J.C. and H.P.; methodology, H.P.; software, H.P.; validation, J.C., H.P. and Y.L. (Yunlei Lv); formal analysis, J.C.; investigation, J.W.; resources, J.C.; data curation, Y.L. (Yunlei Lv); writing—original draft preparation, H.P.; writing—review and editing, Y.L. (Yunlei Lv); visualization, H.P.; supervision, Y.L. (Yunlei Lv); project administration, J.W.; funding acquisition, Y.L. (Yaqiu Liu) All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by “the Fundamental Research Funds for the Central Universities” (No. 2572023CT15-03), National Natural Science Foundation Grant of China (No.31370565) and Cultivating Excellent Doctoral Dissertation of Forestry Engineering (No. LYGCYB202008).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ahmed, S.; Liwicki, M.; Weber, M.; Dengel, A. Improved automatic analysis of architectural floor plans. In Proceedings of the 2011 International Conference on Document Analysis and Recognition, Beijing, China, 18–21 September 2011; Volume 10, pp. 864–869. [Google Scholar]
Chen, G.; Kua, J.; Shum, S.; Naikal, N.; Carlberg, M.; Zakhor, A. Indoor localization algorithms for a human-operated backpack system. In 3D Data Processing Visualization, and Transmission; Citeseer: University Park, PA, USA, 2010; Volume 3. [Google Scholar]
Macé, S.; Locteau, H.; Valveny, E.; Tabbone, S. A system to detect rooms in architectural floor plan images. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, Boston, MA, USA, 9–11 June 2010; Volumes 167–174. [Google Scholar]
Or, S.; Wong, K.H.; Yu, Y.; Chang, M.M.Y.; Kong, H. Highly automatic approach to architectural floorplan image understanding & model generation. Pattern Recognit. 2005, 25–32, 1–9. [Google Scholar]
Moloo, R.K.; Dawood, M.A.S.; Auleear, A.S. 3-Phase Recognition Approach to Pseudo 3D Building Generation from 2D Floor Plan. Int. J. Comput. Graph. Animat. 2011, 1, 3. [Google Scholar]
Ahmed, S.; Liwicki, M.; Weber, M.; Dengel, A. Automatic room detection and room labeling fromarchitectural floor plans. In Proceedings of the 2012 10th IAPR International Workshop on Document Analysis Systems, Goald Coast, Australia, 27–29 March 2012; pp. 339–343. [Google Scholar]
Ahmed, S.; Weber, M.; Liwicki, M.; Dengel, A. Text/graphics segmentation in architectural floor plans. In Proceedings of the 2011 International Conference on Document Analysis and Recognition, Beijing, China, 18–21 September 2011; pp. 734–738. [Google Scholar]
Lu, T.; Tai, C.L.; Su, F.; Cai, S. A new recognition model for electronic architectural drawings. Comput.-Aided Des. 2005, 37, 1053–1069. [Google Scholar]
Lu, T.; Yang, H.; Yang, R.; Cai, S. Automatic analysis and integration of architectural drawings. Int. J. Doc. Anal. Recognit. (IJDAR) 2007, 9, 31–47. [Google Scholar] [CrossRef]
Lu, T.; Tai, C.L.; Bao, L.; Su, F.; Cai, S. 3D reconstruction of detailed buildings from architectural drawings. Comput.-Aided Des. Appl. 2011, 2, 734–738. [Google Scholar] [CrossRef]
De Las Heras, L.P.; Ahmed, S.; Liwicki, M.; Valveny, E.; Sánchez, G. Statistical segmentation and structural recognition for floor plan interpretation: Notation invariant structural element recognition. Int. J. Doc. Anal. Recognit. (IJDAR) 2014, 17, 221–237. [Google Scholar] [CrossRef]
Shen, J.; Wang, Y.; Liang, T.; Gong, X. Automatic analysis of architectural floor plans based on deep learning and morphology. In Proceedings of the 4th International Conference on Information Science, Electrical, and Automation Engineering (ISEAE 2022), Hangzhou, China, 25–27 March 2022; SPIE: San Francisco, CA, USA, 2022; Volume 12257, pp. 245–251. [Google Scholar]
Farhadi, A.; Redmon, J. Yolov3: An incremental improvement. Comput. Vis. Pattern Recognit. 2018, 1804, 1–6. [Google Scholar]
Yamasaki, T.; Zhang, J.; Takada, Y. Apartment structure estimation using fully convolutional networks and graph model. In Proceedings of the 2018 ACM Workshop on Multimedia for Real Estate Tech, Yokohama, Japan, 11 June 2018; pp. 1–6. [Google Scholar]
Lv, X.; Zhao, S.; Yu, X.; Zhao, B. Residential floor plan recognition and reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16717–16726. [Google Scholar]
Zeng, Z.; Li, X.; Yu, Y.K.; Fu, C.W. Deep floor plan recognition using a multi-task network with room-boundary-guided attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9096–9104. [Google Scholar]
Zhang, Y.; He, Y.; Zhu, S. The Direction-Aware, Learnable, Additive Kernels and the Adversarial Network for Deep Floor Plan Recognition. In Proceedings of the 2011 International Conference on Document Analysis and Recognition, Beijing, China, 18–21 September 2011; pp. 358–361. [Google Scholar]
Dodge, S.; Xu, J.; Stenger, B. Parsing floor plan images. In Proceedings of the 2017 Fifteenth IAPR international conference on machine vision applications (MVA), Nagoya, Japan, 8–12 May 2017; pp. 734–738. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [PubMed]
Zhao, Y.; Deng, X.; Lai, H. Reconstructing BIM from 2D structural drawings for existing buildings. Autom. Constr. 2021, 128, 103750. [Google Scholar] [CrossRef]
de las Heras, L.P.; Terrades, O.R.; Robles, S.; Sánchez, G. CVC-FP and SGT: A new database for structural floor plan analysis and its groundtruthing tool. Int. J. Doc. Anal. Recognit. (IJDAR) 2015, 18, 15–30. [Google Scholar] [CrossRef]
Liu, C.; Schwing, A.G.; Kundu, K.; Urtasun, R.; Fidler, S. Rent3d: Floor-plan priors for monocular layout estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3413–3421. [Google Scholar]
Kalervo, A.; Ylioinas, J.; Häikiö, M.; Karhu, A.; Kannala, J. Cubicasa5k: A dataset and an improved multi-task model for floorplan image analysis. In Proceedings of the Image Analysis: 21st Scandinavian Conference, SCIA, Norrköping, Sweden, 11–13 June 2019; pp. 28–40. [Google Scholar]
Liu, C.; Wu, J.; Kohli, P.; Furukawa, Y. Raster-to-vector: Revisiting floorplan transformation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2195–2203. [Google Scholar]
Pizarro, P.N.; Hitschfeld, N.; Sipiran, I. Large-scale multi-unit floor plan dataset for architectural plan analysis and recognition. Autom. Constr. 2023, 156, 105132. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Zeng, P.; Gao, W.; Yin, J.; Xu, P.; Lu, S. Residential floor plans: Multi-conditional automatic generation using diffusion models. Autom. Constr. 2024, 162, 105374. [Google Scholar]
Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. East: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5551–5560. [Google Scholar]
Tian, Z.; Huang, W.; He, T.; He, P.; Qiao, Y. Detecting text in natural image with connectionist text proposal network. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 56–72. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Li, M.; Lv, T.; Chen, J.; Cui, L.; Lu, Y.; Florencio, D.; Wei, F. Trocr: Transformer-based optical character recognition with pre-trained models. Proc. AAAI Conf. Artif. Intell. 2023, 37, 13094–13102. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. Int. Conf. Learn. Represent. 2021. [Google Scholar]
Lin, J.; Ren, X.; Zhang, Y.; Liu, G.; Wang, P.; Yang, A.; Zhou, C. Transferring General Multimodal Pretrained Models to Text Recognition. Find. Assoc. Comput. Linguist. 2023, 588–597. [Google Scholar]
Kenton, J.; Toutanova, L. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proc. NAACL-HLT. 2019, 4171–4186. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Toshev, A.; Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations Computational and Biological Learning Society, San Diego, CA, USA, 7–9 May 2015; pp. 85–97. [Google Scholar]
Tompson, J.; Jain, A.; LeCun, Y.; Bregler, C. Joint training of a convolutional network and a graphical model for human pose estimation. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]

Figure 1. HSL color space model.

Figure 2. ReLU function images.

Figure 3. CPN network structure.

Figure 4. CPN network structure.

Figure 5. Yolov8 network structure diagram.

Figure 6. CPN-Floor network architecture diagram.

Figure 7. Three-way decision model for the determination of MDF continuous flattening deviation grade grain structure.

Figure 8. Definition of critical point categories for walls.

Figure 9. Example of house type diagram for dataset.

Figure 10. Maximum Enclosure Box for Household Area.

Figure 11. Household type categorization.

Figure 12. Key points of (a) walls and (b) doors and (c) windows.

Figure 13. Example of house type diagram for dataset.

Figure 14. Household key point detection network ablation experiment.

Figure 15. Key point detection results for all connection points.

Figure 16. Example of Household Primitive Labeling Data.

Figure 17. Household type key point detection results.

Figure 18. Key point detection results for all connection points.

Figure 19. Vectorization results of window type primitives with sloping walls.

Figure 20. House Plan Reconstruction of Windows.

Figure 21. Household 2D rendering structure.

Figure 22. System architecture diagram household plan recognition.

Figure 23. System functional module diagram.

Figure 24. Flowchart for household identification and reconstruction.

Figure 25. Household reconstruction results in 2D view.

Figure 26. Household reconstruction result in 3D viewpoint.

Table 1. Structure of ConvNeXt-Base network.

Stage	Downsampling Times	Network Structure
stem	4	$4 \times 4$ Conv, c = 128, s = 4
stage2	4	c = 128, ConvNeXt Block × 3
down2	8	2 × 2 Conv, c = 256, s = 2
stage3	8	c = 256, ConvNeXt Block × 3
down3	16	$2 \times 2$ Conv, c = 512, s = 2
stage4	16	c = 512, ConvNeXt Block × 27
down4	32	$2 \times 2$ Conv, c = 1024, s = 2
stage5	32	c = 1024, ConvNeXt Block × 3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chang, J.; Lv, Y.; Wang, J.; Pang, H.; Liu, Y. Raster Image-Based House-Type Recognition and Three-Dimensional Reconstruction Technology. Buildings 2025, 15, 1178. https://doi.org/10.3390/buildings15071178

AMA Style

Chang J, Lv Y, Wang J, Pang H, Liu Y. Raster Image-Based House-Type Recognition and Three-Dimensional Reconstruction Technology. Buildings. 2025; 15(7):1178. https://doi.org/10.3390/buildings15071178

Chicago/Turabian Style

Chang, Jianbo, Yunlei Lv, Jian Wang, Hao Pang, and Yaqiu Liu. 2025. "Raster Image-Based House-Type Recognition and Three-Dimensional Reconstruction Technology" Buildings 15, no. 7: 1178. https://doi.org/10.3390/buildings15071178

APA Style

Chang, J., Lv, Y., Wang, J., Pang, H., & Liu, Y. (2025). Raster Image-Based House-Type Recognition and Three-Dimensional Reconstruction Technology. Buildings, 15(7), 1178. https://doi.org/10.3390/buildings15071178

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Raster Image-Based House-Type Recognition and Three-Dimensional Reconstruction Technology

Abstract

1. Introduction

2. Household Element Edge Detection

2.1. Scale Endpoint Corner Point Detection

2.2. HSL Color Space Model

2.3. Neural Network Activation Function

3. Analysis of the Network Structure for Wall, Door, and Window Detection and Scale Recognition of House Plans

3.1. Key Point Detection Network of Walls, Doors, and Windows of House Plans

3.2. Scale Recognition Network Model Structure Analysis

3.3. Household Critical Point Detection Network CPN-Floor

3.4. Definition of Key Points for Raster House Plan Elements

3.5. Feature Extraction Network ConvNeXt

3.6. Loss Function

3.7. Division of Scale Area

3.8. Scale Endpoint Area Detection and Scale Corner Positioning

3.9. Experimental Test and Result Analysis

4. Vectorized Reconstruction Methods for Household Plans

4.1. 2D Reconstruction Method for House Plans

4.2. 3D Modeling Rendering Techniques for House Plans

4.3. System Architecture Design

Household Identification and Reconstruction Module Design

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI