1. Introduction
This work is a continuation of previous studies of one of the authors [
1,
2,
3] on the completeness of data contained in the Polish Price and Value Register (PVR). It constitutes an element of Land and Building Register [
4], and is an important source of data on real estate [
5,
6,
7] used by real estate appraisers to estimate values of properties. PVR also plays a significant role in the real estate management, spatial policy, sustainable development policy, and tax system [
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18]. PVR data have become a subject of abundant research indicating its incompleteness or low quality [
5,
19,
20,
21,
22]. The real estate appraisal largely consists of estimating the property’s value based on the transaction prices of similar buildings. A lack of data is the main reason for which a certified real estate appraiser must reject a specific transaction, and, considering a small number of transactions on the local market, real estate appraisal becomes problematic.
One of the authors analyzed 829 transactions in PVR in communes of Koszalin and Kołobrzeg districts in years 2010–2017, and found that data incompleteness is especially abundant in the case of parcels with residential buildings located on them. Data on transaction date, real estate location, plot size, and ownership type, were fully available. Information on the construction material of building walls and number of stories occurred in PVR in 63% cases, while, with regards to the construction year, was around 40%. Even lower availability characterizes data on the usable area, which PVR provided in only around 30% of transactions. Therefore, the studied register is not very useful in terms of the completeness of data for real estate evaluation, mainly due to the lack of information about the building usable area, which is mainly taken into account when assessing whether the building is similar to the one being evaluated by the appraiser.
The aim of this paper is to propose methods of estimating the usable area of residential buildings using Light Detection and Ranging (LiDAR) data as well as the Database of Topographic Objects (BDOT10k). A successful method may overcome limitations of PVR, and increase the number of available similar buildings needed for real estate appraisal in the given area. This method is not intended to replace the standard interior measurements of usable area, but to give an accurate estimation when only limited topographic data on the building are available. The accuracy of usable area estimation by different methods should be checked by applying them to the existing single-family houses with known usable areas. This step, however, is complicated by the fact that Polish law lacks consistent rules on how to calculate this property of buildings and premises.
Usable area definitions differ between currently binding Polish acts [
23], and on the purpose of calculating the quantity. Therefore, various definitions can be found in the Act on Local Taxes and Charges [
24], Act On Tax On Inheritance And Donations [
25], Tenants’ Rights, Municipal Housing Stock and the Civil Code Amendment Act [
26], Regulation of the Minister of Justice on the Establishment and Maintenance of Land and Mortgage Registers in an IT System [
27], Polish Standards (PSs), and international standards. The data on usable area contained in PVR, concerning both buildings and premises, should be consistent [
22] with the definition from the act on tenants’ rights, which is the following: “area of all the spaces in a building, in particular, the rooms, kitchens, pantries, lobbies, alcoves, halls, corridors, bathrooms and other rooms used for residential and housekeeping needs of the tenant, whatever their actual purpose or way of use; the usable area does not include the area of balconies, terraces, loggias, entresols, wardrobes, recessed wall cubbies, laundry rooms, drying rooms, baby carriage rooms, attics, cellars, and fuel storage rooms” [
26].
The regulations concerning the real estate appraisal, in the context of detailed rules regarding how to calculate the usable area, mainly refer back to PSs, which were written by the Polish Commitee of Standarization and introduced by the Normalization Act of 3 April 1993 [
28]. According to this act, application of a PS was voluntary; however, it could be made obligatory by the minister regulation or if a PS was mentioned in any act explicitly. Using a PS became voluntary without an exception with the Normalization Act of 12 September 2002 [
29]. Since 1971, a standard PN-B-02365:1970 [
30] was in power, and was commonly used to calculate the usable area [
31]. In 1998, it was replaced by PN-ISO 9836:1997 [
32], which was introduced as an identical standard with the international one, ISO 9836:1992 [
33].
Both standards indicate that, when calculating the usable area, construction and partition walls, and structural columns should not be included, and measurement precision should be within 0.01 m. However, exhibited differences, presented in
Table 1, result in discrepancies within few percent [
23,
31].
Starting from 29 April 2012, when calculating the usable area of single-family houses and premises, the use of PN-ISO 9836:1997 became mandatory with the Regulation of the Minister of Transport, Construction and Maritime Economy of 25 April 2012 on Detailed Scope and Form of a Construction Project [
34], with two additional rules, “(1) a premises is a self-contained housing unit composed of a room or rooms separated with permanent walls from the rest of the building, allocated for a people’s continuous stay, with which auxiliary rooms serve their housing needs, (2) rooms or their parts with a height equal to, or greater than, 2.20 m shall be included in the calculation of usable area in 100%, with a height from 1.40 m to 2.20 m—in 50%, while with a height of less than 1.40 m shall be omitted” [
34]. An important consequence of the first rule is ignoring partition walls, unlike the previous standards. As such, until 1999, usable areas were calculated using PN-B-02365:1970, from 1999 to 2012 both standards were applicable, and, finally, in 2012, PN-ISO 9836:1997 with two additional rules became obligatory for newly-built single-family houses and premises. To account for these changes, this study needs to take into account both standards as well as the 1997 standard with two rules (written as PN-ISO 9836:1997 (+2012) from now). Ultimately, both standards are currently ’withdrawn’ by the Polish Committee for Standardization, which has been recommending PN-ISO 9836:2015-12 [
35] since 2015, but due to a lack of law amendments, the newest standard continues to be unused.
The only approach of estimating the usable area of single-family houses that has already been developed and that is known to the authors is the method of Benduch and Hanus based on geometric and descriptive data of buildings contained in PVR [
22]. In three variants, differing with a level of detail, Benduch and Hanus used existing geometric data of a building, number of overground and underground stories, information on the material used for the construction of external walls, and total number of chambers. The accuracy of the most detailed variant was extremely high; however, the study was conducted only for two residential buildings. Moreover, its main limitation is the necessity of trusting data contained in PVR, which has already been proved to be both incomplete and occasionally unreliable [
5,
20,
21].
In this study, we harnessed the well-known methods developed by the machine learning (ML) community to estimate the usable area of single-family houses using data provided by LiDAR and BDOT10k. As such, we entered into the booming area of research benefiting from combining ML methods and LiDAR-based information [
36] that have already tackled problems such as detection of buildings [
37] and archaeological objects [
38] as well as tree species classification [
39]. We began with a detailed analysis of data on project buildings obtained mostly from the design offices Lipińscy [
40] and Archon [
41], available online. In order to find outliers and understand dependencies in the data, a simple formula was implemented in which outputs estimate usable area in three different standards, using detailed information on analyzed buildings and architectural assumptions concerning, e.g., wall thickness and room height. Then, we trained the linear regression and neural network models on the described data with usable areas in PN-ISO 9836:1997, using a minimal amount of information on every building, and we tested their performance. Finally, we applied the chosen trained model on single-family houses in Koszalin, described with data provided by LiDAR and BDOT10k, and we checked its performance by comparing outputs to the usable area contained in PVR, taking into account that it can be calculated in a different standard than PN-ISO 9836:1997.
3. Results
3.1. Formula Based on Architectural Assumptions for Model Houses from the Design Offices
In order to better understand the dataset and find outliers, a simple mathematical formula was designed, described in detail in
Section 2.3. We used it to estimate usable areas of 96 residential single-family buildings from the design offices, according to the PN-ISO 9836:1997 standard. The comparison of the results and true usable areas is presented in
Figure 1.
The formula provided very accurate estimations for buildings without garages, with the mean error of 3.48%, errors’ median of 2.46%, and = 98.28%. We identified two features that made buildings the dataset outliers, understood here as buildings whose usable areas were estimated with the largest errors. The main one was a radically small or large covered area which resulted in the failure of our assumptions on the walls width. The usable area estimated with the largest error of 16.5% was of a holiday house with covered area of 54.93 m, being the smallest one in the dataset, with external walls being 30 cm thick, and with no internal construction walls. The second feature that worsened the estimation was an unusually small number of partition walls.
The accuracy of the formula was significantly worse in the case of buildings with garages, with the mean error of 8.83%, errors’ median of 7.18%, and = 94.77%. This change was due to the variance in garages’ size. For one-spot garages, areas range from 15.74 to 24.9 m, while for two-spot garages—from 29.07 to 44.19 m. We expected that the neural network models would describe this dependency more accurately.
The formula calculates the usable area accordingly to any of the three standards: PN-B-02365:1970, PN-ISO 9836:1997, and PN-ISO 9836:1997 (+2012). In the analyzed dataset, the usable area according to PN-ISO 9836:1997 was larger than according to PN-ISO 9836:1997 (+2012) on average by 4.8 m. In half of the buildings, the usable area following PN-B-02365:1970 was larger than following PN-ISO 9836:1997 by 3.8–9.4 m, while in 41 houses was smaller by 7.4–13 m.
In total, the formula exhibited a high accuracy of estimating the usable area of the design offices’ buildings, with the mean error of 6.10%, errors’ median of 4.41%, and = 95.37%. Finally, we noticed that there were no estimation error differences between 68 original buildings and 28 added ones that we created to expand the dataset.
3.2. The Design Offices’ Buildings: Without Garages and Extensions
In this section, we present the predictions of linear regression and neural network model trained with SGD with momentum on the dataset containing only buildings without garages and extensions. The isolation of this data was done for two purposes. In this dataset, there are 21 original buildings, and 27 artificial ones, added as described in
Section 2.2. When training the models on this dataset, we checked whether the artificial data introduced the data mismatch, as described in
Section 2.4. The second purpose was to check the intuition that buildings without garages and extensions were simpler, and as such they should be described with a higher accuracy by the models.
This dataset was divided into test, validation, and bridge set, each containing eight elements. The test and validation sets contained only original buildings, while the bridge set only added ones. We tested how the accuracy of models’ predictions depend on the number of the input buildings’ features fed to the model. The predictions of the best found models are presented in
Table 5 and
Figure 2. “The best” here means the highest accuracy achieved with the simplest possible architecture.
First of all, the comparison of the models’ performance on the bridge and validation data showed that there was no significant data mismatch resulting from the artificial extension of the dataset. Secondly, as seen in
Table 5, already such a simple model as linear regression can capture, with an acceptable accuracy, the relationship between geometric buildings’ data and their usable areas. In every set of input features, however, NN performed significantly better than the linear regression. What is also interesting is the simplicity of NNs’ architectures that predicted usable areas with the best accuracy. In all cases, NNs consisted of only one hidden layer with units’ number ranging from 8 to 32.
The largest errors, starting from 20%, concerned estimation of usable areas of buildings with more than one story. Apparently, the linear regression model did not accurately account for it, which can be additionally seen in
Figure 2. It is understandable, as it can only find best weights of features and add bias, having no possibility of extracting more complex relationships between them. NNs, however, surpassed this limitation, and successfully learned the dependency of the usable area on the stories’ number (
) reducing the maximum error to the order of 7%. However, they achieved poorer results when, instead of
, the height of the building,
H, was provided, which is disappointing, as LiDAR data are in general much more reliable than PVR.
What is surprising is the models’ great performance with only two input features being the covered area, and stories’ number . This set-up was actually the most successful one in the case of the NN. The same was true for the linear regression if we ignored its inability to correctly account for more than one story. The mean errors of 3.34% for NN and 2.96% (on one-story buildings only) for linear regression account for the variability of wall density between the buildings that cannot be extracted from provided input data.
3.3. The Design Offices’ Buildings: Full Dataset
Having confirmed in the previous subsection that there is no data mismatch between the artificially added data and the originals from the design offices, we divided the full dataset into the 15-element test set, 15-element validation set, and 66-element training set. As the data on the buildings’ perimeter, width, etc. did not enhance the models’ prediction, firstly we used only the covered area,
, and number of stories
. Then, we observed the accuracy increase along with the introduction of data on the garages and buildings’ extensions. As we presented in the previous section, the linear regression model was outperformed by NNs in every set-up, thus, from this point, we focused solely on these more complex models. The predictions of the best found models are presented in
Table 6.
The results showed that the information on garages and extensions of the buildings had to be provided to the model in order to reproduce the NN’s accuracy from
Section 3.2. These two features greatly impact the resulting usable area of the building, and they cannot be guessed by the model based only on
and
. In these set-ups, more complex NN’s architectures were also needed, to capture the dependencies between features.
Unsurprisingly, the best results were achieved when the garage area was given explicitly to the model. In this case, the mean error of 2.3% comes in majority from the variance of partition wall density between houses, which is impossible to derive from topographical data of the building. The increase of the error between the fifth set-up with explicitly given garage areas and the third set-up with only number of garage spots given comes entirely from the diversity in garages’ sizes. However, it is evident that the NN learned a more complex relationship between the garage size and the other building’s features than just finding an average area corresponding to every , judging by its performance on the third set-up, where it reached of 97.74% and maximum error as low as 10.71%. It is a promising result as topographic data usually cannot provide exact garage area. Similarly as in the previous subsection, the use of height instead of stories’ number resulted in an accuracy decrease, with the mean error of 8.41% and = 87.55%.
3.4. Koszalin Buildings
In this subsection, we present inference results of the best found model, namely NN (4-64-8-1), on the set of 29 Koszalin single-family buildings. Data were described in detail in
Section 2.1. Results are presented in
Figure 3.
The mean error for the whole dataset amounted to 13.05%. The errors’ median was 10.44%, and maximum and minimum errors were equal to 37.31% and 1.47%, respectively. The model’s inference resulted in the coefficient of determination, %. This significant decrease in accuracy is caused by a number of reasons. First of all, none of the 96 buildings from the design offices, on which the NN was trained, has a balcony within the covered area, and not on the top of the extension at the same time. Simultaneously, out of 29 Koszalin buildings, only 10 have no balcony meeting the criteria stated above. Out of 19 houses that have such balconies, seven is characterized by large balconies’ areas, reaching even 30 m. The usable area of each of these buildings was strongly overestimated by the NN.
Removal of these seven buildings resulted in much better predictions’ statistics. The mean error reached 8.6%, the errors’ median—6.86%, while the maximum and minimum errors amounted to 18.97% and 1.47%, respectively, with equal to 84.23%. Such errors were expected for a few reasons. First of all, we do not know the standard according to which usable areas contained in PVR were calculated. NN was trained on PN-ISO 9836:1997 standard, and gave predictions following it. Secondly, owners reporting the usable area to PVR could have done it as for tax purposes. For those, a calculation is done in a very different way. Last but not least, the analyzed buildings from the Rokosowo precinct belong to the old architecture, being built at least 30 years ago. NN was trained on the design offices’ data, which may follow a more modern architectural approach. Nonetheless, the test of the NN on Koszalin buildings proved to be useful: firstly, it indicated its weakness regarding balconies; secondly, the NN accuracy still turned out to be acceptable.
4. Discussion
In this work, we focused on residential single-family buildings with flat roofs. Light Detection and Ranging data in Koszalin exhibit the first level of detail, in which buildings are represented as blocks. To properly estimate the usable areas of houses with more complicated roofs, a higher level of detail data is needed.
Within this study, we prepared two datasets of flat-roof single-family houses. The first was built out of data on buildings from the design offices’ projects available online, and contains 96 examples. The second one consists of data on 29 houses located within the Rokosowo precinct in Koszalin, Poland, provided by the Database of Topographic Objects, Light Detection and Ranging, Price and Value Register, and Google Street View.
On the dataset gathered from the design offices, we trained and tested different models to predict usable areas of houses based on their three-dimensional models. Firstly, we analyzed the performance of the mathematical formula based on architectural assumptions. It exhibited a high accuracy with the mean error of 6.10%, errors’ median of 4.41%, and = 95.37%; however, at the same time, it required a highly detailed information on the building. To minimize the amount of needed data, we moved to machine learning methods, and we found that the model as simple as linear regression can estimate with great accuracy the usable area of one-story buildings without garages and extensions, having as an input only the covered area of the building. The mean error of its predictions was as low as 2.96%. To correctly account for more than one story, garage, and extensions, a neural network model was needed with two hidden layers of 64 and 8 units, respectively. Its mean error amounted to 3.37%, with as high as 97.74%. Finally, we tested this neural network, trained on the first dataset, on 29 Koszalin houses. The mean error was below 9% with equal to 84.23%. Its performance then can be evaluated as satisfying, especially taking into account the fact that we cannot fully trust data contained in the Price and Value Register and recorded usable areas are both calculated in an unknown standard and for an unknown purpose.
While assessing the results as significantly accurate and very promising in terms of possible applications, we acknowledge weaknesses of the designed and trained model. First of all, none of the buildings on which the model was trained has a balcony within the covered area. Within the Koszalin buildings, the largest balconies have areas of the order of 30 m, and this is the error that the model has to make. Secondly, none of the methods estimating the building’s usable area based on its three-dimensional model is able to guess architectural solutions that significantly impact the usable area, but are invisible from the outside, like entresols. There is also a possibility that training data, namely the design offices’ model buildings do not exhibit the same diversity of architectural solutions that exist in the reality. Nonetheless, they offer a concise source of data with a minimized human error, and calculated in a known standard. Lastly, the model suffers from the accuracy decrease if the building’s height instead of number of stories is provided. Three solutions to this problem are the following: first, to use Google Street View to determine the number of stories. The second is to apply the mathematical formula we designed to estimate number of stories out of the height. Finally, one can provide the height as a feature, and use the model with slightly worse accuracy. It is also important to note that, even if the presented model was significantly improved and provided excellent results, the legalization and its implementation in Polish Price and Value Register may prove very challenging.
The possible extension of this work is to apply neural networks (or other machine learning model) to estimate usable areas from topographic data of houses with more complex roofs, like gable or hip ones. To achieve this, at least topographic data of second level of detail is needed. It enables recognizing the roof structure as well as secondary construction elements. Moreover, the third level of detail data should account for balconies, and therefore present a full picture needed to calculate the usable area. While the topographic data of a second level of detail are available in the eastern part of Poland, the third level is still not attainable for general public. Such detailed three-dimensional models of houses could be then processed by a chosen model, e.g., convolutional neural networks, which would provide an estimation for the usable area.