*4.1. Studies Description*

From the 46 selected studies five were from conference papers and 41 from journals. The largest number of publications corresponded by far to the *Journal of Construction Engineering and Management* with 11 studies (24% of the total). The studies dated from 1974 to 2022, but only two of them were published before 2000, Elhag and Boussabaine [48]

and Karshenas [49]. These papers have seminal material in the area of cost modelling of building projects. As can be seen in Figure 3, the number of publications in the research area increased from 2000 and until 2014–2015, presenting a spike in 2004–2005. From 2014–2015 until 2018–2019 the research activity decreased, and in the last period of 2020–2022 the publications increased. The reduction of publications suggested that the research area may have reached a maturity level, where a next stage in the research area may be appropriate to be explored. The graph of the same figure presents Korea as the most prolific country after the United States with 17 and five studies, respectively. The Korean presence in the research area can be explained by the dedication of researchers, such as Gwang-Hee Kim and Sae-Hyun Ji, who together are authors of 13 of the 17 studies.

**Figure 3.** Statistical properties of the publications: (**a**) biannual distribution of publications of the review; and (**b**) distribution of publications per country.

The top 10 most cited documents in Google Scholar are shown below in Table 1. Kim et al. [50] present the highest number of citations, 617, and was the first publication comparing the most promising techniques for cost estimation, Multiple Regression Analysis (MRA), Artificial Neural Networks (ANN), and Case-Based Reasoning (CBR). In this study the high accuracy achieved by the three techniques, and, particularly, the transparency of CBR in explaining the results, suggest predictive analytics techniques can be a feasible alternative to traditional cost estimation in the early stages of projects. Kim et al. [50] and the rest of the top 10 publications, having over 100 citations each, have become a reference in the research area of cost modelling not only for building projects but for general construction projects.

#### *4.2. Models Input Parameters*

Even though the performance of cost models heavily relies on the appropriate identification of the cost drivers, the data available is the fundamental input to elaborate the models. This section starts presenting the relevant features of the data used in the studies, such as data source, type of buildings, and quantity of data. Next, two approaches used to identify and select the parameters from the data are presented. Then, the most predominant parameters used in the studies are shown in the form of an aggregated ranking.


#### **Table 1.** Most cited papers.

### 4.2.1. Data Utilised in the Studies

In predictive analytics, the data used for modelling should, ideally, be extracted from a population of similar characteristics to achieve more accurate predictions (Shmueli and Koppius [12]. In this sense, prediction accuracy is strongly linked to the data characteristics. The general type of buildings identified in the systematic literature review was multistorey, and subclassifications were identified according to their use, e.g., residential, schools, office use, or mixed. Also, seven studies specified the structure type of the building used. The source of data was also not uniform. Twenty-three studies expressed that its data origin were general contractors, public databases, theses, and other public and private organsations. General contractors and databases were the most commonly used data sources, and 22 did not provide details about the source of data. Transparency in this regard is an issue to improve in the research domain due to the fact that reliability of the input data is crucial to achieve reliable results [10].

#### 4.2.2. Qualitative Identification/Selection Approach

Selecting the initial parameters is a fundamental step in the modelling process. Shmueli and Koppius [12] and Elmousalami [15] have identified the first of two phases as a qualitative process in which combining domain knowledge, theory, and exploratory analysis is fundamental to give grounds for the inclusion of inputs. The method to identify the potential parameters and the number of related studies is shown in Table 2, where 23 studies identified potential parameters from literature reviews or/and expert knowledge, and six used the researchers' criteria. Two studies selected the parameters from the data available, and the rest did not specify the process to select them. Notably, publications from journals provided initial parameters for the studies [53,54,60–64]. The compilation of expert knowledge was realised by interviews and questionnaire surveys. Elaborated techniques to acquire information, such as a Likert Scale, Delphi method, and Analytic Hierarchy Process, are standard according to Elmousalami (2020), but only five studies implemented them.


**Table 2.** Number of methods to identify the parameters.

The process followed in the studies to identify potential parameters can be improved by the use of both expert knowledge and previous literature, in order to increase the credibility of the outcomes and to improve the model's performance. Predictive analytics is a relatively new area of research that has evolved with the developments in informatics. Therefore, its guidelines are still being tested, but robustness in research needs to be a priority regardless of the innovations in technology. Secondly, experts in the area of cost estimation and architects were surveyed, but developers' knowledge was considered only in Stoy et al. [65], where the developers are the individuals making crucial decisions regarding investment options in the early stages of projects.

#### 4.2.3. Quantitative Identification/Selection Approach

Dimension reduction is a method within exploratory data analysis used to reduce the number of parameters and to increase predictive accuracy [12,15]. In this regard, of the 46 studies, 27 utilised exploratory methods, used also to weight the parameters in the CBR models [59,66–69]. Table 3 shows the optimization parameters methods reviewed and the number of related studies. Nine of the studies implemented stepwise regression analysis. Methods such as PCA, Correlation Analysis, and Factor Analysis are commonly used to analyse cause–effect relationships, but these also provide a reduction in the number of parameters to achieve more accurate models. Although the main objective of predictive analytics is to produce models that forecast costs, the techniques used in the studies can determine the strength of the relationship between parameters and also the relative strength of its effect on the output. This information can serve decision-makers as guides in the subsequent stages to optimise the building features in the design stage.

**Table 3.** Methods used to optimise the parameters.


#### 4.2.4. Parameters Used

The size of the data has significant effects on the accuracy of the model. The more extensive databases are, the less sample variance and model bias are obtained. In addition, testing the modelling process requires the use of additional data. Shmueli and Koppius [12] stated that guidelines to set the minimum data size are difficult to define, although a commonly used rule of thumb of using 10 times the number of parameters is considered reasonable in computer experiments [86]. Following this criterion, 19 of the 46 studies had less than 10 data points per parameter, 24 had 10 or more data points per parameter, and three did not mention the total number of datapoints. Meta-analysis was not performed in this review, but the average MAPE of studies using 10 or more data points by parameter was 7.6%. On the other hand, the studies using less than 10 data points per parameter achieved 10.7% of average MAPE. This situation suggests that more extensive data relative to the number of parameters may produce better results.

The studies considered different parameters for their models, classifying them as quantitative and qualitative. Twenty-seven of the 46 studies (59%) provided the parameters used in the models in the form of ranks. The different authors developed these lists with the different methods from the quantitative approach and mean sensitivity ANN analyses from the results of the modelling processes. The Borda–Kendall technique, was used to synthesise the lists of the individual rankings into one aggregated ranking list. This method was used to acquire a generic view of the relative importance of the parameters within the studies.

For the calculation of the ranking of parameters the Borda rule represented as the vector of weights:

$$w = (n, n-1, \dots, 2, 1),\tag{1}$$

which applies to a set of complete or partial ranked lists of *n* alternatives where *wi* is the weight attached to an alternative located at the *i*th rank in any given list. Then, the cumulative score Cs for the *i*th alternative is given by:

$$\mathbf{C}\mathbf{s}\_{\mathbf{i}} = \sum \mathbf{w}\_{\mathbf{i}\mid \mathbf{i}} \tag{2}$$

which is the weighted sum over all the lists, *j*, corresponding to the rank in each list for the *i*th alternative [87].

In the study, 78 were the total alternative parameters *n* from 27 lists, so the parameters in the first place of the lists had a score of 78, the ones in the second, a score of 77 and so forth. Then, the sum of scores by parameter allowed to elaborate the rank.

Note that the ranking corresponds to data from different locations, and it would require further examination to consider it a representative ranking of general buildings in different locations.

The rank aggregation provided a rank of 78 parameters. The 10 parameters with the highest scores are shown in Table 4. The Gross Floor Area (GFA) and the number of floors are the two most important parameters, having scores significantly higher. The rest of the parameters may not be the principal source of costs, but their consideration in the cost models elaboration may increase their predictive power. Notably, the parameters of foundation type, type of roof, structure type, and location are measured in categorical scales. Therefore, the ability of predictive analytics to deal with categorical scales enhances its usability for cost estimation.


**Table 4.** Ranked parameters.

#### *4.3. Predictive Power*

Predictive accuracy, also known as predictive power, is the model's ability to elaborate accurate predictions of new observations [12]. Two criteria need to be met for an adequate test of predictive performance: assessment of the model's accuracy using adequate predictive measures, and determination of the appropriate validation method [12]. Root Mean Square Error (RMSE), Mean Square Error (MSE), and MAPE were commonly used generic predictive measures, but the first two are scale-dependent and should not be used when comparing across datasets that have different scales [88]. MAPE, being scale-independent, was an appropriate measurement to analyse the studies' models under a standard accuracy measurement. For the second criterion, the review synthesised the method of validation, which defines how the data is partitioned and tested for accuracy. The following subsection introduces accuracy measurements in the studies, followed by the validation methods.

#### 4.3.1. Accuracy

The most critical feature of models for predicting events is its accuracy. It is fundamental, especially for decision-makers, when assessing investment opportunities with rather limited information. The average accuracy error of all the models included was under 10%, with a standard deviation of 5%, as shown in Figure 4. The use of ANN resulted in a slightly more dispersed distribution of the second and third quartile compared to MRA and CBR, but its overall dispersion is smaller than MRA. On the other hand, CBR presented the narrowest overall and second-third quartile distribution of MAPE, additionally, the range position of the two quartiles and its mean are lower than those of ANN and CBR. Although additional studies would deliver more substantial grounds to advocate for a particular technique, the collected data suggest that the CBR technique tends to provide higher accuracies than others. The MAPE of the overall models ranged between 2 and 21%, with the second and third quartile between 5 and 13%, respectively. Considering that the accuracy error in traditional cost estimation ranges from −15% to +25%, which, in absolute terms, is 35%, the three techniques can perform significantly better, presenting errors under 21%, indicating that the absolute limit of 21% can serve as a baseline for an acceptance range of error for building projects' cost estimations in the early stages.

#### 4.3.2. Validation

The method of validation in the studies was collected to assess the satisfaction of the second criterion stated by [12]. As part of the modelling process exposed earlier, models need an appropriate assessment of their accuracy using an independent data set. Forty-five of the studies considered out-of-sample data for testing, and only Chan and Park [58] did not specify whether a subset was set aside or not. Hold-out cross-validation, k-fold cross-validation, and Leave One Out Cross Validation (LOOCV) were used on 33, eight, and four studies, respectively. Two considerations were pondered to assess suitability of the method used. First, for small samples, k-fold cross validation would be pertinent because it should provide better estimates of accuracy according to [35]. A second consideration

was extracted from Shmueli and Koppius [12], where a sample size of 213 data points was considered small in the modeling process, and cross-validation was preferred to a simple hold-out. Therefore, in this research the method of hold-out is considered appropriate for samples of more than 213 data points. Accordingly, only 20 of the studies in this review conducted appropriate validation methods utilizing cross-validation or hold-out for data samples bigger than 213 data points, 22 studies did not implement the best validation method, and four studies did not indicate the type of validation nor the sample size. These results agree with Elfaki et al. [17] by evidencing a urgent need for standard validation methods to determine the level of accuracy of models and ease the implementation of predictive analytics.

**Figure 4.** Box and whiskers chart of the average MAPE by technique.

#### *4.4. Modelling Techniques*

The five main techniques applied in the studies for the estimation of building construction costs at the early stages were:


ANN, CBR, and MRA were the predominant techniques used to elaborate the costprediction models. ANNs were used in 48% of the studies, while MRA and CBR were used in 22% and 26%, respectively. The other two techniques, BRT and SVM, represented only 4% each. Three approaches were followed by the reviewed papers to evaluate the techniques. The first approach used a single technique to develop a model, such as Chan and Park [58], who proposed a technique based on Principal Component Analysis to identify the most significant parameters to develop a linear function to model the costs of buildings. In the second approach, the studies compared different alternatives to improve a single technique. For example, Kim et al. [57] incorporated genetic algorithms to optimise the architecture of the artificial neural network model, and Do ˘gan et al. [59] used genetic algorithms in a casebased model to determine the optimal weights of the case attributes. The third approach considered the comparison of different techniques, e.g., Kim et al. [50] based its research methodology comparing ANN, CBR, and MRA in cost modelling of buildings. Overall, 24% of the studies developed models without performing comparisons, 50% evaluated alternatives enhancing a single technique, and 26% compared different techniques. The studies comparing variations of one technique provided valuable outcomes regarding the component on which technique has the potential to increase the accuracy of the models. The areas to improve and the methods successfully used are shown in the following subsections.
