*2.2. Predictive Analytics*

The concept of predictive analytics can be understood as the systematic analysis of data to elaborate models for prediction using computational techniques. Predictive analytics has been used since the decade of the 1950s [28]. Shmueli [29] stated that predictive modelling aims to predict future observations as a process using data-mining algorithms or statistical models to data. Predictive analytics techniques have been applied successfully in different areas, such as marketing and finance [30], to prevent bank fraud, according to Boyacioglu [31], and in medical areas, for the prediction of diseases, such as diabetes [32]. The increasing capacity of data transmission, the increasing amount of data stored by organisations, and the higher processing capacities have boosted the use of predictive analytics in industry [33]. Despite these advances, the uptake in the construction industry is behind compared to other industries, such as financial services, transportation and logistics, and energy and resources [10,34].

A complete process of constructing predictive models consists of the steps shown in Figure 1, where the initial consideration in the modelling process is the appropriate identification of the main model's objective from a predictive perspective, followed by the data collection and study design. Large-size data and data of an observational nature within the same population are considered optimal for higher accuracies. The data-preparation step has two main issues. Missing information can be helpful if the data is informative enough of the output, but, if not, these data need to be handled by removing observations or parameters by utilising dummy variables or developing different models according to the missing data distribution [29]. The second issue relates to data partitioning for testing purposes. The data set should be randomly partitioned into two parts, one for training the model and the other one to evaluate the predictive performance of the final model.

**Figure 1.** Empirical model-building steps schematic. Adapted from Shmueli and Koppius [12].

The Exploratory Data Analysis (EDA) follows the data-preparation step and is used informally in predictive analytics to synthesise the data graphically and numerically to capture unknown or not formulated relationships [12]. Additionally, EDA is used to reduce the dimensionality of the data by reducing the number of parameters and to reduce the sample variance. Some methods, such as Principal Component Analysis (PCA) and Factor Analysis, can be used to assess relations between parameters of potential models. Variable inputs or parameters are chosen considering the relation between input and output, the data quality, and the availability of the parameters at the moment of prediction. Although the accuracy of the models mainly influences the model's choice, techniques with higher accuracy sacrifice interpretability and objectivity of models. The many available techniques used in predictive analytics can be classified as linear and nonlinear models. Linear and logistic regressions are the most common techniques used for data modelling. Although, with higher chances of overfitting models, techniques such as Decision Trees, Artificial Neural Networks, Support Vector Machine (SVM), and Fuzzy Logic Systems (FLS) have the capacity of modelling nonlinear relationships [30]. Case-Based Reasoning (CBR) is also a common technique studied to elaborate predictive models.

The evaluation and validation are the main criteria for assessing the predictive power of a model [12]. The model selection aims at identifying the appropriate level of complexity leveraging bias and variance for higher accuracy. Model evaluation is conducted by assessing the accuracy of the models using out-of-sample data. The use of statistical significance variables such as R-squared are considered a minor role, while generic predictive measures on observational data such as Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE) are more typical metrics of accuracy. The selection of out-of-sample data depends on the method of validation used for the model's evaluation. The two methods, hold-out cross-validation and k-fold cross-validation, are standard for validation of models [35]. The hold-out cross-validation method is the most straightforward approach and involves splitting the data into a training dataset and a testing dataset. In the second method, k-fold cross-validation, the same data is used to train and test several models. The data selected for testing and training purposes are different on each train session, but the average of the test results should provide better estimates than individual test results [35].The extreme case is when the number of subsets is the total number of data points, and it is called Leave One Out Cross Validation (LOOCV). Validation methods also help to overcome the challenge of model overfitting, which occurs when a model fits the data for training to the extreme of not being able to predict new data [12]. The model use and reporting stage relate closely to the predictions and the performance measures where results need to be translated into new knowledge following the initial objectives.

The following section describes the research method followed in this paper to investigate how predictive analytics can enhance the practice of cost estimation.
