*2.2. Variables of the Model and the Concept of Model Development*

Before the start of actual regression analysis, data that reflected the values of model variables were collected and analyzed. The collected data included information about road bridges, rail bridges, and animal bridges (as wildlife crossings) built in Poland between 2005 and 2018. In terms of total construction costs, the real-life values were updated to be comparable—regardless of the date of project completion—with the use of price indices of construction assembly production published by the General Statistical Office in Poland. Later in the paper, the updated costs of bridges given in millions of PLN (e.g., PLN 10.53 m) are referred to as *y*. For better recognition, the costs are given in millions of EUR as well (e.g., EUR 2.45 m). The conversion was made on the basis of the Polish National Bank official exchange rate for the PLN/EUR pair of currencies published for 31.12.2018. The values of *y* varied between PLN 2.46 m (EUR 0.57 m) and PLN 23.48 m (EUR 5.46 m).

The cost predictors, as the independent variables, brought to the model information about the type of bridge, type of project, structural and material solutions, types of supports and their foundations, and load class. All the mentioned information was initially recorded as nominal data. Moreover, basic size measures, in terms of the decks' total length and width, as well as the number of spans, were taken

into account. The independent variables of a model are presented in Table 1. In this table, one can see that finally the characteristics of bridges recorded initially as nominal data were coded as binary values (0 or 1). Information recorded as numerical data was scaled to the range <0; 1>. In the case of structural solution, type of intermediate supports and load class, the values for *x*14, *x*22, and *x*<sup>27</sup> were introduced to represent more than one nominal value that were ARCHED/BOX, COLUMNS/PILES, and k/C/D/E, respectively (see also the footnotes under Table 1). This was done due to the fact that some nominal values were not numerous enough in the dataset to be represented alone by one binary variable. It is important to note that for each of the characteristics listed in Table 1, only one nominal value was allowed, so only one of the binary variables belonging to this characteristic could take value 1. For example, for the type of a structure of which the nominal value was VIADUCT, the values *x*<sup>1</sup> − *x*<sup>3</sup> equaled *x*<sup>1</sup> = 0, *x*<sup>2</sup> = 1, *x*<sup>3</sup> = 0.


**Table 1.** Input data for regression model—independent variables.

<sup>1</sup> k for rail bridges or C, D, E for other bridges; \* according to standards applied in Poland.

Table 2 presents a random sample of the coded variables *x* and *y* as used for model development, and *p* stands for pattern number.

The selection of the cost predictors was based on the availability of information in the early stages of the bridge construction projects. The characteristics and their values that became independent variables of the model (as presented in Table 1) can be easily identified in at beginning of the design process.

Overall, the number of patterns to be used for the process of machine learning and testing models equaled 167. The data was collected from the public clients responsible for bridge construction projects in Poland. The data was divided into two subsets—the first subset (later denoted as *L*) was used for the machine learning purposes, the second subset (later denoted as *T*) was used for the models' testing purposes. Both subsets were selected so as to be equivalent and to ensure their representativeness in terms of the features of the investigated bridges and the range of construction costs as well. The cardinality of subset *L* equaled 131, whereas the cardinality of subset *T* equaled 36. One can easily note that the number of patterns belonging to subset *T* accounted for more than 20% of the overall number of collected data patterns.


**Table 2.** Random sample of the model's variable values.

<sup>1</sup> training and testing of the model was done with the use of costs given in millions of PLN.

The research included an investigation of the number of SVM-based regression models. A schematic diagram of the investigated models is presented in Figure 1.

**Figure 1.** Schematic diagram of the investigated support vector machine (SVM)-based regression models.

The SVM-based models' performance rely on the assumed kernel function and its parameters as well as *C* and ε meta-parameters.

For the purposes of transformation Φ, the use of the three aforementioned kernel functions (10)–(12) were investigated, however the best results were obtained for radial basis function (11). Thus, in the two following sections, the author focused on a presentation and discussion of the models in which this particular type of function was applied.

The selected methods of the parameters *C* and ε can be summarized after [17,18,33–35] as follows:


The choice of the two parameters for the models proposed herein compromised the above-mentioned approaches, namely determination of the parameters on the basis of the training data and grid search.

Each of the models was analyzed and its predictive performance was assessed in terms of correlation between the real-life values of the bridges' total construction costs *y* and the predicted values *yˆ*, the predictions' errors, and the residuals analysis. The following equations were used for computations of Pearson's correlation coefficient (*R*), root mean squared error (*RMSE*), mean absolute percentage error (*MAPE*), and absolute percentage error for *p*-th case (*APEp*):

$$R = \text{cov}(y; \mathfrak{j}) / (\sigma\_{\mathfrak{y}} \sigma\_{\mathfrak{j}}),\tag{14}$$

$$RMSE = (1/n \cdot \Sigma(y - \circ)^2)^{0.5},\tag{15}$$

$$MAPE = 1/100\% \cdot \Sigma \{ (y - \hat{y})/y \}. \tag{16}$$

$$APE^p = 100\%\_{o^\*} (|y^p - \hat{y}^p|) / y^p,\tag{17}$$

where *cov*(*y*;*yˆ*)—covariance of real values of the bridges' total construction costs and values predicted by a model, σ*<sup>y</sup>* and σ*yˆ* standard deviations of real values of the bridges total construction costs and values predicted by a model, respectively; *n*—cardinality of either *L* or *T* subset, *y* − *yˆ*—prediction errors, computed after completion of the machine learning process for either *L* or *T* subset; and *p*—pattern index. The SVM machine learning process was made with the use of STATISTICATM software suite.

According to the literature [36–38] and remarks about the expected accuracy of cost estimates provided at the early stages of construction projects (also called conceptual estimates), the error of estimates should fall into the ranges <−30%/−25% and +25%/+30%> when compared to the actual, final construction costs. If the proposed models' predictions and *APEp* are considered, the above rule can be reformulated into the expectation about the desired range of *APE<sup>p</sup>* between 0% and +25%/+30%. What is obvious is that the predictions of the bridges' total construction costs are still required to be provided by the models with errors as small as possible. However, the rule can be used for the purposes of the models' performance comparison and assessment.

### **3. Results**

For the investigated SVM-based regression models, the parameter γ (for radial basis kernel function) was assumed as the inverse of the number of inputs, thus γ = 1/27 = 0.037. The γ value can be explained as the inverse of the radius of influence of samples selected in the course of machine learning to be support vectors.

Regularization meta-parameter *C* was initially assessed following the rule [35]:

$$C = \max\{|E(y) + \Im \sigma\_y| \colon |E(y) - \Im \sigma\_y|\},\tag{18}$$

where *E*(*y*) = 6.61 and σ*<sup>y</sup>* = 4.22 computed for *yp* belonging to subset *L* resulted in *C* = 19.27. After this, it was assumed that 20 will constitute the upper boundary of *C*. Values of *C* were sought for with the use of grid search; the values of ε (threshold of the loss function) were also sought for with the use of grid search. The considered ranges of *C* and ε, as well as the grid search details, are given in Table 3.


**Table 3.** Considered ranges of length axis (*C*) and depth axes (ε) parameters.

The machine learning process for each of the models was carried out with the use of 10-fold cross-validation. Having finished the process, the performances of the models were compared. *RMSE* values were computed for both *L* and *T* subsets for all of the obtained models. The *RMSE* values obtained for the subset that was used in the course of machine learning (subset *L*) are presented in Figure 2. Figure 3 depicts *RMSE* values computed for testing subset *T*. The values of errors (height axes in Figures 2 and 3) are presented as 3D surfaces with regard to *C* (length axes) and ε (depth axes). One can see that in the case of *RMSE*, the values computed for subset *L* are decreasing with the increase of *C* and decrease of ε. On the other hand, the tendency for errors computed for subset *T* is similar with regards to ε, however the opposite with regard to *C*.

**Figure 2.** *RMSE* errors obtained for subset *L*.

**Figure 3.** *RMSE* errors computed for subset *T*.

When considering the values of *RMSE* for both subsets *L* and *T* together, one can find the points in the grid representing errors of learning and testing computed for certain models, where the values of *RMSE* for testing reach minimums; moreover, the values of *RMSE* for machine learning are close.

The analysis of *RMSE* values allowed for the selection of five models that were further investigated. The five bridges' construction cost prediction models based on support vector regression (later referred to as BCCPMSVR) are introduced in Table 4. Characteristics of the models include values of meta-parameters *C* and ε, number of support vectors (*sv*), and number of bounded support vectors and values of the constants *w*0. The support vectors are the data patterns belonging to subset *L* that determine the position of the regression hyperplane for a certain model. Furthermore, errors of 10-fold cross-validation are also presented. General error and performance measures *RMSE*, *R*, and *MAPE* for the five BCCPMSVR models, computed for *L* and *T* subsets, are set together in Table 5.


The values of *RMSE* and *R* (in Table 5), when comparing the five selected models, are relatively close. Thus, in light of the *RMSE* and *R* values analysis, the performance of the models can be assessed as comparable. In terms of *MAPE* values, the differences are slightly more evident. The final choice of the model, however, was based on the comparison of the distribution of *APE<sup>p</sup>* errors and the rule, (presented in Section 2.2) that refers to the desired range of *APE<sup>p</sup>* values for bridge construction early cost estimates.


**Table 5.** Measures of errors and performance obtained for the five selected models.

Table 6 presents the distributions of *APE<sup>p</sup>* errors of predictions of total bridge construction costs both for *<sup>L</sup>* and *<sup>T</sup>* subsets under the conditions that *APE<sup>p</sup>* ≤ 25% or *APEp* ≤ 30%. In light of the analysis of the values in Table 6, model BCCPMSVR 2 was proven to perform better than the others—the model reached the highest shares of *APE<sup>p</sup>* ≤ 25% for *<sup>L</sup>* and *<sup>T</sup>* subsets and the same shares of *APEp* ≤ 30% for *<sup>L</sup>* and *T* subsets as BCCPMSVR1.

**Table 6.** Comparison of absolute percentage error for *p*-th case (*APEp*) errors for the five selected models.


For the finally selected model of BCCPMSVR2, the scatter plots of values of *y* (actual bridge construction costs, presented on the horizontal axes) and *yˆ* (bridge construction cost predictions by model BCCPMSVR 2, presented on the vertical axes) are depicted in Figures 4 and 5. The former shows the scatter plot of *y* and *yˆ* values for subset *L*, the latter for subset *T*. The charts include also the cones of errors ±25% and ±30%.

**Figure 4.** Scatter plot of *y* and *yˆ* predicted by BCCPMSVR2 for subset *L*.

**Figure 5.** Scatter plot of *y* and *yˆ* predicted by BCCPMSVR2 for subset *T*.

Table 7 presents the percentage shares of *APEp* errors of bridge construction cost predictions provided by the model BCCPMSVR2 (both for *L* and *T* subsets) divided into intervals of a range equal to 5%. Additionally, distributions (cumulated shares) of *APEp* errors are given in the Table.


**Table 7.** Shares and distribution of *APE<sup>p</sup>* values for BCCPMSVR2.

The distribution of points (*yp*; *yˆp*) in the scatter plots (in Figures 4 and 5) is even along the line of a perfect fit. Moreover, for both of the subsets *L* and *T*, the vast majority of bridge construction cost predictions are located within the ±25% cone of errors; almost all of the predictions are located within the ±30% cone of errors.

The values of the *APEp*, (in Table 6), as complementary information, confirm that most of the bridge construction cost predictions made by BCCPMSVR2 meet the condition of early cost estimates.

The general conclusion on the results presented above is that the proposed model provides the predictions of costs for bridge construction projects with satisfactory accuracy regarding the expectations for estimates at the early stages of projects.

### **4. Discussion**

When compared to the models proposed by other authors, some significant differences of the model introduced herein can be indicated. The previous works that aimed at modeling costs of bridges in the early stages of projects were focused on cost estimates of either parts of bridge structures [2–4] or specific types of bridges [5–8]. The model introduced herein offers cost predictions of bridges as a whole object (the substructure and superstructure together). Moreover, the predictions are made for different types of bridges with regard to their structure, purpose, and structural and material solutions.

On the other hand, most of the previously proposed models are based either on regression analysis [2–4] or ANN [5]. The former requires a priori assumptions about the functional relationship binding bridge construction cost as a dependent variable with cost predictors as independent variables. The latter are at risk of the so-called local minima problem. Both of these drawbacks are overcome by the use of the SVM-based regression method for the development of the model for prediction costs of bridges.

The results of the research confirmed the assumptions made for the application of the SVM method for bridge construction cost prediction. Several SVM-based regression models were investigated with the use of data collected for a number of bridge construction projects completed in Poland. Having finished machine learning and testing processes, five of the models, of satisfactory knowledge generalization ability and comparable performance, were preselected. An important fact to be mentioned here is that in the case of repetitions of machine learning processes with given constraints, the results obtained for each of the investigated models were exactly the same every time. Application of the SVM method for early estimates of bridge construction costs eliminates the risks of local minima problem.

The final selection of the best model was based on the comparison and analysis ability to predict the bridge construction costs with accuracy appropriate for the early stage of the projects.

The general performance of the selected model, namely BCCPMSVR2, and its measures are presented in Section 3. The predictions of the bridge construction costs provided by the model can also be analyzed in a way that focuses on selected characteristics and features of bridges as the model's input.

Tables 8–11 present relative percentage shares of *APEp*, computed for the machine learning subset, belonging to certain intervals (compare Table 7) with regard to variables of a nominal type (coded as binary values for machine learning). The relative percentage shares of *APEp* for variables of nominal type were computed as follows:

• For each of the variables *xj* for *j* = 1 − 8 or *j* = 12 − 27, the number of predictions that fulfilled the condition of having corresponding *APEp* that fell into the certain interval were counted and divided by the number of occurrences of *xj* = 1.


**Table 8.** *APEp* predictions' errors for machine learning with regard to the type of bridge its structure and type of a project.


**Table 9.** *APEp* predictions' errors for machine learning with regard to the structural and material solutions.

**Table 10.** *APEp* predictions' errors for machine learning with regard to the types of bridgehead and intermediate supports and supports' foundations.


**Table 11.** *APE<sup>p</sup>* predictions' errors for machine learning with regard to the load class.


<sup>1</sup> (compare with Table 1).

Analyzing the Tables 8–11, one can see how the predictions accuracy depends relatively on the certain, chosen characteristics of the bridges described by the nominal values.

Tables 12–14 present the relative percentage shares of *APEp*, computed for the machine learning subset, belonging to certain intervals (compare Table 6) with regard to variables of a numerical type.

The relative percentage shares of *APE<sup>p</sup>* for these variables were computed as follows: for each of the variables *xj* for *j* = 9 − 11:


Analyzing the Tables 12–14 one can see how the predictions accuracy depends relatively on the certain, chosen characteristics of the bridges described by the structure's length, width or number of spans.

A limitation of the model that should be mentioned here is that the real-life bridge construction costs were updated for a certain moment in time for the data that was used both in the machine learning and in the testing processes. Thus, for now, dynamical predictions are not provided by the developed model. The reason for this limitation is the number of collected data patterns which does not currently allow for dynamical predictions that comply to the changes of costs in time.

Future research plans cover the issue of database expansion and further collection of training data, and development of models capable of dynamical predictions. One of the possible future research directions, which also rely on the database expansion, is the decomposition of the problem, development of separate models for certain types of bridges and combining the models in a so-called *committee machine*.


**Table 12.** *APEp* predictions' errors for machine learning with regard to the total length of bridge (*x*9).

**Table 13.** *APE<sup>p</sup>* predictions' errors for machine learning with regard to the width of bridge (*x*10).


**Table 14.** *APE<sup>p</sup>* predictions' errors for machine learning with regard to the of number of spans (*x*11).


### **5. Conclusions**

As a result of the research, an original model capable of supporting early estimates of bridge construction costs, based on machine learning and SVM method, was developed and introduced. The input variables bring to the model information, available in the early stage of a bridge construction project, that represent the features of bridges.

According to the presented results and discussion, as well as the accuracy expectations applicable for conceptual estimates, the model offers good performance. Applied kernel functions are of the radial basis type, and the meta-parameters of the model are *C* = 8 and ε = 0.050. The values of the general measures of the model's performance, respectively for machine learning and testing, are:


The model provides cost predictions with satisfactory accuracy, within the range of errors appropriate for early estimates (conceptual estimates) that is ±25%/30%.

The proposed approach is prospective for early cost estimates (conceptual cost estimates) in bridge construction projects. The study contributes to the body of knowledge by the application of machine learning methods for cost analyses in construction.

**Funding:** This research was funded by statutory activities of Cracow University of Technology.

**Acknowledgments:** Computations for SVM machine learning were done with the use of STATISTICATM software suite.

**Conflicts of Interest:** The author declares no conflicts of interest.
