*2.3. ML for Construction Cost Prediction*

ML uses historical evidence to offer a reliable solution that facilitates informed decision making. The literature on ML applications utilizing different types of datasets is growing in various fields [43–56]. Different ML approaches, such as artificial neural network (ANN), deep neural network (DNN), and support vector machine (SVM) are employed due to their ability to understand the complicated, non-linear patterns of real-world datasets. In this regard, the two ML approaches used for the cost estimation of construction projects were ANN [57,58] and SVM [55,59]. Even though other ML approaches, such as k-nearest neighbors (KNN) and decision trees (DT) share similarities with the ANN and SVM algorithms, they have yet to be investigated in the construction management literature [48]. Overall, construction cost estimation studies of more advanced ML approaches are scarce.

The literature on construction quality has mostly focused on quality assurance and quality control, using visual defect detection methodologies for a variety of tasks, including crack identification [60,61], damage localization on wooden building elements [62], and evaluation of pavement conditions [63]. ML approaches have also been used for the identification of rework or defect construction items. To this end, Fan [64] recently constructed a hybrid ML model using association rule mining (ARM) and a Bayesian network (BN) approach identify quality determinants and gain more effective evaluations of defect risk and its occurrence. In a related study, Kim et al. [65] utilized SVM, random forest (RF), and logistic regression (LR) along with three natural language processing (NLP) methods on 310,000 defect cases from South Korea to assign defect items to the appropriate repair task. Shoar et al. [16] used RF to estimate the COR of engineering services in construction to be used for devising appropriate contingency plans. Their study found using RF as a cost estimator to be an efficient approach for screening and prioritizing from the standpoint of cost overrun within construction projects, and that it can be used to devise related contingency plans.

Regarding the present study, the most relevant study is that conducted by Do ˘gan [66] to predict the cost impact of construction nonconformities using case-based reasoning (CBR). His results indicated that the ability of CBR to predict the cost impact of quality problems is higher in construction NCRs. Reviewing the construction management literature, one may say that the development of ML-based cost estimators is still at an early stage. There is a lack of advanced ML approaches, such as ensemble learning methods. Although studies have established the usefulness of these ML methods, they have not elaborated on the robustness of the developed estimators, that is, on the ability to use the systems developed for other datasets. Thus, there is a research gap in the implementation of advanced ML-based techniques for predicting the COR associated with different construction activities.

#### *2.4. Ensemble Learning*

Single ML classifiers, such as SVM, KNN, NB, and DT, are trained with labeled datasets through various approaches to predict an output label class. Ensemble classifiers, however, combine the best predictions of these single ML approaches to improve the final prediction accuracy with improved stability and robustness [67]. The ensemble methods vary according to how they combine the results of single ML classifiers, while their performance depends on the number of individual members along with their prediction accuracy [67]. There are three popular ensemble techniques: (i) stacked, (ii) voting classifiers, and (iii) tree-based. Kansara et al. [68] applied the stacked ensemble (XGBoost regression) and treebased ensemble (RF) approaches to improve the price prediction accuracy for real estate datasets. However, stacked ensemble approaches have the disadvantages of additional complexity and high computational time. Thus, they are feasible only when other ensemble approaches are not applicable.

Overall, due to their improved accuracy [51,68,69], studies have adopted ensemble learning methods for different prediction activities within different fields. Therefore, ensemble predictors are expected to provide more accurate cost predictions. In addition, the mechanism of the ensemble classifier benefits from both strong and weak predictors, where the latter is used to improve the prediction of the underrepresented classes. The literature on construction quality has still not matured with respect to cost estimation using both single ML predictors and ensemble learning predictors. Furthermore, the COR for different construction activities is not addressed in the construction quality literature. Therefore, because of the superior performance of ensemble learning over single ML models [51,68,69], this study adopts two such techniques, referred to as *soft* and *hard voting classifiers*, and compares them with three conventional tree-based ensemble classifiers (RF, gradient boosting (GB), and AdaBoosting (AB)) along with four single ML classifiers (DT, naïve Bayes (NB9), Logistic Regression (LR), and SVM). Accordingly, Figure 3 presents a simplified form of the procedure for the hard and soft voting classifiers adopted in this study.

**Figure 3.** Simplified hard and soft voting COR classifiers.

As shown in Figure 3, each single classifier (ML 1–3) is referred to as a member that predicts an output class label, referred to as a vote. The hard voting classifier selects the label voted for by the majority of the members. The hard voting ensemble classifier uses the average of the predicted probabilities of all the members. For example, in Figure 3, ML 1 and ML 2 both classify the impact of COR as two, while ML 3 classifies it as three, so the hard voting model predicts the COR impact as two. Soft voting is less straightforward, since it uses the probability of each of the five classes and finds the average probability of all the classifiers within each class to select the final label. Voting classifiers can benefit from the voting of both single and ensemble classifiers. Tree-based ensemble approaches, such as

GB and AB, have been utilized to predict the rental price of apartments, showing better prediction accuracy than single ML approaches [69]. In addition to voting classifiers, bagging and boosting tree-based ensemble approaches are experimented with in this study. Figure 4 outlines the bagging and boosting mechanism within the tree-based ensemble approaches.

**Figure 4.** Simplified RF and boosting (AB and GB) ensemble mechanisms.

The bagging (i.e., RF) and boosting (i.e., AB and GB) mechanisms are the main ensemble approaches used within tree-based ensemble models, taking advantage of the best predictions of single DTs.

#### **3. Data Description from Construction Nonconformance Report**

This study uses the nonconformance items from diverse construction projects undertaken by international construction companies, collected in a study by Do ˘gan [66] in 2021. The dataset comprises 2527 nonconformance items recorded by inspecting the different activities throughout the construction phase. A histogram associated with the construction activities and the frequency of recorded nonconformity is given in Figure 5, with activities having less than 20 occurrences aggregated under the 'other project activities' group. The collected nonconformance items were assigned to the different causation attributes through interviews.

**Figure 5.** Details of construction activities registered in the NCRs.

Since the dataset was collected during the construction phases, attributes related to the pre-construction, design, and tendering phases, such as those related to clients and subcontractors, are omitted. The obtained NCR dataset is described using the stacked histogram, which details different construction project types. The NCRs include details of the causation of each recorded item, divided into material, design, operation, and construction causation. In addition, the cost impact of COR (*y*) is assigned as an output feature column. This assigns each input feature a cost impact of between one and five, corresponding to very low (VL), low (L), medium (M), high (H), and very high (VH).

In addition, the output cost impact is recategorized into three, four and five cost impact classes to evaluate the class prediction accuracy of the adopted ensemble approaches (Figure 6).

As Figure 6 shows, critical nonconformance items with high-cost impact classes are underrepresented, there being few records of these compared with lower impact cost classes. Figure 7 shows the frequency of material-related nonconformance attributes used in this study. The observations from Figure 6 highlight the class imbalance among the different cost impact groups. The ability of ML to represent the under-represented cost impact classes is reduced.

**Figure 7.** Material-related nonconformance attributes.

Furthermore, as shown in Figure 7, the collected NCRs also include the stage at which the nonconformance issue was initiated. Thus, each nonconformance attribute is linked to installation, documentation, material, or process damage. For example, 10 nonconformance issues with a cost impact of three are recorded as caused by damaged material usage (M-4), where the damage is initiated at the installation stage. Likewise, Figure 8 shows the nonconformity attributes related to design and operation. The design-related attributes recorded during the construction phase are limited because the collected NCRs were only gathered from the construction site.

**Figure 8.** Design- and operation-related nonconformance attributes.

As Figure 8 shows, the design-related nonconformance attributes mostly occurred during the processing stage, while the operation-related issues were mostly associated with a lack of supervision (O-4) during the installation phase. As this study uses NCRs from construction sites, the frequency of nonconformances within the installation stage significantly increases in construction-related attributes (Figure 9).

**Figure 9.** Construction-related nonconformance attributes.

#### **4. Methodology**

This study adopts ensemble ML classifiers to determine the impact of COR on overall project cost. The methodology is outlined in Figure 10.


**Figure 10.** COR classifier methodology.

The methodology described was implemented within the Python programming environment. Most of the data preprocessing, analysis, and associated ML configuration was performed with widely used libraries, such as the Pandas [70], NumPy [70], and Scikit-Learn [71,72] packages. The nonconformance dataset was utilized with an ensemble classifier to predict the impact of COR on overall construction cost, while the results were compared with the single ML predictors.
