*4.2. Configuring Voting Classifiers*

There is a wide range of single and ensemble ML algorithms that voting classifiers can use for a given prediction. This study explores different single and ensemble ML methods used in the literature to reach the best combination for the given study (Figure 11).

**Figure 11.** Configuration of soft and hard voting classifiers.

This study aimed at developing a COR impact predictor that achieves good prediction accuracy with simple implementation. Therefore, dimension reduction, class imbalances, and optimization techniques were not used to ascertain the best hyperparameters. Instead, the layers of different ML approaches were placed within a voting classifier and, based on the observed performance, a final voting classifier was configured with fewer ML members to accelerate the training procedure. The compulsory parameters, including the number of estimators (for RF, AB, and GB) and the number of neighbors (for KNN), were set roughly close to the benchmark models, while the other options were left as default. Although feature engineering and optimization techniques would have improved the performance of the ensemble predictor, their implementation was beyond the scope of the present study.

The radial basis function (RBF) was used as the SVM kernel while enabling a balanced class weight option. KNN was trained with 23 neighbors. The LR was adjusted with a LIBLINEAR optimizer and an l1 regularization. To ensure the creation of a weak learner, NB was used in its default form. The number of estimators for AB and RF was adjusted to 300 trees. The GB used 100 estimators while its learning rate was customized at 0.1. Afterwards, in order to reduce computational cost, the number of voting classifier members was reduced to three. In this respect, two strong learners were combined with a weak learner to simultaneously ensure accuracy and the elimination of bias. Combining strong and weak ML learners enhances the prediction accuracy of models for different rework cost impact classes while boosting the model's overall performance in terms of generalization ability and computational cost. Therefore, LR and KNN were used as strong learners, while NB was used as a weak learner for both the soft and hard voting classifiers.

#### *4.3. Configuring Benchmark Classifiers*

Unlike the voting classifiers, which were configured without any particular attention to the fine-tuning of their hyperparameters, each of the benchmark ML approaches was specifically fine-tuned to ensure a fair comparison between the ensemble voting classifiers and single ML predictions (Table 1).


**Table 1.** Hyperparameters of benchmark COR classifiers.

#### *4.4. Evaluation Metrics*

The prediction performance of the OCR impact predictors was evaluated using conventional accuracy, precision, accuracy, and F1 scores. For this, the number of correct predictions of each cost class (true positive (TP)) and correct assignments of the sample to the rest of the subclasses (true negative (TN)) was obtained. Likewise, the incorrect predictions within each subclass (false positive (FP), and false negative (FN)) were also recorded. Accuracy, F1 scores, precision, and recall [73] were obtained using Equations (2)–(5).

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \tag{2}$$

$$F1\ \text{score} = \frac{2 \times (Precision \times Recall)}{Precision + Recall} \tag{3}$$

$$Precision\_{Multiclass} = \frac{\sum\_{i=2}^{classes} TP\_i}{\sum\_{i=2}^{classes} TP\_i + FP\_i} \tag{4}$$

$$Recall\_{Multiclass} = \frac{\sum\_{i=2}^{classes} TP\_i}{\sum\_{i=2}^{classes} TP\_i + FN\_i} \tag{5}$$

Accuracy did not provide a satisfactory evaluation for an imbalanced dataset as it does not consider FP and FN. Thus, the F1 score was preferred for the imbalanced dataset as it combines precision and recall. However, for a reliable COR predictor, the accuracy of the subclasses also needed to be evaluated.

#### **5. Results and Discussion**

The prediction performance of the soft and hard voting classifiers along with the benchmark ML approaches are shown in Table 2.


**Table 2.** COR prediction performance results.

As Table 2 shows, LR outperformed all predictors in terms of F1 score for the fivelevel cost impact prediction. Additionally, DT and GB displayed better F1 scores for the three- and four-level COR impact predictions. However, the accuracy and F1 scores can be misleading when working with an imbalanced dataset, so to investigate the practicality of the predictors, their ability in predicting each cost impact class needs to be investigated. This is best achieved by measuring the F1 score of each of the cost impact classes. Figure 12 presents the prediction performances of the best-performing classifiers within the benchmark model with the hard and soft voting classifiers.

As Figure 12 shows, both the soft and the hard voting classifiers were able to detect the high-cost impact items with only a 4% occurrence (support = 23). Among the COR classifiers, however, only one soft voting classifier detected the rework with a cost impact of five (VH) with a low occurrence of 1% (support = five).

A high accuracy with DT was only associated with nonconformance items with a very low cost impact, and it performed poorly for other underrepresented but more important cost impact classes (Figure 13). Likewise, voting classifiers exhibited better performance for medium- and high-impact cost estimation compared with GB, which completely failed to predict COR with a medium impact (Figure 14). Despite its poor class performance in the four-level classification, RF resulted in the best prediction for high-impact COR items without sacrificing overall accuracy. On the other hand, the soft voting classifier proved to be consistent in its precision accuracy for different classification levels.

**Figure 14.** Prediction performance in 3-level classifiers.

To achieve a robust COR predictor, it is important to evaluate the ability of ML to solve the problem. Most single and tree-based ensemble ML predictors failed to estimate COR with high (four) or very high (five) impacts. The superior accuracy of benchmark ML approaches is due to their predicting low-cost impact rework items. However, they are incapable of predicting high-impact cost items. The prediction of low-cost impact rework items cannot reduce the deviation between the as-planned and as-built costs. On the other hand, voting classifiers are more successful in predicting high-impact COR items. The soft voting classifiers were the most robust for COR, displaying a significantly good performance in detecting underrepresented cost impact classes.

To better illustrate the practical implementation of the model, a trained soft voting classifier was used to predict an unseen user input from the test dataset. For example, the user may want to evaluate the material-related nonconformities of a high-rise building project using the available construction team status report and site conditions. The user defines a scenario in which negligence in the initial material inspection due to a lack of site supervision (O-4) results in the receipt of defective material (M-2) from the supplier. The defective material, accompanied by an insufficient review of the design documents (C-15), causes a deviation from the design (C-3). Thus, with respect to the scenario specified by the user, the system can predict the cost impacts of different construction activities under a user-defined scenario. Once the user specifies different activities, such as the facade works (ceramic, coating, insulation, etc.), the system uses the trained soft voting classifier to evaluate its impact on the overall construction cost. In this example, the soft voting classifier can accurately predict a cost impact of three (medium) on the overall construction cost.
