*3.2. Modeling and Algorithm Implementation*

3.2.1. Combinative Strategy Encoding and Data Improvement

In order to reach an optimal model, a combinative strategy, which contains five subprocesses, is proposed. The five subprocesses are feature selection, synthetic minority over-sampling technique (SMOTE), one-hot encoding, standard scaler, and classifiers. Feature selection is a process used to reduce the number of input variables in developing a classification model. This study simply divides the behaviors into Yes (high risk) and No (low risk), and this approach may result in an imbalanced distribution of each behavior. SMOTE is a proper method to address the imbalanced distribution issue [45]. The dataset contained nominal-categorical and ordinal-categorical features. One-hot encoding is used to create new binary features for each element in a categorical [46]. Moreover, all features are scaled at different intervals in the obtained dataset. By means of standard scaler, all

features are converted, leading to a distribution with a mean value of 0 and a standard deviation of 1. Standard scaler helps limit the sample differences [46]. As a supervised learning concept, classification is a process of categorizing a set of data points into classes. In machine learning, a classifier is basically an algorithm that categorizes data into classes. This study used four classifiers, i.e., logistic regression (LR), support vector machines (SVM), random forest (RF), and categorical boosting (CatBoost).

This study tried 64 models, which are coded by the rules shown in Figure 3. The value of the first four bits is represented by the binary numbers 1 and 0, with 1 indicating *used* and 0 *unused*. The first part refers to feature selection, the second to SMOTE, the third to one-hot encoding, the fourth to standard scaler, and the last part is the first letter of the classifier's name. For example, a model code of "0101R" means that the model uses SMOTE, standard scaler and RF, and does not use feature selection and one-hot encoding.

**Figure 3.** The process of encoding models.

#### 3.2.2. Classification by Four Classifiers of Machine Learning

In terms of classification, there are many classic machine-learning algorithms, such as LR, SVM, etc. Recently, emerging algorithms are increasingly used, such as RF and CatBoost. In order to select a more suitable classifier, this study uses four classifiers, i.e., LR, SVM, RF and CatBoost.

Based on the natural logarithm, LR follows a logistic S-curve. Classification is determined by the probability of an outcome. SVM includes a set of related supervised learning methods to make prediction and regression. The statistical learning theory and structural risk minimization underlie the learning algorithms of SVM. According to Antwi-Afari et al. [47], SVM shows comparable or even better results than other machine-learning methods. RF is an ensemble of decision trees. It employs a bagging method to achieve classification. Each node is split using the best predictor from a subset of predictors chosen randomly at that node. As it is more robust in terms of generalizability than the decision trees, RF plays an important role in machine learning, such as the works of Niu et al. and Poh et al. [45,48]. Recently, decision trees have been extended to the family of gradient boosting algorithms, such as eXtreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and Categorical Boosting (CatBoost). In particular, CatBoost is a framework based on oblivious trees. It has few parameters, supports categorical variables, and deals with categorical features in an efficient and reasonable manner. Furthermore, it modifies gradient computation to avoid a prediction shift in order to improve model accuracy. The results of a three-algorithm comparison show that CatBoost achieves the best results [49] despite the small differences among them.

#### 3.2.3. Model Tuning and Hyperparameter Optimization by MOSMA and LOO

In some cases, over-fitting the data is an issue during the machine-learning process, resulting in poor generalizability. One of the most acceptable resolutions is to tune models and optimize parameters. This study uses an algorithm named slime mould algorithm (SMA) to tune the classifiers automatically. SMA is inspired by the behavior of slime mould, and has been applied in graph theory and path networks [50,51]. Since five behaviors are modeled in this study, a multi-objective SMA (MOSMA) is used to search the maximum average scores for these five behaviors. According to Houssein et al. [52], the MOSMA consumes significantly less training time than traditional optimization algorithms such as grid-search. Moreover, leave one out (LOO) cross-validation is fitting for those cases with a small sample size. For *n* samples, the number of training samples is *n*-1, while only one sample is left out for validation. This train-validation process is repeated for *n* times, and fully utilizes the dataset of the training dataset. Since there is no random sampling, bias is eliminated by LOO cross-validation [45]. Therefore, it is reasonable to combine LOO and MOSMA to find optimal settings in order to maximize the generalizability of the model.

#### 3.2.4. Three Methods for Feature Selection

This study employs a combination of three traditional feature selection methods, i.e., feature importance (FI), Chi-square test (CT) and Boruta selection (BS).

When the variables in the dataset have varying degrees of influence on the five (un)safety behaviors, focusing on the most important features is critical for gaining a better understanding of them, respectively. To some extent, FI represents the diverse effects of various features. However, it does not entirely capture the association between the features and the safety behaviors, nor does it determine whether the feature has a positive or negative impact. In this regard, CT and odds ratio (OR) can make up for this deficiency, as they can not only calculate the correlation between features and safety behaviors, but also can reveal the nature of the impact (i.e., positive or negative). BS is a novel featureselection algorithm for finding all relevant variables [53]. According to Poh et al. [45], BS has a critical advantage over ordinary feature-selection techniques in that it may pick the input variable in a robust and unbiased manner by using bagging schemes and including statistical confidence tests into its selection process.

Features are preserved in each iteration if more than half of the votes are in favor of passing. On the contrary, they are returned to the prediction part of modeling until the maximum score is achieved. For instance, Table 2 explains how to make selection decisions regarding three input indicators, i.e., NatClit, DeptRsp, and TMX1.


\* The variable obtains one vote if it is shown as an important feature for one behavior.

#### *3.3. Optimal Model Acquisition*

There are many indicators to evaluate the final training model's performance. For simplicity and efficiency, this study employs common indicators, including area under the curve of receiver characteristic operator (AUC), accuracy, precision, recall, and F1-score [48]. Accuracy, precision, recall, and F1-score are partial performance indicators, whereas AUC

is a comprehensive indicator. They are defined by the following functions, which are based on the confusion matrix.

$$ALIC = \frac{\sum I \left(P\_{\text{positive}}, P\_{\text{negative}}\right)}{M \times N} \tag{1}$$

where *I P*positive, *P*negative = ⎧ ⎨ ⎩ 1, *P*positive > *P*negative 0.5, *P*positive = *P*negative 0, *P*positive < *P*negative , *M* and *N* are the numbers of

positive and negative samples in the dataset, respectively.

$$Accuracy\,\,=\frac{TP+TN}{TP+TN+FP+FN}\tag{2}$$

$$Precision = \frac{TP}{TP + FP} \tag{3}$$

$$Recall = \frac{TP}{TP + FN} \tag{4}$$

$$F1 - score = \frac{2 \times Precision \times Recall}{Precision + Recall} \tag{5}$$

#### **4. Results**

#### *4.1. Necessity of Tuning Models and Optimizing Parameters by MOSMA*

Since the sample was randomly divided into training and test sets, it is necessary to limit the error of the model by tuning models and optimizing parameters. This study used the MOSMA method, which is rarely employed in the construction safety domain. Using the average of the outcomes of 10 random divisions as the final performance score, this study compared the performance of the MOSMA and the traditional grid-search method. Figure 4 shows the average AUC scores of the four classifiers for the five behaviors, and Figure 5 shows the average accuracy and F1-scores. From these two figures, it can be concluded that CatBoost–MOSMA has the maximum classification performance, and hence, is used for feature importance analysis later on.


**Figure 4.** AUC of classifiers with(out) MOSMA.

**Figure 5.** Accuracy and F1-scores of classifiers with(out) MOSMA. (**a**) Accuracy without MOSMA. (**b**) Accuracy with MOSMA. (**c**) F1-scores without MOSMA. (**d**) F1-scores with MOSMA.

#### *4.2. Performance of Different Models*

As mentioned above, this study has tried 64 models. Figure 6 depicts their performance in terms of AUC, accuracy, and F1-score. As can be seen from Figure 6, the models coded as "1111C" (No. 64) and "1010C" (No. 40) have satisfactory performance. In the former model, four methods (i.e., feature selection, SMOTE, one-hot encoding, and standard scaler) and the classifier of CatBoost are employed. In the latter model, two methods (i.e., feature selection and one-hot encoding) and CatBoost are used. The former model yields the maximum AUC of 0.9175, accuracy of 0.8075, and F1-score of 0.6497. Although the F1-score of 0.6497 is not the highest, it ranks the upper-middle among all models. The latter model yields the AUC of 0.8970, accuracy of 0.8583, and the maximum F1-score of 0.7725. Since No. 64 model garners the maximum AUC, which is a comprehensive performance indicator, the following sections report results from it.

**Figure 6.** Performance of 64 models.
