*4.1. Data Preprocessing*

The dataset (*D*) was obtained from the construction NCRs with 39 feature columns and 2527 rows (Equation (1)):

$$D = \{ \mathbf{x1}, \mathbf{x2}, \dots, \mathbf{x31}, \mathbf{x32}, \dots, \mathbf{x36}, \mathbf{x37}, \mathbf{x38}, \mathbf{y} \} \tag{1}$$

The material-, design-, operation-, and construction-related nonconformity items were used as 31 binary input feature columns. The project types were presented by five binary columns (*x*32 − *x*36), associated with industrial, hospital, high-rise, housing, and other building construction types. The NCR type column that shows the initiation area of the recorded nonconformity item was used as another input feature (*x*37) with four categories: installation, documentation, material inspection, and processes. To translate each category into the ML language, the dummy encoding method was used, which converts the NCR type into four columns, each showing a single category with either zero or one. For example, {1, 0, 0, 0} shows the NCR type as installation, while {0, 0, 1, 0} stands for material NCR type. In addition, the construction activity associated with each nonconformance item was used as a categorical input column (*x*38) with 20 categories, as depicted in Figure 5. Again, dummy encoding was used to translate the construction activity into binary format, this time within 20 feature columns. This resulted in an encoded dataset with 61 columns and 2527 nonconformance rows. Finally, 70% of the dataset was used for training and the rest (30%) was kept for performance evaluation.
