Research on the Prediction of Sustainable Safety Production in Building Construction Based on Text Data

Fan, Jifei; Wang, Daopeng; Liu, Ping; Xu, Jiaming

doi:10.3390/su16125081

Open AccessArticle

Research on the Prediction of Sustainable Safety Production in Building Construction Based on Text Data

by

Jifei Fan

,

Daopeng Wang

^*,

Ping Liu

and

Jiaming Xu

School of Civil Engineering, Lanzhou University of Technology, Lanzhou 730050, China

^*

Author to whom correspondence should be addressed.

Sustainability 2024, 16(12), 5081; https://doi.org/10.3390/su16125081

Submission received: 15 May 2024 / Revised: 10 June 2024 / Accepted: 12 June 2024 / Published: 14 June 2024

(This article belongs to the Special Issue Engineering Safety Prevention and Sustainable Risk Management)

Download

Browse Figures

Versions Notes

Abstract

:

Given the complexity and variability of modern construction projects, safety risk management has become increasingly challenging, while traditional methods exhibit deficiencies in handling complex dynamic environments, particularly those involving unstructured text data. Consequently, this study proposes a text data-based risk prediction method for building construction safety. Initially, heuristic Chinese automatic word segmentation, which incorporates mutual information, information entropy statistics, and the TF-IDF algorithm, preprocesses text data to extract risk factor keywords and construct accident attribute variables. At the same time, the Spearman correlation coefficient is utilized to eliminate the multicollinearity between feature variables. Next, the XGBoost algorithm is employed to develop a model for predicting the risks associated with safe production. Its performance is optimized through three experimental scenarios. The results indicate that the model achieves satisfactory overall performance after hyperparameter tuning, with the prediction accuracy and F1 score reaching approximately 86%. Finally, the SHAP model interpretation technique identifies critical factors influencing the safety production risk in building construction, highlighting project managers’ attention to safety, government regulation, safety design, and emergency response as critical determinants of accident severity. The main objective of this study is to minimize human intervention in risk assessment and to construct a text data-based risk prediction model for building construction safety production using the rich empirical knowledge embedded in unstructured accident text, with the aim of reducing safety production accidents and promoting the sustainable development of construction safety in the industry. This model not only enables a paradigm shift toward intelligent risk control in safety production but also provides theoretical and practical insights into decision-making and technical support in safety production.

Keywords:

production safety risk; text mining methods; TF-IDF algorithm; XGBoost algorithm; sustainability

1. Introduction

In the current era of rapid digitalization and informatization, enterprise operations and production management are experiencing unprecedented changes, particularly in terms of extracting valuable information from vast unstructured data to enhance competitiveness [1]. In the construction industry, big data and cognitive technologies are fundamental pillars of the global economy and offer significant potential to yield new industry insights. With the rapid development of the construction industry, new types of buildings, such as supertall, green, and intelligent buildings, are driving the expansion of the construction scale and the diversification of design structures, thereby increasing both the volume of work and construction complexity, while imposing greater demands on safety management [2,3,4]. However, factors unique to the construction industry—such as inconsistent labor quality, high resource demands, long project cycles, and complex construction conditions—position it as one of the industries with the highest safety risks [5,6,7]. The construction safety production sector currently confronts a series of deep-rooted challenges and issues involving various facets of safety production risk management [8,9,10,11]. First, the insufficient capacity to analyze textual data prevents full exploration and utilization of the deep information embedded within the vast amount of data. Second, the correlation between risk factors has not been thoroughly explored, hindering a comprehensive understanding of the potential risks. Furthermore, an unclear understanding of critical factors complicates risk prediction and control. In addition, existing risk evaluation methods remain highly subjective and lack objective scientific assessment standards. The application of emerging intelligent technologies in building construction safety production remains limited, while research into interpretable artificial intelligence has not yet been thoroughly explored [12,13,14].

With the deep integration of science and technology into the core aspects of safety production, a series of novel risk factor identification methods grounded in accident causation theories have emerged from analyzing workplace hazard information and expediting the achievement of the risk management goal of proactive accident prevention [15]. Currently, the most popular hazard identification methods involve initial hazard analysis through risk feature extraction using artificial intelligence technology [16,17,18,19]. For instance, Hu [20] developed a safety risk identification method using a two-layer convolutional neural network architecture based on natural gas drilling data, which effectively identified safety risk features in the production process and promoted the early detection of hazards, thereby helping to mitigate and resolve safety risks during natural gas drilling. Macedo [21] identified and analyzed refinery operation hazards using text mining and BERT linguistic representation models. The text mining extracted information from refinery injury reports, while the fine-tuned BERT pretraining identified the risk features in refinery subsystems, ultimately supporting safety managers in effectively identifying and assessing accidental injuries linked to refinery operations.

Additionally, applying software tools for risk identification based on scenario mapping and spatial imaging has proven effective. For example, Zhang [22] developed a method that integrates remote sensing and workspace modeling techniques within construction safety planning to visualize workspaces in building information models, effectively identifying potential spatial safety hazards involving crews, materials, and lifting equipment. Conversely, Aziz [23] used an ontological approach to construct a generic platform for analyzing various system attributes in a chemical processing plant. The most likely hazard scenarios were constructed and analyzed using a dynamic process model, achieving scenario-based dynamic risk identification and rapid risk assessment. Hazard identification techniques widely employed in the workplace typically involve the following steps: first, define the specific scope of the assessment task; second, refine the critical processes involved; second, comprehensively summarize the various risk factors that may arise in these processes; and finally, systematically construct a list of those risk factors. This process allows for the precise identification of potential hazards in various tasks at a given site and generates a comprehensive list of risk factors that personnel, equipment, materials, operating methods, and the environment may encounter during their life cycle [24,25,26,27,28].

Risk evaluation and prediction, crucial components of risk prevention, aim to assess the probability of occupational safety and health injuries and their potential consequences based on all the risk factors identified and analyzed using safety system engineering principles and methods. They also offer theoretical guidance for developing preventive measures and emergency decision-making. Risk prediction methods have gradually expanded to include petroleum, metallurgy, aviation, coal, and emerging construction projects [29,30,31]. Modern safety risk prediction methods can be classified into objective evaluation methods, probabilistic risk analysis methods, machine learning methods, subjective evaluation methods, and software simulation methods based on the type of evaluation [32,33,34,35,36]. For instance, Chen [37] employed an improved entropy TOPSIS-RSR method to predict the risks associated with road traffic safety based on the comprehensive Road Safety Risk Index (RSRI). He developed a system of indicators for assessing road safety risks, which included five dimensions, and confirmed the model’s effectiveness through practical application. Karasan [38] proposed a method that integrated safety and critical effects analysis (SCEA) with a Pythagorean fuzzy set risk prediction technique to enhance the comprehensiveness and accuracy of the risk assessment. Koulinas [39] employed the fuzzy extended hierarchy process (FEAHP) to construct a safety risk prediction model and quantitatively analyze workplace risk priorities. Conversely, Zavari [40] utilized building information modeling (BIM) and the geospatial information system (GIS) to establish a framework for dynamically optimizing layout planning at building construction sites to enhance workplace safety. Luo [35] established a mapping relationship between feature attributes and accident severity, utilizing text data from building collapse incidents and random forest (RF) machine learning algorithms to predict the severity of occupational injury accidents.

Although the existing research fully recognizes the significant theoretical and practical implications of investigating the complex consequences of safety injuries or accidents, developing and utilizing risk prediction models remains constrained by challenges and issues in practical applications, such as an incomplete understanding of accident scenarios, poor data quality, and unstable algorithms. Moreover, numerous injury prediction studies have primarily used statistical analysis tools; however, when handling complex injury data and patterns, the prediction accuracy of statistical models fluctuates and exhibits poor generalization performance. Thus, current research must incorporate the rapid changes in data volume, data dimensions, and computational techniques; the gradual trend toward data-driven, machine learning, and artificial intelligence modeling approaches; and the continuous improvement and refinement of occupational accident analytics methods to develop more sophisticated risk prediction models that can address the ever-changing safety risks and challenges. In order to further explore the safety production risk prediction model, this study constructs a framework for a building construction safety production risk prediction model based on the research of Luo and Kang [35,36] and utilizes the XGBoost algorithm combined with the improved SHAP model interpretation technique to enhance the understanding of the critical features influencing building safety production risk. Simultaneously, based on empirical cases from building construction enterprises, the proposed model’s applicability and reliability are assessed, providing guidance for the comprehensive and scientific implementation of safety production practices. The main research objective of this study is to develop a comprehensive safety production risk prediction model using quantitative and intelligent scientific and technological methods, minimizing human intervention and leveraging unstructured accident texts that contain rich historical information. This will facilitate the rapid integration of artificial intelligence with production safety, transform hidden danger investigation and risk control into a safer and smarter mode, and provide a theoretical framework and practical technical support for mitigating production safety risks and preventing serious accidents.

The structure of the paper is outlined as follows: Section 2 presents the framework of the proposed predictive model and describes the setup of the experimental framework. Section 3 provides experimental studies that demonstrate the validity of the proposed methodology. Section 4 examines the theoretical and practical implications of this study. Finally, Section 5 summarizes and emphasizes the primary contributions of this paper.

2. Materials and Methods

This section outlines a framework for a model that predicts risks in safe production related to building construction accidents, as shown in Figure 1. The framework is structured in three stages:

Phase 1: Preparation of high-quality data.

Phase 2: Construction of the risk prediction model based on the XGBoost algorithm.

Phase 3: Interpretability analysis of the modeling results.

This study employed correlation coefficient measurements to address data multicollinearity and enhanced the text preprocessing steps to obtain high-quality sample data. During the risk prediction model construction and optimization stage, the XGBoost algorithm was used to simulate three distinct experimental scenarios to obtain prediction model performance indices, thereby verifying the model’s validity and reliability. During the interpretability analysis stage of the model results, two interpretability methods, SHAP and an improved SHAP method based on information gain rate weights, were employed to analyze the critical attribute factors in building construction accidents. The basic process is shown in Figure 1, with Phase 1 described in detail in Section 2.1, Phase 2 in Section 2.2, and Phase 3 in Section 2.3.

2.1. Preparation of High-Quality Data

Drawing lessons from historical accidents is a crucial means of preventing their recurrence. The accident text reports in this work primarily came from two sources. First, Python was used to scrape the investigation results of building construction accidents. Second, because the accident data were scattered, and some cases were incomplete, we manually collected and supplemented the texts of safety production accident cases published online and categorized them by type. In accordance with China’s 2021 Regulation on Reporting, Investigating, and Handling Production Safety Accidents, this study classified the collected accident cases into four categories based on the number of fatalities, serious injuries, and the extent of economic losses. These categories are defined as general accidents, significant accidents, major accidents, and extraordinary major accidents. Primarily, the steps involved in processing accident text data include text segmentation, keyword extraction using TF-IDF, and multicollinearity analysis among accident attribute variables. A detailed description of these three components follows.

2.1.1. Text Segmentation Process

To reduce feature extraction and minimize the model’s runtime and cost, the collected case texts focused on two modules, accident occurrence and accident cause, serving as a corpus for feature extraction. Second, text segmentation was performed using the Jieba package. The steps included the following:

The Trie tree segmentation model was established. The Trie tree data structure consists of root and multilayered leaf nodes. The root node stores no data, while the leaf nodes each store one character. The retrieval process starts from the root node and sequentially moves through the leaf nodes, with the characters connected to form a complete word. The number inside the leaf node indicates how frequently the word appears in the corpus.
A directed acyclic graph (DAG) of the corpus was established. Through the rapid scanning process of the Trie tree, all possible combinations of each word in the corpus are traversed, the potential combinations are obtained, and a DAG is formed from the nodes and multiple links.
The maximum probability of the segmentation path based on the word frequency was obtained. Finally, unlogged words are subdivided using the HMM model. By considering sentences as observable states, the state transfer process is described through the random analysis of states and observational states. The set of state values corresponding to a sentence is denoted as (B, M, E, S), representing the beginning position (Begin), middle position (Middle), end position (End), and independent word formation (Single), respectively. Sentence annotation is performed using these four states. After extensive corpus training, we can determine the dependency relationships between word states and observations and simulate the sequence prediction problem using three HMM model parameters: transition probability, emission probability, and initial hidden state probability. The Viterbi decoding algorithm, with model parameters and observation states as inputs, finds the most likely hidden state sequence, thereby implementing an HMM-based segmentation algorithm to process text containing unregistered words. As an example, consider the statement “Inadequate execution of duties by supervisory personnel”. The corresponding word tag sequence is “BMMEBEBEBESBE”, where each word constitutes an unregistered word. The corresponding word label sequence is “BMMEBEBEBESBE”, where each word constitutes an observation sequence, and the word position labeling constitutes a state sequence. The optimal segmentation result is obtained by combining the HMM model and Viterbi algorithm derived from the training. Additionally, when using the Jieba package for Chinese text segmentation without constructing and loading a complete custom dictionary, the segmentation results are mostly binary word strings. Two statistical measures, mutual information, and information entropy, were introduced to overcome the limitations of dictionary-based segmentation in specific domains and accurately recognize words with complete meanings. Mutual information assesses the likelihood of word formation and evaluates the independence and semantic integrity of multi-word expressions by counting the frequency of simultaneous occurrence of adjacent word combinations. Information entropy measures the degree of freedom of candidate words. When the information entropy value on both sides of the candidate word is higher, the likelihood of independent word formation increases. For example, the algorithm implementation process can be described as follows: there are ordered neighboring words A and B, and the mutual information between A and B is expressed as the probability value I(A,B), reflecting the degree of association. If I(A,B) is greater than the set threshold, A and B are considered to co-occur. After multiple calculations to find the combined word string AB, the frequency of the occurrence of the left and right neighboring words of AB is identified, and the left and right neighboring word information entropy values are calculated according to the information entropy formula. By simultaneously eliminating boundary uncertainty, we obtained feature words containing rich semantic information.

2.1.2. Keyword Extraction Based on TF-IDF

To comprehensively consider the frequency of words and document differentiation in the keyword extraction process to reflect their criticality and feature content, this study used the TF-IDF algorithm to weight the feature items, where the TF in the TF-IDF calculates the frequency of words appearing in the accident dataset. The IDF calculates the ratio of the number of documents in the accident dataset to the number of times that a particular word appears in the accident dataset. The calculation formula is shown in (1).

W_{i}

is the eigenvalue weight of the word

i

after the accidental text is split into words,

N_{i j}

is the number of times the word

i

is in the unintentional dataset

j

,

\sum_{k} N_{k j}

is the number of times all the words in the accidental dataset

j

appear,

k

is the

k t h

word,

M

is the total number of documents in the accidental dataset, and

M_{i}

is the number of records that contain the word

i

.

W_{i} = \frac{N_{i j}}{\sum_{k} N_{k j}} \times \lg (\frac{M}{M_{i} + 1})

(1)

The obtained keywords are contextualized to form a list of textual safety risk factor terms, effectively identifying valuable information from highly unstructured data. Since the identified safety risk factors and accident severity levels are qualitative data that cannot be numerically analyzed due to the lack of a logical relationship between categories, these categorical variables are processed using the “one-hot” coding method. This method involves converting n categories into n-1 new variables. The accident attribute factors are unordered categorical variables, and their causal information is expressed in “0” or “1” data using the “one-hot” coding method. Likewise, the four accident severity categories are unordered categorical variables, denoted as “0”, “1”, “2”, and “3” for average, significant, major, and extraordinary major accidents, respectively.

2.1.3. Multiple Covariance Analysis of Accident Attribute Variables

When covariance is present in the selected attribute features, the information from the independent variables overlaps, necessitating the removal of unimportant variables to reduce model estimation bias due to multicollinearity. This study employed the Spearman correlation coefficient to measure the multicollinearity among model feature variables, calculated the rank order correlation, and conducted hierarchical clustering. The Spearman correlation coefficient, also called the rank-order correlation coefficient, disregards the distribution of original variables, focusing instead on analyzing the linear correlation between the rank values of two variables to indicate the direction and strength of their association. Calculating the Spearman correlation coefficient and setting an appropriate threshold can determine whether multicollinearity exists between variables. If the resulting correlation coefficient surpasses the threshold, a strong correlation exists between the variables. A single feature attribute should be retained from each cluster to eliminate duplicate information and reduce the issues caused by covariance. Spearman’s correlation coefficient is represented as Equation (2).

d_{i}

, the difference between the bit values of the

i t h

data pair, is represented by

n

, which represents the total number of data points.

ρ = 1 - \frac{6 \sum d_{i}^{2}}{n (n^{2} - 1)}

(2)

2.2. Risk Prediction Model Construction Based on the XGBoost Algorithm

2.2.1. XGBoost Algorithm Construction Process

XGBoost is an efficient ensemble learning method that constructs and combines multiple weak base learners to create a more robust model. It enhances efficiency, flexibility, and portability by improving the objective function, introducing a regularization term, and incorporating strategies such as automatic handling of missing values. In the XGBoost algorithm, each step involves fitting a weak learner to the current model’s prediction error, producing a new weak learner that minimizes the objective function. A robust model is achieved by combining all weak learners at each step. The XGBoost algorithm obtains the weak learner

f_{t} (x)

.

Design the objective function. The regularization term is added in the middle of the prediction error as its objective Function (3), where $i$ is the $i t h$ sample, $l ({\hat{y}}_{i}, y_{i})$ is the prediction error of the $i t h$ sample, and $Ω (f_{k})$ is the complexity function of the tree and the regularization term of the objective function.
Evaluate the error second-order Taylor expansion. The second-order Taylor approximation expansion of the objective function uses first-order and second-order derivatives to calculate the evaluation error for learning to generate $f_{t} (x)$ . The objective function under the $t$ iteration can be expressed as Equation (4).

L^{(t)} (ϕ) = \sum_{i} l ({\hat{y}}_{i}, y_{i}) + \sum_{k} Ω (f_{k})

(3)

L^{(t)} = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}^{(t - 1)} + f_{t} (X_{i})) + Ω (f_{t})

(4)

Find the optimal splitting point. The complexity function of the tree (i.e., the regularization term of the objective function) is related to the number of leaf nodes $T$ . The value of leaf nodes $w$ of the tree $γ$ is the number of leaves penalizing the regular term, which is used to limit the regression tree to producing branches and has the effect of pruning, and $λ$ is the leaf node weight penalizing the common term, which plays a role in reducing the impact of overfitting. The structure can be expressed as (5). All the samples $x_{i}$ classified to the $j t h$ leaf node can be formed into a sample set using the mathematical expression $I_{j} = \{i |q (x_{i}) = j\}$ . For a fixed structure $q (x)$ , it can be obtained through quadratic derivation that the optimal weight $w_{j}^{*}$ of the leaf node $j$ is (6).

Ω (f_{t}) = γ T + \frac{1}{2} λ {‖w‖}^{2}

(5)

w_{j}^{*} = - \frac{\sum_{i \in I_{j}} g_{i}}{\sum_{i \in I_{j}} h_{i} + λ}

(6)

The scoring function can be used as an evaluation index to measure the quality of the tree structure

q

. The smaller its value is, the better it represents the tree structure. Assuming that

I_{L}

and

I_{R}

are the sets of instances of the left and right nodes after splitting

I = I_{L} \cup I_{R}

, by measuring the difference in the scoring function before and after breaking, the split point with the most significant difference is determined to be the optimal splitting point (7). Node splitting is accomplished after splitting at the optimal splitting point. The above steps are repeated until the gain from splitting

L_{s p l i t}

is less than the set threshold or the splitting reaches the maximal tree depth, i.e., the training of a tree is accomplished.

L_{s p l i t} = \frac{1}{2} [\frac{(\sum_{i \in I_{L}} g_{i})^{2}}{\sum_{i \in I_{L}} h_{i} + λ} + \frac{(\sum_{i \in I_{R}} g_{i})^{2}}{\sum_{i \in I_{R}} h_{i} + λ} - \frac{(\sum_{i \in I} g_{i})^{2}}{\sum_{i \in I} h_{i} + λ}] - γ

(7)

2.2.2. Model Optimization and Performance Evaluation Metrics

Hyperparameter tuning

In this study, the grid tuning method was utilized to optimize the relevant parameters in the XGBoost algorithm to improve the model. The grid search method evaluates the quality of the training results by systematically adjusting the parameters in incremental steps within the specified parameter range, using cross-validation as a criterion for adjusting the booster parameters, identifying the parameter with the highest accuracy in the validation set among all candidates, and obtaining a training model with the optimal parameter ratios.

Performance Evaluation

This study used three parameters—precision rate, recall rate, and F1-score—to assess the model’s predictive accuracy. These were complemented with two standard generalization performance assessment techniques: the confusion matrix and the ROC curve. This combination aids in evaluating and summarizing the model’s performance across subtypes, distinguishing between true-positive and false-positive error rates. The formulas for the data metrics—precision rate, recall rate, and

F 1

-score (

F 1

-Score)—are expressed as follows:

T P

(true positive) represents the number of positive samples correctly predicted as accurate,

F P

(false positive) denotes the number of negative samples incorrectly predicted as accurate, and

F N

(false negative) refers to the number of positive samples incorrectly predicted as false.

P = \frac{T P}{(T P + F P)}

(8)

R = \frac{T P}{(T P + F N)}

(9)

F 1 = \frac{2 \times P \times R}{P + R}

(10)

The confusion matrix typically displays the predicted categories of a classification model against a set of test data with known valid categories, either in tabular or graphical form. This matrix visually demonstrates the classifier’s accuracy. The diagonal elements of the confusion matrix signify the number of correct predictions by the classification model, while the off-diagonal elements indicate the number of incorrect predictions. Comparing the predicted values with the valid values in the matrix elucidates the distribution of incorrect predictions across different categories.

The ROC curve is a significant metric tool in predictive analytics. After calculating the true-positive rate (TPR) and false-positive rate (FPR), a characteristic curve is formed, with the FPR as the horizontal axis and the TPR as the vertical axis. This method considers the classifier’s ability to distinguish between positive and negative cases, thereby negating the influence of sample category imbalance. As a result, the classification accuracy is readily and intuitively observable from the graph. Additionally, the AUC (area under the curve) index is employed if the ROC curve does not distinctly depict the classification effect. This index represents the area under the ROC curve and the coordinate axis, with a larger enclosed area indicating superior model performance.

2.3. Interpretability Analysis of the Modeling Results

SHAP value analysis is a novel method of feature importance analysis that calculates how much each feature in the model contributes to the predicted results. The degree to which each feature influences the model output can be derived by weighting each sample’s feature values and the direction each feature has on the label. Visual representations indicate whether each variable contributes positively or negatively to the predicted value, allowing for a more straightforward explanation of the model’s behavior.

However, since building construction accidents frequently involve multiple factors acting together, using the SHAP method to analyze the importance of attribute factors can lead to differences in the Shapley values between factors. These factors make it challenging to identify the key factors that primarily influence accident occurrence. This work introduced information gain rate weights to calculate the weights of attribute factors with a more considerable classification contribution in the SHAP method, thereby resolving the issue of jointly acting attribute factors not being apparent. This approach narrows the Shapley value errors, increases the variability of the Shapley values for each attribute factor, and clearly distinguishes the importance of attribute factors. The SHAP method, enhanced with information gain rate weights, is presented in Equation (11), where the

i t h

sample of the sample set is

x_{i}

,

N

is the complete set of all attribute factors of sample

x_{i}

,

S

is a subset formed by any number of attribute factors in sample

x_{i}

, and

ν (S)

is the value generated by the attribute factors included in subset

S

working together.

f (x_{j}) = \sum_{i = 1}^{M} \frac{1}{N!} \sum_{S \subseteq N \ \{j\}} λ_{i} |S|! (|N| - |S| - 1)! [ν (S \cup \{j\}) - ν (S)]

(11)

3. Case Study

3.1. Data Collection and Processing

In this study, construction accident investigation reports in China from 2016 to 2023 were utilized as sample data for the case study and model validation. After collecting accident investigation results using Python and manually compiling and categorizing accident case texts, a text database of 973 construction safety accident cases containing five main accident types was finally obtained. Figure 2 illustrates the number of different accident types. Collapse accidents pose extreme hazards in the construction industry and are highly likely to result in mass deaths and injuries. Furthermore, the severity of collapse accidents in the collected data is relatively uniformly distributed, making it suitable as an experimental case to explore the underlying relationship between attribute factors and injury severity. A total of 176 cases of building construction collapse data were collected, of which the number of particularly significant accidents was too small (only one case). Thus, major accidents were excluded from the risk prediction of accident severity to ensure the objectivity and standardization of the selected text data. Eventually, the risk prediction model was constructed from data collected from 175 reported construction collapse accidents, including 58 (32.50%) general accidents, 48 (27.81%) significant accidents, and 69 (39.68%) severe accidents. The accident text reports mainly included information covering five aspects: the overview of the accident unit, the description of the accident process, the cause and nature of the accident, the determination of accident responsibility, and the preventive and corrective measures.

3.1.1. Text Segmentation Process

In this study, a heuristic Chinese automatic word segmentation method based on mutual information and information entropy statistics was used to address the challenges of segmentation ambiguity and unregistered word identification in word segmentation. The optimized results are complete, semantically more transparent, and richer in information. A comparison of some of the segmentation results obtained via Jieba and the statistical methods is shown in Table 1.

3.1.2. Keyword Extraction

To comprehensively consider the word frequency and document differentiation in the keyword extraction process and reflect their criticality and feature content, this study employed the TF-IDF algorithm to weigh the feature items, calculate the TF-IDF values of 2732 feature items, and sequentially order them. Simultaneously, words with high weighting values but lacking practical significance for identifying security risk factors were manually deleted. Subsequently, 68 keywords that expressed security risk factors were listed based on their TF-IDF values, as shown in Table 2.

The optimized text segmentation process was used to segment and extract key feature words from construction collapse case reports. The identified feature words, combined with domain knowledge contextualization, enabled a comprehensive classification of the construction collapse risk factors. The risk factors were mapped into four categories based on their specific manifestations: human factors (HF), equipment factors (FF), environmental factors (EF), and management factors (MF). This process yielded a list of 43 subcategories of risk factors, as shown in Table 3.

3.1.3. Multiple Covariance Analysis of Accident Attribute Variables

To eliminate the multicollinearity between the attribute factors in the dataset, hierarchical cluster analysis of the Spearman rank-order correlation of the features was conducted using the “cor()” function in R’s correlation analysis. The results indicated that the maximum value of the Spearman correlation coefficient was 0.4138, which did not exceed the threshold range. The fact that the maximum Spearman correlation coefficient remains within the threshold range demonstrates an insignificant correlation between the features. The heatmap of the feature correlation coefficients is shown in Figure 3. The scale on the right side specifies the color mapping rules, where darker shades correspond to higher correlation coefficients and lighter shades indicate lower coefficients than the midpoint. Thus, the magnitude of the correlation between variables can be discerned based on color shading, revealing the degree of correlation more distinctly than other visual cues. Therefore, the linear relationship between the independent variables’ attribute factors was insignificant. The slight multicollinearity did not affect the model performance, so all the attribute factors were retained.

3.2. Predictive Model Construction Based on XGBoost

The performance evaluation and result optimization of the safety prediction model were based on three experimental scenarios: (a) model training and testing accuracy using all attribute features were achieved by selecting the gbtree structure to run the data. The default value of “silent” was set to 0, and “unthread” was set to −1 to invoke all kernels. The parameters were n_estimator = 200, eta = 0.01, min_child_weight = 3, max_depth = 6, gamma = 0.6, subsample = 0.3, colsample_bytree = 0.6, lambda = 0.6. (b) Model training and testing accuracy were improved by selecting critical attribute features, restricting selection to attributes in the top 80%, and applying three-fold cross-validation to avoid overfitting. The parameters were n_estimator = 160, eta = 0.01, min_child_weight = 2, max_depth = 4, gamma = 0.4, subsample = 0.3, colsample_bytree = 0.4, lambda = 0.4. (c) Hyperparameter tuning of the developed XGBoost model was used to further improve the accuracy. The GridSearchCV package of the sklearn module in Python was utilized for grid search training with five-fold cross-validation, and the average AUC was used to evaluate the performance. The parameters included n_estimator = 100, eta = 0.1, min_child_weight = 1, max_depth = 2, gamma = 0.3, subsample = 0.1, colsample_bytree = 0.2, and lambda = 0.3. After conducting the experiments sequentially, the final performance assessment metrics of the prediction models were obtained for the three experimental scenarios (Table 4). The evaluation matrix of the experimental results shows that the model’s prediction accuracy when using some key attribute features (0.77) was lower than that when using all attribute features (0.81). However, the model’s accuracy after hyperparameter optimization was the highest (0.86). To prevent model overfitting and ensure good generalizability, 80% of the data were randomly divided into the training set, while 20% were assigned to the testing set.

The prediction results of the three experimental scenarios were evaluated using the confusion matrix and ROC curves, as shown in Figure 4 and Figure 5. The confusion matrix showed that the XGBoost algorithm performed well in predicting major safety accidents, with a low misclassification rate for major accidents across the three contexts. However, general accidents were often misclassified as more severe accidents. When all attribute factors were utilized, misclassifications of more severe accidents occurred only for general accidents. Conversely, severe accidents were incorrectly predicted as significant in the experimental scenarios characterized by critical attributes. The layout summary of the confusion matrix indicated that the prediction results were concentrated on the diagonal, further validating the rationality and effectiveness of the risk prediction model for building construction collapse accidents. Subsequently, by analyzing the area between the ROC curves and the axes in the three experimental scenarios, it was further demonstrated that the risk prediction model optimized with hyperparameters achieved the highest recognition accuracy (curve 3). Moreover, the prediction model using all attribute factors exhibited average recognition accuracy (curve 1), while the model using key attribute factors had the lowest recognition accuracy (curve 2).

3.3. Interpretability Analysis of Modeling Results

To better understand the decision-making process, SHAP values and an improved SHAP method incorporating information gain rate weights were used to interpret the model for predicting risks in safe production based on the XGBoost algorithm. Figure 6 and Figure 7 show the results of the feature importance distribution based on the top 20 rankings of the SHAP-value and improved SHAP-value parameters. The study revealed that the factors with the most significant influence on the severity of building construction collapse accidents were due to the neglect of safety issues by project management (MF3), the failure to implement geographic safety supervision responsibilities (MF2), and operators on duty without certification (HF10). Both the SHAP and improved SHAP methods yielded consistent results in terms of ranking the importance of features. The results of this study indicate that staff skill level and regulatory attention are the main influences on accidents in the construction industry. Staff members with extensive work experience and excellent educational backgrounds are better equipped to handle construction risks. In addition, the rankings of the remaining variables changed, except for incomplete safety technical instructions (MF17), improper design of the construction process (HF4), contradiction of technical safety standards (HF2), and contradiction of technical standards (HF2). All variables were included in the set of improved SHAP-value parameter-based ranking features.

Figure 8 illustrates the magnitude and direction of the influence of each feature on the model output, while Figure 9 depicts the relationship between certain features and the predicted values. Based on the results of the interpretability analysis of the predictive model, the SHAP values corresponding to the MF3 data samples exhibited clustering in positive and negative intervals. This indicates that although the safety attitude and culture of top management significantly impact organizational safety performance, the relationship between safety performance and injury severity is influenced by multiple factors. Moreover, neglecting safety in management does not directly increase the severity of injuries. Furthermore, the HF10 and MF12 feature variables first showed a negative correlation and then a positive correlation. This suggests that the professional skill quality of construction personnel and the supervisor’s knowledge, concern, and safety capabilities play a decisive role in on-site safety production. Additionally, governmental supervision of safety production in construction projects is crucial for improving safety production levels. Meanwhile, most of the MF2 data samples corresponded to SHAP values in the positive region, indicating that a higher degree of unimplemented responsibility for safety supervision corresponds to a higher risk of on-site production safety. In addition, most of the SHAP values of HF8 data samples were clustered in the positive zone, reflecting that the degree of hidden safety hazards and the lack of safety protection measures directly increase the severity of injuries. This feature is also a key concern for enhancing safety production levels.

4. Discussion

In the construction industry, safety is not only a critical topic but also the cornerstone of the industry’s continued prosperity. Historical studies have examined the crucial role of managers in site safety, particularly the impact of their safety focus and competence [41,42,43]. Similarly, the government is crucial in regulating work safety in construction projects. These views are strongly supported by field surveys of construction firms, which show that staff work experience and educational background significantly impact construction risk assessment [44].

This study used SHAP for model interpretive analysis, confirming that employee skill level and supervisory attention are the main factors influencing construction industry accidents. Management’s attitude toward safety correlates with accident occurrence. In terms of the research content, the analysis of attribute factor rankings reveals that building construction collapses can be explained by four dimensions: structural and technical, management and supervision, personnel training and emergency response, and environmental and other external factors. In the dimension of structural and technical factors, factors such as improper design of the construction process (HF4), illegal subcontracting and sub-subcontracting behavior (MF14), and incomplete safety technical instructions (MF17) indicate that the safety of construction projects is contingent upon management decisions and the on-site implementation of specifications. Strengthening compliance with technical specifications, strictly monitoring the legality of subcontracting and sub-subcontracting activities, and enhancing the quality of safety instructions are key measures to ensure structural stability and personnel safety. Neglect of safety issues by project management (MF3), failure to implement geographic safety supervision responsibilities (MF2) in the dimension of management and supervisory factors, and inadequate execution of duties by supervisory personnel (MF12) are identified as common root causes in many construction accidents [45]. To effectively improve the safety of building construction, it is imperative to first enhance management’s focus on safety. Concurrently, strengthening supervision and enhancing the accountability and competence of supervisory personnel are crucial to ensuring that the construction process adheres to safety standards. In terms of environmental and other external factor dimensions, complex geological conditions (EF2) and the failure to formulate a special safety construction plan (MF18) highlight the importance of adaptive and prospective environmental risk management. Developing a specialized safety construction plan tailored to specific environmental conditions and project characteristics is essential for effectively managing environmental risks. Furthermore, managing environmental factors involves not only responding to current conditions but also anticipating and preparing for potential future changes, ensuring long-term construction safety and sustainability. It is important to note that in building construction, Design for Safety (DfS) is recognized as a crucial risk prevention strategy to reduce or eliminate sudden injuries through proactive design measures during the initial phase of the project life cycle [46,47,48,49,50]. This strategy not only concentrates on the physical and technological environment of the workers but also encompasses a broad spectrum of safety management aspects, including structural and technical, managerial and regulatory, and personnel training and emergency preparedness phases. For example, complex geological conditions necessitate special consideration in the design program to ensure that the construction techniques and materials are appropriately selected for the construction conditions; the attitude and commitment of project managers to DfS directly influence the implementation of safety culture and practices throughout the project; furthermore, safety design entails providing adequate training for construction personnel to ensure a quick and effective response during emergencies. The core of safety design resides in proactively identifying and controlling potential risks through thoughtful design rather than merely reacting to them during the construction phase [51,52,53,54]. The Chinese construction industry should place greater emphasis on implementing Design for Safety (DfS) to ensure ongoing improvements in construction safety.

In terms of research methodology, Kang [36] utilized a Random Forest (RF) model to predict occupational accidents based on Korean construction accidents and weather data, deriving key risk factors for various types of accidents. Luo [35] employed an RF model to predict the severity of occupational accidents in construction collapse scenarios, utilizing the results from feature importance ranking to identify critical attributes. In this study, we used XGBoost, LightGBM, GBDT, and random forest to build prediction models based on the training set and validated them using the test set, with reference to Kang and Luo et al. The prediction results are shown in Table 5, and the specific parameters of each model in experimental scenario (c) are shown in Table 6. The RF is an integrated decision tree-based learning method that usually provides robust performance in dealing with noisy data and outliers. GBDT is also an integrated decision tree-based method; however, it improves performance by iteratively training multiple trees. LightGBM and XGBoost are both gradient-boosting-based decision tree models known for their efficient performance, fast speed, and low memory consumption. XGBoost uses more efficient algorithms (e.g., greedy algorithm), supports parallel processing, and employs regularization to prevent overfitting. Additionally, XGBoost provides extensive parameter settings to control the complexity and performance of the model by adjusting the parameters. This flexibility allows it to better adapt to the needs of different data and tasks. Based on the experimental results, the prediction model with hyperparameter tuning using XGBoost outperformed the other three models. Compared with Luo et al.’s study, this study’s case database is updated, and the results of the XGBoost model with hyperparameter tuning are significantly better than those of the RF algorithm under the same conditions, increasing the prediction model’s accuracy to 86%. Considering that machine learning models are generally “black-box models” with poor interpretability, which can only verify the validity of a certain indicator system and cannot provide specific explanations for individual indicators, this study applied the SHAP model to explain the safety production risk prediction model. The SHAP model can measure and compare the impacts of different features on construction safety production and identify the risk factors that determine the safety production risk of construction. This greatly enhanced the interpretability of the prediction results, which has been neglected by previous researchers.

5. Conclusions

5.1. Contributions

The significant contributions of this study are as follows:

In terms of theoretical significance, this study first proposed an innovative text preprocessing method combining statistical metrics and contextual processing, abandoning the traditional mechanical word-splitting approach that relies on customized domain dictionaries. This method improves the understanding and processing accuracy of accident reports through context-sensitive strategies. Second, multiple covariance analysis was introduced in the data preparation phase to identify and process high correlations between data features, avoiding the problems of model instability and interpretation difficulties caused by covariance between features. Third, this study utilized the XGBoost algorithm to construct a building construction safety production risk prediction model, verifying the potential of machine learning technology in safety production risk prediction. It reveals the limitations of dimensionality reduction or feature selection, pointing out that selecting only key attribute factors may not always improve the model accuracy and emphasizing the balance between simplifying the model and maintaining key features. Subsequent research should focus on reducing the reliance on dimensionality reduction and feature selection. At the same time, appropriate parameter optimization techniques have great potential to enhance the model prediction performance, underscoring the key role of optimization techniques in machine learning model construction. Finally, key factors affecting safety accidents were identified through feature importance analysis, enhancing the explanatory power and credibility of the model and providing a reliable basis for decision-making.

In terms of practical significance, this study adopted systematic data preprocessing and the XGBoost algorithm to construct a predictive model of safety production risk in construction, combined with the SHAP method for explanatory analysis of the model, and proposed risk prevention and control measures in four dimensions: structure and technology, management and supervision, personnel training and emergency response, and environmental and other external factors. These measures aim to provide decision support for decision-makers conducting research on active prevention strategies for project risks. Specifically, the construction unit should strictly comply with technical specifications and introduce a third-party review mechanism to ensure the rationality and safety of the design. Additionally, the quality of safety briefings should be improved through regular safety meetings and detailed written guidelines while establishing a transparent subcontracting management system to monitor the subcontracting process in real-time. Simultaneously, the introduction of a safety performance appraisal system can link safety performance to management’s performance evaluations and rewards, increasing their attention to safety. This approach strengthens supervision and enhances the supervisory staff’s sense of responsibility and competence, ensuring compliance with safety standards. Furthermore, it is necessary to formulate detailed emergency plans and improve emergency response capabilities through regular drills. Regular safety education and publicity activities should be conducted to enhance workers’ safety awareness and skills. The emergency plan can act as a barrier to reduce accidental losses and effectively promote the improvement of risk management capabilities in building construction projects.

5.2. Limitations

This study collected 973 textual accident reports exclusively from the building construction field. However, this dataset had a limited breadth of application examples for analysis, with several notable limitations. First, the amount of data was relatively insufficient, which may limit the generalizability of the results. Second, the reports did not adequately represent certain key attribute factors, such as employees’ physical and mental health status, which are crucial to the workplace safety climate. Therefore, when facing more detailed task requirements, these attribute factors must be adjusted or added to assess their impact on workplace safety more comprehensively.

Additionally, although the proposed risk prediction model based on the XGBoost algorithm underwent calibration for its performance on reports of building construction collapses, to verify the model’s generalizability, it is essential to compare and analyze its results with accident data from other industries or to extend its application to workplace safety risk assessments in different sectors, thereby confirming the model’s applicability and validity. Finally, given that different data sources may contain valuable information, future research should consider combining data from multiple sources to supplement the list of accident attributes.

Author Contributions

Conceptualization, J.F. and D.W.; methodology, J.F.; software, J.X.; resources, P.L.; data curation, J.X.; writing—original draft preparation, J.F.; writing—review and editing, D.W.; supervision, D.W.; project administration, D.W.; funding acquisition, P.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, (Grant Number: 72061023); the Natural Science Foundation of Gansu Province, China (Grant Number: 20JR10RA173); and the Hongliu Outstanding Young Talents Support Program of Lanzhou University of Technology (0320038).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data generated or analyzed during this study are included in this published article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

References

Gandomi, A.; Haider, M. Beyond the hype: Big data concepts, methods, and analytics. Int. J. Inf. Manag. 2015, 35, 137–144. [Google Scholar] [CrossRef]
Nawaz, A.; Waqar, A.; Shah, S.A.R.; Sajid, M.; Khalid, M.I. An innovative framework for risk management in construction projects in developing countries: Evidence from Pakistan. Risks 2019, 7, 24. [Google Scholar] [CrossRef]
Hwang, B.-G.; Shan, M.; Phuah, S.L. Safety in green building construction projects in Singapore: Performance, critical issues, and improvement solutions. KSCE J. Civ. Eng. 2018, 22, 447–458. [Google Scholar] [CrossRef]
Tonetto, M.S.; Saurin, T.A. Choosing fall protection systems in construction sites: Coping with complex rather than complicated systems. Saf. Sci. 2021, 143, 105412. [Google Scholar] [CrossRef]
Wang, J.; Zou, P.X.; Li, P.P. Critical factors and paths influencing construction workers’ safety risk tolerances. Accid. Anal. Prev. 2016, 93, 267–279. [Google Scholar] [CrossRef]
Kumar, K.S.; Narayanan, R. Review on construction risk and development of risk management procedural index–A case study from Chennai construction sector. Mater. Today Proc. 2021, 43, 1141–1146. [Google Scholar] [CrossRef]
Tian, Z.; Chen, Q.; Zhang, T. A method for assessing the crossed risk of construction safety. Saf. Sci. 2022, 146, 105531. [Google Scholar] [CrossRef]
Alkaissy, M.; Arashpour, M.; Ashuri, B.; Bai, Y.; Hosseini, R. Safety management in construction: 20 years of risk modeling. Saf. Sci. 2020, 129, 104805. [Google Scholar] [CrossRef]
Afzal, M.; Shafiq, M.T.; Al Jassmi, H. Improving construction safety with virtual-design construction technologies—A review. J. Inf. Technol. Constr. 2021, 26, 319–340. [Google Scholar] [CrossRef]
Yap, J.B.H.; Lee, W.K. Analysing the underlying factors affecting safety performance in building construction. Prod. Plan. Control 2020, 31, 1061–1076. [Google Scholar] [CrossRef]
Kukoyi, P.O.; Adebowale, O.J. Impediments to construction safety improvement. J. Eng. Proj. Prod. Manag. 2021, 11, 207–214. [Google Scholar]
Sanni-Anibire, M.O.; Mahmoud, A.S.; Hassanain, M.A.; Salami, B.A. A risk assessment approach for enhancing construction safety performance. Saf. Sci. 2020, 121, 15–29. [Google Scholar] [CrossRef]
Darko, A.; Chan, A.P.; Yang, Y.; Tetteh, M.O. Building information modeling (BIM)-based modular integrated construction risk management–Critical survey and future needs. Comput. Ind. 2020, 123, 103327. [Google Scholar] [CrossRef]
Zhou, Z.; Wei, L.; Yuan, J.; Cui, J.; Zhang, Z.; Zhuo, W.; Lin, D. Construction safety management in the data-rich era: A hybrid review based upon three perspectives of nature of dataset, machine learning approach, and research topic. Adv. Eng. Inform. 2023, 58, 102144. [Google Scholar] [CrossRef]
Pagell, M.; Johnston, D.; Veltri, A.; Klassen, R.; Biehl, M. Is safe production an oxymoron? Prod. Oper. Manag. 2014, 23, 1161–1175. [Google Scholar] [CrossRef]
Ghousi, R.; Khanzadi, M.; Atashgah, K.M. A flexible method of building construction safety risk assessment and investigating financial aspects of safety program. Int. J. Optim. Civ. Eng. 2018, 8, 433–452. [Google Scholar]
Manzoor, B.; Othman, I.; Manzoor, M. Evaluating the critical safety factors causing accidents in high-rise building projects. Ain Shams Eng. J. 2021, 12, 2485–2492. [Google Scholar] [CrossRef]
Chen, H.; Mao, Y.; Xu, Y.; Wang, R. The impact of wearable devices on the construction safety of building workers: A systematic review. Sustainability 2023, 15, 11165. [Google Scholar] [CrossRef]
Sofwan, N.M.; Zaini, A.A.; Mahayuddin, S.A. Preliminary study on the identification of safety risks factors in the high rise building construction. J. Teknol. 2016, 78, 8505. [Google Scholar] [CrossRef]
Hu, W.; Xia, W.; Li, Y.; Jiang, J.; Li, G.; Chen, Y. An intelligent identification method of safety risk while drilling in gas drilling. Pet. Explor. Dev. 2022, 49, 428–437. [Google Scholar] [CrossRef]
Macêdo, J.B.; Moura, M.d.C.; Aichele, D.; Lins, I.D. Identification of risk features using text mining and BERT-based models: Application to an oil refinery. Process Saf. Environ. Prot. 2022, 158, 382–399. [Google Scholar] [CrossRef]
Zhang, S.; Teizer, J.; Pradhananga, N.; Eastman, C.M. Workforce location tracking to model, visualize and analyze workspace requirements in building information models for construction safety planning. Autom. Constr. 2015, 60, 74–86. [Google Scholar] [CrossRef]
Aziz, A.; Ahmed, S.; Khan, F.I. An ontology-based methodology for hazard identification and causation analysis. Process Saf. Environ. Prot. 2019, 123, 87–98. [Google Scholar] [CrossRef]
Chartres, N.; Bero, L.A.; Norris, S.L. A review of methods used for hazard identification and risk assessment of environmental hazards. Environ. Int. 2019, 123, 231–239. [Google Scholar] [CrossRef] [PubMed]
Martin, P.; Bladier, C.; Meek, B.; Bruyere, O.; Feinblatt, E.; Touvier, M.; Watier, L.; Makowski, D. Weight of evidence for hazard identification: A critical review of the literature. Environ. Health Perspect. 2018, 126, 076001. [Google Scholar] [CrossRef] [PubMed]
Mihić, M. Classification of construction hazards for a universal hazard identification methodology. J. Civ. Eng. Manag. 2020, 26, 147–159. [Google Scholar] [CrossRef]
Martin, P.; Bladier, C.; Meek, B.; Bruyere, O.; Feinblatt, E.; Touvier, M.; Watier, L.; Makowski, D. Hazard identification, risk assessment and risk control (HIRARC) accidents at power plant. In Proceedings of the MATEC Web of Conferences, Amsterdam, The Netherlands, 23–25 March 2016; EDP Sciences: Les Ulis, France, 2016. [Google Scholar]
Mihić, M.; Cerić, A.; Završki, I. Developing construction hazard database for automated hazard identification process. Teh. Vjesn. 2018, 25, 1761–1769. [Google Scholar]
Aven, T. Risk assessment and risk management: Review of recent advances on their foundation. Eur. J. Oper. Res. 2016, 253, 1–13. [Google Scholar] [CrossRef]
Gambrill, E.; Shlonsky, A. Risk assessment in context. Child. Youth Serv. Rev. 2000, 225, 813–837. [Google Scholar] [CrossRef]
Jannadi, O.A.; Almishari, S. Risk assessment in construction. J. Constr. Eng. Manag. 2003, 129, 492–500. [Google Scholar] [CrossRef]
Hegde, J.; Rokseth, B. Applications of machine learning methods for engineering risk assessment—A review. Saf. Sci. 2020, 122, 104492. [Google Scholar] [CrossRef]
Mohandes, S.R.; Zhang, X. Developing a Holistic Occupational Health and Safety risk assessment model: An application to a case of sustainable construction project. J. Clean. Prod. 2021, 291, 125934. [Google Scholar] [CrossRef]
Wuni, I.Y.; Shen, G.Q.; Osei-Kyei, R.; Agyeman-Yeboah, S. Modelling the critical risk factors for modular integrated construction projects. Int. J. Constr. Manag. 2022, 22, 2013–2026. [Google Scholar] [CrossRef]
Luo, X.; Li, X.; Goh, Y.M.; Song; Liu, Q. Application of machinelearning technology for occupational accident severityprediction in the case of construction collapse accidents. Saf. Sci. 2023, 163, 106138. [Google Scholar] [CrossRef]
Kang, K.; Ryu, H. Predicting types of occupational accidents at construction sites in Korea using random forest model. Saf. Sci. 2019, 120, 226–236. [Google Scholar] [CrossRef]
Chen, F.; Wang, J.; Deng, Y. Road safety risk evaluation by means of improved entropy TOPSIS–RSR. Saf. Sci. 2015, 79, 39–54. [Google Scholar] [CrossRef]
Karasan, A.; Ilbahar, E.; Cebi, S.; Kahraman, C. A new risk assessment approach: Safety and Critical Effect Analysis (SCEA) and its extension with Pythagorean fuzzy sets. Saf. Sci. 2018, 108, 173–187. [Google Scholar] [CrossRef]
Koulinas, G.; Marhavilas, P.; Demesouka, O.; Vavatsikos, A.; Koulouriotis, D. Risk analysis and assessment in the worksites using the fuzzy-analytical hierarchy process and a quantitative technique—A case study for the Greek construction sector. Saf. Sci. 2019, 112, 96–104. [Google Scholar] [CrossRef]
Zavari, M.; Shahhosseini, V.; Ardeshir, A.; Sebt, M.H. Multi-objective optimization of dynamic construction site layout using BIM and GIS. J. Build. Eng. 2022, 52, 104518. [Google Scholar] [CrossRef]
Kouabenan, D.R.; Ngueutsa, R.; Mbaye, S. Safety climate, perceived risk, and involvement in safety management. Saf. Sci. 2015, 77, 72–79. [Google Scholar] [CrossRef]
Gunderson, D.E.; Gloeckner, G.W. Superintendent competencies and attributes required for success: A national study comparing construction professionals’ opinions. Int. J. Constr. Educ. Res. 2011, 7, 294–311. [Google Scholar] [CrossRef]
Post, C.; Latu, I.M.; Belkin, L.Y. A female leadership trust advantage in times of crisis: Under what conditions? Psychol. Women Q. 2019, 43, 215–231. [Google Scholar] [CrossRef]
Moshood, T.; Adeleke, A.; Nawanir, G.; Mahmud, F. Ranking of human factors affecting contractors’ risk attitudes in the Malaysian construction industry. Soc. Sci. Humanit. Open 2020, 2, 100064. [Google Scholar] [CrossRef]
Zhou, Z.; Goh, Y.M.; Shi, Q.; Qi, H.; Liu, S. Data-driven determination of collapse accident patterns for the mitigation of safety risks at metro construction sites. Tunn. Undergr. Space Technol. 2022, 127, 104616. [Google Scholar] [CrossRef]
Toh, Y.Z.; Goh, Y.M.; Guo, B.H. Knowledge, attitude, and practice of design for safety: Multiple stakeholders in the Singapore construction industry. J. Constr. Eng. Manag. 2017, 143, 04016131. [Google Scholar] [CrossRef]
Hossain, M.A.; Abbott, E.L.S.; Chua, D.K.H.; Qui, N.T.; Goh, Y.M. Design-for-safety knowledge library for BIM-integrated safety risk reviews. Autom. Constr. 2018, 94, 290–302. [Google Scholar] [CrossRef]
Zhou, W.; Whyte, J.; Sacks, R. Construction safety and digital design: A review. Autom. Constr. 2012, 22, 102–111. [Google Scholar] [CrossRef]
Che Ibrahim, C.K.I.; Belayutham, S.; Mohammad, M.Z. Prevention through design (PtD) education for future civil engineers in Malaysia: Current state, challenges, and way forward. J. Civ. Eng. Educ. 2021, 147, 05020007. [Google Scholar] [CrossRef]
Poghosyan, A.; Manu, P.; Mahdjoubi, L.; Gibb, A.G.F.; Behm, M.; Mahamadu, A.-M. Design for safety implementation factors: A literature review. J. Eng. Des. Technol. 2018, 16, 783–797. [Google Scholar] [CrossRef]
Goh, Y.M.; Chua, S. Knowledge, attitude and practices for design for safety: A study on civil & structural engineers. Accid. Anal. Prev. 2016, 93, 260–266. [Google Scholar]
Ibrahim, C.K.I.C.; Manu, P.; Belayutham, S.; Mahamadu, A.-M.; Antwi-Afari, M.F. Design for safety (DfS) practice in construction engineering and management research: A review of current trends and future directions. J. Build. Eng. 2022, 52, 104352. [Google Scholar] [CrossRef]
Teo, A.L.E.; Ofori, G.; Tjandra, I.K.; Kim, H. Design for safety: Theoretical framework of the safety aspect of BIM system to determine the safety index. Constr. Econ. Build. 2016, 16, 1–18. [Google Scholar] [CrossRef]
Tymvios, N.; Gambatese, J.A. Perceptions about design for construction worker safety: Viewpoints from contractors, designers, and university facility owners. J. Constr. Eng. Manag. 2016, 142, 04015078. [Google Scholar] [CrossRef]

Figure 1. Process for predicting risks in safe production.

Figure 2. Distribution of construction safety accidents.

Figure 3. Heatmap displaying the Spearman correlation coefficients for the attribute features.

Figure 4. Confusion matrix for the risk evaluation models in three experimental contexts.

Figure 5. The ROC curves of the risk evaluation models in three experimental contexts.

Figure 6. Feature importance distribution of the parameters.

Figure 7. Feature Importance Distribution Based on Improved SHAP Value Parameters.

Figure 8. Summary plot of the SHAP features.

Figure 9. SHAP partial dependence plot.

Table 1. Comparison of the text classification results.

Jieba segmentation results	operation, construction, accident, production, cause, site, indirect cause, supervision, management, construction site, installation, training, inspection, dismantling, violation of regulations
Segmentation results after combining statistics optimization	operators, construction site, operating procedures, timely detection, poor safety awareness, work at height, unauthorized operation, construction program, safety management personnel, safety production, chaotic safety management, operation site, safety management system

Table 2. The 68 keywords describing security risk factors.

Serial Number	Keywords	TF-IDF	Serial Number	Keywords	TF-IDF
1	Operator	4.579	35	Construction permit	1.156
2	Design plan	3.581	36	Warning signs	1.147
3	Construction site	2.898	37	Safety management system	1.139
4	Management aspects	2.753	38	Inadequate staffing	1.127
5	Operating procedures	2.591	39	Unauthorized command	1.116
6	Site investigation	2.439	40	Repair and maintenance	0.178
7	Timely discovery	1.965	41	Unauthorized operation	0.169
8	Operating area	1.941	42	Illegal contracting	0.154
9	Limited company	1.911	43	Failure to wear a safety helmet	0.091
10	Unqualified	1.899	44	Light safety	0.090
11	Slow safety awareness	1.852	45	Disorderly safety management	0.089
12	Emergency response plan	1.785	46	Engineering design	0.085
13	Safety education and training	1.763	47	Supervision and administration	0.076
14	Measures taken	1.751	48	Supervision contracts	0.072
15	Site management chaos	1.726	49	Operator’s certificates	0.071
16	Fatal accidents	1.692	50	Violation of labor discipline	0.069
17	Unauthorized operation	1.683	51	Safety supervision	0.068
18	Safety confirmation	1.669	52	Supervisory work for safety	0.065
19	Crane equipment	1.651	53	Unreasonable organization	0.064
20	Lack of supervision	1.648	54	Technical programs	0.062
21	Safety inspections	1.592	55	On-site management personnel	0.059
22	Poor safety awareness	1.572	56	Personal protective equipment	0.058
23	Hidden dangers investigation	1.568	57	Safety briefing	0.044
24	Safety rules and regulations	1.534	58	Illegal subcontracting	0.041
25	Safety protection	1.531	59	Conscientious implementation	0.039
26	Organization and coordination	1.519	60	Operation qualification certificate	0.037
27	Management system	1.486	61	Safety protective equipment	0.036
28	Strict supervision	1.451	62	Training and assessment	0.036
29	Responsible accidents	1.392	63	Operation management	0.034
30	Licensed work	1.363	64	Subcontracting units	0.033
31	Unlicensed work	1.357	65	Emergency disposal	0.033
32	Management procedures	1.296	66	Special equipment safety	0.032
33	Laws and regulations	1.271	67	Blind command	0.032
34	Program design	1.188	68	Wearing helmets	0.030

Table 3. Complete safety risk factors for construction.

Risk Factor Category	Risk Factor
Human Factor	Violation of operating regulations (HF1)
	Contravention of technical safety standards (HF2)
	Formwork erection and dismantling not conforming to specifications (HF3)
	Improper design of the construction process (HF4)
	Failure to strictly comply with safety regulations (HF5)
	Failure to follow the construction design program (HF6)
	Incomplete inspection of potential safety hazards (HF7)
	Failure to deal with hidden safety hazards promptly (HF8)
	Inadequate site investigation and monitoring (HF9)
	Operators on duty without certification (HF10)
	Inadequate safety awareness (HF11)
	Inadequate personal protection for operators (HF12)
	Unauthorized change in the construction organization design (HF13)
Facility Factors	Substandard quality of the construction materials (FF1)
	Inadequate inspection of the equipment (FF2)
	Improper maintenance management of the special equipment (FF3)
	Inadequate implementation of the site safety protection measures (FF4)
	Inadequate provision of safety protection equipment (FF5)
Environmental Factors	Harsh natural climate (EF1)
	Complex geological conditions (EF2)
	Poor geographic location (EF3)
	Poor operating space environment (EF4)
Management Factors	Failure to implement a system of responsibility for work safety (MF1)
	Failure to implement geographic safety supervision responsibilities (MF2)
	Neglect of safety issues by project management (MF3)
	Incomplete safety management regulatory system (MF4)
	Inadequate safety management organizational structure (MF5)
	Inadequate execution of duties by safety management personnel (MF6)
	Noncompliance of safety management personnel (MF7)
	Inadequate on-site safety supervision (MF8)
	Lack of safety training and education (MF9)
	Failure to supervise the rectification and review of hidden accident hazards (MF10)
	Lack of supervisory organization or personnel (MF11)
	Inadequate execution of duties by supervisory personnel (MF12)
	Inadequate safety inspection (MF13)
	Illegal subcontracting and sub-subcontracting behavior (MF14)
	Incomplete emergency response mechanism (MF15)
	Failure to formulate an emergency response plan for work safety (MF16)
	Incomplete safety technical instructions (MF17)
	Failure to formulate a particular safety construction plan (MF18)
	Particular program without expert assessment (MF19)
	Construction organization and management are chaotic and disordered (MF20)
	Inadequate investment of safety funds (MF21)

Table 4. Evaluation results for the three experimental contexts.

Experiments	Accuracy	Precision	Recall	F1-Score	Severity
a	0.81	0.71	0.84	0.77	0
		0.73	0.67	0.71	1
		0.97	0.96	0.96	2
		0.80	0.82	0.81	Mean
b	0.77	0.72	0.71	0.73	0
		0.73	0.69	0.69	1
		0.91	0.89	0.91	2
		0.79	0.76	0.78	Mean
c	0.86	0.81	0.83	0.86	0
		0.79	0.8	0.78	1
		0.98	0.97	0.95	2
		0.86	0.87	0.86	Mean

Table 5. Comparison of the models’ performance.

Test	Test Scenario	Accuracy	Precision	Recall	F1-Score
XGBoost	a	0.81	0.71	0.84	0.77
	b	0.77	0.73	0.67	0.71
	c	0.86	0.97	0.96	0.96
LightGBM	a	0.74	0.72	0.72	0.73
	b	0.77	0.73	0.77	0.71
	c	0.83	0.81	0.83	0.84
GBDT	a	0.69	0.62	0.69	0.67
	b	0.72	0.66	0.71	0.69
	c	0.74	0.67	0.73	0.77
Random forest	a	0.73	0.70	0.76	0.76
	b	0.76	0.73	0.71	0.75
	c	0.82	0.78	0.80	0.80

Table 6. Comparison of the models’ parameters.

Test Scenario(c)	Model Parameters
XGBoost	n_estimator = 100, eta = 0.1, min_child_weight = 1, max_depth = 2, gamma = 0.3, subsample = 0.1, colsample_bytree = 0.2, lambda = 0.3
LightGBM	num_leaves = 18, min_data_in_leaf = 1, max_depth = 2, min_split_gain = 0.3, subsample = 0.1, colsample_bytree = 0.2, reg_lambda = 0.3
GBDT	n_estimators = 100, min_samples_split = 2, max_depth = 2, min_samples_leaf = 1, subsample = 0.1, max_features = 0.2
Random forest	n_estimators = 100, max_depth = 2, min_samples_split = 2, min_samples_leaf = 1, max_features = 0.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fan, J.; Wang, D.; Liu, P.; Xu, J. Research on the Prediction of Sustainable Safety Production in Building Construction Based on Text Data. Sustainability 2024, 16, 5081. https://doi.org/10.3390/su16125081

AMA Style

Fan J, Wang D, Liu P, Xu J. Research on the Prediction of Sustainable Safety Production in Building Construction Based on Text Data. Sustainability. 2024; 16(12):5081. https://doi.org/10.3390/su16125081

Chicago/Turabian Style

Fan, Jifei, Daopeng Wang, Ping Liu, and Jiaming Xu. 2024. "Research on the Prediction of Sustainable Safety Production in Building Construction Based on Text Data" Sustainability 16, no. 12: 5081. https://doi.org/10.3390/su16125081

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on the Prediction of Sustainable Safety Production in Building Construction Based on Text Data

Abstract

1. Introduction

2. Materials and Methods

2.1. Preparation of High-Quality Data

2.1.1. Text Segmentation Process

2.1.2. Keyword Extraction Based on TF-IDF

2.1.3. Multiple Covariance Analysis of Accident Attribute Variables

2.2. Risk Prediction Model Construction Based on the XGBoost Algorithm

2.2.1. XGBoost Algorithm Construction Process

2.2.2. Model Optimization and Performance Evaluation Metrics

2.3. Interpretability Analysis of the Modeling Results

3. Case Study

3.1. Data Collection and Processing

3.1.1. Text Segmentation Process

3.1.2. Keyword Extraction

3.1.3. Multiple Covariance Analysis of Accident Attribute Variables

3.2. Predictive Model Construction Based on XGBoost

3.3. Interpretability Analysis of Modeling Results

4. Discussion

5. Conclusions

5.1. Contributions

5.2. Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI