To attain the research objectives, the research team first acquired reliable national data on occupational incidents involving electrical contractors from OSHA’s online database of catastrophic accidents. The authors then conducted a thorough content analysis to ensure consistency among variables, to reduce any ambiguity in reported values, and to prepare data for statistical analysis. As the most severe accidents end in a fatality, this study uses fatality rates to describe accident severity. Thus, to investigate and explain the relationship between factors contributing to accidents and the degree of accident injuries, this study executed a multivariate logistic regression model that estimates the fatality rates of different accident scenarios occurring among electrical contractors. The ability to consider several factors in one model and interpret the final coefficients in terms of adjusted odds ratios (i.e., controlled for other factors) makes multivariate logistic modeling a suitable approach for severity analysis. The rest of this section is devoted to explaining each of these steps.
3.3. Logistic Regression
While using chi-square tests can show the relationship between a single factor and the degree of injuries, these tests are unable to determine the effects of these variables in the presence of other factors. Logistic regression is a proper method to test the association between potential accident risk factors and a dependent variable [
27]. As mentioned in the background section, many studies have used various types of logistic regression models to investigate potential associations in the field of accident analysis. While more advanced machine learning (ML) methods have emerged and been applied to accident studies in the past decade, logistic regression has some advantages over ML methods to justify its application in this study. First, the results of a regression model are easier to interpret. The model can be represented in one formula using only the independent variables and their coefficients. The coefficients of the model can directly determine important variables, along with the magnitude and direction of association between each independent variable (i.e., risk factor) and the dependent variable. This property is very important in studies where finding the relationship between risk factors and the dependent variable is as significant as the accuracy of the model’s predictions. This could be the main reason for the popularity of regression modeling in traffic accident analysis studies. Therefore, while some machine learning methods such as decision trees—and their variants such as random forests and gradient boosting trees—and support vector machines might provide better prediction accuracy, their lack of interpretability can be a disadvantage. Second, once the risk factors are identified, developing a logistic regression model is straightforward and, unlike machine learning methods, does not require tuning various hyperparameters. This quality makes logistic regression the first choice in many predictive studies and a valid baseline for other more complicated classifiers. Third, while methods such as association rules can be useful to find latent patterns in large data sets, these methods are inherently different from classification methods such as logistic regression modeling. As Freitas [
48] has outlined, classification methods are about using the past data to predict the future; prediction is a non-deterministic task, and that’s why two different classification methods (e.g., logistic regression and decision trees) could generate different predictions on the same set of values. On the other hand, association rules are deterministic: every algorithm would produce the same set of rules, while some might be faster. One of the objectives behind modeling construction accidents in this paper is to use a reliable and well-defined model to predict fatality rates in common accident scenarios; association rules cannot be used for such predictions.
A logistic regression model can isolate effects and indicate which variables can explain the variability among accidents more accurately. Logistic regression models have been adopted widely in areas ranging from medicine [
49,
50] to the social sciences [
51,
52]. This section details several steps in developing and evaluating a multivariate logistic regression model.
Traditionally, building statistical models starts with selecting variables that can result in a parsimonious model (i.e., having as few variables as possible), explain the data, are stable, and can be generalized to unseen situations [
53]. To develop such models, this study adopted the purposeful selection of variables procedure proposed by Hosmer et al. [
54]. One main advantage of this approach is that it considers both the significance and change-in-estimate criteria when selecting final variables [
55]. Each step will be explained here. One should note that to maintain the language of statistical modeling, accident factors are called “covariates” in this section.
Three steps were taken to conduct this regression analysis (
Figure 1):
Step 1: Select variables and calculate odds ratios
Step 2: Develop and adjust model
Step 3: Assess and validate model
A purposeful model-building process starts with building univariate (i.e., containing only one covariate such as project end-use or source of injury) models for each covariate and assessing their performance. The performance of each univariate model is calculated as the difference between its deviance and the deviance of a model with only the constant parameter (i.e., no covariate). The glm function from the stats package in R [
15] was used to determine the deviances of models. The significance of difference (i.e., presented by G) is determined through the p-value of a chi-square test (i.e., the pchisq() function in R) with a degree of freedom equal to (d-1), where ‘d’ is the number of categories of the covariate. Within this framework, a significant result recommends the inclusion of the variable in the final model. As this is the first step, Hosmer et al. [
54] recommended less conservative significance levels (i.e., 0.25 instead of 0.05) to include more variables in the model. In other words, this step allows less significant factors to remain in the model in order to analyze the effects of these apparently less significant factors on more significant ones in future steps. Factors that are neither significant at the 0.05 level nor have an effect on other factors would be excluded from the model eventually.
Table 1 presents the log likelihood ratio statistics (i.e., G) for five univariate models. Low values of G indicate that the difference between fatality rates among different categories of a factor are negligible and therefore the variable would not be very predictive in the model. The significance of G also depends on the degree of freedom (i.e., d.f.) which ultimately determines the
p-values (i.e., the evidence against a null hypothesis: the smaller the p-value, the stronger the evidence to reject a null hypothesis). One can only check
p-values and conclude whether the effect of a variable on fatality rates is significant or not. As the significant level in this step is 0.25, any value less than that is considered as significant at this step. Based on p-values of the chi-square tests, “end use” and “project cost” do not have significant effects on the probability of a fatal accident (even at the significance level of 0.25); in other words, the univariate models with these variables do not differ significantly from a model that has no covariates. The results indicate that the other three covariates are significant and should be considered in the multivariate model.
After finding the more important covariates (i.e., “source of injury,” “cause of injury,” “project type”), one can build an additive (i.e., no interaction) model and test the importance of individual covariates in this multivariate context using traditional significance levels (i.e., 0.05 in this study). The non-significant covariates are temporarily removed from the model—in the case of categorical covariates, all levels should be removed even if only one of the categories is not significant. Next, the deviance of the new/reduced model is compared to the deviance of the original multivariate model (i.e., likelihood ratio test). A large difference means that the removed variables—though independently not significant—have a considerable effect on adjusting the significant variables, and hence should be added back to the model. This process can be repeated several times to make sure that the necessary variables are included in the model. The last step in building a multivariate model includes the interaction effects. As with the single covariates, the interaction terms are added to the model one-by-one, and their effects are measured through the amount of deviance they can reduce. Thus, using this approach, significant interactions also remain in the final model. The modeling step is accomplished using, mainly, the generalized linear modeling [i.e., glm()] function with ‘binomial’ link in R [
15].
As there are only eighteen possible models of interest based on the different combinations of these variables and their interactions, all of them are shown in
Table 2. Three subscripts were used to reflect the structure of the data: let π
ijk be the probability of a fatal accident in the (i, j, k)-th group, where i = 1, 2, 3, 4, 5, 6 indexes “source of injury,” j = 1, 2, 3, 4, 5 presents levels of “project type” and k = 1, 2, 3, 4, 5, 6 indicates the categories under “cause of injuries”. These variables can produce 180 accident patterns (i.e., 6 × 6 × 5). However, 72 of these patterns have zero cases in the data and therefore should be excluded from analysis, leaving 108 covariate patterns for modeling. As all predictors are categorical, the authors decided to focus on fatality rates among these patterns instead of looking at each accident individually.
Table 2 shows models in abbreviated notation, formulas for the linear predictor, the deviance, and the degrees of freedom. A unique ID number has also been assigned to each model for future references.
Using the deviance and degree-of-freedom from
Table 2, one should start with the additive model with three covariates (i.e., model 11). Next, one covariate would be excluded from the model to check its effect on the other two covariates. The results of these tests are presented in
Table 3 (i.e., a1, a2, and a3). The results reveal that the additive model with three variables (i.e., SoI + PT + CoI) represents a significant improvement over all the additive models with two factors (i.e., model 5, 6, and 7). In other words, while some levels of covariates are not significant (results are not shown), they have a significant effect on adjusting the other variables and therefore should remain in the model.
The next step is to investigate the interaction effects, which includes models with one, two, or three interaction terms. Three models (i.e., 12, 13, 14) in
Table 2 have one interaction term. For example, model 12 includes the main effects of source of injury (SoI), project time (PT), and cause of injury (CoI), and the interaction between SoI and PT. The log likelihood tests in
Table 3 (i.e., b1, b2, and b3) show that none of these models with one interaction is better than the three-factor additive model (i.e., model 11). One can also consider models involving two interactions between two factors, of which there are three (i.e., models 15, 16, 17). The results show that only one model (i.e., model 16) is marginally (
p-value: 0.071) better than the additive model. Model 18, which includes all the interactions between each two variables, could not improve the additive model, and hence cannot be selected as a good model. Considering the p-values in
Table 3, one can conclude that model 11 (SoI + PT + CoI) is the best model.
The last task in model development is to consider non-significant variables from step 1 in the context of the new multivariate model and check whether they can improve the performance of this model.
Table 4 shows the results of such comparisons and declares that adding the non-significant variables, one-by-one and together, cannot lead to better results. For instance, adding end use to the model would reduce the model by four degrees of freedom while only reducing the deviance by 0.73, which is not even close to a significant improvement (
p-value: 0.949). Therefore, the research team concludes that the additive model with three variables is the best multivariate model among the possible options.
After building a model by selecting covariates and tuning the model in a purposeful manner, one should investigate the probabilities that are produced by the model against true values in data. To do so, the last step in every statistical modeling process includes model assessment. This section will detail this study’s three main attempts to fulfill the requirements needed to assess our final model.
I. Goodness of fit
After building models and comparing deviances to determine covariates and build a parsimonious model, one needs to examine how well the data fits the final model. Lack of fit means that estimated coefficients are biased, odds ratios could be misleading, and future predictions are not accurate. To check fit, one can compare predicted values derived from the model to observed values to confirm that the fitted model is correct [
56]. The Hosmer-Lemeshow (HL) statistic is a popular test for goodness-of-fit and has been used in several clinical studies [
49,
50,
57]. The idea is to partition observations based on their estimated probabilities into g (usually 10 to represent deciles of risk) groups featuring approximately the same quantity of observations; here, the first group would represent the lowest probabilities of fatality, the next group would have larger estimated probabilities and so on, until the last group which includes the highest probabilities [
58,
59]. The HL statistic is then calculated by comparing the sum of probabilities to the number of observed values in each group. A chi-square test on this value with g-2 degrees of freedom can determine whether there is enough evidence to reject the hypothesis that data fits the model. The ‘hoslem.test()’ function from ‘ResourceSelection’ package in R [
15] provides the test statistic and p-value of the HL test.
Table 5 shows the results of the Hosmer-Lemeshow tests when dividing the accident patterns into 10 groups. The value of Hosmer-Lemeshow’s goodness of fit computed for the frequencies in
Table 5 is 5.85 when granted 8 degrees of freedom, with the corresponding
p-value of 0.664. The large p-value indicates that the null hypothesis that the model fits the data cannot be rejected, demonstrating that the model fits the data quite well. A comparison of observed and expected frequencies in the 20 cells in
Table 5 also shows close agreements in every decile of risk. For instance, within the highest risk decile (i.e., decile 10), the difference between both observed and expected values are within one point.
Figure 2 compares the average of observed and predicted values in each decile, which again shows a close agreement in most deciles.
II. Diagnostics
Even a strong fit can be very sensitive to outlying and extreme-leverage points in the data [
60]. So, while summary statistics such as the HL test can indicate the overall fit of a model to data with a single number, one still needs to check if the model fits over all covariate patterns. Pregibon [
61] introduced a range of diagnostic measures for logistic models with binary outcomes. Two types of measures, based on the leveraged value of covariate patterns, are adopted in this study: (1) those that determine fit in each pattern (i.e., change in value of the Pearson chi-square, Δχ
2, and change in the deviance, ΔD); and (2) those that can determine the amount of influence a pattern can have over other patterns (i.e., change in value of the estimated coefficients, Δβ).
For each covariate pattern, values of Δχ
2 (and ΔD) are calculated as the difference between the Pearson chi-square (and deviance) values of the original model and the model when excluding observations in that pattern. As mentioned by Peng et al. [
62], at the significance level of 0.05, and based on the critical value of the chi-square distribution with one degree of freedom (i.e., 3.84), changes more than four are considered large and demonstrate that the pattern in question contributes significantly to the disagreement between the observed and predicted values. Large values of Δβ also indicate that estimates are not stable. Large values of Δχ
2 or ΔD accompanied by large changes in coefficients can signal that a covariate pattern is an outlier and should be investigated in more detail.
Figure 3 and
Table 6 show diagnostic measures for covariates in the final model. Regarding the poorest fit, covariate pattern 97 (i.e., SoI: “parts and material,” PT: “alteration or rehabilitation,” CoI: “other”) and 55 (i.e., SoI: “machinery,” PT: “other,” CoI: “other”) induced large changes in Pearson chi-square values. Covariate pattern 97 also created significant changes in deviance values. In terms of effect of a covariate pattern on coefficients of other patterns, pattern 97 again has the largest value, followed by pattern 100 (i.e., SoI: “parts and material,” PT: “alteration or rehabilitation,” CoI: “installing plumbing and lighting fixtures”).
Figure 3d combines
Figure 3a with
Figure 3c as the size of circles represent the value of Δβ. Based on these results, three covariate patterns (i.e., 97, 55, and 100) were selected for further investigation (
Table 7).
To further validate this model, one can delete each of the three questionable patterns and look at the model statistics to see if the modified model can still perform well. Columns three, four, and five of
Table 7 demonstrate deviance, sum of Pearson chi-square residuals, and the Hosmer-Lemeshow goodness-of-fit statistic for three models excluding covariates 97, 55, and 100, respectively. These numbers indicate that the selected additive model (i.e., SoI + PT + CoI) can still perform well and fit the data in each case. The results of three other scenarios in
Table 7 (i.e., removing two covariates with the poorest fit, removing two covariates with the largest influence, and removing all three covariates) also show that the additive model with three variables can fit the data very well in each scenario. These findings indicate that the additive model performs well with the remaining 105 covariates patterns (representing 91% of all accidents). Based on these results, the three questionable covariates were removed from the rest of analysis.
III. Validation
As mentioned by Iezzoni [
63], calibration and discrimination are two main methods to assess the performance of a logistic regression model on unseen data. While discrimination measures the ability of a model to distinguish between two classes of the dependent variable, calibration determines the model’s capability in producing estimations that are, on average, close to the observed classes [
64]. This study is focused on fatality rates among common accident scenarios. Due to the categorical nature of predictors, the fatality rate (i.e., dependent variable) of an accident scenario is calculated as the average of fatalities among all accidents in that scenario. For instance, among the eight accidents that have occurred in alteration/rehabilitation projects with tools and instruments representing the source of injury and interior plumbing/ducting/electrical work being the cause of injury, only one resulted in a fatality. Therefore, the observed/actual rate of fatality in this scenario is around 13%. As the objective is to study these rates and compare the performance of the logistic model to them, the authors have concluded that calibration measures would better serve this purpose.
The same measure (i.e., HL statistic) can be used for calibration/validation purposes. The only difference is that the data would be divided into training and testing sets and the model would be developed on the training set and be tested on the unseen testing set. Large p-values indicate there is no evidence to believe the model does not fit test data. For more information on the test statistics for testing dataset look at [
54] (p. 155).
To validate the model on unseen data, a series of training sets were created using 70% of the source data, and a series of testing sets were created using 30% of the source data. To increase the reliability of the results, this study applied a stratified sampling method to generate training/testing data sets. The same measure (i.e., Hosmer-Lemeshow statistic) was calculated to ensure that the model, which was trained only on the training set, can fit testing data well. The chi-square statistic of 6.97 and the p-value of 0.540 showed that the proposed model can fit unseen data as well.