An Interpretable Machine Learning-Based Hurdle Model for Zero-Inflated Road Crash Frequency Data Analysis: Real-World Assessment and Validation
Abstract
:1. Introduction
2. Literature Review
2.1. Traditional Models
2.2. Machine Learning Approaches
2.3. Addressing Limitations in Crash Analysis Models
3. Proposed Methodology
3.1. Generalized Machine Learning Hurdle Model
- Data Cleaning: Removing redundant, erroneous, or inconsistent entries that could bias the analysis. For example, duplicate records or missing values are identified and handled to ensure data quality.
- Feature Engineering: Creating or transforming variables to better capture underlying patterns in the data. This may involve modifying road segment attributes or combining features to enhance the dataset’s predictive power.
- Variable Transformation: Converting categorical variables into numerical representations if needed, and applying transformations to ensure compatibility with machine learning algorithms.
- Dataset Splitting: Dividing the dataset into training and testing sets, typically with an 80:20 split, to enable unbiased model evaluation on unseen data.
- Binary Data Creation: For the classification stage, a binary dataset is generated, where the target variable indicates the presence or absence of crashes. This binary outcome facilitates training the classifier to distinguish between crash-prone and non-crash segments.
- Non-Zero Data Creation: For the regression stage, a subset of the data containing only non-zero crash counts is created. This dataset allows the regression model to focus solely on segments where crashes are predicted, enabling accurate frequency prediction.
- Dataset Splitting: Dividing the data into training and testing sets to evaluate the model’s performance on unseen data.
- Model Selection: Choosing a classification algorithm suited for high-dimensional and categorical data.
- Handling Class Imbalance: Applying methods to balance the class distribution, increasing sensitivity to the minority class (crash cases).
- Classification Model Evaluation: Assessing the classification model’s effectiveness in identifying crash-prone segments using metrics like accuracy, precision, recall, and F1 score.
- Dataset Splitting: Similar to the classification stage, dividing data into training and testing sets for unbiased evaluation.
- Regression Model Application: Selecting a suitable regression model with a loss function aligned with count-based data.
- Feature Selection: Emphasizing key features influencing crash frequency, such as road geometry and traffic volume, to improve accuracy.
- Regression Model Evaluation: Using metrics like Root-Mean-Square Error (RMSE) and Mean Absolute Error (MAE) to assess prediction accuracy.
- Final Output Calculation: The expected crash frequency is computed, integrating the predictions from the classifier and the regressor.
- Full Model Evaluation: The combined model’s performance is evaluated to ensure that it provides a reliable analysis of crash likelihood and frequency. This overall evaluation includes both the classification and regression results to confirm the model’s robustness.
- Interpretation for Classification: Interpretability techniques for the classification model indicate the impact of each feature on the likelihood of a crash occurring, helping to clarify which factors most strongly influence the outcome.
- Interpretation for Regression: In the regression model, interpretability methods show how specific features affect the crash frequency predictions. For instance, traffic volume might be associated with increased crash frequency, while other factors could reduce it.
- Insights for Road Safety Planning: By making the model predictions interpretable, stakeholders gain actionable insights into the factors driving crash risks. This information supports targeted interventions, such as adjustments to road design or safety measures, to enhance traffic safety.
3.2. CatBoost Hurdle Model
3.3. Customization of Loss Functions
3.3.1. Classification Stage
3.3.2. Regression Stage
3.4. Model Interpretability
4. Empirical Assessment Using Real-World Data
4.1. Data
4.2. Experimental Design
4.2.1. Classification Stage
- Bootstrap Sampling: A bootstrap sample from the overall data is used to grow each tree, promoting variability amongst the trees.
- Random Predictor Selection: At each split within a tree, a random subset of predictors is considered, which ensures that the trees in the forest are uncorrelated and strengthens the ensemble’s predictive power.
- Out-Of-Bag Error Estimation: For each tree, the out-of-bag (OOB) data not included in the bootstrap sample serve as a validation set to estimate prediction error, offering an unbiased evaluation of model performance without the need for a separate test set.
- True positives (TP): Correctly predicted positive observations.
- False positives (FP): Incorrectly predicted positive observations (type I error).
- True negatives (TN): Correctly predicted negative observations.
- False negatives (FN): Incorrectly predicted negative observations (type II error).
4.2.2. Regression Stage
4.2.3. Full Model Assessment and Interpretability
4.3. Results
4.3.1. Classification Stage
4.3.2. Regression Stage
4.3.3. Full Model Assessment
4.3.4. Model Interpretability
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- World Health Organization. Road Traffic Injuries. Available online: https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries (accessed on 3 August 2024).
- Centers for Disease Control and Prevention. Road Traffic Accidents. Available online: https://www.cdc.gov/transportation-safety/global/index.html (accessed on 15 November 2024).
- Centers for Disease Control and Prevention. Financial Impact of Road Traffic Crashes. Available online: https://www.cdc.gov/transportation-safety/global/publications.html (accessed on 15 November 2024).
- Lord, D.; Mannering, F. The Statistical Analysis of Crash-Frequency Data: A Review and Assessment of Methodological Alternatives. Transp. Res. Part A Policy Pract. 2010, 44, 291–305. [Google Scholar] [CrossRef]
- Mannering, F.; Bhat, C. Analytic Methods in Accident Research: Methodological Frontier and Future Directions. Anal. Methods Accid. Res. 2014, 1, 1–22. [Google Scholar]
- American Association of State Highway and Transportation Officials (AASHTO). Highway Safety Manual, 1st ed.; AASHTO: Washington, DC, USA, 2010; ISBN 978-1-56051-477-0. [Google Scholar]
- Lambert, D. Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing. Technometrics 1992, 34, 1–14. [Google Scholar] [CrossRef]
- Geedipally, S.R.; Lord, D.; Dhavala, S.S. The Negative Binomial-Lindley Generalized Linear Model: Characteristics and Application using Crash Data. Accid. Anal. Prev. 2012, 45, 258–265. [Google Scholar] [CrossRef]
- Shankar, V.; Mannering, F.; Barfield, W. Effect of roadway geometrics and environmental factors on rural freeway accident frequencies. Accid. Anal. Prev. 1995, 27, 371–389. [Google Scholar] [CrossRef] [PubMed]
- Lord, D.; Washington, S.P.; Ivan, J.N. Poisson, Poisson-gamma and zero inflated regression models of motor vehicle crashes: Balancing statistical fit and theory. Accid. Anal. Prev. 2005, 37, 35–46. [Google Scholar] [CrossRef]
- Son, J.; Sayed, T.; Chung, Y. Modeling the relationship between truck accidents and geometric design of road sections: Poisson versus negative binomial regressions. Accid. Anal. Prev. 2011, 43, 673–682. [Google Scholar]
- Hosseinpour, M.; Yahaya, A.S.; Sadullah, A.F.; Ghadiri, S.M.R. A comparative study of count models: Application to pedestrian-vehicle crashes along Malaysia federal roads. Traffic Inj. Prev. 2013, 14, 630–638. [Google Scholar] [CrossRef]
- Cai, Q.; Lee, J.; Eluru, N.; Abdel-Aty, M. Macro-level pedestrian and bicycle crash analysis: Incorporating spatial spillover effects in dual state count models. Accid. Anal. Prev. 2016, 93, 14–22. [Google Scholar] [CrossRef]
- Khedher, M.B.B.; Yun, D. Generalized linear models to identify the impact of road geometric design features on crash frequency in rural roads. KSCE J. Civ. Eng. 2022, 26, 1388–1395. [Google Scholar] [CrossRef]
- Lord, D.; Persaud, B. Accident prediction models with and without trend: Application of the generalized estimating equations procedure. Transp. Res. Rec. J. Transp. Res. Board 2000, 1717, 102–108. [Google Scholar] [CrossRef]
- Miaou, S.-P.; Song, J.J. Bayesian ranking of sites for engineering safety improvements: Decision parameter, treatability concept, statistical criterion, and spatial dependence. Accid. Anal. Prev. 2005, 37, 699–720. [Google Scholar] [CrossRef] [PubMed]
- Geedipally, S.R.; Lord, D. Examination of crash variances estimated by Poisson-gamma and Conway-Maxwell-Poisson models. Transp. Res. Rec. 2011, 2241, 59–67. [Google Scholar] [CrossRef]
- Park, E.S.; Lord, D. Application of finite mixture models for vehicle crash data analysis. Accid. Anal. Prev. 2009, 41, 683–691. [Google Scholar] [CrossRef] [PubMed]
- Wen, X.; Xie, Y.; Jiang, L.; Pu, Z.; Ge, T. Applications of machine learning methods in traffic crash severity modelling: Current status and future directions. Transp. Rev. 2021, 41, 855–879. [Google Scholar] [CrossRef]
- Tang, J.; Liang, J.; Han, C.; Li, Z.; Huang, H. Crash injury severity analysis using a two-layer stacking framework. Accid. Anal. Prev. 2019, 122, 226–238. [Google Scholar] [CrossRef]
- Xie, Y.; Lord, D.; Zhang, Y. Predicting motor vehicle collisions using Bayesian neural network models: An empirical analysis. Accid. Anal. Prev. 2007, 39, 922–933. [Google Scholar] [CrossRef]
- Li, X.; Lord, D.; Zhang, Y.; Xie, Y. Predicting motor vehicle crashes using support vector machine models. Accid. Anal. Prev. 2008, 40, 1611–1618. [Google Scholar] [CrossRef]
- Abdel-Aty, M.; Haleem, K. Analyzing angle crashes at unsignalized intersections using machine learning techniques. Accid. Anal. Prev. 2011, 43, 461–470. [Google Scholar] [CrossRef]
- Haleem, K.; Gan, A.; Lu, J. Using multivariate adaptive regression splines (MARS) to develop crash modification factors for urban freeway interchange influence areas. Accid. Anal. Prev. 2013, 55, 12–21. [Google Scholar] [CrossRef]
- Zeng, Q.; Huang, H.; Pei, X.; Wong, S.C.; Gao, M. Rule extraction from an optimized neural network for traffic crash frequency modeling. Accid. Anal. Prev. 2016, 97, 87–95. [Google Scholar] [CrossRef]
- Zhang, X.; Waller, S.T.; Jiang, P. An ensemble machine learning-based modeling framework for analysis of traffic crash frequency. Comput.-Aided Civ. Infrastruct. Eng. 2020, 35, 258–276. [Google Scholar] [CrossRef]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3146–3154. [Google Scholar]
- Al Mamlook, R.E.; Abdulhameed, T.Z.; Hasan, R.; Al-Shaikhli, H.I.; Mohammed, I.; Tabatabai, S. Utilizing machine learning models to predict the car crash injury severity among elderly drivers. In Proceedings of the 2020 IEEE international conference on electro information technology (EIT), Chicago, IL, USA, 31 July–1 August 2020; pp. 105–111. [Google Scholar]
- Mullahy, J. Specification and testing of some modified count data models. J. Econom. 1986, 33, 341–365. [Google Scholar] [CrossRef]
- Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased Boosting with Categorical Features. Adv. Neural Inf. Process. Syst. 2018, 31, 6639–6649. [Google Scholar]
- Hancock, J.T.; Khoshgoftaar, T.M. CatBoost for Big Data: An Interdisciplinary Review. J. Big Data 2020, 7, 94. [Google Scholar] [CrossRef]
- Park, J.H.; Yun, D.G.; Seong, J.G.; Lee, J.S. Introduce Advanced Road Research Vehicle-‘ARASEO’. Transp. Technol. Policy 2012, 9, 47–52. [Google Scholar]
Variable | Definition (Unit) | Mean | Standard Deviation | Minimum | Maximum |
---|---|---|---|---|---|
Crash Count | Crash frequency count | 0.37 | 1.21 | 0 | 24 |
Section Length | Homogeneous section length (meter) | 129.90 | 131.44 | 50 | 3100 |
SLOPE | Vertical slope absolute value (%) | 1.94 | 1.76 | 0 | 12.05 |
Curve Radius | Horizontal curve radius (meter) | 272.21 | 437.62 | 0 | 2000 |
Road Width | Lane width (meter) | 3.79 | 0.68 | 2.5 | 5 |
Median Width | Median width (meter) | 0.91 | 0.79 | 0 | 4 |
Shoulder Width | Shoulder width (meter) | 0.59 | 0.69 | 0 | 11.3 |
AADT | AADT | 9306.97 | 9253.74 | 321 | 54,307 |
Variable | Definition | Category | Frequency | Proportions |
---|---|---|---|---|
Lane Count | Indicator variable for number of lanes | 1 | 3393 | 39.29% |
2 | 5147 | 59.60% | ||
3 | 80 | 0.93% | ||
4 | 16 | 0.19% | ||
Climb | Indicator variable for climbing lanes | 1 | 180 | 2.08% |
0 | 8456 | 97.92% | ||
Median | Indicator variable for median barriers | 1 | 4711 | 54.55% |
0 | 3925 | 45.45% | ||
Guardrail | Indicator variable for guardrails | 1 | 6510 | 75.38% |
0 | 2126 | 24.62% | ||
Street Light | Indicator variable for street lights | 1 | 535 | 6.19% |
0 | 8101 | 93.81% | ||
City Division | Indicator variable for road category | 1 (Rural) | 7214 | 83.53% |
0 (Suburban) | 1422 | 16.47% |
Actual Values | ||
---|---|---|
Predicted Value | Positive | Negative |
Positive | True positives (TP) | False positives (FP) |
Negative | False negatives (FN) | True negatives (TN) |
Classifier | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|
Logistic Regression | 0.82 | 0.56 | 0.15 | 0.23 |
Decision Tree | 0.74 | 0.31 | 0.32 | 0.31 |
Random Forest | 0.81 | 0.5 | 0.25 | 0.33 |
XGBoost | 0.82 | 0.51 | 0.28 | 0.36 |
CatBoost (Default Loss Function) | 0.83 | 0.58 | 0.26 | 0.35 |
CatBoost (Custom Loss Function) | 0.81 | 0.81 | 0.77 | 0.78 |
Model | RMSE | MAE |
---|---|---|
Zero-Truncated Poisson | 4.873 | 1.92 |
Zero-Truncated Negative Binomial | 2.321 | 1.282 |
CatBoost (Default Loss Function) | 1.997 | 1.101 |
CatBoost (Poisson Loss Function) | 1.855 | 0.916 |
Model | RMSE | MAE |
---|---|---|
Poisson Hurdle Model | 2.097 | 0.527 |
Negative Binomial Hurdle Model | 1.336 | 0.511 |
CatBoost (Single Stage) | 1.197 | 0.421 |
Proposed Model | 0.978 | 0.346 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ben Khedher, M.B.; Yun, D. An Interpretable Machine Learning-Based Hurdle Model for Zero-Inflated Road Crash Frequency Data Analysis: Real-World Assessment and Validation. Appl. Sci. 2024, 14, 10790. https://doi.org/10.3390/app142310790
Ben Khedher MB, Yun D. An Interpretable Machine Learning-Based Hurdle Model for Zero-Inflated Road Crash Frequency Data Analysis: Real-World Assessment and Validation. Applied Sciences. 2024; 14(23):10790. https://doi.org/10.3390/app142310790
Chicago/Turabian StyleBen Khedher, Moataz Bellah, and Dukgeun Yun. 2024. "An Interpretable Machine Learning-Based Hurdle Model for Zero-Inflated Road Crash Frequency Data Analysis: Real-World Assessment and Validation" Applied Sciences 14, no. 23: 10790. https://doi.org/10.3390/app142310790
APA StyleBen Khedher, M. B., & Yun, D. (2024). An Interpretable Machine Learning-Based Hurdle Model for Zero-Inflated Road Crash Frequency Data Analysis: Real-World Assessment and Validation. Applied Sciences, 14(23), 10790. https://doi.org/10.3390/app142310790