1. Introduction
Traffic safety is one of the major concerns in the world today. There are approximately 1.35 million deaths and over 50 million injuries every year due to traffic accidents worldwide [
1]. In addition, it is estimated that low- and middle-income countries account for approximately 90% of the global total of road traffic fatalities [
2]. As the world’s largest developing country, China has the highest population, car ownership, and total road mileage in the world, yet traffic safety still faces lots of challenges [
3]. It is estimated that about one-fifth of traffic accidents are fatalities in China [
4,
5]. Therefore, the active investigation and management of traffic safety risks have become an urgent problem to be solved in building a green and sustainable highway traffic system.
Established research have shown that the main factors affecting traffic safety include people (drivers), vehicles, road conditions, and the environment, among which drivers are considered to have the greatest impact on traffic safety, while road conditions were often overlooked [
6,
7]. Road conditions affect the performance of moving vehicles, drivers’ psychological activities, and driving performance. Approximately 40% of traffic accidents were caused by direct or indirect influence of road conditions [
8]. Babukov et al. studied the influence of highway horizontal and longitudinal cross-section alignment on traffic safety, and proposed recommendations for ensuring traffic safety in the stages of road design, operation, and maintenance [
9]. Ma et al. analyzed the relationship between the design elements of road alignment (horizontal, longitudinal cross-section, and intersections) and the accident risk [
10]. With the continuous development of the social economy, the current road design may become increasingly incapable of meeting the traffic demand today [
11]. Therefore, it is necessary to analyze the contributory factors for traffic accidents under current road infrastructure conditions [
12,
13].
Many researchers used the traditional linear regression (LR) method to fit the relationship between road alignment, traffic volume and accident frequency, etc. However, some studies also found that the assumption of the linear regression with the Chi-square distribution of the variance is often violated in practical applications, and the unconstrained linear regression is not applicable to non-negative count models [
14]. Subsequently, some scholars used Poisson regression to simulate the frequency of traffic accidents, and the results showed that its fitting effect is better than linear regression, but it was also found that accident data were too discrete and the model could not satisfy the assumption of equal mean and variance [
15,
16]. Miaou investigated the relationship between truck accidents and highway geometric features based on Poisson regression, zero-inflated Poisson regression (ZIP), and negative binomial regression (NB), and showed that ZIP and NB can handle the over-dispersion of accident data well [
17]. Milton et al. used NB to study the relationship between accident frequency, road geometry as well as traffic characteristics and found that NB is a powerful tool for accident analysis [
18]. Subsequently, some scholars successively applied generalized estimating equations, Bayesian, random effects, random parameters, etc., to the analysis of accident frequency [
19,
20].
Most traditional statistical models are parametric and are based on certain distribution assumptions. These models may have relatively robust predictive performance and good interpretability of the internal influence mechanism, but these models may be sensitive to data size, have poor real-time processing capability for big data, and cannot capture the complex nonlinear relationship between features and dependent variables [
21,
22,
23].
Recently, machine learning has been widely applied to the research of traffic safety. Compared with traditional statistical models, machine learning models are able to approximate complex nonlinear relationships among multi-dimensional data variables and have unique advantages in real-time processing of big data and in the analyses of samples with complex structures. Therefore, they can flexibly respond to complex application scenarios and achieve high prediction accuracy [
24]. Thakali et al. developed a parametric negative binomial model and a nonparametric kernel regression model based on a large amount of traffic accidents data and found that kernel regression outperformed the negative binomial model in terms of dataset size sensitivity [
25]. The LSTM–CNN neural network model, was used by Li et al. for the prediction of real-time accident risks on arterial roads, and the results show that its prediction accuracy is better than other machine learning models such as XGBoost, LSTM, and CNN models [
26]. Lee et al. compared and analyzed the difference in accuracy of different machine learning algorithms (random forest, MLP and decision tree) for rainfall traffic accident prediction [
27]. Wang compared the modeling of accident risk in highway work zones using CNN and binary logistic regression models, respectively, and finally found that the CNN model had better accuracy [
28].
At present, most of the road design documents in China are scattered in different design units and government agencies, which makes data acquisition difficult and not conducive to large-scale research. In this paper, we propose a road geometric alignment inverse calculation method by satellite map and clustering algorithm, which provides another solution for researchers. In addition, combining the characteristics of the data set, the author chooses the MLP model, which is more operable, to study the accident risk of highway geometric alignment, compare the results of the negative binomial model, and analyze the advantages and disadvantages of the nonlinear fit of the two in the correlation model. The internal mechanism of the MLP model is also visualized and analyzed with the help of SHAP theory, so as to compensate for the limited interpretability of machine learning algorithms.
In this study, firstly, the geometric alignment data of the target highway was calculated by the road geometric alignment inverse calculation method. Then, based on the traffic accident information of the target road, the MLP and negative binomial regression-based accident risk association models are established, respectively. Finally, the internal mechanism of the MLP model was visualized with the help of SHAP theory.
3. Results
3.1. Traffic Accident Data
In this study, data and information pertaining to 36,439 traffic accidents from 2010 to 2016 were collected from the traffic management department of Chongqing, China, for four highways with a total statistical mileage of up to 1194.8 km. The accident information mainly includes the time of the incident, driver information, location of the incident, casualty information, weather conditions, accident patterns, the topography of the accident, road conditions, traffic control mode, road type, road alignment, lighting conditions, accident category, etc. The statistical information of each highway accident is tabulated in
Table 1.
3.2. Verification of Road Geometric Alignment Backcalculation
The reliability of the back-calculation method was verified by using a part of the Nanfu highway in Chongqing, China, which had a total mileage equal to 55 km (see
Figure 5). The longitude, latitude, and elevation information of each target point along the route were extracted at intervals of 10 m and 40 m, respectively. According to the calculation process (see Equations (4)–(9)), microelement calculation was conducted to obtain the final inverse calculation results of the highway plane and longitudinal section alignment index. The result of the validation is listed in
Table 2.
From
Table 2, the correlation coefficients of the flat curve’s length, flat curve deflection angle, longitudinal slope length, and longitudinal slope gradient are all above 0.85. It shows that the calculated values are consistent with the design values. However, the correlation coefficient between the design and calculated values of the flat curve radius is only 0.3, and the statistical result shows that the difference between the calculated and the maximum value of the design flat curve’s radius is significant (see
Table 3, the error is 64.94%). Then, the K-means clustering algorithm was used to further cluster the flat curve radius indices. There are three outliers in the inverse calculation values of 8.26 km, 8.80 km, and 11.29 km, which are much larger than the design value of this indicator for the corresponding road section. After removing these three resultant points, the correlation index between the design and calculated values of the flat curve radius will rise to 0.72, and the corresponding value will be less than 0.0001, thus indicating that they are significantly correlated. Therefore, the road geometric alignment back calculation algorithm can be used for subsequent traffic safety prediction analysis.
3.3. Model Fitting Results
According to the calculated geometric alignment results, the homogeneous method was then used to divide the road into several section units. A total of 2063 road section units were finally obtained as the basic data for the study. The section unit length, straight section length, flat curve radius, flat curve deflection angle, longitudinal slope gradient, longitudinal slope length, and slope difference were selected as the input variables of the model. The section unit accident frequency was used as the explanatory variable of the model. In order to avoid too large variables that may result in very small intervals, the study readjusted the unit of each input variable (
Table 4). Additionally, it was assumed that the flat curve radius and flat curve deflection angle of the straight section unit were 100 km (infinity) and 0°, respectively. The length of the straight section on the flat curve section was 0 km.
A randomly selected part (80%) of the data constituted the training set for the training and parameter calibration of the MLP model. The remaining 20% of the data was used to validate the generalization performance of the model. Besides, a negative binomial regression model was used as a control method to analyze the differences between the traditional statistical method and the machine learning method. The accuracy of the negative binomial model and MLP model predictions were evaluated based on the ORE, MAD, CSR, ρY,Y′. The implementation process of MLP model hyperparameter optimization is shown as follows:
- (1)
Determining the MLP model input and output dimensions as well as formulating the number of neurons in the input and output layers.
- (2)
To speed up the convergence of the MLP model, the data set needs to be normalized and compressed into a finite value domain space.
- (3)
Importing the normalized accident data into the MLP model, determining the cost function of the neural network model, and calibrating the hyperparameters of the multilayer perceptron model until the model reaches the target accuracy.
- (4)
Importing normalization of the fitted prediction results of the MLP to restore the original magnitude of the data. The optimization of the hyperparameters of the MLP model are shown in
Table 5 and A comparison between negative binomial model and MLP model is outlined in
Table 6.
For the negative binomial model, the magnitude of the absolute value of the regression parameters could indicate indirectly the degree of influence of the explanatory variables on the dependent variable. The RI value in the MLP model results is the relative importance of the feature calculated by SHAP theory, and the larger the value, the higher the contribution of the feature to the explanatory variables. From the results, it can be observed that the negative binomial model is consistent with the determined MLP model for the feature with the greatest degree of influence on the frequency of roadway accidents, that is, the roadway unit length; its corresponding RI value in the MLP is equal to 0.151. However, the importance ranking of subsequent features in the MLP model is slightly different from that of the negative binomial model, where the subsequent ranking in the negative binomial model is . The ranking in the MLP model calculation results is . In addition, the explanatory variables in the negative binomial model that are significantly related to the dependent variable also have the highest RI values in the MLP model.
The overall relative errors of the negative binomial model and the multilayer perceptron model are approximately the same (both are about 3%).
However, the mean absolute deviation, cumulative residuals and correlation coefficients of the multilayer perceptron model are better than those of the negative binomial regression model, thus indicating that the multilayer perceptron model has a better prediction performance for the frequency of accidents in mountainous highway sections than that of the negative binomial model.
3.4. Variable Correlation Analysis
As a type of parametric statistical model, the negative binomial model has good explanatory properties, and the effect of its explanatory variables on the dependent variable can be analyzed according to the nature of its regression coefficients. From
Table 6, the road section unit length, straight section length, and longitudinal slope length are all greater than zero, thus indicating a positive correlation between these three characteristic variables and the accident frequency of the selected road sections. Thus, it can be concluded that the accident frequency gradually increases as these three characteristic variables increase. By contrast, the flat curve radius and flat curve deflection angle are negatively correlated with the frequency of road unit accidents.
This paper further explored the nonlinear relationships between the six independent variables ranked high in relative importance and the dependent variables in the MLP model through SHAP theory. The independent variables are the road section unit length, flat curve radius, flat curve deflection angle, straight-line section length, slope length, and slope degree.
Figure 6a shows that there is a positive correlation between the length of the road section unit and the accident frequency within the road section, and the greater the length of the road section is, the higher the accident risk within the road section will be. In addition, it can also be observed that the accident risk increases as a function of the road section unit length as a Log function, and the growth rate gradually decreases. The SHAP dependence plots of the flat curve radius and accident frequency of roadway units are shown in
Figure 6b,c. The radius of the flat curve of the straight-line section was set to 100 km to ensure the operation of the model; thus, there is a sample clustering point at the location of 100 km in
Figure 6b, where in the sample points are all straight-line section units.
Figure 6c shows that there is not a simple linear relationship between the radius of the road section unit flat curve and the frequency of accidents within the unit. The relationship is such that at the beginning, as the radius of the road section flat curve increases, the accident frequency within the unit gradually decreases, while in instances in which the radius of the road section unit flat curve exceeds about 3.9 km, the accident risk within the unit increases slightly. The conclusion here is not entirely consistent with the established studies in which the radius of a flat curve is negatively correlated with the frequency of unit accidents on flat curve sections [
37]. The reason for this analysis is attributed to the fact that long and straight sections can have an impact on driving safety. When the radius of the flat curve is too large, the curvature of the flat curve is relatively small, and the driving difficulty is low; therefore, the driver is prone to subconsciously increase the speed, thus increasing the risk of accidents.
There are two inflection points in the data in
Figure 6d. The results indicate a higher-order polynomial correlation between the deflection angle of the flat curve and the frequency of roadway unit accidents. When the flat curve deviation angle of the road section is less than 19° or more than 60°, the accident frequency is negatively correlated with the flat curve deviation angle, and the accident risk gradually decreases as a function of the flat curve deviation angle in this interval section. In addition, when the flat curve deflection angle is between the two inflection points, the accident risk of the road section increases in relation to the increase of the flat curve deflection angle. The conclusion is consistent with established studies in terms of the pattern of change, except for the difference in threshold values.
Figure 6e shows that the accident frequency of the road section at the beginning gradually decreases as a function of straight-line length, and there is a quadratic polynomial relationship between the straight-line length and the accident frequency. When the straight-line length exceeds 1.15 km, the length of the straight-line section is positively correlated with the accident risk. That is, when the straight-line length within the section is too short or too long, are not conducive to highway traffic safety.
Figure 6f shows that the frequency of accidents within the road section decreases as a function of the increase in slope length. However, when the slope length exceeds 2.8 km, the accident risk in the road section increases slightly as a function of the slope length. Analyzing the reason is that the car would brake frequently and easily in the long downhill section, thus leading to overheating of the brake system and a decrease in braking performance, which is not conducive to driving safety. For the long uphill section, the travel speed of large vehicles is reduced considerably, thus leading to increases in the speed differences between large vehicles and small vehicles; in turn, this leads to an increase in the frequency of overtaking, thus increasing the risk of traffic accidents.
Figure 6g shows the SHAP dependence of the longitudinal slope gradient and accident frequency, from which it can be observed that the longitudinal slope gradient is correlated with the accident frequency based on a quadratic polynomial function. The accident risk of the road unit gradually increases as a function of the increase of the absolute value of the longitudinal slope, and the rate of increase of accident risk on the uphill slope is slightly larger than that on the downhill slope. This study found that the accident risk of road section units gradually increases with the increase of the absolute value of longitudinal slope, and the rate of accident risk increase on uphill is slightly larger than that on downhill. This conclusion differs from the studies [
37] in which the number of accidents on downhill sections is relatively higher than the number of accidents on uphill sections for the same absolute value of slope.