1. Introduction
Around the world, many urban regions have serious difficulties with traffic accidents and safety issues. One of the most important aspects of transportation for society is safety [
1]. According to the World Health Organization, 1.35 million individuals worldwide pass away every year as a result of traffic accidents. Additionally, 20 to 50 million people experience non-fatal injuries that may lead to short-term or long-term diseases or disabilities. Traffic accidents, on the other hand, result in significant economic losses and negatively affect society [
2]. Up to 3% of the GDP of the majority of nations is predicted to be spent on lost productivity and medical expenses [
3]. The cost of traffic-related incidents in the United States of America, such as accidents and traffic congestion, can reach up to USD 160 billion annually and by the end of 2020, the cost might rise to USD 192 billion [
4]. If the trend of the same condition continues, the road traffic network is recognized as the second and third leading cause of mortality for people between the ages of 5 and 44. Furthermore, traffic accident are predicted to become the third biggest cause of fatalities [
5]. Crashes are accidents that involve the interaction of different components, such as the road, driver, vehicle, and environment [
6]. Five components are recognized as having a direct impact on traffic accidents in the results of numerous studies in the area of traffic accident analysis: human factors, vehicle design, physical condition, traffic conditions, geometric features of the road, and weather conditions [
7]. The vehicle, the driver, and the road environment all have a reasonable relationship with each other in terms of self-driving behavior [
8]. An overwhelming rate of approximately 95% of accidents are caused by human errors [
9]. Changing the attitudes of travelers reduces the negative impact of mobility [
10]. According to data from the World Health Organization (WHO), using cell phones and other electronic devices while driving is the primary source of distraction. It was found that usage of mobile devices has increased by up to 11% over the last 5 to 10 years. This study noted a four-fold increase in the likelihood of a traffic collision when these devices are used. The authors stated that around 71% of road accidents are related to activities that drivers engaged in that were not connected with driving [
7]. The high cost of vehicle accidents has emerged as a major concern for the development of driverless technology. According to estimates, the Google Car can reduce this cost by 90%. Furthermore, the Google Car has the potential to prevent over 2 million injuries and 30,000 fatalities, while saving nearly USD 270 billion in the United States each year [
11]. Additionally, Tesla Motors was founded to create a project to accelerate the global shift to sustainable transportation, which includes self-driving capabilities [
12].
Generally speaking, speed is seen as a primary factor in vehicle accidents. The characteristics of traffic vary according to the change of the speed distribution on a specific roadway in a specified period of time [
13]. Speed-related indicators area consensus for researchers to consider as a needed variable for the severity dimension. However, the severity of modeling crush frequency received less attention [
14]. The benefit of high-speed traffic flow is attributed to reduced travel time. However, this advantage could be associated with a possible surge in the number of accidents and the reality that injuries are likely to be severe if accidents happen at a higher speed [
15]. Public authorities address this issue by reducing speed limits; according to a study conducted in Angers (France), for example, they extended the 30 km/h speed limit throughout the urban area. This study is based on the theory of planned behavior (TPB) and the prediction of young drivers’ intentions to comply with speed limits. By projecting the results on a decision tree method, they were able to identify the most influential variables for predicting intensions. The interest in using a decision tree is that it makes it possible to compare self-reported intentions and expected outcomes. [
16]. The behavior of the frequency of risky and risk-free clusters was evaluated using the Gaussian function, in which the relationship between differences in posted speed limits, the operating speed, and the accident frequency rate was tested. It was found that the accident frequency rate is reduced by an average of 0.99 by increasing km/h in the difference between the posted speed limits and the operating speed, thereby decreasing the number of accidents per length. It was concluded that drivers in safe clusters do not exceed the speed limits, while in risky and unsafe clusters, drivers exceed the speed limits [
17].
Road safety and fatal accidents are significantly influenced by a number of important elements. One of the most important aspects is believed to be the motorization level [
18]. Researchers studied numerous significant aspects that might affect the creation of the high-risk traffic situation. It was found that there is a considerably stronger association between high-risk traffic accidents and road geometry and traffic circumstances. The road traffic accident probability is always higher when there are poor traffic conditions and improper road infrastructure. It was confirmed by statistical data that 20–25% of accidents occur due to the poor condition of the roads. The majority of studies based on road safety are focused on external causes such as the probability of crashes, type of driver, existing conditions of the pavement surface, or searching for patterns that can explain accident causes [
7]. The mobile LiDAR system (MLS), supported by an inductive reasoning process, was taken as a means to assess road safety. The study was performed based on a decision tree, which provides a potential risk assessment based on geometric parameters exclusively. It was highlighted that in future research, an extension of the categorization evaluation of road risk by DT would be desirable [
19]. The performance of LiDAR sensors has been investigated under adverse weather conditions. The results demonstrate that LiDAR must overcome the challenges posed by inclement weather to ensure safety by obtaining information about dynamic objects such as pedestrians, traffic lights, and surrounding vehicles, and improve the driving safety of automated vehicles [
20]. The study by [
21] investigated the rate of injury and deaths of vehicle traffic accidents associated with road types based on their main function. The road type definition included functional roads, administrative roads, urban expressways, and urban general roads. The study shows that the death rate from road traffic accidents on administrative roads is the highest, followed by that on functional roads among all different road types. Moreover, the incidence of traffic accidents is 11.6 times higher on urban general roads than on urban expressways [
21]. A research analysis conducted using the decision tree method shows that the three most important factors in fatal injury are: the driver’s seat belt usage, the light condition of the roadway, and the driver’s alcohol usage [
2]. Another study’s findings, conducted in Belgium, indicate significant differences between countries when it comes to road safety performance. National culture plays a huge role in these discrepancies; it is strongly related to differences in wealth and prosperity in different regions. In Europe, the overall percentage of individuals supporting the measures is over 70%. Thus, the social standard of people’s perspectives regarding road safety is substantially more important. This indicates that there is a general willingness to accept policy measures that help to improve road safety [
22]. To create a composite safety indicator for various countries, the safety target was investigated using principal component analysis and weighting from common factor analysis. When the national safety program is in line with the European policy of a 50% decline in fatalities by 2010, it is revealed that it is marked as having an “ambitious” target. This mark (value “a”), which is the highest value, was given to many countries. However, some nations (such as Italy) claimed to have no national targets, given (value “c”) [
23].
A decision tree (DT) can be used in different subjects to find out which variables affect the resolutions and build a model that adjusts accordingly. Ordinary trees consist of a root, branches, nodes, and leaves [
16]. The first node is the root. Two or several branches may grow to form it. The last node of the chain is a leaf, and no branches grow from it. Each node represents a variable, and branches give a set of values, which can be predicted from observation of individuals, social groups, or specific characteristics. Accidents are relatively unpredictable and infrequent. Behavioral factors in road accidents are difficult to study by traditional research methods for a number of reasons [
24]. In a study in the UK, two hundred police case files on right-turning accidents were randomly selected from the records held at police headquarters for Nottinghamshire, in which 100 right turns were made off the main road and 100 right turns were made onto the main road. The machine-learning method was used to create decision trees distinguishing the characteristics of accidents that resulted in injury or damage only. In the result, it was found that middle-aged drivers are generally safer than either young or old drivers [
24].
In a research project by [
25], for each type of traffic sign, drivers from various socioeconomic backgrounds filled out a paper survey that served as the basis for a decision tree that was used to determine the most important variables affecting drivers’ comprehension. By collecting division data, the algorithm’s ideal goal was to identify homogeneous groups with regard to its dependent variable.
The process of the (DT) algorithm is iterative until the stopping level is attained. In the case of classification, when the tree is used, the criteria are based on entropy and the Gini index [
26]. The entropy is an inhomogeneity measure of input data for the classification. The decision tree construction has three objectives: reduce entropy (randomness of the variable goal), be consistent with the data set, and have the lowest number of nodes. However, the Gini index, which was developed by Conrado Gini in 1912, measures the data heterogeneity degree. Therefore, it can be used for measuring node impurity. It means that when this index is zero, the node is pure. On the other hand, when it approaches value one, the node is impure [
26]. The decision tree method allows classification based on crash severity, and provides an alternative to parametric models in their ability to identify patterns based on data without the need to establish a functional relationship between variables. It does not need to specify a functional form in the way that ordinary statistical modeling techniques, such as regression models, do. One of the most important advantages of the DT is that the outcomes of the analysis are easy to understand and perform due to the graphical nature of its results. It can easily find the important variables of the model [
27]. Network-level optimization was a model aim in a study by [
28]. The traffic demand was managed through binary integer modeling, considering a fully autonomous vehicle transport system. The developed model investigated the transport processes at the vehicle level. The study’s intentions were to ensure traffic safety, as well as capacity management. According to a study by [
29], a precise algorithm for predicting congestion was critical for reducing casualties. In that study, a comparison of decision trees, logistic regression, and neural networks was provided as traffic congestion prediction systems. For data processing, model training, and testing, “Tensor Flow” and “Clementine Machine Learning” were used. The confusion matrix shows that the decision tree has a better prediction performance and leads the other two methods in accuracy (97%) [
29].
To sum up, the literature indicates the negative impact of car accidents on society, as well as the main basic causes of traffic accidents and methods related to DT by incorporating machine-learning approaches that are used in the field of transportation. The driver’s attitude and socioeconomic aspects connected to other elements related to car accidents have not been comprehensively examined in the literature in different areas. The objective of the paper is to address the most probable causes of both severe and non-severe accidents that are related to drivers’ personal attributes and behavioral factors. As humans are the substantial components, their behavioral changes need to be the focus [
30].
2. Materials and Methods
The information was gathered in the city by conducting basic random sample interviews with citizens. The total number of participants from the public was 1172. In all regions of the city, the questionnaire forms were distributed to various groups of individuals while taking into account their age, gender, education level, and other factors. The percentages of participants by age and gender are shown in
Figure 1a,b, below. Statistics from 2018 indicated that there were about 450,000 people living in the city of Duhok. It was chosen as a study area to investigate traffic accidents since it is one of the cities that has seen a significant number of vehicle crashes over the years, including a depressing number of fatalities and injuries.
Table 1 shows the number of accidents, from a prior study, together with the related number of fatalities and injuries that were reported in the city over the course of the ten years [
31].
The software used for data analysis in Python 3.7 manipulates, transforms, and creates charts or graphs that summarize the information collected. Data mining techniques are rarely used in transportation, but their use is increasing by the day [
32]. The decision tree algorithm was used in this research to analyze the dataset. In general, DT is a supervised classification that normally has a procedure, as follows; given a dataset of observations called a training set, different sets of observations are used; this set is called the test set. The variable to be predicted (classified) is called the class variable, and the rest of the variables in the dataset are called predictive attributes or features [
33]. The questions were designed to ask for general information, specific opinions, driving experiences, and driving behaviors. The items of the questionnaire asked for substantial information about traffic behavior generally and traffic accidents specifically all over the city. The questionnaire’s items posed detailed questions regarding general driving behavior and individual traffic accidents that occurred throughout the city. The severity of the crash is a topic covered in the questions. The accident was either non-severe, if it only caused property damage, or severe, if it resulted in human injuries or deaths. All respondents’ answers were converted to a numeric expression and entered into the Excel sheet. Consequently, the Excel sheet was changed and saved as a CSV file. This was necessary in order to run the DT algorithm on Python. The main reason that DT was proposed to be used in this study is that the target variable that requires investigation has a binary property, i.e., the level of severity (“Severe Crashes” and “Non-Severe”), which is seen as appropriate for this method. The accuracy of the analysis obtained by the DT using the Python “sklearn” library’s “score()” method is 79.34%. This value is within the range of values obtained in other studies in which classification methods have similar objectives [
27]. This function of the decision tree is to display the percentage accuracy of the assignments made by the classifier. It takes the input and target variables as arguments. The score value for this study indicates that classifications made by the model should be correct approximately 80% of the time.
4. Discussion
In the current paper, the effect of drivers’ behavior on traffic accident types is studied. The following variables, listed in
Table 2, are part of the dataset collection. A decision tree was built using the data that was converted to digits, to evaluate the variables that influence the severity of traffic accidents.
Figure 6 provides a number of things related to the DT technique analysis. The variable “No_of_Lane”, which is the tree’s root, comes first at the top of the tree, indicating that it is the most important element in classifying objects. The branches to the left are for the accidents on the roads with a lower number of lanes. Each root and intermediate node contains the decision factor, the entropy, and the number of respondents who fit the criterion at that point in the tree. For example, the root node indicates that there are 368 observations that make up the learning data set. Those are “drivers” who have been in “traffic accidents,” of which 272 are non-severe and 96 are severe. At the next level, we can see the “Support_SpeedLimit_Radars”; it indicates that the majority of the 180 people that are the less-supportive ones had been in accidents on roads with more than two lanes, such as highways and major arterials. On the left, it can be observed that 188 drivers were involved in accidents that happened on the two-lane road types. On the third level, at the far right, it can be seen that drivers from the higher age group are in the node that was created from the right arrow (false direction) of the upper node. This obviously means that the older drivers are more supportive of the speed limits. However, of its 156 samples, 68 of them are marked as severe crashes, Furthermore, out of 20 samples of the oldest driver group, 12 exceed the speed limit, and all accidents are classified as severe with 0 entropy. Additionally, it shows that the older groups have the majority of accidents on weekends, while the younger groups have accidents on weekdays. At the end, the leaf nodes for the intermediate nodes indicate that middle-aged drivers have accidents on multiple roadway types in daytime hours, all of which are non-severe accidents. On the other side, to the far left of the tree, it shows that younger drivers have accidents on two-lane roadways in the afternoon and evening and at night, with a majority of non-severe accidents. It is also observed that female drivers are more likely to be involved in car accidents on two-lane roads at night.
Finally, the elements in the value array show the severity level. The first value is the number of non-severe crashes, and the second is the number of severe crashes for each criterion. Out of the gathered data, the root node reveals that 272 people experienced non-severe accidents and 96 had severe accidents.
Entropy is the measure of noise in the decision. Noise can be viewed as uncertainty. For example, in nodes in which the decision results are equal values in the severity value array, the entropy is at its highest value, which is 1.0. This means that the model is unable to definitively mark the classification decision based on the input variables. For values of very low entropy, the decision is much more clear-cut, and the difference in the number of severe and non-severe is much higher.
Similar to many other kinds of research, this one has a number of restrictions. The phase of data collection was the most challenging. The respondents hardly accepted the questionnaire form to answer or offer a truthful response regarding their driving habits and attitudes. As it was previously stated in this paper, a severe accident is one that results in loss of life or injury. This limitation is related to the section on severe accidents. Only those who survived or have been injured respond to the questions. The injured person might be either the driver or passenger, and they would be the only ones to speak about the serious number of fatalities or injuries.