1. Introduction
According to “The global status report on road safety 2023” by the WHO, the number of annual global road traffic deaths has reached 1.19 million.
The number of injured people, around 50 million, is even more alarming [
1]. The faults committed by drivers in traffic accidents are reported to account for 90% of the causes of the road accidents [
2]. In this study, the aim is to analyse traffic accidents recorded in Turkey and to develop a model to predict the classified casualties of the drivers based on the available data with regard to the drivers’ faults. As a result of this classification, we aim to gather analytical information about driver attributes that lead to deaths or injuries, helping to take measures to prevent these incidents. This research analyses accident severity based only on the drivers’ conditions, excluding passengers and pedestrians. The main reason for this is that only 24 percent of the dataset includes injured or dead passengers’ records. There is no information on whether the remaining 76 percent represents uninjured or uninvolved passengers. This uncertainty with regard to this remarkable proportion leads the drivers’ conditions to be taken into consideration only for the analysis pertaining to the data obtained from police accident records. Another important point to be mentioned regarding why only the drivers were paid attention in this study is that accident reports in Turkey reveal the fact that drivers’ faults represent 90% of the causes of the accidents. This clearly highlights that drivers’ characteristics and faults in the accidents have to be examined to properly determine the impact of the related drivers’ attributes.
In this context, single, two-vehicle, and multi-vehicle car accidents are examined to identify key factors that influence accident severity for drivers by mainly focusing on the automobile-related accidents since they prevail in the general dataset contents. Furthermore, heavy vehicle-involved accidents dominantly characterise the severity of the accidents concealing the leading effects of primary parameters. In addition, as the police in Turkey record only accidents involving fatalities or injuries, not those involving property damage alone, the dataset in this research includes “fatal” and “injury” cases. On the other hand, as the dataset has an overwhelmingly overbalanced distribution of fatal and injury accident rates of 0.7% and 99.3%, respectively, it was challenging to predict the accident outcome by classifying it as fatal or injury. Therefore, the evaluation of the accident severity was based on the conditions of the casualties as the outcome of the accident. Since police reports classify conditions of the drivers, passengers, and pedestrians into three categories, there are three designated classes to describe the status of individuals involved in the accidents: “dead”, “injured”, and “uninjured”.
Many studies have been conducted to develop a prediction model to reveal the significant factors affecting traffic accident severity [
3,
4,
5,
6,
7]. Among them, some of the statistical models, such as Logistic and Poisson regression models, require certain assumptions reducing the accuracy level of the results expected. Therefore, non-parametric data mining classification methods including Logistic Regression, CART (Classification and Regression Tree), Bayes Networks, Random Forest, Decision Jungle, Support Vector Machine, and CHAID (Chi-squared Automatic Interaction Detection) Structural Equation Models and Decision Trees were preferred in some studies to model the severity of the traffic accidents. Decision trees were chosen as the method of analysis because they are accepted in the literature for being successful and productive in making classification predictions in cases where categorical data are included [
4].
2. Literature Review
In Taiwan, Taipei, the CART model was employed to determine the relationships between injury severity and road–environment/driver–vehicle characteristics along with accident attributes. The outcomes of the study revealed the fact that the strongest correlation was between vehicle type and injury severity. Another finding of the research was that pedestrians, motorcyclists, and cyclists had a higher risk of injury than other vehicle drivers [
3].
The CART method was used in other research to determine the factors affecting the accident severity of back-seat motorcycle passengers in Iran. It has been concluded that the residential or rural nature of the accident location, land use, and injured part of the body (head, neck, etc.) are the most influential factors affecting the fatality of motorcycle passengers [
4].
Classification Trees (CT) analysis and Association Rules Mining (ARM) were employed in a study analysing powered two-wheeler (PTW) accidents in Italy. These accidents were found to be strongly sensitive to the various combinations of road, environment, and drivers’ attributes. While one of the outcomes of the research was that accidents in rural areas exhibited a higher accident severity, the other important conclusion stated the critical importance of the consistency of the grip of the tyres on the road surface for the stability of the PTWs. This is particularly important due to the fact that these types of vehicle have only two points of road surface contact compared to four-wheel vehicles’ four contact points [
5,
8].
In a study conducted in Maryland, USA, multinomial logit (MNL) and random forest models were established to identify factors affecting the level of severity of the accidents. The major contributing factors obtained were collision type, age of the occupants, and the speed limits [
6].
The machine learning methods of Decision Jungle (DJ), Random Forest (RF), and Decision Trees (DT), which were used to predict the injury severity in three-wheel rickshaw accidents in Rawalpindi, Pakistan, identified accident type, weather condition, accident cause, posted speed, lighting conditions, and age of the drivers as the leading attributes in predicting accident severity. Among them, the Decision Jungle method performed as the best prediction method when overall accuracy values were compared [
7].
In a study carried out in Alabama, USA, the aim was to estimate the severity of accidents involving young and old drivers in lane-changing-related accidents. According to the results of the Random Logit (RL) model conducted in the study, it was found that young male drivers were involved in more severe injuries than older male drivers. It has been observed that while young male drivers were inclined to get involved in major injury-level accidents rather than minor or no injury ones in daylight conditions, older male drivers tended to have the same injuries in dark/unlit driving and road conditions. It has also been found that driver errors, such as not wearing seat belts, impaired driving (alcohol or drug use), and driving without a valid license are linked to severe injuries [
9].
Estimation models in a study in Cyprus were established using machine learning methods such as Naïve Bayes (NB), Random Forest (RF), Logistic Regression (LR), and Artificial Neural Networks (ANN) by classifying the accident severity in two classes as serious and non-serious. The effects of attributes on accident severity were listed with Random Forest feature importance. The outcomes of this study state that the vehicle engine power, vehicle age, driver age, the road being a first-degree major road, the presence of an intersection, and the available speed limit have an effect on accident severity, respectively [
10].
In a study conducted in the United Arab Emirates, machine learning methods such as Gradient Boosting (GB), Support Vector Machines (SVM), and Random Forest (RF) were used to determine accident severity as mild, moderate, severe, or fatal. The attributes of the age of the injured, age of the driver, vehicle model, vehicle class (truck, SUV, car, etc.), cause of the accident, road type, speed limit on the road, type of accident, lighting, intersection, using a belt, car brand, gender of the injured, and gender of the driver were all found to have an effect on accident severity [
11].
In a separate study conducted in Iran, fatal accidents were investigated through a Logic Regression (LR) model with an applied cross validation method to avoid overfitting. The study identified road lighting, speed limit, and compliance with traffic rules as the important factors in accident severity [
12].
In a study carried out in Seoul, XGBoost (XGB), Logistic Regression (LR), and DBSCAN, machine learning methods, were used to estimate variables affecting accident severity. The study points out that weather, season, and day of the week do not significantly affect accident severity. Since pedestrian-related features come to the forefront the most, it is emphasized that measures to be taken for pedestrians should be increased [
13].
The accident severity level was modelled in another study conducted in China through Random Forest by using a dataset created from video recordings. Key features identified as having a significant effect on accident severity included accident type (such as pedestrian–vehicle or motorcycle–truck collision, etc.), engine capacity, collision speed, speed limit, road information, accident location, vehicle manoeuvre, and vehicle age [
14].
CHAID decision tree and Bayesian Network Analysis (BN) data mining techniques were used to reveal the attributes that are effective at predicting the severity of bicycle accidents in Italy. The best predictive factors revealed by CHAID analysis are, in order of their proportional effects, road type, accident type, age of cyclist, road signage, gender of cyclist, type of opposing vehicle, month of the year, and type of road segment. In the Bayesian network analysis, the factors that best predicted the severity of the accident in bicycle accidents were accident, road, and collision vehicle type [
15].
CHAID and Bayesian Network techniques, together with path analysis, were applied to determine different combinations of accident factors in roadside accidents. Vehicle speed, curve radius, vehicle type, adhesion coefficient, hard shoulder width, and longitudinal slope were found to be the most important factors, respectively [
16].
A Multinomial Logistic Regression (MLR) model, Artificial Neural Network Multilayer Perceptron (ANN-MLP), Chi-square Automatic Interaction Detector (CHAID), and C5.0 were used to analyse crash severity in a study conducted with data from the California HSIS database for all state highways comprising crash information for the years 2012–2014. The study revealed that the cause of the crash, the number of vehicles involved, the weather, and the driver’s age factors were important [
17].
Aiash and Robusté used Binary Logit (BL) and CHAID methods to reveal the relationships between injury severity and various factors in crashes. Their study found that drivers and pedestrians were more likely to be seriously injured or killed than passengers. Additionally, crashes occurring on weekends, at noon, and at night were found to increase the probability of the crash being fatal or severe [
18].
Yuan et al. employed C5.0, CHAID, and CART algorithms to determine the factors affecting the severity of the side right-angle collision type of accident. Drunk driving, weather conditions, excessive speed, the effective speed of head-on vehicles, and traffic light equipment were obtained as significant factors. The study also found that the occurrence of serious accidents increased with the rising effective speed of head-on vehicles and decreased when road isolation facilities were installed. In addition, it was revealed that there was a high probability of serious accidents in rainy/snowy conditions. Turning was also obtained as a factor increasing the possibility of serious accidents in windy/foggy conditions [
19].
In a study conducted in Palestine, CHAID and CART decision trees were used to determine the severity of pedestrian accidents. The study found that pavement type (paved, unpaved), land use (residential, commercial, or educational), light conditions (daylight or nighttime), and driver gender did not have much effect on pedestrian accident severity. On the other hand, pedestrian gender, area classification (urban, rural), and pedestrian age category (>65, <15) were obtained as the most important factors [
20].
Tinella et al. conducted a study by analysing screening data for 345 drivers in terms of their cognitive and personality measures, driving behaviours, and attitudes along with fitness levels to drive. The study put forward the related sociodemographic and psychological profiles of the drivers. The researchers established models to predict the probability of being included in motor vehicle crashes (MVC) with regard to important factors such as the personality trait of disconstraint and motor skills by using the Classification and Regression Tree method (CART) [
21].
A study carried out in Norway through principal component analysis with iteration and varimax rotation investigated the effects of risk behaviours and the influence of attitude on accidents. The study stated that engaging in risky behaviours and attitudes in traffic were related to age and gender, violating rules, and excessive speed, which had an effect on traffic accidents and near-accidents [
22].
Table 1 summarizes the factors affecting accident severity and the methods used in the literature.
As explained, there are several studies in the literature regarding the wide range of factors affecting the severity of the accidents. This study, on the other hand, made a noteworthy contribution to the literature by developing prediction models for automobile-related accidents only through analyzing the characteristics of the drivers, as given in
Table 2 and
Table 3, and their effects on the severity level of the accidents.
3. Material and Method
Turkey, with a population of 86,907,000.00, has 259,072 km-long highways. The number of vehicles registered in traffic is 28,183,745.00, of which 14,967,044.00 are automobiles. While the total distance travelled annually is 329 × 109 vehicles/km, 179 × 109 km of it is travelled by automobiles [
23]. A total of 92.7% and 89.4% of all passenger and freight transportation in the country is carried out on highways, respectively [
24]. The numbers of automobiles owned per 1000 people in Turkey and EU countries are 167 and 560, respectively. While the number of deaths per 1 million automobiles in Turkey is 366 people, the related figure for EU countries is 76, reflecting the need to make significant progress in road safety engineering and applications. Approximately 45% of those killed in accidents are drivers, 32% are passengers, and 23% are pedestrians [
2].
3.1. Data Description
The dataset used in this study consists of data recorded in accident reports by traffic police for all the roads under their jurisdiction in Turkey. An important point to be mentioned is that, since some roads in Turkey are under the Gendarmerie’s responsibility as far as road safety is concerned, around 30,000 accidents each year were not included in the accident reports as the accidents on these roads are not recorded by the police. Furthermore, since property-damage-only accidents are not reported by traffic police in Turkey, the dataset used in this research merely covers the fatal and injury-related accidents that occurred during the period of 2015–2021. The number of traffic accidents and the automobile ownership rate per 1000 people in Turkey for the year 2021 are given in
Figure 1 and
Figure 2, respectively.
While the first of the three accident data files obtained from the Turkish General Directorate of Security contains information about accidents, the second provides information related to the drivers, and the third file covers information regarding passengers and pedestrians. All these files were combined for data analysis using vehicles’ and related accidents’ explanatory information through identification numbers. In the combined data, each line contains the characteristics of the related accident, the characteristics of the drivers, and the passengers and/or pedestrians, if any. Therefore, the dependent attribute, the severity level of the driver as a result of the accident, is an unbalanced dataset consisting of three classes: “fatal”, “injured”, and “non-injured”. While the categories “injured” and “non-injured” overwhelmingly reflect 46.5% and 53.1% of the total number of the accidents, respectively, “fatality” only accounts for 0.4%. Although the fatal accident rate is very low and reduces model performance, it was considered a separate class and was included separately in the analysis process. This is mainly due to the fact that fatal accidents represent the most severe adverse result of the accidents and the classification of the casualty levels of the fatal accidents based on the causes has significant importance. Since the severity levels of injury were not stated in the accident records, the analysis requiring these data could not be carried out.
An average of one hundred and fifty-one thousand (151,000) accidents involving death or injury between 2015 and 2021, except for the COVID-19 period (the last months of 2019 and the whole year of 2020), occurred annually, about 1641 of which resulted in death and 149,784 in injuries [
2].
In this study, two-stage data processing was performed to examine the casualty of drivers through the accidents they were involved in, based primarily on their personal characteristics/attributes. While in the first stage, data pre-processing was applied, in the second stage the aim was to remove dominant attributes from the dataset to obtain more accurate information concerning the driver attributes. The raw dataset contains a total of 1,059,981 accidents with 1,757,341 information contents related to the drivers involved in those accidents. In this study, it was possible to see the effects of the driver’s faults and their characteristics/attributes on the casualties, along with other attributes related to the accidents by excluding the major factors causing the driver’s impact on the accident to be hidden. For example, the decisive factor in the severity of the accident is the size of the vehicle when a car and a truck collide, regardless of the driver’s characteristics. Therefore, examining accidents involving more or less the same size of the vehicles (car-to-car collisions) will eliminate the effect of the size of the vehicles on the severity of the accidents and will help to reveal the mere driver characteristics affecting the severity.
The high rate and dominant characteristics of the automobiles in traffic composition and the recorded accident dataset are believed to generate considerable benefits at the end of the analysis of the only automobile-involved accidents. Therefore, only automobile-involved accidents, in which at least one of the drivers was at fault, were analysed, and the remaining other types of accident were eliminated from the dataset. For the purpose of this analysis, only the dataset of automobile-involved accidents in the fatal or injured type that occurred in Turkey between 2015 and 2021 was used. As a result of all eliminations required as explained above, the records of 106,794 accidents remained to carry out the analysis from the total dataset of 1,059,981 accidents.
Figure 3 illustrates the general steps and stages of the structure followed for the analysis of the data used in this research.
3.2. Pre-Processing-1
The attributes that need to be cleaned, transformed, and combined are determined at this stage.
The reason for applying this stage is to organize the dataset without overlooking any information in the accident records. Thus, where accident severity was predicted, the input attributes were arranged in the best possible way to be included in the analysis.
The research carried out states that the type of colliding vehicle is one of the most important factors that determine the severity of the accidents involving two or more vehicles [
26,
27,
28,
29]. On the other hand, to determine only the effects of the driver’s contribution to the casualties, accidents involving only automobiles were investigated by eliminating both the effect of the vehicle size, which has been shown to have a profound impact on the possible outcome of the severity level of the accidents, and road users’ sensitivity factors, from the analysis. As a natural result of this approach, the single- or multi-vehicle accidents in which one of the involved vehicles was not an automobile were removed from the dataset. Moreover, since it is well known that most of the drivers do not display any sort of injury in 94% of the accidents with vulnerable road users such as pedestrians, bicycles, and motorcycle riders, the data of this type of accident were also kept away from the dataset [
30,
31].
These stated concerns led to an approach that accidents involving only automobiles were left in the dataset before starting the pre-processing of the raw data, as it is known that factors such as vehicle size and road user vulnerability affect the accident severity significantly. Accordingly, accidents involving passengers–pedestrians and vehicles other than automobiles have been removed from the dataset. All the analysis, in this sense, is exclusively based on the determination and prediction of the casualties of automobile drivers.
3.2.1. Cleaning-1
To find effective and meaningful classification rules, attributes such as accident and vehicle ID, hours and coordinates of the accidents, and numbers of the road sections were all excluded since these attributes have individual values resulting in extremely small percentages in the general dataset to create a group of attributes. If there was not any clear statement regarding the fault or faults of each or both drivers, the data related to those accident records were removed from the dataset.
Accidents involving at least one of the passengers injured or killed, but the drivers were reported as being unharmed, meaning that the severity of the accident was high. On the other hand, as this study focuses on estimating the severity of the accidents by only looking at the conditions of the drivers with regard to their attributes, accidents involving passengers having any level of severity were excluded from the data.
Accidents involving pedestrians were also removed from the dataset. The fact that automobile drivers are not generally affected as a result of pedestrian-involved accidents leads the casualty analysis of the drivers to be excluded for these types of accident.
As the proportion of driver fault attribute classes, such as not obeying the red-light rule (3.5%), hitting a legally parked vehicle (2%), and entering a signposted no-vehicle area (1.6%) were at a negligible level in terms of the data mining research approach, these were all removed from the dataset.
The data associated with “driving without a license” (0.6%) were verified with the data in the driver’s license attribute (“holding an official driving licence or not”) and, hence, excluded from the dataset.
There is a class in accident records called “drunk driving”, which is the driver’s fault, and also an attribute defined as “alcohol usage”, with a record of the amount of the alcohol. However, by considering the fact that only 11% of the accidents involved the driver being drunk and the alcohol level was given in seven classes, each of them with very few data, it was not possible to analyse the data in this way during the process of setting up decision trees. Furthermore, the drunk “driving class” in the driver faults was filtered and taken as “yes” or “no” because the alcohol level of 15% of those who were recorded as drunk was not known, and they were excluded from this class because their alcohol level was not known.
3.2.2. Transforming
After information regarding the accident’s day, month, and season was obtained from the accident date data, a new attribute was created by dividing the time data into time intervals.
The pavement surface data categorised as “icy”, “snowy”, “puddly”, “wet”, and “slippery” were all combined under the “wet/slippery” class to create two classes: “dry” and “wet/slippery”.
3.2.3. Combining
Since there is very little information regarding drunk drivers in the dataset, it is not possible to study different levels of alcohol usage separately. As a result of this, alcohol-related profile information at different levels has been combined under the “alcoholic” class to illustrate only the effect of alcohol presence or absence on the severity of the accident. Adanu also included alcohol or drug usage as “yes” or “no” in his study [
9].
An attribute generated as “road section” was obtained by using the “intersection” and “road type” attributes.
The attribute named as “accident light condition” was determined to describe the brightness of the environment at the time of the accident through “daylight” and “illumination” attributes.
Driver ages are grouped as young under 25 years old, middle-aged1 between 26 and 40, middle-aged2 between 41 and 64, and elderly 65 and over.
As far as the vehicle usage-related attribute is concerned, the official, military, public, and agricultural usage with a small amount of data in total were all combined under the “other class” to define three classes in this attribute as “private”, “commercial”, and “other”.
The accident type categories of a falling human/animal from the vehicle, chain collision, multiple collision, and animal collision, represented by very scarce data in the whole dataset, were similarly combined as the “other” class.
Table 2 represents the descriptive statistics of the dataset after data pre-processing. Then, relevant analyses were carried out on the remaining 106,794 driver = related accident data when only automobile accidents were taken into consideration.
Table 2.
Descriptive statistics of analysed dataset.
Table 2.
Descriptive statistics of analysed dataset.
Attributes | F | % | Attributes | F | | Attributes | F | % |
---|
Dependent attribute | | Independent attributes | | Independent attributes |
---|
Driver severity | | Road Section | | | Vertical road geometry |
Injured | 67,478 | 0.63 | Divided | 42,344 | 0.40 | No-slope | 82,166 | 0.77 |
No injured | 38,762 | 0.36 | Two-way | 14,693 | 0.14 | Sloping | 24,628 | 0.23 |
Fatal | 554 | 0.01 | One-way | 3354 | 0.03 | Time-period | |
Independent attributes | | Four-legs/roundabout | 29,087 | 0.27 | 00:00–06:00 | 15,198 | 0.14 |
Manner of accident | | Intersection (Y) | 2205 | 0.02 | 06:00–12:00 | 24,048 | 0.22 |
Rear-end | 15,226 | 0.14 | Intersection (T) | 9860 | 0.09 | 12:00–18:00 | 36,277 | 0.34 |
Sc | 36,807 | 0.34 | Interchange | 652 | 0.01 | 18:00–24:00 | 31,271 | 0.29 |
Sideswipe | 1265 | 0.01 | Other int. | 4599 | 0.04 | Light cond. | |
Rsr | 30,905 | 0.29 | Location type | | | Daylight | 65,674 | 0.62 |
Hc | 4863 | 0.05 | Urban | 81,709 | 0.77 | Gsl | 28,927 | 0.27 |
Bo | 15,606 | 0.15 | Rural | 25,085 | 0.24 | Bsl | 12,193 | 0.11 |
Bv | 2122 | 0.02 | Road alignment | | Weather | |
Driver fault | | Straight | 88,945 | 0.83 | Clear | 90,924 | 0.85 |
Speeding | 54,948 | 0.52 | Curve | 15,359 | 0.14 | Rainy | 11,416 | 0.11 |
Ftc | 12,050 | 0.11 | Sharp Curve | 2490 | 0.02 | Snowy | 1294 | 0.01 |
Fty | 18,366 | 0.17 | Road Type | | Other | 3160 | 0.03 |
It | 6072 | 0.06 | Streets | 69,177 | 0.65 | Driver license | |
Ilc | 15,358 | 0.14 | State highways | 28,750 | 0.27 | Yes | 102,035 | 0.95 |
Weekend | | | Provincial roads | 1979 | 0.02 | No | 4759 | 0.05 |
Weekdays | 73,621 | 0.69 | Motorways | 3449 | 0.03 | Unknown | | |
Weekend | 33,173 | 0.31 | Other roads | 3439 | 0.03 | Season | | |
Number of vehicles | | Surface condition | | Spring | 24,752 | 0.23 |
One-vehicle | 44,995 | 0.42 | Dry | 87,190 | 0.82 | Summer | 30,787 | 0.29 |
Two-vehicles | 53,774 | 0.50 | Wet | 19,604 | 0.18 | Autumn | 27,822 | 0.26 |
Multi-vehicle | 8025 | 0.08 | Driver education | | Winter | 23,433 | 0.22 |
Driver age | | | Primary School | 18,633 | 0.17 | Year | | |
Young (18–25) | 25,828 | 0.24 | Middle School | 5591 | 0.05 | 2015 | 20,634 | 0.19 |
Ma 1 (26–40) | 46,253 | 0.43 | High School | 53,761 | 0.50 | 2016 | 16,117 | 0.15 |
Ma 2 (41–64) | 29,834 | 0.28 | Higher Edu. | 28,809 | 0.27 | 2017 | 15,779 | 0.15 |
Elderly (65+) | 4878 | 0.05 | Alcohol usage | | | 2018 | 15,769 | 0.15 |
Driver gender | | Normal | 95,077 | 0.89 | 2019 | 14,420 | 0.14 |
Male | 92,271 | 0.86 | Drunk driving | 11,717 | 0.11 | 2020 | 11,688 | 0.11 |
Female | 14,523 | 0.14 | Traffic signal | | | 2021 | 12,387 | 0.12 |
Vehicle usage | | Yes | 12,539 | 0.12 | | | |
Private | 102,859 | 0.96 | No | 94,255 | 0.88 | | | |
Commercial | 2855 | 0.03 | | | | | | |
Others | 1080 | 0.01 | | | | | | |
3.3. Methodology
Decision Trees are powerful and widely used methods for classification- and prediction-related studies. The appeal of tree-based methods is largely because, unlike neural networks, Decision Trees produce rules. A Decision Tree represents a series of questions and answers. They are obtained with the root at the top and the leaves at the bottom. The attribute at the root is the attribute that has the strongest relationship with the output (target) variable. In Decision Trees, a record in the dataset enters the tree from the root node, and a test is applied to determine which sub-node it will encounter next. There are different algorithms for this test. When choosing suitable algorithms, the main point is to determine the best classification algorithm. This test and the process of moving to the sub-node are repeated until the leaf reaches the terminal node [
32].
Decision trees are known to be well designed for making classification predictions in categorical data. Since they do not require assumptions, such as normality, the classification methods are easy to apply for multidimensional data processing to analyse the categorical data for the severity estimation of traffic accidents [
7]. As stated earlier, all dependent and independent attributes in this research are of the categorical data type (non-categorical ones were also converted to categorical type). Chi-Squared Automatic Interaction Detection (CHAID), one of the decision trees, was chosen as the analysis algorithm. This method was proposed and improved by Kass [
33] to detect statistical relationships through the chi-squared test based on the inspiration from the Automatic Interaction Detection method (AID) which is a popular technique suggested by Sonquist and Morgan [
34,
35] and used in many fields [
36]. It is used as a classification tool because it creates a tree while detecting statistical relationships.
The reason that CHAID is preferred in this study over other decision trees is that it stops the growth of the tree without overfitting. In this sense, the CHAID tree differs from other trees by producing more splits than two-fold divisions. In this way, some classes are prevented from being aggregated, resulting in obtaining fewer clear trees.
The relatively high credibility and accuracy of the CHAID algorithm, along with having a good mathematical theoretical basis in branch calculation, stem from the fact that it employs the chi-square detection method in statistics. The chi-square test is used to select the independent attributes affecting the dependent attribute mostly in terms of the principle of local optimization. Then, the CHAID algorithm generates equal amounts of leaf nodes according to the number of categories of the independent attribute simply due to the fact that the independent attributes may have many different categories. This makes the CHAID algorithm produce a multi-fork tree.
It should also be mentioned that the CHAID algorithm employs the “pre-pruning method”. In this method, “pruning” is applied earlier than dividing and generating the decision tree. In this way, it would be possible not only to reduce the training and testing time overhead of the CHAID decision tree but also to decrease the risk of overfitting. The risk of underfitting the CHAID algorithm, on the other hand, may even be further reduced. Additionally, the accuracy might be improved as long as the number of prunings is preserved in a good interval on the condition that a sufficient amount of data with mostly categorical attributes are available.
All these explained strong sides, along with the fact that each class appears in a different node, in addition to the possibility of obtaining clearer rules with CHAID, justify why this tree algorithm was used in this study.
3.4. Classification Model
The criterion for dividing the tree into branches in the CHAID method is based on the chi-square statistic value. As a test of statistical significance, the chi-square test was invented by Karl Pearson in 1900. In this test, frequencies are regarded as more important than variances, means, or standard deviations [
32]. In this study, a chi-square independence test was carried out to examine whether there is a relationship between the dependent attribute and the selected independent attributes. In this independence test, the null hypothesis claims that there is no correlation between the related attributes, while the alternative hypothesis states that there is a connection. The level of significance, denoted by “α”, is the probability of rejecting the null hypothesis. In this study, it was chosen as 5% to decide whether the null hypothesis can be rejected or not by comparing the chi-square statistics and chi-square significance level. The chi-square statistic value is evaluated to rank the degree of the relationship between the dependent attribute and the independent attributes. The degree of freedom, the maximum number of logically independent values, and the chi-square values are given in
Table 3. These values express the fact that there is a statistically significant relationship between all designated independent attributes and the dependent attribute.
The chi-square statistic value is evaluated to rank the degree of the relationship between the dependent attribute and the independent attributes. The highest chi-square value reveals the highest connection between the two attributes in question [
37]. The attribute that is most related to the dependent attribute, i.e., with the largest chi square value, is located at the root node of the tree to be established [
32]. The chi-square equation used is given as [
37]:
The chi-square test of independence was applied to see if there was a statistically significant relationship between the dependent attribute, the accident casualty of the drivers, and the other attributes. The distribution of chi-square values is given in
Table 3. “The manner of accident” and the “casualty level of the driver” attributes represent the uppermost chi-square values illustrating the higher and stronger statistical relationship of these two attributes with the dependent one compared to the relationship of other independent attributes. Another conclusion to be stated here from the statistically significant relationship demonstrated by
Table 3 is that there is a strong correlation between the number of vehicles involved in the accident and the dependent attribute. These findings coincide with numerous studies in the literature carried out in different parts of the world [
3,
5,
6,
7,
15].
The chi-square statistics and the critical values of these attributes shown above have been compared. It was determined that there was a statistically significant relationship between the dependent and independent attributes. The other independent attributes were not illustrated in
Table 3 as there was not a strong relationship between them and the dependent attribute.
In
Table 3, the magnitude of the relationship between the independent attributes and the dependent attribute is inferred from the magnitude of the chi-square statistic values. However, in order to determine whether these attributes have a statistically significant relationship, the chi-square critical values were compared to decide whether the null hypothesis was to be accepted or not. Chi-square critical values were obtained from the chi-square distribution table and degree of freedom values. All values obtained are given in
Table 3 [
38].
Table 3.
The results of Chi-Square test analysis.
Table 3.
The results of Chi-Square test analysis.
Attribute (Output) | ao−1 | Attributes (Input) | ai−1 | df | χ2 Sta | χ2 Cri |
---|
Casualty Level of the Driver | 2 | Manner of accident | 6 | 12 | 35,306.91 | 21.03 |
Number of vehicles | 2 | 4 | 33,961.62 | 9.49 |
Driver fault | 4 | 8 | 16,683.24 | 15.51 |
Road section | 7 | 14 | 8536.08 | 23.68 |
Road type | 4 | 8 | 3820.14 | 15.51 |
Road alignment | 2 | 4 | 3780.20 | 9.49 |
Location type | 1 | 2 | 3174.94 | 5.99 |
Time period | 3 | 6 | 1921.44 | 12.59 |
Light cond. | 2 | 4 | 1161.80 | 9.49 |
Traffic signal | 1 | 2 | 792.08 | 5.99 |
Driver gender | 1 | 2 | 495.01 | 5.99 |
Alcohol usage | 1 | 2 | 453.98 | 5.99 |
Driver age | 3 | 6 | 423.17 | 12.59 |
Vertical road geo. | 1 | 2 | 402.82 | 5.99 |
Driver license | 1 | 2 | 398.39 | 5.99 |
Surface cond. | 1 | 2 | 233.76 | 5.99 |
Weather | 3 | 6 | 139.33 | 12.59 |
Season | 3 | 6 | 78.38 | 12.59 |
Weekend | 1 | 2 | 61.14 | 5.99 |
Driver education | 3 | 6 | 54.57 | 12.59 |
After all these independent attributes with a statistically significant association with the dependent attribute were determined, the CHAID decision tree model was created to estimate and evaluate the casualty of the drivers involved in the accidents.
It was seen in the decision tree model established with all variables that dominant variables such as manner of accident and number of vehicles were included in the tree. On the other hand, variables such as “driver age” and “driver education”, which are dependent of the driver, and the variables with relatively less effect on accident severity but whose effect on the accident result is statistically significant such as “road surface” directly affecting driving, were not included in the tree. Hence, no comments could be made regarding their effects on the accident result. This leads to the fact that the results of these factors remained hidden. A second data pre-processing stage was applied to the dataset to remove the dominant variables from the dataset to produce results regarding these hidden situations.
3.5. Preprocessing-2: Selecting Attributes
As stated above, the removal of the dominant attributes from the dataset causing driver characteristics to be hidden is required. Hence, a second data pre-processing is carried out to develop a model of prediction of casualty of the drivers based on drivers’ characteristics.
Cleaning-2
Although this study reveals that the types/manners of the accidents, each being distinctive from the other, and the number of vehicles involved, were the two main independent attributes affecting the casualty level of the drivers, they were excluded from the analysis process. The reason behind this is related to the primary aim of this study, setting up and estimating the relationships between the attributes of the drivers and the casualty of the accidents. Thus, those dominant attributes concealing the drivers’ related attributes and faults were all eliminated to prevent obtaining misleading results, even if they have great importance on the casualty level of the drivers in general.
In the same way, although they have effects on the severity level of the accident, the location (residential or rural), type of road alignment, vertical road geometry, road type, time period, light conditions, and traffic signal attributes were all extracted from the dataset for the analysis purposes.
As weather- and road surface condition-related attributes are highly correlated, it would be pointless to include both in the model simultaneously. Taking the higher value of chi square of the surface condition attribute on the dependent attribute into consideration led to the removal of the weather attribute from the dataset and analysis of the model.
Although seasonal and weekend-related attributes have more meaningful chi-square values than the level of driver education attributes, they were also removed from the model because they conceal the effect of the driver’s education status attribute on the interested outcomes of this research, which is an investigation into the effect of driver characteristics on driver accident severity.
3.6. Comparison and Validation of the Classification Model with Selected Attributes
The developed model which estimates the accident casualties of the drivers produces similar proportional results with regard to three categorical outcomes for each year not fluctuating much from year to year, as illustrated by
Figure 4. It should be mentioned that, while decision trees were obtained through training datasets, model performances were evaluated by using test datasets.
A 10-fold cross-validation was applied to ensure the reliability of the model. The re-sults of Random Forest and Naïve Bayes were illustrated by
Figure 5 to illustrate the success of the model compared to other available ones. By selecting 70% of the data as training and 30% as test data, the dataset was divided into two parts to create the decision trees. This distinction was made linearly and, as the accidents in the dataset were recorded and presented according to the date/year on/in which they occurred, the first five years of the available data were taken as education/training, and the last two years were taken as testing data. As the accidents and their results occur at similar rates each year, the predic-tion model is expected to produce accurate results for the future periods to make reliable and reasonable evaluations.
At this stage, the validity of the CHAID model was tested by establishing models with Random Forest and Naïve Bayes algorithms.
Random Forest, a combined classifier algorithm proposed by Breiman, can perform classification, association, clustering, prediction, and sequential pattern mining operations, and create multiple decision trees [
14,
39]. In this sense, a Random Forest algorithm is employed to predict the severity of traffic accidents by producing multiple decision trees.
Bayesian Classification, on the other hand, is a process that estimates the probability that a new observation belongs to a predefined category by using a probability model defined according to Bayesian theory [
40]. This technique evaluates the prior probability of each category based on a large training dataset defined by a set of variables. It also assumes that the classification can be estimated by calculating the conditional probability density function and the posterior probability [
41,
42].
Given an observation with k attributes, x
i conditioning factor, and y
j the output class, i =1, 2, …, k, the Naïve Bayes classifier calculates the probability P(y
j/x
i) for all possible output classes. The prediction is based on selecting the class with the highest posterior probability as:
The prior probability P(y
j) can be estimated by determining the fraction of observations in the training dataset that belong to the output class y
j. The conditional probability is computed by using:
where μ is the mean and δ is the standard deviation of x
i [
43].
Once 70% and 30% of the dataset were separated as training and test data, respectively, the number of data predicted as correct and incorrect were determined for CHAID, Random Forest, and Naïve Bayes methods through the comparison of the expected and observed results of all accident scenarios in the test data. Following this, the classification accuracies were calculated by employing Equation (5). In the cross-validation process, on the other hand, a total of 10 subgroups were randomly created by taking 10% of the data into consideration at each step. By employing each of these subgroups and the remaining 90% of the data as test and training data, respectively, the accuracy values were calculated for the methods stated.
Figure 5 illustrates these values. As the accuracy values of the dataset obtained for each method were close to or higher than the accuracy values given in the literature, this clearly ensured the validity of the model [
17,
44,
45].
Figure 5 illustrates that the best accuracy value is produced by the CHAID algorithm among the stated three models through the training test. Therefore, the decision trees are established by employing the CHAID algorithm, and related findings are presented below.
4. Results
After the decision tree was created with all attributes and the performance values of the model were obtained, the model with selected attributes was set up following the second data pre-processing phase.
Figure 6 below illustrates the established decision tree model by producing thirty leaf (terminal) nodes with selected attributes. As mentioned in the previous section, the dataset in the decision tree is divided into sub-groups by looking at the relationship between input attributes and the output attribute based on chi-square values. Since the relationship between the “driver fault” attribute and the casualties of the drivers is the highest, this attribute is located as a root-knot in the tree. Eight dividing attributes are involved in the tree structure: driver fault, road section, driver age, education level of driver, gender of driver, holding driving license, alcohol usage, and surface condition.
4.1. Driver Fault: Speeding
The data regarding the faults of not being able to adjust the vehicle speed according to the present road and weather conditions constitute 52% of the training dataset. While 61% of the drivers who do not adopt the vehicle speed considering the road and weather conditions are killed on divided roads, 17% are killed on single-carriageway roads. A total of 41% of all drivers who died in the accidents of the dataset examined are speedy drivers on divided roads. This clearly states the fact that if the related proper measures can be taken on these roads, fatal accidents might be significantly reduced. The tree obtained also illustrates that when the drivers fail to drive at a suitable speed according to the road and weather conditions at the junctions with four approaching traffic flows and roundabouts, the gender of the driver, alcohol usage, driver age, surface conditions, and education level of the driver are all effective attributes on the casualty level of the drivers involved in the accidents.
4.2. Driver Fault: Failure to Give Way
The set of drivers who do not comply with the right-of-way priority at intersections constitutes 17% of the training dataset. A total of 62% of these drivers remain remarkably uninjured, revealing the fact that they cause the death or injury of the other person or people in the accident because it is known that each accident in the dataset is related to at least one injured or dead person. In other words, those drivers who cause the accidents by not obeying the right of way rules protect themselves somehow but cause third-party innocent people to become injured or die. If the driver is young and without a driving license, s/he injures herself/himself at a probability rate of 52%, and injures or kills a third person/s at the rate of 48%. It was found that 64% of the middle-aged1 and middle-aged2 groups were intact.
4.3. Driver Fault: Improper Left or Right Turning
Drivers who do not obey the rules of changing the vehicle direction constitute 6% of the dataset. If these drivers are women or elderly men, they are more prone to be injured as a result of this kind of accident.
4.4. Driver Fault: Following Too Close
The accidents due to close follow-up represent 11% of the dataset. In addition, 65% of the drivers involved in this type of accident caused the death or injury of someone other than themselves.
4.5. Driver Fault: Improper Lane Changing
Accidents resulting from improper lane changing have a share of 14% in the dataset. The fact that 27% of the drivers who were reported dead, as in the training dataset, died as a result of this fault, expresses the reality that preventing improper lane changing behaviour will reduce the number of deceased drivers to a remarkable level.
4.6. Performance Measures
The “Accuracy” value of performance measure in the model is the ratio obtained by dividing the number of correctly predicted data in all classes by the total amount of data. The “Precision”, on the other hand, is the ratio obtained by dividing the number of correctly estimated data in a class by the whole amount of data estimated for that specific class. Another performance criterion, “Recall”, also known as “Sensitivity”, is the ratio obtained by dividing the number of data estimated correctly in a class by the total observed amount of data in that class [
46]. The following equations separately express the mathematical structure of these performance measures.
DTP: Desired case, correctly predicted
UTP: Undesired case, correctly predicted
DFP: Desired case, incorrectly predicted-FT1: False Type-1
UFP: Undesired case, incorrectly predicted-FT2: False Type-2
UUP: Undesired case, predicted as another undesired case-FT3: False Type-3
The confusion matrix plays an important role in depicting the classifier’s effectiveness thoroughly. This matrix provides a profound understanding of the internal workings of the classifier rather than the unique computation of Precision or Recall metrics. Hence, worthy insights that can guide further research and enhancements in model performance can be obtained through this process. Furthermore, the inherent structure of the data themselves can be released through the relationships unveiled by the confusion matrix between various data features and objects [
47].
A confusion matrix is created by comparing the predicted class with the real case one in classification problems with more than two classes. While each matrix column represents the actual observed situation, each row represents the predicted one. In this confusion matrix, the values in the cells on the diagonal line characterise the number of correctly classified data. On the other hand, the rest signify misclassifications.
Table 4 below illustrates all the obtained values as the performance measures of the model.
Table 4 provides the “confusion matrix” of the models set up through all and the selected attributes of the drivers, along with the “prediction accuracy”, “class precision”, and “class recall” values reflecting the performance criteria of the model.
The weaknesses of the classifier can be uncovered by delving into the confusion matrix providing valuable insights to enhance the model performance. Additionally, the confusion matrix may shed light on the relationships between various data features and objects, unveiling the inherent structure of the data themselves [
47]. In other words, valuable indications can be obtained about the connections between classes and the labels signifying semantic meanings and concepts assigned to data instances.
The success of the classification of a confusion matrix of size n×n can be evaluated by the recall and precision values for each class individually [
48].
As shown in
Table 4, the predicted recall values for injured ones are given as 69.74 and 78.27 percent for all and selected attributes, respectively, indicating the success of the model developed. Since the number of data related to uninjured cases is limited compared to the injured cases, the same high-rate success could not be obtained although a reasonable level is achieved.
In the study conducted by Wang and Kim (2019), property damage alone, injury, and fatality were selected as the casualty classes to predict accident severity by determining the related factors. Wang and Kim used lighting, intersection, collision, and road division type, fixed objects, speed limits, gender and type of occupant, alcohol usage, movement, and type and age of the vehicle as attributes to characterise the accidents. Following the prediction models established through Multinomial Logit (MNL) and Random Forest algorithms, performances of the models were obtained by calculating precision and recall, along with F1 score values. In the MNL model, precision values for property damage only, injury, and fatal classes were 64%, 57%, and 75%, respectively. Furthermore, while recall values were found to be 89%, 24%, and 2%, F1 score values were calculated as 75%, 34%, and 3%, respectively. On the other hand, the precision values of the injured class obtained in this study as 94.66% and 80.17% for the models, where all and selected attributes were analysed, respectively, are significantly higher than the value of 57% given by Wang and Kim [
6]. These higher values clarify the fact that the accuracy of class prediction of injury-related accidents in this study is remarkably improved.
It can also be seen from
Table 4 that there is an approximately 5% difference between the “accuracy values”. This quite small difference clearly indicates that there is no drawback of establishing the proposed model with the attributes selected within the scope of this study. The remaining attributes, after all the elimination and clearance processes, produced the aimed-for results of this research.
Another important issue to be stated is that as the characteristics of the accidents in which the driver is killed are quite similar to the accident characteristics in which drivers are injured the model tends to predict the dead drivers as injured in accidents.
5. Discussion
In this study, as a non-parametric tree-based model, CHAID was employed to model the casualty levels of the drivers in traffic accidents involving only automobiles.
The outcomes of this research obtained from the CHAID-tree algorithm identify the fact that alcohol usage, driver’s age, education level of the drivers, and wet/slippery road surface profoundly affect the resulted accident severity. The results of the study conducted by Kadilar to obtain the casualty level of the accidents by logistics regression state that snowy road surfaces increase the accident severity of the drivers compared to other types of road surface. In other studies, it has been found that drivers’ casualty levels get worse with a rise in alcohol usage [
49,
50]. The outcomes of this study in this sense comply with the findings from the literature.
When the entire tree is thoroughly examined, male drivers were found to become involved in more accidents than women drivers, as it is probably a natural result of the fact that they undertake more travel/km than women. On the other hand, 71% of women drivers injure themselves rather than the other people in the accidents they cause. It can also be seen from the tree developed that 65% of the female drivers who do not adapt the vehicle speed to the road and weather conditions injure themselves at the intersections with four arms and roundabouts. It should be kept in mind that the four-arm intersections and roundabouts were combined in the same class and were evaluated together in the study.
When it is observed through every leaf-terminal node with gender, it can be seen that the mortality rates of men in their own groups are more than the mortality rates of women in theirs. The same issue of male drivers’ likelihood of having a tendency to become involved in fatal or injury-involving accidents was also stated in Das’ research [
50].
Drivers’ ages have been observed to have an effect on the casualty level of the accidents occurring at four-arm intersections, including roundabouts, as a result of excessive speed and illegal turns. In this regard, the finding of this research does not comply with the outcome obtained in Zhang’s study, in which it was found that the driver’s age had no effect on traffic violation and accident severity [
51]. Similarly, Batouli et al. stated in their study that the driver’s age did not affect the injury severity level with the explanation that older and more experienced drivers’ relatively careful and slow driving characteristics might tolerate and compensate the younger drivers’ mistakes [
52]. The findings undoubtedly exhibit the fact that the young drivers, especially those in the early years of having a driving licence, must be monitored and controlled to overcome the violations of the rules causing severe accidents in Turkey.
The findings of this research revealed the fact that a slippery or wet road surface is an important factor affecting the severity of the accidents, even in the case of four-wheeled vehicles. A similar outcome was also obtained by Montella for two-wheeled vehicles [
5].
Accident reports kept by the police in Turkey do not include information on years of driving experience. In order to deepen the analysis with regard to the effect of the level of experience of the drivers on the severity level, information regarding the years that drivers have held a driving licence is important and, hence, must be recorded in the police reports. Even this information may not be sufficient because the duration and frequency of driving after obtaining a driving license are the factors that will affect the experience. As it may not be possible to obtain this information accurately from the drivers, it is a challenging issue to develop an accurate conclusion to evaluate the effect of driver experience on casualty levels.
Similarly, a more comprehensive analysis would have been possible if, for example, the classification of injury level had been provided. Information, such as the model of the vehicles involved in the accidents, is also important and should be included in the dataset. This is simply because studies in the literature identified that driver behaviour and accident severity vary depending on the vehicle model [
6,
9,
14]. Bédard et al. found that late-model vehicles were associated with a 5% increased risk of death for every five years [
26]. Contrary to Bedard, many studies have found a different correlation between vehicle age and accident severity. Levine et al. stated that new-model vehicles were safer [
53]. Wang and Kim found that the age of a vehicle serves as an indicator that newer automobiles are safer compared to older ones, as older vehicles are more likely to be involved in accidents resulting in injuries or fatalities [
6]. Weast et al. (2021) investigated the characteristics of the vehicles used by young and adult drivers in fatal accidents. According to their study, young drivers who died tended to drive older vehicles. Furthermore, young drivers also tended to drive vehicles with less advanced safety features, such as side airbags and ESC (Electronic Stability Control) equipment [
54]. Kuyumcu et al. specified that drivers of older model vehicles were more likely to get involved in accidents in which their vehicles were severely damaged [
31]. This study also represents a valid finding for both driver groups that, as the age of the automobile increases, more people die as a result of accidents.
Although some individual specific factors and their effect on the severity level have been discussed, and related numerical results presented, the combined categorization and their impact strength must be studied separately and expressed accordingly. The decision trees displayed basically state the importance of the factors categorically as far as the expected results of the accidents are concerned. This study, hence, focused on highlighting these factors in terms of their degree of importance related to the severity level of the accidents, rather than an evaluation of the categorical correlation of the attributes of the casualty level.
Obviously, the dataset obtained is related to the Turkish roads and drivers. Therefore, it would be wise to state that the model results reflect the characteristics of Turkish roads and drivers only. On the other hand, the methodology used can be widely used for analysis and investigations in other parts of the world. A comparison of the results for car-only accidents may reveal if there might be any correlation of the similar parameters and characteristics of the drivers and the severity level of the accidents.
6. Conclusions
The model proposed expresses the major and most important interconnected attributes causing the accidents and the related level of severity of the drivers. A rigorous evaluation of the findings produced the following conclusions.
Driver error, located at the root node of the decision tree, was the attribute mostly affecting the accident severity. Other variables had different levels of impact degree on accident severity, changing with different driver errors. For example, in cases where the driving is fast, the road section variable is the one to be paid immediate attention as far as the classification of accident results are concerned. As another example, in cases where the driver error was a failure to give way, it was seen that the age of the driver was the most important factor. Moreover, in young drivers, whether they have a driving license or not seemed to play an important role to the accident result.
The first and chief issue to be stated is that the education and training of the candidates must be paid the utmost attention, and related regulations are to be applied in the quickest way as the factors such as following too close, improper lane changing and turnings, and speeding along with a failure to give way play a significant role in the occurrence of fatal or injury-causing accidents. Furthermore, periodic practices for renewing drivers’ licenses seem essential.
Some well-known attributes, such as the manner of accidents and size of the vehicles involved, were removed from the analysis, and a new model was suggested based on the variables having an effect but remaining hidden, especially those related to the drivers. In this way, the central attributes solely related to the characteristics of the drivers and their combinations with other important factors were put forward. Drivers among the middle-aged1 and middle-aged2 groups seem to be the groups to be focused on as they involve significant proportions of the fatal or injury-related accidents. Young drivers must also be paid attention as their involvement is quite high in injury-related accidents. Effective measures must be taken to deter the young drivers from driving cars without a driving licence. Strong enforcements must be applied in this regard simply because non-licence-holder young drivers are involved in many injury-related accidents, as can be seen from the decision tree.
Middle-aged1 (26–40) and Middle-aged2 (41–64) drivers should be provided with training on the priority of passing at intersections and adapting vehicle speed to road geometry to increase their awareness. As the article of the Highway Traffic Law of the Republic of Turkey clearly enforces a reduction in speed when approaching intersections, pedestrian crossings, tunnels, narrow bridges, culverts and hilltops, entering curves, proceeding on windy roads, level crossings, and entering construction and repair areas, the inspections must be intensified at these specific locations. Since these age groups are active in working life, they can be provided with traffic training at their workplaces and subjected to periodic practical exams.
Speeding is one of the most important factors related to driver faults, especially on divided and two-way roads, being responsible for 19,650 injuries and 207 deaths in total. Various professionally well-organised campaigns must be organized and put into practice on all communication tools (TV, radio, social media, billboards) to deal with this issue. The necessary legislative measures must be applied to handle and eliminate the perception of drivers that it is safe to go slightly above the speed limit. Moreover, drivers should be aware of the fact that the speed limit is a strict law and violations will not be tolerated. Measures such as increasing electronic and police-operated speed controls, reducing legal speed limits, and proper traffic calming should be implemented, especially on such roads.
The decision tree produced reveals that improper lane changing is another point to be considered carefully as it has been responsible for 7681 injuries and 104 deaths. Improper lane changing can be expressed as occurring due to many reasons, including careless driving, lack of sleep, and the road being designed in such a way that the driver cannot understand the physical conditions well or having deficiencies that cause drivers to make mistakes due to a lack of signs or horizontal markings. The manner of drivers also needs to be mentioned, such as a lack of signal usage. To handle this problem, various operative campaigns should be organized and the designs and markings related to the physical parts of the roads must be checked where most accidents occur due to the driver’s fault.
Attributes having an effect on the accident but which were not included in the tree due to the dominant effects of some factors were also analysed by performing the second pre-processing. Although the performance of the model decreased from 78% to 73% after the removal of dominant variables, the resulting success performance clearly expresses the fact that the model performs well.
The limitations related to this study should also be mentioned. As there was no information with regard to the degree of injury severity, speed limits of the roads, and visibility in the dataset, the model proposed did not cover the analysis of these factors. Passengers were excluded from the study because there were too many missing data regarding the passengers. The information about the road geometry obtained from the dataset was very limited. This is the reason why the factors directly related to road geometry and their effects on the severity of the accidents in terms of drivers were excluded from this study. When the related information is included in the dataset for the future records, analyses can also be carried out in this respect. Accidents primarily involving motorcycles will be taken into consideration for future studies. Furthermore, accidents with other types of vehicle rather than automobiles only, pedestrians, and passengers grouped according to the seat taken in the vehicle, will be analysed to develop effective prediction models for accident severities.
The dataset in this study did not include sociodemographic and psychological attributes of the drivers. The future analysis is intended to take these internal factors into consideration as discussed by Tinella et al. [
21] Furthermore, the risky behaviours (reckless driving, drinking and driving, over observation of children, not using a seat belt, etc.), as analysed by Iversen [
22], are one of the main topics to be paid attention to, to point out, and to compare the drivers’ attitudes in Turkey with the findings available from other parts of the world.
The categorical comparison of all variables in terms of their weighted and relative effects on the severity of the accidents including all the vehicle types and personalized characters of the drivers by using screening data will be an important step to deepen the analysis and develop the model.