1. Introduction
Road crashes are one of the main threats to public health. More specifically, the driver’s crash claim severe losses in terms of human life and goods. According to the World Health Organisation, about 1.2 million individuals a year die in road accidents and between twenty and fifty million suffer non-fatal injuries [
1,
2]. The consequences of these events range from suffering and loss in human terms to costs to individuals and national economies, approximately 3% of their annual gross domestic product [
1]. Furthermore, a large part (more than 90%) of accidents occurred in the low and middle income countries, in societies and economies already affected by issues criticism in terms of life quality [
3]. Several studies [
4,
5,
6] demonstrated that the trauma related to motor vehicle Crash (MCV) claims can lead to elevated psychological distress, and this condition can continue years after the injury [
7]. The economic and social consequences of this phenomenon hinder the community’s ability to develop and implement improved policies and strategies to mitigate the impact of road crashes. In Europe, the policy implemented by the European Commission is outlined in the “EU road safety policy framework 2021–2030” [
8], which aims to achieve zero fatalities on European Roads by 2050. However, despite precise positive results in road crash reduction (in the EU-27, a −10% change in absolute numbers in 2023 compared to 2019), in 2024, 20,400 people died in road crashes, with 46 deaths per million inhabitants [
8].
Improving road safety requires addressing a complex array of contributing factors through targeted policies and interventions [
9].
A review of the scientific literature reveals that one of the limitations to developing such countermeasures is the identification of a relationship between the severity of accidents and the multiple factors involved [
9]. In a general overview, we can distinguish between “vehicle factors”, “Human factors”, “Environmental Factors”, and “crash type” [
2].
As for the environmental factors that trigger driving accidents, previous research has mainly studied the role of road type and characteristics [
10,
11,
12], climatic conditions [
11,
13,
14], such as sunrise, sunset, dusty weather, oily road surfaces, and winding uphill/downhill road [
13]. In recent years, open access large datasets, released in an open-access format, as Open Street Map’s layers [
15], allow the acquisition of relevant information on the road network with a good precision [
16].
The analysis of these dynamics for identifying contributing factors has been carried out in the past using statistical models [
2]. The significant limitations resulting from these experiments concern the need to make preliminary assumptions about the distribution of the data (no anomalous values must be present) and the relationships between the dependent and independent variables [
2,
17,
18].
Recently, the development of data analysis methods based on machine learning (ML) has permitted overcoming some of these problems and provides a new approach to the discipline. No prior assumptions about the relationships between variables are necessary; they have better handling of outliers and missing values [
19] and could make it possible to highlight generally intense factors as non-contributory or peculiar to individual study areas [
20]. The identification of these factors would also provide a reasonable basis for selecting input to develop further predictive models [
9].
Unfortunately, the researchers currently face severe limitations in terms of the accuracy and the availability of historical crash drive databases, which are essential for studying the phenomenon. At an international level, the DRIVER initiative (Data for Road Incident Visualization Evaluation and Reporting), carried out by the Global Road Safety Facility (GRSF), aims to collect and harmonize the data using a cooperative, open access system, but nowadays this initiative is still limited to a few countries and cities [
21].
On a national scale, many countries lack a comprehensive road crash database suitable for detailed spatial analysis. For instance, the Italian statistics office only provides a limited-time inventory with very low geolocalization precision, simply identifying the road where the accident happened [
22]. Additionally, several Italian cities, like the Municipality of Cagliari, publish their own databases, but these are often partial and also lack clear geolocalization, much like the national data [
23].
Against this background, the primary objective of this work is to investigate how different environmental factors contribute to accidents in Switzerland, as highlighted by freely accessible federal accident databases. Leveraging a suite of environmental and social variables, we apply various machine learning algorithms and evaluate their performance [
24]. Using a set of environmental and social factors, a pool of Machine Learning Algorithms (MLAs) was applied, and their performance was evaluated (
Figure 1). More specifically, we aim to: (i) clarify the role of environmental factors in accident occurrence; (ii) assess the prediction accuracy achieved by different MLA; and (iii) produce refined spatial maps indicating crash probabilities.
Using a combination of MLA, GIS and open data can improve knowledge of the area, leading to more precise analyses in terms of accuracy, efficiency and computational cost. This results in more effective planning and allocation of resources, as well as the development of tools for evaluating the effectiveness of interventions [
25].
3. Results and Discussion
The descriptive statistic for all covariates is calculated, as reported in
Appendix A. Furthermore, to check for multicollinearity, a Pearson’s correlation matrix was performed, and the results are reported in
Appendix B. These preliminary operations contribute to the management of outliers and the assessment of performance models.
3.1. Covariates Importance
To assess the importance of the covariates in the training phase of the model, different metrics and statistics are used due to the characteristics of the implemented model. Ranger supports a native method for evaluating the influence of covariates on the model’s quality. In RStudio, it is coded as “impurity” (
Figure 3) and consists of the rank of the independent variables in the classification. For each covariate is associated a value of importance, based on the decrease in impurity detected when the model creates the splitting criterion in the classification tree [
66,
67].
In Ranger model the covariates that belong to the group of infrastructure have more influences as compared to environmental or social-economic group. Road Class 2 roads have the biggest influences, in terms of the decrease in impurity. This class appears to have a considerable gap compared to the others, together with the distance from the health buildings and Road Class 4 constitutes the classes with the greatest importance in the construction of the model. Road Class 2 includes the most heavily trafficked and used roads, for this reason probably influences the classification of the model to recognize the occurrence of an accident. In this term the same things could be worth for the distance between the health buildings. Near to this type of building, there may be a concentration of emergency vehicles, which could influence whether accidents occur. For Class Road 4 the opposite argument could be made with respect to Class 2, because the types of roads included in the first of these are characterized by a minority use, due to limitations in their use (Bus way) or their functions (Service). Also in other work, similar for field of application and approach, the category of the road have an high influence above the model prediction [
9]. In the same study, on Random Forest model, they find a significatively importance in the variable that describes the urban and rural area, which distinguishes between Urban and Rural Road accidents.
Land Class 4, along with its population density per square kilometer, is the only socio-economic group that achieved a high rank in impurity reduction. The other covariates, of social-economic group, achieved the worst rank. The slope and climate variables do not achieve values that are too low or too high, but the slope influences the model more significantly than the other two variables. The morphology of the terrain, and therefore also the slope, has a significant impact on the morphology of the road, its structure, and its safety. Land Class 4 together with the Pop./km2 are the only ones of the social-economic group that achieved a high rank of impurity decrease. The other covariates, of social-economic group achieved the wors rank. The slope and the climate variables do not achieved values that are too low or too high, but probably the slope influences the model more significantly than the other two variables. The morphology of the terrain, and therefore also the slope, has a critical impact on the morphology of the road, on its structure and on its safety.
For the Logistic Regression Model, the importance of covariates is determined by the z-value or the Wald Statistic (
Figure 4). Z-value can be used to evaluate the importance of variables in a Logistic Regression Model. The importance in this case is related to the hypothesis that a coefficient of a variable is 0, as opposed to being different from 0, for each variable used in the classification [
63]. Even for this model, Class Road 2 is dominant. Road Class 1 emerges as particularly significant, even more so than in the Ranger model. Similarly, high-hierarchy roads strongly influence the model’s composition, while Road Class 3 maintains importance. Conversely, Road Class 4 is not a relevant predictor in this setting. Some variables that belong to the infrastructure group have less importance than those in the previous model but generally remain the best-performing group. The socio-economic variables are less important in the Logistic Regression Model. All Land Classes have very low values, and Class 5 is excluded from the model. This class is the least important in the Ranger program (Version: 0.17.0). Rainfall has a lesser influence on this model than Ranger, while temperature appears to have a medium level of importance. In a territory such as the one examined, it is not unusual for the temperature to affect road conditions and vehicle performance, considering the consequences of exposure to low temperatures, given the presence of cells with minimum values of −10 degrees Celsius.
3.2. Accuracy Assessment
The model that obtained the best performance is Ranger, not only in terms of Accuracy, where it achieved 0.98. In contrast, LRM and Keras achieved 0.93 (
Table 4), as well as in the model’s capacity to classify TP. Indeed, in both Specificity and Negative Prediction Value, the Ranger confusion matrix prediction achieves the best values, respectively 0.88 and 0.96, indicating a promising trend of the model in classifying both classes. We can also see this trend in the confusion matrix (
Table 5).
The TN in the Ranger matrix is fewer in number than the LMR and Keras. These models were unable to classify the N class correctly. They get good Accuracy value only because the data is unbalanced. The capacity and numerosity of P increase the accuracy value; therefore, it is essential to use different metrics to compare the performance of models. Keras performs better than LRM, both in terms of Specificity, where they get 0.4 compared to 0.34. Both classify in the same way, P, but in N, Keras classifies about 1000 TN more than LRM.
The AUC value reflects the same characteristics observed in the other metrics. Also, in this case, Ranger achieved the best value, 0.99, followed by Keras, 0.93, and LRM, 0.92.
The models were relatively fast. Logistic regression differs from the other models in that it takes less time to train, completing the process in under one minute. Keras takes longer to complete the training phase, at around 15 min, whereas Ranger takes only 10 min. The longer Keras takes could be because, in RStudio, it graphs the selected metrics for evaluating the training phase in real time, which requires more resources.
In this case, from the ROC plots, no significant difference is observable. The plots are very similar, and the differences are truly minimal. The curve of LMR is slightly different compared to the others (
Figure 5). The LMR curve appears more squashed.
3.3. Prediction Maps
Figure 6 shows the predicted maps. These maps are produced using the model’s probability of classifying each cell as one of two categories. This value can be used to evaluate the model’s ability to predict the susceptibility of an area to road accidents. Cells with values closer to 1 are mainly located in urban and peri-urban areas. This pattern is consistent across all maps. The models identify Zurich, Bern, Basel, Lausanne, and Geneva, as well as their surrounding areas, as being closer to class 1. The model performed very well in identifying areas with a high probability of road accidents, while also showing greater precision in highlighting low-probability zones. Compared to southern Switzerland, cells with values close to 1 were less frequent, likely due to the predominance of lower-density road networks and the limited extent of urban land use in this area. This configuration is likely to reduce exposure and, consequently, accident occurrence. Conversely, all models consistently highlighted the northern region as more vulnerable, where higher population density and a more complex road infrastructure increase the probability of road accidents. Both Ranger and Keras identify a concentration of values close to 1 in the Zurich canton area. LMR tends to classify more cells in class 0 in this area and its neighbouring areas, particularly those surrounding the main urban centres.
This trend is more easily observed in the canton of Graubünden, where each model obtains a value close to 0. However, in the case of LMR, the infrastructure network that the grid tramples on are almost invisible. In the case of Keras, it is slightly visible, whereas in the Ranger map, it is perfectly visible. This trend is more easily observed in the canton of Graubünden, where each model obtains a value close to 0.
However, in the case of LMR, the infrastructure networks that the grid tramples on are almost invisible. Keras is slightly visible, whereas Ranger is perfectly visible.
This pattern is more evident on the map, which shows the mean probability by municipality (
Figure 7). There are 24 municipalities with a probability exceeding 0.5. According to the model’s logic, we can infer that these municipalities will be classified as class 1. These municipalities include Zurich, Basel, and Geneva. Geneva achieved a probability of 0.78, Basel 0.66, and Zurich 0.51. However, the model also highlights trends that are not immediately obvious. In fact, some of the municipalities identified include those with relatively small populations. This could be due to the scale of the road infrastructure relative to the municipality’s surface area and population. In fact, despite a population of around 2500, Muralto achieved the highest probability (0.86). However, the small city of Muralto is a crucial crossroads for vehicular traffic. It is home to the main road connecting the important city of Locarno with the surrounding municipalities, as well as an important rail station [
68]. This was followed by the municipalities of Chêne-Bourg (0.74), Vevey (0.73), and Lancy (0.67). Interestingly, compared to the last three, there is a tendency towards a higher average value in municipalities with a relatively lower population. However, these small towns are located in an high density traffic area, correspondent of the neighbor of Geneva Lake. The Lake Geneva Region register the highest number of motor vehicles in Switzerland; 1,730,000 in the 2023 [
69]. The city of Chêne-Bourg il locate at the border of Geneva city metropolitan area, close one of the main access highway to the city, the “Route blanche”. Vevey is a small city with almost 20,000 inhabitants. It is historically important because it was an important staging point on the ancient ‘Francigena Way’ [
70]; therefore, it has long been an important traffic center. Today, the N9 Highway passes through the northern part of the town, with 67,000 vehicles passing through Vevey every day [
69]. In the end, Lancy is a small town locate in the southern part of the metropolitan area of Genève. It’s an important public transport hub, home to the impressive Gare de Genève-La Praille train station, which caters for both passenger and freight services (46°10′56″ N, 6°07′34″ E). On the contrast, the relative low value for Zurich could be attributed to the municipality’s relatively small surface area and limited road infrastructure. Some municipalities bordering the main urban area or along the main infrastructure line achieved a mean value greater than 0.5. This can be attributed to the presence of numerous cells in the municipal area, particularly along the main road (Road Class 1), despite the low population and limited road infrastructure. This indicates that even small towns and cities can be classified by the model as close to 1, as can be seen in areas surrounding the Zurich municipal area and along some of the main infrastructure lines branching off from it to connect with other major urban centers. This trend can also be seen along the shores of Lake Geneva and in the neighboring area of Geneva.
3.4. Limitations of the Research
This research aimed to use only open-source data and tools. We attempted to use MLA to analyze the phenomenon of road accidents. All the applied models have achieved a high level of performance quality; Ranger has achieved the highest level of performance. The results are similar to, and sometimes better than, those of other studies in the same field [
71,
72,
73]. Some of the works compared focus on accident severity prediction, such as [
71,
72]. In these works, the aim is to predict the severity of the accident. However, they use the same model and approach (classification problems), as well as some of the same variables and, of course, the same metrics to evaluate model performance. For this reason, we consider it possible to compare our results with theirs.
As in other fields, the availability of data is not unlimited. The Swiss should be commended for the quantity and quality of their data. This appears to be an isolated incident rather than a common approach to processing this type of data. This data needs to be open and of the highest quality. However, if additional information on accident dynamics had been available, within the limits permitted by current international privacy laws, the issue could have been addressed more comprehensively.
It is necessary to take similar action regarding the data that, in this work, represent independent variables. These are selected and used in order to explain the natural, social, or economic phenomenon that is to be studied and analyzed. In road accident research, data describing the dynamics of the accident are of fundamental importance. This includes characteristics of the vehicles involved, their speed, and the condition of those involved. There are already projects underway aimed at improving the availability of road accident data (e.g., [
21]). However, they are still in the data collection phase and do not yet cover a sufficiently wide area. Nevertheless, we believe that initiatives of this kind represent a crucial step toward improving the accuracy of machine learning applications and their contribution to spatial knowledge.
The condition of the road infrastructure is also important, including artificial lighting, the number of lanes, and horizontal and vertical road traffic signs.