Assessing the Impact of Infrastructure and Social Environment Predictors on Road Accidents in Switzerland Using Machine Learning Algorithms and Open Large-Scale Dataset

Auzzas, Alessandro; Capra, Gian Franco; Ganga, Antonio

doi:10.3390/urbansci9090343

Open AccessArticle

Assessing the Impact of Infrastructure and Social Environment Predictors on Road Accidents in Switzerland Using Machine Learning Algorithms and Open Large-Scale Dataset

by

Alessandro Auzzas

,

Gian Franco Capra

and

Antonio Ganga

^*

Dipartimento di Architettura, Design e Urbanistica, Università di Sassari, Via Piandanna 4, 07100 Sassari, Italy

^*

Author to whom correspondence should be addressed.

Urban Sci. 2025, 9(9), 343; https://doi.org/10.3390/urbansci9090343

Submission received: 9 July 2025 / Revised: 20 August 2025 / Accepted: 22 August 2025 / Published: 29 August 2025

Download

Browse Figures

Versions Notes

Abstract

The significant impact of road traffic accidents on public health requires clear and effective policies to combat them. However, public action can only be truly effective when supported by robust monitoring tools. This project aims to evaluate the effectiveness of a set of machine learning algorithms in predicting road accidents in Switzerland, utilizing open-access Confederation drive crash databases combined with environmental and socio-economic factors. Three different algorithms are tested: Logistic Regression Model (LRM), Random Forest with Ranger (RF), and Artificial Neural Network (ANN) with Keras. Among the predictive factors, road types are shown to be of high importance in all models. Regarding model performance, all the applied algorithms show a high level of accuracy, with all models achieving over 90%. The Random Forest algorithm, optimised using the Ranger application, exhibited the best performance, particularly in terms of specificity (0.88 compared to 0.34 and 0.40 for LRM and Keras, respectively) and negative predictive value (0.96 compared to 0.65 for LRM and 0.68 for Keras). These results suggest that this approach could support public policy for traffic management, if data collection and sharing activities are constantly carried out.

Keywords:

drive crashes; logistic regression; Keras; Ranger; health policies

1. Introduction

Road crashes are one of the main threats to public health. More specifically, the driver’s crash claim severe losses in terms of human life and goods. According to the World Health Organisation, about 1.2 million individuals a year die in road accidents and between twenty and fifty million suffer non-fatal injuries [1,2]. The consequences of these events range from suffering and loss in human terms to costs to individuals and national economies, approximately 3% of their annual gross domestic product [1]. Furthermore, a large part (more than 90%) of accidents occurred in the low and middle income countries, in societies and economies already affected by issues criticism in terms of life quality [3]. Several studies [4,5,6] demonstrated that the trauma related to motor vehicle Crash (MCV) claims can lead to elevated psychological distress, and this condition can continue years after the injury [7]. The economic and social consequences of this phenomenon hinder the community’s ability to develop and implement improved policies and strategies to mitigate the impact of road crashes. In Europe, the policy implemented by the European Commission is outlined in the “EU road safety policy framework 2021–2030” [8], which aims to achieve zero fatalities on European Roads by 2050. However, despite precise positive results in road crash reduction (in the EU-27, a −10% change in absolute numbers in 2023 compared to 2019), in 2024, 20,400 people died in road crashes, with 46 deaths per million inhabitants [8].

Improving road safety requires addressing a complex array of contributing factors through targeted policies and interventions [9].

A review of the scientific literature reveals that one of the limitations to developing such countermeasures is the identification of a relationship between the severity of accidents and the multiple factors involved [9]. In a general overview, we can distinguish between “vehicle factors”, “Human factors”, “Environmental Factors”, and “crash type” [2].

As for the environmental factors that trigger driving accidents, previous research has mainly studied the role of road type and characteristics [10,11,12], climatic conditions [11,13,14], such as sunrise, sunset, dusty weather, oily road surfaces, and winding uphill/downhill road [13]. In recent years, open access large datasets, released in an open-access format, as Open Street Map’s layers [15], allow the acquisition of relevant information on the road network with a good precision [16].

The analysis of these dynamics for identifying contributing factors has been carried out in the past using statistical models [2]. The significant limitations resulting from these experiments concern the need to make preliminary assumptions about the distribution of the data (no anomalous values must be present) and the relationships between the dependent and independent variables [2,17,18].

Recently, the development of data analysis methods based on machine learning (ML) has permitted overcoming some of these problems and provides a new approach to the discipline. No prior assumptions about the relationships between variables are necessary; they have better handling of outliers and missing values [19] and could make it possible to highlight generally intense factors as non-contributory or peculiar to individual study areas [20]. The identification of these factors would also provide a reasonable basis for selecting input to develop further predictive models [9].

Unfortunately, the researchers currently face severe limitations in terms of the accuracy and the availability of historical crash drive databases, which are essential for studying the phenomenon. At an international level, the DRIVER initiative (Data for Road Incident Visualization Evaluation and Reporting), carried out by the Global Road Safety Facility (GRSF), aims to collect and harmonize the data using a cooperative, open access system, but nowadays this initiative is still limited to a few countries and cities [21].

On a national scale, many countries lack a comprehensive road crash database suitable for detailed spatial analysis. For instance, the Italian statistics office only provides a limited-time inventory with very low geolocalization precision, simply identifying the road where the accident happened [22]. Additionally, several Italian cities, like the Municipality of Cagliari, publish their own databases, but these are often partial and also lack clear geolocalization, much like the national data [23].

Against this background, the primary objective of this work is to investigate how different environmental factors contribute to accidents in Switzerland, as highlighted by freely accessible federal accident databases. Leveraging a suite of environmental and social variables, we apply various machine learning algorithms and evaluate their performance [24]. Using a set of environmental and social factors, a pool of Machine Learning Algorithms (MLAs) was applied, and their performance was evaluated (Figure 1). More specifically, we aim to: (i) clarify the role of environmental factors in accident occurrence; (ii) assess the prediction accuracy achieved by different MLA; and (iii) produce refined spatial maps indicating crash probabilities.

Using a combination of MLA, GIS and open data can improve knowledge of the area, leading to more precise analyses in terms of accuracy, efficiency and computational cost. This results in more effective planning and allocation of resources, as well as the development of tools for evaluating the effectiveness of interventions [25].

2. Materials and Methods

2.1. Study Area and Data Collection

Switzerland is a federal state located in central Europe (long 8.251°, lat. 46.791°, Figure 2). It has a population of 8.6 million inhabitants. It administers a surface area of 41,291 km², of which 35% is dedicated to agricultural use, 32% to areas characterised by forest vegetation cover, 25% to unproductive areas (water and glaciers), and only the remaining 8% is occupied by settlements (mainly residential areas) [26,27].

2.2. Data Collection

All the data used in this work are open source. To simplify the presentation of the resources used, they were classified according to the characteristics they describe. The first category is the infrastructure category, which contains all data relating to major infrastructure, particularly roads. The social and economic class encompasses aspects that affects the economic and social sphere of the territory, which can influence the phenomenon of road accidents. Finally, the last class incorporates environmental aspects relating to the climate of the study area (Table 1).

2.2.1. Prediction Variable

The prediction variable represents the phenomenon of crashes drive. This paper utilizes open-access data on accidents with personal injury [24] from the Federal Roads Office of Swiss. This data is configured as a punctual vector. The crashes that occurred between 2011 and 2023 are considered in the research. The data associated with the location includes several types of data, such as the kind of accident, which describes the crash dynamic in a few words, the severity, divided in three classes: accident with light or severe injuries and fatalities. There are more data, for each incident, but not very deep and detailed about speed velocity, kind and number of vehicles involved, weather and road conditions in the moment of the crash, psychophysical conditions of the drivers and other data that will be used for other type of investigation and analysis about this phenomenon. For this reason, a grid has been created that follows the Swiss road infrastructure. Each cell has a resolution of 150 × 150 metres. The number of accidents that occurred on the road or roads contained within each cell is associated with that cell. The prediction variable is populated using this data; every cell containing one or more accidents is classified as 1, while those without accidents are classified as 0.

2.2.2. Infrastructures

This class is the most numerous of those taken for the research. This data was all created from the OpenStreetMap (OSM) database [15]. Using the QGIS tool “QuickOSM” [36], it was possible to extract the entire road network of Switzerland and load it as a vector layer within the QGIS software, version 3.40.1 [37]. From this vector data, most of the variables in the infrastructure class were extracted. The distance classes (Distance to Airports, Health and Education Buildings) were constructed based on building data made available by OSM.

The distance-related prediction variables were realised using QGIS vector processing (‘Shortest distance line between elements’). This made it possible to determine the shortest segment between the center of the grid cells and the center of the building under consideration. The length of the roads is based on the sum of the lengths of the road segments within the respective cell.

The variable Road Overlay represents the presence of road overlays and/or intersections. This variable was realised by extracting the intersections of the road segments. The number of intersections within the cell is then divided by the surface area of the cell (1).

R o a d O v e r l a y = \frac{N u m b e r o f o v e r l a y / c o r n e r}{S u r f a c e o f s i n g l e c e l ({k m}^{2})}

(1)

The Road Class variables represent the percentages of roads within a single cell divided into four hierarchical classes. The classes follow the OSM classification standard, adapted for the context of the Swiss road network (Table 2).

The maximum speed limit was extracted directly from the data provided within the road network attributes table. A considerable proportion of the road sections examined (70%) lack this data. The data is relatively evenly distributed concerning the classes realised by OSM. Therefore, it was decided to fill the absent data with the average of the respective classes. Once a complete dataset was obtained, the average value of the roads within a single cell was used.

Sinuosity represents the ratio between the straight-line distance and the effective driving distance of each road network segment (2). To create the sinuosity covariate, the average value is calculated for each cell.

Sinuosity = \frac{Straight line distance}{Driving distance}

(2)

2.2.3. Social/Economic Factors

This class of data represents some of the dynamics that can have a social and economic influence on the road accident phenomenon. The Population per km² influences the presence of people using the road network in different modes (Persons, Private/Public vehicles, Cyclists, and Wildlife). The land-use class provides an indication of the possible presence of special vehicles, such as agricultural vehicles, as well as providing a concrete idea of population concentration, highlighting the densest urban fabric and differentiating it from rural areas.

The first variable, derived from data made available by the Federal Office of Topography swisstopo [30], was extracted from the data for the cantons of the Federation. Each cell was assigned the value corresponding to the canton where it is located.

Land use was extracted from data made available by the Copernicus Corine Land Cover (CLC) project [31] and the referenced data. It was chosen to limit the value of the variable to the first level of classification of the CLC proposed by the original data. Five variables were constructed for each cell, one for each of the first-level categories proposed by CLC. Each of these variables represents the proportion of the land-use class distribution within each cell.

2.2.4. Environmental Factors

Environmental dataset can be traced back to the climate domain. The data were produced within the WordClim2 project [35]. Both precipitation and temperature average data were generated over a 61-year time series, collected monthly from January 1960 to December 2021. Weather phenomena can directly or indirectly influence road safety conditions. According to the reviewed bibliography [38,39], it was decided to integrate these variables to measure their effectiveness in the prediction activities. Precipitation influences both vehicle traction and visibility; in parallel, temperature variations can affect pavement conditions and general road safety.

2.3. Machine Learning Algorithms

2.3.1. Ranger

Ranger, an efficient implementation of the Random Forest algorithm, is commonly adopted for large datasets, constructing ensembles of decision trees to deliver highly accurate predictions [40,41]. The algorithm (Algorithm 1) selects a subset of the predictors to subdivide the nodes, making the model more accurate and further minimizing the instability of the trees [42,43,44]. In the script below, we choose the kind of importance to evaluate the influence of the predictors. In this case, we use “impurity,” which corresponds to the Gini Index [40]. The model is implemented with the package ranger() [45] on RStudio software, version 4.5.1 (13 June 2025) [46].

Algorithm 1. Application of the Ranger model in the RStudio environment

ranger(x= traindata [3:23],

y= traindata$crash

importance = “impurity”,

num.trees = 500,

write.forest = TRUE,

min.node.size = 3,

classification= TRUE,

seed = 198)

2.3.2. Logistic Regression

Logistic Regression (LRM), is implemented in this work using the glm() function from the stats package in R [47], is a modelling technique typically used to analyse the relationship between multiple independent variables and a categorical dependent variable, as shown in Algorithm 2. It works by estimating the probability of an event occurring by fitting the data to a logistic curve [48]. The logistic function is described by the following formula [49]:

P = \frac{1}{1 + (e x p^{- Z})}

(3)

where P is the probability of crash occurrence and Z a linear combination of casual X_i factors:

Z = β_{0} + β_{1} x_{1} + β_{2} x_{2} + β_{3} x_{3} + \dots + β_{n} x_{n}

(4)

In this case, the occurrence, based on a set of predictors [50], is obtained by:

Algorithm 2. Application of the Logistic Regression Model (LRM) in the RStudio environment

glm(crash_accidents~ Distance_from_Airport + Distance_from_Health_Building+

Distance_from_Educational_Building +Street_Length +

Road_Overlay + Road_Class_1 + Road_Class_2 + Road_Class_3 +

Road_Class_4 + Tortuosity_Index + Slope + Rainfall + Temperature +

Land_Class_1 + Land_Class_2 + Land_Class_3 +

Land_Class_4 + Land_Class_5 +Pop_per_km2 + Max_Speed_Limit,

data = traindata,

family = binomial)

2.3.3. Keras

Deep learning is a subset of machine learning. Among these methods is the Artificial Neural Network (ANN). In this work, Keras was used. It is an API developed in Python (version 3.10) to build an ANN, based on TensorFlow [51]. The enormous success of his usage and implementation is due to this, which allows users to easily introduce, train, and analyze neural networks [52]. In fact, layers in the library are bound to one another like Lego blocks, resulting in a tidy and easy-to-understand model [53]. The great advantage is that it enables the quick creation and training of models, and model testing is straightforward. Only a few characteristics need to be set: model details, the number of training epochs, and metrics to track [53,54]. Another important aspect is the significant increase in efficiency achieved by allocating time to technological execution. This frees up time for more important tasks, including the development of improved deep learning algorithms [52,55]. Another pivotal aspect is the large number of different neural network components supported by Keras, including dense layers, convolutional layers, recurrent layers, and dropout layers [53]. In this work, Keras was used within RStudio (Algorithm 3) via the Keras package, version 2.16 [56].

As reported in (3), a Keras model is developed in three main steps in the RStudio environment [56]:

(1): Data preparation and model selection, the part of the code where the model is designed and the number and type of layers of neurons are determined;
(2): Model parameter settings, decide what kind of problems you need to solve with Keras and select the metrics for monitoring the training phase;
(3): Training and evaluation of the model, which represent the true training phase;

Algorithm 3. Application of the Keras model in the RStudio environment

(1) keras_model_sequential() %>% layer_dense(

units = 32,

activation = “relu”,

input_shape = ncol(x_train_scaled)) %>% layer_dense(

units = 16,

activation = “relu”) %>% layer_dense(

units = 1,

activation = “sigmoid”)

(2) model %>% compile(

loss = “binary_crossentropy”,

optimizer = optimizer_adam(learning_rate = 0.001),

metrics = c(“accuracy”, metric_precision(), metric_recall()))

(3) model %>% fit(

x_train_scaled,

y_train_bin,

epochs = 50,

batch_size = 320,

validation_split = 0.2)

2.4. Validation and Assessment Models

To validate the model, the dataset was divided into two parts. In random mode, with an 80/20 proportion. The larger part is used to train the model. During this phase, the model makes more accurate predictions than in the test phase. This is because, during training, the model’s parameters are continually adjusted to improve the metrics and the quality of the classification. The remaining part of the data is used for the testing phase, i.e., the phase when the model predicts using unseen data [57]. Different metrics are used to assess classification performance. Most of these metrics are associated with the confusion matrix (Table 3) [58,59]:

Several selected metrics demonstrate the differences in model performance with unbalanced data. Some of the commonly applied metrics in classification problems include accuracy, precision, and positive predictive value [57,58]. The other metric is used to observe differences in model performance quality, such as Specificity, true negative ratio, inverse recall (7), and Sensitivity, True Positive Ratio, or recall rate (8) [57,60]. The Negative Prediction Value (NPV) is used to emphasize the performance quality of the model to classify the cell where the accident occurs (9).

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(5)

P r e c i s i o n = \frac{T P}{F P + T P}

(6)

S p e c i f i c i t y = \frac{T N}{F P + T N}

(7)

S e n s i t i v i t y = \frac{T P}{F P + T N}

(8)

N P V = \frac{T N}{F N + T N}

(9)

There are also other metrics and plots to assess the performance of classification models that are not directly associated with the confusion matrix. In this research, we use the Receiver Operating Characteristics (ROC) curve and the Area Under the ROC curve (AUC) in the assessment phase. Both are largely used in classification problems [61,62,63]. The ROC plot describes, according to error or performance, the quality of a classification model. The plot can take the form of a curve when fit to data or an empirical ROC that matches the data perfectly. The AUC represents the model’s capacity to distinguish between the P and N classes, where the Recall Rate (8) and False Positive Rate (FPR) (10) are the vertical and horizontal axes, respectively, in the ROC plot [62,64,65].

F P R = \frac{F P}{F P + T N}

(10)

3. Results and Discussion

The descriptive statistic for all covariates is calculated, as reported in Appendix A. Furthermore, to check for multicollinearity, a Pearson’s correlation matrix was performed, and the results are reported in Appendix B. These preliminary operations contribute to the management of outliers and the assessment of performance models.

3.1. Covariates Importance

To assess the importance of the covariates in the training phase of the model, different metrics and statistics are used due to the characteristics of the implemented model. Ranger supports a native method for evaluating the influence of covariates on the model’s quality. In RStudio, it is coded as “impurity” (Figure 3) and consists of the rank of the independent variables in the classification. For each covariate is associated a value of importance, based on the decrease in impurity detected when the model creates the splitting criterion in the classification tree [66,67].

In Ranger model the covariates that belong to the group of infrastructure have more influences as compared to environmental or social-economic group. Road Class 2 roads have the biggest influences, in terms of the decrease in impurity. This class appears to have a considerable gap compared to the others, together with the distance from the health buildings and Road Class 4 constitutes the classes with the greatest importance in the construction of the model. Road Class 2 includes the most heavily trafficked and used roads, for this reason probably influences the classification of the model to recognize the occurrence of an accident. In this term the same things could be worth for the distance between the health buildings. Near to this type of building, there may be a concentration of emergency vehicles, which could influence whether accidents occur. For Class Road 4 the opposite argument could be made with respect to Class 2, because the types of roads included in the first of these are characterized by a minority use, due to limitations in their use (Bus way) or their functions (Service). Also in other work, similar for field of application and approach, the category of the road have an high influence above the model prediction [9]. In the same study, on Random Forest model, they find a significatively importance in the variable that describes the urban and rural area, which distinguishes between Urban and Rural Road accidents.

Land Class 4, along with its population density per square kilometer, is the only socio-economic group that achieved a high rank in impurity reduction. The other covariates, of social-economic group, achieved the worst rank. The slope and climate variables do not achieve values that are too low or too high, but the slope influences the model more significantly than the other two variables. The morphology of the terrain, and therefore also the slope, has a significant impact on the morphology of the road, its structure, and its safety. Land Class 4 together with the Pop./km² are the only ones of the social-economic group that achieved a high rank of impurity decrease. The other covariates, of social-economic group achieved the wors rank. The slope and the climate variables do not achieved values that are too low or too high, but probably the slope influences the model more significantly than the other two variables. The morphology of the terrain, and therefore also the slope, has a critical impact on the morphology of the road, on its structure and on its safety.

For the Logistic Regression Model, the importance of covariates is determined by the z-value or the Wald Statistic (Figure 4). Z-value can be used to evaluate the importance of variables in a Logistic Regression Model. The importance in this case is related to the hypothesis that a coefficient of a variable is 0, as opposed to being different from 0, for each variable used in the classification [63]. Even for this model, Class Road 2 is dominant. Road Class 1 emerges as particularly significant, even more so than in the Ranger model. Similarly, high-hierarchy roads strongly influence the model’s composition, while Road Class 3 maintains importance. Conversely, Road Class 4 is not a relevant predictor in this setting. Some variables that belong to the infrastructure group have less importance than those in the previous model but generally remain the best-performing group. The socio-economic variables are less important in the Logistic Regression Model. All Land Classes have very low values, and Class 5 is excluded from the model. This class is the least important in the Ranger program (Version: 0.17.0). Rainfall has a lesser influence on this model than Ranger, while temperature appears to have a medium level of importance. In a territory such as the one examined, it is not unusual for the temperature to affect road conditions and vehicle performance, considering the consequences of exposure to low temperatures, given the presence of cells with minimum values of −10 degrees Celsius.

3.2. Accuracy Assessment

The model that obtained the best performance is Ranger, not only in terms of Accuracy, where it achieved 0.98. In contrast, LRM and Keras achieved 0.93 (Table 4), as well as in the model’s capacity to classify TP. Indeed, in both Specificity and Negative Prediction Value, the Ranger confusion matrix prediction achieves the best values, respectively 0.88 and 0.96, indicating a promising trend of the model in classifying both classes. We can also see this trend in the confusion matrix (Table 5).

The TN in the Ranger matrix is fewer in number than the LMR and Keras. These models were unable to classify the N class correctly. They get good Accuracy value only because the data is unbalanced. The capacity and numerosity of P increase the accuracy value; therefore, it is essential to use different metrics to compare the performance of models. Keras performs better than LRM, both in terms of Specificity, where they get 0.4 compared to 0.34. Both classify in the same way, P, but in N, Keras classifies about 1000 TN more than LRM.

The AUC value reflects the same characteristics observed in the other metrics. Also, in this case, Ranger achieved the best value, 0.99, followed by Keras, 0.93, and LRM, 0.92.

The models were relatively fast. Logistic regression differs from the other models in that it takes less time to train, completing the process in under one minute. Keras takes longer to complete the training phase, at around 15 min, whereas Ranger takes only 10 min. The longer Keras takes could be because, in RStudio, it graphs the selected metrics for evaluating the training phase in real time, which requires more resources.

In this case, from the ROC plots, no significant difference is observable. The plots are very similar, and the differences are truly minimal. The curve of LMR is slightly different compared to the others (Figure 5). The LMR curve appears more squashed.

3.3. Prediction Maps

Figure 6 shows the predicted maps. These maps are produced using the model’s probability of classifying each cell as one of two categories. This value can be used to evaluate the model’s ability to predict the susceptibility of an area to road accidents. Cells with values closer to 1 are mainly located in urban and peri-urban areas. This pattern is consistent across all maps. The models identify Zurich, Bern, Basel, Lausanne, and Geneva, as well as their surrounding areas, as being closer to class 1. The model performed very well in identifying areas with a high probability of road accidents, while also showing greater precision in highlighting low-probability zones. Compared to southern Switzerland, cells with values close to 1 were less frequent, likely due to the predominance of lower-density road networks and the limited extent of urban land use in this area. This configuration is likely to reduce exposure and, consequently, accident occurrence. Conversely, all models consistently highlighted the northern region as more vulnerable, where higher population density and a more complex road infrastructure increase the probability of road accidents. Both Ranger and Keras identify a concentration of values close to 1 in the Zurich canton area. LMR tends to classify more cells in class 0 in this area and its neighbouring areas, particularly those surrounding the main urban centres.

This trend is more easily observed in the canton of Graubünden, where each model obtains a value close to 0. However, in the case of LMR, the infrastructure network that the grid tramples on are almost invisible. In the case of Keras, it is slightly visible, whereas in the Ranger map, it is perfectly visible. This trend is more easily observed in the canton of Graubünden, where each model obtains a value close to 0.

However, in the case of LMR, the infrastructure networks that the grid tramples on are almost invisible. Keras is slightly visible, whereas Ranger is perfectly visible.

This pattern is more evident on the map, which shows the mean probability by municipality (Figure 7). There are 24 municipalities with a probability exceeding 0.5. According to the model’s logic, we can infer that these municipalities will be classified as class 1. These municipalities include Zurich, Basel, and Geneva. Geneva achieved a probability of 0.78, Basel 0.66, and Zurich 0.51. However, the model also highlights trends that are not immediately obvious. In fact, some of the municipalities identified include those with relatively small populations. This could be due to the scale of the road infrastructure relative to the municipality’s surface area and population. In fact, despite a population of around 2500, Muralto achieved the highest probability (0.86). However, the small city of Muralto is a crucial crossroads for vehicular traffic. It is home to the main road connecting the important city of Locarno with the surrounding municipalities, as well as an important rail station [68]. This was followed by the municipalities of Chêne-Bourg (0.74), Vevey (0.73), and Lancy (0.67). Interestingly, compared to the last three, there is a tendency towards a higher average value in municipalities with a relatively lower population. However, these small towns are located in an high density traffic area, correspondent of the neighbor of Geneva Lake. The Lake Geneva Region register the highest number of motor vehicles in Switzerland; 1,730,000 in the 2023 [69]. The city of Chêne-Bourg il locate at the border of Geneva city metropolitan area, close one of the main access highway to the city, the “Route blanche”. Vevey is a small city with almost 20,000 inhabitants. It is historically important because it was an important staging point on the ancient ‘Francigena Way’ [70]; therefore, it has long been an important traffic center. Today, the N9 Highway passes through the northern part of the town, with 67,000 vehicles passing through Vevey every day [69]. In the end, Lancy is a small town locate in the southern part of the metropolitan area of Genève. It’s an important public transport hub, home to the impressive Gare de Genève-La Praille train station, which caters for both passenger and freight services (46°10′56″ N, 6°07′34″ E). On the contrast, the relative low value for Zurich could be attributed to the municipality’s relatively small surface area and limited road infrastructure. Some municipalities bordering the main urban area or along the main infrastructure line achieved a mean value greater than 0.5. This can be attributed to the presence of numerous cells in the municipal area, particularly along the main road (Road Class 1), despite the low population and limited road infrastructure. This indicates that even small towns and cities can be classified by the model as close to 1, as can be seen in areas surrounding the Zurich municipal area and along some of the main infrastructure lines branching off from it to connect with other major urban centers. This trend can also be seen along the shores of Lake Geneva and in the neighboring area of Geneva.

3.4. Limitations of the Research

This research aimed to use only open-source data and tools. We attempted to use MLA to analyze the phenomenon of road accidents. All the applied models have achieved a high level of performance quality; Ranger has achieved the highest level of performance. The results are similar to, and sometimes better than, those of other studies in the same field [71,72,73]. Some of the works compared focus on accident severity prediction, such as [71,72]. In these works, the aim is to predict the severity of the accident. However, they use the same model and approach (classification problems), as well as some of the same variables and, of course, the same metrics to evaluate model performance. For this reason, we consider it possible to compare our results with theirs.

As in other fields, the availability of data is not unlimited. The Swiss should be commended for the quantity and quality of their data. This appears to be an isolated incident rather than a common approach to processing this type of data. This data needs to be open and of the highest quality. However, if additional information on accident dynamics had been available, within the limits permitted by current international privacy laws, the issue could have been addressed more comprehensively.

It is necessary to take similar action regarding the data that, in this work, represent independent variables. These are selected and used in order to explain the natural, social, or economic phenomenon that is to be studied and analyzed. In road accident research, data describing the dynamics of the accident are of fundamental importance. This includes characteristics of the vehicles involved, their speed, and the condition of those involved. There are already projects underway aimed at improving the availability of road accident data (e.g., [21]). However, they are still in the data collection phase and do not yet cover a sufficiently wide area. Nevertheless, we believe that initiatives of this kind represent a crucial step toward improving the accuracy of machine learning applications and their contribution to spatial knowledge.

The condition of the road infrastructure is also important, including artificial lighting, the number of lanes, and horizontal and vertical road traffic signs.

4. Conclusions

The research aims to assess the impact of the environment on road accidents by analyzing open-source data and tools. This work represents a small step towards utilizing machine learning techniques to assess hazardous phenomena that pose a threat to human life. Improving health policies related to traffic management is crucial for public administration and highly relevant to the communities they serve. Employing these new techniques could lead to the development of new tools and improvements in targeted policies, as well as the activation of new time-monitoring tools. However, improving these models requires the data to be more accessible, higher quality, more precise, and more detailed. To address this issue, it is necessary to act on three different levels: (i) Technical—by importing data collection systems and practices from other research fields, such as environmental monitoring and territorial management; (ii) Methodological—by introducing new ways of representing data and information within administrative processes; (iii) Regulatory—by ensuring that policies take into account the accessibility and availability of such data at both national and international levels. Therefore, populating open databases with comprehensive information on car accidents will only enable more accurate predictions if acquiring and storing this data is prioritized in the administrative agenda. This work represents a step forward in the use of machine learning to assess risk in spatially complex and socially relevant phenomena. The insights gained here can inform more nuanced transport policies and foster the operational use of probability mapping in public health and traffic safety.

Author Contributions

Conceptualization, A.A. and A.G.; methodology, A.A. and A.G.; software, A.A.; validation, A.A., A.G. and G.F.C.; formal analysis, A.A., A.G.; investigation, A.A.; resources, A.A., A.G. and G.F.C.; data curation, A.A.; writing—original draft preparation, A.A., A.G. and G.F.C.; writing—review and editing, A.A., A.G. and G.F.C.; visualization, A.A.; supervision, A.G.; project administration, A.G. and G.F.C.; funding acquisition, A.G. and G.F.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Predictors	Min	Max	Average	Var	Outlier
Distance from Airport	2.39	41,222.49	9437.84	31,875,370	28,279
Distance from Health Building	0.414	25,581.83	3655.67	8,495,665	42,463
Distance from Educational Building	1.63	36,919.46	3880.78	13,272,945	56,890
Street Length	0.000006	0.1369	0.01689	83.8	44,318
Road Overlay	0.00000	0.006489	0.00005333	0.00111	155,634
Road Class 1 (%)	0.0000	100.00	0.8929	5,276,828	18,540
Road Class 2 (%)	0.000	100.00	7.527	4,257,821	157,519
Road Class 3 (%)	0.00	100.00	16.70	9,440,875	144,187
Road Class 4 (%)	0.00	100.00	74.88	1,315,316	0
Max. Speed Limit	19.77	100.00	41.22	1,473,987	395,910
Sinuosity	0.079	31,368.99	1.640	2987.42	159,027
Pop./km²	0.0	11140.0	308.2	419,421	123,239
Land Class 1 (%)	0.00	100.00	48.21	2,193,478	0
Land Class 2 (%)	0.0000	100.00	0.5022	3,215,553	13,285
Land Class 3 (%)	0.000	100.000	39.529	2022.72	0
Land Class 4 (%)	0.00	100.00	11.63	8,879,126	179,750
Land Class 5 (%)	0.0000	100.0000	0.1343	107.277	2637
Slope	0.000	1379.593	29.801	9,489,682	21,692
Rainfall	14.40	174.77	102.15	3,465,761	20,994
Temperature	−10.728	7.820	2.827	7,525,209	54,805

Appendix B

References

World Health Organization (WHO) Road Traffic Injuries. Available online: https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries (accessed on 14 April 2025).
Santos, K.; Dias, J.P.; Amado, C. A Literature Review of Machine Learning Algorithms for Crash Injury Severity Prediction. J. Saf. Res. 2022, 80, 254–269. [Google Scholar] [CrossRef]
Stewart, K.-A.; Groen, R.S.; Kamara, T.B.; Farahzard, M.; Samai, M.; Yambasu, S.E.; Cassidy, L.D.; Kushner, A.L.; Wren, S.M. Traumatic Injuries in Developing Countries: Report from a Nationwide Cross-Sectional Survey of Sierra Leone. JAMA Surg. 2013, 148, 463–469. [Google Scholar] [CrossRef]
Bryant, R.A.; Harvey, A.G. Psychological Impairment Following Motor Vehicle Accidents. Aust. J. Public. Health 1995, 19, 185–188. [Google Scholar] [CrossRef] [PubMed]
Chan, A.O.M.; Medicine, M.; Air, T.M.; McFarlane, A.C. Posttraumatic Stress Disorder and Its Impact on the Economic and Health Costs of Motor Vehicle Accidents in South Australia. J. Clin. Psychiatry 2003, 64, 175–181. [Google Scholar] [CrossRef] [PubMed]
Heron-Delaney, M.; Kenardy, J.; Charlton, E.; Matsuoka, Y. A Systematic Review of Predictors of Posttraumatic Stress Disorder (PTSD) for Adult Road Traffic Crash Survivors. Injury 2013, 44, 1413–1422. [Google Scholar] [CrossRef] [PubMed]
Craig, A.; Tran, Y.; Guest, R.; Gopinath, B.; Jagnoor, J.; Bryant, R.A.; Collie, A.; Tate, R.; Kenardy, J.; Middleton, J.W.; et al. Psychological Impact of Injuries Sustained in Motor Vehicle Crashes: Systematic Review and Meta-Analysis. BMJ Open 2016, 6, e011993. [Google Scholar] [CrossRef]
European Commision 20,400 Lives Lost in EU Road Crashes Last Year. Available online: https://transport.ec.europa.eu/news-events/news/20400-lives-lost-eu-road-crashes-last-year-2024-10-10_en (accessed on 14 April 2025).
Ahmed, S.; Hossain, M.A.; Ray, S.K.; Bhuiyan, M.M.I.; Sabuj, S.R. A Study on Road Accident Prediction and Contributing Factors Using Explainable Machine Learning Models: Analysis and Performance. Transp. Res. Interdiscip. Perspect. 2023, 19, 100814. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, W. Analysis of Roadway and Environmental Factors Affecting Traffic Crash Severities. Transp. Res. Procedia 2017, 25, 2119–2125. [Google Scholar] [CrossRef]
Thompson, J.P.; Baldock, M.R.J.; Mathias, J.L.; Wundersitz, L.N. An Examination of the Environmental, Driver and Vehicle Factors Associated with the Serious and Fatal Crashes of Older Rural Drivers. Accid. Anal. Prev. 2013, 50, 768–775. [Google Scholar] [CrossRef]
Zainuddin, N.I.; Arshad, A.K.; Hamidun, R.; Haron, S.; Hashim, W. Influence of Road and Environmental Factors towards Heavy-Goods Vehicle Fatal Crashes. Phys. Chem. Earth Parts A/B/C 2023, 129, 103342. [Google Scholar] [CrossRef]
Lankarani, K.B.; Heydari, S.T.; Aghabeigi, M.R.; Moafian, G.; Hoseinzadeh, A.; Vossoughi, M. The Impact of Environmental Factors on Traffic Accidents in Iran. J. Inj. Violence Res. 2014, 6, 64–71. [Google Scholar] [CrossRef] [PubMed]
Wang, K.; Zhang, W.; Jin, L.; Feng, Z.; Zhu, D.; Cong, H.; Yu, H. Diagnostic Analysis of Environmental Factors Affecting the Severity of Traffic Crashes: From the Perspective of Pedestrian–Vehicle and Vehicle–Vehicle Collisions. Traffic Inj. Prev. 2022, 23, 17–22. [Google Scholar] [CrossRef] [PubMed]
Coast, S. © OpenStreetMap Contributors. Available online: https://osmfoundation.org/wiki/Main_Page (accessed on 23 February 2025).
Accuracy—OpenStreetMap Wiki. Available online: https://wiki.openstreetmap.org/wiki/Accuracy (accessed on 14 April 2025).
Rezapour, M.; Mehrara Molan, A.; Ksaibati, K. Analyzing Injury Severity of Motorcycle At-Fault Crashes Using Machine Learning Techniques, Decision Tree and Logistic Regression Models. Int. J. Transp. Sci. Technol. 2020, 9, 89–99. [Google Scholar] [CrossRef]
Prati, G.; Pietrantoni, L.; Fraboni, F. Using Data Mining Techniques to Predict the Severity of Bicycle Crashes. Accid. Anal. Prev. 2017, 101, 44–54. [Google Scholar] [CrossRef]
Wahab, L.; Jiang, H. A Comparative Study on Machine Learning Based Algorithms for Prediction of Motorcycle Crash Severity. PLoS ONE 2019, 14, e0214966. [Google Scholar] [CrossRef]
Wang, S.; Gao, K.; Zhang, L.; Yu, B.; Easa, S.M. Geographically Weighted Machine Learning for Modeling Spatial Heterogeneity in Traffic Crash Frequency and Determinants in US. Accid. Anal. Prev. 2024, 199, 107528. [Google Scholar] [CrossRef]
Global Road Safety Facility, World Bank DRIVER—Data for Road Incident Visualization, Evaluation, and Reporting. 2024. Available online: https://www.globalroadsafetyfacility.org/driver (accessed on 17 August 2025).
Ministero delle Infrastrutture e Dei Trasporti Dataset—Open Data. Available online: https://dati.mit.gov.it/catalog/dataset/?tags=incidenti (accessed on 17 August 2025).
Comune di Cagliari Open Data. Available online: https://www.comune.cagliari.it/portale/page/it/open_data_it_1?contentId=SRV13140 (accessed on 17 August 2025).
Incidenti Stradali con Danni Personali—Opendata.Swiss. Available online: https://opendata.swiss/it/dataset/strassenverkehrsunfalle-mit-personenschaden (accessed on 14 April 2025).
Chaturvedi, V.; de Vries, W.T. Machine Learning Algorithms for Urban Land Use Planning: A Review. Urban Sci. 2021, 5, 68. [Google Scholar] [CrossRef]
Utilizzazione e Copertura del Suolo. Available online: https://www.bfs.admin.ch/content/bfs/it/home/statistiche/territorio-ambiente/utilizzazione-copertura-suolo.html (accessed on 23 February 2025).
Crescita ed Effettivi della Popolazione—1900–2023|Diagramma. Available online: https://www.bfs.admin.ch/asset/it/32229771 (accessed on 23 February 2025).
IT|Dashboard Strassenverkehrsunfälle. Available online: https://experience.arcgis.com/experience/ab23f413abb04339b536b5ebe2fc7499/page/IT (accessed on 19 May 2025).
Road Traffic Accidents with Injury to Persons. Available online: https://data.geo.admin.ch/browser/index.html#/collections/ch.astra.unfaelle-personenschaeden_alle (accessed on 19 May 2025).
Swisstopo—Knowing Where. Available online: https://www.swisstopo.admin.ch/en?utm_source=chatgpt.com (accessed on 23 February 2025).
CORINE Land Cover. Available online: https://land.copernicus.eu/en/products/corine-land-cover (accessed on 11 May 2024).
EU-DEM|Copernicus. Available online: https://www.copernicus.eu/it/node/3234?utm_source=chatgpt.com (accessed on 23 February 2025).
European Digital Elevation Model (EU-DEM). Available online: https://www.eea.europa.eu/en/datahub/datahubitem-view/d08852bc-7b5f-4835-a776-08362e2fbf4b?utm_source=chatgpt.com (accessed on 23 February 2025).
Geoland.At. Available online: https://www.geoland.at/ (accessed on 23 February 2025).
Fick, S.E.; Hijmans, R.J. WorldClim 2: New 1-Km Spatial Resolution Climate Surfaces for Global Land Areas. Int. J. Climatol. 2017, 37, 4302–4315. [Google Scholar] [CrossRef]
Trimaille, E. QuickOSM 2025. Available online: https://plugins.qgis.org/plugins/QuickOSM/ (accessed on 24 April 2025).
QGIS Development Team QGIS 2023. Available online: https://qgis.org/download/ (accessed on 30 November 2024).
Theofilatos, A.; Yannis, G. A Review of the Effect of Traffic and Weather Characteristics on Road Safety. Accid. Anal. Prev. 2014, 72, 244–256. [Google Scholar] [CrossRef]
Bergel-Hayat, R.; Debbarh, M.; Antoniou, C.; Yannis, G. Explaining the Road Accident Risk: Weather Effects. Accid. Anal. Prev. 2013, 60, 456–465. [Google Scholar] [CrossRef]
Wright, M.N.; Ziegler, A. Ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. J. Stat. Softw. 2017, 77, 1–17. [Google Scholar] [CrossRef]
DemïR, S.; SahïN, E.K. Random Forest Importance-Based Feature Ranking and Subset Selection for Slope Stability Assessment Using the Ranger Implementation. Eur. J. Sci. Technol. 2023, 2, 23–28. [Google Scholar] [CrossRef]
Forkuor, G.; Hounkpatin, O.K.L.; Welp, G.; Thiel, M. High Resolution Mapping of Soil Properties Using Remote Sensing Variables in South-Western Burkina Faso: A Comparison of Machine Learning and Multiple Linear Regression Models. PLoS ONE 2017, 12, e0170478. [Google Scholar] [CrossRef] [PubMed]
Van Der Westhuizen, S.; Heuvelink, G.B.M.; Hofmeyr, D.P.; Poggio, L. Measurement Error-Filtered Machine Learning in Digital Soil Mapping. Spat. Stat. 2022, 47, 100572. [Google Scholar] [CrossRef]
Dreiseitl, S.; Ohno-Machado, L. Logistic Regression and Artificial Neural Network Classification Models: A Methodology Review. J. Biomed. Inform. 2002, 35, 352–359. [Google Scholar] [CrossRef]
Ranger: Ranger in Ranger: A Fast Implementation of Random Forests. Available online: https://rdrr.io/cran/ranger/man/ranger.html (accessed on 8 June 2023).
RStudio Team RStudio: Integrated Development for R 2011. Available online: https://posit.co/download/rstudio-desktop/ (accessed on 8 July 2025).
R Core Team Stats: The R Stats Package 2024. Available online: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/00Index.html (accessed on 25 February 2025).
Park, H.-A. An Introduction to Logistic Regression: From Basic Concepts to Interpretation with Particular Attention to Nursing Domain. J. Korean Acad. Nurs. 2013, 43, 154. [Google Scholar] [CrossRef]
Rasyid, A.R.; Bhandary, N.P.; Yatabe, R. Performance of Frequency Ratio and Logistic Regression Model in Creating GIS Based Landslides Susceptibility Map at Lompobattang Mountain, Indonesia. Geoenviron. Disasters 2016, 3, 19. [Google Scholar] [CrossRef]
Subasi, A.; Erçelebi, E. Classification of EEG Signals Using Neural Network and Logistic Regression. Comput. Methods Programs Biomed. 2005, 78, 87–99. [Google Scholar] [CrossRef]
Dillon, J.V.; Langmore, I.; Tran, D.; Brevdo, E.; Vasudevan, S.; Moore, D.; Patton, B.; Alemi, A.; Hoffman, M.; Saurous, R.A. TensorFlow Distributions. arXiv 2017, arXiv:1711.10604. [Google Scholar] [CrossRef]
Lee, H.; Song, J. Introduction to Convolutional Neural Network Using Keras; an Understanding from a Statistician. CSAM 2019, 26, 591–610. [Google Scholar] [CrossRef]
Chicho, B.T.; Sallow, A.B. A Comprehensive Survey of Deep Learning Models Based on Keras Framework. J. Soft Comput. Data Min. 2021, 2, 49–62. [Google Scholar] [CrossRef]
Petra, V.; Neruda, R. Evolving KERAS Architectures for Sensor Data Analysis. In Proceedings of the 2017 Federated Conference on Computer Science and Information Systems, Prague, Czech Republic, 3–6 September 2017; p. 112. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Kalinowski, T.; Falbel, D.; Allaire, J.J.; Chollet, F.; RStudio; Google; Tang, Y.; Bijl, W.V.D.; Studer, M.; Keydana, S. Keras: R Interface to “Keras”, Version 2.16.0; RStudio: Boston, MA, USA, 2024.
Tharwat, A. Classification Assessment Methods. ACI 2021, 17, 168–192. [Google Scholar] [CrossRef]
Božić, D.; Runje, B.; Lisjak, D.; Kolar, D. Metrics Related to Confusion Matrix as Tools for Conformity Assessment Decisions. Appl. Sci. 2023, 13, 8187. [Google Scholar] [CrossRef]
Townsend, J.T. Theoretical Analysis of an Alphabetic Confusion Matrix. Percept. Psychophys. 1971, 9, 40–50. [Google Scholar] [CrossRef]
Swift, A.; Heale, R.; Twycross, A. What Are Sensitivity and Specificity? Evid. Based Nurs. 2020, 23, 2–4. [Google Scholar] [CrossRef]
Bradley, A.P. The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognit. 1997, 30, 1145–1159. [Google Scholar] [CrossRef]
Vanderlooy, S.; Hüllermeier, E. A Critical Analysis of Variants of the AUC. Mach. Learn. 2008, 72, 247–262. [Google Scholar] [CrossRef]
Hand, D.J.; Till, R.J. A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Mach. Learn. 2001, 45, 171–186. [Google Scholar] [CrossRef]
Tao, H.; Niu, X.; Xu, L.; Fu, L.; Cao, Q.; Chen, H.; Shang, S.; Xian, Y. A Comparative Study of Software Defect Binomial Classification Prediction Models Based on Machine Learning. Softw. Qual. J. 2024, 32, 1203–1237. [Google Scholar] [CrossRef]
Carrington, A.M.; Fieguth, P.W.; Qazi, H.; Holzinger, A.; Chen, H.H.; Mayr, F.; Manuel, D.G. A New Concordant Partial AUC and Partial c Statistic for Imbalanced Data in the Evaluation of Machine Learning Algorithms. BMC Med. Inform. Decis. Mak. 2020, 20, 4. [Google Scholar] [CrossRef]
Xu, R.; Nettleton, D.; Nordman, D.J. Case-Specific Random Forests. J. Comput. Graph. Stat. 2016, 25, 49–65. [Google Scholar] [CrossRef]
Nembrini, S.; König, I.R.; Wright, M.N. The Revival of the Gini Importance? Bioinformatics 2018, 34, 3711–3718. [Google Scholar] [CrossRef]
Nodo Intermodale della Stazione di Locarno-Muralto: Un Intervento a Favore di una Viabilità Migliore! Available online: https://www.ticinonews.ch/ospiti/nodo-intermodale-della-stazione-di-locarno-muralto-un-intervento-a-favore-di-una-viabilita-migliore-413352 (accessed on 3 July 2025).
Federal Roads Office (FEDRO). FEDRO Annual Report—Roads and Traffic 2023/2024; Federal Roads Office (FEDRO): Ittigen, Switzerland, 2025. [Google Scholar]
Conti, E.V. Il Pellegrinaggio Contemporaneo: Percezione del Tempo e Spiritualità Lungo la via Francigena. Contemporary Pilgrimage: Perception of Time and Spirituality Along the via Francigena. Master’s Thesis, Università Degli Studi di Genova, Genova, Italy, 2024. [Google Scholar]
Yan, M.; Shen, Y. Traffic Accident Severity Prediction Based on Random Forest. Sustainability 2022, 14, 1729. [Google Scholar] [CrossRef]
Yassin, S.S. Pooja Road Accident Prediction and Model Interpretation Using a Hybrid K-Means and Random Forest Algorithm Approach. SN Appl. Sci. 2020, 2, 1576. [Google Scholar] [CrossRef]
Hickman, L.; Akdere, M. Developing Intercultural Competencies through Virtual Reality: Internet of Things Applications in Education and Learning. In Proceedings of the 2018 15th Learning and Technology Conference (L&T), Jeddah, Saudi Arabia, 25–26 February 2018; IEEE: New York, NY, USA; pp. 24–28. [Google Scholar]

Figure 1. Workflow describing the methodological approach.

Figure 2. Study Area Map.

Figure 3. Covariance Importance Plot. Ranger.

Figure 4. Covariates Importance. Logistic Regression.

Figure 5. ROCs graph, Ranger ROC is represented in green, LRM in red and Keras in blue.

Figure 6. Prediction maps.

Figure 7. Distribution map of the population (left) compared to the map distribution of the average car crashes by municipality (right).

Table 1. Variables source and classification table.

Variables	Source	Theme	Resolution
Crashes	Federal Road Office of Swiss (FEDRO) [28,29]	Prediction Variable	150 m × 150 m
Distance from Airport	© Open Street Map(OSM) [15]	Infrastructure	-
Distance from Health Building
Distance from Educational Building
Street Length
Road Overlay
Road Class 1 (%)
Road Class 2 (%)
Road Class 3 (%)
Road Class 4 (%)
Max. Speed Limit
Sinuosity
Pop./km²	Federal Office of Topography swisstopo [30]	Social/Economic	-
Land Class 1 (%)	Corine Land Cover [31]		100 m × 100 m
Land Class 2 (%)
Land Class 3 (%)
Land Class 4 (%)
Land Class 5 (%)
Slope	Federal Office of Topography swisstopo [30,32,33,34]	Environmental	10 m × 10 m
Rainfall	WorldClim2 [35]		5 min
Temperature	WorldClim2 [35]		5 min

Table 2. Road Class variables according to original classes present in the Swiss road network, based on OSM.

OSM Class	Road Class
Motor Way	Road Class 1 (%)
Motor Way Link
Trunk
Trunk Link
Primary	Road Class 2 (%)
Primary Link
Secondary
Secondary Link
Tertiary
Tertiary Link
Residential	Road Class 3 (%)
Unclassified	Road Class 3 (%)
Bus Guideway	Road Class 4 (%)
Bus way
Path
Road
Truck
Service

Table 3. Confusion Matrix Example.

	Positive (P)	Negative (N)
True (T)	True Positive (TP)	False Positive (FP)
False (F)	False Negative (FN)	True Negative (TN)
	P = TP + FN	N = FP + TN

Table 4. Result.

Model	Accuracy	Sensitivity	Specificity	Pos. Pred. Value	Neg. Pred. Value	AUC	Training Time
Ranger	0.9878	0.9970	0.8819	0.9898	0.9629	0.9926	10.97 min
LRM	0.933	0.9840	0.3460	0.9554	0.6530	0.9229	7.14 s
Keras	0.9377	0.9839	0.4071	0.9502	0.6867	0.9368	15 min

Table 5. Confusion Matrix Model.

Ranger	(P)	(N)
(T)	194,350	2002
(F)	576	14,948
LRM	(P)	(N)
(T)	191,960	11,042
(F)	3029	5887
Keras	(P)	(N)
(T)	191,778	10,049
(F)	3148	6901

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Auzzas, A.; Capra, G.F.; Ganga, A. Assessing the Impact of Infrastructure and Social Environment Predictors on Road Accidents in Switzerland Using Machine Learning Algorithms and Open Large-Scale Dataset. Urban Sci. 2025, 9, 343. https://doi.org/10.3390/urbansci9090343

AMA Style

Auzzas A, Capra GF, Ganga A. Assessing the Impact of Infrastructure and Social Environment Predictors on Road Accidents in Switzerland Using Machine Learning Algorithms and Open Large-Scale Dataset. Urban Science. 2025; 9(9):343. https://doi.org/10.3390/urbansci9090343

Chicago/Turabian Style

Auzzas, Alessandro, Gian Franco Capra, and Antonio Ganga. 2025. "Assessing the Impact of Infrastructure and Social Environment Predictors on Road Accidents in Switzerland Using Machine Learning Algorithms and Open Large-Scale Dataset" Urban Science 9, no. 9: 343. https://doi.org/10.3390/urbansci9090343

APA Style

Auzzas, A., Capra, G. F., & Ganga, A. (2025). Assessing the Impact of Infrastructure and Social Environment Predictors on Road Accidents in Switzerland Using Machine Learning Algorithms and Open Large-Scale Dataset. Urban Science, 9(9), 343. https://doi.org/10.3390/urbansci9090343

Article Menu

Assessing the Impact of Infrastructure and Social Environment Predictors on Road Accidents in Switzerland Using Machine Learning Algorithms and Open Large-Scale Dataset

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and Data Collection

2.2. Data Collection

2.2.1. Prediction Variable

2.2.2. Infrastructures

2.2.3. Social/Economic Factors

2.2.4. Environmental Factors

2.3. Machine Learning Algorithms

2.3.1. Ranger

2.3.2. Logistic Regression

2.3.3. Keras

2.4. Validation and Assessment Models

3. Results and Discussion

3.1. Covariates Importance

3.2. Accuracy Assessment

3.3. Prediction Maps

3.4. Limitations of the Research

4. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI