Next Article in Journal
ZTCloudGuard: Zero Trust Context-Aware Access Management Framework to Avoid Medical Errors in the Era of Generative AI and Cloud-Based Health Information Ecosystems
Previous Article in Journal
ChatGPT Code Detection: Techniques for Uncovering the Source of Code
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Predicting Number of Vehicles Involved in Rural Crashes Using Learning Vector Quantization Algorithm

by
Sina Shaffiee Haghshenas
*,
Giuseppe Guido
,
Sami Shaffiee Haghshenas
and
Vittorio Astarita
Department of Civil Engineering, University of Calabria, Via Bucci, 87036 Rende, Italy
*
Author to whom correspondence should be addressed.
AI 2024, 5(3), 1095-1110; https://doi.org/10.3390/ai5030054
Submission received: 5 June 2024 / Revised: 3 July 2024 / Accepted: 4 July 2024 / Published: 8 July 2024
(This article belongs to the Section AI Systems: Theory and Applications)

Abstract

:
Roads represent very important infrastructure and play a significant role in economic, cultural, and social growth. Therefore, there is a critical need for many researchers to model crash injury severity in order to study how safe roads are. When measuring the cost of crashes, the severity of the crash is a critical criterion, and it is classified into various categories. The number of vehicles involved in the crash (NVIC) is a crucial factor in all of these categories. For this purpose, this research examines road safety and provides a prediction model for the number of vehicles involved in a crash. Specifically, learning vector quantization (LVQ 2.1), one of the sub-branches of artificial neural networks (ANNs), is used to build a classification model. The novelty of this study demonstrates LVQ 2.1’s efficacy in categorizing accident data and its ability to improve road safety strategies. The LVQ 2.1 algorithm is particularly suitable for classification tasks and works by adjusting prototype vectors to improve the classification performance. The research emphasizes how urgently better prediction algorithms are needed to handle issues related to road safety. In this study, a dataset of 564 crash records from rural roads in Calabria between 2017 and 2048, a region in southern Italy, was utilized. The study analyzed several key parameters, including daylight, the crash type, day of the week, location, speed limit, average speed, and annual average daily traffic, as input variables to predict the number of vehicles involved in rural crashes. The findings revealed that the “crash type” parameter had the most significant impact, whereas “location” had the least significant impact on the occurrence of rural crashes in the investigated areas.

1. Introduction

The paramount importance of roadways as critical infrastructure for sustainable development cannot be understated. As such, a reduction in road-related crashes and the augmentation of road safety are preeminent objectives that transport engineers and researchers endeavor to attain, in consonance with sustainable mobility. Consequently, an urgent imperative exists regarding the creation of a model to evaluate the gravity of crash injuries, which would facilitate researchers in examining road safety [1,2]. Given the burgeoning global population, it is an incontrovertible fact that both advanced and emerging societies face an upsurge in vehicular volume, subsequently resulting in an intensification of travel and traffic on thoroughfares, thereby amplifying the likelihood of vehicular incidents [3,4].
In addition to the more traditional approaches to interpreting accident data (e.g., descriptive or inferential statistics), since the end of the last century, many researchers have resorted to the simulation of transport networks to evaluate road safety and estimate its effects on people and the environment [5,6].
The multifarious nature of road safety has been the focus of invaluable research efforts aimed at bolstering our understanding of the subject. In certain instances, crashes may be impacted by a combination of various risk factors [7,8]. Contributing factors that have been identified encompass, but are not limited to, daylight [9], weather conditions [10], the age of the driver and vehicle [11,12], the speed limit and average speed [13,14], and the annual average daily traffic (AADT) [15].
In general, traffic collisions involve only one or two individuals. Depending on the quantity of automobiles implicated, crashes are categorized as either single-vehicle crashes (SVCs) or multiple-vehicle crashes (MVCs) [16]. The appraisal of accident-related expenditures necessitates the contemplation of crash severity, a vital parameter that is classified into several tiers. The number of vehicles implicated in the crash constitutes a critical variable throughout all of these gradations.
Wang [17] incorporated both environmental and safety considerations to comprehensively capture the multifaceted aspects of sustainable transport. To do so, he developed a unified performance measure using data envelopment analysis (DEA), a nonparametric approach to benchmarking entities with multiple inputs and outputs. This measure was then applied to jointly assess the environmental impacts and safety concerns of road transport for a set of OECD (Organization for Economic Co-operation and Development) countries between 2000 and 2014. Finally, he demonstrated that the unified measures derived from this joint assessment can differ significantly from those obtained by evaluating the environmental impacts and safety separately. McLeod and Carey [18] conducted a literature review on traffic safety, utilizing the established hazard control hierarchy. Their research identified and categorized potential approaches to the successful integration of Vision Zero with broader sustainable accessibility policy objectives. The authors synthesized the literature within the context of the Hazard Control Hierarchy, offering a framework for the more efficacious coordination of professional practices that impact urban safety and sustainability. Ultimately, the authors supplied recommendations for enhancing the integration of Vision Zero and sustainable accessibility policies, with the hazard control hierarchy serving as an organizing principle. Ziakopoulos and George [19] conducted a comprehensive analysis of the available literature that explored the diverse spatial methodologies used by researchers to examine and analyze the spatial dimension in their investigations. Additionally, the authors evaluated studies that concentrated on the spatial analysis of precarious road users. The authors also deliberated on the practical implementation, benefits, and drawbacks of the diverse techniques used in spatial modeling. Drawing upon their critical review, they identified current obstacles and future avenues for research in this field.
Afghari et al. [20] utilized a joint model of crash count and crash severity to identify road segments that pose a high risk of fatal and serious injury crashes. The study employed data from state-controlled roads in Queensland, Australia, and a novel risk score was developed by predicting crash counts by severity and weighting them using the cost ratio of severity levels. The weighted risk score was then employed to pinpoint road segments with a heightened risk of fatal and injury crashes. Their results revealed that the joint model of crash count and crash severity substantially enhanced the prediction accuracy when compared with traditional count models. In another study, Tamakloe, and Park [21] utilized fatal crash data from Korea to identify hotspots with increasing (critical) and decreasing (diminishing) temporal trends using a spatio-temporal hotspot analysis tool in a geographic information system (GIS). Additionally, they employed a machine learning technique to investigate the series of factors that influence the number of vehicles and casualties involved in fatal crashes at intersections and midblocks in each hotspot type identified. Based on their findings, they identified groups of factors that could be collectively addressed to enhance road safety and recommended countermeasures to mitigate fatal crashes on the roads. Hossain et al. [22] employed a partial proportional odds model to predict the injury severity of the most severely injured driver in a multi-vehicle crash, using demographic information on all drivers involved. The authors then compared models that incorporated the demographic information and vehicle characteristics of all drivers and vehicles involved in a crash with models that considered only information about the most severely injured driver, evaluating the significance of factors and the prediction accuracy. The results of their study suggested that although young drivers were generally found to have lower levels of injury severity compared to working-age drivers, the severity of injuries increased when the proportion of young drivers involved in a multi-vehicle crash was higher.
Based on a review of the existing literature, it has been established that roads are an essential element of infrastructure and play a critical role in the advancement of society, the economy, and culture. Generally, professionals in the field of transportation engineering and research on road safety issues prioritize two major goals: reducing the occurrence of road crashes and improving overall road safety. These objectives are closely related, making it necessary to develop a model that can accurately predict the severity of injuries resulting from crashes. Such a predictive model is crucial for researchers to assess road safety effectively. Understanding the number of vehicles involved in a crash (NVIC) is one of the most important factors that can play a role in planning and reducing the severity of road crashes. Hence, the main objective of this research is to examine road safety and create a predictive model that can estimate the NVIC. This model is developed using a technique known as learning vector quantization (LVQ 2.1), which is a subset of artificial neural networks (ANNs) used for classification. The study analyzes the records of 564 crashes that took place on rural roads in southern Italy to construct the models. It is worth mentioning that, based on a study of the literature reviews, the proposed LVQ 2.1 model has some performance benefits, while other predictive models have some major limitations. Particularly useful for predicting the variables affecting the NVIC, the LVQ 2.1 algorithm is renowned for its capacity to efficiently manage nonlinear connections in complicated datasets. Furthermore, the strong learning ability of the algorithm enables it to create effective prediction models even in the face of noisy or missing data, a typical difficulty in accident reporting. On the other hand, traditional statistical models such as linear or Poisson regression often find it difficult to faithfully represent the nonlinear and diverse character of NVIC data, therefore producing a less-than-ideal prediction performance. Also, a lot of the machine learning techniques we use now, like decision trees, might not be as effective in handling the complex and high-dimensional interactions we see in NVIC data. This might lead to overfitting or bad generalizations. Therefore, the important contributions of this study are summarized as follows:
-
Development of a Classification Model Using LVQ 2.1: The main output of this work is a predictive classification model developed using Learning Vector Quantization 2.1 (LVQ 2.1), a particular kind of ANN. Particularly for estimating the number of cars involved in rural crashes, this modeling technique presents a developed model use of LVQ 2.1 in the framework of traffic accident research. The selection of LVQ 2.1 emphasizes its possible value over other modeling approaches in capturing the nonlinear interactions between the input variables and the multivehicle crash events. This may affect future model choices in related fields.
-
Utilization of a Comprehensive Dataset from Rural Roads in Calabria: The data consist of 564 crash reports from rural roads between 2017 and 2018. It covers a lot of important factors such as the day of the week, location, speed limit, average speed, average yearly daily traffic, lighting conditions and kind of collision. This study’s special focus on rural crashes closes a gap in crash modeling research, which typically focuses on urban environments. The study of this dataset might provide important information on the dynamics of crashes in rural locations, which are typically influenced by different features not seen in urban environments, like speed limitations and different traffic compositions.
-
Analysis of Factors Influencing Rural Crashes: This contributes to the corpus of knowledge by quantifying the impact of many elements on rural collisions, therefore influencing policy and preventative action. The diverse influence of these elements offers complex information on the causes of rural crashes, therefore assisting in focused actions and resource allocation to achieve a maximum effect in lowering accidents.
-
Predictive Performance of the Model: A major contribution comes from the creation and testing of the model itself. Establishing a baseline for a comparative analysis with other prediction models helps to guide further studies.
The rest of this work is delineated as follows: Section 2 outlines the LVQ methodology employed in this research. Section 3 offers a concise summary of the case study’s features. In Section 4, the developed models are constructed, and the factors that contribute to the number of vehicles involved in a crash are analyzed. Lastly, Section 5 provides concluding remarks and suggestions for future research.

2. Learning Vector Quantization (LVQ)

The contemporary scientific arena has observed noteworthy advancements in various branches of artificial intelligence (AI), which have led to the development of innovative technologies. Hence, the implementation of artificial intelligence techniques to tackle complex issues across various scientific fields is an unavoidable trajectory [23,24,25,26,27,28]. The Learning Vector Quantization (LVQ) network is a kind of neural network that uses a supervised learning methodology. Pattern recognition and classification issues are where it finds its most frequent use [29,30,31]. LVQ is highly comparable to self-organizing maps (SOM), and it also has many parallels to the k-Nearest Neighbor (kNN) technique of classification. To acquire prototypes (also called codebook vectors) to represent unique class areas, learning vector quantization (LVQ) is a kind of method used in statistical pattern classification. The hyperplanes that separate the prototypes define the boundaries of these class areas, creating Voronoi partitions. The LVQ network stands out from other types of ANNs in its own unique way [32]. To assist network training and data categorization, the LVQ network uses the “winner-takes-all” method, which is based on either the “Hebbian Learning” or “Associate Learning” principles. Kohonen is the originator of LVQ, which has been subject to numerous adaptations and refinements over time, resulting in the emergence of several LVQ variants [33,34,35]. Figure 1 illustrates an LVQ network exemplar.
Figure 1 shows the weight vector W, which represents the connections between each neuron in the input layer and the neurons in the output layer. X represents the input, Y represents the output, and W represents the weight vector. To classify information into a desired category, the LVQ uses the Euclidean distance between input vectors. The data are assigned to the class or target with the least distance if the estimated distance is tiny or negligible [36].
One of the most widely recognized initial variants introduced by Kohonen is LVQ 2.1, which is extensively expounded upon in Kohonen, 1990, and Kohonen, 1997. As a result, LVQ 2.1 was implemented in this study for the purpose of classifying the dataset. LVQ 2.1 stands apart from its preceding versions owing to the fact that it updates two centers and neurons concurrently. This feature yields a considerable improvement in the efficiency of the algorithm, ultimately culminating in faster performance. The first step of the LVQ 2.1 method is to choose the two closest prototypes based on the Euclidean distance, namely θl and θm, for every data point (x, y) in the training set S = ( x i , y i ) i = 1 N . If the prototypes’ labels cl and cm are distinct, and one of them corresponds to label y of the data point, then the two closest prototypes are modified based on Equations (1) and (2) [37,38,39].
θ l ( t + 1 ) = θ l ( t ) + α ( t ) ( x θ l ) , c l = y
θ m ( t + 1 ) = θ m ( t ) α ( t ) ( x θ m ) , c m y
In the event that the labels cl and cm are identical or both labels differ from the label y of the data point, no parameter update is executed. The modeling method and the performance indicators used in the modeling are explained in the next sections.

3. Case Study

In order to test the proposed methodology, the records of 564 accidents that occurred between 2017 and 2018 on rural roads in Cosenza province (Calabria, Italy) were used (Figure 2). The road accident sample was acquired from the ACI-ISTAT database (Automobile Club Italia—National Institute of Statistics), which collects and analyzes data on road accidents in Italy [40].
The information contained in the dataset provides details on the date and place of the accident, the type of road, the pavement conditions, the weather conditions, the type of accident, the type of vehicle involved, the causes of the accident and the consequences for the people involved (injuries or deaths). However, this dataset does not contain information on Property Damage Only events, because ISTAT, in Italy, identifies and classifies accidents if they generate at least one injury.
The above information was integrated with other data characterizing the context of the study to enable a more detailed analysis and to implement the proposed method. In particular, the speed limits, the average speed and the average annual daily traffic (AADT) were acquired to characterize the road elements in which accidents occurred. The speed limits were acquired from the dataset of the national autonomous road company (ANAS). The average speed was obtained by gathering the available data of the historical traffic statistics of TomTom (TomTom Move) and Octo Telematics (Octo IoT Cloud), referring to the road sections with the observed accidents. The average annual daily traffic was obtained from the PANAMA system, a traffic monitoring platform provided by ANAS [40].
As better illustrated in Section Classification Modelling, the above-mentioned data were classified into seven independent variables (i.e., the factors affecting the number of vehicles involved in the crash), including four qualitative variables, namely daylight (DL), the type of crash (TC), day of the week (W), and location (LO), and three quantitative variables, namely the speed limit (SL), average speed (AS), and annual average daily traffic (AADT).

4. Modelling

The main objective of the current research is to explore the variables that influence the level of road safety in rural regions through the implementation of binary classification modeling techniques. To accomplish this aim, the study utilized a developed classification model, and the NVIC was assessed using the LVQ 2.1 approach, as previously stated.
In binary classification modeling, the confusion matrix provides the most useful accuracy and error measurements for evaluating performance [40]. As shown graphically in Figure 3 and mathematically in Equations (3) and (4), the confusion matrix is used to facilitate model comparison. Data normalization is essential in data-driven system modeling methodologies because the investigated parameters have different ranges and measurement scales. Data that have not been normalized may produce inaccuracies in the calculation due to issues of a greater scale. Therefore, in this study, every piece of data was normalized using the min–max technique before being included in a model to eliminate the possibility of such outliers [40].
A c c u r a c y = T P + T N T P + F P + T N + F N
E r r o r = F P + F N T P + F P + T N + F N = 1 A c c u r a c y

Classification Modelling

To commence the modeling process, the initial step involved preparing the dataset. Following a thorough examination of the available data, the seven known parameters were categorized into four distinct data groups. The values and characteristics associated with each collision, which influenced the NVIC, were identified as inputs for modeling (independent variables). These encompassed four qualitative variables, namely daylight (DL), the type of crash (TC), day of the week (W), and location (LO), as well as three quantitative variables, including the speed limit (SL), average speed (AS), and annual average daily traffic (AADT). The aforementioned variables were classified and are presented in Table 1. It is worth mentioning that the evaluation of the NVIC involved considering the first-labeled “1” class in crashes where only one vehicle was involved. Incidents involving multiple vehicles (as identified by the “2” designation) were categorized into the second class. This categorization was based on the underlying assumption that the minimum NVIC is the most critical factor in determining differences between the classes.
The next phase, after the collection and preparation of the dataset, was to set the algorithm’s governing parameters. The effectiveness of the algorithm and the rate of convergence may be greatly improved by adjusting these parameters. In most cases, there are no standard methods for establishing such limits. Instead, experts rely on their knowledge, experience, and data type to estimate a parameter range [41]. Models with varying degrees of accuracy and error rates are created using these factors. The strategic fusion of data-driven techniques and expert judgment may result in more trustworthy and effective models.
The modeling process involved creating a mapping between the input and output data, which was then utilized to design and construct an optimal classification model that could accurately identify the appropriate classes. The primary objective of the model was to achieve the highest possible accuracy. A selection of the governing variables and their corresponding intervals encompassed an epoch quantity of 5, 10, 20, 30, or 50, along with the number of neurons in the hidden layer (NNHL) being regarded as 10, 20, 30, or 40. Furthermore, from the aggregate 564 datasets, a 70% portion (395) was allocated for model training, a 10% segment (56) was used for validation purposes, and the residual 20% (113) was used for testing the model. The determination of these proportions was influenced by insights derived from prior research in the domain of neural network prognostication [42]. Table 2 displays the outcomes of a total of 20 models that were constructed and evaluated. Upon constructing various models and determining their accuracy scores for both training and testing, a straightforward technique recommended by Zorlu et al. [43] was employed to rank all of the models. The resulting rankings are presented in Table 3.
Table 2 shows that the configurations of LVQ 2.1 models affect their performance greatly, especially with regard to the number of epochs and the NNHL. The training accuracy ratings fall between 64.3% and 82.5%, and the testing accuracy falls between 61.9% and 82.3%. These variants draw attention to how different models’ efficacy depend on their setups. Model 15 (30 epochs, 30 NNHL) ranks highest, with a training accuracy of 82.5% and a testing accuracy of 82.3%. Its great accuracy on both the training and testing data points to a well-tuned model that successfully strikes a mix of generalizing and complexity. Model 11 (20 epochs, 30 NNHL), with accuracy values of 82.3% for training and 81% for testing, also shows really excellent performance. This model also shows strong generalizing capabilities.
Models 3, 4, 11, 15, and 16 show quite high and consistent accuracy for both training and testing, indicating that these configurations are less prone to overfitting and generalize well. Models with a high training accuracy but greatly reduced testing accuracy—such as Model 6 (71.9% training against 61.9% testing)—may be overfitting the training data. Achieving 80.8% training and 75% testing accuracy, Model 10 (20 epochs, 20 NNHL) is among the better-performing models. It slightly underperforms compared to the top models but remains effective. Models with higher NNHL values tend to achieve better accuracy, but they also require careful tuning to avoid overfitting, as seen in models with significant accuracy drops between training and testing.
Table 3 shows the twenty models’ training and testing accuracy-based rankings. By examining their accuracy in both the training and testing stages, Table 3’s outcomes help determine which models generally excel. Given this, a higher-ranking value denotes improved performance. Table 3’s ranking values clearly show the model performance; higher rankings indicate better results. The top-ranked models are Model 15 and Model 11, which have remarkable accuracy and excellent generalization. Model 15 (30 epochs, 30 NNHL) performs well, with great accuracy in both the training and testing stages. Its design enables it to learn and generalize from the data with efficiency. This model was able to correctly classify 81.4% of all data. A strong contender for dependable predictions, Model 11 (20 epochs, 30 NNHL) routinely rates well in both training and testing. Model 1 (5 epochs, 10 NNHL) exhibits poor performance in both training and testing. Similarly low-ranked, Model 2 (5 epochs, 20 NNHL) performs badly in each phase, suggesting that either more training or some other setup is required.
This study emphasizes the need for model complexity and a balanced performance in both the training and testing phases to ensure optimal model selection and implementation.
Additionally, the confusion matrices for the training, validation, testing, and total datasets can be found in Figure 4a–d.
In the context of classification problems, the utilization of the receiver operating characteristic (ROC) curve is an essential component in analyzing the outcomes due to its probability-based nature. Also, the assessment of the developed binary classification model’s performance is accomplished through the calculation of the area under the curve (AUC), which ranges from 0 to 1. It is noteworthy that an AUC value of 0.5 or less indicates inadequate performance by the developed model, while values greater than 0.5 are observed for the train, test, and total ROC curve, indicating acceptable model performance. Consequently, the ROC curve was employed to assess the outcomes produced by the 16th model, and the results for training, testing, and all data based on the ROC curve are presented in Figure 5a–d. It is important to note that a threshold of 0.5 was utilized, which is a commonly accepted value in this scenario. Based on the performance of the 16th model, which outperformed the other developed models, the area under the curve (AUC) for the 16th model is notably greater than the AUC values for the other developed models.

5. Validation and Discussion

Various input factors’ effects on the NVIC were analyzed using a sensitivity study. The best LVQ model was then utilized for predicting the output, and the degree of correlation between the input data and the predicted result was assessed. For further sensitivity analysis, the cosine amplitude approach (Equation (5)) was used. Here, n signifies the total number of data points, while rij stands for the correlation strength between them. Both the input parameters xik and the projected values yij are represented by symbols.
r i j = k = 1 n ( x i k × y j k ) k = 1 n x i k 2 k = 1 n y i k 2
Based on Equation (5) as well as the results obtained from the best-developed model of LVQ 2.1 (15th model), a sensitivity analysis was performed, and its results were compared with the previous study. To validate the LVQ 2.1 model, a comparison was made using the results of past studies. The prior investigations used two machine learning techniques, namely GMDH and GOA-SVM. It should be mentioned that some brief information about the classification models used in past studies is given. The ideal design of GMDH models greatly influences their remarkable performance. Therefore, a fundamental problem is the exact determination of the GMDH model control parameters. Combining GOA and SVM creates a prediction model. Several SVM parameters using the GOA technique were optimized to ensure the best performance of the SVM model. Finally, after the modeling process, the best GMDH model has an MNL, MNNL, and SP equal to 20, 50, and 0.5, respectively. Furthermore, the optimum control parameter of the best GOA-SVM model containing Grasshoppers’ populations equal to 40, k-fold equal to 3, and Gamma ( γ ) of the RBF kernel was 6.17. For more information, it is recommended that one refers to the study of Guido et al. [41]. The results obtained from this comparison are shown in Figure 6. Figure 6 shows the alignment of all models in determining the same results. Although the values of the degree of correlation were different in different models, the answers were finally the same. Based on the results, TC (type of crash) and AS (average speed), respectively, had the greatest impact on the number of vehicles involved in a crash. Also, LO (location) showed the least impact on NVIC in all three models. Multiple independent models confirm that this consistency points to a strong fundamental link between these variables and NVIC. This homogeneity also helps to support the conclusion concerning LO’s small impact on NVIC prediction.
Although the models agree on the factor rankings, their degrees of correlation differ. The LVQ 2.1 model, for instance, provides a correlation coefficient of 0.93 for TC, whereas GMDH shows 0.85 and GOA-SVM shows 0.87. Though small, these variances draw attention to the minute changes in sensitivity each model records. It is worth mentioning that the y-axis values in Figure 6, which show the degree of correlation, are notable because they emphasize the most and least important elements in forecasting the quantity of cars engaged in crashes. This knowledge is required to validate the model and guide sensible efforts to improve road safety.
Also, in another comparison, we compared the performance of the LVQ 2.1 model with previous research models in terms of its accuracy on the training and testing data [41]. The results are shown in Figure 7. Based on the obtained results, it is clear that the performance of the LVQ 2.1 model is acceptable, and there is not much difference in accuracy between the GMDH and GOA-SVM models. However, an important point that should be mentioned here and one of the most important strengths of this study is that although there was no great difference between the accuracy of the LVQ 2.1 model and other models in the past literature, the modeling process and development of the model were easier, and the number of parameters that needed to be adjusted in the LVQ 2.1 model is less compared to other models, which enables users to develop the model more easily.
The LVQ 2.1 model’s high training accuracy indicates that it efficiently captures trends in the training data. For example, if the LVQ 2.1 model has a training accuracy of 82.5%, it means that for 82.5% of the training data, the model correctly forecasts the NVIC.
Additionally, the GMDH and GOA-SVM models show acceptable training accuracy, that is, 83.2% and 84.6%, respectively. These equivalent degrees of accuracy suggest that all three models can efficiently learn from the data. The test accuracy is a major indicator of the model’s generalizability to new, unprocessed data. The LVQ 2.1 model has a testing accuracy that is close to its training accuracy, which is 82.3%, as well as a high generalizing capacity. The GMDH and GOA-SVM models also show similar testing accuracies, like 81.6% and 83.4%, indicating that these models, too, generalize well to new data.
The small variations in the training and testing accuracies for every model show that they do not overfit the training data. A typical problem wherein a model performs well on training data but poorly on testing data is overfitting. The uniformity of the accuracy levels points to the considerable avoidance of this issue by all three models.
As mentioned before, the evaluation of crash severity is a crucial part of the road safety process in transportation engineering. Nevertheless, the increase in crash severity is one of the undesirable effects of the increase in the number of vehicles involved in a crash. Therefore, an accurate prediction of the NVIC can be useful in minimizing the level of crash severity in road transportation. Based on the results, it can be inferred that the TC exerts a significant influence on the NVIC. Various factors, such as inadequate traffic signage and suboptimal road conditions, may contribute to specific categories of vehicular incidents. Head-on collisions are often caused by driver inattention to road signs or insufficient lighting, resulting in poor visibility. Likewise, following too closely, driving while distracted, or sudden deceleration caused by unfavorable road conditions may lead to the occurrence of rear-end collisions. The severity of the crash also plays a role in the number of vehicles involved, with more severe accidents involving a greater number of vehicles. For instance, accidents involving trucks or buses can have a severe impact due to their size and weight, causing damage to multiple vehicles [44,45,46]. It is also worth mentioning that TC is in the crash characteristic category.
AS and AADT were, respectively, the most influential parameters affecting the NVIC. Both of these factors are in the traffic flow characteristics category. In summary, it can be inferred that the occurrence of road accidents in the rural area of Cosenza is attributable to a confluence of factors, including human conduct, vehicular attributes, and road infrastructure. In order to mitigate the incidence of road accidents on rural routes in Cosenza, a multifaceted approach is necessary, encompassing enhancements to road infrastructure, heightened public consciousness of safe driving protocols, and the rigorous enforcement of traffic regulations. Through the implementation of these measures, it is feasible to enhance road safety and mitigate the incidence of vehicular mishaps in the rural region of Cosenza.
In the framework of this particular research, the fact that the impact of LO (location) among the input parameters areis lower than those that of other parameters reflect shows that the geographical location has less influence on the number of cars engaged in collisions. This might be the result of the particular circumstances on Calabrian rural roads. Knowing LO conditions might assist in the refinement of models and enable them to concentrate on the most important factors involved in increasing road safety in future plans for southern Italian road network development.
It is imperative to acknowledge that the LVQ 2.1 algorithm, while possessing the potential for utilization in classification analysis and providing a dependable method for forecasting NVIC, is not without certain constraints. One of the most significant among these is the inability of the algorithm to process incomplete datasets. Furthermore, it is essential to recognize that the specific model developed through the application of LVQ 2.1 in this study is not directly transferable to alternative case studies due to the distinct nature of the structures involved. Therefore, it is suggested that this classification framework is used in future research in other regions, and that the input parameters are changed based on the data available from other regions, with their results compared with the results of this research.

6. Conclusions

Road safety, defined as the absence of crashes that result in injuries or property damage, is an essential part of transportation engineering. The costs and risks of not paying attention to the road safety that is necessary may be high and long-lasting. Therefore, a solid understanding of road safety is crucial. One of the major parameters for evaluating the severity of a crash is the NVIC. In order to estimate the number of vehicles that will be involved in a crash, this study used a classification-based approach. In this study, from the accessible and available data, seven parameters from four data categories were used. Then, the predictive model was built based on the LVQ 2.1 algorithm, and 564 valuable datasets from rural road crashes in Calabria were used. The accuracy of the results obtained from the developed model was acceptable, and it showed that it can be considered a classification prediction model with acceptable accuracy in issues related to road safety. This indicates that the developed model of the LVQ 2.1 algorithm produced outcomes of about 82.5% and 82.3% in the training and testing models, respectively. Also, a sensitivity analysis was performed on the predicted results. The results of this sensitivity analysis were compared with the previous literature. This analysis confirmed the findings of prior research by showing that the TC and LO had the greatest and least influence on the rate of cars engaged in crashes, respectively. Also, this study’s results backed up previous studies in this field by highlighting the importance of human behavior in crash causation. Therefore, it is recommended that groups concerned with road safety not only work to improve rural road conditions, but also create a complete strategy for raising awareness and assessing drivers’ abilities. Subsequent investigations may benefit from utilizing deep learning algorithms, which possess significant capabilities regarding the construction of models and the analysis of complex datasets. Furthermore, it is recommended that a comprehensive dataset comprising diverse variables that could potentially impact the frequency of vehicular accidents be considered in future studies.

Author Contributions

Conceptualization, S.S.H. (Sina Shaffiee Haghshenas), G.G. and S.S.H. (Sami Shaffiee Haghshenas); methodology, S.S.H. (Sina Shaffiee Haghshenas), G.G. and S.S.H. (Sami Shaffiee Haghshenas); software, S.S.H. (Sina Shaffiee Haghshenas) and S.S.H. (Sami Shaffiee Haghshenas); validation, S.S.H. (Sina Shaffiee Haghshenas), G.G. and S.S.H. (Sami Shaffiee Haghshenas); formal analysis, S.S.H. (Sina Shaffiee Haghshenas) and S.S.H. (Sami Shaffiee Haghshenas); investigation, S.S.H. (Sina Shaffiee Haghshenas), G.G. and S.S.H. (Sami Shaffiee Haghshenas); resources, G.G. and V.A.; writing—original draft preparation, S.S.H. (Sina Shaffiee Haghshenas), G.G., S.S.H. (Sami Shaffiee Haghshenas) and V.A.; writing—review and editing, S.S.H. (Sina Shaffiee Haghshenas), G.G., S.S.H. (Sami Shaffiee Haghshenas) and V.A.; supervision, G.G. and V.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The authors do not have permission to share data.

Acknowledgments

The authors would like to acknowledge Mehdi Ghaem for the helpful guidance he provided during the course of this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wallius, E.; Tomé Klock, A.C.; Hamari, J. Playing It Safe: A Literature Review and Research Agenda on Motivational Technologies in Transportation Safety. Reliab. Eng. Syst. Saf. 2022, 223, 108514. [Google Scholar] [CrossRef]
  2. Damjanović, M.; Stevic, Z.; Stanimirovic, D.; Tanackov, I.; Marinković, D. Impact of the Number of Vehicles on Traffic Safety: Multiphase Modeling. Facta Univ. Ser. Mech. Eng. 2022, 20, 177–197. [Google Scholar] [CrossRef]
  3. Hamim, O.F.; Ukkusuri, S.V. Towards safer streets: A framework for unveiling pedestrians’ perceived road safety using street view imagery. Accid. Anal. Prev. 2024, 195, 107400. [Google Scholar] [CrossRef] [PubMed]
  4. Abedi, M.M.; Sacchi, E. A machine learning tool for collecting and analyzing subjective road safety data from Twitter. Expert Syst. Appl. 2024, 240, 122582. [Google Scholar] [CrossRef]
  5. Kušić, K.; Schumann, R.; Ivanjko, E. A digital twin in transportation: Real-time synergy of traffic data streams and simulation for virtualizing motorway dynamics. Adv. Eng. Inform. 2023, 55, 101858. [Google Scholar] [CrossRef]
  6. Sohail, A.; Cheema, M.A.; Ali, M.E.; Toosi, A.N.; Rakha, H.A. Data-driven approaches for road safety: A comprehensive systematic literature review. Saf. Sci. 2023, 158, 105949. [Google Scholar] [CrossRef]
  7. Bonera, M.; Barabino, B.; Yannis, G.; Maternini, G. Network-wide road crash risk screening: A new framework. Accid. Anal. Prev. 2024, 199, 107502. [Google Scholar] [CrossRef]
  8. Mohammadpour, S.I.; Khedmati, M.; Zada, M.J.H. Classification of truck-involved crash severity: Dealing with missing, imbalanced, and high dimensional safety data. PLoS ONE 2023, 18, e0281901. [Google Scholar] [CrossRef]
  9. Saljoqi, M.; Behnood, H.R.; Mirbaha, B. Developing the Crash Modification Model for Urban Street Lighting. Innov. Infrastruct. Solut. 2021, 6, 59. [Google Scholar] [CrossRef]
  10. Ivajnšič, D.; Horvat, N.; Žiberna, I.; Konečnik Kotnik, E.; Davidović, D. Revealing the Spatial Pattern of Weather-related Road Traffic Crashes in Slovenia. Appl. Sci. 2021, 11, 6506. [Google Scholar] [CrossRef]
  11. Török, Á. A Novel Approach in Evaluating the Impact of Vehicle Age on Road Safety. Promet-Traffic Transp. 2020, 32, 789–796. [Google Scholar] [CrossRef]
  12. Lyon, C.; Mayhew, D.; Granié, M.-A.; Robertson, R.D.; Vanlaar, W.G.M.; Woods-Fry, H.; Thevenet, C.; Furian, G.; Soteropoulos, A. Age and Road Safety Performance: Focusing on Elderly and Young Drivers. IATSS Res. 2020, 44, 212–219. [Google Scholar] [CrossRef]
  13. Llopis-Castelló, D.; Bella, F.; Camacho-Torregrosa, F.J.; García, A. New Consistency Model Based on Inertial Operating Speed Profiles for Road Safety Evaluation. J. Transp. Eng. Part A Syst. 2018, 144, 04018006. [Google Scholar] [CrossRef]
  14. Elvik, R.; Vadeby, A.; Hels, T.; van Schagen, I. Updated Estimates of the Relationship Between Speed and Road Safety at the Aggregate and Individual Levels. Accid. Anal. Prev. 2019, 123, 114–122. [Google Scholar] [CrossRef] [PubMed]
  15. Zarei, M.; Hellinga, B. Method for Estimating the Monetary Benefit of Improving Annual Average Daily Traffic Accuracy in the Context of Road Safety Network Screening. Transp. Res. Rec. 2022, 2677, 445–457. [Google Scholar] [CrossRef]
  16. Amiri, A.M.; Sadri, A.; Nadimi, N.; Shams, M. A Comparison Between Artificial Neural Network and Hybrid Intelligent Genetic Algorithm in Predicting the Severity of Fixed Object Crashes Among Elderly Drivers. Accid. Anal. Prev. 2020, 138, 105468. [Google Scholar] [CrossRef] [PubMed]
  17. Wang, D.D.; Wang, D.D. Assessing Road Transport Sustainability by Combining Environmental Impacts and Safety Concerns. Transp. Res. Part D Transp. Environ. 2019, 77, 212–223. [Google Scholar] [CrossRef]
  18. McLeod, S.; Curtis, C. Integrating Urban Road Safety and Sustainable Transportation Policy Through the Hierarchy of Hazard Controls. Int. J. Sustain. Transp. 2022, 16, 166–180. [Google Scholar] [CrossRef]
  19. Ziakopoulos, A.; Yannis, G. A Review of Spatial Approaches in Road Safety. Accid. Anal. Prev. 2020, 135, 105323. [Google Scholar] [CrossRef]
  20. Afghari, A.P.; Haque, M.; Washington, S. Applying a Joint Model of Crash Count and Crash Severity to Identify Road Segments with High Risk of Fatal and Serious Injury Crashes. Accid. Anal. Prev. 2020, 144, 105615. [Google Scholar] [CrossRef]
  21. Tamakloe, R.; Park, D.-W. Factors Influencing Fatal Vehicle-involved Crash Consequence Metrics at Spatio-temporal Hotspots in South Korea: Application of GIS and Machine Learning Techniques. Int. J. Urban Sci. 2022, 27, 483–517. [Google Scholar] [CrossRef]
  22. Hossain, J.; Ivan, J.N.; Zhao, S.; Wang, K.; Sharmin, S.; Ravishanker, N.; Jackson, E. Considering Demographics of Other Involved Drivers in Predicting the Highest Driver Injury Severity in Multi-vehicle Crashes on Rural Two-lane Roads in California. J. Transp. Saf. Secur. 2022, 15, 43–58. [Google Scholar] [CrossRef]
  23. Silva, P.B.; Andrade, M.; Ferreira, S. Machine Learning Applied to Road Safety Modeling: A Systematic Literature Review. J. Traffic Transp. Eng. (Engl. Ed.) 2020, 7, 775–790. [Google Scholar] [CrossRef]
  24. Santos, K.; Dias, J.P.; Amado, C. A literature review of machine learning algorithms for crash injury severity prediction. J. Saf. Res. 2022, 80, 254–269. [Google Scholar] [CrossRef] [PubMed]
  25. Ziakopoulos, A.; Vlahogianni, E.I.; Antoniou, C.; Yannis, G. Spatial Predictions of Harsh Driving Events Using Statistical and Machine Learning Methods. Saf. Sci. 2022, 150, 105722. [Google Scholar] [CrossRef]
  26. Muksimova, S.; Umirzakova, S.; Mardieva, S.; Cho, Y.I. Enhancing medical image denoising with innovative teacher–student model-based approaches for precision diagnostics. Sensors 2023, 23, 9502. [Google Scholar] [CrossRef]
  27. Panda, C.; Mishra, A.K.; Dash, A.K.; Nawab, H. Predicting and explaining severity of road accident using artificial intelligence techniques, SHAP and feature analysis. Int. J. Crashworthiness 2023, 28, 186–201. [Google Scholar] [CrossRef]
  28. Tselentis, D.I.; Papadimitriou, E.; van Gelder, P. The usefulness of artificial intelligence for safety assessment of different transport modes. Accid. Anal. Prev. 2023, 186, 107034. [Google Scholar] [CrossRef] [PubMed]
  29. Lin, X.; Zhang, G.; Wei, S. Velocity prediction using Markov Chain combined with driving pattern recognition and applied to Dual-Motor Electric Vehicle energy consumption evaluation. Appl. Soft Comput. 2021, 101, 106998. [Google Scholar] [CrossRef]
  30. Liu, R.; Wang, C.; Tang, A.; Zhang, Y.; Yu, Q. A twin delayed deep deterministic policy gradient-based energy management strategy for a battery-ultracapacitor electric vehicle considering driving condition recognition with learning vector quantization neural network. J. Energy Storage 2023, 71, 108147. [Google Scholar] [CrossRef]
  31. Fu, H.; Yang, D.; Wang, S.; Wang, L.; Wang, D. A novel online energy management strategy for fuel cell vehicles based on improved random forest regression in multi road modes. Energy Convers. Manag. 2024, 305, 118261. [Google Scholar] [CrossRef]
  32. Nova, D.; Estevez, P.A. A Review of Learning Vector Quantization Classifiers. Neural Comput. Appl. 2014, 25, 511–524. [Google Scholar] [CrossRef]
  33. Kohonen, T. An introduction to neural computing. Neural Netw. 1988, 1, 3–16. [Google Scholar] [CrossRef]
  34. Kohonen, T. Improved versions of learning vector quantization. In Proceedings of the 1990 IJCNN International Joint Conference on Neural Networks, San Diego, CA, USA, 17–21 June 1990; IEEE: Piscataway, NJ, USA, 1990; pp. 545–550. [Google Scholar]
  35. Kohonen, T. Self-Organizing Maps; Springer, Inc.: Secaucus, NJ, USA, 1997. [Google Scholar]
  36. Setyorini, P.F.D.; Mahmudah, H.; Puspitorini, O.; Siswandari, N.A.; Wijayanti, A. Accuracy Improvement on Learning Vector Quantization (LVQ) Using Exponential Smoothing for Driving Activity Classification. In Proceedings of the 2020 8th International Conference on Information and Communication Technology (ICoICT), Yogyakarta, Indonesia, 24–26 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
  37. Seo, S.; Bode, M.; Obermayer, K. Soft nearest prototype classification. IEEE Trans. Neural Netw. 2003, 14, 390–398. [Google Scholar] [PubMed]
  38. King, K.N. 2006 IEEE International Conference on Granular Computing. IEEE Comput. Intell. Mag. 2006, 1, 53–55. [Google Scholar] [CrossRef]
  39. Arul, E.; Punidha, A. Supervised Deep Learning Vector Quantization to Detect Memcached DDOS Malware Attack on Cloud. SN Comput. Sci. 2021, 2, 85. [Google Scholar] [CrossRef]
  40. Guido, G.; Shaffiee Haghshenas, S.; Shaffiee Haghshenas, S.; Vitale, A.; Astarita, V.; Park, Y.; Geem, Z.W. Evaluation of contributing factors affecting number of vehicles involved in crashes using machine learning techniques in rural roads of Cosenza, Italy. Safety 2022, 8, 28. [Google Scholar] [CrossRef]
  41. Astarita, V.; Haghshenas, S.S.; Guido, G.; Vitale, A. Developing New Hybrid Grey Wolf Optimization-based Artificial Neural Network for Predicting Road Crash Severity. Transp. Eng. 2023, 12, 100164. [Google Scholar] [CrossRef]
  42. Guido, G.; Haghshenas, S.S.; Haghshenas, S.S.; Vitale, A.; Astarita, V. Application of Feature Selection Approaches for Prioritizing and Evaluating the Potential Factors for Safety Management in Transportation Systems. Computers 2022, 11, 145. [Google Scholar] [CrossRef]
  43. Zorlu, K.; Gokceoglu, C.; Ocakoğlu, F.; Nefeslioglu, H.A.; Acikalin, S. Prediction of Uniaxial Compressive Strength of Sandstones Using Petrography-based Models. Eng. Geol. 2008, 96, 141–158. [Google Scholar] [CrossRef]
  44. Tamakloe, R.; Hong, J.; Park, D. A Copula-based Approach for Jointly Modeling Crash Severity and Number of Vehicles Involved in Express Bus Crashes on Expressways Considering Temporal Stability of Data. Accid. Anal. Prev. 2020, 146, 105736. [Google Scholar] [CrossRef] [PubMed]
  45. Casado-Sanz, N.; Guirao, B.; Attard, M. Analysis of the Risk Factors Affecting the Severity of Traffic Accidents on Spanish Crosstown Roads: The Driver’s Perspective. Sustainability 2020, 12, 2237. [Google Scholar] [CrossRef]
  46. Azimi, G.; Rahimi, A.; Asgari, H.; Jin, X. Injury Severity Analysis for Large Truck-involved Crashes: Accounting for Heterogeneity. Transp. Res. Rec. 2022, 2676, 15–29. [Google Scholar] [CrossRef]
Figure 1. An overview of the LVQ network.
Figure 1. An overview of the LVQ network.
Ai 05 00054 g001
Figure 2. Rural road accident map in the province of Cosenza (Italy) for the years 2017 and 2018.
Figure 2. Rural road accident map in the province of Cosenza (Italy) for the years 2017 and 2018.
Ai 05 00054 g002
Figure 3. The simplest possible form of a confusion matrix.
Figure 3. The simplest possible form of a confusion matrix.
Ai 05 00054 g003
Figure 4. The confusion matrix’s results for the 15th developed model regarding training (a), validation (b), testing (c), and the total dataset (d).
Figure 4. The confusion matrix’s results for the 15th developed model regarding training (a), validation (b), testing (c), and the total dataset (d).
Ai 05 00054 g004
Figure 5. The ROC curve’s results for the 15th developed model regarding training (a), validation (b), testing (c), and the total dataset (d).
Figure 5. The ROC curve’s results for the 15th developed model regarding training (a), validation (b), testing (c), and the total dataset (d).
Ai 05 00054 g005
Figure 6. A comparison of the LVQ model’s sensitivity analysis results with previous studies.
Figure 6. A comparison of the LVQ model’s sensitivity analysis results with previous studies.
Ai 05 00054 g006
Figure 7. Comparison between the accuracy results of LVQ 2.1 model and prior research.
Figure 7. Comparison between the accuracy results of LVQ 2.1 model and prior research.
Ai 05 00054 g007
Table 1. Quantitative and qualitative factors serving as independent variables.
Table 1. Quantitative and qualitative factors serving as independent variables.
Data CategoriesVariableCode/UnitExplanation
Crash
characteristic
Type of Crash1Collision with vehicle
2Collision with pedestrian
3Collision with obstacle
4Other
Environment characteristicsDay Light0Day light
1Night-time
Weekday0Weekend or Holiday
1Weekday
Road environmentLocation0Non intersection
1Intersection
Speed Limit (km/h)150
270
390
4110
5130
Traffic flow characteristicsAvg Speed (km/h)Not codedMin 28
Max 122
Avg 91.43
AADT (Veh/day)1<5000
25000–9999
310,000–14,999
4>14,999
Table 2. The models’ accuracy in training and testing with different controls parameters.
Table 2. The models’ accuracy in training and testing with different controls parameters.
Models No.EpochNNHLTraining Accuracy (%)Testing Accuracy (%)
151066.161.9
252064.663.5
353080.571.4
45408079.6
5101064.866.4
6102071.961.9
7103068.475.2
8104064.369
9201069.669
10202080.875
11203082.381
12204076.567.3
13301081.579.6
14302081.380.5
15303082.582.3
1630408080.5
17501080.580.5
1850208278.8
19503081.577
20504079.279.6
Table 3. Ranking of models based on their accuracy in training and testing.
Table 3. Ranking of models based on their accuracy in training and testing.
Models No.EpochNNHLRank Based on Training Accuracy (%)Rank Based on Testing Accuracy (%)Total Rank
1510415
2520235
353012820
4540101323
51010347
61020718
7103051015
81040167
920106612
10202014923
112030191938
1220408513
133010161329
143020151631
153030202040
163040101626
175010121628
185020181230
195030161127
20504091322
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shaffiee Haghshenas, S.; Guido, G.; Shaffiee Haghshenas, S.; Astarita, V. Predicting Number of Vehicles Involved in Rural Crashes Using Learning Vector Quantization Algorithm. AI 2024, 5, 1095-1110. https://doi.org/10.3390/ai5030054

AMA Style

Shaffiee Haghshenas S, Guido G, Shaffiee Haghshenas S, Astarita V. Predicting Number of Vehicles Involved in Rural Crashes Using Learning Vector Quantization Algorithm. AI. 2024; 5(3):1095-1110. https://doi.org/10.3390/ai5030054

Chicago/Turabian Style

Shaffiee Haghshenas, Sina, Giuseppe Guido, Sami Shaffiee Haghshenas, and Vittorio Astarita. 2024. "Predicting Number of Vehicles Involved in Rural Crashes Using Learning Vector Quantization Algorithm" AI 5, no. 3: 1095-1110. https://doi.org/10.3390/ai5030054

Article Metrics

Back to TopTop