1. Introduction
Road traffic congestions (RTCs) are significant issues globally; they negatively affect economic production and quality of life in different cities. RTCs in transportation services occur when the required demand goes beyond the design capacity. According to [
1], road traffic congestions gradually increase and cost economies billions of Rands (ZAR), with cities such as Bengaluru (India) leading globally, followed by Manila (Philippines), Bogota (Colombia), Mumbai (India), and Pune (India), the top-five ranked congested cities globally with over 800,000 in population. Furthermore, the top-five ranked congested cities in Africa are Cairo (Egypt) taking the lead, followed by Cape Town (South Africa), Johannesburg (South Africa), Pretoria (South Africa), and East London (South Africa) for overall daily congestion. In [
2], it was also reported that RTCs have impacts on reducing road throughput, increased vehicle emissions, and accidents that cost road users time and money. Commuters residing in large metropolitan areas are mainly affected by RTCs daily disrupting their day-to-day activities.
The increasing traffic congestion is, directly and indirectly, the cause of a significant part of road traffic collisions that result in an increased number of injuries and fatalities on the roads globally. The World Health Organisation (WHO) [
2] has also reported that RTCs contribute to health complications, affecting about 3.7 million lives lost globally, specifically in developing cities with high monetary losses, delays, fuel waste, road collisions, and emission. They also produce nitrogen oxides (NOx), carbon oxides (CO), sulphur oxides (SOx), and particle filters. According to [
3], the transport sector accounts for three-quarters of global CO
2 emissions, adding that the emissions can originate from trucks, cars, planes, and trains. Furthermore, the report by [
2] has stated that the main sources of air pollution are from road traffic congestion-related issues, causing diseases such as cardiovascular and respiratory failure. Traffic congestion is harmful to drivers and passengers sitting in traffic and people living near the affected highways. According to [
4], adults living next to busy roadways might end up having a disease like dementia, and when it comes to children, it might risk them developing long-term chronic diseases. In addition, other research studies like [
5,
6,
7,
8,
9,
10] have investigated the negative impacts of RTCs in detail.
Road traffic congestion prediction presented in this paper can be defined as predicting traffic state at a specific time [
11]. This study focuses more on traffic flow and road traffic congestion. Furthermore, the study considers traffic volume (density), average speed (speed flow/velocity) and travel time parameters of the fundamental diagram. The fundamental diagram graphs from [
12] consist of freeflow
, bound flow
and congested (traffic density) vectors. The fundamental analysis of the study is as follows: freeflow refers to the traffic being stable with vehicles moving freely, and bound flow is the bistable transition from a stable traffic state. Then, the congested state refers to the unstable traffic state, which means vehicles velocity is reduced and the volume of vehicles increases gradually for the freeway capacity [
12,
13,
14]. For this study, the vectors (classes) were defined as freeflow, moderate, and congested, where the bound flow vector is referred to as moderate. These three classes correspond based on the obtained data.
This study considered supervised ensemble methods to design predictive models for road traffic congestion. Ensemble methods are implemented in machine learning to boost several classifiers and to improve their overall performance. The main aim of ensemble methods is to perform best compared to the traditional machine learning methods called base learners [
15]. Ensemble methods have shown superior performance in the road traffic congestion domain recently. In the recent study [
16], ensemble methods performed best compared to regression models and multi-layer perceptron. Ensemble methods were considered in this study to improve machine learning classifier performance. This study aimed to develop an ensemble learning model using a real-life road traffic flow dataset to evaluate the negative impacts of road traffic congestion, which, in turn, might lead to high numbers of road traffic collisions on the highways and delays, to name a few. The three main objectives of this study, which deals with a real-life road traffic flow dataset, are as follows:
To handle missing values using the listwise deletion method;
To compare the performance of three traditional machine-learning methods and three ensemble methods;
To assess each model’s performance using key evaluation metrics and a cost model incorporating cost to commuters, businesses, and the economy.
The study’s contribution is to present an RTC framework that inputs missing values and performs comprehensive analysis using ensemble and machine learning methods, computes classification cost, and evaluates the performance of the models. The entire paper is outlined as follows:
Section 2 contains the RTC-related literature review,
Section 3 describes the study methodology and methods used.
Section 4 describes the experimental settings for the study, while
Section 5 provides the experimental results and discussion of the findings. Finally, a conclusion is presented in
Section 6.
5. Experimental Results and Discussion
This section of the study outlines the results obtained during experiments, discussion of the results, and computation of the misclassification cost. Models for predicting the status of vehicle traffic on the freeway were constructed using traditional ML methods and ensemble methods (EMs). The data contained attributes such as travel time, average speed, traffic volume, and date.
5.1. Comparison of RTC Prediction Results
Table 5 and
Figure 4 show overall results obtained when traditional methods and the bagging method (using RF, DT, and SVM), the AdaBoost method (using RF, DT, and SVM), and the stacking method with a combination of RF, DT, and SVM, with the final estimator as default logistic regression, were used. Results computed by using traditional methods revealed that the DT model obtained more promising results than RF and SVM. Thus, the model performed well in terms of all evaluation metrics. Decision trees proved to have performed well since they could handle datasets with outliers.
More results were generated using ensemble methods. The bagging model with DT achieved the best results in terms of accuracy and precision performance metrics. The AdaBoost model with DT also achieved better results than the combination of RF and SVM, which means the combination of AdaBoost with DT significantly improved the model’s performance. Additionally, the stacking model, with base estimators as RF, DT, and SVM and the final estimator defined as LR, has shown more significant improvement. In terms of performance, considering accuracy, precision, recall, and f1-score, the model best performed compared with the other models. The results revealed that DT could improve the performance of different ensemble model combinations when used individually and as a weak learner.
A typical analysis example of the state of traffic involving the three vectors/classes is as follows: in 2016, the volume of vehicles travelling on the freeway was recorded as 4987 on a Monday from 06:00, travelling at 87 km/h, resulting in an unstable traffic state. The unstable traffic state is due to different factors such as commuters travelling to work, road incidents, and other factors. Then, in 2018, the volume of vehicles on the freeway was recorded as 942 travelling at 108 km/h on a Friday from 23:00, which is classified as a bistable traffic state. Furthermore, in 2016, the volume of vehicles on the freeway was recorded as 326, on a Thursday from 05:00, travelling at 110 km/h, classified as stable traffic. During peak hours, traffic turns unstable since most commuters need to get to their different workplaces. Then, during working weekdays in the morning, traffic turns unstable from 06:00 to 08:00 and in the afternoon from 15:00 to 18:00. The analysis shows various peak and off-peak hours to enable commuters to plan for future travelling.
5.2. Models’ Misclassification Cost
Misclassification cost refers to all the penalties associated with errors during the classification process. The study aimed to determine the model that did not penalise commuters heavily in travel time and had sufficient information regarding traffic flow. All models were evaluated using the loss matrix shown in
Table 6. The penalty was assigned based on the misclassification error for each model [
55]. A penalty of 3 was assigned to a cell that predicted
Freeflow (c) when the actual traffic status was
Moderate or Bound flow (b) and
Congested (a). A penalty of 1 was assigned to a cell that predicted
Congested when the actual traffic status was
Moderate. Then, a penalty of 0 was assigned to entries on the leading diagonal of the loss matrix, as the actual and the predicted vehicle status were the same. Since there was a 3 × 3 confusion matrix from the three classes (
Freeflow,
Moderate or
Bound flow, and
Congested), a 3 × 3 loss matrix was designed.
The input in cells
of
Table 7 and
Table 8 specifies the penalty associated with the prediction of class
when in fact, it is
[
55]. For all instances,
which belongs to
, the expected loss is given by Equation (5) below.
is the penalty associated with misclassification for a predicted model.
The actual risk Equation (6) minimises each point x; i.e., when regions are chosen [
55,
58]. Values of the loss matrix (for cells in
Table 6) were chosen by hand, based on views of the knowledge expected in MTM. The misclassification cost of prediction was computed using Equation (7):
The results were obtained when multiplying the values in
Table 6 from the loss matrix with the corresponding cell values of a confusion matrix in
Table 7 and
Table 8. The other eight prediction models in
Table 5 were computed using their corresponding confusion matrix.
Table 7 showed that 929, 3554 and 2387 were classified correctly, with 19 incorrect classifications from the model development.
Table 8 showed 42 and 3557 correctly classified, with 3290 incorrectly classified. The results of the cost computation showed that the AdaBoost (DT) ensemble model obtained the lowest cost of misclassification of 24 when compared to the AdaBoost (SVM) ensemble model, which obtained the highest cost of misclassification of 3290, as shown in
Table 9 and
Figure 5. This was computed using Equations (8) and (9) above.
A road traffic congestion prediction model was developed using ensemble methods. The study results showed that the AdaBoost (DT) ensemble prediction model achieved an accuracy of 99.7% and a prediction misclassification cost of 24, which is lower when compared to the other models. The model obtained the best results in terms of precision, recall, and f1-score metrics. It is hoped that the constructed model would reduce the high number of road traffic collisions and traffic congestion with other interventions already in place. AdaBoost proved to handle the dataset; thus, its performance was as good as expected. This confirmed the suitability of AdaBoost for this kind of dataset and problem domain. However, the dataset used did not include weather and road collision data. This could be why models performed poorly compared to AdaBoost (DT).
6. Conclusions
This paper addressed the determination of vehicle traffic flow well in advance using a dataset from a highway in the Gauteng Province, South Africa. Variables such as travel time, traffic volume, and average speed contributed to a good description of road traffic state regarding its prediction. These were useful for predicting the status of vehicle traffic flow. Results also suggested that the AdaBoost (DT) ensemble model performed better when compared to RF, DT, SVM, bagging, other AdaBoost combination models, and stacking ensemble learning.
This model will complement the current interventions by transport authorities already in place, including e-toll gates, cycle tracks, bus rapid transit, road expansion, and corridors of freedom (well-planned transport routes), among other interventions. Adopting this model will benefit commuters and businesses in the province and potentially make the province an attractive destination for investors. This method is better than the available methods, which have been used to date for addressing these challenges in the Gauteng Province. Although the results here are specific to the dataset used and cannot be generalised, they afford the validation of the presented framework, which may be applied on different road traffic datasets. The predictive model, when implemented well, could directly and indirectly decrease the number of road traffic collisions and risks of poor health in different communities. Furthermore, commuters wishing to travel on the highway will receive helpful information on the traffic flow state well in advance, thanks to the developed prediction model. The proposed prediction model will enable commuters to plan travel time properly. The model can also benefit the authorities to plan the distribution of resources. Businesses will also see improvements due to the timely delivery of goods and staff reporting to work on time, positively impacting the economy of different cities.
For future work, authors are planning to pay more attention to deep learning and various missing data imputation methods as promising emerging approaches, and the authors see this as a future area worth exploring.