2. Materials and Methods
The challenge of forecasting traffic flows and using these to develop solutions for managing transportation systems is considered one of the most relevant topics in the modern world. Many scientists conduct theoretical and practical research in this area, leading to the development of unique algorithms, methods, and models. In this work, we will also analyze articles related to our research, including works [
1,
2].
In articles [
1,
2,
3], research was conducted on short-term traffic forecasting using extreme gradient boosting (XGBoost). The authors of [
3] note that fast and accurate short-term traffic flow forecasting is an important condition for traffic analysis and management. In [
4], the authors investigated the issue of road safety in intelligent transportation systems (ITS) focusing on intersections. They propose using nonparametric, nonlinear ensemble models of decision trees to forecast traffic volume. It is also noted that intersections are the most complex part of the road network, and most accidents are related to intersections.
Article [
5] is dedicated to forecasting traffic flow in work zones on roads. It notes that most existing traffic flow forecasting models do not consider the peculiarities of work zones, which create conditions that are different from both normal operating conditions and incident conditions. The study developed four models for forecasting traffic flow in the planned work zones. These models include random forests, regression trees, multilayer feedforward neural networks, and nonparametric regression. The authors investigated both long-term and short-term traffic flow forecasting. Long-term forecasting involves predicting 24 h ahead using historical traffic data, while short-term forecasting involves predicting 1 h, 45 min, 30 min, and 15 min ahead using real-time traffic data.
The models were evaluated using data from work zones on two types of roads: a highway and an arterial road in St. Louis, Missouri, USA. The research revealed that the random forest model provided the most accurate forecasts for both long-term and short-term traffic flow in work zones.
In addition, articles [
6,
7] propose a new hybrid model (CEEMDAN-XGBoost) for forecasting traffic flow at the lane level, based on complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) and extreme gradient boosting (XGBoost). The CEEMDAN method is used to decompose the original traffic flow data into several intrinsic mode functions components and one residual component. Then, XGBoost models are trained and make predictions on the decomposed components, respectively. The final forecasting results are obtained by integrating the forecasting results of the XGBoost methods. For illustrative purposes, traffic flow data at the lane level obtained from remote microwave traffic sensors installed on the 3rd ring road in Beijing are used to evaluate the effectiveness of the CEEMDAN-XGBoost model.
In some literature, the issue of forecasting traffic flows when data is insufficient is studied. In [
8,
9,
10,
11], the problem of forecasting traffic flow in intelligent transportation systems (ITS) applications is considered. It is noted that using an autoregressive integrated moving average (ARIMA) or seasonal ARIMA (SARIMA) for traffic flow forecasting requires a large amount of data to develop the model, which may be impossible in cases of insufficient data.
To address this problem, the authors proposed and evaluated a forecasting scheme based on the Kalman filter (KF) method, which requires only a limited amount of input data.
In this study on traffic flow forecasting, a variety of methods and algorithms were considered, ranging from linear regression to ensemble models. However, it is known that many linear models do not have sufficient accuracy for forecasting. For this reason, nonlinear methods and models were extensively studied, collectively referred to as additive models. Additive models, including various types of neural networks and decision trees, are able to capture nonlinear dependencies between variables and provide more accurate results.
In the context of regression, the generalized additive model (GAM) is formulated as:
where
—the conditional expectation of the dependent variable
given the values of predictors
,
a—constant (intercept) representing the baseline level of the dependent variable
when all predictors are zero,
—are predictors,
Y—response,
—undefined smooth (nonparametric) functions.
In our work, we have focused on the nonlinearity of the problem and explored state-of-the-art machine learning methods for predicting nonlinear dependencies. We delved into the study of decision trees, random forests, and ensemble methods, as they are effective in solving such tasks. Particularly noteworthy are decision trees, which find application in many areas due to their simplicity and ease of interpreting results. Like any other method, decision trees have their drawbacks, such as limited applicability and a tendency to over fit [
4,
12]. The results of several studies [
4,
12,
13,
14,
15] indicate significant improvements in addressing the issues of limited performance and reliability that are inherent in decision trees. Based on these studies, ensemble methods have been developed, which combine multiple decision trees with various voting methods to select the optimal target label. These enhancements help increase the accuracy and stability of predictions, making the model more efficient and reliable.
A detailed examination of decision tree methods is discussed in [
4,
13,
16,
17]. In [
17,
18], four of the most widely used algorithms, CART (classification and regression trees), C4.5, CHAID (chi-squared automatic interaction detection), and QUEST (quick, unbiased, efficient, statistical tree), are compared. [
16] compares the algorithms ID3 and C4.5, C4.5 and C5.0, and C5.0 and CART. The research results indicate that the best algorithm for machine learning is C4.5. However, it is known that this algorithm is mainly used in classification tasks. In our case, we will use the CART algorithm since our task pertains to regression.
Having studied the mentioned sources, we can move on to how to construct a decision tree algorithm for regression tasks. To do this, we need to have data, including training and test sets. Let’s assume that we have a set of input data
X and corresponding output vectors
Y, such that (
) for
where
with
p features. Then, we should split the data based on some feature
to divide our data into two parts
(left) and
(right), thus creating the tree structure represented in
Figure 1.
To select nodes for splitting the most informative features, we need to define the objective function that we will optimize using the tree learning algorithm. In this case, our objective function is to maximize the information gain at each split, which is defined as follows (2):
where
is the information gain when partitioning set
by feature
,
is the impurity of the original set
before the split,
and
are the number of instances in subsets
and
after the split, respectively,
is the total number of instances in the original set
,
and
are the impurities of subsets
and
after the split, respectively.
In regression tasks, the mean squared error (MSE) is usually chosen as the impurity.
where
is the impurity of node
R,
is the number of instances in node
R,
is the actual value of the target variable for instance
i,
is the predicted value of the target variable for instance
i.
As mentioned above, ensemble methods have been developed to improve the drawbacks of decision trees, which have been studied in articles [
4,
6,
11,
12,
14,
19,
20,
21,
22,
23,
24,
25].
Bagging is an ensemble machine learning method used to improve the stability and generalization of machine learning models. The idea behind bagging is to use not just one model, but several
models trained on different subsets of data
, to improve the quality of the model’s prediction. Subsequently, the results of all models are combined, for example, by averaging (4), to obtain a more stable and accurate prediction [
22,
26].
where
n is the number of base models in bagging,
is the predicted value of the target variable for the object
x using the
i-th base model.
Random forest is also an ensemble machine learning method used for both classification and regression tasks. It is based on the idea of combining multiple decision trees into one model [
20,
24,
27,
28].
To build a random forest, several random subsamples with replacement are first formed from the original dataset R. Then, a separate decision tree is built for each subsample. When building each node of the tree, a random subset of features is selected. After all the trees are built, their results are combined, for example, by majority voting for classification tasks or averaging for regression tasks, to obtain the final prediction result of the random forest model.
Boosting is a machine learning method used to improve the performance of models by combining several weak models into one strong model. The main idea of boosting is to sequentially train models on data, focusing on examples where previous models made mistakes. Each new model attempts to correct the errors of the previous one, and, as a result, the combination of all the models allows for a more accurate and generalizable model [
3,
23,
25,
26].
Boosting can be used with various base models, but decision trees are most commonly used (5).
where
is the weight coefficient for the
i-th model, and
is the prediction of the
i-th base model for object
x.
In boosting, the main goal is to minimize the error functional (quality criterion) by sequentially adding new models that correct the errors of the previous ones (6).
where
is the loss function,
n is the total number of observations in the dataset,
is the loss function dependent on the true value
and the predicted value
. Parameters
α and
b are those values adjusted to minimize this loss function.
Another method is the gradient boosting algorithm, which is widely used in classification and regression tasks. The difference between gradient boosting and classical boosting is that gradient boosting uses gradient descent to tune the model parameters, while classical boosting uses the method of reweighting objects.
3. Problem Formulation
In this study, we consider an intersection located on Bogishamol Street in Tashkent, Uzbekistan, as the object of research. To manage this intersection using an intelligent system, we need parameters of traffic flow in these directions. These parameters include traffic intensity, traffic density, and average speed.
The information necessary for forecasting traffic flows can be obtained from cameras installed along this road. Here are the data on vehicles moving from west to east, collected from video cameras and presented in
Table 1 below.
The data were collected from all four directions at the intersection, with a total of 17,373 records. However, we are analyzing information obtained only from one direction when developing the forecasting algorithm. The average and median values of daily traffic intensity and the average traffic volume on weekdays and weekends are presented in (
Figure 2) below.
The analysis of the graph in
Figure 2a, which represents the mean and median values of traffic flow by hour of the day, indicates that the minimum road congestion occurs during nighttime, approximately from 0 to 5 h, when the number of vehicles is around 500–900. From 6 a.m., there is a significant increase in traffic flow, reaching a peak between 8 and 10 a.m. with the number of vehicles around 1500–1660. After this, there is a slight decrease. However, starting from 11 a.m., a new rise begins, reaching its maximum around 5–6 p.m. (approximately 1600–1700 vehicles). In the evening, after 6 p.m., the number of vehicles gradually decreases until the end of the day.
The second graph in
Figure 2b demonstrates the differences in traffic flow by day of the week for each hour. On weekdays (Monday to Friday), there are two distinct peaks: the morning peak around 8 a.m. and the evening peak around 5 p.m., corresponding to the start and end of the workday. On Saturday and Sunday, the number of vehicles is significantly lower compared to weekdays. On weekends, traffic flow remains relatively stable throughout the day, with a slight increase around noon, especially on Sunday.
Thus, the data show that the highest road congestion occurs during the morning and evening hours of weekdays, which is associated with rush hour traffic. On weekends, the traffic flow is significantly lower and more evenly distributed throughout the day, indicating reduced activity on these days.
To achieve a more precise data analysis, it is essential to utilize a correlation matrix. Correlation is a crucial tool in data analytics as it allows us to determine the degree of association between the changes in different variables. In the context of machine learning, correlation analysis helps identify significant relationships between variables, which aids in simplifying the model, removing irrelevant data, and accelerating the training process. For instance, if the number of cars significantly varies depending on the day of the week and the time of day, as shown in the correlation matrix (
Figure 3), these variables will be critical for predicting traffic flow. Correlation analysis also plays a vital role in preventing multicollinearity, thereby enhancing the stability and interpretability of the model.
After identifying significant relationships, the data is divided into training and test sets in a ratio of 80/20 or 70/30. This ensures a balance between having sufficient data for training and for evaluating the model’s performance. Model evaluation can be performed using metrics such as mean squared error (MSE), mean absolute error (MAE), and the coefficient of determination (R2). These metrics provide a quantitative assessment of the model’s accuracy and predictive capabilities, enabling researchers to determine the effectiveness of the model in capturing the underlying data patterns.
In this study, we used machine learning algorithms such as decision trees, random forests, and ensemble methods for forecasting. This choice was made because these models have proven to be simple to use in practice and have demonstrated good results in many areas.
4. Results
In this section, an assessment and comparison of the results are conducted. To evaluate our models, we use the aforementioned metrics. For building traffic flow forecasting models, we use the Python programming language version 3.11.5. This work is carried out in the Anaconda environment, which provides a convenient package and virtual environment management. In the development process, we make use of several libraries, including scikit-learn (sklearn), numpy, pandas, and matplotlib.
Initially, we will forecast the traffic flow using a decision tree with a maximum branching depth (max_depth) of 5, 10, 100, and 200 (
Figure 4). We will evaluate the forecast quality on a sample that will be split using the train_test_split method, as well as on a unique sample that contains data for 24 h and was not used in the model training process.
Figure 4 presents the results of the decision tree model with depths of 5, 50, and 100 for a visual comparison. Graphs a, c, and e show the prediction results based on the test dataset, where the black lines represent actual values, and the dashed lines indicate predicted values. Graphs b, d, and f illustrate the model’s performance on a validation dataset collected on 10 June 2023, which was not used in either the training or testing datasets.
To understand the impact of tree depth on model performance, we evaluated a decision tree regressor with varying maximum depths using two datasets: a test sample and a validation dataset from 10 June 2023. The performance metrics include the coefficient of determination (
R2), mean squared error (MSE), mean absolute error (MAE), and training time in seconds. The results are summarized in
Table 2 below.
For the test sample, the following trends were observed. As the tree depth increased, the R2 value improved from 0.89 at depth 5 to 0.94 at depths 50, 100, and 200, indicating a better fit of the model to the data. MSE and MAE values significantly decreased with increasing depth, from 18,110.99, and 89.87 at depth 5 to approximately 10,233.33, and 65.00 at depths 50, 100, and 200, respectively. This shows that deeper trees provide more accurate predictions. As expected, the training time increased with tree depth, rising from 0.0034 s at depth 5 to 0.0389 s at depth 200. Although the increase in training time is noticeable, it is relatively small compared to the improvements in the performance metrics.
For the control data, the results were somewhat different. The R2 value remained relatively stable around 0.83 at depths 5, 50, and 100, but slightly decreased to 0.80 at depth 200. This indicates that while deeper trees fit the training data better, they may not generalize as well to new data. MSE values showed minor changes: 7656.67 at depth 5 and a slight decrease to 7500.00 at depth 100, but MSE increased to 8733.33 at depth 200, indicating potential overfitting. MAE values remained stable around 70 at depths 5, 50, and 100, but increased to 81.66 at depth 200, further suggesting a decline in performance in the new data.
From this analysis, we can conclude that increasing the maximum tree depth generally improves model performance on the training and test samples, as evidenced by higher R2 values and lower MSE and MAE values. However, the control data results indicate that excessively deep trees (depth 200) may lead to overfitting, reducing the model’s ability to generalize to new data. The optimal tree depth should balance between training accuracy and generalization ability, with depths around 50–100 providing the best overall performance for our datasets.
Next, we will consider ensemble methods, starting with random forest. For the random forest model, we will tune two main hyperparameters: the maximum depth of the trees in the forest (max_depth) and the number of trees in the forest (n_estimators). To evaluate the forecasting quality, we will use the methods mentioned earlier, including
R2 (coefficient of determination), MSE, and MAE. The research results are presented in
Figure 5 and in
Table 3. The figure shows not all stages, but only those non-visual changes when tuning the hyperparameter.
Figure 5 presents 8 images depicting the predictions of the random forest model with a maximum tree depth of 50 and the number of trees varying from 10 to 1000. Graphs a, c, e, and g show predictions on the test sample, while graphs b, d, f, and h display predictions on the validation sample, similar to the decision tree models. A more detailed analysis was conducted based on the data presented in
Table 3.
Analyzing the results, it can be observed that increasing the number of trees in the forest improves the prediction metrics. For example, with 10 trees and a tree depth of 50, the coefficient of determination R2 on the test sample is 0.95, the mean squared error (MSE) is 8891.83, and the mean absolute error (MAE) is 54.08. The training time for this configuration is 0.1154 s. When increasing the number of trees to 100, R2 remains at 0.95, but MSE decreases to 7849.14, and MAE to 49.04, with a training time of 0.9674 s. Similar improvements are observed with further increases in the number of trees to 200 and 1000, although increasing the number of trees beyond 100 does not significantly improve the metrics but substantially increases the training time.
The validation sample also shows improved prediction accuracy with an increasing number of trees. For instance, with 10 trees, R2 is 0.85, MSE is 6682.83, and MAE is 65.08. When increasing the number of trees to 100, R2 rises to 0.88, MSE decreases to 5298.39, and MAE to 59.35, with a training time of 0.2513 s. With further increases to 1000 trees, the metrics on the validation sample show consistently high results, confirming the overall trend of improved model accuracy with more trees, although the training time also increases, which must be considered when choosing model parameters.
Therefore, based on the obtained results, it can be concluded that the optimal parameters for the random forest model include a maximum tree depth of 50 and a number of trees of 100. These parameters provide high prediction accuracy on both test and validation samples while maintaining acceptable training time. However, it should be noted that further increasing the number of trees to 1000 does not significantly improve the metrics, making it more rational to use 100 trees to avoid excessive training time without substantial improvement in prediction quality.
The last model that is most commonly used for predicting nonlinear regression tasks is gradient boosting. In gradient boosting, we need to tune 3 parameters: the maximum depth of the trees, the number of trees in the ensemble, and the learning rate. The results of the gradient boosting research are presented in
Figure 6 and
Table 4.
Analysis of
Table 4 shows that with a maximum tree depth of 5 and n_estimators set to 10, the model achieves an
R2 value of 0.77 on the test sample and 0.68 on the check data. MSE and MAE values are 40,839.59 and 152.09, respectively, for the test sample, indicating moderate accuracy with relatively high errors. As the tree depth increases to 50, the
R2 value improves to 0.83 for the test sample and 0.58 for the check data, with MSE decreasing to 29,548.97 and MAE to 133.62 for the test sample, showing improved performance but still relatively high errors for the check data.
When the maximum tree depth is set to 5 and the number of trees increases to 100, the model’s performance significantly improves. The R2 value reaches 0.95 for the test sample and 0.93 for the check data, with MSE dropping to 8164.95 and MAE to 58.61 for the test sample. This demonstrates the effectiveness of increasing the number of trees to improve model accuracy and reduce errors. Further increasing the number of trees to 500 and 1000 shows only marginal improvement in R2 values, with the test sample reaching 0.96 and the validation dataset 0.92, indicating diminishing returns in performance gains.
Similarly, with max_depth set to 50, the results show stable performance with R2 values around 0.94–0.95 for the test sample and 0.81–0.84 for the check data, with MSE and MAE remaining relatively stable across different tree quantities. This suggests that increasing tree depth can improve model performance up to a certain point, after which gains are minimal.
Therefore, the optimal performance of the gradient boosting model is observed with a maximum tree depth of 5 and n_estimators set to 100, achieving high accuracy with an R2 value of 0.95 for the test sample and 0.93 for the check data, as well as low errors with MSE 8164.95 and MAE 58.61. Further increasing the number of trees beyond this point yields only slight performance improvements, indicating the model fully utilizes its capabilities with this configuration.
Analyzing all three models, we came to the following conclusions: decision tree provides a simple and interpretable model, suitable for tasks where the explainability of decisions is important. However, it is prone to overfitting and less robust to data changes. Random forest improves accuracy and robustness by using an ensemble of trees but requires more computational resources and is less interpretable. Gradient boosting offers the highest accuracy and the ability to handle complex data but at the cost of high computational expense and the need for careful tuning. Therefore, the choice of model depends on the specific requirements of the task: decision tree is suitable for quick and interpretable solutions, random forest for balancing accuracy and robustness, and gradient boosting for maximum accuracy on complex data.