1. Introduction
In recent years, as the economy has grown and people’s quality of life has improved, people’s demands for transportation has increased, and vehicles have progressively become the preferred mode of transportation. However, this has caused an increase in traffic congestion, and a contradiction and an intensification between the supply and demand of road traffic. As a result, comprehensive technologies and methods are urgently needed to properly control and monitor traffic flow, as well as to alleviate traffic congestion and other issues.
Traffic-flow prediction is fundamental in traffic management and dredging, and its accuracy is critical in resolving traffic-congestion issues. A vast number of experts have done extensive research on this in recent years, primarily utilizing a linear or nonlinear model to predict the following:
The historical average forecasting methods, the time series forecasting methods, and the Kalman wave forecasting methods were all used in the early days of traffic flow research. Some scholars use simple linear models to predict traffic flow, such as the autoregressive moving average model (ARIMA) model, which is suitable for predicting data with time rules, but the traffic flow has a strong non-linear trend, and its prediction accuracy for traffic flow is not high and has limitations [
1,
2,
3]. D. Cvetek et al. used the collected data to compare some common time series methods such as ARIMA and SARIMA, showing that the ARIMA model provides better performance in predicting traffic demand [
4].
The Kalman wave is also used as a linear theory-prediction method by many scholars. Okutani firstly applied the Kalman wave to traffic-flow forecasting [
5]. According to the inherent shortcomings of Kalman wave variance, Guo et al. proposed an adaptive Kalman wave energy update variance, which improved the prediction performance of the original model [
6]. Israr Ullah et al. developed an artificial neural network (ANN)-based learning module to improve the accuracy of the Kalman filter algorithm [
7]. Additionally, in the experiment of the indoor environment prediction in the greenhouse, good prediction results were obtained. Therefore, the Kalman wave model can effectively reduce the uncertainty and noise in the flow change in the prediction process, but it is difficult to predict the nonlinear change trend of the traffic flow.
- (2)
Non-linear model
With the recent development of technology, the widespread use of powerful computer and mathematical models is applied to this field [
8]. Among them, the wavelet neural network, as a representative of the nonlinear theoretical model, has a better traffic-flow prediction effect. Gao et al. used the network model to predict short-term traffic flow and achieved good results [
9]. Although the wavelet neural network converges faster and the prediction accuracy is higher, the existence of the wavelet basis function increases the complexity of the model.
Machine learning models have become research hotspots that have been widely used in many fields. They are best applied to the field of traffic flow. Qin et al. proposed a new SoC estimation method on the impact of temperature on SoC estimation, and the use of limited data to rapidly adjust the estimation model to new temperatures, which not only reduces the prediction error at a fixed temperature but also improves the prediction accuracy at a new temperature [
10]. Xiong Ting et al. used the random forest model to predict the traffic flow and achieved high prediction accuracy, based on the combination of spatio-temporal features [
11]. Lu et al. used the XGBoost model to predict the traffic flow at public intersections in Victoria and achieved high prediction accuracy [
12]. Alajali et al. used the GBDT model to analyze the lane-level traffic flow data on the Third Ring Road in Beijing on the basis of feature processing and proved that the model has a good prediction effect and is suitable for the traffic prediction of different lanes [
13]. On the basis of extracting features, Yu et al. used the KNN model to complete the prediction of the traffic node and route traffic flow, which achieved good prediction results [
14].
Therefore, it can be concluded that the integrated model based on the decision tree is widely used and has high prediction accuracy, while the KNN model can eliminate the sensitivity to abnormal traffic flow in the prediction. Qin et al. proposed a slow-varying dynamics-assisted temporal CapsNet (SD-TemCapsNet) that introduced a long short-term memory (LSTM) mechanism to simultaneously learn slow-varying dynamics and temporal dynamics from measurements, achieving an accurate RUL estimation [
15]. Although LSTM has been used by many scholars as a network model with high accuracy in terms of time series prediction, the complexity of the network itself is difficult to avoid. The gate recurrent unit model (GRU) can effectively solve this problem, which can complete the prediction of traffic with fewer parameters under the premise of meeting a certain prediction accuracy. Dai et al. used the GRU model to predict the traffic flow under the condition of making full use of the features and verified the effectiveness of the model through comparative analysis with the convolutional neural network [
16]. As an evolutionary model of LSTM, the GRU can predict traffic flow with fewer parameters, under the premise of satisfying a certain prediction accuracy.
Although machine learning models perform well in traffic-flow prediction, the prediction performance of the single model is limited. Therefore, a model combining multiple single models has gradually become a trend [
17]. Pengfei Zhu et al. integrated the GRU and BP to predict the frequency shift of unknown monitoring points, which effectively improved the prediction accuracy of a single model [
18]. Although the above combined models can improve the accuracy to a certain extent, they are limited by the number of single models. The integrated model that mixes multiple models is gradually becoming favored by scholars and has been applied to various fields [
19]. Shuai Wang et al. proposed a probabilistic approach using stacked ensemble learning that integrates random forests, long short-term memory networks, linear regression, and Gaussian process regression, for predicting cloud resources required for CSS applications [
20]. Common ensemble models include bagging [
21], boosting [
22], and stacking [
23]. Compared with other ensemble models, the stacking model has a high degree of flexibility, which can effectively integrate the changing characteristics of heterogeneous models to make the prediction results better.
In summary, the single prediction model has limitations, and the combined forecasting model has gradually become a trend. Common models that can integrate a single model include the entropy combination method, the inverse error group method, the ensemble learning method, and other combination methods. [
24,
25]. Among them, the comprehensive model is more practical. The bagging integration model and the boosting integration model, generally used for a homogeneous single model, are limited to a single model, while the stacking integration model is more commonly used for the fusion of heterogeneous models. Therefore, the first use of the bagging model is to optimize the base learner model and then optimize the stacking model, to improve the overall performance of the model.
2. Establishment of the DW-Ba-Stacking Model
In this section, a DW-Ba-Stacking model was put forwarded in detail. The DW-Ba-Stacking model consists of three parts in total, the stacking model (stacking), the bagging model (Ba), and the dynamic weighting adjustment (DW).
2.1. Stacking Model
Traffic flow trends are complex, and there are various models used in this field, among which machine learning models are widely used in traffic-flow prediction due to their good non-linear fitting. In order to obtain a stacking model with high accuracy, machine learning models with different merits and good applications in this field are selected for fusion: the random forest model, which is less prone to overfitting; the KNN model, which is insensitive to outliers; the decision-tree model; the XGBoost and GBDT models; the GRU model, which can effectively use temporal features; and the K-fold cross validation to prevent overfitting.
2.1.1. Principle of the Stacking Model
The stacking model obtains the final prediction by linear or non-linear processing of the sub-learners. The main principle is that the original data are first predicted by the base learner, and then the prediction is passed to the meta-learner to obtain the final result. To prevent overfitting, the data are usually trained by fold cross-validation, as follows.
Let the original data set , be the feature variables of the sample, be the predictor variables of the sample, and the number of base learners be The data from the original dataset are used as the validation set, ; the rest of the data are used as the training set; the divided data are fed into the base learner for training, and the prediction results from are obtained . The predictions from the base learner and are then used as the training set for the meta-learner, which trains the model and makes predictions.
2.1.2. Machine Learning Models
Random Forest and KNN Models
The Random Forest model is a modified bagging algorithm. When the model is used for regression, the single model that is integrated is the CART regression tree. First, samples are drawn by bootstrap sampling with replacement; then, the corresponding regression trees are modelled for the m different samples drawn to form the forest; and, finally, the average of the predictions from the different regression trees is taken as the final prediction. The samples and features of the regression trees in the model are chosen randomly. Each regression tree built through bootstrap sampling is independent and uncorrelated. This feature increases the variation between models and enhances the generalization ability of the model. At the same time, the random nature of feature selection reduces the variability of the models. As the number of regression trees increases, the model error gradually converges, which reduces the occurrence of overfitting. This is why the model was selected as one of the base learners.
When the KNN model is used for classification, it determines the k sample types by searching for k samples in the historical data that are similar to the samples to be classified. The principle can be expressed as follows:
where
is the feature vector,
is the category of the example sample, and
. The Euclidean distance is used to express the similarity between the sample to be classified and the feature sample in
. The Euclidean distance between the observed sample and the feature is calculated. Based on the calculated distances, find the closest K points to the object to be classified in S and determine the X category. The principle is shown in
Figure 1. There are
samples with the categories
, which are
different categories. By testing the Euclidean distance between sample
and the
training sets,
M samples that are closer to sample
are obtained, and if most of the
samples belong to a certain type, then sample
also belongs to that type. The model can be applied to both discrete and continuous features and is insensitive to outliers, so it is used as a base learner.
Decision Trees, and the GBDT and XGBoost Models
A decision tree is a model consisting of nodes and directed edges that allow predictions to be made by correspondence between attributes and objects. The internal nodes are the features of the object and the leaf nodes are the classes of the object. The model has a wide range of applications, and it is efficient and suitable for high-dimensional feature processing, which is why it has been chosen as one of the traffic-flow prediction models. It aims to summarize certain rules from the training dataset and eventually achieve the correct result. The essence is to find the optimal decision tree. The three more important features in the search process are attribute selection, decision tree generation, and decision tree pruning. The key to their generation is the division of the optimal attributes. Purity is a measure based on the assignment of attributes. The evaluation metrics for measuring purity include information gain, gain rate, and Gini index. The principle is shown in
Figure 2.
Both GBDT and XGBoost are algorithms that evolve by boosting. GBDT is formed by continuously fitting the residual error by updating the learners on the gradient. When the residual error reaches a certain limit, the model stops iterating and forms the final learner. The model can be very good at fitting non-linear data. However, the computational complexity will increase when the dimensionality is high and the traffic flow has fewer characteristic dimensions, so the model is suitable for prediction in this area. The regulator model is a linearly weighted combination of different regulators.
where
is a weak regressor. The loss function of the weak regressor is
where
is the loss function.
XGBoost and GBDT share the same principles and integrated model, with a process of continuously fitting the residuals and gradually reducing them. During the fitting process, the learner is updated with first-order derivatives and second-order derivatives. Specifically, the second-order Taylor expansion of the loss function and the positive term of the error component are used as the objective function during each round of iterations. It updates the parameters through the solution of the least significant graph. The positive term in the objective function controls the complexity of the model, reduces the variance of the model, and makes the learning process of the model easier, so this model is chosen as the base learner. The loss function L is
In the formula, the first half is the error between the predicted and actual values; the second half is the conventional term.
The Equations and are the penalty coefficients for the model.
GRU Model
A deep-learning model is one of the machine learning models. It can adapt well to the changing characteristics of data when the amount of data is appropriate. It has gradually been applied to various fields with good results. Zheng Jianhu et al. relied on deep learning (DL) to predict traffic flow through a time series analysis and carried out long-term traffic-flow prediction experiments based on the LSTM network-based traffic-flow prediction model, the ARIMA model, and the BPNN model [
26]. It can be seen that regular sequences have won the favor of various scholars and that GRU is a more mature network for processing time series in recent years. Additionally, the earliest proposed network to deal with time series is RNN, but it is prone to gradient disappearance, leading to network performance degradation. Zhao et al. used long short-time memory (LSTM) to predict traffic flow under the premise of considering spatial factors in the actual prediction process and achieved high prediction accuracy [
27], but the network model also has the disadvantage of poor robustness. In order to solve this problem, Li Yuelong et al. realized the optimization of the prediction performance of the network through the network space feature fusion rights protection unit [
28]. It can be seen that although LSTM is used by many scholars as a network model with high time series prediction accuracy, the complexity of the network itself is difficult to avoid. The GRU model, on the other hand, can effectively reduce the network parameters while ensuring the performance of the model itself. Its structure is shown in
Figure 3.
where
is the product of the corresponding positions of the two matrices,
is the activation function,
and
are the weight parameters of the network, and
is the bias parameter of the network, which is the state value of the hidden layer at different moments. The reset gate
determines the input ratio of the previous state information
to the current network cell; the update gate
determines the deletion ratio of the previous state information. The entire network cell is filtered by the two gates to determine the valid information of the network cell. Compared with the LSTM model, the GRU model reduces one gate unit and only sets the reset gate and update gate to control the input and output information of the network unit, which reduces the complexity of the network and improves the network training speed.
2.2. Bagging Model
The overall architecture of the Ba-Stacking model included the bagging model processing stage and the stacking model processing stage. Because the bagging was only embedded as part of the stacking model, the stacking model architecture plays a big role. The more important processing phases are: the base learner processing phase and the meta learner processing phase. The base learner processing stage requires different base learners to obtain the prediction results, so the choice of the base learner plays an important role. The meta-learner processing stage is more important because it includes a large amount of raw data information, so it is important that the effect of using the base learner information affects the final prediction results. However, the output information of different base learners is duplicated, and the data variability is not strong enough to extract the effective information of the output data. Therefore, to address the problem that the output information of base learners cannot be fully utilized, it is necessary to consider how to effectively utilize its output information and reflect its importance and variability.
To further improve the stacking model, this paper considers the use of the bagging algorithm to further optimize the base learner and reduce the base learner variance, as two ways to improve the potential performance of the meta-learner model in the stacking model.
Considering that the prediction effect of the base learner directly affects the final effect of the integrated model, the prediction effect of the base learner of the stacking-integrated model is optimized by the bagging algorithm. To better extract the base learner features, a ridge regression with linearity is used as the meta-learner, and the overall construction principle is shown in
Figure 4.
The process of this model is to optimize the data features of the stacking base learner based on its output information through the bagging algorithm and then further input this optimized data into the meta-learner in the stacking-integrated model for traffic prediction. The process consists of three parts: the first part builds the stacking base learner model by comparing and analyzing different features to obtain the optimal base learner model; the second part builds the stacking model and obtains the optimal stacking model by comparing and analyzing different base learner models and meta-learner models; finally, the bagging model is combined into the stacking model to build the Ba-Stacking model.
2.3. DW Model
The entropy value can be expressed as the uncertainty of each value. The entropy weighting method in the tradition weights the fixed coefficients of each model, but the certainty degree of different positions of the base learner can be deduced from the certainty degree of a specific position in each model.
Where the single model
is the base learner prediction and
is the actual value, the entropy value is
The addition of 0.5 to the Ln function in Equation (10) is to accommodate the calculation of zeros in the original series. is the entropy value derived from the error value , where is the absolute error indicator value. Because the characteristics of the meta-learner in the stacking-integrated model are the strong information characteristics of the base learner output, and the uncertainty of the base learner can be known according to its entropy value at different positions, the variability of the base learner model output information can be enhanced after the introduction of weights, which in turn improves the overall performance of the model. The degree of uncertainty of different models is determined by introducing the entropy value after the MSE is calculated, which is used when the dynamic parameters are calculated.
2.4. Model Construction
2.4.1. Dynamic Weighting Adjustment Model Process
In the stacking model, the degree of data deviation at different locations in the base learner output information varies, and fixed weighting cannot capture its dynamic change pattern, so dynamic weighting coefficients are designed in the model.
The coefficient is designed outside the meta-learners, and the dynamic weight coefficients are first solved according to the degree of deviation at different positions, and then the dynamic weight coefficients are weighted to adjust the base learner output information to achieve the extraction of dynamic change patterns. The weighting coefficients here include error weighting and entropy weighting.
is the predicted value of the base learner, is the actual value, is the number of elements, is the number of base learners, and is the predicted mean value of each base learner. The adjustment process of the output information of the base learner is
In the process of adjustment, the key lies in the solution of dynamic weight coefficients
.The solution process is as follows:
- (1)
Calculate the absolute error of each element , that is, the degree of deviation of each element: the absolute value of the difference between the predicted value and the actual value of the base learner;
- (2)
Calculate the deviation rate
and average deviation rate of each element
, the normalized value of absolute error
, and the normalized mean value of absolute error of each column n
, respectively;
- (3)
Calculate the contribution rate and the average contribution rate of each element , the value of 1 minus the deviation rate, and the value of 1 minus the average deviation rate, respectively.
The contribution rate calculated in Equation (14) is the dynamic weight coefficient . The adjusted output information reduces the prediction results influenced by errors or deviation information, making the information characteristics more representative. The coefficient matrices are used to adjust the training set and test set. The specific process is as follows:
Adjust the change rule of the predicted value of the base learner: use the product of the predicted value of different positions and the dynamic weight coefficient as the new data. The specific process is shown in
Figure 5.
Adjust the overall change law of the predicted value of the training set of the base learner: use the product of the predicted value of different positions and the average dynamic weight coefficient in the training set as the new data. The specific process is shown in
Figure 6.
2.4.2. Ba-Stacking Model Optimization Process
The principle of the improved stacking ensemble model is shown in
Figure 7. Assuming that the traffic flow data sequence has
records of data,
is the number of characteristic variables, the original data set is
,
is the predictor variable, and
is the characteristic variable. The specific steps of the model are as follows:
- (1)
Divide the original data into the training set and test set;
- (2)
Construct the corresponding prediction models, including random forest, XGBoost, the GBDT, and the decision-tree model;
- (3)
Use and to obtain the corresponding predicted values of different models through the bagging algorithm, denoted as ;
- (4)
Using , obtain the weight coefficients by different adjustment methods, followed by the flow data of the adjusted base learner model, noted as ;
- (5)
Using,, build a meta-learner ridge regression mode to obtain the final traffic prediction values of the improved stacking integration model;
- (6)
Train the model with the training set. Once trained, the model will be tested using the test set.
5. Conclusions
With socio-economic improvements, traffic congestion will occur more frequently. Traffic-flow prediction can effectively manage and monitor traffic flow, and its prediction accuracy plays a crucial role in solving traffic-congestion problems. Machine learning algorithms have long been applied to the field of traffic-flow prediction, but individual models are greatly limited in terms of their predictive powers. Therefore, this paper applies the stacking-integrated learning model, which has been widely used in various fields in recent years, to traffic-flow prediction and provides a new idea for its prediction. A series of improvement measures are carried out to address the shortcomings of the traditional stacking-integrated learning model. The main objectives of this paper are as follows:
- (1)
In order to improve the shortcomings of the traffic prediction model with a single feature, temporal features such as holidays and historical features such as speed are constructed. Traffic flow is always recorded in the detector, so the time for the recorded parameters is clearer. In this paper, different time-feature information is extracted according to the specific time of the record: holiday information, weekend information features, and peak information; historical speed and occupancy features related to traffic flow are constructed according to the original data features, and the rationality of the introduced features is verified through the comparative analysis of different features. Thus, the best effect is obtained.
- (2)
The stacking integration model with the highest accuracy is obtained by filtering and optimizing the learners. First, we build machine learning models with different merits; then, we analyze the correlation coefficient between each model and the actual information by using the Pearson correlation coefficient; next, we select the stacking-integrated model with the highest prediction accuracy based on the weight of each model; and, finally, we embed the bagging model in this model to further improve the prediction accuracy of the model.
- (3)
According to the shortcomings of the stacking-integrated model, the stacking model two-layer is used as the object of improvement. With the goal of enhancing the variability between models and the correlation between predicted and actual information, the weights of different base learner models are adjusted so that the prediction accuracy is higher.
The main innovative work of this paper is to achieve the following:
- (1)
Realize the effective combination of the stacking model and bagging model, i.e., the construction of Ba-Stacking. The bagging model is used to optimize the output information features of the base learner in the stacking model, and the construction of the Ba-Stacking model is completed.
- (2)
Based on the Ba-Stacking model, the DW-Ba-Stacking model is constructed by weighting coefficients. The Ba-Stacking model with the meta-learners as ridge regression optimizes the base learner feature information by error coefficient.
In summary, this paper not only introduces the stacking-integrated model, which can effectively improve the accuracy of traffic-flow prediction, but also proposes an improved DW-Ba-Stacking model, which further improves the prediction accuracy of traffic flow while adjusting the internal structure, and provides a reference for the development of traffic-management strategies and implementation plans. In the future, the improved method can be applied to other fields with practical significance. However, in the process of improving the stacking ensemble model, this paper only pays attention to the prediction accuracy and does not consider the time efficiency, so there are some limitations in its level of improvement. In the future, the improved method can be applied to other fields with practical significance.