**4. Study Area**

This study was conducted in the city of Beijing, China, which covers an area of 16,410 km2, and hosts 21.7 million people. Road transportation is an integral part of the city's routine businesses, linking most households to workplaces or schools. There are 21,885 km of paved public road in Beijing (as of June 2016), 982 km of which are classified as highways [68]. According to the Beijing census, the number of private cars was close to 5.4 million, in addition to 5.3 million other vehicles in different categories including 330,100 trucks. The Second-Ring Road consists of six percent of the urban space of Beijing, with clusters of major companies, businesses, and administrative institutions, but generate 30% of the traffic volume per day [68,69]. Within this perspective, integrated urban planning is becoming difficult, so much so that 60% of the historical site of the city is lying on the Second Ring Road. Since the traffic hotspots are concentrated mainly in the center of Beijing, we have chosen an area as the study area at this location [68,70]. The Second Ring Road is approximately 33 km long including 37 on-ramps and 53 off-ramps. Figure 4 shows the study area on the Second Ring Road along with other different ring roads. In this study, a basic freeway segmen<sup>t</sup> of the Second Ring (L = 478.5 m) was selected.

#### *Data Collection and Parameters Setting*

The first step in preparing the experiment was to develop a microscopic model using VISSIM (Micro Traffic Simulation Software) to capture all the essential data for the Second Ring Road. When simulating the field conditions, it is essential to calibrate the driving behavior parameters for the traffic simulator, and this was accomplished by standard procedures, as reported in the existing work [71]. In doing so, several simulation iterations were performed, incurring a different random seed to ensure that the model works under the real-time scenario. The proposed methodology for the present study is presented in Figure 5.

**Figure 4.** Second ring road (from Google Maps). Note: The Chinese words on map indicate names of surrounding infrastructure.

**Figure 5.** Proposed methodology for the study.

In this study, macroscopic traffic parameters (volume, speed, density) were obtained from the VISSIM simulation analysis. Traffic volume or flow rate can be defined as the number of vehicles that pass through a point on a highway or lane at a specific time, and is usually expressed in units of vehicles per hour per lane (v/h/l), while density is referred to the number of vehicles occupying a unit length of roadway, and is denoted by vehicles per km/mile per lane (v/m/ln). Occupancy is sometimes synonymously used with density; however, it should be noted that it shows the percentage of time that a road segmen<sup>t</sup> is occupied by vehicles. Tra ffic speed is another important state parameter, and can be found by the distance traversed per unit of time, and is typically expressed in km/h. or miles/h. These parameters are further calculated by using the link evaluation in VISSIM. Once the factual freeway architecture is achieved, the key macroscopic characteristics are identified in order to adjust the entire microscopic simulator (e.g., demand flow and split ratio). Demand flow is defined as the tra ffic volume as it utilizes the facility, while split ratio is the directional hourly volume (DHV) in the peak direction, which varies with respect to time, that is, the peak time and o ff-peak time. Additionally, the real tra ffic state of the Second Ring Road in this study was obtained from the Beijing Collaborative Innovation Center for Metropolitan Transportation. Thereby, the model of the road network deemed for the Second Ring Road was constructed by VISSIM. It has three lanes, where each lane is designated with an average width of 3.75 m, as shown in Figure 6. Simulations in the VISSIM were carried for 6 h, during the period 6:00 am to 12:00 pm, and a congested regime prevailed from 1.5 to 2 h (i.e., between 7:30 am to 9:30 am), leveraging the almost free flow for the remaining hours. Therefore, the transition state from D to F encountered few labels. Meanwhile, data were collected using di fferent prediction horizons such as 5, 10, and 15 min.

**Figure 6.** Basic freeway segmen<sup>t</sup> of the Second Ring Road (from Google Earth). Note: The Chinese words on map indicate names of surrounding infrastructure.

To assess the freeway operations, level-of-service (LOS), a commonly used performance indicator, was used for qualitative evaluation purposes. The data collected from the VISSIM simulation was further divided into six levels [72], wherein the LOS defines the tra ffic state of each level. Tra ffic state is usually characterized by tra ffic-density on a given link, and is directly related with the number of vehicles occupying the roadway segment. It also represents the transient boundary conditions between two LOS levels. Moreover, to test the e fficacy, classification models were built in python scripting orange software and azure machine learning to write the required procedures for extracting the tra ffic parameters, and level-of-service corresponded to highway capacity manual (HCM) [43,73]. The data points (in Figure 7) represent di fferent points in time distributed spatially, which together define the LOS at the road segment. In the mentioned figure, di fferent colors showed the states for 15 min, which is actually the LOS divided into six sub-levels based on density along the highway segment. We termed these levels as di fferent states (from A to F) and further evaluated them for 5, 10, and 15 min intervals. Since stratified K-fold cross validation was opted to address the issue of imbalance data, the method

aimed to choose the proportionate frequencies for each LOS class. Thus, it is likely that label D or any other label will be associated with true representative class. The actual density–flow captured on a segmen<sup>t</sup> of the Second Ring was simulated in VISSIM for a prediction horizon of 15 min and is shown in Figure 7.

**Figure 7.** The actual density–flow via VISSIM.

#### **5. Results and Discussion**

## *5.1. K-Fold Cross-Validation*

We selected the K-Fold cross-validation method (using k = 10), which is used for a better f-model, and it provides the appropriate settings for parameters. The original instances were randomly split into k equal parts. A single part was used for validation from the k split, and the remaining k minus one (k − 1) parts were used for the training set in order to develop the model. To do so, we revised the same technique k times. Each time a distinct validation dataset was selected, until the model's final accuracy was equal to the average accuracy, that in turn, was achieved in each iteration. This technique has the advantage over repeated random sub-sampling as all the samples are used for training as well as in the validation, where each sample is used once for the validation. To avoid the problems of data imbalance and enhance the prediction accuracy of the proposed methods, several strategies have been suggested by previous studies. In this study, K-fold cross validation was used to overcome the issues and bias associated with imbalance and small datasets as the K-fold validation method is more efficient and robust compared to other conventional techniques, since it preserves the percentage of samples for each group or class. We tuned the parameters to obtain the best results with accuracy and they were selected using hyperparameter tuning.

## *5.2. Model Evaluation*

In this study, we selected the most common evaluation metrics in order to assess the performances of the models known as F score and Accuracy. The F score is a measure of the accuracy of a test, also known as the F-1 score or F measure. The F − 1 score is defined as the weighted average of recall and precision. To measure the overall performances of the model, the F-1 score was derived as follows:

$$F - 1 = 2 \times \frac{Precision \times recall}{Precision + recall} \tag{15}$$

Accuracy is one of the classifications' performance measures, which is defined as the ratio of the correct sample to the total number of samples as follows [74],

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \tag{16}$$

where *P* and *N* denote the number of positive and negative samples, respectively. *TP* and *TN* indicate the true positive and true negative. *FP* and *FN* indicate the false positive and false negative, respectively.

#### *5.3. Local Deep Kernel Learning SVM (LD-SVM)*

The tuning parameters include LD-SVM tree depth, Lambda W, Lambda theta, Lambda theta prime, number of iterations, and sigmoid sharpness or sigma. Figure 8a shows the LD-SVM tree depth impact on accuracy and 92.00% accuracy was achieved when the tree depth was 3. The impact of the other parameters, Lambda W, Lambda theta, Lambda theta prime, number of iterations, and sigmoid sharpness or sigma, can be seen in Figure 8c. The best hyperparameter tuned values for these parameters were 0.00052, 0.34587, 0.1025, 49,247, and 0.0068, which were encircled and obtained using 10-fold cross-validation. Figure 8b shows the predicted state for the next 15 min

**Figure 8.** The LD-SVM model. (**a**) The impact of tree depth on accuracy. (**b**) Predicted state for next 15 min (**c**) Impact of Lambda theta, Lambda theta prime, Lambda W, and sigmoid function on accuracy.

## *5.4. Decision Jungle*

The tuning parameters in the decision jungle model were described by the maximum depth of the decision (DAGs), number of decision DAGs, number of optimization steps per decision, DAGs layer, and maximum width of the decision DAGs'. Figure 9b shows the impact of the maximum

depth of decision DAGs on the accuracy of the model. The accuracy was 92% and was achieved when the maximum depth of the decision (DAGs) was 77. The best-tuned values for the other parameters are depicted in Figure 9c (such as the number of decisions DAGs, number of optimization steps per decision, and maximum width of decision DAGs' were 22, 5786, and 19, respectively), and were obtained using 10-fold cross-validation. Since, our study considered 15 min prediction horizons as the structure of DAG is illustrated in Figure 9d, which shows the number of DAGs is 22 with a maximum depth of levels 77. The predicted state for 15 min horizons can be seen in Figure 9a.

**(d**) Structure of DAGs for 15-min. prediction horizons

**Figure 9.** Decision jungle model. (**a**) The impact of maximum depth on accuracy. (**b**) Predicted state for the next 15 min (**c**) Impact of maximum width of the decision DAGs and number of decision DAGs on accuracy. (**d**) Number of DAGs, width, and depth of DAGs are shown for 15-min prediction horizons.

#### *5.5. CN2 Rule Induction*

CN2 utilizes a statistical significance test in order to ensure that the fresh rule represents a real correlation between features and classes. In fact, it is a pre-pruning technique that prevents particular rules after their implementation. Moreover, it performs a sequential covering approach at the upper stage (also defined as split-and-conquer or cover-and-remove), once used by the algorithm quasi-optimal (AQ) algorithm. The CN2 rule returns a class distribution in terms of the number of examples covered and distributed over classes. The distribution in Table 1 and the tables in the Appendix show that each number corresponded to the number of example(s) that belonged to class LOS = i, where i = {A, B, C, D, E, F} and "i" is the observed frequency distribution of examples between di fferent classes. In another words, it represents number of the relevant class membership. The derived probabilities shown in Table 1 can be used to check the accuracy and e fficiency of that particular rule. We adopted exclusive coverage in our implementation at the upper level such as unordered CN2 [62], whereas Laplace estimation was used for function evaluation at the lower level. Pre-pruning of rules was performed using two methods: (i) likelihood ratio statistic (LRS) tests, and (ii) minimum threshold for coverage of rules. The LRS test indicates two tests: first, a rule's minimum level of significance α1, and the second LRS test is likened to its parent rule, as it checks whether the last rule specialization has a su fficient level of significance α2. The values for the LRS tests and rules for the di fferent prediction horizons were obtained using 10-fold cross-validation. Figure 10 shows the predicted state for 15 min intervals. The values of α1 and α2 are listed in Table 2. The rule for the next 5 min and 10 min horizons is given in Appendix A, whilst the rule for the next 15 min horizons is given in Table 1.

**Table 1.** Selected rules (for 15-min prediction horizon) with rule quality.


**Figure 10.** Predicted state for the next 15 min horizon.


**Table 2.** CN2 rule setting parameter values.

#### *5.6. Multi-Layer Perceptron (MLP)*

In neural networks, learning includes adjusting the connection weights between neurons and each functional neuron's threshold. We considered one input layer and one hidden layer with 35 neurons. The input layer had four nodes: speed, density, flow, and time duration (interval). The accuracy achieved using 10-fold cross-validation for different prediction horizons was compared (shown in Table 3) against the learning rate, momentum, activation function, and epochs. Figure 11a shows the predicted state for the next 15 min horizons. The input layer, hidden layers with neurons, and output layers for the MLP network are depicted in Figure 11b.

**Table 3.** Configuration of the parameters for the multi-layer perceptron (MLP).


**Figure 11.** MLP model. (**a**) Predicted state for next 15 min horizon. (**b**) MLP network with 01 hidden layer.

## **6. Model Comparison**

The weighted average F-1 score and accuracy were evaluated in order to assess the performances of different models. The results sugges<sup>t</sup> that decision jungles outperformed the LD-SVM, CN2, and MLP, as shown in Figure 12. Additionally, the decision jungles and LD-SVM achieved a higher weighted average F-1 score. In particular, the decision jungle was found to have improved results over the LD-SVM, CN2, and MLP, and obtained high F-1 scores of 0.9777, 0.952, and 0.915 were predicated for time horizons of 15, 10, and 5 min, respectively. Similarly, the LD-SVM was slightly better than the MLP and CN2 as the F1-score was higher (0.904, 0.926, 0.946) for the 15, 10, and 5 min prediction horizons. However, the CN2 rule induction performed better, except for decision jungles, while the other models failed to achieve a higher F–1 score for the same prediction horizon. On the other hand, Figure 13a,b shows that decision jungles and LD-SVM also achieved higher accuracy when compared to the remaining models such as CN2 rule induction and MLP. It can be noted that as the prediction horizons increases, the F-1 score and accuracy decreases. This indicates that decision jungles were stable when compared to the results in accordance with time horizons of 15, 10, and 5. Unlike the LD-SVM, MLP and CN2 were found to be less effective at maintaining the stability of accuracy in different time horizons. However, the CN2 rule induction in Figure 13c,d) performed well and provided stable results only for the 10, and 15 min prediction horizons.

**Figure 12.** Model comparison. Weighted average F-1 score for decision jungles; weighted average F-1 score for LD-SVM; weighted average F-1 score for MLP; weighted average F-1 score for CN2 rule induction.

The experimental results are summarized in Tables 4 and 5, where the models' performances were computed using F-1 score and the average accuracy for different prediction horizons, respectively. It can be clearly seen that decision jungles achieved a higher F-1 score and gained a higher accuracy when compared to the other models for different prediction horizons. This shows that decision jungles achieved an average improvement of 95% and outperformed the remaining models. However, the LD-SVM performed better than the MLP and CN2 rule induction.


**Table 4.** F-1 score for the different model comparisons.

**Table 5.** Accuracy for different models comparisons.


**Figure 13.** Model Comparison. (**a**) Accuracy for the decision jungles. (**b**) Accuracy for the LD-SVM. (**c**) Accuracy for the MLP. (**d**) Accuracy for the CN2 rule induction.
