**3. Preliminaries**

Machine learning provides a number of supervised learning techniques for classification and prediction. The objective of a classification problem is to learn a model, which can predict the value of the target variable (class label) based on multiple input variables (predictors, attributes). This model is a function, which maps as an input attribute vector *X* to the output class label (i.e., *Y* {C1, C2, C3, ... , Cn}). The label training set is represented as follows:

$$(X, Y) = \{ (\mathbf{x}\_{0\prime} \ \mathbf{x}\_1, \mathbf{x}\_2, \ \mathbf{x}\_{3\prime} \ \dots \mathbf{x}\_n), Y \} \tag{1}$$

where *Y* is the target label class (dependent variable) and vector *X* is composed of *x*0, *x*1, *x*2, *x*3, ... , *xn.* The macroscopic flow, density, and speed obtained from traffic simulation are referred to as input parameters when fed/imported to machine learning models for short term traffic prediction. The model learns from these input variables for different time intervals (i.e., 5, 10, and 15 min). Either the next time interval level of service (LOS) is considered as a class label or target variable. The predicted label class for time (Time duration = 1), is given in the following form:

$$(Density\_1, Speed\_1, Flow\_1, Time\, Duration\_1, Loss\_2) \tag{2}$$

The current study utilized four different machine learning methods for short term TSP. These methods included LD-SVM, decision jungles, CN2 rule induction, and MLP. The detailed methodology for each technique is presented below.

#### *3.1. Local Deep Support Vector Machine (LD-SVM)*

SVM is based on statistical learning theory as suggested by Vapnik in 1995 for classification and regression [66]. Local deep kernel learning SVM (LD-SVM) is a scheme for effective non-linear SVM prediction while preserving classification precision above an acceptable limit. Using a local kernel function allows the model to learn arbitrary local embedding features including sparse, high-dimensional, and computationally deep features that bring non-linearity into the model. The model employs routines that are effective and primarily infused to optimize the space of local tree-structured embedding features in more than half a million training points for big training sets. LD-SVM model training is exponentially quicker than traditional SVM models training [57]. LD-SVM can be used for both linear and non-linear classification tasks. It is considered as a special type of linear classifier (e.g., logistic regression LG), however, LG is unable to perform sufficiently in complicated and linear tasks. In addition, LD-SVM model learning is significantly faster and computationally more efficient than traditional SVM model training. The formulation of a local deep kernel learns a non-linear kernel *K xi*, *xj* = *KL xi*, *xj KGxi*, *xj*, where *KL* and *KG* are the local and global kernel. The product of local kernel *KL* = φ*tL*φ*L* and global kernel *KG* = φ*tG*φ*G* leads to the prediction function.

$$y(\mathbf{x}) = \text{sign}\{\phi\_L^t(\mathbf{x})\mathcal{W}^t\phi\_G(\mathbf{x})\}\tag{3}$$

$$y(\mathbf{x}) = \text{sign}\left[\left(\sum\_{i \neq k} \alpha\_i \, y\_i \, \phi\_{G\_j} \left(\mathbf{x}\_i\right) \phi\_{G\_j} \, \phi\_{L\_k} \left(\mathbf{x}\_i\right) \phi\_{L\_k} \left(\mathbf{x}\right)\right)\right] \tag{4}$$

$$y(\mathbf{x}) = \operatorname{sign}\Big(\mathcal{W}^t(\phi\_G(\mathbf{x}) \otimes \phi\_L(\mathbf{x})) \Big) \tag{5}$$

$$y(\mathbf{x}) = \operatorname{sign}\Big(\phi\_L^t(\mathbf{x})\mathcal{W}^t\phi\_G(\mathbf{x})\Big) \tag{6}$$

$$y(\mathbf{x}) = \text{sign}\{\mathcal{W}^\mathbf{f}\phi\_G(\mathbf{x})\}\tag{7}$$

where *Wk*= *i* α*i yi* φ*Lk* (*xi*)φ*G* (*xi*), φ*Lk* denote dimension *k* of φ*L* ∈ *R M*, *W* = [*<sup>w</sup>*1 ...... *w <sup>M</sup>*], *W* (*x*) = *<sup>W</sup>*φ*L* (*x*), and ⊗ is the Kronecker product. φ*L* is the local feature space and φ*G* is the global features space.

$$\phi\_{L\_k}(\mathbf{x}) = \tanh(\sigma \theta\_k'^t) I\_k(\mathbf{x}) \tag{8}$$

while training the LD-SVM and smoothing the tree are shown in Figure 1, Equation (1) can further written as below:

$$\mathbf{y}(\mathbf{x}) = \text{sign}\left[\tanh\left(\sigma v\_1^t \mathbf{x}\right) w\_1^t \mathbf{x} + \text{amb}\left(\sigma v\_2^t \mathbf{x}\right) w\_2^t \mathbf{x}\right) + \text{amb}\left(\sigma v\_4^t \mathbf{x}\right) w\_4^t \mathbf{x}\right] \tag{9}$$

where *Ik*(*x*) is the indicator function for each node *k* in the tree; θ is to go left or right; *v* stack with non-linearity; σ is sigmoid sharpness for the parameter scaling and could be set by validation. Higher values imply that the 'tanh' is saturated in the local kernel, while a lower value means a more linear range of operation for θ. The full optimization formula is given in Equation (10). The local deep kernel learning (LDKL) primal for jointly learning θ and *W* from the training data, where % (*xi*, *yi*) *N i*=1 & can be described as:

$$\min\_{\mathcal{W},\boldsymbol{\theta},\boldsymbol{\theta}^{r}} P(\mathcal{W},\boldsymbol{\theta},\boldsymbol{\theta}^{\prime}) = \frac{\lambda\_{\mathcal{W}}}{2} T\_{\boldsymbol{\theta}} \Big(\boldsymbol{\mathcal{W}}^{\boldsymbol{\theta}}\boldsymbol{\mathcal{W}}\Big) + \frac{\lambda\_{\mathcal{O}}}{2} T\_{\boldsymbol{\theta}} \Big(\boldsymbol{\theta}^{\prime}\boldsymbol{\theta}\Big) + \frac{\lambda\_{\mathcal{O}^{\prime}}}{2} T\_{\boldsymbol{\theta}} \Big(\boldsymbol{\theta}^{\prime\prime}\boldsymbol{\theta}^{\prime}\Big) + \sum\_{i=1}^{N} L\Big(\boldsymbol{y}\_{i\prime}\boldsymbol{\phi}\_{L}^{t}(\mathbf{x}\_{i})\boldsymbol{\mathcal{W}}^{t}\mathbf{x}\_{i}\Big) \tag{10}$$

where *L* = max0, 1 − *yi*, φ*t L*(*xi*)*W<sup>t</sup> xi* ; λ*w* is the weight of the regularization term; and λθ specifies the amount of space between the region boundary and the nearest data point to be left. λθ controls the curvature amount allowed in the model's decision boundaries.

**Figure 1.** Schematic diagram of the as local deep support vector machine (LD-SVM).

## *3.2. Decision Jungles*

Decision jungles are the latest addition to decision forests. They are comprised of a set of decision-making acyclic graphs (DAGs). Unlike standard decision trees, the DAG in the decision jungle enables di fferent paths from root to leaf. A DAG decision has a reduced memory footprint and provides superior e fficiency than a decision tree. Decision jungles are deemed as non-parametric models that provide integrated feature selection, classification, and are robust in the presence of noisy features. DAGs have the same structure as decision trees, except that the nodes have multiple parents. DAGs can limit the memory consumption by specifying a width at each layer in the DAG and potentially help to reduce overfitting [54]. Considering the nodes set at two consecutive levels of DAGs, Figure 2 shows that the nodes set consists of child nodes *Nc* and parent nodes *Np*. Let θ*i* denote the parameters of the split function *f* for parent node *i Np*. *Si* denotes the categorized training samples (x, <sup>y</sup>), where it reaches node *i*, and set of samples can be calculated from node *i*, which travels through its left or right branches. Given θ*i* and *Si*, the left and right are computed by *SLi* (<sup>θ</sup>*i*) = ((*<sup>x</sup>*, *y*) *Si f*(<sup>θ</sup>*i*, x) ≤ 0) and *SRi* (<sup>θ</sup>*i*) = *Si*/*SLi* (<sup>θ</sup>*i*), respectively. *li Nc* indicates the left outward edge from parental node *i Np* to a child node, and *ri Nc* denotes the right outward edge. Henceforth, the number of samples reaching any child node *j N* is given as:

$$S\_{\vec{j}}(\{\theta\_i\}, \{l\_i\}, \{r\_i\}) = \left[ {}\_{i \text{ cN}} \underbrace{\cup}\_{\vec{p}^{S \cdot J \cdot l\_i = \vec{j}}} S\_{\vec{i}}^L(\theta\_i) \right] \cup \left[ {}\_{i \text{ cN}} \underbrace{\cup}\_{\vec{p}^{S \cdot J \cdot r\_i = \vec{j}}} S\_{\vec{i}}^R(\theta\_i) \right] \tag{11}$$

**Figure 2.** Decision jungles (DAGs).

#### *3.3. CN2 Rule Induction*

In this study, rule learning models were also explored for TSP. These models are usually used for classification and prediction solutions. The CN2 algorithm is a method of classification designed to induce simple efficiency; "if condition then predicts class," even in areas where noise may occur. Inspired by Iterative Dichotomiser 3 (ID3), the original CN2 uses entropy as the function for rule evaluation; Laplace estimation may be defined as an alternative measure of the rule quality to fix unpleasant entropy (downward bias), and it is described as follows [67]:

$$\text{Laplace Estimation } (\mathbb{R}) = \frac{p+1}{P+n+k} \tag{12}$$

where '*p*' represents the number of positive examples in the training set covered by Rule '*R*'; *n* represents the number of negative instances covered by *R*; and '*k*' is the number of the training classes available in the training set.
