**1. Introduction**

Real-time traffic state estimates have been increasingly recognized following the introduction of recent advanced technologies such as connected vehicle (CV) technologies. CVs aim to improve road safety by potentially reducing human errors, mitigating traffic congestion levels by offering alternative routes, and reducing on-road emissions and fuel consumption [1]. Nowadays, conducting research with limited probe vehicle data (e.g., CVs) is a challenge, especially when no additional data sources are provided. Hence, past research has utilized probe data in conjunction with existing detection systems to enhance proposed traffic models, despite the limitation that fixed detection techniques (e.g., loop detectors) always have some noise in their data [2–4].

A probe vehicle is defined as a vehicle that provides real-time information, such as its instantaneous position and speed. Several benefits of using probe vehicle data have been recognized; for example, the high quality of data compared with existing data sources (e.g., cameras and loop detectors), and data can be collected at any location inside the network, thus offering a clear picture about traffic behavior at any time. Therefore, transportation agencies are putting effort into facilitating the use of probe vehicle data.

Limited studies have used only information from probe vehicle data (e.g., Global Positioning Systems [GPSs]) to estimate the state of on-road traditional vehicles [5], such as traffic travel time, traffic density, traffic speed, and traffic volume. The real-time estimation of traffic density is important to achieving better traffic operations managemen<sup>t</sup> in urban areas. This paper aims to estimate the total number of vehicles on signalized link approaches using only probe vehicle data. The estimate outcomes can be provided to traffic signal controllers to optimally determine the allocation of green time for each traffic signal phase [6,7], leading to better intersection performance measures such as intersection delays and vehicle crashes [8,9]. One concern with using probe vehicles is measuring their level of market penetration (LMP). The LMP is defined as the ratio of the total number of probe vehicles to the total number of vehicles. Providing accurate LMP estimates improves the estimation accuracy of the vehicle counts [5]. Therefore, in this paper, a machine-learning technique is developed to provide reliable LMP estimates.

## **2. Related Work**

Different statistical tools have been used to estimate the total number of vehicles on arterial roads and freeways, such as the Kalman filter (KF) [10], Bayesian statistics [11], and Particle filter [12] approaches. The literature shows the benefits of using the KF technique in addressing different aspects of the traffic estimation problem. The KF has been used to estimate the traffic travel time [13,14], traffic speed [15,16], and traffic density [5,17]. Different detection techniques have been employed to estimate the number of vehicles, such as loop detectors, camera systems, and probe data. Two loop detectors, one at the entrance and the other at the exit of the link, are utilized to measure the total number of arrivals and departures, then the number of vehicles are simply obtained by applying the flow continuity equation [18]. A robust KF model with at least three loop detectors on the tested link was employed to estimate the number of vehicles on the link in [17]. The study derived the KF state equation from the flow continuity equation, while the measurement equation was derived from the relationship of the detector time occupancy and space occupancy; however, the cost of implementing such an algorithm in the field is high given the number of sensors needed. Another study employed the KF to estimate the number of vehicles on multi-section freeways. The state equation was derived from the flow continuity equation, while the measurement equation was derived from the hydrodynamic relationship between traffic speed and density [19]. Loop detectors were used in addition to speed sensors in the middle of the tested section. However, the proposed algorithm is hard to employ in the field due to the high cost of implementation. A video record, another detection technique, was used to estimate the traffic density for signalized links [20]. In that study, the authors used the space-mean speed rather than the traffic flow in the state equation due to high errors accompanied with sensor failures. Their argumen<sup>t</sup> takes into account that the space-mean speed is taken as an average quantity while the traffic flow is a cumulative quantity. They also demonstrated the importance of having knowledge about the system noise characteristics to improve the performance of the KF model. Consequently, the authors of this paper applied an adaptive Kalman filter (AKF) to enable real-time estimates of statistical parameters of the system noise rather than using predefined values for the entire simulation (as assumed in the traditional KF model).

As illustrated in the literature, stationary sensors, such as loop detectors and camera systems, suffer from poor detection accuracy and have high installation and maintenance costs. Advanced detection techniques such as GPS data have proven to be more accurate without the need to install additional hardware. Consequently, recent studies have developed several traffic estimation models using fusion data (combination of two different data sources) to estimate the number of vehicles with the aim of achieving better accuracy than using only one source of data. In many of the works using fusion data, the KF technique was employed for estimating traffic density. One study achieved accurate estimated traffic density results using the traffic flow values measured from a video detection system and the travel time obtained from vehicles equipped with GPS devices [2]. The proposed estimation approach in this study differs in two significant ways from the proposed AKF model, namely only probe vehicle data are used with a variable time interval rather than a fixed value (the updating time interval was 1 min in [2]), and the proposed estimation approach uses the AKF to allow for real-time estimates of statistical parameters of the state and measurement noise.

Reviewing the literature, the KF model has proven its ability to address estimation research problems for different traffic applications. However, it is hard to implement in real-world applications due to hard estimates of statistical characteristics of the system noise (mean and variance). Consequently, researchers have developed the AKF to solve this issue and make field implementation possible. Chu et al. proposed an AKF model to estimate freeway travel time using both loop detectors and probe data [21]. They presented the estimation method for noise statistic parameters that was proposed in [22]. This estimation method of statistical parameters is known for its simplicity in handling errors and its fast processing time. Hence, in this study, the estimation of the statistical parameters uses the same estimation procedure as in Chu et al.'s study. It should be noted that the main difference between the proposed estimation approach and Chu et al.'s approach is that our model uses only probe vehicle data.

In a recent study, the KF model was proposed to estimate the number of vehicles on signalized link approaches using only probe vehicle data [5]. The KF state equation was based on the traffic flow continuity equation and thus one value of probe vehicle LMP (*ρ*), for the entire link, is used to scale up the probe measurements to reflect the total flow in the second term of the flow continuity equation as presented in Equation (1). It was found that using two LMP values (at the entrance and the exit of the link) produce more accurate vehicle count estimates, especially when dealing with low LMPs, as described later in Section 4.3. In Equation (1), *N*(*t*) is the number of vehicles traversing the link at time (*t*), Δ*t* is the variable duration of the updating time interval, *N*(*t* − Δ*t*) is the number of vehicles traversing the link in the previous interval, *qin* and *qout* are the probe flows entering and exiting the link between (*t* − Δ*t*) and (*t*), respectively, and *ρ* is the LMP of probe vehicles.

$$N(t) = N(t - \Delta t) + \frac{\Delta t}{\rho} [q^{in}(t) - q^{out}(t)] \tag{1}$$

Machine learning has proven its ability to provide accurate estimates for different traffic characteristics [23–28]. Traffic speed and density have been estimated using an artificial neural network (ANN) model [23]. Video and Bluetooth data were used to build the ANN model. The traffic flow data were manually extracted from the video records, while the speed data were constructed from the collected Bluetooth travel time data. The neural network model (NN) is able to address the research problem if a good quantity of training data is accessible. Another study conducted several machine learning techniques such as k-means clustering, k-nearest neighbor classification, and locally weighted regression to estimate traffic speed [24] using archived data of speeds, counts, and densities. They found that machine learning models can improve the accuracy of speed estimation. Khan et al. [25] used artificial intelligence to classify the level of service in a freeway segmen<sup>t</sup> based on traffic density values. They used loop detectors and CV data to develop support vector machine and k-nearest neighbor classification. Results indicated higher accuracy from the support vector machine algorithm than the k-nearest neighbor classification algorithm. Estimating hourly traffic volumes between sensors was addressed using an NN model in the Maryland highway network [27], deploying both probe vehicles and automatic traffic recording station data to construct the

NN model. A comparison was also made between linear regression, k-nearest neighbor, support vector machine with linear kernel, random forest, and NN models, concluding that the NN model performed the best. The proposed approach produced 24% more accurate estimates than current volume profiles.

In this research study, an AKF technique was applied to estimate real-time vehicle counts along signalized link approaches using only probe vehicle data. The study then considers the recommendation of Aljamal et al's study [5] by using two LMP values at the entrance and the exit of the tested link. To achieve this task, an NN model was developed to provide real-time estimates of the LMP values to improve the accuracy of the proposed AKF model. After that, the paper develops the new AKFNN approach after combining the AKF with the developed NN models. The proposed study extends the state-of-the-art in vehicle count estimates by making four major contributions:


This paper is organized as follows. The first section describes the development of the simulation data. The second section describes the estimation models and the problem formulation for the KF, AKF, and AKFNN models. The third section discusses the results of the new proposed models. The fourth section provides the conclusions of the study and recommended future work.

#### **3. Development of Simulation Data**

This paper relies on the INTEGRATION traffic simulation model [29] to validate and test the accuracy of the proposed models. The INTEGRATION software has been extensively validated and demonstrated to replicate empirical observations [30–35]. Specifically, INTEGRATION was used to create synthetic data for conditions not observed in the field to quantify the sensitivity of the proposed method to the link length and traffic demand level. The selected tested link is located in downtown Blacksburg, Virginia, with an approximate length of 102 m based on ArcGis software, and connects two signalized intersections. The link characteristics were calibrated to local conditions using typical values, which included a free-flow speed of 40 (km/h), a speed-at-capacity of 32 (km/h), a jam density of 160 (veh/km/lane), and a base saturation flow rate of 2100 (veh/h/lane), which resulted in a roadway capacity of 700 (veh/h) given the cycle length and green times of the traffic signal. The traffic signal cycle length is 75 s and it has four phases with the following displayed green times: 5, 25, 5, and 28 s. The tested link here is assigned with a displayed green time of 25 s. These values were consistent with what was coded in the field.

The INTEGRATION simulation model was used to ease the generation of probe vehicle data as real probe data are not easy to access. For each LMP, a total of 50 scenarios were generated with different random seeds as conducted in [25]. Forty-nine scenarios were used to train and validate the proposed NN model, and scenario number 50 was considered the testing data set. The INTEGRATION model generates a "time-space" file which provides some information about the probe vehicles during their trips for every second. The time-space file records the instantaneous position, speed, and spacing for each probe vehicle. In addition to that, a loop detector is installed at the entrance of the tested link to create a detector output

file which provides some data about the simulation behavior such as speed, traffic volume, and occupancy at the detection location.

#### **4. Estimation Models**

This section first summarizes some crucial points regarding estimating the vehicle count as discussed in the authors' last research study [5]. In addition, this section describes the proposed AKF estimation model for estimating the vehicle count along signalized link approaches, and demonstrates the difference of the state-of-the-art KF model in [5] and the new proposed AKF model. Finally, an NN model is developed to provide estimates of the probe vehicle LMPs to be used in the proposed AKF model equations to attain higher accuracy. Two vehicle count estimation models are described in this section: (1) the AKF model, which uses only probe vehicle data; and (2) the AKFNN model, which fuses probe and single-loop detector data. The single-loop detector data were mainly used to develop the NN model.

#### *4.1. Summary of the Developed KF Model*

In a previous study [5], the authors developed a KF model to produce reliable vehicle count estimates using only probe vehicle data. In that study, the authors introduced a novel variable estimation time interval as opposed to the traditional fixed time interval. The estimation time interval was defined as the time when exactly *n* probe vehicles traversed the tested link. It was proven that the variable time interval, compared to a fixed time interval (e.g., 20 s), led to improved estimation accuracy. An illustrative example to show the benefits of using the variable time interval. if the approach's LMP is 10%, the number of probe vehicles will obviously be low. If we treat the problem using a fixed estimation interval, then the probability of observing zero probe vehicles within an interval will be high for short estimation time intervals, making the estimation inefficient and inaccurate. Accordingly, low LMPs require long intervals (e.g., 300 s) to ensure that at least one probe vehicle is on the approach. In contrast, approaches with high LMPs can use short estimation intervals (e.g., 20 s). Consequently, treating the estimation time interval as a variable produces an efficient and convenient way of determining the duration of the estimation period. For more details, readers may refer to [5].

One concern about the KF model is the use of predefined fixed values of the statistical parameters, mean and variance, of the KF state and measurement errors. Applying the KF model in real-world problems is limited since the statistical parameters are assumed to be known [21]. The mean and variance entities are known as variable rather than fixed values. To produce a flexible model, this study employs the AKF model to provide real-time estimates of the statistical parameters of the KF state and measurement errors as described in the following section.

#### *4.2. Adaptive Kalman Filter (AKF)*

The traditional KF technique is utilized with predefined error values of the state and measurement noise; these error values remain constant for the entire simulation. However, these values are hard to obtain in the field and they are always changing with time. Hence, an AKF is developed to overcome this issue and to dynamically estimate the error values in the state and measurement estimates. The AKF is comprised of two equations: (a) state equation and (b) measurement equation. The state equation is derived from the traffic flow continuity equation as defined in Equation (2). The state equation computes the number of vehicles by continuously adding the difference in the number of vehicles entering and exiting the section to the previously computed cumulative number of vehicles traveling along the section. This integral results in an accumulation error which requires fixing, and thus the measurement equation is needed. In Equation (2), the *ρ* value can be observed from historical data.

$$N(t) = N(t - \Delta t) + \frac{\Delta t}{\rho} [q^{in}(t) - q^{out}(t)] \tag{2}$$

The state equation produces accurate results if the scaled traffic flows (*qin*/*ρin* and *<sup>q</sup>out*/*ρout*) are accurate [5], as shown in Section 4.3. The total counts can be extracted from traditional loop detectors or video detection systems. We should note here that the *ρ* value in Equation (2) plays a major role in delivering accurate outcomes. *ρ* is defined as the ratio of the number of probe vehicles (*Nprobe*) to the total number of vehicles (*Ntotal*), as shown in Equation (3). For instance, if *ρ* is equal 0.1, and the number of probe vehicles is 5, then the expected total number of vehicles is 50.

$$
\rho = N\_{\text{probe}} / N\_{\text{total}} \tag{3}
$$

Equation (4) describes the hydrodynamic relationship between the macroscopic traffic stream parameters (flow, density, and space-mean speed),

$$
\eta = k u\_{\\$} \tag{4}
$$

where *q* is the traffic flow (vehicles per unit time), *k* is the traffic stream density (vehicles per unit distance), and *us* is the space-mean speed (distance per unit time). The *us* can be represented as shown in Equation (5),

$$
u\_{\\$} = D/TT\tag{5}$$

where *D* is the link length and *TT* is the average vehicle travel time. Since probe vehicles can share their instantaneous locations every Δ*t*, the travel time of each probe vehicle can be computed for any road section. Thus, the probe vehicle travel time is used in the measurement equation, using Equations (4) and (5). The measurement equation can be written as shown in Equation (8):

$$TT(t) = D \times \frac{k(t)}{\eta(t)}\tag{6}$$

$$TT(t) = \frac{1}{\eta} [k(t) \times D] = \frac{1}{\eta(t)} N(t) \tag{7}$$

$$TT(t) = H\left(t\right) \times N\left(t\right) \tag{8}$$

where *q*¯ is the average traffic flow entering and exiting the link, and *H*(*t*) is a transition vector that converts the vehicle counts to travel times, and is the inverse of the average flow (i.e., the first term of Equation (7)), as shown in Equation (9).

$$H(t) = \frac{1}{\overline{q}(t)} = \frac{2 \times \rho}{\overline{q}^{in}(t) + q^{out}(t)}\tag{9}$$

The system state and measurement equations can be written as in Equations (10) and (11), considering the errors (noise). The term *u*(*t*) is the given inputs for the system. The vector *H*(*t*) is used to convert the vehicle counts to travel times. The vector *w*(*<sup>t</sup>* − Δ*t*) is the state noise and is assumed to be Gaussian noise with the mean of *m*(*t*) and variance of *<sup>M</sup>*(*t*). The measurement noise v(*t*) is assumed to be Gaussian noise with the mean of *r*(*t*) and variance of *<sup>R</sup>*(*t*).

$$\text{State Equation :} \quad \quad N(t) = N(t - \Delta t) + u(t) + w(t - \Delta t) \tag{10}$$

$$u(t) = \frac{\Delta t}{\rho} [q^{in}(t) - q^{out}(t)],$$

Measurement Equation : TT(t) = H (t) × N (t) + v(t) (11)

$$H(t) = \frac{1}{q(t)} = \frac{2 \times \rho}{q^{in}(t) + q^{out}(t)}$$

The proposed AKF estimation model can be solved using the following equations:

$$
\hat{N}^{\cdots}(t) = \hat{N}^{+}(t - \Delta t) + u(t) + m(t - \Delta t) \tag{12}
$$

$$\dot{P}^-\left(t\right) = \dot{P}^+\left(t - \Delta t\right) + \mathcal{M}(t - \Delta t) \tag{13}$$

$$\hat{\mathbf{G}}(t) = \hat{P}^-(t)H(t)^T \left[ H(t)\hat{P}^-(t) \left[ H(t)^T + \mathcal{R}(t) \right] \right]^{-1} \tag{14}$$

$$
\hat{N}^+(t) = \hat{N}^-(t) + G(t) \left[ TT(t) - H(t)\hat{N}^-(t) - r(t) \right] \tag{15}
$$

$$
\hat{P}^{+}\left(t\right) = \hat{P}^{-}\left(t\right) \times \left[1 - H\left(t\right)\right] \mathcal{G}\left(t\right)\tag{16}
$$

where *N* ˆ − is the a priori estimate of the vehicle counts calculated using the measurement prior to instant *t*, and *P* ˆ − is the a priori estimate of the covariance error at instant *t*. The Kalman gain (*G*) is demonstrated in Equation (14). The posterior state estimate (*N* ˆ +) and the posterior error covariance estimate (*P*<sup>ˆ</sup><sup>+</sup>) are updated as shown in Equations (15) and (16), considering the probe vehicle travel time measurements. In the next section, the estimation steps of the noise statistical parameters (*<sup>m</sup>*, *M*,*r*, *R*) are described.

#### 4.2.1. Online Estimation of Noise Statistics

An online estimate is conducted to optimally find the errors in the state and the measurement variables, to make the KF more efficient and applicable in real-world applications. As pointed out in the literature, the traditional KF assumes predefined errors in the system, which is not the case in real applications. A set of unknown noise statistical parameters, (*<sup>m</sup>*, *M*,*r*, *<sup>R</sup>*), needs to be estimated at every estimation step. The online estimate procedure follows the same procedure presented in [21].

The mean (*m*) and variance (*M*) of the state noise are shown in Equations (17) and (18), respectively.

$$m = \frac{1}{n} \sum\_{t=1}^{n} m(t), \quad \text{where} \quad m(t) = \hat{N}^+(t) - \hat{N}^+(t - \Delta t) - u(t) \tag{17}$$

$$M = \frac{1}{n-1} \sum\_{t=1}^{n} \left[ (m(t) - m) ./ m(t) - m \right]^T - (\frac{n-1}{n}) \dot{P}^+ (t - \Delta t) - \dot{P}^+ (t) \Big] \tag{18}$$

where *m*(*t*) is the state noise at time *t*, the first term of Equation (18) is the covariance of w at time *t*, n is the number of state noise samples.

The mean (*r*) and variance (*R*) of the measurement noise are shown in Equations (19) and (20), respectively.

$$r = \frac{1}{n} \sum\_{t=1}^{n} r(t), \quad \text{where} \ r(t) = TT(t) - H(t) \ \hat{N}^-(t) \tag{19}$$

$$R = \frac{1}{n-1} \sum\_{t=1}^{n} \left[ (r(t) - r). (r(t) - r)^T - (\frac{n-1}{n}) H(t) \mathbb{P}^-(t) H^T(t) \right] \tag{20}$$

where *R*(*t*) is the observation noise at time *t*. The first term of Equation (20) is the covariance of v at time *t*, and n is the number of measurement noise samples. As a summary, the KF and AKF models use the same equations except for the fact that the AKF model estimates the statistical parameters of the noise for every estimation step using Equations (17) to (20).

As found in our previous study [5], providing the system equations real-time estimates of *ρin* and *ρout* should improve the estimation accuracy. In this study, a single-loop detector was installed at the entrance of the tested link to produce real-time estimates of *ρin*. In contrast, in the next section, an NN model is developed to obtain real-time estimates for the *ρout* values.

#### *4.3. Neural Network*

NN is a machine learning technique that aims to recognize relationships between vast amounts of data by employing a certain number of neurons in every single hidden layer to achieve better accuracy [36]. The network consists of three main layers: the input layer, the hidden layer, and the output layer. This section takes into account the recommendation of using two market penetration rates (at the entrance and exit of the link) rather than one market penetration rate along the tested link in the KF equations [5]. Accordingly, the state equation and the H vector in the measurement equation are revised as presented in Equations (21) and (22). *ρin* and *ρout* are the probe LMP at the entrance and the exit of the link, respectively.

$$N(t) = N(t - \Delta t) + \Delta t[\frac{q^{in}(t)}{\rho\_{in}(t)} - \frac{q^{out}(t)}{\rho\_{out}(t)}] \tag{21}$$

$$H(t) = \frac{1}{\overline{q(t)}} = \frac{2}{\frac{q^{in}(t)}{\rho^{in}(t)} + \frac{q^{out}(t)}{\rho\_{out}(t)}}\tag{22}$$

A single-loop detector was installed at the entrance of the link to measure *ρin* and also to use as an input to the NN model. Accordingly, this study develops an NN model to estimate *ρout*. The tested link is shown in Figure 1. The next section describes the selected inputs (features) and the output variables of the NN model.

**Figure 1.** Tested link.

Characteristics of the NN: Input and Output Variables

Previous research has used different features to build machine learning models [23–26]. Fusing video and Bluetooth data was used to estimate traffic density and speed. The traffic flow was manually extracted from the video records, while the speed data were constructed from the collected Bluetooth travel time data [23]. Another study relied on archived data of traffic speeds, counts, and density to estimate traffic

speed [24]. Distance headway, number of stops, and speed data were identified as useful features to achieve accurate density estimates [25]. They employed loop detectors and CV data. In a recent study, Sekula et al. used probe and automatic traffic recording station data to extract the features of the NN model [27]. The selected features were the (1) speed of probe vehicles, (2) weather data such as temperature, visibility, precipitation, and weather status, (3) infrastructure data (speed limits, number of lanes, class of the road, and type of the road), (4) temporal data such as the day of the week, and (5) volume profiles based on historical data. The literature showed that the traffic speed is always used as a model feature, especially when probe vehicle data are used. In contrast, the traffic flow is always used when stationary sensors (e.g., loop detector) are used.

In this paper, a fusion of probe and single-loop detector data is utilized to produce the model features. The single-loop detector was installed at the entrance of the link and thus *ρin* can be computed directly using Equation (3). The *ρout* variable is calculated from the NN (the NN output). Seven possible inputs (features) were considered in the NN model, as defined in Table 1. Conducting a feature selection technique to validate the importance of each feature for the NN model, the number of the model features was dropped to five features. It should be noted that the selected model inputs can be easily extracted when probe vehicles are on the link. *ρout* can be expressed as a function of the selected inputs, as presented in Equation (23).

$$\rho\_{\rm out} = f(A\_{1\prime}A\_{p\prime}u\_{s\prime}S\_1, S\_2) \tag{23}$$

The *ρout* values vary between 0 and 1, the 0 value means that no probe vehicles were observed at the exit of the link, while the value of 1 means that the *Dp* value is the same as the *Dt*. The selected inputs must be relevant to the model output *ρout* to allow the NN model to build a strong relationship between the model inputs and outputs, and therefore produce high estimation accuracy. For instance, in our case, the *ρout* value decreases as *At* and *Ap* increase. For instance, a high value of *At* means that the link is more congested and thus the number of departures ( *Dt*) is expected to be high. The *ρout* value also decreases with increasing speed (*S*1, *S*2, and *us*). The speed is an indicator of the congestion level of the link; for instance, if the speed is low, then more vehicles are expected to be on the link, leading to higher values of *Dt*.


**Table 1.** Definition of the NN model inputs.

A single hidden layer with one neuron, with a transfer function of hyperbolic tansgent sigmoid, was used to build the NN model as shown in Figure 2. The Levenberg–Marquardt (LM) optimization has been proven in the literature to outperform the gradient decent and conjugate gradient methods for medium-sized problems [37]. Furthermore, the LM is considered the fastest back-propagation algorithm and thus was implemented in the proposed approach. The weights and biases of the developed NN model are described below. *w*1 depicts the weights between the input layer and the hidden layer, while *w*2 represents the weight between the hidden layer and the output layer. *b*1 and *b*2 represent the biases at the

hidden and output layers, respectively. Figure 2 describes the proposed AKFNN approach, combining the AKF model with the NN model.

*w*1 = [0.43 0.19 − 47.28 0.36 − 0.43], *w*2 = [1.70], *b*1 = [−46.62], *b*2 = [0.95]

**Figure 2.** Flowchart for adaptive Kalman filter with a neural network (AKFNN) approach.
