Every time a Wi-Fi device has to deliver a message, it needs to know which AP to access. This information is provided to the devices by the concept of association, which is a necessary, but not sufficient, operation to support the connectivity of the devices. Indeed, before a device is allowed to send a data message via an AP, it must first become associated with the AP. To this, the device sends Probe Request frames, i.e., messages broadcast periodically from any active Wi-Fi interface to detect nearby APs.
In the following, the de-randomization process and the counting algorithm will be described in details.
3.1. De-Randomization Algorithm
The randomization of the MAC address introduced by operating system providers has allowed to hide the real MAC address of the network cards in the Probe Request frames of the device from which they are sent. In the Probe Request frames the real MAC addresses are replaced by random MAC addresses that are changed several times over a limited period of time. The change does not take place at regular intervals or according to predefined timing, but depending on the use of the device. Therefore the MAC address contained in the Probe Request frames is no longer sufficient to count the devices as it previously happened.
In this Section we introduce the de-randomization algorithm, which has the purpose to understand which MAC addresses are more likely to be brought back to the same device. Indeed, some parameters included in Probe Request frames can be exploited to estimate with sufficient reliability which frames containing different random MAC addresses may have been sent by the same device.
In particular, some fields of Probe Request frames remain constant even with randomized MAC addresses, as highlighted by the red square in
Figure 3, also called tagged parameters [
44].
This information is the same in all the Probe Request frames sent by the same device, even if the MAC address is randomized. However, this information identifies a particular family of devices but not the single device. In order for two frames containing different random MAC addresses to be traceable to the same device, the fact that these information are the same must therefore be considered as the first condition, necessary but not sufficient.
The de-randomization algorithm therefore needs to exploit other parameters, whose changes provide important information. Let us consider two MAC addresses received from the sniffer, namely
and
with
received before
. Accordingly, in order for the two MAC addresses to be traced back to the same source device, the instant of time when
was received, namely its timestamp, has to be lower than
’s timestamp. Therefore, calling
and
the tagged parameters of respectively
and
,
the last timestamp associated to
and
the first timestamp associated to
(both expressed in seconds), the de-randomization algorithm starts if, and only if, both the following conditions are met:
where the first condition of Equation (1) selects only devices that have the same transmitting characteristics, i.e., same tagged parameters, whilst the second condition of Equation (1) ensures that the last frame with
was sent before the first frame with
.
As stated above, the de-randomization algorithm assesses the probability that two random MAC addresses correspond to the same device. To compute this probability, we define a
score for each couple of random MAC addresses identified. The score is calculated using the timestamp and another relevant parameter included in the Probe Request: the frame sequence number. The sequence number is a 12-bit code that progressively increases with each frame and is contained in the Sequence control (Seq-ctl in
Figure 2). Its value varies from 0 to 4095 and once this maximum value is reached the numbering starts again from 0. Successive frames have increasing sequence numbers, even if the random MAC address changes.
The score is in inverse proportion to the difference in time and the difference in the sequence numbers between received frames with different MAC addresses. The difference in time is expressed as as:
which expresses the continuity in the arrival of the frames, even for randomized addresses, and then guarantees that not too much time has passed since the change of address. The difference in the sequence numbers has a similar goal, i.e., to check the continuity in the received frames even if the MAC address is changed due to the randomization process. However, we need to take into account that the sequence number assumes values between 0 and 4095. Defining
the sequence number of the last frame corresponding to
and
the sequence number of the first frame corresponding to
, the resulting formula is the following:
The score used in the algorithm is then defined as:
The score assumes values higher than 0 only if Equation (1) holds, i.e., if j and i have the same tagged parameters and they are received one after the other. For values higher than 0, the greater the value, the greater the probability that two MAC addresses are attributable to the same source device.
Once the score has been calculated for all the couples of random MAC addresses identified, the algorithm has the task of creating lists of MAC addresses traceable to the same source device. The process to create lists, depicted in the flowchart of
Figure 4, is now explained. Whenever a frame with a new random MAC address
is received, the score is computed between
and all the other random MAC addresses that are already in a list. If no lists have been created yet, no scores will be computed; if all the scores are equal to 0,
certainly belongs to a new device. In both these case, a new list with
as the only element is created. Otherwise, the address
with the highest score with
, i.e., with the highest probability to belong to the same device, is identified. Considering the list
where
is located, if
is the last element of the list,
is appended to the list. If there is another MAC address
that follows
in list
, the
has to be compared to
: if the first is lower than the latter, it means that it is more likely that
and
belong to the same device, with respect to
and
. Therefore, the process to find the right list for
starts again ignoring
m. If, on the other hand,
is higher than
, the probability that
and
belong to the same device is higher than the probability that
and
do. Accordingly,
is put in
right after
. Furthermore, the following MAC addresses in list
are not sure to belong to the same device anymore. Hence, the process is repeated again for
and all its following MAC addresses in
.
The procedure therefore assumes a recursive form. At the end, all the lists are those with greater probability that the MAC addresses present therein are traceable to the same device. Accordingly, all the frames belonging to the same list are tagged with the same MAC address.
3.2. Passenger Counting Algorithm
Once the de-randomization process is solved, we need to analyze the resulting data, which represent the univocal MAC addresses of the sensed devices, in order to count the number of people on the bus.
With respect to a static situation, e.g., when counting people in a room, there are several issues that needs to be considered in order to accurately count only the people on the bus.
the distance between two consecutive bus stops can be highly variable, from hundredth of meters to one-two kilometers;
the Probe Requests are not sent regularly;
the time the bus spends at each stop is variable and there may be stops where the bus does not stop;
the sniffer can sense devices that are not on the bus, but are walking in the footpath, waiting at the bus stop or in the car near to the bus;
Let us consider the scenario depicted in
Figure 5; our goal is to count the number of people on the bus and to know how many people board and alight from the bus during the last stop. To this, we define the set of bus stops
, so that for each stop
z, we can define the time of arrival at the bus stop
and the time of departure from it
; then, we call
time spent, the time interval the bus spends at each stop, which can be identified as follows:
We can also identify the
running time as the time the bus needs in order to run from one stop to the next one as follows:
In order to overcome the issues mentioned earlier, we need to filter all the entries in our database. To address the variability in the reception frequency of the Probe Requests, for every stop
z we need to examine a temporal window so to consider for the counting algorithm only the Probe Requests received with an instant of time included in the interval:
where
is a fixed parameter, called
watch time.
After the temporal filtering, we need to understand if the remaining Probe Requests are transmitted by devices on the bus. To this, we applied a series of checks to all the entries in the filtered window. A first check is carried out to count the number of frames for each MAC address, namely , in order to identify devices encountered only for brief moments during the movement of the bus, such as the smartphone of a pedestrian or of a driver: if the number of frames captured is below a given threshold x, then the requests of these devices have been probably captured only a few times by the sniffer and then should be discarded.
Another check is implemented to evaluate the received power from the Probe Requests: received power is inversely proportional to the distance at which the transmitting device is located with respect to the sniffer, so it is possible to discard all the Probe Requests from devices which are too far and then are likely to be outside the bus, such as the case of a car that is moving in front of the bus as shown in
Figure 5. In particular, we consider that the average power of the Probe Requests from
received by the sniffer must be higher than a certain threshold
, i.e., that
; doing so, it is possible to discard even the cases of a car in a traffic jam which constantly moves away (lower power) and gets together again with the bus (higher power).
Finally, due to how Probe Requests are sent, i.e., through a train of near-time requests, we decided to consider another parameter, namely the permanence of a device on the bus , calculated as the difference between the last and the first frame received in the temporal window, namely and , to assess if the requests belong to different trains of requests.
Summarizing, the entries of the generic
address inside a temporal window, as defined by Equation (
7), are considered for the counting algorithm if they satisfy the following conditions:
The number of unique MAC addresses left after the filtering steps gives a rough estimate of the number of devices on the bus. However, it is still important to assess at which stop the device boards and alights. This information is really useful to construct origin–destination matrices, but also to assess that the first or the last occurrence of a MAC address is not too far from any stops, thus indicating another vehicle travelling close to the bus with a similar speed, rather than a passenger boarding or alighting from the bus in motion.
To this, we consider that a device boarded or alighted from the bus only if the first or last Probe Request was received in a time frame of the bus stop, respectively. This is necessary because it is not possible to know in advance when the Probe Requests will be transmitted, so we must guarantee a guard time, both before arriving at the bus stop, , and after leaving it, , for those unlucky cases in which the first or the last Probe Request is sent just before reaching the bus stop or shortly after leaving it.
For every bus stop with a , there will be a time frame of available to receive the first or last Probe Request from a device. In general, the guard time can be considered constant for all the stops both before and after each stop, so we can indicate it as ; however, if the running time between two stops is too small, i.e., less than , then the guard time must be modified accordingly as .
We finally need to understand, for each temporal window and for each device, if the device was actually on board and if it boarded or alighted during the considered timeframe. As shown in
Figure 5, when analyzing a single timeframe, we can individuate different sectors:
Sector 1, between the start of the temporal window and the start of the guard time for stop .
Sector 2, which analyzes the stop , considering both the guard times, before and after the stop, and the time spent.
Sector 3 examines the time between two stops.
Sector 4, which takes into account the event for stop z.
Sector 5, between the end of the guard time for stop z and the temporal window.
For each device, the counting algorithm first checks if the device was already considered on board. If it is the first time the sniffer has received requests from the device, the algorithm has to understand at which stop the device boarded: as we said earlier, a device can only board if the first Probe Request, , has been received when the bus is near a bus stop. The device will be counted as boarding at stop if the first Probe Request is received in Sector 2 or at stop z if it is received in Sector 4. In all the other cases, i.e., when the Probe Request is received inside Sector 1, 3 and 5, all the entries for the device are considered as spurious and then discarded.
An on board device is considered as such for the whole temporal window; in this situation the condition on the number of frames for the next temporal window, described in Equation (
7), is updates as follows:
to take into account that the device is already on board. However, if the device does not pass the filtering step in the next temporal window, i.e., for stop
, then the algorithm goes back to the previous temporal window to check what happened by analyzing the last Probe Request received,
. Again, a device will be counted as alighting from the bus at stop
or
z only if the last Probe Request is received in Sector 2 or 4 respectively, otherwise all the entries for the device are discarded.
Noteworthy, every time the bus reaches a stop z, the algorithm is able to count the number of passengers related to the previous stop, by checking the next temporal window.
Figure 4 illustrates all the steps of the overall process. It starts with the sniffing of the Probe Requests and ends when all the bus stops have been analyzed to count the passengers.