3.1.1. Data Collection

The basic data needed to execute the message reverse framework are the in-vehicle CAN bus traces, *X*, and the raw physical measurements, *Yr*. Where *Yr* is the original sensor data for a particular behavior of the vehicle and *X* is the CAN trajectory obtained when the vehicle performs that behavior. The current phase requires the simultaneous acquisition of *X* and *Yr* to reduce errors in linear regression modeling. Therefore, the data acquisition device shown in Figure 6 is used in this phase, using the same timestamp for synchronization. The CAN trace acquisition device is shown in Figure 6a. This device is a combined cable consisting of an OBD-II to DB9 diagnostic cable and a PCAN-USB FD adapter. The cable connects from the OBD-II port of the vehicle to the USB port on the side of the computer to allow the real-time collection of CAN traffic. The behavioral measurements of the vehicle are collected using the sensor device shown in Figure 6b. The device consists of a global positioning system (GPS) antenna, a universal serial bus (USB) interface, and a gyroscope angle sensor with a 0–200 Hz sampling frequency. Although the device is only \$78.56 [43], it has a speed sampling accuracy of 0.001 km/h and an angle sampling accuracy of 0.1◦. To reduce the error of the sensor sampling, the sampling device should be installed in such a way that the direction of sample change is consistent with the direction of either axis of the sensor. For example, the Y-axis of the sensor is aligned with the head direction when collecting vehicle speed, and the X-axis of the sensor is aligned with the angle change direction when collecting angle data. To represent the behavior and condition of the vehicle as completely as possible, the location of the sensor deployment and the collected data are listed in Table 2. The synchronous work of the above two devices provide the raw data for the reverse framework.

**Figure 6.** Data acquisition equipment: (**a**) OBD-II data collection equipment; (**b**) Vehicle behavior sensor.


### 3.1.2. Data Processing and Resampling

Since the raw data collected by the sensors is limited and does not provide a good picture of the various vehicle states, the collected *Yr* must be processed to reveal more vehicle-related state information. Integral, derivative, and discretization processes are performed on the obtained *Yr* to ge<sup>t</sup> more information. Based on the vehicle behavior in each *Yr*, the rate of behavior change is obtained by derivative, the total amount of change is obtained by integral, and the discrete behavioral states are obtained based on a threshold value. Take speed as an example, the acceleration of the vehicle could be obtained by calculating its derivative to time, and the mileage is obtained by calculating its integral for time. Based on the vehicle speed and the threshold of 1 km/h, the vehicle can be classified into two discrete states of stationary and driving. The data processing methods and results are shown in Table 3. After the extension, there are 13 types of vehicle behaviors. The output after data processing is *Ys*, which contains more detailed vehicle states.

When processing the raw CAN data collected through the OBD-II port, this framework classifies the raw CAN messages based on the ID and removes the constant data field CAN messages. Since the ID identifies the type of the CAN message, *Xi* is first determined by grouping by the ID during processing to facilitate the subsequent modeling of the messages for each ID. Since the framework proposed in this study is based on vehicle behavior to reverse CAN messages, constant CAN messages during sensor acquisition of vehicle behavior do not describe any vehicle behavior and are therefore considered as noise. This noisy data is defined as constant data in READ and LibreCAN, CAN message with constant data fields. Noisy messages can be removed to reduce the number of resamples and subsequent modeling, thus reducing the overall time required.


**Table 3.** Methods and results of raw data processing.

The next step of data processing is to synchronize the CAN messages with the vehicle behavior. In this study, the CAN messages in *Xi* are selected synchronously with the time interval of the beginning and the end of the vehicle behavior described by *Ys*. Synchronizing the data ensures that the CAN messages in *Xi* and the behavior described by *Ys* have the same vehicle behavior and state during this time interval.

Finally, multiple linear regression described in Section 2.2 is a method for modeling the dependent and explanatory variables in the same dimension. However, since the messages for each ID appear at a different frequency than the sampling rate of the sensor device, *Ys*, must be resampled based on the frequency of *Xi* to ensure that the two have the same dimensionality [44]. In the data resampling process, this study uses the resampling method of time series in Python to resample each vehicle state *Ys* according to the frequency of each *Xi* to facilitate subsequent modeling. The resampled data is *Ysi* with the same dimensions as *Xi*. In this step, a separate resampling must be performed for each *Ys* based on the frequency of each *Xi* to obtain 13 × *n Ysi*.

### *3.2. Related Messages Filter*

Based on the results of data processing and resampling, the purpose of this stage is to build a linear regression model with *Ysi* as the dependent variable and each bit of the data field in *Xi* as the independent variable. Based on the *R*<sup>2</sup> of the model, the messages that are most relevant to the dependent variable are filtered out.

To obtain the relationship between each bit of the data field and the vehicle behavior, this step starts by expanding the data field in *Xi* in bit form, which is an *l* × 64 matrix, where *l* is the number of messages with ID *i*. The dependent variable *Ysi*, which is an *l* × 1 matrix, is defined to represent the vehicle state data resampled according to the message dimension, where *s* represents the different vehicle states, *s* ∈ (*<sup>s</sup>*1, *s*2, ...,*s*13). A threshold Δ*s* is defined to filter out the best model. The outputs of this stage are messages and linear regression models that are highly correlated with the individual vehicle behavior data. The flow of this phase is shown in Figure 7. The detailed process is shown below.


• **Step 5:** Execute step 1 to step 4 for all *s* to obtain the candidate messages and the corresponding models for each vehicle behavior.

**Figure 7.** Message selection based on *β*.

### *3.3. Bit-Level Message Reverse*

After the related message filtering phase, the most relevant candidate messages for the particular vehicle behavior and the corresponding linear regression models are determined. The linear regression models of *Ysi* and *Xi* are shown in Equation (7). This result clearly shows the relationship between the vehicle behavior and the data fields of *mi*, where *β* = (*β*0, *β*1,..., *β*64) represents the linear relationship between this vehicle behavior data and each bit of the message.

$$Y\_{si} = \beta \alpha + \beta\_1 \mathbf{x}\_{i1} + \beta\_2 \mathbf{x}\_{i2} + \dots + \beta\_{64} \mathbf{x}\_{i64} \tag{7}$$

In this stage, the specific details of how the data fields of candidate CAN messages describe the behavior of the vehicle are determined by analyzing the regression coefficient *β*. As shown in Figure 8, the flow of the bit-level reverse for the candidate messages proceeds as follows.

• Iterate through each *βx* in *β* = (*β*0, *β*1,..., *β*64), keeping only those *βx* that are not less than the threshold value. If the value of *βx* is less than the threshold, it means that the *x*th bit of the data field is not related to the specific vehicle behavior. Otherwise, this bit may represent how the behavior of the vehicle is recorded in the CAN messages. The result after threshold filtering is *β*.


$$
\beta\_i = 2 \times \beta\_{i+1} = 4 \times \beta\_{i+2} = \dots 2^n \times \beta\_{i+n} \tag{8}
$$

$$
\beta\_i = \frac{1}{2} \times \beta\_{i+1} = \frac{1}{4} \times \beta\_{i+2} = \dots \frac{1}{2^n} \times \beta\_{i+n} \tag{9}
$$

**Figure 8.** Diagram of bit-level reverse.
