**1. Introduction**

The increasingly diverse features in today's vehicles offer drivers and passengers a more relaxed driving experience and greater convenience. Vehicle connectivity provides real-time information and a variety of entertainment options. In addition, vehicle support features such as advanced driver assistance systems (ADAS), reduce driving stress and make driving safer. These capabilities have multiplied due to the increasing number of electronic control units (ECUs) and higher computing power. Current vehicles are equipped with up to 150 ECUs [1], that need to communicate in a unified network that requires the vehicles to provide sophisticated real-time performance, sufficient data transmission volume, and adequate reliability. Control Area Network (CAN), a technology that meets these requirements, became the international standard for intra-vehicle network communication in 1993 [2]. However, since CAN uses broadcast communication and lacks security mechanisms such as encryption and authentication, it increases the probability that the vehicle will be attacked [3–6].

**Citation:** Bi, Z.; Xu, G.; Xu, G.; Wang, C.; Zhang, S. Bit-Level Automotive Controller Area Network Message Reverse Framework Based on Linear Regression. *Sensors* **2022**, *22*, 981. https://doi.org/10.3390/s22030981

Academic Editors: Leandros Maglaras, Helge Janicke and Mohamed Amine Ferrag

Received: 9 December 2021 Accepted: 24 January 2022 Published: 27 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Many examples of attacks on vehicles have confirmed that it is possible to attack the vehicle and perform negative control [7–9]. The most typical attack case is the attack by Miller et al. on a Jeep Cherokee that was driving on the highway and used a CAN busconnected entertainment system and ECU firmware, that resulted in acceleration and brake failures [10]. More recently, Keen Labs in China exploited vulnerabilities in Tesla's assisted driving system to drive the vehicle into the reverse lane and even remotely control the vehicle's steering with a gamepad [11]. Regardless of the type of vulnerability, the common denominator of the attack is the need to inject information into the CAN bus to cause the vehicle to behave dangerously [12]. To prevent the CAN bus from being infiltrated with targeted attacks, original equipment manufacturers (OEMs) privatize the database CAN (DBC) file. The DBC file defines the structure, content, and meaning of each message in the CAN network [13,14]. Even the DBC file is different for different models of the same brand. It is very time-consuming for an attacker to work reverse before implementing CAN bus attacks. For security researchers, private DBC files are a massive obstacle to CAN security research. The most affected area is the automotive intrusion detection system (IDS), a crucial research element in automotive security. CAN intrusion detection systems have been proposed to detect anomalies by analyzing CAN traffic [15–23], but these studies are based on message transmission characteristics that are practically irrelevant to the behavior and status of the vehicle. Therefore, the existing IDSs for the CAN are not very powerful. Another hindered study is the fuzzy test on the CAN bus, which is often used to automatically test and discover unknown vulnerabilities in ECUs [24–28]. Since the DBC files are hidden, which causes the fuzzy test intelligence to construct data blindly, brute force and random data make the test inefficient. In addition, the lack of DBC files with detailed descriptions of CAN messages hinders automotive aftermarket development. Without effective access to vehicle status, automotive driver assistance systems and status display tools become meaningless.

The detailed specification of CAN messages is crucial for CAN network intrusion detection, fuzz testing, and automotive aftermarket products. To obtain the CAN message description in the DBC document, the security research field has proposed CAN bus reversion methods such as CAN message tokenization algorithm, machine learning-based inversion method, and onboard diagnostics II (OBD-II) diagnostic information matching. The earliest CAN message tokenization algorithm was the FBCA algorithm proposed by Markovitz et al. in 2017 [29], followed by the READ algorithm proposed by Marchetii and Stabili in 2018 [30]. The automatic CAN message translator LibreCAN was proposed by Pesé et al. in 2019 [31]. Recently, the ReCAN [32] dataset was published by Zago et al. in 2020 using a similar approach to READ. However, they are limited to classifying and subdividing data changes, such as constants, multiple values, counters, sensors. These cannot obtain specific information, such as the meaning and alignment of each tagged data. It is of minimal help for IDS research and aftermarket. The most typical of the machine learning-based CAN message reversal methods are Jaynes et al. proposed a method for efficient identification of sending ECUs, which identifies CAN frame by analyzing a similarity construction model describing uniform vehicle state information [33]. A datadriven CAN bus reversion method proposed by Buscemi et al. used already available open-source DBC files to train a machine learning model to identify unknown CAN message contents [34], a scheme similar to the unsupervised machine learning-based scheme proposed by Ezeobi et al. [35]. The accuracy of this type of solution depends entirely on the coverage of the training set. Since each vehicle is configured with a unique DBC file, it is almost impossible for the training set of such algorithms to cover all vehicle models. These approaches have been validated only on simulated data and are practically infeasible. Methods based on matching OBD-II diagnostic information describe the vehicle status in CAN information by comparing and matching OBD-II responses. Song and Kim et al. first proposed to create windows before and after the OBD-II response information to find candidate information that exactly duplicates the response data and repeat it several times to determine the information describing the response [36]. Blaauwendraad proposed

a matching method using correlation coefficients based on Song's method [37]. While these methods can yield some inversion results, they can only identify specific vehicle behavior in CAN messages. The insufficient number of supported vehicle behaviors for per-vehicle diagnostics limits the application of this scheme. Additionally, the CANHUNTER [38] proposed by Wen et al. in 2020 reverses the CAN message by disassembling the control APP that interacts with the car. Although this is a novel idea, this method can only obtain what is specified in the APP, and the scheme will be completely invalid once the APP commands are escaped at the server-side. In addition, since such APPs are only valid for the specified car model, this scheme also receives the limitation of the car model. In summary, existing CAN message reversal techniques are limited in their implementation by the number of available DBC files and vehicle models, and their results are unsatisfactory. Solutions that are not limited by vehicle models and can achieve close to the DBC file reversal results are urgently needed.

The CAN frame data tags alone do not reveal any valuable information, and one needs to have DBC files to decode them. However, the DBC files are hidden and usually different for each model. Reverse engineering solutions for CAN information that are not constrained by the vehicle model and can access critical information in the DBC files are urgently needed. To achieve CAN message reversal close to the DBC file, this study innovatively proposes a multiple linear regression model after an in-depth analysis of the way the DBC file specifies the vehicle behavior. The model is built using each bit of the CAN message data field as the independent variable and the vehicle behavior data as the dependent variable. As the input of our framework additionally includes sensor data, our framework needs to be very useful. First, the framework uses the *R*<sup>2</sup> of the model to filter the candidate messages related to vehicle behavior, which has an excellent filtering result on related messages compared to existing schemes. In addition, the framework outperforms existing systems in terms of data boundary delineation by locating the bits describing the vehicle behavior and obtaining the details of field functions, starting bits, field lengths, and alignment formats in the DBC file based on the β value of each model. Finally, since commercially available vehicles must be configured with a standard CAN data interface and the vehicle behavior can be captured by commonly used sensors, the inverse framework proposed in this study is independent of the vehicle model and brand.

The structure of this study is as follows. Section 2 introduces the CAN bus, DBC file, multiple linear regression models preliminary introductions and describes the feasibility of the study's ideas. Section 3 describes the design and implementation ideas of the framework. Section 4 evaluates the performance of the CAN reverse framework in actual vehicles, the reverse accuracy, the time required, the advantages over existing solutions, and the applicability of the framework. Section 5 concludes the study.

### **2. Background and Feasibility**

### *2.1. CAN Bus Overview*

The CAN bus is a serial communication bus originally developed by Bosch [39]. Later the international standards organization (ISO) issued the international standard ISO11898 for CAN in 1993 [40]. CANs have become one of the most widely used fieldbuses globally due to their high transmission rate and high real-time characteristics.

The standard format of a CAN message is shown in Figure 1. It begins with the start of frame (SOF), followed by an 11-bit identifier (ID) and a remote transmission request (RTR). The ID defines the meaning and type of the message and is also used to filter irrelevant messages when the node receives the messages. The ID is also used for arbitration when multiple nodes send data simultaneously; the smaller the ID is, the higher the priority is. RTR is used to distinguish the type of message. A six-bit control field follows this: identifier extension (IDE) and r0 specify the length of the frame, and the data length code (DLC) specifies the number of bytes in the data field. The data field is the core of the CAN message and is 64 bits long. It contains the vehicle control commands, the status data, and any other data to be transmitted (e.g., counters, checksum values, etc.). This is followed by the Circular Check Code (CRC), the Acknowledgement Field (ACK), and the end of frame (EOF), respectively.

**Figure 1.** Standard CAN message frame.

For CAN message reversal work, the main targets of the reversal are the identifier (ID) and the data fields. When reversing CAN messages, the relevant message ID is usually locked first, and then the data fields are analyzed to obtain specific bit fields that characterize the vehicle behavior.

### *2.2. DBC File*

The form and content of each type of CAN message are defined in the DBC file, so each OEM keeps it private to avoid leakage from the data source and prevent negative control and modification of the car. However, all CAN messages must be fully translated using the DBC file as a table, making sense for CAN reverse work. The contents defined in the DBC file are listed in Table 1. The Name, ID, Cycle Time, and Length describe the entire message. The Function specifies one or more vehicle behaviors in the message data fields. Byte Order, Start Byte, Start Bit, Bit Length, Units, Precision, and Offset specify how the message describes the specific behavior. Typically, the data fields of a message contain multiple functions.

**Table 1.** DBC file content definition


The message with ID 0x198 is used to explain the correspondence between the DBC file and the CAN message content. As shown in Figure 2a, the DBC file defines the name of the message as angle, the message sending period is 10 ms, the message length is 64 bits, and it contains 3 vehicle behaviors: steering angle, brake pedal angle, and gas pedal angle. The steering angle is arranged in Motorola (LSB) form from bit 0 to bit 15 with a resolution of 0.01. Similarly, the gas pedal and brake pedal angles are arranged in bits 16–23 and 48–55 of the data field, respectively. The alignment is Intel (MSB) and Motorola. When capturing any message with ID 0x198, its data can be decoded according to the provisions of the DBC file. According to the definition of DBC, the message shown in Figure 2b describes the angle information of the vehicle at this moment, where the brake pedal angle is -22 + 24 + 25 + 27 × 0.1 = 19.1◦, the steering angle is -20 + 24 + 25 + 27 × 0.01 = 1.77◦, and the throttle angle is 0.


**Figure 2.** Correspondence diagram between DBC file and CAN messages: (**a**) 0x198 Message definition in DBC; (**b**) Message data decoded according to DBC.

In summary, the DBC file is vital to study the CAN messages in-depth, which makes the DBC file a realistic target for reverse work.

### *2.3. Linear Regression Preliminary*

In statistics, the multiple linear regression model describes the linear relationship [41,42] between the scalar dependent variable *y* and several explanatory variables defined as *X* = (*<sup>x</sup>*1, *x*2,..., *xk*) and the model function is shown in Equation (1), where *β* = (*β*0,..., *βk*) is an unknown model parameter that can be estimated by giving sample set of *y* and *X*. The ordinary least squares method is the most commonly used method for parameter estimation. For a given sample set *ye* (see Equation (2)) and *Xe* (see Equation (3)), the ordinary least squares method first creates a new matrix Ω, as shown in Equation (4).

$$y = \beta\_0 + \beta\_1 x\_1 + \beta\_2 x\_2 + \dots + \beta\_k x\_k \tag{1}$$

$$y\_c = \begin{pmatrix} y\_1 \\ y\_2 \\ \vdots \\ y\_m \end{pmatrix} \tag{2}$$

$$X\_{\mathfrak{c}} = \begin{pmatrix} \mathfrak{x}\_{11} & \dots & \mathfrak{x}\_{1k} \\ \mathfrak{x}\_{21} & \dots & \mathfrak{x}\_{2k} \\ \vdots & \ddots & \vdots \\ \mathfrak{x}\_{m1} & \dots & \mathfrak{x}\_{mk} \end{pmatrix} \tag{3}$$

$$
\Omega = \begin{pmatrix} 1 & \mathbf{x}\_{11} & \dots & \mathbf{x}\_{1k} \\ 1 & \mathbf{x}\_{21} & \dots & \mathbf{x}\_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & \mathbf{x}\_{m1} & \dots & \mathbf{x}\_{mk} \end{pmatrix} \tag{4}
$$

The estimation *β* ˆ can be obtained from Equation (5), where Ω*<sup>T</sup>* is the transpose of Ω. The determination coefficient *R*<sup>2</sup> indicates how well the samples fit the linear model created with *β* ˆ and is calculated by Equation (6), where *y*ˆ*i* = *β* ˆ 0 + *β* ˆ 1*xi*1 + ... + *β* ˆ *ixik* is the *yi* estimated with the linear model and *yi* is the mean of *ye*. The value of *R*<sup>2</sup> is in the range [0, 1], and 1.0 is the best fit.

$$
\hat{\boldsymbol{\beta}} = \left(\boldsymbol{\Omega}^T \boldsymbol{\Omega}\right)^{-1} \boldsymbol{\Omega}^T \boldsymbol{y}\_{\boldsymbol{\varepsilon}} \tag{5}
$$

$$R^2 = 1 - \frac{\sum\_{i=1}^{m} (y\_i - \hat{y}\_i)^2}{\sum\_{i=1}^{m} (y\_i - \overline{y\_i})^2} \tag{6}$$
