1. Introduction
Transportation has become an essential aspect of modern life, with more than 278 million vehicles registered in the US as of 2022, and 72% of individuals who rely on personal vehicles as their primary mode of transport [
1,
2]. This widespread reliance on vehicles has driven advancements in automotive technology, transforming cars from mere luxuries into indispensable tools for daily living.
A critical innovation in this domain is the development of in-vehicle communication protocols, which facilitate seamless data transfer among a vehicle’s components. Modern vehicles, equipped with 50 to 100 Electronic Control Units (ECUs), rely on these protocols to manage key systems such as the engine, brakes, and infotainment. These ECUs work together via robust communication networks to ensure efficiency and safety in vehicle operation [
3].
The Controller Area Network (CAN) bus, developed by Robert Bosch in 1986 [
4], is a widely adopted in-vehicle communication protocol. Its scalability, robustness, and fault detection capabilities have made it a leading serial bus system, reducing wiring complexity while ensuring efficiency [
5]. Beyond vehicles, CAN bus systems are also integral to aircrafts, drones, farm machinery, and space satellites [
5,
6,
7,
8,
9].
The CAN network enables communication among multiple ECUs in a vehicle through two primary wires: CAN High and CAN Low [
10]. Key ECUs, such as the Engine Control Unit (ECU), Antilock Braking System (ABS), Airbag Control Module, and On-Board Diagnostic Unit (OBD), collaborate to manage and monitor critical systems, ensuring safety and performance.
Figure 1 illustrates their connectivity within a CAN-enabled vehicle.
The security of modern vehicles has become a growing concern in recent years, particularly with the widespread adoption of the CAN protocol [
11]. Although CAN easily allows for adding functionalities to a vehicle, increasing its complexity, this innovation also creates opportunities for attacks against these complex systems. With the advent of electronic and autonomous cars, reports of malicious attackers stealing or sabotaging vehicles have risen. For instance, in 2022, a sophisticated group of malicious actors stole Mr. Tabor’s new Toyota SUV model by targeting the CAN bus protocol [
12]. They used a PIC18F chip concealed within the Bluetooth component to execute a CAN injection attack, spoofing the system by masquerading as a smart key ECU and transmitting messages to unlock the car doors and start the engine.
Aside from the Bluetooth component being used as an access point for the attacks, as in Mr. Tabor’s case, other vehicle components have also been exploited as attack entry points. Some examples of these vehicle components include the telematics control unit [
13], which provides connectivity for various vehicle services and can be targeted to access the vehicle’s network; infotainment units [
14], such as Bluetooth and WiFi modules; and the On-Board Diagnostics (OBD-II) port [
15]. Malicious actors often target these components to attack the CAN bus, mainly due to the broadcast nature of CAN messages. Although the CAN bus offers Cyclic Redundancy Checks (CRC) to check for inconsistencies in messages sent by nodes/ECUs on the bus, there is no inherent security measure within the bus to defend against attacks. Furthermore, our previous research demonstrated that CAN-enabled systems are vulnerable to spoofing, injection attacks, and denial of service (DoS) attacks [
16]. Furthermore, researchers have shown that the CAN bus is susceptible to replay attacks, impersonation attacks, fuzzy attacks [
17], man-in-the-middle attacks, and bus-off attacks, among others [
18]. This highlights the need for a system to detect these attacks and alert drivers or vehicle owners, ensuring immediate action is taken.
In this paper, we propose the development of a security system that can immediately detect attacks, thereby allowing the driver to take action instantly. We take it a step further by ensuring this system can also determine the part of the car that is under attack. To develop our attack detection system, we generated a dataset using a CAN-enabled testbed with four nodes, each acting as ten arbitrary nodes. In this dataset, we created scenarios that simulate injection attacks, leading to DoS attacks within the CAN network. Hence, we are proposing the development of CANGuard, a CAN Intrusion Detection System (IDS) facilitated by machine learning algorithms. An IDS monitors systems for anomalies or attacks that could lead to security breaches. We will conduct experiments to monitor the performance of different machine-learning models in detecting these anomalies or attacks. The machine learning algorithms that will be used to develop this IDS are logistic regression, random forest classifier, gradient boosting classifier, multilayer perceptron.
The structure of this paper is as follows:
Section 2 provides a review of existing works on the development of IDS systems for CAN-enabled networks.
Section 3 details the general architecture of the CAN network and the testbed used in this research.
Section 4 describes the CAN bus threat model, common attacks the network is susceptible to, and the specific attack model adopted for this study.
Section 5 outlines the methodology used to develop the proposed IDS, CANGuard.
Section 6 presents the results of the evaluated models, while
Section 7 discusses the findings, including the limitations of the research. Finally,
Section 8 concludes the paper and suggests future directions.
2. Related Works
As previously stated, CAN is vulnerable to several attacks. Attacks on the CAN bus can have devastating effects when the car is in operation, including the potential for loss of human life or severe injury to drivers and passengers [
19]. To address this issue, researchers have explored various approaches to anomaly detection in CAN-enabled vehicles to identify and mitigate potential attacks in real time. A promising approach is the development of IDS systems for CAN systems, particularly because this detection system does not change its operations, making it easy to adapt to the CAN network. Hence, in this section, we will discuss existing literature that focuses on developing IDS systems for CAN-enabled vehicles with varying detection modes.
When an IDS system is being developed, an important concept to be considered is the method the system adopts to detect these attacks [
11,
20,
21]. An IDS system developed for the CAN network will monitor the system and extract the dynamic behavior of the CAN network, thereby using it as a reference to detect deviations from normal operation. Often, these detection mechanisms fall into one of these two major mechanisms: signature or anomaly-based detection.
Anomaly-based IDS systems identify intrusions by detecting deviations from normal behavior or activity patterns within the CAN network. They monitor the network for unusual patterns that deviate from established baselines, signaling potential security threats. Lampe et al. [
22] developed an anomaly-based IDS system that operates through an Android application connected to the car’s Bluetooth component, which is plugged into the diagnostic port on the CAN bus. Their system evaluation was promising, showing little to no false positives when tested with real cars and publicly available attack datasets.
In contrast, signature-based IDS systems detect intrusions by matching patterns or sequences corresponding to known attack signatures stored in the IDS database. For example, Jin et al. [
23] developed a promising signature-based IDS system for CAN that can applied directly to each ECU within the vehicle. They extracted signatures of various attacks that the CAN bus system is vulnerable to using real-world scenarios to detect attacks within the CAN network traffic. Similarly, Song et al. [
24], and Bi et al. [
25] proposed signature-based IDS for the CAN bus, relying heavily on the time interval of messages sent on the bus. In their experimentation, they recorded that when a CAN ID sends a packet, the time interval between packets should not be less than 0.2 milliseconds (ms); otherwise, the IDS records it as an attack. Likewise, Halder et al. [
26] developed COIDS, a signature-based IDS system that detects intrusions in CAN-enabled systems by monitoring changes in clock offset. COIDS creates a baseline of normal clock behavior using clock offset measurements from ECUs and identifies deviations from this baseline to detect potential intrusions.
It is important to note that the implementation of IDS mechanisms in vehicles can vary. Depending on the specific application and the data type being monitored, IDS systems might be applied in various ways, such as machine-based, frequency-based, statistical-based [
17], or specification-based approaches. Each method offers different advantages and focuses on different aspects of the data or behavior to detect anomalies and intrusions.
The use of machine learning models in the development of IDS systems is highly common due to their ability to detect complex patterns and anomalies in large datasets and their ability to ensure the IDS is easily adaptable to new attacks. For instance, Seo et al. [
27] adopted Generative Adversarial Nets (GAN) to develop their anomaly-based IDS system, GIDS. Their system performed in the 99th percentile in detecting DoS, fuzzy, RPM, and gear attacks. While Sun et al. [
28] also leveraged the machine learning model to develop their IDS, they developed an ensemble model using CNN and LSTM with attention mechanisms. Their model showed great results with an error rate of 2%. Other machine learning models that have been adopted in the development of IDS systems are CNN [
29], Bayesian networks [
21,
30], Gradient Boosting Decision Tree (GBDT) [
31], KNN, RF [
32], gated recurrent unit (GRU) [
33] among many others.
Conversely, other time-based IDS methods to develop either signature or anomaly-based ID systems are the CAN network IDS developed by Khan et al. [
17] focused on analyzing the relationship between the attack ratio, average, and standard deviation of CAN bus data, completely dismissing the time intervals. Another by Lee et al. [
34], which uses the offset ratio and time interval of ECUs to detect attacks within the bus. Additionally, More et al. [
35] used a statistical approach to develop their IDS system, which relies heavily on identifying messages and standardizing message transfer times within the CAN bus. The vulnerabilities discovered during testing were reported directly to the vendor. Song et al. [
24] developed an IDS system based on the time intervals of the CAN messages by capturing CAN data and simulating three types of message injection attacks. Others that also developed IDS using time interval analysis and arbitrary Identifiers assignments are [
35,
36].
In this paper, we propose the development of CANGuard, an IDS that will recognize anomalies within a CAN-enabled system. Our approach for developing this IDS system uses a hybrid of anomaly and signature detection mechanisms, thereby ensuring the enhanced detection of attacks. CANGuard is developed using attack datasets we generated to simulate real-life scenarios of DoS attacks, thus ensuring it performs effectively in securing the vehicle. Our IDS leverages an enhanced methodology employing various machine learning models, which we rigorously evaluate in terms of performance during training and testing on the dataset. This comparative analysis aims to identify the most effective model for anomaly detection, ultimately strengthening CAN bus security in vehicles.
3. System Architecture
In this section, we discuss the general architecture of the CAN network, highlighting its communication methods, physical components, and packet transfer mechanism.
The CAN network facilitates communication between various vehicle components, such as the engine control unit (ECU), transmission, airbags, and ABS, without complex wiring [
10]. It operates within a robust framework of international standards, particularly ISO 11898 [
4], which defines the rules for real-time and reliable communication between microcontrollers and devices, all without requiring a central host computer.
3.1. Physical Components of an ECU in the CAN Network
A typical ECU in a CAN-enabled vehicle consists of three key components: the microcontroller, the transceiver, and the controller. These components work together to ensure the ECU performs its tasks efficiently, enabling sensors and other parameters to function properly. This, in turn, ensures the optimal performance of the vehicle. The ISO 11898 standard underpins the communication protocols these components use, ensuring a reliable exchange of information between vehicle subsystems to maintain overall system performance. A detailed description of the functionalities of these components is highlighted and described in
Figure 2.
3.2. CAN Bus Communication Mechanisms
One of the key advantages of the CAN bus is that it allows many of a vehicle’s components to be connected while reducing the complexity of the wiring system. Instead of using multiple wires, the CAN bus operates with just two: CAN High (CANH) and CAN Low (CANL). These two wires work together to transfer data within the bus, with each transmission representing one bit of information. CAN High (CANH) handles high-speed signals, while CAN Low (CANL) manages lower-speed signals in the network. Electromagnetic interference is minimized because the wires are twisted together and terminated with a 120-ohm resistor. All nodes or ECUs (Electronic Control Units) on the bus are connected in parallel and must be connected to CANH and CANL wires.
Modern cars often have multiple CAN networks connected via a gateway to handle the various nodes (or ECUs) within them. The high-speed CAN system (ISO 11898-2) is responsible for critical systems like ABS, airbag modules, and powertrain control units with a maximum transfer rate of signals as 1kbit - 1Mbit per second. On the other hand, the low-speed CAN bus system (ISO 11898-3) is typically used for less critical functions, such as infotainment systems (radio, Bluetooth, GPS) and door signals. Its transfer rate is 125 kbps [
14]. This system works great because nodes can easily be added to the bus, and it is also easy to remove nodes from the bus.
When the CAN bus is idle (in a recessive state), the CANH and CANL lines are at 2.5 V. However, once communication or data transfer begins (in a dominant state), the voltage on these wires changes: CANH typically rises to 3.75 V, and CANL drops to 1.25 V. This creates a 2.5 V difference between the two lines, allowing data transmission to occur.
Table 1 highlights the specific voltage levels of each CAN bus wire in different states. Likewise,
Figure 3 gives a visual representation of the changes in the state of the bus when data is being transmitted.
3.3. CAN Bus Data Transfer Mechanism
When communication happens on the CAN bus, it is broadcast to all connected nodes. This means that all other nodes on the bus receive any message sent by one node. Each node then decides whether to accept or ignore the message based on whether it is relevant to its function or needs. The CAN bus ensures the integrity of messages or data sent on the bus by encapsulating them into packets called frames. Hence, we can refer to messages sent on the bus as frames. It is important to note that only one node can send a frame/packet to the bus at a time, so the CAN network is called a serial network.
Different kinds of frames/messages can be sent on the bus, and this determines the need for the node to send the packet. These various frames are:
Data Frames: The frames contain data transfer sent by a node to be received by other nodes or ECUs on the bus.
Remote Frames: These are initiated when a node intends to request data from other nodes.
Error Frames: Based on the state of the bus, in the event an error occurs, these frames are used to report such errors.
Overload frames: When the bus is overloaded, these frames report the congested state of the bus.
Irrespective of the frame sent on the bus, each frame is encapsulated with several components, each of which serves a crucial role in ensuring reliable and synchronized communication on the CAN network. These components are surveyed in
Table 2.
3.4. CAN Bus Frame Priority (Arbitration)
The CAN bus is a serial network, meaning only one frame can be transmitted. When two or more nodes attempt to send packets simultaneously, the network uses a priority system to determine which node’s data will be transmitted. This process is known as CAN arbitration, where the message with the highest priority (lowest identifier value) is given access to the bus. At the same time, the other nodes wait until the bus is free again. This ensures that critical data is sent first. Recall the ID component encapsulated within the frame for each data transfer; this component is essential for determining the priority of the data sent on the bus. This ID helps identify how critical the data is. For example, data from the ABS will have a higher priority than data from the door signals. Since lower ID values indicate higher priority, the ABS message will have a lower numerical value than the door signal, ensuring that it is transmitted first in case both attempt to send data simultaneously.
3.5. The Proposed CAN Bus Testbed
To carry out this research, we developed a CAN bus testbed as the foundation for data collection, and the simulation of various attack scenarios used to develop the CAN IDS system. Our testbed consists of four physical nodes, each simulating ten ECUs. We achieved this by assigning ten random IDs unique to each node once the system was operational. As a result, the CAN network processes each message as if it came from an individual ECU, effectively simulating a system with over forty ECUs.
We tested the system using a CAN data logger, which captured traffic during each run, and an oscilloscope to monitor the frames’ components transmitted within the network.
Figure 4 shows the hardware components of our CAN bus testbed, and
Table 3 highlights the critical components essential for its successful implementation.
3.5.1. Configuration the ECU
Recall that a CAN ECU comprises of the CAN microcontroller, the CAN controller, and the CAN transceiver. Once these components are integrated into the system, configuring and testing the ECU in the network becomes straightforward.
The microcontroller was programmed with C code to program the CAN ECU using libraries compatible with the Mbed platform. Each ECU was initialized with the CAN Rx (PB_8) and CAN Tx (PB_9) pins to enable the transmission and reception of messages over the CAN bus. The message is 1 byte in size and contains an ID and a counter value. When the message is successfully transmitted, an LED lights up as a signal, and the counter is incremented for the next message. We use the CANStandard format, which specifies that the CAN frames use the standard 11-bit identifier rather than the extended 29-bit format.
Each ECU is also programmed to receive and store incoming messages. When a message is received, the ECU prints the content of the message, and another LED is turned on to indicate that a message has been successfully received. The ECU pauses for 5 milliseconds (ms) before continuing this loop, allowing for a controlled timing between operations
3.5.2. System Configuration of the CAN Data Logger
The CL2000 is the device we use to log data on our CAN network, functioning as a sniffer in our setup. It came preconfigured by the manufacturer, so we easily integrated it into the network by connecting pins 2, 3, 7, and 9 to our CAN transceiver on the breadboard.
Table 4 highlights the connection of the CL2000 to the board.
Figure 5 shows the connection of each pin to the CAN transceiver.
To capture the traffic on the CAN bus, the connected CL2000 device records all network traffic while the CAN system is running. After each capture session, we turn off the CAN system and disconnect the CL2000 from the network. Then, we connect the CL2000 to our host computer via USB to transfer the recorded data for further analysis. After connecting the CANlogger to the host machine, we use the Savvy CAN software (version number V199) to analyze the captured traffic and export the data from the CL2000 into CSV format. An example of the frames captured by the CL2000 and viewed using the Savvy CAN tool is shown in
Figure 6. Finally, we used the PicoScope 2000 series to monitor and visualize the signals transmitted on the CAN bus.
5. Methodology of CANGuard: Our Proposed IDS
In this section, we outline the steps in developing CANGuard, our proposed IDS system designed to effectively and efficiently detect attacks within a CAN bus system. By leveraging advanced machine learning models, CANGuard enhances its ability to identify both known vulnerabilities and emerging threats in bus traffic, providing robust system protection. We will cover the raw dataset in
Section 5.2, the data transformation steps to align with real-life scenarios, the selected machine learning models used for training and testing, and finally, the deployment of the best-performing model based on a comparison of each model’s performance in the following subsections.
Figure 9 gives a high-level overview of the proposed methodology adopted in the development of the IDS system.
5.1. Experimental Setup
For the experimental setup to develop the IDS system, we utilized Google Colab Pro as our computational platform, leveraging its robust resources for machine learning experiments. The hardware configuration included access to a Tesla T4 GPU, which is highly effective for handling computationally intensive tasks, particularly those involving deep learning models. The system was equipped with 12.7 GB of RAM and 10 GB of GPU memory, providing sufficient resources to train and evaluate our models efficiently. Additionally, the allocated disk space of 107.7 GB allowed us to store the CAN bus dataset, feature-engineered data, and model checkpoints throughout the experimentation process. All experiments were implemented using the Python programming language, with primary reliance on the sci-kit-learn library for our machine learning models and Jupyter Notebook as the development environment.
5.2. Datasets
As shown in our methodology diagram, the first phase involves collecting the raw dataset. Our raw dataset was divided into two categories: normal and attack traffic. The normal dataset represents non-compromised CAN bus activity, while the attack dataset includes data from two simulated attack scenarios (Scenario 1 and Scenario 2). All datasets were collected using the CAN data logger, ensuring a consistent structure. The total size of all data collected was 7 MB.
Table 8 details the dataset columns used for analysis.
The timestamp and ID columns were the key segments selected as raw data. The ID column was especially relevant in the attack scenario 1 (randomized ID) dataset, while the timestamp column was crucial in the attack scenario 2 (randomized injection rate) dataset.
5.3. Data Preprocessing
Data preprocessing involved feature extraction and engineering to develop a robust IDS for detecting DoS attacks. The goal was to enable the models to identify both the presence of an attack and the responsible node. Meaningful features were derived from the raw dataset to optimize model performance during training and testing.
5.3.1. Feature Extraction
We used a sliding window approach to extract features from raw CAN bus data, capturing temporal patterns and distinguishing normal from attack communications. This method tracked changes in message rates, timing, and node activity within 5 ms intervals, providing essential inputs for accurate classification. Extracted features included unique ID counts, time-based metrics, message counts, error counts, and distance features.
Unique IDs counts: One of the primary features generated is the count of messages for each unique CAN ID within a 5 ms time window, labeled as ID_XX_Count, where XX represents the specific CAN ID. For each unique ID observed in the dataset, a feature tracks the frequency of its occurrences within this time frame. This feature is crucial for detecting message bursts or unusual traffic patterns, often indicators of abnormal behaviors such as those caused by a DoS attack. Monitoring message volume by specific IDs helps to identify overuse or underuse patterns, which can suggest malicious activity when they deviate from typical communication patterns.
A particularly critical feature in detecting DoS attacks is the count of messages with ID 0x00, noted as ID_00_Count. In our DoS attack simulations, the attacking node is assigned the ID of 0x00, making this feature a direct indicator of attack activity. A significant rise in messages with this ID within the time window strongly suggests that the attacking node is attempting to flood the network, overwhelming the system and disrupting normal operations.
Time-based Feature: The timing of each message or frame is a critical factor for the IDS in identifying anomalous activity within the CAN bus network, as unusual timing patterns often signal malicious interference. To capture this, we extracted the timestamp feature, which records the precise time each message is sent. Additionally, we discuss the statistical transformations of the time-based feature in the feature engineering phase to ensure that patterns such as irregular intervals, burst frequencies, and unexpected delays can be effectively analyzed by the model.
Message Count Feature: Another important feature is the total message count within each window. This feature is particularly useful in detecting sudden surges in traffic, which are characteristic of DoS attacks. In a normal CAN bus network, the message rate remains relatively stable, whereas a DoS attack results in a sharp increase in the number of messages transmitted in a short time. Furthermore, the system also tracks the number of messages labeled as ERROR, which serves as an indicator of potential communication failures or transmission issues within the network. An increase in error messages could signify that the network is under attack, as DoS attacks often lead to message collisions, transmission errors, and other disruptions in the communication protocol.
5.3.2. Feature Engineering
Feature engineering involved analyzing timestamp data and adding noise to simulate real-world network fluctuations. These engineered features enhanced the IDS’s ability to detect both subtle and overt disruptions, distinguishing normal behavior from DoS attacks while identifying their source within the CAN network. We also labeled the dataset in this phase.
Statistical analysis of the Time-based feature: We transform the raw data by carrying out statistical analysis on the time-based features extracted to capture the temporal dynamics of the CAN bus messages. The statistical analysis of the time-based feature extracted are:
- -
The mean timestamp difference is calculated as the average time interval between consecutive messages within each window. This feature helps detect anomalies in the message rate, as DoS attacks tend to flood the network with a high frequency of messages, reducing the time interval between them. It is defined as
- -
Similarly, the standard deviation of the timestamp difference provides insights into the variability of message timing within the window. A stable network typically exhibits consistent message timing, whereas a DoS attack introduces irregularities. The standard deviation of the timestamp differences (STD) is calculated as:
- -
Additionally, the maximum and minimum timestamp differences were also calculated. This process further aids in capturing extreme intervals, which can also be indicative of abnormal activity. The maximum and minimum differences between consecutive timestamps are given by:
Noise Addition: Noise addition is crucial for training robust machine learning models capable of detecting DoS attacks on CAN bus networks. By simulating real-world disturbances, such as timestamp jitter, latency, packet loss, and transmission errors, we can create a more challenging and realistic training environment. Timestamp jitter introduces small, random variations in message timestamps, mimicking minor network fluctuations. Latency simulates delays in message transmission, accounting for network congestion and physical distances. Packet loss models the random loss of messages due to network errors or attacks. Transmission errors introduce corrupted messages into the dataset, mimicking the effects of hardware failures or malicious interference. By exposing the model to these diverse types of noise, we enhance its ability to distinguish between normal network behavior and malicious activity. The model learns to identify patterns that indicate a DoS attack, even in the presence of various disturbances. Ultimately, this approach leads to a more resilient and effective machine learning model for detecting and mitigating DoS attacks on CAN bus networks.
Message Label: The message label feature was created to support binary classification, which determines whether a message is normal or anomalous. Normal messages, indicating no attack, are labeled 0, while attack or anomalous messages are labeled 1.
Node Label: The node label is essential for identifying the specific node responsible for the DoS attack. During the preprocessing phase, each node in the CAN bus network is assigned a label based on its physical proximity to the central controller. This feature enables the model to determine which node is likely causing the attack. Including this feature adds a valuable layer of multi-class classification, as it not only detects the presence of an attack but also helps pinpoint its source within the network. In this case, we had four labels, where all frames from ECU1, ECU2, ECU3, and ECU4 are assigned labels 1, 2, 3, and 4 respectively.
The data preprocessing phase centers on extracting key features through a window-based approach and generating meaningful features via feature engineering techniques, resulting in a robust input dataset. For the experimental analysis, the dataset consisted of CAN bus frames, each representing a single message exchanged on the network. These frames were grouped into windows based on a sliding window approach with a size of 5000 milliseconds (5 s). Within each window, statistical features were computed, such as message counts, timestamp differences, and error counts, providing a summarized representation of the network activity during that period. The total number of windows generated depended on the number of frames in the dataset and the chosen window size. For training and testing, the windows were split into an 80-20 ratio, ensuring a balanced distribution of attack and non-attack scenarios across the datasets. By preprocessing the data in this way, the system can effectively detect anomalies in real-time and identify the specific node responsible for the attack, leveraging both temporal and spatial features.
Table 9 highlights the features in our input dataset.
5.4. Machine Learning and Deep Learning Models Selected for the Development of Our IDS
To ensure the selection of the best-performing model, we evaluated four different machine learning models and one deep learning model, each chosen for its potential effectiveness in identifying anomalies within the CAN bus system. These models were carefully compared based on key performance metrics to determine which would provide the most accurate and reliable detection of DoS attacks. The specific characteristics and advantages of each model are discussed below, highlighting how they contribute to the overall robustness of the IDS.
5.4.1. Logistic Regression
A significant advantage of the logistic regression model [
37] lies in its suitability for binary classification problems. In our IDS system, the first step involves detecting the presence of DoS attacks before identifying the specific part of the vehicle under attack. Logistic regression was deemed a promising choice for addressing this initial problem: determining whether a DoS attack is occurring within the vehicle. The objective is to produce a binary outcome, predicting either “Yes” (label 1) or “No” (label 0). Its simplicity and efficiency make it a practical choice for anomaly detection in our IDS system, ensuring seamless integration into vehicle systems.
Logistic regression operates through two key steps: the linear model and the sigmoid function.
The linear model calculates a weighted sum of the input features:
The sigmoid function then maps this result to a probability:
This allows the model to output probabilities between 0 and 1, which can have included thresholds to ensure binary predictions are produced.
5.4.2. Random Forest Classifier
Another supervised machine learning model chosen for this research is the Random Forest Classifier [
38]. Unlike Logistic Regression, which is primarily a classification model, Random Forest is an ensemble learning method that combines the outputs of multiple decision trees. It uses voting for classification and averaging for regression, ensuring optimal performance while reducing the risk of overfitting. These benefits made it an ideal choice for the development of our IDS system.
Random Forest operates in several stages. First, multiple decision trees are built using random subsets of the input data to ensure diversity and robustness. For a given input sample, each tree
in the forest predicts an output
. In the case of classification tasks, the final output
y is determined by majority voting:
where
are the individual predictions from the
n trees.
For regression tasks, the final output is computed by averaging the predictions from all trees:
where
is the prediction from the
k-th tree, and
n is the total number of trees in the forest.
These ensemble methods ensure that Random Forest achieves higher accuracy and robustness compared to individual decision trees. Its ability to handle both classification and regression tasks efficiently made it an ideal selection in the development of the proposed IDS system.
5.4.3. Gradient Boosting
Similar to the Random Forest Classifier, Gradient Boosting [
39] is another powerful supervised machine learning model. It is an ensemble learning method capable of performing both classification and regression tasks. Unlike Random Forest, Gradient Boosting builds models sequentially, where each model learns from and corrects the errors of its predecessor. By combining the predictions of weak learners, it creates a strong model capable of making accurate predictions.
For the development of our IDS system, Gradient Boosting offers significant advantages, including high predictive accuracy, customizability, and resistance to overfitting when properly regularized. These characteristics make it a promising choice for enhancing the system’s performance and reliability.
Gradient Boosting combines multiple weak learners (e.g., decision trees) into a strong learner by minimizing a loss function. The process involves iteratively adding a new model to reduce the residual errors of the previous model.
The general prediction for Gradient Boosting is represented as:
where:
- -
: The prediction of the ensemble model at iteration m.
- -
: The prediction from the previous iteration.
- -
: The learning rate, which controls the contribution of .
- -
: The new weak learner added at iteration m.
At each iteration,
is trained to minimize a loss function
, which measures the error between the true labels
y and the predictions
. The objective is to solve:
where
n is the number of samples in the dataset.
This sequential optimization process ensures that each new learner focuses on correcting the errors made by its predecessors, resulting in a highly accurate final model.
5.4.4. MultiLayer Perceptron (MLP)
The only deep learning model we selected was the Multilayer Perceptron (MLP) [
40], a type of artificial neural network designed for supervised learning tasks. MLP is particularly suited for handling problems with diverse inputs, making it an ideal choice for developing a robust IDS. It excels at predictive analysis by processing input data through multiple layers, where each layer applies transformations based on weights and biases to model complex relationships. Furthermore, MLP adjusts these weights and biases through backpropagation, which minimizes prediction errors and improves accuracy over time.
The computation in an MLP is carried out layer by layer, using weighted sums followed by activation functions. This process, known as the forward pass, is described as follows for each layer
l:
where:
- -
is the input to the activation function at layer l,
- -
represents the weights matrix for layer l,
- -
is the output from the previous layer, and
- -
denotes the bias vector at layer l.
Next, the activation function
is applied to compute the output of layer
l:
The process continues until the final layer generates the output. The accuracy of the MLP improves iteratively through the backpropagation process, where the weights and biases are updated by minimizing the error using a gradient descent algorithm. This adjustment ensures the model learns from the data effectively, enhancing its predictive capabilities.
For all four models selected, the hyperparameters used in their development are summarized in
Table 10. These hyperparameters represent the final configurations that were fine-tuned to optimize the architectures of our models, ensuring the best possible results.
5.5. Classification Process of the IDS System
The modeling and classification process for detecting DoS attacks and identifying the responsible node involves two main tasks: detecting the occurrence of the DoS attack and classifying which node is responsible for the attack. These two tasks are addressed using different classification strategies, including binary classification for DoS detection and multi-class classification for identifying the node based on its distance from the network controller.
5.5.1. Binary Classification to Detect the Occurrence of DoS Attacks
The first task, DoS attack detection, is framed as a binary classification problem. In this task, the model is trained to predict whether or not a DoS attack is occurring within a given time window. The feature that plays a critical role in determining the presence of a DoS attack is the count of messages with the CAN ID 0. During a DoS attack, the node with ID 0 sends an excessive number of messages, effectively flooding the network due to the CAN bus’s arbitration. By setting a threshold on the number of messages in each time window, the model learns to differentiate between normal and attack behaviors. The binary classification task provides the first layer of defense in identifying potential DoS attacks in real time.
5.5.2. Mutli-Class Classification to Classify the Node Responsible for the Attack
The second task, node identification, is formulated as a multi-class classification problem. Once the DoS attack is detected, the next step is to classify which node is responsible for causing the attack. Each node in the network is assigned a unique distance from the central controller, which serves as an identifier for the node. The goal of the model is to predict the node responsible for the attack based on this distance. In the multi-class classification task, the distance values (or node identifiers) are treated as distinct classes, and the model is trained to predict the class corresponding to the node that initiated the attack. Since multiple nodes can potentially initiate an attack, the model learns to recognize the specific behavior of each node under attack conditions, based on features such as message timing, error counts, and message frequencies. This task is more complex than binary classification, as it requires the model to distinguish between multiple potential sources of the attack, each represented by a different distance value.
5.5.3. Multi-Output Classifier to Predict Both the DoS Attack and the Node Responsible for the Attack
A key aspect of this pipeline is using multi-output classification to jointly predict both the occurrence of a DoS attack and the node responsible for it. Multi-output classification allows the model to predict multiple target variables simultaneously. In this case, the model is tasked with predicting two outputs: the binary label indicating the presence of a DoS attack, and the multi-class label identifying the node responsible for the attack based on its distance. A multi-output classifier is employed to achieve this. This is implemented using the MultiOutputClassifier module from scikit-learn. The MultiOutputClassifier wrapper is utilized to extend traditional machine learning models for multi-output tasks. This classifier builds separate models for each target to allow for the simultaneous training and prediction of multiple outputs. By handling the two tasks of DoS detection and node identification, the multi-output classifier reduces computational overhead and improves the efficiency of the detection process.
The advantage of using a multi-output approach is that the model can leverage shared information between the two objectives. For example, features that are important for detecting a DoS attack (such as the count of the ID 0 messages or timestamp variability) may also provide valuable insights into which node is responsible for the attack. By training the model to predict both outputs simultaneously, the multi-output classifier can take advantage of these shared patterns, leading to better overall performance in both tasks.
Consequently, the modeling and classification process for DoS attack detection and node identification is designed to handle both binary and multi-class classification tasks. By incorporating multi-output classification, the system can simultaneously predict the presence of a DoS attack and identify the responsible node. This approach enhances the model’s efficiency and overall accuracy, making it well-suited for real-time detection scenarios.
7. Discussion
Our results demonstrate the effectiveness of the proposed CANGuard system in detecting anomalies within CAN-enabled vehicles. The system’s performance in binary classification and multi-class node detection tasks provides valuable insights into the strengths and limitations of various machine learning models for intrusion detection in vehicular networks.
7.1. Binary Classification Insights
The results for binary classification, where the task was to detect whether a DoS attack was present, were exceptional. All models achieved perfect scores (accuracy, precision, recall, and F1-score of 1.00). These outcomes highlight the robustness of the feature engineering process, which extracted critical temporal and statistical patterns from the raw CAN bus data. The sliding window approach played a pivotal role in capturing fine-grained details about message rates and node activity, ensuring that the models could accurately distinguish between normal and attack states.
This performance indicates that the engineered features are highly effective for attack detection, making the system a promising tool for real-world vehicular IDS deployment. Additionally, the absence of false positives and negatives, as illustrated by the confusion matrices, emphasizes the reliability of the system in detecting attacks without triggering unnecessary alerts or missing actual threats.
7.2. Node Detection Challenges
In contrast, the task of node identification (multi-class classification) proved more challenging. While Gradient Boosting emerged as the top-performing model, achieving high accuracy (0.99), the performance varied across other models. Random Forest and Logistic Regression also demonstrated strong capabilities, with accuracies of 0.98 and 0.97, respectively. However, the MLP model struggled significantly, with an accuracy of 0.44, and showed pronounced misclassification errors in the confusion matrix, particularly for nodes at distances 1 and 3.
The disparity in performance suggests that tree-based models, such as Gradient Boosting and Random Forest, are better suited for handling the complexities of node-level detection in this dataset. Their ability to manage feature interactions and capture hierarchical patterns likely contributed to their superior results. On the other hand, the MLP model may require additional tuning, more training data, or architecture adjustments to perform at a comparable level. The results also underscore the importance of dataset balance and feature relevance in multi-class classification tasks.
The binary classification results suggest that CANGuard can serve as a highly reliable first layer of defense in detecting attacks on CAN networks. Its ability to provide accurate attack alerts with no false positives ensures operational safety while minimizing disruptions. For node detection, while Gradient Boosting demonstrates promise, further refinements are needed to ensure consistent and robust performance across all nodes, especially in scenarios with diverse attack patterns and varying data volumes.
The challenges faced by the MLP model point to potential limitations in applying deep learning approaches directly to resource-constrained environments like vehicular systems. This highlights the need for lightweight yet effective models capable of maintaining high performance without excessive computational overhead.