1. Introduction
The efforts of industry and academia to find solutions to urban traffic problems (such as traffic congestion, cargo theft, and optimized public transportation, among others) have allowed vehicles to become more than just transportation machines [
1]. Modern vehicles are equipped with novel communication devices (e.g., wireless antenna and cellular technology) that make it possible to communicate with surrounding vehicles, send and receive messages, and access remote applications through an Internet connection. Regarding cellular technology, vehicles commonly use 4G/LTE (Long Term Evolution) or 5G (fifth-generation mobile network). Some studies have already considered the forthcoming 6G [
2] in the vehicular context.
The vehicular network, initially referenced as Vehicular Ad hoc Network (VANET), has evolved into the Internet of Vehicles (IoV) [
3] as a result of its integration with other technologies, namely the Internet of Things (IoT) [
4]. Furthermore, IoV has different types of communication: Vehicle to Vehicle (V2V), Vehicle to Infrastructure (V2I), Vehicle to Sensor (V2S), Vehicle to Roadside Unit (V2R), Vehicle to Pedestrian (V2P), and Vehicle to Everything (V2X) [
5]. They have called attention to the security requirements, since each communication type can require different layers of security mechanisms.
Figure 1 shows an overview of the different kinds of vehicular communications, where other technologies can also be integrated, namely, cloud and edge computing [
6].
Although vehicles have fewer computing resources (e.g., processing power or storage) than traditional computers, they have caught the attention of malicious users who can adapt the methodology of computer-based attacks (e.g., Denial-of-Service (DoS) [
7], Sybil [
8], Jamming [
9], Fuzzy [
10], Spoofing [
11], and Eavesdropping [
12], among others) to vehicular networks. In addition, vehicular networks have the potential to generate valuable data from their users (e.g., tracking vehicles’ routes, Global Position System (GPS) coordinates, vehicles’ identity, or most visited places) that can also be valuable for malicious users. On the other hand, providing/adapting network security tools for IoV is not a trivial task due to its characteristics, which cannot be ignored, such as rapid network topology change, nodes with high mobility, and small connection duration.
Machine Learning (ML) algorithms have been explored to maximize the potential of identifying malicious users and network security breaches. However, in the IoV context, the use of ML-based solutions faces a crucial challenge:
finding publicly available network datasets. The ML model would be trained with data extracted from real vehicular testbed in an ideal model-building process scenario. However, generating datasets with real data is challenging because public sources are not widely available. Nonetheless, there are well-known vehicular network simulators that are publicly available to create private and public datasets such as Network Simulator 3 (NS-3) [
13] and Veins [
14].
Network security tools use different strategies for identifying malicious activities, where ML algorithms can help them to increase the detection rate. An Intrusion Detection System (IDS) is a good example of a security tool that merges its functionalities with ML algorithms. For this purpose, the right choice of vehicular network datasets represents an important step in correctly labelling malicious behaviors in such a vehicular scenario.
Availability is one of the third pillars of data and information security. The other two are confidentiality and integrity. As aforementioned, some attacks can cause network disruption, especially the Flooding attack. By performing this attack, malicious vehicles can stop legitimate messages from reaching their destination on the network. Furthermore, this attack can also lengthen the time for receiving useful messages, such as those sent by vehicular safety applications requiring low latency. Based on this problem, in our work we developed an IDS that uses ML algorithms to detect the Flooding attack in 5G-enabled vehicular networks.
Our contributions are the following:
We propose four new labelled datasets of 5G-enabled vehicular networks with 16 features, which have Flooding attack characteristics.
We build a decision tree model that outperforms (e.g., accuracy, precision, recall, and F1) some works that use more complex ML algorithms.
The remainder of the work is organized as follows.
Section 2 presents the background and related work. In
Section 3, we present the vehicular scenarios used to generate our datasets.
Section 4 presents our experimental setup and
Section 5 reports the obtained results. Finally,
Section 6 concludes the paper.
2. Background and Related Work
Conducting cyber-attacks on vehicular networks can compromise the entire communication structure between vehicles, by interrupting vehicles from receiving safety messages or by consuming network resources such as bandwidth, hence putting human lives at risk [
15]. The lack of security mechanisms for vehicles can cause chaos in a city, where stopping 20% of the vehicles during heavy traffic would be enough for this disaster to occur [
16]. Different studies have been conducted by the scientific community bearing in mind the seriousness of this threat [
17,
18]. The dynamic nature of these networks presents characteristics that cannot be ignored, such as high mobility, the number of vehicles in a given area, and connection time [
19].
Attacks on in-vehicle communications, such as espionage, injection, bus-off, and DoS attacks, aim to cause Engine Control Unit (ECU) malfunctions [
20]. The ECU provides different services for passengers, such as entertainment, system information, import of multimedia content, etc. For instance, an attack of espionage occurs when an attacker can access the vehicle’s messages, where through the Controller Area Network (CAN) patterns are identified in the legitimate messages exchanged. Since CAN messages are not authenticated, the injection attack enables attackers to access the vehicle through On-Board Diagnostic II (OBD-II), ECU ports, or entertainment services, allowing the injection of malicious messages into the network or devices. The bus-off attack aims to turn off the ECU by continuously sending bits causing the ECU error counter to increase.
There are three types of IDS: network-based (Network-based Intrusion Detection System, NIDS), host-based (Host-based Intrusion Detection System, HIDS), and hybrid [
21,
22]. NIDS aims to monitor the network on which the devices are connected. HIDS seeks to detect anomalies that may occur in the device in which the IDS was configured. Moreover, the hybrid approach combines the characteristics of the other two. However, an IDS that applies ML techniques uses datasets generated with data from real or simulated networks to train the anomaly classifier [
23].
Although vehicles can use different communication technologies to share information (e.g., Wi-Fi), they mainly use the IEEE 802.11p communication standard. However, as vehicular applications become more robust, there is a need for new technologies that enable low delay and high throughput, such as 5G technology. As highlighted in [
24], applying 5G technology in vehicular scenarios can expand the integration of systems that use 3G, 4G, Wi-Fi, ZigBee, and Bluetooth. In addition, vehicular safety applications demand messages with low latency. For example, a collision avoidance application can avoid an accident by receiving timely messages before the driver reacts to the behavior of an adversary vehicle.
Deep Learning (DL) has revolutionized how ML optimizes information processing, enabling it to be used in different areas of knowledge. Tangade et al. [
25] applied DL in vehicular networks, highlighting the possibility of increasing reliability, reducing latency, and detecting security problems.
The particularities of an inter-vehicle network can directly affect the accuracy of building an ML model. For example, each vehicular environment has its own heterogeneous characteristics (e.g., number of nodes, network topology, and available resources) that can influence how the ML model will react to the behavior of the entire network.
Seeking to provide public datasets, Gonçalves et al. [
26] generated different datasets for IoV, where they performed DoS and Fabrication attacks (i.e., false acceleration, speed, and direction data). Aiming to validate the generated datasets, they proposed a hierarchical IDS that uses ML algorithms [
27] to identify malicious behaviors in the network. Each generated dataset has a total of 18 columns/features, including the attack class label [
26].
In the context of Smart cities and electric vehicles, Aloqaily et al. [
28] proposed the identification of Probing, User to Root (U2R), Remote to User (R2U), and DoS attacks in Connected Vehicular Network (CVN) using an IDS. The strategy used consisted of grouping vehicles into clusters [
29], for which the algorithm selects a cluster head (CH) that is responsible for communicating with the trusted third parties (TTP) that are not available in the cluster. They use deep belief network (DBN) and decision tree (DT) algorithms for identifying and classifying anomalies. In the proposed IDS, the authors use a hybrid dataset (network data from NS-3 and NSL-KDD dataset) as input. For the classification of anomalous or normal behavior, the network data packets are processed by the DBN algorithm, which aims to reduce unnecessary network data packets. Finally, the DT algorithm classifies network packets into anomalies or legitimate packets. Additionally, it is pointed out that the NS-3 network data are only used to add normal traffic to the dataset. Apart from the work done, both datasets have the same format. It is important to highlight that NSL-KDD does not use vehicular network data. As already mentioned, vehicular networks have their own characteristics that should not be ignored. Finally, it is not mentioned which features the hybrid dataset has and how important each feature is after DT classification.
Privacy issues in vehicular networks should be addressed at different levels of the vehicular network architecture, since the attacker can harm users in different ways, such as spreading false information, receiving and collecting/processing unauthorized data, and so forth. For example, the Sybil attack can create different identities, and each identity can simulate a vehicle on the road. For example, a legitimate vehicle may not receive an important message about the road conditions in this case. Liang et al. [
30] proposed an IDS for identifying False Information and Sybil attacks. The proposed tool was used in two scenarios for data collection (conducting training) and testing. The first scenario did not contain anomalies, and the second one did, to perform the training of the anomaly detection algorithm. The detection algorithm used is called growing hierarchical self-organizing map (GHSOM), which is a neural network.
Garip et al. [
31] presented the first adaptive botnet detection mechanism, called SHIELDNET. For the proposed solution, they simulate different scenarios in the Veins tool, which includes the Simulation of Urban MObility (SUMO) and OMNeT simulators, and ML algorithms to identify botnets on the network.
Adhikary et al. [
32] proposed a hybrid algorithm to detect distributed DoS (DDoS) attacks in VANETs, where their solution combines support vector machine (SVM) kernels, namely, AnovaDot and RBFDot. Their simulation has a total of 5 RSUs and 1000 vehicles, where the vehicles are displaced every 100 to 500 ms. To evaluate their solution, they also generated a dataset with two classes, 0 (normal behavior) and 1 (victim or DDoS attacker). First, they evaluated the accuracy for each RSU considering only AnovaDot, RBFDot and the Hybrid algorithm. Second, they also considered Gini coefficient, Kolmogorov–Smirnov (which measures the empirical distance between two sample datasets), Hand Measure (which is an alternative performance measure for Area Under Curve—AUC), and Minimum Error rate.
As a proposed solution to the black-hole attack in vehicles with the auto-driving system, Alheeti et al. [
33] developed an IDS that uses neural networks and fuzzified data to identify and correct the problem. For the simulation of message exchange between different vehicles and between vehicles and RSUs, the NS2 simulator was used, which had as input the data generated by SUMO and MObilty VEhicles (MOVE) [
34] simulators. A statistical approach was also used to extract relevant information in the tracing files generated by the NS2, called Proportional Overlapping Scores (POS).
Kosmanos et al. [
35] developed an IDS to identify spoofing attacks in electric vehicles. In addition to using ML, they also employ Position Verification using Relative Speed (PVRS) to optimize the results obtained. An attacker performs some actions on the vehicle or network through the spoofing attack, such as data theft, sending false information, and sending false GPS information (i.e., GPS spoofing).
Polat et al. [
36] proposed an IDS solution to detect DDoS on Software-defined network (SDN)-based VANET, where SDN is already the main activator of 5G. To detect the DDoS attack, they used the stacked sparse autoencoder (SSAE) + Softmax classifier deep network model.
In addition, Otoum et al. [
37] developed a transfer-learning-driven intrusion detection for IoV, where they used deep neural networks and Convolutional Neural Network (CNN) in two datasets, namely, CICIDS2017 and CSE-CIC-IDS2018. Their solution aims to classify DoS, DDoS, Botnet, Brute-force, Infiltration, Web Attacks, and Port Scan attacks.
Table 1 summarizes the related work and our proposed IDS, where we emphasize that ours is the only approach that uses 5G technology.
The related work described above lacks discussion on non-trivial issues in ML, such as data distribution and how the data are balanced among classes. These are important themes, since poorly distributed and/or unbalanced datasets can pose serious difficulties to proper model training and consequent performance. Furthermore, most of the related work also seems to completely disregard the usefulness of the simplest and most interpretable ML models, such as decision trees, and how proper parameter settings can improve the quality of the models when regarded through different metrics. As the reader will see in
Section 4 and
Section 5, in our work we explore the parameterization of simple ML algorithms, combine different datasets in order to improve data distribution, and evaluate the results using different metrics that are robust to unbalanced data.
3. Simulated Scenarios
The proposed scenarios are simulated in a virtual machine running Ubuntu 20.04.5 LTS with Intel (R) Core (TM) i5-8300H, four cores at 2.3 GHz, and 8 GB RAM. The simulation parameters are listed in
Table 2. We use the NS-3 network simulator, which is open-source. We used the 5G-LENA module, i.e., a GPLv2 New Radio (NR) network [
38], called nr, that also allows to simulate 4G and 5G networks and V2X-based 5G communication. The simulator allows simulating some network actors, such as remote hosts that can connect to Packet Gateway and Service Gateway through a link and send it to gNodeB, and user equipment (i.e., vehicles). Additionally, the nr module is described as a “hard fork” of the millimeter-wave (mmWave) simulator [
38], which enables simulating the physical (PHY) layer and medium access control (MAC), mmWave channel [
39], propagation, beamforming [
40], and antenna models.
Furthermore, seeking to generate heterogeneous data, we used SUMO, as it permits the modeling of intermodal traffic systems, to generate four different maps. Each map can have a different number of nodes and a different coverage area. The four maps simulate some regions of Lisbon, Portugal (see
Figure 2).
The simulations are designed as follows:
All vehicles are equipped with 5G technology, where SUMO is used to generate mobility.
There are two distinct groups of vehicles: senders and receivers.
Message exchange between vehicles is made via the multicast address (i.e., 225.0.0.0).
As previously stated, vehicles are separated into two groups (i.e., senders and receivers) and we generated four maps:
the first map has a total of 45 vehicles, where 10 are senders (from this total, two vehicles are attackers) and 35 receivers;
the second map also has 45 vehicles, where 10 are senders (from this total, four vehicles are attackers) and 35 receivers;
the third map has a total of 70 vehicles, where 15 are senders (from this total, seven are attackers) and 55 receivers;
finally, the fourth map has 100 vehicles, where 19 are senders (from this total, nine vehicles are attackers) and 81 receivers.
In addition, each simulation lasts a total of 230 s. However, to enable more mobility/movement of the vehicles, they exchange packets at second 170.
Additionally, all datasets have the following features:
timeSec—this feature indicates the simulation time at which a packet is sent or received. In our dataset, we are considering only metrics of received packets;
txRx—a tag to indicate whether a packet was sent (tx) or received (rx);
nodeId—refers to the receiver node ID;
imsi—is the International Mobile Subscriber Identity, which is an identifier assigned with the SIM (Subscriber Identity Module) card;
srcIp—the IP address of a sender node;
dstIp—the IP address of a receiver node;
packetSizeBytes—it refers to the packet size in bytes. Each sender node uses a different size to increase randomness;
srcPort—refers to the port where the sender nodes are sending the packets;
dstPort—refers to the port where the receiver nodes are receiving the packets;
pktSeqNum—refers to the sequence of transmitted packets;
delay—the difference between the reception time of a packet and its sending time;
jitter—it uses the RFC 1889 [
41] format;
coord_x—is the “x” coordinate on the map generated in SUMO;
coord_y—is the “y” coordinate on the map generated in SUMO;
speed—is the speed of the vehicle in meters per second;
isAttack—is the class of benign (class 0) packet or malign (class 1) packet.
Table 3 shows the total of rows and class distributions on each dataset.
4. Experimental Setup
Our experiments are divided into two parts: first, we use one of the simplest learning methods available, e.g., decision trees, to explore different combinations of our datasets while obtaining preliminary baseline results; then, we explore other, more complex learning methods, namely the ensemble method of random forests and the neural network method multilayer perceptron. We use scikit-learn (version 1.1.1) [
42] for all our experiments.
We name our four datasets according to the number of attackers simulated in each, specifically
2,
4,
7, and
9. In the first experiment, we train classifiers in each of them and then test these classifiers on the remaining three datasets. We measure the F1 score separately for each test set and obtain the F1 score for a larger test set that joins the three sets. We prefer the F1 score to the accuracy since some datasets are unbalanced (see
Table 3). For choosing the decision tree depth, we perform a grid search with 10-fold cross-validation on the depths {2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55}. This means that the training set is split into ten parts, and ten different models are trained for each tree depth, each using nine parts for training and one part for validation, and the best tree depth is then chosen based on the different F1 scores obtained. From all the features included in the datasets (see
Section 3), we use: nodeId, imsi, pktSizeBytes, dstPort, delay, jitter, coord_x, coord_y, and speed. The remaining ones were not used because they caused overfitting. In the second experiment, we use a mix of two datasets to train and then test on the remaining two separately and joined. The third experiment uses a mix of three datasets for training and the remaining one for testing. In these experiments, the tree depth is not chosen with regular
k-fold cross-validation, but rather with what scikit-learn calls
GroupKFold cross-validation. In the latter, each group is the set of samples from each dataset and the same group cannot coexist in both training and validation parts.
Table 4 presents the tested parameters in all algorithms.
In the second part of our experimental setup, we use random forests and multilayer perceptron. The training data is always composed of three datasets, with the remaining one used for testing. The same GroupKFold cross-validation strategy is used for a grid search. With random forests, the search includes tree depths within the values {2, 3, 4, 5, 6, 7, 8, 9, 10} and the number of trees within the values {10, 20, 30, 40, 50}. Regarding the multilayer perceptron, the search includes batch size within {32, 64}, hidden layers and neurons within {(10,2), (20,2)}, optimizer between stochastic gradient-based optimizer (“adam”) and stochastic gradient descent (“sgd”), and the activation function between the hyperbolic tangent function (“tanh”) and the rectified linear unit function (“relu”).