1. Introduction
Intelligent transport system (ITS) has a great contribution to modern life. This system offers new services to control adverse events such as road accidents and improve traffic management [
1,
2]. Rapid progress in wireless communication technologies helps create such a system because vehicles equipped with these wireless technologies can efficiently make connection links with other vehicles as well as roadside units (RSUs). Vehicular ad hoc network (VANET) aims to provide these wireless connections between network nodes (i.e., vehicles or roadside infrastructures). Recently, this network has attracted the attention of researchers because of its potential role in ITS. VANET has many applications such as passenger safety improvement, traffic efficiency optimization, autonomous driving, access to Internet of vehicles (IoV), collecting real-time data to control traffic and road protection systems, paying road tolls automatically, and entertainment applications [
3,
4]. VANETs have specific features such as frequent disconnections, dynamic topology, and moving nodes.
Table 1 summarizes these features for vehicular ad hoc networks.
Designing an efficient routing approach is a serious issue in VANET [
5,
6]. Routing protocols are responsible for determining paths between source-destination pairs [
7]. Also, when breaking the discovered paths, routing protocols are responsible for forming an alternative route. In such a case, if the routing path is not properly selected, it diminishes network performance. Path efficiency is measured based on the participation of nodes in the data transmission process. Due to very dynamic topology and high-speed vehicles, efficient routing solutions are a serious challenge in these networks. Therefore, many researchers have attempted to modify the existing routing schemes in VANETs. Despite many efforts in this regard, routing protocols are still vulnerable and are not complete.
Recently, machine-learning (ML) is a new field originated from artificial intelligence (AI) that includes efficient and strong techniques [
8,
9]. They can be applied to integrate autonomous decision-making systems in vehicular ad hoc networks to solve their various challenges and issues such as routing. ML can produce more intelligent machines trained on past experiences without human interference. This means that they do not need explicit programming. Machine learning involves three branches: supervised, unsupervised, and reinforcement learning. The first class (i.e., supervised learning) consists of an input dataset and corresponding outputs (i.e., labels). Techniques related to this class seek to form a learning model to explore the relationship between data samples and labels and produce a function to map the data to the labels. This model is used for predicting unlabeled data. In unsupervised learning, there is no output related to the inputs, meaning that the data is unlabeled. Unsupervised learning must find the existing patterns and relationships between data samples. In reinforcement learning (RL), the agent and the dynamic environment work in relation to each other. These interactions determines the ideal behavior of the agent with regard to the reward-penalty produced by environment [
10]. In VANET, ML techniques, especially RL, try to make vehicles take self-decisions for networking operations such as routing [
11,
12]. The agent must obtain knowledge in relation to the environment dynamics based on the collected data to find the most suitable action and achieve a certain purpose, like discovering routes with minimum delay. RL can be used to optimize various issues in VANET such as predicting traffic conditions, estimating network traffic, controlling network congestion, discovering routes, enhancing network security, and resource allocation.
Reinforcement learning algorithms are attractive solutions for modifying routing methods in VANET. However, researchers need more research in this area because machine learning, especially reinforcement learning is a significant research subject in VANETs. Note that most review papers related to machine learning and VANET do not focus on RL applications in designing routing protocols. For example, in [
13], authors have reviewed various routing methods based on RL in VANETs. Ref. [
14] studied various applications of RL and DRL in vehicular network management, but did not consider their applications for improving routing approaches. In [
15], authors investigated the importance of artificial intelligence techniques in different areas of VANET, especially routing. However, they do not well explain how they use reinforcement learning techniques for improving vehicular communication. In [
16], authors have studied different RL and DRL applications in various Internet of things (IoT) systems. In [
17], authors have examined how to use multi-agent reinforcement learning techniques in different VANET applications such as resource allocation, caching, and data offloading. Overall, our studies show that few review papers are presented in the field of RL applications for designing routing schemes in VANETs. Thus, this important issue requires further research to better identify future research directions and their challenges. We believe that our survey can help researchers to understand how to create routing protocols based on RL in VANET. In this review, we propose a categorization of RL-based routing schemes with regard to learning framework (single (or multiple) agent(s)), learning model (model-based and free-model), learning algorithm (RL or DRL), learning process (centralized and distributed), and routing algorithm (position-based, cluster-based, topology-based (proactive, reactive, and hybrid)). Then, we present the latest routing approaches according to the proposed classification.
The organization of this paper includes several sections:
Section 2 expresses several review papers in this area.
Section 3 reviews reinforcement learning and Markov decision process in summary. In
Section 4, VANETs and their applications are introduced briefly. In this section, we focus on the routing operation and its issues in VANETs.
Section 5 proposes a categorization for RL-based schemes in VANETs. In
Section 6, several RL-based schemes are investigated in VANETs.
Section 7 presents a discussion of RL-based routing methods.
Section 8 demonstrates the major challenges and open issues in this area. Ultimately, in
Section 9, the conclusion is stated.
2. Related Works
Today, researchers study on machine learning, especially reinforcement learning because it is a significant research subject in VANETs.
Table 2 summarizes some review papers in this field. Note that most review papers related to machine learning and VANET do not focus on RL applications in designing routing protocols.
In [
13], authors have reviewed different RL-based routing protocols in VANETs. In this paper, authors claim that their survey is the first review paper, which has analyzed RL-based routing algorithms in VANETs. It is a comprehensive review and is very suitable for researchers in this field. In [
13], routing methods are divided into seven categories, including hybrid, zone-based, geographical, topology-based, hierarchical, secure, and DTN. However, this category is very limited and does not evaluate the routing algorithms in terms of learning structure and RL algorithm.
In [
14], authors have studied the RL and DRL applications in vehicular network management. Firstly, they have introduced vehicular ad hoc networks. Then, they review the RL and DRL concepts. Finally, they have carefully studied the newest applications of these learning techniques in two different areas: vehicular resource management and vehicular infrastructure management. Note that this paper emphasizes vehicular network management using Rl approaches and does not investigate the use of these approaches for improving the routing schemes.
In [
15], authors have examined the importance of artificial intelligence (AI) techniques in various fields of VANETs. In this paper, they have briefly explained three AI techniques, including machine learning methods (especially, RL), deep learning (especially DRL), and swarm intelligence. Then, they have studied various AI techniques to solve different challenges in VANETs. They are carefully examined in six areas including application, routing, security, resource and access technologies, mobility management, and architecture. However, the authors do not well explain how to use reinforcement learning techniques for improving vehicular communication.
In [
16], authors have reviewed reinforcement learning techniques and deep reinforcement learning techniques in various IoT systems including wireless sensor networks (WSNs), wireless body area networks (WBANs), underwater wireless sensor networks (UWSNs), Internet of vehicles (IoV), and Industrial Internet of things (IIoT). Then, they have divided them into seven different categories: routing, scheduling, resource allocation, dynamic spectrum access, energy, mobility, and caching. However, RL-base and DRL-based routing methods are only investigated in wireless sensor networks.
In [
17], authors have examined multi-agent reinforcement learning (MARL) techniques to solve various problems in VANET. In this paper, various research works are focused on resource allocation, caching and data offloading in VANET. Also, the authors have explained how to use MARL techniques in streaming applications and mission-critical applications. Finally, they have presented the challenges related to these systems in VANETs. However, this paper does not focus on the MARL applications for designing routing protocols in VANETs.
In [
18], the authors have investigated how to apply reinforcement learning (RL) to build routing approaches in flying ad hoc networks (FANETs). For this purpose, they explained these networks, their constraints, main components, especially drones, and applications in different fields and specified the routing challenges in these networks in detail. Finally, a classification of routing approaches was presented. It includes three main fields, namely learning algorithm, routing algorithm and data dissemination process. According to the presented classification, the latest RL-based routing approaches in FANET have been reviewed.
Overall, our studies show that few review papers are presented in the field of RL applications for designing routing schemes in VANETs. Thus, we focus on the Rl-based routing protocols in VANETs and review their learning structure. Additionally, we propose a categorization of RL-based routing schemes with regard to learning framework (single (or multiple) agent(s)), learning model (model-based and free-model), learning algorithm (RL and DRL), learning process (centralized and distributed), and routing algorithm (position-based, cluster-based, topology-based (proactive, reactive, and hybrid)).
7. Discussion
Here, according to the routing approaches studied in
Section 6, it can be deduced that the most common RL technique applied in most routing protocols is Q-learning. For example, based on
Table 8, it can be found that IV2XQ, QAGR, CEPF, PFQ-AODV, QGrid, QTAR, Q-LBR, ECTS, GLS, RASAR, Wu et al., RHR, IRQ, QFHR have utilized the Q-learning algorithm to design their routing model. The reason for this issue is the simplicity and low computational complexity of Q-learning. The performance of this RL technique is related to the size of the state and action sets. If these sets are small, Q-learning is an appropriate option for modeling the routing process because it can obtain an optimal response in a short time. However, when there is a complicated learning environment, meaning that it has large state and action sets, the learning capability of the agent is greatly dropped, and the agent may explore the most suitable response in a long time or may never reach an optimal response. This means that the Q-learning algorithm has a low convergence speed in this condition. Researchers have suggested various ideas to solve this problem.
Table 8 presents the most important parameters of the reviewed methods in summary. According to this table, it can be deduced that IV2XQ, QGrid, ECTS, GLS, IRQ, and QFHR have used an intersection-based routing technique. According to this idea, the intersections in the network environment are considered as the state space and the central controller is responsible for obtaining the most suitable path between the intersections. This idea makes a smaller state set and improves the convergence speed of the RL-based routing approaches. However, in these schemes, the central controller calculates the global path to reach the destination. They usually use historical information to discover the routes. As a result, these methods may be incompatible with real-time events such as accidents or gridlocks on roads in the network. Another idea considered in some routing methods such as RLRC, Wu et al., and SeScR is to use clustering techniques on the network. In this case, CH nodes perform the routing operation. As a result, the state and action sets in the routing algorithms are limited and their convergence speed is improved. Moreover, CEPF presents a novel idea for selecting edge nodes and sending data through these nodes. This idea improves the speed of route formation and reduces the number of transmitter nodes in the network. Another solution that some researchers have used is to utilize UAVs to improve the routing operations in VANET. For example, QAGR has used UAVs to calculate routing direction, which filters the neighboring ground nodes and makes a smaller state space, which will help QAGR to improve its convergence speed. Furthermore, in QTAR, the researchers have used two Q-learning-based routing techniques for finding global paths between RSUs at intersections and exploring local paths between vehicles on each road segment.
Also, according to
Table 9, it can be deduced that the studied methods in this review support various communications in the network and are suitable for specific applications. For example, IV2XQ, QTAR, ECTS, Wu et al., SeScR, IRQ, and QFHR support V2V and V2I communications and are more suitable for urban environments. QAGR and Q-LBR methods use two communication types, V2V and vehicle-to-UAV (V2U) communication and require an aerial network to support the routing process. Therefore, these protocols can be deployed when this aerial infrastructure is available. Building this aerial network is costly and unsuitable for any applications. It is also difficult to manage this network and communication between different nodes, which makes the routing process very complex. In addition, some routing methods such as CEPF, PFQ-AODV, QGrid, RRPV, RLRC, GLS, RSAR, and RHR support only V2V communications. These algorithms are suitable for routing on highways and urban areas.
Table 10 presents a comparison between the routing approaches with regard to different equipment, like positioning systems and digital map. This equipment is used to calculate the location of vehicles. However, it is difficult to obtain the position of vehicles in some areas, such as tunnels. Also, the performance of a routing approach depends heavily on the positioning device. If this device cannot accurately calculate the location of vehicles, the performance of the routing method will not be desirable. Another point is that positioning systems are highly costly in terms of bandwidth consumption. Therefore, some routing methods such as PFQ-AODV and Q-LBR have attempted to design the routing process independent of this equipment. This idea is attractive and can be considered for designing routing methods in the future.
Additionally,
Table 11 compares various routing methods based on their need to control messages. These control messages are exchanged between nodes to evaluate communication links and share spatial and velocity information of vehicles with other nodes in the network. However, control messages increase communication overhead, bandwidth consumption and other resources, network congestion, and delay in the routing process. Therefore, many researchers are trying to lower control messages exchanged on the network and control these broadcast messages. For example, IV2XQ and QGrid methods do not use any control messages to design the Q-Learning-based routing model. These schemes only broadcast beacon messages between vehicles to find the best route in each road segment. As a result, these methods have a suitable bandwidth consumption, low delay, and low communication overhead. Moreover, RHR has presented an adaptive broadcast technique to manage bandwidth consumption. According to this technique, the broadcast interval of the beacon message is dynamically determined using a broadcast control strategy.
Table 12 compares various routing methods in terms of learning framework. According to this table, it can be found that the routing approaches such as IV2XQ, QAGR, CEPF, PFQ-AODV, QGrid, QTAR, ECTS, RLRC, GLS, RSAR, Wu et al., RHR, SeScR, IRQ, QFHR apply single-agent reinforcement learning in their routing process. These routing schemes are simpler than the multi-agent approaches and have less computational complexity. However, the learning capability and convergence rate of the agent in single-agent routing schemes is less those that in multi-agent approaches. According to
Table 12, it can be deduced that only RRPV uses multi-agent technique. In this protocol, several agents (vehicles) interact with each other and share their experiences. This has increased the learning capability of this routing scheme, especially for large-scale networks. However, RRPV has more computational complexity than other single-agent approaches.
Moreover,
Table 13 compares the routing approaches in terms of learning model. According to this table, it can be found that most routing protocols such as IV2XQ, QAGR, CEPF, PFQ-AODV, QGrid, QTAR, ECTS, RLRC, GLS, RSAR, Wu et al., RHR, SeScR, IRQ, QFHR use a free-model reinforcement learning framework. In these methods, the agent estimates the value function according to the obtained experience and does not build any model from the learning environment. These routing schemes have less computational complexity compared to model-based routing methods. However, they need more experience compared to model-based techniques and are not flexible. Our studies show that RRPV is only model-based routing method. This method uses fuzzy logic to build the network model. Due to creating the network model, RRPV can estimate the value function with less interaction with the environment. Also, this routing method is more flexible against sudden changes in the environment. However, model-based techniques have higher computational complexity and require a lot of computational resources, and have poor performance for large-scale networks.
Table 14 compares routing approaches with regard to learning algorithms. According to this table, IV2XQ, QAGR, CEPF, PFQ-AODV, QGrid, QTAR, Q-LBR, RRPV, ECTS, RLRC, GLS, RSAR, Wu et al., RHR, IRQ, and QFHR use traditional RL techniques. These methods have a good performance and find the optimal response with an acceptable convergence speed if the state and action sets are small. However, large state and action sets lead to slow convergence and more time to find an optimal response. SeScR utilizes a deep reinforcement learning technique to improve the learning rate. This method can solve complex computational operations in complex (large-scale) environments, such as VANETs, and speeds up the agent ability to optimize this policy. These routing protocols have a suitable performance for large networks, and their learning speed is good. There are few routing methods that use deep reinforcement learning techniques to improve the routing process. Due to rapid progress of technology and the emergence of new networks such as Internet of Vehicles (IoV), it is essential to design routing techniques that use deep reinforcement learning because these methods are successful in terms of convergence speed in larger state and action spaces.
In addition,
Table 15 compares various routing approaches with regard to the learning process. In this table, in IV2XQ, QGrid, Q-LBR, ECTS, GLS, and IRQ, the reinforcement learning algorithm is executed by a central agent (for example a central server or a UAV) in the network. Then, the agent sends information about this learning process to the network nodes to begin the data transfer operation. This can reduce communication and computational overhead in the network because these methods do not need to exchange control messages between nodes. However, they may be involved with the single point of failure problem, meaning that, if the central agent cannot properly execute the RL-based routing algorithm, the network will be disrupted. Also, the central agent cannot adapt itself to the topology changes in a real-time manner. Therefore, when occurring events like accidents, these algorithms cannot predict real-time traffic conditions and adapt to sudden changes in the network. Also, based on
Table 15, it can be found that in some methods like QAGR, CEPF, PFQ-AODV, RRPV, QTAR, RLRC, RSAR, Wu et al., RHR, SeScR, and QFHR, RL-based routing algorithms are locally executed in the network. However, in this case, computational cost and communication overhead are increased compared to centralized routing approaches because the network topology information is obtained locally by exchanging beacon messages between vehicles in the network. However, these methods are scalable, are more consistent with the dynamic network environment.
Finally,
Table 16 has categorized the RL-based routing schemes with regard to routing techniques. In this table, it can be found that IV2XQ, QGrid, GLS, QAGR, CEPF, RRPV, QTAR, RASR, IRQ, and QFHR are position-based routing approaches. These routing methods do not require information about the entire network and use local information to send data packets. As a result, they have low communication overhead and efficiently consume bandwidth and energy resources. In this type of routing technique, RRPV, QTAR, and RSAR are known as the DTN routing methods. This means that they use the store-carry-forward technique to transfer data packets to the target node. However, this technique has a good routing overhead; but it boosts delay in the data transfer operation. On the other hand, IV2XQ, GLS, QAGR, CEPF, IRQ, QFHR are known as the non-DTN routing protocols. This means that they use a greedy forwarding technique for data transfer. These methods have a good performance in dense networks. Moreover, they are scalable, have low routing overhead, and consume low memory and bandwidth. The most important challenge in these routing protocols is to accurately obtain location information of nodes because if the position of the nodes is not available and/or is not accurately calculated, the performance of these protocols will not be accurate. Note that QGrid uses two routing techniques, namely the greedy strategy (non-DTN routing scheme) and the Markov prediction method (DTN routing scheme) to discover routes in the road segments. On the other hand, RLRC, Wu et al., and SeScR use clustering techniques in their routing process, so that CH node manages the cluster and inter-cluster communication. These routing methods can greatly reduce routing messages exchanged in the network and prevent network congestion. However, the challenges of this type of routing protocol are CH selection and cluster management, especially for dynamic networks such as VANET. Furthermore, Q-LBR, ECTS, PFQ-AODV, and RHR are known as topology-based routing protocols, so that Q-LBR, ECTS, and PFQ-AODV are reactive routing methods. These methods are successful in terms of memory consumption, bandwidth, and routing overhead. However, they are faced with challenges such as high delay in the route discovery process, flooding control messages, and congestion on the network. On the other hand, RHR integrates proactive and reactive schemes and utilizes their benefits such as controlling routing overhead and lowering delay in the routing operation. It is appropriate for large-scale networks. Note that some routing protocols integrate position-based routing and topology-based routing. For example, the RL-based routing processes in some geographic routing protocols like IV2XQ, QGrid, GLS, and IRQ are executed in a proactive manner. Additionally, the routing process in QAGR, CEPF, RSAR, and QFHR integrate geographic and reactive routing methods. Finally, RLRC is a cluster-based reactive routing scheme. This means that the routing process between CH nodes is performed using a reactive manner.