1. Introduction
The acceleration of urbanization has exacerbated urban transportation problems, particularly at complex road intersections. Decision making and planning for autonomous driving at intersections constitute a complex problem [
1]. Numerous researchers have devoted their time to developing algorithms for autonomous driving decision-making models in relation to intersections [
2].
Intersections without traffic signals are more complex and prone to accidents. There are more potential conflict zones along the lanes without traffic signal management. Changes in vehicle driving behavior are also more likely to cause confusion. Dense traffic from various directions can block the intersection, causing collisions, congestion, and safety accidents [
3,
4]. According to the U.S. National Highway Traffic Safety Administration’s fatality analysis report, more than a quarter of all fatal crashes in the United States occur at or in connection with intersections, with approximately 50% of those occurring at uncontrolled intersections [
5].
Due to the lack of traffic signals or signs, drivers need to decide on their own whether, when, and how to enter and pass through an intersection. With the emergence of autonomous vehicles, this task is transferred to machine learning and artificial intelligence algorithms. Autonomous vehicles can obtain road and other vehicle status information through high-resolution cameras, LiDAR, and mmWave radar sensors. He et al. [
6] proposed the DAMO-StreamNet framework. Li et al. [
7] proposed the LongShortNet network. Lv et al. [
8] developed a fusion architecture. These advancements aim to improve real-time perception tasks, providing more accurate perception results to support more reliable autonomous driving decision-making algorithms. This is crucial for navigating complex intersection scenarios. In this context, considering the interactions between vehicles is particularly important: if driving is too conservative, failing to account for these interactions may lead to a deadlock, where vehicles might become stuck and never pass through the intersection, or if driving is too aggressive, it could result in collisions [
9]. The continuous improvement of autonomous driving technology can improve traffic safety and increase traffic flow. It can also provide economic benefits, environmental protection, and social inclusiveness [
10].
Interactions between vehicles at unsignalized intersections are highly complex. Current research on autonomous driving strategies at intersections mainly concentrates on developing algorithms based on motion prediction, threat estimation, and cooperative decision making [
11]. These studies generally assume that all vehicles on the road are autonomous [
12], but human behavior often exhibits subjective uncertainty. Autonomous vehicles (AVs) need to interact with human-driven vehicles (HDVs), which exhibit unpredictable behavior. Therefore, decision-making algorithms for autonomous vehicles at unsignalized intersections must be able to cope with dynamically changing conditions and unpredictable actions, which has always been a challenge in the field of autonomous driving.
Our research aims to enhance existing multi-agent reinforcement learning algorithms to better address these complex decision-making challenges. We improve the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm [
13] by enhancing the replay buffer and introducing a variable-noise mechanism. These enhancements increase the stability and robustness of the decision-making process in dynamic environments and the unpredictable behaviors of surrounding vehicles. The proposed VN-MADDPG algorithm makes decision making at complex intersections more efficient and reliable.
The contributions of our paper are as follows:
The autonomous driving decision-making problem at unsignalized intersections is defined as a collaborative multi-agent reinforcement learning (MARL) problem. Vehicle behavior in training scenarios is more uncertain. Training in such scenarios enhances the ability of autonomous vehicles to handle various emergent situations in intersections. This enhances the success rate and efficiency of decision making.
A variable mechanism is integrated into the noise module of the VN-MADDPG algorithm. It can gradually reduce the noise based on the proportion of remaining training episodes. This encourages the agent to rely more on the strategies it has already learned to complete tasks, enhancing the robustness and stability of the final decision model.
The experience replay buffer is designed with an importance sampling module. It focuses more on samples that significantly impact the learning process. VN-MADDPG can utilize experiences in the replay buffer more effectively to improve learning efficiency. This helps the algorithm quickly converge to superior policies, further enhancing the robustness of the model.
The rest of this paper is organized as follows.
Section 2 provides an overview of recent related works.
Section 3 introduces the VN-MADDPG model. The simulation environment and the experimental setup are described in
Section 4. In
Section 5, we analyze the experimental results. Finally,
Section 6 concludes our work.
3. Methods
To address these issues, we propose the VN-MADDPG algorithm. In this section, the framework of our VN-MADDPG model is first outlined. Then, we provide a detailed description of the variable-noise mechanism and the importance sampling module, which are essential for improving the learning efficiency of the algorithm and enhancing the robustness of the decision-making model.
We formulate the decision-making problem of multi-vehicle autonomous driving in a mixed-traffic environment as a Markov decision process based on a collaborative multi-agent reinforcement learning algorithm. The Markov decision process is used for agents to learn strategies and supports agents in coordinating conflict decisions [
37].
We use MADDPG as the baseline algorithm due to its suitability for handling mixed cooperative-competitive environments and continuous action spaces, which are crucial in autonomous vehicle decision-making systems. MADDPG [
13] excels in these settings by allowing each agent to implement a DDPG algorithm, making it effective for complex multi-agent scenarios. However, the dynamic changes in the environment can diminish the relevance of experience samples in the experience pool, causing models to converge slowly. This slow convergence hampers the rapid acquisition of effective strategies. Therefore, enhancing the stability and robustness of MADDPG is essential to better adapt to dynamic environments and the unpredictable behaviors of surrounding vehicles.
3.1. VN-MADDPG
To enhance the stability and robustness of MADDPG and help the model learn better policies more quickly, we propose the VN-MADDPG algorithm. The structure and data flow of VN-MADDPG is shown in
Figure 1.
In VN-MADDPG, the critic network of each agent is updated based on the policies of all agents, similar to MADDPG, while each agent’s actor network focuses only on its own observations. This can provide superior policy learning performance and promote cooperation among multiple agents effectively, thus enabling them to pass through intersections smoothly.
VN-MADDPG also utilizes an experience replay buffer module. Each experience sample consists of , where s denotes the observation vector of all agents within the current environment for the current environment state and denotes the observation vector of all agents within the current environment at the next time step. is the action set of the agents, and is the set of rewards of the agent. d is a boolean value indicating whether all trained agents in this round arrived at the destination without collision.
3.1.1. Variable-Noise Mechanism
In reinforcement learning, the exploration-exploitation dilemma is a well-known challenge. It refers to the trade-off between exploring new actions to gather more information about the environment and exploiting the current best-known action to maximize rewards. Striking the right balance is crucial, especially in complex multi-agent scenarios where the dynamics are more unstable.
Effective exploration is critical for discovering optimal policies, especially in complex multi-agent environments. MADDPG utilizes fixed noise to encourage agents to explore the environment, increasing the randomness of their behavior. However, this static approach often leads to suboptimal performance because it does not adapt to the evolving needs of the learning process. The early stages of training require higher exploration to map out the environment, while the later stages benefit from reduced exploration to fine-tune the learned policy. A variable-noise adjustment mechanism that adapts the noise level based on the training progress is theoretically more sound.
In VN-MADDPG, we modify the noise to be dynamic. Before the start of each training episode, the agent retrieves the current episode number
and the total number of training episodes
to calculate the noise value for the current round, as shown in
Figure 2. As the number of training episodes increases, the noise decreases accordingly. The noise is calculated based on the initial noise, the final noise, and the remaining training episodes, as shown in the following equation:
is the percentage of remaining training episodes. is the total number of training episodes. is the number of current episodes. The of the current episode is equal to the final noise plus the difference between the initial noise and , multiplied by .
As the number of training episodes increases, the noise gradually diminishes to zero. This mechanism encourages the agent to explore the environment extensively at the start. In the early stages, larger noise allows the agent to try various actions, accumulating diverse experiences.
As training progresses, the agent continuously learns and optimizes its decision-making process through the policy network, gradually forming an effective strategy. In the later stages of training, with a more mature decision strategy in place, the noise decreases. This shift allows the agent to rely more on its learned policy rather than random exploratory actions.
The reduction in noise helps the agent perform tasks more reliably and with less unnecessary randomness, enhancing decision-making stability. Initially, the agent explores broadly to gather comprehensive environmental information. Later, it effectively uses the learned strategy to optimize decisions. This approach improves training efficiency and enhances the agent’s performance in complex unsignalized intersections.
3.1.2. Importance Sampling Module
We also designed an experience importance sampling module to enhance the learning efficiency and convergence speed of the decision-making model. This module focuses on selecting experiences based on their sampling probability, which is determined by their priority.
In our approach, experiences that have a significant impact on agent behavior or represent near-optimal solutions are given higher priority. The priority of an experience is determined by the Temporal Difference (TD) error or reward prediction error. Experiences with larger prediction errors are assigned higher priority because they indicate that the model’s predictions for these experiences are less accurate, thus necessitating further learning.
The prediction error, which quantifies the difference between the predicted and actual values, is calculated using the following formula:
where
is the current reward obtained.
is the discount factor, which represents the value of the next state calculated using the goal network, and is the value of performing the action in the current state. Here,
equals
, which represents the action selected in the current state
.
is the policy that the current agent has learned, and
is the current state.
In addition,
where
is the target
Q network,
is the target policy network, and
is the next state. The priority weights
for the experience are defined as follows:
where
is a very small constant to ensure that the priority is not zero.
is a hyperparameter that controls the priority weights. It prevents high-priority experiences from being oversampled, which could result in training bias.
The probability of each experience being selected is calculated based on its priority weight
, defined as follows:
It helps the decision model to focus more on impactful samples during training, effectively utilizing experiences in the replay buffer.
Additionally, the experience replay buffer continuously updates the experience samples in the buffer based on their priority weights, as shown in
Figure 3. This adaptive update mechanism prioritizes important experiences for learning, thereby enhancing overall learning efficiency, which helps VN-MADDPG achieve better strategies more quickly.
The detailed steps of VN-MADDPG are summarized in Algorithm 1.
Algorithm 1 VN-MADDPG |
Initialize all agents’ actor network and parameters Initialize all agents’ critic network and parameters Initialize all agents’ target actor network and its parameters Initialize all agents’ target critic network and its parameters Initialize the experience replay buffer D for episode = 1 to N do Initialize the environment and obtain the initial observations for step = 1 to S do (1) Each agent gets based on and (2) Execute , get , and (3) Stored all to the replay buffer : (4) Update , of experience if then Sample from the D Update : Calculate the target Q-value for each agent Calculate the loss L and update Update by maximizing the expected r Soft-update target network: end if end for end for
|
3.2. State Space
In an unsignalized intersection, multiple vehicles need to pass through the intersection at the same time. The state space for the vehicles in the scene needs to include the state data of all perceived vehicles. Taking any agent vehicle
in the scene as an example, its state space is defined as follows:
This includes the speed of the self-vehicle , the speeds of the other vehicles , the relative distances of other vehicles from the self-vehicle, and the distance of the destination from the self-vehicle.
3.3. Action Space
In reinforcement learning, the action space includes all possible actions that an agent can take. The agent’s behavior is defined by the action space, and an accurate definition of the action space can facilitate the learning process. The agent can explore and exploit experiences more effectively to achieve its goals. The definition of the action space becomes more critical in multi-agent scenarios.
MADDPG is more suitable for solving continuous action-space problems in autonomous driving decision making. Its action space design is continuous. Taking any agent vehicle
in the scene as an example, its action space is defined as follows:
The action space for each agent vehicle in the scenario contains three continuous control signals: throttle , brake , and steer . The agent makes acceleration, deceleration, and steering maneuvers based on these continuous control signals to cross the intersection and reach the destination smoothly to complete the task.
3.4. Reward Function
Based on the multi-agent task scenario, we define the following rewards.
Local reward: The task objective is for agent vehicles to successfully reach the destination from the starting point while passing through intersections safely and quickly. We incorporate vehicle speed and time spent as evaluation criteria. Agent vehicles receive a positive reward based on their proximity to the target point. The vehicle’s proximity to the target point is used to determine whether it has reached the destination. If it does this successfully, it receives a reward for task completion. However, it incurs a penalty for conflicts with other vehicles.
Global reward: If all vehicles avoid collisions and safely reach their expected destination, each agent vehicle receives a positive reward. This promotes task cooperation among agent vehicles.
After multiple experimental adjustments, the final reward function is defined as follows:
where the speed reward
is a positive reward based on the difference between the agent’s speed and the target speed; the time reward
is a positive reward based on the time taken by the agent to arrive at the destination safely;
rewards or penalizes based on the distance between the agent vehicle and the target point—if the distance is greater than 1 m, a penalty is given, and if it is less than 1 m, a positive reward is given, encouraging the agent vehicles to continue moving toward the goal after crossing the intersection;
is a penalty incurred if the vehicle collides;
is a positive award given if the vehicle successfully reaches the target and completes the task; and
is a positive reward given if all agent vehicles successfully complete the task and arrive at their targets.
4. Experiments
In this section, we introduce the simulation environment and then we describe its rules in detail. Finally, we introduce our evaluation metrics.
4.1. Scenario Design
We evaluated the differences between various algorithms in a typical unsignalized two-way single-lane intersection with four entrances and four exits, as shown in
Figure 4a.
The scenario includes three agent vehicles equipped with the algorithms. These vehicles can interact with each other. The decision-making algorithms’ task is to adjust the longitudinal and lateral control of each vehicle along its driving path to ensure that each vehicle’s actions at each step are aligned with the global optimal solution. The goal is for all vehicles to avoid collisions, pass through the intersection quickly, and reach their respective destinations.
To validate the effectiveness and superiority of VN-MADDPG, we utilized Python as the development language and constructed the algorithm network structure based on PyTorch. We chose the Town04 map in the CARLA simulation [
38] and deployed different decision-making algorithms in an unsignalized two-way single-lane intersection. The compared decision-making algorithms include the DDPG algorithm, the MADDPG algorithm, and our proposed VN-MADDPG algorithm.
We chose CARLA for its highly realistic urban traffic simulation, which includes diverse road types, intersections, and sensors, enabling comprehensive testing of our algorithm. Its compatibility with deep learning frameworks also facilitates effective data collection and model training. The effects of the simulation experiment scene are shown in
Figure 4b.
The settings for the environment in the simulation were as follows:
To ensure experimental efficacy, the destinations of each agent vehicle were fixed. An agent vehicle makes a left turn through the intersection, while the other two vehicles continue straight through the intersection. The agent vehicles need to pass through the intersection and reach their specified destinations while controlling both lateral and longitudinal movements to avoid collisions. The experimental setup aimed to create conflict points in vehicle trajectories as much as possible, covering various conflict scenarios that could occur at intersections.
The initial lane of each agent vehicle was predetermined. At the beginning of each training episode, all vehicles were randomly generated within 5 m of the intersection on their respective lanes. This setup made the intersection more random and uncertain. The initial speed of all vehicles was around 3 m/s, making their speed and time when entering the intersection uncertain. The target speed for vehicles was set at 5 m/s, with a maximum speed limit of 8 m/s.
Each agent vehicle in the environment was equipped with visual sensors and collision detection sensors, providing all the perception data required for the experiment. To ensure stable experimental results, only three agent vehicles were set up in the environment. This setup simplified the training scenario and ensured the model could learn more effectively.
The main goal of the experiment was to compare the performance of various decision-making algorithms in identical experimental settings. The comparison focused on the collision and success rates of vehicles at intersections when deploying the different decision-making algorithms.
4.2. Evaluation Indicator
We utilized several metrics to assess the performance of autonomous vehicle decision-making algorithms in this scenario, including the collision rate, success rate, pass time, and cumulative reward.
The collision rate is defined as the proportion of collisions per 100 training episodes. Similarly, the success rate is defined as the proportion of the number all agents who complete the task per 100 training episodes. Task completion is defined as all vehicles successfully passing through the intersection and reaching their destinations without any collisions.
5. Experimental Results and Analysis
In this section, we analyze the results of the experiments. We first introduce the types of recorded data and the training parameters of the models. Then, we compare the training processes of the models in terms of the cumulative reward, collision rate, and success rate. Finally, we compare the decision-making effectiveness of the trained models.
During the training process, we recorded the number of collisions leading to task failures, the number of successful task completions after all vehicles passed through the intersection, and the time taken to complete the task. We also recorded the cumulative rewards for each agent per round, as well as the total cumulative rewards for all vehicles.
The DDPG, MADDPG, and VN-MADDPG algorithms were trained, and relevant data were recorded in the same simulation environment settings. The network training parameters are shown in
Table 1.
The parameters of reinforcement learning algorithms significantly impact a model’s performance. The parameters mentioned in
Table 1 were determined after numerous experiments involving continuous trials and adjustments. Insufficient training episodes may lead to inadequate learning and poor performance, while excessive episodes might cause overfitting. An appropriate update frequency ensures more stable learning. A low initial noise value might result in insufficient exploration, leading to suboptimal learning, whereas a high final noise value could impede convergence.
The total number of training episodes was 10,000. We saved the cumulative rewards of each agent vehicle and calculated the total reward value every 20 rounds. The success rate and collision rate were saved every 100 rounds.
5.1. Cumulative Reward
A comparison of the global cumulative rewards obtained by the different algorithms during the training process is shown in
Figure 5. The horizontal axis represents the training episode, and the vertical axis represents the total cumulative reward. The figure shows that the DDPG and MADDPG algorithms exhibited similar growth trends in terms of the total cumulative rewards. Both algorithms eventually achieved relatively high total reward values but required longer training times.
In the same scenario and environmental settings, VN-MADDPG, which incorporates an importance sampling module and a variable-noise mechanism, exhibited higher efficiency in environmental exploration. The convergence efficiency of the decision-making model was considerably improved, and its overall performance surpassed that of DDPG and MADDPG.
The VN-MADDPG algorithm learned robust and stable decision-making strategies more quickly. The agent vehicles utilized environmental features more quickly and effectively after actively exploring the environment because of the importance sampling module and dynamically varying noise mechanism. After achieving a high total cumulative reward in training, the model remained very stable with minimal fluctuations in the reward values. The agent vehicles achieved higher reward values by collaborating with each other.
The average cumulative reward statistics obtained from the various decision-making algorithm models are shown in
Table 2.
Vehicles using the DDPG algorithm lacked cooperation, resulting in relatively poor decision-making performance. The learned policy of the agents tended to be self-centered, leading to a lower total cumulative reward. The decision-making performance of agent vehicles using the MADDPG algorithm was also not satisfactory. Although it promoted cooperation among vehicles, the resulting decision-making policy was not robust enough, and the training process was relatively slow.
5.2. Collision Rate
The VN-MADDPG algorithm demonstrated superior performance in reducing collision rates. As shown in
Figure 6, its final collision rate was reduced to around 3%.
The cooperation between vehicles and the importance sampling of the replay buffer enabled efficient strategy iteration. Vehicles in the scenario avoided collisions with each other and reached their destinations quickly.
Compared to the DDPG and MADDPG algorithms, our algorithm substantially reduced the occurrence of collisions between vehicles in the early stages of training. It explored more robust decision-making strategies more quickly.
The ANOVA test revealed a significant difference in collision rates among the three algorithms. The F value of 36.41646 with a p-value of less than 0.0001 suggests that the improvements in the collision rates are statistically significant.
5.3. Success Rate
As shown in
Figure 7, the trend of the changes in the success rate is similar to that in the cumulative reward. The DDPG and MADDPG algorithms exhibited a relatively slow convergence rate toward optimal policies. Their success rates in decision making fluctuated considerably, indicating instability. Additionally, the final policies derived from these algorithms were not particularly impressive.
The VN-MADDPG algorithm focused more on experiences crucial for learning strategies. As the noise dynamically decreased, the VN-MADDPG algorithm relied more on its policy network to make final decisions. Our algorithm enabled agent vehicles in the scenario to achieve a higher pass rate more quickly, maintaining a high success rate of around 97%.
The ANOVA test for success rates also demonstrated a significant difference among the algorithms. The F value of 37.6365 with a p-value of less than 0.0001 confirms that the improvements in success rates are statistically significant.
5.4. Validation of Trained Models
The trained models of the three decision-making algorithms were redeployed and tested in the same signal-free intersection scenario.
As shown in
Table 3, our algorithm shows improvements in evaluation metrics such as the success rate, collision rate, and pass time.
The VN-MADDPG algorithm significantly enhances the utilization of experience samples in multi-agent deep reinforcement learning algorithms, strengthening the ability of multi-agent deep reinforcement learning algorithms to cope with dynamically changing scenarios. VN-MADDPG enhances the learning speed and convergence efficiency of decision-making models for autonomous vehicles. The decision-making strategies learned by agent vehicles are more robust.
6. Conclusions
We propose the VN-MADDPG algorithm to address the challenges of instability and suboptimal decision making for autonomous vehicles at unsignalized intersections. Based on the MADDPG framework, VN-MADDPG includes a variable-noise mechanism and an importance sampling module to enhance stability and robustness. The variable-noise mechanism dynamically adjusts the level of exploration based on the training progress, promoting extensive exploration in the early stages and gradually shifting to reliance on learned strategies as training advances. The importance sampling module prioritizes impactful experiences, thereby improving learning efficiency and accelerating convergence. These enhancements collectively contribute to more reliable and effective decision making in complex and dynamic intersections.
We deployed VN-MADDPG in the CARLA simulation platform. We verified the effectiveness and superiority of our method by comparing it with the DDPG and MADDPG algorithms. Experimental results demonstrate that the VN-MADDPG algorithm effectively enhances the decision-making ability of autonomous vehicles at unsignalized intersections. The decision-making algorithm enables better adaptation to dynamic environments and unexpected situations. It improves the success rate and efficiency of autonomous vehicles passing through intersections.
In the future, we may consider factors such as reduced wheel–road adhesion in the simulation platform to more realistically simulate real-world conditions [
39]. Exploring improvements in braking and turning stability under these conditions could further enhance the effectiveness of decision-making algorithms. This could be a promising research direction.