Next Article in Journal
Prediction of Ship Main Particulars for Harbor Tugboats Using a Bayesian Network Model and Non-Linear Regression
Previous Article in Journal
Practice-Oriented Controller Design for an Inverse-Response Process: Heuristic Optimization versus Model-Based Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Research on Behavioral Decision at an Unsignalized Roundabout for Automatic Driving Based on Proximal Policy Optimization Algorithm

1
College of Urban Rall Transit and Logistics, Beijing Union University, Beijing 100101, China
2
College of Robotics, Beijing Union University, Beijing 100101, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(7), 2889; https://doi.org/10.3390/app14072889
Submission received: 22 January 2024 / Revised: 16 March 2024 / Accepted: 19 March 2024 / Published: 29 March 2024

Abstract

:
Unsignalized roundabouts have a significant impact on traffic flow and vehicle safety. To address the challenge of autonomous vehicles passing through roundabouts with low penetration, improve their efficiency, and ensure safety and stability, we propose the proximal policy optimization (PPO) algorithm to enhance decision-making behavior. We develop an optimization-based behavioral choice model for autonomous vehicles that incorporates gap acceptance theory and deep reinforcement learning using the PPO algorithm. Additionally, we employ the CoordConv network to establish an aerial view for spatial perception information gathering. Furthermore, a dynamic multi-objective reward mechanism is introduced to maximize the PPO algorithm’s reward pool function while quantifying environmental factors. Through simulation experiments, we demonstrate that our optimized PPO algorithm significantly improves training efficiency by enhancing the reward value function by 2.85%, 7.17%, and 19.58% in scenarios with 20, 100, and 200 social vehicles, respectively, compared to the PPO+CCMR algorithm. The effectiveness of simulation training also increases by 11.1%, 13.8%, and 7.4%. Moreover, there is a reduction in crossing time by 2.37%, 2.62%, and 13.96%. Our optimized PPO algorithm enhances path selection during autonomous vehicle simulation training as they tend to drive in the inner ring over time; however, the influence of social vehicles on path selection diminishes as their quantity increases. The safety of autonomous vehicles remains largely unaffected by our optimized PPO algorithm.

1. Introduction

With the advances in information technology, networking, digitalization, and artificial intelligence (AI), research on autonomous vehicles has gained significant attention in the public domain. It is anticipated that autonomous vehicles will gradually replace conventional vehicles in the future. In addition to performing traditional vehicle functions, such as acceleration, deceleration, braking, and motion control, autonomous vehicles also exhibit AI-driven behaviors including environment perception, task planning, path planning, vehicle control mechanisms, and intelligent obstacle avoidance [1,2,3,4]. The exploration and implementation of autonomous vehicles have the potential to alleviate road traffic congestion issues while reducing fuel consumption and carbon dioxide emissions, and thereby enhances the overall road efficiency. Although there are numerous advantages associated with autonomous vehicles’ application across various scenarios involving people or objects encountered during travel journeys, roundabouts still pose a substantial challenge [5].
The decision-making strategy of the target vehicle in a social vehicle environment is investigated as a scenario [6,7]. Shadi Samizadeh et al. focused on lane waiting time to address the issue of optimal lane selection and studied the method for selecting the best lane in a three-lane roundabout without signaling [8]. Rasool Mohebifard et al. examined the impact of different penetration rates for autonomous vehicles on roundabout operation and improved traffic operations under low saturation and half-saturation flow [9]. Mehdi Naderi et al. proposed a comprehensive general method for controlling the laneless path of autonomous vehicles to tackle large non-lane roundabouts [10]. Zhang introduced a layered control scheme based on an intelligent network environment to enhance traffic efficiency at circular crossings without traffic lights [11].
In summary, the previous research on the roundabout scenario has primarily focused on assessing its overall efficiency and simulating autonomous vehicles in an ideal state, assuming a perfect intelligent system network [12]. However, there has been limited investigation into the decision-making problems faced by individual autonomous vehicle entities at a roundabout. This study addresses this gap by considering separate autonomous vehicles operating at a penetration rate of 0, meaning that when all other vehicles are non-autonomous, the safety and stability of these vehicles remain unaffected. The optimization of decision-making for autonomous vehicles at unsignalized roundabouts and selection of routes to enhance bicycle passage efficiency through such intersections are explored [13].
Autonomous driving decision-making encompasses both rule-based and data-driven strategies [14]. While rule-based approaches face challenges in addressing problems encountered by autonomous vehicles due to the complexity and dynamic nature of road conditions, data-driven strategies can optimize vehicle decision-making through the interaction between target action behaviors and the environment.
The proximal policy optimization algorithm is a data-driven decision algorithm, which plays an important role in decision-making research [15]. Qian et al. proposed a near-end strategy optimization algorithm based on dual observation and matching reward, in order to study air combat autonomous decision-making, effectively solving the problems of high information redundancy and slow convergence of traditional reinforcement-learning air combat autonomous decision-making algorithms [16]. Based on the traditional PPO algorithm, She et al. introduced the network structure of a long- and short-term memory network and used the PPO–LSTM algorithm to solve the problem of online trajectory planning in the ascending phase of high-speed aircraft [17]. Chen et al. proposed an intersection signal phase and timing optimization method based on the hybrid proximal policy optimization algorithm to solve the signal control problem [18]. Zhao et al. proposed a multi-agent dynamic spectrum resource allocation scheme based on the proximal policy optimization algorithm, which improved the channel transmission rate and the success rate of vehicle information transmission [19]. The PPO algorithm can be competent in multiple, continuous action, space environment decision-making regarding vehicle control, and its training stability, high robustness, and easy-to-adjust parameters are suitable for simulation research [20].
The present study proposes an innovative approach to address the issue of social vehicles dominating urban roads during the initial stage of autonomous vehicle implementation. Specifically, we introduce a coordinate convolution dynamic multi-objective reward mechanism into the proximal policy optimization (PPO) algorithm. Our research focuses on enhancing the interaction between autonomous vehicles and their surrounding environment at intersections by incorporating gap acceptance theory [21]. To achieve this, we integrate CoordConv network and coordinate convolutional network into our algorithm model to capture and store low-dimensional environmental information. By optimizing the dynamic multi-objective reward mechanism model, we aim to enhance safety, stability, and driving efficiency of autonomous vehicles. We select PPO algorithm for optimization purposes. In an unsignalized roundabout scenario, our proposed approach enables effective interaction between autonomous vehicles and other vehicles as well as traffic signs in the scene. We define a reward mechanism function for training purposes with the objective of establishing a decision model throughout various scenes and obtain an optimal decision strategy for autonomous vehicles.

2. PPO Algorithm Model

2.1. PPO Optimization Algorithm

Deep reinforcement learning (DRL), an artificial intelligence method that combines deep learning and reinforcement learning, exhibits a cognitive resemblance to human decision-making processes. Deep learning, as a multi-layer neural network, possesses exceptional perceptual capabilities in machine learning.
Reinforcement learning involves the interaction between an agent and its environment, where the agent learns from feedback information in the form of reward values to optimize its strategy for maximizing rewards. By integrating deep learning’s ability for end-to-end nonlinear fitting with reinforcement learning’s problem-solving capacity, deep reinforcement learning algorithms are formed. The actor–critic algorithm framework is widely employed in these algorithms due to its combination of value function estimation and strategy search methods, effectively addressing both high bias and high variance issues associated with traditional approaches. This framework enables the design of agents capable of generating strategies while simultaneously evaluating their effectiveness through value functions. The utilization of deep networks within deep reinforcement learning facilitates the accurate fitting of output value functions and strategies over iterative updates towards optimal performance (Figure 1).
PPO is a strategy optimization algorithm for reinforcement learning, which was proposed and published by OpenAl researchers and is widely used in various reinforcement learning tasks. The PPO algorithm utilizes strategy gradient-based optimization to maximize the expected cumulative return of the strategy function. In comparison with other strategy optimization algorithms, PPO ensures improved stability and convergence speed through constraint optimization, thereby controlling the update amplitude within each iteration.
The PPO algorithm is enhanced by incorporating the CoordConv network and a dynamic model multi-objective reward mechanism. The CoordConv network establishes an aerial perspective of the vehicle to acquire spatial perception information, capturing low-dimensional details such as road conditions and surrounding vehicles. In autonomous driving simulations, data processing plays a crucial role. By utilizing an aerial view, the data processing procedure can be simplified, facilitating easier analysis and enhancing simulation efficiency and accuracy. The global overview provided by the aerial view enables better comprehension of the surrounding environment and obstacles for self-driving systems, extracting features from lower-dimensional data. These feature data are stored, with the CoordConv network employed to extract relevant information from the aerial view’s features such as vehicle speed, distance to preceding vehicles, and their speeds—all serving as inputs for decision-making in the algorithm [22,23]. This approach effectively reduces dimensionality in state feature representation within the algorithm.
The optimal reward function model is established based on a multi-objective reward mechanism to optimize the decision-making strategy of autonomous vehicles in unsignalized roundabout scenarios, which differ from normal roads. The multi-objective reward mechanism encompasses lane offset, lateral speed, longitudinal speed, vehicle collision, brake and throttle rewards, workshop distance reward, and critical clearance reward to enhance the driving efficiency, safety, and stability of autonomous vehicles. Figure 2 illustrates the structural diagram of PPO algorithm.
The gap acceptance theory dictates that a circular vehicle can enter the roundabout when its headway exceeds a critical gap, while it must wait if the headway is less than this threshold. By calculating the maximum number of vehicles that can enter per unit of time, capacity calculation models based on gap acceptance theory accurately reflect the traffic flow characteristics of roundabouts.
The critical gap is a crucial research topic in autonomous vehicle navigation at roundabouts, as it involves determining the appropriate gap for the vehicle to enter. This refers to the minimum distance required for a branching vehicle to safely merge into the circular traffic flow. The critical gap serves as a decision criterion for vehicles approaching the roundabout entrance. Once an acceptable gap is identified, the vehicle accelerates to its desired speed and seamlessly merges with existing traffic while maintaining a safe following distance.
The time required for vehicles to enter the intersection is mainly divided into three sections: the first section is the driver’s reaction time t1; the second section is the time t2 from moving the foot to pressing the accelerator pedal to the start of the vehicle movement; and the third section is the time t3 required for the vehicle to enter the circular lane after the vehicle starts to move. When the vehicle enters the intersection, it must maintain a certain distance from the vehicle in front, so it is necessary to calculate the on-board clearance t4 with the vehicle in front, and tc is the critical clearance. Therefore, the calculation of critical gap is shown in Equation (1):
t c = t 1 + t 2 + t 3 + t 4
The entry function is a function established according to the critical reception theory, which represents the number of vehicles allowed to pass by the time headway. The entry function is used to reflect the overall traffic efficiency of the unsignalized roundabout. The formula g(t) of the entry function is:
g ( t ) = t - t 0 t f ,   t > t 0 0 , t t 0
t 0 = t c + t f 2
Among them, t0 is the minimum acceptable gap; tf is the vehicle spacing; tc is the critical gap.
In the process of building and training deep reinforcement learning models, there is a continuous optimization of the decision models. However, during the initial stage of training, the reward value remains close to zero, making it ineffective to repeatedly employ poor historical strategies that hinder training efficiency improvement. Therefore, optimizing the experience pool of the PPO algorithm becomes necessary. The PPO algorithm discards previously collected experiences after each policy update as they become irrelevant to the new policy. One advantage is that this ensures consistency between the policy and value function, enabling the accurate estimation of current policy performance. To store and reuse historical experiences requiring correction, PPO algorithms can utilize experience playback buffers.
In the research experiments conducted here, prioritized experience replay was employed which assigns priority based on data value or strength; specifically favoring completed experiment data and high-reward values. This approach allows for more frequent sampling of valuable or advantageous data, thereby enhancing learning efficiency while reducing collisions with autonomous vehicles.

2.2. State Model

The simulation model of the unsignalized roundabout is utilized to establish a research experiment that encompasses the autonomous vehicle’s traversal through the unsignalized roundabout under various traffic flow conditions. The driving process of an autonomous vehicle involves both state space and action space.
The state space, acquired through the CoordConv network, incorporates essential information such as vehicle route details, driving direction, relative position, speed, and surrounding vehicle data. This provided state space is defined as follows:
  A = τ , l , φ , v , d
τ is the characteristic vector; l is the distance between the vehicle center point and the lane line; φ is the angle between the vehicle and the lane line; v is the current speed of the vehicle; and d is the distance between the vehicle and the vehicle in front. A is input as a vector to the policy and evaluation network.
The motion state space includes lateral control and longitudinal control. The lateral control is controlled by the steering wheel steer, whose value is [−1, 1], and the right steering wheel is positive. Longitudinal control is controlled by the vehicle throttle and brake break, set to acc, whose value is [−1, 1], and vehicle acceleration is positive. Among them, the giving action state is defined as:
B = s t e e r ,   a c c

2.3. The Reward and Punishment Function of an Unsignalized Roundabout

Studies have improved the reward mechanism model of the PPO algorithm in deep reinforcement learning to enhance training efficiency and reduce meaningless simulations [24,25,26]. The experiments are terminated when the vehicle’s stopping time reaches 2 min due to congestion or impassable roads, followed by starting the next training session. Additionally, encountering obstacles leads to a vehicle stop for a record duration of 30 seconds before initiating subsequent training sessions to increase repetition and improve overall efficiency [27,28,29].
The reward function design is optimized, and the gap acceptance theory is considered to enhance the success rate of autonomous vehicles entering the roundabout and improve their efficiency. Furthermore, variables and characteristics of autonomous vehicles in terms of safety, stability, and driving efficiency are taken into account (see Table 1).
The PPO algorithm of deep reinforcement learning optimizes the training model of the autonomous driving vehicle and measures the quality of vehicle decisions in relation to the environment, which plays an important role in vehicle reinforcement learning training. The reward function model considers the evaluation indicators and the following vehicle state and environment variables: lane offset (rline, rout), vehicle lateral speed (rspeed_lat), longitudinal speed (rspeed_lon, rfast), vehicle collision (rcollision), brake (rbrake), throttle (rthrottle), workshop distance (Sspacing), and roundabout critical clearance (ts). Set the c constant to −0.1. The reward function design requires the autonomous vehicle to drive at the roundabout, and it needs to run smoothly and steadily along the lane to ensure the safety of the vehicle.
The initial configuration of the training model for autonomous vehicles, such as the maximum speed and acceleration range, is relatively broad, leading to unreasonable decisions regarding vehicle speed, throttle control, and braking during early-stage simulation training. As a result, the autonomous vehicle is prone to collisions or lane deviations, ultimately terminating the experiment and resulting in low efficiency in early-stage training. To enhance overall training efficiency, we propose a dynamic reward function model that considers the varying weights of longitudinal speed, throttle control, brake control, and distance from other vehicles.
Px is a dynamic coefficient, and its value changes according to the value of the reward function. P1 is the dynamic coefficient of longitudinal velocity. When the reward function approaches 0, the weight of longitudinal velocity is minimum. As the value of the reward function increases, the weight of the reward function of the longitudinal speed is increased, and the initial value is 1 to ensure that the vehicle can start normally. P1 formula is a piecewise function:
P 1 = 2 π arctan R 10 ,   i f   R > 0 1 ,   i f   R 0
The initial R value is 0, and R is the reward value obtained in the last experiment result of each experiment.
P2 is a decreasing function, giving the dynamic coefficients of brake, throttle and workshop distance. When the reward function is 0, the brake weight is the largest. As the value of the reward function increases, the weight of the reward function decreases, and the final value approaches 1. The dynamic coefficient of the decreasing function enables important parameters in the early stages of training to play a role as much as possible, speeding up training efficiency and improving training success rate. The formula P2 is as follows:
P 2 = 1 e R 10 + 1
After several rounds of experimental tests, the following reward mechanism is obtained:
R = k 1 × r l i n e + k 2 × r o u t + k 3 × r s p e e d _ l a t + k 4 × r s p e e d _ l o n + P 1 × k 5 × r f a s t + k 6 × r c o l l i s i o n + P 2 × k 7 × r b r a k e + k 8 × r t h r o t t l e + k 9 × S s p a c i n g + k 10 × t s + c
The research requires the vehicle to drive along the center line of the road; the deviation of the center lane is 0; the left side of the front is negative; the larger the offset, the smaller the value; the right side of the front is positive; the larger the offset, the larger the value. The function e x 2 + 1 is used as a reward function, and its characteristics are utilized, that is, the function image is axially symmetric with x = 0. When the value is 0, the function value is the largest, and the larger the value is; the smaller the value is, the smaller the value of the function is, so the function can punish vehicle lane offset. According to the importance of lane offset to the overall reward mechanism, rline is the penalty value. Set its coefficient k1 to 1.
r l i n e = e x 2 + 1
Lane offset has the maximum threshold lthr; when lthr > 2 m, the vehicle offset is too large and leaves the lane, which directly ends the training. When the offset threshold is set, the vehicle is prone to vehicle collision when the threshold is too large, which is not conducive to training the model. When the threshold is too small, there is no room for fault tolerance, the learning cost is high, and it is difficult to obtain an effective training model. rout is the deviation penalty of the vehicle; when the vehicle deviates from the exit lane and exceeds the threshold lthr, the reward value is −1, and in other cases it is 0. The coefficient k2 is 5, which strongly penalizes lane departures.
r out = 1 ,   if   l > l thr 0 , o t h e r w i s e
Preventing the vehicle from driving unstably due to excessive lateral speed during the test. The penalty is set to rspeed_lat based on vehicle corner steer and vehicle speed vspeed_lon, and the coefficient k3 is 0.2.
r ( s p e e d _ l a t ) = s t e e r × v s p e e d _ l o n 2
The longitudinal speed vspeed_lon reward is set to rspeed_lon, and the higher the longitudinal speed, the greater the reward value. The coefficient k4 is 1, as setting the reward too high will cause the vehicle to crash ahead, and setting the reward value too low will make the vehicle unable to move forward.
r s p e e d _ l o n = v s p e e d _ l o n
The vehicle speed should be given a speed threshold according to the road speed limit requirements, and punishment should be given if the threshold is exceeded. The penalty value is rfast. This time, the threshold is set as vthr = 8 m/s. When the vehicle speed is greater than the threshold, rfast = −1 is given; in other cases, it is 0, and the coefficient k5 is 10.
r fast = 1 ,   if   v > v thr 0 , o t h e r w i s e
When the self-driving vehicle is driving, its own speed is too fast, the vehicle brakes slowly, other vehicles suddenly change lanes, and other factors will lead to vehicle collision. During the training process, if there is a vehicle collision, immediately terminate the current test training and start the next simulation. rcollision is the penalty for a vehicle collision, with a reward value of −1 if the vehicle collides and 0 in other cases. The collision directly affects the safety of the autonomous vehicle. The experiment requires that the vehicle does not collide, and the set coefficient k6 has a large weight. The coefficient k6 of rcollision is 200.
r collision = 1 ,   if   collision r 0 , o t h e r w i s e
When the current party encounters an obstacle or the vehicle speed is greater than the threshold, the autonomous vehicle will brake and slow down. In this paper, the simulated environment perceives the obstacle distance ahead as d = 15 m. The braking reward value is rbrake. When there is an obstacle in front or the speed of the vehicle is greater than the threshold value, the braking behavior of the vehicle will be rewarded, and the setting coefficient k7 is 5.
r brake = brake ,   if   d < d thr   or   v > v thr 0 , o t h e r w i s e
In this scenario, the maximum speed is 8 m/s or 30 km/h. When the vehicle speed is lower than 8 m/s, the vehicle acceleration is rewarded. When the vehicle speed is higher than 8 m/s, the vehicle acceleration penalty is given; the penalty for speeding up when the vehicle deviates from the lane. The throttle reward value is rthrottle and the set factor k8 is 5.
r throttle = throttle ,   if   v > v thr   or   acceleration   after   crossing   the   line 0 , o t h e r w i s e
In this scenario, the maximum road speed is 30 km/h, and the minimum vehicle spacing is Sthr = 3 m. The research requires the self-driving vehicle to maintain the minimum vehicle spacing from the front vehicle as far as possible, and uses the exponential function as the reward function, as shown in formula 18. When the value is 0, the function value is the largest, and the larger the value is; the smaller the value is, the function value also becomes smaller. This function can reward the self-driving vehicle for maintaining a relatively close distance from the front vehicle. Vehicle spacing is the penalty value and the coefficient k9 is set to 5.
S s p a c i n g = e x 2 + 1
According to the gap acceptance theory, the critical gap tc can be obtained. When there is a critical gap at the roundabout, when the entry function g(t) is ≥1, the autonomous vehicle is allowed to enter the roundabout. When the time difference between the two vehicles in the roundabout is less than the critical gap, that is, when the entry function g(t) is <1, the autonomous vehicle brakes. The critical gap ts is the penalty value, and the coefficient k10 is set to 30. Gap acceptance is the most critical evaluation index for roundabout optimization. If the setting is too large, the influence of other factors on vehicles entering the roundabout is not obvious. If the setting is too small, the influence of gap acceptance theory on vehicles entering the roundabout cannot be reflected.
t s = brake ,   if   t s < t c 0 , o t h e r w i s e

3. Results and Discussion

3.1. Experimental Environment

In this study, the proximal policy optimization (PPO) algorithm is employed to optimize the decision-making behavior of autonomous vehicles at unsignalized roundabouts. The simulation environment consists of a 64-bit Ubuntu 20.04 operating system, Carla 0.9.6 simulation platform, and Roadrunner R2020b software for creating autonomous driving scenarios.
The simulated training scenarios encompass urban roads, including unsignalized roundabouts and signalized intersections. Specifically, the decision scenario involves a straight road segment of 100 meters in both front and rear directions, as well as an unsignalized roundabout with a diameter of 20 meters. The simulated scene includes dual lanes for the entrance lane, the inner lane of the roundabout, and the exit lane. At the entrance of the unsignalized roundabout, appropriate measures such as speed reduction signs, speed limits, and traffic signs indicating a roundabout should be implemented; moreover, sensors on autonomous vehicles should respond accordingly upon detecting these relevant traffic signs [30]. The designed scene aims to evaluate various vehicle action decisions such as turning movements, avoidance maneuvers, and acceleration/deceleration behaviors during following or overtaking other vehicles.
The experimental groups utilized the proximal policy optimization (PPO) algorithm, the coordinate convolution multi-reward function under PPO (PPO+CCMR), and an optimized version of the PPO algorithm. It is important to note that the deep reinforcement learning PPO algorithm requires a relatively long time for establishing the automatic driving model. Therefore, in order to ensure a fair comparison, all other factors were kept constant throughout the experiment. Furthermore, it is crucial to optimize the training efficiency of deep reinforcement learning; thus, each of the three experimental groups were given equal training time for accurate comparison.
The independent variables in our experiments included traffic flow at unsignalized roundabouts with four different conditions as follows: (1) a scenario with 20 social vehicles where there are minimal other vehicles present; (2) a scenario with 100 social vehicles causing interference at unsignalized roundabouts; (3) a scenario with 200 social vehicles resulting in increased interference at unsignalized roundabouts without signals; and finally, (4) a scenario with heavy traffic congestion at unsignalized roundabouts due to the presence of 300 social vehicles.
Figure 3 illustrates the program execution process.
The stability evaluation criteria for autonomous vehicles encompass attitude stability, lateral and longitudinal stability, driving safety, and obstacle avoidance capability. In the simulation results, the one-step reward value function is employed as the evaluation metric to compare the one-step reward function curves in identical environments. A higher value indicates better stability.
During the simulation experiment, the driving efficiency of autonomous vehicles is evaluated based on their travel time. The simulation scenarios include linear roads and unsignalized roundabouts where travel time is recorded to calculate time reduction efficiency that directly reflects traffic efficiency.
The safety evaluation of autonomous vehicles encompasses vehicle control accuracy, sensor performance, road adaptability, emergency handling capability, and compliance with traffic regulations. In the simulation experiment, apart from the decision-making algorithm of the autonomous vehicle, all other vehicles exhibit identical performance characteristics. Additionally, the perception system comprising cameras and lidar is configured uniformly. The simulation environment is trained using urban roads and unsignalized roundabouts. The safety gap in autonomous vehicles primarily manifests in the control accuracy of their decision-making algorithms and their ability to handle emergency situations effectively. Success rate serves as a benchmark for comparing algorithms in simulation experiments.

3.2. Simulation and Analysis

The simulation scenario includes a total of 20 social vehicles, and the automatic driving vehicle obtained simulation data after undergoing 600,000 steps of training in an unsignalized roundabout.
The results are presented in Figure 4, where it can be observed that the reward value function of the optimized PPO algorithm stabilizes around 500,000 steps, while the reward value function of the PPO+CCMR algorithm stabilizes at approximately 540,000 steps. Notably, compared to the PPO+CCMR algorithm, there is a significant increase in reward value of approximately 2.8%, indicating an improvement in simulation training efficiency by about 11.1% for the optimized PPO algorithm.
As depicted in Figure 5, the optimized PPO algorithm for vehicle average speed optimization exhibits a gradual increase with the simulation step size, surpassing both the PPO+CCMR curve and the standard PPO algorithm. The initial average training speed of the optimized PPO algorithm is relatively low but improves progressively during simulation experiment training, demonstrating smoother and more stable performance compared to the PPO+CCMR algorithm.
The simulation scenario includes 100 social vehicles, and the automatic driving vehicle obtained simulation data after 600,000 training steps in an unsignalized roundabout.
As depicted in Figure 6, the reward value function of the optimized PPO algorithm tends to stabilize at around 500,000 steps, while the reward value function of the PPO+CCMR algorithm stabilizes at approximately 560,000 steps. Compared to the PPO+CCMR algorithm, there is a significant increase of 7.17% in the reward value achieved by utilizing only the optimized PPO algorithm. Moreover, this improvement results in a remarkable enhancement of training efficiency by approximately 13.8%.
The average vehicle speed of the optimized PPO algorithm and PPO+CCMR curve is demonstrated as being in close proximity, yet significantly higher than that of the conventional PPO algorithm, as depicted in Figure 7. Moreover, it is observed that the average speed growth curve of the optimized PPO algorithm exhibits enhanced smoothness and stability compared to that of the PPO+CCMR algorithm.
The simulation scenario involves 200 social vehicles, and the automatic driving vehicle obtained simulation data after undergoing 600,000 steps of training in an unsignalized roundabout.
As depicted in Figure 8, within this environment of 200 social vehicles, autonomous vehicles are influenced by their own driving status as well as surrounding vehicles. The average reward value achieved by autonomous vehicles utilizing the optimized PPO algorithm surpasses that of the PPO+CCMR algorithm in an unsignalized roundabout. Moreover, decision-making performed by the optimized PPO algorithm is more rational and yields higher rewards. The reward value function of the optimized PPO algorithm tends to stabilize at approximately 540,000 steps, while for the PPO+CCMR algorithm it stabilizes around 560,000 steps. Compared to the PPO+CCMR algorithm, there is a significant increase of 19.58% in the reward value when employing the optimized PPO algorithm; additionally, simulation training efficiency improves by approximately 7.4%.
The initial average speed of the optimized PPO algorithm is observed to be lower than that of the PPO+CCMR algorithm, as depicted in Figure 9. However, with the progressive increment of the speed weight during training, a steady rise in average speed is witnessed, leading to continuous enhancement in autonomous vehicles’ efficiency while navigating unsignalized roundabouts. Upon reaching 600,000 steps, it becomes evident that the average speed surpasses that achieved by training under the PPO+CCMR algorithm. Consequently, self-driving vehicles exhibit a more significant increase in their average speeds.
To compare the passing efficiency of autonomous vehicles, our focus lies on two specific periods: when the total step length of the training model reaches 300,000 and 600,000. We then proceed to analyze and compare the average passing time in various simulation scenarios, as presented in Table 2.
The average passing time does not take into account the PPO algorithm, which exhibits poor training performance in unsignalized roundabouts due to limited simulation experiments and lack of reference for the average passing time. As shown in Table 2, with 300,000 training steps, the optimized PPO algorithm reduces the passing time by 2.32%, 2.61%, and 12.67% compared to the PPO+CCMR algorithm when social vehicles are increased on the scene. With 600,000 training steps, the optimized PPO algorithm further reduces the passing time by 2.37%, 2.62%, and 13.96% compared to the PPO+CCMR algorithm under similar conditions of increasing social vehicles on the scene.
The enhanced efficiency of traffic achieved by optimizing the PPO algorithm is positively correlated with both simulation step length and vehicle density within a given scenario at an unsignalized roundabout where autonomous vehicles navigate through inner loop lanes or outer loop lanes with varying route selection strategies as indicated in Table 3.
As presented in Table 3, when employing an optimized PPO algorithm for the simulation training of autonomous vehicles, the average passing time of inner ring vehicles is observed to be lower than that of outer ring vehicles. Moreover, as the training duration increases, autonomous vehicles tend to prefer traveling within the inner ring. The path selection of autonomous driving vehicles in unsignalized roundabout scenarios is influenced by the number of social vehicles present, with this influence becoming less pronounced as the number of social vehicles increases.
To assess the safety performance of autonomous vehicles, particular attention is given to two specific periods during which the total step size reaches 300,000 and 600,000 in the training model. Around these periods, data from 20 experimental groups are collected to determine whether or not successful simulations were achieved.
The primary shortcomings observed in the simulation experiments were vehicle collisions and off-lane misalignment, with 20 scenarios primarily involving off-lane misalignment and 200 scenarios mainly focusing on vehicle collisions. Table 4 presents the success rate of these simulation experiments. For both training the PPO+CCMR algorithm and optimizing the PPO algorithm over 300,000 steps, as well as for training over 600,000 steps, the success rates ranged between 35% and 45%, and between 65% and 80%, respectively. However, when synchronizing the PPO+CCMR algorithm with the optimized PPO algorithm for an extended period of time, there was a decrease in the success rate of simulation experiments as scene vehicles increased; this decrease was significantly higher than that observed when only using the PPO algorithm. In scenarios involving identical vehicles, increasing the length of training steps led to an improvement in simulation experiment success rates while enhancing autonomous vehicle safety.
The simulation experiment demonstrates that the optimized PPO algorithm effectively enhances the efficiency of simulation training. However, in the case of 20 and 100 vehicles, the optimization algorithm exhibits marginal improvements in terms of driving efficiency, stability, and safety for autonomous vehicles at unsignalized roundabouts. Conversely, when confronted with a scenario involving 200 vehicles, the optimization algorithm significantly enhances both stability and driving efficiency for autonomous vehicles entering unsignalized roundabouts, albeit with limited safety enhancements.

4. Conclusions

This paper investigates the decision-making process of autonomous driving in unsignalized roundabouts and proposes a proximal policy optimization (PPO) learning method based on the PPO algorithm, which is trained using the Carla simulation environment. The proposed algorithmic framework utilizes an aerial view and an automatic driving perception system to acquire road environment information, while the CoordConv network extracts and integrates feature data into the decision-making network. A dynamic multi-objective reward mechanism and dynamic behavioral decision-making are employed to adjust the weight of autonomous vehicle states during training, enabling dynamic adjustment of important indices at different stages.
The results demonstrate that the optimized algorithm trained with this approach exhibits no significant improvement compared to the PPO+CCMR algorithm; however, it outperforms the PPO algorithm when there are fewer vehicles present. As other vehicles increase on the road, both the improved optimization algorithm’s dynamic model multi-objective reward mechanism and unsignalized roundabout decision-making optimization module exhibit noticeable effects, leading to better decisions regarding path selection and effectively enhancing training efficiency as well as vehicle stability and safety. In scenarios involving 20, 100, and 200 social vehicles, respectively, compared with the PPO+CCMR algorithm, our optimized PPO algorithm shows improvements in reward value function by 2.85%, 7.17%, and 19.58%; simulation training efficiency improves by 11.1%, 13.8%, and 7.4%; traversal time is reduced by 2.37%, 2.62%, and 13.96%.
While our proposed algorithms have achieved certain results in unsignalized roundabout experiments, further research can be combined with intelligent transportation systems considering actual road conditions along with increasingly complex road environments.

Author Contributions

Conceptualization, J.G.; methodology, J.G.; software, J.G.; validation, J.G. and J.Z.; formal analysis, J.G.; investigation, J.G; resources, J.Z. and Y.L.; data curation, J.G.; writing—original draft preparation, J.G.; writing—review and editing, J.G., J.Z. and Y.L.; visualization, J.G.; supervision, J.Z. and Y.L.; project administration, J.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Key R&D Program under Grant 2021YFC3001300, in part by the National Natural Science Foundation of China Grant 62371013, in part by the National Natural Science Foundation of China Key Project Collaboration Grant 61931012, and in part by the Academic Research Projects of Beijing Union University under Grant ZK10202208.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Samizadeh, S.; Nikoofard, A.; Yektamoghadam, H. Decision Making for Autonomous Vehicles’ Strategy in Triple-Lane Roundabout Intersections. In Proceedings of the 2022 8th International Conference on Control, Instrumentation and Automation (ICCIA), Tehran, Iran, 2–3 March 2022; pp. 1–6. [Google Scholar]
  2. Mohebifard, R.; Hajbabaie, A. Effects of Automated Vehicles on Traffic Operations at Roundabouts. In Proceedings of the IEEE International Conference on Intelligent Transportation Systems, Rhodes, Greece, 20–23 September 2020. [Google Scholar] [CrossRef]
  3. Naderi, M.; Papageorgiou, M.; Karafyllis, I.; Papamichail, I. Automated vehicle driving on large lane-free roundabouts. In Proceedings of the 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), Macau, China, 8–12 October 2022; pp. 1528–1535. [Google Scholar]
  4. Zhang, Y.; Zhang, J.; Dong, B. An optimal management scheme for connected vehicles merging at a roundabout. In Proceedings of the 2022 6th CAA International Conference on Vehicular Control and Intelligence (CVCI), Nanjing, China, 28–30 October 2022; pp. 1–6. [Google Scholar]
  5. Qian, D.; Qi, H.; Liu, Z.; Zhou, Z.; Yi, J. Research on Autonomous Decision-Making in Air-Combat Based on Improved Proximal Policy Optimization. J. Syst. Simul. 2023, 1–11. [Google Scholar] [CrossRef]
  6. Yu, Z.; Zhu, T.; Liu, W. Rapid Trajectory Programming for Hypersonic Umanned 6Areial Vehicle in Ascent Phase Based on Proximal Policy Optimization. J. Jilin Univ. (Eng. Technol. Ed.) 2023, 53, 863–870. [Google Scholar] [CrossRef]
  7. Chen, X.; Zhu, Y.; Lv, C. Signal Phase and Timing Optimization Method for Intersection Based on Hybrid Proximal Policy Optimization. J. Transp. Syst. Eng. Inf. Technol. 2023, 23, 106–113. [Google Scholar] [CrossRef]
  8. Zhao, J.; Hu, X.; Du, X. Spectrum Resource Allocation of Vehicle Edge Network Based on Proximal Policy Optimization Algorithm. Front. Data Comput. 2022, 4, 142–155. [Google Scholar]
  9. Jia, H.; Li, B. Calculation of Traffic Capacity at Signalized Roundabouts Based on Gap Acceptance Theory. J. Transp. Inf. Saf. 2018, 36, 64–71. [Google Scholar]
  10. Liu, C.; Liu, Y.; Luo, X. Trajectory Optimization of Connected Vehicles at Isolated Intersection in Mixed Traffic Environment. J. Transp. Syst. Eng. Inf. Technol. 2022, 22, 154–162. [Google Scholar] [CrossRef]
  11. Zhang, J.; Hu, S.; Jin, H. Modeling of Traffic Flow Velocity Control Strategy for Human-machine Mixed Driving at Signalized Intersections. J. Syst. Simul. 2022, 34, 1697–1709. [Google Scholar] [CrossRef]
  12. Liu, Q.; Pan, M. Research on Intersection Capacity Considering the Stability of Autonomous Vehicles. Highway 2021, 66, 240–247. [Google Scholar]
  13. Wang, S.; Wan, Q. Right-turn Driving Decisions of Autonomous Vehicles at Signal-free Intersections. Appl. Res. Comput. 2022, 1–6. [Google Scholar] [CrossRef]
  14. Chen, Z.; Luo, L. Speed Trajectory Optimization of Connected Autonomous Vehicles at Signalized Intersections. J. Transp. Inf. Saf. 2021, 39, 92–98+156. [Google Scholar]
  15. Wu, W.; Liu, Y.; Liu, W.; Wu, G.; Ma, W. A Novel Autonomous Vehicle Trajectory Planning and Control Model for Connected-and-Autonomous Intersections. Acta Autom. Sin. 2020, 46, 1971–1985. [Google Scholar] [CrossRef]
  16. Lu, Y.; Xu, X.; Ding, C.; Lu, G. Connected Autonomous Vehicle Speed Control at Successive Signalized Intersections. J. Beijing Univ. Aeronaut. Astronaut. 2018, 44, 2257–2266. [Google Scholar] [CrossRef]
  17. Zhang, Y.; Gao, B.; Guo, L.; Guo, H.; Chen, H. Adaptive decision-making for automated vehicles under roundabout scenarios using optimization embedded reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 5526–5538. [Google Scholar] [CrossRef] [PubMed]
  18. Hang, P.; Huang, C.; Hu, Z.; Xing, Y.; Lv, C. Decision making of connected automated vehicles at an unsignalized roundabout considering personalized driving behaviours. IEEE Trans. Veh. Technol. 2021, 70, 4051–4064. [Google Scholar] [CrossRef]
  19. García Cuenca, L.; Puertas, E.; Fernandez Andrés, J.; Aliane, N. Autonomous driving in roundabout maneuvers using reinforcement learning with Q-learning. Electronics 2019, 8, 1536. [Google Scholar] [CrossRef]
  20. Zheng, R.; Liu, C.; Guo, Q. A decision–making method for autonomous vehicles based on simulation and reinforcement learning. In Proceedings of the 2013 International Conference on Machine Learning and Cybernetics, Tianjin, China, 14–17 July 2013. [Google Scholar]
  21. Gao, Z.; Sun, T.; Xiao, H. Decision–making method for vehicle longitudinal automatic driving based on reinforcement Q–learning. Int. J. Adv. Robot. Syst. 2019, 16, 141–172. [Google Scholar] [CrossRef]
  22. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human–level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
  23. Chae, H.; Kang, C.M.; Kim, B.; Kim, J.; Chung, C.C.; Choi, J.W. Autonomous braking system via deep reinforcement learning. In Proceedings of the 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Yokohama, Japan, 16–19 October 2017. [Google Scholar]
  24. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
  25. Sallab, A.E.; Abdou, M.; Perot, E.; Yogamani, S. Deep reinforcement learning framework for autonomous driving. arXiv 2017, arXiv:1704.02532. [Google Scholar] [CrossRef]
  26. Kachroo, P.; Li, Z. Vehicle merging control design for an automated highway system. In Proceedings of the IEEE Proceedings of Conference on Intelligent Transportation Systems, Boston, MA, USA, 12 November 1997; pp. 224–229.
  27. Awal, T.; Kulik, L.; Ramamohanrao, K. Optimal traffic merging strategy for communication-and sensor-enabled vehicle. In Proceedings of the 16th International IEEE Conference on Intelligent Transportation Systems, Hague, The Netherlands, 6–9 October 2013; pp. 1468–1474. [Google Scholar]
  28. Uno, A.; Sakaguchi, T.; Tsugawa, S. A merging control algorithm based on inter-vehicle communication. In Proceedings of the Proceedings 199 IEEE/IEEEJ/JSAI International Conference on Intelligent Transportation Systems, Tokyo, Japan, 5–8 October 1999; pp. 783–787. [Google Scholar]
  29. Waddell, E. Evolution of Roundabout Technology: A history Based Literature Review. In Proceedings of the Institute of Transportation Engineers 67th Annual Meeting Compendium of Technical Papers, Boston, MA, USA, 3–7 August 1997; pp. 89–97. [Google Scholar]
  30. Thai Van, M.J.; Balmefrezol, P. The Design of Roundabout in France: Historical context and State of the Art. Transp. Res. Rec. 2000, 1737, 92–97. [Google Scholar] [CrossRef]
Figure 1. Actor–critic algorithm framework.
Figure 1. Actor–critic algorithm framework.
Applsci 14 02889 g001
Figure 2. Optimization algorithm structure diagram.
Figure 2. Optimization algorithm structure diagram.
Applsci 14 02889 g002
Figure 3. Simulator running diagram.
Figure 3. Simulator running diagram.
Applsci 14 02889 g003
Figure 4. The average reward value of a single step in a 20 vehicle scene.
Figure 4. The average reward value of a single step in a 20 vehicle scene.
Applsci 14 02889 g004
Figure 5. The average speed value graph in 20 vehicles scene.
Figure 5. The average speed value graph in 20 vehicles scene.
Applsci 14 02889 g005
Figure 6. The average reward value of a single step in a 100 vehicle scene.
Figure 6. The average reward value of a single step in a 100 vehicle scene.
Applsci 14 02889 g006
Figure 7. The average speed value graph in a 100 vehicle scene.
Figure 7. The average speed value graph in a 100 vehicle scene.
Applsci 14 02889 g007
Figure 8. The average reward value of a single step in a 200 vehicle scene.
Figure 8. The average reward value of a single step in a 200 vehicle scene.
Applsci 14 02889 g008
Figure 9. The average speed value graph in 200 vehicles scene.
Figure 9. The average speed value graph in 200 vehicles scene.
Applsci 14 02889 g009
Table 1. Driving state evaluation index.
Table 1. Driving state evaluation index.
CategoryVariableCharacteristics
SecurityLateral offsetLeft–right offset,
Offset range
Lane change timeAverage time
Vehicle collisionCollision or not
Vehicle spacingMinimum spacing
Comfort levelLateral accelerationMaximum value, average value
Longitudinal accelerationMaximum value, average value
Vehicle deviation angleOffset angle and offset range
Driving efficiencySpeedMaximum value, average value
Vehicle accelerationMaximum value, average value
Transit timeAverage value
Table 2. Simulation average pass time.
Table 2. Simulation average pass time.
Training Step Size300,000600,000
Scenes2010020020100200
Average transit time (s)PPO+CCMR algorithm34.4234.8641.8131.6131.9739.61
Optimized PPO algorithm33.6233.9536.5130.8631.1334. 08
Reduce efficiency through time2.32%2.61%12.67%2.37%2.62%13.96%
Table 3. The inner and outer loop traffic schedule.
Table 3. The inner and outer loop traffic schedule.
Training Step Size300,000600,000
Scenes2010020020100200
Traffic ratio (Inner: outer)5:55:56:49:19:18:2
Average time (s)Inner31.3031.5535.1530.4830.7433.52
outer35.9436.3538.5534.2634.6236.31
Table 4. Simulation success rate table.
Table 4. Simulation success rate table.
Training the Total Step Size300,000600,000
Scenes2010020020100200
PPO
Algorithm
Successes221443
Failures181819161617
Success rate10%10%5%20%20%15%
PPO+CCMR
Algorithm
Successes997161513
Failures111113457
Success rate45%45%35%80%75%65%
Optimized PPO
Algorithm
Successes998161514
Failures111112456
Success rate45%45%40%80%75%70%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gan, J.; Zhang, J.; Liu, Y. Research on Behavioral Decision at an Unsignalized Roundabout for Automatic Driving Based on Proximal Policy Optimization Algorithm. Appl. Sci. 2024, 14, 2889. https://doi.org/10.3390/app14072889

AMA Style

Gan J, Zhang J, Liu Y. Research on Behavioral Decision at an Unsignalized Roundabout for Automatic Driving Based on Proximal Policy Optimization Algorithm. Applied Sciences. 2024; 14(7):2889. https://doi.org/10.3390/app14072889

Chicago/Turabian Style

Gan, Jingpeng, Jiancheng Zhang, and Yuansheng Liu. 2024. "Research on Behavioral Decision at an Unsignalized Roundabout for Automatic Driving Based on Proximal Policy Optimization Algorithm" Applied Sciences 14, no. 7: 2889. https://doi.org/10.3390/app14072889

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop