Autonomous Vehicle Decision and Control through Reinforcement Learning with Traffic Flow Randomization

Yuan Lin; Antai Xie; Xiao Liu

doi:10.3390/machines12040264

,

and

Shien-Ming Wu School of Intelligent Engineering, South China University of Technology, Guangzhou 510641, China

^*

Author to whom correspondence should be addressed.

Machines2024, 12(4), 264;https://doi.org/10.3390/machines12040264

This article belongs to the Section Vehicle Engineering

Version Notes

Order Reprints

Abstract

Most of the current studies on autonomous vehicle decision-making and control based on reinforcement learning are conducted in simulated environments. The training and testing of these studies are carried out under the condition of rule-based microscopic traffic flow, with little consideration regarding migrating them to real or near-real environments. This may lead to performance degradation when the trained model is tested in more realistic traffic scenes. In this study, we propose a method to randomize the driving behavior of surrounding vehicles by randomizing certain parameters of the car-following and lane-changing models of rule-based microscopic traffic flow. We trained policies with deep reinforcement learning algorithms under the domain-randomized rule-based microscopic traffic flow in freeway and merging scenes and then tested them separately in rule-based and high-fidelity microscopic traffic flows. The results indicate that the policies trained under domain-randomized traffic flow have significantly better success rates and episodic rewards compared to those trained under non-randomized traffic flow.

Keywords:

autonomous vehicle; reinforcement learning; decision and control; traffic flow; domain randomization

1. Introduction

In recent years, autonomous vehicles have received increasing attention as they have the potential to free drivers from the fatigue of driving and facilitate efficient road traffic [1]. With the development of machine learning, rapid progress has been achieved in the development of autonomous vehicles. In particular, reinforcement learning, which enables vehicles to learn driving tasks through trial and error, continuously improves the learned policies. Compared to supervised learning, reinforcement learning does not require the manual labeling or supervision of sample data [2,3,4,5]. However, reinforcement learning models require tens of thousands of trial-and-error iterations for policy learning, and real vehicles on the road can hardly withstand so many trials. Therefore, the current mainstream research on autonomous driving with reinforcement learning focuses on using virtual driving simulators for training.

Lin et al. [6] utilized deep reinforcement learning within a driving simulator, Simulation of Urban Mobility (SUMO), to train autonomous vehicles, enabling them to merge safely and smoothly at on-ramps. Peng et al. [7] also employed deep reinforcement learning algorithms within a SUMO to train a model for lane changing and car following. They tested the model by reconstructing scenes using NGSIM data, and the results indicate that the models based on reinforcement learning demonstrate higher efficacy than those based on rule-based approaches. Mirchevska et al. [8] used fitted Q-learning for high-level decisionmaking on a busy simulated highway. However, the microscopic traffic flows of these studies are based on rule-based models, such as the Intelligent Driver Model (IDM) [9,10,11] and the Minimize Overall Braking Induced by Lane Change (MOBIL) model. These are mathematical models based on traffic flow theory [12]. They tend to simplify vehicle motion behavior and do not consider the interaction of multiple vehicles. Autonomous vehicles trained with reinforcement learning in such microscopic traffic flows may perform exceptionally well when tested in the same environments. However, when the trained model is applied to more realistic or real-world traffic flows, their performance may significantly deteriorate, and they could even cause traffic accidents. This is due to the discrepancies between simulated and real-world traffic flows.

For research on sim-to-real transfer, numerous methods have been proposed to date. For instance, robust reinforcement learning has been explored to develop strategies that account for the mismatch between simulated and real-world scenes [13]. Meta-learning is another approach that seeks to learn adaptability to potential test tasks from multiple training tasks [14]. Additionally, the domain randomization method used in this article is acknowledged as one of the most extensively used techniques to improve the adaptability to real-world scenes [15]. Domain randomization relies on randomized parameters aimed at encompassing the true distribution of real-world data. Sheckells et al. [16] applied domain randomization to vehicle dynamics, using stochastic dynamic models to optimize the control strategies for vehicles maneuvering on elliptical tracks. Real-world experiments indicated that the strategy was able to maintain performance levels similar to those achieved in simulations. However, few studies have applied domain randomization to microscopic traffic flows and investigated its efficacy.

In recent years, many driving simulators have been moving towards more realistic scenes. One type includes data-based driving simulators (InterSim [17] and TrafficGen [18]), which train neural network models by extracting vehicle motion characteristics from real-world traffic datasets, resulting in interactive microscopic traffic flows. However, the simulation time is much longer than for most rule-based driving simulators due to the complexity of the models. The other kind includes theory-based interactive traffic simulators, which can generate long-term interactive high-fidelity traffic flows by combining multiple modules (LimSim [19]). The traffic flow generated by LimSim closely resembles an actual dataset with a normal distribution, sharing similar means and standard deviations [20].

This paper proposes a domain randomization method for rule-based microscopic traffic flows for reinforcement learning-based decision and control. The parameters of the car-following and lane-changing models are randomized with Gaussian distributions, making the microscopic traffic flows more random and behaviorally uncertain, thus exposing the agent to a more complex and variable driving environment during training. To investigate the impact of domain randomization, this paper will train and test agents using microscopic traffic flow without randomization, high-fidelity microscopic traffic flow, and domain-randomized traffic flow for freeway and merging scenes.

The rest of this paper is structured as follows: Section 2 introduces the relevant microscopic traffic flows. Section 3 describes the proposed domain randomization method. Section 4 presents the simulation experiments and the analysis of the results for the freeway and merging scenes. Finally, the conclusions are drawn in Section 5.

2. Microscopic Traffic Flow

Microscopic traffic flow models take individual vehicles as the research subject and mathematically describe the driving behaviors of the vehicles, such as acceleration, overtaking, and lane changing.

2.1. Rule-Based Microscopic Traffic Flow

This paper utilizes IDM and SL2015 as the default car-following and lane-changing models, respectively. The following is a detailed introduction to them.

2.1.1. IDM Car-Following Model

IDM was originally proposed by Treiber in [9], capable of describing various traffic states from free flow to complete congestion with a unified formulaic approach. The model takes the preceding vehicle’s speed, the ego vehicle’s speed, and the distance to the preceding vehicle as inputs to output the ego vehicle’s safe acceleration. The acceleration of the ego vehicle at each timestep is

\dot{v} (t) = a [1 - {(\frac{v (t)}{v_{0}})}^{δ} - {(\frac{s^{*} (v (t), Δ v (t))}{s})}^{2}],

(1)

where a represents the maximum acceleration of the ego vehicle,

v (t)

is the current speed of the ego vehicle,

v_{0}

is the desired speed of the ego vehicle,

δ

is the acceleration exponent,

Δ v (t)

is the speed difference between the ego vehicle and the preceding vehicle, s is the current distance between the ego vehicle and the preceding vehicle, and

s^{*} (v (t), Δ v (t))

is the desired following distance. The desired distance is defined as follows:

s^{*} (v (t), Δ v (t)) = s_{0} + max (0, v (t) * T + \frac{v (t) * Δ v (t)}{2 \sqrt{a b}}),

(2)

where

s_{0}

is the minimum gap, T is the bumper-to-bumper time gap, and b represents the maximum deceleration.

2.1.2. SL2015 Lane-Changing Model

The safety distance required for the lane-changing process is calculated as follows:

d_{lc, veh} (t) = \{\begin{matrix} v (t) * a_{1} + 2 l_{veh}, & if v (t) \leq v_{c}, \\ v (t) * a_{2} + 2 l_{veh}, & if v (t) > v_{c}, \end{matrix}

(3)

where

d_{lc, veh} (t)

denotes the safety distance required for lane changing,

v (t)

represents the velocity of the vehicle at time t,

l_{veh}

is the length of the vehicle,

a_{1}

and

a_{2}

are safety factors, and the threshold speed

v_{c}

differentiates between urban roads and highways.

The profit

b_{l n} (t)

at time t for changing lanes is calculated as follows:

b_{l n} (t) = \frac{v (t, l n) - v (t, l c)}{v_{\max} (l c)},

(4)

where

v (t, l n)

is the velocity of the vehicle in the target lane at the next timestep,

v (t, l c)

is the safe velocity in the current lane, and

v_{\max} (l c)

is the maximum velocity allowed in the current lane. The goal here is to maximize the velocity difference, thereby increasing the benefit of changing lanes.

If the profit

b_{l n} (t)

for the current timestep is greater than zero, then this profit will be added to the cumulative profit. Conversely, if the profit for the current timestep is less than zero, the cumulative profit will be halved to moderate the desire to change to the target lane. If the cumulative profit is larger than a threshold, lane change can be initiated.

2.2. LimSim High-Fidelity Microscopic Traffic Flow

The study employs the LimSim driving simulation platform’s high-fidelity microscopic traffic flow. The high-fidelity microscopic traffic flow in LimSim is based on optimal trajectory in the Frenet frame [21]. Within the circular area around the ego vehicle, the microscopic traffic flow is updated based on each optimal trajectory.

2.2.1. Trajectory Generation

In the Frenet coordinate system, the motion state of a vehicle can be described by the tuple

[s, \dot{s}, \ddot{s}, d, \dot{d}, \ddot{d}]

, where s represents the longitudinal displacement,

\dot{s}

the longitudinal velocity,

\ddot{s}

the lateral acceleration, d represents the lateral displacement,

\dot{d}

the lateral velocity, and

\ddot{d}

the latitudinal acceleration.

Lateral Trajectory Generation

The lateral trajectory curve can be expressed by the following fifth-order polynomial:

d (t) = a_{d 0} + a_{d 1} t + a_{d 2} t^{2} + a_{d 3} t^{3} + a_{d 4} t^{4} + a_{d 5} t^{5} .

(5)

The trajectory start point is known as

D_{0} = [d_{0}, {\dot{d}}_{0}, {\ddot{d}}_{0}]

, and a complete polynomial trajectory can be determined once the end point

D_{1} = [d_{1}, {\dot{d}}_{1}, {\ddot{d}}_{1}]

is specified. As vehicles travel on the road, they use the road centerline as the reference line for navigation, and the optimal state should be moving parallel to the centerline, which means the end point would be

D_{1} = [d_{1}, 0, 0]

. Equidistant sampling points are selected between the start point and end point, and the multiple polynomial segments are connected to form many complete lateral trajectories.

Longitudinal Trajectory Generation

Longitudinal trajectory curve can be expressed with a fourth-degree polynomial:

s (t) = a_{s 0} + a_{s 1} t + a_{s 2} t^{2} + a_{s 3} t^{3} + a_{s 4} t^{4} .

(6)

S_{0} = [s_{0}, {\dot{s}}_{0}, {\ddot{s}}_{0}]

is the start point and

S_{1} = [{\dot{s}}_{1}, {\ddot{s}}_{1}]

is the end point. Equidistant sampling points are selected between the start point and end point, and the multiple polynomial segments are connected to form many complete longitudinal trajectories.

2.2.2. Optimal Trajectory Selection

The trajectory selection process involves evaluating a cost function that includes key components: trajectory smoothness, which is determined by the heading and curvature differences between the actual and reference trajectories; vehicle stability, indicated by the differences in acceleration and jerk between the actual and reference trajectories; collision risk, assessed by the risk level of collision with surrounding vehicles; speed deviation, gauged by the velocity difference between the actual trajectory and the reference speed; and lateral trajectory deviation, measured by the lateral distance difference between the actual trajectory and the reference trajectory.

The total cost function is utilized to evaluate the set of candidate trajectories in Section 2.2.1, followed by an assessment of their compliance with vehicle dynamics constraints, such as turning radius and speed/acceleration limits. The trajectory that not only satisfies the vehicle dynamics constraints but also incurs the minimum cost is selected as the final valid trajectory.

Vehicles within a 50 m perception range of the ego vehicle will be subject to the Frenet optimal trajectory control, with a trajectory being planned every 0.5 s and having a duration of 5 s.

3. Domain Randomization for Rule-Based Microscopic Traffic Flow

The domain randomization method is based on randomizing the model parameters in the IDM car-following model and the SL2015 lane-changing model. The randomized parameters are shown in Table 1 and are described below.

Table 1. Domain randomization parameters.

There are five randomized parameters in the IDM model. “

δ

” is the acceleration exponent and “T” is the time gap in the IDM model, respectively. “

a_{m a x}

”, “

a_{m i n}

”, and “

v_{m a x}

” are the upper and lower limits of vehicle acceleration and the upper limit of vehicle speed, respectively.

There are two randomized parameters in the SL2015 model. “lcSpeedGain” indicates the degree to which a vehicle is eager to change lanes to gain speed; the larger the value, the more inclined the vehicle is to change lanes. “lcAssertive” is another parameter that significantly influences the driver’s lane-changing model [22]; a lower “lcAssertive” value makes the vehicle more inclined to accept smaller lane-changing gaps, leading to more aggressive lane-changing behavior.

Ref. [23] found that the parameters

δ

, T,

a_{m a x}

,

a_{m i n}

, and

v_{m a x}

are close to Gaussian distributions. Consequently, we adopt Gaussian distributions for all the domain-randomized parameters. All the randomized parameters follow Gaussian distributions within the interval

[s_{m i n}, s_{m a x}]

, with distribution s

(μ, σ^{2})

. Here,

s_{m a x}

and

s_{m i n}

are the upper and lower bounds of the randomization interval.

μ

is set to be

(s_{m a x} + s_{m i n}) / 2

, and

σ

is set to be

(s_{m a x} - s_{m i n}) / 6

. Thus, when a vehicle is generated, the probability that its randomized parameter value will fall within

[s_{m i n}, s_{m a x}]

is 99.73%.

When each vehicle is initialized on the road for each episode, these randomized parameters are generated and assigned to it.

4. Simulation Experiment

In this section, we create freeway and merging environments in the open-source SUMO driving simulator [24] and establish the communication between SUMO and the reinforcement learning algorithm via TraCI [25]. The timestep for the agent to select actions and observe environment state is set at 0.1 s. We create non-randomized microscopic traffic flow, the high-fidelity microscopic traffic flow of LimSim, and the domain-randomized microscopic traffic flows. We train the reinforcement learning-based autonomous vehicles under different microscopic traffic flows in freeway and merging scenes, respectively.

4.1. Merging

4.1.1. Merging Environment

We establish the merging environment inspired by Lin et al. [6]. A control zone for the merging vehicle is established, spanning 100 m to the rear of the on-ramp’s merging point and 100 m to the front of the merging point, as depicted in Figure 1. The red vehicle, operating under reinforcement learning control, is tasked with executing smooth and safe merging within the designated control area.

Figure 1. Merging in SUMO.

State

In defining the state of the reinforcement learning environment, the merging vehicle is projected onto the main road to produce the projected vehicle, and then a total of five vehicles are considered: two vehicles before the projected vehicle, two vehicles after the projected vehicle, and the projected vehicle. In order to utilize the observable information reasonably, the distance (

d_{t}^{p 2}, d_{t}^{p 1}, d_{t}^{f 1}, d_{t}^{f 2}, d_{t}^{m}

) of these five vehicles to the merging point, as well as their velocities (

v_{t}^{p 2}, v_{t}^{p 1}, v_{t}^{f 1}, v_{t}^{f 2}, v_{t}^{m}

), are included in the state representation. These parameters form a state representation with eleven variables, defined as

s_{t} = [d_{t}^{p 2}, v_{t}^{p 2}, d_{t}^{p 1}, v_{t}^{p 1}, d_{t}^{m}, v_{t}^{m}, a_{t}^{m}, d_{t}^{f 1}, v_{t}^{f 1}, d_{t}^{f 2}, v_{t}^{f 2}] \in S .

(7)

Action

The action space we have defined is a continuous variable: acceleration within

[- 4.5, 2.5]

m/s². This range is consistent with the normal acceleration range of surrounding vehicles.

a_{t} = {a c c_{t}^{m}} \in A .

(8)

Reward

We aim for the merging vehicle to maintain a safe distance from the preceding and following vehicles, ensure comfort, and avoid coming to a stop or making the following vehicle brake sharply. Therefore, the reward function is expressed as follows:

R_{total} = R_{m} + R_{b} + R_{j} + R_{stop} + R_{success} + R_{collision} .

(9)

After merging, the merging vehicle is safer when its position is in the middle between the preceding and following vehicles. The corresponding penalizing reward is defined as

R_{m} = w_{m} * (| w | + \frac{|\frac{(v_{p 1} + v_{f 1})}{2} - v_{m}|}{Δ v_{\max}}),

(10)

where

w_{m}

represents the weight factor, and

Δ v_{\max}

is the maximum allowable speed difference. The variable w is defined to measure the distance gap among the merging vehicle, its first preceding vehicle, and its first following vehicle. The details are as follows:

w = \frac{|d_{p 1} - d_{m} - l_{p 1}| - |d_{m} - d_{f 1} - l_{m}|}{|d_{p 1} - d_{f 1} - l_{p 1} - l_{m}|},

(11)

where

l_{p 1}

and

l_{m}

represent the lengths of the first preceding vehicle and the merging vehicle, both measuring at 5 m. When the first following vehicle performs braking in the control zone, a penalizing reward is defined as

R_{b} = w_{b} * \frac{|a_{f 1}|}{max (|a_{\min}|, a_{\max})},

(12)

where

w_{b}

is the weight and

a_{f 1}

is acceleration of the first following vehicle. In order to improve the comfort level of the merging vehicle, we define a penalizing reward for jerk:

R_{j} = w_{j} * \frac{|j_{m}|}{j_{\max}} = - w_{j} * \frac{|{\dot{a}}_{m}|}{j_{\max}},

(13)

where

w_{j}

is the weight,

j_{\max}

is maximum allowed jerk, and

{\dot{a}}_{m}

is jerk of the merging vehicle.

In addition, if the merging vehicle comes to a stop, a penalty of

R_{stop} = - 0.5

is imposed. When a merged vehicle collides with any vehicle, a penalty of

R_{collision} = - 1

is applied. Conversely, if the merging vehicle successfully reaches its destination, a reward of

R_{success} = 1

is granted. Table 2 shows the values of the above-mentioned parameters of the merging vehicle.

Table 2. Parameter values for the merging vehicle.

4.1.2. Soft Actor–Critic

SAC is the reinforcement learning algorithm used for training in merging scenes [26]. The SAC algorithm uses the classical framework of reinforcement learning, actor–critic, which helps to optimize the value function and the policy at the same time, and it consists of a parameterized soft-Q function

Q_{θ} (s_{t}, a_{t})

and a tractable policy

π_{ϕ} (a_{t} | s_{t})

. The parameters of these networks are

θ

and

ϕ

. This approach considers a more general maximum entropy objective that not only seeks to maximize rewards but also maintains a degree of randomness in action selection, as follows:

J (π) = \sum_{t = 0}^{T} E_{(s_{t}, a_{t}) \sim ρ_{π}} γ^{t} [r (s_{t}, a_{t}) + α H (π (\cdot | s_{t}))],

(14)

where

ρ_{π}

denotes the state–action distribution under the policy

π

, while

H (π (\cdot | s_{t}))

signifies the entropy of the policy at state

s_{t}

, thereby enhancing the unpredictability of the chosen actions. The temperature parameter

α

plays a pivotal role as it calibrates the balance between entropy and reward within the objective function, subsequently influencing the formulation of the optimal policy. The hyperparameters of SAC are the same as in Ref. [6].

4.1.3. Results under Different Microscopic Traffic Flows

Training

In the merging environment, we trained 200,000 timesteps in each of three different microscopic traffic flows. The training was carried out on an NVIDIA RTX 3060 graphics card paired with an Intel i7-12700F processor. It required approximately 1 h to complete the training using both SUMO’s default non-randomized and domain-randomized traffic flows. In contrast, the training under the condition of high-fidelity traffic flow took 3.5 h. Vehicle generation probability is 0.56, and the traffic density on the main road was approximately 16 vehicles per kilometer.

Testing

The trained policy was tested with 1000 episodes in the merging environment. We evaluated the trained policy based on the merging vehicle’s success rate defined by the completion of an episode without any collisions and the average reward value over the entire testing period.

Comparison and Analysis

The training curves depicted in Figure 2 suggest that there is minimal visible difference in the rate of convergence and the rewards achieved by strategies trained under different microscopic traffic flows.

Figure 2. Undiscounted episode reward during training under three traffic flows.

Table 3 shows that the policy trained under rule-based traffic flow without randomization and high-fidelity microscopic traffic flow yields poor results when adapted to domain-randomized rule-based traffic flow. Conversely, the policy trained under domain-randomized rule-based traffic flows consistently achieves success rates above

90 %

when tested across all three traffic flows.

Table 3. The results of testing the trained policies regarding merging.

4.1.4. Generalization Results for Increased Traffic Densities

High-fidelity microscopic traffic flows closely resemble actual traffic scenes, so we used them as the test traffic flow with increased traffic densities. The impact of changes in traffic density is shown in Table 4. It can be observed that the policy trained under rule-based traffic flow without randomization experiences a gradual decline in success rates and rewards as traffic density increases. In contrast, the policy trained under domain-randomized rule-based traffic flow consistently maintains a higher success rate.

Table 4. The impact of traffic densities on three trained policies with high-fidelity traffic flow.

4.1.5. Ablation Study

In order to strengthen the understanding of individual domain-randomized parameters’ role in the model’s performance, we analyzed their individual impact on the training outcomes through an ablation study. For the ablation study, we separately ablated each of the domain-randomized parameters. Subsequently, policies were individually trained under the traffic flows with domain-randomized parameter ablation. Finally, the trained policies were tested under both the domain-randomized (all parameters randomized) and high-fidelity traffic flows. The results of the ablation study are shown in Table 5.

Table 5. The results of the ablation study.

It can be observed that a decline occurs in the performance of the policies trained under the traffic flows with ablations when tested under the high-fidelity traffic flow. Moreover, the ablation of

v_{m a x}

significantly affects performance.

4.2. Freeway

4.2.1. Freeway Environment

We used a straight two-lane freeway measuring 1000 m in length, inspired by Lin et al. [27]. Figure 3 depicts a standard lane-changing scenario in SUMO, where the ego vehicle is indicated by the red car and the surrounding vehicles are represented by the green cars.

Figure 3. The ego vehicle overtakes along the arrow trajectory in the freeway.

State

The state of the environment is centered on the ego vehicle and four nearby vehicles: one directly in the front and one directly behind it in the same lane, and two similarly positioned vehicles in the adjacent lane. At time t, the state is defined by the longitudinal distance (

d_{t}^{p}, d_{t}^{f}, d_{t}^{a d j a c e n t_{p}}, d_{t}^{a d j a c e n t_{f}}

) of these four vehicles from the ego vehicle, their respective speeds (

v_{t}^{p}, v_{t}^{f}, v_{t}^{a d j a c e n t_{p}}, v_{t}^{a d j a c e n t_{f}}

), and the speed and acceleration (

v_{t}^{e g o}, a_{t}^{e g o}

) of the ego vehicle. These parameters form a state representation with ten variables as follows:

s_{t} = (d_{t}^{p}, d_{t}^{f}, d_{t}^{a d j a c e n t_{p}}, d_{t}^{a d j a c e n t_{f}}, v_{t}^{p}, v_{t}^{f}, v_{t}^{a d j a c e n t_{p}}, v_{t}^{a d j a c e n t_{f}}, v_{t}^{e g o}, a_{t}^{e g o}) \in S .

(15)

Action

The action space is defined as follows:

a_{t} = {a c c_{t}^{e g o}, 0, 1} \in A .

(16)

where

a c c_{t}^{e g o}

is a continuous action that indicates the acceleration of ego vehicle. Meanwhile, the discrete actions ‘0’ and ‘1’ dictate lane-changing behavior. The ‘0’ means to keep the current lane and the ‘1’ means an instantaneous lane change to the other lane.

Reward

We have formulated a reward function aligned with practical driving objectives, incentivizing behaviors such as avoiding collisions, obeying speed limits, preserving comfortable driving conditions, and maintaining a safe following distance. The total reward

R_{total}

is expressed as follows:

R_{total} = R_{act} + R_{distance} + R_{jerk} + R_{v} + R_{collision} .

(17)

In order to penalize frequent lane changes, the penalty

R_{act}

is defined as follows:

R_{act} = \{\begin{matrix} ω_{0}, & if |y_{t - 1} - y_{t}| \neq 0 and d_{t}^{p} < d_{safe}, \\ ω_{1}, & if |y_{t - 1} - y_{t}| \neq 0 and d_{t}^{p} > d_{safe}, \\ 0, & other, \end{matrix}

(18)

where

y_{t}

is ego vehicle’s lateral position and

ω_{0}

<

ω_{1}

. If the vehicle changes lanes within the safety distance

d_{safe}

, it incurs a penalty of

ω_{0}

. Alternatively, changing lanes outside

d_{safe}

results in a penalty of

ω_{1}

.

It is essential to ensure that the ego vehicle maintains a safe following distance from the preceding vehicle, and the corresponding penalizing reward

R_{distance}

is defined as

R_{distance} = ω_{2} * |\frac{d_{t}^{p} - d_{safe}}{d_{safe}}|, if d_{t}^{p} < d_{safe} .

(19)

The objective of

R_{jerk}

is to ensure driving comfort. It is defined as

R_{jerk} = ω_{3} * |(a_{t} - a_{t - 1}) / 0.1|,

(20)

where

a_{t}

and

a_{t - 1}

denote the acceleration at the current and previous moments, respectively.

In order to promote the ego vehicle speed that enables overtaking, the penalty

R_{v}

is defined as follows:

R_{v} = \{\begin{matrix} ω_{4} * |\frac{v_{t}^{e g o} - v_{stable}}{v_{safe}}|, & if v_{stable} < v_{t}^{e g o} < v_{safe} and d_{t}^{p} < d_{safe} + d^{*}, \\ ω_{5} * |\frac{v_{t}^{e g o} - v_{safe}}{v_{safe}}|, & if v_{t}^{e g o} > v_{safe} and d_{t}^{p} < d_{safe} + d^{*}, \\ ω_{6} * |\frac{v_{t}^{e g o} - v_{stable}}{v_{stable}}|, & if v_{t}^{e g o} < v_{stable} and d_{t}^{p} < d_{safe} + d^{*}, \\ 0, & otherwise . \end{matrix}

(21)

When there is no opportunity to overtake the vehicle ahead, the ego vehicle should travel at a steady speed similar to that of the preceding vehicle. Consequently, we introduce a threshold

d^{*}

. As long as

d^{p} \in [d_{safe}, d_{safe} + d^{*}]

, the ego vehicle will not incur penalties of

R_{distance}

or

R_{v}

.

In the equations presented above,

ω_{i}

denotes the corresponding weights. The key parameters for the freeway scene are presented in Table 6.

Table 6. Parameters for freeway simulation.

4.2.2. Parameterized Soft Actor–Critic

The SAC algorithm in Section 4.1.2 can only solve continuous-action space problems. When dealing with continuous-discrete hybrid action space for freeway lane change, we adopt the Parameterized SAC (PASAC) algorithm, inspired by Lin et al. [27].

PASAC is based on SAC. The actor network produces continuous outputs, which include both continues actions and the weights for the discrete actions. An argmax function is utilized to select the discrete action associated with the maximum weight.

The freeway environment, having a hybrid continuous-discrete action space, requires the agent to be trained with the PASAC algorithm. The hyperparameters of PASAC are the same as SAC.

4.2.3. Results under Different Microscopic Traffic Flows

Training

In the freeway environment, we trained 400,000 timesteps in each of three different microscopic traffic flows. It required 1.5 h to complete the training using both rule-based traffic flows without randomization and with domain randomization. In contrast, the training under the condition of high-fidelity traffic flow took 5 h. Vehicle generation probability is 0.14 vehicles per second, and the traffic density on the main road was approximately 11 vehicles per kilometer on each straightaway.

Testing

The trained policy was tested with 1000 episodes in the freeway environment. We evaluated the trained policy based on the ego vehicle’s success rate defined by the completion of an episode without any collisions, and the average reward value over the entire testing period.

Comparison and Analysis

In Figure 4, it can be observed that the policies all tend to converge around 200 episodes. Throughout the training process, aside from the initially lower reward of the domain-randomized traffic flows, the convergence rates and final rewards of the three curves are closely aligned.

Figure 4. Undiscounted episode reward during training under three traffic flows.

The results of testing are shown in Table 7. It can be observed that the policy trained under domain-randomized rule-based traffic flows has the highest success rates when tested under different microscopic traffic flows. The policy trained under rule-based and high-fidelity traffic flows without randomization cannot adapt to domain-randomized rule-based traffic flow.

Table 7. The results of testing the trained policies regarding freeway condition.

4.2.4. Generalization Results for Increased Traffic Densities

The impact of changes in high-fidelity traffic density is shown in Table 8. It can be observed that the success rate and reward of the policy trained under microscopic traffic flow without randomization decreases significantly as traffic density increases. In contrast, the policy trained under domain-randomized traffic flow maintains a success rate near

100 %

.

Table 8. The impact of changes in traffic density on three trained policies under high-fidelity traffic flow.

4.2.5. Ablation Study

In the freeway environment, we also conducted an ablation study to enhance our understanding of the role that an individual domain-randomized parameter plays in the model’s performance. The results of the ablation study are shown in Table 9.

Table 9. The results of the ablation study.

In the freeway environment, the collision rates of the different policies are close to zero, so we mainly compare the average rewards of the different policies. It can be found that the conclusions are similar to those of merging, where the performance of the policies trained under traffic flows with domain-randomized parameter ablation declines to varying degrees. The ablation of

v_{m a x}

has a large impact on the performance.

5. Conclusions

In this study, we introduce a method for randomizing lane-changing and car-following model parameters to generate randomized microscopic traffic flows, and we evaluate and compare the policies trained by reinforcement learning algorithms in freeway and merging environments. The results show that

The policy trained under the condition of domain-randomized rule-based microscopic traffic flow is able to maintain high rewards and success rates when tested with different microscopic traffic flows. However, the policy trained under the condition of microscopic traffic flow without randomization or high-fidelity microtraffic flow performs significantly worse when tested under microscopic traffic flows that are different from those of training. This indicates that domain randomization enables reinforcement learning agents to adapt to different types of traffic flow.
The policy trained under the condition of domain-randomized rule-based microscopic traffic flow performs well when tested under high-fidelity microscopic traffic flow with different traffic densities. The policy trained under microscopic traffic flow without randomization decreases significantly with increasing traffic density. This indicates that the domain-randomized traffic flow possesses strong generalization to changes in traffic density.
Although high-fidelity microscopic traffic flow is close to real microscopic traffic flows, the results show that not only does it considerably increase simulation time but policies trained under the condition of microscopic traffic flow also do not generalize well to different microscopic flows. Therefore, high-fidelity microscopic traffic flow is more suitable for testing rather than training.

In summary, the policies trained under domain-randomized rule-based microscopic traffic flow demonstrate robust performance when transferred to environments that closely resemble real-world traffic conditions. The future work includes testing a policy trained under the condition of domain-randomized rule-based microscopic traffic flow on a real autonomous vehicle.

Author Contributions

Conceptualization, Y.L.; methodology, A.X., Y.L. and X.L.; formal analysis, A.X. and Y.L.; investigation, A.X. and Y.L.; data curation, A.X.; writing—original draft preparation, A.X.; writing—review and editing, A.X.; supervision, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by Guangzhou Basic and Applied Basic Research Program under Grant 2023A04J1688, and in part by South China University of Technology faculty start-up fund.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data can be obtained upon reasonable request from the corresponding author.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Le Vine, S.; Zolfaghari, A.; Polak, J. Autonomous cars: The tension between occupant experience and intersection capacity. Transp. Res. Part C Emerg. Technol. 2015, 52, 1–14. [Google Scholar] [CrossRef]
Sallab, A.E.; Abdou, M.; Perot, E.; Yogamani, S. Deep reinforcement learning framework for autonomous driving. arXiv 2017, arXiv:1704.02532. [Google Scholar] [CrossRef]
Hoel, C.J.; Wolff, K.; Laine, L. Automated speed and lane change decision making using deep reinforcement learning. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 2148–2155. [Google Scholar]
Ye, Y.; Zhang, X.; Sun, J. Automated vehicle’s behavior decision making using deep reinforcement learning and high-fidelity simulation environment. Transp. Res. Part C Emerg. Technol. 2019, 107, 155–170. [Google Scholar] [CrossRef]
Ye, F.; Cheng, X.; Wang, P.; Chan, C.Y.; Zhang, J. Automated lane change strategy using proximal policy optimization-based deep reinforcement learning. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2020; pp. 1746–1752. [Google Scholar]
Lin, Y.; McPhee, J.; Azad, N.L. Anti-jerk on-ramp merging using deep reinforcement learning. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2020; pp. 7–14. [Google Scholar]
Peng, J.; Zhang, S.; Zhou, Y.; Li, Z. An Integrated Model for Autonomous Speed and Lane Change Decision-Making Based on Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2022, 23, 21848–21860. [Google Scholar] [CrossRef]
Mirchevska, B.; Blum, M.; Louis, L.; Boedecker, J.; Werling, M. Reinforcement learning for autonomous maneuvering in highway scenarios. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium, Los Angeles, CA, USA, 11–14 June 2017. [Google Scholar]
Treiber, M.; Hennecke, A.; Helbing, D. Congested traffic states in empirical observations and microscopic simulations. Phys. Rev. E 2000, 62, 1805. [Google Scholar] [CrossRef] [PubMed]
Liu, J.; Zeng, W.; Urtasun, R.; Yumer, E. Deep structured reactive planning. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 4897–4904. [Google Scholar]
Huang, X.; Rosman, G.; Jasour, A.; McGill, S.G.; Leonard, J.J.; Williams, B.C. TIP: Task-informed motion prediction for intelligent vehicles. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 11432–11439. [Google Scholar]
Punzo, V.; Simonelli, F. Analysis and comparison of microscopic traffic flow models with real traffic microscopic data. Transp. Res. Rec. 2005, 1934, 53–63. [Google Scholar] [CrossRef]
Tessler, C.; Efroni, Y.; Mannor, S. Action robust reinforcement learning and applications in continuous control. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6215–6224. [Google Scholar]
Wang, J.X.; Kurth-Nelson, Z.; Tirumala, D.; Soyer, H.; Leibo, J.Z.; Munos, R.; Blundell, C.; Kumaran, D.; Botvinick, M. Learning to reinforcement learn. arXiv 2016, arXiv:1611.05763. [Google Scholar]
Andrychowicz, O.M.; Baker, B.; Chociej, M.; Jozefowicz, R.; McGrew, B.; Pachocki, J.; Petron, A.; Plappert, M.; Powell, G.; Ray, A.; et al. Learning dexterous in-hand manipulation. Int. J. Robot. Res. 2020, 39, 3–20. [Google Scholar] [CrossRef]
Sheckells, M.; Garimella, G.; Mishra, S.; Kobilarov, M. Using data-driven domain randomization to transfer robust control policies to mobile robots. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 3224–3230. [Google Scholar]
Sun, Q.; Huang, X.; Williams, B.C.; Zhao, H. InterSim: Interactive traffic simulation via explicit relation modeling. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 11416–11423. [Google Scholar]
Feng, L.; Li, Q.; Peng, Z.; Tan, S.; Zhou, B. Trafficgen: Learning to generate diverse and realistic traffic scenarios. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 3567–3575. [Google Scholar]
Wenl, L.; Fu, D.; Mao, S.; Cai, P.; Dou, M.; Li, Y.; Qiao, Y. LimSim: A long-term interactive multi-scenario traffic simulator. In Proceedings of the 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), Bilbao, Spain, 24–28 September 2023; pp. 1255–1262. [Google Scholar]
Zheng, O.; Abdel-Aty, M.; Yue, L.; Abdelraouf, A.; Wang, Z.; Mahmoud, N. CitySim: A drone-based vehicle trajectory dataset for safety oriented research and digital twins. arXiv 2022, arXiv:2208.11036. [Google Scholar] [CrossRef]
Werling, M.; Ziegler, J.; Kammel, S.; Thrun, S. Optimal trajectory generation for dynamic street scenarios in a frenet frame. In Proceedings of the 2010 IEEE International Conference on Robotics and Automation, Anchorage, Alaska, 3–8 May 2010; pp. 987–993. [Google Scholar]
Berrazouane, M.; Tong, K.; Solmaz, S.; Kiers, M.; Erhart, J. Analysis and initial observations on varying penetration rates of automated vehicles in mixed traffic flow utilizing sumo. In Proceedings of the 2019 IEEE International Conference on Connected Vehicles and Expo (ICCVE), Graz, Austria, 4–8 November 2019; pp. 1–7. [Google Scholar]
Kusari, A.; Li, P.; Yang, H.; Punshi, N.; Rasulis, M.; Bogard, S.; LeBlanc, D.J. Enhancing SUMO simulator for simulation based testing and validation of autonomous vehicles. In Proceedings of the 2022 IEEE Intelligent Vehicles Symposium (IV), Aachen, Germany, 5–9 June 2022; pp. 829–835. [Google Scholar]
Krajzewicz, D.; Erdmann, J.; Behrisch, M.; Bieker, L. Recent development and applications of SUMO-Simulation of Urban MObility. Int. J. Adv. Syst. Meas. 2012, 5, 128–138. [Google Scholar]
Wegener, A.; Piórkowski, M.; Raya, M.; Hellbrück, H.; Fischer, S.; Hubaux, J.P. TraCI: An interface for coupling road traffic and network simulators. In Proceedings of the 11th Communications and Networking Simulation Symposium, Ottawa, ON, Canada, 14–17 April 2008; pp. 155–163. [Google Scholar]
Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic algorithms and applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
Lin, Y.; Liu, X.; Zheng, Z.; Wang, L. Discretionary Lane-Change Decision and Control via Parameterized Soft Actor-Critic for Hybrid Action Space. arXiv 2024, arXiv:2402.15790. [Google Scholar] [CrossRef]

Figure 1. Merging in SUMO.

Figure 2. Undiscounted episode reward during training under three traffic flows.

Figure 3. The ego vehicle overtakes along the arrow trajectory in the freeway.

Figure 4. Undiscounted episode reward during training under three traffic flows.

Table 1. Domain randomization parameters.

Parameter	Default Value	Randomization Interval
$δ$	4	$[3.5, 4.5]$
T	1 s	$[0.5, 1.5]$ s
$a_{m a x}$	2.6 m/s²	$[1.8, 3.4]$ m/s²
$a_{m i n}$	−4.5 m/s²	$[- 5.5, - 3.5]$ m/s²
$v_{m a x}$	8.33 m/s	$[7.33, 9.33]$ m/s
`lcSpeedGain`	1	$[0, 100]$
`lcAssertive`	1	$[1, 5]$

Table 2. Parameter values for the merging vehicle.

Parameter	Value
Weight for merging midway $w_{m}$	$- 0.015$
Weight for penalizing first following’s braking $w_{b}$	$- 0.015$
Weight for penalizing jerk $w_{j}$	$- 0.015$
Maximum allowed speed difference $Δ v_{\max}$	5 m/s
Maximum allowed jerk value $j_{\max}$	3 m/s³

Table 3. The results of testing the trained policies regarding merging.

		Traffic Flows for Training
			Rule-Based, No Randomization	High-Fidelity, No Randomization	Rule-Based, Randomization
Testing	rule-based,	Average reward	0.0058	0.0002	−0.0018
	no randomization	Success rate	98.50%	76.40%	$91.60 %$
	high-fidelity,	Average reward	0.0029	0.0055	−0.0069
	no randomization	Success rate	95.20%	97.50%	$98.20 %$
	rule-based,	Average reward	−0.0089	−0.0065	$0.0057$
	randomization	Success rate	56.00%	66.70%	$99.30 %$

Table 4. The impact of traffic densities on three trained policies with high-fidelity traffic flow.

		Traffic Density for Testing under High-Fidelity Traffic Flow
			$ϕ = 0.56$	$ϕ = 0.72$	$ϕ = 0.89$
Training under rule-based traffic flow (ϕ = 0.56)	no randomization	Average reward	0.0039	0.0017	0.0006
	no randomization	Success rate	95.90%	93.30%	91.90%
	randomization	Average reward	−0.0001	−0.0002	−0.0001
	randomization	Success rate	98.20%	98.50%	98.20%

ϕ

is the vehicle generation probability of the microscopic traffic flow, defined as the number of vehicles that are generated from the lane starting point per second.

Table 5. The results of the ablation study.

Training under Rule-Based Traffic Flow	Traffic Flows for Testing
Training under Rule-Based Traffic Flow		Rule-Based, Randomization	High-Fidelity, No Randomization
randomization—no $δ$	Average reward	0.0049	−0.0023
randomization—no $δ$	Success rate	99.90%	95.60%
randomization—no $T$	Average reward	0.0018	−0.0049
randomization—no $T$	Success rate	92.50%	94.10%
randomization—no $a_{\max}$	Average reward	0.0038	−0.0038
randomization—no $a_{\max}$	Success rate	97.30%	97.70%
randomization—no $a_{\min}$	Average reward	0.0025	−0.0026
randomization—no $a_{\min}$	Success rate	94.20%	96.80%
randomization—no $v_{\max}$	Average reward	−0.0070	0.0016
randomization—no $v_{\max}$	Success rate	65.00%	93.70%
no randomization	Average reward	−0.0089	0.0029
no randomization	Success rate	56.00%	95.20%
randomization—all parameters	Average reward	0.0057	−0.0069
randomization—all parameters	Success rate	99.30%	98.20%

Table 6. Parameters for freeway simulation.

Parameters	Value	Weights	Value
$a_{\min}$	$- 4.5$ m/s²	$w_{0}$	$- 5$
$a_{\max}$	2.6 m/s²	$w_{1}$	$- 2$
$v_{safe}$	16.89 m/s	$w_{2}$	$- 10$
$v_{stable}$	8.89 m/s	$w_{3}$	$- 0.005$
$d_{safe}$	25 m	$w_{4}$	1
$R_{collision}$	$- 200$	$w_{5}$	$- 0.5$
$d^{*}$	2.5 m	$w_{6}$	$- 0.5$

Table 7. The results of testing the trained policies regarding freeway condition.

		Traffic Flows for Training
			Rule-Based, No Randomization	High-Fidelity, No Randomization	Rule-Based, Randomization
Testing	rule-based,	Average reward	200.50	205.32	197.15
	no randomization	Success rate	100%	99.70%	$100 %$
	high-fidelity,	Average reward	187.10	208.85	202.48
	no randomization	Success rate	98.90%	99.60%	99.90%
	rule-based,	Average reward	126.27	160.06	186.72
	randomization	Success rate	80.40%	89.00%	99.40%

Table 8. The impact of changes in traffic density on three trained policies under high-fidelity traffic flow.

		Traffic Density for Testing under High-Fidelity Traffic Flow
			$ϕ = 0.14$	$ϕ = 0.18$	$ϕ = 0.20$
Training under rule-based traffic flow (ϕ = 0.14)	no randomization	Average speed	16.39	15.89	15.77
		Average reward	187.10	189.85	109.49
		Success rate	98.90%	93.90%	66.50%
	randomization	Average speed	15.83	15.42	15.17
		Average reward	202.48	197.48	191.51
		Success rate	99.90%	99.80%	99.90%

ϕ

is the vehicle generation probability of the microscopic traffic flow, defined as the number of vehicles that are generated from the lane starting point per second.

Table 9. The results of the ablation study.

Training under Rule-Based Traffic Flow	Traffic Flows for Testing
Training under Rule-Based Traffic Flow		Rule-Based, Randomization	High-Fidelity, No Randomization
randomization—no $δ$	Average reward	170.46	187.65
randomization—no $δ$	Success rate	97.20%	99.90%
randomization—no $T$	Average reward	169.62	178.95
randomization—no $T$	Success rate	98.50%	99.60%
randomization—no $a_{\max}$	Average reward	181.49	202.52
randomization—no $a_{\max}$	Success rate	98.30%	99.90%
randomization—no $a_{\min}$	Average reward	170.72	185.81
randomization—no $a_{\min}$	Success rate	99.40%	100%
randomization—no $v_{\max}$	Average reward	96.99	150.10
randomization—no $v_{\max}$	Success rate	89.90%	99.50%
randomization—no `lcSpeedGain`	Average reward	132.04	156.61
randomization—no `lcSpeedGain`	Success rate	100%	100%
randomization—no `lcAssertive`	Average reward	173.71	188.57
randomization—no `lcAssertive`	Success rate	97.70%	99.90%
no randomization	Average reward	126.27	187.10
no randomization	Success rate	80.40%	98.90%
randomization—all parameters	Average reward	186.72	202.48
randomization—all parameters	Success rate	99.40%	99.90%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Autonomous Vehicle Decision and Control through Reinforcement Learning with Traffic Flow Randomization

Abstract

1. Introduction

2. Microscopic Traffic Flow

2.1. Rule-Based Microscopic Traffic Flow

2.1.1. IDM Car-Following Model

2.1.2. SL2015 Lane-Changing Model

2.2. LimSim High-Fidelity Microscopic Traffic Flow

2.2.1. Trajectory Generation

Lateral Trajectory Generation

Longitudinal Trajectory Generation

2.2.2. Optimal Trajectory Selection

3. Domain Randomization for Rule-Based Microscopic Traffic Flow

4. Simulation Experiment

4.1. Merging

4.1.1. Merging Environment

State

Action

Reward

4.1.2. Soft Actor–Critic

4.1.3. Results under Different Microscopic Traffic Flows

Training

Testing

Comparison and Analysis

4.1.4. Generalization Results for Increased Traffic Densities

4.1.5. Ablation Study

4.2. Freeway

4.2.1. Freeway Environment

State

Action

Reward

4.2.2. Parameterized Soft Actor–Critic

4.2.3. Results under Different Microscopic Traffic Flows

Training

Testing

Comparison and Analysis

4.2.4. Generalization Results for Increased Traffic Densities

4.2.5. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics