Next Article in Journal
Detection of Sub-pT Field of Magnetic Responses in Metals and Magnetic Materials by Highly Sensitive Magnetoresistive Sensors
Previous Article in Journal
Vibration-Based Anomaly Detection for Induction Motors Using Machine Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Autonomous Vehicle Behavior Decision Method Based on Deep Reinforcement Learning with Hybrid State Space and Driving Risk

School of Mechanical Engineering, Southwest Jiaotong University, Chengdu 610036, China
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(3), 774; https://doi.org/10.3390/s25030774
Submission received: 24 December 2024 / Revised: 14 January 2025 / Accepted: 24 January 2025 / Published: 27 January 2025
(This article belongs to the Section Vehicular Sensing)

Abstract

:
Behavioral decision-making is an important part of the high-level intelligent driving system of intelligent vehicles, and efficient and safe behavioral decision-making plays an important role in the deployment of intelligent transportation system, which is a hot topic of current research. This paper proposes a deep reinforcement learning (DRL) method based on mixed-state space and driving risk for autonomous vehicle behavior decision-making, which enables autonomous vehicles to make behavioral decisions with minimal instantaneous risk through deep reinforcement learning training. Firstly, based on the various behaviors that may be taken by autonomous vehicles during high-speed driving, a calculation method for autonomous vehicle driving risk is proposed. Then, deep reinforcement learning methods are used to improve the safety and efficiency of behavioral decision-making from the interaction between the vehicle and the driving environment. Finally, the effectiveness of the proposed method is proved by training verification in different simulation scenarios, and the results show that the proposed method can enable autonomous vehicles to make safe and efficient behavior decisions in complex driving environments. Compared with advanced algorithms, the method proposed in this paper improves the driving distance of autonomous vehicle by 3.3%, the safety by 2.1%, and the calculation time by 43% in the experiment.

1. Introduction

Advanced Driver Assistance Systems (ADAS) play a critical role in modern intelligent transportation systems, enhancing both the intelligence and safety of vehicle operation [1,2,3]. Central to ADAS is the behavior decision-making function, which relies on the perception of various traffic participants within the driving environment [4,5,6]. With the rapid advancements in artificial intelligence, AI technologies have become integral to the assisted driving functions of autonomous vehicles, greatly enhancing the accuracy and reliability of these systems. In this context, the objective of this paper is to investigate the behavioral decision-making processes of autonomous vehicles through the application of deep reinforcement learning techniques.
When a vehicle operates on an open road, the dynamic changes in the state of surrounding traffic participants and the road environment, as captured in the scene information, can significantly influence the vehicle’s behavior. Consequently, lane change decisions must take into account the interactions within the surrounding driving context. Extensive research has been conducted on vehicle lane change decision-making, which can be broadly categorized into two primary approaches: rule-based decision-making methods and learning-based decision-making methods.
The rule-based decision-making approach assumes that a driver’s behavior is entirely rational, with lane change decisions based on factors such as safety, feasibility, obstacle locations, and the relative speed advantages of the current and target lanes [7]. The formulation of these rules is primarily based on considerations such as speed dissatisfaction, driving safety, and interactions with surrounding traffic participants [8,9,10]. Speed dissatisfaction is calculated by comparing the expected speed of the vehicle in front with the actual speed, which in turn influences the decision to change lanes [11]. Driving safety is concerned with the vehicle’s ability to avoid potential conflicts with surrounding entities, guiding decisions regarding lane changes or adjustments in speed (acceleration or deceleration) [12,13]. Interaction with surrounding traffic participants is typically modeled using game theory, which assumes a conflict of interest between the lane-changing vehicle and other surrounding vehicles. In this framework, decision-making rules for lane changes, acceleration, and deceleration are designed by optimizing for both safety and efficiency, thereby establishing a decision-making model for autonomous vehicle behavior [14,15,16].
On the one hand, the rule-based lane change decision-making approach may not fully capture the driving behaviors of human drivers. On the other hand, it can be challenging to implement, as it requires simultaneously considering and artificially representing the effects of multiple factors within a single calculation formula. In contrast, learning algorithms can autonomously learn vehicle driving parameters and represent more complex decision models, making them a key focus in the field of autonomous vehicle behavior decision-making. Currently, two primary types of learning-based decision-making methods are prevalent: deep learning and reinforcement learning.
The deep learning approach is characterized by its distributed processing and self-learning capabilities, which enable it to be trained using multiple sets of parameters that capture salient features. This training process leads to the accurate simulation of driving behavior and the optimization of model parameters, ensuring that the model’s output closely aligns with human behavior. For instance, Chen [17] developed a deep neural network (DNN) within the TORCS simulator environment, training vehicles to follow and overtake at high speeds. Xu [18] proposed a branch network structure of FCN-LSTM, integrating semantic segmentation techniques to better understand driving scene characteristics and predict vehicle movements in both horizontal and vertical directions probabilistically. Müller [19] introduced a vision-based convolutional neural network (CNN) for semantic segmentation of the road, enabling the prediction of driving strategies, including road identification and tracking via a PID controller. The training and testing for this model were conducted in the Carla simulator, albeit without traffic flow.
The effectiveness of reinforcement learning (RL) algorithms primarily hinges on the design of the reward function, as the manner in which rewards are assigned directly impacts model performance. Reward functions are typically categorized into two types: those that reward movement towards the goal and those that impose penalties for undesirable actions, such as collisions or deviations from the intended path. Kendall [20] demonstrated the application of RL in a real-world environment, where the vehicle was trained to follow lane markings. The training cycle ended when a safety officer determined that the vehicle had deviated from the road, with the reward function being defined as the maximum distance the vehicle traveled before being taken over. However, this approach, which uses a single scalar value as the reward, is limited in its ability to capture the complexity of real-world driving scenarios. A common improvement to reinforcement learning is its integration with deep learning techniques, forming what is known as deep reinforcement learning (DRL) [21,22,23,24,25]. One of the main advantages of reinforcement learning is that it does not require large amounts of human driving data; instead, it is trained based on maximizing the reward associated with specific behaviors. Moreover, since reinforcement learning is an online training process, it can simultaneously explore the environment and update the model. However, this approach suffers from relatively lower training efficiency [26]. To address this, Liang [27] proposed a method where a deep model is first trained using labeled data and then further optimized online using reinforcement learning strategies. This method shortens the training time compared to pure RL and leverages RL’s adaptability to improve model performance. Rhinehart [28] combined lidar data with imitation learning and model-based reinforcement learning to predict vehicle trajectories by mimicking expert behavior, with the training and testing conducted in the Carla simulator environment without dynamic traffic participants.
In contrast to the limitations of rule-based methods, such as restricted scene coverage, rigid procedural applications, and insufficient adaptability, learning-based decision-making approaches can leverage comprehensive vehicle driving scene information to acquire a broader range of experience [29]. Consequently, learning-based methods hold significant promise for wider applicability and offer greater flexibility in addressing dynamic driving environments. As the downstream of decision-making, planning can design specific driving trajectories based on the decision-making results, including lane change lateral planning and longitudinal speed planning [30,31].
This paper addresses the behavioral decision-making problem of autonomous vehicles on highways by employing a deep reinforcement learning approach, incorporating behavioral risk as a key factor. Behavioral risk enables autonomous vehicles to assess the level of danger associated with each decision, thereby providing a basis for the reward function in reinforcement learning. Deep neural networks are utilized to extract temporal features from driving scenarios represented by mixed state spaces. The reward function in this framework takes into account both driving risk and efficiency, guiding the autonomous vehicle to adopt safe and optimal behaviors. Initially, a method for calculating the driving risk of autonomous vehicles during high-speed driving is proposed, considering the various behaviors the vehicle may exhibit. Subsequently, deep reinforcement learning is applied to enhance both the safety and efficiency of the decision-making process, focusing on the interaction between the vehicle and the driving environment. Finally, the effectiveness of the proposed method is validated through training and evaluation across multiple simulation scenarios. Compared with advanced algorithm, the method proposed in this paper improves the driving distance of autonomous vehicle by 3.3%, the safety by 2.1%, and the calculation time by 43% in the experiment.
The structure of this paper is organized as follows: Section 2 outlines the methodology for calculating the risk associated with autonomous vehicle driving behaviors. Section 3 presents the design of a deep reinforcement learning approach for autonomous vehicle behavior decision-making. Section 4 involves training and simulating the proposed method, demonstrating its effectiveness and rationality. Finally, Section 5 concludes the paper with a summary of the findings and contributions.

2. Risk Analysis of Autonomous Vehicle Behavior

2.1. Behavior Model Construction of Autonomous Vehicle

In this paper, the quintic polynomial [32] is employed to model the lateral lane change behavior, while the Intelligent Driver Model (IDM) [33] is utilized to describe the longitudinal acceleration and deceleration dynamics of the vehicle.
The following assumptions are made for the quintic polynomial lateral lane change model: (1) the lateral position of the vehicle upon completion of the lane change is aligned with the center line of the target lane, (2) the vehicle’s longitudinal speed remains constant throughout the lane change, and (3) the lateral trajectory of the lane change is determined by a quintic polynomial. The lane change process governed by the quintic polynomial adheres to the following relationship.
y = a 0 + a 1 t + a 2 t 2 + a 3 t 3 + a 4 t 4 + a 5 t 5
where y represents the lateral offset during the lane change, and a 5 , a 4 , a 3 , a 2 , a 1 , a 0 are the coefficients of the quintic polynomial that define the trajectory of the lane change. The variable t denotes the time elapsed since the initiation of the lane change. From the lateral displacement function, lateral speed v y and lateral acceleration a y can be derived as the first and second derivatives, respectively. These quantities are essential for evaluating the dynamics of the lane change maneuver.
v y = y ˙ t = 5 a 5 t 4 + 4 a 4 t 3 + 3 a 3 t 2 + 2 a 2 t + a 1 a y = y ¨ t = 20 a 5 t 3 + 12 a 4 t 2 + 6 a 3 t + 2 a 2
The specific shape of the polynomial trajectory is determined by the coefficient vector A = a 0 , a 1 , a 2 , a 3 , a 4 , a 5 T . At the initial time t = t 0 of the lane change, the following boundary conditions are typically applied to ensure a smooth transition:
Y t 0 = Y 0 , v y t 0 = v 0 , a y t 0 = a 0
At the ending time t = t e n d of the lane change,
Y t e n d = Y e n d , v y t e n d = v e n d , a y t e n d = a e n d
To determine the polynomial coefficients, six equations can be constructed based on the vehicle’s state at the start and end moments of the lane change:
a 0 + a 1 t 0 + a 2 t 0 2 + a 3 t 0 3 + a 4 t 0 4 + a 5 t 0 5 = Y 0 a 0 + a 1 t e n d + a 2 t e n d 2 + a 3 t e n d 3 + a 4 t e n d 4 + a 5 t e n d 5 = Y e n d a 1 + 2 a 2 t 0 + 3 a 3 t 0 2 + 4 a 4 t 0 3 + a 5 t 0 4 = v 0 a 1 + 2 a 2 t e n d + 3 a 3 t e n d 2 + 4 a 4 t e n d 3 + a 5 t e n d 4 = v e n d 2 a 2 + 6 a 3 t 0 + 12 a 4 t 0 2 + 20 a 5 t 0 3 = a 0 2 a 2 + 6 a 3 t e n d + 12 a 4 t e n d 2 + 20 a 5 t e n d 3 = a e n d
Substituting the initial time t 0 = 0 and the end time t e n d = t e n d into the polynomial will help simplify the system and determine coefficients:
A = B 1 C
A = a 0 a 1 a 2 a 3 a 4 a 5 T
B = 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 t end t end 2 t end 3 t end 4 t end 5 0 1 2 t end 3 t end 2 4 t end 3 5 t end 4 0 0 2 6 t end 12 t end 2 20 t end 3
C = Y 0 v 0 a 0 Y e n d v e n d a e n d T
In the context of lane change behavior, the state of the vehicle at the moment the lane change is initiated uniquely determines the entire lane change trajectory. Consequently, the quintic polynomial that describes the lane change is solely dependent on the lane change time.
The longitudinal IDM of the vehicle is represented as
a t = max a m a x 1 v t v 0 δ , a m i n ,   free   driving max a m a x 1 v t v 0 δ s * t s t 2 , a m i n ,   car   following
s * t = s 0 + max 0 ,   v t T v t s ˙ t 2 a m a x   a I D M
where a m a x = 2   m / s 2 is the maximum acceleration that the vehicle can take, a m i n = 4   m / s 2 is the maximum deceleration that the vehicle can take, v 0 = 30   m / s is the expected speed, v t is the speed of the vehicle at time t, s t and s ˙ t are the relative distance and relative speed from the vehicle in front, s 0 = 5   m is the minimum distance between the two cars when they are stationary, T = 1.5   s is the set distance between the front of the vehicle, which is the speed difference between the two vehicles at time t, and δ = 4 and a I D M = 4   m / s 2 is constant.

2.2. Classification Discussion and Risk Analysis of Driving Behavior

In scenarios where an autonomous vehicle interacts with an environmental vehicle on a structured road, two primary forms of behavior are observed: longitudinal follow (LF) and lateral lane change (LC). The former can be further categorized into three specific types, while the latter encompasses one distinct form. Potential collision risks associated with each of these behaviors are illustrated in Figure 1.
In Scenario 1, the autonomous vehicle is positioned in the same lane as the environmental vehicle and follows it, with the autonomous vehicle’s speed denoted as v 0 , the environmental vehicle’s speed as v 1 , the distance between them as r 1 , and the relative speed as v f = v 1 v 0 . In Scenario 2, the autonomous vehicle is in the same lane but positioned ahead of the environmental vehicle, where the autonomous vehicle’s speed is v 0 , the environmental vehicle’s speed is v 1 , the distance between them is r 1 , and the relative speed is v f = v 0 v 1 . In Scenario 3, the environmental vehicle is in an adjacent lane and is about to merge into the lane of the autonomous vehicle. The autonomous vehicle’s speed is v 0 , the environmental vehicle’s speed is v 2 , the longitudinal distance between them is r 2 , and the relative speed is v c = v 2 v 0 . The acceleration of the autonomous vehicle in the longitudinal direction is represented by a 0 , and the behavioral variable for the autonomous vehicle in Scenarios 1, 2, and 3 is U = a 0 .
The longitudinal behavior risk of an autonomous vehicle is defined as follows. For a given scenario, the initial state S and the action of the environmental vehicle (with lane-following behavior characterized by constant speed and lane-change behavior defined by the lane change time t ) together form a four-dimensional variable. The action a 0 of the autonomous vehicle is generated according to its behavior model. The actions of both the environmental vehicle and the autonomous vehicle are simulated, and if no collision occurs between the two vehicles, the action risk for the environmental vehicle in this scenario is considered zero. If a collision occurs, the action risk is calculated as the reciprocal of the two-vehicle collision time (TTC), which represents the risk of the environmental vehicle’s action (for lane-following, the action is acceleration, and for lane change, the action is lane change time). The shorter the collision time, the higher the risk associated with the action. Assuming a vehicle size of 5 m × 2 m for both the autonomous vehicle and the environmental vehicle, the four-dimensional variable combinations of lane-following and lane-changing actions can be traversed within their respective value ranges, enabling the calculation of the risk for various actions of the environmental vehicle in different scenarios. The longitudinal behavior risk for the autonomous vehicle traveling at a speed of 30 m/s is shown in Figure 2.
Scenario 4 represents a unique situation where the autonomous vehicle actively changes lanes. The primary focus of analyzing these four specific interaction forms is to assess the action risk of the vehicle, with the autonomous vehicle’s action being determined by its own intelligent driving algorithm. In this scenario, a classification-based approach is adopted to evaluate the lane change of the autonomous vehicle. Specifically, from the perspective of the environmental vehicle, the vehicle’s front is considered to be in the original lane before crossing the lane line. As the vehicle crosses the lane line, its front is considered to occupy both lanes until the rear of the vehicle completes the lane change, at which point the vehicle is considered to be entirely in the new lane.
Thus, two potential collision forms exist in Scenario 4: one involves a collision with the autonomous vehicle caused by a vehicle in the environment approaching from behind in the new lane during the lane change, and the other involves a collision with a vehicle in the environment in front of the new lane during the lane change. In these cases, the autonomous vehicle’s behavior corresponds to the conditions in Scenario 1 and Scenario 2, respectively, in the longitudinal direction. The associated risks can therefore be evaluated using the same longitudinal driving mode applied in these scenarios.

3. Design of Deep Reinforcement Model

3.1. Problem Description of Autonomous Vehicle Behavior Decision by DRL

Autonomous vehicle behavior decision-making can be modeled as a Markov Decision Process (MDP), which can be addressed using deep reinforcement learning algorithms. In this framework, the autonomous vehicle is considered the decision-making entity, known as the agent. The vehicle interacts with the driving environment, where its actions influence the state of both the agent and the surrounding environment. As the agent makes decisions, the environment responds, producing numerical rewards. The objective of the autonomous vehicle, as an agent, is to maximize the cumulative benefits derived from its actions.
In the MDP framework, the set of possible states of the agent and the environment is denoted as S , the set of actions that the agent can take is represented by A , and the rewards generated from those actions are denoted by R . At time t , the autonomous vehicle observes a specific state s t S of the environment and chooses an action a t A . At the next time step t + 1 , the vehicle receives a reward r t + 1 R R and the state of the environment is updated to s t + 1 . Thus, the agent–environment interaction can be represented as a sequence of state–action–reward transitions:
s 0 , a 0 , r 1 , s 1 , a 1 , r 2 , s 2 , a 2 , r 3 , , s T , a T , r T + 1
where the subscript denotes the discrete time step, and T is the termination time. At each time step t , the state–action pair s t , a t leads to a new state and reward pair s t + 1 , r t , with the corresponding probability distribution p γ that defines the dynamic characteristics of the MDP, i.e.,
p s t + 1 , r t s t , a t P S t = s t + 1 , R t = r t S t 1 = s t , A t 1 = a t
The objective of MDP is to maximize the cumulative reward, represented by the expected value of the return G t , i.e.,
G t R t + 1 + γ R t + 2 + γ 2 R t + 3 + = R t + 1 + γ G t + 1 = k = 0 γ k R t + k + 1
where 0     γ     1 is the discount rate. The Markov Decision Process (MDP) framework aims to determine the optimal control policy by maximizing the expected cumulative rewards. The policy, denoted by π , defines the probability of selecting a specific action a t given a state s t . The value function associated with a policy quantifies the expected return starting from a given state, under the guidance of the policy. Specifically, the state-value function v π s t represents the expected cumulative reward from state s onward, following policy π . This function is mathematically expressed as
v π s t = E π G t S = s t = E π k = 0 γ k R t + k + 1 S = s t
Given the initial state–action pair s = s t , a = a t , the action value function of the corresponding strategy π is the expectation of total return, which is denoted as q π s t , a t , quantifying how beneficial or detrimental the action is within the context of the policy, i.e.,
q π s t , a t = s , r p s t + 1 , r t s t , a t r t + γ v π s t + 1
The optimal action-value function q * s t , a t corresponds to the maximum expected return achievable from a given state–action pair, and it is defined as
q * s t , a t = arg max π q π s t , a t
If the optimal action value function q * can be obtained, the optimal action can be determined as
a * = arg max a q * s , a
By quantifying the driving behavior risk of autonomous vehicles, the numerical value τ of the behavior risk can be used as the return of MDP. Therefore, the optimal policy function for autonomous vehicle behavior decisions can be written as
π * s = arg min π E π i = 0 γ i τ t + i s t = s
The equivalent is
π * s = arg max π E π i = 0 γ i 1 τ t + i s t = s
Therefore, the optimal action value function is written as
q π s , a = arg max π E π i = 0 γ i 1 τ t + i s t = s , a t = a
The deep reinforcement learning method for solving the MDP process can be used to realize the trajectory planning of autonomous vehicles.

3.2. Deep Reinforcement Learning Method

To determine the optimal action value function q * , a neural network with parameters w is utilized and denoted as q s , a ; w . By training this neural network, an approximation of the optimal action value function can be achieved. The training process aims to minimize the loss, thereby obtaining an action value function with a high degree of approximation. This approach, known as Deep Q-Network (DQN), is implemented using Temporal Difference (TD) learning. The DQN framework consists of several key components, including the training network, target network, and experience replay pool. The experience replay mechanism stores the experiences generated by the agent through interactions with the environment. After selecting and executing an action using the ϵ -greedy strategy, the resulting reward and next state are stored as training samples. Both the training and target networks initially share the same set of parameters. The training network processes the current state and action from each sample to predict the Q-value of the selected action, while the target network processes the next state to predict the maximum Q-value over all possible actions. The training network adjusts its parameters based on each action, while the target network updates its parameters after a fixed time step, ensuring more stable training. The DQN framework has shown in Figure 3.

3.2.1. Hybrid State Space

The state space consists of two parts: the autonomous vehicle state and the dynamic environment vehicle state.
The input status of the smart vehicle is represented as
S t e g o = x e g o x 0 x e n d x 0 , y e g o i = 1 m L W m
where x e g o and y e g o represent the longitudinal and lateral coordinates of the autonomous vehicle’s position, x 0 and x e n d denote the longitudinal coordinates of the starting and ending points of the road within the driving area, m is the number of lanes from the road’s starting point to the side of the driving direction of the autonomous vehicle, and L W represents the width of each lane.
The dynamic environment vehicle feature extraction takes into account the nearest vehicles positioned both in front and behind, as well as in adjacent lanes, within the autonomous vehicle’s perception range. A total of six vehicles are considered. The coordinates and speeds of these vehicles (with non-existent vehicles represented by a zero vector) are then converted into relative values with respect to the autonomous vehicle. This processed information forms the input state for the dynamic environment vehicle model, denoted as
S t n = x n x e g o r d , y n y e g o 2 L W
where the subscript n represents the index of the environmental vehicle. If no environmental vehicle is present at a specific location, its corresponding input state is set to a zero vector.
The state sequence of the autonomous vehicle’s state features and dynamic environmental entity features is extracted using a 1D convolutional layer, as shown in Figure 4. The feature extraction network backbone is then constructed through fully connected layers, with the extracted features concatenated and input as the state input for the DQN training network.

3.2.2. Action Space

Action space A is designed for left-lane change, right-lane change, and longitudinal driving acceleration.
A = t L , a , t R
where the set t L and t R contains the optional left lane change time, its value ranges from 3 s to 9 s, with intervals of 0.2 s. The set a contains the optional longitudinal acceleration; its value ranges from −2 m/s2 to 2 m/s2, with intervals of 0.1 m/s2.

3.2.3. Reward Function Design

The objective of this paper is to identify the driving behavior that minimizes vehicle risk, necessitating the incorporation of risk assessment results into the reward function of the DRL approach. To achieve this, a weighted combination of vehicle behavioral risk, driving efficiency, and driving comfort rewards is employed. These three types of rewards can be expressed as follows:
r r = t = 1 t = T 1 τ t r e = 1 T t = 1 t = T v t v * r c = a t + 1 a t
where r r is the behavior risk reward, r e is the vehicle driving efficiency reward, r c is the driving comfort reward, τ t is the vehicle behavior risk value at time t , T is the total time for the vehicle to complete the driving task, v t is the longitudinal speed of the vehicle at time t , v * is the set cruising speed, and a t is the longitudinal acceleration at time t . The weighting of the three is
r a = ω r r r + ω e r e + ω c r c
where ω r , ω e , and ω c correspond to the weighting coefficients of the three rewards, respectively. Additionally, the total reward function incorporates the vehicle’s progress in completing the desired trajectory during the training process. Specifically, if the vehicle successfully completes the specified distance without incident, a completion reward is awarded; otherwise, a penalty for non-completion is imposed. Reward function is set as
r = 1 + r a , c o m p l e t e 20 ,   o t h e r w i s e

3.2.4. Implement Step and Training Parameter

The steps of using the DQN algorithm to train vehicle behavior decisions are described in Algorithm 1.
Algorithm 1. DQN implementation process
Input: Replay buffer size D, network update interval N, discount factor γ , learning rate α , reward function, state space, action space.
Output: Parameters of training network and target network.
  • Initialize training network parameter w and target network parameter w = w
  • For i = 1, D do
    a)
    Obtain environment vehicle state information as input of state space s t
    b)
    For t = 1, T do
    • Pick random a t at small probability, otherwise a t = arg max a q s t , a ; w
    • Perform behaviour a t , get reward r t and new environment state s t + 1
    • Store s t , a t , r t , s t + 1 in replay buffer
    • Randomly select batch s j , a j , r j , s j + 1 in replay buffer, calculate TD target y j = r j + γ   max a q s j + 1 , a ; w
    • Use gradient descent to update parameter w = w α q j y j q s j , a j ; w w
    • Copy network parameter w to w every N step
    c)
    End for
  • End for
Each iteration in the algorithm updates the neural network parameters to learn the optimal action value, thus enabling the approximation of the action value function. Table 1 outlines the hyperparameter settings used during the algorithm’s training process.

4. Experiment and Discussion

The intelligent driving simulation platform CARLA provides standardized road networks, various vehicle control models, and precise sensor data, making it an ideal environment provider for the deep reinforcement learning framework employed in this paper. The proposed autonomous vehicle behavior decision-making algorithm is trained within the CARLA simulator. The environmental vehicle control model utilizes the built-in algorithm of the simulator, and the initial positions of both the autonomous vehicle and surrounding traffic participants are randomly assigned within a predefined area for each training batch. The vehicle begins from its starting position, and the end of the driving process is determined either by a collision or after traveling a distance of 200 m. The reward during the training process is recorded as Figure 5, and analysis of the reward curve indicates that the proposed algorithm converges rapidly.
The training results are evaluated across various traffic scenarios, which include different road configurations and environmental vehicle driving behaviors. The autonomous vehicle successfully completes driving tasks in these diverse driving conditions. In Scenario 1, the road is a one-way, two-lane layout, where three randomly generated environmental vehicles are all traveling at a constant speed of 10 m/s, with no vehicles driving side by side. The autonomous vehicle starts at the center of the right lane, positioned behind one of the environmental vehicles, with an initial speed of 15 m/s. Its target cruising speed is also set to 15 m/s. In Scenario 2, the road is a one-way, four-lane configuration, with multiple randomly generated environmental vehicles driven using CARLA’s built-in autonomous driving model to create a traffic flow. The method of generating the position and velocity of the environmental vehicle is statistically significant. The initial velocities are randomly selected between 10 and 20 m/s, the initial longitude positions are randomly selected within 50 m in front of the autonomous vehicle, and the lateral position is selected at the center of random lane. The autonomous vehicle starts at the centerline of a random lane at the rear of the generated traffic flow. Some typical cases from both scenarios are shown in Figure 6, Figure 7, Figure 8 and Figure 9.
In the two experimental scenarios, the results demonstrate that the behavior decision-making method based on driving risk, as proposed in this paper, allows autonomous vehicles to make effective decisions in complex driving environments. To evaluate the impact of the behavior risk value on improving the behavior strategy of autonomous vehicles, a comparative experiment was conducted, with the results presented in Table 2. The comparison includes the baseline method (where vehicle behavior is entirely randomly generated), numerical-optimization-based EM-Planner method, deep-learning-based LSTM method, DRL-based DQN method with the behavior risk input removed, and DRL-based DDPG method. The evaluation metrics for the behavior decision-making algorithms are the average driving distance x , the standard deviation of the driving distance σ , the number of collisions with other vehicles or road edges NoC, and average computation time ACT (unit: ms), all computed over multiple random experimental scenarios.
The comparison among baseline and other methods highlights that the application of numerical optimization and learning-based methods significantly enhance the safety and precision of autonomous vehicle behavior decision-making. The numerical optimization method can show excellent reliability, but its disadvantage is that the search for the feasible solution space consumes a huge amount of computational time, so the real-time performance is unsatisfactory. Compared with reinforcement learning methods, deep learning methods rely too much on the input dataset, and the computation time is strongly correlated with the complexity of network design, so its generalization is poor. The advantages of the DRL method based on driving risk are evident in the comparison. Firstly, in terms of computational time, the DRL method is significantly better than the numerical optimization method and deep learning method. Secondly, through the comparison of the DQN method without introducing vehicle behavior risk and the DDPG method with more complex network structure to the proposed method, it is evident that incorporating behavior risk further improves decision-making performance. The integration of behavior risk allows the autonomous vehicle to assess potentially hazardous actions that could lead to collisions at each decision point, thus promoting safer behavior patterns across various driving conditions. This, in turn, enhances the overall safety of the autonomous vehicle. While the reward function in traditional DRL methods primarily penalizes collisions with a large negative reward, the DRL method based on vehicle behavior risk, as presented in this paper, applies negative rewards to dangerous actions from the outset of each decision-making cycle. This results in improved overall driving performance throughout the vehicle’s driving process. Compared with the current advanced DDPG algorithm, the DQN method with driving risk proposed in this paper improves the driving distance of autonomous vehicle by 3.3%, the safety by 2.1%, and the calculation time by 43% in the experiment.

5. Conclusions

This paper presents a deep reinforcement learning method based on a mixed state space and driving risk for autonomous vehicle behavior decision-making. The goal is to enable autonomous vehicles to make decisions that minimize instantaneous risk based on real-time traffic conditions. The proposed numerical calculation method for behavior risk allows the autonomous vehicle to assess the danger level of its actions at each decision-making point, which is integrated into the reward function of the DRL model. Experimental results demonstrate that the proposed DRL approach effectively guides autonomous vehicles to adopt both efficient and safe driving behaviors. The key advantages of the proposed method are as follows: (1) the reward function accounts for both driving risk and efficiency, (2) the autonomous vehicle gains awareness of potentially dangerous behaviors that may lead to collisions, allowing it to adopt safer driving strategies under varying conditions, and (3) the DRL approach directly applies negative rewards to dangerous actions at each decision point, thereby improving the overall driving performance of the autonomous vehicle. Future work will focus on the comparison of different methods’ effect on efficiency, occupied computing resources, and practical performance in real traffic environment.

Author Contributions

Conceptualization, B.Q.; methodology, X.W.; software, B.Q.; validation, B.Q.; formal analysis, J.Z.; investigation, J.Z.; resources, J.Z.; data curation, J.Z.; writing—original draft preparation, X.W.; writing—review and editing, X.W.; visualization, B.Q.; supervision, W.L.; project administration, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

The work was supported by the NSFC (National Natural Science Foundation of China) under grants No. 52375131 and No. 52275574.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Xu, J.W.; Zhang, X.Q.; Park, S.H.; Guo, K. The Alleviation of Perceptual Blindness During Driving in Urban Areas Guided by Saccades Recommendation. IEEE Trans. Intell. Transp. Syst. 2022, 23, 16386–16396. [Google Scholar] [CrossRef]
  2. Bathla, G.; Bhadane, K.; Singh, R.K.; Kumar, R.; Aluvalu, R.; Krishnamurthi, R.; Kumar, A.; Thakur, R.N.; Basheer, S. Autonomous Vehicles and Intelligent Automation: Applications, Challenges, and Opportunities. Mob. Inf. Syst. 2022, 2022, 7632892. [Google Scholar] [CrossRef]
  3. Chan, T.K.; Chin, C.S. Review of Autonomous Intelligent Vehicles for Urban Driving and Parking. Electronics 2021, 10, 1021. [Google Scholar] [CrossRef]
  4. Negash, N.M.; Yang, J.M. Driver Behavior Modeling Toward Autonomous Vehicles: Comprehensive Review. IEEE Access 2023, 11, 22788–22821. [Google Scholar] [CrossRef]
  5. Rong, S.S.; Meng, R.F.; Guo, J.H.; Cui, P.F.; Qiao, Z. Multi-Vehicle Collaborative Planning Technology under Automatic Driving. Sustainability 2024, 16, 4578. [Google Scholar] [CrossRef]
  6. Ignatious, H.A.; El-Sayed, H.; Khan, M.A.; Mokhtar, B.M. Analyzing Factors Influencing Situation Awareness in Autonomous Vehicles-A Survey. Sensors 2023, 23, 4075. [Google Scholar] [CrossRef]
  7. Bagdatli, M.E.C.; Dokuz, A.S. Modeling discretionary lane-changing decisions using an improved fuzzy cognitive map with association rule mining. Transp. Lett. 2021, 13, 623–633. [Google Scholar] [CrossRef]
  8. Karle, P.; Geisslinger, M.; Betz, J.; Lienkamp, M. Scenario Understanding and Motion Prediction for Autonomous Vehicles-Review and Comparison. IEEE Trans. Intell. Transp. Syst. 2022, 23, 16962–16982. [Google Scholar] [CrossRef]
  9. Abdallaoui, S.; Ikaouassen, H.; Kribèche, A.; Chaibet, A.; Aglzim, E. Advancing autonomous vehicle control systems: An in-depth overview of decision-making and manoeuvre execution state of the art. J. Eng.-JOE 2023, 2023, e12333. [Google Scholar] [CrossRef]
  10. Khelfa, B.; Ba, I.; Tordeux, A. Predicting highway lane-changing maneuvers: A benchmark analysis of machine and ensemble learning algorithms. Phys. A 2023, 612, 16. [Google Scholar] [CrossRef]
  11. Yu, Y.W.; Luo, X.; Su, Q.M.; Peng, W.K. A dynamic lane-changing decision and trajectory planning model of autonomous vehicles under mixed autonomous vehicle and human-driven vehicle environment. Phys. A 2023, 609, 22. [Google Scholar] [CrossRef]
  12. Long, X.Q.; Zhang, L.C.; Liu, S.S.; Wang, J.J. Research on Decision-Making Behavior of Discretionary Lane-Changing Based on Cumulative Prospect Theory. J. Adv. Transp. 2020, 2020, 1291342. [Google Scholar] [CrossRef]
  13. Wang, C.; Sun, Q.Y.; Li, Z.; Zhang, H.J. Human-Like Lane Change Decision Model for Autonomous Vehicles that Considers the Risk Perception of Drivers in Mixed Traffic. Sensors 2020, 20, 2259. [Google Scholar] [CrossRef] [PubMed]
  14. Jain, G.; Kumar, A.; Bhat, S.A. Recent Developments of Game Theory and Reinforcement Learning Approaches: A Systematic Review. IEEE Access 2024, 12, 9999–10011. [Google Scholar] [CrossRef]
  15. Wang, J.W.; Chu, L.; Zhang, Y.; Mao, Y.B.; Guo, C. Intelligent Vehicle Decision-Making and Trajectory Planning Method Based on Deep Reinforcement Learning in the Frenet Space. Sensors 2023, 23, 9819. [Google Scholar] [CrossRef] [PubMed]
  16. Ahmad, F.; Shah, Z.; Al-Fagih, L. Applications of evolutionary game theory in urban road transport network: A state of the art review. Sustain. Cities Soc. 2023, 98, 104791. [Google Scholar] [CrossRef]
  17. Chen, C.Y.; Seff, A.; Kornhauser, A.; Xiao, J.X. DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 2722–2730. [Google Scholar]
  18. Xu, H.Z.; Gao, Y.; Yu, F.; Darrell, T. End-to-end Learning of Driving Models from Large-scale Video Datasets. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3530–3538. [Google Scholar]
  19. Müller, M.; Dosovitskiy, A.; Ghanem, B.; Koltun, V. Driving policy transfer via modularity and abstraction. arXiv 2018, arXiv:1804.09364. [Google Scholar]
  20. Kendall, A.; Hawke, J.; Janz, D.; Mazur, P.; Reda, D.; Allen, J.M.; Shah, A. Learning to drive in a day. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 8248–8254. [Google Scholar]
  21. Hu, H.Y.; Lu, Z.Y.; Wang, Q.; Zheng, C.Y. End-to-End Automated Lane-Change Maneuvering Considering Driving Style Using a Deep Deterministic Policy Gradient Algorithm. Sensors 2020, 20, 443. [Google Scholar] [CrossRef]
  22. Gao, Z.H.; Yan, X.T.; Gao, F.; He, L. Driver-like decision-making method for vehicle longitudinal autonomous driving based on deep reinforcement learning. Proc. Inst. Mech. Eng. Part D-J. Automob. Eng. 2022, 236, 3060–3070. [Google Scholar] [CrossRef]
  23. Cao, J.Q.; Wang, X.L.; Wang, Y.S.; Tian, Y.X. An improved Dueling Deep Q-network with optimizing reward functions for driving decision method. Proc. Inst. Mech. Eng. Part D-J. Automob. Eng. 2023, 237, 2295–2309. [Google Scholar] [CrossRef]
  24. Liao, J.D.; Liu, T.; Tang, X.L.; Mu, X.Y.; Huang, B.; Cao, D.P. Decision-Making Strategy on Highway for Autonomous Vehicles Using Deep Reinforcement Learning. IEEE Access 2020, 8, 177804–177814. [Google Scholar] [CrossRef]
  25. Deng, H.F.; Zhao, Y.Q.; Wang, Q.W.; Nguyen, A.T. Deep Reinforcement Learning Based Decision-Making Strategy of Autonomous Vehicle in Highway Uncertain Driving Environments. Automot. Innov. 2023, 6, 438–452. [Google Scholar] [CrossRef]
  26. Tampuu, A.; Matiisen, T.; Semikin, M.; Fishman, D.; Muhammad, N. A Survey of End-to-End Driving: Architectures and Training Methods. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 1364–1384. [Google Scholar] [CrossRef] [PubMed]
  27. Liang, X.D.; Wang, T.R.; Yang, L.N.; Xing, E.R. CIRL: Controllable Imitative Reinforcement Learning for Vision-Based Self-driving. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 604–620. [Google Scholar]
  28. Liu, Q.; Li, X.Y.; Yuan, S.H.; Li, Z.R. Decision-Making Technology for Autonomous Vehicles: Learning-Based Methods, Applications and Future Outlook. In Proceedings of the IEEE Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, Sep 19–22 September 2021; pp. 30–37. [Google Scholar]
  29. Liu, X.C.; Hong, L.; Lin, Y.R. Vehicle Lane Change Models-A Historical Review. Appl. Sci. 2023, 13, 12366. [Google Scholar] [CrossRef]
  30. Teng, S.; Hu, X.; Deng, P.; Li, B.; Li, Y.; Ai, Y.; Yang, D.; Li, L.; Xuanyuan, Z.; Zhu, F.; et al. Motion Planning for Autonomous Driving: The State of the Art and Future Perspectives. IEEE Trans. Intell. Veh. 2023, 8, 3692–3711. [Google Scholar] [CrossRef]
  31. Wang, X.; Li, B.; Su, X.; Peng, H.; Wang, L.; Lu, C. Autonomous dispatch trajectory planning on flight deck: A search-resampling-optimization framework. Eng. Appl. Artif. Intell. 2023, 119, 105792. [Google Scholar] [CrossRef]
  32. Zhang, D.X.; Jiao, X.H.; Zhang, T. Lane-changing and overtaking trajectory planning for autonomous vehicles with multi-performance optimization considering static and dynamic obstacles. Robot. Auton. Syst. 2024, 182, 104797. [Google Scholar] [CrossRef]
  33. Mu, Z.Y.; Jahedinia, F.; Park, B.B. Does the Intelligent Driver Model Adequately Represent Human Drivers? In Proceedings of the 9th International Conference on Vehicle Technology and Intelligent Transport Systems (VEHITS), Prague, Czech Republic, 26–28 April 2023; pp. 113–121. [Google Scholar]
  34. Zhang, Y.; Sun, H.; Zhou, J.; Pan, J.; Hu, J.; Miao, J. Optimal Vehicle Path Planning Using Quadratic Optimization for Baidu Apollo Open Platform. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2020; pp. 978–984. [Google Scholar]
  35. Chu, L.; Wang, J.; Cao, Z.; Zhang, Y.; Guo, C. A Human-Like Free-Lane-Change Trajectory Planning and Control Method With Data-Based Behavior Decision. IEEE Access 2023, 11, 121052–121063. [Google Scholar] [CrossRef]
  36. Liu, W.; Xiang, Z.; Fang, H.; Huo, K.; Wang, Z. A Multi-Task Fusion Strategy-Based Decision-Making and Planning Method for Autonomous Driving Vehicles. Sensors 2023, 23, 7021. [Google Scholar] [CrossRef]
Figure 1. Potential collision under different behavior modes of an autonomous vehicle.
Figure 1. Potential collision under different behavior modes of an autonomous vehicle.
Sensors 25 00774 g001
Figure 2. Risk of different actions of an autonomous vehicle in the case of lane following and lane change when the speed is 30 m/s: (a) action risk of car following; (b) action risk of lane change.
Figure 2. Risk of different actions of an autonomous vehicle in the case of lane following and lane change when the speed is 30 m/s: (a) action risk of car following; (b) action risk of lane change.
Sensors 25 00774 g002
Figure 3. DQN framework.
Figure 3. DQN framework.
Sensors 25 00774 g003
Figure 4. Feature extraction network backbone.
Figure 4. Feature extraction network backbone.
Sensors 25 00774 g004
Figure 5. Training process reward curve.
Figure 5. Training process reward curve.
Sensors 25 00774 g005
Figure 6. Bird’s-eye view of two-lane scenario road test (typical case for scenario 1).
Figure 6. Bird’s-eye view of two-lane scenario road test (typical case for scenario 1).
Sensors 25 00774 g006
Figure 7. Kinematic curve of two-lane scenario road test (typical case for scenario 1).
Figure 7. Kinematic curve of two-lane scenario road test (typical case for scenario 1).
Sensors 25 00774 g007
Figure 8. Bird’s-eye view of four-lane scenario road test (typical case for scenario 2).
Figure 8. Bird’s-eye view of four-lane scenario road test (typical case for scenario 2).
Sensors 25 00774 g008
Figure 9. Kinematic curve of four-lane scenario road test (typical case for scenario 2).
Figure 9. Kinematic curve of four-lane scenario road test (typical case for scenario 2).
Sensors 25 00774 g009
Table 1. Algorithm hyperparameter setting.
Table 1. Algorithm hyperparameter setting.
HyperparameterSymbolValue
discount factor γ 0.99
learning rate α 0.001
replay buffer sizeD5000
network update intervalN5
batch size 128
weighting coefficients of risk ω r 0.5
weighting coefficients of efficiency ω e 1
weighting coefficients of comfort ω c 0.1
Table 2. Performance evaluation indicators of different behavioral decision-making methods.
Table 2. Performance evaluation indicators of different behavioral decision-making methods.
IndicatorScenario 1—Two LanesScenario 2—Four Lanes
x (m) σ (m)NoC ACT (ms) x (m) σ (m)NoCACT (ms)
Baseline15.38.2100-19.59.2100-
EM-Planner [34]170.74.25109.3206.98.57124.7
LSTM [35]142.811.21852.9168.411.41653.5
DQN-no risk62.514.7572.8473.211.8442.88
DDPG [36]174.83.944.96210.58.155.13
DQN (this paper)179.53.322.85218.77.832.93
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, X.; Qian, B.; Zhuo, J.; Liu, W. An Autonomous Vehicle Behavior Decision Method Based on Deep Reinforcement Learning with Hybrid State Space and Driving Risk. Sensors 2025, 25, 774. https://doi.org/10.3390/s25030774

AMA Style

Wang X, Qian B, Zhuo J, Liu W. An Autonomous Vehicle Behavior Decision Method Based on Deep Reinforcement Learning with Hybrid State Space and Driving Risk. Sensors. 2025; 25(3):774. https://doi.org/10.3390/s25030774

Chicago/Turabian Style

Wang, Xu, Bo Qian, Junchao Zhuo, and Weiqun Liu. 2025. "An Autonomous Vehicle Behavior Decision Method Based on Deep Reinforcement Learning with Hybrid State Space and Driving Risk" Sensors 25, no. 3: 774. https://doi.org/10.3390/s25030774

APA Style

Wang, X., Qian, B., Zhuo, J., & Liu, W. (2025). An Autonomous Vehicle Behavior Decision Method Based on Deep Reinforcement Learning with Hybrid State Space and Driving Risk. Sensors, 25(3), 774. https://doi.org/10.3390/s25030774

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop