Next Article in Journal
Advanced and Complex Energy Systems Monitoring and Control: A Review on Available Technologies and Their Application Criteria
Next Article in Special Issue
Computational Efficient Motion Planning Method for Automated Vehicles Considering Dynamic Obstacle Avoidance and Traffic Interaction
Previous Article in Journal
AI-Enabled Mosquito Surveillance and Population Mapping Using Dragonfly Robot
Previous Article in Special Issue
Advanced Pedestrian State Sensing Method for Automated Patrol Vehicle Based on Multi-Sensor Fusion
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Generalized Single-Vehicle-Based Graph Reinforcement Learning for Decision-Making in Autonomous Driving

1
School of Mechanical Engineering, Beijing Institute of Technology, Beijing 100081, China
2
Department of Transport and Planning, Faculty of Civil Engineering and Geosciences, Delft University of Technology, Stevinweg 1, 2628 CN Delft, The Netherlands
*
Author to whom correspondence should be addressed.
Sensors 2022, 22(13), 4935; https://doi.org/10.3390/s22134935
Submission received: 24 May 2022 / Revised: 28 June 2022 / Accepted: 28 June 2022 / Published: 29 June 2022

Abstract

:
In the autonomous driving process, the decision-making system is mainly used to provide macro-control instructions based on the information captured by the sensing system. Learning-based algorithms have apparent advantages in information processing and understanding for an increasingly complex driving environment. To incorporate the interactive information between agents in the environment into the decision-making process, this paper proposes a generalized single-vehicle-based graph neural network reinforcement learning algorithm (SGRL algorithm). The SGRL algorithm introduces graph convolution into the traditional deep neural network (DQN) algorithm, adopts the training method for a single agent, designs a more explicit incentive reward function, and significantly improves the dimension of the action space. The SGRL algorithm is compared with the traditional DQN algorithm (NGRL) and the multi-agent training algorithm (MGRL) in the highway ramp scenario. Results show that the SGRL algorithm has outstanding advantages in network convergence, decision-making effect, and training efficiency.

1. Introduction

In autonomous driving, the decision-making system is mainly used to produce advanced actions of vehicles, such as lane changing, acceleration, braking, and so on. Tactical decision-making for autonomous driving is challenging due to the diversity of environments, the uncertainty in the sensor information, and the complex interaction with other road users [1,2]. Traditional vehicle trajectory modeling and tracking control methods, such as genetic algorithms, neural networks, and their optimizations, have played a positive role in the research of decision-making [3].
The operational space of an autonomous vehicle (AV) can be diverse and vary significantly. Due to this, formulating a rule-based decision-maker for selecting driving maneuvers may not be ideal [4]. With the development of deep learning, the domain of reinforcement learning (RL) has become a robust learning framework now capable of learning complex policies in high-dimensional environments [5]. Therefore, using reinforcement learning to solve decision-making problems has gradually become the mainstream of research. Carl-Johan Hoel et al. introduce a method based on deep reinforcement learning for automatically generating a general-purpose decision-making function [6]. They trained an RL agent to handle a truck–trailer combination’s speed and lane change decisions in a simulated environment. Hongbo Gao et al. solved sequential decision optimization problems based on the inverse reinforcement learning algorithm, and the proposed method was verified in terms of efficiency [7]. Some algorithms for planning and decision-making are based on analyzing driver behavior [8,9].
The realization of the decision-making is based on the understanding and analysis of high-dimensional environmental information. However, the traditional reinforcement learning algorithm only has a good capacity for decision-making based on low-dimension features [10]. It will have the problem of insufficient understanding in the face of more complex scenario information. The deep neural network (DNN) has a strong ability to learn representations and for the generalization of matching patterns from high-dimensional data. Therefore, deep reinforcement learning (DRL) algorithms are effective in tasks requiring feature representation and policy learning, e.g., autonomous driving decision-making [11]. Using the functional approximation ability of a deep neural network (DNN), an intelligent controller integrating artificial intelligence technologies such as deep learning (DL) and reinforcement learning (RL) is designed to maintain and avoid obstacles in lanes [12,13,14], decision-making [15,16,17,18,19,20,21], longitudinal control [22], merger maneuvers [23], human-like driving strategies [24,25,26], and other large-scale autonomous driving control tasks. Yingjun Ye et al. put forward a framework for decision-making training and learning. It consists of a deep reinforcement learning (DRL) training program and a high-fidelity virtual simulation environment [27]. Compared with the DQN algorithm based on a value function, the deep deterministic policy gradient (DDPG) algorithm based on an action policy can solve the continuity problem of the action space. Haifei Zhang et al. used the DDPG algorithm to solve the control problem of automatic driving based on a reasonable reward function, deep convolution network, and exploration policy [28].
Interaction between vehicles in common public transport scenarios is necessary and pervasive. However, there are relatively few studies on how autonomous vehicles interact in public environments with reinforcement learning. Realizing coordination between vehicles in a shared environment is challenging due to the unique feature of vehicular mobility, which makes it infeasible to apply the existing reinforcement learning methods directly. Chao Yu et al. proposed using a dynamic coordination graph to model the continuously changing topology during vehicles’ interactions and developed two basic learning approaches to coordinate the driving maneuvers for a group of vehicles [29].
Cooperative vehicle–infrastructure systems and automatic driving technology are developing rapidly [30]. The graph neural network (GNN) has gained increasing popularity in various domains, including social network analysis [31,32]. GNN can extract the relational data representations and generate useful node embeddings on the node features and the features from neighboring nodes. The interactions between the ego vehicle and other surrounding vehicles can also be represented by the dynamic potential field (DPF) and embedded in the gap acceptance model to ensure safety and personalization during driving [33]. Jiqian Dong et al. proposed a novel deep reinforcement learning (DRL)-based approach combining the graphic convolution neural network (GNN) and deep Q network (DQN), namely the graphic convolution Q network, as the information fusion module and decision processor [34]. The proposed model can aggregate the information obtained by collaborative perception and output collaborative lane change decisions for multiple vehicles. Even in the case of highly dynamic and partially observed mixed traffic, the intention can be satisfied.
However, the above multi-agent training-based GNN reinforcement learning (MGRL) has the following problems in the actual verification process (the scenario of highway ramp exiting):
1.
Multi-agent training simultaneously increases the computational network complexity, resulting in a higher overall training time cost. Therefore, each parameter modification requires a more prolonged verification time, which is not conducive to the development and adjustment of the algorithm.
2.
The reward and punishment offset each other in multi-agent overall training, resulting in poor network convergence in the training process. Due to the mutual influence of multiple agents’ reward values, it cannot accurately evaluate the current state, resulting in the very unstable fluctuation of the loss curve.
3.
Through the test of the final training model, the final task success rate of the GCQ algorithm is maintained at around 50 % , which cannot meet the basic driving needs of vehicles.
4.
The GCQ decision-making model stays in the lateral lane change and cannot control the longitudinal behavior of the vehicle at the same time. This is an incomplete driving control model, which leads to a low success rate, serious collisions, and low traffic efficiency.
Given the above problems, this paper proposes an improved single-agent GNN-based RL algorithm (SGRL algorithm). This paper has the following contributions:
1.
Based on retaining interactive feature extraction (GNN), the trained object is transferred from multi-agent to single-agent. Through the internal processing of the network model, the training results of single-agent can be applied to the application scenarios of multi-agent. This training method can significantly reduce the time cost of model training and eliminate the interaction of rewards and punishments between agents to improve the training effect.
2.
An inductive reward function is designed to improve the convergence speed of the model. The reward function incorporates driving intention, collision, lane change frequency, and vehicle speed into the calculation. The trained model simultaneously performs well in terms of task success rate, safety, driving stability, and traffic efficiency.
3.
The dimension enhancement of the action space is realized by changing the network structure. Adding vehicle longitudinal velocity control gives the model a more robust control ability for task success rate, safety, and traffic efficiency.
4.
In the training process, the real-time screening of stored data can improve the training speed. The random generation mechanism of vehicles’ number, type, and position in the training scenario can improve the model’s generalization ability and avoid overfitting.
The paper is organized as follows. Section 2 introduces the proposed SGRL algorithm in detail. Section 3 shows the model training and testing of the SGRL algorithm and comparison algorithms (MGRL and NGRL). Section 4 shows the results of training and testing and analyzes the comparison. Finally, Section 5 derives the conclusions and proposes future improvement directions.

2. Method

The proposed SGRL methods are used to model the Markov Decision Process (MDP). The agents can explore the environment by observing states, taking actions, and receiving rewards, as shown in Figure 1.
The implementation core of the SGRL method is based on the preprocessing of input data, the structure of the deep neural network, and the setting of the output end.

2.1. Network Input

In each time step in the training process, the vehicles in the scenario can be divided into human-driven vehicles (HVs) and autonomous vehicles (AVs). At each step t, the input of the training model can be divided into two parts: feature matrix X t and correlation matrix C t . The state s t = ( X t , C t ) . Concerning the feature matrix X t , the necessary basic information based on highway scenarios is included. The total number of human-driven vehicles (HVs) and autonomous vehicles (AVs) is set to N = N H V + N A V . Then, the specification of the matrix is N × 8 . The eight characteristic parameters of each vehicle describe the speed, lateral position, longitudinal position, and destination information, denoted as x i = ( v i , p i , l i , I i ) , To serialize multiple data at the same scale, the parameters are set as follows:
  • v i = v i c u r r e n t v max is the relative speed, where v max = max ( v max v e h i c l e , v max h i g h w a y ) is the maximum speed for the current vehicle.
  • p i = x i l h i g h w a y is the relative longitudinal position normalized by the entire length of the highway. (The algorithm belongs to local planning. Its application scenarios are specific ramps and expressway sections within short distances, so the normalization of location information is more favorable for network computing.)
  • l i is the lateral lane position for the vehicle i. The position of the vehicle is represented by three-bit binary coding. For example, vehicles in the rightmost lane are denoted as [ 1 , 0 , 0 ] , the middle lane [ 0 , 1 , 0 ] and the leftmost lane [ 0 , 0 , 1 ] .
  • I i is the destination intention feature for the vehicle i. The data type is similar to l i . The three destinations are expressed as [ 1 , 0 , 0 ] (merge out from the first ramp), [ 0 , 1 , 0 ] (merge out from the second ramp) and [ 0 , 0 , 1 ] (go straight along the highway).
Based on GNN, we can introduce a correlation matrix C t to calculate the relationship between vehicle nodes. Considering that the sensors installed on autonomous vehicles have fixed sensing ranges, only vehicles within the sensor sensing range are recorded in the correlation matrix. C t is a square matrix with the specification N × N . The row data i of the matrix represent the relationship between vehicle i and other vehicles through binary data. The value C i j of the vehicle j within the set distance of the target vehicle i will be set to 1, as shown in Figure 2.
Considering that the number of vehicles in the scene is dynamic, this situation has been considered when setting the feature matrix and the correlation matrix. The vehicles in the environment are divided into HVs and AVs, located in the feature matrix’s upper and lower parts. When the number of vehicles is less than N, the positions in the matrix are occupied by 0.
To ensure the authenticity of the simulation process, all vehicles appear and disappear successively in the scenario, so the feature information and correlation information obtained at the step t cannot reach the predetermined matrix size. To ensure the standard calculation of GNN, matrix data filling is needed. Matrix segmentation prevents data confusion and error correspondence caused by random packing, as shown in Figure 3.

2.2. Network Structure

In the whole network structure, the function of graph convolution is to obtain the interaction information between vehicles. The function of the full connection layer is to parse the matrix information. Each newly obtained matrix in the network needs to be input into the full connection layer for recording and information analysis. In addition, the number of nodes in each layer also plays a decisive role in the model’s performance. The optimal number of nodes is selected through comparative experiments.
Data records for comparative tests are shown in Table 1.
At each step t, the feature matrix X t is first fed to a Fully Connected Network (FCN) encoder ψ and then the matrix H t N × 128 is obtained (Equation (1)). The original eight characteristic values will be recoded to 128 values. FCN will be executed twice consecutively.
H t = ψ ( X t ) N × 128
Next is the calculation of graph convolution. The input of the convolution function is the newly obtained feature matrix H t and correlation matrix C t . The function output matrix K t has the same specification as the original matrix H t (Equation (2)).
K t = χ ( H t , C t ) R N × 128
The original feature information and the new information obtained by graph convolution are fused by matrix stitching online as the total input of the subsequent neural network η .
The feature data density changes at each stage are as follows:
  • Fully Connected Network (FCN): D e n s e ( 8 ) D e n s e ( 128 ) D e n s e ( 128 ) ;
  • Graph Neural Network (GNN): D e n s e ( 128 ) D e n s e ( 128 ) D e n s e ( 128 ) ;
  • Q Network: D e n s e ( 256 ) D e n s e ( 128 ) D e n s e ( 128 ) ;
  • Output: D e n s e ( 128 ) D e n s e ( 33 ) .
The overall structure of the SGRL algorithm is shown in Figure 4.

2.3. Network Output

Since the vehicle longitudinal control model built into the simulation environment is still rule-based, it cannot incorporate the interaction between vehicles into the control model. The SGRL algorithm fuses longitudinal and lateral control into the same model by increasing the output dimension, which significantly simplifies the complexity of the control process and solves the coupling problem of two directions.
The SGRL algorithm is trained based on DQN, so the output action space can only be discrete. For AVs, the longitudinal control is mainly reflected in the acceleration, and the lateral control is primarily reflected in the lane change direction. The improvement of the output action resolution is conducive to improving the control sensitivity, but it also increases the computational complexity and reduces the control frequency. In the algorithm of this paper, a compromise solution is selected. The longitudinal control is set to 11 discrete values in the interval [ 5 , 5 ] , and the lateral control is set to three actions: keeping, turning left and turning right. Thus, the output matrix of the model is set as A t R N × 33 , as shown in Figure 5.

2.4. Reward Function

The setting of the reward function needs clear guidance and a strong correlation with training objectives. In the SGRL algorithm, the task success rate, security, and traffic efficiency should be considered simultaneously. The corresponding reward values are set as intention reward, crash reward, and speed reward.

2.4.1. Intention Reward

To guide autonomous vehicles to complete driving tasks from the corresponding ramps out of the highway, the SGRL algorithm constructs reward gradients for different lanes, as shown in Figure 6. Merge_0 and merge_1, respectively, refer to the autonomous vehicles that need to drive away from ramp_0 and ramp_1 according to the task requirements. In the simulation experiment, the driving route of all autonomous vehicles is entirely determined by the reinforcement learning controller. Therefore, vehicles belonging to the merge_0 category will not necessarily exit from ramp_0; that is, they cannot complete the driving task.
To ensure practical guidance for all locations, the reward value R I t for the fixed area is set to a constant value. To ensure the interaction between various rewards and to pass it to the training process, it is necessary to pay attention to the consistency of the numerical scale when setting R I t . Moreover, the R I t needs to distinguish between positive and negative values, which is more conducive to the training. In addition, if the autonomous vehicle completes the task, it can obtain greater R I t , and if the task fails, it will also obtain a greater negative R I t .

2.4.2. Crash Reward

The safety of autonomous driving is the basis of all other characteristics. The simulation platform SUMO can detect collisions at step t and output the number N c o l l i s i o n t of vehicles involved in collisions. The collision reward is calculated according to N c o l l i s i o n t (Equation (3)).
R C t = N c o l l i s i o n t / 2

2.4.3. Speed Reward

To control the consistency of the reward scale, the speed reward value R S t needs to be normalized. v max = max ( v max v e h i c l e , v max h i g h w a y ) is the max speed for the AV. To ensure the positive and negative values of R S t , the calculation is shown as Equation (4).
R S t = v i t v max 0.3

2.4.4. Total Reward

The general method to calculate the total reward is a weighted summation of each component. Direct addition will lead to mutual coverage of rewards, resulting in poor training effects. The SGRL algorithm uses a new aggregation method, as shown in Equation (5).
R t = ω I × R I t × R S t + ω C × R C t ω I × R I t + ω C × R C t ( R I t > 0 ) ( R I t 0 )
This calculation method takes the task success rate and safety as the primary considerations and considers traffic efficiency simultaneously. The problem of vehicle parking for intention rewards can be completely avoided. Moreover, the weight relationship between rewards can be adjusted by parameters ω I and ω C . The parameter settings of this paper are ω I = 1 , ω C = 2 .

2.5. Model Training and Testing

The most prominent feature of the SGRL algorithm is training for a single agent, but the obtained model can be applied to a multi-agent environment. For trained agents, the vehicles around them can be considered the same category—surrounding vehicles. Therefore, the work done by the GNN-based reinforcement learning model can be interpreted as planning and decision-making based on the characteristics and relationships of the target agent and its surrounding vehicles. The model obtained by single-agent training can be directly transplanted to other agents in the same scenario, as shown in Figure 7.
To prevent the overfitting of the reinforcement learning process, we randomize the distribution of training vehicles and the task of autonomous vehicles in each episode based on fixed rules, as shown in Figure 8.
Algorithm 1 shows the detailed steps of training.
Algorithm 1 SGRL Q Learning Steps.
Initialize the reply memory R to capacity N
Initialize the weights for the SGRL Net ( ψ , χ , η )
Jointly Current Network Q θ and Target Network Q t = Q θ
## Warming up ##
For step t = 1 to T 0 (warming up steps) do
 Choose random action for agent i: a t r = n p . r a n d o m . c h o i c e ( n p . a r r a n g e ( 3 ) , N )
 Get the transition ( s t , a t , r t , s t + 1 )
 Store in the buffer
## Training step ##
For step t = T 0 + 1 to T(total steps) do
 With the probability e choose random action for agent i: a t r
 With the probability 1 e do:
  State decoding: ( X t , C t ) = s t
  Double FCN: H t 1 = ψ 0 ( X t ) R N × 128 ; H t 2 = ψ 1 ( H t 1 ) R N × 128
  GNN + FCN: K t = χ ( H t 2 , C t ) R N × 128 ; K t 1 = ψ 2 ( K t ) R N × 128
  Feature Stitching: F t = ( H t 2 , K t 1 ) R N × 256
  Double FCN: F t 1 = ψ 3 ( F t ) R N × 128 ; F t 2 = ψ 4 ( F t 1 ) R N × 128
  Compute Q values: Q θ ( s t ) = ψ 5 ( F t 3 ) R N × 33
  Select a t * = arg max Q θ ( s t )
 Execute a t * and get a new state s t + 1
 Store in the buffer
 Set s t = s t + 1
## Training model at training step ##
Sample a batch from the buffer and calculate the Q target:
Q t arg e t = r t + γ max Q θ ( s t ) a d o n e = 0 r t d o n e = 1
Get average L o s s
Update parameters θ of the model Q θ
## Updating Target Net every n steps ##
Q t a r g e t = Q θ
Algorithm 2 shows the detailed steps of testing.
Algorithm 2 SGRL Testing Steps.
Initialize the simulation environment
# # Testing step # #
For step t = 1 to T(total test steps) do
 State decoding: ( X t , C t ) = s t
 For AV i = 1 to N A V do
  Move other AV’s features to the back of HV: X t 0 = λ ( X t ) R N × 8
  Calculate Q based on trained model: Q θ ( s t ) = π S G R L ( X t 0 , C t ) R N × 33
  Select a t i * = arg max Q θ ( s t )
 Get the action matrix a t * = ( a 1 * , a 2 * , a N A V * )
 Execute a t * and get a new state s t + 1
 Set s t = s t + 1

3. Simulation

3.1. Baseline Models

Two baseline models are introduced for comparative analysis in the simulation: the traditional DQN algorithm (NGRL) and the multi-agent training algorithm (MGRL). In the NGRL algorithm, the GNN part is removed, and the correlation matrix is used as input by splicing with the feature matrix.
The model of the MGRL algorithm is consistent with the SGRL proposed in this paper in terms of network structure, but its training process is based on multi-agent environment interaction.
By comparing the three models, the specific effects of the GNN structure and single-agent training of SGRL can be effectively analyzed. The simulation scenario is shown in Figure 9.

3.2. Simulator Parameters

The simulation scenario is a long highway with three lanes. There are exit ramps at one-third and two-thirds of the total length respectively. The speed limit for the whole road is set as 20 m/s (76 km/h) for all the vehicles. The AVs and HVs are put into the scenario at the probability of 0.1 and 0.4 from the left side of the road. And the initial speed and lane position are random.
There are six types of vehicles in the scenario, including HVs that travel straight through the highway, vehicles (HVs and AVs) that want to exit from ramp_0 and vehicles (HVs and AVs) that want to exit from ramp_1. The simulation environment controls the driving of HVs, and the AVs are completely controlled by the reinforcement learning model SGRL in real time.
The number of vehicles in the experiment is set as shown in Table 2.

4. Results and Discussion

4.1. Training Results

All three models were trained for 1000 episodes. The symbolic data of the training process are the reward, average Q value and loss value. Average rewards are obtained by averaging rewards against steps. The specific changes are shown in Figure 10, Figure 11, Figure 12 and Figure 13.
For the comparison of reward values in the training process, under the same reward value calculation, the reward values and average reward values of the SGRL algorithm can converge faster and have better final convergence.
The changing trend of the Q value and loss value of the SGRL algorithm in the training process is consistent with the learning process. According to the loss results, SGRL has a faster convergence speed.
To compare the task success rate, security, and traffic efficiency, the average velocity, number of collisions, success rate, and average steps per episode must be collected. The results of the above training data are shown in Figure 14, Figure 15, Figure 16 and Figure 17.
The SGRL algorithm has obvious advantages regarding task success rate and average vehicle speed. In terms of collision times and average training step length, SGRL also meets driving safety requirements.
In addition, SGRL has apparent advantages in training efficiency under the premise of the same training episodes. The specific hardware parameters and training time are shown in Table 3. The comparison of the training time is shown in Table 4.

4.2. Testing Results

The trained model needs to be verified by the test process. To fully verify the algorithm’s effectiveness, we adjust the total length of the road under the premise of the same traffic flow (20 vehicles per episode). The three algorithms are tested on 1000 m, 750 m and 500 m roads to simulate different traffic flow and congestion levels.
Each test process includes 1000 episodes. In the test process, the reward value can still be used as an essential evaluation of model performance. The simulation results of reward value and average reward value are shown in Figure 18 and Figure 19.
For the most complex and congested 500 m highway scenario, the longitudinal motion spatial distribution of the three algorithms throughout the test cycle is as shown in Figure 20.
The longitudinal control of autonomous vehicles will become more and more complex with the increase in road congestion, so the test results of the 500 m scene are the optimal reference. The test results show that the action output of the SGRL algorithm is mainly concentrated near 0, and the probability of large acceleration is acceptable.
To compare the task success rate, security, and traffic efficiency, the average velocity, number of collisions, success rate, and average step per episode must be collected.
The data of different methods are listed in Table 5, and the mean of the above data is shown in Figure 21, Figure 22, Figure 23 and Figure 24.
It can be seen from the figure and table that SGRL has apparent advantages in task success rate and average velocity. In terms of the number of collisions and the test steps, although the SGRL value is not the best, it is consistent with the best value.

4.3. Results Discussion

It can be seen that the SGRL algorithm has outstanding advantages over the MGRL algorithm and the NGRL algorithm. The SGRL algorithm can converge faster during the training process and achieve better data performance. In the testing process, the SGRL algorithm has outstanding data performance in terms of task success rate, average vehicle speed, and security.
The MGRL algorithm directly sums the reward value of all autonomous vehicles in the process of reward value calculation, so there is a phenomenon in which the excellent performance of vehicle behavior and the poor performance of vehicle behavior offset each other, which is not conducive to the adequate updating of parameters, and also causes the final convergence speed to be slow, and thus the convergence effect is not good.
The NGRL algorithm lacks the calculation consideration of the interaction process between vehicle individuals. Therefore, in the decision-making process, due to the relatively simple understanding of the environment, it cannot obtain sufficient data support, so the data performance is poor.
The SGRL algorithm considers and solves the above problems and optimizes the network structure. According to the comparison of data, it can be verified that the improvement of SGRL is obviously effective.

5. Conclusions

This paper proposes a generalized single-vehicle-based graph neural network reinforcement learning algorithm (SGRL algorithm). This algorithm combines GNN with deep reinforcement learning to solve the vehicle planning problem in the scenario of highway driving out of the ramp. The SGRL model is trained for a single agent and can be tested in multi-agent scenarios. At the same time, the algorithm sets up an improved reward function to provide a clear direction for training.
Comparing the three algorithms, the conclusions are as follows:
  • Firstly, a training mode for single-agent training extended to multi-agent scenarios is proposed and verified in terms of training effectiveness and performance. The algorithm improves the analytical ability of the DRL by increasing the number of network nodes, thereby increasing the control dimension of vehicle decision-making to longitudinal and lateral dimensions.
  • Secondly, the proposed SGRL algorithm simplifies the training mode and improves the training efficiency without affecting the training effect. This helps to adjust complex parameters and reduce time costs.
  • Thirdly, SGRL is more sufficient in the training process to achieve a better convergence effect. The fluctuation is the smallest after the training data are stable. This shows that the proposed SGRL algorithm has outstanding training ability and is more suitable for decision-making based on reinforcement learning in multi-agent scenarios.
  • Finally, the newly designed reward function effectively solves the problem of mutual influence between longitudinal and lateral control. SGRL can achieve higher task success rates and average velocity in the training and testing process. This shows that the new reward function, the training method for a single agent, and the incorporation of GNN effectively improve the decision performance of the model.
In future research, the continuity of model action space can be added to this algorithm, which will effectively improve the driving fluency of vehicles. In addition, the relationship between multiple agents in the scenario should not be limited to physical characteristics: decision-making, driving intention, and task priority can also be incorporated into the calculation process.

Author Contributions

Conceptualization, F.Y. and Q.L.; Data curation, F.Y.; Methodology, F.Y. and Z.L.; Project administration, X.L.; Resources, X.L.; Software, F.Y., Q.L. and X.G.; Supervision, X.L.; Validation, F.Y.; Visualization, F.Y.; Writing—original draft, F.Y.; Writing—review & editing, X.L. and Q.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Hoel, C.J.; Driggs-Campbell, K.; Wolff, K.; Laine, L.; Kochenderfer, M.J. Combining Planning and Deep Reinforcement Learning in Tactical Decision Making for Autonomous Driving. IEEE Trans. Intell. Veh. 2020, 5, 294–305. [Google Scholar] [CrossRef] [Green Version]
  2. Liu, Q.; Li, Z.; Yuan, S.; Zhu, Y.; Li, X. Review on Vehicle Detection Technology for Unmanned Ground Vehicles. Sensors 2021, 21, 1354. [Google Scholar] [CrossRef] [PubMed]
  3. Peng, T.; Su, L.; Zhang, R.; Guan, Z.; Zhao, H.; Qiu, Z.; Zong, C.; Xu, H. A new safe lane-change trajectory model and collision avoidance control method for automatic driving vehicles. Expert Syst. Appl. 2020, 141, 112953. [Google Scholar] [CrossRef]
  4. Nageshrao, S.; Tseng, H.E.; Filev, D. Autonomous highway driving using deep reinforcement learning. In Proceedings of the 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), Bari, Italy, 6–9 October 2019; pp. 2326–2331. [Google Scholar]
  5. Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; Sallab, A.A.A.; Yogamani, S.; Perez, P. Deep Reinforcement Learning for Autonomous Driving: A Survey. IEEE Trans. Intell. Transp. Syst. 2021, 23, 4909–4926. [Google Scholar] [CrossRef]
  6. Hoel, C.J.; Wolff, K.; Laine, L. Automated speed and lane change decision making using deep reinforcement learning. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 2148–2155. [Google Scholar]
  7. Gao, H.; Shi, G.; Xie, G.; Cheng, B. Car-following method based on inverse reinforcement learning for autonomous vehicle decision-making. Int. J. Adv. Robot. Syst. 2018, 15. [Google Scholar] [CrossRef]
  8. Li, Z.; Gong, J.; Lu, C.; Li, J. Personalized Driver Braking Behavior Modeling in the Car-Following Scenario: An Importance-Weight-Based Transfer Learning Approach. IEEE Trans. Ind. Electron. 2022, 69, 10704–10714. [Google Scholar] [CrossRef]
  9. Lu, C.; Hu, F.; Cao, D.; Gong, J.; Xing, Y.; Li, Z. Transfer learning for driver model adaptation in lane-changing scenarios using manifold alignment. IEEE Trans. Intell. Transp. Syst. 2019, 21, 3281–3293. [Google Scholar] [CrossRef]
  10. Zhao, D.B.; Shao, K.; Zhu, Y.H.; Li, D.; Wang, C.H.J.C.T. Review of deep reinforcement learning and discussions on the development of computer Go. Control Theory Appl. 2016, 33, 701–717. [Google Scholar]
  11. Wang, J.; Zhang, Q.; Zhao, D.; Chen, Y. Lane Change Decision-making through Deep Reinforcement Learning with Rule-based Constraints. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–6. [Google Scholar] [CrossRef] [Green Version]
  12. Li, Y.; Chen, S.; Ha, P.; Dong, J.; Steinfeld, A.; Labi, S. Leveraging Vehicle Connectivity and Autonomy to Stabilize Flow in Mixed Traffic Conditions: Accounting for Human-driven Vehicle Driver Behavioral Heterogeneity and Perception-reaction Time Delay. arXiv 2020, arXiv:2008.04351. [Google Scholar]
  13. Gong, C.; Li, Z.; Lu, C.; Gong, J.; Hu, F. A comparative study on transferable driver behavior learning methods in the lane-changing scenario. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; pp. 3999–4005. [Google Scholar]
  14. Sallab, A.; Abdou, M.; Perot, E.; Yogamani, S.J.E.I. Deep Reinforcement Learning framework for Autonomous Driving. Electron. Imaging 2017, 2017, 70–76. [Google Scholar] [CrossRef] [Green Version]
  15. Noh, S. Decision-Making Framework for Autonomous Driving at Road Intersections: Safeguarding Against Collision, Overly Conservative Behavior, and Violation Vehicles. IEEE Trans. Ind. Electron. 2019, 66, 3275–3286. [Google Scholar] [CrossRef]
  16. Liu, Q.; Li, X.; Yuan, S.; Li, Z. Decision-Making Technology for Autonomous Vehicles: Learning-Based Methods, Applications and Future Outlook. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 30–37. [Google Scholar] [CrossRef]
  17. Schwarting, W.; Alonso-Mora, J.; Rus, D. Planning and Decision-Making for Autonomous Vehicles. Annu. Rev. Control. Robot. Auton. Syst. 2018, 1, 187–210. [Google Scholar] [CrossRef]
  18. Li, L.; Ota, K.; Dong, M. Humanlike Driving: Empirical Decision-Making System for Autonomous Vehicles. IEEE Trans. Veh. Technol. 2018, 67, 6814–6823. [Google Scholar] [CrossRef] [Green Version]
  19. Xu, X.; Zuo, L.; Li, X.; Qian, L.; Ren, J.; Sun, Z. A Reinforcement Learning Approach to Autonomous Decision Making of Intelligent Vehicles on Highways. IEEE Trans. Syst. Man Cybern. Syst. 2019, 50, 3884–3897. [Google Scholar] [CrossRef]
  20. Zhang, Z.; Jiang, Q.; Wang, R.; Song, L.; Zhang, Z.; Wei, Y.; Mei, T.; Yu, B. Research on Management System of Automatic Driver Decision-Making Knowledge Base for Unmanned Vehicle. Int. J. Pattern Recognit. Artif. Intell. 2019, 33, 1959013. [Google Scholar] [CrossRef]
  21. Duan, J.; Li, S.E.; Guan, Y.; Sun, Q.; Cheng, B.J.I.I.T.S. Hierarchical reinforcement learning for self-driving decision-making without reliance on labelled driving data. IET Intell. Transp. Syst. 2020, 14, 297–305. [Google Scholar] [CrossRef] [Green Version]
  22. Cheng, X.; Jiang, R.; Chen, R. Simulation of decision-making method for vehicle longitudinal automatic driving based on deep Q neural network. In Proceedings of the 2020 the 7th International Conference on Automation and Logistics (ICAL), Beijing, China, 22–24 July 2020; pp. 12–17. [Google Scholar]
  23. Wang, P.; Chan, C.; Fortelle, A.d.L. A Reinforcement Learning Based Approach for Automated Lane Change Maneuvers. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; pp. 1379–1384. [Google Scholar] [CrossRef] [Green Version]
  24. Forster, Y.; Hergeth, S.; Naujoks, F.; Beggiato, M.; Krems, J.F.; Keinath, A. Learning to use automation: Behavioral changes in interaction with automated driving systems. Transp. Res. Part F Traffic Psychol. Behav. 2019, 62, 599–614. [Google Scholar] [CrossRef]
  25. Biondi, F.; Alvarez, I.; Jeong, K.A. Human–Vehicle Cooperation in Automated Driving: A Multidisciplinary Review and Appraisal. Int. J. Hum. Comput. Interact. 2019, 35, 932–946. [Google Scholar] [CrossRef]
  26. Li, Z.; Gong, C.; Lu, C.; Gong, J.; Lu, J.; Xu, Y.; Hu, F. Transferable driver behavior learning via distribution adaption in the lane change scenario. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; pp. 193–200. [Google Scholar]
  27. Ye, Y.; Zhang, X.; Sun, J.J.T.R.P.C.E.T. Automated vehicle’s behavior decision making using deep reinforcement learning and high-fidelity simulation environment. Transp. Res. Part C Emerg. Technol. 2019, 107, 155–170. [Google Scholar] [CrossRef] [Green Version]
  28. Zhang, H.; Xu, J.; Qiu, J.; Bashir, A.K. An Automatic Driving Control Method Based on Deep Deterministic Policy Gradient. Wireless Commun. Mob. Comput. 2022, 2022, 7739440. [Google Scholar] [CrossRef]
  29. Yu, C.; Wang, X.; Xu, X.; Zhang, M.; Ge, H.; Ren, J.; Sun, L.; Chen, B.; Tan, G. Distributed Multiagent Coordinated Learning for Autonomous Driving in Highways Based on Dynamic Coordination Graphs. IEEE Trans. Intell. Transp. Syst. 2020, 21, 735–748. [Google Scholar] [CrossRef]
  30. Yuan, S.; Zhao II, P.; Zhang III, Q. Research on automatic driving technology architecture based on cooperative vehicle-infrastructure system. In Proceedings of the International Conference on Artificial Intelligence, Virtual Reality, and Visualization (AIVRV 2021), Sanya, China, 19–21 November 2021; Volume 12153, pp. 111–117. [Google Scholar]
  31. Li, Z.; Gong, J.; Lu, C.; Yi, Y. Interactive Behavior Prediction for Heterogeneous Traffic Participants in the Urban Road: A Graph-Neural-Network-Based Multitask Learning Framework. IEEE/ASME Trans. Mechatron. 2021, 26, 1339–1349. [Google Scholar] [CrossRef]
  32. Li, Z.; Lu, C.; Yi, Y.; Gong, J. A hierarchical framework for interactive behaviour prediction of heterogeneous traffic participants based on graph neural network. IEEE Trans. Intell. Transp. Syst. 2021, 1–13. [Google Scholar] [CrossRef]
  33. Huang, C.; Lv, C.; Hang, P.; Xing, Y. Toward Safe and Personalized Autonomous Driving: Decision-Making and Motion Control With DPF and CDT Techniques. IEEE/ASME Trans. Mechatron. 2021, 26, 611–620. [Google Scholar] [CrossRef]
  34. Dong, J.; Chen, S.; Ha, P.; Li, Y.; Labi, S. A DRL-based Multiagent Cooperative Control Framework for CAV Networks: A Graphic Convolution Q Network. arXiv 2020, arXiv:2010.05437. [Google Scholar]
Figure 1. Markov Decision Process.
Figure 1. Markov Decision Process.
Sensors 22 04935 g001
Figure 2. The schematic diagram of the correlation matrix setting. Each row in the matrix represents the correlation between a vehicle and all other vehicles, specifically the logical values of 0 and 1, where 0 represents the actual distance greater than the set distance, and 1 represents the actual distance less than the set distance.
Figure 2. The schematic diagram of the correlation matrix setting. Each row in the matrix represents the correlation between a vehicle and all other vehicles, specifically the logical values of 0 and 1, where 0 represents the actual distance greater than the set distance, and 1 represents the actual distance less than the set distance.
Sensors 22 04935 g002
Figure 3. Matrix segmentation diagram. Each row in the matrix represents the characteristic information of a vehicle, and the matrix is divided into the upper and lower parts to separate AV and HV. The green part indicates that the vehicle is not currently in the scenario.
Figure 3. Matrix segmentation diagram. Each row in the matrix represents the characteristic information of a vehicle, and the matrix is divided into the upper and lower parts to separate AV and HV. The green part indicates that the vehicle is not currently in the scenario.
Sensors 22 04935 g003
Figure 4. The overall structure diagram of the model. FCN represents the full connection layer, and GNN represents the graph neural network.
Figure 4. The overall structure diagram of the model. FCN represents the full connection layer, and GNN represents the graph neural network.
Sensors 22 04935 g004
Figure 5. Model action space diagram. The matrix row direction divides the longitudinal acceleration into discrete values, and the matrix column direction represents the lateral lane change of the vehicle.
Figure 5. Model action space diagram. The matrix row direction divides the longitudinal acceleration into discrete values, and the matrix column direction represents the lateral lane change of the vehicle.
Sensors 22 04935 g005
Figure 6. Intention reward gradient diagram. The task completion of autonomous vehicles is seen as a factor in judging the quality of the current reinforcement learning model, which is manifested in the reward values that can be obtained when each vehicle is in different driving sections and lanes. The strip area represents the reward types that the corresponding vehicles can obtain from the area. Green represents reward ( 1 × R I B a s e ), blue represents punishment ( 1 × R I B a s e ), and orange represents serious punishment ( 2 × R I B a s e ). R I B a s e is set to 1 for normalization.
Figure 6. Intention reward gradient diagram. The task completion of autonomous vehicles is seen as a factor in judging the quality of the current reinforcement learning model, which is manifested in the reward values that can be obtained when each vehicle is in different driving sections and lanes. The strip area represents the reward types that the corresponding vehicles can obtain from the area. Green represents reward ( 1 × R I B a s e ), blue represents punishment ( 1 × R I B a s e ), and orange represents serious punishment ( 2 × R I B a s e ). R I B a s e is set to 1 for normalization.
Sensors 22 04935 g006
Figure 7. Reinforcement learning model transplant diagram.
Figure 7. Reinforcement learning model transplant diagram.
Sensors 22 04935 g007
Figure 8. Scenario random setting diagram.
Figure 8. Scenario random setting diagram.
Sensors 22 04935 g008
Figure 9. Simulation scenario diagram.
Figure 9. Simulation scenario diagram.
Sensors 22 04935 g009
Figure 10. Diagram of average reward. This value is the average of each step reward value.
Figure 10. Diagram of average reward. This value is the average of each step reward value.
Sensors 22 04935 g010
Figure 11. Diagram of reward. This value is the accumulation of each single step reward value.
Figure 11. Diagram of reward. This value is the accumulation of each single step reward value.
Sensors 22 04935 g011
Figure 12. Diagram of average Q. This value is a training mark value in reinforcement learning.
Figure 12. Diagram of average Q. This value is a training mark value in reinforcement learning.
Sensors 22 04935 g012
Figure 13. Diagram of loss. This value represents the difference between the real network and the ideal network.
Figure 13. Diagram of loss. This value represents the difference between the real network and the ideal network.
Sensors 22 04935 g013
Figure 14. Diagram of success rate. This value is obtained by the ratio of the number of vehicles completing the task (entering the corresponding ramp) to the total number.
Figure 14. Diagram of success rate. This value is obtained by the ratio of the number of vehicles completing the task (entering the corresponding ramp) to the total number.
Sensors 22 04935 g014
Figure 15. Diagram of collisions. This value is the number of collisions between vehicles obtained by real-time detection in the simulation scenario.
Figure 15. Diagram of collisions. This value is the number of collisions between vehicles obtained by real-time detection in the simulation scenario.
Sensors 22 04935 g015
Figure 16. Diagram of average velocity. This value is the average velocity of all AVs in the scenario.
Figure 16. Diagram of average velocity. This value is the average velocity of all AVs in the scenario.
Sensors 22 04935 g016
Figure 17. Diagram of average steps. This value is the number of steps experienced at the end of each episode.
Figure 17. Diagram of average steps. This value is the number of steps experienced at the end of each episode.
Sensors 22 04935 g017
Figure 18. Diagram of testing reward. This value is the average of the total reward value for each episode in the test.
Figure 18. Diagram of testing reward. This value is the average of the total reward value for each episode in the test.
Sensors 22 04935 g018
Figure 19. Diagram of testing average reward. This value is the average of the average reward value of each episode in the test.
Figure 19. Diagram of testing average reward. This value is the average of the average reward value of each episode in the test.
Sensors 22 04935 g019
Figure 20. Spatial distribution diagram of longitudinal movement. Based on the frequency statistics of each action output in each testing process, the probability distribution of the longitudinal action can be obtained through probability calculation. The data in the figure are obtained from the average of ten repeated experiments.
Figure 20. Spatial distribution diagram of longitudinal movement. Based on the frequency statistics of each action output in each testing process, the probability distribution of the longitudinal action can be obtained through probability calculation. The data in the figure are obtained from the average of ten repeated experiments.
Sensors 22 04935 g020
Figure 21. Diagram of testing average steps. This value is the number of steps experienced at the end of each episode.
Figure 21. Diagram of testing average steps. This value is the number of steps experienced at the end of each episode.
Sensors 22 04935 g021
Figure 22. Diagram of testing success rate. This value is obtained by the ratio of the number of vehicles completing the task (entering the corresponding ramp) to the total number.
Figure 22. Diagram of testing success rate. This value is obtained by the ratio of the number of vehicles completing the task (entering the corresponding ramp) to the total number.
Sensors 22 04935 g022
Figure 23. Diagram of testing collisions. This value is the number of collisions between vehicles obtained by real-time detection in the simulation scenario.
Figure 23. Diagram of testing collisions. This value is the number of collisions between vehicles obtained by real-time detection in the simulation scenario.
Sensors 22 04935 g023
Figure 24. Diagram of testing average velocity. This value is the average velocity of all AVs in the scenario.
Figure 24. Diagram of testing average velocity. This value is the average velocity of all AVs in the scenario.
Sensors 22 04935 g024
Table 1. Training effect for different numbers of nodes.
Table 1. Training effect for different numbers of nodes.
N of NodesTraining Time for 1000 EpisodesConvergence Effect
321.5179 hPoor (large fluctuation)
642.6438 hAcceptable (occasional large fluctuations)
1283.4756 hGood (small fluctuation)
2566.0987 hGood (small fluctuation)
51210.9542 hGood (small fluctuation)
Table 2. Number setting of experimental vehicles.
Table 2. Number setting of experimental vehicles.
Algorithm TypeN of Vehicles
AVsHVs
Merge_0Merge_1
Training ProcessSGRL119
MGRL5510
NGRL5510
Testing ProcessSGRL5510
MGRL5510
NGRL5510
Table 3. Computer hardware information.
Table 3. Computer hardware information.
ItemType
CPUIntel I9 10980XE
GPUNVIDIA RTX3090(24G)
RAMCrucial DDR4 3200MHz 32G × 4
SSDSAMSUNG 970 EVO Plus 1T × 2
OSUbuntu 20.04
Table 4. Training time statistics (ten experiments for each algorithm; time unit is hours).
Table 4. Training time statistics (ten experiments for each algorithm; time unit is hours).
SGRLMGRLNGRL
13.33566566.407873.251174
23.68793578.942374.812508
33.78265371.374493.737191
43.76313382.166324.360045
53.3837466.4633.259155
63.28689172.259483.890604
73.87457467.388963.840993
83.04364967.680163.668797
93.36894765.512023.017189
103.83284569.729514.072374
Mean3.53600370.792423.791003
Table 5. Performance comparison for different models.
Table 5. Performance comparison for different models.
ModelRoad LengthAverage_VN_CollisionsSuccess_RateAverage_Steps
SGRL1000 m9.555881.668880.94068601.1014
750 m8.450471.737140.92385494.9505
500 m7.505482.069110.91493307.8586
MGRL1000 m3.602312.902050.541292478.409
750 m3.335452.968040.538641836.932
500 m3.207533.027190.470951334.704
NGRL1000 m8.926651.233730.53529204.3344
750 m7.160451.839060.52708166.5145
500 m6.704722.314180.4839111.9219
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Yang, F.; Li, X.; Liu, Q.; Li, Z.; Gao, X. Generalized Single-Vehicle-Based Graph Reinforcement Learning for Decision-Making in Autonomous Driving. Sensors 2022, 22, 4935. https://doi.org/10.3390/s22134935

AMA Style

Yang F, Li X, Liu Q, Li Z, Gao X. Generalized Single-Vehicle-Based Graph Reinforcement Learning for Decision-Making in Autonomous Driving. Sensors. 2022; 22(13):4935. https://doi.org/10.3390/s22134935

Chicago/Turabian Style

Yang, Fan, Xueyuan Li, Qi Liu, Zirui Li, and Xin Gao. 2022. "Generalized Single-Vehicle-Based Graph Reinforcement Learning for Decision-Making in Autonomous Driving" Sensors 22, no. 13: 4935. https://doi.org/10.3390/s22134935

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop