**1. Introduction**

In recent years, autonomous driving technology has developed rapidly due to its significant economic potential and advantages in improving traffic efficiency and driving safety. Various methods have been proposed to solve the decision-making problem of autonomous vehicles in highway driving tasks. Most studies have considered decision making as a control problem. As an unavoidable part of the autonomous driving system, trajectory planning is of grea<sup>t</sup> significance to the study of the autonomous vehicle. Avoiding the surrounding obstacles accurately and driving safely and efficiently based on the upper perception and prediction results are the basic requirements for automobile driving. Therefore, most autonomous driving researchers are now focusing on more intelligent, safe and efficient trajectory-planning methods.

The existing trajectory-planning methods are generally divided into four categories: potential field methods [1], sample-based methods [2], search-based methods [3], and optimization-based methods [4]. A potential field method simulates the movement of a controlled object in space into a forced movement of a particle in a virtual force field and plans the future trajectory of a vehicle by calculating the combined force field to which the vehicle is subjected. However, this method relies on accurate modeling of the environment, which will put the training into the dilemma of the local optimal solution and increase the computational cost. The sampling-based methods are mainly divided into fast random search tree (RRT) and probability path map (PRM) methods. The probability map path method is based on the graph structure, converts the continuous space into a discrete space, and uses the search algorithms such as A\* to find paths on the route map

**Citation:** Zhu, Z.; Teng, C.; Cai, Y.; Chen, L.; Lian, Y.; Wang, H. Vehicle Safety Planning Control Method Based on Variable Gauss Safety Field. *World Electr. Veh. J.* **2022**, *13*, 203. https://doi.org/10.3390/ wevj13110203

Academic Editors: Yong Li, Xing Xu, Lin Zhang, Yechen Qin and Yang Lu

Received: 12 October 2022 Accepted: 26 October 2022 Published: 31 October 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

to improve search efficiency. However, this method needs to solve the boundary value problem and does not focus on generating paths in the process of building the graph. The search-based planning algorithms mainly refer to map search methods, including A\*, D\*, and the corresponding variants. This kind of algorithm is widely used in the field of robot motion planning, but its planned path does not consider the geometric constraints of the road and has poor smoothness. Qi Xuanxuan et al. [5] introduced simulated annealing to optimize the expansion of nodes and heuristic functions, and guided the algorithm to search for the target point, which improved the inefficiency of the traditional A\* algorithm but still fell into the dilemma of a suboptimal solution. To improve sampling efficiency and avoid suboptimal dilemmas for agents, Claussmann et al. [6] classified the spatial configuration for route planning into three main categories: sampling [7], connection unit [8], and raster representation (Lattice) [9]. The raster representation can be used to predict and plan based on the moving obstacles around the vehicle while considering the kinematic constraints. However, the raster method is difficult to sample completely and can only sample better driving tracks. It is also difficult for the complete search method to consider the dynamic constraints of the automobile. The trajectory planning based on the optimization method has higher computational power requirements for the vehicle computer, and the optimization delay between each frame is large. In summary, most of the existing traditional trajectory-planning methods have relatively stable security performance and excellent computational efficiency. However, they focus only on the generation of the optimal path and can fall into the suboptimal dilemma.

In recent years, deep reinforcement learning (DRL) has shown satisfactory performance in both trajectory planning and trajectory tracking control. Feher et al. [10] trained deep deterministic policy gradient (DDPG) agents to generate waypoints for vehicle tracking and achieved good results. However, the algorithm only focused on the lateral trajectory and provided a suboptimal solution. Several studies have used original sensor measurements to generate turn angles and throttle values [11–16] in an end-to-end manner. The deep deterministic actor-critic (DDAC) algorithm [11,12] can keep the vehicle as far as possible on the center line of the lane and has achieved satisfactory results. However, this algorithm only considers the lateral control, not the longitudinal vehicle following. Lingli Yu et al. [15,16] proposed to use the DDPG algorithm to reduce the dependence on sample data. Their method had more continuous corner control and less lateral error when a vehicle was traveling. Although better results have been shown in the simulation environment, the agen<sup>t</sup> is still affected by turn and throttle fluctuations and does not consider safety issues when interacting with other vehicles in highway conditions resulting in poor stability and safety.

To solve the above-mentioned problems, a vehicle safety planning and control method based on the variable Gauss safety field is designed in this paper. A planning model is constructed using a time series bird's-eye view as a state quantity and policy gradient algorithm. The timeliness and security of the planning model are verified by experiments. The reinforcement learning method of multi-task partitioning is used to partition and train the whole automatic driving trajectory tracking control task. Compared with the general end-to-end reinforcement learning auto-driving method, the multi-task partitioned training method reduces the training duration by dividing the entire auto-driving tracking control task into several sub-tasks and improves the noise input method in the longitudinal control module to further improve the training efficiency and provide a smoother driving experience. Meanwhile, protecting traffic participants is the most important topic in driving theory. Wang et al. [17,18] proposed the driving safety field theory modeling method and developed a collision warning algorithm, field experiments were conducted to verify the proposed algorithm. However, the whole framework contains several factors of driver, vehicle, and road, which bring grea<sup>t</sup> difficulties to practical application. To improve the practicability of safety field theory, a variable Gaussian safety field model is proposed to reveal the dynamic field characteristics of vertices. We use the variable Gaussian safety field model as the reward function of the planning module and combined with the constraint

and evaluation index of the control module. The model combines a Gaussian field in both directions to form an envelope and varies with the vehicle speed angle. While ensuring reasonable trajectory generation, the interaction of the ego vehicle with the surrounding vehicles is utilized to actively avoid the surrounding vehicles when they enter the Gaussian field, which improves the safety performance of the vehicle in high-speed scenarios such as highways. The simulation results in CARLA show that the vehicle safety planning control method based on the variable Gauss safety field has good planning efficiency and better safety compared with the traditional algorithms.

The main contributions of this paper are as follows:


#### **2. Route Planning Algorithm**

The goal of trajectory planning for autonomous driving is to find the optimal trajectory in advance for a vehicle. On the one hand, it is necessary to ensure the safety of the vehicle; On the other hand, getting to the destination through obstacles as soon as possible, reducing traffic pressure and improving driving efficiency are also important criteria to measure the effectiveness of the planned trajectory. Figure 1 shows that the trajectory planning module plays a key role in the overall auto-driving system.

**Figure 1.** Autopilot system flow chart.

#### *2.1. Time Series Bird's-Eye View and Strategic Network*

The agents of reinforcement learning obtain the state input through interaction with the surrounding complex traffic environment to conduct effective learning training. One of the difficulties of the existing reinforcement learning algorithm is obtaining effective state features from complex environments. Overly redundant states will increase the learning difficulty of the agent. It is particularly important to make it easier for an agen<sup>t</sup> to extract valid features. Therefore, this paper designs a policy network and corresponding time series bird's-eye view as the state quantity of the reinforcement learning, enabling the network to extract better environmental features.

#### 2.1.1. Policy Network State Quantity

For an effective policy network for reinforcement learning, it is essential to obtain the perceptual information including lane lines, pedestrians, vehicles, and obstacles from the surrounding environment as well as the predictive tracks for the next few moments including dynamic obstacles.

The sequential bird's-eye view significantly improves the learning efficiency of the policy network. Figure 2 shows the time series bird's-eye view matrix diagram.

**Figure 2.** A time series bird's-eye view matrix diagram.

The bird's-eye view is a three-dimensional matrix composed of lateral displacement, vertical displacement and time. The specific elements in the matrix diagram shown in Figure 2 include (a) the current position status of the ego vehicle, (b) the ego vehicle, (c) obstacles, (d) the non-driving area and (e) the exercisable area, (f) the reference line, (g) the planned trajectory.

The generation of the time series bird's-eye view includes the following two steps: (1) According to the perception module of the autonomous vehicle, obtain the surrounding environmental information, including dynamic and static obstacles and lane lines. The prediction module is used to obtain the position information of dynamic obstacles in the future time of 0 ∼ *tend*. (2) The information obtained from the perception module and the information is used to generate a bird's-eye view of features in three dimensions: horizontal, vertical and time.

The size of the three-dimensional the time series bird's-eye view matrix is (40, 400, 80). The first dimension 40 represents the horizontal range of 10 m on the left and right of the reference line, with the horizontal displacement interval of 0.5 m; The second dimension 400 represents the longitudinal 200 m forward range with the ego vehicle as the origin, the longitudinal displacement interval is 0.5 m, and the third dimension 80 represents the time range within the next 8 s, the time interval is 1 s. The (c) obstacles and (d) the non-driving area are represented by −1 in the time series bird's-eye view matrix; (e) the exercisable area is represented by 0 in the time series bird's-eye view matrix; (f) the reference line is represented by 1 in the time series bird's-eye view. In the matrix, the reference line represents higher priority than (c) obstacles, (d) the non-driving area and (e) the exercisable area. At the same time, (a) the current position status of the ego vehicle, (b) the ego vehicle, and (g) the planned trajectory are not specifically represented in the time series bird's-eye view matrix.

Figure 3 shows the vertical view of a time series bird's-eye view with a green rectangle representing the vehicles on the highway and a dashed grey line representing the driveway sidelines.

The generation of a time series bird's-eye view includes the following two steps: (1) Obtain the surrounding environment information, including dynamic and static obstacles, and lane lines, according to the perception module of the automobile. Obtain dynamic obstacles using prediction module in the future 0 ∼ *tend* location information within the end. (2) Generate cross-sectional, vertical, and temporal feature bird's-eye views using the information obtained from the perception and prediction modules. Then, train using the bird's-eye view as the state input.

#### 2.1.2. Strategic Network Structure

Figure 4 shows the structure of the policy network *πθ* (*z*, *<sup>a</sup>*). The network includes a convolution feature extraction network consisting of one convolution layer and a fully connected network consisting of three fully connected layers. Where *z* is the input state quantity of the policy network, including the time series bird's-eye view matrix and the history track of the vehicle, *θ* denote the weights and offset parameters for the network and *a* is the output of the policy network, that is, the final state of the planning trajectory *a* = *s*, .*s*, ..*s*, *l*, .*l*, ..*l*, *t*, where *s*, .*s* and ..*s* are the final longitudinal position, the end-of-longitudinal speed, and the acceleration of the longitudinal end state of the vehicle, respectively, while *l*, . *l* and .. *l* are the lateral end state position, the lateral end-state speed and the acceleration of the lateral end state of the vehicle, respectively. The input of the convolution feature extraction network is the time series aerial view matrix and the output is the final extracted environmental feature information. The input of the fully connected network is the convolution feature. The environmental feature information and the historical track information of the vehicle are extracted from the network output.

**Figure 4.** Strategic network structure diagram.

#### *2.2. Variable Gauss Safety Field Theory*

Since reinforcement learning explores policies and rewards by making agents constantly try and error, the security of reinforcement learning is lower than the other methods. Improving the security of reinforcement learning remains the focus of research. The variable Gauss security field model based on risk center transfer further improves the security of trajectory planning and control methods and serves as the reward function of the trajectory planning part and the constraint boundary of the control part.

Figure 5 shows that a static vehicle is abstracted as a rectangle with a length of *lv*, a width of *wv*, and the risk center *<sup>O</sup>*(*<sup>x</sup>*0, *y*0) is its geometric center. The static security field of the vehicle is described by a two-dimensional Gaussian function as:

$$S\_{sta} = C\_a \cdot \exp(-\frac{\left(\chi - x\_0\right)^2}{a\_x^2} - \frac{\left(y - y\_0\right)^2}{b\_y^2})\tag{1}$$

where *Ca* is the field strength factor, *ax* and *by* represent the function of vehicle shape. The main control parameter for the shape of a static safety field is anisotropy:

$$\varepsilon = \frac{a\_x^2 - b\_y^2}{a\_x^2 + b\_y^2} = \frac{\phi^2 - 1}{\phi^2 + 1} \tag{2}$$

**Figure 5.** Static safety field overhead projection.

Parameter *ε* equivalently expressed in aspect ratio ∅ = *ax*/*by* = *lv*/*wv*.

The direction of the safety field is a vector from the risk center whose isoelectric line is projected upward into a series of ellipses. In Figure 5, the red rectangle represents the vehicle, the area in the solid red rectangle is called the core domain, the area between the red and the yellow ellipses is called the restriction domain, the area between the yellow and the blue ellipses is called the expansion domain, and each area represents a different risk state. The sizes of these different domains are related to the shape and motion of the vehicle and can be determined based on the parameters *ax*, *by* of the Gaussian function (1). The Gauss security field is variable. The aspect ratio of the virtual vehicle will change with the change of the vehicle motion state and will significantly change the core, restriction and extension domains of the Gauss security field.

Figure 6 shows the overhead projection of the dynamic safety field. It can be seen that when the vehicle is in motion, the risk center will transfer following the vector *kv*<sup>→</sup>*v* , the new risk center becomes *<sup>O</sup>*(*<sup>x</sup>*0, *y*0) and there are:

$$\begin{cases} \begin{aligned} x'\_0 &= x\_0 + k\_v \left| \stackrel{\rightarrow}{\vec{v}} \right| \cos \beta\\ y'\_0 &= y\_0 + k\_v \left| \stackrel{\rightarrow}{\vec{v}} \right| \sin \beta \end{aligned} \tag{3}$$

where →*v* is the velocity vector of the vehicle motion, *kv* is the regulator and 0 < *kv* < 1 or −1 < *kv* < 0, the sign corresponds to the front and back directions of the movement. *β* is the transferred angle between the vector and the x-axis.

**Figure 6.** Overhead projection of dynamic safety field.

A virtual vehicle is formed with a length of *lv* and width of *<sup>w</sup>v* under the transfer of the risk center, whose geometric center is (*x*0, *<sup>y</sup>*0), which establishes its dynamic security field as:

$$S\_{dyn} = \mathbb{C}\_{\mathfrak{a}} \cdot \exp\left(-\frac{\left(\mathfrak{x} - \mathfrak{x}\_{0}^{\prime}\right)^{2}}{\left(a\_{\mathfrak{x}}^{\prime}\right)^{2}} - \frac{\left(\mathfrak{y} - \mathfrak{y}\_{0}^{\prime}\right)^{2}}{\left(b\_{\mathfrak{y}}^{\prime}\right)^{2}}\right) \tag{4}$$

where *ax* and *by* are parameters related to vehicle shape and motion state. The new aspect ratio is expressed as ∅ = *ax*/*by* = *lv*/*wv*.

#### *2.3. Improved Lattice Programming Algorithm Based on Strategic Gradient Algorithm*

The traditional Lattice programming algorithm achieves trajectory planning by sampling the target vertically and horizontally. This method will lead to the dilemma of a suboptimal solution for the sample-fitting trajectory, and it would be difficult to obtain the optimal trajectory. However, too many sampling points will lead to complex and inefficient calculations.

The Lattice algorithm is improved by using the policy gradient algorithm to directly obtain the optimal final state sample points as shown in Figure 7. This improved method abandons sampling with high time complexity and cost function evaluation for each alternate trajectory, which considerably improves the timeliness of the algorithm. Although the training process of reinforcement learning has better universality than the general rule-based planning algorithm, the design of the reward function based on the final control effect will make it more suitable for complex traffic scenes and complex vehicle dynamic features.

#### 2.3.1. Track Planning Agent Design

The trajectory output by general dynamic programming, Monte Carlo sampling and time series difference methods will have a complete state action sequence < *s*0, *a*0,*s*1, *a*1 ··· *send*−1, *aend*−1,*send* > and a trajectory consists of several state–action pairs as shown in Figure 8. Different actions *a* in each step will inevitably lead to changes in the overall trajectory. This will necessarily result in an exponential increase in the complexity of the solution as the length of the trajectory will increase. The simplified trajectory *τ* is composed of the start state *s*0, action *a* and end state *send*. In the start state *s*0, executing action *a* produces a unique trajectory *τ*, reaching the end state *send*.

**Figure 7.** Lattice sampling process improvement.

**Figure 8.** Diagrams of single-step and multi-step dynamic planning trajectory outputs.

In practice, the policy gradient algorithm is used instead of the last state sampling process in the Lattice algorithm. The end state of the track is used as the action space *A*:

$$A = \left\{ \mathbf{s}\_{end}, \dot{\mathbf{s}}\_{end}, \ddot{\mathbf{s}}\_{end}, \mathbf{\dot{l}}\_{end\prime} \mathbf{l}\_{end\prime} \dot{\mathbf{l}}\_{end\prime} \mathbf{\dot{l}}\_{end\prime} \mathbf{\dot{l}}\_{end\prime} \mathbf{\dot{l}}\_{end\prime} \right\} \tag{5}$$

Policy network *πθ* (*z*, *a*) maximizes the expected return of the output trajectory as an optimization objective:

$$J(\pi) = \sum\_{\tau} p(\tau, \theta) \cdot r(\tau) \tag{6}$$

where *z* denotes the state features of the surrounding traffic environment, a is the network output action, *θ* is a network parameter, *p* = (*<sup>τ</sup>*, *θ*) is the probability of executing action a and outputting track *τ* under parameter *θ* and state *z*, and *r*(*τ*) is the reward function of trajectory *τ*.

The gradient rise method is used to optimize *πθ* (*z*, *a*) from Equation (6):

$$
\theta = \theta + \mathfrak{a} \cdot \nabla\_{\theta} I(\pi) \tag{7}
$$

To calculate the derivative of the optimization objective with respect to network parameter *θ*, the strategy gradient is derived as:

$$\begin{array}{l} \nabla\_{\theta} f(\theta) = \nabla\_{\theta} \sum\_{\tau} p(\tau, \theta) \cdot r(\tau) \\ = \sum\_{\tau} \nabla\_{\theta} p(\tau, \theta) \cdot r(\tau) \\ = \sum\_{\tau} \frac{p(\tau, \theta)}{p(\tau, \theta)} \nabla\_{\theta} p(\tau, \theta) \cdot r(\tau) \\ = \sum\_{\tau} p(\tau, \theta) \nabla\_{\theta} \log p(\tau, \theta) \cdot r(\tau) \end{array} \tag{8}$$

To improve the efficiency of training, during the training process, the agen<sup>t</sup> continuously stores the experience data < *z*, *a*, *τ*,*r* > from the interaction with the environment in real-time into the experience pool (Memory). The Monte Carlo method is also used to randomly extract the mini-batch-sized empirical data from the experience pool for training:

$$\nabla\_{\theta} J(\theta) \approx \frac{1}{n} \sum\_{i=1}^{n} \nabla\_{\theta} \log p(\mathbf{r}, \theta) \cdot r(\mathbf{r}) \tag{9}$$

From Formula (9), the update direction of the final policy parameters *θ* is:

$$\theta = \theta + \alpha \cdot \frac{1}{n} \sum\_{i=1}^{n} \nabla\_{\theta} \log p(\mathbf{r}, \theta) \cdot r(\mathbf{r}) \tag{10}$$

To enhance the agent's exploring ability in unfamiliar state space and avoid the agen<sup>t</sup> falling into local optimal space during training, the output of the policy network *πθ* (*z*, *a*) will conform to normal distribution. It consists of two parts: mean *μ*(*<sup>z</sup>*, *a*) and variance *<sup>σ</sup>*(*<sup>z</sup>*, *a*):

$$\pi\_{\theta}(z,a) = \frac{1}{\sqrt{2\pi} \cdot \sigma(z,\theta)} \exp(-\frac{(z-\mu(z,\theta))^2}{2\sigma^2(z,\theta)})\tag{11}$$

During the learning process of the policy network *πθ* (*z*, *<sup>a</sup>*), the mean *μ*(*<sup>z</sup>*, *a*) and the variance *<sup>σ</sup>*(*<sup>z</sup>*, *a*) of the output keep approaching *argmaxQ*(*<sup>z</sup>*, *a*) and 0, respectively, and the probability of the agen<sup>t</sup> taking random behavior exploration keeps decreasing. During training, the agen<sup>t</sup> selects action *a* = *s*, .*s*, ..*s*, *l*, .*l*, ..*l*, *t* from this normal distribution as the training output and executes it.

#### 2.3.2. Reward Function Design

Reinforcement learning obtains the amount of state by interacting with the environment and evaluates the training agen<sup>t</sup> by a reward function. The agents obtain higher returns by continuously optimizing their network of policies. Therefore, the design of the reward function is critical to the convergence of the agent, which affects the final decision-making results of the overall model. Moreover, a reasonable reward function design can also make the agen<sup>t</sup> obtain more incentives from the environment and accelerate the convergence speed of the agent.

The reward function design for the trajectory planning section includes the following sections:

$$reward = k\_1 \cdot r\_{speed} + k\_2 \cdot r\_{acc} + k\_3 \cdot r\_{lateral} + k\_4 \cdot r\_{comfort} + k\_5 \cdot r\_{additional} + k\_6 \cdot r\_{safe} \tag{12}$$

In the formula, *rspeed* = − ∑*<sup>t</sup>*<*ttotal <sup>t</sup>*·*<sup>ν</sup>target* − *<sup>ν</sup>t*<sup>2</sup> is the speed reward, its goal is to keep the speed at the target speed; *racc* = − ∑*<sup>t</sup>*<*ttotal* ..*s*2*t* and *rcomf ort* = − ∑*<sup>t</sup>*<*ttotal* ..*lt*2 are the longitudinal and lateral comfort rewards, respectively, their goals are to maintain low longitudinal acceleration and low lateral acceleration, respectively; *rlateral* = − ∑*<sup>t</sup>*<*ttotal l*2*t* is the lateral deviation reward, its goal is to maintain a small lateral deviation from the reference line; *radditoanal* = − <sup>∑</sup>*<sup>t</sup>*<*totalst* − *sactual t* 2 + *lt* − *lactual t* 2 is the additional coupling reward, the objective is to maintain the coupling force between the planned trajectory and the controller and vehicle dynamics, and to maintain a better horizontal and vertical tracking accuracy of the vehicle during actual tracking; and *rsaf e* is the safety reward. *k*1 ∼ *k*6 is the proportion weight of each reward function. Where, *k*1 = 1.0, *k*2 = 0.2, *k*3 = 1.0, *k*4 = 0.2, *k*5 = 0.5, and *k*6 = 1.0. The value of *k*1 ∼ *k*6 is obtained through debugging, and the specific value comparison is shown in Figure 9 below.


**Figure 9.** Comparison chart of proportional weights.

> The design of *rsaf e* is constrained by the variable Gaussian safety field, as shown below: When the vehicle is stationary:

$$\begin{cases} \;r\_{safe,lat} = \begin{cases} -100 & \text{if } d\_{lat} < l\_v\\ 0 & \text{if } d\_{lat} > l\_v \end{cases} \\\;r\_{safe,lon} = \begin{cases} -100 & \text{if } d\_{lon} < \frac{3l\_v}{\sqrt{9 - w\_v^2}}\\ 0 & \text{if } d\_{lon} > \frac{3l\_v}{\sqrt{9 - w\_v^2}} \end{cases} \end{cases} \tag{13}$$

When the vehicle is moving:

$$\begin{cases} \begin{array}{l} r\_{safe,lat} = \begin{cases} -100 & \text{if } d\_{lat} < l\_v' \\ 0 & \text{if } d\_{lat} > l\_v' \end{cases} \\\ r\_{safe,lon} = \begin{cases} -100 & \text{if } d\_{lon} < \frac{4l\_v'}{\sqrt{16 - (w\_v')^2}} \\\ 0 & \text{if } d\_{lon} > \frac{4l\_v'}{\sqrt{16 - (w\_v')^2}} \end{cases} \end{cases} \tag{14}$$

where

 ⎧⎨⎩*wv* = *wv* + <sup>2</sup>·*kv*·<sup>→</sup>*<sup>v</sup>* · sin *β lv* = *lv* + <sup>2</sup>·*kv*·<sup>→</sup>*<sup>v</sup>* · cos *β* , *lv* and *wv* are the length and the width of the agent, →

respectively, *v* is the speed vector of vehicle motion, *kv* is the adjustment factor, and *β* is the angle between the transfer vector and the x-axis. After the actual vehicle test, *kv* = 0.35.

#### **3. Controller Design**

The traditional trajectory planning module and the control module are simple upper and lower-level relationships. The trajectory planning module outputs the optimal trajectory and the controller tracks the control. Although this mode is simple and easy to operate, it cannot meet the real-time requirements in complex traffic environments. Figure 10 shows the relationship diagram of the proposed feedback design model. It can be seen from the

figure that the trajectory planning agen<sup>t</sup> based on the policy gradient algorithm, the trajectory tracking controller and the environment form a planning control environment closed loop. The proposed loop feedback design model will enable the agents to continuously learn to adapt to the environment and adapt to the trajectory tracking controller. This method effectively links the traffic environment, the planner and the controller, so that the output trajectory of the planner can effectively adapt to the dynamic features of the vehicle and the controller. To enable the agen<sup>t</sup> to stably, efficiently and safely track the optimal trajectory output by the planner, and improve the efficiency, the training of the control part is divided into horizontal control and vertical control.

**Figure 10.** Planner/Controller/Environment relationship diagram.

#### *3.1. Horizontal Trajectory Tracking Control Model Training*

The goal of the traditional horizontal trajectory tracking task [19,20] is to enable vehicles to drive stably on the lane line without deviating, regardless of the state relationship with other vehicles. However, when the vehicle tracks and controls the track, the first consideration is the safety of the track, that is, it will not collide with other vehicles.Therefore, the variable Gaussian safety field is introduced as the evaluation index, and the state quantity and reward function are adjusted. The variables including the distance *di* from other vehicles, the lateral relative coordinate *xi*−*v*, the coordinate (*xi*, *yi*) of the navigation point in the current vehicle coordinate system, the heading deviation *ϕ* and the speed *v* and acceleration .*v* of the control vehicle are added as the state variables:

$$s^{\text{lamc}-\text{keep}} =  \tag{15}$$

The output action is only the steering wheel angle *asteer* ∈ [−1, 1]. For the design of the reward function for lane keeping, the lateral error *x*0 between the current vehicle coordinate and the lane centerline, the deviation *ϕ* of the heading angle and the relative distance *di* from other vehicles are considered as the evaluation index reward functions:

$$\begin{cases} r\_{safe,lat} = -\log(\left|\frac{1}{\sqrt{2}}w\_v' - d\right| + 1, 1.2) \\\ r\_{safe,lon} = -\log(\left|\frac{1}{\sqrt{2}}l\_v' - d\right| + 1, 1.2) \\\\ r^{lane-kexp} = -k\_1 abs(\mathbf{x}\_0) - k\_2 \sin \varrho \end{cases} \tag{16}$$

where ⎧⎨⎩*wv* = *wv* + <sup>2</sup>·*kv*·<sup>→</sup>*<sup>v</sup>* · sin *β lv* = *lv* + <sup>2</sup>·*kv*·<sup>→</sup>*<sup>v</sup>* · cos *β* , *lv* and *wv* are the length and the width of the agent,

respectively, →*v* is the speed vector of vehicle motion, *kv* is the adjustment factor, and *β* is the angle between the transfer vector and the x-axis. After the actual vehicle test, *kv* = 0.35.

If the lateral deviation of the current position of the autonomous vehicle is greater than the set maximum lateral deviation threshold value *x*0*max* during the training, the

current round of iterative training will be ended for the next round of training. Through the cumulative reward mechanism, agents that enhance learning continuously obtain higher reward reports. Hence, they can take more potential threats into account. However, the dynamic features of the vehicle will be hidden in the state quantity of the past few moments. Thus, it would be difficult to fully understand the current state of the intelligent vehicle only through the current state quantity. To enable the agen<sup>t</sup> to better understand the dynamic features of the intelligent vehicle at the current time and output more reasonable trajectory tracking actions, the state quantities at the current time and at the past four times are stacked together as network inputs.

#### *3.2. Training of Longitudinal Trajectory Tracking Control Model*

To maintain an ideal distance between the ego vehicle and the vehicle in front without any collision with the vehicle in front, the ego vehicle is expected to cruise at a constant speed when there is no vehicle in front. When there are other vehicles in front of the ego vehicle, the road information is not considered, instead only the information of the current vehicle and the vehicle ahead is considered as the state quantity. Figure 11 describes the cruise mission status. The longitudinal trajectory tracking control task considers the speed *v* and acceleration . *v* of the current vehicle, speed *vl* and acceleration . *vl* of the vehicle in front, the distance *d* from the vehicle in front and the expected speed *vdes* of the current vehicle as the state variables:

$$s^{\rm act} =  \,\,\,\tag{17}$$

**Figure 11.** Description of cruise mission status.

Output action *aacc* ∈ [−1, 1] of the agent, including accelerator action *athrottle* and brake action *abrake*: -

$$\begin{array}{cccc} a\_{\text{throttle}} = a\_{\text{acc}}, a\_{\text{brakz}} = 0 & \text{if } a\_{\text{acc}} \ge 0\\ a\_{\text{throttle}} = 0, a\_{\text{brakz}} = a\_{\text{acc}} & \text{if } a\_{\text{acc}} < 0 \end{array} \tag{18}$$

For vertical control tasks, the reward function is designed as:

$$r^{\rm acc} = \begin{cases} -k\_5 abs(v - v\_{des}) - k\_6 abs(d - d\_{des}) & \text{if } d > d\_{safe} \\ -100 & \text{if } d \le d\_{safe} \end{cases} \tag{19}$$

where *ddes* and *dsaf e* are the expected and safe distances from the vehicle in front, respectively. When the distance between the intelligent vehicle and the vehicle in front is less than the safe distance, the reward is −100 and the current interaction is stopped to start the next round of interaction. During longitudinal training, the speed *vl* of the vehicle in front

and the expected speed *vdes* of the current vehicle are randomly given each round, so that the training model can be generalized to more complex situations.

The traditional training mostly uses Gaussian noise or Ornstein Uhlenbeck (OU) noise to promote agents to actively explore the environment at the beginning of training. However, unnecessary exploration will prolong the training time of agents. Therefore, in this paper, a Multi-Head Actor network structure is designed for the tasks with convex solution space in longitudinal control tasks. The main function of the proposed structure is to make the output action noisy. Action noise reflects the uncertainty measure of the optimal solution of the current policy. The Multi-head Actor network structure is used to construct this uncertainty measurement method.

The output of the Online Actor network is connected to multiple Head networks. To reflect the difference of each Head network, the initialization and training sampling experience pool of each Head network are independent and the way to converge to the optimal solution space is also different. Therefore, the variance of the Head network output action is used to estimate the uncertainty measure of the output action of the Actor network as:

$$\begin{cases} \mathcal{N}\_t = \left\{ k \cdot \text{var}(\mu\left(s\_l \middle| \: \theta\_{\mu\text{unline}}\right)) \right. & \text{if } k \cdot \text{var}\left(\mu\_{\theta\_{\mu\_{\text{unline}}}}\left(s\_l \middle| \: \theta\_{\mu\text{unline}}\right)\right) < N\_{\text{thresholder}}\\ & \mathcal{N}\_{\text{threshold}} \end{cases} \tag{20}$$

where *Nt* and *Nthreshold* are the real-time action noise and the threshold noise, respectively, *θ* is the adopted policy, *μ st*|*θμonline* is the deterministic action of the network output, and *k* is the weight parameter.

Similar to the horizontal control part, the vertical control part also selects the current state quantity of the agen<sup>t</sup> and the state quantity of the past four times as the network input, making the network easier to converge and having high training efficiency.

#### **4. Experiment and Analysis**

The simulation experiment is based on the open-source autopilot simulator CARLA, which supports the development, training and validation of autopilot systems. In addition to open-source code and API protocol, CARLA also provides open mathematical assets (urban layout, buildings and vehicles) that can be freely invoked. CARLA works through the client mode. It has a specific python API interface that can realize simulation environment configuration, environment interaction and vehicle control through interface code. CARLA is suitable as a training platform for automatic driving reinforcement learning. The simulation training was completed under the environment of TOWN06 and TOWN04 in CARLA 0.9.9. Figure 12 shows the specific CARLA simulation scenario.

**Figure 12.** CARLA simulation diagram.

#### *4.1. Trajectory Planning Experiment Based on PG Algorithm*

When training the trajectory planning module, other obstacle vehicles were randomly generated for each round of training to enable the trained agents to target complex traffic conditions. In a random environment, the average reward of each round was used to evaluate the training effect of the agent. When the agen<sup>t</sup> reached the specified number of steps or encounters a collision, it directly started the next round of training. To avoid

randomness, the final training results were obtained by averaging the five training results. The training results are shown in Figure 13. The red curve is the average reward, and the red-shaded part is the sliding average of the five training rewards. Due to the strong randomness of the training environment, the rewards show a strong jitter with the change of the round. The rewards show an overall upward trend with the change of rounds, indicating that the agents are increasingly adapting to the changing traffic environment to obtain higher rewards during the training process. After 100 rounds, the variance of rewards tends to decrease, and the training results of agents become more stable.

**Figure 13.** The change in average reward of the track planning module with the number of training wheels.

As shown in Figure 14, the red curve represents the reward curve of the planning method based on the time series bird's-eye view and the policy gradient algorithm proposed in this paper, and the blue curve represents the reward curve of the planning method using the DDPG algorithm. Because of the strong randomness of the training environment, the reward fluctuates greatly with the change of the round. In the comparison of average rewards, both curves are almost the same. However, it is obvious that the DDPG algorithm represented by the blue curve has convergence effect only after 100 rounds, while the planning method proposed in this paper starts to converge gradually after 70 rounds. Therefore, the proposed planning method has higher convergence efficiency and stability.

**Figure 14.** Comparison curve of average reward of the track planning module changing with the number of training wheels.

#### *4.2. Safety Control Module Experiment*

In the control module, due to the randomness of the steps that the autonomous vehicle can take during the training process, it is not suitable to use a single reward or a cumulative reward as the evaluation standard of the training effect of the agen<sup>t</sup> at the current moment. Therefore, it is reasonable to take the average reward of each step of the current round as the evaluation standard of the training effect of the current round. The abscissa is the number of training rounds, and the ordinate is the average reward obtained in each round. Figure 15 shows the change in the training curve of the horizontal trajectory tracking task.

**Figure 15.** The change in average reward of the horizontal trajectory tracking task with the number of training wheels.

It can be seen from Figure 15 that in the first 15 rounds of the lateral trajectory tracking control task, the agen<sup>t</sup> is still in the free exploration stage, and the reward curve fluctuates and does not converge. With the progress of training, the agen<sup>t</sup> continuously optimizes its strategic network, makes more reasonable behavior, obtains higher rewards and optimizes its network again according to the rewards obtained from feedback, forming a virtuous circle. After 50 rounds, the reward curve begins to converge and achieves good training results.

In this paper, the variable Gaussian safety field is used as the constraint and evaluation index of the control part. Figure 16 shows the reward curve of the variable Gaussian safety field. The red curve represents the reward curve of the lateral tracking control considering the relationship with other vehicle state quantities under the variable Gaussian safety field. The blue curve represents the reward curve of the traditional lateral tracking control under the variable Gaussian safety field. In both cases, the average value of the five experiments is taken. Figure 16 clearly shows that the reward curve of the safety lateral tracking control method proposed in this paper is superior to the traditional lateral tracking control, with higher safety performance and greater response space to emergency conditions. At the beginning of several training rounds, since the agen<sup>t</sup> did not interact with other vehicles in the opening exploration phase, the average reward was 0, as shown in Figure 16. From the sixth round, the agen<sup>t</sup> interacts with other vehicles in the environment, the variable Gaussian safety field acts, and the reward curve changes.

**Figure 16.** Reward curve of variable Gauss safety field.

Figure 17 shows the average reward of the longitudinal trajectory tracking control task over time. It can be seen that the average reward changes with the training times. The blue and red curves represent the average reward change curves of the agents with Gaussian noise and adaptive noise exploration, respectively, and the shaded part is the standard deviation of five experiments. Figure 17 shows that both types of agents have achieved good training results in the longitudinal trajectory tracking control task. Due to the randomness of the ego vehicle's speed and the state of the vehicle ahead in each training round, the average reward of the lateral trajectory tracking control task fluctuates to some extent. However, similar to the lateral trajectory tracking control task, the training effect of the adaptive noise detection method is better than that of the common noise attenuation method.

**Figure 17.** The change in the average reward of the longitudinal trajectory tracking task with the number of training wheels.

### **5. Conclusions**

In this paper, a vehicle safety planning control method based on the variable Gaussian safety field is designed. The policy gradient algorithm is used to improve the driving safety

of autonomous vehicles and make the driving trajectory of autonomous vehicles more intelligent. The spatiotemporal bird's-eye view proposed in combination with the policy gradient algorithm as a state variable can enhance the ability of feature extraction of the policy network and make the network convergence easier. The variable Gaussian safety field is added as the reward function of the trajectory planning module and the evaluation index of the control module to improve the safety and rationality of the output trajectory and tracking control, respectively. In the longitudinal control module, Gaussian noise input is improved to avoid repeated invalid exploration of agents and enhance training efficiency. Compared with the traditional planning control algorithm, the proposed method has the following advantages: (1) the spatiotemporal bird's-eye view is used as the input state of the policy network enabling the trajectory planning policy network to effectively extract the features of the surrounding traffic environment. The planning trajectory of autonomous vehicles is generated through reinforcement learning, which improves the trajectory planning ability of autonomous vehicles in complex scenes. The efficiency of the lattice sampling method for trajectory planning algorithm avoids invalid sampling in complex traffic scenes; (2) the variable Gaussian safety field is added as a reward function to improve the safety of trajectory and control effect; (3) the traditional noise input is improved and the multi-head actor network structure is designed to add noise in the output action and improve the training efficiency. The experimental results demonstrate and validate that the proposed framework is superior to the traditional methods.

At the same time, this paper does not consider the scenarios other than an expressway, and how to change lanes in an emergency. In the future, we will test and improve the algorithm in more complex environments, such as ramps and urban roads. From another point of view, the single vehicle will be extended to the fleet, and the driving efficiency and safety of the fleet on the expressway will be considered.

**Author Contributions:** Conceptualization, Z.Z. and Y.C.; methodology, Z.Z.; software, C.T.; validation, Z.Z., C.T. and Y.C.; formal analysis, Y.L.; investigation, H.W.; resources, L.C.; data curation, Z.Z.; writing—original draft preparation, Z.Z.; writing—review and editing, Z.Z.; visualization, Y.C.; supervision, Y.L.; project administration, H.W.; funding acquisition, Y.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by [National Natural Science Foundation of China] gran<sup>t</sup> number [U20A20333, 51875255, U20A20331, 52072160] and [Six talent peaks project in Jiangsu Province] gran<sup>t</sup> number [2018-TD-GDZB-022] and [Key R&D projects in Jiangsu Province] gran<sup>t</sup> number [BE2020083-3, BE2019010-2].

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest. And Yubo Lian is employee of BYD Company Limited. The paper reflects the views of the scientists, and not the company.
