Deep Deterministic Policy Gradient Algorithm Based on Convolutional Block Attention for Autonomous Driving

Jin, Yanliang; Liu, Qianhong; Shen, Liquan; Zhu, Leiji

doi:10.3390/sym13061061

Open AccessArticle

Deep Deterministic Policy Gradient Algorithm Based on Convolutional Block Attention for Autonomous Driving

¹

Key Laboratory of Specialty Fiber Optics and Optical Access Networks, Joint International Research Laboratory of Specialty Fiber Optics and Advanced Communication, Shanghai University, Shanghai 200444, China

²

Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 200050, China

^*

Author to whom correspondence should be addressed.

Symmetry 2021, 13(6), 1061; https://doi.org/10.3390/sym13061061

Submission received: 28 April 2021 / Revised: 7 June 2021 / Accepted: 9 June 2021 / Published: 12 June 2021

Download

Browse Figures

Versions Notes

Abstract

:

The research on autonomous driving based on deep reinforcement learning algorithms is a research hotspot. Traditional autonomous driving requires human involvement, and the autonomous driving algorithms based on supervised learning must be trained in advance using human experience. To deal with autonomous driving problems, this paper proposes an improved end-to-end deep deterministic policy gradient (DDPG) algorithm based on the convolutional block attention mechanism, and it is called multi-input attention prioritized deep deterministic policy gradient algorithm (MAPDDPG). Both the actor network and the critic network of the model have the same structure with symmetry. Meanwhile, the attention mechanism is introduced to help the vehicles focus on useful environmental information. The experiments are conducted in the open racing car simulator (TORCS)and the results of five experiment runs on the test tracks are averaged to obtain the final result. Compared with the state-of-the-art algorithm, the maximum reward increases from 62,207 to 116,347, and the average speed increases from 135 km/h to 193 km/h, while the number of success episodes to complete a circle increases from 96 to 147. Also, the variance of the distance from the vehicle to the center of the road is compared, and the result indicates that the variance of the DDPG is 0.6 m while that of the MAPDDPG is only 0.2 m. The above results indicate that the proposed MAPDDPG achieves excellent performance.

Keywords:

convolutional block attention mechanism; MAPDDPG; TORCS; autonomous driving

1. Introduction

With the rapid development of artificial intelligence technology, the era of automatic driving has come. However, traditional autonomous driving systems require human involvement in designing rules. Also, to make decisions, the structural information from the scenario [1] needs to be built, such as the subsystem of lanes, marks, walkers, cars, and beacons [2]. Currently, there are two main types of unmanned driving algorithms: modular unmanned driving and end-to-end unmanned driving. The traditional modular unmanned driving algorithm has the following shortcomings.

The process of rule-based policy-making is complex and costly [3].
The vehicle observes the surrounding information through the perception module without filtering [4,5,6,7,8], making the convergence slow and training inefficient.
The predefined input and output of each subsystem are not necessarily optimal, which makes it difficult to adapt to complex and changeable environments [9].

Deep reinforcement learning combines the perceptual capabilities of deep learning with the decision-making abilities of reinforcement learning to solve the above problems and enables end-to-end learning. It has a wide range of applications in autonomous driving, such as vehicle power administration [10,11,12,13], control [14], and policy-making [15,16]. So far, deep reinforcement learning is classified into two main approaches: the value-based approach and the policy-based approach. The value-based approaches focus on investigating the value of the certain state and are more commonly used in discrete action problems, such as Deep-Q-learning(DQN) [17], Double DQN [18], and Dueling DQN [19]. However, this kind of approach cannot address continuous action problems. The policy-based approaches concentrate on finding the optimal policy that maximizes the anticipated gains and can be used to address continuous action problems, such as REINFORCE [20], Truncated Natural Policy Gradient (TNPG) [21], Proximal Policy Optimization (PPO) [22], Deep Deterministic Policy Gradient (DDPG) [23]. Among them, the DDPG works very well in Atari [15] and is widely used in autopilot. However, some problems of this approach need to be solved, such as low training efficiency, slow convergence, and no awareness of the environment.

Attention mechanisms can ignore irrelevant information and focus on key information. Thus, attention can be applied to autonomous driving to recognize environment information [24]. Considering the shortcomings of previous autonomous driving research and the characteristics of attention mechanism, this paper proposes MAPDDPG that incorporates convolutional block attention [25] module into the policy gradient algorithm [23] to solve the problem of complex and costly manual decision-making processes. During the training on the open racing car simulator (TORCS), the MAPDDPG model automatically pays attention to the input information that may affect the driving. This makes the learned driving strategies more in line with human driving behavior and contribute to higher safety. The main contributions of this work are as follows.

Channel and spatial attention mechanisms are applied to the actor network of the MAPDDPG to weigh the important regions of the input images so that the model can focus on the key region information. Based on a middle feature matrix, an attention feature map in two independent dimensions (channel and space) is deduced by the module. Then, the two features are multiplied to improve the characteristics of the input image.
In the actor network of the MAPDDPG algorithm, the GRU layer is used by the model as a temporal attention mechanism to weigh the past few frames according to their importance to determine the current driving policy. The important frames in the past can provide further optimization for behavioral decisions. By processing the data in parallel, the entire network obtains the data features symmetrically from multiple dimensions.
A new reward function is proposed and used as the criterion for evaluation. Three orbits in the TORCS simulation environment are exploited to evaluate the performance of the MAPDDPG module.

2. Related Work

Deep learning and deep reinforcement learning have been extensively studied and applied to autopilot systems. The application of deep learning in autopilot dates back as far as 1986, and [26] used a three-layer back-propagation network to train data that was obtained from cameras and laser rangefinders to output vehicle actions. LeCun et al. [27] designed a vision-based obstacle avoidance system that maps raw input images to steering angles and the system was trained in supervised mode to predict the steering angles. Karol Zieba mapped raw camera pixels to action commands and trained them in a convolutional neural network using a small amount of data [4,5]. Besides, to solve the problem of the unstable learning process, the scholars devised a few network structures that can be used in complex environments. For example, Mehta et al. [28] proposed multi-task learning from demonstration (MTLfD) framework that predicts visual affordances and action primitives and guides predictive driving commands through direct supervision. Sauer et al. [29] presented a direct perception method that maps video inputs to intermediate representations and is adapted to autonomic guidance in sophisticated urban surroundings to reduce traffic accidents. All these methods require significant human involvement, which introduces uncertainty into the training of end-to-end self-driving systems. If the data obtained is not independently distributed or the noise is not homogeneous, the results of the end-to-end autonomous driving system training can deviate significantly from the real results.

Deep reinforcement learning learns through exploration and trial-and-error and does not require prior knowledge. As for the study of deep reinforcement learning, Riedmiller et al. [30] proposed a neural fit Q iterative (NFQ) network that is purely data-driven and guides a real robot car based on the data collected directly from experiments. Jung et al. [31] extended an approach based on deep inverse reinforcement learning. The extension exploited a new type of neural network to derive contextual relationships from sensory data and blend them with the output. By using expert presentation in Q-learning, Xia et al. [32] improved the stability by 32% and reduced the convergence time by 72%. Chae et al. [33] exploited Deep-Q-Learning [17] to train the driving behavior of the agent in urban environments to decrease the occurrence of mishaps. Wang et al. [34] treated both the state space and action space as continuous and designed a Q-function approximator with a closed greedy policy to train the vehicle to learn automatic lane-changing behavior.

However, the DQN can only handle discrete action spaces, and subsequent researchers use the Actor-Critic algorithm and deep deterministic policy gradient (DDPG) to deal with the continuous action problem. Jaritz et al. [35] mapped the RGB images from the front camera to the output actions and trained the agent with the Asynchronous Advantage Actor-Critic [36] algorithm to achieve fast convergence and stable driving. Wang et al. [37] exploited DDPG to train the lane-changing behavior of the agent. For the first time, deep reinforcement learning is applied to an actual full-size self-driving vehicle, where the DDPG network takes the image information observed by the vehicle as input and it is trained with sparse reward [16]. Wang et al. [38] proposed to set the learning objectives by collecting and analyzing the driving data from different drivers. Then, the DDPG algorithm was exploited to design a driving decision system. Although the DDPG algorithm works well in autonomous driving research, it suffers from great deficiencies in stability and data processing.

Based on the previous research, this paper proposes the MAPDDPG model that can selectively pay attention to the input information, thus enhancing the safety of autonomous driving and obtaining a better reward.

3. Methods

This section outlines the MAPDDPG model, and the overall structure of the model is shown in Figure 1. A convolutional block attention mechanism is introduced in this paper to extract channel and spatial features of images to make the model focus on the information of important regions. Meanwhile, a GRU layer is added to make the important frames in the past provide further optimization to behavioral decisions. Unlike previous autopilot models, MAPDDPG takes sensor and image information as input and a new reward function is designed to accelerate the training process. The MAPDDPG model mainly consists of five elements:

Convolutional Neural Network (CNN): The model adopts a five-layer CNN to extract features from the image and obtain the middle feature matrix.
Channel attention and spatial attention layer: The model adds the channel attention and spatial attention layer behind the CNN. First, the model ignores the spatial dimensionality of the input features to obtain the channel attention feature map. Then, the model produces a spatial attention graph by using the spatial correlation between the features.
Gated Recurrent Unit layer (GRU): The GRU layer is placed behind the convolutional block attention mechanism. GRU uses gating mechanisms to make the proposed model not only remember past information but also selectively forget unimportant information.
Priority experience replay: The MAPDDPG differs from the previous models in that it exploits priority experience replay to improve sample utilization.
Reward function: A new reward function is designed that consists of three components: speed, distance to the center of the lane, and whether to run out of the track.

3.1. Deep Deterministic Policy Gradient

This paper improves the MAPDDPG model based on DDPG [23] that has two neural network structures with symmetry. The parameters of the target network are copied from the online network after C iterations. Symmetries may be found in markov decision processes (MDPs). For example, CartPole has symmetry along the longitudinal axis. Our model introduces an MDP with symmetry, which involves a set of transitions about the state-action space and keeps the reward function and the transition operator unchanged. A state transition and a state-dependent action space are respectively denoted as

S \to s

and A → a.

The actor network takes state and action as input and output which are in charge of producing actions and reacting with the environment, respectively. The critic network is responsible for evaluating the performance of the action and determining the actions for the next state of the actor-network to obtain the maximum Q-value. The definition of the loss function

L

is as follows.

L = \frac{1}{N} \sum_{i} {(y_{i} - Q (s_{i}, a_{i} | θ^{Q}))}^{2}

(1)

y_{i} = r_{i} + γ Q^{'} (s_{i + 1}, μ^{'} (s_{i + 1} | θ^{μ^{'}}) | θ^{Q^{'}})

(2)

where

y_{i}

and

r_{i}

are respectively the target Q-value and reward, and

γ

is the discount factor.

s_{i}

and

s_{i + 1}

are the current state and the next state.

θ^{Q}

and

θ^{μ}

denote the network parameters of the critic-network and the actor-network. The gradient renewal of actor network is:

\nabla_{θ^{μ}} J \approx \frac{1}{N} \sum_{} \nabla_{a} Q (s, a | θ^{Q}) |_{s = s_{i}, a = μ (s_{i})} \nabla_{θ^{μ}} μ (s | θ^{μ}) |_{s_{i}}

(3)

where

\nabla_{θ^{μ}} J

is the gradient of the performance objective, and

J (μ)

measure the performance of the actor network

μ

.

The DDPG algorithm adopts an experience pool and dual-network architecture with symmetry to break the dependency between data samples. The target network provides target values

Q^{'} (s_{i + 1}, μ^{'} (s_{i + 1} | θ^{μ^{'}}) | θ^{Q^{'}})

. The weights of the target network are adjusted with a regular frequency

τ

that is usually much less than 1. The weights are updated as follows:

θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}}

(4)

θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}}

(5)

3.2. Attention-Based Actor-Critic Structure

Drivers are likely to consider several historical states to take action, and they evaluate the significance based on the time and place of the state. To gain this ability, the attention mechanism shown in Figure 2 is proposed in this paper.

Attention enables the neural network to focus on the useful information of the input data that is relevant to the current output, thus improving the quality of the output. In this paper, attentional mechanisms are exploited to make the model concentrate on the significant traits. Specifically, the picture information is extracted by the convolutional neural network to form an intermediate matrix, and the features of the important regions in the picture are extracted by the channel and spatial attention mechanism. Then, the feature matrix is fed into the GRU layer to filter out the important frames in the past time, and the final output action after three fully connected layers is obtained.

In this subsection, two attention mechanisms (channel attention and spatial attention) that are incorporated into the actor network to provide better options for action are described.

3.2.1. Channel Attention

The MAPDDPG generates a channel attention map by exploiting the channel-to-channel correlations. Since each channel of the characteristic graph is regarded as a detector of traits [39], it makes sense for the channel attention to focus on the “what” of the input image. Based on this, the currently commonly used average and max pooling are used to deal with the aggregation of spatial information. The architecture of channel attention is illustrated in Figure 3.

The channel attention of the MAPDDPG model is as follows:

\begin{matrix} M_{c} (F) & = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F))) \\ = σ (W_{1} (W_{0} (F_{a v g}^{c})) + W_{1} (W_{0} (F_{\max}^{c}))) \end{matrix}

(6)

Firstly, a feature map

F^{H \times W \times C}

is input to the model. The spatial data of the feature map is aggregated by applying max pooling and average pooling to create two unique spatial circumstance descriptors:

F_{\max}^{c}

and

F_{a v g}^{c}

that respectively represent the max-pooled and average-pooled features. Then, the two spatial circumstance descriptors are input into a sharing network that consists of a multilayer perceptron (MLP) with a hidden layer to create a channel attention feature map

M_{c} \in ℝ^{C \times 1 \times 1}

. The overhead of the parameter is minimized by setting the size of the hidden activation to

R^{C / r \times 1 \times 1}

, where

r

is the down-sampling multiplier of the number of channels. Finally, the calculated two characteristic graphs are summed and fed into a sigmoid activation function

σ

to obtain the weights

M_{c}

.

W_{0} \in ℝ^{C / r \times C}

and

W_{1} \in ℝ^{C \times C / r}

are the weights of the sharing network.

3.2.2. Spatial Attention

The MAPDDPG adds the spatial attention behind the channel attention to focus on “where” the features are meaningful. The spatial sub-module uses two outputs that are similar to the channel attention and collects features to produce a valid feature descriptor along the channel axis. The architecture of spatial attention is shown in Figure 4.

The calculation of spatial attention is shown in Equation (7).

\begin{matrix} M_{s} (F^{'}) & = σ (f^{7 \times 7} ([A v g P o o l (F^{'}); M a x P o o l (F^{'})])) \\ = σ (f^{7 \times 7} ([F_{a v g}^{' s}; F_{\max}^{' s}])) \end{matrix}

(7)

Similar to channel attention, a feature

F^{'}

with a size of

H \times W \times C

is given (reconstructed by channel attention). Two 2D feature graphs

F_{a v g}^{' s} \in ℝ^{1 \times H \times W}

and

F_{\max}^{' s} \in ℝ^{1 \times H \times W}

along the channel axis are created by using the same operation as the channel attention. Then, they are concatenated and fed into a

f^{7 \times 7}

convolutional layer to obtain a spatial attention graph

M_{s} (F^{'}) \in R^{H \times W}

. The activation function

σ

is sigmoid.

3.2.3. Convolutional Block Attention Mechanism

Channel attention and spatial attention can be combined in series or parallel, but it is found that combing the attention mechanisms in series and putting the channel attention first lead to better results. Therefore, channel attention is placed before the spatial attention in our model, as shown in Figure 5. MAPDDPG model is calculated as follows.

\begin{array}{l} F^{'} = M_{c} (F) \otimes F \\ F^{″} = M_{s} (F^{'}) \otimes F^{'} \end{array}

(8)

where

\otimes

indicates multiplication, and

F^{″}

is the feature map reconstructed by the channel and spatial attention mechanism.

3.3. Gated Recurrent Unit

CNN networks only take historical observations as input and do not consider the time information. By contrast, RNN considers longer sequences of historical information through the time link, which contributes to the generation of more sophisticated driving strategies. As shown in Figure 6, the recurrent neural network chosen in this paper is Gated Recurrent Unit [40].

After the channel and spatial attention, the structured features map

F^{'}

and the hidden vector

h_{t - 1}

are each transformed by a linear transformation to form two feature matrices, i.e., the update gate

z_{t} = σ (W^{(z)} F^{″} + U^{(z)} h_{t - 1})

and the reset gate

r_{t} = σ (W^{(r)} F^{″} + U^{(r)} h_{t - 1})

. The update gate assists our model in determining the amount of previous information to pass on to the future, while the reset gate primarily decides the amount of previous experience that should be removed. The formulas are as follows:

h_{t}^{'} = \tanh (W F^{″} + U h_{t - 1}^{'})

(9)

h_{t} = z_{t} ⊙ h_{t - 1} + (1 - z_{t}) ⊙ h_{t}^{'}

(10)

where

W

and

U

are two linear matrices, and

⊙

denotes a Boolean operation. Then, the hidden vector

h_{t - 1}^{'} = r_{t} ⊙ h_{t - 1}

is spliced with the input vector

F^{″}

to obtain the hidden vector

h_{t}^{'}

. The hidden vector

h_{t}

retains the information about the current cell. Subsequently, the matrix of contexts that is the weighted sum of the GRU layer outputs is calculated as follows:

w_{T + 1 - t} = s o f t m a x (F_{T + 1 - t}^{″} • h_{T + 1 - t})

(11)

C_{T} = \sum_{t}^{T} (ω_{T + 1 - t} • h_{T + 1 - t}) t = 1, \dots, T

(12)

As shown in Figure 2, the context vector

C_{T}

is fed into the connected layer before the actions are output. The weights

w_{T + 1 - t}

learned by the network can be explained as the significance of the GRU output at a confirmed time. The significance of using the GRU layer is that it exploits the output characteristics of the past

T - f r a m e s

to obtain the action output policy.

3.4. Network and Priority

As shown in Figure 7, in the actor network and target actor network, the feature maps and 29-dimensional sensor information obtained from the TORCS as input to the model, and actions are output. The two networks have the same structure with symmetry.

As shown in Figure 8, the critic network and target critic network differ from the actor network in two aspects. One is that the action information obtained from the actor network is input to the connected network in the model. The other is that only the Q value of the output actions needs to be calculated to evaluate the performance of the actions in the critic network. In this case, there is no need to use the attention mechanism in the critic network, which reduces the computational complexity of the network and speeds up the training. The critic network and target critic network have the same structure with symmetry.

In this paper, the MAPDDPG is optimized by priority sampling instead of uniform sampling. The prioritized experience replay [41] approach prioritizes all experiences by TD-error and chooses the ones with higher priority. This is achieved with SumTree, a binary tree construction. As shown in Figure 9, each leaf node of the SumTree represents a sample of experience, and the inside nodes store the sum of the sub-node values. For instance, if a random number between [0, 42] is chosen as a sample, e.g., number 24, searching for 24 starts from the parent node, and which of the two child nodes is larger than 24 is determined. If it is the left node, the searching is performed on the left node and vice versa. However, if both the left and right nodes are less than 24, then the searching is performed on the right node and the priority value of the right node is subtracted by 24. The searching continues from top to down until a sample with a priority of 12 is found.

3.5. Reward Function

To assess the superiority of the MAPDDPG, a new reward function that consists of three components is designed. Firstly, the velocity is restricted to a steady state, neither too large or too small. This component of the reward function can be expressed as

R_{s p e e d}

:

R_{s p e e d} = {\begin{cases} v & v \leq 50 m / s \\ 100 - v & v > 50 m / s \end{cases}

(13)

where

v

is the speed. Secondly, it is desired that the vehicle stays on the centerline. Thus, if the vehicle deviates from the center, a penalty is given. This component of the reward function is denoted as

R_{c e n t e r}

. Thirdly, to penalize the situation where the vehicle runs out of a lane, the component of the reward function is denoted as

R_{o u t}

.

R_{c e n t e r} = V_{x} \cos (θ) - V_{x} \sin (θ) - | t r a c k P o s | - V_{x} | t r a c k P o s |

(14)

R_{o u t} = - 50

(15)

where

V_{x} \cos (θ)

indicates the speed along the lane, which should be encouraged;

V_{x} \sin (θ)

denotes the velocity perpendicular to the orbit and should be disabled;

| t r a c k P o s |

and

V_{x} | t r a c k P o s |

are penalty terms. The farther the vehicle leaves the centerline, the lower the reward.

The overall reward function is a linear combination of the three components with the specified weights

α, β, γ

. Finally, normalization of the reward function is performed, and the MAPDDPG is trained to determine the most appropriate combination of the weights.

R = α R_{s p e e d} + β R_{c e n t e r} + γ R_{o u t}

(16)

4. Experiment

To confirm the validity of the MAPDDPG model, it is run on TORCS, an open-source simulation tool for autopilot. Other software used in the experiment includes Anaconda 3, Keras 0.1.1, and Tensorflow-gpu 0.12. All our experiments are conducted on a machine running Ubuntu 16.04, and the machine is equipped with a 16-core CPU, 64 GB memory, and GTX-1060 GPU. The parameters set for the network are listed in Table 1.

4.1. Experiment Settings

In many autonomous driving studies based on deep reinforcement learning, the inputs are divided into two domains, i.e., image and sensor information. In human driving scenarios, the data transferred by sensors and vision needs to be considered. So, the input of MAPDDPG has two aspects: image and sensors. Specifically, the images from the vehicle’s front camera are selected as input, and the sensor input is listed in Table 2.

To ensure that the proposed method is not restricted to a specific route, the three roads shown in Figure 10 are chosen for training and validation. Aalborg is selected as the training track, and Track-2 and Track-3 as the validation tracks. All models are trained in 500 episodes.

4.2. Experiment Results

The vehicle is trained in about 500 episodes (close to 500,000 steps) on the map of Aalborg. When the vehicle dashes off the track or turns in the opposite direction, the episode ends. Thus, the length of each episode varies greatly, that is, a good model can make the episode infinite. Meanwhile, the maximum step size for each episode is set to 100,000 iterations. The images with a size of 320 × 240 are gathered at a frequency of 5 Hz, and the training is performed on the Aalborg map for hours. In this case, the vehicle collects approximately 100,000 frames for each track. As for training autonomous vehicles with deep reinforcement learning, the cumulative returns per episode is a significant assessment criterion. The better the model, the more the return.

During the training, it is found that the training reaches convergence at 200 episodes. Thus, three criteria are adopted to test the model performance: the reward values for 200 episodes, the average velocity along the X-axis, and the distance from the centerline. In this paper, we compare the MAPDDPG model with the A3C, PPO, P-DDPG, and DDPG. All models are trained under the same parameters as listed in Table 1, and the experimental results of five runs are averaged to make comparisons. The comparison results are shown in Figure 11, Figure 12 and Figure 13.

It can be seen from Figure 11 that the reward obtained by MAPDDPG is better than the DDPG, P-DDPG, (only add experience priority to DDPG), A3C, and PPO algorithms. The reward value of the proposed algorithm increases significantly after 25 episodes, reaching 72,995 after about 50 episodes and becoming stable after about 110 episodes of training. At this point, the reward values of the DDPG, P-DDPG, A3C, and PPO algorithms are 7000, 42,995, 12,772, and 39,995, respectively. This suggests that by adding an attention mechanism, the proposed model can analyze the current state and focus on the important information, which allows the MAPDDPG model to obtain accurate action strategies rapidly. By contrast, the DDPG, P-DDPG, A3C, and PPO algorithms gradually obtain the training results after about 100, 80, 145, and 110 episodes, respectively.

As shown in Figure 12, the comparison of the average speed after 200 episodes indicates that the MAPDDPG model performs more stable and better than the other algorithms. The speed of the MAPDDPG model can reach 158.2 km/h at the 50th episode, the speed of the DDPG, P-DDPG, A3C, and PPO algorithms is respectively only 96.2 km/h, 50.1 km/h, 49.5 km/h, and 82.6 km/h.

This experiment also compares the deviation of the vehicle from the centerline for each episode. As shown in Figure 13, the MAPDDPG can steadily travel on the centerline at episode 28, while the DDPG, P-DDPG, A3C, and PPO algorithms tend to stabilize after 75, 62, 146, and 52 episodes, respectively.

Besides, the convergence time and the variance of the distance from the vehicle to the center of the lane are compared. As listed in Table 3, the MAPDDPG converges two times faster than DDPG, and its variance has been reduced by nearly three times.

Meanwhile, the success episodes of the algorithms in driving one circle smoothly in TORCS is evaluated. As shown in Table 4, comparing with the DDPG, P-DDPG, A3C, and PPO, our algorithm performs the best in all aspects. The MAPDDPG model can drive a complete circle in 147 of 200 episodes, collision in 36 episodes, and run out of boundary in 17 episodes. The main failures are collisions and out of boundary, each with a 50% split. Collisions are caused by vehicles not learning to slow down, while running out of boundary is caused by not controlling speed during a sharp turn.

As shown in Figure 14, the gated recurrent units (GRU), spatial attention (Spat), channel attention (Chan), and joint models are integrated into the DDPG respectively to study their effects. All models are trained under the same parameters and all the experimental results presented in the graph are the average of five runs.

As can be noticed in Figure 14 that the integration of GRU accelerates convergence and contributes to higher reward and maximum speed. The integration of attention mechanisms leads to greater utilization of image and sensor information, as well as more secure and robust autonomic driving behavior. It can be seen from the model combining all components (comb), the max-speed of our model increases from 135 km/h to 193 km/h, while the success episodes increase from 96 to 147 in comparison to DDPG. Finally, this paper summarizes all the salient results in Table 5.

As shown in Table 5, we compare with previous autopilot models that are trained by putting all input information into the network without assigning weights to the information, MAPDDPG adds an attention mechanism that allows the model to assign different weights to different regions of the input images, thus achieving the ability to recognize the environment. And it is clear from all the comparisons that the MAPDDPG model has the best results.

5. Conclusions

In this paper, a deep reinforcement learning algorithm based on convolutional block attention is proposed to learn self-driving behavior. Using sensor and image information as input, an attention layer is first designed to make the model focus on the focal region of the image. Then, the GRU module is designed to optimize the output strategy by using important frames from the past time to make the model “memorable”. The weights of the attention and GRU layers are fused into the actor-critic network in a hybrid manner. Next, a prioritized experience replay buffer is added to improve sample utilization, and a new reward function is designed to speed up the training process. The MAPDDPG model processes data in parallel and captures the features better from multiple aspects based on the symmetry of the network model. Finally, it is demonstrated in TORCS simulation that the channel and spatial attention mechanisms can improve the performance of deep reinforcement learning algorithms for autopilot. Compared with the current state-of-the-art autopilot algorithms including A3C, PPO, DDPG, and P-DDPG, the MAPDDPG model can reach a maximum speed of 193 km/h, a maximum reward of 116,347, and the variance from the vehicle to the center of the lane is only 0.2 m, indicating that the proposed model achieves excellent performance.

In this paper, the study of autonomous driving strategies is based on individual vehicles. In the future, we will consider multiple vehicles for the research of autonomous driving and the effect of the remaining vehicles on a single car.

Author Contributions

Methodology, Q.L.; Supervision, Y.J.; Writing—review & editing, L.S. and L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liang, X.D.; Wang, T.; Yang, L.N.; Xing, E. CIRL: Controllable Imitative Reinforcement Learning for Vision-Based Self-Driving. 2018. Available online: https://arxiv.org/abs/1807.03776.pdf (accessed on 10 July 2018).
Huang, Z.Q.; Zhang, J.; Tian, R.; Zhang, Y.X. End-to-End Autonomous Driving Decision Based on Deep Reinforcement Learning. In Proceedings of the 2019 5th International Conference on Control, Automation and Robotics (ICCAR), Beijing, China, 19–22 April 2019; pp. 658–662. [Google Scholar] [CrossRef]
Yurtsever, E.; Lambert, J.; Carballo, A.; Takeda, K. A Survey of Autonomous Driving: Common Practices and Emerging Technologies 2019. Available online: https://arxiv.org/abs/1906.05113.pdf (accessed on 12 June 2019).
Bojarski, M.; Testa, D.D.; Dworakowski, D.; Firner, B.; Flepp, B.; Goyal, P.; Jackel, L.D.; Monfort, M.; Muller, U.; Zhang, J.; et al. End to End Learning for Self-Driving Cars. 2016. Available online: https://arxiv.org/pdf/1604.07316.pdf (accessed on 25 April 2016).
Bojarski, M.; Yeres, P.; Choromanska, A.; Choromanski, K.; Firner, B.; Jackel, L.D.; Muller, U. Explaining How a Deep Neural Network Trained with End-to-End Learning Steers a Car. 2017. Available online: https://arxiv.org/pdf/1704.07911.pdf (accessed on 25 April 2017).
Xu, H.; Gao, Y.; Yu, F.; Darrell, T. End-to-End Learning of Driving Models from Large-Scale Video Datasets. 2016. Available online: https://arxiv.org/pdf/1612.01079.pdf (accessed on 4 December 2016).
Chi, L.; Mu, Y. Deep Steering: Learning End-to-End Driving Model from Spatial and Temporal Visual Cues. 2017. Available online: https://arxiv.org/pdf/1708.03798.pdf (accessed on 12 August 2017).
Loiacono, D.; Prete, A.; Lanzi, P.L.; Cardamone, L. Learning to overtake in torcs using simple reinforcement learning. In Proceedings of the IEEE Congress on Evolutionary Computation, Barcelona, Spain, 18–23 July 2010; pp. 1–8. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998; Volume 9. [Google Scholar]
Hu, X.; Liu, T.; Qi, X.; Barth, M. Reinforcement learning for hybrid and plug-in hybrid electric vehicle energy management: Recent advances and prospects. IEEE Ind. Electron. Mag. 2019, 13, 16–25. [Google Scholar] [CrossRef] [Green Version]
Liu, T.; Zou, Y.; Liu, D.; Sun, F. Reinforcement learning of adaptive energy management with transition probability for a hybrid electric tracked vehicle. IEEE Trans. Ind. Electron. 2015, 62, 7837–7846. [Google Scholar] [CrossRef]
Zhou, Q.; Li, J.; Shuai, B.; Williams, H.; He, Y.; Li, Z.; Yan, F. Multi-step reinforcement learning for model-free predictive energy management of an electrified off-highway vehicle. Appl. Energy 2019, 255, 113755. [Google Scholar] [CrossRef]
Han, X.; He, H.; Wu, J.; Peng, J.; Li, Y. Energy management based on reinforcement learning with double deep Q-learning for a hybrid electric tracked vehicle. Appl. Energy 2019, 254, 113708. [Google Scholar] [CrossRef]
Zhu, L.; Yu, F.R.; He, Y.; Ning, B.; Tang, T.; Zhao, N. Communication based train control system performance optimization using deep reinforcement learning. IEEE Trans. Veh. Technol. 2017, 66, 10705–10717. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
Kendall, A.; Hawke, J.; Janz, D.; Mazur, P.; Reda, D.; Allen, J.M.; Shah, A. Learning to drive in a day. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 8248–8254. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G. Human-level control through deep reinforcementlearning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
van Hasselt, H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-learning. 2015. Available online: https://arxiv.org/pdf/1509.06461.pdf (accessed on 22 September 2015).
Wang, Z.; de Freitas, N.; Lanctot, M. Dueling Network Architectures for Deep Reinforcement Learning. 2015. Available online: https://arxiv.org/pdf/1511.06581.pdf (accessed on 20 November 2015).
Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef] [Green Version]
Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P. High-dimensional continuous control using generalized advantage estimation. arXiv 2015, arXiv:1506.02438. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. 2017. Available online: https://arxiv.org/pdf/1707.06347.pdf (accessed on 20 July 2017).
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y. Continuous Control with Deep Reinforcement Learning. 2015. Available online: https://arxiv.org/pdf/1509.02971.pdf (accessed on 9 September 2015).
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning; PMLR: New York, NY, USA, 2015; pp. 2048–2057. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Lecture Notes in Computer Science, Springer, Computer Vision—ECCV 2018; Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018. [Google Scholar]
Pomerleau, D.A. Alvinn: An autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems 1; Touretzky, D.S., Ed.; Morgan-Kaufmann: San Francisco, CA, USA, 1989; pp. 305–313. [Google Scholar]
LeCun, Y.; Muller, U.; Ben, J.; Cosatto, E.; Flepp, B. Off-road obstacle avoidance through end-to-end learning. In Advances in Neural Information Processing Systems; MIT Press: Vancouver, BC, Canada, 2005; pp. 739–746. [Google Scholar]
Mehta, A.; Subramanian, A. Learning end-to-end autonomous driving using guided auxiliary supervision. arXiv 2018, arXiv:1808.10393. [Google Scholar]
Sauer, A.; Savinov, N.; Geiger, A. Conditional affordance learning for driving in urban environments. In Conference on Robot Learning; PMLR: New York, NY, USA, 2018; pp. 237–252. [Google Scholar]
Riedmiller, M.; Montemerlo, M.; Dahlkamp, H. Learning to drive a real car in 20 minutes. In Proceedings of the 2007 Frontiers in the Convergence of Bioscience and Information Technologies, Jeju, Korea, 11–13 October 2007; pp. 645–650. [Google Scholar]
Jung, C.; Shim, H. Incorporating Multi-Context into the Traversability Map for Urban Autonomous Driving Using Deep Inverse Reinforcement Learning. IEEE Robot. Autom. Lett. 2021, 6, 1662–1669. [Google Scholar] [CrossRef]
Xia, W.; Li, H.; Li, B. A control strategy of autonomous vehicles based on deep reinforcement learning. In Proceedings of the 2016 9th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China, 10–11 December 2016; Volume 2, pp. 198–201. [Google Scholar]
Chae, H.; Kang, M.; Kim, B.; Kim, J.; Choo, C.C.; Choi, J. Autonomous Braking System via Deep Reinforcement Learning. 2017. Available online: https://arxiv.org/pdf/1702.02302.pdf (accessed on 8 February 2017).
Wang, P.; Chan, Y.; de la Fortelle, A. A reinforcement learning based approach for automated lane change maneuvers. In Proceedings of the IEEE Intelligent Vehicles Symposium, Changshu, China, 26–30 June 2018; pp. 1379–1384. [Google Scholar]
Jaritz, M.; Charette, R.; Toromanoff, M.; Perot, E.; Nashashibi, F. End-to-end race driving with deep reinforcement learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 2070–2075. [Google Scholar] [CrossRef] [Green Version]
Mnih, V.; Badia, P.; Mirza, M.; Graves, A.; Lillicrap, T.P.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning; PMLR: New York, NY, USA, 2016; pp. 1928–1937. [Google Scholar]
Wang, P.; Li, H.; Chan, C. Continuous Control for Automated Lane Change Behavior Based on Deep Deterministic Policy Gradient Algorithm. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; pp. 1454–1460. [Google Scholar] [CrossRef] [Green Version]
Wang, X.; Wu, C.; Xue, J.; Chen, Z. A Method of Personalized Driving Decision for Smart Car Based on Deep Reinforcement Learning. Information 2020, 11, 295. [Google Scholar] [CrossRef]
Zeiler, D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; Available online: https://arxiv.org/pdf/1311.2901.pdf (accessed on 12 November 2013).
Cho, K.; Merrienboer, V.; Bengio, Y. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. 2014. Available online: https://arxiv.org/pdf/1406.1078.pdf (accessed on 3 June 2014).
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. 2016. Available online: https://arxiv.org/pdf/1511.05952.pdf (accessed on 18 November 2015).
GitHub. Available online: https://jaromiru.com/2016/11/07/lets-make-a-dqn-double-learning-andprioritized-experience-replay/ (accessed on 7 November 2016).

Figure 1. The architecture of our model.

Figure 2. Actor network framework in MAPDDPG.

Figure 3. The architecture of channel attention.

Figure 4. The architecture of spatial attention.

Figure 5. Convolutional block attention in our model.

Figure 6. The data transfer process at the GRU layer.

Figure 7. The architecture of the actor network.

Figure 8. The architecture of the critic network.

Figure 9. The tree structure of sumTree [42].

Figure 10. Training and validation tracks.

Figure 11. Reward comparison of 200 episodes of training.

Figure 12. Average speed of 200 episodes of training.

Figure 13. Distance from centerline to vehicle in 200 episodes.

Figure 14. DDPG performance with the addition of different modules.

Table 1. Simulation experiment setting parameters.

Parameters	Value	Parameters	Value
Experience buffer	10,000	Smooth factor $τ$	0.001
Discount factor $γ$	0.99	Batch Size	64
Actor network learning rate $α_{A}$	0.0001	Critic network learning rate $α_{c}$	0.001

Table 2. Sensor input of the TORCS simulation environment.

Parameters	Value	Significance
ob.angle	[−π, π]	Angle of vehicle to centerline
ob.track	(0, 200)	The distance from the edge of the track to the car
ob.trackPos	(−∞, ∞)	Distance of the car from the centerline of the track.
ob.speedX	(−∞, ∞)	Velocity along the X-axis
ob.speedY	(−∞, ∞)	Velocity along Y-axis
ob.speedZ	(−∞, ∞)	Velocity along Z-axis

Table 3. Convergence time and variance in distance from the centerline to vehicle.

Algorithm	DDPG	P-DDPG	A3C	PPO	MAPDDPG
Variance	0.674 m	0.536 m	0.80 m	0.463 m	0.228 m
Convergence time	75 h	54 h	108 h	47 h	34 h

Table 4. Number of successful laps in 200 episodes.

Algorithm	DDPG	P-DDPG	A3C	PPO	MAPDDPG
collision	72	47	58	43	36
Success episodes	96	117	93	132	147
Out of boundary	32	36	49	25	17

Table 5. Overall comparison of the results of our model with other algorithms.

Algorithm	DDPG	P-DDPG	A3C	PPO	MAPDDPG
Max-Speed	135 km/h	147 km/h	124 km/h	142 km/h	193 km/h
Max-Reward	62,207	99,667	84,416	89,641	116,347
Variance	0.674 m	0.536 m	0.80 m	0.463 m	0.228 m
Success episodes	96	117	93	132	147
Convergence time	75 h	54 h	108 h	47 h	34 h

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jin, Y.; Liu, Q.; Shen, L.; Zhu, L. Deep Deterministic Policy Gradient Algorithm Based on Convolutional Block Attention for Autonomous Driving. Symmetry 2021, 13, 1061. https://doi.org/10.3390/sym13061061

AMA Style

Jin Y, Liu Q, Shen L, Zhu L. Deep Deterministic Policy Gradient Algorithm Based on Convolutional Block Attention for Autonomous Driving. Symmetry. 2021; 13(6):1061. https://doi.org/10.3390/sym13061061

Chicago/Turabian Style

Jin, Yanliang, Qianhong Liu, Liquan Shen, and Leiji Zhu. 2021. "Deep Deterministic Policy Gradient Algorithm Based on Convolutional Block Attention for Autonomous Driving" Symmetry 13, no. 6: 1061. https://doi.org/10.3390/sym13061061

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Deterministic Policy Gradient Algorithm Based on Convolutional Block Attention for Autonomous Driving

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Deep Deterministic Policy Gradient

3.2. Attention-Based Actor-Critic Structure

3.2.1. Channel Attention

3.2.2. Spatial Attention

3.2.3. Convolutional Block Attention Mechanism

3.3. Gated Recurrent Unit

3.4. Network and Priority

3.5. Reward Function

4. Experiment

4.1. Experiment Settings

4.2. Experiment Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI