A DDQN Path Planning Algorithm Based on Experience Classification and Multi Steps for Mobile Robots

Zhang, Xin; Shi, Xiaoxu; Zhang, Zuqiong; Wang, Zhengzhong; Zhang, Lieping

doi:10.3390/electronics11142120

Open AccessArticle

A DDQN Path Planning Algorithm Based on Experience Classification and Multi Steps for Mobile Robots

by

Xin Zhang

¹

,

Xiaoxu Shi

¹,

Zuqiong Zhang

²,

Zhengzhong Wang

¹ and

Lieping Zhang

^1,*

¹

College of Mechanical and Control Engineering, Guilin University of Technology, Guilin 541006, China

²

Network and Information Center, Guilin University of Technology, Guilin 541006, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(14), 2120; https://doi.org/10.3390/electronics11142120

Submission received: 10 May 2022 / Revised: 3 July 2022 / Accepted: 4 July 2022 / Published: 6 July 2022

(This article belongs to the Section Systems & Control Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Constrained by the numbers of action space and state space, Q-learning cannot be applied to continuous state space. Targeting this problem, the double deep Q network (DDQN) algorithm and the corresponding improvement methods were explored. First of all, to improve the accuracy of the DDNQ algorithm in estimating the target Q value in the training process, a multi-step guided strategy was introduced into the traditional DDQN algorithm, for which the single-step reward was replaced with the reward obtained in continuous multi-step interactions of mobile robots. Furthermore, an experience classification training method was introduced into the traditional DDQN algorithm, for which the state transition generated by the mobile robot–environment interaction was divided into two different types of experience pools, and experience pools were trained by the Q network, and the sampling proportions of the two experience pools were updated through the training loss. Afterward, the advantages of a multi-step guided DDQN (MS-DDQN) algorithm and experience classification DDQN (EC-DDQN) algorithm were combined to develop a novel experience classification multi-step DDQN (ECMS-DDQN) algorithm. Finally, the path planning of these four algorithms, including DDQN, MS-DDQN, EC-DDQN, and ECMS-DDQN, was simulated on the OpenAI Gym platform. The simulation results revealed that the ECMS-DDQN algorithm outperforms the other three in the total return value and generalization in path planning.

Keywords:

deep reinforcement learning; double deep Q network; multi steps; experience classification; mobile robot; path planning

1. Introduction

Path planning for mobile robots refers to planning an optimal or near-optimal collision-free path from the start point to the endpoint according to the given obstacle environment, which is a complex problem of decision optimization with little prior information [1,2]. The common path planning algorithms include Dijkstra’s algorithm [3], A* algorithm [4], D* algorithm and genetic algorithm [5], particle swarm optimization [6], and neural network algorithm [7]. However, as the work environment of mobile robots changes, it becomes an urgent demand for robots to adapt to the complex and ever-changing environment, so certain disadvantages of these algorithms are exposed in this situation [8,9].

As one of the methodologies of machine learning, reinforcement learning is used to describe and solve the problem of how agents learn and optimize strategies during their interaction with the environment. Its strength lies in solving the path planning problem of mobile robots with limited a priori information [10]. However, traditional reinforcement learning algorithms, such as Q-learning, which are limited by action space and sample space, cannot be applied to practical scenes with a large state space and continuous action space [11]. Deep learning is a major branch of machine learning in artificial intelligence. After training, it is able to identify the regular information of sample data and help explain the internal correlation of data. Deep reinforcement learning obtained by combining deep learning with reinforcement learning can solve the problems faced by Q-learning and other algorithms in path planning [12]. In the literature [13], a Deep Q Network (DQN) algorithm was proposed for the improvement of learning strategies based on different requirements for the depth and breadth of experience in different learning stages of intelligent path planning. However, the modeling of its reward function faces certain limitations when it is used in a complex environment. To overcome this problem, the initialization of robots in each round of training was redefined and an improved double DQN (DDQN) algorithm was developed to solve the problem of insufficient experience [14]. In terms of dynamic and complex environments, the DDQN algorithm was used for path planning of unmanned aerial vehicles (UAVs), reducing the energy consumed during UAV flight and computing operations [15]. Given the survival probability of UAVs in an enemy-threatened dynamic environment, a fast evaluation model was built and a deep reinforcement learning method for UAV path planning based on global situation information was put forward by plotting the situation map corresponding to the overall threat value of each enemy to the UAV [16]. Aiming at the instability of the training phase and the sparsity of the environment state space, a specific training method and its reward function were designed, effectively solving the non-convergence problem due to the reward sparsity in the state space [17]. For higher accuracy of the target Q value during the training process, heuristic knowledge was introduced into reinforcement learning, thereby improving the efficiency of path planning and obstacle avoidance of the DQN algorithm [18]. Through the network parameter training based on the dynamic fusion of the a priori knowledge of DDQN and average DQN, a dynamic fusion DDQN algorithm, dynamic target DDQN (DT-DDQN), was designed, narrowing the gap between the Q value output by the network and the real one and reducing the influences of the estimation on the action choices of robots [19]. The evaluation function of DQN was improved by the correction function to increase the evaluation accuracy of the value function of the algorithm [20]. Deep reinforcement learning was applied to path planning of mobile robots in unknown dynamic environments [21], where targeting the problem of mutual collision triggered by abnormal rewards due to the relative motion of obstacles and robots, two reward thresholds were set to modify the abnormal rewards, thus realizing optimal decision learning. The experience playback mechanism was combined with Q-learning [22] and the soft actor–critic (SAC) algorithm [23], respectively, improving the sample utilization rate, the learning efficiency and convergence speed of the algorithm in three-dimensional (3D) path planning of UAVs and path planning of multi-arm robots. As for the problem of uniform sampling with moderate probability during experience playback, an improved SAC algorithm, prioritized experience replay (PER)-SAC algorithm was designed, increasing the convergence speed of the algorithm [24]. The DQN algorithm was improved by DDQN with PER, where the degree of intelligent ship driving was raised to accelerate the learning process [25].

According to the above analysis, the reinforcement learning algorithm was improved in the literature aforementioned mainly based on the temporal difference reinforcement learning error of state transition in experience or the network with the approximate value function. However, the influences of the experience characteristics on algorithm training were ignored. In this paper, algorithm improvement and fusion were completed from the two perspectives of value function error and experience, generating an experience classification multi-step DDQN (ECMS-DDQN) path planning algorithm that combines multi-step guidance and experience classification to improve the adaptivity of path planning of mobile robots.

Firstly, based on the DDQN algorithm, the single-step reward was replaced with the reward obtained by the continuous multi-step interaction of mobile robots to enable these robots to obtain a more accurate target state during training. Secondly, based on the DDQN algorithm, the state transition generated by the interaction between mobile robots and the environment was divided into two different types of experience pools, which were sampled and stitched in different proportions. The stitched state was trained by the Q network, and the sampling proportions of the two experience pools were updated through the training loss to improve the training speed and experience utilization of mobile robots. Then, the above two improvements were combined to develop a novel ECMS-DDQN path planning algorithm. Finally, the effectiveness of the proposed algorithm was validated through the path planning experiments of mobile robots designed in two different environments.

The rest of the paper is structured as below: The theoretical basis of DDQN is illustrated in Section 2. A detailed description of the MS-DDQN algorithm and EC-DDQN algorithm, as well as the ECMS-DDQN algorithm formulated by combining the two, is given in Section 3. Section 4 is focused on different experimental environments built to validate and analyze the performance of the proposed algorithm. Finally, the paper is concluded in Section 5.

2. DDQN Algorithm

2.1. DQN Algorithm

In the DQN algorithm, the input of Q network is the state feature

ϕ (S)

corresponding to the state information

S

observed by agents. The output is the estimated value of action value

Q (s, a)

of state when all available actions

a \in A

are executed in the same state. Generally, lots of data are needed in convergence of the neural network, which requires the convergence of Q network to reach the optimal state, as the action value function does. Therefore, the DQN algorithm cannot update every step like the Q-learning algorithm. Concerning this problem, experience replay techniques were introduced into the DQN algorithm to realize the convergence of the value function [26], which is also one of the core characteristics of the DQN algorithm.

Two networks with the same structure, namely, critic network Q and target network

\hat{Q}

, were used in the DQN algorithm. Critic network Q was used to select actions and update parameters, while target network

\hat{Q}

was used to calculate the target Q value. The parameters of critic network Q were updated through iterations, while parameters of target network

\hat{Q}

were updated by copying parameter

w

of critic network Q to parameter

w^{-}

of target network

\hat{Q}

at intervals. For state transition

(ϕ (S), A, R, ϕ (S^{'}), i s_d o n e)

, target value

Q_{t a r g e t} (ϕ (S), A)

equals the outcome of multiplying the selected maximum action value

Q (ϕ (S^{'}), A^{'}, w)

in the state

S

by the sum of instant reward of post-state

S

and the product of expectation attenuation parameters

γ

. The target value used in parameter updating as shown in Equation (1):

Q_{t a r g e t} (ϕ (S), A) = {\begin{matrix} R & i s_d o n e is true \\ R + γ \max_{a^{'}} \hat{Q} (ϕ (S^{'}), A^{'}, w^{-}) & i s_d o n e is false \end{matrix}

(1)

In Equation (1),

a^{'} \in A

refers to the action that maximizes the value of post-state

Q (ϕ (S^{'}), A^{'}, w)

and

i s_d o n e

refers to whether the current state is the terminal state.

Mean-square deviation cost function is usually used in the DQN algorithm to calculate the network error. The expression of cost function is shown in Equation (2):

J (ω) = \frac{1}{2 M} \sum_{t = 1}^{M} {[r_{t} - y_{t}]}^{2}

(2)

In Equation (2),

r_{t}

is target value, and

y_{t}

is output value through network. Plug the Equation (1) into Equation (2), and the network error between target value and output value through the network can be calculated, as shown in Equation (3):

J (ω) = \frac{1}{2 M} \sum_{t = 1}^{M} {[R_{t} + γ Q ({S^{'}}_{t}, {A^{'}}_{t}, w) - Q (S_{t}, A_{t}, w)]}^{2}

(3)

Through back propagation with gradient, parameter

w

of neural network was updated and Q network was converged in the end. The parameter

w

was updated, as shown in Equation (4):

w_{t + 1} = w_{t} + α [R_{t + 1} + γ \underset{a}{\max Q (S_{t + 1}, a, w_{t}) - Q (S_{t}, A_{t}, w_{t})}] \nabla_{w_{t}} Q (S_{t}, A_{t}, w_{t})

(4)

The pseudo form of DQN algorithm is shown in Algorithm 1.

Algorithm 1: Pseudo form of the DQN algorithm

Input:

ε_{init}

,

ε_{\min}

,

T

,

α

,

γ

,

m

,

N

,

n

,

A

,

Q (w)

,

Q_{target} (w)

,

C

for t = 1 to T do

Initialization: Initialize the first state

S

in the present state sequence and acquire its eigenvector

ϕ (S)

Update the basic exploration factor

ε_{t}

based on the number of iteration episodes

t

for the each initialized first state

S

do

(a) Select an action

A = {argmax}_{a} Q (ϕ (S), A, w)

as per the probability

1 - ε_{t}

, or randomly select an action

A

(b) Execute the action

A

under the state

S

to acquire an instant reward

R

, and obtain a new state

S^{'}

and its eigenvector

ϕ (S^{'})

.

i s_d o n e

refers to whether the state is terminated

(c) Store the quintuple

{ϕ (S), A, R, ϕ (S^{'}), i s_d o n e}

in an experience container D

(d)

S = S^{'}

(e) Randomly collect

m

state transition samples (sample size >

N

) from the experience container D, and calculate the present

Q_{target} (y_{j})

:

y_{j} = {\begin{matrix} R_{j} & i s_d o n e_{j} is true \\ R_{j} + γ \max_{a^{'}} \hat{Q} (ϕ (S^{'}), {A^{'}}_{j}, w^{-}) & i s_d o n e_{j} is false \end{matrix}

(f) Use the mean square error loss function

\frac{1}{m} {\sum_{j = 1}^{m} (y_{i} - Q (ϕ (S), A_{j}, w))}^{2}

to update the network weight

w

via the small-batch gradient descent method

Update the target network parameter

Q_{target} (w^{'})

every

C

steps

Until

S

ends in an is_done state

end for

Until reaching the maximum number of iteration episodes

T

end for

Output:

Q_{target} (w^{'})

2.2. DDQN Algorithm

Like the Q-learning algorithm, the target value in the DQN algorithm can be directly calculated through the greedy algorithm. Although Q value can quickly be closer to the possible optimization target in this way, the target Q value will be overestimated and the deviation will be large in the algorithm model. This problem can be solved in the DDQN algorithm [27].

In the DDQN algorithm, overestimation can be eliminated by decoupling the action selection and calculation of the target Q value. The key is the full use of the double network structure. In updating parameters of the target Q network through the DDQN algorithm, the critic Q network was firstly used to select the action

a^{m a x^{'}}

with an optimal value, as shown in Equation (5):

a^{\max} (S^{'}, w) = a r g \max_{a^{'}} Q (ϕ (S^{'}), a, w)

(5)

This action is used to calculate the target Q value in target network

\hat{Q}

, as shown in Equation (6):

Q_{t a r g e t} (ϕ (S), A) = \begin{matrix} R + γ \hat{Q} (ϕ (S^{'}), a^{\max} (S^{'}, w), w^{-}) & (i s_d o n e is false \end{matrix})

(6)

For the state transition

(ϕ (S), A, R, ϕ (S^{'}), i s_d o n e)

, the target value is used to update parameters through the DDQN algorithm, as shown in Equation (7):

Q_{t a r g e t} (ϕ (S), A) = {\begin{matrix} R & i s_d o n e is true \\ R + γ \hat{Q} (ϕ (S^{'}), a r g \underset{a^{'}}{m a x} Q (ϕ (S^{'}), a, w), w^{-}) & i s_d o n e is false \end{matrix}

(7)

The pseudo form of DDQN algorithm is shown in Algorithm 2.

Algorithm 2: Pseudo form of the DDQN algorithm

Input:

ε_{init}

,

ε_{\min}

,

T

,

α

,

γ

,

m

,

N

,

n

,

A

,

Q (w)

,

Q_{target} (w)

,

C

for t = 1 to T do

Initialization: Initialize the first state

S

in the present state sequence and acquire its eigenvector

ϕ (S)

Update the basic exploration factor

ε_{t}

based on the number of iteration episodes

t

for the each initialized first state

S

do

(a) Select an action

A = {argmax}_{a} Q (ϕ (S), A, w)

as per the probability

1 - ε_{t}

, or randomly select an action

A

(b) Execute the action

A

under the state

S

to acquire an instant reward

R

, obtain a new state

S^{'}

and its eigenvector

ϕ (S^{'})

.

i s_d o n e

refers to whether the state is terminated

(c) Store the quintuple

{ϕ (S), A, R, ϕ (S^{'}), i s_d o n e}

in an experience container D

(d)

S = S^{'}

(e) Randomly collect

m

state transition samples from the experience container D if the total sample size in the container is greater than

N

, and calculate the present

Q_{target} (y_{j})

:

y_{j} = {\begin{matrix} R_{j} & i s_d o n e_{j} is true \\ R_{j} + γ \hat{Q} (ϕ (S^{'}), a r g \max_{a^{'}} Q (ϕ (S^{'}), a, w), w^{-}) & i s_d o n e_{j} is false \end{matrix}

(f) Use the mean square error loss function

\frac{1}{m} {\sum_{j = 1}^{m} (y_{i} - Q (ϕ (S), A_{j}, w))}^{2}

to update the network weight

w

via the small-batch gradient descent method

Update the target network parameter

Q_{target} (w^{'})

every

C

steps

Until

S

ends in an

i s_d o n e

state

end for

Until reaching the maximum number of iteration episodes

T

end for

Output:

Q_{target} (w^{'})

3. DDQN-ECMS Algorithm

3.1. Multi Step DDQN Algorithm

In the DDQN algorithm, the instant reward

R

obtained in current action and estimated value in the next moment was used to calculate the target Q value. In the early stage when the deviation of network parameters was large, the target Q value also showed a large deviation, so the speed of training was slow. The idea of multi-step was introduced for paper to improve the performance of the DDQN algorithm. Specifically, instant reward

R^{(n)}

obtained in successive MS was used to substitute the instant reward

R

(

n

is a step-size parameter) in the DDQN algorithm to obtain the more accurate target Q value in the training. This algorithm is named the multi step DDQN (MS-DDQN) algorithm,

R^{(n)}

, as shown in Equation (8):

R^{(n)} = \sum_{i = 0}^{n - 1} γ_{t}^{(i)} R_{t + i + 1}

(8)

For n sequences of state transition

(ϕ (S_{i}), A_{i}, R_{i}, ϕ (S_{i}^{'}), i s_d o n e) (i = t, t + 1, \dots, t + n)

, the target value was used by the algorithm in parameter updating, as shown in Equation (9):

Q_{t a r g e t} (ϕ (S), A) = {\begin{matrix} R_{t}^{(n)} & i s_d o n e_{t + n} is true \\ R_{t}^{(n)} + γ_{t}^{(n)} Q_{\arg \max} & i s_d o n e_{t + n} is false \end{matrix}

(9)

In the of Equation (9), the value of step-size parameter

n

should be set based on the real situation, which can effectively improve the training speed of the algorithm.

The pseudo form of MS-DDQN algorithm is shown in Algorithm 3.

Algorithm 3: Pseudo form of the MS-DDQN algorithm

Input:

ε_{init}

,

ε_{\min}

,

T

,

α

,

γ

,

m

,

N

,

n

,

n_{s}

,

A

,

Q (w)

,

Q_{target} (w)

,

C

for t = 1 to T do

Initialization: Initialize the first state

S

in the present state sequence and acquire its eigenvector

ϕ (S)

Update the basic exploration factor

ε_{t}

based on the number of iteration episodes

t

for the each initialized first state

S

do

(a) Select an action

A = {argmax}_{a} Q (ϕ (S), A, w)

as per the probability

1 - ε_{t}

, or randomly select an action

A

(b) Execute the action

A

under the state

S

to acquire an instant reward

R

, obtain a new state

S^{'}

and its eigenvector

ϕ (S^{'})

.

i s_d o n e

refers to whether the state is terminated

(c) Store the quintuple

{ϕ (S), A, R, ϕ (S^{'}), i s_d o n e}

in a multi-step guidance queue container D. (When len(D) =

n_{s}

, the first quintuple in the queue pops up firstly)

(d) When len(D) =

n_{s}

, store the quintuple

{ϕ (S)^{'}, A^{'}, R^{'}, ϕ (S^{'})^{'}, i s_d o n e^{'}}

into the experience container M.

ϕ (S)^{'}

,

A^{'}

and

i s_d o n e^{'}

stand for

ϕ (S)

,

A

and

i s_d o n e

of the first quintuple, respectively, in the queue D,

ϕ (S^{'})^{'}

refers to the non-

i s_d o n e

state after the subsequent maximum number of iteration episodes, and

R^{'}

represents the cumulative sum of subsequent damping instant rewards ended in a

i s_d o n e

state.

(e)

S = S^{'}

(f) When the total sample size in the container is greater than

N

, randomly select

m

state transition samples from the experience container M and calculate the present

Q_{target} (y_{j})

:

y_{i} = {\begin{matrix} R & i s_d o n e is true \\ R + γ \hat{Q} (ϕ (S^{'}), \arg \max_{a^{'}} Q (ϕ (S^{'}), a, w), w^{-}) & i s_d o n e is false \end{matrix}

(g) Use the mean square error loss function

\frac{1}{m} {\sum_{j = 1}^{m} (y_{i} - Q (ϕ (S), A_{j}, w))}^{2}

to update the network weight

w

via the small-batch gradient descent method

Update the target network parameter

Q_{target} (w^{'})

every

C

steps

Until

S

ends in an

i s_d o n e

state

end for

Until reaching the maximum number of iteration episodes

T

end for

Output:

Q_{target} (w^{'})

3.2. Experience Classification DDQN Algorithm

3.2.1. Experience Classification Training Method

For the current experience replay techniques of the DDQN algorithm, the state transition is just simply stored in order, where the differences and similarities between different state transitions are not considered. As a result, although a large number of collisions occur in the early stage, the state transition for collisions may not be selected, so the speed of convergence will become slow. In this paper, a training method—experience classification (EC)—was proposed to improve the utilization of experience when agents collide with the obstacle or stay close to the obstacle in the early stage.

In the EC method, state transitions can be stored in several different experience replay buffers and storage conditions and sampling formulas of experience replay buffers can be designed according to the real demands. In the application of path planning, with complexity of algorithm and real demands taken into consideration, experience replay buffers will be divided into Memory0 and Memory1 according to whether a state transition belongs to a certain range of obstacles. A sampling batch is kept in every experience replay buffer and sampling in two experience replay buffers is based on the size of sampling batches. Then, a new sampling batch is obtained from the combination of the two and sent to a Q network for parameter training and updating. After the training, the average loss

l_{0}

and

l_{1}

of two experience replay buffers reflect the value of different types of experience for the current training stage. Therefore, sampling proportion, that is, the size of sampling batches in two experience replay buffers, was updated according to the demands of different experience replay buffers in various training stages and the returning loss of two experience replay buffers. The sketch map of EC training method is shown in Figure 1, where batch0 and batch1 are samples based on the sampling proportion of different experience replay buffers and loss0 and loss1 are average loss of training of different experience replay buffers.

3.2.2. Experience Classification Training Method

For the DDQN algorithm, all network parameters are the random value in the initial stage. All the samples should have the chance to be trained to help agents find the probable task demands as soon as possible. Therefore, in this paper, the EC method was used in auxiliary training. Within certain episodes in the initial stage, experience within a certain range of obstacles was sampled more, but the training trend of total samples was not changed. This algorithm is called the experience classification DDQN (EC-DDQN) algorithm.

To ensure algorithm’s stability in the initial stage of training, the state transitions whose number is less than that of minimum exploration samples should be stored in two experience replay buffers, while the state transitions whose number is more than that of minimum exploration samples should be selectively stored according to the features of state transitions. In this paper, Memory0 was designed for auxiliary training, so state transitions within a certain range of obstacles would be stored in Memory0, and all the state transitions would be stored in Memory1. The sampling proportions of Memory0 and Memory1 are expressed by

P_{0}

and

P_{1}

.

In the EC-DDQN algorithm, three parameters, auxiliary training coefficient

β

, initial sampling probability

p_{0}

, and dynamic sampling probability

p_{1}

, were safeguarded to realize the design. Meanwhile, the sampling proportion of auxiliary experience replay buffers was designed into two parts. In the first part, initial sampling probability

p_{0}

decreases as training goes. The

p_{0}

cannot be too large when it is set; otherwise, it will influence the stability of the algorithm. In the second part, dynamic sampling probability

p_{1}

is dynamically distributed to two experience replay buffers according to their training loss

l_{0}

and

l_{1}

. The updating formula of sampling proportion

p_{0}

is specified as shown in Equation (10):

P_{0} = {\begin{matrix} p_{0} \times ε_{t} + p_{1} \times \frac{l_{0}}{l_{0} + l_{1}} & t \leq β T \\ 0 & e l s e \end{matrix}

(10)

The pseudo form of the EC-DDQN algorithm is shown in Algorithm 4.

Algorithm 4: Pseudo form of the EC-DDQN algorithm

Input:

ε_{init}

,

ε_{\min}

,

T

,

α

,

γ

,

m

,

p_{0}

,

p_{1}

,

Φ

,

N

,

n

,

A

,

Q (w)

,

Q_{target} (w)

,

C

for t =1 to T do

Initialization: Initialize the first state

S

in the present state sequence and acquire its eigenvector

ϕ (S)

Update the basic exploration factor

ε_{t}

based on the number of iteration episodes

t

for the each initialized first state

S

do

(a) Select an action

A = {argmax}_{a} Q (ϕ (S), A, w)

as per the probability

1 - ε_{t}

, or randomly select an action

A

(b) Execute the action

A

under the state

S

to acquire an instant reward

R

, obtain a new state

S^{'}

and its eigenvector

ϕ (S^{'})

.

i s_d o n e

refers to whether the state is terminated

(c) Store the quintuple

{ϕ (S), A, R, ϕ (S^{'}), i s_d o n e}

. Simultaneously store M0 and M1 when len(M1) < N. Store M1 when len(M1) >= N, and simultaneously store M0 if

ϕ (S)^{'} \in Φ

.

(d)

S = S^{'}

(e) Update the sampling proportion of the auxiliary experience pool:

P_{0} = {\begin{matrix} p_{0} \times ε_{t} + p_{1} \times \frac{l_{0}}{l_{0} + l_{1}} & t \leq β T \\ 0 & e l s e \end{matrix}

(f) When the total sample size in the container is greater than

N

, randomly select

m

state transition samples from the experience containers M0 and M1 at sampling proportions of

P_{0}

and

P_{1}

, respectively, and calculate the present

Q_{target} (y_{j})

:

y_{i} = {\begin{matrix} R & i s_d o n e is true \\ R + γ \hat{Q} (ϕ (S^{'}), \arg \max_{a^{'}} Q (ϕ (S^{'}), a, w), w^{-}) & i s_d o n e is false \end{matrix}

(g) Use the mean square error loss function

\frac{1}{m} {\sum_{j = 1}^{m} (y_{i} - Q (ϕ (S), A_{j}, w))}^{2}

to update the network weight

w

via the small-batch gradient descent method, and return the average losses

l_{0}

and

l_{1}

of different experience pools.

Update the target network parameter

Q_{target} (w^{'})

every

C

steps

Until

S

ends in an

i s_d o n e

state

end for

Until reaching the maximum number of iteration episodes

T

end for

Output:

Q_{target} (w^{'})

3.3. DDQN Algorithm Based on Experience Classification and Multi Steps

The advantages of MS and EC are applied in the DDQN algorithm, so a new DDQN-ECMS algorithm is figured out. Its pseudo form is shown in Algorithm 5.

Algorithm 5: Pseudo form of the DDQN-ECMS algorithm

Input:

ε_{init}

,

ε_{\min}

,

T

,

α

,

γ

,

m

,

p_{0}

,

p_{1}

,

Φ

,

N

,

n

,

n_{s}

,

A

, Q(w), C

for t = 1 to T do

Initialization: To get the eigenvector

ϕ (S)

of the first state in state sequence

S

.

Update the basic exploration factor

ε_{t}

according to the number of iteration episodes

t

.

for the each initialized first state S do

(a) Select action

A = {argmax}_{a} Q (ϕ (S), A, w)

based on probability

1 - ε_{t}

, otherwise action

A

will be randomly selected.(b) Execute action

(b) Execute action

A

in the state

S

and get the instant reward

R

. New state

S^{'}

and its eigenvector

ϕ (S^{'})

will be obtained.

i s_d o n e

refers to whether the state is terminated.

(c) Store quintet

{ϕ (S), A, R, ϕ (S^{'}), i s_d o n e}

in container for MS queue D. (If len(D) =

n_{s}

, the first quintet in the queue will pop up)

(d) If len(D) =

n_{s}

, store quintet

{ϕ {(S)}^{'}, A^{'}, R^{'}, ϕ {(S^{'})}^{'}, i s_d o n e^{'}}

in the experience container. If len(M1) < N: store M0 and M1; if len(M1) >=

N

: store M1. If

ϕ {(S)}^{'} \in Φ

, store M0.

ϕ {(S)}^{'}

,

A^{'}

and

i s_d o n e^{'}

are

ϕ (S)

,

A

and

i s_d o n e

in the first quintet of the M3 queue.

ϕ {(S^{'})}^{'}

is subsequent maximum number of steps that is not in

i s_d o n e

state.

R^{'}

is cumulative sum with damping instant reward subsequently ended in

i s_d o n e

state.

(e)

S = S^{'}

(f) Update the sampling proportion of auxiliary experience replay buffers:

P_{0} = {\begin{matrix} p_{0} \times ε_{t} + p_{1} \times \frac{l_{0}}{l_{0} + l_{1}} & t \leq β T \\ 0 & e l s e \end{matrix}

(g) If the total number of samples in container is more than N, m samples of state transitions are randomly selected from experience containers M0 and M1 according to sampling proportion

P_{0}

and

P_{1}

to calculate the current

Q_{target} (y_{j})

:

y_{i} = {\begin{matrix} R & i s_d o n e is true \\ R + γ \hat{Q} (ϕ (S^{'}), \arg \underset{a^{'}}{m a x} Q (ϕ (S^{'}), a, w), w^{-}) & i s_d o n e is false \end{matrix}

(h) Use mean squared deviation loss function:

\frac{1}{m} \sum_{j = 1}^{m} {(y_{i} - Q (ϕ (S), A_{j}, w))}^{2}

Update network weight parameter

w

through the small batch gradient descent method. The average loss is

l_{0}

and

l_{1}

while returning to different experience replay buffers.

Update target network parameters

Q_{target} (w^{'})

every C step.

Until the

S

as terminal stateend for

Until the maximum number of iteration episodes

T

.

end for

Output:

Q_{target} (w^{'})

4. Test and Analysis

4.1. Setting of the Experimental Simulation Environment

1.: Design of reward function

The directional control of the agent was discretized into four directions, namely left turn for 30°, left turn for 15°, right turn for 15°, and right turn for 30°. More detailed reward feedback was adopted in the design of reward function to improve the efficiency of interaction between the agent and the environment and make it better utilize rewards to adjust the action plan. A reward of −0.01 was obtained in the case of a direction change. A reward of −0.05× number of turning times was gained in the event of continuous turning. A reward of −0.1 was obtained when the agent was within a certain range of an obstacle. A reward of −1 was obtained in case of a collision. When the agent was near the goal position, a final reward of 10 was obtained. A base reward

10 \times Δ d i s

was obtained for each step, and

Δ d i s

was the normalized value of the distance difference between two time steps with the diagonal of the scene as the unit length. The reward function was specifically designed as shown in Equation (11):

r = {\begin{matrix} 10 \times Δ d i s & base_reward \\ - 0.01 & direction_changed \\ - 0.05 \times vibration_times & continus_direction_changed \\ - 0.1 & near_the_obstacles \\ - 1 & collision \\ 10 & reach_the_goal \end{matrix}

(11)

In Equation (11), “base_reward” means the base reward when the agent is near or far away from the obstacle, “direction_changed” means that the agent direction is changed, continus_direction_changed indicates that the agent will be subjected to a penalty of −0.05× number of turning times if changing directions for three consecutive times. ”near_the_obstacles” means that the agent is within a certain range of the obstacle, “collision” means that the agent collides with the obstacle, and “reach_the_goal” means that the agent reaches the goal position.

2.: Design of value fitting network Q

The designed neural network was a four-layer deep neural network. The input layer was composed of 10 neurons. The input layer was fully connected to the hidden layers. Each of two hidden layers was composed of 64 neurons, with the activation function of ReLU. The hidden layers were fully connected to the output layer, and the output layer was composed of five neurons.

3.: Parameter setting

The initial exploration factor

ε_{i n i t}

and the minimum exploration factor

ε_{m i n}

facilitate agents to select between the optimal-value action and a random action. To be specific, agents will select the optimal-value action at a probability of

ε_{i n i t} - ε_{m i n}

, and randomly select an action at a probability of

ε_{m i n}

, thus ensuring that they will explore different states at a certain probability and preventing them from the failure of acquiring new experiences due to the constant selection of the optimal-value action. Given that the agent initially knew nothing about the environmental information,

ε_{i n i t}

and

ε_{m i n}

were set as 1 and 0.01, respectively, to better explore the environmental information. When the learning rate

α

, a parameter used to update the DDQN weight, was too low, the convergence process would be rather slow. At an excessively high

α

, however, the gradient might fluctuate back and forth near the minimum value, and even the convergence might fail. By reference to the literature [26],

α

was set as 0.001 and the step interval for network parameter updating as C = 100 in this study. The single-episode maximum number of steps

S t e p_{m a x}

means the maximum number of action times that can be done by the agent in a single loop, and the maximum number of iteration episodes

T_{m a x_e p i s o d e s}

denotes the total amount of DDQN network parameter training. In this study,

S t e p_{m a x}

and

T_{m a x_e p i s o d e s}

were set as 1000 and 100, respectively, by combining the designed 2D path planning simulation experience. The reward damping factor

γ

, which was used to balance the current reward and the future reward, was set as 0.9 according to the literature [28]. In addition, the maximum capacity

M_{capacity}

of an experience playback container, which decided the number of state transition sequences that could be stored, was set as 20,000. As for deep neural network (DNN) training parameters, the sampling batch size

M_{capacity}

was determined as 128 through repeated experiments. In the section of the experience partitioning method, the minimum number of explorations

N_{explore}

was a threshold for judging whether to conduct experience partitioning. Moreover, the sampling proportions of two experience pools were co-decided by the initial exploration probability

p_{0}

, the dynamic sampling probability

p_{1}

, and the auxiliary exploration coefficient

β

. To keep the algorithms stable in the initial training phase,

N_{explore}

,

p_{0}

,

p_{1}

and

β

were set as 500, 0.2, 0.2 and 0.4, respectively. An exponential damping factor of 0.95 was adopted for the exploration factor

ε

of all algorithms, expecting to improve the utilization rate of experience [12]. The related parameters are listed in Table 1.

4.2. Experiments and Result Analysis

Two experiments were designed to verify the algorithms proposed in this chapter. One experiment was performed in an environment with the fixed initial position of the agent, goal position, and obstacle positions, and the other was done in an environment with the random initial position of the agent, goal position, and obstacle positions. The performances of related algorithms in both experimental environments were compared on the OpenAI Gym platform.

4.2.1. Experiment in the Environment with Fixed Positions and Result Analysis

With the size of

7 \times 7 m^{2}

, the experimental scene was visualized under the pixels of

700 \times 700

. The visual scene is shown in Figure 2. The dark grey polygons of different sizes in the scene represent different obstacles, and there were five positions, and all the obstacles were fixed. The x-coordinates, y-coordinates, and radiuses of obstacles were set, from down to top, as [1.75, 1.75, 0.21], [1.75, 5.25, 0.28], [3.5, 3.5, 0.28], [5.25, 1.75, 0.21] and [5.25, 5.25, 0.35], respectively. The blue car was the agent in the experiment, and the goal position range was represented by a circular area with a red flag in the middle. The initial direction and initial position of the agent can be set arbitrarily, and the motion mode is uniform motion with variable direction. In the experiment in the environment with fixed positions, the initial direction of the agent was fixed to the front, and the goal position was fixed to the upper right. The task of the agent was to successfully reach the goal position without collision with the obstacles through direction control. For the agent, it was assumed that the direction and relative distance of the goal position were known, and, in addition, the agent could detect the distance of an object within a fan-shaped range of 180° in front, and the detection radius d could be arbitrarily set.

Four different algorithms were used for path planning in the scene of the experiment with fixed positions. In theory, the larger the MS value is set, the larger the calculation amount of the algorithm is, and the more accurate the estimation is. However, the case is not better if this value is bigger in practical applications. Therefore, in order to figure out a reasonable MS value, this value was firstly set as 3 in this paper, and it was determined to select the MS value

n_{s} = 4

of the MS-DDQN and DDQN-ECMS algorithms through experiments. In the experiments, the agent could reach the goal position successfully without any collision in the later stage of training (around the last 10 episodes) in all four of the algorithms. Through training with each of the four algorithms, the return curve was obtained, as shown in Figure 3. In this figure, the abscissa represents the current number of iteration episodes, and the ordinate represents the total return for each episode. The steps consumed in the four algorithms are shown in Figure 4. In this figure, the abscissa represents the current number of iteration episodes and the ordinate represents the total number of steps consumed by each episode.

Figure 3 shows that four algorithms can converge to the optimal reward value in the later stage. At the initial stage of training, the total return values of the four algorithms all rose rapidly, and there were certain fluctuations in the whole medium stage. For the EC-DDQN algorithm, the rise of the reward at the initial stage was the fastest but the most unstable in the middle and latter stages. In general, the DDQN-ECMS algorithm was more stable. Figure 4 shows that there were some fluctuations in the total steps of the four algorithms, and they were all within 200 steps in the 20 episodes after gradual stabilization. The difference between the four algorithms was not large. The path given by the final episode of single training in each of the four algorithms is shown in Figure 5. In the figure, the black, red, green, and blue circles represent the trajectories of the DDQN, MS-DDQN, EC-DDQN, and DDQN-ECMS algorithms, respectively.

Figure 5 shows that the path given by the DDQN algorithm was the smoothest, but the overall distance to the three obstacles was relatively close. In the path given by the MS-DDQN algorithm, the agent was close to the obstacles twice, and the steering adjustment was not timely. In the path given by the EC-DDQN algorithm, the direction was adjusted twice due to obstacles, and the direction change angle was relatively large in the first direction adjustment. The path given by the DDQN-ECMS algorithm was relatively far away from the obstacles, and the steering adjustment was relatively smooth when it was close to the obstacles, so it is safer overall.

However, under the single training, the quality of the performances of the algorithms cannot be determined because of the randomness. Therefore, the four algorithms were run 100 times each, and the performances of the four algorithms were counted. The counted items included the average time spent by the algorithm, the average return of the last 10 episodes of each training, the average number of steps, and the average reward of each step of the last 10 episodes of each training. The reason why the average value of the last 10 episodes was calculated is that the exploration rate of the last 10 episodes was stably below 0.01 under the exponential attenuation of 0.95. At this time, it was close to complete convergence for the algorithm, and some random interference can also be eliminated, which can better reflect the performance of the algorithm. The performance of each algorithm in the environment with fixed positions is shown in Table 2.

According to Table 2, compared to the time spent by the DDQN algorithm, the average time spent by the MS-DDQN and EC-DDQN algorithms was about 7 s less, and the average time spent by the DDQN-ECMS algorithm was about 5 s less. The average return values of both the MS-DDQN and DDQN-ECMS algorithms were significantly larger than those of the DDQN and EC-DDQN algorithms, and there was little difference between the latter two algorithms. The average number of steps of the DDQN-ECMS algorithm was the smallest, while that of the MS-DDQN algorithm was the largest. In terms of the average reward per step, the score of the DDQN-ECMS algorithm was the highest, followed by the MS-DDQN algorithm, while the DDQN and EC-DDQN algorithms share the lowest score. With the score of the DDQN algorithm as the reference, there was no increase for the EC-DDQN algorithm, while the score of the MS-DDQN algorithm increased by 5.77% and that of the DDQN-ECMS algorithm by 9.62%, with obvious advantages.

4.2.2. Experiment in the Environment with Random Positions and Result Analysis

The experimental scene size in the environment with random positions was the same as the experimental scene size in the experiment with fixed positions, and the number of obstacles was also 5, and the blue car was also the agent in the experiment, and the goal position range was also represented by a circular area with a red flag in the middle.. However, the goal position, positions, and sizes of obstacles in a certain range and the positions and direction of the agent were randomly set in each episode. That is to say, all the initial conditions in each episode were random in the training with the maximum number of iteration episodes as 100. An example of a random scene is provided as shown in Figure 6.

Four different algorithms were also applied for comparison of path planning in the environment with random positions. In addition, for the setting of MS value in this environment, the MS values

n_{s} = 3

of the MS-DDQN and DDQN-ECMS algorithms were determined through experiments. Through training with each of four algorithms, the return curve was obtained, as shown in Figure 7, and the number of steps of single training was also obtained, as shown in Figure 8. As aforementioned, the abscissa represents the number of iteration episodes, and the ordinate represents the total return value and the total number of steps.

Figure 7 shows that the total return values of the four algorithms fluctuated to some extent during the whole training process. Generally, as the number of iteration episodes increased, the fluctuations became smaller and closer to convergence. The MS-DDQN algorithm fluctuated largely in the medium stage, while the other three algorithms were obviously more stable than the former. Figure 8 shows that the number of steps in all four algorithms fluctuated largely. In terms of stability, the curve of the MS-DDQN algorithm fluctuated the most, while the difference of the other three algorithms was not obvious.

The paths were planned in the environment with random positions by using the network parameters of the four algorithms completed in this training in order to verify the generalization performances of the algorithms. The path planning results of the four algorithms under four environments with random positions are shown in Figure 9. As aforementioned, the black, red, green, and blue circles represent the trajectories of the DDQN, MS-DDQN, EC-DDQN, and ECMS-DDQN algorithms, respectively.

Figure 9 shows that, in this training, the agent started more often from the left and got close to the goal in the MS-DDQN algorithm, the agent started more often from the right and got close to the goal in the DDQN, EC-DDQN and DDQN-ECMS algorithms. However, the deflection angles were smaller for the EC-DDQN and DDQN-ECMS algorithms. The agent could reach the goal position in all four of the algorithms. However, the planned paths of the EC-DDQN and DDQN-ECMS algorithms were smoother with fewer direction changes.

The agent was trained for 100 times in the environment with random positions with each of the four algorithms, and the average performances of the four algorithms were counted, as shown in Table 3.

Table 3 shows that the time consumed in the EC-DDQN algorithm was the shortest, while there was little difference in the time consumed in the other three algorithms, about 1 s. The average return value of the DDQN-ECMS algorithm is the highest, while that of the MS-DDQN is the lowest. The average number of steps of the EC-DDQN algorithm is the smallest, followed by that of the DDQN-ECMS algorithm. In terms of the average reward per step that best reflects the performance of an algorithm, the performances of MS-DDQN, EC-DDQN, and DDQN-ECMS Algorithms decreased by 4.03%, 2.42%, and 3.26%, respectively, with the DDQN algorithm as the reference, and the performance of DDQN-ECMS algorithm showed the best comprehensive performance.

5. Conclusions and Future Directions

In this study, a mobile robot path planning algorithm referred to as ECMS-DDQN was proposed to improve the estimation accuracy of DDQN algorithm for the target Q value during the training process and the experience utilization efficiency during the experience playback process. Next, the path planning simulation verification was performed on an OpenAI Gym platform. Then, the superiority of the proposed algorithm was verified by comparing it with the DDQN algorithm, the MS-DDQN algorithm, and the EC-DDQN algorithm. Finally, the following conclusions were drawn through the theoretical study and the simulation experiment:

Based on the DDQN algorithm, on the one hand, the multi-step guidance method was introduced to replace the instant reward at a single moment with the instant reward of continuous multi-step interaction to improve the estimation accuracy of DDQN for the target Q value during training. On the other hand, the experience was classified according to state transition features and stored in different experience pools according to the experience type. Then, the sampling proportion was dynamically adjusted on the basis of the average training loss of different experience pools. Finally, the two improvements were fused to form a novel ECMS-DDQN algorithm, whose superiority was verified through the path planning simulation experiment of mobile robots.

Despite certain advantages in the total return value and the generalization performance, the proposed ECMS-DDQN algorithm shows no evident advantages in the average time consumption, the average return value, or the average number of steps over the DDQN algorithm, the MS-DDQN algorithm, and the EC-DDQN algorithm. Therefore, how to further improve the superiority of this algorithm will be key research content in the future.

Compared with the actual environment, the random environment set up in this study is too simple despite a certain complexity. Algorithms under environments with dynamic obstacles and dynamic target positions should be investigated in the follow-up research. Meanwhile, only simulation verification was performed in this study, and thus such verification should be done by establishing a corresponding actual experimental environment. In addition, the parameters of related algorithms given in this study remain to be further optimized, which will be key research content in the future.

Author Contributions

Conceptualization, L.Z.; data curation, X.Z., X.S., Z.Z. and Z.W.; simulation, Z.W. and X.Z.; Methodology, X.Z. and Z.Z.; writing—original draft preparation, X.Z., Z.Z. and X.S.; writing—review and editing, X.Z. and X.S.; supervision, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 61741303) and the Key Laboratory of Spatial Information and Geomatics (Guilin University of Technology) (No. 19-185-10-08).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors express their gratitude to the reviewers to improve this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

He, Y.Y.; Fan, X.W. Application of improved ant colony algorithm in robot path planning. Comput. Eng. Appl. 2021, 57, 276–282. [Google Scholar]
Jiang, M.; Wang, F.; Ge, Y.; Sun, L.L. Research on path planning of mobile robot based on improved ant colony algorithm. Chin. J. Sci. Instrum. 2019, 40, 113–121. [Google Scholar]
Fadzli, S.A.; Abdulkadir, S.I.; Makhtar, M. Robotic Indoor Path Planning Using Dijkstra’s Algorithm with Multi-Layer Dictionaries. In Proceedings of the 2015 2nd International Conference on Information Science and Security (ICISS), Seoul, Korea, 14–16 December 2015; pp. 1–4. [Google Scholar]
Ahlam, A.; Mohd, A.S.; Khatija, R. An optimized hybrid approach for path finding. Int. J. Found. Comput. Sci. Technol. 2015, 5, 47–58. [Google Scholar]
Song, B.Y.; Wang, Z.D.; Sheng, L. A new genetic algorithm approach to smooth path planning for mobile robots. Assem. Autom. 2016, 36, 138–145. [Google Scholar] [CrossRef] [Green Version]
Li, G.; Chou, W. Path planning for mobile robot using self-adaptive learning particle swarm optimization. Sci. China Inf. Sci. 2018, 61, 052204. [Google Scholar] [CrossRef] [Green Version]
Juang, C.; Yeh, Y. Multiobjective Evolution of Biped Robot Gaits Using Advanced Continuous Ant-Colony Optimized Recurrent Neural Networks. IEEE Trans. Cybern. 2018, 48, 1910–1922. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; He, L.; Yuan, L.; Ran, T. Mobile robot path planning based on improved two-layer ant colony algorithm. Control. Decis. 2022, 37, 303–313. [Google Scholar]
Wang, K.Y.; Shi, Z.; Yang, Z.C.; Wang, S.S. Improved reinforcement learning algorithm applied to mobile robot path planning. Comput. Eng. Appl. 2021, 57, 270–274. [Google Scholar]
Zhou, X.M.; Bai, T.; Ga, Y.B.; Han, Y.T. Vision-Based Robot Navigation through Combining Unsupervised Learning and Hierarchical Reinforcement Learning. Sensors 2019, 19, 1576. [Google Scholar] [CrossRef] [Green Version]
Liu, Z.R.; Jiang, S.H. Research review of mobile robot path planning based on reinforcement learning. Manuf. Autom. 2017, 41, 90–93. [Google Scholar]
Dong, Y.; Ge, Y.Y.; Guo, H.Y.; Dong, Y.F.; Yang, C. Mobile robot path planning based on deep reinforcement learning. Comput. Eng. Appl. 2019, 55, 15–19. [Google Scholar]
Lv, L.H.; Zhang, S.J.; Ding, D.R.; Wang, Y.X. Path Planning via an Improved DQN-Based Learning Policy. IEEE Access 2019, 7, 67319–67330. [Google Scholar] [CrossRef]
Zhang, F.; Gu, C.; Yang, F. An Improved Algorithm of Robot Path Planning in Complex Environment Based on Double DQN. In Advances in Guidance, Navigation and Control. Lecture Notes in Electrical Engineering; Yan, L., Duan, H., Yu, X., Eds.; Springer: Singapore, 2021; Volume 644. [Google Scholar]
Peng, Y.S.; Liu, Y.; Zhang, H. Deep Reinforcement Learning based Path Planning for UAV-assisted Edge Computing Networks. In Proceedings of the 2021 IEEE Wireless Communications and Networking Conference (WCNC), Nanjing, China, 29 March–1 April 2021; pp. 1–6. [Google Scholar]
Yan, C.; Xiang, X.; Wang, C. Towards Real-Time Path Planning through Deep Reinforcement Learning for a UAV in Dynamic Environments. J. Intell. Robot. Syst. 2020, 98, 297–309. [Google Scholar] [CrossRef]
Lei, X.; Zhang, Z.; Dong, P. Dynamic path planning of unknown environment based on deep reinforcement learning. J. Robot. 2018, 2018, 5781591. [Google Scholar] [CrossRef]
Jiang, L.; Huang, H.Y.; Ding, Z.H. Path Planning for Intelligent Robots Based on Deep Q-learning With Experience Replay and Heuristic Knowledge. IEEE-CAA J. Autom. Sin. 2020, 7, 1179–1189. [Google Scholar] [CrossRef]
Dong, Y.F.; Yan, C.; Dong, Y.; Qin, C.X.; Xiao, H.X.; Wang, Z.Q. Path planning based on improved DQN robot. Comput. Eng. Des. 2021, 42, 552–558. [Google Scholar]
Feng, S.; Shu, H.; Xie, B.O. Three-dimensional environment path planning based on improved deep reinforcement learning. Comput. Appl. Softw. 2021, 38, 250–255. [Google Scholar]
Huang, R.N.; Qin, C.X.; Li, J.L.; Lan, X.J. Path planning of mobile robot in unknown dynamic continuous environment using reward-modified deep Q-network. Optim. Control. Appl. Methods, 2021; early view. [Google Scholar] [CrossRef]
Xie, R.; Meng, Z.; Zhou, Y.; Ma, Y.; Wu, Z. Heuristic Q-learning based on experience replay for three-dimensional path planning of the unmanned aerial vehicle. Sci. Prog. 2020, 103, 0036850419879024. [Google Scholar] [CrossRef]
Prianto, E.; Kim, M.; Park, J.H.; Bae, J.H.; Kin, J.S. Path Planning for Multi-Arm Manipulators Using Deep Reinforcement Learning: Soft Actor–Critic with Hindsight Experience Replay. Sensors 2020, 20, 5911. [Google Scholar] [CrossRef]
Liu, Q.Q.; Liu, P.Y. Soft Actor Critic Reinforcement Learning with Prioritized Experience Replay. J. Jilin Univ. (Inf. Sci. Ed.) 2021, 39, 192–199. [Google Scholar]
Zhai, P.; Zhang, Y.; Shaobo, W. Intelligent Ship Collision Avoidance Algorithm Based on DDQN with Prioritized Experience Replay under COLREGs. J. Mar. Sci. Eng. 2022, 10, 585. [Google Scholar] [CrossRef]
Li, H. Research on Mobile Robot Path Planning Method Based on Deep Reinforcement Learning. Master’s Thesis, Tianjin Vocational and Technical Normal University, Tianjin, China, 2020. [Google Scholar]
Hasselt, H.V.; Guze, A.; Silver, D. Deep Reinforcement Learning with Double Q-learning. Comput. Sci. 2015, 47, 253–279. [Google Scholar] [CrossRef]
Devo, A.; Costante, G.; Valigi, P. Deep Reinforcement Learning for Instruction Following Visual Navigation in 3D Maze-Like Environments. IEEE Robot. Autom. Lett. 2020, 5, 1175–1182. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of the experience classification training method.

Figure 2. Example of a fixed location scene.

Figure 3. Return curve of four different algorithms in a fixed location scene.

Figure 4. The number of steps consumed by the four algorithms in a fixed location scene.

Figure 5. The path given by four algorithms in a fixed position scene.

Figure 6. An example of a random scene.

Figure 7. Return curve of four algorithms in a random environment.

Figure 8. The number of steps consumed by the four algorithms in a random location scene.

Figure 9. Planning path of four algorithms in four random scenes. (a) Path planning results of the first random environment. (b) Path planning results of the second random environment. (c) Path planning results of the third random environment. (d) Path planning results of the fourth random environment.

Table 1. Experimental training parameters’ settings.

Parameter	Value
initial exploration factor $(ε_{init})$	1
minimum exploration factor $(ε_{m i n})$	0.01
learning rate $(α)$	0.001
network parameter of updated step interval $(C)$	100
maximum step of single episode $(S t e p_{m a x})$	1000
maximum number of each iteration episodes $(T_{m a x_e p i s o d e s})$	100
reward attenuation factor $(γ)$	0.9
maximum capacity of experience playback container $(M_{capacity})$	20,000
minimum exploration number $(N_{explore})$	500
the sampling batch size $(N_{batchSize})$	128
initial exploration probability $(p_{0})$	0.2
dynamic sampling probability $(p_{1})$	0.2
auxiliary exploration coefficient $(β)$	0.4

Table 2. Performance comparison of four different algorithms in a fixed position scene.

Algorithm	DDQN	MS-DDQN	EC-DDQN	DDQN-ECMS
The average time spent (s)	129.732	122.244	122.587	124.709
The average return values	8.203	8.825	8.275	8.930
The average number of steps	157.815	160.267	158.132	157.330
The average reward per step	0.052	0.055	0.052	0.057

Table 3. Performance comparison of four different algorithms in a random position scene.

Algorithm	DDQN	MS-DDQN	EC-DDQN	DDQN-ECMS
The average time spent (s)	73.292	74.436	67.913	74.013
The average return values	8.886	8.644	8.902	9.121
The average number of steps	71.684	72.767	69.888	71.102
The average reward per step	0.124	0.119	0.127	0.128

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Shi, X.; Zhang, Z.; Wang, Z.; Zhang, L. A DDQN Path Planning Algorithm Based on Experience Classification and Multi Steps for Mobile Robots. Electronics 2022, 11, 2120. https://doi.org/10.3390/electronics11142120

AMA Style

Zhang X, Shi X, Zhang Z, Wang Z, Zhang L. A DDQN Path Planning Algorithm Based on Experience Classification and Multi Steps for Mobile Robots. Electronics. 2022; 11(14):2120. https://doi.org/10.3390/electronics11142120

Chicago/Turabian Style

Zhang, Xin, Xiaoxu Shi, Zuqiong Zhang, Zhengzhong Wang, and Lieping Zhang. 2022. "A DDQN Path Planning Algorithm Based on Experience Classification and Multi Steps for Mobile Robots" Electronics 11, no. 14: 2120. https://doi.org/10.3390/electronics11142120

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A DDQN Path Planning Algorithm Based on Experience Classification and Multi Steps for Mobile Robots

Abstract

1. Introduction

2. DDQN Algorithm

2.1. DQN Algorithm

2.2. DDQN Algorithm

3. DDQN-ECMS Algorithm

3.1. Multi Step DDQN Algorithm

3.2. Experience Classification DDQN Algorithm

3.2.1. Experience Classification Training Method

3.2.2. Experience Classification Training Method

3.3. DDQN Algorithm Based on Experience Classification and Multi Steps

4. Test and Analysis

4.1. Setting of the Experimental Simulation Environment

4.2. Experiments and Result Analysis

4.2.1. Experiment in the Environment with Fixed Positions and Result Analysis

4.2.2. Experiment in the Environment with Random Positions and Result Analysis

5. Conclusions and Future Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI