Deep-Reinforcement-Learning-Based Motion Planning for a Wide Range of Robotic Structures

Parák, Roman; Kůdela, Jakub; Matoušek, Radomil; Juříček, Martin

doi:10.3390/computation12060116

Open AccessArticle

Deep-Reinforcement-Learning-Based Motion Planning for a Wide Range of Robotic Structures

Institute of Automation and Computer Science, Brno University of Technology, 61600 Brno, Czech Republic

^*

Author to whom correspondence should be addressed.

Computation 2024, 12(6), 116; https://doi.org/10.3390/computation12060116

Submission received: 25 March 2024 / Revised: 30 April 2024 / Accepted: 23 May 2024 / Published: 5 June 2024

(This article belongs to the Special Issue 10th Anniversary of Computation—Computational Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

The use of robot manipulators in engineering applications and scientific research has significantly increased in recent years. This can be attributed to the rise of technologies such as autonomous robotics and physics-based simulation, along with the utilization of artificial intelligence techniques. The use of these technologies may be limited due to a focus on a specific type of robotic manipulator and a particular solved task, which can hinder modularity and reproducibility in future expansions. This paper presents a method for planning motion across a wide range of robotic structures using deep reinforcement learning (DRL) algorithms to solve the problem of reaching a static or random target within a pre-defined configuration space. The paper addresses the challenge of motion planning in environments under a variety of conditions, including environments with and without the presence of collision objects. It highlights the versatility and potential for future expansion through the integration of OpenAI Gym and the PyBullet physics-based simulator.

Keywords:

deep reinforcement learning; motion planning; collision avoidance; physics-based simulation; industrial robotics

1. Introduction

The significance of robot manipulators in engineering applications and scientific research has increased substantially in recent years. This surge can be attributed to the more frequent application of technologies such as autonomous robotics, physics-based simulation, etc. [1,2], within Industry 4.0 [3], along with the utilization of artificial intelligence techniques. As technology continues to advance, the convergence of robotics and artificial intelligence within the concept of Industry 4.0 holds immense promise. This requires a comprehensive exploration and understanding of the transformative potential inherent in these advancements.

The motion planning problem, shown schematically in Figure 1, is a fundamental and challenging research topic in industrial robotics. It involves generating a collision-free path or trajectory for various robotic structures to perform specific tasks while adhering to constraints and avoiding obstacles. The main objective is to determine an efficient path for the robotic manipulator

R

to move from its initial configuration

θ_{s}

to the desired final configuration

θ_{f}

and generate reference inputs for the motion control system. Traditionally, formulations of the motion planning problem for robotic manipulators rely on the concept of configuration space (

C

or C-space) [4,5], representing all possible transformations based on the robot’s kinematics model.

The traditional task of planning trajectories for robots with a high number of degrees of freedom (DoF), such as industrial robots (typically four to seven DoFs), has undergone a significant revolution in motion planning with the development of sampling-based algorithms [6,7,8]. Representative examples of these algorithms include Rapidly Exploring Random Trees (RRTs) and Probabilistic Roadmaps (PRMs). Another approach to solving the path planning problem, which was tested in complex scenarios such as a social environment with a human as a dynamic obstacle [9], was developed based on fuzzy logic. The motion planning problem has also been addressed using evolutionary computation techniques, as discussed in the comprehensive review [10]. Such problems were even used to create benchmarking instances for evolutionary computation methods [11,12,13]. Another group of approaches ti solving the motion planning problem involves using reinforcement learning (RL) algorithms [14] to complete partial tasks within a specific area. In contrast to traditional methods, which often rely on pre-defined maps and algorithms, RL has emerged as a groundbreaking paradigm, revolutionizing path planning by allowing autonomous agents to learn and adapt their strategies through experience.

In recent years, the combination of deep neural networks (DNN) [15] and reinforcement learning, termed deep reinforcement learning (DRL) [16], has become a popular choice for end-to-end control in robotics research, where sequential actions are learned directly from raw input observations. This statement is supported by several review articles that address the research problem [17,18,19]. These articles highlight various research gaps that merit further investigation. One of the identified gaps is the lack of versatility of robotic structures used to teach specific skills. In addition, issues related to reliability and accuracy are frequently discussed in the articles. Future research directions include the exploration of safe navigation for robots, where they must autonomously navigate their environment without causing harm to themselves or encountering obstacles within the configuration space.

One of the fundamental areas of research in motion planning that utilizes RL algorithms involves navigating in structured environments with the objective of reaching predetermined [20,21] or randomly positioned targets [22,23,24,25,26]. However, these environments lack the presence of collision objects. Among the most commonly used algorithms for the presented problem are actor–critic model-free algorithms such as the Deep Deterministic Policy Gradient (DDPG) [27], the Twin Delayed Deep Deterministic Policy Gradient (TD3) [28], and the Soft Actor–Critic (SAC) [29], often used with an experience replay buffer [30]. Mobile platforms extended with robotic arms are used in addition to robotic arms to solve problems in this area. The research described in [31] presents the generation of robot base motions for mobile platform systems. However, collision detection within the environment is not considered in the approach presented in the article. Another article describes the use of RL algorithms, specifically DRL, to solve inverse kinematics problems in 7-DoF robotic manipulators [32]. The method employs Products of Exponentials as a forward kinematics computation tool and the Deep Q-Network as an inverse kinematics solver. The article notes a limitation of the proposed method in the inclusion of self-collision detection. Another approach integrates DRL to explore the length-optimal path in Cartesian space and to derive the energy-optimal solution to inverse kinematics [33]. This solution combines a so-called traditional path planner, which includes collision detections, and DRL to find the optimal solution for redundant robotic manipulator. The other one uses the base DRL to avoid collisions with humans in the working space of the robotic arm. The approach described in [34] proposes the safety shield in scenarios with a high probability of a collision.

DRL algorithms are also used in the field of robotics to solve complex tasks, such as grasping, pushing, and sliding objects within the environment. The approach described in [35] presents on-policy and off-policy RL algorithms for object manipulation tasks. The comparison indicates that when obtaining new experience is costly and computation power is not a concern, the off-policy method SAC may be preferable. Conversely, if data collection is less expensive, such as in simulations, the on-policy method Proximal Policy Optimization (PPO) can achieve better results. In another article, as described in [36], an effective approach is presented to use RL for robotic manipulation based on simulated locomotion demonstrations. The proposed approach has been evaluated on 13 tasks using the 7-DoF Fetch robotic arm. One limitation of the method is its inability to avoid potential collisions with objects. On the other hand, the approach described in [37] presents a new automated method for complex object manipulation in environments with obstacles. The method is based on hindsight goal generation and its extensions. Valuable hindsight goals are selected by balancing graph-based diversity and proximity metrics. The results are demonstrated using the Kuka collaborative robotic manipulator with 7-DoF. DRL algorithms are also used as tools in robotics research to enable closed-loop dynamic control for soft robotic manipulators [38]. This adoption of DRL algorithms reflects a wider trend in robotics research, where various construction structures, including conventional “standard” robotic manipulators, low-budget, or even self-made robots, are increasingly being used for control tasks. This trend demonstrates the increasing recognition of the effectiveness of DRL in managing complex control issues across diverse robotic platforms, promoting innovation, and expanding the availability of advanced control methods within the field.

The primary motivation of this paper is to improve the versatility of DRL methods in robotic applications to a wide range of robotic structures, which was identified as one of the major research gaps in contemporary reviews [17,18,19]. Although the use of DRL methods in the field of industrial robotics is growing in popularity, the implementation of these methods on various robotic structures, as well as reaching targets in environments with the presence of collision objects, represents a relatively unexplored area of research. The novelty of the proposed solution lies in the reliability and accuracy of the solution for the safe navigation of a wide range of robotic structures in a predefined environment.

To demonstrate the versatility of the proposed method in this paper, robotic structures that are part of a robotic laboratory Industry 4.0 Cell [39] were used. The laboratory contains a variety of robotic structures, as shown in Figure 2, which represents a set of the most common geometric representations used for experiments in robotic research, that is, an industrial robot ABB IRB 120 [40] with six degrees of freedom, the same robot extended by a linear axis providing seven DoFs, a SCARA robot Epson LS3-B401S [41] with four DoFs, a dual-arm collaborative robot ABB IRB14000 [42] with seven DoFs on each arm, and finally, a collaborative robot Universal Robots UR3 [43] with six DoFs.

The structure of this paper is as follows. Section 2 delves into reinforcement learning methods, providing an overview of the fundamental concept of reinforcement learning. Additionally, it discusses the integration of deep learning techniques into traditional reinforcement learning methods, known as deep reinforcement learning. Section 3 is dedicated to Deep-Reinforcement-Learning-Based planning algorithms, specifically designed for tasks aimed at achieving randomly generated targets within a predefined configuration space. Finally, Section 4 summarizes the main findings and insights presented throughout this paper. It also offers reflections on the implications of the study and suggests potential avenues for future research.

2. Fundamental Concept of Reinforcement Learning

Reinforcement learning [14] is an area of machine learning in which an intelligent agent learns how to achieve specific behavior by maximizing rewards from the environment through the actions it chooses to execute, much like a human being, through the trial and error method. A higher reward encourages the agent to choose the selected action in the future, while a lower reward, also known as a penalty, has the opposite effect. The idea is that the agent will learn how to achieve its behavior within the specific environment in an effective method that converges on an optimal solution.

The problem of reinforcement learning is formally defined in terms of optimal control of a Markov Decision Process (MDP). Figure 3 illustrates its basic structure in an iteration loop. A characteristic property of MDPs is that each state is only dependent on the previous state, i.e., a memoryless property where each state contains all the information necessary to predict the next state.

A Markov Decision Process is a discrete-time stochastic control process and can be represented as the tuple (S, A, R, p), where S is the continuous multidimensional state space, and A is the continuous multidimensional action space. The state transition probabilities are defined by the function

p : S \times S \times A \to [0, 1]

, which represents the probability density of the next state

s^{'} \in S

based on the current state

s \in S

and action

a \in A

.

p (s^{'} ∣ s, a) = P \{S_{t + 1} = s^{'} ∣ S_{t} = s, A_{t} = a\}

(1)

The reward function, defined by

R : S \times S \times A \to R

, represents the immediate reward that the agent receives after the transition from

s \in S

to

s^{'} \in S

using an action

a \in A

.

R (s, a, s^{'}) = E \{R_{t} ∣ S_{t} = s, A_{t} = a, S_{t + 1} = s^{'}\}

(2)

The objective of the process is for the agent to learn optimal behavior, also known as a policy denoted as

π = π (a ∣ s)

, which maximizes its discounted expected return. A discount factor

γ

in the range

(0, 1]

is usually introduced when calculating the long-term reward, resulting in the following expression for the continuous ongoing process:

G_{t} = R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots = \sum_{k = 0}^{\infty} γ^{k} R_{t + 1 + k},

(3)

where

γ

determines the priority of long-term future rewards.

In the field of reinforcement learning, the value function is a crucial concept that guides decision-making processes. It serves as a quantitative measure of the expected cumulative reward that an agent can anticipate while occupying a specific state or following a particular action. This metric helps the agent assess the desirability of states and actions, facilitating the learning of an optimal policy.

More formally, a state value function can be defined to determine the expected return when following a policy

π

for a particular state s, with the value function

V^{π} (s)

.

V^{π} (s) = E_{π} [G_{t} ∣ S_{t} = s]

(4)

The action value function, which represents the expected discounted return when starting in state s and initially taking action a but then following policy

π

, can be defined as follows:

Q^{π} (s, a) = E_{π} [G_{t} ∣ S_{t} = s, A_{t} = a] .

(5)

The primary goal of the agent in reinforcement learning is to find the optimal policy

π^{*}

, which is better than or equal to all other policies. This can be achieved by estimating the corresponding optimal functions for the action value function, defined as follows

Q^{*} (s, a) = \max_{π} Q^{π} (s, a) .

(6)

The optimal action value function is satisfied by the Bellman equation, a fundamental concept in reinforcement learning. The Bellman optimality equation explains that the value of a state is equal to the expected return when the best action

a^{'}

is taken in that state. The equation for

Q^{*}

is defined as follows:

Q^{*} (s, a) = E [R_{t + 1} + γ \max_{a^{'}} Q^{*} (s^{'}, a^{'}) ∣ S_{t} = s, A_{t} = a] .

(7)

The Bellman optimality equation is a fundamental formula in reinforcement learning theory. After finding the optimal action value function

Q^{*} (s, a)

, the optimal policy

π^{*}

can be followed by selecting the optimal action

a^{*}

in each state s, defined as

a^{*} (s) = \underset{a^{'}}{argmax} Q^{*} (s, a^{'}) .

(8)

It is important to note that the optimization of the value function is often performed off-policy and can, therefore, utilize experience replay.

The well-known and straightforward algorithm for this type of problem is Q-learning, as described in [44]. The Q-learning algorithm belongs to the category of Temporal Difference (TD) learning [45]. TD learning is a form of a value-based approach in which the value function is optimized by minimizing the TD error, denoted as

δ

.

δ_{t} = R_{t} + γ \max_{a^{'}} Q^{π} (s^{'}, a^{'}) - Q^{π} (s_{t}, a_{t})

(9)

The Q-learning update rule for the estimation of

Q^{*} (s_{t}, a_{t})

becomes an optimization problem described as follows

Q^{*} (s_{t}, a_{t}) \leftarrow Q^{π} (s_{t}, a_{t}) + α δ_{t} = Q^{π} (s_{t}, a_{t}) + α [R_{t} + γ \max_{a^{'}} Q^{π} (s^{'}, a^{'}) - Q^{π} (s_{t}, a_{t})],

(10)

where

α \in [0, 1)

is the learning rate.

It is important to note that the standard form of Q-learning is inefficient for large environments because it must consider every possible state–action pair to determine the optimal

Q^{*} (s, a)

, for example, by using a tabular approach. To address this inefficiency, a general approximator based on deep neural networks can be used instead of large tables.

2.1. Deep Reinforcement Learning

Deep reinforcement learning [16] is an extension of reinforcement learning (RL) that uses deep neural networks [15] as function approximators for value functions or policies. RL, at its core, involves training agents to make sequential decisions in environments, and DRL enhances this process by employing deep neural networks to handle high-dimensional data.

DRL has significant applications in robotics, especially in addressing challenges related to path planning. In this context, several DRL algorithms, also known as actor–critic model-free RL algorithms, such as the Deep Deterministic Policy Gradient [27], the Twin Delayed Deep Deterministic Policy Gradient [28], and the Soft Actor–Critic [29], often used with an experience replay buffer [30], have proven effective. These algorithms optimize both the policy and the value functions with the overarching goal of maximizing cumulative rewards, enabling them to converge on optimal results. To enhance the learning process, an experience replay buffer is commonly employed to store transitions during interactions with the environment. Figure 4 depicts the actor-critic architecture.

The integration of Deep Reinforcement Learning into robotics and path planning demonstrates its capability to handle complex and high-dimensional data, facilitating the effective learning of various tasks [22,23,24]. In the following sections, the details of the DDPG, TD3, and SAC algorithms are briefly described.

2.1.1. Deep Deterministic Policy Gradient

The Deep Deterministic Policy Gradient (DDPG) [27] is an off-policy reinforcement learning algorithm designed for continuous action spaces that uses a deterministic policy. It utilizes two neural networks: the actor and the critic. The actor network predicts the optimal action to take in a given state, while the critic network evaluates the quality of the actor’s chosen action.

The DDPG algorithm learns the optimal policy and Q-function simultaneously. Initially, it uses the Bellman equation and off-policy memory to learn the Q-function. Subsequently, the algorithm utilizes the acquired Q-function to learn the policy. The primary objective is to determine the optimal action value, denoted as

a^{*}

, for each state in a continuous control environment, similar to the Q-learning method (see Equation (7)).

The Bellman equation is utilized to estimate the optimal Q-function through a set of artificial neural networks, denoted as

Q_{θ} (s, a)

, where

θ

represents the network’s parameter. The principle of the DDPG algorithm is to minimize the Mean Squared Bellman Error (MSBE) function defined for the set D of transitions (s, a, R,

s^{'}

, d). The equation for the MSBE function can be described as follows:

L (θ, D) = E_{D} [(Q_{θ} (s, a) - {(R_{t} + γ (1 - d) \max_{a^{'}} Q_{θ} (s^{'}, a^{'}))}^{2}],

(11)

where d indicates whether the terminal condition is satisfied or not, and the term

(R_{t} + γ (1 - d) \max_{a^{'}} Q_{θ} (s^{'}, a^{'}))

is called the target.

The problem is that both

Q_{θ}

and the target depend on the same parameter

θ

. Due to this dependency, the minimization of the MSBE function becomes unstable. Therefore, another neural network,

Q_{θ_{targ}}

, referred to as the target network, is used to compute the error function. Since the action space is infinite, it is possible to use another neural network,

μ_{targ}

, known as the target policy network, to approximate the function

\max Q_{θ} (s^{'}, a^{'})

. The target networks are then updated using the parameters of the main networks using the Polyak averaging formula [46].

Based on the above information, Equation (11) will be rewritten as follows:

L (θ, D) = E_{D} [(Q_{θ} (s, a) - {(R_{t} + γ (1 - d) \max_{a^{'}} Q_{θ_{targ}} (s^{'}, μ_{targ} (s^{'})))}^{2}] .

(12)

2.1.2. Twin Delayed Deep Deterministic Policy Gradient

The Twin Delayed Deep Deterministic Policy Gradient (TD3) [28] is an off-policy reinforcement learning algorithm designed for a continuous action space. It is an extension of the DDPG algorithm that aims to overcome some of its limitations by implementing several significant improvements.

Firstly, the action is selected using so-called target policy smoothing, as follows:

a^{'} (s^{'}) = clip (μ_{targ} (s^{'}) + clip (ϵ, - c, c), a_{Low}, a_{High}),

(13)

where

ϵ

is a random variable with a normal distribution and c is a hyperparameter.

Secondly, the algorithm uses clipped double Q-learning [47], which involves learning two Q-functions concurrently instead of one and selecting the smaller one to calculate the target term in the Bellman equation. This approach enables TD3 to more effectively capture the uncertainty in the environment and reduce the overestimation of value estimates that may occur in DDPG.

The TD3 algorithm includes two primary networks, denoted as

Q_{θ_{0}}

and

Q_{θ_{1}}

, and two target networks, denoted as

Q_{θ_{0, targ}}

and

Q_{θ_{1, targ}}

. The general form of the MSBE function for the TD3 algorithm is described in Equation (14).

L (θ_{i}, D) = E_{D} [(Q_{θ_{i}} (s, a) - {(R_{t} + γ (1 - d) \min_{i \in 0, 1} Q_{θ_{i, targ}} (s^{'}, μ_{targ} (s^{'})))}^{2}]

(14)

2.1.3. Soft Actor–Critic

The Soft Actor–Critic (SAC) [29] algorithm is an off-policy reinforcement learning algorithm designed for a continuous action space. It differs from DDPG and TD3 in its approach, although it incorporates some techniques from these algorithms, such as the clipped double Q-trick. The most significant aspect of the SAC algorithm is the entropy regularization feature. This algorithm maximizes the expected return and entropy during policy training, eliminating improper convergence.

The concept of entropy, which evaluates the randomness of a variable x using its density function P, can be defined as follows:

H (P) = E_{P} [- log P (x)] .

(15)

Taking into account the effect of entropy, where the agent receives a bonus reward proportional to the entropy of the policy at the time step, the optimal policy can be defined as follows:

π^{*} = \underset{π}{argmax} E_{π} [\sum_{t = 0}^{\infty} γ^{t} (R (s_{t}, a_{t}, s_{t + 1}) + α H (π (a_{t} ∣ s_{t})))],

(16)

where

α > 0

is a hyperparameter called the trade-off coefficient.

In addition to the optimal policy, two Q-functions,

Q_{θ_{0}}

and

Q_{θ_{1}}

, are optimized by the SAC algorithm. It uses a clipped double Q-learning trick and updates the target networks using Polyak averaging. The MSBE function for the SAC algorithm is described in Equation (17).

L (θ_{i}, D) = E_{D} [{(Q_{θ_{i}} (s, a) - y (r, s^{'}, d))}^{2}],

(17)

where the target term, denoted

y (r, s^{'}, d)

, is expressed as

y (r, s^{'}, d) = R_{t} + γ (1 - d) (\min_{i \in 0, 1} Q_{θ_{i, targ}} (s^{'}, a^{'}) - α log π (a^{'} ∣ s^{'})),

(18)

where

π

is the current learned policy and

a^{'} \sim π

. To optimize the policy, the SAC algorithm uses a reparameterization trick to determine the action as follows

a^{'} = f_{θ} (ϵ, s),

(19)

where

f_{θ}

is a deterministic function and

ϵ

is random noise with normal distribution.

3. Deep-Reinforcement-Learning-Based Motion Planning

The combination of deep neural networks and reinforcement learning, known as deep reinforcement learning, represents a relatively unexplored area of research in robotics and motion planning. The complex approach of learning sequential actions directly from raw input observations, especially in the problem of reaching a static [20,21] or random target [22,23,24,25,26] within the robot configuration space, shows potential in this area. Among the most commonly used algorithms for the presented problem, as described in the previous part, are actor–critic model-free algorithms such as DDPG, TD3, and SAC, often used with experience replay.

In consideration of the research objectives, which prioritize the implementation of a modular approach to DRL-based motion planning solutions across a variety of robotic structures, each algorithm was compared in a specific experiment. The methods highlighted in the above-mentioned literature face challenges related to accuracy, independence from the type of robotic structure, the potential occurrence of self-collision, or collision avoidance in general. The research aims to address these challenges, eliminate shortcomings, and achieve greater versatility.

3.1. Comparison of Actor–Critic Algorithms

The comparison of DRL algorithms was focused mainly on the specific robotic arm Universal Robots UR3, considering that it was part of previous research [20]. The solved problem was divided into two parts (see Figure 5), with both parts focusing on reaching the target in a pre-defined configuration space. The first part, defined by the environment

E_{1}

, was focused on reaching the target within the configuration space without any external collision. The second part, defined by the environment

E_{2}

, was focused on the same problem, but with the presence of a statically positioned external collision object.

In both cases, the task was to move the robotic arm from the initial position to the desired final position within the pre-defined configuration space without collisions, including self-collisions. An effective approach to avoiding collisions was to approximate the individual parts of the robot’s structure, as well as potential external collision objects, using so-called bounding boxes [48,49]. The static objects that form the base of the robot and potential external objects were approximated with Axis-Aligned Bounding Boxes (AABBs), while dynamic objects such as joints were approximated with Oriented Bounding Boxes (OBBs). The movement was executed by generating the trajectory using linear segments with parabolic blends (LSPB). The Levenberg–Marquardt method [50] was used to define the motion using a numerical inverse kinematics approach. The learning process was performed through the integration of OpenAI Gym [51] and PyBullet [52] physics-based simulator.

The observation space consists of the position of the end effector in the current iteration and the desired end effector position. The desired position, known as the target, was randomly generated in each episode to enhance the learning model. The action space consists of the end effector movement command in three-dimensional Euclidean space. The reward function, as a critical part of the learning process, was defined for the

E_{1}

environment as follows:

R_{i} = - (∥ p_{d} - p_{i} ∥),

(20)

and for the

E_{2}

environment as

R_{i} = - (∥ p_{d} - p_{i} ∥ + \frac{γ_{O}}{1 + ∥ p_{O} - p_{i} ∥}) .

(21)

In Equation (20), the

E_{1}

environment was determined by the simple Euclidean distance between the current position of the end effector, denoted as

p_{i}

, and the desired position of the end effector, denoted as

p_{d}

. Equation (21) for the

E_{2}

environment contains an extension of the aforementioned Equation (20) with a penalty term related to the collision object. This penalty is defined based on the position of the collision object, denoted as

p_{O}

; the current position of the end-effector, denoted as

p_{i}

; and by the threshold parameter

γ_{O}

(empirically set to the value 0.01). This threshold parameter affects the significance of the penalty within the learning process. It is important to note that while the reward function for the

E_{1}

environment adheres to conventional methodologies for the given task, the reward calculation function for the

E_{2}

environment introduces a novel approach.

The learning process is interrupted when the robot encounters an external object or self-collision. Conditions for process interruption include the singularity check, along with evaluating the feasibility of the inverse kinematics solution. Successful completion of the learning episode is achieved when the current end-effector position of the robotic arm aligns closely with the randomly generated target, defined by the desired end-effector position.

The training process, with the parametrization shown in Appendix A, using the DDPG, SAC, and TD3 algorithms, both with and without Hindsight Experience Replay (HER), is illustrated in Figure 6 for environment

E_{1}

and Figure 7 for environment

E_{2}

. The comprehensive results of the evaluation process are described in Table 1 for environment

E_{1}

and Table 2 for environment

E_{2}

. The results underscore the importance of selecting appropriate algorithms based on the characteristics of the environment. Specifically, the TD3 algorithm proved to be effective in environments without external collisions, while the DDPG algorithm was well suited for scenarios featuring statically positioned external collision objects. This strategic approach improved the adaptability of the chosen methods for a diverse range of robotic arms in future applications.

3.2. Application of Selected Algorithms to a Wide Range of Robotic Structures

The application of suitable algorithms for a specific case, obtained from the comprehensive experiments described in the previous section, was used for a wide range of robotic arms in both the

E_{1}

environment, as described in Figure 8, and the

E_{2}

environment, as described in Figure 9. Specifically, the TD3 algorithm was employed in environments without external collisions, while the DDPG algorithm was utilized for scenarios featuring statically positioned external collision objects.

The results of the training process, which was evaluated for specific robot arms, namely ABB IRB 120 with and without an extended linear axis, Epson LS3-B401S, dual-arm collaborative robot ABB IRB14000, and Universal Robots UR3, are presented in Table 3 for environment

E_{1}

and Table 4 for environment

E_{2}

.

To underscore the universality of the proposed method, an additional prediction test was conducted for a portfolio of robotic arms. The results are presented in Table 5 for environment

E_{1}

and Table 6 for environment

E_{2}

. The results clearly demonstrate the high efficiency and universality of the algorithms in each environment across a diverse range of robotic structures.

4. Conclusions

In this paper, the development of a method for planning motion across a diverse range of robotic structures based on innovative algorithms was presented. The combination of deep neural networks and reinforcement learning, known as deep reinforcement learning, was represented as a relatively unexplored area of research in robotics and motion planning, especially in the problem of reaching a static or random target within the robot configuration space. The methods highlighted in the literature review faced challenges related to accuracy, independence from the type of robotic structure, potential occurrences of self-collision, or collision avoidance in general. The proposed method effectively addresses the aforementioned shortcomings and showcases the high efficiency and universality of the algorithms across a diverse range of robotic structures. To substantiate this claim, a comprehensive comparison of actor–critic model-free algorithms was conducted, including Deep Deterministic Policy Gradient, Twin Delayed Deep Deterministic Policy Gradient, and Soft Actor–Critic, both with and without Hindsight Experience Replay. The problem was divided into two parts, with both parts focusing on reaching the target in a pre-defined configuration space. The first part, defined by the environment

E_{1}

, focused on reaching the target within the configuration space without any external collision. The second part, defined by the environment

E_{2}

, focused on the same problem, but with the presence of a statically positioned external collision object. The movement was executed by generating the trajectory using linear segments with parabolic blends. The results underscored the importance of selecting appropriate algorithms based on the characteristics of the environment. Specifically, the TD3 algorithm proved to be effective in environments without external collisions, while the DDPG algorithm was well suited for scenarios featuring statically positioned external collision objects. This strategic approach improved the adaptability of the methods chosen for a diverse range of robotic arms, as applied to the portfolio of presented robotic arms.

It is important to note that in practical scenarios involving real robots, the execution of movements may not always be realized through the LSPB method as demonstrated in the simulation experiments. In practical scenarios, for an individualized approach to robot control, it is necessary to use special methods that enable the robot to read the trajectory profile in real-time or near real-time, instead of the simple target position. These special methods can be, for example, EGM (Externally Guided Motion) for the ABB robot and UR-RTDE (Real-Time Data Exchange) for the UR robot. Otherwise, it is necessary to rely on the trajectory generation methods defined by the manufacturer of the respective robot.

Outlook for Future Research

Future investigations into Deep-Reinforcement-Learning-Based motion planning have the potential to address more complex challenges, including the avoidance of collisions with multiple static or dynamic objects, multi-robot motion planning, human–machine collaboration, planning with energy consumption or path efficiency optimization, intelligent grasping of diverse objects, and in general multi-objective environments for both simulation and real-world applications. One avenue for tackling these challenges involves enhancing DRL algorithms through the integration of multi-agent structures, such as the Multi-Agent Deep Deterministic Policy Gradient [53] or Multi-Agent TD3 [54]. These frameworks offer the potential to optimize collaborative decision-making among multiple agents, thereby improving the efficiency and robustness of motion planning tasks. Another potential research objective is to use a more sophisticated physics-based simulation tool, such as Unity3D or NVIDIA Isaac Sim, instead of the PyBullet tool that was utilized.

Author Contributions

Conceptualization, R.P., J.K. and R.M.; methodology, R.P. and J.K.; investigation, R.P. and J.K.; resources, R.P. and J.K.; writing—original draft preparation, R.P., J.K., R.M. and M.J.; writing—review and editing, R.P., J.K., R.M. and M.J.; software, R.P. and M.J; visualization, R.P. and M.J.; supervision, R.M.; project administration, R.P. and J.K.; funding acquisition, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the project IGA BUT No. FSI-S-23-8394 “Artificial intelligence methods in engineering tasks”.

Data Availability Statement

The code used in this work is available on the GitHub repository at the following URL: https://github.com/rparak/PyBullet_Industrial_Robotics_Gym (accessed on 25 March 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Hyperparameters

The appendix presents the hyperparameter structure that was used for the experiments in both the

E_{1}

and

E_{2}

environments. The selection of these hyperparameters was inspired by a comprehensive review detailed in [22].

Table A1. Hyperparameter structure used for experiments in both types of environments.

Hyperparameters	Description
Network type	Multi-layer Perceptron (MLP)
Network size	3 layers with 256 units each and ReLU non-linearities
Optimizer	Adam optimizer [55] with $1 \cdot 10^{- 3}$ to train both actor and critic
Learning rate	0.001
Polyak-averaging coefficient [46]	0.95
Action L2 norm coefficient	1.0
Observation clipping	$[- 1, 1]$
Action clipping	$[- 1, 1]$
Probability of random actions	0.3
Scale of additive Gaussian noise	0.2
Replay Buffer size	$10^{6}$ transitions
Batch size	256
Episode length	100
Total timesteps	$10^{5}$

References

Uygun, Y. The Fourth Industrial Revolution-Industry 4.0. 2021. Available online: https://ssrn.com/abstract=3909340 (accessed on 5 January 2024).
Erboz, G. How to define industry 4.0: Main pillars of industry 4.0. Manag. Trends Dev. Enterp. Glob. Era 2017, 761, 761–767. [Google Scholar]
Palka, D.; Ciukaj, J. Prospects for development movement in the industry concept 4.0. Multidiscip. Asp. Prod. Eng. 2019, 2, 315–326. [Google Scholar] [CrossRef]
Siciliano, B.; Khatib, O. Springer Handbook of Robotics; Springer: Berlin/Heidelberg, Germany, 2016; pp. 11–33. [Google Scholar]
Siciliano, B.; Sciavicco, L.; Villani, L.; Oriolo, G. Robotics: Modelling, Planning and Control, 1st ed.; Springer Publishing Company, Incorporated: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
Liu, S.; Liu, P. Benchmarking and optimization of robot motion planning with motion planning pipeline. Int. J. Adv. Manuf. Technol. 2022, 118, 949–961. [Google Scholar] [CrossRef]
Xanthidis, M.P.; Esposito, J.M.; Rekleitis, I.; O’Kane, J.M. Analysis of motion planning by sampling in subspaces of progressively increasing dimension. arXiv 2018, arXiv:1802.00328. [Google Scholar]
Wang, X.; Li, X.; Guan, Y.; Song, J.; Wang, R. Bidirectional potential guided RRT* for motion planning. IEEE Access 2019, 7, 95046–95057. [Google Scholar]
Tanha, S.D.N.; Dehkordi, S.F.; Korayem, A.H. Control a mobile robot in Social environments by considering human as a moving obstacle. In Proceedings of the 2018 6th RSI International Conference on Robotics and Mechatronics (IcRoM), Tehran, Iran, 23–25 October 2018; pp. 256–260. [Google Scholar] [CrossRef]
Juříček, M.; Parák, R.; Kůdela, J. Evolutionary Computation Techniques for Path Planning Problems in Industrial Robotics: A State-of-the-Art Review. Computation 2023, 11, 245. [Google Scholar] [CrossRef]
Kudela, J.; Juříček, M.; Parák, R. A collection of robotics problems for benchmarking evolutionary computation methods. In Applications of Evolutionary Computation, Proceedings of the International Conference on the Applications of Evolutionary Computation (Part of EvoStar), Brno, Czech Republic, 12–14 April 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 364–379. [Google Scholar]
Kudela, J. A critical problem in benchmarking and analysis of evolutionary computation methods. Nat. Mach. Intell. 2022, 4, 1238–1245. [Google Scholar] [CrossRef]
Stripinis, L.; Kudela, J.; Paulavicius, R. Benchmarking Derivative-Free Global Optimization Algorithms Under Limited Dimensions and Large Evaluation Budgets. IEEE Trans. Evol. Comput. 2024. early access. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
François-Lavet, V.; Henderson, P.; Islam, R.; Bellemare, M.G.; Pineau, J. An introduction to deep reinforcement learning. Found. Trends® Mach. Learn. 2018, 11, 219–354. [Google Scholar] [CrossRef]
Liu, R.; Nageotte, F.; Zanne, P.; de Mathelin, M.; Dresp-Langley, B. Deep reinforcement learning for the control of robotic manipulation: A focussed mini-review. Robotics 2021, 10, 22. [Google Scholar] [CrossRef]
Han, D.; Mulyana, B.; Stankovic, V.; Cheng, S. A survey on deep reinforcement learning algorithms for robotic manipulation. Sensors 2023, 23, 3762. [Google Scholar] [CrossRef]
Elguea-Aguinaco, Í.; Serrano-Muñoz, A.; Chrysostomou, D.; Inziarte-Hidalgo, I.; Bøgh, S.; Arana-Arexolaleiba, N. A review on reinforcement learning for contact-rich robotic manipulation tasks. Robot.-Comput.-Integr. Manuf. 2023, 81, 102517. [Google Scholar] [CrossRef]
Parák, R.; Matoušek, R. Comparison of Multiple Reinforcement Learning and Deep Reinforcement Learning Methods for the Task Aimed at Achieving the Goal. Mendel J. Ser. 2021, 27, 1–8. [Google Scholar] [CrossRef]
Kristensen, C.B.; Sørensen, F.A.; Nielsen, H.B.; Andersen, M.S.; Bendtsen, S.P.; Bøgh, S. Towards a robot simulation framework for e-waste disassembly using reinforcement learning. Procedia Manuf. 2019, 38, 225–232. [Google Scholar] [CrossRef]
Plappert, M.; Andrychowicz, M.; Ray, A.; McGrew, B.; Baker, B.; Powell, G.; Schneider, J.; Tobin, J.; Chociej, M.; Welinder, P.; et al. Multi-goal reinforcement learning: Challenging robotics environments and request for research. arXiv 2018, arXiv:1802.09464. [Google Scholar]
Gallouédec, Q.; Cazin, N.; Dellandréa, E.; Chen, L. panda-gym: Open-source goal-conditioned environments for robotic learning. arXiv 2021, arXiv:2106.13687. [Google Scholar]
Rzayev, A.; Aghaei, V.T. Off-Policy Deep Reinforcement Learning Algorithms for Handling Various Robotic Manipulator Tasks. arXiv 2022, arXiv:2212.05572. [Google Scholar]
Mahmood, A.R.; Korenkevych, D.; Komer, B.J.; Bergstra, J. Setting up a reinforcement learning task with a real-world robot. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 4635–4640. [Google Scholar]
Franceschetti, A.; Tosello, E.; Castaman, N.; Ghidoni, S. Robotic arm control and task training through deep reinforcement learning. In Intelligent Autonomous Systems 16, Proceedings of the International Conference on Intelligent Autonomous Systems, Singapore, 22–25 June 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 532–550. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor–critic methods. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR: London, UK, 2018; pp. 1587–1596. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor–Critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR: London, UK, 2018; pp. 1861–1870. [Google Scholar]
Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong, R.; Welinder, P.; McGrew, B.; Tobin, J.; Pieter Abbeel, O.; Zaremba, W. Hindsight experience replay. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Honerkamp, D.; Welschehold, T.; Valada, A. Learning kinematic feasibility for mobile manipulation through deep reinforcement learning. IEEE Robot. Autom. Lett. 2021, 6, 6289–6296. [Google Scholar] [CrossRef]
Malik, A.; Lischuk, Y.; Henderson, T.; Prazenica, R. A deep reinforcement-learning approach for inverse kinematics solution of a high degree of freedom robotic manipulator. Robotics 2022, 11, 44. [Google Scholar] [CrossRef]
Li, X.; Liu, H.; Dong, M. A general framework of motion planning for redundant robot manipulator based on deep reinforcement learning. IEEE Trans. Ind. Inform. 2021, 18, 5253–5263. [Google Scholar] [CrossRef]
Thumm, J.; Althoff, M. Provably safe deep reinforcement learning for robotic manipulation in human environments. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 6344–6350. [Google Scholar]
Shahid, A.A.; Piga, D.; Braghin, F.; Roveda, L. Continuous control actions learning and adaptation for robotic manipulation through reinforcement learning. Auton. Robot. 2022, 46, 483–498. [Google Scholar] [CrossRef]
Kilinc, O.; Montana, G. Reinforcement learning for robotic manipulation using simulated locomotion demonstrations. Mach. Learn. 2022, 111, 465–486. [Google Scholar] [CrossRef]
Bing, Z.; Zhou, H.; Li, R.; Su, X.; Morin, F.O.; Huang, K.; Knoll, A. Solving robotic manipulation with sparse reward reinforcement learning via graph-based diversity and proximity. IEEE Trans. Ind. Electron. 2022, 70, 2759–2769. [Google Scholar] [CrossRef]
Centurelli, A.; Arleo, L.; Rizzo, A.; Tolu, S.; Laschi, C.; Falotico, E. Closed-loop dynamic control of a soft manipulator using deep reinforcement learning. IEEE Robot. Autom. Lett. 2022, 7, 4741–4748. [Google Scholar] [CrossRef]
Parák, R.; Matoušek, R.; Lacko, B. I4C—Robotic cell according to the Industry 4.0 concept. Automa 2021, 27, 10–12. [Google Scholar]
ABB Ltd. ABB IRB 120 Product Manual; ABB Ltd.: Zurich, Switzerland, 2022. [Google Scholar]
Seiko Epson Corporation. Industrial Robot: SCARA ROBOT LS-B Series MANUAL; Seiko Epson Corporation: Suwa, Japan, 2024. [Google Scholar]
ABB Ltd. ABB IRB 14000 Product Manual; ABB Ltd.: Zurich, Switzerland, 2022. [Google Scholar]
Universal Robots A/S. User Manual UR3e; Universal Robots A/S: Odense, Denmark, 2024. [Google Scholar]
Dayan, P.; Watkins, C. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Tesauro, G. Temporal difference learning and TD-Gammon. Commun. ACM 1995, 38, 58–68. [Google Scholar] [CrossRef]
Polyak, B.T.; Juditsky, A.B. Acceleration of stochastic approximation by averaging. SIAM J. Control. Optim. 1992, 30, 838–855. [Google Scholar] [CrossRef]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
Ericson, C. Real-Time Collision Detection; CRC Press: Boca Raton, FL, USA, 2004. [Google Scholar]
Van Den Bergen, G. Collision Detection in Interactive 3D Environments; CRC Press: Boca Raton, FL, USA, 2003. [Google Scholar]
Sugihara, T. Solvability-unconcerned inverse kinematics by the Levenberg–Marquardt method. IEEE Trans. Robot. 2011, 27, 984–991. [Google Scholar] [CrossRef]
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. Openai gym. arXiv 2016, arXiv:1606.01540. [Google Scholar]
Coumans, E.; Bai, Y.P.; PyBullet, A. PyBullet, a Python Module for Physics Simulation for Games, Robotics and Machine Learning. 2016. Available online: https://docs.google.com/document/d/10sXEhzFRSnvFcl3XxNGhnD4N2SedqwdAvK3dsihxVUA (accessed on 5 January 2024).
Li, S.; Wu, Y.; Cui, X.; Dong, H.; Fang, F.; Russell, S. Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 4213–4220. [Google Scholar]
Zhang, F.; Li, J.; Li, Z. A TD3-based multi-agent deep reinforcement learning method in mixed cooperation-competition environment. Neurocomputing 2020, 411, 206–215. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR)HER, San Diego, CA, USA, 7–9 May 2015; Volume 5, p. 6. [Google Scholar]

Figure 1. A schematic representation of a motion planning problem in two-dimensional space, illustrating an n-link robotic manipulator

R

within a collision-free configuration space

C_{free}

with obstacles denoted as

O

.

Figure 1. A schematic representation of a motion planning problem in two-dimensional space, illustrating an n-link robotic manipulator

R

within a collision-free configuration space

C_{free}

with obstacles denoted as

O

.

Figure 2. An illustration of the robotic structures that are part of the Industry 4.0 Cell (I4C).

Figure 3. The interaction between an agent and the environment in a Markov Decision Process (MDP), adapted from [14].

Figure 4. An overview of the actor–critic architecture. The Temporal Difference (TD) error

δ_{t}

is utilized to adjust both the critic’s action value function

Q (s, a)

and the actor’s policy

π (a ∣ s, θ)

, which is parameterized by

θ

[14,16].

Figure 4. An overview of the actor–critic architecture. The Temporal Difference (TD) error

δ_{t}

is utilized to adjust both the critic’s action value function

Q (s, a)

and the actor’s policy

π (a ∣ s, θ)

, which is parameterized by

θ

[14,16].

Figure 5. An illustration of both types of environments,

E_{1}

(left) and

E_{2}

(right), that were used in an experiment focused on reaching the target in a pre-defined configuration space

C_{free}

. The yellow wireframe determines a pre-defined configuration space

C_{free}

, while the green wireframe delineates the area where the target was randomly generated. The red sphere within the environment

E_{2}

represents an obstacle approximated with Axis-Aligned Bounding Box (AABB), denoted as

O_{A A B B}

.

Figure 5. An illustration of both types of environments,

E_{1}

(left) and

E_{2}

(right), that were used in an experiment focused on reaching the target in a pre-defined configuration space

C_{free}

. The yellow wireframe determines a pre-defined configuration space

C_{free}

, while the green wireframe delineates the area where the target was randomly generated. The red sphere within the environment

E_{2}

represents an obstacle approximated with Axis-Aligned Bounding Box (AABB), denoted as

O_{A A B B}

.

Figure 6. The training process shows success rates within the environment type

E_{1}

for the DDPG, SAC, and TD3 algorithms (left) and an extension with HER (right).

Figure 6. The training process shows success rates within the environment type

E_{1}

for the DDPG, SAC, and TD3 algorithms (left) and an extension with HER (right).

Figure 7. The training process shows success rates within the environment type

E_{2}

for the DDPG, SAC, and TD3 algorithms (left) and an extension with HER (right).

Figure 7. The training process shows success rates within the environment type

E_{2}

for the DDPG, SAC, and TD3 algorithms (left) and an extension with HER (right).

Figure 8. An illustration of a wide range of robotic structures within the

E_{1}

environment, simulated using a PyBullet physics-based simulator. These structures were used in an experiment that focused on reaching the target in a pre-defined configuration space.

Figure 8. An illustration of a wide range of robotic structures within the

E_{1}

environment, simulated using a PyBullet physics-based simulator. These structures were used in an experiment that focused on reaching the target in a pre-defined configuration space.

Figure 9. An illustration of a wide range of robotic structures within the

E_{2}

environment, simulated using PyBullet physics-based simulator. These structures were used in an experiment that focused on reaching the target in a pre-defined configuration space.

Figure 9. An illustration of a wide range of robotic structures within the

E_{2}

environment, simulated using PyBullet physics-based simulator. These structures were used in an experiment that focused on reaching the target in a pre-defined configuration space.

Table 1. The table presents experimental results comparing various DRL algorithms within the

E_{1}

environment. The required minimum success rate to meet the specified criteria was set at 0.98.

Table 1. The table presents experimental results comparing various DRL algorithms within the

E_{1}

environment. The required minimum success rate to meet the specified criteria was set at 0.98.

Algorithm	Success Rate	Percentage of Successful Targets	Mean Reward per Episode	Mean Episode Length
DDPG	0.98–1.0	95.86%	−0.388	5.299
DDPG + HER	0.98–1.0	89.83%	−0.387	5.431
SAC	0.98–1.0	96.52%	−0.408	5.682
SAC + HER	0.98–1.0	94.85%	−0.407	5.701
TD3	0.98–1.0	96.61%	−0.386	5.298
TD3 + HER	0.98–1.0	78.73%	−0.395	5.720

Table 2. The experimental results comparing various DRL algorithms within the

E_{2}

environment. The required minimum success rate to meet the specified criteria was set at 0.8.

Table 2. The experimental results comparing various DRL algorithms within the

E_{2}

environment. The required minimum success rate to meet the specified criteria was set at 0.8.

Algorithm	Success Rate	Percentage of Successful Targets	Mean Reward per Episode	Mean Episode Length
DDPG	0.8–0.97	95.01%	−0.710	5.866
DDPG + HER	0.8–0.96	86.72%	−0.711	5.944
SAC	0.8–0.85	5.048%	−0.709	5.684
SAC + HER	0.8–0.88	19.07%	−0.713	5.844
TD3	0.8–0.96	89.59%	−0.711	5.901
TD3 + HER	0.8–0.94	81.99%	−0.718	6.146

Table 3. The table presents experimental results comparing a wide range of robotic structures using the TD3 algorithm within the

E_{1}

environment. The minimum success rate required to meet the specifications was set at 0.98.

Table 3. The table presents experimental results comparing a wide range of robotic structures using the TD3 algorithm within the

E_{1}

environment. The minimum success rate required to meet the specifications was set at 0.98.

Robot Type	Success Rate	Percentage of Successful Targets	Mean Reward per Episode	Mean Episode Length
Universal Robots UR3	0.98–1.0	96.61%	−0.386	5.298
ABB IRB 120	0.98–1.0	84.32%	−0.602	6.073
ABB IRB 120 Ext.	0.98–1.0	87.65%	−0.637	6.647
ABB IRB 14000 (L)	0.98–1.0	94.51%	−0.256	4.593
ABB IRB 14000 (R)	0.98–1.0	91.69%	−0.256	4.608
Epson LS3-B401S	0.98–1.0	94.39%	−0.064	2.474

Table 4. The table presents experimental results comparing a wide range of robotic structures using the DDPG algorithm within the

E_{2}

environment. The minimum success rate required to meet the specifications was set at 0.8.

Table 4. The table presents experimental results comparing a wide range of robotic structures using the DDPG algorithm within the

E_{2}

environment. The minimum success rate required to meet the specifications was set at 0.8.

Robot Type	Success Rate	Percentage of Successful Targets	Mean Reward per Episode	Mean Episode Length
Universal Robots UR3	0.8–0.97	95.01%	−0.710	5.866
ABB IRB 120	0.8–0.98	95.26%	−0.874	6.609
ABB IRB 120 Ext.	0.8–0.98	96.83%	−1.031	8.041
ABB IRB 14,000 (L)	0.8–0.97	94.85%	−0.503	5.356
ABB IRB 14,000 (R)	0.8–0.96	92.14%	−0.499	5.227
Epson LS3-B401S	0.8–0.99	98.19%	−0.396	5.040

Table 5. The table presents experimental results comparing a wide range of robotic structures using the TD3 algorithm within the

E_{1}

environment for one hundred randomly generated targets. The position error is given in meters.

Table 5. The table presents experimental results comparing a wide range of robotic structures using the TD3 algorithm within the

E_{1}

environment for one hundred randomly generated targets. The position error is given in meters.

Robot Type	Success Rate	Mean Reward per Episode	Mean Episode Length	Mean Absolute Position Error
Universal Robots UR3	1.0	−0.365	4.94	0.0024
ABB IRB 120	1.0	−0.501	5.26	0.0055
ABB IRB 120 Ext.	1.0	−0.502	5.16	0.0061
ABB IRB 14,000 (L)	1.0	−0.233	4.21	0.0024
ABB IRB 14,000 (R)	1.0	−0.228	4.13	0.0028
Epson LS3-B401S	1.0	−0.059	2.32	0.0030

Table 6. The table presents experimental results comparing a wide range of robotic structures using the DDPG algorithm within the

E_{2}

environment for one hundred randomly generated targets. The position error is given in meters.

Table 6. The table presents experimental results comparing a wide range of robotic structures using the DDPG algorithm within the

E_{2}

environment for one hundred randomly generated targets. The position error is given in meters.

Robot Type	Success Rate	Mean Reward per Episode	Mean Episode Length	Mean Absolute Position Error
Universal Robots UR3	1.0	−0.711	6.04	0.0058
ABB IRB 120	1.0	−0.859	6.45	0.0023
ABB IRB 120 Ext.	1.0	−0.945	6.98	0.0065
ABB IRB 14,000 (L)	1.0	−0.471	5.15	0.0040
ABB IRB 14,000 (R)	1.0	−0.470	5.12	0.0032
Epson LS3-B401S	1.0	−0.374	4.79	0.0027

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Parák, R.; Kůdela, J.; Matoušek, R.; Juříček, M. Deep-Reinforcement-Learning-Based Motion Planning for a Wide Range of Robotic Structures. Computation 2024, 12, 116. https://doi.org/10.3390/computation12060116

AMA Style

Parák R, Kůdela J, Matoušek R, Juříček M. Deep-Reinforcement-Learning-Based Motion Planning for a Wide Range of Robotic Structures. Computation. 2024; 12(6):116. https://doi.org/10.3390/computation12060116

Chicago/Turabian Style

Parák, Roman, Jakub Kůdela, Radomil Matoušek, and Martin Juříček. 2024. "Deep-Reinforcement-Learning-Based Motion Planning for a Wide Range of Robotic Structures" Computation 12, no. 6: 116. https://doi.org/10.3390/computation12060116

APA Style

Parák, R., Kůdela, J., Matoušek, R., & Juříček, M. (2024). Deep-Reinforcement-Learning-Based Motion Planning for a Wide Range of Robotic Structures. Computation, 12(6), 116. https://doi.org/10.3390/computation12060116

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep-Reinforcement-Learning-Based Motion Planning for a Wide Range of Robotic Structures

Abstract

1. Introduction

2. Fundamental Concept of Reinforcement Learning

2.1. Deep Reinforcement Learning

2.1.1. Deep Deterministic Policy Gradient

2.1.2. Twin Delayed Deep Deterministic Policy Gradient

2.1.3. Soft Actor–Critic

3. Deep-Reinforcement-Learning-Based Motion Planning

3.1. Comparison of Actor–Critic Algorithms

3.2. Application of Selected Algorithms to a Wide Range of Robotic Structures

4. Conclusions

Outlook for Future Research

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Hyperparameters

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI