Next Article in Journal
Underwater Acoustic Signal LOFAR Spectrogram Denoising Based on Enhanced Simulation
Next Article in Special Issue
Computer-Integrated Surface Image Processing of Hydrogen-Saturated Steel Wear Products
Previous Article in Journal
Study on the Minimum Operation Width of Human-Powered Bicycles for Safe and Comfortable Cycling
Previous Article in Special Issue
Kinematic and Workspace Analysis of RRU-3RSS: A Novel 2T2R Parallel Manipulator
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Deep Reinforcement Learning-Assisted Teaching Strategy for Industrial Robot Manipulator

Department of Mechatronics and Automation, Faculty of Engineering, University of Szeged, 6720 Szeged, Hungary
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(23), 10929; https://doi.org/10.3390/app142310929
Submission received: 13 October 2024 / Revised: 13 November 2024 / Accepted: 21 November 2024 / Published: 25 November 2024

Abstract

:
This paper introduces an innovative algorithm aimed at enhancing robot learning using dynamic trajectory modeling and time-dependent state analysis. By integrating reinforcement learning (RL) and trajectory planning, the proposed approach enhances the robot’s adaptability in diverse environments and tasks. The framework begins with a comprehensive analysis of the robot’s operational space, focusing on Cartesian coordinates and configuration systems. By modeling trajectories and states within these systems, the robot achieves sequential tracking of arbitrary states, facilitating efficient task execution in various scenarios. Experimental results demonstrate the algorithm’s efficacy in manipulation tasks and path planning in dynamic environments. By integrating dynamic trajectory modeling and time-dependent state analysis, the robot’s adaptability and performance improve significantly, enabling precise task execution in complex environments. This research contributes to advancing robot learning methodologies, particularly in human–robot interaction scenarios, promising applications in manufacturing, healthcare, and logistics.

1. Introduction

Robotic arms have long served industries by automating tasks characterized by repetition and complexity that demand precision and speed. Conventional control methods for these robotic arms have presented challenges, necessitating individualized programming for each task, leaving little room for error. This has traditionally required significant effort from robotics engineers for task programming or skilled robot operators engaging in tedious online manipulation with the control pendant. However, recent advancements in robotic arm control have seen the integration of deep learning techniques, such as deep Q-network (DQN) frameworks. These advancements hold promise in revolutionizing robotic arm control by enabling learning-based approaches that adapt and optimize arm movements based on environmental feedback, ultimately enhancing efficiency and versatility in industrial applications. In addressing the challenges associated with implementing robotics in factory environments, reinforcement learning (RL) is emerging as a favored alternative to task-specific programming. The objective of RL in robotics is to empower the agent (the robotic arm) to execute fundamental operations autonomously, devoid of direct command programming, and to adeptly adapt to task variations without necessitating reprogramming. The logistical task of pick-and-place exemplifies a prime scenario wherein a robotic agent should proficiently execute tasks using RL.
The aim of the research is the development of a model for robot learning by demonstration through human–robot interaction. The model is intended to be directed towards generating robot motion plans aligned with the given task. Based on a limited number of examples, it is necessary to ensure the generalization of knowledge that will enable the resolution of similar problems. Two research hypotheses have been formulated based on the aim:
  • To develop a robust DRL-based control framework that allows a robot to adapt dynamically to different environments.
  • To validate the proposed framework’s performance through comparisons with existing control techniques in simulated scenarios.
Expected contributions include the following:
  • A DRL-based control framework for executing robotic tasks through a limited number of examples presented by human–robot interaction, based on the synthesis of motion shaping methods using primitives and inverse reinforcement learning.
  • An algorithm for determining and evaluating specific task-describing features from the assumed problem domain, based on statistical analysis and multi-criteria optimization in the time-dependent problem space, aimed at generalizing learned knowledge.
In robotics, the field of motion planning encompasses problems such as path planning, trajectory planning, and task planning. Path planning deals with the problem of finding a path from some initial position to a goal (“from point A to point B”), while avoiding obstacles. The novelty of this work lies in the combination of dynamic trajectory modeling with time-dependent state analysis, which collectively enhance the robot’s adaptability and performance. Experimental results demonstrate the efficacy of the proposed algorithm in robotic manipulation tasks and path planning within dynamic environments. Furthermore, the integration of these techniques allows the robot to efficiently navigate complex environments and execute tasks with precision.
This research makes significant contributions to the field of robot learning methodologies by presenting a comprehensive framework that improves robot learning through human–robot interaction. The proposed model holds promise for practical applications in manufacturing, healthcare, and logistics, where robots are increasingly utilized for automation and assistance.

Related Work

Liu, R. et al. outline the recent notable advancements in deep reinforcement learning algorithms aimed at addressing challenges encountered in the domain of robotic manipulation control, including issues related to sample efficiency and generalization [1]. Mohammed, M.Q. et al. introduce a robotic grasp-to-place system designed to grasp objects in environments characterized by sparsity and clutter. A notable aspect of the system is its ability to execute both fundamental actions of object picking and placing within an explicit framework, leveraging raw RGB-D images [2]. Rubagotti, M. et al. introduces a teleoperation approach known as semiautonomous teleoperation or shared control, and its aim is to reduce the workload of the human operator during the performance of a difficult task that involves controlling a robotic system [3]. Salvato, E. et al. clarifies the differences between the three main most popular deep learning approaches by highlighting the relative pros and cons [4]. Bujok, P. et al. give a comparison of 11 swarm-intelligence-based (SI) and bio-inspired (BI) algorithms [5]. Hellwig, M. et al. discuss fundamental principles essential to consider when evaluating benchmarking problems for constrained optimization. They evaluate existing benchmark environments for testing evolutionary algorithms based on these principles [6]. Wang, T. et al. propose a new algorithm and verify its efficiency in achieving the optimal trajectory planning of the grinding robot. In Ref. [7], the authors proposed four different objective functions related to possible requirements in a factory environment, which led to constrained multi-objective optimization problems. In Refs. [8,9], a general method for computing minimum cost trajectory planning for industrial robot manipulators is presented. Meng, X. proposes an algorithm that provides a feasible scheme for the efficient planning of manipulators in equipment manufacturing workshops [10].

2. Materials and Methods

The simulation setup consists of an industrial robot arm equipped with a suite of sensors and controllers to facilitate dynamic interaction with the environment. The robot’s operational space is represented using both Cartesian coordinates and configuration systems, enabling precise control and analysis of its movements. The robot is programmed to perform a variety of tasks, including object manipulation, assembly, and navigation in dynamic environments. The proposed deep reinforcement learning (DRL) framework integrates dynamic trajectory modeling and time-dependent state analysis. The implementation details how the dynamic trajectory modeling component uses polynomial functions to generate smooth and continuous paths for the robot. This method is particularly effective for connecting segments with varying velocities and accelerations. The trajectory planning algorithm calculates the optimal path based on the current state of the robot and the desired end state, ensuring smooth transitions and minimizing abrupt changes in motion. Trajectory planning extends the problem of path planning by introducing parameters such as time, velocity, acceleration, kinematics, continuity, manipulability, etc. Geometric approaches to trajectory modeling are prevalent here, serving as the basis for motion due to their continuity. Additionally, trajectory optimization with respect to various parameters is traditionally an active area of research. Learning from demonstrations and motion planning are highly compatible approaches, where demonstrations can be used to find criteria against which trajectory optimization is required. This hypothesis has been used in several works [11]. Depending on the perspective from which the problem is viewed, these criteria can be aimed at optimizing the path to improve kinematic/dynamic performance or optimizing for task execution. The tendency of newer approaches is toward combining both goals for more effective and autonomous task execution. The adaptability of the plan is emphasized in approaches based on motion planning compared to learning from demonstrations. This is highly significant in circumstances where there is dynamic environmental change during trajectory execution, where these approaches often allow for real-time replanning. The generalization capabilities of motion planning methods are highly advanced in the domain of motion, as they must enable movement based on arbitrary initial and final positions [12]. On the other hand, the drawback of these approaches is often the inability to implement various criteria crucial for task execution, where methods utilizing learning from demonstrations have an advantage.
Figure 1 aims to illustrate the comprehensive simulation environment for the robot manipulator using Simulink. The authors intend to simulate the dynamic behavior and control of the robot manipulator to evaluate the proposed learning algorithm’s effectiveness. This section will provide a detailed introduction and explanation of the components and the objectives of the block diagram. The primary aim of the block diagram is to simulate the robot manipulator’s movements and interactions within a controlled virtual environment. By doing so, the authors can test and refine their algorithm for dynamic trajectory modeling and time-dependent state analysis, ensuring that the robot can effectively perform various tasks such as object manipulation, assembly, and navigation. The first three elements of the multibody block diagram define the simulation environment itself as can be seen in Figure 1. The first element is the world frame, which is defined as the global reference framework. In fact, it represents the same fixed coordinate point that was specified as the zero point of the structure in inventor. In the mechanism configuration block, the mechanical and simulation parameters are set, such as the gravitational acceleration vector of my model [13]. The third element is the solver configuration, which determines the settings of the Simulink solver.
Table 1 shows the DH parameters of the used model. The model needs these parameters before starting the simulation. In Simulink, the so-called “solver” calculates the model’s state parameters during the simulation. Basically, there are two types: fixed-step and variable-step solver.
As we progress through the block diagram, the components are connected to each other in sequence, following the hierarchy defined in the URDF file, which was generated from the CAD assembly file. The first element is the base plate element, which is connected to the “BASE”, and the “BASE” is connected to the “SHOULDER” as its child. The “SHOULDER” is connected to the first arm, and so on. The connection of the components is done with joints or constraints. Joints determine the maximum degree of freedom of the two connected elements, which can be rotational—revolute joint—or linear—prismatic joint—motion [14]. For example, a component block such as the “BASE” seen in Figure 2 is a subsystem, which is itself a smaller block diagram composed of different elements.
The modeling phase is divided into three parts. In the first part, information about the examined system, i.e., the physical, geometric, and other parameters of the robot, is gathered. Based on this information, the CAD model of the robot can be created. Afterward, it needs to be imported into the Simulink (R2024a) software, and the simulation environment needs to be built around it. Figure 3 depicts the multibody simulation environment.

2.1. Reinforcement Learning in Continuous Environments

In reinforcement learning approaches such as learning from demonstrations, all learning data must be pre-prepared and available before learning. This is highly restrictive when a robot needs to learn complex tasks because the teacher needs to generate a sufficient number of demonstrations. On the other hand, RL (reinforcement learning) is based on autonomous data collection by the agent (robot) through its own exploration and evaluations of behavior by an expert system, which can be a human or in the form of an artificial evaluation system [15]. The best behavior for a given task is provided in the form of an optimal strategy. Its significant advantage over learning from demonstrations is that the robot performs all behaviors independently and directly through its control system, eliminating the problem of correspondence during learning. Additionally, human behavior does not always have to be optimal, so learning from demonstrations can result in sub-optimal solutions. In supported learning, this problem does not exist; instead, the quality of the final behavior is usually determined solely by the outcome (Figure 4).
The classical scenario of supported learning is described by an agent located in a current state s at time t, acting in its environment with its action a. Based on its action, the agent ends up in a new state st + 1, and its behavior is evaluated by a reward rt. The implemented DQN algorithm is shown in Algorithm 1.
Algorithm 1 DQN training algorithm
//main iteration
Initiate   main   network   values   θ main
  Copy   main   to   target   values   θ target θ main
Initiate   experience   replay   buffer   B
Initiate   exploration   rate   ϵ 1
Initiate   changeover   0
Initiate   episode   0
  while   changeover < N   do
    episode     episode + 1
    Refresh   collector   network   with   main   network   values   θ collector θ main
    Get   initial   state   s 1
    if   changeover <   buffer   start   size   then
    ϵ 1 1 N ϵ ϵ
    end   if
    for   t = 1 , T   do
      Get   Q - values   Q θ collector
      for   j = 1 , J   do
      Sample   x u n i   f ( 0 ,   1 )
      if   x < ϵ   then
        Select   a   joint   action   greedily   a t , j = a r g m a x a j Q θ collector s t , a j
      else
        Select   a   random   joint   action   a t , j
      end   if
end   for
Execute   robot   manipulator   action   a t   and   get   the   new   state   s t + 1   and   reward   r t   Store   the   changeover   s t , a t , r t , s t + 1   in   the   buffer   B
changeover     changeover + 1
end   for
  end   while
/ / trainer   iteration
  Load   main   and   target   network   values   θ main
  Wait   for   the   experience   replay   buffer   B   to   reach   its   start   size
  while   changeover < N   do
    last   changeover     changeover
    refresh   0
    while   last   changeover =   changeover   do
    if   refresh < N u   then
        Sample   a   mini - batch   of   n   changeover   from   B
        Get   target   Q   values   from   target   network :
        Q target , i r t + γ m a x a t + 1 , i   Q θ target s t + 1 , i , a t + 1 , i
        Refresh   the   main   network   values   θ main   to   minimize
        L ( θ ) = 1 n i = 1 n   Q target , i Q θ main s t , i , a t , i 2
        Smoothly   refresh   the   target   network   values   θ target τ θ main + ( 1 τ ) θ target
        refresh     refresh + 1
      end   if
    end   while
 end while
The algorithm is based on a machine learning paradigm where an agent learns to make decisions by interacting with an environment and receiving feedback in the form of rewards or penalties based on its actions, with the goal of maximizing cumulative reward over time.

2.2. Dynamic Movement Primitives (DMP)

Since the previously presented model for learning from demonstrations is based on the analysis of coordinate systems in the operational (Cartesian) system of the robot, this section will introduce a trajectory model used to approximate Cartesian states, which will be further enhanced by a modification that couples this model with the proposed analysis of coordinate system activities [16]. The basis of this trajectory model is the primitive motion model—DMP. Specifically, a special variation of the DMP model capable of approximating states defined in the Cartesian coordinate system is utilized.
The forcing term is replaced by establishing interdependence between states and time. This is achieved by recalculating the current goal of the DMP dynamic system at each time step according to a time-dependent probabilistic function of predefined states. This approach was originally presented in Ref. [1] to encode stiffness information into the trajectory for robots controlled by a given torque. The stiffness matrix mentioned is not used for this approach. The system’s acceleration is calculated as the sum of N proportional differential systems, where N denotes the number of states, y and y ˙ represent the current position and velocity of the system, and a y and b y are gain parameters.
τ 2 y ˆ ˆ = i = 1 K   h i ( t ) a y b y μ i y τ y ˙ ,
The aforementioned time dependence is achieved by using an exponentially decaying function t = α x t as a function of time, where the variable t is initialized with t = 1 at the beginning of each motion and gradually converges to zero at each time step. The rate of decay, i.e., convergence of this function to zero, is defined by the gain parameter α x [17]. This temporal function is linked to the system dynamics through a set of N normal distributions N t ; μ i t , Σ i with uniformly distributed mean values of distributions from zero to one, where each individual distribution represents one state from the state vector μ . The time-dependent probability of the system being in a certain state is given in Equation (2).
h i ( t ) = N t ; μ i t , Σ i k = 1 K     N t ; μ k t , Σ k ,
The implemented DMP based trajectory planning algorithm is shown in Algorithm 2. The algorithmic task consists of converting the initial trajectory into a DMP record. Using iterative search, an attempt is made to find a trajectory that passes through the given spatial position X, which was not previously part of the initial trajectory.
Algorithm 2 DMP-based trajectory planning
Input: y demo   / / demonstrated   path
Initiate   n d m p = 2 , n b f s = 10 , d t = 0.02 , τ = 1 , α z , β z , α x , c , σ
for   each   time   interval   t = 1 : τ / d t
x = x ( t 1 ) α x x t d t
ψ i ( x ) = e x p 1 2 σ i 2 x c i 2
end
y 0 = y 0 demo   g = g demo
for   each   n _ d m p / / learning   weight
f target   = y ¨ demo   α z β z g y demo   y ˙ demo
W i = S T Γ 1 f target   S Γ 1 S , S = x t 0 g y 0 x t 1 g y 0 x t N g y 0 , r 1 = Ψ 1 t 0 0 0 0 ψ 1 t 1 0 0 0 ψ 1 t N / weight _ init  
end
for   iteration   w i
for   time   interval   x i x
for   each   i = 1 : n d m p
f ( x ) = i = 1 N   ψ i ( x ) w i i = 1 N   ψ i ( x ) x g y 0
y ¨ 2 = α z β z g y i z + z
y ˙ 2 = y ˙ + y ¨ d t
y i = y + y ˙ d t
end
end
D m i n = m i n D x y //“target function”
end
Using this model, sequential tracking of arbitrary states defined either in the operational (Cartesian) or configuration system of the robot can be achieved. Furthermore, it can be observed how the shape of the exponentially decaying time function determines how much time is “allocated” to a specific state within the state vector [18]. Using the standard form of this function, it is evident that the function has a higher rate of decay at the beginning of the function (after initialization with a value of 1), indicating that states appearing in this time period are less important than states appearing at the end of the function, where the rate of time decay is much smaller as the function converges to a value of zero.

2.3. Neural Network Architecture

The input layer receives a 25-dimensional vector representing the state of the environment. This could include variables such as joint angles, joint velocities, distances to obstacles, or other environmental factors that are relevant for the task. The number of nodes in the first hidden layer is 64. Using the ReLU (rectified linear unit) activation function, the network introduces non-linearity, allowing the model to learn complex patterns in the input data. The number of nodes in the second hidden layer is 64, and the activation function is also ReLU. The number of nodes in the third hidden layer is 32 (reduced size to allow gradual compression of features), and the activation function is also ReLU. The output layer has 7 nodes. Each node in the output layer represents an action or Q-value. If this is a Q-network for a discrete action space, each output corresponds to the Q-value of an action, indicating the expected cumulative reward of each action given the current state as can be seen in Table 2.
This architecture provides a balance between capacity (through multiple hidden layers) and efficiency, making it suitable for the RL tasks that require capturing complex patterns in state-action spaces. RL revolves around four key components: the policy, reward function, value function, and environmental model. The fundamental concept is that an intelligent agent must learn to perform a series of actions to maximize cumulative rewards through its interactions with the environment. The reward function is designed to combine various reward variables into a single output value, which serves as feedback for the agent to understand the incentivized behaviors. In this case, the robot is expected to follow the goal for as long as possible while avoiding dynamic obstacles. Consequently, the reward function is defined as a weighted sum of two components: the distance between the end-effector and the goal as well as the minimum distance between the robot manipulator and nearby obstacles. Additionally, a negative reward is applied more heavily than a positive one to encourage the robot to avoid penalties, facilitating faster learning.

3. Results

The preliminary test result shown in the image depicts a MATLAB (R2024a) written program that describes a circle of radius r. It can be observed that as the given trajectory is traversed, the joints assume their respective positions as depicted in Figure 5. The path points were recorded in variables in matrix form. They were then attached to the input port of the block. The path planner iterates the path points based on the given time parameters and generates an appropriate path. The result was obtained in the form of a position vector. These types of path planners not only calculate the path but also the velocity and acceleration.
Polynomial trajectories are useful in cases where zero or non-zero velocity and acceleration segments need to be connected, as the acceleration profiles are much smoother. The polynomial trajectory block was used to define polynomial trajectories as can be seen in Figure 6.
To demonstrate the adaptability of the algorithm, the high fidelity simulation was conducted with the robot in various environmental scenarios. These include the following:
  • Dynamic Obstacles: With moving obstacles that force the robot to dynamically adapt its path.
  • Unstructured Environments: Test the robot in environments with unpredictable object placements or configurations.
  • Task Variations: Modifying the pick-and-place task (e.g., different object sizes, weights, or grasping positions) and evaluate how well the robot adapts.
Table 3 presents the results of the simulation the robot’s performance in terms of task completion time, success rate, and path deviation. This DQN-based robot control simulation system framework provides a foundation for building intelligent, adaptive robotic systems. Using reinforcement learning, the robot can learn to perform tasks autonomously and adapt to dynamic environments without the need for manually predefined control strategies.
The same path is specified as by the trapezoidal trajectory planner. It was noticed during application that it is not as reliable as the trapezoidal trajectory, as oscillations were sometimes observed between points. According to the tests, this is because the boundary conditions cannot always be maintained. Two trajectories were displayed. The first one represents the path wanted to be achieved, i.e., the specified trajectory, while the other path is described by the effector according to trajectory planner calculations. This way, the two trajectories can easily be compared. The difference between the two trajectories is shown in Figure 7. The given path points are connected by a spline curve in red. This would be an ideal trajectory. In blue, the calculated trajectory is displayed.
Previously, only the change in the position of the end-effector was discussed. However, it is not enough to simply adopt the new position; it must also be oriented. Orientation interpolation is somewhat more complex than positional interpolation because the joint angles constantly change. In some cases, such as Euler angles, multiple angles can correspond to a single configuration. This can be mitigated by performing interpolation using so-called “quaternions”. This technique is called spherical linear interpolation (SLERP). It finds the shortest path between two orientations at a constant angular velocity around an axis. The commanded orientation parameters were transformed using coordinate transformation, which is capable of converting Euler angles to quaternions.
The success rate is the proportion of episodes in which the DQN agent achieves the desired goal or performs the task successfully as depicted in Figure 8. A well-designed reward function is crucial for guiding the agent toward the desired behavior. The complexity and capacity of the neural network play a key role. Deeper networks can capture more complex patterns but require more training data and time. The relationship between the success rate and the training rate in DQN can be complex and is influenced by the interplay of the factors mentioned above. Generally, the success rate improves with increased training time, as the agent has more opportunities to learn and refine its policy. However, this relationship is not linear and can exhibit diminishing returns, plateaus, or even degradation in performance due to overfitting or catastrophic forgetting. The success rate against the training rate in DQN highlights the iterative nature of reinforcement learning, where the agent gradually improves its policy through experience. While the initial training phases may show slow progress, consistent optimization of training parameters and environment design can significantly enhance both the speed and quality of learning. Understanding this relationship is crucial for developing more efficient and robust reinforcement learning applications, particularly in complex domains such as autonomous robotic systems.

4. Discussion

The advent of Industry 4.0 has driven the need for more intelligent, autonomous, and flexible robotic systems. Traditional methods of programming industrial robots, which often involve manual coding and teach pendant programming, are becoming increasingly inadequate due to their time-consuming nature and lack of adaptability. Deep reinforcement learning (DRL) offers a promising alternative by enabling robots to learn optimal behaviors through interactions with their environment. This discussion explores the integration of DRL into the teaching strategies for industrial robot arms, highlighting its potential advantages, challenges, and future directions [19]. Traditional teaching methods for industrial robots primarily involve two approaches: manual programming and teach pendant programming. The manual programming method requires writing explicit code for each task, specifying precise movements and operations. It is highly detailed and accurate but lacks flexibility and is not easily scalable to different tasks or environments. Teach pendant programming involves using a handheld device to guide the robot through a series of movements, which are then recorded and replayed. While more intuitive than manual programming, it remains time-consuming and is limited by the operator’s skill and experience. Both methods require significant human intervention and expertise, making them less suitable for complex, dynamic, and unstructured environments.
In a software setup where deep reinforcement learning (DRL) is connected to a robot model, several key components must be implemented to enable the robot to learn from and interact with its environment. A simulation platform (Simulink) models the robot and its environment. This environment simulates physical dynamics, sensor feedback, and constraints, allowing the DRL algorithm to train safely without the risk of damaging real hardware. The robot model in the simulation environment represents its kinematics, dynamics, sensors, and actuators. The simulated robot responds to control commands and updates its state based on simulated physics. The simulation continuously provides feedback on the robot’s state (e.g., joint angles, velocities, end-effector position) and environment (e.g., distances to obstacles, positions of target objects). These state data are sent to the DRL agent as an observation vector or a set of images if a vision-based model is used. In a reinforcement learning setup, this information constitutes the state st, which will be used as input for the agent’s neural network to decide the next action. The DRL agent receives the current state st and, based on its learned policy, selects an action at. This action is represented as control signals, which could be joint torques, velocities, or positional changes in the robot model. The selected action is then converted to control commands that are sent to the robot model in the simulation environment, updating its state accordingly. After executing an action, the environment calculates a reward rt based on task-specific criteria, such as the robot’s proximity to a goal, success in grasping an object, or avoiding obstacles. This integration connects a DRL agent to a robot model by simulating real-world physics and feedback loops. The agent continuously learns by observing states, selecting actions, and receiving rewards, improving its policy to accomplish tasks autonomously.
Deep reinforcement learning (DRL) is a subset of machine learning that combines reinforcement learning (RL) with deep neural networks. In RL, an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward [20]. The agent receives feedback from the environment in the form of rewards or penalties and updates its policy to improve future performance. Deep Q-learning (DQN) is one of the earliest DRL algorithms, which combines Q-learning with deep neural networks to handle high-dimensional state spaces. The integration of DRL into industrial robot arm teaching involves several key steps:
  • Environment Simulation: A realistic simulation environment is essential for training DRL agents. This environment must accurately model the physical characteristics and constraints of the robot and its operating space.
  • Reward Function Design: The reward function guides the learning process by defining what constitutes successful behavior. Designing an appropriate reward function is critical and often involves balancing multiple objectives such as speed, accuracy, and safety.
  • State and Action Space Definition: The state space represents the observable variables that describe the environment, such as joint angles and end-effector positions. The action space defines the set of possible actions the robot can take, such as joint movements or gripper operations.
  • Training and Optimization: Training a DRL agent involves running numerous simulations where the agent explores the environment and updates its policy based on the received rewards. This process requires significant computational resources and time.
The use of DRL in teaching industrial robot arms offers several significant advantages. This approach enables robots to learn from their environment and adapt to new tasks and conditions autonomously, reducing the need for human intervention. It also enables them to handle complex tasks that are difficult to program manually, such as tasks requiring fine manipulation or dynamic adjustment to varying conditions [21]. Once trained, DRL-based systems can perform tasks more efficiently and with higher precision than manually programmed robots, and the same DRL framework can be applied to different robots and tasks, making it highly scalable and versatile. Deep reinforcement learning holds significant promise for revolutionizing the teaching strategies of industrial robot arms. By enabling autonomous learning and adaptation, DRL can address many limitations of traditional programming methods, offering greater flexibility, efficiency, and scalability. However, realizing this potential requires addressing challenges related to data requirements, computational complexity, safety, and sim-to-real transfer. Through continued research and development, DRL can become a cornerstone of intelligent, autonomous robotic systems in industrial applications.
Generally, DRL algorithms offer greater flexibility and capability in handling a wide range of tasks, especially in environments with complex dynamics and high-dimensional state-action spaces as shown in Table 4. They often require extensive training data, computational resources, and careful hyperparameter tuning to achieve stable and effective performance. DMP algorithms provide a structured and interpretable way to encode and reproduce complex movements, making them particularly suitable for tasks requiring smooth and precise control. Their main limitations lie in the need for manual tuning and potential difficulty in handling highly variable or discontinuous tasks without significant modifications [22,23,24]. The choice between DRL and DMP algorithms, or a combination thereof, depends on the specific requirements of the robotic task at hand, the complexity of the environment, and the desired balance between learning flexibility and control precision.

Performance Evaluation

To provide a comprehensive evaluation, it is crucial to compare the proposed DRL-based robot control simulation with existing techniques. This comparison should demonstrate the strengths of the new approach in terms of adaptability, efficiency, and task performance metrics. Benchmark techniques for comparison are noted as follows:
  • Traditional Control Methods
  • Classical Motion Planning Algorithms
  • Other Reinforcement Learning Algorithms
  • Learning by Demonstration
To objectively compare techniques, the following key metrics were established:
  • Task Completion Rate
  • Adaptability to Dynamic Environments
  • Path Accuracy (Deviation)
  • Task Completion Time
  • Energy Efficiency
  • Learning Efficiency
Each algorithm is initialized and trained within the simulation environment. The robot performs a series of tasks (pick-and-place and path following) under each control method. These tasks are repeated multiple times (3000 trials) to collect statistically significant data. For each trial, data on task completion rate, path deviation, completion time, and energy consumption are collected. The results are presented in a Table 5 to illustrate each technique’s performance.
The proposed DRL-based approach outperforms PID and A* in task completion rate, particularly in dynamic environments, due to its ability to adapt actions in real time. DRL-based control methods, particularly with DMP integration, achieve lower path deviation and shorter completion times than traditional control methods (PID) and classical motion planning (A*), which struggle to make real-time adjustments. The DRL-based approach achieves higher energy efficiency, making it suitable for tasks requiring repetitive motion with minimal power use. The comparison confirms that the DRL-based approach with DMP integration offers significant improvements over existing techniques, especially in dynamic and unstructured environments (Figure 9).
A performance comparison diagram between DQN- and DMP-based algorithms illustrates their effectiveness in a robotic or control-based task over a set number of training episodes or trials. DMP may reach a steady performance level earlier due to its predefined movement primitives, making it suitable for tasks with known, repeatable trajectories. DQN, on the other hand, may take longer but has greater flexibility to adapt to complex or changing environments, given enough episodes. DQN may potentially outperform DMP in highly dynamic tasks due to its capacity for ongoing learning and adaptability. However, DMP could be more efficient for tasks with predictable patterns.

5. Conclusions

The research presented in this paper addresses the issue of task-oriented robot learning. The first part of the study focuses on analyzing demonstrations and modeling trajectories based on the importance of individual trajectory segments. Here, a new method for analyzing demonstrations based on a small number of robot trajectory demonstrations in Cartesian space is proposed. The method involves weighting trajectory segments based on the analysis of activities within each segment relative to the demonstrated trajectories. This achieves what is known as learning the importance of each specific segment of the trajectory, which is later used in trajectory generation to approximate demonstrated trajectory segments more closely or distantly. A modified version of the DMP trajectory is used as the trajectory model. The result of this part of the research is the ability to generate trajectories for new situations that effectively describe the essential parts of the demonstrations while disregarding less important parts.
The second part of the research extends this methodology by applying the previously presented demonstration analysis method to the analysis of a real robotic task. An experimental setup incorporating a vision system for marker tracking and a collaborative robotic arm is implemented for this purpose. A new trajectory model based on an optimization process that allows state approximation and obstacle avoidance is proposed. The trajectory model is tested together with the demonstration analysis method to generate trajectories in various newly introduced scenarios, including a pushing task on a real robot.
The results of the research conducted in the first two parts of the paper confirm the first hypothesis, which states that robot learning to perform tasks in the assumed problem domain is achievable through a limited number of examples presented by human–robot interactions. During trajectory reproduction for real robotic tasks, solely based on demonstration analysis, the possibility of insufficiently accurate task execution due to various unmodeled influences in demonstration learning is observed. Therefore, the third part of the research addresses the issue of autonomous robot learning to enable self-adjustment according to specific characteristics of the working environment in which the robot operates. A learning model based on an iterative parameter search strategy is developed. The DMP trajectory model, which is “learned” from the available trajectory, is chosen as the parameter strategy for describing the trajectory. As iterative robot learning in a real environment can potentially be very dangerous and time-consuming, a simulation environment with a physics emulator in the MATLAB/Simulink environment is implemented, allowing the robot to test trajectories safely during the learning process. The DQN optimization algorithm is used to adjust DMP strategy parameters during learning, optimizing trajectory parameters with respect to the task. The used form of the cost function is designed to evaluate only the final outcome of robot behavior, thus maintaining implementation simplicity while optimizing robot behavior for the real task. The iterative learning system is tested on two different stochastic robotic tasks: insertion and pushing tasks. Two criteria for evaluating the quality of the learning process are proposed based on the mean value of currently seen solutions and the best currently seen solution. Convergence to behavior successfully performing the task is demonstrated in both cases.
The results obtained in the third and fourth parts of the research suggest that knowledge generalization of robots in terms of task execution for different configurations can be achieved through statistical processing and multi-criteria optimization of specific task-describing features. This confirms the second hypothesis of the paper. As original scientific contributions of this paper, the following are highlighted:
  • Trajectory-level learning from a demonstration method based on spatial coordinate system classification.
  • A trajectory model for motion planning in Cartesian coordinate systems based on an optimization process with the ability to approximate coordinate systems and avoid obstacles.
  • A methodology for task-oriented iterative trajectory learning based on initialization through demonstration learning.
Future research will focus on finding more efficient strategy search algorithms and exploring the possibility of applying other parameter strategies for motion modeling. Additionally, significant progress in the ease of applying strategy search algorithms can be achieved through research in the areas of automatic learning parameter tuning and automatic extraction of reward/goal functions from a small number of human demonstrations. This would significantly increase and simplify the applicability of such solutions in real-world applications. Another future development for the proposed approaches in this paper can be the transition from the existing designed simulated environment to the real-life robot manipulator in a physically designed industrial environment.

Author Contributions

Conceptualization, J.S. (János Simon) and L.G.; methodology, J.S. (József Sárosi) and J.S. (János Simon); software, J.S. (János Simon); validation, L.G.; resources, J.S. (József Sárosi); writing—original draft preparation, J.S. (János Simon); writing—review and editing, J.S. (János Simon) and L.G.; supervision, J.S. (József Sárosi); project administration, L.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
DHDenavit–Hartenberg
DMPDynamic Movement Primitives
DOFDegrees of Freedom
DQNDeep Q-Network
DRLDeep Reinforcement Learning
LbDLearning by Demonstration
NNNeural Network
ReLURectified Linear Unit
RLReinforcement Learning

References

  1. Liu, R.; Nageotte, F.; Zanne, P.; de Mathelin, M.; Dresp-Langley, B. Deep Reinforcement Learning for the Control of Robotic Manipulation: A Focussed Mini-Review. Robotics 2021, 10, 22. [Google Scholar] [CrossRef]
  2. Mohammed, M.Q.; Chung, K.L.; Chyi, C.S. Pick and Place Objects in a Cluttered Scene Using Deep Reinforcement Learning. Int. J. Mech. Mechatron. Eng. 2020, 20, 50–57. [Google Scholar]
  3. Rubagotti, M.; Sangiovanni, B.; Nurbayeva, A.; Incremona, G.P.; Ferrara, A.; Shintemirov, A. Shared Control of Robot Manipulators with Obstacle Avoidance: A Deep Reinforcement Learning Approach. IEEE Control Syst. Mag. 2023, 43, 44–63. [Google Scholar] [CrossRef]
  4. Salvato, E.; Fenu, G.; Medvet, E.; Pellegrino, F.A. Crossing the reality gap: A survey on sim-to-real transferability of robot controllers in reinforcement learning. IEEE Access 2021, 9, 153171–153187. [Google Scholar] [CrossRef]
  5. Bujok, P.; Tvrdík, J.; Poláková, R. Comparison of nature-inspired population-based algorithms on continuous optimization problems. Swarm Evol. Comput. 2019, 50, 100490. [Google Scholar] [CrossRef]
  6. Hellwig, M.; Beyer, H.G. Benchmarking evolutionary algorithms for single objective real-valued constrained optimization–a critical review. Swarm Evol. Comput. 2019, 44, 927–944. [Google Scholar] [CrossRef]
  7. Wang, T.; Xin, Z.; Miao, H.; Zhang, H.; Chen, Z.; Du, Y. Optimal Trajectory Planning of Grinding Robot Based on Improved Whale Optimization Algorithm. Math. Probl. Eng. 2020, 2020, 3424313. [Google Scholar] [CrossRef]
  8. Szczepanski, R.; Erwinski, K.; Tejer, M.; Bereit, A.; Tarczewski, T. Optimal scheduling for palletizing task using robotic arm and artificial bee colony algorithm. Eng. Appl. Artif. Intell. 2022, 113, 104976. [Google Scholar] [CrossRef]
  9. Saravanan, R.; Ramabalan, S. Evolutionary Minimum Cost Trajectory Planning for Industrial Robots. J. Intell. Robot. Syst. 2008, 52, 45–77. [Google Scholar] [CrossRef]
  10. Meng, X.; Zhu, X. Autonomous Obstacle Avoidance Path Planning for Grasping Manipulator Based on Elite Smoothing Ant Colony Algorithm. Symmetry 2022, 14, 1843. [Google Scholar] [CrossRef]
  11. Ying, K.C.; Pourhejazy, P.; Cheng, C.Y.; Cai, Z.Y. Deep learning-based optimization for motion planning of dual-arm assembly robots. Comput. Ind. Eng. 2021, 160, 107603. [Google Scholar] [CrossRef]
  12. Lobbezoo, A.; Qian, Y.; Kwon, H.J. Reinforcement Learning for Pick and Place Operations in Robotics: A Survey. Robotics 2021, 10, 105. [Google Scholar] [CrossRef]
  13. Jin, L.; Li, S.; Yu, J.; He, J. Robot manipulator control using neural networks: A survey. Neurocomputing 2018, 285, 23–34. [Google Scholar] [CrossRef]
  14. Zhang, S.; Xia, Q.; Chen, M.; Cheng, S. Multi-Objective Optimal Trajectory Planning for Robotic Arms Using Deep Reinforcement Learning. Sensors 2023, 23, 5974. [Google Scholar] [CrossRef] [PubMed]
  15. Chen, X.; Zhang, Q.; Sun, Y. Evolutionary Robot Calibration and Nonlinear Compensation Methodology Based on GA-DNN and an Extra Compliance Error Model. Math. Probl. Eng. 2020, 2020, 3981081. [Google Scholar] [CrossRef]
  16. Silva, F.; Duarte, M.; Correia, L.; Oliveira, S.M.; Christensen, A.L. Open issues in evolutionary robotics. Evol. Comput. 2016, 24, 205–236. [Google Scholar] [CrossRef]
  17. Han, S.; Shan, X.; Fu, J.; Xu, W.; Mi, H. Industrial robot trajectory planning based on improved pso algorithm. J. Phys. Conf. Ser. 2021, 1820, 012185. [Google Scholar] [CrossRef]
  18. Ekrem, Ö.; Aksoy, B. Trajectory planning for a 6-axis robotic arm with particle swarm optimization algorithm. Eng. Appl. Artif. Intell. 2023, 122, 106099. [Google Scholar] [CrossRef]
  19. Gao, R.; Zhou, Q.; Cao, S.; Jiang, Q. Apple-Picking Robot Picking Path Planning Algorithm Based on Improved PSO. Electronics 2023, 12, 1832. [Google Scholar] [CrossRef]
  20. Stefanoni, M.; Pesti, R.; Odry, Á.; Sarcevic, P. Measurement System for the Simulation of Indoor Magnetic Disturbances using a Robotic Arm. In Proceedings of the 2023 IEEE 21st World Symposium on Applied Machine Intelligence and Informatics (SAMI), Herl’any, Slovakia, 19–21 January 2023; pp. 000273–000278. [Google Scholar] [CrossRef]
  21. Zhang, J.; Zhang, J.; Zhang, Q.; Wei, X. Obstacle Avoidance Path Planning of Space Robot Based on Improved Particle Swarm Optimization. Symmetry 2022, 14, 938. [Google Scholar] [CrossRef]
  22. Mystkowski, A.; Wolniakowski, A.; Kadri, N.; Sewiolo, M.; Scalera, L. Neural Network Learning Algorithms for High-Precision Position Control and Drift Attenuation in Robotic Manipulators. Appl. Sci. 2023, 13, 10854. [Google Scholar] [CrossRef]
  23. Liu, D.; Wang, Z.; Lu, B.; Cong, M.; Yu, H.; Zou, Q. A Reinforcement Learning-Based Framework for Robot Manipulation Skill Acquisition. IEEE Access 2020, 8, 108429–108437. [Google Scholar] [CrossRef]
  24. Han, D.; Mulyana, B.; Stankovic, V.; Cheng, S. A Survey on Deep Reinforcement Learning Algorithms for Robotic Manipulation. Sensors 2023, 23, 3762. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Robot manipulator Simulink block diagram.
Figure 1. Robot manipulator Simulink block diagram.
Applsci 14 10929 g001
Figure 2. Robot manipulator Simulink subsystem.
Figure 2. Robot manipulator Simulink subsystem.
Applsci 14 10929 g002
Figure 3. Multibody simulation environment with preliminary test results.
Figure 3. Multibody simulation environment with preliminary test results.
Applsci 14 10929 g003
Figure 4. A representation of the reinforcement learning model.
Figure 4. A representation of the reinforcement learning model.
Applsci 14 10929 g004
Figure 5. First simulation result.
Figure 5. First simulation result.
Applsci 14 10929 g005
Figure 6. The path, velocity, and acceleration diagrams of the polynomial trajectory.
Figure 6. The path, velocity, and acceleration diagrams of the polynomial trajectory.
Applsci 14 10929 g006
Figure 7. (a) DQN-based path planning. (b) DMP-based path planning.
Figure 7. (a) DQN-based path planning. (b) DMP-based path planning.
Applsci 14 10929 g007
Figure 8. Success rate against training rate.
Figure 8. Success rate against training rate.
Applsci 14 10929 g008
Figure 9. Performance comparison of DQN- and DMP-based algorithms.
Figure 9. Performance comparison of DQN- and DMP-based algorithms.
Applsci 14 10929 g009
Table 1. Denavit–Hartenberg (DH) parameters.
Table 1. Denavit–Hartenberg (DH) parameters.
SegmentLi [mm]θi [°]αi [°]ai [mm]di [mm]
1L1 = 72θ1900L1
2L2 = 129θ20L20
3L3 = 129θ30L30
4L4 = 61θ4 + 90°9000
5L5 = 61θ5 − 90°−900L6
6L6 = 73θ6 + 90°900L7
7L7 = 133θ700L8
Table 2. NN Layer Structure.
Table 2. NN Layer Structure.
LayerNumber of NodesActivation Function
Input Layer25
Hidden Layer 164ReLU
Hidden Layer 264ReLU
Hidden Layer 332ReLU
Output Layer7Linear
Table 3. A comparison between static and dynamic environments.
Table 3. A comparison between static and dynamic environments.
EnvironmentSuccess Rate (%)Path Deviation (cm)Task Completion Time (s)
Static (no obstacles)950.58.5
Dynamic (with obstacles)880.810.2
Unstructured851.011.0
Table 4. Strengths and limitations of different deep reinforcement learning (DRL) and dynamic movement primitives (DMP) algorithms.
Table 4. Strengths and limitations of different deep reinforcement learning (DRL) and dynamic movement primitives (DMP) algorithms.
StrengthsLimitations
Deep Q-Networks (DQN)Simplicity,
Experience Replay,
Target Network
Sample Inefficiency,
Action Space Limitation,
Stability Issues
Double DQNReduced Overestimation Bias,
Improved Stability
Increased Complexity,
Still Sample Inefficient
Basic DMPTrajectory Representation,
Generalization,
Smooth Trajectories
Manual Tuning,
Limited Expressiveness
Table 5. Analysis of results compared with existing techniques.
Table 5. Analysis of results compared with existing techniques.
MetricProposed DRL-BasedPID ControlA*
Algorithm
DDPGPPOLbD Only
Task Completion Rate96%69%80%91%92%85%
Adaptability ScoreHighLowMediumMediumHighLow
Path Accuracy (cm)0.42.01.50.70.61.0
Task Completion Time (s)7.511.29.89.18.89.5
Energy Efficiency (J)N/AN/AN/AN/AN/AN/A
Learning Efficiency3000 episodesN/AN/A1000700300
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Simon, J.; Gogolák, L.; Sárosi, J. Deep Reinforcement Learning-Assisted Teaching Strategy for Industrial Robot Manipulator. Appl. Sci. 2024, 14, 10929. https://doi.org/10.3390/app142310929

AMA Style

Simon J, Gogolák L, Sárosi J. Deep Reinforcement Learning-Assisted Teaching Strategy for Industrial Robot Manipulator. Applied Sciences. 2024; 14(23):10929. https://doi.org/10.3390/app142310929

Chicago/Turabian Style

Simon, János, László Gogolák, and József Sárosi. 2024. "Deep Reinforcement Learning-Assisted Teaching Strategy for Industrial Robot Manipulator" Applied Sciences 14, no. 23: 10929. https://doi.org/10.3390/app142310929

APA Style

Simon, J., Gogolák, L., & Sárosi, J. (2024). Deep Reinforcement Learning-Assisted Teaching Strategy for Industrial Robot Manipulator. Applied Sciences, 14(23), 10929. https://doi.org/10.3390/app142310929

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop