1. Introduction
Robotic arms have long served industries by automating tasks characterized by repetition and complexity that demand precision and speed. Conventional control methods for these robotic arms have presented challenges, necessitating individualized programming for each task, leaving little room for error. This has traditionally required significant effort from robotics engineers for task programming or skilled robot operators engaging in tedious online manipulation with the control pendant. However, recent advancements in robotic arm control have seen the integration of deep learning techniques, such as deep Q-network (DQN) frameworks. These advancements hold promise in revolutionizing robotic arm control by enabling learning-based approaches that adapt and optimize arm movements based on environmental feedback, ultimately enhancing efficiency and versatility in industrial applications. In addressing the challenges associated with implementing robotics in factory environments, reinforcement learning (RL) is emerging as a favored alternative to task-specific programming. The objective of RL in robotics is to empower the agent (the robotic arm) to execute fundamental operations autonomously, devoid of direct command programming, and to adeptly adapt to task variations without necessitating reprogramming. The logistical task of pick-and-place exemplifies a prime scenario wherein a robotic agent should proficiently execute tasks using RL.
The aim of the research is the development of a model for robot learning by demonstration through human–robot interaction. The model is intended to be directed towards generating robot motion plans aligned with the given task. Based on a limited number of examples, it is necessary to ensure the generalization of knowledge that will enable the resolution of similar problems. Two research hypotheses have been formulated based on the aim:
To develop a robust DRL-based control framework that allows a robot to adapt dynamically to different environments.
To validate the proposed framework’s performance through comparisons with existing control techniques in simulated scenarios.
Expected contributions include the following:
A DRL-based control framework for executing robotic tasks through a limited number of examples presented by human–robot interaction, based on the synthesis of motion shaping methods using primitives and inverse reinforcement learning.
An algorithm for determining and evaluating specific task-describing features from the assumed problem domain, based on statistical analysis and multi-criteria optimization in the time-dependent problem space, aimed at generalizing learned knowledge.
In robotics, the field of motion planning encompasses problems such as path planning, trajectory planning, and task planning. Path planning deals with the problem of finding a path from some initial position to a goal (“from point A to point B”), while avoiding obstacles. The novelty of this work lies in the combination of dynamic trajectory modeling with time-dependent state analysis, which collectively enhance the robot’s adaptability and performance. Experimental results demonstrate the efficacy of the proposed algorithm in robotic manipulation tasks and path planning within dynamic environments. Furthermore, the integration of these techniques allows the robot to efficiently navigate complex environments and execute tasks with precision.
This research makes significant contributions to the field of robot learning methodologies by presenting a comprehensive framework that improves robot learning through human–robot interaction. The proposed model holds promise for practical applications in manufacturing, healthcare, and logistics, where robots are increasingly utilized for automation and assistance.
2. Materials and Methods
The simulation setup consists of an industrial robot arm equipped with a suite of sensors and controllers to facilitate dynamic interaction with the environment. The robot’s operational space is represented using both Cartesian coordinates and configuration systems, enabling precise control and analysis of its movements. The robot is programmed to perform a variety of tasks, including object manipulation, assembly, and navigation in dynamic environments. The proposed deep reinforcement learning (DRL) framework integrates dynamic trajectory modeling and time-dependent state analysis. The implementation details how the dynamic trajectory modeling component uses polynomial functions to generate smooth and continuous paths for the robot. This method is particularly effective for connecting segments with varying velocities and accelerations. The trajectory planning algorithm calculates the optimal path based on the current state of the robot and the desired end state, ensuring smooth transitions and minimizing abrupt changes in motion. Trajectory planning extends the problem of path planning by introducing parameters such as time, velocity, acceleration, kinematics, continuity, manipulability, etc. Geometric approaches to trajectory modeling are prevalent here, serving as the basis for motion due to their continuity. Additionally, trajectory optimization with respect to various parameters is traditionally an active area of research. Learning from demonstrations and motion planning are highly compatible approaches, where demonstrations can be used to find criteria against which trajectory optimization is required. This hypothesis has been used in several works [
11]. Depending on the perspective from which the problem is viewed, these criteria can be aimed at optimizing the path to improve kinematic/dynamic performance or optimizing for task execution. The tendency of newer approaches is toward combining both goals for more effective and autonomous task execution. The adaptability of the plan is emphasized in approaches based on motion planning compared to learning from demonstrations. This is highly significant in circumstances where there is dynamic environmental change during trajectory execution, where these approaches often allow for real-time replanning. The generalization capabilities of motion planning methods are highly advanced in the domain of motion, as they must enable movement based on arbitrary initial and final positions [
12]. On the other hand, the drawback of these approaches is often the inability to implement various criteria crucial for task execution, where methods utilizing learning from demonstrations have an advantage.
Figure 1 aims to illustrate the comprehensive simulation environment for the robot manipulator using Simulink. The authors intend to simulate the dynamic behavior and control of the robot manipulator to evaluate the proposed learning algorithm’s effectiveness. This section will provide a detailed introduction and explanation of the components and the objectives of the block diagram. The primary aim of the block diagram is to simulate the robot manipulator’s movements and interactions within a controlled virtual environment. By doing so, the authors can test and refine their algorithm for dynamic trajectory modeling and time-dependent state analysis, ensuring that the robot can effectively perform various tasks such as object manipulation, assembly, and navigation. The first three elements of the multibody block diagram define the simulation environment itself as can be seen in
Figure 1. The first element is the world frame, which is defined as the global reference framework. In fact, it represents the same fixed coordinate point that was specified as the zero point of the structure in inventor. In the mechanism configuration block, the mechanical and simulation parameters are set, such as the gravitational acceleration vector of my model [
13]. The third element is the solver configuration, which determines the settings of the Simulink solver.
Table 1 shows the DH parameters of the used model. The model needs these parameters before starting the simulation. In Simulink, the so-called “solver” calculates the model’s state parameters during the simulation. Basically, there are two types: fixed-step and variable-step solver.
As we progress through the block diagram, the components are connected to each other in sequence, following the hierarchy defined in the URDF file, which was generated from the CAD assembly file. The first element is the base plate element, which is connected to the “BASE”, and the “BASE” is connected to the “SHOULDER” as its child. The “SHOULDER” is connected to the first arm, and so on. The connection of the components is done with joints or constraints. Joints determine the maximum degree of freedom of the two connected elements, which can be rotational—revolute joint—or linear—prismatic joint—motion [
14]. For example, a component block such as the “BASE” seen in
Figure 2 is a subsystem, which is itself a smaller block diagram composed of different elements.
The modeling phase is divided into three parts. In the first part, information about the examined system, i.e., the physical, geometric, and other parameters of the robot, is gathered. Based on this information, the CAD model of the robot can be created. Afterward, it needs to be imported into the Simulink (R2024a) software, and the simulation environment needs to be built around it.
Figure 3 depicts the multibody simulation environment.
2.1. Reinforcement Learning in Continuous Environments
In reinforcement learning approaches such as learning from demonstrations, all learning data must be pre-prepared and available before learning. This is highly restrictive when a robot needs to learn complex tasks because the teacher needs to generate a sufficient number of demonstrations. On the other hand, RL (reinforcement learning) is based on autonomous data collection by the agent (robot) through its own exploration and evaluations of behavior by an expert system, which can be a human or in the form of an artificial evaluation system [
15]. The best behavior for a given task is provided in the form of an optimal strategy. Its significant advantage over learning from demonstrations is that the robot performs all behaviors independently and directly through its control system, eliminating the problem of correspondence during learning. Additionally, human behavior does not always have to be optimal, so learning from demonstrations can result in sub-optimal solutions. In supported learning, this problem does not exist; instead, the quality of the final behavior is usually determined solely by the outcome (
Figure 4).
The classical scenario of supported learning is described by an agent located in a current state
s at time
t, acting in its environment with its action
a. Based on its action, the agent ends up in a new state
st + 1, and its behavior is evaluated by a reward
rt. The implemented DQN algorithm is shown in Algorithm 1.
Algorithm 1 DQN training algorithm |
//main iteration
:
end while |
The algorithm is based on a machine learning paradigm where an agent learns to make decisions by interacting with an environment and receiving feedback in the form of rewards or penalties based on its actions, with the goal of maximizing cumulative reward over time.
2.2. Dynamic Movement Primitives (DMP)
Since the previously presented model for learning from demonstrations is based on the analysis of coordinate systems in the operational (Cartesian) system of the robot, this section will introduce a trajectory model used to approximate Cartesian states, which will be further enhanced by a modification that couples this model with the proposed analysis of coordinate system activities [
16]. The basis of this trajectory model is the primitive motion model—DMP. Specifically, a special variation of the DMP model capable of approximating states defined in the Cartesian coordinate system is utilized.
The forcing term is replaced by establishing interdependence between states and time. This is achieved by recalculating the current goal of the DMP dynamic system at each time step according to a time-dependent probabilistic function of predefined states. This approach was originally presented in Ref. [
1] to encode stiffness information into the trajectory for robots controlled by a given torque. The stiffness matrix mentioned is not used for this approach. The system’s acceleration is calculated as the sum of
N proportional differential systems, where
N denotes the number of states,
and
represent the current position and velocity of the system, and
and
are gain parameters.
The aforementioned time dependence is achieved by using an exponentially decaying function
as a function of time, where the variable
is initialized with
at the beginning of each motion and gradually converges to zero at each time step. The rate of decay, i.e., convergence of this function to zero, is defined by the gain parameter
[
17]. This temporal function is linked to the system dynamics through a set of
normal distributions
with uniformly distributed mean values of distributions from zero to one, where each individual distribution represents one state from the state vector
. The time-dependent probability of the system being in a certain state is given in Equation (2).
The implemented DMP based trajectory planning algorithm is shown in Algorithm 2. The algorithmic task consists of converting the initial trajectory into a DMP record. Using iterative search, an attempt is made to find a trajectory that passes through the given spatial position X, which was not previously part of the initial trajectory.
Algorithm 2 DMP-based trajectory planning |
Input:
end
//“target function” end |
Using this model, sequential tracking of arbitrary states defined either in the operational (Cartesian) or configuration system of the robot can be achieved. Furthermore, it can be observed how the shape of the exponentially decaying time function determines how much time is “allocated” to a specific state within the state vector [
18]. Using the standard form of this function, it is evident that the function has a higher rate of decay at the beginning of the function (after initialization with a value of 1), indicating that states appearing in this time period are less important than states appearing at the end of the function, where the rate of time decay is much smaller as the function converges to a value of zero.
2.3. Neural Network Architecture
The input layer receives a 25-dimensional vector representing the state of the environment. This could include variables such as joint angles, joint velocities, distances to obstacles, or other environmental factors that are relevant for the task. The number of nodes in the first hidden layer is 64. Using the ReLU (rectified linear unit) activation function, the network introduces non-linearity, allowing the model to learn complex patterns in the input data. The number of nodes in the second hidden layer is 64, and the activation function is also ReLU. The number of nodes in the third hidden layer is 32 (reduced size to allow gradual compression of features), and the activation function is also ReLU. The output layer has 7 nodes. Each node in the output layer represents an action or Q-value. If this is a Q-network for a discrete action space, each output corresponds to the Q-value of an action, indicating the expected cumulative reward of each action given the current state as can be seen in
Table 2.
This architecture provides a balance between capacity (through multiple hidden layers) and efficiency, making it suitable for the RL tasks that require capturing complex patterns in state-action spaces. RL revolves around four key components: the policy, reward function, value function, and environmental model. The fundamental concept is that an intelligent agent must learn to perform a series of actions to maximize cumulative rewards through its interactions with the environment. The reward function is designed to combine various reward variables into a single output value, which serves as feedback for the agent to understand the incentivized behaviors. In this case, the robot is expected to follow the goal for as long as possible while avoiding dynamic obstacles. Consequently, the reward function is defined as a weighted sum of two components: the distance between the end-effector and the goal as well as the minimum distance between the robot manipulator and nearby obstacles. Additionally, a negative reward is applied more heavily than a positive one to encourage the robot to avoid penalties, facilitating faster learning.
3. Results
The preliminary test result shown in the image depicts a MATLAB (R2024a) written program that describes a circle of radius
r. It can be observed that as the given trajectory is traversed, the joints assume their respective positions as depicted in
Figure 5. The path points were recorded in variables in matrix form. They were then attached to the input port of the block. The path planner iterates the path points based on the given time parameters and generates an appropriate path. The result was obtained in the form of a position vector. These types of path planners not only calculate the path but also the velocity and acceleration.
Polynomial trajectories are useful in cases where zero or non-zero velocity and acceleration segments need to be connected, as the acceleration profiles are much smoother. The polynomial trajectory block was used to define polynomial trajectories as can be seen in
Figure 6.
To demonstrate the adaptability of the algorithm, the high fidelity simulation was conducted with the robot in various environmental scenarios. These include the following:
Dynamic Obstacles: With moving obstacles that force the robot to dynamically adapt its path.
Unstructured Environments: Test the robot in environments with unpredictable object placements or configurations.
Task Variations: Modifying the pick-and-place task (e.g., different object sizes, weights, or grasping positions) and evaluate how well the robot adapts.
Table 3 presents the results of the simulation the robot’s performance in terms of task completion time, success rate, and path deviation. This DQN-based robot control simulation system framework provides a foundation for building intelligent, adaptive robotic systems. Using reinforcement learning, the robot can learn to perform tasks autonomously and adapt to dynamic environments without the need for manually predefined control strategies.
The same path is specified as by the trapezoidal trajectory planner. It was noticed during application that it is not as reliable as the trapezoidal trajectory, as oscillations were sometimes observed between points. According to the tests, this is because the boundary conditions cannot always be maintained. Two trajectories were displayed. The first one represents the path wanted to be achieved, i.e., the specified trajectory, while the other path is described by the effector according to trajectory planner calculations. This way, the two trajectories can easily be compared. The difference between the two trajectories is shown in
Figure 7. The given path points are connected by a spline curve in red. This would be an ideal trajectory. In blue, the calculated trajectory is displayed.
Previously, only the change in the position of the end-effector was discussed. However, it is not enough to simply adopt the new position; it must also be oriented. Orientation interpolation is somewhat more complex than positional interpolation because the joint angles constantly change. In some cases, such as Euler angles, multiple angles can correspond to a single configuration. This can be mitigated by performing interpolation using so-called “quaternions”. This technique is called spherical linear interpolation (SLERP). It finds the shortest path between two orientations at a constant angular velocity around an axis. The commanded orientation parameters were transformed using coordinate transformation, which is capable of converting Euler angles to quaternions.
The success rate is the proportion of episodes in which the DQN agent achieves the desired goal or performs the task successfully as depicted in
Figure 8. A well-designed reward function is crucial for guiding the agent toward the desired behavior. The complexity and capacity of the neural network play a key role. Deeper networks can capture more complex patterns but require more training data and time. The relationship between the success rate and the training rate in DQN can be complex and is influenced by the interplay of the factors mentioned above. Generally, the success rate improves with increased training time, as the agent has more opportunities to learn and refine its policy. However, this relationship is not linear and can exhibit diminishing returns, plateaus, or even degradation in performance due to overfitting or catastrophic forgetting. The success rate against the training rate in DQN highlights the iterative nature of reinforcement learning, where the agent gradually improves its policy through experience. While the initial training phases may show slow progress, consistent optimization of training parameters and environment design can significantly enhance both the speed and quality of learning. Understanding this relationship is crucial for developing more efficient and robust reinforcement learning applications, particularly in complex domains such as autonomous robotic systems.
4. Discussion
The advent of Industry 4.0 has driven the need for more intelligent, autonomous, and flexible robotic systems. Traditional methods of programming industrial robots, which often involve manual coding and teach pendant programming, are becoming increasingly inadequate due to their time-consuming nature and lack of adaptability. Deep reinforcement learning (DRL) offers a promising alternative by enabling robots to learn optimal behaviors through interactions with their environment. This discussion explores the integration of DRL into the teaching strategies for industrial robot arms, highlighting its potential advantages, challenges, and future directions [
19]. Traditional teaching methods for industrial robots primarily involve two approaches: manual programming and teach pendant programming. The manual programming method requires writing explicit code for each task, specifying precise movements and operations. It is highly detailed and accurate but lacks flexibility and is not easily scalable to different tasks or environments. Teach pendant programming involves using a handheld device to guide the robot through a series of movements, which are then recorded and replayed. While more intuitive than manual programming, it remains time-consuming and is limited by the operator’s skill and experience. Both methods require significant human intervention and expertise, making them less suitable for complex, dynamic, and unstructured environments.
In a software setup where deep reinforcement learning (DRL) is connected to a robot model, several key components must be implemented to enable the robot to learn from and interact with its environment. A simulation platform (Simulink) models the robot and its environment. This environment simulates physical dynamics, sensor feedback, and constraints, allowing the DRL algorithm to train safely without the risk of damaging real hardware. The robot model in the simulation environment represents its kinematics, dynamics, sensors, and actuators. The simulated robot responds to control commands and updates its state based on simulated physics. The simulation continuously provides feedback on the robot’s state (e.g., joint angles, velocities, end-effector position) and environment (e.g., distances to obstacles, positions of target objects). These state data are sent to the DRL agent as an observation vector or a set of images if a vision-based model is used. In a reinforcement learning setup, this information constitutes the state st, which will be used as input for the agent’s neural network to decide the next action. The DRL agent receives the current state st and, based on its learned policy, selects an action at. This action is represented as control signals, which could be joint torques, velocities, or positional changes in the robot model. The selected action is then converted to control commands that are sent to the robot model in the simulation environment, updating its state accordingly. After executing an action, the environment calculates a reward rt based on task-specific criteria, such as the robot’s proximity to a goal, success in grasping an object, or avoiding obstacles. This integration connects a DRL agent to a robot model by simulating real-world physics and feedback loops. The agent continuously learns by observing states, selecting actions, and receiving rewards, improving its policy to accomplish tasks autonomously.
Deep reinforcement learning (DRL) is a subset of machine learning that combines reinforcement learning (RL) with deep neural networks. In RL, an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward [
20]. The agent receives feedback from the environment in the form of rewards or penalties and updates its policy to improve future performance. Deep Q-learning (DQN) is one of the earliest DRL algorithms, which combines Q-learning with deep neural networks to handle high-dimensional state spaces. The integration of DRL into industrial robot arm teaching involves several key steps:
Environment Simulation: A realistic simulation environment is essential for training DRL agents. This environment must accurately model the physical characteristics and constraints of the robot and its operating space.
Reward Function Design: The reward function guides the learning process by defining what constitutes successful behavior. Designing an appropriate reward function is critical and often involves balancing multiple objectives such as speed, accuracy, and safety.
State and Action Space Definition: The state space represents the observable variables that describe the environment, such as joint angles and end-effector positions. The action space defines the set of possible actions the robot can take, such as joint movements or gripper operations.
Training and Optimization: Training a DRL agent involves running numerous simulations where the agent explores the environment and updates its policy based on the received rewards. This process requires significant computational resources and time.
The use of DRL in teaching industrial robot arms offers several significant advantages. This approach enables robots to learn from their environment and adapt to new tasks and conditions autonomously, reducing the need for human intervention. It also enables them to handle complex tasks that are difficult to program manually, such as tasks requiring fine manipulation or dynamic adjustment to varying conditions [
21]. Once trained, DRL-based systems can perform tasks more efficiently and with higher precision than manually programmed robots, and the same DRL framework can be applied to different robots and tasks, making it highly scalable and versatile. Deep reinforcement learning holds significant promise for revolutionizing the teaching strategies of industrial robot arms. By enabling autonomous learning and adaptation, DRL can address many limitations of traditional programming methods, offering greater flexibility, efficiency, and scalability. However, realizing this potential requires addressing challenges related to data requirements, computational complexity, safety, and sim-to-real transfer. Through continued research and development, DRL can become a cornerstone of intelligent, autonomous robotic systems in industrial applications.
Generally, DRL algorithms offer greater flexibility and capability in handling a wide range of tasks, especially in environments with complex dynamics and high-dimensional state-action spaces as shown in
Table 4. They often require extensive training data, computational resources, and careful hyperparameter tuning to achieve stable and effective performance. DMP algorithms provide a structured and interpretable way to encode and reproduce complex movements, making them particularly suitable for tasks requiring smooth and precise control. Their main limitations lie in the need for manual tuning and potential difficulty in handling highly variable or discontinuous tasks without significant modifications [
22,
23,
24]. The choice between DRL and DMP algorithms, or a combination thereof, depends on the specific requirements of the robotic task at hand, the complexity of the environment, and the desired balance between learning flexibility and control precision.
5. Conclusions
The research presented in this paper addresses the issue of task-oriented robot learning. The first part of the study focuses on analyzing demonstrations and modeling trajectories based on the importance of individual trajectory segments. Here, a new method for analyzing demonstrations based on a small number of robot trajectory demonstrations in Cartesian space is proposed. The method involves weighting trajectory segments based on the analysis of activities within each segment relative to the demonstrated trajectories. This achieves what is known as learning the importance of each specific segment of the trajectory, which is later used in trajectory generation to approximate demonstrated trajectory segments more closely or distantly. A modified version of the DMP trajectory is used as the trajectory model. The result of this part of the research is the ability to generate trajectories for new situations that effectively describe the essential parts of the demonstrations while disregarding less important parts.
The second part of the research extends this methodology by applying the previously presented demonstration analysis method to the analysis of a real robotic task. An experimental setup incorporating a vision system for marker tracking and a collaborative robotic arm is implemented for this purpose. A new trajectory model based on an optimization process that allows state approximation and obstacle avoidance is proposed. The trajectory model is tested together with the demonstration analysis method to generate trajectories in various newly introduced scenarios, including a pushing task on a real robot.
The results of the research conducted in the first two parts of the paper confirm the first hypothesis, which states that robot learning to perform tasks in the assumed problem domain is achievable through a limited number of examples presented by human–robot interactions. During trajectory reproduction for real robotic tasks, solely based on demonstration analysis, the possibility of insufficiently accurate task execution due to various unmodeled influences in demonstration learning is observed. Therefore, the third part of the research addresses the issue of autonomous robot learning to enable self-adjustment according to specific characteristics of the working environment in which the robot operates. A learning model based on an iterative parameter search strategy is developed. The DMP trajectory model, which is “learned” from the available trajectory, is chosen as the parameter strategy for describing the trajectory. As iterative robot learning in a real environment can potentially be very dangerous and time-consuming, a simulation environment with a physics emulator in the MATLAB/Simulink environment is implemented, allowing the robot to test trajectories safely during the learning process. The DQN optimization algorithm is used to adjust DMP strategy parameters during learning, optimizing trajectory parameters with respect to the task. The used form of the cost function is designed to evaluate only the final outcome of robot behavior, thus maintaining implementation simplicity while optimizing robot behavior for the real task. The iterative learning system is tested on two different stochastic robotic tasks: insertion and pushing tasks. Two criteria for evaluating the quality of the learning process are proposed based on the mean value of currently seen solutions and the best currently seen solution. Convergence to behavior successfully performing the task is demonstrated in both cases.
The results obtained in the third and fourth parts of the research suggest that knowledge generalization of robots in terms of task execution for different configurations can be achieved through statistical processing and multi-criteria optimization of specific task-describing features. This confirms the second hypothesis of the paper. As original scientific contributions of this paper, the following are highlighted:
Trajectory-level learning from a demonstration method based on spatial coordinate system classification.
A trajectory model for motion planning in Cartesian coordinate systems based on an optimization process with the ability to approximate coordinate systems and avoid obstacles.
A methodology for task-oriented iterative trajectory learning based on initialization through demonstration learning.
Future research will focus on finding more efficient strategy search algorithms and exploring the possibility of applying other parameter strategies for motion modeling. Additionally, significant progress in the ease of applying strategy search algorithms can be achieved through research in the areas of automatic learning parameter tuning and automatic extraction of reward/goal functions from a small number of human demonstrations. This would significantly increase and simplify the applicability of such solutions in real-world applications. Another future development for the proposed approaches in this paper can be the transition from the existing designed simulated environment to the real-life robot manipulator in a physically designed industrial environment.