1. Introduction
Collaborative robots are being widely applied, and their tasks and environments are becoming more complex. Robots thus require motion skills, flexibility, and adaptability similar to those of human arms [
1,
2].
There is a large volume of research on how robots acquire motion skills. The initial motion skills of a robot are acquired through the programming of the teaching pendant [
3]. For complex motion trajectories, this method is usually difficult to implement, and the technical requirements of the worker are high. Related technological research has provided traditional robot motion planning algorithms, which comprise mainly the random sampling, artificial potential field, and graph search algorithms [
4,
5,
6]. These algorithms usually require precise modeling of the working environment. A complex working environment greatly reduces the algorithm’s efficiency. With advances in artificial intelligence technology, learning from demonstration (LfD) has gradually become the main way for robots to acquire moving skills [
7,
8]. Ude et al. recorded the human body teaching trajectory with optical tracking equipment, used a B-spline wavelet to fit the teaching trajectory, and changed the shape of the trajectory by changing the coefficients of the B-spline curve to meet the needs of different tasks [
9]. Ye et al. interpolated manipulator trajectory with quintic B-spline curves and achieved multi-objective trajectory optimization that simultaneously optimizes traveling time, energy consumption, and mean jerk [
10]. Ijspeert et al. proposed the motion planning of a robot based on a dynamic motion primitive (DMP) algorithm [
11]. The DMP algorithm is a demonstration learning algorithm that comprises a second-order system and nonlinear terms. The second-order system ensures that the robot moves to the target point. Through changing the weight of the nonlinear term, the trajectory can have good generalizability. A DMP algorithm based on Gaussian mixture regression has been proposed for a robot to learn multiple teaching trajectories [
12]. This article proposed a DMP motion planning algorithm based on task parameters and adjusted the DMP model according to the position of the robot. The parameters of the robot enable man–machine collaborative handling and assembly [
13]. Cohen et al. proposed a methodology for learning the manifold of task and DMP parameters, which facilitates runtime adaptation to changes in task requirements while ensuring predictable and robust performance [
14]. Paraschos et al. proposed using probabilistic motion primitives (PROMPs) to model the trajectory distribution learned from random movements and Carvalho et al. proposed to combine ProMPs with the Residual Reinforcement Learning (RRL) framework, to account for both, corrections in position and orientation during task execution [
15,
16]. Wang et al. proposed an improved probabilistic motion primitive algorithm and applied it to the air of a lower-extremity exoskeleton, adopting black-box optimization of the PROMP, so that the exoskeleton adapts to different wearers and improves the adaptive ability of the system [
17]. Zhang et al. proposed a new trajectory learning scheme of a limb exoskeleton robot based on dynamic movement primitives (DMPs) combined with reinforcement learning (RL) [
18]. To address these challenges, a trajectory learning and modification method based on improved dynamic movement primitives (DMPs), called FDC-DMP, is proposed. The method introduces an improved force-controlled dynamic coupling term (FDCT) that uses virtual force as coupling force. This enhancement enables precise and flexible shape modifications within the target trajectory range [
19]. This article addresses a new way of generating compliant trajectories for control using movement primitives to allow physical human–robot interaction where parallel robots (PRs) are involved [
20]. Khansari-Zadeh et al. proposed a stable estimator of the dynamical systems algorithm as a demonstration learning algorithm. According to the demonstration trajectory, the probability relationship between the robot’s position and velocity is established, and the Lyapunov stability theory is adopted to ensure that the robot moves to the target. The algorithm has good adaptability to disturbances in time and space [
21]. Zhang et al. proposed a stable estimator of dynamical systems algorithm based on neural network optimization. This estimator improves the accuracy of the motion trajectory and handles high-dimensional data [
22].
There is a large volume of research on how to improve robot compliance. Most of this research is based on admittance control [
23]. As research has deepened, it has been found that the stiffness and damping parameters of the human arm change during a working process [
24,
25]. Therefore, how to change the parameters of the admittance model according to the work task has become a research hotspot. Yang et al. proposed using a neural network to optimize the parameters of the impedance model, which improves the adaptability of the robot to uncertain environments and ensures the stability of the control system [
26]. Zeng et al. proposed an extended TbD system which can also enable learning stiffness regulation strategies from humans and Franklin et al. proposed indirectly estimating the endpoint stiffness of the human arm using electromyography (EMG) during reaching movements [
27,
28]. By collecting the surface electromechanical signals of the human body in a demonstration task, the change in the stiffness of the human arm during the task can be obtained. The stiffness parameters are applied to the admittance control model, and the robot has the same flexibility as the human arm. Yu, X. et al. adopted demonstration learning to collect the movement trajectory and arm stiffness information of the human arm during the process of operating a drinking fountain and modeled the movement trajectory and stiffness, so that a robot could also press the drinking fountain lever to complete the task of fetching water [
29]. Peternel et al. completed a sawing task through human–computer cooperation [
30]. Stiffness information of the arm is obtained by detecting the electromyographic signal of the human arm. The stiffness of the robot decreases as the stiffness of the human arm increases. As the stiffness of the human arm decreases, the rigidity of the robot increases to realize a sawing action.
In this paper, we propose the hybrid primitive framework (HPF) that enables robots to have motion skills and flexibility similar to human arms and optimize the parameters of the hybrid primitive framework using the policy improvement with path integrals (PI2) algorithm to cope with different tasks. Firstly, the end of the robot is dynamically modeled using an admittance control model to give the robot flexibility. Secondly, the robot’s motion trajectory is modeled using dynamic movement primitives (DMPs). Stiffness primitives (SPs) and damping primitives (DPs) are proposed to model the stiffness parameters and damping parameters in the admittance model. The dynamic movement primitives, stiffness primitives, and damping primitives are referred to as the hybrid primitive framework. According to different tasks, the PI2 algorithm is used to iteratively learn the parameters of the hybrid primitive framework to adapt to task requirements, enabling the robot to have adaptability similar to human arms. Finally, simulation experiments are designed under external force disturbance and constant force tracking under variable stiffness conditions to validate the effectiveness of the algorithm. Compared to the previous research, this article innovatively proposes modeling the stiffness and damping parameters in the admittance control model using the stiffness primitives and damping primitives. By coupling these with dynamic movement primitives, the hybrid primitive framework is proposed, enabling the robot to possess both motion skills and compliance skills. The entire framework of this article is shown in
Figure 1.
The remainder of the paper is organized as follows.
Section 2 presents the admittance control model,
Section 3 introduces the HPF,
Section 4 describes the optimization of the HPF parameters through the PI2 algorithm,
Section 5 reports on simulation experiments based on HPF, and
Section 6 draws conclusions from the results of the study.
3. Hybrid Primitive Framework
In this paper, we use DMPs to model the motion trajectory. Compared with traditional programming, the use of DMPs is more efficient and stable and has good generalization ability. The DMP framework mainly comprises a stable convergent second-order system and nonlinear forcing terms, which ensure that the robot converges to the target point while moving along a specific trajectory. The DMP model is defined as follows [
33,
34]:
where
and
are predefined constants. In addition, for the second-order system to respond more rapidly and reach the target point, the system needs to satisfy
, which lets the second-order system be in a critically damped state.
,
, and
are, respectively, the position, velocity, and acceleration of the system.
f(
s) is a nonlinear forcing term. Here,
s is the phase, which monotonically changes from 1 to 0 during movement and satisfies
. The system is known as a canonical system.
is a predefined constant that satisfies
.
is a temporal scaling factor. The nonlinear forcing term is defined as follows:
where
denotes the weights of the Gaussian basis functions;
denotes the Gaussian basis functions, and each basis function
is weighted by parameter
;
denotes the centers of the Gaussian basis functions;
denotes the variances of the Gaussian basis functions; and
N is the number of Gaussian basis functions. The trajectory shape is changed by changing the weights of the Gaussian basis functions.
In accordance with the definition of the nonlinear forcing term in the DMP model, this paper proposes stiffness primitives (SPs) and damping primitives (DPs) to represent the variation of stiffness parameters and damping parameters in the admittance model. The SP model is defined as follows:
where
k(
s) consists of
M Gaussian basis functions
, and
denotes the center of each Gaussian basis function.
denotes the variances of each Gaussian basis function.
denotes the weights of the Gaussian basis functions and determines the shape of the stiffness profiles. DPs have the same form. The DP model is defined as follows:
This article proposes to refer to DMPs, SPs, and DPs as the hybrid primitive framework (HPF), each of which is driven by system phases to control the motion and compliance of robots together.
4. HPF Parameter Optimization Based on the PI2 Algorithm
The weight parameters in the HPF are usually determined through demonstration learning by collecting the motion trajectories of the demonstrator, arm stiffness information, etc. The weight values are solved using the locally weighted regression (LWR) algorithm. This method has the characteristics of fast learning speed and simplicity. However, for motion tasks with complex trajectory and stiffness variations, the demonstrator cannot demonstrate suitable trajectories and stiffness, and the weight parameters of the model in the HPF cannot be determined through demonstration learning. It is important to enable robots to break free from reliance on demonstration learning, possess independent thinking capabilities, and autonomously determine appropriate weight parameters according to tasks. Therefore, this section proposes a weight parameter learning method for HPF based on the policy improvement with path integrals (PI2) [
35,
36].
Many studies have analyzed the optimization problem of the DMP model based on PI2 and proposed various improved methods [
37]. In this paper, this idea is applied to the learning of HPF weight parameters, including the learning of weight parameters in the DMP, SP, and DP components of HPF. The PI2 algorithm implements policy improvement in the form of path integration, which neither requires matrix inversion nor uses gradient descent for parameter optimization, thus avoiding the problem of numerical instability in the iteration process, and this algorithm requires no other free parameters except for exploration noise. The PI2 algorithm belongs to reinforcement learning algorithms. The corresponding cost function is determined based on the task requirements, and after multiple learning iterations, the cost value gradually converges, obtaining the weight parameters of HPF that meet the task requirements.
The DMP model is rewritten as follows:
where
represents the weight parameters in the DMP model, which determine the robot’s motion trajectory, and
represents the exploration noise vector of the weight parameters in the DMP model in the PI2 algorithm.
The SP model is rewritten as follows:
where
represents the weight parameters in the SP model, determining the variation of stiffness parameters in the admittance control model, and
represents the exploration noise vector of the weight parameters in the SP model in the PI2 algorithm.
The DP model is rewritten as follows:
where
represents the weight parameters in the DP model, determining the variation of damping parameters in the admittance control model, and
represents the exploration noise vector of the weight parameters in the DP model in the PI2 algorithm.
In the learning process, the exploration noise vector is added to the HPF weight parameters, and different noise variances can be specified, so that the robot can add different levels of noise vectors in the process of learning the trajectory, stiffness, and damping. The process of learning HPF weight parameters is as follows:
Firstly, at each moment, K noise vectors are generated, corresponding to K trajectory curves, stiffness curves, and damping curves, where the cost value of the
k-th trajectory at time
i is
, composed of the final cost
of this movement and the immediate cost
of all subsequent time steps:
where
n is the total number of steps in the motion. The final cost
is determined based on the task requirements. For example, the final cost function
can be defined based on the distance between the robot’s final position and the target position, where a smaller distance indicates a lower cost function value, indicating that the robot can better accomplish the task. The immediate cost
is determined based on the characteristics of the robot’s motion parameters. For example, acceleration parameters can be chosen as components of the immediate cost, where they represent the requirement for the smoothness of the robot’s motion trajectory.
The probability
of the
k-th trajectory at time
i-th is realized through an exponential function and normalization of its corresponding cost value, defined as follows:
A higher cost of the trajectory corresponds to a lower corresponding probability. According to the probability
, it is determined that the
k-th trajectory explores the noise value weight at the
i-th moment, and then the K noise values are weighted to obtain the update amount of the weight parameters in the HPF at the
i-th moment:
After the update of the weight parameter is calculated for all times, the final update amount for the weight of the
j-th Gaussian basis function in the HPF during this learning is determined according to the following formula.
The learning process of the HPF weight parameters based on the PI2 algorithm is shown in
Table 1.
5. Experiment
The robot learning method for human-like skills in this paper is applicable to robots with any number of degrees of freedom. This algorithm is used for controlling the end-effector of the robot, and the forward and inverse kinematics algorithms of the robot are not the focus of this study. In the simulation experiments of this section, the structure of the robot is not specifically introduced. The term “robot motion trajectory” refers to the “end-effector motion trajectory,” which can be equivalently treated as a point. The effectiveness of the proposed algorithm is verified by observing the motion trajectory of this point. In this section, we complete simulation experiments based on common robot tasks and analyze the experimental data.
In the simulation experiment, the robot needs to move from the starting position to the target position. During the movement, the robot is subjected to external disturbance forces and must pass through task points at specified times. Based on the algorithm proposed in this paper, the robot can exhibit human-like compliance while resisting external disturbances, passing through task points, and reaching the target point. The cost function during the learning process can be determined by the following five rules:
Rule 1: The robot reaches the goal point. The speed when reaching the goal point is as low as possible.
Rule 2: The robot passes through the task point.
Rule 3: The robot runs with a low acceleration to ensure smoothness of the trajectory.
Rule 4: The robot has low stiffness while completing the task, such that it has better compliance.
Rule 5: The robot has low damping characteristics that reduce the energy consumption of the system.
According to the above rules, the final cost
of one movement can be determined as follows:
where
is the cost weight of the distance between the position and the goal point at the last moment;
is the position at the last moment;
is the goal point;
is the cost weight of the speed at the last moment; and
is the velocity at the last moment.
The immediate cost
is
where
is the cost weight through the task points;
is the cost of passing the task point;
is the task point;
is the task point time;
is the position of the robot at time
;
is the cost weight of acceleration during motion;
is the acceleration at time
t;
is the cost weight of stiffness;
is the stiffness at time
t;
is the cost weight of damping; and
is the damping at time
t.
In the simulation experiment, the robot’s starting point is (0 m, 0 m), and the target point is (1 m, 1 m), with a motion duration of 5 s. It is required to pass through the task point (0.6 m, 0.8 m) at
= 2.5 s. To simulate the complex disturbance forces in an actual working environment, sinusoidal disturbance forces are applied in both the x and y directions during the experiment. The disturbance force curve is shown in
Figure 2.
The entire learning process iterated 600 times. Since the learning objective is divided into five parts, the learning cost mainly includes point-passing cost, target point cost, acceleration cost, stiffness cost, damping cost, and end velocity cost. As shown in
Figure 3, with the increase in the number of iterations, the total learning cost value gradually decreases and converges.
Figure 4 shows the motion trajectories of the robot during the learning process.
Figure 4a displays the initial motion trajectory of the robot with the task points. It can be observed that although the robot’s motion trajectory can reach the target point (1 m, 1 m), it deviates from the task point (0.6 m, 0.8 m).
Figure 4b displays the motion trajectory of the robot after 100 learning iterations, showing a tendency toward the target point.
Figure 4c displays the motion trajectory of the robot after 200 learning iterations; the robot gradually approaches the task point.
Figure 4d displays the motion trajectory of the robot after 600 learning iterations, showing that the motion trajectory passes through the task point (0.6 m, 0.8 m) and reaches the target point (1 m, 1 m).
Figure 5 shows the stiffness variation curves in the x and y directions. During the learning process, the maximum stiffness k_max is set to 400 N/m, and the minimum stiffness k_min is set to 30 N/m. The blue curve represents the initial stiffness, the initial stiffness k_init is set to 100 N/m, while the red curve represents the stiffness curve after learning.
Figure 5a illustrates the variation in stiffness in the x-direction. Initially, the stiffness of the robot in the x-direction is low, enhancing the compliance of the robot. This allows the robot to move along disturbance forces smoothly. As the robot gradually approaches the task point, the direction of motion remains aligned with the direction of disturbance forces. To prevent deviation from the task point, stiffness is increased to counteract the disturbance forces and ensure passage through the task point. After passing the task point, the direction of disturbance forces becomes opposite to the direction of robot motion. Therefore, increasing stiffness ensures that the robot can reach the target point.
Figure 5b illustrates the variation in stiffness in the y-direction. In the first half of the time period, the stiffness of the robot in the y-direction is low. This is because the task point in the y-direction is farther from the starting point compared to the x-direction, resulting in consistently lower stiffness in the y-direction to maintain higher compliance. This enables the robot to move along disturbance forces smoothly, ensuring passage through the task point. After passing the task point, the direction of disturbance forces becomes opposite to the direction of robot motion. Therefore, stiffness in the y-direction also needs to be increased to ensure that the robot can reach the target point.
Figure 6 shows the damping variation curves in the x and y directions. During the learning process, the maximum damping d_max is set to 100 Ns/m, and the minimum damping d_min is set to 10 Ns/m. The blue curve represents the initial damping d_init of 50 Ns/m, while the red curve represents the damping curve after learning.
As shown in
Figure 6, it can be seen that in the first half of the time period, the damping of the robot in the x and y directions is relatively low. This is because the disturbance force is aligned with the direction of motion, and lower damping allows the robot to better follow the external force. In the second half of the time period, the damping increases significantly because the disturbance force is opposite to the direction of motion. By increasing the damping, the disturbance force is dissipated, preventing the robot from deviating from the target point.
Through analysis, it was found that under the influence of the hybrid primitive, the robot can have motion capabilities, variable stiffness, and variable damping similar to those of a human arm. This allows the robot to complete point-to-point tasks in environments with disturbances.
- B.
Task 2: Trajectory tracking experiment conducted in a variable stiffness environment
The simulation experiment is set up as follows: the robot moves closely along the surface of a cantilever beam while maintaining a constant contact force, simulating an actual robotic grinding task.
Since the structure and material properties of the cantilever beam are fixed, its deflection curve is also fixed when subjected to a constant force. However, the stiffness of the cantilever beam varies at different positions; the closer to the fixed end, the greater the stiffness. By observing the deviation between the robot’s motion trajectory and the deflection curve of the cantilever beam, a smaller deviation indicates that the robot possesses human-arm-like skills and can complete trajectory-tracking tasks in a variable stiffness environment. The cost function during the learning process can be determined by the following five rules:
Rule 1: The robot reaches the target point. The speed when reaching the target point is as low as possible.
Rule 2: The robot’s motion trajectory is as consistent as possible with the deflection curve of the cantilever beam.
Rule 3: The robot runs with low acceleration to ensure the smoothness of the trajectory.
Rule 4: The robot has low stiffness and thus good compliance while completing the task.
Rule 5: The robot has low damping characteristics that reduce the energy consumption of the system.
According to the above rules, the final cost
of one movement can be determined as follows:
where
is the cost weight of the distance between the position and goal point at the last moment;
is the position at the last moment;
is the goal point;
is the cost weight of the speed at the last moment; and
is the velocity at the last moment.
The immediate cost
is
where
denotes the tracking cost weights for the cantilever beam deflection curves;
is the position of the robot at time
t;
is the deflection of the cantilever beam at time
t;
is the cost weight of acceleration during motion;
is the acceleration at time
t;
is the cost weight of stiffness;
is the stiffness at time
t;
is the cost weight of damping; and
is the damping at time
t.
In the simulation experiment, the cantilever beam is made of carbon steel with a thickness of 4 mm, a width of 50 mm, and a length of 800 mm. The modulus of elasticity
E is set to 200 Gpa. According to the deflection curve Equation (25) of the cantilever beam, when subjected to a constant force of 10 N in the y-direction, the deflection curve of the cantilever beam can be obtained, as shown in
Figure 7. Therefore, the target position for the robot’s motion is the deflection at the free end of the cantilever beam, which is
y = 32 mm.
where
is the force acting on the surface of the cantilever beam,
is the length of the cantilever beam,
is the modulus of elasticity of the cantilever beam material, and
is the moment of inertia of the cantilever beam section, which is determined by the width
and height
of the cross section.
The entire learning process iterated 600 times. Since the learning objective is divided into five parts, the learning cost mainly includes tracking cost, acceleration cost, endpoint cost, velocity cost, damping cost, and stiffness cost. As shown in
Figure 8, with the increase in the number of iterations, the total learning cost value gradually decreases and converges.
Figure 9 and
Figure 10 show the stiffness and damping variation curves of the robot after learning. In
Figure 11, the motion trajectory of the robot after learning is displayed along with the deflection curve of the cantilever beam. The two curves almost overlap, and the error curve between the two trajectories at any time is shown in
Figure 12. The error is initially large at the beginning of the motion, then gradually decreases. This is because as the robot moves, the stiffness of the cantilever beam decreases, allowing the robot to better track the deflection curve of the cantilever beam. Throughout the entire motion process, the deviation value remains between ±0.15 mm, indicating that the robot can achieve trajectory tracking under variable stiffness conditions like a human arm, validating the effectiveness of the algorithm proposed in this paper.