Integrated Intelligent Control of Redundant Degrees-of-Freedom Manipulators via the Fusion of Deep Reinforcement Learning and Forward Kinematics Models

Chen, Yushuo; Su, Shijie; Ni, Kai; Li, Cunjun

doi:10.3390/machines12100667

Open AccessArticle

Integrated Intelligent Control of Redundant Degrees-of-Freedom Manipulators via the Fusion of Deep Reinforcement Learning and Forward Kinematics Models

¹

College of Mechanical Engineering, Jiangsu University of Science and Technology, Zhenjiang 212100, China

²

Zhoushan Institute of Calibration and Testing for Quality and Technology Supervision, Zhoushan 316021, China

^*

Authors to whom correspondence should be addressed.

Machines 2024, 12(10), 667; https://doi.org/10.3390/machines12100667

Submission received: 4 August 2024 / Revised: 7 September 2024 / Accepted: 19 September 2024 / Published: 24 September 2024

(This article belongs to the Section Automation and Control Systems)

Download

Browse Figures

Versions Notes

Abstract

Redundant degree-of-freedom (DOF) manipulators offer increased flexibility and are better suited for obstacle avoidance, yet precise control of these systems remains a significant challenge. This paper addresses the issues of slow training convergence and suboptimal stability that plague current deep reinforcement learning (DRL)-based control strategies for redundant DOF manipulators. We propose a novel DRL-based intelligent control strategy, FK-DRL, which integrates the manipulator’s forward kinematics (FK) model into the control framework. Initially, we conceptualize the control task as a Markov decision process (MDP) and construct the FK model for the manipulator. Subsequently, we expound on the integration principles and training procedures for amalgamating the FK model with existing DRL algorithms. Our experimental analysis, applied to 7-DOF and 4-DOF manipulators in simulated and real-world environments, evaluates the FK-DRL strategy’s performance. The results indicate that compared to classical DRL algorithms, the FK-DDPG, FK-TD3, and FK-SAC algorithms improved the success rates of intelligent control tasks for the 7-DOF manipulator by 21%, 87%, and 64%, respectively, and the training convergence speeds increased by 21%, 18%, and 68%, respectively. These outcomes validate the proposed algorithm’s effectiveness and advantages in redundant manipulator control using DRL and FK models.

Keywords:

forward kinematics; neural network; manipulators; redundant degrees; Denavit–Hartenberg parameters; Markov decision process

1. Introduction

Redundant degree-of-freedom (DOF) manipulators have additional degrees of freedom, endowing the system with enhanced obstacle avoidance capabilities, superior singularity avoidance, and higher fault tolerance [1]. However, while solving the forward kinematics (FK) of such manipulators is relatively straightforward, their inverse kinematics (IK) present innumerable possible solutions. Therefore, solving the IK of redundant DOF manipulators poses significant complexity and challenges [2]. Numerical methods are the prevailing approach for solving the IK of manipulators, but they typically yield only locally optimal solutions [3]. Furthermore, these methods are hampered by slow computational speeds and high costs [4], which can adversely affect the effectiveness of subsequent motion control for the manipulator.

The analytical method [5] establishes explicit expressions for the joint angles of manipulators, thus offering high computational efficiency and precision. Crane [6] proposed a technique to fix one joint of a 7-DOF manipulator, obtaining the analytical solution for the remaining 7-DOF subchain, which facilitates the derivation of the kinematic inverse solution for the manipulator. Xavier et al. [7] introduced the application of Grobner basis theory to solve the IK of 7-DOF manipulators and validated the accuracy of this method. However, the analytical approach requires the IK equations to be rederived based on the specific structural configuration of each manipulator [8], which limits its generalizability.

With the development of artificial intelligence, scholars have increasingly utilized genetic algorithms [9] and neural networks [10] to solve the IK of manipulators with redundant DOF. However, genetic algorithms are often criticized for their slow iterative solution process, while neural networks require extensive training data. Additionally, both methods are prone to converging on local optima, which can compromise the precision of the solutions.

Reinforcement learning (RL) involves intelligent agents learning optimal strategies through interactions with the environment to maximize cumulative rewards [11]. This approach offers several advantages for intelligent control of manipulators, including the elimination of the need to establish kinematic models, robust handling of complex problems, and self-learning capabilities. Consequently, RL has emerged as a promising method for intelligent manipulator control. Perrusquia et al. [12] proposed a multi-agent RL method to solve the IK problem of redundant DOF manipulators. Lee et al. [13] developed a 7-DOF manipulator posture control algorithm based on RL and neural networks. Ramirez et al. [14] introduced an RL method combined with human demonstrations and applied it to the control of redundant DOF manipulators. However, RL algorithms exhibit inefficiencies in exploring high-dimensional spaces and are prone to getting trapped in local optima in complex tasks.

To improve the performance of RL algorithms, researchers have begun integrating deep learning with reinforcement learning, leading to the development of deep reinforcement learning (DRL) [15] methods. Li et al. [16] proposed a general framework that integrates DRL into the motion planning of redundant DOF manipulators, optimizing path planning and deriving the energy-optimal solution to IK. Carlos et al. [17] introduced a DRL framework for controlling manipulators in simulated environments. Zheng et al. [18], focusing on a 6-DOF manipulator, proposed a trajectory-planning method based on DRL, which improves convergence by introducing dynamic action selection strategies and combinatorial reward functions.

Although DRL has partially addressed the shortcomings of RL, it still faces issues such as slow training convergence, poor stability, and limited scalability when dealing with intelligent control tasks for high DOF or redundant DOF manipulators. To address these challenges, this paper proposes an efficient convergence FK-DRL algorithm for the intelligent control of redundant DOF and 4-DOF manipulators. The algorithm integrates the easily established FK model of the manipulator into the DRL algorithm’s training framework. The FK model effectively guides the agent in exploring optimal control strategies, thereby significantly improving the training convergence speed and stability of the algorithm. Furthermore, the algorithm only requires modifications to the FK model of the manipulator to be applicable to the intelligent control of other manipulators, demonstrating excellent scalability.

In the remainder of this article, the proposed FK-DRL algorithm is introduced in Section 2. The comparison simulations are conducted to illustrate the performance of the proposed algorithm in Section 3. Section 4 further validates the performance of the proposed algorithm through actual experiments with a 4-DOF manipulator. Furthermore, the conclusions and future work are summarized in Section 5.

2. Design of the FK-DRL Control Algorithm

This section introduces the design of the FK-DRL algorithm. We provide a detailed explanation of the process of modeling the manipulator’s control task using the Markov decision process (MDP) model [19] and the construction of the manipulator’s FK model. Additionally, we elucidate how to integrate the FK model into the DRL algorithm to form the FK-DRL algorithm.

2.1. Control Problem and MDP Modelling

The task of manipulator control can be characterized as follows: A target block is randomly placed within the workspace of the manipulator. The agent rotates the various joints of the manipulator to ensure that the end-effector [20] is positioned at the desired location and orientation.

The control task of a manipulator can be modeled as an MDP, with the objective of identifying an optimal policy that enables the manipulator’s end-effector to reach the target pose accurately with the minimum number of steps.

The control task of a manipulator can be modelled as an MDP, with the objective of identifying an optimal policy that enables the end-effector to reach the target pose accurately with the least number of steps.

Define the state space

S

:

\forall s_{t} \in S, s_{t} = [j, p_{e}, p_{g}]

(1)

where

j

denotes the

n

joint angles

(θ_{1}, θ_{2}, \dots θ_{n})

of the manipulator,

p_{e}

is the current position

(x_{e}, y_{e}, z_{e}, α_{e}, β_{e}, γ_{e})

of the end-effector, and

p_{g}

is the target position

(x_{g}, y_{g}, z_{g}, α_{g}, β_{g}, γ_{g})

.

Define the action space

A

:

\forall a_{t} \in A, a_{t} = [θ_{1}, θ_{2}, \dots θ_{n}]

(2)

where action

a_{t}

is the target angle of the

n

joints of the manipulator.

Define the reward function

r_{t}

:

\begin{array}{l} r_{a} = - d \\ = - \sqrt{\begin{array}{l} {(x_{e} - x_{g})}^{2} + {(y_{e} - y_{g})}^{2} + {(z_{e} - z_{g})}^{2} + \\ {(α_{e} - α_{g})}^{2} + {(β_{e} - β_{g})}^{2} + {(γ_{e} - γ_{g})}^{2} \end{array}} \\ r_{b} = \{\begin{matrix} 1 d^{'} > d \\ 0 d^{'} < d \end{matrix} \\ r_{c} = \{\begin{matrix} 0 d > ε \\ η d < ε \end{matrix} \\ r_{t} = r_{a} + r_{b} + r_{c} \end{array}

(3)

where

r_{a}

defines a reward value that is negatively correlated with the distance between

p_{e}

and

p_{g}

,

r_{b}

defines a directional reward function,

d

denotes the distance between

p_{e}

and

p_{g}

at the current moment,

d ’

denotes the distance between

p_{e}

and

p_{g}

at the previous moment, and when

p_{e}

is closer to

p_{g}

than at the previous moment it receives a reward value of 1; otherwise, it receives a reward value of 0.

r_{c}

denotes that

p_{e}

and

p_{g}

receive a reward value of 0 when the distance between them is greater than

ε

of the reward value; otherwise, the task is considered completed and a reward value of

η

is obtained. The total reward value

r_{t}

is the sum of

r_{a}

,

r_{b}

, and

r_{c}

. The episode ends when the time step equals

n

unless the task is completed within

n

time steps.

The sum of episode reward values

R_{t}

is defined as:

R_{t} = \sum_{t = 1}^{n} r_{t}

(4)

In manipulator control tasks, the target pose is randomly initialized within the manipulator’s workspace at the beginning of each episode. The agent identifies the optimal strategy based on the current environmental state, which yields the joint angles necessary for the manipulator’s end-effector to reach the target pose as swiftly as possible.

2.2. Modelling of the FK of the Manipulator

The FK model of the n DOF manipulator is established according to the D-H (Denavit–Hartenberg) [21] parametric method, and the relative positions between joints

{i - 1}

and

\{i\}

in its joint coordinate system are shown in Figure 1.

According to the D-H parameters, the transformation matrix

{}_{i}^{i - 1}T

between joints

{i - 1}

and

\{i\}

is obtained as:

{}_{i}^{i - 1}T = [\begin{matrix} \cos θ_{i} & - \sin θ_{i} & 0 & a_{i - 1} \\ \sin θ_{i} \cos α_{i - 1} & \cos θ_{i} \cos α_{i - 1} & - \sin α_{i - 1} & - \sin α_{i - 1} d_{i} \\ \sin θ_{i} \sin α_{i - 1} & \cos θ_{i} \sin α_{i - 1} & \cos α_{i - 1} & \cos α_{i - 1} d_{i} \\ 0 & 0 & 0 & 1 \end{matrix}]

(5)

where

α_{i - 1}

,

a_{i - 1}

,

d_{i}

, and

θ_{i}

are the D-H parameters of the manipulator.

Its base joint to end-effector joint coordinate system transformation matrix

{}_{n}^{0}T

is:

{}_{n}^{0}T =_{1}^{0} T \cdot_{2}^{1} T \cdot \dots \cdot_{n}^{n - 1} T = [\begin{matrix} n_{x} & o_{x} & a_{x} & p_{x} \\ n_{y} & o_{y} & a_{y} & p_{y} \\ n_{z} & o_{z} & a_{z} & p_{z} \\ 0 & 0 & 0 & 1 \end{matrix}]

(6)

where

{[\begin{matrix} n_{i} & o_{i} & a_{i} \end{matrix}]}_{3 \times 3} (i = x, y, z)

denotes the rotation matrix, and each column is the attitude vector coordinates of the manipulator end-effector; the fourth column

{[\begin{matrix} p_{x} & p_{y} & p_{z} \end{matrix}]}^{T}

denotes the position vector of the manipulator end-effector.

The manipulator end-effector position

p

is:

p = (p_{x}, p_{y}, p_{z}, α, β, γ)

(7)

where

α, β, γ

denote the heading, pitch, and roll angles of the manipulator end-effector in Euler angles, respectively.

Denoting the FK model of the n DOF manipulator as

M

, and the mapping relation from the joint

θ_{1}, θ_{2}, \dots θ_{n}

mapping to the manipulator end-effector position

p

as

f

, then:

M : f (θ_{1}, θ_{2}, \dots θ_{n}) =_{n}^{0} T

(8)

2.3. FK-DRL Algorithm

DRL algorithms can be classified into value-based and policy-based algorithms. Value-based algorithms select the action with the highest value as the optimal policy each time [22]:

π^{*} (a | s) = \underset{a}{\arg \max} q^{*} (s, a)

(9)

where

π^{*} (a | s)

represents the optimal policy, and

q^{*} (s, a)

represents the optimal action value function.

Policy-based RL algorithms, on the other hand, directly parameterize the policy, denoted as

π (a | s, θ)

, and represent it as a function whose objective is to find the optimal parameter

θ^{*}

that maximizes the expectation of the cumulative reward [23]. Policy-based RL algorithms need to find the gradient

\nabla_{θ^{μ}} L_{A} (θ^{μ})

of the policy’s objective function

L (θ^{μ})

, defined as

\nabla_{θ^{μ}} L_{A} (θ^{μ}) = \frac{1}{N} \sum_{i} \nabla_{a} Q (s, a | θ^{Q}) |_{a = π (s_{i})} \nabla_{θ^{μ}} π (s | θ^{μ})

(10)

Current mainstream actor–critic algorithms such as deep deterministic policy gradient (DDPG) [24], twin delayed deep deterministic policy gradient (TD3) [25], and soft actor–critic (SAC) [26] can estimate value functions with relative precision and guide policy learning. However, these methods necessitate persistent interaction between the agent and the environment to enrich the data within the experience replay buffer [27]. This process is both time-consuming and resource-intensive, leading to slow convergence rates and suboptimal control precision during training.

To address the aforementioned issues, this paper proposes the FK-DRL algorithm for controlling manipulators with redundant degrees of freedom. The specific workflow is as follows:

Step 1: Define the FK model M, the highest number of training episodes E, the highest number of interactions with the environment per episode T1, the number of dynamic planning per episode P, and the maximum number of interactions with the FK model M in one dynamic planning episode T2.

Step 2: In each training episode, the intelligent agent begins to update the Actor and Critic networks after each interaction with the environment. After the T1 interactions, an episode of dynamic planning is initiated:

(1): Take one unfinished task from the experience pool $t r a n s i t i o n (s_{n}, a_{n}, r_{n}, s_{n + 1})$ .
(2): For the sake of differentiation, the state $s_{n}$ is denoted as $s_{k}^{M}$ and used as the initial state in an episode of planning. At the same time, OU (Ornstein Uhlenbeck) noise is added to the action $a_{n}$ [28] and is denoted as $a_{k}^{M} = [θ_{1}^{k}, θ_{2}^{k}, \dots θ_{n}^{k}]$ .
(3): Inputting the action $a_{k}^{M}$ into the model M, the pose of the end-effector $p_{e}^{k} = (x_{e}^{k}, y_{e}^{k}, z_{e}^{k}, α_{e}^{k}, β_{e}^{k}, γ_{e}^{k})$ is obtained. The next state $s_{k + 1}^{M} = [j^{k + 1}, p_{e}^{k + 1}, p_{g}^{k + 1}]$ is obtained from the current state $s_{k}^{M} = [j^{k}, p_{e}^{k}, p_{g}^{k}]$ after the input action $a_{k}^{M}$ . The reward value $r_{k}^{M}$ is calculated to get the $t r a n s i t i o n_{k}^{M} (s_{k}^{M}, a_{k}^{M}, r_{k}^{M}, s_{k + 1}^{M})$ , which is stored in the experience pool.
(4): Input $s_{k + 1}^{M}$ into the Actor network, which then outputs an action. Add OU noise to this action to obtain action $a_{k + 1}^{M}$ and repeat step (3) to obtain $t r a n s i t i o n_{k + 1}^{M} (s_{k + 1}^{M}, a_{k + 1}^{M}, r_{k + 1}^{M}, s_{k + 2}^{M})$ , which is then stored in the experience pool. Extract experiences from the pool to update the Actor and Critic networks.
(5): Continue to interact with model M until the number of interactions reaches T2 and the Actor and Critic networks, indicating the end of this dynamic planning episode;
(6): Repeat steps (1)–(5) P times.

Step 3: Repeat Step 2 until the maximum number of episodes E is reached to obtain the optimal strategy

π (a | s, θ)

.

The pseudocode for the FK-DRL algorithm is shown in Algorithm 1.

Algorithm 1. FK-DRL

Given that the network update processes of algorithms such as TD3, SAC, and DDPG are essentially the same, this paper uses FK-DDPG as an example to demonstrate the overall framework of the FK-DRL algorithm, as shown in Figure 2.

3. Simulation

To evaluate the efficacy of the proposed algorithms, control tasks involving a 7-DOF manipulator and a 4-DOF manipulator were conducted in a simulation environment. The agents were trained using both the proposed FK-DDPG, FK-TD3, and FK-SAC algorithms, as well as the conventional DDPG, TD3, and SAC algorithms. Subsequently, the performance of the manipulators was evaluated post-training.

3.1. Simulation Environment

In this study, we utilized CoppeliaSim (formerly known as V-REP) [29] as the simulation software. Neural network training was conducted using PyTorch 1.9.0 on a single GPU (NVIDIA GeForce RTX 3090) with CUDA version 11.2.

The manipulator model, designed in SolidWorks, was imported into CoppeliaSim. Blocks with randomly initialized positions were added to serve as the target poses for the manipulator’s end-effector in each simulation episode. Figure 3 shows the 3D simulation model of the 7-DOF manipulator and its workspace for the end-effector. The workspace of the end-effector can be represented as a rectangular region defined as the coordinate range from (0.3, −0.1) to (0.5, 0.1). Within CoppeliaSim, we can acquire data on the joint angles, the pose of the end-effector, and the coordinates of the blocks.

As shown in Figure 3, Joint 1 rotates around the Z-axis of the world coordinate system, while Joints 2 through 7 are all rotational joints. The end-effector of the manipulator is considered a fixed Joint 8. The D-H parameters of this manipulator are presented in Table 1.

The FK model M of the manipulator is:

M : f (θ_{1}, θ_{2}, θ_{3}, θ_{4}, θ_{5}, θ_{6}, θ_{7}) =_{8}^{0} T

(11)

Figure 4 illustrates the 3D simulation environment of the 4-DOF manipulator and its end-effector workspace. The workspace of the end-effector can be represented as a rectangular region defined as the coordinate range from (0.2, −0.15) to (0.5, 0.15). The D-H parameters of the 4-DOF manipulator are shown in Table 2.

The FK model M of this manipulator is:

M : f (θ_{1}, θ_{2}, θ_{3}, θ_{4}) =_{5}^{0} T

(12)

The network architectures of the DDPG, TD3, FK-DDPG, and FK-TD3 algorithms are identical. The network model for a 7-DOF manipulator control task is illustrated in Figure 5.

The SAC and FK-SAC algorithms employ stochastic policy approaches. Their Actor networks consist of four fully connected neural layers, each with 256 neurons. Following each fully connected layer, a ReLU activation function is applied. The output of the Actor network includes one layer that produces the mean

μ

of a normal distribution after passing through the first two fully connected layers. Another output layer, also following the first two layers, generates the exponent of the standard deviation

σ

of the normal distribution. The Critic network structure is consistent with that of DDPG.

The network hyperparameters [30] of each algorithm are shown in Table 3.

3.2. Analysis of Simulation Results of 7-DOF Manipulator

This study presents an evaluation of the convergence curves for FK-DDPG, FK-TD3, and FK-SAC algorithms, along with the classical DDPG, TD3, and SAC algorithms, by averaging the results of 10 random training runs. Figure 6, Figure 7 and Figure 8 depict the training outcomes. In each figure, the central line represents the average reward per episode, calculated over 10 random training sessions, while the shaded area indicates the standard deviation. The horizontal axis denotes the number of episodes during training, while the vertical coordinate represents the cumulative reward value for the current round divided by the current number of rounds.

It is evident that integrating the FK model of a manipulator with any of the DDPG, TD3, or SAC algorithms resulted in superior training performance compared to employing the DDPG, TD3, or SAC algorithms independently. Notably, the algorithms that combine FK—FK-DDPG, FK-TD3, and FK-SAC—demonstrated smaller shaded areas in their training curves, indicating that the incorporation of FK with DDPG, TD3, or SAC led to a more stable training process.

The FK-DDPG algorithm converged after 80,000 episodes, marking a 27% reduction in the number of episodes required for convergence compared to the standard DDPG algorithm, which converged after 110,000 episodes. Additionally, the average reward value at the completion of training for FK-DDPG exceeded that of DDPG by approximately 5 units.

The FK-TD3 algorithm achieved convergence at 85,000 episodes, whereas the standard TD3 algorithm failed to converge within the same number of episodes. Similarly, the FK-SAC algorithm converged after 90,000 episodes, demonstrating a 25% reduction in episodes compared to the SAC algorithm, which only began to show a convergence trend after 120,000 episodes. The average reward for FK-SAC upon training completion was about 8 units higher than that of SAC. The slower convergence rate of the SAC algorithm can be attributed to its stochastic policy approach based on the maximum entropy principle. During training, the SAC algorithm maintains a policy distribution and seeks to maximize the entropy of the policy to enhance exploration and improve the extent of exploration across different actions. This increased exploratory behavior results in a slower convergence rate during training.

The FK-DDPG, FK-TD3, and FK-SAC algorithms all exhibited shorter convergence times and higher control precision. This is because these algorithms incorporate FK models during dynamic planning, enabling the agent to leverage exploration and learning further. Such integration facilitates efficient data augmentation, generating a greater variety of grasping scenarios and samples to enrich the training data. Consequently, agents can rapidly acquire an understanding of actions, which better guides their exploration and decision-making processes, allowing for precise control of the manipulator to reach target poses in fewer episodes.

Training time is a more accurate measure of the time expended during the process of training agents. In this paper, it specifically refers to the time elapsed from the start of training until convergence is achieved or the maximum number of episodes is reached.

As illustrated in Figure 9, the training time required for the FK-DDPG, FK-TD3, and FK-SAC algorithms was significantly less than that of their respective classical DRL counterparts. For instance, the FK-DDPG algorithm reduced training time by 21%, while the FK-TD3 and FK-SAC algorithms achieved reductions of 18% and 68%, respectively. The incorporation of the FK model not only decreased the number of episodes required for convergence but also substantially shortened the total convergence time.

To further assess the effectiveness of the algorithms, we selected average reward (total reward obtained during testing divided by the number of tests) and success rate (number of tasks completed divided by the number of tests) as metrics. The average rewards and success rates of the FK-DDPG, FK-TD3, and FK-SAC algorithms compared with the DDPG, TD3, and SAC algorithms over 100 simulation trials are depicted in Figure 10 and Figure 11.

Figure 10 demonstrates that during the testing phase, the average rewards for the FK-DDPG, FK-TD3, and FK-SAC algorithms were consistently higher, and their reward variances were substantially lower than those of the corresponding classical DRL algorithms. This is in line with observations made during the training phase. For example, the FK-DDPG algorithm achieved an average reward of 9.96, in contrast to the DDPG algorithm’s average reward of 4.45.

As illustrated in Figure 11, the success rates of the FK-DDPG, FK-TD3, and FK-SAC algorithms were significantly higher than those of the classical DRL algorithms. For instance, the FK-TD3 algorithm achieved a testing success rate of 99%, in stark contrast to the mere 12% success rate of the classical TD3 algorithm. This discrepancy may be ascribed to the reliance of classical DRL algorithms on a pure trial-and-error methodology for acquiring control strategies for manipulators, without the benefit of pre-existing kinematic model knowledge to inform the learning process. This absence of guidance makes the learning process more challenging and unstable, consequently reducing the success rate of the task.

3.3. Analysis of Simulation Results of 4-DOF Manipulator

Based on the existing experimental conditions, in Section 4, we utilize the trained FK-DRL algorithm and the classic DRL algorithm to control a real-world 4-DOF manipulator for intelligent control task testing. To compare with the experimental results in the real world and thereby comprehensively evaluate the performance of the FK-DRL and DRL algorithms, this study also conducted simulation training and testing on a 4-DOF manipulator intelligent control task. We set an early stop condition during the simulation process. The training ended when the reward curve of the algorithm converged and remained stable. The training processes for the DDPG, FK-DDPG, TD3, FK-TD3, SAC, and FK-SAC algorithms are depicted in Figure 12, Figure 13 and Figure 14, respectively.

Figure 15, Figure 16 and Figure 17 present the training duration, average reward, and task success rates for the FK-DDPG, FK-TD3, and FK-SAC algorithms versus the DDPG, TD3, and SAC algorithms. The results indicate that when controlling a 4-DOF manipulator, the FK-DDPG, FK-TD3, and FK-SAC algorithms maintain a significant advantage over the DDPG, TD3, and SAC algorithms in terms of convergence speed and control precision. This demonstrates the applicability of the proposed algorithms for intelligent control of manipulators with varying degrees of freedom.

However, the performance improvement of the FK-DDPG, FK-TD3, and FK-SAC algorithms was not as pronounced when controlling a 4-DOF manipulator as it was with a 7-DOF manipulator. This is attributed to the exponential growth in the state and action spaces as the number of degrees of freedom of the manipulator increases. The FK-DRL algorithm possesses an advantage in guiding policy exploration; therefore, the algorithm achieves greater improvements in convergence speed and control precision with higher DOF manipulators.

4. Experiments and Analysis

After validating the effectiveness of the proposed algorithm through simulations, this section deploys the trained FK-DRL algorithm onto a real-world experimental platform and compares it with classical DRL algorithms.

4.1. Experimental Platform

To evaluate the practical effectiveness of the algorithm presented in this paper, a four-degrees-of-freedom manipulator, identical to the one used in the simulation environment (see Figure 18), was employed in a physical setting. The gripper of the manipulator was configured as a fixed joint, with each of the four joints being driven by UART serial bus servos, each rated at a torque of

4.5 N \cdot m

.

Prior to conducting the experiment, camera calibration was performed using Zhang’s calibration method [31,32] to establish a coordinate transformation matrix between the pixel coordinate system and the manipulator’s spatial coordinate system. In this study, the camera and the manipulator are not integrated, meaning the camera operates in an “eye-to-hand” configuration, and its pose and position remain constant throughout the experiment. During the experiment, the joint coordinates of the manipulator were obtained through the servo control program, the pose of the end effector was determined using Equation (3), and the target pose was calculated from the pixel coordinates of the blocks captured in the photographs taken by the camera through the established coordinate transformation matrix.

4.2. Analysis of Experimental Results

The model trained was transferred to a real-world environment featuring a 4-DOF manipulator, and the process of the manipulator reaching the target pose was observed. Taking the FK-DDPG algorithm as an example, the sequence of the manipulator’s movement towards the target pose is illustrated in Figure 19a–f, with the entire process taking 9 s.

To evaluate the performance of the proposed FK-DRL algorithm in a physical setting, we also conducted 100 trials and used the average reward and success rate as metrics to assess the experimental outcomes. The results are presented in Figure 20 and Figure 21.

Similar to the simulation results, the FK-DRL algorithm achieved higher average rewards and success rates compared to the classical DRL algorithms. Specifically, the task completion success rates for FK-DDPG, FK-TD3, and FK-SAC were 96%, 98%, and 94%, respectively, while the success rates for the conventional DDPG, TD3, and SAC algorithms were 89%, 44%, and 72%, respectively. However, there was a noticeable decrease in both average rewards and success rates in the physical environment compared to the simulation environment. This discrepancy can be attributed to real-world factors such as joint friction and manufacturing errors, which create inconsistencies between the simulation and the physical environments. This issue can typically be mitigated by further training the agent in the physical environment for a number of episodes, as suggested in the literature [33].

By comparing the experimental results of a 4-DOF manipulator in the real world with those in a simulated environment, we found that, compared to classical DRL algorithms, the task success rates of FK-DDPG, FK-TD3, and FK-SAC in the simulation increased by 6%, 48%, and 20%, respectively. In the real world, the success rates increased by 7%, 54%, and 22%, respectively. The results indicate that the proposed FK-DRL algorithms exhibited more significant stability improvements in the real world. This is attributed to the presence of various interferences in the real world, where classical DRL algorithms demonstrate poor adaptability in such dynamic environments. This further validates the stability and superiority of the proposed FK-DRL algorithms.

5. Conclusions

To address the issues of slow training convergence, low exploration efficiency, and poor stability in classic DRL algorithms when controlling redundant DOF manipulators, an efficient FK-DRL intelligent control algorithm is proposed for the intelligent control tasks of redundant DOF and 4-DOF manipulators. The control problem of the manipulator is transformed into an MDP problem. To improve the convergence and stability of DRL-based controllers, the easily established FK model of the manipulator is integrated into the training framework of the baseline DRL algorithm.

During the training process, the agent initially interacts with a 3-D simulation environment, continuously optimizing its decision-making strategy. After a period of pre-training, failed samples are extracted from the experience pool. The agent then inputs actions with added disturbance noise into the FK model for rapid trial-and-error learning, quickly optimizing the decision-making strategy. The FK-DRL algorithm effectively reduces the number of interactions between the agent and the 3-D simulation environment while obtaining a large number of successful samples during interactions with the FK model, significantly improving the training convergence speed and stability of the intelligent control algorithm.

For simulation verification, a joint simulation framework using CoppeliaSim and Python was constructed, and extensive simulation training and testing were conducted on 7-DOF and 4-DOF manipulators. To compare with the latest technological solutions, we selected the widely recognized DDPG, TD3, and SAC algorithms in various fields and compared them with the proposed FK-DDPG, FK-TD3, and FK-SAC algorithms. Compared to classic DRL algorithms, the FK-DDPG, FK-TD3, and FK-SAC algorithms significantly shortened training time and greatly improved task completion rates. Specifically, in the simulation experiments of the 7-DOF manipulator, the training times of FK-DDPG, FK-TD3, and FK-SAC algorithms were reduced by 21%, 18%, and 68%, respectively, and the task completion rates were increased by 21%, 87%, and 64%, respectively. In the simulation experiments of the 4-DOF manipulator, these algorithms reduced training time by 20%, 40%, and 65%, respectively, and increased task completion rates by 6%, 48%, and 20%, respectively.

Furthermore, we conducted experimental validation on a real 4-DOF manipulator and performed multiple experimental comparisons of the aforementioned algorithms. The results show that the task completion rates of the FK-DDPG, FK-TD3, and FK-SAC algorithms were improved by 7%, 54%, and 22%, respectively, further demonstrating the effectiveness and stability of the FK-DRL algorithm. Simulations and experiments indicate that integrating the FK model into the DRL algorithm’s training framework can not only significantly improve training convergence speed and exploration efficiency but also enhance the adaptability and stability of the algorithm in intelligent control tasks. Additionally, the FK-DRL algorithm avoids the complex processes of dynamic modeling and inverse kinematics solving, exhibits strong scalability, and achieves excellent control performance across manipulators with different DOFs by merely adjusting the easily established FK model of the manipulator, laying a theoretical foundation for more efficient and intelligent control strategies for redundant DOF manipulators.

However, due to the limitations of experimental conditions, this study only conducted control tests on a 4-DOF manipulator in the real world. In the future, we plan to expand the scope of experiments to include a 7-DOF physical manipulator and other physical robotic systems with different DOFs and structures to further validate the performance of the FK-DRL algorithm in real-world scenarios. This will help in more comprehensively assessing the scalability and stability of the FK-DRL algorithm.

Author Contributions

Conceptualization, S.S.; methodology, S.S. and Y.C.; software, Y.C.; validation, Y.C., S.S. and K.N.; funding acquisition, C.L.; resource, S.S.; writing—original draft preparation, Y.C.; writing—review and editing, C.L. and S.S.; project administration, S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Research Projects on Basic Sciences (Natural Sciences) in Higher Education Institutions of Jiangsu Province of China, grant number 23KJA460005.

Data Availability Statement

The study data can be obtained by email request to the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tan, N.; Zhong, Z.; Yu, P.; Li, Z.; Ni, F. A Discrete Model-Free Scheme for Fault-Tolerant Tracking Control of Redundant Manipulators. IEEE Trans. Ind. Inform. 2022, 18, 8595–8606. [Google Scholar] [CrossRef]
Tong, Y.; Liu, J.; Liu, Y.; Yuan, Y. Analytical inverse kinematic computation for 7-DOF redundant sliding manipulators. Mech. Mach. Theory 2021, 155, 104006. [Google Scholar] [CrossRef]
Quan, Y.; Zhao, C.; Lv, C.; Wang, K.; Zhou, Y. The Dexterity Capability Map for a Seven-Degree-of-Freedom Manipulator. Machines 2022, 10, 1038–1059. [Google Scholar] [CrossRef]
Ning, Y.; Li, T.; Du, W.; Yao, C.; Zhang, Y.; Shao, J. Inverse kinematics and planning/control co-design method of redundant manipulator for precision operation: Design and experiments. Robot. Comput.-Integr. Manuf. 2023, 80, 102457. [Google Scholar] [CrossRef]
Sahbani, A.; El-Khoury, S.; Bidaud, P. An overview of 3D object grasp synthesis algorithms. Robot. Auton. Syst. 2012, 60, 326–336. [Google Scholar] [CrossRef]
Crane, C.; Duffy, J.; Carnahan, T. A kinematic analysis of the space station remote manipulator system. J. Robot. Syst. 1991, 8, 637–658. [Google Scholar] [CrossRef]
Xavier da Silva, S.; Schnitman, L.; Cesca, V. A Solution of the Inverse Kinematics Problem for a 7-Degrees-of-Freedom Serial Redundant Manipulator Using Grobner Bases Theory. Math. Probl. Eng. 2021, 2021, 6680687. [Google Scholar] [CrossRef]
Gong, M.; Li, X.; Zhang, L. Analytical Inverse Kinematics and Self-Motion Application for 7-DOF Redundant Manipulator. IEEE Access 2019, 7, 18662–18674. [Google Scholar] [CrossRef]
Marcos, M.; Machado, J.; Azevedo-Perdicoulis, T. Trajectory planning of redundant manipulators using genetic algorithms. Commun. Nonlinear Sci. Numer. Simul. 2009, 14, 2858–2869. [Google Scholar] [CrossRef]
Xie, Z.; Jin, L. Hybrid Control of Orientation and Position for Redundant Manipulators Using Neural Network. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 2737–2747. [Google Scholar] [CrossRef]
Yang, Q.; Jagannathan, S. Reinforcement Learning Controller Design for Affine Nonlinear Discrete-Time Systems using Online Approximators. IEEE Trans. Syst. Man Cybern. Part B Cybern. 2012, 42, 377–390. [Google Scholar] [CrossRef] [PubMed]
Perrusquia, A.; Yu, W.; Li, X. Multi-agent reinforcement learning for redundant robot control in task-space. Int. J. Mach. Learn. Cyber. 2021, 12, 231–241. [Google Scholar] [CrossRef]
Lee, C.; An, D. AI-Based Posture Control Algorithm for a 7-DOF Robot Manipulator. Machines 2022, 10, 651. [Google Scholar] [CrossRef]
Ramirez, J.; Yu, W. Reinforcement learning from expert demonstrations with application to redundant robot control. Eng. Appl. Artif. Intell. 2023, 119, 105753. [Google Scholar] [CrossRef]
Xu, W.; Sen, W.; Xing, L. Deep Reinforcement Learning: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 5064–5078. [Google Scholar]
Li, X.; Liu, H.; Dong, M. A General Framework of Motion Planning for Redundant Robot Manipulator Based on Deep Reinforcement Learning. IEEE Trans. Ind. Inform. 2022, 18, 5253–5263. [Google Scholar] [CrossRef]
Calderón-Cordova, C.; Sarango, R.; Castillo, D.; Lakshminarayanan, V. A Deep Reinforcement Learning Framework for Control of Robotic Manipulators in Simulated Environments. IEEE Access 2024, 12, 103133–103161. [Google Scholar] [CrossRef]
Zheng, L.; Wang, Y.; Yang, R.; Wu, S.; Guo, R.; Dong, E. An Efficiently Convergent Deep Reinforcement Learning-Based Trajectory Planning Method for Manipulators in Dynamic Environments. J. Intell. Robot. Syst. 2023, 107, 50. [Google Scholar] [CrossRef]
Feng, Z.; Hou, Q.; Zheng, Y.; Ren, W.; Ge, J.Y.; Li, T.; Cheng, C.; Lu, W.; Cao, S.; Zhang, J.; et al. Method of artificial intelligence algorithm to improve the automation level of Rietveld refinement. Comput. Mater. Sci. 2019, 156, 310–314. [Google Scholar] [CrossRef]
Cammarata, A.; Maddio, P.D.; Sinatra, R.; Belfiore, N.P. Direct Kinetostatic Analysis of a Gripper with Curved Flexures. Micromachines 2022, 13, 2172. [Google Scholar] [CrossRef]
Corke, P. A simple and systematic approach to assigning Denavit-Hartenberg parameters. IEEE Trans. Robot. 2007, 23, 590–594. [Google Scholar] [CrossRef]
Chen, P.; Lu, W. Deep reinforcement learning based moving object grasping. Inf. Sci. 2021, 565, 62–76. [Google Scholar] [CrossRef]
Sadeghzadeh, M.; Calvert, D.; Abdullah, H. Autonomous visual servoing of a robot manipulator using reinforcement learning. Int. J. Robot. Autom. 2016, 31, 26–38. [Google Scholar] [CrossRef]
Liu, Y.; Huang, C. DDPG-Based Adaptive Robust Tracking Control for Aerial Manipulators with Decoupling Approach. IEEE Trans. Cybern. 2022, 52, 8258–8271. [Google Scholar] [CrossRef] [PubMed]
Kim, M.; Han, D.; Park, J.; Kim, J. Motion Planning of Robot Manipulators for a Smoother Path Using a Twin Delayed Deep Deterministic Policy Gradient with Hindsight Experience Replay. Appl. Sci. 2020, 10, 575–589. [Google Scholar] [CrossRef]
Chen, P.; Pei, J.; Lu, W.; Li, M. A deep reinforcement learning based method for real-time path planning and dynamic obstacle avoidance. Neurocomputing 2022, 497, 64–75. [Google Scholar] [CrossRef]
Hassanpour, H.; Wang, X. A practically implementable reinforcement learning-based process controller design. Comput. Chem. Eng. 2024, 70, 108511. [Google Scholar] [CrossRef]
Wang, C.; Wang, Y.; Shi, P.; Wang, F. Scalable-MADDPG-Based Cooperative Target Invasion for a Multi-USV System. IEEE Trans. Neural Netw. Learn. Syst. 2023, 2023, 3309689. [Google Scholar] [CrossRef]
Bogaerts, B.; Sels, S.; Vanlanduit, S.; Penne, R. Connecting the CoppeliaSim robotics simulator to virtual reality. SoftwareX 2020, 11, 100426. [Google Scholar] [CrossRef]
Su, S.; Chen, Y.; Li, C.; Ni, K.; Zhang, J. Intelligent Control Strategy for Robotic Manta Via CPG and Deep Reinforcement Learning. Drones 2024, 8, 323. [Google Scholar] [CrossRef]
Rohan, A. Enhanced Camera Calibration for Machine Vision using OpenCV. IAES Int. J. Artif. Intell. (IJ-AI) 2014, 3, 136. [Google Scholar]
Huang, B.; Zou, S. A New Camera Calibration Technique for Serious Distortion. Processes 2022, 10, 488. [Google Scholar] [CrossRef]
Ju, H.; Juan, R.; Gomez, R.; Nakamura, K.; Li, G. Transferring policy of deep reinforcement learning from simulation to reality for robotics. Nat. Mach. Intell. 2022, 4, 1077–1087. [Google Scholar] [CrossRef]

Figure 1. The relative position between joint

{i - 1}

and joint

\{i\}

in the joint coordinate system.

Figure 1. The relative position between joint

{i - 1}

and joint

\{i\}

in the joint coordinate system.

Figure 2. The main structure of the FK-DDPG algorithm.

Figure 3. 7-DOF manipulator simulation environment.

Figure 4. 4-DOF manipulator simulation environment.

Figure 5. Network model of the DDPG, FK-DDPG, TD3, and FK-TD3 algorithms. (a) Network model of Actor; (b) network model of Critic.

Figure 6. Training process of the FK-DDPG and DDPG algorithms for the 7-DOF manipulator.

Figure 7. Training process of the FK-TD3 and TD3 algorithms for the 7-DOF manipulator.

Figure 8. Training process of the FK-SAC and SAC algorithms for the 7-DOF manipulator.

Figure 9. Training time of various algorithms in the simulation environment for the 7-DOF manipulator.

Figure 10. Average rewards of various algorithms in the simulation environment for the 7-DOF manipulator.

Figure 11. Success rates of various algorithms in the simulation environment for the 7-DOF manipulator.

Figure 12. Training process of the FK-DDPG and DDPG algorithms for the 4-DOF manipulator.

Figure 13. Training process of the FK-TD3 and TD3 algorithms for the 4-DOF manipulator.

Figure 14. Training process of the FK-SAC and SAC algorithms for the 4-DOF manipulator.

Figure 15. Training time of various algorithms in the simulation environment for the 4-DOF manipulator.

Figure 16. Average rewards of various algorithms in the simulation environment for the 4-DOF manipulator.

Figure 17. Success rates of various algorithms in the simulation environment for the 4-DOF manipulator.

Figure 18. The 4-DOF manipulator in the real world.

Figure 19. Sequence of reaching the target pose of the 4-DOF manipulator.

Figure 20. Average rewards of various algorithms in the real-world environment for the 4-DOF manipulator.

Figure 21. Success rates of various algorithms in the real-world environment for the 4-DOF manipulator.

Table 1. Parameters of 7-DOF manipulator D-H.

$i$	$α_{i - 1} (°)$	$d_{i} (m)$	$θ_{i} (°)$
1	0	0.2039	$θ_{1}$
2	90	0	$θ_{2}$
3	−90	0.2912	$θ_{3}$
4	90	0	$θ_{4}$
5	−90	0.3236	$θ_{5}$
6	90	0	$θ_{6}$
7	−90	0.0606	$θ_{7}$
8	0	0.1006	$0$

Table 2. Parameters of 4-DOF manipulator D-H.

$i$	$α_{i - 1} (°)$	$a_{i - 1} (m)$	$d_{i} (m)$	$θ_{i} (°)$
1	0	0	0.0445	$θ_{1}$
2	90	0.0025	0	$θ_{2}$
3	0	0.081	0	$θ_{3}$
4	0	0.0775	0	$θ_{4}$
5	0	0.126	0	$0$

Table 3. Training hyperparameter setting.

Parameter	Value
Learning rate of Actor network $α_{A}$	1 × 10⁻⁴
Learning rate of Critic network $α_{C}$	5 × 10⁻⁴
Discount rate $γ$	0.9
Size of the replay buffer capacity	10,000
Target network soft update factor $τ$	0.005
Batch size	32
Episode	120,000
Step of interactions with the environment per episode $T_{1}$	16
Step of interactions with M in dynamic programming $T_{2}$	16
Step of dynamic programming $P$	5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Su, S.; Ni, K.; Li, C. Integrated Intelligent Control of Redundant Degrees-of-Freedom Manipulators via the Fusion of Deep Reinforcement Learning and Forward Kinematics Models. Machines 2024, 12, 667. https://doi.org/10.3390/machines12100667

AMA Style

Chen Y, Su S, Ni K, Li C. Integrated Intelligent Control of Redundant Degrees-of-Freedom Manipulators via the Fusion of Deep Reinforcement Learning and Forward Kinematics Models. Machines. 2024; 12(10):667. https://doi.org/10.3390/machines12100667

Chicago/Turabian Style

Chen, Yushuo, Shijie Su, Kai Ni, and Cunjun Li. 2024. "Integrated Intelligent Control of Redundant Degrees-of-Freedom Manipulators via the Fusion of Deep Reinforcement Learning and Forward Kinematics Models" Machines 12, no. 10: 667. https://doi.org/10.3390/machines12100667

APA Style

Chen, Y., Su, S., Ni, K., & Li, C. (2024). Integrated Intelligent Control of Redundant Degrees-of-Freedom Manipulators via the Fusion of Deep Reinforcement Learning and Forward Kinematics Models. Machines, 12(10), 667. https://doi.org/10.3390/machines12100667

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrated Intelligent Control of Redundant Degrees-of-Freedom Manipulators via the Fusion of Deep Reinforcement Learning and Forward Kinematics Models

Abstract

1. Introduction

2. Design of the FK-DRL Control Algorithm

2.1. Control Problem and MDP Modelling

2.2. Modelling of the FK of the Manipulator

2.3. FK-DRL Algorithm

3. Simulation

3.1. Simulation Environment

3.2. Analysis of Simulation Results of 7-DOF Manipulator

3.3. Analysis of Simulation Results of 4-DOF Manipulator

4. Experiments and Analysis

4.1. Experimental Platform

4.2. Analysis of Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI