DDPG-Based Adaptive Sliding Mode Control with Extended State Observer for Multibody Robot Systems

Khan, Hamza; Khan, Sheraz Ali; Lee, Min Cheol; Ghafoor, Usman; Gillani, Fouzia; Shah, Umer Hameed

doi:10.3390/robotics12060161

Open AccessArticle

DDPG-Based Adaptive Sliding Mode Control with Extended State Observer for Multibody Robot Systems

¹

School of Mechanical Engineering, Pusan National University, Busan 46241, Republic of Korea

²

Department of Mechatronics Engineering, University of Engineering and Technology, Peshawar 25000, Pakistan

³

Department of Mechanical Engineering, Institute of Space Technology, Islamabad 44000, Pakistan

⁴

Department of Mechanical Engineering & Technology, Government College University, Faisalabad 37000, Pakistan

⁵

Department of Mechanical Engineering and Artificial Intelligence Research Center, College of Engineering and Information Technology, Ajman University, Ajman P.O. Box 346, United Arab Emirates

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Robotics 2023, 12(6), 161; https://doi.org/10.3390/robotics12060161

Submission received: 23 October 2023 / Revised: 21 November 2023 / Accepted: 22 November 2023 / Published: 26 November 2023

(This article belongs to the Special Issue Kinematics and Robot Design VI, KaRD2023)

Download

Browse Figures

Versions Notes

Abstract

:

This research introduces a robust control design for multibody robot systems, incorporating sliding mode control (SMC) for robustness against uncertainties and disturbances. SMC achieves this through directing system states toward a predefined sliding surface for finite-time stability. However, the challenge arises in selecting controller parameters, specifically the switching gain, as it depends on the upper bounds of perturbations, including nonlinearities, uncertainties, and disturbances, impacting the system. Consequently, gain selection becomes challenging when system dynamics are unknown. To address this issue, an extended state observer (ESO) is integrated with SMC, resulting in SMCESO, which treats system dynamics and disturbances as perturbations and estimates them to compensate for their effects on the system response, ensuring robust performance. To further enhance system performance, deep deterministic policy gradient (DDPG) is employed to fine-tune SMCESO, utilizing both actual and estimated states as input states for the DDPG agent and reward selection. This training process enhances both tracking and estimation performance. Furthermore, the proposed method is compared with the optimal-PID, SMC, and

H_{\infty}

in the presence of external disturbances and parameter variation. MATLAB/Simulink simulations confirm that overall, the SMCESO provides robust performance, especially with parameter variations, where other controllers struggle to converge the tracking error to zero.

Keywords:

multibody dynamics; sliding mode control; extended state observer; DDPG

1. Introduction

The expanding capabilities of multibody robot systems in autonomous operation and their versatility in performing a wide range of tasks have gathered significant attention from both researchers and industries, emphasizing the persistent need for precision and reliability in their operations. As a result, multibody robot systems require robust control algorithms. However, controlling multibody robot dynamics can be a challenging task, especially when the robot dynamics are unknown. In this effort, different robust control algorithms have been proposed in which sliding mode control (SMC) has been of great interest due to outstanding robustness against parametric uncertainties and external disturbances [1,2]. Subsequent developments resulted in different types of SMC, including integral SMC (ISMC) [3], super twisting SMC (STSMC) [4], terminal SMC (TSMC) [5], SMC with a nonlinear disturbance observer known as sliding perturbation observer (SMCSPO) [6], and SMC with extended state observer (SMCESO) [7].

This research is conducted for the robust control of multibody industrial robot systems with aims of enhancing the trajectory tracking results. Therefore, we consider the nonlinear control SMC with ESO (SMCESO) for the robot control. ESO considers the system dynamics and external disturbances as perturbations to the system. Therefore, with ESO, the system is only affected by the perturbation estimation error because of the compensation provided by the ESO. Another advantage of the ESO is that it requires no system dynamics information and only uses partial state feedback (position) for estimating the states and the perturbation. Subsequently, the robustness of SMCESO now depends on the quality of estimation of the ESO, which is dependent on the selection of control parameters. However, tuning the parameters manually becomes a challenging task. Therefore, optimal parameter selection can be achieved through adapting the parameters for different sliding conditions.

Various methods for adaptive SMC have been explored, including model-free adaptation, intelligent adaptation, and observer-based adaptation. A. J. Humaidi et al. introduced particle swarm optimization-based adaptive STSMC [8]. The adaptation is carried out based on the Lyapunov theory to guarantee global stability. Y. Wang and H. Wang introduced model-free adaptive SMC, initially estimating unknown dynamics through the time delay estimation method [9,10]. Nevertheless, this approach exhibited undesirable chattering in the control input during experiments, which is deemed unacceptable in the present research. On the other hand, R-D. Xi et al. presented adaptive SMC with a disturbance observer for robust robot manipulator control [11]. Observer-based adaptive SMC stands out for its ability to ensure robustness through minimizing the impact of lumped disturbances, a feature similarly emphasized by C. Jing et al. [12]. Conclusively, this study states that implementing a disturbance observer can lead to finite-time stability and specific tracking performance quality. Furthermore, H. Zao et al. introduced fuzzy SMC for robot manipulator trajectory tracking [13]. H. Khan et al. proposed extremum seeking (ES)-based adaptive SMCSPO for industrial robots [14]. A unique cost function is used which consists of estimation error, and error dynamics to guarantee accurate states and perturbation estimation. H. Razmi et al. proposed neural network-based adaptive SMC [15], and Z. Chen et al. presented radial basis function neural network (NN)-based adaptive SMC [16], both demonstrating commendable performance. However, it is worth noting that the systems under consideration in these studies were relatively smaller than the industrial robot in our current research. Furthermore, a model-free reinforcement learning algorithm known as deep deterministic policy gradient (DDPG) has been observed to provide optimal SMC parameters, enhancing performance through learning and adapting to different sliding patterns [17,18,19].

Considering the diverse literature, initially, the model-free extremum seeking algorithm was a consideration. However, in the current study, the need to tune multiple (four different) parameters simultaneously led to the exploration of learning-based algorithms such as NN and DDPG for adapting controller parameters. Notably, NN is well suited for simpler systems, while DDPG is preferred for complex, high-dimensional systems with unknown dynamics. DDPG is a model-free, online, and off-policy reinforcement learning algorithm. It employs an actor–critic architecture, where the actor focuses on learning the optimal policy, while the critic approximates the Q-function [20]. The Q-function is responsible for estimating the expected cumulative long-term rewards for state–action pairs. The critic achieves this by minimizing the temporal difference error, which represents the disparity between the predicted Q-value and the actual Q-value derived from environmental feedback. This process equips the critic to evaluate action quality in various states, guiding the actor in selecting actions that maximize expected rewards. Ensuring the convergence of the temporal difference error is a pivotal aspect of effective DDPG agent training.

The primary contribution of this study is the optimal tuning of SMCESO using the DDPG algorithm for a heavy-duty industrial robot manipulator with six degrees of freedom (DOF). Robust performance can be achieved through minimizing estimation errors, ensuring accurate perturbation estimation and compensation. To accomplish this, the DDPG input states incorporate tracking error, estimation error, current joint angle, and estimated joint angle. A reward has been designed, integrating an overall error tolerance of

0.01 r a d

for both tracking and estimation errors, yielding positive rewards if error is below the threshold. Conversely, if errors exceed this threshold, negative rewards are assigned. Through this approach, the DDPG agent learns a control pattern based on actual and estimated results, ultimately achieving optimal estimation and robust control performance. The proposed algorithm was implemented and compared with optimal proportional—integral–derivative (PID) control and SMC, and

H_{\infty}

control in an extensive MATLAB/Simulink simulation environment. The results demonstrated that SMCESO outperforms all the three controllers, particularly in the presence of variable system parameters, as it effectively reduces the effect of the actual perturbations on system performance.

The reminder manuscript Is organized as follows: Section 2 describes the general multibody dynamics and formulates the SMC. Section 3 presents the ESO and the DDPG algorithm. Section 4 then presents the simulation environment and the results of the proposed algorithm, whereas Section 5 provides the conclusions.

2. Preliminaries

2.1. Multibody Dynamics Description

Consider the second-order multibody dynamics [14] as follows:

{\ddot{x}}_{j} = f_{j} (x) + ∆ f_{j} (x) + \sum_{i = 1}^{n} [{(b}_{j i} (x) + ∆ b_{j i} (x)) u_{i}] + d_{j} (t) j = 1, \dots, n

(1)

where

x ≜ {[x_{1} \dots x_{n}]}^{T}

are the state vectors representing the position, and

f_{j} (x)

and

∆ f_{j} (x)

are the linear dynamic and dynamics uncertainties, respectively. Similarly, the control gain matrix and their uncertainties are represented by

b_{j i} (x)

and

{∆ b}_{j i} (x)

, respectively.

u_{i}

and

d_{j}

are the control input and external disturbance, respectively. Combining the system nonlinearities, dynamics uncertainties, and disturbances as the perturbation (

ψ

) can be written as

ψ_{j} (x, t) = Δ f_{j} (x) + \sum_{i = 1}^{n} [Δ b_{j i} (x) u_{i}] + d_{j} (t)

(2)

whereas it is assumed that the perturbation is bounded by an unknown continuous function, i.e.,

|ψ_{j} (x, t)| \leq Γ > 0

, and, in addition, that it is smooth with the bounded derivative

|{\dot{ψ}}_{j} (x, t)| \leq \bar{Γ}

.

2.2. Sliding Mode Control

The main concept of SMC is to design a sliding surface

σ

in the state space (position

x_{1}

, and velocity

x_{2}

) [21], which is given as

σ = \dot{e} + c e

(3)

where

e = x_{d} - x

is the tracking error, and

c > 0

is a positive constant. Now, in order to drive the system dynamics, the state variable should converge to zero: i.e.,

\lim_{t \to \infty} \dot{e}, e = 0

asymptotically in the presence of perturbation. Therefore, SMC tends to bring system states on the sliding surface by means of control force

u

. Subsequently, SMC has two phases: The first is the reaching phase, during which the system states are not on the sliding surface and require a switching control

u_{s w}

to reach the sliding surface. The second phase is the sliding phase, in which the system states have reached the sliding surface and now require continuous control, generally known as equivalent control

u_{e q}

, to remain on the sliding surface, where the overall control input becomes

u = u_{e q} + u_{s w}

. To compute the control input, the derivative of the sliding surface is defined as follows:

\dot{σ} = \ddot{e} + c \dot{e} = - K_{s m c} \cdot s a t (σ)

(4)

where

K_{s m c}

represents the switching control gain, and ‘

s a t

’ is the saturation function with a boundary layer thickness

ε_{c}

, given as

s a t (σ) = \{\begin{matrix} \frac{σ}{|σ|} i f |σ| > ε_{c} \\ \frac{σ}{ε_{c}} i f |σ| \leq ε_{c} \end{matrix}

(5)

Assuming unknown system dynamics,

\ddot{x} = u

is presumed. Substituting this condition with the dynamics error

\ddot{e} = {\ddot{x}}_{d} - \ddot{x}

in (4) results in the following control input.

u = - K_{s m c} \cdot s a t (σ) + {\ddot{x}}_{d} - c \dot{e}

(6)

Here,

K_{s m c} \cdot s a t (σ)

is denotes the switching control (

u_{s w}

), and the negative sign embodies the error convention. The remaining terms are considered equivalent control (

u_{e q}

). Subsequently, taking the derivative of the sliding surface, with the system disturbed by perturbation (such as (10) in the subsequent section) yields (7):

\dot{σ} = \ddot{x} - {\ddot{x}}_{d} + c \dot{e} = u + ψ (x, t) - {\ddot{x}}_{d} + c \dot{e}

(7)

Substituting the control law from (6) and solving results in (8):

\dot{σ} = - K_{s m c} \cdot s a t (σ) + ψ (x, t)

(8)

Equation (8) shows that in SMC, the sliding surface is affected by the perturbation. Once the system states have reached the sliding phase, the relationship between the sliding surface and the perturbation is given as the following transfer function [6].

\frac{σ}{ψ (x, t)} = \frac{1}{p + \frac{K_{s m c}}{ε_{c}}}

(9)

where

p

is the s-domain variable. Increasing the boundary layer will decrease the breaking frequency, making the system less sensitive to the higher frequency perturbations. However, at

σ \approx 0,

increasing the boundary layer thickness reduces controller performance, leading to higher tracking error. If the sliding surface is tightly bounded, with a very small boundary layer, chattering occurs.

2.3. Problem Forumlation

Calculating the dynamics of a multibody robot system is a challenging task, further compounded by the presence of inaccurate dynamics, which introduce uncertainties. Therefore, for the later study, considering the complete dynamic model as perturbation and

b = 1,

the resulting dynamics is as follows.

\ddot{x} = ψ (x, t) + u

(10)

Subsequently, to ensure the sliding condition outside the boundary later, the sliding dynamics can be written as

\dot{σ} = c \dot{e} + ψ (x, t) + u, σ (0) = σ_{o}

(11)

For the asymptotic stability of (11) about the equilibrium point,

\dot{V} < 0 f o r σ \neq 0

must be satisfied [22]. The derivative of

V

is computed as

\dot{V} \leq σ \dot{σ} = σ [c \dot{e} + ψ (x, t) + u]

(12)

Taking

η = c \dot{e} + u

in (12) will result in

\dot{V} \leq σ \dot{σ} = σ [ψ (x, t) + η] = σ \cdot ψ (x, t) + σ \cdot η

(13)

\dot{V} \leq |σ| Γ + σ \cdot η

(14)

Selecting

η = {- K}_{s m c} \cdot s a t (σ)

, and with

K_{s m c} > 0

, (14) becomes

\dot{V} \leq |σ| Γ - |σ| \cdot K_{s m c} = - |σ| (K_{s m c} - Γ)

(15)

Consequently, the overall control input becomes

u = K_{s m c} \cdot s a t (σ) + c \dot{e}

(16)

Equation (15) further emphasizes that for stability,

K_{s m c} > Γ

to satisfy the Lyapunov condition. However, obtaining information about

Γ

can be a complex and tedious task.

3. Proposed Algorithm

There are two concerns: First, based on (9), as the perturbation affects the sliding dynamics, the correct dynamics are unknown. Therefore, a perturbation observer has been used to estimate and compensate the actual perturbation effects. For this purpose, an extended state observer (ESO) has been implemented, which offers the advantage of not requiring the system dynamics information. Secondly, we optimally tune the control gain for SMC and ESO to stabilize the system in finite time, ensuring that both tracking and estimation error converge to zero. Subsequently, deep deterministic policy gradient (DDPG) has been employed for control gain tuning.

3.1. Extended State Observer

ESO provides real-time estimations of unmeasured system states and perturbations, which is the combination of modelled and unmodelled dynamics and external disturbances, enhancing control system performance and robustness. This means ESO considers the system’s linear and nonlinear dynamics as the perturbation and estimates them [23]. Consequently, only the control input

u

in (1) is known. Furthermore, ESO does not require system dynamics information and uses only partial state feedback (position) for estimation. In addition to the system states (position

x_{1}

and velocity

x_{2}

), an extended state

x_{3}

is introduced, such as

x_{3} = ψ (x, t) = f (x) + Δ f (x) + \sum_{i = 1}^{n} [Δ b_{i} (x) u] + d (t) \leq Γ

(17)

Subsequently, the system dynamics in (1) can be simplified as

\begin{matrix} {\dot{x}}_{1} = x_{2} \\ {\dot{x}}_{2} = x_{3} + u \\ x_{3} = Γ \end{matrix}

(18)

With the new system information, the mathematical model of nonlinear ESO [7] is then written as

\begin{matrix} {\dot{\hat{x}}}_{1} = {\hat{x}}_{2} + l_{1} \cdot ρ ({\tilde{x}}_{1}) \\ {\dot{\hat{x}}}_{2} = {\hat{x}}_{3} + u + l_{2} \cdot ρ ({\tilde{x}}_{1}) \\ {\dot{\hat{x}}}_{3} = l_{3} \cdot ρ ({\tilde{x}}_{1}) \end{matrix}

(19)

where the components with

“ \land ”

, and

“ ~ ”

represent the estimated states and the error between the actual and the estimated value, e.g.,

{\tilde{x}}_{1} = x_{1} - {\hat{x}}_{1}

.

ρ

is the saturation function, which is selected as

ρ ({\tilde{x}}_{1}) = \{\begin{matrix} \frac{{\tilde{x}}_{1}}{|{\tilde{x}}_{1}|} & i f & |{\tilde{x}}_{1}| > ε_{o} \\ \frac{{\tilde{x}}_{1}}{ε_{o}} & i f & |{\tilde{x}}_{1}| \leq ε_{o} \end{matrix}

(20)

ε_{o}

is the boundary layer of the ESO such that the estimation error should be

|{\tilde{x}}_{1}| \leq ε_{o}

. The estimation errors are calculated as

\begin{matrix} {\dot{\tilde{x}}}_{1} = {\tilde{x}}_{2} - l_{1} \cdot ρ ({\tilde{x}}_{1}) \\ {\dot{\tilde{x}}}_{2} = {\tilde{x}}_{3} - l_{2} \cdot ρ ({\tilde{x}}_{1}) \\ {\dot{\tilde{x}}}_{3} = Γ - l_{3} \cdot ρ ({\tilde{x}}_{1}) \end{matrix}

(21)

As the estimated error should be bounded by a boundary later, therefore, (21) can be rewritten as follows.

\begin{matrix} {\dot{\tilde{x}}}_{1} = {\tilde{x}}_{2} - l_{1} \cdot \frac{{\tilde{x}}_{1}}{ε_{o}} \\ {\dot{\tilde{x}}}_{2} = {\tilde{x}}_{3} - l_{2} \cdot \frac{{\tilde{x}}_{1}}{ε_{o}} \\ {\dot{\tilde{x}}}_{3} = Γ - l_{3} \cdot \frac{{\tilde{x}}_{1}}{ε_{o}} \end{matrix}

(22)

Subsequently, the state space of the error dynamics can be written as

\dot{\tilde{x}} = A \tilde{x} + B Γ

(23)

where

A = [\begin{matrix} - l_{1} / ε_{o} & 1 & 0 \\ - l_{2} / ε_{o} & 0 & 1 \\ - l_{3} / ε_{o} & 0 & 0 \end{matrix}], a n d E = [\begin{matrix} 0 \\ 0 \\ 1 \end{matrix}]

(24)

The characteristic equation of

A

can be calculated as follows:

|λ I - A| = |\begin{matrix} λ + l_{1} / ε_{o} & - 1 & 0 \\ l_{2} / ε_{o} & λ & - 1 \\ l_{3} / ε_{o} & 0 & λ \end{matrix}| = λ^{3} + (l_{1} / ε_{o}) λ^{2} + (l_{2} / ε_{o}) λ + l_{3} / ε_{o}

(25)

The error dynamics are stable if the gains

l_{1}, l_{2}, a n d l_{3}

are positive. Therefore, these gains are selected using the pole placement method as follows:

{(s + λ)}^{3} = s^{3} + 3 \cdot s \cdot λ^{2} + 3 \cdot λ \cdot s^{2} + λ^{3}

(26)

Comparing the coefficients of (25) and (26) results in the following selection of gains:

l_{1} = 3 \cdot λ \cdot ε_{o}, l_{2} = 3 \cdot λ^{2} \cdot ε_{o}, and l_{3} = λ^{3} \cdot ε_{o}

(27)

3.2. Extended State Observer-Based Sliding Mode Control (SMCESO)

For enhanced system performance, the final control input

u_{o}

for the system with estimated perturbation

\hat{ψ} (x, t)

from ESO and switching control from SMC can be written as

u_{o} = u - {\hat{x}}_{3} = u - \hat{ψ} (x, t)

(28)

where

u

is from (16). Consequently, the system dynamics from (10) can be rewritten as follows.

\ddot{x} = u_{o} - \hat{ψ} (x, t) + ψ (x, t) = u_{o} + \tilde{ψ} (x, t)

(29)

where

\tilde{ψ} (x, t) = ψ (x, t) - \hat{ψ} (x, t)

is the perturbation estimation error. Now, it is evident that with ESO, the system is only affected by the perturbation estimation error as compared to the actual perturbation. This follows

|\tilde{ψ} (x, t)| ≪ |ψ (x, t)|

, ensuring that ESO-based SMC is more stable than the individual SMC. Subsequently, the Lyapunov function in (15) will become

\dot{V} \leq - |σ| (K_{s m c}^{'} - \tilde{ψ} (x, t))

(30)

The stability of SMCESO with the Lyapunov function

σ \dot{σ} \leq 0

can be calculated as

σ \dot{σ} \leq |σ| (\ddot{e} + c \dot{e}) \leq |σ| (\ddot{x} - {\ddot{x}}_{d} + c \dot{e}) \leq 0

(31)

With the system dynamics and combined control input from (6) and (28), according to (7), this will result in the following condition:

σ \dot{σ} \leq |σ| (- K_{s m c}^{'} \cdot s a t (σ) + {\ddot{x}}_{d} - c \dot{e} - \hat{ψ} (x, t) + ψ (x, t) - {\ddot{x}}_{d} + c \dot{e}) \leq 0

(32)

Simplifying (32) yields (33):

σ \dot{σ} \leq |σ| (- K_{s m c}^{'} \cdot s a t (σ) + \tilde{ψ} (x, t)) \leq 0

(33)

Subsequently, to keep the system stable, the control gain should follow the following condition.

K_{s m c}^{'} > |\tilde{ψ} (x, t)|

(34)

Now, the new control gain

K_{s m c}^{'}

is small in comparison to conventional gain

K_{s m c}

, with

K_{s m c}^{'} < K_{s m c}

. The reduced gain will result in smoother switching control, eliminating any chattering for improved performance. Furthermore, the control parameters

K_{s m c}^{'}

,

c, ε_{c}

, and

λ

are then optimally tuned using DDPG to reduce manual tuning efforts.

3.3. DDPG-Based SMCESO

Deep deterministic policy gradient (DDPG) is a reinforcement learning algorithm designed for solving continuous action space problems. It combines elements of deep neural networks and the deterministic policy gradient theorem to achieve remarkable performance in control tasks. DDPG employs an actor–critic architecture, with the actor network modeling the policy and the critic network estimating the state–action value function. A key innovation in DDPG is the use of target networks to stabilize training, with periodic updates to slowly track the learned networks. This approach, coupled with experience replay, enables stable and efficient learning, making DDPG a prominent choice for complex, high-dimensional control problems.

Similar to other reinforcement learning algorithms, the DDPG algorithm operates within the framework of a Markov decision process (MDP) [24], denoted by (

S, A, P, R

), where

S

and

A

represent the environment’s state space and the agent’s action space, respectively.

P

signifies the probability of state transitions. During agent training, the reward function

R

serves as the training target. In core, while training the agent, the system’s state

s ϵ S

is observed, and the associated reward

r ϵ R

is acquired. Subsequently, the optimal policy

π^{a} (a | s)

is determined through maximizing the state–action value function.

Q (s, a) = E [R_{c} | S_{t} = s, A_{t} = a]

(35)

where

R_{c}

represents the cumulative reward, and

R_{c} = \sum_{k = 0}^{\infty} γ^{k} r_{k + 1}

with

0 \leq γ \leq 1

is the discount factor that reflects the importance of the reward value at future moments. To enhance the controller performance, the DDPG has to study the regulation strategy

μ

(actor network) and calculate the probability of each action. Consequently, the controller parameters are updated in real time to maximize the total reward [25,26].

\{\begin{matrix} \max_{μ} [\sum_{k = 0}^{\infty} γ^{k} r_{k} (x_{1} (k), {\hat{x}}_{1} (k))] \\ s t : θ (k) = μ (K_{s m c}^{'}, c, ε_{c}, λ) \\ θ_{m i n} \leq θ (k) \leq θ_{m a x} \end{matrix}

(36)

θ (k)

is the set of action parameters, with minimum limit

θ_{m i n}

and maximum limit

θ_{m a x}

. The structure of DDPG is presented in Figure 1. The selection of a suitable state space is crucial for ensuring the convergence of reinforcement learning. In the context of the present challenges, the chosen state space should inherently pertain to the robot’s position and its estimated dynamics. As a result, for the sake of computational efficiency and enhanced learning, the state space is straightforwardly defined as

S = [x (k), \hat{x} (k)]

, and the state vector is defined as

s_{k} = [x_{1}, {\hat{x}}_{1}, e, {\tilde{x}}_{1}]

.

The actor–critic value network for the robot system is established, which is a double-layer structure including the target network and main network. The replay buffer stores data in the form of

[s_{k}, a_{k}, r_{k}, s_{k + 1}]

, which is used for network training. Both the main networks and target networks share the same structure but differ in their parameters. The actor network is denoted by

a_{k} = μ (s_{k} | θ^{μ})

, with

θ^{μ}

as the network parameter. The critic network is denoted as

Q (s_{k}, a_{k} | θ^{Q})

, with the network parameter as

θ^{Q}

. When training, small batches of sample information

[s_{i}, a_{i}, r_{i}, s_{i + 1}]

are randomly selected from the replay buffer for learning. In brief, the training process involves the four networks to ensure that actions generated by the actor network can be used as input for the critic network to maximize the state–action value function in (35). The training process is provided in Algorithm 1.

Algorithm 1: Training DDPG Agent

Initialize the networks

μ (s_{k} | θ^{μ}),

and

Q (s_{k}, a_{k} | θ^{Q})

randomly.
Initialize the target network

μ^{'} (s_{k} | θ^{μ^{'}})

, and

Q^{'} (s_{k}, a_{k} | θ^{Q^{'}})

with weights.
Initialize the replay buffer.
While

e p \leq {e p}_{m a x}

Randomly initialize the process

N

for action exploration.
Receive the states

s_{k}

while

k < k_{m a x}

a_{k} = μ (s_{k} | θ^{μ}) + N

.
Execute the environment to update the reward

r_{k}

, and s_{k + 1}

.
Store

[s_{k}, a_{k}, r_{k}, s_{k + 1}]

in replay buffer R.
Sample a random minibatch of

m

transitions

transitions [s_{i}, a_{i}, r_{i}, s_{i + 1}]

from R.
Set target

y_{i} = r_{i} + γ \cdot Q^{'} (s_{i + 1}, μ^{'} (s_{i + 1} | θ^{μ^{'}}) | θ^{μ^{'}}) .

Update the critic by minimizing the loss function

J = \frac{1}{m} \sum {(y_{i} - Q (s_{i}, a_{i} | θ^{Q}))}^{2} .

Update the actor using the sampled policy gradient

\nabla_{θ^{μ}} J \approx \frac{1}{m} \sum \nabla_{a} Q (s, a | θ^{Q}) |_{s = s_{i}, a = μ_{s}} \nabla_{θ^{μ}} μ (s | θ^{μ})

.
Update the target network with soft update

θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}}, θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}}

.
If

i s d o n e = = 1

Reset.
End if
end while k
if

r_{a v e r a g e} \geq r_{s t o p p i n g}

Stop training.
End
end

The DDPG-based SMCESO block diagram is presented in Figure 2. For robust performance, the tracking error should be eliminated. Subsequently, the estimation should be accurate, i.e.,

\tilde{x} \to 0

. Consequently, the true perturbation will be estimated and well compensated. Therefore, the reward function for the current study is designed as follows.

\begin{matrix} r = R_{1} + R_{2} + R_{3} \\ R_{1} = \{\begin{matrix} - 1 & e \geq e_{t o l} \\ 5 & e < e_{t o l} \end{matrix} \\ \begin{matrix} R_{2} = \{\begin{matrix} - 1 & {\tilde{x}}_{1} \geq {\tilde{x}}_{1, t o l} \\ 5 & {\tilde{x}}_{1} < {\tilde{x}}_{1, t o l} \end{matrix} \\ R_{3} = \{\begin{matrix} - 100 & x_{1} \geq x_{1, s t o p} \\ 0 & x_{1} < x_{1, s t o p} \end{matrix} \end{matrix} \end{matrix}

(37)

where

e_{t o l}

is the error tolerance for accepting good performance of tracking control. Similarly,

R_{2}

is for the good performance of ESO with estimation error tolerance as

{\tilde{x}}_{1, t o l}

.

R_{3}

is for the stopping condition (

i s d o n e

in Algorithm 1), meaning the robot is not stable exceeding the movement limits

x_{1, s t o p}

.

4. Simulations and Discussion

This section provides details about the simulation system and the environment. It also includes the presentation of results and their subsequent discussion.

4.1. System and Environment Descrption

For the DDPG implementation, a simulation environment is created in MATLAB/Simulink, featuring an object pick-and-place task using the Simscape Multibody model of the KUKA KR 16 S industrial robot arm, as presented in Figure 2. The KR 16 is a six-degrees-of-freedom (DOF) high-speed, heavy-duty industrial robot arm with a substantial payload capacity. Demonstrating robust performance with such robot will validate the efficiency of the proposed method. Consequently, the robot must exhibit robust performance and a minimal tracking error in the presence of nonlinear dynamics. The sampling time for the DDPG algorithm is set to 0.5 s, while the control algorithm operates with a sampling time of 5 ms. The computations are carried out on a computer equipped with an Intel i7 processor and an RTX 3090 ti GPU.

4.2. Simulations

Simulations are conducted in two phases. First is the implementation of the proposed algorithm on a simple linear system to explain the basics or the workings of the ESO. Second is the implementation on the multibody dynamics of the robot arm, with a sine wave as the desired position. For simulation, the DDPG hyperparameters are presented in Table 1

4.2.1. Simple System Implementation

For a simple linear system, consider the following second-order dynamics.

\ddot{x} = u_{o} - b \dot{x} - k x + d (t), d (t) = a \cdot \sin (t)

(38)

ψ (x, t) = - 10 \dot{x} - 50 x + d (t)

(39)

where

a

is the magnitude of disturbance (

d (t)

),

b = 10

is the damping coefficient, and

k = 50

is the stiffness. The performance of DDPG-based SMCESO has been compared with SMC, proportional–integral–derivative (PID) control optimally tuned using the Control System Tuner toolbox in Simulink, and

H_{\infty}

control [27]. The control gains are provided in Table 2, and the trajectory tracking error is shown in Figure 3.

The error results of the step response in Figure 3a reveal that when a disturbance (a = 10) is present, all three controllers except for SMCESO demonstrate good performance with high control gains but fail to fully converge the error to zero. In contrast, SMCESO effectively estimates and compensates for the perturbation, as depicted in Figure 4 (on the next page), leading to error convergence toward zero. Moreover, as anticipated in Section 3.2, the new control gain

K_{s m c}^{'}

is notably smaller than the conventional gain

K_{s m c}

(in Table 1), which is tuned using the DDPG algorithm. Additionally, the algorithms underwent testing with parameter changes, where the stiffness was chosen as

k = 50 \pm 8

. These variations were introduced using the Simulink random number block with a variance of 20. The tracking errors for variable stiffness are presented in Figure 3b, illustrating that PID exhibits the maximum deviation, while

H_{\infty}

outperforms PID. However, SMC now surpasses H_∞ due to a model mismatch between the actual system and the dynamics used for controller synthesis. Finally, SMCESO outperforms all three controllers through maintaining the error very close to zero. This validates that SMCESO effectively estimates system uncertainties and compensates for their effects on the system response, resulting in robust performance.

4.2.2. Adaptive SMCESO with Multibody Robot

With a multibody robot system, the DDPG agent has been trained to fine-tune the controller parameters. For controller evaluation, Joint 2 (

q_{2}

) of the robot manipulator has been considered as it holds the maximum weight of the robot against gravity. Therefore, the robot arm is fully extended, and only

q_{2}

is moving. The desired trajectory is defined as

q_{2, d} = \sin (w \cdot t)

, with initial frequency

w_{o} = 1

, which resets after every episode as

w = 1 + r a n d [- 0.5,0.5]

. Furthermore, the total simulation time is 10seconds, with an ideal reward

r_{m a x} = 210

. The training stops when the average award reaches

r_{c} \geq 199

, considering the average reward window length. The DDPG agent took 343 episodes for the training. The episode reward and the cumulative reward are presented in the following Figure 5, and subsequently, the tuned parameters are shown in Figure 6 and the trajectory tracking error and joint torques are in Figure 7.

The joints were equipped with electromechanical motor dynamics with the motor parameters given in Table 3. Consequently, both control algorithms (SMC and SMCESO) can achieve joint tracking errors with the range

\pm 1 d e g r e e

. However, it is evident from the control input that SMC has sudden spikes throughout the simulation. Reducing gains can eliminate these spikes but will reduce the control performance, resulting in larger errors. Similarly, to reduce the error of SMC, higher gains (more than double those of SMCESO) are required. This, in turn, increases the spikes and occasionally introduces chattering in the response. In contrast, SMCESO shows very smooth performance and keeps the error within the range of

\pm 0.1 d e g r e e

. This validates the robustness of SMC integrated with ESO, which overcomes the perturbation effects of the system with a total mass

m > 55 K g

on joint 2. Overall, the initial jump in the control input is primarily attributed to motor dynamics such as friction, which stabilizes once the robot starts moving. Moreover, for a deeper understanding of achieving robust performance, observing the estimated states in Figure 8.

The position and velocity results show that the state observer is performing very well, with estimations showing nearly zero error. This suggests that the system may have highly effective perturbation estimation and compensation capabilities to enhance tracking performance. Moreover, the Simscape multibody toolbox allows obtaining the dynamics components of the robot system, including the mass matrix

M (q)

, velocity product torque

C (q, \dot{q}) \cdot \dot{q}

with

C (q, \dot{q})

Coriolis terms, and gravitational torque

G (q)

. This can be achieved through first creating the rigid body tree and then utilizing the Manipulator Algorithm library from Robotics System Toolbox. Subsequently, similar to (10), the expected perturbation is presumed as

ψ (x, t) = C (q, \dot{q}) \cdot \dot{q} + G (q)

(40)

The assumed and estimated perturbations are presented in Figure 9, below.

The estimated perturbation closely aligns with the assumed perturbation. With the desired trajectory being a sine wave, the velocity is continuously changing, leading to some perturbation estimation error, as expected due to the motor dynamics, which are not factored into the perturbation calculation. However, this error can be compensated by the SMC in Equation (16), further validating the theory in Equation (29) that, with ESO, the system dynamics are primarily influenced by the perturbation estimation error. From a magnitude perspective, it is evident that the perturbation estimation error is considerably smaller than the actual perturbation, making the system achieve robust performance. Furthermore, when the robot comes to a stop, the estimated perturbation converges to match the assumed perturbation, confirming the accurate working of the ESO.

5. Conclusions

In this study, an approach to control and stabilize multibody robotic systems with inherent dynamics and uncertainties is presented. The approach leverages extended state observer (ESO) and sliding mode control (SMC) (SMCESO), combined with the optimization capabilities of deep deterministic policy gradients (DDPGs). One of the advantages of ESO is that it requires only partial state feedback (position) to estimate the perturbation, which includes the system dynamics and external disturbances. Initially, the proposed algorithm is implemented on a simple second-order system with introduced sinusoidal disturbance. Subsequently, the control parameters were fine-tuned using a DDPG agent, which was trained based on system tracking error, joint angle, estimated joint angle, and estimation error. This training allowed the DDPG-based SMCESO to outperform the optimally tuned PID control (via a control tuner toolbox), conventional SMC (tuned through DDPG), and

H_{\infty}

control in terms of robustness, significantly enhancing system stability and performance. Even in the presence of disturbances, the SMCESO consistently converges to zero error due to its perturbation rejection capabilities. It was also demonstrated that with ESO, the system dynamics are primarily affected by the perturbation estimation error, which was validated through simulations showing close alignment between estimated and actual perturbations, leaving only minor estimation errors to be handled by the SMC control input. As a result, the multibody robot system’s overall performance is highly robust.

Author Contributions

Conceptualization, H.K. and M.C.L.; Data curation, S.A.K.; Formal analysis, F.G.; Funding acquisition, U.H.S.; Investigation, S.A.K. and F.G.; Methodology, H.K.; Project administration, M.C.L.; Resources, M.C.L.; Software, H.K.; Validation, U.G. and U.H.S.; Writing—original draft, H.K.; Writing—review and editing, U.G. and U.H.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Acknowledgments

This work was supported by the Deanship of Graduate Studies and Research (DGSR) Program, Ajman University, United Arab Emirates.

Conflicts of Interest

The authors declare no conflict of interest.

References

Shtessel, Y.; Edwards, C.; Fridman, L.; Levant, A. Introduction: Intuitive Theory of Sliding Mode Control. In Sliding Mode Control and Observation; Control Engineering; Birkhäuser: New York, NY, USA, 2014; pp. 1–42. [Google Scholar]
Afifa, R.; Ali, S.; Pervaiz, M.; Iqbal, J. Adaptive Backstepping Integral Sliding Mode Control of a MIMO Separately Excited Dc Motor. Robotics 2023, 12, 105. [Google Scholar] [CrossRef]
Khan, H.; Abbasi, S.J.; Lee, M.C. DPSO and Inverse Jacobian-based Real-time Inverse Kinematics with Trajectory Tracking using Integral SMC for Teleoperation. IEEE Access 2020, 8, 159622–159638. [Google Scholar] [CrossRef]
Hollweg, G.V.; de Oliveira Evald, P.J.; Milbradt, D.M.; Tambara, R.V.; Gründling, H.A. Design of continuous-time model reference adaptive and super-twisting sliding mode controller. Math. Comput. Simul. 2022, 201, 215–238. [Google Scholar] [CrossRef]
Mobayen, S.; Bayat, F.; ud Din, S.; Vu, M.T. Barrier function-based adaptive nonsingular terminal sliding mode control technique for a class of disturbed nonlinear systems. ISA Trans. 2023, 134, 481–496. [Google Scholar] [CrossRef] [PubMed]
Khan, H.; Abbasi, S.J.; Lee, M.C. Robust Position Control of Assistive Robot for Paraplegics. Int. J. Control Autom. Syst. 2021, 19, 3741–3752. [Google Scholar] [CrossRef]
Abbasi, S.J.; Khan, H.; Lee, J.W.; Salman, M.; Lee, M.C. Robust Control Design for Accurate Trajectory Tracking of Multi-Degree-of-Freedom Robot Manipulator in Virtual Simulator. IEEE Access 2022, 10, 17155–17168. [Google Scholar] [CrossRef]
Humaidi, A.J.; Hasan, A.F. Particle Swarm Optimization-Based Adaptive Super-Twisting Sliding Mode Control Design for 2-Degree-of-Freedom Helicopter. Meas. Control 2019, 52, 1403–1419. [Google Scholar] [CrossRef]
Wang, Y.; Zhu, K.; Yan, F.; Chen, B. Adaptive Super-Twisting Nonsingular Fast Terminal Sliding Mode Control for Cable-Driven Manipulators using Time-Delay Estimation. Adv. Eng. Softw. 2019, 128, 113–124. [Google Scholar] [CrossRef]
Wang, H.; Fang, L.; Song, T.; Xu, J.; Shen, H. Model-free Adaptive Sliding Mode Control with Adjustable Funnel Boundary for Robot Manipulators with Uncertainties. Rev. Sci. Instrum. 2021, 92, 065101. [Google Scholar] [CrossRef]
Xi, R.-D.; Xiao, X.; Ma, T.-N.; Yang, Z.-X. Adaptive Sliding Mode Disturbance Observer-Based Robust Control for Robot Manipulators Towards Assembly Assistance. IEEE Robot. Autom. Lett. 2022, 7, 6139–6146. [Google Scholar] [CrossRef]
Jing, C.; Xu, H.; Niu, X. Adaptive Sliding Mode Disturbance Rejection Control with Prescribed Performance for Robotic Manipulators. ISA Trans. 2019, 91, 41–51. [Google Scholar] [CrossRef]
Zhao, H.; Tao, B.; Ma, R.; Chen, B. Manipulator trajectory tracking based on adaptive fuzzy sliding mode control. Concurr. Comput. Pract. Exp. 2023, 35, e7620. [Google Scholar] [CrossRef]
Khan, H.; Lee, M.C. Extremum Seeking-Based Adaptive Sliding Mode Control with Sliding Perturbation Observer for Robot Manipulators. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 5284–5290. [Google Scholar]
Razmi, H.; Afshinfar, S. Neural Network-Based Adaptive Sliding Mode Control Design for Position and Attitude Control of a Quadrotor UAV. Aerosp. Sci. Technol. 2019, 91, 12–27. [Google Scholar] [CrossRef]
Chen, Z.; Huang, F.; Chen, W.; Zhang, W.; Sun, W.; Chen, J.; Zhu, S.; Gu, J. RBFNN-Based Adaptive Sliding Mode Control Design for Delayed Nonlinear Multilateral Telerobotic System with Cooperative Manipulation. IEEE Trans. Ind. Inform. 2020, 16, 1236–1247. [Google Scholar] [CrossRef]
Wang, D.; Shen, Y.; Sha, Q.; Li, G.; Kong, X.; Chen, G.; He, B. Adaptive DDPG Design-Based Sliding-Mode Control for Autonomous Underwater Vehicles at Different Speeds. In Proceedings of the IEEE Underwater Technology (UT), Kaohsiung, Taiwan, 16–19 April 2019; pp. 1–5. [Google Scholar]
Mosharafian, S.; Afzali, S.; Bao, Y.; Velni, J.M. A Deep Reinforcement Learning-Based Sliding Mode Control Design for Partially Known Nonlinear Systems. In Proceedings of the European Control Conference (ECC), London, UK, 12–15 July 2022; pp. 2241–2246. [Google Scholar]
Lei, C.; Zhu, Q. U-Model-Based Adaptive Sliding Mode Control using a Deep Deterministic Policy Gradient. Math. Probl. Eng. 2022, 2022, 8980664. [Google Scholar] [CrossRef]
Pantoja-Garcia, L.; Parra-Vega, V.; Garcia-Rodriguez, R.; Vázquez-García, C.E. A Novel Actor—Critic Motor Reinforcement Learning for Continuum Soft Robots. Robotics 2023, 12, 141. [Google Scholar] [CrossRef]
Abbasi, S.J.; Lee, S. Enhanced Trajectory Tracking via Disturbance-Observer-Based Modified Sliding Mode Control. Appl. Sci. 2023, 13, 8027. [Google Scholar] [CrossRef]
Raoufi, M.; Habibi, H.; Yazdani, A.; Wang, H. Robust Prescribed Trajectory Tracking Control of a Robot Manipulator Using Adaptive Finite-Time Sliding Mode and Extreme Learning Machine Method. Robotics 2022, 11, 111. [Google Scholar] [CrossRef]
Saleki, A.; Fateh, M.M. Model-free control of electrically driven robot manipulators using an extended state observer. Comput. Electr. Eng. 2020, 87, 106768. [Google Scholar] [CrossRef]
Zheng, Y.; Tao, J.; Sun, Q.; Zeng, X.; Sun, H.; Sun, M.; Chen, Z. DDPG-Based Active Disturbance Rejection 3D Path-Following Control for Powered Parafoil Under Wind Disturbances. Nonlinear Dyn. 2023, 111, 1–17. [Google Scholar] [CrossRef]
Sun, M.; Zhang, W.; Zhang, Y.; Luan, T.; Yuan, X.; Li, X. An Anti-Rolling Control Method of Rudder Fin System Based on ADRC Decoupling and DDPG Parameter Adjustment. Ocean. Eng. 2023, 278, 114306. [Google Scholar] [CrossRef]
Yang, J.; Peng, W.; Sun, C. A Learning Control Method of Automated Vehicle Platoon at Straight Path with DDPG-Based PID. Electronics 2021, 10, 2580. [Google Scholar] [CrossRef]
Dey, N.; Mondal, U.; Mondal, D. Design of a H-Infinity Robust Controller for a DC Servo Motor System. In Proceedings of the 2016 International Conference on Intelligent Control Power and Instrumentation (ICICPI), Kolkata, India, 21–23 October 2016; pp. 27–31. [Google Scholar]

Figure 1. Structure of DDPG.

Figure 2. Block diagram of DDPG-based SMCESO.

Figure 3. Controller performance evaluation: (a) with disturbance; (b) with parameter variation.

Figure 4. Actual and estimated perturbation comparison.

Figure 5. SMCESO training reward.

Figure 6. SMCESO fine-tuned gains.

Figure 7. Fine-tuned controller tracking performance.

Figure 8. Actual and estimated states of the system.

Figure 9. Perturbation results.

Table 1. DDPG parameters.

Reinforcement	Parameters
Reinforcement	Parameter	Value
Critic	Learn rate	$1 \times 10^{- 3}$
Critic	Gradient Threshold	$1$
Actor	Learn rate	$1 \times 10^{- 4}$
Actor	Gradient threshold	$1$
Agent	Sample time	$0.5$
	Target smooth factor	$1 \times 10^{- 3}$
	Discount factor	$1$
	Minibatch size	$64$
	Experience buffer length	$1 \times 10^{6}$
	Noise variance	$0.3$
	Noise variance decay rate	$1 \times 10^{- 5}$
Training	Maximum episode	2000
	Maximum steps	20
	Average reward window length	10

Table 2. Control gains.

Control Algorithm	Gains
PID	$K_{p} = 200$ $, K_{i} = 1000$ $, and K_{d} = 20$ .
SMC	$K_{s m c} = 300$ $, c = 35$ $, and ϵ_{c} = 0.5$ .
SMCESO	$K_{s m c}^{'} = 50$ $, c = 30$ $, ϵ_{c} = 0.5$ $, λ = 137.31$ $, and ϵ_{o} = 1$ .
$H_{\infty}$	$Sensitivity Function W_{s} = \frac{s + 40}{4 s + 0.36}$

Table 3. Motor dynamics parameters.

Parameter	Value
$Inductance, L$	$0.573 \times 10^{- 3} H$
$Resistance, R$	$0.978 Ω$
$Torque constant, k_{t}$	$33.5 \times 10^{- 3} N \cdot m / A$
$Voltage constant, k_{e}$	$33.5 \times 10^{- 3} V \cdot s / r a d$
Gear Ratio	100

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Khan, H.; Khan, S.A.; Lee, M.C.; Ghafoor, U.; Gillani, F.; Shah, U.H. DDPG-Based Adaptive Sliding Mode Control with Extended State Observer for Multibody Robot Systems. Robotics 2023, 12, 161. https://doi.org/10.3390/robotics12060161

AMA Style

Khan H, Khan SA, Lee MC, Ghafoor U, Gillani F, Shah UH. DDPG-Based Adaptive Sliding Mode Control with Extended State Observer for Multibody Robot Systems. Robotics. 2023; 12(6):161. https://doi.org/10.3390/robotics12060161

Chicago/Turabian Style

Khan, Hamza, Sheraz Ali Khan, Min Cheol Lee, Usman Ghafoor, Fouzia Gillani, and Umer Hameed Shah. 2023. "DDPG-Based Adaptive Sliding Mode Control with Extended State Observer for Multibody Robot Systems" Robotics 12, no. 6: 161. https://doi.org/10.3390/robotics12060161

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DDPG-Based Adaptive Sliding Mode Control with Extended State Observer for Multibody Robot Systems

Abstract

1. Introduction

2. Preliminaries

2.1. Multibody Dynamics Description

2.2. Sliding Mode Control

2.3. Problem Forumlation

3. Proposed Algorithm

3.1. Extended State Observer

3.2. Extended State Observer-Based Sliding Mode Control (SMCESO)

3.3. DDPG-Based SMCESO

4. Simulations and Discussion

4.1. System and Environment Descrption

4.2. Simulations

4.2.1. Simple System Implementation

4.2.2. Adaptive SMCESO with Multibody Robot

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI