Next Article in Journal
Deep Reinforcement Learning-Based Uncalibrated Visual Servoing Control of Manipulators with FOV Constraints
Previous Article in Journal
A New Robust Algorithm for Fault-Plane Parameters Identification: The 2009 L’Aquila (Central Italy) Seismic Sequence Case
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Deep Reinforcement Learning Based Active Disturbance Rejection Control for ROV Position and Attitude Control

by
Gaosheng Luo
1,
Dong Zhang
1,
Wei Feng
2,
Zhe Jiang
1,3,* and
Xingchen Liu
1
1
Shanghai Engineering Research Center of Hadal Science and Technology, College of Engineering Science and Technology, Shanghai Ocean University, Shanghai 201306, China
2
Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen 518110, China
3
Lanqi Robot Co., Ltd., Wuxi 214000, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(8), 4443; https://doi.org/10.3390/app15084443
Submission received: 11 March 2025 / Revised: 13 April 2025 / Accepted: 15 April 2025 / Published: 17 April 2025

Abstract

:
Remotely operated vehicles (ROVs) face challenges in achieving optimal trajectory tracking performance during underwater movement due to external disturbances and parameter uncertainties. To address this issue, this paper proposes a position and attitude control strategy for underwater robots based on a reinforcement learning active disturbance rejection controller. The linear active disturbance rejection controller has achieved satisfactory results in the field of underwater robot control. However, fixed-parameter controllers cannot achieve optimal control performance for the controlled object. Therefore, further exploration of the adaptive capability of control parameters based on the linear active disturbance rejection controller was conducted. The deep deterministic policy gradient (DDPG) algorithm was used to optimize the linear extended state observer (LESO). This strategy employs deep neural networks to adjust the LESO parameters online based on measured states, allowing for more accurate estimation of model uncertainties and environmental disturbances, and compensating the total disturbance into the control input online, resulting in better disturbance estimation and control performance. Simulation results show that the proposed control scheme, compared to PID and fixed parameter LADRC, as well as the double closed-loop sliding mode control method based on nonlinear observers (NESO-DSMC), significantly improves the disturbance estimation accuracy of the linear active disturbance rejection controller, leading to higher control precision and stronger robustness, thus demonstrating the effectiveness of the proposed control strategy.

1. Introduction

Remotely operated vehicles (ROVs) play an important role in underwater inspection, marine salvage, and deep-sea mining. Currently, the requirements for the control accuracy and robustness of underwater robots are also increasing. However, due to the nonlinear dynamics, external disturbances, and parameter uncertainties present in the underwater movement of ROVs, designing a reliable tracking controller is challenging. In recent decades, an increasing number of scholars have conducted extensive research on the control stability of ROVs. The control methods they used are elaborated upon below.
Proportional-integral-derivative (PID) control is the most widely used approach in industrial control. Guerrero et al. proposed a saturation function-based nonlinear PID controller, effectively addressing the control instability issues in underwater vehicles caused by actuator saturation and complex environmental disturbances [1]. Sarhadi et al. proposed a model reference adaptive PID control structure with an anti-saturation compensator to address the issue of model uncertainty in autonomous underwater vehicle systems [2].
Fuzzy control is a control method similar to expert systems. Han et al. proposed a fuzzy logic system to address the issue of unknown inertia matrices in AUV systems [3]. Li et al. proposed a fuzzy adaptive controller that considers the dynamics of ROV thrusters to improve the trajectory tracking performance of work-class ROVs, using a fuzzy adaptive control algorithm to compensate for changes in system parameters and disturbances [4]. Yang et al. proposed a fuzzy logic system (FLS) to replace the discontinuous switching terms in CSMC to reduce chattering phenomena [5].
Sliding Mode Control (SMC) is often used for trajectory tracking of underwater robots due to its resistance to external disturbances and parameter variations. Cheng et al. proposed a method that combines a finite-time observer with adaptive sliding mode control to achieve high-precision robust tracking of underwater vehicles [6]. Long et al. proposed an Adaptive Sliding Mode Control (ASMC) to construct a dynamic controller that calculates the optimal force and torque based on the output virtual speed. This approach is robust to parameter uncertainties and addresses the issue of flutter [7]. Luo et al. proposed an improved sliding surface has been proposed to address the finite selection problem of exponential parameters, resulting in a controller with better robustness [8]. Huang et al. introduced a double closed-loop sliding mode controller for trajectory tracking control of working-class ROVs, which uses the arctangent function as the switching function of the controller, effectively reducing chattering phenomena [9].
Neural network control (NNC) has emerged as a potent tool for crafting controllers for nonlinear and uncertain systems. Wen et al. proposed a predefined time control strategy using Radial Basis Function Neural Networks (RBFNNs), which effectively approximates external disturbances, thereby enhancing the robustness of the system [10]. Chu et al. proposed an adaptive control scheme based on radial basis function (RBF) neural networks for ROV trajectory tracking. To ensure system stability under actuator saturation, a first-order auxiliary state system was constructed [11]. Shojaei et al. proposed a neural network-enhanced feedback linearization control framework, which effectively addresses the performance guarantee issues of underactuated AUVs under model uncertainties and disturbances [12].
Each of the methodologies mentioned above has specific limitations. As the complexity of the Remotely Operated Vehicle (ROV) model increases, the effectiveness of proportional-integral-derivative (PID) control decreases significantly. Fuzzy control heavily relies on a fuzzy rule base constructed from expert experience, while SMC (Sliding Mode Control) has a very high dependency on the model and is prone to high-frequency chattering issues [13,14]. Neural network control (NNC) is particularly influenced by the number of nodes in the neural network; while increasing the nodes can improve control accuracy, it also leads to a significant rise in computational complexity, posing challenges for practical engineering applications [15]. Furthermore, these approaches do not adequately address the constraints related to the ROV’s state, potentially compromising tracking precision and risking damage to the thrusters.
To overcome the limitations associated with the status of the Remotely Operated Vehicle (ROV), this research employs Linear Active Disturbance Rejection Control (LADRC), an optimal control approach. LADRC preserves the essential characteristics of the proportional-integral-derivative (PID) algorithm without requiring an accurate model of the controlled system [16]. Instead, it treats the unmodeled dynamics and external disturbances as “total disturbances,” which are then estimated and compensated for. This methodology offers significant advantages in engineering applications due to its ease of use and robust resistance to interference [17,18]. Zhao et al. introduced a trajectory tracking control method for a dual-joint robotic arm system, integrating an extended state observer to estimate both the disturbances and states of the system. Additionally, they applied a state error feedback controller, and experimental findings indicate that the proposed control approach effectively meets control requirements under various conditions, including low-frequency, high-frequency, load, and disturbance scenarios [19,20]. Despite the successful implementation of LADRC in nonlinear systems, its control effectiveness is limited by its fixed structure and parameters.
The adaptive adjustment of parameters for control methods has been a prominent subject of interest, with various optimization algorithms being utilized to improve the robustness and control efficacy of these methods [21,22]. Different operational contexts necessitate varying control parameters, posing challenges for controllers with fixed optimization parameters to achieve optimal control performance. Drawing inspiration from artificial intelligence technologies, reinforcement learning (RL) algorithms have been amalgamated with control theory to devise novel control strategies that augment the adaptability of control systems and uphold optimal control performance in real time. Chen introduced a Q-learning-based adaptive tuning technique for LADRC parameters [23], which identifies optimal control parameters through iterative updates of the Q-value table and applies it to the heading control of ships. Nevertheless, the Q-learning algorithm mandates manual partitioning of the states of the controlled object and the specification of discrete actions, rendering it arduous to train and learn efficiently as the number of states and specified actions escalates. Furthermore, due to the discrete actions specified, the controller parameters can only assume predetermined values rather than varying continuously, thereby constraining the controller’s flexibility. To solve this problem [24], this study employs the Deep Deterministic Policy Gradient (DDPG) RL algorithm to dynamically generate optimal control gains online for the designed LADRC within the Linear Extended State Observer (LESO), thereby determining the optimal parameters of the extended state observer under diverse unknown disturbance conditions. This methodology circumvents the issue of inaccurate disturbance estimation stemming from fixed parameters, and the efficacy of the algorithm is ultimately corroborated through simulations.
The main contributions of this article include:
  • This article presents a nonlinear model for underwater robots that considers parameter uncertainty in the dynamic model. It also proposes a linear active disturbance rejection controller for controlling the position and attitude of the underwater robot based on this model. The convergence of the extended state observer in the active disturbance rejection controller and the stability of the closed-loop control system are demonstrated using the Lyapunov method.
  • A novel control method, named DDPG-LADRC, has been introduced to address disturbances in linear systems by integrating the Deep Deterministic Policy Gradient (DDPG) algorithm with an active disturbance rejection control approach. This method focuses on optimizing the extended state observer through the DDPG algorithm, enabling the observer to sustain optimal performance under varying external disturbances during both position and attitude control of Remotely Operated Vehicles (ROVs). Through real-time adjustments of control parameters, the performance of the extended state observer (LESO) is enhanced, thereby improving the system’s resilience to disturbances and enhancing control accuracy in intricate underwater settings.
  • Based on a nonlinear underwater robot model, numerical simulations have confirmed the efficacy of the approach. The simulation results first compared three control algorithms: PID, fixed-parameter LADRC, and DDPG-LADRC, and finally included NESO-DSMC for comparison. Through analysis, the proposed method has been verified to have significant advantages in terms of transient performance, control accuracy, and robustness.
The subsequent sections of this article are structured as follows: Section 2 introduces a nonlinear model for the underwater robot, Section 3 outlines the design of the LADRC controller for the robot, and Section 4 introduces the DDPG-LADRC algorithm and analyzes the convergence of the extended state observer and the stability of the closed-loop control. Section 5 presents the results and analysis of numerical simulations, and lastly, a conclusion is offered.

2. Dynamics Model of a Robot

This section investigates the dynamic model of an ROV for offshore underwater structure marine growth cleaning and structural inspection independently developed by the Hadal Science and Technology Center of Shanghai Ocean University. The robot is equipped with eight thrusters, allowing it to execute three-dimensional spatial maneuvers. Figure 1 shows the coordinate system of the robot and defines the inertial coordinate system ( x 0 ,   y 0 , z 0 ) and the motion coordinate system ( x ,   y , z ). The state variables relative to the motion coordinate system are represented as V = [ u , v , w , p , q , r ] T , where u , v , w represent linear velocity, and p , q , r represent angular velocity. The state variables relative to the inertial coordinate system are represented as [ x , y , z , ϕ , θ , ψ ] T , where x , y , z indicate the position of the ROV, and ϕ , θ , ψ represent the attitude of the ROV. The kinematic equations of the ROV can be expressed as η ˙ = J ( η ) ν [25]. The roll and pitch are passively stabilized by the buoyancy system, requiring no active control, so ϕ = θ = 0 . Therefore, the six degrees of freedom motion of the ROV can be simplified to four degrees of freedom motion.
The dynamic equations in the Remotely Operated Vehicle (ROV) motion coordinate system are presented below [25]:
M ν ˙ + C v v + D v v + g η = τ + f
In the dynamics Equation (1) of the ROV,   M represents the inertia matrix of the ROV, v represents the linear velocities and angular velocities of the ROV, ν ˙   represents the linear and angular accelerations of the ROV, C v represents the Coriolis-centrifugal matrix, D v represents the hydrodynamic damping matrix, g η   represents the restoring force vector, τ is the control input provided by the main thrust and torque from the ROV’s thrusters, and f represents external water flow disturbances and uncertainties such as unmodeled dynamics. The movement of an ROV can be conceptualized as the general motion of a rigid body influenced by gravity and hydrodynamics in a water flow. Usually, the fluid dynamics parameters are derived from experiments or fluid simulations. However, due to the complexity of real-world ocean conditions, ensuring the accuracy of these parameters is a significant challenge. Consequently, the parameters M , C ( v ) , and   D v in the equation remain indeterminate. The M , C ( v ) , D ( v ) in Equation (1) can be represented as the combination of the nominal parameters M 0 , C 0 ν , D 0 ( ν ) and the dynamic uncertainties M 0 ,   C 0 ( v ) ,   D 0 ( v ) , as follows:
M = M 0 + M C ( v ) = C 0 ( v ) + C ( v ) D ( v ) = D 0 ( v ) + D ( v )
Then, Equation (1) can be rewritten as:
M 0 ν ˙ + C 0 v v + D 0 v v + g η = τ + f + τ Δ
M R B = m 0 0 m y G 0 m 0 m x G 0 0 m 0 m y G m x G 0 I z                               M A = X u ˙ X v ˙ X w ˙ X r ˙ Y u ˙ Y v ˙ Y w ˙ Y r ˙ Z u ˙ Z v ˙ Z w ˙ Z r ˙ N u ˙ N v ˙ N w ˙ N r ˙
By assuming that the origin O of the dynamic system coincides with the centroid G and that the coordinate axes coincide with the three principal inertia axes, and ignoring the off-diagonal elements in the M R B and M A matrices [25], we can obtain the M 0 matrix:
M 0 = M R B + M A = m X u ˙ 0 0 0 0 m Y v ˙ 0 0 0 0 m Z w ˙ 0 0 0 0 I Z N r ˙
In Formula (5), M 0 R 4 × 4 represents the inertia matrix, which is composed of the sum of the rigid body mass matrix M R B and the added mass matrix,   M A . m is the mass of the ROV, I is the moment of inertia, and X u ˙ , Y v ˙ , Z w ˙   represent the hydrodynamic forces induced by the added mass in the x , y , and   z   directions, respectively, with unit acceleration along the   u ˙ ,   v ˙ , and w ˙ axes.   N r ˙   represents the additional inertial force generated by the unit angular acceleration r ˙ in the direction of the z-axis.
C A ν = 0 0 0 Y v ˙ ν 0 0 0 X u ˙ u 0 0 0 0 Y v ˙ ν X u ˙ u 0 0                             C R B ν = 0 0 0 m ν 0 0 0 m u 0 0 0 0 m ν m u 0 0 C 0 ( ν ) = C R B + C A = 0 0 0 ( m Y v ˙ ) ν 0 0 0 ( m X u ˙ ) u 0 0 0 0 ( m Y v ˙ ) ν ( m X u ˙ ) u 0 0
In Formula (6), C 0 ( v ) R 4 × 4 , C 0 ( ν ) = C R B + C A , where C R B represents the matrix encompassing the rigid body Coriolis force and centripetal force, while C A     denotes the matrix accounting for the Coriolis force and centripetal force resulting from the added mass of the inertial fluid dynamics [25].
D 0 v = X u + X u u u 0 0 0 0 Y v + Y v v v 0 0 0 0 Z w + Z w w w 0 0 0 0 N r + N r r r
In the Formula (7), D 0 ( V ) R 4 × 4 denotes the damping matrix, which arises from the effects of viscous fluid dynamics on the robot. The symbols X u , Y v , Z w , N r and X u u , Y v v , Z w | w | , N r | r |   correspondingly denote the primary and secondary hydrodynamic damping coefficients that emerge during the motion of the underwater robot [25].
g ( η ) = 0 0 ( W B ) 0 T
In Formula (8), g ( η ) is represented as the restoring force and moment caused by gravity and buoyancy [25]. W is the gravity of the underwater robot, and B is the buoyancy. In the physical structure design of the ROV, buoyancy is equal to gravity, so it can be further expressed as: g ( η ) = 0 0 0 0 T .
In Formula (3), τ R 4 × 4   represents the thrust generated by the propeller, expressed as τ = F x , F y , F Z , N Z T , where F x , F y , F Z   are the thrusts generated by the propeller along the three coordinate axes, and N Z   are the moments generated by the propeller around the coordinate axes.
In Formula (3), Δ f R 4 × 4   represents external disturbances such as ocean currents in the working environment, and τ Δ R 4 × 4   represents the uncertainty of dynamic lumped parameters, where τ Δ = Δ M ν ˙ Δ C ( v ) v Δ D ( v ) v . For the subsequent design of LADRC, we unify the model parameter uncertainty terms Δ M , Δ C ( V ) , Δ D ( V ) and external disturbances as total disturbances.

3. LADRC Controller Design

In addressing the challenges posed by uncertainties in ocean current disturbances, process noise, and hydrodynamic damping coefficients, the Linear Active Disturbance Rejection Control (LADRC) method is utilized. LADRC comprises a Linear Extended State Observer (LESO) and Linear State Error Feedback Control Law (LSEFC). Of particular significance is the development of the LESO, which is implemented to estimate and counteract external disturbances and uncertainties in model parameters. The LSEFC determines the virtual control signal u 0 by evaluating the system’s state error. The control block diagram is illustrated in Figure 2.

3.1. Linear Extended State Observer

LESO improves the performance of control systems by estimating and compensating for disturbances and unknown state variables. This section outlines the design process of the LESO scheme based on the mathematical model of the ROV system. The kinematics expression of the ROV is η ˙ = J ( η ) ν [25], from which we can derive Formula (9):
= η ¨ = J η ν ˙ + J ˙ η ν J η M 0 1 Δ f + τ Δ C 0 ν v D 0 ν v + g η + J ˙ η ν + J η M 0 1 τ = f + X + J η M 0 1 τ
In Formula (9),   X = J ˙ η ν J η M 0 1 C 0 ν v D 0 ν v + g η , f = J ( η ) M 0 1 ( Δ f + τ Δ ) represents the total uncertainty of the unmodeled dynamics and external disturbances. For the convenience of designing the LESO, let   f = f x , f y , f z , f ψ , τ = τ x , τ y , τ z , τ ψ , then Formula (9) can be written as [26]:
x ¨ = a 1 τ x + X 1 + f x y ¨ = a 2 τ y + X 2 + f y z ¨ = a 3 τ z + X 3 + f z ψ ¨ = a 6 τ ψ + X 6 + f ψ
X 1 = c o s ψ m X u ˙ X u + X u u u m Y v ˙ u r s i n ψ m Y v ˙ Y v + Y v v v m X u ˙ u r u s i n ψ v c o s ψ a 1 = c o s ψ m X u ˙ s i n ψ m Y v ˙ X 2 = s i n ψ m X u ˙ X u + X u u u m Y v ˙ u r + c o s ψ m Y v ˙ Y v + Y v v v m X u ˙ u r + u c o s ψ v s i n ψ a 2 = s i n ψ m X u ˙ + c o s ψ m Y v ˙ X 3 = Z w + Z w w w w m Z w ˙ a 3 = 1 m Z w ˙ X 4 = N r + N r r r r I z N r ˙ , a 4 = 1 I z N r ˙
The total uncertainty f , which represents the unmodeled dynamics and external disturbances, is defined as the total disturbance. To achieve an accurate estimation of the total disturbance f experienced in the control of underwater robots [19], the dynamics model of the ROV is rewritten as follows.
ψ ¨ = 1 I z N r ˙ τ ψ + N r + N r r r r I z N r ˙ + f ψ
Using the dynamic expression of heading angle ψ from Equations (10) and (11) as an example for controller design, we provide a detailed design explanation for LESO and LSEFC.
In Formula (12), 1 I z N r ˙ is a constant term whose value is determined by the system’s inertia parameter I z and damping coefficient N r ˙ . To simplify the formula, b = 1 I z N r ˙ is used to replace this complex coefficient. The second term N r + N r r r r I z N r ˙ describes the nonlinear disturbances caused by hydrodynamics, which are typically regarded as a type of interference or unmodeled dynamics. In the design of LESO, this part is incorporated into the total uncertainty f ψ for unified treatment. Therefore, Formula (12) can be simplified to:
ψ ¨ = b τ ψ + f ψ
Taking the state variables   ψ 1 , ψ 2 , ψ 3 , among them ψ 3 = f ψ as the extended state, the ROV heading angle ψ control model can be expressed as:
ψ = ψ 1 ψ ˙ 1 = ψ 2 ψ ˙ 2 = b τ ψ + ψ 3 ψ ˙ 3 = D
Let D = f ˙ ψ , ψ 2 = ψ ˙ . A linear expansion state observer can be established for the system (13) [19]:
e 1 = ψ z 1 z ˙ 1 = z 2 + β 1 e 1 z ˙ 2 = z 3 + β 2 e 1 + b τ ψ z ˙ 3 = β 3 e 1
In reference Formula (15), z 1 , z 2   represent the estimated values of the state variables of the controlled object (in this example, z 1   represents the observation value of ψ , and z 2   is the observation value of the ψ derivative), while z 3 represents the real-time estimated value of the total disturbance (unknown external disturbances and uncertain models). β 1 , β 2 , β 3 are the gains of the LESO. If the observer gains are chosen appropriately, LESO can achieve precise tracking of each state variable of the controlled object. To facilitate parameter tuning, the values of β 1 , β 2 , β 3 are determined by ω 0 . By reasonably selecting the parameter   ω 0 , the observed value of the “total disturbance” can be made closer to the true value. The Laplace transform of the LESO equation yields [19]:
z 1 = β 1 s 2 + β 2 s + β 3 L s Y s + b s L s U s z 2 = β 2 s 2 + β 3 s L s Y s + b s s + β 1 L s U s z 3 = β 3 s 2 L s Y s b β 3 L s U s
Y ( S ) is the Laplace transform of the system output y(t) in the time domain, and U ( S ) is the Laplace transform of the system input u(t) in the time domain. The characteristic equation corresponding to LESO is [18]:
L s = s 3 + β 1 s 2 + β 2 s + β 3
To stabilize the system, the roots of the characteristic equation must be located in the left half of the s-plane. Therefore, the three poles of the observer are uniformly placed on the left half of the real axis at ω o (where ω o   is the bandwidth of the observer, and   ω o > 0 ). Therefore, the observer gain can be obtained as   β 1 = 3 ω o ,   β 2 = 3 ω o 2 ,   β 3 = ω o 3 . The estimated error of LESO can be expressed as [18]:
e ˙ 1 = e 2 β 1 e 1 e ˙ 2 = e 3 β 2 e 1 e ˙ 3 = D β 3 e 1
Among them, e i = ψ i z i   ( i = 1 , 2 , 3 ) provides the conditions for the estimation error of LESO, which will be used for the stability proof of LESO in the following text.

3.2. Linear State Error Feedback Controller

Traditional PID controllers use error integration to eliminate static errors, but the feedback from error integration can make the system prone to oscillation. In contrast, the LESO (Linear Extended State Observer) employs real-time compensation for total disturbances, avoiding the negative effects of integral feedback. The result is shown in Equation (19) [18]:
e 1 = ψ d z 1 e 2 = ψ ˙ d z 2 u 0 = k p e 1 + k d e 2
In the equation, ψ d   is the reference heading angle input, u 0   is the error feedback control quantity, and k p , k d   are the controller gains. According to Equation (19), the transfer function of u 0   concerning r can be obtained:
U 0 s R s = k p + k d s s 2 + k d s + k p
By setting both poles of the controller on the real axis at ω c (where ω c   represents the control bandwidth) in the left half of the s -plane, the controller gain can be determined as [18].
k p = ω c 2 k d = 2 ω c
Based on u 0 , an additional compensation term for the total disturbance estimate is added, so the control quantity can be taken as:
τ ψ = z 3 + u 0 b 0
In Formula (22), u 0   is the error feedback control quantity, and τ ψ     is the actual input of the controlled object.

4. DDPG Optimization of Control Parameters for Active Disturbance Rejection Controller

For the controlled system, when the range of unknown disturbances is too large and the rate of change is too fast, using a fixed-parameter active disturbance rejection controller results in low control accuracy. Therefore, by combining deep reinforcement learning, an improved active disturbance rejection controller has been designed, in which the active disturbance rejection control parameters will vary with the environment.

4.1. Deep Deterministic Policy Gradient-Active Disturbance Rejection Controller Algorithm Framework

Reinforcement learning is an algorithm that allows an agent to adjust its behavioral strategies based on observations made during interactions with the environment, to maximize cumulative rewards. The schematic diagram is shown in Figure 3.
In response to the online adjustment problem of fixed parameter active disturbance rejection controllers, this paper employs the Deep Deterministic Policy Gradient algorithm (DDPG), which can handle continuous action control. The proposed control strategy, DDPG-LADRC, treats the entire underwater robot control system as the environment, using the system’s control performance as the reward evaluation criterion. The DDPG-LADRC agent determines actions based on the current environment, and then the environment provides a new state based on the output value and calculates the reward value. The DDPG-LADRC agent makes judgments, optimizes and updates the next action, and interacts with the environment until the reward converges.

4.2. Deep Deterministic Policy Gradient Algorithm Principles

DDPG is a deep reinforcement learning algorithm based on the actor-critic framework. The actor-network outputs deterministic actions in a continuous action space based on environmental state feedback, while the critic-network calculates the corresponding Q-value based on the current state and action, which is used to evaluate the long-term expected return of the action. By adjusting the weights of the critic network according to the error between the reward output by the critic network and the actual received reward, the output estimates of the critic network can become more accurate. Using the policy gradient algorithm, the parameters of the actor-network are updated in the direction of increasing the action value. During the interaction between the agent and the environment, the learning parameters of both networks will be continuously updated until the policy converges [27].
At time   t , the mapping from state s to action a is referred to as policy   π .
a t = π s t
According to the actions generated by strategy π , new states and reward values r are continuously obtained. The formula for calculating cumulative rewards is:
G t = t = 1 T γ t r t
The Bellman equation for the state value function is represented as:
V π s = E π [ k = 0 G t | s = s t ] = E π r t + 1 + γ V π s t + 1 s = s t
Considering the impact of actions on the value function, the Bellman equation for the state-action value function is represented as:
Q π s , a = E π r t + 1 + γ Q π s t + 1 , a t + 1 s = s t , a = a t
The optimal Bellman equation can be expressed as:
Q s t , a t = r s t a t + γ Q π s t + 1 , a t + 1
The optimal strategy π   is obtained by maximizing the cumulative reward and its corresponding optimal Bellman equation:
π s a t = argmax Q π s t , a t
We calculate the loss function for the target Q value, using y t   to represent it:
y t = r t + γ Q s t + 1 , π s t + 1 θ π θ Q L ( θ Q ) = E ( y t Q ( s t , a t θ Q ) ) 2
By calculating the gradient of the loss function, we update the current value network [28].
θ k Q = θ k 1 Q μ Q θ Q L θ k 1 Q θ Q L ( θ k 1 Q ) = E ( 2 ( y t Q ( s , a | θ k 1 Q ) | s = s t , a = a t ) Q θ Q ( s , a | θ k 1 Q ) | s = s t , a = a t )
The strategy network uses the Q function output from the value network as the loss function. By taking the policy gradient of the Q function, the update formula is obtained [28].
θ k π = θ k 1 π μ π θ π L ( θ k 1 π ) θ k 1 π π = a Q ( s , a θ k 1 π ) | s = s t , a t = π s t θ π π ( s θ k 1 π ) | s = s t )
The target network uses a soft update method as follows [29].
θ k Q = τ θ k 1 Q + 1 τ θ k 1 Q θ k π = τ θ k 1 π + 1 τ θ k 1 π
The DDPG algorithm is a deterministic policy that adds noise to the deterministic policy, as shown in Equation (33), allowing the agent to explore the environment more effectively and preventing it from getting stuck in local optima. The deterministic policy gradient helps the critic converge and updates the network parameters [29]. The meanings of the various parameters in the above analysis are shown in Table 1.
a t = μ θ s t + N
The structure of the DDPG algorithm model is shown in Figure 4. The DDPG agent stores the sample data obtained from interacting with the LADRC control system in the experience pool. During the learning process, it randomly samples m pieces of data from the experience pool and continuously iterates to update the network gradient values to optimize the algorithm [29,30].
Combining deep reinforcement learning with an active disturbance rejection controller, the control system is designed to obtain environmental state data through the interaction between the agent and the environment (underwater robot control system). Here, LSEFC represents the state error feedback controller, and LESO represents the linear expanded observer. The structural block diagram of the active disturbance rejection controller based on deep reinforcement learning is shown in Figure 5.
According to the control system structure block diagram designed in the above figure, we set the various parameters of the deep reinforcement learning agent.
If the LADRC control has the Markov property, then when optimizing with DDPG, the future state transitions of the system depend only on the current state and control actions, without the need to explicitly model the state transition probabilities. The optimization problem of the LESO observation capability is modeled as a reinforcement learning task in a continuous action space. The DDPG algorithm is used to dynamically adjust the key parameter w 0 of the LESO, enabling it to adapt to changes in external disturbances, thereby improving the disturbance estimation accuracy of the LESO and the robustness of the controller. The actor network is responsible for generating the adjustment of the LESO observer bandwidth w 0 , based on the current state. The critic network outputs the action value Q based on the current state and the action generated by the actor network, guiding the policy update of the actor network. The ROV studied in this paper is under umbilical cable control, effectively avoiding the issue of limited resources for the ROV, and the ROV’s controller can meet the high computational demands of DDPG training.
For the state space, select e , e ˙ , corresponding to the errors e x , e y , e z , e ψ , and the differential of the error in each degree of freedom.
For the action space, select the poles of the expanded state observer, that is w 0 , β 1 = 3 ω o , β 2 = 3 ω o 2 , β 3 = ω o 3 .
To reduce the final error, a reward function is set based on the error between the output of each degree of freedom and the expected value: R = e x 2 + e y 2 + e z 2 + e ψ 2 .
In terms of the discount factor, the degree to which future rewards influence current decisions is determined, and in this article, the chosen discount factor γ = 0.98 is used to ensure that accurate trajectory tracking is given high priority. After multiple adjustments, the final selected DDPG parameters are shown in Table 2.
The reward curve of the Deep Deterministic Policy Gradient (DDPG) algorithm is generally used to determine whether the agent’s training has converged. The curve showing the change in rewards after training over the training iterations is shown in Figure 6.

4.3. Stability Analysis

To conduct the analysis, the following hypothesis is proposed based on engineering practice: the total disturbance observed by the observer in the self-disturbance rejection control is bounded within H , H = { D | | D | F h } , where F h is a positive constant.
Theorem 1. 
The estimation error of the observer constructed in Equation (15) is bounded [18]. l i m ω 0 , t | | e | | = 0 .
Proof. 
Let q i = e i ω o i ( i = 1 , 2 , 3 ) , then Equation (18) can be rewritten as:
q ˙ 1 = ω o ( q 2 3 q 1 ) q ˙ 2 = ω o ( q 3 3 q 1 ) q ˙ 3 = ω o D ω o 4 q 1
The reference Formula (34) can be rewritten as:
η ˙ = ω o A q + B D ω o 3
For simplicity, let q = [ q 1 , q 2 , q 3 ] T , therefore:
A = 3 1 0 3 0 1 1 0 0 ,   B = 0 0 1
Observing the above equation, for any positive ω o ,     A   is Hurwitz, therefore, there exists a unique positive definite symmetric matrix P η that satisfies the Lyapunov equation A T P η + P η A = Q η . By choosing the Lyapunov function as V ( q ) = q T P η q , we can derive the following for V ( q ) :
V ˙ ( q ) = q T P η q ˙ + q ˙ T P η q = q T P η ( ω 0 A q + B D ω 0 3 ) + ( ω 0 A q + B D ω 0 3 ) T P η q = ω 0 q T Q η q + 2 q T P η B D ω 0 3
Referring to Equation (37), using the Cauchy inequality and the property of the minimum eigenvalue of positive definite matrices, V ˙ ( q ) can be rewritten as:
V ˙ ( q ) = ω 0 q T Q η q + 2 q T P η B D ω o 3 ω o λ m i n Q η q 2 + 2 F h λ m a x P η q ω o 3
Using the eigenvalues of the matrix, we can obtain the following bounds on the quadratic form: λ m i n P η q 2 q T P η q λ m a x P η q 2 . This can be rewritten as: V q λ m a x P η q 2 V q λ m i n P η . Therefore, inequality (38) can be rewritten as:
V ˙ ( q ) ω o λ m i n Q η λ m a x P η V ( q ) + 2 F h λ m a x P η ω o 3 λ m i n P η V q
To obtain the linear differential inequality, let W = V ( q ) , then W ˙ = V ˙ ( q ) 2 V ( q ) can be obtained, and inequality (39) can be rewritten as:
W ˙ ω o λ m i n Q η 2 λ m a x P η W + F h λ m a x P η ω o 3 λ m i n P η
When studying the state equation W ˙ , it is often necessary to obtain the boundary of its solution   W , rather than the solution itself. The Gronwall–Bellman method is one of the approaches used for this purpose [31]. First, let β = w 0 λ m i n ( Q η ) 2 λ m a x ( P η ) , α = F h λ m a x ( P η ) ω o 3 λ m i n ( P η ) . By applying the inequality, we can obtain the following.
Assume   W ˙ = β W + α . Thus, there is   W ( W ( t 0 ) α β ) e β ( t t 0 ) + α β . By organizing, we can obtain the following expression:
W ( 2 F h λ m a x 2 P η ω o 4 λ m i n P η λ m i n Q η W ( t 0 ) ) e ω 0 λ m i n Q η 2 λ m a x P η t t 0 + 2 F h λ m a x 2 P η ω o 4 λ m i n P η λ m i n Q η
From W = V ( q ) = q T P η q , V ( q ) λ m a x ( P n ) q 2 V ( q ) λ m i n ( P n ) , we can obtain Equation (42):
q V λ m i n P η = W λ m i n P η
When   t , the Expression (43) can be obtained:
q V q λ m i n P η 2 F h λ m a x 2 P η ω o 4 λ m i n P η λ m i n Q η = k ω 0 4
In Formula (43), k = 2 F h λ m a x 2 P η λ m i n P η λ m i n Q η   is a normal constant, and because P η , Q η   are independent of ω 0 , the Equation (43) can demonstrate that l i m ω 0 , t | | q | | = 0 , which, together with q i = e i ω o i ( i = 1 , 2 , 3 ) , is produced. Therefore, l i m ω 0 , t | | e | | | | q | | ω 0 3 k ω o = 0 , thus completing the proof. Define H e as H e = { e | | | e | | E } , where E is a positive constant. By adjusting ω 0 to ensure k ω o E , the estimated error of LESO will remain within H e , Theorem 1 has been proven. □
Theorem 2. 
According to the error feedback control law given by Equation (19), we can ensure the closed-loop stability of the control system. According to Theorem 1, the convergence of LESO can be guaranteed by carefully selecting  ω 0 , b , and the estimation error of LESO will be constrained within   H e . Translate Equation (19) into the Equation (14) to obtain:
e ˙ ψ 1 = e ψ 2 e ˙ ψ 2 = k p e ψ 1 k d e ψ 2 + k p e 1 + k d e 2 + e 3
Referencing Equation (44), e ψ 1 = ψ r ψ 1   and e ψ 2 = ψ ˙ r ψ 2     are defined as tracking errors. The above equation can be rewritten as:
e ˙ ψ = C e ψ + G e e ψ = e ψ 1 , e ψ 2 T , C = 0 1 k p k d G = k p k d 1
Since the controller output is positive, ensure that k p , k d 0 . Then, the characteristic roots of C can be expressed as:
λ 1 , 2 = k d 2 ± k p + k d 2 4
Therefore, the designed linear active disturbance rejection controller is stable, and Theorem 2 has been proven. The tracking error of the observer is also bounded.

5. Simulation Analysis

The underwater intelligent cleaning and inspection robot is specifically designed for the safety inspection of marine oil platform risers and the removal of marine organisms attached to the risers. It is equipped with an Ultra-Short Baseline positioning system (USBL), an attitude sensor, a depth sensor, and a compass, enabling precise positioning, attitude awareness, and depth perception. In addition, its propulsion system includes four horizontal thrusters and four vertical thrusters. The model parameters of the underwater robot are shown in Table 3, and the physical prototype and thruster layout are illustrated in Figure 7.
To verify that DDPG-LADRC has stronger robustness, this paper proposes two experimental simulation scenarios.
(a)
Section 5.1 introduces a simple time-varying external disturbance, and the tracked trajectory is also relatively simple, to evaluate the improvement of the DDPG-LADRC control strategy on the transient performance during the motion of the ROV.
(b)
The time-varying disturbances introduced in Section 5.2 are related to the motion state of the ROV and track different trajectories, aiming to verify that the DDPG-LADRC control strategy has stronger robustness when the ROV is in a dynamic marine environment.

5.1. Scenario 1

To verify the enhanced effect of combining reinforcement learning DDPG with a linear active disturbance rejection controller in terms of disturbance suppression capability and control accuracy, the position and attitude of the underwater robot are tracked under time-varying external disturbances. The transient performance of the control system under perturbations is evaluated to validate the disturbance rejection and robustness of the DDPG-LADRC control scheme. Disturbances are introduced during the movement of the ROV as follows:
f = f x , f y , f z , f ψ T = [ 20 sin 0.4 t   20   s i n ( 0.4 t )                   20   s i n 0.4 t     20   s i n 0.4 t ] T
The initial conditions for the underwater robot are set as [ X ( 0 ) , Y ( 0 ) , Z ( 0 ) , P h i ( 0 ) ] = 0 , with the velocity and angular velocity set as u ( 0 ) = ν ( 0 ) = w ( 0 ) = r ( 0 ) = 0 . Additionally, for the controller parameters, the PID parameters are set as:
K p = { 150 , 150 , 300 , 370 } K i = { 15 , 15 , 60 , 15 } K d = { 300 , 300 , 150 , 300 }
The parameters for the Active Disturbance Rejection Control are set as follows: b 0 = 10 ,   w 0 = 5 , because   β 1 = 3 ω o ,   β 2 = 3 ω o 2 , β 3 = ω o 3   which means   β 1 = 15, β 2 = 75 , β 3 = 125 . The relevant DDPG setting parameters are shown in Table 2 above. The underwater robot simulation is designed to run for 100 s, with a simulation step size of 0.01 s. The proposed control algorithm is mainly compared with PID and LADRC under fixed parameters through three-dimensional trajectory tracking, and planar tracking, to verify the degree of improvement in the system’s transient performance by the DDPG-LADRC control strategy. The trajectory tracking curve in the inertial coordinate system is:
x d = 2 sin ( 0.1 π t )   m y d = 2 cos ( 0.1 π t )   m z d = 0.2 t   m ψ d = 0.03 π t   r a d
First, a feasibility analysis of the parameter optimization for LESO is conducted. Figure 8 compares the observation errors of the LESO optimized by DDPG with those of the fixed-parameter LESO. It can be observed that the fixed-parameter LADRC controller is not precise in tracking total disturbances. In contrast, the DDPG-LADRC can maintain better performance with a shorter time under the constraints of model parameter uncertainty and strong unknown external disturbances in underwater robot trajectory tracking control. The DDPG-LADRC can quickly respond to changes in disturbances and adjust its control strategy promptly to adapt to these changes, thereby enhancing the system’s dynamic performance. This indicates that the optimized observer parameters of DDPG-LADRC are effective.
The three-dimensional trajectory tracking performance of the ROV under different control schemes, as well as the tracking curves in the XY, XZ, and YZ planes shown in Figure 9, can be observed. It can be seen that even in the presence of disturbances, the DDPG-LADRC control scheme can achieve precise trajectory tracking, with control performance superior to that of the PID controller and the fixed parameter LADRC controller, demonstrating stronger robustness. Therefore, parameter optimization based on DDPG can enhance the control performance of LADRC.
The selected evaluation indicators for transient performance are overshoot, settling time, and peak time.
In underwater robot control, overshoot is an important indicator used to describe the dynamic performance of a system. Overshoot is typically measured by the difference between the maximum output value and the steady-state value, and it can also be expressed as a percentage of this difference relative to the steady-state value. The system without overshoot typically stabilizes at the setpoint without deviating too much from the target value, indicating that there is no significant overreaction or oscillation during the response process. From Table 4, we can see that DDPG-LADRC maintains response speed without overshoot, while PID and LADRC exhibit overshoot. When the overshoot is too large, the control system is prone to oscillation. The results indicate that DDPG-LADRC ensures the dynamic response process of the system, maintaining high robustness even in the face of model uncertainty or external disturbances. The parameter optimization effect of DDPG-LADRC is evident, effectively meeting the dynamic performance requirements of the system.
In underwater robot trajectory tracking control, the adjustment time is an important dynamic performance indicator. It reflects the robot’s sensitivity to changes in control signals and its ability to respond quickly, defined as the time required for the ROV to respond and maintain within a certain allowable error range (usually ±2% or ±5% of the final value) after initially reaching the target value. A shorter adjustment time means that the ROV can stabilize more quickly around the target value, reducing oscillations or instability during the transition process. Additionally, a rapid response can better handle external disturbances and changes in the internal parameters of the ROV, enhancing the system’s robustness and stability. Referring to Table 5, it can be seen that the adjustment time of DDPG-LADRC for the ROV in the X -direction is significantly better than the other two control strategies, reducing by 93% and 98%, respectively. In the Y -direction, the reductions are 93% and 86%, respectively, and in the Z -direction, the reductions are 66.7% and 90%, respectively. The attitude angles P h i were reduced by 64% and 89%, respectively.
Even if the overshoot is 0, the system response may still have a “peak,” which does not refer to a deviation exceeding the steady-state value, but rather to the maximum value during the response process. In the underwater robot trajectory tracking control system, the peak time is an important dynamic performance indicator that describes the time required for the system response to exceed its steady-state value and reach the first peak. Referring to Table 6, the comparison of peak times shows that in the X -direction, DDPG-LADRC significantly outperforms the other two control strategies, reducing by 93% and 98%, respectively. In the Y -direction, it reduces by 82% and 90%, respectively, and in the Z -direction, it reduces by 80% and 98%, respectively. The attitude angles P h i are reduced by 93% and 89%, respectively.
In summary, through a comparative analysis of transient performance under different control methods, the results indicate the superiority of the DDPG-LADRC control strategy in terms of transient performance. Compared to PID controllers and traditional LADRC controllers, the proposed DDPG-LADRC is more suitable for underwater robotic systems that are multivariable, strongly coupled, have significant randomness, and are subject to unknown disturbances.
The tracking error of the ROV trajectory tracking in Figure 9 is shown in Figure 10. Compared to the PID controller and the fixed parameter LADRC controller, the proposed DDPG-LADRC controller has a smaller steady-state error. The PID and fixed-parameter LADRC control schemes are unable to eliminate steady-state errors in a short time, which leads to an inability to track the desired trajectory. However, the DDPG-LADRC significantly improves the control accuracy of the system by introducing DDPG to achieve online tuning of LADRC parameters in response to environmental changes. This ensures that the ROV can maintain satisfactory control performance even in the presence of inaccurate model parameters and significant uncertain disturbances.
After 60 s, data from 1000 sampling points should be collected to calculate the root mean square error for determining the stable accuracy of the control method, as presented in Table 7.
In underwater robot control, better stability accuracy means that the robot can precisely reach the target position. Simulation results indicate that the designed DDPG-LADRC controller not only has robust performance but also possesses the ability to quickly track commands and suppress disturbances. Further comparisons show that the performance of DDPG-LADRC surpasses that of PID and conventional fixed-parameter LADRC. Therefore, parameter optimization based on DDPG can enhance the control performance of LADRC.

5.2. Scenario 2

To further verify the robustness of the controller, the anti-interference capability of different control methods under strong interference conditions was compared. The most representative tracking trajectory during the ROV’s motion was selected (Formula (51)). A dual closed-loop sliding mode control scheme based on a nonlinear extended state observer (NESO-DSMC) was added for the comparison of control methods [26], to validate the superiority of the DDPG-LADRC controller’s performance.
The parameters for the Active Disturbance Rejection Control are set as follows: b 0 = 10 ,   w 0 = 5 , Because β 1 = 3 ω o ,   β 2 = 3 ω o 2 ,   β 3 = ω o 3 which means β 1 = 15 , β 2 = 75 , β 3 = 125 . The relevant DDPG setting parameters are shown in Table 2 above. In addition, the controller parameters proposed in the NESO-DSMC are chosen as follows: δ = 0.01 , ϵ 11 = ϵ 21 = ϵ 31 = 0.5 ,   β = 0.1 ,   ρ 11 = 100 ,   ρ 21 = 300 ,   ρ 31 = 1000 ,   γ 1 = γ 2 = γ 3 = γ 4 = 0.01 ,   m = 5 ,   q = 2 ,   K η = d i a g { 0.3 ,   0.3 ,   0.3 ,   0.3 } , K ν = d i a g { 10 ,   10 ,   10 ,   10 } [26].
The external interference added is shown in Equation (50). The added disturbance signal is related to the state of the ROV, and this signal is constantly changing.
f x = 40 1.65 X 0.3 Y 2 1.22 Z 2 8 Z f y = 2.2 X 2 2.5 Y + 0.3 Z f z = 18 2.1 X 2 0.88 Y 2 0.5 Z 2
The tracked trajectory is shown in Formula (51). This trajectory indicates that the ROV first descends vertically, then performs linear back-and-forth and spiral movements on a horizontal plane, accompanied by changes in depth and adjustments in heading, ultimately returning to a horizontal straight path. The initial position and attitude of the ROV are set as: X 0 = 0   m ,   Y 0 = 1   m ,   Z 0 = 0   m ,   P h i 0 = 0   r a d .
x d ( t ) = 0   m , 0 t 20   s 0.2 t 20   m , 20 t < 40   s sin 0.04 π t 40 + 4   m , 40 t < 60   s 0.2 t 60 + 4   m , 60 t < 80   s   s i n ( 0.05 π ( t 80 ) ) , 80 t < 100   s 0.2 t 100   m , 100 t 120   s y d ( t ) = 1   m , 0 t 20   s 1   m , 20 t < 40   s c o   s ( 0.05 π ( t 40 ) ) + 2   m , 40 t < 60   s 3   m , 60 t < 80   s cos 0.05 π t 80 + 4   m , 80 t < 100   s 5   m , 100 t 120   s z d ( t ) = 0.3 t   m , 0 t < 20   s 6 4 c o   s ( 0.1 π ) + 5   s i n ( 0.1 π x ) + 4 cos 0.1 π y m , 20 t 120   s ψ d ( t ) = 0   r a d , 0 t < 20   s 0   r a d , 20 t < 40   s 0.05 π t 40   r a d , 40 t < 60   s π   r a d , 60 t < 80   s π 0.05 π t 80   r a d , 80 t < 100   s 0   r a d , 100 t 120   s
The simulation results shown in Figure 11 demonstrate that the DDPG-LADRC can achieve accurate disturbance estimation for the perturbation observations and corresponding disturbance observation error curves of the three state variables f x , f y , f z . The maximum observation error value for the observer in the X -direction is 0.00141, the maximum observation error in the Y -direction is 0.0016, and the maximum observation error in the Z -direction is 0.0021. DDPG-optimized LESO has achieved the estimation accuracy for disturbances that meet our requirements.
From Figure 12, it can be seen that LADRC, due to the issue of fixed parameters in the controller, is unable to eliminate steady-state errors in a short time. Under continuously changing external disturbances, LADRC cannot achieve optimal control performance. In the presence of significant uncertain disturbances, NESO-DSMC cannot reach the same level of error convergence accuracy as DDPG-LADRC. Table 8 and Table 9 show the RMSE and MAE under different control methods, indicating that DDPG-LADRC has better robustness compared to LADRC and NESO-DSMC. DDPG-LADRC can eliminate steady-state errors within 5 s because it incorporates DDPG for online adjustment of LADRC parameters in response to uncertain disturbances caused by environmental changes, significantly improving the control accuracy of the system. This ensures that the ROV can maintain satisfactory control performance even in the presence of inaccurate model parameters and significant uncertain disturbances.

6. Conclusions

In response to the issue of underwater robots facing difficulties in determining model parameters and external disturbances, and the inability of traditional fixed-parameter controllers to achieve optimal control performance for the controlled object, an online parameter tuning strategy based on active disturbance rejection control has been proposed: the DDPG-LADRC algorithm.
1.
Based on the nonlinear model of underwater robots, dynamic parameter uncertainty was considered, and a linear active disturbance rejection controller was designed. The convergence of the extended state observer in the linear active disturbance rejection controller and the stability of the closed-loop control were proven using the Lyapunov method. To address the issue that fixed parameter controllers in nonlinear systems cannot achieve optimal control performance, a DDPG-LADRC control strategy was designed, which improved the performance of the LESO by online adjusting control parameters, resulting in the reward curve of DDPG. A feasibility analysis of parameter optimization for LESO was conducted in numerical simulations, demonstrating the effectiveness of the DDPG-LADRC strategy.
2.
Compared to PID, fixed-parameter LADRC, and the latest nonlinear observer-based double closed-loop sliding mode control method (NESO-DSMC), the DDPG-LADRC method can generate optimal parameters for the controller, thereby improving control accuracy. Experiments show that this control strategy outperforms PID, fixed-parameter LADRC, and NESO-DSMC control strategies in terms of transient performance and anti-interference capability. Therefore, it can be said that DDPG-LADRC has significant advantages in tracking and anti-interference capabilities.
3.
The algorithm, although demonstrating good performance in simulations, still faces significant challenges when being translated into practical engineering applications. For instance, the accurate determination of an ROV’s rotational inertia and hydrodynamic coefficients presents a notable challenge. In the future, the parameter adaptation concept based on DDPG can be combined with other control methods to achieve asymptotic stability and optimal control performance.

Author Contributions

G.L.: Writing—review and editing, Writing—original draft, Validation, Supervision, Software, Methodology, Investigation, Funding acquisition, Formal analysis, Data curation, Conceptualization. D.Z.: Writing—review and editing, Writing—original draft, Visualization, Validation, Software, Project administration, Methodology, Investigation, Formal analysis, Data curation, Conceptualization. W.F.: Conceptualization, Funding acquisition, Investigation, Resources, Supervision, Writing—review and editing. Z.J.: Data curation, Investigation, Resources, Supervision, Validation, Writing—review and editing. X.L.: Data curation, Investigation, Supervision, Visualization. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the project [2024]52 funded by the Guangdong Provincial Department of Natural Resources and the Guangdong Provincial Natural Resources Research Committee. We appreciate the support you provided. It was also funded by the Shanghai Municipal Industrial Collaborative Innovation Technology Project (XTCX-KJ-2023-2-15). We sincerely thank you for your valuable help and support.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to copyright issues with co-developers.

Conflicts of Interest

Author Zhe Jiang was employed by the company Lanqi Robot Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Guerrero, J.; Torres, J.; Creuze, V.; Chemori, A.; Campos, E. Saturation based nonlinear PID control for underwater vehicles: Design, stability analysis and experiments. Mechatronics 2019, 61, 96–105. [Google Scholar] [CrossRef]
  2. Sarhadi, P.; Noei, A.R.; Khosravi, A. Model reference adaptive PID control with anti-windup compensator for an autonomous underwater vehicle. Robot. Auton. Syst. 2016, 83, 87–93. [Google Scholar] [CrossRef]
  3. Han, Y.; Liu, J.; Yu, J.; Sun, C. Adaptive fuzzy quantized state feedback control for AUVs with model uncertainty. Ocean Eng. 2024, 313, 119496. [Google Scholar] [CrossRef]
  4. Li, M.; Yu, C.; Zhang, X.; Liu, C.; Lian, L. Fuzzy adaptive trajectory tracking control of work-class ROVs considering thruster dynamics. Ocean Eng. 2023, 267, 113232. [Google Scholar] [CrossRef]
  5. Yang, M.; Sheng, Z.; Yin, G.; Wang, H. A recurrent neural network based fuzzy sliding mode control for 4-DOF ROV movements. Ocean Eng. 2022, 256, 111509. [Google Scholar] [CrossRef]
  6. Chen, B.; Hu, J.; Zhao, Y.; Ghosh, B.K. Finite-time observer based tracking control of uncertain heterogeneous underwater vehicles using adaptive sliding mode approach. Neurocomputing 2022, 481, 322–332. [Google Scholar] [CrossRef]
  7. Long, C.; Hu, M.; Qin, X.; Bian, Y. Hierarchical trajectory tracking control for ROVs subject to disturbances and parametric uncertainties. Ocean Eng. 2022, 266 Pt 1, 112733. [Google Scholar] [CrossRef]
  8. Luo, W.; Liu, S. Disturbance observer based nonsingular fast terminal sliding mode control of underactuated AUV. Ocean Eng. 2023, 279, 114553. [Google Scholar] [CrossRef]
  9. Huang, B.; Yang, Q. Double-loop sliding mode controller with a novel switching term for the trajectory tracking of work-class ROVs. Ocean Eng. 2019, 178, 80–94. [Google Scholar] [CrossRef]
  10. Wen, J.; Zhang, J.; Yu, G. Predefined-Time Three-Dimensional Trajectory Tracking Control for Underactuated Autonomous Underwater Vehicles. Appl. Sci. 2025, 15, 1698. [Google Scholar] [CrossRef]
  11. Chu, Z.; Xiang, X.; Zhu, D.; Luo, C.; Xie, D. Adaptive trajectory tracking control for remotely operated vehicles considering thruster dynamics and saturation constraints. ISA Trans. 2020, 100, 28–37. [Google Scholar] [CrossRef] [PubMed]
  12. Shojaei, K. Neural network feedback linearization target tracking control of underactuated autonomous underwater vehicles with a guaranteed performance. Ocean Eng. 2022, 258, 111827. [Google Scholar] [CrossRef]
  13. Bao, H.; Zhang, Y.; Song, M.; Kong, Q.; Hu, X.; An, X. A review of underwater vehicle motion stability. Ocean Eng. 2023, 287, 115735. [Google Scholar] [CrossRef]
  14. Zheng, J.; Song, L.; Liu, L.; Yu, W.; Wang, Y.; Chen, C. Fixed-time sliding mode tracking control for autonomous underwater vehicles. Appl. Ocean Res. 2021, 117, 102928. [Google Scholar] [CrossRef]
  15. Xia, T.; Yang, Q.; Huang, B.; Ouyang, Y.; Zheng, Y.; Mao, P. Enhanced trajectory tracking control algorithm for ROVs considering actuator saturation, external disturbances, and model parameter uncertainties. Ocean Eng. 2024, 311, 118973. [Google Scholar] [CrossRef]
  16. Han, J. Auto disturbance rejection controller and its applications. Control Decis. 1998, 13, 19–23. [Google Scholar]
  17. Gao, J.; Liang, X.; Chen, Y.; Zhang, L.; Jia, S. Hierarchical image-based visual serving of underwater vehicle manipulator systems based on model predictive control and active disturbance rejection control. Ocean Eng. 2021, 229, 108814. [Google Scholar] [CrossRef]
  18. Gao, Z. Scaling and bandwidth-parameterization based controller tuning. In Proceedings of the 2003 American Control Conference, Denver, CO, USA, 4–6 June 2003. [Google Scholar] [CrossRef]
  19. Li, S.; Chen, Z.; Ju, Y.; Jia, Y.; Tang, W.; Wang, Y. Transverse vibration analysis and active disturbance rejection decoupling control of vector propulsion shaft system for underwater vehicles. Ocean Eng. 2023, 298, 117158. [Google Scholar] [CrossRef]
  20. Zhao, L.; Liu, X.; Wang, T. Trajectory tracking control for double-joint manipulator systems driven by pneumatic artificial muscles based on a nonlinear extended state observer. Mech. Syst. Signal Process. 2019, 122, 307–320. [Google Scholar] [CrossRef]
  21. Zheng, Y.; Chen, Z.; Huang, Z.; Sun, M.; Sun, Q. Active disturbance rejection controller for multi-area interconnected power system based on reinforcement learning. Neurocomputing 2021, 425, 149–159. [Google Scholar] [CrossRef]
  22. Huang, Z.; Chen, Z.; Zheng, Y.; Sun, M.; Sun, Q. Optimal design of load frequency active disturbance rejection control via double chains quantum genetic algorithm. Neural Comput. Appl. 2020, 33, 3325–3345. [Google Scholar] [CrossRef]
  23. Chen, Z.; Qin, B.; Sun, M.; Sun, Q. Q-learning-based parameters adaptive algorithm for active disturbance rejection control and its application to ship course control. Neurocomputing 2020, 408, 51–63. [Google Scholar] [CrossRef]
  24. Sehgal, A.; Ward, N.; La, H.M.; Papachristos, C.; Louis, S. GA+DDPG+HER: Genetic algorithm-based function optimizer for deep reinforcement learning in robotic manipulation tasks. In Proceedings of the 2022 6th IEEE International Conference on Robotic Computing (IRC), Naples, Italy, 5–7 December 2022; pp. 85–86. [Google Scholar] [CrossRef]
  25. Xu, W.; Liu, J.; Yu, J.; Han, Y. Low complexity adaptive neural network three-dimensional tracking control for autonomous underwater vehicles considering uncertain dynamics. Eng. Appl. Artif. Intell. 2025, 142, 109860. [Google Scholar] [CrossRef]
  26. Luo, G.; Gao, S.; Jiang, Z.; Luo, C.; Zhang, W.; Wang, H. ROV trajectory tracking control based on disturbance observer and combinatorial reaching law of sliding mode. Ocean Eng. 2024, 304, 117744. [Google Scholar] [CrossRef]
  27. Liang, Y.; Guo, C.; Ding, Z.; Hua, H. Agent-based modeling in electricity market using deep deterministic policy gradient algorithm. IEEE Trans. Power Syst. 2020, 35, 4180–4192. [Google Scholar] [CrossRef]
  28. Qi, G.; Li, Y. Reinforcement learning control for robot arm grasping based on improved DDPG. In Proceedings of the 2021 40th Chinese Control Conference (CCC), Shanghai, China, 26–28 July 2021; pp. 4132–4137. [Google Scholar] [CrossRef]
  29. Mousavifard, R.; Alipour, K.; Najafqolian, M.A.; Zarafshan, P. Quadrotor trajectory tracking using combined stochastic model-free position and DDPG-based attitude control. ISA Trans. 2025, 156, 240–252. [Google Scholar] [CrossRef]
  30. Wu, D.; Dong, X.; Shen, J.; Hoi, S.C.H. Reducing estimation bias via triplet-average deep deterministic policy gradient. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 4933–4945. [Google Scholar] [CrossRef]
  31. Khalil, H.K. Nonlinear Systems, 3rd ed.; Prentice Hall: Hoboken, NJ, USA, 2002. [Google Scholar]
Figure 1. Remotely operated vehicle coordinate system.
Figure 1. Remotely operated vehicle coordinate system.
Applsci 15 04443 g001
Figure 2. Structure diagram of linear active disturbance rejection controller.
Figure 2. Structure diagram of linear active disturbance rejection controller.
Applsci 15 04443 g002
Figure 3. Reinforcement learning diagram.
Figure 3. Reinforcement learning diagram.
Applsci 15 04443 g003
Figure 4. DDPG algorithm framework diagram.
Figure 4. DDPG algorithm framework diagram.
Applsci 15 04443 g004
Figure 5. Block diagram of a self-disturbance rejection controller based on reinforcement learning.
Figure 5. Block diagram of a self-disturbance rejection controller based on reinforcement learning.
Applsci 15 04443 g005
Figure 6. Training reward change curve.
Figure 6. Training reward change curve.
Applsci 15 04443 g006
Figure 7. Underwater robot (ROV) physical prototype, 3D arrangement of thrusters.
Figure 7. Underwater robot (ROV) physical prototype, 3D arrangement of thrusters.
Applsci 15 04443 g007
Figure 8. Comparison of total disturbance observation errors of two types of observers.
Figure 8. Comparison of total disturbance observation errors of two types of observers.
Applsci 15 04443 g008
Figure 9. Trajectory tracking results of the ROV under three control schemes.
Figure 9. Trajectory tracking results of the ROV under three control schemes.
Applsci 15 04443 g009
Figure 10. Comparison of tracking errors of different control methods for ROV.
Figure 10. Comparison of tracking errors of different control methods for ROV.
Applsci 15 04443 g010
Figure 11. DDPG-LADRC disturbance observation and corresponding error.
Figure 11. DDPG-LADRC disturbance observation and corresponding error.
Applsci 15 04443 g011aApplsci 15 04443 g011b
Figure 12. ROV 3D trajectory and error.
Figure 12. ROV 3D trajectory and error.
Applsci 15 04443 g012
Table 1. DDPG algorithm parameter meaning.
Table 1. DDPG algorithm parameter meaning.
Algorithm ParametersMeaning
Q ( s t , a t θ Q ) The Q value output by the current value network at time t
Q ( s t + 1 , π ( s t + 1 θ μ ) θ Q ) Input Q value of the target network
π S t + 1 θ π Action variables output by the target strategy network
θ k Q , θ k π The parameters of the network at the k round of learning
μ Q Learning rate of value networks
θ Q L ( θ k 1 Q ) A gradient of the loss function concerning the parameters
μ π Learning rate of the strategy network
θ k 1 π Strategy gradient
θ k Q , θ k π The parameters of the target network at the k learning iteration
τ Soft update coefficient
Table 2. DDPG algorithm parameters.
Table 2. DDPG algorithm parameters.
HyperparameterValue
Actor-network learning rate0.001
Critics’ online learning rate0.0005
Small batch sampling sample size64
Discount factor0.98
Noise variance0.2
Noise attenuation coefficient0.00001
Experience pool size100,000
Table 3. ROV model parameters.
Table 3. ROV model parameters.
ParameterValuesParameterValues
m197 kg Y v v −245.2 N2/m2
I z 25.1 Nms2 Z w −12.6 N/S
X u −5.24 N/S Z w ˙ −367.8 kg
X u ˙ −135.1 kg Z w w −547.4 N2/m2
X u u −109.1 N2/m2 N r −1.52 N/S
Y v −11.1 N/S N r ˙ −34.3 kg
Y v ˙ −390.6 kg N r r −26.2 N2/m2
Table 4. Comparison of transient performance of different control methods: overshoot.
Table 4. Comparison of transient performance of different control methods: overshoot.
Comparison X Y Z Phi
PID6%12.5%10%5%
LADRC2%25%5%0
DDPG-LADRC0000
Table 5. Comparison of transient performance of different control methods: settling time/s.
Table 5. Comparison of transient performance of different control methods: settling time/s.
Comparison X Y Z P h i
PID50711011
LADRC1440335
DDPG-LADRC1514
Table 6. Comparison of transient performance of different control methods: peak time/s.
Table 6. Comparison of transient performance of different control methods: peak time/s.
Comparison X Y Z P h i
PID5034561
LADRC15625035
DDPG-LADRC1614
Table 7. Stable accuracy (X, Y, Z = m, Phi = rad).
Table 7. Stable accuracy (X, Y, Z = m, Phi = rad).
Comparison X Y Z P h i
PID0.3510.4970.3810.004
LADRC0.02850.35871.43 × 10−50.003
DDPG-LADRC5.43 × 10−51.14 × 10−41.29 × 10−146.56 × 10−9
Table 8. Root Mean Square Error (X, Y, Z = m, Phi = rad).
Table 8. Root Mean Square Error (X, Y, Z = m, Phi = rad).
Comparison X Y Z P h i
LADRC0.010.0210.0140.003
NESO-DSMC0.0050.00030.0060.0006
DDPG-LADRC0.00110.00020.00120.00029
Table 9. Mean Absolute Error (X, Y, Z = m, Phi = rad).
Table 9. Mean Absolute Error (X, Y, Z = m, Phi = rad).
Comparison X Y Z P h i
LADRC0.0050.0150.0120.008
NESO-DSMC0.0010.00030.00070.00028
DDPG-LADRC0.00060.00010.00050.00011
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Luo, G.; Zhang, D.; Feng, W.; Jiang, Z.; Liu, X. Deep Reinforcement Learning Based Active Disturbance Rejection Control for ROV Position and Attitude Control. Appl. Sci. 2025, 15, 4443. https://doi.org/10.3390/app15084443

AMA Style

Luo G, Zhang D, Feng W, Jiang Z, Liu X. Deep Reinforcement Learning Based Active Disturbance Rejection Control for ROV Position and Attitude Control. Applied Sciences. 2025; 15(8):4443. https://doi.org/10.3390/app15084443

Chicago/Turabian Style

Luo, Gaosheng, Dong Zhang, Wei Feng, Zhe Jiang, and Xingchen Liu. 2025. "Deep Reinforcement Learning Based Active Disturbance Rejection Control for ROV Position and Attitude Control" Applied Sciences 15, no. 8: 4443. https://doi.org/10.3390/app15084443

APA Style

Luo, G., Zhang, D., Feng, W., Jiang, Z., & Liu, X. (2025). Deep Reinforcement Learning Based Active Disturbance Rejection Control for ROV Position and Attitude Control. Applied Sciences, 15(8), 4443. https://doi.org/10.3390/app15084443

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop