Next Article in Journal
Identifying Characteristic Fire Properties with Stationary and Non-Stationary Fire Alarm Systems
Previous Article in Journal
Design, Construction, and Validation of an Experimental Electric Vehicle with Trajectory Tracking
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

USV Trajectory Tracking Control Based on Receding Horizon Reinforcement Learning

1
School of Automation, Wuhan University of Technology, Wuhan 430070, China
2
School of Information Engineering, Wuhan University of Technology, Wuhan 430070, China
*
Author to whom correspondence should be addressed.
Sensors 2024, 24(9), 2771; https://doi.org/10.3390/s24092771
Submission received: 12 March 2024 / Revised: 22 April 2024 / Accepted: 24 April 2024 / Published: 26 April 2024
(This article belongs to the Section Fault Diagnosis & Sensors)

Abstract

:
We present a novel approach for achieving high-precision trajectory tracking control in an unmanned surface vehicle (USV) through utilization of receding horizon reinforcement learning (RHRL). The control architecture for the USV involves a composite of feedforward and feedback components. The feedforward control component is derived directly from the curvature of the reference path and the dynamic model. Feedback control is acquired through application of the RHRL algorithm, effectively addressing the problem of achieving optimal tracking control. The methodology introduced in this paper synergizes with the rolling time domain optimization mechanism, converting the perpetual time domain optimal control predicament into a succession of finite time domain control problems amenable to resolution. In contrast to Lyapunov model predictive control (LMPC) and sliding mode control (SMC), our proposed method employs the RHRL controller, which yields an explicit state feedback control law. This characteristic endows the controller with the dual capabilities of direct offline and online learning deployment. Within each prediction time domain, we employ a time-independent executive–evaluator network structure to glean insights into the optimal value function and control strategy. Furthermore, we substantiate the convergence of the RHRL algorithm in each prediction time domain through rigorous theoretical proof, with concurrent analysis to verify the stability of the closed-loop system. To conclude, USV trajectory control tests are carried out within a simulated environment.

1. Introduction

A USV inherently constitutes a complex nonlinear system, being subject to disturbances and influences from the environment during navigation. Consequently, enhancing the path-tracking accuracy of unmanned ship motion control is a pressing concern.
At present, common methods for achieving such control include the PID [1,2], which is the most widely used, feedback control [3,4], fuzzy control [5,6], module predictive control (MPC) [7,8], and reinforcement learning (RL)-based control [9,10] methods. Of the aforementioned approaches, the PID control method stands out for its advantages. Notably, it eliminates the necessity for modeling the unmanned ship, rendering it a robust and easily implementable controller. However, a challenge lies in ensuring the optimality of specific performance indices. While the fuzzy controller exhibits the capability to deduce and generate expert behavior, its application is challenged by the intricacies of crafting fuzzy rules that primarily arise from the complexity inherent in the navigation environment.
The feedback controller, in its typical operation, computes heading and lateral deviations by analyzing the geometric relationship between the USV and the desired path. Based on this, it directly determines the steering wheel angle for precise steering control. The methods used for tracking, which involve deriving the correlation between the selected path anchor point and the USV position, are the single-point tracking method, pre-sight distance method, and the Stanley method. Both the single-point tracking method [11] and pre-viewing distance method [12,13] offer the advantages of simplicity in algorithms and ease of implementation. However, a notable consideration lies in the fact that the selection of pre-viewing distance is contingent upon the experiential judgment of designers. The Stanley method, initially introduced by Stanford University for an unmanned vehicle fleet, is well suited for lower vehicle speeds. It necessitates a continuous curvature in the reference trajectory for optimal implementation.
A plethora of research findings have emerged concerning the application of MPC in vehicle motion control, as documented in the literature [14,15,16,17]. Of the achievements in these cited works, Falcone et al. [15] introduced an MPC motion controller grounded in the continuous linearization model, and their simulation results underscore the efficacy of the continuous linearization MPC design approach in minimizing computational costs. Carvalho et al. [17] studied an algorithm for local path planning using locally linearized MPC, carrying out linearization and convex approximation of nonlinear obstacle avoidance boundaries. Liniger et al. [18] proposed a lateral motion method of model predictive controlling control (MPCC). Using this method, the lateral deviation is calculated by estimating the position of the projection point, which reduces the computational complexity to a certain extent. Ostafew et al. [19] adopted Gaussian process regression to build a nonparametric model of a mobile robot. In the realm of unmanned surface vehicles, the trajectory tracking controller, employing the MPC method, typically necessitates real-time numerical calculations for solving an open-loop control sequence. The performance of this approach can be influenced by the precision of the model in addition to the unavoidable challenge of managing the complexity inherent in online calculations. Collectively, the current control strategies have various limitations characterized by suboptimal tracking accuracy and constrained computational efficiency.
In recent years, approximate dynamic programming (ADP) as well as reinforcement learning (RL) have experienced widespread adoption in the design of robot decision and control algorithms, thanks to their remarkable efficiency in solving optimization problems and adaptive learning capabilities [20,21]. Yang [22] developed a learning method which is based on PID control for the tracking control of vehicles. Aiming at optimizing the tracking deviation of robots, the DHP algorithm was employed for real-time adjustment of PID parameters, enhancing path-tracking accuracy. Gong et al. [23] designed a finite-time dynamic positioning controller for surface vessels. Shen et al. [24] introduced an innovative LMPC framework aiming to enhance trajectory tracking performance. Jiang et al. [25] also proposed sliding mode control to improve the tracking performance of USVs.
Recent advancements include noteworthy works employing deep learning and deep reinforcement learning to design controllers based on image or state information, facilitating trajectory control for USVs [26,27,28]. A key advantage of this approach lies in leveraging deep networks to enhance the feature representation capabilities of both reinforcement learning and supervised learning. Notably, the training process is entirely data driven, eliminating the need for dynamic model information. However, it has the following disadvantages:
(1)
Due to the inherent complexity of deep networks, application of this method is limited to offline training control strategies for online deployment. Moreover, its control performance is susceptible to the influence of factors such as the quantity and distribution of training samples.
(2)
In the context of deep network learning, the analysis of theoretical characteristics, such as convergence and robustness, remains a crucial and challenging issue for the academic community to address.
Motivated by the challenges outlined above, we propose a RHRL-based control method, aiming at achieving high-precision lateral control for USVs. The initial step involves constructing a dynamic deviation model for a USV. The steering control of such vehicles comprises two parts, which are feedforward and feedback. Feedforward control is derived directly from the curvature and deviation model for the reference path. In parallel, the establishment of feedback control is achieved by addressing the problem of optimal tracking through application of the RHRL algorithm proposed in this paper. Diverging from conventional optimal control methods rooted in reinforcement learning, RHRL employs a rolling horizon optimization mechanism. This transformation converts infinite time domain optimal control problems into a sequence of finite time domain heuristic dynamic programming problems for resolution. In contrast to the MPC method for unwinding the loop control sequence, the strategy learned by this method is an explicit state feedback control law, which is amenable to offline direct deployment and online learning. Furthermore, in Section 3, the convergence and stability of the closed-loop associated with the proposed RHRL algorithm are theoretically analyzed within each prediction time domain. Finally, simulation and comparative experiments for USV trajectory control using the RHRL algorithm are conducted. Through simulation tests, the control performance is found to be comparable to that of LMPC, with notable advantages in terms of computational efficiency, lower sample complexity, and higher learning efficiency. To verify the algorithm’s robustness and anti-interference capabilities, simulation incorporating disturbances are also conducted.
The remaining sections of this manuscript are arranged as follows. In Section 2, a dynamic model of a USV is built. Then, a USV trajectory control algorithm based on RHRL is proposed and shown to be stable. In Section 3, the simulation and comparison experiments are carried out, and disturbances are added. Section 4 contains the conclusions.

2. Materials and Methods

2.1. Modeling

In contemporary vehicle modeling, the utilization of three degrees of freedom (DOF) and six DOF predominates. However, considering the environment of the USV investigated in this study, which navigates on the sea surface, we opt for three degrees of freedom in the modeling process to avoid unnecessary complexity.
In the process of establishing dynamical equations, a crucial decision lies in selecting the coordinate system for their formulation. Direct application of Newton’s laws of motion necessitates the expansion of equations in an inertial coordinate system. Nevertheless, various considerations compel us to derive the dynamic equations in a satellite coordinate system. One such reason is to establish dynamic equations that are direction independent. Additionally, employing the satellite coordinate system facilitates the direct assignment of forces and control moments. However, this would result in the current frame of reference not being an inertial frame of reference. Hence, to account for the non-inertial reference frame, Coriolis and centripetal forces are artificially introduced. This allows us to derive the remaining dynamics as if they were in an inertial reference frame.
The USV under investigation features a catamaran-like structure, incorporating two fixed propellers positioned at the extremities of each hull. In Figure 1, variables U 1 and U 2 denote the speeds of the two thrusters, while θ represents the heading angle.
Considering its actual working environment, trajectory tracking control of the USV on the horizontal plane will be the focus of our study.
There is a reference frame called the BF (body frame) that is securely fixed to the USV, with the point of origin deliberately chosen to coincide with the center of gravity. Global information is recorded by the IF (inertial frame). Thus, the USV’s motion can be accurately described via the kinematic equation and dynamic equation of the coordinate transformation between these two frames.
The kinematic equation is
ξ ˙ = R θ v
where ξ = x , y , z T represents the USV’s position and heading in the IF; v = u , v , r T represents the USV’s velocity in the BF; and the rotation matrix R θ depends on θ , which is the heading angle.
R θ can be expressed by the follow equation:
R θ = c o s θ s i n θ 0 s i n θ c o s θ 0 0 0 1
According to the Newton’s law of motion, the dynamic equation can be established as follows:
M v ˙ + C v v + D v v + g ξ = κ
where κ = F u , F v , F r T represents the thrust force of each propeller. The matrix M takes the mass (which is added) into consideration; C v represents the Coriolis and centripetal matrix. Concrete forms of the above three matrices are shown as follows:
M = M u ˙ 0 0 0 M v ˙ 0 0 0 M r ˙
C v = 0 0 M v ˙ v 0 0 M u ˙ u M v ˙ v M u ˙ u 0
D v = X u + D u u 0 0 0 Y v + D v v 0 0 0 Z r + D r r
where D v is the USV’s damping matrix; g ξ denotes the specific restoring force.
The thrusters τ = τ 1 , τ 2 , τ 3 T generate thrust force κ , and the τ comes from κ = B α τ . α denoting the thrusters’ azimuth vector in the BF. We can obtain the distribution of the thruster:
κ = B τ , B = 1 0 1 0 1 0 l 1 0 l 2
where B denotes an input matrix that is constant. B is a 3 × 3 matrix that distributes power to the thrusters in three directions, and B satisfies the condition that B T B is not singular. l 1 , l 2 ( 0 , 1 ) are the thrusters’ efficiency factors.
Therefore, we can derive the dynamic model of the USV for trajectory tracking by combining Equations (1), (3), and (5):
x ˙ = R θ v M 1 B τ M 1 Cv M 1 Dv M 1 g = f x , τ
where x = x , y , θ , u , v , r T is the defined state, and input control is expressed as τ = τ 1 , τ 2 , τ 3 T . At the end of this section, we successfully derive the dynamic equation governing USV operation on the water surface.

2.2. The USV Trajectory Control Algorithm Based on RHRL

In this section, the USV trajectory control algorithm utilizing RHRL is elaborated. We initially formulate the performance index for the finite time domain trajectory control problem of the USV. Subsequently, we outline the core concepts of the associated reinforcement learning algorithm along with the design and implementation process of the controller. Also included is a detailed analysis of convergence based on this approach.
When conducting tracking control, it is necessary to describe the relative position between the USV and the desired path, as shown in Figure 2. The point P represents the closet point from the desired path, which is called the road projection point. P ( X p , Y p , φ d , κ ) is denoted as the path information at the projection point, where X p , Y p are the global coordinates of P. φ d is the angle between the tangent line of P and the X-axis, also known as the direction of the path; κ is the curvature of the path at point P.
The distance between P and the USV centroid is called the lateral deviation e y , and e y > 0 is specified for when the USV is located on the left side of the path, and e y < 0 when the USV is on the right side. Therefore, the lateral deviation can be expressed as
e y = ( X X p ) s i n ( φ d ) + ( Y Y p ) c o s ( φ d )
The path deviation e φ of the USV is defined as the difference between the path and the direction, which is e φ = φ φ d . φ = 1 2 ( z ˙ + r ) t a n θ = r t a n θ . The first derivative of e y and e φ are shown below:
e y ˙ = v y c o s ( e φ ) + v x s i n ( e φ ) e φ ˙ = ω κ [ v x c o s ( e φ ) v y s i n ( e φ ) ]
where ω = φ ˙ . v x = x ˙ c o s φ + y ˙ s i n φ + u 2 + v 2 s i n φ , v y = x ˙ s i n φ + y ˙ c o s φ + u 2 + v 2 c o s φ . It is assumed that v x remains constant and there is no sidescale phenomenon in the moving process, and that the expected yaw velocity of the USV’s desired path is constant; then, the lateral acceleration of the USV when it stably tracks the path is a y = v x 2 κ .
Assuming that the course deviation e φ is small, then according to the small angle theorem, s i n ( e φ ) e φ , c o s ( e φ ) 1 . Then, the second derivative of the lateral deviation with respect to time can be expressed as
e y ¨ = ( v y ˙ + v x ω ) v x 2 κ
The first derivative can be approximated as
e y ˙ = v y + v x e φ
Combining Equations (1), (3) and (4), also (8)–(10); the following equation can be derived as
e ˙ = A c e + B c 1 u + B c 2 ω d
A c = A w 0 6 × 3 0 6 × 3 0 3 × 6 0 3 × 3 I 3 × 3 0 3 × 6 0 3 × 3 M 1 D
A w = 0 3 × 3 A w ¯ , A w ¯ = M u ˙ + M v ˙ V 0 X u + D u | u | 0 M u ˙ u M r ˙ M r ˙ Y v + D v | v | Z r + D r | r |
B c 1 = E w 0 6 × 3 0 3 × 3 0 3 × 3 0 3 × 3 M 1 , B c 2 = 0 6 × 3 0 3 × 3 M 1
where ω d = φ d ˙ , e = [ e y , e y ˙ , e φ , e φ ˙ ] T , and the control quantity u = δ f .
Given a sampling period Δ t , the discrete time model of Equation (11a) can be discretized as
e ( k + 1 ) = A e ( k ) + B 1 u ( k ) + B 2 ω d ( k )
where A = I + Δ t A c , B 1 = Δ t B c 1 , B 2 = Δ t B c 2 , and k is a discrete time point.
For the above model Equation (12), it is assumed that path information ( X i , Y i ) i = 1 M , and the purpose of this paper is to design a lateral control algorithm based on RHRL (as shown in Figure 3) such that during the control process, the above-mentioned lateral error state quantity gradually converges to 0, that is, e 0 .

2.2.1. Design of Performance Index for the Finite Time Domain Trajectory Control Problem

In this section, a detailed control algorithm based on RHRL is presented. We commence by designing the performance index for the USV finite time domain lateral control problem. Subsequently, we outline the core concept of the RHRL algorithm and delve into the design implementation and convergence analysis based on the actuator–evaluator. For the system deviation model of Equation (12), the control quantity can be decomposed into the form of a feedforward component u f plus a feedback component u b such that u = u f + u b , which is shown in Figure 3. The feedforward control quantity represents the expected control input during steady-state vehicle operation and is applicable when the vehicle is stably following the reference path.At the same time e ( k ) = e ( k + 1 ) = 0 holds, u b = 0 as well. The feedforward control quantity u f can be determined as follows:
j = 0 A j B 1 u f j = 0 A j B 2 ω d
The value in the above formula can be obtained by ω d = v x k . A , B 1 , B 2 are discrete time coefficient matrices.
Since u f can be easily solved at any current time value k, we assume that u f remains constant throughout the prediction time domain [ k , k + N ] , then the feedback control quantity u b to be solved needs to meet the following constraints:
u b U b = u R | u ̲ u f u u ¯ u f
where u ¯ represents the maximum of u, u ̲ is the minimum of u. The RHRL algorithm, introduced in this paper, seeks to minimize the following performance indicator function by optimizing u b U b in each prediction time domain:
V e k = k + N 1 l = k L e l , u b l + V f e k + N
where the cost function L e l , u b l = e T l Q e l + P u b l 2 , Q R 4 × 4 is a matrix which is positive definite, P is a preset positive real number, and the cost function of the predictive time domain terminal is
V f e k + N = e T k + N R ¯ e k + N
where the penalty matrix R ¯ R 4 × 4 is a positive definite matrix, which can be solved using the following Lyapunov equation:
F T R ¯ F R ¯ = Q K T P K
where F = A + B 1 K , K R 1 × 4 is the feedback gain matrix satisfying the conditions indicating that F is Schur-stable. (The characteristic polynomial ‘F’ for discrete linear systems is such that the roots are located within the unit circle. This property results in the system being classified as Schur-stable).

2.2.2. Path Control Algorithm Based on RHRL

The implementation of the finite time domain reinforcement learning algorithm using the executive–evaluator involves the following main steps:
First of all, according to Equation (15), in any l k , k + N 1 , we can express the value function as a differential form:
V e l = L e l , u b l + V e l + 1
where V e k + N = V f e k + N . At the l-th prediction moment, V * e l would be defined as the optimal value function, and we obtain the HJB equation of the above finite time domain optimization control problem as
V * e l = min u b l U b L e l , u b l + V * e l + 1
and the optimal control strategy:
u * e l = argmin u b l U b L e l , u b l + V * e l + 1
In fact, due to the control constraints, it is difficult to obtain analytical solutions for V * and u * using Equations (19) and (20). In principle, we can approximate the optimal solution of the value function and the control strategy through the method of value iteration. For any l k , k + N 1 , at given initial values where V 0 e l = 0 , then iterate steps i = 0 , 1 , 2 . This needs to be repeated until V i + 1 e l V i e l 0 to resolve the following two steps.
(1)
Strategy update
u i e l = argmin u b l U b L e l , u b l + V i e l + 1
(2)
Value update
V i + 1 e l = L ( e l , u b i e l + V i e l + 1
In conclusion, the task of trajectory tracking is accomplished through continuous updating of the strategy and feedback values.

2.2.3. Rolling Time Domain Executor–Evaluator Learning Implementation

We employ the executive–evaluator structure to implement the finite time domain value function iteration algorithm described above. In existing finite time domain reinforcement learning control algorithms [17], the value function in the prediction time domain is regarded as a time-dependent function.
Assumption 1. 
If there has a control strategy u b e = Φ v e so that system (Equation (12)) is asymptotically stable under control strategy u = u b + u f , where Φ v e is a continuous function satisfying u b e U b , v e R .
The aforementioned assumptions essentially represent another aspect of the stabilizability of the system Equation (12). Simultaneously, it is worth noting that the dynamic model Equation (12) presented in this paper is controllable, so there must be a continuous equation u b e U b that renders Equation (16) asymptotically stable under the control strategy u = u b + u f . Therefore, the above assumptions are reasonable.
We define χ f as a control invariant set under the control law u b = K e U b , then we can state the following theorem.
Theorem 1. 
(Time-independent value function) If the value of the prediction time domain N satisfies t k , k + N in any prediction time domain, for any initial state e k R 4 , the terminal state e k + N χ f is driven by the control strategy u e l , l k , k + N 1 of system Equation (9) such that there is such a control strategy u b e U b that V e l , and l k , k + N 1 is a function that is independent of time.
Proof of Theorem 1. 
Firstly, consider the case of e k χ f . Based on the definition of χ f , there is a control law u b = K e = Φ v e U b that ensures the quantity of states at any time in the future satisfy x l χ f . From that, we can solve and obtain the following function:
V e ( l ) = i = l k + n 1 L e ( i ) , u b ( i ) + V f e ( k + N ) = e ( l ) T P ¯ e ( l )
For the case of e k χ f , according to Assumption 1, there is such a control strategy u b = Φ v e and a finite prediction step N that e k + N χ f . In particular, let v = K e , then
V e l = k + n 1 i = l L e i , u b i + V f e k + N = + i = l L e i , u b i
where u b = Φ v e .
Hence, a value function and a strategy independent of time exist. Drawing inspiration from this, we adopt a time-independent executive–evaluator structure to execute the finite time-domain value function iteration process described above. Initially, a network of evaluators is designed to approximate the value function:
V ^ e = W ^ c T φ e
where W c ^ R N c represents the weight of the evaluator network, N c denotes network node number; φ e is the network’s basis function. According to the definition of the evaluator network, the resulting errors E and the end error E f can be expressed as
E l = W c ^ T φ l L e l , u b ^ l W c ^ T φ l + 1
E f = W c ^ T φ e f e f T P ¯ e f
Therein, e f = e k + N , which can be randomly valued around 0. By minimizing E c l = E l 2 + E f 2 , the equation for updating the weights of the evaluator network is derived as follows:
W c ^ l + 1 = W c ^ l + μ c Δ φ e l + 1 E l φ e f E f
where μ c > 0 is the learning rate of the evaluator network.
Next, to deal with control constraints, we construct the network of actuators as follows:
u b ^ l = u 1 ¯ t a n h W a ^ T σ e l + u 2 ¯
where u 1 ^ = 0.5 u b ¯ u b ̲ , u 2 ^ = 0.5 u b ¯ + u b ̲ , including W a ^ R N a is the weight of actuator network; σ e is the basis function vector of the network. N a indicates the node number, which is on network. Given that the actuator network aims to approximate the optimal strategy of control, we define the control quantity deviation as follows:
E a l = W a ^ T σ e l + 1 2 R 1 B 1 T φ e l W c ^ l
By minimizing E a 2 , we can obtain the update rule of the network weight as
W a ^ l + 1 = W a ^ l μ a δ E a 2 l δ W a ^ l
where μ a > 0 represents the learning rate of the actuator network.
Algorithm 1 The main steps of implementing the above finite time domain reinforcement learning algorithm, which makes use of the executive–evaluator.
(I)
Initialize the weights W c ^ , W a ^ , and obtain the initial state Z 0 .
(II)
When the time t = k Δ t , the projection point P is found according to the state Z t , and the deviation state e t is calculated.
(III)
l k , k + N 1 , repeat the following process 1–3:
(1)
According to Equations (17) and (28), u f l and u b ^ l are respectively calculated.
(2)
Update W c ^ , W a ^ according to Equations (27) and (30).
(3)
Calculate u l = u f l + u b ^ l according to Equations (13) and (28), and apply the prediction model for e l + 1 .
(IV)
Calculate u f k and u b ^ e k according to Equations (12) and (27), respectively.
(V)
In the time period k Δ t , k + 1 Δ t , apply quantity u t = u k Δ t directly to the USV, and update the system states Z k + 1 Δ t .
(VI)
Set k k + 1 and repeat operations II-V based on the receding time domain optimization strategy.

2.2.4. Convergence Analysis of the Weight of Finite Time Domain Actuator and Evaluator

Next, we present the convergence analysis of the above RHRL algorithm in each prediction domain k , k + N 1 . First, the (local) optimal value function and control strategy can be represented as a network:
V * e = W c T φ e + κ c
u b * = u 1 ¯ tan h W a T μ e + κ a + u 2 ¯
where both W a and W c are weight matrices, and κ a and κ c are the errors of reconstruction.
Assumption 2. 
(Network reconstruction error)
(1) 
W c W c , m , φ φ m , φ φ ¯ m , κ c κ c , m , κ c κ ¯ c , m
(2) 
W a W a , m , ψ ψ m , κ a κ a , m
Assumption 3. 
(Continuous excitation)
There are positive real numbers  q 1 , q 2 , q 1 < q 2  such that
q 1 φ ¯ , φ ¯ f q 2
where φ ¯ = Δ φ T Δ φ , φ ¯ f = φ f T φ f , φ f = φ e f .
In order to more compactly describe the following theorem, define  γ 1 = 4 4 ψ ¯ μ a 4 8 ψ ¯ μ a β 1 + β 3 , ψ ¯ = ψ T ψ , φ ¯ = φ ¯ l + 1 + φ ¯ f , α = β 0 , β 1 , β 2 , β 3  are tunable positive real numbers.
Theorem 2. 
Under Assumptions 2 and 3, if the appropriate learning laws μ c and μ a and β i 3 i = 0 are chosen so that γ 1 > 0 and α γ 2 > 0 , then the network weights W ^ c and W ^ a of Equations (27) and (30) will asymptotically converge to the following region when using the above strategy:
W ¯ c E t γ 1
ε a E t α γ 2 λ m i n g ¯
where W ¯ c = W c W ^ c , W ¯ a = W a W ^ a , ξ a = W ¯ a T ψ , and E t is the error.
Furthermore, if κ c , m , κ ¯ c , m , κ a , m 0 , then W ¯ c and ξ a converge asymptotically to 0.
Proof of Theorem 2. 
The Lyapunov function is defined as follows:
L l = L c l + L a l
where L c = t r ( W ¯ c T η c 1 W ¯ c ) , and L a = t r W ¯ a T η a 1 W ¯ a . They can be calculated based on Equation (26).
E ( l ) = W ^ c T φ ( l ) W ^ c T φ ( l + 1 ) + Δ V * ( l + 1 ) = W ¯ c T Δ φ ( l + 1 ) + Δ κ c ( l + 1 )
where Δ V * ( l + 1 ) = V * ( l + 1 ) V * ( l ) , Δ κ c ( l + 1 ) = κ c ( l + 1 ) κ c ( l ) .
E f = W ¯ c T φ f W c T φ f κ c , f = W ¯ c T φ f κ c , f
where κ c 1 f = κ c k + N , then according to Equations (27), (35) and (36):
Δ L c l + 1 = L c l + 1 L c l = 2 W ¯ c T φ ¯ W ¯ c + κ ¯ c + μ c φ ¯ W ¯ c + κ ¯ c T φ ¯ W ¯ c + κ ¯ c α W ¯ c 2 + E c
where κ ¯ c = Δ φ l + 1 Δ κ c l + 1 φ f κ c f , E c = 2 μ c + β 0 1 κ ¯ c 2 .
Similarly, Δ L a l + 1 can be expressed as
Δ L a l + 1 = t r 2 W ¯ a T l E a 2 l W ^ a l + μ a E a 2 l W ^ a l T E a 2 l W ^ a l
In consideration of E a = ξ a g W ¯ c + κ ¯ a , g = φ 1 2 R 1 B 1 T , κ ¯ a = κ a κ c 1 2 R 1 B 1 T , and E a 2 l W ^ a l = 2 ψ E a , then
Δ L a = ( 4 4 ψ ¯ μ a ) ξ a 2 8 ψ ¯ μ a g W ¯ c κ ¯ a + 4 ψ ¯ μ a W ¯ c g ¯ 2 + ( 4 8 ψ ¯ μ a ) ( ξ a κ ¯ a ξ a T g W ¯ c )
where g ¯ = g T g . According to Young’s inequality theorem,
Δ L a ( l + 1 ) γ 1 ξ a 2 + γ 2 W ¯ c g ¯ 2 + E a
where E a = 1 / β 2 + 1 / β 3 κ ¯ a 2 . Then, by defining E t = E c , m + E a , m , we obtain
Δ L = γ 1 ξ a 2 ( α γ 2 ) W ¯ c g ¯ 2 + E t
On this basis, if κ ¯ c , m , κ c , m , κ a , m 0 , E t 0 is obtained, then W ¯ c and ξ a asymptotically converge to 0.
Hence, at this juncture, we have successfully concluded the proof of Theorem 2. □
The conclusion of the above theorem indicates that we can make u converge to u b * with an arbitrarily small error by increasing the number of base function nodes in the actuator and the evaluator. Therefore, under the premise that Assumption 1 is true, if a sufficiently large N is chosen, the equation of system (12) satisfies the terminal state e k + N x f in the prediction time domain k , k + N 1 driven by strategy u b * k | k , , u b * k + N 1 | k . Thus, the next prediction time domain k + 1 , k + N , u b * k + 1 | k , , u b * k + N 1 | k , K e k + N | k is a feasible control strategy. We define the loss function produced by the feasible strategy for L o s f k + 1 | k , and referring to Rawling’s [29], L o s f k + 1 | k L o s * k | k L e k | k , u b k | k is available. Due to K e k + N | k being suboptimal, we may safely derive
L o s * k + 1 | k + 1 L o s * k | k L o s f k + 1 | k L o s * k | k L e k | k , u b k | k
which can be obtain by using Lyapunov stability analysis of the stability of the system, which is a closed-loop system.

3. Simulation Analysis

To ensure a precise comparison of the control performance between RHRL, Lyapunov-based MPC (LMPC), and sliding mode control (SMC), the control variable method was adopted using experimental parameters from [24,25]. In the simulations, all of the hydrodynamic parameters in the equations are based on the Falcon model [30].
The simulation results are presented in this section in showcasing the advantages of the RHRL method. In addition, the operating environment is Matlab 2021b, and the core is R7-5800H.

3.1. Parameter Selection

Two distinct desired trajectories are employed. Refer to the article of Li [31], where one trajectory (Path I) is a typical sinusoidal path:
p ( t ) = x d = 0.4 t y d = sin ( 0.4 t )
The other trajectory, Path II, is based on [32] and is an S-shaped path:
p ( t ) = χ d = sin ( 0.4 t ) y d = sin ( 0.24 t )
For the RHRL controller, the following parameters are utilized: the prediction horizon is set such that T = 5 δ , where δ = 0.1 [s] represents the time period; three matrices are set for weighting as Q = d i a g ( 10 5 , 10 5 , 10 3 , 10 2 , 10 2 , 10 2 ) , R = d i a g ( 10 4 , 10 4 , 10 4 , 10 4 ) , and P = d i a g ( 10 3 , 10 3 , 10 2 , 10 , 10 ) . The gains of the control are K p = K d = d i a g ( 1 , 1 , 1 ) .And the l 1 = l 2 = 0.8 .
In this section, the desired trajectory tracking simulation of a USV based on RHRL will be executed as described to emphasize the feasibility and efficiency of RHRL algorithm proposed earlier. The parameters for USV simulation are presented in Table 1.

3.2. Tracking Performance

Both Figure 4a,c depict the tracking results for Path I. The USV trajectories are represented by the blue curve for the LMPC control method, the green curve for the SMC controller, and the red curve for the USV RHRL controller, and the black curve illustrates the sinusoidal trajectory, which is the desired trajectory. The results demonstrate that all controllers are successful in guiding the USV along the desired trajectory, affirming the stability of the closed loop. However, the RHRL method notably exhibits a considerably accelerated convergence compared to the LMPC and SMC methods. This acceleration in convergence is attributed to the selection of control gain matrices K p and K d , which are small. The simulation results show that the improvement of tracking accuracy is due to synchronous online incremental learning and deployment.
Figure 4b illustrates the thrust output of each propeller. It is evident that at the commencement of tracking, the RHRL controller maximally utilizes the onboard thrust capability to achieve convergence as swiftly as possible. In essence, the state remains within the prescribed boundary, aligning with expectations. It is also notable that RHRL demonstrates superior adjustment capability and undergoes more rapid adjustments.
The outcomes for Path II are presented in Figure 5. Similarities arise from the observations: The USV exhibits quicker convergence to the desired trajectory through RHRL.

3.3. Robustness Experiment with Disturbance

The incorporation of the receding horizon implementation introduces feedback into the closed-loop system. One of the inherent advantages of the RHRL controller is its robustness toward disturbances and emergencies, making it particularly well-suited for control systems in marine and submarine environments.The RHRL’s robustness is thoroughly examined and demonstrated through simulations. The definite simulated disturbance of magnitude 100 ( N ) , 100 ( N ) , 0 ( N m ) T was added. To provide a clearer visualization of the deviation between the three algorithms, the reference trajectory, indicated by a black line, is also included in this experiment.
In analyzing the outcomes shown from Figure 5 to Figure 6, it is evident that RHRL tracking control consistently guides the USV to adequately converge toward the desired trajectory. In contrast, substantial tracking errors are exhibited when conducting tracking control using LMPC, the even greater errors are associated with SMC. Figure 6b and Figure 7b illustrate that the RHRL controller consistently provides feedback for responding within a small time domain, ensuring minimal deviation.
The MSEs (mean square errors) for both paths are consolidated in Table 2 and Table 3. Generally, the MSEs are approximately 10 times smaller for RHRL compared to LMPC and SMC, especially in the case of Path II. Indeed, it is widely acknowledged that smaller MSEs correspond to reduced tracking error, thereby resulting in higher tracking accuracy; thus, it is evident that the RHRL algorithm significantly enhances tracking accuracy.
In order to more objectively demonstrate the excellent performance of the algorithm, we propose conducting quantitative analysis based on a new factor, namely thrust output. It is known that a smaller average value of thrust corresponds to lower energy consumption and enhanced cost-effectiveness. The specific data are shown in Table 4 and Table 5. As can be seen from the tables, the energy consumption of RHRL compared with LMPC is reduced by 43.85% and 41.65% for Paths I and II, respectively. The data show that RHRL is much more economical than LMPC. However, due to the algorithm characteristics, RHRL does not have a significant advantage over SMC based on this analysis.
The observed disparity stems from RHRL’s ability to learn and adapt online, utilizing online optimization to dynamically adjust control gains and effectively compensate for interference. Conversely, both LMPC and SMC lack this flexibility. Consequently, robustness is significantly enhanced by RHRL control.

4. Conclusions

In this paper, a trajectory control algorithm for USV based on RHRL is introduced in which reinforcement learning is seamlessly integrated with a rolling time domain optimization mechanism. Thus, infinite time self-learning optimization problems are effectively converted into a series of finite time optimization problems, which can then be solved using an executive–evaluator algorithm. The incorporation of the rolling time domain mechanism in this design approach significantly enhances the learning efficiency of the RL algorithm. Moreover, compared to LMPC and SMC, the optimization method utilizing both effector and evaluator contributes to enhanced computational efficiency. In diverging from the majority of existing finite time domain executive–evaluator learning algorithms, the proposed RHRL employs a time-independent single-network structure. This innovative approach serves to diminish the intricacy associated with network design and online computational complexity. Moreover, we analyzed the stability of the closed-loop system theoretically. Concerning scenarios involving significant errors in the learned approximation strategy, we plan to conduct in-depth analysis and substantiation in our forthcoming research. The results of simulations demonstrate that our algorithm is effective based on comparison with typical traditional algorithms in simulation scenarios. The simulation results show that RHRL control is superior to LMPC and SMC in terms of control performance and computational efficiency while also being more economical than LMPC. RHRL control also has lower sample complexity and higher learning efficiency.

Author Contributions

Conceptualization, Y.C. and X.G.; methodology, X.G.; software, Y.W.; validation, Y.C.; formal analysis, X.G.; investigation, X.G.; resources, Y.W.; data curation, Y.W.; writing—original draft preparation, Y.W.; writing—review and editing, X.G.; visualization, X.G.; supervision, Y.C.; project administration, Y.C.; funding acquisition, X.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Fundamental Research Funds for the Central Universities of Ministry of Education of China grant number 2023IVA092.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors would like to thank the anonymous reviewers for their helpful comments which improved the quality of the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
USVunmanned surface vehicle
RHRLreceding horizon reinforcement learning
LMPCLyapunov model predictive control
PIDproportional integral derivative
MPCCmodel predictive controlling control
ADPapproximate dynamic programming
KDHPkernel-based DHP
BFbody frame
IFinertial frame
MSEmean square error

References

  1. Alim, M.F.A.; Kadir, R.E.A.; Gamayanti, N.; Santoso, A.; Sahal, M. Autopilot system design on monohull USV- LSS01 using PID-based sliding mode control method. IOP Conf. Ser. Earth Environ. Sci. 2021, 649, 012058. [Google Scholar] [CrossRef]
  2. Guo, X.-K. Particle swarm optimization for pid usv heading stability control. Ship Sci. Technol. 2019, 41, 52–54. [Google Scholar]
  3. Ege, E.; Ankarali, M.M. Feedback motion planning of unmanned surface vehicles via random sequential composition. Trans. Inst. Meas. Control 2019, 41, 3321–3330. [Google Scholar] [CrossRef]
  4. Huanyin, Z.; Xisheng, F.; Zhiqiang, H.U.; Wei, L.I. Dynamic Feedback Controller Based on Optimized Switching of Multiple Identification Models for Course Control of Unmanned Surface Vehicle. Robot 2013, 35, 552. [Google Scholar]
  5. Yan, D.; Xiao, C.; Wen, Y. Pod Propulsion Small Surface USV Heading Control Research. In Proceedings of the 26th International Ocean and Polar Engineering Conference, Rhodes, Greece, 26 June–1 July 2016. [Google Scholar]
  6. Deng, Y.; Zhang, X.; Im, N.; Zhang, G.; Zhang, Q. Adaptive fuzzy tracking control for underactuated surface vessels with unmodeled dynamics and input saturation. ISA Trans. 2020, 103, 52–62. [Google Scholar] [CrossRef]
  7. Dong, Z.; Zhang, Z.; Qi, S.; Zhang, H.; Li, J.; Liu, Y. Autonomous cooperative formation control of underactuated USVs based on improved MPC in complex ocean environment. Ocean Eng. 2023, 270, 113633. [Google Scholar] [CrossRef]
  8. Han, X.; Zhang, X. Tracking control of ship at sea based on MPC with virtual ship bunch under Frenet frame. Ocean Eng. 2022, 247, 110737. [Google Scholar] [CrossRef]
  9. Johnson, A.M.; Lenell, C.; Severa, E.; Rudisch, D.M.; Morrison, R.A.; Shembel, A.C. Semi-Automated Training of Rat Ultrasonic Vocalizations. Front. Behav. Neurosci. 2022, 16, 826550. [Google Scholar] [CrossRef]
  10. Zhao, Y.; Qi, X.; Ma, Y.; Li, Z.; Sotelo, M.A. Path Following Optimization for an Underactuated USV Using Smoothly-Convergent Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2020, 22, 6208–6220. [Google Scholar] [CrossRef]
  11. Guo, J. Study on Lateral Fuzzy Control of Unmanned Vehicles Via Genetic Algorithms. J. Mech. Eng. 2012, 48, 76. [Google Scholar] [CrossRef]
  12. Leonard, J.; How, J.; Teller, S.; Berger, M.; Williams, J. A Perception-Driven Autonomous Urban Vehicle. J. Field Robot. 2009, 25, 727–774. [Google Scholar] [CrossRef]
  13. Rajamani, R.; Zhu, C.; Alexander, L. Lateral control of a backward driven front-steering vehicle. Control Eng. Pract. 2003, 11, 531–540. [Google Scholar] [CrossRef]
  14. Taherian, S.; Halder, K.; Dixit, S.; Fallah, S. Autonomous Collision Avoidance Using MPC with LQR-Based Weight Transformation. Sensors 2021, 21, 4296. [Google Scholar] [CrossRef]
  15. Falcone, P.; Borrelli, F.; Asgari, J.; Tseng, H.E.; Hrovat, D. Predictive Active Steering Control for Autonomous Vehicle Systems. IEEE Trans. Control Syst. Technol. 2007, 15, 566–580. [Google Scholar] [CrossRef]
  16. Beal, C.E.; Gerdes, J.C. Model Predictive Control for Vehicle Stabilization at the Limits of Handling. IEEE Trans. Control Syst. Technol. 2013, 21, 1258–1269. [Google Scholar] [CrossRef]
  17. Li, D.; Zhao, D.; Zhang, Q.; Chen, Y. Reinforcement Learning and Deep Learning Based Lateral Control for Autonomous Driving [Application Notes]. IEEE Comput. Intell. Mag. 2019, 14, 83–98. [Google Scholar] [CrossRef]
  18. Domahidi, A.; Liniger, A.; Morari, M. Optimization-Based Autonomous Racing of 1:43 Scale RC Cars. Optim. Control Appl. Methods 2017, 36, 628–647. [Google Scholar]
  19. Ostafew, C.J.; Schoellig, A.P.; Barfoot, T.D. Robust Constrained Learning-based NMPC enabling reliable mobile robot path tracking. Int. J. Robot. Res. 2016, 35, 1547–1563. [Google Scholar] [CrossRef]
  20. Alighanbari, S.; Azad, N.L. Safe Adaptive Deep Reinforcement Learning for Autonomous Driving in Urban Environments. Additional Filter? How and Where? IEEE Access 2021, 9, 141347–141359. [Google Scholar] [CrossRef]
  21. Chen, Y.; Hereid, A.; Peng, H.; Grizzle, J. Enhancing the Performance of a Safe Controller Via Supervised Learning for Truck Lateral Control. J. Dyn. Syst. Meas. Control 2019, 141, 101005. [Google Scholar] [CrossRef]
  22. Zhou, X.; Wu, Y.; Huang, J. MPC-based path tracking control method for USV. In Proceedings of the 2020 Chinese Automation Congress (CAC), Shanghai, China, 6–8 November 2020. [Google Scholar]
  23. Gong, C.; Su, Y.; Zhu, Q.; Zhang, D.; Hu, X. Finite-time dynamic positioning control design for surface vessels with external disturbances, input saturation and error constraints. Ocean Eng. 2023, 276, 114259. [Google Scholar] [CrossRef]
  24. Shen, C.; Shi, Y.; Buckham, B. Trajectory Tracking Control of an Autonomous Underwater Vehicle Using Lyapunov-Based Model Predictive Control. IEEE Trans. Ind. Electron. 2017, 65, 5796–5805. [Google Scholar] [CrossRef]
  25. Jiang, X.; Xia, G. Sliding mode formation control of leaderless unmanned surface vehicles with environmental disturbances. Ocean Eng. 2022, 244, 110301. [Google Scholar] [CrossRef]
  26. Mayne, D.Q.; Kerrigan, E.C. Tube-Based Robust Nonlinear Model Predictive Control. Int. J. Robust Nonlinear Control 2009, 21, 1341–1353. [Google Scholar] [CrossRef]
  27. Zhang, X.; Pan, W.; Scattolini, R.; Yu, S.; Xu, X. Robust Tube-based Model Predictive Control with Koopman Operators–Extended Version. arXiv 2021, arXiv:2108.13011. [Google Scholar]
  28. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
  29. Rawlings, J.; Mayne, D.; Diehl, M. Model Predictive Control: Theory, Computation, and Design; Nob Hill Publishing, LLC: Madison, WI, USA, 2017. [Google Scholar]
  30. Proctor, A.A. Semi-autonomous guidance and control of a Saab SeaEye Falcon ROV. Ph.D. Thesis, University of Victoria, Victoria, BC, Canada, 2014. [Google Scholar]
  31. Li, B.; Zhang, H.; Niu, Y.; Ran, D.; Xiao, B. Finite-time disturbance observer-based trajectory tracking control for quadrotor unmanned aerial vehicle with obstacle avoidance. Math. Methods Appl. Sci. 2023, 46, 1096–1110. [Google Scholar] [CrossRef]
  32. Hmeyda, F.; Bouani, F. Camera-based autonomous Mobile Robot Path Planning and Trajectory tracking using PSO algorithm and PID Controller. In Proceedings of the 2017 International Conference on Control, Automation and Diagnosis (ICCAD), Hammamet, Tunisia, 19–21 January 2017. [Google Scholar]
Figure 1. Diagram of the BF (left) and IF (right).
Figure 1. Diagram of the BF (left) and IF (right).
Sensors 24 02771 g001
Figure 2. Lateral error model.
Figure 2. Lateral error model.
Sensors 24 02771 g002
Figure 3. Trajectory tracking control block diagram of the USV.
Figure 3. Trajectory tracking control block diagram of the USV.
Sensors 24 02771 g003
Figure 4. The USV trajectory tracking performance in Path I. (a) The USV trajectory for Path I. (b) The thrust outputs for Path I. (c) The state trajectories for Path I.
Figure 4. The USV trajectory tracking performance in Path I. (a) The USV trajectory for Path I. (b) The thrust outputs for Path I. (c) The state trajectories for Path I.
Sensors 24 02771 g004
Figure 5. The USV trajectory tracking performance in Path II. (a) The USV trajectory for Path II. (b) The thrust outputs for Path II. (c) The state trajectories for Path II.
Figure 5. The USV trajectory tracking performance in Path II. (a) The USV trajectory for Path II. (b) The thrust outputs for Path II. (c) The state trajectories for Path II.
Sensors 24 02771 g005
Figure 6. The USV trajectory tracking performance in Path I with disturbance. (a) The USV trajectory for Path I. (b) The thrust outputs for Path I. (c) The state trajectories for Path I.
Figure 6. The USV trajectory tracking performance in Path I with disturbance. (a) The USV trajectory for Path I. (b) The thrust outputs for Path I. (c) The state trajectories for Path I.
Sensors 24 02771 g006
Figure 7. The USV trajectory tracking performance in Path II with disturbance. (a) The USV trajectory for Path II. (b) The thrust outputs for Path II. (c) The state trajectories for Path II.
Figure 7. The USV trajectory tracking performance in Path II with disturbance. (a) The USV trajectory for Path II. (b) The thrust outputs for Path II. (c) The state trajectories for Path II.
Sensors 24 02771 g007
Table 1. Parameters for USV simulation.
Table 1. Parameters for USV simulation.
ParametersValue
M/kg (mass of USV)37
D/m (distance from motors and center of mass)0.7
K (viscosity coefficient)0.1
I (moment of inertia)0.2
T e (sampling period)0.2
i (loop index)1
U c r u i s e = U 1 = U 2 2
Table 2. MSE for disturbances in Path I.
Table 2. MSE for disturbances in Path I.
MSELMPCRHRLSMCImprovement IImprovement II
x/ m 2 0.05180.00860.073283.3%88.2%
y/ m 2 0.02860.00310.039189.3%92.1%
ψ / rad 2 0.31980.03580.427388.9%91.6%
Table 3. MSE for disturbances in Path II.
Table 3. MSE for disturbances in Path II.
MSELMPCRHRLSMCImprovement IImprovement II
x/ m 2 0.13860.01580.145088.6%89.1%
y/ m 2 0.09680.00790.100291.8%92.1%
ψ / rad 2 0.86630.35610.998458.8%64.3%
Table 4. The average thrust output with disturbances in Path I.
Table 4. The average thrust output with disturbances in Path I.
TOLMPCRHRLSMCImprovement IImprovement II
U 1 / N 152.986.386.843.6%0.57%
U 2 / N 154.286.587.643.9%1.25%
Table 5. The average thrust output with disturbances in Path II.
Table 5. The average thrust output with disturbances in Path II.
TOLMPCRHRLSMCImprovement IImprovement II
U 1 / N 106.762.163.241.8%1.74%
U 2 / N 109.263.964.641.5%0.11%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wen, Y.; Chen, Y.; Guo, X. USV Trajectory Tracking Control Based on Receding Horizon Reinforcement Learning. Sensors 2024, 24, 2771. https://doi.org/10.3390/s24092771

AMA Style

Wen Y, Chen Y, Guo X. USV Trajectory Tracking Control Based on Receding Horizon Reinforcement Learning. Sensors. 2024; 24(9):2771. https://doi.org/10.3390/s24092771

Chicago/Turabian Style

Wen, Yinghan, Yuepeng Chen, and Xuan Guo. 2024. "USV Trajectory Tracking Control Based on Receding Horizon Reinforcement Learning" Sensors 24, no. 9: 2771. https://doi.org/10.3390/s24092771

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop