USV Trajectory Tracking Control Based on Receding Horizon Reinforcement Learning

Wen, Yinghan; Chen, Yuepeng; Guo, Xuan

doi:10.3390/s24092771

Open AccessArticle

USV Trajectory Tracking Control Based on Receding Horizon Reinforcement Learning

by

Yinghan Wen

¹,

Yuepeng Chen

¹ and

Xuan Guo

^2,*

¹

School of Automation, Wuhan University of Technology, Wuhan 430070, China

²

School of Information Engineering, Wuhan University of Technology, Wuhan 430070, China

^*

Author to whom correspondence should be addressed.

Sensors 2024, 24(9), 2771; https://doi.org/10.3390/s24092771

Submission received: 12 March 2024 / Revised: 22 April 2024 / Accepted: 24 April 2024 / Published: 26 April 2024

(This article belongs to the Section Fault Diagnosis & Sensors)

Download

Browse Figures

Versions Notes

Abstract

:

We present a novel approach for achieving high-precision trajectory tracking control in an unmanned surface vehicle (USV) through utilization of receding horizon reinforcement learning (RHRL). The control architecture for the USV involves a composite of feedforward and feedback components. The feedforward control component is derived directly from the curvature of the reference path and the dynamic model. Feedback control is acquired through application of the RHRL algorithm, effectively addressing the problem of achieving optimal tracking control. The methodology introduced in this paper synergizes with the rolling time domain optimization mechanism, converting the perpetual time domain optimal control predicament into a succession of finite time domain control problems amenable to resolution. In contrast to Lyapunov model predictive control (LMPC) and sliding mode control (SMC), our proposed method employs the RHRL controller, which yields an explicit state feedback control law. This characteristic endows the controller with the dual capabilities of direct offline and online learning deployment. Within each prediction time domain, we employ a time-independent executive–evaluator network structure to glean insights into the optimal value function and control strategy. Furthermore, we substantiate the convergence of the RHRL algorithm in each prediction time domain through rigorous theoretical proof, with concurrent analysis to verify the stability of the closed-loop system. To conclude, USV trajectory control tests are carried out within a simulated environment.

Keywords:

unmanned surface vehicle; receding horizon reinforcement learning; trajectory tracking; executive–evaluator

1. Introduction

A USV inherently constitutes a complex nonlinear system, being subject to disturbances and influences from the environment during navigation. Consequently, enhancing the path-tracking accuracy of unmanned ship motion control is a pressing concern.

At present, common methods for achieving such control include the PID [1,2], which is the most widely used, feedback control [3,4], fuzzy control [5,6], module predictive control (MPC) [7,8], and reinforcement learning (RL)-based control [9,10] methods. Of the aforementioned approaches, the PID control method stands out for its advantages. Notably, it eliminates the necessity for modeling the unmanned ship, rendering it a robust and easily implementable controller. However, a challenge lies in ensuring the optimality of specific performance indices. While the fuzzy controller exhibits the capability to deduce and generate expert behavior, its application is challenged by the intricacies of crafting fuzzy rules that primarily arise from the complexity inherent in the navigation environment.

The feedback controller, in its typical operation, computes heading and lateral deviations by analyzing the geometric relationship between the USV and the desired path. Based on this, it directly determines the steering wheel angle for precise steering control. The methods used for tracking, which involve deriving the correlation between the selected path anchor point and the USV position, are the single-point tracking method, pre-sight distance method, and the Stanley method. Both the single-point tracking method [11] and pre-viewing distance method [12,13] offer the advantages of simplicity in algorithms and ease of implementation. However, a notable consideration lies in the fact that the selection of pre-viewing distance is contingent upon the experiential judgment of designers. The Stanley method, initially introduced by Stanford University for an unmanned vehicle fleet, is well suited for lower vehicle speeds. It necessitates a continuous curvature in the reference trajectory for optimal implementation.

A plethora of research findings have emerged concerning the application of MPC in vehicle motion control, as documented in the literature [14,15,16,17]. Of the achievements in these cited works, Falcone et al. [15] introduced an MPC motion controller grounded in the continuous linearization model, and their simulation results underscore the efficacy of the continuous linearization MPC design approach in minimizing computational costs. Carvalho et al. [17] studied an algorithm for local path planning using locally linearized MPC, carrying out linearization and convex approximation of nonlinear obstacle avoidance boundaries. Liniger et al. [18] proposed a lateral motion method of model predictive controlling control (MPCC). Using this method, the lateral deviation is calculated by estimating the position of the projection point, which reduces the computational complexity to a certain extent. Ostafew et al. [19] adopted Gaussian process regression to build a nonparametric model of a mobile robot. In the realm of unmanned surface vehicles, the trajectory tracking controller, employing the MPC method, typically necessitates real-time numerical calculations for solving an open-loop control sequence. The performance of this approach can be influenced by the precision of the model in addition to the unavoidable challenge of managing the complexity inherent in online calculations. Collectively, the current control strategies have various limitations characterized by suboptimal tracking accuracy and constrained computational efficiency.

In recent years, approximate dynamic programming (ADP) as well as reinforcement learning (RL) have experienced widespread adoption in the design of robot decision and control algorithms, thanks to their remarkable efficiency in solving optimization problems and adaptive learning capabilities [20,21]. Yang [22] developed a learning method which is based on PID control for the tracking control of vehicles. Aiming at optimizing the tracking deviation of robots, the DHP algorithm was employed for real-time adjustment of PID parameters, enhancing path-tracking accuracy. Gong et al. [23] designed a finite-time dynamic positioning controller for surface vessels. Shen et al. [24] introduced an innovative LMPC framework aiming to enhance trajectory tracking performance. Jiang et al. [25] also proposed sliding mode control to improve the tracking performance of USVs.

Recent advancements include noteworthy works employing deep learning and deep reinforcement learning to design controllers based on image or state information, facilitating trajectory control for USVs [26,27,28]. A key advantage of this approach lies in leveraging deep networks to enhance the feature representation capabilities of both reinforcement learning and supervised learning. Notably, the training process is entirely data driven, eliminating the need for dynamic model information. However, it has the following disadvantages:

(1): Due to the inherent complexity of deep networks, application of this method is limited to offline training control strategies for online deployment. Moreover, its control performance is susceptible to the influence of factors such as the quantity and distribution of training samples.
(2): In the context of deep network learning, the analysis of theoretical characteristics, such as convergence and robustness, remains a crucial and challenging issue for the academic community to address.

Motivated by the challenges outlined above, we propose a RHRL-based control method, aiming at achieving high-precision lateral control for USVs. The initial step involves constructing a dynamic deviation model for a USV. The steering control of such vehicles comprises two parts, which are feedforward and feedback. Feedforward control is derived directly from the curvature and deviation model for the reference path. In parallel, the establishment of feedback control is achieved by addressing the problem of optimal tracking through application of the RHRL algorithm proposed in this paper. Diverging from conventional optimal control methods rooted in reinforcement learning, RHRL employs a rolling horizon optimization mechanism. This transformation converts infinite time domain optimal control problems into a sequence of finite time domain heuristic dynamic programming problems for resolution. In contrast to the MPC method for unwinding the loop control sequence, the strategy learned by this method is an explicit state feedback control law, which is amenable to offline direct deployment and online learning. Furthermore, in Section 3, the convergence and stability of the closed-loop associated with the proposed RHRL algorithm are theoretically analyzed within each prediction time domain. Finally, simulation and comparative experiments for USV trajectory control using the RHRL algorithm are conducted. Through simulation tests, the control performance is found to be comparable to that of LMPC, with notable advantages in terms of computational efficiency, lower sample complexity, and higher learning efficiency. To verify the algorithm’s robustness and anti-interference capabilities, simulation incorporating disturbances are also conducted.

The remaining sections of this manuscript are arranged as follows. In Section 2, a dynamic model of a USV is built. Then, a USV trajectory control algorithm based on RHRL is proposed and shown to be stable. In Section 3, the simulation and comparison experiments are carried out, and disturbances are added. Section 4 contains the conclusions.

2. Materials and Methods

2.1. Modeling

In contemporary vehicle modeling, the utilization of three degrees of freedom (DOF) and six DOF predominates. However, considering the environment of the USV investigated in this study, which navigates on the sea surface, we opt for three degrees of freedom in the modeling process to avoid unnecessary complexity.

In the process of establishing dynamical equations, a crucial decision lies in selecting the coordinate system for their formulation. Direct application of Newton’s laws of motion necessitates the expansion of equations in an inertial coordinate system. Nevertheless, various considerations compel us to derive the dynamic equations in a satellite coordinate system. One such reason is to establish dynamic equations that are direction independent. Additionally, employing the satellite coordinate system facilitates the direct assignment of forces and control moments. However, this would result in the current frame of reference not being an inertial frame of reference. Hence, to account for the non-inertial reference frame, Coriolis and centripetal forces are artificially introduced. This allows us to derive the remaining dynamics as if they were in an inertial reference frame.

The USV under investigation features a catamaran-like structure, incorporating two fixed propellers positioned at the extremities of each hull. In Figure 1, variables

U_{1}

and

U_{2}

denote the speeds of the two thrusters, while

θ

represents the heading angle.

Considering its actual working environment, trajectory tracking control of the USV on the horizontal plane will be the focus of our study.

There is a reference frame called the BF (body frame) that is securely fixed to the USV, with the point of origin deliberately chosen to coincide with the center of gravity. Global information is recorded by the IF (inertial frame). Thus, the USV’s motion can be accurately described via the kinematic equation and dynamic equation of the coordinate transformation between these two frames.

The kinematic equation is

\dot{ξ} = R (θ) v

(1)

where

ξ = {[x, y, z]}^{T}

represents the USV’s position and heading in the IF;

v = {[u, v, r]}^{T}

represents the USV’s velocity in the BF; and the rotation matrix

R (θ)

depends on

θ

, which is the heading angle.

R (θ)

can be expressed by the follow equation:

R (θ) = [\begin{matrix} c o s θ & - s i n θ & 0 \\ s i n θ & c o s θ & 0 \\ 0 & 0 & 1 \end{matrix}]

(2)

According to the Newton’s law of motion, the dynamic equation can be established as follows:

M \dot{v} + C (v) v + D (v) v + g (ξ) = κ

(3)

where

κ = {[F_{u}, F_{v}, F_{r}]}^{T}

represents the thrust force of each propeller. The matrix

M

takes the mass (which is added) into consideration;

C (v)

represents the Coriolis and centripetal matrix. Concrete forms of the above three matrices are shown as follows:

\begin{matrix} M = [\begin{matrix} M_{\dot{u}} & 0 & 0 \\ 0 & M_{\dot{v}} & 0 \\ 0 & 0 & M_{\dot{r}} \end{matrix}] \end{matrix}

(4a)

\begin{matrix} C (v) = [\begin{matrix} 0 & 0 & - M_{\dot{v}} v \\ 0 & 0 & M_{\dot{u}} u \\ M_{\dot{v}} v & - M_{\dot{u}} u & 0 \end{matrix}] \end{matrix}

(4b)

\begin{matrix} D (v) = [\begin{matrix} X_{u} + D_{u} |u| & 0 & 0 \\ 0 & Y_{v} + D_{v} |v| & 0 \\ 0 & 0 & Z_{r} + D_{r} |r| \end{matrix}] \end{matrix}

(4c)

where

D (v)

is the USV’s damping matrix;

g (ξ)

denotes the specific restoring force.

The thrusters

τ = {[τ_{1}, τ_{2}, τ_{3}]}^{T}

generate thrust force

κ

, and the

τ

comes from

κ = B (α) τ

.

α

denoting the thrusters’ azimuth vector in the BF. We can obtain the distribution of the thruster:

κ = B τ, B = [\begin{matrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ l_{1} & 0 & l_{2} \end{matrix}]

(5)

where

B

denotes an input matrix that is constant.

B

is a

3 \times 3

matrix that distributes power to the thrusters in three directions, and

B

satisfies the condition that

B^{T} B

is not singular.

l_{1}, l_{2} \in (0, 1)

are the thrusters’ efficiency factors.

Therefore, we can derive the dynamic model of the USV for trajectory tracking by combining Equations (1), (3), and (5):

\dot{x} = [\begin{matrix} R (θ) v \\ M^{- 1} B τ - M^{- 1} Cv - M^{- 1} Dv - M^{- 1} g \end{matrix}] = f (x, τ)

(6)

where

x = {[x, y, θ, u, v, r]}^{T}

is the defined state, and input control is expressed as

τ = {[τ_{1}, τ_{2}, τ_{3}]}^{T}

. At the end of this section, we successfully derive the dynamic equation governing USV operation on the water surface.

2.2. The USV Trajectory Control Algorithm Based on RHRL

In this section, the USV trajectory control algorithm utilizing RHRL is elaborated. We initially formulate the performance index for the finite time domain trajectory control problem of the USV. Subsequently, we outline the core concepts of the associated reinforcement learning algorithm along with the design and implementation process of the controller. Also included is a detailed analysis of convergence based on this approach.

When conducting tracking control, it is necessary to describe the relative position between the USV and the desired path, as shown in Figure 2. The point P represents the closet point from the desired path, which is called the road projection point.

P (X_{p}, Y_{p}, φ_{d}, κ)

is denoted as the path information at the projection point, where

X_{p}, Y_{p}

are the global coordinates of P.

φ_{d}

is the angle between the tangent line of P and the X-axis, also known as the direction of the path;

κ

is the curvature of the path at point P.

The distance between P and the USV centroid is called the lateral deviation

e_{y}

, and

e_{y} > 0

is specified for when the USV is located on the left side of the path, and

e_{y} < 0

when the USV is on the right side. Therefore, the lateral deviation can be expressed as

e_{y} = - (X - X_{p}) s i n (φ_{d}) + (Y - Y_{p}) c o s (φ_{d})

(7)

The path deviation

e_{φ}

of the USV is defined as the difference between the path and the direction, which is

e_{φ} = φ - φ_{d}

.

φ = \frac{1}{2} (\dot{z} + r) t a n θ = r t a n θ

. The first derivative of

e_{y}

and

e_{φ}

are shown below:

\{\begin{matrix} \dot{e_{y}} = v_{y} c o s (e_{φ}) + v_{x} s i n (e_{φ}) \\ \dot{e_{φ}} = ω - κ [v_{x} c o s (e_{φ}) - v_{y} s i n (e_{φ})] \end{matrix}

(8)

where

ω = \dot{φ}

.

v_{x} = \dot{x} c o s φ + \dot{y} s i n φ + \sqrt{u^{2} + v^{2}} s i n φ, v_{y} = - \dot{x} s i n φ + \dot{y} c o s φ + \sqrt{u^{2} + v^{2}} c o s φ

. It is assumed that

v_{x}

remains constant and there is no sidescale phenomenon in the moving process, and that the expected yaw velocity of the USV’s desired path is constant; then, the lateral acceleration of the USV when it stably tracks the path is

a_{y} = {v_{x}}^{2} κ

.

Assuming that the course deviation

e_{φ}

is small, then according to the small angle theorem,

s i n (e_{φ}) \approx e_{φ}, c o s (e_{φ}) \approx 1

. Then, the second derivative of the lateral deviation with respect to time can be expressed as

\ddot{e_{y}} = (\dot{v_{y}} + v_{x} ω) - v_{x}^{2} κ

(9)

The first derivative can be approximated as

\dot{e_{y}} = v_{y} + v_{x} e_{φ}

(10)

Combining Equations (1), (3) and (4), also (8)–(10); the following equation can be derived as

\begin{matrix} \dot{e} = A_{c} e + B_{c 1} u + B_{c 2} ω_{d} \end{matrix}

(11a)

\begin{matrix} A_{c} = [\begin{matrix} A_{w} & 0_{6 \times 3} & 0_{6 \times 3} \\ 0_{3 \times 6} & 0_{3 \times 3} & I_{3 \times 3} \\ 0_{3 \times 6} & 0_{3 \times 3} & - M^{- 1} D \end{matrix}] \end{matrix}

(11b)

\begin{matrix} A_{w} = [\begin{matrix} 0_{3 \times 3} \\ \bar{A_{w}} \end{matrix}], \bar{A_{w}} = [\begin{matrix} M_{\dot{u}} + M_{\dot{v}} V & 0 & X_{u} + D_{u} | u | \\ 0 & M_{\dot{u}} u & - M_{\dot{r}} \\ M_{\dot{r}} & Y_{v} + D_{v} | v | & Z_{r} + D_{r} | r | \end{matrix}] \end{matrix}

(11c)

\begin{matrix} B_{c 1} = [\begin{matrix} E_{w} & 0_{6 \times 3} \\ 0_{3 \times 3} & 0_{3 \times 3} \\ 0_{3 \times 3} & M^{- 1} \end{matrix}], B_{c 2} = [\begin{matrix} 0_{6 \times 3} \\ 0_{3 \times 3} \\ M^{- 1} \end{matrix}] \end{matrix}

(11d)

where

ω_{d} = \dot{φ_{d}}, e = {[e_{y}, \dot{e_{y}}, e_{φ}, \dot{e_{φ}}]}^{T}

, and the control quantity

u = δ_{f}

.

Given a sampling period

Δ t

, the discrete time model of Equation (11a) can be discretized as

e (k + 1) = A e (k) + B_{1} u (k) + B_{2} ω_{d} (k)

(12)

where

A = I + Δ t A_{c}, B_{1} = Δ t B_{c 1}, B_{2} = Δ t B_{c 2}

, and k is a discrete time point.

For the above model Equation (12), it is assumed that path information

{(X_{i}, Y_{i})}_{i = 1}^{M}

, and the purpose of this paper is to design a lateral control algorithm based on RHRL (as shown in Figure 3) such that during the control process, the above-mentioned lateral error state quantity gradually converges to 0, that is,

e \to 0

.

2.2.1. Design of Performance Index for the Finite Time Domain Trajectory Control Problem

In this section, a detailed control algorithm based on RHRL is presented. We commence by designing the performance index for the USV finite time domain lateral control problem. Subsequently, we outline the core concept of the RHRL algorithm and delve into the design implementation and convergence analysis based on the actuator–evaluator. For the system deviation model of Equation (12), the control quantity can be decomposed into the form of a feedforward component

u_{f}

plus a feedback component

u_{b}

such that

u = u_{f} + u_{b}

, which is shown in Figure 3. The feedforward control quantity represents the expected control input during steady-state vehicle operation and is applicable when the vehicle is stably following the reference path.At the same time

e (k) = e (k + 1) = 0

holds,

u_{b} = 0

as well. The feedforward control quantity

u_{f}

can be determined as follows:

\underset{j = 0}{\sum^{\infty}} A^{j} B_{1} u_{f} \approx - \underset{j = 0}{\sum^{\infty}} A^{j} B_{2} ω_{d}

(13)

The value in the above formula can be obtained by

ω_{d} = v_{x} k

.

A, B_{1}, B_{2}

are discrete time coefficient matrices.

Since

u_{f}

can be easily solved at any current time value k, we assume that

u_{f}

remains constant throughout the prediction time domain

[k, k + N]

, then the feedback control quantity

u_{b}

to be solved needs to meet the following constraints:

u_{b} \in U_{b} = \{u \in R | \underset{̲}{u} - u_{f} \leq u \leq \bar{u} - u_{f}\}

(14)

where

\bar{u}

represents the maximum of u,

\underset{̲}{u}

is the minimum of u. The RHRL algorithm, introduced in this paper, seeks to minimize the following performance indicator function by optimizing

u_{b} \in U_{b}

in each prediction time domain:

V (e (k)) = \underset{l = k}{\sum^{k + N - 1}} L (e (l), u_{b} (l)) + V_{f} (e (k + N))

(15)

where the cost function

L (e (l), u_{b} (l)) = e^{T} (l) Q e (l) + P u_{b} {(l)}^{2}, Q \in R^{4 \times 4}

is a matrix which is positive definite, P is a preset positive real number, and the cost function of the predictive time domain terminal is

V_{f} (e (k + N)) = e^{T} (k + N) \bar{R} e (k + N)

(16)

where the penalty matrix

\bar{R} \in R^{4 \times 4}

is a positive definite matrix, which can be solved using the following Lyapunov equation:

F^{T} \bar{R} F - \bar{R} = - Q - K^{T} P K

(17)

where

F = A + B_{1} K, K \in R^{1 \times 4}

is the feedback gain matrix satisfying the conditions indicating that F is Schur-stable. (The characteristic polynomial ‘F’ for discrete linear systems is such that the roots are located within the unit circle. This property results in the system being classified as Schur-stable).

2.2.2. Path Control Algorithm Based on RHRL

The implementation of the finite time domain reinforcement learning algorithm using the executive–evaluator involves the following main steps:

First of all, according to Equation (15), in any

l \in [k, k + N - 1]

, we can express the value function as a differential form:

V (e (l)) = L (e (l), u_{b} (l)) + V (e (l + 1))

(18)

where

V (e (k + N)) = V_{f} (e (k + N))

. At the l-th prediction moment,

V^{*} (e (l))

would be defined as the optimal value function, and we obtain the HJB equation of the above finite time domain optimization control problem as

V^{*} (e (l)) = min_{u_{b} (l) \in U_{b}} L (e (l), u_{b} (l)) + V^{*} (e (l + 1))

(19)

and the optimal control strategy:

u^{*} (e (l)) = \underset{u_{b} (l) \in U_{b}}{argmin} L (e (l), u_{b} (l)) + V^{*} (e (l + 1))

(20)

In fact, due to the control constraints, it is difficult to obtain analytical solutions for

V^{*}

and

u^{*}

using Equations (19) and (20). In principle, we can approximate the optimal solution of the value function and the control strategy through the method of value iteration. For any

l \in [k, k + N - 1]

, at given initial values where

V^{0} (e (l)) = 0

, then iterate steps

i = 0, 1, 2 \dots

. This needs to be repeated until

V^{i + 1} (e (l)) - V^{i} (e (l)) \to 0

to resolve the following two steps.

(1): Strategy update

$\begin{matrix} u^{i} (e (l)) = \underset{u_{b} (l) \in U_{b}}{argmin} L (e (l), u_{b} (l)) + V^{i} (e (l + 1)) \end{matrix}$

(21a)
(2): Value update

$\begin{matrix} V^{i + 1} (e (l)) = L (e (l), u_{b}^{i} (e (l)) + V^{i} (e (l + 1)) \end{matrix}$

(21b)

In conclusion, the task of trajectory tracking is accomplished through continuous updating of the strategy and feedback values.

2.2.3. Rolling Time Domain Executor–Evaluator Learning Implementation

We employ the executive–evaluator structure to implement the finite time domain value function iteration algorithm described above. In existing finite time domain reinforcement learning control algorithms [17], the value function in the prediction time domain is regarded as a time-dependent function.

Assumption 1.

If there has a control strategy

u_{b} (e) = Φ (v (e))

so that system (Equation (12)) is asymptotically stable under control strategy

u = u_{b} + u_{f}

, where

Φ (v (e))

is a continuous function satisfying

u_{b} (e) \in U_{b}, \forall v (e) \in R

.

The aforementioned assumptions essentially represent another aspect of the stabilizability of the system Equation (12). Simultaneously, it is worth noting that the dynamic model Equation (12) presented in this paper is controllable, so there must be a continuous equation

u_{b} (e) \in U_{b}

that renders Equation (16) asymptotically stable under the control strategy

u = u_{b} + u_{f}

. Therefore, the above assumptions are reasonable.

We define

χ_{f}

as a control invariant set under the control law

u_{b} = K e \in U_{b}

, then we can state the following theorem.

Theorem 1.

(Time-independent value function) If the value of the prediction time domain N satisfies

t \in [k, k + N]

in any prediction time domain, for any initial state

e (k) \in R^{4}

, the terminal state

e (k + N) \in χ_{f}

is driven by the control strategy

u (e (l)), l \in [k, k + N - 1]

of system Equation (9) such that there is such a control strategy

u_{b} (e) \in U_{b}

that

V (e (l))

, and

l \in [k, k + N - 1]

is a function that is independent of time.

Proof of Theorem 1.

Firstly, consider the case of

e (k) \in χ_{f}

. Based on the definition of

χ_{f}

, there is a control law

u_{b} = K e = Φ (v (e)) \in U_{b}

that ensures the quantity of states at any time in the future satisfy

x (l) \in χ_{f}

. From that, we can solve and obtain the following function:

V (e (l)) = \sum_{i = l}^{k + n - 1} L (e (i), u_{b} (i)) + V_{f} (e (k + N)) = e {(l)}^{T} \bar{P} e (l)

(22)

For the case of

e (k) \notin χ_{f}

, according to Assumption 1, there is such a control strategy

u_{b} = Φ (v (e))

and a finite prediction step N that

e (k + N) \in χ_{f}

. In particular, let

v = K e

, then

V (e (l)) = \underset{i = l}{\sum^{k + n - 1}} L (e (i), u_{b} (i)) + V_{f} (e (k + N)) = \underset{i = l}{\sum^{+ \infty}} L (e (i), u_{b} (i))

(23)

where

u_{b} = Φ (v (e))

.

Hence, a value function and a strategy independent of time exist. Drawing inspiration from this, we adopt a time-independent executive–evaluator structure to execute the finite time-domain value function iteration process described above. Initially, a network of evaluators is designed to approximate the value function:

\hat{V} (e) = {\hat{W}}_{c}^{T} φ (e)

(24)

where

\hat{W_{c}} \in R^{N_{c}}

represents the weight of the evaluator network,

N_{c}

denotes network node number;

φ (e)

is the network’s basis function. According to the definition of the evaluator network, the resulting errors E and the end error

E_{f}

can be expressed as

E (l) = {\hat{W_{c}}}^{T} φ (l) - L (e (l), \hat{u_{b}} (l)) - {\hat{W_{c}}}^{T} φ (l + 1)

(25)

E_{f} = {\hat{W_{c}}}^{T} φ (e_{f}) - {e_{f}}^{T} \bar{P} e_{f}

(26)

Therein,

e_{f} = e (k + N)

, which can be randomly valued around 0. By minimizing

E_{c} (l) = E {(l)}^{2} + {E_{f}}^{2}

, the equation for updating the weights of the evaluator network is derived as follows:

\hat{W_{c}} (l + 1) = \hat{W_{c}} (l) + μ_{c} (Δ φ (e (l + 1)) E (l) - φ (e_{f}) E_{f})

(27)

where

μ_{c} > 0

is the learning rate of the evaluator network.

Next, to deal with control constraints, we construct the network of actuators as follows:

\hat{u_{b}} (l) = \bar{u_{1}} t a n h ({\hat{W_{a}}}^{T} σ (e (l))) + \bar{u_{2}}

(28)

where

\hat{u_{1}} = 0.5 (\bar{u_{b}} - \underset{̲}{u_{b}})

,

\hat{u_{2}} = 0.5 (\bar{u_{b}} + \underset{̲}{u_{b}})

, including

\hat{W_{a}} \in R^{N_{a}}

is the weight of actuator network;

σ (e)

is the basis function vector of the network.

N_{a}

indicates the node number, which is on network. Given that the actuator network aims to approximate the optimal strategy of control, we define the control quantity deviation as follows:

E_{a} (l) = {\hat{W_{a}}}^{T} σ (e (l)) + \frac{1}{2} R^{- 1} {B_{1}}^{T} \nabla φ (e (l)) \hat{W_{c}} (l)

(29)

By minimizing

{E_{a}}^{2}

, we can obtain the update rule of the network weight as

\hat{W_{a}} (l + 1) = \hat{W_{a}} (l) - μ_{a} \frac{δ {E_{a}}^{2} (l)}{δ \hat{W_{a}} (l)}

(30)

where

μ_{a} > 0

represents the learning rate of the actuator network.

Algorithm 1 The main steps of implementing the above finite time domain reinforcement learning algorithm, which makes use of the executive–evaluator.

(I)

Initialize the weights

\hat{W_{c}}

,

\hat{W_{a}}

, and obtain the initial state

Z (0)

.

(II)

When the time

t = k Δ t

, the projection point P is found according to the state

Z (t)

, and the deviation state

e (t)

is calculated.

(III)

\forall l \in [k, k + N - 1]

, repeat the following process 1–3:

(1): According to Equations (17) and (28), $u_{f} (l)$ and $\hat{u_{b}} (l)$ are respectively calculated.
(2): Update $\hat{W_{c}}$ , $\hat{W_{a}}$ according to Equations (27) and (30).
(3): Calculate $u (l) = u_{f} (l) + \hat{u_{b}} (l)$ according to Equations (13) and (28), and apply the prediction model for $e (l + 1)$ .

(IV)

Calculate

u_{f} (k)

and

\hat{u_{b}} (e (k))

according to Equations (12) and (27), respectively.

(V)

In the time period

[k Δ t, (k + 1) Δ t]

, apply quantity

u (t) = u (k Δ t)

directly to the USV, and update the system states

Z ((k + 1) Δ t)

.

(VI)

Set

k \leftarrow k + 1

and repeat operations II-V based on the receding time domain optimization strategy.

□

2.2.4. Convergence Analysis of the Weight of Finite Time Domain Actuator and Evaluator

Next, we present the convergence analysis of the above RHRL algorithm in each prediction domain

[k, k + N - 1]

. First, the (local) optimal value function and control strategy can be represented as a network:

V^{*} (e) = {W_{c}}^{T} φ (e) + κ_{c}

(31)

{u_{b}}^{*} = \bar{u_{1}} tan h ({W_{a}}^{T} μ (e) + κ_{a}) + \bar{u_{2}}

(32)

where both

W_{a}

and

W_{c}

are weight matrices, and

κ_{a}

and

κ_{c}

are the errors of reconstruction.

Assumption 2.

(Network reconstruction error)

(1): $W_{c} \leq W_{c, m}, φ \leq φ_{m}, \nabla φ \leq {\bar{φ}}_{m}, κ_{c} \leq κ_{c, m}, \nabla κ_{c} \leq {\bar{κ}}_{c, m}$
(2): $W_{a} \leq W_{a, m}, ψ \leq ψ_{m}, κ_{a} \leq κ_{a, m}$

Assumption 3.

(Continuous excitation)

There are positive real numbers

q_{1}, q_{2}, (q_{1} < q_{2})

such that

q_{1} \leq \bar{φ}, {\bar{φ}}_{f} \leq q_{2}

(33)

where

\bar{φ} = Δ φ^{T} Δ φ, {\bar{φ}}_{f} = {φ_{f}}^{T} φ_{f}, φ_{f} = φ (e_{f})

.

In order to more compactly describe the following theorem, define

γ_{1} = 4 - 4 \bar{ψ} μ_{a} - (4 - 8 \bar{ψ} μ_{a})

(β_{1} + β_{3}), \bar{ψ} = ψ^{T} ψ, \bar{φ} = \bar{φ} (l + 1) + {\bar{φ}}_{f}, α = β_{0}, β_{1}, β_{2}, β_{3}

are tunable positive real numbers.

Theorem 2.

Under Assumptions 2 and 3, if the appropriate learning laws

μ_{c}

and

μ_{a}

and

{\{β_{i}\}}^{3} (i = 0)

are chosen so that

γ_{1} > 0

and

α - γ_{2} > 0

, then the network weights

{\hat{W}}_{c}

and

{\hat{W}}_{a}

of Equations (27) and (30) will asymptotically converge to the following region when using the above strategy:

\begin{matrix} {\bar{W}}_{c} \leq \frac{\sqrt{E_{t}}}{\sqrt{γ_{1}}} \end{matrix}

(34a)

\begin{matrix} ε_{a} \leq \frac{\sqrt{E_{t}}}{\sqrt{α - γ_{2}} λ_{m i n (\bar{g})}} \end{matrix}

(34b)

where

{\bar{W}}_{c} = W_{c} - {\hat{W}}_{c}, {\bar{W}}_{a} = W_{a} - {\hat{W}}_{a}, ξ_{a} = {\bar{W}}_{a}^{T} ψ

, and

E_{t}

is the error.

Furthermore, if

κ_{c, m}, {\bar{κ}}_{c, m}, κ_{a, m} \to 0

, then

{\bar{W}}_{c}

and

ξ_{a}

converge asymptotically to 0.

Proof of Theorem 2.

The Lyapunov function is defined as follows:

L (l) = L_{c} (l) + L_{a} (l)

where

L_{c} = t r ({\bar{W}}_{c}^{T} {η_{c}}^{- 1} {\bar{W}}_{c})

, and

L_{a} = t r ({\bar{W}}_{a}^{T} {η_{a}}^{- 1} {\bar{W}}_{a})

. They can be calculated based on Equation (26).

E (l) = {\hat{W}}_{c}^{T} φ (l) - {\hat{W}}_{c}^{T} φ (l + 1) + Δ V^{*} (l + 1) = {\bar{W}}_{c}^{T} Δ φ (l + 1) + Δ κ_{c} (l + 1)

(35)

where

Δ V^{*} (l + 1) = V^{*} (l + 1) - V^{*} (l), Δ κ_{c} (l + 1) = κ_{c} (l + 1) - κ_{c} (l)

.

E_{f} = {\bar{W}}_{c}^{T} φ_{f} - W_{c}^{T} φ_{f} - κ_{c, f} = - {\bar{W}}_{c}^{T} φ_{f} - κ_{c, f}

(36)

where

κ_{c_{1}} f = κ_{c} (k + N)

, then according to Equations (27), (35) and (36):

\begin{matrix} Δ L_{c} (l + 1) & = L_{c} (l + 1) - L_{c} (l) \\ = 2 {\bar{W}}_{c}^{T} (- \bar{φ} {\bar{W}}_{c} + {\bar{κ}}_{c}) + μ_{c} {(- \bar{φ} {\bar{W}}_{c} + {\bar{κ}}_{c})}^{T} (- \bar{φ} {\bar{W}}_{c} + {\bar{κ}}_{c}) \\ \leq - α {\bar{W}}_{c}^{2} + E_{c} \end{matrix}

(37)

where

{\bar{κ}}_{c} = - Δ φ (l + 1) Δ κ_{c} (l + 1) - φ_{f} κ_{c f}

,

E_{c} = (2 μ_{c} + {β_{0}}^{- 1}) {\bar{κ}}_{c}^{2}

.

Similarly,

Δ L_{a} (l + 1)

can be expressed as

Δ L_{a} (l + 1) = t r [2 {\bar{W}}_{a}^{T} (l) \frac{\partial {E_{a}}^{2} (l)}{\partial {\hat{W}}_{a} (l)} + μ_{a} {(\frac{\partial {E_{a}}^{2} (l)}{\partial {\hat{W}}_{a} (l)})}^{T} \frac{\partial {E_{a}}^{2} (l)}{\partial {\hat{W}}_{a} (l)}]

In consideration of

E_{a} = - ξ_{a} - g {\bar{W}}_{c} + {\bar{κ}}_{a}

,

g = \nabla φ (\frac{1}{2} R^{- 1} {B_{1}}^{T})

,

{\bar{κ}}_{a} = - κ_{a} - \nabla κ_{c} (\frac{1}{2} R^{- 1} {B_{1}}^{T})

, and

\frac{\partial {E_{a}}^{2} (l)}{\partial {\hat{W}}_{a} (l)} = 2 ψ E_{a}

, then

Δ L_{a} = - (4 - 4 \bar{ψ} μ_{a}) ∥ ξ_{a} ∥^{2} - 8 \bar{ψ} μ_{a} g {\bar{W}}_{c} {\bar{κ}}_{a} + 4 \bar{ψ} μ_{a} ∥ {\bar{W}}_{c} ∥ {\bar{g}}^{2} + (4 - 8 \bar{ψ} μ_{a}) (ξ_{a} {\bar{κ}}_{a} - ξ_{a}^{T} g {\bar{W}}_{c})

where

\bar{g} = g^{T} g

. According to Young’s inequality theorem,

Δ L_{a} (l + 1) \leq - γ_{1} ∥ ξ_{a} ∥^{2} + γ_{2} {∥ {\bar{W}}_{c} ∥}_{\bar{g}}^{2} + E_{a}

where

E_{a} = (1 / β_{2} + 1 / β_{3}) {\bar{κ}}_{a}^{2}

. Then, by defining

E_{t} = E_{c, m} + E_{a, m}

, we obtain

Δ L = - γ_{1} ∥ ξ_{a} ∥^{2} - (α - γ_{2}) {∥ {\bar{W}}_{c} ∥}_{\bar{g}}^{2} + E_{t}

(38)

On this basis, if

{\bar{κ}}_{c, m}, κ_{c, m}, κ_{a, m} \to 0

,

E_{t} \to 0

is obtained, then

{\bar{W}}_{c}

and

ξ_{a}

asymptotically converge to 0.

Hence, at this juncture, we have successfully concluded the proof of Theorem 2. □

The conclusion of the above theorem indicates that we can make u converge to

u_{b}^{*}

with an arbitrarily small error by increasing the number of base function nodes in the actuator and the evaluator. Therefore, under the premise that Assumption 1 is true, if a sufficiently large N is chosen, the equation of system (12) satisfies the terminal state

e (k + N) \in x_{f}

in the prediction time domain

[k, k + N - 1]

driven by strategy

u_{b}^{*} (k | k), \dots

,

u_{b}^{*} (k + N - 1 | k)

. Thus, the next prediction time domain

[k + 1, k + N]

,

u_{b}^{*} (k + 1 | k), \dots, u_{b}^{*} (k + N - 1 | k)

,

K e (k + N | k)

is a feasible control strategy. We define the loss function produced by the feasible strategy for

L o s^{f} (k + 1 | k)

, and referring to Rawling’s [29],

L o s^{f} (k + 1 | k) - L o s^{*} (k | k) \leq - L (e (k | k), u_{b} (k | k))

is available. Due to

K e (k + N | k)

being suboptimal, we may safely derive

L o s^{*} (k + 1 | k + 1) - L o s^{*} (k | k) \leq L o s^{f} (k + 1 | k) - L o s^{*} (k | k) \leq - L (e (k | k), u_{b} (k | k))

which can be obtain by using Lyapunov stability analysis of the stability of the system, which is a closed-loop system.

3. Simulation Analysis

To ensure a precise comparison of the control performance between RHRL, Lyapunov-based MPC (LMPC), and sliding mode control (SMC), the control variable method was adopted using experimental parameters from [24,25]. In the simulations, all of the hydrodynamic parameters in the equations are based on the Falcon model [30].

The simulation results are presented in this section in showcasing the advantages of the RHRL method. In addition, the operating environment is Matlab 2021b, and the core is R7-5800H.

3.1. Parameter Selection

Two distinct desired trajectories are employed. Refer to the article of Li [31], where one trajectory (Path I) is a typical sinusoidal path:

p (t) = \{\begin{matrix} x_{d} = 0.4 t \\ y_{d} = sin (0.4 t) \end{matrix}

(39)

The other trajectory, Path II, is based on [32] and is an S-shaped path:

p (t) = \{\begin{matrix} χ_{d} = - sin (0.4 t) \\ y_{d} = sin (0.24 t) \end{matrix}

(40)

For the RHRL controller, the following parameters are utilized: the prediction horizon is set such that

T = 5 δ

, where

δ = 0.1

[s] represents the time period; three matrices are set for weighting as

Q = d i a g (10^{5}, 10^{5}, 10^{3}, 10^{2}, 10^{2}, 10^{2})

,

R = d i a g (10^{- 4}, 10^{- 4}, 10^{- 4}, 10^{- 4})

, and

P = d i a g (10^{3}, 10^{3}, 10^{2}, 10, 10)

. The gains of the control are

K_{p} = K_{d} = d i a g (1, 1, 1)

.And the

l_{1} = l_{2} = 0.8

.

In this section, the desired trajectory tracking simulation of a USV based on RHRL will be executed as described to emphasize the feasibility and efficiency of RHRL algorithm proposed earlier. The parameters for USV simulation are presented in Table 1.

3.2. Tracking Performance

Both Figure 4a,c depict the tracking results for Path I. The USV trajectories are represented by the blue curve for the LMPC control method, the green curve for the SMC controller, and the red curve for the USV RHRL controller, and the black curve illustrates the sinusoidal trajectory, which is the desired trajectory. The results demonstrate that all controllers are successful in guiding the USV along the desired trajectory, affirming the stability of the closed loop. However, the RHRL method notably exhibits a considerably accelerated convergence compared to the LMPC and SMC methods. This acceleration in convergence is attributed to the selection of control gain matrices

K_{p}

and

K_{d}

, which are small. The simulation results show that the improvement of tracking accuracy is due to synchronous online incremental learning and deployment.

Figure 4b illustrates the thrust output of each propeller. It is evident that at the commencement of tracking, the RHRL controller maximally utilizes the onboard thrust capability to achieve convergence as swiftly as possible. In essence, the state remains within the prescribed boundary, aligning with expectations. It is also notable that RHRL demonstrates superior adjustment capability and undergoes more rapid adjustments.

The outcomes for Path II are presented in Figure 5. Similarities arise from the observations: The USV exhibits quicker convergence to the desired trajectory through RHRL.

3.3. Robustness Experiment with Disturbance

The incorporation of the receding horizon implementation introduces feedback into the closed-loop system. One of the inherent advantages of the RHRL controller is its robustness toward disturbances and emergencies, making it particularly well-suited for control systems in marine and submarine environments.The RHRL’s robustness is thoroughly examined and demonstrated through simulations. The definite simulated disturbance of magnitude

{[100 (N), 100 (N), 0 (N m)]}^{T}

was added. To provide a clearer visualization of the deviation between the three algorithms, the reference trajectory, indicated by a black line, is also included in this experiment.

In analyzing the outcomes shown from Figure 5 to Figure 6, it is evident that RHRL tracking control consistently guides the USV to adequately converge toward the desired trajectory. In contrast, substantial tracking errors are exhibited when conducting tracking control using LMPC, the even greater errors are associated with SMC. Figure 6b and Figure 7b illustrate that the RHRL controller consistently provides feedback for responding within a small time domain, ensuring minimal deviation.

The MSEs (mean square errors) for both paths are consolidated in Table 2 and Table 3. Generally, the MSEs are approximately 10 times smaller for RHRL compared to LMPC and SMC, especially in the case of Path II. Indeed, it is widely acknowledged that smaller MSEs correspond to reduced tracking error, thereby resulting in higher tracking accuracy; thus, it is evident that the RHRL algorithm significantly enhances tracking accuracy.

In order to more objectively demonstrate the excellent performance of the algorithm, we propose conducting quantitative analysis based on a new factor, namely thrust output. It is known that a smaller average value of thrust corresponds to lower energy consumption and enhanced cost-effectiveness. The specific data are shown in Table 4 and Table 5. As can be seen from the tables, the energy consumption of RHRL compared with LMPC is reduced by 43.85% and 41.65% for Paths I and II, respectively. The data show that RHRL is much more economical than LMPC. However, due to the algorithm characteristics, RHRL does not have a significant advantage over SMC based on this analysis.

The observed disparity stems from RHRL’s ability to learn and adapt online, utilizing online optimization to dynamically adjust control gains and effectively compensate for interference. Conversely, both LMPC and SMC lack this flexibility. Consequently, robustness is significantly enhanced by RHRL control.

4. Conclusions

In this paper, a trajectory control algorithm for USV based on RHRL is introduced in which reinforcement learning is seamlessly integrated with a rolling time domain optimization mechanism. Thus, infinite time self-learning optimization problems are effectively converted into a series of finite time optimization problems, which can then be solved using an executive–evaluator algorithm. The incorporation of the rolling time domain mechanism in this design approach significantly enhances the learning efficiency of the RL algorithm. Moreover, compared to LMPC and SMC, the optimization method utilizing both effector and evaluator contributes to enhanced computational efficiency. In diverging from the majority of existing finite time domain executive–evaluator learning algorithms, the proposed RHRL employs a time-independent single-network structure. This innovative approach serves to diminish the intricacy associated with network design and online computational complexity. Moreover, we analyzed the stability of the closed-loop system theoretically. Concerning scenarios involving significant errors in the learned approximation strategy, we plan to conduct in-depth analysis and substantiation in our forthcoming research. The results of simulations demonstrate that our algorithm is effective based on comparison with typical traditional algorithms in simulation scenarios. The simulation results show that RHRL control is superior to LMPC and SMC in terms of control performance and computational efficiency while also being more economical than LMPC. RHRL control also has lower sample complexity and higher learning efficiency.

Author Contributions

Conceptualization, Y.C. and X.G.; methodology, X.G.; software, Y.W.; validation, Y.C.; formal analysis, X.G.; investigation, X.G.; resources, Y.W.; data curation, Y.W.; writing—original draft preparation, Y.W.; writing—review and editing, X.G.; visualization, X.G.; supervision, Y.C.; project administration, Y.C.; funding acquisition, X.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Fundamental Research Funds for the Central Universities of Ministry of Education of China grant number 2023IVA092.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors would like to thank the anonymous reviewers for their helpful comments which improved the quality of the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

USV	unmanned surface vehicle
RHRL	receding horizon reinforcement learning
LMPC	Lyapunov model predictive control
PID	proportional integral derivative
MPCC	model predictive controlling control
ADP	approximate dynamic programming
KDHP	kernel-based DHP
BF	body frame
IF	inertial frame
MSE	mean square error

References

Alim, M.F.A.; Kadir, R.E.A.; Gamayanti, N.; Santoso, A.; Sahal, M. Autopilot system design on monohull USV- LSS01 using PID-based sliding mode control method. IOP Conf. Ser. Earth Environ. Sci. 2021, 649, 012058. [Google Scholar] [CrossRef]
Guo, X.-K. Particle swarm optimization for pid usv heading stability control. Ship Sci. Technol. 2019, 41, 52–54. [Google Scholar]
Ege, E.; Ankarali, M.M. Feedback motion planning of unmanned surface vehicles via random sequential composition. Trans. Inst. Meas. Control 2019, 41, 3321–3330. [Google Scholar] [CrossRef]
Huanyin, Z.; Xisheng, F.; Zhiqiang, H.U.; Wei, L.I. Dynamic Feedback Controller Based on Optimized Switching of Multiple Identification Models for Course Control of Unmanned Surface Vehicle. Robot 2013, 35, 552. [Google Scholar]
Yan, D.; Xiao, C.; Wen, Y. Pod Propulsion Small Surface USV Heading Control Research. In Proceedings of the 26th International Ocean and Polar Engineering Conference, Rhodes, Greece, 26 June–1 July 2016. [Google Scholar]
Deng, Y.; Zhang, X.; Im, N.; Zhang, G.; Zhang, Q. Adaptive fuzzy tracking control for underactuated surface vessels with unmodeled dynamics and input saturation. ISA Trans. 2020, 103, 52–62. [Google Scholar] [CrossRef]
Dong, Z.; Zhang, Z.; Qi, S.; Zhang, H.; Li, J.; Liu, Y. Autonomous cooperative formation control of underactuated USVs based on improved MPC in complex ocean environment. Ocean Eng. 2023, 270, 113633. [Google Scholar] [CrossRef]
Han, X.; Zhang, X. Tracking control of ship at sea based on MPC with virtual ship bunch under Frenet frame. Ocean Eng. 2022, 247, 110737. [Google Scholar] [CrossRef]
Johnson, A.M.; Lenell, C.; Severa, E.; Rudisch, D.M.; Morrison, R.A.; Shembel, A.C. Semi-Automated Training of Rat Ultrasonic Vocalizations. Front. Behav. Neurosci. 2022, 16, 826550. [Google Scholar] [CrossRef]
Zhao, Y.; Qi, X.; Ma, Y.; Li, Z.; Sotelo, M.A. Path Following Optimization for an Underactuated USV Using Smoothly-Convergent Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2020, 22, 6208–6220. [Google Scholar] [CrossRef]
Guo, J. Study on Lateral Fuzzy Control of Unmanned Vehicles Via Genetic Algorithms. J. Mech. Eng. 2012, 48, 76. [Google Scholar] [CrossRef]
Leonard, J.; How, J.; Teller, S.; Berger, M.; Williams, J. A Perception-Driven Autonomous Urban Vehicle. J. Field Robot. 2009, 25, 727–774. [Google Scholar] [CrossRef]
Rajamani, R.; Zhu, C.; Alexander, L. Lateral control of a backward driven front-steering vehicle. Control Eng. Pract. 2003, 11, 531–540. [Google Scholar] [CrossRef]
Taherian, S.; Halder, K.; Dixit, S.; Fallah, S. Autonomous Collision Avoidance Using MPC with LQR-Based Weight Transformation. Sensors 2021, 21, 4296. [Google Scholar] [CrossRef]
Falcone, P.; Borrelli, F.; Asgari, J.; Tseng, H.E.; Hrovat, D. Predictive Active Steering Control for Autonomous Vehicle Systems. IEEE Trans. Control Syst. Technol. 2007, 15, 566–580. [Google Scholar] [CrossRef]
Beal, C.E.; Gerdes, J.C. Model Predictive Control for Vehicle Stabilization at the Limits of Handling. IEEE Trans. Control Syst. Technol. 2013, 21, 1258–1269. [Google Scholar] [CrossRef]
Li, D.; Zhao, D.; Zhang, Q.; Chen, Y. Reinforcement Learning and Deep Learning Based Lateral Control for Autonomous Driving [Application Notes]. IEEE Comput. Intell. Mag. 2019, 14, 83–98. [Google Scholar] [CrossRef]
Domahidi, A.; Liniger, A.; Morari, M. Optimization-Based Autonomous Racing of 1:43 Scale RC Cars. Optim. Control Appl. Methods 2017, 36, 628–647. [Google Scholar]
Ostafew, C.J.; Schoellig, A.P.; Barfoot, T.D. Robust Constrained Learning-based NMPC enabling reliable mobile robot path tracking. Int. J. Robot. Res. 2016, 35, 1547–1563. [Google Scholar] [CrossRef]
Alighanbari, S.; Azad, N.L. Safe Adaptive Deep Reinforcement Learning for Autonomous Driving in Urban Environments. Additional Filter? How and Where? IEEE Access 2021, 9, 141347–141359. [Google Scholar] [CrossRef]
Chen, Y.; Hereid, A.; Peng, H.; Grizzle, J. Enhancing the Performance of a Safe Controller Via Supervised Learning for Truck Lateral Control. J. Dyn. Syst. Meas. Control 2019, 141, 101005. [Google Scholar] [CrossRef]
Zhou, X.; Wu, Y.; Huang, J. MPC-based path tracking control method for USV. In Proceedings of the 2020 Chinese Automation Congress (CAC), Shanghai, China, 6–8 November 2020. [Google Scholar]
Gong, C.; Su, Y.; Zhu, Q.; Zhang, D.; Hu, X. Finite-time dynamic positioning control design for surface vessels with external disturbances, input saturation and error constraints. Ocean Eng. 2023, 276, 114259. [Google Scholar] [CrossRef]
Shen, C.; Shi, Y.; Buckham, B. Trajectory Tracking Control of an Autonomous Underwater Vehicle Using Lyapunov-Based Model Predictive Control. IEEE Trans. Ind. Electron. 2017, 65, 5796–5805. [Google Scholar] [CrossRef]
Jiang, X.; Xia, G. Sliding mode formation control of leaderless unmanned surface vehicles with environmental disturbances. Ocean Eng. 2022, 244, 110301. [Google Scholar] [CrossRef]
Mayne, D.Q.; Kerrigan, E.C. Tube-Based Robust Nonlinear Model Predictive Control. Int. J. Robust Nonlinear Control 2009, 21, 1341–1353. [Google Scholar] [CrossRef]
Zhang, X.; Pan, W.; Scattolini, R.; Yu, S.; Xu, X. Robust Tube-based Model Predictive Control with Koopman Operators–Extended Version. arXiv 2021, arXiv:2108.13011. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Rawlings, J.; Mayne, D.; Diehl, M. Model Predictive Control: Theory, Computation, and Design; Nob Hill Publishing, LLC: Madison, WI, USA, 2017. [Google Scholar]
Proctor, A.A. Semi-autonomous guidance and control of a Saab SeaEye Falcon ROV. Ph.D. Thesis, University of Victoria, Victoria, BC, Canada, 2014. [Google Scholar]
Li, B.; Zhang, H.; Niu, Y.; Ran, D.; Xiao, B. Finite-time disturbance observer-based trajectory tracking control for quadrotor unmanned aerial vehicle with obstacle avoidance. Math. Methods Appl. Sci. 2023, 46, 1096–1110. [Google Scholar] [CrossRef]
Hmeyda, F.; Bouani, F. Camera-based autonomous Mobile Robot Path Planning and Trajectory tracking using PSO algorithm and PID Controller. In Proceedings of the 2017 International Conference on Control, Automation and Diagnosis (ICCAD), Hammamet, Tunisia, 19–21 January 2017. [Google Scholar]

Figure 1. Diagram of the BF (left) and IF (right).

Figure 2. Lateral error model.

Figure 3. Trajectory tracking control block diagram of the USV.

Figure 4. The USV trajectory tracking performance in Path I. (a) The USV trajectory for Path I. (b) The thrust outputs for Path I. (c) The state trajectories for Path I.

Figure 5. The USV trajectory tracking performance in Path II. (a) The USV trajectory for Path II. (b) The thrust outputs for Path II. (c) The state trajectories for Path II.

Figure 6. The USV trajectory tracking performance in Path I with disturbance. (a) The USV trajectory for Path I. (b) The thrust outputs for Path I. (c) The state trajectories for Path I.

Figure 7. The USV trajectory tracking performance in Path II with disturbance. (a) The USV trajectory for Path II. (b) The thrust outputs for Path II. (c) The state trajectories for Path II.

Table 1. Parameters for USV simulation.

Parameters	Value
M/kg (mass of USV)	37
D/m (distance from motors and center of mass)	0.7
K (viscosity coefficient)	0.1
I (moment of inertia)	0.2
$T_{e}$ (sampling period)	0.2
i (loop index)	1
$U_{c r u i s e} = U_{1} = U_{2}$	2

Table 2. MSE for disturbances in Path I.

MSE	LMPC	RHRL	SMC	Improvement I	Improvement II
x/ $m^{2}$	0.0518	0.0086	0.0732	83.3%	88.2%
y/ $m^{2}$	0.0286	0.0031	0.0391	89.3%	92.1%
$ψ$ / ${rad}^{2}$	0.3198	0.0358	0.4273	88.9%	91.6%

Table 3. MSE for disturbances in Path II.

MSE	LMPC	RHRL	SMC	Improvement I	Improvement II
x/ $m^{2}$	0.1386	0.0158	0.1450	88.6%	89.1%
y/ $m^{2}$	0.0968	0.0079	0.1002	91.8%	92.1%
$ψ$ / ${rad}^{2}$	0.8663	0.3561	0.9984	58.8%	64.3%

Table 4. The average thrust output with disturbances in Path I.

TO	LMPC	RHRL	SMC	Improvement I	Improvement II
$\|U_{1}\| / N$	152.9	86.3	86.8	43.6%	0.57%
$\|U_{2}\| / N$	154.2	86.5	87.6	43.9%	1.25%

Table 5. The average thrust output with disturbances in Path II.

TO	LMPC	RHRL	SMC	Improvement I	Improvement II
$\|U_{1}\| / N$	106.7	62.1	63.2	41.8%	1.74%
$\|U_{2}\| / N$	109.2	63.9	64.6	41.5%	0.11%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wen, Y.; Chen, Y.; Guo, X. USV Trajectory Tracking Control Based on Receding Horizon Reinforcement Learning. Sensors 2024, 24, 2771. https://doi.org/10.3390/s24092771

AMA Style

Wen Y, Chen Y, Guo X. USV Trajectory Tracking Control Based on Receding Horizon Reinforcement Learning. Sensors. 2024; 24(9):2771. https://doi.org/10.3390/s24092771

Chicago/Turabian Style

Wen, Yinghan, Yuepeng Chen, and Xuan Guo. 2024. "USV Trajectory Tracking Control Based on Receding Horizon Reinforcement Learning" Sensors 24, no. 9: 2771. https://doi.org/10.3390/s24092771

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

USV Trajectory Tracking Control Based on Receding Horizon Reinforcement Learning

Abstract

1. Introduction

2. Materials and Methods

2.1. Modeling

2.2. The USV Trajectory Control Algorithm Based on RHRL

2.2.1. Design of Performance Index for the Finite Time Domain Trajectory Control Problem

2.2.2. Path Control Algorithm Based on RHRL

2.2.3. Rolling Time Domain Executor–Evaluator Learning Implementation

2.2.4. Convergence Analysis of the Weight of Finite Time Domain Actuator and Evaluator

3. Simulation Analysis

3.1. Parameter Selection

3.2. Tracking Performance

3.3. Robustness Experiment with Disturbance

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI