Heading and Path-Following Control of Autonomous Surface Ships Based on Generative Adversarial Imitation Learning

Liu, Jialun; Cai, Jianuo; Li, Shijie; Li, Changwei; Yu, Yue

doi:10.3390/jmse13091623

Open AccessArticle

Heading and Path-Following Control of Autonomous Surface Ships Based on Generative Adversarial Imitation Learning

by

Jialun Liu

^1,2,3

,

Jianuo Cai

⁴

,

Shijie Li

^1,4

,

Changwei Li

^4,5,* and

Yue Yu

^4,6

¹

East Lake Laboratory, Wuhan 420202, China

²

Intelligent Transportation Systems Research Center, Wuhan University of Technology, Wuhan 430063, China

³

National Engineering Research Center for Water Transport Safety, Wuhan 430063, China

⁴

School of Transportation and Logistics Engineering, Wuhan University of Technology, Wuhan 430063, China

⁵

Beijing Highlander Digital Technology Co., Ltd., Beijing 100094, China

⁶

GD Midea Air-Conditioning Equipment Co., Ltd., Foshan 528300, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(9), 1623; https://doi.org/10.3390/jmse13091623

Submission received: 31 July 2025 / Revised: 23 August 2025 / Accepted: 23 August 2025 / Published: 25 August 2025

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

Autonomous ship control faces significant challenges due to the diversity of ship types, the complexity of task scenarios, and the uncertainty of dynamic marine environments. These factors limit the effectiveness of traditional control approaches that rely on explicit dynamics modeling and handcrafted control laws. With the rapid advancement of computing and artificial intelligence, imitation learning offers a promising alternative by directly learning expert behaviors from data. This paper proposes a Generative Adversarial Imitation Learning (GAIL) method for heading and path-following control of autonomous surface ships. It employs an adversarial learning structure, in which a generator learns control policies that reproduce expert behavior while a discriminator distinguishes between expert and learned trajectories. In this way, the control strategies can be learned from expert demonstrations without requiring explicit reward design. The proposed method is validated through simulations on a model-scale tug. Compared with a behavioral cloning (BC) baseline controller, the GAIL-based controller achieves superior performance in terms of path-following accuracy, heading stability, and control smoothness, confirming its effectiveness and potential for real-world deployment.

Keywords:

MASS; imitation learning; GAIL; BC; heading control; path-following

1. Introduction

The development of smart ships has become a major focus of maritime innovation as the requirements for autonomy, safety and operational efficiency continue to increase. The development of autonomous control technologies is increasingly shifting from traditional rule-based strategies to data-driven and learning-based approaches. Among them, deep learning (DL) and reinforcement learning (RL) have shown strong potential in simulating and replicating the complex, non-linear and uncertain maneuvering behaviors typically exhibited by human operators.

Recent studies have shown the effectiveness of learning-based approaches in various ship control tasks. For instance, deep Q-networks (DQNs) and deterministic policy gradient algorithms have been used to develop control strategies for navigating ships through constrained channels in numerical simulation environments. The comparative analysis reveals both methods to be effective while exhibiting distinct control behaviors [1]. A deep deterministic policy gradient (DDPG)-based path-planning algorithm has been proposed for unmanned surface vessels (USVs), offering continuous control outputs and a well-designed reward function for trajectory approximation, speed regulation, and position stabilization [2]. Other deep reinforcement learning (DRL) methods focus on collision avoidance by determining the necessity and direction of evasive maneuvers [3], or by designing state–action spaces tailored for autonomous navigation and obstacle avoidance with an autonomous navigation and obstacle avoidance (ANOA) framework, which outperforms traditional DQN and Deep Sarsa methods [4]. DRL methods can also be combined with other methods like artificial potential fields to refine the reward and action space design in uncertain environments [5]. DRL has also been integrated with model predictive control (RL-MPC) to improve trajectory tracking by combining neural networks and MPC in an actor–critic framework [6]. For multi-ship encounters, a generalized behavioral decision-making (GBDM) model is proposed, in which the RL-trained model uses obstacle zone by target (OZT) information derived from dynamic ship data and virtual sensors. By incorporating COLREGs into the reward structure, the model handles collision avoidance decisions effectively across 60 COLREG-clustered scenarios, exhibiting flexibility and scalability [7].

However, existing reinforcement learning methods often require large volumes of training data and extensive interaction with the environment, which is impractical in real maritime applications. Accurate modeling of ship dynamics is also difficult, especially under limited onboard sensor configurations and with scarce real-world data. Moreover, the high cost and safety risks associated with full-scale tug trials restrict access to empirical validation, necessitating the use of simulation-based development and evaluation.

To address these challenges, control methods based on imitation learning (IL) offer promising alternatives for intelligent ship control. IL is a machine learning paradigm in which an agent learns to perform tasks by observing and replicating expert demonstrations [8]. Inspired by natural learning mechanisms observed in humans and animals, especially in how children acquire language and social behavior through imitation [9], IL enables effective policy acquisition without explicit modeling of system dynamics.

A representative direction in IL is generative adversarial networks (GANs) [10], which learn data distributions through an adversarial game between a generator and a discriminator. Building on this idea, Ho and Ermon [11] introduced Generative Adversarial Imitation Learning (GAIL), which integrates GANs with inverse reinforcement learning. In GAIL, the discriminator not only distinguishes between expert and agent behaviors but also generates a reward signal to guide the policy update. The agent interacts with the environment and optimizes its policy using reinforcement learning based on feedback from the discriminator.

GAIL demonstrates superior performance in several benchmark control tasks including CartPole, Acrobot, and MountainCar, outperforming classical approaches such as behavioral cloning (BC), feature expectation matching (FEM), and game-theoretic apprentice learning (GATL). GAIL has been extended to simulate human highway driving by integrating recurrent neural networks with trust region policy optimization (TRPO), achieving improved safety, lane-changing realism, and trajectory stability on the NGSIM dataset [12]. To address the compounding error problem in BC, a variational autoencoder (VAE) is incorporated into GAIL to learn a low-dimensional semantic policy space, showing promising results in robotic manipulation and gait control [13]. In the transportation domain, a conditional GAIL (cGAIL) model is trained on three months of Shenzhen taxi GPS data to learn driving strategies, enhancing income and public service quality through inter-agent knowledge transfer [14]. A hybrid GAIL-DDPG framework is proposed with a gain regulator to improve training efficiency and adaptability under varying working conditions [15]. GAIL is also applied to energy management in commercial buildings, outperforming proximal policy optimization (PPO) in controlling airflow systems with expert-guided efficiency gains [16]. Combining the soft actor–critic (SAC) algorithm and LSTM, and then using the strategies obtained by SAC-LSTM as expert data for GAIL to learn better control strategies, this SAC-LSTM-GAIL (SL-GAIL) algorithm does not need to spend time exploring unknown environments and learns control strategies directly from stable expert data [17].

GAIL has been extended to a variety of advanced control tasks across domains. A virtual reality-integrated GAIL model (VR-GAIL) achieves higher success rates than PPO in long field-of-view multi-subtask robotic co-construction tasks, without relying on explicit reward design [18]. By modeling the GAIL discriminator as an additive optimality signal, imitation learning and reinforcement learning are unified as probabilistic inference under a multi-objective partially observable Markov decision process, resulting in significantly better policy performance [19]. In federated multi-agent learning, GAIL is used to track UAV motion during the global model update phase, while self-imitation learning (SIL) corrects local errors, enabling efficient distributed strategy coordination [20]. For motion planning, a GAIL-based route planner effectively replicates DRL-generated collision avoidance trajectories [21]. In sim-to-real transfer, PPO is used in simulation and GAIL then adapts the model for real-world dual-arm robot assembly tasks [22]. Additionally, a DGAIL framework integrates DQN as a GAIL generator to model intelligent driving behaviors in structured traffic, reducing action randomness, improving training efficiency, and outperforming A3C, DQN, and GAIL in straight and merging road scenarios [23].

Motivated by the strengths of imitation learning, particularly the robustness and generalization capabilities of GAIL, this paper proposes a GAIL-based control strategy for autonomous surface ships. Unlike conventional model-based controllers that rely on accurate dynamic modeling, the proposed approach learns control policies directly from expert demonstrations, offering greater adaptability to the uncertainties and nonlinearities of real-world ship maneuvering. Specifically, we develop GAIL-based control policies for two fundamental navigation tasks: heading control and path-following. These tasks are critical to autonomous navigation and berthing operations. The proposed framework leverages expert trajectory data collected from human-operated maneuvers to train a policy model that imitates expert behavior with improved smoothness, accuracy, and actuation efficiency.

The main contributions of this paper are as follows:

An imitation learning-based control framework is proposed for autonomous surface ships, enabling data-driven controller design using expert demonstration data. The framework is applied to both heading control and path-following tasks, and its effectiveness is validated through simulation experiments.
An adversarial training structure is developed, in which a control policy generator and a behavior discriminator are jointly optimized. This architecture enables the learned control policy to closely imitate expert-level behavior and enhances the generalization capability of the controller beyond conventional imitation learning methods.

The remainder of this paper is organized as follows. Section 2 introduces the fundamentals of ship dynamics and the Generative Adversarial Imitation Learning framework. Section 3 describes the controller design and implementation process. Section 4 presents the simulation results and performance analysis. Section 5 concludes the paper and outlines directions for future research.

2. Problem Description

2.1. Ship Dynamics

The ship considered in this study is a twin-screw azimuth stern-driven tug (ASD) equipped with two azimuth thrusters. To facilitate analysis of its kinematic behavior, two coordinate systems are defined: the body-fixed coordinate system (BFCS) and the inertial coordinate system (ICS). The inertial coordinate system, denoted as

O_{0} - x_{0} y_{0} z_{0}

, is fixed relative to the Earth and serves as an absolute reference frame. In contrast, the body-fixed coordinate system, denoted as

O - x y z

, is attached to the ship and moves with it. These two coordinate systems are illustrated in Figure 1. The conversion of the body-fixed coordinate system to the ship’s inertial coordinate system can be performed by:

\dot{η} = (\begin{matrix} c o s (ψ) & - s i n (ψ) & 0 \\ s i n (ψ) & c o s (ψ) & 0 \\ 0 & 0 & 1 \end{matrix}) v

(1)

A three-degree-of-freedom (3-DOF) ship dynamics model is commonly used to describe the planar motion of a ship on the water surface, specifically accounting for surge, sway, and yaw motions. The position and orientation of the ship in the inertial coordinate system are represented by the vector

η = {[x, y, ψ]}^{⊤}

, where x and y denote position coordinates and

ψ

is the direction angle. The velocity vector in the body-fixed coordinate system is denoted as

v = {[u, v, r]}^{⊤}

, where u is the surge velocity, v is the sway velocity, and r is the yaw rate. The 3-DOF dynamic model of the ship is expressed as follows [24]:

M \dot{v} = - C (v) - D (v) v + τ + τ_{w}

(2)

in which

M

denotes the system inertia matrix,

C (v)

is the Coriolis and centripetal force matrix, and

D (v)

represents the hydrodynamic damping matrix. The control input vector is defined as

τ = {[τ_{u}, τ_{v}, τ_{r}]}^{⊤}

, where

τ_{u}

,

τ_{v}

, and

τ_{r}

correspond to surge force, sway force, and yaw moment, respectively. The external disturbance vector is denoted as

τ_{w} = {[τ_{w u}, τ_{w v}, τ_{w r}]}^{⊤}

, representing environmental forces and moments acting in the surge, sway, and yaw directions, respectively. These forces may arise from wind, waves, or current effects and are typically modeled as time-varying disturbances.

While the rigid-body dynamics model provides a physically interpretable framework for ship motion modeling and control, its development and calibration require extensive domain expertise, system identification procedures, and high-fidelity sea trials. These processes are often time-consuming, costly, and difficult to generalize across different vessel types or environmental conditions. Moreover, traditional model-based control approaches rely heavily on accurate dynamic models and hand-tuned control parameters, which may not adapt well to real-world uncertainties or dynamic disturbances.

In contrast, reinforcement learning (RL) offers a data-driven alternative by directly learning control policies through interactions with the environment. However, purely relying on RL requires large amounts of trial-and-error exploration, which is infeasible and unsafe in most maritime scenarios. To address this, imitation learning (IL) provides a practical and efficient solution by leveraging expert demonstrations to bootstrap policy learning. IL methods enable the controller to quickly acquire expert-like behavior without explicit modeling of system dynamics or reward functions, significantly reducing the time and cost required for controller development.

2.2. Generative Adversarial Imitation Learning

Generative Adversarial Imitation Learning (GAIL) is a model-free imitation learning framework that integrates the principles of reinforcement learning (RL) and generative adversarial networks (GAN). The aim is to learn control policies directly from expert demonstrations without the need to manually design a reward function. Figure 2 illustrates the overall framework of the GAIL algorithm. The process begins with the policy network, that is, the generator, receiving the current state of the environment and generating a corresponding action. This action is executed in the environment to form a state–action pair

(s, a)

. The discriminator then receives the pairs of state–action generated by experts and agent-generated and outputs a probability indicating whether a given input originates from expert data. This output is used to compute a surrogate reward, which is subsequently utilized by a reinforcement learning algorithm to update the policy. Through repeated interaction and adversarial training between the policy and the discriminator, the agent gradually learns to generate expert-like behaviors and achieves policy improvement without requiring an explicit reward function.

The goal of GAIL’s adversarial framework is to learn a policy that closely approximates expert behavior by minimizing the discrepancy between generated and expert trajectories. This process is formulated as a minimax optimization problem:

min_{π} max_{D} E_{τ \sim π} [log D (s, a)] + E_{τ_{E} \sim π_{E}} [log (1 - D (s, a))]

(3)

Here,

π

denotes the agent policy, corresponding to the generator in the GAN framework, and

π_{E}

represents the expert policy. The variable

τ = (s_{0}, a_{0}), (s_{1}, a_{1}), \dots, (s_{T}, a_{T})

denotes a trajectory of state–action pairs. The discriminator

D (s, a)

estimates the probability that a given state–action pair is drawn from expert data rather than generated by the policy.

The optimization problem consists of two intertwined learning processes. First, the discriminator D is trained to distinguish expert data from generated data by maximizing the following loss function:

L_{D} = E_{τ_{E} \sim π_{E}} [log D (s, a)] + E_{τ \sim π} [log (1 - D (s, a))]

(4)

This objective encourages the discriminator to output values close to 1 for expert demonstrations and close to 0 for generated trajectories. Conversely, the policy is trained to minimize the discriminator’s ability to differentiate its output from expert data, with the following objective function:

L_{π} = - E_{τ \sim π} [log D (s, a)]

(5)

In this way, the policy learns to generate state–action sequences that are increasingly similar to those of the expert. To support reinforcement learning–based policy optimization, the discriminator output is used to define a surrogate reward:

r (s, a) = - log (1 - D (s, a))

(6)

This reward signal replaces the need for a handcrafted reward function and enables the use of any standard reinforcement learning algorithm for policy training. Through repeated adversarial updates, the policy distribution gradually converges toward the expert distribution, thereby achieving effective behavioral imitation.

Also based on the above derivation, the pseudo-code of the GAIL training process is summarized in Algorithm 1.

Algorithm 1 Generative Adversarial Imitation Learning (GAIL)

Input:: Expert demonstration dataset $D_{E} = (s_{0}^{E}, a_{0}^{E}), (s_{1}^{E}, a_{1}^{E}), \dots, (s_{T}^{E}, a_{T}^{E})$

1:: Initialize policy parameters $θ_{0}$ and discriminator parameters $ϕ_{0}$
2:: for $i = 0, 1, 2, \dots, N$ do
3:: Sample trajectories $D_{i} = (s_{0}, a_{0}), (s_{1}, a_{1}), \dots, (s_{T}, a_{T})$ using current policy $π_{θ_{i}}$
4:: Update discriminator parameters $ϕ_{i} \to ϕ_{i + 1}$ via gradient ascent: $\sum_{(s, a) \in D} \nabla_{ϕ} (log D_{ϕ} (s, a)) + \sum_{(s, a) \in D_{i}} \nabla_{ϕ} log (1 - D_{ϕ} (s, a))$
5:: Compute surrogate reward using updated discriminator: $r (s, a) = - log (1 - D_{ϕ_{i + 1}} (s, a))$
6:: Update policy parameters $θ_{i} \to θ_{i + 1}$ using a reinforcement learning algorithm (e.g., TRPO) with the surrogate reward
7:: end for

3. Controller Design Based on Generative Adversarial Imitation Learning

3.1. Markov Decision Process Formulation

The Markov Decision Process (MDP) is a mathematical framework for describing sequential decision-making problems, integrating core concepts of states, actions, transition probabilities, and reward functions. For the ship motion control problem, the task model can be described as follows: At a given time step

t_{i}

, the ship is in state

S_{i} = {x_{i}, y_{i}, ψ_{i}, u_{i}, v_{i}, r_{i}}

, where

x_{i}

and

y_{i}

are the position coordinates,

ψ_{i}

is the heading angle, and

u_{i}, v_{i}, r_{i}

are the surge, sway, and yaw velocities, respectively. Based on this state, the controller generates a control action

A_{i + 1} = {δ_{i + 1}, n_{i + 1}}

, representing the rudder angle and propeller revolution speed. This action drives the ship to transition to the next state

S_{i + 1}

. This paper focuses on learning a control policy that maps state

S_{i}

to action

A_{i + 1}

in a manner consistent with human expert behavior. The goal is to develop a human-like control strategy that imitates the maneuvering patterns of a human captain under similar operating conditions.

In typical navigation tasks, the data collected from ship operations generally comprises three key components: (1) path information, which includes a sequence of coordinate points representing the desired or executed trajectory; (2) control commands, such as rudder angles and engine thrust inputs; and (3) real-time motion states, including actual rudder angles, speeds, and heading deviations. Imitation learning algorithms leverage such data to infer control policies by learning from expert demonstrations. Through this process, the underlying decision-making patterns of human operators can be extracted and generalized to previously unseen scenarios. Consequently, expert demonstrations are transformed into closed-loop control policies, enabling autonomous systems to replicate human-like behavior in complex navigation environments.

This paper formulates the state and action spaces of the MDP process as follows:

\{\begin{matrix} S = {ψ, u, v, r, δ_{1}, δ_{2}, Δ x, Δ y, Δ ψ} \\ A = {δ_{c m d 1}, δ_{c m d 2}, n_{c m d 1}, n_{c m d 2}} \end{matrix}

(7)

The transition function in this MDP, denoted by

T : S \times A \to S^{'}

characterizes the evolution of the state of the ship under the influence of control actions. It is governed by the underlying dynamics of maneuvering of the ship, which may be represented by data-driven models or physics-based equations.

3.2. Expert Demonstration Data Preparation

Ship motion control can be decomposed into several sub-tasks based on different control objectives related to heading and speed, including speed control, heading control, path-following, and trajectory tracking. The performance and robustness of imitation learning models are highly dependent on the accuracy and diversity of the expert demonstration data used for training.

Expert data can be obtained from various sources, such as high-fidelity virtual simulations, physical experiments, or real-world sea trials. In this study, a scaled-down model ship is used as the experimental platform to construct the expert dataset. The dataset is specifically collected for two representative control tasks: heading control and path-following. The types of experiments and the sampling frequencies are summarized in Table 1.

To ensure the professionalism, authenticity, and accuracy of the data, individuals with certified piloting qualifications were invited to participate in the experimental operations. These participants included students majoring in maritime technology, nautical interns, and licensed officers. By collaborating with these experienced personnel, the collected expert dataset reflects realistic ship operation behavior under diverse conditions.

3.3. Learning Control Policy Through GAIL

The GAIL framework consists of a generator G and a discriminator D. The policy

π

within G generates control actions a given the state s, while the discriminator D takes the state–action pair

(s, a)

as input and outputs a probability value

b \in (0, 1)

, indicating the likelihood that the action was taken by the expert rather than generated by G. Ideally, the generator learns to produce actions that the discriminator cannot distinguish from expert actions. For the ship control task addressed in this paper, the trained generator G is expected to function directly as a controller within the control system. It takes the control target as input and outputs a sequence of control commands, such that the resulting ship behavior closely imitates that in the expert dataset. To this end, the control objectives from the expert data describe the ship’s desired state and motion are incorporated as part of the generator’s input, replacing the traditional use of random noise. In implementation, the generator is structured as a multilayer feed-forward neural network. The detailed configuration of its input and output variables, including their respective dimensions, is provided in Table 2.

Here,

Δ x

,

Δ y

, and

Δ ψ

represent the incremental errors in position and heading angle between two consecutive time steps. These values are calculated from the expert data and used as part of the training input to capture the dynamic trends of the ship’s motion. The architecture and layer configuration of the generator network G are presented in Table 3.

Batch normalization is applied to each layer of the generator to accelerate training, and the discriminator adopts a standard binary classification structure commonly used in GAN and GAIL, outputting a probability between 0 and 1 to indicate whether the input data comes from the expert dataset or the generator. The discriminator processes input features through a multi-layer fully connected network. The input and output variable settings of the discriminator are shown in Table 4.

In this framework, the generator output is denoted as G (assigned label 0), and the expert data as E (assigned label 1). The discriminator D is trained to estimate the probability that a given input originates from the expert, thereby distinguishing between G and E. The network architecture and configuration of the discriminator are summarized in Table 5.

The training procedure of GAIL is shown in Figure 2. In the GAIL framework, the generator takes state vectors as input and outputs action vectors through a fully connected network, with the output dimension matching that of the environment’s action space. Each generated action is concatenated with the corresponding state to form the input to the discriminator, which estimates the probability that the input originates from expert data. The goal of the generator is to produce actions that approximate expert behavior closely enough to be identified as expert data by the discriminator. Meanwhile, the discriminator is trained to accurately distinguish between expert data and data generated by the generator, providing effective feedback to guide the generator’s improvement. During training, the generator is updated based on the output of the discriminator, while the discriminator is optimized by minimizing the loss on both expert and generated data.

The generator G functions as a virtual controller that maps the current system state to control commands, whereas the discriminator serves as a supervisory module that differentiates model-generated behavior from expert demonstrations. Through alternating optimization, the two networks iteratively improve their objectives until convergence, enabling G to approximate expert-level control decisions.

During training, G receives the instantaneous ship state and outputs a candidate control action. This action, concatenated with the input state, is passed to the discriminator. The generator seeks to maximize the probability that its outputs are classified as expert data; this is formulated with a binary cross-entropy loss that drives the discriminator’s prediction towards unity for generator samples. To maintain numerical stability, G is updated while the discriminator remains in evaluation mode, preventing inadvertent parameter changes. The discriminator is then trained on labeled state–action pairs to distinguish expert from generated data, providing informative gradients that guide G towards expert-like behavior. After training, the generator model exhibiting the best convergence metrics is deployed as a real-time controller. In operation it receives live state measurements and produces control commands, forming a closed-loop system that steers the ship along a desired heading or reference path. This realization effectively transfers expert knowledge into autonomous motion control.

4. Simulation Results

4.1. Simulation Settings

To evaluate the performance of the proposed control strategy, both simulations are conducted on a model-scale twin-screw azimuth stern-driven (ASD) tug, Qiuxin No. 6, shown in Figure 3. Key technical specifications are summarized in Table 6. For comparison, a behavior cloning (BC)-based controller is also formulated. The training parameter settings of BC and GAIL are shown in Table 7.

4.2. Baseline Comparison: Behavior Cloning

Behavior cloning (BC) is a commonly used baseline in imitation learning, where a policy is trained via supervised learning on expert demonstrations. Given a dataset of state–action pairs collected from expert operations,

D = (s_{1}, a_{1}), (s_{2}, a_{2}), \dots, (s_{n}, a_{n})

(8)

where

s_{i}

denotes a state of the system and

a_{i}

the corresponding action of the expert, the objective is to learn a parameterized policy

π_{θ} (s)

that minimizes the deviation from the behavior of the expert:

θ^{*} = arg min_{θ} E_{(s, a) D} [L (π_{θ} (s), a)]

(9)

For continuous control tasks, the loss function L is typically the Mean Squared Error (MSE):

L (θ) = \frac{1}{n} \sum_{i = 1}^{n} {(a_{i} - π_{θ} (s_{i}))}^{2}

(10)

The model is optimized using gradient descent:

θ \leftarrow θ - α \nabla_{θ} L (θ)

(11)

After training, the learned policy

π_{θ} (s)

serves as a control law that maps real-time state observations to control commands.

In this paper, the BC maps ship state and trajectory tracking error to rudder and propeller commands. The input includes motion state, heading, and deviation metrics; the output is the control instruction vector. Input data is normalized to

[- 1, 1]

to match the activation function domain. The encoder network extracts state features before feeding into the policy backbone. Details are shown in Table 8 and Table 9. The loss function combines prediction error of state transitions and control commands using MSE.

To facilitate data processing and improve feature extraction, the input state is divided into two components:

S_{1} = [δ_{1}, δ_{2}, ψ, u, v, r]

, representing the ship’s current motion and heading states, and

S_{2} = [Δ x, Δ y, Δ ψ]

, representing the deviation from the desired states. The features from

S_{1}

and

S_{2}

are concatenated and passed into the BC network for control action generation. The network architecture and layer configurations are summarized in Table 9. To enhance generalization and prevent overfitting, batch normalization and dropout are applied after each hidden layer.

4.3. Heading Control Performance

Figure 4 illustrates the heading control performance when tracking reference angles of +45° and −45°. The tug is initialized at

(x, y) = (0, 0)

with zero initial speeds

(u, v, r) = (0, 0, 0)

. The initial heading angles are set to

ψ = 90^{°}

and

ψ = 0^{°}

for the +45° and −45° cases, respectively. Table 10 compares the heading control performance of BC and GAIL in ±45° target tracking. GAIL achieves significantly lower average and maximum heading angle errors compared to BC, indicating more accurate and robust control. Although GAIL shows a slightly larger average rudder amplitude, the improved heading control performance justifies this marginal increase in control effort.

4.4. Path-Following Control Performance

The core objective of the path-following control task is to stabilize the ship along a predefined path. In the comparison experiments, both the BC and GAIL controllers are trained on expert demonstrations collected from human-operated maneuvers along various randomly generated paths.

Figure 5 shows the path-following control performance and the changes of control inputs. Table 11 compares the performance of BC and GAIL controllers in terms of different evaluation indicators. The GAIL controller demonstrates superior performance across all indicators. Specifically, the GAIL controller reduces the mean absolute heading error from 6.12° to 1.22°, and the maximum heading error from 16.82° to 3.53°, representing improvements of approximately 80.1% and 79.0%, respectively. This suggests a significant enhancement in tracking accuracy. Additionally, the mean rudder amplitude is slightly reduced from 4.17° to 3.78°, indicating smoother and more stable control effort.

Figure 6 shows the path-following control performance and the changes of control inputs. Table 12 presents the performance comparison of the BC and GAIL controllers in the path-following task along reference path 2. The GAIL controller again demonstrates notable improvements across all evaluated metrics.

Overall, the GAIL-based controller not only enhances tracking accuracy but also maintains better control stability. These findings further support the effectiveness of adversarial imitation learning in marine motion control applications.

In terms of heading accuracy, the GAIL controller reduces the mean absolute heading error from 5.34° to 0.50°, and the maximum heading error from 18.65° to 2.54°, representing reductions of approximately 90.6% and 86.4%, respectively. These results highlight GAIL’s strong capability in learning control policies that closely follow the desired path. Furthermore, the mean rudder amplitude is slightly reduced from 4.18° to 3.82°, indicating a smoother and more efficient control effort compared to BC.

4.5. Influence of Training Data Size

Imitation learning methods are highly dependent on the quality and size of the training dataset. To evaluate the impact of dataset size on control performance, this section conducts comparative experiments using both the full training dataset and reduced-scale subsets. The full dataset includes all available expert demonstration data, while the reduced datasets contain 70% and 50% of the full dataset, respectively. To ensure consistency in the dimensionality of ship state and control command inputs, only the time duration of the training sequences is proportionally shortened to 70% and 50% of the original length. These experiments aim to assess how the amount of training data influences the accuracy and robustness of the learned control policy, thereby providing a reference for optimizing data efficiency in future training strategies.

Figure 7 shows the heading control performance under different training data sizes, specific performance metrics are compared in Table 13. With the full dataset, the controller achieves the lowest heading error metrics, with a mean absolute heading error of 2.12° and a maximum error of 8.78°, demonstrating accurate and stable heading control. When the training data is reduced to 70%, the mean heading error increases sharply to 6.62°, and the maximum error rises to 13.51°, accompanied by an increase in mean rudder amplitude from 2.82° to 3.35°. This suggests that the controller requires more frequent and stronger adjustments to compensate for reduced model precision.

Interestingly, at 50% data size, the mean rudder amplitude slightly decreases to 3.05°, but the mean and maximum heading errors continue to increase, reaching 7.97° and 16.45°, respectively. This indicates a further loss in control accuracy, despite slightly reduced actuation effort.

Figure 8 illustrates the path-following control performance with varying training data sizes, while a detailed quantitative comparison is provided in Table 14. The results show a significant degradation in path-following accuracy as the training dataset size decreases. With the complete dataset, the GAIL controller achieves excellent performance, and when the training data is reduced to 70%, the mean heading error sharply increases, and the maximum error escalates to 21.35°. The mean rudder amplitude increases, implying that although the control effort remains similar, the controller’s tracking ability declines. This indicates that insufficient training data limits the generalization capability of the learned policy.

The performance further deteriorates with only 50% of the data, where the mean heading error rises dramatically and the maximum error reaches 48.10°. Meanwhile, the rudder amplitude increases to 5.68°, suggesting increased but still ineffective control responses.

It can be seen that unlike heading control, path-following tasks appear more sensitive to data quantity, likely due to their higher complexity and the need for sustained control over a longer horizon. This highlights the importance of ensuring sufficient and diverse expert demonstrations when applying imitation learning to path-following or trajectory tracking tasks.

5. Conclusions and Future Work

This paper proposes an imitation learning-based control method for autonomous surface ships in heading and path-following tasks, based on expert demonstrations. The results confirm that the proposed approaches achieve improved control accuracy and stability compared to baseline methods.

However, several limitations remain. First, the paired azimuth thrusters of the twin-screw ASD tug takes the same rudder angle and rotation speed, which restricts the applicability of the methods to varied propulsion configurations. In addition, the task design covers only a limited set of control scenarios. Future work should be extended to different ship types, propulsion systems, and more diverse control tasks to improve generalization. Second, although the neural network architectures used in NNBC and GAIL show promising results, there is still room for optimization in terms of model structure, input–output representation, and parameter tuning. In addition, the validation is currently limited to simulation environments. Future research will aim to enhance model efficiency and robustness, and conduct verification through physical experiments or real-time control platforms to support practical deployment in intelligent marine systems.

Author Contributions

Conceptualization, J.L. and J.C.; methodology, J.C.; validation, C.L.; writing—original draft preparation, J.C. and J.L.; writing—review and editing, C.L. and S.L.; visualization, Y.Y.; supervision, S.L.; project administration, J.L.; funding acquisition, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by National Key R&D Program of China (2022YFB4301402), National Natural Science Foundation of China (52272425).

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

Author Changwei Li was employed by the company Beijing Highlander Digital Technology Co., Ltd. and Author Yue Yu was employed by the company GD Midea Air-Conditioning Equipment Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Figueiredo, J.M.P.; Abou Rejaili, R.P. Deep reinforcement learning algorithms for ship navigation in restricted waters. Mecatrone 2018, 3, 1–10. [Google Scholar] [CrossRef]
Xu, H.; Wang, N.; Zhao, H.; Zheng, Z. Deep reinforcement learning-based path planning of underactuated surface vessels. Cyber-Phys. Syst. 2019, 5, 1–17. [Google Scholar] [CrossRef]
Woo, J.; Kim, N. Collision avoidance for an unmanned surface vehicle using deep reinforcement learning. Ocean Eng. 2020, 199, 107001. [Google Scholar] [CrossRef]
Wu, X.; Chen, H.; Chen, C.; Zhong, M.; Xie, S.; Guo, Y.; Fujita, H. The autonomous navigation and obstacle avoidance for USVs with ANOA deep reinforcement learning method. Knowl.-Based Syst. 2020, 196, 105201. [Google Scholar] [CrossRef]
Li, L.; Wu, D.; Huang, Y.; Yuan, Z.M. A path planning strategy unified with a COLREGS collision avoidance function based on deep reinforcement learning and artificial potential field. Appl. Ocean Res. 2021, 113, 102759. [Google Scholar] [CrossRef]
Fan, Z.; Wang, L.; Meng, H.; Yang, C. RL-MPC-based anti-disturbance control method for pod-driven ship. Ocean Eng. 2025, 325, 120791. [Google Scholar] [CrossRef]
Guan, W.; Zhao, M.Y.; Zhang, C.B.; Xi, Z.Y. Generalized Behavior Decision-Making Model for Ship Collision Avoidance via Reinforcement Learning Method. J. Mar. Sci. Eng. 2023, 11, 273. [Google Scholar] [CrossRef]
Zheng, B.; Verma, S.; Zhou, J.; Tsang, I.W.; Chen, F. Imitation Learning: Progress, Taxonomies and Challenges. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 2409–2423. [Google Scholar] [CrossRef] [PubMed]
Hussein, A.; Gaber, M.M.; Elyan, E.; Jayne, C. Imitation Learning: A Survey of Learning Methods. ACM Comput. Surv. 2017, 50, 1–35. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Advances in Neural Information Processing Systems; Curran Associates: Red Hook, NY, USA, 2014; Volume 27, pp. 2672–2680. [Google Scholar]
Ho, J.; Ermon, S. Generative Adversarial Imitation Learning. In Advances in Neural Information Processing Systems; Curran Associates: Red Hook, NY, USA, 2016; Volume 29, pp. 4565–4573. [Google Scholar]
Kuefler, A.; Morton, J.; Wheeler, T.; Kochenderfer, M. Imitating Driver Behavior with Generative Adversarial Networks. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV), Los Angeles, CA, USA, 11–14 June 2017; pp. 1169–1174. [Google Scholar]
Wang, Z.; Merel, J.S.; Reed, S.E.; Wayne, G.; de Freitas, N. Robust Imitation of Diverse Behaviors. In Advances in Neural Information Processing Systems; Curran Associates: Red Hook, NY, USA, 2017; Volume 30, pp. 5320–5329. [Google Scholar]
Zhang, X.; Li, Y.; Zhou, X.; Qin, Z.; Zhu, H.; Ye, J. CGAIL: Conditional Generative Adversarial Imitation Learning—An Application in Taxi Drivers’ Strategy Learning. IEEE Trans. Big Data 2022, 9, 1026–1039. [Google Scholar] [CrossRef]
Zhang, M.; Wan, X.; Gang, L.; Lv, X.; Wu, Z.; Liu, Z. An Automated Driving Strategy Generating Method Based on WGAIL-DDPG. Int. J. Appl. Math. Comput. Sci. 2021, 31, 461–470. [Google Scholar] [CrossRef]
Liu, M.; Guo, M.; Fu, Y.; O’Neill, Z.; Gao, Y. Expert-guided imitation learning for energy management: Evaluating GAIL’s performance in building control applications. Appl. Energy 2024, 372, 123753. [Google Scholar] [CrossRef]
Hu, J.; Wang, F.; Li, X.; Qin, Y.; Guo, F.; Jiang, M. Trajectory Tracking Control for Robotic Manipulator Based on Soft Actor–Critic and Generative Adversarial Imitation Learning. Biomimetics 2024, 9, 779. [Google Scholar] [CrossRef] [PubMed]
Li, R.; Zou, Z. Enhancing construction robot learning for collaborative and long-horizon tasks using generative adversarial imitation learning. Adv. Eng. Inform. 2023, 58, 102140. [Google Scholar] [CrossRef]
Kinose, A.; Taniguchi, T. Integration of imitation learning using GAIL and reinforcement learning using task-achievement rewards via probabilistic graphical model. Adv. Robot. 2020, 34, 1055–1067. [Google Scholar] [CrossRef]
Yang, B.; Shi, H.; Xia, X. Federated imitation learning for UAV swarm coordination in urban traffic monitoring. IEEE Trans. Ind. Inform. 2023, 19, 6037–6046. [Google Scholar] [CrossRef]
Higaki, T.; Hashimoto, H. Human-like route planning for automatic collision avoidance using generative adversarial imitation learning. Appl. Ocean Res. 2023, 138, 103620. [Google Scholar] [CrossRef]
Jiang, D.; Wang, H.; Lu, Y. Mastering the Complex Assembly Task with a Dual-Arm Robot: A Novel Reinforcement Learning Method. IEEE Robot. Autom. Mag. 2023, 30, 57–66. [Google Scholar] [CrossRef]
Jiang, J.; Rui, Y.; Ran, B.; Luo, P. Design of an Intelligent Vehicle Behavior Decision Algorithm Based on DGAIL. Appl. Sci. 2023, 13, 5648. [Google Scholar] [CrossRef]
Fossen, T.I. Handbook of Marine Craft Hydrodynamics and Motion Control; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]

Figure 1. Ship’s coordinate system.

Figure 2. GAIL-based control scheme.

Figure 3. Qiuxin No 6.

Figure 4. Performance of the +45° and −45° heading control.

Figure 5. Path-following control performance of reference path 1.

Figure 6. Path-following control performance of reference path 2.

Figure 7. Heading control performance under different training data sizes.

Figure 8. Path-following control performance under different training data sizes.

Table 1. Experimental design to obtain expert demonstrations.

Group	Experiment Type	Control Mode	Sampling Interval/ Time Duration (s)
1	Constant speed straight-line sailing	Manual	1/1000
2	Variable-speed straight-line sailing	Manual	1/1000
3	Heading control (+30° change)	Manual	1/2000
4	Heading control (−30° change)	Manual	1/2000
5	Random path-following 1	Manual	1/1800
6	Random path-following 2	Manual	1/2000
7	Random path-following 3	Manual	1/3000
8	Random path-following 4	Manual	1/4000

Table 2. The I/O of the generator network.

	Variables	Dimensions
Input	$u, v, r, ψ, δ_{1}, δ_{2}, Δ x, Δ y, Δ ψ, Δ t$	10
Output	$δ_{c m d 1}, δ_{c m d 2}, n_{c m d 1}, n_{c m d 2}$	4

Table 3. The generator network settings.

Hierarchy	Input Dimension	Output Dimension	Activation Function
Input layer	10	128	Tanh
Hidden layer 1	128	256	Tanh
Hidden layer 2	256	128	Tanh
Output layer	128	4	Tanh

Table 4. The I/O of the discriminator network.

	Variables	Dimensions
Input	$u, v, r, ψ, δ_{1}, δ_{2}, Δ x, Δ y, Δ ψ, δ_{c m d 1}, δ_{c m d 2}, n_{c m d 1}, n_{c m d 2}$ from the expert and generated data	14
Output	$b \in [0, 1]$	1

Table 5. The discriminator network’s settings.

Hierarchy	Input Dimension	Output Dimension	Activation Function
Input Layer	14	128	Tanh
Hidden Layer 1	128	256	Tanh
Hidden Layer 2	256	128	Tanh
Output Layer	128	1	Tanh

Table 6. The parameters of the model-scale ASD tug.

Parameter	Symbol (Unit)	Value
Overall length	$L_{o a}$ (m)	2.266
Length at waterline	$L_{w l}$ (m)	2.148
Beam at waterline	$B_{w l}$ (m)	0.653
Draft	d (m)	0.248
Wetted surface area	S ( $m^{2}$ )	1.838
Displacement	∇ ( $m^{3}$ )	0.188
Block coefficient	$C_{b}$	0.540
Longitudinal center of gravity	$x_{G}$ (m)	+0.068
Propeller diameter	$D_{p}$ (m)	0.160
Number of propellers	n	2
Number of blades per propeller	$N_{p}$	4
Projected area of thruster pods	$A_{n}$ ( $m^{2}$ )	0.021

Table 7. Training parameters of BC and GAIL.

Parameter	Description	BC	GAIL
Number of epochs	Total training cycles	100	100
Batch size	Number of samples per batch	64	64
Learning rate	Step size for gradient updates	$1 \times 10^{- 3}$	$1 \times 10^{- 3}$
Optimizer	Optimization algorithm	Adam	Adam

Table 8. The I/O of the BC network.

	Variables	Dimensions
Input	$(u, v, r, ψ, δ_{1}, δ_{2}, Δ x, Δ y, Δ ψ, Δ t)$	10
Output	$(δ_{c m d 1}, δ_{c m d 2}, n_{c m d 1}, n_{c m d 2})$	4

Table 9. The BC network settings.

Hierarchy	Input Dimension	Output Dimension	Activation Function
Input layer	10	128	Tanh
Hidden layer 1	128	256	Tanh
Hidden layer 2	256	256	Tanh
Hidden layer 3	256	128	Tanh
Output layer	128	4	Tanh

Table 10. Comparison of BC and GAIL in ±45 heading control.

Controller	Mean Rudder Amplitude/°	Mean Absolute Heading Angle Error/°	Maximum Absolute Heading Angle Error/°
BC	0.74	3.97	18.05
GAIL	0.78	2.45	8.78

Table 11. Comparison of BC and GAIL controllers in path-following control of reference path 1.

Controller	Mean Rudder Amplitude/°	Mean Absolute Heading Angle Error/°	Maximum Absolute Heading Angle Error/°
BC	4.17	6.12	16.82
GAIL	3.78	1.22	3.53

Table 12. Comparison of BC and GAIL controllers in path-following control of reference path 2.

Controller	Mean Rudder Amplitude/°	Mean Absolute Heading Angle Error/°	Maximum Absolute Heading Angle Error/°
BC	4.18	5.34	18.65
GAIL	3.82	0.50	2.54

Table 13. Heading control performance comparison under different training data sizes.

Controller	Mean Rudder Amplitude/°	Mean Absolute Heading Angle Error/°	Maximum Absolute Heading Angle Error/°
100% sample data	2.82	2.12	8.78
70% sample data	3.35	6.62	13.51
50% sample data	3.05	7.97	16.45

Table 14. Path-following control performance comparison under different training data sizes.

Controller	Mean Rudder Amplitude/°	Mean Absolute Heading Angle Error/°	Maximum Absolute Heading Angle Error/°
100% sample data	5.30	0.79	2.53
70% sample data	5.39	10.51	21.35
50% sample data	5.68	23.95	48.10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, J.; Cai, J.; Li, S.; Li, C.; Yu, Y. Heading and Path-Following Control of Autonomous Surface Ships Based on Generative Adversarial Imitation Learning. J. Mar. Sci. Eng. 2025, 13, 1623. https://doi.org/10.3390/jmse13091623

AMA Style

Liu J, Cai J, Li S, Li C, Yu Y. Heading and Path-Following Control of Autonomous Surface Ships Based on Generative Adversarial Imitation Learning. Journal of Marine Science and Engineering. 2025; 13(9):1623. https://doi.org/10.3390/jmse13091623

Chicago/Turabian Style

Liu, Jialun, Jianuo Cai, Shijie Li, Changwei Li, and Yue Yu. 2025. "Heading and Path-Following Control of Autonomous Surface Ships Based on Generative Adversarial Imitation Learning" Journal of Marine Science and Engineering 13, no. 9: 1623. https://doi.org/10.3390/jmse13091623

APA Style

Liu, J., Cai, J., Li, S., Li, C., & Yu, Y. (2025). Heading and Path-Following Control of Autonomous Surface Ships Based on Generative Adversarial Imitation Learning. Journal of Marine Science and Engineering, 13(9), 1623. https://doi.org/10.3390/jmse13091623

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Heading and Path-Following Control of Autonomous Surface Ships Based on Generative Adversarial Imitation Learning

Abstract

1. Introduction

2. Problem Description

2.1. Ship Dynamics

2.2. Generative Adversarial Imitation Learning

3. Controller Design Based on Generative Adversarial Imitation Learning

3.1. Markov Decision Process Formulation

3.2. Expert Demonstration Data Preparation

3.3. Learning Control Policy Through GAIL

4. Simulation Results

4.1. Simulation Settings

4.2. Baseline Comparison: Behavior Cloning

4.3. Heading Control Performance

4.4. Path-Following Control Performance

4.5. Influence of Training Data Size

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI