1. Introduction
The development of smart ships has become a major focus of maritime innovation as the requirements for autonomy, safety and operational efficiency continue to increase. The development of autonomous control technologies is increasingly shifting from traditional rule-based strategies to data-driven and learning-based approaches. Among them, deep learning (DL) and reinforcement learning (RL) have shown strong potential in simulating and replicating the complex, non-linear and uncertain maneuvering behaviors typically exhibited by human operators.
Recent studies have shown the effectiveness of learning-based approaches in various ship control tasks. For instance, deep Q-networks (DQNs) and deterministic policy gradient algorithms have been used to develop control strategies for navigating ships through constrained channels in numerical simulation environments. The comparative analysis reveals both methods to be effective while exhibiting distinct control behaviors [
1]. A deep deterministic policy gradient (DDPG)-based path-planning algorithm has been proposed for unmanned surface vessels (USVs), offering continuous control outputs and a well-designed reward function for trajectory approximation, speed regulation, and position stabilization [
2]. Other deep reinforcement learning (DRL) methods focus on collision avoidance by determining the necessity and direction of evasive maneuvers [
3], or by designing state–action spaces tailored for autonomous navigation and obstacle avoidance with an autonomous navigation and obstacle avoidance (ANOA) framework, which outperforms traditional DQN and Deep Sarsa methods [
4]. DRL methods can also be combined with other methods like artificial potential fields to refine the reward and action space design in uncertain environments [
5]. DRL has also been integrated with model predictive control (RL-MPC) to improve trajectory tracking by combining neural networks and MPC in an actor–critic framework [
6]. For multi-ship encounters, a generalized behavioral decision-making (GBDM) model is proposed, in which the RL-trained model uses obstacle zone by target (OZT) information derived from dynamic ship data and virtual sensors. By incorporating COLREGs into the reward structure, the model handles collision avoidance decisions effectively across 60 COLREG-clustered scenarios, exhibiting flexibility and scalability [
7].
However, existing reinforcement learning methods often require large volumes of training data and extensive interaction with the environment, which is impractical in real maritime applications. Accurate modeling of ship dynamics is also difficult, especially under limited onboard sensor configurations and with scarce real-world data. Moreover, the high cost and safety risks associated with full-scale tug trials restrict access to empirical validation, necessitating the use of simulation-based development and evaluation.
To address these challenges, control methods based on imitation learning (IL) offer promising alternatives for intelligent ship control. IL is a machine learning paradigm in which an agent learns to perform tasks by observing and replicating expert demonstrations [
8]. Inspired by natural learning mechanisms observed in humans and animals, especially in how children acquire language and social behavior through imitation [
9], IL enables effective policy acquisition without explicit modeling of system dynamics.
A representative direction in IL is generative adversarial networks (GANs) [
10], which learn data distributions through an adversarial game between a generator and a discriminator. Building on this idea, Ho and Ermon [
11] introduced Generative Adversarial Imitation Learning (GAIL), which integrates GANs with inverse reinforcement learning. In GAIL, the discriminator not only distinguishes between expert and agent behaviors but also generates a reward signal to guide the policy update. The agent interacts with the environment and optimizes its policy using reinforcement learning based on feedback from the discriminator.
GAIL demonstrates superior performance in several benchmark control tasks including CartPole, Acrobot, and MountainCar, outperforming classical approaches such as behavioral cloning (BC), feature expectation matching (FEM), and game-theoretic apprentice learning (GATL). GAIL has been extended to simulate human highway driving by integrating recurrent neural networks with trust region policy optimization (TRPO), achieving improved safety, lane-changing realism, and trajectory stability on the NGSIM dataset [
12]. To address the compounding error problem in BC, a variational autoencoder (VAE) is incorporated into GAIL to learn a low-dimensional semantic policy space, showing promising results in robotic manipulation and gait control [
13]. In the transportation domain, a conditional GAIL (cGAIL) model is trained on three months of Shenzhen taxi GPS data to learn driving strategies, enhancing income and public service quality through inter-agent knowledge transfer [
14]. A hybrid GAIL-DDPG framework is proposed with a gain regulator to improve training efficiency and adaptability under varying working conditions [
15]. GAIL is also applied to energy management in commercial buildings, outperforming proximal policy optimization (PPO) in controlling airflow systems with expert-guided efficiency gains [
16]. Combining the soft actor–critic (SAC) algorithm and LSTM, and then using the strategies obtained by SAC-LSTM as expert data for GAIL to learn better control strategies, this SAC-LSTM-GAIL (SL-GAIL) algorithm does not need to spend time exploring unknown environments and learns control strategies directly from stable expert data [
17].
GAIL has been extended to a variety of advanced control tasks across domains. A virtual reality-integrated GAIL model (VR-GAIL) achieves higher success rates than PPO in long field-of-view multi-subtask robotic co-construction tasks, without relying on explicit reward design [
18]. By modeling the GAIL discriminator as an additive optimality signal, imitation learning and reinforcement learning are unified as probabilistic inference under a multi-objective partially observable Markov decision process, resulting in significantly better policy performance [
19]. In federated multi-agent learning, GAIL is used to track UAV motion during the global model update phase, while self-imitation learning (SIL) corrects local errors, enabling efficient distributed strategy coordination [
20]. For motion planning, a GAIL-based route planner effectively replicates DRL-generated collision avoidance trajectories [
21]. In sim-to-real transfer, PPO is used in simulation and GAIL then adapts the model for real-world dual-arm robot assembly tasks [
22]. Additionally, a DGAIL framework integrates DQN as a GAIL generator to model intelligent driving behaviors in structured traffic, reducing action randomness, improving training efficiency, and outperforming A3C, DQN, and GAIL in straight and merging road scenarios [
23].
Motivated by the strengths of imitation learning, particularly the robustness and generalization capabilities of GAIL, this paper proposes a GAIL-based control strategy for autonomous surface ships. Unlike conventional model-based controllers that rely on accurate dynamic modeling, the proposed approach learns control policies directly from expert demonstrations, offering greater adaptability to the uncertainties and nonlinearities of real-world ship maneuvering. Specifically, we develop GAIL-based control policies for two fundamental navigation tasks: heading control and path-following. These tasks are critical to autonomous navigation and berthing operations. The proposed framework leverages expert trajectory data collected from human-operated maneuvers to train a policy model that imitates expert behavior with improved smoothness, accuracy, and actuation efficiency.
The main contributions of this paper are as follows:
An imitation learning-based control framework is proposed for autonomous surface ships, enabling data-driven controller design using expert demonstration data. The framework is applied to both heading control and path-following tasks, and its effectiveness is validated through simulation experiments.
An adversarial training structure is developed, in which a control policy generator and a behavior discriminator are jointly optimized. This architecture enables the learned control policy to closely imitate expert-level behavior and enhances the generalization capability of the controller beyond conventional imitation learning methods.
The remainder of this paper is organized as follows.
Section 2 introduces the fundamentals of ship dynamics and the Generative Adversarial Imitation Learning framework.
Section 3 describes the controller design and implementation process.
Section 4 presents the simulation results and performance analysis.
Section 5 concludes the paper and outlines directions for future research.
3. Controller Design Based on Generative Adversarial Imitation Learning
3.1. Markov Decision Process Formulation
The Markov Decision Process (MDP) is a mathematical framework for describing sequential decision-making problems, integrating core concepts of states, actions, transition probabilities, and reward functions. For the ship motion control problem, the task model can be described as follows: At a given time step , the ship is in state , where and are the position coordinates, is the heading angle, and are the surge, sway, and yaw velocities, respectively. Based on this state, the controller generates a control action , representing the rudder angle and propeller revolution speed. This action drives the ship to transition to the next state . This paper focuses on learning a control policy that maps state to action in a manner consistent with human expert behavior. The goal is to develop a human-like control strategy that imitates the maneuvering patterns of a human captain under similar operating conditions.
In typical navigation tasks, the data collected from ship operations generally comprises three key components: (1) path information, which includes a sequence of coordinate points representing the desired or executed trajectory; (2) control commands, such as rudder angles and engine thrust inputs; and (3) real-time motion states, including actual rudder angles, speeds, and heading deviations. Imitation learning algorithms leverage such data to infer control policies by learning from expert demonstrations. Through this process, the underlying decision-making patterns of human operators can be extracted and generalized to previously unseen scenarios. Consequently, expert demonstrations are transformed into closed-loop control policies, enabling autonomous systems to replicate human-like behavior in complex navigation environments.
This paper formulates the state and action spaces of the MDP process as follows:
The transition function in this MDP, denoted by characterizes the evolution of the state of the ship under the influence of control actions. It is governed by the underlying dynamics of maneuvering of the ship, which may be represented by data-driven models or physics-based equations.
3.2. Expert Demonstration Data Preparation
Ship motion control can be decomposed into several sub-tasks based on different control objectives related to heading and speed, including speed control, heading control, path-following, and trajectory tracking. The performance and robustness of imitation learning models are highly dependent on the accuracy and diversity of the expert demonstration data used for training.
Expert data can be obtained from various sources, such as high-fidelity virtual simulations, physical experiments, or real-world sea trials. In this study, a scaled-down model ship is used as the experimental platform to construct the expert dataset. The dataset is specifically collected for two representative control tasks: heading control and path-following. The types of experiments and the sampling frequencies are summarized in
Table 1.
To ensure the professionalism, authenticity, and accuracy of the data, individuals with certified piloting qualifications were invited to participate in the experimental operations. These participants included students majoring in maritime technology, nautical interns, and licensed officers. By collaborating with these experienced personnel, the collected expert dataset reflects realistic ship operation behavior under diverse conditions.
3.3. Learning Control Policy Through GAIL
The GAIL framework consists of a generator
G and a discriminator
D. The policy
within
G generates control actions a given the state
s, while the discriminator
D takes the state–action pair
as input and outputs a probability value
, indicating the likelihood that the action was taken by the expert rather than generated by
G. Ideally, the generator learns to produce actions that the discriminator cannot distinguish from expert actions. For the ship control task addressed in this paper, the trained generator
G is expected to function directly as a controller within the control system. It takes the control target as input and outputs a sequence of control commands, such that the resulting ship behavior closely imitates that in the expert dataset. To this end, the control objectives from the expert data describe the ship’s desired state and motion are incorporated as part of the generator’s input, replacing the traditional use of random noise. In implementation, the generator is structured as a multilayer feed-forward neural network. The detailed configuration of its input and output variables, including their respective dimensions, is provided in
Table 2.
Here,
,
, and
represent the incremental errors in position and heading angle between two consecutive time steps. These values are calculated from the expert data and used as part of the training input to capture the dynamic trends of the ship’s motion. The architecture and layer configuration of the generator network
G are presented in
Table 3.
Batch normalization is applied to each layer of the generator to accelerate training, and the discriminator adopts a standard binary classification structure commonly used in GAN and GAIL, outputting a probability between 0 and 1 to indicate whether the input data comes from the expert dataset or the generator. The discriminator processes input features through a multi-layer fully connected network. The input and output variable settings of the discriminator are shown in
Table 4.
In this framework, the generator output is denoted as
G (assigned label 0), and the expert data as
E (assigned label 1). The discriminator
D is trained to estimate the probability that a given input originates from the expert, thereby distinguishing between
G and
E. The network architecture and configuration of the discriminator are summarized in
Table 5.
The training procedure of GAIL is shown in
Figure 2. In the GAIL framework, the generator takes state vectors as input and outputs action vectors through a fully connected network, with the output dimension matching that of the environment’s action space. Each generated action is concatenated with the corresponding state to form the input to the discriminator, which estimates the probability that the input originates from expert data. The goal of the generator is to produce actions that approximate expert behavior closely enough to be identified as expert data by the discriminator. Meanwhile, the discriminator is trained to accurately distinguish between expert data and data generated by the generator, providing effective feedback to guide the generator’s improvement. During training, the generator is updated based on the output of the discriminator, while the discriminator is optimized by minimizing the loss on both expert and generated data.
The generator G functions as a virtual controller that maps the current system state to control commands, whereas the discriminator serves as a supervisory module that differentiates model-generated behavior from expert demonstrations. Through alternating optimization, the two networks iteratively improve their objectives until convergence, enabling G to approximate expert-level control decisions.
During training, G receives the instantaneous ship state and outputs a candidate control action. This action, concatenated with the input state, is passed to the discriminator. The generator seeks to maximize the probability that its outputs are classified as expert data; this is formulated with a binary cross-entropy loss that drives the discriminator’s prediction towards unity for generator samples. To maintain numerical stability, G is updated while the discriminator remains in evaluation mode, preventing inadvertent parameter changes. The discriminator is then trained on labeled state–action pairs to distinguish expert from generated data, providing informative gradients that guide G towards expert-like behavior. After training, the generator model exhibiting the best convergence metrics is deployed as a real-time controller. In operation it receives live state measurements and produces control commands, forming a closed-loop system that steers the ship along a desired heading or reference path. This realization effectively transfers expert knowledge into autonomous motion control.
4. Simulation Results
4.1. Simulation Settings
To evaluate the performance of the proposed control strategy, both simulations are conducted on a model-scale twin-screw azimuth stern-driven (ASD) tug,
Qiuxin No. 6, shown in
Figure 3. Key technical specifications are summarized in
Table 6. For comparison, a behavior cloning (BC)-based controller is also formulated. The training parameter settings of BC and GAIL are shown in
Table 7.
4.2. Baseline Comparison: Behavior Cloning
Behavior cloning (BC) is a commonly used baseline in imitation learning, where a policy is trained via supervised learning on expert demonstrations. Given a dataset of state–action pairs collected from expert operations,
where
denotes a state of the system and
the corresponding action of the expert, the objective is to learn a parameterized policy
that minimizes the deviation from the behavior of the expert:
For continuous control tasks, the loss function
L is typically the Mean Squared Error (MSE):
The model is optimized using gradient descent:
After training, the learned policy
serves as a control law that maps real-time state observations to control commands.
In this paper, the BC maps ship state and trajectory tracking error to rudder and propeller commands. The input includes motion state, heading, and deviation metrics; the output is the control instruction vector. Input data is normalized to
to match the activation function domain. The encoder network extracts state features before feeding into the policy backbone. Details are shown in
Table 8 and
Table 9. The loss function combines prediction error of state transitions and control commands using MSE.
To facilitate data processing and improve feature extraction, the input state is divided into two components:
, representing the ship’s current motion and heading states, and
, representing the deviation from the desired states. The features from
and
are concatenated and passed into the BC network for control action generation. The network architecture and layer configurations are summarized in
Table 9. To enhance generalization and prevent overfitting, batch normalization and dropout are applied after each hidden layer.
4.3. Heading Control Performance
Figure 4 illustrates the heading control performance when tracking reference angles of +45° and −45°. The tug is initialized at
with zero initial speeds
. The initial heading angles are set to
and
for the +45° and −45° cases, respectively.
Table 10 compares the heading control performance of BC and GAIL in ±45° target tracking. GAIL achieves significantly lower average and maximum heading angle errors compared to BC, indicating more accurate and robust control. Although GAIL shows a slightly larger average rudder amplitude, the improved heading control performance justifies this marginal increase in control effort.
4.4. Path-Following Control Performance
The core objective of the path-following control task is to stabilize the ship along a predefined path. In the comparison experiments, both the BC and GAIL controllers are trained on expert demonstrations collected from human-operated maneuvers along various randomly generated paths.
Figure 5 shows the path-following control performance and the changes of control inputs.
Table 11 compares the performance of BC and GAIL controllers in terms of different evaluation indicators. The GAIL controller demonstrates superior performance across all indicators. Specifically, the GAIL controller reduces the mean absolute heading error from 6.12° to 1.22°, and the maximum heading error from 16.82° to 3.53°, representing improvements of approximately 80.1% and 79.0%, respectively. This suggests a significant enhancement in tracking accuracy. Additionally, the mean rudder amplitude is slightly reduced from 4.17° to 3.78°, indicating smoother and more stable control effort.
Figure 6 shows the path-following control performance and the changes of control inputs.
Table 12 presents the performance comparison of the BC and GAIL controllers in the path-following task along reference path 2. The GAIL controller again demonstrates notable improvements across all evaluated metrics.
Overall, the GAIL-based controller not only enhances tracking accuracy but also maintains better control stability. These findings further support the effectiveness of adversarial imitation learning in marine motion control applications.
In terms of heading accuracy, the GAIL controller reduces the mean absolute heading error from 5.34° to 0.50°, and the maximum heading error from 18.65° to 2.54°, representing reductions of approximately 90.6% and 86.4%, respectively. These results highlight GAIL’s strong capability in learning control policies that closely follow the desired path. Furthermore, the mean rudder amplitude is slightly reduced from 4.18° to 3.82°, indicating a smoother and more efficient control effort compared to BC.
4.5. Influence of Training Data Size
Imitation learning methods are highly dependent on the quality and size of the training dataset. To evaluate the impact of dataset size on control performance, this section conducts comparative experiments using both the full training dataset and reduced-scale subsets. The full dataset includes all available expert demonstration data, while the reduced datasets contain 70% and 50% of the full dataset, respectively. To ensure consistency in the dimensionality of ship state and control command inputs, only the time duration of the training sequences is proportionally shortened to 70% and 50% of the original length. These experiments aim to assess how the amount of training data influences the accuracy and robustness of the learned control policy, thereby providing a reference for optimizing data efficiency in future training strategies.
Figure 7 shows the heading control performance under different training data sizes, specific performance metrics are compared in
Table 13. With the full dataset, the controller achieves the lowest heading error metrics, with a mean absolute heading error of 2.12° and a maximum error of 8.78°, demonstrating accurate and stable heading control. When the training data is reduced to 70%, the mean heading error increases sharply to 6.62°, and the maximum error rises to 13.51°, accompanied by an increase in mean rudder amplitude from 2.82° to 3.35°. This suggests that the controller requires more frequent and stronger adjustments to compensate for reduced model precision.
Interestingly, at 50% data size, the mean rudder amplitude slightly decreases to 3.05°, but the mean and maximum heading errors continue to increase, reaching 7.97° and 16.45°, respectively. This indicates a further loss in control accuracy, despite slightly reduced actuation effort.
Figure 8 illustrates the path-following control performance with varying training data sizes, while a detailed quantitative comparison is provided in
Table 14. The results show a significant degradation in path-following accuracy as the training dataset size decreases. With the complete dataset, the GAIL controller achieves excellent performance, and when the training data is reduced to 70%, the mean heading error sharply increases, and the maximum error escalates to 21.35°. The mean rudder amplitude increases, implying that although the control effort remains similar, the controller’s tracking ability declines. This indicates that insufficient training data limits the generalization capability of the learned policy.
The performance further deteriorates with only 50% of the data, where the mean heading error rises dramatically and the maximum error reaches 48.10°. Meanwhile, the rudder amplitude increases to 5.68°, suggesting increased but still ineffective control responses.
It can be seen that unlike heading control, path-following tasks appear more sensitive to data quantity, likely due to their higher complexity and the need for sustained control over a longer horizon. This highlights the importance of ensuring sufficient and diverse expert demonstrations when applying imitation learning to path-following or trajectory tracking tasks.
5. Conclusions and Future Work
This paper proposes an imitation learning-based control method for autonomous surface ships in heading and path-following tasks, based on expert demonstrations. The results confirm that the proposed approaches achieve improved control accuracy and stability compared to baseline methods.
However, several limitations remain. First, the paired azimuth thrusters of the twin-screw ASD tug takes the same rudder angle and rotation speed, which restricts the applicability of the methods to varied propulsion configurations. In addition, the task design covers only a limited set of control scenarios. Future work should be extended to different ship types, propulsion systems, and more diverse control tasks to improve generalization. Second, although the neural network architectures used in NNBC and GAIL show promising results, there is still room for optimization in terms of model structure, input–output representation, and parameter tuning. In addition, the validation is currently limited to simulation environments. Future research will aim to enhance model efficiency and robustness, and conduct verification through physical experiments or real-time control platforms to support practical deployment in intelligent marine systems.