3.2.1. Candidate Set Q*<sup>n</sup>* Identification and Navigation Status Update

At iteration *n* ≥ 1, the algorithm first identifies the candidate set Q*n*, which is made up of all the waymark points within a rectangle centered at the latest point in the trajectory, i.e., around point *pn* = (*λ*¯ *<sup>n</sup>*, *φ*¯*n*):

$$\mathcal{Q}\_n = \left\{ (\lambda\_m, \bar{\phi}\_m, \theta\_m, \bar{\rho}\_m) \middle| \lambda\_m \in (\bar{\lambda}\_n - \varepsilon\_{\lambda'}, \bar{\lambda}\_n + \varepsilon\_{\lambda}), \phi\_m \in (\bar{\phi}\_n - \varepsilon\_{\phi'}, \bar{\phi}\_n + \varepsilon\_{\phi}) \right\} \tag{4}$$

where *ελ* > 0 and *εφ* > 0 are used to define the size of the rectangular area.

To ensure that the waymark point *pn*+<sup>1</sup> locates in the direction that is consistent with the recent trajectory trend, we update the current heading trend *<sup>d</sup>curn* as:

$$
\vec{d}\_{cur\_n} = \begin{cases}
\vec{d}\_n & n = 1 \\
LDM\left(\vec{d}\_{cur\_{n-1}}, \vec{d}\_n, \vec{d}\_{\mathcal{Q}'\_{n-1}}\right) & n > 1
\end{cases} \tag{5}
$$

where *<sup>d</sup><sup>n</sup>* <sup>=</sup> *dire*(*pn*−1, *pn*) and *<sup>d</sup>*Q *<sup>n</sup>*−<sup>1</sup> is the mean of the COGs of the waymark points in set Q *<sup>n</sup>*−1, and Q *<sup>n</sup>*−<sup>1</sup> is the filtered candidate set in the (*<sup>n</sup>* <sup>−</sup> <sup>1</sup>)-th iteration (see Section 3.2.2 for details). From the above equation, it can be seen that when *<sup>n</sup>* > 1, *<sup>d</sup>curn* is the linear directional mean of the (predicted) traveling direction of iteration *n* − 1, the heading direction *<sup>d</sup><sup>n</sup>* from point *pn*−<sup>1</sup> to *pn* (the current position), and the mean COG of the key points in Q *<sup>n</sup>*−1.

#### 3.2.2. Candidate Set Filtering

In this step, the waymark points in Q*<sup>n</sup>* that deviate significantly from the expected traveling direction are filtered out from further consideration in the current prediction iteration. To do so, we first denote *<sup>d</sup>DESn* . = *dire*(*pn*, *pDES*) as the direction from the current position to the destination and then introduce the following angle-dependent parameters:

$$
\delta'\_m = \Delta\_{\vec{d}\_{pn, \phi m} \vec{d}\_{DES\_m}} \tag{6}
$$

$$
\delta\_m = \Delta\_{\theta\_m \delta\_m} \tag{7}
$$

where *<sup>d</sup>pn*,*qm* = *dire*(*pn*, *qm*) is the heading direction from *pn* to *qm*. Parameter *<sup>δ</sup> <sup>m</sup>* measures the degree of difference between *<sup>d</sup>pn*,*qm* and the direction of the current position towards the destination, while *δ<sup>m</sup>* measures the degree of difference between the COG of *pn* and the COG of *qm*. An illustration of the above notation is given in Figure 3.

Ideally, if a tanker does not make a sharp turn, *δ<sup>m</sup>* is expected to take small values. Additionally, as the tanker is traveling towards the destination, it is expected that the heading from *pn* to *qm* does not deviate too drastically from *<sup>d</sup>DESn* . Therefore, it is expected that *δ <sup>m</sup>* does not take extremely large values. Therefore, we identify the refined candidate set by applying filtering based on *δ<sup>m</sup>* and *δ <sup>m</sup>* as follows:

$$\mathbb{Q}'\_{n} = \left\{ q\_{m} | q\_{m} \in Q\_{n}, \delta\_{m} \le \mathfrak{e}\_{m}, \delta'\_{m} \le \mathfrak{e}'\_{m} \right\},\tag{8}$$

where  *<sup>m</sup>* and  *<sup>m</sup>* are two thresholds.

**Figure 3.** An illustration of the mathematical notation for candidate set filtering.

#### 3.2.3. Waymark Point Selection

Once the filtered candidate set of waymark points, Q *n*, has been identified, the algorithm can proceed to make a selection for the next waymark point. As each waymark point *qm* = *λ*¯ *<sup>m</sup>*, *φ*¯*m*, ¯ *θm*, *ρ*¯*<sup>m</sup>* represents a cluster of key points, a higher value of *ρ*¯*<sup>m</sup>* means that the corresponding grid area has been visited more frequently. Hence, it is reasonable to allocate more weight to such waymark points in the selection. Additionally, the expected trajectory trend, i.e., *<sup>d</sup>curn* , is expected to be close to the heading from *pn* to *qm*. Based on these considerations, we propose a weighted distance to account for both the frequency that a waymark point is visited and the angular change from *pn* to *qm*. Specifically, the weighted distance between the current point *pn* and a candidate waymark point *qm* ∈ Q *<sup>n</sup>* can be calculated as:

$$\Delta Dis(p\_n, q\_m) = \kappa \sqrt{\rho\_{\max} / \vec{\rho}\_m} + (1 - \kappa) \Delta\_{\vec{d}\_{\text{cur}\_n, \vec{d}\_{\text{par}, \text{@m}}}} \tag{9}$$

where <sup>Δ</sup>*dcurn* ,*dpn*,*qm* is the angular change defined in (3), and *κ* ∈ (0, 1) is the weight. Usually, <sup>Δ</sup>*dcurn* ,*dpn*,*qm* is much larger than /*ρmax*/*ρ*¯*m*; thus, *κ* can be set close to 1 to better balance the two terms. Parameter *ρmax* = max{*ρ*¯*m*|*qm* ∈ Q *<sup>n</sup>*} is the maximum density of all the waymark points in the refined candidate set Q *<sup>n</sup>*. Based on the weighted distance, the next waymark point is then selected as the one with the smallest weighted distance, i.e.,

$$p\_{n+1} = \arg\min\_{q\_m \in \mathbb{Q}\_n'} Dis(p\_{n\prime}q\_m). \tag{10}$$

The weighted distance defined in (9) does not consider the geo-distance between two waymark points. This is due to the fact that the refined candidate set Q *<sup>n</sup>* ⊆ Q*<sup>n</sup>* contains waymark points that are sufficiently close to the current reference point, as shown in (4). In this case, it is sufficient to focus on the angular and density differences to identify the next waymark point. The proposed trajectory prediction algorithm is finally summarized in Algorithm 1:

#### **Algorithm 1** Trajectory prediction algorithm

```
Require: pDES, T0 = {..., p−1, p0}, Q = {q1, q2,..., qM}
Parameters: εDES, ελ, εφ, εδk , εδ
                                  k
                                   , κ, 
                                        m, 

                                           m
Initialization: T = {p0}
    while dist(pn, pDES) ≥ εDES do
       Qn = {

                λ¯ m, φ¯m, ¯
                        θm, ρ¯m

                                |λ¯ m ∈ 

                                        λ¯ n − ελ, λ¯ n + ελ

                                                          ,
            φ¯m ∈ 

                   φ¯n − εφ, φ¯n + εφ

                                     }
       Q
         n = {qm|qm ∈ Qn, δm ≤ 
                                    m, δ
                                       m ≤ 

                                              m}
       for all qm in Q
                       n do
           Dis(pn, qm) = κ
                            /ρmax/ρ¯m + (1 − κ)Δdcurn ,dpn,qm
       end for
       pn+1 = arg minqm∈Q
                             n Dis(pn, qm)
       Append pn+1 to T
    end while
    return T
```
#### **4. Experimental Results**

In this section, we present the experimental results that evaluate the performance of the proposed trajectory prediction method. We start this section by introducing the dataset and the pre-processing techniques used to process the raw AIS data. The results obtained from key point clustering are then presented to illustrate the operation of the proposed method, followed by the evaluations of trajectory prediction accuracies.

#### *4.1. Dataset and Pre-Processing*

In the experiment, we evaluate the proposed trajectory prediction method using the historical AIS data provided by the Danish Maritime Authority (DMA) [10]. The raw data contain AIS messages from various types of vessels traveling around Danish waters. The AIS data have 26 fields, including latitude, longitude, COG, destination, etc. In these experiments, we will mainly use latitude, longitude and COG fields for trajectory prediction. A comprehensive list and description of other fields can be found in Appendix A of [23]. The experiment uses the AIS data from oil tankers traveling to Port Skagen during the period between 2017 and 2019. This dataset contains the AIS data for 1278 vessels and has over 40 million AIS records. In this dataset, the data between 1 January 2017 and 30 June 2019 are used to extract the training trajectories and produce the waymark points. The AIS data from oil tankers traveling to the same port from 1 July 2019 to 31 December 2019 are used to test the proposed trajectory prediction algorithm.

The raw data are pre-processed before being used to produce the waymark points. Key steps of the pre-processing include trajectory extraction and integrity check and downsampling. We refer to Appendix B of [23] for a detailed description of the preprocessing steps. In the experiments, a total of 1046 training trajectories were obtained after the pre-processing, and there are 232 trajectories in the test period. Each trajectory undergoes a route to the destination, i.e., Port Skagen.

#### *4.2. Results for Grid-Based Key Point Clustering*

In the training dataset, there are over 550,000 key points. During the key point clustering, the difference in latitude/longitude between the center of the adjacent grid areas is set to 0.05 degrees, while the width of the grid area is 0.05 degrees for both the latitude and the longitude. The parameters for DBSCAN are listed in Table 2.


**Table 2.** Key point clustering: DBSCAN parameter settings.

Figure 4 plots all the key points from the training trajectories, from which the results show that there are some relatively clear routes in the eastern part of the area. In contrast, the routes are much less clear in the western part of the area. After DBSCAN clustering, a total of 6127 waymark points are obtained, which are plotted in Figure 5. Comparing Figure 4 to Figure 5, the results show that although the number of waymark points is only about 1.1% of the key points, they capture all the main route patterns in both the western and the eastern part of the area.

**Figure 4.** All key points in the training trajectories.

**Figure 5.** All waymark points obtained after clustering of the key points.

Figure 6 shows the empirical cumulative distribution (CDF) of the COG standard deviation (STD) before and after clustering. Before applying the clustering algorithm, each COG STD is estimated based on the key points within the same grid area. After the clustering, the COG STD is for the key points within the same cluster. Therefore, as expected and shown in Figure 6, the clustering significantly reduces the STD of the COG. For example, before applying the COG clustering, the 95% STD value is about 67.8◦. This means that for about 5% of the clusters (before applying the COG clustering), the COG spread measured

by the STD is ≥67.8◦. This result suggests that although the grid areas are relatively small, the traveling directions of the tankers within the same small area can differ significantly. In contrast, after applying the clustering, the 95% STD of the COG reduces to 2.2◦. In other words, after clustering, the traveling direction of the cluster becomes more consistent, which is more indicative for the prediction of future positions of the vessel of interest.

**Figure 6.** The empirical distribution of the COG variance before and after grid clustering, showing that the variance of the COG is greatly reduced after clustering.

#### *4.3. Results for Trajectory Prediction*

We now evaluate the performance of the proposed trajectory prediction. The parameters for waymark point connection are set as follows. The distance threshold *εDES* = 20 km, the length and the width of the rectangular region used to identify Q*<sup>n</sup>* were set to *ελ* = 0.2◦, *εϕ* = 0.2◦, and the weight coefficient of the weighted distance is set to *κ* = 10/11. In the filtering waypoint step, we have chosen two more lenient values to filter out impossible points, which are  *<sup>m</sup>* = 180◦ and  *<sup>m</sup>* = 120◦.

Figure 7 shows the number of test samples versus the prediction horizon measured by the number of hours. It can be seen that the test samples contain a mixed of short-term and long-term trajectory prediction tests. About 53.5% of the tests correspond to trajectory predictions in 10 h or more.

**Figure 7.** Number of test samples vs. time horizon of trajectory prediction.

To test the proposed algorithm, we implemented a trajectory prediction algorithm based on graph theory and the Dijkstra algorithm [18] and use it as the baseline. Additionally, we also implemented a straightforward trajectory prediction algorithm that simply selects the training trajectory that most matches the historical segment of the trajectory under prediction. This baseline is termed *best-matched*. Directed Hausdorff distance is adopted to measure the similarity. Figure 8 presents four examples of the trajectory prediction results of the proposed algorithm (the red curves), the graph-based baseline from [18] (the greed curves), and the baseline of the best matched (the orange curves). The ground truth trajectory is also plotted as a reference in each example (the grey solid curves). In the figures, the grey dotted curves are the historical segment of the trajectory prior to the prediction moment. The results show that while the baseline algorithms can provide trajectory predictions fairly close to the ground truth, the smoothness of the predicted trajectories is not good enough. As a comparison, the proposed algorithm can offer more accurate predictions in the sense that the predicted trajectories are visually closer to the true ones and retain the smoothness of realistic trajectories.

From a quantitative perspective, we further evaluate the similarities between the predicated and the ground truth trajectories for the proposed algorithm and the baseline algorithms. To measure the similarity, we consider the DTW distance [24–27] and the Hausdorff distance [28], which are commonly used to measure the similarities of two curves/sequences of data. A smaller value of this metric suggests a higher degree of similarity and thus a more accurate trajectory prediction.

As the area of interest contains different routes to Port Skagen from different directions, to better demonstrate the details of the performance evaluation, we classified the trajectories into eight trajectory clusters, as shown in Figure 9, for the tested data. Table 3 shows the number of test instances for each cluster in the experiment and the corresponding average DTW and HD. Each test instance refers to a time instance of a trajectory in the test dataset. Since each track lasts more than 25 h on average, the number of test instances is significantly larger than the number of trajectories.

**Figure 8.** Examples of trajectory prediction results.

**Figure 9.** An illustration of the trajectories of the 8 clusters in the test dataset.

From Table 3, it can be seen that the proposed algorithm outperforms the baseline algorithms on both the performance metrics and for all the eight trajectory classes. Specifically, there is a 12.08% and 25.18% improvement in the DTW distance compared to the best-history and the graph-based baselines, respectively. For Hausdorff distance, the improvement is 11.05% and 42.04% over the two baselines. It can also be seen that for trajectory clusters 1, 6 and 7, which locate on the eastern side of the sea, the advantages of the proposed algorithm over the graph-based approach are even more significant due to the complexity of the shipping lanes and the extremely narrow passable channels.

To further show the stability of the performance of the algorithms, box plots of the performance metrics of the three algorithms are presented in Figures 10 and 11, where the orange line indicates the median, and the green triangle is the location of the mean. It can be seen that for both metrics, the proposed algorithm has a lower median and 75% quantile of the mean than the baseline algorithm, demonstrating higher robustness.


**Table 3.** Performance comparison between different trajectory prediction algorithms based on DTW and Hausdorff distance.

**Figure 10.** Boxplot of the DTW distance obtained for the proposed algorithm and the two baselines.

**Figure 11.** Boxplot of the Hausdorff distance obtained for the proposed algorithm and the two baselines.

### **5. Conclusions and Discussion**

In this paper, a long-term trajectory prediction algorithm has been developed for oil tankers traveling to a known destination port. The proposed algorithm utilizes key points from historical training trajectories that are extracted from historical AIS data in an area of interest. The algorithm then applies the DBSCAN clustering algorithm to generate a set of waymark points that are much fewer than the key points but still retain all the traveling patterns of the oil tankers. Based on the waymark points, a novel path-finding algorithm has been developed that sequentially identifies a set of waymark points to form the predicted trajectory. The proposed algorithm was tested on real AIS data for oil tankers in Danish waters with a fixed destination of port Skagen. Experimental results show that compared to existing trajectory prediction algorithms, including the graph-based and the best-matched approach, the proposed method can achieve more accurate trajectory predictions with higher trajectory smoothness. Specifically, measured by DTW distance and Hausdorff distance, the proposed method offers a reduction of 25.18% and 42.04% over the graph-based baseline and 12.08% and 11.05% over the best-matched baseline, respectively.

While the proposed algorithm has been tested using real AIS data, the tested scenario is relatively limited. Further testings with AIS data from different areas, on different types of vessels, and to different destinations are needed to obtain a more comprehensive evaluation of the performance of the algorithm. Additionally, the focus of the current work was only on the prediction of the trajectory path. Further effort is needed to match the positions of the predicted trajectory to timestamps. In this direction, a combination of the current algorithm with machine learning techniques may be needed to better capture the motion dynamics of vessels in different areas.

**Author Contributions:** X.X.: onceptualization, methodology, formal analysis, software, visualization, writing—original draft. C.L.: conceptualization, methodology, formal analysis, visualization, supervision, writing—original draft, writing—review and editing, funding acquisition. J.L.: conceptualization, methodology, visualization, validation, supervision, writing—review and editing, funding acquisition. Y.M.: conceptualization, methodology, visualization, supervision, writing—review and editing. L.Z.: conceptualization, methodology, visualization, supervision, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was jointly supported by the Zhejiang Provincial Natural Science Foundation of China (Grant No. LZ22F010001) and the Fundamental Research Funds for the Provincial Universities of Zhejiang (Grant No. GK229909299001-013).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The raw AIS data are available at "https://dma.dk/safety-at-sea/ navigational-information/ais-data" (accessed on 6 June 2023). The processed data that support the findings of this study are available on request from the corresponding authors.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
