**3. Methodology**

In this section, we describe the proposed long-term vessel trajectory prediction algorithm. Figure 1 presents a flow chart of the proposed algorithm. As shown in the figure, the algorithm has the following two modules: the key point clustering module and the trajectory finding module. In the remaining part of this section, we present the details of the two modules.

**Figure 1.** Flow chart of the trajectory prediction algorithm; (1) pre-processed original trajectory is used in key point clustering, (2) filter the candidate set and calculate weighted distance, (3) select the point with the smallest weighted distance as the next location point, (4) calculate the distance from the current location to the destination. If it is less than the threshold value, continue to execute the key point connection step; otherwise, output the predicted trajectory to end the prediction process.

#### *3.1. Key Point Clustering*

The key point clustering module aims to extract from pre-processed training trajectories a set of waymark points that are representative of the traveling patterns of the oil tankers in an area of interest. Instead of keeping all the raw AIS data points in the training trajectories, each trajectory is pre-processed to remove redundant data and to reduce the number of points retained in the trajectory. (A detail description of data pre-processing will be presented in Section 4.1 along with the experiments.) The points retained in the training trajectories are called the *key* points. The key point clustering module treats all key points (from all different training trajectories) as the input and produces a set of *waymark* points. As a result of clustering, each waymark point not only reflects the Lon and Lat of tankers passing through the vicinity of the corresponding cluster, but also contains information such as the COG to better capture the motion of tankers in the corresponding area.

Each predicted trajectory consists of a number of waymark points. Hence, to achieve accurate predictions, there should be a sufficient number of waymark points that can comprehensively capture all possible routes in the area of interest. Additionally, waymark points that are close in space should have distinguishing features related to the motion of the vessels to make clear selections among the waymark points. To fulfill these purposes, we adopt a COG-based clustering process to generate the waymark points [23].

To facilitate the description of the clustering process, we use *λk*, *φ<sup>k</sup>* to represent the Lon and Lat of the key point *p*˜*k*, respectively, and *θ<sup>k</sup>* the COG of *p*˜*k*. Hence, each key point is represented by *p*˜*<sup>k</sup>* = (*λk*, *φk*, *θk*) . Let D = {*p*˜1, *p*˜2, *p*˜3,..., *p*˜*K*} be the set of key points in the training trajectories. The core idea of the clustering algorithm is first to discretize the entire area of interest into a set of grids and then run DBSCAN for each grid to cluster the key points falling into the corresponding area. Specifically, DBSCAN is used to cluster the key points within the grid area based only on the COG to extract representative traveling directions. Each cluster identified by DBSCAN forms a waymark point *qm*, which contains four-dimensional information, including the average Lon *λ*¯ *<sup>m</sup>* and average Lat *φ*¯*<sup>m</sup>* of all the key points in the corresponding cluster, the median COG of the key points, i.e., ¯ *θm*, and the number of key points *ρ*¯*<sup>m</sup>* in the cluster, i.e., *qm* = *λ*¯ *<sup>m</sup>*, *φ*¯*m*, ¯ *θm*, *ρ*¯*<sup>m</sup>* .

Each of the grid areas can be represented as:

$$G\_{i,j} = \{ (\lambda, \phi) | \lambda \in (\bar{\lambda}\_i - \delta\_{\lambda'} \bar{\lambda}\_i + \delta\_{\lambda}), \phi \in (\bar{\phi}\_j - \delta\_{\phi'} \bar{\phi}\_j + \delta\_{\phi}) \},\tag{1}$$

where *λ*¯ *<sup>i</sup>* and *φ*¯*<sup>j</sup>* are the Lon and Lat of the center of the grid, and *δλ* and *δφ* are the widths in the longitude and latitude direction. The difference of the center Lon (Lat) between two adjacent grid areas can be made smaller than 2*δφ* (2*δλ*), e.g., *λ*¯ *<sup>i</sup>* <sup>−</sup> *<sup>λ</sup>*¯ *<sup>i</sup>*+<sup>1</sup> <sup>&</sup>lt; <sup>2</sup>*δφ*. In such cases, two adjacent grid areas may partially overlap with each other, and the key points falling into the overlapped areas are used multiple times during clustering. Such a setup makes it easier to obtain a representative set of waymark points from non-uniformly distributed key points by using a relatively large grid width.

It is noted that a customized distance metric is adopted when applying DBSCAN for COG clustering. Specifically, suppose *θ<sup>i</sup>* ∈ [0◦, 360◦) and *θ<sup>j</sup>* ∈ [0◦, 360◦) are two COGs. Then, the corresponding DBSCAN distance metric is calculated as |*θ<sup>i</sup>* − *θj*| if |*θ<sup>i</sup>* − *θj*| < 180◦ and 360 − |*θ<sup>i</sup>* − *θj*| otherwise. This customized metric is to avoid the errors caused by COGs close to the north, i.e., COG values close to 0◦ or 360◦. Additionally, it is noted that the Lon and Lat are not used in the clustering process. This is due to the fact that the COG clustering is performed for each grid area, where the key points have relatively close values of Lon and Lat.

As an illustrative example, Figure 2 presents the key points obtained from the trajectories of the oil tankers passing through the shown area between 2017 to 2019. It can be seen that there are areas where the key points are sparse, while there are also areas where the key points are much denser. It is known that DBSCAN requires one to specify a value on the minimum number of points in a cluster so the algorithm can operate. If there are less key points than this value, then all points in the grid area will be classified as noise, and no

cluster can be formed. Therefore, in a sparse area, the width of the grid area has to be set sufficiently large. However, if overlapping is not permitted, then a large width will lead to a reduced number of waymark points, which will make the representation of the traveling patterns less accurate.

**Figure 2.** Key points obtained from oil tankers' trajectories between 2017 to 2019.

### *3.2. Trajectory Finding Based on Waymark Points*

Without loss of generality, suppose a prediction of the future trajectory is made at time instant 0, where the historical trajectory consisted of the set of previous location points {*p*−*<sup>I</sup>* ... , *p*−2, *p*−1, *p*0}, and *pn* = (*λn*, *φn*) contains the Lat and Lon of the *n*-th position. Denote Q = {*q*1, *q*2, ... , *qM*} as the set of all waymark points obtained from the key point clustering. The task of the trajectory prediction is to select from Q a set of *N* points to produce the predicted trajectory *T* = {*p*1, ... , *pn*, ... , *pN*, *pDES*}, where *pDES* is the destination of interest. This process is implemented in an iterative way by sequentially identifying the *N* waymark points. Note that the number of point *N* is determined by the algorithm automatically and is not a preset value.

In the *n*-th iteration, the algorithm first identifies a candidate waymark point set Q*<sup>n</sup>* ⊆ Q based on *Tn* = {*p*−*<sup>I</sup>* ... , *p*0, ... , *pn*}. The candidate set Q*<sup>n</sup>* is then filtered to generate a refined candidate set Q *<sup>n</sup>*, from which the predicted point *pn*+<sup>1</sup> is chosen based on a weighted distance to be explained later in this section. In this process, the Lat, Lon and the information of direction and point density are all utilized.

For ease of exposition, we introduce the following notation:


$$LDM\{\vec{a}\_1, \vec{a}\_2, \dots, \vec{a}\_m\} = \angle\left(\frac{\vec{a}\_1 + \vec{a}\_2 + \dots + \vec{a}\_m}{m}\right),\tag{2}$$

where*ai*, *i* = 1, . . . , *m* is a unit vector from the origin.

• Δ*a*, *<sup>b</sup>* : the angular change from direction*<sup>a</sup>* to direction*b*, measured in degrees. <sup>Δ</sup>*a*, *<sup>b</sup>* is calculated as :

$$
\Delta\_{\vec{a},\vec{b}} = \begin{cases}
360^\circ - \Delta & \Delta \ge 180^\circ \\
\Delta & \Delta \le 180^\circ
\end{cases} \tag{3}
$$

where Δ = ∠*<sup>a</sup>* <sup>−</sup> <sup>∠</sup>*<sup>b</sup>* . Here, angle <sup>∠</sup>*a*, <sup>∠</sup>*<sup>b</sup>* both take value in [0◦, 360◦), where 0◦ corresponds to the north.

At the *n*-th iteration, the distance from *pn* to *pDES* is compared against a pre-determined threshold *εDES*. If

$$dist(p\_{n\prime}, p\_{DES}) < \varepsilon\_{DES}$$

i.e, the distance to the destination is sufficiently small, then the iterative process terminates, and *pDES* is appended to *Tn* to form the completed trajectory prediction. Otherwise, the algorithm continues to identify *pn*+1. The threshold *εDES* needs to be chosen with care, as setting it too large can lead to premature terminations of the trajectory finding, while setting it too small may make the prediction susceptible to interference from other routes in the areas of high concentrations of multiple routes towards different directions, e.g., the areas near a port. We recommend to set *εDES* to be the distance from the destination from which various routes begin to diverge.
