A Study of Multi-Step Sparse Vessel Trajectory Restoration Based on Feature Correlation

Ye, Lin; Chen, Xiaohui; Liu, Haiyan; Zhang, Ran; Li, Jia; Lu, Chuanwei; Zhao, Yunpeng

doi:10.3390/app14104057

Open AccessArticle

A Study of Multi-Step Sparse Vessel Trajectory Restoration Based on Feature Correlation

Institute of Data and Target Engineering, Information Engineering University, Zhengzhou 450000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(10), 4057; https://doi.org/10.3390/app14104057

Submission received: 21 January 2024 / Revised: 6 May 2024 / Accepted: 8 May 2024 / Published: 10 May 2024

Download

Browse Figures

Versions Notes

Abstract

:

To address the issue of data integrity and reliability caused by sparse vessel trajectory data, this paper proposes a multi-step restoration method for sparse vessel trajectory based on feature correlation. First, we preserved the overall trend of the trajectory by detecting and marking the sparse and abnormal vessel trajectories points and using the cubic spline interpolation method for preliminary restoration. Then, we established a composite indicator of feature correlation for selecting highly correlated trajectory features as inputs to the model, reducing data redundancy while improving the key correlation between trajectory features. Finally, we developed a vessel trajectory restoration model based on the Seq2Seq network for secondary restoration of the trajectory to ensure the accurate restoration of the vessel trajectory. Through comparison and ablation experiments, we demonstrate that the method can efficiently extract highly correlated features from vessel trajectories and combines the advantages of the interpolation method and neural network model to effectively improve the accuracy of trajectory restoration and ensure the integrity and accuracy of trajectory data. The research results could provide crucial technical support for the subsequent mining of vessel behavior patterns and assisted decision-making, which holds significant application prospects and potential value.

Keywords:

sparse trajectory restoration; feature correlation indicator construction; trajectory sparsity and anomaly detection; Seq2Seq neural network

1. Introduction

With the rapid growth of international trade and global logistics needs, the maritime industry has become an important force driving the global economy. However, the development of the maritime industry has also brought about a series of related issues, such as maritime safety management, marine environmental protection, and other challenges [1,2,3]. In this context, a large amount of automatic identification system (AIS) data provides valuable support for related research. AIS is a system that records real-time information about a vessel’s position, speed, heading, and other key attributes through satellites and terrestrial base stations. By analyzing AIS data, we can mine the spatial and temporal distribution of ships, route networks, and sailing patterns. These provide decision support for optimizing maritime logistics, improving maritime security, and saving energy [4,5,6,7,8].

However, AIS data often contain missing data and abnormal data, and these problems mainly stem from factors such as inconsistencies in data collection intervals, limitations in data transmission and storage, and sensor failures. Missing data and abnormal data will cause the acquisition of incomplete vessel trajectory operation information, which triggers the sparsity and abnormality problem of AIS data. It adversely affects the quality of AIS data and brings challenges to vessel feature mining and navigation pattern analysis [9]. Many research efforts have searched for effective methods to control AIS data quality. Such methods mainly reduce sparse and abnormal trajectories by restoring the completeness and accuracy of the trajectory data and also provide a high-quality database for subsequent tasks [10,11].

Currently, the main approaches to AIS data restoration include traditional mathematical modeling methods and data-driven methods. Traditional mathematical modeling methods achieve trajectory data restoration by building mathematical models from a statistical point of view. However, when facing a large amount of AIS data, traditional methods make it difficult to capture the complex relationship between data dimensions due to the limitations of high data dimensionality and the existence of missing data and abnormal data, and the method is sensitive to abnormal data and missing data. In the process of processing, high computational complexity and serious loss of information will be encountered, thus affecting the accuracy and effect of trajectory restoration. Data-driven methods focus more on automatically learning and extracting key features from the data perspective and capturing the complex relationships between AIS data dimensions by building models. Although these methods are more adaptive, in the case of low data quality, machine learning methods make it difficult to extract complete information and to achieve the expectations for the resulting in trajectory restoration. When the basic neural network method based on deep learning is faced with multi-dimensional data, the problem of input feature redundancy leading to model accuracy degradation occurs. Therefore, to process AIS data more effectively and improve the accuracy of trajectory restoration, it is necessary to process sparse and abnormal vessel trajectory data and then construct a suitable restoration model based on the selection of input features with a high correlation with restoration features to achieve the efficient restoration of sparse vessel trajectories.

To achieve this goal, this paper proposes a multi-step sparse vessel trajectory restoration method based on feature correlation, which attempts to solve the challenges brought about by the sparseness and anomaly of AIS data for subsequent trajectory mining. First, trajectory restoration points are identified through sparse and anomalous vessel trajectory data detection, and interpolation restoration methods are utilized to populate these, thus initially ensuring the coherence of the input data. Secondly, a composite indicator of feature correlation is constructed through the correlation analysis method, which is used to control the input features, from which the most correlated features are selected as inputs to the model to ensure that the critical information can still be captured in the case of sparse data. Finally, the trajectory restoration model is constructed based on the sequence to sequence (Seq2Seq) network, and the vessel trajectory restoration is achieved by training the network to improve the data accuracy. This method has the potential to solve the problem of sparse data and provides new ideas for the field of vessel trajectory restoration.

2. Related Work

To ensure the quality and reliability of AIS data, scholars have conducted in-depth research on key issues such as erroneous data identification, abnormal data detection, and missing trajectory restoration of AIS data. Vehicle trajectory restoration is affected by road network constraints and other factors such as the relationships between other vehicles [12,13,14,15]. In general, vessel trajectory restoration only considers its trajectory data for trajectory restoration. When the starting position of the vessel and the fixed course are known, the trajectory matching technique is indeed a reliable choice to restore the vessel trajectory [16,17]. However, in the real navigation environment, the trajectory of a vessel often deviates from the intended course due to a variety of factors, which requires more general and flexible methods for trajectory restoration.

By investigating related studies, this paper summarizes two approaches: traditional mathematical modeling methods and data-driven methods.

2.1. Traditional Mathematical Modeling Methods

Traditional methods for vessel trajectory data restoration rely on an in depth understanding of the data and accurate modeling, which detects abnormal points and repairs sparse trajectories by building complex mathematical equations and models describing the vessel’s operation. These methods usually establish a functional relationship between latitude, longitude, and time and use interpolation methods to build the restoration model.

To realize the reconstruction of vessel trajectories in different sailing states in a port, Zhang et al. [18] constructed a multi-state vessel trajectory reconstruction model based on linear regression and a spline interpolation method based on the quantitative analysis of the quality of the AIS data, which fit and reconstructed vessel trajectories based on the identification of the sailing states of the vessels in the port area. Qin et al. [19] address the previous algorithms that do not consider the dynamic information of vessel operation; it is difficult to repair the curve of the ship trajectory, resulting in low accuracy, so they improved the linear interpolation algorithm and proposed a two-way iteration and weighted average of the trajectory repair iterative algorithm, which effectively takes into account the dynamic information of the ship, thus improving the restoration of the trajectory accuracy and at the same time compressing the algorithm execution time. Zhang et al. [20] focused on single-ship AIS data and proposed a trajectory anomaly identification and repair method, which utilizes the parameter information of the AIS data itself to determine the anomalies and then repairs the trajectory anomalies by using the cubic spline interpolation method, which is effective in eliminating the abnormal data mutations of the parameters to make the repaired trajectory changes smoother. Also, Zhang et al. [21] proposed a vector analysis-based vessel trajectory anomaly detection and track repair method, which analyzes the characteristics of AIS raw data to establish the trajectory basis vectors, uses the basis vectors as the basis for classifying the vessel trajectory categories, and finally reconstructs the trajectory according to the unused trajectory line type categories by using different interpolation methods, respectively.

Although traditional methods can achieve better restoration results, when faced with too many missing and abnormal trajectory points or complex navigation environments, these methods adopt the processing method of screening out sparse and abnormal trajectory points, which often makes it difficult to achieve higher accuracy in the restoration results. Therefore, it is necessary to screen the sparse and abnormal trajectory data points and modify the trajectories by interpolation methods without abandoning them and to label them to improve the restoration accuracy.

2.2. Data-Driven Methods

With the rapid development of computer technology, data-driven methods represented by machine learning and deep learning have been widely used in maritime data processing. Compared with the traditional mathematical modeling methods, the data-driven methods rely on the powerful arithmetic support of big data and artificial intelligence technology. Based on the artificial neural network model, the vessel trajectory restoration model is constructed to excavate the patterns and rules of the data itself, and the contextual information of the trajectory data features is extracted for trajectory restoration to achieve the comprehensive optimization and improvement of the AIS data.

Li et al. [22] proposed an inland river trajectory complementation method based on a least squares support vector machine (LS-SVM) based on inland river vessel trajectories with spatio-temporal similarity characteristics, which seeks the most similar trajectories by adopting the quadratic matching method after synchronized processing of AIS trajectory data and uses the LS-SVM model to repair the mutilated trajectory data. Liu et al. [23] built a joint model based on a back propagation (BP) neural network and segmented Hermite interpolation three times to repair and predict the trajectory data for the problem of AIS data loss and through comparison found that the joint model is more effective in repairing and predicting the vessel trajectory. Chen et al. [24] proposed a trajectory reconstruction framework integrating data quality control and preprocessing to solve the problem of noise pollution in AIS data. First, they designed the data quality control framework to deal with abnormal trajectory data, and then they utilized an artificial neural network (ANN) to make the trajectory prediction, but this method does not apply to complex navigational environments. The neural network represented by recurrent neural networks can handle the time series data more effectively. Zhong et al. [25] improved on Liu’s algorithm and proposed an inland river trajectory repair method based on bi-directional long short-term memory recurrent neural networks (BLSTM-RNNs), which can make up for the lack of trajectory repair capability of the linear method and is suitable for inland river trajectory repair capability, which can be applied to vessel trajectory repair scenarios with complex geometries and multiple missing points.

In existing research, for the processing of a large amount of sparse and abnormal trajectory data, the conventional method is to identify and remove abnormal data and then use neural network methods with RNN as a variant for missing trajectory restoration. However, these methods often neglect the potential value of abnormal trajectory points during processing and neglect the correlation between trajectory features, resulting in the inability to fully use the intrinsic information of the trajectory data and the simplicity of the model, thus affecting the effect of trajectory restoration to a certain extent. Therefore, it is necessary to use the correlation analysis method to construct a composite indicator of features correlation, accurately determine the model input features of the effective vessel trajectory, and then repair the vessel trajectory based on the Seq2Seq network-constructed vessel trajectory restoration model to ensure that a more accurate and complete vessel trajectory is obtained.

3. Trajectory Data Sparsity and Anomaly Detection

To ensure the accuracy and reliability of AIS data and to fulfill the needs of subsequent analysis and application, it is crucial to preprocess the AIS data and to detect and deal with sparse and abnormal missing points. Therefore, this study first preprocesses the AIS data to remove invalid and irrelevant data, finely divides the trajectory segments based on sailing time, then develops the trajectory sparse and anomalous evaluation method to identify and mark the trajectory points to be restored, and finally utilizes the cubic spline interpolation method for the initial sparse trajectory restoration; the overall process is shown in Figure 1.

3.1. Data Preprocessing

Data preprocessing mainly contains two parts: data cleaning and trajectory extraction. First, by deleting invalid data, the erroneous, invalid, or redundant values of AIS trajectory points can be effectively eliminated to improve the reliability of trajectory data. Second, based on extracting different vessel trajectory data, non-contiguous vessel trajectories are removed to ensure data availability for subsequent studies.

3.1.1. Invalid Data Deletion

Maritime Mobile Service Identity (MMSI) is a unique 9-digit code recorded in the AIS static data to identify a vessel, based on which invalid data with non-9-digit MMSI codes are deleted from AIS data. Then, the trajectory point of the same vessel may be recorded several times during traveling. Therefore, when Equation (1) is satisfied, one of the duplicate track points is deleted to reduce data redundancy to ensure the accuracy and streamlining of the tracking data.

T_{i} = T_{j} {& Lat}_{i} = {Lat}_{j} {& Lon}_{i} = {Lon}_{j}

(1)

where

i

and

j

are the

i th

and

j th

trajectory points of the same vessel.

3.1.2. Trajectory Segmentation

After preprocessing the trajectory data, this study found that the time intervals between some of the neighboring trajectory points were too long. To avoid the influence of AIS data discontinuity on the subsequent analysis, this paper utilized the time threshold method to segment the trajectory. First, the data of a vessel were sorted into chronological order and the timestamp intervals between trajectory points were calculated. Secondly, a period of 1800 s was selected as the time threshold according to expert experience, all timestamp intervals were traversed, and the trajectory points with time intervals larger than the time threshold were taken as cutting points for segmenting the trajectory. Finally, track segments with less than 50 track points were deleted to improve data availability.

3.2. Sparse and Abnormal Trajectory Points Detection

The sparseness and anomalies of the AIS trajectory data originate from a variety of factors, such as receiving equipment failures and data transmission interference, which all affect the authenticity of the AIS data. However, the subsequent research in trajectory analysis, mining, and modeling usually requires an accurate and reliable trajectory database. Therefore, to ensure the reliability and validity of the data, this paper develops a detection method to detect and mark the sparse and abnormal points in the trajectory data, which provides support for the subsequent trajectory restoration.

3.2.1. Detection Method Based on Time Elements

In a single voyage mission, the acquisition of AIS trajectory data may be affected by the inconsistency of the acquisition frequency, resulting in inconsistent time intervals. This inconsistency may cause two problems: a larger time interval may cause sparse trajectory, resulting in the loss of critical vessel trajectory information, and a smaller time interval leads to data redundancy. To properly solve this problem, ensure the integrity and continuity of the trajectory data, and eliminate the potential impact of time interval anomalies, this study deals with time intervals as follows. Firstly, the time interval percentage in the trajectory data was counted as shown in Figure 2, through which it was found that the amount of time interval data within 180 s accounted for more than 90% of the total amount. Based on this finding, 180 s was determined as the interpolation time. Then, the time difference between two adjacent points is judged; if this time difference is greater than twice the interpolation time, i.e., greater than 360 s, it means that the trajectory segment is a sparse trajectory restoration, i.e., data is missing; is marked as a trajectory restoration point; and the number of restoration points is

int (Δ t / 180)

.

3.2.2. Detection Method Based on Speed Elements

During the actual voyage of the vessel, according to different marine environments, there is often a limitation on the speed over ground (SOG) of the vessel, and it is necessary to utilize the AIS data to count the actual SOG distribution of the vessel and identify abnormal data through statistical analysis. This research focuses on restoring vessel trajectories under voyage status and does not consider vessel trajectories under anchored status. Therefore, we screened out the data with a sailing speed of less than 1 knot. The distribution of SOG obtained from the statistical dataset is shown in Figure 3, and based on this distribution, the SOG threshold is set to 24 knots, i.e., when the SOG is greater than 24 knots, it is inferred that there are data anomalies in this section of the trajectory, and it is marked as a trajectory restoration point.

3.2.3. Detection Method Based on Location Elements

The original AIS data often contain drift and error points, which in turn lead to abnormal vessel trajectories. To obtain accurate vessel trajectories, it is necessary to screen and repair abnormal position trajectory points. According to Equation (2), to determine whether the trajectory points are located in the study area if the trajectory points are beyond the study area, they are regarded as abnormal data and excluded. Then, according to the maximum SOG, to calculate the maximum sailing theoretical distance between neighboring trajectory points and compare with the actual distance, if the actual distance is greater than the theoretical distance, the point is judged as abnormal and marked trajectory restoration points, and vice versa for normal points, which is calculated as in Equation (3).

\{\begin{matrix} - 84.0 \leq {lon}_{i} \leq - 79.5 \\ 23.0 \leq {lat}_{i} \leq 24.5 \end{matrix}

(2)

\{\begin{matrix} D_{act} = 2 r \arcsin \sqrt{\sin^{2} (\frac{{lat}_{i} - {lat}_{i + 1}}{2}) + \cos ({lat}_{i}) \cos ({lat}_{i + 1}) \sin^{2} (\frac{{lon}_{i} - {lon}_{i + 1}}{2})} \\ D_{theory} = V_{\max} \times |t_{i} - t_{i + 1}| \\ V_{\max} = 1.852 \times {SOG}_{\max} \end{matrix}

(3)

where

D_{act}

is the actual distance,

D_{theory}

is the maximum sailing theoretical distance,

r

is the radius of the Earth,

V_{\max}

is the maximum velocity,

{SOG}_{\max}

is the SOG threshold, and

t

is the trajectory point timestamp.

Based on the above three types of detection methods, sparse and anomaly detection is carried out on the trajectory data and the number and location of trajectory restoration points are accurately identified and marked, which provides the necessary support for the next step of preliminary trajectory restoration work.

3.3. Trajectory Preliminary Restoration

Trajectory restoration aims to enhance the integrity and reliability of trajectory data. Among the many methods for trajectory restoration, the interpolation method is widely applied, especially the cubic spline interpolation method. This method constructs a smooth continuous function curve by fitting the given scatter data, and its advantage is that it can retain the trend features of the original data to more accurately restore the real change trend of the trajectory data. In addition, it is characterized by high interpolation accuracy and good smoothness, making it suitable for processing complex trajectory data with preliminary restoration.

In this study, based on identifying the location and number of trajectory restoration points, the preliminary trajectory restoration is carried out by using the cubic spline interpolation method. The restoration process is as follows: assume that a vessel trajectory is

Traj = \{{point}_{1}, \dots, {\bar{point}}_{i}, \dots, {point}_{i + l}, \dots, {\bar{point}}_{i + m}, \dots, {point}_{n}\}

, where

{point}_{i + l}

represents a certain trajectory point in the vessel trajectory and

{\bar{point}}_{i}

is the restoration point after the trajectory abnormality judgment. Through the cubic spline interpolation method, the marked trajectory restoration point is restored as

\hat{{point}_{i}}

. Finally, the preliminary restoration of the vessel trajectory,

\hat{Traj} = \{{point}_{1}, \dots, {\hat{point}}_{i}, \dots, {point}_{i + l}, \dots, {\hat{point}}_{i + m}, \dots, {point}_{n}\}

, is obtained.

4. Trajectory Feature Extraction and Restoration Model Construction

Ensuring the accuracy and reliability of trajectory data is a key part of trajectory analysis, although the interpolation method can be used to repair sparse trajectory restoration by simply completing the missing values and repairing abnormal data, which improves the quality of trajectory data to a certain extent. However, due to the continuous multiple missing points in sparse trajectories, the interpolation method cannot completely and accurately reflect the trajectory state changes. Therefore, to guarantee restoration accuracy, it is still necessary to carry out sparse trajectory restoration with the help of neural networks. With its powerful learning ability and pattern recognition ability, neural networks can capture the complex relationships and nonlinear rules in depth in the trajectory data. It further optimizes the repaired data to improve the completeness and continuity of the trajectory data. Through the secondary repair of the neural network, it is more effective in filling in missing data and correcting abnormal data and finally obtains more reliable trajectory data. The AIS data contain a large amount of multi-dimensional information about the vessel, and if all the data are directly input into the model without selecting the filtering, it leads to an increase in the redundant information of the input features and reduces the restoration effect. Therefore, to improve the accuracy and efficiency of trajectory restoration, it is crucial to use the correlation method to analyze the intrinsic relationship between different features of the data. By analyzing the correlation of each feature in the AIS data, the features that are closely related to the trajectory restoration task are identified to avoid the redundancy of the input data, which in turn improves the effectiveness and computational efficiency of the model.

The overall framework of this study is shown in Figure 4. Firstly, the preliminary restoration trajectory data are divided so that the trajectory data meets the input requirements of the model. Then, a comprehensive indicator system of feature relevance is constructed to determine the dimensions of the input features. Then, based on the correlation coefficient, the Akaike information criteria (AIC) and Bayesian information criteria (BIC) are used to select the appropriate input features step. Finally, the neural network model is built to train the data and output the restored trajectory, and the accuracy of the model and trajectory restoration is measured using evaluation metrics.

4.1. Dataset Construction

Different features of AIS data have different scales, and the variability of scales will inevitably have an impact on the final restoration results of the model. Moreover, AIS data contains temporal and spatial information, and this spatial–temporal connectivity lends higher dimensionality and complexity to the data. Therefore, it is necessary to normalize the data before restoration to balance the differences in scale between different features, based on which the sliding window method is used to construct the model dataset to improve the efficiency of data utilization and the accuracy of the model.

4.1.1. Data Feature Normalization

AIS data have different scales of each feature variable, which need to be normalized by the maximum and minimum values to eliminate the bias of the restoration results due to the different scales of different feature variables. The specific calculation Equation (4) is as follows:

Z_{i}^{'} = \frac{Z_{i} - Z_{\min}}{Z_{\max} - Z_{\min}}

(4)

where

Z_{i}^{'}

and

Z_{i}

represent the feature values before and after the normalization of certain data, and

Z_{\max}

and

Z_{\min}

represent the maximum and minimum values, respectively.

4.1.2. Input Sequence Division

The trajectory restoration model is mainly for the secondary restoration of the interpolated trajectory restoration data. To better capture the local patterns of the trajectory data and accelerate the training efficiency of the model, this study adopts the sliding window method to refine the segmentation of the trajectory data and generates several segmented trajectories as the input sequence to be supplied to the model after each trajectory is fixedly cut. Assuming a trajectory containing

m

trajectory points, the input sequence step size is set as

l

, the output sequence step size is set as

n

, and each slide is

h

steps. After being processed by the sliding window method, the original continuous trajectory data is transformed into a series of independent input/output sequences, and the process is shown in Figure 5.

4.2. Input Feature Extraction Based on Trajectory Features Correlation

In AIS data, not every variable can have a positive impact on improving the accuracy of the trajectory restoration model. Therefore, the multidimensional feature processing of AIS data is needed to select highly correlated input features. Correlation analysis and correlation coefficient calculation are commonly used methods to effectively assess the degree of association between features. Through this method, those features that are most correlated with trajectory restoration data can be selected, thus ensuring model efficiency and accuracy and reducing data redundancy. This correlation-based feature selection strategy enhances the understanding of the intrinsic structure of the data and helps to improve the repair performance of the model.

4.2.1. Construction of a Composite Indicator of Feature Correlation

Correlation analysis is a statistical method for assessing the degree of association between trajectory data variables, providing an important reference for the selection of model input features. The principle is to accurately measure the degree of correlation between variables by calculating the correlation coefficient index. To avoid the bias of a single calculation method, this study constructs a composite evaluation indicator of features correlation, uses multiple correlation calculation methods performs a weighted solution to comprehensively assess the degree of correlation between trajectory data variables, and provides a basis for determining the data features to be input into the model.

Pearson’s Correlation Coefficient

Pearson’s correlation coefficient (PCC) is a commonly used correlation coefficient to quantitatively assess the degree of linear correlation between two variables. It is very sensitive to extreme values and accurately reflects the strength of the association between variables. The coefficient ranges from −1 to 1, where −1 means that the two variables are perfectly negatively correlated, 1 means that they are perfectly positively correlated, and 0 means that they are uncorrelated. For the given two random variables

X

and

Y

, the specific Equation (5) for Pearson’s correlation coefficient is as follows:

P C C_{(X, Y)} = \frac{\sum_{i = 1}^{n} (X_{i} - \bar{X}) (Y_{i} - \bar{Y})}{\sqrt{\sum_{i = 1}^{n} {(X_{i} - \bar{X})}^{2}} \sqrt{\sum_{i = 1}^{n} {(Y_{i} - \bar{Y})}^{2}}}

(5)

where

X

and

Y

denote the values of the two variables, and

\bar{X}

and

\bar{Y}

denote the mean values of the two variables, respectively.

2.: Spearman’s Rank Coefficient Correlation

Spearman’s rank coefficient correlation (SRCC), also known as the rank correlation coefficient, is a statistical tool used to measure the correlation between variables. Unlike PCC, SRCC utilizes the rank order of the variables to assess their correlation, which makes SRCC more widely used as it does not require strict conditions for the data. Similar to PCC, SRCC can take values between −1 and 1, with larger absolute values indicating a stronger correlation. The specific Equation (6) for SRCC is as follows:

S R C C_{(X, Y)} = 1 - \frac{6 \sum_{i = 1}^{N} d_{i}^{2}}{N (N^{2} - 1)}

(6)

where

d_{i} = X_{i} - Y_{i}

;

X_{i}

and

Y_{i}

are the ranks of the two variables in order of magnitude; and

N

is the capacity of the sample.

3.: Mutual Information Entropy Coefficient

Information entropy is used to characterize the amount of information in a single random variable. Mutual information entropy (MIE), on the other hand, is used to evaluate the amount of information that the occurrence of one event contributes to the occurrence of another event, measuring the degree of correlation between two random variables

X

and

Y

[26,27]. When the value of MIE is larger, the correlation between two random variables is stronger. In a special case where the value of mutual information is zero, the two random variables are independent of each other. To facilitate the subsequent calculation of the correlation index, this study adopts normalized mutual information entropy (NMIE) to calculate the mutual information between the variables, and the specific calculation Equation (7) is as follows:

\{\begin{cases} N M I E_{(X, Y)} = 2 \frac{I_{(X, Y)}}{H (X) + H (Y)} \\ I_{(X, Y)} = \sum_{i = 1}^{m} \sum_{j = 1}^{n} P (X_{i}, Y_{j}) \log_{2} \frac{P (X_{i}, Y_{j})}{P (X_{i}) P (Y_{j})} \\ H (X) = - \sum_{i} P (X_{i}) \log (P (X_{i})) \end{cases}

(7)

where

I_{(X, Y)}

is the mutual information entropy of

X

and

Y

;

H (X)

is the information entropy of

X

; and

P (X_{j})

and

P (Y_{i})

are the edge probability density functions of

X

and

Y

.

This study constructs a composite indicator that comprehensively evaluates the correlation of features through a weighted summation method. The specific calculation Equation (8) is as follows:

ρ = α \times |P C C_{(X, Y)}| + β \times |S R C C_{(X, Y)}| + γ \times |N M I E_{(X, Y)}|

(8)

where

α

,

β

, and

γ

represent the weights of each indicator, respectively.

To balance the differences and contributions between different indicators and ensure the accuracy and reliability of the composite indicators, this study adopts the equal-weight treatment, letting the weight of each indicator take the value of 1/3. To determine the correlation between the variables more precisely, this study set the following criteria for determining the correlation of the variables.

\{\begin{cases} ρ \geq 0.8, & Extremely high correlation . \\ 0.6 \leq ρ < 0.8, & High correlation . \\ 0.4 \leq ρ < 0.6, & Moderate correlation . \\ 0.2 \leq ρ < 0.4, & Low correlation . \\ ρ < 0.2, & Extremely low correlation . \end{cases}

(9)

Finally, based on the computational results, this study selects the features with higher correlation as inputs to the trajectory restoration model.

4.2.2. Select the Step Size of the Input Feature

The step size of input data is a key parameter in neural network models, which has a significant impact on the performance and accuracy of the model. Smaller step sizes can result in more repetitive samples being used for training, which in turn makes the model overfit to local features. Larger step sizes may cause the model to fail to capture fine patterns and key features in the data, resulting in a loss of information and model underfitting. The autocorrelation function (ACF) and partial autocorrelation function (PACF) are commonly used concepts in time series analyses. Of the two, the ACF measures the degree of correlation between the time series and the lagged version of the series at different lags by calculating the autocorrelation coefficients at different lag orders. The PACF measures the direct correlation between the current lagged value and the series after further removing the effect of other lagged terms, reflecting the effect of this lag order on the series, thus providing guidance for determining the appropriate step size of the input data.

This study considers the longitude and latitude series as time series and calculates the ACF and PACF parameters for the longitude and dimension series, respectively. The AIC and BIC evaluation models are utilized to select the optimal lag order as the step size of the input features. The AIC and BIC information criteria are common measures of the evaluation value of the time series model and the smaller value represents the higher quality of the model. Therefore, selecting the model with the smallest AIC and BIC values is used to determine the optimal lag order, and this is used as the step size of the input features. The calculation Equation (10) is as follows:

\{\begin{cases} AIC = - 2 \ln (L) + 2 k \\ BIC = - 2 \ln (L) + k \ln (n) \end{cases}

(10)

where

L

is the parameter maximum likelihood estimate,

k

is the number of model variables, and

n

is the length of the series.

4.3. Construction of Seq2Seq-Based Restoration Model

The trajectory restoration problem can be viewed as a sequence regression problem, which aims to minimize the difference between the repaired trajectory and the true trajectory by training the dataset. This study utilizes the sequence-to-sequence structure of deep learning to build a trajectory restoration model. The Seq2Seq structure extracts key information from the vessel trajectory through an encoder, which encodes it into a sequence of trajectory states. This sequence is an accurate representation of the trajectory information and is used to infer its original trajectory. The decoder is then utilized to repair the original trajectory based on the mined trajectory information. This method can effectively repair the vessel trajectory, which in turn improves the completeness and accuracy of the trajectory data.

4.3.1. Seq2Seq Model

In this study, the Seq2Seq model is used as the basis for constructing a trajectory restoration model. Seq2Seq is a generalized encoder–decoder structure. It consists of three parts: an encoder, a decoder, and an intermediate state vector that connects the two parts [28]. This underlying model architecture utilizes the RNN cell structure as the core component of the encoder and decoder. The encoder is responsible for extracting features from a variable-length data sequence and encoding this information into a fixed-size state vector. The decoder receives this state vector and generates a variable-length output sequence by learning from it. The model is shown in Figure 6.

In this paper, the AIS data trajectory points are selected by the model features to determine the input features and step size, and then the divided dataset is input into the model. The encoder accepts an input sequence,

X = \{x_{1}, x_{2}, \dots, x_{l}\}

, and generates a hidden state vector from its information. This vector is a high-level semantic representation of the input sequence, representing the core features of the input sequence. The calculation process is represented as follows:

C = E (\{x_{1}, x_{2}, \dots, x_{l}\}, θ_{E})

(11)

where

C

is the hidden state vector,

E ()

is the function of the encoder, and

θ_{E}

is the network parameter to be learned by the encoder. The decoder then generates the output sequence Y of the predicted data step-by-step based on this potential representation

C

, and the calculation process represents the following equation.

Y = D (\{y_{1}, y_{2}, \dots, y_{n}\}, θ_{D}, C)

(12)

where

D ()

is the function of the decoder, and

θ_{D}

is the network parameter to be learned by decode.

4.3.2. Basic Neural Network Models

In this paper, RNN, gated recurrent unit (GRU), LSTM, BILSTM, and BIGRU models are used as the underlying network architectures for comparison experiments. Its internal structures are shown in Figure 7.

The mechanism of operation of the RNN model is that the output of each step acts directly on itself at the next time step. The inputs from the previous moment are processed and extracted for information and then passed directly into the network along with the inputs from the next moment. This mechanism enables the hidden state to be passed backward sequentially as the sequence advances, but at the same time, it may also lead to the problem of gradient vanishing or gradient explosion, and the RNN structure is shown in Figure 7a.

Compared to RNN, the LSTM model introduces a gating mechanism, and this improvement allows the model to better control the long-term information transfer and capture the long-distance dependence, thus improving the problem of gradient vanishing or gradient explosion. The structure of LSTM is mainly composed of four parts, namely the input gate, the forgetting gate, the output gate, and the memory unit, and the structure is shown in Figure 7b.

The GRU model is further simplified in terms of network structure, containing only update gates and reset gates, which do not degrade the model performance but realize the functions of “forgetting” and “selecting memory” by resetting the gating unit at the same time. The GRU model not only reduces the time required for model training but also helps to alleviate the problems of gradient vanishing and explosion. The GRU model not only effectively reduces the time needed for model training but also helps to alleviate the problems of gradient vanishing and explosion, as shown in Figure 7c.

The BILSTM and BIGRU models, on the other hand, are bi-directionally trained to ensure that the trajectory data are fully utilized. This class of model is a bi-directional architecture composed of two independent LSTM/GRU networks, which are computed in forward and direction and merged to produce the final result, with the structure shown in Figure 7d.

4.3.3. Evaluation Indicators

To verify the accuracy of the experimental results, a variety of evaluation indicators were used to assess the accuracy of the restoration results. These indicators include root mean square error (RMSE), mean absolute error (MAE), Frechet distance (FD), and average Euclidean distance (AED). RMSE reflects the accuracy of the repair result, MAE visualizes the repair error, FD reflects the matching degree between the repaired trajectory and the real trajectory, and AED reflects the positional accuracy of the repaired trajectory points. The specific calculation of the evaluation indexes is as follows (Equation (13)).

\{\begin{cases} RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(Z_{ture, i} - Z_{re, i})}^{2}} \\ MAE = \frac{1}{n} \sum_{i = 1}^{n} | Z_{ture, i} - Z_{re, i} | \\ FD = \max_{[1, n]} \sqrt{{(x_{ture, i} - x_{re, i})}^{2} + {(y_{ture, i} - y_{re, i})}^{2}} \\ AED = \frac{\sum_{i = 1}^{n} \sqrt{{(x_{ture, i} - x_{re, i})}^{2} + {(y_{ture, i} - y_{re, i})}^{2}}}{n} \end{cases}

(13)

The RMSE and MAE units are given in degrees, and the maritime common unit (nautical miles) is used as the FD and AED units.

n

is the number of samples;

Z_{ture, i}

and

Z_{re, i}

are the true and restored values of the features; and

x_{ture, i}

,

y_{ture, i}

,

x_{re, i}

, and

y_{re, i}

are the projected latitude and longitude of true and restored values, respectively.

5. Experiment and Result Analysis

5.1. Experimental Data and Configuration

The data for this study were obtained from the AIS dataset downloaded from marinecadastre.gov. The original AIS dataset consists of dynamic and static information about the vessel, which includes multi-dimensional information such as MMSI, coordinated universal time (UTC), latitude, longitude, speed over ground (SOG), course over ground (COG), timestamps, and call signs, as shown in Table 1. This study uses the tanker data from a strait in July 2021 as the experimental data. The latitude range of the selected region is 23.00~24.5 N, and the longitude range is 79.50~84.00 W. There are 191,506 trajectory points in the statistical region range.

Based on relevant research and experience, this study conducted experiments in a Windows 11/64-bit environment with Python 3.6.9 environment based on the PyTorch framework. The experimental data obtained after processing in Section 3 were divided into 70% for the training dataset and 30% for the test dataset. In addition, this study finalized a set of optimal hyperparameter configurations through experimental comparisons. The loss function was mean square error (MSE), the optimizer was set to Adam, the learning rate was initially set to 0.001, the learning rate adjustment multiplier was 0.9, the loss rate was 0.1, the hidden layer unit in the neural network was set to 64, the number of layers was set to 2, the number of epochs was set to 40, and the batch size was set to 64.

5.2. Analysis of Anomaly Detection and Preliminary Restoration Results

After the pre-processing of the trajectory data, this study extracted 706 trajectories in the region, covering 140,085 trajectory points. Based on the established data anomalies and missing judgment standard method to identify the anomalies in vessel tracking data, the results are shown in Table 2.

From Table 2, it can be seen that the proportion of trajectory abnormal points to normal trajectory points in this region is 2.769%, which is mainly manifested in time abnormal points, totaling 3872. It can be seen that the preliminary restoration work of the trajectory is mainly to interpolate the missing time points. Taking a section of the trajectory of one of the vessels as an example, its preliminary restoration results are shown in Figure 8. The figure shows that the research in this study can effectively complement the trajectory restoration points through the cubic spline interpolation method. This restoration method is used in the restoration of trajectory integrity, and restoration results also reflect the motion state of the vessel better.

5.3. Analysis of Model Features Selection Results

To determine the input feature dimensions, this study extracts 397 vessel trajectory feature points and calculates the composite indicator of feature correlation. The results of its indicator calculation are shown in Table 3. According to Equation (9), the variables of longitude (LON) and latitude (LAT) are judged to be variables with an extremely high correlation with each other, indicating their importance in vessel trajectory restoration. The correlation of SOG and heading with the LAT and LON variables showed a high correlation, and heading showed a moderate correlation with LAT and LON, whereas the time variable showed a moderate correlation with LON and a low correlation with LAT. Based on the above findings, this study determined the variables with a composite indicator of higher than 0.4 as correlating with LAT and LON. Therefore, a total of five dimensions of LAT, LON, SOG, COG, and heading were identified as the input features of the model, which can be expressed as

X_{i} = \{{LAT}_{i} {, LON}_{i} {, COG}_{i} {, SOG}_{i} {, Heading}_{i}\}

.

In determining the model input feature step, this study adopted the traditional lag period method of the time series linear regression model. This method regards the latitude and longitude series as a spatio-temporal autocorrelation series, and the results of its latitude and longitude autocorrelation and partial autocorrelation calculation are shown in Figure 9. Through observation, it can be seen that the autocorrelation and partial autocorrelation coefficients of latitude and longitude show a trailing trend, and this trailing trend suggests that the traditional linear regression model may not be able to adequately capture the complex dependence structure in these series. Therefore, to further pinpoint the lagged values, this study uses the autoregressive moving average model (ARMA).

Based on the correlation coefficient calculation for the latitude and longitude series, this study further calculates the ARMA model. To determine the optimal value of the input feature step, this study calculates based on the AIC and BIC information criteria, and its calculation results are shown in Figure 10. By comprehensively evaluating the judging values of the AIC and BIC information criteria this study found that the model reaches the optimum when the lag value is 10, indicating that each trajectory point has a strong correlation with the first 10 previous trajectory points. Based on the above analysis, this study finally determined that the model feature input step is 10, i.e., each model training sample consists of 10 trajectory points, and each trajectory point contains five features.

5.4. Analysis of Vessel Trajectory Restoration Results

5.4.1. Analysis of Model Restoration Results

In the Seq2Seq-based vessel trajectory restoration model, the choice of encoder and decoder affects the trajectory restoration effect. To systematically evaluate and compare the repair effects of different model architectures, this study constructs Seq2Seq models with different structures. These include the basic Seq2Seq (R2R) model and the LSTM2LSTM (L2L), BILSTM2BILSTM (BL2BL), GRU2GRU (G2G), and BIGRU2BIGRU (BG2BG) models in which the encoder and decoder are LSTM, BILSTM, GRU, and BIGRU units, respectively. Also, this study selected typical RNN, LSTM, BILSTM, and GRU recurrent network models as baseline models for comparison experiments.

The MSEs of different model training phases are counted by the training input data, and the trend of the loss function is observed, as shown in Figure 11. As can be seen from Figure 11a, the baseline models show some upward and downward fluctuations in the value of the loss function within the first 10 rounds, but the overall trend decreases. This phenomenon indicates that the model is in the learning stage at this stage. Then, the curve of the loss function gradually flattens out and no longer shows a significant decrease; this phenomenon indicates that the model has converged and that the model training is complete. As shown in Figure 11b, the Seq2Seq-based models converge slightly slower and the LOSS curve stabilizes after 25 rounds. This shows that the model architecture of Seq2Seq is more complex and the increase in parameters leads to an increase in training time in the same environment. This complexity may stem from the fact that the Seq2Seq models need to optimize both the encoder and decoder parameters simultaneously, while the baseline models only need to optimize a single parameter, which may increase the training time but may also lead to better prediction performance.

The trained model performed vessel trajectory restoration on the test set, and the results of the evaluation metrics of the restoration performance are shown in Table 4. From Table 4, it can be seen that the restoration model based on the Seq2Seq structure has smaller values of the performance metrics compared with the baseline model, which confirms the superior performance of the Seq2Seq model in the restoration task. Further comparing the restoration results of different encoder models under the Seq2Seq structure, some of the trajectory restoration models are shown in Figure 12. As a result of Table 4 and Figure 12, it can be observed that the BG2BG model exhibits optimal performance in all performance metrics. Specifically, it has the lowest MAE, RMSE, FD, and AED values, which highlights the excellent performance of the BG2BG model in the trajectory restoration task, which not only has the highest restoration precision and restoration accuracy but also has a higher similarity between the restored trajectory and the real trajectory and has the smallest average Euclidean distance. The excellent performance of the BG2BG model is attributed to its complex model structure and efficient feature extraction capability, which allows it to capture and restore vessel trajectories more accurately.

To explore the generalization ability and robustness of the trajectory restoration model in depth, we selected AIS data from an area in the Gulf of Mexico (with a latitude range from 26.00 N to 28.00 N and a longitude range from 90.10 W to 96.50 W) as a dataset. This dataset has wider coverage compared to the experimental dataset and is intended to provide a more comprehensive assessment of model performance. We conducted robustness experiments using this dataset and compared the various indicators of model performance. The comparison results are shown in Table 5 and Figure 13. It is obvious from Table 5 that the BG2BG model achieves the lowest values in all the indicators except for the MAE-LAT indicator, which is not the lowest. Further observations of Figure 13 show that the trajectory restoration model results of the BG2BG model are more consistent with the trend of real vessel trajectories. This indicates that the model demonstrates optimal performance on this dataset. Combining the results of Table 4 and Table 5 with those of Figure 12 and Figure 13, we can conclude that the framework proposed in this study demonstrates significant effectiveness in vessel trajectory restoration. Furthermore, the BG2BG model demonstrates the highest accuracy on both different datasets, fully proving that the model has good generalization ability and robustness.

5.4.2. Analysis of Ablation Experiment Results

To verify the effectiveness of the feature extraction module (Module A), the preliminary restoration module (Module B) and the Seq2Seq-based restoration module (module C) in the vessel trajectory restoration model were used. In this study, the ablation tests of the two modules were conducted for straight and curved trajectories, and the results are shown in Table 6 and Figure 14.

As can be seen from Table 6, the restoration performance indicators of straight trajectories are overall small compared to those of curved trajectories. This indicates that the trajectory restoration performance of the model varies when dealing with trajectories of different complexity. The restoration accuracy of straight trajectories with lower complexity is relatively higher. When the feature extraction module and the preliminary restoration module, respectively, are introduced into the model, it can be observed that the accuracy of the restoration performance metrics is improved. This result highlights the positive role of these two modules in enhancing the model performance. Specifically, the feature extraction module can efficiently extract highly relevant features from the input data, avoiding data redundancy. On the other hand, the preliminary restoration module can effectively eliminate abnormal data and further preserve the trend features of the original data. The Seq2Seq-based restoration module again performs the good restoration of the vessel trajectory based on feature extraction and preliminary restoration. It reduces errors, improves the restoration accuracy, and ensures that the restoration trajectory is consistent with the original data in the overall trend. It should be noted that when the preliminary restoration module is not added to the bending trajectory, the MAE and RMSE values of its latitude are minimized with the FD value. And combined with Figure 14b, it was found that although the overall trend of the model restoration trajectory of the B + C structure is more fitted to the real trajectory, its overall offset phenomenon occurs, whereas the overall A + B + C structure model exhibits a closer resemblance to the real trajectory. The results show that the A + B structure has more advantages in straight trajectory restoration under sparse conditions. However, in the curved trajectory restoration, the trajectory repaired by the overall A + B + C structure model is more able to fit the trend of the real vessel trajectory. Therefore, combining Table 6 and Figure 14 with the model restoration index and restoration effect graph, it can be concluded that increasing the feature extraction and preliminary restoration module can effectively improve the overall performance of the vessel trajectory restoration model by optimizing the data input features and improving the data quality.

5.4.3. Analysis of Multi-Step Length Restoration Results

To investigate the influence of the output step size on the repair effect of the trajectory in depth, this study carried out experiments for different output step sizes, and the relevant results are shown in Table 7 and Figure 15. As can be seen from Table 7, in the restoration results of the single-step output, each restoration performance index reaches the minimum value. However, with the increase of the output step, the values of each index increased, indicating that the repair effect and capacity were reduced to some extent. Combined with Figure 15, the repair effects of different output step sizes are further analyzed. It is found that when the trajectory complexity is low, although there are some differences between the repaired trajectories with different step sizes and the real trajectories, they on the whole fit the real trajectories better. And when facing complex trajectories, the repair effects of different step sizes are not the same. The reason for this situation is that as the computational complexity of the model increases with the increase of the output step size, the ability of the model to capture and restore the details of the trajectory restoration model is limited, thus affecting the restoration performance.

6. Conclusions

This paper proposes a sparse trajectory restoration method for vessels based on feature correlation. The method demonstrates significant effectiveness and advantages in the trajectory repair task. Through comparative experiments, this study compares the restoration effect of the baseline network with different Seq2Seq structure models and further explores the influence of different modules on the restoration effect and the sensitivity of different step sizes on the restoration effect in depth through ablation experiments.

In the current trajectory restoration methods, anomalous data are often simply eliminated, ignoring their effects on trajectory integrity. Most existing models seldom consider the correlation between trajectory features in trajectory restoration, and inputting redundant data increases the computational complexity of the model. The method used in this study can effectively detect abnormal data and accurately mark the missing locations so that the interpolation method is used to preliminarily repair the trajectory to ensure the overall trend of the trajectory in the preliminary restoration method. Then, a composite indicator of features correlation is constructed to screen high correlation features as model inputs, and a Seq2Seq structure model is constructed for the vessel trajectory restoration model, which enables the model to carry out vessel trajectory restoration more efficiently and accurately.

It is worth noting that this study found that the MAE and RMSE evaluation metrics for longitude are skewed larger than those for latitude. This difference may be attributed to the characteristics of the data itself, and the latitude data may have more nonlinear relationships or noise which increases the difficulty of model restoration. This study did not validate the robustness of the model in a more complex navigational environment. Therefore, our next research paper will focus on validating the model in more diverse and complex navigation environments to comprehensively assess the performance and robustness of the model. This will provide a deeper understanding and is expected to promote the further development of vessel trajectory repair technology in practical applications.

Author Contributions

Conceptualization, L.Y., X.C. and H.L.; methodology, L.Y. and X.C.; validation, J.L.; investigation, L.Y.; resources, C.L. and Y.Z.; data curation, Y.Z.; writing—original draft preparation, L.Y.; writing—review and editing, X.C. and H.L.; visualization, R.Z.; supervision, X.C. and H.L.; project administration, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (42371438).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study can be found at https://www.marinecadastre.gov/ais/ (accessed on 18 January 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, R.W.; Liang, M.; Nie, J.; Lim, W.Y.B.; Zhang, Y.; Guizani, M. Deep learning-powered vessel trajectory prediction for improving smart traffic services in maritime Internet of Things. IEEE Trans. Netw. Sci. Eng. 2022, 9, 3080–3094. [Google Scholar] [CrossRef]
Zhang, Z.; Huang, L.; Peng, X.; Wen, Y.; Song, L. Loitering behavior detection and classification of vessel movements based on trajectory shape and Convolutional Neural Networks. Ocean. Eng. 2022, 258, 111852. [Google Scholar] [CrossRef]
Chai, T.; Zhu, H.; Peng, L.; Wang, J.; Fan, Z.; Xiao, S.; Xie, J.; Hu, Y. Constructing and analyzing the causation chain network for ship collision accidents. Int. J. Mod. Phys. C 2022, 33, 2250118. [Google Scholar] [CrossRef]
Sui, Z.; Huang, Y.; Wen, Y.; Zhou, C.; Huang, X. Marine traffic profile for enhancing situational awareness based on complex network theory. Ocean. Eng. 2021, 241, 110049. [Google Scholar] [CrossRef]
Sui, Z.; Wen, Y.; Huang, Y.; Zhou, C.; Du, L.; Piera, M.A. Node importance evaluation in marine traffic situation complex network for intelligent maritime supervision. Ocean. Eng. 2022, 247, 110742. [Google Scholar] [CrossRef]
Chao, H.C.; Wu, H.T.; Tseng, F.H. AIS meets IoT: A network security mechanism of sustainable marine resource based on edge computing. Sustainability 2021, 13, 3048. [Google Scholar] [CrossRef]
Dogancay, K.; Tu, Z.; Ibal, G. Research into vessel behaviour pattern recognition in the maritime domain: Past, present and future. Digit. Signal Process. 2021, 119, 103191. [Google Scholar] [CrossRef]
Toscano, D.; Murena, F.; Quaranta, F.; Mocerino, L. Assessment of the impact of ship emissions on air quality based on a complete annual emission inventory using AIS data for the port of Naples. Ocean. Eng. 2021, 232, 109166. [Google Scholar] [CrossRef]
Yan, R.; Mo, H.; Yang, D.; Wang, S. Development of denoising and compression algorithms for AIS-based vessel trajectories. Ocean. Eng. 2022, 252, 111207. [Google Scholar] [CrossRef]
Guo, S.; Mou, J.; Chen, L.; Chen, P. Improved kinematic interpolation for AIS trajectory reconstruction. Ocean. Eng. 2021, 234, 109256. [Google Scholar] [CrossRef]
Deng, C.; Wang, S.; Liu, J.; Li, H.; Chu, B.; Zhu, J. Graph Signal Variation Detection: A novel approach for identifying and reconstructing ship AIS tangled trajectories. Ocean. Eng. 2023, 286, 115452. [Google Scholar] [CrossRef]
Li, B.; Cai, Z.; Kang, M.; Su, S.; Zhang, S.; Jiang, L.; Ge, Y. A trajectory restoration algorithm for low-sampling-rate floating car data and complex urban road networks. Int. J. Geogr. Inf. Sci. 2020, 35, 717–740. [Google Scholar] [CrossRef]
Li, G.; Lou, L.; Zheng, P. Route Restoration Method for Sparse Taxi GPS trajectory based on Bayesian Network. Teh. Vjesn./Tech. Gaz. 2021, 28, 668–677. [Google Scholar]
Ren, H.; Ruan, S.; Li, Y.; Bao, J.; Meng, C.; Li, R.; Zheng, Y. Mtrajrec: Map-constrained trajectory recovery via seq2seq multi-task learning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual, Singapore, 14–18 August 2021; Volume 8, pp. 1410–1419. [Google Scholar]
Sun, J.; Sun, X.; Zhan, Z.; Zhou, J. A method of vehicle trajectory restoration based on Kalman Filter. In Proceedings of the 2022 IEEE 8th International Conference on Cloud Computing and Intelligent Systems (CCIS), Chengdu, China, 26–28 November 2022; Volume 11, pp. 662–665. [Google Scholar]
Liu, C.; Wang, J.; Liu, A.; Cai, Y.; Ai, B. An Asynchronous Trajectory Matching Method Based on Piecewise Space-Time Constraints. IEEE Access 2020, 8, 224712–224728. [Google Scholar] [CrossRef]
Gong, X.; Huang, Z.; Wang, Y.; Wu, L.; Liu, Y. High-performance spatiotemporal trajectory matching across heterogeneous data sources. Future Gener. Comput. Syst. 2020, 105, 148–161. [Google Scholar] [CrossRef]
Zhang, L.; Meng, Q.; Xiao, Z.; Fu, X. A novel ship trajectory reconstruction approach using AIS data. Ocean. Eng. 2018, 159, 165–174. [Google Scholar] [CrossRef]
Qin, H.; Yang, X. Iterative Algorithm for Vessel Trajectory Restoration Based on Improved Linear Interpolation. J. Comput. Aided Des. Comput. Graph. 2019, 31, 9. [Google Scholar]
Zhang, L.; Zhu, Y.; Lu, W. A detection and restoration approach for vessel trajectory anomalies based on AIS. J. Northwestern Polytech. Univ. 2021, 39, 119–125. [Google Scholar] [CrossRef]
Zhang, X.; He, Y.; Tang, R.; Mou, J.; Gong, S. A novel method for reconstruct ship trajectory using raw AIS data. In Proceedings of the 2018 3rd IEEE International Conference on Intelligent Transportation Engineering (ICITE), Singapore, 3–5 September 2018; IEEE: New York, NY, USA, 2018; pp. 192–198. [Google Scholar]
Li, J.; Chu, X.M.; Liu, X.L. An approach for restoring the lost trajectories of vessels in inland waterways. J. Harbin Eng. Univ. 2019, 40, 67–73. [Google Scholar]
Liu, L.; Jiang, Z.L.; Chu, X.M. Automatic identification system data restoration and prediction. J. Harbin Eng. Univ. 2019, 40, 1072–1077. [Google Scholar]
Chen, X.; Ling, J.; Yang, Y.; Zheng, H.; Xiong, P.; Postolache, O.; Xiong, Y. Ship trajectory reconstruction from AIS sensory data via data quality control and prediction. Math. Probl. Eng. 2020, 2020, 7191296. [Google Scholar] [CrossRef]
Zhong, C.; Jiang, Z.; Chu, X.; Liu, L. Inland ship trajectory restoration by recurrent neural network. J. Navig. 2019, 72, 1359–1377. [Google Scholar] [CrossRef]
Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Kvalseth, T.O. On normalized mutual information: Measure derivations and properties. Entropy 2017, 19, 631. [Google Scholar] [CrossRef]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14), Montreal, QC, Canada, 8–13 December 2014; Volume 2, pp. 3104–3112. [Google Scholar]

Figure 1. AIS trajectory data preprocessing and preliminary restoration framework.

Figure 2. Map of time intervals of AIS trajectory data.

Figure 3. Map of SOG distribution.

Figure 4. Vessel trajectory restoration model framework.

Figure 5. Sliding window method to divide the dataset.

Figure 6. Seq2Seq model structure.

Figure 7. Basic network model structure: (a) RNN model structure; (b) LSTM model structure; (c) GRU model structure; (d) BILSTM/BIGRU model structure.

Figure 8. Preliminary restoration trajectory results.

Figure 9. Correlation coefficient values for latitude and longitude: (a) ACF of LAT; (b) PACF of LAT; (c) ACF of LON; (d) PACF of LON.

Figure 10. AIC and BIC information criteria values for latitude and longitude features: (a) the value of LAT; (b) the value of LON.

Figure 11. Comparison of loss functions during training of different models: (a) MSE of the baseline models; (b) MSE for Seq2Seq type models.

Figure 12. Comparison of trajectory restoration results of different models: (a) restoration results of straight trajectory-1; (b) restoration results of curved trajectory-1; (c) restoration results of straight trajectory-2; (d) restoration results of curved trajectory-2.

Figure 13. Comparison of trajectory restoration results in robustness experiments: (a) restoration results of curved trajectory-3; (b) restoration results of curved trajectory-4; (c) Restoration results of curved trajectory-5; (d) restoration results of curved trajectory-6.

Figure 14. Comparison of trajectory restoration results in ablation experiments: (a) ablation experiment results of straight trajectory-3; (b) ablation experiment results of curved trajectory-7.

Figure 15. Comparison of trajectory restoration results with different step sizes: (a) restoration results for curve trajectory-8, (b) restoration results for curve trajectory-9.

Table 1. Vessel information table.

Static Information			Dynamic Information
Attribute	Description	Type	Attribute	Description	Type
MMSI	Maritime mobile service identity	Text	BaseDateTime	Full UTC date and time	DataTime
VesselName	Vessel name	Text	LON	Longitude	Double
IMO	International Maritime Organization	Text	LAT	Latitude	Double
CallSign	Vessel call sign	Text	SOG	Speed over ground	Float
VesselType	Vessel type	Int	COG	Course over ground	Float
Length	Length of vessel	Float	Heading	True heading angle	Float
Width	Width of vessel	Float	Status	Navigation status	Int
TransceiverClass	Class of AIS transceiver	Text	Draught	Draft depth of vessel	Float
			Cargo	Cargo type	Text

Table 2. Statistics of abnormal and missing trajectory points.

Abnormal Time	Abnormal SOG	Abnormal Position	Summary	Percentage
3872	8	0	3880	2.769%

Table 3. Calculation results of latitude and longitude correlation indicators.

Variable	LAT				LON
Variable	PCC	SRCC	NMIE	$ρ$	PCC	SRCC	NMIE	$ρ$
Time	0.056	−0.032	0.992	0.360	0.470	0.146	1.000	0.539
LAT	1.000	1.000	1.000	1.000	0.808	0.882	0.992	0.894
LON	0.808	0.882	0.992	0.894	1.000	1.000	1.000	1.000
COG	−0.702	−0.694	0.868	0.755	−0.787	−0.689	0.877	0.784
SOG	−0.529	−0.625	0.614	0.589	−0.526	−0.493	0.622	0.547
Heading	−0.741	−0.776	0.629	0.715	−0.788	−0.717	0.635	0.713

Table 4. Comparison of restoration indicators for different models.

Type of Network	MAE		RMSE		FD	AED
Type of Network	LAT	LON	LAT	LON	FD	AED
RNN	0.0021	0.0078	0.0030	0.0099	1.2822	0.4717
LSTM	0.0020	0.0070	0.0026	0.0095	1.3866	0.4288
BILSTM	0.0031	0.0090	0.0040	0.0113	1.5827	0.5523
GRU	0.0022	0.0073	0.0027	0.0095	1.4306	0.4414
R2R	0.0019	0.0070	0.0028	0.0094	1.2610	0.4196
L2L	0.0018	0.0072	0.0024	0.0095	1.3149	0.4185
G2G	0.0016	0.0071	0.0023	0.0091	1.3151	0.4237
BL2BL	0.0019	0.0070	0.0030	0.0093	1.3590	0.4282
BG2BG	0.0017	0.0058	0.0023	0.0077	1.2495	0.3573

Table 5. Comparison of restoration indicators in robustness experiments.

Type of Network	MAE		RMSE		FD	AED
Type of Network	LAT	LON	LAT	LON	FD	AED
R2R	0.0044	0.0219	0.0053	0.0225	1.7565	1.2446
L2L	0.0018	0.0223	0.0026	0.0229	1.8000	1.2154
G2G	0.0051	0.0122	0.0055	0.0142	1.5501	0.7473
BL2BL	0.0031	0.0061	0.0035	0.0067	0.7368	0.4057
BG2BG	0.0021	0.0041	0.0025	0.0050	0.6964	0.2836

Table 6. Comparison of restoration indicators in ablation experiments.

Ablation Methods		MAE		RMSE		FD	AED
Ablation Methods		LAT	LON	LAT	LON	FD	AED
Straight Trajectory-3	A + B + C	0.0004	0.0022	0.0006	0.0025	0.2720	0.1265
	B + C	0.0019	0.0034	0.0022	0.0036	0.5931	0.2240
	A + B	0.0009	0.0006	0.0011	0.0008	0.1481	0.0551
	A + C	0.0007	0.0040	0.0008	0.0042	0.3705	0.2263
Curved Trajectory-7	A + B + C	0.0011	0.0036	0.0014	0.0070	1.2462	0.2177
	B + C	0.0007	0.0068	0.0009	0.0074	0.9880	0.3789
	A + B	0.0118	0.0013	0.0193	0.0025	1.4143	0.1481
	A + C	0.0028	0.0036	0.0029	0.0067	1.4485	0.2861

Table 7. Comparison of restoration indicators with different step sizes.

Output Steps		MAE		RMSE		FD	AED
Output Steps		LAT	LON	LAT	LON	FD	AED
Curved trajectory-8	1	0.0003	0.0015	0.0004	0.0018	0.2935	0.0877
	2	0.0013	0.0017	0.0013	0.0020	0.3415	0.1256
	3	0.0024	0.0025	0.0024	0.0028	0.4026	0.2040
	4	0.0007	0.0062	0.0007	0.0063	0.5603	0.3458
	5	0.0019	0.0028	0.0019	0.0030	0.4081	0.1962
Curved trajectory-9	1	0.0002	0.0050	0.0002	0.0061	0.8582	0.2225
	2	0.0007	0.0057	0.0007	0.0061	0.9565	0.3198
	3	0.0007	0.0065	0.0007	0.0068	1.0042	0.3587
	4	0.0005	0.0058	0.0006	0.0062	0.9664	0.3234
	5	0.0007	0.0059	0.0007	0.0064	0.9528	0.3208

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ye, L.; Chen, X.; Liu, H.; Zhang, R.; Li, J.; Lu, C.; Zhao, Y. A Study of Multi-Step Sparse Vessel Trajectory Restoration Based on Feature Correlation. Appl. Sci. 2024, 14, 4057. https://doi.org/10.3390/app14104057

AMA Style

Ye L, Chen X, Liu H, Zhang R, Li J, Lu C, Zhao Y. A Study of Multi-Step Sparse Vessel Trajectory Restoration Based on Feature Correlation. Applied Sciences. 2024; 14(10):4057. https://doi.org/10.3390/app14104057

Chicago/Turabian Style

Ye, Lin, Xiaohui Chen, Haiyan Liu, Ran Zhang, Jia Li, Chuanwei Lu, and Yunpeng Zhao. 2024. "A Study of Multi-Step Sparse Vessel Trajectory Restoration Based on Feature Correlation" Applied Sciences 14, no. 10: 4057. https://doi.org/10.3390/app14104057

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Study of Multi-Step Sparse Vessel Trajectory Restoration Based on Feature Correlation

Abstract

1. Introduction

2. Related Work

2.1. Traditional Mathematical Modeling Methods

2.2. Data-Driven Methods

3. Trajectory Data Sparsity and Anomaly Detection

3.1. Data Preprocessing

3.1.1. Invalid Data Deletion

3.1.2. Trajectory Segmentation

3.2. Sparse and Abnormal Trajectory Points Detection

3.2.1. Detection Method Based on Time Elements

3.2.2. Detection Method Based on Speed Elements

3.2.3. Detection Method Based on Location Elements

3.3. Trajectory Preliminary Restoration

4. Trajectory Feature Extraction and Restoration Model Construction

4.1. Dataset Construction

4.1.1. Data Feature Normalization

4.1.2. Input Sequence Division

4.2. Input Feature Extraction Based on Trajectory Features Correlation

4.2.1. Construction of a Composite Indicator of Feature Correlation

4.2.2. Select the Step Size of the Input Feature

4.3. Construction of Seq2Seq-Based Restoration Model

4.3.1. Seq2Seq Model

4.3.2. Basic Neural Network Models

4.3.3. Evaluation Indicators

5. Experiment and Result Analysis

5.1. Experimental Data and Configuration

5.2. Analysis of Anomaly Detection and Preliminary Restoration Results

5.3. Analysis of Model Features Selection Results

5.4. Analysis of Vessel Trajectory Restoration Results

5.4.1. Analysis of Model Restoration Results

5.4.2. Analysis of Ablation Experiment Results

5.4.3. Analysis of Multi-Step Length Restoration Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI