2.2. Data Preprocessing
Preprocessing for AIS data is mainly reflected in the treatment of missing values and outliers. Missing values need to be completed by interpolation according to the before and after data. Outliers should be recognized first and then completed by interpolation after deleting.
Currently, AIS technology is not yet fully matured, resulting in a partial absence of ship track trajectory points; therefore, some processing of the missing data is needed.
Figure 1 shows the missing trajectory data map; the blue points indicate the actual data that have been recorded, and the black points indicate the missing data to compensate for the missing data values [
20]. The numbers are markers for the points.
Since the trajectory data is a column of time series, the location where the missing values need to be interpolated can be determined based on the time interval between two neighboring data points, and the number of missing points is directly proportional to the time interval. The larger the time interval between two neighboring data points, the greater the number of missing points.
To identify missing values, we used the method of judging the time interval to identify missing values, with a time interval of 10 min. Let the time series of the original AIS data be , and calculate the time interval between any two adjacent time points , when , which does not need to be supplemented. When , it needs to supplement one point. When , it needs to supplement two points, and so on. In the filling in of missing values, the data were supplemented using Lagrange interpolation.
A polynomial passing through all known points can be obtained by Lagrange interpolation. Based on the existing trajectory of the fishing boat, all known points can be brought in to obtain the polynomial, and then the value at the missing place can be calculated from the polynomial, as shown in the following formula [
21].
Due to the different configuration of fishing vessel facilities for different types of operations, the range of speed and heading range that can be reached is also different, and there is no clear regulation. AIS data of fishing boats record the instantaneous speed and instantaneous course of fishing boats, which are easily affected by waves, strong winds, and other factors, and are prone to noise values. Therefore, it is necessary to process the abnormal value of the collected AIS data to avoid adverse effects on the subsequent model training.
Let D be the difference between the current velocity value and the previous velocity value, and F be the velocity threshold. When D > F, the distance l between the current data point and the previous data point is calculated by the current velocity. The distance l is compared with the actual distance d between the two data points calculated by the Haversine formula. If the distance l is close to the actual distance d, it is considered that the current velocity value is normal. On the contrary, it is considered that the current velocity value is abnormal, and the data value needs to be corrected. The Harversine formula is:
where, d represents the actual distance between the two points and
r represents the radius of the earth, generally selected as 6,378,137 m.
,
represents the latitude and longitude of the previous data point and the latter data point, respectively.
The latitude and longitude anomaly refers to the non-normal deviation of the trajectory point in a small time, which will adversely affect the judgment of the subsequent model. It is necessary to delete the data point and perform an interpolation-filling calculation. According to the degree of variation before and after the trajectory point, we can judge whether there is a deviation problem in the trajectory point. Suppose that the current trajectory point is , the previous data point is , the latter data point is , the midpoint of and is , represents the distance between and , and represents the distance between and . When , the current data point can be considered as a latitude and longitude anomaly point. The anomaly point can be deleted first, and then the interpolation method introduced above is used to fill the data.
Sequence slicing is a data preprocessing technique that involves dividing a long sequence into a series of shorter subsequences, which is widely used in time series analysis. Its main functions include managing long sequences and alleviating computer memory limitations. By fixing the length input, sequences of different lengths are transformed into consistent shapes to adapt to the deep learning model. When dealing with data such as text or time series, slicing can help the model capture short-term time dependencies, which helps reduce the computational complexity and may improve model performance. Due to the huge difference in the number of trajectory points of each fishing vessel in the AIS dataset used in this paper, the sequence slicing method proposed in this paper is to divide the trajectory data of different lengths into statistical segments with a unified number of segments.
The state vector of the fishing vessel at time t can be represented by
. It contains four state values, the latitude (lat), longitude (lon), speed, and heading direction of the ship at time t, as shown in Formula (3).
Then all the state vectors experienced by a fishing boat from time 0 to time H can form the original data matrix
A of the whole fishing boat.
We use
L to represent the total length of a trajectory and divide it into
N segments on average; then the time step of each segment is
T.
; then the new data matrix
X_segment is:
The speed, course, latitude, and longitude of the fishing vessel can be used as the characteristics to judge the type of fishing vessel operation. According to the slice results, the characteristics of each section are extracted. The maximum, minimum, average, median, and standard deviation of different characteristics are counted below. Its feature matrix is
X_feature.