Enhanced Transformer Framework for Multivariate Mesoscale Eddy Trajectory Prediction

Du, Yanling; Huang, Jiahao; Chen, Jiasheng; Chen, Ke; Wang, Jian; He, Qi

doi:10.3390/jmse12101759

Open AccessArticle

Enhanced Transformer Framework for Multivariate Mesoscale Eddy Trajectory Prediction

by

Yanling Du

¹

,

Jiahao Huang

¹

,

Jiasheng Chen

¹,

Ke Chen

²,

Jian Wang

¹ and

Qi He

^1,*

¹

College of Information Technology, Shanghai Ocean University, Shanghai 201306, China

²

East China Sea Forecasting and Disaster Reduction Center, Ministry of Natural Resources, Shanghai 200136, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2024, 12(10), 1759; https://doi.org/10.3390/jmse12101759

Submission received: 13 August 2024 / Revised: 23 September 2024 / Accepted: 1 October 2024 / Published: 4 October 2024

(This article belongs to the Section Physical Oceanography)

Download

Browse Figures

Versions Notes

Abstract

:

Accurately predicting the trajectories of mesoscale eddies is essential for comprehending the distribution of marine resources and the multiscale energy cascade in the ocean. Nevertheless, current approaches for predicting mesoscale eddy trajectories frequently exhibit inadequate examination of the intrinsic multiscale temporal data, resulting in diminished predictive precision. To address this challenge, our research introduces an enhanced transformer-based framework for predicting mesoscale eddy trajectories. Initially, a multivariate dataset of mesoscale eddy trajectories is constructed and expanded, encompassing eddy properties and pertinent ocean environmental information. Additionally, novel feature factors are delineated based on the physical attributes of eddies. Subsequently, a multi-head attention mechanism is introduced to bolster the modeling of the multiscale time-varying connections within eddy trajectories. Furthermore, the original positional encoding is substituted with Time-Absolute Position Encoding, which considers the dimensions and durations of the sequence mapping, thereby improving the distinguishability of embedded vectors. Ultimately, the Soft-DTW loss function is integrated to more accurately assess the overall discrepancies among mesoscale eddy trajectories, thereby improving the model’s resilience to erratic and diverse trajectory sequences. The effectiveness of the proposed framework is assessed using the eddy-abundant South China Sea. Our framework exhibits exceptional predictive accuracy, achieving a minimum central error of 8.507 km over a seven-day period, surpassing existing state-of-the-art models.

Keywords:

mesoscale eddy; trajectory prediction; transformer; time series data analysis

1. Introduction

Mesoscale eddies are pivotal in the oceanic dynamic system, contributing to 90% of the kinetic energy and significantly influencing upper ocean circulation [1,2,3]. Their role is critical in the mechanisms of energy cascades, heat exchange across multiscale oceanic processes [4,5], and the transport and distribution of materials and resources [6]. Consequently, the precise prediction of mesoscale eddy trajectories is paramount for a comprehensive understanding of the spatiotemporal evolution of resources [7], the intricacies of multiscale energy cascade processes [8], and the dynamics of the coupled ocean–atmosphere climate system [9]. Such predictions are also vital for issuing early warnings to offshore platforms and fishing vessels, thereby averting potential environmental disasters caused by eddies [10]. Traditional methods employed for characterizing ocean eddies have proven inadequate, failing to capture their nonlinear motion variations accurately and, thus, compromising the reliability of eddy trajectory predictions, particularly in comparison to conventional physics-based models. The emergence of data-driven deep-learning methodologies has brought significant advancements in various fields, offering new opportunities to improve the accuracy of mesoscale eddy predictions. This highlights the need to incorporate sophisticated data analysis techniques in oceanographic research.

The methodologies for predicting mesoscale eddy trajectories can generally be categorized into three distinct approaches: statistical analysis, numerical simulation, and machine learning. Robinson et al. [11] pioneered the prediction of eddy evolution within the California Current System, using an anisotropic statistical model to analyze mixed spatiotemporal targets over a fortnight. Masina and Pinardi [12] introduced a quasi-geostrophic numerical model with initial condition fields for mesoscale assimilation in the central Adriatic Sea, capable of forecasting mesoscale flow field dynamics over 30 days. Shriver et al. [13] improved forecast resolution from 1/16° to 1/32° by integrating the Navy Layered Ocean Model (NLOM) with Optimal Interpolation (OI) techniques. Despite their contributions, traditional statistical and numerical simulation strategies, which rely heavily on explicit physical principles and data assimilation techniques [14], face challenges in balancing predictive accuracy with computational resource demands.

In recent years, there has been significant growth in the utilization of machine learning for predictive modeling, particularly in the realm of feature extraction and modeling. Li et al. [15] conducted a comprehensive study on the dynamics of eddy propagation in the South China Sea, employing multiple linear regression techniques to develop a reliable prediction model. The advent of deep learning has further revolutionized this area by utilizing its hierarchical model structure to analyze datasets and uncover latent information. The utilization of the efficient backpropagation algorithm in deep learning has led to significant advancements in predictive accuracy and computational resource management [16,17]. This successful equilibrium between precision and computational efficiency has established deep learning as a prominent methodology for forecasting the movement and evolution of mesoscale eddies.

Currently, predominant models for predicting eddy trajectories predominantly rely on Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks, as well as their respective variations. Wang et al. [18] utilized a combination of LSTM and Extra Tree (ET) methods to improve the prediction of mesoscale eddy characteristics and their propagation trajectories, emphasizing the model’s ability to handle high-dimensional data efficiently. Unlike the previous approach of directly feeding eddy numerical data into traditional recurrent neural networks, remote-sensing image frame data are also used for trajectory prediction. Thus, neural networks combining recurrent and convolutional structures emerged. But both suffer from severe gradient-vanishing issues and difficulty in capturing long-term dependencies in the data. To address these limitations, Nian et al. [19] incorporated a modified version of LSTM, known as Memory In Memory (MIM), to enhance the model’s ability to capture long-term-dependent spatio-temporal features. This augmentation, in conjunction with mesoscale eddy detection techniques, notably increased the spatio-temporal predictability of mesoscale eddies. Ma et al. [20] utilized PredNet, a variant of LSTM that integrates convolutional operations, known as Convolutional Long Short-Term Memory (ConvLSTM), to model and reconstruct spatio-temporal features in Sea-Level Anomaly (SLA) data, demonstrating an average accuracy of 60% in predicting mesoscale eddy trajectories over a seven-day period. To further address the recurrent neural network’s tendency to overly rely on the previous time step, attention mechanisms have been widely incorporated into models. Wang et al. [14,21] subsequently developed a deep-learning architecture incorporating spatio-temporal attention and a GRU-based network. This comprehensive framework integrates data from multiple sources, including SLA, eddy observations, and hydrological dataset. Furthermore, the incorporation of foundational oceanographic insights is imperative for nurturing the efficacy of data-driven models. Ge et al. [22] developed a neural network model for predicting eddy trajectories that adheres to physical limitations using GRU and LSTM. Additionally, they introduced a novel loss function guided by geographical constraints to enhance the predictive accuracy of eddy trajectories. Zhu et al. [23] developed an algorithm for precise ten-day predictions of eddy trajectories using a physics-embedded regression model, which improves forecast accuracy by 21% over traditional methods. Tang et al. [24] introduced a Convolutional Gated Recurrent Unit (ConvGRU)-based model for eddy prediction that integrates historical prior statistics to enhance accuracy. This model outperformed existing methods for seven-day forecasts. Despite the demonstrated potential of deep learning in forecasting the trajectories and evolving characteristics of mesoscale eddies, unresolved issues persist:

Current research often emphasizes specific intrinsic properties of ocean eddies, such as their size, intensity, and rotation patterns, while overlooking other crucial ocean environmental information like temperature gradients, salinity variations, and surrounding current systems.
Recurrent neural network-based methods tend to heavily rely on previous time-step outcomes, and the use of attention mechanisms has been suggested to address this issue. However, the implementation of these mechanisms often involves analyzing eddy data within limited temporal windows, which poses challenges for accurately forecasting trajectories of eddies with prolonged lifespans and results in a notable loss of valuable multi-scale temporal information.
Nonlinear dynamics in mesoscale eddies cause significant trajectory variability, challenging multi-step predictions. Traditional Euclidean distance loss functions struggle with local discrepancies, making it difficult to capture the overall trajectory pattern and reducing prediction accuracy.

The transformer model, utilizing the self-attention mechanism exclusively, has demonstrated notable effectiveness in analyzing time series data by adeptly capturing multi-scale temporal information from extensive data sequences [25,26,27,28]. To tackle the difficulties related to forecasting mesoscale eddy trajectories, an enhanced transformer framework customized for mesoscale eddy trajectory prediction is presented. This framework integrates diverse data variables, such as intrinsic properties of ocean eddies and ocean environmental information, enabling both single-step and multi-step prediction functionalities. In conclusion, the main contributions of this article are summarized as follows:

The construction and expansion of multivariate eddy data involves the integration, processing, and extraction of data regarding changes in ocean environmental information within eddy regions from various altimeter satellites. This process aims to capture the motion characteristics of eddy trajectories, particularly focusing on Mesoscale Eddy Trajectory (MET) and SLA data. Building upon this framework, novel features are developed for eddy characterization, involving the creation of new variables derived from raw data to more accurately depict the motion dynamics of mesoscale eddies.
The enhanced transformer framework for predicting mesoscale eddy trajectories improves the identification of multi-scale dependencies by integrating a multi-head attention mechanism to analyze dependencies within lengthy time series. The conventional positional encoding is replaced with Time-Absolute Position Encoding (TAPE) to improve the differentiation among embedded vectors by taking into account sequence-mapping dimensions and length. Moreover, the incorporation of the Soft-DTW loss function, a differentiable version of Dynamic Time Warping (DTW) that allows for smooth optimization in deep-learning models, enables a more precise assessment of overall deviations in mesoscale eddy trajectories, thereby bolstering the model’s resilience to noise and trajectories exhibiting substantial variations, ultimately enhancing the predictive accuracy of mesoscale eddy trajectories.
Extensive empirical investigations have been undertaken, revealing that the suggested framework for predicting mesoscale eddy trajectories outperforms existing mainstream approaches (outperforming LSTM by 64.3% and GRU by 67.6%) in terms of predictive accuracy, achieving the lowest average central error of 8.294 km (LSTM: 23.259 km, GRU: 25.582 km) over 7 days, which is superior to most existing models.

The subsequent sections of this paper are structured as follows: Section 2 outlines the preprocessing steps for multivariate mesoscale eddy time series data and introduces the enhanced transformer framework for predicting mesoscale eddy trajectories, detailing the framework’s structure and specific enhancements. Section 3 presents an analysis of the experimental setup and results. Finally, Section 4 offers a summary and discussion of the paper as a whole, along with suggestions for future research directions.

2. Methods

This section outlines the methodology for feature construction and expansion in the context of multivariate trajectory data pertaining to mesoscale eddies. This process aims to enhance the representation of mesoscale eddies by integrating latitude and longitude, amplitude, radius, and other relevant ocean environmental information. Furthermore, we introduce an enhanced transformer framework for the prediction of mesoscale eddy trajectories and offer a comprehensive elucidation of the specific components that have undergone enhancements.

2.1. Construction and Expansion of Multivariate Mesoscale Eddy Features

In order to enhance the representation of mesoscale eddy characteristics, we employ MET and SLA data sourced from archiving, validation, and interpretation of satellite oceanographic data (AVISO) and Copernicus Marine Environment Monitoring Service (CMEMS). These datasets analyze sea surface height through the integration of data from various altimeter satellites (such as Jason-3, Saral[-DP]/AltiKa, Cryosat-2, OSTM/Jason2, Jason-1, Topex/Poseidon, Envisat, GFO, ERS-1/2, Haiyang-2A/B) in NetCDF format. Further details regarding the MET and SLA datasets are available in Table 1.

Here, the MET and SLA datasets have a temporal resolution of 1 day. The MET dataset covers the time period from 1 January 1993 to 9 February 2022, while the SLA dataset covers the time period from 1 January 1993 to 4 August 2022. To merge these two datasets for data fusion in our study, we have selected the overlapping data within the common time range for training, validation, and testing purposes. The SLA is a gridded dataset with a spatial resolution of 0.25°, while the MET is a high-resolution time series data derived from altimeter satellite data through specific eddy tracking. To further integrate these datasets, we need to filter and match them. For the matching computation, we have employed the following procedure:

{T i m e}_{M E T [i]} = {T i m e}_{S L A [j]}

(1)

∣ {L a t}_{M E T [i]} - {L a t}_{S L A [j]} ∣ \leq x, ∣ {L o n}_{M E T [i]} - {L o n}_{S L A [j]} ∣ \leq x

(2)

The neighborhood radius

x

is set to 0.25°. When there are multiple SLA grid regions that match the eddy center, the SLA that is closest to the corresponding eddy center is selected as the matching result. Additionally, to mitigate the impact of measurement differences between multi-features on the accuracy of the model, dataset is normalized to have zero mean.

The South China Sea (0°–25° N, 100°–125° E) was selected as the research area for investigating mesoscale eddies due to their abundant presence and distinctive spatiotemporal variability [29,30,31]. The preprocessing of multi-source mesoscale eddy trajectory data, illustrated in Figure 1.

2.1.1. Mesoscale Eddy Trajectory Feature Expansion

Additionally, two supplementary features, velocity and azimuth, are introduced to more comprehensively capture the dynamic variations in eddy trajectories. These features offer significant insights into the velocity and direction of eddy motion, facilitating a thorough examination of the changing dynamics and characteristics of eddies over time. Velocity quantifies the rate of eddy movement at discrete time points, while azimuth defines the direction of the eddies’ rotation relative to a specified reference point. Specifically, velocity is derived from the Haversine formula, as shown in follows:

d = 2 R \cdot \arcsin (\sqrt{\sin^{2} (\frac{L a t_{[t + 1]} - L a t_{[t]}}{2}) + \cos (L a t_{t}) \times \cos (L a t_{[t + 1]}) \times \sin^{2} (\frac{L o n_{[t + 1]} - L o n_{[t]}}{2})})

(3)

V e l o c i t y = \frac{d}{Δ t}

(4)

where R is the radius of the Earth = 6371 km,

{L o n}_{[t]} {L a t}_{[t]}

denote the latitude and longitude of the eddy center at time

t

, respectively, and

{L o n}_{[t + 1]} {L a t}_{[t + 1]}

denote the latitude and longitude of the eddy center at time

t + 1

, respectively. The moving average velocity of the eddy is the ratio of the calculated spherical displacement distance

d

between two points and the time difference

Δ t

.

Azimuth denotes the horizontal angle, measured clockwise from the meridian bisecting the center of a given eddy to the meridian intersecting the specified direction, as depicted in Figure 2. By applying the principles of sine and cosine in spherical trigonometry, one can ascertain the magnitude of the dihedral angle θ. The calculation formula is presented below:

\begin{matrix} u = s i n ({L o n}_{[i + 1]} - {L o n}_{[i]}) \cdot \cos ({L a t}_{[i + 1]}) \\ v = \cos ({L a t}_{[i]}) \cdot \sin ({L a t}_{[i + 1]}) - \sin ({L a t}_{[i]}) \cdot \cos ({L a t}_{[i + 1]}) \cdot \cos ({L o n}_{[i + 1]} - {L o n}_{[i]}) \\ A z i m u t h = a t a n 2 (u, v) \end{matrix}

(5)

The determination of azimuth utilizes the atan2 function, a modified version of the arctan function that identifies the exact quadrant for the angle calculated, thus removing any directional uncertainty. The function is defined as follows:

atan 2 (u, v) = \{\begin{array}{l} a r c t a n (u / v) & (v > 0) \\ a r c t a n (u / v) + π & (v < 0, u ⩾ 0) \\ a r c t a n (u / v) - π & (v < 0, u < 0) \\ π / 2 & (v = 0, u > 0) \\ - π / 2 & (v = 0, u < 0) \\ 0 & (v = 0, u = 0) \end{array}

(6)

2.1.2. Multivariate Mesoscale Eddy Trajectory Feature Construction

We select features from MET, including track, time, longitude, latitude, amplitude, speed, and radius, along with features from SLA, namely Adt, Ugos, Vgos, and the expanded features of velocity and azimuth, which are integrated together to express the mesoscale eddy motion process. Eddies with lifecycles of 30 days or longer are included in our analysis to focus on long-term eddy trajectory prediction. Specifically, the prediction of mesoscale eddy trajectories is illustrated in Figure 3, which outlines the criteria used to determine the center of eddies and provides visual representations of the physical properties and ocean environmental information associated with eddies. Our work aims to predict the future trajectory of eddies for the next 7 days using the historical trajectories of eddy centers, which means the data we input into the model.

2.2. Enhanced Transformer-Based Framework for Mesoscale Eddy Trajectory Prediction

The proposed prediction framework for multivariate mesoscale eddy trajectories, illustrated in Figure 4, integrates various components such as an encoder–decoder, multi-head attention mechanism, TAPE, and Soft-DTW. The encoder–decoder employs a sliding window to input preprocessed data into the framework, utilizing convolution and average pooling for the initial transformation of raw data to extract high-level features. Furthermore, the introduction of the multi-head attention mechanism allows for the exploration of multi-scale temporal dependencies among mesoscale eddy trajectories. Additionally, TAPE is introduced to analyze the positional connections within the trajectory sequence, thereby improving the differentiation among trajectory sequences. Lastly, the framework utilizes the Soft-DTW loss function for comprehensive analysis of the multi-step prediction outcomes.

2.2.1. Encoder and Decoder

As shown in Figure 4, the transformer model is a deep-learning architecture primarily composed of an encoder and a decoder, both of which rely heavily on attention mechanisms. A key component of this model is the multi-head self-attention mechanism, where the input is processed through three matrices: Query (Q), Key (K), and Value (V). These matrices allow the model to weigh the importance of different tokens in a sequence, enabling it to capture long-range dependencies effectively. The encoder processes the input sequence by computing attention scores through Q, K, and V, while the decoder uses these attention mechanisms to generate outputs in sequence-to-sequence tasks, such as machine translation and text generation. This structure allows the transformer to handle complex dependencies in data without relying on traditional sequential processing. The following provides a more detailed explanation:

Encoder: The encoder is employed to model the preprocessed historical trajectory data of mesoscale eddies. Specifically, before the data are fed into the encoder, it passes through a linear layer, which transforms the data from its initial

i

dimensions to

d_{T r a j}

dimensions, and

d_{T r a j}

is the mapping dimension for the input and output of the eddy trajectory sequence. In tandem, convolution and average pooling operations are applied to this

d_{T r a j}

-dimensional data to distill high-level features, with the convolution’s output channels also set to

d_{T r a j}

dimensions. The outcome of the pooling process is then integrated with the data that has been encoded using TAPE. Following this, the encoder projects the data linearly to produce Q, K, and V, which are further processed through the multi-head self-attention mechanism to derive attention scores. Given the inherent limitations in the expressive capacity of self-attention, a Fully Connected Feed Forward network is incorporated to augment the data’s representational depth. It is important to highlight that both before and after the self-attention and Feed Forward stages, Residual Connections and Layer Normalization are applied to the data, aiming to alleviate potential gradient-related issues.

Decoder: Expanding upon the encoder, the decoder incorporates an extra layer of self-attention, utilizing the output K and V from the encoder. This initial self-attention layer is configured with masking to inhibit any premature disclosure of predictive insights. Additionally, the segment of the trajectory slated for prediction is fed into the decoder through a sliding-window mechanism. In a bid to strike an optimal balance between computational efficiency and the quality of predictions, the stack depths N (for the encoder) and M (for the decoder) are both strategically set to 3.

2.2.2. Multi-Head Self-Attention

The multi-head attention mechanism is essential for capturing multi-scale temporal dependencies between mesoscale eddy trajectories, distinguishing it as a key advantage over recurrent neural networks like LSTM and GRU. This mechanism involves the linear combination of Q with sets of K and V to produce an output that is a weighted sum of the values. The weights assigned to each V are determined by a compatibility function that evaluates the correlation between the Q and K [25]. Through the utilization of self-attention, the model transforms Q, K, and V into multiple dimensions, employing multiple attention functions simultaneously to generate outputs across various dimensions. The multi-dimensional outputs are combined and subjected to a linear projection, enabling the model to comprehend and learn diverse representations across different subspaces. This multi-faceted approach enhances the model’s ability to perceive and encompass various relational dynamics within the sequence, thereby greatly enhancing its capacity to capture complex relationships across multiple sequence dimensions.

In this research, each observation point for an eddy center is defined as

{t r a j}_{t} = {{L a t}_{t}, {L o n}_{t}, \dots, {V g o s}_{t}}

, with

{L a t}_{t}, {L o n}_{t}, \dots, {V g o s}_{t}

representing the heterogeneous features corresponding to the eddy center at a specific time

t

. The eddy trajectory sequence is defined as

T r a j = {{t r a j}_{1}, {t r a j}_{2}, \dots, {t r a j}_{n}}

, where

n

represents the number of timestamps in the current sequence, i.e., the number of discretized eddies. The corresponding formulas for multi-head attention are as follows:

M u l ti Head (Q, K, V) = C o n c a t e n a t e ({h e a d}_{1}, {h e a d}_{2}, \dots, {h e a d}_{m}) W^{O}

(7)

{h e a d}_{i} = A t e n t i o n (Q_{i}, K_{i}, V_{i}) = s o f t m a x (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{K}}}) V_{i}

(8)

\begin{matrix} Q_{i} = T r a j \cdot W_{i}^{Q} \\ K_{i} = {T r a j \cdot W}_{i}^{K} \\ V_{i} = T r a j \cdot W_{i}^{V} \end{matrix}

(9)

In Formula (7), where

m

represents the number of attention heads,

W^{O}

is a learnable parameter, and

W^{O} \in R^{{m d}_{v} {\times d}_{T r a j}}

; each head’s calculation is expressed in Formulas (8) and (9).

Q_{i}

,

K_{i}

,

V_{i}

represent the projected query, key, and value matrices in

{h e a d}_{i}

, where

W_{i}^{Q} \in R^{d_{T r a j} \times d_{k}}

,

W_{i}^{K} \in R^{d_{T r a j} \times d_{k}}

,

W_{i}^{V} \in R^{d_{T r a j} \times d_{k}}

, with

d_{k}

and

d_{v}

being the dimensions for the combination of Q, K, and the dimension of V, respectively. In this study,

m = 4

,

d_{k} = d_{v} = d_{T r a j} / m = 64

. The transformer’s multi-head self-attention mechanism adeptly encodes and aggregates individual observation points within the eddy trajectory sequence. By calculating the similarity between a given observation point and all others, and using these similarity scores as weights, self-attention amalgamates the linearly weighted feature vectors of other elements. Consequently, this imbues each observation point with aggregated information from its counterparts within the specified time window. Such a strategic method enhances the model’s ability to capture and understand the dependencies between elements in long sequences, ensuring a more nuanced and comprehensive analysis of temporal relationships.

2.2.3. Time-Absolute Position Encoding

In order to improve the accuracy and reliability of our model in forecasting eddy trajectories, we address the issue presented by the initial positional encoding in the transformer architecture, which may blur distinctions among certain embedding dimensions. We propose the implementation of TAPE as an alternative to the Original Positional Encoding (OPE), with the aim of optimizing the discrimination of embedding vectors through the consideration of sequence-specific dimensions and lengths. The transformer architecture, characterized by its absence of recursive and convolutional structures, faces limitations in preserving essential positional information, particularly the sequential order of time series data, within its self-attention mechanism. To address this deficiency, many approaches enhance the transformer’s ability to process time series data by incorporating absolute or relative positional encodings, thereby improving its comprehension of temporal context. The initial introduction of the self-attention mechanism involved the utilization of sine and cosine functions for absolute positional encoding [32,33], enhancing its temporal context comprehension. The initial incorporation of the self-attention mechanism featured the utilization of sine and cosine functions for absolute positional encoding to address the processing of lengthy sequences [25]. This approach is encapsulated in the following formulas:

\begin{matrix} {O P E}_{i} (k, 2 i) = \sin (\frac{k}{1000 0^{2 i / d_{T r a j}}}) \\ {O P E}_{i} (k, 2 i + 1) = \cos (\frac{k}{1000 0^{2 i / d_{T r a j}}}) \end{matrix}

(10)

where

k

is the temporal position in the sequence,

i

is the dimension, and the domain is

[0, d_{T r a j} / 2]

. Each dimension

i

corresponds to a sine wave, ensuring that positions in the order of magnitude of

10^{4}

will not have repeated embeddings.

While this method proves effective in datasets with elevated embedding dimensions, it falls short in maximizing the embedding vector space in lower dimensions, resulting in difficulties distinguishing between randomly embedded vectors [34]. Consequently, we propose replacing the OPE with TAPE, which takes into account the sequence length. We modify the frequency terms of sine and cosine functions by incorporating the sequence length

L_{T r a j}

into the original formula. The specific formulas are as follows:

\begin{matrix} {T A P E}_{i} (k, 2 i) = \sin (\frac{k d_{T r a j}}{{{4 L}_{T r a j} \cdot 10000}^{2 i / d_{T r a j}}}) \\ {T A P E}_{i} (k, 2 i + 1) = \cos (\frac{k d_{T r a j}}{{{4 L}_{T r a j} \cdot 10000}^{2 i / d_{T r a j}}}) \end{matrix}

(11)

In Formula (11), when

d_{T r a j} = 4 L_{T r a j}

, TAPE has the same encoding scheme as Formula (10). Therefore, as shown in Figure 5, we compare the embedding vectors of TAPE and OPE in two scenarios. Figure 5a depicts the situation when

d_{T r a j} > 4 L_{T r a j}

, and the similarity (dot product) between TAPE-embedding vectors is lower than that of OPE. TAPE can better utilize the embedding vector space to differentiate the positions between two vectors. In Figure 5b, when

d_{T r a j} < 4 L_{T r a j}

, the similarity (dot product) between OPE-embedding vectors fluctuates, and the monotonic decreasing trend becomes unsignificant. Although the overall similarity (dot product) of TAPE is higher than that of OPE, the declining trend in TAPE is more pronounced, better reflecting the differences brought by the distance between two vectors. The superiority brought by TAPE to the prediction framework will also be demonstrated in the subsequent ablation studies.

2.2.4. Soft-DTW Loss Function

In the realm of multi-step eddy trajectory predictions, traditional loss functions based on Euclidean distance tend to emphasize minor differences within sequences, thus limiting their capacity to accurately capture the larger-scale evolutionary trends in eddy trajectories. This limitation hinders a comprehensive understanding of the overarching motion dynamics of eddies. To address this issue, Soft-DTW, a well-known loss function recognized for its effectiveness in assessing similarities among time series data, integrates flexible constraints inspired by DTW. The incorporation of this integration enhances the adaptability of Soft-DTW to diverse variations present in different time series, thereby increasing its robustness against noise and significant fluctuations in sequences.

DTW is a methodical technique for temporal alignment that seeks to determine the most efficient alignment path between two time series through the computation of an accumulated distance matrix. The primary goal of DTW is to achieve optimal synchronization between the two time series by initially calculating an n × m pairwise distance matrix between corresponding points. Following this, a Dynamic Program (DP) is resolved utilizing Bellman’s recursion, resulting in a quadratic cost of nm. The precise formula is as follows:

D T W ({T r a j}_{p r e d}, {T r a j}_{t r u t h}) : = \underset{A \in A_{n, m}}{m i n} ⟨A, Δ ({T r a j}_{p r e d}, {T r a j}_{t r u t h})⟩

(12)

where

{T r a j}_{t r u t h} \in R^{m \times d}

,

{T r a j}_{p r e d} \in R^{n \times d}

represent the true trajectory and predicted trajectory, respectively.

{A \in A}_{n, m} \subset {0,1}^{n \times m}

denotes the binary alignment matrix [35], which is used to indicate paths connecting two diagonals in the matrix.

Δ ({T r a j}_{p r e d}, {T r a j}_{t r u t h})

represents the accumulated distance matrix. DTW finds the optimal alignment by minimizing the dot product between

A

and

Δ ({T r a j}_{p r e d}, {T r a j}_{t r u t h})

. However, this process is discrete and non-differentiable, making it unsuitable as a loss function for backpropagation. On basis of DTW, Soft-DTW introduces a smoothing parameter

γ

(

γ > 0

), which allows for some smoothness between points on the aligned path instead of strict matching. Assuming

a_{i}

corresponds to the dot product between

A

and

Δ ({T r a j}_{p r e d}, {T r a j}_{t r u t h})

, the specific procedure is as follows:

{m i n}^{γ} {a_{1}, \dots, a_{n}} = - γ lo g \sum_{i = 1}^{n} e^{- \frac{a_{i}}{γ}}

(13)

Soft-DTW is able to calculate a soft minimum for all dot products, with

γ

regulating the magnitude of the overall distance cost and serving as a regularization parameter. Furthermore, the inclusion of

γ

enables Soft-DTW to be explicitly differentiated, facilitating optimization in algorithms such as gradient descent. This function can serve as a loss function for tasks like time series classification and multi-step prediction, surpassing traditional methods based on Euclidean distance [35].

3. Experiments and Results

3.1. Experiment Configuration

The experiment was conducted using PyTorch version 1.13.1 in Python 3.9.16, with training performed on a GeForce RTX Titan GPU with 48 GB of RAM. The NetCDF and NumPy libraries were employed for data input and storage. The model utilized Soft-DTW as the loss function and Adam as the optimizer, with an initial learning rate of 0.0001. The experiment was run for 100 epochs with a batch size of 32.

The dataset used consisted of a multivariate mesoscale eddy trajectory time series, comprising 4865 eddy trajectories, including 2413 cyclonic and 2452 anticyclonic eddies. The trajectories were partitioned into training, validation, and testing sets, with 80%, 10%, and 10% of the trajectories, respectively, assigned to each set. In order to consider the longevity and consistency of eddies, our framework utilized a sliding-window methodology for forecasting future eddy centers over multiple time steps. Specifically, sliding windows of 7, 14, and 21 days were employed to predict the eddy centers for the subsequent 7 days.

3.2. Evaluation Indicators

Mean Absolute Error (MAE) and Mean Squared Error (MSE) are commonly used metrics in machine learning for regression tasks, serving to quantitatively assess the disparities between predicted and true values. The accuracy of predicting the location of eddy centers is evaluated through the calculation of MAE and Root Mean Square Error (RMSE) of distance errors.

The formulas of both MAE and RMSE incorporate the distances between the true and predicted values of the eddy center longitude and latitude are shown as follows:

d^{'} = 2 R \cdot \arcsin (\sqrt{\sin^{2} (\frac{L a t_{t r u t h} - L a t_{p r e d}}{2}) + \cos (L a t_{p r e d}) \cdot \cos (L a t_{t r u t h}) \cdot \sin^{2} (\frac{L o n_{t r u t h} - L o n_{p r e d}}{2})})

(14)

M A E = \frac{1}{n} \sum_{i = 1}^{n} |d_{i}^{'}|

(15)

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(d_{i}^{'})}^{2}}

(16)

where

L o n_{t r u t h} L a t_{t r u t h}

represent the true values of the eddy center longitude and latitude, and

L o n_{p r e d} L a t_{p r e d}

represent the predicted values of the eddy center longitude and latitude. And the parameter

n

is the number of days for prediction.

3.3. Ablation Studies

In order to assess the impact of various datasets, feature expansion, and enhanced modules on the efficacy of our model in predicting mesoscale eddy trajectories, a series of ablation experiments was conducted, and the results are shown in Table 2. The experiments involved comparing the seven-day average forecast errors across six different cases. In Case 1, solely utilizing the MET dataset resulted in the highest errors in eddy center prediction, with MAE and RMSE values of 21.970 km and 27.421 km, respectively. This outcome can be attributed to the limited number of eddy-related features present in the MET dataset, which hinders the model’s ability to capture the intricate nonlinear relationships and interactive dynamics of eddies. Cases 2 and 3 demonstrate the advantages of incorporating additional azimuth and velocity features into the SLA dataset, resulting in a reduction in MAE to 14.900 km and 11.680 km, and RMSE to 17.748 km and 14.062 km, respectively. In Cases 4 and 5, the inclusion of convolutional layers, average pooling, and TAPE further improves model performance by leveraging the enriched features. Empirical findings indicate that the integration of convolution and average pooling mechanisms, along with TAPE, leads to a decrease in MAE to 11.127 km and 8.986 km, and RMSE to 13.771 km and 10.897 km, respectively. Significantly, TAPE demonstrates superior predictive accuracy compared to convolution and average pooling methods. When compared to Case 3, MAE is enhanced by 4.7% and 23.1% with convolution and average pooling, respectively, and by 2.1% and 20.9% with TAPE, respectively. Ultimately, the combined use of convolution, average pooling, and TAPE results in the highest accuracy, achieving an MAE of 8.294 km and an RMSE of 9.874 km.

3.4. Comparative Experiment and Analysis

This section undertakes a series of experimental comparisons and analyses to evaluate the performance and efficacy of different techniques in forecasting mesoscale eddy trajectories. Initially, the study examines the performance disparities between different loss functions in predicting the accuracy of mesoscale eddy trajectories over a seven-day period, using a historical trajectory length of 21 days as input. The results of this analysis are presented in Table 3. Subsequently, this study assesses the prediction outcomes by varying the length of the input trajectory through different sliding window sizes, as outlined in Table 4. Subsequently, visual representations of the prediction results for eight selected trajectories are presented to enhance understanding of the model’s effectiveness, as illustrated in Figure 6. Additionally, the visualization of attention heatmaps for features serves as a robust tool for validating the efficacy of the proposed method by offering visual evidence of its capability to capture long-term temporal dependencies in mesoscale eddy trajectory sequences with precise predictions, as demonstrated in Figure 7.

3.4.1. Comparison of Loss Functions

The evaluation of training performance for various models is conducted using Soft-DTW loss and L2 loss. The L2 loss function is utilized to reduce the Euclidean distance between the predicted trajectory and the ground truth trajectory in Euclidean space. The formula is expressed as follows:

L 2 = \frac{1}{n} \sum_{i = 0}^{n} (\frac{1}{m} \sum_{t = 1}^{m} {(y_{i}^{t} - {\hat{y}}_{i}^{t})}^{2})

(17)

where

m

represents the forecast days,

n

represents the batch size,

y_{i}^{t}

and

{\hat{y}}_{i}^{t}

, respectively, denote the

i

-th actual value and predicted value at time

t

. Informer [26], LSTM [36], Bidirectional Long Short-Term Memory (BiLSTM) [37], GRU [38], and Bidirectional Gated Recurrent Unit (BiGRU) [38] have been selected as benchmark models for comparative analysis.

As indicated in Table 3, we standardize the input length to 21 and contrast the average prediction errors for the eddy center over a seven-day period. Significantly, when utilizing the Soft-DTW loss function for training, the informer model demonstrates a modest enhancement of 3.3% and 2.5% in MAE and RMSE, respectively, compared to its performance with the L2 loss function. In contrast, the LSTM model shows enhancements of 14.5% and 12.6%, BiLSTM by 9.3% and 7.3%, GRU by 13.4% and 10.4%, and BiGRU by 12.2% and 10.8%. Notably, our method exhibits a substantial improvement in accuracy, with MAE and RMSE improving by 31.6% and 32.0%, respectively. The experimental findings indicate that utilizing Soft-DTW loss for training yields superior results compared to training with L2 loss across various models, primarily attributed to its capacity for comprehensive parsing of multi-step prediction outputs.

3.4.2. Comparison of Different Input Trajectory Lengths

In order to enable a more straightforward comparison of prediction results across varying input lengths, the training outcomes have been consolidated in Table 4. The LSTM, BiLSTM, GRU, and BiGRU employ a multi-step prediction mode known as single-step rolling prediction, wherein predicted values are sequentially fed back into the model, resulting in cumulative errors. Notably, the BiLSTM and BiGRU architectures enhance the foundational structures of LSTM and GRU by integrating both past and future context at each timestep. During each time step, the forward and backward hidden states of BiLSTM and BiGRU capture features from both past and future time points, resulting in their superior performance compared to LSTM and GRU across a range of input lengths. Additionally, the impact of error propagation is less significant in the context of seven-day multi-step prediction. Specifically, BiLSTM demonstrates a slight advantage over our model during the third to fifth days when utilizing a 14-day input length, highlighting the effectiveness of its bidirectional capabilities. Furthermore, our comparative analysis extends to the informer model, a transformer-based approach specifically tailored for time series forecasting. Despite improvements in computational complexity, high memory usage, and initial decoding processes within the decoder, the predictive performance of informer with varying input lengths is not as strong as that of other models. This disparity can be largely attributed to the self-attention distilling mechanism employed by informer, which is particularly effective in managing extremely long time series data. Additionally, the unique features of the eddy data, such as their lifecycle features, prevent them from reaching the sequence length required by informer.

Furthermore, our model demonstrates superior performance in terms of daily average errors MAE and RMSE when considering input lengths of 7 and 21 days. In the context of varying input lengths, our model exhibits notable improvements in both MAE and RMSE compared to the GRU and BiLSTM. Specifically, for a seven-day input length, our model achieves a 37.4% reduction in MAE and a 45.8% decrease in RMSE compared to the GRU. Similarly, with a 21-day input length, our model demonstrates a 4.9% decrease in MAE and a 13.1% decrease in RMSE compared to the BiLSTM. Additionally, our model maintains or improves prediction accuracy as the forecast horizon increases, with a noticeable decrease in MAE from the first to the fifth day for a 14-day input length. The enhancement observed can be attributed to the distinctive mechanism of our model, which not only emphasizes the capture of long-distance dependencies within the input sequence but also acquires knowledge of the correlations between the model’s outputs at each iteration, facilitating a more comprehensive understanding of global information.

Concurrently, our model demonstrates slightly superior performance compared to the other two cases when the input sequence length is 14 days, suggesting that predictive performance is not solely determined by the length of the input sequence. The complex and nonlinear motion of eddies results in a diminishing ability of the model to extract global information as the input length continues to increase. Nevertheless, the extent of this decrease remains within acceptable limits. In comparison to a 14-day input length, extending the input length to 21 days results in a 15.4% decrease in the precision of the average error over a seven-day forecast. However, this precision remains 13.6% higher than that of a seven-day input. Thus, it is imperative for future research to delve deeper into the physical behaviors of eddies and carefully select an appropriate input length.

3.4.3. Comparison of Visualization of Prediction Results

Additionally, we have set the input lengths of the model to 7, 14, and 21 days, respectively, and utilized Soft-DTW for training purposes. The predicted outcomes of our approach are displayed in Figure 6, showcasing 8 selected representative trajectories for visualization (traj 1–8). The blue, orange, and green points in the graph, respectively, symbolize historical trajectories, true trajectories, and trajectories forecasted by our model. The analysis reveals that our model is capable of accurately predicting future trajectories even when encountering turning points. Furthermore, the framework consistently demonstrates its ability to maintain trajectory discrepancies within a 10 km range for varying future periods, highlighting its effectiveness in learning and prediction.

Table 5 shows train and test times for different methods with different input lengths. BiLSTM peaks at 0.792 GPU-Days for training and 5.702 GPU-Minutes for testing with a 21-day input. In contrast, our model is the most efficient, with just 0.511 GPU-Days for training and 3.678 GPU-Minutes for testing at the same input length. And Figure 7 highlights the efficiency and accuracy trade-offs among various methods, with particular emphasis on the superior performance of our model across different input lengths.

3.4.4. Visualization of Attention Heatmap

Figure 8 provides a visual representation of the attention matrix within the model’s decoder. The distinctive masking mechanism employed in the decoder guarantees that the upper triangular segment of the 7 × 7 attention matrix is assigned a value of zero following the softmax operation, thereby preventing premature revelation of forthcoming predictions. Furthermore, it is evident that in the context of a seven-day multi-step forecast, the attention weights are evenly distributed, with relatively greater emphasis placed on the diagonal elements. This indicates that the model does not excessively depend on computational outcomes from preceding time steps.

4. Conclusions and Discussion

This study introduces an enhanced transformer-based framework for the precise forecasting of mesoscale eddy trajectories in the South China Sea. The initial step involves the efficient extraction, integration, and processing of oceanic variable data from various altimeter satellites to capture the movement characteristics of eddy paths, resulting in the creation of multivariate mesoscale eddy time series data. Subsequently, feature expansion is conducted to enhance eddy characteristics, thereby providing a more comprehensive simulation of the entire motion process during the eddy lifecycle. Additionally, we extend the conventional transformer model by incorporating convolution and pooling mechanisms in conjunction with TAPE to improve the extraction, learning, and discrimination of complex features within the data. Subsequently, the Soft-DTW loss function is implemented to comprehensively evaluate the multi-step prediction outcomes of the model. Through ablation and comparison experiments, our framework is shown to accurately forecast the trajectories of eddy centers up to 7 days in advance. When the input sequence lengths are 14 and 21, both MAE and RMSE within a seven-day period are found to be within 10 km, surpassing the performance benchmarks established by a majority of the baseline models.

The primary limitations of this study can be identified as twofold. Firstly, the study does not account for the intricate nonlinear motion of eddies themselves, thereby neglecting the fundamental motion mechanisms or features associated with eddy modeling. Instead, the focus is primarily on the transfer and updating of the model. Furthermore, the fusion of features within the data prior to input into the model results in a weakening of the correlations between said features. Despite the implementation of targeted optimization strategies to address this issue, the improvements are still limited. In future research, we intend to introduce physical constraints that consider the dynamics of eddy motion and investigate approaches such as channel independence to overcome these obstacles, ultimately improving the predictive capability of the model.

Author Contributions

Conceptualization, validation, and writing—original draft preparation, Y.D. and J.H.; formal analysis and visualization, J.H.; writing—review and editing, J.C., K.C. and J.W.; supervision, Q.H.; funding acquisition, Y.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Key Research and Development Program of China, grant number 2021YFC3101602 and National Natural Science Foundation of China (General Program), grant number 42376194.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All the data used in this study are available as follows: The MET dataset is from AVISO (Global mesoscale eddy trajectory product (altimetry.fr)), and the SLA dataset is from CMEMS (Global Ocean Gridded L 4 Sea Surface Heights And Derived Variables Reprocessed 1993 Ongoing, Copernicus Marine Service).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pascual, A.; Faugère, Y.; Larnicol, G.; Le Traon, P. Improved Description of the feature Mesoscale Variability by Combining Four Satellite Altimeters. Geophys. Res. Lett. 2006, 33, L02611. [Google Scholar] [CrossRef]
Wunsch, C. The Past and Future Ocean Circulation from a Contemporary Perspective. In Geophysical Monograph Series; Schmittner, A., Chiang, J.C.H., Hemming, S.R., Eds.; American Geophysical Union: Washington, DC, USA, 2007; Volume 173, pp. 53–74. [Google Scholar]
Martínez-Moreno, J.; Hogg, A.M.; England, M.H.; Constantinou, N.C.; Kiss, A.E.; Morrison, A.K. Global Changes in Oceanic Mesoscale Currents over the Satellite Altimetry Record. Nat. Clim. Change 2021, 11, 397–403. [Google Scholar] [CrossRef]
Shao, Z.; Zhang, Z.; Wang, F.; Xu, Y. Pre-Training Enhanced Spatial-Temporal Graph Neural Network for Multivariate Time Series Forecasting. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14 August 2022; pp. 1567–1577. [Google Scholar]
Lin, Y.; Wang, G. The Effects of Eddy Size on the Sea Surface Heat Flux. Geophys. Res. Lett. 2021, 48, e2021GL095687. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, W.; Qiu, B. Oceanic Mass Transport by Mesoscale Eddies. Science 2014, 345, 322–324. [Google Scholar] [CrossRef]
Beech, N.; Rackow, T.; Semmler, T.; Danilov, S.; Wang, Q.; Jung, T. Long-Term Evolution of Ocean Eddy Activity in a Warming World. Nat. Clim. Chang. 2022, 12, 910–917. [Google Scholar] [CrossRef]
Eden, C.; Dietze, H. Effects of Mesoscale Eddy/Wind Interactions on Biological New Production and Eddy Kinetic Energy. J. Geophys. Res. Oceans 2009, 114, C05023. [Google Scholar] [CrossRef]
Van Westen, R.M.; Dijkstra, H.A. Ocean Eddies Strongly Affect Global Mean Sea-Level Projections. Sci. Adv. 2021, 7, eabf1674. [Google Scholar] [CrossRef]
Horvat, C.; Tziperman, E.; Campin, J. Interaction of Sea Ice Floe Size, Ocean Eddies, and Sea Ice Melting. Geophys. Res. Lett. 2016, 43, 8083–8090. [Google Scholar] [CrossRef]
Robinson, A.R.; Carton, J.A.; Mooers, C.N.K.; Walstad, L.J.; Carter, E.F.; Rienecker, M.M.; Smith, J.A.; Leslie, W.G. A real-time dynamical forecast of ocean synoptic/mesoscale eddies. Nature 1984, 309, 781–783. [Google Scholar] [CrossRef]
Masina, S.; Pinardi, N. Mesoscale Data Assimilation Studies in the Middle Adriatic Sea. Cont. Shelf Res. 1994, 14, 1293–1310. [Google Scholar] [CrossRef]
Shriver, J.F.; Hurlburt, H.E.; Smedstad, O.M.; Wallcraft, A.J.; Rhodes, R.C. 1/32° Real-Time Global Ocean Prediction and Value-Added over 1/16° Resolution. J. Mar. Syst. 2007, 65, 3–26. [Google Scholar] [CrossRef]
Wang, X.; Wang, X.; Yu, M.; Li, C.; Song, D.; Ren, P.; Wu, J. MesoGRU: Deep Learning Framework for Mesoscale Eddy Trajectory Prediction. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8013805. [Google Scholar] [CrossRef]
Li, J.; Wang, G.; Xue, H.; Wang, H. A Simple Predictive Model for the Eddy Propagation Trajectory in the Northern South China Sea. Ocean Sci. 2019, 15, 401–412. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Vargas, R.; Mosavi, A.; Ruiz, R. Deep Learning: A Review. Adv. Intell. Syst. Comput. 2017, 29, 232–244. [Google Scholar]
Wang, X.; Wang, H.; Liu, D.; Wang, W. The Prediction of Oceanic Mesoscale Eddy Properties and Propagation Trajectories Based on Machine Learning. Water 2020, 12, 2521. [Google Scholar] [CrossRef]
Nian, R.; Cai, Y.; Zhang, Z.; He, H.; Wu, J.; Yuan, Q.; Geng, X.; Qian, Y.; Yang, H.; He, B. The Identification and Prediction of Mesoscale Eddy Variation via Memory in Memory with Scheduled Sampling for Sea Level Anomaly. Front. Mar. Sci. 2021, 8, 753942. [Google Scholar] [CrossRef]
Ma, C.; Li, S.; Wang, A.; Yang, J.; Chen, G. Altimeter Observation-Based Eddy Nowcasting Using an Improved Conv-LSTM Network. Remote Sens. 2019, 11, 783. [Google Scholar] [CrossRef]
Wang, X.; Li, C.; Wang, X.; Tan, L.; Wu, J. Spatio–Temporal Attention-Based Deep Learning Framework for Mesoscale Eddy Trajectory Prediction. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 3853–3867. [Google Scholar] [CrossRef]
Ge, L.; Huang, B.; Chen, X.; Chen, G. Medium-Range Trajectory Prediction Network Compliant to Physical Constraint for Oceanic Eddy. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4206514. [Google Scholar] [CrossRef]
Zhu, R.; Song, B.; Qiu, Z.; Tian, Y. A Metadata-Enhanced Deep Learning Method for Sea Surface Height and Mesoscale Eddy Prediction. Remote Sens. 2024, 16, 1466. [Google Scholar] [CrossRef]
Tang, H.; Lin, J.; Ma, D. Direct Prediction for Oceanic Mesoscale Eddy Geospatial Distribution through Prior Statistical Deep Learning. Expert Syst. Appl. 2024, 249, 123737. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, virtually, 2–9 February 2021. [Google Scholar]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecastin. In Proceedings of the Advances in Neural Information Processing Systems, virtually, 6–14 December 2021. [Google Scholar]
Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; Long, M. iTransformer: Inverted Transformers Are Effective for Time Series Forecasting. arXiv 2023, arXiv:2310.06625v2. [Google Scholar]
Chen, G.; Hou, Y.; Chu, X. Mesoscale Eddies in the South China Sea: Mean Properties, Spatiotemporal Variability, and Impact on Thermohaline Structure. J. Geophys. Res. 2011, 116, C06018. [Google Scholar] [CrossRef]
Nan, F.; Xue, H.; Yu, F. Kuroshio Intrusion into the South China Sea: A Review. Prog. Oceanogr. 2015, 137, 314–333. [Google Scholar] [CrossRef]
Du, Y.; Wu, D.; Liang, F.; Yi, J.; Mo, Y.; He, Z.; Pei, T. Major Migration Corridors of Mesoscale Ocean Eddies in the South China Sea from 1992 to 2012. J. Mar. Syst. 2016, 158, 173–181. [Google Scholar] [CrossRef]
Dufter, P.; Schmitt, M.; Schütze, H. Position Information in Transformers: An Overview. Comput. Linguist. 2021, 48, 733–763. [Google Scholar] [CrossRef]
Huang, Z.; Liang, D.; Xu, P.; Xiang, B. Improve Transformer Models with Better Relative Position Embeddings. arXiv 2020, arXiv:2009.13658. [Google Scholar]
Foumani, N.M.; Tan, C.W.; Webb, G.I.; Salehi, M. Improving Position Encoding of Transformers for Multivariate Time Series Classification. arXiv 2023, arXiv:2305.16642. [Google Scholar] [CrossRef]
Cuturi, M.; Blondel, M. Soft-DTW: A Differentiable Loss Function for Time-Series. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Graves, A.; Schmidhuber, J. Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef] [PubMed]
Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Repre-sentations Using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]

Figure 1. Preprocessing flowchart for multivariate mesoscale eddy time series data.

Figure 2. Azimuth illustration. P and Q represent the centers of two eddies. The dihedral angle θ formed by the planes OPQ and OPN represents the azimuth of P relative to Q.

Figure 3. Illustration of mesoscale eddy trajectory prediction. The center of the eddy is defined as the center of the speed best-fit circle (solid deep blue lines), which represents a geometric representation that approximates the eddy’s speed contour, connecting all points in the flow field with the same velocity magnitude, as indicated by the black and orange dots. The radius of eddy is the radius of the effective best fit circle (solid deep blue dashed lines), which fits the contour of maximum circum-average geostrophic speed for the detected eddy using the least squares method. And the vector formed by ugos and vgos represents the geostrophic flow velocity vector (black arrows).

Figure 4. Flowchart of enhanced transformer-based mesoscale eddy trajectory prediction framework.

Figure 5. Comparison of the similarity corresponding to distances between different positions in the sequence with TAPE and OPE. (a)

d_{T r a j} = 256

L_{T r a j} = 21

(b)

d_{T r a j} = 256

L_{T r a j} = 300

.

Figure 5. Comparison of the similarity corresponding to distances between different positions in the sequence with TAPE and OPE. (a)

d_{T r a j} = 256

L_{T r a j} = 21

(b)

d_{T r a j} = 256

L_{T r a j} = 300

.

Figure 6. Partial prediction results of different models trained with Soft-DTW loss. (a–c) respectively represent the prediction results with different input lengths.

Figure 7. The correlation between model train and test times and RMSE performance with different input lengths.

Figure 8. Visualization of the 7 × 7 attention matrices for the four randomly sampled heads in the decoder.

Table 1. Features description of MET and SLA.

Dataset	Time Resolution	Spatial Resolution	Feature	Unit	Description
MET	1 day	/	Track	-	Trajectory identification number
			Time	day	Timestamps since 1 January 1950
			Longitude	°	Longitude of eddies’ effective contour
			Latitude	°	Latitude of eddies’ effective contour
			Amplitude	m	The height difference between the eddies’ center and contour
			Speed	m/s	Average speed of eddies’ contour
			Radius	m	The radius of fit circle corresponding to the contour
			Azimuth	rad	Azimuth to the eddy at the previous timestamp
			Velocity	m/s	The moving average velocity of the eddies’center
SLA	1 day	0.25°	Adt	m	The absolute dynamic topography is the sea surface height above geoid
			Ugos	m/s	Absolute geostrophic Velocity of sea surface: zonal component
			Vgos	m/s	Absolute geostrophic Velocity of sea surface: meridian component

Table 2. Ablation of datasets, features expansion, and model mechanism with MAE and RMSE metrics using Soft-DTW loss function We set the input length

L = 21

. The results of seven-day average predictions are as follows.

Table 2. Ablation of datasets, features expansion, and model mechanism with MAE and RMSE metrics using Soft-DTW loss function We set the input length

L = 21

. The results of seven-day average predictions are as follows.

Datasets and Features and Model Mechanism	Case 1	Case 2	Case 3	Case 4	Case 5	Case 6
MET	🗸	🗸	🗸	🗸	🗸	🗸
SLA	-	🗸	🗸	🗸	🗸	🗸
Azimuth and Velocity	-	-	🗸	🗸	🗸	🗸
Conv + Avgpool	-	-	-	🗸	-	🗸
TAPE	-	-	-	-	🗸	🗸
MAE (km)	21.970	14.900	11.680	11.127	8.986	8.294
RMSE (km)	27.421	17.748	14.062	13.771	10.897	9.874

Table 3. Study on seven-day average prediction error of different methods training with L2 and Soft-DTW loss function. We set the input length

L = 21

.

Table 3. Study on seven-day average prediction error of different methods training with L2 and Soft-DTW loss function. We set the input length

L = 21

.

Method	MAE (km)		RMSE (km)
Method	L2	Soft-DTW	L2	Soft-DTW
Informer	50.366	48.720	59.680	58.209
LSTM	27.209	23.259	34.151	29.842
BiLSTM	9.610	8.721	12.253	11.362
GRU	29.549	25.582	33.587	30.088
BiGRU	21.920	19.239	26.749	23.869
Ours	12.132	8.294	14.511	9.874

Table 4. Average prediction error within a seven-day time threshold for different experiment methods using Soft-DTW loss function. We set the input length

L \in {7, 14, 21} .

.

Table 4. Average prediction error within a seven-day time threshold for different experiment methods using Soft-DTW loss function. We set the input length

L \in {7, 14, 21} .

.

Input Length	Days	MAE (km)						RMSE (km)
Input Length	Days	Informer	LSTM	BiLSTM	GRU	BiGRU	Ours	Informer	LSTM	BiLSTM	GRU	BiGRU	Ours
7	1	22.893	13.015	10.871	11.443	12.667	9.103	27.701	16.074	14.129	14.342	15.877	9.960
	2	24.867	13.985	11.703	11.711	12.780	9.244	29.898	17.319	15.320	14.780	16.349	10.158
	3	26.027	15.057	12.825	12.234	13.438	9.396	31.499	18.722	16.856	15.503	17.328	10.375
	4	27.836	16.210	13.985	12.865	14.216	9.461	33.868	20.210	18.451	16.370	18.406	10.486
	5	29.737	17.450	15.211	13.593	15.077	9.515	36.331	21.828	20.152	17.366	19.580	10.578
	6	31.499	18.784	16.470	14.397	16.017	9.549	38.646	23.572	21.883	18.473	20.873	10.656
	7	33.350	20.269	17.776	15.336	17.107	9.594	41.072	25.546	23.667	19.794	22.382	10.732
14	1	36.987	12.675	8.009	12.541	11.397	7.305	43.543	15.022	10.428	15.431	13.936	8.547
	2	38.134	11.993	7.330	12.269	9.913	7.257	44.893	14.445	9.687	15.253	12.614	8.558
	3	39.551	12.070	7.078	12.270	9.440	7.116	46.607	14.678	9.416	15.305	12.215	8.476
	4	40.682	12.423	7.023	12.398	9.308	7.109	48.071	15.205	9.362	15.529	12.123	8.498
	5	42.159	12.974	7.089	12.652	9.366	7.097	49.920	15.973	9.441	15.883	12.259	8.524
	6	43.557	13.739	7.271	13.082	9.639	7.151	51.721	17.042	9.654	16.478	12.702	8.597
	7	45.190	14.751	7.655	13.716	10.261	7.190	53.750	18.513	10.198	17.353	13.697	8.658
21	1	43.611	18.826	10.842	22.205	15.272	7.675	51.925	22.168	13.168	25.615	18.697	8.937
	2	43.455	16.840	9.307	22.527	15.647	7.668	51.711	20.804	11.762	25.941	19.180	9.012
	3	43.838	17.371	8.648	22.776	16.159	7.810	52.218	21.754	11.087	26.292	19.843	9.178
	4	44.842	18.527	8.355	23.219	16.752	7.910	53.422	23.405	10.785	26.875	20.601	9.317
	5	46.237	19.990	8.286	23.845	17.469	8.054	55.191	25.427	10.714	27.713	21.516	9.533
	6	47.374	21.585	8.412	24.631	18.278	8.204	56.589	27.591	10.894	28.775	22.580	9.744
	7	48.720	23.259	8.721	25.582	19.239	8.294	58.209	29.842	11.362	30.088	23.869	9.874

Table 5. Train time (GPU-Days) and test time (GPU-Minutes) for various experiment methods with different input lengths.

Input Length	Train Time (GPU-Days)						Test Time (GPU-Minutes)
Input Length	Informer	LSTM	BiLSTM	GRU	BiGRU	Ours	Informer	LSTM	BiLSTM	GRU	BiGRU	Ours
7	0.502	0.535	0.563	0.510	0.521	0.313	3.614	3.852	4.454	3.662	3.751	2.251
14	0.597	0.612	0.698	0.603	0.667	0.426	4.278	4.416	5.024	4.341	4.808	3.064
21	0.634	0.697	0.792	0.678	0.712	0.511	4.561	5.018	5.702	4.886	5.127	3.678

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Du, Y.; Huang, J.; Chen, J.; Chen, K.; Wang, J.; He, Q. Enhanced Transformer Framework for Multivariate Mesoscale Eddy Trajectory Prediction. J. Mar. Sci. Eng. 2024, 12, 1759. https://doi.org/10.3390/jmse12101759

AMA Style

Du Y, Huang J, Chen J, Chen K, Wang J, He Q. Enhanced Transformer Framework for Multivariate Mesoscale Eddy Trajectory Prediction. Journal of Marine Science and Engineering. 2024; 12(10):1759. https://doi.org/10.3390/jmse12101759

Chicago/Turabian Style

Du, Yanling, Jiahao Huang, Jiasheng Chen, Ke Chen, Jian Wang, and Qi He. 2024. "Enhanced Transformer Framework for Multivariate Mesoscale Eddy Trajectory Prediction" Journal of Marine Science and Engineering 12, no. 10: 1759. https://doi.org/10.3390/jmse12101759

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Enhanced Transformer Framework for Multivariate Mesoscale Eddy Trajectory Prediction

Abstract

1. Introduction

2. Methods

2.1. Construction and Expansion of Multivariate Mesoscale Eddy Features

2.1.1. Mesoscale Eddy Trajectory Feature Expansion

2.1.2. Multivariate Mesoscale Eddy Trajectory Feature Construction

2.2. Enhanced Transformer-Based Framework for Mesoscale Eddy Trajectory Prediction

2.2.1. Encoder and Decoder

2.2.2. Multi-Head Self-Attention

2.2.3. Time-Absolute Position Encoding

2.2.4. Soft-DTW Loss Function

3. Experiments and Results

3.1. Experiment Configuration

3.2. Evaluation Indicators

3.3. Ablation Studies

3.4. Comparative Experiment and Analysis

3.4.1. Comparison of Loss Functions

3.4.2. Comparison of Different Input Trajectory Lengths

3.4.3. Comparison of Visualization of Prediction Results

3.4.4. Visualization of Attention Heatmap

4. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI