1. Introduction
Meteorological prediction is extremely important in contemporary society, offering timely and accurate weather predictions that contribute significantly to diverse societal advancements. In the field of natural energy, the implementation of an effective prediction system can reduce energy costs and enhance energy utilization efficiency [
1,
2]. Furthermore, meteorological prediction also influences construction planning [
3], water resource management [
4], and numerous other domains [
5]. Meteorological prediction is becoming increasingly crucial in agriculture, particularly with the ongoing challenges of climate change. These predictions are vital for optimizing planting strategies and resource use, leading to enhanced crop yields and profitability. In the southeastern U.S. and Argentina, ENSO-based predictions have improved corn planting decisions, boosting incomes for farmers [
6]. In England, Wales, and the Hetao district in China, weather forecasts aid in efficient nitrogen use and inform practices that positively impact both soil health and crop yields [
7,
8]. These examples highlight the critical role of meteorological prediction in production practices in modern society.
Nevertheless, meteorological prediction encounters various difficulties and challenges due to complex meteorological dynamics. Currently, meteorological prediction methods can primarily be categorized into two categories [
9]: numerical weather prediction (NWP) methods and deep learning methods. The NWP methods focus on simulating various physical processes using a series of partial differential equations (PDEs) and solving them through numerical simulations [
10,
11]. However, NWP methods necessitate significant computational resources and are time-consuming when it comes to solving PDEs [
12] Moreover, the formulas employed in the NWP methods are often inadequate and unavoidably introduce approximation and calculation errors [
13,
14]. Deep learning methods harness the formidable learning capacity of neural networks to acquire knowledge from historical meteorological data and make predictions very quickly [
15,
16]. There has been considerable progress in using deep learning for weather prediction research. FourCastNet [
17] has established a high-resolution prediction model for global weather forecasting for the first time using data-driven methods. Although there is a certain gap in prediction accuracy compared to the ECMWF Integrated Forecasting System (IFS) based on NWP, its prediction speed is one order of magnitude faster than IFS. Pangu-Weather [
18] has used a Transformer-based network for the first time in global weather forecasting achieving prediction accuracy higher than IFS. GraphCast [
19] utilizes graph neural networks to process meteorological information of different resolutions, surpassing Pangu-Weather in prediction accuracy. Data-driven deep learning models can skip the step of building complex and refined models through historical data, thus avoiding limitations that exist in NWP models, such as biases in convergence parameterization schemes that strongly affect prediction forecasts. However, such models become black boxes without the support of meteorological theory, often only able to obtain predictive results but lacking interpretability.
Traditional RNN and CNN networks have been widely used in regional meteorological forecasting, but their shortcomings are obvious in dealing with high-dimensional data. Shi et al. [
20,
21], Yu et al. [
22], Wang et al. [
23] improved the RNN network structure, making it perform better in spatiotemporal prediction tasks and capable of handling meteorological predictions at single pressure levels. Gao et al. [
24] developed a fully CNN-based model to achieve comprehensive prediction performance. However, networks based on RNN and CNN can only predict meteorological data at a single pressure level and cannot capture the interaction information between data at different heights. Furthermore, the structures of RNN and CNN make it difficult to meet the accuracy requirements for prediction, and a more reliable network structure needs to be introduced.
Although the attention mechanism of the Transformer has shown excellent performance in many tasks, the original Transformer structure cannot process the high-dimensional meteorological data. In time series prediction tasks [
25,
26], the Transformer demonstrates an absolute accuracy advantage when compared to traditional RNN-based LSTM and regression-based ARIMA. Similarly, in image understanding methods, the Transformer has clearly emerged as a successor to the previous mainstream CNN methods [
27,
28]. In the case of 5D meteorological data, the dataset encompasses five dimensions: longitude, latitude, pressure level, multiple selected meteorological variables, and time. The existing Transformer-based methods can only predict meteorological data point by point for each coordinate in space. As the data scale increases, it demands more predictive models, consequently increasing the demand for computation resources. More importantly, point-by-point prediction ignores the interaction of meteorological data in different coordinates and cannot effectively utilize spatial information in high-dimensional data. As far as we know, SA-Fit is the first method to use the Transformer structure for regional multi-pressure level meteorological prediction.
We drew inspiration from the curve fitting algorithm in machine learning and innovatively designed the Transformer-based spatiotemporal prediction network to achieve prediction. The curve fitting algorithm is a method of constructing explicit function curves that accurately capture the patterns of data points [
29,
30]. Curve fitting algorithms have found extensive applications in earth science [
31,
32] as well as diverse domains like biology [
33] and economics [
34]. Our objective is to derive tailored fitting functions that capture the variations in regional meteorological data. To effectively apply the curve fitting algorithm to high-dimensional meteorological data, we divide the intricate meteorological data into distinct segments and continuously encode them through a spatiotemporal Transformer network while reducing their dimensions. Subsequently, we employ multiple fitting functions to accommodate the unique characteristics of each segment, with the coefficients being our prediction targets.
In this paper, we propose the spatiotemporal analysis fitting prediction algorithm (SA-Fit), an innovative integration of a lightweight Transformer-based network and curve fitting algorithm. Based on the inherent spatiotemporal coherence and high-dimensionality of meteorological data, SA-Fit adopts two key strategies. The first strategy is to improve the Transformer-based network to process spatial and temporal information of high-dimensional data in a step-by-step manner. The second strategy introduces a novel prediction approach that incorporates fitting functions with a lasso penalty to capture variations in meteorological data. As SA-Fit can concurrently predict meteorological data across multiple pressure levels, it augments the capability of the model to handle high-dimensional data, while reducing the demand for computation resources. The main contributions of this work can be summarized as follows:
- (1)
We propose an innovative algorithm that combines a lightweight Transformer-based network with the curve fitting algorithm to achieve efficient prediction of high-dimensional meteorological data.
- (2)
We improve the Transformer-based network structure for step-by-step processing of spatiotemporal information, which can fully learn the interaction information of different coordinate points in high-dimensional data.
- (3)
Our algorithm greatly reduces the model parameters compared to other Transformer-based prediction models, achieving efficient prediction and reducing the demand for computation resources.
2. Methodology
2.1. Overall Structure
For clarity and convenience, we utilize the symbols listed in
Table 1 to represent the variables in five-dimensional meteorological data. We append a tilde (∼) to the corresponding symbol to denote the predicted values. The objective of our study is to utilize known regional meteorological data from past time points to predict the unknown data of the same region at future time points. Assuming the current time point is
, we sample
previous data points at a fixed time interval
and employ the algorithm to predict
J future data from
, also with a time interval of
. The entire prediction process can be summarized as follows:
where
represents the prediction algorithm and
denotes the parameters.
Since the input of our model consists of five dimensions (longitude, latitude, pressure level, the selected meteorological variables, and time), and is higher in dimensionality compared to typical network inputs, it presents challenges in effectively integrating the information from all five dimensions. We use the powerful and effective Transformer-based network structure [
35] as the core of our network design. The attention mechanism serves as the primary component of our network structure, enabling the extraction and integration of information from five-dimensional data. Specifically, as depicted in
Figure 1, we employ a spatial encoder to extract spatial information from 4D data (longitude, latitude, pressure level, and the selected meteorological variables) at various time points, resulting in one-dimensional spatial features. Subsequently, we employ a temporal encoder to encode spatial information from different time points. The effectiveness of this structure, which processes spatial information first and then temporal information, has been confirmed by Arnab et al. [
36].
To explicitly represent our prediction results, we draw inspiration from curve algorithms and employ multiple explicit fitting functions to capture the variations between the future meteorological data and data of current time (
). For each combination of meteorological variables and feature time points denoted by (
), we postulate the existence of a function
to comprehensively depict the disparity between
and
. For example, when representing the true value of the meteorological variable
v at time
, longitude
x, latitude
y, and pressure level
p, we can use
. Our objective is to find an explicit function
as an approximation of
, with
representing the unknown parameters that require prediction. The final outputs of our network are the estimated parameters
for multiple
functions of every combination of meteorological variables and feature time points. The prediction process of SA-Fit for the combination (
) at longitude
x, latitude
y, and pressure level
p can be summarized as follows:
where
represents our network with parameters
, and
denotes the partial outputs of the network corresponding to the combination (
).
2.2. Spatiotemporal Analysis Network
As shown in
Figure 2, the spatial encoder begins by employing a token embedding layer to embed the input 4D meteorological data (without time dimension) into
C channels. It then alternates between Video Swin Transformer (VST) blocks and patch merging blocks to continually integrate information and downsample the 4D data. After the multi-head self-attention (MSA) block, the resulting four-dimensional tensor is reshaped into a one-dimensional spatial feature.
The VST block, originally proposed by Liu et al. [
37], is employed in our model to extract information. Leveraging the structural similarity between video data and meteorological data, we can readily utilize this architecture to encode meteorological information. The VST block comprises two components: 3D window multi-head self-attention (3DW-MSA) and 3D shifted window multi-head self-attention (3DSW-MSA). Consider the input tensor
, where
H,
W,
P, and
C represent the selected longitude, latitude, pressure level, and variable numbers, respectively. Moreover, 3DW-MSA partitions the tensor into
windows with a size of
in a non-overlapping manner. Subsequently, multi-head attention is applied to the tokens within these formed windows. To enable information exchange among separate windows in 3DW-MSA, 3DSW-MSA shifts the windows derived from 3DW-MSA along the longitude, latitude, and pressure level by
tokens, thereby generating new windows. Attention operations are then performed within these reconstructed windows. The VST block with a depth of 1 is computed as follows:
where
X and
D denote the input and output of the VST block; MLP denotes a two-layer multilayer perceptron; LN denotes layer normalization. The VST block repeats the aforementioned process
L times when it has a depth of
L.
The patch merging block connects adjacent tokens in small patches, where a, b, and c represent the length, width, and height of the patches, respectively. It then applies a linear layer to reduce the dimension of the connected tokens to one quarter. The patch merging block can reduce the number of tokens to , where the specific values a, b, and c are determined based on the data sizes.
Finally, the tokens are processed through the MSA block, and then the vectors formed by connecting all tokens are passed through a linear layer to obtain spatial features. The MSA block, with a depth of 1 and an input of
X, is defined as follows:
where MSA denotes multi-head self-attention.
The temporal encoder is constructed by an MSA block with a depth of . Since the attention mechanism is a position- and time-agnostic set operation, we incorporate the spatial features extracted from the spatial encoder into their respective time and position embeddings. For the target time points, we substitute their spatial features with learnable vectors. We feed a sequence of feature vectors into the temporal encoder to extract temporal information. The K tokens corresponding to the feature time points are then individually fed into K distinct linear layers, yielding predictions of function parameters. Predictions of the parameters of the fitting functions for all variables at a future time point are generated by the corresponding spatial feature. Specifically, if there are five variables (Z, Q, T, U, and V) that need to be predicted, the spatial feature at time point is processed by the temporal encoder to simultaneously generate function parameters , , , and .
2.3. Fitting Functions with a Lasso Penalty
The fundamental part of SA-Fit lies in introducing explicit functions to capture variations in meteorological data. Rather than directly predicting the meteorological variable values, we predict the parameters of the fitting functions. To obtain a predicted value, the corresponding fitting function is first derived based on the meteorological variables and the predicted time point. Then, the corresponding longitude, latitude, and pressure level are used as inputs to the fitting function, yielding the predicted variation of the meteorological variable. The final predicted value is obtained by adding the current value of meteorological variables to the predicted variation.
Figure 3 shows the process of obtaining the predicted value of meteorological variable
v at longitude
, latitude
, pressure level
, and time point
through the fitting function.
All prediction results rely on generating the variation of meteorological variables through fitting functions. Therefore, the selection of functions directly impacts the prediction accuracy of the algorithm. In this paper, we choose the multivariate polynomial function as the fitting function to describe the variation of high-dimensional data. The form of multivariate polynomial functions with interaction terms is given by the following:
where
are the function parameters. The highest degree of non-zero terms in multivariate polynomials is referred to as the polynomial degree, denoted as
. For simplicity, we uniformly assign an identical
to all variables and time points.
When using polynomial functions with a high
for fitting, the resulting fitting function contains a multitude of terms, making it difficult to identify the dominant ones. Therefore, we introduce a lasso penalty to the coefficients of the polynomial fitting functions. The lasso penalty for a combination of
v and
t is defined as follows:
Our objective is to minimize the absolute values of the fitting function coefficients, effectively conducting a process of variable selection. This process leads to certain coefficients being reduced to zero, retaining only the coefficients of dominant items. For reducing prediction errors in a combination of
v and
t, our loss function is the mean squared error (MSE), as follows:
Our final training error is given by the following:
where
is a hyperparameter.
3. Experiment
3.1. Data and Research Area
We use the ERA5 reanalysis dataset [
38,
39,
40,
41] as the research data for SA-Fit. The reanalysis dataset [
42] is a globally continuous and seamless meteorological dataset that integrates historical meteorological observation data and output data from meteorological models through recalculations. The ERA5 dataset encompasses meteorological data at the surface and 37 pressure levels, with a spatial resolution of 0.25° × 0.25° in latitude and longitude. It is widely regarded as the most comprehensive and accurate reanalysis dataset globally, and it is widely used in various studies [
43,
44].
To comprehensively showcase the predictive capability of SA-Fit, we selected two regions as our research areas. The first region we selected was Shanghai, situated in Southeast China (longitude range: 120°E to 123°E, latitude range: 29°N to 32°N). The second region was Xi’an, located in Northwest China (longitude range: 107°E to 110°E, latitude range: 32.5°N to 35.5°N). Our study focuses on 13 specific pressure levels: 50 hPa, 100 hPa, 150 hPa, 200 hPa, 250 hPa, 300 hPa, 400 hPa, 500 hPa, 600 hPa, 700 hPa, 850 hPa, 925 hPa, and 1000 hPa. The meteorological variables chosen for prediction include geopotential (Z), specific humidity (Q), air temperature (T), u-component (U), and v-component (V) of the wind.
3.2. Experiment Setup
We use ERA5 data from 2012 to 2018 as the training set, 2019 data as the validation set, and 2020 and 2021 data as the test set. A time interval () of 6 h is set, and the data of the initial three days () are utilized to predict the data on the next day ().
Regarding the parameter settings of the network, the initial embedding dimension (C) is set to 32. The first VST block has a depth () of 1, with a window size () of . The second VST block has a depth () of 2, with a window size () of . The MSA block has a depth () of 2. The patch merging block has a patch size () of . The spatial encoder has a depth () of 6, and the temporal encoder outputs spatial features with a dimension of 1024.
During training, we set the batch size to 32. All fitting functions adopted are polynomial functions with a degree () of 10. The hyperparameter is set to . We employ an Adam optimizer with an initial learning rate of , along with an exponential decay learning rate scheduler with a gamma of 0.95. The drop rate is set to 0.3. We terminate the training when the loss on the validation set increases twice in comparison to the previous epoch.
The benchmark methods for our experiment include ConvLSTM [
20], TrajGRU [
21], PredRNN [
45], PredRNN++ [
46], E3D-LSTM [
23], MIM [
47], CrevNet [
22], and SimVP [
24]. Due to the inability of the above comparison methods to predict all pressure levels simultaneously, we generate predictions for each pressure level individually. All experiments were conducted on a single NVIDIA RTX3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA).
We use root mean square error (RMSE) and mean absolute error (MAE) as the criteria for experimental results, which are defined as follows:
where
is the true value,
is the predicted value, and
is the number of variable values. The meteorological variables—geopotential (
Z), specific humidity (
Q), air temperature (
T),
u-component (
U), and
v-component (
V) of the wind—are measured in units of m
2/s
2, g/kg, K, m/s and m/s, respectively.
3.3. Experiment Results
Table 2,
Table 3,
Table 4 and
Table 5 display the RMSE and MAE values of SA-Fit and comparison methods for different prediction time intervals. Despite being an approximate prediction algorithm, SA-Fit achieves comparable prediction accuracy to other direct prediction deep learning methods.
The experimental results demonstrate that the prediction accuracy of SA-Fit varies across different meteorological variables, which can be attributed to the suitability of the polynomial fitting functions for each variable. SA-Fit exhibits significantly higher prediction accuracy than other methods in the 6-h prediction of Z in the Shanghai region, achieving an RMSE of 114.334 and an MAE of 85.805. These values represent a reduction of 37.0% and 35.0%, respectively, compared to the second-best results. In the Xi’an region, SA-Fit achieves reductions of 9.7% in RMSE and 8.4% in MAE for the 18-h prediction of Z compared to the second-best results. However, the prediction accuracy of SA-Fit for Q is relatively unsatisfactory. In the 12-h prediction of Q in the Shanghai region, the RMSE of SA-Fit is 1.107, which is increased by 14.2% compared to the best results, while the MAE is 0.642, which is increased by 25.1%. In the 24-h prediction for the Xi’an region, these two increase ratios are 13.0% and 24.5%, respectively. As for the prediction of the other three variables, SA-Fit and other methods exhibit comparable levels of accuracy.
The experimental results indicate that SA-Fit demonstrates superior performance in short-term prediction. SA-Fit achieved the most accurate predictions in the 6-h T prediction, with an RMSE of 1.228 and an MAE of 0.900 in the Shanghai region, and an RMSE of 1.280 and an MAE of 0.958 in the Xi’an region. But in the 24-h prediction, the RMSE and MAE of SA-Fit increased by 5.0% and 6.7%, respectively, compared to the best results in the Shanghai region, and by 4.0% and 4.6% in Xi’an. SA-Fit also demonstrated the best performance in the 6-h U prediction for the Shanghai region, yielding an RMSE of 3.683 and an MAE of 2.735. However, in the 24-h prediction, SA-Fit had increased rates of 4.5% and 6.7% in RMSE and MAE, respectively, compared to the best results.
Based on the above analysis, the experiment results may imply the need to select different fitting functions for different variables and different prediction times to achieve the best prediction results.
3.4. Prediction at Different Pressure Levels
Figure 4 showcases the detailed variation in RMSE for multiple meteorological variables across different prediction time intervals and pressure levels. The behaviors of prediction results at different atmospheric pressures reveal complex, non-linear trends. For variable
Z, the RMSE initially shows a sharp increase as pressure decreases, peaking at approximately 200 hPa, indicating a significant deviation in predictions at higher pressures. After this peak, the RMSE gradually declines, stabilizing between 600 hPa and 850 hPa, representing more accurate predictions in this mid-level pressure range. However, a slight increase is observed again as the pressure decreases further. The RMSE of variable
Q follows a somewhat similar trend, starting with a moderate rise in RMSE. After reaching its maximum between 600 hPa and 850 hPa, it shows a clear decline as the pressure decreases, suggesting a more stable predictive accuracy in both lower and higher pressure ranges. In the case of variable
T, the RMSE exhibits a relatively consistent upward trend as pressure decreases, with a noticeable local maximum occurring between 250 hPa and 400 hPa. This behavior suggests that the prediction error increases more steadily for
T compared to other variables at higher pressures. Variables
U and
V exhibit very similar patterns in their RMSE curves, characterized by an initial increase followed by a steady decline. Both variables show their highest RMSE values in the range between 150 hPa and 400 hPa, corresponding to higher pressures. However, their predictive accuracy improves significantly at lower pressures, with minimum RMSE values observed between 850 hPa and 1000 hPa. Overall, the figure highlights how different meteorological variables exhibit unique relationships between prediction errors and pressure levels, with the mid- to high-pressure regions showing more variability in RMSE patterns.
Figure 5 illustrates the variations in MAE for different variables and prediction time points of SA-Fit prediction results with respect to pressure levels. The variation patterns of MAE in the prediction results are largely consistent with those of RMSE. The prediction results for meteorological variables at different pressure levels demonstrate inconsistency and the relationship between variables and pressure levels exhibits variations. For variable
Z, the MAE initially increases, reaching a peak around 200 hPa, followed by a decrease until reaching a pressure level between 600 hPa and 850 hPa, and subsequently increasing again. Variable
Q exhibits an initial increase in MAE, followed by a decrease, and reaches its maximum at pressure levels between 600 hPa and 850 hPa. For the variable
T, MAE typically increases with increasing pressure levels, with a local maximum value between 250 hPa and 400 hPa. Variables
U and
V exhibit similar patterns of MAE changes, characterized by an initial increase, and subsequent decrease. The maximum MAE for both variables is observed between 150 hPa and 400 hPa, while the minimum MAE is observed between 850 hPa and 1000 hPa.
Comparing the RMSE and MAE of SA Fit and SimVP prediction results in Shanghai, it can be found that the prediction performance of SA-Fit is generally better than the prediction performance of SimVP. In short-term prediction, SA-Fit significantly outperforms SimVP and demonstrates outstanding performance in predicting variable Z. However, in long-term prediction, SA-Fit exhibits slightly inferior predictive performance compared to SimVP. It is worth noting that due to the more stable and less volatile predictive performance of SA-Fit compared to SimVP at different pressure levels, SA-Fit is more robust at dealing with sudden changes in pressure levels.
3.5. Comparison of the Number of Parameters
Due to the use of the spatiotemporal analysis network, SA-Fit can process and predict high-dimensional meteorological data simultaneously through the improved curve fitting algorithm. The number of parameters in SA-Fit is not affected by the scale of the data to be predicted. The previous Transformer-based time series prediction methods could only predict time series with a single spatial coordinate. When dealing with multiple time series in high-dimensional space, multiple prediction models need to be constructed. As the data scale continues to increase, the computational resources required to construct models become unacceptable.
Table 6 shows the model size required by different algorithms, and it is evident that SA-Fit has significantly fewer parameters when predicting meteorological data in the Shanghai region. We have visually demonstrated the advantage of SA-Fit in terms of model size in
Figure 6. In fact, using methods like FEDformer to simultaneously predict data for all spatial coordinate points in a given area on a typical consumer-grade graphics card is not feasible, as the GPU memory does not support such operations.
3.6. Reduced Training Time
Table 7 presents the training times for various comparative methods, alongside the results from the SA-Fit experiments conducted in the Shanghai region. The training time is an important metric used in assessing the efficiency of predictive models, especially when scaling up to handle more complex tasks such as predicting meteorological variables across multiple pressure levels.
SA-Fit stands out due to its remarkably efficient training time, especially when predicting across 13 different pressure levels simultaneously. Unlike traditional methods, where the prediction of each pressure level is treated as a separate task resulting in a linear increase in training time, SA-Fit can predict all pressure levels simultaneously. This ability significantly reduces the overall time needed for training, as it does not scale linearly with the number of pressure levels. While some comparison methods, such as ConvLSTM, SimVP, and PredRNN, also exhibit relatively short training times that are comparable to SA-Fit, these models handle predictions differently. They can become inefficient as the complexity of the task increases, particularly when dealing with high-dimensional data with more pressure levels.
As the number of pressure levels grows, the training efficiency advantage of SA-Fit becomes even more pronounced. This scalability is crucial for operational models that require quick training across various atmospheric conditions, ensuring that SA-Fit remains competitive in terms of both training speed and accuracy. Its constant training time, regardless of the number of pressure levels, highlights its suitability for large-scale meteorological forecasting tasks, offering a distinct advantage over other models.
3.7. Visualization
To enhance the clarity of SA-Fit prediction results, we randomly selected two locations and plotted the predicted values of two variables alongside their corresponding true values. These plots are presented in
Figure 7 and
Figure 8.
Figure 7 displays the predictions and true values for variables
T and
U at a pressure level of 500 hPa in the Shanghai region, specifically located at longitude 121.25° and latitude 30.75°.
Figure 8 illustrates the predictions and true values for variables
Z and
V at a pressure level of 800 hPa in the Xi’an region, situated at longitude 108.25° and latitude 34.25°.
The figures show that the trend of SA-Fit predictions aligns closely with the trend of the real data. Furthermore, SA-Fit demonstrates a strong predictive capacity, particularly in capturing significant variations within the real data. The prediction results of SA-Fit exhibit a slight delay compared to the real data as the prediction time interval increases. However, SA-Fit does not merely replicate historical data; instead, it effectively learns valuable information from the historical data to inform its predictions.
4. Discussion
SA-Fit aims to address the challenge of predicting regional high-dimensional meteorological data with existing deep learning methods. To achieve this, SA-Fit introduces a novel approach for predicting high-dimensional meteorological data. SA-Fit proposes a lightweight Transformer-based spatiotemporal analysis network to process spatiotemporal information and leverages polynomials to fit variations in meteorological data. The experimental results demonstrate that, despite being an approximation algorithm, SA-Fit achieves comparable results to state-of-the-art algorithms developed in recent years.
Some prior prediction networks employ RNN structures and an iterative multi-step prediction strategy, which autoregressively predicts future data; this inevitably leads to cumulative errors when predicting over an extended time horizon. SA-Fit embraces a Transformer-based architecture and employs a direct multi-step prediction strategy that optimizes the multi-step prediction objective directly in a single step. This prediction strategy effectively mitigates the issue of cumulative errors when making predictions over an extended future time horizon. Zeng et al. [
50] discussed the benefits of employing Transformers within the context of this strategy. However, in our experiments, we did not observe SA-Fit outperforming RNN-based In previous studies using RNN and Transformer structure networks for prediction, since the cyclic prediction of RNN is different from the direct action of the Transformer on all tokens through the attention mechanism, RNN results will produce obvious cumulative errors when outputting longer prediction results. However, the advantage of the Transformer structure (i.e., of not accumulating errors) can only be reflected when there are dozens or even hundreds of prediction time points. We posit that one potential explanation is that the number of predicted future time points is insufficient to generate substantial cumulative errors.
The structure of the Transformer-based network enables SA-Fit to simultaneously encode data across multiple pressure levels within high-dimensional meteorological data, in contrast to the previous approach of separate predictions for each pressure level or point. This enhanced approach allows SA-Fit to maximize the utilization of structural data information, thus enhancing the support for accurate predictions.
SA-Fit can significantly reduce the number of predicted values, with the extent of reduction increasing as the range of predicted longitude, latitude, and pressure level expands. During the experiment, when predicting the meteorological data of Shanghai, we are required to predict a total of 37,440 values, calculated as . Conversely, when using SA-Fit (), we are only required to predict a mere 5720 function coefficients, calculated as , only 15.3% of the original amount.
SA-Fit possesses the flexibility to choose suitable fitting functions for prediction, thus making it adaptable to the prediction demands of diverse datasets. Furthermore, distinct fitting functions can be selected for different variables within the dataset. When having prior knowledge of the variables to be predicted, we can leverage this knowledge to choose fitting functions that improve prediction accuracy and interpretability.
5. Conclusions
In this article, we propose SA-Fit, an algorithm using a lightweight Transformer-based network aimed at addressing the challenge of predicting regional high-dimensional meteorological data. SA-Fit proposes an enhanced Transformer-based spatiotemporal analysis network to encode spatiotemporal information in high-dimensional meteorological data and introduces explicit functions to fit variations in meteorological data, offering a novel predictive methodology.
The experimental results show that SA-Fit is comparable to other advanced deep learning algorithms and requires fewer computation resources. When using multivariate polynomial functions for fitting, SA-Fit exhibits favorable performance in certain prediction tasks, such as geopotential prediction and short-term prediction. Meanwhile, in the experiment, SA-Fit effectively reduces training time and greatly reduces the model parameters compared to other Transformer-based prediction models.