The data used in this study come from the water level observation records of six tidal stations in the South China Sea, covering the period from January to September 2018 (a total of 9 months). The data from each tidal station for the period from January to August 2018 were used for training the neural network, while the data from September were reserved for validation. All tidal stations recorded water levels every hour, resulting in 24 recordings per day. During the data processing phase, we pre-processed the data from the six tidal stations, disgrading some missing values, and then fed the processed data into the neural network for training. After training the neural network, we used it to predict the water levels for 312 h starting from 00:00 on 3 September 2018. At the same time, the FVCOM model performed the same prediction task, generating corresponding prediction results for the six tidal stations. Among the six sets of prediction results obtained, we selected the prediction data from five tidal stations and divided them into a training set and a test set in an approximately 80:20 ratio for use in the reinforcement learning fusion module. The remaining 312 h of prediction data from one tidal station were used entirely to validate the generalization ability of the DNR model.
Evaluation Metrics. To evaluate the performance of our proposed model, we adopted the Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Pearson correlation coefficient, and Nash–Sutcliffe Efficiency (NSE), which are commonly used metrics for tidal level prediction tasks. Additionally, we conducted a detailed comparison with other models.
4.1. Experimental Setup
This study was equipped with 32 GB of RAM and an NVIDIA RTX 4070Ti GPU with 12 GB of VRAM for hardware, as well as an Intel 14th-generation processor for the CPU. On the software side, VSCode was used as the development environment. For more detailed setup information, please refer to
Table 2.
The key settings of the FVCOM model’s nml configuration file are shown in
Table 3. Considering that the FVCOM simulation has an initial cold start period, the simulation was run in advance for a period of time, and the first 48 h of the prediction results were discarded. That is, the actual prediction started from 00:00 on 3 September 2018, with prediction results output every hour. In
Table 3, START_DATE represents the start time of the FVCOM prediction, and END_DATE denotes the end time of the prediction. STARTUP_TYPE is used to specify the startup mode of the model. NC_FIRST_OUT refers to the start time for outputting prediction results in the NetCDF file, while NC_OUT_INTERVAL defines the time interval for outputting prediction values in the NetCDF file. As for WIND_ON, it indicates whether the wind field file is enabled, with “F” meaning that the wind field file is not enabled.
As shown in
Table 3, the model employed a time step of 3 s. The grid design incorporated a minimum resolution of approximately 500 m, with simulated current velocities not exceeding 4 m/s in the model outputs. The CFL condition analysis confirms that values remained substantially below 1. Furthermore, the wet/dry treatment threshold was set to 0.05 m following FVCOM manual recommendations, significantly enhancing the model’s stability in nearshore simulations. Referencing the grid resolution settings from existing FVCOM studies [
42], our grid generation followed these principles: coastal zones, estuaries, and bays characterized by high gradients were refined to 500 m, while open sea areas gradually transitioned to 30 km. The geometric fidelity of shorelines and isobaths was maintained using high-resolution topographic and shoreline mapping data provided by RESDC.
The general hyperparameter settings for the TCN deep learning model are shown in
Table 4. The hyperparameters were configured based on the TCN paper and experimental observations from this study. The number of training epochs was set to 400, because the loss curve indicated convergence after approximately 300 epochs, and models trained with 400 epochs exhibited comparable performance to those trained with 500 epochs.
The common hyperparameters for the other deep learning models used for comparison can be set according to the settings in the table above. For all deep learning models, we selected water level data from January to August 2018 as the training set. The model input was uniformly set as the 48 h of observed data prior to the prediction, while the output focused on predicting the water level for the next hour. Based on this setup, approximately 5500 data points were generated for each station to train the model. After training the model, we used a rolling prediction approach for water level forecasting. Specifically, the first 48 h of data before 00:00 on 1st September were used to predict the water level for the next hour. Then, this prediction result was combined with the previous 47 h of actual data to form a new input for predicting the water level for the second hour. This process continued until 48 h of water level predictions were completed. Next, the 48 h observed water level data from 1st to 2nd September were used to predict the water levels from 3rd to 4th September, and so on, until predictions were made through 23:00 on 15th September, totaling 360 h of predictions. After comparing the prediction performance of the four deep learning models and the ARIMA model, we selected the best-performing model. We then used approximately 80% of the predicted data from five tidal stations for training the fusion model. Finally, we tested the fusion model using the remaining 48 h of prediction data from these five stations, as well as the full prediction data from another station that was not involved in fusion model training.
The hyperparameter settings for the fusion model, DDPG_dual, are shown in
Table 5. The state dimension is 24, which means the input to the fusion network consists of 24 data points, combining 12-h predicted water levels from both FVCOM and TCN. The fusion network is responsible for outputting two weights, which are used to perform a weighted fusion of the final set of predictions from the FVCOM and TCN models, resulting in the final prediction from the fusion model.
4.2. Experimental Result
This section provides a detailed analysis of the experimental results, focusing primarily on the four metrics mentioned earlier. Comparative analysis of similar models was conducted in both the deep learning prediction model phase and the fusion model phase, with the aim of comprehensively evaluating the performance and effectiveness of the proposed fusion model.
Firstly, we used the FVCOM model to predict the tidal levels at six tidal stations. Since FVCOM starts simulations in cold-start mode, to ensure the accuracy of the prediction results, we discarded the first 48 h of data after the simulation began and selected the subsequent 312 h of data as the FVCOM prediction results. Among these, for the five stations used for fusion model training, we specifically selected the last 48 h of data, which were not involved in the fusion model training, for display (the previous 250 h of prediction data were used for fusion model training). For station BHI, which was not involved in fusion model training, we displayed its complete 312-h prediction results. In
Figure 6, the predicted results for each station are shown by green lines, while the corresponding actual tidal level data is presented by red lines.
(1) Quantitative Comparative Analysis of Models: This section focuses on the performance comparison between two categories of models: first, the comparison between deep learning models and the ARIMA model, and second, the performance comparison among different fusion models.
First, we compared the performance between the deep learning models and the ARIMA model. The deep learning models included TCN (Temporal Convolutional Network), RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory Network), and VIT (Vision Transformer). We used these models to predict the tidal levels at six tidal stations, with the performance metrics of the models for each station shown in
Table 6. For each station, the numbers marked in red represent the minimum value for the corresponding model on that metric (note that for MAE, which is Mean Absolute Error, and RMSE, which is Root Mean Square Error, smaller values are better). The numbers marked in blue represent the maximum value for the corresponding model on that metric (where Pearson correlation coefficient and NSE, the Nash–Sutcliffe Efficiency, are better when larger).
Overall, the TCN model demonstrated the best performance in this study, achieving the smallest Mean Absolute Error (MAE) at most stations while also attaining the highest correlation coefficient and Nash–Sutcliffe Efficiency (NSE). This outstanding performance is primarily attributed to the unique architecture of the TCN: by stacking convolutional layers with different dilation rates, the TCN can effectively capture patterns at various scales and long-term dependencies within sequential data. This structural feature makes the TCN particularly effective in modeling long-term temporal correlations. Additionally, the convolutional operations used in TCN are shift-invariant, meaning that regardless of the time position of the input sequence, the learned features remain valid. This characteristic further enhances the robustness of the TCN model, enabling it to better adapt to temporal shifts and dynamic changes in sequential data, thereby significantly improving prediction accuracy. The LSTM model followed closely, exhibiting second-best performance, while the traditional RNN model ranked behind the LSTM. This outcome is expected, as LSTM, an improved version of RNN, has a distinct advantage in handling long-term dependencies in sequential data. In contrast, the VIT model performed poorly in this tidal level forecasting task. Originally designed for image recognition, the VIT model’s core idea is to divide an image into multiple patches and arrange them into a sequence, similar to the processing of natural language. While some adjustments can be made to the network, the VIT model still struggled to achieve satisfactory prediction results for sequential data like tidal level forecasting. The ARIMA model performed the worst in this study. Across all data metrics, the TCN model, with its excellent performance and robustness, emerged as the preferred choice for the deep learning prediction module.
After obtaining the prediction data from FVCOM and the deep learning model TCN, we conducted a series of tests and comparisons on five reinforcement learning fusion models: DDPG_dual, DDPG, PPO, DQN, and DDQN. Specifically, we selected approximately 80% of the prediction data from five stations for training the fusion models and used the last 48 h of prediction data for fusion testing. Additionally, for the station that did not participate in the fusion model training (BHI), we used 312 h of prediction data for fusion testing. The results are presented in
Table 7. For comparison purposes, the performance metrics of the FVCOM and TCN models are also listed in the table. As in
Table 6, the numbers marked in red indicate the minimum value achieved by the corresponding model on that metric, while the numbers marked in blue indicate the maximum value achieved.
Overall, the DDPG_dual model demonstrated the best predictive performance across all tide stations, outperforming both the TCN model and the DDPG model. This can primarily be attributed to the addition of a specialized module for handling extremes in the actor and critic networks of the DDPG_dual model compared with the DDPG model. This module is more effective in predicting wave peaks and troughs, leading to predictions that are closer to the real data. This will be further validated in the subsequent ablation experiments.
The DDPG model showed strong performance at the five training stations, second only to the DDPG_dual model. However, at the BHI station, its Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) still lagged behind the TCN model. Notably, the DDPG_dual model also achieved the best results when conducting fusion predictions at the unseen BHI station. This suggests that the generalization ability of the DDPG_dual model is stronger compared with the DDPG model. It is worth emphasizing that, within the spatiotemporal scope influenced by Typhoon Mangkhut, the proposed ensemble learning DNR model effectively integrated the prediction sequences from two base models and achieved accurate tidal level forecasting. This also further validates robustness of the proposed method under extreme weather conditions.
A comparison with previously mentioned related studies reveals that our proposed DNR model achieved a correlation coefficient above 0.9 even over a 48-h prediction horizon, outperforming the correlation coefficient reported in that study [
27] for 24-h tidal predictions. Compared with the results of this study [
14], the correlation coefficient of our model’s 48-h tidal prediction (with an average of 0.9403 across six stations) remains close to that of the 50-h tidal prediction curve (average 0.9511) reported in their study under the influence of Typhoon Mangkhut. Overall, the performance of our proposed model can be considered satisfactory.
(2) Qualitative Comparison: We also plotted the prediction curves of each model.
Figure 7 shows the prediction curves of the deep learning models compared against the ARIMA model, while
Figure 8 presents the prediction curves of the various fusion models. In
Figure 8, for each subplot, the red solid line represents the real tide level curve, and the black solid line represents the prediction curve of the DDPG_dual model. From the figure, it is clear that the DDPG_dual model generally fits the real curve quite well. In most cases, compared to the DDPG model, it is able to more accurately capture the peaks and troughs of the real curve, demonstrating superior predictive performance. However, there are certain time points where the DDPG_dual model’s predictions still show some shortcomings. For example, around the 10th hour at the SWI station and around the 13th h at the DWS station, there are significant discrepancies between the predicted and actual tide levels. The DQN and DDQN models, due to their discrete action selection, have certain limitations in their fusion performance compared to the DDPG model. The PPO fusion model, being the most complex of the models tested, not only requires longer training times but also has a slower convergence rate. Although some patterns in the tide level changes can be observed in its predictions, the prediction errors are large, making it difficult to meet the precision requirements for practical applications.
(3) Ablation Study: To further enhance the performance of the fusion model, we made two key improvements to the original DDPG model. On one hand, we introduced a dual-channel mechanism to strengthen the model’s ability to process input information, thereby improving the fusion effect. On the other hand, we optimized the original reward mechanism to encourage the model to focus more on accurately handling extreme values during training.
Specifically, we named the original DDPG model as DDPG_O_1. After introducing the dual-channel mechanism, the model was named DDPG_dual. Further adjustments to the reward mechanism based on DDPG_dual resulted in a model named DDPG_d_e. To validate the effectiveness of the newly added structures, we tested the above models using the same dataset. The average absolute errors (MAE) for each model are shown in
Table 8. Here, DDPG_d_e (128) and DDPG_d_e (256) represent models with hidden layer neuron counts of 128 and 256, respectively. The error units are unified in centimeters (cm).
As seen in
Table 8, the basic fusion model DDPG had an average error of 12.285 cm. After incorporating the dual-channel attention mechanism, the average error dropped to 11.624 cm. Further improving the reward mechanism reduced the average error to 11.228 cm. When the hidden layer was adjusted to 256 neurons, the average error decreased even further to 10.248 cm. Overall, these adjustments effectively enhanced the fusion performance of the DDPG model, resulting in approximately a 16% reduction in prediction error. During the optimization of DDPG_dual, we implemented the aforementioned improvements and plotted their prediction curves (
Figure 9). The results demonstrate that the optimal model outperformed others in closely matching the extreme values of the ground truth curve. For instance, at tidal station DWS, it showed significantly better alignment with actual measurements at the first trough, third trough, and second peak. Similar improvements were observed at station SWI. Moreover, the best performing model maintained superior consistency with real observations across all remaining tidal stations.