Maximum Individual Wave Height Prediction Using Different Machine Learning Techniques with Data Collected from a Buoy Located in Bilbao (Bay of Biscay)

Porlan-Ferrando, Lucia; Nuñez-Gonzalez, J. David; Ulazia Manterola, Alain; Martinez-Iturricastillo, Nahia; Ringwood, John V.

doi:10.3390/jmse13040625

Open AccessArticle

Maximum Individual Wave Height Prediction Using Different Machine Learning Techniques with Data Collected from a Buoy Located in Bilbao (Bay of Biscay)

by

Lucia Porlan-Ferrando

¹

,

J. David Nuñez-Gonzalez

¹

,

Alain Ulazia Manterola

^2,*

,

Nahia Martinez-Iturricastillo

³

and

John V. Ringwood

³

¹

Applied Mathematics Department, University of the Basque Country (UPV/EHU), Otaola 29, 20600 Eibar, Spain

²

Energy Engineering Department, University of the Basque Country (UPV/EHU), Otaola 29, 20600 Eibar, Spain

³

Centre for Ocean Energy Research, Maynooth University, W23 F2H6 Maynooth, Ireland

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(4), 625; https://doi.org/10.3390/jmse13040625

Submission received: 18 February 2025 / Revised: 17 March 2025 / Accepted: 19 March 2025 / Published: 21 March 2025

(This article belongs to the Special Issue New Era in Offshore Wind Energy)

Download

Browse Figures

Versions Notes

Abstract

:

Accurate prediction of extreme waves, particularly the maximum wave height and the ratio between the maximum and significant wave heights of individual waves, is crucial for maritime safety and the resilience of offshore infrastructure. This study employs machine learning (ML) techniques such as linear regression modeling (LM), support vector regression (SVR), long short-term memory (LSTM), and gated recurrent units (GRU) to develop predictive models based on historical data (1990–2024) obtained from a buoy at a specific oceanic location. The results show that the SVR model provides the highest accuracy in predicting the maximum wave height (

H_{m a x}

), achieving a coefficient of determination (

R^{2}

) of 0.9006 and mean squared error (MSE) of 0.0185. For estimation of the ratio between maximum and significant wave heights (

H_{m a x} / H_{s}

), the SVR and LM models exhibit comparable performance, with MSE values of 0.0229. These findings have significant implications for improving early warning systems, optimizing the structural design of offshore infrastructure, and enhancing the efficiency of energy extraction under changing climate conditions.

Keywords:

artificial intelligence; wave height prediction; marine renewable energy; maximum individual wave height

1. Introduction

Climate change, primarily driven by anthropogenic activities, is inducing rapid and significant alterations in global temperature and other climatic variables, with potentially catastrophic consequences. The predominant driver of this phenomenon is the emission of greenhouse gases resulting from the combustion of fossil fuels. However, the integration of renewable energy sources, advances in energy efficiency, electrification, demand-side management, and the deployment of intelligent solutions present viable pathways to implementing transformative reforms in electrical and transportation systems. These interventions not only mitigate the impacts of climate change but also promote job creation and reduce energy costs [1].

Renewable energy sources are characterized by their cleanliness and inexhaustibility, and constitute a competitive alternative to fossil fuels because of their diversity, abundance, and global availability. These energy sources do not produce greenhouse gas emissions or pollutants during operation [2]. Among renewable energy sources, oceans represent a substantial energy reservoir, offering opportunities through wave energy, tidal forces, ocean currents, and thermal gradients between surface and deep waters 1. Covering approximately 70% of the Earth’s surface, and with 97% of this area represented by seas and oceans, marine energy demonstrates several advantages, including low environmental impact, high predictability, and continuous energy generation, making it a compelling option compared to intermittent sources such as solar and wind power [3]. Wave energy in particular exhibits high energy density and reliability, especially in coastal regions, and holds the potential to address a significant portion of global energy demand. However, climate change may alter oceanographic conditions such as wave height and wind speed, potentially affecting the viability of marine energy generation [4]. Moreover, the influence of climate change in the past and future is globally affecting the long-term behavior of wave loads and energy, as has been shown by several previous authors [5,6,7,8]. This lends more importance to the study of extreme oceanic events in terms of maximum individual wave height.

Accurate prediction of extreme wave events, specifically the maximum wave height (

H_{m a x}

) and ratio of maximum height to significant height (

H_{m a x} / H_{s}

), is critical to ensuring the safety and sustainability of essential maritime operations such as navigation, positioning of offshore platforms, and the design of wave energy converters [9]. The significant wave height (

H_{s}

), defined as the average height of the highest third of the waves recorded during a given period, is widely used in oceanography to describe general sea conditions. However, in scenarios that involve extreme wave activity,

H_{m a x}

and its ratio to

H_{s}

are more relevant metrics. For wave height distributions adhering to a Rayleigh distribution, theoretical models indicate that the maximum wave height can reach approximately twice the significant wave height. Consequently, reliable forecasting of these extreme events is vital not only to ensuring maritime safety but also to optimizing the design and placement of offshore infrastructure, reducing risks to personnel, and enhancing navigation route planning [10].

Two principal approaches are used for predicting meteorological and geographic parameters such as temperature, wind speed, and humidity. The first approach relies on physical models that use numerical simulations based on geographic and meteorological input, while the second employs machine learning (ML) methodologies, which can enable more flexible and long-term predictive capabilities [11]. However, traditional ML models exhibit limitations in capturing large-scale extreme phenomena, underscoring the need for development of robust ML-based techniques to enhance the precision of maximum wave height estimation [12]. The integration of meteorological and oceanographic data with ML approaches has demonstrated significant improvements in the accuracy of extreme wave event predictions [13].

This research focuses on estimation of the maximum wave height

H_{m a x}

and

H_{m a x} / H_{s}

ratio using observational data collected from a buoy located in Bilbao. By employing machine learning algorithms, including linear regression (LM), support vector regression (SVR), long short-term memory (LSTM), and gated recurrent units (GRU), this study seeks to identify the most suitable methodology for modeling these specific phenomena and improving predictive accuracy on a global scale. We hypothesize that the SVR model will provide the highest accuracy in predicting

H_{m a x}

, while the LM and SVR algorithms will yield similar results when estimating the

H_{m a x} / H_{s}

ratio. These findings are expected to contribute to the design of resilient marine structures and the optimization of maritime operations.

2. Related Works

Climate change has significantly altered wave characteristics on a global scale. In [14], the authors identified increasing trends in significant wave height across various oceanic regions, attributing these changes to wind pattern variations driven by global warming. In [15], the authors projected an increase in wave energy due to ocean warming, with potential implications for coastal erosion and maritime infrastructure.

Extreme waves pose a considerable challenge in the design of marine structures. In [16], the authors analyzed impact loads from solitary waves on vertical structures, emphasizing the importance of breaking wave forces, while ref. [17] observed a 30 cm increase in the mean height of winter waves off the California coast since 1970 along with a doubling in the frequency of waves exceeding 4 m between 1996–2016 compared to 1949–1969. In [18], the authors reported significant positive trends in extreme significant wave height and wind speed, particularly in the Beaufort and East Siberian Seas, which they attributed to sea ice retreat and the expansion of open water areas. In [19], the authors utilized satellite observations to detect increasing trends in significant wave height and wind speed, particularly in the high latitudes of the Southern Hemisphere. In [20], the authors developed a theoretical framework for evaluating wave–structure interactions, enabling assessment of wave-induced loads under varying maritime conditions. Accurate prediction of maximum wave height is critical for the design of maritime infrastructure. In [21], the authors highlighted the importance of estimating the relationship between maximum and significant wave heights for establishing reliable design criteria, while ref. [22] introduced a statistical model for predicting extreme wave heights.

Recent advancements have improved wave height forecasting. In [23], the authors demonstrated that autoregressive models could predict wave heights up to 5 s ahead, enhancing the stability of marine energy structures. In [24], the authors developed a vision transformer-based model for regional significant wave height prediction, leveraging attention mechanisms to capture wind–wave relationships while preserving positional information via transposed convolutions. Sargol et al. applied adaptive neuro-fuzzy inference systems (ANFIS) and support vector regression (SVR) to forecast seasonal maximum wave heights using buoy data from the southern Baltic Sea. In [12,25], the authors analyzed critical ocean wave parameters such as the maximum crest height, crest-to-trough height, and envelope height, validating their estimates with observational data from the North Pacific.

Reguero et al. [9] investigated the impact of climate change on the variability of the

H_{m a x} / H_{s}

ratio, a key parameter in marine structure design and risk assessment. Their findings indicated significant regional variations in this ratio driven by shifts in wind intensity and direction, which in turn could affect the prediction of extreme wave events. An increase in

H_{m a x} / H_{s}

suggests an elevated risk of extreme waves in specific regions, potentially compromising the safety of maritime infrastructure and vessels. Their study highlights the necessity of integrating updated climate projections into wave forecasting models and coastal engineering practices to enhance resilience against increasingly variable and intense oceanic conditions.

In conclusion, climate change is intensifying wave conditions worldwide, posing significant challenges for coastal and offshore infrastructure. Advances in maximum wave height prediction and better understanding of the

H_{m a x} / H_{s}

ratio are crucial for improving the resilience of marine structures in response to an increasingly extreme ocean climate.

While previous studies have used certain machine learning techniques, they have either not compared a comprehensive range of models or not applied these techniques to long-term datasets. The goal of this paper is to cover this gap, providing a predictive experimental study covering different phases and techniques of machine learning that can help to predict maximum wave heights in the future from time series data.

The rest of this paper is structured as follows: Section 3 presents the experimental data, including characteristics and preprocessing; Section 4 describes the methodology; Section 5 shows the experimental pipeline; Section 6 presents the results; and Section 7 concludes the paper.

3. Data

3.1. Location and Attributes

The dataset used in this paper was acquired from the Bilbao buoy, provided by Puertos del Estado 2. The geographical location of the buoy is indicated in Figure 1, represented by a red marker. The primary technical specifications of the buoy are summarized in Table 1.

The dataset from Puertos del Estado comprised 17 attributes, categorized as temporal data, wave characteristics, and meteorological-oceanographic data, as detailed below:

Time:
–
Date (GMT), date and hour.
Waves:
–
Significant height, $H_{s}$ (m).
–
Mean period, $T_{m}$ (s).
–
Peak period, $T_{p}$ (s).
–
Maximum height, $H_{m a x}$ (m).
–
Period associated with maximum height, $T_{m a x}$ (s).
–
Mean direction, $D i r . m$ (degrees).
–
Mean direction at peak energy, $D i r . p$ (degrees).
–
Directional spread at peak energy, $D i s p$ (degrees).
Meteorology (data recorded at 3 m above the surface):
–
Average wind speed, $U_{v}$ (m/s).
–
Average wind direction, $D i r . v$ (degrees).
–
Air temperature, $T_{a i r}$ ( $° C$ ).
–
Atmospheric pressure, p (HPa).
Oceanography (data recorded at 3 m below the surface):
–
Average current velocity $U_{c}$ (cm/s).
–
Average current direction, $D i r . o f . p r o p$ (degrees).
–
Salinity, $a l$ (psu).

3.2. Data Preprocessing

The main goal of this research is to predict the maximum wave height (

H_{m a x}

) and ratio of the maximum to the significant wave height (

H_{m a x} / H_{s}

) based on buoy-derived variables. Initially, a correlation matrix (Figure 2) was computed to examine the relationships between predictor variables and target variables (

H_{m a x}

and

H_{m a x} / H_{s}

). This analysis facilitated our selection of the most pertinent variables for each prediction task.

For prediction of

H_{m a x}

, the following five variables exhibited the strongest correlations:

Significant wave height ( $H_{s}$ ).
Mean wave period ( $T_{m}$ ).
Peak wave period ( $T_{p}$ ).
Period associated with the maximum wave height ( $T_H_{m a x}$ ).
Wind speed (U).

Similarly, for the prediction of

H_{m a x} / H_{s}

, the selected variables included the following:

Mean wave period ( $T_{m}$ ).
Directional spread at the energy peak ( $D i s p$ ).
Period associated with the maximum wave height ( $T_H_{m a x}$ ).
Mean wave direction ( $D i r . m$ ).
Peak wave period ( $T_{p}$ ).

The preprocessing pipeline commenced with variable selection and an assessment of missing data. To address missing values, three distinct imputation methods were implemented:

Predictive Mean Matching (PMM): This method identifies the observed values closest to the predictive mean, randomly selects one, and assigns it to the missing entry.
Classification and Regression Trees (CART):
–
Constructs decision trees via recursive partitioning.
–
For each missing value, determines the terminal node of the fitted tree.
–
Assigns the observed value through random sampling within the terminal node.
LASSO Regression with Normalization (LASSO.NORM): This method employs LASSO linear regression combined with bootstrapping to handle univariate normal missing data.

Following imputation, the dataset was normalized to ensure uniform feature scaling in order to enhance the performance and stability of the machine learning models.

4. Methodology

To address the prediction tasks, three regression algorithms were employed:

Linear Regression (LM): Linear regression can model the relationship between dependent (target) variables and independent (predictor) variables as a linear equation:

$Y = β_{0} + β_{1} X_{1} + \dots + β_{n} X_{n} + ϵ$

where $β_{0}$ is the intercept, $β_{i}$ are the coefficients, and $ϵ$ represents the error term.
Support Vector Regression (SVR): SVR is an extension of support vector machine (SVM) that constructs hyperplanes in high-dimensional space to minimize the prediction error within a defined tolerance margin. It is particularly effective for complex nonlinear datasets.
Long Short-Term Memory (LSTM): LSTM is a recurrent neural network (RNN) architecture designed to model data sequences while maintaining long-term information through a memory cell structure. It uses different types of gates (input, forget, and output) to regulate the flow of information, allowing it to learn long-term dependencies without encountering the problem of gradient vanishing.
Gated Recurrent Units (GRU): GRU is another variant of the recurrent neural network architecture. Similar to LSTM, it is designed to handle long-term dependencies in data sequences. Unlike LSTM, however, GRU has a simpler structure, using only two gates (reset and update) to control the flow of information, which makes it more computationally efficient without sacrificing the ability to model temporal dependencies.

Regarding the hyperparameter configuration of each model, linear regression (LM) does not require predefined hyperparameters, making it a simple and computationally efficient baseline approach. The support vector regression (SVR) model utilizing a linear kernel (kernel = “linear”) adopts the following default hyperparameter values: cost = 1, gamma = 1, epsilon = 0.1, and type = “eps regression”, which govern its optimization process by defining the error margin constraints.

For the LSTM and GRU models, both of which are based on recurrent neural networks, an identical hyperparameter configuration was employed. The architecture consisted of an initial layer with 128 units followed by a second layer with 64 units. The input dimensionality was defined as input_shape = (1, 5), indicating a single time step with five input features. The output layer comprised a single neuron with linear activation, ensuring the production of continuous numerical outputs. Additionally, L2 regularization with a coefficient of 0.01 was applied to the final layer in order to mitigate overfitting by controlling model complexity.

For this study, we selected models specifically designed for regression tasks in order to facilitate a rigorous comparative evaluation of their predictive performance. The LM and SVR models were chosen due to their capacity to effectively model both linear and nonlinear relationships. Conversely, the LSTM and GRU models were incorporated to assess their performance in capturing temporal dependencies, enabling a comparative analysis between conventional and advanced approaches with an appropriate balance between predictive accuracy and computational efficiency.

The integration of these diverse machine learning models allows for a comprehensive assessment of their predictive capabilities in estimating maximum individual wave height (

H_{m a x}

) and the ratio between maximum and significant wave height (

H_{m a x} / H_{s}

) using meteorological and oceanographic data. This comparative framework can provide valuable insights into the relative strengths and limitations of each approach in the context of wave prediction.

5. Proposed Experiment

As detailed above, the experiment described in this study was conducted using databases obtained from the Spanish State Ports website referenced previously, where data for buoys are available.

To achieve robust and accurate predictions, a structured machine learning (ML) lifecycle was implemented. This lifecycle comprised multiple phases, including planning, data preprocessing, model engineering, evaluation, deployment, monitoring, maintenance, and knowledge extraction.

A rigorous variable selection process was carried out during the data preprocessing stage, identifying five explanatory variables in addition to the target variable. Subsequently, missing values were imputed and various transformations were applied, including standardization of variables to ensure consistency in the input features.

Additionally, an autoregressive modeling approach was integrated, expressed mathematically as

y (t) = f (y (t - 1), \dots, y (t - n), x (t - 1), \dots, x (t - m)),

where past instances of both the independent and dependent variables were utilized. This methodological framework enabled the models to capture the temporal dependencies inherent in time series data, thereby enhancing predictive performance by incorporating relevant historical information during training.

Following data preprocessing, different regression-based predictive models were trained and evaluated, including LM, SVR, LSTM, and GRU models. A comparative analysis of these models was then performed using a range of performance metrics to assess their predictive accuracy and reliability.

A fundamental aspect of our methodological framework involved systematically testing different database configurations and preprocessing strategies in order to identify the most effective approach for training the regressors and optimizing their predictive performance.

The specific database configurations considered in this study are detailed in the following section.

For the experiments detailed herein, the database spanned the years 1990 to 2024. For training and testing purposes, individual years were selected as subsets to create training and test datasets. This procedure was repeated with multiple randomly selected years from the complete database, ensuring robust and representative results.

The computational experiments were conducted on a machine equipped with an 11th-Gen Intel® Core™ i7-1165G7 processor operating at a base frequency of 2.80 GHz and with a maximum clock speed of 2.803 GHz. Due to system resource limitations, the data analyses were performed using one-year temporal windows.

5.1. Hmax Predictions

The primary objective of this experiment was to systematically assess the efficacy of different validation methodologies in training predictive models. These methods vary in their partitioning strategies for training and test sets as well as in the configuration of prediction windows within the test set. Although conceptually similar, they differ in terms of the dimensions of the prediction windows and the allocation of data between training and testing subsets. The experiment was conducted using the preprocessed database, where each subset (training and testing) corresponded to a full year, and predictions were performed for multiple years randomly selected from the dataset. This approach ensured a robust evaluation of the models’ generalization ability across different temporal periods.

Method 1: The training set consisted of the first three months of the year, while the test set comprised the remaining nine months. Predictions were generated iteratively; the model was initially trained on the first three months, after which it predicted the following month. After each prediction, the actual values of the predicted month were added to the training set and the process was repeated until predictions for all months in the test set were completed.
Method 2: The training set was defined as the first 1000 observations of the year, while the remaining observations were assigned to the test set. Predictions were generated for the 50 observations immediately following the training set. After obtaining these predictions, the actual values corresponding to the predictions were incorporated into the training set and iteration continued until predictions were generated for the entire test set.
Method 3: The training set comprised the first two weeks of the year, while the remaining data were allocated to the test set. Predictions were made for three consecutive days following the initial two-week training period. After obtaining these predictions, the actual values corresponding to the predictions were incorporated into the training set and iteration continued until all test set predictions were completed.
Method 4: This method was similar to Method 3 but introduced an additional adjustment. After generating predictions for three consecutive days, the actual values corresponding to the predictions were added to the training set, while the three oldest days were removed. This iterative process continued until all predictions for the test set were completed.

All four methods shared a common iterative framework, with specific differences in the size and configuration of the training and test sets. Below, Algorithm 1 describes the shared approach for the methods described above.

The average results for all the trained years are presented in Figure 3, graphically represented through bar diagrams.

Validation of the proposed methods was conducted to assess their performance across both short-term and long-term prediction horizons, as evaluated through the designated prediction windows applied to the test set. Among the analyzed methods, Method 1 yielded the most notable results. Additionally, a comparative analysis of different predictive models indicated that the linear regression (LM) and support vector regression (SVR) models exhibited superior performance, demonstrating higher levels of accuracy and robustness.

Algorithm 1 General algorithm

Input: data

Goal: predict the class variable with the cleaned and normalized database data.

data_train← each method has its own training set.

data_test← each method has its own test set.

n_rows← number of rows of data_test

increment← data in the prediction window

for i in seq(1, n_rows, by = increment)

for (j in 1:increment)

model ← regressor model with data_train

prediction ← prediction of the regressor with model and data_test

data_train ← rbind(data_train, prediction)

end for

5.2. Hmax/Hs Predictions

The objective of the second experiment was to assess the effectiveness of the same validation methodologies in predicting the

H_{m a x} / H_{s}

ratio, as opposed to the individual maximum wave height

H_{m a x}

examined in the first experiment. The methodology remained identical, with the primary distinction being the target variable, which in this case corresponded to the ratio between the maximum wave height

H_{m a x}

and significant wave height

H_{s}

. In this experiment, predictions were conducted exclusively using the LM and SVR models, as they provided the best results in the previous experiment.

Despite using the same approaches and same preprocessed dataset, the obtained results showed significantly lower performance compared to the prediction of

H_{m a x}

, as shown in Figure 4, which contains the average values of the results for all the trained years. Both methods faced significant difficulties in accurately predicting the relationship, as reflected in the low

R^{2}

values. This coefficient was notably lower compared to the values obtained for

H_{m a x}

, indicating that the models were limited in their ability to explain the variability in the relationship data.

However, improvements were observed in evaluation metrics such as the mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE), which remained comparable to those obtained in the

H_{m a x}

prediction task. This finding suggests that although predicting the

H_{m a x} / H_{s}

ratio is more challenging, the models still generate reasonable estimates in terms of the absolute error magnitudes.

One potential explanation for the observed decline in predictive performance is the low correlation between the target variable (

H_{m a x} / H_{s}

ratio) and the predictor variables, which does not exceed 0.16. This weak correlation indicates a limited relationship between the input features and the target variable, potentially constraining the model’s ability to learn robust patterns.

6. Results

The following performance metrics were used to determine the effectiveness of the chosen regressors:

R-Squared ( $R^{2}$ ): The coefficient of determination, often referred to as R-squared, is the proportion of the total variance in the dependent variable that can be explained by the estimated regression model. A result close to 1 means that the model is able to explain much of the variation in the data, while lower values mean that it lacks this ability and does not fit the data well. The R-squared value is calculated using the following formula:

$R^{2} = \frac{\sum_{i = 1}^{N} {(\hat{y_{i}} - y_{i})}^{2}}{{(y_{i} - \bar{y})}^{2}} .$
Mean Absolute Error (MAE): The MAE is used to quantify the prediction accuracy. It measures the average magnitude of the errors in a set of predictions, providing a numerical value that represents how far the predictions are from the actual values. An MAE value close to 0 indicates that the model is more accurate, as there is not much difference between the actual and predicted values.

$MAE = \frac{\sum_{i = 1}^{N} | y_{i} - \hat{y_{i}} |}{N}$
Mean Square Error (MSE): The MSE measures the average of the squares of the errors, i.e., the average squared difference between the estimated values and the actual value:

$MSE = \frac{\sum_{i = 1}^{N} {(y_{i} - \hat{y_{i}})}^{2}}{N} .$
Root Mean Square Error (RMSE): The RMSE is another metric used to measure the accuracy of predictions. In this case, the square of the errors is taken into account, which assigns more weight to larger errors. The ideal value is 0. The formula for RMSE is

R M S E = \sqrt{\frac{\sum_{i = 1}^{N} {(y_{i} - \hat{y_{i}})}^{2}}{N}},

where

\hat{y_{i}}

represents the predicted values,

y_{i}

the observed values,

\bar{y}

the mean value, and N the number of observations.

As previously discussed, model prediction of both the maximum wave height (

H_{m a x}

) and the ratio (

H_{m a x} / H_{s}

) were evaluated using these four key performance metrics: MAE, MSE, RMSE, and

R^{2}

, for which definitions and detailed explanations can be found in Section 4. In the following, we present the best results accompanied by the standard deviation of the errors, allowing for assessment of the statistical significance of the predictions.

6.1. $H_{m a x}$ Predictions

For the initial experiment, the best results were obtained from the

H_{m a x}

predictions for the year 2014, which exhibited the highest performance among all the years we evaluated.

Based on the results presented in Table 2, it can be observed that Method 1 exhibits the highest performance in terms of the considered evaluation metrics of

R^{2}

, MAE, MSE, and RMSE, which has been highlighted in red for easy identification. In particular, the model that demonstrated the best performance on this dataset was SVR, suggesting its superior ability to capture the underlying patterns when predicting the target variable.

A key aspect of these results is that the standard deviation of the errors obtained with Method 1 is significantly lower compared to the other evaluated approaches. This indicates greater stability in predictions and reduced variability in errors, which in turn suggests higher consistency and accuracy across the entire dataset. The reduction in error dispersion is a critical indicator of the model’s robustness, further reinforcing its reliability compared to alternative methods and highlighting its potential as an optimal strategy for this type of predictive analysis.

To ensure the statistical validity of the results, normality tests were conducted. These tests are essential as they assess whether the data follow a normal distribution, which is a key requirement for many parametric statistical tests. The results indicated that the data do not exhibit a normal distribution, which has significant implications for selecting the most appropriate statistical tests for analysis.

Because many parametric tests assume data normality, applying them in cases where the assumption of normality does not hold could lead to misleading or biased conclusions. Therefore, the non-parametric Kruskal–Wallis test was chosen as a robust alternative that does not assume normality in order to effectively compare multiple groups. This test evaluates whether significant differences exist among the groups considered in the analysis and provides p-values that determine the statistical significance of these differences.

The p-values obtained from the Kruskal–Wallis test are presented in Table 3, and serve as a key criterion for interpreting the significance of the results. The application of these methods ensures a rigorous and appropriate analysis, minimizing the risk of erroneous inferences and strengthening the reliability of the conclusions drawn. This approach enhances the robustness of the study and its applicability to similar contexts where the data may not conform to a normal distribution.

Because the obtained p-values are all below 0.05, it can be concluded that the observed differences are statistically significant.

6.2. $H_{m a x} / H_{s}$ Predictions

In the second experiment, the most accurate results in predicting the ratio (

H_{m a x} / H_{s}

) were obtained for the year 2011, as shown in Table 4, highlighting it as the period with the best performance among all the years we evaluated. It is important to note that we used only the models that yielded the best results in the previous experiment, specifically, the linear regression and SVR models.

Based on the results presented in Table 4, it can be observed that the SVR model exhibits superior performance in terms of

R^{2}

and MAE, while the linear regression model demonstrates better performance for MSE and RMSE. However, in the context of this experiment, it is not possible to definitively determine the most suitable model, as the results obtained by all methods are comparatively similar. Nonetheless, it can be stated that Method 1 shows significantly smaller standard deviations than the other methods, suggesting greater consistency and accuracy in predictions across the dataset. This behavior underscores the robustness and reliability of this method in comparison to the evaluated alternatives.

The performance in term of

R^{2}

exhibits notably low values across all predictions. However, it is important to highlight that the trend line between the actual observations and the predictions demonstrates a high degree of congruence, as shown in Figure 5. In this figure, the x-axis represents the indices of the test set values, while the y-axis displays the predicted and actual values of

H_{m a x}

. This finding suggests that, despite the imprecision of individual predictions, the models are capable of adequately capturing the overall trend of the data. This behavior indicates that with further adjustments, such as the inclusion of variables with higher correlation, the predictive capacity of the model could be significantly improved in future studies.

In addition, normality tests were conducted to analyze the prediction results for the (

H_{m a x} / H_{s}

) ratio. The p-value results indicated that the data do not follow a normal distribution; consequently, the non-parametric Kruskal–Wallis test was applied to assess whether there were significant differences between the groups. Upon examining the results presented in Table 4, it can be observed that the metrics obtained for the two models are notably similar. Therefore, the test is expected to reflect the absence of statistical differences between the two models. The results of the Kruskal–Wallis test are detailed in Table 5.

As shown in Table 5, the results align with expectations. The p-values are all greater than 0.05, allowing us to conclude that there are no statistically significant differences between the models used for predicting the variable in question.

7. Conclusions

The analysis conducted in this study demonstrates that long-term predictions are significantly more accurate than short-term ones. This phenomenon can be attributed to the capacity of Method 1 to more effectively identify and model stable trends in the data, which is thanks to this method utilizing a larger data window in the test set. In the context of ocean engineering, this improvement is particularly relevant for estimating extreme wave conditions such as swells, which critically impact both the safety and operational efficiency of maritime activities. The ability of models to capture general trends over extended intervals provides a robust tool for risk planning and mitigation in harsh ocean environments, where the predictability of extreme conditions is essential for decision-making.

Regarding prediction of the

H_{m a x}

variable, the models that best fit the data and achieved the most accurate results were those based on linear regression (LM) and support vector regression (SVR), which is further supported by statistical significance analysis.

This study also examined prediction of the

H_{m a x} / H_{s}

ratio. This ratio shows a low correlation with other variables, affecting the models’ accuracy. Nevertheless, error metrics such as MAE, MSE, and RMSE indicate that the models were able to successfully reduce prediction errors. Although the models did not fully capture the variability, the observed and predicted trends are similar, suggesting that they were able to adequately capture the global dynamics.

Author Contributions

All authors contributed equally to conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing—original draft preparation, writing—review and editing, visualization, supervision, project administration, and funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Raw data are available upon request from the website www.puertos.es, accessed on 15 Septembre 2024.

Acknowledgments

We thank Puertos de España (Ports of Spain) for providing the requested data for this work.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ANFIS	Adaptative Neuro-Fuzzy Interference System
CART	Classification and Regression Trees
GHG	Greenhouse Gases
LASSO.NORM	Lasso Linear Regression
LM	Linear Regression Modeling
ML	Machine Learning
MAE	Mean Average Error
MICE	Multiple Imputation using Chained Equations
MRE	Marine Renewable Energy
MSE	Mean Square Error
OTEC	Ocean Thermal Energy Conversion
PMM	Predictive Mean Matching
PPCA	Probabilistic Principal Component Analysis
$R^{2}$	R-Squared
RF	Random Forest
RMSE	Root Mean Square Error
SVR	Support Vector Regression

Notes

1	https://www.appa.es/appa-marina/que-es-la-energia-marina/?cn-reloaded=1, accessed on 5 October 2024; https://www.irena.org/Publications/2021/Jul/Offshore-Renewables-An-Action-Agenda-for-Deployment, accessed on 5 October 2024
2	www.puertos.es, 15 Septembre 2024

References

Yang, Y.; Zhao, L.; Chen, L.; Wang, C.; Wang, G. The spillover effects between renewable energy tokens and energy assets. Res. Int. Bus. Financ. 2024, 74, 102672. [Google Scholar] [CrossRef]
Zhang, X.; Xu, W.; Rauf, A.; Ozturk, I. Transitioning from conventional energy to clean renewable energy in G7 countries: A signed network approach. Energy 2024, 307, 132655. [Google Scholar] [CrossRef]
Khurshid, H.; Mohammed, B.S.; Al-Yacoubya, A.M.; Liew, M.; Zawawi, N.A.W.A. Analysis of hybrid offshore renewable energy sources for power generation: A literature review of hybrid solar, wind, and waves energy systems. Dev. Built Environ. 2024, 19, 100497. [Google Scholar] [CrossRef]
Ibarra-Berastegui, G.; Sáenz, J.; Ulazia, A.; Sáenz-Aguirre, A.; Esnaola, G. CMIP6 projections for global offshore wind and wave energy production (2015–2100). Sci. Rep. 2023, 13, 18046. [Google Scholar] [CrossRef]
Ulazia, A.; Esnaola, G.; Serras, P.; Penalba, M. On the impact of long-term wave trends on the geometry optimisation of oscillating water column wave energy converters. Energy 2020, 206, 118146. [Google Scholar] [CrossRef]
Ulazia, A.; Sáenz, J.; Saenz-Aguirre, A.; Ibarra-Berastegui, G.; Carreno-Madinabeitia, S. Paradigmatic case of long-term colocated wind–wave energy index trend in Canary Islands. Energy Convers. Manag. 2023, 283, 116890. [Google Scholar] [CrossRef]
Ulazia, A.; Saenz-Aguirre, A.; Ibarra-Berastegui, G.; Sáenz, J.; Carreno-Madinabeitia, S.; Esnaola, G. Performance variations of wave energy converters due to global long-term wave period change (1900–2010). Energy 2023, 268, 126632. [Google Scholar] [CrossRef]
Martinez-Iturricastillo, N.; Ulazia, A.; Ringwood, J. Long term wave load trends against offshore monopile structures: A case study in the Bay of Biscay. Proc. EWTEC 2023, 15, 1–7. [Google Scholar] [CrossRef]
Sierra, J.P.; Castrillo, R.; Mestres, M.; Mösso, C.; Lionello, P.; Marzo, L. Impact of climate change on wave energy resource in the Mediterranean coast of Morocco. Energies 2020, 13, 2993. [Google Scholar] [CrossRef]
Stansell, P. Distributions of extreme wave, crest and trough heights measured in the North Sea. Ocean Eng. 2005, 32, 1015–1036. [Google Scholar] [CrossRef]
Neshat, M.; Sergiienko, N.Y.; Rafiee, A.; Mirjalili, S.; Gandomi, A.H.; Boland, J. Meta Wave Learner: Predicting wave farms power output using effective meta-learner deep gradient boosting model: A case study from Australian coasts. Energy 2024, 304, 132122. [Google Scholar] [CrossRef]
Barbariol, F.; Bidlot, J.R.; Cavaleri, L.; Sclavo, M.; Thomson, J.; Benetazzo, A. Maximum wave heights from global model reanalysis. Prog. Oceanogr. 2019, 175, 139–160. [Google Scholar] [CrossRef]
White, C.; Carrelhas, A.; Gato, L.; Portillo, J.; Cândido, J. Floating wind and wave energy technologies: Applications, synergies and role in decarbonization in Portugal. Proc. EWTEC 2023, 15. [Google Scholar] [CrossRef]
Young, I.R.; Zieger, S.; Babanin, A.V. Global trends in wind speed and wave height. Science 2011, 332, 451–455. [Google Scholar] [CrossRef]
Reguero, B.G.; Losada, I.J.; Méndez, F.J. A recent increase in global wave power as a consequence of oceanic warming. Nat. Commun. 2019, 10, 205. [Google Scholar] [CrossRef] [PubMed]
Stansby, P.K. Solitary wave run up and overtopping by a semi-analytical model. Coast. Eng. 2003, 47, 159–179. [Google Scholar] [CrossRef]
Bromirski, P.D.; Miller, A.J.; Flick, R.E.; Auad, G. Wave power variability and trends across the North Pacific. J. Geophys. Res. Ocean. 2020, 125, e2019JC015419. [Google Scholar] [CrossRef]
Cabral, I.S.; Young, I.R.; Toffoli, A. Long-term and seasonal variability of wind and wave extremes in the Arctic Ocean. J. Geophys. Res. Ocean. 2020, 125, e2020JC016708. [Google Scholar] [CrossRef]
Young, I.R.; Ribal, A. Multiplatform evaluation of global trends in wind speed and wave height. Science 2019, 364, 548–552. [Google Scholar] [CrossRef]
Boccotti, P. Wave Mechanics for Ocean Engineering; Elsevier Oceanography Series; Elsevier: Amsterdam, The Netherlands, 2000; Volume 64, Chapter 11. [Google Scholar] [CrossRef]
Goda, Y. Random Seas and Design of Maritime Structures; World Scientific Publishing Company: Singapore, 2000. [Google Scholar] [CrossRef]
Tayfun, M.A. Narrow-band nonlinear sea waves. J. Geophys. Res. Ocean. 1980, 85, 1548–1552. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, X.; Dong, Q.; Chen, G.; Li, X. Phase-resolved wave prediction with linear wave theory and physics-informed neural networks. Appl. Energy 2024, 355, 121602. [Google Scholar] [CrossRef]
Liu, Y.; Huang, L.; Ma, X.; Zhang, L.; Fan, J.; Jing, Y. A fast, high-precision deep learning model for regional wave prediction. Ocean Eng. 2023, 288, 115949. [Google Scholar] [CrossRef]
Memar, S.; Mahdavi-Meymand, A.; Sulisz, W. Prediction of seasonal maximum wave height for unevenly spaced time series by Black Widow Optimization algorithm. Mar. Struct. 2021, 78, 103005. [Google Scholar] [CrossRef]

Figure 1. Position of the buoy.

Figure 2. Heatmap of correlations.

Figure 3. Average results of

H_{m a x}

predictions.

Figure 3. Average results of

H_{m a x}

predictions.

Figure 4. Average results of

H_{m a x} / H_{s}

prediction.

Figure 4. Average results of

H_{m a x} / H_{s}

prediction.

Figure 5. Observations, predictions, and trend lines.

Table 1. Characteristics of buoy.

Buoy	Longitude	Latitude	Depth	Cadence	Model	Years
Bilbao-Biscay	3° ${5.100}^{'}$ W	43° ${37.940}^{'}$ N	870 m	60 min	SeaWatch	1990–2024

Table 2. Results of

H_{m a x}

predictions for 2014.

Table 2. Results of

H_{m a x}

predictions for 2014.

		Method 1				Method 2
		R²	MAE	MSE	RMSE	R²	MAE	MSE	RMSE
LM	$\bar{x}$	0.8932	0.0967	0.0196	0.1335	0.6202	0.1293	0.0536	0.1702
LM	sd	0.0215	0.0338	0.0131	0.0453	0.2474	0.1156	0.1200	0.1578
SVR	$\bar{x}$	0.9006	0.0916	0.0185	0.1286	0.6339	0.1246	0.0534	0.1659
SVR	sd	0.0222	0.0348	0.0133	0.0473	0.2463	0.1168	0.1239	0.1617
LSTM	$\bar{x}$	0.8162	0.1433	0.0389	0.1884	0.3858	0.1804	0.0786	0.2374
LSTM	sd	0.0790	0.0518	0.0234	0.0625	0.2764	0.1130	0.1136	0.1494
GRU	$\bar{x}$	0.8124	0.1435	0.0397	0.1892	0.3906	0.1841	0.0796	0.2397
GRU	sd	0.0792	0.0553	0.0256	0.0662	0.2777	0.1138	0.1131	0.1493
		Method 3				Method 4
		R²	MAE	MSE	RMSE	R²	MAE	MSE	RMSE
LM	$\bar{x}$	0.6748	0.1539	0.0728	0.2046	0.6801	0.1497	0.0647	0.1966
LM	sd	0.2367	0.1290	0.1431	0.1766	0.2471	0.1216	0.1182	0.1619
SVR	$\bar{x}$	0.6884	0.1484	0.0725	0.1997	0.6842	0.1455	0.0646	0.1947
SVR	sd	0.2376	0.1309	0.1476	0.1813	0.2427	0.1189	0.1227	0.1642
LSTM	$\bar{x}$	0.7047	0.3078	0.1862	0.4099	0.6896	0.3547	0.2687	0.4699
LSTM	sd	0.1778	0.1066	0.1202	0.1429	0.1797	0.1929	0.2728	0.2323
GRU	$\bar{x}$	0.4640	0.2066	0.1028	0.2746	0.6960	0.3664	0.2956	0.4802
GRU	sd	0.2902	0.1242	0.1366	0.1652	0.1729	0.2291	0.3569	0.2703

Table 3. The p values for the

H_{m a x}

predictions.

Table 3. The p values for the

H_{m a x}

predictions.

	Method 1	Method 2	Method 3	Method 4
R²	1.1 × $10^{- 3}$	2.2 × $10^{- 16}$	7.8 × $10^{- 11}$	6.8 × $10^{- 7}$
MAE	1.5 × $10^{- 2}$	2.2 × $10^{- 16}$	8.9 × $10^{- 11}$	9.4 × $10^{7}$
MSE	3.0 × $10^{- 2}$	2.2 × $10^{- 16}$	6.3 × $10^{- 11}$	3.5 × $10^{- 7}$
RMSE	3.0 × $10^{- 2}$	2.2 × $10^{- 16}$	6.3 × $10^{- 11}$	9.07 × $10^{- 7}$

Table 4. Results of

H_{m a x} / H_{s}

predictions for 2011.

Table 4. Results of

H_{m a x} / H_{s}

predictions for 2011.

		Method 1				Method 2
		R²	MAE	MSE	RMSE	R²	MAE	MSE	RMSE
LM	$\bar{x}$	0.0108	0.1170	0.0229	0.1516	0.0278	0.1188	0.0259	0.1532
LM	sd	0.0099	0.0029	0.0013	0.0043	0.0367	0.0262	0.0380	0.0489
SVR	$\bar{x}$	0.0114	0.1166	0.0235	0.1533	0.0283	0.1182	0.0264	0.1547
SVR	sd	0.0099	0.0030	0.0013	0.0043	0.0382	0.0262	0.0383	0.0492
		Method 3				Method 4
		R²	MAE	MSE	RMSE	R²	MAE	MSE	RMSE
LM	$\bar{x}$	0.0221	0.1184	0.0255	0.1535	0.0235	0.1205	0.0260	0.1552
LM	sd	0.0303	0.0219	0.0305	0.0444	0.0308	0.0215	0.0307	0.0440
SVR	$\bar{x}$	0.0221	0.1179	0.0260	0.1549	0.0229	0.1189	0.0262	0.1557
SVR	sd	0.0306	0.0219	0.0307	0.0447	0.0219	0.0216	0.0304	0.0439

Table 5. The p values for

H_{m a x} / H_{s}

predictions.

Table 5. The p values for

H_{m a x} / H_{s}

predictions.

	Method 1	Method 2	Method 3	Method 4
R²	0.8253	0.8946	0.7515	0.9070
MAE	0.6911	0.6726	0.6341	0.3464
MSE	0.3099	0.5535	0.4136	0.7876
RMSE	0.3099	0.5535	0.5391	0.7876

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Porlan-Ferrando, L.; Nuñez-Gonzalez, J.D.; Ulazia Manterola, A.; Martinez-Iturricastillo, N.; Ringwood, J.V. Maximum Individual Wave Height Prediction Using Different Machine Learning Techniques with Data Collected from a Buoy Located in Bilbao (Bay of Biscay). J. Mar. Sci. Eng. 2025, 13, 625. https://doi.org/10.3390/jmse13040625

AMA Style

Porlan-Ferrando L, Nuñez-Gonzalez JD, Ulazia Manterola A, Martinez-Iturricastillo N, Ringwood JV. Maximum Individual Wave Height Prediction Using Different Machine Learning Techniques with Data Collected from a Buoy Located in Bilbao (Bay of Biscay). Journal of Marine Science and Engineering. 2025; 13(4):625. https://doi.org/10.3390/jmse13040625

Chicago/Turabian Style

Porlan-Ferrando, Lucia, J. David Nuñez-Gonzalez, Alain Ulazia Manterola, Nahia Martinez-Iturricastillo, and John V. Ringwood. 2025. "Maximum Individual Wave Height Prediction Using Different Machine Learning Techniques with Data Collected from a Buoy Located in Bilbao (Bay of Biscay)" Journal of Marine Science and Engineering 13, no. 4: 625. https://doi.org/10.3390/jmse13040625

APA Style

Porlan-Ferrando, L., Nuñez-Gonzalez, J. D., Ulazia Manterola, A., Martinez-Iturricastillo, N., & Ringwood, J. V. (2025). Maximum Individual Wave Height Prediction Using Different Machine Learning Techniques with Data Collected from a Buoy Located in Bilbao (Bay of Biscay). Journal of Marine Science and Engineering, 13(4), 625. https://doi.org/10.3390/jmse13040625

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Maximum Individual Wave Height Prediction Using Different Machine Learning Techniques with Data Collected from a Buoy Located in Bilbao (Bay of Biscay)

Abstract

1. Introduction

2. Related Works

3. Data

3.1. Location and Attributes

3.2. Data Preprocessing

4. Methodology

5. Proposed Experiment

5.1. Hmax Predictions

5.2. Hmax/Hs Predictions

6. Results

6.1. $H_{m a x}$ Predictions

6.2. $H_{m a x} / H_{s}$ Predictions

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Notes

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Maximum Individual Wave Height Prediction Using Different Machine Learning Techniques with Data Collected from a Buoy Located in Bilbao (Bay of Biscay)

Abstract

1. Introduction

2. Related Works

3. Data

3.1. Location and Attributes

3.2. Data Preprocessing

4. Methodology

5. Proposed Experiment

5.1. Hmax Predictions

5.2. Hmax/Hs Predictions

6. Results

6.1. H m a x Predictions

6.2. H m a x / H s Predictions

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Notes

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

6.1. $H_{m a x}$ Predictions

6.2. $H_{m a x} / H_{s}$ Predictions