1. Introduction
Wave buoys are devices that measure oceanographic and meteorological parameters, such as wind speed and direction, air temperature and pressure, sea surface temperature, the height, period, and direction of waves (
Figure 1), providing critical data for activities like boating, shipping, fishing, research, offshore construction, and disaster warning [
1,
2]. Wave buoys are especially important for offshore wind farms since they provide timely data on the changing wave conditions, which is crucial to optimize the design, operation, and maintenance of wind turbines in a farm and also to plan and schedule their inspection, repair, or replacement [
3]. Continuous real-time wave measurements at offshore wind farms serve multiple purposes to enhance both operations and safety. They support power output optimization, improve turbine efficiency and health monitoring, and assist decision-makers in avoiding unnecessary or high-risk maintenance activities during adverse sea states that could endanger personnel or equipment [
4]. Wave buoy data are extremely relevant for planning vessel operations; offshore wind farm operators can select and utilize the most suitable and cost-effective maintenance strategy for their site, considering various factors such as weather windows, failure rates, availability, downtime, and penalty costs [
5,
6,
7].
Since wave buoys are usually placed in remote and harsh marine locations, they are vulnerable to damage or loss due to environmental or human factors, such as storms, collisions, or fishing activities, and, therefore, they require regular maintenance to ensure their functionality. While undergoing maintenance, the devices are offline, causing the unavailability of marine information crucial for planning other activities. These factors can significantly increase operational costs and reduce the availability and quality of wave data, causing limited continuity and interruptions in the data collection [
8].
These factors make it necessary to create virtual wave buoy models that can provide more reliable and comprehensive wave data, filling the gap when data are not available. Virtual wave buoy models are numerical models that can learn and reproduce sea state and wave data at the target location in the event of equipment failure. They can also provide spatially distributed and temporally continuous wave data over a large area or region by utilizing multiple virtual wave buoy locations. Virtual wave buoy models can also improve the accuracy and quality of the wave data by using advanced algorithms that can estimate more wave characteristics and reduce errors and uncertainties.
Wave height is one of the most critical wave characteristics for coastal protection, ocean engineering, offshore operations, and marine disaster prevention [
9,
10]. However, wave height measurements by networks of moored wave buoys are often incomplete or sometimes erroneous due to maintenance operations or extreme events [
11]. Therefore, the creation of virtual wave buoys for wave height prediction is needed.
Numerical models such as Simulating Waves Nearshore (SWAN), Wave Modeling Project (WAM), and WAVEWATCH III [
12,
13,
14] are commonly used for simulating sea wave characteristics based on physical equations. However, they are computationally expensive and time-consuming, especially for large domains and complex coastal areas [
15]. Alternatively, machine learning techniques have been applied for wave height prediction and reconstruction using historical data from wave buoys. A lot of studies in the past years focused on the implementation of data-driven artificial neural network-based strategies to create wave height prediction models and hindcasting models [
9,
16,
17,
18,
19,
20,
21,
22]. These models are all based on the use of historical data from a single sensor to train the forecasting and predictive model for the same sensor. The results presented in the literature mentioned above show that these strategies could describe the wave height forecast at different locations and time horizons with sufficient reliability. However, the mean absolute error in the wave height magnitude prediction could go up to 40 cm in some scenarios, which is not suitable for applications that require high-accuracy prediction. The reader is also referred to [
23], which provides a comprehensive literature review of machine learning-based approaches for forecasting.
Among the most recent and successful studies, Fan et al. [
24] presented a novel model to predict significant wave height based on a long short-term memory network for 1 h and 6 h predictions of the significant wave height at ten stations using different environmental conditions as input features collected at that same sensor location. The short-time prediction of wave height provides good results with mean absolute error values below 10 cm for 1 h prediction in some cases and increasing errors for longer time spans. Hu et al. [
25] developed a wave forecasting model for two wave buoys located at Lake Erie using extreme gradient boosting and a long short-term memory network. The authors used a very long training period from 1994 until 2013, applying observed wind velocity as model input and observed significant wave height and peak wave period as the target variables, obtaining optimal accuracy in the prediction of the wave height with mean absolute errors lower than 10 cm. Abed-Elmdoust and Kerachian [
26] employed a bidirectional gated recurrent unit network for forecasting tropical cyclone wave height using data from 14 buoys in various environments over the past nine years, achieving the lowest error and highest correlation coefficient among all tested models. Jörges et al. [
27,
28] proposed a long short-term memory (LSTM) and then a convolutional neural network mixed-data deep neural network that can predict spatial ocean wave heights using random field simulated bathymetry data as an additional input. They focused on a nearly 13-year dataset, integrating high-frequency weather data, and showed that including bathymetry features improved wave height reconstruction and prediction by reducing the Root Mean Square Error. Their study is highly relevant to coastal and shallow water regions but may not be suitable for deeper offshore environments due to the intensive data requirements. Gomez et al. [
29] used reanalysis data gridded on a fine latitude–longitude grid, modeling weather conditions for each buoy using a sub-grid of the four closest reanalysis nodes. This approach increases data volume but can introduce significant computational overhead and complexity. Moreover, their model relies on data from the same buoy as an input, limiting its applicability when predicting in the absence of such data.
These studies focused on a single-location prediction strategy, where the metocean historical information collected at that same location is used to train and construct the prediction models. Londhe et al. [
11] provided one of the only detailed studies implementing a buoy network strategy. The proposed strategy utilizes artificial neural networks to reconstruct significant wave height data missing at a target location using a network of public wave buoys in the surrounding area. They tested six different sites and showed promising results. However, for three sites in the north of the US, which also utilized years of data for training, the mean absolute error for the prediction target buoy was relatively large, ranging between 30 and 40 cm.
Chen et al. [
30] developed a Random Forest-based surrogate model to predict the significant wave height, mean wave direction, mean zero-crossing period, and peak wave period at a target location. The model was trained using 21 years of simulated data via the SWAN physics-based numerical model. Then, the authors used in situ buoy observations and wind data as inputs for the trained model and compared its performance with the SWAN model at a test location in the UK. The model can produce accurate spatial wave data with significantly less data input and computational time than the SWAN model, and it can capture the seasonal and interannual variability of wave conditions at the test site. However, the proposed methodology, relying on 21 years of simulated data via the SWAN model, has an initial heavy computational demand.
Among the most relevant and recent works, Patanè et al. [
31] utilized convolutional layers for spatial feature extraction and short-term memory layers for modeling, adding complexity and computational demand. They focused on a single buoy prediction scenario and rely on ERA5 reanalysis wind forcing with a spectral wave model, which can be limiting if reanalysis data are unavailable or unreliable. Minuzzi et al. [
32] used the NOAA numerical forecast data, targeting the residual between observational data and numerical model output. Their model training relied on a massive dataset spanning 20 years, which is impractical for many applications and may not be feasible for real-time or near-real-time predictions. Additionally, their results sometimes exhibited deviations greater than 1 m from observed data, indicating lower resolution performance in local contexts.
These studies demonstrate the potential of machine learning techniques for wave height prediction. However, most of the literature focuses on forecasting wave height using historical data from the same sensor location, with few studies addressing data fusion for virtual sensing using a network of buoys. Additionally, models experimenting with different input quantities often result in high prediction errors, sometimes as significant as 20–30 cm. Furthermore, the most effective models rely on extensive training datasets, increasing computational demands and limiting their applicability to scenarios with abundant data.
This study aims to address these gaps by developing a purely observational, data-driven, user-friendly, and computationally efficient virtual buoy model. Unlike previous studies, we leverage a network-based strategy without relying on the same buoy’s data for predictions, ensuring broader applicability and robustness in scenarios where buoy data is missing. The methodology is applied in the context of multiple areas, testing different network buoys and carrying out a sensitivity analysis to determine how the placement and arrangement of sensors in the ocean impact model accuracy. This model provides accurate wave height estimates for specific locations even when the physical sensors are non-functional. This strategy is not computationally demanding and can be easily generalized to different regions (coastal and offshore regions) without relying on external reanalysis datasets, extensive data for training, numerical simulations, or even additional features, maintaining high accuracy with fewer inputs. This model is tested across four different sites characterized by different conditions, some hosting offshore wind farms and providing unique case studies using proprietary data, highlighting how particularly crucial it is for offshore wind farms, where timely and precise marine information is essential for the safety of service vessels during navigation and maintenance operations.
Additionally, the proposed imputation strategy for handling missing data enhances data integrity and prediction reliability, addressing a critical aspect not covered in the existing literature. This study, therefore, offers significant advancements in offshore wave height prediction, combining robust imputation strategies with an efficient and broadly applicable network-based approach.
3. Datasets Description
This work considers four different sites: two in the UK involving publicly and privately owned wave buoys and two in the US using only measurement data from public buoys.
Table 1 provides the spatial coordinates and characteristics of the buoys in the sites.
The two UK sites correspond to two offshore wind farm sites: Race Bank Offshore Wind Farm (ROW01) (
Figure 3a) and Walney Wind Farm (WOW04) (
Figure 3b). In particular, for Race Bank, there are seven public buoys located in the surrounding area of the private target buoy owned by Ørsted, while for Walney, the data from five public buoys are considered.
In addition to the public buoy data, more measurements are available at the two UK sites. The ROW01 site has two additional private sensors: a wave radar and a lidar. The wave radar sensor provides significant wave height, maximum wave height, median wave direction, and peak wave period with a sampling frequency of 1 min. The lidar sensor provides wind speed and direction measurements at 100 m height every 10 min. For WOW04, the data come from three weather stations: one weather station is placed directly on the WOW04 site, while the other two are located on the Walney farm but in a different section (WOW03), and finally, the last weather station is in correspondence of a nearby farm, Burbo Bank offshore wind farm. The stations provide wave height and wind speed information with a sampling frequency of 15 min.
The sites in the US were picked according to the current lease plans for offshore wind farms on the US East Coast. In particular, the area below the Cape Cod region (
Figure 3d), located in Nantucket Bay in front of Rhode Island, will host multiple wind farm projects, making it an area of interest for testing the virtual buoy strategy. Moreover, the Gulf of Maine area (
Figure 3c) was chosen because this location in New England is characterized by the highest number of public buoys available with robust live streaming of data. Even if no offshore wind farms are currently planned for that area of interest, the richness of information available in the area makes this site a valuable test bed for the proposed virtual sensing strategy considering the New England marine conditions. The National Oceanic and Atmospheric Administration agency was used to access real-time and historical weather and marine condition measurements. In particular, the National Data Buoy Center (NDBC) is part of NOAA’s National Weather Service, and it operates a network of data-collecting buoys and coastal stations that provide meteorological and oceanographic observations for weather forecasting, marine safety, research, and environmental monitoring.
The Martha’s Vineyard site is characterized by only public buoy data. Four buoys provide reliable and accurate wave height information on this site (
Figure 3d). Since no private buoy was available for this site, no target buoy was defined a priori. Buoy C was chosen as the reference target buoy because it was the one placed closer to the future offshore developments planned for that area, and therefore, it is potentially the buoy that is more representative of the marine conditions in the wind farm area. Similarly, for the Gulf of Maine, there are currently no private buoys deployed in the area; only measurements from six public buoys are available. For this site, there was no target buoy defined a priori. Buoy F was selected as the reference target buoy due to its proximity to the coast, similar to buoys E, G, and I, and its central location relative to the rest of the buoy network. Unlike Buoy H, which is located in deeper, open waters farther offshore, Buoy F is more representative of the nearshore marine conditions of interest for this study.
The selected sites span a range of water depths, including both shallow and moderately deep offshore regions. For example, depths range from as low as 10 m (e.g., Cleveleys buoy at WOW04) to as deep as 177 m (e.g., Buoy H at the Gulf of Maine site). This variation allows us to assess the virtual buoy modeling framework across diverse oceanographic settings, supporting its generalizability.
Data Processing and Input Features Selection
As shown in the description of the dataset used in this work, the data measurements collected at different buoy locations are often characterized by different sampling frequencies. The first step in the data processing phase consists of aligning the sampling frequencies in time of the buoy data for each site. For the two UK sites, some buoys are characterized by a sampling frequency of 60 min, and others are characterized by a sampling frequency of 30 min. For the analysis, the sampling resolution is selected at 30 min, and when needed (West Sole A, Clipper, Sean P for ROW01 and M2 for WOW03), sensor buoys are up-sampled by linear interpolation from 60 to 30 min. For the Gulf of Maine site, all the buoys are characterized by a sampling frequency of 60 min and kept at that constant sampling frequency. In the Martha’s Vineyard site, Buoy A, B, and D are all sampled at 60 min, while Buoy C is characterized by a more refined sampling resolution, with measurements collected every 30 min. However, for the analysis, the sampling resolution is selected as 60 min, down-sampling Buoy C to one value per hour.
The dataset is subjected to the same pre-processing strategy for all the sites. The missing values (NaN values) are removed from all the buoys involved in the analysis, together with any possible non-physical negative values. Then, since the research focuses on wave height prediction at target locations to inform the safe planning and logistics of maintenance operations for service vessels, there is no interest in accurately predicting wave height conditions above 3 m. This limit is imposed using relatively small vessels for maintenance operations in the selected farms, which have a wave height limit of either 1.5 m for crew transfer vessels or 2.5 m for service operation vessels. Therefore, given the use-inspired nature of this research, any wave height information above 4 m for the target buoy is removed to optimize the training of the regression model for lower wave height conditions.
Finally, to conduct comparable and reliable sensitivity analysis and hyperparameter tuning, the dataset is kept constant for all analyses presented in this section and the following sections. Therefore, the analysis presented in the following sections is performed using data from 28 August 2021 (5:00 AM) and 26 October 2022 (00:30 AM) for ROW01 and WOW04, while for the US sites, the dataset is fixed between 1 January 2021 (00:00 AM) and 31 December 2022 (11:30 PM) for the US sites.
The data measurements collected by the buoys at each site are divided into two sets: training and testing datasets. The training dataset is utilized to develop and train the regression model. On the other hand, the testing dataset comprises data that the model never saw during the learning phase. It is used to assess the model’s performance in making accurate predictions on new and unseen data.
For the training set, the first 21 days of each month are used, while the remaining days are reserved for testing. This data-splitting strategy preserves the temporal order of observations, which is crucial in time series data such as wave heights where short-term autocorrelation is typically present. Unlike a random split that may distribute temporally adjacent data points across both sets—potentially leading to information leakage and inflated model performance—our approach ensures that the model is always evaluated on future data relative to its training period. This structure also allows the model to learn from each month’s variability while maintaining the continuity of autocorrelated sequences within each split, better reflecting the conditions of real-world forecasting.
In the following sections, the results presented for the different regression models concern predicting the significant wave height at the target buoy using only the significant wave height measurements from the input neighboring wave buoys as input features for building those models.
Section 5.3 will discuss and evaluate the model’s performance when additional data sources and quantities are considered input features.
4. Model Baseline and Algorithm Tuning
Hyperparameter tuning is a critical step in building a regression model. Its objective is to identify the best possible values for the hyperparameters of a given algorithm that result in optimal performance of the model on the validation set. This work focuses on building a regression model and finding the optimal balance between that model’s performance and its complexity. The approach employed involves creating a baseline model first, using an initial estimate of the hyperparameters. Subsequently, the parameters of the RFR are further tuned via a grid search strategy.
The initial guess for the hyperparameters of the Random Forest model is provided for the two UK sites (ROW01 and WOW04), which are the datasets with more available buoys for building the regression models.
The baseline for the two sites in the UK is determined using the parameters presented in
Table 2. For the US sites, the baseline performance is evaluated using a Random Forest model with the parameters defined for the ROW01 site without further tuning.
Table 3 reports the Random Forest model performance in predicting the wave height information for the training and testing dataset for the four sites of interest.
The training dataset shows that the ROW01 and WOW04 sites show lower MAE and MdAE values, indicating better model performance in these regions than in Martha’s Vineyard and the Gulf of Maine, which exhibit higher error metrics. In contrast, WOW04 exhibits a significantly larger RMSE for the testing dataset, highlighting a notable deviation in model predictions. Additionally, the other sites—ROW01, Martha’s Vineyard, and Gulf of Maine—display more consistent performance between the training and testing datasets, with ROW01 showing minimal error increases across both datasets.
When performing hyperparameter tuning for Random Forest Regression, some key hyperparameters to consider are the number of estimators and maximum depth, both of which range in the following set of values: [50, 200, 500, 625, 750, 800]. Then, this range for the two set of parameters is narrowed down for Martha’s Vineyard and the Gulf of Maine sites ranging between [150, 175, 200, 220, 250]; the minimum samples leaf within [2, 4, 6, 8, 10, 12]; and the minimum samples split between [2, 4, 6, 8, 10, 12]. The optimized results via the grid-search approach for the four investigated sites are presented in
Table 4 and identified as the optimized version of the Random Forest Regressor (RFRO).
The results obtained by applying the optimized RFR model are compared with those obtained from the model without optimizing the hyperparameters and with the models chosen as reference models, including lasso regression [
36], support vector regression [
37], gradient boosting regression [
38], multilayer perceptron neural networks [
39], and long short-term memory networks [
40].
Lasso regression [
36] is a linear model that performs both variable selection and regularization by applying an L1 penalty on the regression coefficients. This penalty encourages sparsity by shrinking some coefficients to zero, improving model interpretability. The primary hyperparameter is the regularization strength
, which controls the degree of penalization; we set
based on the grid search results. While lasso regression is computationally efficient and effective for linear relationships, it lacks the flexibility of RF in capturing nonlinear dependencies and feature interactions.
Support vector regression (SVR) [
37] is a kernel-based method that seeks to find a function approximating the data within a defined margin of tolerance, using support vectors to represent the solution. The main hyperparameters include
C, which controls the trade-off between model complexity and the tolerance for errors (set to 100 in our case);
, which defines the influence of individual training examples (set to 0.0001); and
, which sets the width of the error-insensitive tube around the predicted values (set to 0.01). An RBF (radial basis function) kernel is chosen for its ability to capture nonlinear relationships. Compared to RF, SVR often achieves high accuracy in high-dimensional spaces but can be less scalable and more sensitive to hyperparameter tuning.
Gradient boosting (GB) regression [
38] is an ensemble learning technique that sequentially adds weak learners—typically shallow decision trees—to minimize the residuals of previous models. Key hyperparameters include the number of estimators (set to 60), learning rate (set to 0.4), and maximum tree depth (set to 2), which collectively control the model’s complexity and learning speed. Additional parameters include
for quantile loss regularization (used with the ‘lad’ loss), minimum samples to split (3), minimum samples per leaf (1), and a minimum weight fraction per leaf of 0.01. Gradient boosting offers high predictive power and flexibility but is generally more computationally intensive and prone to overfitting compared to RF.
Multilayer perceptron (MLP) networks [
39] are feedforward neural networks capable of approximating nonlinear functions. Our model uses a single hidden layer with 12 neurons and rectified linear unit (ReLU) activation. The regularization term
, set to 10, controls L2 weight decay to mitigate overfitting. Training is performed using the Limited-memory BFGS optimizer, with a maximum of 3000 iterations. While MLPs are expressive and powerful for capturing complex relationships, they require more careful tuning and offer less interpretability than RF.
Long short-term memory networks [
40] are a type of recurrent neural network (RNN) specifically designed to handle sequential data with long-range dependencies. Their architecture incorporates gating mechanisms to regulate information flow through time steps. In this study, we implement a standard LSTM architecture to explore its ability to model temporal wave height dynamics. Compared to RF, LSTMs can capture temporal dependencies more explicitly, but they require significantly more data, computational resources, and tuning of architectural and training parameters.
In all cases, model hyperparameters are optimized using a grid search procedure consistent with that applied to RF. This ensures a fair performance comparison across methods.
Figure 4 and
Figure 5 show the regression model performance in terms of MAE for the training and testing datasets, considering the ROW01 and Martha’s Vineyard sites, respectively. The bars are color-coded—blue for training, red for testing—and numerical values are displayed directly on top of each bar for clarity. A third set of gray bars indicates the signed percentage difference between testing and training performance, shown on the secondary
y-axis. This signed representation directly compares relative performance, with negative values indicating better generalization to the test set.
The first observation concerns the comparison of the performance between the baseline RF model and the optimized version. It is evident how the optimized model tends to show better performance for the training dataset without offering any significant improvement for the testing dataset. Small improvements between RFR and RFRO in the testing dataset, such as in the Martha’s Vineyard site, are not worth the increase in the computational expenses required by a more complex and deep Random Forest model. It is possible to observe that for both sites, the Random Forest model, independent from the optimization of the hyperparameters, tends to perform better for the training dataset than the testing dataset.
Figure 5 highlights the differences in the performance of the different regression models between the training and testing datasets. The Random Forest model shows, for the different sites, substantial discrepancies between the training and testing cases, proving the overfitting tendency. Random forest is known to suffer from overfitting, which means that it learns the noise or the specific patterns in the training data that do not generalize well to new or unseen data. Overfitting can result in poor performance or high variance in the test or validation data. A complex architecture of the model or a dataset poor in richness are common causes of the model’s overfitting tendency.
Although the RF model exhibits signs of overfitting, its performance remains comparable to that of the other models considered. Only the SVM and LSTM models show slightly better results on the testing dataset. The LSTM model achieves marginally lower error on the test set compared to the training set, particularly for the ROW01 site. This behavior is likely due to the non-random data split, where the training set contains a greater proportion of complex temporal relationships. In such cases, the model may appear to underperform during training while generalizing better to unseen data. Nonetheless, despite these localized improvements, models such as SVM and LSTM are more sensitive to hyperparameter choices and require longer training times. In contrast, the Random Forest model demonstrates more consistent and robust performance across both sites, making it a more practical and reliable choice for the application considered in this study.
When considering efficiency for a simple regression task, Random Forest surpasses alternatives such as support vector regression, long short-term memory, and gradient boosting in speed and simplicity. The computational complexity of RF is , where T is the number of trees, N is the number of samples, and F is the number of features. This structure allows RF to train efficiently by building trees in parallel, making it ideal for real-time applications.
In contrast, the complexity of SVR is , as it involves solving a quadratic optimization problem, making it computationally prohibitive for large datasets. During training, LSTM models rely on backpropagation through time to update network weights based on sequences of input data, which can be computationally intensive. During inference, backpropagation is not required; however, the model must still update its hidden state sequentially at each time step, limiting parallelization. LSTM is characterized by a complexity of , where d is the hidden state size. While GB shares a similar theoretical complexity to RF equal to , its sequential tree-building prevents efficient parallelization, making it slower in practice. MLPs, on the other hand, have a complexity of approximately , where L is the number of layers, is the number of input features, and is the number of neurons per layer. Hence, RF is the superior choice for simple, scalable regression models that balance accuracy and computational efficiency, especially for real-time or large-scale applications like wave height prediction.
To illustrate, consider the ROW01 case, where 1000 samples and 7 features are used. The complexity for the baseline Random Forest model with 18 estimators is approximately 1.3 million operations, while the optimized RF model with 750 estimators results in 57.75 million operations. By comparison, due to its linear nature, lasso regression requires just 7000 operations but lacks the flexibility to capture complex relationships. With its cubic complexity, SVR demands 1 billion operations, making it far more computationally expensive. LSTM, with its sequential nature, has a complexity of 25 million operations, while GB, with 60 estimators, results in 4.6 million operations, and finally, the MLP model, using one hidden layer with 12 neurons, requires approximately 84,000 operations, making it efficient in practice.
This comparison highlights the advantages of RF in terms of computational efficiency, structural flexibility, and overall robustness. While lasso regression offers a straightforward linear architecture and low computational complexity, it also presumes a linear relationship between input features and the target variable that can become unreliable when violated. Moreover, its objective function minimizes the sum of the absolute values of coefficients and makes it sensitive to outliers. In contrast, RF, as a non-parametric ensemble method, makes no assumptions about the functional form of the data, inherently providing greater flexibility in capturing complex, nonlinear interactions. Its ensemble structure enhances robustness to outliers by mitigating the influence of individual data points and allows it to naturally accommodate multicollinearity without the need for explicit feature selection, which is particularly valuable in cases where input features—such as significant wave height estimates from nearby marine locations—may be highly correlated. Additionally, RF supports inherent uncertainty quantification through the variation among its decision trees, which enhances model interpretability and confidence in the predictions. Although MLPs benefit from hardware-optimized matrix operations that make them computationally efficient, they are typically more difficult to interpret and require careful tuning to avoid instability or underfitting. In conclusion, despite the overfitting tendencies observed in our experiments, RF remains a compelling choice for wave height prediction due to its robustness to data variability, competitive performance, scalability, and ability to model complex patterns without extensive pre-processing or assumptions about data structure.
To assess the computational efficiency of the adopted model, we measure the training and inference times of the Random Forest model in the context of the ROW01 site. On a standard consumer-grade machine equipped with an Intel(R) Core(TM) i7-10870H CPU @ 2.20GHz (8 cores, 16 threads), 32 GB of RAM, and running Windows 11, model training requires approximately 1.97 s, while inference on the validation set takes 0.4 s. All computations are executed using CPU-based parallelization with the scikit-learn implementation, and the GPU is not utilized. These results demonstrate that the model is capable of delivering predictions at speeds suitable for near real-time applications, supporting its practical deployment in monitoring and decision-support scenarios.
Uncertainty Quantification from Random Forest Trees
Random Forest models provide a natural way to quantify uncertainty through the variance among the predictions of the individual trees. Each tree in the forest is trained on a bootstrap sample of the data, leading to slightly different models. The variability among the predictions from these different trees can be used to estimate the uncertainty of the overall model prediction.
The prediction variance can be calculated by first obtaining the predictions from all trees in the forest for a given data point. The variance of these predictions provides a measure of the model’s uncertainty. Specifically, the prediction variance is computed as the average of the squared deviations of each tree’s prediction from the mean prediction of all trees. This can be expressed as
where
is the prediction from tree
t, and
is the mean prediction of all trees. This approach allows for estimating prediction intervals, providing a measure of confidence for each prediction. Higher variance indicates higher uncertainty, guiding further analysis or data collection efforts to reduce this uncertainty.
The results of this uncertainty analysis are presented in
Figure 6 and
Figure 7 for the ROW01 site.
Figure 6 shows the RF predictions with 95% prediction intervals for different time scales. The blue lines represent the actual wave heights, the orange lines represent the predicted values, and the shaded violet areas indicate the 95% prediction intervals, reflecting the model’s uncertainty. In
Figure 7, the highlighted shaded rectangle indicates the specific time window that is zoomed in on the right-hand side of the figure. The alignment of the predicted values with the actual values, along with the narrow prediction intervals, suggests that the model performs well, with high confidence in its predictions.
Figure 7 summarizes the model’s residuals and performance evaluation. The left subplot shows the residual plot, where residuals (the difference between actual and predicted values) are plotted against the predicted values. The red-shaded region represents the 95% confidence interval around the residuals, providing a visual depiction of the uncertainty. Most residuals are centered around zero, indicating no significant bias in the model. The spread of residuals remains relatively consistent across the range of predicted values, although a slight increase in variance at higher predicted values suggests mild heteroscedasticity. The clustering of residuals around zero further confirms the model’s accuracy, with most errors being small.
The middle plot is a horizontal histogram of the residuals combined with a kernel density estimate (KDE), offering insight into the spread of the errors. This plot suggests a normal distribution, with most residuals centered around zero, indicating a well-calibrated model with no significant skewness in the errors.
The right subplot compares predicted values vs. actual values, with the dashed red line representing the perfect prediction line (where predicted equals actual). The dense clustering of points along this line confirms that the model has high predictive accuracy, with only minor deviations observed. The inclusion of error bars represents the uncertainty bounds (±10% of the residuals), further illustrating the model’s ability to reliably estimate wave height values across the dataset.
6. Data Processing and Imputation of Missing Data
Data quality is a crucial factor for regression and virtual sensing strategies, which aim to estimate or infer the values of certain variables based on available data measurements. Poor data quality can affect the accuracy, reliability, and validity of regression models, such as Random Forest and virtual sensors, which use machine learning techniques to simulate physical sensors. This section discusses the uncertainty quantification (UQ) associated with pre-processing stages, particularly focusing on k-nearest neighbors (kNN) for imputation.
For example, remote sensing data can suffer from various imperfections, such as noise, distortion, missing values, outliers, or inconsistency, that can alter the extracted information and the decisions made. Similarly, virtual sensing for water quality assessment can be influenced by the quality of the input data, such as water samples, sensor readings, or environmental factors. Therefore, data quality assessment and improvement methods are essential to ensure the performance and robustness of the regression and virtual sensing strategies.
In the present application, two data quality issues are considered and addressed in two separate stages in the context of wave height prediction. The first data quality verification and pre-processing step, as mentioned in
Section 5, focuses on the removal of any non-physical observation corresponding to negative values for all the buoys involved in the analysis (target and input buoys). Then, any wave height observation above 4 m for the target buoy is removed in order to optimize the training of the regression model for the lower wave height conditions of interest.
The second data quality problem addressed in the present virtual sensing strategy focuses on finding a solution for the presence of missing data in the virtual buoy acquisition. Missing data is a common problem in many real-world datasets, and it can substantially affect the performance and validity of regression models. Depending on the mechanism and pattern of missingness, different methods can be applied to handle missing data, such as deletion, imputation, or model-based approaches [
41]. However, not all methods are suitable for all types of regression models, and some may introduce bias or uncertainty in the results.
In the preliminary analysis presented in the previous section, the missing data are handled by deletion. In particular, whenever any of the buoys (input and target) present a missing data observation, that information in time is removed from all the buoys. This strategy presents advantages and disadvantages. Deletion is simple and easy to implement, as it does not require any assumptions or models for the missing data mechanism, preserves the distribution and the relationships of the observed variables, and does not introduce any artificial values. Moreover, deletion can be unbiased and valid if the data are missing completely at random, meaning that the missing data are unrelated to any observed or unobserved variables. On the other hand, deletion can reduce the sample size and the statistical power of the analysis, as it discards potentially useful information from the non-missing values, and it can be inefficient and impractical if the data have a large proportion or a complex pattern of missingness, as it may result in losing too many observations or variables.
In the present application, the deletion strategy results in the elimination of a significant portion of the observable wave height information for all the investigated sites. To provide an example of the effect of the deletion strategy on the data loss,
Table 8 and
Figure 14 show the breakdown by buoy of the data quality information for the ROW01.
The ROW01 site shows a variable distribution of missing values according to the considered buoy (discrepancy between the theoretical number of observations in the selected period of time and the effective number of observations collected), with values that oscillate up to 22% for some of the buoys. In addition to the missing values—identified as gaps in the expected timestamp sequence—there are also a number of measurements for which the timestamp is present, but the wave height is explicitly recorded as “NaN”. These two types of incompleteness are handled distinctly but ultimately processed uniformly for imputation. A very small number of negative values in the data also need to be removed. In order to optimize the prediction accuracy of the model in the wave height range of interest, the values above 4 m for the target buoy are removed. The final step in the cleaning stage is the temporal alignment of all the measurements from the network of buoys.
Since all the data for the Martha’s Vineyard site and the Gulf of Maine are maintained and pre-processed by the NDBC, there are rarely “NaN” or recorded missing values or negative values. When there is a time stamp missing/data were not recorded, the time stamp is simply omitted from the dataset.
The final stage in the cleaning phase is where the data are lost (about 30% per buoy). It is evident how handling the missing data by deletion causes a significant loss in wave height observations. Such deletion of information could easily lead to the previously mentioned disadvantages for a regression model like RF, negatively affecting the performance of the virtual sensing strategy. Therefore, it is necessary to address an alternative solution to handling missing data by deletion. Random forest is designed to allow handling missing data with alternative approaches. It is possible to impute the missing data using either median or proximity-based measures [
42,
43] or by splitting the data into two subsets based on the presence or absence of missing values in each node [
33]. Both methods have advantages and disadvantages, and their accuracy and efficiency depend on the amount and pattern of missing data, as well as the complexity and variability of the data [
44].
The present work addresses the problem of handling missing data via pre-imputation implementing a k-nearest neighbor algorithm.
6.1. Imputation via k-Nearest Neighbors
Multivariate imputation via k-nearest neighbors [
45] is a method for filling in missing values in a dataset using information from other variables. The idea is to find the
k most similar observations to the one with missing values, based on some distance metric, and use their values to impute the missing ones. The main steps in the implementation of the imputation via kNN are as follows:
Identify the missing values in the dataset and mark them with a special value, such as NaN.
Choose a distance metric to measure the similarity between observations, such as Euclidean distance or dynamic time warping for time series data.
Choose a value for k, the number of nearest neighbors to use for imputation. This can be performed empirically by comparing the performance of different values of k on a validation set.
For each observation with missing values, find the k-nearest neighbors with complete values for the same variable using the distance metric and the non-missing variables.
Impute the missing values using some aggregation function of the values from the k nearest neighbors, such as mean, median, or mode.
Repeat the process until all missing values are imputed or until convergence is reached.
A simple example is provided below to give a clearer description of the above-mentioned steps. The main goal is to impute the value
, where row (
i) is the time observation and column (
h) is the buoy. As an example, let us consider a set of time series collecting N wave height information for four different buoys as presented in
Table 9; the goal is to fill the missing
NaN information at time step 4 for buoy 1 and therefore impute the observation value
(
Table 9).
6.2. Step-by-Step Procedure for kNN Imputation
Step 1: Identify the missing value , where row i corresponds to the time index and column h indicates the buoy (feature) with the missing value. For example, indicates the wave height at time step 4 for buoy 1.
Step 2a: Compute the Euclidean distance between row i and each of the other rows , using all columns that do not contain missing values in both rows i and p.
The distance is given by
where
M is the total number of columns (buoys),
is the set of usable columns (no NaNs in both rows), and
is the weighting factor adjusting for varying dimensionality.
Example 1. For , , and , we have Step 2b: If some dimensions in are missing for either row, remove those dimensions and update accordingly.
Example 2. For , , and , we have Step 3: Choose the number of nearest neighbors k.
Step 4: Identify the k rows with the smallest values. Denote these rows as .
Step 5: Impute the missing value
as the mean of the values at column
h from the
k nearest rows:
Example 3. If and the nearest rows are and , then 6.3. Random Forest Performance with Pre-Imputation via k-Nearest Neighbors
A preliminary performance assessment is addressed, considering the artificial removal of a month of data from one of the buoys for the ROW01 site. Those missing observations are then imputed following the kNN-based strategy described in the previous section. In particular, the month of June is removed from the Chapel Point buoy (highlighted by the shaded area in
Figure 15).
The kNN algorithm is used to impute the missing observations considering an Euclidean distance metric.
Figure 15 shows the original time history (in red) overlapping almost perfectly with the corresponding time history of wave height information obtained from the imputation (in blue), filling the gap of missed observations with the imputed values.
The MAE computed considering the original time history and the one with imputed values corresponds to 0.077 m and 0.081 m, respectively, for these two study cases. The MAE values observed are consistent with the overall level of accuracy reported for the RF prediction performance in the previous sections, proving the reliability of the kNN for addressing and imputing the missing values with a sufficient and satisfactory level of accuracy.
Figure 16 and
Figure 17 show the original time histories of the wave height observations for the two most critical buoys located in the ROW01 (Blakeney) and in the WOW04 (M2) sites in red and the reconstructed time history with the imputed observations in blue. These buoys consistently and significantly lose valuable wave height information in the time period considered for this study. Before aligning the buoys to build the training and testing dataset to be used in the present work, the dataset is pre-processed via the previously introduced imputation strategy.
The impact of the imputation procedure is quantified in
Table 10 and
Table 11, where model performance on the testing set is reported both before and after applying the imputation. For the ROW01 site, where the buoy coverage is dense and the sensor network offers a high degree of spatial redundancy, improvements are modest but consistent across all metrics. The reduction in both MAE and MdAE suggests a marginal enhancement in overall predictive accuracy and robustness, while the increase in R2Sc indicates that the model captures a slightly larger portion of the wave height variability. The RMSE also decreases, showing a small improvement in controlling larger deviations. These changes confirm that even in well-instrumented settings, where the model already performs strongly, imputation contributes to more stable and refined predictions.
In contrast, the WOW04 site, which is characterized by a sparser and more spatially dispersed network, shows substantial improvements. The MAE decreases by over 20%, demonstrating better average predictive behavior across all observations. The MdAE drops by 13%, reflecting a more consistent performance with fewer large, anomalous errors. The explained variance rises sharply, from 0.785 to 0.967, suggesting a greatly enhanced ability of the model to reconstruct the temporal dynamics of the system. Most notably, the RMSE is reduced by 65.8%, indicating a significant decrease in large error magnitudes, which is particularly important for applications where tail risk matters. Altogether, the improvements at WOW04 demonstrate that the imputation method is especially beneficial in data-sparse contexts, where its ability to stabilize predictions and reduce uncertainty is most impactful.
6.4. Uncertainty Quantification from Data Perturbations
The imputation method implemented in this work can be considered deterministic because, given the same dataset and parameters (such as the number of neighbors and the distance metric), it will always produce the same imputed values. The imputation process does not involve any randomness; it strictly follows the defined algorithm to find the nearest neighbors and calculate the imputed values based on those neighbors. However, the uncertainty in imputation can be quantified by introducing perturbations to the data before imputation. This method involves simulating different missing data patterns and perturbing the data within realistic ranges to analyze the variability in the imputed values.
To implement this approach, one would first introduce perturbations by simulating multiple scenarios of missing data by randomly removing different sets of values in the dataset. In this work, we are considering a perturbation of 0.1 m in the wave height measurements. After these perturbations, kNN imputation is applied to each perturbed dataset. The final step is to analyze the variability in the imputed values across these different perturbed datasets. The standard deviation of the imputed values across different scenarios can be used to quantify this uncertainty. The standard deviation provides a measure of how much the imputed values vary due to the different perturbations, offering insights into the confidence of the imputation process.
The formula for calculating the standard deviation of the imputed values across different scenarios is
where
is an imputed value in scenario
i,
is the mean of imputed values across all scenarios, and
L is the number of perturbation scenarios.
Figure 18 illustrates the predicted wave heights over the full-time series, highlighting both the actual and mean predicted values. The actual wave height values are plotted in blue, while the mean predicted values are shown in orange. The light blue shaded region represents the 95% prediction interval, calculated as
times the standard deviation of the predictions obtained from multiple imputations with random noise perturbations (0.05 m).
This interval captures the uncertainty introduced by the variability in the imputed values due to different noise perturbations, offering insights into the model’s confidence. The plot indicates that the model’s predictions closely follow the actual wave heights, with the prediction interval effectively encompassing the observed variability.
This is also evident by looking at
Figure 18 in a zoomed-in view of a shorter time period. This closer inspection reveals that the prediction intervals remain tight around the mean predictions, demonstrating the model’s consistent performance and confidence even over shorter time spans.
Including the prediction interval in both figures emphasizes the importance of considering uncertainty in the imputation process. By quantifying the variability in the predictions due to random noise perturbations, the Random Forest model provides point estimates and a probabilistic measure of the prediction reliability.
7. Discussion and Conclusions
This study focused on the development of a data-driven virtual buoy to predict wave height through regression models. The performance of the Random Forest regressor at four different sites was evaluated with varying characteristics of wave and data availability. Particular attention was given to evaluate the robustness, sensitivity, and data pre-processing of the RFR considering the target regression task. The Random Forest regressor proved to be a low computational and user-friendly algorithm that can provide accurate wave height predictions for virtual buoys based on the measurements from nearby buoys. The average error on the wave height prediction is less than 15 cm for all the investigated sites and, in some cases, less than 10 cm, which is comparable to the accuracy of physical buoys. The RFR can capture the variability and nonlinearity of the wave height data and outperform linear regression models that assume a linear relationship between the predictors and the response variable. The inherent uncertainty quantification in the Random Forest algorithm, derived from the ensemble of trees, provides valuable insights into the variability and confidence of the predictions.
The RFR is highly sensitive to the distance of the buoys from the target buoy. The closer buoys play a key role in the accuracy of the prediction, as they provide more relevant information about the local wave conditions. Therefore, the optimal number and location of buoys for each virtual buoy should be carefully selected based on the spatial correlation and the wave propagation patterns. Alternatively, fixed measurement stations or additional features, such as wind speed or wave direction, could help improve the performance of the model by providing more contextual information.
The RFR framework is easily transferable and highly performing on different sites. It was implemented at four sites with different wave characteristics and data availability: the Gulf of Maine, Cape Cod Bay, and the UK’s east and west coasts. The RFR showed consistent and robust performance across all sites, with similar error metrics and correlation coefficients. This indicates that the RFR can adapt to different wave regimes and data sources and provide reliable wave height predictions for virtual buoys in various locations.
The missing data distributed among the buoys used for the regression can cause a substantial loss of data during the temporal alignment of the buoys in the pre-processing phase of the network of source data. The alignment process requires all buoys to have complete data for each time step, which reduces the effective sample size and may introduce bias or uncertainty in the analysis. Therefore, a multivariate imputation method strategy based on k-nearest neighbors was proposed and implemented to fill in the missing values before aligning the data. The kNN imputation method improved the performance of the model, as it reduced the error metrics and increased the correlation coefficients. Additionally, an uncertainty quantification approach was implemented by introducing perturbations to the data before imputation, followed by analyzing the variability in the imputed values. This method provided a small uncertainty bound across all the permutations, further enhancing the robustness and reliability of the model. This suggests that data pre-processing is an important factor for the success of the RFR and that imputation methods can help preserve and utilize the available information. The RFR framework can be applied to other locations and variables of interest, such as wave period or direction, to provide comprehensive information about the wave conditions for various applications.
The current model focuses on moderate sea states, reflecting operational limits for offshore wind maintenance, where vessel access is typically restricted to wave heights below 1.5–2.5 m. This use-inspired scope prioritizes accurate predictions within the actionable decision window rather than across all ocean conditions. As a future development, the framework could be extended to incorporate extreme wave events—potentially through multi-objective learning—to enhance its applicability in broader wave height estimation and long-term risk assessment contexts.