Regression-Based Networked Virtual Buoy Model for Offshore Wave Height Prediction

Tronci, Eleonora M.; Vitale, Matteo; Patrosio, Therese; Søndergaard, Thomas; Moaveni, Babak; Khan, Usman

doi:10.3390/jmse13040728

Open AccessArticle

Regression-Based Networked Virtual Buoy Model for Offshore Wave Height Prediction

by

Eleonora M. Tronci

^1,2,*

,

Matteo Vitale

³,

Therese Patrosio

¹

,

Thomas Søndergaard

³,

Babak Moaveni

¹

and

Usman Khan

¹

Department of Civil and Environmental Engineering, Tufts University, Medford, MA 02155, USA

²

Department of Civil and Environmental Engineering, Northeastern University, Boston, MA 02115, USA

³

Ørsted, 2820 Gentofte, Denmark

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(4), 728; https://doi.org/10.3390/jmse13040728

Submission received: 11 March 2025 / Revised: 3 April 2025 / Accepted: 3 April 2025 / Published: 5 April 2025

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Accurate wave height measurements are critical for offshore wind farm operations, marine navigation, and environmental monitoring. Wave buoys provide essential real-time data; however, their reliability is compromised by harsh marine conditions, resulting in frequent data gaps due to sensor failures, maintenance issues, or extreme weather events. These disruptions pose significant risks for decision-making in offshore logistics and safety planning. While numerical wave models and machine learning techniques have been explored for wave height prediction, most approaches rely heavily on historical data from the same buoy, limiting their applicability when the target sensor is offline. This study addresses these limitations by developing a virtual wave buoy model using a network-based data-driven approach with Random Forest Regression (RFR). By leveraging wave height measurements from surrounding buoys, the proposed model ensures continuous wave height estimation even in the case of malfunctioning physical sensors. The methodology is tested across four offshore sites, including operational wind farms, evaluating the sensitivity of predictions to buoy placement and feature selection. The model demonstrates high accuracy and incorporates a k-nearest neighbors (kNN) imputation strategy to mitigate data loss. These findings establish RFR as a scalable and computationally efficient alternative for virtual sensing, thereby enhancing offshore wind farm resilience, marine safety, and operational efficiency.

Keywords:

virtual sensing; wave buoys; random forest regression; significant wave height; sensitivity analysis

1. Introduction

Wave buoys are devices that measure oceanographic and meteorological parameters, such as wind speed and direction, air temperature and pressure, sea surface temperature, the height, period, and direction of waves (Figure 1), providing critical data for activities like boating, shipping, fishing, research, offshore construction, and disaster warning [1,2]. Wave buoys are especially important for offshore wind farms since they provide timely data on the changing wave conditions, which is crucial to optimize the design, operation, and maintenance of wind turbines in a farm and also to plan and schedule their inspection, repair, or replacement [3]. Continuous real-time wave measurements at offshore wind farms serve multiple purposes to enhance both operations and safety. They support power output optimization, improve turbine efficiency and health monitoring, and assist decision-makers in avoiding unnecessary or high-risk maintenance activities during adverse sea states that could endanger personnel or equipment [4]. Wave buoy data are extremely relevant for planning vessel operations; offshore wind farm operators can select and utilize the most suitable and cost-effective maintenance strategy for their site, considering various factors such as weather windows, failure rates, availability, downtime, and penalty costs [5,6,7].

Since wave buoys are usually placed in remote and harsh marine locations, they are vulnerable to damage or loss due to environmental or human factors, such as storms, collisions, or fishing activities, and, therefore, they require regular maintenance to ensure their functionality. While undergoing maintenance, the devices are offline, causing the unavailability of marine information crucial for planning other activities. These factors can significantly increase operational costs and reduce the availability and quality of wave data, causing limited continuity and interruptions in the data collection [8].

These factors make it necessary to create virtual wave buoy models that can provide more reliable and comprehensive wave data, filling the gap when data are not available. Virtual wave buoy models are numerical models that can learn and reproduce sea state and wave data at the target location in the event of equipment failure. They can also provide spatially distributed and temporally continuous wave data over a large area or region by utilizing multiple virtual wave buoy locations. Virtual wave buoy models can also improve the accuracy and quality of the wave data by using advanced algorithms that can estimate more wave characteristics and reduce errors and uncertainties.

Wave height is one of the most critical wave characteristics for coastal protection, ocean engineering, offshore operations, and marine disaster prevention [9,10]. However, wave height measurements by networks of moored wave buoys are often incomplete or sometimes erroneous due to maintenance operations or extreme events [11]. Therefore, the creation of virtual wave buoys for wave height prediction is needed.

Numerical models such as Simulating Waves Nearshore (SWAN), Wave Modeling Project (WAM), and WAVEWATCH III [12,13,14] are commonly used for simulating sea wave characteristics based on physical equations. However, they are computationally expensive and time-consuming, especially for large domains and complex coastal areas [15]. Alternatively, machine learning techniques have been applied for wave height prediction and reconstruction using historical data from wave buoys. A lot of studies in the past years focused on the implementation of data-driven artificial neural network-based strategies to create wave height prediction models and hindcasting models [9,16,17,18,19,20,21,22]. These models are all based on the use of historical data from a single sensor to train the forecasting and predictive model for the same sensor. The results presented in the literature mentioned above show that these strategies could describe the wave height forecast at different locations and time horizons with sufficient reliability. However, the mean absolute error in the wave height magnitude prediction could go up to 40 cm in some scenarios, which is not suitable for applications that require high-accuracy prediction. The reader is also referred to [23], which provides a comprehensive literature review of machine learning-based approaches for forecasting.

Among the most recent and successful studies, Fan et al. [24] presented a novel model to predict significant wave height based on a long short-term memory network for 1 h and 6 h predictions of the significant wave height at ten stations using different environmental conditions as input features collected at that same sensor location. The short-time prediction of wave height provides good results with mean absolute error values below 10 cm for 1 h prediction in some cases and increasing errors for longer time spans. Hu et al. [25] developed a wave forecasting model for two wave buoys located at Lake Erie using extreme gradient boosting and a long short-term memory network. The authors used a very long training period from 1994 until 2013, applying observed wind velocity as model input and observed significant wave height and peak wave period as the target variables, obtaining optimal accuracy in the prediction of the wave height with mean absolute errors lower than 10 cm. Abed-Elmdoust and Kerachian [26] employed a bidirectional gated recurrent unit network for forecasting tropical cyclone wave height using data from 14 buoys in various environments over the past nine years, achieving the lowest error and highest correlation coefficient among all tested models. Jörges et al. [27,28] proposed a long short-term memory (LSTM) and then a convolutional neural network mixed-data deep neural network that can predict spatial ocean wave heights using random field simulated bathymetry data as an additional input. They focused on a nearly 13-year dataset, integrating high-frequency weather data, and showed that including bathymetry features improved wave height reconstruction and prediction by reducing the Root Mean Square Error. Their study is highly relevant to coastal and shallow water regions but may not be suitable for deeper offshore environments due to the intensive data requirements. Gomez et al. [29] used reanalysis data gridded on a fine latitude–longitude grid, modeling weather conditions for each buoy using a sub-grid of the four closest reanalysis nodes. This approach increases data volume but can introduce significant computational overhead and complexity. Moreover, their model relies on data from the same buoy as an input, limiting its applicability when predicting in the absence of such data.

These studies focused on a single-location prediction strategy, where the metocean historical information collected at that same location is used to train and construct the prediction models. Londhe et al. [11] provided one of the only detailed studies implementing a buoy network strategy. The proposed strategy utilizes artificial neural networks to reconstruct significant wave height data missing at a target location using a network of public wave buoys in the surrounding area. They tested six different sites and showed promising results. However, for three sites in the north of the US, which also utilized years of data for training, the mean absolute error for the prediction target buoy was relatively large, ranging between 30 and 40 cm.

Chen et al. [30] developed a Random Forest-based surrogate model to predict the significant wave height, mean wave direction, mean zero-crossing period, and peak wave period at a target location. The model was trained using 21 years of simulated data via the SWAN physics-based numerical model. Then, the authors used in situ buoy observations and wind data as inputs for the trained model and compared its performance with the SWAN model at a test location in the UK. The model can produce accurate spatial wave data with significantly less data input and computational time than the SWAN model, and it can capture the seasonal and interannual variability of wave conditions at the test site. However, the proposed methodology, relying on 21 years of simulated data via the SWAN model, has an initial heavy computational demand.

Among the most relevant and recent works, Patanè et al. [31] utilized convolutional layers for spatial feature extraction and short-term memory layers for modeling, adding complexity and computational demand. They focused on a single buoy prediction scenario and rely on ERA5 reanalysis wind forcing with a spectral wave model, which can be limiting if reanalysis data are unavailable or unreliable. Minuzzi et al. [32] used the NOAA numerical forecast data, targeting the residual between observational data and numerical model output. Their model training relied on a massive dataset spanning 20 years, which is impractical for many applications and may not be feasible for real-time or near-real-time predictions. Additionally, their results sometimes exhibited deviations greater than 1 m from observed data, indicating lower resolution performance in local contexts.

These studies demonstrate the potential of machine learning techniques for wave height prediction. However, most of the literature focuses on forecasting wave height using historical data from the same sensor location, with few studies addressing data fusion for virtual sensing using a network of buoys. Additionally, models experimenting with different input quantities often result in high prediction errors, sometimes as significant as 20–30 cm. Furthermore, the most effective models rely on extensive training datasets, increasing computational demands and limiting their applicability to scenarios with abundant data.

This study aims to address these gaps by developing a purely observational, data-driven, user-friendly, and computationally efficient virtual buoy model. Unlike previous studies, we leverage a network-based strategy without relying on the same buoy’s data for predictions, ensuring broader applicability and robustness in scenarios where buoy data is missing. The methodology is applied in the context of multiple areas, testing different network buoys and carrying out a sensitivity analysis to determine how the placement and arrangement of sensors in the ocean impact model accuracy. This model provides accurate wave height estimates for specific locations even when the physical sensors are non-functional. This strategy is not computationally demanding and can be easily generalized to different regions (coastal and offshore regions) without relying on external reanalysis datasets, extensive data for training, numerical simulations, or even additional features, maintaining high accuracy with fewer inputs. This model is tested across four different sites characterized by different conditions, some hosting offshore wind farms and providing unique case studies using proprietary data, highlighting how particularly crucial it is for offshore wind farms, where timely and precise marine information is essential for the safety of service vessels during navigation and maintenance operations.

Additionally, the proposed imputation strategy for handling missing data enhances data integrity and prediction reliability, addressing a critical aspect not covered in the existing literature. This study, therefore, offers significant advancements in offshore wave height prediction, combining robust imputation strategies with an efficient and broadly applicable network-based approach.

2. Virtual Wave Buoy Regression Model

2.1. Data Fusion for Virtual Sensing

The objective of this work is to create a virtual sensor or a virtual wave buoy model for one of the wave buoys in the case of sensor malfunctioning or failure of the target sensor. The virtual wave buoy will enable the continuous stream of wave height estimates. A comprehensive data fusion approach is presented in Figure 2. The strategy involves collecting marine data from physical wave buoys located near the area of interest. Rather than relying on a fixed spatial threshold, “near” refers to locations where buoys are expected to experience sufficiently similar meteorological and oceanographic conditions—an assumption supported by empirical correlation analysis between sensors. These data are then used to construct a regression or virtual buoy model for the target location. By leveraging the spatiotemporal coherence within the sensor network, the proposed methodology enables a reliable and accurate mapping from neighboring sensors to the target buoy, ensuring the robustness of the virtual wave buoy model. The target quantity to be predicted in this work is the significant wave height. It represents the statistical measure of the average height of the highest one-third of waves in a given time period, while the wave height simply indicates the vertical distance between the crest and the trough of a wave. To simplify the terminology in the paper, the significant wave height is referred to as the wave height.

2.2. Regression Model

In this work, the primary model chosen to create the virtual wave buoy is the Random Forest (RF) [33]. It is an ensemble learning method that can be used either for classification or regression and that combines the predictions of multiple algorithms or the same algorithm multiple times to create a more powerful model. The single unit algorithm of a Random Forest architecture is the decision tree [34,35], a decision-support hierarchical model that uses a tree-like model of decisions that can predict the class or value of target variables by learning decision rules inferred from prior data. In an RF Regression (RFR) model, each decision tree predicts the expected outcome based on its own decision criteria, and the final prediction is made by taking the average of the predictions of all the trees. For regression tasks, the output of the Random Forest is the mean or average prediction of the trees, and the model uses measures such as mean squared error or mean absolute error to evaluate how well a node minimizes the prediction error. The parameters that need to be set for RFR are as follows:

The number of estimators or the number of trees in the forest—a larger number of trees usually improves the accuracy and stability of the model but also increases the computational cost and time.
The maximum depth of each tree—a deeper tree can capture more complex patterns in the data but also increases the risk of overfitting and reduces the interpretability of the model.
The maximum features to consider when looking for the best split at each node—a smaller number of features can reduce the variance and improve the generalization of the model but also increases the bias and may miss some important features.
The minimum sample split or the minimum number of samples required to split an internal node—a larger number of samples can prevent overfitting and reduce the complexity of the model but also increases the bias and may underfit the data.
The minimum sample leaf or the minimum number of samples required to be at a leaf node—a larger number of samples can smooth the prediction and reduce the model’s variance. Still, it also increases the bias and may underfit the data.

Random Forest is a flexible, non-parametric ensemble learning method well suited for modeling complex environmental processes such as wave height prediction across a spatially distributed buoy network. It effectively captures nonlinear relationships and interactions among input variables without requiring strong parametric assumptions or extensive feature engineering. RF is particularly advantageous in our context due to its robustness to multicollinearity, tolerance of partially missing data, and built-in mechanisms for assessing feature importance and ranking predictor relevance. These characteristics support key aspects of our virtual buoy framework, including sensitivity analysis and uncertainty quantification. Moreover, RF provides bootstrapped estimates of prediction uncertainty, allowing us to evaluate the confidence of model outputs in a principled way. While the ensemble structure of RF can reduce interpretability compared to single decision trees, the availability of feature importance metrics offers meaningful insight into the drivers of wave height variability. With appropriate regularization—such as limiting tree depth or the number of features considered at each split—RF delivers strong predictive performance while mitigating the risk of overfitting, making it a pragmatic and reliable choice for our application.

2.3. Model Training and Metric Evaluation

The training approach used to build the Random Forest regression model is the bagging method. In this method, the training dataset is randomly sampled with replacement a certain number of times (number of estimators), and a decision tree is trained on each sample. Finally, for prediction, the results of all trees are aggregated to produce a final decision. The general idea behind the bagging method is that combining multiple learning models enhances the overall result.

The performance of the RF model is compared against several alternative regression models, using evaluation metrics including mean absolute error (MAE), Median Absolute Error (MdAE),

R^{2}

score, and Root Mean Square Error (RMSE), which are discussed in detail throughout the paper.

The mean absolute error is a metric that provides the mean absolute difference between the predicted and actual values in a dataset. It measures the average magnitude of errors and represents an easy-to-interpret quantity with the same units as the target variable. It is computed with the following expression:

M A E = \frac{\sum_{i = 1}^{n} | y_{i} - x_{i} |}{n}

(1)

where n is the number of observations,

y_{i}

is the actual value, and

x_{i}

is the predicted value.

The Median Absolute Error is a metric that provides the median absolute difference between the predicted and actual values in a dataset. It measures the central tendency of errors and is more robust to outliers than MAE. It also has the same units as the target variable. It is computed with the following expression:

M d A E = m e d i a n (| y_{i} - x_{i} |)

(2)

where

y_{i}

is the actual value and

x_{i}

is the predicted value.

The R-squared expresses how well a model fits a dataset by comparing it with a baseline model that always predicts the mean value of the target variable. It ranges from 0 to 1, with higher values indicating better fit. The coefficient of determination or

R^{2}

score (R2Sc) measures how much of the variation in the target variable is explained by the model. It is calculated as

R 2 s c = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - x_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(3)

where

\bar{y}

is the mean value of

y_{i}

.

The Root Mean Square Error measures the mean square magnitude of errors. It has the same units as the target variable but can be harder to interpret. It is calculated as

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - x_{i})}^{2}}

(4)

The following section will provide a more detailed description of the data used for the training and testing, respectively.

3. Datasets Description

This work considers four different sites: two in the UK involving publicly and privately owned wave buoys and two in the US using only measurement data from public buoys. Table 1 provides the spatial coordinates and characteristics of the buoys in the sites.

The two UK sites correspond to two offshore wind farm sites: Race Bank Offshore Wind Farm (ROW01) (Figure 3a) and Walney Wind Farm (WOW04) (Figure 3b). In particular, for Race Bank, there are seven public buoys located in the surrounding area of the private target buoy owned by Ørsted, while for Walney, the data from five public buoys are considered.

In addition to the public buoy data, more measurements are available at the two UK sites. The ROW01 site has two additional private sensors: a wave radar and a lidar. The wave radar sensor provides significant wave height, maximum wave height, median wave direction, and peak wave period with a sampling frequency of 1 min. The lidar sensor provides wind speed and direction measurements at 100 m height every 10 min. For WOW04, the data come from three weather stations: one weather station is placed directly on the WOW04 site, while the other two are located on the Walney farm but in a different section (WOW03), and finally, the last weather station is in correspondence of a nearby farm, Burbo Bank offshore wind farm. The stations provide wave height and wind speed information with a sampling frequency of 15 min.

The sites in the US were picked according to the current lease plans for offshore wind farms on the US East Coast. In particular, the area below the Cape Cod region (Figure 3d), located in Nantucket Bay in front of Rhode Island, will host multiple wind farm projects, making it an area of interest for testing the virtual buoy strategy. Moreover, the Gulf of Maine area (Figure 3c) was chosen because this location in New England is characterized by the highest number of public buoys available with robust live streaming of data. Even if no offshore wind farms are currently planned for that area of interest, the richness of information available in the area makes this site a valuable test bed for the proposed virtual sensing strategy considering the New England marine conditions. The National Oceanic and Atmospheric Administration agency was used to access real-time and historical weather and marine condition measurements. In particular, the National Data Buoy Center (NDBC) is part of NOAA’s National Weather Service, and it operates a network of data-collecting buoys and coastal stations that provide meteorological and oceanographic observations for weather forecasting, marine safety, research, and environmental monitoring.

The Martha’s Vineyard site is characterized by only public buoy data. Four buoys provide reliable and accurate wave height information on this site (Figure 3d). Since no private buoy was available for this site, no target buoy was defined a priori. Buoy C was chosen as the reference target buoy because it was the one placed closer to the future offshore developments planned for that area, and therefore, it is potentially the buoy that is more representative of the marine conditions in the wind farm area. Similarly, for the Gulf of Maine, there are currently no private buoys deployed in the area; only measurements from six public buoys are available. For this site, there was no target buoy defined a priori. Buoy F was selected as the reference target buoy due to its proximity to the coast, similar to buoys E, G, and I, and its central location relative to the rest of the buoy network. Unlike Buoy H, which is located in deeper, open waters farther offshore, Buoy F is more representative of the nearshore marine conditions of interest for this study.

The selected sites span a range of water depths, including both shallow and moderately deep offshore regions. For example, depths range from as low as 10 m (e.g., Cleveleys buoy at WOW04) to as deep as 177 m (e.g., Buoy H at the Gulf of Maine site). This variation allows us to assess the virtual buoy modeling framework across diverse oceanographic settings, supporting its generalizability.

Data Processing and Input Features Selection

As shown in the description of the dataset used in this work, the data measurements collected at different buoy locations are often characterized by different sampling frequencies. The first step in the data processing phase consists of aligning the sampling frequencies in time of the buoy data for each site. For the two UK sites, some buoys are characterized by a sampling frequency of 60 min, and others are characterized by a sampling frequency of 30 min. For the analysis, the sampling resolution is selected at 30 min, and when needed (West Sole A, Clipper, Sean P for ROW01 and M2 for WOW03), sensor buoys are up-sampled by linear interpolation from 60 to 30 min. For the Gulf of Maine site, all the buoys are characterized by a sampling frequency of 60 min and kept at that constant sampling frequency. In the Martha’s Vineyard site, Buoy A, B, and D are all sampled at 60 min, while Buoy C is characterized by a more refined sampling resolution, with measurements collected every 30 min. However, for the analysis, the sampling resolution is selected as 60 min, down-sampling Buoy C to one value per hour.

The dataset is subjected to the same pre-processing strategy for all the sites. The missing values (NaN values) are removed from all the buoys involved in the analysis, together with any possible non-physical negative values. Then, since the research focuses on wave height prediction at target locations to inform the safe planning and logistics of maintenance operations for service vessels, there is no interest in accurately predicting wave height conditions above 3 m. This limit is imposed using relatively small vessels for maintenance operations in the selected farms, which have a wave height limit of either 1.5 m for crew transfer vessels or 2.5 m for service operation vessels. Therefore, given the use-inspired nature of this research, any wave height information above 4 m for the target buoy is removed to optimize the training of the regression model for lower wave height conditions.

Finally, to conduct comparable and reliable sensitivity analysis and hyperparameter tuning, the dataset is kept constant for all analyses presented in this section and the following sections. Therefore, the analysis presented in the following sections is performed using data from 28 August 2021 (5:00 AM) and 26 October 2022 (00:30 AM) for ROW01 and WOW04, while for the US sites, the dataset is fixed between 1 January 2021 (00:00 AM) and 31 December 2022 (11:30 PM) for the US sites.

The data measurements collected by the buoys at each site are divided into two sets: training and testing datasets. The training dataset is utilized to develop and train the regression model. On the other hand, the testing dataset comprises data that the model never saw during the learning phase. It is used to assess the model’s performance in making accurate predictions on new and unseen data.

For the training set, the first 21 days of each month are used, while the remaining days are reserved for testing. This data-splitting strategy preserves the temporal order of observations, which is crucial in time series data such as wave heights where short-term autocorrelation is typically present. Unlike a random split that may distribute temporally adjacent data points across both sets—potentially leading to information leakage and inflated model performance—our approach ensures that the model is always evaluated on future data relative to its training period. This structure also allows the model to learn from each month’s variability while maintaining the continuity of autocorrelated sequences within each split, better reflecting the conditions of real-world forecasting.

In the following sections, the results presented for the different regression models concern predicting the significant wave height at the target buoy using only the significant wave height measurements from the input neighboring wave buoys as input features for building those models. Section 5.3 will discuss and evaluate the model’s performance when additional data sources and quantities are considered input features.

4. Model Baseline and Algorithm Tuning

Hyperparameter tuning is a critical step in building a regression model. Its objective is to identify the best possible values for the hyperparameters of a given algorithm that result in optimal performance of the model on the validation set. This work focuses on building a regression model and finding the optimal balance between that model’s performance and its complexity. The approach employed involves creating a baseline model first, using an initial estimate of the hyperparameters. Subsequently, the parameters of the RFR are further tuned via a grid search strategy.

The initial guess for the hyperparameters of the Random Forest model is provided for the two UK sites (ROW01 and WOW04), which are the datasets with more available buoys for building the regression models.

The baseline for the two sites in the UK is determined using the parameters presented in Table 2. For the US sites, the baseline performance is evaluated using a Random Forest model with the parameters defined for the ROW01 site without further tuning. Table 3 reports the Random Forest model performance in predicting the wave height information for the training and testing dataset for the four sites of interest.

The training dataset shows that the ROW01 and WOW04 sites show lower MAE and MdAE values, indicating better model performance in these regions than in Martha’s Vineyard and the Gulf of Maine, which exhibit higher error metrics. In contrast, WOW04 exhibits a significantly larger RMSE for the testing dataset, highlighting a notable deviation in model predictions. Additionally, the other sites—ROW01, Martha’s Vineyard, and Gulf of Maine—display more consistent performance between the training and testing datasets, with ROW01 showing minimal error increases across both datasets.

When performing hyperparameter tuning for Random Forest Regression, some key hyperparameters to consider are the number of estimators and maximum depth, both of which range in the following set of values: [50, 200, 500, 625, 750, 800]. Then, this range for the two set of parameters is narrowed down for Martha’s Vineyard and the Gulf of Maine sites ranging between [150, 175, 200, 220, 250]; the minimum samples leaf within [2, 4, 6, 8, 10, 12]; and the minimum samples split between [2, 4, 6, 8, 10, 12]. The optimized results via the grid-search approach for the four investigated sites are presented in Table 4 and identified as the optimized version of the Random Forest Regressor (RFRO).

The results obtained by applying the optimized RFR model are compared with those obtained from the model without optimizing the hyperparameters and with the models chosen as reference models, including lasso regression [36], support vector regression [37], gradient boosting regression [38], multilayer perceptron neural networks [39], and long short-term memory networks [40].

Lasso regression [36] is a linear model that performs both variable selection and regularization by applying an L1 penalty on the regression coefficients. This penalty encourages sparsity by shrinking some coefficients to zero, improving model interpretability. The primary hyperparameter is the regularization strength

α

, which controls the degree of penalization; we set

α = 0.1

based on the grid search results. While lasso regression is computationally efficient and effective for linear relationships, it lacks the flexibility of RF in capturing nonlinear dependencies and feature interactions.

Support vector regression (SVR) [37] is a kernel-based method that seeks to find a function approximating the data within a defined margin of tolerance, using support vectors to represent the solution. The main hyperparameters include C, which controls the trade-off between model complexity and the tolerance for errors (set to 100 in our case);

γ

, which defines the influence of individual training examples (set to 0.0001); and

ϵ

, which sets the width of the error-insensitive tube around the predicted values (set to 0.01). An RBF (radial basis function) kernel is chosen for its ability to capture nonlinear relationships. Compared to RF, SVR often achieves high accuracy in high-dimensional spaces but can be less scalable and more sensitive to hyperparameter tuning.

Gradient boosting (GB) regression [38] is an ensemble learning technique that sequentially adds weak learners—typically shallow decision trees—to minimize the residuals of previous models. Key hyperparameters include the number of estimators (set to 60), learning rate (set to 0.4), and maximum tree depth (set to 2), which collectively control the model’s complexity and learning speed. Additional parameters include

α = 0.01

for quantile loss regularization (used with the ‘lad’ loss), minimum samples to split (3), minimum samples per leaf (1), and a minimum weight fraction per leaf of 0.01. Gradient boosting offers high predictive power and flexibility but is generally more computationally intensive and prone to overfitting compared to RF.

Multilayer perceptron (MLP) networks [39] are feedforward neural networks capable of approximating nonlinear functions. Our model uses a single hidden layer with 12 neurons and rectified linear unit (ReLU) activation. The regularization term

α

, set to 10, controls L2 weight decay to mitigate overfitting. Training is performed using the Limited-memory BFGS optimizer, with a maximum of 3000 iterations. While MLPs are expressive and powerful for capturing complex relationships, they require more careful tuning and offer less interpretability than RF.

Long short-term memory networks [40] are a type of recurrent neural network (RNN) specifically designed to handle sequential data with long-range dependencies. Their architecture incorporates gating mechanisms to regulate information flow through time steps. In this study, we implement a standard LSTM architecture to explore its ability to model temporal wave height dynamics. Compared to RF, LSTMs can capture temporal dependencies more explicitly, but they require significantly more data, computational resources, and tuning of architectural and training parameters.

In all cases, model hyperparameters are optimized using a grid search procedure consistent with that applied to RF. This ensures a fair performance comparison across methods.

Figure 4 and Figure 5 show the regression model performance in terms of MAE for the training and testing datasets, considering the ROW01 and Martha’s Vineyard sites, respectively. The bars are color-coded—blue for training, red for testing—and numerical values are displayed directly on top of each bar for clarity. A third set of gray bars indicates the signed percentage difference between testing and training performance, shown on the secondary y-axis. This signed representation directly compares relative performance, with negative values indicating better generalization to the test set.

The first observation concerns the comparison of the performance between the baseline RF model and the optimized version. It is evident how the optimized model tends to show better performance for the training dataset without offering any significant improvement for the testing dataset. Small improvements between RFR and RFRO in the testing dataset, such as in the Martha’s Vineyard site, are not worth the increase in the computational expenses required by a more complex and deep Random Forest model. It is possible to observe that for both sites, the Random Forest model, independent from the optimization of the hyperparameters, tends to perform better for the training dataset than the testing dataset. Figure 5 highlights the differences in the performance of the different regression models between the training and testing datasets. The Random Forest model shows, for the different sites, substantial discrepancies between the training and testing cases, proving the overfitting tendency. Random forest is known to suffer from overfitting, which means that it learns the noise or the specific patterns in the training data that do not generalize well to new or unseen data. Overfitting can result in poor performance or high variance in the test or validation data. A complex architecture of the model or a dataset poor in richness are common causes of the model’s overfitting tendency.

Although the RF model exhibits signs of overfitting, its performance remains comparable to that of the other models considered. Only the SVM and LSTM models show slightly better results on the testing dataset. The LSTM model achieves marginally lower error on the test set compared to the training set, particularly for the ROW01 site. This behavior is likely due to the non-random data split, where the training set contains a greater proportion of complex temporal relationships. In such cases, the model may appear to underperform during training while generalizing better to unseen data. Nonetheless, despite these localized improvements, models such as SVM and LSTM are more sensitive to hyperparameter choices and require longer training times. In contrast, the Random Forest model demonstrates more consistent and robust performance across both sites, making it a more practical and reliable choice for the application considered in this study.

When considering efficiency for a simple regression task, Random Forest surpasses alternatives such as support vector regression, long short-term memory, and gradient boosting in speed and simplicity. The computational complexity of RF is

O (T \cdot N \cdot F \cdot log (N))

, where T is the number of trees, N is the number of samples, and F is the number of features. This structure allows RF to train efficiently by building trees in parallel, making it ideal for real-time applications.

In contrast, the complexity of SVR is

O (N^{3})

, as it involves solving a quadratic optimization problem, making it computationally prohibitive for large datasets. During training, LSTM models rely on backpropagation through time to update network weights based on sequences of input data, which can be computationally intensive. During inference, backpropagation is not required; however, the model must still update its hidden state sequentially at each time step, limiting parallelization. LSTM is characterized by a complexity of

O (N \cdot T \cdot d^{2})

, where d is the hidden state size. While GB shares a similar theoretical complexity to RF equal to

O (T \cdot N \cdot F \cdot log (N))

, its sequential tree-building prevents efficient parallelization, making it slower in practice. MLPs, on the other hand, have a complexity of approximately

O (L \cdot n_{i} \cdot n_{o})

, where L is the number of layers,

n_{i}

is the number of input features, and

n_{o}

is the number of neurons per layer. Hence, RF is the superior choice for simple, scalable regression models that balance accuracy and computational efficiency, especially for real-time or large-scale applications like wave height prediction.

To illustrate, consider the ROW01 case, where 1000 samples and 7 features are used. The complexity for the baseline Random Forest model with 18 estimators is approximately 1.3 million operations, while the optimized RF model with 750 estimators results in 57.75 million operations. By comparison, due to its linear nature, lasso regression requires just 7000 operations but lacks the flexibility to capture complex relationships. With its cubic complexity, SVR demands 1 billion operations, making it far more computationally expensive. LSTM, with its sequential nature, has a complexity of 25 million operations, while GB, with 60 estimators, results in 4.6 million operations, and finally, the MLP model, using one hidden layer with 12 neurons, requires approximately 84,000 operations, making it efficient in practice.

This comparison highlights the advantages of RF in terms of computational efficiency, structural flexibility, and overall robustness. While lasso regression offers a straightforward linear architecture and low computational complexity, it also presumes a linear relationship between input features and the target variable that can become unreliable when violated. Moreover, its objective function minimizes the sum of the absolute values of coefficients and makes it sensitive to outliers. In contrast, RF, as a non-parametric ensemble method, makes no assumptions about the functional form of the data, inherently providing greater flexibility in capturing complex, nonlinear interactions. Its ensemble structure enhances robustness to outliers by mitigating the influence of individual data points and allows it to naturally accommodate multicollinearity without the need for explicit feature selection, which is particularly valuable in cases where input features—such as significant wave height estimates from nearby marine locations—may be highly correlated. Additionally, RF supports inherent uncertainty quantification through the variation among its decision trees, which enhances model interpretability and confidence in the predictions. Although MLPs benefit from hardware-optimized matrix operations that make them computationally efficient, they are typically more difficult to interpret and require careful tuning to avoid instability or underfitting. In conclusion, despite the overfitting tendencies observed in our experiments, RF remains a compelling choice for wave height prediction due to its robustness to data variability, competitive performance, scalability, and ability to model complex patterns without extensive pre-processing or assumptions about data structure.

To assess the computational efficiency of the adopted model, we measure the training and inference times of the Random Forest model in the context of the ROW01 site. On a standard consumer-grade machine equipped with an Intel(R) Core(TM) i7-10870H CPU @ 2.20GHz (8 cores, 16 threads), 32 GB of RAM, and running Windows 11, model training requires approximately 1.97 s, while inference on the validation set takes 0.4 s. All computations are executed using CPU-based parallelization with the scikit-learn implementation, and the GPU is not utilized. These results demonstrate that the model is capable of delivering predictions at speeds suitable for near real-time applications, supporting its practical deployment in monitoring and decision-support scenarios.

Uncertainty Quantification from Random Forest Trees

Random Forest models provide a natural way to quantify uncertainty through the variance among the predictions of the individual trees. Each tree in the forest is trained on a bootstrap sample of the data, leading to slightly different models. The variability among the predictions from these different trees can be used to estimate the uncertainty of the overall model prediction.

The prediction variance can be calculated by first obtaining the predictions from all trees in the forest for a given data point. The variance of these predictions provides a measure of the model’s uncertainty. Specifically, the prediction variance is computed as the average of the squared deviations of each tree’s prediction from the mean prediction of all trees. This can be expressed as

Variance (y) = \frac{1}{n_{trees}} \sum_{t = 1}^{n_{trees}} {(y_{t} - \bar{y})}^{2}

(5)

where

y_{t}

is the prediction from tree t, and

\bar{y}

is the mean prediction of all trees. This approach allows for estimating prediction intervals, providing a measure of confidence for each prediction. Higher variance indicates higher uncertainty, guiding further analysis or data collection efforts to reduce this uncertainty.

The results of this uncertainty analysis are presented in Figure 6 and Figure 7 for the ROW01 site. Figure 6 shows the RF predictions with 95% prediction intervals for different time scales. The blue lines represent the actual wave heights, the orange lines represent the predicted values, and the shaded violet areas indicate the 95% prediction intervals, reflecting the model’s uncertainty. In Figure 7, the highlighted shaded rectangle indicates the specific time window that is zoomed in on the right-hand side of the figure. The alignment of the predicted values with the actual values, along with the narrow prediction intervals, suggests that the model performs well, with high confidence in its predictions.

Figure 7 summarizes the model’s residuals and performance evaluation. The left subplot shows the residual plot, where residuals (the difference between actual and predicted values) are plotted against the predicted values. The red-shaded region represents the 95% confidence interval around the residuals, providing a visual depiction of the uncertainty. Most residuals are centered around zero, indicating no significant bias in the model. The spread of residuals remains relatively consistent across the range of predicted values, although a slight increase in variance at higher predicted values suggests mild heteroscedasticity. The clustering of residuals around zero further confirms the model’s accuracy, with most errors being small.

The middle plot is a horizontal histogram of the residuals combined with a kernel density estimate (KDE), offering insight into the spread of the errors. This plot suggests a normal distribution, with most residuals centered around zero, indicating a well-calibrated model with no significant skewness in the errors.

The right subplot compares predicted values vs. actual values, with the dashed red line representing the perfect prediction line (where predicted equals actual). The dense clustering of points along this line confirms that the model has high predictive accuracy, with only minor deviations observed. The inclusion of error bars represents the uncertainty bounds (±10% of the residuals), further illustrating the model’s ability to reliably estimate wave height values across the dataset.

5. Sensitivity Analysis

5.1. Buoy Permutation

Permutation is a widely used strategy for sensitivity analysis, which involves evaluating the importance of input features on the output of a predictive model by permuting them. This approach provides insights into which features have the most impact on the model’s performance.

This permutation analysis is carried out for the Martha’s Vineyard and Gulf of Maine sites. Given that no wind farm is currently active there, no target buoy is pre-determined for these locations.

Table 5 reports the Random Forest model’s predictive accuracy in terms of the selected metrics, considering every time a different buoy as the regression target of the model and using the remaining buoys as input variables for the training and testing datasets. It is possible to observe that for the Gulf of Maine, the model manifests the best performance when predicting the wave height information for Buoy B, while for the Martha’s Vineyard site, Buoy D exhibits the highest accuracy results as a target variable, with Buoy C as the second closest. The buoys associated with the less performative results are Buoy D for the Gulf of Maine and Buoy B for Martha’s Vineyard. Both buoys are located in the most peripheral and lateral locations in the network of considered sensors, and they are the ones farther away from the rest of the buoys; therefore, they are likely to be the most difficult buoys to represent in terms of marine conditions.

5.2. Distance of the Input Buoys

A sensitivity analysis is carried out to address the sensitivity of the Random Forest model towards the buoys’ arrangement and mutual distance. For all the considered sites, four different sensitivity analyses are carried out. First, the model regression performance is evaluated by removing one input buoy at a time (Figure 8 and Figure 9). Then, the input buoys are removed in sequence, from the farthest away to the last one before the closest one (Figure 10 and Figure 11) and the other way around (Figure 12 and Figure 13). The model performance is evaluated in terms of the MAE. For every tested configuration of input buoys, the model performance is compared with the reference performance obtained when considering all the input buoys together without removing any buoy (indicated with none in the plots).

Across all the sites, the results show that removing the closest buoy significantly lowers the accuracy of the model, while removing the farthest buoy has almost no effect. Removing the closest buoy causes a percentage increase in the MAE of 23.9% for the ROW01 site (Figure 8), 11.0% for WOW04, 59.7% for Martha’s Vineyard (Figure 9), 76.9% for the Gulf of Maine. These error increases are consistently higher when removing the closest buoy compared to the removal of more distant input buoys. This suggests that, overall, the closest buoy plays a more important role in determining the accuracy of the model than the farther buoys. This is particularly evident for the ROW01, Martha’s Vineyard, and Gulf of Maine sites. For WOW04, the percentage of increase in the error is sufficiently small, which highlights a more collegial and collaborative tendency of the buoys and dependency of the variables to construct a higher-confidence model for that site.

Moreover, for the Martha’s Vineyard site (Figure 9), it is interesting to notice how Buoy D, even if it is close to target Buoy C (almost the same distance as Buoy B), does not contribute to the regression outcome of the target Buoy C for that site. Buoy B, having Martha’s Vineyard and Nantucket separating it from the rest of the buoys, results in its being placed in a different coastal and marine context from the target buoy.

Considering the sequential removal of buoys starting from the farthest away, it is evident how the removal of the farthest buoys does not affect the model (Figure 10 and Figure 11).

This also proves to be true when removing the second farthest buoy in most sites, except for the WOW04 site, where removing the two-second farthest buoy causes a sensible increase in the MAE. The single removal of the M2 and ABI buoys does not cause a substantial increase in the error performance of the model, while their combined removal heavily affects the prediction abilities of the Random Forest model. This behavior strengthens the assumption that for the WOW04, the joint effect of all the input buoys contributes to achieving satisfactory performance of the model, while for the other sites, the major contribution to the model performance is provided by the closest buoys.

Similarly, each buoy is sequentially removed from closest to farthest away (Figure 12 and Figure 13). For all sites in this case, removing the closest buoys causes an increase in the MAE of the model. For the ROW01 site, the increase in removing the closest buoy is 23.9%, and the MAE remains below 10 cm (Figure 13). In this case, the farthest away buoys can make up for removing the closest one when a sufficient number of buoys is available in the network. In the remaining sites, as presented in Figure 13 for the Gulf of Maine, removing the closest buoy causes a dramatic increase in the error, demonstrating how the farthest buoys cannot make up for removing the closest one when a limited number of buoys is available in the network.

Overall, the sensitivity analysis shows that the accuracy of the RFR model is sensitive to changes in the distance variables, particularly the distance to the closest buoy. Therefore, it is critical to include the closest buoy in the model to obtain accurate results.

5.3. Features Engineering and Sensitivity

In this section, the introduction of additional features and the effect of feature engineering on the accuracy of our model are studied. In particular, two sets of additional features are combined with the wave height information for all the buoys considered as input for the regression model. The effect of the introduction of additional features is addressed for the sites where the Random Forest model performs worse in the US and UK regions: Martha’s Vineyard and the WOW04 site. For the Martha’s Vineyard site, the addition of the average wave period and mean wave direction is considered, while for the WOW04 site, the introduction of wave height information measured at a fixed weather station location is considered.

To better evaluate the sensitivity of the model to the introduction of additional input features in the case of the Martha’s Vineyard site, the analysis is carried out considering different choices of the target buoy. In particular, the model’s performance is evaluated for the target Buoy C and once for target Buoy A. From the previous permutation analysis, Buoy C proves to be the best, and Buoy A is the third best as characterized by the model.

The results presented in Table 6 and Table 7 show that when considering Buoy C as the target buoy, the introduction of the wave period and mean wave direction leads to an improvement in the prediction ability of the Random Forest model with a reduction in the MAE of 15.6% and a direct improvement also in the robustness and reliability of the model with a reduction in the RMSE of 17.4%. On the other hand, the same addition of input features does not initiate the same improvement when Buoy A is chosen as the target buoy. In this case, the model performance worsens with an increase in the MAE and the RMSE.

The introduction of additional features is not beneficial in the creation of the virtual buoy model for Buoy A because of the location of the buoy itself. Buoy A is located in open waters in the peripheral area of the network of buoys used as input features; therefore, the wave direction and wave period information of the input buoys would not be representative of that area. Reversely, Buoy C, being in the center of the network of input buoys, presents marine conditions that, overall, are well characterized and represented by the surrounding buoys; therefore, the introduction of the wave direction and wave period introduces richness and valuable information in the creation of the regression model, correlating well with the wave height information to be predicted.

In the best-case scenario, where the model has the highest accuracy and lowest error, the addition of these features lowers the accuracy of the model, suggesting that they may be introducing noise into the system. However, in the worst-case scenario, where the model has the lowest accuracy and highest error, the addition of these features increases the accuracy and lowers the error of prediction, although not substantially. This suggests that in situations where the model is less accurate, the additional features may help to improve the accuracy of the model. Overall, our analysis shows that the addition of the average wave period and mean wave direction does not significantly impact the model’s accuracy. However, these features may be useful in certain situations, mainly when the model is less accurate. Further research is needed to determine the optimal combination of features for predicting ocean wave conditions.

6. Data Processing and Imputation of Missing Data

Data quality is a crucial factor for regression and virtual sensing strategies, which aim to estimate or infer the values of certain variables based on available data measurements. Poor data quality can affect the accuracy, reliability, and validity of regression models, such as Random Forest and virtual sensors, which use machine learning techniques to simulate physical sensors. This section discusses the uncertainty quantification (UQ) associated with pre-processing stages, particularly focusing on k-nearest neighbors (kNN) for imputation.

For example, remote sensing data can suffer from various imperfections, such as noise, distortion, missing values, outliers, or inconsistency, that can alter the extracted information and the decisions made. Similarly, virtual sensing for water quality assessment can be influenced by the quality of the input data, such as water samples, sensor readings, or environmental factors. Therefore, data quality assessment and improvement methods are essential to ensure the performance and robustness of the regression and virtual sensing strategies.

In the present application, two data quality issues are considered and addressed in two separate stages in the context of wave height prediction. The first data quality verification and pre-processing step, as mentioned in Section 5, focuses on the removal of any non-physical observation corresponding to negative values for all the buoys involved in the analysis (target and input buoys). Then, any wave height observation above 4 m for the target buoy is removed in order to optimize the training of the regression model for the lower wave height conditions of interest.

The second data quality problem addressed in the present virtual sensing strategy focuses on finding a solution for the presence of missing data in the virtual buoy acquisition. Missing data is a common problem in many real-world datasets, and it can substantially affect the performance and validity of regression models. Depending on the mechanism and pattern of missingness, different methods can be applied to handle missing data, such as deletion, imputation, or model-based approaches [41]. However, not all methods are suitable for all types of regression models, and some may introduce bias or uncertainty in the results.

In the preliminary analysis presented in the previous section, the missing data are handled by deletion. In particular, whenever any of the buoys (input and target) present a missing data observation, that information in time is removed from all the buoys. This strategy presents advantages and disadvantages. Deletion is simple and easy to implement, as it does not require any assumptions or models for the missing data mechanism, preserves the distribution and the relationships of the observed variables, and does not introduce any artificial values. Moreover, deletion can be unbiased and valid if the data are missing completely at random, meaning that the missing data are unrelated to any observed or unobserved variables. On the other hand, deletion can reduce the sample size and the statistical power of the analysis, as it discards potentially useful information from the non-missing values, and it can be inefficient and impractical if the data have a large proportion or a complex pattern of missingness, as it may result in losing too many observations or variables.

In the present application, the deletion strategy results in the elimination of a significant portion of the observable wave height information for all the investigated sites. To provide an example of the effect of the deletion strategy on the data loss, Table 8 and Figure 14 show the breakdown by buoy of the data quality information for the ROW01.

The ROW01 site shows a variable distribution of missing values according to the considered buoy (discrepancy between the theoretical number of observations in the selected period of time and the effective number of observations collected), with values that oscillate up to 22% for some of the buoys. In addition to the missing values—identified as gaps in the expected timestamp sequence—there are also a number of measurements for which the timestamp is present, but the wave height is explicitly recorded as “NaN”. These two types of incompleteness are handled distinctly but ultimately processed uniformly for imputation. A very small number of negative values in the data also need to be removed. In order to optimize the prediction accuracy of the model in the wave height range of interest, the values above 4 m for the target buoy are removed. The final step in the cleaning stage is the temporal alignment of all the measurements from the network of buoys.

Since all the data for the Martha’s Vineyard site and the Gulf of Maine are maintained and pre-processed by the NDBC, there are rarely “NaN” or recorded missing values or negative values. When there is a time stamp missing/data were not recorded, the time stamp is simply omitted from the dataset.

The final stage in the cleaning phase is where the data are lost (about 30% per buoy). It is evident how handling the missing data by deletion causes a significant loss in wave height observations. Such deletion of information could easily lead to the previously mentioned disadvantages for a regression model like RF, negatively affecting the performance of the virtual sensing strategy. Therefore, it is necessary to address an alternative solution to handling missing data by deletion. Random forest is designed to allow handling missing data with alternative approaches. It is possible to impute the missing data using either median or proximity-based measures [42,43] or by splitting the data into two subsets based on the presence or absence of missing values in each node [33]. Both methods have advantages and disadvantages, and their accuracy and efficiency depend on the amount and pattern of missing data, as well as the complexity and variability of the data [44].

The present work addresses the problem of handling missing data via pre-imputation implementing a k-nearest neighbor algorithm.

6.1. Imputation via k-Nearest Neighbors

Multivariate imputation via k-nearest neighbors [45] is a method for filling in missing values in a dataset using information from other variables. The idea is to find the k most similar observations to the one with missing values, based on some distance metric, and use their values to impute the missing ones. The main steps in the implementation of the imputation via kNN are as follows:

Identify the missing values in the dataset and mark them with a special value, such as NaN.
Choose a distance metric to measure the similarity between observations, such as Euclidean distance or dynamic time warping for time series data.
Choose a value for k, the number of nearest neighbors to use for imputation. This can be performed empirically by comparing the performance of different values of k on a validation set.
For each observation with missing values, find the k-nearest neighbors with complete values for the same variable using the distance metric and the non-missing variables.
Impute the missing values using some aggregation function of the values from the k nearest neighbors, such as mean, median, or mode.
Repeat the process until all missing values are imputed or until convergence is reached.

A simple example is provided below to give a clearer description of the above-mentioned steps. The main goal is to impute the value

W_{i h}

, where row (i) is the time observation and column (h) is the buoy. As an example, let us consider a set of time series collecting N wave height information for four different buoys as presented in Table 9; the goal is to fill the missing NaN information at time step 4 for buoy 1 and therefore impute the observation value

W_{41}

(Table 9).

6.2. Step-by-Step Procedure for kNN Imputation

Step 1: Identify the missing value

W_{i h}

, where row i corresponds to the time index and column h indicates the buoy (feature) with the missing value. For example,

W_{41}

indicates the wave height at time step 4 for buoy 1.

Step 2a: Compute the Euclidean distance between row i and each of the other rows

p \in {1, \dots, N} ∖ {i}

, using all columns

j \in J_{i p} \subseteq {1, \dots, M} ∖ {h}

that do not contain missing values in both rows i and p.

The distance is given by

dist (i, p) = \sqrt{T_{i p} \sum_{j \in J_{i p}} {(W_{i j} - W_{p j})}^{2}}

(6)

where M is the total number of columns (buoys),

J_{i p}

is the set of usable columns (no NaNs in both rows), and

T_{i p} = \frac{M - 1}{| J_{i p} |}

is the weighting factor adjusting for varying dimensionality.

Example 1.

For

i = 4

,

p = 1

, and

J_{41} = {2, 3, 4}

, we have

dist (4, 1) = \sqrt{T_{41} [{(W_{42} - W_{12})}^{2} + {(W_{43} - W_{13})}^{2} + {(W_{44} - W_{14})}^{2}]}, T_{41} = \frac{3}{3} = 1

Step 2b: If some dimensions in

J_{i p}

are missing for either row, remove those dimensions and update

T_{i p}

accordingly.

Example 2.

For

i = 4

,

p = 2

, and

J_{42} = {2, 4}

, we have

dist (4, 2) = \sqrt{T_{42} [{(W_{42} - W_{22})}^{2} + {(W_{44} - W_{24})}^{2}]}, T_{42} = \frac{3}{2} = 1.5

Step 3: Choose the number of nearest neighbors k.

Step 4: Identify the k rows with the smallest

dist (i, p)

values. Denote these rows as

{k_{1}, k_{2}, \dots, k_{k}}

.

Step 5: Impute the missing value

W_{i h}

as the mean of the values at column h from the k nearest rows:

W_{i h} = \frac{1}{k} \sum_{ℓ = 1}^{k} W_{k_{ℓ} h}

(7)

Example 3.

If

k = 2

and the nearest rows are

p = 2

and

p = 3

, then

W_{41} = \frac{W_{21} + W_{31}}{2}

6.3. Random Forest Performance with Pre-Imputation via k-Nearest Neighbors

A preliminary performance assessment is addressed, considering the artificial removal of a month of data from one of the buoys for the ROW01 site. Those missing observations are then imputed following the kNN-based strategy described in the previous section. In particular, the month of June is removed from the Chapel Point buoy (highlighted by the shaded area in Figure 15).

The kNN algorithm is used to impute the missing observations considering an Euclidean distance metric. Figure 15 shows the original time history (in red) overlapping almost perfectly with the corresponding time history of wave height information obtained from the imputation (in blue), filling the gap of missed observations with the imputed values.

The MAE computed considering the original time history and the one with imputed values corresponds to 0.077 m and 0.081 m, respectively, for these two study cases. The MAE values observed are consistent with the overall level of accuracy reported for the RF prediction performance in the previous sections, proving the reliability of the kNN for addressing and imputing the missing values with a sufficient and satisfactory level of accuracy.

Figure 16 and Figure 17 show the original time histories of the wave height observations for the two most critical buoys located in the ROW01 (Blakeney) and in the WOW04 (M2) sites in red and the reconstructed time history with the imputed observations in blue. These buoys consistently and significantly lose valuable wave height information in the time period considered for this study. Before aligning the buoys to build the training and testing dataset to be used in the present work, the dataset is pre-processed via the previously introduced imputation strategy.

The impact of the imputation procedure is quantified in Table 10 and Table 11, where model performance on the testing set is reported both before and after applying the imputation. For the ROW01 site, where the buoy coverage is dense and the sensor network offers a high degree of spatial redundancy, improvements are modest but consistent across all metrics. The reduction in both MAE and MdAE suggests a marginal enhancement in overall predictive accuracy and robustness, while the increase in R2Sc indicates that the model captures a slightly larger portion of the wave height variability. The RMSE also decreases, showing a small improvement in controlling larger deviations. These changes confirm that even in well-instrumented settings, where the model already performs strongly, imputation contributes to more stable and refined predictions.

In contrast, the WOW04 site, which is characterized by a sparser and more spatially dispersed network, shows substantial improvements. The MAE decreases by over 20%, demonstrating better average predictive behavior across all observations. The MdAE drops by 13%, reflecting a more consistent performance with fewer large, anomalous errors. The explained variance rises sharply, from 0.785 to 0.967, suggesting a greatly enhanced ability of the model to reconstruct the temporal dynamics of the system. Most notably, the RMSE is reduced by 65.8%, indicating a significant decrease in large error magnitudes, which is particularly important for applications where tail risk matters. Altogether, the improvements at WOW04 demonstrate that the imputation method is especially beneficial in data-sparse contexts, where its ability to stabilize predictions and reduce uncertainty is most impactful.

6.4. Uncertainty Quantification from Data Perturbations

The imputation method implemented in this work can be considered deterministic because, given the same dataset and parameters (such as the number of neighbors and the distance metric), it will always produce the same imputed values. The imputation process does not involve any randomness; it strictly follows the defined algorithm to find the nearest neighbors and calculate the imputed values based on those neighbors. However, the uncertainty in imputation can be quantified by introducing perturbations to the data before imputation. This method involves simulating different missing data patterns and perturbing the data within realistic ranges to analyze the variability in the imputed values.

To implement this approach, one would first introduce perturbations by simulating multiple scenarios of missing data by randomly removing different sets of values in the dataset. In this work, we are considering a perturbation of 0.1 m in the wave height measurements. After these perturbations, kNN imputation is applied to each perturbed dataset. The final step is to analyze the variability in the imputed values across these different perturbed datasets. The standard deviation of the imputed values across different scenarios can be used to quantify this uncertainty. The standard deviation provides a measure of how much the imputed values vary due to the different perturbations, offering insights into the confidence of the imputation process.

The formula for calculating the standard deviation of the imputed values across different scenarios is

σ_{imputed} = \sqrt{\frac{1}{L - 1} \sum_{i = 1}^{L} {(y_{i} - \bar{y})}^{2}}

(8)

where

y_{i}

is an imputed value in scenario i,

\bar{y}

is the mean of imputed values across all scenarios, and L is the number of perturbation scenarios.

Figure 18 illustrates the predicted wave heights over the full-time series, highlighting both the actual and mean predicted values. The actual wave height values are plotted in blue, while the mean predicted values are shown in orange. The light blue shaded region represents the 95% prediction interval, calculated as

\pm 1.96

times the standard deviation of the predictions obtained from multiple imputations with random noise perturbations (0.05 m).

This interval captures the uncertainty introduced by the variability in the imputed values due to different noise perturbations, offering insights into the model’s confidence. The plot indicates that the model’s predictions closely follow the actual wave heights, with the prediction interval effectively encompassing the observed variability.

This is also evident by looking at Figure 18 in a zoomed-in view of a shorter time period. This closer inspection reveals that the prediction intervals remain tight around the mean predictions, demonstrating the model’s consistent performance and confidence even over shorter time spans.

Including the prediction interval in both figures emphasizes the importance of considering uncertainty in the imputation process. By quantifying the variability in the predictions due to random noise perturbations, the Random Forest model provides point estimates and a probabilistic measure of the prediction reliability.

7. Discussion and Conclusions

This study focused on the development of a data-driven virtual buoy to predict wave height through regression models. The performance of the Random Forest regressor at four different sites was evaluated with varying characteristics of wave and data availability. Particular attention was given to evaluate the robustness, sensitivity, and data pre-processing of the RFR considering the target regression task. The Random Forest regressor proved to be a low computational and user-friendly algorithm that can provide accurate wave height predictions for virtual buoys based on the measurements from nearby buoys. The average error on the wave height prediction is less than 15 cm for all the investigated sites and, in some cases, less than 10 cm, which is comparable to the accuracy of physical buoys. The RFR can capture the variability and nonlinearity of the wave height data and outperform linear regression models that assume a linear relationship between the predictors and the response variable. The inherent uncertainty quantification in the Random Forest algorithm, derived from the ensemble of trees, provides valuable insights into the variability and confidence of the predictions.

The RFR is highly sensitive to the distance of the buoys from the target buoy. The closer buoys play a key role in the accuracy of the prediction, as they provide more relevant information about the local wave conditions. Therefore, the optimal number and location of buoys for each virtual buoy should be carefully selected based on the spatial correlation and the wave propagation patterns. Alternatively, fixed measurement stations or additional features, such as wind speed or wave direction, could help improve the performance of the model by providing more contextual information.

The RFR framework is easily transferable and highly performing on different sites. It was implemented at four sites with different wave characteristics and data availability: the Gulf of Maine, Cape Cod Bay, and the UK’s east and west coasts. The RFR showed consistent and robust performance across all sites, with similar error metrics and correlation coefficients. This indicates that the RFR can adapt to different wave regimes and data sources and provide reliable wave height predictions for virtual buoys in various locations.

The missing data distributed among the buoys used for the regression can cause a substantial loss of data during the temporal alignment of the buoys in the pre-processing phase of the network of source data. The alignment process requires all buoys to have complete data for each time step, which reduces the effective sample size and may introduce bias or uncertainty in the analysis. Therefore, a multivariate imputation method strategy based on k-nearest neighbors was proposed and implemented to fill in the missing values before aligning the data. The kNN imputation method improved the performance of the model, as it reduced the error metrics and increased the correlation coefficients. Additionally, an uncertainty quantification approach was implemented by introducing perturbations to the data before imputation, followed by analyzing the variability in the imputed values. This method provided a small uncertainty bound across all the permutations, further enhancing the robustness and reliability of the model. This suggests that data pre-processing is an important factor for the success of the RFR and that imputation methods can help preserve and utilize the available information. The RFR framework can be applied to other locations and variables of interest, such as wave period or direction, to provide comprehensive information about the wave conditions for various applications.

The current model focuses on moderate sea states, reflecting operational limits for offshore wind maintenance, where vessel access is typically restricted to wave heights below 1.5–2.5 m. This use-inspired scope prioritizes accurate predictions within the actionable decision window rather than across all ocean conditions. As a future development, the framework could be extended to incorporate extreme wave events—potentially through multi-objective learning—to enhance its applicability in broader wave height estimation and long-term risk assessment contexts.

Author Contributions

Conceptualization, E.M.T. and M.V.; methodology, M.V. and E.M.T.; software, E.M.T., M.V. and T.P.; validation, E.M.T. and T.P.; formal analysis, E.M.T. and T.P.; investigation, T.P.; resources, M.V., T.S., B.M. and U.K.; data curation, M.V., E.M.T., T.P. and T.S.; writing—original draft preparation, E.M.T. and T.P.; writing—review and editing, E.M.T., T.P., M.V., T.S., B.M. and U.K.; visualization, E.M.T., T.P., M.V.; supervision, E.M.T., M.V., T.S., B.M. and U.K.; project administration, M.V., T.S., B.M. and U.K.; funding acquisition, M.V., T.S., B.M. and U.K. All authors have read and agreed to the published version of the manuscript.

Funding

Partial support of this study by Tufts Springboard, National Science Foundation grant 2230630, and rsted is gratefully acknowledged. The opinions, findings, and conclusions expressed in this document are those of the authors and do not necessarily represent the views of the sponsors and organizations involved in this project.

Data Availability Statement

Data from the public buoys in the Maine and Martha’s Vineyard sites are accessible on the National Data Buoy Center website. The data from the buoys located in the ROW01 and WOW04 sites are unavailable due to privacy restrictions.

Conflicts of Interest

Authors Matteo Vitale and Thomas Søndergaard were employed by the company Ørsted. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Niclasen, B.A.; Simonsen, K.; Magnusson, A.K. Wave forecasts and small-vessel safety: A review of operational warning parameters. Marine Structures 2010, 23, 1–21. [Google Scholar] [CrossRef]
Rezaee, S.; Pelot, R.; Ghasemi, A. The effect of extreme weather conditions on commercial fishing activities and vessel incidents in Atlantic Canada. Ocean. Coast. Manag. 2016, 130, 115–127. [Google Scholar] [CrossRef]
Huang, W.; Dong, S. Improved short-term prediction of significant wave height by decomposing deterministic and stochastic components. Renew. Energy 2021, 177, 743–758. [Google Scholar] [CrossRef]
Bruserud, K.; Reuder, J.; Svardal, B. A novel approach for motion correction of ship-mounted wind lidar measurements. Remote Sens. 2018, 10, 246. [Google Scholar] [CrossRef]
Dalgic, Y.; Lazakis, I.; Dinwoodie, I.; McMillan, D.; Revie, M. Advanced logistics planning for offshore wind farm operation and maintenance activities. Ocean Eng. 2015, 101, 211–226. [Google Scholar] [CrossRef]
Gonzalo, A.P.; Benmessaoud, T.; Entezami, M.; Márquez, F.P.G. Optimal maintenance management of offshore wind turbines by minimizing the costs. Sustain. Energy Technol. Assess. 2022, 52, 102230. [Google Scholar] [CrossRef]
Li, M.; Jiang, X.; Carroll, J.; Negenborn, R.R. A multi-objective maintenance strategy optimization framework for offshore wind farms considering uncertainty. Appl. Energy 2022, 321, 119284. [Google Scholar] [CrossRef]
Sheng, W. Wave Measurement Buoy. In Encyclopedia of Ocean Engineering; Cui, W., Fu, S., Hu, Z., Eds.; Springer: Singapore, 2018; pp. 1–9. [Google Scholar] [CrossRef]
Deo, M.; Naidu, C. Real time wave forecasting using neural networks. Ocean. Eng. 1998, 26, 191–203. [Google Scholar] [CrossRef]
Hou, B.; Fu, H.; Li, X.; Song, T.; Zhang, Z. Predicting significant wave height in the South China Sea using the SAC-ConvLSTM model. Front. Mar. Sci. 2024, 11, 1424714. [Google Scholar] [CrossRef]
Londhe, S.N.; Panchang, V. Correlation of wave data from buoy networks. Estuar. Coast. Shelf Sci. 2007, 74, 481–492. [Google Scholar] [CrossRef]
Booij, N.; Holthuijsen, L.H.; Ris, R.C. The “SWAN” wave model for shallow water. In Proceedings of the Coastal Engineering 1996, Orlando, FL, USA, 2–6 September 1996; pp. 668–676. [Google Scholar] [CrossRef]
Günther, H.; Hasselmann, S.; Janssen, P.A. The WAM Model Cycle 4; Technical Report; Deutsches Klimarechenzentrum (DKRZ): Hamburg, Germany, 1992. [Google Scholar]
Tolman, H.L. User Manual and System Documentation of WAVEWATCH III TM Version 3.14. Technical Report (276(220)), (MMAB Contribution), 2009. Available online: https://polar.ncep.noaa.gov/mmab/papers/tn276/MMAB_276.pdf (accessed on 2 April 2025).
Loveland, M.; Meixner, J.; Valseth, E.; Dawson, C. Efficacy of reduced order source terms for a coupled wave-circulation model in the Gulf of Mexico. Ocean Model. 2024, 190, 102387. [Google Scholar] [CrossRef]
Deo, M.C.; Jha, A.; Chaphekar, A.S.; Ravikant, K. Neural networks for wave forecasting. Ocean Eng. 2001, 28, 109046. [Google Scholar] [CrossRef]
Mahjoobi, J.; Etemad-Shahidi, A.; Kazeminezhad, M.H. Hindcasting of wave parameters using different soft computing methods. Appl. Ocean. Res. 2008, 30, 28–36. [Google Scholar] [CrossRef]
Etemad-Shahidi, A.; Mahjoobi, J. Comparison between M5’ model tree and neural networks for prediction of significant wave height in Lake Superior. Ocean Eng. 2009, 36, 1175–1181. [Google Scholar] [CrossRef]
Kamranzad, B.; Etemad-Shahidi, A.; Kazeminezhad, M.H. Wave height forecasting in Dayyer, the Persian Gulf. Ocean Eng. 2011, 38, 109046. [Google Scholar] [CrossRef]
Prahlada, R.; Deka, P.C. Forecasting of time series significant wave height using wavelet decomposed neural network. Aquat. Procedia 2015, 4, 540–547. [Google Scholar] [CrossRef]
Londhe, S.N.; Shah, S.; Dixit, P.R.; Nair, T.B.; Sirisha, P.; Jain, R. A coupled numerical and artificial neural network model for improving location specific wave forecast. Appl. Ocean. Res. 2016, 59, 483–491. [Google Scholar] [CrossRef]
Savitha, R.; Al Mamun, A. Regional ocean wave height prediction using sequential learning neural networks. Ocean Eng. 2017, 129, 605–612. [Google Scholar] [CrossRef]
Zhang, H.; Liu, Y.; Zhang, C.; Li, N. Machine Learning Methods for Weather Forecasting: A Survey. Atmosphere 2025, 16, 82. [Google Scholar] [CrossRef]
Fan, S.; Xiao, N.; Dong, S. A novel model to predict significant wave height based on long short-term memory network. Ocean Eng. 2020, 205, 107298. [Google Scholar] [CrossRef]
Hu, H.; van der Westhuysen, A.J.; Chu, P.; Fujisaki-Manome, A. Predicting Lake Erie wave heights and periods using XGBoost and LSTM. Ocean Model. 2021, 164, 101832. [Google Scholar] [CrossRef]
Abed-Elmdoust, A.; Kerachian, R. Forecasting tropical cyclones wave height using bidirectional gated recurrent units network. Ocean Eng. 2021, 230, 109046. [Google Scholar]
Jörges, C.; Berkenbrink, C.; Stumpe, B. Prediction and reconstruction of ocean wave heights based on bathymetric data using LSTM neural networks. Ocean Eng. 2021, 232, 109046. [Google Scholar] [CrossRef]
Jörges, C.; Berkenbrink, C.; Gottschalk, H.; Stumpe, B. Spatial ocean wave height prediction with CNN mixed-data deep neural networks using random field simulated bathymetry. Ocean Eng. 2023, 271, 113699. [Google Scholar] [CrossRef]
Gómez-Orellana, A.; Guijo-Rubio, D.; Gutiérrez, P.; Hervás-Martínez, C. Simultaneous short-term significant wave height and energy flux prediction using zonal multi-task evolutionary artificial neural networks. Renew. Energy 2022, 184, 975–989. [Google Scholar] [CrossRef]
Chen, J.; Pillai, A.C.; Johanning, L.; Ashton, I. Using machine learning to derive spatial wave data: A case study for a marine energy site. Environ. Model. Softw. 2021, 142, 105066. [Google Scholar] [CrossRef]
Patanè, L.; Iuppa, C.; Faraci, C.; Xibilia, M.G. A deep hybrid network for significant wave height estimation. Ocean Model. 2024, 189, 102363. [Google Scholar] [CrossRef]
Minuzzi, F.C.; Farina, L. Artificial neural networks ensemble methodology to predict significant wave height. Ocean Eng. 2024, 300, 117479. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Edwards, W.; Miles, R.F.; Von Winterfeldt, D. Advances in Decision Analysis; Cambridge University Press: Cambridge, UK, 2007; pp. 202–220. [Google Scholar] [CrossRef]
von Winterfeldt, D.; Edwards, W. Decision Trees. In Decision Analysis and Behavioral Research; Cambridge University Press: Cambridge, UK, 1986; pp. 63–89. [Google Scholar]
Ranstam, J.; Cook, J.A. LASSO regression. J. Br. Surg. 2018, 105, 1348. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Friedman, J.H. Stochastic gradient boosting. Comput. Stat. Data Anal. 2002, 38, 367–378. [Google Scholar] [CrossRef]
Bebis, G.; Georgiopoulos, M. Feed-forward neural networks. IEEE Potentials 1994, 13, 27–31. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Emmanuel, T.; Maupong, T.; Mpoeleng, D.; Semong, T.; Mphago, B.; Tabona, O. A survey on missing data in machine learning. J. Big Data 2021, 8, 1–37. [Google Scholar] [CrossRef] [PubMed]
Stekhoven, D.J.; Bühlmann, P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 2012, 28, 112–118. [Google Scholar] [CrossRef]
Waljee, A.K.; Mukherjee, A.; Singal, A.G.; Zhang, Y.; Warren, J.; Balis, U.; Marrero, J.; Zhu, J.; Higgins, P.D. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 2013, 3, e002847. [Google Scholar] [CrossRef]
Hong, S.; Lynn, H.S. Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC Med. Res. Methodol. 2020, 20, 199. [Google Scholar] [CrossRef]
Peterson, L.E. K-nearest neighbor. Scholarpedia 2009, 4, 1883. [Google Scholar] [CrossRef]

Figure 1. View of a wind farm and wave buoy sensor. Image source: (a) Fraunhofer Institute for Wind Energy and Energy System Technology archive; (b) National Oceanic and Atmospheric Administration (NOAA).

Figure 2. Virtual buoy strategy.

Figure 3. Location of buoys for the four sites of interest: (a) ROW01, (b) WOW04, (c) Maine, and (d) Martha’s Vineyard.

Figure 4. Regression models performance and corresponding differences for the training and testing dataset for the ROW01 site.

Figure 5. Regression models performance and corresponding differences for the training and testing dataset for the Martha’s Vineyard site.

Figure 6. Random Forest predictions against actual values and 95% prediction intervals for the ROW01 site. The red-shaded rectangular region marks the time window magnified in the zoomed-in view on the right.

Figure 7. Residual analysis, distribution of residuals, and prediction vs. actual plot.

Figure 8. Random Forest model performance for the testing dataset for the ROW01 site considering the removal from the set of input features of a single virtual wave buoy at the time. The performance is provided in terms of MAE (blue) and MAE variation (red) with respect to the baseline condition, where the measurements from all the buoys are used as input features.

Figure 9. Random Forest model performance for the testing dataset for the Martha’s Vineyard site considering the removal from the set of input features of a single virtual wave buoy at the time. The performance is provided in terms of MAE (blue) and MAE variation (red) with respect to the baseline condition, where the measurements from all the buoys are used as input features.

Figure 10. Random Forest model performance for the testing dataset for the ROW01 Vineyard site considering the sequential removal from the set of input features of virtual wave buoys starting from the farthest buoy. The performance is provided in terms of MAE (blue) and MAE variation (red) with respect to the baseline condition, where the measurements from all the buoys are used as input features.

Figure 11. Random Forest model performance for the testing dataset for the WOW04 site considering the sequential removal from the set of input features of virtual wave buoys starting from the farthest buoy. The performance is provided in terms of MAE (blue) and MAE variation (red) with respect to the baseline condition, where the measurements from all the buoys are used as input features.

Figure 12. Random Forest model performance for the testing dataset for the ROW01 site considering the sequential removal from the set of input features of virtual wave buoys starting from the closest buoy. The performance is provided in terms of MAE (blue) and MAE variation (red) with respect to the baseline condition, where the measurements from all the buoys are used as input features.

Figure 13. Random Forest model performance for the testing dataset for the GoM site considering the sequential removal from the set of input features of virtual wave buoys starting from the farthest buoy. The performance is provided in terms of MAE (blue) and MAE variation (red) with respect to the baseline condition, where the measurements from all the buoys are used as input features.

Figure 14. Percentage of data loss before and after aligning the data from all the buoys.

Figure 15. The original wave height time history (in blue) and the time history with the imputed observations for June 2021 (in red).

Figure 16. The original wave height time history (in red) and the time history with the imputed observations (in blue) for the Blakeney buoy for the ROW01 site.

Figure 17. The original wave height time history (in red) and the time history with the imputed observations (in blue) for the M2 buoy for the WOW04 site.

Figure 18. Performance and uncertainty quantification of the Random Forest Regression model in the context of imputation. The red-shaded region marks the time window magnified in the zoomed-in view on the right.

Table 1. Information about wave buoys and sites.

Buoy Name	Site	Latitude, Longitude	Dist. from	Depth	Sampling
			Target [Km]	[m]	Frequency
Target Buoy	ROW01	53.23 N, 0.90 E	-	17	30 min
Blakeney Overfalls Waverider	ROW01	53.057 N, 1.104 E	21	23	30 min
Chapel Point Waverider	ROW01	53.246 N, 0.447 E	26	13	30 min
Dowsing WaveNet	ROW01	53.531 N, 1.053 E	29	22	30 min
North Well Waverider	ROW01	53.054 N, 0.47 E	30	31	30 min
West Sole A	ROW01	53.7 N, 1.1 E	47	21	60 min
Clipper	ROW01	53.458 N, 1.731 E	53	21	60 min
Sean P	ROW01	53.188 N, 2.863 E	114	31	60 min
Target Buoy	WOW04	54.093 N, − 3.741 E	-	35	30 min
Cleveleys	WOW04	53.895 N, −3.195 E	37	10	30 min
Morecambe	WOW04	53.989 N, −3.066 E	40	10	30 min
Liverpool	WOW04	53.535 N, −3.355 E	58	24	30 min
ABI	WOW04	53.788 N, −5.637 E	111	92	30 min
M2	WOW04	53.48 N, −5.425 E	113	73	60 min
A (44017)	MV	40.693 N, 72.049 W	83	48	60 min
B (44008)	MV	40.496 N, 69.25 W	166	72	60 min
Target Buoy-C (44097)	MV	40.967 N, 71.124 W	-	49	30 min
D (44020)	MV	41.493 N, 70.279 W	92	14	60 min
E (44013)	GoM	42.346 N, 70.651 W	21	65	60 min
Target Buoy-F (44029)	GoM	42.523 N, 70.566 W	-	65	60 min
G (44030)	GoM	43.179 N, 70.426 W	74	62	60 min
H (44005)	GoM	43.201 N, 69.127 W	139	177	60 min
I (44007)	GoM	43.525 N, 70.14 W	117	49	60 min

Table 2. Random Forest baseline parameters.

Models	Estimators	Min-S-Leaf	Min-S-Split	Max Depth	Criterion
RFR-ROW01	18	4	6	11	MSE
RFR-WOW04	14	1	6	10	MSE

Table 3. Random Forest baseline model prediction results for the training and testing dataset.

Metric	ROW01	WOW04	Martha	GoM
Training Dataset
MAE [m]	0.046	0.092	0.108	0.083
MdAE [m]	0.033	0.067	0.080	0.062
R2Sc	0.891	0.928	0.944	0.966
RMSE [m]	0.174	0.197	0.146	0.113
Testing Dataset
MAE [m]	0.071	0.144	0.141	0.105
MdAE [m]	0.054	0.099	0.105	0.074
R2Sc	0.948	0.767	0.896	0.892
RMSE [m]	0.105	0.399	0.188	0.147

Table 4. Random Forest model hyperparameters optimized via grid search for the four tested sites.

Models	Estimators	Min-S-Leaf	Min-S-Split	Max Depth	Criterion
RFO-ROW01	750	2	8	750	MSE
RFO-WOW04	625	2	9	625	MSE
RFO-Martha	200	2	8	150	MSE
RFO-GoM	220	2	8	150	MSE

Table 5. Prediction performance on the training and testing dataset of the Random Forest model for the Gulf of Maine and Martha’s Vineyard sites considering the target buoy permutation.

Target Buoy	Gulf of Maine					Martha’s Vineyard
	E	F	G	H	I	A	B	C	D
	Training
MAE [m]	0.09	0.061	0.089	0.205	0.095	0.095	0.3	0.079	0.08
MdAE [m]	0.068	0.083	0.065	0.146	0.069	0.13	0.224	0.108	0.101
R2Sc	0.956	0.966	0.954	0.836	0.932	0.91	0.756	0.944	0.716
RMSE [m]	0.122	0.113	0.122	0.287	0.133	0.178	0.403	0.145	0.133
	Testing
MAE [m]	0.114	0.105	0.12	0.288	0.129	0.163	0.344	0.141	0.126
MdAE [m]	0.084	0.077	0.085	0.198	0.096	0.121	0.258	0.105	0.098
R2Sc	0.862	0.892	0.858	0.569	0.823	0.836	0.644	0.894	0.555
RMSE [m]	0.159	0.148	0.167	0.415	0.176	0.221	0.455	0.19	0.168

Table 6. Random Forest model prediction results for the testing dataset without (w/o) and with (w) the additional input features (AF) for the model for the Martha’s Vineyard site, considering Buoys A and C as the target buoy.

	Buoy C			Buoy A
Metrics	w/o AF	w AF	Delta	w/o AF	w AF	Delta
MAE [m]	0.141	0.119	15.6%	0.163	0.180	−9.8%
MdAE [m]	0.105	0.093	11.4%	0.121	0.140	−12.4%
R2Sc	0.894	0.928	−3.8%	0.836	0.810	3.1%
RMSE [m]	0.190	0.157	17.4%	0.221	0.240	−9.0%

Table 7. Random Forest model prediction results for the testing dataset without and with the additional input features for the model for the WOW04 site.

Metrics	w/o AF	w AF	Delta
MAE [m]	0.144	0.117	18.8%
MdAE [m]	0.100	0.075	25.0%
R2Sc	0.767	0.785	−2.3%
RMSE [m]	0.399	0.383	4.0%

Table 8. Data quality evaluation for the ROW01 site.

Source	Expected	Measured	Missing	NaN	Values	Negative	Before	After	%
	Data	Data	Data		>4 m	Values	Align	Align
Target	26,200	26,164	0.14%	0 (0.00%)	1 (0.00%)	4 (0.01%)	26,159	17,303	33.85%
Chapel	26,200	24,911	4.92%	0 (0.00%)	0 (0.00%)	0 (0.00%)	24,911	17,303	30.54%
Blakeney	26,200	20,473	21.86%	0 (0.00%)	0 (0.00%)	0 (0.00%)	20,473	17,303	15.48%
Sean	26,200	26,200	3.30%	866 (3.30%)	210 (0.83%)	0 (0.00%)	25,334	17,303	31.70%
Clipper	26,200	26,200	0	834 (3.18%)	87 (0.34%)	0 (0.00%)	25,366	17,303	31.79%
West	26,200	26,200	0	970 (3.70%)	0 (0.00%)	0 (0.00%)	25,230	17,303	31.42%
Dowsing	26,200	23,939	8.63%	0 (0.00%)	53 (0.22%)	0 (0.00%)	23,939	17,303	27.72%
North	26,200	25,640	2.14%	0 (0.00%)	0 (0.00%)	0 (0.00%)	25,640	17,303	32.52%
Lidar	26,200	25,991	0.79%	274 (1.05%)	-	0 (0.00%)	25,717	17,303	32.72%
Radar	26,200	23,583	9.98%	0 (0.00%)	0 (0.00%)	45 (0.19%)	23,538	17,303	26.49%

Table 9. Example wave height (WH) time series collection for a network of buoys.

Time	Buoy 1—WH	Buoy 2—WH	Buoy 3—WH	Buoy 4—WH
1	$W_{11}$	$W_{12}$	$W_{13}$	$W_{14}$
2	$W_{21}$	$W_{22}$	`NaN`	$W_{24}$
3	$W_{31}$	$W_{32}$	$W_{33}$	$W_{34}$
4	`NaN`	$W_{42}$	$W_{43}$	$W_{44}$
5	$W_{51}$	$W_{52}$	$W_{53}$	$W_{54}$
⋮	⋮	⋮	⋮	⋮
N	$W_{N 1}$	$W_{N 2}$	$W_{N 3}$	$W_{N 4}$

Table 10. Random Forest model prediction results for the testing dataset considering the dataset filled with the imputed observations for the ROW01 site.

	Incomplete Dataset	Imputed Dataset	Delta
MAE [m]	0.071	0.067	5.63%
MdAE [m]	0.054	0.05	7.41%
R2Sc	0.952	0.97	−1.89%
RMSE [m]	0.101	0.091	9.90%

Table 11. Random Forest model prediction results for the testing dataset considering the dataset filled with the imputed observations for the WOW04 site.

	Incomplete Dataset	Imputed Dataset	Delta
MAE [m]	0.117	0.093	20.51%
MdAE [m]	0.075	0.065	13.33%
R2Sc	0.785	0.967	−23.18%
RMSE [m]	0.383	0.131	65.80%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tronci, E.M.; Vitale, M.; Patrosio, T.; Søndergaard, T.; Moaveni, B.; Khan, U. Regression-Based Networked Virtual Buoy Model for Offshore Wave Height Prediction. J. Mar. Sci. Eng. 2025, 13, 728. https://doi.org/10.3390/jmse13040728

AMA Style

Tronci EM, Vitale M, Patrosio T, Søndergaard T, Moaveni B, Khan U. Regression-Based Networked Virtual Buoy Model for Offshore Wave Height Prediction. Journal of Marine Science and Engineering. 2025; 13(4):728. https://doi.org/10.3390/jmse13040728

Chicago/Turabian Style

Tronci, Eleonora M., Matteo Vitale, Therese Patrosio, Thomas Søndergaard, Babak Moaveni, and Usman Khan. 2025. "Regression-Based Networked Virtual Buoy Model for Offshore Wave Height Prediction" Journal of Marine Science and Engineering 13, no. 4: 728. https://doi.org/10.3390/jmse13040728

APA Style

Tronci, E. M., Vitale, M., Patrosio, T., Søndergaard, T., Moaveni, B., & Khan, U. (2025). Regression-Based Networked Virtual Buoy Model for Offshore Wave Height Prediction. Journal of Marine Science and Engineering, 13(4), 728. https://doi.org/10.3390/jmse13040728

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Regression-Based Networked Virtual Buoy Model for Offshore Wave Height Prediction

Abstract

1. Introduction

2. Virtual Wave Buoy Regression Model

2.1. Data Fusion for Virtual Sensing

2.2. Regression Model

2.3. Model Training and Metric Evaluation

3. Datasets Description

Data Processing and Input Features Selection

4. Model Baseline and Algorithm Tuning

Uncertainty Quantification from Random Forest Trees

5. Sensitivity Analysis

5.1. Buoy Permutation

5.2. Distance of the Input Buoys

5.3. Features Engineering and Sensitivity

6. Data Processing and Imputation of Missing Data

6.1. Imputation via k-Nearest Neighbors

6.2. Step-by-Step Procedure for kNN Imputation

6.3. Random Forest Performance with Pre-Imputation via k-Nearest Neighbors

6.4. Uncertainty Quantification from Data Perturbations

7. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI