Next Article in Journal
A LSSVR Interactive Network for AUV Motion Control
Previous Article in Journal
Improving Aquaculture Water Quality Using Dual-Input Fuzzy Logic Control for Ammonia Nitrogen Management
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Ensemble Hindcasting of Coastal Wave Heights

by
Namitha Viona Pais
1,*,
Nalini Ravishanker
1,†,
James O’Donnell
2,† and
Ellis Shaffer
3
1
Department of Statistics, University of Connecticut, Storrs, CT 06269, USA
2
Connecticut Institute for Resilience and Climate Adaptation, Department of Marine Sciences, University of Connecticut, Groton, CT 06340, USA
3
S&P, New York, NY 10010, USA
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
J. Mar. Sci. Eng. 2023, 11(6), 1110; https://doi.org/10.3390/jmse11061110
Submission received: 26 April 2023 / Revised: 11 May 2023 / Accepted: 22 May 2023 / Published: 24 May 2023
(This article belongs to the Section Coastal Engineering)

Abstract

:
Long records of wave parameters are central to the estimation of coastal flooding risk and the causes of coastal erosion. This paper leverages the predictive power of wave height history and correlations with wind speed and direction to build statistical models for time series of wave heights to develop a method to fill data-gaps and extend the record length of coastal wave observations. A threshold regression model is built where the threshold parameter, based on lagged wind speed, explains the nonlinear associations, and the lagged predictors in the model are based on a well-established empirical wind-wave relationship. The predictive model is completed by addressing the residual conditional heteroscedasticity using a GARCH model. This comprehensive model is trained on time series data from 2005 to 2013, using wave height and wind data both observed from a buoy in Long Island Sound. Subsequently, replacing wind data with observations from a nearby coastal station provides a similar level of predictive accuracy. This approach can be used to hindcast wave heights for past decades given only wind information at a coastal station. These hindcasts are used as a representative of the unobserved past to carry out extreme value analysis by fitting Generalized Pareto (GP) distribution in a peaks over threshold (POT) framework. By analyzing longer periods of data, we can obtain reliable return value estimates to help design better coastal protection structures.

1. Introduction

The development of strategies to reduce the impact of coastal erosion and flooding must be informed by quantitative estimates of the wave height and period that a site is likely to experience. Where long records of observations are available, the methods of extreme value analysis [1] are often used to estimate the wave height that has a 1 % probability of exceedance in a year, and these values are used in the design of coastal structures. A succinct summary of the history of the approach in engineering applications is provided in [2]. Unfortunately, even the longest data records seldom exceed twenty years, so the wave 1 % exceedance height estimates are generally based on an extrapolation of a statistical model chosen to fit the observed frequency of much more likely events. The results are, therefore, sensitive to the model selected. Observations of wind velocity have been recorded at many more locations, and for longer time periods, so many projects develop a design wind condition and then use an empirical wind-driven wave formula (e.g., US Army Corp of Engineers) [3,4] to develop coastal design guidance. More recently, physics-based mathematical models of the mechanisms of wave growth, propagation, and decay have been used to synthesize long-time series of wave statistics at locations where design parameters are required. A few approaches use records of observed winds to force models that compute the wave field evolution, and others have used atmospheric models or parametric representations of the path and character of hurricanes [5,6,7]. For instance, it is useful to employ numerical wave models, such as the phase-solving Boussinesq-type model [8,9], to predict coastal wave conditions by simulating real physical processes.
Extreme value analysis of significant wave heights has been extensively studied in offshore and coastal safety [10,11]. However, this analysis may be limited by shorter record durations and gaps due to instrument failures.
This paper describes and assesses a novel statistical approach to extend the data record length of wave observations that exploits the availability of long records of wind observations at coastal sites and shorter near-shore buoy-based wave observations, and the space-time correlation of wind and wave parameters. The method allows the synthesis of longer records of coastal wave height for use in extreme value estimation.
Wave data is distributed to the public through two main portals, the National Oceanic and Atmospheric Administration (NOAA) and the United States Army Corps of Engineers (USACE) databases. Buoy fleets are independently owned and operated by universities, private research institutes, and government agencies.
Our analysis relies on NOAA datasets as distributed through NOAA’s National Data Buoy Center (NDBC). We consider hourly data from the NOAA buoy # 44039 in the Central Long Island Sound (see Figure 1) on these variables: significant wave height (referred to as wave height in the rest of the paper), wind direction, and wind speed. We also employ measurements of the wind speed and direction made at Sikorsky Memorial Airport, Bridgeport, CT (USAF station ID 725040, available at ftp://ftp.ncdc.noaa.gov/pub/data/noaa, (accessed on 1 October 2022). The goal of this analysis is to
(a)
verify the similarity (association) between the wind data from the buoy and its proximal weather station (see Section 2),
(b)
using hourly observations from a training data set, build a predictive time series regression model adjusted for nonlinear effects for buoy wave heights as a function of its history and functions of lagged wind variables from the buoy (as exogenous predictors) (see Section 3),
(c)
build and train a model for buoy wave heights by replacing the buoy wind data in (b) with the Sikorsky station wind data after transforming it (see Section 2.2),
(d)
compare the in-sample and out-of-sample predictive accuracy from the model in (c) to the model in (b), in order to corroborate the use of transformed wind data from the Sikorsky coastal station (see Section 3.5.1 and Section 3.5.2)
(e)
compute ensemble hindcasts of buoy wave heights for the past several decades based on the model we train in (c) using transformed Sikorsky station wind data from 2005–2013 (see Section 4), and
(f)
conduct extreme value analysis and estimate the return values for wave heights using (i) the ensemble hindcasted wave heights for 1974–2004 obtained from (e), and (ii) the observed buoy wave heights from 2005–2013 (see Section 5).
The flowchart in Figure 2 summarizes our methodology.

2. Data Description

Hourly oceanographic data from buoy # 44039 is available for the years 2004–2013 on these variables: wave heights (H in meters, m), wind speed (u in m/s), and wind direction (W, 0 to 360 degrees in increments of 10). Since the sensors do not record wind speeds lower than 0.25 m/s, such low speeds are recorded at 0.25 m/s (censoring). We use the wind direction at each sample time to estimate the fetch, the distance from the wave observation location to the nearest land in the up-wind direction. Since wind direction is reported in 10-degree increments, 36 fetch lengths are required.
Hourly wind speed and wind direction data are also available since 1974 at Sikorsky Memorial Airport, Bridgeport CT, which is approximately 45 km to the west of the buoy. A simple empirical analysis of data from 2005 to 2013 reveals the relationship between observations from the buoy and from the coastal station. A Pearson correlation coefficient of 0.6218 in wind speed values, and a Kendall rank correlation coefficient of 0.6024 in the wind direction values indicate a positive correlation between the wind patterns from the buoy and coastal station. The Sikorsky station consistently records lower wind speeds than the buoy since the boundary shear stress over land is larger.
Data for each year is grouped into a windy season which includes the months of November, December, January–March, and a calm season which includes April–October. For example, the 2007–2008 windy season consists of data from November 2007–December 2007 and January–March 2008, while the 2008 calm season includes data from April 2008–October 2008. The data must be preprocessed before building and training statistical models. The buoy and coastal station data have data gaps, which are addressed in Section 2.1. Section 2.2 describes how the coastal station wind data is matched to wind data from the buoy.

2.1. Missing Data Imputation

Missing data occurs when the buoy or the coastal station sensor fails to record any information at a scheduled report time, usually due to hardware failure. The run-length of missing observations ranges from 1 to 2000 data points for the buoy, and from 1 to 235 data points for the Sikorsky station.
As shown below, missing data on wind speed and direction are first imputed, and then used for wave height imputation. In each year, the imputation is made separately for the windy and calm seasons.

2.1.1. Wind Speed and Wind Direction Imputation

The steps below describe imputation of wind speed and direction from the buoy as well as the coastal station. The imputation strategy depends on the run-length of the missing observations, denoted by M l .
  • Case 1: Run length M l 5 .
We impute the wind speed and wind direction separately via linear interpolation using the imputeTS package in R with the built-in n a _ i n t e r p o l a t i o n function [12]. Since the values of wind direction range from 0 to 360 with an increment of 10, the interpolated value is rounded off to the nearest tenth digit, i.e.,
Imputed ( W ) = round ( n a _ i n t e r p o l a t i o n ( W ) , 10 )
For wind speed, we first linearly interpolate a missing value using the R package, and include randomness in the imputation by adding a factor ν × s d ( u ) , where ν N ( 0 , 1 ) and sd ( u ) denotes the standard deviation of non-missing wind speed values. To be consistent with observed wind speeds from the sensors, we censor imputed values smaller than 0.25 m/s and record them as 0.25 m/s.
Imputed ( u ) = max { n a _ i n t e r p o l a t i o n ( u ) + ν × s d ( u ) , 0.25 }
  • Case 2: Run length M l > 5 .
Here, we jointly impute missing values in wind speed and wind direction in order to preserve the relationship between. To do this, we replace each missing pair of wind speed and wind direction with a pair sampled randomly (with replacement) from the observed data.

2.1.2. Wave Heights

Since wave heights are strongly affected by wind speed, information on wind speeds are used to impute missing wave heights. Wind speed values are grouped into four ordered bins based on their quartiles. The missing wave heights corresponding to wind speeds in each bin is imputed by the median of observed wave heights in that bin, and injected with some randomness by the term c × ν , where ν N ( 0 , 1 ) , and c is a scale factor (damping factor set to avoid over-volatility) computed from the residual standard error obtained by fitting a linear model on the non-imputed data during the initial empirical data analysis. The value of c is calculated as 0.01 for calm months, and 0.03 for windy months.
Imputed ( H ) = median ( H bin ) + c × ν

2.1.3. Summary of Imputed Data

The pre-processed data consists of 8760 hourly observations for a non-leap year, where 3624 of them belong to a windy season and 5136 belong to a calm season. For a leap year, there are 8784 hourly observations, with 3648 coming from a windy season and 5136 from a calm season. The data for each hour consists of either observed or imputed wave heights, wind speed, wind direction, and fetch.

2.2. Transforming Coastal Station Wind Data

Our exploratory data analysis shows a positive association between the wind speed and direction from the buoy and the coastal station, although the latter consistently records lower wind speeds. To adjust for this level difference, we transform the Sikorsky station wind speed data to match the wind speed data from the buoy in 2005–2013, using the steps below. Let u t , u t c , and W t c respectively denote the buoy wind speed, Sikorsky station wind speed, and wind direction.
  • Step 1. Using information from the storm event database [13], we define a binary variable γ t to indicate the occurrence of an extreme event along the Long Island Sound within a 12-h period prior to time t. That is,
    γ t = 1 if t corresponds to an extreme event 0 otherwise .
  • Step 2. We divide the data into three groups based on the wind direction W t c recorded at the Sikorsky station: (i) East to West, with a corresponding range of 60 W t c 180 ( g = 1 with 3813 observations), (ii) West to East with 240 W t c 300 ( g = 2 with 10,262 observations), and (iii) All other directions ( g = 3  with 18,580 observations).
    Step 3. Within each group g = 1 , , 3 , we fit a linear regression model with u t as response and u t c and γ t as predictors, i.e., u t = a 0 g + a 1 g u t c + a 2 g γ t . The estimates of the coefficients from the three groups are shown in Table 1. We use these to estimate the transformed Sikorsky station wind speed as the fitted values from the regression:
    u ^ t c = a ^ 0 g + a ^ 1 g u t c + a ^ 2 g γ t

3. Threshold Regression GARCH Model for Wave Heights

We build a statistical model for wave heights in a given season (windy, or calm) in any given year. In a windy season, the wind speeds and wave heights are generally higher than in a calm season. To reflect this, we build separate models for each season. Starting from a linear regression motivated by an approximation of Goda’s simplified Sverdrup-Munk-Bretschneider (SMB) model [14], we build a rich model that incorporates lagged linear and nonlinear relations between wave heights and wind.

3.1. Goda’s Simplified SMB Model

Ocean wave dynamics are closely linked to wind behavior. Let H denote wave height (m), u be the wind speed (m/s), F be the fetch length (m) for a given wind direction, and g be the gravitational acceleration (9.807 m/s 2 ). Goda’s consolidated method is described by [4]
H = 0.3 u 2 g 1 1 + 0.004 g F u 2 2 .
The equation in (1) is highly non-linear. We construct a simplifying linear approximation as follows:
H = 0.3 u 2 g 1 1 + 0.004 g F u 2 2 = a u 2 1 1 + b F u 2 2 = a u 2 1 1 + x 2 ,
where, a = 0.3 g , b = 0.004 g , and x = b F u 2 . Applying the MacLaurin series expansion f ( x ) = k = 0 f k ( 0 ) k ! x k , we get
H a u 2 1 1 2 x + 3 x 2 4 x 3 2 a u 2 b F u 2 3 a u 2 b F u 2 2 + 4 a u 2 b F u 2 3 2 a b u F 3 a b 2 F + 4 a b 3 F 3 / 2 u .
Now, we can explore an approximate linear relationship between H and functions of wind speed u and fetch F.

3.2. Threshold Regression Model for Wave Heights

We use the approximation in (3) as a starting point to construct a suitable regression model for wave heights as a function of lagged wave heights, lagged exogenous predictors, and their interactions with lagged wave heights, as well as an additional threshold effect to accommodate non-linearity. Let H t , u t , and F t denote the hourly observations on wave height, wind speed, and fetch respectively. The model includes different components that are discussed below.
  • Lagged exogenous predictors
Correlation analysis and wave physics suggest that the square of lagged wind speed, u t r 2 , should be included as a distinct, stand-alone term in a regression model for wave height. Additionally, we consider three functions of wind speed and fetch that emerge from the SMB approximation in (3), i.e.,
Z 2 , ( t r ) = u t r F t r , Z 3 , ( t r ) = F t r ,   and   Z 4 , ( t r ) = F t r 3 / 2 u t r .
Empirical evidence based on cross-correlation function (CCF) plots between the exogenous predictors and wave heights shows that lags r 6 are useful for modeling. We include as predictors u t r 2 , Z 2 , ( t r ) , Z 3 , ( t r ) and Z 4 , ( t r ) , where r = 1 , , 6 .
  • Lagged wave heights and interactions
Based on empirical evidence from the ACF plots of wave heights, we include lagged wave heights for r = 1 , , 6 , i.e., H t 1 , , H t 6 as predictors (6). We also include two-way interactions between the lagged exogenous predictors and lagged wave heights, i.e., interactions of H t h with u t 2 and Z i , t k where i = 2 , 3 , 4 and h , , k = 1 , , 6 .
  • Predictors to capture nonlinear effects
Wave height behavior may be considerably different at low or high wind speeds. In order to capture the behavior of wave heights at low wind speed of 0.25 m/s, we include an indicator (segmenting) variable as a predictor:
I u = 1 if u t 1 = 0.25 0 otherwise .
We also conjecture that the behavior of H t may be different when u t 1 2 > e , where e is an unknown threshold parameter. Since a threshold parameter can capture a nonlinear relationship between a response and a predictor, we include the thresholding effect of u t 1 2 by including as predictor,
( u t 1 2 e ) + = u t 1 2 e if u t 1 2 > e 0 otherwise .
  • Nonlinear threshold regression model
Incorporating all the above effects, we write the general form of the threshold regression model for H t as
H t = β 0 + h = 1 H β h H t h + = 2 L η u t 2 + i = 2 4 k = 1 K δ i k Z i , t k + h = 1 H = 1 L γ h H t h u t 2 + i = 2 4 h = 1 H k = 1 K ψ i , h k H t h Z i , t k + α 1 I u + α 2 ( u t 1 2 e ) + .
where H , L , and K are set to six based on the empirical analysis.

3.3. Model Fitting Using Buoy Wind and Wave Heights Data

We have a large basket of predictors, including lagged wave heights, lagged exogenous predictors, and their interactions, as well as the segmenting and thresholding predictors. To avoid any issues due to multicollinearity, we first fit the linear portion of the model (ignoring the last two terms on the right side of (6)), and retain predictors whose variance inflation factors (VIFs) do not exceed 30. A VIF of 30 corresponds to a coefficient of determination R 2 of 0.97 in a linear regression of the predictor in question on all other predictors and assesses its collinearity with them. While a cutoff VIF value of 20 ( R 2 = 0.95 ) or 10 ( R 2 = 0.90 ) have also been suggested in the literature, by using a VIF threshold of 30, our model incorporates all predictors derived from the SMB approximation while accounting for multicollinearity. We then fit the model in (6) with the retained predictors, and the segmenting and thresholding effects. We use the R package chngpt which employs an exact maximum likelihood estimation approach [15]. That is, we choose a grid of candidate change points that uniformly span the empirical distribution of the quantiles of the predictor we threshold (here u t 1 2 ), and estimate the change point e ^ . We illustrate our model fitting for the windy and calm seasons in 2007–2008. The code and results for other years are available in the github link, https://github.com/NamithaVionaPais (accessed on 5 January 2023).
  • Fitted threshold regression model for the windy season in 2007–2008
We show results for data corresponding to the windy season which includes November and December 2007, and January, February, and March 2008. The fitted threshold regression model for wave heights is
H ^ t = 0.0488 + 0.5561 H t 1 + 0.2238 H t 2 + 0.0147 H t 3 + 0.0254 H t 4 + 0.0159 u t 1 2 0.0035 u t 2 2 0.0041 u t 3 2 0.0021 u t 4 2 0.0017 u t 5 2 0.0003 ( H t 1 × u t 1 2 ) + 0.0005 ( H t 3 × u t 3 2 ) + 0.1568 I u 0.0031 ( u t 1 2 127.69 ) + .
The threshold parameter e corresponding to u t 1 2 is estimated to be 127.69 .
  • Fitted threshold regression model for the calm season in 2008
The calm season includes data from April-October 2008, for which the fitted threshold regression model is
H ^ t = 0.0926 + 0.3330 H t 1 + 0.1460 H t 2 + 0.0118 H t 3 + 0.0322 H t 4 + 0.0172 u t 1 2 0.0033 u t 2 2 0.0014 u t 3 2 0.0002 u t 4 2 0.0001 u t 5 2 + 0.0014 ( H t 1 × u t 1 2 ) + 0.0001 ( H t 3 × u t 3 2 ) 0.0038 I u 0.0120 ( u t 1 2 67.24 ) + ,
Here, the threshold parameter e is estimated as 67.24 . The change point for u t 1 2 in the windy season is higher (almost double) than the change point in the calm season aligns with physical theory since data on squared wind speed will lie on a larger scale for the windy season than the calm season. The threshold regression model has a good in-sample fit and is consistent with the physical theory expressed in the approximation of the SMB equation.

3.4. Garch Model to Handle Residual Nonlinearity and Volatility

Residual and squared residual diagnostics from the fitted threshold regression model helps us to assess whether we have adequately captured all linear and nonlinear associations between the response and predictors. Let R t denote the residuals from a fitted model (6). Diagnostics based on the autocorrelation function (ACF) and partial autocorrelation function (PACF) of R t [16] confirm that linear temporal relationships are adequately explained. However, the ACF and PACF plots of the squared residuals R t 2 indicate that some nonlinear dependence remains and has not been adequately explained by the threshold model.
The class of generalized autoregressive conditionally heteroscedastic (GARCH) models [17] is useful for fitting nonlinear time series which exhibit conditional heteroscedasticity. GARCH models belong to a class of univariate time series models that enable us to model volatility (conditional standard deviation) and study non-linear dependence over time. These models are often used in conjunction with linear regression and linear time series model to capture temporal dependence of different types.
We incorporate the nonlinear dependence into the model for H t by fitting a suitable GARCH model to the residuals R t , and then adding these fits to the fits from the threshold regression model (6). After a thorough investigation of different error distributions and model orders, we select a GARCH ( 1 , 2 ) model for fitting the residuals from any season in any year:
R t = σ t ε t σ t 2 = α 0 + α 1 R t 1 2 + β 1 σ t 1 2 + β 2 σ t 2 2 ,
where σ t is the conditional variance of R t given the history, ε t are i.i.d. N ( 0 , 1 ) , and α 0 , α 1 , β 1 and β 2 are unknown model parameters which are estimated using the R package fGarch using the method of conditional maximum likelihood [18].
The estimated parameters from the GARCH(1,2) fit to the residuals for the 2007–2008 windy season and 2008 calm season are shown in Table 2. Let
R ^ t = sign ( R t ) × σ ^ t
denote the fits from the GARCH model in (9); the sign function is given by
sign ( x ) = 1 if x < 0 0 if x = 0 1 if x > 0 .
We use the fits from (10) to obtain the final estimates for wave heights as described in Section 3.5.

3.5. Final Fitted Threshold Regression GARCH Model

The final model for wave heights consists of fitting (6) followed by (9) and obtaining parameter estimates and in-sample fits. We present the results of H t in feet.

3.5.1. In-Sample Fits from the Final Model

We fit the data to the 2007–2008 windy season and 2008 calm season, and assess the fits for the same seasons. Let H ^ t ( 1 ) be the fits from the threshold regression model in (6). The fitted wave heights from the final threshold regression GARCH model are given by
H ^ t = H ^ t ( 1 ) + R ^ t .
Figure 3 and Figure 4 respectively show the in-sample fits along with the observed wave heights for the 2007–2008 windy season and 2008 calm season. We observe that the in-sample fits have a remarkably close match with the observed wave heights for all months in both seasons. The root mean square error (RMSE) based on the in-sample model fits is 0.2556 for the 2007–2008 windy season and 0.1849 for the 2008 calm season (see the row corresponding to the Year 2008 in Table 3).
To verify the robustness of the final threshold regression GARCH model, we fit the model to data from windy and calm seasons in each of the years from 2005 to 2013 using wave height and wind data from the buoy. The RMSE values based on in-sample fits for each season are shown in Table 3. The low values of RMSE indicate that our model is able to accurately predict the wave heights for each year.
Another useful check is to use the transformed coastal wind data (see Section 2.2) to fit the buoy wave heights for windy and calm seasons in 2005–2013. That is, we fit (6) and (9) and obtain H ^ t using Sikorsky coastal station wind data as predictors. The RMSE values based on the in-sample fits, using data on exogenous predictors from Sikorsky coastal station for windy and calm seasons in each of the years from 2005 to 2013 are shown in Table 3. The small RMSE values indicate that our proposed modeling approach can adequately predict wave heights, even when we use the wind data from a nearby coastal station rather than from the buoy.

3.5.2. Ensemble Out-of-Sample Hindcasting Fits from the Final Model

Our main goal is to use our final model to predict wave heights for years when they are not observed (i.e., prior to 2005), by using wind data from the Sikorsky station as predictors. We refer to such prediction as back forecasting, or hindcasting.
Before we do this, it is essential to verify the out-of-sample predictive accuracy from our threshold regression GARCH model that uses the coastal wind data as predictors. It is also important to provide a framework for constructing ensemble hindcasts. We assess out-of-sample predictions of wave heights for the 2007–2008 windy season and the 2008 calm season, assuming that we do not observe these wave heights. To do this, we build the threshold regression GARCH model using data (i.e., wave heights from the buoy and wind data from the coastal station) from the 2008–2009 windy season until the 2013 calm season.
We describe the steps to get out-of-sample predictions of wave heights for the 2007–2008 windy season and the 2008 calm season. We refer to this as year y = 1 .
  • Step 1. We train the threshold regression model in (6) using data from the 2008–2009 windy season and estimate the model coefficients. To obtain out-of-sample predictions for the 2007–2008 windy season, we use these model coefficients with the Sikorsky station wind data from the 2007–2008 windy season to get
    H ^ t ( 1 ) = β ^ 0 + h = 1 H β ^ h H ^ t h ( 1 ) + = 2 L η ^ u t 2 + i = 2 4 k = 1 K δ ^ i k Z i , t k + h = 1 H = 1 L γ ^ h H ^ t h ( 1 ) u t 2 + i = 2 4 h = 1 H k = 1 K ψ ^ i , h k H ^ t h ( 1 ) Z i , t k + α ^ 1 I u + α ^ 2 ( u t 1 2 e ^ ) + ,
    setting H ^ h ( 1 ) = 0 for h = 1 , 2 , , H (initial H predictions).
  • Step 2. We fit the GARCH model in (9) to the residuals from he model fit to the 2008–2009 windy season data in Step 1, and construct the fitted values R ^ t .
  • Step 3. We construct formulas to add R ^ t to H ^ t ( 1 ) , differentiating between leap years and non-leap years to correctly maintain the correspondence between the days. In any non-leap year R ^ t would have 24 fewer observations (corresponding to hourly data on February 29) than a leap year. To address this, we reset R ^ t values corresponding to February 29 to be zero. Then,
    H ^ t = H ^ t ( 1 ) + R ^ t .
  • Step 4. We repeat the steps for the calm season (no adjustment for leap versus non-leap year is necessary).
  • Step 5. Repeat Steps 1–3 using the windy season and calm season data from November 2008 until October 2013. Let H ^ t , y denote the out-of sample predictions from year y , y = 1 , , m , where m = 5 . We construct the ensemble prediction for 2008 as
    H ^ t , e = 1 m y = 1 m H ^ t , y
  • Step 6. We also construct a 10-standard deviation (10-sd) prediction interval around each H ^ t , y as
    H ^ t , y ± 10 × e s t . s . e . ( H ^ t , y ) = ( L ^ t , y , U ^ t , y ) , say ,
    where e s t . s . e . ( H ^ t , y ) is computed based on the values of predictors and the variance-covariance matrix of the estimated model coefficients [19]. The ensemble prediction band ( L ^ t , e , U ^ t , e ) is obtained by averaging over the m = 5 prediction intervals.
Figure 5 and Figure 6 respectively show the ensemble out-of-sample wave height predictions (in red) for the 2007–2008 windy season and 2008 calm season using the model trained on the years 2009–2013. At most time points, these ensemble hindcasts are close to the observed wave heights (in black). Even at times when they are lower than high observed wave heights, the latter fall within the 10-sd prediction interval.
For the 2007–2008 windy season and the 2008 calm season, Table 4 shows the RMSE values by comparing the out-of-sample predictions H ^ t , y with the observed values H t , y for each training year y. We also compute the RMSE values based on the ensemble hindcast H ^ t , e (15). The reasonably small RMSE values provide convincing evidence that our hindcasting approach is useful.

4. Hindcasting Several Decades of Wave Heights

Leveraging results from Section 3, we hindcast unobserved wave heights prior to 2005 using transformed wind data from the Sikorsky station as predictors. Specifically, we obtain ensemble hindcasts of wave heights and the 10-standard deviation prediction interval estimates, using the approach in Section 3.5.2. Here m = 9 since we use data from 2005–2013 to train the model.
We examine the validity of these hindcasts using boxplots of mean and maximum wave heights for each month; see Figure 7. The figure on the left shows boxplots for each month based on the mean wave heights for that month between 1974 and 2004. For example, the boxplot for January is constructed from the mean values for January from 31 years. The figure on the right shows similar boxplots for each month based on the maximum wave heights. In addition, we show as red dots the observed mean wave heights (Figure 7 (left)) and the observed maximum wave heights (Figure 7 (right)) for each month for the years 2005–2013. These plots show that the mean and maximum wave heights across each month over the years 1974–2004 are relatively consistent with the observed mean and maximum heights for the years 2005–2013.

5. Extreme Value Analysis of Wave Heights

An m-year return value of wave heights denotes a value exceeded on average once every m years, and can be used to design safety control measures and appropriate coastal structures [20]. The daily recorded maxima (over a 24-h period) are usually used to conduct extreme value analysis and estimate the return values using an approach such as peaks over threshold (POT); see [1,21].
While long time series of wave heights will allow us to obtain accurate return value estimates, these are rarely observed in practice for several decades. To estimate the return values, we use the point hindcasts of the wave heights H ^ t , e and the prediction intervals ( L ^ t , e , U ^ t , e )) from 1973 to 2004, along with the observed wave heights H t from 2005 to 2013.
The POT approach consists of fitting a generalized Pareto (GP) distribution to the tail of the data consisting of values that exceed a given threshold u and then estimating the return values based on the rate of occurrence of the exceedances over the threshold. The cumulative distribution function (c.d.f.) of the GP distribution is given by
F u ( y ) = 1 1 + ξ y σ u , ξ 0 1 exp y σ u , ξ = 0 ,
where y > 0 , σ u > 0 and ξ R (the real line).
We analyze the 1974–2013 wave heights data using the POT approach. We use the R package POT [22] to fit the distribution in (17) to the daily maximum of wave heights with the threshold u set to 5 ft. The package employs an exact maximum likelihood (ML) approach to estimate the return values. We use the observed wave heights from 2005–2013; for the years 1974–2004, we use three setups, i.e., (i) hindcasted wave heights H ^ t , e , (ii) the lower bounds L ^ t , e , and the (iii) upper bound U ^ t , e as defined in Section 3.5.2.
The maximum likelihood estimates of the GP scale and shape parameters for each of the three setups are shown in Table 5 (see rows 2, 3 and 4). In addition, we also conduct extreme value analysis on the observed daily maximum wave heights data H ^ t (2005–2013) and the maximum likelihood estimates are shown in Table 5 (see row 1).
The return value plots from the POT estimation for each setup are shown in Figure 8. We use the return level plots to estimate the m-year return value defined as the value that is expected to be equaled or exceeded on average once every m years (with a probability of 1 / m ) for m = 10 , 50 , 100 . The 1 / m -yr return value z m is calculated as
z m = u + σ u ξ [ ( λ u m ) ξ 1 ] , ξ 0 u + σ u log ( λ u m ) , ξ = 0 .
The return values estimates for each setup are shown in Table 6. We observe that the return level estimates based on the observed wave heights for the years 2005–2013 is considerably lower than the estimates obtained using observed and hindcasted wave heights (specifically the hindcast estimates H ^ t , e and upper limit of prediction interval U ^ t , e ) for the years 1974–2013. Therefore, the results from our analysis are more reliable when designing offshore and coastal systems. These estimates will allow the practitioners to design the ship, offshore, and coastal structures by taking into consideration the most extreme wave conditions they might need to withstand during their lifetime.
It is useful to explore the possible impact of major weather cycles such as Southern Oscillation Index (SOI) and North Atlantic Oscillation (NOA) on the analysis and findings. To do this, we can first group the years corresponding to SOI < 0 (El-Niño) and SOI 0 (La-Niña), or similarly, NOA < 0 and NOA 0 , and then compare the annual average wave height exceedances over a given threshold (say, 4, 5, or 6 ft) between the groups. Extreme value analysis can then be implemented separately for each of these groups to look for evidence of significant differences. Since the meteorology of our site (Long Island Sound) is typical of the northwest Atlantic, the extreme value analysis may not be sensitive to the weather cycles, but other areas may have substantial decadal-scale cycles that should be considered in the empirical hindcasting of wave conditions.

6. Discussion and Summary

This study presents the m-yr return value estimates of wave heights ( m = 10 , 50 , 100 ) helpful in designing an offshore and coastal structure to achieve safety control. This knowledge is often difficult to infer due to the unavailability of sufficient data. Therefore, we develop a suitable predictive model that uses wind data from proximal coastal station (Sikorsky) to predict wave heights near the buoy. As an initial data preprocessing, we set up a suitable imputation technique to obtain hourly wave and wind data. A Threshold Regression GARCH Model for wave heights near the buoy is built for two different seasons, windy and calm, for each year using wind data near the buoy. Next, we investigate the prediction efficacy of this model (in-sample and out-of-sample) when the wind data from the buoy is replaced by the transformed wind data from the Sikorsky station. Once we establish the validity of our model, we use the available wind data from the Sikorsky station for a significant past of over 30 years to hindcast wave heights near the buoy.
By treating these hindcasts as estimates of the unobserved wave heights past, we conduct extreme value analysis, using the POT approach on the daily maximum of the wave heights to estimate the m-yr return values. Since these estimates are based on a longer historical record, they can be used to design better coastal protection structures. Our study aims to improve coastal flood risk assessments by synthesizing long records of wave data based on existing wind data and thoroughly investigating wind-wave behavior’s temporal dependence.
An alternative useful future investigation is to model the hourly wave heights from 2005 to 2013 as a single long time series, and indicate differential effects of windy and calm seasons through dummy variables treated as additional predictors. While this approach would alleviate the need to distinguish between leap and non-leap years, it would require the inclusion of multiple thresholds corresponding to windy and calm seasons. There may be value to examining this issue further, but we think it would require a substantial additional effort and a modification of the approach.

Author Contributions

Conceptualization, N.V.P., N.R. and J.O.; methodology, N.V.P., N.R. and J.O.; formal analysis, N.V.P., N.R. and J.O.; investigation, N.V.P., N.R., E.S. and J.O.; resources, J.O.; data curation, N.V.P. and E.S.; writing—original draft preparation, N.V.P., N.R., E.S. and J.O.; writing—review and editing, N.V.P., N.R. and J.O.; visualization, N.V.P., N.R. and J.O.; supervision, N.R. and J.O.; project administration, N.R. and J.O.; funding acquisition, N.R. and J.O. All authors have read and agreed to the published version of the manuscript.

Funding

Funding for this project was provided by the Connecticut Institute for Resilience and Climate Adaptation (CIRCA) through their climate research seed grants program. In addition, O’Donnell was supported by the United States Department of Housing and Urban Development through the Community Block Grant National Disaster Recovery Program, as administered by the State of Connecticut, Department of Housing.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Wave and Wind data from NOAA buoy # 44039 is available online at https://www.ndbc.noaa.gov, National Data Buoy Center (accessed on 1 October 2022). The wind data from Sikorsky station is available online at https://www.ncei.noaa.gov, National Centers for Environmental Information (accessed on 1 October 2022). The storm event database is available online at https://www.ncdc.noaa.gov/stormevents/, National Storm Event Database (accessed on 1 October 2022). Data and analysis code is available in the https://github.com/NamithaVionaPais, Github link. (accessed on 5 January 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Coles, S. An Introduction to Statistical Modeling of Extreme Values, 1st ed.; Springer: Berlin/Heidelberg, Germany, 2001. [Google Scholar]
  2. Mathiesen, M.; Goda, Y.; Hawkes, P.J.; Mansard, E.; Martín, M.J.; Peltier, E.; Thompson, E.F.; Van Vledder, G. Recommended practice for extreme wave analysis. J. Hydraul. Res. 1994, 32, 803–814. [Google Scholar] [CrossRef]
  3. US Army Corp of Engineers. Shore Protection Manual; Vol 1 P-652; CERC Department of the Army, U.S. Army Corps of Engineers: Washington, DC, USA, 1984.
  4. Goda, Y. Revisiting Wilson’s formulas for Simplified Wind-Wave Prediction. J. Waterw. Port Coast. Ocean Eng. 2003, 129, 93–95. [Google Scholar] [CrossRef]
  5. Panchang, V.; Jeong, C.K.; Demirbilek, Z. Analyses of Extreme Wave Heights in the Gulf of Mexico for Offshore Engineering Applications. J. Offshore Mech. Arct. Eng. 2013, 135, 031104. [Google Scholar] [CrossRef]
  6. US Army Corp of Engineers. North Atlantic Coast Comprehensive Study: Resilient Adaptation to Increasing Risk; Technical Report P-116; U.S. Army Corps of Engineers: Washington, DC, USA, 2015.
  7. Liu, C.; Onat, Y.; Jia, Y.; O’Donnell, J. Modeling nearshore dynamics of extreme storms in complex environments of Connecticut. Coast. Eng. 2021, 168, 103950. [Google Scholar] [CrossRef]
  8. Gao, J.; Ma, X.; Dong, G.; Chen, H.; Liu, Q.; Zang, J. Investigation on the effects of Bragg reflection on harbor oscillations. Coast. Eng. 2021, 170, 103977. [Google Scholar] [CrossRef]
  9. Gao, J.; Zhou, X.; Zhou, L.; Zang, J.; Chen, H. Numerical investigation on effects of fringing reefs on low-frequency oscillations within a harbor. Ocean Eng. 2019, 172, 86–95. [Google Scholar] [CrossRef]
  10. Liu, C.; Jia, Y.; Onat, Y.; Cifuentes-Lorenzen, A.; Ilia, A.; McCardell, G.; Fake, T.; O’Donnell, J. Estimating the annual exceedance probability of water levels and wave heights from high resolution coupled wave-circulation models in long island sound. J. Mar. Sci. Eng. 2020, 8, 475. [Google Scholar] [CrossRef]
  11. Nadal-Caraballo, N.C.; Melby, J.A. North Atlantic Coast Comprehensive Study Phase I: Statistical Analysis of Historical Extreme Water Levels with Sea Level Change; Technical Report; Engineer Research and Development Center Vicksburg MS Coastal and Hydraulics LAB: Vicksburg, MS, USA, 2014. [Google Scholar]
  12. Moritz, S.; Bartz-Beielstein, T. imputeTS: Time series missing value imputation in R. R J. 2017, 9, 207. [Google Scholar] [CrossRef]
  13. NCDC. NOAA Storm Events Database. 2023. Available online: https://www.ncdc.noaa.gov/stormevents/ (accessed on 1 October 2022).
  14. Sverdrup, H.U.; Munk, W.H. Wind, Sea and Swell: Theory of Relations for Forecasting; Hydrographic Office: Taunton, UK, 1947.
  15. Fong, Y.; Huang, Y.; Gilbert, P.B.; Permar, S.R. chngpt: Threshold regression model estimation and inference. BMC Bioinform. 2017, 18, 454. [Google Scholar] [CrossRef] [PubMed]
  16. Shumway, R.H.; Stoffer, D.S. Time Series Analysis and Its Applications; Springer: Berlin/Heidelberg, Germany, 2000; Volume 3. [Google Scholar]
  17. Bollerslev, T. Generalized autoregressive conditional heteroskedasticity. J. Econom. 1986, 31, 307–327. [Google Scholar] [CrossRef]
  18. Wuertz, D.; Runit, S.; Chalabi, M.Y. Package ‘fGarch’; Technical Report, Working Paper/Manual, 09.11.2009; R Core Team: Vienna, Austria, 2013. [Google Scholar]
  19. Ravishanker, N.; Chi, Z.; Dey, D.K. A First Course in Linear Model Theory; CRC Press: Boca Raton, FL, USA, 2021. [Google Scholar]
  20. Caires, S.; Sterl, A. 100-Year Return Value Estimates for Ocean Wind Speed and Significant Wave Height from the ERA-40 Data. J. Clim. 2005, 18, 1032–1048. [Google Scholar] [CrossRef]
  21. Caires, S. Extreme Value Analysis: Wave Data. JCOMM Technical Report No. 57. In Technical Report; World Meteorological Organization: Geneva, Switzerland, 2011. [Google Scholar]
  22. Ribatet, M.; Dutang, C. POT: Generalized Pareto Distribution and Peaks Over Threshold; R Package Version 1.1-10; R Core Team: Vienna, Austria, 2022. [Google Scholar]
Figure 1. NOAA buoy # 44039 located in the Central Long Island Sound and owned by University of Connecticut, Department of Marine Sciences. This buoy records meteorological data as well as wave height and period.
Figure 1. NOAA buoy # 44039 located in the Central Long Island Sound and owned by University of Connecticut, Department of Marine Sciences. This buoy records meteorological data as well as wave height and period.
Jmse 11 01110 g001
Figure 2. Summary of the methodology.
Figure 2. Summary of the methodology.
Jmse 11 01110 g002
Figure 3. Observed and in-sample fits of wave heights using the threshold regression GARCH model for the 2007–2008 windy season.
Figure 3. Observed and in-sample fits of wave heights using the threshold regression GARCH model for the 2007–2008 windy season.
Jmse 11 01110 g003
Figure 4. Observed and in-sample fits of wave heights using the threshold regression GARCH model for the 2008 calm season.
Figure 4. Observed and in-sample fits of wave heights using the threshold regression GARCH model for the 2008 calm season.
Jmse 11 01110 g004
Figure 5. Ensemble hindcasts for the 2007–2008 windy season trained on years 2009–2013.
Figure 5. Ensemble hindcasts for the 2007–2008 windy season trained on years 2009–2013.
Jmse 11 01110 g005
Figure 6. Ensemble hindcasts for the 2008 calm season trained on years 2009–2013.
Figure 6. Ensemble hindcasts for the 2008 calm season trained on years 2009–2013.
Jmse 11 01110 g006
Figure 7. Boxplots of the mean (left), and maximum (right) hindcasted wave heights for each month for the years 1973–2004. The red dots indicate the mean (left), and the maximum (right) of the observed wave heights for each month for the years 2005 to 2013.
Figure 7. Boxplots of the mean (left), and maximum (right) hindcasted wave heights for each month for the years 1973–2004. The red dots indicate the mean (left), and the maximum (right) of the observed wave heights for each month for the years 2005 to 2013.
Jmse 11 01110 g007
Figure 8. Return Level plots based on (i) H t for the years 2005–2013, (ii) H ^ t , e with H t for the years 1974–2013 (iii) L ^ t , e with H t for the years 1974–2013 (iv) U ^ t , e with H t for the years 1974–2013. The estimates for the 10-year, 50-year, and 100-year return levels, along with their 95 % confidence intervals, are shows in the top left of the figures.
Figure 8. Return Level plots based on (i) H t for the years 2005–2013, (ii) H ^ t , e with H t for the years 1974–2013 (iii) L ^ t , e with H t for the years 1974–2013 (iv) U ^ t , e with H t for the years 1974–2013. The estimates for the 10-year, 50-year, and 100-year return levels, along with their 95 % confidence intervals, are shows in the top left of the figures.
Jmse 11 01110 g008
Table 1. Estimated regression coefficients along with the standard errors for the transformed Sikorsky station wind speeds u ^ t c . Group denotes wind direction: East to West ( g = 1 ), West to East ( g = 2 ), and All other directions ( g = 3 ).
Table 1. Estimated regression coefficients along with the standard errors for the transformed Sikorsky station wind speeds u ^ t c . Group denotes wind direction: East to West ( g = 1 ), West to East ( g = 2 ), and All other directions ( g = 3 ).
Windy SeasonCalm Season
Group g a ^ 0 g a ^ 1 g a ^ 2 g a ^ 0 g a ^ 1 g a ^ 2 g
11.35030.84290.83381.20870.82090.3530
(0.1910)(0.8429)(0.1674)(0.1696)(0.0123)(0.1607)
24.78360.6602−0.69693.91070.409−0.5313
(0.2274)(0.0124)(0.2162)(0.1706)(0.0116)(0.1612)
33.68840.62530.08781.82820.52890.8880
(0.1306)(0.0091)(0.1227)(0.0935)(0.0077)(0.0884)
Table 2. GARCH model estimates along with their standard errors for 2007–2008 windy and 2008 calm season.
Table 2. GARCH model estimates along with their standard errors for 2007–2008 windy and 2008 calm season.
Season α ^ 0 α ^ 1 β ^ 1 β ^ 2
Windy 0.0040 ( 0.0006 ) 0.2027 ( 0.0212 ) 0.6332 ( 0.1028 ) 0.1476 ( 0.0871 )
Calm 0.0051 ( 0.0006 ) 0.2125 ( 0.0213 ) 0.4476 ( 0.1085 ) 0.2756 ( 0.0938 )
Table 3. RMSE for in-sample fits using the threshold regression GARCH model on the buoy wave and wind data for each year 2005–2013 using model estimates for the same year.
Table 3. RMSE for in-sample fits using the threshold regression GARCH model on the buoy wave and wind data for each year 2005–2013 using model estimates for the same year.
Windy SeasonCalm Season
YearBuoy-BuoyBuoy-SikorskyBuoy-BuoyBuoy-Sikorsky
20050.29570.34190.23300.2450
20060.32090.32840.11190.1535
20070.28860.40670.17720.2043
20080.25560.28650.18490.2052
20090.25070.28740.18940.2471
20100.23580.29150.21200.2422
20110.20650.24200.22680.2435
20120.26030.27580.23190.2575
20130.28570.34220.19090.2171
Table 4. RMSE for out-of-sample fits for the 2007–2008 windy season and the 2008 calm season using the threshold regression GARCH model trained on data from the years 2009–2013.
Table 4. RMSE for out-of-sample fits for the 2007–2008 windy season and the 2008 calm season using the threshold regression GARCH model trained on data from the years 2009–2013.
Year2007–2008 Windy Season RMSE2008 Calm Season RMSE
20090.90650.7084
20100.99440.7111
20110.89800.7220
20120.91670.7675
20130.95160.7093
Ensemble0.85540.6527
Table 5. POT model estimates along with the standard errors based on observed wave heights from 2005–2013 and hindcasted wave heights from 1974–2004.
Table 5. POT model estimates along with the standard errors based on observed wave heights from 2005–2013 and hindcasted wave heights from 1974–2004.
Setup σ ^ u ξ ^
H t (2005–2013) 1.3321 0.1900
( 0.1237 ) ( 0.0538 )
H ^ t , e + H t (1974–2013) 0.9695 0.1698
( 0.0758 ) ( 0.0520 )
U ^ t , e + H t (1974–2013) 0.9954 0.2452
( 0.03902 ) ( 0.02910 )
L ^ t , e + H t (1974–2013) 1.3418 0.1707
( 0.12648 ) ( 0.05663 )
Table 6. m-year return value estimates.
Table 6. m-year return value estimates.
Setup m = 10 m = 50 m = 100
H t (2005–2013) 9.3625 10.0601 10.3008
H ^ t , e + H t (1974–2013) 11.0227 14.7095 16.6351
U ^ t , e + H t (1974–2013) 17.7961 25.9501 30.5825
L ^ t , e + H t (1974–2013) 8.7026 9.7016 10.0541
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pais, N.V.; Ravishanker, N.; O’Donnell, J.; Shaffer, E. Ensemble Hindcasting of Coastal Wave Heights. J. Mar. Sci. Eng. 2023, 11, 1110. https://doi.org/10.3390/jmse11061110

AMA Style

Pais NV, Ravishanker N, O’Donnell J, Shaffer E. Ensemble Hindcasting of Coastal Wave Heights. Journal of Marine Science and Engineering. 2023; 11(6):1110. https://doi.org/10.3390/jmse11061110

Chicago/Turabian Style

Pais, Namitha Viona, Nalini Ravishanker, James O’Donnell, and Ellis Shaffer. 2023. "Ensemble Hindcasting of Coastal Wave Heights" Journal of Marine Science and Engineering 11, no. 6: 1110. https://doi.org/10.3390/jmse11061110

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop